Summary: | [nvme][resume] Delay and possible deadlock after resume | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Base System | Reporter: | Oleksandr Tymoshenko <gonzo> | ||||||
Component: | kern | Assignee: | Warner Losh <imp> | ||||||
Status: | Closed FIXED | ||||||||
Severity: | Affects Some People | CC: | imp, mmatalka, ports, shuriku, ultrageranium | ||||||
Priority: | Normal | ||||||||
Version: | CURRENT | ||||||||
Hardware: | amd64 | ||||||||
OS: | Any | ||||||||
Attachments: |
|
Description
Oleksandr Tymoshenko
2018-08-17 23:50:38 UTC
I get these on my Dell XPS 13 running head, too. Aug 12 20:35:51 hammy kernel: nvme0: Resetting controller due to a timeout. Aug 12 20:35:51 hammy kernel: nvme0: resetting controller Aug 12 20:35:51 hammy kernel: nvme0: aborting outstanding i/o Aug 12 20:35:51 hammy kernel: nvme0: READ sqid:1 cid:103 nsid:1 lba:296718939 len:24 Aug 12 20:35:51 hammy kernel: nvme0: ABORTED - BY REQUEST (00/07) sqid:1 cid:103 cdw0:0 Aug 12 20:35:51 hammy kernel: nvme0: aborting outstanding i/o Aug 12 20:35:51 hammy kernel: nvme0: READ sqid:1 cid:106 nsid:1 lba:147433251 len:186 Aug 12 20:35:51 hammy kernel: nvme0: nvme0: ABORTED - BY REQUEST (00/07) sqid:1 cid:106 cdw0:0 Aug 12 20:35:51 hammy kernel: async event occurred (type 0x0, info 0x00, page 0x01) Aug 12 20:35:51 hammy kernel: nvme0: aborting outstanding i/o Aug 12 20:35:51 hammy kernel: nvme0: WRITE sqid:2 cid:125 nsid:1 lba:290334796 len:7 Aug 12 20:35:51 hammy kernel: nvme0: ABORTED - BY REQUEST (00/07) sqid:2 cid:125 cdw0:0 Aug 12 20:35:51 hammy kernel: nvme0: aborting outstanding i/o Aug 12 20:35:51 hammy kernel: nvme0: DATASET MANAGEMENT sqid:3 cid:84 nsid:1 Aug 12 20:35:51 hammy kernel: nvme0: ABORTED - BY REQUEST (00/07) sqid:3 cid:84 cdw0:0 Aug 12 20:35:51 hammy kernel: nvme0: aborting outstanding i/o Aug 12 20:35:51 hammy kernel: nvme0: DATASET MANAGEMENT sqid:3 cid:94 nsid:1 Aug 12 20:35:51 hammy kernel: nvme0: ABORTED - BY REQUEST (00/07) sqid:3 cid:94 cdw0:0 Aug 12 20:35:51 hammy kernel: nvme0: aborting outstanding i/o Aug 12 20:35:51 hammy kernel: nvme0: READ sqid:4 cid:126 nsid:1 lba:261212309 len:7 Aug 12 20:35:51 hammy kernel: nvme0: ABORTED - BY REQUEST (00/07) sqid:4 cid:126 cdw0:0 Aug 12 20:35:51 hammy kernel: nvme0: aborting outstanding i/o Aug 12 20:35:51 hammy kernel: nvme0: WRITE sqid:5 cid:87 nsid:1 lba:290332732 len:16 Aug 12 20:35:51 hammy kernel: nvme0: ABORTED - BY REQUEST (00/07) sqid:5 cid:87 cdw0:0 Aug 12 20:35:51 hammy kernel: nvme0: aborting outstanding i/o Aug 12 20:35:51 hammy kernel: nvme0: READ sqid:5 cid:86 nsid:1 lba:107551651 len:4 Aug 12 20:35:51 hammy kernel: nvme0: ABORTED - BY REQUEST (00/07) sqid:5 cid:86 cdw0:0 Aug 12 20:35:51 hammy kernel: nvme0: aborting outstanding i/o Aug 12 20:35:51 hammy kernel: nvme0: READ sqid:6 cid:115 nsid:1 lba:312559547 len:5 Aug 12 20:35:51 hammy kernel: nvme0: ABORTED - BY REQUEST (00/07) sqid:6 cid:115 cdw0:0 Aug 12 20:35:51 hammy acpi[1696]: resumed at 20180812 20:35:51 I am experiencing the same problem however I will get, around 50% of the time, a kernel panic or the machine simply becomes unresponsive. I have seen errors saying something along the lines of the nvme device disappeared. Unfortunately because logs don't actually persist to disk since the disk is gone. I am running a new Dell XPS 13 and commit 338924 with drm-kmod-next I'm seeing the same on a Thinkpad X1 (2018) with 12.0-PRERELEASE (r341261) and following hardware: nvme0@pci0:4:0:0: class=0x010802 card=0xa801144d chip=0xa808144d rev=0x00 hdr=0x00 vendor = 'Samsung Electronics Co Ltd' device = 'NVMe SSD Controller SM981/PM981' class = mass storage subclass = NVM Not sure if related but I suffered an unpleasant data loss during one of these hanging resuming state. Is it reliable to use suspend while this issue is not solved? I have not any data corruptions, but I prefer to shutdown system instead of suspend until we do not have any progress on it. I have some code in my https://github.com/bsdimp/freebsd in the nvme-suspend branch you might want to test. I've not had time to test it out, so I'm looking for help testing it out. Do I need to build a whole tree or just patch these files? sys/dev/nvme/nvme.c sys/dev/nvme/nvme_ctrlr.c sys/dev/nvme/nvme_private.h Created attachment 200116 [details]
Kernel Panic while resuming
Created attachment 200117 [details]
Kernel Panic while resuming 2
I uploaded photos of what the kernel panic I get on resume. Hopefully that is useful. (In reply to Alexandr Krivulya from comment #6) Just rebuild the kernel and/or module. With kernel patched from nvme-suspend branch I have a new messages in log: Dec 15 14:50:41 thinkpad kernel: pm_runtime_get_if_in_use not implemented -- see your local kernel hacker Dec 15 14:50:41 thinkpad kernel: sched_setscheduler_nocheck not implemented -- see your local kernel hacker Dec 15 14:50:41 thinkpad kernel: register_oom_notifier not implemented -- see your local kernel hacker Dec 15 14:50:41 thinkpad kernel: [drm] Initialized i915 1.6.0 20171222 for drmn0 on minor 0 Dec 15 14:50:41 thinkpad kernel: register_acpi_notifier not implemented -- see your local kernel hacker Dec 15 14:50:41 thinkpad kernel: async_schedule is dodgy -- see your local kernel hacker Dec 15 14:50:41 thinkpad kernel: pm_runtime_set_autosuspend_delay not implemented -- see your local kernel hacker Dec 15 14:50:41 thinkpad kernel: __pm_runtime_use_autosuspend not implemented -- see your local kernel hacker Dec 15 14:50:41 thinkpad kernel: async_synchronize_cookie not implemented -- see your local kernel hacker Also kernel panicked while suspending: Fatal trap 12: page fault while in kernel mode cpuid = 3; apic id = 06 fault virtual address = 0x4d8 fault code = supervisor write data, page not present instruction pointer = 0x20:0xffffffff807ca21d stack pointer = 0x28:0xfffffe00004857a0 frame pointer = 0x28:0xfffffe00004857c0 code segment = base rx0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 13 (g_down) trap number = 12 panic: page fault cpuid = 3 time = 1544877759 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe0000485450 vpanic() at vpanic+0x1b4/frame 0xfffffe00004854b0 panic() at panic+0x43/frame 0xfffffe0000485510 trap_fatal() at trap_fatal+0x369/frame 0xfffffe0000485560 trap_pfault() at trap_pfault+0x49/frame 0xfffffe00004855c0 trap() at trap+0x29f/frame 0xfffffe00004856d0 calltrap() at calltrap+0x8/frame 0xfffffe00004856d0 --- trap 0xc, rip = 0xffffffff807ca21d, rsp = 0xfffffe00004857a0, rbp = 0xfffffe00004857c0 --- nvme_qpair_submit_request() at nvme_qpair_submit_request+0x2d/frame 0xfffffe00004857c0 nvme_ctrlr_submit_io_request() at nvme_ctrlr_submit_io_request+0x2f/frame 0xfffffe00004857e0 nvme_ns_cmd_deallocate() at nvme_ns_cmd_deallocate+0x84/frame 0xfffffe0000485820 nvme_ns_bio_process() at nvme_ns_bio_process+0x257/frame 0xfffffe00004858a0 nvd_strategy() at nvd_strategy+0x4d/frame 0xfffffe00004858d0 g_disk_start() at g_disk_start+0x360/frame 0xfffffe0000485920 g_io_schedule_down() at g_io_schedule_down+0x1a1/frame 0xfffffe0000485960 g_down_procbody() at g_down_procbody+0x6d/frame 0xfffffe0000485970 fork_exit() at fork_exit+0x83/frame 0xfffffe00004859b0 fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe00004859b0 --- trap 0, rip = 0, rsp = 0, rbp = 0 --- KDB: enter: panic (In reply to Warner Losh from comment #5) Any updates on this branch given Alexandr's log message? I'm happy to test the changes on my machine. I understand what I must do next, but haven't had the time to do the changes for the next round of testing because of time off around the holidays, travel, etc. (In reply to Warner Losh from comment #13) Great! I'll keep an eye out for updates on here. Thank you for working on this. Hello, is there any progress on this ticket? Or is there any way I can understand what is going on to fix myself? I'm not sure what else has changed but effectively suspend/resume is useless on my laptop because resume works maybe 1/20th of the time. Hello, I recently updated to 12.1-STABLE (353675) and since then my Thinkpad X1 (Gen 6) cannot wake up from sleep. On opening the lid, screen stays blank/off, sleep LED keeps on blinking as if it was still sleeping and a couple seconds later the machine reboots and I go through a filesystem check. Before the update, sleep/resume was fine but was exhibiting the issues described here (without noticeable data loss). For what it's worth, my previous system was somewhere around 351xxx, but before 351914. It's actually this commit that motivated me to update and see if the situation with resume had improved :) Thanks! My bad! Problem came from drm-fbsd12.0-kmod needing to be recompiled. Sorry for the noise :( On the plus side, r351914 really fixed the resume delays for me and I do not have nvme error messages. Thanks a lot for the fix! Hopefully it will also work for those that had deadlocks on top of the delays. 12.x and later have this fixed. Closing. If you have new issues, please open a new bug. |