Bug 230722 - [nvme][resume] Delay and possible deadlock after resume
Summary: [nvme][resume] Delay and possible deadlock after resume
Status: Open
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: CURRENT
Hardware: amd64 Any
: Normal Affects Some People
Assignee: Warner Losh
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-08-17 23:50 UTC by Oleksandr Tymoshenko
Modified: 2019-01-11 20:13 UTC (History)
4 users (show)

See Also:


Attachments
Kernel Panic while resuming (653.54 KB, image/jpeg)
2018-12-14 14:58 UTC, mmatalka
no flags Details
Kernel Panic while resuming 2 (653.54 KB, image/jpeg)
2018-12-14 14:58 UTC, mmatalka
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Oleksandr Tymoshenko freebsd_committer freebsd_triage 2018-08-17 23:50:38 UTC
When resuming from S3 sleep any disk activity stops for around 30 seconds and kernel prints a bunch of error messages from NVME. Eventually, the system resumes normal operations. I also saw this scenario resulting in deadlock although can't reproduce it reliably anymore.

Error messages printed by the kernel:
nvme0: Resetting controller due to a timeout.
nvme0: resetting controller
nvme0: aborting outstanding i/o
nvme0: READ sqid:1 cid:111 nsid:1 lba:98381272 len:8
nvme0: ABORTED - BY REQUEST (00/07) sqid:1 cid:111 cdw0:0
nvme0: aborting outstanding i/o
nvme0: READ sqid:1 cid:124 nsid:1 lba:626180072 len:8
nvme0: ABORTED - BY REQUEST (00/07) sqid:1 cid:124 cdw0:0
nvme0: aborting outstanding i/o
nvme0: READ sqid:2 cid:121 nsid:1 lba:617174512 len:40
nvme0: ABORTED - BY REQUEST (00/07) sqid:2 cid:121 cdw0:0
nvme0: aborting outstanding i/o
nvme0: DATASET MANAGEMENT sqid:4 cid:94 nsid:1
nvme0: ABORTED - BY REQUEST (00/07) sqid:4 cid:94 cdw0:0
nvme0: aborting outstanding i/o
nvme0: WRITE sqid:4 cid:93 nsid:1 lba:46557808 len:16
nvme0: ABORTED - BY REQUEST (00/07) sqid:4 cid:93 cdw0:0
nvme0: Missing interrupt

Controller information:

Controller Capabilities/Features
================================
Vendor ID:                   17aa
Subsystem Vendor ID:         17aa
Serial Number:               1150592304774
Model Number:                LENSE20512GMSP34MEAT2TA
Firmware Version:            1.9.8341
Recommended Arb Burst:       2
IEEE OUI Identifier:         99 32 a0
Multi-Path I/O Capabilities: Not Supported
Max Data Transfer Size:      131072
Controller ID:               0x01
Version:                     0.0.0

Admin Command Set Attributes
============================
Security Send/Receive:       Supported
Format NVM:                  Supported
Firmware Activate/Download:  Supported
Namespace Managment:         Not Supported
Device Self-test:            Not Supported
Directives:                  Not Supported
NVMe-MI Send/Receive:        Not Supported
Virtualization Management:   Not Supported
Doorbell Buffer Config       Not Supported
Abort Command Limit:         4
Async Event Request Limit:   4
Number of Firmware Slots:    1
Firmware Slot 1 Read-Only:   No
Per-Namespace SMART Log:     No
Error Log Page Entries:      4
Number of Power States:      5

NVM Command Set Attributes
==========================
Submission Queue Entry Size
  Max:                       64
  Min:                       64
Completion Queue Entry Size
  Max:                       16
  Min:                       16
Number of Namespaces:        1
Compare Command:             Not Supported
Write Uncorrectable Command: Supported
Dataset Management Command:  Supported
Write Zeroes Command:        Not Supported
Save Features:               Supported
Reservations:                Not Supported
Timestamp feature:           Not Supported
Fused Operation Support:     Not Supported
Format NVM Attributes:       Crypto Erase, Per-NS Erase, Per-NS Format
Volatile Write Cache:        Present
Comment 1 Eric van Gyzen freebsd_committer 2018-08-24 21:06:51 UTC
I get these on my Dell XPS 13 running head, too.

Aug 12 20:35:51 hammy kernel: nvme0: Resetting controller due to a timeout.
Aug 12 20:35:51 hammy kernel: nvme0: resetting controller
Aug 12 20:35:51 hammy kernel: nvme0: aborting outstanding i/o
Aug 12 20:35:51 hammy kernel: nvme0: READ sqid:1 cid:103 nsid:1 lba:296718939 len:24
Aug 12 20:35:51 hammy kernel: nvme0: ABORTED - BY REQUEST (00/07) sqid:1 cid:103 cdw0:0
Aug 12 20:35:51 hammy kernel: nvme0: aborting outstanding i/o
Aug 12 20:35:51 hammy kernel: nvme0: READ sqid:1 cid:106 nsid:1 lba:147433251 len:186
Aug 12 20:35:51 hammy kernel: nvme0: nvme0: ABORTED - BY REQUEST (00/07) sqid:1 cid:106 cdw0:0
Aug 12 20:35:51 hammy kernel: async event occurred (type 0x0, info 0x00, page 0x01)
Aug 12 20:35:51 hammy kernel: nvme0: aborting outstanding i/o
Aug 12 20:35:51 hammy kernel: nvme0: WRITE sqid:2 cid:125 nsid:1 lba:290334796 len:7
Aug 12 20:35:51 hammy kernel: nvme0: ABORTED - BY REQUEST (00/07) sqid:2 cid:125 cdw0:0
Aug 12 20:35:51 hammy kernel: nvme0: aborting outstanding i/o
Aug 12 20:35:51 hammy kernel: nvme0: DATASET MANAGEMENT sqid:3 cid:84 nsid:1
Aug 12 20:35:51 hammy kernel: nvme0: ABORTED - BY REQUEST (00/07) sqid:3 cid:84 cdw0:0
Aug 12 20:35:51 hammy kernel: nvme0: aborting outstanding i/o
Aug 12 20:35:51 hammy kernel: nvme0: DATASET MANAGEMENT sqid:3 cid:94 nsid:1
Aug 12 20:35:51 hammy kernel: nvme0: ABORTED - BY REQUEST (00/07) sqid:3 cid:94 cdw0:0
Aug 12 20:35:51 hammy kernel: nvme0: aborting outstanding i/o
Aug 12 20:35:51 hammy kernel: nvme0: READ sqid:4 cid:126 nsid:1 lba:261212309 len:7
Aug 12 20:35:51 hammy kernel: nvme0: ABORTED - BY REQUEST (00/07) sqid:4 cid:126 cdw0:0
Aug 12 20:35:51 hammy kernel: nvme0: aborting outstanding i/o
Aug 12 20:35:51 hammy kernel: nvme0: WRITE sqid:5 cid:87 nsid:1 lba:290332732 len:16
Aug 12 20:35:51 hammy kernel: nvme0: ABORTED - BY REQUEST (00/07) sqid:5 cid:87 cdw0:0
Aug 12 20:35:51 hammy kernel: nvme0: aborting outstanding i/o
Aug 12 20:35:51 hammy kernel: nvme0: READ sqid:5 cid:86 nsid:1 lba:107551651 len:4
Aug 12 20:35:51 hammy kernel: nvme0: ABORTED - BY REQUEST (00/07) sqid:5 cid:86 cdw0:0
Aug 12 20:35:51 hammy kernel: nvme0: aborting outstanding i/o
Aug 12 20:35:51 hammy kernel: nvme0: READ sqid:6 cid:115 nsid:1 lba:312559547 len:5
Aug 12 20:35:51 hammy kernel: nvme0: ABORTED - BY REQUEST (00/07) sqid:6 cid:115 cdw0:0
Aug 12 20:35:51 hammy acpi[1696]: resumed at 20180812 20:35:51
Comment 2 mmatalka 2018-09-27 05:58:40 UTC
I am experiencing the same problem however I will get, around 50% of the time, a kernel panic or the machine simply becomes unresponsive.  I have seen errors saying something along the lines of the nvme device disappeared.  Unfortunately because logs don't actually persist to disk since the disk is gone.

I am running a new Dell XPS 13 and commit 338924 with drm-kmod-next
Comment 3 ultrageranium 2018-11-29 22:08:22 UTC
I'm seeing the same on a Thinkpad X1 (2018) with 12.0-PRERELEASE (r341261) and following hardware:

nvme0@pci0:4:0:0:       class=0x010802 card=0xa801144d chip=0xa808144d rev=0x00 hdr=0x00
      vendor     = 'Samsung Electronics Co Ltd'
      device     = 'NVMe SSD Controller SM981/PM981'
      class      = mass storage
      subclass   = NVM

Not sure if related but I suffered an unpleasant data loss during one of these hanging resuming state.

Is it reliable to use suspend while this issue is not solved?
Comment 4 Alexandr Krivulya 2018-12-12 10:37:42 UTC
I have not any data corruptions, but I prefer to shutdown system instead of suspend until we do not have any progress on it.
Comment 5 Warner Losh freebsd_committer 2018-12-12 17:09:23 UTC
I have some code in my https://github.com/bsdimp/freebsd in the nvme-suspend branch you might want to test. I've not had time to test it out, so I'm looking for help testing it out.
Comment 6 Alexandr Krivulya 2018-12-13 06:09:04 UTC
Do I need to build a whole tree or just patch these files?

sys/dev/nvme/nvme.c
sys/dev/nvme/nvme_ctrlr.c
sys/dev/nvme/nvme_private.h
Comment 7 mmatalka 2018-12-14 14:58:01 UTC
Created attachment 200116 [details]
Kernel Panic while resuming
Comment 8 mmatalka 2018-12-14 14:58:43 UTC
Created attachment 200117 [details]
Kernel Panic while resuming 2
Comment 9 mmatalka 2018-12-14 14:59:14 UTC
I uploaded photos of what the kernel panic I get on resume.  Hopefully that is useful.
Comment 10 Warner Losh freebsd_committer 2018-12-14 21:33:48 UTC
(In reply to Alexandr Krivulya from comment #6)
Just rebuild the kernel and/or module.
Comment 11 Alexandr Krivulya 2018-12-15 13:10:49 UTC
With kernel patched from nvme-suspend branch I have a new messages in log:

Dec 15 14:50:41 thinkpad kernel: pm_runtime_get_if_in_use not implemented -- see your local kernel hacker
Dec 15 14:50:41 thinkpad kernel: sched_setscheduler_nocheck not implemented -- see your local kernel hacker
Dec 15 14:50:41 thinkpad kernel: register_oom_notifier not implemented -- see your local kernel hacker
Dec 15 14:50:41 thinkpad kernel: [drm] Initialized i915 1.6.0 20171222 for drmn0 on minor 0
Dec 15 14:50:41 thinkpad kernel: register_acpi_notifier not implemented -- see your local kernel hacker
Dec 15 14:50:41 thinkpad kernel: async_schedule is dodgy -- see your local kernel hacker
Dec 15 14:50:41 thinkpad kernel: pm_runtime_set_autosuspend_delay not implemented -- see your local kernel hacker
Dec 15 14:50:41 thinkpad kernel: __pm_runtime_use_autosuspend not implemented -- see your local kernel hacker
Dec 15 14:50:41 thinkpad kernel: async_synchronize_cookie not implemented -- see your local kernel hacker

Also kernel panicked while suspending:

Fatal trap 12: page fault while in kernel mode
cpuid = 3; apic id = 06
fault virtual address   = 0x4d8
fault code              = supervisor write data, page not present
instruction pointer     = 0x20:0xffffffff807ca21d
stack pointer           = 0x28:0xfffffe00004857a0
frame pointer           = 0x28:0xfffffe00004857c0
code segment            = base rx0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 13 (g_down)
trap number             = 12
panic: page fault
cpuid = 3
time = 1544877759
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe0000485450
vpanic() at vpanic+0x1b4/frame 0xfffffe00004854b0
panic() at panic+0x43/frame 0xfffffe0000485510
trap_fatal() at trap_fatal+0x369/frame 0xfffffe0000485560
trap_pfault() at trap_pfault+0x49/frame 0xfffffe00004855c0
trap() at trap+0x29f/frame 0xfffffe00004856d0
calltrap() at calltrap+0x8/frame 0xfffffe00004856d0
--- trap 0xc, rip = 0xffffffff807ca21d, rsp = 0xfffffe00004857a0, rbp = 0xfffffe00004857c0 ---
nvme_qpair_submit_request() at nvme_qpair_submit_request+0x2d/frame 0xfffffe00004857c0
nvme_ctrlr_submit_io_request() at nvme_ctrlr_submit_io_request+0x2f/frame 0xfffffe00004857e0
nvme_ns_cmd_deallocate() at nvme_ns_cmd_deallocate+0x84/frame 0xfffffe0000485820
nvme_ns_bio_process() at nvme_ns_bio_process+0x257/frame 0xfffffe00004858a0
nvd_strategy() at nvd_strategy+0x4d/frame 0xfffffe00004858d0
g_disk_start() at g_disk_start+0x360/frame 0xfffffe0000485920
g_io_schedule_down() at g_io_schedule_down+0x1a1/frame 0xfffffe0000485960
g_down_procbody() at g_down_procbody+0x6d/frame 0xfffffe0000485970
fork_exit() at fork_exit+0x83/frame 0xfffffe00004859b0
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe00004859b0
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---
KDB: enter: panic
Comment 12 mmatalka 2019-01-10 20:45:44 UTC
(In reply to Warner Losh from comment #5)

Any updates on this branch given Alexandr's log message?  I'm happy to test the changes on my machine.
Comment 13 Warner Losh freebsd_committer 2019-01-10 22:06:02 UTC
I understand what I must do next, but haven't had the time to do the changes for the next round of testing because of time off around the holidays, travel, etc.
Comment 14 mmatalka 2019-01-11 14:45:23 UTC
(In reply to Warner Losh from comment #13)
Great!  I'll keep an eye out for updates on here.  Thank you for working on this.