Bug 230722 - [nvme][resume] Delay and possible deadlock after resume
Summary: [nvme][resume] Delay and possible deadlock after resume
Status: New
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: CURRENT
Hardware: amd64 Any
: --- Affects Some People
Assignee: freebsd-bugs mailing list
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-08-17 23:50 UTC by Oleksandr Tymoshenko
Modified: 2018-12-15 13:10 UTC (History)
5 users (show)

See Also:


Attachments
Kernel Panic while resuming (653.54 KB, image/jpeg)
2018-12-14 14:58 UTC, mmatalka
no flags Details
Kernel Panic while resuming 2 (653.54 KB, image/jpeg)
2018-12-14 14:58 UTC, mmatalka
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Oleksandr Tymoshenko freebsd_committer freebsd_triage 2018-08-17 23:50:38 UTC
When resuming from S3 sleep any disk activity stops for around 30 seconds and kernel prints a bunch of error messages from NVME. Eventually, the system resumes normal operations. I also saw this scenario resulting in deadlock although can't reproduce it reliably anymore.

Error messages printed by the kernel:
nvme0: Resetting controller due to a timeout.
nvme0: resetting controller
nvme0: aborting outstanding i/o
nvme0: READ sqid:1 cid:111 nsid:1 lba:98381272 len:8
nvme0: ABORTED - BY REQUEST (00/07) sqid:1 cid:111 cdw0:0
nvme0: aborting outstanding i/o
nvme0: READ sqid:1 cid:124 nsid:1 lba:626180072 len:8
nvme0: ABORTED - BY REQUEST (00/07) sqid:1 cid:124 cdw0:0
nvme0: aborting outstanding i/o
nvme0: READ sqid:2 cid:121 nsid:1 lba:617174512 len:40
nvme0: ABORTED - BY REQUEST (00/07) sqid:2 cid:121 cdw0:0
nvme0: aborting outstanding i/o
nvme0: DATASET MANAGEMENT sqid:4 cid:94 nsid:1
nvme0: ABORTED - BY REQUEST (00/07) sqid:4 cid:94 cdw0:0
nvme0: aborting outstanding i/o
nvme0: WRITE sqid:4 cid:93 nsid:1 lba:46557808 len:16
nvme0: ABORTED - BY REQUEST (00/07) sqid:4 cid:93 cdw0:0
nvme0: Missing interrupt

Controller information:

Controller Capabilities/Features
================================
Vendor ID:                   17aa
Subsystem Vendor ID:         17aa
Serial Number:               1150592304774
Model Number:                LENSE20512GMSP34MEAT2TA
Firmware Version:            1.9.8341
Recommended Arb Burst:       2
IEEE OUI Identifier:         99 32 a0
Multi-Path I/O Capabilities: Not Supported
Max Data Transfer Size:      131072
Controller ID:               0x01
Version:                     0.0.0

Admin Command Set Attributes
============================
Security Send/Receive:       Supported
Format NVM:                  Supported
Firmware Activate/Download:  Supported
Namespace Managment:         Not Supported
Device Self-test:            Not Supported
Directives:                  Not Supported
NVMe-MI Send/Receive:        Not Supported
Virtualization Management:   Not Supported
Doorbell Buffer Config       Not Supported
Abort Command Limit:         4
Async Event Request Limit:   4
Number of Firmware Slots:    1
Firmware Slot 1 Read-Only:   No
Per-Namespace SMART Log:     No
Error Log Page Entries:      4
Number of Power States:      5

NVM Command Set Attributes
==========================
Submission Queue Entry Size
  Max:                       64
  Min:                       64
Completion Queue Entry Size
  Max:                       16
  Min:                       16
Number of Namespaces:        1
Compare Command:             Not Supported
Write Uncorrectable Command: Supported
Dataset Management Command:  Supported
Write Zeroes Command:        Not Supported
Save Features:               Supported
Reservations:                Not Supported
Timestamp feature:           Not Supported
Fused Operation Support:     Not Supported
Format NVM Attributes:       Crypto Erase, Per-NS Erase, Per-NS Format
Volatile Write Cache:        Present
Comment 1 Eric van Gyzen freebsd_committer 2018-08-24 21:06:51 UTC
I get these on my Dell XPS 13 running head, too.

Aug 12 20:35:51 hammy kernel: nvme0: Resetting controller due to a timeout.
Aug 12 20:35:51 hammy kernel: nvme0: resetting controller
Aug 12 20:35:51 hammy kernel: nvme0: aborting outstanding i/o
Aug 12 20:35:51 hammy kernel: nvme0: READ sqid:1 cid:103 nsid:1 lba:296718939 len:24
Aug 12 20:35:51 hammy kernel: nvme0: ABORTED - BY REQUEST (00/07) sqid:1 cid:103 cdw0:0
Aug 12 20:35:51 hammy kernel: nvme0: aborting outstanding i/o
Aug 12 20:35:51 hammy kernel: nvme0: READ sqid:1 cid:106 nsid:1 lba:147433251 len:186
Aug 12 20:35:51 hammy kernel: nvme0: nvme0: ABORTED - BY REQUEST (00/07) sqid:1 cid:106 cdw0:0
Aug 12 20:35:51 hammy kernel: async event occurred (type 0x0, info 0x00, page 0x01)
Aug 12 20:35:51 hammy kernel: nvme0: aborting outstanding i/o
Aug 12 20:35:51 hammy kernel: nvme0: WRITE sqid:2 cid:125 nsid:1 lba:290334796 len:7
Aug 12 20:35:51 hammy kernel: nvme0: ABORTED - BY REQUEST (00/07) sqid:2 cid:125 cdw0:0
Aug 12 20:35:51 hammy kernel: nvme0: aborting outstanding i/o
Aug 12 20:35:51 hammy kernel: nvme0: DATASET MANAGEMENT sqid:3 cid:84 nsid:1
Aug 12 20:35:51 hammy kernel: nvme0: ABORTED - BY REQUEST (00/07) sqid:3 cid:84 cdw0:0
Aug 12 20:35:51 hammy kernel: nvme0: aborting outstanding i/o
Aug 12 20:35:51 hammy kernel: nvme0: DATASET MANAGEMENT sqid:3 cid:94 nsid:1
Aug 12 20:35:51 hammy kernel: nvme0: ABORTED - BY REQUEST (00/07) sqid:3 cid:94 cdw0:0
Aug 12 20:35:51 hammy kernel: nvme0: aborting outstanding i/o
Aug 12 20:35:51 hammy kernel: nvme0: READ sqid:4 cid:126 nsid:1 lba:261212309 len:7
Aug 12 20:35:51 hammy kernel: nvme0: ABORTED - BY REQUEST (00/07) sqid:4 cid:126 cdw0:0
Aug 12 20:35:51 hammy kernel: nvme0: aborting outstanding i/o
Aug 12 20:35:51 hammy kernel: nvme0: WRITE sqid:5 cid:87 nsid:1 lba:290332732 len:16
Aug 12 20:35:51 hammy kernel: nvme0: ABORTED - BY REQUEST (00/07) sqid:5 cid:87 cdw0:0
Aug 12 20:35:51 hammy kernel: nvme0: aborting outstanding i/o
Aug 12 20:35:51 hammy kernel: nvme0: READ sqid:5 cid:86 nsid:1 lba:107551651 len:4
Aug 12 20:35:51 hammy kernel: nvme0: ABORTED - BY REQUEST (00/07) sqid:5 cid:86 cdw0:0
Aug 12 20:35:51 hammy kernel: nvme0: aborting outstanding i/o
Aug 12 20:35:51 hammy kernel: nvme0: READ sqid:6 cid:115 nsid:1 lba:312559547 len:5
Aug 12 20:35:51 hammy kernel: nvme0: ABORTED - BY REQUEST (00/07) sqid:6 cid:115 cdw0:0
Aug 12 20:35:51 hammy acpi[1696]: resumed at 20180812 20:35:51
Comment 2 mmatalka 2018-09-27 05:58:40 UTC
I am experiencing the same problem however I will get, around 50% of the time, a kernel panic or the machine simply becomes unresponsive.  I have seen errors saying something along the lines of the nvme device disappeared.  Unfortunately because logs don't actually persist to disk since the disk is gone.

I am running a new Dell XPS 13 and commit 338924 with drm-kmod-next
Comment 3 ultrageranium 2018-11-29 22:08:22 UTC
I'm seeing the same on a Thinkpad X1 (2018) with 12.0-PRERELEASE (r341261) and following hardware:

nvme0@pci0:4:0:0:       class=0x010802 card=0xa801144d chip=0xa808144d rev=0x00 hdr=0x00
      vendor     = 'Samsung Electronics Co Ltd'
      device     = 'NVMe SSD Controller SM981/PM981'
      class      = mass storage
      subclass   = NVM

Not sure if related but I suffered an unpleasant data loss during one of these hanging resuming state.

Is it reliable to use suspend while this issue is not solved?
Comment 4 Alexandr Krivulya 2018-12-12 10:37:42 UTC
I have not any data corruptions, but I prefer to shutdown system instead of suspend until we do not have any progress on it.
Comment 5 Warner Losh freebsd_committer 2018-12-12 17:09:23 UTC
I have some code in my https://github.com/bsdimp/freebsd in the nvme-suspend branch you might want to test. I've not had time to test it out, so I'm looking for help testing it out.
Comment 6 Alexandr Krivulya 2018-12-13 06:09:04 UTC
Do I need to build a whole tree or just patch these files?

sys/dev/nvme/nvme.c
sys/dev/nvme/nvme_ctrlr.c
sys/dev/nvme/nvme_private.h
Comment 7 mmatalka 2018-12-14 14:58:01 UTC
Created attachment 200116 [details]
Kernel Panic while resuming
Comment 8 mmatalka 2018-12-14 14:58:43 UTC
Created attachment 200117 [details]
Kernel Panic while resuming 2
Comment 9 mmatalka 2018-12-14 14:59:14 UTC
I uploaded photos of what the kernel panic I get on resume.  Hopefully that is useful.
Comment 10 Warner Losh freebsd_committer 2018-12-14 21:33:48 UTC
(In reply to Alexandr Krivulya from comment #6)
Just rebuild the kernel and/or module.
Comment 11 Alexandr Krivulya 2018-12-15 13:10:49 UTC
With kernel patched from nvme-suspend branch I have a new messages in log:

Dec 15 14:50:41 thinkpad kernel: pm_runtime_get_if_in_use not implemented -- see your local kernel hacker
Dec 15 14:50:41 thinkpad kernel: sched_setscheduler_nocheck not implemented -- see your local kernel hacker
Dec 15 14:50:41 thinkpad kernel: register_oom_notifier not implemented -- see your local kernel hacker
Dec 15 14:50:41 thinkpad kernel: [drm] Initialized i915 1.6.0 20171222 for drmn0 on minor 0
Dec 15 14:50:41 thinkpad kernel: register_acpi_notifier not implemented -- see your local kernel hacker
Dec 15 14:50:41 thinkpad kernel: async_schedule is dodgy -- see your local kernel hacker
Dec 15 14:50:41 thinkpad kernel: pm_runtime_set_autosuspend_delay not implemented -- see your local kernel hacker
Dec 15 14:50:41 thinkpad kernel: __pm_runtime_use_autosuspend not implemented -- see your local kernel hacker
Dec 15 14:50:41 thinkpad kernel: async_synchronize_cookie not implemented -- see your local kernel hacker

Also kernel panicked while suspending:

Fatal trap 12: page fault while in kernel mode
cpuid = 3; apic id = 06
fault virtual address   = 0x4d8
fault code              = supervisor write data, page not present
instruction pointer     = 0x20:0xffffffff807ca21d
stack pointer           = 0x28:0xfffffe00004857a0
frame pointer           = 0x28:0xfffffe00004857c0
code segment            = base rx0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 13 (g_down)
trap number             = 12
panic: page fault
cpuid = 3
time = 1544877759
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe0000485450
vpanic() at vpanic+0x1b4/frame 0xfffffe00004854b0
panic() at panic+0x43/frame 0xfffffe0000485510
trap_fatal() at trap_fatal+0x369/frame 0xfffffe0000485560
trap_pfault() at trap_pfault+0x49/frame 0xfffffe00004855c0
trap() at trap+0x29f/frame 0xfffffe00004856d0
calltrap() at calltrap+0x8/frame 0xfffffe00004856d0
--- trap 0xc, rip = 0xffffffff807ca21d, rsp = 0xfffffe00004857a0, rbp = 0xfffffe00004857c0 ---
nvme_qpair_submit_request() at nvme_qpair_submit_request+0x2d/frame 0xfffffe00004857c0
nvme_ctrlr_submit_io_request() at nvme_ctrlr_submit_io_request+0x2f/frame 0xfffffe00004857e0
nvme_ns_cmd_deallocate() at nvme_ns_cmd_deallocate+0x84/frame 0xfffffe0000485820
nvme_ns_bio_process() at nvme_ns_bio_process+0x257/frame 0xfffffe00004858a0
nvd_strategy() at nvd_strategy+0x4d/frame 0xfffffe00004858d0
g_disk_start() at g_disk_start+0x360/frame 0xfffffe0000485920
g_io_schedule_down() at g_io_schedule_down+0x1a1/frame 0xfffffe0000485960
g_down_procbody() at g_down_procbody+0x6d/frame 0xfffffe0000485970
fork_exit() at fork_exit+0x83/frame 0xfffffe00004859b0
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe00004859b0
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---
KDB: enter: panic