Bug 224064 - [nvme][hang][resume] nvme_ctrlr_wait_for_ready called with desired_val = 0 but cc.en =1
Summary: [nvme][hang][resume] nvme_ctrlr_wait_for_ready called with desired_val = 0 bu...
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: CURRENT
Hardware: amd64 Any
: --- Affects Only Me
Assignee: freebsd-scsi (Nobody)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-12-03 22:03 UTC by Dave Cottlehuber
Modified: 2017-12-19 16:14 UTC (History)
4 users (show)

See Also:


Attachments
nvme info (1.70 KB, text/x-log)
2017-12-03 22:03 UTC, Dave Cottlehuber
no flags Details
dmesg (9.69 KB, text/x-log)
2017-12-03 22:03 UTC, Dave Cottlehuber
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Dave Cottlehuber freebsd_committer freebsd_triage 2017-12-03 22:03:27 UTC
Created attachment 188501 [details]
nvme info

on resuming from suspend the system stalls on any disk i/o, after console shows:

freebsd nvme_ctrlr_wait_for_ready called with desired_val = 0 but cc.en = 1

- issue occurs frequently but not 100%, some resumes work fine
- easily reproducible
- X is running, as are any open terminals, until disk is required

system is a dell xps13 laptop with a TOSHIBA NVMe. https://wiki.freebsd.org/Laptops/Dell_XPS13_9360 for full specs & dmesg/devinfo/diskinfo etc.

https://s3.amazonaws.com/uploads.hipchat.com/8784/2508819/N0PyBJEhSHX3jKu/IMG_2693.JPG
https://s3.amazonaws.com/uploads.hipchat.com/8784/2508819/eLtZ9caTMR7B6eb/IMG_2677.JPG

# dmesg

nvme0: <Generic NVMe Device> mem 0xdc000000-0xdc003fff at device 0.0 on pci4
nvd0: <THNSN5512GPUK NVMe TOSHIBA 512GB> NVMe namespace
nvd0: 488386MB (1000215216 512 byte sectors)

# nvmcontrol info

nvme0: THNSN5512GPUK NVMe TOSHIBA 512GB
    nvme0ns1 (488386MB)
Controller Capabilities/Features
================================
Vendor ID:                  1179
Subsystem Vendor ID:        1179
Serial Number:              376B508IKSJU
Model Number:               THNSN5512GPUK NVMe TOSHIBA 512GB
Firmware Version:           5KDA4103
Recommended Arb Burst:      1
IEEE OUI Identifier:        0d 08 00
Multi-Interface Cap:        00
Max Data Transfer Size:     Unlimited
Controller ID:              0x00

Admin Command Set Attributes
============================
Security Send/Receive:       Supported
Format NVM:                  Supported
Firmware Activate/Download:  Supported
Namespace Managment:         Not Supported
Abort Command Limit:         4
Async Event Request Limit:   4
Number of Firmware Slots:    1
Firmware Slot 1 Read-Only:   No
Per-Namespace SMART Log:     No
Error Log Page Entries:      128
Number of Power States:      5

NVM Command Set Attributes
==========================
Submission Queue Entry Size
  Max:                       64
  Min:                       64
Completion Queue Entry Size
  Max:                       16
  Min:                       16
Number of Namespaces:        1
Compare Command:             Not Supported
Write Uncorrectable Command: Supported
Dataset Management Command:  Supported
Volatile Write Cache:        Present
Size (in LBAs):              1000215216 (953M)
Capacity (in LBAs):          1000215216 (953M)
Utilization (in LBAs):       1000215216 (953M)
Thin Provisioning:           Not Supported
Number of LBA Formats:       2
Current LBA Format:          LBA Format #00
LBA Format #00: Data Size:   512  Metadata Size:     0
LBA Format #01: Data Size:  4096  Metadata Size:     0
Comment 1 Dave Cottlehuber freebsd_committer freebsd_triage 2017-12-03 22:03:59 UTC
Created attachment 188502 [details]
dmesg
Comment 2 Dave Cottlehuber freebsd_committer freebsd_triage 2017-12-03 22:06:14 UTC
# uname

FreeBSD akai.skunkwerks.at 12.0-CURRENT FreeBSD 12.0-CURRENT #0
r325987+5aee85eae833(master):
Sun Nov 19 04:34:13 UTC 2017     root@wintermute:/usr/obj/usr/src/amd64.amd64/sys/GENERIC  amd64
Comment 3 chuck 2017-12-11 05:16:32 UTC
While most like not a fix, it would be worth testing the change proposed in D13389 [1]. The message you are seeing will definitely go away, but it isn't clear if the change would either a) let the driver proceed with the reset and life is good or b) fall down somewhere else.

[1] https://reviews.freebsd.org/D13389
Comment 4 Warner Losh freebsd_committer freebsd_triage 2017-12-11 14:54:54 UTC
If you do, you'll have to add this device to the quirk list... I suspect some other issue is going on since the device is likely transitioning from power state D3 to D0. I haven't checked, but sometimes that takes a while and it's quite possible there's an other spot in the driver that needs either a fixed delay (yuck) or to poll something to become active before proceeding.
Comment 5 Dave Cottlehuber freebsd_committer freebsd_triage 2017-12-11 16:22:50 UTC
thanks Chuck & Warner for the info. I'll build with the patch later this week & report back. wrt adding a quirk, I'm not familiar with this.

I see nothing in /sys/dev/nv* regarding these, so I assume they are specified in
sys/cam/.

I'll read up README.quirks and my FreeBSD Design & Implementation book, but if you have any pointers or similar commits that would be a big help. I've found:

- https://reviews.freebsd.org/D13093
- https://forums.freebsd.org/threads/55210/

to start with.
Comment 6 Dave Cottlehuber freebsd_committer freebsd_triage 2017-12-11 16:24:20 UTC
actually it looks like https://reviews.freebsd.org/D13389#inline-80238 is all I need - sorry for the noise
Comment 7 chuck 2017-12-11 23:51:53 UTC
It would be interesting to see what, if any, failure occurs without adding a specific quirk for your device. So if you have the time and inclination, I'd vote for experiment #1 to be the D13389 patch without changes and experiment #2 to be the D13389 patch plus an entry in pci_ids for your device with .quirks set to QUIRK_DELAY_B4_CHK_RDY.
Comment 8 Dave Cottlehuber freebsd_committer freebsd_triage 2017-12-16 12:19:22 UTC
experiment#1 underway..
Comment 9 Dave Cottlehuber freebsd_committer freebsd_triage 2017-12-18 07:15:11 UTC
early days but I've not had a repeat hang with this patch (yay). I'll close this later in the week if I don't have a recurrence. thanks!
Comment 10 Dave Cottlehuber freebsd_committer freebsd_triage 2017-12-19 16:10:12 UTC
LGTM, no issues reported at all, no quirk hacks needed. My perception is that resume time is super fast now as well.