Bug 262969 - NVMe - Resetting controller due to a timeout and possible hot unplug
Summary: NVMe - Resetting controller due to a timeout and possible hot unplug
Status: Open
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 13.1-STABLE
Hardware: amd64 Any
: --- Affects Some People
Assignee: freebsd-bugs (Nobody)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2022-04-01 05:45 UTC by Ian Brennan
Modified: 2024-06-16 22:45 UTC (History)
16 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Ian Brennan 2022-04-01 05:45:42 UTC
I am seeing very unstable NVMe on TrueNAS 12 and 13.  I'm using Western Digital Black NVMe m.2 SSD's.  The exact version I'm using is WD_BLACK SN770 250GB & 2TB, firmware version 731030WD.

This is on TrueNAS version: TrueNAS-12.0-U8 / FreeBSD: 12.2-RELEASE-p12 (I later upgraded to 13 Beta, see below)

I set hw.nvme.per_cpu_io_queues=0, and it did not fix the problem, in fact it seems to have made it much more frequent, although I'm not 100% sure about that, need to test again.

I also tried using the nvd driver with hw.nvme.use_nvd=0, which doesn't seem to make a difference, however it had slightly different results in the log when the issue happened again.  See logs below, would be grateful if somebody can help with this problem.


Mar 29 21:42:25 truenas nvme5: Resetting controller due to a timeout and possible hot unplug.
Mar 29 21:42:25 truenas nvme5: resetting controller
Mar 29 21:42:25 truenas nvme5: failing outstanding i/o
Mar 29 21:42:25 truenas nvme5: READ sqid:12 cid:120 nsid:1 lba:1497544880 len:16
Mar 29 21:42:25 truenas nvme5: ABORTED - BY REQUEST (00/07) sqid:12 cid:120 cdw0:0
Mar 29 21:42:25 truenas nvme5: failing outstanding i/o
Mar 29 21:42:25 truenas nvme5: READ sqid:12 cid:123 nsid:1 lba:198272936 len:16
Mar 29 21:42:25 truenas nvme5: ABORTED - BY REQUEST (00/07) sqid:12 cid:123 cdw0:0
Mar 29 21:42:25 truenas nvme5: failing outstanding i/o
Mar 29 21:42:25 truenas nvme5: READ sqid:13 cid:121 nsid:1 lba:431014528 len:24
Mar 29 21:42:25 truenas nvme5: ABORTED - BY REQUEST (00/07) sqid:13 cid:121 cdw0:0
Mar 29 21:42:25 truenas nvme5: failing outstanding i/o
Mar 29 21:42:25 truenas nvme5: READ sqid:15 cid:127 nsid:1 lba:864636432 len:8
Mar 29 21:42:25 truenas nvme5: ABORTED - BY REQUEST (00/07) sqid:15 cid:127 cdw0:0
Mar 29 21:42:25 truenas nvme5: failing outstanding i/o
Mar 29 21:42:25 truenas nvme5: READ sqid:16 cid:126 nsid:1 lba:2445612184 len:8
Mar 29 21:42:25 truenas nvme5: ABORTED - BY REQUEST (00/07) sqid:16 cid:126 cdw0:0
Mar 29 21:42:25 truenas nvme5: failing outstanding i/o
Mar 29 21:42:25 truenas nvme5: READ sqid:16 cid:120 nsid:1 lba:430503600 len:8
Mar 29 21:42:25 truenas nvme5: ABORTED - BY REQUEST (00/07) sqid:16 cid:120 cdw0:0
Mar 29 21:42:25 truenas nvme5: failing outstanding i/o
Mar 29 21:42:25 truenas nvme5: READ sqid:18 cid:123 nsid:1 lba:1499051024 len:8
Mar 29 21:42:26 truenas nvme5: ABORTED - BY REQUEST (00/07) sqid:18 cid:123 cdw0:0
Mar 29 21:42:26 truenas nvme5: failing outstanding i/o
Mar 29 21:42:26 truenas nvme5: WRITE sqid:18 cid:124 nsid:1 lba:1990077368 len:8
Mar 29 21:42:26 truenas nvme5: ABORTED - BY REQUEST (00/07) sqid:18 cid:124 cdw0:0
Mar 29 21:42:26 truenas nvme5: failing outstanding i/o
Mar 29 21:42:26 truenas nvme5: READ sqid:19 cid:122 nsid:1 lba:1237765696 len:8
Mar 29 21:42:26 truenas nvme5: ABORTED - BY REQUEST (00/07) sqid:19 cid:122 cdw0:0
Mar 29 21:42:26 truenas nvme5: failing outstanding i/o
Mar 29 21:42:26 truenas nvme5: READ sqid:19 cid:125 nsid:1 lba:180758264 len:16
Mar 29 21:42:26 truenas nvme5: ABORTED - BY REQUEST (00/07) sqid:19 cid:125 cdw0:0
Mar 29 21:42:26 truenas nvme5: failing outstanding i/o
Mar 29 21:42:26 truenas nvme5: READ sqid:20 cid:121 nsid:1 lba:2445612192 len:8
Mar 29 21:42:26 truenas nvme5: ABORTED - BY REQUEST (00/07) sqid:20 cid:121 cdw0:0
Mar 29 21:42:26 truenas nvd5: detached





nvme3: Resetting controller due to a timeout and possible hot unplug.
nvme3: resetting controller
nvme3: failing outstanding i/o
nvme3: READ sqid:7 cid:127 nsid:1 lba:419546528 len:8
nvme3: ABORTED - BY REQUEST (00/07) sqid:7 cid:127 cdw0:0
nvme3: (nda3:nvme3:0:0:1): READ. NCB: opc=2 fuse=0 nsid=1 prp1=0 prp2=0 cdw=1901c5a0 0 7 0 0 0
failing outstanding i/o
(nda3:nvme3:0:0:1): CAM status: CCB request completed with an error
(nda3:nvme3:0:0:1): Error 5, Retries exhausted
nvme3: READ sqid:11 cid:127 nsid:1 lba:782841288 len:8
nvme3: ABORTED - BY REQUEST (00/07) sqid:11 cid:127 cdw0:0
nvme3: (nda3:nvme3:0:0:1): READ. NCB: opc=2 fuse=0 nsid=1 prp1=0 prp2=0 cdw=2ea935c8 0 7 0 0 0
failing outstanding i/o
nvme3: READ sqid:11 cid:123 nsid:1 lba:704576056 len:8
nvme3: ABORTED - BY REQUEST (00/07) sqid:11 cid:123 cdw0:0
nvme3: failing outstanding i/o
nvme3: WRITE sqid:12 cid:127 nsid:1 lba:1016402352 len:8
nvme3: ABORTED - BY REQUEST (00/07) sqid:12 cid:127 cdw0:0
nvme3: failing outstanding i/o
nvme3: READ sqid:12 cid:125 nsid:1 lba:1824854760 len:8
nvme3: ABORTED - BY REQUEST (00/07) sqid:12 cid:125 cdw0:0
nvme3: failing outstanding i/o
nvme3: (nda3:nvme3:0:0:1): CAM status: CCB request completed with an error
(nda3:nvme3:0:0:1): Error 5, Retries exhausted
WRITE sqid:13 cid:124 nsid:1 lba:1008638008 len:64
nvme3: ABORTED - BY REQUEST (00/07) sqid:13 cid:124 cdw0:0
nvme3: failing outstanding i/o
nvme3: WRITE sqid:13 cid:125 nsid:1 lba:1008638152 len:56
nvme3: ABORTED - BY REQUEST (00/07) sqid:13 cid:125 cdw0:0
nvme3: failing outstanding i/o
nvme3: READ sqid:15 cid:127 nsid:1 lba:783188688 len:8
nvme3: (nda3:nvme3:0:0:1): READ. NCB: opc=2 fuse=0 nsid=1 prp1=0 prp2=0 cdw=29fefa38 0 7 0 0 0
(nda3:nvme3:0:0:1): CAM status: CCB request completed with an error
(nda3:nvme3:0:0:1): Error 5, Retries exhausted
(nda3:nvme3:0:0:1): WRITE. NCB: opc=1 fuse=0 nsid=1 prp1=0 prp2=0 cdw=3c9511b0 0 7 0 0 0
ABORTED - BY REQUEST (00/07) sqid:15 cid:127 cdw0:0
nvme3: failing outstanding i/o
nvme3: WRITE sqid:15 cid:123 nsid:1 lba:1008553080 len:8
nvme3: ABORTED - BY REQUEST (00/07) sqid:15 cid:123 cdw0:0
nvme3: failing outstanding i/o
nvme3: READ sqid:16 cid:124 nsid:1 lba:147012776 len:8
nvme3: (nda3:nvme3:0:0:1): CAM status: CCB request completed with an error
(nda3:nvme3:0:0:1): Error 5, Retries exhausted
(nda3:nvme3:0:0:1): READ. NCB: opc=2 fuse=0 nsid=1 prp1=0 prp2=0 cdw=6cc512e8 0 7 0 0 0
(nda3:nvme3:0:0:1): CAM status: CCB request completed with an error
(nda3:nvme3:0:0:1): Error 5, Retries exhausted
(nda3:nvme3:0:0:1): WRITE. NCB: opc=1 fuse=0 nsid=1 prp1=0 prp2=0 cdw=3c1e9838 0 3f 0 0 0
ABORTED - BY REQUEST (00/07) sqid:16 cid:124 cdw0:0
nvme3: failing outstanding i/o
nvme3: READ sqid:16 cid:127 nsid:1 lba:2881895592 len:8
nvme3: ABORTED - BY REQUEST (00/07) sqid:16 cid:127 cdw0:0
nvme3: failing outstanding i/o
nvme3: READ sqid:17 cid:127 nsid:1 lba:2574392744 len:16
nvme3: ABORTED - BY REQUEST (00/07) sqid:17 cid:127 cdw0:0
nvme3: failing outstanding i/o
nvme3: READ sqid:18 cid:126 nsid:1 lba:155895056 len:8
nvme3: (nda3:nvme3:0:0:1): CAM status: CCB request completed with an error
(nda3:nvme3:0:0:1): Error 5, Retries exhausted
(nda3:nvme3:0:0:1): WRITE. NCB: opc=1 fuse=0 nsid=1 prp1=0 prp2=0 cdw=3c1e98c8 0 37 0 0 0
ABORTED - BY REQUEST (00/07) sqid:18 cid:126 cdw0:0
nvme3: failing outstanding i/o
nvme3: READ sqid:19 cid:125 nsid:1 lba:151377120 len:8
nvme3: ABORTED - BY REQUEST (00/07) sqid:19 cid:125 cdw0:0
(nda3:nvme3:0:0:1): CAM status: CCB request completed with an error
(nda3:nvme3:0:0:1): Error 5, Retries exhausted
(nda3:nvme3:0:0:1): READ. NCB: opc=2 fuse=0 nsid=1 prp1=0 prp2=0 cdw=2eae82d0 0 7 0 0 0
(nda3:nvme3:0:0:1): CAM status: CCB request completed with an error
(nda3:nvme3:0:0:1): Error 5, Retries exhausted
(nda3:nvme3:0:0:1): WRITE. NCB: opc=1 fuse=0 nsid=1 prp1=0 prp2=0 cdw=3c1d4c78 0 7 0 0 0
(nda3:nvme3:0:0:1): CAM status: CCB request completed with an error
(nda3:nvme3:0:0:1): Error 5, Retries exhausted
(nda3:nvme3:0:0:1): READ. NCB: opc=2 fuse=0 nsid=1 prp1=0 prp2=0 cdw=8c33ca8 0 7 0 0 0
(nda3:nvme3:0:0:1): CAM status: CCB request completed with an error
(nda3:nvme3:0:0:1): Error 5, Retries exhausted
(nda3:nvme3:0:0:1): READ. NCB: opc=2 fuse=0 nsid=1 prp1=0 prp2=0 cdw=abc63ca8 0 7 0 0 0
(nda3:nvme3:0:0:1): CAM status: CCB request completed with an error
(nda3:nvme3:0:0:1): Error 5, Retries exhausted
(nda3:nvme3:0:0:1): READ. NCB: opc=2 fuse=0 nsid=1 prp1=0 prp2=0 cdw=99721da8 0 f 0 0 0
(nda3:nvme3:0:0:1): CAM status: CCB request completed with an error
(nda3:nvme3:0:0:1): Error 5, Retries exhausted
(nda3:nvme3:0:0:1): READ. NCB: opc=2 fuse=0 nsid=1 prp1=0 prp2=0 cdw=94ac510 0 7 0 0 0
(nda3:nvme3:0:0:1): CAM status: CCB request completed with an error
(nda3:nvme3:0:0:1): Error 5, Retries exhausted
(nda3:nvme3:0:0:1): READ. NCB: opc=2 fuse=0 nsid=1 prp1=0 prp2=0 cdw=905d4e0 0 7 0 0 0
(nda3:nvme3:0:0:1): CAM status: CCB request completed with an error
(nda3:nvme3:0:0:1): Error 5, Retries exhausted
nda3 at nvme3 bus 0 scbus13 target 0 lun 1
nda3: <WD_BLACK SN770 2TB 731030WD 21513C800057>
 s/n 21513C800057 detached
xptioctl: pass driver is not in the kernel
xptioctl: put "device pass" in your kernel config file
xptioctl: pass driver is not in the kernel
xptioctl: put "device pass" in your kernel config file
xptioctl: pass driver is not in the kernel
xptioctl: put "device pass" in your kernel config file
xptioctl: pass driver is not in the kernel
xptioctl: put "device pass" in your kernel config file
xptioctl: pass driver is not in the kernel
xptioctl: put "device pass" in your kernel config file
xptioctl: pass driver is not in the kernel
xptioctl: put "device pass" in your kernel config file
xptioctl: pass driver is not in the kernel
xptioctl: put "device pass" in your kernel config file


After Upgrading to 13.0-STABLE I see this:

nvme4: RECOVERY_START 166765538822 vs 166329346273
nvme4: Controller in fatal status, resetting
nvme4: Resetting controller due to a timeout and possible hot unplug.
nvme4: RECOVERY_WAITING
nvme4: resetting controller
nvme4: failing outstanding i/o
nvme4: READ sqid:18 cid:127 nsid:1 lba:32 len:224
nvme4: ABORTED - BY REQUEST (00/07) sqid:18 cid:127 cdw0:0
nvme4: failing outstanding i/o
nvme4: READ sqid:18 cid:126 nsid:1 lba:544 len:224
nvme4: ABORTED - BY REQUEST (00/07) sqid:18 cid:126 cdw0:0
nvme4: failing outstanding i/o
nvme4: READ sqid:18 cid:125 nsid:1 lba:3907028000 len:224
nvme4: ABORTED - BY REQUEST (00/07) sqid:18 cid:125 cdw0:0
nvme4: failing outstanding i/o
nvme4: READ sqid:18 cid:124 nsid:1 lba:3907028512 len:224
nvme4: ABORTED - BY REQUEST (00/07) sqid:18 cid:124 cdw0:0
nvd4: detached
Comment 1 Tomasz "CeDeROM" CEDRO 2022-04-01 21:00:38 UTC
Hello world :-)

Just saw previous and this report. I also switched to M2.NVM Samsung SSD 980 1TB drive over ICYBOX IB-PCI224M2-ARGB PCI-E 4.0 controller on my desktop but all seems to work fine. Just a reference, may help somehow :-)

nvme0: <Generic NVMe Device> mem 0xfe600000-0xfe603fff at device 0.0 on pci5
nvme0: Allocated 64MB host memory buffer
nvd0: <Samsung SSD 980 1TB> NVMe namespace
nvd0: 953869MB (1953525168 512 byte sectors)

FreeBSD hexagon 13.1-STABLE FreeBSD 13.1-STABLE #0 stable/13-n250096-4f69c575996: Fri Mar 25 03:50:58 CET 2022     root@hexagon:/usr/obj/usr/src/amd64.amd64/sys/GENERIC  amd64
Comment 2 Graham Perrin freebsd_committer freebsd_triage 2022-04-02 10:15:48 UTC
(In reply to Ian Brennan from comment #0)

Is there a corresponding report in the TrueNAS area?
Comment 3 Timothy Guo 2022-09-24 15:58:42 UTC
I have an WD SN570 1TB NVME drive which suddenly run into this controller reset problem after about 2 months of active usage. Once this problem shows up it reproduces on every reboot.

I'm still trying to confirm if this is an OS driver issue or disk issue.
The disk appears to behave differently on different OS. When I switched to Linux, the same problematic disk appears to react to different APST config and I was able to get it pass some short read-only tests (DD, disk-wide find, grep in kernel tree etc.) with APST disabled. 
No matter if this observation is real or not, it encouraged me to switch back to my FreeBSD box and try ZFS scrub. Unfortunately the disk fails terribly this time and I couldn't even get it back to work in Linux as the drive appears to get stuck in some low power state... Will try dig deeper to see if there is anything I can do to get it back.


BTW, the SMART log does not report any error for this drive...
Comment 4 Timothy Guo 2022-09-24 17:11:11 UTC
(In reply to Timothy Guo from comment #3)
My SN550 was brought into a pretty bad state that it triggers AER flood on a Linux system and the BIOS was also affected that I can't even enter the BIOS setup page.

As a desperate last resort, I move the disk into another machine (a NUC box). Magically the NUC box recognized the broken disk without problem, and booted just fine (Actually I lied, it landed on a prompt saying it's a FreeNAS data disk, which at least proves the disk can read just fine).

Now after I moved it back to my home server box, the disk backs to live once again. I'm able to scrub the ZFS pool on this disk and find no data error at all.

This time, I didn't run into any issue with APST enabled or disabled, which is really surprising.

Let me switch back to the offending FreeNAS system to see how it behaves this time. Is this really a physical connection problem? How could it be -- I didn't use any adapter cable but directly hook it onto the M.2 slot on my motherboard.
Comment 5 Timothy Guo 2022-09-24 17:51:19 UTC
(In reply to Timothy Guo from comment #4)

The disk appears to be running in the freeNAS box just fine once again.
I'll need to monitor the status for a longer while since it takes about one day for the issue to reproduce for the 2nd time, even though it repeats soon after each reboot from then on...

Apologize if this turns out to be a noise.
Comment 6 Ian Brennan 2022-09-24 22:40:40 UTC
I had the same problem basically, I gave up and installed TrueNAS Scale, Debian seemed to solve the problem.  SSDs on FreeBSD just seemed way too unstable.
Comment 7 Timothy Guo 2022-10-03 16:11:14 UTC
(In reply to Timothy Guo from comment #5)
I'm back reporting that the problem comes back just now (2 hours ago according to the alert mail) for no obvious reason -- I'm not actively using that server at this moment. The same nvme controller timeout reset shows up in kernel log and I loss access to the ZFS pool on it.

The disk itself seems to work well physically so I'm not sure if I can ask for refund or any service. On the other hand, once the problem shows up, it appears to affect both Linux && FreeBSD running on the same physical box. Maybe firmware bug, maybe driver issue, maybe both...

I used to suspect the problem is APST related, but I have no way to play with this config in FreeBSD. There is no mention of this term in the FreeBSD world. There is no user land tools that can manipulate or inspect the status of the APST related feature setting. It's kind of surprising since this feature had bad fame in the Linux world.

Is there anybody who can help me at least do some diagnose on this problem? Is it feasible to manually parse the PCI config space to determine the APST status? I'll need some guide for this though...
Comment 8 Timothy Guo 2022-10-03 16:21:09 UTC
(In reply to Ian Brennan from comment #6)
Was that a migration from TrueNAS core to Scale? Did you run into any regression specific to TrueNAS Scale? I heard Scale is less mature than Core.

My SW setup used to be stable until I added that NVME disk as an upgrade. I really wish I can avoid any SW migration since the SW setup has been there working just fine for 8+ years with quite a few weird customization -- XEN virtualization, LXC container in dom0 served by TrueNAS core as a XEN guest.

XEN virtualization and PCI passthrough could be a factor in this problem but so far I haven't found any evidence yet... My current top suspect is still the APST feature in FreeBSD driver...
Comment 9 bibi 2022-10-08 13:55:26 UTC
Hello,

I have same error on a FreeBSD 13-1

----8<-------------
Oct  8 15:47:09 donald kernel: nvme0: WRITE sqid:2 cid:0 nsid:1 lba:88256808 len:24
Oct  8 15:47:09 donald kernel: nvme0: resubmitting queued i/o
Oct  8 15:47:09 donald kernel: nvme0: WRITE sqid:2 cid:0 nsid:1 lba:212904296 len:8
Oct  8 15:47:09 donald kernel: nvme0: resubmitting queued i/o
Oct  8 15:47:09 donald kernel: nvme0: WRITE sqid:2 cid:0 nsid:1 lba:213705328 len:8
----8<-------------

Found also in log :

----8<-------------
Oct  7 23:49:21 donald kernel: nvme0: RECOVERY_START 6604360949574 vs 6603365139954
Oct  7 23:49:21 donald kernel: nvme0: timeout with nothing complete, resetting
Oct  7 23:49:21 donald kernel: nvme0: Resetting controller due to a timeout.
Oct  7 23:49:21 donald kernel: nvme0: RECOVERY_WAITING
Oct  7 23:49:21 donald kernel: nvme0: resetting controller
Oct  7 23:49:21 donald kernel: nvme0: aborting outstanding i/o
----8<-------------
Comment 10 Lucas Holt 2023-06-13 13:56:02 UTC
I've had the same issue on FreeBSD 13.1 release.  Drive worked fine for about 2 months. The zfs pool completely flipped out (one device only) and system had to be powered off. reboot, zpool status and other commands would just hang.

WD_BLACK™ SN770 NVMe™ SSD - 2TB
SKU: WDS200T3X0E 

nvme0: GET LOG PAGE (02) sqid:0 cid:0 nsid:ffffffff cdw10:007f0002 cdw11:00000000
nvme0: ABORTED - BY REQUEST (00/07) crd:0 m:0 dnr:0 sqid:0 cid:0 cdw0:0
nvme0: GET LOG PAGE (02) sqid:0 cid:0 nsid:ffffffff cdw10:007f0002 cdw11:00000000
nvme0: ABORTED - BY REQUEST (00/07) crd:0 m:0 dnr:0 sqid:0 cid:0 cdw0:0
nvme0: GET LOG PAGE (02) sqid:0 cid:0 nsid:ffffffff cdw10:007f0002 cdw11:00000000
nvme0: ABORTED - BY REQUEST (00/07) crd:0 m:0 dnr:0 sqid:0 cid:0 cdw0:0
nvme0: GET LOG PAGE (02) sqid:0 cid:0 nsid:ffffffff cdw10:007f0002 cdw11:00000000
nvme0: ABORTED - BY REQUEST (00/07) crd:0 m:0 dnr:0 sqid:0 cid:0 cdw0:0
nvme0: GET LOG PAGE (02) sqid:0 cid:0 nsid:ffffffff cdw10:007f0002 cdw11:00000000
nvme0: ABORTED - BY REQUEST (00/07) crd:0 m:0 dnr:0 sqid:0 cid:0 cdw0:0
nvme0: GET LOG PAGE (02) sqid:0 cid:0 nsid:ffffffff cdw10:007f0002 cdw11:00000000
nvme0: ABORTED - BY REQUEST (00/07) crd:0 m:0 dnr:0 sqid:0 cid:0 cdw0:0
nvme0: GET LOG PAGE (02) sqid:0 cid:0 nsid:ffffffff cdw10:007f0002 cdw11:00000000
nvme0: ABORTED - BY REQUEST (00/07) crd:0 m:0 dnr:0 sqid:0 cid:0 cdw0:0
nvme0: GET LOG PAGE (02) sqid:0 cid:0 nsid:ffffffff cdw10:007f0002 cdw11:00000000
nvme0: ABORTED - BY REQUEST (00/07) crd:0 m:0 dnr:0 sqid:0 cid:0 cdw0:0
nvme0: GET LOG PAGE (02) sqid:0 cid:0 nsid:ffffffff cdw10:007f0002 cdw11:00000000
nvme0: ABORTED - BY REQUEST (00/07) crd:0 m:0 dnr:0 sqid:0 cid:0 cdw0:0
nvme0: GET LOG PAGE (02) sqid:0 cid:0 nsid:ffffffff cdw10:007f0002 cdw11:00000000
nvme0: ABORTED - BY REQUEST (00/07) crd:0 m:0 dnr:0 sqid:0 cid:0 cdw0:0
nvme0: GET LOG PAGE (02) sqid:0 cid:0 nsid:ffffffff cdw10:007f0002 cdw11:00000000
nvme0: ABORTED - BY REQUEST (00/07) crd:0 m:0 dnr:0 sqid:0 cid:0 cdw0:0
nvme0: GET LOG PAGE (02) sqid:0 cid:0 nsid:ffffffff cdw10:007f0002 cdw11:00000000
nvme0: ABORTED - BY REQUEST (00/07) crd:0 m:0 dnr:0 sqid:0 cid:0 cdw0:0
nvme0: GET LOG PAGE (02) sqid:0 cid:0 nsid:ffffffff cdw10:007f0002 cdw11:00000000
nvme0: ABORTED - BY REQUEST (00/07) crd:0 m:0 dnr:0 sqid:0 cid:0 cdw0:0
nvme0: GET LOG PAGE (02) sqid:0 cid:0 nsid:ffffffff cdw10:007f0002 cdw11:00000000
nvme0: ABORTED - BY REQUEST (00/07) crd:0 m:0 dnr:0 sqid:0 cid:0 cdw0:0
nvme0: GET LOG PAGE (02) sqid:0 cid:0 nsid:ffffffff cdw10:007f0002 cdw11:00000000
nvme0: ABORTED - BY REQUEST (00/07) crd:0 m:0 dnr:0 sqid:0 cid:0 cdw0:0
nvme0: GET LOG PAGE (02) sqid:0 cid:0 nsid:ffffffff cdw10:007f0002 cdw11:00000000
nvme0: ABORTED - BY REQUEST (00/07) crd:0 m:0 dnr:0 sqid:0 cid:0 cdw0:0
nvme0: GET LOG PAGE (02) sqid:0 cid:0 nsid:ffffffff cdw10:007f0002 cdw11:00000000
nvme0: ABORTED - BY REQUEST (00/07) crd:0 m:0 dnr:0 sqid:0 cid:0 cdw0:0
nvme0: GET LOG PAGE (02) sqid:0 cid:0 nsid:ffffffff cdw10:007f0002 cdw11:00000000
nvme0: ABORTED - BY REQUEST (00/07) crd:0 m:0 dnr:0 sqid:0 cid:0 cdw0:0
Solaris: WARNING: Pool 'vm' has encountered an uncorrectable I/O failure and has been suspended.
Comment 11 crb 2023-06-13 19:45:16 UTC
It looks like ed posted a comment referring to a similar bug I had submitted. The TLDR of that bug was that I was a 3.3v rail that was out of spec (as reported by the bios) and I replaced the power supply, the BIOS reported a compliant value for the 3.3 rail and that system has been ROCK solid ever since with the same SSD.  I'm not suggesting that is your issue, but I'm trying to save people time reading my bug report and determining what the problem was on that one.
Christopher
Comment 12 Lucas Holt 2023-06-21 18:52:10 UTC
It appears this is a problem with the WD SN770 in general. https://github.com/openzfs/zfs/discussions/14793
Comment 13 Ed Maste freebsd_committer freebsd_triage 2023-06-21 19:22:06 UTC
(In reply to crb from comment #11)
Thanks for adding the followup and it seems I was probably hasty in adding the see-also. I'll remove it.
Comment 14 Timothy Guo 2023-06-26 17:25:42 UTC
(In reply to crb from comment #11)

I would like to share my follow up experience of this issue.

In short, the problem magically goes away after I wipe the disk and recreated the pool from backup. The same system (hardware and SW) has been working without issue for about half a year now. Unfortunately, I couldn't locate a conclusive offender during the entire procedure.

One thing I would like to note also is the 3.3V rail of the PSU. When I was still suffering from the issue, I also discovered 3.3V rail under-voltage, probably thanks to the hint from @crb's bug. I first read the out of range Voltage value from BIOS, and then confirmed the issue through direct measurement with a Voltage meter directly from the PSU pin-out. So it's true that the issue could really be power related. But it's unfortunate that I can't tell who is the offender, is the NVME drawing too much power due to firmware bug? Or is a failing PSU leading to NVME failure?

I contacted my PSU vendor and got the feedback that the wire connector may be aged and increased the resistance. Maybe my Voltage measuring attempt fixed the wiring connection, maybe the wipe-out and rebuild worked-around a potential firmware bug. The issue just suddenly goes away, as it suddenly comes (Note: I couldn't remember any re-assembling of the hardware build when it suddenly comes, though.)

The only part that I'm sure is the power failure is real and highly related. A stronger PSU might have simply avoided the problem altogether?
Comment 15 Marcus Oliveira 2023-07-23 16:35:00 UTC
Writing for someone who might come after me... After reading this last post about 3.3V rail of the PSU, I changed the PSU and the problem with my NVME and Freebsd disappeared. Funny enough, even before replacing the PSU the problem wouldn't happen on Windows 11.

Marcus
Comment 16 Thierry Thomas freebsd_committer freebsd_triage 2023-07-23 17:17:37 UTC
(In reply to Marcus Oliveira from comment #15)

See also
https://forums.tomshardware.com/threads/nvme-ssd-sometimes-does-not-boot-because-of-weak-psu.3692473/
Comment 17 Mark Johnston freebsd_committer freebsd_triage 2023-12-30 16:20:25 UTC
Just a "me too" on a newly built system with a SN770 drive (actually two but I haven't tested the other one yet) used by ZFS.  My BIOS doesn't report any problem with the 3.3V rail.  The PSU is brand new and I'm a bit reluctant to replace it at this point.

The problem isn't reproducible for me but has been been happening overnight, even when I'm not running and the system should be idle.
Comment 18 Wes Morgan 2024-04-27 20:01:55 UTC
Exact same behavior for me with two WD SN770s running in a ZFS mirror, with no apparent reason for triggering, but happened every boot. After reading this and some discussions in the forums, I immediately replaced them with two Corsair MP600 PROs and the problem vanished.

The two SN770s are still plugged into the device, but with the pool exported and idle, with no errors.

A WD SN850X running as a root pool has no issues.