Dear all! I've encoutered the problem with nvme0 controller driver on FreeBSD. The driver/system works properly with Samsung 950 Pro 512GB nvme. Unfortunately, during installation from USB stick (FreeBSD 10.3-RELEASEp0) of Samsung SM961 1TB nvme drive, I have the following statements: nvme0: resetting controller nvme0: aborting outstanding i/o nvme0: WRITE sqid:8 cid:127 nsid:1 lba:5131264 len:64 nvme0: ABORTED - BY REQUEST (00/07) sqid:8: cid:127 cdw0:0 or nvme0: resetting controller nvme0: aborting outstanding i/o nvme0: WRITE sqid:8 cid:127 nsid:1 lba:2047 len:1 nvme0: ABORTED - BY REQUEST (00/07) sqid:8: cid:127 cdw0:0 The above statemets repeats. It occurs more often than not during mounting partitions and before copying installation files. I can even partition and format the disk. I can use disklabel, diskinfo and fdisk to check the drive but sometimes that occurs during system boot. The drive works properly on Windows 10 and has no errors. Thanks for help.
Can you post more of the dmesg to give proper context to when this happens?
Copyright (c) 1992-2016 The FreeBSD Project. Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994 The Regents of the University of California. All rights reserved. FreeBSD is a registered trademark of The FreeBSD Foundation. FreeBSD 10.3-RELEASE #0 r297264: Fri Mar 25 02:10:02 UTC 2016 root@releng1.nyi.freebsd.org:/usr/obj/usr/src/sys/GENERIC amd64 FreeBSD clang version 3.4.1 (tags/RELEASE_34/dot1-final 208032) 20140512 VT(efifb): resolution 1280x1024 CPU: Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz (2300.05-MHz K8-class CPU) Origin="GenuineIntel" Id=0x306f2 Family=0x6 Model=0x3f Stepping=2 Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE> Features2=0x7ffefbff<SSE3,PCLMULQDQ,DTES64,MON,DS_CPL,VMX,SMX,EST,TM2,SSSE3,SDBG,FMA,CX16,xTPR,PDCM,PCID,DCA,SSE4.1,SSE4.2,x2APIC,MOVBE,POPCNT,TSCDLT,AESNI,XSAVE,OSXSAVE,AVX,F16C,RDRAND> AMD Features=0x2c100800<SYSCALL,NX,Page1GB,RDTSCP,LM> AMD Features2=0x21<LAHF,ABM> Structured Extended Features=0x37ab<FSGSBASE,TSCADJ,BMI1,AVX2,SMEP,BMI2,ERMS,INVPCID,PQM,NFPUSG> XSAVE Features=0x1<XSAVEOPT> VT-x: PAT,HLT,MTF,PAUSE,EPT,UG,VPID,VID,PostIntr TSC: P-state invariant, performance statistics real memory = 274869518336 (262136 MB) avail memory = 267073318912 (254700 MB) Event timer "LAPIC" quality 600 ACPI APIC Table: <GBT GBTUACPI> FreeBSD/SMP: Multiprocessor System Detected: 40 CPUs FreeBSD/SMP: 2 package(s) x 10 core(s) x 2 SMT threads cpu0 (BSP): APIC ID: 0 cpu1 (AP): APIC ID: 1 cpu2 (AP): APIC ID: 2 cpu3 (AP): APIC ID: 3 cpu4 (AP): APIC ID: 4 cpu5 (AP): APIC ID: 5 cpu6 (AP): APIC ID: 6 cpu7 (AP): APIC ID: 7 cpu8 (AP): APIC ID: 8 cpu9 (AP): APIC ID: 9 cpu10 (AP): APIC ID: 16 cpu11 (AP): APIC ID: 17 cpu12 (AP): APIC ID: 18 cpu13 (AP): APIC ID: 19 cpu14 (AP): APIC ID: 20 cpu15 (AP): APIC ID: 21 cpu16 (AP): APIC ID: 22 cpu17 (AP): APIC ID: 23 cpu18 (AP): APIC ID: 24 cpu19 (AP): APIC ID: 25 cpu20 (AP): APIC ID: 32 cpu21 (AP): APIC ID: 33 cpu22 (AP): APIC ID: 34 cpu23 (AP): APIC ID: 35 cpu24 (AP): APIC ID: 36 cpu25 (AP): APIC ID: 37 cpu26 (AP): APIC ID: 38 cpu27 (AP): APIC ID: 39 cpu28 (AP): APIC ID: 40 cpu29 (AP): APIC ID: 41 cpu30 (AP): APIC ID: 48 cpu31 (AP): APIC ID: 49 cpu32 (AP): APIC ID: 50 cpu33 (AP): APIC ID: 51 cpu34 (AP): APIC ID: 52 cpu35 (AP): APIC ID: 53 cpu36 (AP): APIC ID: 54 cpu37 (AP): APIC ID: 55 cpu38 (AP): APIC ID: 56 cpu39 (AP): APIC ID: 57 random: <Software, Yarrow> initialized ioapic0 <Version 2.0> irqs 0-23 on motherboard ioapic1 <Version 2.0> irqs 24-47 on motherboard ioapic2 <Version 2.0> irqs 48-71 on motherboard module_register_init: MOD_LOAD (vesa, 0xffffffff80dc6500, 0) error 19 kbd0 at kbdmux0 acpi0: <GBT GBTUACPI> on motherboard acpi0: Power Button (fixed) cpu0: <ACPI CPU> on acpi0 cpu1: <ACPI CPU> on acpi0 cpu2: <ACPI CPU> on acpi0 cpu3: <ACPI CPU> on acpi0 cpu4: <ACPI CPU> on acpi0 cpu5: <ACPI CPU> on acpi0 cpu6: <ACPI CPU> on acpi0 cpu7: <ACPI CPU> on acpi0 cpu8: <ACPI CPU> on acpi0 cpu9: <ACPI CPU> on acpi0 cpu10: <ACPI CPU> on acpi0 cpu11: <ACPI CPU> on acpi0 cpu12: <ACPI CPU> on acpi0 cpu13: <ACPI CPU> on acpi0 cpu14: <ACPI CPU> on acpi0 cpu15: <ACPI CPU> on acpi0 cpu16: <ACPI CPU> on acpi0 cpu17: <ACPI CPU> on acpi0 cpu18: <ACPI CPU> on acpi0 cpu19: <ACPI CPU> on acpi0 cpu20: <ACPI CPU> on acpi0 cpu21: <ACPI CPU> on acpi0 cpu22: <ACPI CPU> on acpi0 cpu23: <ACPI CPU> on acpi0 cpu24: <ACPI CPU> on acpi0 cpu25: <ACPI CPU> on acpi0 cpu26: <ACPI CPU> on acpi0 cpu27: <ACPI CPU> on acpi0 cpu28: <ACPI CPU> on acpi0 cpu29: <ACPI CPU> on acpi0 cpu30: <ACPI CPU> on acpi0 cpu31: <ACPI CPU> on acpi0 cpu32: <ACPI CPU> on acpi0 cpu33: <ACPI CPU> on acpi0 cpu34: <ACPI CPU> on acpi0 cpu35: <ACPI CPU> on acpi0 cpu36: <ACPI CPU> on acpi0 cpu37: <ACPI CPU> on acpi0 cpu38: <ACPI CPU> on acpi0 cpu39: <ACPI CPU> on acpi0 atrtc0: <AT realtime clock> port 0x70-0x71,0x74-0x77 irq 8 on acpi0 Event timer "RTC" frequency 32768 Hz quality 0 attimer0: <AT timer> port 0x40-0x43,0x50-0x53 irq 0 on acpi0 Timecounter "i8254" frequency 1193182 Hz quality 0 Event timer "i8254" frequency 1193182 Hz quality 100 hpet0: <High Precision Event Timer> iomem 0xfed00000-0xfed003ff on acpi0 Timecounter "HPET" frequency 14318180 Hz quality 950 Event timer "HPET" frequency 14318180 Hz quality 350 Event timer "HPET1" frequency 14318180 Hz quality 340 Event timer "HPET2" frequency 14318180 Hz quality 340 Event timer "HPET3" frequency 14318180 Hz quality 340 Event timer "HPET4" frequency 14318180 Hz quality 340 Event timer "HPET5" frequency 14318180 Hz quality 340 Event timer "HPET6" frequency 14318180 Hz quality 340 Event timer "HPET7" frequency 14318180 Hz quality 340 Timecounter "ACPI-fast" frequency 3579545 Hz quality 900 acpi_timer0: <24-bit timer at 3.579545MHz> port 0x408-0x40b on acpi0 pcib0: <ACPI Host-PCI bridge> on acpi0 pci255: <ACPI PCI bus> on pcib0 pcib1: <ACPI Host-PCI bridge> on acpi0 pci127: <ACPI PCI bus> on pcib1 pcib2: <ACPI Host-PCI bridge> port 0xcf8-0xcff on acpi0 pci0: <ACPI PCI bus> on pcib2 pcib3: <ACPI PCI-PCI bridge> irq 26 at device 1.0 on pci0 pci1: <ACPI PCI bus> on pcib3 pcib4: <ACPI PCI-PCI bridge> irq 32 at device 2.0 on pci0 pci2: <ACPI PCI bus> on pcib4 pcib5: <ACPI PCI-PCI bridge> irq 32 at device 2.2 on pci0 pci3: <ACPI PCI bus> on pcib5 pcib6: <ACPI PCI-PCI bridge> irq 40 at device 3.0 on pci0 pci4: <ACPI PCI bus> on pcib6 pcib7: <ACPI PCI-PCI bridge> irq 40 at device 3.2 on pci0 pci5: <ACPI PCI bus> on pcib7 pci0: <unknown> at device 17.0 (no driver attached) xhci0: <XHCI (generic) USB 3.0 controller> mem 0xc7200000-0xc720ffff irq 19 at device 20.0 on pci0 xhci0: 32 bytes context size, 64-bit DMA usbus0 on xhci0 pci0: <simple comms> at device 22.0 (no driver attached) pci0: <simple comms> at device 22.1 (no driver attached) ehci0: <EHCI (generic) USB 2.0 controller> mem 0xc7214000-0xc72143ff irq 18 at device 26.0 on pci0 usbus1: EHCI version 1.0 usbus1 on ehci0 pcib8: <ACPI PCI-PCI bridge> irq 16 at device 28.0 on pci0 pci6: <ACPI PCI bus> on pcib8 pcib9: <ACPI PCI-PCI bridge> irq 18 at device 28.2 on pci0 pci7: <ACPI PCI bus> on pcib9 pcib10: <ACPI PCI-PCI bridge> at device 0.0 on pci7 pci8: <ACPI PCI bus> on pcib10 vgapci0: <VGA-compatible display> port 0x6000-0x607f mem 0xc6000000-0xc6ffffff,0xc7000000-0xc701ffff irq 16 at device 0.0 on pci8 vgapci0: Boot video device pcib11: <ACPI PCI-PCI bridge> irq 16 at device 28.4 on pci0 pci9: <ACPI PCI bus> on pcib11 nvme0: <Generic NVMe Device> mem 0xc7100000-0xc7103fff irq 16 at device 0.0 on pci9 ehci1: <EHCI (generic) USB 2.0 controller> mem 0xc7213000-0xc72133ff irq 18 at device 29.0 on pci0 usbus2: EHCI version 1.0 usbus2 on ehci1 isab0: <PCI-ISA bridge> at device 31.0 on pci0 isa0: <ISA bus> on isab0 ahci0: <Intel Wellsburg AHCI SATA controller> port 0x7050-0x7057,0x7040-0x7043,0x7030-0x7037,0x7020-0x7023,0x7000-0x701f mem 0xc7212000-0xc72127ff irq 16 at device 31.2 on pci0 ahci0: AHCI v1.30 with 6 6Gbps ports, Port Multiplier not supported ahcich0: <AHCI channel> at channel 0 on ahci0 ahcich1: <AHCI channel> at channel 1 on ahci0 ahcich2: <AHCI channel> at channel 2 on ahci0 ahcich3: <AHCI channel> at channel 3 on ahci0 ahciem0: <AHCI enclosure management bridge> on ahci0 pcib12: <ACPI Host-PCI bridge> on acpi0 pci128: <ACPI PCI bus> on pcib12 pcib13: <ACPI PCI-PCI bridge> irq 50 at device 1.0 on pci128 pci129: <ACPI PCI bus> on pcib13 igb0: <Intel(R) PRO/1000 Network Connection, Version - 2.5.3-k> port 0xf020-0xf03f mem 0xfbd20000-0xfbd3ffff,0xfbd44000-0xfbd47fff irq 50 at device 0.0 on pci129 igb0: Using MSIX interrupts with 9 vectors igb0: Ethernet address: 40:8d:5c:6d:c1:81 igb0: Bound queue 0 to cpu 0 igb0: Bound queue 1 to cpu 1 igb0: Bound queue 2 to cpu 2 igb0: Bound queue 3 to cpu 3 igb0: Bound queue 4 to cpu 4 igb0: Bound queue 5 to cpu 5 igb0: Bound queue 6 to cpu 6 igb0: Bound queue 7 to cpu 7 igb1: <Intel(R) PRO/1000 Network Connection, Version - 2.5.3-k> port 0xf000-0xf01f mem 0xfbd00000-0xfbd1ffff,0xfbd40000-0xfbd43fff irq 52 at device 0.1 on pci129 igb1: Using MSIX interrupts with 9 vectors igb1: Ethernet address: 40:8d:5c:6d:c1:82 igb1: Bound queue 0 to cpu 8 igb1: Bound queue 1 to cpu 9 igb1: Bound queue 2 to cpu 10 igb1: Bound queue 3 to cpu 11 igb1: Bound queue 4 to cpu 12 igb1: Bound queue 5 to cpu 13 igb1: Bound queue 6 to cpu 14 igb1: Bound queue 7 to cpu 15 pcib14: <ACPI PCI-PCI bridge> irq 56 at device 2.0 on pci128 pci130: <ACPI PCI bus> on pcib14 pcib15: <ACPI PCI-PCI bridge> at device 0.0 on pci130 pci131: <ACPI PCI bus> on pcib15 pcib16: <PCI-PCI bridge> at device 2.0 on pci131 pci132: <PCI bus> on pcib16 igb2: <Intel(R) PRO/1000 Network Connection, Version - 2.5.3-k> port 0xe020-0xe03f mem 0xfbc20000-0xfbc3ffff,0xfb800000-0xfbbfffff,0xfbc44000-0xfbc47fff irq 61 at device 0.0 on pci132 igb2: Using MSIX interrupts with 9 vectors igb2: Ethernet address: 90:e2:ba:06:6a:d8 igb2: Bound queue 0 to cpu 16 igb2: Bound queue 1 to cpu 17 igb2: Bound queue 2 to cpu 18 igb2: Bound queue 3 to cpu 19 igb2: Bound queue 4 to cpu 20 igb2: Bound queue 5 to cpu 21 igb2: Bound queue 6 to cpu 22 igb2: Bound queue 7 to cpu 23 igb3: <Intel(R) PRO/1000 Network Connection, Version - 2.5.3-k> port 0xe000-0xe01f mem 0xfbc00000-0xfbc1ffff,0xfb000000-0xfb3fffff,0xfbc40000-0xfbc43fff irq 62 at device 0.1 on pci132 igb3: Using MSIX interrupts with 9 vectors igb3: Ethernet address: 90:e2:ba:06:6a:d9 igb3: Bound queue 0 to cpu 24 igb3: Bound queue 1 to cpu 25 igb3: Bound queue 2 to cpu 26 igb3: Bound queue 3 to cpu 27 igb3: Bound queue 4 to cpu 28 igb3: Bound queue 5 to cpu 29 igb3: Bound queue 6 to cpu 30 igb3: Bound queue 7 to cpu 31 pcib17: <PCI-PCI bridge> at device 4.0 on pci131 pci133: <PCI bus> on pcib17 igb4: <Intel(R) PRO/1000 Network Connection, Version - 2.5.3-k> port 0xd020-0xd03f mem 0xfa820000-0xfa83ffff,0xfa400000-0xfa7fffff,0xfa844000-0xfa847fff irq 56 at device 0.0 on pci133 igb4: Using MSIX interrupts with 9 vectors igb4: Ethernet address: 90:e2:ba:06:6a:dc igb4: Bound queue 0 to cpu 32 igb4: Bound queue 1 to cpu 33 igb4: Bound queue 2 to cpu 34 igb4: Bound queue 3 to cpu 35 igb4: Bound queue 4 to cpu 36 igb4: Bound queue 5 to cpu 37 igb4: Bound queue 6 to cpu 38 igb4: Bound queue 7 to cpu 39 igb5: <Intel(R) PRO/1000 Network Connection, Version - 2.5.3-k> port 0xd000-0xd01f mem 0xfa800000-0xfa81ffff,0xf9c00000-0xf9ffffff,0xfa840000-0xfa843fff irq 60 at device 0.1 on pci133 igb5: Using MSIX interrupts with 9 vectors igb5: Ethernet address: 90:e2:ba:06:6a:dd igb5: Bound queue 0 to cpu 0 igb5: Bound queue 1 to cpu 1 igb5: Bound queue 2 to cpu 2 igb5: Bound queue 3 to cpu 3 igb5: Bound queue 4 to cpu 4 igb5: Bound queue 5 to cpu 5 igb5: Bound queue 6 to cpu 6 igb5: Bound queue 7 to cpu 7 acpi_button0: <Power Button> on acpi0 uart0: <16550 or compatible> port 0x3f8-0x3ff irq 4 flags 0x10 on acpi0 uart1: <16550 or compatible> port 0x2f8-0x2ff irq 3 on acpi0 orm0: <ISA Option ROMs> at iomem 0xc0000-0xc7fff,0xc8000-0xc8fff on isa0 ppc0: cannot reserve I/O port range est0: <Enhanced SpeedStep Frequency Control> on cpu0 est1: <Enhanced SpeedStep Frequency Control> on cpu1 est2: <Enhanced SpeedStep Frequency Control> on cpu2 est3: <Enhanced SpeedStep Frequency Control> on cpu3 est4: <Enhanced SpeedStep Frequency Control> on cpu4 est5: <Enhanced SpeedStep Frequency Control> on cpu5 est6: <Enhanced SpeedStep Frequency Control> on cpu6 est7: <Enhanced SpeedStep Frequency Control> on cpu7 est8: <Enhanced SpeedStep Frequency Control> on cpu8 est9: <Enhanced SpeedStep Frequency Control> on cpu9 est10: <Enhanced SpeedStep Frequency Control> on cpu10 est11: <Enhanced SpeedStep Frequency Control> on cpu11 est12: <Enhanced SpeedStep Frequency Control> on cpu12 est13: <Enhanced SpeedStep Frequency Control> on cpu13 est14: <Enhanced SpeedStep Frequency Control> on cpu14 est15: <Enhanced SpeedStep Frequency Control> on cpu15 est16: <Enhanced SpeedStep Frequency Control> on cpu16 est17: <Enhanced SpeedStep Frequency Control> on cpu17 est18: <Enhanced SpeedStep Frequency Control> on cpu18 est19: <Enhanced SpeedStep Frequency Control> on cpu19 est20: <Enhanced SpeedStep Frequency Control> on cpu20 est21: <Enhanced SpeedStep Frequency Control> on cpu21 est22: <Enhanced SpeedStep Frequency Control> on cpu22 est23: <Enhanced SpeedStep Frequency Control> on cpu23 est24: <Enhanced SpeedStep Frequency Control> on cpu24 est25: <Enhanced SpeedStep Frequency Control> on cpu25 est26: <Enhanced SpeedStep Frequency Control> on cpu26 est27: <Enhanced SpeedStep Frequency Control> on cpu27 est28: <Enhanced SpeedStep Frequency Control> on cpu28 est29: <Enhanced SpeedStep Frequency Control> on cpu29 est30: <Enhanced SpeedStep Frequency Control> on cpu30 est31: <Enhanced SpeedStep Frequency Control> on cpu31 est32: <Enhanced SpeedStep Frequency Control> on cpu32 est33: <Enhanced SpeedStep Frequency Control> on cpu33 est34: <Enhanced SpeedStep Frequency Control> on cpu34 est35: <Enhanced SpeedStep Frequency Control> on cpu35 est36: <Enhanced SpeedStep Frequency Control> on cpu36 est37: <Enhanced SpeedStep Frequency Control> on cpu37 est38: <Enhanced SpeedStep Frequency Control> on cpu38 est39: <Enhanced SpeedStep Frequency Control> on cpu39 random: unblocking device. usbus0: 5.0Gbps Super Speed USB v3.0 Timecounters tick every 1.000 msec usbus1: 480Mbps High Speed USB v2.0 ugen0.1: <0x8086> at usbus0 uhub0: <0x8086 XHCI root HUB, class 9/0, rev 3.00/1.00, addr 1> on usbus0 usbus2: 480Mbps High Speed USB v2.0 ugen1.1: <Intel> at usbus1 uhub1: <Intel EHCI root HUB, class 9/0, rev 2.00/1.00, addr 1> on usbus1 ugen2.1: <Intel> at usbus2 uhub2: <Intel EHCI root HUB, class 9/0, rev 2.00/1.00, addr 1> on usbus2 nvd0: <SAMSUNG MZVKW1T0HMLH-00000> NVMe namespace nvd0: 976762MB (2000409264 512 byte sectors) ses0 at ahciem0 bus 0 scbus4 target 0 lun 0 ses0: <AHCI SGPIO Enclosure 1.00 0001> SEMB S-E-S 2.00 device ses0: SEMB SES Device ada0 at ahcich0 bus 0 scbus0 target 0 lun 0 ada0: <HGST HDN724040ALE640 MJAOA5E0> ATA8-ACS SATA 3.x device ada0: Serial Number PK1334PEJH5ULS ada0: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes) ada0: Command Queueing enabled ada0: 3815447MB (7814037168 512 byte sectors) ada0: Previously was known as ad4 ada1 at ahcich1 bus 0 scbus1 target 0 lun 0 ada1: <HGST HDN724040ALE640 MJAOA5E0> ATA8-ACS SATA 3.x device ada1: Serial Number PK1334PEJH1VBS ada1: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes) ada1: Command Queueing enabled ada1: 3815447MB (7814037168 512 byte sectors) ada1: Previously was known as ad6 ada2 at ahcich2 bus 0 scbus2 target 0 lun 0 ada2: <HGST HDN724040ALE640 MJAOA5E0> ATA8-ACS SATA 3.x device uhub1: 2 ports with 2 removable, self powered ada2: Serial Number PK1334PEJGY25S uhub2: 2 ports with 2 removable, self powered ada2: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes) ada2: Command Queueing enabled ada2: 3815447MB (7814037168 512 byte sectors) ada2: Previously was known as ad8 ada3 at ahcich3 bus 0 scbus3 target 0 lun 0 ada3: <HGST HDN724040ALE640 MJAOA5E0> ATA8-ACS SATA 3.x device ada3: Serial Number PK1334PEJH1UZS ada3: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes) ada3: Command Queueing enabled ada3: 3815447MB (7814037168 512 byte sectors) uhub0: 21 ports with 21 removable, self powered ada3: Previously was known as ad10 SMP: AP CPU #1 Launched! SMP: AP CPU #4 Launched! SMP: AP CPU #30 Launched! SMP: AP CPU #5 Launched! SMP: AP CPU #34 Launched! SMP: AP CPU #7 Launched! SMP: AP CPU #20 Launched! SMP: AP CPU #25 Launched! SMP: AP CPU #16 Launched! SMP: AP CPU #31 Launched! SMP: AP CPU #17 Launched! SMP: AP CPU #38 Launched! SMP: AP CPU #11 Launched! SMP: AP CPU #33 Launched! SMP: AP CPU #32 Launched! SMP: AP CPU #14 Launched! SMP: AP CPU #21 Launched! SMP: AP CPU #6 Launched! SMP: AP CPU #22 Launched! SMP: AP CPU #10 Launched! SMP: AP CPU #35 Launched! SMP: AP CPU #8 Launched! SMP: AP CPU #3 Launched! SMP: AP CPU #2 Launched! SMP: AP CPU #23 Launched! SMP: AP CPU #12 Launched! SMP: AP CPU #18 Launched! SMP: AP CPU #27 Launched! SMP: AP CPU #36 Launched! SMP: AP CPU #39 Launched! SMP: AP CPU #19 Launched! SMP: AP CPU #15 Launched! SMP: AP CPU #29 Launched! SMP: AP CPU #24 Launched! SMP: AP CPU #26 Launched! SMP: AP CPU #37 Launched! SMP: AP CPU #28 Launched! SMP: AP CPU #9 Launched! SMP: AP CPU #13 Launched! Timecounter "TSC-low" frequency 1150026386 Hz quality 1000 Root mount waiting for: usbus2 usbus1 usbus0 ugen0.2: <no manufacturer> at usbus0 uhub3: <no manufacturer Gadget USB HUB, class 9/0, rev 2.00/0.00, addr 1> on usbus0 ugen2.2: <vendor 0x8087> at usbus2 uhub4: <vendor 0x8087 product 0x8002, class 9/0, rev 2.00/0.05, addr 2> on usbus2 ugen1.2: <vendor 0x8087> at usbus1 uhub5: <vendor 0x8087 product 0x800a, class 9/0, rev 2.00/0.05, addr 2> on usbus1 uhub5: 6 ports with 6 removable, self powered uhub4: 8 ports with 8 removable, self powered uhub3: 5 ports with 5 removable, self powered Root mount waiting for: usbus0 ugen0.3: <Avocent> at usbus0 ukbd0: <Keyboard> on usbus0 kbd1 at ukbd0 ugen0.4: <Dell> at usbus0 ukbd1: <EP1 Interrupt> on usbus0 kbd2 at ukbd1 Root mount waiting for: usbus0 Root mount waiting for: usbus0 ugen0.5: <SanDisk> at usbus0 umass0: <SanDisk Extreme, class 0/0, rev 3.00/0.10, addr 4> on usbus0 umass0: SCSI over Bulk-Only; quirks = 0x0100 umass0:5:0:-1: Attached to scbus5 Trying to mount root from ufs:/dev/da0p3 [rw,noatime]... mountroot: waiting for device /dev/da0p3 ... da0 at umass-sim0 bus 0 scbus5 target 0 lun 0 da0: <SanDisk Extreme 0001> Removable Direct Access SPC-4 SCSI device da0: Serial Number AA011021141312316547 da0: 400.000MB/s transfers da0: 61057MB (125045424 512 byte sectors) da0: quirks=0x2<NO_6_BYTE> WARNING: / was not properly dismounted igb0: link state changed to UP igb1: link state changed to UP igb2: link state changed to UP ums0: <Mouse> on usbus0 ums0: 3 buttons and [Z] coordinates ID=0 ums1: <Mouse REL> on usbus0 ums1: 3 buttons and [XYZ] coordinates ID=0
crw-r----- 1 root operator 0x53 Aug 10 06:24 /dev/nvd0 crw-r----- 1 root operator 0x5a Aug 10 06:24 /dev/nvd0s1 crw------- 1 root wheel 0x27 Aug 10 06:24 /dev/nvme0 crw------- 1 root wheel 0x51 Aug 10 06:24 /dev/nvme0ns1 /dev/nvd0 512 1024209543168 2000409264 512 0 ******* Working on device /dev/nvd0 ******* parameters extracted from in-core disklabel are: cylinders=124519 heads=255 sectors/track=63 (16065 blks/cyl) Figures below won't work with BIOS for partitions not in cyl 1 parameters to be used for BIOS calculations are: cylinders=124519 heads=255 sectors/track=63 (16065 blks/cyl) Media sector size is 512 Warning: BIOS sector numbering starts with sector 1 Information from DOS bootblock is: The data for partition 1 is: sysid 7 (0x07),(NTFS, OS/2 HPFS, QNX-2 (16 bit) or Advanced UNIX) start 2048, size 2000404480 (976760 Meg), flag 0 beg: cyl 0/ head 32/ sector 33; end: cyl 1023/ head 254/ sector 63 The data for partition 2 is: <UNUSED> The data for partition 3 is: <UNUSED> The data for partition 4 is: <UNUSED> ******* Working on device /dev/nvd0 ******* parameters extracted from in-core disklabel are: cylinders=124519 heads=255 sectors/track=63 (16065 blks/cyl) Figures below won't work with BIOS for partitions not in cyl 1 parameters to be used for BIOS calculations are: cylinders=124519 heads=255 sectors/track=63 (16065 blks/cyl) Media sector size is 512 Warning: BIOS sector numbering starts with sector 1 Information from DOS bootblock is: The data for partition 1 is: sysid 165 (0xa5),(FreeBSD/NetBSD/386BSD) start 63, size 2000397672 (976756 Meg), flag 80 (active) beg: cyl 0/ head 1/ sector 1; end: cyl 614/ head 254/ sector 63 The data for partition 2 is: <UNUSED> The data for partition 3 is: <UNUSED> The data for partition 4 is: <UNUSED> => 34 2000409197 nvd0 GPT (954G) 34 2014 1 efi (1.0M) 2048 10485760 2 freebsd-ufs (5.0G) 10487808 104857600 3 freebsd-ufs (50G) 115345408 20971520 4 freebsd-ufs (10G) 136316928 10485760 5 freebsd-ufs (5.0G) 146802688 10485760 6 freebsd-ufs (5.0G) 157288448 10485760 7 freebsd-ufs (5.0G) 167774208 1832635023 8 freebsd-ufs (874G)
I've got the same problem with a Samsung SSD SM961 256GB and FreeBSD 11.0. The SSD works fine with Ubuntu and Windows. my equipment: Xeon E3-1245v5 Gigabyte GA-X170-Extreme ECC 2x Kingston KVR21E15D8K2/16I = 32GB Samsung SSD SM961 256GB, M.2 (MZVPW256HEGL-00000)
I have the same problem on 10-STABLE (r309209) with a 128GB SM961 and can also reproduce it on CURRENT (20161117 snapshot). Even attempting to read the device with dd causes the error: (0:21) pool13:/sysprog/terry# dd if=/dev/nvme0ns1 of=/dev/null count=1 nvme0: resetting controller nvme0: aborting outstanding i/o nvme0: READ sqid:8 cid:127 nsid:1 lba:3 len:1 nvme0: ABORTED - BY REQUEST (00/07) sqid:8 cid:127 cdw0:0 nvmecontrol reports that the drive has never experienced an error, even after the above: (0:30) pool13:/sysprog/terry# nvmecontrol logpage -p 1 nvme0 Error Information Log ===================== No error entries found smartctl similarly reports no problems. Booting Arch Linux 2016.11.01 lets me read and write to the drive with no problems, so I don't think this is a hardware problem. I have the module in a dedicated test system and can provide a https:// remote console to the developer(s) if it would help to pin down the problem. Let me know if you'd like any further information.
Hi all. I have a similar issue. My hardware is a Lenovo Thinkpad P50 with a 512GB nvme SSD running current (synced yesterday) with a custom kernel configuration that differs from GENERIC only in that VESA is disabled (nvidia issues while resuming from ACPI S3 if not). $ nvcontrol devlist nvme0: SAMSUNG MZVKV512HAJH-000L1 nvme0ns1 (488386MB) The particular manifestation I see is on resume when disk access seems to stall for 5-10 seconds. The relevant messages in the kernel log buffer are a scary stream of: nvme0: READ sqid:6 cid:124 nsid:1 lba:697441256 len:8 nvme0: ABORTED - BY REQUEST (00/07) sqid:6 cid:124 cdw0:0 nvme0: aborting outstanding i/o nvme0: READ sqid:8 cid:109 nsid:1 lba:399615616 len:40 nvme0: ABORTED - BY REQUEST (00/07) sqid:8 cid:109 cdw0:0 Please let me now if you need any more specifics. Thanks.
(In reply to Robin Randhawa from comment #6) That certainly looks like the same issue. I'm working with the driver developer on this issue and we have reproduced it on my system. I've sent some debugging info, but may need to run an specially-instrumented version of the driver to track down where the fault is being triggered. If you need this working (with possibly VERY reduced performance), you can add: hw.nvme.per_cpu_io_queues=0 to your /boot/loader.conf file. Note that this is a workaround, not a fix.
(In reply to Terry Kennedy from comment #7) On second thought, if yours only happens on resume, it may not be the same bug at all - it may be something in the suspend/resume code that isn't restoring the NVMe state until the request times out and the recovery code kicks in. Plus, I think your module is a 951 and people are successfully using that module - all the other reports in this thread are on the 961.
(In reply to Terry Kennedy from Comment #8) Hi Terry. Thanks for the responses. I concur with your view that this is likely something to do with the suspend-resume pathway and missing context that isn't restored correctly (or at all). BTW I was incorrectly stating a ~5 second 'stall'. This is in fact closer to ~30 seconds. So after a resume from ACPI S3, there is a ~30 second freeze before the abort messages and the system becoming usable again. Most irritating. I'm trying to grok the nvme driver source to see if there is some basic suspend resume callbacks that need fleshing out. I will update if I make any headway. I also worry that there is some ACPI <-> NVMe overlaps that may be to blame. This is a fairly new laptop and there are a whole lot of ACPI related messages appearing in the kernel log buffer at suspend-resume time. Cheers.
I'm having this issue on a fairly recent SuperMicro Xeon-D 1541 (SuperServer 5018D-FN4T) system as soon as I put _two_ NVMe SSD in. When using only the onboard M.2/M-Key Slot on the motherboard I can install FreeBSD 11.0-RELEASE p1 just fine. As soon as I put another NVMe SSD in, via PCIe Card adapter, I geh the resetting controller messages that IPTRACE already mentioned. There is no difference whether I put both on the PCIe card or one on the motherboard and one on the PCIe card. I'm using the SuperMicro AOC-SLG3-2M2 for adding the second NVMe SSD which are Samsung SM961/256GB models. Since I basically cannot put the system into production due to this bug at the moment, I'm happy to test everything that may help to track this down and fix it. If you need any more details of the hardware or BIOS/UEFI settings used I'm happy to provide them. I'll also investigate Terry Kennedy's loader.conf hint and check if it helps and also how much of a performance hit the system takes. Best regards MacLemon
Any news on this bug? Im having the same error with 2x SM961 1TB nvme drives on PCIe adapters in a Dell R720 running FreeNAS 10 RC1.
(In reply to tkurmann from comment #11) I just pushed a few changes into -head that may help. Any chance you can try booting a snapshot?
Created attachment 180826 [details] Boot capture with uname and "resetting controller"
Created attachment 180827 [details] dd fails with error, but made some progress
Created attachment 180828 [details] newfs output - also makes some progress
Looks like my text got lost when I added the images. This is a boot with 12-CURRENT 20170309 (the latest snapshot available). It looks like the driver is making a bit more progress with the nvd0: device, but still ends up in a "resetting controller" loop. Sorry about the .BMP attachments, that's the way my remote management card does screenshots.
I've done some testing now on FreeBSD 11.0p3-RELEASE and 12.0-CURRENT. I've tested two of these Samsung SM961/256GB in a SuperServer 5018D-FN4T with SuperMicro AOC-SLG3-2M2. I've done all possible combinations of the two SSDs in the 3 available M.2 slots of that combo. I've also tried pretty much every combination of Legacy/BIOS/UEFI setting for the PCIe slots to no avail. The SSDs are recognized at all every few reboots, I never managed to get both to show up. Using the oboard slot or the 2M2 card doesn't make any difference. I did not manage to successfully initialize these SSDs let alone create a ZFS Mirror from them to boot from. Just for completeness I've tested them running Debian/Sid and they behave just as unusable there. I'll be returning these Samsung SSDs and try to get Toshiba X3 instead.
(In reply to MacLemon from comment #17) I'd suggest testing with a single card first, to rule out some potential PCIe bifurcation* problems. My single SM961 works as expected under Linux (some random LiveCD I downloaded), but gives the "resetting controller" message under FreeBSD. A lot of stuff makes an assumption about there being a single "thing" in a PCIe slot. There seem to be 2 types of PCIe / NVMe multi-module adapters. The first just takes the 4 lanes from each NVMe and puts them on the PCIe bus - so a 2 * M.2 adapter uses 8 PCIe lanes, 4 for each NVMe. The other type has a PCIe / PCIe bridge on the board.
My samsung 960 PRO works great. We have other (hundreds) drives at work that are doing close to 3.8GB/s steady for hours.... So it can work... Let's dig down a level... So the 'reset' messages that Terry is seeing in the two screen shots he just posted are either the result of some prankster doing an nvmecontrol reset (quite unlikely), or the result of the driver calling reset internally. It does this only when it gets a timeout for a command. Assuming for the moment that the timeout code is good, there's a command that's coming back bad and we wind up here: nvme_timeout(void *arg) ... /* Read csts to get value of cfs - controller fatal status. */ csts.raw = nvme_mmio_read_4(ctrlr, csts); if (ctrlr->enable_aborts && csts.bits.cfs == 0) { /* * If aborts are enabled, only use them if the controller is * not reporting fatal status. */ nvme_ctrlr_cmd_abort(ctrlr, tr->cid, qpair->id, nvme_abort_complete, tr); } else nvme_ctrlr_reset(ctrlr); so we read the CSTS (the controller status) and if we've enabled aborts (which you can do by setting the tunable hw.nvme.enable_aborts=1 (it defaults to 0, so that's the path we may be taking unless you've found this already), so we do a reset. The reset turns out to be unsuccessful, and we drive off the road into the ditch with the follow-on errors. So, maybe try to set the tunable and try again. I'd normally ask about all the stupid issues: is power good, are the connections good, are you seeing PCIe errors (pciconf -lbace nvmeX), etc here, but I kinda assume with so many reports that's unlikely to be fruitful to everybody. Maybe I'll try to find a Samsung 950 Pro 512GB (which form factor do you have?) and try as well, but that process will take about a week or two since I have an offsite soon and I don't think I can get one here before then.
In addition, you can control the timeout period with the sysctl dev.nvme.X.timeout=S where S is the number of seconds. The min is 5, max is 120. The default is 30. It might be helpful to see if setting it lower causes this to happen more often or if setting it higher causes it to happen less often. It's possible that we're missing a completion interrupt, or that we get one and somehow take a code path that doesn't cancel the timeout (though given there were actual I/Os that were aborted, that seems unlikely). Disabling TRIM might also make things not suck so badly. But that wouldn't help Terry's simple newfs. We had issues with insanely slow TRIMs for a drive for a drive we were evaluating under NDA that might be relevant. We had no issues with newfs, nor with the drive itself once we turned TRIM off in ufs. I couldn't find a 950 PRO, but was able to find a SM961. I'll see if I can recreate this issue on my NUC6.
(In reply to Warner Losh from comment #19) It is a MZVPW128HEGM-00000 which is a SM961. This has happened in multiple systems (identical hardware configs - see below) which otherwise are operating flawlessly. In previous correspondence with jimharris@, he had me try setting hw.nvme.per_cpu_io_queues=0 which concealed the problem but resulted in abysmal I/O performance (as expected). He also had me set some debug loader tunables and post the results. Those are at https://www.glaver.org/transient/nvme I just moved the card (SM961 on generic PCIe slot adapter) to a Dell PowerEdge R710. It had been in a Supermicro X8DTH-iF. It works fine in the R710, even on 10.3-STABLE. Both of those systems are as similar as I could make them - both use the Intel 5520 chipset, both have 2 * X5680 CPU, both have 48GB of RAM (same part number in both systems). So it seems (at least in my case) to be related to the system hardware. I can try ordering some other M.2 NVMe module to see if this issue is specific to the SM961, or if it is a problem with any NVMe on the X8DTH-iF board.
I can confirm the finding of Terry Kennedy, with the latest snapshot 12-CURRENT 20170309, dd can read and write for a couple of kybtes until a reset is requested. Further, after the 5 "aborting outstanding i/o messages" (30s apart), the drive can no longer read / write using dd. Also nvmecontrol reset seems to have no effect, identify and devlist still work though. Here is my exact hardware: Dell PowerEdge R720 2x Xeon E5-2670 192 Gb RAM SM961 on ASUS PCIe x4 to m2 breakout I will also try the drive on a desktop tomorrow with the snapshot and report back.
Progess! I updated the bios of the R720 to 2.5.4 and rebooted FreeBSD with the latest snapshot 12-CURRENT 20170309. To my surprise reading using dd works with any block size. The speed is capped at 2.0Gbytes/s, but it works. Writing on the other hand seems to only work up to a bs of around 512k (didn't find the threshold yet) and then it would timeout again. The speed cap made me suspicious and I checked to see what pcie version the card was running with (pciconf -lbace nvme0) and of course it was version 2.0. Under ubuntu 16.04 the card was reported with version 3.0 and the speed limit was 3.2 Gbytes/s. I assume this is related somehow, any thoughts?
Gen2 PCIe is limited to 2GB/s for that setup. That's your problem. and likely an indicator of the solution... When you say 'under ubuntu' is that on the same physical hardware or a different system? If it is just a reboot between the two performance profiles, that tells me one thing. If it is in a physically separate box, that tells me something else. At work we have some drives that are defective (bad resistors that need to be swapped out) because they can't keep the link established at x4 PCIe3 speeds. Either they fall back to x1 PCIe3 speeds or x2 and/or PCIe2 speeds. And when they do, they aren't super reliable, in addition to being slow. FreeBSD currently does a poor job of dealing with PCIe errors, so links can get into crazy states where they perform horribly. Maybe Linux is better able to reset the links on errors. If so, then that's up the alley of some uncommitted AER / Link retrain code I've been working on.
(In reply to Warner Losh from comment #24) I think there may be multiple bugs all getting lumped into this PR. On my Dell R710 (same exact CPUs, memory modules, and system chipset as the Supermicro X8DTH-iF where I get the hangs) with FreeBSD 10.3-STABLE, I get: (0:4) pool20:/sysprog/terry# dd if=/dev/nvd0 of=/dev/null bs=16m 7631+1 records in 7631+1 records out 128035676160 bytes transferred in 86.524463 secs (1479762738 bytes/sec) (0:5) pool20:/sysprog/terry# dd if=/dev/zero of=/dev/nvd0 bs=16m dd: /dev/nvd0: short write on character device dd: /dev/nvd0: end of device 7632+0 records in 7631+1 records out 128035676160 bytes transferred in 164.568004 secs (778010750 bytes/sec) (1:6) pool20:/sysprog/terry# pciconf -lbace nvme0 nvme0@pci0:5:0:0: class=0x010802 card=0xa801144d chip=0xa804144d rev=0x00 hdr=0x00 bar [10] = type Memory, range 64, base rxdf2fc000, size 16384, enabled cap 01[40] = powerspec 3 supports D0 D3 current D0 cap 05[50] = MSI supports 32 messages, 64 bit cap 10[70] = PCI-Express 2 endpoint max data 256(256) FLR RO NS link x4(x4) speed 5.0(8.0) ASPM disabled(L1) cap 11[b0] = MSI-X supports 33 messages, enabled Table in map 0x10[0x3000], PBA in map 0x10[0x2000] ecap 0001[100] = AER 2 0 fatal 0 non-fatal 0 corrected ecap 0003[148] = Serial 1 0000000000000000 ecap 0004[158] = Power Budgeting 1 ecap 0019[168] = PCIe Sec 1 lane errors 0 ecap 0018[188] = LTR 1 ecap 001e[190] = unknown 1 This is in a PCIe 2 x4 slot, but the write speed matches the SM961 datasheet for the 128GB version - "700MB/Sec sequential write". Sequential read is probably being limited by PCIe 2 speeds, as the datasheet specifies "Up to 3100MB/Sec". These numbers are from "Samsung Rev 1.0, June 2016". I can't post the whole thing as it is marked Company Confidential. On the Supermicro X8DTH-iF where FreeBSD gets the controller resets, Arch Linux 2016.11.01 reads and writes the SM961 at approximately the same speeds as on the Dell R710, with no resets.
Terry: Is this one supermicro that you've booted FreeBSD, then booted linux and observed the difference? Or is it two different supermicros that are otherwise identical?
(In reply to Warner Losh from comment #26) Exact same system. Wasn't even power-cycled.
OK. It isn't the SM961 generally. I have one in a NUC6 on my desk now (it arrived last night). With 5 dd's reading it, I can get 2.5GB/s. More dd's don't help. Data sheet says up to 3.1GB/s. But I'm seeing no errors with my card, though it has AER reporting. I was able to newfs the 256GB version w/o issue. I suspect, but cannot prove, this may be a signal issue causing errors. Linux seems to cope better maybe? Or maybe it knows to push it less hard? Not sure. I need to polish off the AER code I've written to monitor other devices I have.
(In reply to Warner Losh from comment #28) As I said, my SM961 works fine in a Dell PowerEdge R710 with the same model of CPU, same amount and model of memory chips, and same platform controller as the Supermicro X8DTH-iF where it doesn't work and gets the controller resets. So the Dell and the Supermicro are doing something differently that the drive and / or card don't like. As I mentioned a few replies further up, I can try purchasing some other brand of NVMe card to see if this issue is specific to the SM961 or something more generic.
(In reply to Warner Losh from comment #24) Same machine same hardware. Read performance (single dd if.. of=/dev/zero) on Ubuntu is 3.0 GB/s on the R720 and 2.0 GB/s with FreeBSD (snapshot 20170323). On a single CPU system I measured around 3.2 GB/s. Read speeds are faster and writes have no errors on Linux, whereas FreeBSD fails at writing with block sizes > 1K.
Intel Skull Canyon NUC with WDC WDS512G1X0C-00ENX0 on VEN_15B7 DEV_5001 on C230 Chipset Linux / Win10 - fine FreeBSD / TrueOS (FreeBSD 12 Current) recognize card but soon after boot card goes offline and stays offline until machine is powered off. (Reboot alone won't clear the state. Reboot into other OS not possible. UEFI can not see drive until powered off and back on.) Played with BIOS PCI power settings to little avail. - G
So something is hanging the card so that posted transactions don't complete. There's a small chance this is some other run-a-way thread in a different driver (we see that at work), but it would be useful to know what the transactions that are pending prior to it hanging.
(In reply to Warner Losh from comment #32) Back when I first ran into this, I sent Jim Harris a bunch of "sysctl dev.nvme.0.ioq*.dump_debug=1" traces that he requested. He had me configure the driver with "hw.nvme.per_cpu_io_queues=0" which caused the card to work, but also crippled performance. I recently obtained an Optane 16GB NVMe stick, and that one does work in the Supermicro board (where the PM961 didn't). I don't know if that proves anything.
I'm having the same problem on an SuperMicro SYS-5019S-M with a Samsung SM961 128GB. Right now, the boot does not complete, I guess due to ZFS probing the disk. I'll see if hw.nvme.per_cpu_io_queues=0 will make the kernel complete booting.
(In reply to stb from comment #34) Setting hw.nvme.per_cpu_io_queues=0 works.
(In reply to stb from comment #35) Can you provide more information about reduced performance? # diskinfo -t /dev/nvme0ns1
[root@foo ~]# diskinfo -t /dev/nvd0 /dev/nvd0 512 # sectorsize 128035676160 # mediasize in bytes (119G) 250069680 # mediasize in sectors 0 # stripesize 0 # stripeoffset S347NY0HB01730 # Disk ident. Seek times: Full stroke: 250 iter in 0.014551 sec = 0.058 msec Half stroke: 250 iter in 0.015022 sec = 0.060 msec Quarter stroke: 500 iter in 0.029067 sec = 0.058 msec Short forward: 400 iter in 0.015134 sec = 0.038 msec Short backward: 400 iter in 0.015675 sec = 0.039 msec Seq outer: 2048 iter in 0.063374 sec = 0.031 msec Seq inner: 2048 iter in 0.057973 sec = 0.028 msec Transfer rates: outside: 102400 kbytes in 0.094174 sec = 1087349 kbytes/sec middle: 102400 kbytes in 0.089065 sec = 1149722 kbytes/sec inside: 102400 kbytes in 0.089141 sec = 1148742 kbytes/sec I think I should be getting 2.2GB/s. With 4 concurrent dd's, gstat shows: [root@foo ~]# gstat -I60s -f '^....$' dT: 60.002s w: 60.000s filter: ^....$ L(q) ops/s r/s kBps ms/r w/s kBps ms/w %busy Name 4 13578 13578 1737975 0.3 0 0 0.0 100.0| nvd0 [root@foo ~]# for i in 0 1 2 3; do dd if=/dev/nvd0 of=/dev/null bs=1m count=100k & done; wait; echo 'done' [1] 41520 [2] 44578 [3] 46696 [4] 47833 102400+0 records in 102400+0 records out 107374182400 bytes transferred in 192.262522 secs (558476927 bytes/sec) [1] Done dd if=/dev/nvd0 of=/dev/null bs=1m count=100k 102400+0 records in 102400+0 records out 107374182400 bytes transferred in 241.421031 secs (444759026 bytes/sec) [2] Done dd if=/dev/nvd0 of=/dev/null bs=1m count=100k 102400+0 records in 102400+0 records out 107374182400 bytes transferred in 241.552144 secs (444517613 bytes/sec) [3]- Done dd if=/dev/nvd0 of=/dev/null bs=1m count=100k 102400+0 records in 102400+0 records out 107374182400 bytes transferred in 241.559861 secs (444503412 bytes/sec) [4]+ Done dd if=/dev/nvd0 of=/dev/null bs=1m count=100k done So I'm guessing the penalty is not too big. The 128 GB model has a significantly lower write speed compared to the 256GB and 512GB models (around 800MB/s I believe), so I didn't test that.
(In reply to stb from comment #34) One more detail: the SM961 support PCIe 3.0 with 4 lanes, but the M.2 socket on the X11SSH-F provides only two lanes. I have no idea whether this should make a difference in function or not, but it will limit the theoretical performance to about 2 GB/s. With that in mind, my numbers look like its using the available bandwidth fully. I guess there appears to be no performance penalty, at least for large, linear transfers.
This is what I get from pciconf: [root@foo ~]# pciconf -lBbcevV nvme0@pci0:3:0:0 nvme0@pci0:3:0:0: class=0x010802 card=0xa801144d chip=0xa804144d rev=0x00 hdr=0x00 vendor = 'Samsung Electronics Co Ltd' device = 'NVMe SSD Controller SM961/PM961' class = mass storage subclass = NVM bar [10] = type Memory, range 64, base rxdf100000, size 16384, enabled cap 01[40] = powerspec 3 supports D0 D3 current D0 cap 05[50] = MSI supports 32 messages, 64 bit cap 10[70] = PCI-Express 2 endpoint max data 256(256) FLR NS link x2(x4) speed 8.0(8.0) ASPM L1(L1) cap 11[b0] = MSI-X supports 33 messages, enabled Table in map 0x10[0x3000], PBA in map 0x10[0x2000] ecap 0001[100] = AER 2 0 fatal 0 non-fatal 1 corrected ecap 0003[148] = Serial 1 0000000000000000 ecap 0004[158] = Power Budgeting 1 ecap 0019[168] = PCIe Sec 1 lane errors 0 ecap 0018[188] = LTR 1 ecap 001e[190] = unknown 1 PCI-e errors = Correctable Error Detected Unsupported Request Detected Corrected = Advisory Non-Fatal Error
Having same issue. Using Lenovo 4xb0m52449 256gb-Nvme-M.2 SSD. VMware is operating the SSD fine, but if I'm trying to clean install FreeBSD on it, im stuck in the "resetting controller" loop.
Same problem here. Works fine under Debian Stretch 9.2 with same hardware. Supermicro X10DRL-I-O motherboard ASUS Hyper M.2 x16 NVMe card 2 X Samsung PM961 256gb 1 X Samsung PM961 128gb I've disabled the hw.nvme.per_cpu_io_queues and it's working but slow I think. Here's some dmesg lines with hw.nvme.per_cpu_io_queues=0: FreeBSD 11.1-STABLE #0 r321665+d4625dcee3e(freenas/11.1-stable): Wed Dec 13 16:33:42 UTC 2017 CPU: Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz (2100.04-MHz K8-class CPU) FreeBSD/SMP: Multiprocessor System Detected: 32 CPUs FreeBSD/SMP: 2 package(s) x 8 core(s) x 2 hardware threads nvme0: <Generic NVMe Device> mem 0xc7800000-0xc7803fff irq 40 at device 0.0 numa-domain 0 on pci6 nvme1: <Generic NVMe Device> mem 0xc7700000-0xc7703fff irq 40 at device 0.0 numa-domain 0 on pci7 nvme2: <Generic NVMe Device> mem 0xc7600000-0xc7603fff irq 40 at device 0.0 numa-domain 0 on pci8 nvd0: <SAMSUNG MZVLW128HEGR-00000> NVMe namespace nvd0: 122104MB (250069680 512 byte sectors) nvd1: <SAMSUNG MZVLW256HEHP-000L7> NVMe namespace nvd1: 244198MB (500118192 512 byte sectors) nvd2: <SAMSUNG MZVLW256HEHP-000L7> NVMe namespace nvd2: 244198MB (500118192 512 byte sectors)
same problem during FreeBSD installation on Lenovo Thinkpad t470p (StorageSamsung PM961 NVMe MZVLW512HMJP, 512 GB, M.2 SSD) nvme0: resetting controller nvme0: aborting outstanding i/o nvme0: READ sqid:8 cid:127 nsid:1 lba:1000215153 len:4 nvme0: ABORTED - BY REQUEST (00/07) sqid:8: cid:127 cdw0:0
There are several additional complications while using Lenovo (IdeaPad 700). The most frustrating one is that sometimes the system will decide the nvme disk is okay with no timeouts, and the next boot times-out continually. Aack! Out of desperation, I built a system disk on an external drive. Most of the time that works with no use of nvme and starts without an issue but if the timeouts occur during boot, it may take several hours to complete the error checking. Stupid Question: Is the driver detecting the proper device and selecting the correct flavor of the driver? The messages look right, but the inconsistency would suggest that something of that sort is happening. How would I verify this?
I have exactly the same issue on my SSD: Samsung Electronics Co Ltd NVMe SSD Controller SM961/PM961 The issue happened with the 12.0-CURRENT installation media, and also with the system installed. I am able to solve this issue with kern.smp.disabled=1, so I can boot; however it also disable the MultiProcessor feature (and I end up with a single core). (The “Fail Safe” option at boot includes this fix) Hope this help!
Talked to Jim Harris the other day... What might be going on here is a lost interrupt, so we timeout. I'm going to modify the timeout code to check completions before doing a reset. If we find any, we'll complete the I/Os and continue, otherwise we'll reset the card. This may help.
same as here: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=211713#c43 dumps on heavy load during update, Lenovo Ideapad 700
(In reply to Warner Losh from comment #45) This is also my conclusion of the problem. I have managed to overcome the interrupt timeout matters by disabling the pci express MSI interrupt signalling in loader.conf with 'hw.pci.enable_msi="0". Disabling this globally solves this issue but this will cause issues with other pci express devices that do not fully function with MSI-X interrupt signalling. Should these specific pci express controllers be added to a quirk list or what would be the correct way of solving the issue? What this tunable means? excerpt from man pci: hw.pci.enable_msi (Defaults to 1) Enable support for Message Signalled Interrupts (MSI). MSI interrupts can be disabled by setting this tunable to 0. some additional details on pci express interrupts: https://en.wikipedia.org/wiki/Message_Signaled_Interrupts https://electronics.stackexchange.com/questions/76867/what-do-the-different-interrupts-in-pcie-do-i-referring-to-msi-msi-x-and-intx
(In reply to tommi.pernila from comment #47) I wish this really fixed the problem, but it doesn't. It did, however, reduce the frequency of occurrence. nvme0: resetting controller nvme0: aborting outstanding i/o nvme0: WRITE sqid:8 cid:127 nsid:1 lba:2064 len:64 nvme0: ABORTED - BY REQUEST (00/07) sqid:8: cid:127 cdw0:0 This is from a cold boot. Room temperature is about 24 C I doubt this is heat-related. loader.conf.local contains: kern.cam.boot.delay=10000 kern.cam.scsi.delay=10000 vfs.root.mountfrom="ufs:/dev/da0p2" hw.pci.enable_msi="0"
(In reply to tommi.pernila from comment #47) I found that the same hardware (exact same NVMe card, not just same model) works fine when moved to a Dell PowerEdge R710, even though it shows the problem in a Supermicro X8DTH-iF (see comment 21). I also found that although FreeBSD has problems with it on that Supermicro board, some random Linux distro does not (see comment #18). This makes me think that it is something we're not handling properly, either timing-related or motherboard chipset-related.
A commit references this bug: Author: imp Date: Fri Mar 16 05:23:49 UTC 2018 New revision: 331046 URL: https://svnweb.freebsd.org/changeset/base/331046 Log: Try polling the qpairs on timeout. On some systems, we're getting timeouts when we use multiple queues on drives that work perfectly well on other systems. On a hunch, Jim Harris suggested I poll the completion queue when we get a timeout. This patch polls the completion queue if no fatal status was indicated. If it had pending I/O, we complete that request and return. Otherwise, if aborts are enabled and no fatal status, we abort the command and return. Otherwise we reset the card. This may clear up the problem, or we may see it result in lots of timeouts and a performance problem. Either way, we'll know the next step. We may also need to pay attention to the fatal status bit of the controller. PR: 211713 Suggested by: Jim Harris Sponsored by: Netflix Changes: head/sys/dev/nvme/nvme_private.h head/sys/dev/nvme/nvme_qpair.c
You might try hw.nvme.enable_aborts=1 in loader.conf. This will enable aborting the command on timeouts when there's no fatal error indicated. This might help. Also, r331046 has a workaround suggested by Jim Harris. IF there's no fatal error signaled, we'll poll the completion queue. If that works, we move on (with a loud printf that will likely have a performance issue, but we'll see it). If not, and no fatal error signaled and aborts are enabled, we'll abort the command. Otherwise we'll reset the card (the current behavior). I could never recreate this problem, despite buying the exact card (I think) that others have reported as being bad. So, if you can reproduce this problem, please try r331046 or later and let me know if that helps or not.
If the timeout 'fixes' the issue, Jim thinks it might mean that we have a MSIX interrupt mapping issue, or similar, to track down. Either by the driver making bad assumptions, it getting fed bad data, or some issue in the msix code. I'm skeptical, but we'll know after the retesting.
(In reply to Warner Losh from comment #52) The system I was using to test this has been off and in storage - when I booted it just now to update it, it said it was 11.1-PRERELEASE 8-}. I'm in the process of updating it to 11-STABLE and will try the Samsung NVMe device again with this patch (assuming I can apply it to 11-STABLE). Right now the box has an Intel Optane NVMe drive in it, so I can also test for regressions with a known-working module. If nobody else tries this and reports back in a few days, ping me to make sure I'm still working on testing. Thanks!
Created attachment 191543 [details] Log of failed patch application on 11-STABLE
(In reply to Terry Kennedy from comment #54) The previous comment seems to be missing my comment text... I tried applying the patch to 11-STABLE (r331049) and it didn't apply cleanly. Before I dig into this, would be possible to get a version for 11?
Exact same issue when trying to install freeBSD on a brand new Lenovo E480. I tried `FreeBSD-12.0-CURRENT-amd64-20180322-r331345` and `FreeBSD-11.1-STABLE-amd64-20180322-r331337`. Both failed with continuously logging nvme0 failure. Help welcome ! I can provide more information if needed.
[update] : trying to install FreeBSD-12.0-CURRENT-amd64-20180329-r331740 in 'normal' mode on LENOVO E480 with samsung ssd MZVLW256HEHP-000L7. output : `nvme0: missing interrupt` many times, then graphical installer displays. I select to install Auto (ZFS) on nvd0. During installation, I still see `nvme0: missing interrupt` messages showing up few times again. Then installation failed with error displaying in graphical window : `Error: gpart provider: Device not configured`. Booting in 'safe mode' ended with the same gpart error. Hope it helps !
following my comment #57, here more debug info in another context with same hardware : I am able to boot TrueOS-Desktop-201803131015 with `hw.nvme.per_cpi_io_queues="0"` set in /boot/loader.conf. Everything works well, BUT fatal error comes when trying to resume after s3 suspend mode : I see kernel messages ending with : ``` (…) kernel: WARN_ON(…stripped…) CSR SSP Base Not fine (…) kernel: CSR HTP Not fine (…) kernel: WARN_ON(…stripped…) Clearing unexpected auxiliary request for power well 2 ``` then : ``` nvme0: resetting controller nvme0: controller ready did not become 0 within 30000 ms nvme0: failing queued i/o nvme0: READ sqid:1 cid:0 nsid: 1 lba:324015968 len:20 nvme0: ABORTED - BY REQUEST (00/07) sqid:1 cid:0 cdw0:0 ``` and similar errors repeated a dozen times, then the fatal : ``` nvd0: lost device - 0 outstanding nvd0: removing device entry nvme0: WRITE sqid:1 cid:0 nsid:1 lba:4416948 len:48 nvme0: ABORTED - BY REQUEST (00/07) sqid:1 cid:0 cdw0:0 Fatal trap 12: page fault while in kernel mode cpuid = 4; apic id = 04 fault virtual address = 0x8 fault code = supervisor read data, page not present instruction pointer = 0x20:0xffffffff80a3b141 stack pointer = 0x28:0xfffffe0000545820 frame pointer = 0x28:0xfffffe0000545860 code segment = base rx0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 0 (nvme taskq) [ thread pid 0 tid 100077 ] stopped at g_disk_done+0xc1: movq 0x8(%rax),%rdi db> ```
``` nvme0: resetting controller nvme0: controller ready did not become 0 within 30000 ms nvme0: failing queued i/o nvme0: READ sqid:1 cid:0 nsid: 1 lba:324015968 len:20 nvme0: ABORTED - BY REQUEST (00/07) sqid:1 cid:0 cdw0:0 ``` I can observe the same on my host on resume, everything works except resuming. Sometime it manages to reset the bloody controller after some time 5-30 sec - then it works properly.
Here's actually from my system, it had woken up successfully this morning. nvme0: Resetting controller due to a timeout. nvme0: Resetting controller due to a timeout. nvme0: resetting controller nvme0: Resetting controller due to a timeout. nvme0: aborting outstanding i/o nvme0: WRITE sqid:1 cid:124 nsid:1 lba:302574390 len:72 nvme0: ABORTED - BY REQUEST (00/07) sqid:1 cid:124 cdw0:0 nvme0: aborting outstanding i/o nvme0: WRITE sqid:1 cid:127 nsid:1 lba:308798193 len:8 nvme0: ABORTED - BY REQUEST (00/07) sqid:1 cid:127 cdw0:0 nvme0: aborting outstanding i/o nvme0: READ sqid:1 cid:90 nsid:1 lba:146704418 len:5 nvme0: ABORTED - BY REQUEST (00/07) sqid:1 cid:90 cdw0:0 nvme0: aborting outstanding i/o nvme0: READ sqid:2 cid:126 nsid:1 lba:436423099 len:54 nvme0: ABORTED - BY REQUEST (00/07) sqid:2 cid:126 cdw0:0 nvme0: aborting outstanding i/o nvme0: READ sqid:3 cid:125 nsid:1 lba:785815849 len:14 nvme0: ABORTED - BY REQUEST (00/07) sqid:3 cid:125 cdw0:0 nvme0: aborting outstanding i/o nvme0: READ sqid:3 cid:75 nsid:1 lba:859171570 len:2 nvme0: ABORTED - BY REQUEST (00/07) sqid:3 cid:75 cdw0:0 nvme0: aborting outstanding i/o nvme0: WRITE sqid:4 cid:100 nsid:1 lba:306185450 len:2 nvme0: ABORTED - BY REQUEST (00/07) sqid:4 cid:100 cdw0:0 nvme0: aborting outstanding i/o nvme0: DATASET MANAGEMENT sqid:4 cid:79 nsid:1 nvme0: ABORTED - BY REQUEST (00/07) sqid:4 cid:79 cdw0:0 nvme0: aborting outstanding i/o nvme0: DATASET MANAGEMENT sqid:4 cid:119 nsid:1 nvme0: ABORTED - BY REQUEST (00/07) sqid:4 cid:119 cdw0:0 nvme0: aborting outstanding i/o nvme0: DATASET MANAGEMENT sqid:4 cid:118 nsid:1 nvme0: ABORTED - BY REQUEST (00/07) sqid:4 cid:118 cdw0:0 nvme0: aborting outstanding i/o nvme0: DATASET MANAGEMENT sqid:4 cid:95 nsid:1 nvme0: ABORTED - BY REQUEST (00/07) sqid:4 cid:95 cdw0:0 nvme0: aborting outstanding i/o nvme0: DATASET MANAGEMENT sqid:4 cid:117 nsid:1 nvme0: ABORTED - BY REQUEST (00/07) sqid:4 cid:117 cdw0:0 nvme0: aborting outstanding i/o nvme0: DATASET MANAGEMENT sqid:4 cid:101 nsid:1 nvme0: ABORTED - BY REQUEST (00/07) sqid:4 cid:101 cdw0:0 nvme0: aborting outstanding i/o nvme0: DATASET MANAGEMENT sqid:4 cid:109 nsid:1 nvme0: ABORTED - BY REQUEST (00/07) sqid:4 cid:109 cdw0:0 nvme0: aborting outstanding i/o nvme0: DATASET MANAGEMENT sqid:4 cid:75 nsid:1 nvme0: ABORTED - BY REQUEST (00/07) sqid:4 cid:75 cdw0:0 nvme0: aborting outstanding i/o nvme0: DATASET MANAGEMENT sqid:4 cid:107 nsid:1 nvme0: ABORTED - BY REQUEST (00/07) sqid:4 cid:107 cdw0:0 nvme0: aborting outstanding i/o nvme0: DATASET MANAGEMENT sqid:4 cid:123 nsid:1 nvme0: ABORTED - BY REQUEST (00/07) sqid:4 cid:123 cdw0:0 nvme0: aborting outstanding i/o nvme0: DATASET MANAGEMENT sqid:4 cid:110 nsid:1 nvme0: ABORTED - BY REQUEST (00/07) sqid:4 cid:110 cdw0:0 nvme0: aborting outstanding i/o nvme0: DATASET MANAGEMENT sqid:4 cid:93 nsid:1 nvme0: ABORTED - BY REQUEST (00/07) sqid:4 cid:93 cdw0:0 nvme0: aborting outstanding i/o nvme0: DATASET MANAGEMENT sqid:4 cid:115 nsid:1 nvme0: ABORTED - BY REQUEST (00/07) sqid:4 cid:115 cdw0:0 nvme0: aborting outstanding i/o nvme0: DATASET MANAGEMENT sqid:4 cid:98 nsid:1 nvme0: ABORTED - BY REQUEST (00/07) sqid:4 cid:98 cdw0:0 nvme0: aborting outstanding i/o nvme0: DATASET MANAGEMENT sqid:4 cid:72 nsid:1 nvme0: ABORTED - BY REQUEST (00/07) sqid:4 cid:72 cdw0:0 nvme0: aborting outstanding i/o nvme0: DATASET MANAGEMENT sqid:4 cid:65 nsid:1 nvme0: ABORTED - BY REQUEST (00/07) sqid:4 cid:65 cdw0:0 nvme0: aborting outstanding i/o nvme0: DATASET MANAGEMENT sqid:4 cid:111 nsid:1 nvme0: ABORTED - BY REQUEST (00/07) sqid:4 cid:111 cdw0:0 nvme0: aborting outstanding i/o nvme0: DATASET MANAGEMENT sqid:4 cid:108 nsid:1 nvme0: ABORTED - BY REQUEST (00/07) sqid:4 cid:108 cdw0:0 nvme0: aborting outstanding i/o nvme0: DATASET MANAGEMENT sqid:4 cid:74 nsid:1 nvme0: ABORTED - BY REQUEST (00/07) sqid:4 cid:74 cdw0:0 nvme0: aborting outstanding i/o nvme0: DATASET MANAGEMENT sqid:4 cid:92 nsid:1 nvme0: ABORTED - BY REQUEST (00/07) sqid:4 cid:92 cdw0:0 nvme0: aborting outstanding i/o nvme0: DATASET MANAGEMENT sqid:4 cid:87 nsid:1 nvme0: ABORTED - BY REQUEST (00/07) sqid:4 cid:87 cdw0:0 nvme0: aborting outstanding i/o nvme0: DATASET MANAGEMENT sqid:4 cid:96 nsid:1 nvme0: ABORTED - BY REQUEST (00/07) sqid:4 cid:96 cdw0:0 nvme0: aborting outstanding i/o nvme0: DATASET MANAGEMENT sqid:4 cid:94 nsid:1 nvme0: ABORTED - BY REQUEST (00/07) sqid:4 cid:94 cdw0:0 nvme0: aborting outstanding i/o nvme0: DATASET MANAGEMENT sqid:4 cid:77 nsid:1 nvme0: ABORTED - BY REQUEST (00/07) sqid:4 cid:77 cdw0:0 nvme0: aborting outstanding i/o nvme0: DATASET MANAGEMENT sqid:4 cid:104 nsid:1 nvme0: ABORTED - BY REQUEST (00/07) sqid:4 cid:104 cdw0:0 nvme0: aborting outstanding i/o nvme0: DATASET MANAGEMENT sqid:4 cid:113 nsid:1 nvme0: ABORTED - BY REQUEST (00/07) sqid:4 cid:113 cdw0:0 nvme0: aborting outstanding i/o nvme0: DATASET MANAGEMENT sqid:4 cid:66 nsid:1 nvme0: ABORTED - BY REQUEST (00/07) sqid:4 cid:66 cdw0:0 nvme0: aborting outstanding i/o nvme0: DATASET MANAGEMENT sqid:4 cid:120 nsid:1 nvme0: ABORTED - BY REQUEST (00/07) sqid:4 cid:120 cdw0:0 nvme0: aborting outstanding i/o nvme0: DATASET MANAGEMENT sqid:4 cid:67 nsid:1 nvme0: ABORTED - BY REQUEST (00/07) sqid:4 cid:67 cdw0:0 nvme0: aborting outstanding i/o nvme0: DATASET MANAGEMENT sqid:4 cid:71 nsid:1 nvme0: ABORTED - BY REQUEST (00/07) sqid:4 cid:71 cdw0:0 nvme0: aborting outstanding i/o nvme0: DATASET MANAGEMENT sqid:4 cid:88 nsid:1 nvme0: ABORTED - BY REQUEST (00/07) sqid:4 cid:88 cdw0:0 nvme0: aborting outstanding i/o nvme0: DATASET MANAGEMENT sqid:4 cid:106 nsid:1 nvme0: ABORTED - BY REQUEST (00/07) sqid:4 cid:106 cdw0:0 nvme0: aborting outstanding i/o nvme0: DATASET MANAGEMENT sqid:4 cid:116 nsid:1 nvme0: ABORTED - BY REQUEST (00/07) sqid:4 cid:116 cdw0:0 nvme0: aborting outstanding i/o nvme0: DATASET MANAGEMENT sqid:4 cid:121 nsid:1 nvme0: ABORTED - BY REQUEST (00/07) sqid:4 cid:121 cdw0:0 nvme0: aborting outstanding i/o nvme0: DATASET MANAGEMENT sqid:4 cid:126 nsid:1 nvme0: ABORTED - BY REQUEST (00/07) sqid:4 cid:126 cdw0:0 nvme0: aborting outstanding i/o nvme0: DATASET MANAGEMENT sqid:4 cid:84 nsid:1 nvme0: ABORTED - BY REQUEST (00/07) sqid:4 cid:84 cdw0:0 nvme0: aborting outstanding i/o nvme0: DATASET MANAGEMENT sqid:4 cid:70 nsid:1 nvme0: ABORTED - BY REQUEST (00/07) sqid:4 cid:70 cdw0:0 nvme0: aborting outstanding i/o nvme0: DATASET MANAGEMENT sqid:4 cid:76 nsid:1 nvme0: ABORTED - BY REQUEST (00/07) sqid:4 cid:76 cdw0:0 nvme0: aborting outstanding i/o nvme0: DATASET MANAGEMENT sqid:4 cid:99 nsid:1 nvme0: ABORTED - BY REQUEST (00/07) sqid:4 cid:99 cdw0:0 nvme0: aborting outstanding i/o nvme0: DATASET MANAGEMENT sqid:4 cid:124 nsid:1 nvme0: ABORTED - BY REQUEST (00/07) sqid:4 cid:124 cdw0:0 nvme0: aborting outstanding i/o nvme0: DATASET MANAGEMENT sqid:4 cid:69 nsid:1 nvme0: ABORTED - BY REQUEST (00/07) sqid:4 cid:69 cdw0:0 nvme0: aborting outstanding i/o nvme0: DATASET MANAGEMENT sqid:4 cid:91 nsid:1 nvme0: ABORTED - BY REQUEST (00/07) sqid:4 cid:91 cdw0:0 nvme0: aborting outstanding i/o nvme0: DATASET MANAGEMENT sqid:4 cid:81 nsid:1 nvme0: ABORTED - BY REQUEST (00/07) sqid:4 cid:81 cdw0:0 nvme0: aborting outstanding i/o nvme0: DATASET MANAGEMENT sqid:4 cid:103 nsid:1 nvme0: ABORTED - BY REQUEST (00/07) sqid:4 cid:103 cdw0:0 nvme0: aborting outstanding i/o nvme0: DATASET MANAGEMENT sqid:4 cid:114 nsid:1 nvme0: ABORTED - BY REQUEST (00/07) sqid:4 cid:114 cdw0:0 nvme0: aborting outstanding i/o nvme0: DATASET MANAGEMENT sqid:4 cid:89 nsid:1 nvme0: ABORTED - BY REQUEST (00/07) sqid:4 cid:89 cdw0:0 nvme0: aborting outstanding i/o nvme0: DATASET MANAGEMENT sqid:4 cid:127 nsid:1 nvme0: ABORTED - BY REQUEST (00/07) sqid:4 cid:127 cdw0:0 nvme0: aborting outstanding i/o nvme0: DATASET MANAGEMENT sqid:4 cid:85 nsid:1 nvme0: ABORTED - BY REQUEST (00/07) sqid:4 cid:85 cdw0:0 nvme0: aborting outstanding i/o nvme0: DATASET MANAGEMENT sqid:4 cid:125 nsid:1 nvme0: ABORTED - BY REQUEST (00/07) sqid:4 cid:125 cdw0:0 nvme0: aborting outstanding i/o nvme0: DATASET MANAGEMENT sqid:4 cid:73 nsid:1 nvme0: ABORTED - BY REQUEST (00/07) sqid:4 cid:73 cdw0:0 nvme0: aborting outstanding i/o nvme0: DATASET MANAGEMENT sqid:4 cid:83 nsid:1 nvme0: ABORTED - BY REQUEST (00/07) sqid:4 cid:83 cdw0:0 nvme0: aborting outstanding i/o nvme0: DATASET MANAGEMENT sqid:4 cid:68 nsid:1 nvme0: ABORTED - BY REQUEST (00/07) sqid:4 cid:68 cdw0:0 nvme0: aborting outstanding i/o nvme0: DATASET MANAGEMENT sqid:4 cid:86 nsid:1 nvme0: ABORTED - BY REQUEST (00/07) sqid:4 cid:86 cdw0:0 nvme0: aborting outstanding i/o nvme0: DATASET MANAGEMENT sqid:4 cid:82 nsid:1 nvme0: ABORTED - BY REQUEST (00/07) sqid:4 cid:82 cdw0:0 nvme0: aborting outstanding i/o nvme0: DATASET MANAGEMENT sqid:4 cid:102 nsid:1 nvme0: ABORTED - BY REQUEST (00/07) sqid:4 cid:102 cdw0:0 nvme0: aborting outstanding i/o nvme0: DATASET MANAGEMENT sqid:4 cid:78 nsid:1 nvme0: ABORTED - BY REQUEST (00/07) sqid:4 cid:78 cdw0:0 nvme0: aborting outstanding i/o nvme0: DATASET MANAGEMENT sqid:4 cid:122 nsid:1 nvme0: ABORTED - BY REQUEST (00/07) sqid:4 cid:122 cdw0:0 nvme0: aborting outstanding i/o nvme0: DATASET MANAGEMENT sqid:4 cid:90 nsid:1 nvme0: ABORTED - BY REQUEST (00/07) sqid:4 cid:90 cdw0:0 nvme0: aborting outstanding i/o nvme0: DATASET MANAGEMENT sqid:4 cid:112 nsid:1 nvme0: ABORTED - BY REQUEST (00/07) sqid:4 cid:112 cdw0:0 nvme0: aborting outstanding i/o nvme0: DATASET MANAGEMENT sqid:4 cid:105 nsid:1 nvme0: ABORTED - BY REQUEST (00/07) sqid:4 cid:105 cdw0:0 nvme0: aborting outstanding i/o nvme0: DATASET MANAGEMENT sqid:4 cid:80 nsid:1 nvme0: ABORTED - BY REQUEST (00/07) sqid:4 cid:80 cdw0:0
I've recently hit the error nvme0: Missing interrupt on 10.3 and 12.0-current. I discovered a workaround that may help lead to a proper fix, though I don't know for certain if the Missing interrupt is related to resetting controller message. If I have nvme_load="YES" and nvd_load="YES" in /boot/loader.conf, I get missing interrupt every time, the more the device gets used the more of the messages show up. Each time that message is shown it seems the read/write operation fails as the end result is corruption (AND oddly half the time my intel ix card doesn't work properly when this happens it will spit out ix0: TX(0) desc avail = 34, pidx = 87, link status stays no carrier) If I load nvme/nvd AFTER the system finishes booting, it behaves normally and doesn't affect ix. So it seems loading nvme/nvd early in the boot process causes some kind of interrupt conflict with other driver(s).
Same problem on FreeBSD 11.2-RC3 on a Thinkpad T480. After resuming from suspend the system is not usable for 10-30 seconds, showing the following messages nvme0: Resetting controller due to a timeout. nvme0: Resetting controller due to a timeout. nvme0: resetting controller nvme0: Resetting controller due to a timeout. nvme0: aborting outstanding i/o nvme0: WRITE sqid:1 cid:124 nsid:1 lba:302574390 len:72 ..... .....
Experiencing similar suspend/resume issue with Samsung NVME PCIE 960 EVO. I am running freebsd 11.2 When comes back from a suspend/resume cycle, recieve a long list of ABORTs on pending writes to the NVME. i have also verified that after each such resume, the "unsafe shutdowns" count in the NVME increments by 1. i was running the same NVME with a windows OS for some months, doing many suspend/resumes and that count had not incremented, so i do not beleive it is an issue with the NVME but with freebsd. The unsafe shutdown count does NOT increment when freebsd shuts down (e.g. shutdown -p now). i can simulate a similar list of pending io actions in the queue by using nvmecontrol to send a reset to the nvme device. but in that circumstance it repopulates the queue, instead of aborting them. Also note that it is very frequent to find newly missing data fragments in the nvme partition when using fsck after the aborted io queue is reported. I SUSPECT THAT AN ABILITY TO AT LEAST SEND a FLUSH COMMAND to the NVME would allow us to avoid the lost/corrupted data by putting such an action in the /etc/rc.suspend file. but i have not discovered a way to send that flush command. using nvmecontrol to set a very low power level on the nvme during rc.suspend does not prevent the behavior. Possible that a "shutdown" command sent to the NVME during suspend would provide same results. nvmecontrol does not seem to expose flush or shutdown functionality. FreshPorts appears to have a nvme-cli port that has FLUSH, but is flagged as broken for 11.2. I have not attemped to update/test under freebsd 12.
'sync' will force all the dirty buffers to be scheduled in the nvme controller and won't return until they are complete. There are no other 'flush' operations needed as the errors are that we suspend while we have pending I/O in the nvme controller. That might need to be attended to, but it isn't currently. But a suspend / resume bug is very different than this bug. Please file a new bug to track that. This bug is, during normal operations, something bad happens, and we stop being able to talk to the NVMe drive and error recovery is insufficient to cope.
(In reply to Warner Losh from comment #64) have attmpted to sync in many combinations during suspend, and it doesnt change the behavior. the flush i am referring to is a command defined in NVME spec to force the NVME to flush its internal buffer.
The drive should likely be properly shutdown before suspend / resume. I agree. That's a different bug. There's code to do this on shutdown. The FLUSH command won't help because that has to be integrated into the driver to be useful (since I/O can happen after the sync before things suspend). The errors people are seeing from pending commands, however, are a different issue. Both of which are different issues from this bug. This isn't an omnibus NVME error bug.
Created attachment 199031 [details] dmesg output
I'm testing FreeBSD 12.0-BETA3 r340039 GENERIC, and I have an PM961 PCIe NVMe m.2 1TB drive that came with my Lenovo ThinkPad P50. P/N: MZSLW1T0HMLH-000L1 Produced Oct 2016 That drive is recognized by FreeBSD 12, but is not usable whatsoever (can't read/write to it). I've used this drive with Debian testing since 2016 without trouble on my ThinkPad P50. I installed FreeBSD 12 on an internal 2TB HDD in the ThinkPad in order to test FreeBSD, but the PM961 continued to cause boot delays -- I would see "nvme0: Missing interrupt" messages until the system finally gave up and continued with the boot process. I attempted to install FreeBSD 11 on the 2TB HDD but the install failed when it had trouble recognizing the nvme drive. Initially I thought the missing interrupt problem with FreeBSD was caused by the LUKS encryption on the nvme drive because I had not formatted that drive yet since I was dual booting. So I purchased another Samsung NVMe SSD 960 PRO m.2 1TB drive P/N: MZVKP1T0HMJP, and that drive works with FreeBSD 12. The new nvme was installed in the ThinkPad along with the original nvme and HDD drive. The 2TB HDD and the new 1TB nvme drives are dedicated to FreeBSD using ZFS. I attempted to create a ZFS mirror using the two nvme drives and FreeBSD successfully wrote to the original nvme drive (because it overwrote my Linux partitions) but the overall `zfs_create_diskpart` process failed and I had to start over using only the new nvme drive, which worked. I eventually removed the original nvme drive from my laptop because of the constant "missing interrupt" delays. However, after removing the original nmve drive and while installing a virtual machine in VirtualBox on my new nvme, my laptop went into (what seemed to be) ACPI S3 suspended mode, and after I woke the machine the laptop rebooted itself. Thinking the problem was VirtualBox, I removed that software and setup Bhyve instead. During a virtual machine install in Bhyve, the laptop went into an S3-style suspended mode again, and this time when I woke the machine I noticed the nvme0 resetting controller, write, read, and aborted-by-request messages in `dmesg` (output attached above). For the most part, the new nvme device seems stable with FreeBSD 12. I haven't test it with FreeBSD 11. I don't know if KDE's baloo service crashing and creating a 256GB core dump every single time I login is part of the problem using this drive. Today I disabled Baloo file indexing and installed another virtual machine using Bhyve and the system hasn't reported any problems with the nvme. I also used `dd` to create some 10GB and 100GB files using input from /dev/urandom, and that didn't cause any issues so far. Lastly, cold boots on the new nvme (without the old nvme installed in the laptop) are normal. However, reboots can take literally 2 minutes to complete. This includes an extended delay on the BIOS screen before reaching the GELI password prompt, and a delay after loading the kernel before moving on to the ---<<BOOT>>--- screen, and the entire boot process is sluggish until finally reaching the login prompt. I've never experienced this with Debian testing and I suspect the FreeBSD nvme driver is leaving the system in a weird state. IIRC setting hw.nvme.enable_aborts=1 while the original nvme drive is still in the laptop causes a kernel panic while booting. I haven't tried setting hw.nvme.per_cpu_io_queues=0 since the system is usable and not completely instable. Hardware details: # nvmecontrol devlist nvme0: SAMSUNG MZSLW1T0HMLH-000L1 nvme0ns1 (976762MB) nvme1: Samsung SSD 960 PRO 1TB nvme1ns1 (976762MB) # pciconf -lbace nvme0 nvme0@pci0:2:0:0: class=0x010802 card=0xa801144d chip=0xa804144d rev=0x00 hdr=0x00 bar [10] = type Memory, range 64, base rxd4400000, size 16384, enabled cap 01[40] = powerspec 3 supports D0 D3 current D0 cap 05[50] = MSI supports 32 messages, 64 bit cap 10[70] = PCI-Express 2 endpoint max data 256(256) FLR RO NS link x4(x4) speed 8.0(8.0) ASPM L1(L1) cap 11[b0] = MSI-X supports 33 messages, enabled Table in map 0x10[0x3000], PBA in map 0x10[0x2000] ecap 0001[100] = AER 2 0 fatal 0 non-fatal 1 corrected ecap 0003[148] = Serial 1 0000000000000000 ecap 0004[158] = Power Budgeting 1 ecap 0019[168] = PCIe Sec 1 lane errors 0 ecap 0018[188] = LTR 1 ecap 001e[190] = unknown 1 PCI-e errors = Correctable Error Detected Unsupported Request Detected Corrected = Advisory Non-Fatal Error # pciconf -lbace nvme1 nvme1@pci0:62:0:0: class=0x010802 card=0xa801144d chip=0xa804144d rev=0x00 hdr=0x00 bar [10] = type Memory, range 64, base rxd4200000, size 16384, enabled cap 01[40] = powerspec 3 supports D0 D3 current D0 cap 05[50] = MSI supports 32 messages, 64 bit cap 10[70] = PCI-Express 2 endpoint max data 256(256) FLR RO NS link x4(x4) speed 8.0(8.0) ASPM L1(L1) cap 11[b0] = MSI-X supports 8 messages, enabled Table in map 0x10[0x3000], PBA in map 0x10[0x2000] ecap 0001[100] = AER 2 0 fatal 0 non-fatal 1 corrected ecap 0003[148] = Serial 1 0000000000000000 ecap 0004[158] = Power Budgeting 1 ecap 0019[168] = PCIe Sec 1 lane errors 0 ecap 0018[188] = LTR 1 ecap 001e[190] = unknown 1 PCI-e errors = Correctable Error Detected Unsupported Request Detected Corrected = Advisory Non-Fatal Error # diskinfo -t /dev/nvme0ns1 /dev/nvme0ns1 512 # sectorsize 1024209543168 # mediasize in bytes (954G) 2000409264 # mediasize in sectors 0 # stripesize 0 # stripeoffset No # TRIM/UNMAP support Unknown # Rotation rate in RPM Seek times: Full stroke:^C Nov 6 18:58:09 fenixbsd kernel: nvme0: Missing interrupt Nov 6 18:58:39 fenixbsd syslogd: last message repeated 1 times # diskinfo -t /dev/nvme1ns1 /dev/nvme1ns1 512 # sectorsize 1024209543168 # mediasize in bytes (954G) 2000409264 # mediasize in sectors 0 # stripesize 0 # stripeoffset No # TRIM/UNMAP support Unknown # Rotation rate in RPM Seek times: Full stroke: 250 iter in 0.011499 sec = 0.046 msec Half stroke: 250 iter in 0.010018 sec = 0.040 msec Quarter stroke: 500 iter in 0.015302 sec = 0.031 msec Short forward: 400 iter in 0.013087 sec = 0.033 msec Short backward: 400 iter in 0.012144 sec = 0.030 msec Seq outer: 2048 iter in 0.041548 sec = 0.020 msec Seq inner: 2048 iter in 0.042294 sec = 0.021 msec Transfer rates: outside: 102400 kbytes in 0.066412 sec = 1541890 kbytes/sec middle: 102400 kbytes in 0.064908 sec = 1577618 kbytes/sec inside: 102400 kbytes in 0.064534 sec = 1586760 kbytes/sec
I can confirm this bug rears it head on a dell r510 running freenas11.2 (which follows freebsd11.2) if I use a synology m2d18 https://www.synology.com/en-us/products/M2D18 dual NVME pci card. I cannot get diskinfo to run cleanly without the target going to lunch. Setting hw.nvme.enable_aborts=1 via a loader.conf tunable (and rebooting) had no impact in my case. disabling or enabling hyperthreading does not seem to impact this hang, although having hyperthreading enabled seems to prevent MSIX from being used. if I take the nvme devices out of the synology dual card and put them into a single nvme->pci adapter, I do not (yet) seem to have the problem. happy to provide whatever details might be useful in making progress, but figured that just the additional info might be sufficiently useful without superfluous noise.
I'm having the same reset controller issue on 11.2-RELEASE with an SM961. I tried it on 2 different SuperMicro systems: one very new system with a mobo-based m.2 slot, and one older system with a PCIe m.2 adapter. Let me know if I can be of assistance with troubleshooting. Would really love to be able to use this hardware.
Created attachment 205429 [details] A patch trying to fix the missing interrupt issue on SM961. A patch trying to fix the missing interrupt issue on SM961.
(In reply to Ka Ho Ng from comment #71) For anyone being affected by this bug can you try whether the patch works for you?
(In reply to Ka Ho Ng from comment #72) One more to add, the patch is to be applied on FreeBSD 12.0-RELEASE, but trying this on FreeBSD 11 should also be trivial.
(In reply to Ka Ho Ng from comment #72) wait. please wait for the next revision...
(In reply to Ka Ho Ng from comment #71) This patch is only a workaround to the issue with cpu thread number <= 8 with its own issues. It is not a fix so don't try it out.
Created attachment 205541 [details] Fix SM961 issue For people using FreeBSD 11.3 or FreeBSD 12.0 please try if this patch fixes the issue instead.
Why do you need to change pci_mask_msix and pci_unmask_msix? Surely that can't be right? The nmve patches look good, I think, but that one seems like a non-starter to do unconditionally.
(In reply to Warner Losh from comment #77) The commit message of the patch is actually inside this commit: https://github.com/khng300/freebsd/commit/c75f08495fde5dee08e4b24f399f2d70a77254a6 To put simply, some controllers give zeroes for MMIO read on certain regions, and subsequently leading to MSI-X not being enabled at all (the interrupts will first be masked and it will be in effect, however the subsequent unmask will not work at all). As a result the corresponding bit in PBA will be set by the controller since the interrupt is actually not enabled after being masked (recall that the read of vector control word will always give zero). Such modification is actually made by taking a reference at the interrupt unmask implementation of Illumos kernel, by not considering the existing content of vector control bit but simply overwriting the word.
(In reply to Warner Losh from comment #77) The NVME patch was a mistake I made that I thought the corresponding feature was 1-based, which in fact is zero. The behavior will be that number of IO queues will be number of CPU's minus one and that is undesirable. Although this will sort of fix the behavior on machines with ncpus less than or equal to 8, it will still cause trouble if ncpus is greater than 8.
(In reply to Ka Ho Ng from comment #76) I just rebuild the installer with a kernel including your patch. That's amazing! It works! Thank you for your work :)
My only potential concern with this patch is that in my original testing, I found that the NVMe drive worked on some systems and not others (under FreeBSD; under Linux I could not get it to fail anywhere). Is it possible that we're seeing a difference in the way the BIOS sets things up? If so, is the proposed patch the way to go, or should we do further diagnosis to see if we can find what the actual BIOS initialization differences are? OTOH, if nobody else thinks there are issues with the patch, good to go. I'm just concerned about changing this behavior globally as opposed to just on the NVMe device.
Hello World :-) The same problem still here in 12.0-RELEASE AMD64 on Panasonic Toughbook CF-MX4 with M.2 SSD SAMSUNG MZNTE256HMP (Model MZ-NTE2560)! Cannot install nor use FreeBSD on this machine.. I contantly get lots of: CAM status: Uncorrectable parity/CRC error Retrying command, N more tries remain SEND_FPDMA_QUEUED DATA SET MANAGEMENT. ACB 61 04 00 .... The worst thing everything works fine on Linux and Windoze :-(
(In reply to Tomasz "CeDeROM" CEDRO from comment #82) Well it seems that this SSD is using AHCI which may be unrelated to this ticket.
A commit references this bug: Author: imp Date: Thu Sep 5 23:54:45 UTC 2019 New revision: 351915 URL: https://svnweb.freebsd.org/changeset/base/351915 Log: MFC r349845: Work around devices which return all zeros for reads of existing MSI-X table VCTRL registers. Note: This is confirmed to fix the nvme lost interrupt issues, seen on both virtual and real cards. PR: 211713 Changes: _U stable/12/ stable/12/sys/dev/pci/pci.c
A commit references this bug: Author: imp Date: Fri Sep 6 00:06:55 UTC 2019 New revision: 351917 URL: https://svnweb.freebsd.org/changeset/base/351917 Log: MFC r349845: Work around devices which return all zeros for reads of existing MSI-X table VCTRL registers. PR: 211713 Changes: _U stable/11/ stable/11/sys/dev/pci/pci.c
^Triage: Assign to committer resolving @Warner Is this now resolved given base r349845 and subsequent merges to stable/12,11 ? If so, please close FIXED
I can cofirm that problem reported with M.2 SSD SAMSUNG MZNTE256HMP (Model MZ-NTE2560) was caused by lack of TRIM support for that particular drive. Disabling TRIM solved the problem. I have replaced it with Samsung SSD 860 EVO M.2 2TB RVT22B6Q and all works fine even with TRIM enabled. It was unrelated to this controller problem. Thank you for a hint! :-)
I'm experiencing what seems to be like this same exact bug, however I'm using Western Digital Black NVMe m.2 SSD's, which are of course very similar to the Samsung. The exact version I'm using is WD_BLACK SN770 250GB & 2TB, firmware version 731030WD. This is on TrueNAS version: TrueNAS-12.0-U8 / FreeBSD: 12.2-RELEASE-p12 I set hw.nvme.per_cpu_io_queues=0, and it did not fix the problem, in fact it seems to have made it much more frequent, although I'm not 100% sure about that, need to test again. I also tried using the nvd driver with hw.nvme.use_nvd=0, which doesn't seem to make a difference, however it had slightly different results in the log when the issue happened again. See logs below, would be grateful if somebody can help with this problem. Mar 29 21:42:25 truenas nvme5: Resetting controller due to a timeout and possible hot unplug. Mar 29 21:42:25 truenas nvme5: resetting controller Mar 29 21:42:25 truenas nvme5: failing outstanding i/o Mar 29 21:42:25 truenas nvme5: READ sqid:12 cid:120 nsid:1 lba:1497544880 len:16 Mar 29 21:42:25 truenas nvme5: ABORTED - BY REQUEST (00/07) sqid:12 cid:120 cdw0:0 Mar 29 21:42:25 truenas nvme5: failing outstanding i/o Mar 29 21:42:25 truenas nvme5: READ sqid:12 cid:123 nsid:1 lba:198272936 len:16 Mar 29 21:42:25 truenas nvme5: ABORTED - BY REQUEST (00/07) sqid:12 cid:123 cdw0:0 Mar 29 21:42:25 truenas nvme5: failing outstanding i/o Mar 29 21:42:25 truenas nvme5: READ sqid:13 cid:121 nsid:1 lba:431014528 len:24 Mar 29 21:42:25 truenas nvme5: ABORTED - BY REQUEST (00/07) sqid:13 cid:121 cdw0:0 Mar 29 21:42:25 truenas nvme5: failing outstanding i/o Mar 29 21:42:25 truenas nvme5: READ sqid:15 cid:127 nsid:1 lba:864636432 len:8 Mar 29 21:42:25 truenas nvme5: ABORTED - BY REQUEST (00/07) sqid:15 cid:127 cdw0:0 Mar 29 21:42:25 truenas nvme5: failing outstanding i/o Mar 29 21:42:25 truenas nvme5: READ sqid:16 cid:126 nsid:1 lba:2445612184 len:8 Mar 29 21:42:25 truenas nvme5: ABORTED - BY REQUEST (00/07) sqid:16 cid:126 cdw0:0 Mar 29 21:42:25 truenas nvme5: failing outstanding i/o Mar 29 21:42:25 truenas nvme5: READ sqid:16 cid:120 nsid:1 lba:430503600 len:8 Mar 29 21:42:25 truenas nvme5: ABORTED - BY REQUEST (00/07) sqid:16 cid:120 cdw0:0 Mar 29 21:42:25 truenas nvme5: failing outstanding i/o Mar 29 21:42:25 truenas nvme5: READ sqid:18 cid:123 nsid:1 lba:1499051024 len:8 Mar 29 21:42:26 truenas nvme5: ABORTED - BY REQUEST (00/07) sqid:18 cid:123 cdw0:0 Mar 29 21:42:26 truenas nvme5: failing outstanding i/o Mar 29 21:42:26 truenas nvme5: WRITE sqid:18 cid:124 nsid:1 lba:1990077368 len:8 Mar 29 21:42:26 truenas nvme5: ABORTED - BY REQUEST (00/07) sqid:18 cid:124 cdw0:0 Mar 29 21:42:26 truenas nvme5: failing outstanding i/o Mar 29 21:42:26 truenas nvme5: READ sqid:19 cid:122 nsid:1 lba:1237765696 len:8 Mar 29 21:42:26 truenas nvme5: ABORTED - BY REQUEST (00/07) sqid:19 cid:122 cdw0:0 Mar 29 21:42:26 truenas nvme5: failing outstanding i/o Mar 29 21:42:26 truenas nvme5: READ sqid:19 cid:125 nsid:1 lba:180758264 len:16 Mar 29 21:42:26 truenas nvme5: ABORTED - BY REQUEST (00/07) sqid:19 cid:125 cdw0:0 Mar 29 21:42:26 truenas nvme5: failing outstanding i/o Mar 29 21:42:26 truenas nvme5: READ sqid:20 cid:121 nsid:1 lba:2445612192 len:8 Mar 29 21:42:26 truenas nvme5: ABORTED - BY REQUEST (00/07) sqid:20 cid:121 cdw0:0 Mar 29 21:42:26 truenas nvd5: detached nvme3: Resetting controller due to a timeout and possible hot unplug. nvme3: resetting controller nvme3: failing outstanding i/o nvme3: READ sqid:7 cid:127 nsid:1 lba:419546528 len:8 nvme3: ABORTED - BY REQUEST (00/07) sqid:7 cid:127 cdw0:0 nvme3: (nda3:nvme3:0:0:1): READ. NCB: opc=2 fuse=0 nsid=1 prp1=0 prp2=0 cdw=1901c5a0 0 7 0 0 0 failing outstanding i/o (nda3:nvme3:0:0:1): CAM status: CCB request completed with an error (nda3:nvme3:0:0:1): Error 5, Retries exhausted nvme3: READ sqid:11 cid:127 nsid:1 lba:782841288 len:8 nvme3: ABORTED - BY REQUEST (00/07) sqid:11 cid:127 cdw0:0 nvme3: (nda3:nvme3:0:0:1): READ. NCB: opc=2 fuse=0 nsid=1 prp1=0 prp2=0 cdw=2ea935c8 0 7 0 0 0 failing outstanding i/o nvme3: READ sqid:11 cid:123 nsid:1 lba:704576056 len:8 nvme3: ABORTED - BY REQUEST (00/07) sqid:11 cid:123 cdw0:0 nvme3: failing outstanding i/o nvme3: WRITE sqid:12 cid:127 nsid:1 lba:1016402352 len:8 nvme3: ABORTED - BY REQUEST (00/07) sqid:12 cid:127 cdw0:0 nvme3: failing outstanding i/o nvme3: READ sqid:12 cid:125 nsid:1 lba:1824854760 len:8 nvme3: ABORTED - BY REQUEST (00/07) sqid:12 cid:125 cdw0:0 nvme3: failing outstanding i/o nvme3: (nda3:nvme3:0:0:1): CAM status: CCB request completed with an error (nda3:nvme3:0:0:1): Error 5, Retries exhausted WRITE sqid:13 cid:124 nsid:1 lba:1008638008 len:64 nvme3: ABORTED - BY REQUEST (00/07) sqid:13 cid:124 cdw0:0 nvme3: failing outstanding i/o nvme3: WRITE sqid:13 cid:125 nsid:1 lba:1008638152 len:56 nvme3: ABORTED - BY REQUEST (00/07) sqid:13 cid:125 cdw0:0 nvme3: failing outstanding i/o nvme3: READ sqid:15 cid:127 nsid:1 lba:783188688 len:8 nvme3: (nda3:nvme3:0:0:1): READ. NCB: opc=2 fuse=0 nsid=1 prp1=0 prp2=0 cdw=29fefa38 0 7 0 0 0 (nda3:nvme3:0:0:1): CAM status: CCB request completed with an error (nda3:nvme3:0:0:1): Error 5, Retries exhausted (nda3:nvme3:0:0:1): WRITE. NCB: opc=1 fuse=0 nsid=1 prp1=0 prp2=0 cdw=3c9511b0 0 7 0 0 0 ABORTED - BY REQUEST (00/07) sqid:15 cid:127 cdw0:0 nvme3: failing outstanding i/o nvme3: WRITE sqid:15 cid:123 nsid:1 lba:1008553080 len:8 nvme3: ABORTED - BY REQUEST (00/07) sqid:15 cid:123 cdw0:0 nvme3: failing outstanding i/o nvme3: READ sqid:16 cid:124 nsid:1 lba:147012776 len:8 nvme3: (nda3:nvme3:0:0:1): CAM status: CCB request completed with an error (nda3:nvme3:0:0:1): Error 5, Retries exhausted (nda3:nvme3:0:0:1): READ. NCB: opc=2 fuse=0 nsid=1 prp1=0 prp2=0 cdw=6cc512e8 0 7 0 0 0 (nda3:nvme3:0:0:1): CAM status: CCB request completed with an error (nda3:nvme3:0:0:1): Error 5, Retries exhausted (nda3:nvme3:0:0:1): WRITE. NCB: opc=1 fuse=0 nsid=1 prp1=0 prp2=0 cdw=3c1e9838 0 3f 0 0 0 ABORTED - BY REQUEST (00/07) sqid:16 cid:124 cdw0:0 nvme3: failing outstanding i/o nvme3: READ sqid:16 cid:127 nsid:1 lba:2881895592 len:8 nvme3: ABORTED - BY REQUEST (00/07) sqid:16 cid:127 cdw0:0 nvme3: failing outstanding i/o nvme3: READ sqid:17 cid:127 nsid:1 lba:2574392744 len:16 nvme3: ABORTED - BY REQUEST (00/07) sqid:17 cid:127 cdw0:0 nvme3: failing outstanding i/o nvme3: READ sqid:18 cid:126 nsid:1 lba:155895056 len:8 nvme3: (nda3:nvme3:0:0:1): CAM status: CCB request completed with an error (nda3:nvme3:0:0:1): Error 5, Retries exhausted (nda3:nvme3:0:0:1): WRITE. NCB: opc=1 fuse=0 nsid=1 prp1=0 prp2=0 cdw=3c1e98c8 0 37 0 0 0 ABORTED - BY REQUEST (00/07) sqid:18 cid:126 cdw0:0 nvme3: failing outstanding i/o nvme3: READ sqid:19 cid:125 nsid:1 lba:151377120 len:8 nvme3: ABORTED - BY REQUEST (00/07) sqid:19 cid:125 cdw0:0 (nda3:nvme3:0:0:1): CAM status: CCB request completed with an error (nda3:nvme3:0:0:1): Error 5, Retries exhausted (nda3:nvme3:0:0:1): READ. NCB: opc=2 fuse=0 nsid=1 prp1=0 prp2=0 cdw=2eae82d0 0 7 0 0 0 (nda3:nvme3:0:0:1): CAM status: CCB request completed with an error (nda3:nvme3:0:0:1): Error 5, Retries exhausted (nda3:nvme3:0:0:1): WRITE. NCB: opc=1 fuse=0 nsid=1 prp1=0 prp2=0 cdw=3c1d4c78 0 7 0 0 0 (nda3:nvme3:0:0:1): CAM status: CCB request completed with an error (nda3:nvme3:0:0:1): Error 5, Retries exhausted (nda3:nvme3:0:0:1): READ. NCB: opc=2 fuse=0 nsid=1 prp1=0 prp2=0 cdw=8c33ca8 0 7 0 0 0 (nda3:nvme3:0:0:1): CAM status: CCB request completed with an error (nda3:nvme3:0:0:1): Error 5, Retries exhausted (nda3:nvme3:0:0:1): READ. NCB: opc=2 fuse=0 nsid=1 prp1=0 prp2=0 cdw=abc63ca8 0 7 0 0 0 (nda3:nvme3:0:0:1): CAM status: CCB request completed with an error (nda3:nvme3:0:0:1): Error 5, Retries exhausted (nda3:nvme3:0:0:1): READ. NCB: opc=2 fuse=0 nsid=1 prp1=0 prp2=0 cdw=99721da8 0 f 0 0 0 (nda3:nvme3:0:0:1): CAM status: CCB request completed with an error (nda3:nvme3:0:0:1): Error 5, Retries exhausted (nda3:nvme3:0:0:1): READ. NCB: opc=2 fuse=0 nsid=1 prp1=0 prp2=0 cdw=94ac510 0 7 0 0 0 (nda3:nvme3:0:0:1): CAM status: CCB request completed with an error (nda3:nvme3:0:0:1): Error 5, Retries exhausted (nda3:nvme3:0:0:1): READ. NCB: opc=2 fuse=0 nsid=1 prp1=0 prp2=0 cdw=905d4e0 0 7 0 0 0 (nda3:nvme3:0:0:1): CAM status: CCB request completed with an error (nda3:nvme3:0:0:1): Error 5, Retries exhausted nda3 at nvme3 bus 0 scbus13 target 0 lun 1 nda3: <WD_BLACK SN770 2TB 731030WD 21513C800057> s/n 21513C800057 detached xptioctl: pass driver is not in the kernel xptioctl: put "device pass" in your kernel config file xptioctl: pass driver is not in the kernel xptioctl: put "device pass" in your kernel config file xptioctl: pass driver is not in the kernel xptioctl: put "device pass" in your kernel config file xptioctl: pass driver is not in the kernel xptioctl: put "device pass" in your kernel config file xptioctl: pass driver is not in the kernel xptioctl: put "device pass" in your kernel config file xptioctl: pass driver is not in the kernel xptioctl: put "device pass" in your kernel config file xptioctl: pass driver is not in the kernel xptioctl: put "device pass" in your kernel config file
(In reply to Ian Brennan from comment #88) Any chance you can boot 13.1-RC1 on this box? I'm pretty sure that 12.2 hasn't been updated with my changes. My changes were committed last summer (end of july) and 12.2 was released in 2020, and 12.3 is the first release that has them in it.
(In reply to Warner Losh from comment #89) Thank you so much for the reply! So I just upgraded to 13.0-STABLE, I assume this version has the fix in it? However, so far the same problem exists, it has some updated wording in the logs though, see below. Just before I upgraded the problem got significantly worse, as now it happens every time to the same ssd at boot time, however it allows the system to continue booting and seems stable, just missing the one ssd. Maybe this is because the disk is very full at the moment. nvme4: RECOVERY_START 168425279147 vs 166453916285 nvme4: Controller in fatal status, resetting nvme4: Resetting controller due to a timeout and possible hot unplug. nvme4: RECOVERY_WAITING nvme4: resetting controller nvme4: failing outstanding i/o nvme4: READ sqid:18 cid:127 nsid:1 lba:32 len:224 nvme4: ABORTED - BY REQUEST (00/07) sqid:18 cid:127 cdw0:0 nvme4: failing outstanding i/o nvme4: READ sqid:18 cid:126 nsid:1 lba:544 len:224 nvme4: ABORTED - BY REQUEST (00/07) sqid:18 cid:126 cdw0:0 nvme4: failing outstanding i/o nvme4: READ sqid:18 cid:125 nsid:1 lba:3907028000 len:224 nvme4: ABORTED - BY REQUEST (00/07) sqid:18 cid:125 cdw0:0 nvme4: failing outstanding i/o nvme4: READ sqid:18 cid:124 nsid:1 lba:3907028512 len:224 nvme4: ABORTED - BY REQUEST (00/07) sqid:18 cid:124 cdw0:0 nvd4: detached
(In reply to Ian Brennan from comment #90) Can you open a new bug? Your failure is a little different than this bug and it would be better to have it in a new bug so it doesn't get lost in the long history of this bug. The controllers never indicated a fatal state with this bug, for example... I'd like to fix this, and hate to redirect you like this, but that would help me track it more easily. Thanks
I just created bug ID 262969 thx for your help
Same problem over here... works perfectly on windows on same hardware ( :~( ). WD SN750 512Gb. Any tips would be really appreciated. Should the bug be reopened ? ... nvme0: RECOVERY_START 756963034121 vs 755216752641 nvme0: Controller in fatal status, resetting nvme0: Resetting controller due to a timeout and possible hot unplug. nvme0: RECOVERY_WAITING nvme0: resetting controller nvme0: failing outstanding i/o nvme0: READ sqid:1 cid:124 nsid:1 lba:0 len:1024 nvme0: ABORTED - BY REQUEST (00/07) sqid:1 cid:124 cdw0:0 nvme0: failing outstanding i/o nvme0: READ sqid:1 cid:125 nsid:1 lba:1024 len:1024 nvme0: ABORTED - BY REQUEST (00/07) sqid:1 cid:125 cdw0:0 nvd0: detached nvme0: waiting # freebsd-version 13.1-RELEASE-p7 Marcus