This issue occurs on both 12.0-RELEASE-p5 and 13-CURRENT r347415 (snap from 10th of May 2019). Reproducing is very easy: 1. Setup a bunch of multipath SAS disks using gmultipath (I am using 100 SAS disks with two 9300-8e SAS HBA's, but issue occurs even on 1x HBA) 2. Create a ZFS pool based on these (zpool create bench mirror multipath/.. ...) 3. Run some I/O on the pool: iozone -a 4. Physically pull one of the paths, wait, plug back in. Confirm that the path is OPTIMAL via gmultipath. Then repeat with the other port. Eventually I/O will be frozen to the OS (there's a bunch of scsi ioc terminated and CAM status: CCB request completed with an error) and eventually the system will panic. This is 100% reproducible on my setup, can also provide SSH/IPMI access if needed.
Created attachment 204415 [details] Photo of kernel panic
Do you need more information? I can reproduce this panic very easily
Thanks for the bug report. I'm talking to another person who is seeing something similar. I'll work on reproducing the problem and see if I can come up with a solution.
(In reply to Scott Long from comment #3) Thanks Scott, please let me know when you need testers. I have kit ready with 102 SAS HDDs and dual controllers at the ready!
This bug is also found if bad devices are on the bus. Here is message output when then bug is hit. Jun 12 03:33:59 pavo kernel: (da106:mps1:0:120:0): Retrying command, 14 more tries remain Jun 12 03:33:59 pavo kernel: (da106:mps1:0:120:0): WRITE(16). CDB: 8a 00 00 00 00 03 a8 54 56 58 00 00 00 20 00 00 Jun 12 03:33:59 pavo kernel: (da106:mps1:0:120:0): CAM status: CCB request completed with an error Jun 12 03:33:59 pavo kernel: (da106:mps1:0:120:0): Retrying command, 14 more tries remain Jun 12 03:33:59 pavo kernel: (da106:mps1:0:120:0): WRITE(16). CDB: 8a 00 00 00 00 03 a8 54 55 68 00 00 00 18 00 00 Jun 12 03:33:59 pavo kernel: Finished recovery after LUN reset for target 120 Jun 12 03:33:59 pavo kernel: mps1: More commands to abort for target 120 Jun 12 03:33:59 pavo kernel: Jun 12 03:33:59 pavo syslogd: last message repeated 1 times Jun 12 03:33:59 pavo kernel: Fatal trap 12: page fault while in kernel mode Jun 12 03:33:59 pavo kernel: cpuid = 10; apic id = 0a Jun 12 03:33:59 pavo kernel: fault virtual address = 0x0 Jun 12 03:33:59 pavo kernel: fault code = supervisor read data, page not present Jun 12 03:33:59 pavo kernel: instruction pointer = 0x20:0xffffffff806f8421 Jun 12 03:33:59 pavo kernel: stack pointer = 0x28:0xfffffe00bab8f900 Jun 12 03:33:59 pavo kernel: frame pointer = 0x28:0xfffffe00bab8f940 Jun 12 03:33:59 pavo kernel: code segment = base rx0, limit 0xfffff, type 0x1b Jun 12 03:33:59 pavo kernel: = DPL 0, pres 1, long 1, def32 0, gran 1 Jun 12 03:33:59 pavo kernel: (da106:mps1:0:120:0): CAM status: CCB request completed with an error Jun 12 03:33:59 pavo kernel: (da106:mps1:0:120:0): Retrying command, 14 more tries remain Jun 12 03:33:59 pavo kernel: (da106:mps1:0:120:0): WRITE(16). CDB: 8a 00 00 00 00 03 a8 54 56 08 00 00 00 20 00 00 Jun 12 03:33:59 pavo kernel: (da106:mps1:0:120:0): CAM status: CCB request completed with an error Jun 12 03:33:59 pavo kernel: (da106:mps1:0:120:0): Retrying command, 14 more tries remain Jun 12 03:33:59 pavo kernel: (da106:mps1:0:120:0): WRITE(16). CDB: 8a 00 00 00 00 03 a8 54 55 d8 00 00 00 18 00 00 Jun 12 04:09:48 pavo syslogd: kernel boot file is /boot/kernel/kernel
ping.. any update scott?
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=235559 I believe these two are related if not the same bug.