Bug 237937 - Removing path with mpr and geom_multipath causes kernel panic
Summary: Removing path with mpr and geom_multipath causes kernel panic
Status: New
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: CURRENT
Hardware: amd64 Any
: --- Affects Only Me
Assignee: freebsd-bugs mailing list
URL:
Keywords: panic
Depends on:
Blocks:
 
Reported: 2019-05-17 06:14 UTC by pascal.guitierrez
Modified: 2019-06-26 13:30 UTC (History)
4 users (show)

See Also:


Attachments
Photo of kernel panic (195.66 KB, image/jpeg)
2019-05-17 06:26 UTC, pascal.guitierrez
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description pascal.guitierrez 2019-05-17 06:14:11 UTC
This issue occurs on both 12.0-RELEASE-p5 and 13-CURRENT r347415 (snap from 10th of May 2019).


Reproducing is very easy:
1. Setup a bunch of multipath SAS disks using gmultipath (I am using 100 SAS disks with two 9300-8e SAS HBA's, but issue occurs even on 1x HBA)
2. Create a ZFS pool based on these (zpool create bench mirror multipath/.. ...)
3. Run some I/O on the pool: iozone -a
4. Physically pull one of the paths, wait, plug back in. Confirm that the path is OPTIMAL via gmultipath. Then repeat with the other port.

Eventually I/O will be frozen to the OS (there's a bunch of scsi ioc terminated and CAM status: CCB request completed with an error) and eventually the system will panic.


This is 100% reproducible on my setup, can also provide SSH/IPMI access if needed.
Comment 1 pascal.guitierrez 2019-05-17 06:26:20 UTC
Created attachment 204415 [details]
Photo of kernel panic
Comment 2 pascal.guitierrez 2019-05-28 22:52:41 UTC
Do you need more information? I can reproduce this panic very easily
Comment 3 Scott Long freebsd_committer 2019-05-29 19:38:02 UTC
Thanks for the bug report.  I'm talking to another person who is seeing something similar.  I'll work on reproducing the problem and see if I can come up with a solution.
Comment 4 pascal.guitierrez 2019-06-04 04:52:54 UTC
(In reply to Scott Long from comment #3)

Thanks Scott, please let me know when you need testers. I have kit ready with 102 SAS HDDs and dual controllers at the ready!
Comment 5 harrison 2019-06-12 13:15:25 UTC
This bug is also found if bad devices are on the bus.   Here is message output when then bug is hit.
Jun 12 03:33:59 pavo kernel: (da106:mps1:0:120:0): Retrying command, 14 more tries remain
Jun 12 03:33:59 pavo kernel: (da106:mps1:0:120:0): WRITE(16). CDB: 8a 00 00 00 00 03 a8 54 56 58 00 00 00 20 00 00 
Jun 12 03:33:59 pavo kernel: (da106:mps1:0:120:0): CAM status: CCB request completed with an error
Jun 12 03:33:59 pavo kernel: (da106:mps1:0:120:0): Retrying command, 14 more tries remain
Jun 12 03:33:59 pavo kernel: (da106:mps1:0:120:0): WRITE(16). CDB: 8a 00 00 00 00 03 a8 54 55 68 00 00 00 18 00 00 
Jun 12 03:33:59 pavo kernel: Finished recovery after LUN reset for target 120
Jun 12 03:33:59 pavo kernel: mps1: More commands to abort for target 120
Jun 12 03:33:59 pavo kernel: 
Jun 12 03:33:59 pavo syslogd: last message repeated 1 times
Jun 12 03:33:59 pavo kernel: Fatal trap 12: page fault while in kernel mode
Jun 12 03:33:59 pavo kernel: cpuid = 10; apic id = 0a
Jun 12 03:33:59 pavo kernel: fault virtual address      = 0x0
Jun 12 03:33:59 pavo kernel: fault code         = supervisor read data, page not present
Jun 12 03:33:59 pavo kernel: instruction pointer        = 0x20:0xffffffff806f8421
Jun 12 03:33:59 pavo kernel: stack pointer              = 0x28:0xfffffe00bab8f900
Jun 12 03:33:59 pavo kernel: frame pointer              = 0x28:0xfffffe00bab8f940
Jun 12 03:33:59 pavo kernel: code segment               = base rx0, limit 0xfffff, type 0x1b
Jun 12 03:33:59 pavo kernel:                    = DPL 0, pres 1, long 1, def32 0, gran 1
Jun 12 03:33:59 pavo kernel: (da106:mps1:0:120:0): CAM status: CCB request completed with an error
Jun 12 03:33:59 pavo kernel: (da106:mps1:0:120:0): Retrying command, 14 more tries remain
Jun 12 03:33:59 pavo kernel: (da106:mps1:0:120:0): WRITE(16). CDB: 8a 00 00 00 00 03 a8 54 56 08 00 00 00 20 00 00 
Jun 12 03:33:59 pavo kernel: (da106:mps1:0:120:0): CAM status: CCB request completed with an error
Jun 12 03:33:59 pavo kernel: (da106:mps1:0:120:0): Retrying command, 14 more tries remain
Jun 12 03:33:59 pavo kernel: (da106:mps1:0:120:0): WRITE(16). CDB: 8a 00 00 00 00 03 a8 54 55 d8 00 00 00 18 00 00 
Jun 12 04:09:48 pavo syslogd: kernel boot file is /boot/kernel/kernel
Comment 6 pascal.guitierrez 2019-06-17 22:41:06 UTC
ping.. any update scott?
Comment 7 harrison 2019-06-26 13:30:56 UTC
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=235559
I believe these two are related if not the same bug.