If I connect an HP D6020 external SAS disk enclosure to an HP H241 (HBA mode) and then connect each drawer to each port of the H241 then the system just sees the first 48 drives (and not all 70). Looking at the output from 'camcontrol devlist' the last drive seen have "at scbus4 target 63 lun 0" and the first "at scbus4 target 16 lun 0". If I connect each drawer into separate H241 controllers then all 70 drives are visible. According to HP documentation the H241 should handle up to 200 physical drives. I've tried reading the ciss driver source code and I can't see any "obvious" limits/adjustable knobs but I might be missing something.... The "target 63" number feels lika a 64-target-limit somewhere. (The first 16 I'm guessing is reserved for logical drives (which we don't use) so I'm guessing that's why it starts numbering at 16). (Now I want to connect a second D6020 to the server(s) in question so I can use the "use two controller cards" solution anymore)
Created attachment 214288 [details] Output from "camcontrol devlist"
Created attachment 214289 [details] Output from "camcontrol devlist"
Did some more testing today. Attached one of the new D6020 disk cabinets (with 70 SAS drives) to our test server. It looks like the CISS driver does see all 70 drives, but some way up to the CAM layer 16 of them gets lost :-) # cciss_vol_status -V /dev/ciss0 | egrep 1200 | wc -l 70 # camcontrol devlist | egrep 1200 | wc -l 48 (Full output from cciss_vol_status & camcontrol added as attachments)
Created attachment 214315 [details] Output from "cciss_vol_status"
Ok, with some bits of printf-debugging I found some suspect code in sys/dev/ciss/ciss.c:ciss_cam_action() at the "case XPT_PATH_INQ" section: cpi->max_target = sc->ciss_cfg->max_logical_supported; Notice the "max logical logical volumes: 64" below? ciss0: PERFORMANT Transport ciss0: 0 logical drives configured ciss0: firmware 5.04 ciss0: 1 SCSI channels ciss0: signature 'CISS' ciss0: valence 3 ciss0: supported I/O methods 0x7e000147<READY,simple,performant> ciss0: active I/O method 0x5<performant> ciss0: 4G page base rx00000000 ciss0: interrupt coalesce delay 0us ciss0: interrupt coalesce count 16 ciss0: max outstanding commands 1024 ciss0: bus types 0x200000 ciss0: server name 'CZ3729EX3D' ciss0: heartbeat 0xc0 ciss0: max logical logical volumes: 64 ciss0: max physical disks supported: 384 ciss0: max physical disks per logical volume: 128 ciss0: JBOD Support is Available ciss0: JBOD Mode is Enabled ciss0: 72 physical devices (72 is 2 too many, but I guess the two extra are the storage drawers) If I change that line to: cpi->max_target = sc->ciss_cfg->max_physical_supported; then "camcontrol devlist" now show 69 of 70 drives... Better but not 100% there.
It now probes targets up to around 373 (but it takes a looong time) or so, and also detects the SES devices that have been hidden before since it only probed up to target 63... ses0 at ciss0 bus 33 scbus2 target 119 lun 0 ses0: <HPE D6020 1.63> Fixed Enclosure Services SPC-4 SCSI device ses0: Serial Number 7CE952P06X ses0: 135.168MB/s transfers ses0: SES Device ses1 at ciss0 bus 33 scbus2 target 121 lun 0 ses1: <HPE D6020 1.63> Fixed Enclosure Services SPC-4 SCSI device ses1: Serial Number 7CE952P06X ses1: 135.168MB/s transfers ses1: SES Device Due to the probing taking a loong time I also see timeouts: > run_interrupt_driven_hooks: still waiting after 300 seconds for xpt_config and SCSI errors: (probe0:ciss1:33:351:0): REPORT LUNS. CDB: a0 00 00 00 00 00 00 00 00 10 00 00 (probe0:ciss1:33:351:0): CAM status: CCB request completed with an error (probe0:ciss1:33:351:0): Retrying command, 4 more tries remain (probe0:ciss1:33:351:0): REPORT LUNS. CDB: a0 00 00 00 00 00 00 00 00 10 00 00 (probe0:ciss1:33:351:0): CAM status: CCB request completed with an error (probe0:ciss1:33:351:0): Retrying command, 3 more tries remain This code feels... broken :-) Ah well, I'll see if I can modify the driver code to be a bit smarter on how many targets to probe. It really doesn't have to check all since it knows how many target there are (72 in my case) - it could stop after having detected that many... (No wonder the Linux folks have replaced their cciss driver with a rewritten one called hpsa). - Peter
The current code probably works reasonable well for cases where the controller is being used in RAID mode (where it works with logical LUNs). Well, except that it probably fails to detect the SES devices on the D6020 cabinets. But when used as "dumb" HBA with more physical drives than a certain controller handles logical devices (64 in my case for the H241 controller) it will always do the wrong thing. And since it starts numbering physical targets 16 (probably since the controller signals "0" as supported logical luns - since it's a HBA!) and then the driver code uses a compile-time-default of 16 (intended for really old controllers) things become strange... So 64-16 = 48. Okidoki. Time for some code hacking :-)
Created attachment 214328 [details] Patch to fix support for more than 48 drives in HBA mode The attached patch will fix a couple of bugs in the current ciss driver code where it incorrectly enumerates physical drives if the controller is in JBOD mode. There are two bugs/problems: 1. If you attach more physical drives to a controller than how many logical volumes the controller supports (yes, really - totally wrong logic here) the additional drives will not be available because the driver sets the max_target limit to the number of logical volumes, but the enumeration of hardware drives starts at 16. So for a controller that support say 64 logical volumes, only the first (64-16) drives will be detected. 2. The code also sets the initiator_id to same max logical volume number so any physical drive that happens to have the same target number will silently be skipped... The patch also enables a little more verbosity. This patch has been tested with HP H241 controllers in JBOD mode with 70 drives connected to a HP D6020 external SAS enclosure on FreeBSD 12.1-RELEASE-p3. This patch has not been tested with controllers in "RAID" mode but the patch should be compatible...
This part: "but the enumeration of hardware drives starts at 16. So for a controller that support say 64 logical volumes, only the first (64-16) drives will be detected." should probably read: Depending on how disk enclosures enumerate drive the exact number of drives allowed might differ. For HP D6020 enclosures they seem to start enumeration at 16, and with a HP H241 controller that supports 64 logical volumes only the first (64-16) drives will be detected. And none of the SES "targets" (one per drawer in the D6020 enclosure) either since they are listed last...
Convert to modern way to indicate "patch".
Just verified that the patched driver also works on an old HP DL380G5 with a HP Smart Array P400 controller with four drives and it worked there too. Not much of a test though but at least that part is the same as before (the old P400 doesn't really support a true JBOD mode anyway).
Just a note that I've created a patch and created a diff on : https://reviews.freebsd.org/D25155 This diff fixes the problems mentioned here, plus the bug in 246280 (panic at unplug/replug of devices, plus a bug that prevents SES (storage enclosure services) to enumerate devices behind a HP controller (now "sesutil map" works), plus it makes the /boot/loader.conf (and sysctl) tunables visible, and allows it to be more verbose at boot time (without having to set boot_verbose="YES" in /boot/loader.conf).
Hmm.. Is there something more I need to do for this patch? If the ciss driver isn't really maintained by anyone anymore then I could take over that responsibility perhaps. (We are using it in production and will probably keep on using it for atleast a number of years more so I'll have an incentive of keeping it working).
(In reply to Peter Eriksson from comment #13) Please nudge mav and imp who approved the review request.
I'll see if I can get this committed. I have no ciss cards, however...
Any progress on this?
(In reply to Warner Losh from comment #15) > I'll see if I can get this committed. I have no ciss cards, however... I probably could send you a couple - a HP Smart HBA H241 (external SAS) and an old HP Smart Array P400 (internal controller in a old HP DL380g5 if that helps getting things committed :-) Or is there something else I can be of assistance with? - Peter
Any chance of getting this into 13-stable so it could be in 13.1-release? :-) (So I could use the stock kernel in the future :-)
(In reply to Peter Eriksson from comment #18) Thanks for the info. At least that problem is known, resolution too, perhaps it can be fixed to celebrate 3rd birthday of the fixing patch? Do you have an account on the Phabricator Peter? Would you mind creating review there (https://reviews.freebsd.org/)? If not, perhaps someone can take care of that ?
See comment #12 a bit up here :-)
(In reply to Peter Eriksson from comment #17) Peter, could you please test the patch which is the review again and give some feedback there?
Yes, I'll try to test it later today. I'll get back with some results. - Peter
First test on a server with two HP H241 HBA cards with just 5 disks (in two boxes) running FreeBSD 12.4 with the full ciss.c from Fabricator - works fine. Next I'll test it on another server with 140 disks on two H241 controllers (70 in each enclosure). root@balur00:/boot # sysctl hw.ciss hw.ciss.force_interrupt: 0 hw.ciss.force_transport: 0 hw.ciss.nop_message_heartbeat: 0 hw.ciss.expose_hidden_physical: 0 hw.ciss.verbose: 2 hw.ciss.base_transfer_speed: 135168 hw.ciss.initiator_id: -1 root@balur00:/boot # egrep ciss /var/run/dmesg.boot ciss0: <HP Smart Array H241> port 0x3000-0x30ff mem 0x95400000-0x954fffff,0x95500000-0x955003ff at device 0.0 numa-domain 0 on pci5 ciss0: PERFORMANT Transport ciss0: Using 1 MSIX interrupt ciss0: using 1024 of 1024 available commands ciss0: 0 logical drives configured ciss0: firmware 7.00 ciss0: 1 SCSI channels ciss0: 0 FC channels ciss0: 0 enclosures ciss0: 0 expanders ciss0: maximum blocks: 65535 ciss0: controller clock: 18343 ciss0: 256 MB controller memory ciss0: signature 'CISS' ciss0: valence 3 ciss0: supported I/O methods 0x7f000147<READY,simple,performant> ciss0: active I/O method 0x5<performant> ciss0: 4G page base rx00000000 ciss0: interrupt coalesce delay 0us ciss0: interrupt coalesce count 16 ciss0: max outstanding commands 1024 ciss0: bus types 0x200000 ciss0: server name 'CZ3729EX3D' ciss0: heartbeat 0xb7 ciss0: max logical volumes supported: 64 ciss0: max physical drives supported: 384 ciss0: max physical drives per logical volume: 128 ciss0: JBOD Support is Available ciss0: JBOD Mode is Enabled ciss0: 0 physical devices ciss0: max physical target id: 0 ciss0: 0 logical drives ciss1: <HP Smart Array H241> port 0x2000-0x20ff mem 0x95200000-0x952fffff,0x95300000-0x953003ff at device 0.0 numa-domain 0 on pci11 ciss1: PERFORMANT Transport ciss1: Using 1 MSIX interrupt ciss1: using 1024 of 1024 available commands ciss1: 0 logical drives configured ciss1: firmware 7.00 ciss1: 1 SCSI channels ciss1: 0 FC channels ciss1: 2 enclosures ciss1: 2 expanders ciss1: maximum blocks: 65535 ciss1: controller clock: 18486 ciss1: 256 MB controller memory ciss1: signature 'CISS' ciss1: valence 3 ciss1: supported I/O methods 0x7f000147<READY,simple,performant> ciss1: active I/O method 0x5<performant> ciss1: 4G page base rx00000000 ciss1: interrupt coalesce delay 0us ciss1: interrupt coalesce count 16 ciss1: max outstanding commands 1024 ciss1: bus types 0x200000 ciss1: server name 'CZ3729EX3D' ciss1: heartbeat 0xb9 ciss1: max logical volumes supported: 64 ciss1: max physical drives supported: 384 ciss1: max physical drives per logical volume: 128 ciss1: JBOD Support is Available ciss1: JBOD Mode is Enabled ciss1: 7 physical devices ciss1: max physical target id: 120 ciss1: 0 logical drives Root mount waiting for:ses0 at ciss1 bus 33 scbus3 target 119 lun 0 ses1 at ciss1 bus 33 scbus3 target 120 lun 0 da2 at ciss1 bus 32 scbus2 target 83 lun 0 da3 at ciss1 bus 32 scbus2 target 84 lun 0 da4 at ciss1 bus 32 scbus2 target 85 lun 0 da1 at ciss1 bus 32 scbus2 target 50 lun 0 uhub3: da0 at ciss1 bus 32 scbus2 target 49 lun 0 root@balur00:/boot # cciss_vol_status -V /dev/ciss1 Controller: Smart HBA H241 Board ID: 0x21c8103c Logical drives: 0 Running firmware: 7.00 ROM firmware: 7.00 Physical drives: 5 connector 1E box 1 bay 34 HP MB010000JWAYK 7PH8MXKG HPD5 OK connector 1E box 1 bay 35 HP MB010000JWAYK 7PH4AJ9G HPD5 OK connector 2E box 1 bay 33 HP MB010000JWAYK 7PH816MG HPD5 OK connector 2E box 1 bay 34 HP MB010000JWAYK 7PH8G5PG HPD5 OK connector 2E box 1 bay 35 HP MB010000JWAYK 7PGTUTHG HPD5 OK /dev/ciss1: (Smart HBA H241) Enclosure D6020 (S/N: 7CE714P009) on Bus 2, Physical Port 1E status: OK. /dev/ciss1: (Smart HBA H241) Enclosure D6020 (S/N: 7CE714P009) on Bus 3, Physical Port 2E status: OK. /dev/ciss1(Smart HBA H241:0): Non-Volatile Cache status: Cache configured: No root@balur00:/boot # sesutil -u /dev/ses0 show ses0: <HPE D6020 2.74>; ID: 5001438030884b80 Desc Dev Model Ident Size/Status {"Name":"Drive bay"} - - - Not Installed {"Name":"DriveBay1"} - - - Not Installed {"Name":"DriveBay2"} - - - Not Installed {"Name":"DriveBay3"} - - - Not Installed {"Name":"DriveBay4"} - - - Not Installed {"Name":"DriveBay5"} - - - Not Installed {"Name":"DriveBay6"} - - - Not Installed {"Name":"DriveBay7"} - - - Not Installed {"Name":"DriveBay8"} - - - Not Installed {"Name":"DriveBay9"} - - - Not Installed {"Name":"DriveBay10"} - - - Not Installed {"Name":"DriveBay11"} - - - Not Installed {"Name":"DriveBay12"} - - - Not Installed {"Name":"DriveBay13"} - - - Not Installed {"Name":"DriveBay14"} - - - Not Installed {"Name":"DriveBay15"} - - - Not Installed {"Name":"DriveBay16"} - - - Not Installed {"Name":"DriveBay17"} - - - Not Installed {"Name":"DriveBay18"} - - - Not Installed {"Name":"DriveBay19"} - - - Not Installed {"Name":"DriveBay20"} - - - Not Installed {"Name":"DriveBay21"} - - - Not Installed {"Name":"DriveBay22"} - - - Not Installed {"Name":"DriveBay23"} - - - Not Installed {"Name":"DriveBay24"} - - - Not Installed {"Name":"DriveBay25"} - - - Not Installed {"Name":"DriveBay26"} - - - Not Installed {"Name":"DriveBay27"} - - - Not Installed {"Name":"DriveBay28"} - - - Not Installed {"Name":"DriveBay29"} - - - Not Installed {"Name":"DriveBay30"} - - - Not Installed {"Name":"DriveBay31"} - - - Not Installed {"Name":"DriveBay32"} - - - Not Installed {"Name":"DriveBay33"} da2 HP MB010000JWAYK 7PH816MG 10T {"Name":"DriveBay34"} da3 HP MB010000JWAYK 7PH8G5PG 10T {"Name":"DriveBay35"} da4 HP MB010000JWAYK 7PGTUTHG 10T Temperatures: {"Name":"Temperature sensor"} : 42 C, {"Name":"LocalIoModule-Sensor[0]"} : 31 C, {"Name":"LocalIoModule-Sensor[1]"} : 38 C, {"Name":"LocalExpander-CpuSensor[0]"} : 42 C, {"Name":"PowerSupply[3]-InletSensor[0]"}: 28 C, {"Name":"PowerSupply[3]-Sensor[0]"} : 32 C, {"Name":"PowerSupply[4]-InletSensor[0]"}: 27 C, {"Name":"PowerSupply[4]-Sensor[0]"} : 31 C, {"Name":"Backplane-Sensor[0]"} : 25 C, {"Name":"Backplane-Sensor[1]"} : 23 C, {"Name":"Backplane-Sensor[2]"} : 24 C, {"Name":"Backplane-Sensor[3]"} : 27 C, {"Name":"Backplane-Sensor[4]"} : 24 C, {"Name":"Backplane-Sensor[5]"} : 23 C, {"Name":"DisplayBoard-Sensor[0]"} : 25 C root@balur00:/boot # sesutil -u /dev/ses1 show ses1: <HPE D6020 2.74>; ID: 5001438030894600 Desc Dev Model Ident Size/Status {"Name":"Drive bay"} - - - Not Installed {"Name":"DriveBay1"} - - - Not Installed {"Name":"DriveBay2"} - - - Not Installed {"Name":"DriveBay3"} - - - Not Installed {"Name":"DriveBay4"} - - - Not Installed {"Name":"DriveBay5"} - - - Not Installed {"Name":"DriveBay6"} - - - Not Installed {"Name":"DriveBay7"} - - - Not Installed {"Name":"DriveBay8"} - - - Not Installed {"Name":"DriveBay9"} - - - Not Installed {"Name":"DriveBay10"} - - - Not Installed {"Name":"DriveBay11"} - - - Not Installed {"Name":"DriveBay12"} - - - Not Installed {"Name":"DriveBay13"} - - - Not Installed {"Name":"DriveBay14"} - - - Not Installed {"Name":"DriveBay15"} - - - Not Installed {"Name":"DriveBay16"} - - - Not Installed {"Name":"DriveBay17"} - - - Not Installed {"Name":"DriveBay18"} - - - Not Installed {"Name":"DriveBay19"} - - - Not Installed {"Name":"DriveBay20"} - - - Not Installed {"Name":"DriveBay21"} - - - Not Installed {"Name":"DriveBay22"} - - - Not Installed {"Name":"DriveBay23"} - - - Not Installed {"Name":"DriveBay24"} - - - Not Installed {"Name":"DriveBay25"} - - - Not Installed {"Name":"DriveBay26"} - - - Not Installed {"Name":"DriveBay27"} - - - Not Installed {"Name":"DriveBay28"} - - - Not Installed {"Name":"DriveBay29"} - - - Not Installed {"Name":"DriveBay30"} - - - Not Installed {"Name":"DriveBay31"} - - - Not Installed {"Name":"DriveBay32"} - - - Not Installed {"Name":"DriveBay33"} - - - Not Installed {"Name":"DriveBay34"} da0 HP MB010000JWAYK 7PH8MXKG 10T {"Name":"DriveBay35"} da1 HP MB010000JWAYK 7PH4AJ9G 10T Temperatures: {"Name":"Temperature sensor"} : 43 C, {"Name":"LocalIoModule-Sensor[0]"} : 34 C, {"Name":"LocalIoModule-Sensor[1]"} : 39 C, {"Name":"LocalExpander-CpuSensor[0]"} : 43 C, {"Name":"PowerSupply[1]-InletSensor[0]"}: 32 C, {"Name":"PowerSupply[1]-Sensor[0]"} : 43 C, {"Name":"PowerSupply[2]-InletSensor[0]"}: 33 C, {"Name":"PowerSupply[2]-Sensor[0]"} : 38 C, {"Name":"Backplane-Sensor[0]"} : 20 C, {"Name":"Backplane-Sensor[1]"} : 20 C, {"Name":"Backplane-Sensor[2]"} : 22 C, {"Name":"Backplane-Sensor[3]"} : 23 C, {"Name":"Backplane-Sensor[4]"} : 21 C, {"Name":"Backplane-Sensor[5]"} : 20 C, {"Name":"DisplayBoard-Sensor[0]"} : 18 C
Result on the bigger server: Booted, but then I got a long list of CAM errors ending with a "ADAPTER HEARTBEAT FAILED". ... (da116:ciss2:32:56:0): CAM status: CCB request completed with an error (da116:ciss2:32:56:0): Retrying command, 3 more tries remain (da123:ciss2:32:63:0): READ(6). CDB: 08 00 01 40 08 00 (da123:ciss2:32:63:0): CAM status: CCB request completed with an error (da123:ciss2:32:63:0): Retrying command, 3 more tries remain (da123:ciss2:32:63:0): READ(10). CDB: 28 00 00 00 03 00 00 01 00 00 (da123:ciss2:32:63:0): CAM status: CCB request completed with an error (da123:ciss2:32:63:0): Retrying command, 2 more tries remain (da123:ciss2:32:63:0): READ(6). CDB: 08 00 01 38 08 00 (da123:ciss2:32:63:0): CAM status: CCB request completed with an error (da123:ciss2:32:63:0): Retrying command, 3 more tries remain (da123:ciss2:32:63:0): READ(6). CDB: 08 00 01 30 08 00 (da123:ciss2:32:63:0): CAM status: CCB request completed with an error (da123:ciss2:32:63:0): Retrying command, 3 more tries remain (da116:ciss2:32:56:0): READ(16). CDB: 88 00 00 00 00 01 ce 8c 30 28 00 00 01 00 00 00 (da116:ciss2:32:56:0): CAM status: CCB request completed with an error (da116:ciss2:32:56:0): Retrying command, 3 more tries remain (da116:ciss2:32:56:0): READ(16). CDB: 88 00 00 00 00 01 ce 8c 31 28 00 00 01 00 00 00 (da116:ciss2:32:56:0): CAM status: CCB request completed with an error (da116:ciss2:32:56:0): Retrying command, 3 more tries remain (da116:ciss2:32:56:0): READ(16). CDB: 88 00 00 00 00 01 ce 8c 32 28 00 00 01 00 00 00 (da116:ciss2:32:56:0): CAM status: CCB request completed with an error (da116:ciss2:32:56:0): Retrying command, 3 more tries remain (da116:ciss2:32:56:0): READ(16). CDB: 88 00 00 00 00 01 ce 8c 33 28 00 00 00 38 00 00 (da116:ciss2:32:56:0): CAM status: CCB request completed with an error (da116:ciss2:32:56:0): Retrying command, 3 more tries remain ciss2: ADAPTER HEARTBEAT FAILED (I could login and see all disks with "camcontrol delvist" but it started displaying these errors when zfs was importing the pools. The da116 and da123 diskar are probably bad but it would be nice if the ciss controller would handle bad drives a bit better in general. I had to replace the HP H241 controller with an LSI SAS3816 controller on my other "big" HP server since that H241/ciss controller would go into a loop and just spew out retry errors every day or so. The LSI controller handles the bad drives better...
Rebooted and tried with hw.ciss.nop_message_heartbeat=1, then logged in via ssh and ran an "sesutil show" (before zfs had started importing pools), then it panic:ed with: login: (da116:ciss2:32:56:0): READ(6). CDB: 08 00 01 48 08 00 (da116:ciss2:32:56:0): CAM status: CCB request completed with an error ciss2: *** Hot-plug drive removed, Port=2E Box=1 Bay=6 SN= 5PGTSWYC (da116:ciss2:32:56:0): Retrying command, 3 more tries remain ciss2: *** Physical drive failure, Port=2E Box=1 Bay=6 reason=0x14 (da116:ciss2:32:56:0): READ(16). CDB: 88 00 00 00 00 05 74 ff ff 00 00 00 01 00 00 00 (da116:ciss2:32:56:0): CAM status: CCB request completed with an error (da116:ciss2:32:56:0): Retrying command, 3 more tries remain (da116:ciss2:32:56:0): READ(6). CDB: 08 00 01 50 b0 00 (da116:ciss2:32:56:0): CAM status: CCB request completed with an error (da116:ciss2:32:56:0): Retrying command, 3 more tries remain (da116:ciss2:32:56:0): READ(10). CDB: 28 00 00 00 03 00 00 01 00 00 (da116:ciss2:32:56:0): CAM status: CCB request completed with an error (da116:ciss2:32:56:0): Retrying command, 3 more tries remain (da116:ciss2:32:56:0): READ(16). CDB: 88 00 00 00 00 05 74 ff fd 00 00 00 01 00 00 00 (da116:ciss2:32:56:0): CAM status: CCB request completed with an error (da116:ciss2:32:56:0): Retrying command, 3 more tries remain (da116:ciss2:32:56:0): READ(6). CDB: 08 00 01 40 08 00 (da116:ciss2:32:56:0): CAM status: CCB request completed with an error (da116:ciss2:32:56:0): Retrying command, 3 more tries remain (da116:ciss2:32:56:0): READ(6). CDB: 08 00 01 38 08 00 (da116:ciss2:32:56:0): CAM status: CCB request completed with an error (da116:ciss2:32:56:0): Retrying command, 3 more tries remain (da116:ciss2:32:56:0): READ(6). CDB: 08 00 01 28 08 00 (da116:ciss2:32:56:0): CAM status: CCB request completed with an error (da116:ciss2:32:56:0): Retrying command, 3 more tries remain (da116:ciss2:32:56:0): READ(6). CDB: 08 00 01 30 08 00 (da116:ciss2:32:56:0): CAM status: CCB request completed with an error (da116:ciss2:32:56:0): Retrying command, 3 more tries remain (da116:ciss2:32:56:0): READ(6). CDB: 08 00 01 20 08 00 (da116:ciss2:32:56:0): CAM status: CCB request completed with an error (da116:ciss2:32:56:0): Retrying command, 3 more tries remain (da116:ciss2:32:56:0): READ(6)/WRITE(6) not supported, increasing minimum_cmd_size to 10. (da116:ciss2:32:56:0): READ(16). CDB: 88 00 00 00 00 05 74 ff ff 00 00 00 01 00 00 00 (da116:ciss2:32:56:0): CAM status: SCSI Status Error (da116:ciss2:32:56:0): SCSI status: Check Condition (da116:ciss2:32:56:0): SCSI sense: ILLEGAL REQUEST asc:25,0 (Logical unit not supported) (da116:ciss2:32:56:0): Error 6, Unretryable error (da116:ciss2:32:56:0): Invalidating pack (da116:ciss2:32:56:0): READ(6)/WRITE(6) not supported, increasing minimum_cmd_size to 10. (da116:ciss2:32:56:0): READ(10). CDB: 28 00 00 00 03 00 00 01 00 00 (da116:ciss2:32:56:0): CAM status: SCSI Status Error (da116:ciss2:32:56:0): SCSI status: Check Condition (da116:ciss2:32:56:0): SCSI sense: ILLEGAL REQUEST asc:25,0 (Logical unit not supported) (da116:ciss2:32:56:0): Error 6, Unretryable error (da116:ciss2:32:56:0): READ(16). CDB: 88 00 00 00 00 05 74 ff fd 00 00 00 01 00 00 00 (da116:ciss2:32:56:0): CAM status: SCSI Status Error (da116:ciss2:32:56:0): SCSI status: Check Condition (da116:ciss2:32:56:0): SCSI sense: ILLEGAL REQUEST asc:25,0 (Logical unit not supported) (da116:ciss2:32:56:0): Error 6, Unretryable error (da116:ciss2:32:56:0): READ(6)/WRITE(6) not supported, increasing minimum_cmd_size to 10. (da116:ciss2:32:56:0): READ(6)/WRITE(6) not supported, increasing minimum_cmd_size to 10. (da116:ciss2:32:56:0): READ(6)/WRITE(6) not supported, increasing minimum_cmd_size to 10. (da116:ciss2:32:56:0): READ(6)/WRITE(6) not supported, increasing minimum_cmd_size to 10. da116 at ciss2 bus 32 scbus7 target 56 lun 0 da116: <HP MB012000JWDFD HPD2> s/n 5PGTSWYC detached May 8 21:56:32 balur03 ZFS[4858]: vdev probe failure, zpool=$FILUR02 path=$/dev/diskid/DISK-5PGTSWYC (da116:ciss2:32:56:0): READ(6). CDB: 08 00 01 20 08 00 (da116:ciss2:32:56:0): CAM status: CCB request completed with an error (da116:ciss2:32:56:0): Error 5, Periph was invalidated (da116:ciss2:32:56:0): READ(10). CDB: 28 00 00 00 01 48 00 00 08 00 (da116:ciss2:32:56:0): CAM status: CCB request completed with an error (da116:ciss2:32:56:0): Error 5, Periph was invalidated (da116:ciss2:32:56:0): READ(10). CDB: 28 00 00 00 01 28 00 00 08 00 (da116:ciss2:32:56:0): CAM status: CCB request completed with an error (da116:ciss2:32:56:0): Error 5, Periph was invalidated (da116:ciss2:32:56:0): READ(10). CDB: 28 00 00 00 01 30 00 00 08 00 (da116:ciss2:32:56:0): CAM status: CCB request completed with an error (da116:ciss2:32:56:0): Error 5, Periph was invalidated (da116:ciss2:32:56:0): READ(10). CDB: 28 00 00 00 01 38 00 00 08 00 (da116:ciss2:32:56:0): CAM status: CCB request completed with an error (da116:ciss2:32:56:0): Error 5, Periph was invalidated (da116:ciss2:32:56:0): READ(10). CDB: 28 00 00 00 01 40 00 00 08 00 (da116:ciss2:32:56:0): CAM status: CCB request completed with an error (da116:ciss2:32:56:0): Error 5, Periph was invalidated (da116:ciss2:32:56:0): READ(10). CDB: 28 00 00 00 01 50 00 00 b0 00 (da116:ciss2:32:56:0): CAM status: CCB request completed with an error (da116:ciss2:32:56:0): Error 5, Periph was invalidated Fatal trap 12: page fault while in kernel mode cpuid = 3; apic id = 03 fault virtual address = 0x180 fault code = supervisor read data, page not present (da116:ciss2:32:56:0): Periph destroyed instruction pointer = 0x20:0xffffffff8256e51d stack pointer = 0x28:0xfffffe058a1d59d0 frame pointer = 0x28:0xfffffe058a1d5a40 code segment = base rx0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 45 (solthread 0xfffffff) trap number = 12 panic: page fault cpuid = 3 time = 1683575795 KDB: stack backtrace: #0 0xffffffff80c325f5 at kdb_backtrace+0x65 #1 0xffffffff80be89c8 at vpanic+0x178 #2 0xffffffff80be8843 at panic+0x43 #3 0xffffffff8110408f at trap_fatal+0x38f #4 0xffffffff811040df at trap_pfault+0x4f #5 0xffffffff810dbd58 at calltrap+0x8 #6 0xffffffff8256e49a at vdev_dtl_reassess+0x5a #7 0xffffffff8256e49a at vdev_dtl_reassess+0x5a #8 0xffffffff825639e2 at spa_vdev_state_exit+0x42 #9 0xffffffff8255d04f at spa_async_thread_vd+0x17f #10 0xffffffff80ba969e at fork_exit+0x7e #11 0xffffffff810dcd8e at fork_trampoline+0xe System is going down. Uptime: 4m40s I've seen things you people wouldn't believe. Attack ships on fire off the shoulder of Orion. I watched C-beams glitter in the dark near the Tannhäuser Gate. All those moments will be lost in time, like tears in rain. Time to die.
On the third attempt I just let it keep trying, no "sesutil show". And then it eventually printed: .... (da116:ciss2:32:56:0): SCSI status: Check Condition (da116:ciss2:32:56:0): SCSI sense: ILLEGAL REQUEST asc:25,0 (Logical unit not supported) (da116:ciss2:32:56:0): Error 6, Unretryable error da116 at ciss2 bus 32 scbus7 target 56 lun 0 da116: <HP MB012000JWDFD HPD2> s/n 5PGTSWYC detached May 8 22:19:40 balur03 ZFS[22519]: vdev probe failure, zpool=$FILUR02 path=$/dev/diskid/DISK-5PGTSWYC (da116:ciss2:32:56:0): Periph destroyed (da123:ciss2:32:63:0): WRITE(16). CDB: 8a 00 00 00 00 03 1b 00 39 58 00 00 01 00 00 00 (da123:ciss2:32:63:0): CAM status: CCB request completed with an error (da123:ciss2:32:63:0): Retrying command, 3 more tries remain (da123:ciss2:32:63:0): WRITE(16). CDB: 8a 00 00 00 00 03 1b 00 38 58 00 00 01 00 00 00 (da123:ciss2:32:63:0): CAM status: CCB request completed with an error (da123:ciss2:32:63:0): Retrying command, 3 more tries remain (da123:ciss2:32:63:0): WRITE(16). CDB: 8a 00 00 00 00 03 1b 00 37 58 00 00 01 00 00 00 (da123:ciss2:32:63:0): CAM status: CCB request completed with an error (da123:ciss2:32:63:0): Retrying command, 3 more tries remain (da123:ciss2:32:63:0): WRITE(16). CDB: 8a 00 00 00 00 03 1b 00 39 58 00 00 01 00 00 00 (da123:ciss2:32:63:0): CAM status: SCSI Status Error (da123:ciss2:32:63:0): SCSI status: Check Condition (da123:ciss2:32:63:0): SCSI sense: ILLEGAL REQUEST asc:25,0 (Logical unit not supported) (da123:ciss2:32:63:0): Command Specific Info: 0x2060000 (da123:ciss2:32:63:0): Error 6, Unretryable error (da123:ciss2:32:63:0): Invalidating pack (da123:ciss2:32:63:0): WRITE(16). CDB: 8a 00 00 00 00 03 1b 00 38 58 00 00 01 00 00 00 (da123:ciss2:32:63:0): CAM status: SCSI Status Error (da123:ciss2:32:63:0): SCSI status: Check Condition (da123:ciss2:32:63:0): SCSI sense: ILLEGAL REQUEST asc:25,0 (Logical unit not supported) (da123:ciss2:32:63:0): Command Specific Info: 0x2060000 (da123:ciss2:32:63:0): Error 6, Unretryable error (da123:ciss2:32:63:0): WRITE(16). CDB: 8a 00 00 00 00 03 1b 00 37 58 00 00 01 00 00 00 (da123:ciss2:32:63:0): CAM status: SCSI Status Error (da123:ciss2:32:63:0): SCSI status: Check Condition (da123:ciss2:32:63:0): SCSI sense: ILLEGAL REQUEST asc:25,0 (Logical unit not supported) (da123:ciss2:32:63:0): Command Specific Info: 0x2060000 (da123:ciss2:32:63:0): Error 6, Unretryable error da123 at ciss2 bus 32 scbus7 target 63 lun 0 da123: <HP MB012000JWDFD HPD2> s/n 5PGU79TE detached May 8 22:26:52 balur03 ZFS[38178]: vdev probe failure, zpool=$FILUR07 path=$/dev/diskid/DISK-5PGU79TE (da123:ciss2:32:63:0): Periph destroyed And then zfs started mounting the filesystems. root@balur03# df -h | wc -l 169281 It'll be interesting to see if it survives :-) The fact that the "sesutil" command seemed to trigger a panic might indicate a problem somewhere though, but I'm not sure it's in the ciss driver.
Thanks for taking an effort to test it again Peter. Without the patch you authored, the tests would end sooner with no panic and 40 instead of 140 drives seen. That's a large setup and the solution is not ideal, but looks like a step forward. In the meantime I plied a bit with ciss(4) settings, and found out that setting hw.ciss.nop_message_heartbeat="1" on which you complained in comment 25 breaks things also for me regardless of the application of the patch from D25155. On the other hand, I am finding quite a useful new sysctl knob "hw.ciss.verbose" introduced by your patch for operations like drive replacement: [7372] ciss0: *** Hot-plug drive removed, Port=1I Box=1 Bay=3 SN= PLXXXXX [7372] ciss0: *** Physical drive failure, Port=1I Box=1 Bay=3 [7372] ciss0: *** State change, logical drive 2, new state=FAILED [7372] ciss0: logical drive 2 (da2) changed status OK->failed, spare status 0x0 [7372] da2 at ciss0 bus 0 scbus0 target 2 lun 0 [7372] (da2:ciss0:0:2:0): Periph destroyed [7444] ciss0: *** Hot-plug drive inserted, Port=1I Box=1 Bay=3 SN= PLXXXXX [7444] ciss0: *** Media exchanged detected, logical drive 2 [7444] ciss0: logical drive 2 () media exchanged, ready to go online [7444] ciss0: *** State change, logical drive 2, new state=OK [7444] ciss0: logical drive (b0t2): RAID 0, 139776MB online [7444] ciss0: logical drive 2 () changed status failed->OK, spare status 0x0 [7444] ciss0: logical drive (b0t2): RAID 0, 139776MB online [7444] da2 at ciss0 bus 0 scbus0 target 2 lun 0 Anyway, the patch proposed in review D25155 fixes a couple of things, was tested in JBOD and RAID mode and seems to not introduce any regression.
Actually, with hw.ciss.nop_message_heartbeat="0" my first test ended with ciss2: ADAPTER HEARTBEAT FAILED and then things deadlocked. With: hw.ciss.nop_message_heartbeat="1" and if I didn't provoke it with the "sesutil show" then all disks got detected correctly and it failed the two dead ones after a while and all zfs filesystems got mounted. The server has now been running for 10 hours without problems. (Going to remove the problem disks later today) hw.ciss.verbose = "1" hw.ciss.nop_message_heartbeat = "1" hw.ciss.base_transfer_speed = "1200000" - Peter
OK. I've staged the rebased patch I broke it down into tiny bites in case there's problems. I've also noted my extreme reservations about making the cr == NULL panic just a printf. There's some race that's causing it (I think with where we set the shutdown flag racing another thread that's setting it w/o ciss_mtx held). Ideally, someone would chase that to ground who had the hardware and the time to do so, but I do not. This was reviewed, though by mav@ as well as myself (though a prior version by him). Most of the change is otherwise uncontroversial. However, if there's problems, I'm going to be quick to back out if need be. If everything builds, I'll push the changes soon.
A commit in branch main references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=45645518ea19ccb4761aee3a525aab2f323d37d4 commit 45645518ea19ccb4761aee3a525aab2f323d37d4 Author: Peter Eriksson <pen@lysator.liu.se> AuthorDate: 2024-10-14 04:01:33 +0000 Commit: Warner Losh <imp@FreeBSD.org> CommitDate: 2024-10-14 05:24:15 +0000 ciss: Add max physical target Add support for tracking the maximum physical target and using that to override the maximum logical target. PR: 246279 Reviewed by: imp Tested by: Marek Zarychta Differential Revision: https://reviews.freebsd.org/D25155 sys/dev/ciss/ciss.c | 11 ++++++++++- sys/dev/ciss/cissvar.h | 1 + 2 files changed, 11 insertions(+), 1 deletion(-)
A commit in branch main references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=7c74337e2c3d2269d1559f4e5541c0a3f402d814 commit 7c74337e2c3d2269d1559f4e5541c0a3f402d814 Author: Peter Eriksson <pen@lysator.liu.se> AuthorDate: 2024-10-14 04:01:33 +0000 Commit: Warner Losh <imp@FreeBSD.org> CommitDate: 2024-10-14 05:24:06 +0000 ciss: Expose tunable hw.ciss.force_interrupt as sysctl Expose the hw.ciss.force_interrupt tuneable as a sysctl and make it writeable at runtime. PR: 246279 Reviewed by: imp Tested by: Marek Zarychta Differential Revision: https://reviews.freebsd.org/D25155 sys/dev/ciss/ciss.c | 3 +++ 1 file changed, 3 insertions(+)
A commit in branch main references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=fd95966af50bab6229bc5e67fadc7ffd915f77f5 commit fd95966af50bab6229bc5e67fadc7ffd915f77f5 Author: Peter Eriksson <pen@lysator.liu.se> AuthorDate: 2024-10-14 04:01:33 +0000 Commit: Warner Losh <imp@FreeBSD.org> CommitDate: 2024-10-14 05:37:46 +0000 ciss: hw.ciss.initator_id to set the initiator ID Add hw.ciss.inititor_id to set the initiator to something other than the default. PR: 246279 Reviewed by: imp Tested by: Marek Zarychta Differential Revision: https://reviews.freebsd.org/D25155 sys/dev/ciss/ciss.c | 15 +++++++++++++-- 1 file changed, 13 insertions(+), 2 deletions(-)
A commit in branch main references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=b339ab1491055d89415f85b6d1a03423193178f9 commit b339ab1491055d89415f85b6d1a03423193178f9 Author: Peter Eriksson <pen@lysator.liu.se> AuthorDate: 2024-10-14 04:01:33 +0000 Commit: Warner Losh <imp@FreeBSD.org> CommitDate: 2024-10-14 05:37:46 +0000 ciss: Don't panic on null CR ciss_dequeue_notify Apparently, sometimes on hot plug/unplug, a null cr comes back from ciss_dequeue_notify. This is clearly a bug, and by ignoring it we're papering over that bug. We only ever wake the thread after enqueing a notification or setting a bit about killing the thread, so once we check the bit isn't the cause, cr can't be NULL unless something else has dequeued it. Ideally, this would be fixed, rather than papered over, but this makes a very old card somewhat more useable for external enclosures. I suspect it's a race when we set CISS_THREAD_SHUT and another flag (the latter w/o ciss_mtx held), but I don't see it and w/o hardware to reproduce it would be hard to know for sure. PR: 246279 Reviewed by: imp Tested by: Marek Zarychta Differential Revision: https://reviews.freebsd.org/D25155 sys/dev/ciss/ciss.c | 16 ++++++++++++++-- 1 file changed, 14 insertions(+), 2 deletions(-)
A commit in branch main references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=cec58bba6425d4b9cf19a9ede9ca0dc00c1d48e3 commit cec58bba6425d4b9cf19a9ede9ca0dc00c1d48e3 Author: Peter Eriksson <pen@lysator.liu.se> AuthorDate: 2024-10-14 04:01:33 +0000 Commit: Warner Losh <imp@FreeBSD.org> CommitDate: 2024-10-14 05:23:45 +0000 ciss: Expose tunable hw.ciss.nop_message_heartbeat as sysctl Expose the hw.ciss.nop_message_heartbeat tuneable as a sysctl and make it writeable at runtime. PR: 246279 Reviewed by: imp Tested by: Marek Zarychta Differential Revision: https://reviews.freebsd.org/D25155 sys/dev/ciss/ciss.c | 3 +++ 1 file changed, 3 insertions(+)
A commit in branch main references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=a35564358ac442d5d7a5c9c2dd0544f07b1963e7 commit a35564358ac442d5d7a5c9c2dd0544f07b1963e7 Author: Peter Eriksson <pen@lysator.liu.se> AuthorDate: 2024-10-14 04:01:33 +0000 Commit: Warner Losh <imp@FreeBSD.org> CommitDate: 2024-10-14 05:23:36 +0000 ciss: Expose tunable hw.ciss.expose_hidden_physical as sysctl Expose the hw.ciss.expose_hidden_physical tuneable as a sysctl and make it writeable at runtime. PR: 246279 Reviewed by: imp Tested by: Marek Zarychta Differential Revision: https://reviews.freebsd.org/D25155 sys/dev/ciss/ciss.c | 3 +++ 1 file changed, 3 insertions(+)
A commit in branch main references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=d8b024673bbfb32259030db7e54f043f3e471abe commit d8b024673bbfb32259030db7e54f043f3e471abe Author: Peter Eriksson <pen@lysator.liu.se> AuthorDate: 2024-10-14 04:01:33 +0000 Commit: Warner Losh <imp@FreeBSD.org> CommitDate: 2024-10-14 05:23:25 +0000 ciss: Report more errors at higher ciss_verbose levels Report more information on errors, including the the opcode. PR: 246279 Reviewed by: imp Tested by: Marek Zarychta Differential Revision: https://reviews.freebsd.org/D25155 sys/dev/ciss/ciss.c | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-)
A commit in branch main references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=f373e6b866b9efafc66ccc5355e1ea0aeeedfb6a commit f373e6b866b9efafc66ccc5355e1ea0aeeedfb6a Author: Peter Eriksson <pen@lysator.liu.se> AuthorDate: 2024-10-14 04:01:33 +0000 Commit: Warner Losh <imp@FreeBSD.org> CommitDate: 2024-10-14 05:23:17 +0000 ciss: Add sysctl/tunable hw.ciss.verbose Add tuneable to turn on/off verbosity for debugging purposes. This is approximately the same as bootverbose, but will print even more information when > 1. PR: 246279 Reviewed by: imp Tested by: Marek Zarychta Differential Revision: https://reviews.freebsd.org/D25155 sys/dev/ciss/ciss.c | 28 ++++++++++++++++++++++------ 1 file changed, 22 insertions(+), 6 deletions(-)
A commit in branch main references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=74575d14284fde2c7617ad14fd9a8bac897b4427 commit 74575d14284fde2c7617ad14fd9a8bac897b4427 Author: Peter Eriksson <pen@lysator.liu.se> AuthorDate: 2024-10-14 04:01:33 +0000 Commit: Warner Losh <imp@FreeBSD.org> CommitDate: 2024-10-14 05:23:08 +0000 ciss: Add sysctl/tunable hw.ciss and hw.ciss.base_transfer_speed Add a sysctl/tuneable to report a different base transfer speed than the default of 132*1024. PR: 246279 Reviewed by: imp Tested by: Marek Zarychta Differential Revision: https://reviews.freebsd.org/D25155 sys/dev/ciss/ciss.c | 12 +++++++++++- 1 file changed, 11 insertions(+), 1 deletion(-)
A commit in branch main references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=77af8c6db25ff6154268eb17f54f082a7eb61ea0 commit 77af8c6db25ff6154268eb17f54f082a7eb61ea0 Author: Peter Eriksson <pen@lysator.liu.se> AuthorDate: 2024-10-14 04:01:33 +0000 Commit: Warner Losh <imp@FreeBSD.org> CommitDate: 2024-10-14 05:23:54 +0000 ciss: Expose tunable hw.ciss.force_transport as sysctl Expose the hw.ciss.force_transport tuneable as a sysctl and make it writeable at runtime. PR: 246279 Reviewed by: imp Tested by: Marek Zarychta Differential Revision: https://reviews.freebsd.org/D25155 sys/dev/ciss/ciss.c | 3 +++ 1 file changed, 3 insertions(+)
A commit in branch main references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=f03e1a42e92eff76dcf474655b600db37b04ae2b commit f03e1a42e92eff76dcf474655b600db37b04ae2b Author: Peter Eriksson <pen@lysator.liu.se> AuthorDate: 2024-10-14 04:01:33 +0000 Commit: Warner Losh <imp@FreeBSD.org> CommitDate: 2024-10-14 05:22:01 +0000 ciss: Minor formatting nit. PR: 246279 Reviewed by: imp Tested by: Marek Zarychta Differential Revision: https://reviews.freebsd.org/D25155 sys/dev/ciss/cissvar.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
A commit in branch main references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=cafc839393db5c5d8000fd086118b3c7b47e95c2 commit cafc839393db5c5d8000fd086118b3c7b47e95c2 Author: Peter Eriksson <pen@lysator.liu.se> AuthorDate: 2024-10-14 04:01:33 +0000 Commit: Warner Losh <imp@FreeBSD.org> CommitDate: 2024-10-14 05:22:19 +0000 ciss: Ignore data over/under run on RECEIVE_DIAGNOSTIC This appears to be harmless, so ignore data over/under run on diagnostics. PR: 246279 Reviewed by: imp Tested by: Marek Zarychta Differential Revision: https://reviews.freebsd.org/D25155 sys/dev/ciss/ciss.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
(In reply to commit-hook from comment #33) > I've also noted my extreme reservations about making the cr == NULL > panic just a printf. There's some race that's causing it (I think > with where we set the shutdown flag racing another thread that's setting it w/o > ciss_mtx held). Ideally, someone would chase that to ground who had the hardware and > the time to do so, but I do not. Perhaps we should make that panic / printf fix selectable via a sysctl knob? I agree that there is a bug somewhere since with the change to printf() I sometimes see a never-ending loop with errors instead. However, that takes longer to "develop" than the panic-reboot so it still makes it a bit more usable... It typically occurs with bad disks though, so another fix is to remove the bad disks :-)