Bug 246279

Summary: ciss device driver not allowing more than 48 drives to be detected by the CAM layer
Product: Base System Reporter: Peter Eriksson <pen>
Component: kernAssignee: freebsd-bugs (Nobody) <bugs>
Status: Open ---    
Severity: Affects Some People CC: allanjude, imp, michael.osipov, rgrimes, sbruno, stable, zarychtam
Priority: ---    
Version: 12.4-RELEASE   
Hardware: Any   
OS: Any   
See Also: https://reviews.freebsd.org/D25155
Attachments:
Description Flags
Output from "camcontrol devlist"
none
Output from "camcontrol devlist"
none
Output from "cciss_vol_status"
none
Patch to fix support for more than 48 drives in HBA mode none

Description Peter Eriksson 2020-05-07 11:12:25 UTC
If I connect an HP D6020 external SAS disk enclosure to an HP H241 (HBA mode) and then connect each drawer to each port of the H241 then the system just sees the first 48 drives (and not all 70). Looking at the output from 'camcontrol devlist' the last drive seen have "at scbus4 target 63 lun 0" and the first "at scbus4 target 16 lun 0".

If I connect each drawer into separate H241 controllers then all 70 drives are visible.

According to HP documentation the H241 should handle up to 200 physical drives.

I've tried reading the ciss driver source code and I can't see any "obvious" limits/adjustable knobs but I might be missing something....

The "target 63" number feels lika a 64-target-limit somewhere.
(The first 16 I'm guessing is reserved for logical drives (which we don't use) so I'm guessing that's why it starts numbering at 16).

(Now I want to connect a second D6020 to the server(s) in question so I can use the "use two controller cards" solution anymore)
Comment 1 Peter Eriksson 2020-05-08 17:56:10 UTC
Created attachment 214288 [details]
Output from "camcontrol devlist"
Comment 2 Peter Eriksson 2020-05-08 17:56:57 UTC
Created attachment 214289 [details]
Output from "camcontrol devlist"
Comment 3 Peter Eriksson 2020-05-08 17:58:56 UTC
Did some more testing today. Attached one of the new D6020 disk cabinets (with 70 SAS drives) to our test server. It looks like the CISS driver does see all 70 drives, but some way up to the CAM layer 16 of them gets lost :-)

# cciss_vol_status -V /dev/ciss0 | egrep 1200 | wc -l
      70

# camcontrol devlist | egrep 1200 | wc -l
      48

(Full output from cciss_vol_status & camcontrol added as attachments)
Comment 4 Peter Eriksson 2020-05-09 16:13:47 UTC
Created attachment 214315 [details]
Output from "cciss_vol_status"
Comment 5 Peter Eriksson 2020-05-09 16:45:01 UTC
Ok, with some bits of printf-debugging I found some suspect code in sys/dev/ciss/ciss.c:ciss_cam_action() at the "case XPT_PATH_INQ" section:

  cpi->max_target = sc->ciss_cfg->max_logical_supported;

Notice the "max logical logical volumes: 64" below?

ciss0: PERFORMANT Transport
ciss0:   0 logical drives configured
ciss0:   firmware 5.04
ciss0:   1 SCSI channels
ciss0:   signature 'CISS'
ciss0:   valence 3
ciss0:   supported I/O methods 0x7e000147<READY,simple,performant>
ciss0:   active I/O method 0x5<performant>
ciss0:   4G page base rx00000000
ciss0:   interrupt coalesce delay 0us
ciss0:   interrupt coalesce count 16
ciss0:   max outstanding commands 1024
ciss0:   bus types 0x200000
ciss0:   server name 'CZ3729EX3D'
ciss0:   heartbeat 0xc0
ciss0:   max logical logical volumes: 64
ciss0:   max physical disks supported: 384
ciss0:   max physical disks per logical volume: 128
ciss0:   JBOD Support is Available
ciss0:   JBOD Mode is Enabled
ciss0: 72 physical devices

(72 is 2 too many, but I guess the two extra are the storage drawers)

If I change that line to:

  cpi->max_target = sc->ciss_cfg->max_physical_supported;

then "camcontrol devlist" now show 69 of 70 drives... Better but not 100% there.
Comment 6 Peter Eriksson 2020-05-09 19:00:15 UTC
It now probes targets up to around 373 (but it takes a looong time) or so, and also detects the SES devices that have been hidden before since it only probed up to target 63...
                                                                                                                                                              
ses0 at ciss0 bus 33 scbus2 target 119 lun 0                                                                                                                       
ses0: <HPE D6020 1.63> Fixed Enclosure Services SPC-4 SCSI device                                                                                                  
ses0: Serial Number 7CE952P06X                                                                                                                                     
ses0: 135.168MB/s transfers                                                                                                                                        
ses0: SES Device                                                                                                                                                   
                                                                                                                                                                   
ses1 at ciss0 bus 33 scbus2 target 121 lun 0                                                                                                                       
ses1: <HPE D6020 1.63> Fixed Enclosure Services SPC-4 SCSI device                                                                                                  
ses1: Serial Number 7CE952P06X                                                                                                                                     
ses1: 135.168MB/s transfers                                                                                                                                        
ses1: SES Device                      

Due to the probing taking a loong time I also see timeouts:

> run_interrupt_driven_hooks: still waiting after 300 seconds for xpt_config         

and SCSI errors:

(probe0:ciss1:33:351:0): REPORT LUNS. CDB: a0 00 00 00 00 00 00 00 00 10 00 00
(probe0:ciss1:33:351:0): CAM status: CCB request completed with an error
(probe0:ciss1:33:351:0): Retrying command, 4 more tries remain
(probe0:ciss1:33:351:0): REPORT LUNS. CDB: a0 00 00 00 00 00 00 00 00 10 00 00
(probe0:ciss1:33:351:0): CAM status: CCB request completed with an error
(probe0:ciss1:33:351:0): Retrying command, 3 more tries remain

This code feels... broken :-)

Ah well, I'll see if I can modify the driver code to be a bit smarter on how many targets to probe. It really doesn't have to check all since it knows how many target there are (72 in my case) - it could stop after having detected that many... 

(No wonder the Linux folks have replaced their cciss driver with a rewritten one called hpsa).

- Peter
Comment 7 Peter Eriksson 2020-05-09 19:32:47 UTC
The current code probably works reasonable well for cases where the controller is being used in RAID mode (where it works with logical LUNs). 

Well, except that it probably fails to detect the SES devices on the D6020 cabinets.

But when used as "dumb" HBA with more physical drives than a certain controller handles logical devices (64 in my case for the H241 controller) it will always do the wrong thing. 

And since it starts numbering physical targets 16 (probably since the controller signals "0" as supported logical luns - since it's a HBA!) and then the driver code uses a compile-time-default of 16 (intended for really old controllers) things become strange... So 64-16 = 48.

Okidoki. Time for some code hacking :-)
Comment 8 Peter Eriksson 2020-05-09 23:19:10 UTC
Created attachment 214328 [details]
Patch to fix support for more than 48 drives in HBA mode

The attached patch will fix a couple of bugs in the current ciss driver code where it incorrectly enumerates physical drives if the controller is in JBOD mode.

There are two bugs/problems:

1. If you attach more physical drives to a controller than how many logical volumes the controller supports (yes, really - totally wrong logic here) the additional drives will not be available because the driver sets the max_target limit to the number of logical volumes, but the enumeration of hardware drives starts at 16. So for a controller that support say 64 logical volumes, only the first (64-16) drives will
be detected.

2. The code also sets the initiator_id to same max logical volume number so any physical drive that happens to have the same target number will silently be skipped...

The patch also enables a little more verbosity.

This patch has been tested with HP H241 controllers in JBOD mode with 70 drives connected to a HP D6020 external SAS enclosure on FreeBSD 12.1-RELEASE-p3. 

This patch has not been tested with controllers in "RAID" mode but the patch should be compatible...
Comment 9 Peter Eriksson 2020-05-09 23:35:29 UTC
This part:

"but the enumeration of hardware drives starts at 16. So for a controller that support say 64 logical volumes, only the first (64-16) drives will be detected."

should probably read:

Depending on how disk enclosures enumerate drive the exact number of drives allowed might differ. For HP D6020 enclosures they seem to start enumeration at 16, and with a HP H241 controller that supports 64 logical volumes only the first (64-16) drives will be detected. And none of the SES "targets" (one per drawer in the D6020 enclosure) either since they are listed last...
Comment 10 Mark Linimon freebsd_committer freebsd_triage 2020-05-10 01:53:04 UTC
Convert to modern way to indicate "patch".
Comment 11 Peter Eriksson 2020-05-10 13:27:23 UTC
Just verified that the patched driver also works on an old HP DL380G5 with a HP Smart Array P400 controller with four drives and it worked there too. Not much of a test though but at least that part is the same as before (the old P400 doesn't really support a true JBOD mode anyway).
Comment 12 Peter Eriksson 2020-06-06 14:35:41 UTC
Just a note that I've created a patch and created a diff on :

  https://reviews.freebsd.org/D25155

This diff fixes the problems mentioned here, plus the bug in 246280 (panic at unplug/replug of devices, plus a bug that prevents SES (storage enclosure services) to enumerate devices behind a HP controller (now "sesutil map" works), plus it makes the /boot/loader.conf (and sysctl) tunables visible, and allows it to be more verbose at boot time (without having to set boot_verbose="YES" in /boot/loader.conf).
Comment 13 Peter Eriksson 2020-07-05 19:25:57 UTC
Hmm.. Is there something more I need to do for this patch? If the ciss driver isn't really maintained by anyone anymore then I could take over that responsibility perhaps. (We are using it in production and will probably keep on using it for atleast a number of years more so I'll have an incentive of keeping it working).
Comment 14 Andriy Gapon freebsd_committer freebsd_triage 2020-07-06 06:04:10 UTC
(In reply to Peter Eriksson from comment #13)
Please nudge mav and imp who approved the review request.
Comment 15 Warner Losh freebsd_committer freebsd_triage 2020-07-06 14:57:41 UTC
I'll see if I can get this committed. I have no ciss cards, however...
Comment 16 Peter Eriksson 2020-11-07 20:34:13 UTC
Any progress on this?
Comment 17 Peter Eriksson 2021-02-20 12:41:57 UTC
(In reply to Warner Losh from comment #15)

> I'll see if I can get this committed. I have no ciss cards, however...

I probably could send you a couple - a HP Smart HBA H241 (external SAS) and an old HP Smart Array P400 (internal controller in a old HP DL380g5 if that helps getting things committed :-)

Or is there something else I can be of assistance with?

- Peter
Comment 18 Peter Eriksson 2022-02-14 15:44:52 UTC
Any chance of getting this into 13-stable so it could be in 13.1-release? :-)

(So I could use the stock kernel in the future :-)
Comment 19 Marek Zarychta 2023-05-04 13:13:54 UTC
(In reply to Peter Eriksson from comment #18)

Thanks for the info. At least that problem is known, resolution too, perhaps it can be fixed to celebrate 3rd birthday of the fixing patch?

Do you have an account on the Phabricator Peter? Would you mind creating review there (https://reviews.freebsd.org/)? If not, perhaps someone can take care of that ?
Comment 20 Peter Eriksson 2023-05-04 14:14:59 UTC
See comment #12 a bit up here :-)
Comment 21 Marek Zarychta 2023-05-08 14:30:42 UTC
(In reply to Peter Eriksson from comment #17)
Peter, could you please test the patch which is the review again and give some feedback there?
Comment 22 Peter Eriksson 2023-05-08 15:28:30 UTC
Yes, I'll try to test it later today. I'll get back with some results.

- Peter
Comment 23 Peter Eriksson 2023-05-08 19:14:21 UTC
First test on a server with two HP H241 HBA cards with just 5 disks (in two boxes) running FreeBSD 12.4 with the full ciss.c from Fabricator - works fine. 

Next I'll test it on another server with 140 disks on two H241 controllers (70 in each enclosure).


root@balur00:/boot # sysctl hw.ciss
hw.ciss.force_interrupt: 0
hw.ciss.force_transport: 0
hw.ciss.nop_message_heartbeat: 0
hw.ciss.expose_hidden_physical: 0
hw.ciss.verbose: 2
hw.ciss.base_transfer_speed: 135168
hw.ciss.initiator_id: -1


root@balur00:/boot # egrep ciss /var/run/dmesg.boot 
ciss0: <HP Smart Array H241> port 0x3000-0x30ff mem 0x95400000-0x954fffff,0x95500000-0x955003ff at device 0.0 numa-domain 0 on pci5
ciss0: PERFORMANT Transport
ciss0: Using 1 MSIX interrupt
ciss0: using 1024 of 1024 available commands
ciss0:   0 logical drives configured
ciss0:   firmware 7.00
ciss0:   1 SCSI channels
ciss0:   0 FC channels
ciss0:   0 enclosures
ciss0:   0 expanders
ciss0:   maximum blocks: 65535
ciss0:   controller clock: 18343
ciss0:   256 MB controller memory
ciss0:   signature 'CISS'
ciss0:   valence 3
ciss0:   supported I/O methods 0x7f000147<READY,simple,performant>
ciss0:   active I/O method 0x5<performant>
ciss0:   4G page base rx00000000
ciss0:   interrupt coalesce delay 0us
ciss0:   interrupt coalesce count 16
ciss0:   max outstanding commands 1024
ciss0:   bus types 0x200000
ciss0:   server name 'CZ3729EX3D'
ciss0:   heartbeat 0xb7
ciss0:   max logical volumes supported: 64
ciss0:   max physical drives supported: 384
ciss0:   max physical drives per logical volume: 128
ciss0:   JBOD Support is Available
ciss0:   JBOD Mode is Enabled
ciss0: 0 physical devices
ciss0: max physical target id: 0
ciss0: 0 logical drives
ciss1: <HP Smart Array H241> port 0x2000-0x20ff mem 0x95200000-0x952fffff,0x95300000-0x953003ff at device 0.0 numa-domain 0 on pci11
ciss1: PERFORMANT Transport
ciss1: Using 1 MSIX interrupt
ciss1: using 1024 of 1024 available commands
ciss1:   0 logical drives configured
ciss1:   firmware 7.00
ciss1:   1 SCSI channels
ciss1:   0 FC channels
ciss1:   2 enclosures
ciss1:   2 expanders
ciss1:   maximum blocks: 65535
ciss1:   controller clock: 18486
ciss1:   256 MB controller memory
ciss1:   signature 'CISS'
ciss1:   valence 3
ciss1:   supported I/O methods 0x7f000147<READY,simple,performant>
ciss1:   active I/O method 0x5<performant>
ciss1:   4G page base rx00000000
ciss1:   interrupt coalesce delay 0us
ciss1:   interrupt coalesce count 16
ciss1:   max outstanding commands 1024
ciss1:   bus types 0x200000
ciss1:   server name 'CZ3729EX3D'
ciss1:   heartbeat 0xb9
ciss1:   max logical volumes supported: 64
ciss1:   max physical drives supported: 384
ciss1:   max physical drives per logical volume: 128
ciss1:   JBOD Support is Available
ciss1:   JBOD Mode is Enabled
ciss1: 7 physical devices
ciss1: max physical target id: 120
ciss1: 0 logical drives
Root mount waiting for:ses0 at ciss1 bus 33 scbus3 target 119 lun 0
ses1 at ciss1 bus 33 scbus3 target 120 lun 0
da2 at ciss1 bus 32 scbus2 target 83 lun 0
da3 at ciss1 bus 32 scbus2 target 84 lun 0
da4 at ciss1 bus 32 scbus2 target 85 lun 0
da1 at ciss1 bus 32 scbus2 target 50 lun 0
uhub3: da0 at ciss1 bus 32 scbus2 target 49 lun 0


root@balur00:/boot # cciss_vol_status -V /dev/ciss1
Controller: Smart HBA H241
  Board ID: 0x21c8103c
  Logical drives: 0
  Running firmware: 7.00
  ROM firmware: 7.00
  Physical drives: 5
         connector 1E box 1 bay 34                 HP      MB010000JWAYK                                    7PH8MXKG     HPD5 OK
         connector 1E box 1 bay 35                 HP      MB010000JWAYK                                    7PH4AJ9G     HPD5 OK
         connector 2E box 1 bay 33                 HP      MB010000JWAYK                                    7PH816MG     HPD5 OK
         connector 2E box 1 bay 34                 HP      MB010000JWAYK                                    7PH8G5PG     HPD5 OK
         connector 2E box 1 bay 35                 HP      MB010000JWAYK                                    7PGTUTHG     HPD5 OK
/dev/ciss1: (Smart HBA H241) Enclosure D6020 (S/N: 7CE714P009) on Bus 2, Physical Port 1E status: OK.
/dev/ciss1: (Smart HBA H241) Enclosure D6020 (S/N: 7CE714P009) on Bus 3, Physical Port 2E status: OK.
/dev/ciss1(Smart HBA H241:0): Non-Volatile Cache status:
                   Cache configured: No

root@balur00:/boot # sesutil -u /dev/ses0 show
ses0: <HPE D6020 2.74>; ID: 5001438030884b80
Desc     Dev     Model                     Ident                Size/Status
{"Name":"Drive bay"} -       -                         -                    Not Installed
{"Name":"DriveBay1"} -       -                         -                    Not Installed
{"Name":"DriveBay2"} -       -                         -                    Not Installed
{"Name":"DriveBay3"} -       -                         -                    Not Installed
{"Name":"DriveBay4"} -       -                         -                    Not Installed
{"Name":"DriveBay5"} -       -                         -                    Not Installed
{"Name":"DriveBay6"} -       -                         -                    Not Installed
{"Name":"DriveBay7"} -       -                         -                    Not Installed
{"Name":"DriveBay8"} -       -                         -                    Not Installed
{"Name":"DriveBay9"} -       -                         -                    Not Installed
{"Name":"DriveBay10"}    -       -                         -                    Not Installed
{"Name":"DriveBay11"}    -       -                         -                    Not Installed
{"Name":"DriveBay12"}    -       -                         -                    Not Installed
{"Name":"DriveBay13"}    -       -                         -                    Not Installed
{"Name":"DriveBay14"}    -       -                         -                    Not Installed
{"Name":"DriveBay15"}    -       -                         -                    Not Installed
{"Name":"DriveBay16"}    -       -                         -                    Not Installed
{"Name":"DriveBay17"}    -       -                         -                    Not Installed
{"Name":"DriveBay18"}    -       -                         -                    Not Installed
{"Name":"DriveBay19"}    -       -                         -                    Not Installed
{"Name":"DriveBay20"}    -       -                         -                    Not Installed
{"Name":"DriveBay21"}    -       -                         -                    Not Installed
{"Name":"DriveBay22"}    -       -                         -                    Not Installed
{"Name":"DriveBay23"}    -       -                         -                    Not Installed
{"Name":"DriveBay24"}    -       -                         -                    Not Installed
{"Name":"DriveBay25"}    -       -                         -                    Not Installed
{"Name":"DriveBay26"}    -       -                         -                    Not Installed
{"Name":"DriveBay27"}    -       -                         -                    Not Installed
{"Name":"DriveBay28"}    -       -                         -                    Not Installed
{"Name":"DriveBay29"}    -       -                         -                    Not Installed
{"Name":"DriveBay30"}    -       -                         -                    Not Installed
{"Name":"DriveBay31"}    -       -                         -                    Not Installed
{"Name":"DriveBay32"}    -       -                         -                    Not Installed
{"Name":"DriveBay33"}    da2     HP MB010000JWAYK          7PH816MG             10T
{"Name":"DriveBay34"}    da3     HP MB010000JWAYK          7PH8G5PG             10T
{"Name":"DriveBay35"}    da4     HP MB010000JWAYK          7PGTUTHG             10T

Temperatures: {"Name":"Temperature sensor"}   : 42 C, {"Name":"LocalIoModule-Sensor[0]"}  : 31 C, {"Name":"LocalIoModule-Sensor[1]"}  : 38 C, {"Name":"LocalExpander-CpuSensor[0]"}   : 42 C, {"Name":"PowerSupply[3]-InletSensor[0]"}: 28 C, {"Name":"PowerSupply[3]-Sensor[0]"} : 32 C, {"Name":"PowerSupply[4]-InletSensor[0]"}: 27 C, {"Name":"PowerSupply[4]-Sensor[0]"} : 31 C, {"Name":"Backplane-Sensor[0]"}  : 25 C, {"Name":"Backplane-Sensor[1]"}  : 23 C, {"Name":"Backplane-Sensor[2]"}  : 24 C, {"Name":"Backplane-Sensor[3]"}  : 27 C, {"Name":"Backplane-Sensor[4]"}  : 24 C, {"Name":"Backplane-Sensor[5]"}  : 23 C, {"Name":"DisplayBoard-Sensor[0]"}   : 25 C


root@balur00:/boot # sesutil -u /dev/ses1 show
ses1: <HPE D6020 2.74>; ID: 5001438030894600
Desc     Dev     Model                     Ident                Size/Status
{"Name":"Drive bay"} -       -                         -                    Not Installed
{"Name":"DriveBay1"} -       -                         -                    Not Installed
{"Name":"DriveBay2"} -       -                         -                    Not Installed
{"Name":"DriveBay3"} -       -                         -                    Not Installed
{"Name":"DriveBay4"} -       -                         -                    Not Installed
{"Name":"DriveBay5"} -       -                         -                    Not Installed
{"Name":"DriveBay6"} -       -                         -                    Not Installed
{"Name":"DriveBay7"} -       -                         -                    Not Installed
{"Name":"DriveBay8"} -       -                         -                    Not Installed
{"Name":"DriveBay9"} -       -                         -                    Not Installed
{"Name":"DriveBay10"}    -       -                         -                    Not Installed
{"Name":"DriveBay11"}    -       -                         -                    Not Installed
{"Name":"DriveBay12"}    -       -                         -                    Not Installed
{"Name":"DriveBay13"}    -       -                         -                    Not Installed
{"Name":"DriveBay14"}    -       -                         -                    Not Installed
{"Name":"DriveBay15"}    -       -                         -                    Not Installed
{"Name":"DriveBay16"}    -       -                         -                    Not Installed
{"Name":"DriveBay17"}    -       -                         -                    Not Installed
{"Name":"DriveBay18"}    -       -                         -                    Not Installed
{"Name":"DriveBay19"}    -       -                         -                    Not Installed
{"Name":"DriveBay20"}    -       -                         -                    Not Installed
{"Name":"DriveBay21"}    -       -                         -                    Not Installed
{"Name":"DriveBay22"}    -       -                         -                    Not Installed
{"Name":"DriveBay23"}    -       -                         -                    Not Installed
{"Name":"DriveBay24"}    -       -                         -                    Not Installed
{"Name":"DriveBay25"}    -       -                         -                    Not Installed
{"Name":"DriveBay26"}    -       -                         -                    Not Installed
{"Name":"DriveBay27"}    -       -                         -                    Not Installed
{"Name":"DriveBay28"}    -       -                         -                    Not Installed
{"Name":"DriveBay29"}    -       -                         -                    Not Installed
{"Name":"DriveBay30"}    -       -                         -                    Not Installed
{"Name":"DriveBay31"}    -       -                         -                    Not Installed
{"Name":"DriveBay32"}    -       -                         -                    Not Installed
{"Name":"DriveBay33"}    -       -                         -                    Not Installed
{"Name":"DriveBay34"}    da0     HP MB010000JWAYK          7PH8MXKG             10T
{"Name":"DriveBay35"}    da1     HP MB010000JWAYK          7PH4AJ9G             10T

Temperatures: {"Name":"Temperature sensor"}   : 43 C, {"Name":"LocalIoModule-Sensor[0]"}  : 34 C, {"Name":"LocalIoModule-Sensor[1]"}  : 39 C, {"Name":"LocalExpander-CpuSensor[0]"}   : 43 C, {"Name":"PowerSupply[1]-InletSensor[0]"}: 32 C, {"Name":"PowerSupply[1]-Sensor[0]"} : 43 C, {"Name":"PowerSupply[2]-InletSensor[0]"}: 33 C, {"Name":"PowerSupply[2]-Sensor[0]"} : 38 C, {"Name":"Backplane-Sensor[0]"}  : 20 C, {"Name":"Backplane-Sensor[1]"}  : 20 C, {"Name":"Backplane-Sensor[2]"}  : 22 C, {"Name":"Backplane-Sensor[3]"}  : 23 C, {"Name":"Backplane-Sensor[4]"}  : 21 C, {"Name":"Backplane-Sensor[5]"}  : 20 C, {"Name":"DisplayBoard-Sensor[0]"}   : 18 C
Comment 24 Peter Eriksson 2023-05-08 19:50:57 UTC
Result on the bigger server: 

Booted, but then I got a long list of CAM errors ending with a "ADAPTER HEARTBEAT FAILED".

...
(da116:ciss2:32:56:0): CAM status: CCB request completed with an error
(da116:ciss2:32:56:0): Retrying command, 3 more tries remain
(da123:ciss2:32:63:0): READ(6). CDB: 08 00 01 40 08 00 
(da123:ciss2:32:63:0): CAM status: CCB request completed with an error
(da123:ciss2:32:63:0): Retrying command, 3 more tries remain
(da123:ciss2:32:63:0): READ(10). CDB: 28 00 00 00 03 00 00 01 00 00 
(da123:ciss2:32:63:0): CAM status: CCB request completed with an error
(da123:ciss2:32:63:0): Retrying command, 2 more tries remain
(da123:ciss2:32:63:0): READ(6). CDB: 08 00 01 38 08 00 
(da123:ciss2:32:63:0): CAM status: CCB request completed with an error
(da123:ciss2:32:63:0): Retrying command, 3 more tries remain
(da123:ciss2:32:63:0): READ(6). CDB: 08 00 01 30 08 00 
(da123:ciss2:32:63:0): CAM status: CCB request completed with an error
(da123:ciss2:32:63:0): Retrying command, 3 more tries remain
(da116:ciss2:32:56:0): READ(16). CDB: 88 00 00 00 00 01 ce 8c 30 28 00 00 01 00 00 00 
(da116:ciss2:32:56:0): CAM status: CCB request completed with an error
(da116:ciss2:32:56:0): Retrying command, 3 more tries remain
(da116:ciss2:32:56:0): READ(16). CDB: 88 00 00 00 00 01 ce 8c 31 28 00 00 01 00 00 00 
(da116:ciss2:32:56:0): CAM status: CCB request completed with an error
(da116:ciss2:32:56:0): Retrying command, 3 more tries remain
(da116:ciss2:32:56:0): READ(16). CDB: 88 00 00 00 00 01 ce 8c 32 28 00 00 01 00 00 00 
(da116:ciss2:32:56:0): CAM status: CCB request completed with an error
(da116:ciss2:32:56:0): Retrying command, 3 more tries remain
(da116:ciss2:32:56:0): READ(16). CDB: 88 00 00 00 00 01 ce 8c 33 28 00 00 00 38 00 00 
(da116:ciss2:32:56:0): CAM status: CCB request completed with an error
(da116:ciss2:32:56:0): Retrying command, 3 more tries remain
ciss2: ADAPTER HEARTBEAT FAILED

(I could login and see all disks with "camcontrol delvist" but it started displaying these errors when zfs was importing the pools. 

The da116 and da123 diskar are probably bad but it would be nice if the ciss controller would handle bad drives a bit better in general.

I had to replace the HP H241 controller with an LSI SAS3816 controller on my other "big" HP server since that H241/ciss controller would go into a loop and just spew out retry errors every day or so. The LSI controller handles the bad drives better...
Comment 25 Peter Eriksson 2023-05-08 20:00:33 UTC
Rebooted and tried with hw.ciss.nop_message_heartbeat=1, then logged in via ssh and ran an "sesutil show" (before zfs had started importing pools), then it panic:ed with:

login: (da116:ciss2:32:56:0): READ(6). CDB: 08 00 01 48 08 00 
(da116:ciss2:32:56:0): CAM status: CCB request completed with an error
ciss2: *** Hot-plug drive removed, Port=2E Box=1 Bay=6 SN=            5PGTSWYC
(da116:ciss2:32:56:0): Retrying command, 3 more tries remain
ciss2: *** Physical drive failure, Port=2E Box=1 Bay=6 reason=0x14
(da116:ciss2:32:56:0): READ(16). CDB: 88 00 00 00 00 05 74 ff ff 00 00 00 01 00 00 00 
(da116:ciss2:32:56:0): CAM status: CCB request completed with an error
(da116:ciss2:32:56:0): Retrying command, 3 more tries remain
(da116:ciss2:32:56:0): READ(6). CDB: 08 00 01 50 b0 00 
(da116:ciss2:32:56:0): CAM status: CCB request completed with an error
(da116:ciss2:32:56:0): Retrying command, 3 more tries remain
(da116:ciss2:32:56:0): READ(10). CDB: 28 00 00 00 03 00 00 01 00 00 
(da116:ciss2:32:56:0): CAM status: CCB request completed with an error
(da116:ciss2:32:56:0): Retrying command, 3 more tries remain
(da116:ciss2:32:56:0): READ(16). CDB: 88 00 00 00 00 05 74 ff fd 00 00 00 01 00 00 00 
(da116:ciss2:32:56:0): CAM status: CCB request completed with an error
(da116:ciss2:32:56:0): Retrying command, 3 more tries remain
(da116:ciss2:32:56:0): READ(6). CDB: 08 00 01 40 08 00 
(da116:ciss2:32:56:0): CAM status: CCB request completed with an error
(da116:ciss2:32:56:0): Retrying command, 3 more tries remain
(da116:ciss2:32:56:0): READ(6). CDB: 08 00 01 38 08 00 
(da116:ciss2:32:56:0): CAM status: CCB request completed with an error
(da116:ciss2:32:56:0): Retrying command, 3 more tries remain
(da116:ciss2:32:56:0): READ(6). CDB: 08 00 01 28 08 00 
(da116:ciss2:32:56:0): CAM status: CCB request completed with an error
(da116:ciss2:32:56:0): Retrying command, 3 more tries remain
(da116:ciss2:32:56:0): READ(6). CDB: 08 00 01 30 08 00 
(da116:ciss2:32:56:0): CAM status: CCB request completed with an error
(da116:ciss2:32:56:0): Retrying command, 3 more tries remain
(da116:ciss2:32:56:0): READ(6). CDB: 08 00 01 20 08 00 
(da116:ciss2:32:56:0): CAM status: CCB request completed with an error
(da116:ciss2:32:56:0): Retrying command, 3 more tries remain
(da116:ciss2:32:56:0): READ(6)/WRITE(6) not supported, increasing minimum_cmd_size to 10.
(da116:ciss2:32:56:0): READ(16). CDB: 88 00 00 00 00 05 74 ff ff 00 00 00 01 00 00 00 
(da116:ciss2:32:56:0): CAM status: SCSI Status Error
(da116:ciss2:32:56:0): SCSI status: Check Condition
(da116:ciss2:32:56:0): SCSI sense: ILLEGAL REQUEST asc:25,0 (Logical unit not supported)
(da116:ciss2:32:56:0): Error 6, Unretryable error
(da116:ciss2:32:56:0): Invalidating pack
(da116:ciss2:32:56:0): READ(6)/WRITE(6) not supported, increasing minimum_cmd_size to 10.
(da116:ciss2:32:56:0): READ(10). CDB: 28 00 00 00 03 00 00 01 00 00 
(da116:ciss2:32:56:0): CAM status: SCSI Status Error
(da116:ciss2:32:56:0): SCSI status: Check Condition
(da116:ciss2:32:56:0): SCSI sense: ILLEGAL REQUEST asc:25,0 (Logical unit not supported)
(da116:ciss2:32:56:0): Error 6, Unretryable error
(da116:ciss2:32:56:0): READ(16). CDB: 88 00 00 00 00 05 74 ff fd 00 00 00 01 00 00 00 
(da116:ciss2:32:56:0): CAM status: SCSI Status Error
(da116:ciss2:32:56:0): SCSI status: Check Condition
(da116:ciss2:32:56:0): SCSI sense: ILLEGAL REQUEST asc:25,0 (Logical unit not supported)
(da116:ciss2:32:56:0): Error 6, Unretryable error
(da116:ciss2:32:56:0): READ(6)/WRITE(6) not supported, increasing minimum_cmd_size to 10.
(da116:ciss2:32:56:0): READ(6)/WRITE(6) not supported, increasing minimum_cmd_size to 10.
(da116:ciss2:32:56:0): READ(6)/WRITE(6) not supported, increasing minimum_cmd_size to 10.
(da116:ciss2:32:56:0): READ(6)/WRITE(6) not supported, increasing minimum_cmd_size to 10.
da116 at ciss2 bus 32 scbus7 target 56 lun 0
da116: <HP MB012000JWDFD HPD2>  s/n 5PGTSWYC detached
May  8 21:56:32 balur03 ZFS[4858]: vdev probe failure, zpool=$FILUR02 path=$/dev/diskid/DISK-5PGTSWYC
(da116:ciss2:32:56:0): READ(6). CDB: 08 00 01 20 08 00 
(da116:ciss2:32:56:0): CAM status: CCB request completed with an error
(da116:ciss2:32:56:0): Error 5, Periph was invalidated
(da116:ciss2:32:56:0): READ(10). CDB: 28 00 00 00 01 48 00 00 08 00 
(da116:ciss2:32:56:0): CAM status: CCB request completed with an error
(da116:ciss2:32:56:0): Error 5, Periph was invalidated
(da116:ciss2:32:56:0): READ(10). CDB: 28 00 00 00 01 28 00 00 08 00 
(da116:ciss2:32:56:0): CAM status: CCB request completed with an error
(da116:ciss2:32:56:0): Error 5, Periph was invalidated
(da116:ciss2:32:56:0): READ(10). CDB: 28 00 00 00 01 30 00 00 08 00 
(da116:ciss2:32:56:0): CAM status: CCB request completed with an error
(da116:ciss2:32:56:0): Error 5, Periph was invalidated
(da116:ciss2:32:56:0): READ(10). CDB: 28 00 00 00 01 38 00 00 08 00 
(da116:ciss2:32:56:0): CAM status: CCB request completed with an error
(da116:ciss2:32:56:0): Error 5, Periph was invalidated
(da116:ciss2:32:56:0): READ(10). CDB: 28 00 00 00 01 40 00 00 08 00 
(da116:ciss2:32:56:0): CAM status: CCB request completed with an error
(da116:ciss2:32:56:0): Error 5, Periph was invalidated
(da116:ciss2:32:56:0): READ(10). CDB: 28 00 00 00 01 50 00 00 b0 00 
(da116:ciss2:32:56:0): CAM status: CCB request completed with an error
(da116:ciss2:32:56:0): Error 5, Periph was invalidated


Fatal trap 12: page fault while in kernel mode
cpuid = 3; apic id = 03
fault virtual address   = 0x180
fault code              = supervisor read data, page not present
(da116:ciss2:32:56:0): Periph destroyed
instruction pointer     = 0x20:0xffffffff8256e51d
stack pointer           = 0x28:0xfffffe058a1d59d0
frame pointer           = 0x28:0xfffffe058a1d5a40
code segment            = base rx0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 45 (solthread 0xfffffff)
trap number             = 12
panic: page fault
cpuid = 3
time = 1683575795
KDB: stack backtrace:
#0 0xffffffff80c325f5 at kdb_backtrace+0x65
#1 0xffffffff80be89c8 at vpanic+0x178
#2 0xffffffff80be8843 at panic+0x43
#3 0xffffffff8110408f at trap_fatal+0x38f
#4 0xffffffff811040df at trap_pfault+0x4f
#5 0xffffffff810dbd58 at calltrap+0x8
#6 0xffffffff8256e49a at vdev_dtl_reassess+0x5a
#7 0xffffffff8256e49a at vdev_dtl_reassess+0x5a
#8 0xffffffff825639e2 at spa_vdev_state_exit+0x42
#9 0xffffffff8255d04f at spa_async_thread_vd+0x17f
#10 0xffffffff80ba969e at fork_exit+0x7e
#11 0xffffffff810dcd8e at fork_trampoline+0xe
System is going down.
Uptime: 4m40s

I've seen things you people wouldn't believe.
Attack ships on fire off the shoulder of Orion.
I watched C-beams glitter in the dark near the
Tannhäuser Gate. All those moments will be lost
in time, like tears in rain.

Time to die.
Comment 26 Peter Eriksson 2023-05-08 20:48:31 UTC
On the third attempt I just let it keep trying, no "sesutil show". And then it eventually printed:

....
(da116:ciss2:32:56:0): SCSI status: Check Condition
(da116:ciss2:32:56:0): SCSI sense: ILLEGAL REQUEST asc:25,0 (Logical unit not supported)
(da116:ciss2:32:56:0): Error 6, Unretryable error
da116 at ciss2 bus 32 scbus7 target 56 lun 0
da116: <HP MB012000JWDFD HPD2>  s/n 5PGTSWYC detached
May  8 22:19:40 balur03 ZFS[22519]: vdev probe failure, zpool=$FILUR02 path=$/dev/diskid/DISK-5PGTSWYC
(da116:ciss2:32:56:0): Periph destroyed
(da123:ciss2:32:63:0): WRITE(16). CDB: 8a 00 00 00 00 03 1b 00 39 58 00 00 01 00 00 00 
(da123:ciss2:32:63:0): CAM status: CCB request completed with an error
(da123:ciss2:32:63:0): Retrying command, 3 more tries remain
(da123:ciss2:32:63:0): WRITE(16). CDB: 8a 00 00 00 00 03 1b 00 38 58 00 00 01 00 00 00 
(da123:ciss2:32:63:0): CAM status: CCB request completed with an error
(da123:ciss2:32:63:0): Retrying command, 3 more tries remain
(da123:ciss2:32:63:0): WRITE(16). CDB: 8a 00 00 00 00 03 1b 00 37 58 00 00 01 00 00 00 
(da123:ciss2:32:63:0): CAM status: CCB request completed with an error
(da123:ciss2:32:63:0): Retrying command, 3 more tries remain
(da123:ciss2:32:63:0): WRITE(16). CDB: 8a 00 00 00 00 03 1b 00 39 58 00 00 01 00 00 00 
(da123:ciss2:32:63:0): CAM status: SCSI Status Error
(da123:ciss2:32:63:0): SCSI status: Check Condition
(da123:ciss2:32:63:0): SCSI sense: ILLEGAL REQUEST asc:25,0 (Logical unit not supported)
(da123:ciss2:32:63:0): Command Specific Info: 0x2060000
(da123:ciss2:32:63:0): Error 6, Unretryable error
(da123:ciss2:32:63:0): Invalidating pack
(da123:ciss2:32:63:0): WRITE(16). CDB: 8a 00 00 00 00 03 1b 00 38 58 00 00 01 00 00 00 
(da123:ciss2:32:63:0): CAM status: SCSI Status Error
(da123:ciss2:32:63:0): SCSI status: Check Condition
(da123:ciss2:32:63:0): SCSI sense: ILLEGAL REQUEST asc:25,0 (Logical unit not supported)
(da123:ciss2:32:63:0): Command Specific Info: 0x2060000
(da123:ciss2:32:63:0): Error 6, Unretryable error
(da123:ciss2:32:63:0): WRITE(16). CDB: 8a 00 00 00 00 03 1b 00 37 58 00 00 01 00 00 00 
(da123:ciss2:32:63:0): CAM status: SCSI Status Error
(da123:ciss2:32:63:0): SCSI status: Check Condition
(da123:ciss2:32:63:0): SCSI sense: ILLEGAL REQUEST asc:25,0 (Logical unit not supported)
(da123:ciss2:32:63:0): Command Specific Info: 0x2060000
(da123:ciss2:32:63:0): Error 6, Unretryable error
da123 at ciss2 bus 32 scbus7 target 63 lun 0
da123: <HP MB012000JWDFD HPD2>  s/n 5PGU79TE detached
May  8 22:26:52 balur03 ZFS[38178]: vdev probe failure, zpool=$FILUR07 path=$/dev/diskid/DISK-5PGU79TE
(da123:ciss2:32:63:0): Periph destroyed

And then zfs started mounting the filesystems.

root@balur03# df -h | wc -l
  169281

It'll be interesting to see if it survives :-)


The fact that the "sesutil" command seemed to trigger a panic might indicate a problem somewhere though, but I'm not sure it's in the ciss driver.
Comment 27 Marek Zarychta 2023-05-08 21:54:17 UTC
Thanks for taking an effort to test it again Peter. Without the patch you authored, the tests would end sooner with no panic and 40 instead of 140 drives seen. That's a large setup and the solution is not ideal, but looks like a step forward. 

In the meantime I plied a bit with ciss(4) settings, and found out that setting hw.ciss.nop_message_heartbeat="1" on which you complained in comment 25 breaks things also for me regardless of the application of the patch from D25155. On the other hand, I am finding quite a useful new sysctl knob "hw.ciss.verbose" introduced by your patch for operations like drive replacement: 

 [7372] ciss0: *** Hot-plug drive removed, Port=1I Box=1 Bay=3 SN=            PLXXXXX
[7372] ciss0: *** Physical drive failure, Port=1I Box=1 Bay=3
[7372] ciss0: *** State change, logical drive 2, new state=FAILED
[7372] ciss0: logical drive 2 (da2) changed status OK->failed, spare status 0x0
[7372] da2 at ciss0 bus 0 scbus0 target 2 lun 0
[7372] (da2:ciss0:0:2:0): Periph destroyed
[7444] ciss0: *** Hot-plug drive inserted, Port=1I Box=1 Bay=3 SN=            PLXXXXX
[7444] ciss0: *** Media exchanged detected, logical drive 2
[7444] ciss0: logical drive 2 () media exchanged, ready to go online
[7444] ciss0: *** State change, logical drive 2, new state=OK
[7444] ciss0: logical drive (b0t2): RAID 0, 139776MB online
[7444] ciss0: logical drive 2 () changed status failed->OK, spare status 0x0
[7444] ciss0: logical drive (b0t2): RAID 0, 139776MB online
[7444] da2 at ciss0 bus 0 scbus0 target 2 lun 0

Anyway, the patch proposed in review D25155 fixes a couple of things, was tested in JBOD and RAID mode and seems to not introduce any regression.
Comment 28 Peter Eriksson 2023-05-09 06:22:43 UTC
Actually, with

  hw.ciss.nop_message_heartbeat="0"

my first test ended with 

  ciss2: ADAPTER HEARTBEAT FAILED

and then things deadlocked.



With:

  hw.ciss.nop_message_heartbeat="1"

and if I didn't provoke it with the "sesutil show" then all disks got detected correctly and it failed the two dead ones after a while and all zfs filesystems got mounted. The server has now been running for 10 hours without problems.

(Going to remove the problem disks later today)

hw.ciss.verbose = "1"
hw.ciss.nop_message_heartbeat = "1"
hw.ciss.base_transfer_speed = "1200000"

- Peter