Bug 246279 - ciss device driver not allowing more than 48 drives to be detected by the CAM layer
Summary: ciss device driver not allowing more than 48 drives to be detected by the CAM...
Status: Open
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 12.4-RELEASE
Hardware: Any Any
: --- Affects Some People
Assignee: Warner Losh
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-05-07 11:12 UTC by Peter Eriksson
Modified: 2024-11-25 05:21 UTC (History)
7 users (show)

See Also:
linimon: mfc-stable14?
linimon: mfc-stable13?


Attachments
Output from "camcontrol devlist" (4.08 KB, text/plain)
2020-05-08 17:56 UTC, Peter Eriksson
no flags Details
Output from "camcontrol devlist" (4.08 KB, text/plain)
2020-05-08 17:56 UTC, Peter Eriksson
no flags Details
Output from "cciss_vol_status" (9.23 KB, text/plain)
2020-05-09 16:13 UTC, Peter Eriksson
no flags Details
Patch to fix support for more than 48 drives in HBA mode (3.60 KB, patch)
2020-05-09 23:19 UTC, Peter Eriksson
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Peter Eriksson 2020-05-07 11:12:25 UTC
If I connect an HP D6020 external SAS disk enclosure to an HP H241 (HBA mode) and then connect each drawer to each port of the H241 then the system just sees the first 48 drives (and not all 70). Looking at the output from 'camcontrol devlist' the last drive seen have "at scbus4 target 63 lun 0" and the first "at scbus4 target 16 lun 0".

If I connect each drawer into separate H241 controllers then all 70 drives are visible.

According to HP documentation the H241 should handle up to 200 physical drives.

I've tried reading the ciss driver source code and I can't see any "obvious" limits/adjustable knobs but I might be missing something....

The "target 63" number feels lika a 64-target-limit somewhere.
(The first 16 I'm guessing is reserved for logical drives (which we don't use) so I'm guessing that's why it starts numbering at 16).

(Now I want to connect a second D6020 to the server(s) in question so I can use the "use two controller cards" solution anymore)
Comment 1 Peter Eriksson 2020-05-08 17:56:10 UTC
Created attachment 214288 [details]
Output from "camcontrol devlist"
Comment 2 Peter Eriksson 2020-05-08 17:56:57 UTC
Created attachment 214289 [details]
Output from "camcontrol devlist"
Comment 3 Peter Eriksson 2020-05-08 17:58:56 UTC
Did some more testing today. Attached one of the new D6020 disk cabinets (with 70 SAS drives) to our test server. It looks like the CISS driver does see all 70 drives, but some way up to the CAM layer 16 of them gets lost :-)

# cciss_vol_status -V /dev/ciss0 | egrep 1200 | wc -l
      70

# camcontrol devlist | egrep 1200 | wc -l
      48

(Full output from cciss_vol_status & camcontrol added as attachments)
Comment 4 Peter Eriksson 2020-05-09 16:13:47 UTC
Created attachment 214315 [details]
Output from "cciss_vol_status"
Comment 5 Peter Eriksson 2020-05-09 16:45:01 UTC
Ok, with some bits of printf-debugging I found some suspect code in sys/dev/ciss/ciss.c:ciss_cam_action() at the "case XPT_PATH_INQ" section:

  cpi->max_target = sc->ciss_cfg->max_logical_supported;

Notice the "max logical logical volumes: 64" below?

ciss0: PERFORMANT Transport
ciss0:   0 logical drives configured
ciss0:   firmware 5.04
ciss0:   1 SCSI channels
ciss0:   signature 'CISS'
ciss0:   valence 3
ciss0:   supported I/O methods 0x7e000147<READY,simple,performant>
ciss0:   active I/O method 0x5<performant>
ciss0:   4G page base rx00000000
ciss0:   interrupt coalesce delay 0us
ciss0:   interrupt coalesce count 16
ciss0:   max outstanding commands 1024
ciss0:   bus types 0x200000
ciss0:   server name 'CZ3729EX3D'
ciss0:   heartbeat 0xc0
ciss0:   max logical logical volumes: 64
ciss0:   max physical disks supported: 384
ciss0:   max physical disks per logical volume: 128
ciss0:   JBOD Support is Available
ciss0:   JBOD Mode is Enabled
ciss0: 72 physical devices

(72 is 2 too many, but I guess the two extra are the storage drawers)

If I change that line to:

  cpi->max_target = sc->ciss_cfg->max_physical_supported;

then "camcontrol devlist" now show 69 of 70 drives... Better but not 100% there.
Comment 6 Peter Eriksson 2020-05-09 19:00:15 UTC
It now probes targets up to around 373 (but it takes a looong time) or so, and also detects the SES devices that have been hidden before since it only probed up to target 63...
                                                                                                                                                              
ses0 at ciss0 bus 33 scbus2 target 119 lun 0                                                                                                                       
ses0: <HPE D6020 1.63> Fixed Enclosure Services SPC-4 SCSI device                                                                                                  
ses0: Serial Number 7CE952P06X                                                                                                                                     
ses0: 135.168MB/s transfers                                                                                                                                        
ses0: SES Device                                                                                                                                                   
                                                                                                                                                                   
ses1 at ciss0 bus 33 scbus2 target 121 lun 0                                                                                                                       
ses1: <HPE D6020 1.63> Fixed Enclosure Services SPC-4 SCSI device                                                                                                  
ses1: Serial Number 7CE952P06X                                                                                                                                     
ses1: 135.168MB/s transfers                                                                                                                                        
ses1: SES Device                      

Due to the probing taking a loong time I also see timeouts:

> run_interrupt_driven_hooks: still waiting after 300 seconds for xpt_config         

and SCSI errors:

(probe0:ciss1:33:351:0): REPORT LUNS. CDB: a0 00 00 00 00 00 00 00 00 10 00 00
(probe0:ciss1:33:351:0): CAM status: CCB request completed with an error
(probe0:ciss1:33:351:0): Retrying command, 4 more tries remain
(probe0:ciss1:33:351:0): REPORT LUNS. CDB: a0 00 00 00 00 00 00 00 00 10 00 00
(probe0:ciss1:33:351:0): CAM status: CCB request completed with an error
(probe0:ciss1:33:351:0): Retrying command, 3 more tries remain

This code feels... broken :-)

Ah well, I'll see if I can modify the driver code to be a bit smarter on how many targets to probe. It really doesn't have to check all since it knows how many target there are (72 in my case) - it could stop after having detected that many... 

(No wonder the Linux folks have replaced their cciss driver with a rewritten one called hpsa).

- Peter
Comment 7 Peter Eriksson 2020-05-09 19:32:47 UTC
The current code probably works reasonable well for cases where the controller is being used in RAID mode (where it works with logical LUNs). 

Well, except that it probably fails to detect the SES devices on the D6020 cabinets.

But when used as "dumb" HBA with more physical drives than a certain controller handles logical devices (64 in my case for the H241 controller) it will always do the wrong thing. 

And since it starts numbering physical targets 16 (probably since the controller signals "0" as supported logical luns - since it's a HBA!) and then the driver code uses a compile-time-default of 16 (intended for really old controllers) things become strange... So 64-16 = 48.

Okidoki. Time for some code hacking :-)
Comment 8 Peter Eriksson 2020-05-09 23:19:10 UTC
Created attachment 214328 [details]
Patch to fix support for more than 48 drives in HBA mode

The attached patch will fix a couple of bugs in the current ciss driver code where it incorrectly enumerates physical drives if the controller is in JBOD mode.

There are two bugs/problems:

1. If you attach more physical drives to a controller than how many logical volumes the controller supports (yes, really - totally wrong logic here) the additional drives will not be available because the driver sets the max_target limit to the number of logical volumes, but the enumeration of hardware drives starts at 16. So for a controller that support say 64 logical volumes, only the first (64-16) drives will
be detected.

2. The code also sets the initiator_id to same max logical volume number so any physical drive that happens to have the same target number will silently be skipped...

The patch also enables a little more verbosity.

This patch has been tested with HP H241 controllers in JBOD mode with 70 drives connected to a HP D6020 external SAS enclosure on FreeBSD 12.1-RELEASE-p3. 

This patch has not been tested with controllers in "RAID" mode but the patch should be compatible...
Comment 9 Peter Eriksson 2020-05-09 23:35:29 UTC
This part:

"but the enumeration of hardware drives starts at 16. So for a controller that support say 64 logical volumes, only the first (64-16) drives will be detected."

should probably read:

Depending on how disk enclosures enumerate drive the exact number of drives allowed might differ. For HP D6020 enclosures they seem to start enumeration at 16, and with a HP H241 controller that supports 64 logical volumes only the first (64-16) drives will be detected. And none of the SES "targets" (one per drawer in the D6020 enclosure) either since they are listed last...
Comment 10 Mark Linimon freebsd_committer freebsd_triage 2020-05-10 01:53:04 UTC
Convert to modern way to indicate "patch".
Comment 11 Peter Eriksson 2020-05-10 13:27:23 UTC
Just verified that the patched driver also works on an old HP DL380G5 with a HP Smart Array P400 controller with four drives and it worked there too. Not much of a test though but at least that part is the same as before (the old P400 doesn't really support a true JBOD mode anyway).
Comment 12 Peter Eriksson 2020-06-06 14:35:41 UTC
Just a note that I've created a patch and created a diff on :

  https://reviews.freebsd.org/D25155

This diff fixes the problems mentioned here, plus the bug in 246280 (panic at unplug/replug of devices, plus a bug that prevents SES (storage enclosure services) to enumerate devices behind a HP controller (now "sesutil map" works), plus it makes the /boot/loader.conf (and sysctl) tunables visible, and allows it to be more verbose at boot time (without having to set boot_verbose="YES" in /boot/loader.conf).
Comment 13 Peter Eriksson 2020-07-05 19:25:57 UTC
Hmm.. Is there something more I need to do for this patch? If the ciss driver isn't really maintained by anyone anymore then I could take over that responsibility perhaps. (We are using it in production and will probably keep on using it for atleast a number of years more so I'll have an incentive of keeping it working).
Comment 14 Andriy Gapon freebsd_committer freebsd_triage 2020-07-06 06:04:10 UTC
(In reply to Peter Eriksson from comment #13)
Please nudge mav and imp who approved the review request.
Comment 15 Warner Losh freebsd_committer freebsd_triage 2020-07-06 14:57:41 UTC
I'll see if I can get this committed. I have no ciss cards, however...
Comment 16 Peter Eriksson 2020-11-07 20:34:13 UTC
Any progress on this?
Comment 17 Peter Eriksson 2021-02-20 12:41:57 UTC
(In reply to Warner Losh from comment #15)

> I'll see if I can get this committed. I have no ciss cards, however...

I probably could send you a couple - a HP Smart HBA H241 (external SAS) and an old HP Smart Array P400 (internal controller in a old HP DL380g5 if that helps getting things committed :-)

Or is there something else I can be of assistance with?

- Peter
Comment 18 Peter Eriksson 2022-02-14 15:44:52 UTC
Any chance of getting this into 13-stable so it could be in 13.1-release? :-)

(So I could use the stock kernel in the future :-)
Comment 19 Marek Zarychta 2023-05-04 13:13:54 UTC
(In reply to Peter Eriksson from comment #18)

Thanks for the info. At least that problem is known, resolution too, perhaps it can be fixed to celebrate 3rd birthday of the fixing patch?

Do you have an account on the Phabricator Peter? Would you mind creating review there (https://reviews.freebsd.org/)? If not, perhaps someone can take care of that ?
Comment 20 Peter Eriksson 2023-05-04 14:14:59 UTC
See comment #12 a bit up here :-)
Comment 21 Marek Zarychta 2023-05-08 14:30:42 UTC
(In reply to Peter Eriksson from comment #17)
Peter, could you please test the patch which is the review again and give some feedback there?
Comment 22 Peter Eriksson 2023-05-08 15:28:30 UTC
Yes, I'll try to test it later today. I'll get back with some results.

- Peter
Comment 23 Peter Eriksson 2023-05-08 19:14:21 UTC
First test on a server with two HP H241 HBA cards with just 5 disks (in two boxes) running FreeBSD 12.4 with the full ciss.c from Fabricator - works fine. 

Next I'll test it on another server with 140 disks on two H241 controllers (70 in each enclosure).


root@balur00:/boot # sysctl hw.ciss
hw.ciss.force_interrupt: 0
hw.ciss.force_transport: 0
hw.ciss.nop_message_heartbeat: 0
hw.ciss.expose_hidden_physical: 0
hw.ciss.verbose: 2
hw.ciss.base_transfer_speed: 135168
hw.ciss.initiator_id: -1


root@balur00:/boot # egrep ciss /var/run/dmesg.boot 
ciss0: <HP Smart Array H241> port 0x3000-0x30ff mem 0x95400000-0x954fffff,0x95500000-0x955003ff at device 0.0 numa-domain 0 on pci5
ciss0: PERFORMANT Transport
ciss0: Using 1 MSIX interrupt
ciss0: using 1024 of 1024 available commands
ciss0:   0 logical drives configured
ciss0:   firmware 7.00
ciss0:   1 SCSI channels
ciss0:   0 FC channels
ciss0:   0 enclosures
ciss0:   0 expanders
ciss0:   maximum blocks: 65535
ciss0:   controller clock: 18343
ciss0:   256 MB controller memory
ciss0:   signature 'CISS'
ciss0:   valence 3
ciss0:   supported I/O methods 0x7f000147<READY,simple,performant>
ciss0:   active I/O method 0x5<performant>
ciss0:   4G page base rx00000000
ciss0:   interrupt coalesce delay 0us
ciss0:   interrupt coalesce count 16
ciss0:   max outstanding commands 1024
ciss0:   bus types 0x200000
ciss0:   server name 'CZ3729EX3D'
ciss0:   heartbeat 0xb7
ciss0:   max logical volumes supported: 64
ciss0:   max physical drives supported: 384
ciss0:   max physical drives per logical volume: 128
ciss0:   JBOD Support is Available
ciss0:   JBOD Mode is Enabled
ciss0: 0 physical devices
ciss0: max physical target id: 0
ciss0: 0 logical drives
ciss1: <HP Smart Array H241> port 0x2000-0x20ff mem 0x95200000-0x952fffff,0x95300000-0x953003ff at device 0.0 numa-domain 0 on pci11
ciss1: PERFORMANT Transport
ciss1: Using 1 MSIX interrupt
ciss1: using 1024 of 1024 available commands
ciss1:   0 logical drives configured
ciss1:   firmware 7.00
ciss1:   1 SCSI channels
ciss1:   0 FC channels
ciss1:   2 enclosures
ciss1:   2 expanders
ciss1:   maximum blocks: 65535
ciss1:   controller clock: 18486
ciss1:   256 MB controller memory
ciss1:   signature 'CISS'
ciss1:   valence 3
ciss1:   supported I/O methods 0x7f000147<READY,simple,performant>
ciss1:   active I/O method 0x5<performant>
ciss1:   4G page base rx00000000
ciss1:   interrupt coalesce delay 0us
ciss1:   interrupt coalesce count 16
ciss1:   max outstanding commands 1024
ciss1:   bus types 0x200000
ciss1:   server name 'CZ3729EX3D'
ciss1:   heartbeat 0xb9
ciss1:   max logical volumes supported: 64
ciss1:   max physical drives supported: 384
ciss1:   max physical drives per logical volume: 128
ciss1:   JBOD Support is Available
ciss1:   JBOD Mode is Enabled
ciss1: 7 physical devices
ciss1: max physical target id: 120
ciss1: 0 logical drives
Root mount waiting for:ses0 at ciss1 bus 33 scbus3 target 119 lun 0
ses1 at ciss1 bus 33 scbus3 target 120 lun 0
da2 at ciss1 bus 32 scbus2 target 83 lun 0
da3 at ciss1 bus 32 scbus2 target 84 lun 0
da4 at ciss1 bus 32 scbus2 target 85 lun 0
da1 at ciss1 bus 32 scbus2 target 50 lun 0
uhub3: da0 at ciss1 bus 32 scbus2 target 49 lun 0


root@balur00:/boot # cciss_vol_status -V /dev/ciss1
Controller: Smart HBA H241
  Board ID: 0x21c8103c
  Logical drives: 0
  Running firmware: 7.00
  ROM firmware: 7.00
  Physical drives: 5
         connector 1E box 1 bay 34                 HP      MB010000JWAYK                                    7PH8MXKG     HPD5 OK
         connector 1E box 1 bay 35                 HP      MB010000JWAYK                                    7PH4AJ9G     HPD5 OK
         connector 2E box 1 bay 33                 HP      MB010000JWAYK                                    7PH816MG     HPD5 OK
         connector 2E box 1 bay 34                 HP      MB010000JWAYK                                    7PH8G5PG     HPD5 OK
         connector 2E box 1 bay 35                 HP      MB010000JWAYK                                    7PGTUTHG     HPD5 OK
/dev/ciss1: (Smart HBA H241) Enclosure D6020 (S/N: 7CE714P009) on Bus 2, Physical Port 1E status: OK.
/dev/ciss1: (Smart HBA H241) Enclosure D6020 (S/N: 7CE714P009) on Bus 3, Physical Port 2E status: OK.
/dev/ciss1(Smart HBA H241:0): Non-Volatile Cache status:
                   Cache configured: No

root@balur00:/boot # sesutil -u /dev/ses0 show
ses0: <HPE D6020 2.74>; ID: 5001438030884b80
Desc     Dev     Model                     Ident                Size/Status
{"Name":"Drive bay"} -       -                         -                    Not Installed
{"Name":"DriveBay1"} -       -                         -                    Not Installed
{"Name":"DriveBay2"} -       -                         -                    Not Installed
{"Name":"DriveBay3"} -       -                         -                    Not Installed
{"Name":"DriveBay4"} -       -                         -                    Not Installed
{"Name":"DriveBay5"} -       -                         -                    Not Installed
{"Name":"DriveBay6"} -       -                         -                    Not Installed
{"Name":"DriveBay7"} -       -                         -                    Not Installed
{"Name":"DriveBay8"} -       -                         -                    Not Installed
{"Name":"DriveBay9"} -       -                         -                    Not Installed
{"Name":"DriveBay10"}    -       -                         -                    Not Installed
{"Name":"DriveBay11"}    -       -                         -                    Not Installed
{"Name":"DriveBay12"}    -       -                         -                    Not Installed
{"Name":"DriveBay13"}    -       -                         -                    Not Installed
{"Name":"DriveBay14"}    -       -                         -                    Not Installed
{"Name":"DriveBay15"}    -       -                         -                    Not Installed
{"Name":"DriveBay16"}    -       -                         -                    Not Installed
{"Name":"DriveBay17"}    -       -                         -                    Not Installed
{"Name":"DriveBay18"}    -       -                         -                    Not Installed
{"Name":"DriveBay19"}    -       -                         -                    Not Installed
{"Name":"DriveBay20"}    -       -                         -                    Not Installed
{"Name":"DriveBay21"}    -       -                         -                    Not Installed
{"Name":"DriveBay22"}    -       -                         -                    Not Installed
{"Name":"DriveBay23"}    -       -                         -                    Not Installed
{"Name":"DriveBay24"}    -       -                         -                    Not Installed
{"Name":"DriveBay25"}    -       -                         -                    Not Installed
{"Name":"DriveBay26"}    -       -                         -                    Not Installed
{"Name":"DriveBay27"}    -       -                         -                    Not Installed
{"Name":"DriveBay28"}    -       -                         -                    Not Installed
{"Name":"DriveBay29"}    -       -                         -                    Not Installed
{"Name":"DriveBay30"}    -       -                         -                    Not Installed
{"Name":"DriveBay31"}    -       -                         -                    Not Installed
{"Name":"DriveBay32"}    -       -                         -                    Not Installed
{"Name":"DriveBay33"}    da2     HP MB010000JWAYK          7PH816MG             10T
{"Name":"DriveBay34"}    da3     HP MB010000JWAYK          7PH8G5PG             10T
{"Name":"DriveBay35"}    da4     HP MB010000JWAYK          7PGTUTHG             10T

Temperatures: {"Name":"Temperature sensor"}   : 42 C, {"Name":"LocalIoModule-Sensor[0]"}  : 31 C, {"Name":"LocalIoModule-Sensor[1]"}  : 38 C, {"Name":"LocalExpander-CpuSensor[0]"}   : 42 C, {"Name":"PowerSupply[3]-InletSensor[0]"}: 28 C, {"Name":"PowerSupply[3]-Sensor[0]"} : 32 C, {"Name":"PowerSupply[4]-InletSensor[0]"}: 27 C, {"Name":"PowerSupply[4]-Sensor[0]"} : 31 C, {"Name":"Backplane-Sensor[0]"}  : 25 C, {"Name":"Backplane-Sensor[1]"}  : 23 C, {"Name":"Backplane-Sensor[2]"}  : 24 C, {"Name":"Backplane-Sensor[3]"}  : 27 C, {"Name":"Backplane-Sensor[4]"}  : 24 C, {"Name":"Backplane-Sensor[5]"}  : 23 C, {"Name":"DisplayBoard-Sensor[0]"}   : 25 C


root@balur00:/boot # sesutil -u /dev/ses1 show
ses1: <HPE D6020 2.74>; ID: 5001438030894600
Desc     Dev     Model                     Ident                Size/Status
{"Name":"Drive bay"} -       -                         -                    Not Installed
{"Name":"DriveBay1"} -       -                         -                    Not Installed
{"Name":"DriveBay2"} -       -                         -                    Not Installed
{"Name":"DriveBay3"} -       -                         -                    Not Installed
{"Name":"DriveBay4"} -       -                         -                    Not Installed
{"Name":"DriveBay5"} -       -                         -                    Not Installed
{"Name":"DriveBay6"} -       -                         -                    Not Installed
{"Name":"DriveBay7"} -       -                         -                    Not Installed
{"Name":"DriveBay8"} -       -                         -                    Not Installed
{"Name":"DriveBay9"} -       -                         -                    Not Installed
{"Name":"DriveBay10"}    -       -                         -                    Not Installed
{"Name":"DriveBay11"}    -       -                         -                    Not Installed
{"Name":"DriveBay12"}    -       -                         -                    Not Installed
{"Name":"DriveBay13"}    -       -                         -                    Not Installed
{"Name":"DriveBay14"}    -       -                         -                    Not Installed
{"Name":"DriveBay15"}    -       -                         -                    Not Installed
{"Name":"DriveBay16"}    -       -                         -                    Not Installed
{"Name":"DriveBay17"}    -       -                         -                    Not Installed
{"Name":"DriveBay18"}    -       -                         -                    Not Installed
{"Name":"DriveBay19"}    -       -                         -                    Not Installed
{"Name":"DriveBay20"}    -       -                         -                    Not Installed
{"Name":"DriveBay21"}    -       -                         -                    Not Installed
{"Name":"DriveBay22"}    -       -                         -                    Not Installed
{"Name":"DriveBay23"}    -       -                         -                    Not Installed
{"Name":"DriveBay24"}    -       -                         -                    Not Installed
{"Name":"DriveBay25"}    -       -                         -                    Not Installed
{"Name":"DriveBay26"}    -       -                         -                    Not Installed
{"Name":"DriveBay27"}    -       -                         -                    Not Installed
{"Name":"DriveBay28"}    -       -                         -                    Not Installed
{"Name":"DriveBay29"}    -       -                         -                    Not Installed
{"Name":"DriveBay30"}    -       -                         -                    Not Installed
{"Name":"DriveBay31"}    -       -                         -                    Not Installed
{"Name":"DriveBay32"}    -       -                         -                    Not Installed
{"Name":"DriveBay33"}    -       -                         -                    Not Installed
{"Name":"DriveBay34"}    da0     HP MB010000JWAYK          7PH8MXKG             10T
{"Name":"DriveBay35"}    da1     HP MB010000JWAYK          7PH4AJ9G             10T

Temperatures: {"Name":"Temperature sensor"}   : 43 C, {"Name":"LocalIoModule-Sensor[0]"}  : 34 C, {"Name":"LocalIoModule-Sensor[1]"}  : 39 C, {"Name":"LocalExpander-CpuSensor[0]"}   : 43 C, {"Name":"PowerSupply[1]-InletSensor[0]"}: 32 C, {"Name":"PowerSupply[1]-Sensor[0]"} : 43 C, {"Name":"PowerSupply[2]-InletSensor[0]"}: 33 C, {"Name":"PowerSupply[2]-Sensor[0]"} : 38 C, {"Name":"Backplane-Sensor[0]"}  : 20 C, {"Name":"Backplane-Sensor[1]"}  : 20 C, {"Name":"Backplane-Sensor[2]"}  : 22 C, {"Name":"Backplane-Sensor[3]"}  : 23 C, {"Name":"Backplane-Sensor[4]"}  : 21 C, {"Name":"Backplane-Sensor[5]"}  : 20 C, {"Name":"DisplayBoard-Sensor[0]"}   : 18 C
Comment 24 Peter Eriksson 2023-05-08 19:50:57 UTC
Result on the bigger server: 

Booted, but then I got a long list of CAM errors ending with a "ADAPTER HEARTBEAT FAILED".

...
(da116:ciss2:32:56:0): CAM status: CCB request completed with an error
(da116:ciss2:32:56:0): Retrying command, 3 more tries remain
(da123:ciss2:32:63:0): READ(6). CDB: 08 00 01 40 08 00 
(da123:ciss2:32:63:0): CAM status: CCB request completed with an error
(da123:ciss2:32:63:0): Retrying command, 3 more tries remain
(da123:ciss2:32:63:0): READ(10). CDB: 28 00 00 00 03 00 00 01 00 00 
(da123:ciss2:32:63:0): CAM status: CCB request completed with an error
(da123:ciss2:32:63:0): Retrying command, 2 more tries remain
(da123:ciss2:32:63:0): READ(6). CDB: 08 00 01 38 08 00 
(da123:ciss2:32:63:0): CAM status: CCB request completed with an error
(da123:ciss2:32:63:0): Retrying command, 3 more tries remain
(da123:ciss2:32:63:0): READ(6). CDB: 08 00 01 30 08 00 
(da123:ciss2:32:63:0): CAM status: CCB request completed with an error
(da123:ciss2:32:63:0): Retrying command, 3 more tries remain
(da116:ciss2:32:56:0): READ(16). CDB: 88 00 00 00 00 01 ce 8c 30 28 00 00 01 00 00 00 
(da116:ciss2:32:56:0): CAM status: CCB request completed with an error
(da116:ciss2:32:56:0): Retrying command, 3 more tries remain
(da116:ciss2:32:56:0): READ(16). CDB: 88 00 00 00 00 01 ce 8c 31 28 00 00 01 00 00 00 
(da116:ciss2:32:56:0): CAM status: CCB request completed with an error
(da116:ciss2:32:56:0): Retrying command, 3 more tries remain
(da116:ciss2:32:56:0): READ(16). CDB: 88 00 00 00 00 01 ce 8c 32 28 00 00 01 00 00 00 
(da116:ciss2:32:56:0): CAM status: CCB request completed with an error
(da116:ciss2:32:56:0): Retrying command, 3 more tries remain
(da116:ciss2:32:56:0): READ(16). CDB: 88 00 00 00 00 01 ce 8c 33 28 00 00 00 38 00 00 
(da116:ciss2:32:56:0): CAM status: CCB request completed with an error
(da116:ciss2:32:56:0): Retrying command, 3 more tries remain
ciss2: ADAPTER HEARTBEAT FAILED

(I could login and see all disks with "camcontrol delvist" but it started displaying these errors when zfs was importing the pools. 

The da116 and da123 diskar are probably bad but it would be nice if the ciss controller would handle bad drives a bit better in general.

I had to replace the HP H241 controller with an LSI SAS3816 controller on my other "big" HP server since that H241/ciss controller would go into a loop and just spew out retry errors every day or so. The LSI controller handles the bad drives better...
Comment 25 Peter Eriksson 2023-05-08 20:00:33 UTC
Rebooted and tried with hw.ciss.nop_message_heartbeat=1, then logged in via ssh and ran an "sesutil show" (before zfs had started importing pools), then it panic:ed with:

login: (da116:ciss2:32:56:0): READ(6). CDB: 08 00 01 48 08 00 
(da116:ciss2:32:56:0): CAM status: CCB request completed with an error
ciss2: *** Hot-plug drive removed, Port=2E Box=1 Bay=6 SN=            5PGTSWYC
(da116:ciss2:32:56:0): Retrying command, 3 more tries remain
ciss2: *** Physical drive failure, Port=2E Box=1 Bay=6 reason=0x14
(da116:ciss2:32:56:0): READ(16). CDB: 88 00 00 00 00 05 74 ff ff 00 00 00 01 00 00 00 
(da116:ciss2:32:56:0): CAM status: CCB request completed with an error
(da116:ciss2:32:56:0): Retrying command, 3 more tries remain
(da116:ciss2:32:56:0): READ(6). CDB: 08 00 01 50 b0 00 
(da116:ciss2:32:56:0): CAM status: CCB request completed with an error
(da116:ciss2:32:56:0): Retrying command, 3 more tries remain
(da116:ciss2:32:56:0): READ(10). CDB: 28 00 00 00 03 00 00 01 00 00 
(da116:ciss2:32:56:0): CAM status: CCB request completed with an error
(da116:ciss2:32:56:0): Retrying command, 3 more tries remain
(da116:ciss2:32:56:0): READ(16). CDB: 88 00 00 00 00 05 74 ff fd 00 00 00 01 00 00 00 
(da116:ciss2:32:56:0): CAM status: CCB request completed with an error
(da116:ciss2:32:56:0): Retrying command, 3 more tries remain
(da116:ciss2:32:56:0): READ(6). CDB: 08 00 01 40 08 00 
(da116:ciss2:32:56:0): CAM status: CCB request completed with an error
(da116:ciss2:32:56:0): Retrying command, 3 more tries remain
(da116:ciss2:32:56:0): READ(6). CDB: 08 00 01 38 08 00 
(da116:ciss2:32:56:0): CAM status: CCB request completed with an error
(da116:ciss2:32:56:0): Retrying command, 3 more tries remain
(da116:ciss2:32:56:0): READ(6). CDB: 08 00 01 28 08 00 
(da116:ciss2:32:56:0): CAM status: CCB request completed with an error
(da116:ciss2:32:56:0): Retrying command, 3 more tries remain
(da116:ciss2:32:56:0): READ(6). CDB: 08 00 01 30 08 00 
(da116:ciss2:32:56:0): CAM status: CCB request completed with an error
(da116:ciss2:32:56:0): Retrying command, 3 more tries remain
(da116:ciss2:32:56:0): READ(6). CDB: 08 00 01 20 08 00 
(da116:ciss2:32:56:0): CAM status: CCB request completed with an error
(da116:ciss2:32:56:0): Retrying command, 3 more tries remain
(da116:ciss2:32:56:0): READ(6)/WRITE(6) not supported, increasing minimum_cmd_size to 10.
(da116:ciss2:32:56:0): READ(16). CDB: 88 00 00 00 00 05 74 ff ff 00 00 00 01 00 00 00 
(da116:ciss2:32:56:0): CAM status: SCSI Status Error
(da116:ciss2:32:56:0): SCSI status: Check Condition
(da116:ciss2:32:56:0): SCSI sense: ILLEGAL REQUEST asc:25,0 (Logical unit not supported)
(da116:ciss2:32:56:0): Error 6, Unretryable error
(da116:ciss2:32:56:0): Invalidating pack
(da116:ciss2:32:56:0): READ(6)/WRITE(6) not supported, increasing minimum_cmd_size to 10.
(da116:ciss2:32:56:0): READ(10). CDB: 28 00 00 00 03 00 00 01 00 00 
(da116:ciss2:32:56:0): CAM status: SCSI Status Error
(da116:ciss2:32:56:0): SCSI status: Check Condition
(da116:ciss2:32:56:0): SCSI sense: ILLEGAL REQUEST asc:25,0 (Logical unit not supported)
(da116:ciss2:32:56:0): Error 6, Unretryable error
(da116:ciss2:32:56:0): READ(16). CDB: 88 00 00 00 00 05 74 ff fd 00 00 00 01 00 00 00 
(da116:ciss2:32:56:0): CAM status: SCSI Status Error
(da116:ciss2:32:56:0): SCSI status: Check Condition
(da116:ciss2:32:56:0): SCSI sense: ILLEGAL REQUEST asc:25,0 (Logical unit not supported)
(da116:ciss2:32:56:0): Error 6, Unretryable error
(da116:ciss2:32:56:0): READ(6)/WRITE(6) not supported, increasing minimum_cmd_size to 10.
(da116:ciss2:32:56:0): READ(6)/WRITE(6) not supported, increasing minimum_cmd_size to 10.
(da116:ciss2:32:56:0): READ(6)/WRITE(6) not supported, increasing minimum_cmd_size to 10.
(da116:ciss2:32:56:0): READ(6)/WRITE(6) not supported, increasing minimum_cmd_size to 10.
da116 at ciss2 bus 32 scbus7 target 56 lun 0
da116: <HP MB012000JWDFD HPD2>  s/n 5PGTSWYC detached
May  8 21:56:32 balur03 ZFS[4858]: vdev probe failure, zpool=$FILUR02 path=$/dev/diskid/DISK-5PGTSWYC
(da116:ciss2:32:56:0): READ(6). CDB: 08 00 01 20 08 00 
(da116:ciss2:32:56:0): CAM status: CCB request completed with an error
(da116:ciss2:32:56:0): Error 5, Periph was invalidated
(da116:ciss2:32:56:0): READ(10). CDB: 28 00 00 00 01 48 00 00 08 00 
(da116:ciss2:32:56:0): CAM status: CCB request completed with an error
(da116:ciss2:32:56:0): Error 5, Periph was invalidated
(da116:ciss2:32:56:0): READ(10). CDB: 28 00 00 00 01 28 00 00 08 00 
(da116:ciss2:32:56:0): CAM status: CCB request completed with an error
(da116:ciss2:32:56:0): Error 5, Periph was invalidated
(da116:ciss2:32:56:0): READ(10). CDB: 28 00 00 00 01 30 00 00 08 00 
(da116:ciss2:32:56:0): CAM status: CCB request completed with an error
(da116:ciss2:32:56:0): Error 5, Periph was invalidated
(da116:ciss2:32:56:0): READ(10). CDB: 28 00 00 00 01 38 00 00 08 00 
(da116:ciss2:32:56:0): CAM status: CCB request completed with an error
(da116:ciss2:32:56:0): Error 5, Periph was invalidated
(da116:ciss2:32:56:0): READ(10). CDB: 28 00 00 00 01 40 00 00 08 00 
(da116:ciss2:32:56:0): CAM status: CCB request completed with an error
(da116:ciss2:32:56:0): Error 5, Periph was invalidated
(da116:ciss2:32:56:0): READ(10). CDB: 28 00 00 00 01 50 00 00 b0 00 
(da116:ciss2:32:56:0): CAM status: CCB request completed with an error
(da116:ciss2:32:56:0): Error 5, Periph was invalidated


Fatal trap 12: page fault while in kernel mode
cpuid = 3; apic id = 03
fault virtual address   = 0x180
fault code              = supervisor read data, page not present
(da116:ciss2:32:56:0): Periph destroyed
instruction pointer     = 0x20:0xffffffff8256e51d
stack pointer           = 0x28:0xfffffe058a1d59d0
frame pointer           = 0x28:0xfffffe058a1d5a40
code segment            = base rx0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 45 (solthread 0xfffffff)
trap number             = 12
panic: page fault
cpuid = 3
time = 1683575795
KDB: stack backtrace:
#0 0xffffffff80c325f5 at kdb_backtrace+0x65
#1 0xffffffff80be89c8 at vpanic+0x178
#2 0xffffffff80be8843 at panic+0x43
#3 0xffffffff8110408f at trap_fatal+0x38f
#4 0xffffffff811040df at trap_pfault+0x4f
#5 0xffffffff810dbd58 at calltrap+0x8
#6 0xffffffff8256e49a at vdev_dtl_reassess+0x5a
#7 0xffffffff8256e49a at vdev_dtl_reassess+0x5a
#8 0xffffffff825639e2 at spa_vdev_state_exit+0x42
#9 0xffffffff8255d04f at spa_async_thread_vd+0x17f
#10 0xffffffff80ba969e at fork_exit+0x7e
#11 0xffffffff810dcd8e at fork_trampoline+0xe
System is going down.
Uptime: 4m40s

I've seen things you people wouldn't believe.
Attack ships on fire off the shoulder of Orion.
I watched C-beams glitter in the dark near the
Tannhäuser Gate. All those moments will be lost
in time, like tears in rain.

Time to die.
Comment 26 Peter Eriksson 2023-05-08 20:48:31 UTC
On the third attempt I just let it keep trying, no "sesutil show". And then it eventually printed:

....
(da116:ciss2:32:56:0): SCSI status: Check Condition
(da116:ciss2:32:56:0): SCSI sense: ILLEGAL REQUEST asc:25,0 (Logical unit not supported)
(da116:ciss2:32:56:0): Error 6, Unretryable error
da116 at ciss2 bus 32 scbus7 target 56 lun 0
da116: <HP MB012000JWDFD HPD2>  s/n 5PGTSWYC detached
May  8 22:19:40 balur03 ZFS[22519]: vdev probe failure, zpool=$FILUR02 path=$/dev/diskid/DISK-5PGTSWYC
(da116:ciss2:32:56:0): Periph destroyed
(da123:ciss2:32:63:0): WRITE(16). CDB: 8a 00 00 00 00 03 1b 00 39 58 00 00 01 00 00 00 
(da123:ciss2:32:63:0): CAM status: CCB request completed with an error
(da123:ciss2:32:63:0): Retrying command, 3 more tries remain
(da123:ciss2:32:63:0): WRITE(16). CDB: 8a 00 00 00 00 03 1b 00 38 58 00 00 01 00 00 00 
(da123:ciss2:32:63:0): CAM status: CCB request completed with an error
(da123:ciss2:32:63:0): Retrying command, 3 more tries remain
(da123:ciss2:32:63:0): WRITE(16). CDB: 8a 00 00 00 00 03 1b 00 37 58 00 00 01 00 00 00 
(da123:ciss2:32:63:0): CAM status: CCB request completed with an error
(da123:ciss2:32:63:0): Retrying command, 3 more tries remain
(da123:ciss2:32:63:0): WRITE(16). CDB: 8a 00 00 00 00 03 1b 00 39 58 00 00 01 00 00 00 
(da123:ciss2:32:63:0): CAM status: SCSI Status Error
(da123:ciss2:32:63:0): SCSI status: Check Condition
(da123:ciss2:32:63:0): SCSI sense: ILLEGAL REQUEST asc:25,0 (Logical unit not supported)
(da123:ciss2:32:63:0): Command Specific Info: 0x2060000
(da123:ciss2:32:63:0): Error 6, Unretryable error
(da123:ciss2:32:63:0): Invalidating pack
(da123:ciss2:32:63:0): WRITE(16). CDB: 8a 00 00 00 00 03 1b 00 38 58 00 00 01 00 00 00 
(da123:ciss2:32:63:0): CAM status: SCSI Status Error
(da123:ciss2:32:63:0): SCSI status: Check Condition
(da123:ciss2:32:63:0): SCSI sense: ILLEGAL REQUEST asc:25,0 (Logical unit not supported)
(da123:ciss2:32:63:0): Command Specific Info: 0x2060000
(da123:ciss2:32:63:0): Error 6, Unretryable error
(da123:ciss2:32:63:0): WRITE(16). CDB: 8a 00 00 00 00 03 1b 00 37 58 00 00 01 00 00 00 
(da123:ciss2:32:63:0): CAM status: SCSI Status Error
(da123:ciss2:32:63:0): SCSI status: Check Condition
(da123:ciss2:32:63:0): SCSI sense: ILLEGAL REQUEST asc:25,0 (Logical unit not supported)
(da123:ciss2:32:63:0): Command Specific Info: 0x2060000
(da123:ciss2:32:63:0): Error 6, Unretryable error
da123 at ciss2 bus 32 scbus7 target 63 lun 0
da123: <HP MB012000JWDFD HPD2>  s/n 5PGU79TE detached
May  8 22:26:52 balur03 ZFS[38178]: vdev probe failure, zpool=$FILUR07 path=$/dev/diskid/DISK-5PGU79TE
(da123:ciss2:32:63:0): Periph destroyed

And then zfs started mounting the filesystems.

root@balur03# df -h | wc -l
  169281

It'll be interesting to see if it survives :-)


The fact that the "sesutil" command seemed to trigger a panic might indicate a problem somewhere though, but I'm not sure it's in the ciss driver.
Comment 27 Marek Zarychta 2023-05-08 21:54:17 UTC
Thanks for taking an effort to test it again Peter. Without the patch you authored, the tests would end sooner with no panic and 40 instead of 140 drives seen. That's a large setup and the solution is not ideal, but looks like a step forward. 

In the meantime I plied a bit with ciss(4) settings, and found out that setting hw.ciss.nop_message_heartbeat="1" on which you complained in comment 25 breaks things also for me regardless of the application of the patch from D25155. On the other hand, I am finding quite a useful new sysctl knob "hw.ciss.verbose" introduced by your patch for operations like drive replacement: 

 [7372] ciss0: *** Hot-plug drive removed, Port=1I Box=1 Bay=3 SN=            PLXXXXX
[7372] ciss0: *** Physical drive failure, Port=1I Box=1 Bay=3
[7372] ciss0: *** State change, logical drive 2, new state=FAILED
[7372] ciss0: logical drive 2 (da2) changed status OK->failed, spare status 0x0
[7372] da2 at ciss0 bus 0 scbus0 target 2 lun 0
[7372] (da2:ciss0:0:2:0): Periph destroyed
[7444] ciss0: *** Hot-plug drive inserted, Port=1I Box=1 Bay=3 SN=            PLXXXXX
[7444] ciss0: *** Media exchanged detected, logical drive 2
[7444] ciss0: logical drive 2 () media exchanged, ready to go online
[7444] ciss0: *** State change, logical drive 2, new state=OK
[7444] ciss0: logical drive (b0t2): RAID 0, 139776MB online
[7444] ciss0: logical drive 2 () changed status failed->OK, spare status 0x0
[7444] ciss0: logical drive (b0t2): RAID 0, 139776MB online
[7444] da2 at ciss0 bus 0 scbus0 target 2 lun 0

Anyway, the patch proposed in review D25155 fixes a couple of things, was tested in JBOD and RAID mode and seems to not introduce any regression.
Comment 28 Peter Eriksson 2023-05-09 06:22:43 UTC
Actually, with

  hw.ciss.nop_message_heartbeat="0"

my first test ended with 

  ciss2: ADAPTER HEARTBEAT FAILED

and then things deadlocked.



With:

  hw.ciss.nop_message_heartbeat="1"

and if I didn't provoke it with the "sesutil show" then all disks got detected correctly and it failed the two dead ones after a while and all zfs filesystems got mounted. The server has now been running for 10 hours without problems.

(Going to remove the problem disks later today)

hw.ciss.verbose = "1"
hw.ciss.nop_message_heartbeat = "1"
hw.ciss.base_transfer_speed = "1200000"

- Peter
Comment 29 Warner Losh freebsd_committer freebsd_triage 2024-10-14 05:32:28 UTC
OK. I've staged the rebased patch
I broke it down into tiny bites in case there's problems.
I've also noted my extreme reservations about making the cr == NULL
panic just a printf. There's some race that's causing it (I think
with where we set the shutdown flag racing another thread that's setting it w/o ciss_mtx held). Ideally, someone would chase that to ground who had the hardware and the time to do so, but I do not.
This was reviewed, though by mav@ as well as myself (though a prior version by him).
Most of the change is otherwise uncontroversial.
However, if there's problems, I'm going to be quick to back out if need be.
If everything builds, I'll push the changes soon.
Comment 30 commit-hook freebsd_committer freebsd_triage 2024-10-14 05:41:51 UTC
A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=45645518ea19ccb4761aee3a525aab2f323d37d4

commit 45645518ea19ccb4761aee3a525aab2f323d37d4
Author:     Peter Eriksson <pen@lysator.liu.se>
AuthorDate: 2024-10-14 04:01:33 +0000
Commit:     Warner Losh <imp@FreeBSD.org>
CommitDate: 2024-10-14 05:24:15 +0000

    ciss: Add max physical target

    Add support for tracking the maximum physical target and using that to
    override the maximum logical target.

    PR: 246279
    Reviewed by: imp
    Tested by: Marek Zarychta
    Differential Revision: https://reviews.freebsd.org/D25155

 sys/dev/ciss/ciss.c    | 11 ++++++++++-
 sys/dev/ciss/cissvar.h |  1 +
 2 files changed, 11 insertions(+), 1 deletion(-)
Comment 31 commit-hook freebsd_committer freebsd_triage 2024-10-14 05:41:53 UTC
A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=7c74337e2c3d2269d1559f4e5541c0a3f402d814

commit 7c74337e2c3d2269d1559f4e5541c0a3f402d814
Author:     Peter Eriksson <pen@lysator.liu.se>
AuthorDate: 2024-10-14 04:01:33 +0000
Commit:     Warner Losh <imp@FreeBSD.org>
CommitDate: 2024-10-14 05:24:06 +0000

    ciss: Expose tunable hw.ciss.force_interrupt as sysctl

    Expose the hw.ciss.force_interrupt tuneable as a sysctl and make it
    writeable at runtime.

    PR: 246279
    Reviewed by: imp
    Tested by: Marek Zarychta
    Differential Revision: https://reviews.freebsd.org/D25155

 sys/dev/ciss/ciss.c | 3 +++
 1 file changed, 3 insertions(+)
Comment 32 commit-hook freebsd_committer freebsd_triage 2024-10-14 05:41:54 UTC
A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=fd95966af50bab6229bc5e67fadc7ffd915f77f5

commit fd95966af50bab6229bc5e67fadc7ffd915f77f5
Author:     Peter Eriksson <pen@lysator.liu.se>
AuthorDate: 2024-10-14 04:01:33 +0000
Commit:     Warner Losh <imp@FreeBSD.org>
CommitDate: 2024-10-14 05:37:46 +0000

    ciss: hw.ciss.initator_id to set the initiator ID

    Add hw.ciss.inititor_id to set the initiator to something other than the
    default.

    PR: 246279
    Reviewed by: imp
    Tested by: Marek Zarychta
    Differential Revision: https://reviews.freebsd.org/D25155

 sys/dev/ciss/ciss.c | 15 +++++++++++++--
 1 file changed, 13 insertions(+), 2 deletions(-)
Comment 33 commit-hook freebsd_committer freebsd_triage 2024-10-14 05:41:56 UTC
A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=b339ab1491055d89415f85b6d1a03423193178f9

commit b339ab1491055d89415f85b6d1a03423193178f9
Author:     Peter Eriksson <pen@lysator.liu.se>
AuthorDate: 2024-10-14 04:01:33 +0000
Commit:     Warner Losh <imp@FreeBSD.org>
CommitDate: 2024-10-14 05:37:46 +0000

    ciss: Don't panic on null CR ciss_dequeue_notify

    Apparently, sometimes on hot plug/unplug, a null cr comes back from
    ciss_dequeue_notify. This is clearly a bug, and by ignoring it we're
    papering over that bug. We only ever wake the thread after enqueing a
    notification or setting a bit about killing the thread, so once we check
    the bit isn't the cause, cr can't be NULL unless something else has
    dequeued it.

    Ideally, this would be fixed, rather than papered over, but this makes a
    very old card somewhat more useable for external enclosures. I suspect
    it's a race when we set CISS_THREAD_SHUT and another flag (the latter
    w/o ciss_mtx held), but I don't see it and w/o hardware to reproduce
    it would be hard to know for sure.

    PR: 246279
    Reviewed by: imp
    Tested by: Marek Zarychta
    Differential Revision: https://reviews.freebsd.org/D25155

 sys/dev/ciss/ciss.c | 16 ++++++++++++++--
 1 file changed, 14 insertions(+), 2 deletions(-)
Comment 34 commit-hook freebsd_committer freebsd_triage 2024-10-14 05:41:57 UTC
A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=cec58bba6425d4b9cf19a9ede9ca0dc00c1d48e3

commit cec58bba6425d4b9cf19a9ede9ca0dc00c1d48e3
Author:     Peter Eriksson <pen@lysator.liu.se>
AuthorDate: 2024-10-14 04:01:33 +0000
Commit:     Warner Losh <imp@FreeBSD.org>
CommitDate: 2024-10-14 05:23:45 +0000

    ciss: Expose tunable hw.ciss.nop_message_heartbeat as sysctl

    Expose the hw.ciss.nop_message_heartbeat tuneable as a sysctl and make
    it writeable at runtime.

    PR: 246279
    Reviewed by: imp
    Tested by: Marek Zarychta
    Differential Revision: https://reviews.freebsd.org/D25155

 sys/dev/ciss/ciss.c | 3 +++
 1 file changed, 3 insertions(+)
Comment 35 commit-hook freebsd_committer freebsd_triage 2024-10-14 05:41:59 UTC
A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=a35564358ac442d5d7a5c9c2dd0544f07b1963e7

commit a35564358ac442d5d7a5c9c2dd0544f07b1963e7
Author:     Peter Eriksson <pen@lysator.liu.se>
AuthorDate: 2024-10-14 04:01:33 +0000
Commit:     Warner Losh <imp@FreeBSD.org>
CommitDate: 2024-10-14 05:23:36 +0000

    ciss: Expose tunable hw.ciss.expose_hidden_physical as sysctl

    Expose the hw.ciss.expose_hidden_physical tuneable as a sysctl
    and make it writeable at runtime.

    PR: 246279
    Reviewed by: imp
    Tested by: Marek Zarychta
    Differential Revision: https://reviews.freebsd.org/D25155

 sys/dev/ciss/ciss.c | 3 +++
 1 file changed, 3 insertions(+)
Comment 36 commit-hook freebsd_committer freebsd_triage 2024-10-14 05:42:00 UTC
A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=d8b024673bbfb32259030db7e54f043f3e471abe

commit d8b024673bbfb32259030db7e54f043f3e471abe
Author:     Peter Eriksson <pen@lysator.liu.se>
AuthorDate: 2024-10-14 04:01:33 +0000
Commit:     Warner Losh <imp@FreeBSD.org>
CommitDate: 2024-10-14 05:23:25 +0000

    ciss: Report more errors at higher ciss_verbose levels

    Report more information on errors, including the the opcode.

    PR: 246279
    Reviewed by: imp
    Tested by: Marek Zarychta
    Differential Revision: https://reviews.freebsd.org/D25155

 sys/dev/ciss/ciss.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)
Comment 37 commit-hook freebsd_committer freebsd_triage 2024-10-14 05:42:02 UTC
A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=f373e6b866b9efafc66ccc5355e1ea0aeeedfb6a

commit f373e6b866b9efafc66ccc5355e1ea0aeeedfb6a
Author:     Peter Eriksson <pen@lysator.liu.se>
AuthorDate: 2024-10-14 04:01:33 +0000
Commit:     Warner Losh <imp@FreeBSD.org>
CommitDate: 2024-10-14 05:23:17 +0000

    ciss: Add sysctl/tunable hw.ciss.verbose

    Add tuneable to turn on/off verbosity for debugging purposes. This is
    approximately the same as bootverbose, but will print even more
    information when > 1.

    PR: 246279
    Reviewed by: imp
    Tested by: Marek Zarychta
    Differential Revision: https://reviews.freebsd.org/D25155

 sys/dev/ciss/ciss.c | 28 ++++++++++++++++++++++------
 1 file changed, 22 insertions(+), 6 deletions(-)
Comment 38 commit-hook freebsd_committer freebsd_triage 2024-10-14 05:42:04 UTC
A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=74575d14284fde2c7617ad14fd9a8bac897b4427

commit 74575d14284fde2c7617ad14fd9a8bac897b4427
Author:     Peter Eriksson <pen@lysator.liu.se>
AuthorDate: 2024-10-14 04:01:33 +0000
Commit:     Warner Losh <imp@FreeBSD.org>
CommitDate: 2024-10-14 05:23:08 +0000

    ciss: Add sysctl/tunable hw.ciss and hw.ciss.base_transfer_speed

    Add a sysctl/tuneable to report a different base transfer speed than the
    default of 132*1024.

    PR: 246279
    Reviewed by: imp
    Tested by: Marek Zarychta
    Differential Revision: https://reviews.freebsd.org/D25155

 sys/dev/ciss/ciss.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)
Comment 39 commit-hook freebsd_committer freebsd_triage 2024-10-14 05:42:05 UTC
A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=77af8c6db25ff6154268eb17f54f082a7eb61ea0

commit 77af8c6db25ff6154268eb17f54f082a7eb61ea0
Author:     Peter Eriksson <pen@lysator.liu.se>
AuthorDate: 2024-10-14 04:01:33 +0000
Commit:     Warner Losh <imp@FreeBSD.org>
CommitDate: 2024-10-14 05:23:54 +0000

    ciss: Expose tunable hw.ciss.force_transport as sysctl

    Expose the hw.ciss.force_transport tuneable as a sysctl and make it
    writeable at runtime.

    PR: 246279
    Reviewed by: imp
    Tested by: Marek Zarychta
    Differential Revision: https://reviews.freebsd.org/D25155

 sys/dev/ciss/ciss.c | 3 +++
 1 file changed, 3 insertions(+)
Comment 40 commit-hook freebsd_committer freebsd_triage 2024-10-14 05:42:07 UTC
A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=f03e1a42e92eff76dcf474655b600db37b04ae2b

commit f03e1a42e92eff76dcf474655b600db37b04ae2b
Author:     Peter Eriksson <pen@lysator.liu.se>
AuthorDate: 2024-10-14 04:01:33 +0000
Commit:     Warner Losh <imp@FreeBSD.org>
CommitDate: 2024-10-14 05:22:01 +0000

    ciss: Minor formatting nit.

    PR: 246279
    Reviewed by: imp
    Tested by: Marek Zarychta
    Differential Revision: https://reviews.freebsd.org/D25155

 sys/dev/ciss/cissvar.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
Comment 41 commit-hook freebsd_committer freebsd_triage 2024-10-14 05:42:08 UTC
A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=cafc839393db5c5d8000fd086118b3c7b47e95c2

commit cafc839393db5c5d8000fd086118b3c7b47e95c2
Author:     Peter Eriksson <pen@lysator.liu.se>
AuthorDate: 2024-10-14 04:01:33 +0000
Commit:     Warner Losh <imp@FreeBSD.org>
CommitDate: 2024-10-14 05:22:19 +0000

    ciss: Ignore data over/under run on RECEIVE_DIAGNOSTIC

    This appears to be harmless, so ignore data over/under run on
    diagnostics.

    PR: 246279
    Reviewed by: imp
    Tested by: Marek Zarychta
    Differential Revision: https://reviews.freebsd.org/D25155

 sys/dev/ciss/ciss.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)
Comment 42 Peter Eriksson 2024-10-17 08:27:21 UTC
(In reply to commit-hook from comment #33)

> I've also noted my extreme reservations about making the cr == NULL
> panic just a printf. There's some race that's causing it (I think
> with where we set the shutdown flag racing another thread that's setting it w/o 
> ciss_mtx held). Ideally, someone would chase that to ground who had the hardware and > the time to do so, but I do not.

Perhaps we should make that panic / printf fix selectable via a sysctl knob? I agree that there is a bug somewhere since with the change to printf() I sometimes see a never-ending loop with errors instead. However, that takes longer to "develop" than the panic-reboot so it still makes it a bit more usable... It typically occurs with bad disks though, so another fix is to remove the bad disks :-)