Bug 246279 - ciss device driver not allowing more than 48 drives to be detected by the CAM layer
Summary: ciss device driver not allowing more than 48 drives to be detected by the CAM...
Status: New
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 12.1-RELEASE
Hardware: Any Any
: --- Affects Some People
Assignee: freebsd-bugs mailing list
URL:
Keywords: patch
Depends on:
Blocks:
 
Reported: 2020-05-07 11:12 UTC by Peter Eriksson
Modified: 2020-05-20 08:22 UTC (History)
0 users

See Also:


Attachments
Output from "camcontrol devlist" (4.08 KB, text/plain)
2020-05-08 17:56 UTC, Peter Eriksson
no flags Details
Output from "camcontrol devlist" (4.08 KB, text/plain)
2020-05-08 17:56 UTC, Peter Eriksson
no flags Details
Output from "cciss_vol_status" (9.23 KB, text/plain)
2020-05-09 16:13 UTC, Peter Eriksson
no flags Details
Patch to fix support for more than 48 drives in HBA mode (3.60 KB, text/plain)
2020-05-09 23:19 UTC, Peter Eriksson
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Peter Eriksson 2020-05-07 11:12:25 UTC
If I connect an HP D6020 external SAS disk enclosure to an HP H241 (HBA mode) and then connect each drawer to each port of the H241 then the system just sees the first 48 drives (and not all 70). Looking at the output from 'camcontrol devlist' the last drive seen have "at scbus4 target 63 lun 0" and the first "at scbus4 target 16 lun 0".

If I connect each drawer into separate H241 controllers then all 70 drives are visible.

According to HP documentation the H241 should handle up to 200 physical drives.

I've tried reading the ciss driver source code and I can't see any "obvious" limits/adjustable knobs but I might be missing something....

The "target 63" number feels lika a 64-target-limit somewhere.
(The first 16 I'm guessing is reserved for logical drives (which we don't use) so I'm guessing that's why it starts numbering at 16).

(Now I want to connect a second D6020 to the server(s) in question so I can use the "use two controller cards" solution anymore)
Comment 1 Peter Eriksson 2020-05-08 17:56:10 UTC
Created attachment 214288 [details]
Output from "camcontrol devlist"
Comment 2 Peter Eriksson 2020-05-08 17:56:57 UTC
Created attachment 214289 [details]
Output from "camcontrol devlist"
Comment 3 Peter Eriksson 2020-05-08 17:58:56 UTC
Did some more testing today. Attached one of the new D6020 disk cabinets (with 70 SAS drives) to our test server. It looks like the CISS driver does see all 70 drives, but some way up to the CAM layer 16 of them gets lost :-)

# cciss_vol_status -V /dev/ciss0 | egrep 1200 | wc -l
      70

# camcontrol devlist | egrep 1200 | wc -l
      48

(Full output from cciss_vol_status & camcontrol added as attachments)
Comment 4 Peter Eriksson 2020-05-09 16:13:47 UTC
Created attachment 214315 [details]
Output from "cciss_vol_status"
Comment 5 Peter Eriksson 2020-05-09 16:45:01 UTC
Ok, with some bits of printf-debugging I found some suspect code in sys/dev/ciss/ciss.c:ciss_cam_action() at the "case XPT_PATH_INQ" section:

  cpi->max_target = sc->ciss_cfg->max_logical_supported;

Notice the "max logical logical volumes: 64" below?

ciss0: PERFORMANT Transport
ciss0:   0 logical drives configured
ciss0:   firmware 5.04
ciss0:   1 SCSI channels
ciss0:   signature 'CISS'
ciss0:   valence 3
ciss0:   supported I/O methods 0x7e000147<READY,simple,performant>
ciss0:   active I/O method 0x5<performant>
ciss0:   4G page base rx00000000
ciss0:   interrupt coalesce delay 0us
ciss0:   interrupt coalesce count 16
ciss0:   max outstanding commands 1024
ciss0:   bus types 0x200000
ciss0:   server name 'CZ3729EX3D'
ciss0:   heartbeat 0xc0
ciss0:   max logical logical volumes: 64
ciss0:   max physical disks supported: 384
ciss0:   max physical disks per logical volume: 128
ciss0:   JBOD Support is Available
ciss0:   JBOD Mode is Enabled
ciss0: 72 physical devices

(72 is 2 too many, but I guess the two extra are the storage drawers)

If I change that line to:

  cpi->max_target = sc->ciss_cfg->max_physical_supported;

then "camcontrol devlist" now show 69 of 70 drives... Better but not 100% there.
Comment 6 Peter Eriksson 2020-05-09 19:00:15 UTC
It now probes targets up to around 373 (but it takes a looong time) or so, and also detects the SES devices that have been hidden before since it only probed up to target 63...
                                                                                                                                                              
ses0 at ciss0 bus 33 scbus2 target 119 lun 0                                                                                                                       
ses0: <HPE D6020 1.63> Fixed Enclosure Services SPC-4 SCSI device                                                                                                  
ses0: Serial Number 7CE952P06X                                                                                                                                     
ses0: 135.168MB/s transfers                                                                                                                                        
ses0: SES Device                                                                                                                                                   
                                                                                                                                                                   
ses1 at ciss0 bus 33 scbus2 target 121 lun 0                                                                                                                       
ses1: <HPE D6020 1.63> Fixed Enclosure Services SPC-4 SCSI device                                                                                                  
ses1: Serial Number 7CE952P06X                                                                                                                                     
ses1: 135.168MB/s transfers                                                                                                                                        
ses1: SES Device                      

Due to the probing taking a loong time I also see timeouts:

> run_interrupt_driven_hooks: still waiting after 300 seconds for xpt_config         

and SCSI errors:

(probe0:ciss1:33:351:0): REPORT LUNS. CDB: a0 00 00 00 00 00 00 00 00 10 00 00
(probe0:ciss1:33:351:0): CAM status: CCB request completed with an error
(probe0:ciss1:33:351:0): Retrying command, 4 more tries remain
(probe0:ciss1:33:351:0): REPORT LUNS. CDB: a0 00 00 00 00 00 00 00 00 10 00 00
(probe0:ciss1:33:351:0): CAM status: CCB request completed with an error
(probe0:ciss1:33:351:0): Retrying command, 3 more tries remain

This code feels... broken :-)

Ah well, I'll see if I can modify the driver code to be a bit smarter on how many targets to probe. It really doesn't have to check all since it knows how many target there are (72 in my case) - it could stop after having detected that many... 

(No wonder the Linux folks have replaced their cciss driver with a rewritten one called hpsa).

- Peter
Comment 7 Peter Eriksson 2020-05-09 19:32:47 UTC
The current code probably works reasonable well for cases where the controller is being used in RAID mode (where it works with logical LUNs). 

Well, except that it probably fails to detect the SES devices on the D6020 cabinets.

But when used as "dumb" HBA with more physical drives than a certain controller handles logical devices (64 in my case for the H241 controller) it will always do the wrong thing. 

And since it starts numbering physical targets 16 (probably since the controller signals "0" as supported logical luns - since it's a HBA!) and then the driver code uses a compile-time-default of 16 (intended for really old controllers) things become strange... So 64-16 = 48.

Okidoki. Time for some code hacking :-)
Comment 8 Peter Eriksson 2020-05-09 23:19:10 UTC
Created attachment 214328 [details]
Patch to fix support for more than 48 drives in HBA mode

The attached patch will fix a couple of bugs in the current ciss driver code where it incorrectly enumerates physical drives if the controller is in JBOD mode.

There are two bugs/problems:

1. If you attach more physical drives to a controller than how many logical volumes the controller supports (yes, really - totally wrong logic here) the additional drives will not be available because the driver sets the max_target limit to the number of logical volumes, but the enumeration of hardware drives starts at 16. So for a controller that support say 64 logical volumes, only the first (64-16) drives will
be detected.

2. The code also sets the initiator_id to same max logical volume number so any physical drive that happens to have the same target number will silently be skipped...

The patch also enables a little more verbosity.

This patch has been tested with HP H241 controllers in JBOD mode with 70 drives connected to a HP D6020 external SAS enclosure on FreeBSD 12.1-RELEASE-p3. 

This patch has not been tested with controllers in "RAID" mode but the patch should be compatible...
Comment 9 Peter Eriksson 2020-05-09 23:35:29 UTC
This part:

"but the enumeration of hardware drives starts at 16. So for a controller that support say 64 logical volumes, only the first (64-16) drives will be detected."

should probably read:

Depending on how disk enclosures enumerate drive the exact number of drives allowed might differ. For HP D6020 enclosures they seem to start enumeration at 16, and with a HP H241 controller that supports 64 logical volumes only the first (64-16) drives will be detected. And none of the SES "targets" (one per drawer in the D6020 enclosure) either since they are listed last...
Comment 10 Mark Linimon freebsd_committer freebsd_triage 2020-05-10 01:53:04 UTC
Convert to modern way to indicate "patch".
Comment 11 Peter Eriksson 2020-05-10 13:27:23 UTC
Just verified that the patched driver also works on an old HP DL380G5 with a HP Smart Array P400 controller with four drives and it worked there too. Not much of a test though but at least that part is the same as before (the old P400 doesn't really support a true JBOD mode anyway).