Created attachment 186641 [details]
Machine is a HP Proliant ML350 G5 with on-board E200i Smart Array, used as a file server.
Since the E200i does not support JBOD, one disk (in a hardware single-disk RAID0) is attached to it as an UFS system disk.
An unmodified Dell H200 controller has been added, and drives the original 8x2.5" drive cage. Those 8 disks (Seagate Barracuda, 2TB) are set up in a raidz2 pool.
With all disks present, the zpool uses da0-da7 off the H200 controller, and da8 is the system disk, using labels for / and swap.
Under 10.3-RELEASE (currently running), everything runs fine.
Under 11.1-RELEASE (previously installed, and re-tested by booting with the installer USB key), the sixth disk off the H200 controller simply disappears ; it is not seen by the mps driver (as reported by dmesg), drives are renumbered accordingly (da0-da4 unchanged, da6 becomes da5 etc, system/smart array appears on da7).
Tested under multiple configurations with the original SAS drives in the cage, either one by one or as a group ; it is always the disk at the same location that goes missing.
Rebooting back to 10.3, all the disks are recognized.
Booting off a 10.4-RELEASE usb install key shows the same symptoms as 11.1 : one disk not detected at boot and the zpool degraded missing the fifth disk.
Booting back to installed 10.3-RELEASE and all disks are seen, zpool status normal.
Created attachment 187825 [details]
dmesg of 10.4 booted in verbose mode
I have exactly the same issue going on. I was running FreeNAS based on FreeBSD 11.0 and everything worked perfectly. FreeNAS 11.0 saw all 6 of my disks and my ZFS Pool was imported and functioning perfectly. Once I upgraded to FreeNAS 11.1 the last disk (6 of 6) failed to be enumerated. It was still reported as functional and working by the bios but the OS failed to see it. When I reverted back to FreeNAS 11.0 my disk magically appeared.
My system is a :
I have exactly the same problem on near the same hardware:
Dell PowerEdge R710
6 x 2Tb Drives.
The problem appear right after upgrade to 11.1-RELEASE
The last drive is not visible by OS, but looks completely OK in BIOS.
An attempt to boot from FreeBSD-11.2-PRERELEASE-amd64-20180504-r333226-mini-memstick.img lead to the same result. Same disk missing in dmesg. The disk itself is tested in a USB box, and shows no errors. Also inserting brand new disk into the slot did not lead to detection. Seems like similar problem exist for ages and related to the mps driver:
The machine I'm having problems with had worked under FreeBSD 8.2 RELEASE and LSI version of the mps driver without any issues.
Doing some delving suggests that the only change between 10.3 and 10.4 was the following MFC:
Cc:ing ken due to that commit.
Created attachment 193262 [details]
dmesg of verbose boot off FreeBSD-11.0-RELEASE-amd64-mini-memstick.img
Created attachment 193263 [details]
dmesg of verbose boot off FreeBSD-11.1-RELEASE
Booting off a FreeBSD-11.0-RELEASE-amd64-mini-memstick.img shows all disks but generate error: failure at /usr/src/sys/dev/mps/mps_sas_lsi.c:675/mpssas_add_device()! Could not get ID for device with handle 0x0009
Upgrading the controller's firmware to:
** H200I Adapter with:
** Package version -------------- 07.03.06.00
** Firmware Version ------------- 07.15.08.00
** Option BIOS Version ---------- 7.11.10.00-Dellized_24
** Boot services Driver Version - 7.02.04.00
** NVDATA Version --------------- Version 19
did not help..
Attaching verbose boot logs for 11.0 and 11.1 RELEASE.
The adapter in question appear as mps0
There are also two extra adapters (LSI SAS 9200-8e) detected as mps1 and mps2.
An external enclosure with 16 disks attached to the mps1. All disks detected.
Attempt to boot from 11.2-PRERELEASE memstick while the external enclosure disconnected did not fix the problem (same disk is missing).
The disks arranged in this way (labels on the front panel):
0 | 2 | 4
1 | 3 | 5
Z1P06JYK | Z1P06K5P | Z1P06JK1
Z1P06N38 | Z1P06N1Y | Z1P06NB2
The disk which is always (except for booting in 11.0) missing is 5 (Z1P06NB2)
By the way it is also missing when boot in 12-CURRENT
Please any advice how to deal with this...
Adding scottl, slm (even though he now works on firmware), mav and asomers to the CC list on this one. Perhaps one of them has seen a similar problem
There was much more than just change 332661 that went into 10.x between 10.3 and 10.4. slm@ made some significant changes to the mapping code that went in, and that is more likely to have caused drives to not show up.
I'll look around and see if I see anything that looks likely.
One thing that might help is updating your firmware to the latest phase (Phase 20) for the 6Gb LSI boards. It has a lot of bug fixes, and you probably would want to run that in any case.
If you can post the boot messages with the MPS_MAPPING debug sysctl bit on, that might help (see man mps for details). We'll go from there.
Created attachment 193381 [details]
dmesg while dev.mps.0.debug_level set to 0x0203
Thanks for looking at this!
Here is what I got after updating firmware and BIOS on the
SAS 9200-8e adapters to the stage 20
and setting dev.mps.0.debug_level set to 0x0203 (dmesg attached).
# cat /boot/loader.conf
# sysctl -a |grep "mps.0"
dev.mps.0.%pnpinfo: vendor=0x1000 device=0x0072 subvendor=0x1028 subdevice=0x1f1e class=0x010700
dev.mps.0.%location: slot=0 function=0 dbsf=pci0:3:0:0
dev.mps.0.%desc: Avago Technologies (LSI) SAS2008
Can you try changing the use_phy_num sysctl variable to 0. You should set this globally so all of your instances of mps have use_phy_num set to 0. Then rerun and let's see what happens. I see this in the log:
mps0: Attempting to reuse target id 8 handle 0x0009
which means that a device with target ID 8 has already been added. You can see this a few lines earlier. So, I'm thinking that the first device is overwritten by the second device (making the first device invisible). If you change use_phy_num to 0, the driver will not use the Phy Number as the device ID and the two devices should have different IDs and both should be added successfully. Give it a try and let me know.
Created attachment 193412 [details]
dmesg while mps.X.use_phy_num set to 0 for all X
Setting hw.mps.use_phy_num=0 solved the problem.
The missing disk appeared and zpool scrub went well on that pool, but now dmesg shows this:
mps0: failure at /usr/src/sys/dev/mps/mps_sas_lsi.c:700/mpssas_add_device()! Could not get ID for device with handle 0x0009
Is this serious or can be ignored?
Attached see dmesg with following in the /boot/loader.conf:
Thanks a lot!
That error means that the device was not added to the mapping table. It shouldn't be a problem, but I'm not sure why it wasn't added. If you turn on all of the debug bits (0x7FF), it should tell us what failed. If that's too much debug output to boot, it would be good to at least see 0x223 (turn on MPS_ERROR). But, better to turn them all on if possible.
Created attachment 193472 [details]
dmesg while dev.mps.0.debug_level set to 0x0223
Booting with dev.mps.0.debug_level set to 0x7FF was just a neverending flood of kernel messages on the screen, so I change it to 0x223. Hope this helps.
Thanks for the help again,
Hmmm. I'm having trouble figuring out why the device with handle 9 is not being added to the mapping table. _mapping_add_new_device() should add the device to the mapping table and log an error if something goes wrong. I'm assuming that you have Persistent Mapping enabled (as opposed to Enclosure/Slot mapping or no mapping), and that the Device Persistent Mapping Page in the controller is enabled. I've never used mpsutil, but you should be able to use this utility to get the IOC Facts and the IOC Page 8 Config Page. Can you get that info for mps0 and I'll see if something pops out at me. Actually, it would also be good to see any devices that are attached too, so if you can get that with this utility it might help.
Created attachment 193507 [details]
output of mpsutil
Here is what I was able to get from mpsutil. As about the "Page 8" - seems like it does not exis at all:
# mpsutil show cfgpage 8
mpsutil: Error retrieving cfg page: Invalid configuration page
: Input/output error
I went through pagees from 0 to 30 inclusively and got output only from some of them (see file attached).
There might be a problem with mpsutil in the way it retrieves config pages. Config pages have a type and a number. The page I was looking for was IOC (type) Page 8. It looks like you can only enter a number, but I'm not sure.
Anyway, I think the issue is probably that the device with handle 9 is an SEP Target device and the driver may handle these differently than a SAS/SATA target device. I haven't tested it to see exactly how those devices are treated, but it looks like they are not put into the mapping table. I don't think this will cause any problems and you should be OK now, with the change to the use_phy_num setting.
(In reply to Denis Polygalov from comment #18)
Did use_phy_num work for you and were all drives enumerated?
I also wanted to thank you for helping to not only solve your problem but mine also!
use_phy_num did the trick for me also! Thanks everyone!
As I remember, the workaround with use_phys_num also worked for me (I cannot confirm, this machine has been decommissioned and the pool moved over to another server).