Bug 222533 - One HDD not seen by mps (11.1 regression)
Summary: One HDD not seen by mps (11.1 regression)
Status: New
Alias: None
Product: Base System
Classification: Unclassified
Component: bin (show other bugs)
Version: 11.1-RELEASE
Hardware: amd64 Any
: --- Affects Only Me
Assignee: freebsd-bugs mailing list
URL:
Keywords: regression
Depends on:
Blocks:
 
Reported: 2017-09-23 14:12 UTC by a.parseg
Modified: 2018-06-19 16:13 UTC (History)
7 users (show)

See Also:


Attachments
10.3-RELEASE dmesg (13.01 KB, text/plain)
2017-09-23 14:12 UTC, a.parseg
no flags Details
dmesg of 10.4 booted in verbose mode (18.67 KB, text/plain)
2017-11-07 16:55 UTC, a.parseg
no flags Details
dmesg of verbose boot off FreeBSD-11.0-RELEASE-amd64-mini-memstick.img (88.40 KB, text/plain)
2018-05-11 04:31 UTC, Denis Polygalov
no flags Details
dmesg of verbose boot off FreeBSD-11.1-RELEASE (89.04 KB, text/plain)
2018-05-11 04:32 UTC, Denis Polygalov
no flags Details
dmesg while dev.mps.0.debug_level set to 0x0203 (32.15 KB, text/plain)
2018-05-14 08:49 UTC, Denis Polygalov
no flags Details
dmesg while mps.X.use_phy_num set to 0 for all X (48.49 KB, text/plain)
2018-05-15 06:06 UTC, Denis Polygalov
no flags Details
dmesg while dev.mps.0.debug_level set to 0x0223 (32.41 KB, text/plain)
2018-05-17 01:17 UTC, Denis Polygalov
no flags Details
output of mpsutil (50.64 KB, text/plain)
2018-05-18 08:15 UTC, Denis Polygalov
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description a.parseg 2017-09-23 14:12:48 UTC
Created attachment 186641 [details]
10.3-RELEASE dmesg

Machine is a HP Proliant ML350 G5 with on-board E200i Smart Array, used as a file server.

Since the E200i does not support JBOD, one disk (in a hardware single-disk RAID0) is attached to it as an UFS system disk.

An unmodified Dell H200 controller has been added, and drives the original 8x2.5" drive cage. Those 8 disks (Seagate Barracuda, 2TB) are set up in a raidz2 pool.

With all disks present, the zpool uses da0-da7 off the H200 controller, and da8 is the system disk, using labels for / and swap.

Under 10.3-RELEASE (currently running), everything runs fine.

Under 11.1-RELEASE (previously installed, and re-tested by booting with the installer USB key), the sixth disk off the H200 controller simply disappears ; it is not seen by the mps driver (as reported by dmesg), drives are renumbered accordingly (da0-da4 unchanged, da6 becomes da5 etc, system/smart array appears on da7). 

Tested under multiple configurations with the original SAS drives in the cage, either one by one or as a group ; it is always the disk at the same location that goes missing.

Rebooting back to 10.3, all the disks are recognized.
Comment 1 a.parseg 2017-11-05 07:44:34 UTC
Booting off a 10.4-RELEASE usb install key shows the same symptoms as 11.1 : one disk not detected at boot and the zpool degraded missing the fifth disk.

Booting back to installed 10.3-RELEASE and all disks are seen, zpool status normal.
Comment 2 a.parseg 2017-11-07 16:55:16 UTC
Created attachment 187825 [details]
dmesg of 10.4 booted in verbose mode
Comment 3 Joe Oliveiro 2017-12-14 23:08:57 UTC
I have exactly the same issue going on. I was running FreeNAS based on FreeBSD 11.0 and everything worked perfectly. FreeNAS 11.0 saw all 6 of my disks and my ZFS Pool was imported and functioning perfectly. Once I upgraded to FreeNAS 11.1 the last disk  (6 of 6) failed to be enumerated. It was still reported as functional and working by the bios but the OS failed to see it.  When I reverted back to FreeNAS 11.0 my disk magically appeared.

My system is a :
Dell R710
128GB Memory
H200 HBA 
6x5TB Drives.
Comment 4 Denis Polygalov 2018-05-07 08:13:51 UTC
I have exactly the same problem on near the same hardware:

Dell PowerEdge R710
128GB Memory
H200 HBA 
6 x 2Tb Drives.

The problem appear right after upgrade to 11.1-RELEASE
The last drive is not visible by OS, but looks completely OK in BIOS.
Comment 5 Denis Polygalov 2018-05-07 09:18:23 UTC
An attempt to boot from FreeBSD-11.2-PRERELEASE-amd64-20180504-r333226-mini-memstick.img lead to the same result. Same disk missing in dmesg. The disk itself is tested in a USB box, and shows no errors. Also inserting brand new disk into the slot did not lead to detection. Seems like similar problem exist for ages and related to the mps driver:

https://forums.freebsd.org/threads/not-all-drives-detected-m1015-flashed-to-8211-it-on-freebsd-10-0-release.50311/

The machine I'm having problems with had worked under FreeBSD 8.2 RELEASE and LSI version of the mps driver without any issues.
Comment 6 Mark Linimon freebsd_committer freebsd_triage 2018-05-07 12:57:42 UTC
Doing some delving suggests that the only change between 10.3 and 10.4 was the following MFC:

https://svnweb.freebsd.org/base?view=revision&revision=322661

Cc:ing ken due to that commit.
Comment 7 Denis Polygalov 2018-05-11 04:31:36 UTC
Created attachment 193262 [details]
dmesg of verbose boot off FreeBSD-11.0-RELEASE-amd64-mini-memstick.img
Comment 8 Denis Polygalov 2018-05-11 04:32:26 UTC
Created attachment 193263 [details]
dmesg of verbose boot off FreeBSD-11.1-RELEASE
Comment 9 Denis Polygalov 2018-05-11 04:33:07 UTC
Booting off a FreeBSD-11.0-RELEASE-amd64-mini-memstick.img shows all disks but generate error: failure at /usr/src/sys/dev/mps/mps_sas_lsi.c:675/mpssas_add_device()! Could not get ID for device with handle 0x0009 

Upgrading the controller's firmware to:
** H200I Adapter with:
**     Package version -------------- 07.03.06.00
**     Firmware Version ------------- 07.15.08.00
**     Option BIOS Version ---------- 7.11.10.00-Dellized_24
**     Boot services Driver Version - 7.02.04.00
**     NVDATA Version --------------- Version 19
**
did not help..

Attaching verbose boot logs for 11.0 and 11.1 RELEASE.
The adapter in question appear as mps0
There are also two extra adapters (LSI SAS 9200-8e) detected as mps1 and mps2.
An external enclosure with 16 disks attached to the mps1. All disks detected.
Attempt to boot from 11.2-PRERELEASE memstick while the external enclosure disconnected did not fix the problem (same disk is missing).
The disks arranged in this way (labels on the front panel):
0 | 2 | 4
---------
1 | 3 | 5

Serial numbers:

Z1P06JYK | Z1P06K5P | Z1P06JK1
------------------------------
Z1P06N38 | Z1P06N1Y | Z1P06NB2

The disk which is always (except for booting in 11.0) missing is 5 (Z1P06NB2)
By the way it is also missing when boot in 12-CURRENT

Please any advice how to deal with this...

Regards,
Denis
Comment 10 Kenneth D. Merry freebsd_committer 2018-05-11 19:48:50 UTC
Adding scottl, slm (even though he now works on firmware), mav and asomers to the CC list on this one.  Perhaps one of them has seen a similar problem 

There was much more than just change 332661 that went into 10.x between 10.3 and 10.4.  slm@ made some significant changes to the mapping code that went in, and that is more likely to have caused drives to not show up.

I'll look around and see if I see anything that looks likely.

One thing that might help is updating your firmware to the latest phase (Phase 20) for the 6Gb LSI boards.  It has a lot of bug fixes, and you probably would want to run that in any case.
Comment 11 Stephen McConnell freebsd_committer 2018-05-11 21:03:35 UTC
If you can post the boot messages with the MPS_MAPPING debug sysctl bit on, that might help (see man mps for details). We'll go from there.
Comment 12 Denis Polygalov 2018-05-14 08:49:39 UTC
Created attachment 193381 [details]
dmesg while dev.mps.0.debug_level set to 0x0203

Thanks for looking at this!

Here is what I got after updating firmware and BIOS on the 
SAS 9200-8e adapters to the stage 20 
and setting dev.mps.0.debug_level set to 0x0203 (dmesg attached).

Also, FYI:

# cat /boot/loader.conf
hw.mps.spinup_wait_time=10
dev.mps.0.debug_level=0x0203
kern.maxfiles="32768"
ipmi_load="YES"
geom_mirror_load="YES"
zfs_load="YES"
linux_load="YES"
linprocfs_load="YES"
linsysfs_load="YES"

# sysctl -a |grep "mps.0"
dev.mps.0.use_phy_num: 1
dev.mps.0.encl_table_dump:
dev.mps.0.mapping_table_dump:
dev.mps.0.spinup_wait_time: 10
dev.mps.0.chain_alloc_fail: 0
dev.mps.0.enable_ssu: 1
dev.mps.0.max_io_pages: -1
dev.mps.0.max_chains: 2048
dev.mps.0.chain_free_lowwater: 2008
dev.mps.0.chain_free: 2048
dev.mps.0.io_cmds_highwater: 154
dev.mps.0.io_cmds_active: 0
dev.mps.0.driver_version: 21.02.00.00-fbsd
dev.mps.0.firmware_version: 07.15.08.00
dev.mps.0.disable_msi: 0
dev.mps.0.disable_msix: 0
dev.mps.0.debug_level: 515
dev.mps.0.%parent: pci3
dev.mps.0.%pnpinfo: vendor=0x1000 device=0x0072 subvendor=0x1028 subdevice=0x1f1e class=0x010700
dev.mps.0.%location: slot=0 function=0 dbsf=pci0:3:0:0
dev.mps.0.%driver: mps
dev.mps.0.%desc: Avago Technologies (LSI) SAS2008

Regards,
Denis.
Comment 13 Stephen McConnell freebsd_committer 2018-05-14 17:13:00 UTC
Can you try changing the use_phy_num sysctl variable to 0. You should set this globally so all of your instances of mps have use_phy_num set to 0. Then rerun and let's see what happens. I see this in the log:

mps0: Attempting to reuse target id 8 handle 0x0009

which means that a device with target ID 8 has already been added. You can see this a few lines earlier. So, I'm thinking that the first device is overwritten by the second device (making the first device invisible). If you change use_phy_num to 0, the driver will not use the Phy Number as the device ID and the two devices should have different IDs and both should be added successfully. Give it a try and let me know.
Comment 14 Denis Polygalov 2018-05-15 06:06:43 UTC
Created attachment 193412 [details]
dmesg while mps.X.use_phy_num set to 0 for all X

Setting hw.mps.use_phy_num=0 solved the problem. 
The missing disk appeared and zpool scrub went well on that pool, but now dmesg shows this:

mps0: failure at /usr/src/sys/dev/mps/mps_sas_lsi.c:700/mpssas_add_device()! Could not get ID for device with handle 0x0009

Is this serious or can be ignored?
Attached see dmesg with following in the /boot/loader.conf:

hw.mps.spinup_wait_time=10

dev.mps.0.debug_level=0x0203
dev.mps.0.use_phy_num=0
dev.mps.1.debug_level=0x0203
dev.mps.1.use_phy_num=0
dev.mps.2.debug_level=0x0203
dev.mps.2.use_phy_num=0

Thanks a lot!
Denis
Comment 15 Stephen McConnell freebsd_committer 2018-05-15 15:11:15 UTC
That error means that the device was not added to the mapping table. It shouldn't be a problem, but I'm not sure why it wasn't added. If you turn on all of the debug bits (0x7FF), it should tell us what failed. If that's too much debug output to boot, it would be good to at least see 0x223 (turn on MPS_ERROR). But, better to turn them all on if possible.
Comment 16 Denis Polygalov 2018-05-17 01:17:05 UTC
Created attachment 193472 [details]
dmesg while dev.mps.0.debug_level set to 0x0223

Booting with dev.mps.0.debug_level set to 0x7FF was just a neverending flood of kernel messages on the screen, so I change it to 0x223. Hope this helps.

Thanks for the help again,
Denis.
Comment 17 Stephen McConnell freebsd_committer 2018-05-17 18:10:36 UTC
Hmmm. I'm having trouble figuring out why the device with handle 9 is not being added to the mapping table. _mapping_add_new_device() should add the device to the mapping table and log an error if something goes wrong. I'm assuming that you have Persistent Mapping enabled (as opposed to Enclosure/Slot mapping or no mapping), and that the Device Persistent Mapping Page in the controller is enabled. I've never used mpsutil, but you should be able to use this utility to get the IOC Facts and the IOC Page 8 Config Page. Can you get that info for mps0 and I'll see if something pops out at me. Actually, it would also be good to see any devices that are attached too, so if you can get that with this utility it might help.
Comment 18 Denis Polygalov 2018-05-18 08:15:37 UTC
Created attachment 193507 [details]
output of mpsutil

Here is what I was able to get from mpsutil. As about the "Page 8" - seems like it does not exis at all:
# mpsutil show cfgpage 8
mpsutil: Error retrieving cfg page: Invalid configuration page
: Input/output error

I went through pagees from 0 to 30 inclusively and got output only from some of them (see file attached).

Regards,
Denis.
Comment 19 Stephen McConnell freebsd_committer 2018-05-18 17:11:56 UTC
There might be a problem with mpsutil in the way it retrieves config pages. Config pages have a type and a number. The page I was looking for was IOC (type) Page 8. It looks like you can only enter a number, but I'm not sure.

Anyway, I think the issue is probably that the device with handle 9 is an SEP Target device and the driver may handle these differently than a SAS/SATA target device. I haven't tested it to see exactly how those devices are treated, but it looks like they are not put into the mapping table. I don't think this will cause any problems and you should be OK now, with the change to the use_phy_num setting.
Comment 20 Joe Oliveiro 2018-05-20 23:32:37 UTC
(In reply to Denis Polygalov from comment #18)

Did use_phy_num work for you and were all drives enumerated?

I also wanted to thank you for helping to not only solve your problem but mine also!
Comment 21 Joe Oliveiro 2018-05-21 00:38:31 UTC
use_phy_num did the trick for me also! Thanks everyone!
Comment 22 a.parseg 2018-06-19 16:13:31 UTC
As I remember, the workaround with use_phys_num also worked for me (I cannot confirm, this machine has been decommissioned and the pool moved over to another server).