Bug 248906 - LSI SAS2008 (mps) gets stuck in a reset loop when writing on AMD Epyc 3000
Summary: LSI SAS2008 (mps) gets stuck in a reset loop when writing on AMD Epyc 3000
Status: New
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: Unspecified
Hardware: amd64 Any
: --- Affects Only Me
Assignee: freebsd-bugs (Nobody)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-08-25 17:58 UTC by Will Ross
Modified: 2020-09-23 16:51 UTC (History)
0 users

See Also:


Attachments
Successful smartctl -a (12.28 KB, text/plain)
2020-08-26 16:38 UTC, Will Ross
no flags Details
Successful read (90.49 KB, text/plain)
2020-08-26 16:39 UTC, Will Ross
no flags Details
Unsuccessful `smartctl -C -t short /dev/da0` (26.18 KB, text/plain)
2020-08-26 16:40 UTC, Will Ross
no flags Details
Unsuccessful write with `dd if=/dev/zero of=/dev/da0 ba=1m count=10` (241.83 KB, text/plain)
2020-08-26 16:41 UTC, Will Ross
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Will Ross 2020-08-25 17:58:53 UTC
Overview:
I'm trying to use an LSI SAS2008 based PCIe card with an AMD Epyc 3151 system. Once I try to write anything to a drive connected to the card, the mps driver appears to get stuck in a reset loop, repeating messages like this:

mps0: IOC Fault 0x40002622, Resetting
mps0: Reinitializing controller,
mps0: Firmware: 20.00.07.00, Driver: 21.02.00.00-fbsd
mps0: IOCCapabilities: 1285c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,HostDisc>
mps0: mps_reinit finished sc 0xfffffe00014a9000 post 4 free 3
mps0: SAS Address for SATA device = 2a04546ea96c8bac
mps0: SAS Address from SATA device = 2a04546ea96c8bac
mps0: SAS Address for SATA device = d9413b15bbcdcc78
mps0: SAS Address from SATA device = d9413b15bbcdcc78

There's then a pause for a few seconds, and these messages are printed again (none of the values change).

Reproduction Steps:
1. Set up hardware with a SuperMicro M11SDV-4C-LN4F and an LSI SAS2008 HBA PCIe card that's been reflashed to the IT firmware. Connect a SATA disk to the HBA.
2. Boot FreeBSD (off of install media, another disk, etc).
3. Once booted, check dmesg to see the name of the SATA disk (ex: da0)
4. Run `dd if=/dev/zero of=/dev/da0`

Expected:
Zeros are successfully written to the disk.

Actual:
mps driver gets stuck in a reset loop.

Comments:
* I've tested two different cards (one reflashed by me, another bought off of eBay pre-flashed), and they both exhibit this issue.
* Ubuntu is able to use both cards.
* I've tested both an SSD and HDD, with no difference.
* This machine is specifically running FreeNAS 11.3-U4.1 (FreeBSD 11.3p11 equivalent). I encountered the same issue with FreeBSD 12.1-RELEASE as well.
* I haven't had a chance to try them in another Intel system yet, but will update this issue once I have.
* Reads work fine (tested with `dd if=/dev/da0 of=/temp/read_test`). The data is as expected.
* smartctl is:
    * Able to read SMART values off of drives.
    * Run a background test runs successfully.
    * Running a foreground test fails. After waiting 1 minute, smartctl exits. Checking the SMART test log shows that the test was "Interrupted (host reset)" without completing, and these messages are logged by the system:

        (pass1:mps0:0:5:0): ATA COMMAND PASS THROUGH(16). CDB: 85 06 0c 00 d4 00 00 00 81 00 4f 00 c2 00 b0 00 length 0 SMID 700 Aborting command 0xfffffe00015246c0
mps0: Sending reset from mpssas_send_abort for target ID 5
mps0: Unfreezing devq for target ID

The
Comment 1 Will Ross 2020-08-25 18:07:59 UTC
I'm guessing this particular hardware setup isn't as common, as the boards a bit niche. I was able to find one other reference to someone encountering the same issue:

https://forum.level1techs.com/t/ioc-fault-when-importing-pool-mps-reinit-hba-lsi-card-freenas-solved-sorta/153973
Comment 2 Will Ross 2020-08-26 16:38:53 UTC
Created attachment 217541 [details]
Successful smartctl -a

Successful `smartctl -a /dev/da0`
Comment 3 Will Ross 2020-08-26 16:39:56 UTC
Created attachment 217542 [details]
Successful read

Successful read with `dd if=/dev/da0 of=/tmp/read_test bs=1m count=10`
Comment 4 Will Ross 2020-08-26 16:40:49 UTC
Created attachment 217543 [details]
Unsuccessful `smartctl -C -t short /dev/da0`
Comment 5 Will Ross 2020-08-26 16:41:21 UTC
Created attachment 217544 [details]
Unsuccessful write with `dd if=/dev/zero of=/dev/da0 ba=1m count=10`
Comment 6 Will Ross 2020-08-26 16:44:32 UTC
I've attached logs with debug output (dev.mps.0.debug_level=2047) for four commands. The first two are successful, the latter two are not, with the final one needing a reboot to stop the reset loop (there's also some extra output in that last log from NUT, and a weird timestamp from devd at the end).