Bug 257670 - mpr(4): SAS3008 PCI-Express Fusion-MPT SAS-3: Fatal unrecoverable error detected with : mpr0: IOC Fault 0x4000265d, Resetting (LOR: CAM device lock (CAM device lock, sleep mutex) @ /usr/src/sys/cam/cam_xpt.c)
Summary: mpr(4): SAS3008 PCI-Express Fusion-MPT SAS-3: Fatal unrecoverable error detec...
Status: Open
Alias: None
Product: Base System
Classification: Unclassified
Component: arm (show other bugs)
Version: CURRENT
Hardware: arm64 Any
: --- Affects Some People
Assignee: freebsd-arm (Nobody)
URL:
Keywords: ThunderX, crash, needs-qa
Depends on:
Blocks:
 
Reported: 2021-08-07 07:21 UTC by Daniel Morante
Modified: 2021-08-11 14:32 UTC (History)
3 users (show)

See Also:
koobs: maintainer-feedback? (mav)
koobs: maintainer-feedback? (scottl)
koobs: mfc-stable13?
koobs: mfc-stable12?
koobs: mfc-stable11?


Attachments
capture of boot via serial (67.27 KB, text/plain)
2021-08-07 07:21 UTC, Daniel Morante
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Daniel Morante 2021-08-07 07:21:34 UTC
Created attachment 227004 [details]
capture of boot via serial

I am testing FreeBSD-14.0-CURRENT-arm64-aarch64-20210805-f3a3b061216-248478 on a Cavium ThunderX2 (Gigabyte R281-T91).  This system has an onboard SAS3008 PCI-Express Fusion-MPT SAS-3 controller.  

```
mpr0@pci0:14:0:0:       class=0x010700 rev=0x02 hdr=0x00 vendor=0x1000 device=0x0097 subvendor=0x1458 subdevice=0x3008
    vendor     = 'Broadcom / LSI'
    device     = 'SAS3008 PCI-Express Fusion-MPT SAS-3'
    class      = mass storage
    subclass   = SAS
```

I load the `mpr` driver by having `mpr_load="YES"` in `/boot/loader.conf`.  So far so good except for the weird messages in dmesg. (see attachment)

There are currently 8 HDD's attached to it and I setup 3 ZFS pools.  This goes well until I finally start to put some load on them.  The system kernel panics and halts with the following in dmesg:

```
mpr0: IOC Fault 0x4000265d, Resetting
mpr0: Reinitializing controller
...
RAS CONTROLLER: Fatal unrecoverable error detected
```

This is not to say the problem is with ZFS.  I suspect the mpr driver is just unstable.

The system can no longer boot into multi user mode.  It kernel panics with the same error as soon as it tries to start ZFS.

```
mountroot: waiting for device /dev/nda0p2...
WARNING: / was not properly dismounted
Dual Console: Video Primary, Serial Secondary
witness_lock_list_get: witness exhausted
ZFS filesystem version: 5
ZFS storage pool version: features support (5000)
RAS CONTROLLER: Fatal unrecoverable error detected

        *** NBU Error ***
...
```

In order to get a functional system I disable ZFS in `/etc/rc.conf` while in single user mode.

Now back in multi user mode I can do a `service zfs onestart` and try to import one of the pools.  The system then kernel panics again.

I detail the full specs of this system in bug #254651 (where I have a problem with the onboard SATA controllers) and in my forum post at https://forums.freebsd.org/threads/aarch64-trouble-with-cn99xx-ahci-and-fastlinq-ql41000-controllers.79556/ (where I explain the lack of a driver for the onboard Ethernet).

Also, for some weird reason I can no longer boot 13.0-RELEASE on this system.  It panics with "panic: NVME polled command failed to complete within 10s". I think it doesn't like the add-on PCIe NVME.  However when it was working (prior to adding in the NVME) the SAS controller was just as unstable.

Seeing how most of the hardware is still very new, I don't expect FreeBSD (especcially arm64) to support it.  I'd like to help anyway that I can should someone be interested. The system has an IPMI and I'd be willing to offer remote access to it for as long as it's required via VPN (if that's a thing that's normally done) on a dedicated network with any other required resources).
Comment 1 Kubilay Kocak freebsd_committer freebsd_triage 2021-08-07 09:58:45 UTC
This appears like a LOR (rather than a panic?) resulting in or contributing to a fatal state for the controller. If this indeed does panic the system, are you able to obtain a backtrace or include/attach core dumps?
Comment 2 Daniel Morante 2021-08-08 00:27:53 UTC
You are correct.  I didn't know what a LOR was prior to you mentioning it here.  The system does not panic, it simply halts on the error.
Comment 3 Andrew Turner freebsd_committer 2021-08-11 14:32:59 UTC
I don't think it's a LOR.

It looks like the firmware has detected a problem when reading from a PCIe device. It's likely it's because the device is detecting a fault and is being reset, so one of the following:
    mpr0: IOC Fault 0x40002667, Resetting
    mpr0: IOC Fault 0x4000265d, Resetting

I'm unsure why it's being reset, or what the faults mean.