Bug 257670

Summary: mpr(4): SAS3008 PCI-Express Fusion-MPT SAS-3: Fatal unrecoverable error detected with : mpr0: IOC Fault 0x4000265d, Resetting (LOR: CAM device lock (CAM device lock, sleep mutex) @ /usr/src/sys/cam/cam_xpt.c)
Product: Base System Reporter: Daniel Morante <daniel>
Component: armAssignee: freebsd-arm (Nobody) <freebsd-arm>
Status: Closed Unable to Reproduce    
Severity: Affects Some People CC: Andrew, imp, mav, scottl
Priority: --- Keywords: ThunderX, crash, needs-qa
Version: CURRENTFlags: koobs: maintainer-feedback? (mav)
scottl: maintainer-feedback-
Hardware: arm64   
OS: Any   
See Also: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=254651
Attachments:
Description Flags
capture of boot via serial none

Description Daniel Morante 2021-08-07 07:21:34 UTC
Created attachment 227004 [details]
capture of boot via serial

I am testing FreeBSD-14.0-CURRENT-arm64-aarch64-20210805-f3a3b061216-248478 on a Cavium ThunderX2 (Gigabyte R281-T91).  This system has an onboard SAS3008 PCI-Express Fusion-MPT SAS-3 controller.  

```
mpr0@pci0:14:0:0:       class=0x010700 rev=0x02 hdr=0x00 vendor=0x1000 device=0x0097 subvendor=0x1458 subdevice=0x3008
    vendor     = 'Broadcom / LSI'
    device     = 'SAS3008 PCI-Express Fusion-MPT SAS-3'
    class      = mass storage
    subclass   = SAS
```

I load the `mpr` driver by having `mpr_load="YES"` in `/boot/loader.conf`.  So far so good except for the weird messages in dmesg. (see attachment)

There are currently 8 HDD's attached to it and I setup 3 ZFS pools.  This goes well until I finally start to put some load on them.  The system kernel panics and halts with the following in dmesg:

```
mpr0: IOC Fault 0x4000265d, Resetting
mpr0: Reinitializing controller
...
RAS CONTROLLER: Fatal unrecoverable error detected
```

This is not to say the problem is with ZFS.  I suspect the mpr driver is just unstable.

The system can no longer boot into multi user mode.  It kernel panics with the same error as soon as it tries to start ZFS.

```
mountroot: waiting for device /dev/nda0p2...
WARNING: / was not properly dismounted
Dual Console: Video Primary, Serial Secondary
witness_lock_list_get: witness exhausted
ZFS filesystem version: 5
ZFS storage pool version: features support (5000)
RAS CONTROLLER: Fatal unrecoverable error detected

        *** NBU Error ***
...
```

In order to get a functional system I disable ZFS in `/etc/rc.conf` while in single user mode.

Now back in multi user mode I can do a `service zfs onestart` and try to import one of the pools.  The system then kernel panics again.

I detail the full specs of this system in bug #254651 (where I have a problem with the onboard SATA controllers) and in my forum post at https://forums.freebsd.org/threads/aarch64-trouble-with-cn99xx-ahci-and-fastlinq-ql41000-controllers.79556/ (where I explain the lack of a driver for the onboard Ethernet).

Also, for some weird reason I can no longer boot 13.0-RELEASE on this system.  It panics with "panic: NVME polled command failed to complete within 10s". I think it doesn't like the add-on PCIe NVME.  However when it was working (prior to adding in the NVME) the SAS controller was just as unstable.

Seeing how most of the hardware is still very new, I don't expect FreeBSD (especcially arm64) to support it.  I'd like to help anyway that I can should someone be interested. The system has an IPMI and I'd be willing to offer remote access to it for as long as it's required via VPN (if that's a thing that's normally done) on a dedicated network with any other required resources).
Comment 1 Kubilay Kocak freebsd_committer freebsd_triage 2021-08-07 09:58:45 UTC
This appears like a LOR (rather than a panic?) resulting in or contributing to a fatal state for the controller. If this indeed does panic the system, are you able to obtain a backtrace or include/attach core dumps?
Comment 2 Daniel Morante 2021-08-08 00:27:53 UTC
You are correct.  I didn't know what a LOR was prior to you mentioning it here.  The system does not panic, it simply halts on the error.
Comment 3 Andrew Turner freebsd_committer freebsd_triage 2021-08-11 14:32:59 UTC
I don't think it's a LOR.

It looks like the firmware has detected a problem when reading from a PCIe device. It's likely it's because the device is detecting a fault and is being reset, so one of the following:
    mpr0: IOC Fault 0x40002667, Resetting
    mpr0: IOC Fault 0x4000265d, Resetting

I'm unsure why it's being reset, or what the faults mean.
Comment 4 Warner Losh freebsd_committer freebsd_triage 2022-01-30 21:12:12 UTC
The IOC Fault may be due to a fix I made. I'll be committing it soon.
Comment 5 Warner Losh freebsd_committer freebsd_triage 2022-01-30 21:33:50 UTC
0x4000265d is "A noncritical data TLB interrupt occurred."
0x4000266z is "A bus fault occurred on the IOC-to-host memory move."
This is also with mpr (not mps), so the change I made to mps won't affect that.

It sounds like a bad address is getting sent to the card somehow, but beyond that I'm unsure what might be causing it. A number of fixes have been made to mpr since this bug was filed, it might not hurt to try again.
Comment 6 Warner Losh freebsd_committer freebsd_triage 2022-02-20 22:52:05 UTC
IOC Fault means the firmware singalled an error. Not the most helpful way to do it. The codes are weird and make me suspect that addresses that were sent down are causing errors when the card tries to read or write them. Unsure beyond that. Is there an IOMMU involved?
Comment 7 Warner Losh freebsd_committer freebsd_triage 2024-02-19 02:34:35 UTC
I have this same card in an ampere system and don't see issues with it.

But the string 'RAS' doesn't appear in the mpr driver. I Suspect it may be due to the beta hardware that that was hit. It otherwise looks like it recovered.