Summary: | mpr(4): SAS3008 PCI-Express Fusion-MPT SAS-3: Fatal unrecoverable error detected with : mpr0: IOC Fault 0x4000265d, Resetting (LOR: CAM device lock (CAM device lock, sleep mutex) @ /usr/src/sys/cam/cam_xpt.c) | ||||||
---|---|---|---|---|---|---|---|
Product: | Base System | Reporter: | Daniel Morante <daniel> | ||||
Component: | arm | Assignee: | freebsd-arm (Nobody) <freebsd-arm> | ||||
Status: | Closed Unable to Reproduce | ||||||
Severity: | Affects Some People | CC: | Andrew, imp, mav, scottl | ||||
Priority: | --- | Keywords: | ThunderX, crash, needs-qa | ||||
Version: | CURRENT | Flags: | koobs:
maintainer-feedback?
(mav) scottl: maintainer-feedback- |
||||
Hardware: | arm64 | ||||||
OS: | Any | ||||||
See Also: | https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=254651 | ||||||
Attachments: |
|
Description
Daniel Morante
2021-08-07 07:21:34 UTC
This appears like a LOR (rather than a panic?) resulting in or contributing to a fatal state for the controller. If this indeed does panic the system, are you able to obtain a backtrace or include/attach core dumps? You are correct. I didn't know what a LOR was prior to you mentioning it here. The system does not panic, it simply halts on the error. I don't think it's a LOR. It looks like the firmware has detected a problem when reading from a PCIe device. It's likely it's because the device is detecting a fault and is being reset, so one of the following: mpr0: IOC Fault 0x40002667, Resetting mpr0: IOC Fault 0x4000265d, Resetting I'm unsure why it's being reset, or what the faults mean. The IOC Fault may be due to a fix I made. I'll be committing it soon. 0x4000265d is "A noncritical data TLB interrupt occurred." 0x4000266z is "A bus fault occurred on the IOC-to-host memory move." This is also with mpr (not mps), so the change I made to mps won't affect that. It sounds like a bad address is getting sent to the card somehow, but beyond that I'm unsure what might be causing it. A number of fixes have been made to mpr since this bug was filed, it might not hurt to try again. IOC Fault means the firmware singalled an error. Not the most helpful way to do it. The codes are weird and make me suspect that addresses that were sent down are causing errors when the card tries to read or write them. Unsure beyond that. Is there an IOMMU involved? I have this same card in an ampere system and don't see issues with it. But the string 'RAS' doesn't appear in the mpr driver. I Suspect it may be due to the beta hardware that that was hit. It otherwise looks like it recovered. |