Bug 243225 - "mpr0: Out of chain frames" boot hang after clang 9.0.1 import (probably timing, not compiler related)
Summary: "mpr0: Out of chain frames" boot hang after clang 9.0.1 import (probably timi...
Status: New
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 12.0-STABLE
Hardware: Any Any
: --- Affects Some People
Assignee: freebsd-bugs mailing list
URL:
Keywords: regression
Depends on:
Blocks:
 
Reported: 2020-01-09 19:00 UTC by Terry Kennedy
Modified: 2020-03-10 19:53 UTC (History)
2 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Terry Kennedy 2020-01-09 19:00:53 UTC
I updated my test system from r356239 to r356557 (which crosses the clang 9.0.1 import) and started receiving "mpr0: Out of chain frames" at boot time, which causes a boot hang with the mpr0 controller being reset and reinitialized, and the error happening again. This happens before the device (tape drive) is detected, and happens regardless of whether anything is connected to the mpr controller.

I had this before (many months ago) on this system and worked with Dell service, replacing boards / cables / tape drive, etc. The solution at that point was to put the controller into a different slot, which apparently hid whatever timing problem is causing the boot hang. That's why I say in the PR title that I don't think it is a clang 9.0.1 problem (incorrect code generation). Presumably clang 9 generates faster (hopefully) or slower code that is triggering the problem.

Escaping to the boot loader and killing time, then saying "boot" without changing anything will sometimes let the system boot normally. Again pointing to a possible timing problem.

The boot messages from r356239 are:

mpr0: <Avago Technologies (LSI) SAS3008> port 0x8000-0x80ff mem 0xc9100000-0xc910ffff,0xc8000000-0xc80fffff irq 64 at device 0.0 on pci17
mpr0: Firmware: 16.00.08.00, Driver: 23.00.00.00-fbsd
mpr0: IOCCapabilities: 7a85c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,MSIXIndex,HostDisc,FastPath,RDPQArray>
mpr0: Found device <c01<SspTarg,Direct>,End Device> <6.0Gbps> handle<0x0009> enclosureHandle<0x0001> slot 7
mpr0: At enclosure level 0 and connector name (1   )
sa0 at mpr0 bus 0 scbus14 target 7 lun 0

In r356557, only the first of those 3 lines appear, followed by:

mpr0: Out of chain frames, consider increasing hw.mpr.max_chains

And then, eventually by:

mpr0: Calling Reinit from mpr_wait_command, timeout=60, elapsed=60
mpr0: Reinitializing controller

At that point we're in a perpetual loop of reinit / timeout.

I can make the problem system available via remote console access (Dell iDRAC 8) or can try any suggestions for debugging this further myself.
Comment 1 Terry Kennedy 2020-01-10 01:42:37 UTC
To provide a little more information about the system this is happening on:

System: Dell PowerEdge R730
CPU: E5-2643 v3 @ 3.40GHz
RAM: 128GB (8 * 16GB DDR4-2133)
OS boot controller / drives: PERC H730 mini, card=0x1f491028 chip=0x005d1000, mrsas driver, 5 * SAS2 SSD
Tape controller: Dell "12Gb/s SAS HBA External", card=0x1f461028 chip=0x00971000, mpr driver
Comment 2 Peter Eriksson 2020-03-08 16:29:48 UTC
Do you have X2APIC enabled in your R730? When I try to enable it on our R730xd's with a similar setup (Dell HBA330 (mpr)) controllers it refuses to boot and just gives me a loop of:

mpr0: Calling Reinit from mpr_wait_command, timeout=60, elapsed=60
mpr0: Reinitializing controller,
mpr0: Firmware: 16.00.08.00, Driver: 18.03.00.00-fbsd
mpr0: IOCCapabilities: 7a85c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,MSIXIndex,HostDisc,FastPath,RDPQArray>

FreeBSD 11.3-RELEASE-p6
Dell R730xd with 1 HBA330 Mini (mpr) and dual 10GE Intel X710 (ixl)

Switching back to X2APIC disabled allows the server to boot just fine. 


(And some errors about USB too just for the fun of it...)

uhub1: 2 ports with 2 removable, self powered
uhub0: 21 ports with 21 removable, self powered
uhub2: 2 ports with 2 removable, self powered
usb_alloc_device: set address 2 failed (USB_ERR_TIMEOUT, ignored)
usb_alloc_device: set address 2 failed (USB_ERR_TIMEOUT, ignored)
usbd_setup_device_desc: getting device descriptor at addr 2 failed, USB_ERR_TIMEOUT
usbd_setup_device_desc: getting device descriptor at addr 2 failed, USB_ERR_TIMEOUT
usbd_req_re_enumerate: addr=2, set address failed! (USB_ERR_TIMEOUT, ignored)
usbd_req_re_enumerate: addr=2, set address failed! (USB_ERR_TIMEOUT, ignored)
usbd_setup_device_desc: getting device descriptor at addr 1 failed, USB_ERR_TIMEOUT
usbd_setup_device_desc: getting device descriptor at addr 2 failed, USB_ERR_TIMEOUT
usbd_setup_device_desc: getting device descriptor at addr 2 failed, USB_ERR_TIMEOUT
usbd_req_re_enumerate: addr=2, set address failed! (USB_ERR_TIMEOUT, ignored)
usbd_req_re_enumerate: addr=2, set address failed! (USB_ERR_TIMEOUT, ignored)
usbd_setup_device_desc: getting device descriptor at addr 1 failed, USB_ERR_TIMEOUT
usbd_setup_device_desc: getting device descriptor at addr 2 failed, USB_ERR_TIMEOUT
usbd_setup_device_desc: getting device descriptor at addr 2 failed, USB_ERR_TIMEOUT
usbd_req_re_enumerate: addr=2, set address failed! (USB_ERR_TIMEOUT, ignored)
usbd_req_re_enumerate: addr=2, set address failed! (USB_ERR_TIMEOUT, ignored)
usbd_setup_device_desc: getting device descriptor at addr 2 failed, USB_ERR_TIMEOUT
usbd_setup_device_desc: getting device descriptor at addr 2 failed, USB_ERR_TIMEOUT
usbd_setup_device_desc: getting device descriptor at addr 1 failed, USB_ERR_TIMEOUT
usbd_req_re_enumerate: addr=2, set address failed! (USB_ERR_TIMEOUT, ignored)
usbd_req_re_enumerate: addr=2, set address failed! (USB_ERR_TIMEOUT, ignored)
ugen2.2: <Unknown > at usbus2 (disconnected)
uhub_reattach_port: could not allocate new device
usbd_setup_device_desc: getting device descriptor at addr 2 failed, USB_ERR_TIMEOUT
ugen1.2: <Unknown > at usbus1 (disconnected)
uhub_reattach_port: could not allocate new device
usbd_setup_device_desc: getting device descriptor at addr 1 failed, USB_ERR_TIMEOUT
usbd_setup_device_desc: getting device descriptor at addr 1 failed, USB_ERR_TIMEOUT
ugen0.2: <Unknown > at usbus0 (disconnected)
uhub_reattach_port: could not allocate new device
Comment 3 Terry Kennedy 2020-03-10 19:53:57 UTC
(In reply to Peter Eriksson from comment #2)

X2APIC has been off on this system since it was installed - I see the same USB symptoms that you reported if I enable it. But I get the mpr(4) errors in either case.

I updated the BIOS and OS Collector (which isn't used on FreeBSD) - everything else was already up-to-date = and repeated the tests, with the same result.

Since I suspect this is timing-related, I enabled Hyperthreading (which I normally force off in the BIOS) and the system booted (at least this one time) without the mpr(4) hang.

Again, this system had been working for some time, but then started hanging with the "out of chain frames" at boot time. Dell support moved the controller to a different slot, which "fixed" the problem, at least until the clang 9.x import, which apparently changed the timing enough to make the problem show up again.

I think all of the above points to some timing issue in the mpr(4) driver. Hopefully someone with knowledge of that driver's inner workings will see this and comment. I can continue to make the system available (via iDRAC port, so the developer can reset it, change BIOS settings, etc.) if needed for someone to diagnose this issue.