Bug 243225 - "mpr0: Out of chain frames" boot hang after clang 9.0.1 import (probably timing, not compiler related)
Summary: "mpr0: Out of chain frames" boot hang after clang 9.0.1 import (probably timi...
Status: New
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 12.0-STABLE
Hardware: Any Any
: --- Affects Some People
Assignee: freebsd-bugs (Nobody)
URL:
Keywords: regression
Depends on:
Blocks:
 
Reported: 2020-01-09 19:00 UTC by Terry Kennedy
Modified: 2020-08-08 22:40 UTC (History)
3 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Terry Kennedy 2020-01-09 19:00:53 UTC
I updated my test system from r356239 to r356557 (which crosses the clang 9.0.1 import) and started receiving "mpr0: Out of chain frames" at boot time, which causes a boot hang with the mpr0 controller being reset and reinitialized, and the error happening again. This happens before the device (tape drive) is detected, and happens regardless of whether anything is connected to the mpr controller.

I had this before (many months ago) on this system and worked with Dell service, replacing boards / cables / tape drive, etc. The solution at that point was to put the controller into a different slot, which apparently hid whatever timing problem is causing the boot hang. That's why I say in the PR title that I don't think it is a clang 9.0.1 problem (incorrect code generation). Presumably clang 9 generates faster (hopefully) or slower code that is triggering the problem.

Escaping to the boot loader and killing time, then saying "boot" without changing anything will sometimes let the system boot normally. Again pointing to a possible timing problem.

The boot messages from r356239 are:

mpr0: <Avago Technologies (LSI) SAS3008> port 0x8000-0x80ff mem 0xc9100000-0xc910ffff,0xc8000000-0xc80fffff irq 64 at device 0.0 on pci17
mpr0: Firmware: 16.00.08.00, Driver: 23.00.00.00-fbsd
mpr0: IOCCapabilities: 7a85c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,MSIXIndex,HostDisc,FastPath,RDPQArray>
mpr0: Found device <c01<SspTarg,Direct>,End Device> <6.0Gbps> handle<0x0009> enclosureHandle<0x0001> slot 7
mpr0: At enclosure level 0 and connector name (1   )
sa0 at mpr0 bus 0 scbus14 target 7 lun 0

In r356557, only the first of those 3 lines appear, followed by:

mpr0: Out of chain frames, consider increasing hw.mpr.max_chains

And then, eventually by:

mpr0: Calling Reinit from mpr_wait_command, timeout=60, elapsed=60
mpr0: Reinitializing controller

At that point we're in a perpetual loop of reinit / timeout.

I can make the problem system available via remote console access (Dell iDRAC 8) or can try any suggestions for debugging this further myself.
Comment 1 Terry Kennedy 2020-01-10 01:42:37 UTC
To provide a little more information about the system this is happening on:

System: Dell PowerEdge R730
CPU: E5-2643 v3 @ 3.40GHz
RAM: 128GB (8 * 16GB DDR4-2133)
OS boot controller / drives: PERC H730 mini, card=0x1f491028 chip=0x005d1000, mrsas driver, 5 * SAS2 SSD
Tape controller: Dell "12Gb/s SAS HBA External", card=0x1f461028 chip=0x00971000, mpr driver
Comment 2 Peter Eriksson 2020-03-08 16:29:48 UTC
Do you have X2APIC enabled in your R730? When I try to enable it on our R730xd's with a similar setup (Dell HBA330 (mpr)) controllers it refuses to boot and just gives me a loop of:

mpr0: Calling Reinit from mpr_wait_command, timeout=60, elapsed=60
mpr0: Reinitializing controller,
mpr0: Firmware: 16.00.08.00, Driver: 18.03.00.00-fbsd
mpr0: IOCCapabilities: 7a85c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,MSIXIndex,HostDisc,FastPath,RDPQArray>

FreeBSD 11.3-RELEASE-p6
Dell R730xd with 1 HBA330 Mini (mpr) and dual 10GE Intel X710 (ixl)

Switching back to X2APIC disabled allows the server to boot just fine. 


(And some errors about USB too just for the fun of it...)

uhub1: 2 ports with 2 removable, self powered
uhub0: 21 ports with 21 removable, self powered
uhub2: 2 ports with 2 removable, self powered
usb_alloc_device: set address 2 failed (USB_ERR_TIMEOUT, ignored)
usb_alloc_device: set address 2 failed (USB_ERR_TIMEOUT, ignored)
usbd_setup_device_desc: getting device descriptor at addr 2 failed, USB_ERR_TIMEOUT
usbd_setup_device_desc: getting device descriptor at addr 2 failed, USB_ERR_TIMEOUT
usbd_req_re_enumerate: addr=2, set address failed! (USB_ERR_TIMEOUT, ignored)
usbd_req_re_enumerate: addr=2, set address failed! (USB_ERR_TIMEOUT, ignored)
usbd_setup_device_desc: getting device descriptor at addr 1 failed, USB_ERR_TIMEOUT
usbd_setup_device_desc: getting device descriptor at addr 2 failed, USB_ERR_TIMEOUT
usbd_setup_device_desc: getting device descriptor at addr 2 failed, USB_ERR_TIMEOUT
usbd_req_re_enumerate: addr=2, set address failed! (USB_ERR_TIMEOUT, ignored)
usbd_req_re_enumerate: addr=2, set address failed! (USB_ERR_TIMEOUT, ignored)
usbd_setup_device_desc: getting device descriptor at addr 1 failed, USB_ERR_TIMEOUT
usbd_setup_device_desc: getting device descriptor at addr 2 failed, USB_ERR_TIMEOUT
usbd_setup_device_desc: getting device descriptor at addr 2 failed, USB_ERR_TIMEOUT
usbd_req_re_enumerate: addr=2, set address failed! (USB_ERR_TIMEOUT, ignored)
usbd_req_re_enumerate: addr=2, set address failed! (USB_ERR_TIMEOUT, ignored)
usbd_setup_device_desc: getting device descriptor at addr 2 failed, USB_ERR_TIMEOUT
usbd_setup_device_desc: getting device descriptor at addr 2 failed, USB_ERR_TIMEOUT
usbd_setup_device_desc: getting device descriptor at addr 1 failed, USB_ERR_TIMEOUT
usbd_req_re_enumerate: addr=2, set address failed! (USB_ERR_TIMEOUT, ignored)
usbd_req_re_enumerate: addr=2, set address failed! (USB_ERR_TIMEOUT, ignored)
ugen2.2: <Unknown > at usbus2 (disconnected)
uhub_reattach_port: could not allocate new device
usbd_setup_device_desc: getting device descriptor at addr 2 failed, USB_ERR_TIMEOUT
ugen1.2: <Unknown > at usbus1 (disconnected)
uhub_reattach_port: could not allocate new device
usbd_setup_device_desc: getting device descriptor at addr 1 failed, USB_ERR_TIMEOUT
usbd_setup_device_desc: getting device descriptor at addr 1 failed, USB_ERR_TIMEOUT
ugen0.2: <Unknown > at usbus0 (disconnected)
uhub_reattach_port: could not allocate new device
Comment 3 Terry Kennedy 2020-03-10 19:53:57 UTC
(In reply to Peter Eriksson from comment #2)

X2APIC has been off on this system since it was installed - I see the same USB symptoms that you reported if I enable it. But I get the mpr(4) errors in either case.

I updated the BIOS and OS Collector (which isn't used on FreeBSD) - everything else was already up-to-date = and repeated the tests, with the same result.

Since I suspect this is timing-related, I enabled Hyperthreading (which I normally force off in the BIOS) and the system booted (at least this one time) without the mpr(4) hang.

Again, this system had been working for some time, but then started hanging with the "out of chain frames" at boot time. Dell support moved the controller to a different slot, which "fixed" the problem, at least until the clang 9.x import, which apparently changed the timing enough to make the problem show up again.

I think all of the above points to some timing issue in the mpr(4) driver. Hopefully someone with knowledge of that driver's inner workings will see this and comment. I can continue to make the system available (via iDRAC port, so the developer can reset it, change BIOS settings, etc.) if needed for someone to diagnose this issue.
Comment 4 Warner Losh freebsd_committer freebsd_triage 2020-06-03 23:49:20 UTC
Looking at the source we see:

mpr_dprint(sc, MPR_INFO, "Out of chain frames, "
    "consider increasing hw.mpr.max_chains.\n");

Have you increased hw.mpr.max_chains?

We use this sort of controller all the time (though we don't use tape drive) at work and haven't seen this at all.
Comment 5 Terry Kennedy 2020-06-04 00:28:52 UTC
(In reply to Warner Losh from comment #4)

I tried increasing it and it doesn't help. This happens during the initial device probe at boot time. We should only need chain frames if we're doing I/O to a connected peripheral, and this error happens even without the tape drive (or anything else) connected. Changing the slot the controller is in, enabling/disabling hyperthreading, or sometimes just sitting at the loader prompt before proceeding will make the problem disappear. My guess is that it is a timing loop that is close to marginal, depending on the other hardware in the system. When the problem manifests, the system goes down the rabbit hole of "out of chain frames", probably because the code thinks the only reason for an error in that part of the path is running out of chain frames.

This is a Dell R730 (complete description in prior reply) so I can capture the complete boot as a video and make the video (and player app) available if anyone wants to look at it. Alternatively, I can provide remore access (console, reset, etc) via the Dell iDRAC controller.

The problem has persisted from 12.0 to the latest 12-STABLE. All hardware is up-to-date, Dell has replaced the controller multiple times and the tape drive and cable once.
Comment 6 Terry Kennedy 2020-08-08 22:40:11 UTC
(In reply to Terry Kennedy from comment #5)
Following up to my own post.

I just emailed Warner Losh the following (forgetting that the discussion was also going on in this PR):

  The system is a Dell PowerEdge R730 with dual E5-2643 v3 CPUs, 128GB
RAM, PERC H730 Mini RAID controller (for internal SAS drives) and a
Dell "12Gb External SAS Adapter" (LSI SAS 3008) which is the problem
controller. Since I emailed you last, Dell changed it again (they're
getting irritated at this point) with no change.

  With a r364046 kernel from today, the system boots fine with hyper-
threading enabled, but shows thenmprmhang with hyperthreading disabled.
The previous kernel worked with hyperthreading disabled - as I've said,
this appears to be timing-related as it first showed up after the LLVM
9.0.0 import into 12-STABLE.

  A verbose boot didn't appear to show anything, but as the kernel mes-
sage buffer never gets written to disk because of the hang, I have to
play the boot sequence back as a video from the internal remote manage-
ment card. Also, sometimes a verbose boot works, since this is a timing-
related Heisenbug, apparently.

  I have created a directory https://www.glaver.org/transient/imp which
has (so far) dmesg.boot and pciconf.txt from a good boot, bootcapture_
bad.dvc from a non-hyperthreaded boot hang, bootcapture_good.dvc from
a successful hyperthreaded boot. and videoplayer.zip which is the play-
er for .dvc files. It contains a Windows .exe and a jar file and shell
script for Linux (and presumably FreeBSD with Java and a GUI).

  I can provide remote access to the Dell iDRAC 8 Enterprise card which
allows console viewing and all sorts of other stuff, like reset / power
cycle and access to BIOS setup, etc. if needed.

  Let me know if there is anything else you need.