Bug 243225 - "mpr0: Out of chain frames" boot hang after clang 9.0.1 import (probably timing, not compiler related)
Summary: "mpr0: Out of chain frames" boot hang after clang 9.0.1 import (probably timi...
Status: New
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 12.0-STABLE
Hardware: Any Any
: --- Affects Some People
Assignee: freebsd-bugs mailing list
URL:
Keywords: regression
Depends on:
Blocks:
 
Reported: 2020-01-09 19:00 UTC by Terry Kennedy
Modified: 2020-01-10 05:36 UTC (History)
1 user (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Terry Kennedy 2020-01-09 19:00:53 UTC
I updated my test system from r356239 to r356557 (which crosses the clang 9.0.1 import) and started receiving "mpr0: Out of chain frames" at boot time, which causes a boot hang with the mpr0 controller being reset and reinitialized, and the error happening again. This happens before the device (tape drive) is detected, and happens regardless of whether anything is connected to the mpr controller.

I had this before (many months ago) on this system and worked with Dell service, replacing boards / cables / tape drive, etc. The solution at that point was to put the controller into a different slot, which apparently hid whatever timing problem is causing the boot hang. That's why I say in the PR title that I don't think it is a clang 9.0.1 problem (incorrect code generation). Presumably clang 9 generates faster (hopefully) or slower code that is triggering the problem.

Escaping to the boot loader and killing time, then saying "boot" without changing anything will sometimes let the system boot normally. Again pointing to a possible timing problem.

The boot messages from r356239 are:

mpr0: <Avago Technologies (LSI) SAS3008> port 0x8000-0x80ff mem 0xc9100000-0xc910ffff,0xc8000000-0xc80fffff irq 64 at device 0.0 on pci17
mpr0: Firmware: 16.00.08.00, Driver: 23.00.00.00-fbsd
mpr0: IOCCapabilities: 7a85c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,MSIXIndex,HostDisc,FastPath,RDPQArray>
mpr0: Found device <c01<SspTarg,Direct>,End Device> <6.0Gbps> handle<0x0009> enclosureHandle<0x0001> slot 7
mpr0: At enclosure level 0 and connector name (1   )
sa0 at mpr0 bus 0 scbus14 target 7 lun 0

In r356557, only the first of those 3 lines appear, followed by:

mpr0: Out of chain frames, consider increasing hw.mpr.max_chains

And then, eventually by:

mpr0: Calling Reinit from mpr_wait_command, timeout=60, elapsed=60
mpr0: Reinitializing controller

At that point we're in a perpetual loop of reinit / timeout.

I can make the problem system available via remote console access (Dell iDRAC 8) or can try any suggestions for debugging this further myself.
Comment 1 Terry Kennedy 2020-01-10 01:42:37 UTC
To provide a little more information about the system this is happening on:

System: Dell PowerEdge R730
CPU: E5-2643 v3 @ 3.40GHz
RAM: 128GB (8 * 16GB DDR4-2133)
OS boot controller / drives: PERC H730 mini, card=0x1f491028 chip=0x005d1000, mrsas driver, 5 * SAS2 SSD
Tape controller: Dell "12Gb/s SAS HBA External", card=0x1f461028 chip=0x00971000, mpr driver