Summary: | "mpr0: Out of chain frames" boot hang after clang 9.0.1 import (probably timing, not compiler related) | ||
---|---|---|---|
Product: | Base System | Reporter: | Terry Kennedy <terry-freebsd> |
Component: | kern | Assignee: | freebsd-bugs (Nobody) <bugs> |
Status: | New --- | ||
Severity: | Affects Some People | CC: | imp, pen, terry-freebsd |
Priority: | --- | Keywords: | regression |
Version: | 12.0-STABLE | ||
Hardware: | Any | ||
OS: | Any |
Description
Terry Kennedy
2020-01-09 19:00:53 UTC
To provide a little more information about the system this is happening on: System: Dell PowerEdge R730 CPU: E5-2643 v3 @ 3.40GHz RAM: 128GB (8 * 16GB DDR4-2133) OS boot controller / drives: PERC H730 mini, card=0x1f491028 chip=0x005d1000, mrsas driver, 5 * SAS2 SSD Tape controller: Dell "12Gb/s SAS HBA External", card=0x1f461028 chip=0x00971000, mpr driver Do you have X2APIC enabled in your R730? When I try to enable it on our R730xd's with a similar setup (Dell HBA330 (mpr)) controllers it refuses to boot and just gives me a loop of: mpr0: Calling Reinit from mpr_wait_command, timeout=60, elapsed=60 mpr0: Reinitializing controller, mpr0: Firmware: 16.00.08.00, Driver: 18.03.00.00-fbsd mpr0: IOCCapabilities: 7a85c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,MSIXIndex,HostDisc,FastPath,RDPQArray> FreeBSD 11.3-RELEASE-p6 Dell R730xd with 1 HBA330 Mini (mpr) and dual 10GE Intel X710 (ixl) Switching back to X2APIC disabled allows the server to boot just fine. (And some errors about USB too just for the fun of it...) uhub1: 2 ports with 2 removable, self powered uhub0: 21 ports with 21 removable, self powered uhub2: 2 ports with 2 removable, self powered usb_alloc_device: set address 2 failed (USB_ERR_TIMEOUT, ignored) usb_alloc_device: set address 2 failed (USB_ERR_TIMEOUT, ignored) usbd_setup_device_desc: getting device descriptor at addr 2 failed, USB_ERR_TIMEOUT usbd_setup_device_desc: getting device descriptor at addr 2 failed, USB_ERR_TIMEOUT usbd_req_re_enumerate: addr=2, set address failed! (USB_ERR_TIMEOUT, ignored) usbd_req_re_enumerate: addr=2, set address failed! (USB_ERR_TIMEOUT, ignored) usbd_setup_device_desc: getting device descriptor at addr 1 failed, USB_ERR_TIMEOUT usbd_setup_device_desc: getting device descriptor at addr 2 failed, USB_ERR_TIMEOUT usbd_setup_device_desc: getting device descriptor at addr 2 failed, USB_ERR_TIMEOUT usbd_req_re_enumerate: addr=2, set address failed! (USB_ERR_TIMEOUT, ignored) usbd_req_re_enumerate: addr=2, set address failed! (USB_ERR_TIMEOUT, ignored) usbd_setup_device_desc: getting device descriptor at addr 1 failed, USB_ERR_TIMEOUT usbd_setup_device_desc: getting device descriptor at addr 2 failed, USB_ERR_TIMEOUT usbd_setup_device_desc: getting device descriptor at addr 2 failed, USB_ERR_TIMEOUT usbd_req_re_enumerate: addr=2, set address failed! (USB_ERR_TIMEOUT, ignored) usbd_req_re_enumerate: addr=2, set address failed! (USB_ERR_TIMEOUT, ignored) usbd_setup_device_desc: getting device descriptor at addr 2 failed, USB_ERR_TIMEOUT usbd_setup_device_desc: getting device descriptor at addr 2 failed, USB_ERR_TIMEOUT usbd_setup_device_desc: getting device descriptor at addr 1 failed, USB_ERR_TIMEOUT usbd_req_re_enumerate: addr=2, set address failed! (USB_ERR_TIMEOUT, ignored) usbd_req_re_enumerate: addr=2, set address failed! (USB_ERR_TIMEOUT, ignored) ugen2.2: <Unknown > at usbus2 (disconnected) uhub_reattach_port: could not allocate new device usbd_setup_device_desc: getting device descriptor at addr 2 failed, USB_ERR_TIMEOUT ugen1.2: <Unknown > at usbus1 (disconnected) uhub_reattach_port: could not allocate new device usbd_setup_device_desc: getting device descriptor at addr 1 failed, USB_ERR_TIMEOUT usbd_setup_device_desc: getting device descriptor at addr 1 failed, USB_ERR_TIMEOUT ugen0.2: <Unknown > at usbus0 (disconnected) uhub_reattach_port: could not allocate new device (In reply to Peter Eriksson from comment #2) X2APIC has been off on this system since it was installed - I see the same USB symptoms that you reported if I enable it. But I get the mpr(4) errors in either case. I updated the BIOS and OS Collector (which isn't used on FreeBSD) - everything else was already up-to-date = and repeated the tests, with the same result. Since I suspect this is timing-related, I enabled Hyperthreading (which I normally force off in the BIOS) and the system booted (at least this one time) without the mpr(4) hang. Again, this system had been working for some time, but then started hanging with the "out of chain frames" at boot time. Dell support moved the controller to a different slot, which "fixed" the problem, at least until the clang 9.x import, which apparently changed the timing enough to make the problem show up again. I think all of the above points to some timing issue in the mpr(4) driver. Hopefully someone with knowledge of that driver's inner workings will see this and comment. I can continue to make the system available (via iDRAC port, so the developer can reset it, change BIOS settings, etc.) if needed for someone to diagnose this issue. Looking at the source we see: mpr_dprint(sc, MPR_INFO, "Out of chain frames, " "consider increasing hw.mpr.max_chains.\n"); Have you increased hw.mpr.max_chains? We use this sort of controller all the time (though we don't use tape drive) at work and haven't seen this at all. (In reply to Warner Losh from comment #4) I tried increasing it and it doesn't help. This happens during the initial device probe at boot time. We should only need chain frames if we're doing I/O to a connected peripheral, and this error happens even without the tape drive (or anything else) connected. Changing the slot the controller is in, enabling/disabling hyperthreading, or sometimes just sitting at the loader prompt before proceeding will make the problem disappear. My guess is that it is a timing loop that is close to marginal, depending on the other hardware in the system. When the problem manifests, the system goes down the rabbit hole of "out of chain frames", probably because the code thinks the only reason for an error in that part of the path is running out of chain frames. This is a Dell R730 (complete description in prior reply) so I can capture the complete boot as a video and make the video (and player app) available if anyone wants to look at it. Alternatively, I can provide remore access (console, reset, etc) via the Dell iDRAC controller. The problem has persisted from 12.0 to the latest 12-STABLE. All hardware is up-to-date, Dell has replaced the controller multiple times and the tape drive and cable once. (In reply to Terry Kennedy from comment #5) Following up to my own post. I just emailed Warner Losh the following (forgetting that the discussion was also going on in this PR): The system is a Dell PowerEdge R730 with dual E5-2643 v3 CPUs, 128GB RAM, PERC H730 Mini RAID controller (for internal SAS drives) and a Dell "12Gb External SAS Adapter" (LSI SAS 3008) which is the problem controller. Since I emailed you last, Dell changed it again (they're getting irritated at this point) with no change. With a r364046 kernel from today, the system boots fine with hyper- threading enabled, but shows thenmprmhang with hyperthreading disabled. The previous kernel worked with hyperthreading disabled - as I've said, this appears to be timing-related as it first showed up after the LLVM 9.0.0 import into 12-STABLE. A verbose boot didn't appear to show anything, but as the kernel mes- sage buffer never gets written to disk because of the hang, I have to play the boot sequence back as a video from the internal remote manage- ment card. Also, sometimes a verbose boot works, since this is a timing- related Heisenbug, apparently. I have created a directory https://www.glaver.org/transient/imp which has (so far) dmesg.boot and pciconf.txt from a good boot, bootcapture_ bad.dvc from a non-hyperthreaded boot hang, bootcapture_good.dvc from a successful hyperthreaded boot. and videoplayer.zip which is the play- er for .dvc files. It contains a Windows .exe and a jar file and shell script for Linux (and presumably FreeBSD with Java and a GUI). I can provide remote access to the Dell iDRAC 8 Enterprise card which allows console viewing and all sorts of other stuff, like reset / power cycle and access to BIOS setup, etc. if needed. Let me know if there is anything else you need. |