I updated my test system from r356239 to r356557 (which crosses the clang 9.0.1 import) and started receiving "mpr0: Out of chain frames" at boot time, which causes a boot hang with the mpr0 controller being reset and reinitialized, and the error happening again. This happens before the device (tape drive) is detected, and happens regardless of whether anything is connected to the mpr controller. I had this before (many months ago) on this system and worked with Dell service, replacing boards / cables / tape drive, etc. The solution at that point was to put the controller into a different slot, which apparently hid whatever timing problem is causing the boot hang. That's why I say in the PR title that I don't think it is a clang 9.0.1 problem (incorrect code generation). Presumably clang 9 generates faster (hopefully) or slower code that is triggering the problem. Escaping to the boot loader and killing time, then saying "boot" without changing anything will sometimes let the system boot normally. Again pointing to a possible timing problem. The boot messages from r356239 are: mpr0: <Avago Technologies (LSI) SAS3008> port 0x8000-0x80ff mem 0xc9100000-0xc910ffff,0xc8000000-0xc80fffff irq 64 at device 0.0 on pci17 mpr0: Firmware: 16.00.08.00, Driver: 23.00.00.00-fbsd mpr0: IOCCapabilities: 7a85c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,MSIXIndex,HostDisc,FastPath,RDPQArray> mpr0: Found device <c01<SspTarg,Direct>,End Device> <6.0Gbps> handle<0x0009> enclosureHandle<0x0001> slot 7 mpr0: At enclosure level 0 and connector name (1 ) sa0 at mpr0 bus 0 scbus14 target 7 lun 0 In r356557, only the first of those 3 lines appear, followed by: mpr0: Out of chain frames, consider increasing hw.mpr.max_chains And then, eventually by: mpr0: Calling Reinit from mpr_wait_command, timeout=60, elapsed=60 mpr0: Reinitializing controller At that point we're in a perpetual loop of reinit / timeout. I can make the problem system available via remote console access (Dell iDRAC 8) or can try any suggestions for debugging this further myself.
To provide a little more information about the system this is happening on: System: Dell PowerEdge R730 CPU: E5-2643 v3 @ 3.40GHz RAM: 128GB (8 * 16GB DDR4-2133) OS boot controller / drives: PERC H730 mini, card=0x1f491028 chip=0x005d1000, mrsas driver, 5 * SAS2 SSD Tape controller: Dell "12Gb/s SAS HBA External", card=0x1f461028 chip=0x00971000, mpr driver
Do you have X2APIC enabled in your R730? When I try to enable it on our R730xd's with a similar setup (Dell HBA330 (mpr)) controllers it refuses to boot and just gives me a loop of: mpr0: Calling Reinit from mpr_wait_command, timeout=60, elapsed=60 mpr0: Reinitializing controller, mpr0: Firmware: 16.00.08.00, Driver: 18.03.00.00-fbsd mpr0: IOCCapabilities: 7a85c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,MSIXIndex,HostDisc,FastPath,RDPQArray> FreeBSD 11.3-RELEASE-p6 Dell R730xd with 1 HBA330 Mini (mpr) and dual 10GE Intel X710 (ixl) Switching back to X2APIC disabled allows the server to boot just fine. (And some errors about USB too just for the fun of it...) uhub1: 2 ports with 2 removable, self powered uhub0: 21 ports with 21 removable, self powered uhub2: 2 ports with 2 removable, self powered usb_alloc_device: set address 2 failed (USB_ERR_TIMEOUT, ignored) usb_alloc_device: set address 2 failed (USB_ERR_TIMEOUT, ignored) usbd_setup_device_desc: getting device descriptor at addr 2 failed, USB_ERR_TIMEOUT usbd_setup_device_desc: getting device descriptor at addr 2 failed, USB_ERR_TIMEOUT usbd_req_re_enumerate: addr=2, set address failed! (USB_ERR_TIMEOUT, ignored) usbd_req_re_enumerate: addr=2, set address failed! (USB_ERR_TIMEOUT, ignored) usbd_setup_device_desc: getting device descriptor at addr 1 failed, USB_ERR_TIMEOUT usbd_setup_device_desc: getting device descriptor at addr 2 failed, USB_ERR_TIMEOUT usbd_setup_device_desc: getting device descriptor at addr 2 failed, USB_ERR_TIMEOUT usbd_req_re_enumerate: addr=2, set address failed! (USB_ERR_TIMEOUT, ignored) usbd_req_re_enumerate: addr=2, set address failed! (USB_ERR_TIMEOUT, ignored) usbd_setup_device_desc: getting device descriptor at addr 1 failed, USB_ERR_TIMEOUT usbd_setup_device_desc: getting device descriptor at addr 2 failed, USB_ERR_TIMEOUT usbd_setup_device_desc: getting device descriptor at addr 2 failed, USB_ERR_TIMEOUT usbd_req_re_enumerate: addr=2, set address failed! (USB_ERR_TIMEOUT, ignored) usbd_req_re_enumerate: addr=2, set address failed! (USB_ERR_TIMEOUT, ignored) usbd_setup_device_desc: getting device descriptor at addr 2 failed, USB_ERR_TIMEOUT usbd_setup_device_desc: getting device descriptor at addr 2 failed, USB_ERR_TIMEOUT usbd_setup_device_desc: getting device descriptor at addr 1 failed, USB_ERR_TIMEOUT usbd_req_re_enumerate: addr=2, set address failed! (USB_ERR_TIMEOUT, ignored) usbd_req_re_enumerate: addr=2, set address failed! (USB_ERR_TIMEOUT, ignored) ugen2.2: <Unknown > at usbus2 (disconnected) uhub_reattach_port: could not allocate new device usbd_setup_device_desc: getting device descriptor at addr 2 failed, USB_ERR_TIMEOUT ugen1.2: <Unknown > at usbus1 (disconnected) uhub_reattach_port: could not allocate new device usbd_setup_device_desc: getting device descriptor at addr 1 failed, USB_ERR_TIMEOUT usbd_setup_device_desc: getting device descriptor at addr 1 failed, USB_ERR_TIMEOUT ugen0.2: <Unknown > at usbus0 (disconnected) uhub_reattach_port: could not allocate new device
(In reply to Peter Eriksson from comment #2) X2APIC has been off on this system since it was installed - I see the same USB symptoms that you reported if I enable it. But I get the mpr(4) errors in either case. I updated the BIOS and OS Collector (which isn't used on FreeBSD) - everything else was already up-to-date = and repeated the tests, with the same result. Since I suspect this is timing-related, I enabled Hyperthreading (which I normally force off in the BIOS) and the system booted (at least this one time) without the mpr(4) hang. Again, this system had been working for some time, but then started hanging with the "out of chain frames" at boot time. Dell support moved the controller to a different slot, which "fixed" the problem, at least until the clang 9.x import, which apparently changed the timing enough to make the problem show up again. I think all of the above points to some timing issue in the mpr(4) driver. Hopefully someone with knowledge of that driver's inner workings will see this and comment. I can continue to make the system available (via iDRAC port, so the developer can reset it, change BIOS settings, etc.) if needed for someone to diagnose this issue.
Looking at the source we see: mpr_dprint(sc, MPR_INFO, "Out of chain frames, " "consider increasing hw.mpr.max_chains.\n"); Have you increased hw.mpr.max_chains? We use this sort of controller all the time (though we don't use tape drive) at work and haven't seen this at all.
(In reply to Warner Losh from comment #4) I tried increasing it and it doesn't help. This happens during the initial device probe at boot time. We should only need chain frames if we're doing I/O to a connected peripheral, and this error happens even without the tape drive (or anything else) connected. Changing the slot the controller is in, enabling/disabling hyperthreading, or sometimes just sitting at the loader prompt before proceeding will make the problem disappear. My guess is that it is a timing loop that is close to marginal, depending on the other hardware in the system. When the problem manifests, the system goes down the rabbit hole of "out of chain frames", probably because the code thinks the only reason for an error in that part of the path is running out of chain frames. This is a Dell R730 (complete description in prior reply) so I can capture the complete boot as a video and make the video (and player app) available if anyone wants to look at it. Alternatively, I can provide remore access (console, reset, etc) via the Dell iDRAC controller. The problem has persisted from 12.0 to the latest 12-STABLE. All hardware is up-to-date, Dell has replaced the controller multiple times and the tape drive and cable once.
(In reply to Terry Kennedy from comment #5) Following up to my own post. I just emailed Warner Losh the following (forgetting that the discussion was also going on in this PR): The system is a Dell PowerEdge R730 with dual E5-2643 v3 CPUs, 128GB RAM, PERC H730 Mini RAID controller (for internal SAS drives) and a Dell "12Gb External SAS Adapter" (LSI SAS 3008) which is the problem controller. Since I emailed you last, Dell changed it again (they're getting irritated at this point) with no change. With a r364046 kernel from today, the system boots fine with hyper- threading enabled, but shows thenmprmhang with hyperthreading disabled. The previous kernel worked with hyperthreading disabled - as I've said, this appears to be timing-related as it first showed up after the LLVM 9.0.0 import into 12-STABLE. A verbose boot didn't appear to show anything, but as the kernel mes- sage buffer never gets written to disk because of the hang, I have to play the boot sequence back as a video from the internal remote manage- ment card. Also, sometimes a verbose boot works, since this is a timing- related Heisenbug, apparently. I have created a directory https://www.glaver.org/transient/imp which has (so far) dmesg.boot and pciconf.txt from a good boot, bootcapture_ bad.dvc from a non-hyperthreaded boot hang, bootcapture_good.dvc from a successful hyperthreaded boot. and videoplayer.zip which is the play- er for .dvc files. It contains a Windows .exe and a jar file and shell script for Linux (and presumably FreeBSD with Java and a GUI). I can provide remote access to the Dell iDRAC 8 Enterprise card which allows console viewing and all sorts of other stuff, like reset / power cycle and access to BIOS setup, etc. if needed. Let me know if there is anything else you need.