I have a Dell server with a CS23-SH motherboard. It boots the 10.3 mini-memstick image fine, but it hangs while booting the 11.0-BETA4 installer. While booting in verbose mode, the last thing on the console is: pcib4: <ACPI PCI-PCI bridge> irq 16 at device 0.0 on pci3 pcib4: attempting to allocate 1 MSI vectors (1 supported) msi: routing MSI IRQ 256 to local APIC 0 vector 52 pcib4: using IRQ 256 for MSI pcib4: [GIANT-LOCKED] pcib4:
Created attachment 173465 [details] console output at hang
Created attachment 173466 [details] pciconf -lv output on an identical machine
Created attachment 173467 [details] dmidecode output on an identical machine
Fixed now, probably by r303781.
Created attachment 174218 [details] Screenshot of hang on FreeBSD 11 RC2
This system worked fine on 11.0-BETA4 with a kernel built from stable/11 at r303879. However, when I upgraded to RC2, the problem returned.
If you build an 11 kernel with DDB can you break into the debugger when it is hung? Alternatively, I think RC2 has debug printfs that fire on each write to the SLOT_CTL register as well as on each interrupt. Hmm, we did write to SLOT_CTL and the it seems an interrupt storm started. Can you set 'hw.pci.enable_pcie_hp=0' at the loader prompt to boot for now, and capture the output of 'pciconf -lbBce pcib4'?
Created attachment 174219 [details] output of "pciconf -lbBce pcib4" That setting allowed me to boot; thanks. Attached is the pciconf output you asked for.
Is there any chance to fix this before 11.0-RELEASE ?
Would an interrupt storm not produce a flood of "HotPlug interrupt" messages? Does some subtlety prevent those messages in this case? Sorry for the slow reply; I was on vacation last week.
Out of curiosity, what model system is this? (kenv smbios.system.product)
$ kenv smbios.system.product CS23-SH
Keep re@ in the loop on this RC2+ regression
Created attachment 174474 [details] never read the slot_ctl register Please try this patch. It's against head, but should apply to releng/11.0. I tested it several weeks ago, but not recently. I'll test it later today. Remember to revert the tunable that disabled HotPlug.
I tested the attached patch against head r305539. Everything works (that I am able to test). I would be grateful if you could test it.
I'll test it too. I should be done by 5:00 MDT.
Created attachment 174487 [details] never read slot_ctl; save last write except EIC If you can, please use this updated patch. On your machine, it should behave the same as the previous patch, but this is the one I would intend to commit.
By the way: Does this machine actually have hot-plug slots?
The first patch does not work on my machine. I applied it to releng/11.0 at r305539, but I get exactly the same hang as before. Should I still bother to test the new patch? How do I tell if I have PCIE hotplug slots? I'm guessing I don't.
Don't bother testing the new patch; it would probably do the same. Unfortunately, I don't have any idea how to fix this. I could blacklist this system model or this PCIe switch and ignore its purported HotPlug support. That would have the same effect as the tunable. This PCIe switch reports that it has a Manual Retention Latch. Do any of the slots have a real, physical latch that holds the card in place? It also reports an Attention Button, which would be on or near the slot. We've seen several systems/switches that claim to have these features but actually don't, so I wouldn't be surprised if your system is lying. Thanks for your help so far!
Nope. There's no retention latch, and nothing that looks like an attention switch.
I suspect your slot is a lot like the bogus slots we've worked around before except that the logic we are currently using only works if a card is in the slot. It would be good to find out what it is doing during the hang. One possibility is to build a kernel with the following: options KTR options KTR_VERBOSE options KTR_COMPILE=KTR_PROC options KTR_MASK=KTR_PROC and boot it with boot -v. In particular, KTR_PROC will log any context switches including for ithreads. It may be that the machine starts spewing stuff forever in a loop though. If there's any way to hook up a serial console either via cable or SOL so you can log the output that would probably be very useful.
No infinite spew loop. It just hangs. The hang screen looks the same as the screenshot I posted, except the HotPlug command is "01c0 -> 07ff" instead of "03c0 -> 07ff".
Thanks for your help, Alan. For now, disabling HotPlug is the best fix we can come up with. I plan to put something like the following in the 11.0 release notes: <para revision="299142">Native PCI-express HotPlug support is enabled by default on &arch.amd64;, &arch.arm64;, and &arch.powerpc;. This feature has exposed compatibility issues on some hardware that result in missing devices or a hang during boot. To work around such issues, run <literal>set hw.pci.enable_pcie_hp=0</literal> in the boot loader, and add <literal>hw.pci.enable_pcie_hp=0</literal> to <filename>/boot/loader.conf</filename>.</para>
(In reply to Alan Somers from comment #23) Hmm, interesting that we write '0x07ff'. That would seem to indicate this device doesn't support DLL active notifications. That isn't required by the spec I think, but that is a bit odd. I wonder if having a power controller requires DLL for example (if it does we could use this as a way to blacklist this slot). Eric, can you verify if all of the slots you have access to with power controllers support DLL state change notifications?
(In reply to John Baldwin from comment #25) Sorry for the slow reply. My slots support DLL change interrupts. However, I don't think they're necessary for power control: One could simply wait a second and poll the DLL state. That being said, I imagine that--in the real world--all switches with power controllers also provide DLL change interrupts, so it would be fairly safe to use this as a blacklist criterion (at least until proven otherwise).
Eric, can you please tell me how to blacklist this specific PCI ID so I don't have to do it globally in all of our builds?
Created attachment 179483 [details] sys/dev/pci/pci_pci.c patch Reading from PCI EXPRESS BASE SPECIFICATION: Data Link Layer Link Active Reporting Capable: For a Downstream Port, this bit must be set to 1b if the component supports the optional capability of reporting the DL_Active state of the Data Link Control and Management State Machine. For a hot-plug capable Downstream Port (as indicated by the Hot-Plug Capable field of the Slot Capabilities register), this bit must be set to 1b. ----------- Is it possible to conclude, for this ill behaving device (i.e. '6311ESB/6321ESB PCI Express Downstream Port E1') that since it is reporting "Hot-Plug Capable" (in the Slot Capabilities register) and NOT reporting "Data Link Layer Link Active Reporting Capable (in the Link Capabilities Register) that this device is not really/truly Hot-Plug Capable. If this conclusion is valid then is the attached patch safe/reasonable ? The patch does allow the Dell server with a CS23-SH motherboard to boot without globally disabling hotplug.
Alan: I'm sorry I didn't reply. I never got an email from bugzilla. Can you test Dave's patch? Dave: Your reading of the spec and your patch both seem obviously correct to me. John: I'd appreciate your opinion.
removing re@
This looks good to me. We should perhaps remove some other checks of this flag then and assume that it is always true. In theory we can remove pcie_link_cap entirely and just use a local variable to check it once in this probe routine. I would be fine with splitting that up into separate changes though and just committing this first.
Also good find. I never found that as I was always focused on reading the description of the slot registers and the Hot Plug overview section. I don't think anywhere in those places does it mention that Hot Plug requires DLL active reporting.
I tested the proposed patch against 12.0-CURRENT with kernel source refreshed this (2/1) afternoon. The Dell server with a CS23-SH motherboard boots without setting hw.pci.enable_pcie_hp=1.
(In reply to Dave Baukus from comment #33) Correction to my last comment; it should read: he Dell server with a CS23-SH motherboard boots without setting hw.pci.enable_pcie_hp=0. That is, it boots without globally disabling hotplug.
A commit references this bug: Author: jhb Date: Thu Feb 2 17:29:16 UTC 2017 New revision: 313097 URL: https://svnweb.freebsd.org/changeset/base/313097 Log: Require Data Layer Active reporting for native PCI-e HotPlug. Some PCI-e bridges report that they support HotPlug in the slot capabilities but do not report support for Data Layer Active events in the link capabilities register. These bridges do not work correctly when HotPlug is used. Further, while the description of HotPlug in the spec does not mention that DL active events are required, the description of the link capabilities register says that DL active is required for HotPlug. Thanks to Dave Baukus for finding that language in the spec. PR: 211699 Submitted by: Dave Baukus <daveb@spectralogic.com> Reviewed by: vangyzen MFC after: 3 days Changes: head/sys/dev/pci/pci_pci.c
A commit references this bug: Author: jhb Date: Tue Feb 7 22:40:38 UTC 2017 New revision: 313408 URL: https://svnweb.freebsd.org/changeset/base/313408 Log: MFC 313097: Require Data Layer Active reporting for native PCI-e HotPlug. Some PCI-e bridges report that they support HotPlug in the slot capabilities but do not report support for Data Layer Active events in the link capabilities register. These bridges do not work correctly when HotPlug is used. Further, while the description of HotPlug in the spec does not mention that DL active events are required, the description of the link capabilities register says that DL active is required for HotPlug. Thanks to Dave Baukus for finding that language in the spec. PR: 211699 Changes: _U stable/11/ stable/11/sys/dev/pci/pci_pci.c