Bug 211699 - 11.0-BETA4 installer hangs on Dell CS23-SH
Summary: 11.0-BETA4 installer hangs on Dell CS23-SH
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 11.0-STABLE
Hardware: Any Any
: --- Affects Some People
Assignee: John Baldwin
URL:
Keywords: needs-patch, needs-qa, regression
Depends on:
Blocks:
 
Reported: 2016-08-09 17:10 UTC by Alan Somers
Modified: 2017-02-07 23:25 UTC (History)
4 users (show)

See Also:
koobs: mfc-stable11?


Attachments
console output at hang (219.48 KB, image/jpeg)
2016-08-09 17:15 UTC, Alan Somers
no flags Details
pciconf -lv output on an identical machine (6.88 KB, text/plain)
2016-08-09 17:16 UTC, Alan Somers
no flags Details
dmidecode output on an identical machine (25.04 KB, text/plain)
2016-08-09 17:16 UTC, Alan Somers
no flags Details
Screenshot of hang on FreeBSD 11 RC2 (315.78 KB, image/jpeg)
2016-08-30 15:59 UTC, Alan Somers
no flags Details
output of "pciconf -lbBce pcib4" (759 bytes, text/plain)
2016-08-30 16:52 UTC, Alan Somers
no flags Details
never read the slot_ctl register (1.77 KB, patch)
2016-09-07 15:15 UTC, Eric van Gyzen
no flags Details | Diff
never read slot_ctl; save last write except EIC (1.92 KB, patch)
2016-09-07 19:53 UTC, Eric van Gyzen
no flags Details | Diff
sys/dev/pci/pci_pci.c patch (415 bytes, patch)
2017-02-01 00:51 UTC, Dave Baukus
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Alan Somers freebsd_committer freebsd_triage 2016-08-09 17:10:50 UTC
I have a Dell server with a CS23-SH motherboard.  It boots the 10.3 mini-memstick image fine, but it hangs while booting the 11.0-BETA4 installer.  While booting in verbose mode, the last thing on the console is:

pcib4: <ACPI PCI-PCI bridge> irq 16 at device 0.0 on pci3
pcib4: attempting to allocate 1 MSI vectors (1 supported)
msi: routing MSI IRQ 256 to local APIC 0 vector 52
pcib4: using IRQ 256 for MSI
pcib4: [GIANT-LOCKED]
pcib4:
Comment 1 Alan Somers freebsd_committer freebsd_triage 2016-08-09 17:15:38 UTC
Created attachment 173465 [details]
console output at hang
Comment 2 Alan Somers freebsd_committer freebsd_triage 2016-08-09 17:16:26 UTC
Created attachment 173466 [details]
pciconf -lv output on an identical machine
Comment 3 Alan Somers freebsd_committer freebsd_triage 2016-08-09 17:16:49 UTC
Created attachment 173467 [details]
dmidecode output on an identical machine
Comment 4 Alan Somers freebsd_committer freebsd_triage 2016-08-11 17:14:59 UTC
Fixed now, probably by r303781.
Comment 5 Alan Somers freebsd_committer freebsd_triage 2016-08-30 15:59:22 UTC
Created attachment 174218 [details]
Screenshot of hang on FreeBSD 11 RC2
Comment 6 Alan Somers freebsd_committer freebsd_triage 2016-08-30 16:01:14 UTC
This system worked fine on 11.0-BETA4 with a kernel built from stable/11 at r303879.  However, when I upgraded to RC2, the problem returned.
Comment 7 John Baldwin freebsd_committer freebsd_triage 2016-08-30 16:38:41 UTC
If you build an 11 kernel with DDB can you break into the debugger when it is hung?  Alternatively, I think RC2 has debug printfs that fire on each write to the SLOT_CTL register as well as on each interrupt.  Hmm, we did write to SLOT_CTL and the it seems an interrupt storm started.

Can you set 'hw.pci.enable_pcie_hp=0' at the loader prompt to boot for now, and capture the output of 'pciconf -lbBce pcib4'?
Comment 8 Alan Somers freebsd_committer freebsd_triage 2016-08-30 16:52:08 UTC
Created attachment 174219 [details]
output of "pciconf -lbBce pcib4"

That setting allowed me to boot; thanks.  Attached is the pciconf output you asked for.
Comment 9 Alan Somers freebsd_committer freebsd_triage 2016-09-01 21:24:00 UTC
Is there any chance to fix this before 11.0-RELEASE ?
Comment 10 Eric van Gyzen freebsd_committer freebsd_triage 2016-09-05 15:18:02 UTC
Would an interrupt storm not produce a flood of "HotPlug interrupt" messages?  Does some subtlety prevent those messages in this case?

Sorry for the slow reply; I was on vacation last week.
Comment 11 Eric van Gyzen freebsd_committer freebsd_triage 2016-09-05 15:20:06 UTC
Out of curiosity, what model system is this?  (kenv smbios.system.product)
Comment 12 Alan Somers freebsd_committer freebsd_triage 2016-09-06 16:22:32 UTC
$ kenv smbios.system.product
CS23-SH
Comment 13 Kubilay Kocak freebsd_committer freebsd_triage 2016-09-07 00:15:55 UTC
Keep re@ in the loop on this RC2+ regression
Comment 14 Eric van Gyzen freebsd_committer freebsd_triage 2016-09-07 15:15:19 UTC
Created attachment 174474 [details]
never read the slot_ctl register

Please try this patch.  It's against head, but should apply to releng/11.0.  I tested it several weeks ago, but not recently.  I'll test it later today.  Remember to revert the tunable that disabled HotPlug.
Comment 15 Eric van Gyzen freebsd_committer freebsd_triage 2016-09-07 19:23:09 UTC
I tested the attached patch against head r305539.  Everything works (that I am able to test).  I would be grateful if you could test it.
Comment 16 Alan Somers freebsd_committer freebsd_triage 2016-09-07 19:24:22 UTC
I'll test it too.  I should be done by 5:00 MDT.
Comment 17 Eric van Gyzen freebsd_committer freebsd_triage 2016-09-07 19:53:27 UTC
Created attachment 174487 [details]
never read slot_ctl; save last write except EIC

If you can, please use this updated patch.  On your machine, it should behave the same as the previous patch, but this is the one I would intend to commit.
Comment 18 Eric van Gyzen freebsd_committer freebsd_triage 2016-09-07 19:55:04 UTC
By the way:  Does this machine actually have hot-plug slots?
Comment 19 Alan Somers freebsd_committer freebsd_triage 2016-09-07 20:04:02 UTC
The first patch does not work on my machine.  I applied it to releng/11.0 at r305539, but I get exactly the same hang as before.  Should I still bother to test the new patch?

How do I tell if I have PCIE hotplug slots?  I'm guessing I don't.
Comment 20 Eric van Gyzen freebsd_committer freebsd_triage 2016-09-07 20:28:00 UTC
Don't bother testing the new patch; it would probably do the same.

Unfortunately, I don't have any idea how to fix this.  I could blacklist this system model or this PCIe switch and ignore its purported HotPlug support.  That would have the same effect as the tunable.

This PCIe switch reports that it has a Manual Retention Latch.  Do any of the slots have a real, physical latch that holds the card in place?  It also reports an Attention Button, which would be on or near the slot.  We've seen several systems/switches that claim to have these features but actually don't, so I wouldn't be surprised if your system is lying.

Thanks for your help so far!
Comment 21 Alan Somers freebsd_committer freebsd_triage 2016-09-07 21:35:34 UTC
Nope.  There's no retention latch, and nothing that looks like an attention switch.
Comment 22 John Baldwin freebsd_committer freebsd_triage 2016-09-09 16:31:24 UTC
I suspect your slot is a lot like the bogus slots we've worked around before except that the logic we are currently using only works if a card is in the slot.

It would be good to find out what it is doing during the hang.  One possibility is to build a kernel with the following:

options KTR
options KTR_VERBOSE
options KTR_COMPILE=KTR_PROC
options KTR_MASK=KTR_PROC

and boot it with boot -v.

In particular, KTR_PROC will log any context switches including for ithreads.  It may be that the machine starts spewing stuff forever in a loop though.  If there's any way to hook up a serial console either via cable or SOL so you can log the output that would probably be very useful.
Comment 23 Alan Somers freebsd_committer freebsd_triage 2016-09-09 23:46:34 UTC
No infinite spew loop.  It just hangs.  The hang screen looks the same as the screenshot I posted, except the HotPlug command is "01c0 -> 07ff" instead of "03c0 -> 07ff".
Comment 24 Eric van Gyzen freebsd_committer freebsd_triage 2016-09-13 17:57:49 UTC
Thanks for your help, Alan.  For now, disabling HotPlug is the best fix we can come up with.  I plan to put something like the following in the 11.0 release notes:

      <para revision="299142">Native PCI-express HotPlug
        support is enabled by default on &arch.amd64;, &arch.arm64;, and
        &arch.powerpc;.  This feature has exposed compatibility issues
        on some hardware that result in missing devices or a hang
        during boot.  To work around such issues, run
        <literal>set hw.pci.enable_pcie_hp=0</literal> in the boot
        loader, and add <literal>hw.pci.enable_pcie_hp=0</literal> to
        <filename>/boot/loader.conf</filename>.</para>
Comment 25 John Baldwin freebsd_committer freebsd_triage 2016-09-13 23:39:48 UTC
(In reply to Alan Somers from comment #23)
Hmm, interesting that we write '0x07ff'.  That would seem to indicate this device doesn't support DLL active notifications.  That isn't required by the spec I think, but that is a bit odd.  I wonder if having a power controller requires DLL for example (if it does we could use this as a way to blacklist this slot).  Eric, can you verify if all of the slots you have access to with power controllers support DLL state change notifications?
Comment 26 Eric van Gyzen freebsd_committer freebsd_triage 2016-09-26 20:53:13 UTC
(In reply to John Baldwin from comment #25)

Sorry for the slow reply.

My slots support DLL change interrupts.  However, I don't think they're necessary for power control:  One could simply wait a second and poll the DLL state.  That being said, I imagine that--in the real world--all switches with power controllers also provide DLL change interrupts, so it would be fairly safe to use this as a blacklist criterion (at least until proven otherwise).
Comment 27 Alan Somers freebsd_committer freebsd_triage 2017-01-27 17:46:33 UTC
Eric, can you please tell me how to blacklist this specific PCI ID so I don't have to do it globally in all of our builds?
Comment 28 Dave Baukus 2017-02-01 00:51:07 UTC
Created attachment 179483 [details]
sys/dev/pci/pci_pci.c patch

Reading from PCI EXPRESS BASE SPECIFICATION:

Data Link Layer Link Active Reporting Capable:

For a Downstream Port, this bit must be set to 1b if the component 
supports the optional capability of reporting the DL_Active state 
of the Data Link Control and Management State Machine.  For a 
hot-plug capable Downstream Port (as indicated by the Hot-Plug 
Capable field of the Slot Capabilities register), this bit must be 
set to 1b. 
-----------

Is it possible to conclude, for this ill behaving device (i.e. '6311ESB/6321ESB PCI Express Downstream Port E1') that since it is reporting "Hot-Plug Capable" (in the Slot Capabilities register) and NOT reporting "Data Link Layer Link Active Reporting Capable (in the Link Capabilities Register) that this device is not really/truly Hot-Plug Capable. 

If this conclusion is valid then is the attached patch safe/reasonable ?

The patch does allow the  Dell server with a CS23-SH motherboard to boot without globally disabling hotplug.
Comment 29 Eric van Gyzen freebsd_committer freebsd_triage 2017-02-01 02:16:03 UTC
Alan:  I'm sorry I didn't reply.  I never got an email from bugzilla.  Can you test Dave's patch?

Dave:  Your reading of the spec and your patch both seem obviously correct to me.

John:  I'd appreciate your opinion.
Comment 30 Eric van Gyzen freebsd_committer freebsd_triage 2017-02-01 02:17:11 UTC
removing re@
Comment 31 John Baldwin freebsd_committer freebsd_triage 2017-02-01 02:32:57 UTC
This looks good to me.  We should perhaps remove some other checks of this flag then and assume that it is always true.  In theory we can remove pcie_link_cap entirely and just use a local variable to check it once in this probe routine.

I would be fine with splitting that up into separate changes though and just committing this first.
Comment 32 John Baldwin freebsd_committer freebsd_triage 2017-02-01 02:33:54 UTC
Also good find.  I never found that as I was always focused on reading the description of the slot registers and the Hot Plug overview section.  I don't think anywhere in those places does it mention that Hot Plug requires DLL active reporting.
Comment 33 Dave Baukus 2017-02-02 01:07:37 UTC
I tested the proposed patch against 12.0-CURRENT with kernel source refreshed this (2/1) afternoon. The Dell server with a CS23-SH motherboard boots without setting hw.pci.enable_pcie_hp=1.
Comment 34 Dave Baukus 2017-02-02 16:13:50 UTC
(In reply to Dave Baukus from comment #33)

Correction to my last comment; it should read:

he Dell server with a CS23-SH motherboard boots without setting hw.pci.enable_pcie_hp=0.

That is, it boots without globally disabling hotplug.
Comment 35 commit-hook freebsd_committer freebsd_triage 2017-02-02 17:30:01 UTC
A commit references this bug:

Author: jhb
Date: Thu Feb  2 17:29:16 UTC 2017
New revision: 313097
URL: https://svnweb.freebsd.org/changeset/base/313097

Log:
  Require Data Layer Active reporting for native PCI-e HotPlug.

  Some PCI-e bridges report that they support HotPlug in the slot
  capabilities but do not report support for Data Layer Active events
  in the link capabilities register.  These bridges do not work correctly
  when HotPlug is used.  Further, while the description of HotPlug in
  the spec does not mention that DL active events are required, the
  description of the link capabilities register says that DL active is
  required for HotPlug.  Thanks to Dave Baukus for finding that language
  in the spec.

  PR:		211699
  Submitted by:	Dave Baukus <daveb@spectralogic.com>
  Reviewed by:	vangyzen
  MFC after:	3 days

Changes:
  head/sys/dev/pci/pci_pci.c
Comment 36 commit-hook freebsd_committer freebsd_triage 2017-02-07 22:40:55 UTC
A commit references this bug:

Author: jhb
Date: Tue Feb  7 22:40:38 UTC 2017
New revision: 313408
URL: https://svnweb.freebsd.org/changeset/base/313408

Log:
  MFC 313097: Require Data Layer Active reporting for native PCI-e HotPlug.

  Some PCI-e bridges report that they support HotPlug in the slot
  capabilities but do not report support for Data Layer Active events
  in the link capabilities register.  These bridges do not work correctly
  when HotPlug is used.  Further, while the description of HotPlug in
  the spec does not mention that DL active events are required, the
  description of the link capabilities register says that DL active is
  required for HotPlug.  Thanks to Dave Baukus for finding that language
  in the spec.

  PR:		211699

Changes:
_U  stable/11/
  stable/11/sys/dev/pci/pci_pci.c