Bug 196542

Summary: System stops booting with "ACPI APIC Table: <INTEL DENLOW>"
Product: Base System Reporter: Jonas Keidel <jonas>
Component: kernAssignee: John Baldwin <jhb>
Status: Closed FIXED    
Severity: Affects Some People CC: chris, herbert, jhb, mm, sebastian
Priority: ---    
Version: 10.1-STABLE   
Hardware: amd64   
OS: Any   
See Also: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=197756
Description Flags
Intel S1200 FreeBSD 10.1 "ACPI Denlow"
Complete Dmesg with verbose on
Bootup with printf enabled
Printf in start_ap function
ipi_startup printf
mp_machdep.c patch with printf none

Description Jonas Keidel 2015-01-06 14:33:24 UTC
FreeBSD 10.1 does not boot on Intel S1200RPv3 board correctly. We did some tests with mfsBSD based on FreeBSD 10.1 (http://mfsbsd.vx.sk/). After booting it the first time with following reboot it wont boot and stops with the following message:


To fix it, we need to boot a linux system. After booting the linux correctly the FreeBSD will boot the next time correctly. But with the next reboot it wont start again.

Maybe there is an initializing problem with the ACPI table, which the linux does correctly, the FreeBSD not.
Comment 1 John Baldwin freebsd_committer freebsd_triage 2015-01-06 15:25:22 UTC
Are you using boot -v?
Comment 2 Jonas Keidel 2015-01-07 07:17:17 UTC
Created attachment 151448 [details]
Intel S1200 FreeBSD 10.1 "ACPI Denlow"

I have started the system with "Verbose" = "on". But there is no further information found.
Comment 3 Jonas Keidel 2015-01-07 10:39:15 UTC
Created attachment 151451 [details]
Complete Dmesg with verbose on

I've added a complete dmesg of installed FreeBSD 10.1 booting with verbose on.
The system hangs some time on the Denlow message and after that it sometimes boots through to login. But in some random reboots, it hangs there forever (strange).
Comment 4 John Baldwin freebsd_committer freebsd_triage 2015-01-13 19:34:20 UTC
So one thing we are doing between those printfs is starting up the other CPUs.  Can you build a custom kernel with printfs sprinkled in sys/amd64/amd64/mp_machdep.c?  cpu_mp_start() and native_start_all_aps() would be good places to start.
Comment 5 Jonas Keidel 2015-01-14 08:36:15 UTC
Created attachment 151604 [details]
Bootup with printf enabled

I've added some printf in sys/amd64/amd64/mp_machdep.c to the functions cpu_mp_start() and start_all_aps() (I didn't find native_start_all_aps()...) and booted up the system again.

These printf are the comments of blocks inside the function.
Comment 6 John Baldwin freebsd_committer freebsd_triage 2015-01-14 13:55:41 UTC
Comment on attachment 151604 [details]
Bootup with printf enabled

So it looks like it hung trying to start up the second AP?  Normally though we have a timeout here that fails and prints a message (and offers to panic :-/) if the AP doesn't respond in time.  In the past when I've seen similar issues they have either been due to SMI# interference (we disable legacy USB mode on certain Macs early in the boot process to workaround similar issues) or they were resolved via a BIOS update.  Is this a new Intel platform?
Comment 7 Jonas Keidel 2015-01-14 14:20:37 UTC
(In reply to John Baldwin from comment #6)
Well, its not a new Intel Platform. It is a S1200V3RPL server mainboard running a Xeon E3-1270v3 with 32GB of memory. The current BIOS version is installed, maybe it needs some more fixes. If I disable legacy USB it wont boot via PXE, so that isn't an option...
Comment 8 John Baldwin freebsd_committer freebsd_triage 2015-01-14 14:33:48 UTC
I think I saw in a thread on stable@ that you said it works fine if you disable HTT?  Is that true?
Comment 9 Jonas Keidel 2015-01-14 14:36:37 UTC
(In reply to John Baldwin from comment #8)
Someone else said that it worked. But in my case this doesn't help. Also isn't disabling HTT an option, because we need HTT.
Comment 10 Jonas Keidel 2015-01-21 07:20:06 UTC
Are there any news towards this bug?
Comment 11 John Baldwin freebsd_committer freebsd_triage 2015-01-26 15:59:09 UTC
I assume you've let it sit for more than 5 seconds?  Normally if trying to start an AP times out, it panics after waiting about 5 seconds.  I'm puzzled that it is hanging forever without panic'ing.  The reason to test with legacy USB disabled is to try to narrow down a possible cause (similar to HTT) btw.

If you are up for one more test, can you also instrument the "start_ap" function to verify that it at least gets into the spin loop and is spinning when it dies?  (Maybe print out something periodically in the wait loop, could start with once a second, but if that doesn't output you could make it print more often to see if it is looping at all, or if it is hanging in the DELAY() call)
Comment 12 Jonas Keidel 2015-01-27 09:07:51 UTC
Created attachment 152254 [details]
Printf in start_ap function

(In reply to John Baldwin from comment #11)

I've waited longer than 5 seconds. The systems stays on a couple of hours.

So i've added some printf to the start_ap function. As you can see, it starts a couple of APs and hangs after on one.
The message "wait ms: 0" is in the delay while-loop, so it does not seem to hang on the DELAY().
Comment 13 Jonas Keidel 2015-01-27 09:20:45 UTC
Towards disabling Legacy-USB:
I've tested that and I noticed, that the system hangs on the same point as with Legacy-USB enabled, but then it boots correctly after half a minute or so. Maybe it is related and it does not trigger the error.

I hope that the provided information helps.
Comment 14 John Baldwin freebsd_committer freebsd_triage 2015-01-27 15:50:39 UTC
(In reply to Jonas Keidel from comment #12)

To be clear, does the machine hang at the image you provided?  If so, it appears to be hung in ipi_startup?

Can you instrument ipi_startup?  The calls in there to lapic_ipi_wait(-1) can potentially hang forever.  One thing you can do is to change the '-1's to 1000000 and seeing if the machine boots.  It seems that Linux does this (it just gives up waiting if the ICR bit doesn't clear).
Comment 15 Jonas Keidel 2015-01-28 12:58:33 UTC
Created attachment 152309 [details]
ipi_startup printf

As you can see on my screenshot, it hangs there (sometimes) forever.
So I've changed the delay time to 1000000 and it boots correctly. Should that be 10 seconds or can it be shorter? Now it hangs every booting quiet long because of this delay...
Comment 16 Jonas Keidel 2015-01-28 13:01:22 UTC
And is it good to correct the symptoms and not the effects? Why gets the ICR not ready fast enough on this system? I think this is more interesting than set a static delay which is on most systems not necessary...
Comment 17 John Baldwin freebsd_committer freebsd_triage 2015-01-28 13:48:06 UTC
To be clear, it is hanging in the DELAY() and not in the call to lapic_ipi_raw() to send the first startup IPI?  (That is, your printfs are before the line in question, not after?)

Making the delay longer would seem to contradict that as if it was going to hang in DELAY() it would seem to hang for a long delay the same as a short one (DELAY just spins on the TSC).

The 10 millisecond wait there is what is specified in the original Intel SMP spec as the appropriate delay between INIT and STARTUP.  Also, if you are hanging in the DELAY, then ICR has cleared just fine.  Perhaps post a diff of your changes to mp_machdep.c just so I can be clear on where the logging has been added?

Comment 18 Jonas Keidel 2015-01-28 14:00:11 UTC
Created attachment 152311 [details]
mp_machdep.c patch with printf

(In reply to John Baldwin from comment #17)

This is a diff of my changes in mp_machdep.c. I always put the printf before the appropriate line.
Comment 19 John Baldwin freebsd_committer freebsd_triage 2015-01-28 21:33:55 UTC
Comment on attachment 152311 [details]
mp_machdep.c patch with printf

Ahhh, so you didn't alter the arg passed to DELAY(), but you made the lapic_ipi_wait()'s timeout.  Ok.  That is actually what Linux does too, though I think it might specify a timeout in a time unit (e.g. microseconds) rather than a simple spin count.  I will think about this some more to come up with a real patch.
Comment 20 Jonas Keidel 2015-01-29 06:54:30 UTC
(In reply to John Baldwin from comment #19)
That sounds nice that you might have a solution for this! I'm looking forward to the patch fixing this problem!
Comment 21 John Baldwin freebsd_committer freebsd_triage 2015-01-29 17:48:50 UTC
So it doesn't help that the original Intel MP spec and Intel's later SDM docs conflict with each other.  Please try the changes in https://reviews.freebsd.org/D1719
Comment 22 Jonas Keidel 2015-02-02 12:32:51 UTC
(In reply to John Baldwin from comment #21)
Thanks for the patch. I've tested it and works very good. Sometimes it appears to take a little longer to get the IPI initialized, but it does not hang forever.
Maybe there is some more room for improvement? Because with the linux kernel there is no delay while initializing the IPI, also on other boards it does not appear.

But with this state of the patch it works very well, thanks a lot.
Comment 23 John Baldwin freebsd_committer freebsd_triage 2015-02-02 15:40:54 UTC
I'm not sure what else to change really.  Linux waits for up to 100 milliseconds if the ICR is stuck, but aside from that it uses the same set of operations as in this patch.

Linux does increment an interrupt counter called "icr_read_retry_count" when it thinks the ICR is stuck.  I'm not sure how it would export it, but perhaps you can see if it is advertised somewhere?
Comment 24 Jonas Keidel 2015-02-03 07:08:28 UTC
(In reply to John Baldwin from comment #23)
Maybe there is a problem while shutting down the system. Because if I boot a linux system, followed by booting a freebsd 10.1 rescue system (based on mfsbsd), there is no problem. Nothing hangs.
If I reboot and start the mfsbsd again it hangs. So there might be a problem while shutting down the freebsd. Maybe it sets some registers or whatever, which causes the hanging during the second boot.
Might this be a point to figure the problem out?
Comment 25 commit-hook freebsd_committer 2015-02-06 18:20:54 UTC
A commit references this bug:

Author: jhb
Date: Fri Feb  6 18:20:01 UTC 2015
New revision: 278325
URL: https://svnweb.freebsd.org/changeset/base/278325

  Revert the IPI startup sequence to match what is described in the
  Intel Multiprocessor Specification v1.4.  The Intel SDM claims that
  the INIT IPIs here are invalid, but other systems follow the MP
  spec instead.

  While here, fix the IPI wait routine to accept a timeout in microseconds
  instead of a raw spin count, and don't spin forever during AP startup.
  Instead, panic if a STARTUP IPI is not delivered after 20 us.

  PR:		196542
  Differential Revision:	https://reviews.freebsd.org/D1719
  MFC after:	2 weeks

Comment 26 John Baldwin freebsd_committer freebsd_triage 2015-02-06 18:55:49 UTC
Does FreeBSD boot fine from a cold boot as well?  We don't do anything super special on the APs during shutdown.  One thing you can try perhaps is changing the enable_intr() in sys/amd64/amd64/vm_machdep.c cpu_reset() in the #ifdef SMP code to a disable_intr() instead.
Comment 27 Christos Chatzaras 2015-03-01 15:18:14 UTC
Does Revision 278325 fix this issue?
Comment 28 John Baldwin freebsd_committer freebsd_triage 2015-03-11 14:35:34 UTC
(In reply to chris from comment #27)

Yes, this fixed the panic on boot.  The remaining issue is that he still sees it take a while after rebooting from FreeBSD (that and another bug report I now have after the commit in question where another machine now panics because the startup IPI takes too long)
Comment 29 Christos Chatzaras 2015-03-11 17:10:23 UTC
Any idea when it will go to 10-STABLE?
Comment 30 John Baldwin freebsd_committer freebsd_triage 2015-03-12 14:31:41 UTC
(In reply to chris from comment #29)

Since this broke other systems on HEAD I want to get that regression fixed before I merge the change.
Comment 31 commit-hook freebsd_committer 2015-04-15 16:52:51 UTC
A commit references this bug:

Author: jhb
Date: Wed Apr 15 16:52:35 UTC 2015
New revision: 281560
URL: https://svnweb.freebsd.org/changeset/base/281560

  MFC 278325,280866:
  Revert the IPI startup sequence to match what is described in the
  Intel Multiprocessor Specification v1.4.  The Intel SDM claims that

  Revert the IPI startup sequence to match what is described in the
  Intel Multiprocessor Specification v1.4.  The Intel SDM claims that
  the INIT IPIs here are invalid, but other systems follow the MP
  spec instead.

  While here, fix the IPI wait routine to accept a timeout in microseconds
  instead of a raw spin count, and don't spin forever during AP startup.
  Instead, panic if a STARTUP IPI is not delivered after 20 us.

  Wait 100 microseconds for a local APIC to dispatch each startup-related IPI
  rather than 20.  The MP 1.4 specification states in Appendix B.2:

    "A period of 20 microseconds should be sufficient for IPI dispatch to
     complete under normal operating conditions".

  (Note that this appears to be separate from the 10 millisecond (INIT) and
  200 microsecond (STARTUP) waits after the IPIs are dispatched.)  The
  Intel SDM is silent on this issue as far as I can tell.

  At least some hardware requires 60 microseconds as noted in the PR, so
  bump this to 100 to be on the safe side.

  PR:		196542, 197756

_U  stable/10/
Comment 32 Christos Chatzaras 2015-08-21 16:10:07 UTC
I upgrade to 10.2 and the issue still exist.
Comment 33 Christos Chatzaras 2016-10-01 15:40:57 UTC
Issue still exist with 10.3-RELEASE-p7
Comment 34 John Baldwin freebsd_committer freebsd_triage 2016-10-02 23:17:40 UTC
The patches in question were present in both 10.2 and 10.3.  You can try testing the last RC of 11.0 to see if 11 is also affected.  If so, you will need to follow the same procedure used earlier in this PR to instrument the relevant functions to narrow down where the hang occurs.
Comment 35 Herbert J. Skuhra 2016-10-24 18:06:11 UTC
I have a system running 11.0-STABLE (r307819) with similar symptoms. Every now and then the system fails to (re)boot and hangs at line:


It seems that the code was modified quite a bit. So what patch can I test?
This I a remote server that I can only access via ssh or rescue system.
Comment 36 Christos Chatzaras 2016-10-24 18:14:44 UTC
Is this a server from Hetzner?
Comment 37 Herbert J. Skuhra 2016-10-24 18:33:06 UTC
Yes, it is.
Comment 38 Christos Chatzaras 2016-10-24 18:38:45 UTC
Ask them to update the BIOS to latest version and see if this fix the issue.
Comment 39 Sebastian Oswald (sko) 2020-03-24 10:56:34 UTC
Sorry for reporting into a closed bugreport, but it seems this bug has been re-introduced (or hasn't been fully fixed).

I'm stuck with the same problem on an INTEL NUC7i5BN. BIOS has been updated to latest version, boot hangs for legacy & UEFI with any [11.3|12.0|12.1|13.0]-[RELEASE|STABLE|CURRENT]-mini-memstick images currently available for download...

Last few lines of boot -v output:

Package ID shift: 4
L3 cache ID shift: 4
L2 cache ID shift: 1
L1 cache ID shift: 1
Core ID shift: 1