Bug 220277 - 11.1-BETA3 hangs on Dell Vostro V130 during boot
Summary: 11.1-BETA3 hangs on Dell Vostro V130 during boot
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 11.0-STABLE
Hardware: amd64 Any
: --- Affects Some People
Assignee: Mark Johnston
URL:
Keywords: regression
Depends on:
Blocks:
 
Reported: 2017-06-26 06:13 UTC by Zdenek Zavadil
Modified: 2017-07-06 17:37 UTC (History)
3 users (show)

See Also:


Attachments
the last screen of the boot (23.81 KB, image/png)
2017-06-26 19:30 UTC, Zdenek Zavadil
no flags Details
/var/log/messages (88.03 KB, text/plain)
2017-06-26 19:34 UTC, Zdenek Zavadil
no flags Details
DDB - "next" until it hangs (15.89 KB, image/png)
2017-07-02 18:26 UTC, Zdenek Zavadil
no flags Details
DDB - "step" + "tr" (17.73 KB, image/png)
2017-07-02 18:35 UTC, Zdenek Zavadil
no flags Details
DDB - "ps" (39.14 KB, image/png)
2017-07-02 18:55 UTC, Zdenek Zavadil
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Zdenek Zavadil 2017-06-26 06:13:40 UTC
Hi,
I've upgraded my laptops Dell Vostro V13 and V130 to 11.1-BETA3. While on V13 everything is ok, on V130 it hangs during the boot. The last message is:

acpi_tz0: <Thermal Zone> on acpi0

But booting in safe mode works.

I can provide you more details if needed...

TIA, 
Zdenek
Comment 1 Glen Barber freebsd_committer freebsd_triage 2017-06-26 16:22:19 UTC
(In reply to Zdenek Zavadil from comment #0)
> Hi,
> I've upgraded my laptops Dell Vostro V13 and V130 to 11.1-BETA3. While on
> V13 everything is ok, on V130 it hangs during the boot. The last message is:
> 
> acpi_tz0: <Thermal Zone> on acpi0
> 
> But booting in safe mode works.
> 

Can you boot in verbose mode, and provide the output?  (Or pictures, if you do not have serial console.)

At the loader prompt, 'boot -v'.
Comment 2 Zdenek Zavadil 2017-06-26 19:30:46 UTC
Created attachment 183819 [details]
the last screen of the boot

I am only able to capture the last screen of the boot because it locks completely...
Comment 3 Zdenek Zavadil 2017-06-26 19:34:55 UTC
Created attachment 183820 [details]
/var/log/messages

And this is the /var/log/messages file from the normal boot of 11.0-RELEASE (for the comparison).
Comment 4 Zdenek Zavadil 2017-07-01 13:22:10 UTC
I've tried 11.1-RC1, unfortunately the problem persists.

I've made some other experiments: only setting kern.smp.disabled=1 made it to boot again.
Comment 5 Glen Barber freebsd_committer freebsd_triage 2017-07-01 14:02:25 UTC
John, could this somehow be related to EARLY_AP_STARTUP?
Comment 6 John Baldwin freebsd_committer freebsd_triage 2017-07-01 16:44:20 UTC
Possibly.  Can you boot via kern.smp.disabled=1 to get the box installed and then build a custom kernel that includes DDB?  You can install that kernel as a test kernel (e.g. to /boot/test) see if Ctrl-Alt-Esc works to break into the db> prompt when it hangs.  If so, please get the output of 'ps' and 'tr'.
Comment 7 Mark Johnston freebsd_committer freebsd_triage 2017-07-02 18:06:32 UTC
I'm seeing identical symptoms on HEAD on a Dell Inspiron 5748. I'm not sure when the regression was introduced; for the past several months this laptop has been using a custom kernel config with which I'm able to boot. The config omits EARLY_AP_STARTUP, but I haven't verified whether adding it triggers a hang.

I don't appear to be able to break into DDB at this point during boot; the keyboard driver hasn't attached yet.
Comment 8 Zdenek Zavadil 2017-07-02 18:26:50 UTC
Created attachment 184014 [details]
DDB - "next" until it hangs

Thanks for comments.

I've successfully built the custom kernel and here are the outputs from DDB...

The first one is the result of pressing "next" until it hangs.
Comment 9 Zdenek Zavadil 2017-07-02 18:35:38 UTC
Created attachment 184015 [details]
DDB - "step" + "tr"

The next one is the result of pressing "step" just before the hang (the previous screenshot) until it seems looping forever in sched_idled function...

At the end is output of "tr".
Comment 10 Zdenek Zavadil 2017-07-02 18:55:41 UTC
Created attachment 184016 [details]
DDB - "ps"

And the last one is the output of "ps" (3 screenshots awkwardly pasted together) gotten at the same time as previous one.

Would it be helpful to try to build another kernel without EARLY_AP_STARTUP?

Thanks!
Comment 11 Zdenek Zavadil 2017-07-02 20:51:56 UTC
(In reply to Mark Johnston from comment #7)

Mark,
I also wasn't able to break into debugging using Ctrl-Alt-Esc, so I used "boot -d" from the boot loader.
Comment 12 Mark Johnston freebsd_committer freebsd_triage 2017-07-02 20:54:05 UTC
I managed to capture the hang in ddb using boot -d. thread0 is mtx_sleep()ing in AcpiOsAcquireMutex() with a stack that goes through acpi_button_probe(). A taskqueue thread is pause()d with the following stack:

pause_sbt()
AcpiExSystemDoSleep()
AcpiDsExecEndOp
AcpiPsParseLoop()
AcpiPsParseAml()
AcpiPsExecuteMethod()
AcpiNsEvaluate()
AcpiEvaluateObject()
acpi_cmbat_get_bst()
acpi_cmbat_init_battery()
<taskqueue stuff>

I presume that this thread is holding the ACPI mutex that thread0 is waiting for. So it seems that r310336 wasn't sufficient.
Comment 13 Zdenek Zavadil 2017-07-02 21:01:54 UTC
(In reply to Zdenek Zavadil from comment #9)

Just to be more specific, it seems to be looping among these functions (in this order):
sched_idletd
cpu_idle
sched_runnable
cpu_idle
Comment 14 Mark Johnston freebsd_committer freebsd_triage 2017-07-02 21:36:35 UTC
(In reply to Zdenek Zavadil from comment #13)
That would just be an idle thread running. The bug appears to be a sort of deadlock, and the participating threads are not on-CPU while the hang occurs.
Comment 15 John Baldwin freebsd_committer freebsd_triage 2017-07-04 14:10:56 UTC
(In reply to Mark Johnston from comment #12)
It's probably worth looking at the stacks of all the other threads to find which one owns the lock and then determine why it isn't running.

Are there any other threads that are runnable?  It might be that we need to have pause() on thread0 still not sleep via callouts, but perhaps it needs to call mi_switch/sched_yield in a loop to give other runnable threads a chance in conjunction with spinning.
Comment 16 Mark Johnston freebsd_committer freebsd_triage 2017-07-04 20:01:45 UTC
(In reply to John Baldwin from comment #15)
The taskqueue thread owns the lock. There aren't any other runnable threads.

The issue is that the taskqueue thread can't wake up until the eventtimer is configured during SI_SUB_CLOCK. thread0 is blocked on the lock held by the taskqueue thread and is still in SI_SUB_CONFIGURE.
Comment 17 John Baldwin freebsd_committer freebsd_triage 2017-07-05 14:48:51 UTC
(In reply to Mark Johnston from comment #16)
We could defer handling ACPI tasks until timers are working.  That is probably the simplest / shortest solution.  More complicated would be to fix pause() to always use sched_yield + spinning for all threads (not just thread0 as I suggested in my previous comment).  The first approach can be tried by changing the SYSINIT for acpi_taskq_init in sys/dev/acpica/Osd/OsdSchedule.c:

Index: sys/dev/acpica/Osd/OsdSchedule.c
===================================================================
--- sys/dev/acpica/Osd/OsdSchedule.c	(revision 320674)
+++ sys/dev/acpica/Osd/OsdSchedule.c	(working copy)
@@ -128,7 +128,8 @@
     acpi_taskq_started = 1;
 }
 
-SYSINIT(acpi_taskq, SI_SUB_CONFIGURE, SI_ORDER_SECOND, acpi_taskq_init, NULL);
+SYSINIT(acpi_taskq, SI_SUB_KICK_SCHEDULER, SI_ORDER_SECOND, acpi_taskq_init,
+    NULL);
 
 /*
  * Bounce through this wrapper function since ACPI-CA doesn't understand
Comment 18 Mark Johnston freebsd_committer freebsd_triage 2017-07-05 16:48:37 UTC
(In reply to John Baldwin from comment #17)
That seems to do the trick in my case, and is probably the best way to solve this for 11.1.
Comment 19 John Baldwin freebsd_committer freebsd_triage 2017-07-05 16:53:33 UTC
Can you please commit the fix to HEAD?  We defer starting other ACPI threads (such as the ACPI thermal zone threads) until this same time anyway.

If we get another glitch with EARLY_AP_STARTUP for 11.1 we will just disable it though.
Comment 20 Mark Johnston freebsd_committer freebsd_triage 2017-07-05 16:56:18 UTC
(In reply to John Baldwin from comment #19)
Sure, will do.
Comment 21 commit-hook freebsd_committer freebsd_triage 2017-07-05 17:40:03 UTC
A commit references this bug:

Author: markj
Date: Wed Jul  5 17:39:17 UTC 2017
New revision: 320690
URL: https://svnweb.freebsd.org/changeset/base/320690

Log:
  Defer ACPI taskqueue creation to SI_SUB_KICK_SCHEDULER.

  This addresses a deadlock during boot when EARLY_AP_STARTUP is configured:
  a taskqueue thread may call pause() with an ACPI mutex held, and thread0
  may block on this mutex before configuring the eventtimer. In this case
  the taskqueue thread will sleep forever waiting for its callout to fire.

  PR:		220277
  Submitted by:	jhb
  MFC after:	3 days

Changes:
  head/sys/dev/acpica/Osd/OsdSchedule.c
Comment 22 commit-hook freebsd_committer freebsd_triage 2017-07-06 17:21:00 UTC
A commit references this bug:

Author: markj
Date: Thu Jul  6 17:20:36 UTC 2017
New revision: 320744
URL: https://svnweb.freebsd.org/changeset/base/320744

Log:
  MFC r320690:
  Defer ACPI taskqueue creation to SI_SUB_KICK_SCHEDULER.

  PR:	220277

Changes:
_U  stable/11/
  stable/11/sys/dev/acpica/Osd/OsdSchedule.c
Comment 23 commit-hook freebsd_committer freebsd_triage 2017-07-06 17:32:11 UTC
A commit references this bug:

Author: markj
Date: Thu Jul  6 17:31:39 UTC 2017
New revision: 320746
URL: https://svnweb.freebsd.org/changeset/base/320746

Log:
  MFS r320744: MFC r320690:
  Defer ACPI taskqueue creation to SI_SUB_KICK_SCHEDULER.

  PR:		220277
  Approved by:	re (gjb)

Changes:
_U  releng/11.1/
  releng/11.1/sys/dev/acpica/Osd/OsdSchedule.c