Bug 248746

Summary: panic during boot in ACPI Platform Error Interfaces
Product: Base System Reporter: Curtis Villamizar <curtis>
Component: kernAssignee: Alexander Motin <mav>
Status: Closed FIXED    
Severity: Affects Some People CC: emaste, gallatin, markj, mav, vvd
Priority: --- Keywords: panic, regression
Version: 12.0-STABLE   
Hardware: amd64   
OS: Any   
Attachments:
Description Flags
console output on crash
none
acpidump -t
none
console output verbose
none
console output verbose - second try
none
acpidump -t output from Tyan S8036GM2NE
none
Switch from acpi_find_table() to AcpiGetTable()
none
confirm pmap bug
none
confirm pmap bug
none
possible fix
none
console output - actual verbose this time
none
stable/12 port of r364407
none
fyi - patches adapted to stable/12 - for temporary use none

Description Curtis Villamizar 2020-08-19 01:48:52 UTC
Created attachment 217331 [details]
console output on crash

On a supermicro AMD64 EPYC rackmount server with hot swap drives I get
a panic in stable/12 with r >= 364003.

        *   364003   sys/x86/x86/cpu_machdep.c
        *   364003   sys/x86/include/acpica_machdep.h
        *   364003   sys/dev/pci/pci.c
        *   364003   sys/dev/pci/pcivar.h
        *   364003   sys/dev/acpica/acpi_apei.c
        *   364003   sys/dev/acpica/acpi.c
        *   364003   sys/conf/files
        *   364003   sys/arm64/arm64/machdep.c
        *   364003   sys/arm64/include/acpica_machdep.h
?                    sys/amd64/conf/occnc12.amd64

The kernel conf is the one I normally use (lots of stuff I don't use
commented out).

This works (run from /usr/src/sys):

svn update -r364002
( cd amd64/conf/ && config occnc12.amd64 \
  && cd ../compile/occnc12.amd64 \
  && make cleandepend && make depend \
  && make && mv kernel kernel.new || ( echo ; echo FAIL ; echo ) )
# then copy the kernel onto a flash as kernel.new

This doesn't work:

svn update -r364003
[ .. same as above .. ]

Fatal trap 12: page fault while in kernel mode

Can't get a core dump since drivers are not yet probed.  This is a boot from flash to do the initial install.

The most recent kernel I tried was 364260 .
Comment 1 Curtis Villamizar 2020-08-19 01:54:21 UTC
To see the offending diffs:

cd /usr/src
svn update -r 364003 sys
svn diff -r364002 sys > ~/apei_nmi.patch

But that might be obvious.
Comment 2 Curtis Villamizar 2020-08-19 01:55:39 UTC
A potential short term workaround would be to provide a hints entry that can be set in boot/loader.conf to disable this feature.  Longer term a fix would be better.
Comment 3 Mark Linimon freebsd_committer freebsd_triage 2020-08-19 02:11:43 UTC
^Triage: assign and add Keywords.  Note that committer of the regression (mav@) has already been notified.
Comment 4 VVD 2020-08-19 02:31:03 UTC
As a temporary workaround you can turn off patch with commenting out 2 lines from this diff:
https://svnweb.freebsd.org/base/stable/12/sys/x86/x86/cpu_machdep.c?r1=364003&r2=364002&pathrev=364003

if (apei_nmi != NULL && (*apei_nmi)())
        claimed = true;
Comment 5 Alexander Motin freebsd_committer 2020-08-19 02:34:32 UTC
What's about the hint, this tunable should disable the APEI driver: debug.acpi.disabled="apei"

What's about real diagnostics, then in provided console output I don't see anything, since the panic does not happen in the driver, but in atrtc, which is completely unrelated.  Would you please provide panic with the verbose messages enabled and output of `acpidump -t` output?
Comment 6 Curtis Villamizar 2020-08-19 02:52:49 UTC
Created attachment 217332 [details]
acpidump -t
Comment 7 Curtis Villamizar 2020-08-19 03:28:14 UTC
(In reply to VVD from comment #4)
That would be a good place for the disable code to reside.  Except it didn't work.
Comment 8 Alexander Motin freebsd_committer 2020-08-19 03:40:49 UTC
Have you tried to update your BIOS?  The `acpidump -t` output describes 1800 error sources.  Not sure why it caused panic yet, but I doubt it is normal.
Comment 9 Curtis Villamizar 2020-08-19 04:24:10 UTC
(In reply to Alexander Motin from comment #8)
Bios is current (H11SSL-i MB bios rev 2.1, build date 2/21/2020).
Comment 10 Curtis Villamizar 2020-08-19 04:26:41 UTC
MB described at: https://www.supermicro.com/en/products/motherboard/H11SSL-i
has 2 hot plug disk, 2 hot plug ssd
Comment 11 Curtis Villamizar 2020-08-19 04:30:09 UTC
Created attachment 217333 [details]
console output verbose

I set -P -v in boot.config but didn't get much.  Is there something else I should be providing to help?
Comment 12 Curtis Villamizar 2020-08-19 04:46:42 UTC
Created attachment 217334 [details]
console output verbose - second try

This got more output.  I disabled security device support (none present) in bios but that should have no effect.
Comment 13 Andrew Gallatin freebsd_committer 2020-08-19 17:08:17 UTC
I have the same issue on a Tyan S8036GM2NE running V1.01D of the BIOS
Comment 14 Andrew Gallatin freebsd_committer 2020-08-19 17:13:12 UTC
Created attachment 217343 [details]
acpidump -t output from Tyan S8036GM2NE
Comment 15 Andrew Gallatin freebsd_committer 2020-08-19 17:18:26 UTC
One last note.  For me, the workaround (debug.acpi.disabled="apei") allows me to boot.   I have 2 failure modes, depending on the kernel I've been running.  One is a silen reset during boot around this point in boot:

acpi0: Power Button (fixed)
acpi0: wakeup code va 0xfffffe01a93ff000 pa 0x9900
<.....>
Table 'SSDT' at 0xa7f37000
Table 'WSMT' at 0xa7f36000
Table 'APIC' at 0xa7f34000
Table 'HEST' at 0xa7efb000
HEST: Found table at 0xa7efb000
<hangs for a few seconds then reboots>

The other failure mode is:

Fatal trap 12: page fault while in kernel mode
cpuid = 52; apic id = 34
fault virtual address   = 0xffffffff82612000
fault code              = supervisor read data, page not present
instruction pointer     = 0x20:0xffffffff80c29f0e
stack pointer           = 0x28:0xffffffff824256d0
frame pointer           = 0x28:0xffffffff824256f0
code segment            = base rx0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = resume, IOPL = 0
current process         = 0 (swapper)
trap number             = 12
panic: page fault
cpuid = 52
time = 1
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xffffffff82425390
vpanic() at vpanic+0x182/frame 0xffffffff824253e0
panic() at panic+0x43/frame 0xffffffff82425440
trap_fatal() at trap_fatal+0x387/frame 0xffffffff824254a0
trap_pfault() at trap_pfault+0x4f/frame 0xffffffff824254f0
trap() at trap+0x271/frame 0xffffffff82425600
calltrap() at calltrap+0x8/frame 0xffffffff82425600
--- trap 0xc, rip = 0xffffffff80c29f0e, rsp = 0xffffffff824256d0, rbp = 0xffffffff824256f0 ---
tdq_notify() at tdq_notify+0x2e/frame 0xffffffff824256f0
sched_switch() at sched_switch+0x578/frame 0xffffffff824257c0
mi_switch() at mi_switch+0xc1/frame 0xffffffff824257e0
sched_bind() at sched_bind+0x74/frame 0xffffffff82425800
native_apic_free_vector() at native_apic_free_vector+0x4c/frame 0xffffffff82425830
ioapic_assign_cpu() at ioapic_assign_cpu+0x27c/frame 0xffffffff82425880
intr_assign_cpu() at intr_assign_cpu+0x56/frame 0xffffffff824258b0
_intr_event_bind() at _intr_event_bind+0x120/frame 0xffffffff824258f0
atrtc_attach() at atrtc_attach+0x2b3/frame 0xffffffff82425940
atrtc_acpi_attach() at atrtc_acpi_attach+0x12/frame 0xffffffff82425970
device_attach() at device_attach+0x3dd/frame 0xffffffff824259c0
bus_generic_attach() at bus_generic_attach+0x2d/frame 0xffffffff824259e0
acpi_attach() at acpi_attach+0xd2a/frame 0xffffffff82425af0
device_attach() at device_attach+0x3dd/frame 0xffffffff82425b40
bus_generic_attach() at bus_generic_attach+0x2d/frame 0xffffffff82425b60
device_attach() at device_attach+0x3dd/frame 0xffffffff82425bb0
bus_generic_new_pass() at bus_generic_new_pass+0xf9/frame 0xffffffff82425be0
root_bus_configure() at root_bus_configure+0x36/frame 0xffffffff82425c10
configure() at configure+0x9/frame 0xffffffff82425c20
mi_startup() at mi_startup+0x200/frame 0xffffffff82425c70
btext() at btext+0x2c
KDB: enter: panic
[ thread pid 0 tid 100000 ]
Stopped at      kdb_enter+0x37: movq    $0,0x106b856(%rip)
db> 


Looking at the mappings for the cpuid_to_pcpu[] array, I have 64 cores (32 cores, 64 threads).  That array should have valid mappings for entries 0..63.  Using show pte from ddb, I can see that entries 0..40 are unmapped, and entries 41..63 have valid mappings in the page tables.
Comment 16 Alexander Motin freebsd_committer 2020-08-19 17:28:07 UTC
Curtis, unfortunately the new dmesg is also not verbose.  Either enable it in boot menu, or add boot_verbose="YES" to the /boot/loader.conf

Meanwhile I don't see the apei driver attach before the crash, even though dmesg is not erbose, so the only part of it that may be executed is apei_identify() function.  Could you try to add "return;" at the beginning of the function to block it from doing anything?

Alternatively I am curios what happen if you comment out bus_bind_intr() call from sys/x86/isa/atrtc.c, where it for some reason crashes.
Comment 17 Alexander Motin freebsd_committer 2020-08-19 17:45:44 UTC
Created attachment 217346 [details]
Switch from acpi_find_table() to AcpiGetTable()

It seems like something wrong happen inside acpi_find_table().  Could you please try this patch, switching to AcpiGetTable() instead?
Comment 18 Mark Johnston freebsd_committer 2020-08-19 17:54:00 UTC
(In reply to Alexander Motin from comment #17)
I was looking at this a bit yesterday with Drew.  The __pcpu array was getting unmapped somehow.  I notice that acpi_find_table() uses table_map(), which uses pmap_kenter_temporary(), which 1) maps into the crash dump map, immediately preceding the __pcpu map, and 2) does not do any validation of the mapping size.
Comment 19 Mark Johnston freebsd_committer 2020-08-19 17:57:35 UTC
Created attachment 217347 [details]
confirm pmap bug

This patch should trigger a panic if my theory is right.  If you are able to reproduce the problem, please give it a try.
Comment 20 Mark Johnston freebsd_committer 2020-08-19 18:24:27 UTC
Created attachment 217348 [details]
confirm pmap bug

Sorry, the last patch isn't right.  pmap_kenter_temporary() provides a strange interface.  Here's an updated patch to test.
Comment 21 Mark Johnston freebsd_committer 2020-08-19 18:43:41 UTC
Created attachment 217349 [details]
possible fix

Assuming that this is indeed the problem, the attached patch should fix it.  This code doesn't need to use the crashdumpmap anymore, we have better support now for creating early device mappings than when this code was originally written.
Comment 22 Andrew Gallatin freebsd_committer 2020-08-19 19:13:39 UTC
Mark's "possible fix" patch fixes the problem for me
Comment 23 commit-hook freebsd_committer 2020-08-19 19:55:39 UTC
A commit references this bug:

Author: mav
Date: Wed Aug 19 19:55:13 UTC 2020
New revision: 364407
URL: https://svnweb.freebsd.org/changeset/base/364407

Log:
  Unify AcpiGetTable() KPI use in identify, probe and attach.

  While there, change probe order to not call AcpiGetTable() for every
  probed ACPI device.

  PR:		248746
  MFC after:	3 days

Changes:
  head/sys/dev/acpica/acpi_apei.c
Comment 24 Curtis Villamizar 2020-08-19 21:05:16 UTC
Created attachment 217353 [details]
console output - actual verbose this time

My mistake.  This is actually verbose.
Comment 25 Curtis Villamizar 2020-08-19 21:22:05 UTC
(In reply to Mark Johnston from comment #21)

The patch worked for me as well.  btw- status is still marked as "New".  Seems fixed but not closed if needed for tracking until it gets into production builds.
Comment 26 Curtis Villamizar 2020-08-19 22:46:21 UTC
(In reply to commit-hook from comment #23)

I removed all patches, did svn update -r364407 sys and then compiled.  The resulting kernel produced another crash so this is not fixed by the patch submitted.  Will get console output shortly.
Comment 27 Alexander Motin freebsd_committer 2020-08-19 22:59:10 UTC
(In reply to Curtis Villamizar from comment #26)
The commit above was to FreeBSD head so far, while you are running stable/12 according to provided logs.
Comment 28 Curtis Villamizar 2020-08-19 23:02:53 UTC
Oops.  My bad.  I was on the stable/12 branch and the changes are on head.  I just manually applied the patch and will try when compile is done.
Comment 29 Curtis Villamizar 2020-08-19 23:19:33 UTC
Also failed on stable/12 branch with this patch.  Will remove patch and then update to head and try that kernel.  Compile might be a while.  It would be nice to have a patch for the stable/12 branch that avoids this creash.  It seems that more than this one file needs to be updated (or more patches to this file taken from head).
Comment 30 Alexander Motin freebsd_committer 2020-08-19 23:51:18 UTC
Created attachment 217358 [details]
stable/12 port of r364407

Since the r364407 commit does not apply clean to stable/12 I am not sure what you have tested.  This is the patch with the conflict resolved.  Though I haven't tested it yet, only built.
Comment 31 Curtis Villamizar 2020-08-20 00:29:34 UTC
Got this trying to just build a head kernel under stable/12:

ERROR: version of config(8) does not match kernel!
config version = 600016, version required = 600018

Make sure that /usr/src/usr.sbin/config is in sync
with your /usr/src/sys and install a new config binary
before trying this again.

So I have to go through the more rigorous full build process, new toolchain, etc.  I'll update in a few hours.
Comment 32 Curtis Villamizar 2020-08-20 00:32:23 UTC
(In reply to Alexander Motin from comment #30)
If you could make the binary available (/boot/kernel/kernel) that would save me some time.  oob would be best.
Comment 33 commit-hook freebsd_committer 2020-08-20 00:53:28 UTC
A commit references this bug:

Author: markj
Date: Thu Aug 20 00:52:54 UTC 2020
New revision: 364411
URL: https://svnweb.freebsd.org/changeset/base/364411

Log:
  Use pmap_mapbios() to map ACPI tables on amd64 and i386.

  The ACPI table-mapping code used pmap_kenter_temporary() to create
  mappings, which in turn uses the fixed-size crashdump map.  Moreover,
  the code was not verifying that the table fits in this map, so when
  mapping large tables we could clobber adjacent mappings.  This use of
  pmap_kenter_temporary() appears to predate support in pmap_mapbios() for
  creating early mappings, but that restriction no longer applies.

  PR:		248746
  Reviewed by:	kib, mav
  Tested by:	gallatin, Curtis Villamizar <curtis@ipv6.occnc.com>
  MFC after:	3 days
  Sponsored by:	The FreeBSD Foundation
  Differential Revision:	https://reviews.freebsd.org/D26125

Changes:
  head/sys/amd64/acpica/acpi_machdep.c
  head/sys/i386/acpica/acpi_machdep.c
Comment 34 Curtis Villamizar 2020-08-20 01:33:54 UTC
This is kludgy (run on a stable/12 build) but worked:

pushd usr.sbin/config/
make
ls /usr/obj/usr/releng/r12-20200815/head/amd64.amd64/usr.sbin/config/config
popd
( cd amd64/conf/ \
  && /usr/obj/usr/releng/r12-20200815/head/amd64.amd64/usr.sbin/config/config \
      GENERIC \
  && cd ../compile/GENERIC && make cleandepend && make depend \
  && make && mv kernel kernel.new || ( echo ; echo FAIL ; echo ) )

then use ssh to get the kernel.new file over to the flash drive and boot.

uname -a
FreeBSD flash-amd64.v6cc2.occnc.com 13.0-CURRENT FreeBSD 13.0-CURRENT #0 r364408: Wed Aug 19 20:49:02 EDT 2020     root@r12-devel-amd64-bh4.southbend.occnc.com:/usr/releng/r12-20200815/head/sys/amd64/compile/GENERIC  amd64

cat /boot/loader.conf
#  set boot banner type and delay
autoboot_delay="3"
beastie_disable="YES"
#  root filesystem uses ufs
rootdev="da0p2"
vfs.root.mountfrom="ufs:da0p2"
##  hk01 uses the re driver from ports
#if_re_load="YES"
#if_re_name="/boot/modules/if_re.ko"
##  this is needed for re driver to avoid mbuf fragmentation hangs
#hw.re.max_rx_mbuf_sz="2048"
#debug.acpi.disabled="apei"
boot_verbose="YES"

I saved the console log if you need them but I doubt it so didn't attach.
Comment 35 Curtis Villamizar 2020-08-21 03:36:36 UTC
Created attachment 217405 [details]
fyi - patches adapted to stable/12 - for temporary use

This is the patch I'm using for stable/12.  It is the same patches applied to the head branch but going thru the set of .rej files and manually applying those changes.  It works.  Only changes were fixed some comments and deleted some debug that was not in the head branch (and might be reason for rej).  In case someone else reads this bug report before the changes make it to stable/12 it might help.  The workaround adding debug.acpi.disabled="apei" to loader.conf also works.
btw- thanks for the quick turnaround on this bug report.
Comment 36 commit-hook freebsd_committer 2020-08-22 00:43:09 UTC
A commit references this bug:

Author: mav
Date: Sat Aug 22 00:42:34 UTC 2020
New revision: 364471
URL: https://svnweb.freebsd.org/changeset/base/364471

Log:
  MFC r364407: Unify AcpiGetTable() KPI use in identify, probe and attach.

  While there, change probe order to not call AcpiGetTable() for every
  probed ACPI device.

  PR:		248746

Changes:
_U  stable/12/
  stable/12/sys/dev/acpica/acpi_apei.c
Comment 37 commit-hook freebsd_committer 2020-08-23 17:34:53 UTC
A commit references this bug:

Author: markj
Date: Sun Aug 23 17:34:21 UTC 2020
New revision: 364501
URL: https://svnweb.freebsd.org/changeset/base/364501

Log:
  MFC r364411:
  Use pmap_mapbios() to map ACPI tables on amd64 and i386.

  PR:	248746

Changes:
_U  stable/12/
  stable/12/sys/amd64/acpica/acpi_machdep.c
  stable/12/sys/i386/acpica/acpi_machdep.c