|Summary:||panic during boot in ACPI Platform Error Interfaces|
|Product:||Base System||Reporter:||Curtis Villamizar <curtis>|
|Component:||kern||Assignee:||Alexander Motin <mav>|
|Severity:||Affects Some People||CC:||emaste, gallatin, markj, mav, vvd|
Description Curtis Villamizar 2020-08-19 01:48:52 UTC
Created attachment 217331 [details] console output on crash On a supermicro AMD64 EPYC rackmount server with hot swap drives I get a panic in stable/12 with r >= 364003. * 364003 sys/x86/x86/cpu_machdep.c * 364003 sys/x86/include/acpica_machdep.h * 364003 sys/dev/pci/pci.c * 364003 sys/dev/pci/pcivar.h * 364003 sys/dev/acpica/acpi_apei.c * 364003 sys/dev/acpica/acpi.c * 364003 sys/conf/files * 364003 sys/arm64/arm64/machdep.c * 364003 sys/arm64/include/acpica_machdep.h ? sys/amd64/conf/occnc12.amd64 The kernel conf is the one I normally use (lots of stuff I don't use commented out). This works (run from /usr/src/sys): svn update -r364002 ( cd amd64/conf/ && config occnc12.amd64 \ && cd ../compile/occnc12.amd64 \ && make cleandepend && make depend \ && make && mv kernel kernel.new || ( echo ; echo FAIL ; echo ) ) # then copy the kernel onto a flash as kernel.new This doesn't work: svn update -r364003 [ .. same as above .. ] Fatal trap 12: page fault while in kernel mode Can't get a core dump since drivers are not yet probed. This is a boot from flash to do the initial install. The most recent kernel I tried was 364260 .
Comment 1 Curtis Villamizar 2020-08-19 01:54:21 UTC
To see the offending diffs: cd /usr/src svn update -r 364003 sys svn diff -r364002 sys > ~/apei_nmi.patch But that might be obvious.
Comment 2 Curtis Villamizar 2020-08-19 01:55:39 UTC
A potential short term workaround would be to provide a hints entry that can be set in boot/loader.conf to disable this feature. Longer term a fix would be better.
Comment 3 Mark Linimon 2020-08-19 02:11:43 UTC
^Triage: assign and add Keywords. Note that committer of the regression (mav@) has already been notified.
Comment 4 VVD 2020-08-19 02:31:03 UTC
As a temporary workaround you can turn off patch with commenting out 2 lines from this diff: https://svnweb.freebsd.org/base/stable/12/sys/x86/x86/cpu_machdep.c?r1=364003&r2=364002&pathrev=364003 if (apei_nmi != NULL && (*apei_nmi)()) claimed = true;
Comment 5 Alexander Motin 2020-08-19 02:34:32 UTC
What's about the hint, this tunable should disable the APEI driver: debug.acpi.disabled="apei" What's about real diagnostics, then in provided console output I don't see anything, since the panic does not happen in the driver, but in atrtc, which is completely unrelated. Would you please provide panic with the verbose messages enabled and output of `acpidump -t` output?
Comment 7 Curtis Villamizar 2020-08-19 03:28:14 UTC
(In reply to VVD from comment #4) That would be a good place for the disable code to reside. Except it didn't work.
Comment 8 Alexander Motin 2020-08-19 03:40:49 UTC
Have you tried to update your BIOS? The `acpidump -t` output describes 1800 error sources. Not sure why it caused panic yet, but I doubt it is normal.
Comment 9 Curtis Villamizar 2020-08-19 04:24:10 UTC
(In reply to Alexander Motin from comment #8) Bios is current (H11SSL-i MB bios rev 2.1, build date 2/21/2020).
Comment 10 Curtis Villamizar 2020-08-19 04:26:41 UTC
MB described at: https://www.supermicro.com/en/products/motherboard/H11SSL-i has 2 hot plug disk, 2 hot plug ssd
Comment 11 Curtis Villamizar 2020-08-19 04:30:09 UTC
Created attachment 217333 [details] console output verbose I set -P -v in boot.config but didn't get much. Is there something else I should be providing to help?
Comment 12 Curtis Villamizar 2020-08-19 04:46:42 UTC
Created attachment 217334 [details] console output verbose - second try This got more output. I disabled security device support (none present) in bios but that should have no effect.
Comment 13 Andrew Gallatin 2020-08-19 17:08:17 UTC
I have the same issue on a Tyan S8036GM2NE running V1.01D of the BIOS
Comment 14 Andrew Gallatin 2020-08-19 17:13:12 UTC
Created attachment 217343 [details] acpidump -t output from Tyan S8036GM2NE
Comment 15 Andrew Gallatin 2020-08-19 17:18:26 UTC
One last note. For me, the workaround (debug.acpi.disabled="apei") allows me to boot. I have 2 failure modes, depending on the kernel I've been running. One is a silen reset during boot around this point in boot: acpi0: Power Button (fixed) acpi0: wakeup code va 0xfffffe01a93ff000 pa 0x9900 <.....> Table 'SSDT' at 0xa7f37000 Table 'WSMT' at 0xa7f36000 Table 'APIC' at 0xa7f34000 Table 'HEST' at 0xa7efb000 HEST: Found table at 0xa7efb000 <hangs for a few seconds then reboots> The other failure mode is: Fatal trap 12: page fault while in kernel mode cpuid = 52; apic id = 34 fault virtual address = 0xffffffff82612000 fault code = supervisor read data, page not present instruction pointer = 0x20:0xffffffff80c29f0e stack pointer = 0x28:0xffffffff824256d0 frame pointer = 0x28:0xffffffff824256f0 code segment = base rx0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = resume, IOPL = 0 current process = 0 (swapper) trap number = 12 panic: page fault cpuid = 52 time = 1 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xffffffff82425390 vpanic() at vpanic+0x182/frame 0xffffffff824253e0 panic() at panic+0x43/frame 0xffffffff82425440 trap_fatal() at trap_fatal+0x387/frame 0xffffffff824254a0 trap_pfault() at trap_pfault+0x4f/frame 0xffffffff824254f0 trap() at trap+0x271/frame 0xffffffff82425600 calltrap() at calltrap+0x8/frame 0xffffffff82425600 --- trap 0xc, rip = 0xffffffff80c29f0e, rsp = 0xffffffff824256d0, rbp = 0xffffffff824256f0 --- tdq_notify() at tdq_notify+0x2e/frame 0xffffffff824256f0 sched_switch() at sched_switch+0x578/frame 0xffffffff824257c0 mi_switch() at mi_switch+0xc1/frame 0xffffffff824257e0 sched_bind() at sched_bind+0x74/frame 0xffffffff82425800 native_apic_free_vector() at native_apic_free_vector+0x4c/frame 0xffffffff82425830 ioapic_assign_cpu() at ioapic_assign_cpu+0x27c/frame 0xffffffff82425880 intr_assign_cpu() at intr_assign_cpu+0x56/frame 0xffffffff824258b0 _intr_event_bind() at _intr_event_bind+0x120/frame 0xffffffff824258f0 atrtc_attach() at atrtc_attach+0x2b3/frame 0xffffffff82425940 atrtc_acpi_attach() at atrtc_acpi_attach+0x12/frame 0xffffffff82425970 device_attach() at device_attach+0x3dd/frame 0xffffffff824259c0 bus_generic_attach() at bus_generic_attach+0x2d/frame 0xffffffff824259e0 acpi_attach() at acpi_attach+0xd2a/frame 0xffffffff82425af0 device_attach() at device_attach+0x3dd/frame 0xffffffff82425b40 bus_generic_attach() at bus_generic_attach+0x2d/frame 0xffffffff82425b60 device_attach() at device_attach+0x3dd/frame 0xffffffff82425bb0 bus_generic_new_pass() at bus_generic_new_pass+0xf9/frame 0xffffffff82425be0 root_bus_configure() at root_bus_configure+0x36/frame 0xffffffff82425c10 configure() at configure+0x9/frame 0xffffffff82425c20 mi_startup() at mi_startup+0x200/frame 0xffffffff82425c70 btext() at btext+0x2c KDB: enter: panic [ thread pid 0 tid 100000 ] Stopped at kdb_enter+0x37: movq $0,0x106b856(%rip) db> Looking at the mappings for the cpuid_to_pcpu array, I have 64 cores (32 cores, 64 threads). That array should have valid mappings for entries 0..63. Using show pte from ddb, I can see that entries 0..40 are unmapped, and entries 41..63 have valid mappings in the page tables.
Comment 16 Alexander Motin 2020-08-19 17:28:07 UTC
Curtis, unfortunately the new dmesg is also not verbose. Either enable it in boot menu, or add boot_verbose="YES" to the /boot/loader.conf Meanwhile I don't see the apei driver attach before the crash, even though dmesg is not erbose, so the only part of it that may be executed is apei_identify() function. Could you try to add "return;" at the beginning of the function to block it from doing anything? Alternatively I am curios what happen if you comment out bus_bind_intr() call from sys/x86/isa/atrtc.c, where it for some reason crashes.
Comment 17 Alexander Motin 2020-08-19 17:45:44 UTC
Created attachment 217346 [details] Switch from acpi_find_table() to AcpiGetTable() It seems like something wrong happen inside acpi_find_table(). Could you please try this patch, switching to AcpiGetTable() instead?
Comment 18 Mark Johnston 2020-08-19 17:54:00 UTC
(In reply to Alexander Motin from comment #17) I was looking at this a bit yesterday with Drew. The __pcpu array was getting unmapped somehow. I notice that acpi_find_table() uses table_map(), which uses pmap_kenter_temporary(), which 1) maps into the crash dump map, immediately preceding the __pcpu map, and 2) does not do any validation of the mapping size.
Comment 19 Mark Johnston 2020-08-19 17:57:35 UTC
Created attachment 217347 [details] confirm pmap bug This patch should trigger a panic if my theory is right. If you are able to reproduce the problem, please give it a try.
Comment 20 Mark Johnston 2020-08-19 18:24:27 UTC
Created attachment 217348 [details] confirm pmap bug Sorry, the last patch isn't right. pmap_kenter_temporary() provides a strange interface. Here's an updated patch to test.
Comment 21 Mark Johnston 2020-08-19 18:43:41 UTC
Created attachment 217349 [details] possible fix Assuming that this is indeed the problem, the attached patch should fix it. This code doesn't need to use the crashdumpmap anymore, we have better support now for creating early device mappings than when this code was originally written.
Comment 22 Andrew Gallatin 2020-08-19 19:13:39 UTC
Mark's "possible fix" patch fixes the problem for me
Comment 23 commit-hook 2020-08-19 19:55:39 UTC
A commit references this bug: Author: mav Date: Wed Aug 19 19:55:13 UTC 2020 New revision: 364407 URL: https://svnweb.freebsd.org/changeset/base/364407 Log: Unify AcpiGetTable() KPI use in identify, probe and attach. While there, change probe order to not call AcpiGetTable() for every probed ACPI device. PR: 248746 MFC after: 3 days Changes: head/sys/dev/acpica/acpi_apei.c
Comment 24 Curtis Villamizar 2020-08-19 21:05:16 UTC
Created attachment 217353 [details] console output - actual verbose this time My mistake. This is actually verbose.
Comment 25 Curtis Villamizar 2020-08-19 21:22:05 UTC
(In reply to Mark Johnston from comment #21) The patch worked for me as well. btw- status is still marked as "New". Seems fixed but not closed if needed for tracking until it gets into production builds.
Comment 26 Curtis Villamizar 2020-08-19 22:46:21 UTC
(In reply to commit-hook from comment #23) I removed all patches, did svn update -r364407 sys and then compiled. The resulting kernel produced another crash so this is not fixed by the patch submitted. Will get console output shortly.
Comment 27 Alexander Motin 2020-08-19 22:59:10 UTC
(In reply to Curtis Villamizar from comment #26) The commit above was to FreeBSD head so far, while you are running stable/12 according to provided logs.
Comment 28 Curtis Villamizar 2020-08-19 23:02:53 UTC
Oops. My bad. I was on the stable/12 branch and the changes are on head. I just manually applied the patch and will try when compile is done.
Comment 29 Curtis Villamizar 2020-08-19 23:19:33 UTC
Also failed on stable/12 branch with this patch. Will remove patch and then update to head and try that kernel. Compile might be a while. It would be nice to have a patch for the stable/12 branch that avoids this creash. It seems that more than this one file needs to be updated (or more patches to this file taken from head).
Comment 30 Alexander Motin 2020-08-19 23:51:18 UTC
Created attachment 217358 [details] stable/12 port of r364407 Since the r364407 commit does not apply clean to stable/12 I am not sure what you have tested. This is the patch with the conflict resolved. Though I haven't tested it yet, only built.
Comment 31 Curtis Villamizar 2020-08-20 00:29:34 UTC
Got this trying to just build a head kernel under stable/12: ERROR: version of config(8) does not match kernel! config version = 600016, version required = 600018 Make sure that /usr/src/usr.sbin/config is in sync with your /usr/src/sys and install a new config binary before trying this again. So I have to go through the more rigorous full build process, new toolchain, etc. I'll update in a few hours.
Comment 32 Curtis Villamizar 2020-08-20 00:32:23 UTC
(In reply to Alexander Motin from comment #30) If you could make the binary available (/boot/kernel/kernel) that would save me some time. oob would be best.
Comment 33 commit-hook 2020-08-20 00:53:28 UTC
A commit references this bug: Author: markj Date: Thu Aug 20 00:52:54 UTC 2020 New revision: 364411 URL: https://svnweb.freebsd.org/changeset/base/364411 Log: Use pmap_mapbios() to map ACPI tables on amd64 and i386. The ACPI table-mapping code used pmap_kenter_temporary() to create mappings, which in turn uses the fixed-size crashdump map. Moreover, the code was not verifying that the table fits in this map, so when mapping large tables we could clobber adjacent mappings. This use of pmap_kenter_temporary() appears to predate support in pmap_mapbios() for creating early mappings, but that restriction no longer applies. PR: 248746 Reviewed by: kib, mav Tested by: gallatin, Curtis Villamizar <firstname.lastname@example.org> MFC after: 3 days Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D26125 Changes: head/sys/amd64/acpica/acpi_machdep.c head/sys/i386/acpica/acpi_machdep.c
Comment 34 Curtis Villamizar 2020-08-20 01:33:54 UTC
This is kludgy (run on a stable/12 build) but worked: pushd usr.sbin/config/ make ls /usr/obj/usr/releng/r12-20200815/head/amd64.amd64/usr.sbin/config/config popd ( cd amd64/conf/ \ && /usr/obj/usr/releng/r12-20200815/head/amd64.amd64/usr.sbin/config/config \ GENERIC \ && cd ../compile/GENERIC && make cleandepend && make depend \ && make && mv kernel kernel.new || ( echo ; echo FAIL ; echo ) ) then use ssh to get the kernel.new file over to the flash drive and boot. uname -a FreeBSD flash-amd64.v6cc2.occnc.com 13.0-CURRENT FreeBSD 13.0-CURRENT #0 r364408: Wed Aug 19 20:49:02 EDT 2020 email@example.com:/usr/releng/r12-20200815/head/sys/amd64/compile/GENERIC amd64 cat /boot/loader.conf # set boot banner type and delay autoboot_delay="3" beastie_disable="YES" # root filesystem uses ufs rootdev="da0p2" vfs.root.mountfrom="ufs:da0p2" ## hk01 uses the re driver from ports #if_re_load="YES" #if_re_name="/boot/modules/if_re.ko" ## this is needed for re driver to avoid mbuf fragmentation hangs #hw.re.max_rx_mbuf_sz="2048" #debug.acpi.disabled="apei" boot_verbose="YES" I saved the console log if you need them but I doubt it so didn't attach.
Comment 35 Curtis Villamizar 2020-08-21 03:36:36 UTC
Created attachment 217405 [details] fyi - patches adapted to stable/12 - for temporary use This is the patch I'm using for stable/12. It is the same patches applied to the head branch but going thru the set of .rej files and manually applying those changes. It works. Only changes were fixed some comments and deleted some debug that was not in the head branch (and might be reason for rej). In case someone else reads this bug report before the changes make it to stable/12 it might help. The workaround adding debug.acpi.disabled="apei" to loader.conf also works. btw- thanks for the quick turnaround on this bug report.
Comment 36 commit-hook 2020-08-22 00:43:09 UTC
A commit references this bug: Author: mav Date: Sat Aug 22 00:42:34 UTC 2020 New revision: 364471 URL: https://svnweb.freebsd.org/changeset/base/364471 Log: MFC r364407: Unify AcpiGetTable() KPI use in identify, probe and attach. While there, change probe order to not call AcpiGetTable() for every probed ACPI device. PR: 248746 Changes: _U stable/12/ stable/12/sys/dev/acpica/acpi_apei.c
Comment 37 commit-hook 2020-08-23 17:34:53 UTC
A commit references this bug: Author: markj Date: Sun Aug 23 17:34:21 UTC 2020 New revision: 364501 URL: https://svnweb.freebsd.org/changeset/base/364501 Log: MFC r364411: Use pmap_mapbios() to map ACPI tables on amd64 and i386. PR: 248746 Changes: _U stable/12/ stable/12/sys/amd64/acpica/acpi_machdep.c stable/12/sys/i386/acpica/acpi_machdep.c