Bug 282860 - Kernel panic at boot on intel i9-7980XE / asus prime x299-A rev1
Summary: Kernel panic at boot on intel i9-7980XE / asus prime x299-A rev1
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 14.2-STABLE
Hardware: amd64 Any
: --- Affects Only Me
Assignee: Konstantin Belousov
URL:
Keywords: crash
Depends on:
Blocks:
 
Reported: 2024-11-19 11:23 UTC by keivan
Modified: 2025-01-17 18:07 UTC (History)
3 users (show)

See Also:
markj: mfc-stable14+


Attachments
kdb backtrace (282.58 KB, image/jpeg)
2024-11-19 11:23 UTC, keivan
no flags Details
freebsd 12 panic screen (159.38 KB, image/jpeg)
2024-11-20 00:59 UTC, keivan
no flags Details
freebsd 13 panic screen (163.08 KB, image/jpeg)
2024-11-20 01:00 UTC, keivan
no flags Details
freebsd 11 boot (no root dev) (147.79 KB, image/jpeg)
2024-11-20 01:01 UTC, keivan
no flags Details
freebsd 15 current kdb (202.34 KB, image/jpeg)
2024-11-20 01:01 UTC, keivan
no flags Details
netbsd10 boot (101.48 KB, image/jpeg)
2024-11-20 01:02 UTC, keivan
no flags Details
openbsd 7.6 boot (122.69 KB, image/jpeg)
2024-11-20 01:02 UTC, keivan
no flags Details
FreeBSD15-CURRENT GENERIC-KASAN kernel (177.50 KB, image/jpeg)
2024-11-20 19:19 UTC, keivan
no flags Details
debug prints (423 bytes, patch)
2024-11-20 20:23 UTC, Mark Johnston
no flags Details | Diff
FreeBSD15-CURRENT printf patch (140.72 KB, image/jpeg)
2024-11-20 20:40 UTC, keivan
no flags Details
more debug prints (2.73 KB, patch)
2024-11-20 21:22 UTC, Mark Johnston
no flags Details | Diff
new freebsd15 debug output (316.88 KB, image/jpeg)
2024-11-20 21:42 UTC, keivan
no flags Details
poweroff crash without efirtc (299.36 KB, image/jpeg)
2024-11-21 00:12 UTC, keivan
no flags Details
unrel mutex warnings (186.07 KB, image/jpeg)
2024-11-21 09:07 UTC, keivan
no flags Details
FreeBSD15-CURRENT debug BP pflags output (228.69 KB, image/jpeg)
2024-11-21 12:02 UTC, keivan
no flags Details
with the last revision of the patch the system boots (213.93 KB, image/jpeg)
2024-11-21 16:26 UTC, keivan
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description keivan 2024-11-19 11:23:32 UTC
Created attachment 255299 [details]
kdb backtrace

Hello,
FreeBSD 14.2 (Both the beta3 and the beta2 build) incurs in a kernel panic at boot and then fails to automatically reboot. This also affects FreeBSD 14.1-RELEASE.

Tested when booting the install image from usb in both UEFI and legacy mode on a computer with the following hardware:

asus prime X299-A motherboard (first revision, bios release number 4001), intel i9-7980XE cpu, no onboard graphics/igp, nvidia geforce rtx 3060 gpu

Attached is a picture of the KDB backtrace
Comment 1 Mark Johnston freebsd_committer freebsd_triage 2024-11-19 15:16:31 UTC
Does any version of FreeBSD boot successfully on this hardware?

Are you able to boot if you enable "safe mode" from the loader menu?
Comment 2 keivan 2024-11-19 15:57:16 UTC
(In reply to Mark Johnston from comment #1)
Hello, 
I have never tried this hardware on FreeBSD before.
I will try tomorrow to boot 14.2 in safe mode and FreeBSD 13, 12 and 11 too
Comment 3 keivan 2024-11-20 00:59:45 UTC
Created attachment 255308 [details]
freebsd 12 panic screen
Comment 4 keivan 2024-11-20 01:00:45 UTC
Created attachment 255309 [details]
freebsd 13 panic screen
Comment 5 keivan 2024-11-20 01:01:28 UTC
Created attachment 255310 [details]
freebsd 11 boot (no root dev)
Comment 6 keivan 2024-11-20 01:01:58 UTC
Created attachment 255311 [details]
freebsd 15 current kdb
Comment 7 keivan 2024-11-20 01:02:19 UTC
Created attachment 255312 [details]
netbsd10 boot
Comment 8 keivan 2024-11-20 01:02:43 UTC
Created attachment 255313 [details]
openbsd 7.6 boot
Comment 9 keivan 2024-11-20 01:03:28 UTC
FreeBSD 12.0 to 14.2-beta3 all crash about in the same way (panic screens attached), both with default boot options and with the safe mode active.

FreeBSD 11.0 does not panic, however it is unable to mount the root via usb, both when attaching the drive to a usb3 and when attaching it to an usb2 port.

FreeBSD 15.0-CURRENT (20241115-79af8f72b3af-273651) crashes in a different way, but the kernel debugger prompt does not accept keyboard input so I'm unable to have it print more information

As a side bit of information, both linux (debian 12 with kernel 6.1, *EL 9.4, more recent distributions like fedora 41 and ubuntu 24.10), OpenIndiana, NetBSD 10 and OpenBSD 7.6 boot successfully, but full ACPI support with graphical acceleration seems only to be available/working on linux at the time of writing
Comment 10 keivan 2024-11-20 10:52:15 UTC
While FreeBSD 15-CURRENT does not allow me to interact with the debugger through the usb keyboard, it seems that the program counter value at which it fails is the same at which FreeBSD 14.2b3 panics: 0x4e6134ec.
FreeBSD 15 prints out that the instruction at fault is movq %rcx,%rax.

I don’t know if it is a fair assumption to think that FreeBSD 14.2 is running the same instruction at that problematic address, but on FreeBSD 14.2 the cpu registers are also printed out and the destination address in rax is set to 0000000000000000
Comment 11 Mark Johnston freebsd_committer freebsd_triage 2024-11-20 16:02:01 UTC
The 15-CURRENT boot is with bootverbose enabled and in safe mode, I believe; could you please also try it with debug.verbose_sysinit=1 set from the loader prompt?  I'm wondering which SYSINIT is triggering the problem.

Does the 15-CURRENT kernel always crash at the same point?
Comment 12 keivan 2024-11-20 16:20:42 UTC
(In reply to Mark Johnston from comment #11)
Yes, it still crashes at the same point (seems to be always the same).

with debug.verbose_sysinit=1 it reports:

subsystem 3100000
configure_first(0)... done.
module_register_init(&cam_moduledata)... done.
fbd_evh_init(0)... done.
configure(0)... [here it crashes]
[ thread pid 0 tid 100000 ]
Stopped at 0x4e6134ec: movq %rcx,%rax
Comment 13 Mark Johnston freebsd_committer freebsd_triage 2024-11-20 16:38:08 UTC
(In reply to keivan from comment #12)
I think we're crashing while probing ISA bus devices.

Could you please try booting with

hint.isa.0.disabled="1"
hint.isab.0.disabled="1"

set from the loader?
Comment 14 keivan 2024-11-20 16:57:50 UTC
(In reply to Mark Johnston from comment #13)
done, no effect in disabling ISA probing
Comment 15 Mark Johnston freebsd_committer freebsd_triage 2024-11-20 17:05:58 UTC
(In reply to keivan from comment #14)
It still crashes right after printing "configure(0)..."?

Are you able to build a new kernel that can be booted on this system?  If so, it would also be useful to try a GENERIC-KASAN kernel.
Comment 16 keivan 2024-11-20 17:10:59 UTC
(In reply to Mark Johnston from comment #15)
I can try. Which branch do you suggest to checkout for this build?
Comment 17 Mark Johnston freebsd_committer freebsd_triage 2024-11-20 17:11:38 UTC
(In reply to keivan from comment #16)
I would suggest trying "main", the default branch.
Comment 18 keivan 2024-11-20 17:22:17 UTC
(In reply to Mark Johnston from comment #17)
Ok, I will try to build a kernel with KASAN from main and then the memstick.img target (or can the newly built kernel just be replaced on the already written freebsd 15 install media?).


Also I forgot to add, yes in the previous test it was still crashing on the same instruction/PC right after configure() even with:

set debug.verbose_sysinit=1
set hint.isa.0.disabled="1"
set hint.isab.0.disabled="1"

or 

set debug.verbose_sysinit=1
set hint.isa.0.disabled=1
set hint.isab.0.disabled=1
Comment 19 keivan 2024-11-20 19:19:02 UTC
Created attachment 255329 [details]
FreeBSD15-CURRENT GENERIC-KASAN kernel
Comment 20 keivan 2024-11-20 19:24:08 UTC
The FreeBSD 15-CURRENT kernel built with 'make buildkernel KERNCONF=GENERIC-KASAN' does not print any more warnings and it even panics on the same program counter address as before: 0x4e6134ec
Maybe the bug is in asm platform code/not covered by KASAN?
Comment 21 keivan 2024-11-20 20:02:17 UTC
(In reply to keivan from comment #20)
no difference with a GENERIC-KMSAN build either, except for a warning about the WITNESS option being enabled
Comment 22 Mark Johnston freebsd_committer freebsd_triage 2024-11-20 20:23:21 UTC
Created attachment 255330 [details]
debug prints

(In reply to keivan from comment #21)
Ok, thanks for your patience thus far.

I guess one of the driver identify routines is triggering the crash, somehow.  I attached a small patch - could you try booting a GENERIC kernel built from main with that patch applied?  It should tell us which driver is at fault.
Comment 23 keivan 2024-11-20 20:40:42 UTC
Created attachment 255331 [details]
FreeBSD15-CURRENT printf patch

Here it is
Comment 24 Mark Johnston freebsd_committer freebsd_triage 2024-11-20 21:22:16 UTC
Created attachment 255332 [details]
more debug prints

Ok, let's add some more printf()s.

Please recompile the kernel with WITH_CLEAN=, i.e., "make buildkernel WITH_CLEAN=".

Please also keep booting with debug.verbose_sysinit=1 set from the loader.
Comment 25 keivan 2024-11-20 21:42:23 UTC
Created attachment 255333 [details]
new freebsd15 debug output

here is the new output, with set debug.verbose_sysinit=1 and the new patch
Comment 26 Mark Johnston freebsd_committer freebsd_triage 2024-11-20 21:50:27 UTC
So perhaps there is something strange going on in the efirtc driver.  Could you please remove the "device efirtc" line from GENERIC and try building a new kernel (again with WITH_CLEAN=)?

I don't think setting hints.efirtc.0.disabled=1 will work, as that doesn't stop the driver from probing.
Comment 27 Mark Johnston freebsd_committer freebsd_triage 2024-11-20 21:52:19 UTC
Hmm, there is a hack in efi_init() which looks related:

231 #if defined(__aarch64__) || defined(__amd64__)                                                                                                                                                                                                                                                                            
232         /*                                                                                                                                                                                                                                                                                                                
233          * Some UEFI implementations have multiple implementations of the                                                                                                                                                                                                                                                 
234          * RS->GetTime function. They switch from one we can only use early                                                                                                                                                                                                                                               
235          * in the boot process to one valid as a RunTime service only when we                                                                                                                                                                                                                                             
236          * call RS->SetVirtualAddressMap. As this is not always the case, e.g.                                                                                                                                                                                                                                            
237          * with an old loader.efi, check if the RS->GetTime function is within                                                                                                                                                                                                                                            
238          * the EFI map, and fail to attach if not.                                                                                                                                                                                                                                                                        
239          */                                                                                                                                                                                                                                                                                                               
240         rtdm = (struct efi_rt *)efi_phys_to_kva((uintptr_t)efi_runtime);                                                                                                                                                                                                                                                  
241         if (rtdm == NULL || !efi_is_in_map(map, ndesc, efihdr->descriptor_size,                                                                                                                                                                                                                                           
242             (vm_offset_t)rtdm->rt_gettime)) {                                                                                                                                                                                                                                                                             
243                 if (bootverbose)                                                                                                                                                                                                                                                                                          
244                         printf(                                                                                                                                                                                                                                                                                           
245                          "EFI runtime services table has an invalid pointer\n");                                                                                                                                                                                                                                          
246                 efi_runtime = NULL;                                                                                                                                                                                                                                                                                       
247                 efi_destroy_1t1_map();                                                                                                                                                                                                                                                                                    
248                 return (ENXIO);                                                                                                                                                                                                                                                                                           
249         }                                                                                                                                                                                                                                                                                                                 
250 #endif
Comment 28 keivan 2024-11-20 22:03:18 UTC
It does boot successfully without "device efirtc"
Comment 29 keivan 2024-11-21 00:12:30 UTC
Created attachment 255337 [details]
poweroff crash without efirtc

but it still crashes after issuing 'poweroff' from the shell, without the efirtc driver compiled
Comment 30 Konstantin Belousov freebsd_committer freebsd_triage 2024-11-21 05:02:17 UTC
Try this https://reviews.freebsd.org/D47694
Comment 31 keivan 2024-11-21 09:07:19 UTC
Created attachment 255354 [details]
unrel mutex warnings

The results of testing the D47694 patch are:

-booting with the patch and efirtc compiled in the kernel: panic at 0x4e6134ec: movq %rcx,%rax during config(0)

-booting with the patch and efirtc not compiled: the system boots. Issuing a shutdown causes a panic at 0x4e6134ec: movq %rcx,%rax


pictures are not attached as the error is always the same as in the original reporting. Attached is another warning about a mutex not working as expected that shows up when booting without efirtc (not related to this report I think)
Comment 32 keivan 2024-11-21 09:14:55 UTC
is there any step I can try to identify the offending line of code for instruction pointer 0x4e6134ec when successfully booting the system without efirtc?
Comment 33 Konstantin Belousov freebsd_committer freebsd_triage 2024-11-21 10:26:58 UTC
(In reply to keivan from comment #32)
The source code for the faulting instruction is stored somewhere at the BIOS
vendor office.  It is EFI RT call into BIOS that causing the #BP exception
during execution.

I updated the patch in the review with some debugging info, which should allow
me to understand why the check did not detected the exception occurring at the
RT call.  I need the panic info with the patch applied.
Comment 34 keivan 2024-11-21 12:02:22 UTC
Created attachment 255357 [details]
FreeBSD15-CURRENT debug BP pflags output

Attached is the new debug output
Comment 35 Mark Johnston freebsd_committer freebsd_triage 2024-11-21 15:39:56 UTC
I wonder if Linux's efirt time works properly on this system?  That is, is there a rtc-efi clock visible from procfs?
Comment 36 keivan 2024-11-21 15:53:28 UTC
(In reply to Mark Johnston from comment #35)
Debian 12 (linux 6.1) seems to be using the rtc-cmos driver instead of efi-rtc

ls -l /dev/rtc0
crw------- 1 root root 251, 0 21 nov 13.00 /dev/rtc0

cat /sys/dev/char/251:0/name
rtc_cmos rtc_cmos

=======================

cat /proc/driver/rtc
rtc_time	: 15:47:28
rtc_date	: 2024-11-21
alrm_time	: 21:37:57
alrm_date	: 2024-11-21
alarm_IRQ	: no
alrm_pending	: no
update IRQ enabled	: no
periodic IRQ enabled	: no
periodic IRQ frequency	: 1024
max user IRQ frequency	: 64
24hr		: yes
periodic_IRQ	: no
update_IRQ	: no
HPET_emulated	: no
BCD		: yes
DST_enable	: no
periodic_freq	: 1024
batt_status	: okay

========================

dmesg | grep -i rtc
[    0.579202] platform rtc_cmos: registered platform RTC device (no PNP device found)
[    0.931525] rtc_cmos rtc_cmos: RTC can wake from S4
[    0.932252] rtc_cmos rtc_cmos: registered as rtc0
[    0.932380] rtc_cmos rtc_cmos: setting system clock to 2024-11-21T11:57:52 UTC (1732190272)
[    0.932401] rtc_cmos rtc_cmos: alarms up to one month, y3k, 114 bytes nvram

========================

lsmod | grep efi
efi_pstore             16384  0
efivarfs               24576  1
Comment 37 Mark Johnston freebsd_committer freebsd_triage 2024-11-21 16:03:43 UTC
(In reply to keivan from comment #36)
Does the Linux dmesg show anything relating to EFI at all?

Are there any BIOS updates available for this motherboard?
Comment 38 keivan 2024-11-21 16:05:46 UTC
(In reply to Mark Johnston from comment #37)
Hello, the motherboard is running the last bios update (that was the first thing I updated when having problems some weeks ago once I got hold of this system)

efi related output in linux:

dmesg | grep -i efi
[    0.000000] efi: EFI v2.70 by American Megatrends
[    0.000000] efi: TPMFinalLog=0x4d1fd000 ACPI=0x4c14e000 ACPI 2.0=0x4c14e014 SMBIOS=0x4e393000 SMBIOS 3.0=0x4e392000 MEMATTR=0x4718b018 ESRT=0x47e89298 MOKvar=0x4e3c1000 
[    0.012687] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 7645519600211568 ns
[    0.540481] pci 0000:65:00.0: BAR 1: assigned to efifb
[    0.543068] Registered efivars operations
[    0.936432] efifb: probing for efifb
[    0.936448] efifb: framebuffer at 0xc0000000, using 3072k, total 3072k
[    0.936450] efifb: mode is 1024x768x32, linelength=4096, pages=1
[    0.936452] efifb: scrolling: redraw
[    0.936452] efifb: Truecolor: size=8:8:8:8, shift=24:16:8:0
[    0.937084] fb0: EFI VGA frame buffer device
[    1.026018] integrity: Loading X.509 certificate: UEFI:db
[    1.026442] integrity: Loading X.509 certificate: UEFI:db
[    1.026837] integrity: Loading X.509 certificate: UEFI:db
[    1.026883] integrity: Loaded X.509 cert 'Microsoft Corporation UEFI CA 2011: 13adbf4309bd82709c8cd54f316ed522988a1bd4'
[    1.026886] integrity: Loading X.509 certificate: UEFI:db
[    1.026925] integrity: Loading X.509 certificate: UEFI:db
[    1.027336] integrity: Loading X.509 certificate: UEFI:db
[    1.027371] integrity: Loaded X.509 cert 'Microsoft UEFI CA 2023: 81aa6b3244c935bce0d6628af39827421e32497d'
[    1.027374] integrity: Loading X.509 certificate: UEFI:db
[    1.027411] integrity: Loaded X.509 cert 'Microsoft Corporation: Windows UEFI CA 2023: aefc5fbbbe055d8f8daa585473499417ab5a5272'
[    1.604755] tsc: Refined TSC clocksource calibration: 2592.008 MHz
[  145.017347] systemd[1]: Starting modprobe@efi_pstore.service - Load Kernel Module efi_pstore...
[  145.028456] pstore: Registered efi as persistent store backend
[  145.029412] systemd[1]: modprobe@efi_pstore.service: Deactivated successfully.
[  145.029525] systemd[1]: Finished modprobe@efi_pstore.service - Load Kernel Module efi_pstore.
Comment 39 Konstantin Belousov freebsd_committer freebsd_triage 2024-11-21 16:08:06 UTC
Try the updated patch in the review.
Comment 40 keivan 2024-11-21 16:26:13 UTC
Created attachment 255361 [details]
with the last revision of the patch the system boots

With the last revision of the patch to the trap handler, the system with the efirtc driver compiled in, boots up successfully into the installer.

It seems it is not using the rtc timer via EFI in the end, but it doesn't panic anymore.

The kernel panic on shutdown is not happening any more, too, with the system shutting down normally
Comment 41 commit-hook freebsd_committer freebsd_triage 2024-11-21 22:06:27 UTC
A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=e6ec41fa86d88f80bd663e55455a6844619a9b24

commit e6ec41fa86d88f80bd663e55455a6844619a9b24
Author:     Konstantin Belousov <kib@FreeBSD.org>
AuthorDate: 2024-11-21 04:57:58 +0000
Commit:     Konstantin Belousov <kib@FreeBSD.org>
CommitDate: 2024-11-21 22:05:28 +0000

    amd64 efi rt: handle #BP

    PR:     282860
    Reviewed by:    markj
    Sponsored by:   The FreeBSD Foundation
    MFC after:      1 week
    Differential revision:  https://reviews.freebsd.org/D47694

 sys/amd64/amd64/trap.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)
Comment 42 keivan 2024-11-22 11:34:52 UTC
Thank you for your support!
I have tested the same patched kernel/installer image on some other systems which exhibited no boot problems before and the preliminary trap handler patch does not break them.

Once the patch is finalized I'm willing to test it again.

Also, the last memory I have of FreeBSD correctly resuming from S3 on some other computer I had is from around FreeBSD 8. On this system, even with the trap handler patch, it does not reactivate the video output after resuming from S3 with the vesa driver; hw.acpi.reset_video=1 before suspending and kdload vesa.ko after resuming have no effect. The same happens on another ryzen pc I tested and on my intel 12th gen laptop. Since the system this bug report is about has a handy COM/serial socket on the motherboard I can possibly connect to, I'm willing in the future to troubleshoot the ACPI S3 problems as well, with some guidance
Comment 43 keivan 2024-11-24 00:16:53 UTC
Hello, I cloned today from main and it still works, without applying patches.
I think you can close this bug report.

As a side note, I also got the system to correctly resume from S3, but it did not work on any of my systems with the vt_vga driver.
All the computers I tested were resuming with the screen off, but ssh working.
An intel laptop running in text mode got S3 resume fixed by using drm-kmod/i915kms.

The system of this bug report also got working ACPI S3 resume by loading the nvidia-drm driver (either when working in a console or suspending in X11), or the nvidia-modeset driver, but only from X11 or when X11 was active somewhere.

So maybe there's something off for some moderately modern hardware in vt_vga or vt.
(Probably for a new bug report)
Comment 44 Mark Johnston freebsd_committer freebsd_triage 2024-11-25 16:48:51 UTC
(In reply to keivan from comment #42)
Thanks for helping us narrow down the problem.

(In reply to keivan from comment #43)
Yes, this ought to be a new bug report.  I believe there are some known problems with vt and suspend-to-S3 but I don't know the details.
Comment 45 commit-hook freebsd_committer freebsd_triage 2024-11-28 13:30:30 UTC
A commit in branch stable/14 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=5f5b47e37416df46b52e0d43b94d9c2f37d15397

commit 5f5b47e37416df46b52e0d43b94d9c2f37d15397
Author:     Konstantin Belousov <kib@FreeBSD.org>
AuthorDate: 2024-11-21 04:57:58 +0000
Commit:     Konstantin Belousov <kib@FreeBSD.org>
CommitDate: 2024-11-28 12:53:17 +0000

    amd64 efi rt: handle #BP

    PR:     282860

    (cherry picked from commit e6ec41fa86d88f80bd663e55455a6844619a9b24)

 sys/amd64/amd64/trap.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)