Created attachment 222630 [details] FreeBSD 13.0 BETA2 hwpmc core dump text Steps to repeat: 1. Fresh install FreeBSD 13.0 BETA2 on Fusion 12.1.0 VM (FreeBSD 12 64bit guest type). 2. kldload hwpmc 3. pmccontrol -l Kernel panic core dump text summary: Fatal trap 12: page fault while in kernel mode cpuid = 0; apic id = 00 fault virtual address = 0xc fault code = supervisor read data, page not present instruction pointer = 0x20:0xffffffff82c25a90 stack pointer = 0x0:0xfffffe007c8ed9b0 frame pointer = 0x0:0xfffffe007c8edac0 code segment = base rx0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 832 (pmccontrol) trap number = 12 panic: page fault cpuid = 0 time = 1613757734 KDB: stack backtrace: #0 0xffffffff80c56695 at kdb_backtrace+0x65 #1 0xffffffff80c09261 at vpanic+0x181 #2 0xffffffff80c090d3 at panic+0x43 #3 0xffffffff810891a7 at trap_fatal+0x387 #4 0xffffffff810891ff at trap_pfault+0x4f #5 0xffffffff8108885d at trap+0x27d #6 0xffffffff8105fc38 at calltrap+0x8 #7 0xffffffff8108a0f5 at amd64_syscall+0x755 #8 0xffffffff8106055e at fast_syscall_common+0xf8 Uptime: 1m34s Dumping 174 out of 471 MB:..10%..19%..28%..37%..46%..56%..65%..74%..83%..92%
Can you get a back trace for the exact line number of the functions? Also, the panic happens in an odd place, not where I'd expect such a panic to happen.
(kgdb) bt #0 __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55 #1 doadump (textdump=<optimized out>) at /usr/src/sys/kern/kern_shutdown.c:399 #2 0xffffffff80c08e56 in kern_reboot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:486 #3 0xffffffff80c092d0 in vpanic (fmt=<optimized out>, ap=<optimized out>) at /usr/src/sys/kern/kern_shutdown.c:919 #4 0xffffffff80c090d3 in panic (fmt=<unavailable>) at /usr/src/sys/kern/kern_shutdown.c:843 #5 0xffffffff810891a7 in trap_fatal (frame=0xfffffe007c8ed8f0, eva=12) at /usr/src/sys/amd64/amd64/trap.c:915 #6 0xffffffff810891ff in trap_pfault (frame=frame@entry=0xfffffe007c8ed8f0, usermode=false, signo=<optimized out>, signo@entry=0x0, ucode=<optimized out>, ucode@entry=0x0) at /usr/src/sys/amd64/amd64/trap.c:732 #7 0xffffffff8108885d in trap (frame=0xfffffe007c8ed8f0) at /usr/src/sys/amd64/amd64/trap.c:398 #8 <signal handler called> #9 pmc_syscall_handler (td=<optimized out>, syscall_args=<optimized out>) at /usr/src/sys/dev/hwpmc/hwpmc_mod.c:3679 #10 0xffffffff8108a0f5 in syscallenter (td=0xfffffe007d395700) at /usr/src/sys/amd64/amd64/../../kern/subr_syscall.c:161 #11 amd64_syscall (td=0xfffffe007d395700, traced=0) at /usr/src/sys/amd64/amd64/trap.c:1156 #12 <signal handler called> #13 0x00000008009bc48a in ?? () Backtrace stopped: Cannot access memory at address 0x7fffffffe9c8 (kgdb) frame 9 #9 pmc_syscall_handler (td=<optimized out>, syscall_args=<optimized out>) at /usr/src/sys/dev/hwpmc/hwpmc_mod.c:3679 3679 if ((error = pcd->pcd_describe(cpu, ari, p, &pm)) != 0) (kgdb) print pcd $5 = (struct pmc_classdep *) 0x0 (kgdb)
Some additional debug info: (kgdb) print md->pmd_npmc $1 = 33 (kgdb) print p - pmcinfo $2 = 24 (kgdb) print pmc_rowindex_to_classdep[23] $3 = (struct pmc_classdep *) 0xfffff8000440a5a0 (kgdb) print pmc_rowindex_to_classdep[24] $4 = (struct pmc_classdep *) 0x0 (kgdb) print pmc_rowindex_to_classdep[25] $5 = (struct pmc_classdep *) 0x0 (kgdb) print pmc_rowindex_to_classdep[26] $6 = (struct pmc_classdep *) 0x0 (kgdb) print pmc_rowindex_to_classdep[27] $7 = (struct pmc_classdep *) 0x0 (kgdb) print pmc_rowindex_to_classdep[28] $8 = (struct pmc_classdep *) 0x0 (kgdb) print pmc_rowindex_to_classdep[29] $9 = (struct pmc_classdep *) 0x0 (kgdb) print pmc_rowindex_to_classdep[30] $10 = (struct pmc_classdep *) 0x0 (kgdb) print pmc_rowindex_to_classdep[31] $11 = (struct pmc_classdep *) 0x0 (kgdb) print pmc_rowindex_to_classdep[32] $12 = (struct pmc_classdep *) 0x0 (kgdb) Hope it helps.
It is still an issue on FreeBSD 13.0-RC1 .
(In reply to Zhenlei Huang from comment #3) Hi, I suspect hwpmc has some kind of misidentification or misconfiguration based on the specific processor model. Could you provide the output of the following? # sysctl kern.hwpmc.cpuid and # cpucontrol -i 0xA /dev/cpuctl0 You may need to kldload cpuctl for the latter. Thanks.
(In reply to Mitchell Horne from comment #5) Hi, # kldload hwpmc cpuctl # sysctl kern.hwpmc.cpuid kern.hwpmc.cpuid: GenuineIntel-6-3D-4 # cpucontrol -i 0xA /dev/cpuctl0 cpuid level 0xa: 0x07300403 0x00000000 0x00000000 0x00000603 Some info about the host: The host is MacBook Pro(Retina, 13-inch, Early 2015), the model identifier is MacBookPro12,1, and the processor is 2.7 GHz Dual-Core Intel Core i5.
Created attachment 226244 [details] Set proper nclasses value for Broadwell CPUs (In reply to Zhenlei Huang from comment #6) Thanks for the info. I believe I have identified the problem, but it would be helpful if you could confirm the fix. Please apply the attached patch to a checkout of the src tree, then, compile and load the hwpmc module. There should be no need to build the kernel in its entirety, it is enough to do: # cd /path/to/freebsd-src # make -C sys/modules/hwpmc # make -C sys/modules/hwpmc install # kldload /boot/modules/hwpmc.ko
(In reply to Mitchell Horne from comment #7) Thanks for fast fix :) I'm report back as soon as possible.
(In reply to Mitchell Horne from comment #7) I applied the patch to current/14 and stable/13, `pmccontrol -l` both works greatly now. I could not recall why I set 'regression' keyword on this issue, but the stable/12 is also affected. Applying the patch also fix stable/12.
I can confirm release/11.4 is also affected. The patch can also fix for release/11.4 .
Oops, observed kernel general protection fault while testing PMC's with the patched hwpmc. This can be reproduced on release/11.4, stable/12, stable/13 and current/14 . Step to repeat: 1. kldload /path/to/patched/hwpmc.ko 2. pmcstudy -T Kernel panic core dump text summary: (obtained from stable/13 vm) Fatal trap 9: general protection fault while in kernel mode cpuid = 0; apic id = 00 instruction pointer = 0x20:0xffffffff82c2e335 stack pointer = 0x28:0xfffffe0088a569a0 frame pointer = 0x28:0xfffffe0088a569a0 code segment = base rx0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 22576 (pmcstat) trap number = 9 panic: general protection fault cpuid = 0 time = 1625673289 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe0088a566b0 vpanic() at vpanic+0x181/frame 0xfffffe0088a56700 panic() at panic+0x43/frame 0xfffffe0088a56760 trap_fatal() at trap_fatal+0x387/frame 0xfffffe0088a567c0 trap() at trap+0x8b/frame 0xfffffe0088a568d0 calltrap() at calltrap+0x8/frame 0xfffffe0088a568d0 --- trap 0x9, rip = 0xffffffff82c2e335, rsp = 0xfffffe0088a569a0, rbp = 0xfffffe0088a569a0 --- ucp_start_pmc() at ucp_start_pmc+0xd5/frame 0xfffffe0088a569a0 pmc_syscall_handler() at pmc_syscall_handler+0x1e16/frame 0xfffffe0088a56ac0 amd64_syscall() at amd64_syscall+0x755/frame 0xfffffe0088a56bf0 fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe0088a56bf0 --- syscall (0, FreeBSD ELF64, nosys), rip = 0x800a8e48a, rsp = 0x7fffffffe3b8, rbp = 0x7fffffffe3e0 --- KDB: enter: panic Uptime: 11m1s Dumping 152 out of 471 MB:..11%..21%..32%..42%..53%..63%..74%..84%..95% __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55 55 /usr/src/sys/amd64/include/pcpu_aux.h: No such file or directory. (kgdb) bt #0 __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55 #1 doadump (textdump=textdump@entry=1) at /usr/src/sys/kern/kern_shutdown.c:399 #2 0xffffffff80c21b2b in kern_reboot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:486 #3 0xffffffff80c21fb0 in vpanic (fmt=<optimized out>, ap=<optimized out>) at /usr/src/sys/kern/kern_shutdown.c:919 #4 0xffffffff80c21db3 in panic (fmt=<unavailable>) at /usr/src/sys/kern/kern_shutdown.c:843 #5 0xffffffff810bbce7 in trap_fatal (frame=0xfffffe0088a568e0, eva=0) at /usr/src/sys/amd64/amd64/trap.c:943 #6 0xffffffff810bb1cb in trap (frame=0xfffffe0088a568e0) at /usr/src/sys/amd64/amd64/trap.c:246 #7 <signal handler called> #8 wrmsr (msr=913, newval=1) at /usr/src/sys/amd64/include/cpufunc.h:404 #9 ucp_start_pmc (cpu=<optimized out>, ri=0) at /usr/src/sys/dev/hwpmc/hwpmc_uncore.c:712 #10 0xffffffff82c26786 in pmc_start (pm=0xfffff80017ebd000) at /usr/src/sys/dev/hwpmc/hwpmc_mod.c:3252 #11 pmc_syscall_handler (td=<optimized out>, syscall_args=<optimized out>) at /usr/src/sys/dev/hwpmc/hwpmc_mod.c:4524 #12 0xffffffff810bcc35 in syscallenter (td=0xfffffe00898103a0) at /usr/src/sys/amd64/amd64/../../kern/subr_syscall.c:161 #13 amd64_syscall (td=0xfffffe00898103a0, traced=0) at /usr/src/sys/amd64/amd64/trap.c:1184 #14 <signal handler called> #15 0x0000000800a8e48a in ?? () Backtrace stopped: Cannot access memory at address 0x7fffffffe3b8 (kgdb) frame 8 #8 wrmsr (msr=913, newval=1) at /usr/src/sys/amd64/include/cpufunc.h:404 404 /usr/src/sys/amd64/include/cpufunc.h: No such file or directory. (kgdb) frame 9 #9 ucp_start_pmc (cpu=<optimized out>, ri=0) at /usr/src/sys/dev/hwpmc/hwpmc_uncore.c:712 712 /usr/src/sys/dev/hwpmc/hwpmc_uncore.c: No such file or directory.
Created attachment 226312 [details] Select alternate uncore MSR for Broadwell also (In reply to Zhenlei Huang from comment #11) Thanks for your testing. This appears to be a separate issue. It looks like it is trying to program the wrong MSR, thus raising the protection fault. Please try the new attached patch (in addition to the other). Mitchell
(In reply to Mitchell Horne from comment #12) Tried the new attached patch, encountered a different general protection fault. This is obtained from current/14: FreeBSD 14.0-CURRENT FreeBSD 14.0-CURRENT #10 main-n247819-bd597b814933: Fri Jul 9 10:28:46 CST 2021 root@:/usr/obj/usr/src/amd64.amd64/sys/GENERIC-DEBUG amd64 panic: general protection fault GNU gdb (GDB) 10.2 [GDB v10.2 for FreeBSD] Copyright (C) 2021 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-portbld-freebsd13.0". Type "show configuration" for configuration details. For bug reporting instructions, please see: <https://www.gnu.org/software/gdb/bugs/>. Find the GDB manual and other documentation resources online at: <http://www.gnu.org/software/gdb/documentation/>. For help, type "help". Type "apropos word" to search for commands related to "word"... Reading symbols from /boot/kernel/kernel... Reading symbols from /usr/lib/debug//boot/kernel/kernel.debug... Unread portion of the kernel message buffer: Fatal trap 9: general protection fault while in kernel mode cpuid = 0; apic id = 00 instruction pointer = 0x20:0xffffffff8322f595 stack pointer = 0x28:0xfffffe0087f66990 frame pointer = 0x28:0xfffffe0087f66990 code segment = base rx0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 67912 (pmcstat) trap number = 9 panic: general protection fault cpuid = 0 time = 1625830747 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe0087f666a0 vpanic() at vpanic+0x181/frame 0xfffffe0087f666f0 panic() at panic+0x43/frame 0xfffffe0087f66750 trap_fatal() at trap_fatal+0x387/frame 0xfffffe0087f667b0 trap() at trap+0xa4/frame 0xfffffe0087f668c0 calltrap() at calltrap+0x8/frame 0xfffffe0087f668c0 --- trap 0x9, rip = 0xffffffff8322f595, rsp = 0xfffffe0087f66990, rbp = 0xfffffe0087f66990 --- ucp_start_pmc() at ucp_start_pmc+0xd5/frame 0xfffffe0087f66990 pmc_syscall_handler() at pmc_syscall_handler+0x1ed1/frame 0xfffffe0087f66ac0 amd64_syscall() at amd64_syscall+0x749/frame 0xfffffe0087f66bf0 fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe0087f66bf0 --- syscall (0, FreeBSD ELF64, nosys), rip = 0x800a8e48a, rsp = 0x7fffffffe3b8, rbp = 0x7fffffffe3e0 --- KDB: enter: panic Uptime: 14m37s Dumping 277 out of 465 MB:..6%..12%..24%..35%..41%..52%..64%..76%..81%..93% __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55 55 /usr/src/sys/amd64/include/pcpu_aux.h: No such file or directory. (kgdb) #0 __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55 #1 doadump (textdump=textdump@entry=1) at /usr/src/sys/kern/kern_shutdown.c:399 #2 0xffffffff80c113a0 in kern_reboot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:486 #3 0xffffffff80c11800 in vpanic (fmt=<optimized out>, ap=<optimized out>) at /usr/src/sys/kern/kern_shutdown.c:919 #4 0xffffffff80c11553 in panic (fmt=<unavailable>) at /usr/src/sys/kern/kern_shutdown.c:843 #5 0xffffffff810ced87 in trap_fatal (frame=0xfffffe0087f668d0, eva=0) at /usr/src/sys/amd64/amd64/trap.c:943 #6 0xffffffff810ce204 in trap (frame=0xfffffe0087f668d0) at /usr/src/sys/amd64/amd64/trap.c:246 #7 <signal handler called> #8 0xffffffff8322f595 in tsc_config_pmc (cpu=-2006632640, ri=0, pm=0x10918) at /usr/src/sys/dev/hwpmc/hwpmc_tsc.c:110 #9 0xffffffff83226671 in pmc_ri_to_classdep (md=0x1, ri=0, adjri=<optimized out>) at /usr/src/sys/dev/hwpmc/hwpmc_mod.c:611 #10 pmc_syscall_handler (td=0xfffffe0087f66990, syscall_args=<optimized out>) at /usr/src/sys/dev/hwpmc/hwpmc_mod.c:4254 #11 0xffffffff810cfd69 in syscallenter (td=<optimized out>) at /usr/src/sys/amd64/amd64/../../kern/subr_syscall.c:161 #12 amd64_syscall (td=0xfffffe0088653740, traced=0) at /usr/src/sys/amd64/amd64/trap.c:1184 #13 <signal handler called> #14 0x0000000800a8e48a in ?? () Backtrace stopped: Cannot access memory at address 0x7fffffffe3b8 (kgdb) PS: the hwpmc console log: hwpmc: SOFT/16/64/0x67<INT,USR,SYS,REA,WRI> TSC/1/64/0x20<REA> IAP/4/48/0x3ff<INT,USR,SYS,EDG,THR,REA,WRI,INV,QUA,PRC> IAF/3/48/0x67<INT,USR,SYS,REA,WRI> UCP/8/48/0x3f8<EDG,THR,REA, WRI,INV,QUA,PRC> UCF/1/48/0x60<REA,WRI>
(In reply to Zhenlei Huang from comment #13) Okay this one is quite strange. First, it is odd that there are two backtraces that are similar but distinct. The 'cpu' argument in frame 8 of the kgdb backtrace also looks suspicious. I would expect a single digit number. Was this obtained with pmcstudy -T? Does the panic happen consistently, and do the backtraces look similar each time?
(In reply to Mitchell Horne from comment #14) The backtrace from comment #11 is observed when applying only the patch "Set proper nclasses value for Broadwell CPUs ". It can be consistently repeated. The backtrace from comment #13 is observed when applying both "Set proper nclasses value for Broadwell CPUs" and "Select alternate uncore MSR for Broadwell also" patches. I've only test this a few times, I'll try more times to confirm whether it can be consistently repeated or not. > Was this obtained with pmcstudy -T? Yes.
The source code might be out of sync. I updated the source code to latest main, patched and made a clean build. As 'pmcstudy -T' fork 'pmcstat', thus the core dump shows 'current process' is 'pmcstat'. It is slow to validate every PMC. I managed to narrow down the routine to trigger the panic. The panic happen consistently. 1. Cold boot, this prevent the 'dmesg' message buffer from containing info of last boot. 2. kldload hwpmc 3. pmcstat -s CPU_CLK_UNHALTED.THREAD_P -s BR_MISP_RETIRED.ALL_BRANCHES -s MACHINE_CLEARS.CYCLES -s UOPS_ISSUED.ANY -s UOPS_RETIRED.RETIRE_SLOTS NOTE: For the step 3, a combine of more or equal than five valid event-specs will trigger the panic. It looks good when validating with less than five event-specs. Kernel panic core dump text summary: Tue Jul 13 00:36:08 CST 2021 FreeBSD 14.0-CURRENT FreeBSD 14.0-CURRENT #11 bugfix/253687-n247872-90823878749c: Mon Jul 12 15:22:17 CST 2021 root@:/usr/obj/usr/src/amd64.amd64/sys/GENERIC-DEBUG amd64 panic: general protection fault GNU gdb (GDB) 10.2 [GDB v10.2 for FreeBSD] Copyright (C) 2021 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-portbld-freebsd13.0". Type "show configuration" for configuration details. For bug reporting instructions, please see: <https://www.gnu.org/software/gdb/bugs/>. Find the GDB manual and other documentation resources online at: <http://www.gnu.org/software/gdb/documentation/>. For help, type "help". Type "apropos word" to search for commands related to "word"... Reading symbols from /boot/kernel/kernel... Reading symbols from /usr/lib/debug//boot/kernel/kernel.debug... Unread portion of the kernel message buffer: Fatal trap 9: general protection fault while in kernel mode cpuid = 0; apic id = 00 instruction pointer = 0x20:0xffffffff83230a15 stack pointer = 0x28:0xfffffe0087f39970 frame pointer = 0x28:0xfffffe0087f39980 code segment = base rx0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 20674 (pmcstat) trap number = 9 panic: general protection fault cpuid = 0 time = 1626107735 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe0087f39680 vpanic() at vpanic+0x181/frame 0xfffffe0087f396d0 panic() at panic+0x43/frame 0xfffffe0087f39730 trap_fatal() at trap_fatal+0x387/frame 0xfffffe0087f39790 trap() at trap+0xa4/frame 0xfffffe0087f398a0 calltrap() at calltrap+0x8/frame 0xfffffe0087f398a0 --- trap 0x9, rip = 0xffffffff83230a15, rsp = 0xfffffe0087f39970, rbp = 0xfffffe0087f39980 --- ucp_start_pmc() at ucp_start_pmc+0x115/frame 0xfffffe0087f39980 pmc_syscall_handler() at pmc_syscall_handler+0x182b/frame 0xfffffe0087f39ac0 amd64_syscall() at amd64_syscall+0x749/frame 0xfffffe0087f39bf0 fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe0087f39bf0 --- syscall (0, FreeBSD ELF64, nosys), rip = 0x800a8e48a, rsp = 0x7fffffffe2a8, rbp = 0x7fffffffe2d0 --- KDB: enter: panic Uptime: 5m52s Dumping 169 out of 465 MB:..10%..19%..29%..38%..48%..57%..67%..76%..86%..95% __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55 55 /usr/src/sys/amd64/include/pcpu_aux.h: No such file or directory. (kgdb) #0 __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55 #1 doadump (textdump=textdump@entry=1) at /usr/src/sys/kern/kern_shutdown.c:399 #2 0xffffffff80c1e6e0 in kern_reboot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:486 #3 0xffffffff80c1eb40 in vpanic (fmt=<optimized out>, ap=<optimized out>) at /usr/src/sys/kern/kern_shutdown.c:919 #4 0xffffffff80c1e893 in panic (fmt=<unavailable>) at /usr/src/sys/kern/kern_shutdown.c:843 #5 0xffffffff810dbd87 in trap_fatal (frame=0xfffffe0087f398b0, eva=0) at /usr/src/sys/amd64/amd64/trap.c:944 #6 0xffffffff810db204 in trap (frame=0xfffffe0087f398b0) at /usr/src/sys/amd64/amd64/trap.c:249 #7 <signal handler called> #8 wrmsr (msr=913, newval=1) at /usr/src/sys/amd64/include/cpufunc.h:404 #9 ucp_start_pmc (cpu=0, ri=0) at /usr/src/sys/dev/hwpmc/hwpmc_uncore.c:714 #10 0xffffffff83226d5b in pmc_start (pm=<optimized out>) at /usr/src/sys/dev/hwpmc/hwpmc_mod.c:3252 #11 pmc_syscall_handler (td=<optimized out>, syscall_args=<optimized out>) at /usr/src/sys/dev/hwpmc/hwpmc_mod.c:4524 #12 0xffffffff810dcd69 in syscallenter (td=<optimized out>) at /usr/src/sys/amd64/amd64/../../kern/subr_syscall.c:161 #13 amd64_syscall (td=0xfffffe00886f6560, traced=0) at /usr/src/sys/amd64/amd64/trap.c:1185 #14 <signal handler called> #15 0x0000000800a8e48a in ?? () Backtrace stopped: Cannot access memory at address 0x7fffffffe2a8 (kgdb)
(In reply to Zhenlei Huang from comment #16) And the backtraces look similar each time.
(In reply to Zhenlei Huang from comment #16) The event-specs in step 3 is from https://cgit.freebsd.org/src/tree/usr.sbin/pmcstudy/pmcstudy.c#n1945
Created attachment 226779 [details] Add class validation to ICP and UCP pmc allocation methods In reply to Zhenlei Huang from comment #18) Hi again, thanks for the info. I spent some time looking into this further, and I've found numerous issues in our uncore implementation, some of which can lead to the panic you are seeing. Still, the specific line in pmcstudy that is failing shouldn't trigger this panic, since it does not allocate any uncore PMCs. I'd expect that allocating the 5th counter should fail, since your system supports four programmable counters per core. I've attached a patch that I think should fix this, so please apply it alongside the others and try again.
(In reply to Mitchell Horne from comment #19) Thanks! Test latest current/14 with the patch 'Add class validation to ICP and UCP pmc allocation methods', there're no panics when pmcstat validating with more than four event-specs. It is weird that 'pmcstudy -T' still panics. 'pmcstudy -T' stops at 'unc_cbo_xsnp_response.miss_xcore'. I will verify if it is regression caused by recent changes to hwpmc and report later. The core text dump: dumped core - see /var/crash/vmcore.7 Sat Jul 31 02:06:28 CST 2021 FreeBSD 14.0-CURRENT FreeBSD 14.0-CURRENT #17 bugfix/253687-n248379-1e16bfc58152: Fri Jul 30 14:19:18 CST 2021 root@:/usr/obj/usr/src/amd64.amd64/sys/GENERIC-DEBUG amd64 panic: general protection fault GNU gdb (GDB) 10.2 [GDB v10.2 for FreeBSD] Copyright (C) 2021 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-portbld-freebsd13.0". Type "show configuration" for configuration details. For bug reporting instructions, please see: <https://www.gnu.org/software/gdb/bugs/>. Find the GDB manual and other documentation resources online at: <http://www.gnu.org/software/gdb/documentation/>. For help, type "help". Type "apropos word" to search for commands related to "word"... Reading symbols from /boot/kernel/kernel... Reading symbols from /usr/lib/debug//boot/kernel/kernel.debug... Unread portion of the kernel message buffer: Fatal trap 9: general protection fault while in kernel mode cpuid = 0; apic id = 00 instruction pointer = 0x20:0xffffffff83230a05 stack pointer = 0x28:0xfffffe008f040970 frame pointer = 0x28:0xfffffe008f040980 code segment = base rx0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 92141 (pmcstat) trap number = 9 panic: general protection fault cpuid = 0 time = 1627659218 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe008f040680 vpanic() at vpanic+0x181/frame 0xfffffe008f0406d0 panic() at panic+0x43/frame 0xfffffe008f040730 trap_fatal() at trap_fatal+0x387/frame 0xfffffe008f040790 trap() at trap+0xa4/frame 0xfffffe008f0408a0 calltrap() at calltrap+0x8/frame 0xfffffe008f0408a0 --- trap 0x9, rip = 0xffffffff83230a05, rsp = 0xfffffe008f040970, rbp = 0xfffffe008f040980 --- ucp_start_pmc() at ucp_start_pmc+0x115/frame 0xfffffe008f040980 pmc_syscall_handler() at pmc_syscall_handler+0x182b/frame 0xfffffe008f040ac0 amd64_syscall() at amd64_syscall+0x753/frame 0xfffffe008f040bf0 fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe008f040bf0 --- syscall (0, FreeBSD ELF64, nosys), rip = 0x800a8e48a, rsp = 0x7fffffffe308, rbp = 0x7fffffffe330 --- KDB: enter: panic Uptime: 9m24s Dumping 208 out of 977 MB:..8%..16%..24%..31%..47%..54%..62%..77%..85%..93% __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55 55 /usr/src/sys/amd64/include/pcpu_aux.h: No such file or directory. (kgdb) #0 __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55 #1 doadump (textdump=textdump@entry=1) at /usr/src/sys/kern/kern_shutdown.c:399 #2 0xffffffff80c19370 in kern_reboot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:486 #3 0xffffffff80c197d0 in vpanic (fmt=<optimized out>, ap=<optimized out>) at /usr/src/sys/kern/kern_shutdown.c:919 #4 0xffffffff80c19523 in panic (fmt=<unavailable>) at /usr/src/sys/kern/kern_shutdown.c:843 #5 0xffffffff810d9fa7 in trap_fatal (frame=0xfffffe008f0408b0, eva=0) at /usr/src/sys/amd64/amd64/trap.c:944 #6 0xffffffff810d9424 in trap (frame=0xfffffe008f0408b0) at /usr/src/sys/amd64/amd64/trap.c:249 #7 <signal handler called> #8 wrmsr (msr=913, newval=1) at /usr/src/sys/amd64/include/cpufunc.h:404 #9 ucp_start_pmc (cpu=0, ri=0) at /usr/src/sys/dev/hwpmc/hwpmc_uncore.c:715 #10 0xffffffff83226d5b in pmc_start (pm=<optimized out>) at /usr/src/sys/dev/hwpmc/hwpmc_mod.c:3252 #11 pmc_syscall_handler (td=<optimized out>, syscall_args=<optimized out>) at /usr/src/sys/dev/hwpmc/hwpmc_mod.c:4523 #12 0xffffffff810dafa3 in syscallenter (td=<optimized out>) at /usr/src/sys/amd64/amd64/../../kern/subr_syscall.c:161 #13 amd64_syscall (td=0xfffffe008eaec740, traced=0) at /usr/src/sys/amd64/amd64/trap.c:1186 #14 <signal handler called> #15 0x0000000800a8e48a in ?? () Backtrace stopped: Cannot access memory at address 0x7fffffffe308 (kgdb)
Traveling through the recent changes to hwpmc, they are all related to PowerPC and ARM. I can confirm that the panic can be repeated as the following steps: 1. kldload hwpmc 2. pmcstat -s UNC_CBO_XSNP_RESPONSE.MISS_XCORE
(In reply to Zhenlei Huang from comment #21) I still expected this panic. The short explanation is that our support for "uncore" PMC events (mostly L3 cache events) is incomplete. It would take a larger effort to fix this, so for now my solution is to disable this feature on CPUs where its broken: https://reviews.freebsd.org/D31389 If you can apply the patch and test it out, that would be great. I expect that pmcstudy -T will not complete successfully, but it should at least avoid the panics.
(In reply to Mitchell Horne from comment #22) Test D31389 with latest current/14, no panics anymore :) Thanks your efforts!
A commit in branch main references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=4f35e8cba232d9256ab1399b8adfb761988e5eff commit 4f35e8cba232d9256ab1399b8adfb761988e5eff Author: Mitchell Horne <mhorne@FreeBSD.org> AuthorDate: 2021-08-04 17:31:36 +0000 Commit: Mitchell Horne <mhorne@FreeBSD.org> CommitDate: 2021-08-04 18:23:22 +0000 hwpmc: disable uncore class on Sandy Bridge and newer It was written for Nehalem and Westmere, with minor but incomplete updates for Sandy Bridge in 78d763a29b15. The uncore architecture changed significantly with this generation, bringing new layouts and locations for some MSRs. Misprogramming these MSRs in ucp_start_pmc() may panic the system, and this is trivially reproducible via pmcstat(8) on at least Broadwell and Haswell. Disable the class on these CPUs until it can be updated more completely and leave a TODO comment detailing some of the work required. Note that the nclasses value for Broadwell was already incorrect and doesn't need changing. The result is that any uncore events listed by pmcstat -L will no longer be allocatable, but this is already the case for newer generations of Intel CPUs. PR: 253687 Reported by: Zhenlei Huang <zlei.huang@gmail.com> Reviewed by: kib MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D31389 sys/dev/hwpmc/hwpmc_intel.c | 31 +++++++++++++++++++++++++------ 1 file changed, 25 insertions(+), 6 deletions(-)
A commit in branch main references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=8399d923a5697b7c194dbd44c33018c94ec42c4e commit 8399d923a5697b7c194dbd44c33018c94ec42c4e Author: Mitchell Horne <mhorne@FreeBSD.org> AuthorDate: 2021-08-04 17:37:05 +0000 Commit: Mitchell Horne <mhorne@FreeBSD.org> CommitDate: 2021-08-04 18:23:22 +0000 hwpmc_intel: assert for correct nclasses value This variable is set based on the exact CPU model detected. If this value is set too small, it could lead to a NULL-dereference from an improperly initialized pmc_rowindex_to_classdep array. Though it has been fixed, this was previously the case for Broadwell. Add two asserts to catch this in DEBUG kernels, as it represents a configuration error that may be hard to uncover otherwise. PR: 253687 Reported by: Zhenlei Huang <zlei.huang@gmail.com> Sponsored by: The FreeBSD Foundation sys/dev/hwpmc/hwpmc_intel.c | 2 ++ 1 file changed, 2 insertions(+)
(In reply to Zhenlei Huang from comment #23) Thanks for your detailed reporting and testing. I forgot to give a 'Tested by' credit to you in the main commit, but hopefully the 'Reported by' is enough ;) I have at least one follow-up commit to make still, and I will merge the changes back to stable/13 and stable/12 in a week or so.
A commit in branch stable/13 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=2e50ba70742bac34f144832ceb5f6816fcc06be2 commit 2e50ba70742bac34f144832ceb5f6816fcc06be2 Author: Mitchell Horne <mhorne@FreeBSD.org> AuthorDate: 2021-08-04 17:31:36 +0000 Commit: Mitchell Horne <mhorne@FreeBSD.org> CommitDate: 2021-08-11 16:49:44 +0000 hwpmc: disable uncore class on Sandy Bridge and newer It was written for Nehalem and Westmere, with minor but incomplete updates for Sandy Bridge in 78d763a29b15. The uncore architecture changed significantly with this generation, bringing new layouts and locations for some MSRs. Misprogramming these MSRs in ucp_start_pmc() may panic the system, and this is trivially reproducible via pmcstat(8) on at least Broadwell and Haswell. Disable the class on these CPUs until it can be updated more completely and leave a TODO comment detailing some of the work required. Note that the nclasses value for Broadwell was already incorrect and doesn't need changing. The result is that any uncore events listed by pmcstat -L will no longer be allocatable, but this is already the case for newer generations of Intel CPUs. PR: 253687 Reported by: Zhenlei Huang <zlei.huang@gmail.com> Reviewed by: kib MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D31389 (cherry picked from commit 4f35e8cba232d9256ab1399b8adfb761988e5eff) sys/dev/hwpmc/hwpmc_intel.c | 31 +++++++++++++++++++++++++------ 1 file changed, 25 insertions(+), 6 deletions(-)
A commit in branch stable/12 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=88ef00584bc7f9f9e19546bc3ef8682d5c4d6cb6 commit 88ef00584bc7f9f9e19546bc3ef8682d5c4d6cb6 Author: Mitchell Horne <mhorne@FreeBSD.org> AuthorDate: 2021-08-04 17:31:36 +0000 Commit: Mitchell Horne <mhorne@FreeBSD.org> CommitDate: 2021-08-11 17:07:59 +0000 hwpmc: disable uncore class on Sandy Bridge and newer It was written for Nehalem and Westmere, with minor but incomplete updates for Sandy Bridge in 78d763a29b15. The uncore architecture changed significantly with this generation, bringing new layouts and locations for some MSRs. Misprogramming these MSRs in ucp_start_pmc() may panic the system, and this is trivially reproducible via pmcstat(8) on at least Broadwell and Haswell. Disable the class on these CPUs until it can be updated more completely and leave a TODO comment detailing some of the work required. Note that the nclasses value for Broadwell was already incorrect and doesn't need changing. The result is that any uncore events listed by pmcstat -L will no longer be allocatable, but this is already the case for newer generations of Intel CPUs. PR: 253687 Reported by: Zhenlei Huang <zlei.huang@gmail.com> Reviewed by: kib MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D31389 (cherry picked from commit 4f35e8cba232d9256ab1399b8adfb761988e5eff) sys/dev/hwpmc/hwpmc_intel.c | 31 +++++++++++++++++++++++++------ 1 file changed, 25 insertions(+), 6 deletions(-)
A commit in branch stable/13 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=78f8ebe0c343eb43f832adb6061d610b777c6a76 commit 78f8ebe0c343eb43f832adb6061d610b777c6a76 Author: Mitchell Horne <mhorne@FreeBSD.org> AuthorDate: 2021-08-04 17:37:05 +0000 Commit: Alexander Motin <mav@FreeBSD.org> CommitDate: 2022-07-04 17:48:08 +0000 hwpmc_intel: assert for correct nclasses value This variable is set based on the exact CPU model detected. If this value is set too small, it could lead to a NULL-dereference from an improperly initialized pmc_rowindex_to_classdep array. Though it has been fixed, this was previously the case for Broadwell. Add two asserts to catch this in DEBUG kernels, as it represents a configuration error that may be hard to uncover otherwise. PR: 253687 Reported by: Zhenlei Huang <zlei.huang@gmail.com> Sponsored by: The FreeBSD Foundation (cherry picked from commit 8399d923a5697b7c194dbd44c33018c94ec42c4e) sys/dev/hwpmc/hwpmc_intel.c | 2 ++ 1 file changed, 2 insertions(+)