253687 – [hwpmc] kernel panic when listing available hardware performance counters

Bug 253687 - [hwpmc] kernel panic when listing available hardware performance counters

Summary: [hwpmc] kernel panic when listing available hardware performance counters

Status:	Closed FIXED

Alias:	None

Product:	Base System
Classification:	Unclassified
Component:	kern (show other bugs)
Version:	11.4-RELEASE
Hardware:	amd64 Any

Importance:	--- Affects Only Me
Assignee:	Mitchell Horne

URL:
Keywords:	crash, hwpmc

Depends on:
Blocks:

Reported:	2021-02-19 10:14 UTC by Zhenlei Huang
Modified:	2022-10-12 00:49 UTC (History)
CC List:	5 users (show)

See Also:

Attachments
FreeBSD 13.0 BETA2 hwpmc core dump text (61.49 KB, text/plain) 2021-02-19 10:14 UTC, Zhenlei Huang	no flags	Details
Set proper nclasses value for Broadwell CPUs (368 bytes, patch) 2021-07-05 17:02 UTC, Mitchell Horne	no flags	Details \| Diff
Select alternate uncore MSR for Broadwell also (817 bytes, patch) 2021-07-08 19:24 UTC, Mitchell Horne	no flags	Details \| Diff
Add class validation to ICP and UCP pmc allocation methods (1.19 KB, patch) 2021-07-29 13:45 UTC, Mitchell Horne	no flags	Details \| Diff
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Zhenlei Huang freebsd_committer

2021-02-19 10:14:05 UTC

Created attachment 222630 [details]
FreeBSD 13.0 BETA2 hwpmc core dump text

Steps to repeat:

1. Fresh install FreeBSD 13.0 BETA2 on Fusion 12.1.0 VM (FreeBSD 12 64bit guest type).
2. kldload hwpmc
3. pmccontrol -l

Kernel panic core dump text summary:

Fatal trap 12: page fault while in kernel mode
cpuid = 0; apic id = 00
fault virtual address   = 0xc
fault code              = supervisor read data, page not present
instruction pointer     = 0x20:0xffffffff82c25a90
stack pointer           = 0x0:0xfffffe007c8ed9b0
frame pointer           = 0x0:0xfffffe007c8edac0
code segment            = base rx0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 832 (pmccontrol)
trap number             = 12
panic: page fault
cpuid = 0
time = 1613757734
KDB: stack backtrace:
#0 0xffffffff80c56695 at kdb_backtrace+0x65
#1 0xffffffff80c09261 at vpanic+0x181
#2 0xffffffff80c090d3 at panic+0x43
#3 0xffffffff810891a7 at trap_fatal+0x387
#4 0xffffffff810891ff at trap_pfault+0x4f
#5 0xffffffff8108885d at trap+0x27d
#6 0xffffffff8105fc38 at calltrap+0x8
#7 0xffffffff8108a0f5 at amd64_syscall+0x755
#8 0xffffffff8106055e at fast_syscall_common+0xf8
Uptime: 1m34s
Dumping 174 out of 471 MB:..10%..19%..28%..37%..46%..56%..65%..74%..83%..92%

Comment 1 John-Mark Gurney freebsd_committer

2021-03-02 01:35:26 UTC

Can you get a back trace for the exact line number of the functions?

Also, the panic happens in an odd place, not where I'd expect such a panic to happen.

Comment 2 Zhenlei Huang freebsd_committer

2021-03-02 02:57:57 UTC

(kgdb) bt
#0  __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55
#1  doadump (textdump=<optimized out>) at /usr/src/sys/kern/kern_shutdown.c:399
#2  0xffffffff80c08e56 in kern_reboot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:486
#3  0xffffffff80c092d0 in vpanic (fmt=<optimized out>, ap=<optimized out>) at /usr/src/sys/kern/kern_shutdown.c:919
#4  0xffffffff80c090d3 in panic (fmt=<unavailable>) at /usr/src/sys/kern/kern_shutdown.c:843
#5  0xffffffff810891a7 in trap_fatal (frame=0xfffffe007c8ed8f0, eva=12) at /usr/src/sys/amd64/amd64/trap.c:915
#6  0xffffffff810891ff in trap_pfault (frame=frame@entry=0xfffffe007c8ed8f0, usermode=false, signo=<optimized out>, signo@entry=0x0, ucode=<optimized out>, 
    ucode@entry=0x0) at /usr/src/sys/amd64/amd64/trap.c:732
#7  0xffffffff8108885d in trap (frame=0xfffffe007c8ed8f0) at /usr/src/sys/amd64/amd64/trap.c:398
#8  <signal handler called>
#9  pmc_syscall_handler (td=<optimized out>, syscall_args=<optimized out>) at /usr/src/sys/dev/hwpmc/hwpmc_mod.c:3679
#10 0xffffffff8108a0f5 in syscallenter (td=0xfffffe007d395700) at /usr/src/sys/amd64/amd64/../../kern/subr_syscall.c:161
#11 amd64_syscall (td=0xfffffe007d395700, traced=0) at /usr/src/sys/amd64/amd64/trap.c:1156
#12 <signal handler called>
#13 0x00000008009bc48a in ?? ()
Backtrace stopped: Cannot access memory at address 0x7fffffffe9c8
(kgdb) frame 9
#9  pmc_syscall_handler (td=<optimized out>, syscall_args=<optimized out>) at /usr/src/sys/dev/hwpmc/hwpmc_mod.c:3679
3679				if ((error = pcd->pcd_describe(cpu, ari, p, &pm)) != 0)
(kgdb) print pcd
$5 = (struct pmc_classdep *) 0x0
(kgdb)

Comment 3 Zhenlei Huang freebsd_committer

2021-03-02 04:47:58 UTC

Some additional debug info:

(kgdb) print md->pmd_npmc
$1 = 33
(kgdb) print p - pmcinfo
$2 = 24
(kgdb) print pmc_rowindex_to_classdep[23]
$3 = (struct pmc_classdep *) 0xfffff8000440a5a0
(kgdb) print pmc_rowindex_to_classdep[24]
$4 = (struct pmc_classdep *) 0x0
(kgdb) print pmc_rowindex_to_classdep[25]
$5 = (struct pmc_classdep *) 0x0
(kgdb) print pmc_rowindex_to_classdep[26]
$6 = (struct pmc_classdep *) 0x0
(kgdb) print pmc_rowindex_to_classdep[27]
$7 = (struct pmc_classdep *) 0x0
(kgdb) print pmc_rowindex_to_classdep[28]
$8 = (struct pmc_classdep *) 0x0
(kgdb) print pmc_rowindex_to_classdep[29]
$9 = (struct pmc_classdep *) 0x0
(kgdb) print pmc_rowindex_to_classdep[30]
$10 = (struct pmc_classdep *) 0x0
(kgdb) print pmc_rowindex_to_classdep[31]
$11 = (struct pmc_classdep *) 0x0
(kgdb) print pmc_rowindex_to_classdep[32]
$12 = (struct pmc_classdep *) 0x0
(kgdb)

Hope it helps.

Comment 4 Zhenlei Huang freebsd_committer

2021-03-09 05:07:20 UTC

It is still an issue on FreeBSD 13.0-RC1 .

Comment 5 Mitchell Horne freebsd_committer

2021-07-01 14:58:29 UTC

(In reply to Zhenlei Huang from comment #3)

Hi,

I suspect hwpmc has some kind of misidentification or misconfiguration based on the specific processor model. Could you provide the output of the following?

# sysctl kern.hwpmc.cpuid

and

# cpucontrol -i 0xA /dev/cpuctl0

You may need to kldload cpuctl for the latter. Thanks.

Comment 6 Zhenlei Huang freebsd_committer

2021-07-02 00:51:14 UTC

(In reply to Mitchell Horne from comment #5)
Hi,

# kldload hwpmc cpuctl
# sysctl kern.hwpmc.cpuid
kern.hwpmc.cpuid: GenuineIntel-6-3D-4

# cpucontrol -i 0xA /dev/cpuctl0
cpuid level 0xa: 0x07300403 0x00000000 0x00000000 0x00000603


Some info about the host:
The host is MacBook Pro(Retina, 13-inch, Early 2015), the model identifier is MacBookPro12,1, and the processor is 2.7 GHz Dual-Core Intel Core i5.

Comment 7 Mitchell Horne freebsd_committer

2021-07-05 17:02:27 UTC

Created attachment 226244 [details]
Set proper nclasses value for Broadwell CPUs

(In reply to Zhenlei Huang from comment #6)

Thanks for the info. I believe I have identified the problem, but it would be helpful if you could confirm the fix.

Please apply the attached patch to a checkout of the src tree, then, compile and load the hwpmc module. There should be no need to build the kernel in its entirety, it is enough to do:

# cd /path/to/freebsd-src
# make -C sys/modules/hwpmc
# make -C sys/modules/hwpmc install
# kldload /boot/modules/hwpmc.ko

Comment 8 Zhenlei Huang freebsd_committer

2021-07-07 04:56:59 UTC

(In reply to Mitchell Horne from comment #7)
Thanks for fast fix :)

I'm report back as soon as possible.

Comment 9 Zhenlei Huang freebsd_committer

2021-07-07 09:51:05 UTC

(In reply to Mitchell Horne from comment #7)
I applied the patch to current/14 and stable/13, `pmccontrol -l` both works greatly now.

I could not recall why I set 'regression' keyword on this issue, but the stable/12 is also affected. Applying the patch also fix stable/12.

Comment 10 Zhenlei Huang freebsd_committer

2021-07-08 07:38:19 UTC

I can confirm release/11.4 is also affected. The patch can also fix for release/11.4 .

Comment 11 Zhenlei Huang freebsd_committer

2021-07-08 08:16:35 UTC

Oops, observed kernel general protection fault while testing PMC's with the patched hwpmc.
This can be reproduced on release/11.4, stable/12, stable/13 and current/14 .

Step to repeat:

1. kldload /path/to/patched/hwpmc.ko
2. pmcstudy -T

Kernel panic core dump text summary: (obtained from stable/13 vm)

Fatal trap 9: general protection fault while in kernel mode
cpuid = 0; apic id = 00
instruction pointer	= 0x20:0xffffffff82c2e335
stack pointer	        = 0x28:0xfffffe0088a569a0
frame pointer	        = 0x28:0xfffffe0088a569a0
code segment		= base rx0, limit 0xfffff, type 0x1b
			= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags	= interrupt enabled, resume, IOPL = 0
current process		= 22576 (pmcstat)
trap number		= 9
panic: general protection fault
cpuid = 0
time = 1625673289
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe0088a566b0
vpanic() at vpanic+0x181/frame 0xfffffe0088a56700
panic() at panic+0x43/frame 0xfffffe0088a56760
trap_fatal() at trap_fatal+0x387/frame 0xfffffe0088a567c0
trap() at trap+0x8b/frame 0xfffffe0088a568d0
calltrap() at calltrap+0x8/frame 0xfffffe0088a568d0
--- trap 0x9, rip = 0xffffffff82c2e335, rsp = 0xfffffe0088a569a0, rbp = 0xfffffe0088a569a0 ---
ucp_start_pmc() at ucp_start_pmc+0xd5/frame 0xfffffe0088a569a0
pmc_syscall_handler() at pmc_syscall_handler+0x1e16/frame 0xfffffe0088a56ac0
amd64_syscall() at amd64_syscall+0x755/frame 0xfffffe0088a56bf0
fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe0088a56bf0
--- syscall (0, FreeBSD ELF64, nosys), rip = 0x800a8e48a, rsp = 0x7fffffffe3b8, rbp = 0x7fffffffe3e0 ---
KDB: enter: panic
Uptime: 11m1s
Dumping 152 out of 471 MB:..11%..21%..32%..42%..53%..63%..74%..84%..95%

__curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55
55	/usr/src/sys/amd64/include/pcpu_aux.h: No such file or directory.
(kgdb) bt
#0  __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55
#1  doadump (textdump=textdump@entry=1) at /usr/src/sys/kern/kern_shutdown.c:399
#2  0xffffffff80c21b2b in kern_reboot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:486
#3  0xffffffff80c21fb0 in vpanic (fmt=<optimized out>, ap=<optimized out>) at /usr/src/sys/kern/kern_shutdown.c:919
#4  0xffffffff80c21db3 in panic (fmt=<unavailable>) at /usr/src/sys/kern/kern_shutdown.c:843
#5  0xffffffff810bbce7 in trap_fatal (frame=0xfffffe0088a568e0, eva=0) at /usr/src/sys/amd64/amd64/trap.c:943
#6  0xffffffff810bb1cb in trap (frame=0xfffffe0088a568e0) at /usr/src/sys/amd64/amd64/trap.c:246
#7  <signal handler called>
#8  wrmsr (msr=913, newval=1) at /usr/src/sys/amd64/include/cpufunc.h:404
#9  ucp_start_pmc (cpu=<optimized out>, ri=0) at /usr/src/sys/dev/hwpmc/hwpmc_uncore.c:712
#10 0xffffffff82c26786 in pmc_start (pm=0xfffff80017ebd000) at /usr/src/sys/dev/hwpmc/hwpmc_mod.c:3252
#11 pmc_syscall_handler (td=<optimized out>, syscall_args=<optimized out>) at /usr/src/sys/dev/hwpmc/hwpmc_mod.c:4524
#12 0xffffffff810bcc35 in syscallenter (td=0xfffffe00898103a0) at /usr/src/sys/amd64/amd64/../../kern/subr_syscall.c:161
#13 amd64_syscall (td=0xfffffe00898103a0, traced=0) at /usr/src/sys/amd64/amd64/trap.c:1184
#14 <signal handler called>
#15 0x0000000800a8e48a in ?? ()
Backtrace stopped: Cannot access memory at address 0x7fffffffe3b8
(kgdb) frame 8
#8  wrmsr (msr=913, newval=1) at /usr/src/sys/amd64/include/cpufunc.h:404
404	/usr/src/sys/amd64/include/cpufunc.h: No such file or directory.
(kgdb) frame 9
#9  ucp_start_pmc (cpu=<optimized out>, ri=0) at /usr/src/sys/dev/hwpmc/hwpmc_uncore.c:712
712	/usr/src/sys/dev/hwpmc/hwpmc_uncore.c: No such file or directory.

Comment 12 Mitchell Horne freebsd_committer

2021-07-08 19:24:05 UTC

Created attachment 226312 [details]
Select alternate uncore MSR for Broadwell also

(In reply to Zhenlei Huang from comment #11)

Thanks for your testing. 

This appears to be a separate issue. It looks like it is trying to program the wrong MSR, thus raising the protection fault. Please try the new attached patch (in addition to the other).

Mitchell

Comment 13 Zhenlei Huang freebsd_committer

2021-07-09 14:30:37 UTC

(In reply to Mitchell Horne from comment #12)
Tried the new attached patch, encountered a different general protection fault. This is obtained from current/14:

FreeBSD  14.0-CURRENT FreeBSD 14.0-CURRENT #10 main-n247819-bd597b814933: Fri Jul  9 10:28:46 CST 2021     root@:/usr/obj/usr/src/amd64.amd64/sys/GENERIC-DEBUG  amd64

panic: general protection fault

GNU gdb (GDB) 10.2 [GDB v10.2 for FreeBSD]
Copyright (C) 2021 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-portbld-freebsd13.0".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /boot/kernel/kernel...
Reading symbols from /usr/lib/debug//boot/kernel/kernel.debug...

Unread portion of the kernel message buffer:


Fatal trap 9: general protection fault while in kernel mode
cpuid = 0; apic id = 00
instruction pointer     = 0x20:0xffffffff8322f595
stack pointer           = 0x28:0xfffffe0087f66990
frame pointer           = 0x28:0xfffffe0087f66990
code segment            = base rx0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 67912 (pmcstat)
trap number             = 9
panic: general protection fault
cpuid = 0
time = 1625830747
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe0087f666a0
vpanic() at vpanic+0x181/frame 0xfffffe0087f666f0
panic() at panic+0x43/frame 0xfffffe0087f66750
trap_fatal() at trap_fatal+0x387/frame 0xfffffe0087f667b0
trap() at trap+0xa4/frame 0xfffffe0087f668c0
calltrap() at calltrap+0x8/frame 0xfffffe0087f668c0
--- trap 0x9, rip = 0xffffffff8322f595, rsp = 0xfffffe0087f66990, rbp = 0xfffffe0087f66990 ---
ucp_start_pmc() at ucp_start_pmc+0xd5/frame 0xfffffe0087f66990
pmc_syscall_handler() at pmc_syscall_handler+0x1ed1/frame 0xfffffe0087f66ac0
amd64_syscall() at amd64_syscall+0x749/frame 0xfffffe0087f66bf0
fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe0087f66bf0
--- syscall (0, FreeBSD ELF64, nosys), rip = 0x800a8e48a, rsp = 0x7fffffffe3b8, rbp = 0x7fffffffe3e0 ---
KDB: enter: panic
Uptime: 14m37s
Dumping 277 out of 465 MB:..6%..12%..24%..35%..41%..52%..64%..76%..81%..93%
__curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55
55      /usr/src/sys/amd64/include/pcpu_aux.h: No such file or directory.
(kgdb) #0  __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55
#1  doadump (textdump=textdump@entry=1)
    at /usr/src/sys/kern/kern_shutdown.c:399
#2  0xffffffff80c113a0 in kern_reboot (howto=260)
    at /usr/src/sys/kern/kern_shutdown.c:486
#3  0xffffffff80c11800 in vpanic (fmt=<optimized out>, ap=<optimized out>)
    at /usr/src/sys/kern/kern_shutdown.c:919
#4  0xffffffff80c11553 in panic (fmt=<unavailable>)
    at /usr/src/sys/kern/kern_shutdown.c:843
#5  0xffffffff810ced87 in trap_fatal (frame=0xfffffe0087f668d0, eva=0)
    at /usr/src/sys/amd64/amd64/trap.c:943
#6  0xffffffff810ce204 in trap (frame=0xfffffe0087f668d0)
    at /usr/src/sys/amd64/amd64/trap.c:246
#7  <signal handler called>
#8  0xffffffff8322f595 in tsc_config_pmc (cpu=-2006632640, ri=0, pm=0x10918)
    at /usr/src/sys/dev/hwpmc/hwpmc_tsc.c:110
#9  0xffffffff83226671 in pmc_ri_to_classdep (md=0x1, ri=0, 
    adjri=<optimized out>) at /usr/src/sys/dev/hwpmc/hwpmc_mod.c:611
#10 pmc_syscall_handler (td=0xfffffe0087f66990, syscall_args=<optimized out>)
    at /usr/src/sys/dev/hwpmc/hwpmc_mod.c:4254
#11 0xffffffff810cfd69 in syscallenter (td=<optimized out>)
    at /usr/src/sys/amd64/amd64/../../kern/subr_syscall.c:161
#12 amd64_syscall (td=0xfffffe0088653740, traced=0)
    at /usr/src/sys/amd64/amd64/trap.c:1184
#13 <signal handler called>
#14 0x0000000800a8e48a in ?? ()
Backtrace stopped: Cannot access memory at address 0x7fffffffe3b8
(kgdb)


PS: the hwpmc console log:
hwpmc: SOFT/16/64/0x67<INT,USR,SYS,REA,WRI> TSC/1/64/0x20<REA> IAP/4/48/0x3ff<INT,USR,SYS,EDG,THR,REA,WRI,INV,QUA,PRC> IAF/3/48/0x67<INT,USR,SYS,REA,WRI> UCP/8/48/0x3f8<EDG,THR,REA,
WRI,INV,QUA,PRC> UCF/1/48/0x60<REA,WRI>

Comment 14 Mitchell Horne freebsd_committer

2021-07-09 15:09:31 UTC

(In reply to Zhenlei Huang from comment #13)

Okay this one is quite strange. First, it is odd that there are two backtraces that are similar but distinct.

The 'cpu' argument in frame 8 of the kgdb backtrace also looks suspicious. I would expect a single digit number.

Was this obtained with pmcstudy -T? Does the panic happen consistently, and do the backtraces look similar each time?

Comment 15 Zhenlei Huang freebsd_committer

2021-07-10 00:23:55 UTC

(In reply to Mitchell Horne from comment #14)
The backtrace from comment #11 is observed when applying only the patch "Set proper nclasses value for Broadwell CPUs ". It can be consistently repeated.

The backtrace from comment #13 is observed when applying both "Set proper nclasses value for Broadwell CPUs" and "Select alternate uncore MSR for Broadwell also" patches. I've only test this a few times, I'll try more times to confirm whether it can be consistently repeated or not.

> Was this obtained with pmcstudy -T?
Yes.

Comment 16 Zhenlei Huang freebsd_committer

2021-07-12 08:52:41 UTC

The source code might be out of sync. I updated the source code to latest main, patched and made a clean build.

As 'pmcstudy -T' fork 'pmcstat', thus the core dump shows 'current process' is 'pmcstat'.

It is slow to validate every PMC. I managed to narrow down the routine to trigger the panic. The panic happen consistently.

1. Cold boot, this prevent the 'dmesg' message buffer from containing info of last boot.
2. kldload hwpmc
3. pmcstat -s CPU_CLK_UNHALTED.THREAD_P -s BR_MISP_RETIRED.ALL_BRANCHES -s MACHINE_CLEARS.CYCLES -s UOPS_ISSUED.ANY -s UOPS_RETIRED.RETIRE_SLOTS

NOTE: For the step 3, a combine of more or equal than five valid event-specs will trigger the panic. It looks good when validating with less than five event-specs.


Kernel panic core dump text summary:

Tue Jul 13 00:36:08 CST 2021

FreeBSD  14.0-CURRENT FreeBSD 14.0-CURRENT #11 bugfix/253687-n247872-90823878749c: Mon Jul 12 15:22:17 CST 2021     root@:/usr/obj/usr/src/amd64.amd64/sys/GENERIC-DEBUG  amd64

panic: general protection fault

GNU gdb (GDB) 10.2 [GDB v10.2 for FreeBSD]
Copyright (C) 2021 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-portbld-freebsd13.0".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /boot/kernel/kernel...
Reading symbols from /usr/lib/debug//boot/kernel/kernel.debug...

Unread portion of the kernel message buffer:


Fatal trap 9: general protection fault while in kernel mode
cpuid = 0; apic id = 00
instruction pointer     = 0x20:0xffffffff83230a15
stack pointer           = 0x28:0xfffffe0087f39970
frame pointer           = 0x28:0xfffffe0087f39980
code segment            = base rx0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 20674 (pmcstat)
trap number             = 9
panic: general protection fault
cpuid = 0
time = 1626107735
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe0087f39680
vpanic() at vpanic+0x181/frame 0xfffffe0087f396d0
panic() at panic+0x43/frame 0xfffffe0087f39730
trap_fatal() at trap_fatal+0x387/frame 0xfffffe0087f39790
trap() at trap+0xa4/frame 0xfffffe0087f398a0
calltrap() at calltrap+0x8/frame 0xfffffe0087f398a0
--- trap 0x9, rip = 0xffffffff83230a15, rsp = 0xfffffe0087f39970, rbp = 0xfffffe0087f39980 ---
ucp_start_pmc() at ucp_start_pmc+0x115/frame 0xfffffe0087f39980
pmc_syscall_handler() at pmc_syscall_handler+0x182b/frame 0xfffffe0087f39ac0
amd64_syscall() at amd64_syscall+0x749/frame 0xfffffe0087f39bf0
fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe0087f39bf0
--- syscall (0, FreeBSD ELF64, nosys), rip = 0x800a8e48a, rsp = 0x7fffffffe2a8, rbp = 0x7fffffffe2d0 ---
KDB: enter: panic
Uptime: 5m52s
Dumping 169 out of 465 MB:..10%..19%..29%..38%..48%..57%..67%..76%..86%..95%

__curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55
55      /usr/src/sys/amd64/include/pcpu_aux.h: No such file or directory.
(kgdb) #0  __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55
#1  doadump (textdump=textdump@entry=1)
    at /usr/src/sys/kern/kern_shutdown.c:399
#2  0xffffffff80c1e6e0 in kern_reboot (howto=260)
    at /usr/src/sys/kern/kern_shutdown.c:486
#3  0xffffffff80c1eb40 in vpanic (fmt=<optimized out>, ap=<optimized out>)
    at /usr/src/sys/kern/kern_shutdown.c:919
#4  0xffffffff80c1e893 in panic (fmt=<unavailable>)
    at /usr/src/sys/kern/kern_shutdown.c:843
#5  0xffffffff810dbd87 in trap_fatal (frame=0xfffffe0087f398b0, eva=0)
    at /usr/src/sys/amd64/amd64/trap.c:944
#6  0xffffffff810db204 in trap (frame=0xfffffe0087f398b0)
    at /usr/src/sys/amd64/amd64/trap.c:249
#7  <signal handler called>
#8  wrmsr (msr=913, newval=1) at /usr/src/sys/amd64/include/cpufunc.h:404
#9  ucp_start_pmc (cpu=0, ri=0) at /usr/src/sys/dev/hwpmc/hwpmc_uncore.c:714
#10 0xffffffff83226d5b in pmc_start (pm=<optimized out>)
    at /usr/src/sys/dev/hwpmc/hwpmc_mod.c:3252
#11 pmc_syscall_handler (td=<optimized out>, syscall_args=<optimized out>)
    at /usr/src/sys/dev/hwpmc/hwpmc_mod.c:4524
#12 0xffffffff810dcd69 in syscallenter (td=<optimized out>)
    at /usr/src/sys/amd64/amd64/../../kern/subr_syscall.c:161
#13 amd64_syscall (td=0xfffffe00886f6560, traced=0)
    at /usr/src/sys/amd64/amd64/trap.c:1185
#14 <signal handler called>
#15 0x0000000800a8e48a in ?? ()
Backtrace stopped: Cannot access memory at address 0x7fffffffe2a8
(kgdb)

Comment 17 Zhenlei Huang freebsd_committer

2021-07-12 08:54:09 UTC

(In reply to Zhenlei Huang from comment #16)
And the backtraces look similar each time.

Comment 18 Zhenlei Huang freebsd_committer

2021-07-12 09:19:22 UTC

(In reply to Zhenlei Huang from comment #16)
The event-specs in step 3 is from https://cgit.freebsd.org/src/tree/usr.sbin/pmcstudy/pmcstudy.c#n1945

Comment 19 Mitchell Horne freebsd_committer

2021-07-29 13:45:23 UTC

Created attachment 226779 [details]
Add class validation to ICP and UCP pmc allocation methods

In reply to Zhenlei Huang from comment #18)

Hi again, thanks for the info. I spent some time looking into this further, and I've found numerous issues in our uncore implementation, some of which can lead to the panic you are seeing.

Still, the specific line in pmcstudy that is failing shouldn't trigger this panic, since it does not allocate any uncore PMCs. I'd expect that allocating the 5th counter should fail, since your system supports four programmable counters per core. I've attached a patch that I think should fix this, so please apply it alongside the others and try again.

Comment 20 Zhenlei Huang freebsd_committer

2021-07-30 10:28:08 UTC

(In reply to Mitchell Horne from comment #19)
Thanks! Test latest current/14 with the patch 'Add class validation to ICP and UCP pmc allocation methods', there're no panics when pmcstat validating with more than four event-specs.

It is weird that 'pmcstudy -T' still panics. 'pmcstudy -T' stops at 'unc_cbo_xsnp_response.miss_xcore'. I will verify if it is regression caused by recent changes to hwpmc and report later.

The core text dump:
dumped core - see /var/crash/vmcore.7

Sat Jul 31 02:06:28 CST 2021

FreeBSD  14.0-CURRENT FreeBSD 14.0-CURRENT #17 bugfix/253687-n248379-1e16bfc58152: Fri Jul 30 14:19:18 CST 2021     root@:/usr/obj/usr/src/amd64.amd64/sys/GENERIC-DEBUG  amd64

panic: general protection fault

GNU gdb (GDB) 10.2 [GDB v10.2 for FreeBSD]
Copyright (C) 2021 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-portbld-freebsd13.0".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /boot/kernel/kernel...
Reading symbols from /usr/lib/debug//boot/kernel/kernel.debug...

Unread portion of the kernel message buffer:


Fatal trap 9: general protection fault while in kernel mode
cpuid = 0; apic id = 00
instruction pointer     = 0x20:0xffffffff83230a05
stack pointer           = 0x28:0xfffffe008f040970
frame pointer           = 0x28:0xfffffe008f040980
code segment            = base rx0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 92141 (pmcstat)
trap number             = 9
panic: general protection fault
cpuid = 0
time = 1627659218
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe008f040680
vpanic() at vpanic+0x181/frame 0xfffffe008f0406d0
panic() at panic+0x43/frame 0xfffffe008f040730
trap_fatal() at trap_fatal+0x387/frame 0xfffffe008f040790
trap() at trap+0xa4/frame 0xfffffe008f0408a0
calltrap() at calltrap+0x8/frame 0xfffffe008f0408a0
--- trap 0x9, rip = 0xffffffff83230a05, rsp = 0xfffffe008f040970, rbp = 0xfffffe008f040980 ---
ucp_start_pmc() at ucp_start_pmc+0x115/frame 0xfffffe008f040980
pmc_syscall_handler() at pmc_syscall_handler+0x182b/frame 0xfffffe008f040ac0
amd64_syscall() at amd64_syscall+0x753/frame 0xfffffe008f040bf0
fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe008f040bf0
--- syscall (0, FreeBSD ELF64, nosys), rip = 0x800a8e48a, rsp = 0x7fffffffe308, rbp = 0x7fffffffe330 ---
KDB: enter: panic
Uptime: 9m24s
Dumping 208 out of 977 MB:..8%..16%..24%..31%..47%..54%..62%..77%..85%..93%

__curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55
55      /usr/src/sys/amd64/include/pcpu_aux.h: No such file or directory.
(kgdb) #0  __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55
#1  doadump (textdump=textdump@entry=1)
    at /usr/src/sys/kern/kern_shutdown.c:399
#2  0xffffffff80c19370 in kern_reboot (howto=260)
    at /usr/src/sys/kern/kern_shutdown.c:486
#3  0xffffffff80c197d0 in vpanic (fmt=<optimized out>, ap=<optimized out>)
    at /usr/src/sys/kern/kern_shutdown.c:919
#4  0xffffffff80c19523 in panic (fmt=<unavailable>)
    at /usr/src/sys/kern/kern_shutdown.c:843
#5  0xffffffff810d9fa7 in trap_fatal (frame=0xfffffe008f0408b0, eva=0)
    at /usr/src/sys/amd64/amd64/trap.c:944
#6  0xffffffff810d9424 in trap (frame=0xfffffe008f0408b0)
    at /usr/src/sys/amd64/amd64/trap.c:249
#7  <signal handler called>
#8  wrmsr (msr=913, newval=1) at /usr/src/sys/amd64/include/cpufunc.h:404
#9  ucp_start_pmc (cpu=0, ri=0) at /usr/src/sys/dev/hwpmc/hwpmc_uncore.c:715
#10 0xffffffff83226d5b in pmc_start (pm=<optimized out>)
    at /usr/src/sys/dev/hwpmc/hwpmc_mod.c:3252
#11 pmc_syscall_handler (td=<optimized out>, syscall_args=<optimized out>)
    at /usr/src/sys/dev/hwpmc/hwpmc_mod.c:4523
#12 0xffffffff810dafa3 in syscallenter (td=<optimized out>)
    at /usr/src/sys/amd64/amd64/../../kern/subr_syscall.c:161
#13 amd64_syscall (td=0xfffffe008eaec740, traced=0)
    at /usr/src/sys/amd64/amd64/trap.c:1186
#14 <signal handler called>
#15 0x0000000800a8e48a in ?? ()
Backtrace stopped: Cannot access memory at address 0x7fffffffe308
(kgdb)

Comment 21 Zhenlei Huang freebsd_committer

2021-07-30 14:15:14 UTC

Traveling through the recent changes to hwpmc, they are all related to PowerPC and ARM.

I can confirm that the panic can be repeated as the following steps:

1. kldload hwpmc
2. pmcstat -s UNC_CBO_XSNP_RESPONSE.MISS_XCORE

Comment 22 Mitchell Horne freebsd_committer

2021-08-02 21:35:34 UTC

(In reply to Zhenlei Huang from comment #21)

I still expected this panic. The short explanation is that our support for "uncore" PMC events (mostly L3 cache events) is incomplete. It would take a larger effort to fix this, so for now my solution is to disable this feature on CPUs where its broken:

https://reviews.freebsd.org/D31389

If you can apply the patch and test it out, that would be great. I expect that pmcstudy -T will not complete successfully, but it should at least avoid the panics.

Comment 23 Zhenlei Huang freebsd_committer

2021-08-03 04:15:02 UTC

(In reply to Mitchell Horne from comment #22)
Test D31389 with latest current/14, no panics anymore :)

Thanks your efforts!

Comment 24 commit-hook freebsd_committer

2021-08-04 18:25:20 UTC

A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=4f35e8cba232d9256ab1399b8adfb761988e5eff

commit 4f35e8cba232d9256ab1399b8adfb761988e5eff
Author:     Mitchell Horne <mhorne@FreeBSD.org>
AuthorDate: 2021-08-04 17:31:36 +0000
Commit:     Mitchell Horne <mhorne@FreeBSD.org>
CommitDate: 2021-08-04 18:23:22 +0000

    hwpmc: disable uncore class on Sandy Bridge and newer

    It was written for Nehalem and Westmere, with minor but incomplete
    updates for Sandy Bridge in 78d763a29b15. The uncore architecture
    changed significantly with this generation, bringing new layouts and
    locations for some MSRs.

    Misprogramming these MSRs in ucp_start_pmc() may panic the system, and
    this is trivially reproducible via pmcstat(8) on at least Broadwell and
    Haswell. Disable the class on these CPUs until it can be updated more
    completely and leave a TODO comment detailing some of the work required.
    Note that the nclasses value for Broadwell was already incorrect and
    doesn't need changing.

    The result is that any uncore events listed by pmcstat -L will no longer
    be allocatable, but this is already the case for newer generations of
    Intel CPUs.

    PR:             253687
    Reported by:    Zhenlei Huang <zlei.huang@gmail.com>
    Reviewed by:    kib
    MFC after:      1 week
    Sponsored by:   The FreeBSD Foundation
    Differential Revision:  https://reviews.freebsd.org/D31389

 sys/dev/hwpmc/hwpmc_intel.c | 31 +++++++++++++++++++++++++------
 1 file changed, 25 insertions(+), 6 deletions(-)

Comment 25 commit-hook freebsd_committer

2021-08-04 18:25:21 UTC

A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=8399d923a5697b7c194dbd44c33018c94ec42c4e

commit 8399d923a5697b7c194dbd44c33018c94ec42c4e
Author:     Mitchell Horne <mhorne@FreeBSD.org>
AuthorDate: 2021-08-04 17:37:05 +0000
Commit:     Mitchell Horne <mhorne@FreeBSD.org>
CommitDate: 2021-08-04 18:23:22 +0000

    hwpmc_intel: assert for correct nclasses value

    This variable is set based on the exact CPU model detected. If this
    value is set too small, it could lead to a NULL-dereference from an
    improperly initialized pmc_rowindex_to_classdep array.

    Though it has been fixed, this was previously the case for Broadwell.
    Add two asserts to catch this in DEBUG kernels, as it represents a
    configuration error that may be hard to uncover otherwise.

    PR:             253687
    Reported by:    Zhenlei Huang <zlei.huang@gmail.com>
    Sponsored by:   The FreeBSD Foundation

 sys/dev/hwpmc/hwpmc_intel.c | 2 ++
 1 file changed, 2 insertions(+)

Comment 26 Mitchell Horne freebsd_committer

2021-08-04 18:29:38 UTC

(In reply to Zhenlei Huang from comment #23)

Thanks for your detailed reporting and testing. I forgot to give a 'Tested by' credit to you in the main commit, but hopefully the 'Reported by' is enough ;)

I have at least one follow-up commit to make still, and I will merge the changes back to stable/13 and stable/12 in a week or so.

Comment 27 commit-hook freebsd_committer

2021-08-11 17:25:34 UTC

A commit in branch stable/13 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=2e50ba70742bac34f144832ceb5f6816fcc06be2

commit 2e50ba70742bac34f144832ceb5f6816fcc06be2
Author:     Mitchell Horne <mhorne@FreeBSD.org>
AuthorDate: 2021-08-04 17:31:36 +0000
Commit:     Mitchell Horne <mhorne@FreeBSD.org>
CommitDate: 2021-08-11 16:49:44 +0000

    hwpmc: disable uncore class on Sandy Bridge and newer

    It was written for Nehalem and Westmere, with minor but incomplete
    updates for Sandy Bridge in 78d763a29b15. The uncore architecture
    changed significantly with this generation, bringing new layouts and
    locations for some MSRs.

    Misprogramming these MSRs in ucp_start_pmc() may panic the system, and
    this is trivially reproducible via pmcstat(8) on at least Broadwell and
    Haswell. Disable the class on these CPUs until it can be updated more
    completely and leave a TODO comment detailing some of the work required.
    Note that the nclasses value for Broadwell was already incorrect and
    doesn't need changing.

    The result is that any uncore events listed by pmcstat -L will no longer
    be allocatable, but this is already the case for newer generations of
    Intel CPUs.

    PR:             253687
    Reported by:    Zhenlei Huang <zlei.huang@gmail.com>
    Reviewed by:    kib
    MFC after:      1 week
    Sponsored by:   The FreeBSD Foundation
    Differential Revision:  https://reviews.freebsd.org/D31389

    (cherry picked from commit 4f35e8cba232d9256ab1399b8adfb761988e5eff)

 sys/dev/hwpmc/hwpmc_intel.c | 31 +++++++++++++++++++++++++------
 1 file changed, 25 insertions(+), 6 deletions(-)

Comment 28 commit-hook freebsd_committer

2021-08-11 17:25:35 UTC

A commit in branch stable/12 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=88ef00584bc7f9f9e19546bc3ef8682d5c4d6cb6

commit 88ef00584bc7f9f9e19546bc3ef8682d5c4d6cb6
Author:     Mitchell Horne <mhorne@FreeBSD.org>
AuthorDate: 2021-08-04 17:31:36 +0000
Commit:     Mitchell Horne <mhorne@FreeBSD.org>
CommitDate: 2021-08-11 17:07:59 +0000

    hwpmc: disable uncore class on Sandy Bridge and newer

    It was written for Nehalem and Westmere, with minor but incomplete
    updates for Sandy Bridge in 78d763a29b15. The uncore architecture
    changed significantly with this generation, bringing new layouts and
    locations for some MSRs.

    Misprogramming these MSRs in ucp_start_pmc() may panic the system, and
    this is trivially reproducible via pmcstat(8) on at least Broadwell and
    Haswell. Disable the class on these CPUs until it can be updated more
    completely and leave a TODO comment detailing some of the work required.
    Note that the nclasses value for Broadwell was already incorrect and
    doesn't need changing.

    The result is that any uncore events listed by pmcstat -L will no longer
    be allocatable, but this is already the case for newer generations of
    Intel CPUs.

    PR:             253687
    Reported by:    Zhenlei Huang <zlei.huang@gmail.com>
    Reviewed by:    kib
    MFC after:      1 week
    Sponsored by:   The FreeBSD Foundation
    Differential Revision:  https://reviews.freebsd.org/D31389

    (cherry picked from commit 4f35e8cba232d9256ab1399b8adfb761988e5eff)

 sys/dev/hwpmc/hwpmc_intel.c | 31 +++++++++++++++++++++++++------
 1 file changed, 25 insertions(+), 6 deletions(-)

Comment 29 commit-hook freebsd_committer

2022-07-04 18:10:35 UTC

A commit in branch stable/13 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=78f8ebe0c343eb43f832adb6061d610b777c6a76

commit 78f8ebe0c343eb43f832adb6061d610b777c6a76
Author:     Mitchell Horne <mhorne@FreeBSD.org>
AuthorDate: 2021-08-04 17:37:05 +0000
Commit:     Alexander Motin <mav@FreeBSD.org>
CommitDate: 2022-07-04 17:48:08 +0000

    hwpmc_intel: assert for correct nclasses value

    This variable is set based on the exact CPU model detected. If this
    value is set too small, it could lead to a NULL-dereference from an
    improperly initialized pmc_rowindex_to_classdep array.

    Though it has been fixed, this was previously the case for Broadwell.
    Add two asserts to catch this in DEBUG kernels, as it represents a
    configuration error that may be hard to uncover otherwise.

    PR:             253687
    Reported by:    Zhenlei Huang <zlei.huang@gmail.com>
    Sponsored by:   The FreeBSD Foundation

    (cherry picked from commit 8399d923a5697b7c194dbd44c33018c94ec42c4e)

 sys/dev/hwpmc/hwpmc_intel.c | 2 ++
 1 file changed, 2 insertions(+)