Bug 293382 - Dead lock and kernel crash around closefp_impl
Summary: Dead lock and kernel crash around closefp_impl
Status: Open
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 14.3-STABLE
Hardware: Any Any
: --- Affects Only Me
Assignee: freebsd-bugs (Nobody)
URL:
Keywords: crash
Depends on:
Blocks:
 
Reported: 2026-02-23 12:06 UTC by Paul
Modified: 2026-04-11 19:07 UTC (History)
5 users (show)

See Also:


Attachments
debugging patch (5.01 KB, patch)
2026-03-31 09:30 UTC, Konstantin Belousov
no flags Details | Diff
Kernel module for testing memory allocation and copying (5.42 KB, text/plain)
2026-04-11 01:56 UTC, Kristofer Peterson
no flags Details
Userspace utility for calling memory testing kernel module (1.27 KB, text/plain)
2026-04-11 01:57 UTC, Kristofer Peterson
no flags Details
AMD EPYC 9375F: output of cpuid-etallen -1 (47.93 KB, text/plain)
2026-04-11 02:23 UTC, Kristofer Peterson
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Paul 2026-02-23 12:06:00 UTC
Hi!

We've been using 14.4-STABLE for some time now and today a weird issue has popped up. All of the sudden, our multi-threaded network app has deadlocked on some threads, but not on others. We weren't able to neither attach to it with GDB nor kill it with -9. Hard lock inside the kernel. We've managed to collect a few samples of kernel backtrace for this process with `procstat -kk`. All, basically, identical:

  PID    TID COMM                TDNAME              KSTACK
91545 101569 <redacted>          -                   mi_switch+0xbd _sx_xlock_hard+0x4ef kern_close+0x179 amd64_syscall+0x117 fast_syscall_common+0xf8
91545 102281 <redacted>          -                   mi_switch+0xbd _sx_xlock_hard+0x4ef kern_close+0x179 amd64_syscall+0x117 fast_syscall_common+0xf8
91545 102282 <redacted>          <redacted-1>        mi_switch+0xbd sleepq_catch_signals+0x2a2 sleepq_timedwait_sig+0x12 _sleep+0x1c1 umtxq_sleep+0x2cd do_wait+0x244 __umtx_op_wait_uint_private+0x54 sys__umtx_op+0x7e amd64_syscall+0x117 fast_syscall_common+0xf8
91545 102283 <redacted>          <redacted-2>        mi_switch+0xbd sleepq_catch_signals+0x2a2 sleepq_timedwait_sig+0x12 _sleep+0x1c1 kqueue_scan+0xa11 kqueue_kevent+0x13b kern_kevent_fp+0x4b kern_kevent_generic+0xdf sys_kevent+0x61 amd64_syscall+0x117 fast_syscall_common+0xf8
91545 102284 <redacted>          <redacted-3>        mi_switch+0xbd _sleep+0x1f3 knote_fdclose+0xac closefp_impl+0xd0 amd64_syscall+0x117 fast_syscall_common+0xf8
91545 102285 <redacted>          <redacted-4>        mi_switch+0xbd sleepq_catch_signals+0x2a2 sleepq_timedwait_sig+0x12 _sleep+0x1c1 kqueue_scan+0xa11 kqueue_kevent+0x13b kern_kevent_fp+0x4b kern_kevent_generic+0xdf sys_kevent+0x61 amd64_syscall+0x117 fast_syscall_common+0xf8
91545 102286 <redacted>          <redacted-5>        mi_switch+0xbd sleepq_catch_signals+0x2a2 sleepq_timedwait_sig+0x12 _sleep+0x1c1 kqueue_scan+0xa11 kqueue_kevent+0x13b kern_kevent_fp+0x4b kern_kevent_generic+0xdf sys_kevent+0x61 amd64_syscall+0x117 fast_syscall_common+0xf8

Apparently, three threads were deadlocked: first two, that are unnamed and `<redacted-3>`. The last one is the thread that is handling inbound socket connections. Hundreds of thousands of them, mostly WebSocket. Two other threads also use sockets, but for outbound connections. During normal operation, sockets are being open and closed as needed, obviously. Seems like in some case this may lead to a deadlock. Where one thread enters some state in kernel where it hangs, holding the lock and preventing others from closing (or modifying descriptors generally). App is async and uses kqueue for networking sockets extensively. We suspect `<redacted-3>` to be the culprit, specifically its backtrace where `closefp_impl` is involved.

And here's why. When this happened and the traffic was switched to a redundancy server, it almost immediately panicked and wend into reboot. Hopefully, we've got the core dump and were able to analyze it somewhat. And there we saw `closefp_impl` from within the same (not physically, different server) thread `<redacted-3>`:

Fatal trap 12: page fault while in kernel mode
cpuid = 22; apic id = 52
fault virtual address	= 0x10
fault code		= supervisor read data, page not present
instruction pointer	= 0x20:0xffffffff80572e28
stack pointer	        = 0x28:0xfffffe071c126d70
frame pointer	        = 0x28:0xfffffe071c126dc0
code segment		= base rx0, limit 0xfffff, type 0x1b
			= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags	= interrupt enabled, resume, IOPL = 0
current process		= 58518 (<redacted-3>)
rdi: fffff83402622be0 rsi: 0000000000000000 rdx: 0000000000000000
rcx: 0000000000000000  r8: fffff80160b9c520  r9: fffffe071c127000
rax: 0000000000000000 rbx: 0000000000031361 rbp: fffffe071c126dc0
r10: 0000000000000001 r11: 0000000000002af8 r12: fffff80160b9c000
r13: fffff87af7163e18 r14: fffff83402622be0 r15: fffff87af7163e00
trap number		= 12
panic: page fault
cpuid = 22
time = 1771839128
KDB: stack backtrace:
#0 0xffffffff8061303d at kdb_backtrace+0x5d
#1 0xffffffff805c8091 at vpanic+0x161
#2 0xffffffff805c7f23 at panic+0x43
#3 0xffffffff80972f00 at trap_pfault+0x3e0
#4 0xffffffff8094af68 at calltrap+0x8
#5 0xffffffff8056b750 at closefp_impl+0xd0
#6 0xffffffff80973847 at amd64_syscall+0x117
#7 0xffffffff8094b85b at fast_syscall_common+0xf8

When inspecting it's kernel stack:

(kgdb) bt
#0  __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:57
#1  doadump (textdump=<optimized out>) at /usr/src/sys/kern/kern_shutdown.c:405
#2  0xffffffff805c7beb in kern_reboot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:523
#3  0xffffffff805c80e9 in vpanic (fmt=0xffffffff809d2ae7 "%s", ap=ap@entry=0xfffffe071c126c30) at /usr/src/sys/kern/kern_shutdown.c:967
#4  0xffffffff805c7f23 in panic (fmt=<unavailable>) at /usr/src/sys/kern/kern_shutdown.c:891
#5  0xffffffff80972f00 in trap_fatal (frame=<optimized out>, eva=<optimized out>) at /usr/src/sys/amd64/amd64/trap.c:1000
#6  0xffffffff80972f00 in trap_pfault (frame=0xfffffe071c126cb0, usermode=false, signo=<optimized out>, ucode=<optimized out>)
#7  <signal handler called>
#8  0xffffffff80572e28 in knote_drop (kn=0xfffff83402622be0, td=0xfffff80160b9c000) at /usr/src/sys/kern/kern_event.c:2730
#9  knote_fdclose (td=0xfffff80160b9c000, fd=201569) at /usr/src/sys/kern/kern_event.c:2695
#10 0xffffffff8056b750 in closefp_impl (fdp=0xfffffe0d1582a920, fd=0, fp=0xfffff81090d2c5a0, td=0xfffff80160b9c000, audit=true) at /usr/src/sys/kern/kern_descrip.c:1320
#11 0xffffffff80973847 in syscallenter (td=0xfffff80160b9c000) at /usr/src/sys/amd64/amd64/../../kern/subr_syscall.c:193
#12 amd64_syscall (td=0xfffff80160b9c000, traced=0) at /usr/src/sys/amd64/amd64/trap.c:1241
#13 <signal handler called>
#14 0x000000082deed32a in ?? ()
Backtrace stopped: Cannot access memory at address 0x85d08dbc8


Within `knote_drop` we observe a null pointer access:

(kgdb) fr 8
#8  0xffffffff80572e28 in knote_drop (kn=0xfffff83402622be0, td=0xfffff80160b9c000) at /usr/src/sys/kern/kern_event.c:2730
2730			kn->kn_fop->f_detach(kn);
(kgdb) l
2725	static void
2726	knote_drop(struct knote *kn, struct thread *td)
2727	{
2728	
2729		if ((kn->kn_status & KN_DETACHED) == 0)
2730			kn->kn_fop->f_detach(kn);
2731		knote_drop_detached(kn, td);
2732	}
2733	
2734	static void
(kgdb) p kn->kn_fop
$2 = (const struct filterops *) 0x0

If you need more info, please ask. We will be glad to provide it.

---------------

System info:

FreeBSD frv21.ukr.net 14.4-STABLE FreeBSD 14.4-STABLE stable/14-n273658-2f91ff89c56e FRV21 amd64 1404500 1404500
Comment 1 Mark Johnston freebsd_committer freebsd_triage 2026-02-23 17:53:54 UTC
It'd be useful to see the output of

(kgdb) f 8
(kgdb) p/x *kn
(kgdb) f 10
(kgdb) p/x *fp

Is the problem reproducible?

<redacted-3> is blocked, probably waiting for an in-flux knote to stabilize.  It would be helpful to see all of the stacks for this process, as that might tell us why the knote was in-flux to begin with.
Comment 2 Paul 2026-02-24 07:28:45 UTC
Hi Mark,

Thanks for replying so quickly. Here's the output:


(kgdb) f 8
#8  0xffffffff80572e28 in knote_drop (kn=0xfffff83402622be0, td=0xfffff80160b9c000) at /usr/src/sys/kern/kern_event.c:2730
2730			kn->kn_fop->f_detach(kn);
(kgdb) p/x *kn
$2 = {
  kn_link = {
    sle_next = 0x0
  },
  kn_selnext = {
    sle_next = 0x0
  },
  kn_knlist = 0x0,
  kn_tqe = {
    tqe_next = 0x0,
    tqe_prev = 0x0
  },
  kn_kq = 0x0,
  kn_kevent = {
    ident = 0x0,
    filter = 0x0,
    flags = 0x0,
    fflags = 0x0,
    data = 0x0,
    udata = 0x0,
    ext = {0x0, 0x0, 0x0, 0x0}
  },
  kn_hook = 0x0,
  kn_hookid = 0x0,
  kn_status = 0x20,
  kn_influx = 0x1,
  kn_sfflags = 0x0,
  kn_sdata = 0x0,
  kn_ptr = {
    p_fp = 0x0,
    p_proc = 0x0,
    p_aio = 0x0,
    p_lio = 0x0,
    p_v = 0x0
  },
  kn_fop = 0x0
}
(kgdb) f 10
#10 0xffffffff8056b750 in closefp_impl (fdp=0xfffffe0d1582a920, fd=0, fp=0xfffff81090d2c5a0, td=0xfffff80160b9c000, audit=true) at /usr/src/sys/kern/kern_descrip.c:1320
1320			knote_fdclose(td, fd);
(kgdb) p/x *fp
$3 = {
  f_flag = 0x7,
  f_count = 0x1,
  f_data = 0xfffff81a3dbd4000,
  f_ops = 0xffffffff80a673d0,
  f_vnode = 0x0,
  f_cred = 0xfffff8014c5acb00,
  f_type = 0x2,
  f_vflags = 0x0,
  {
    f_seqcount = {0x0, 0x0},
    f_pipegen = 0x0
  },
  f_nextoff = {0x0, 0x0},
  f_vnun = {
    fvn_cdevpriv = 0x0,
    fvn_advice = 0x0
  },
  f_offset = 0x0
}


Unfortunately, it is not reproducible. We've been using this system for around a month under this load. And it's the first time it happened. The other server is using it for around two months. Probably, Monday morning has something to do with it. As it's usually the time when the most users are active at the same time.

We've been using this software for years, on earlier versions of FreeBSD and had basically no critical issues. However, this generation of servers are order of magnitude faster than what we were using before. Additionally, we've go a 100GbE card in there. And we've already faced a kernel panic regarding `int` counter ofverflow within a network stack. That was on 14.3-STABLE and we've reported it in August. Hopefully, shrinking `kern.ipc.maxsockbuf` helped in that case.

I'm saying all this because we suspect that this may be a similar issue: some race condition within a kernel, that was impossible to reproduce on slower hardware, now can be reproduced, occasionally.

Stacks of other threads:

<unnamed-1>

#0  sched_switch (td=td@entry=0xfffff80d4121b740, flags=flags@entry=259) at /usr/src/sys/kern/sched_ule.c:2437
#1  0xffffffff805d57ad in mi_switch (flags=flags@entry=259) at /usr/src/sys/kern/kern_synch.c:548
#2  0xffffffff80622c19 in sleepq_switch (wchan=wchan@entry=0xfffff87af7163e00, pri=pri@entry=46) at /usr/src/sys/kern/subr_sleepqueue.c:608
#3  0xffffffff80623042 in sleepq_catch_signals (wchan=wchan@entry=0xfffff87af7163e00, pri=46) at /usr/src/sys/kern/subr_sleepqueue.c:523
#4  0xffffffff80623242 in sleepq_timedwait_sig (wchan=<unavailable>, wchan@entry=0xfffff87af7163e00, pri=<unavailable>) at /usr/src/sys/kern/subr_sleepqueue.c:704
#5  0xffffffff805d4ed1 in _sleep (ident=ident@entry=0xfffff87af7163e00, lock=lock@entry=0xfffff87af7163e00, priority=priority@entry=302, wmesg=<optimized out>, sbt=sbt@entry=4671013802915906, pr=pr@entry=382252, flags=512) at /usr/src/sys/kern/kern_synch.c:217
#6  0xffffffff80574121 in kqueue_scan (kq=kq@entry=0xfffff87af7163e00, maxevents=maxevents@entry=128, k_ops=k_ops@entry=0xfffffe07027f4de0, tsp=tsp@entry=0xfffffe07027f4d38, keva=keva@entry=0xfffffe07027f4a90, td=td@entry=0xfffff80d4121b740) at /usr/src/sys/kern/kern_event.c:1983
#7  0xffffffff80571bdb in kqueue_kevent (kq=kq@entry=0xfffff87af7163e00, td=td@entry=0xfffff80d4121b740, nchanges=<optimized out>, nchanges@entry=0, nevents=<optimized out>, nevents@entry=128, k_ops=k_ops@entry=0xfffffe07027f4de0, timeout=0xfffffe07027f4d38) at /usr/src/sys/kern/kern_event.c:1377
#8  0xffffffff8057193b in kern_kevent_fp (td=td@entry=0xfffff80d4121b740, fp=<optimized out>, nchanges=nchanges@entry=0, nevents=nevents@entry=128, k_ops=k_ops@entry=0xfffffe07027f4de0, timeout=<unavailable>, timeout@entry=0xfffffe07027f4d38) at /usr/src/sys/kern/kern_event.c:1390
#9  0xffffffff805715cf in kern_kevent (td=0xfffff80d4121b740, fd=<optimized out>, nchanges=0, nevents=128, k_ops=0xfffffe07027f4de0, timeout=0xfffffe07027f4d38) at /usr/src/sys/kern/kern_event.c:1330
#10 kern_kevent_generic (td=0xfffff80d4121b740, uap=uap@entry=0xfffffe07027f4db0, k_ops=k_ops@entry=0xfffffe07027f4de0, struct_name=0xffffffff809ea8f0 "kevent") at /usr/src/sys/kern/kern_event.c:1186
#11 0xffffffff80571481 in sys_kevent (td=<unavailable>, uap=<optimized out>) at /usr/src/sys/kern/kern_event.c:1159
#12 0xffffffff80973847 in syscallenter (td=0xfffff80d4121b740) at /usr/src/sys/amd64/amd64/../../kern/subr_syscall.c:193
#13 amd64_syscall (td=0xfffff80d4121b740, traced=0) at /usr/src/sys/amd64/amd64/trap.c:1241
#14 <signal handler called>
#15 0x000000082deed3ea in ?? ()


<unnamed-2>

#0  sched_switch (td=td@entry=0xfffff806652e6000, flags=flags@entry=259) at /usr/src/sys/kern/sched_ule.c:2437
#1  0xffffffff805d57ad in mi_switch (flags=flags@entry=259) at /usr/src/sys/kern/kern_synch.c:548
#2  0xffffffff80622c19 in sleepq_switch (wchan=wchan@entry=0xfffff8014c472300, pri=pri@entry=0) at /usr/src/sys/kern/subr_sleepqueue.c:608
#3  0xffffffff80623042 in sleepq_catch_signals (wchan=wchan@entry=0xfffff8014c472300, pri=0) at /usr/src/sys/kern/subr_sleepqueue.c:523
#4  0xffffffff80622d89 in sleepq_wait_sig (wchan=<unavailable>, wchan@entry=0xfffff8014c472300, pri=<unavailable>) at /usr/src/sys/kern/subr_sleepqueue.c:671
#5  0xffffffff805d4ef6 in _sleep (ident=0xfffff8014c472300, lock=lock@entry=0xffffffff81105070 <umtxq_chains+43360>, priority=priority@entry=768, wmesg=wmesg@entry=0xffffffff809adf23 "uwait", sbt=sbt@entry=0, pr=pr@entry=0, flags=0) at /usr/src/sys/kern/kern_synch.c:221
#6  0xffffffff805e94ed in umtxq_sleep (uq=uq@entry=0xfffff8014c472300, wmesg=0xffffffff809adf23 "uwait", timo=0x0) at /usr/src/sys/kern/kern_umtx.c:840
#7  0xffffffff805eed24 in do_wait (td=td@entry=0xfffff806652e6000, addr=<optimized out>, id=0, timeout=timeout@entry=0x0, compat32=compat32@entry=1, is_private=is_private@entry=1) at /usr/src/sys/kern/kern_umtx.c:1313
#8  0xffffffff805ecdc4 in __umtx_op_wait_uint_private (td=0xfffff806652e6000, uap=0xfffffe071c199dd8, ops=<optimized out>) at /usr/src/sys/kern/kern_umtx.c:3986
#9  0xffffffff805eae5e in kern__umtx_op (td=<unavailable>, obj=<optimized out>, val=<unavailable>, uaddr1=<unavailable>, uaddr2=<optimized out>, ops=<unavailable>, op=<optimized out>) at /usr/src/sys/kern/kern_umtx.c:4995
#10 sys__umtx_op (td=<unavailable>, uap=<optimized out>) at /usr/src/sys/kern/kern_umtx.c:5020
#11 0xffffffff80973847 in syscallenter (td=0xfffff806652e6000) at /usr/src/sys/amd64/amd64/../../kern/subr_syscall.c:193
#12 amd64_syscall (td=0xfffff806652e6000, traced=0) at /usr/src/sys/amd64/amd64/trap.c:1241
#13 <signal handler called>
#14 0x00000008221050ac in ?? ()


<redacted-1>

#0  sched_switch (td=td@entry=0xfffff80dd2a60740, flags=flags@entry=259) at /usr/src/sys/kern/sched_ule.c:2437
#1  0xffffffff805d57ad in mi_switch (flags=flags@entry=259) at /usr/src/sys/kern/kern_synch.c:548
#2  0xffffffff80622c19 in sleepq_switch (wchan=wchan@entry=0xfffff80cadb2ab00, pri=pri@entry=0) at /usr/src/sys/kern/subr_sleepqueue.c:608
#3  0xffffffff80623042 in sleepq_catch_signals (wchan=wchan@entry=0xfffff80cadb2ab00, pri=0) at /usr/src/sys/kern/subr_sleepqueue.c:523
#4  0xffffffff80623242 in sleepq_timedwait_sig (wchan=<unavailable>, wchan@entry=0xfffff80cadb2ab00, pri=<unavailable>) at /usr/src/sys/kern/subr_sleepqueue.c:704
#5  0xffffffff805d4ed1 in _sleep (ident=0xfffff80cadb2ab00, lock=lock@entry=0xffffffff8110d590 <umtxq_chains+77440>, priority=priority@entry=768, wmesg=wmesg@entry=0xffffffff809adf23 "uwait", sbt=sbt@entry=4671013812078118, pr=pr@entry=0, flags=512) at /usr/src/sys/kern/kern_synch.c:217
#6  0xffffffff805e94ed in umtxq_sleep (uq=uq@entry=0xfffff80cadb2ab00, wmesg=0xffffffff809adf23 "uwait", timo=0xfffffe0784849d20) at /usr/src/sys/kern/kern_umtx.c:840
#7  0xffffffff805eed24 in do_wait (td=td@entry=0xfffff80dd2a60740, addr=<optimized out>, id=0, timeout=timeout@entry=0xfffffe0784849d90, compat32=compat32@entry=1, is_private=is_private@entry=1) at /usr/src/sys/kern/kern_umtx.c:1313
#8  0xffffffff805ecdc4 in __umtx_op_wait_uint_private (td=0xfffff80dd2a60740, uap=0xfffffe0784849dd8, ops=<optimized out>) at /usr/src/sys/kern/kern_umtx.c:3986
#9  0xffffffff805eae5e in kern__umtx_op (td=<unavailable>, obj=<optimized out>, val=<unavailable>, uaddr1=<unavailable>, uaddr2=<optimized out>, ops=<unavailable>, op=<optimized out>) at /usr/src/sys/kern/kern_umtx.c:4995
#10 sys__umtx_op (td=<unavailable>, uap=<optimized out>) at /usr/src/sys/kern/kern_umtx.c:5020
#11 0xffffffff80973847 in syscallenter (td=0xfffff80dd2a60740) at /usr/src/sys/amd64/amd64/../../kern/subr_syscall.c:193
#12 amd64_syscall (td=0xfffff80dd2a60740, traced=0) at /usr/src/sys/amd64/amd64/trap.c:1241
#13 <signal handler called>
#14 0x00000008221050ac in ?? ()


<redacted-2>

#0  sched_switch (td=td@entry=0xfffff80b2f110740, flags=flags@entry=259) at /usr/src/sys/kern/sched_ule.c:2437
#1  0xffffffff805d57ad in mi_switch (flags=flags@entry=259) at /usr/src/sys/kern/kern_synch.c:548
#2  0xffffffff80622c19 in sleepq_switch (wchan=wchan@entry=0xfffff85578222600, pri=pri@entry=46) at /usr/src/sys/kern/subr_sleepqueue.c:608
#3  0xffffffff80623042 in sleepq_catch_signals (wchan=wchan@entry=0xfffff85578222600, pri=46) at /usr/src/sys/kern/subr_sleepqueue.c:523
#4  0xffffffff80623242 in sleepq_timedwait_sig (wchan=<unavailable>, wchan@entry=0xfffff85578222600, pri=<unavailable>) at /usr/src/sys/kern/subr_sleepqueue.c:704
#5  0xffffffff805d4ed1 in _sleep (ident=ident@entry=0xfffff85578222600, lock=lock@entry=0xfffff85578222600, priority=priority@entry=302, wmesg=<optimized out>, sbt=sbt@entry=4671017460503269, pr=pr@entry=263550736, flags=512) at /usr/src/sys/kern/kern_synch.c:217
#6  0xffffffff80574121 in kqueue_scan (kq=kq@entry=0xfffff85578222600, maxevents=maxevents@entry=128, k_ops=k_ops@entry=0xfffffe071c18fde0, tsp=tsp@entry=0xfffffe071c18fd38, keva=keva@entry=0xfffffe071c18fa90, td=td@entry=0xfffff80b2f110740) at /usr/src/sys/kern/kern_event.c:1983
#7  0xffffffff80571bdb in kqueue_kevent (kq=kq@entry=0xfffff85578222600, td=td@entry=0xfffff80b2f110740, nchanges=<optimized out>, nchanges@entry=0, nevents=<optimized out>, nevents@entry=128, k_ops=k_ops@entry=0xfffffe071c18fde0, timeout=0xfffffe071c18fd38) at /usr/src/sys/kern/kern_event.c:1377
#8  0xffffffff8057193b in kern_kevent_fp (td=td@entry=0xfffff80b2f110740, fp=<optimized out>, nchanges=nchanges@entry=0, nevents=nevents@entry=128, k_ops=k_ops@entry=0xfffffe071c18fde0, timeout=<unavailable>, timeout@entry=0xfffffe071c18fd38) at /usr/src/sys/kern/kern_event.c:1390
#9  0xffffffff805715cf in kern_kevent (td=0xfffff80b2f110740, fd=<optimized out>, nchanges=0, nevents=128, k_ops=0xfffffe071c18fde0, timeout=0xfffffe071c18fd38) at /usr/src/sys/kern/kern_event.c:1330
#10 kern_kevent_generic (td=0xfffff80b2f110740, uap=uap@entry=0xfffffe071c18fdb0, k_ops=k_ops@entry=0xfffffe071c18fde0, struct_name=0xffffffff809ea8f0 "kevent") at /usr/src/sys/kern/kern_event.c:1186
#11 0xffffffff80571481 in sys_kevent (td=<unavailable>, uap=<optimized out>) at /usr/src/sys/kern/kern_event.c:1159
#12 0xffffffff80973847 in syscallenter (td=0xfffff80b2f110740) at /usr/src/sys/amd64/amd64/../../kern/subr_syscall.c:193
#13 amd64_syscall (td=0xfffff80b2f110740, traced=0) at /usr/src/sys/amd64/amd64/trap.c:1241
#14 <signal handler called>
#15 0x000000082deed3ea in ?? ()


<redacted-3>    <--- Cause of crash

#0  __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:57
#1  doadump (textdump=<optimized out>) at /usr/src/sys/kern/kern_shutdown.c:405
#2  0xffffffff805c7beb in kern_reboot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:523
#3  0xffffffff805c80e9 in vpanic (fmt=0xffffffff809d2ae7 "%s", ap=ap@entry=0xfffffe071c126c30) at /usr/src/sys/kern/kern_shutdown.c:967
#4  0xffffffff805c7f23 in panic (fmt=<unavailable>) at /usr/src/sys/kern/kern_shutdown.c:891
#5  0xffffffff80972f00 in trap_fatal (frame=<optimized out>, eva=<optimized out>) at /usr/src/sys/amd64/amd64/trap.c:1000
#6  0xffffffff80972f00 in trap_pfault (frame=0xfffffe071c126cb0, usermode=false, signo=<optimized out>, ucode=<optimized out>)
#7  <signal handler called>
#8  0xffffffff80572e28 in knote_drop (kn=0xfffff83402622be0, td=0xfffff80160b9c000) at /usr/src/sys/kern/kern_event.c:2730
#9  knote_fdclose (td=0xfffff80160b9c000, fd=201569) at /usr/src/sys/kern/kern_event.c:2695
#10 0xffffffff8056b750 in closefp_impl (fdp=0xfffffe0d1582a920, fd=0, fp=0xfffff81090d2c5a0, td=0xfffff80160b9c000, audit=true) at /usr/src/sys/kern/kern_descrip.c:1320
#11 0xffffffff80973847 in syscallenter (td=0xfffff80160b9c000) at /usr/src/sys/amd64/amd64/../../kern/subr_syscall.c:193
#12 amd64_syscall (td=0xfffff80160b9c000, traced=0) at /usr/src/sys/amd64/amd64/trap.c:1241
#13 <signal handler called>
#14 0x000000082deed32a in ?? ()


<redacted-4>

#0  sched_switch (td=td@entry=0xfffff80dd249a740, flags=flags@entry=259) at /usr/src/sys/kern/sched_ule.c:2437
#1  0xffffffff805d57ad in mi_switch (flags=flags@entry=259) at /usr/src/sys/kern/kern_synch.c:548
#2  0xffffffff80622c19 in sleepq_switch (wchan=wchan@entry=0xfffff86a10f16b00, pri=pri@entry=46) at /usr/src/sys/kern/subr_sleepqueue.c:608
#3  0xffffffff80623042 in sleepq_catch_signals (wchan=wchan@entry=0xfffff86a10f16b00, pri=46) at /usr/src/sys/kern/subr_sleepqueue.c:523
#4  0xffffffff80623242 in sleepq_timedwait_sig (wchan=<unavailable>, wchan@entry=0xfffff86a10f16b00, pri=<unavailable>) at /usr/src/sys/kern/subr_sleepqueue.c:704
#5  0xffffffff805d4ed1 in _sleep (ident=ident@entry=0xfffff86a10f16b00, lock=lock@entry=0xfffff86a10f16b00, priority=priority@entry=302, wmesg=<optimized out>, sbt=sbt@entry=4671013829273089, pr=pr@entry=2046015, flags=512) at /usr/src/sys/kern/kern_synch.c:217
#6  0xffffffff80574121 in kqueue_scan (kq=kq@entry=0xfffff86a10f16b00, maxevents=maxevents@entry=128, k_ops=k_ops@entry=0xfffffe0784853de0, tsp=tsp@entry=0xfffffe0784853d38, keva=keva@entry=0xfffffe0784853a90, td=td@entry=0xfffff80dd249a740) at /usr/src/sys/kern/kern_event.c:1983
#7  0xffffffff80571bdb in kqueue_kevent (kq=kq@entry=0xfffff86a10f16b00, td=td@entry=0xfffff80dd249a740, nchanges=<optimized out>, nchanges@entry=0, nevents=<optimized out>, nevents@entry=128, k_ops=k_ops@entry=0xfffffe0784853de0, timeout=0xfffffe0784853d38) at /usr/src/sys/kern/kern_event.c:1377
#8  0xffffffff8057193b in kern_kevent_fp (td=td@entry=0xfffff80dd249a740, fp=<optimized out>, nchanges=nchanges@entry=0, nevents=nevents@entry=128, k_ops=k_ops@entry=0xfffffe0784853de0, timeout=<unavailable>, timeout@entry=0xfffffe0784853d38) at /usr/src/sys/kern/kern_event.c:1390
#9  0xffffffff805715cf in kern_kevent (td=0xfffff80dd249a740, fd=<optimized out>, nchanges=0, nevents=128, k_ops=0xfffffe0784853de0, timeout=0xfffffe0784853d38) at /usr/src/sys/kern/kern_event.c:1330
#10 kern_kevent_generic (td=0xfffff80dd249a740, uap=uap@entry=0xfffffe0784853db0, k_ops=k_ops@entry=0xfffffe0784853de0, struct_name=0xffffffff809ea8f0 "kevent") at /usr/src/sys/kern/kern_event.c:1186
#11 0xffffffff80571481 in sys_kevent (td=<unavailable>, uap=<optimized out>) at /usr/src/sys/kern/kern_event.c:1159
#12 0xffffffff80973847 in syscallenter (td=0xfffff80dd249a740) at /usr/src/sys/amd64/amd64/../../kern/subr_syscall.c:193
#13 amd64_syscall (td=0xfffff80dd249a740, traced=0) at /usr/src/sys/amd64/amd64/trap.c:1241
#14 <signal handler called>
#15 0x000000082deed3ea in ?? ()

<redacted-5>

#0  sched_switch (td=td@entry=0xfffff80dd2e41740, flags=flags@entry=259) at /usr/src/sys/kern/sched_ule.c:2437
#1  0xffffffff805d57ad in mi_switch (flags=flags@entry=259) at /usr/src/sys/kern/kern_synch.c:548
#2  0xffffffff80622c19 in sleepq_switch (wchan=wchan@entry=0xfffff85d0d270b00, pri=pri@entry=46) at /usr/src/sys/kern/subr_sleepqueue.c:608
#3  0xffffffff80623042 in sleepq_catch_signals (wchan=wchan@entry=0xfffff85d0d270b00, pri=46) at /usr/src/sys/kern/subr_sleepqueue.c:523
#4  0xffffffff80623242 in sleepq_timedwait_sig (wchan=<unavailable>, wchan@entry=0xfffff85d0d270b00, pri=<unavailable>) at /usr/src/sys/kern/subr_sleepqueue.c:704
#5  0xffffffff805d4ed1 in _sleep (ident=ident@entry=0xfffff85d0d270b00, lock=lock@entry=0xfffff85d0d270b00, priority=priority@entry=302, wmesg=<optimized out>, sbt=sbt@entry=4671014242900761, pr=pr@entry=36144028, flags=512) at /usr/src/sys/kern/kern_synch.c:217
#6  0xffffffff80574121 in kqueue_scan (kq=kq@entry=0xfffff85d0d270b00, maxevents=maxevents@entry=128, k_ops=k_ops@entry=0xfffffe071c1a3de0, tsp=tsp@entry=0xfffffe071c1a3d38, keva=keva@entry=0xfffffe071c1a3a90, td=td@entry=0xfffff80dd2e41740) at /usr/src/sys/kern/kern_event.c:1983
#7  0xffffffff80571bdb in kqueue_kevent (kq=kq@entry=0xfffff85d0d270b00, td=td@entry=0xfffff80dd2e41740, nchanges=<optimized out>, nchanges@entry=0, nevents=<optimized out>, nevents@entry=128, k_ops=k_ops@entry=0xfffffe071c1a3de0, timeout=0xfffffe071c1a3d38) at /usr/src/sys/kern/kern_event.c:1377
#8  0xffffffff8057193b in kern_kevent_fp (td=td@entry=0xfffff80dd2e41740, fp=<optimized out>, nchanges=nchanges@entry=0, nevents=nevents@entry=128, k_ops=k_ops@entry=0xfffffe071c1a3de0, timeout=<unavailable>, timeout@entry=0xfffffe071c1a3d38) at /usr/src/sys/kern/kern_event.c:1390
#9  0xffffffff805715cf in kern_kevent (td=0xfffff80dd2e41740, fd=<optimized out>, nchanges=0, nevents=128, k_ops=0xfffffe071c1a3de0, timeout=0xfffffe071c1a3d38) at /usr/src/sys/kern/kern_event.c:1330
#10 kern_kevent_generic (td=0xfffff80dd2e41740, uap=uap@entry=0xfffffe071c1a3db0, k_ops=k_ops@entry=0xfffffe071c1a3de0, struct_name=0xffffffff809ea8f0 "kevent") at /usr/src/sys/kern/kern_event.c:1186
#11 0xffffffff80571481 in sys_kevent (td=<unavailable>, uap=<optimized out>) at /usr/src/sys/kern/kern_event.c:1159
#12 0xffffffff80973847 in syscallenter (td=0xfffff80dd2e41740) at /usr/src/sys/amd64/amd64/../../kern/subr_syscall.c:193
#13 amd64_syscall (td=0xfffff80dd2e41740, traced=0) at /usr/src/sys/amd64/amd64/trap.c:1241
#14 <signal handler called>
#15 0x000000082deed3ea in ?? ()
Comment 3 Paul 2026-02-24 09:30:48 UTC
Just to emphasize.

Original output of `procstat -kk` is different from the kernel crash.

Original output was from one server. The crash is from the other.

Original server, where the deadlock has occurred, has crashed soon thereafter. Unfortunately, it did not produce the dump for some reason.

Second server had no symptoms, but crashed immediately when traffic was re-routed towards it. And it did produce the core dump. And it't the one we're analyzing in here. And seeing `kqueue_scan` in both cases makes us suspect that these are two symptoms of the same issue.
Comment 4 Mark Johnston freebsd_committer freebsd_triage 2026-02-24 15:54:31 UTC
(In reply to Paul from comment #3)
> And seeing `kqueue_scan` in both cases makes us suspect that these are two symptoms of the same issue.

Yes, I think they are likely related.

I wonder if you could try running your application on a GENERIC-DEBUG kernel?  The output from comment 2 shows that the knote is a marker knote used during list scans, it's pretty weird that knote_fdclose() would encounter one, and I suspect that it reflects a bug elsewhere in the kqueue code.

Using a debug kernel would have a performance penalty but will hopefully make it much easier to track this down.
Comment 5 Paul 2026-02-25 18:06:38 UTC
Hi Mark,

Not a problem at all. We are now running with GENERIC-DEBUG kernel. Hopefully it won't take too much time to reproduce. We'll stay in touch and report back as soon as it happens again.
Comment 6 Paul 2026-03-05 08:10:03 UTC
Finally caught another crash. Alas I was a bit busy at the time and only reporting back now.

The bad thing is: somehow it did not produce the /var/crash/vmcore.1. After the reboot it has spitted out:

Starting syslogd.
savecore 1735 - - reboot after panic: general protection fault
Mar  3 12:14:25 frv21 savecore[1735]: reboot after panic: general protection fault
savecore 1735 - - writing core to /var/crash/textdump.tar.1
Writing crash summary to /var/crash/core.txt.1.
Starting chronyd.

but...

$ cat core.txt.1
/var/crash/vmcore.1 not found


Available files are:

config.txt     core.txt.1     ddb.txt        msgbuf.txt     panic.txt      textdump.tar.1


Last messages:

$ tail -n 50 msgbuf.txt
<5>Limiting tcp reset response from 67253 to 49997 packets/sec
<5>Limiting tcp reset response from 67169 to 49995 packets/sec
<5>Limiting tcp reset response from 65293 to 49993 packets/sec
<5>Limiting tcp reset response from 64794 to 50015 packets/sec
<5>Limiting tcp reset response from 65880 to 50012 packets/sec
<5>Limiting tcp reset response from 65243 to 50000 packets/sec
<5>Limiting tcp reset response from 71695 to 49993 packets/sec
<5>Limiting tcp reset response from 79815 to 49998 packets/sec
<5>Limiting tcp reset response from 71724 to 49997 packets/sec
<5>Limiting tcp reset response from 67537 to 50005 packets/sec
<5>Limiting tcp reset response from 65417 to 50000 packets/sec
<5>Limiting tcp reset response from 75635 to 50009 packets/sec
<5>Limiting tcp reset response from 71016 to 50005 packets/sec
<5>Limiting tcp reset response from 78743 to 49986 packets/sec
<4>TCP syncache overflow detected; using syncookies for the next 15 seconds
<118>Mar  3 00:40:02 frv21 kernel: TCP syncache overflow detected; using syncookies for the next 15 seconds


Fatal trap 9: general protection fault while in kernel mode
cpuid = 3; apic id = 03
instruction pointer	= 0x20:0xffffffff80b590fd
stack pointer	        = 0x28:0xfffffe07e5854d70
frame pointer	        = 0x28:0xfffffe07e5854d70
code segment		= base rx0, limit 0xfffff, type 0x1b
			= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags	= interrupt enabled, resume, IOPL = 0
current process		= 86706 (<redacted-3>)
rdi: deadc0dedeadc0f6 rsi: 0000000000000004 rdx: ffffffff811ab239
rcx: 0000000000000122  r8: 0000000000000001  r9: ffffffff81e1cba8
rax: fffff801be6b4000 rbx: 00000000000774c3 rbp: fffffe07e5854d70
r10: 0000000000000000 r11: 0000000000000004 r12: fffff8010bbfa418
r13: fffff84f0bab13c0 r14: 00000000000774c3 r15: fffff8010bbfa400
trap number		= 9
panic: general protection fault
cpuid = 3
time = 1772532522
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe07e5854af0
vpanic() at vpanic+0x161/frame 0xfffffe07e5854c20
panic() at panic+0x43/frame 0xfffffe07e5854c80
trap_fatal() at trap_fatal+0x68/frame 0xfffffe07e5854ca0
calltrap() at calltrap+0x8/frame 0xfffffe07e5854ca0
--- trap 0x9, rip = 0xffffffff80b590fd, rsp = 0xfffffe07e5854d70, rbp = 0xfffffe07e5854d70 ---
__mtx_assert() at __mtx_assert+0x3d/frame 0xfffffe07e5854d70
knote_fdclose() at knote_fdclose+0x12e/frame 0xfffffe07e5854dc0
closefp_impl() at closefp_impl+0x96/frame 0xfffffe07e5854e00
amd64_syscall() at amd64_syscall+0x15a/frame 0xfffffe07e5854f30
fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe07e5854f30
--- syscall (6, FreeBSD ELF64, close), rip = 0x82d3ea32a, rsp = 0x860151bb8, rbp = 0x860151bd0 ---


Crash point still looks the same.

We aren't sure why there is no core dump. Have we configured/built the kernel improperly?
But, the last time, only one of the two crashes have produced it. 
Seems like it chance related and we have to keep trying?
Comment 7 Paul 2026-03-05 09:41:54 UTC
And it crashed again, just now. Still no core dump. But this time it looks pretty interesting. It may be the hint of the problem: a missing lock, hence the 'racy' nature of the issue.


<118>Starting syslogd.
<118>savecore 1735 - - reboot after panic: general protection fault
<118>Mar  3 12:14:25 frv21 savecore[1735]: reboot after panic: general protection fault
<118>savecore 1735 - - writing core to /var/crash/textdump.tar.1
<118>Writing crash summary to /var/crash/core.txt.1.
<118>Starting chronyd.
<118>Starting local daemons:add net 10.0.0.0: gateway 10.10.255.254
<118>.
<118>Mounting late filesystems:.
<118>Starting powerd.
<118>Starting snmpd.
<118>Starting collectd.
<118>Starting nrpe.
<118>Starting cron.
<118>Performing sanity check on sshd configuration.
<118>Starting sshd.
<118>Starting background file system checks in 60 seconds.
<118>
<118>Tue Mar  3 12:14:25 EET 2026
panic: mutex kqueue not owned at /usr/src/sys/kern/kern_event.c:290
cpuid = 24
time = 1772701913
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe07171f5810
vpanic() at vpanic+0x161/frame 0xfffffe07171f5940
panic() at panic+0x43/frame 0xfffffe07171f59a0
__mtx_assert() at __mtx_assert+0x9f/frame 0xfffffe07171f59b0
kqueue_register() at kqueue_register+0x758/frame 0xfffffe07171f5a30
kqueue_kevent() at kqueue_kevent+0xc9/frame 0xfffffe07171f5c90
kern_kevent_fp() at kern_kevent_fp+0x9b/frame 0xfffffe07171f5ce0
kern_kevent() at kern_kevent+0x82/frame 0xfffffe07171f5d40
kern_kevent_generic() at kern_kevent_generic+0x70/frame 0xfffffe07171f5da0
sys_kevent() at sys_kevent+0x61/frame 0xfffffe07171f5e00
amd64_syscall() at amd64_syscall+0x15a/frame 0xfffffe07171f5f30
fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe07171f5f30
--- syscall (560, FreeBSD ELF64, kevent), rip = 0x82d3df3ea, rsp = 0x85ca5eb58, rbp = 0x85ca5ec40 ---
KDB: enter: panic
Comment 8 Mark Johnston freebsd_committer freebsd_triage 2026-03-08 01:39:22 UTC
(In reply to Paul from comment #7)
From the dmesg lines, it looks like you're saving a textdump instead of a binary core dump.  Are you setting the debug.ddb.textdump.pending sysctl?
Comment 9 Paul 2026-03-09 07:28:20 UTC
These are our sysctl parameters:

$ sysctl debug.ddb.textdump
debug.ddb.textdump.do_version: 1
debug.ddb.textdump.do_panic: 1
debug.ddb.textdump.do_msgbuf: 1
debug.ddb.textdump.do_ddb: 1
debug.ddb.textdump.do_config: 1
debug.ddb.textdump.pending: 0


It can't be that, as dumps do appear, occasionally.
Comment 10 Paul 2026-03-11 07:49:54 UTC
Hi,

Absence of mem dumps was indeed an issue on our side. Finally we've got another panic with a memory dump, this time. This is the culprit:

#0  __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:57
#1  doadump (textdump=0) at /usr/src/sys/kern/kern_shutdown.c:405
#2  0xffffffff804a4718 in db_fncall_generic (nargs=0, args=0xfffffe2364ef9240, addr=<optimized out>, rv=<optimized out>) at /usr/src/sys/ddb/db_command.c:626
#3  db_fncall (dummy1=<optimized out>, dummy2=<optimized out>, dummy3=<optimized out>, dummy4=<optimized out>) at /usr/src/sys/ddb/db_command.c:674
#4  0xffffffff804a418d in db_command (last_cmdp=<optimized out>, cmd_table=<optimized out>, dopager=false) at /usr/src/sys/ddb/db_command.c:504
#5  0xffffffff804a42d6 in db_command_script (command=command@entry=0xffffffff81bba6e2 <db_recursion_data+18> "call doadump") at /usr/src/sys/ddb/db_command.c:569
#6  0xffffffff804a9578 in db_script_exec (scriptname=scriptname@entry=0xfffffe2364ef9410 "kdb.enter.panic", warnifnotfound=warnifnotfound@entry=0) at /usr/src/sys/ddb/db_script.c:302
#7  0xffffffff804a9472 in db_script_kdbenter (eventname=<optimized out>) at /usr/src/sys/ddb/db_script.c:324
#8  0xffffffff804a7531 in db_trap (type=<optimized out>, code=<optimized out>) at /usr/src/sys/ddb/db_main.c:267
#9  0xffffffff80bd0950 in kdb_trap (type=type@entry=3, code=code@entry=0, tf=tf@entry=0xfffffe2364ef9750) at /usr/src/sys/kern/subr_kdb.c:790
#10 0xffffffff810b3a07 in trap (frame=0xfffffe2364ef9750) at /usr/src/sys/amd64/amd64/trap.c:639
#11 <signal handler called>
#12 kdb_enter (why=<optimized out>, msg=<optimized out>) at /usr/src/sys/kern/subr_kdb.c:556
#13 0xffffffff80b7fc2d in vpanic (fmt=0xffffffff81257899 "mutex %s not owned at %s:%d", ap=ap@entry=0xfffffe2364ef9980) at /usr/src/sys/kern/kern_shutdown.c:953
#14 0xffffffff80b7f9f3 in panic (fmt=0xffffffff81d853a0 <cnputs_mtx> "\233\327\031\201\377\377\377\377") at /usr/src/sys/kern/kern_shutdown.c:891
#15 0xffffffff80b5915f in __mtx_assert (c=<optimized out>, what=what@entry=4, file=0xffffffff812c1e07 "/usr/src/sys/kern/kern_cons.c", line=-1070498644, line@entry=290) at /usr/src/sys/kern/kern_mutex.c:1092
#16 0xffffffff80b23738 in kn_enter_flux (kn=0xfffff873f4f1e820) at /usr/src/sys/kern/kern_event.c:290
#17 kqueue_register (kq=kq@entry=0xfffff80189cb1c00, kev=kev@entry=0xfffffe2364ef9a40, td=td@entry=0xfffff80155304740, mflag=mflag@entry=2) at /usr/src/sys/kern/kern_event.c:1699
#18 0xffffffff80b24729 in kqueue_kevent (kq=kq@entry=0xfffff80189cb1c00, td=td@entry=0xfffff80155304740, nchanges=nchanges@entry=1, nevents=nevents@entry=0, k_ops=k_ops@entry=0xfffffe2364ef9de0, timeout=timeout@entry=0x0) at /usr/src/sys/kern/kern_event.c:1359
#19 0xffffffff80b245eb in kern_kevent_fp (td=td@entry=0xfffff80155304740, fp=<optimized out>, nchanges=nchanges@entry=1, nevents=nevents@entry=0, k_ops=k_ops@entry=0xfffffe2364ef9de0, timeout=timeout@entry=0x0) at /usr/src/sys/kern/kern_event.c:1390
#20 0xffffffff80b24502 in kern_kevent (td=td@entry=0xfffff80155304740, fd=<optimized out>, nchanges=1, nevents=0, k_ops=k_ops@entry=0xfffffe2364ef9de0, timeout=timeout@entry=0x0) at /usr/src/sys/kern/kern_event.c:1330
#21 0xffffffff80b24200 in kern_kevent_generic (td=0xfffff80155304740, uap=uap@entry=0xfffffe2364ef9db0, k_ops=k_ops@entry=0xfffffe2364ef9de0, struct_name=0xffffffff8127f5b9 "kevent") at /usr/src/sys/kern/kern_event.c:1186
#22 0xffffffff80b240f1 in sys_kevent (td=0xffffffff81d853a0 <cnputs_mtx>, uap=<optimized out>) at /usr/src/sys/kern/kern_event.c:1159
#23 0xffffffff810b4f0a in syscallenter (td=0xfffff80155304740) at /usr/src/sys/amd64/amd64/../../kern/subr_syscall.c:193
#24 amd64_syscall (td=0xfffff80155304740, traced=0) at /usr/src/sys/amd64/amd64/trap.c:1241
#25 <signal handler called>
#26 0x000000082f6dc3ea in ?? ()

Seems like a missing mutex assert is failing, which perfectly explains the racy nature of the issue.

(kgdb) fr 16
#16 0xffffffff80b23738 in kn_enter_flux (kn=0xfffff873f4f1e820) at /usr/src/sys/kern/kern_event.c:290
290		KQ_OWNED(kn->kn_kq);
(kgdb) l
285	
286	static void
287	kn_enter_flux(struct knote *kn)
288	{
289	
290		KQ_OWNED(kn->kn_kq);
291		MPASS(kn->kn_influx < INT_MAX);
292		kn->kn_influx++;
293	}

(kgdb) p *kn
$3 = {
  kn_link = {
    sle_next = 0x0
  },
  kn_selnext = {
    sle_next = 0x0
  },
  kn_knlist = 0xfffff83c698aa438,
  kn_tqe = {
    tqe_next = 0xffffffffffffffff,
    tqe_prev = 0xffffffffffffffff
  },
  kn_kq = 0xfffff80164651600,
  kn_kevent = {
    ident = 267406,
    filter = -1,
    flags = 32,
    fflags = 0,
    data = 0,
    udata = 0x2c0901461480,
    ext = {0, 0, 0, 0}
  },
  kn_hook = 0x0,
  kn_hookid = 0,
  kn_status = 0,
  kn_influx = 0,
  kn_sfflags = 0,
  kn_sdata = 0,
  kn_ptr = {
    p_fp = 0xfffff83f7bcf6b40,
    p_proc = 0xfffff83f7bcf6b40,
    p_aio = 0xfffff83f7bcf6b40,
    p_lio = 0xfffff83f7bcf6b40,
    p_v = 0xfffff83f7bcf6b40
  },
  kn_fop = 0xffffffff81437d38 <soread_filtops>
}

(kgdb) fr 18
#18 0xffffffff80b24729 in kqueue_kevent (kq=kq@entry=0xfffff80189cb1c00, td=td@entry=0xfffff80155304740, nchanges=nchanges@entry=1, nevents=nevents@entry=0, k_ops=k_ops@entry=0xfffffe2364ef9de0, timeout=timeout@entry=0x0) at /usr/src/sys/kern/kern_event.c:1359
1359				error = kqueue_register(kq, kevp, td, M_WAITOK);
(kgdb) p *kevp
$9 = {
  ident = 312462,
  filter = -1,
  flags = 2,
  fflags = 0,
  data = 0,
  udata = 0x0,
  ext = {0, 0, 0, 0}
}

Please, tell us, if you need more.
Comment 11 Mark Johnston freebsd_committer freebsd_triage 2026-03-14 02:31:25 UTC
https://reviews.freebsd.org/D55852 might fix this.  I see a race which I think can cause these assertion failures, but I think it would be triggered by different userspace threads simultaneously registering an fd with a kqueue while closing the fd, which is presumably an application bug.

Note, I haven't tested the patch yet, so please don't try it unless you can recover from a broken kernel.
Comment 12 Paul 2026-03-16 08:57:21 UTC
Hi mark, thanks!

It very well might be the case that some FD is being simultaneously registered in kqueue and closed. I wouldn't go as far as to say it's the user-space application bug. After all, EBADF is a pretty legitimate error and a condition. We use asio library, and multiple threads. It may, indeed not have the `close()` serialized with a kqueue requests to the kernel. And I still hold my belief that they are right, as I can't remember reading anywhere about close + kqueue being concurrently forbidden: after all, kernel must *always* play on defensive and never trust or give tools to user-space, that can cause panic or crash.

Please, ignore my rabble above. We unmeasurably appreciate your efforts and help!

We are going to take your patch into testing now.

----

As a side note, just a bit of concern: this patch seems to only 'refresh the state' on certain condition and not use an actual lock. Would it not be susceptible to ABA still and just reduce the likelihood of a race, not eliminating it entirely?
Comment 13 Paul 2026-03-16 10:15:17 UTC
Please, ignore my last remark. I've realized, albeit too late, that it's all happening under the protection of a lock.
Comment 14 Paul 2026-03-17 09:48:52 UTC
Sadly, it crashed again.

Fatal trap 9: general protection fault while in kernel mode
cpuid = 28; apic id = 70
instruction pointer     = 0x20:0xffffffff80b5915d
stack pointer           = 0x28:0xfffffe0718b5ed70
frame pointer           = 0x28:0xfffffe0718b5ed70
code segment            = base rx0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 2968 (asy:http:s)
rdi: deadc0dedeadc0f6 rsi: 0000000000000004 rdx: ffffffff811ab239
rcx: 0000000000000121  r8: 0000000000000001  r9: ffffffff81e1efc8
rax: fffff80586c4e740 rbx: 000000000004a19a rbp: fffffe0718b5ed70
r10: 0000000000000000 r11: 0000000000000004 r12: fffff801090cad18
r13: fffff83937739f00 r14: 000000000004a19a r15: fffff801090cad00
trap number             = 9
panic: general protection fault
cpuid = 28
time = 1773739454
KDB: stack backtrace:
...


(kgdb) bt
#0  __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:57
#1  doadump (textdump=0) at /usr/src/sys/kern/kern_shutdown.c:405
#2  0xffffffff804a4718 in db_fncall_generic (nargs=0, args=0xfffffe0718b5e520, addr=<optimized out>, rv=<optimized out>) at /usr/src/sys/ddb/db_command.c:626
#3  db_fncall (dummy1=<optimized out>, dummy2=<optimized out>, dummy3=<optimized out>, dummy4=<optimized out>) at /usr/src/sys/ddb/db_command.c:674
#4  0xffffffff804a418d in db_command (last_cmdp=<optimized out>, cmd_table=<optimized out>, dopager=false) at /usr/src/sys/ddb/db_command.c:504
#5  0xffffffff804a42d6 in db_command_script (command=command@entry=0xffffffff81bba6e2 <db_recursion_data+18> "call doadump") at /usr/src/sys/ddb/db_command.c:569
#6  0xffffffff804a9578 in db_script_exec (scriptname=scriptname@entry=0xfffffe0718b5e6f0 "kdb.enter.panic", warnifnotfound=warnifnotfound@entry=0) at /usr/src/sys/ddb/db_script.c:302
#7  0xffffffff804a9472 in db_script_kdbenter (eventname=<optimized out>) at /usr/src/sys/ddb/db_script.c:324
#8  0xffffffff804a7531 in db_trap (type=<optimized out>, code=<optimized out>) at /usr/src/sys/ddb/db_main.c:267
#9  0xffffffff80bd09b0 in kdb_trap (type=type@entry=3, code=code@entry=0, tf=tf@entry=0xfffffe0718b5ea30) at /usr/src/sys/kern/subr_kdb.c:790
#10 0xffffffff810b3a07 in trap (frame=0xfffffe0718b5ea30) at /usr/src/sys/amd64/amd64/trap.c:639
#11 <signal handler called>
#12 kdb_enter (why=<optimized out>, msg=<optimized out>) at /usr/src/sys/kern/subr_kdb.c:556
#13 0xffffffff80b7fc8d in vpanic (fmt=0xffffffff81237367 "%s", ap=ap@entry=0xfffffe0718b5ec60) at /usr/src/sys/kern/kern_shutdown.c:953
#14 0xffffffff80b7fa53 in panic (fmt=0xffffffff81d853a0 <cnputs_mtx> "\233\327\031\201\377\377\377\377") at /usr/src/sys/kern/kern_shutdown.c:891
#15 0xffffffff810b40b8 in trap_fatal (frame=0xfffffe0718b5ecb0, eva=<optimized out>) at /usr/src/sys/amd64/amd64/trap.c:1000
#16 <signal handler called>
#17 __mtx_assert (c=0xdeadc0dedeadc0f6, what=what@entry=4, file=0xffffffff811ab239 "/usr/src/sys/kern/kern_event.c", line=line@entry=289) at /usr/src/sys/kern/kern_mutex.c:1091
#18 0xffffffff80b25c9e in kn_enter_flux (kn=<optimized out>) at /usr/src/sys/kern/kern_event.c:289
#19 knote_fdclose (td=td@entry=0xfffff80586c4e740, fd=fd@entry=303514) at /usr/src/sys/kern/kern_event.c:2704
#20 0xffffffff80b1dbd6 in closefp_impl (fdp=0xfffffe07143ca920, fd=303514, fp=0xfffff82112ac8e10, td=0xfffff80586c4e740, audit=true) at /usr/src/sys/kern/kern_descrip.c:1320
#21 0xffffffff810b4f0a in syscallenter (td=0xfffff80586c4e740) at /usr/src/sys/amd64/amd64/../../kern/subr_syscall.c:193
#22 amd64_syscall (td=0xfffff80586c4e740, traced=0) at /usr/src/sys/amd64/amd64/trap.c:1241
#23 <signal handler called>
#24 0x000000082c76832a in ?? ()
Backtrace stopped: Cannot access memory at address 0x8624d1b98

(kgdb) fr 18
#18 0xffffffff80b25c9e in kn_enter_flux (kn=<optimized out>) at /usr/src/sys/kern/kern_event.c:289
289		KQ_OWNED(kn->kn_kq);
(kgdb) p *kn->kn_kq
value has been optimized out
(kgdb) up
#19 knote_fdclose (td=td@entry=0xfffff80586c4e740, fd=fd@entry=303514) at /usr/src/sys/kern/kern_event.c:2704
2704				kn_enter_flux(kn);
(kgdb) p *kn
$2 = {
  kn_link = {
    sle_next = 0xdeadc0dedeadc0de
  },
  kn_selnext = {
    sle_next = 0xdeadc0dedeadc0de
  },
  kn_knlist = 0xdeadc0dedeadc0de,
  kn_tqe = {
    tqe_next = 0xdeadc0dedeadc0de,
    tqe_prev = 0xdeadc0dedeadc0de
  },
  kn_kq = 0xdeadc0dedeadc0de,
  kn_kevent = {
    ident = 16045693110842147038,
    filter = -16162,
    flags = 57005,
    fflags = 3735929054,
    data = -2401050962867404578,
    udata = 0xdeadc0dedeadc0de,
    ext = {16045693110842147038, 16045693110842147038, 16045693110842147038, 16045693110842147038}
  },
  kn_hook = 0xdeadc0dedeadc0de,
  kn_hookid = -559038242,
  kn_status = -559038242,
  kn_influx = -559038242,
  kn_sfflags = -559038242,
  kn_sdata = -2401050962867404578,
  kn_ptr = {
    p_fp = 0xdeadc0dedeadc0de,
    p_proc = 0xdeadc0dedeadc0de,
    p_aio = 0xdeadc0dedeadc0de,
    p_lio = 0xdeadc0dedeadc0de,
    p_v = 0xdeadc0dedeadc0de
  },
  kn_fop = 0xdeadc0dedeadc0de
}
(kgdb) p *kn->kn_kq
Cannot access memory at address 0xdeadc0dedeadc0de
(kgdb) 
#19 knote_fdclose (td=td@entry=0xfffff80586c4e740, fd=fd@entry=303514) at /usr/src/sys/kern/kern_event.c:2704
2704				kn_enter_flux(kn);
(kgdb) up
#20 0xffffffff80b1dbd6 in closefp_impl (fdp=0xfffffe07143ca920, fd=303514, fp=0xfffff82112ac8e10, td=0xfffff80586c4e740, audit=true) at /usr/src/sys/kern/kern_descrip.c:1320
1320			knote_fdclose(td, fd);
(kgdb) p *fp
$4 = {
  f_flag = 7,
  f_count = 1,
  f_data = 0xfffff868aabeb000,
  f_ops = 0xffffffff81436808 <socketops>,
  f_vnode = 0x0,
  f_cred = 0xfffff804a3517700,
  f_type = 2,
  f_vflags = 0,
  {
    f_seqcount = {0, 0},
    f_pipegen = 0
  },
  f_nextoff = {0, 0},
  f_vnun = {
    fvn_cdevpriv = 0x0,
    fvn_advice = 0x0
  },
  f_offset = 0
}
(kgdb) p *fdp
$5 = {
  fd_files = 0xfffffe0836e00000,
  fd_map = 0xfffffe080448a000,
  fd_freefile = 3,
  fd_refcnt = 1,
  fd_holdcnt = 1,
  fd_sx = {
    lock_object = {
      lo_name = 0xffffffff812b4244 "filedesc structure",
      lo_flags = 36896768,
      lo_data = 0,
      lo_witness = 0xfffff8804bd94380
    },
    sx_lock = 18446735301352417088
  },
  fd_kqlist = {
    tqh_first = 0xfffff80150f62d00,
    tqh_last = 0xfffff801090b5628
  },
  fd_holdleaderscount = 0,
  fd_holdleaderswakeup = 0
}


Please, tell us, if more useful data can be extracted from this dump.
Comment 15 Konstantin Belousov freebsd_committer freebsd_triage 2026-03-17 15:28:36 UTC
In the frame 19, there should be a local var kq.
Please print out the *kq.

Is 303514 sounds like a possibly correct fd value?  Do you have filedesc
limits bumped up?

Could you try your load with the HEAD kernel with all debugging turned on?
No need to install userspace, only kernel.
Comment 16 Paul 2026-03-17 18:19:15 UTC
Alas, it was optimized out. Probably because its value is in a register?

(kgdb) fr 19
#19 knote_fdclose (td=td@entry=0xfffff80586c4e740, fd=fd@entry=303514) at /usr/src/sys/kern/kern_event.c:2704
2704				kn_enter_flux(kn);
(kgdb) p *kq
value has been optimized out
(kgdb) p kq
$4 = <optimized out>

Using a brute force approach, looks like r15 is a correct guess:

(kgdb) i r
rax            0xfffff80586c4e740  -8772357134528
rbx            0x4a19a             303514
rcx            0x121               289
rdx            0xffffffff811ab239  -2128956871
rsi            0x4                 4
rdi            0xdeadc0dedeadc0f6  -2401050962867404554
rbp            0xfffffe0718b5edc0  0xfffffe0718b5edc0
rsp            0xfffffe0718b5ed80  0xfffffe0718b5ed80
r8             0x1                 1
r9             0xffffffff81e1efc8  -2115899448
r10            0x0                 0
r11            0x4                 4
r12            0xfffff801090cad18  -8791646229224
r13            0xfffff83937739f00  -8550349562112
r14            0x4a19a             303514
r15            0xfffff801090cad00  -8791646229248
rip            0xffffffff80b25c9e  0xffffffff80b25c9e <knote_fdclose+302>
eflags         0x10297             [ CF PF AF SF IF RF ]
cs             0x20                32
ss             0x28                40
ds             <unavailable>
es             <unavailable>
fs             <unavailable>
gs             <unavailable>
fs_base        <unavailable>
gs_base        <unavailable>
(kgdb) p *((struct kqueue*)$r15)
$14 = {
  kq_lock = {
    lock_object = {
      lo_name = 0xffffffff812bbf6c "kqueue",
      lo_flags = 21168128,
      lo_data = 0,
      lo_witness = 0xfffff8804bd8da80
    },
    mtx_lock = 18446735301352417088
  },
  kq_refcnt = 0,
  kq_list = {
    tqe_next = 0xfffff80155923a00,
    tqe_prev = 0xfffff80150f62d28
  },
  kq_head = {
    tqh_first = 0x0,
    tqh_last = 0xfffff801090cad38
  },
  kq_count = 0,
  kq_sel = {
    si_tdlist = {
      tqh_first = 0x0,
      tqh_last = 0x0
    },
    si_note = {
      kl_list = {
        slh_first = 0x0
      },
      kl_lock = 0xffffffff80b254e0 <knlist_mtx_lock>,
      kl_unlock = 0xffffffff80b25500 <knlist_mtx_unlock>,
      kl_assert_lock = 0xffffffff80b25520 <knlist_mtx_assert_lock>,
      kl_lockarg = 0xfffff801090cad00,
      kl_autodestroy = 0
    },
    si_mtx = 0x0
  },
  kq_sigio = 0x0,
  kq_fdp = 0xfffffe07143ca920,
  kq_state = 0,
  kq_knlistsize = 700672,
  kq_knlist = 0xfffffe09b7213000,
  kq_knhashmask = 0,
  kq_knhash = 0x0,
  kq_task = {
    ta_link = {
      stqe_next = 0x0
    },
    ta_pending = 0,
    ta_priority = 0 '\000',
    ta_flags = 0 '\000',
    ta_func = 0xffffffff80b26060 <kqueue_task>,
    ta_context = 0xfffff801090cad00
  },
  kq_cred = 0xfffff804a3517700
}


Yeah, 303514 is definitely a valid descriptor value. We do handle hundreds of thousands of websocket connections normally. And yes, our limits are adjusted to several millions, just in case...

We will consider using HEAD for a test run, without updating the world. We will stay in touch and report back as soon as there is something to report.

In any case, if you need more info, we will be glad to provide it.
Comment 17 Konstantin Belousov freebsd_committer freebsd_triage 2026-03-17 19:16:15 UTC
(In reply to Paul from comment #16)
If believing the guessed location for kq, and there is a reason to believe
because the lock name looks right, it seems that the kq was drained/closed
in parallel with closing the fd.  This might explain why the knote dereferenced
was garbage.

Besides trying the HEAD, you could try to apply the following rude patch.
If it helps, I might consider using a sentinel to resume iteration instead
of restarting from scratch, after the kq relock.

diff --git a/sys/kern/kern_event.c b/sys/kern/kern_event.c
index e8e670d39d09..4bbd9063b6ef 100644
--- a/sys/kern/kern_event.c
+++ b/sys/kern/kern_event.c
@@ -2830,10 +2830,9 @@ knote_fdclose(struct thread *td, int fd)
 	 * We shouldn't have to worry about new kevents appearing on fd
 	 * since filedesc is locked.
 	 */
+again:
 	TAILQ_FOREACH(kq, &fdp->fd_kqlist, kq_list) {
 		KQ_LOCK(kq);
-
-again:
 		influx = 0;
 		while (kq->kq_knlistsize > fd &&
 		    (kn = SLIST_FIRST(&kq->kq_knlist[fd])) != NULL) {
Comment 18 Paul 2026-03-17 20:25:30 UTC
Thanks, we will start testing this patch first.
Comment 19 Paul 2026-03-18 10:10:18 UTC
Sadly, it happens still, even with the latest patch of kern_event.c:


Fatal trap 9: general protection fault while in kernel mode
cpuid = 0; apic id = 00
instruction pointer	= 0x20:0xffffffff80b5914d
stack pointer	        = 0x28:0xfffffe0718977d60
frame pointer	        = 0x28:0xfffffe0718977d60
code segment		= base rx0, limit 0xfffff, type 0x1b
			= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags	= interrupt enabled, resume, IOPL = 0
current process		= 3115 (asy:http:s)
rdi: deadc0dedeadc0f6 rsi: 0000000000000004 rdx: ffffffff811ab239
rcx: 0000000000000121  r8: 0000000000000001  r9: ffffffff81e1ec98
rax: fffff803c20c3740 rbx: 000000000008fa97 rbp: fffffe0718977d60
r10: 0000000000000000 r11: 0000000000000004 r12: fffff80155c37718
r13: fffff819bc941960 r14: 000000000008fa97 r15: fffff80155c37700
trap number		= 9
panic: general protection fault
cpuid = 0
time = 1773824580
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe0718977ae0
vpanic() at vpanic+0x161/frame 0xfffffe0718977c10
panic() at panic+0x43/frame 0xfffffe0718977c70
trap_fatal() at trap_fatal+0x68/frame 0xfffffe0718977c90
calltrap() at calltrap+0x8/frame 0xfffffe0718977c90
--- trap 0x9, rip = 0xffffffff80b5914d, rsp = 0xfffffe0718977d60, rbp = 0xfffffe0718977d60 ---
__mtx_assert() at __mtx_assert+0x3d/frame 0xfffffe0718977d60
knote_fdclose() at knote_fdclose+0x11e/frame 0xfffffe0718977dc0
closefp_impl() at closefp_impl+0x96/frame 0xfffffe0718977e00
amd64_syscall() at amd64_syscall+0x15a/frame 0xfffffe0718977f30
fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe0718977f30
--- syscall (6, FreeBSD ELF64, close), rip = 0x82ddf932a, rsp = 0x85fb5eb88, rbp = 0x85fb5eba0 ---
KDB: enter: panic

(kgdb) bt
#0  __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:57
#1  doadump (textdump=0) at /usr/src/sys/kern/kern_shutdown.c:405
#2  0xffffffff804a4718 in db_fncall_generic (nargs=0, args=0xfffffe0718977510, addr=<optimized out>, rv=<optimized out>) at /usr/src/sys/ddb/db_command.c:626
#3  db_fncall (dummy1=<optimized out>, dummy2=<optimized out>, dummy3=<optimized out>, dummy4=<optimized out>) at /usr/src/sys/ddb/db_command.c:674
#4  0xffffffff804a418d in db_command (last_cmdp=<optimized out>, cmd_table=<optimized out>, dopager=false) at /usr/src/sys/ddb/db_command.c:504
#5  0xffffffff804a42d6 in db_command_script (command=command@entry=0xffffffff81bba6e2 <db_recursion_data+18> "call doadump") at /usr/src/sys/ddb/db_command.c:569
#6  0xffffffff804a9578 in db_script_exec (scriptname=scriptname@entry=0xfffffe07189776e0 "kdb.enter.panic", warnifnotfound=warnifnotfound@entry=0) at /usr/src/sys/ddb/db_script.c:302
#7  0xffffffff804a9472 in db_script_kdbenter (eventname=<optimized out>) at /usr/src/sys/ddb/db_script.c:324
#8  0xffffffff804a7531 in db_trap (type=<optimized out>, code=<optimized out>) at /usr/src/sys/ddb/db_main.c:267
#9  0xffffffff80bd09a0 in kdb_trap (type=type@entry=3, code=code@entry=0, tf=tf@entry=0xfffffe0718977a20) at /usr/src/sys/kern/subr_kdb.c:790
#10 0xffffffff810b3a07 in trap (frame=0xfffffe0718977a20) at /usr/src/sys/amd64/amd64/trap.c:639
#11 <signal handler called>
#12 kdb_enter (why=<optimized out>, msg=<optimized out>) at /usr/src/sys/kern/subr_kdb.c:556
#13 0xffffffff80b7fc7d in vpanic (fmt=0xffffffff81237367 "%s", ap=ap@entry=0xfffffe0718977c50) at /usr/src/sys/kern/kern_shutdown.c:953
#14 0xffffffff80b7fa43 in panic (fmt=0xffffffff81d853a0 <cnputs_mtx> "\233\327\031\201\377\377\377\377") at /usr/src/sys/kern/kern_shutdown.c:891
#15 0xffffffff810b40b8 in trap_fatal (frame=0xfffffe0718977ca0, eva=<optimized out>) at /usr/src/sys/amd64/amd64/trap.c:1000
#16 <signal handler called>
#17 __mtx_assert (c=0xdeadc0dedeadc0f6, what=what@entry=4, file=0xffffffff811ab239 "/usr/src/sys/kern/kern_event.c", line=line@entry=289) at /usr/src/sys/kern/kern_mutex.c:1091
#18 0xffffffff80b25c8e in kn_enter_flux (kn=<optimized out>) at /usr/src/sys/kern/kern_event.c:289
#19 knote_fdclose (td=td@entry=0xfffff803c20c3740, fd=fd@entry=588439) at /usr/src/sys/kern/kern_event.c:2703
#20 0xffffffff80b1dbd6 in closefp_impl (fdp=0xfffffe0713371430, fd=588439, fp=0xfffff86e9b7ee190, td=0xfffff803c20c3740, audit=true) at /usr/src/sys/kern/kern_descrip.c:1320
#21 0xffffffff810b4f0a in syscallenter (td=0xfffff803c20c3740) at /usr/src/sys/amd64/amd64/../../kern/subr_syscall.c:193
#22 amd64_syscall (td=0xfffff803c20c3740, traced=0) at /usr/src/sys/amd64/amd64/trap.c:1241
#23 <signal handler called>
#24 0x000000082ddf932a in ?? ()
Backtrace stopped: Cannot access memory at address 0x85fb5eb88

(kgdb) l /usr/src/sys/kern/kern_event.c:2690
2685		/*
2686		 * We shouldn't have to worry about new kevents appearing on fd
2687		 * since filedesc is locked.
2688		 */
2689	again:
2690		TAILQ_FOREACH(kq, &fdp->fd_kqlist, kq_list) {
2691			KQ_LOCK(kq);
2692			influx = 0;
2693			while (kq->kq_knlistsize > fd &&
2694			    (kn = SLIST_FIRST(&kq->kq_knlist[fd])) != NULL) {
(kgdb) fr 18
#18 0xffffffff80b25c8e in kn_enter_flux (kn=<optimized out>) at /usr/src/sys/kern/kern_event.c:289
289		KQ_OWNED(kn->kn_kq);
(kgdb) p *kn->kn_kq
value has been optimized out
(kgdb) up
#19 knote_fdclose (td=td@entry=0xfffff803c20c3740, fd=fd@entry=588439) at /usr/src/sys/kern/kern_event.c:2703
2703				kn_enter_flux(kn);
(kgdb) p kn
$4 = (struct knote *) 0xfffff819bc941960
(kgdb) p *kn
$1 = {
  kn_link = {
    sle_next = 0xdeadc0dedeadc0de
  },
  kn_selnext = {
    sle_next = 0xdeadc0dedeadc0de
  },
  kn_knlist = 0xdeadc0dedeadc0de,
  kn_tqe = {
    tqe_next = 0xdeadc0dedeadc0de,
    tqe_prev = 0xdeadc0dedeadc0de
  },
  kn_kq = 0xdeadc0dedeadc0de,
  kn_kevent = {
    ident = 16045693110842147038,
    filter = -16162,
    flags = 57005,
    fflags = 3735929054,
    data = -2401050962867404578,
    udata = 0xdeadc0dedeadc0de,
    ext = {16045693110842147038, 16045693110842147038, 16045693110842147038, 16045693110842147038}
  },
  kn_hook = 0xdeadc0dedeadc0de,
  kn_hookid = -559038242,
  kn_status = -559038242,
  kn_influx = -559038242,
  kn_sfflags = -559038242,
  kn_sdata = -2401050962867404578,
  kn_ptr = {
    p_fp = 0xdeadc0dedeadc0de,
    p_proc = 0xdeadc0dedeadc0de,
    p_aio = 0xdeadc0dedeadc0de,
    p_lio = 0xdeadc0dedeadc0de,
    p_v = 0xdeadc0dedeadc0de
  },
  kn_fop = 0xdeadc0dedeadc0de
}
(kgdb) p *kn->kn_kq
Cannot access memory at address 0xdeadc0dedeadc0de
#20 0xffffffff80b1dbd6 in closefp_impl (fdp=0xfffffe0713371430, fd=588439, fp=0xfffff86e9b7ee190, td=0xfffff803c20c3740, audit=true) at /usr/src/sys/kern/kern_descrip.c:1320
1320			knote_fdclose(td, fd);
(kgdb) p *fp
$1 = {
  f_flag = 7,
  f_count = 1,
  f_data = 0xfffff82e0210c000,
  f_ops = 0xffffffff81436808 <socketops>,
  f_vnode = 0x0,
  f_cred = 0xfffff804daf23a00,
  f_type = 2,
  f_vflags = 0,
  {
    f_seqcount = {0, 0},
    f_pipegen = 0
  },
  f_nextoff = {0, 0},
  f_vnun = {
    fvn_cdevpriv = 0x0,
    fvn_advice = 0x0
  },
  f_offset = 0
}
(kgdb) p *fdp
$2 = {
  fd_files = 0xfffffe094f9fb000,
  fd_map = 0xfffffe094d255000,
  fd_freefile = 3,
  fd_refcnt = 1,
  fd_holdcnt = 1,
  fd_sx = {
    lock_object = {
      lo_name = 0xffffffff812b4244 "filedesc structure",
      lo_flags = 36896768,
      lo_data = 0,
      lo_witness = 0xfffff8804bd94380
    },
    sx_lock = 18446735293757011776
  },
  fd_kqlist = {
    tqh_first = 0xfffff8010c5ba200,
    tqh_last = 0xfffff80155c37728
  },
  fd_holdleaderscount = 0,
  fd_holdleaderswakeup = 0
}
(kgdb) fr 19
#19 knote_fdclose (td=td@entry=0xfffff803c20c3740, fd=fd@entry=588439) at /usr/src/sys/kern/kern_event.c:2703
2703				kn_enter_flux(kn);
(kgdb) p *kq
value has been optimized out
(kgdb) i r
rax            0xfffff803c20c3740  -8779952539840
rbx            0x8fa97             588439
rcx            0x121               289
rdx            0xffffffff811ab239  -2128956871
rsi            0x4                 4
rdi            0xdeadc0dedeadc0f6  -2401050962867404554
rbp            0xfffffe0718977dc0  0xfffffe0718977dc0
rsp            0xfffffe0718977d70  0xfffffe0718977d70
r8             0x1                 1
r9             0xffffffff81e1ec98  -2115900264
r10            0x0                 0
r11            0x4                 4
r12            0xfffff80155c37718  -8790359181544
r13            0xfffff819bc941960  -8685555017376
r14            0x8fa97             588439
r15            0xfffff80155c37700  -8790359181568
rip            0xffffffff80b25c8e  0xffffffff80b25c8e <knote_fdclose+286>
eflags         0x10297             [ CF PF AF SF IF RF ]
cs             0x20                32
ss             0x28                40
ds             <unavailable>
es             <unavailable>
fs             <unavailable>
gs             <unavailable>
fs_base        <unavailable>
gs_base        <unavailable>
(kgdb) p *((struct kqueue*)$r15)
$3 = {
  kq_lock = {
    lock_object = {
      lo_name = 0xffffffff812bbf6c "kqueue",
      lo_flags = 21168128,
      lo_data = 0,
      lo_witness = 0xfffff8804bd8da80
    },
    mtx_lock = 18446735293757011776
  },
  kq_refcnt = 1,
  kq_list = {
    tqe_next = 0x0,
    tqe_prev = 0xfffff80150e0d528
  },
  kq_head = {
    tqh_first = 0x0,
    tqh_last = 0xfffff80155c37738
  },
  kq_count = 0,
  kq_sel = {
    si_tdlist = {
      tqh_first = 0x0,
      tqh_last = 0x0
    },
    si_note = {
      kl_list = {
        slh_first = 0x0
      },
      kl_lock = 0xffffffff80b254e0 <knlist_mtx_lock>,
      kl_unlock = 0xffffffff80b25500 <knlist_mtx_unlock>,
      kl_assert_lock = 0xffffffff80b25520 <knlist_mtx_assert_lock>,
      kl_lockarg = 0xfffff80155c37700,
      kl_autodestroy = 0
    },
    si_mtx = 0x0
  },
  kq_sigio = 0x0,
  kq_fdp = 0xfffffe0713371430,
  kq_state = 2,
  kq_knlistsize = 680960,
  kq_knlist = 0xfffffe0987b7a000,
  kq_knhashmask = 0,
  kq_knhash = 0x0,
  kq_task = {
    ta_link = {
      stqe_next = 0x0
    },
    ta_pending = 0,
    ta_priority = 0 '\000',
    ta_flags = 0 '\000',
    ta_func = 0xffffffff80b26050 <kqueue_task>,
    ta_context = 0xfffff80155c37700
  },
  kq_cred = 0xfffff804daf23a00
}

Weirdest thing is (might this be a hint of a problem?) that in frame 19, `kn` points to some memory address that contains exactly the same, byte-by-byte content as in previous crash, seemingly a garbage. Is this some 'kernel constants' data segment, or is it expected and not a garbage?
Comment 20 Kyle Evans freebsd_committer freebsd_triage 2026-03-18 15:49:28 UTC
re: deadlock, I don't really understand this part of knote_fdclose: we set influx=0 at the beginning and only `wakuep(kq)` if we personally dropped a knote from this kq.  Isn't that just asking for a deadlock on the first knote?  Assuming we have something like:

Thread A: kn_enter_flux, drop the lock
Thread B: pick up the lock, find it in flux, enter FLUXWQAIT
Thread C: knote_fdclose(), acquire lock, also find it in flux, doesn't wakeup()

Looking closer, I can kind of see that the pattern should be that thread A will wakeup() after a kn_leave_flux(), but I think we only have one scenario where that's *not* the case: kqueue_fork_copy() -> kqueue_fork_copy_list() will transition a number of knotes and neither would issue a wakeup().  The final KQ_UNLOCK there in kqueue_fork_copy() should probably just assume that it transitioned at least one knote through a flux state and switch to KQ_UNLOCK_FLUX to be sure.

I think with those bases covered, influx + wakeup in knote_fdclose() should probably just go away.  knote_drop will already wakeup any waiters after its removed the knote it transitioned from the list- another wakeup seems spurious.
Comment 21 Konstantin Belousov freebsd_committer freebsd_triage 2026-03-19 02:52:56 UTC
(In reply to Paul from comment #19)
*kn is filled with '0xdeadc0de' which is the pattern for the freed memory in
debugging kernels.

In fact, my patch had a bug, it leaked the kq lock on each flux restart, but it seems
that you did not hit it, which is interesting.  I put the updated patch, with some
additional assert, at the end of my reply.  The assert might be more interesting
than the other part of the patch.

I do ask you to switch to HEAD with all debugging enabled.

(In reply to Kyle Evans from comment #20)
You might be quite right, and might be we should do what you suggest.
Still, it cannot be an issue that is reported there because copy on fork is
not in 14 (at least as far as I remember).

diff --git a/sys/kern/kern_event.c b/sys/kern/kern_event.c
index e8e670d39d09..38928a68dd94 100644
--- a/sys/kern/kern_event.c
+++ b/sys/kern/kern_event.c
@@ -2830,19 +2830,20 @@ knote_fdclose(struct thread *td, int fd)
 	 * We shouldn't have to worry about new kevents appearing on fd
 	 * since filedesc is locked.
 	 */
+again:
 	TAILQ_FOREACH(kq, &fdp->fd_kqlist, kq_list) {
 		KQ_LOCK(kq);
-
-again:
 		influx = 0;
 		while (kq->kq_knlistsize > fd &&
 		    (kn = SLIST_FIRST(&kq->kq_knlist[fd])) != NULL) {
+			MPASS(kn->kn_kq == kq);
 			if (kn_in_flux(kn)) {
 				/* someone else might be waiting on our knote */
 				if (influx)
 					wakeup(kq);
 				kq->kq_state |= KQ_FLUXWAIT;
-				msleep(kq, &kq->kq_lock, PSOCK, "kqflxwt", 0);
+				msleep(kq, &kq->kq_lock, PSOCK | PDROP,
+				    "kqflxwt", 0);
 				goto again;
 			}
 			kn_enter_flux(kn);
Comment 22 Paul 2026-03-23 09:52:56 UTC
Hi,

Just in case, we have tested new patch with our current kernel version, and it panicked. So we've finally switched to HEAD, as suggested earlier (we understand that this simplifies things a lot). And basically the same thing happened there.

It's about the new assert, added in the latest patch.


Unread portion of the kernel message buffer:
panic: Assertion kn->kn_kq == kq failed at /usr/src/sys/kern/kern_event.c:2852
cpuid = 8
time = 1774258230
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe0699ccbbd0
vpanic() at vpanic+0x136/frame 0xfffffe0699ccbd00
panic() at panic+0x43/frame 0xfffffe0699ccbd60
knote_fdclose() at knote_fdclose+0x236/frame 0xfffffe0699ccbdc0
closefp_impl() at closefp_impl+0xa8/frame 0xfffffe0699ccbe00
amd64_syscall() at amd64_syscall+0x169/frame 0xfffffe0699ccbf30
fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe0699ccbf30
--- syscall (6, FreeBSD ELF64, close), rip = 0x82d4a332a, rsp = 0x85dfa6b98, rbp = 0x85dfa6bb0 ---
KDB: enter: panic

(kgdb) bt
#0  __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:57
#1  doadump (textdump=0) at /usr/src/sys/kern/kern_shutdown.c:399
#2  0xffffffff804b60a8 in db_fncall_generic (nargs=0, args=0xfffffe0699ccb5f0, addr=<optimized out>, rv=<optimized out>) at /usr/src/sys/ddb/db_command.c:631
#3  db_fncall (dummy1=<optimized out>, dummy2=<optimized out>, dummy3=<optimized out>, dummy4=<optimized out>) at /usr/src/sys/ddb/db_command.c:679
#4  0xffffffff804b5b2d in db_command (last_cmdp=<optimized out>, cmd_table=<optimized out>, dopager=false) at /usr/src/sys/ddb/db_command.c:508
#5  0xffffffff804b5c76 in db_command_script (command=command@entry=0xffffffff81bd7722 <db_recursion_data+18> "call doadump") at /usr/src/sys/ddb/db_command.c:573
#6  0xffffffff804bba58 in db_script_exec (scriptname=scriptname@entry=0xfffffe0699ccb7c0 "kdb.enter.panic", warnifnotfound=warnifnotfound@entry=0) at /usr/src/sys/ddb/db_script.c:301
#7  0xffffffff804bb952 in db_script_kdbenter (eventname=<optimized out>) at /usr/src/sys/ddb/db_script.c:323
#8  0xffffffff804b91e1 in db_trap (type=<optimized out>, code=<optimized out>) at /usr/src/sys/ddb/db_main.c:266
#9  0xffffffff80c1ce5f in kdb_trap (type=type@entry=3, code=code@entry=0, tf=tf@entry=0xfffffe0699ccbb10) at /usr/src/sys/kern/subr_kdb.c:790
#10 0xffffffff8112a96d in trap (frame=<optimized out>) at /usr/src/sys/amd64/amd64/trap.c:675
#11 <signal handler called>
#12 kdb_enter (why=<optimized out>, msg=<optimized out>) at /usr/src/sys/kern/subr_kdb.c:556
#13 0xffffffff80bc9ddb in vpanic (fmt=0xffffffff812ec6bb "Assertion %s failed at %s:%d", ap=ap@entry=0xfffffe0699ccbd40) at /usr/src/sys/kern/kern_shutdown.c:962
#14 0xffffffff80bc9c43 in panic (fmt=0xffffffff81da2290 <cnputs_mtx> "\254\214!\201\377\377\377\377") at /usr/src/sys/kern/kern_shutdown.c:887
#15 0xffffffff80b6bc76 in knote_fdclose (td=td@entry=0xff0100018d9b4000, fd=fd@entry=161249) at /usr/src/sys/kern/kern_event.c:2852
#16 0xffffffff80b63468 in closefp_impl (fdp=0xfffffe0693882000, fd=161249, fp=0xff010002dd9fb230, td=0xff0100018d9b4000, audit=true) at /usr/src/sys/kern/kern_descrip.c:1413
#17 0xffffffff8112b739 in syscallenter (td=0xff0100018d9b4000) at /usr/src/sys/amd64/amd64/../../kern/subr_syscall.c:193
#18 amd64_syscall (td=0xff0100018d9b4000, traced=0) at /usr/src/sys/amd64/amd64/trap.c:1244
#19 <signal handler called>
#20 0x000000082d4a332a in ?? ()
Backtrace stopped: Cannot access memory at address 0x85dfa6b98
(kgdb) fr 15
#15 0xffffffff80b6bc76 in knote_fdclose (td=td@entry=0xff0100018d9b4000, fd=fd@entry=161249) at /usr/src/sys/kern/kern_event.c:2852
2852				MPASS(kn->kn_kq == kq);
(kgdb) p *kn
$1 = {
  kn_link = {
    sle_next = 0xdeadc0dedeadc0de
  },
  kn_selnext = {
    sle_next = 0xdeadc0dedeadc0de
  },
  kn_knlist = 0xdeadc0dedeadc0de,
  kn_tqe = {
    tqe_next = 0xdeadc0dedeadc0de,
    tqe_prev = 0xdeadc0dedeadc0de
  },
  kn_kq = 0xdeadc0dedeadc0de,
  kn_kevent = {
    ident = 16045693110842147038,
    filter = -16162,
    flags = 57005,
    fflags = 3735929054,
    data = -2401050962867404578,
    udata = 0xdeadc0dedeadc0de,
    ext = {16045693110842147038, 16045693110842147038, 16045693110842147038, 16045693110842147038}
  },
  kn_hook = 0xdeadc0dedeadc0de,
  kn_hookid = -559038242,
  kn_status = -559038242,
  kn_influx = -559038242,
  kn_sfflags = 3735929054,
  kn_sdata = -2401050962867404578,
  kn_ptr = {
    p_fp = 0xdeadc0dedeadc0de,
    p_proc = 0xdeadc0dedeadc0de,
    p_aio = 0xdeadc0dedeadc0de,
    p_lio = 0xdeadc0dedeadc0de,
    p_prison = 0xdeadc0dedeadc0de,
    p_v = 0xdeadc0dedeadc0de
  },
  kn_fop = 0xdeadc0dedeadc0de
}
(kgdb) p *kq
value has been optimized out
(kgdb) i r
rax            0x12                18
rbx            0x275e1             161249
rcx            0xba5f4feebeda7d64  -5017203573044642460
rdx            0xffffffff813451fb  -2127277573
rsi            0xfffffe0699ccba90  -2170673120624
rdi            0xffffffff81da2290  -2116410736
rbp            0xfffffe0699ccbdc0  0xfffffe0699ccbdc0
rsp            0xfffffe0699ccbd70  0xfffffe0699ccbd70
r8             0x12                18
r9             0x20                32
r10            0x0                 0
r11            0x0                 0
r12            0xff010001bdd19b18  -71776111581619432
r13            0xff0100488988c0a0  -71775807516131168
r14            0x275e1             161249
r15            0xff010001bdd19b00  -71776111581619456
rip            0xffffffff80b6bc76  0xffffffff80b6bc76 <knote_fdclose+566>
eflags         0x86                [ PF SF ]
cs             0x20                32
ss             0x28                40
ds             <unavailable>
es             <unavailable>
fs             <unavailable>
gs             <unavailable>
fs_base        <unavailable>
gs_base        <unavailable>
(kgdb) p *((struct kqueue*)$r15)
$2 = {
  kq_lock = {
    lock_object = {
      lo_name = 0xffffffff8133f15f "kqueue",
      lo_flags = 21168128,
      lo_data = 0,
      lo_witness = 0xff0100804bd8db80
    },
    mtx_lock = 18374967961319063552
  },
  kq_refcnt = 0,
  kq_list = {
    tqe_next = 0xff0100014c3afe00,
    tqe_prev = 0xff010001075a7528
  },
  kq_head = {
    tqh_first = 0x0,
    tqh_last = 0xff010001bdd19b38
  },
  kq_count = 0,
  kq_sel = {
    si_tdlist = {
      tqh_first = 0x0,
      tqh_last = 0x0
    },
    si_note = {
      kl_list = {
        slh_first = 0x0
      },
      kl_lock = 0xffffffff80b6b3a0 <knlist_mtx_lock>,
      kl_unlock = 0xffffffff80b6b3c0 <knlist_mtx_unlock>,
      kl_assert_lock = 0xffffffff80b6b3e0 <knlist_mtx_assert_lock>,
      kl_lockarg = 0xff010001bdd19b00,
      kl_autodestroy = 0
    },
    si_mtx = 0x0
  },
  kq_sigio = 0x0,
  kq_fdp = 0xfffffe0693882000,
  kq_state = 0,
  kq_knlistsize = 695296,
  kq_knlist = 0xfffffe0a76665000,
  kq_knhashmask = 0,
  kq_knhash = 0x0,
  kq_task = {
    ta_link = {
      stqe_next = 0x0
    },
    ta_pending = 0,
    ta_priority = 0 '\000',
    ta_flags = 0 '\000',
    ta_func = 0xffffffff80b6db40 <kqueue_task>,
    ta_context = 0xff010001bdd19b00
  },
  kq_cred = 0xff01000107bc5780,
  kq_forksrc = 0x0
}


Please, tell us if you need anything else.
Comment 23 Konstantin Belousov freebsd_committer freebsd_triage 2026-03-25 01:51:28 UTC
My belief is that kn is freed somewhere, which explains 0xdeadc0de.  I cannot easily
guess where such thing could happen.  Lets try the brute force approach then.

Below is the patch, to be applied on top of my previous patch.  It is debug only.
When the panic due to the new assert (kn->kn_kq != kq) occurs, the kn var should
point to struct eknote instead of knote.  I am interested in the *eknote content,
and in the lookup of the source lines for the captured stack trace, which should
point to the place where the free of kn occured.

I do not have better idea how to proceed ATM.

commit 6536e2123f1249a3e8fa8e4895b01abc95381633
Author: Konstantin Belousov <kib@FreeBSD.org>
Date:   Wed Mar 25 03:47:59 2026 +0200

    debug

diff --git a/sys/kern/kern_event.c b/sys/kern/kern_event.c
index 38928a68dd94..70f9ba07716d 100644
--- a/sys/kern/kern_event.c
+++ b/sys/kern/kern_event.c
@@ -230,6 +230,13 @@ static const struct filterops user_filtops = {
 	.f_copy = knote_triv_copy,
 };
 
+#include <sys/stack.h>
+struct eknote {
+	struct knote k;
+	struct knote c;
+	struct stack s;
+};
+
 static uma_zone_t	knote_zone;
 static unsigned int __exclusive_cache_line	kq_ncallouts;
 static unsigned int 	kq_calloutmax = 4 * 1024;
@@ -2962,8 +2969,8 @@ static void
 knote_init(void *dummy __unused)
 {
 
-	knote_zone = uma_zcreate("KNOTE", sizeof(struct knote), NULL, NULL,
-	    NULL, NULL, UMA_ALIGN_PTR, 0);
+	knote_zone = uma_zcreate("KNOTE", sizeof(struct eknote), NULL, NULL,
+	    NULL, NULL, UMA_ALIGN_PTR, UMA_ZONE_NOFREE);
 	ast_register(TDA_KQUEUE, ASTR_ASTF_REQUIRED, 0, ast_kqueue);
 	prison0.pr_klist = knlist_alloc(&prison0.pr_mtx);
 }
@@ -2972,15 +2979,22 @@ SYSINIT(knote, SI_SUB_PSEUDO, SI_ORDER_ANY, knote_init, NULL);
 static struct knote *
 knote_alloc(int mflag)
 {
+	struct eknote *e;
 
-	return (uma_zalloc(knote_zone, mflag | M_ZERO));
+	e = uma_zalloc(knote_zone, mflag | M_ZERO);
+	return (&e->k);
 }
 
 static void
 knote_free(struct knote *kn)
 {
+	struct eknote *e;
 
-	uma_zfree(knote_zone, kn);
+	e = __containerof(kn, struct eknote, k);
+	e->c = e->k;
+	stack_save(&e->s);
+	memset(&e->k, 0xdeadc0de, sizeof(e->k));
+	uma_zfree(knote_zone, e);
 }
 
 /*
Comment 24 Paul 2026-03-25 12:02:25 UTC
Hi, 

Thanks! There was a small issue: NULL can actually be passed to `knote_free()`. So we have adapted the patch a bit. Also, we've never discarded all previous patches. So, to be of the same page, here's the final version that we're testing now, on the HEAD:

diff --git a/sys/kern/kern_event.c b/sys/kern/kern_event.c
index e8e670d39d09..bbb09789e41f 100644
--- a/sys/kern/kern_event.c
+++ b/sys/kern/kern_event.c
@@ -28,7 +28,6 @@
  * SUCH DAMAGE.
  */
 
-#include <sys/cdefs.h>
 #include "opt_ktrace.h"
 #include "opt_kqueue.h"
 
@@ -230,6 +229,13 @@ static const struct filterops user_filtops = {
 	.f_copy = knote_triv_copy,
 };
 
+#include <sys/stack.h>
+struct eknote {
+	struct knote k;
+	struct knote c;
+	struct stack s;
+};
+
 static uma_zone_t	knote_zone;
 static unsigned int __exclusive_cache_line	kq_ncallouts;
 static unsigned int 	kq_calloutmax = 4 * 1024;
@@ -1803,6 +1809,19 @@ kqueue_register(struct kqueue *kq, struct kevent *kev, struct thread *td,
 				error = ENOMEM;
 				goto done;
 			}
+
+    			/*
+	    		* Now that the kqueue is locked, make sure the fd
+	    		* didn't change out from under us.
+	    		*/
+			if (fops->f_isfd &&
+			    fget_noref_unlocked(td->td_proc->p_fd,
+	    		    kev->ident) != fp) {
+			    KQ_UNLOCK(kq);
+			    tkn = kn;
+			    error = EBADF;
+			    goto done;
+			}
 			kn->kn_fp = fp;
 			kn->kn_kq = kq;
 			kn->kn_fop = fops;
@@ -2830,19 +2849,21 @@ knote_fdclose(struct thread *td, int fd)
 	 * We shouldn't have to worry about new kevents appearing on fd
 	 * since filedesc is locked.
 	 */
+again:
 	TAILQ_FOREACH(kq, &fdp->fd_kqlist, kq_list) {
 		KQ_LOCK(kq);
 
-again:
 		influx = 0;
 		while (kq->kq_knlistsize > fd &&
 		    (kn = SLIST_FIRST(&kq->kq_knlist[fd])) != NULL) {
+			MPASS(kn->kn_kq == kq);
 			if (kn_in_flux(kn)) {
 				/* someone else might be waiting on our knote */
 				if (influx)
 					wakeup(kq);
 				kq->kq_state |= KQ_FLUXWAIT;
-				msleep(kq, &kq->kq_lock, PSOCK, "kqflxwt", 0);
+				msleep(kq, &kq->kq_lock, PSOCK | PDROP,
+				    "kqflxwt", 0);
 				goto again;
 			}
 			kn_enter_flux(kn);
@@ -2961,8 +2982,8 @@ static void
 knote_init(void *dummy __unused)
 {
 
-	knote_zone = uma_zcreate("KNOTE", sizeof(struct knote), NULL, NULL,
-	    NULL, NULL, UMA_ALIGN_PTR, 0);
+	knote_zone = uma_zcreate("KNOTE", sizeof(struct eknote), NULL, NULL,
+	    NULL, NULL, UMA_ALIGN_PTR, UMA_ZONE_NOFREE);
 	ast_register(TDA_KQUEUE, ASTR_ASTF_REQUIRED, 0, ast_kqueue);
 	prison0.pr_klist = knlist_alloc(&prison0.pr_mtx);
 }
@@ -2971,15 +2992,24 @@ SYSINIT(knote, SI_SUB_PSEUDO, SI_ORDER_ANY, knote_init, NULL);
 static struct knote *
 knote_alloc(int mflag)
 {
+	struct eknote *e;
 
-	return (uma_zalloc(knote_zone, mflag | M_ZERO));
+	e = uma_zalloc(knote_zone, mflag | M_ZERO);
+	return (&e->k);
 }
 
 static void
 knote_free(struct knote *kn)
 {
+	if (kn) {
+		struct eknote *e;
 
-	uma_zfree(knote_zone, kn);
+		e = __containerof(kn, struct eknote, k);
+		e->c = e->k;
+		stack_save(&e->s);
+		memset(&e->k, 0xdeadc0de, sizeof(e->k));
+		uma_zfree(knote_zone, e);
+	}
 }
 
 /*
diff --git a/sys/sys/filedesc.h b/sys/sys/filedesc.h
index 4817855443af..9ea5239afae7 100644
--- a/sys/sys/filedesc.h
+++ b/sys/sys/filedesc.h
@@ -212,6 +212,8 @@ struct filedesc_to_leader {
 
 #ifdef _KERNEL
 
+#include <machine/atomic.h>
+
 /* Operation types for kern_dup(). */
 enum {
 	FDDUP_NORMAL,		/* dup() behavior. */
@@ -303,6 +305,21 @@ int	fget_only_user(struct filedesc *fdp, int fd,
 	MPASS(refcount_load(&fp->f_count) > 0);				\
 })
 
+/*
+* Look up a file description without requiring a lock.  In general the result
+ * may be immediately invalidated after the function returns, the caller must
+ * handle this.
+ */
+static inline struct file *
+fget_noref_unlocked(struct filedesc *fdp, int fd)
+{
+    if (__predict_false(
+        (u_int)fd >= (u_int)atomic_load_int(&fdp->fd_nfiles)))
+	return (NULL);
+
+    return (atomic_load_ptr(&fdp->fd_ofiles[fd].fde_file));
+}
+
 /* Requires a FILEDESC_{S,X}LOCK held and returns without a ref. */
 static __inline struct file *
 fget_noref(struct filedesc *fdp, int fd)
Comment 25 Konstantin Belousov freebsd_committer freebsd_triage 2026-03-25 22:03:07 UTC
(In reply to Paul from comment #24)
Yes indeed free must handle NULL arg.

BTW, Mark, why not commit your change already?
Comment 26 commit-hook freebsd_committer freebsd_triage 2026-03-27 00:31:02 UTC
A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=8f3227f527567aef53da845ab78da8e16d9051c1

commit 8f3227f527567aef53da845ab78da8e16d9051c1
Author:     Mark Johnston <markj@FreeBSD.org>
AuthorDate: 2026-03-27 00:24:18 +0000
Commit:     Mark Johnston <markj@FreeBSD.org>
CommitDate: 2026-03-27 00:24:18 +0000

    kqueue: Fix a race when adding an fd-based knote to a queue

    When registering a new kevent backed by a file descriptor, we first look
    up the file description with fget(), then lock the kqueue, then see if a
    corresponding knote is already registered.  If not, and KN_ADD is
    specified, we add the knote to the kqueue.

    closefp_impl() interlocks with this process by calling knote_fdclose(),
    which locks each kqueue and checks to see if the fd is registered with a
    knote.  But, if userspace closes an fd while a different thread is
    registering it, i.e., after fget() succeeds but before the kqueue is
    locked, then we may end up with a mismatch in the knote table, where the
    knote kn_fp field points to a different file description than the knote
    ident.

    Fix the problem by double-checking before registering a knote.  Add a
    new fget_noref_unlocked() helper for this purpose.  It is a clone of
    fget_noref().  We could simply use fget_noref(), but I like having an
    explicit unlocked variant.

    PR:             293382
    Reviewed by:    kib
    MFC after:      2 weeks
    Differential Revision:  https://reviews.freebsd.org/D55852

 sys/kern/kern_event.c | 14 +++++++++++++-
 sys/sys/filedesc.h    | 17 +++++++++++++++++
 2 files changed, 30 insertions(+), 1 deletion(-)
Comment 27 Paul 2026-03-30 08:30:38 UTC
Hi!

It seems like latest changes have reduced the likelihood of the issue. My guess is: we are dealing with the use-after-free race and just by adding some additional (slow? like 'stack_save()'?) code to the 'free' function, we made it much, much less likely to occur.

But anyway, it finally happened.


Unread portion of the kernel message buffer:
panic: Assertion kn->kn_kq == kq failed at /usr/src/sys/kern/kern_event.c:2859
cpuid = 3
time = 1774857902
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe069abdfbd0
vpanic() at vpanic+0x136/frame 0xfffffe069abdfd00
panic() at panic+0x43/frame 0xfffffe069abdfd60
knote_fdclose() at knote_fdclose+0x236/frame 0xfffffe069abdfdc0
closefp_impl() at closefp_impl+0xa8/frame 0xfffffe069abdfe00
amd64_syscall() at amd64_syscall+0x169/frame 0xfffffe069abdff30
fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe069abdff30
--- syscall (6, FreeBSD ELF64, close), rip = 0x82d5be32a, rsp = 0x85b519b98, rbp = 0x85b519bb0 ---
KDB: enter: panic


(kgdb) bt
#0  __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:57
#1  doadump (textdump=0) at /usr/src/sys/kern/kern_shutdown.c:399
#2  0xffffffff804b60a8 in db_fncall_generic (nargs=0, args=0xfffffe069abdf5f0, addr=<optimized out>, rv=<optimized out>) at /usr/src/sys/ddb/db_command.c:631
#3  db_fncall (dummy1=<optimized out>, dummy2=<optimized out>, dummy3=<optimized out>, dummy4=<optimized out>) at /usr/src/sys/ddb/db_command.c:679
#4  0xffffffff804b5b2d in db_command (last_cmdp=<optimized out>, cmd_table=<optimized out>, dopager=false) at /usr/src/sys/ddb/db_command.c:508
#5  0xffffffff804b5c76 in db_command_script (command=command@entry=0xffffffff81bd7722 <db_recursion_data+18> "call doadump") at /usr/src/sys/ddb/db_command.c:573
#6  0xffffffff804bba58 in db_script_exec (scriptname=scriptname@entry=0xfffffe069abdf7c0 "kdb.enter.panic", warnifnotfound=warnifnotfound@entry=0) at /usr/src/sys/ddb/db_script.c:301
#7  0xffffffff804bb952 in db_script_kdbenter (eventname=<optimized out>) at /usr/src/sys/ddb/db_script.c:323
#8  0xffffffff804b91e1 in db_trap (type=<optimized out>, code=<optimized out>) at /usr/src/sys/ddb/db_main.c:266
#9  0xffffffff80c1d01f in kdb_trap (type=type@entry=3, code=code@entry=0, tf=tf@entry=0xfffffe069abdfb10) at /usr/src/sys/kern/subr_kdb.c:790
#10 0xffffffff8112a96d in trap (frame=<optimized out>) at /usr/src/sys/amd64/amd64/trap.c:675
#11 <signal handler called>
#12 kdb_enter (why=<optimized out>, msg=<optimized out>) at /usr/src/sys/kern/subr_kdb.c:556
#13 0xffffffff80bc9f9b in vpanic (fmt=0xffffffff812ec6bb "Assertion %s failed at %s:%d", ap=ap@entry=0xfffffe069abdfd40) at /usr/src/sys/kern/kern_shutdown.c:962
#14 0xffffffff80bc9e03 in panic (fmt=0xffffffff81da2290 <cnputs_mtx> "\254\214!\201\377\377\377\377") at /usr/src/sys/kern/kern_shutdown.c:887
#15 0xffffffff80b6bd26 in knote_fdclose (td=td@entry=0xff0100026dc5d780, fd=fd@entry=544408) at /usr/src/sys/kern/kern_event.c:2859
#16 0xffffffff80b63468 in closefp_impl (fdp=0xfffffe02c31fec90, fd=544408, fp=0xff01002710a27640, td=0xff0100026dc5d780, audit=true) at /usr/src/sys/kern/kern_descrip.c:1413
#17 0xffffffff8112b739 in syscallenter (td=0xff0100026dc5d780) at /usr/src/sys/amd64/amd64/../../kern/subr_syscall.c:193
#18 amd64_syscall (td=0xff0100026dc5d780, traced=0) at /usr/src/sys/amd64/amd64/trap.c:1244
#19 <signal handler called>
#20 0x000000082d5be32a in ?? ()
Backtrace stopped: Cannot access memory at address 0x85b519b98
(kgdb) fr 15
#15 0xffffffff80b6bd26 in knote_fdclose (td=td@entry=0xff0100026dc5d780, fd=fd@entry=544408) at /usr/src/sys/kern/kern_event.c:2859
2859				MPASS(kn->kn_kq == kq);
(kgdb) p *((struct kqueue*)$r15)
$11 = {
  kq_lock = {
    lock_object = {
      lo_name = 0xffffffff8133f15f "kqueue",
      lo_flags = 21168128,
      lo_data = 0,
      lo_witness = 0xff0100804bd8db80
    },
    mtx_lock = 18374967965079951232
  },
  kq_refcnt = 0,
  kq_list = {
    tqe_next = 0xff0100010ac21b00,
    tqe_prev = 0xff010002a0adb128
  },
  kq_head = {
    tqh_first = 0x0,
    tqh_last = 0xff0100010aac4b38
  },
  kq_count = 0,
  kq_sel = {
    si_tdlist = {
      tqh_first = 0x0,
      tqh_last = 0x0
    },
    si_note = {
      kl_list = {
        slh_first = 0x0
      },
      kl_lock = 0xffffffff80b6b420 <knlist_mtx_lock>,
      kl_unlock = 0xffffffff80b6b440 <knlist_mtx_unlock>,
      kl_assert_lock = 0xffffffff80b6b460 <knlist_mtx_assert_lock>,
      kl_lockarg = 0xff0100010aac4b00,
      kl_autodestroy = 0
    },
    si_mtx = 0x0
  },
  kq_sigio = 0x0,
  kq_fdp = 0xfffffe02c31fec90,
  kq_state = 0,
  kq_knlistsize = 685056,
  kq_knlist = 0xfffffe0bbf526000,
  kq_knhashmask = 0,
  kq_knhash = 0x0,
  kq_task = {
    ta_link = {
      stqe_next = 0x0
    },
    ta_pending = 0,
    ta_priority = 0 '\000',
    ta_flags = 0 '\000',
    ta_func = 0xffffffff80b6dd00 <kqueue_task>,
    ta_context = 0xff0100010aac4b00
  },
  kq_cred = 0xff0100010a9eb600,
  kq_forksrc = 0x0
}

(kgdb) p *((struct eknote*)kn)
$3 = {
  k = {
    kn_link = {
      sle_next = 0xdededededededede
    },
    kn_selnext = {
      sle_next = 0xdededededededede
    },
    kn_knlist = 0xdededededededede,
    kn_tqe = {
      tqe_next = 0xdededededededede,
      tqe_prev = 0xdededededededede
    },
    kn_kq = 0xdededededededede,
    kn_kevent = {
      ident = 16059518370053021406,
      filter = -8482,
      flags = 57054,
      fflags = 3739147998,
      data = -2387225703656530210,
      udata = 0xdededededededede,
      ext = {16059518370053021406, 16059518370053021406, 16059518370053021406, 16059518370053021406}
    },
    kn_hook = 0xdededededededede,
    kn_hookid = -555819298,
    kn_status = -555819298,
    kn_influx = -555819298,
    kn_sfflags = 3739147998,
    kn_sdata = -2387225703656530210,
    kn_ptr = {
      p_fp = 0xdededededededede,
      p_proc = 0xdededededededede,
      p_aio = 0xdededededededede,
      p_lio = 0xdededededededede,
      p_prison = 0xdededededededede,
      p_v = 0xdededededededede
    },
    kn_fop = 0xdededededededede
  },
  c = {
    kn_link = {
      sle_next = 0x0
    },
    kn_selnext = {
      sle_next = 0xffffffffffffffff
    },
    kn_knlist = 0x0,
    kn_tqe = {
      tqe_next = 0xffffffffffffffff,
      tqe_prev = 0xffffffffffffffff
    },
    kn_kq = 0xff0100010aac4b00,
    kn_kevent = {
      ident = 138904,
      filter = -1,
      flags = 32,
      fflags = 0,
      data = 0,
      udata = 0xdda7cbe67c0,
      ext = {0, 0, 0, 0}
    },
    kn_hook = 0x0,
    kn_hookid = 0,
    kn_status = 8,
    kn_influx = 1,
    kn_sfflags = 0,
    kn_sdata = 0,
    kn_ptr = {
      p_fp = 0x0,
      p_proc = 0x0,
      p_aio = 0x0,
      p_lio = 0x0,
      p_prison = 0x0,
      p_v = 0x0
    },
    kn_fop = 0x0
  },
  s = {
    depth = 5,
    pcs = {18446744071574043258, 18446744071574043755, 18446744071574008936, 18446744071580071737, 18446744071579870491, 0 <repeats 13 times>}
  }
}
(kgdb) p (void*)18446744071574043258
$4 = (void *) 0xffffffff80b6ba7a <knote_drop_detached+554>
(kgdb) p (void*)18446744071574043755
$5 = (void *) 0xffffffff80b6bc6b <knote_fdclose+379>
(kgdb) p (void*)18446744071574008936
$6 = (void *) 0xffffffff80b63468 <closefp_impl+168>
(kgdb) p (void*)18446744071580071737
$7 = (void *) 0xffffffff8112b739 <amd64_syscall+361>
(kgdb) p (void*)18446744071579870491
$8 = (void *) 0xffffffff810fa51b <fast_syscall_common+248>


Please, tell us if you need anything else.
Comment 28 Paul 2026-03-30 09:25:24 UTC
Figured you might also want this.


(kgdb) p (void*)$r15
$2 = (void *) 0xff0100010aac4b00
(kgdb) fr 16
#16 0xffffffff80b63468 in closefp_impl (fdp=0xfffffe02c31fec90, fd=544408, fp=0xff01002710a27640, td=0xff0100026dc5d780, audit=true) at /usr/src/sys/kern/kern_descrip.c:1413
1413			knote_fdclose(td, fd);
(kgdb) p *fp
$5 = {
  f_flag = 7,
  f_count = 1,
  f_data = 0xff01005c9babe000,
  f_ops = 0xffffffff814d4aa0 <socketops>,
  f_vnode = 0x0,
  f_cred = 0xff0100010a9eb600,
  f_type = 2,
  f_vflags = 0,
  {
    f_seqcount = {0, 0},
    f_pipegen = 0
  },
  f_nextoff = {0, 0},
  f_vnun = {
    fvn_cdevpriv = 0x0,
    fvn_advice = 0x0
  },
  f_offset = 0
}
Comment 29 Konstantin Belousov freebsd_committer freebsd_triage 2026-03-30 12:13:32 UTC
(In reply to Paul from comment #28)
The saved knote has ident = 138904, while stack trace shows fd=544408.
This should not happen, the file descriptor must be same as the knote which
we find on the descriptor-indexed list.

Suppose that the debugging info is correct.  This means that the knote existed
on the list for both 544408 and 138904?  This could be if e.g. both heads pointed
to the knote.  Then if fd 138904 got the knote already removed or even not found
on that list (as recorded by the stack trace), it is freed, but the pointer
on the list 544408 is dandling. Then we get the situation in the panic.

If the speculation above holds, then somehow we manage to insert same knote
into two different fd's lists.  Again, I do not understand how could this happen.
I modified the debugging patch to check just that, that knotes are never inserted
twice into kn_link.

Could you run HEAD with only the patch below applied, please?  No need to apply
any of my previous patches, but do update to the latest HEAD since Mark already
committed his fix.  I am interested if you hit the added panics.

commit 8b90be098a85975dc437c53294987c0d7106fcf9
Author: Konstantin Belousov <kib@FreeBSD.org>
Date:   Wed Mar 25 03:47:59 2026 +0200

    debug

diff --git a/sys/kern/kern_event.c b/sys/kern/kern_event.c
index 1a719f7c945c..7d99f00eeb8c 100644
--- a/sys/kern/kern_event.c
+++ b/sys/kern/kern_event.c
@@ -229,6 +229,12 @@ static const struct filterops user_filtops = {
 	.f_copy = knote_triv_copy,
 };
 
+struct eknote {
+	struct knote k;
+	struct knote c;
+	int on_kn_link;
+};
+
 static uma_zone_t	knote_zone;
 static unsigned int __exclusive_cache_line	kq_ncallouts;
 static unsigned int 	kq_calloutmax = 4 * 1024;
@@ -2887,6 +2893,9 @@ knote_attach(struct knote *kn, struct kqueue *kq)
 			return (ENOMEM);
 		list = &kq->kq_knhash[KN_HASH(kn->kn_id, kq->kq_knhashmask)];
 	}
+	if (atomic_fetchadd_int(&__containerof(kn, struct eknote,
+	    k)->on_kn_link, 1) != 0)
+		panic("Double-linking knote %p", kn);
 	SLIST_INSERT_HEAD(list, kn, kn_link);
 	return (0);
 }
@@ -2928,8 +2937,11 @@ knote_drop_detached(struct knote *kn, struct thread *td)
 	else
 		list = &kq->kq_knhash[KN_HASH(kn->kn_id, kq->kq_knhashmask)];
 
-	if (!SLIST_EMPTY(list))
-		SLIST_REMOVE(list, kn, knote, kn_link);
+	if (atomic_fetchadd_int(&__containerof(kn, struct eknote,
+	    k)->on_kn_link, -1) != 1)
+		panic("removal of double-linked knote %p", kn);
+	SLIST_REMOVE(list, kn, knote, kn_link);
+
 	if (kn->kn_status & KN_QUEUED)
 		knote_dequeue(kn);
 	KQ_UNLOCK_FLUX(kq);
@@ -2974,8 +2986,8 @@ static void
 knote_init(void *dummy __unused)
 {
 
-	knote_zone = uma_zcreate("KNOTE", sizeof(struct knote), NULL, NULL,
-	    NULL, NULL, UMA_ALIGN_PTR, 0);
+	knote_zone = uma_zcreate("KNOTE", sizeof(struct eknote), NULL, NULL,
+	    NULL, NULL, UMA_ALIGN_PTR, UMA_ZONE_NOFREE);
 	ast_register(TDA_KQUEUE, ASTR_ASTF_REQUIRED, 0, ast_kqueue);
 	prison0.pr_klist = knlist_alloc(&prison0.pr_mtx);
 }
@@ -2984,15 +2996,24 @@ SYSINIT(knote, SI_SUB_PSEUDO, SI_ORDER_ANY, knote_init, NULL);
 static struct knote *
 knote_alloc(int mflag)
 {
+	struct eknote *e;
 
-	return (uma_zalloc(knote_zone, mflag | M_ZERO));
+	e = uma_zalloc(knote_zone, mflag);
+	memset(e, 0, sizeof(*e));
+	return (&e->k);
 }
 
 static void
 knote_free(struct knote *kn)
 {
+	struct eknote *e;
 
-	uma_zfree(knote_zone, kn);
+	if (kn == NULL)
+		return;
+	e = __containerof(kn, struct eknote, k);
+	e->c = e->k;
+	memset(&e->k, 0xdeadc0de, sizeof(e->k));
+	uma_zfree(knote_zone, e);
 }
 
 /*
Comment 30 Paul 2026-03-30 13:04:25 UTC
Thank you for a prompt reply! We are definitely not tampering with the core dump. And only using the HEAD + patch. No custom kernel modules, nothing like that.

Curiously, may be just a coincidence, but I've noticed that 544408 - 138904 = 405504, which is a multiple of 4096. Specifically 99 * 4096. Could it be that this is due to some hash-table optimization and them ending up in the same bucket? This may explain why seemingly unrelated fds end up colliding somehow. Though, may just be a coincidence. And 4096 seems a plausible 'magick number' to be used somewhere. Like in a simplified hash function (remainder of division) or just a result of an exponential HT growth with a factor of 2.

Regardless, we are now switching to HEAD with only the latest patch which you provided. Thanks!
Comment 31 Konstantin Belousov freebsd_committer freebsd_triage 2026-03-30 13:16:41 UTC
(In reply to Paul from comment #30)
By 'trusting the debug info' I mean that backtraces are known to have issues
with calculating the correct values for the function's arguments and locals.
The matter improved recently with more correct debuginfo in clang, but it is
still sometimes wrong.  So the fd value for close from the backtrace is hopefully
right, which is the basis for my theory, but it is not guaranteed.

WRT the multiple of the page size distance between two fds, I am not sure if
this is of any significance.  For fd-based knotes, in your case, read filter
on socket, no hashing is done.  The kqueues allocates a large array of single-linked
list heads, one for each fd index. If fd table is expanded and a knote is allocated
for fd that is larger than the kqueue array of heads, the array is reallocated.
This is done under the kq lock,  so hopefully is not an issue.
Comment 32 Paul 2026-03-30 13:22:46 UTC
Just to clarify, due to features of modular arithmetic:

  544408 mod 2^n == 138904 mod 2^n

  for n in [0, 12]

So this might as well be a much greater coincidence (them having same modulo up to 2^12) and the real magic number here is smaller than 4096, like 1024, or:

#define   KN_HASHSIZE             64              /* XXX should be tunable */
Comment 33 Konstantin Belousov freebsd_committer freebsd_triage 2026-03-31 06:44:19 UTC
(In reply to Paul from comment #32)
As I said above, filedescriptor-based events are not hashed.  KN_HASH() is only
relevant for knotes that use something else than files for their idents.

So unless there is somewhere an awful mishap of mixing files- and non-files-knotes,
I do not think that KN_HASH() is at play there.
Comment 34 Paul 2026-03-31 08:31:45 UTC
Hi,

It crashed again. Though in a different place now. But that's because the old place (`MPASS(kn->kn_kq == kq);`) is no longer present in the latest patch.

We are currently using e220af9cee74082841db9e80cbb73f570f3a570f + the latest patch.


Unread portion of the kernel message buffer:
panic: mutex kqueue not owned at /usr/src/sys/kern/kern_event.c:316
cpuid = 30
time = 1774943888
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe069b454810
vpanic() at vpanic+0x136/frame 0xfffffe069b454940
panic() at panic+0x43/frame 0xfffffe069b4549a0
__mtx_assert() at __mtx_assert+0xa9/frame 0xfffffe069b4549b0
kqueue_register() at kqueue_register+0x82a/frame 0xfffffe069b454a30
kqueue_kevent() at kqueue_kevent+0xc9/frame 0xfffffe069b454c90
kern_kevent_fp() at kern_kevent_fp+0x9b/frame 0xfffffe069b454ce0
kern_kevent() at kern_kevent+0x82/frame 0xfffffe069b454d40
kern_kevent_generic() at kern_kevent_generic+0x70/frame 0xfffffe069b454da0
sys_kevent() at sys_kevent+0x61/frame 0xfffffe069b454e00
amd64_syscall() at amd64_syscall+0x169/frame 0xfffffe069b454f30
fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe069b454f30
--- syscall (560, FreeBSD ELF64, kevent), rip = 0x82cf723ea, rsp = 0x858888b18, rbp = 0x858888c00 ---
KDB: enter: panic

(kgdb) bt
#0  __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:57
#1  doadump (textdump=0) at /usr/src/sys/kern/kern_shutdown.c:399
#2  0xffffffff804b60a8 in db_fncall_generic (nargs=0, args=0xfffffe069b454230, addr=<optimized out>, rv=<optimized out>) at /usr/src/sys/ddb/db_command.c:631
#3  db_fncall (dummy1=<optimized out>, dummy2=<optimized out>, dummy3=<optimized out>, dummy4=<optimized out>) at /usr/src/sys/ddb/db_command.c:679
#4  0xffffffff804b5b2d in db_command (last_cmdp=<optimized out>, cmd_table=<optimized out>, dopager=false) at /usr/src/sys/ddb/db_command.c:508
#5  0xffffffff804b5c76 in db_command_script (command=command@entry=0xffffffff81bd7722 <db_recursion_data+18> "call doadump") at /usr/src/sys/ddb/db_command.c:573
#6  0xffffffff804bba58 in db_script_exec (scriptname=scriptname@entry=0xfffffe069b454400 "kdb.enter.panic", warnifnotfound=warnifnotfound@entry=0) at /usr/src/sys/ddb/db_script.c:301
#7  0xffffffff804bb952 in db_script_kdbenter (eventname=<optimized out>) at /usr/src/sys/ddb/db_script.c:323
#8  0xffffffff804b91e1 in db_trap (type=<optimized out>, code=<optimized out>) at /usr/src/sys/ddb/db_main.c:266
#9  0xffffffff80c23aef in kdb_trap (type=type@entry=3, code=code@entry=0, tf=tf@entry=0xfffffe069b454750) at /usr/src/sys/kern/subr_kdb.c:790
#10 0xffffffff811318fd in trap (frame=<optimized out>) at /usr/src/sys/amd64/amd64/trap.c:697
#11 <signal handler called>
#12 kdb_enter (why=<optimized out>, msg=<optimized out>) at /usr/src/sys/kern/subr_kdb.c:556
#13 0xffffffff80bd0a6b in vpanic (fmt=0xffffffff812def18 "mutex %s not owned at %s:%d", ap=ap@entry=0xfffffe069b454980) at /usr/src/sys/kern/kern_shutdown.c:962
#14 0xffffffff80bd08d3 in panic (fmt=0xffffffff81da22a0 <cnputs_mtx> "\325\376!\201\377\377\377\377") at /usr/src/sys/kern/kern_shutdown.c:887
#15 0xffffffff80ba9729 in __mtx_assert (c=<optimized out>, what=what@entry=4, file=0xffffffff8134c5ca "/usr/src/sys/kern/kern_cons.c", line=-1129344347, line@entry=316) at /usr/src/sys/kern/kern_mutex.c:1108
#16 0xffffffff80b7004a in kn_enter_flux (kn=0xff0100303150bcd0) at /usr/src/sys/kern/kern_event.c:316
#17 kqueue_register (kq=kq@entry=0xff010002208b5b00, kev=kev@entry=0xfffffe069b454a40, td=td@entry=0xff010001e7d30000, mflag=mflag@entry=2) at /usr/src/sys/kern/kern_event.c:1868
#18 0xffffffff80b71239 in kqueue_kevent (kq=kq@entry=0xff010002208b5b00, td=td@entry=0xff010001e7d30000, nchanges=nchanges@entry=1, nevents=nevents@entry=0, k_ops=k_ops@entry=0xfffffe069b454de0, timeout=timeout@entry=0x0) at /usr/src/sys/kern/kern_event.c:1515
#19 0xffffffff80b710fb in kern_kevent_fp (td=td@entry=0xff010001e7d30000, fp=<optimized out>, nchanges=nchanges@entry=1, nevents=nevents@entry=0, k_ops=k_ops@entry=0xfffffe069b454de0, timeout=timeout@entry=0x0) at /usr/src/sys/kern/kern_event.c:1546
#20 0xffffffff80b71012 in kern_kevent (td=td@entry=0xff010001e7d30000, fd=<optimized out>, nchanges=1, nevents=0, k_ops=k_ops@entry=0xfffffe069b454de0, timeout=timeout@entry=0x0) at /usr/src/sys/kern/kern_event.c:1486
#21 0xffffffff80b70d10 in kern_kevent_generic (td=0xff010001e7d30000, uap=uap@entry=0xfffffe069b454db0, k_ops=k_ops@entry=0xfffffe069b454de0, struct_name=0xffffffff81308166 "kevent") at /usr/src/sys/kern/kern_event.c:1342
#22 0xffffffff80b70c01 in sys_kevent (td=0xffffffff81da22a0 <cnputs_mtx>, uap=<optimized out>) at /usr/src/sys/kern/kern_event.c:1315
#23 0xffffffff81132739 in syscallenter (td=0xff010001e7d30000) at /usr/src/sys/amd64/amd64/../../kern/subr_syscall.c:193
#24 amd64_syscall (td=0xff010001e7d30000, traced=0) at /usr/src/sys/amd64/amd64/trap.c:1267
#25 <signal handler called>
#26 0x000000082cf723ea in ?? ()
Backtrace stopped: Cannot access memory at address 0x858888b18

(kgdb) fr 16
#16 0xffffffff80b7004a in kn_enter_flux (kn=0xff0100303150bcd0) at /usr/src/sys/kern/kern_event.c:316
316		KQ_OWNED(kn->kn_kq);
(kgdb) p *((struct eknote*)kn)
$1 = {
  k = {
    kn_link = {
      sle_next = 0x0
    },
    kn_selnext = {
      sle_next = 0x0
    },
    kn_knlist = 0xff010011b7fb0838,
    kn_tqe = {
      tqe_next = 0xffffffffffffffff,
      tqe_prev = 0xffffffffffffffff
    },
    kn_kq = 0xff0100021013b000,
    kn_kevent = {
      ident = 298277,
      filter = -1,
      flags = 32,
      fflags = 0,
      data = 0,
      udata = 0x35934102e9d0,
      ext = {0, 0, 0, 0}
    },
    kn_hook = 0x0,
    kn_hookid = 0,
    kn_status = 0,
    kn_influx = 0,
    kn_sfflags = 0,
    kn_sdata = 0,
    kn_ptr = {
      p_fp = 0xff01005c34494280,
      p_proc = 0xff01005c34494280,
      p_aio = 0xff01005c34494280,
      p_lio = 0xff01005c34494280,
      p_prison = 0xff01005c34494280,
      p_v = 0xff01005c34494280
    },
    kn_fop = 0xffffffff814dd950 <soread_filtops>
  },
  c = {
    kn_link = {
      sle_next = 0x0
    },
    kn_selnext = {
      sle_next = 0x0
    },
    kn_knlist = 0x0,
    kn_tqe = {
      tqe_next = 0x0,
      tqe_prev = 0x0
    },
    kn_kq = 0x0,
    kn_kevent = {
      ident = 0,
      filter = 0,
      flags = 0,
      fflags = 0,
      data = 0,
      udata = 0x0,
      ext = {0, 0, 0, 0}
    },
    kn_hook = 0x0,
    kn_hookid = 0,
    kn_status = 0,
    kn_influx = 0,
    kn_sfflags = 0,
    kn_sdata = 0,
    kn_ptr = {
      p_fp = 0x0,
      p_proc = 0x0,
      p_aio = 0x0,
      p_lio = 0x0,
      p_prison = 0x0,
      p_v = 0x0
    },
    kn_fop = 0x0
  },
  on_kn_link = 1
}
(kgdb) fr 17
#17 kqueue_register (kq=kq@entry=0xff010002208b5b00, kev=kev@entry=0xfffffe069b454a40, td=td@entry=0xff010001e7d30000, mflag=mflag@entry=2) at /usr/src/sys/kern/kern_event.c:1868
1868			kn_enter_flux(kn);
(kgdb) p *kq
$3 = {
  kq_lock = {
    lock_object = {
      lo_name = 0xffffffff813464b6 "kqueue",
      lo_flags = 21168128,
      lo_data = 0,
      lo_witness = 0xff0100804bd8db80
    },
    mtx_lock = 18374967962832666624
  },
  kq_refcnt = 1,
  kq_list = {
    tqe_next = 0xff010002039d2400,
    tqe_prev = 0xff010002039cfb28
  },
  kq_head = {
    tqh_first = 0x0,
    tqh_last = 0xff010002208b5b38
  },
  kq_count = 0,
  kq_sel = {
    si_tdlist = {
      tqh_first = 0x0,
      tqh_last = 0x0
    },
    si_note = {
      kl_list = {
        slh_first = 0x0
      },
      kl_lock = 0xffffffff80b71f90 <knlist_mtx_lock>,
      kl_unlock = 0xffffffff80b71fb0 <knlist_mtx_unlock>,
      kl_assert_lock = 0xffffffff80b71fd0 <knlist_mtx_assert_lock>,
      kl_lockarg = 0xff010002208b5b00,
      kl_autodestroy = 0
    },
    si_mtx = 0x0
  },
  kq_sigio = 0x0,
  kq_fdp = 0xfffffe0694a7bc90,
  kq_state = 0,
  kq_knlistsize = 683008,
  kq_knlist = 0xfffffe09db9fe000,
  kq_knhashmask = 0,
  kq_knhash = 0x0,
  kq_task = {
    ta_link = {
      stqe_next = 0x0
    },
    ta_pending = 0,
    ta_priority = 0 '\000',
    ta_flags = 0 '\000',
    ta_func = 0xffffffff80b747e0 <kqueue_task>,
    ta_context = 0xff010002208b5b00
  },
  kq_cred = 0xff010027f901b480,
  kq_forksrc = 0x0
}
(kgdb) p *kev
$4 = {
  ident = 297765,
  filter = -1,
  flags = 2,
  fflags = 0,
  data = 0,
  udata = 0x0,
  ext = {0, 0, 0, 0}
}

Seems like knote and kevent don't actually match one another, but both have filter equal to -1, which seems to be sufficient to mistake them. As seen in kqueue_register(): 

 if (kev->filter == kn->kn_filter)
     break;

Seems like the simple fix should be to improve the matching strategy and account for -1?
Comment 35 Konstantin Belousov freebsd_committer freebsd_triage 2026-03-31 09:30:35 UTC
(In reply to Paul from comment #34)
No, the problem is the knote found on the kq_knlist[ident] belongs to different kqueue.
In other words, kn->kn_kq != kq.  This is why the assert about owning the kq lock
failed.

This is very strange thing to happen, but it is consistent with the previous report.
Lets add asserts everywhere that kn->kn_kq is matching the code expectation.  Also I am somewhat suspicious of kqueue_expand(), I added a check after bzero().

What is the CPU that you use?  Show the CPU ident lines from the verbose dmesg.
Or/and, show the output from 'cpuid-etallen -1'.
Comment 36 Konstantin Belousov freebsd_committer freebsd_triage 2026-03-31 09:30:58 UTC
Created attachment 269251 [details]
debugging patch
Comment 37 Paul 2026-03-31 10:51:32 UTC
I see, sorry for pushing my bad analysis.

This is a pretty recent (and decent) CPU. We are using it on many other servers with a bit different kind of load (not that many open sockets per process), but with much more stress overall. And only in this case the issue occurs. Doubtful that's the CPU hardware bug. We will now be applying you latest patch, thanks for helping!

# cpuid-etallen -1
CPU:
   vendor_id = "AuthenticAMD"
   version information (1/eax):
      processor type  = primary processor (0)
      family          = 0xf (15)
      model           = 0x2 (2)
      stepping id     = 0x1 (1)
      extended family = 0xb (11)
      extended model  = 0x0 (0)
      (family synth)  = 0x1a (26)
      (model synth)   = 0x2 (2)
   miscellaneous (1/ebx):
      process local APIC physical ID = 0x33 (51)
      maximum IDs for CPUs in pkg    = 0x20 (32)
      CLFLUSH line size              = 0x8 (8)
      brand index                    = 0x0 (0)
   brand id = 0x00 (0): unknown
   feature information (1/edx):
      x87 FPU on chip                        = true
      VME: virtual-8086 mode enhancement     = true
      DE: debugging extensions               = true
      PSE: page size extensions              = true
      TSC: time stamp counter                = true
      RDMSR and WRMSR support                = true
      PAE: physical address extensions       = true
      MCE: machine check exception           = true
      CMPXCHG8B inst.                        = true
      APIC on chip                           = true
      SYSENTER and SYSEXIT                   = true
      MTRR: memory type range registers      = true
      PTE global bit                         = true
      MCA: machine check architecture        = true
      CMOV: conditional move/compare instr   = true
      PAT: page attribute table              = true
      PSE-36: page size extension            = true
      PSN: processor serial number           = false
      CLFLUSH instruction                    = true
      DS: debug store                        = false
      ACPI: thermal monitor and clock ctrl   = false
      MMX Technology                         = true
      FXSAVE/FXRSTOR                         = true
      SSE extensions                         = true
      SSE2 extensions                        = true
      SS: self snoop                         = false
      hyper-threading / multi-core supported = true
      TM: therm. monitor                     = false
      IA64                                   = false
      PBE: pending break event               = false
   feature information (1/ecx):
      PNI/SSE3: Prescott New Instructions     = true
      PCLMULDQ instruction                    = true
      DTES64: 64-bit debug store              = false
      MONITOR/MWAIT                           = true
      CPL-qualified debug store               = false
      VMX: virtual machine extensions         = false
      SMX: safer mode extensions              = false
      Enhanced Intel SpeedStep Technology     = false
      TM2: thermal monitor 2                  = false
      SSSE3 extensions                        = true
      context ID: adaptive or shared L1 data  = false
      SDBG: IA32_DEBUG_INTERFACE              = false
      FMA instruction                         = true
      CMPXCHG16B instruction                  = true
      xTPR disable                            = false
      PDCM: perfmon and debug                 = false
      PCID: process context identifiers       = true
      DCA: direct cache access                = false
      SSE4.1 extensions                       = true
      SSE4.2 extensions                       = true
      x2APIC: extended xAPIC support          = true
      MOVBE instruction                       = true
      POPCNT instruction                      = true
      time stamp counter deadline             = false
      AES instruction                         = true
      XSAVE/XSTOR states                      = true
      OS-enabled XSAVE/XSTOR                  = true
      AVX: advanced vector extensions         = true
      F16C half-precision convert instruction = true
      RDRAND instruction                      = true
      hypervisor guest status                 = false
   cache and TLB information (2):
   processor serial number = 00B0-0F21-0000-0000-0000-0000
   deterministic cache parameters (4):
      --- cache 0 ---
      cache type                         = no more caches (0)
   MONITOR/MWAIT (5):
      smallest monitor-line size (bytes)       = 0x40 (64)
      largest monitor-line size (bytes)        = 0x40 (64)
      enum of Monitor-MWAIT exts supported     = true
      supports intrs as break-event for MWAIT  = true
      monitorless MWAIT supported              = false
      number of C0 sub C-states using MWAIT    = 0x1 (1)
      number of C1 sub C-states using MWAIT    = 0x2 (2)
      number of C2 sub C-states using MWAIT    = 0x0 (0)
      number of C3 sub C-states using MWAIT    = 0x0 (0)
      number of C4 sub C-states using MWAIT    = 0x0 (0)
      number of C5 sub C-states using MWAIT    = 0x0 (0)
      number of C6 sub C-states using MWAIT    = 0x0 (0)
      number of C7 sub C-states using MWAIT    = 0x0 (0)
   Thermal and Power Management Features (6):
      digital thermometer                     = false
      Intel Turbo Boost Technology            = false
      ARAT always running APIC timer          = true
      PLN power limit notification            = false
      ECMD extended clock modulation duty     = false
      PTM package thermal management          = false
      HWP base registers                      = false
      HWP notification                        = false
      HWP activity window                     = false
      HWP energy performance preference       = false
      HWP package level request               = false
      HDC base registers                      = false
      Intel Turbo Boost Max Technology 3.0    = false
      HWP capabilities                        = false
      HWP PECI override                       = false
      flexible HWP                            = false
      IA32_HWP_REQUEST MSR fast access mode   = false
      HW_FEEDBACK MSRs supported              = false
      ignoring idle logical processor HWP req = false
      IA32_HWP_CTL MSR supported              = false
      Thread Director                         = false
      IA32_HW_FEEDBACK_THREAD_CONFIG bit 25   = false
      digital thermometer thresholds          = 0x0 (0)
      hardware coordination feedback          = true
      ACNT2 available                         = false
      performance-energy bias capability      = false
      number of enh hardware feedback classes = 0x0 (0)
      performance capability reporting        = false
      energy efficiency capability reporting  = false
      size of feedback struct (4KB pages)     = 0x1 (1)
      index of CPU's row in feedback struct   = 0x0 (0)
   extended feature flags (7):
      FSGSBASE instructions                    = true
      IA32_TSC_ADJUST MSR supported            = true
      SGX: Software Guard Extensions supported = false
      BMI1 instructions                        = true
      HLE hardware lock elision                = false
      AVX2: advanced vector extensions 2       = true
      FDP_EXCPTN_ONLY                          = false
      SMEP supervisor mode exec protection     = true
      BMI2 instructions                        = true
      enhanced REP MOVSB/STOSB                 = true
      INVPCID instruction                      = true
      RTM: restricted transactional memory     = false
      RDT-CMT/PQoS cache monitoring            = true
      deprecated FPU CS/DS                     = false
      MPX: intel memory protection extensions  = false
      RDT-CAT/PQE cache allocation             = true
      AVX512F: AVX-512 foundation instructions = true
      AVX512DQ: double & quadword instructions = true
      RDSEED instruction                       = true
      ADX instructions                         = true
      SMAP: supervisor mode access prevention  = true
      AVX512IFMA: integer fused multiply add   = true
      PCOMMIT instruction                      = false
      CLFLUSHOPT instruction                   = true
      CLWB instruction                         = true
      Intel processor trace                    = false
      AVX512PF: prefetch instructions          = false
      AVX512ER: exponent & reciprocal instrs   = false
      AVX512CD: conflict detection instrs      = true
      SHA instructions                         = true
      AVX512BW: byte & word instructions       = true
      AVX512VL: vector length                  = true
      PREFETCHWT1                              = false
      AVX512VBMI: vector byte manipulation     = true
      UMIP: user-mode instruction prevention   = true
      PKU protection keys for user-mode        = true
      OSPKE CR4.PKE and RDPKRU/WRPKRU          = true
      WAITPKG instructions                     = false
      AVX512_VBMI2: byte VPCOMPRESS, VPEXPAND  = true
      CET_SS: CET shadow stack                 = true
      GFNI: Galois Field New Instructions      = true
      VAES instructions                        = true
      VPCLMULQDQ instruction                   = true
      AVX512_VNNI: neural network instructions = true
      AVX512_BITALG: bit count/shiffle         = true
      TME: Total Memory Encryption             = false
      AVX512: VPOPCNTDQ instruction            = true
      LA57: 57-bit addrs & 5-level paging      = true
      BNDLDX/BNDSTX MAWAU value in 64-bit mode = 0x0 (0)
      RDPID: read processor ID supported       = true
      KL: key locker                           = false
      bus lock detection                       = true
      CLDEMOTE supports cache line demote      = false
      MOVDIRI instruction                      = true
      MOVDIR64B instruction                    = true
      ENQCMD instruction                       = false
      SGX_LC: SGX launch config supported      = false
      PKS: supervisor protection keys          = false
      SGX-KEYS: SGX attestation services       = false
      AVX512_4VNNIW: neural network instrs     = false
      AVX512_4FMAPS: multiply acc single prec  = false
      fast short REP MOV                       = true
      UINTR: user interrupts                   = false
      AVX512_VP2INTERSECT: intersect mask regs = true
      IA32_MCU_OPT_CTRL SRBDS mitigation MSR   = false
      VERW MD_CLEAR microcode support          = false
      RTM transaction always aborts            = false
      IA32_TSX_FORCE_ABORT MSR                 = false
      SERIALIZE instruction                    = false
      hybrid part                              = false
      TSXLDTRK: TSX suspend load addr tracking = false
      PCONFIG instruction                      = false
      LBR: architectural last branch records   = false
      CET_IBT: CET indirect branch tracking    = false
      AMX-BF16: tile bfloat16 support          = false
      AVX512_FP16: fp16 support                = false
      AMX-TILE: tile architecture support      = false
      AMX-INT8: tile 8-bit integer support     = false
      IBRS/IBPB: indirect branch restrictions  = false
      STIBP: 1 thr indirect branch predictor   = false
      L1D_FLUSH: IA32_FLUSH_CMD MSR            = true
      IA32_ARCH_CAPABILITIES MSR               = false
      IA32_CORE_CAPABILITIES MSR               = false
      SSBD: speculative store bypass disable   = false
      SHA512 instructions                      = false
      SM3 instructions                         = false
      SM4 instructions                         = false
      RAO-INT atomic instructions              = false
      AVX-VNNI: AVX VNNI neural network instrs = true
      AVX512_BF16: bfloat16 instructions       = true
      LASS: linear address space separation    = false
      CMPccXADD instructions                   = false
      ArchPerfmonExt leaf 0x23 is valid        = false
      fast zero-length REP MOVSB               = false
      fast short REP STOSB                     = false
      fast short REP CMPSB, REP SCASB          = false
      FRED transitions & MSRs                  = false
      LKGS instruction                         = false
      WRMSRNS instruction                      = false
      NMI-source reporting                     = false
      AMX-FP16: FP16 tile operations           = false
      HRESET: history reset support            = false
      AVX-IFMA: integer fused multiply add     = false
      LAM: linear address masking              = false
      RDMSRLIST, WRMSRLIST instructions        = false
      INVD prevention after BIOS done          = false
      MOVRS instructions                       = false
      IA32_PPIN & IA32_PPIN_CTL MSRs supported = false
      PBNDKB instruction                       = false
      IA32_MISC_ENABLE cannot limit CPUID max  = false
      asymmetric RDT-M monitoring              = false
      asymmetric RDT-A allocation              = false
      RDMSR/WRMSRNS immediates supported       = false
      AVX-VNNI-INT8 instructions               = false
      AVX-NE-CONVERT instructions              = false
      AMX-COMPLEX instructions                 = false
      AVX-VNNI-INT16 instructions              = false
      user-timer events                        = false
      PREFETCHIT0, PREFETCHIT1 instructions    = false
      URDMSR, UWRMSR instructions              = false
      UIRET flexibly updates UIF               = false
      CET_SSS: shadow stacks w/o page faults   = false
      AVX10 instructions                       = false
      APX advanced performance extensions      = false
      MWAIT instruction                        = false
   Direct Cache Access Parameters (9):
      PLATFORM_DCA_CAP MSR bits = 0
   Architecture Performance Monitoring Features (0xa):
      version ID                               = 0x0 (0)
      number of counters per logical processor = 0x0 (0)
      bit width of counter                     = 0x0 (0)
      length of EBX bit vector                 = 0x0 (0)
      core cycle event                         = not available
      instruction retired event                = not available
      reference cycles event                   = not available
      last-level cache ref event               = not available
      last-level cache miss event              = not available
      branch inst retired event                = not available
      branch mispred retired event             = not available
      topdown slots event                      = not available
      topdown backend bound                    = not available
      topdown bad speculation                  = not available
      topdown frontend bound                   = not available
      topdown retiring                         = not available
      LBR inserts                              = not available
      fixed counter  0 supported               = false
      fixed counter  1 supported               = false
      fixed counter  2 supported               = false
      fixed counter  3 supported               = false
      fixed counter  4 supported               = false
      fixed counter  5 supported               = false
      fixed counter  6 supported               = false
      fixed counter  7 supported               = false
      fixed counter  8 supported               = false
      fixed counter  9 supported               = false
      fixed counter 10 supported               = false
      fixed counter 11 supported               = false
      fixed counter 12 supported               = false
      fixed counter 13 supported               = false
      fixed counter 14 supported               = false
      fixed counter 15 supported               = false
      fixed counter 16 supported               = false
      fixed counter 17 supported               = false
      fixed counter 18 supported               = false
      fixed counter 19 supported               = false
      fixed counter 20 supported               = false
      fixed counter 21 supported               = false
      fixed counter 22 supported               = false
      fixed counter 23 supported               = false
      fixed counter 24 supported               = false
      fixed counter 25 supported               = false
      fixed counter 26 supported               = false
      fixed counter 27 supported               = false
      fixed counter 28 supported               = false
      fixed counter 29 supported               = false
      fixed counter 30 supported               = false
      fixed counter 31 supported               = false
      number of contiguous fixed counters      = 0x0 (0)
      bit width of fixed counters              = 0x0 (0)
      anythread deprecation                    = false
   x2APIC features / processor topology (0xb):
      extended APIC ID                      = 51
      --- level 0 ---
      level number                          = 0x0 (0)
      level type                            = thread (1)
      bit width of level & previous levels  = 0x0 (0)
      number of logical processors at level = 0x1 (1)
      --- level 1 ---
      level number                          = 0x1 (1)
      level type                            = core (2)
      bit width of level & previous levels  = 0x7 (7)
      number of logical processors at level = 0x20 (32)
      --- level 2 ---
      level number                          = 0x2 (2)
      level type                            = invalid (0)
      bit width of level & previous levels  = 0x0 (0)
      number of logical processors at level = 0x0 (0)
   XSAVE features (0xd/0):
      XCR0 valid bit field mask               = 0x00000000000002e7
         x87 state                            = true
         SSE state                            = true
         AVX state                            = true
         MPX BNDREGS                          = false
         MPX BNDCSR                           = false
         AVX-512 opmask                       = true
         AVX-512 ZMM_Hi256                    = true
         AVX-512 Hi16_ZMM                     = true
         PKRU state                           = true
         XTILECFG state                       = false
         XTILEDATA state                      = false
      bytes required by fields in XCR0        = 0x00000988 (2440)
      bytes required by XSAVE/XRSTOR area     = 0x00000988 (2440)
      XSAVEOPT instruction                    = true
      XSAVEC instruction                      = true
      XGETBV instruction                      = true
      XSAVES/XRSTORS instructions             = true
      XFD: extended feature disable supported = false
      SAVE area size in bytes                 = 0x00000988 (2440)
      IA32_XSS valid bit field mask           = 0x0000000000001800
         PT state                             = false
         PASID state                          = false
         CET_U user state                     = true
         CET_S supervisor state               = true
         HDC state                            = false
         UINTR state                          = false
         LBR state                            = false
         HWP state                            = false
   AVX/YMM features (0xd/2):
      AVX/YMM save state byte size             = 0x00000100 (256)
      AVX/YMM save state byte offset           = 0x00000240 (576)
      supported in IA32_XSS or XCR0            = XCR0 (user state)
      64-byte alignment in compacted XSAVE     = false
      XFD faulting supported                   = false
   AVX-512 opmask features (0xd/5):
      AVX-512 opmask save state byte size      = 0x00000040 (64)
      AVX-512 opmask save state byte offset    = 0x00000340 (832)
      supported in IA32_XSS or XCR0            = XCR0 (user state)
      64-byte alignment in compacted XSAVE     = false
      XFD faulting supported                   = false
   AVX-512 ZMM_Hi256 features (0xd/6):
      AVX-512 ZMM_Hi256 save state byte size   = 0x00000200 (512)
      AVX-512 ZMM_Hi256 save state byte offset = 0x00000380 (896)
      supported in IA32_XSS or XCR0            = XCR0 (user state)
      64-byte alignment in compacted XSAVE     = false
      XFD faulting supported                   = false
   AVX-512 Hi16_ZMM features (0xd/7):
      AVX-512 Hi16_ZMM save state byte size    = 0x00000400 (1024)
      AVX-512 Hi16_ZMM save state byte offset  = 0x00000580 (1408)
      supported in IA32_XSS or XCR0            = XCR0 (user state)
      64-byte alignment in compacted XSAVE     = false
      XFD faulting supported                   = false
   PKRU features (0xd/9):
      PKRU save state byte size                = 0x00000008 (8)
      PKRU save state byte offset              = 0x00000980 (2432)
      supported in IA32_XSS or XCR0            = XCR0 (user state)
      64-byte alignment in compacted XSAVE     = false
      XFD faulting supported                   = false
   CET_U user features (0xd/0xb):
      CET_U user save state byte size          = 0x00000010 (16)
      CET_U user save state byte offset        = 0x00000000 (0)
      supported in IA32_XSS or XCR0            = IA32_XSS (supervisor state)
      64-byte alignment in compacted XSAVE     = false
      XFD faulting supported                   = false
   CET_S supervisor features (0xd/0xc):
      CET_S supervisor save state byte size    = 0x00000018 (24)
      CET_S supervisor save state byte offset  = 0x00000000 (0)
      supported in IA32_XSS or XCR0            = IA32_XSS (supervisor state)
      64-byte alignment in compacted XSAVE     = false
      XFD faulting supported                   = false
   Quality of Service Monitoring Resource Type (0xf/0):
      Maximum range of RMID = 4095
      supports L3 cache monitoring = true
   L3 Cache Quality of Service Monitoring (0xf/1):
      Conversion factor from IA32_QM_CTR to bytes = 64
      Maximum range of RMID                       = 4095
      Counter width                               = 44
      QoS monitoring counter size                 = 0x2c (44)
      IA32_QM_CTR bit 61 is overflow              = false
      non-CPU agent cache monitoring (CMT)        = false
      non-CPU agent mem bandwidth mon (MBM)       = false
      (QoS monitoring counter size synth)         = 44
      supports L3 occupancy monitoring            = true
      supports L3 total bandwidth monitoring      = true
      supports L3 local bandwidth monitoring      = true
   Resource Director Technology Allocation (0x10/0):
      L3 cache allocation technology supported = true
      L2 cache allocation technology supported = false
      memory bandwidth allocation supported    = false
      cache bandwidth allocation supported     = false
   L3 Cache Allocation Technology (0x10/1):
      length of capacity bit mask              = 0x10 (16)
      Bit-granular map of isolation/contention = 0x00000000
      non-CPU agent support                    = false
      code and data prioritization supported   = true
      non-contiguous bitmask supported         = false
      highest COS number supported             = 0xf (15)
   extended processor signature (0x80000001/eax):
      family/generation = 0xf (15)
      model           = 0x2 (2)
      stepping id     = 0x1 (1)
      extended family = 0xb (11)
      extended model  = 0x0 (0)
      (family synth)  = 0x1a (26)
      (model synth)   = 0x2 (2)
   extended feature flags (0x80000001/edx):
      x87 FPU on chip                       = true
      virtual-8086 mode enhancement         = true
      debugging extensions                  = true
      page size extensions                  = true
      time stamp counter                    = true
      RDMSR and WRMSR support               = true
      physical address extensions           = true
      machine check exception               = true
      CMPXCHG8B inst.                       = true
      APIC on chip                          = true
      SYSCALL and SYSRET instructions       = true
      memory type range registers           = true
      global paging extension               = true
      machine check architecture            = true
      conditional move/compare instruction  = true
      page attribute table                  = true
      page size extension                   = true
      multiprocessing capable               = false
      no-execute page protection            = true
      AMD multimedia instruction extensions = true
      MMX Technology                        = true
      FXSAVE/FXRSTOR                        = true
      SSE extensions                        = true
      1-GB large page support               = true
      RDTSCP                                = true
      long mode (AA-64)                     = true
      3DNow! instruction extensions         = false
      3DNow! instructions                   = false
   extended brand id (0x80000001/ebx):
      raw     = 0x40000000 (1073741824)
      BrandId = 0x0 (0)
      PkgType = 0x4 (4)
   AMD feature flags (0x80000001/ecx):
      LAHF/SAHF supported in 64-bit mode     = true
      CMP Legacy                             = true
      SVM: secure virtual machine            = true
      extended APIC space                    = true
      AltMovCr8                              = true
      LZCNT advanced bit manipulation        = true
      SSE4A support                          = true
      misaligned SSE mode                    = true
      3DNow! PREFETCH/PREFETCHW instructions = true
      OS visible workaround                  = true
      instruction based sampling             = true
      XOP support                            = false
      SKINIT/STGI support                    = true
      watchdog timer support                 = true
      lightweight profiling support          = false
      4-operand FMA instruction              = false
      TCE: translation cache extension       = true
      NodeId MSR C001100C                    = false
      TBM support                            = false
      topology extensions                    = true
      core performance counter extensions    = true
      NB/DF performance counter extensions   = true
      data breakpoint extension              = true
      performance time-stamp counter support = false
      LLC performance counter extensions     = true
      MWAITX/MONITORX supported              = true
      Address mask extension support         = true
   brand = "AMD EPYC 9355 32-Core Processor                "
   L1 TLB/cache information: 2M/4M pages & L1 TLB (0x80000005/eax):
      instruction # entries     = 0x40 (64)
      instruction associativity = 0xff (255)
      data # entries            = 0x60 (96)
      data associativity        = 0xff (255)
   L1 TLB/cache information: 4K pages & L1 TLB (0x80000005/ebx):
      instruction # entries     = 0x40 (64)
      instruction associativity = 0xff (255)
      data # entries            = 0x60 (96)
      data associativity        = 0xff (255)
   L1 data cache information (0x80000005/ecx):
      line size (bytes) = 0x40 (64)
      lines per tag     = 0x1 (1)
      associativity     = 0xc (12)
      size (KB)         = 0x30 (48)
   L1 instruction cache information (0x80000005/edx):
      line size (bytes) = 0x40 (64)
      lines per tag     = 0x1 (1)
      associativity     = 0x8 (8)
      size (KB)         = 0x20 (32)
   L2 TLB/cache information: 2M/4M pages & L2 TLB (0x80000006/eax):
      instruction # entries     = 0x40 (64)
      instruction associativity = 2-way (2)
      data # entries            = 0x80 (128)
      data associativity        = 4 to 5-way (4)
   L2 TLB/cache information: 4K pages & L2 TLB (0x80000006/ebx):
      instruction # entries     = 0x40 (64)
      instruction associativity = 4 to 5-way (4)
      data # entries            = 0x80 (128)
      data associativity        = 8 to 15-way (6)
   L2 unified cache information (0x80000006/ecx):
      line size (bytes) = 0x40 (64)
      lines per tag     = 0x1 (1)
      associativity     = 16 to 31-way (8)
      size (KB)         = 0x400 (1024)
   L3 cache information (0x80000006/edx):
      line size (bytes)     = 0x40 (64)
      lines per tag         = 0x1 (1)
      associativity         = 0x9 (9)
      size (in 512KB units) = 0x200 (512)
   RAS Capability (0x80000007/ebx):
      MCA overflow recovery support = true
      SUCCOR support                = true
      HWA: hardware assert support  = false
      scalable MCA support          = true
   Advanced Power Management Features (0x80000007/ecx):
      CmpUnitPwrSampleTimeRatio = 0x0 (0)
   Advanced Power Management Features (0x80000007/edx):
      TS: temperature sensing diode           = true
      FID: frequency ID control               = false
      VID: voltage ID control                 = false
      TTP: thermal trip                       = true
      TM: thermal monitor                     = true
      STC: software thermal control           = false
      100 MHz multiplier control              = false
      hardware P-State control                = true
      TscInvariant                            = true
      CPB: core performance boost             = true
      read-only effective frequency interface = true
      processor feedback interface            = false
      APM power reporting                     = false
      connected standby                       = true
      RAPL: running average power limit       = true
      fast CPPC                               = false
   Physical Address and Linear Address Size (0x80000008/eax):
      maximum physical address bits         = 0x34 (52)
      maximum linear (virtual) address bits = 0x39 (57)
      maximum guest physical address bits   = 0x0 (0)
   Extended Feature Extensions ID (0x80000008/ebx):
      CLZERO instruction                       = true
      instructions retired count support       = true
      always save/restore error pointers       = true
      INVLPGB instruction                      = true
      RDPRU instruction                        = true
      memory bandwidth enforcement             = true
      MCOMMIT instruction                      = false
      WBNOINVD instruction                     = true
      IBPB: indirect branch prediction barrier = true
      interruptible WBINVD, WBNOINVD           = true
      IBRS: indirect branch restr speculation  = true
      STIBP: 1 thr indirect branch predictor   = true
      CPU prefers: IBRS always on              = false
      CPU prefers: STIBP always on             = true
      IBRS preferred over software solution    = true
      IBRS provides same mode protection       = true
      EFER[LMSLE] not supported                = true
      INVLPGB supports TLB flush guest nested  = true
      ppin processor id number supported       = true
      SSBD: speculative store bypass disable   = true
      virtualized SSBD                         = false
      SSBD fixed in hardware                   = false
      CPPC: collaborative processor perf ctrl  = true
      PSFD: predictive store forward disable   = true
      not vulnerable to branch type confusion  = true
      IBPB_RET: ret addr predictor cleared     = true
      branch sampling feature support          = false
      (vuln to branch type confusion synth)    = false
   Size Identifiers (0x80000008/ecx):
      number of threads                   = 0x20 (32)
      ApicIdCoreIdSize                    = 0x7 (7)
      performance time-stamp counter size = 40 bits (0)
   Feature Extended Size (0x80000008/edx):
      max page count for INVLPGB instruction = 0x7 (7)
      RDPRU instruction max input support    = 0x1 (1)
   SVM Secure Virtual Machine (0x8000000a/eax):
      SvmRev: SVM revision = 0x1 (1)
   SVM Secure Virtual Machine (0x8000000a/edx):
      nested paging                           = true
      LBR virtualization                      = true
      SVM lock                                = true
      NRIP save                               = true
      MSR based TSC rate control              = true
      VMCB clean bits support                 = true
      flush by ASID                           = true
      decode assists                          = true
      PMC virtualization                      = true
      SSSE3/SSE5 opcode set disable           = false
      pause intercept filter                  = true
      pause filter threshold                  = true
      AVIC: AMD virtual interrupt controller  = true
      virtualized VMLOAD/VMSAVE               = true
      virtualized global interrupt flag (GIF) = true
      GMET: guest mode execute trap           = true
      X2AVIC: virtualized X2APIC              = true
      supervisor shadow stack                 = true
      guest Spec_ctl support                  = true
      ROGPT: read-only guest page table       = true
      host MCE override                       = true
      INVLPGB/TLBSYNC hyperv interc enable    = true
      VNMI: NMI virtualization                = true
      IBS virtualization                      = true
      extended LVT AVIC access changes        = true
      guest VMCB addr check                   = true
      bus lock threshold                      = true
      idle HLT intercept                      = true
      EXITINFO1 non-interceptible shutdown    = true
   NASID: number of address space identifiers = 0x8000 (32768):
   L1 TLB information: 1G pages (0x80000019/eax):
      instruction # entries     = 0x40 (64)
      instruction associativity = full (15)
      data # entries            = 0x60 (96)
      data associativity        = full (15)
   L2 TLB information: 1G pages (0x80000019/ebx):
      instruction # entries     = 0x0 (0)
      instruction associativity = L2 off (0)
      data # entries            = 0x20 (32)
      data associativity        = 4 to 5-way (4)
   Performance Optimization Identifiers (0x8000001a/eax):
      128-bit SSE executed full-width = false
      MOVU* better than MOVL*/MOVH*   = true
      256-bit SSE executed full-width = false
      512-bit SSE executed full-width = true
   Instruction Based Sampling Identifiers (0x8000001b/eax):
      IBS feature flags valid                  = true
      IBS fetch sampling                       = true
      IBS execution sampling                   = true
      read write of op counter                 = true
      op counting mode                         = true
      branch target address reporting          = true
      IbsOpCurCnt and IbsOpMaxCnt extend 7     = true
      invalid RIP indication support           = true
      fused branch micro-op indication support = true
      IBS fetch control extended MSR support   = true
      IBS op data 4 MSR support                = false
      IBS L3 miss filtering support            = true
      IBS load latency filtering support       = true
      simplified DTLB page size & miss report  = true
   Lightweight Profiling Capabilities: Availability (0x8000001c/eax):
      lightweight profiling                  = false
      LWPVAL instruction                     = false
      instruction retired event              = false
      branch retired event                   = false
      DC miss event                          = false
      core clocks not halted event           = false
      core reference clocks not halted event = false
      continuous mode sampling               = false
      tsc in event record                    = false
      interrupt on threshold overflow        = false
   Lightweight Profiling Capabilities: Supported (0x8000001c/edx):
      lightweight profiling                  = false
      LWPVAL instruction                     = false
      instruction retired event              = false
      branch retired event                   = false
      DC miss event                          = false
      core clocks not halted event           = false
      core reference clocks not halted event = false
      continuous mode sampling               = false
      tsc in event record                    = false
      interrupt on threshold overflow        = false
   Lightweight Profiling Capabilities (0x8000001c/ebx):
      LWPCB byte size             = 0x0 (0)
      event record byte size      = 0x0 (0)
      maximum EventId             = 0x0 (0)
      EventInterval1 field offset = 0x0 (0)
   Lightweight Profiling Capabilities (0x8000001c/ecx):
      latency counter bit size          = 0x0 (0)
      data cache miss address valid     = false
      amount cache latency is rounded   = 0x0 (0)
      LWP implementation version        = 0x0 (0)
      event ring buffer size in records = 0x0 (0)
      branch prediction filtering       = false
      IP filtering                      = false
      cache level filtering             = false
      cache latency filteing            = false
   Cache Properties (0x8000001d):
      --- cache 0 ---
      type                            = data (1)
      level                           = 0x1 (1)
      self-initializing               = true
      fully associative               = false
      extra cores sharing this cache  = 0x0 (0)
      line size in bytes              = 0x40 (64)
      physical line partitions        = 0x1 (1)
      number of ways                  = 0xc (12)
      number of sets                  = 64
      write-back invalidate           = false
      cache inclusive of lower levels = false
      (synth size)                    = 49152 (48 KB)
      --- cache 1 ---
      type                            = instruction (2)
      level                           = 0x1 (1)
      self-initializing               = true
      fully associative               = false
      extra cores sharing this cache  = 0x0 (0)
      line size in bytes              = 0x40 (64)
      physical line partitions        = 0x1 (1)
      number of ways                  = 0x8 (8)
      number of sets                  = 64
      write-back invalidate           = false
      cache inclusive of lower levels = false
      (synth size)                    = 32768 (32 KB)
      --- cache 2 ---
      type                            = unified (3)
      level                           = 0x2 (2)
      self-initializing               = true
      fully associative               = false
      extra cores sharing this cache  = 0x0 (0)
      line size in bytes              = 0x40 (64)
      physical line partitions        = 0x1 (1)
      number of ways                  = 0x10 (16)
      number of sets                  = 1024
      write-back invalidate           = false
      cache inclusive of lower levels = true
      (synth size)                    = 1048576 (1024 KB)
      --- cache 3 ---
      type                            = unified (3)
      level                           = 0x3 (3)
      self-initializing               = true
      fully associative               = false
      extra cores sharing this cache  = 0x3 (3)
      line size in bytes              = 0x40 (64)
      physical line partitions        = 0x1 (1)
      number of ways                  = 0x10 (16)
      number of sets                  = 32768
      write-back invalidate           = true
      cache inclusive of lower levels = false
      (synth size)                    = 33554432 (32 MB)
   extended APIC ID = 51
   Core Identifiers (0x8000001e/ebx):
      core ID          = 0x33 (51)
      threads per core = 0x1 (1)
   Node Identifiers (0x8000001e/ecx):
      node ID             = 0x0 (0)
      nodes per processor = 0x1 (1)
   AMD Secure Encryption (0x8000001f):
      SME: secure memory encryption support    = true
      SEV: secure encrypted virtualize support = true
      VM page flush MSR support                = false
      SEV-ES: SEV encrypted state support      = true
      SEV-SNP: SEV secure nested paging        = true
      VMPL: VM permission levels               = true
      RMPQUERY instruction support             = true
      VMPL supervisor shadow stack support     = true
      Secure TSC supported                     = true
      virtual TSC_AUX supported                = true
      hardware cache coher across enc domains  = true
      SEV guest exec only from 64-bit host     = true
      restricted injection                     = true
      alternate injection                      = true
      full debug state swap for SEV-ES/SEV-SNP = true
      disallowing IBS use by host              = true
      VTE: SEV virtual transparent encryption  = true
      VMGEXIT parameter support                = true
      virtual TOM MSR support                  = true
      IBS virtual support for SEV-ES/SEV-SNP   = true
      PMC virtual support for SEV-ES/SEV-SNP   = true
      RMPREAD instruction                      = true
      guest intercept control support          = true
      segmented RMP support                    = true
      VMSA register protection support         = true
      SMT protection support                   = true
      secure AVIC support                      = true
      allowed SEV features support             = true
      SVSM communication page MSR support      = false
      VIRT_RMPUPDATE & VIRT_PSMASH MSR support = false
      write to hypervisor in-used allowed      = true
      IBPB on entry support                    = true
      encryption bit position in PTE           = 0x33 (51)
      physical address space width reduction   = 0x6 (6)
      number of VM permission levels           = 0x4 (4)
      number of SEV-enabled guests supported   = 0x3ee (1006)
      minimum SEV guest ASID                   = 0x1 (1)
   PQoS Extended Features (0x80000020):
      L3 bandwidth enforcement                 = true
      L3 slow memory bandwidth enforcement     = true
      bandwidth monitoring event configuration = true
      L3 range reservation support             = true
      assignable bandwidth monitoring counters = true
      SDCI allocation enforcement              = true
   PQoS L3 Memory Bandwidth Enforcement (0x80000020/1):
      capacity bitmask length      = 0xd (13)
      number of classes of service = 0xf (15)
   PQoS L3 Slow Memory Bandwidth Enforcement (0x80000020/2):
      capacity bitmask length      = 0xd (13)
      number of classes of service = 0xf (15)
   PQoS Bandwidth Monitoring Event Configuration (0x80000020/3):
      number of bandwidth events available   = 0x2 (2)
      reads to local NUMA                    = true
      reads to non-local NUMA                = true
      non-temporal writes to local NUMA      = true
      non-temporal writes to non-local NUMA  = true
      reads to slow memory in local NUMA     = true
      reads to slow memory in non-local NUMA = true
      dirty victims writes                   = true
   0x80000020 0x04: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
   PQoS Assignable Bandwidth Monitoring Counters (0x80000020/5):
      QM_CTR counter size-24            = 0x14 (20)
      QM_CTR bit 61 is overflow         = false
      (QM_CTR counter size)              = 44
      maximum supported ABCM counter ID = 0x1f (31)
      can measure COS bandwidth         = true
   0x80000020 0x06: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
   Extended Feature 2 (0x80000021):
      no nested data-breakpoints               = true
      FsGsKernelGsBaseNonSerializing           = true
      LFENCE always serializing                = true
      SMM paging configuration lock support    = true
      null selector clears base                = true
      upper address ignore support             = true
      automatic IBRS                           = true
      SMM_CTL MSR not supported                = true
      FSRS: fast short REP STOSB support       = true
      FSRC: fast short REP CMPSB support       = true
      prefetch control MSR support             = true
      L2TLB sizes are multiples of 32          = true
      AMD enhanced REP MOVSB/STOSB             = true
      reserves 0F 01/7 for AMD use             = true
      CPUID disable for non-privileged         = true
      enhanced predictive store forwarding     = true
      fast short REP SCASB support             = true
      IC PREFETCH support                      = true
      FP512 to FP256 downgrade support         = true
      workload OS feedback support             = false
      ret addr predictor security support      = true
      guest: selective branch pred barrier     = true
      guest: PRED_CMD[IBPB] flushes br predict = true
      unaffected by spec return stack overflow = false
      unaffected by SRSO at user-kernel bound  = true
      BP_CFG can mitigate other SRSO cases     = true
      microcode patch size                     = 14368 (0x3820)
      return addr predictor size               = 64 (0x40)
   Extended Performance Monitoring and Debugging (0x80000022):
      AMD performance monitoring V2         = true
      AMD LBR V2                            = true
      AMD LBR stack & PMC freezing          = true
      number of core perf ctrs              = 0x6 (6)
      number of LBR stack entries           = 0x10 (16)
      number of avail Northbridge perf ctrs = 0x10 (16)
      number of available UMC PMCs          = 0x20 (32)
      active UMCs bitmask                   = 0x6db
   Multi-Key Encrypted Memory Capabilities (0x80000023):
      secure host multi-key memory support = true
      number of encryption key IDs         = 0x3f (63)
   0x80000024 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
   Segmented RMP Table (0x80000025):
      RMP segment size minimum supported = 0x24 (36)
      RMP segment size maximum supported = 0x2a (42)
      cacheable RMP segment definitions  = 0x10 (16)
      cached segments hard limit         = false
   AMD Extended CPU Topology (0x80000026):
      extended APIC ID                        = 51
      --- level 0 ---
      level number                            = 0x0 (0)
      level type                              = core (1)
      bit width of level                      = 0x0 (0)
      power efficiency ranking available      = false
      cores heterogeneous at this level       = false
      components have varying number of cores = false
      number of logical processors at level   = 0x1 (1)
   0x80000027 0x00: eax=0x00000003 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
   0x80000028 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
   (instruction supported synth):
      CMPXCHG8B                = true
      conditional move/compare = true
      PREFETCH/PREFETCHW       = true
   (multi-processing synth) = multi-core (c=32)
   (multi-processing method) = AMD leaf 0xb
   (APIC widths synth): CORE_width=7 SMT_width=0
   (APIC synth): PKG_ID=0 CORE_ID=51 SMT_ID=0
   (uarch synth) = AMD Zen 5, TSMC N4P
   (synth) = AMD EPYC (5th Gen) (Turin C1) [Zen 5], TSMC N4P
Comment 38 Kyle Evans freebsd_committer freebsd_triage 2026-03-31 23:51:34 UTC
(In reply to Paul from comment #37)

While we're waiting to see if kib' latest patch surfaces anything of interest, can you clarify/characterize the app's kqueue usage a little bit?  Is it purely adding events to the kqueue and letting them drop out organically from being closed, mix of that and EV_DISABLE/EV_ONESHOT, etc?

Are all of the events in this kqueue tied to fds, or is it also driving timers and other non-fd things?

I don't think all of the above is going to be particularly relevant, but I think it's good to try and get the lay of the land in case we're not thinking outside of the box enough somehow.
Comment 39 Kyle Evans freebsd_committer freebsd_triage 2026-03-31 23:51:57 UTC
(In reply to Kyle Evans from comment #38)

s/EV_DISABLE/EV_DELETE/, of course
Comment 40 Konstantin Belousov freebsd_committer freebsd_triage 2026-04-01 02:42:13 UTC
(In reply to Paul from comment #37)
I was interested in the presence of the support for ERMS, to see
which memset() implementation is selected at runtime.  This is why I asked
about CPU identification.  CPU or RAM failures usually look very differently than
what you report.
Comment 41 Paul 2026-04-01 06:53:33 UTC
(In reply to Kyle Evans from comment #38)

Basically, this is a HTTP/Websocket micro-service app. It handles hundreds of thousands of long-living incoming connections. Approximately 98% of them are Websocket and the rest is plain HTTP. Other than that, there are also a lot of outbound HTTP connections, somewhere in range of 200k-300k. Yeah, to be able to implement this we need timers, a lot of them.

All of it is based on a standalone asio C++ library. It does the heavy lifting under the hood. We use several `io_context` for scale. And I believe each of them has its own kqueue, which seems reasonable.

Not sure about EV_DELETE/EV_ONESHOT, as we're not dealing with kqueue directly. Looking at source code I see no EV_ONESHOT and even see that earlier they've abandoned its use (https://www.boost.org/doc/libs/latest/doc/html/boost_asio/history.html#boost_asio.history.asio_1_4_9___boost_1_46_1). It does use EV_DELETE, seemingly on every descriptor close, so... a lot. Connections appear and fall off organically. A lot of them on timeout, as this is the real world and not all browsers bother to do the clean Websocket shutdown.

This service is nothing new for us. We've been using it for years. Even today, we're still keeping a fallback server that runs FreeBSD 13.1-STABLE but, and this is a big but: with a different server platform : 2x E5-2660. So, not only have we switched to a newer FreeBSD version, but also dramatically changed the platform. And the current platform is much, much faster. My gut feeling is: there was a problem already, it just became apparent. Because, a faster CPU makes things, both in kernel and userspace, run faster. And some, highly improbable races, become likely.

We've been running this platform and OS version for months now, under different kinds of load: not that many sockets, but much more CPU usage overall. And it runs completely stable. On the other hand, for this specific role, we now have 2 (of 2) servers that exhibit these crashes. Which doesn't rule out some CPU issues yes, but makes it highly unlikely to be the cause.
Comment 42 Paul 2026-04-01 06:55:39 UTC
(In reply to Konstantin Belousov from comment #40)

I see, no problem. If you need more info, just ask.
Comment 43 Paul 2026-04-08 14:59:32 UTC
Hi!

We have another panic.

Fatal trap 12: page fault while in kernel mode
cpuid = 7; apic id = 13
fault virtual address	= 0x0
fault code		= supervisor read data, page not present
instruction pointer	= 0x20:0xffffffff80b72503
stack pointer	        = 0x28:0xfffffe069ae28d40
frame pointer	        = 0x28:0xfffffe069ae28d70
code segment		= base rx0, limit 0xfffff, type 0x1b
			= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags	= interrupt enabled, resume, IOPL = 0
current process		= 29500 (asy:http:s)
rdi: ff01000107b11500 rsi: 0000000000000008 rdx: 0000000000000001
rcx: 0000000000000000  r8: 0000000000000002  r9: ffffffff82252ef0
rax: 0000000000000000 rbx: ff0100772fd78668 rbp: fffffe069ae28d70
r10: 0000000000000000 r11: 0000000000000004 r12: ff01000107b11500
r13: ff0100772fd78668 r14: ff01007278e4e780 r15: ff01000107b11518
trap number		= 12
panic: page fault
cpuid = 7
time = 1775658023
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe069ae28a70
vpanic() at vpanic+0x136/frame 0xfffffe069ae28ba0
panic() at panic+0x43/frame 0xfffffe069ae28c00
trap_pfault() at trap_pfault+0x422/frame 0xfffffe069ae28c70
calltrap() at calltrap+0x8/frame 0xfffffe069ae28c70
--- trap 0xc, rip = 0xffffffff80b72503, rsp = 0xfffffe069ae28d40, rbp = 0xfffffe069ae28d70 ---
knote_drop_detached() at knote_drop_detached+0x113/frame 0xfffffe069ae28d70
knote_fdclose() at knote_fdclose+0x17f/frame 0xfffffe069ae28dc0
closefp_impl() at closefp_impl+0xa8/frame 0xfffffe069ae28e00
amd64_syscall() at amd64_syscall+0x169/frame 0xfffffe069ae28f30
fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe069ae28f30
--- syscall (6, FreeBSD ELF64, close), rip = 0x82d1d232a, rsp = 0x858670b98, rbp = 0x858670bb0 ---
KDB: enter: panic

(kgdb) bt
#0  __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:57
#1  doadump (textdump=0) at /usr/src/sys/kern/kern_shutdown.c:399
#2  0xffffffff804b60a8 in db_fncall_generic (nargs=0, args=0xfffffe069ae28490, addr=<optimized out>, rv=<optimized out>) at /usr/src/sys/ddb/db_command.c:631
#3  db_fncall (dummy1=<optimized out>, dummy2=<optimized out>, dummy3=<optimized out>, dummy4=<optimized out>) at /usr/src/sys/ddb/db_command.c:679
#4  0xffffffff804b5b2d in db_command (last_cmdp=<optimized out>, cmd_table=<optimized out>, dopager=false) at /usr/src/sys/ddb/db_command.c:508
#5  0xffffffff804b5c76 in db_command_script (command=command@entry=0xffffffff81bd7722 <db_recursion_data+18> "call doadump") at /usr/src/sys/ddb/db_command.c:573
#6  0xffffffff804bba58 in db_script_exec (scriptname=scriptname@entry=0xfffffe069ae28660 "kdb.enter.panic", warnifnotfound=warnifnotfound@entry=0) at /usr/src/sys/ddb/db_script.c:301
#7  0xffffffff804bb952 in db_script_kdbenter (eventname=<optimized out>) at /usr/src/sys/ddb/db_script.c:323
#8  0xffffffff804b91e1 in db_trap (type=<optimized out>, code=<optimized out>) at /usr/src/sys/ddb/db_main.c:266
#9  0xffffffff80c23c0f in kdb_trap (type=type@entry=3, code=code@entry=0, tf=tf@entry=0xfffffe069ae289b0) at /usr/src/sys/kern/subr_kdb.c:790
#10 0xffffffff811318fd in trap (frame=<optimized out>) at /usr/src/sys/amd64/amd64/trap.c:697
#11 <signal handler called>
#12 kdb_enter (why=<optimized out>, msg=<optimized out>) at /usr/src/sys/kern/subr_kdb.c:556
#13 0xffffffff80bd0b8b in vpanic (fmt=0xffffffff812bd9d3 "%s", ap=ap@entry=0xfffffe069ae28be0) at /usr/src/sys/kern/kern_shutdown.c:962
#14 0xffffffff80bd09f3 in panic (fmt=0xffffffff81da22a0 <cnputs_mtx> "\325\376!\201\377\377\377\377") at /usr/src/sys/kern/kern_shutdown.c:887
#15 0xffffffff81132082 in trap_fatal (frame=<optimized out>, eva=<optimized out>) at /usr/src/sys/amd64/amd64/trap.c:1028
#16 0xffffffff81132082 in trap_pfault (frame=0xfffffe069ae28c80, usermode=false, signo=<optimized out>, ucode=<optimized out>)
#17 <signal handler called>
#18 0xffffffff80b72503 in knote_drop_detached (kn=kn@entry=0xff0100772fd78668, td=td@entry=0xff01007278e4e780) at /usr/src/sys/kern/kern_event.c:2950
#19 0xffffffff80b7284f in knote_drop (td=0xff01007278e4e780, kn=<optimized out>) at /usr/src/sys/kern/kern_event.c:2915
#20 knote_fdclose (td=td@entry=0xff01007278e4e780, fd=fd@entry=211098) at /usr/src/sys/kern/kern_event.c:2875
#21 0xffffffff80b69fd8 in closefp_impl (fdp=0xfffffe0694c620c0, fd=211098, fp=0xff010004c3517c80, td=0xff01007278e4e780, audit=true) at /usr/src/sys/kern/kern_descrip.c:1413
#22 0xffffffff81132739 in syscallenter (td=0xff01007278e4e780) at /usr/src/sys/amd64/amd64/../../kern/subr_syscall.c:193
#23 amd64_syscall (td=0xff01007278e4e780, traced=0) at /usr/src/sys/amd64/amd64/trap.c:1267
#24 <signal handler called>
#25 0x000000082d1d232a in ?? ()
Backtrace stopped: Cannot access memory at address 0x858670b98
(kgdb) fr 18
#18 0xffffffff80b72503 in knote_drop_detached (kn=kn@entry=0xff0100772fd78668, td=td@entry=0xff01007278e4e780) at /usr/src/sys/kern/kern_event.c:2950
2950		SLIST_REMOVE(list, kn, knote, kn_link);
(kgdb) p *((struct eknote*)kn)
$1 = {
  k = {
    kn_link = {
      sle_next = 0x0
    },
    kn_selnext = {
      sle_next = 0xffffffffffffffff
    },
    kn_knlist = 0x0,
    kn_tqe = {
      tqe_next = 0xffffffffffffffff,
      tqe_prev = 0xffffffffffffffff
    },
    kn_kq = 0xff01000107b11500,
    kn_kevent = {
      ident = 76954,
      filter = -1,
      flags = 32,
      fflags = 0,
      data = 0,
      udata = 0x1b2102fcfc40,
      ext = {0, 0, 0, 0}
    },
    kn_hook = 0x0,
    kn_hookid = 0,
    kn_status = 8,
    kn_influx = 1,
    kn_sfflags = 0,
    kn_sdata = 0,
    kn_ptr = {
      p_fp = 0xff010062f4444af0,
      p_proc = 0xff010062f4444af0,
      p_aio = 0xff010062f4444af0,
      p_lio = 0xff010062f4444af0,
      p_prison = 0xff010062f4444af0,
      p_v = 0xff010062f4444af0
    },
    kn_fop = 0xffffffff814dd960 <soread_filtops>
  },
  c = {
    kn_link = {
      sle_next = 0x0
    },
    kn_selnext = {
      sle_next = 0x0
    },
    kn_knlist = 0x0,
    kn_tqe = {
      tqe_next = 0x0,
      tqe_prev = 0x0
    },
    kn_kq = 0x0,
    kn_kevent = {
      ident = 0,
      filter = 0,
      flags = 0,
      fflags = 0,
      data = 0,
      udata = 0x0,
      ext = {0, 0, 0, 0}
    },
    kn_hook = 0x0,
    kn_hookid = 0,
    kn_status = 0,
    kn_influx = 0,
    kn_sfflags = 0,
    kn_sdata = 0,
    kn_ptr = {
      p_fp = 0x0,
      p_proc = 0x0,
      p_aio = 0x0,
      p_lio = 0x0,
      p_prison = 0x0,
      p_v = 0x0
    },
    kn_fop = 0x0
  },
  on_kn_link = 0
}
(kgdb) p kq
$2 = (struct kqueue *) 0xff01000107b11500
(kgdb) p *kq
$3 = {
  kq_lock = {
    lock_object = {
      lo_name = 0xffffffff813464c6 "kqueue",
      lo_flags = 21168128,
      lo_data = 0,
      lo_witness = 0xff0100804bd8db80
    },
    mtx_lock = 18374968446302873472
  },
  kq_refcnt = 0,
  kq_list = {
    tqe_next = 0xff010001dd4fac00,
    tqe_prev = 0xff010077e823d828
  },
  kq_head = {
    tqh_first = 0x0,
    tqh_last = 0xff01000107b11538
  },
  kq_count = 0,
  kq_sel = {
    si_tdlist = {
      tqh_first = 0x0,
      tqh_last = 0x0
    },
    si_note = {
      kl_list = {
        slh_first = 0x0
      },
      kl_lock = 0xffffffff80b71fc0 <knlist_mtx_lock>,
      kl_unlock = 0xffffffff80b71fe0 <knlist_mtx_unlock>,
      kl_assert_lock = 0xffffffff80b72000 <knlist_mtx_assert_lock>,
      kl_lockarg = 0xff01000107b11500,
      kl_autodestroy = 0
    },
    si_mtx = 0x0
  },
  kq_sigio = 0x0,
  kq_fdp = 0xfffffe0694c620c0,
  kq_state = 0,
  kq_knlistsize = 288512,
  kq_knlist = 0xfffffe09ce3fe000,
  kq_knhashmask = 0,
  kq_knhash = 0x0,
  kq_task = {
    ta_link = {
      stqe_next = 0x0
    },
    ta_pending = 0,
    ta_priority = 0 '\000',
    ta_flags = 0 '\000',
    ta_func = 0xffffffff80b748a0 <kqueue_task>,
    ta_context = 0xff01000107b11500
  },
  kq_cred = 0xff010001dd445900,
  kq_forksrc = 0x0
}
(kgdb) p list
$4 = <optimized out>
p kq->kq_knlist[kn->kn_kevent.ident]
$6 = {
  slh_first = 0x0
}
(kgdb) p &kq->kq_knlist[kn->kn_kevent.ident]
$7 = (struct klist *) 0xfffffe09ce4944d0


Please, tell us if you need anything else.
Comment 44 Kyle Evans freebsd_committer freebsd_triage 2026-04-08 19:57:26 UTC
I guess there's still some kind of confusion:

#21 0xffffffff80b69fd8 in closefp_impl (fdp=0xfffffe0694c620c0, fd=211098, fp=0xff010004c3517c80, td=0xff01007278e4e780, audit=true)
[...]
    kn_ptr = {
      p_fp = 0xff010062f4444af0,
      p_proc = 0xff010062f4444af0,
      p_aio = 0xff010062f4444af0,
      p_lio = 0xff010062f4444af0,
      p_prison = 0xff010062f4444af0,
      p_v = 0xff010062f4444af0
    },


Does 0xff010062f4444af0 still look like a valid `struct file *`, and maybe more importantly: does that match what's at fdp->fd_files->fdt_ofiles[211098].fde_file or fdp->fd_files->fdt_ofiles[76954].fde_file?
Comment 45 Paul 2026-04-09 06:26:14 UTC
Looks like it:

(kgdb) p *((struct file *)0xff010062f4444af0)
$1 = {
  f_flag = 7,
  f_count = 2,
  f_data = 0xff01004a38dc5c00,
  f_ops = 0xffffffff814dc3a0 <socketops>,
  f_vnode = 0x0,
  f_cred = 0xff010001dd445900,
  f_type = 2,
  f_vflags = 0,
  {
    f_seqcount = {0, 0},
    f_pipegen = 0
  },
  f_nextoff = {0, 0},
  f_vnun = {
    fvn_cdevpriv = 0x0,
    fvn_advice = 0x0
  },
  f_offset = 0
}
Comment 46 Paul 2026-04-09 06:27:55 UTC
(In reply to Paul from comment #45)

Sorry, trigger happy fingers... this is the followup.

(kgdb) p fdp->fd_files->fdt_ofiles[211098].fde_file
$2 = (struct file *) 0x0
(kgdb) p fdp->fd_files->fdt_ofiles[76954].fde_file
$3 = (struct file *) 0xff010062f4444af0
(kgdb) p *fdp->fd_files->fdt_ofiles[76954].fde_file
$4 = {
  f_flag = 7,
  f_count = 2,
  f_data = 0xff01004a38dc5c00,
  f_ops = 0xffffffff814dc3a0 <socketops>,
  f_vnode = 0x0,
  f_cred = 0xff010001dd445900,
  f_type = 2,
  f_vflags = 0,
  {
    f_seqcount = {0, 0},
    f_pipegen = 0
  },
  f_nextoff = {0, 0},
  f_vnun = {
    fvn_cdevpriv = 0x0,
    fvn_advice = 0x0
  },
  f_offset = 0
}
Comment 47 Paul 2026-04-09 06:46:36 UTC
(In reply to Paul from comment #46)

I couldn't help but notice that the difference between 211098 and 76954 is 131 * 1024.

So far we have:

211098 - 76954 = 131 * 1024

298277 − 297765 = 512

544408 − 138904 = 99 * 4096

How likely is it that this is just a coincidence? Looks like 512 is the common base, so far. So, the likelihood of this being just a fluke is like 1 / (512 ^ 3). Very unlikely.
Comment 48 Kyle Evans freebsd_committer freebsd_triage 2026-04-09 12:20:12 UTC
(In reply to Paul from comment #46)

For shits and giggles, can we take a look at kq->kq_knlist[idx] for both fds as well, please? I guess there's likely only one knote at most per fd here
Comment 49 Paul 2026-04-09 13:13:34 UTC
(In reply to Kyle Evans from comment #48)

No problem at all, and...

(kgdb) p kq->kq_knlist[76954]
$4 = {
  slh_first = 0x0
}
(kgdb) p kq->kq_knlist[211098]
$5 = {
  slh_first = 0xff0100772fd78668
}
(kgdb) p *kq->kq_knlist[211098].slh_first
$6 = {
  kn_link = {
    sle_next = 0x0
  },
  kn_selnext = {
    sle_next = 0xffffffffffffffff
  },
  kn_knlist = 0x0,
  kn_tqe = {
    tqe_next = 0xffffffffffffffff,
    tqe_prev = 0xffffffffffffffff
  },
  kn_kq = 0xff01000107b11500,
  kn_kevent = {
    ident = 76954,
    filter = -1,
    flags = 32,
    fflags = 0,
    data = 0,
    udata = 0x1b2102fcfc40,
    ext = {0, 0, 0, 0}
  },
  kn_hook = 0x0,
  kn_hookid = 0,
  kn_status = 8,
  kn_influx = 1,
  kn_sfflags = 0,
  kn_sdata = 0,
  kn_ptr = {
    p_fp = 0xff010062f4444af0,
    p_proc = 0xff010062f4444af0,
    p_aio = 0xff010062f4444af0,
    p_lio = 0xff010062f4444af0,
    p_prison = 0xff010062f4444af0,
    p_v = 0xff010062f4444af0
  },
  kn_fop = 0xffffffff814dd960 <soread_filtops>
}


76954 seems to be on the list of 211098. But this perfectly matches preceding cases: some mixing of unrelated fds.
Comment 50 Konstantin Belousov freebsd_committer freebsd_triage 2026-04-09 20:19:10 UTC
Please drop any previous patches you have.
Then update to the latest main sources, which have several asserts added by
Kyle and me.
Then apply only the following change and see if the issue is reproducible.

diff --git a/sys/kern/kern_event.c b/sys/kern/kern_event.c
index bb5203b8cb0b..ff3ae96a452f 100644
--- a/sys/kern/kern_event.c
+++ b/sys/kern/kern_event.c
@@ -2043,8 +2043,13 @@ kqueue_expand(struct kqueue *kq, const struct filterops *fops, uintptr_t ident,
 				to_free = list;
 			} else {
 				if (kq->kq_knlist != NULL) {
-					bcopy(kq->kq_knlist, list,
-					    kq->kq_knlistsize * sizeof(*list));
+					volatile char *src, *dst;
+					src = (void *)kq->kq_knlist;
+					dst = (void *)list;
+					for (size_t i = 0;
+					    i < kq->kq_knlistsize * sizeof(*list);
+					    i++)
+						dst[i] = src[i];
 					to_free = kq->kq_knlist;
 					kq->kq_knlist = NULL;
 				}
Comment 51 Kyle Evans freebsd_committer freebsd_triage 2026-04-10 00:02:15 UTC
(In reply to Paul from comment #49)

Just thinking out loud for a minute, we know these facts:

 1.) In slot 211098 of this kq we found this knote that thinks it is for fd 76954 and has an fp that matches fd 76954 in the fd table
 2.) The kq_knlist only ever grows, never shrinks, and in exactly one place: kqueue_expand
 3.) kq_knlist / kn_link usage is actually pretty minimal and reasonably easy to audit
 4.) knote_attach is the only place adding anything to kq_knlist

knote_attach() is incredibly straightforward and clearly under the kq lock:

2880         if (kn->kn_fop->f_isfd) {                                                                        
2881                 if (kn->kn_id >= kq->kq_knlistsize)                                                      
2882                         return (ENOMEM);                                                                 
2883                 list = &kq->kq_knlist[kn->kn_id];                                                        
2884         } else {                                                                                         
2885                 if (kq->kq_knhash == NULL)                                                               
2886                         return (ENOMEM);                                                                 
2887                 list = &kq->kq_knhash[KN_HASH(kn->kn_id, kq->kq_knhashmask)];                            
2888         }                                                                                                
2889         SLIST_INSERT_HEAD(list, kn, kn_link);                                                            

That's clearly checking the size correctly beforre it inserts the knote.  The only removal is in knote_drop_detached():

2925         if (kn->kn_fop->f_isfd)                                                                                                                                                                                           
2926                 list = &kq->kq_knlist[kn->kn_id];                                                                                                                                                                         
2927         else                                                                                                                                                                                                              
2928                 list = &kq->kq_knhash[KN_HASH(kn->kn_id, kq->kq_knhashmask)];                                                                                                                                             
2929                                                                                                                                                                                                                           
2930         if (!SLIST_EMPTY(list))                                                                                                                                                                                           
2931                 SLIST_REMOVE(list, kn, knote, kn_link);                                                                                                                                                                   

I think we all agree that !SLIST_EMPTY(list) should actually be an assertion at best, there's no path before the call to knote_attach() in kqueue_register() that would call knote_drop*() and you can't double-drop a knote.

I really don't see many plausible causes here, outside of kqueue_expand going wrong somehow, but the disassembly seems sane (and not obviously eliding important things) and the bzero calculations, while kind of convoluted, seem fine.  I probably would've written it as bzero(&list[kq->kq_knlistsize], (size - kq->kq_knlistsize) * sizeof(*list)) personally, but I don't see a world in which those aren't equivalent.
Comment 52 Paul 2026-04-10 07:53:48 UTC
(In reply to Konstantin Belousov from comment #50)

Hi,

Thanks for the patch. Though, could't help but also include the assert implied by Kyle. It should't hurt to have an extra check, I hope. This is the final patch that we've applied to the current HEAD:


diff --git a/sys/kern/kern_event.c b/sys/kern/kern_event.c
index 01731ca46b6b..38b67d81dff2 100644
--- a/sys/kern/kern_event.c
+++ b/sys/kern/kern_event.c
@@ -2037,8 +2037,13 @@ kqueue_expand(struct kqueue *kq, const struct filterops *fops, uintptr_t ident,
                                to_free = list;
                        } else {
                                if (kq->kq_knlist != NULL) {
-                                       bcopy(kq->kq_knlist, list,
-                                           kq->kq_knlistsize * sizeof(*list));
+                                       volatile char *src, *dst;
+                                       src = (void *)kq->kq_knlist;
+                                       dst = (void *)list;
+                                       for (size_t i = 0;
+                                           i < kq->kq_knlistsize * sizeof(*list);
+                                           i++)
+                                               dst[i] = src[i];
                                        to_free = kq->kq_knlist;
                                        kq->kq_knlist = NULL;
                                }
@@ -2953,8 +2958,9 @@ knote_drop_detached(struct knote *kn, struct thread *td)
        else
                list = &kq->kq_knhash[KN_HASH(kn->kn_id, kq->kq_knhashmask)];
 
-       if (!SLIST_EMPTY(list))
-               SLIST_REMOVE(list, kn, knote, kn_link);
+       KASSERT(!SLIST_EMPTY(list),
+               ("knote %p is not on the list of %lu", kn, kn->kn_id));
+       SLIST_REMOVE(list, kn, knote, kn_link);
        if (kn->kn_status & KN_QUEUED)
                knote_dequeue(kn);
        KQ_UNLOCK_FLUX(kq);
Comment 53 Konstantin Belousov freebsd_committer freebsd_triage 2026-04-10 09:13:29 UTC
(In reply to Kyle Evans from comment #51)
If you see my last debugging patch, my main suspicious place right now is bcopy().
The values of the fd as reported give quite huge size for bzero, and I suspect that
this is the exceptionally large value for bcopy, if not unique.  So I am trying to
open-code bcopy() to see if memcpy_erms() might be to blame.  It might be not bcopy()
itself but a cpu bug with extra-large %rcx for rep;movsq for instance.  Just
speculating.

(In reply to Paul from comment #52)
This is fine.  But SLIST_REMOVE() must deref NULL if the element is not found.
Also see https://reviews.freebsd.org/D56341
Comment 54 Kyle Evans freebsd_committer freebsd_triage 2026-04-10 11:43:56 UTC
(In reply to Konstantin Belousov from comment #53)

FWIW- the person with that fdgrowtable issue I had mailed about has also narrowed in on the possibility of it being weird collateral damage from memcpy, also with erms (on select AMD hardware). Their case looks a little different, though.
Comment 55 Konstantin Belousov freebsd_committer freebsd_triage 2026-04-10 12:10:19 UTC
(In reply to Kyle Evans from comment #54)
Note that rep;movs and like are very special kind of instructions in x86 ISA.
They are interruptible, and CPU must update %rsi%rdi/%rcx to reflect the progress.
Of course for short moves interrupts are kept pending, but for large they are not,
and I believe that the case of the reporter which copies hundreds of KBs, if not some
MBs, should be.

That said, could you share some details about the case, and how it was solved?
Did the microcode update resolved it?  I lost access to AMD internal docs some
time ago.
Comment 56 Kyle Evans freebsd_committer freebsd_triage 2026-04-10 12:27:21 UTC
(In reply to Konstantin Belousov from comment #55)

Their investigation is still ongoing and they haven't proven that it's erms yet, but here are their observations:

At https://cgit.freebsd.org/src/tree/sys/kern/kern_descrip.c#n2028 in fdgrowtable() ntable is allocated and ntable->fdt_nfiles is assigned.  Before the memcpy, fdt_nfiles can be observed to be the correct value.   A sample from their latest known panic was 7133184 bytes, and they note that the destination is never page unaligned (0xfffffe0423607008 in this last one) as a maybe-relevant fact for their case.

Immediately after the memcpy into ntable->fdt_ofiles, the start of `ntable` appears to be damaged -- fdt_nfiles is suddenly zeroed out, and in this last sample the damage was to the tune of ~2 pages at the beginning zeroed out as well.  They noted that a significant number of entries at the end of the old table were zeroed out, so it may be the case that the damage is actually those overwriting the beginning.
Comment 57 Kyle Evans freebsd_committer freebsd_triage 2026-04-10 12:28:07 UTC
(In reply to Kyle Evans from comment #56)

Also re: micrrocode, they have the latest applied in all cases.
Comment 58 Konstantin Belousov freebsd_committer freebsd_triage 2026-04-10 12:36:34 UTC
(In reply to Kyle Evans from comment #57)
Can you obtain the output from etallen-cpuid -1?
I am curious if this is the same model as in this PR.
Comment 59 Kyle Evans freebsd_committer freebsd_triage 2026-04-10 17:20:05 UTC
(In reply to Konstantin Belousov from comment #58)

I've asked them to report the results to this PR; this was their summary of high-level models it's reproduced on from earlier messages:

 - Have reproduced on: EPYC 9274F w/768GB, EPYC 9375F w/768GB, EPYC 8224P w/192GB
 - Have not reproduced on: EPYC 8024P w/64GB, Ryzen 9 7950X w/128GB, Ryzen 9 5950X w/128GB, Ryzen TRP 3975WX w/256GB
Comment 60 Kristofer Peterson 2026-04-11 01:52:41 UTC
Hi, I don't know if this is related but a while ago I experienced kernel panics under 14.3 and 15.0 on AMD EPYC 9274F and 9375F boxes with 768GB RAM running a multithreaded app with a large number of open files (some sockets but mostly memory mapped files) and a high rate of file descriptor "churn" (mmap'd files are frequently created, written, unmapped and deleted). The panics occurred during fork/exec calls and examination of the crash dumps showed corrupted file tables.

When running debug kernels, the one place that has consistently come up in the kernel panics is do_fork/fdcopy/fdgrowtable, where fdgrowtable is repeatedly called to create and expand a file table as other threads in the parent continue to open new files. fdgrowtable repeatedly allocates and copies an ever increasing file table until eventually fdt_nfiles at the start of the growing table is read back inexplicably as zero which fails the KASSERT at the start of fdgrowtable. The file tables are typically in the range of 250k to 500k entries in size when this happens which, given that the entries are 48 bytes in size, means the memory allocations and copies are on the order of several megabytes.

I have written a small kernel module and a test program that repeatedly allocates, fills, copies, checks and frees memory which I will shortly attach to this bug report.

I have only seen this problem with AMD Zen4 & Zen5 EPYC machines, I have not been able to reproduce this on older CPUs.
Comment 61 Kristofer Peterson 2026-04-11 01:56:38 UTC
Created attachment 269604 [details]
Kernel module for testing memory allocation and copying

Kernel module that implements a syscall for testing memory allocation and copying.
Comment 62 Kristofer Peterson 2026-04-11 01:57:46 UTC
Created attachment 269605 [details]
Userspace utility for calling memory testing kernel module
Comment 63 Kristofer Peterson 2026-04-11 02:17:29 UTC
I have attached kmod2.c (a kernel module) and call2.c (a program that invokes kmod2.c).

kmod2.ko can be built with the following makefile.
-------- kmod2.mk --------
KMOD = kmod2
SRCS = kmod2.c
.include <bsd.kmod.mk>
--------------------------

call2 can be built with the following command line:
cc -o call2 call2.c -lthr

call2 assumes the syscall in kmod2 is assigned syscall number 210 on loading; if this not the cases on your machine, edit call2.c accordingly.

This will run 1 thread, zeroed allocations, starting memory size of 64k, 32 iterations, increase of 4k each iteration:
$ call2 1 5 $((1<<16)) 32 4096

This will run 3 threads, not-zeroed allocations, starting memory size of 16M, 128 iterations, increase of 16k each iteration:
$ call2 3 5 $((1<<24)) 128 16384

dmesg will normally show lines such as:
memkpy/syscall: enter pid:32426 tid:117355 4 16777216 128 16384
memkpy/syscall: exit pid:32426 tid:117355 4 16777216 128 16384 ilog:516 index:515 tsc:5520429094997775 event:4

If the memory checks fail, dmesg output (with kern.msgbuf_show_timestamp="2") will look like this:
[375862.366839] memkpy/syscall: enter pid:9651 tid:191508 4 33554432 128 16384
[375862.787488] memkpy/check_block: failed at ptr:0xfffffe10d50e4000 data:0xfffffe10d50a8000 size:35356672
...
[375863.168260] memkpy/check_block: failed at ptr:0xfffffe10d513d000 data:0xfffffe10d5138000 size:35651584
[375863.178042] memkpy/syscall: exit pid:9651 tid:191508 4 33554432 128 16384 ilog:516 index:515 tsc:1428744768499442 event:4

This is an unfinished work in progress but hopefully it is of some assistance.
Comment 64 Kristofer Peterson 2026-04-11 02:23:39 UTC
Created attachment 269607 [details]
AMD EPYC 9375F: output of cpuid-etallen -1
Comment 65 Konstantin Belousov freebsd_committer freebsd_triage 2026-04-11 19:06:00 UTC
(In reply to Kristofer Peterson from comment #64)
So we have some lead.  There is a public errata document
https://docs.amd.com/v/u/en-US/58251_PUB_1.30

Look for the errata
1567 Instruction State May be Corrupted
during REP CMPSB or REP SCASB
Description

During the execution of a REP CMPSB or REP SCASB, the instruction state may be corrupted.

Potential Effect on System

Unpredictable system behavior

Suggested Workaround

System software may contain the workaround for this erratum.

Fix Planned
No fix planned

It only says something about cmps and scas, not about movs.  But this is very close.
Also, the revision is from July 2025, I did not found anything newer.