Hi! We've been using 14.4-STABLE for some time now and today a weird issue has popped up. All of the sudden, our multi-threaded network app has deadlocked on some threads, but not on others. We weren't able to neither attach to it with GDB nor kill it with -9. Hard lock inside the kernel. We've managed to collect a few samples of kernel backtrace for this process with `procstat -kk`. All, basically, identical: PID TID COMM TDNAME KSTACK 91545 101569 <redacted> - mi_switch+0xbd _sx_xlock_hard+0x4ef kern_close+0x179 amd64_syscall+0x117 fast_syscall_common+0xf8 91545 102281 <redacted> - mi_switch+0xbd _sx_xlock_hard+0x4ef kern_close+0x179 amd64_syscall+0x117 fast_syscall_common+0xf8 91545 102282 <redacted> <redacted-1> mi_switch+0xbd sleepq_catch_signals+0x2a2 sleepq_timedwait_sig+0x12 _sleep+0x1c1 umtxq_sleep+0x2cd do_wait+0x244 __umtx_op_wait_uint_private+0x54 sys__umtx_op+0x7e amd64_syscall+0x117 fast_syscall_common+0xf8 91545 102283 <redacted> <redacted-2> mi_switch+0xbd sleepq_catch_signals+0x2a2 sleepq_timedwait_sig+0x12 _sleep+0x1c1 kqueue_scan+0xa11 kqueue_kevent+0x13b kern_kevent_fp+0x4b kern_kevent_generic+0xdf sys_kevent+0x61 amd64_syscall+0x117 fast_syscall_common+0xf8 91545 102284 <redacted> <redacted-3> mi_switch+0xbd _sleep+0x1f3 knote_fdclose+0xac closefp_impl+0xd0 amd64_syscall+0x117 fast_syscall_common+0xf8 91545 102285 <redacted> <redacted-4> mi_switch+0xbd sleepq_catch_signals+0x2a2 sleepq_timedwait_sig+0x12 _sleep+0x1c1 kqueue_scan+0xa11 kqueue_kevent+0x13b kern_kevent_fp+0x4b kern_kevent_generic+0xdf sys_kevent+0x61 amd64_syscall+0x117 fast_syscall_common+0xf8 91545 102286 <redacted> <redacted-5> mi_switch+0xbd sleepq_catch_signals+0x2a2 sleepq_timedwait_sig+0x12 _sleep+0x1c1 kqueue_scan+0xa11 kqueue_kevent+0x13b kern_kevent_fp+0x4b kern_kevent_generic+0xdf sys_kevent+0x61 amd64_syscall+0x117 fast_syscall_common+0xf8 Apparently, three threads were deadlocked: first two, that are unnamed and `<redacted-3>`. The last one is the thread that is handling inbound socket connections. Hundreds of thousands of them, mostly WebSocket. Two other threads also use sockets, but for outbound connections. During normal operation, sockets are being open and closed as needed, obviously. Seems like in some case this may lead to a deadlock. Where one thread enters some state in kernel where it hangs, holding the lock and preventing others from closing (or modifying descriptors generally). App is async and uses kqueue for networking sockets extensively. We suspect `<redacted-3>` to be the culprit, specifically its backtrace where `closefp_impl` is involved. And here's why. When this happened and the traffic was switched to a redundancy server, it almost immediately panicked and wend into reboot. Hopefully, we've got the core dump and were able to analyze it somewhat. And there we saw `closefp_impl` from within the same (not physically, different server) thread `<redacted-3>`: Fatal trap 12: page fault while in kernel mode cpuid = 22; apic id = 52 fault virtual address = 0x10 fault code = supervisor read data, page not present instruction pointer = 0x20:0xffffffff80572e28 stack pointer = 0x28:0xfffffe071c126d70 frame pointer = 0x28:0xfffffe071c126dc0 code segment = base rx0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 58518 (<redacted-3>) rdi: fffff83402622be0 rsi: 0000000000000000 rdx: 0000000000000000 rcx: 0000000000000000 r8: fffff80160b9c520 r9: fffffe071c127000 rax: 0000000000000000 rbx: 0000000000031361 rbp: fffffe071c126dc0 r10: 0000000000000001 r11: 0000000000002af8 r12: fffff80160b9c000 r13: fffff87af7163e18 r14: fffff83402622be0 r15: fffff87af7163e00 trap number = 12 panic: page fault cpuid = 22 time = 1771839128 KDB: stack backtrace: #0 0xffffffff8061303d at kdb_backtrace+0x5d #1 0xffffffff805c8091 at vpanic+0x161 #2 0xffffffff805c7f23 at panic+0x43 #3 0xffffffff80972f00 at trap_pfault+0x3e0 #4 0xffffffff8094af68 at calltrap+0x8 #5 0xffffffff8056b750 at closefp_impl+0xd0 #6 0xffffffff80973847 at amd64_syscall+0x117 #7 0xffffffff8094b85b at fast_syscall_common+0xf8 When inspecting it's kernel stack: (kgdb) bt #0 __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:57 #1 doadump (textdump=<optimized out>) at /usr/src/sys/kern/kern_shutdown.c:405 #2 0xffffffff805c7beb in kern_reboot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:523 #3 0xffffffff805c80e9 in vpanic (fmt=0xffffffff809d2ae7 "%s", ap=ap@entry=0xfffffe071c126c30) at /usr/src/sys/kern/kern_shutdown.c:967 #4 0xffffffff805c7f23 in panic (fmt=<unavailable>) at /usr/src/sys/kern/kern_shutdown.c:891 #5 0xffffffff80972f00 in trap_fatal (frame=<optimized out>, eva=<optimized out>) at /usr/src/sys/amd64/amd64/trap.c:1000 #6 0xffffffff80972f00 in trap_pfault (frame=0xfffffe071c126cb0, usermode=false, signo=<optimized out>, ucode=<optimized out>) #7 <signal handler called> #8 0xffffffff80572e28 in knote_drop (kn=0xfffff83402622be0, td=0xfffff80160b9c000) at /usr/src/sys/kern/kern_event.c:2730 #9 knote_fdclose (td=0xfffff80160b9c000, fd=201569) at /usr/src/sys/kern/kern_event.c:2695 #10 0xffffffff8056b750 in closefp_impl (fdp=0xfffffe0d1582a920, fd=0, fp=0xfffff81090d2c5a0, td=0xfffff80160b9c000, audit=true) at /usr/src/sys/kern/kern_descrip.c:1320 #11 0xffffffff80973847 in syscallenter (td=0xfffff80160b9c000) at /usr/src/sys/amd64/amd64/../../kern/subr_syscall.c:193 #12 amd64_syscall (td=0xfffff80160b9c000, traced=0) at /usr/src/sys/amd64/amd64/trap.c:1241 #13 <signal handler called> #14 0x000000082deed32a in ?? () Backtrace stopped: Cannot access memory at address 0x85d08dbc8 Within `knote_drop` we observe a null pointer access: (kgdb) fr 8 #8 0xffffffff80572e28 in knote_drop (kn=0xfffff83402622be0, td=0xfffff80160b9c000) at /usr/src/sys/kern/kern_event.c:2730 2730 kn->kn_fop->f_detach(kn); (kgdb) l 2725 static void 2726 knote_drop(struct knote *kn, struct thread *td) 2727 { 2728 2729 if ((kn->kn_status & KN_DETACHED) == 0) 2730 kn->kn_fop->f_detach(kn); 2731 knote_drop_detached(kn, td); 2732 } 2733 2734 static void (kgdb) p kn->kn_fop $2 = (const struct filterops *) 0x0 If you need more info, please ask. We will be glad to provide it. --------------- System info: FreeBSD frv21.ukr.net 14.4-STABLE FreeBSD 14.4-STABLE stable/14-n273658-2f91ff89c56e FRV21 amd64 1404500 1404500
It'd be useful to see the output of (kgdb) f 8 (kgdb) p/x *kn (kgdb) f 10 (kgdb) p/x *fp Is the problem reproducible? <redacted-3> is blocked, probably waiting for an in-flux knote to stabilize. It would be helpful to see all of the stacks for this process, as that might tell us why the knote was in-flux to begin with.
Hi Mark, Thanks for replying so quickly. Here's the output: (kgdb) f 8 #8 0xffffffff80572e28 in knote_drop (kn=0xfffff83402622be0, td=0xfffff80160b9c000) at /usr/src/sys/kern/kern_event.c:2730 2730 kn->kn_fop->f_detach(kn); (kgdb) p/x *kn $2 = { kn_link = { sle_next = 0x0 }, kn_selnext = { sle_next = 0x0 }, kn_knlist = 0x0, kn_tqe = { tqe_next = 0x0, tqe_prev = 0x0 }, kn_kq = 0x0, kn_kevent = { ident = 0x0, filter = 0x0, flags = 0x0, fflags = 0x0, data = 0x0, udata = 0x0, ext = {0x0, 0x0, 0x0, 0x0} }, kn_hook = 0x0, kn_hookid = 0x0, kn_status = 0x20, kn_influx = 0x1, kn_sfflags = 0x0, kn_sdata = 0x0, kn_ptr = { p_fp = 0x0, p_proc = 0x0, p_aio = 0x0, p_lio = 0x0, p_v = 0x0 }, kn_fop = 0x0 } (kgdb) f 10 #10 0xffffffff8056b750 in closefp_impl (fdp=0xfffffe0d1582a920, fd=0, fp=0xfffff81090d2c5a0, td=0xfffff80160b9c000, audit=true) at /usr/src/sys/kern/kern_descrip.c:1320 1320 knote_fdclose(td, fd); (kgdb) p/x *fp $3 = { f_flag = 0x7, f_count = 0x1, f_data = 0xfffff81a3dbd4000, f_ops = 0xffffffff80a673d0, f_vnode = 0x0, f_cred = 0xfffff8014c5acb00, f_type = 0x2, f_vflags = 0x0, { f_seqcount = {0x0, 0x0}, f_pipegen = 0x0 }, f_nextoff = {0x0, 0x0}, f_vnun = { fvn_cdevpriv = 0x0, fvn_advice = 0x0 }, f_offset = 0x0 } Unfortunately, it is not reproducible. We've been using this system for around a month under this load. And it's the first time it happened. The other server is using it for around two months. Probably, Monday morning has something to do with it. As it's usually the time when the most users are active at the same time. We've been using this software for years, on earlier versions of FreeBSD and had basically no critical issues. However, this generation of servers are order of magnitude faster than what we were using before. Additionally, we've go a 100GbE card in there. And we've already faced a kernel panic regarding `int` counter ofverflow within a network stack. That was on 14.3-STABLE and we've reported it in August. Hopefully, shrinking `kern.ipc.maxsockbuf` helped in that case. I'm saying all this because we suspect that this may be a similar issue: some race condition within a kernel, that was impossible to reproduce on slower hardware, now can be reproduced, occasionally. Stacks of other threads: <unnamed-1> #0 sched_switch (td=td@entry=0xfffff80d4121b740, flags=flags@entry=259) at /usr/src/sys/kern/sched_ule.c:2437 #1 0xffffffff805d57ad in mi_switch (flags=flags@entry=259) at /usr/src/sys/kern/kern_synch.c:548 #2 0xffffffff80622c19 in sleepq_switch (wchan=wchan@entry=0xfffff87af7163e00, pri=pri@entry=46) at /usr/src/sys/kern/subr_sleepqueue.c:608 #3 0xffffffff80623042 in sleepq_catch_signals (wchan=wchan@entry=0xfffff87af7163e00, pri=46) at /usr/src/sys/kern/subr_sleepqueue.c:523 #4 0xffffffff80623242 in sleepq_timedwait_sig (wchan=<unavailable>, wchan@entry=0xfffff87af7163e00, pri=<unavailable>) at /usr/src/sys/kern/subr_sleepqueue.c:704 #5 0xffffffff805d4ed1 in _sleep (ident=ident@entry=0xfffff87af7163e00, lock=lock@entry=0xfffff87af7163e00, priority=priority@entry=302, wmesg=<optimized out>, sbt=sbt@entry=4671013802915906, pr=pr@entry=382252, flags=512) at /usr/src/sys/kern/kern_synch.c:217 #6 0xffffffff80574121 in kqueue_scan (kq=kq@entry=0xfffff87af7163e00, maxevents=maxevents@entry=128, k_ops=k_ops@entry=0xfffffe07027f4de0, tsp=tsp@entry=0xfffffe07027f4d38, keva=keva@entry=0xfffffe07027f4a90, td=td@entry=0xfffff80d4121b740) at /usr/src/sys/kern/kern_event.c:1983 #7 0xffffffff80571bdb in kqueue_kevent (kq=kq@entry=0xfffff87af7163e00, td=td@entry=0xfffff80d4121b740, nchanges=<optimized out>, nchanges@entry=0, nevents=<optimized out>, nevents@entry=128, k_ops=k_ops@entry=0xfffffe07027f4de0, timeout=0xfffffe07027f4d38) at /usr/src/sys/kern/kern_event.c:1377 #8 0xffffffff8057193b in kern_kevent_fp (td=td@entry=0xfffff80d4121b740, fp=<optimized out>, nchanges=nchanges@entry=0, nevents=nevents@entry=128, k_ops=k_ops@entry=0xfffffe07027f4de0, timeout=<unavailable>, timeout@entry=0xfffffe07027f4d38) at /usr/src/sys/kern/kern_event.c:1390 #9 0xffffffff805715cf in kern_kevent (td=0xfffff80d4121b740, fd=<optimized out>, nchanges=0, nevents=128, k_ops=0xfffffe07027f4de0, timeout=0xfffffe07027f4d38) at /usr/src/sys/kern/kern_event.c:1330 #10 kern_kevent_generic (td=0xfffff80d4121b740, uap=uap@entry=0xfffffe07027f4db0, k_ops=k_ops@entry=0xfffffe07027f4de0, struct_name=0xffffffff809ea8f0 "kevent") at /usr/src/sys/kern/kern_event.c:1186 #11 0xffffffff80571481 in sys_kevent (td=<unavailable>, uap=<optimized out>) at /usr/src/sys/kern/kern_event.c:1159 #12 0xffffffff80973847 in syscallenter (td=0xfffff80d4121b740) at /usr/src/sys/amd64/amd64/../../kern/subr_syscall.c:193 #13 amd64_syscall (td=0xfffff80d4121b740, traced=0) at /usr/src/sys/amd64/amd64/trap.c:1241 #14 <signal handler called> #15 0x000000082deed3ea in ?? () <unnamed-2> #0 sched_switch (td=td@entry=0xfffff806652e6000, flags=flags@entry=259) at /usr/src/sys/kern/sched_ule.c:2437 #1 0xffffffff805d57ad in mi_switch (flags=flags@entry=259) at /usr/src/sys/kern/kern_synch.c:548 #2 0xffffffff80622c19 in sleepq_switch (wchan=wchan@entry=0xfffff8014c472300, pri=pri@entry=0) at /usr/src/sys/kern/subr_sleepqueue.c:608 #3 0xffffffff80623042 in sleepq_catch_signals (wchan=wchan@entry=0xfffff8014c472300, pri=0) at /usr/src/sys/kern/subr_sleepqueue.c:523 #4 0xffffffff80622d89 in sleepq_wait_sig (wchan=<unavailable>, wchan@entry=0xfffff8014c472300, pri=<unavailable>) at /usr/src/sys/kern/subr_sleepqueue.c:671 #5 0xffffffff805d4ef6 in _sleep (ident=0xfffff8014c472300, lock=lock@entry=0xffffffff81105070 <umtxq_chains+43360>, priority=priority@entry=768, wmesg=wmesg@entry=0xffffffff809adf23 "uwait", sbt=sbt@entry=0, pr=pr@entry=0, flags=0) at /usr/src/sys/kern/kern_synch.c:221 #6 0xffffffff805e94ed in umtxq_sleep (uq=uq@entry=0xfffff8014c472300, wmesg=0xffffffff809adf23 "uwait", timo=0x0) at /usr/src/sys/kern/kern_umtx.c:840 #7 0xffffffff805eed24 in do_wait (td=td@entry=0xfffff806652e6000, addr=<optimized out>, id=0, timeout=timeout@entry=0x0, compat32=compat32@entry=1, is_private=is_private@entry=1) at /usr/src/sys/kern/kern_umtx.c:1313 #8 0xffffffff805ecdc4 in __umtx_op_wait_uint_private (td=0xfffff806652e6000, uap=0xfffffe071c199dd8, ops=<optimized out>) at /usr/src/sys/kern/kern_umtx.c:3986 #9 0xffffffff805eae5e in kern__umtx_op (td=<unavailable>, obj=<optimized out>, val=<unavailable>, uaddr1=<unavailable>, uaddr2=<optimized out>, ops=<unavailable>, op=<optimized out>) at /usr/src/sys/kern/kern_umtx.c:4995 #10 sys__umtx_op (td=<unavailable>, uap=<optimized out>) at /usr/src/sys/kern/kern_umtx.c:5020 #11 0xffffffff80973847 in syscallenter (td=0xfffff806652e6000) at /usr/src/sys/amd64/amd64/../../kern/subr_syscall.c:193 #12 amd64_syscall (td=0xfffff806652e6000, traced=0) at /usr/src/sys/amd64/amd64/trap.c:1241 #13 <signal handler called> #14 0x00000008221050ac in ?? () <redacted-1> #0 sched_switch (td=td@entry=0xfffff80dd2a60740, flags=flags@entry=259) at /usr/src/sys/kern/sched_ule.c:2437 #1 0xffffffff805d57ad in mi_switch (flags=flags@entry=259) at /usr/src/sys/kern/kern_synch.c:548 #2 0xffffffff80622c19 in sleepq_switch (wchan=wchan@entry=0xfffff80cadb2ab00, pri=pri@entry=0) at /usr/src/sys/kern/subr_sleepqueue.c:608 #3 0xffffffff80623042 in sleepq_catch_signals (wchan=wchan@entry=0xfffff80cadb2ab00, pri=0) at /usr/src/sys/kern/subr_sleepqueue.c:523 #4 0xffffffff80623242 in sleepq_timedwait_sig (wchan=<unavailable>, wchan@entry=0xfffff80cadb2ab00, pri=<unavailable>) at /usr/src/sys/kern/subr_sleepqueue.c:704 #5 0xffffffff805d4ed1 in _sleep (ident=0xfffff80cadb2ab00, lock=lock@entry=0xffffffff8110d590 <umtxq_chains+77440>, priority=priority@entry=768, wmesg=wmesg@entry=0xffffffff809adf23 "uwait", sbt=sbt@entry=4671013812078118, pr=pr@entry=0, flags=512) at /usr/src/sys/kern/kern_synch.c:217 #6 0xffffffff805e94ed in umtxq_sleep (uq=uq@entry=0xfffff80cadb2ab00, wmesg=0xffffffff809adf23 "uwait", timo=0xfffffe0784849d20) at /usr/src/sys/kern/kern_umtx.c:840 #7 0xffffffff805eed24 in do_wait (td=td@entry=0xfffff80dd2a60740, addr=<optimized out>, id=0, timeout=timeout@entry=0xfffffe0784849d90, compat32=compat32@entry=1, is_private=is_private@entry=1) at /usr/src/sys/kern/kern_umtx.c:1313 #8 0xffffffff805ecdc4 in __umtx_op_wait_uint_private (td=0xfffff80dd2a60740, uap=0xfffffe0784849dd8, ops=<optimized out>) at /usr/src/sys/kern/kern_umtx.c:3986 #9 0xffffffff805eae5e in kern__umtx_op (td=<unavailable>, obj=<optimized out>, val=<unavailable>, uaddr1=<unavailable>, uaddr2=<optimized out>, ops=<unavailable>, op=<optimized out>) at /usr/src/sys/kern/kern_umtx.c:4995 #10 sys__umtx_op (td=<unavailable>, uap=<optimized out>) at /usr/src/sys/kern/kern_umtx.c:5020 #11 0xffffffff80973847 in syscallenter (td=0xfffff80dd2a60740) at /usr/src/sys/amd64/amd64/../../kern/subr_syscall.c:193 #12 amd64_syscall (td=0xfffff80dd2a60740, traced=0) at /usr/src/sys/amd64/amd64/trap.c:1241 #13 <signal handler called> #14 0x00000008221050ac in ?? () <redacted-2> #0 sched_switch (td=td@entry=0xfffff80b2f110740, flags=flags@entry=259) at /usr/src/sys/kern/sched_ule.c:2437 #1 0xffffffff805d57ad in mi_switch (flags=flags@entry=259) at /usr/src/sys/kern/kern_synch.c:548 #2 0xffffffff80622c19 in sleepq_switch (wchan=wchan@entry=0xfffff85578222600, pri=pri@entry=46) at /usr/src/sys/kern/subr_sleepqueue.c:608 #3 0xffffffff80623042 in sleepq_catch_signals (wchan=wchan@entry=0xfffff85578222600, pri=46) at /usr/src/sys/kern/subr_sleepqueue.c:523 #4 0xffffffff80623242 in sleepq_timedwait_sig (wchan=<unavailable>, wchan@entry=0xfffff85578222600, pri=<unavailable>) at /usr/src/sys/kern/subr_sleepqueue.c:704 #5 0xffffffff805d4ed1 in _sleep (ident=ident@entry=0xfffff85578222600, lock=lock@entry=0xfffff85578222600, priority=priority@entry=302, wmesg=<optimized out>, sbt=sbt@entry=4671017460503269, pr=pr@entry=263550736, flags=512) at /usr/src/sys/kern/kern_synch.c:217 #6 0xffffffff80574121 in kqueue_scan (kq=kq@entry=0xfffff85578222600, maxevents=maxevents@entry=128, k_ops=k_ops@entry=0xfffffe071c18fde0, tsp=tsp@entry=0xfffffe071c18fd38, keva=keva@entry=0xfffffe071c18fa90, td=td@entry=0xfffff80b2f110740) at /usr/src/sys/kern/kern_event.c:1983 #7 0xffffffff80571bdb in kqueue_kevent (kq=kq@entry=0xfffff85578222600, td=td@entry=0xfffff80b2f110740, nchanges=<optimized out>, nchanges@entry=0, nevents=<optimized out>, nevents@entry=128, k_ops=k_ops@entry=0xfffffe071c18fde0, timeout=0xfffffe071c18fd38) at /usr/src/sys/kern/kern_event.c:1377 #8 0xffffffff8057193b in kern_kevent_fp (td=td@entry=0xfffff80b2f110740, fp=<optimized out>, nchanges=nchanges@entry=0, nevents=nevents@entry=128, k_ops=k_ops@entry=0xfffffe071c18fde0, timeout=<unavailable>, timeout@entry=0xfffffe071c18fd38) at /usr/src/sys/kern/kern_event.c:1390 #9 0xffffffff805715cf in kern_kevent (td=0xfffff80b2f110740, fd=<optimized out>, nchanges=0, nevents=128, k_ops=0xfffffe071c18fde0, timeout=0xfffffe071c18fd38) at /usr/src/sys/kern/kern_event.c:1330 #10 kern_kevent_generic (td=0xfffff80b2f110740, uap=uap@entry=0xfffffe071c18fdb0, k_ops=k_ops@entry=0xfffffe071c18fde0, struct_name=0xffffffff809ea8f0 "kevent") at /usr/src/sys/kern/kern_event.c:1186 #11 0xffffffff80571481 in sys_kevent (td=<unavailable>, uap=<optimized out>) at /usr/src/sys/kern/kern_event.c:1159 #12 0xffffffff80973847 in syscallenter (td=0xfffff80b2f110740) at /usr/src/sys/amd64/amd64/../../kern/subr_syscall.c:193 #13 amd64_syscall (td=0xfffff80b2f110740, traced=0) at /usr/src/sys/amd64/amd64/trap.c:1241 #14 <signal handler called> #15 0x000000082deed3ea in ?? () <redacted-3> <--- Cause of crash #0 __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:57 #1 doadump (textdump=<optimized out>) at /usr/src/sys/kern/kern_shutdown.c:405 #2 0xffffffff805c7beb in kern_reboot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:523 #3 0xffffffff805c80e9 in vpanic (fmt=0xffffffff809d2ae7 "%s", ap=ap@entry=0xfffffe071c126c30) at /usr/src/sys/kern/kern_shutdown.c:967 #4 0xffffffff805c7f23 in panic (fmt=<unavailable>) at /usr/src/sys/kern/kern_shutdown.c:891 #5 0xffffffff80972f00 in trap_fatal (frame=<optimized out>, eva=<optimized out>) at /usr/src/sys/amd64/amd64/trap.c:1000 #6 0xffffffff80972f00 in trap_pfault (frame=0xfffffe071c126cb0, usermode=false, signo=<optimized out>, ucode=<optimized out>) #7 <signal handler called> #8 0xffffffff80572e28 in knote_drop (kn=0xfffff83402622be0, td=0xfffff80160b9c000) at /usr/src/sys/kern/kern_event.c:2730 #9 knote_fdclose (td=0xfffff80160b9c000, fd=201569) at /usr/src/sys/kern/kern_event.c:2695 #10 0xffffffff8056b750 in closefp_impl (fdp=0xfffffe0d1582a920, fd=0, fp=0xfffff81090d2c5a0, td=0xfffff80160b9c000, audit=true) at /usr/src/sys/kern/kern_descrip.c:1320 #11 0xffffffff80973847 in syscallenter (td=0xfffff80160b9c000) at /usr/src/sys/amd64/amd64/../../kern/subr_syscall.c:193 #12 amd64_syscall (td=0xfffff80160b9c000, traced=0) at /usr/src/sys/amd64/amd64/trap.c:1241 #13 <signal handler called> #14 0x000000082deed32a in ?? () <redacted-4> #0 sched_switch (td=td@entry=0xfffff80dd249a740, flags=flags@entry=259) at /usr/src/sys/kern/sched_ule.c:2437 #1 0xffffffff805d57ad in mi_switch (flags=flags@entry=259) at /usr/src/sys/kern/kern_synch.c:548 #2 0xffffffff80622c19 in sleepq_switch (wchan=wchan@entry=0xfffff86a10f16b00, pri=pri@entry=46) at /usr/src/sys/kern/subr_sleepqueue.c:608 #3 0xffffffff80623042 in sleepq_catch_signals (wchan=wchan@entry=0xfffff86a10f16b00, pri=46) at /usr/src/sys/kern/subr_sleepqueue.c:523 #4 0xffffffff80623242 in sleepq_timedwait_sig (wchan=<unavailable>, wchan@entry=0xfffff86a10f16b00, pri=<unavailable>) at /usr/src/sys/kern/subr_sleepqueue.c:704 #5 0xffffffff805d4ed1 in _sleep (ident=ident@entry=0xfffff86a10f16b00, lock=lock@entry=0xfffff86a10f16b00, priority=priority@entry=302, wmesg=<optimized out>, sbt=sbt@entry=4671013829273089, pr=pr@entry=2046015, flags=512) at /usr/src/sys/kern/kern_synch.c:217 #6 0xffffffff80574121 in kqueue_scan (kq=kq@entry=0xfffff86a10f16b00, maxevents=maxevents@entry=128, k_ops=k_ops@entry=0xfffffe0784853de0, tsp=tsp@entry=0xfffffe0784853d38, keva=keva@entry=0xfffffe0784853a90, td=td@entry=0xfffff80dd249a740) at /usr/src/sys/kern/kern_event.c:1983 #7 0xffffffff80571bdb in kqueue_kevent (kq=kq@entry=0xfffff86a10f16b00, td=td@entry=0xfffff80dd249a740, nchanges=<optimized out>, nchanges@entry=0, nevents=<optimized out>, nevents@entry=128, k_ops=k_ops@entry=0xfffffe0784853de0, timeout=0xfffffe0784853d38) at /usr/src/sys/kern/kern_event.c:1377 #8 0xffffffff8057193b in kern_kevent_fp (td=td@entry=0xfffff80dd249a740, fp=<optimized out>, nchanges=nchanges@entry=0, nevents=nevents@entry=128, k_ops=k_ops@entry=0xfffffe0784853de0, timeout=<unavailable>, timeout@entry=0xfffffe0784853d38) at /usr/src/sys/kern/kern_event.c:1390 #9 0xffffffff805715cf in kern_kevent (td=0xfffff80dd249a740, fd=<optimized out>, nchanges=0, nevents=128, k_ops=0xfffffe0784853de0, timeout=0xfffffe0784853d38) at /usr/src/sys/kern/kern_event.c:1330 #10 kern_kevent_generic (td=0xfffff80dd249a740, uap=uap@entry=0xfffffe0784853db0, k_ops=k_ops@entry=0xfffffe0784853de0, struct_name=0xffffffff809ea8f0 "kevent") at /usr/src/sys/kern/kern_event.c:1186 #11 0xffffffff80571481 in sys_kevent (td=<unavailable>, uap=<optimized out>) at /usr/src/sys/kern/kern_event.c:1159 #12 0xffffffff80973847 in syscallenter (td=0xfffff80dd249a740) at /usr/src/sys/amd64/amd64/../../kern/subr_syscall.c:193 #13 amd64_syscall (td=0xfffff80dd249a740, traced=0) at /usr/src/sys/amd64/amd64/trap.c:1241 #14 <signal handler called> #15 0x000000082deed3ea in ?? () <redacted-5> #0 sched_switch (td=td@entry=0xfffff80dd2e41740, flags=flags@entry=259) at /usr/src/sys/kern/sched_ule.c:2437 #1 0xffffffff805d57ad in mi_switch (flags=flags@entry=259) at /usr/src/sys/kern/kern_synch.c:548 #2 0xffffffff80622c19 in sleepq_switch (wchan=wchan@entry=0xfffff85d0d270b00, pri=pri@entry=46) at /usr/src/sys/kern/subr_sleepqueue.c:608 #3 0xffffffff80623042 in sleepq_catch_signals (wchan=wchan@entry=0xfffff85d0d270b00, pri=46) at /usr/src/sys/kern/subr_sleepqueue.c:523 #4 0xffffffff80623242 in sleepq_timedwait_sig (wchan=<unavailable>, wchan@entry=0xfffff85d0d270b00, pri=<unavailable>) at /usr/src/sys/kern/subr_sleepqueue.c:704 #5 0xffffffff805d4ed1 in _sleep (ident=ident@entry=0xfffff85d0d270b00, lock=lock@entry=0xfffff85d0d270b00, priority=priority@entry=302, wmesg=<optimized out>, sbt=sbt@entry=4671014242900761, pr=pr@entry=36144028, flags=512) at /usr/src/sys/kern/kern_synch.c:217 #6 0xffffffff80574121 in kqueue_scan (kq=kq@entry=0xfffff85d0d270b00, maxevents=maxevents@entry=128, k_ops=k_ops@entry=0xfffffe071c1a3de0, tsp=tsp@entry=0xfffffe071c1a3d38, keva=keva@entry=0xfffffe071c1a3a90, td=td@entry=0xfffff80dd2e41740) at /usr/src/sys/kern/kern_event.c:1983 #7 0xffffffff80571bdb in kqueue_kevent (kq=kq@entry=0xfffff85d0d270b00, td=td@entry=0xfffff80dd2e41740, nchanges=<optimized out>, nchanges@entry=0, nevents=<optimized out>, nevents@entry=128, k_ops=k_ops@entry=0xfffffe071c1a3de0, timeout=0xfffffe071c1a3d38) at /usr/src/sys/kern/kern_event.c:1377 #8 0xffffffff8057193b in kern_kevent_fp (td=td@entry=0xfffff80dd2e41740, fp=<optimized out>, nchanges=nchanges@entry=0, nevents=nevents@entry=128, k_ops=k_ops@entry=0xfffffe071c1a3de0, timeout=<unavailable>, timeout@entry=0xfffffe071c1a3d38) at /usr/src/sys/kern/kern_event.c:1390 #9 0xffffffff805715cf in kern_kevent (td=0xfffff80dd2e41740, fd=<optimized out>, nchanges=0, nevents=128, k_ops=0xfffffe071c1a3de0, timeout=0xfffffe071c1a3d38) at /usr/src/sys/kern/kern_event.c:1330 #10 kern_kevent_generic (td=0xfffff80dd2e41740, uap=uap@entry=0xfffffe071c1a3db0, k_ops=k_ops@entry=0xfffffe071c1a3de0, struct_name=0xffffffff809ea8f0 "kevent") at /usr/src/sys/kern/kern_event.c:1186 #11 0xffffffff80571481 in sys_kevent (td=<unavailable>, uap=<optimized out>) at /usr/src/sys/kern/kern_event.c:1159 #12 0xffffffff80973847 in syscallenter (td=0xfffff80dd2e41740) at /usr/src/sys/amd64/amd64/../../kern/subr_syscall.c:193 #13 amd64_syscall (td=0xfffff80dd2e41740, traced=0) at /usr/src/sys/amd64/amd64/trap.c:1241 #14 <signal handler called> #15 0x000000082deed3ea in ?? ()
Just to emphasize. Original output of `procstat -kk` is different from the kernel crash. Original output was from one server. The crash is from the other. Original server, where the deadlock has occurred, has crashed soon thereafter. Unfortunately, it did not produce the dump for some reason. Second server had no symptoms, but crashed immediately when traffic was re-routed towards it. And it did produce the core dump. And it't the one we're analyzing in here. And seeing `kqueue_scan` in both cases makes us suspect that these are two symptoms of the same issue.
(In reply to Paul from comment #3) > And seeing `kqueue_scan` in both cases makes us suspect that these are two symptoms of the same issue. Yes, I think they are likely related. I wonder if you could try running your application on a GENERIC-DEBUG kernel? The output from comment 2 shows that the knote is a marker knote used during list scans, it's pretty weird that knote_fdclose() would encounter one, and I suspect that it reflects a bug elsewhere in the kqueue code. Using a debug kernel would have a performance penalty but will hopefully make it much easier to track this down.
Hi Mark, Not a problem at all. We are now running with GENERIC-DEBUG kernel. Hopefully it won't take too much time to reproduce. We'll stay in touch and report back as soon as it happens again.
Finally caught another crash. Alas I was a bit busy at the time and only reporting back now. The bad thing is: somehow it did not produce the /var/crash/vmcore.1. After the reboot it has spitted out: Starting syslogd. savecore 1735 - - reboot after panic: general protection fault Mar 3 12:14:25 frv21 savecore[1735]: reboot after panic: general protection fault savecore 1735 - - writing core to /var/crash/textdump.tar.1 Writing crash summary to /var/crash/core.txt.1. Starting chronyd. but... $ cat core.txt.1 /var/crash/vmcore.1 not found Available files are: config.txt core.txt.1 ddb.txt msgbuf.txt panic.txt textdump.tar.1 Last messages: $ tail -n 50 msgbuf.txt <5>Limiting tcp reset response from 67253 to 49997 packets/sec <5>Limiting tcp reset response from 67169 to 49995 packets/sec <5>Limiting tcp reset response from 65293 to 49993 packets/sec <5>Limiting tcp reset response from 64794 to 50015 packets/sec <5>Limiting tcp reset response from 65880 to 50012 packets/sec <5>Limiting tcp reset response from 65243 to 50000 packets/sec <5>Limiting tcp reset response from 71695 to 49993 packets/sec <5>Limiting tcp reset response from 79815 to 49998 packets/sec <5>Limiting tcp reset response from 71724 to 49997 packets/sec <5>Limiting tcp reset response from 67537 to 50005 packets/sec <5>Limiting tcp reset response from 65417 to 50000 packets/sec <5>Limiting tcp reset response from 75635 to 50009 packets/sec <5>Limiting tcp reset response from 71016 to 50005 packets/sec <5>Limiting tcp reset response from 78743 to 49986 packets/sec <4>TCP syncache overflow detected; using syncookies for the next 15 seconds <118>Mar 3 00:40:02 frv21 kernel: TCP syncache overflow detected; using syncookies for the next 15 seconds Fatal trap 9: general protection fault while in kernel mode cpuid = 3; apic id = 03 instruction pointer = 0x20:0xffffffff80b590fd stack pointer = 0x28:0xfffffe07e5854d70 frame pointer = 0x28:0xfffffe07e5854d70 code segment = base rx0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 86706 (<redacted-3>) rdi: deadc0dedeadc0f6 rsi: 0000000000000004 rdx: ffffffff811ab239 rcx: 0000000000000122 r8: 0000000000000001 r9: ffffffff81e1cba8 rax: fffff801be6b4000 rbx: 00000000000774c3 rbp: fffffe07e5854d70 r10: 0000000000000000 r11: 0000000000000004 r12: fffff8010bbfa418 r13: fffff84f0bab13c0 r14: 00000000000774c3 r15: fffff8010bbfa400 trap number = 9 panic: general protection fault cpuid = 3 time = 1772532522 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe07e5854af0 vpanic() at vpanic+0x161/frame 0xfffffe07e5854c20 panic() at panic+0x43/frame 0xfffffe07e5854c80 trap_fatal() at trap_fatal+0x68/frame 0xfffffe07e5854ca0 calltrap() at calltrap+0x8/frame 0xfffffe07e5854ca0 --- trap 0x9, rip = 0xffffffff80b590fd, rsp = 0xfffffe07e5854d70, rbp = 0xfffffe07e5854d70 --- __mtx_assert() at __mtx_assert+0x3d/frame 0xfffffe07e5854d70 knote_fdclose() at knote_fdclose+0x12e/frame 0xfffffe07e5854dc0 closefp_impl() at closefp_impl+0x96/frame 0xfffffe07e5854e00 amd64_syscall() at amd64_syscall+0x15a/frame 0xfffffe07e5854f30 fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe07e5854f30 --- syscall (6, FreeBSD ELF64, close), rip = 0x82d3ea32a, rsp = 0x860151bb8, rbp = 0x860151bd0 --- Crash point still looks the same. We aren't sure why there is no core dump. Have we configured/built the kernel improperly? But, the last time, only one of the two crashes have produced it. Seems like it chance related and we have to keep trying?
And it crashed again, just now. Still no core dump. But this time it looks pretty interesting. It may be the hint of the problem: a missing lock, hence the 'racy' nature of the issue. <118>Starting syslogd. <118>savecore 1735 - - reboot after panic: general protection fault <118>Mar 3 12:14:25 frv21 savecore[1735]: reboot after panic: general protection fault <118>savecore 1735 - - writing core to /var/crash/textdump.tar.1 <118>Writing crash summary to /var/crash/core.txt.1. <118>Starting chronyd. <118>Starting local daemons:add net 10.0.0.0: gateway 10.10.255.254 <118>. <118>Mounting late filesystems:. <118>Starting powerd. <118>Starting snmpd. <118>Starting collectd. <118>Starting nrpe. <118>Starting cron. <118>Performing sanity check on sshd configuration. <118>Starting sshd. <118>Starting background file system checks in 60 seconds. <118> <118>Tue Mar 3 12:14:25 EET 2026 panic: mutex kqueue not owned at /usr/src/sys/kern/kern_event.c:290 cpuid = 24 time = 1772701913 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe07171f5810 vpanic() at vpanic+0x161/frame 0xfffffe07171f5940 panic() at panic+0x43/frame 0xfffffe07171f59a0 __mtx_assert() at __mtx_assert+0x9f/frame 0xfffffe07171f59b0 kqueue_register() at kqueue_register+0x758/frame 0xfffffe07171f5a30 kqueue_kevent() at kqueue_kevent+0xc9/frame 0xfffffe07171f5c90 kern_kevent_fp() at kern_kevent_fp+0x9b/frame 0xfffffe07171f5ce0 kern_kevent() at kern_kevent+0x82/frame 0xfffffe07171f5d40 kern_kevent_generic() at kern_kevent_generic+0x70/frame 0xfffffe07171f5da0 sys_kevent() at sys_kevent+0x61/frame 0xfffffe07171f5e00 amd64_syscall() at amd64_syscall+0x15a/frame 0xfffffe07171f5f30 fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe07171f5f30 --- syscall (560, FreeBSD ELF64, kevent), rip = 0x82d3df3ea, rsp = 0x85ca5eb58, rbp = 0x85ca5ec40 --- KDB: enter: panic
(In reply to Paul from comment #7) From the dmesg lines, it looks like you're saving a textdump instead of a binary core dump. Are you setting the debug.ddb.textdump.pending sysctl?
These are our sysctl parameters: $ sysctl debug.ddb.textdump debug.ddb.textdump.do_version: 1 debug.ddb.textdump.do_panic: 1 debug.ddb.textdump.do_msgbuf: 1 debug.ddb.textdump.do_ddb: 1 debug.ddb.textdump.do_config: 1 debug.ddb.textdump.pending: 0 It can't be that, as dumps do appear, occasionally.
Hi, Absence of mem dumps was indeed an issue on our side. Finally we've got another panic with a memory dump, this time. This is the culprit: #0 __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:57 #1 doadump (textdump=0) at /usr/src/sys/kern/kern_shutdown.c:405 #2 0xffffffff804a4718 in db_fncall_generic (nargs=0, args=0xfffffe2364ef9240, addr=<optimized out>, rv=<optimized out>) at /usr/src/sys/ddb/db_command.c:626 #3 db_fncall (dummy1=<optimized out>, dummy2=<optimized out>, dummy3=<optimized out>, dummy4=<optimized out>) at /usr/src/sys/ddb/db_command.c:674 #4 0xffffffff804a418d in db_command (last_cmdp=<optimized out>, cmd_table=<optimized out>, dopager=false) at /usr/src/sys/ddb/db_command.c:504 #5 0xffffffff804a42d6 in db_command_script (command=command@entry=0xffffffff81bba6e2 <db_recursion_data+18> "call doadump") at /usr/src/sys/ddb/db_command.c:569 #6 0xffffffff804a9578 in db_script_exec (scriptname=scriptname@entry=0xfffffe2364ef9410 "kdb.enter.panic", warnifnotfound=warnifnotfound@entry=0) at /usr/src/sys/ddb/db_script.c:302 #7 0xffffffff804a9472 in db_script_kdbenter (eventname=<optimized out>) at /usr/src/sys/ddb/db_script.c:324 #8 0xffffffff804a7531 in db_trap (type=<optimized out>, code=<optimized out>) at /usr/src/sys/ddb/db_main.c:267 #9 0xffffffff80bd0950 in kdb_trap (type=type@entry=3, code=code@entry=0, tf=tf@entry=0xfffffe2364ef9750) at /usr/src/sys/kern/subr_kdb.c:790 #10 0xffffffff810b3a07 in trap (frame=0xfffffe2364ef9750) at /usr/src/sys/amd64/amd64/trap.c:639 #11 <signal handler called> #12 kdb_enter (why=<optimized out>, msg=<optimized out>) at /usr/src/sys/kern/subr_kdb.c:556 #13 0xffffffff80b7fc2d in vpanic (fmt=0xffffffff81257899 "mutex %s not owned at %s:%d", ap=ap@entry=0xfffffe2364ef9980) at /usr/src/sys/kern/kern_shutdown.c:953 #14 0xffffffff80b7f9f3 in panic (fmt=0xffffffff81d853a0 <cnputs_mtx> "\233\327\031\201\377\377\377\377") at /usr/src/sys/kern/kern_shutdown.c:891 #15 0xffffffff80b5915f in __mtx_assert (c=<optimized out>, what=what@entry=4, file=0xffffffff812c1e07 "/usr/src/sys/kern/kern_cons.c", line=-1070498644, line@entry=290) at /usr/src/sys/kern/kern_mutex.c:1092 #16 0xffffffff80b23738 in kn_enter_flux (kn=0xfffff873f4f1e820) at /usr/src/sys/kern/kern_event.c:290 #17 kqueue_register (kq=kq@entry=0xfffff80189cb1c00, kev=kev@entry=0xfffffe2364ef9a40, td=td@entry=0xfffff80155304740, mflag=mflag@entry=2) at /usr/src/sys/kern/kern_event.c:1699 #18 0xffffffff80b24729 in kqueue_kevent (kq=kq@entry=0xfffff80189cb1c00, td=td@entry=0xfffff80155304740, nchanges=nchanges@entry=1, nevents=nevents@entry=0, k_ops=k_ops@entry=0xfffffe2364ef9de0, timeout=timeout@entry=0x0) at /usr/src/sys/kern/kern_event.c:1359 #19 0xffffffff80b245eb in kern_kevent_fp (td=td@entry=0xfffff80155304740, fp=<optimized out>, nchanges=nchanges@entry=1, nevents=nevents@entry=0, k_ops=k_ops@entry=0xfffffe2364ef9de0, timeout=timeout@entry=0x0) at /usr/src/sys/kern/kern_event.c:1390 #20 0xffffffff80b24502 in kern_kevent (td=td@entry=0xfffff80155304740, fd=<optimized out>, nchanges=1, nevents=0, k_ops=k_ops@entry=0xfffffe2364ef9de0, timeout=timeout@entry=0x0) at /usr/src/sys/kern/kern_event.c:1330 #21 0xffffffff80b24200 in kern_kevent_generic (td=0xfffff80155304740, uap=uap@entry=0xfffffe2364ef9db0, k_ops=k_ops@entry=0xfffffe2364ef9de0, struct_name=0xffffffff8127f5b9 "kevent") at /usr/src/sys/kern/kern_event.c:1186 #22 0xffffffff80b240f1 in sys_kevent (td=0xffffffff81d853a0 <cnputs_mtx>, uap=<optimized out>) at /usr/src/sys/kern/kern_event.c:1159 #23 0xffffffff810b4f0a in syscallenter (td=0xfffff80155304740) at /usr/src/sys/amd64/amd64/../../kern/subr_syscall.c:193 #24 amd64_syscall (td=0xfffff80155304740, traced=0) at /usr/src/sys/amd64/amd64/trap.c:1241 #25 <signal handler called> #26 0x000000082f6dc3ea in ?? () Seems like a missing mutex assert is failing, which perfectly explains the racy nature of the issue. (kgdb) fr 16 #16 0xffffffff80b23738 in kn_enter_flux (kn=0xfffff873f4f1e820) at /usr/src/sys/kern/kern_event.c:290 290 KQ_OWNED(kn->kn_kq); (kgdb) l 285 286 static void 287 kn_enter_flux(struct knote *kn) 288 { 289 290 KQ_OWNED(kn->kn_kq); 291 MPASS(kn->kn_influx < INT_MAX); 292 kn->kn_influx++; 293 } (kgdb) p *kn $3 = { kn_link = { sle_next = 0x0 }, kn_selnext = { sle_next = 0x0 }, kn_knlist = 0xfffff83c698aa438, kn_tqe = { tqe_next = 0xffffffffffffffff, tqe_prev = 0xffffffffffffffff }, kn_kq = 0xfffff80164651600, kn_kevent = { ident = 267406, filter = -1, flags = 32, fflags = 0, data = 0, udata = 0x2c0901461480, ext = {0, 0, 0, 0} }, kn_hook = 0x0, kn_hookid = 0, kn_status = 0, kn_influx = 0, kn_sfflags = 0, kn_sdata = 0, kn_ptr = { p_fp = 0xfffff83f7bcf6b40, p_proc = 0xfffff83f7bcf6b40, p_aio = 0xfffff83f7bcf6b40, p_lio = 0xfffff83f7bcf6b40, p_v = 0xfffff83f7bcf6b40 }, kn_fop = 0xffffffff81437d38 <soread_filtops> } (kgdb) fr 18 #18 0xffffffff80b24729 in kqueue_kevent (kq=kq@entry=0xfffff80189cb1c00, td=td@entry=0xfffff80155304740, nchanges=nchanges@entry=1, nevents=nevents@entry=0, k_ops=k_ops@entry=0xfffffe2364ef9de0, timeout=timeout@entry=0x0) at /usr/src/sys/kern/kern_event.c:1359 1359 error = kqueue_register(kq, kevp, td, M_WAITOK); (kgdb) p *kevp $9 = { ident = 312462, filter = -1, flags = 2, fflags = 0, data = 0, udata = 0x0, ext = {0, 0, 0, 0} } Please, tell us, if you need more.
https://reviews.freebsd.org/D55852 might fix this. I see a race which I think can cause these assertion failures, but I think it would be triggered by different userspace threads simultaneously registering an fd with a kqueue while closing the fd, which is presumably an application bug. Note, I haven't tested the patch yet, so please don't try it unless you can recover from a broken kernel.
Hi mark, thanks! It very well might be the case that some FD is being simultaneously registered in kqueue and closed. I wouldn't go as far as to say it's the user-space application bug. After all, EBADF is a pretty legitimate error and a condition. We use asio library, and multiple threads. It may, indeed not have the `close()` serialized with a kqueue requests to the kernel. And I still hold my belief that they are right, as I can't remember reading anywhere about close + kqueue being concurrently forbidden: after all, kernel must *always* play on defensive and never trust or give tools to user-space, that can cause panic or crash. Please, ignore my rabble above. We unmeasurably appreciate your efforts and help! We are going to take your patch into testing now. ---- As a side note, just a bit of concern: this patch seems to only 'refresh the state' on certain condition and not use an actual lock. Would it not be susceptible to ABA still and just reduce the likelihood of a race, not eliminating it entirely?
Please, ignore my last remark. I've realized, albeit too late, that it's all happening under the protection of a lock.
Sadly, it crashed again. Fatal trap 9: general protection fault while in kernel mode cpuid = 28; apic id = 70 instruction pointer = 0x20:0xffffffff80b5915d stack pointer = 0x28:0xfffffe0718b5ed70 frame pointer = 0x28:0xfffffe0718b5ed70 code segment = base rx0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 2968 (asy:http:s) rdi: deadc0dedeadc0f6 rsi: 0000000000000004 rdx: ffffffff811ab239 rcx: 0000000000000121 r8: 0000000000000001 r9: ffffffff81e1efc8 rax: fffff80586c4e740 rbx: 000000000004a19a rbp: fffffe0718b5ed70 r10: 0000000000000000 r11: 0000000000000004 r12: fffff801090cad18 r13: fffff83937739f00 r14: 000000000004a19a r15: fffff801090cad00 trap number = 9 panic: general protection fault cpuid = 28 time = 1773739454 KDB: stack backtrace: ... (kgdb) bt #0 __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:57 #1 doadump (textdump=0) at /usr/src/sys/kern/kern_shutdown.c:405 #2 0xffffffff804a4718 in db_fncall_generic (nargs=0, args=0xfffffe0718b5e520, addr=<optimized out>, rv=<optimized out>) at /usr/src/sys/ddb/db_command.c:626 #3 db_fncall (dummy1=<optimized out>, dummy2=<optimized out>, dummy3=<optimized out>, dummy4=<optimized out>) at /usr/src/sys/ddb/db_command.c:674 #4 0xffffffff804a418d in db_command (last_cmdp=<optimized out>, cmd_table=<optimized out>, dopager=false) at /usr/src/sys/ddb/db_command.c:504 #5 0xffffffff804a42d6 in db_command_script (command=command@entry=0xffffffff81bba6e2 <db_recursion_data+18> "call doadump") at /usr/src/sys/ddb/db_command.c:569 #6 0xffffffff804a9578 in db_script_exec (scriptname=scriptname@entry=0xfffffe0718b5e6f0 "kdb.enter.panic", warnifnotfound=warnifnotfound@entry=0) at /usr/src/sys/ddb/db_script.c:302 #7 0xffffffff804a9472 in db_script_kdbenter (eventname=<optimized out>) at /usr/src/sys/ddb/db_script.c:324 #8 0xffffffff804a7531 in db_trap (type=<optimized out>, code=<optimized out>) at /usr/src/sys/ddb/db_main.c:267 #9 0xffffffff80bd09b0 in kdb_trap (type=type@entry=3, code=code@entry=0, tf=tf@entry=0xfffffe0718b5ea30) at /usr/src/sys/kern/subr_kdb.c:790 #10 0xffffffff810b3a07 in trap (frame=0xfffffe0718b5ea30) at /usr/src/sys/amd64/amd64/trap.c:639 #11 <signal handler called> #12 kdb_enter (why=<optimized out>, msg=<optimized out>) at /usr/src/sys/kern/subr_kdb.c:556 #13 0xffffffff80b7fc8d in vpanic (fmt=0xffffffff81237367 "%s", ap=ap@entry=0xfffffe0718b5ec60) at /usr/src/sys/kern/kern_shutdown.c:953 #14 0xffffffff80b7fa53 in panic (fmt=0xffffffff81d853a0 <cnputs_mtx> "\233\327\031\201\377\377\377\377") at /usr/src/sys/kern/kern_shutdown.c:891 #15 0xffffffff810b40b8 in trap_fatal (frame=0xfffffe0718b5ecb0, eva=<optimized out>) at /usr/src/sys/amd64/amd64/trap.c:1000 #16 <signal handler called> #17 __mtx_assert (c=0xdeadc0dedeadc0f6, what=what@entry=4, file=0xffffffff811ab239 "/usr/src/sys/kern/kern_event.c", line=line@entry=289) at /usr/src/sys/kern/kern_mutex.c:1091 #18 0xffffffff80b25c9e in kn_enter_flux (kn=<optimized out>) at /usr/src/sys/kern/kern_event.c:289 #19 knote_fdclose (td=td@entry=0xfffff80586c4e740, fd=fd@entry=303514) at /usr/src/sys/kern/kern_event.c:2704 #20 0xffffffff80b1dbd6 in closefp_impl (fdp=0xfffffe07143ca920, fd=303514, fp=0xfffff82112ac8e10, td=0xfffff80586c4e740, audit=true) at /usr/src/sys/kern/kern_descrip.c:1320 #21 0xffffffff810b4f0a in syscallenter (td=0xfffff80586c4e740) at /usr/src/sys/amd64/amd64/../../kern/subr_syscall.c:193 #22 amd64_syscall (td=0xfffff80586c4e740, traced=0) at /usr/src/sys/amd64/amd64/trap.c:1241 #23 <signal handler called> #24 0x000000082c76832a in ?? () Backtrace stopped: Cannot access memory at address 0x8624d1b98 (kgdb) fr 18 #18 0xffffffff80b25c9e in kn_enter_flux (kn=<optimized out>) at /usr/src/sys/kern/kern_event.c:289 289 KQ_OWNED(kn->kn_kq); (kgdb) p *kn->kn_kq value has been optimized out (kgdb) up #19 knote_fdclose (td=td@entry=0xfffff80586c4e740, fd=fd@entry=303514) at /usr/src/sys/kern/kern_event.c:2704 2704 kn_enter_flux(kn); (kgdb) p *kn $2 = { kn_link = { sle_next = 0xdeadc0dedeadc0de }, kn_selnext = { sle_next = 0xdeadc0dedeadc0de }, kn_knlist = 0xdeadc0dedeadc0de, kn_tqe = { tqe_next = 0xdeadc0dedeadc0de, tqe_prev = 0xdeadc0dedeadc0de }, kn_kq = 0xdeadc0dedeadc0de, kn_kevent = { ident = 16045693110842147038, filter = -16162, flags = 57005, fflags = 3735929054, data = -2401050962867404578, udata = 0xdeadc0dedeadc0de, ext = {16045693110842147038, 16045693110842147038, 16045693110842147038, 16045693110842147038} }, kn_hook = 0xdeadc0dedeadc0de, kn_hookid = -559038242, kn_status = -559038242, kn_influx = -559038242, kn_sfflags = -559038242, kn_sdata = -2401050962867404578, kn_ptr = { p_fp = 0xdeadc0dedeadc0de, p_proc = 0xdeadc0dedeadc0de, p_aio = 0xdeadc0dedeadc0de, p_lio = 0xdeadc0dedeadc0de, p_v = 0xdeadc0dedeadc0de }, kn_fop = 0xdeadc0dedeadc0de } (kgdb) p *kn->kn_kq Cannot access memory at address 0xdeadc0dedeadc0de (kgdb) #19 knote_fdclose (td=td@entry=0xfffff80586c4e740, fd=fd@entry=303514) at /usr/src/sys/kern/kern_event.c:2704 2704 kn_enter_flux(kn); (kgdb) up #20 0xffffffff80b1dbd6 in closefp_impl (fdp=0xfffffe07143ca920, fd=303514, fp=0xfffff82112ac8e10, td=0xfffff80586c4e740, audit=true) at /usr/src/sys/kern/kern_descrip.c:1320 1320 knote_fdclose(td, fd); (kgdb) p *fp $4 = { f_flag = 7, f_count = 1, f_data = 0xfffff868aabeb000, f_ops = 0xffffffff81436808 <socketops>, f_vnode = 0x0, f_cred = 0xfffff804a3517700, f_type = 2, f_vflags = 0, { f_seqcount = {0, 0}, f_pipegen = 0 }, f_nextoff = {0, 0}, f_vnun = { fvn_cdevpriv = 0x0, fvn_advice = 0x0 }, f_offset = 0 } (kgdb) p *fdp $5 = { fd_files = 0xfffffe0836e00000, fd_map = 0xfffffe080448a000, fd_freefile = 3, fd_refcnt = 1, fd_holdcnt = 1, fd_sx = { lock_object = { lo_name = 0xffffffff812b4244 "filedesc structure", lo_flags = 36896768, lo_data = 0, lo_witness = 0xfffff8804bd94380 }, sx_lock = 18446735301352417088 }, fd_kqlist = { tqh_first = 0xfffff80150f62d00, tqh_last = 0xfffff801090b5628 }, fd_holdleaderscount = 0, fd_holdleaderswakeup = 0 } Please, tell us, if more useful data can be extracted from this dump.
In the frame 19, there should be a local var kq. Please print out the *kq. Is 303514 sounds like a possibly correct fd value? Do you have filedesc limits bumped up? Could you try your load with the HEAD kernel with all debugging turned on? No need to install userspace, only kernel.
Alas, it was optimized out. Probably because its value is in a register? (kgdb) fr 19 #19 knote_fdclose (td=td@entry=0xfffff80586c4e740, fd=fd@entry=303514) at /usr/src/sys/kern/kern_event.c:2704 2704 kn_enter_flux(kn); (kgdb) p *kq value has been optimized out (kgdb) p kq $4 = <optimized out> Using a brute force approach, looks like r15 is a correct guess: (kgdb) i r rax 0xfffff80586c4e740 -8772357134528 rbx 0x4a19a 303514 rcx 0x121 289 rdx 0xffffffff811ab239 -2128956871 rsi 0x4 4 rdi 0xdeadc0dedeadc0f6 -2401050962867404554 rbp 0xfffffe0718b5edc0 0xfffffe0718b5edc0 rsp 0xfffffe0718b5ed80 0xfffffe0718b5ed80 r8 0x1 1 r9 0xffffffff81e1efc8 -2115899448 r10 0x0 0 r11 0x4 4 r12 0xfffff801090cad18 -8791646229224 r13 0xfffff83937739f00 -8550349562112 r14 0x4a19a 303514 r15 0xfffff801090cad00 -8791646229248 rip 0xffffffff80b25c9e 0xffffffff80b25c9e <knote_fdclose+302> eflags 0x10297 [ CF PF AF SF IF RF ] cs 0x20 32 ss 0x28 40 ds <unavailable> es <unavailable> fs <unavailable> gs <unavailable> fs_base <unavailable> gs_base <unavailable> (kgdb) p *((struct kqueue*)$r15) $14 = { kq_lock = { lock_object = { lo_name = 0xffffffff812bbf6c "kqueue", lo_flags = 21168128, lo_data = 0, lo_witness = 0xfffff8804bd8da80 }, mtx_lock = 18446735301352417088 }, kq_refcnt = 0, kq_list = { tqe_next = 0xfffff80155923a00, tqe_prev = 0xfffff80150f62d28 }, kq_head = { tqh_first = 0x0, tqh_last = 0xfffff801090cad38 }, kq_count = 0, kq_sel = { si_tdlist = { tqh_first = 0x0, tqh_last = 0x0 }, si_note = { kl_list = { slh_first = 0x0 }, kl_lock = 0xffffffff80b254e0 <knlist_mtx_lock>, kl_unlock = 0xffffffff80b25500 <knlist_mtx_unlock>, kl_assert_lock = 0xffffffff80b25520 <knlist_mtx_assert_lock>, kl_lockarg = 0xfffff801090cad00, kl_autodestroy = 0 }, si_mtx = 0x0 }, kq_sigio = 0x0, kq_fdp = 0xfffffe07143ca920, kq_state = 0, kq_knlistsize = 700672, kq_knlist = 0xfffffe09b7213000, kq_knhashmask = 0, kq_knhash = 0x0, kq_task = { ta_link = { stqe_next = 0x0 }, ta_pending = 0, ta_priority = 0 '\000', ta_flags = 0 '\000', ta_func = 0xffffffff80b26060 <kqueue_task>, ta_context = 0xfffff801090cad00 }, kq_cred = 0xfffff804a3517700 } Yeah, 303514 is definitely a valid descriptor value. We do handle hundreds of thousands of websocket connections normally. And yes, our limits are adjusted to several millions, just in case... We will consider using HEAD for a test run, without updating the world. We will stay in touch and report back as soon as there is something to report. In any case, if you need more info, we will be glad to provide it.
(In reply to Paul from comment #16) If believing the guessed location for kq, and there is a reason to believe because the lock name looks right, it seems that the kq was drained/closed in parallel with closing the fd. This might explain why the knote dereferenced was garbage. Besides trying the HEAD, you could try to apply the following rude patch. If it helps, I might consider using a sentinel to resume iteration instead of restarting from scratch, after the kq relock. diff --git a/sys/kern/kern_event.c b/sys/kern/kern_event.c index e8e670d39d09..4bbd9063b6ef 100644 --- a/sys/kern/kern_event.c +++ b/sys/kern/kern_event.c @@ -2830,10 +2830,9 @@ knote_fdclose(struct thread *td, int fd) * We shouldn't have to worry about new kevents appearing on fd * since filedesc is locked. */ +again: TAILQ_FOREACH(kq, &fdp->fd_kqlist, kq_list) { KQ_LOCK(kq); - -again: influx = 0; while (kq->kq_knlistsize > fd && (kn = SLIST_FIRST(&kq->kq_knlist[fd])) != NULL) {
Thanks, we will start testing this patch first.
Sadly, it happens still, even with the latest patch of kern_event.c: Fatal trap 9: general protection fault while in kernel mode cpuid = 0; apic id = 00 instruction pointer = 0x20:0xffffffff80b5914d stack pointer = 0x28:0xfffffe0718977d60 frame pointer = 0x28:0xfffffe0718977d60 code segment = base rx0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 3115 (asy:http:s) rdi: deadc0dedeadc0f6 rsi: 0000000000000004 rdx: ffffffff811ab239 rcx: 0000000000000121 r8: 0000000000000001 r9: ffffffff81e1ec98 rax: fffff803c20c3740 rbx: 000000000008fa97 rbp: fffffe0718977d60 r10: 0000000000000000 r11: 0000000000000004 r12: fffff80155c37718 r13: fffff819bc941960 r14: 000000000008fa97 r15: fffff80155c37700 trap number = 9 panic: general protection fault cpuid = 0 time = 1773824580 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe0718977ae0 vpanic() at vpanic+0x161/frame 0xfffffe0718977c10 panic() at panic+0x43/frame 0xfffffe0718977c70 trap_fatal() at trap_fatal+0x68/frame 0xfffffe0718977c90 calltrap() at calltrap+0x8/frame 0xfffffe0718977c90 --- trap 0x9, rip = 0xffffffff80b5914d, rsp = 0xfffffe0718977d60, rbp = 0xfffffe0718977d60 --- __mtx_assert() at __mtx_assert+0x3d/frame 0xfffffe0718977d60 knote_fdclose() at knote_fdclose+0x11e/frame 0xfffffe0718977dc0 closefp_impl() at closefp_impl+0x96/frame 0xfffffe0718977e00 amd64_syscall() at amd64_syscall+0x15a/frame 0xfffffe0718977f30 fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe0718977f30 --- syscall (6, FreeBSD ELF64, close), rip = 0x82ddf932a, rsp = 0x85fb5eb88, rbp = 0x85fb5eba0 --- KDB: enter: panic (kgdb) bt #0 __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:57 #1 doadump (textdump=0) at /usr/src/sys/kern/kern_shutdown.c:405 #2 0xffffffff804a4718 in db_fncall_generic (nargs=0, args=0xfffffe0718977510, addr=<optimized out>, rv=<optimized out>) at /usr/src/sys/ddb/db_command.c:626 #3 db_fncall (dummy1=<optimized out>, dummy2=<optimized out>, dummy3=<optimized out>, dummy4=<optimized out>) at /usr/src/sys/ddb/db_command.c:674 #4 0xffffffff804a418d in db_command (last_cmdp=<optimized out>, cmd_table=<optimized out>, dopager=false) at /usr/src/sys/ddb/db_command.c:504 #5 0xffffffff804a42d6 in db_command_script (command=command@entry=0xffffffff81bba6e2 <db_recursion_data+18> "call doadump") at /usr/src/sys/ddb/db_command.c:569 #6 0xffffffff804a9578 in db_script_exec (scriptname=scriptname@entry=0xfffffe07189776e0 "kdb.enter.panic", warnifnotfound=warnifnotfound@entry=0) at /usr/src/sys/ddb/db_script.c:302 #7 0xffffffff804a9472 in db_script_kdbenter (eventname=<optimized out>) at /usr/src/sys/ddb/db_script.c:324 #8 0xffffffff804a7531 in db_trap (type=<optimized out>, code=<optimized out>) at /usr/src/sys/ddb/db_main.c:267 #9 0xffffffff80bd09a0 in kdb_trap (type=type@entry=3, code=code@entry=0, tf=tf@entry=0xfffffe0718977a20) at /usr/src/sys/kern/subr_kdb.c:790 #10 0xffffffff810b3a07 in trap (frame=0xfffffe0718977a20) at /usr/src/sys/amd64/amd64/trap.c:639 #11 <signal handler called> #12 kdb_enter (why=<optimized out>, msg=<optimized out>) at /usr/src/sys/kern/subr_kdb.c:556 #13 0xffffffff80b7fc7d in vpanic (fmt=0xffffffff81237367 "%s", ap=ap@entry=0xfffffe0718977c50) at /usr/src/sys/kern/kern_shutdown.c:953 #14 0xffffffff80b7fa43 in panic (fmt=0xffffffff81d853a0 <cnputs_mtx> "\233\327\031\201\377\377\377\377") at /usr/src/sys/kern/kern_shutdown.c:891 #15 0xffffffff810b40b8 in trap_fatal (frame=0xfffffe0718977ca0, eva=<optimized out>) at /usr/src/sys/amd64/amd64/trap.c:1000 #16 <signal handler called> #17 __mtx_assert (c=0xdeadc0dedeadc0f6, what=what@entry=4, file=0xffffffff811ab239 "/usr/src/sys/kern/kern_event.c", line=line@entry=289) at /usr/src/sys/kern/kern_mutex.c:1091 #18 0xffffffff80b25c8e in kn_enter_flux (kn=<optimized out>) at /usr/src/sys/kern/kern_event.c:289 #19 knote_fdclose (td=td@entry=0xfffff803c20c3740, fd=fd@entry=588439) at /usr/src/sys/kern/kern_event.c:2703 #20 0xffffffff80b1dbd6 in closefp_impl (fdp=0xfffffe0713371430, fd=588439, fp=0xfffff86e9b7ee190, td=0xfffff803c20c3740, audit=true) at /usr/src/sys/kern/kern_descrip.c:1320 #21 0xffffffff810b4f0a in syscallenter (td=0xfffff803c20c3740) at /usr/src/sys/amd64/amd64/../../kern/subr_syscall.c:193 #22 amd64_syscall (td=0xfffff803c20c3740, traced=0) at /usr/src/sys/amd64/amd64/trap.c:1241 #23 <signal handler called> #24 0x000000082ddf932a in ?? () Backtrace stopped: Cannot access memory at address 0x85fb5eb88 (kgdb) l /usr/src/sys/kern/kern_event.c:2690 2685 /* 2686 * We shouldn't have to worry about new kevents appearing on fd 2687 * since filedesc is locked. 2688 */ 2689 again: 2690 TAILQ_FOREACH(kq, &fdp->fd_kqlist, kq_list) { 2691 KQ_LOCK(kq); 2692 influx = 0; 2693 while (kq->kq_knlistsize > fd && 2694 (kn = SLIST_FIRST(&kq->kq_knlist[fd])) != NULL) { (kgdb) fr 18 #18 0xffffffff80b25c8e in kn_enter_flux (kn=<optimized out>) at /usr/src/sys/kern/kern_event.c:289 289 KQ_OWNED(kn->kn_kq); (kgdb) p *kn->kn_kq value has been optimized out (kgdb) up #19 knote_fdclose (td=td@entry=0xfffff803c20c3740, fd=fd@entry=588439) at /usr/src/sys/kern/kern_event.c:2703 2703 kn_enter_flux(kn); (kgdb) p kn $4 = (struct knote *) 0xfffff819bc941960 (kgdb) p *kn $1 = { kn_link = { sle_next = 0xdeadc0dedeadc0de }, kn_selnext = { sle_next = 0xdeadc0dedeadc0de }, kn_knlist = 0xdeadc0dedeadc0de, kn_tqe = { tqe_next = 0xdeadc0dedeadc0de, tqe_prev = 0xdeadc0dedeadc0de }, kn_kq = 0xdeadc0dedeadc0de, kn_kevent = { ident = 16045693110842147038, filter = -16162, flags = 57005, fflags = 3735929054, data = -2401050962867404578, udata = 0xdeadc0dedeadc0de, ext = {16045693110842147038, 16045693110842147038, 16045693110842147038, 16045693110842147038} }, kn_hook = 0xdeadc0dedeadc0de, kn_hookid = -559038242, kn_status = -559038242, kn_influx = -559038242, kn_sfflags = -559038242, kn_sdata = -2401050962867404578, kn_ptr = { p_fp = 0xdeadc0dedeadc0de, p_proc = 0xdeadc0dedeadc0de, p_aio = 0xdeadc0dedeadc0de, p_lio = 0xdeadc0dedeadc0de, p_v = 0xdeadc0dedeadc0de }, kn_fop = 0xdeadc0dedeadc0de } (kgdb) p *kn->kn_kq Cannot access memory at address 0xdeadc0dedeadc0de #20 0xffffffff80b1dbd6 in closefp_impl (fdp=0xfffffe0713371430, fd=588439, fp=0xfffff86e9b7ee190, td=0xfffff803c20c3740, audit=true) at /usr/src/sys/kern/kern_descrip.c:1320 1320 knote_fdclose(td, fd); (kgdb) p *fp $1 = { f_flag = 7, f_count = 1, f_data = 0xfffff82e0210c000, f_ops = 0xffffffff81436808 <socketops>, f_vnode = 0x0, f_cred = 0xfffff804daf23a00, f_type = 2, f_vflags = 0, { f_seqcount = {0, 0}, f_pipegen = 0 }, f_nextoff = {0, 0}, f_vnun = { fvn_cdevpriv = 0x0, fvn_advice = 0x0 }, f_offset = 0 } (kgdb) p *fdp $2 = { fd_files = 0xfffffe094f9fb000, fd_map = 0xfffffe094d255000, fd_freefile = 3, fd_refcnt = 1, fd_holdcnt = 1, fd_sx = { lock_object = { lo_name = 0xffffffff812b4244 "filedesc structure", lo_flags = 36896768, lo_data = 0, lo_witness = 0xfffff8804bd94380 }, sx_lock = 18446735293757011776 }, fd_kqlist = { tqh_first = 0xfffff8010c5ba200, tqh_last = 0xfffff80155c37728 }, fd_holdleaderscount = 0, fd_holdleaderswakeup = 0 } (kgdb) fr 19 #19 knote_fdclose (td=td@entry=0xfffff803c20c3740, fd=fd@entry=588439) at /usr/src/sys/kern/kern_event.c:2703 2703 kn_enter_flux(kn); (kgdb) p *kq value has been optimized out (kgdb) i r rax 0xfffff803c20c3740 -8779952539840 rbx 0x8fa97 588439 rcx 0x121 289 rdx 0xffffffff811ab239 -2128956871 rsi 0x4 4 rdi 0xdeadc0dedeadc0f6 -2401050962867404554 rbp 0xfffffe0718977dc0 0xfffffe0718977dc0 rsp 0xfffffe0718977d70 0xfffffe0718977d70 r8 0x1 1 r9 0xffffffff81e1ec98 -2115900264 r10 0x0 0 r11 0x4 4 r12 0xfffff80155c37718 -8790359181544 r13 0xfffff819bc941960 -8685555017376 r14 0x8fa97 588439 r15 0xfffff80155c37700 -8790359181568 rip 0xffffffff80b25c8e 0xffffffff80b25c8e <knote_fdclose+286> eflags 0x10297 [ CF PF AF SF IF RF ] cs 0x20 32 ss 0x28 40 ds <unavailable> es <unavailable> fs <unavailable> gs <unavailable> fs_base <unavailable> gs_base <unavailable> (kgdb) p *((struct kqueue*)$r15) $3 = { kq_lock = { lock_object = { lo_name = 0xffffffff812bbf6c "kqueue", lo_flags = 21168128, lo_data = 0, lo_witness = 0xfffff8804bd8da80 }, mtx_lock = 18446735293757011776 }, kq_refcnt = 1, kq_list = { tqe_next = 0x0, tqe_prev = 0xfffff80150e0d528 }, kq_head = { tqh_first = 0x0, tqh_last = 0xfffff80155c37738 }, kq_count = 0, kq_sel = { si_tdlist = { tqh_first = 0x0, tqh_last = 0x0 }, si_note = { kl_list = { slh_first = 0x0 }, kl_lock = 0xffffffff80b254e0 <knlist_mtx_lock>, kl_unlock = 0xffffffff80b25500 <knlist_mtx_unlock>, kl_assert_lock = 0xffffffff80b25520 <knlist_mtx_assert_lock>, kl_lockarg = 0xfffff80155c37700, kl_autodestroy = 0 }, si_mtx = 0x0 }, kq_sigio = 0x0, kq_fdp = 0xfffffe0713371430, kq_state = 2, kq_knlistsize = 680960, kq_knlist = 0xfffffe0987b7a000, kq_knhashmask = 0, kq_knhash = 0x0, kq_task = { ta_link = { stqe_next = 0x0 }, ta_pending = 0, ta_priority = 0 '\000', ta_flags = 0 '\000', ta_func = 0xffffffff80b26050 <kqueue_task>, ta_context = 0xfffff80155c37700 }, kq_cred = 0xfffff804daf23a00 } Weirdest thing is (might this be a hint of a problem?) that in frame 19, `kn` points to some memory address that contains exactly the same, byte-by-byte content as in previous crash, seemingly a garbage. Is this some 'kernel constants' data segment, or is it expected and not a garbage?
re: deadlock, I don't really understand this part of knote_fdclose: we set influx=0 at the beginning and only `wakuep(kq)` if we personally dropped a knote from this kq. Isn't that just asking for a deadlock on the first knote? Assuming we have something like: Thread A: kn_enter_flux, drop the lock Thread B: pick up the lock, find it in flux, enter FLUXWQAIT Thread C: knote_fdclose(), acquire lock, also find it in flux, doesn't wakeup() Looking closer, I can kind of see that the pattern should be that thread A will wakeup() after a kn_leave_flux(), but I think we only have one scenario where that's *not* the case: kqueue_fork_copy() -> kqueue_fork_copy_list() will transition a number of knotes and neither would issue a wakeup(). The final KQ_UNLOCK there in kqueue_fork_copy() should probably just assume that it transitioned at least one knote through a flux state and switch to KQ_UNLOCK_FLUX to be sure. I think with those bases covered, influx + wakeup in knote_fdclose() should probably just go away. knote_drop will already wakeup any waiters after its removed the knote it transitioned from the list- another wakeup seems spurious.
(In reply to Paul from comment #19) *kn is filled with '0xdeadc0de' which is the pattern for the freed memory in debugging kernels. In fact, my patch had a bug, it leaked the kq lock on each flux restart, but it seems that you did not hit it, which is interesting. I put the updated patch, with some additional assert, at the end of my reply. The assert might be more interesting than the other part of the patch. I do ask you to switch to HEAD with all debugging enabled. (In reply to Kyle Evans from comment #20) You might be quite right, and might be we should do what you suggest. Still, it cannot be an issue that is reported there because copy on fork is not in 14 (at least as far as I remember). diff --git a/sys/kern/kern_event.c b/sys/kern/kern_event.c index e8e670d39d09..38928a68dd94 100644 --- a/sys/kern/kern_event.c +++ b/sys/kern/kern_event.c @@ -2830,19 +2830,20 @@ knote_fdclose(struct thread *td, int fd) * We shouldn't have to worry about new kevents appearing on fd * since filedesc is locked. */ +again: TAILQ_FOREACH(kq, &fdp->fd_kqlist, kq_list) { KQ_LOCK(kq); - -again: influx = 0; while (kq->kq_knlistsize > fd && (kn = SLIST_FIRST(&kq->kq_knlist[fd])) != NULL) { + MPASS(kn->kn_kq == kq); if (kn_in_flux(kn)) { /* someone else might be waiting on our knote */ if (influx) wakeup(kq); kq->kq_state |= KQ_FLUXWAIT; - msleep(kq, &kq->kq_lock, PSOCK, "kqflxwt", 0); + msleep(kq, &kq->kq_lock, PSOCK | PDROP, + "kqflxwt", 0); goto again; } kn_enter_flux(kn);
Hi, Just in case, we have tested new patch with our current kernel version, and it panicked. So we've finally switched to HEAD, as suggested earlier (we understand that this simplifies things a lot). And basically the same thing happened there. It's about the new assert, added in the latest patch. Unread portion of the kernel message buffer: panic: Assertion kn->kn_kq == kq failed at /usr/src/sys/kern/kern_event.c:2852 cpuid = 8 time = 1774258230 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe0699ccbbd0 vpanic() at vpanic+0x136/frame 0xfffffe0699ccbd00 panic() at panic+0x43/frame 0xfffffe0699ccbd60 knote_fdclose() at knote_fdclose+0x236/frame 0xfffffe0699ccbdc0 closefp_impl() at closefp_impl+0xa8/frame 0xfffffe0699ccbe00 amd64_syscall() at amd64_syscall+0x169/frame 0xfffffe0699ccbf30 fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe0699ccbf30 --- syscall (6, FreeBSD ELF64, close), rip = 0x82d4a332a, rsp = 0x85dfa6b98, rbp = 0x85dfa6bb0 --- KDB: enter: panic (kgdb) bt #0 __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:57 #1 doadump (textdump=0) at /usr/src/sys/kern/kern_shutdown.c:399 #2 0xffffffff804b60a8 in db_fncall_generic (nargs=0, args=0xfffffe0699ccb5f0, addr=<optimized out>, rv=<optimized out>) at /usr/src/sys/ddb/db_command.c:631 #3 db_fncall (dummy1=<optimized out>, dummy2=<optimized out>, dummy3=<optimized out>, dummy4=<optimized out>) at /usr/src/sys/ddb/db_command.c:679 #4 0xffffffff804b5b2d in db_command (last_cmdp=<optimized out>, cmd_table=<optimized out>, dopager=false) at /usr/src/sys/ddb/db_command.c:508 #5 0xffffffff804b5c76 in db_command_script (command=command@entry=0xffffffff81bd7722 <db_recursion_data+18> "call doadump") at /usr/src/sys/ddb/db_command.c:573 #6 0xffffffff804bba58 in db_script_exec (scriptname=scriptname@entry=0xfffffe0699ccb7c0 "kdb.enter.panic", warnifnotfound=warnifnotfound@entry=0) at /usr/src/sys/ddb/db_script.c:301 #7 0xffffffff804bb952 in db_script_kdbenter (eventname=<optimized out>) at /usr/src/sys/ddb/db_script.c:323 #8 0xffffffff804b91e1 in db_trap (type=<optimized out>, code=<optimized out>) at /usr/src/sys/ddb/db_main.c:266 #9 0xffffffff80c1ce5f in kdb_trap (type=type@entry=3, code=code@entry=0, tf=tf@entry=0xfffffe0699ccbb10) at /usr/src/sys/kern/subr_kdb.c:790 #10 0xffffffff8112a96d in trap (frame=<optimized out>) at /usr/src/sys/amd64/amd64/trap.c:675 #11 <signal handler called> #12 kdb_enter (why=<optimized out>, msg=<optimized out>) at /usr/src/sys/kern/subr_kdb.c:556 #13 0xffffffff80bc9ddb in vpanic (fmt=0xffffffff812ec6bb "Assertion %s failed at %s:%d", ap=ap@entry=0xfffffe0699ccbd40) at /usr/src/sys/kern/kern_shutdown.c:962 #14 0xffffffff80bc9c43 in panic (fmt=0xffffffff81da2290 <cnputs_mtx> "\254\214!\201\377\377\377\377") at /usr/src/sys/kern/kern_shutdown.c:887 #15 0xffffffff80b6bc76 in knote_fdclose (td=td@entry=0xff0100018d9b4000, fd=fd@entry=161249) at /usr/src/sys/kern/kern_event.c:2852 #16 0xffffffff80b63468 in closefp_impl (fdp=0xfffffe0693882000, fd=161249, fp=0xff010002dd9fb230, td=0xff0100018d9b4000, audit=true) at /usr/src/sys/kern/kern_descrip.c:1413 #17 0xffffffff8112b739 in syscallenter (td=0xff0100018d9b4000) at /usr/src/sys/amd64/amd64/../../kern/subr_syscall.c:193 #18 amd64_syscall (td=0xff0100018d9b4000, traced=0) at /usr/src/sys/amd64/amd64/trap.c:1244 #19 <signal handler called> #20 0x000000082d4a332a in ?? () Backtrace stopped: Cannot access memory at address 0x85dfa6b98 (kgdb) fr 15 #15 0xffffffff80b6bc76 in knote_fdclose (td=td@entry=0xff0100018d9b4000, fd=fd@entry=161249) at /usr/src/sys/kern/kern_event.c:2852 2852 MPASS(kn->kn_kq == kq); (kgdb) p *kn $1 = { kn_link = { sle_next = 0xdeadc0dedeadc0de }, kn_selnext = { sle_next = 0xdeadc0dedeadc0de }, kn_knlist = 0xdeadc0dedeadc0de, kn_tqe = { tqe_next = 0xdeadc0dedeadc0de, tqe_prev = 0xdeadc0dedeadc0de }, kn_kq = 0xdeadc0dedeadc0de, kn_kevent = { ident = 16045693110842147038, filter = -16162, flags = 57005, fflags = 3735929054, data = -2401050962867404578, udata = 0xdeadc0dedeadc0de, ext = {16045693110842147038, 16045693110842147038, 16045693110842147038, 16045693110842147038} }, kn_hook = 0xdeadc0dedeadc0de, kn_hookid = -559038242, kn_status = -559038242, kn_influx = -559038242, kn_sfflags = 3735929054, kn_sdata = -2401050962867404578, kn_ptr = { p_fp = 0xdeadc0dedeadc0de, p_proc = 0xdeadc0dedeadc0de, p_aio = 0xdeadc0dedeadc0de, p_lio = 0xdeadc0dedeadc0de, p_prison = 0xdeadc0dedeadc0de, p_v = 0xdeadc0dedeadc0de }, kn_fop = 0xdeadc0dedeadc0de } (kgdb) p *kq value has been optimized out (kgdb) i r rax 0x12 18 rbx 0x275e1 161249 rcx 0xba5f4feebeda7d64 -5017203573044642460 rdx 0xffffffff813451fb -2127277573 rsi 0xfffffe0699ccba90 -2170673120624 rdi 0xffffffff81da2290 -2116410736 rbp 0xfffffe0699ccbdc0 0xfffffe0699ccbdc0 rsp 0xfffffe0699ccbd70 0xfffffe0699ccbd70 r8 0x12 18 r9 0x20 32 r10 0x0 0 r11 0x0 0 r12 0xff010001bdd19b18 -71776111581619432 r13 0xff0100488988c0a0 -71775807516131168 r14 0x275e1 161249 r15 0xff010001bdd19b00 -71776111581619456 rip 0xffffffff80b6bc76 0xffffffff80b6bc76 <knote_fdclose+566> eflags 0x86 [ PF SF ] cs 0x20 32 ss 0x28 40 ds <unavailable> es <unavailable> fs <unavailable> gs <unavailable> fs_base <unavailable> gs_base <unavailable> (kgdb) p *((struct kqueue*)$r15) $2 = { kq_lock = { lock_object = { lo_name = 0xffffffff8133f15f "kqueue", lo_flags = 21168128, lo_data = 0, lo_witness = 0xff0100804bd8db80 }, mtx_lock = 18374967961319063552 }, kq_refcnt = 0, kq_list = { tqe_next = 0xff0100014c3afe00, tqe_prev = 0xff010001075a7528 }, kq_head = { tqh_first = 0x0, tqh_last = 0xff010001bdd19b38 }, kq_count = 0, kq_sel = { si_tdlist = { tqh_first = 0x0, tqh_last = 0x0 }, si_note = { kl_list = { slh_first = 0x0 }, kl_lock = 0xffffffff80b6b3a0 <knlist_mtx_lock>, kl_unlock = 0xffffffff80b6b3c0 <knlist_mtx_unlock>, kl_assert_lock = 0xffffffff80b6b3e0 <knlist_mtx_assert_lock>, kl_lockarg = 0xff010001bdd19b00, kl_autodestroy = 0 }, si_mtx = 0x0 }, kq_sigio = 0x0, kq_fdp = 0xfffffe0693882000, kq_state = 0, kq_knlistsize = 695296, kq_knlist = 0xfffffe0a76665000, kq_knhashmask = 0, kq_knhash = 0x0, kq_task = { ta_link = { stqe_next = 0x0 }, ta_pending = 0, ta_priority = 0 '\000', ta_flags = 0 '\000', ta_func = 0xffffffff80b6db40 <kqueue_task>, ta_context = 0xff010001bdd19b00 }, kq_cred = 0xff01000107bc5780, kq_forksrc = 0x0 } Please, tell us if you need anything else.
My belief is that kn is freed somewhere, which explains 0xdeadc0de. I cannot easily guess where such thing could happen. Lets try the brute force approach then. Below is the patch, to be applied on top of my previous patch. It is debug only. When the panic due to the new assert (kn->kn_kq != kq) occurs, the kn var should point to struct eknote instead of knote. I am interested in the *eknote content, and in the lookup of the source lines for the captured stack trace, which should point to the place where the free of kn occured. I do not have better idea how to proceed ATM. commit 6536e2123f1249a3e8fa8e4895b01abc95381633 Author: Konstantin Belousov <kib@FreeBSD.org> Date: Wed Mar 25 03:47:59 2026 +0200 debug diff --git a/sys/kern/kern_event.c b/sys/kern/kern_event.c index 38928a68dd94..70f9ba07716d 100644 --- a/sys/kern/kern_event.c +++ b/sys/kern/kern_event.c @@ -230,6 +230,13 @@ static const struct filterops user_filtops = { .f_copy = knote_triv_copy, }; +#include <sys/stack.h> +struct eknote { + struct knote k; + struct knote c; + struct stack s; +}; + static uma_zone_t knote_zone; static unsigned int __exclusive_cache_line kq_ncallouts; static unsigned int kq_calloutmax = 4 * 1024; @@ -2962,8 +2969,8 @@ static void knote_init(void *dummy __unused) { - knote_zone = uma_zcreate("KNOTE", sizeof(struct knote), NULL, NULL, - NULL, NULL, UMA_ALIGN_PTR, 0); + knote_zone = uma_zcreate("KNOTE", sizeof(struct eknote), NULL, NULL, + NULL, NULL, UMA_ALIGN_PTR, UMA_ZONE_NOFREE); ast_register(TDA_KQUEUE, ASTR_ASTF_REQUIRED, 0, ast_kqueue); prison0.pr_klist = knlist_alloc(&prison0.pr_mtx); } @@ -2972,15 +2979,22 @@ SYSINIT(knote, SI_SUB_PSEUDO, SI_ORDER_ANY, knote_init, NULL); static struct knote * knote_alloc(int mflag) { + struct eknote *e; - return (uma_zalloc(knote_zone, mflag | M_ZERO)); + e = uma_zalloc(knote_zone, mflag | M_ZERO); + return (&e->k); } static void knote_free(struct knote *kn) { + struct eknote *e; - uma_zfree(knote_zone, kn); + e = __containerof(kn, struct eknote, k); + e->c = e->k; + stack_save(&e->s); + memset(&e->k, 0xdeadc0de, sizeof(e->k)); + uma_zfree(knote_zone, e); } /*
Hi, Thanks! There was a small issue: NULL can actually be passed to `knote_free()`. So we have adapted the patch a bit. Also, we've never discarded all previous patches. So, to be of the same page, here's the final version that we're testing now, on the HEAD: diff --git a/sys/kern/kern_event.c b/sys/kern/kern_event.c index e8e670d39d09..bbb09789e41f 100644 --- a/sys/kern/kern_event.c +++ b/sys/kern/kern_event.c @@ -28,7 +28,6 @@ * SUCH DAMAGE. */ -#include <sys/cdefs.h> #include "opt_ktrace.h" #include "opt_kqueue.h" @@ -230,6 +229,13 @@ static const struct filterops user_filtops = { .f_copy = knote_triv_copy, }; +#include <sys/stack.h> +struct eknote { + struct knote k; + struct knote c; + struct stack s; +}; + static uma_zone_t knote_zone; static unsigned int __exclusive_cache_line kq_ncallouts; static unsigned int kq_calloutmax = 4 * 1024; @@ -1803,6 +1809,19 @@ kqueue_register(struct kqueue *kq, struct kevent *kev, struct thread *td, error = ENOMEM; goto done; } + + /* + * Now that the kqueue is locked, make sure the fd + * didn't change out from under us. + */ + if (fops->f_isfd && + fget_noref_unlocked(td->td_proc->p_fd, + kev->ident) != fp) { + KQ_UNLOCK(kq); + tkn = kn; + error = EBADF; + goto done; + } kn->kn_fp = fp; kn->kn_kq = kq; kn->kn_fop = fops; @@ -2830,19 +2849,21 @@ knote_fdclose(struct thread *td, int fd) * We shouldn't have to worry about new kevents appearing on fd * since filedesc is locked. */ +again: TAILQ_FOREACH(kq, &fdp->fd_kqlist, kq_list) { KQ_LOCK(kq); -again: influx = 0; while (kq->kq_knlistsize > fd && (kn = SLIST_FIRST(&kq->kq_knlist[fd])) != NULL) { + MPASS(kn->kn_kq == kq); if (kn_in_flux(kn)) { /* someone else might be waiting on our knote */ if (influx) wakeup(kq); kq->kq_state |= KQ_FLUXWAIT; - msleep(kq, &kq->kq_lock, PSOCK, "kqflxwt", 0); + msleep(kq, &kq->kq_lock, PSOCK | PDROP, + "kqflxwt", 0); goto again; } kn_enter_flux(kn); @@ -2961,8 +2982,8 @@ static void knote_init(void *dummy __unused) { - knote_zone = uma_zcreate("KNOTE", sizeof(struct knote), NULL, NULL, - NULL, NULL, UMA_ALIGN_PTR, 0); + knote_zone = uma_zcreate("KNOTE", sizeof(struct eknote), NULL, NULL, + NULL, NULL, UMA_ALIGN_PTR, UMA_ZONE_NOFREE); ast_register(TDA_KQUEUE, ASTR_ASTF_REQUIRED, 0, ast_kqueue); prison0.pr_klist = knlist_alloc(&prison0.pr_mtx); } @@ -2971,15 +2992,24 @@ SYSINIT(knote, SI_SUB_PSEUDO, SI_ORDER_ANY, knote_init, NULL); static struct knote * knote_alloc(int mflag) { + struct eknote *e; - return (uma_zalloc(knote_zone, mflag | M_ZERO)); + e = uma_zalloc(knote_zone, mflag | M_ZERO); + return (&e->k); } static void knote_free(struct knote *kn) { + if (kn) { + struct eknote *e; - uma_zfree(knote_zone, kn); + e = __containerof(kn, struct eknote, k); + e->c = e->k; + stack_save(&e->s); + memset(&e->k, 0xdeadc0de, sizeof(e->k)); + uma_zfree(knote_zone, e); + } } /* diff --git a/sys/sys/filedesc.h b/sys/sys/filedesc.h index 4817855443af..9ea5239afae7 100644 --- a/sys/sys/filedesc.h +++ b/sys/sys/filedesc.h @@ -212,6 +212,8 @@ struct filedesc_to_leader { #ifdef _KERNEL +#include <machine/atomic.h> + /* Operation types for kern_dup(). */ enum { FDDUP_NORMAL, /* dup() behavior. */ @@ -303,6 +305,21 @@ int fget_only_user(struct filedesc *fdp, int fd, MPASS(refcount_load(&fp->f_count) > 0); \ }) +/* +* Look up a file description without requiring a lock. In general the result + * may be immediately invalidated after the function returns, the caller must + * handle this. + */ +static inline struct file * +fget_noref_unlocked(struct filedesc *fdp, int fd) +{ + if (__predict_false( + (u_int)fd >= (u_int)atomic_load_int(&fdp->fd_nfiles))) + return (NULL); + + return (atomic_load_ptr(&fdp->fd_ofiles[fd].fde_file)); +} + /* Requires a FILEDESC_{S,X}LOCK held and returns without a ref. */ static __inline struct file * fget_noref(struct filedesc *fdp, int fd)
(In reply to Paul from comment #24) Yes indeed free must handle NULL arg. BTW, Mark, why not commit your change already?
A commit in branch main references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=8f3227f527567aef53da845ab78da8e16d9051c1 commit 8f3227f527567aef53da845ab78da8e16d9051c1 Author: Mark Johnston <markj@FreeBSD.org> AuthorDate: 2026-03-27 00:24:18 +0000 Commit: Mark Johnston <markj@FreeBSD.org> CommitDate: 2026-03-27 00:24:18 +0000 kqueue: Fix a race when adding an fd-based knote to a queue When registering a new kevent backed by a file descriptor, we first look up the file description with fget(), then lock the kqueue, then see if a corresponding knote is already registered. If not, and KN_ADD is specified, we add the knote to the kqueue. closefp_impl() interlocks with this process by calling knote_fdclose(), which locks each kqueue and checks to see if the fd is registered with a knote. But, if userspace closes an fd while a different thread is registering it, i.e., after fget() succeeds but before the kqueue is locked, then we may end up with a mismatch in the knote table, where the knote kn_fp field points to a different file description than the knote ident. Fix the problem by double-checking before registering a knote. Add a new fget_noref_unlocked() helper for this purpose. It is a clone of fget_noref(). We could simply use fget_noref(), but I like having an explicit unlocked variant. PR: 293382 Reviewed by: kib MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D55852 sys/kern/kern_event.c | 14 +++++++++++++- sys/sys/filedesc.h | 17 +++++++++++++++++ 2 files changed, 30 insertions(+), 1 deletion(-)
Hi! It seems like latest changes have reduced the likelihood of the issue. My guess is: we are dealing with the use-after-free race and just by adding some additional (slow? like 'stack_save()'?) code to the 'free' function, we made it much, much less likely to occur. But anyway, it finally happened. Unread portion of the kernel message buffer: panic: Assertion kn->kn_kq == kq failed at /usr/src/sys/kern/kern_event.c:2859 cpuid = 3 time = 1774857902 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe069abdfbd0 vpanic() at vpanic+0x136/frame 0xfffffe069abdfd00 panic() at panic+0x43/frame 0xfffffe069abdfd60 knote_fdclose() at knote_fdclose+0x236/frame 0xfffffe069abdfdc0 closefp_impl() at closefp_impl+0xa8/frame 0xfffffe069abdfe00 amd64_syscall() at amd64_syscall+0x169/frame 0xfffffe069abdff30 fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe069abdff30 --- syscall (6, FreeBSD ELF64, close), rip = 0x82d5be32a, rsp = 0x85b519b98, rbp = 0x85b519bb0 --- KDB: enter: panic (kgdb) bt #0 __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:57 #1 doadump (textdump=0) at /usr/src/sys/kern/kern_shutdown.c:399 #2 0xffffffff804b60a8 in db_fncall_generic (nargs=0, args=0xfffffe069abdf5f0, addr=<optimized out>, rv=<optimized out>) at /usr/src/sys/ddb/db_command.c:631 #3 db_fncall (dummy1=<optimized out>, dummy2=<optimized out>, dummy3=<optimized out>, dummy4=<optimized out>) at /usr/src/sys/ddb/db_command.c:679 #4 0xffffffff804b5b2d in db_command (last_cmdp=<optimized out>, cmd_table=<optimized out>, dopager=false) at /usr/src/sys/ddb/db_command.c:508 #5 0xffffffff804b5c76 in db_command_script (command=command@entry=0xffffffff81bd7722 <db_recursion_data+18> "call doadump") at /usr/src/sys/ddb/db_command.c:573 #6 0xffffffff804bba58 in db_script_exec (scriptname=scriptname@entry=0xfffffe069abdf7c0 "kdb.enter.panic", warnifnotfound=warnifnotfound@entry=0) at /usr/src/sys/ddb/db_script.c:301 #7 0xffffffff804bb952 in db_script_kdbenter (eventname=<optimized out>) at /usr/src/sys/ddb/db_script.c:323 #8 0xffffffff804b91e1 in db_trap (type=<optimized out>, code=<optimized out>) at /usr/src/sys/ddb/db_main.c:266 #9 0xffffffff80c1d01f in kdb_trap (type=type@entry=3, code=code@entry=0, tf=tf@entry=0xfffffe069abdfb10) at /usr/src/sys/kern/subr_kdb.c:790 #10 0xffffffff8112a96d in trap (frame=<optimized out>) at /usr/src/sys/amd64/amd64/trap.c:675 #11 <signal handler called> #12 kdb_enter (why=<optimized out>, msg=<optimized out>) at /usr/src/sys/kern/subr_kdb.c:556 #13 0xffffffff80bc9f9b in vpanic (fmt=0xffffffff812ec6bb "Assertion %s failed at %s:%d", ap=ap@entry=0xfffffe069abdfd40) at /usr/src/sys/kern/kern_shutdown.c:962 #14 0xffffffff80bc9e03 in panic (fmt=0xffffffff81da2290 <cnputs_mtx> "\254\214!\201\377\377\377\377") at /usr/src/sys/kern/kern_shutdown.c:887 #15 0xffffffff80b6bd26 in knote_fdclose (td=td@entry=0xff0100026dc5d780, fd=fd@entry=544408) at /usr/src/sys/kern/kern_event.c:2859 #16 0xffffffff80b63468 in closefp_impl (fdp=0xfffffe02c31fec90, fd=544408, fp=0xff01002710a27640, td=0xff0100026dc5d780, audit=true) at /usr/src/sys/kern/kern_descrip.c:1413 #17 0xffffffff8112b739 in syscallenter (td=0xff0100026dc5d780) at /usr/src/sys/amd64/amd64/../../kern/subr_syscall.c:193 #18 amd64_syscall (td=0xff0100026dc5d780, traced=0) at /usr/src/sys/amd64/amd64/trap.c:1244 #19 <signal handler called> #20 0x000000082d5be32a in ?? () Backtrace stopped: Cannot access memory at address 0x85b519b98 (kgdb) fr 15 #15 0xffffffff80b6bd26 in knote_fdclose (td=td@entry=0xff0100026dc5d780, fd=fd@entry=544408) at /usr/src/sys/kern/kern_event.c:2859 2859 MPASS(kn->kn_kq == kq); (kgdb) p *((struct kqueue*)$r15) $11 = { kq_lock = { lock_object = { lo_name = 0xffffffff8133f15f "kqueue", lo_flags = 21168128, lo_data = 0, lo_witness = 0xff0100804bd8db80 }, mtx_lock = 18374967965079951232 }, kq_refcnt = 0, kq_list = { tqe_next = 0xff0100010ac21b00, tqe_prev = 0xff010002a0adb128 }, kq_head = { tqh_first = 0x0, tqh_last = 0xff0100010aac4b38 }, kq_count = 0, kq_sel = { si_tdlist = { tqh_first = 0x0, tqh_last = 0x0 }, si_note = { kl_list = { slh_first = 0x0 }, kl_lock = 0xffffffff80b6b420 <knlist_mtx_lock>, kl_unlock = 0xffffffff80b6b440 <knlist_mtx_unlock>, kl_assert_lock = 0xffffffff80b6b460 <knlist_mtx_assert_lock>, kl_lockarg = 0xff0100010aac4b00, kl_autodestroy = 0 }, si_mtx = 0x0 }, kq_sigio = 0x0, kq_fdp = 0xfffffe02c31fec90, kq_state = 0, kq_knlistsize = 685056, kq_knlist = 0xfffffe0bbf526000, kq_knhashmask = 0, kq_knhash = 0x0, kq_task = { ta_link = { stqe_next = 0x0 }, ta_pending = 0, ta_priority = 0 '\000', ta_flags = 0 '\000', ta_func = 0xffffffff80b6dd00 <kqueue_task>, ta_context = 0xff0100010aac4b00 }, kq_cred = 0xff0100010a9eb600, kq_forksrc = 0x0 } (kgdb) p *((struct eknote*)kn) $3 = { k = { kn_link = { sle_next = 0xdededededededede }, kn_selnext = { sle_next = 0xdededededededede }, kn_knlist = 0xdededededededede, kn_tqe = { tqe_next = 0xdededededededede, tqe_prev = 0xdededededededede }, kn_kq = 0xdededededededede, kn_kevent = { ident = 16059518370053021406, filter = -8482, flags = 57054, fflags = 3739147998, data = -2387225703656530210, udata = 0xdededededededede, ext = {16059518370053021406, 16059518370053021406, 16059518370053021406, 16059518370053021406} }, kn_hook = 0xdededededededede, kn_hookid = -555819298, kn_status = -555819298, kn_influx = -555819298, kn_sfflags = 3739147998, kn_sdata = -2387225703656530210, kn_ptr = { p_fp = 0xdededededededede, p_proc = 0xdededededededede, p_aio = 0xdededededededede, p_lio = 0xdededededededede, p_prison = 0xdededededededede, p_v = 0xdededededededede }, kn_fop = 0xdededededededede }, c = { kn_link = { sle_next = 0x0 }, kn_selnext = { sle_next = 0xffffffffffffffff }, kn_knlist = 0x0, kn_tqe = { tqe_next = 0xffffffffffffffff, tqe_prev = 0xffffffffffffffff }, kn_kq = 0xff0100010aac4b00, kn_kevent = { ident = 138904, filter = -1, flags = 32, fflags = 0, data = 0, udata = 0xdda7cbe67c0, ext = {0, 0, 0, 0} }, kn_hook = 0x0, kn_hookid = 0, kn_status = 8, kn_influx = 1, kn_sfflags = 0, kn_sdata = 0, kn_ptr = { p_fp = 0x0, p_proc = 0x0, p_aio = 0x0, p_lio = 0x0, p_prison = 0x0, p_v = 0x0 }, kn_fop = 0x0 }, s = { depth = 5, pcs = {18446744071574043258, 18446744071574043755, 18446744071574008936, 18446744071580071737, 18446744071579870491, 0 <repeats 13 times>} } } (kgdb) p (void*)18446744071574043258 $4 = (void *) 0xffffffff80b6ba7a <knote_drop_detached+554> (kgdb) p (void*)18446744071574043755 $5 = (void *) 0xffffffff80b6bc6b <knote_fdclose+379> (kgdb) p (void*)18446744071574008936 $6 = (void *) 0xffffffff80b63468 <closefp_impl+168> (kgdb) p (void*)18446744071580071737 $7 = (void *) 0xffffffff8112b739 <amd64_syscall+361> (kgdb) p (void*)18446744071579870491 $8 = (void *) 0xffffffff810fa51b <fast_syscall_common+248> Please, tell us if you need anything else.
Figured you might also want this. (kgdb) p (void*)$r15 $2 = (void *) 0xff0100010aac4b00 (kgdb) fr 16 #16 0xffffffff80b63468 in closefp_impl (fdp=0xfffffe02c31fec90, fd=544408, fp=0xff01002710a27640, td=0xff0100026dc5d780, audit=true) at /usr/src/sys/kern/kern_descrip.c:1413 1413 knote_fdclose(td, fd); (kgdb) p *fp $5 = { f_flag = 7, f_count = 1, f_data = 0xff01005c9babe000, f_ops = 0xffffffff814d4aa0 <socketops>, f_vnode = 0x0, f_cred = 0xff0100010a9eb600, f_type = 2, f_vflags = 0, { f_seqcount = {0, 0}, f_pipegen = 0 }, f_nextoff = {0, 0}, f_vnun = { fvn_cdevpriv = 0x0, fvn_advice = 0x0 }, f_offset = 0 }
(In reply to Paul from comment #28) The saved knote has ident = 138904, while stack trace shows fd=544408. This should not happen, the file descriptor must be same as the knote which we find on the descriptor-indexed list. Suppose that the debugging info is correct. This means that the knote existed on the list for both 544408 and 138904? This could be if e.g. both heads pointed to the knote. Then if fd 138904 got the knote already removed or even not found on that list (as recorded by the stack trace), it is freed, but the pointer on the list 544408 is dandling. Then we get the situation in the panic. If the speculation above holds, then somehow we manage to insert same knote into two different fd's lists. Again, I do not understand how could this happen. I modified the debugging patch to check just that, that knotes are never inserted twice into kn_link. Could you run HEAD with only the patch below applied, please? No need to apply any of my previous patches, but do update to the latest HEAD since Mark already committed his fix. I am interested if you hit the added panics. commit 8b90be098a85975dc437c53294987c0d7106fcf9 Author: Konstantin Belousov <kib@FreeBSD.org> Date: Wed Mar 25 03:47:59 2026 +0200 debug diff --git a/sys/kern/kern_event.c b/sys/kern/kern_event.c index 1a719f7c945c..7d99f00eeb8c 100644 --- a/sys/kern/kern_event.c +++ b/sys/kern/kern_event.c @@ -229,6 +229,12 @@ static const struct filterops user_filtops = { .f_copy = knote_triv_copy, }; +struct eknote { + struct knote k; + struct knote c; + int on_kn_link; +}; + static uma_zone_t knote_zone; static unsigned int __exclusive_cache_line kq_ncallouts; static unsigned int kq_calloutmax = 4 * 1024; @@ -2887,6 +2893,9 @@ knote_attach(struct knote *kn, struct kqueue *kq) return (ENOMEM); list = &kq->kq_knhash[KN_HASH(kn->kn_id, kq->kq_knhashmask)]; } + if (atomic_fetchadd_int(&__containerof(kn, struct eknote, + k)->on_kn_link, 1) != 0) + panic("Double-linking knote %p", kn); SLIST_INSERT_HEAD(list, kn, kn_link); return (0); } @@ -2928,8 +2937,11 @@ knote_drop_detached(struct knote *kn, struct thread *td) else list = &kq->kq_knhash[KN_HASH(kn->kn_id, kq->kq_knhashmask)]; - if (!SLIST_EMPTY(list)) - SLIST_REMOVE(list, kn, knote, kn_link); + if (atomic_fetchadd_int(&__containerof(kn, struct eknote, + k)->on_kn_link, -1) != 1) + panic("removal of double-linked knote %p", kn); + SLIST_REMOVE(list, kn, knote, kn_link); + if (kn->kn_status & KN_QUEUED) knote_dequeue(kn); KQ_UNLOCK_FLUX(kq); @@ -2974,8 +2986,8 @@ static void knote_init(void *dummy __unused) { - knote_zone = uma_zcreate("KNOTE", sizeof(struct knote), NULL, NULL, - NULL, NULL, UMA_ALIGN_PTR, 0); + knote_zone = uma_zcreate("KNOTE", sizeof(struct eknote), NULL, NULL, + NULL, NULL, UMA_ALIGN_PTR, UMA_ZONE_NOFREE); ast_register(TDA_KQUEUE, ASTR_ASTF_REQUIRED, 0, ast_kqueue); prison0.pr_klist = knlist_alloc(&prison0.pr_mtx); } @@ -2984,15 +2996,24 @@ SYSINIT(knote, SI_SUB_PSEUDO, SI_ORDER_ANY, knote_init, NULL); static struct knote * knote_alloc(int mflag) { + struct eknote *e; - return (uma_zalloc(knote_zone, mflag | M_ZERO)); + e = uma_zalloc(knote_zone, mflag); + memset(e, 0, sizeof(*e)); + return (&e->k); } static void knote_free(struct knote *kn) { + struct eknote *e; - uma_zfree(knote_zone, kn); + if (kn == NULL) + return; + e = __containerof(kn, struct eknote, k); + e->c = e->k; + memset(&e->k, 0xdeadc0de, sizeof(e->k)); + uma_zfree(knote_zone, e); } /*
Thank you for a prompt reply! We are definitely not tampering with the core dump. And only using the HEAD + patch. No custom kernel modules, nothing like that. Curiously, may be just a coincidence, but I've noticed that 544408 - 138904 = 405504, which is a multiple of 4096. Specifically 99 * 4096. Could it be that this is due to some hash-table optimization and them ending up in the same bucket? This may explain why seemingly unrelated fds end up colliding somehow. Though, may just be a coincidence. And 4096 seems a plausible 'magick number' to be used somewhere. Like in a simplified hash function (remainder of division) or just a result of an exponential HT growth with a factor of 2. Regardless, we are now switching to HEAD with only the latest patch which you provided. Thanks!
(In reply to Paul from comment #30) By 'trusting the debug info' I mean that backtraces are known to have issues with calculating the correct values for the function's arguments and locals. The matter improved recently with more correct debuginfo in clang, but it is still sometimes wrong. So the fd value for close from the backtrace is hopefully right, which is the basis for my theory, but it is not guaranteed. WRT the multiple of the page size distance between two fds, I am not sure if this is of any significance. For fd-based knotes, in your case, read filter on socket, no hashing is done. The kqueues allocates a large array of single-linked list heads, one for each fd index. If fd table is expanded and a knote is allocated for fd that is larger than the kqueue array of heads, the array is reallocated. This is done under the kq lock, so hopefully is not an issue.
Just to clarify, due to features of modular arithmetic: 544408 mod 2^n == 138904 mod 2^n for n in [0, 12] So this might as well be a much greater coincidence (them having same modulo up to 2^12) and the real magic number here is smaller than 4096, like 1024, or: #define KN_HASHSIZE 64 /* XXX should be tunable */
(In reply to Paul from comment #32) As I said above, filedescriptor-based events are not hashed. KN_HASH() is only relevant for knotes that use something else than files for their idents. So unless there is somewhere an awful mishap of mixing files- and non-files-knotes, I do not think that KN_HASH() is at play there.
Hi, It crashed again. Though in a different place now. But that's because the old place (`MPASS(kn->kn_kq == kq);`) is no longer present in the latest patch. We are currently using e220af9cee74082841db9e80cbb73f570f3a570f + the latest patch. Unread portion of the kernel message buffer: panic: mutex kqueue not owned at /usr/src/sys/kern/kern_event.c:316 cpuid = 30 time = 1774943888 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe069b454810 vpanic() at vpanic+0x136/frame 0xfffffe069b454940 panic() at panic+0x43/frame 0xfffffe069b4549a0 __mtx_assert() at __mtx_assert+0xa9/frame 0xfffffe069b4549b0 kqueue_register() at kqueue_register+0x82a/frame 0xfffffe069b454a30 kqueue_kevent() at kqueue_kevent+0xc9/frame 0xfffffe069b454c90 kern_kevent_fp() at kern_kevent_fp+0x9b/frame 0xfffffe069b454ce0 kern_kevent() at kern_kevent+0x82/frame 0xfffffe069b454d40 kern_kevent_generic() at kern_kevent_generic+0x70/frame 0xfffffe069b454da0 sys_kevent() at sys_kevent+0x61/frame 0xfffffe069b454e00 amd64_syscall() at amd64_syscall+0x169/frame 0xfffffe069b454f30 fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe069b454f30 --- syscall (560, FreeBSD ELF64, kevent), rip = 0x82cf723ea, rsp = 0x858888b18, rbp = 0x858888c00 --- KDB: enter: panic (kgdb) bt #0 __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:57 #1 doadump (textdump=0) at /usr/src/sys/kern/kern_shutdown.c:399 #2 0xffffffff804b60a8 in db_fncall_generic (nargs=0, args=0xfffffe069b454230, addr=<optimized out>, rv=<optimized out>) at /usr/src/sys/ddb/db_command.c:631 #3 db_fncall (dummy1=<optimized out>, dummy2=<optimized out>, dummy3=<optimized out>, dummy4=<optimized out>) at /usr/src/sys/ddb/db_command.c:679 #4 0xffffffff804b5b2d in db_command (last_cmdp=<optimized out>, cmd_table=<optimized out>, dopager=false) at /usr/src/sys/ddb/db_command.c:508 #5 0xffffffff804b5c76 in db_command_script (command=command@entry=0xffffffff81bd7722 <db_recursion_data+18> "call doadump") at /usr/src/sys/ddb/db_command.c:573 #6 0xffffffff804bba58 in db_script_exec (scriptname=scriptname@entry=0xfffffe069b454400 "kdb.enter.panic", warnifnotfound=warnifnotfound@entry=0) at /usr/src/sys/ddb/db_script.c:301 #7 0xffffffff804bb952 in db_script_kdbenter (eventname=<optimized out>) at /usr/src/sys/ddb/db_script.c:323 #8 0xffffffff804b91e1 in db_trap (type=<optimized out>, code=<optimized out>) at /usr/src/sys/ddb/db_main.c:266 #9 0xffffffff80c23aef in kdb_trap (type=type@entry=3, code=code@entry=0, tf=tf@entry=0xfffffe069b454750) at /usr/src/sys/kern/subr_kdb.c:790 #10 0xffffffff811318fd in trap (frame=<optimized out>) at /usr/src/sys/amd64/amd64/trap.c:697 #11 <signal handler called> #12 kdb_enter (why=<optimized out>, msg=<optimized out>) at /usr/src/sys/kern/subr_kdb.c:556 #13 0xffffffff80bd0a6b in vpanic (fmt=0xffffffff812def18 "mutex %s not owned at %s:%d", ap=ap@entry=0xfffffe069b454980) at /usr/src/sys/kern/kern_shutdown.c:962 #14 0xffffffff80bd08d3 in panic (fmt=0xffffffff81da22a0 <cnputs_mtx> "\325\376!\201\377\377\377\377") at /usr/src/sys/kern/kern_shutdown.c:887 #15 0xffffffff80ba9729 in __mtx_assert (c=<optimized out>, what=what@entry=4, file=0xffffffff8134c5ca "/usr/src/sys/kern/kern_cons.c", line=-1129344347, line@entry=316) at /usr/src/sys/kern/kern_mutex.c:1108 #16 0xffffffff80b7004a in kn_enter_flux (kn=0xff0100303150bcd0) at /usr/src/sys/kern/kern_event.c:316 #17 kqueue_register (kq=kq@entry=0xff010002208b5b00, kev=kev@entry=0xfffffe069b454a40, td=td@entry=0xff010001e7d30000, mflag=mflag@entry=2) at /usr/src/sys/kern/kern_event.c:1868 #18 0xffffffff80b71239 in kqueue_kevent (kq=kq@entry=0xff010002208b5b00, td=td@entry=0xff010001e7d30000, nchanges=nchanges@entry=1, nevents=nevents@entry=0, k_ops=k_ops@entry=0xfffffe069b454de0, timeout=timeout@entry=0x0) at /usr/src/sys/kern/kern_event.c:1515 #19 0xffffffff80b710fb in kern_kevent_fp (td=td@entry=0xff010001e7d30000, fp=<optimized out>, nchanges=nchanges@entry=1, nevents=nevents@entry=0, k_ops=k_ops@entry=0xfffffe069b454de0, timeout=timeout@entry=0x0) at /usr/src/sys/kern/kern_event.c:1546 #20 0xffffffff80b71012 in kern_kevent (td=td@entry=0xff010001e7d30000, fd=<optimized out>, nchanges=1, nevents=0, k_ops=k_ops@entry=0xfffffe069b454de0, timeout=timeout@entry=0x0) at /usr/src/sys/kern/kern_event.c:1486 #21 0xffffffff80b70d10 in kern_kevent_generic (td=0xff010001e7d30000, uap=uap@entry=0xfffffe069b454db0, k_ops=k_ops@entry=0xfffffe069b454de0, struct_name=0xffffffff81308166 "kevent") at /usr/src/sys/kern/kern_event.c:1342 #22 0xffffffff80b70c01 in sys_kevent (td=0xffffffff81da22a0 <cnputs_mtx>, uap=<optimized out>) at /usr/src/sys/kern/kern_event.c:1315 #23 0xffffffff81132739 in syscallenter (td=0xff010001e7d30000) at /usr/src/sys/amd64/amd64/../../kern/subr_syscall.c:193 #24 amd64_syscall (td=0xff010001e7d30000, traced=0) at /usr/src/sys/amd64/amd64/trap.c:1267 #25 <signal handler called> #26 0x000000082cf723ea in ?? () Backtrace stopped: Cannot access memory at address 0x858888b18 (kgdb) fr 16 #16 0xffffffff80b7004a in kn_enter_flux (kn=0xff0100303150bcd0) at /usr/src/sys/kern/kern_event.c:316 316 KQ_OWNED(kn->kn_kq); (kgdb) p *((struct eknote*)kn) $1 = { k = { kn_link = { sle_next = 0x0 }, kn_selnext = { sle_next = 0x0 }, kn_knlist = 0xff010011b7fb0838, kn_tqe = { tqe_next = 0xffffffffffffffff, tqe_prev = 0xffffffffffffffff }, kn_kq = 0xff0100021013b000, kn_kevent = { ident = 298277, filter = -1, flags = 32, fflags = 0, data = 0, udata = 0x35934102e9d0, ext = {0, 0, 0, 0} }, kn_hook = 0x0, kn_hookid = 0, kn_status = 0, kn_influx = 0, kn_sfflags = 0, kn_sdata = 0, kn_ptr = { p_fp = 0xff01005c34494280, p_proc = 0xff01005c34494280, p_aio = 0xff01005c34494280, p_lio = 0xff01005c34494280, p_prison = 0xff01005c34494280, p_v = 0xff01005c34494280 }, kn_fop = 0xffffffff814dd950 <soread_filtops> }, c = { kn_link = { sle_next = 0x0 }, kn_selnext = { sle_next = 0x0 }, kn_knlist = 0x0, kn_tqe = { tqe_next = 0x0, tqe_prev = 0x0 }, kn_kq = 0x0, kn_kevent = { ident = 0, filter = 0, flags = 0, fflags = 0, data = 0, udata = 0x0, ext = {0, 0, 0, 0} }, kn_hook = 0x0, kn_hookid = 0, kn_status = 0, kn_influx = 0, kn_sfflags = 0, kn_sdata = 0, kn_ptr = { p_fp = 0x0, p_proc = 0x0, p_aio = 0x0, p_lio = 0x0, p_prison = 0x0, p_v = 0x0 }, kn_fop = 0x0 }, on_kn_link = 1 } (kgdb) fr 17 #17 kqueue_register (kq=kq@entry=0xff010002208b5b00, kev=kev@entry=0xfffffe069b454a40, td=td@entry=0xff010001e7d30000, mflag=mflag@entry=2) at /usr/src/sys/kern/kern_event.c:1868 1868 kn_enter_flux(kn); (kgdb) p *kq $3 = { kq_lock = { lock_object = { lo_name = 0xffffffff813464b6 "kqueue", lo_flags = 21168128, lo_data = 0, lo_witness = 0xff0100804bd8db80 }, mtx_lock = 18374967962832666624 }, kq_refcnt = 1, kq_list = { tqe_next = 0xff010002039d2400, tqe_prev = 0xff010002039cfb28 }, kq_head = { tqh_first = 0x0, tqh_last = 0xff010002208b5b38 }, kq_count = 0, kq_sel = { si_tdlist = { tqh_first = 0x0, tqh_last = 0x0 }, si_note = { kl_list = { slh_first = 0x0 }, kl_lock = 0xffffffff80b71f90 <knlist_mtx_lock>, kl_unlock = 0xffffffff80b71fb0 <knlist_mtx_unlock>, kl_assert_lock = 0xffffffff80b71fd0 <knlist_mtx_assert_lock>, kl_lockarg = 0xff010002208b5b00, kl_autodestroy = 0 }, si_mtx = 0x0 }, kq_sigio = 0x0, kq_fdp = 0xfffffe0694a7bc90, kq_state = 0, kq_knlistsize = 683008, kq_knlist = 0xfffffe09db9fe000, kq_knhashmask = 0, kq_knhash = 0x0, kq_task = { ta_link = { stqe_next = 0x0 }, ta_pending = 0, ta_priority = 0 '\000', ta_flags = 0 '\000', ta_func = 0xffffffff80b747e0 <kqueue_task>, ta_context = 0xff010002208b5b00 }, kq_cred = 0xff010027f901b480, kq_forksrc = 0x0 } (kgdb) p *kev $4 = { ident = 297765, filter = -1, flags = 2, fflags = 0, data = 0, udata = 0x0, ext = {0, 0, 0, 0} } Seems like knote and kevent don't actually match one another, but both have filter equal to -1, which seems to be sufficient to mistake them. As seen in kqueue_register(): if (kev->filter == kn->kn_filter) break; Seems like the simple fix should be to improve the matching strategy and account for -1?
(In reply to Paul from comment #34) No, the problem is the knote found on the kq_knlist[ident] belongs to different kqueue. In other words, kn->kn_kq != kq. This is why the assert about owning the kq lock failed. This is very strange thing to happen, but it is consistent with the previous report. Lets add asserts everywhere that kn->kn_kq is matching the code expectation. Also I am somewhat suspicious of kqueue_expand(), I added a check after bzero(). What is the CPU that you use? Show the CPU ident lines from the verbose dmesg. Or/and, show the output from 'cpuid-etallen -1'.
Created attachment 269251 [details] debugging patch
I see, sorry for pushing my bad analysis. This is a pretty recent (and decent) CPU. We are using it on many other servers with a bit different kind of load (not that many open sockets per process), but with much more stress overall. And only in this case the issue occurs. Doubtful that's the CPU hardware bug. We will now be applying you latest patch, thanks for helping! # cpuid-etallen -1 CPU: vendor_id = "AuthenticAMD" version information (1/eax): processor type = primary processor (0) family = 0xf (15) model = 0x2 (2) stepping id = 0x1 (1) extended family = 0xb (11) extended model = 0x0 (0) (family synth) = 0x1a (26) (model synth) = 0x2 (2) miscellaneous (1/ebx): process local APIC physical ID = 0x33 (51) maximum IDs for CPUs in pkg = 0x20 (32) CLFLUSH line size = 0x8 (8) brand index = 0x0 (0) brand id = 0x00 (0): unknown feature information (1/edx): x87 FPU on chip = true VME: virtual-8086 mode enhancement = true DE: debugging extensions = true PSE: page size extensions = true TSC: time stamp counter = true RDMSR and WRMSR support = true PAE: physical address extensions = true MCE: machine check exception = true CMPXCHG8B inst. = true APIC on chip = true SYSENTER and SYSEXIT = true MTRR: memory type range registers = true PTE global bit = true MCA: machine check architecture = true CMOV: conditional move/compare instr = true PAT: page attribute table = true PSE-36: page size extension = true PSN: processor serial number = false CLFLUSH instruction = true DS: debug store = false ACPI: thermal monitor and clock ctrl = false MMX Technology = true FXSAVE/FXRSTOR = true SSE extensions = true SSE2 extensions = true SS: self snoop = false hyper-threading / multi-core supported = true TM: therm. monitor = false IA64 = false PBE: pending break event = false feature information (1/ecx): PNI/SSE3: Prescott New Instructions = true PCLMULDQ instruction = true DTES64: 64-bit debug store = false MONITOR/MWAIT = true CPL-qualified debug store = false VMX: virtual machine extensions = false SMX: safer mode extensions = false Enhanced Intel SpeedStep Technology = false TM2: thermal monitor 2 = false SSSE3 extensions = true context ID: adaptive or shared L1 data = false SDBG: IA32_DEBUG_INTERFACE = false FMA instruction = true CMPXCHG16B instruction = true xTPR disable = false PDCM: perfmon and debug = false PCID: process context identifiers = true DCA: direct cache access = false SSE4.1 extensions = true SSE4.2 extensions = true x2APIC: extended xAPIC support = true MOVBE instruction = true POPCNT instruction = true time stamp counter deadline = false AES instruction = true XSAVE/XSTOR states = true OS-enabled XSAVE/XSTOR = true AVX: advanced vector extensions = true F16C half-precision convert instruction = true RDRAND instruction = true hypervisor guest status = false cache and TLB information (2): processor serial number = 00B0-0F21-0000-0000-0000-0000 deterministic cache parameters (4): --- cache 0 --- cache type = no more caches (0) MONITOR/MWAIT (5): smallest monitor-line size (bytes) = 0x40 (64) largest monitor-line size (bytes) = 0x40 (64) enum of Monitor-MWAIT exts supported = true supports intrs as break-event for MWAIT = true monitorless MWAIT supported = false number of C0 sub C-states using MWAIT = 0x1 (1) number of C1 sub C-states using MWAIT = 0x2 (2) number of C2 sub C-states using MWAIT = 0x0 (0) number of C3 sub C-states using MWAIT = 0x0 (0) number of C4 sub C-states using MWAIT = 0x0 (0) number of C5 sub C-states using MWAIT = 0x0 (0) number of C6 sub C-states using MWAIT = 0x0 (0) number of C7 sub C-states using MWAIT = 0x0 (0) Thermal and Power Management Features (6): digital thermometer = false Intel Turbo Boost Technology = false ARAT always running APIC timer = true PLN power limit notification = false ECMD extended clock modulation duty = false PTM package thermal management = false HWP base registers = false HWP notification = false HWP activity window = false HWP energy performance preference = false HWP package level request = false HDC base registers = false Intel Turbo Boost Max Technology 3.0 = false HWP capabilities = false HWP PECI override = false flexible HWP = false IA32_HWP_REQUEST MSR fast access mode = false HW_FEEDBACK MSRs supported = false ignoring idle logical processor HWP req = false IA32_HWP_CTL MSR supported = false Thread Director = false IA32_HW_FEEDBACK_THREAD_CONFIG bit 25 = false digital thermometer thresholds = 0x0 (0) hardware coordination feedback = true ACNT2 available = false performance-energy bias capability = false number of enh hardware feedback classes = 0x0 (0) performance capability reporting = false energy efficiency capability reporting = false size of feedback struct (4KB pages) = 0x1 (1) index of CPU's row in feedback struct = 0x0 (0) extended feature flags (7): FSGSBASE instructions = true IA32_TSC_ADJUST MSR supported = true SGX: Software Guard Extensions supported = false BMI1 instructions = true HLE hardware lock elision = false AVX2: advanced vector extensions 2 = true FDP_EXCPTN_ONLY = false SMEP supervisor mode exec protection = true BMI2 instructions = true enhanced REP MOVSB/STOSB = true INVPCID instruction = true RTM: restricted transactional memory = false RDT-CMT/PQoS cache monitoring = true deprecated FPU CS/DS = false MPX: intel memory protection extensions = false RDT-CAT/PQE cache allocation = true AVX512F: AVX-512 foundation instructions = true AVX512DQ: double & quadword instructions = true RDSEED instruction = true ADX instructions = true SMAP: supervisor mode access prevention = true AVX512IFMA: integer fused multiply add = true PCOMMIT instruction = false CLFLUSHOPT instruction = true CLWB instruction = true Intel processor trace = false AVX512PF: prefetch instructions = false AVX512ER: exponent & reciprocal instrs = false AVX512CD: conflict detection instrs = true SHA instructions = true AVX512BW: byte & word instructions = true AVX512VL: vector length = true PREFETCHWT1 = false AVX512VBMI: vector byte manipulation = true UMIP: user-mode instruction prevention = true PKU protection keys for user-mode = true OSPKE CR4.PKE and RDPKRU/WRPKRU = true WAITPKG instructions = false AVX512_VBMI2: byte VPCOMPRESS, VPEXPAND = true CET_SS: CET shadow stack = true GFNI: Galois Field New Instructions = true VAES instructions = true VPCLMULQDQ instruction = true AVX512_VNNI: neural network instructions = true AVX512_BITALG: bit count/shiffle = true TME: Total Memory Encryption = false AVX512: VPOPCNTDQ instruction = true LA57: 57-bit addrs & 5-level paging = true BNDLDX/BNDSTX MAWAU value in 64-bit mode = 0x0 (0) RDPID: read processor ID supported = true KL: key locker = false bus lock detection = true CLDEMOTE supports cache line demote = false MOVDIRI instruction = true MOVDIR64B instruction = true ENQCMD instruction = false SGX_LC: SGX launch config supported = false PKS: supervisor protection keys = false SGX-KEYS: SGX attestation services = false AVX512_4VNNIW: neural network instrs = false AVX512_4FMAPS: multiply acc single prec = false fast short REP MOV = true UINTR: user interrupts = false AVX512_VP2INTERSECT: intersect mask regs = true IA32_MCU_OPT_CTRL SRBDS mitigation MSR = false VERW MD_CLEAR microcode support = false RTM transaction always aborts = false IA32_TSX_FORCE_ABORT MSR = false SERIALIZE instruction = false hybrid part = false TSXLDTRK: TSX suspend load addr tracking = false PCONFIG instruction = false LBR: architectural last branch records = false CET_IBT: CET indirect branch tracking = false AMX-BF16: tile bfloat16 support = false AVX512_FP16: fp16 support = false AMX-TILE: tile architecture support = false AMX-INT8: tile 8-bit integer support = false IBRS/IBPB: indirect branch restrictions = false STIBP: 1 thr indirect branch predictor = false L1D_FLUSH: IA32_FLUSH_CMD MSR = true IA32_ARCH_CAPABILITIES MSR = false IA32_CORE_CAPABILITIES MSR = false SSBD: speculative store bypass disable = false SHA512 instructions = false SM3 instructions = false SM4 instructions = false RAO-INT atomic instructions = false AVX-VNNI: AVX VNNI neural network instrs = true AVX512_BF16: bfloat16 instructions = true LASS: linear address space separation = false CMPccXADD instructions = false ArchPerfmonExt leaf 0x23 is valid = false fast zero-length REP MOVSB = false fast short REP STOSB = false fast short REP CMPSB, REP SCASB = false FRED transitions & MSRs = false LKGS instruction = false WRMSRNS instruction = false NMI-source reporting = false AMX-FP16: FP16 tile operations = false HRESET: history reset support = false AVX-IFMA: integer fused multiply add = false LAM: linear address masking = false RDMSRLIST, WRMSRLIST instructions = false INVD prevention after BIOS done = false MOVRS instructions = false IA32_PPIN & IA32_PPIN_CTL MSRs supported = false PBNDKB instruction = false IA32_MISC_ENABLE cannot limit CPUID max = false asymmetric RDT-M monitoring = false asymmetric RDT-A allocation = false RDMSR/WRMSRNS immediates supported = false AVX-VNNI-INT8 instructions = false AVX-NE-CONVERT instructions = false AMX-COMPLEX instructions = false AVX-VNNI-INT16 instructions = false user-timer events = false PREFETCHIT0, PREFETCHIT1 instructions = false URDMSR, UWRMSR instructions = false UIRET flexibly updates UIF = false CET_SSS: shadow stacks w/o page faults = false AVX10 instructions = false APX advanced performance extensions = false MWAIT instruction = false Direct Cache Access Parameters (9): PLATFORM_DCA_CAP MSR bits = 0 Architecture Performance Monitoring Features (0xa): version ID = 0x0 (0) number of counters per logical processor = 0x0 (0) bit width of counter = 0x0 (0) length of EBX bit vector = 0x0 (0) core cycle event = not available instruction retired event = not available reference cycles event = not available last-level cache ref event = not available last-level cache miss event = not available branch inst retired event = not available branch mispred retired event = not available topdown slots event = not available topdown backend bound = not available topdown bad speculation = not available topdown frontend bound = not available topdown retiring = not available LBR inserts = not available fixed counter 0 supported = false fixed counter 1 supported = false fixed counter 2 supported = false fixed counter 3 supported = false fixed counter 4 supported = false fixed counter 5 supported = false fixed counter 6 supported = false fixed counter 7 supported = false fixed counter 8 supported = false fixed counter 9 supported = false fixed counter 10 supported = false fixed counter 11 supported = false fixed counter 12 supported = false fixed counter 13 supported = false fixed counter 14 supported = false fixed counter 15 supported = false fixed counter 16 supported = false fixed counter 17 supported = false fixed counter 18 supported = false fixed counter 19 supported = false fixed counter 20 supported = false fixed counter 21 supported = false fixed counter 22 supported = false fixed counter 23 supported = false fixed counter 24 supported = false fixed counter 25 supported = false fixed counter 26 supported = false fixed counter 27 supported = false fixed counter 28 supported = false fixed counter 29 supported = false fixed counter 30 supported = false fixed counter 31 supported = false number of contiguous fixed counters = 0x0 (0) bit width of fixed counters = 0x0 (0) anythread deprecation = false x2APIC features / processor topology (0xb): extended APIC ID = 51 --- level 0 --- level number = 0x0 (0) level type = thread (1) bit width of level & previous levels = 0x0 (0) number of logical processors at level = 0x1 (1) --- level 1 --- level number = 0x1 (1) level type = core (2) bit width of level & previous levels = 0x7 (7) number of logical processors at level = 0x20 (32) --- level 2 --- level number = 0x2 (2) level type = invalid (0) bit width of level & previous levels = 0x0 (0) number of logical processors at level = 0x0 (0) XSAVE features (0xd/0): XCR0 valid bit field mask = 0x00000000000002e7 x87 state = true SSE state = true AVX state = true MPX BNDREGS = false MPX BNDCSR = false AVX-512 opmask = true AVX-512 ZMM_Hi256 = true AVX-512 Hi16_ZMM = true PKRU state = true XTILECFG state = false XTILEDATA state = false bytes required by fields in XCR0 = 0x00000988 (2440) bytes required by XSAVE/XRSTOR area = 0x00000988 (2440) XSAVEOPT instruction = true XSAVEC instruction = true XGETBV instruction = true XSAVES/XRSTORS instructions = true XFD: extended feature disable supported = false SAVE area size in bytes = 0x00000988 (2440) IA32_XSS valid bit field mask = 0x0000000000001800 PT state = false PASID state = false CET_U user state = true CET_S supervisor state = true HDC state = false UINTR state = false LBR state = false HWP state = false AVX/YMM features (0xd/2): AVX/YMM save state byte size = 0x00000100 (256) AVX/YMM save state byte offset = 0x00000240 (576) supported in IA32_XSS or XCR0 = XCR0 (user state) 64-byte alignment in compacted XSAVE = false XFD faulting supported = false AVX-512 opmask features (0xd/5): AVX-512 opmask save state byte size = 0x00000040 (64) AVX-512 opmask save state byte offset = 0x00000340 (832) supported in IA32_XSS or XCR0 = XCR0 (user state) 64-byte alignment in compacted XSAVE = false XFD faulting supported = false AVX-512 ZMM_Hi256 features (0xd/6): AVX-512 ZMM_Hi256 save state byte size = 0x00000200 (512) AVX-512 ZMM_Hi256 save state byte offset = 0x00000380 (896) supported in IA32_XSS or XCR0 = XCR0 (user state) 64-byte alignment in compacted XSAVE = false XFD faulting supported = false AVX-512 Hi16_ZMM features (0xd/7): AVX-512 Hi16_ZMM save state byte size = 0x00000400 (1024) AVX-512 Hi16_ZMM save state byte offset = 0x00000580 (1408) supported in IA32_XSS or XCR0 = XCR0 (user state) 64-byte alignment in compacted XSAVE = false XFD faulting supported = false PKRU features (0xd/9): PKRU save state byte size = 0x00000008 (8) PKRU save state byte offset = 0x00000980 (2432) supported in IA32_XSS or XCR0 = XCR0 (user state) 64-byte alignment in compacted XSAVE = false XFD faulting supported = false CET_U user features (0xd/0xb): CET_U user save state byte size = 0x00000010 (16) CET_U user save state byte offset = 0x00000000 (0) supported in IA32_XSS or XCR0 = IA32_XSS (supervisor state) 64-byte alignment in compacted XSAVE = false XFD faulting supported = false CET_S supervisor features (0xd/0xc): CET_S supervisor save state byte size = 0x00000018 (24) CET_S supervisor save state byte offset = 0x00000000 (0) supported in IA32_XSS or XCR0 = IA32_XSS (supervisor state) 64-byte alignment in compacted XSAVE = false XFD faulting supported = false Quality of Service Monitoring Resource Type (0xf/0): Maximum range of RMID = 4095 supports L3 cache monitoring = true L3 Cache Quality of Service Monitoring (0xf/1): Conversion factor from IA32_QM_CTR to bytes = 64 Maximum range of RMID = 4095 Counter width = 44 QoS monitoring counter size = 0x2c (44) IA32_QM_CTR bit 61 is overflow = false non-CPU agent cache monitoring (CMT) = false non-CPU agent mem bandwidth mon (MBM) = false (QoS monitoring counter size synth) = 44 supports L3 occupancy monitoring = true supports L3 total bandwidth monitoring = true supports L3 local bandwidth monitoring = true Resource Director Technology Allocation (0x10/0): L3 cache allocation technology supported = true L2 cache allocation technology supported = false memory bandwidth allocation supported = false cache bandwidth allocation supported = false L3 Cache Allocation Technology (0x10/1): length of capacity bit mask = 0x10 (16) Bit-granular map of isolation/contention = 0x00000000 non-CPU agent support = false code and data prioritization supported = true non-contiguous bitmask supported = false highest COS number supported = 0xf (15) extended processor signature (0x80000001/eax): family/generation = 0xf (15) model = 0x2 (2) stepping id = 0x1 (1) extended family = 0xb (11) extended model = 0x0 (0) (family synth) = 0x1a (26) (model synth) = 0x2 (2) extended feature flags (0x80000001/edx): x87 FPU on chip = true virtual-8086 mode enhancement = true debugging extensions = true page size extensions = true time stamp counter = true RDMSR and WRMSR support = true physical address extensions = true machine check exception = true CMPXCHG8B inst. = true APIC on chip = true SYSCALL and SYSRET instructions = true memory type range registers = true global paging extension = true machine check architecture = true conditional move/compare instruction = true page attribute table = true page size extension = true multiprocessing capable = false no-execute page protection = true AMD multimedia instruction extensions = true MMX Technology = true FXSAVE/FXRSTOR = true SSE extensions = true 1-GB large page support = true RDTSCP = true long mode (AA-64) = true 3DNow! instruction extensions = false 3DNow! instructions = false extended brand id (0x80000001/ebx): raw = 0x40000000 (1073741824) BrandId = 0x0 (0) PkgType = 0x4 (4) AMD feature flags (0x80000001/ecx): LAHF/SAHF supported in 64-bit mode = true CMP Legacy = true SVM: secure virtual machine = true extended APIC space = true AltMovCr8 = true LZCNT advanced bit manipulation = true SSE4A support = true misaligned SSE mode = true 3DNow! PREFETCH/PREFETCHW instructions = true OS visible workaround = true instruction based sampling = true XOP support = false SKINIT/STGI support = true watchdog timer support = true lightweight profiling support = false 4-operand FMA instruction = false TCE: translation cache extension = true NodeId MSR C001100C = false TBM support = false topology extensions = true core performance counter extensions = true NB/DF performance counter extensions = true data breakpoint extension = true performance time-stamp counter support = false LLC performance counter extensions = true MWAITX/MONITORX supported = true Address mask extension support = true brand = "AMD EPYC 9355 32-Core Processor " L1 TLB/cache information: 2M/4M pages & L1 TLB (0x80000005/eax): instruction # entries = 0x40 (64) instruction associativity = 0xff (255) data # entries = 0x60 (96) data associativity = 0xff (255) L1 TLB/cache information: 4K pages & L1 TLB (0x80000005/ebx): instruction # entries = 0x40 (64) instruction associativity = 0xff (255) data # entries = 0x60 (96) data associativity = 0xff (255) L1 data cache information (0x80000005/ecx): line size (bytes) = 0x40 (64) lines per tag = 0x1 (1) associativity = 0xc (12) size (KB) = 0x30 (48) L1 instruction cache information (0x80000005/edx): line size (bytes) = 0x40 (64) lines per tag = 0x1 (1) associativity = 0x8 (8) size (KB) = 0x20 (32) L2 TLB/cache information: 2M/4M pages & L2 TLB (0x80000006/eax): instruction # entries = 0x40 (64) instruction associativity = 2-way (2) data # entries = 0x80 (128) data associativity = 4 to 5-way (4) L2 TLB/cache information: 4K pages & L2 TLB (0x80000006/ebx): instruction # entries = 0x40 (64) instruction associativity = 4 to 5-way (4) data # entries = 0x80 (128) data associativity = 8 to 15-way (6) L2 unified cache information (0x80000006/ecx): line size (bytes) = 0x40 (64) lines per tag = 0x1 (1) associativity = 16 to 31-way (8) size (KB) = 0x400 (1024) L3 cache information (0x80000006/edx): line size (bytes) = 0x40 (64) lines per tag = 0x1 (1) associativity = 0x9 (9) size (in 512KB units) = 0x200 (512) RAS Capability (0x80000007/ebx): MCA overflow recovery support = true SUCCOR support = true HWA: hardware assert support = false scalable MCA support = true Advanced Power Management Features (0x80000007/ecx): CmpUnitPwrSampleTimeRatio = 0x0 (0) Advanced Power Management Features (0x80000007/edx): TS: temperature sensing diode = true FID: frequency ID control = false VID: voltage ID control = false TTP: thermal trip = true TM: thermal monitor = true STC: software thermal control = false 100 MHz multiplier control = false hardware P-State control = true TscInvariant = true CPB: core performance boost = true read-only effective frequency interface = true processor feedback interface = false APM power reporting = false connected standby = true RAPL: running average power limit = true fast CPPC = false Physical Address and Linear Address Size (0x80000008/eax): maximum physical address bits = 0x34 (52) maximum linear (virtual) address bits = 0x39 (57) maximum guest physical address bits = 0x0 (0) Extended Feature Extensions ID (0x80000008/ebx): CLZERO instruction = true instructions retired count support = true always save/restore error pointers = true INVLPGB instruction = true RDPRU instruction = true memory bandwidth enforcement = true MCOMMIT instruction = false WBNOINVD instruction = true IBPB: indirect branch prediction barrier = true interruptible WBINVD, WBNOINVD = true IBRS: indirect branch restr speculation = true STIBP: 1 thr indirect branch predictor = true CPU prefers: IBRS always on = false CPU prefers: STIBP always on = true IBRS preferred over software solution = true IBRS provides same mode protection = true EFER[LMSLE] not supported = true INVLPGB supports TLB flush guest nested = true ppin processor id number supported = true SSBD: speculative store bypass disable = true virtualized SSBD = false SSBD fixed in hardware = false CPPC: collaborative processor perf ctrl = true PSFD: predictive store forward disable = true not vulnerable to branch type confusion = true IBPB_RET: ret addr predictor cleared = true branch sampling feature support = false (vuln to branch type confusion synth) = false Size Identifiers (0x80000008/ecx): number of threads = 0x20 (32) ApicIdCoreIdSize = 0x7 (7) performance time-stamp counter size = 40 bits (0) Feature Extended Size (0x80000008/edx): max page count for INVLPGB instruction = 0x7 (7) RDPRU instruction max input support = 0x1 (1) SVM Secure Virtual Machine (0x8000000a/eax): SvmRev: SVM revision = 0x1 (1) SVM Secure Virtual Machine (0x8000000a/edx): nested paging = true LBR virtualization = true SVM lock = true NRIP save = true MSR based TSC rate control = true VMCB clean bits support = true flush by ASID = true decode assists = true PMC virtualization = true SSSE3/SSE5 opcode set disable = false pause intercept filter = true pause filter threshold = true AVIC: AMD virtual interrupt controller = true virtualized VMLOAD/VMSAVE = true virtualized global interrupt flag (GIF) = true GMET: guest mode execute trap = true X2AVIC: virtualized X2APIC = true supervisor shadow stack = true guest Spec_ctl support = true ROGPT: read-only guest page table = true host MCE override = true INVLPGB/TLBSYNC hyperv interc enable = true VNMI: NMI virtualization = true IBS virtualization = true extended LVT AVIC access changes = true guest VMCB addr check = true bus lock threshold = true idle HLT intercept = true EXITINFO1 non-interceptible shutdown = true NASID: number of address space identifiers = 0x8000 (32768): L1 TLB information: 1G pages (0x80000019/eax): instruction # entries = 0x40 (64) instruction associativity = full (15) data # entries = 0x60 (96) data associativity = full (15) L2 TLB information: 1G pages (0x80000019/ebx): instruction # entries = 0x0 (0) instruction associativity = L2 off (0) data # entries = 0x20 (32) data associativity = 4 to 5-way (4) Performance Optimization Identifiers (0x8000001a/eax): 128-bit SSE executed full-width = false MOVU* better than MOVL*/MOVH* = true 256-bit SSE executed full-width = false 512-bit SSE executed full-width = true Instruction Based Sampling Identifiers (0x8000001b/eax): IBS feature flags valid = true IBS fetch sampling = true IBS execution sampling = true read write of op counter = true op counting mode = true branch target address reporting = true IbsOpCurCnt and IbsOpMaxCnt extend 7 = true invalid RIP indication support = true fused branch micro-op indication support = true IBS fetch control extended MSR support = true IBS op data 4 MSR support = false IBS L3 miss filtering support = true IBS load latency filtering support = true simplified DTLB page size & miss report = true Lightweight Profiling Capabilities: Availability (0x8000001c/eax): lightweight profiling = false LWPVAL instruction = false instruction retired event = false branch retired event = false DC miss event = false core clocks not halted event = false core reference clocks not halted event = false continuous mode sampling = false tsc in event record = false interrupt on threshold overflow = false Lightweight Profiling Capabilities: Supported (0x8000001c/edx): lightweight profiling = false LWPVAL instruction = false instruction retired event = false branch retired event = false DC miss event = false core clocks not halted event = false core reference clocks not halted event = false continuous mode sampling = false tsc in event record = false interrupt on threshold overflow = false Lightweight Profiling Capabilities (0x8000001c/ebx): LWPCB byte size = 0x0 (0) event record byte size = 0x0 (0) maximum EventId = 0x0 (0) EventInterval1 field offset = 0x0 (0) Lightweight Profiling Capabilities (0x8000001c/ecx): latency counter bit size = 0x0 (0) data cache miss address valid = false amount cache latency is rounded = 0x0 (0) LWP implementation version = 0x0 (0) event ring buffer size in records = 0x0 (0) branch prediction filtering = false IP filtering = false cache level filtering = false cache latency filteing = false Cache Properties (0x8000001d): --- cache 0 --- type = data (1) level = 0x1 (1) self-initializing = true fully associative = false extra cores sharing this cache = 0x0 (0) line size in bytes = 0x40 (64) physical line partitions = 0x1 (1) number of ways = 0xc (12) number of sets = 64 write-back invalidate = false cache inclusive of lower levels = false (synth size) = 49152 (48 KB) --- cache 1 --- type = instruction (2) level = 0x1 (1) self-initializing = true fully associative = false extra cores sharing this cache = 0x0 (0) line size in bytes = 0x40 (64) physical line partitions = 0x1 (1) number of ways = 0x8 (8) number of sets = 64 write-back invalidate = false cache inclusive of lower levels = false (synth size) = 32768 (32 KB) --- cache 2 --- type = unified (3) level = 0x2 (2) self-initializing = true fully associative = false extra cores sharing this cache = 0x0 (0) line size in bytes = 0x40 (64) physical line partitions = 0x1 (1) number of ways = 0x10 (16) number of sets = 1024 write-back invalidate = false cache inclusive of lower levels = true (synth size) = 1048576 (1024 KB) --- cache 3 --- type = unified (3) level = 0x3 (3) self-initializing = true fully associative = false extra cores sharing this cache = 0x3 (3) line size in bytes = 0x40 (64) physical line partitions = 0x1 (1) number of ways = 0x10 (16) number of sets = 32768 write-back invalidate = true cache inclusive of lower levels = false (synth size) = 33554432 (32 MB) extended APIC ID = 51 Core Identifiers (0x8000001e/ebx): core ID = 0x33 (51) threads per core = 0x1 (1) Node Identifiers (0x8000001e/ecx): node ID = 0x0 (0) nodes per processor = 0x1 (1) AMD Secure Encryption (0x8000001f): SME: secure memory encryption support = true SEV: secure encrypted virtualize support = true VM page flush MSR support = false SEV-ES: SEV encrypted state support = true SEV-SNP: SEV secure nested paging = true VMPL: VM permission levels = true RMPQUERY instruction support = true VMPL supervisor shadow stack support = true Secure TSC supported = true virtual TSC_AUX supported = true hardware cache coher across enc domains = true SEV guest exec only from 64-bit host = true restricted injection = true alternate injection = true full debug state swap for SEV-ES/SEV-SNP = true disallowing IBS use by host = true VTE: SEV virtual transparent encryption = true VMGEXIT parameter support = true virtual TOM MSR support = true IBS virtual support for SEV-ES/SEV-SNP = true PMC virtual support for SEV-ES/SEV-SNP = true RMPREAD instruction = true guest intercept control support = true segmented RMP support = true VMSA register protection support = true SMT protection support = true secure AVIC support = true allowed SEV features support = true SVSM communication page MSR support = false VIRT_RMPUPDATE & VIRT_PSMASH MSR support = false write to hypervisor in-used allowed = true IBPB on entry support = true encryption bit position in PTE = 0x33 (51) physical address space width reduction = 0x6 (6) number of VM permission levels = 0x4 (4) number of SEV-enabled guests supported = 0x3ee (1006) minimum SEV guest ASID = 0x1 (1) PQoS Extended Features (0x80000020): L3 bandwidth enforcement = true L3 slow memory bandwidth enforcement = true bandwidth monitoring event configuration = true L3 range reservation support = true assignable bandwidth monitoring counters = true SDCI allocation enforcement = true PQoS L3 Memory Bandwidth Enforcement (0x80000020/1): capacity bitmask length = 0xd (13) number of classes of service = 0xf (15) PQoS L3 Slow Memory Bandwidth Enforcement (0x80000020/2): capacity bitmask length = 0xd (13) number of classes of service = 0xf (15) PQoS Bandwidth Monitoring Event Configuration (0x80000020/3): number of bandwidth events available = 0x2 (2) reads to local NUMA = true reads to non-local NUMA = true non-temporal writes to local NUMA = true non-temporal writes to non-local NUMA = true reads to slow memory in local NUMA = true reads to slow memory in non-local NUMA = true dirty victims writes = true 0x80000020 0x04: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000 PQoS Assignable Bandwidth Monitoring Counters (0x80000020/5): QM_CTR counter size-24 = 0x14 (20) QM_CTR bit 61 is overflow = false (QM_CTR counter size) = 44 maximum supported ABCM counter ID = 0x1f (31) can measure COS bandwidth = true 0x80000020 0x06: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000 Extended Feature 2 (0x80000021): no nested data-breakpoints = true FsGsKernelGsBaseNonSerializing = true LFENCE always serializing = true SMM paging configuration lock support = true null selector clears base = true upper address ignore support = true automatic IBRS = true SMM_CTL MSR not supported = true FSRS: fast short REP STOSB support = true FSRC: fast short REP CMPSB support = true prefetch control MSR support = true L2TLB sizes are multiples of 32 = true AMD enhanced REP MOVSB/STOSB = true reserves 0F 01/7 for AMD use = true CPUID disable for non-privileged = true enhanced predictive store forwarding = true fast short REP SCASB support = true IC PREFETCH support = true FP512 to FP256 downgrade support = true workload OS feedback support = false ret addr predictor security support = true guest: selective branch pred barrier = true guest: PRED_CMD[IBPB] flushes br predict = true unaffected by spec return stack overflow = false unaffected by SRSO at user-kernel bound = true BP_CFG can mitigate other SRSO cases = true microcode patch size = 14368 (0x3820) return addr predictor size = 64 (0x40) Extended Performance Monitoring and Debugging (0x80000022): AMD performance monitoring V2 = true AMD LBR V2 = true AMD LBR stack & PMC freezing = true number of core perf ctrs = 0x6 (6) number of LBR stack entries = 0x10 (16) number of avail Northbridge perf ctrs = 0x10 (16) number of available UMC PMCs = 0x20 (32) active UMCs bitmask = 0x6db Multi-Key Encrypted Memory Capabilities (0x80000023): secure host multi-key memory support = true number of encryption key IDs = 0x3f (63) 0x80000024 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000 Segmented RMP Table (0x80000025): RMP segment size minimum supported = 0x24 (36) RMP segment size maximum supported = 0x2a (42) cacheable RMP segment definitions = 0x10 (16) cached segments hard limit = false AMD Extended CPU Topology (0x80000026): extended APIC ID = 51 --- level 0 --- level number = 0x0 (0) level type = core (1) bit width of level = 0x0 (0) power efficiency ranking available = false cores heterogeneous at this level = false components have varying number of cores = false number of logical processors at level = 0x1 (1) 0x80000027 0x00: eax=0x00000003 ebx=0x00000000 ecx=0x00000000 edx=0x00000000 0x80000028 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000 (instruction supported synth): CMPXCHG8B = true conditional move/compare = true PREFETCH/PREFETCHW = true (multi-processing synth) = multi-core (c=32) (multi-processing method) = AMD leaf 0xb (APIC widths synth): CORE_width=7 SMT_width=0 (APIC synth): PKG_ID=0 CORE_ID=51 SMT_ID=0 (uarch synth) = AMD Zen 5, TSMC N4P (synth) = AMD EPYC (5th Gen) (Turin C1) [Zen 5], TSMC N4P
(In reply to Paul from comment #37) While we're waiting to see if kib' latest patch surfaces anything of interest, can you clarify/characterize the app's kqueue usage a little bit? Is it purely adding events to the kqueue and letting them drop out organically from being closed, mix of that and EV_DISABLE/EV_ONESHOT, etc? Are all of the events in this kqueue tied to fds, or is it also driving timers and other non-fd things? I don't think all of the above is going to be particularly relevant, but I think it's good to try and get the lay of the land in case we're not thinking outside of the box enough somehow.
(In reply to Kyle Evans from comment #38) s/EV_DISABLE/EV_DELETE/, of course
(In reply to Paul from comment #37) I was interested in the presence of the support for ERMS, to see which memset() implementation is selected at runtime. This is why I asked about CPU identification. CPU or RAM failures usually look very differently than what you report.
(In reply to Kyle Evans from comment #38) Basically, this is a HTTP/Websocket micro-service app. It handles hundreds of thousands of long-living incoming connections. Approximately 98% of them are Websocket and the rest is plain HTTP. Other than that, there are also a lot of outbound HTTP connections, somewhere in range of 200k-300k. Yeah, to be able to implement this we need timers, a lot of them. All of it is based on a standalone asio C++ library. It does the heavy lifting under the hood. We use several `io_context` for scale. And I believe each of them has its own kqueue, which seems reasonable. Not sure about EV_DELETE/EV_ONESHOT, as we're not dealing with kqueue directly. Looking at source code I see no EV_ONESHOT and even see that earlier they've abandoned its use (https://www.boost.org/doc/libs/latest/doc/html/boost_asio/history.html#boost_asio.history.asio_1_4_9___boost_1_46_1). It does use EV_DELETE, seemingly on every descriptor close, so... a lot. Connections appear and fall off organically. A lot of them on timeout, as this is the real world and not all browsers bother to do the clean Websocket shutdown. This service is nothing new for us. We've been using it for years. Even today, we're still keeping a fallback server that runs FreeBSD 13.1-STABLE but, and this is a big but: with a different server platform : 2x E5-2660. So, not only have we switched to a newer FreeBSD version, but also dramatically changed the platform. And the current platform is much, much faster. My gut feeling is: there was a problem already, it just became apparent. Because, a faster CPU makes things, both in kernel and userspace, run faster. And some, highly improbable races, become likely. We've been running this platform and OS version for months now, under different kinds of load: not that many sockets, but much more CPU usage overall. And it runs completely stable. On the other hand, for this specific role, we now have 2 (of 2) servers that exhibit these crashes. Which doesn't rule out some CPU issues yes, but makes it highly unlikely to be the cause.
(In reply to Konstantin Belousov from comment #40) I see, no problem. If you need more info, just ask.
Hi! We have another panic. Fatal trap 12: page fault while in kernel mode cpuid = 7; apic id = 13 fault virtual address = 0x0 fault code = supervisor read data, page not present instruction pointer = 0x20:0xffffffff80b72503 stack pointer = 0x28:0xfffffe069ae28d40 frame pointer = 0x28:0xfffffe069ae28d70 code segment = base rx0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 29500 (asy:http:s) rdi: ff01000107b11500 rsi: 0000000000000008 rdx: 0000000000000001 rcx: 0000000000000000 r8: 0000000000000002 r9: ffffffff82252ef0 rax: 0000000000000000 rbx: ff0100772fd78668 rbp: fffffe069ae28d70 r10: 0000000000000000 r11: 0000000000000004 r12: ff01000107b11500 r13: ff0100772fd78668 r14: ff01007278e4e780 r15: ff01000107b11518 trap number = 12 panic: page fault cpuid = 7 time = 1775658023 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe069ae28a70 vpanic() at vpanic+0x136/frame 0xfffffe069ae28ba0 panic() at panic+0x43/frame 0xfffffe069ae28c00 trap_pfault() at trap_pfault+0x422/frame 0xfffffe069ae28c70 calltrap() at calltrap+0x8/frame 0xfffffe069ae28c70 --- trap 0xc, rip = 0xffffffff80b72503, rsp = 0xfffffe069ae28d40, rbp = 0xfffffe069ae28d70 --- knote_drop_detached() at knote_drop_detached+0x113/frame 0xfffffe069ae28d70 knote_fdclose() at knote_fdclose+0x17f/frame 0xfffffe069ae28dc0 closefp_impl() at closefp_impl+0xa8/frame 0xfffffe069ae28e00 amd64_syscall() at amd64_syscall+0x169/frame 0xfffffe069ae28f30 fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe069ae28f30 --- syscall (6, FreeBSD ELF64, close), rip = 0x82d1d232a, rsp = 0x858670b98, rbp = 0x858670bb0 --- KDB: enter: panic (kgdb) bt #0 __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:57 #1 doadump (textdump=0) at /usr/src/sys/kern/kern_shutdown.c:399 #2 0xffffffff804b60a8 in db_fncall_generic (nargs=0, args=0xfffffe069ae28490, addr=<optimized out>, rv=<optimized out>) at /usr/src/sys/ddb/db_command.c:631 #3 db_fncall (dummy1=<optimized out>, dummy2=<optimized out>, dummy3=<optimized out>, dummy4=<optimized out>) at /usr/src/sys/ddb/db_command.c:679 #4 0xffffffff804b5b2d in db_command (last_cmdp=<optimized out>, cmd_table=<optimized out>, dopager=false) at /usr/src/sys/ddb/db_command.c:508 #5 0xffffffff804b5c76 in db_command_script (command=command@entry=0xffffffff81bd7722 <db_recursion_data+18> "call doadump") at /usr/src/sys/ddb/db_command.c:573 #6 0xffffffff804bba58 in db_script_exec (scriptname=scriptname@entry=0xfffffe069ae28660 "kdb.enter.panic", warnifnotfound=warnifnotfound@entry=0) at /usr/src/sys/ddb/db_script.c:301 #7 0xffffffff804bb952 in db_script_kdbenter (eventname=<optimized out>) at /usr/src/sys/ddb/db_script.c:323 #8 0xffffffff804b91e1 in db_trap (type=<optimized out>, code=<optimized out>) at /usr/src/sys/ddb/db_main.c:266 #9 0xffffffff80c23c0f in kdb_trap (type=type@entry=3, code=code@entry=0, tf=tf@entry=0xfffffe069ae289b0) at /usr/src/sys/kern/subr_kdb.c:790 #10 0xffffffff811318fd in trap (frame=<optimized out>) at /usr/src/sys/amd64/amd64/trap.c:697 #11 <signal handler called> #12 kdb_enter (why=<optimized out>, msg=<optimized out>) at /usr/src/sys/kern/subr_kdb.c:556 #13 0xffffffff80bd0b8b in vpanic (fmt=0xffffffff812bd9d3 "%s", ap=ap@entry=0xfffffe069ae28be0) at /usr/src/sys/kern/kern_shutdown.c:962 #14 0xffffffff80bd09f3 in panic (fmt=0xffffffff81da22a0 <cnputs_mtx> "\325\376!\201\377\377\377\377") at /usr/src/sys/kern/kern_shutdown.c:887 #15 0xffffffff81132082 in trap_fatal (frame=<optimized out>, eva=<optimized out>) at /usr/src/sys/amd64/amd64/trap.c:1028 #16 0xffffffff81132082 in trap_pfault (frame=0xfffffe069ae28c80, usermode=false, signo=<optimized out>, ucode=<optimized out>) #17 <signal handler called> #18 0xffffffff80b72503 in knote_drop_detached (kn=kn@entry=0xff0100772fd78668, td=td@entry=0xff01007278e4e780) at /usr/src/sys/kern/kern_event.c:2950 #19 0xffffffff80b7284f in knote_drop (td=0xff01007278e4e780, kn=<optimized out>) at /usr/src/sys/kern/kern_event.c:2915 #20 knote_fdclose (td=td@entry=0xff01007278e4e780, fd=fd@entry=211098) at /usr/src/sys/kern/kern_event.c:2875 #21 0xffffffff80b69fd8 in closefp_impl (fdp=0xfffffe0694c620c0, fd=211098, fp=0xff010004c3517c80, td=0xff01007278e4e780, audit=true) at /usr/src/sys/kern/kern_descrip.c:1413 #22 0xffffffff81132739 in syscallenter (td=0xff01007278e4e780) at /usr/src/sys/amd64/amd64/../../kern/subr_syscall.c:193 #23 amd64_syscall (td=0xff01007278e4e780, traced=0) at /usr/src/sys/amd64/amd64/trap.c:1267 #24 <signal handler called> #25 0x000000082d1d232a in ?? () Backtrace stopped: Cannot access memory at address 0x858670b98 (kgdb) fr 18 #18 0xffffffff80b72503 in knote_drop_detached (kn=kn@entry=0xff0100772fd78668, td=td@entry=0xff01007278e4e780) at /usr/src/sys/kern/kern_event.c:2950 2950 SLIST_REMOVE(list, kn, knote, kn_link); (kgdb) p *((struct eknote*)kn) $1 = { k = { kn_link = { sle_next = 0x0 }, kn_selnext = { sle_next = 0xffffffffffffffff }, kn_knlist = 0x0, kn_tqe = { tqe_next = 0xffffffffffffffff, tqe_prev = 0xffffffffffffffff }, kn_kq = 0xff01000107b11500, kn_kevent = { ident = 76954, filter = -1, flags = 32, fflags = 0, data = 0, udata = 0x1b2102fcfc40, ext = {0, 0, 0, 0} }, kn_hook = 0x0, kn_hookid = 0, kn_status = 8, kn_influx = 1, kn_sfflags = 0, kn_sdata = 0, kn_ptr = { p_fp = 0xff010062f4444af0, p_proc = 0xff010062f4444af0, p_aio = 0xff010062f4444af0, p_lio = 0xff010062f4444af0, p_prison = 0xff010062f4444af0, p_v = 0xff010062f4444af0 }, kn_fop = 0xffffffff814dd960 <soread_filtops> }, c = { kn_link = { sle_next = 0x0 }, kn_selnext = { sle_next = 0x0 }, kn_knlist = 0x0, kn_tqe = { tqe_next = 0x0, tqe_prev = 0x0 }, kn_kq = 0x0, kn_kevent = { ident = 0, filter = 0, flags = 0, fflags = 0, data = 0, udata = 0x0, ext = {0, 0, 0, 0} }, kn_hook = 0x0, kn_hookid = 0, kn_status = 0, kn_influx = 0, kn_sfflags = 0, kn_sdata = 0, kn_ptr = { p_fp = 0x0, p_proc = 0x0, p_aio = 0x0, p_lio = 0x0, p_prison = 0x0, p_v = 0x0 }, kn_fop = 0x0 }, on_kn_link = 0 } (kgdb) p kq $2 = (struct kqueue *) 0xff01000107b11500 (kgdb) p *kq $3 = { kq_lock = { lock_object = { lo_name = 0xffffffff813464c6 "kqueue", lo_flags = 21168128, lo_data = 0, lo_witness = 0xff0100804bd8db80 }, mtx_lock = 18374968446302873472 }, kq_refcnt = 0, kq_list = { tqe_next = 0xff010001dd4fac00, tqe_prev = 0xff010077e823d828 }, kq_head = { tqh_first = 0x0, tqh_last = 0xff01000107b11538 }, kq_count = 0, kq_sel = { si_tdlist = { tqh_first = 0x0, tqh_last = 0x0 }, si_note = { kl_list = { slh_first = 0x0 }, kl_lock = 0xffffffff80b71fc0 <knlist_mtx_lock>, kl_unlock = 0xffffffff80b71fe0 <knlist_mtx_unlock>, kl_assert_lock = 0xffffffff80b72000 <knlist_mtx_assert_lock>, kl_lockarg = 0xff01000107b11500, kl_autodestroy = 0 }, si_mtx = 0x0 }, kq_sigio = 0x0, kq_fdp = 0xfffffe0694c620c0, kq_state = 0, kq_knlistsize = 288512, kq_knlist = 0xfffffe09ce3fe000, kq_knhashmask = 0, kq_knhash = 0x0, kq_task = { ta_link = { stqe_next = 0x0 }, ta_pending = 0, ta_priority = 0 '\000', ta_flags = 0 '\000', ta_func = 0xffffffff80b748a0 <kqueue_task>, ta_context = 0xff01000107b11500 }, kq_cred = 0xff010001dd445900, kq_forksrc = 0x0 } (kgdb) p list $4 = <optimized out> p kq->kq_knlist[kn->kn_kevent.ident] $6 = { slh_first = 0x0 } (kgdb) p &kq->kq_knlist[kn->kn_kevent.ident] $7 = (struct klist *) 0xfffffe09ce4944d0 Please, tell us if you need anything else.
I guess there's still some kind of confusion: #21 0xffffffff80b69fd8 in closefp_impl (fdp=0xfffffe0694c620c0, fd=211098, fp=0xff010004c3517c80, td=0xff01007278e4e780, audit=true) [...] kn_ptr = { p_fp = 0xff010062f4444af0, p_proc = 0xff010062f4444af0, p_aio = 0xff010062f4444af0, p_lio = 0xff010062f4444af0, p_prison = 0xff010062f4444af0, p_v = 0xff010062f4444af0 }, Does 0xff010062f4444af0 still look like a valid `struct file *`, and maybe more importantly: does that match what's at fdp->fd_files->fdt_ofiles[211098].fde_file or fdp->fd_files->fdt_ofiles[76954].fde_file?
Looks like it: (kgdb) p *((struct file *)0xff010062f4444af0) $1 = { f_flag = 7, f_count = 2, f_data = 0xff01004a38dc5c00, f_ops = 0xffffffff814dc3a0 <socketops>, f_vnode = 0x0, f_cred = 0xff010001dd445900, f_type = 2, f_vflags = 0, { f_seqcount = {0, 0}, f_pipegen = 0 }, f_nextoff = {0, 0}, f_vnun = { fvn_cdevpriv = 0x0, fvn_advice = 0x0 }, f_offset = 0 }
(In reply to Paul from comment #45) Sorry, trigger happy fingers... this is the followup. (kgdb) p fdp->fd_files->fdt_ofiles[211098].fde_file $2 = (struct file *) 0x0 (kgdb) p fdp->fd_files->fdt_ofiles[76954].fde_file $3 = (struct file *) 0xff010062f4444af0 (kgdb) p *fdp->fd_files->fdt_ofiles[76954].fde_file $4 = { f_flag = 7, f_count = 2, f_data = 0xff01004a38dc5c00, f_ops = 0xffffffff814dc3a0 <socketops>, f_vnode = 0x0, f_cred = 0xff010001dd445900, f_type = 2, f_vflags = 0, { f_seqcount = {0, 0}, f_pipegen = 0 }, f_nextoff = {0, 0}, f_vnun = { fvn_cdevpriv = 0x0, fvn_advice = 0x0 }, f_offset = 0 }
(In reply to Paul from comment #46) I couldn't help but notice that the difference between 211098 and 76954 is 131 * 1024. So far we have: 211098 - 76954 = 131 * 1024 298277 − 297765 = 512 544408 − 138904 = 99 * 4096 How likely is it that this is just a coincidence? Looks like 512 is the common base, so far. So, the likelihood of this being just a fluke is like 1 / (512 ^ 3). Very unlikely.
(In reply to Paul from comment #46) For shits and giggles, can we take a look at kq->kq_knlist[idx] for both fds as well, please? I guess there's likely only one knote at most per fd here
(In reply to Kyle Evans from comment #48) No problem at all, and... (kgdb) p kq->kq_knlist[76954] $4 = { slh_first = 0x0 } (kgdb) p kq->kq_knlist[211098] $5 = { slh_first = 0xff0100772fd78668 } (kgdb) p *kq->kq_knlist[211098].slh_first $6 = { kn_link = { sle_next = 0x0 }, kn_selnext = { sle_next = 0xffffffffffffffff }, kn_knlist = 0x0, kn_tqe = { tqe_next = 0xffffffffffffffff, tqe_prev = 0xffffffffffffffff }, kn_kq = 0xff01000107b11500, kn_kevent = { ident = 76954, filter = -1, flags = 32, fflags = 0, data = 0, udata = 0x1b2102fcfc40, ext = {0, 0, 0, 0} }, kn_hook = 0x0, kn_hookid = 0, kn_status = 8, kn_influx = 1, kn_sfflags = 0, kn_sdata = 0, kn_ptr = { p_fp = 0xff010062f4444af0, p_proc = 0xff010062f4444af0, p_aio = 0xff010062f4444af0, p_lio = 0xff010062f4444af0, p_prison = 0xff010062f4444af0, p_v = 0xff010062f4444af0 }, kn_fop = 0xffffffff814dd960 <soread_filtops> } 76954 seems to be on the list of 211098. But this perfectly matches preceding cases: some mixing of unrelated fds.
Please drop any previous patches you have. Then update to the latest main sources, which have several asserts added by Kyle and me. Then apply only the following change and see if the issue is reproducible. diff --git a/sys/kern/kern_event.c b/sys/kern/kern_event.c index bb5203b8cb0b..ff3ae96a452f 100644 --- a/sys/kern/kern_event.c +++ b/sys/kern/kern_event.c @@ -2043,8 +2043,13 @@ kqueue_expand(struct kqueue *kq, const struct filterops *fops, uintptr_t ident, to_free = list; } else { if (kq->kq_knlist != NULL) { - bcopy(kq->kq_knlist, list, - kq->kq_knlistsize * sizeof(*list)); + volatile char *src, *dst; + src = (void *)kq->kq_knlist; + dst = (void *)list; + for (size_t i = 0; + i < kq->kq_knlistsize * sizeof(*list); + i++) + dst[i] = src[i]; to_free = kq->kq_knlist; kq->kq_knlist = NULL; }
(In reply to Paul from comment #49) Just thinking out loud for a minute, we know these facts: 1.) In slot 211098 of this kq we found this knote that thinks it is for fd 76954 and has an fp that matches fd 76954 in the fd table 2.) The kq_knlist only ever grows, never shrinks, and in exactly one place: kqueue_expand 3.) kq_knlist / kn_link usage is actually pretty minimal and reasonably easy to audit 4.) knote_attach is the only place adding anything to kq_knlist knote_attach() is incredibly straightforward and clearly under the kq lock: 2880 if (kn->kn_fop->f_isfd) { 2881 if (kn->kn_id >= kq->kq_knlistsize) 2882 return (ENOMEM); 2883 list = &kq->kq_knlist[kn->kn_id]; 2884 } else { 2885 if (kq->kq_knhash == NULL) 2886 return (ENOMEM); 2887 list = &kq->kq_knhash[KN_HASH(kn->kn_id, kq->kq_knhashmask)]; 2888 } 2889 SLIST_INSERT_HEAD(list, kn, kn_link); That's clearly checking the size correctly beforre it inserts the knote. The only removal is in knote_drop_detached(): 2925 if (kn->kn_fop->f_isfd) 2926 list = &kq->kq_knlist[kn->kn_id]; 2927 else 2928 list = &kq->kq_knhash[KN_HASH(kn->kn_id, kq->kq_knhashmask)]; 2929 2930 if (!SLIST_EMPTY(list)) 2931 SLIST_REMOVE(list, kn, knote, kn_link); I think we all agree that !SLIST_EMPTY(list) should actually be an assertion at best, there's no path before the call to knote_attach() in kqueue_register() that would call knote_drop*() and you can't double-drop a knote. I really don't see many plausible causes here, outside of kqueue_expand going wrong somehow, but the disassembly seems sane (and not obviously eliding important things) and the bzero calculations, while kind of convoluted, seem fine. I probably would've written it as bzero(&list[kq->kq_knlistsize], (size - kq->kq_knlistsize) * sizeof(*list)) personally, but I don't see a world in which those aren't equivalent.
(In reply to Konstantin Belousov from comment #50) Hi, Thanks for the patch. Though, could't help but also include the assert implied by Kyle. It should't hurt to have an extra check, I hope. This is the final patch that we've applied to the current HEAD: diff --git a/sys/kern/kern_event.c b/sys/kern/kern_event.c index 01731ca46b6b..38b67d81dff2 100644 --- a/sys/kern/kern_event.c +++ b/sys/kern/kern_event.c @@ -2037,8 +2037,13 @@ kqueue_expand(struct kqueue *kq, const struct filterops *fops, uintptr_t ident, to_free = list; } else { if (kq->kq_knlist != NULL) { - bcopy(kq->kq_knlist, list, - kq->kq_knlistsize * sizeof(*list)); + volatile char *src, *dst; + src = (void *)kq->kq_knlist; + dst = (void *)list; + for (size_t i = 0; + i < kq->kq_knlistsize * sizeof(*list); + i++) + dst[i] = src[i]; to_free = kq->kq_knlist; kq->kq_knlist = NULL; } @@ -2953,8 +2958,9 @@ knote_drop_detached(struct knote *kn, struct thread *td) else list = &kq->kq_knhash[KN_HASH(kn->kn_id, kq->kq_knhashmask)]; - if (!SLIST_EMPTY(list)) - SLIST_REMOVE(list, kn, knote, kn_link); + KASSERT(!SLIST_EMPTY(list), + ("knote %p is not on the list of %lu", kn, kn->kn_id)); + SLIST_REMOVE(list, kn, knote, kn_link); if (kn->kn_status & KN_QUEUED) knote_dequeue(kn); KQ_UNLOCK_FLUX(kq);
(In reply to Kyle Evans from comment #51) If you see my last debugging patch, my main suspicious place right now is bcopy(). The values of the fd as reported give quite huge size for bzero, and I suspect that this is the exceptionally large value for bcopy, if not unique. So I am trying to open-code bcopy() to see if memcpy_erms() might be to blame. It might be not bcopy() itself but a cpu bug with extra-large %rcx for rep;movsq for instance. Just speculating. (In reply to Paul from comment #52) This is fine. But SLIST_REMOVE() must deref NULL if the element is not found. Also see https://reviews.freebsd.org/D56341
(In reply to Konstantin Belousov from comment #53) FWIW- the person with that fdgrowtable issue I had mailed about has also narrowed in on the possibility of it being weird collateral damage from memcpy, also with erms (on select AMD hardware). Their case looks a little different, though.
(In reply to Kyle Evans from comment #54) Note that rep;movs and like are very special kind of instructions in x86 ISA. They are interruptible, and CPU must update %rsi%rdi/%rcx to reflect the progress. Of course for short moves interrupts are kept pending, but for large they are not, and I believe that the case of the reporter which copies hundreds of KBs, if not some MBs, should be. That said, could you share some details about the case, and how it was solved? Did the microcode update resolved it? I lost access to AMD internal docs some time ago.
(In reply to Konstantin Belousov from comment #55) Their investigation is still ongoing and they haven't proven that it's erms yet, but here are their observations: At https://cgit.freebsd.org/src/tree/sys/kern/kern_descrip.c#n2028 in fdgrowtable() ntable is allocated and ntable->fdt_nfiles is assigned. Before the memcpy, fdt_nfiles can be observed to be the correct value. A sample from their latest known panic was 7133184 bytes, and they note that the destination is never page unaligned (0xfffffe0423607008 in this last one) as a maybe-relevant fact for their case. Immediately after the memcpy into ntable->fdt_ofiles, the start of `ntable` appears to be damaged -- fdt_nfiles is suddenly zeroed out, and in this last sample the damage was to the tune of ~2 pages at the beginning zeroed out as well. They noted that a significant number of entries at the end of the old table were zeroed out, so it may be the case that the damage is actually those overwriting the beginning.
(In reply to Kyle Evans from comment #56) Also re: micrrocode, they have the latest applied in all cases.
(In reply to Kyle Evans from comment #57) Can you obtain the output from etallen-cpuid -1? I am curious if this is the same model as in this PR.
(In reply to Konstantin Belousov from comment #58) I've asked them to report the results to this PR; this was their summary of high-level models it's reproduced on from earlier messages: - Have reproduced on: EPYC 9274F w/768GB, EPYC 9375F w/768GB, EPYC 8224P w/192GB - Have not reproduced on: EPYC 8024P w/64GB, Ryzen 9 7950X w/128GB, Ryzen 9 5950X w/128GB, Ryzen TRP 3975WX w/256GB
Hi, I don't know if this is related but a while ago I experienced kernel panics under 14.3 and 15.0 on AMD EPYC 9274F and 9375F boxes with 768GB RAM running a multithreaded app with a large number of open files (some sockets but mostly memory mapped files) and a high rate of file descriptor "churn" (mmap'd files are frequently created, written, unmapped and deleted). The panics occurred during fork/exec calls and examination of the crash dumps showed corrupted file tables. When running debug kernels, the one place that has consistently come up in the kernel panics is do_fork/fdcopy/fdgrowtable, where fdgrowtable is repeatedly called to create and expand a file table as other threads in the parent continue to open new files. fdgrowtable repeatedly allocates and copies an ever increasing file table until eventually fdt_nfiles at the start of the growing table is read back inexplicably as zero which fails the KASSERT at the start of fdgrowtable. The file tables are typically in the range of 250k to 500k entries in size when this happens which, given that the entries are 48 bytes in size, means the memory allocations and copies are on the order of several megabytes. I have written a small kernel module and a test program that repeatedly allocates, fills, copies, checks and frees memory which I will shortly attach to this bug report. I have only seen this problem with AMD Zen4 & Zen5 EPYC machines, I have not been able to reproduce this on older CPUs.
Created attachment 269604 [details] Kernel module for testing memory allocation and copying Kernel module that implements a syscall for testing memory allocation and copying.
Created attachment 269605 [details] Userspace utility for calling memory testing kernel module
I have attached kmod2.c (a kernel module) and call2.c (a program that invokes kmod2.c). kmod2.ko can be built with the following makefile. -------- kmod2.mk -------- KMOD = kmod2 SRCS = kmod2.c .include <bsd.kmod.mk> -------------------------- call2 can be built with the following command line: cc -o call2 call2.c -lthr call2 assumes the syscall in kmod2 is assigned syscall number 210 on loading; if this not the cases on your machine, edit call2.c accordingly. This will run 1 thread, zeroed allocations, starting memory size of 64k, 32 iterations, increase of 4k each iteration: $ call2 1 5 $((1<<16)) 32 4096 This will run 3 threads, not-zeroed allocations, starting memory size of 16M, 128 iterations, increase of 16k each iteration: $ call2 3 5 $((1<<24)) 128 16384 dmesg will normally show lines such as: memkpy/syscall: enter pid:32426 tid:117355 4 16777216 128 16384 memkpy/syscall: exit pid:32426 tid:117355 4 16777216 128 16384 ilog:516 index:515 tsc:5520429094997775 event:4 If the memory checks fail, dmesg output (with kern.msgbuf_show_timestamp="2") will look like this: [375862.366839] memkpy/syscall: enter pid:9651 tid:191508 4 33554432 128 16384 [375862.787488] memkpy/check_block: failed at ptr:0xfffffe10d50e4000 data:0xfffffe10d50a8000 size:35356672 ... [375863.168260] memkpy/check_block: failed at ptr:0xfffffe10d513d000 data:0xfffffe10d5138000 size:35651584 [375863.178042] memkpy/syscall: exit pid:9651 tid:191508 4 33554432 128 16384 ilog:516 index:515 tsc:1428744768499442 event:4 This is an unfinished work in progress but hopefully it is of some assistance.
Created attachment 269607 [details] AMD EPYC 9375F: output of cpuid-etallen -1
(In reply to Kristofer Peterson from comment #64) So we have some lead. There is a public errata document https://docs.amd.com/v/u/en-US/58251_PUB_1.30 Look for the errata 1567 Instruction State May be Corrupted during REP CMPSB or REP SCASB Description During the execution of a REP CMPSB or REP SCASB, the instruction state may be corrupted. Potential Effect on System Unpredictable system behavior Suggested Workaround System software may contain the workaround for this erratum. Fix Planned No fix planned It only says something about cmps and scas, not about movs. But this is very close. Also, the revision is from July 2025, I did not found anything newer.