Summary: | panic: got NULL turnstile on rwlock 0xfffff8021123be90 passedv 1 v 1 | ||
---|---|---|---|
Product: | Base System | Reporter: | Lexi Winter <lexi> |
Component: | kern | Assignee: | Mateusz Guzik <mjg> |
Status: | In Progress --- | ||
Severity: | Affects Only Me | CC: | kbowling, mjg |
Priority: | --- | Keywords: | crash, regression |
Version: | 15.0-CURRENT | ||
Hardware: | amd64 | ||
OS: | Any | ||
See Also: | https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=282424 |
Description
Lexi Winter
2024-10-28 08:26:09 UTC
checking logs, it seems like this system is panicking about once a day with the same message/trace, and this has been happening since it was updated to this 15.0 build (from an older 15.0 build). possibly related settings from /boot/loader.conf: net.isr.maxthreads=-1 net.isr.bindthreads=1 net.isr.dispatch=hybrid i've commented these out to see if it makes a difference. ^Triage: panic appears to be in the network stack. Maybe related https://redmine.pfsense.org/issues/15601#change-74811 "Routes with IPv6 Address as Next Hop for IPv4 Destination Causes Kernel Panic" i do have one of these: Internet: Destination Gateway Flags Netif Expire 10.254.3.1 fe80::2%epair1a UGH epair1a we use these extensively here so i'm surprised only this particular system has run into a problem. i've built a new release with https://reviews.freebsd.org/D45913 applied and will deploy this to test tonight. this might not be a (recent) regression; i think the issue here might be that we recently made a network change that caused more NDs to be sent for this address. That patch does NOT fix the problem, whatever it may happen to be. Instead of that, can you please apply the following: https://people.freebsd.org/~mjg/nd6_debug.diff It should still crash, but provide more information. Can you share you CPU model? okay, i replaced that patch with your patch and will reboot later tonight. CPU: Intel(R) Atom(TM) CPU C3758R @ 2.40GHz (2400.22-MHz K8-class CPU) Origin="GenuineIntel" Id=0x506f1 Family=0x6 Model=0x5f Stepping=1 Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE> Features2=0x4ff8ebbf<SSE3,PCLMULQDQ,DTES64,MON,DS_CPL,VMX,EST,TM2,SSSE3,SDBG,CX16,xTPR,PDCM,SSE4.1,SSE4.2,x2APIC,MOVBE,POPCNT,TSCDLT,AESNI,XSAVE,OSXSAVE,RDRAND> AMD Features=0x2c100800<SYSCALL,NX,Page1GB,RDTSCP,LM> AMD Features2=0x101<LAHF,Prefetch> Structured Extended Features=0x2294e283<FSGSBASE,TSCADJ,SMEP,ERMS,NFPUSG,MPX,PQE,RDSEED,SMAP,CLFLUSHOPT,PROCTRACE,SHA> Structured Extended Features3=0xac000000<IBPB,STIBP,ARCH_CAP,SSBD> XSAVE Features=0xf<XSAVEOPT,XSAVEC,XINUSE,XSAVES> IA32_ARCH_CAPS=0x9<RDCL_NO,SKIP_L1DFL_VME> VT-x: PAT,HLT,MTF,PAUSE,EPT,UG,VPID,VID,PostIntr TSC: P-state invariant, performance statistics Thanks. How is the testing going so far? Unread portion of the kernel message buffer: panic: lle 0xfffff8015f8e8000 not locked @ /data/build/src/freebsd/lf/main/sys/netinet6/nd6.c:2368! cpuid = 1 time = 1730156941 KDB: stack backtrace: #0 0xffffffff804676bd at kdb_backtrace+0x5d #1 0xffffffff8041d1df at vpanic+0x13f #2 0xffffffff8041d093 at panic+0x43 #3 0xffffffff805bc173 at nd6_get_llentry+0x3a3 #4 0xffffffff805ba9fb at nd6_resolve_slow+0xfb #5 0xffffffff805ba7d5 at nd6_resolve+0x125 #6 0xffffffff80530da2 at ether_output+0x502 #7 0xffffffff8055dc95 at ip_tryforward+0x505 #8 0xffffffff80560570 at ip_input+0x310 #9 0xffffffff805366f8 at swi_net+0x138 #10 0xffffffff803e23f9 at ithread_loop+0x239 #11 0xffffffff803dea5b at fork_exit+0x7b #12 0xffffffff806a37ee at fork_trampoline+0xe Uptime: 1h42m17s Dumping 1338 out of 16325 MB:..2%..11%..21%..32%..41%..51%..61%..71%..81%..91% (kgdb) bt #0 __curthread () at /data/build/src/freebsd/lf/main/sys/amd64/include/pcpu_aux.h:57 #1 doadump (textdump=<optimized out>) at /data/build/src/freebsd/lf/main/sys/kern/kern_shutdown.c:404 #2 0xffffffff8041cd64 in kern_reboot (howto=260) at /data/build/src/freebsd/lf/main/sys/kern/kern_shutdown.c:524 #3 0xffffffff8041d24c in vpanic (fmt=0xffffffff80719f49 "lle %p not locked @ %s:%d!", ap=ap@entry=0xfffffe00d9967af0) at /data/build/src/freebsd/lf/main/sys/kern/kern_shutdown.c:979 #4 0xffffffff8041d093 in panic (fmt=<unavailable>) at /data/build/src/freebsd/lf/main/sys/kern/kern_shutdown.c:892 #5 0xffffffff805bc173 in nd6_get_llentry (ifp=<optimized out>, addr=<optimized out>, family=<optimized out>) at /data/build/src/freebsd/lf/main/sys/netinet6/nd6.c:2368 #6 0xffffffff805ba9fb in nd6_resolve_slow (ifp=ifp@entry=0xfffff800246d2000, family=family@entry=2, flags=flags@entry=0, m=m@entry=0xfffff802091b6c00, dst=dst@entry=0xfffff8002480fe04, desten=desten@entry=0xfffffe00d9967cb0 "", pflags=0xfffffe00d9967c8c, plle=0x0) at /data/build/src/freebsd/lf/main/sys/netinet6/nd6.c:2408 #7 0xffffffff805ba7d5 in nd6_resolve (ifp=0xfffff800246d2000, gw_flags=<optimized out>, m=m@entry=0xfffff802091b6c00, sa_dst=sa_dst@entry=0xfffff8002480fe04, desten=desten@entry=0xfffffe00d9967cb0 "", pflags=pflags@entry=0xfffffe00d9967c8c, plle=0x0) at /data/build/src/freebsd/lf/main/sys/netinet6/nd6.c:2298 #8 0xffffffff80530da2 in ether_resolve_addr (phdr=0xfffffe00d9967cb0 "", plle=<optimized out>, ifp=<optimized out>, m=<optimized out>, dst=<optimized out>, ro=<optimized out>, pflags=<optimized out>) at /data/build/src/freebsd/lf/main/sys/net/if_ethersubr.c:243 #9 ether_output (ifp=<unavailable>, ifp@entry=<error reading variable: value is not available>, m=0xfffff802091b6c00, m@entry=<error reading variable: value is not available>, dst=<unavailable>, dst@entry=<error reading variable: value is not available>, ro=<unavailable>, ro@entry=<error reading variable: value is not available>) at /data/build/src/freebsd/lf/main/sys/net/if_ethersubr.c:349 #10 0xffffffff8055dc95 in ip_tryforward (m=0xfffff802091b6c00) at /data/build/src/freebsd/lf/main/sys/netinet/ip_fastfwd.c:483 #11 0xffffffff80560570 in ip_input (m=0xfffff802091b6c00) at /data/build/src/freebsd/lf/main/sys/netinet/ip_input.c:590 #12 0xffffffff805366f8 in netisr_process_workstream_proto (nwsp=0xfffffe002056ba00, proto=1) at /data/build/src/freebsd/lf/main/sys/net/netisr.c:927 #13 swi_net (arg=0xfffffe002056ba00) at /data/build/src/freebsd/lf/main/sys/net/netisr.c:974 #14 0xffffffff803e23f9 in intr_event_execute_handlers (ie=0xfffff80001289400, p=<optimized out>) at /data/build/src/freebsd/lf/main/sys/kern/kern_intr.c:1183 #15 ithread_execute_handlers (ie=0xfffff80001289400, p=<optimized out>) at /data/build/src/freebsd/lf/main/sys/kern/kern_intr.c:1196 #16 ithread_loop (arg=arg@entry=0xfffff8000126cb00) at /data/build/src/freebsd/lf/main/sys/kern/kern_intr.c:1289 #17 0xffffffff803dea5b in fork_exit (callout=0xffffffff803e21c0 <ithread_loop>, arg=0xfffff8000126cb00, frame=0xfffffe00d9967f40) at /data/build/src/freebsd/lf/main/sys/kern/kern_fork.c:1151 #18 <signal handler called> #5 0xffffffff805bc173 in nd6_get_llentry (ifp=<optimized out>, addr=<optimized out>, family=<optimized out>) at /data/build/src/freebsd/lf/main/sys/netinet6/nd6.c:2368 warning: 2368 /data/build/src/freebsd/lf/main/sys/netinet6/nd6.c: No such file or directory (kgdb) info locals child_lle = <optimized out> lle = 0xfffff8015f8e8000 lle_tmp = 0xfffff8015f8e8000 (kgdb) print *lle $1 = {lle_next = {cle_next = 0x0, cle_prev = 0x0}, r_l3addr = {addr4 = {s_addr = 285245694}, addr6 = {__u6_addr = {__u6_addr8 = "\376\200\000\021", '\000' <repeats 11 times>, "\002", __u6_addr16 = {33022, 4352, 0, 0, 0, 0, 0, 512}, __u6_addr32 = {285245694, 0, 0, 33554432}}}}, r_linkdata = '\000' <repeats 23 times>, r_hdrlen = 0 '\000', r_family = 2 '\002', spare0 = "\000", r_flags = 0, r_skip_req = 0, lle_tbl = 0xfffff800242ceb00, lle_head = 0x0, lle_free = 0xffffffff8059e7a0 <in6_lltable_destroy_lle>, la_hold = 0xfffff8019caa4b00, la_numheld = 1, la_expire = 0, la_flags = 194, la_asked = 0, la_preempt = 0, ln_state = 0, ln_router = 0, ln_ntick = 0, lle_remtime = 0, lle_hittime = 0, lle_refcnt = 1, ll_addr = 0x0, lle_children = {cslh_first = 0x0}, lle_child_next = { csle_next = 0x0}, lle_parent = 0xfffff801a8651900, lle_chain = {cle_next = 0x0, cle_prev = 0x0}, lle_timer = {c_links = {le = {le_next = 0x0, le_prev = 0x0}, sle = {sle_next = 0x0}, tqe = {tqe_next = 0x0, tqe_prev = 0x0}}, c_time = 0, c_precision = 0, c_arg = 0x0, c_func = 0x0, c_lock = 0x0, c_flags = 0, c_iflags = 16, c_cpu = 0}, lle_lock = {lock_object = {lo_name = 0xffffffff80708160 "lle", lo_flags = 90374144, lo_data = 0, lo_witness = 0x0}, rw_lock = 1}, req_mtx = {lock_object = {lo_name = 0xffffffff8073325f "lle req", lo_flags = 16973824, lo_data = 0, lo_witness = 0x0}, mtx_lock = 0}, lle_epoch_ctx = {data = {0x0, 0x0}}} (kgdb) Thanks. I'm going to ship you with either a fix or some more debug later today, hopefully the former :) Can you please restore the stock source and apply this instead: https://people.freebsd.org/~mjg/nd6_fix.diff it should fix the problem but also complain some more if it does not I forgot to mention it would be great if you could build a debug kernel -- with witness, invariants et al, but this is not a hard requirement. thanks, i'll build a new kernel with this patch and debugging enabled and deploy it tonight (in about 12 hours) -- however the system in question hasn't panicked for over a day now so it might take a while to see if this actually fixes the problem. (if it's caused by ND, i'll try to set something up to cause a lot of NDs to be sent for this address.) It is nd-related. I'll give you an extended patch which will point out the problematic codepath was reached, give me few. https://people.freebsd.org/~mjg/nd6_fix2.diff this will print a "found lle_tmp" message once in dmesg. if it is there, the codepath was hit and if there was no panic we are set :) the latest patch is now applied and running on a debug kernel from src ~22429a464a5f4f6bb5a056aae1353985db83b721 (i had to updated for an unrelated issue), i'll report back results asap. i've been running this patch for the last few hours and the system has neither panicked nor printed the magic message. i am completely happy to just leave this running for a few days and see what happens, but if there's anything you'd like me to try to trigger the issue, i'm happy to try it. Just leave it running for a little bit, thank you. It's all about neighbour discovery, but I don't have a specific reproducer. Hopefully your network will be doing what it was already doing. :) any updates? maybe the printfs showed up? unfortunately neither: [3!] willow ~# uptime 4:06PM up 5 days, 19:57, 2 users, load averages: 0.21, 0.13, 0.12 [4!] willow ~# dmesg|grep lle_tmp [1? 5!] willow ~# this is despite i added a new IPv4 route with IPv6 nexthop for our NAT64 jail. i am considering rebooting with a non-debug kernel, in case it's some timing issue that debug kernel hides. that should be of no significance. thanks for reporting. okay, do you want to close this then? i'm happy to keep running this as a local patch since it seems to have fixed the specific problem i was having. i'm going to sleep on it, i don't want to close since there is *definitely* a bug here, but maybe i'll just commit the fix i came up with i rebooted into a non-debug kernel and trigged the printf within 5 minutes of booting: Nov 16 06:41:17 willow kernel: nd6_get_llentry: found lle_tmp 0xfffff8003bf8ac00 using: FreeBSD willow.eden.le-fay.org 15.0-CURRENT FreeBSD 15.0-CURRENT #2 lf/main-n269078-561fbdac790: Sun Nov 3 16:32:20 GMT 2024 srcmastr@hemlock.eden.le-fay.org:/data/build/obj/freebsd/data/build/src/freebsd/lf/main/amd64.amd64/sys/LF amd64 the system did not panic. Huh. Thanks. I'm going to commit soon(tm). |