Bug 282378

Summary: panic: got NULL turnstile on rwlock 0xfffff8021123be90 passedv 1 v 1
Product: Base System Reporter: Lexi Winter <lexi>
Component: kernAssignee: Mateusz Guzik <mjg>
Status: In Progress ---    
Severity: Affects Only Me CC: kbowling, mjg
Priority: --- Keywords: crash, regression
Version: 15.0-CURRENT   
Hardware: amd64   
OS: Any   
See Also: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=282424

Description Lexi Winter freebsd_triage 2024-10-28 08:26:09 UTC
using main from ~53314e34d5e8e7f781ab990805b22f7a56bc0580, panic observed during normal operation:

panic: got NULL turnstile on rwlock 0xfffff8021123be90 passedv 1 v 1
cpuid = 5
time = 1730103094
KDB: stack backtrace:
#0 0xffffffff804676bd at kdb_backtrace+0x5d
#1 0xffffffff8041d1df at vpanic+0x13f
#2 0xffffffff8041d093 at panic+0x43
#3 0xffffffff804189ec at __rw_wunlock_hard+0xec
#4 0xffffffff805babdd at nd6_resolve_slow+0x2dd
#5 0xffffffff805ba7d5 at nd6_resolve+0x125
#6 0xffffffff80530da2 at ether_output+0x502
#7 0xffffffff8055dc95 at ip_tryforward+0x505
#8 0xffffffff80560570 at ip_input+0x310
#9 0xffffffff805366f8 at swi_net+0x138
#10 0xffffffff803e23f9 at ithread_loop+0x239
#11 0xffffffff803dea5b at fork_exit+0x7b
#12 0xffffffff806a37ae at fork_trampoline+0xe
Uptime: 20h43m36s
Dumping 1674 out of 16325 MB:..1%..11%..21%..31%..41%..51%..61%..71%..81%..91%---<<BOOT>>---

(kgdb) bt
#0  __curthread () at /data/build/src/freebsd/lf/main/sys/amd64/include/pcpu_aux.h:57
#1  doadump (textdump=<optimized out>) at /data/build/src/freebsd/lf/main/sys/kern/kern_shutdown.c:404
#2  0xffffffff8041cd64 in kern_reboot (howto=260) at /data/build/src/freebsd/lf/main/sys/kern/kern_shutdown.c:524
#3  0xffffffff8041d24c in vpanic (fmt=0xffffffff8073c8b1 "got NULL turnstile on rwlock %p passedv %zx v %zx", ap=ap@entry=0xfffffe00d997bb30) at /data/build/src/freebsd/lf/main/sys/kern/kern_shutdown.c:979
#4  0xffffffff8041d093 in panic (fmt=<unavailable>) at /data/build/src/freebsd/lf/main/sys/kern/kern_shutdown.c:892
#5  0xffffffff804189ec in __rw_wunlock_hard (c=c@entry=0xfffff8021123bea8, v=1) at /data/build/src/freebsd/lf/main/sys/kern/kern_rwlock.c:1277
#6  0xffffffff805babdd in nd6_resolve_slow (ifp=ifp@entry=0xfffff800030ca800, family=family@entry=2, flags=flags@entry=0, m=m@entry=0xfffff8003221c100, dst=dst@entry=0xfffff80025aa5d04, 
    desten=desten@entry=0xfffffe00d997bcb0 "", pflags=0xfffffe00d997bc8c, plle=0x0) at /data/build/src/freebsd/lf/main/sys/netinet6/nd6.c:2475
#7  0xffffffff805ba7d5 in nd6_resolve (ifp=0xfffff800030ca800, gw_flags=<optimized out>, m=m@entry=0xfffff8003221c100, sa_dst=sa_dst@entry=0xfffff80025aa5d04, desten=desten@entry=0xfffffe00d997bcb0 "", 
    pflags=pflags@entry=0xfffffe00d997bc8c, plle=0x0) at /data/build/src/freebsd/lf/main/sys/netinet6/nd6.c:2298
#8  0xffffffff80530da2 in ether_resolve_addr (phdr=0xfffffe00d997bcb0 "", plle=<optimized out>, ifp=<optimized out>, m=<optimized out>, dst=<optimized out>, ro=<optimized out>, pflags=<optimized out>)
    at /data/build/src/freebsd/lf/main/sys/net/if_ethersubr.c:243
#9  ether_output (ifp=<unavailable>, ifp@entry=<error reading variable: value is not available>, m=0xfffff8003221c100, m@entry=<error reading variable: value is not available>, dst=<unavailable>, 
    dst@entry=<error reading variable: value is not available>, ro=<unavailable>, ro@entry=<error reading variable: value is not available>) at /data/build/src/freebsd/lf/main/sys/net/if_ethersubr.c:349
#10 0xffffffff8055dc95 in ip_tryforward (m=0xfffff8003221c100) at /data/build/src/freebsd/lf/main/sys/netinet/ip_fastfwd.c:483
#11 0xffffffff80560570 in ip_input (m=0xfffff8003221c100) at /data/build/src/freebsd/lf/main/sys/netinet/ip_input.c:590
#12 0xffffffff805366f8 in netisr_process_workstream_proto (nwsp=0xfffffe00205a3a00, proto=1) at /data/build/src/freebsd/lf/main/sys/net/netisr.c:927
#13 swi_net (arg=0xfffffe00205a3a00) at /data/build/src/freebsd/lf/main/sys/net/netisr.c:974
#14 0xffffffff803e23f9 in intr_event_execute_handlers (ie=0xfffff8000128ab00, p=<optimized out>) at /data/build/src/freebsd/lf/main/sys/kern/kern_intr.c:1183
#15 ithread_execute_handlers (ie=0xfffff8000128ab00, p=<optimized out>) at /data/build/src/freebsd/lf/main/sys/kern/kern_intr.c:1196
#16 ithread_loop (arg=arg@entry=0xfffff80001268880) at /data/build/src/freebsd/lf/main/sys/kern/kern_intr.c:1289
#17 0xffffffff803dea5b in fork_exit (callout=0xffffffff803e21c0 <ithread_loop>, arg=0xfffff80001268880, frame=0xfffffe00d997bf40) at /data/build/src/freebsd/lf/main/sys/kern/kern_fork.c:1151
#18 <signal handler called>
#19 0x1a870f36f88366d0 in ?? ()
Comment 1 Lexi Winter freebsd_triage 2024-10-28 08:27:59 UTC
checking logs, it seems like this system is panicking about once a day with the same message/trace, and this has been happening since it was updated to this 15.0 build (from an older 15.0 build).
Comment 2 Lexi Winter freebsd_triage 2024-10-28 08:30:00 UTC
possibly related settings from /boot/loader.conf:

net.isr.maxthreads=-1
net.isr.bindthreads=1
net.isr.dispatch=hybrid

i've commented these out to see if it makes a difference.
Comment 3 Mark Linimon freebsd_committer freebsd_triage 2024-10-28 08:46:14 UTC
^Triage: panic appears to be in the network stack.
Comment 4 Kevin Bowling freebsd_committer freebsd_triage 2024-10-28 09:41:47 UTC
Maybe related https://redmine.pfsense.org/issues/15601#change-74811
Comment 5 Lexi Winter freebsd_triage 2024-10-28 09:45:01 UTC
"Routes with IPv6 Address as Next Hop for IPv4 Destination Causes Kernel Panic"

i do have one of these:

Internet:
Destination        Gateway            Flags         Netif Expire
10.254.3.1         fe80::2%epair1a    UGH         epair1a

we use these extensively here so i'm surprised only this particular system has run into a problem.
Comment 6 Lexi Winter freebsd_triage 2024-10-28 11:36:12 UTC
i've built a new release with https://reviews.freebsd.org/D45913 applied and will deploy this to test tonight.

this might not be a (recent) regression; i think the issue here might be that we recently made a network change that caused more NDs to be sent for this address.
Comment 7 Mateusz Guzik freebsd_committer freebsd_triage 2024-10-28 11:42:40 UTC
That patch does NOT fix the problem, whatever it may happen to be.

Instead of that, can you please apply the following: https://people.freebsd.org/~mjg/nd6_debug.diff

It should still crash, but provide more information.

Can you share you CPU model?
Comment 8 Lexi Winter freebsd_triage 2024-10-28 11:49:18 UTC
okay, i replaced that patch with your patch and will reboot later tonight.

CPU: Intel(R) Atom(TM) CPU C3758R @ 2.40GHz (2400.22-MHz K8-class CPU)
  Origin="GenuineIntel"  Id=0x506f1  Family=0x6  Model=0x5f  Stepping=1
  Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE>
  Features2=0x4ff8ebbf<SSE3,PCLMULQDQ,DTES64,MON,DS_CPL,VMX,EST,TM2,SSSE3,SDBG,CX16,xTPR,PDCM,SSE4.1,SSE4.2,x2APIC,MOVBE,POPCNT,TSCDLT,AESNI,XSAVE,OSXSAVE,RDRAND>
  AMD Features=0x2c100800<SYSCALL,NX,Page1GB,RDTSCP,LM>
  AMD Features2=0x101<LAHF,Prefetch>
  Structured Extended Features=0x2294e283<FSGSBASE,TSCADJ,SMEP,ERMS,NFPUSG,MPX,PQE,RDSEED,SMAP,CLFLUSHOPT,PROCTRACE,SHA>
  Structured Extended Features3=0xac000000<IBPB,STIBP,ARCH_CAP,SSBD>
  XSAVE Features=0xf<XSAVEOPT,XSAVEC,XINUSE,XSAVES>
  IA32_ARCH_CAPS=0x9<RDCL_NO,SKIP_L1DFL_VME>
  VT-x: PAT,HLT,MTF,PAUSE,EPT,UG,VPID,VID,PostIntr
  TSC: P-state invariant, performance statistics
Comment 9 Mateusz Guzik freebsd_committer freebsd_triage 2024-10-29 19:41:01 UTC
Thanks.

How is the testing going so far?
Comment 10 Lexi Winter freebsd_triage 2024-10-30 10:00:02 UTC
Unread portion of the kernel message buffer:
panic: lle 0xfffff8015f8e8000 not locked @ /data/build/src/freebsd/lf/main/sys/netinet6/nd6.c:2368!
cpuid = 1
time = 1730156941
KDB: stack backtrace:
#0 0xffffffff804676bd at kdb_backtrace+0x5d
#1 0xffffffff8041d1df at vpanic+0x13f
#2 0xffffffff8041d093 at panic+0x43
#3 0xffffffff805bc173 at nd6_get_llentry+0x3a3
#4 0xffffffff805ba9fb at nd6_resolve_slow+0xfb
#5 0xffffffff805ba7d5 at nd6_resolve+0x125
#6 0xffffffff80530da2 at ether_output+0x502
#7 0xffffffff8055dc95 at ip_tryforward+0x505
#8 0xffffffff80560570 at ip_input+0x310
#9 0xffffffff805366f8 at swi_net+0x138
#10 0xffffffff803e23f9 at ithread_loop+0x239
#11 0xffffffff803dea5b at fork_exit+0x7b
#12 0xffffffff806a37ee at fork_trampoline+0xe
Uptime: 1h42m17s
Dumping 1338 out of 16325 MB:..2%..11%..21%..32%..41%..51%..61%..71%..81%..91%

(kgdb) bt
#0  __curthread () at /data/build/src/freebsd/lf/main/sys/amd64/include/pcpu_aux.h:57
#1  doadump (textdump=<optimized out>) at /data/build/src/freebsd/lf/main/sys/kern/kern_shutdown.c:404
#2  0xffffffff8041cd64 in kern_reboot (howto=260) at /data/build/src/freebsd/lf/main/sys/kern/kern_shutdown.c:524
#3  0xffffffff8041d24c in vpanic (fmt=0xffffffff80719f49 "lle %p not locked @ %s:%d!", ap=ap@entry=0xfffffe00d9967af0) at /data/build/src/freebsd/lf/main/sys/kern/kern_shutdown.c:979
#4  0xffffffff8041d093 in panic (fmt=<unavailable>) at /data/build/src/freebsd/lf/main/sys/kern/kern_shutdown.c:892
#5  0xffffffff805bc173 in nd6_get_llentry (ifp=<optimized out>, addr=<optimized out>, family=<optimized out>) at /data/build/src/freebsd/lf/main/sys/netinet6/nd6.c:2368
#6  0xffffffff805ba9fb in nd6_resolve_slow (ifp=ifp@entry=0xfffff800246d2000, family=family@entry=2, flags=flags@entry=0, m=m@entry=0xfffff802091b6c00, dst=dst@entry=0xfffff8002480fe04, 
    desten=desten@entry=0xfffffe00d9967cb0 "", pflags=0xfffffe00d9967c8c, plle=0x0) at /data/build/src/freebsd/lf/main/sys/netinet6/nd6.c:2408
#7  0xffffffff805ba7d5 in nd6_resolve (ifp=0xfffff800246d2000, gw_flags=<optimized out>, m=m@entry=0xfffff802091b6c00, sa_dst=sa_dst@entry=0xfffff8002480fe04, desten=desten@entry=0xfffffe00d9967cb0 "", 
    pflags=pflags@entry=0xfffffe00d9967c8c, plle=0x0) at /data/build/src/freebsd/lf/main/sys/netinet6/nd6.c:2298
#8  0xffffffff80530da2 in ether_resolve_addr (phdr=0xfffffe00d9967cb0 "", plle=<optimized out>, ifp=<optimized out>, m=<optimized out>, dst=<optimized out>, ro=<optimized out>, pflags=<optimized out>)
    at /data/build/src/freebsd/lf/main/sys/net/if_ethersubr.c:243
#9  ether_output (ifp=<unavailable>, ifp@entry=<error reading variable: value is not available>, m=0xfffff802091b6c00, m@entry=<error reading variable: value is not available>, dst=<unavailable>, 
    dst@entry=<error reading variable: value is not available>, ro=<unavailable>, ro@entry=<error reading variable: value is not available>) at /data/build/src/freebsd/lf/main/sys/net/if_ethersubr.c:349
#10 0xffffffff8055dc95 in ip_tryforward (m=0xfffff802091b6c00) at /data/build/src/freebsd/lf/main/sys/netinet/ip_fastfwd.c:483
#11 0xffffffff80560570 in ip_input (m=0xfffff802091b6c00) at /data/build/src/freebsd/lf/main/sys/netinet/ip_input.c:590
#12 0xffffffff805366f8 in netisr_process_workstream_proto (nwsp=0xfffffe002056ba00, proto=1) at /data/build/src/freebsd/lf/main/sys/net/netisr.c:927
#13 swi_net (arg=0xfffffe002056ba00) at /data/build/src/freebsd/lf/main/sys/net/netisr.c:974
#14 0xffffffff803e23f9 in intr_event_execute_handlers (ie=0xfffff80001289400, p=<optimized out>) at /data/build/src/freebsd/lf/main/sys/kern/kern_intr.c:1183
#15 ithread_execute_handlers (ie=0xfffff80001289400, p=<optimized out>) at /data/build/src/freebsd/lf/main/sys/kern/kern_intr.c:1196
#16 ithread_loop (arg=arg@entry=0xfffff8000126cb00) at /data/build/src/freebsd/lf/main/sys/kern/kern_intr.c:1289
#17 0xffffffff803dea5b in fork_exit (callout=0xffffffff803e21c0 <ithread_loop>, arg=0xfffff8000126cb00, frame=0xfffffe00d9967f40) at /data/build/src/freebsd/lf/main/sys/kern/kern_fork.c:1151
#18 <signal handler called>

#5  0xffffffff805bc173 in nd6_get_llentry (ifp=<optimized out>, addr=<optimized out>, family=<optimized out>) at /data/build/src/freebsd/lf/main/sys/netinet6/nd6.c:2368
warning: 2368   /data/build/src/freebsd/lf/main/sys/netinet6/nd6.c: No such file or directory
(kgdb) info locals
child_lle = <optimized out>
lle = 0xfffff8015f8e8000
lle_tmp = 0xfffff8015f8e8000
(kgdb) print *lle
$1 = {lle_next = {cle_next = 0x0, cle_prev = 0x0}, r_l3addr = {addr4 = {s_addr = 285245694}, addr6 = {__u6_addr = {__u6_addr8 = "\376\200\000\021", '\000' <repeats 11 times>, "\002", __u6_addr16 = {33022, 
          4352, 0, 0, 0, 0, 0, 512}, __u6_addr32 = {285245694, 0, 0, 33554432}}}}, r_linkdata = '\000' <repeats 23 times>, r_hdrlen = 0 '\000', r_family = 2 '\002', spare0 = "\000", r_flags = 0, 
  r_skip_req = 0, lle_tbl = 0xfffff800242ceb00, lle_head = 0x0, lle_free = 0xffffffff8059e7a0 <in6_lltable_destroy_lle>, la_hold = 0xfffff8019caa4b00, la_numheld = 1, la_expire = 0, la_flags = 194, 
  la_asked = 0, la_preempt = 0, ln_state = 0, ln_router = 0, ln_ntick = 0, lle_remtime = 0, lle_hittime = 0, lle_refcnt = 1, ll_addr = 0x0, lle_children = {cslh_first = 0x0}, lle_child_next = {
    csle_next = 0x0}, lle_parent = 0xfffff801a8651900, lle_chain = {cle_next = 0x0, cle_prev = 0x0}, lle_timer = {c_links = {le = {le_next = 0x0, le_prev = 0x0}, sle = {sle_next = 0x0}, tqe = {tqe_next = 0x0, 
        tqe_prev = 0x0}}, c_time = 0, c_precision = 0, c_arg = 0x0, c_func = 0x0, c_lock = 0x0, c_flags = 0, c_iflags = 16, c_cpu = 0}, lle_lock = {lock_object = {lo_name = 0xffffffff80708160 "lle", 
      lo_flags = 90374144, lo_data = 0, lo_witness = 0x0}, rw_lock = 1}, req_mtx = {lock_object = {lo_name = 0xffffffff8073325f "lle req", lo_flags = 16973824, lo_data = 0, lo_witness = 0x0}, mtx_lock = 0}, 
  lle_epoch_ctx = {data = {0x0, 0x0}}}
(kgdb)
Comment 11 Mateusz Guzik freebsd_committer freebsd_triage 2024-10-30 11:32:14 UTC
Thanks. I'm going to ship you with either a fix or some more debug later today, hopefully the former :)
Comment 12 Mateusz Guzik freebsd_committer freebsd_triage 2024-10-30 11:45:13 UTC
Can you please restore the stock source and apply this instead: https://people.freebsd.org/~mjg/nd6_fix.diff

it should fix the problem but also complain some more if it does not
Comment 13 Mateusz Guzik freebsd_committer freebsd_triage 2024-10-30 11:50:17 UTC
I forgot to mention it would be great if you could build a debug kernel -- with witness, invariants et al, but this is not a hard requirement.
Comment 14 Lexi Winter freebsd_triage 2024-10-30 11:56:47 UTC
thanks, i'll build a new kernel with this patch and debugging enabled and deploy it tonight (in about 12 hours) -- however the system in question hasn't panicked for over a day now so it might take a while to see if this actually fixes the problem.

(if it's caused by ND, i'll try to set something up to cause a lot of NDs to be sent for this address.)
Comment 15 Mateusz Guzik freebsd_committer freebsd_triage 2024-10-30 12:11:25 UTC
It is nd-related.

I'll give you an extended patch which will point out the problematic codepath was reached, give me few.
Comment 16 Mateusz Guzik freebsd_committer freebsd_triage 2024-10-30 12:24:39 UTC
https://people.freebsd.org/~mjg/nd6_fix2.diff

this will print a "found lle_tmp" message once in dmesg. if it is there, the codepath was hit and if there was no panic we are set :)
Comment 17 Lexi Winter freebsd_triage 2024-10-30 22:46:24 UTC
the latest patch is now applied and running on a debug kernel from src ~22429a464a5f4f6bb5a056aae1353985db83b721 (i had to updated for an unrelated issue), i'll report back results asap.
Comment 18 Lexi Winter freebsd_triage 2024-10-31 03:18:02 UTC
i've been running this patch for the last few hours and the system has neither panicked nor printed the magic message.

i am completely happy to just leave this running for a few days and see what happens, but if there's anything you'd like me to try to trigger the issue, i'm happy to try it.
Comment 19 Mateusz Guzik freebsd_committer freebsd_triage 2024-10-31 07:06:12 UTC
Just leave it running for a little bit, thank you.

It's all about neighbour discovery, but I don't have a specific reproducer. Hopefully your network will be doing what it was already doing. :)
Comment 20 Mateusz Guzik freebsd_committer freebsd_triage 2024-11-07 16:05:30 UTC
any updates? maybe the printfs showed up?
Comment 21 Lexi Winter freebsd_triage 2024-11-07 16:07:37 UTC
unfortunately neither:

[3!] willow ~# uptime
 4:06PM  up 5 days, 19:57, 2 users, load averages: 0.21, 0.13, 0.12
[4!] willow ~# dmesg|grep lle_tmp
[1? 5!] willow ~# 

this is despite i added a new IPv4 route with IPv6 nexthop for our NAT64 jail.

i am considering rebooting with a non-debug kernel, in case it's some timing issue that debug kernel hides.
Comment 22 Mateusz Guzik freebsd_committer freebsd_triage 2024-11-07 16:09:09 UTC
that should be of no significance.

thanks for reporting.
Comment 23 Lexi Winter freebsd_triage 2024-11-07 16:15:42 UTC
okay, do you want to close this then?  i'm happy to keep running this as a local patch since it seems to have fixed the specific problem i was having.
Comment 24 Mateusz Guzik freebsd_committer freebsd_triage 2024-11-07 16:19:14 UTC
i'm going to sleep on it, i don't want to close since there is *definitely* a bug here, but maybe i'll just commit the fix i came up with
Comment 25 Lexi Winter freebsd_triage 2024-11-16 06:47:26 UTC
i rebooted into a non-debug kernel and trigged the printf within 5 minutes of booting:

Nov 16 06:41:17 willow kernel: nd6_get_llentry: found lle_tmp 0xfffff8003bf8ac00

using:

FreeBSD willow.eden.le-fay.org 15.0-CURRENT FreeBSD 15.0-CURRENT #2 lf/main-n269078-561fbdac790: Sun Nov  3 16:32:20 GMT 2024     srcmastr@hemlock.eden.le-fay.org:/data/build/obj/freebsd/data/build/src/freebsd/lf/main/amd64.amd64/sys/LF amd64

the system did not panic.
Comment 26 Mateusz Guzik freebsd_committer freebsd_triage 2024-11-16 06:54:44 UTC
Huh. Thanks.

I'm going to commit soon(tm).