Bug 257067 - panic: Assertion in_epoch(net_epoch_preempt) failed at /usr/src/sys/net/if_vlan.c:1185
Summary: panic: Assertion in_epoch(net_epoch_preempt) failed at /usr/src/sys/net/if_vl...
Status: Open
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 13.0-STABLE
Hardware: arm64 Any
: --- Affects Only Me
Assignee: freebsd-net (Nobody)
URL:
Keywords: crash
Depends on:
Blocks:
 
Reported: 2021-07-09 00:15 UTC by Niels Bakker
Modified: 2021-07-11 11:35 UTC (History)
5 users (show)

See Also:
koobs: maintainer-feedback? (afedorov)


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Niels Bakker 2021-07-09 00:15:41 UTC
# cd /var/crash && kgdb /boot/kernel/kernel vmcore.last
GNU gdb (GDB) 10.2 [GDB v10.2 for FreeBSD]
Copyright (C) 2021 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-portbld-freebsd13.0".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /boot/kernel/kernel...
Reading symbols from /usr/lib/debug//boot/kernel/kernel.debug...

Unread portion of the kernel message buffer:
panic: Assertion in_epoch(net_epoch_preempt) failed at /usr/src/sys/net/if_vlan.c:1185
cpuid = 1
time = 1625786527
KDB: stack backtrace:
#0 0xffffffff80bfece5 at kdb_backtrace+0x65
#1 0xffffffff80bb46d1 at vpanic+0x181
#2 0xffffffff80bb44a3 at panic+0x43
#3 0xffffffff80cd88de at vlan_transmit+0x1de
#4 0xffffffff80cce39b at ether_output_frame+0xab
#5 0xffffffff8210c6d8 at ng_apply_item+0xa8
#6 0xffffffff8210c45e at ng_snd_item+0x2ce
#7 0xffffffff835247f2 at ng_pppoe_disconnect+0x152
#8 0xffffffff8210ac23 at ng_destroy_hook+0x1e3
#9 0xffffffff8210c729 at ng_apply_item+0xf9
#10 0xffffffff8210c45e at ng_snd_item+0x2ce
#11 0xffffffff8210aae9 at ng_destroy_hook+0xa9
#12 0xffffffff8210a8c5 at ng_rmnode+0x65
#13 0xffffffff8210cb03 at ng_apply_item+0x4d3
#14 0xffffffff8210c45e at ng_snd_item+0x2ce
#15 0xffffffff82f35dcf at ngc_send+0x1af
#16 0xffffffff80c51c47 at sosend_generic+0x5f7
#17 0xffffffff80c51f56 at sosend+0x66
Uptime: 7d5h6m15s
Dumping 3807 out of 16148 MB:..1%..11%..21%..31%..41%..51%..61%..71%..81%..91%

__curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55
55		__asm("movq %%gs:%P1,%0" : "=r" (td) : "n" (offsetof(struct pcpu,
(kgdb) bt
#0  __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55
#1  doadump (textdump=<optimized out>) at /usr/src/sys/kern/kern_shutdown.c:399
#2  0xffffffff80bb42fa in kern_reboot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:486
#3  0xffffffff80bb4740 in vpanic (fmt=<optimized out>, ap=<optimized out>)
    at /usr/src/sys/kern/kern_shutdown.c:919
#4  0xffffffff80bb44a3 in panic (fmt=<unavailable>) at /usr/src/sys/kern/kern_shutdown.c:843
#5  0xffffffff80cd88de in vlan_transmit (ifp=0xfffff8003742e800, m=0xfffff8004de1a900)
    at /usr/src/sys/net/if_vlan.c:1185
#6  0xffffffff80cce39b in ether_output_frame (ifp=0xfffff8003742e800, 
    ifp@entry=<error reading variable: value is not available>, m=<unavailable>, 
    m@entry=<error reading variable: value is not available>) at /usr/src/sys/net/if_ethersubr.c:508
#7  0xffffffff8210c6d8 in ng_apply_item (node=<unavailable>, node@entry=0xfffff800423e8700, 
    item=<unavailable>, item@entry=0xfffff8038da1ad80, rw=0) at /usr/src/sys/netgraph/ng_base.c:2407
#8  0xffffffff8210c45e in ng_snd_item (item=item@entry=0xfffff8038da1ad80, flags=flags@entry=0)
    at /usr/src/sys/netgraph/ng_base.c:2324
#9  0xffffffff835247f2 in ng_pppoe_disconnect (hook=0xfffff80037b6bf00, 
    hook@entry=<error reading variable: value is not available>) at /usr/src/sys/netgraph/ng_pppoe.c:2070
#10 0xffffffff8210ac23 in ng_destroy_hook (hook=0xfffff80037b6bf00, 
    hook@entry=<error reading variable: value is not available>) at /usr/src/sys/netgraph/ng_base.c:1224
#11 0xffffffff8210c729 in ng_apply_item (node=<unavailable>, node@entry=0xfffff80054599d00, 
    item=<unavailable>, item@entry=0xfffff800a06dad80, rw=1) at /usr/src/sys/netgraph/ng_base.c:2478
#12 0xffffffff8210c45e in ng_snd_item (item=0xfffff800a06dad80, flags=0, flags@entry=934723328)
    at /usr/src/sys/netgraph/ng_base.c:2324
#13 0xffffffff8210e471 in ng_send_fn1 (node=0x0, hook=<optimized out>, hook@entry=0xfffff80037b6bf00, 
    fn=<optimized out>, arg1=<optimized out>, arg1@entry=0x0, arg2=-511, arg2@entry=0, flags=<optimized out>, 
    flags@entry=0) at /usr/src/sys/netgraph/ng_base.c:3730
#14 0xffffffff8210aae9 in ng_send_fn (node=<unavailable>, hook=0xfffff80037b6bf00, fn=<unavailable>, 
    arg1=0x0, arg2=0) at /usr/src/sys/netgraph/ng_base.c:3708
#15 ng_rmhook_self (hook=0xfffff80037b6bf00) at /usr/src/sys/netgraph/ng_base.c:1639
#16 ng_destroy_hook (hook=0xfffff80034c07380) at /usr/src/sys/netgraph/ng_base.c:1199
#17 0xffffffff8210a8c5 in ng_rmnode (node=0xfffff80037e6c500, dummy1=<optimized out>, dummy2=<optimized out>, 
    dummy3=<optimized out>) at /usr/src/sys/netgraph/ng_base.c:745
#18 0xffffffff8210cb03 in ng_generic_msg (here=0xfffff80037e6c500, item=0xfffff801210fd480, 
    lasthook=<optimized out>) at /usr/src/sys/netgraph/ng_base.c:2527
#19 ng_apply_item (node=<optimized out>, node@entry=0xfffff80037e6c500, item=<optimized out>, 
    item@entry=0xfffff801210fd480, rw=<optimized out>) at /usr/src/sys/netgraph/ng_base.c:2441
#20 0xffffffff8210c45e in ng_snd_item (item=item@entry=0xfffff801210fd480, flags=flags@entry=0)
    at /usr/src/sys/netgraph/ng_base.c:2324
#21 0xffffffff82f35dcf in ngc_send (so=<unavailable>, 
    so@entry=<error reading variable: value is not available>, flags=<unavailable>, 
    flags@entry=<error reading variable: value is not available>, m=0xfffff803d5b65400, 
    m@entry=<error reading variable: value is not available>, addr=<unavailable>, 
    addr@entry=<error reading variable: value is not available>, control=<unavailable>, 
    control@entry=<error reading variable: value is not available>, td=<unavailable>, 
    td@entry=<error reading variable: value is not available>) at /usr/src/sys/netgraph/ng_socket.c:345
#22 0xffffffff80c51c47 in sosend_generic (so=0xfffff80042011760, 
    so@entry=<error reading variable: value is not available>, addr=0xfffff80141708f80, 
    addr@entry=<error reading variable: value is not available>, uio=<optimized out>, 
    uio@entry=<error reading variable: value is not available>, top=0xfffff803d5b65400, 
    top@entry=<error reading variable: value is not available>, control=<optimized out>, 
    control@entry=<error reading variable: value is not available>, flags=0, 
    flags@entry=<error reading variable: value is not available>, td=0xfffffe0110c06c80, 
    td@entry=<error reading variable: value is not available>) at /usr/src/sys/kern/uipc_socket.c:1755
#23 0xffffffff80c51f56 in sosend (so=<unavailable>, so@entry=0xfffff80042011760, addr=<unavailable>, 
    uio=<unavailable>, uio@entry=0xfffffe01112de9a8, top=<unavailable>, top@entry=0x0, 
    control=control@entry=0x0, flags=<unavailable>, flags@entry=0, td=0xfffffe0110c06c80)
    at /usr/src/sys/kern/uipc_socket.c:1805
#24 0xffffffff80c58ceb in kern_sendit (td=<optimized out>, td@entry=0xfffffe0110c06c80, s=5, 
    mp=<optimized out>, mp@entry=0xfffffe01112dea80, flags=0, control=0x0, segflg=segflg@entry=UIO_USERSPACE)
    at /usr/src/sys/kern/uipc_syscalls.c:798
#25 0xffffffff80c5908b in sendit (td=0xfffffe0110c06c80, td@entry=<unavailable>, s=<unavailable>, 
    mp=mp@entry=0xfffffe01112dea80, flags=<unavailable>) at /usr/src/sys/kern/uipc_syscalls.c:723
#26 0xffffffff80c58e9d in sys_sendto (td=<unavailable>, 
    td@entry=<error reading variable: value is not available>, uap=<unavailable>, 
    uap@entry=<error reading variable: value is not available>) at /usr/src/sys/kern/uipc_syscalls.c:841
#27 0xffffffff8101b59e in syscallenter (td=<optimized out>)
    at /usr/src/sys/amd64/amd64/../../kern/subr_syscall.c:189
#28 amd64_syscall (td=0xfffffe0110c06c80, traced=0) at /usr/src/sys/amd64/amd64/trap.c:1184
#29 <signal handler called>
#30 0x00000008008bf6da in ?? ()
Backtrace stopped: Cannot access memory at address 0x7fffdfffdab8
(kgdb) f 5
#5  0xffffffff80cd88de in vlan_transmit (ifp=0xfffff8003742e800, m=0xfffff8004de1a900)
    at /usr/src/sys/net/if_vlan.c:1185
1185		NET_EPOCH_ASSERT();
(kgdb) p *ifp
$1 = {if_link = {cstqe_next = 0xfffff800378a3800}, if_clones = {le_next = 0x0, le_prev = 0xfffff800378a3808}, 
  if_groups = {cstqh_first = 0xfffff800422da7f0, cstqh_last = 0xfffff800422da7c8}, if_alloctype = 6 '\006', 
  if_numa_domain = 255 '\377', if_softc = 0xfffff800423d1e00, if_llsoftc = 0x0, 
  if_l2com = 0xfffff800423e8700, if_dname = 0xffffffff8158fd60 <vlanname> "vlan", if_dunit = 0, if_index = 5, 
  if_index_reserved = 0, if_xname = "vlan0\000\000\000\000\000\000\000\000\000\000", if_description = 0x0, 
  if_flags = 34947, if_drv_flags = 64, if_capabilities = 524546, if_capenable = 524546, if_linkmib = 0x0, 
  if_linkmiblen = 0, if_refcount = 2, if_type = 135 '\207', if_addrlen = 6 '\006', if_hdrlen = 4 '\004', 
  if_link_state = 2 '\002', if_mtu = 1508, if_metric = 0, if_baudrate = 100000000, if_hwassist = 4119, 
  if_epoch = 18, if_lastchange = {tv_sec = 1625163373, tv_usec = 405364}, if_snd = {ifq_head = 0x0, 
    ifq_tail = 0x0, ifq_len = 0, ifq_maxlen = 50, ifq_mtx = {lock_object = {
        lo_name = 0xfffff8003742e858 "vlan0", lo_flags = 16973824, lo_data = 0, lo_witness = 0x0}, 
      mtx_lock = 0}, ifq_drv_head = 0x0, ifq_drv_tail = 0x0, ifq_drv_len = 0, ifq_drv_maxlen = 0, 
    altq_type = 0, altq_flags = 0, altq_disc = 0x0, altq_ifp = 0xfffff8003742e800, altq_enqueue = 0x0, 
    altq_dequeue = 0x0, altq_request = 0x0, altq_clfier = 0x0, altq_classify = 0x0, altq_tbr = 0x0, 
    altq_cdnr = 0x0}, if_linktask = {ta_link = {stqe_next = 0x0}, ta_pending = 0, ta_priority = 0 '\000', 
    ta_flags = 0 '\000', ta_func = 0xffffffff80cc4270 <do_link_state_change>, 
    ta_context = 0xfffff8003742e800}, if_addmultitask = {ta_link = {stqe_next = 0x0}, ta_pending = 0, 
    ta_priority = 0 '\000', ta_flags = 0 '\000', ta_func = 0xffffffff80cc4580 <if_siocaddmulti>, 
    ta_context = 0xfffff8003742e800}, if_addr_lock = {lock_object = {
      lo_name = 0xffffffff81102568 "if_addr_lock", lo_flags = 16973824, lo_data = 0, lo_witness = 0x0}, 
    mtx_lock = 0}, if_addrhead = {cstqh_first = 0xfffff800423e8d00, cstqh_last = 0xfffff800423e8d28}, 
  if_multiaddrs = {cstqh_first = 0x0, cstqh_last = 0xfffff8003742e9d8}, if_amcount = 0, 
  if_addr = 0xfffff800423e8d00, if_hw_addr = 0xfffff800422da7d0, 
  if_broadcastaddr = 0xffffffff8158ee70 <etherbroadcastaddr> "\377\377\377\377\377\377", if_afdata_lock = {
    lock_object = {lo_name = 0xffffffff81199fe8 "if_afdata", lo_flags = 16973824, lo_data = 0, 
      lo_witness = 0x0}, mtx_lock = 0}, if_afdata = {0x0, 0x0, 0xfffff800422d03e0, 0x0 <repeats 25 times>, 
    0xfffff80037d20dc0, 0x0 <repeats 14 times>}, if_afdata_initialized = 2, if_fib = 0, 
  if_vnet = 0xfffff8000419d600, if_home_vnet = 0xfffff8000419d600, if_vlantrunk = 0x0, 
  if_bpf = 0xfffff800423d1c80, if_pcount = 0, if_bridge = 0x0, if_lagg = 0x0, if_pf_kif = 0x0, if_carp = 0x0, 
  if_label = 0x0, if_netmap = 0x0, if_output = 0xffffffff80cd9a00 <vlan_output>, 
  if_input = 0xffffffff80cce980 <ether_input>, if_bridge_input = 0x0, if_bridge_output = 0x0, 
  if_bridge_linkstate = 0x0, if_start = 0x0, if_ioctl = 0xffffffff80cd8930 <vlan_ioctl>, 
  if_init = 0xffffffff80cd86f0 <vlan_init>, if_resolvemulti = 0xffffffff80cceb20 <ether_resolvemulti>, 
  if_qflush = 0xffffffff80cd8920 <vlan_qflush>, if_transmit = 0xffffffff80cd8700 <vlan_transmit>, 
  if_reassign = 0xffffffff80cced00 <ether_reassign>, 
  if_get_counter = 0xffffffff80cc46a0 <if_get_counter_default>, 
  if_requestencap = 0xffffffff80ccec30 <ether_requestencap>, if_counters = {0xfffffe018dcf6228, 
    0xfffffe018dcf6220, 0xfffffe018dcf6218, 0xfffffe018dcf6210, 0xfffffe018dcf6208, 0xfffffe018dcf6200, 
    0xfffffe018dcf61f8, 0xfffffe018dcf61f0, 0xfffffe018dcf61e8, 0xfffffe018dcf61e0, 0xfffffe018dcf61d8, 
    0xfffffe018dcf61d0}, if_hw_tsomax = 65518, if_hw_tsomaxsegcount = 35, if_hw_tsomaxsegsize = 2048, 
  if_snd_tag_alloc = 0xffffffff80cd8ff0 <vlan_snd_tag_alloc>, 
  if_snd_tag_modify = 0xffffffff80cd90f0 <vlan_snd_tag_modify>, 
  if_snd_tag_query = 0xffffffff80cd9110 <vlan_snd_tag_query>, 
  if_snd_tag_free = 0xffffffff80cd9130 <vlan_snd_tag_free>, 
  if_next_snd_tag = 0xffffffff80cd9180 <vlan_next_snd_tag>, 
  if_ratelimit_query = 0xffffffff80cd9190 <vlan_ratelimit_query>, if_ratelimit_setup = 0x0, 
  if_pcp = 0 '\000', if_debugnet_methods = 0x0, if_epoch_ctx = {data = {0x0, 0x0}}, if_ispare = {0, 0, 0, 0}}
(kgdb) 

This host runs FreeBSD stable/13-n246143-6aee7855180-dirty from a week and a half ago, GENERIC but with INVARIANTS enabled. Interface alc0 has a VLAN interface vlan0 over which mpd5 runs PPPoE to my ISP, which had an outage, thus a LCP echo request timeout occurred; the panic occurred as mpd was in the process of tearing down its ng0 interface. Its last log message was:

Jul  9 01:22:01 server mpd[997]: [P] LCP: LayerDown
Comment 1 Aleksandr Fedorov freebsd_committer 2021-07-09 08:27:46 UTC
I think this is a bug in the ng_ppoe(4) code. Disconnecting hooks are called outside of NET_EPOCH, but ng_pppoe_disconnect() calls NG_SEND_DATA_ONLY() which should be called in NET_EPOCH.

Please try the following patch:

diff --git a/sys/netgraph/ng_pppoe.c b/sys/netgraph/ng_pppoe.c
index 295a136cc55..e07f77b9d54 100644
--- a/sys/netgraph/ng_pppoe.c
+++ b/sys/netgraph/ng_pppoe.c
@@ -2037,6 +2037,7 @@ ng_pppoe_disconnect(hook_p hook)
                                log(LOG_NOTICE, "ng_pppoe[%x]: session out of "
                                    "mbufs\n", node->nd_ID);
                        else {
+                               struct epoch_tracker et;
                                struct pppoe_full_hdr *wh;
                                struct pppoe_tag *tag;
                                int     msglen = strlen(SIGNOFF);
@@ -2067,8 +2068,11 @@ ng_pppoe_disconnect(hook_p hook)
                                m->m_pkthdr.len = m->m_len = sizeof(*wh) + sizeof(*tag) +
                                    msglen;
                                wh->ph.length = htons(sizeof(*tag) + msglen);
+
+                               NET_EPOCH_ENTER(et);
                                NG_SEND_DATA_ONLY(error,
                                        privp->ethernet_hook, m);
+                               NET_EPOCH_EXIT(et);
                        }
                }
                if (sp->state == PPPOE_LISTENING)
Comment 2 Niels Bakker 2021-07-09 12:17:34 UTC
Test methodology: Unplug cable carrying the PPPoE session from the NTU, plug into a switch so link stays up, wait.

Observed behaviour pre patch: same panic as ng0 gets torn down
Observed behaviour with above patch applied: normal operation continues after mpd tears down the link, and the PPPoE link comes back up as soon as the cable is plugged back into the NTU.

Success! Thanks, Alexandr Fedorov!
Comment 3 Niels Bakker 2021-07-09 12:18:16 UTC
Apologies for the typo, Aleksandr
Comment 4 Mark Johnston freebsd_committer 2021-07-09 13:43:43 UTC
(In reply to Aleksandr Fedorov from comment #1)
There are several other places where ng_pppoe sends data on the ethernet hook without entering an epoch section - I think they all need to be updated?
Comment 5 Aleksandr Fedorov freebsd_committer 2021-07-09 14:37:34 UTC
(In reply to Mark Johnston from comment #4)

I think the previous fix closed all similar problems: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=248958 and https://reviews.freebsd.org/D26226

I have not yet found other problem areas in the ng_pppoe(4) code.

The main idea behind introducing NET_EPOCH in netgraph (4) is:
- Data processing (NG_SEND_DATA_ *) should always be in the NET_EPOCH section.
- Processing of service messages (NG_SEND_MSG_ *) should not be in the NET_EPOCH section. Because message handling (* _rcvmsg, connect/disconnect hooks, node constructor/destructor) allows sleeping and other forbidden operations for NET_EPOCH section.

But on many nodes, some data processing routines generate messages, and message processing routines generate data (as in this PR).

Unfortunately, there are cases that are not so easy to fix without reworking ng_base.c.
Comment 6 Kubilay Kocak freebsd_committer freebsd_triage 2021-07-11 01:19:30 UTC
(In reply to Aleksandr Fedorov from comment #5)

Is this the same issue as bug 248958? If so, I note the changed weren't tagged for MFC. Should those changes be merged to stable/* branches? If so, please re-open bug 248958, set mfc-stable{13,12} flags to "?" there, and close this issue as a duplicate of it

Thanks
Comment 7 Niels Bakker 2021-07-11 11:33:54 UTC
It's not the same bug. It manifests in a completely different file. At most it's the same class of bug introduced with epoch(9).
Comment 8 Niels Bakker 2021-07-11 11:35:28 UTC
Same file but a different line, apologies again