Bug 255775 - panic with ipfw turned on at boot time
Summary: panic with ipfw turned on at boot time
Status: Closed Overcome By Events
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 13.0-STABLE
Hardware: amd64 Any
: --- Affects Only Me
Assignee: freebsd-ipfw (Nobody)
URL:
Keywords: crash, ipfilter
Depends on:
Blocks:
 
Reported: 2021-05-11 05:09 UTC by Michael Meiszl
Modified: 2021-08-27 14:09 UTC (History)
3 users (show)

See Also:


Attachments
crashlog full version (32.08 KB, application/x-zip-compressed)
2021-05-11 05:11 UTC, Michael Meiszl
no flags Details
ruleset mam (12.30 KB, text/plain)
2021-06-19 16:45 UTC, Michael Meiszl
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Michael Meiszl 2021-05-11 05:09:25 UTC
as suggested by Mark Johnson, I add this as a "new" bug because it does not seem to be related to #255104 after some tests.

Description: 13.0 stock Kernel crashes within a few mins if ipfw has been turned on in rc.conf.
If turned off in rc.conf and started later on by root manually, ipfw works flawlessly and the machine is stable for weeks!

There is no fancy setup for ipfw, no divert, no nat, just plain "deny if it comes from addr x" rules.

As I have been told already, I created a kernel with latest patches (including 255104) and turned on INVARIANTS.

I attach the core.txt file at the end, a brief summary is here:

panic: Assertion m->m_nextpkt == NULL failed at /root/src/sys/net/iflib.c:4087
cpuid = 0
time = 1620674444
KDB: stack backtrace:
#0 0xffffffff80c400e5 at kdb_backtrace+0x65
#1 0xffffffff80bf5be1 at vpanic+0x181
#2 0xffffffff80bf59b3 at panic+0x43
#3 0xffffffff80d29c5b at iflib_if_transmit+0x15b
#4 0xffffffff80d0fb9b at ether_output_frame+0xab
#5 0xffffffff80d0faa1 at ether_output+0x6b1
#6 0xffffffff80da58ef at ip_output_send+0x8f
#7 0xffffffff80da55a5 at ip_output+0x1495
#8 0xffffffff80d12350 at gif_transmit+0x2f0
#9 0xffffffff80df2b9b at ip6_forward+0x95b
#10 0xffffffff80df4414 at ip6_input+0xf04
#11 0xffffffff80d2cb11 at netisr_dispatch_src+0xb1
#12 0xffffffff80d0fd3e at ether_demux+0x17e
#13 0xffffffff80d113cc at ether_nh_input+0x40c
#14 0xffffffff80d2cb11 at netisr_dispatch_src+0xb1
#15 0xffffffff80d10231 at ether_input+0xa1
#16 0xffffffff80d28bd7 at iflib_rxeof+0xe07
#17 0xffffffff80d2274a at _task_fn_rx+0x7a
Uptime: 25s
Dumping 1160 out of 32617 MB:..2%..12%..21%..31%..42%..51%..61%..71%..82%..91%

__curthread () at /root/src/sys/amd64/include/pcpu_aux.h:55
55		__asm("movq %%gs:%P1,%0" : "=r" (td) : "n" (offsetof(struct pcpu,
(kgdb) #0  __curthread () at /root/src/sys/amd64/include/pcpu_aux.h:55
#1  doadump (textdump=<optimized out>)
    at /root/src/sys/kern/kern_shutdown.c:399
#2  0xffffffff80bf580b in kern_reboot (howto=260)
    at /root/src/sys/kern/kern_shutdown.c:486
#3  0xffffffff80bf5c50 in vpanic (fmt=<optimized out>, ap=<optimized out>)
    at /root/src/sys/kern/kern_shutdown.c:919
#4  0xffffffff80bf59b3 in panic (fmt=<unavailable>)
    at /root/src/sys/kern/kern_shutdown.c:843
#5  0xffffffff80d29c5b in iflib_if_transmit (ifp=0xfffff80003dff800, 
    m=0xfffff8005ce3ce00) at /root/src/sys/net/iflib.c:4087
#6  0xffffffff80d0fb9b in ether_output_frame (
    ifp=ifp@entry=0xfffff80003dff800, m=<unavailable>)
    at /root/src/sys/net/if_ethersubr.c:511
#7  0xffffffff80d0faa1 in ether_output (ifp=<optimized out>, 
    ifp@entry=<error reading variable: value is not available>, 
    m=<unavailable>, 
    m@entry=<error reading variable: value is not available>, 
    dst=0xfffffe003499c5a0, 
    dst@entry=<error reading variable: value is not available>, 
    ro=<optimized out>, 
    ro@entry=<error reading variable: value is not available>)
    at /root/src/sys/net/if_ethersubr.c:438
#8  0xffffffff80da58ef in ip_output_send (inp=inp@entry=0x0, 
    ifp=<unavailable>, ifp@entry=0xfffff80003dff800, 
    m=m@entry=0xfffff8005ce3ce00, gw=gw@entry=0xfffffe003499c5a0, 
    ro=<unavailable>, ro@entry=0x0, stamp_tag=<optimized out>)
    at /root/src/sys/netinet/ip_output.c:275
#9  0xffffffff80da55a5 in ip_output (m=0xfffff8005ce3ce00, m@entry=0x0, 
    opt=opt@entry=0x0, ro=<optimized out>, ro@entry=0x0, 
    flags=<optimized out>, flags@entry=0, imo=imo@entry=0x0, 
    inp=<optimized out>, inp@entry=0x0)
    at /root/src/sys/netinet/ip_output.c:812
#10 0xffffffff80d92c59 in in_gif_output (ifp=ifp@entry=0xfffff80134802000, 
    m=<optimized out>, m@entry=0xfffff8005cc87200, proto=<optimized out>, 
    ecn=<optimized out>) at /root/src/sys/netinet/in_gif.c:306
#11 0xffffffff80d12350 in gif_transmit (ifp=0xfffff80134802000, 
    m=0xfffff8005cc87200) at /root/src/sys/net/if_gif.c:380
#12 0xffffffff80df2b9b in ip6_forward (m=<unavailable>, srcrt=srcrt@entry=0)
    at /root/src/sys/netinet6/ip6_forward.c:387
#13 0xffffffff80df4414 in ip6_input (m=<unavailable>, 
    m@entry=<error reading variable: value is not available>)
    at /root/src/sys/netinet6/ip6_input.c:897
#14 0xffffffff80d2cb11 in netisr_dispatch_src (proto=6, 
    source=source@entry=0, m=0xfffff8005cc87200)
    at /root/src/sys/net/netisr.c:1143
#15 0xffffffff80d2ce5f in netisr_dispatch (proto=<unavailable>, 
    m=<unavailable>) at /root/src/sys/net/netisr.c:1234
#16 0xffffffff80d0fd3e in ether_demux (ifp=ifp@entry=0xfffff80003dff800, 
    m=<unavailable>) at /root/src/sys/net/if_ethersubr.c:923
#17 0xffffffff80d113cc in ether_input_internal (ifp=0xfffff80003dff800, 
    m=<unavailable>) at /root/src/sys/net/if_ethersubr.c:709
#18 ether_nh_input (m=<optimized out>, 
    m@entry=<error reading variable: value is not available>)
    at /root/src/sys/net/if_ethersubr.c:739
#19 0xffffffff80d2cb11 in netisr_dispatch_src (proto=proto@entry=5, 
    source=source@entry=0, m=m@entry=0xfffff8005cc87200)
    at /root/src/sys/net/netisr.c:1143
#20 0xffffffff80d2ce5f in netisr_dispatch (proto=<unavailable>, 
    proto@entry=5, m=<unavailable>, m@entry=0xfffff8005cc87200)
    at /root/src/sys/net/netisr.c:1234
#21 0xffffffff80d10231 in ether_input (ifp=0xfffff80003dff800, 
    ifp@entry=<error reading variable: value is not available>, 
    m=0xfffff8005cc87200, 
    m@entry=<error reading variable: value is not available>)
    at /root/src/sys/net/if_ethersubr.c:830
#22 0xffffffff80d28bd7 in iflib_rxeof (rxq=<optimized out>, 
    rxq@entry=0xfffff80003dcc000, budget=<optimized out>)
    at /root/src/sys/net/iflib.c:3006
#23 0xffffffff80d2274a in _task_fn_rx (context=0xfffff80003dcc000)
    at /root/src/sys/net/iflib.c:3949
#24 0xffffffff80c3ea77 in gtaskqueue_run_locked (
    queue=queue@entry=0xfffff80003988100)
    at /root/src/sys/kern/subr_gtaskqueue.c:371
#25 0xffffffff80c3e874 in gtaskqueue_thread_loop (
    arg=arg@entry=0xfffffe00379de008)
    at /root/src/sys/kern/subr_gtaskqueue.c:547
#26 0xffffffff80bb1f00 in fork_exit (
    callout=0xffffffff80c3e7e0 <gtaskqueue_thread_loop>, 
    arg=0xfffffe00379de008, frame=0xfffffe003499cc00)
    at /root/src/sys/kern/kern_fork.c:1069
#27 <signal handler called>
(kgdb)
Comment 1 Michael Meiszl 2021-05-11 05:11:22 UTC
Created attachment 224828 [details]
crashlog full version
Comment 2 Michael Meiszl 2021-05-11 05:27:50 UTC
I was wondering what the difference is between being started by rc.conf or manually by root afterwards.

Scrolling through the crashlog showed me that with rc start, the tunnel interface gif0 has not yet been created and attached. But ipfw contains (a lot) of rules for gif0 (actually it's his main job to keep the baddies from the net out of this machine and grant access only to certain services/machines).

Maybe this is a hint where to look for the mbuf being NULL ???
Comment 3 Andrey V. Elsukov freebsd_committer freebsd_triage 2021-05-12 09:11:01 UTC
Your panic doesn't seem related to ipfw. This backtraces shows that your host receives IPv6 packet, that was forwarded into IP-IP tunnel. Then panic was triggered by KASSERT in iflib due to packet's mbuf has unexpected non NULL m_nextpkt field.
Comment 4 Michael Meiszl 2021-05-12 09:16:29 UTC
Thanks for the analyses, but what am I supposed to do now???

This mbuf stuff only occurs, if ipfw is loaded BEFORE the tunnel is even created.

(or, maybe its created just now and the first packets are coming in, can't decide, it looks like the machine survives until the tunnel starts receiving)
Comment 5 Andrey V. Elsukov freebsd_committer freebsd_triage 2021-05-12 09:36:04 UTC
(In reply to Michael Meiszl from comment #4)
Can you show some output prom kgdb?
I think these commands should work to obtain needed info:

# cd /var/crash/
# kgdb -q /boot/kernel/kernel vmcore.0

f 11
p/x *m
f 8
p/x *m

You also can try to patch the kernel as workaround for test:

--- a/sys/netinet/ip_output.c
+++ b/sys/netinet/ip_output.c
@@ -807,6 +807,7 @@ ip_output(struct mbuf *m, struct mbuf *opt, struct route *ro, int flags,
                 * Reset layer specific mbuf flags
                 * to avoid confusing lower layers.
                 */
+               m->m_nextpkt = NULL;
                m_clrprotoflags(m);
                IP_PROBE(send, NULL, NULL, ip, ifp, ip, NULL);
                error = ip_output_send(inp, ifp, m, gw, ro,
Comment 6 Michael Meiszl 2021-05-12 11:30:54 UTC
The Data you have asked for:
[root@l3router ~]# cd /var/crash/
[root@l3router /var/crash]# kgdb -q /boot/kernel/kernel vmcore.0
Reading symbols from /boot/kernel/kernel...
Reading symbols from /usr/lib/debug//boot/kernel/kernel.debug...
__curthread () at /root/src/sys/amd64/include/pcpu_aux.h:55
55              __asm("movq %%gs:%P1,%0" : "=r" (td) : "n" (offsetof(struct pcpu,
(kgdb) f 11
#11 0xffffffff80d12350 in gif_transmit (ifp=0xfffff80134802000, m=0xfffff8005cc87200) at /root/src/sys/net/if_gif.c:380
380                     error = in_gif_output(ifp, m, proto, ecn);
(kgdb) p/x *m
$1 = {{m_next = 0x0, m_slist = {sle_next = 0x0}, m_stailq = {stqe_next = 0x0}}, {m_nextpkt = 0x0, m_slistpkt = {sle_next = 0x0},
    m_stailqpkt = {stqe_next = 0x0}}, m_data = 0xfffff8005cc872a2, m_len = 0x14, m_type = 0x1, m_flags = 0x0, {{{m_pkthdr = {{
            snd_tag = 0xfffff80003dff800, rcvif = 0xfffff80003dff800}, tags = {slh_first = 0x0}, len = 0x50, flowid = 0x7a3d245c,
          csum_flags = 0xc000000, fibnum = 0x0, numa_domain = 0xff, rsstype = 0xbf, {rcv_tstmp = 0x0, {l2hlen = 0x0, l3hlen = 0x0,
              l4hlen = 0x0, l5hlen = 0x0, inner_l2hlen = 0x0, inner_l3hlen = 0x0, inner_l4hlen = 0x0, inner_l5hlen = 0x0}}, PH_per = {
            eight = {0x0, 0x0, 0x0, 0x0, 0x1c, 0x0, 0x0, 0x0}, sixteen = {0x0, 0x0, 0x1c, 0x0}, thirtytwo = {0x0, 0x1c}, sixtyfour = {
              0x1c00000000}, unintptr = {0x1c00000000}, ptr = 0x1c00000000}, PH_loc = {eight = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0},
            sixteen = {0x0, 0x0, 0x0, 0x0}, thirtytwo = {0x0, 0x0}, sixtyfour = {0x0}, unintptr = {0x0}, ptr = 0x0}}, {m_epg_npgs = 0x0,
          m_epg_nrdy = 0xf8, m_epg_hdrlen = 0xdf, m_epg_trllen = 0x3, m_epg_1st_off = 0xf800, m_epg_last_len = 0xffff, m_epg_flags = 0x0,
          m_epg_record_type = 0x0, __spare = {0x0, 0x0}, m_epg_enc_cnt = 0x0, m_epg_tls = 0x7a3d245c00000050,
          m_epg_so = 0xbfff00000c000000, m_epg_seqno = 0x0, m_epg_stailq = {stqe_next = 0x1c00000000}}}, {m_ext = {{
            ext_count = 0x209f36a0, ext_cnt = 0x26702e30209f36a0}, ext_size = 0x1c1fd705, ext_type = 0x86, ext_flags = 0x60dd, {{
              ext_buf = 0x1203f0628000000, ext_arg2 = 0xb41d0100af707004}, {extpg_pa = {0x1203f0628000000, 0xb41d0100af707004,
                0x2aa924398bd2db, 0x2f0801405014, 0xb2a1032000000000}, extpg_trail = {0x1, 0xbb, 0xbf, 0xf9, 0x9e, 0xae, 0x0, 0x0, 0x0,
                0x0, 0xa0, 0x2, 0xff, 0xff, 0xb8, 0xf1, 0x0, 0x0, 0x2, 0x4, 0x5, 0xa0, 0x4, 0x2, 0x8, 0xa, 0x1, 0x74, 0xa6, 0x8b, 0x0,
                0x0, 0x0, 0x0, 0x1, 0x3, 0x3, 0x6, 0xad, 0xde, 0xde, 0xc0, 0xad, 0xde, 0xde, 0xc0, 0xad, 0xde, 0xde, 0xc0, 0xad, 0xde,
                0xde, 0xc0, 0xad, 0xde, 0xde, 0xc0, 0xad, 0xde, 0xde, 0xc0, 0xad, 0xde}, extpg_hdr = {0xde, 0xc0, 0xad, 0xde, 0xde, 0xc0,
                0xad, 0xde, 0xde, 0xc0, 0xad, 0xde, 0xde, 0xc0, 0xad, 0xde, 0xde, 0xc0, 0xad, 0xde, 0xde, 0xc0, 0xad}}},
          ext_free = 0xdeadc0dedeadc0de, ext_arg1 = 0xdeadc0dedeadc0de}, m_pktdat = 0xfffff8005cc87258}}, m_dat = 0xfffff8005cc87220}}
(kgdb) f 8
#8  0xffffffff80da58ef in ip_output_send (inp=inp@entry=0x0, ifp=<unavailable>, ifp@entry=0xfffff80003dff800,
    m=m@entry=0xfffff8005ce3ce00, gw=gw@entry=0xfffffe003499c5a0, ro=<unavailable>, ro@entry=0x0, stamp_tag=<optimized out>)
    at /root/src/sys/netinet/ip_output.c:275
275             error = (*ifp->if_output)(ifp, m, (const struct sockaddr *)gw, ro);
(kgdb) p/x *m
$2 = {{m_next = 0xdeadc0dedeadc0de, m_slist = {sle_next = 0xdeadc0dedeadc0de}, m_stailq = {stqe_next = 0xdeadc0dedeadc0de}}, {
    m_nextpkt = 0xdeadc0dedeadc0de, m_slistpkt = {sle_next = 0xdeadc0dedeadc0de}, m_stailqpkt = {stqe_next = 0xdeadc0dedeadc0de}},
  m_data = 0xdeadc0dedeadc0de, m_len = 0xdeadc0de, m_type = 0xde, m_flags = 0xdeadc0, {{{m_pkthdr = {{snd_tag = 0xdeadc0dedeadc0de,
            rcvif = 0xdeadc0dedeadc0de}, tags = {slh_first = 0xdeadc0dedeadc0de}, len = 0xdeadc0de, flowid = 0xdeadc0de,
          csum_flags = 0xdeadc0de, fibnum = 0xc0de, numa_domain = 0xad, rsstype = 0xde, {rcv_tstmp = 0xdeadc0dedeadc0de, {l2hlen = 0xde,
              l3hlen = 0xc0, l4hlen = 0xad, l5hlen = 0xde, inner_l2hlen = 0xde, inner_l3hlen = 0xc0, inner_l4hlen = 0xad,
              inner_l5hlen = 0xde}}, PH_per = {eight = {0xde, 0xc0, 0xad, 0xde, 0xde, 0xc0, 0xad, 0xde}, sixteen = {0xc0de, 0xdead,
              0xc0de, 0xdead}, thirtytwo = {0xdeadc0de, 0xdeadc0de}, sixtyfour = {0xdeadc0dedeadc0de}, unintptr = {0xdeadc0dedeadc0de},
            ptr = 0xdeadc0dedeadc0de}, PH_loc = {eight = {0xde, 0xc0, 0xad, 0xde, 0xde, 0xc0, 0xad, 0xde}, sixteen = {0xc0de, 0xdead,
              0xc0de, 0xdead}, thirtytwo = {0xdeadc0de, 0xdeadc0de}, sixtyfour = {0xdeadc0dedeadc0de}, unintptr = {0xdeadc0dedeadc0de},
            ptr = 0xdeadc0dedeadc0de}}, {m_epg_npgs = 0xde, m_epg_nrdy = 0xc0, m_epg_hdrlen = 0xad, m_epg_trllen = 0xde,
          m_epg_1st_off = 0xc0de, m_epg_last_len = 0xdead, m_epg_flags = 0xde, m_epg_record_type = 0xc0, __spare = {0xad, 0xde},
          m_epg_enc_cnt = 0xdeadc0de, m_epg_tls = 0xdeadc0dedeadc0de, m_epg_so = 0xdeadc0dedeadc0de, m_epg_seqno = 0xdeadc0dedeadc0de,
          m_epg_stailq = {stqe_next = 0xdeadc0dedeadc0de}}}, {m_ext = {{ext_count = 0xdeadc0de, ext_cnt = 0xdeadc0dedeadc0de},
          ext_size = 0xdeadc0de, ext_type = 0xde, ext_flags = 0xdeadc0, {{ext_buf = 0xdeadc0dedeadc0de, ext_arg2 = 0xdeadc0dedeadc0de}, {
              extpg_pa = {0xdeadc0dedeadc0de, 0xdeadc0dedeadc0de, 0xdeadc0dedeadc0de, 0xdeadc0dedeadc0de, 0xdeadc0dedeadc0de},
              extpg_trail = {0xde, 0xc0, 0xad, 0xde, 0xde, 0xc0, 0xad, 0xde, 0xde, 0xc0, 0xad, 0xde, 0xde, 0xc0, 0xad, 0xde, 0xde, 0xc0,
                0xad, 0xde, 0xde, 0xc0, 0xad, 0xde, 0xde, 0xc0, 0xad, 0xde, 0xde, 0xc0, 0xad, 0xde, 0xde, 0xc0, 0xad, 0xde, 0xde, 0xc0,
                0xad, 0xde, 0xde, 0xc0, 0xad, 0xde, 0xde, 0xc0, 0xad, 0xde, 0xde, 0xc0, 0xad, 0xde, 0xde, 0xc0, 0xad, 0xde, 0xde, 0xc0,
                0xad, 0xde, 0xde, 0xc0, 0xad, 0xde}, extpg_hdr = {0xde, 0xc0, 0xad, 0xde, 0xde, 0xc0, 0xad, 0xde, 0xde, 0xc0, 0xad, 0xde,
                0xde, 0xc0, 0xad, 0xde, 0xde, 0xc0, 0xad, 0xde, 0xde, 0xc0, 0xad}}}, ext_free = 0xdeadc0dedeadc0de,
          ext_arg1 = 0xdeadc0dedeadc0de}, m_pktdat = 0xfffff8005ce3ce58}}, m_dat = 0xfffff8005ce3ce20}}
(kgdb)

I can't try out the kernel patches today anymore I am afraid, the machine is busy and cannot be rebooted for now. (else you might hear in the tv news tomorrow "admin got killed by upset users").
I will try them as soon as possible and report
Comment 7 Michael Meiszl 2021-05-12 11:33:16 UTC
btw: cute :-)
I dont understand anything from the output, but "deadcode" sounds funny, even written in hex :-)))
Comment 8 Andrey V. Elsukov freebsd_committer freebsd_triage 2021-05-12 11:43:46 UTC
(In reply to Michael Meiszl from comment #7)

I think there is no need to test suggested patch. It seems the problem is due to "use after free". mbuf is already freed and memory were filled with 0xdeadc0de pattern. This is why KASSERT was triggered.
Comment 9 Michael Meiszl 2021-05-12 12:16:30 UTC
ok, as you wish.

But finding a "free after use" maybe very complicated to find, sorry to have triggered it somehow.

If you need more kdb, just give me advices, or I can upload the whole /var/crash files somewhere if you want to look at more details.
Comment 10 Michael Meiszl 2021-05-12 12:28:58 UTC
Just a note, in the file you have pointed out to patch, just a few lines below on 834 you find
   error = ip_fragment(ip, &m, mtu, ifp->if_hwassist);
        if (error)
                goto bad;
        for (; m; m = m0) {
                m0 = m->m_nextpkt;
                m->m_nextpkt = 0; <<<<<<<!!!!
                if (error == 0) {
                        /* Record statistics for this interface address. */
                        if (ia != NULL) {
                                counter_u64_add(ia->ia_ifa.ifa_opackets, 1);
                                counter_u64_add(ia->ia_ifa.ifa_obytes,
                                    m->m_pkthdr.len);
                        }

although legal "m->m_nextpkt = 0;" does not look right. Better it should be "m->m_nextpkt = NULL;" I think.

but 0 is surely not 0xdeadc0de...
Comment 11 Michael Meiszl 2021-05-15 07:24:51 UTC
after running for some days (with fw started manually) it crashed again yesterday, but on a totally different function:
Unread portion of the kernel message buffer:
panic: Assertion stp->st_flags == 0 failed at /root/src/sys/kern/sys_generic.c:1942
cpuid = 1
time = 1620999784
KDB: stack backtrace:
#0 0xffffffff80c400e5 at kdb_backtrace+0x65
#1 0xffffffff80bf5be1 at vpanic+0x181
#2 0xffffffff80bf59b3 at panic+0x43
#3 0xffffffff80c63b20 at seltdfini+0xa0
#4 0xffffffff80bac8fa at exit1+0x49a
#5 0xffffffff80bbddda at kproc_exit+0xaa
#6 0xffffffff82b5116e at smb_iod_thread+0x37e
#7 0xffffffff80bb1f00 at fork_exit+0x80
#8 0xffffffff8105c6ae at fork_trampoline+0xe
Uptime: 2d2h27m29s

I guess this has nothing to do with the main issue, but it made me revert to the original, unpatched 13.0 kernel for now.

My current approach is to start the fw with a combination of cron and at:

CronEntry: @reboot at -f /root/startfirewall now+3min

Startfirewall script: 
#!/bin/sh
/usr/sbin/service ipfw onestart

totally simple, but it seems to work for now. I did not notice the panic yesterday so the whole net ran without fw protection for almost a day. This is not acceptable. Cron+at limit the dangerous time after a panic or reboot to 3mins
Comment 12 Mark Johnston freebsd_committer freebsd_triage 2021-06-16 13:37:26 UTC
Do you have the net.link.ether.ipfw sysctl set to 1 by any chance?
Comment 13 Michael Meiszl 2021-06-16 13:42:28 UTC
(In reply to Mark Johnston from comment #12)
no, it is 0 here
Comment 14 Michael Meiszl 2021-06-16 14:10:55 UTC
My suggested workaround still works flawlessly. When started "later" ipfw does not crash here. The machine now is up for weeks without any problems and it also comes back up alive after a reboot.
So I did not investigate any further during the last weeks.
But if you tell me to try out some new changes, I will.

Sadly, here is no L2 filtering going on at all.
Comment 15 Mark Johnston freebsd_committer freebsd_triage 2021-06-16 14:16:49 UTC
(In reply to Michael Meiszl from comment #14)
I think to make progress on this we'd have to look at a vmcore from one of the panics.  For this I'd also need a copy of the matching /boot/kernel and /usr/lib/debug/boot/kernel directories.
Comment 16 Michael Meiszl 2021-06-16 15:36:39 UTC
yeah it still happens if I switch on IPFW at boottime.

Where do you want all those files to be uploaded to?
Comment 17 Mark Johnston freebsd_committer freebsd_triage 2021-06-19 14:51:58 UTC
(In reply to Michael Meiszl from comment #16)
I don't have a good place to upload vmcores, sorry.  Google drive is used sometimes.  Reading through again, I'm not sure that a vmcore will be very useful.  I suspect your comment 2 is a good clue.  I don't quite follow though: the firewall rules reference gif0, and the rules are loaded before gif0 is created?  I would have assumed that this would not be permitted.  It may be more useful to share the exact ruleset you're using.
Comment 18 Michael Meiszl 2021-06-19 16:44:12 UTC
(In reply to Mark Johnston from comment #17)
No big problem, the rules are straightforward. See attached rulesfile.
(I needed to blur out some IPs, they are not good for the public)

Originally I had Table(1/2/3) filled by failtoban, but I converted them to static rules in the hope that the bug was inside the table management (did not work, sniff)

Anyway, its a very restrictive setting. 80% of the world is locked out. For V6 only certain ports to certain hosts are offered to the outside (there is no V4 restriction on ports because that is handled by a different machine).
Comment 19 Michael Meiszl 2021-06-19 16:45:07 UTC
Created attachment 225933 [details]
ruleset mam
Comment 20 Michael Meiszl 2021-06-19 16:51:24 UTC
About the load sequence: I was also suprised to see when I have checked it with "rcorder ..."
But it does not seem to be important. I've tried to change the order to insure that gif0 is created and up before ipfw is loaded, it made no difference to the crashes.
Comment 21 Michael Meiszl 2021-06-24 17:32:51 UTC
one more idea where to search:

this machine uses "Intel(R) Ethernet Converged Network Adapter X550-T2" Adaptor (more ONE, because it is a dual-port card) running on 10Gbe Copper Connection.
After enabling the card it takes really long (>10s) to establish the link with the switch.
Maybe during this time some "strange" (wrong, cut off or something else) packets come in and drive the firewall crazy???
Maybe that patch is already good for me even if I do not check for L2 packets yet???

If I get some spare time next week, I will give it a try again and see if the bug has already vanished. I will report...
Comment 22 Michael Meiszl 2021-06-26 03:47:05 UTC
Yeah Kubilay , we know that already.
But so far, we think, it does not apply because I have net.link.ether.ipfw set to 0, effecitivly turning off all that L2 processing of ipfw.
Comment 23 Michael Meiszl 2021-08-26 08:07:33 UTC
After waiting some weeks (and some Kernel Patches :-) ) I gave it a new chance again this morning.
IT WORKED !!!
it did not crash anymore and is now up and running fine for at least 2hrs.
(remember: before it did not survive for 30s after a reboot).

So I guess we can close this strange case now. 

tnx for your patience
MAM
Comment 24 Mark Johnston freebsd_committer freebsd_triage 2021-08-27 13:10:36 UTC
(In reply to Michael Meiszl from comment #23)
Thanks for following up.  I had stared at this for quite a while but wasn't able to make much progress.  To be clear, you're no longer able to reproduce the problem on recent stable/13?
Comment 25 Michael Meiszl 2021-08-27 13:16:07 UTC
yeah, since yesterday (or maybe some patches before already, I did not test again after each updates, call me lazy:-) ) i can start ipfw again from rc.conf without any crash. The machine is up for 24hrs now already, I did reboot it yesterday once more to prove that the bug is gone.
Still a bit curious what triggered it, but then, I'm pragmatic, as long as it works, I don't care.

I guess I can now dare to update the external production machine too (if they would crash I have no means to get to the console and start into single user mode, so I am very very careful when it comes to kernel crashes and instant reboots)

MAM
Comment 26 Mark Johnston freebsd_committer freebsd_triage 2021-08-27 13:19:46 UTC
(In reply to Michael Meiszl from comment #25)
> I guess I can now dare to update the external production machine too (if they would crash I have no means to get to the console and start into single user mode, so I am very very careful when it comes to kernel crashes and instant reboots)

If you're building kernels from source you might try installing like this:

# make installkernel INSTKERNNAME=kernel.test
# nextboot -k kernel.test
# shutdown -r now

so if the new kernel (installed to /boot/kernel.test) panics and reboots, the system will boot back into the old kernel without intervention in single-user mode.  Be sure to try this before updating userland.
Comment 27 Michael Meiszl 2021-08-27 14:09:32 UTC
Tnx for your suggestions, but they are not usable for me:

a) I usually don't build my own kernel (unless I was told to to activate debugging for catching the bug)

b) this happened to me on the update from 12 to 13. So userland had to be updated too before it occured and then there was no way back but to throw in the complete backup

c) the bug only showed up going multiuser.

d) I need the network to be up and running to access the machine. The console is serveral hundred miles away from me and my arms are not that long :-)))
 
(But the hint is good if I ever plan to run my own kernel, although I have no idea why I should use something different but GENERIC. Since additinal drivers can be added as loadable modules, there is no need for a custom kernel, at least for me)