Bug 248046

Summary: if_bridge(4): Panic when creating an interface: malloc(M_WAITOK) in epoch context
Product: Base System Reporter: Ryan Moeller <freqlabs>
Component: kernAssignee: Kristof Provost <kp>
Status: Closed FIXED    
Severity: Affects Some People CC: dmilith, eugen, kevans, kp, markj, net, zeising
Priority: --- Keywords: crash, needs-patch, needs-qa
Version: 12.1-STABLEFlags: markj: mfc-stable12+
Hardware: amd64   
OS: Any   

Description Ryan Moeller freebsd_committer freebsd_triage 2020-07-17 14:37:21 UTC
This was also reported on freebsd-stable as "Kernel panic on recent 12.1 on bridge creation."

Tested on a kernel built with INVARIANTS, on
FreeBSD-12.1-STABLE-amd64-20200709-r363017-disc1.iso

Repro: ifconfig bridge create

Unread portion of the kernel message buffer:
panic: malloc(M_WAITOK) in epoch context
cpuid = 1
time = 1594992085
KDB: stack backtrace:
#0 0xffffffff80bda5b5 at kdb_backtrace+0x65
#1 0xffffffff80b934ab at vpanic+0x17b
#2 0xffffffff80b932b3 at panic+0x43
#3 0xffffffff80b6e414 at malloc_dbg+0xd4
#4 0xffffffff80b6e1c1 at malloc+0x21
#5 0xffffffff82a16b69 at bridge_ioctl_gifs+0xb9
#6 0xffffffff82a1845b at bridge_ioctl+0x32b
#7 0xffffffff80c893ea at ifioctl+0x9ea
#8 0xffffffff80bf93d5 at kern_ioctl+0x2d5
#9 0xffffffff80bf9057 at sys_ioctl+0x127
#10 0xffffffff81055874 at amd64_syscall+0x2a4
#11 0xffffffff8102c880 at fast_syscall_common+0x101
Comment 1 Kristof Provost freebsd_committer freebsd_triage 2020-07-17 16:56:46 UTC
That seems fairly straightforward.
The only odd thing is that this doesn't trigger panics on current. It looks like epoch is a bit more tolerant of sleeps there.

Still, it looks to be easy to avoid the malloc(M_WAITOK), so let's just do that everywhere.

I'm testing this patch now:

diff --git a/sys/net/if_bridge.c b/sys/net/if_bridge.c
index 19d8d8964d9..51ee9d29906 100644
--- a/sys/net/if_bridge.c
+++ b/sys/net/if_bridge.c
@@ -1393,9 +1393,9 @@ bridge_ioctl_gifs(struct bridge_softc *sc, void *arg)
                bifc->ifbic_len = buflen;
                return (0);
        }
-       BRIDGE_UNLOCK(sc);
-       outbuf = malloc(buflen, M_TEMP, M_WAITOK | M_ZERO);
-       BRIDGE_LOCK(sc);
+       outbuf = malloc(buflen, M_TEMP, M_NOWAIT | M_ZERO);
+       if (outbuf == NULL)
+               return (ENOMEM);

        count = 0;
        buf = outbuf;
@@ -1455,9 +1455,9 @@ bridge_ioctl_rts(struct bridge_softc *sc, void *arg)
                count++;
        buflen = sizeof(bareq) * count;

-       BRIDGE_UNLOCK(sc);
-       outbuf = malloc(buflen, M_TEMP, M_WAITOK | M_ZERO);
-       BRIDGE_LOCK(sc);
+       outbuf = malloc(buflen, M_TEMP, M_NOWAIT | M_ZERO);
+       if (outbuf == NULL)
+               return (ENOMEM);

        count = 0;
        buf = outbuf;
@@ -1783,9 +1783,9 @@ bridge_ioctl_gifsstp(struct bridge_softc *sc, void *arg)
                return (0);
        }

-       BRIDGE_UNLOCK(sc);
-       outbuf = malloc(buflen, M_TEMP, M_WAITOK | M_ZERO);
-       BRIDGE_LOCK(sc);
+       outbuf = malloc(buflen, M_TEMP, M_NOWAIT | M_ZERO);
+       if (outbuf == NULL)
+               return (ENOMEM);

        count = 0;
        buf = outbuf;
Comment 2 dmilith 2020-07-17 19:57:56 UTC
I did build from last stable/12 and I noticed that if_bridge module seems to be "loaded" even after I explicitly commented out "device if_bridge" in kernel configuration (no loader.conf / loader.conf.local lines with if_bridge) and can't be unloaded:

kldunload if_bridge # => kldunload: can't find file if_bridge

kldload if_bridge # => kldload: can't load if_bridge: module already loaded or in kernel

Unsure it makes a difference, but maybe it's worth mention.
Comment 3 dmilith 2020-07-17 20:31:29 UTC
(In reply to Kristof Provost from comment #1)
With your patch applied it's also KP
Comment 4 Ryan Moeller freebsd_committer freebsd_triage 2020-07-17 20:52:51 UTC
FWIW I tried also testing this in a recent TrueNAS 12 internal build (which is basically stable/12) and hit a different panic in the bridge epochification instead:

<6>bridge0: Ethernet address: 02:dd:a7:c5:f6:00
<6>bridge0: link state changed to UP
<6>em0: promiscuous mode enabled
<6>tap0: Ethernet address: 58:9c:fc:10:53:4a
<6>tap0: changing name to 'vnet0'
<6>vnet0: promiscuous mode enabled
panic: Assertion in_epoch(net_epoch_preempt) failed at /truenas-12.0-internal/freenas/_BE/os/sys/net/if_bridge.c:2411
cpuid = 3
time = 1595009276
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe003f9c3800
vpanic() at vpanic+0x17b/frame 0xfffffe003f9c3850
panic() at panic+0x43/frame 0xfffffe003f9c38b0
bridge_input() at bridge_input+0x7ba/frame 0xfffffe003f9c3910
ether_nh_input() at ether_nh_input+0x279/frame 0xfffffe003f9c3970
netisr_dispatch_src() at netisr_dispatch_src+0x9d/frame 0xfffffe003f9c39d0
ether_input() at ether_input+0x86/frame 0xfffffe003f9c3a00
iflib_rxeof() at iflib_rxeof+0xc29/frame 0xfffffe003f9c3ae0
_task_fn_rx() at _task_fn_rx+0x4b/frame 0xfffffe003f9c3b20
gtaskqueue_run_locked() at gtaskqueue_run_locked+0x145/frame 0xfffffe003f9c3b80
gtaskqueue_thread_loop() at gtaskqueue_thread_loop+0x94/frame 0xfffffe003f9c3bb0
fork_exit() at fork_exit+0x80/frame 0xfffffe003f9c3bf0
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe003f9c3bf0
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---
KDB: enter: panic


The offending line of code is the MPASS(in_epoch(net_epoch_preempt)); at the top of bridge_input().
Comment 5 commit-hook freebsd_committer freebsd_triage 2020-07-18 12:43:47 UTC
A commit references this bug:

Author: kp
Date: Sat Jul 18 12:43:11 UTC 2020
New revision: 363308
URL: https://svnweb.freebsd.org/changeset/base/363308

Log:
  bridge: Don't sleep during epoch

  While it doesn't trigger INVARIANTS or WITNESS on head it does in stable/12.
  There's also no reason for it, as we can easily report the out of memory error
  to the caller (i.e. userspace). All of these can already fail.

  PR:		248046
  MFC after:	3 days

Changes:
  head/sys/net/if_bridge.c
Comment 6 Kristof Provost freebsd_committer freebsd_triage 2020-07-18 12:49:40 UTC
(In reply to Ryan Moeller from comment #4)
Thanks. I can reproduce that. It looks like the code to widen epoch didn't make it into stable/12, so we need to enter epoch on bridge_input()/bridge_output() ourselves.

See https://reviews.freebsd.org/D25715
Comment 7 dmilith 2020-07-18 14:06:12 UTC
(In reply to Kristof Provost from comment #6)
I confirm that patch fixes KP with my build. Thanks!
Comment 8 commit-hook freebsd_committer freebsd_triage 2020-07-22 19:44:56 UTC
A commit references this bug:

Author: kp
Date: Wed Jul 22 19:43:56 UTC 2020
New revision: 363429
URL: https://svnweb.freebsd.org/changeset/base/363429

Log:
  MFC r363308:

  bridge: Don't sleep during epoch

  While it doesn't trigger INVARIANTS or WITNESS on head it does in stable/12.
  There's also no reason for it, as we can easily report the out of memory error
  to the caller (i.e. userspace). All of these can already fail.

  PR:		248046

Changes:
_U  stable/12/
  stable/12/sys/net/if_bridge.c
Comment 9 commit-hook freebsd_committer freebsd_triage 2020-07-22 20:14:05 UTC
A commit references this bug:

Author: kp
Date: Wed Jul 22 20:13:13 UTC 2020
New revision: 363430
URL: https://svnweb.freebsd.org/changeset/base/363430

Log:
  bridge: Enter epoch for bridge_input()/bridge_output()

  In stable/12 epoch is not as wide as it is in head. The network stack isn't yet
  in epoch when bridge_input()/bridge_output() get called, so rather than assert
  this we must enter it ourselves.

  This is a direct commit to stable/12.

  PR:		248046
  Differential Revision:	https://reviews.freebsd.org/D25715

Changes:
  stable/12/sys/net/if_bridge.c
Comment 10 dmilith 2020-07-24 15:47:10 UTC
Bug is still here. Not with ifconfig bridge0 create but on destroy.

Steps to reproduce:

ifconfig bridge0 create
ifconfig bridge0 addm em0 up
ifconfig bridge0 destroy

KP:

[33] KDB: stack backtrace:
[33] #0 0xffffffff80b975db at kdb_backtrace+0x6b
[33] #1 0xffffffff80b512b0 at vpanic+0x180
[33] #2 0xffffffff80b510b3 at panic+0x43
[33] #3 0xffffffff80b2dc84 at malloc_dbg+0xd4
[33] #4 0xffffffff80b2da21 at malloc+0x21
[33] #5 0xffffffff83134b79 at bridge_ioctl_gifs+0xb9
[33] #6 0xffffffff83136476 at bridge_ioctl+0x336
[33] #7 0xffffffff80c44c03 at ifioctl+0x763
[33] #8 0xffffffff80bb6b71 at kern_ioctl+0x301
[33] #9 0xffffffff80bb67c7 at sys_ioctl+0x127
[33] #10 0xffffffff810181a6 at amd64_syscall+0x2b6
[33] #11 0xffffffff80fefcd0 at fast_syscall_common+0x101
Comment 11 dmilith 2020-07-24 15:52:12 UTC
Panic String: Assertion in_epoch(net_epoch_preempt) failed at /usr/src/sys/net/if_bridge.c:3744
Comment 12 Kristof Provost freebsd_committer freebsd_triage 2020-07-24 17:27:40 UTC
(In reply to dmilith from comment #11)
Right, so the backtrace in #10 is bogus, but I can reproduce the problem at least.
Comment 13 commit-hook freebsd_committer freebsd_triage 2020-07-24 20:10:00 UTC
A commit references this bug:

Author: kp
Date: Fri Jul 24 20:09:52 UTC 2020
New revision: 363491
URL: https://svnweb.freebsd.org/changeset/base/363491

Log:
  bridge: Fix mismerges from r360345

  In r362650 we merged r360345. This required manual changes due to the
  differences in EPOCH macros between head and stable/12, and was done
  imperfectly.

  This is a direct commit to stable/12.

  PR:		248046

Changes:
  stable/12/sys/net/if_bridge.c
Comment 14 commit-hook freebsd_committer freebsd_triage 2020-07-24 20:11:04 UTC
A commit references this bug:

Author: kp
Date: Fri Jul 24 20:10:28 UTC 2020
New revision: 363492
URL: https://svnweb.freebsd.org/changeset/base/363492

Log:
  bridge: Enter epoch for bridge_transmit()

  Just like the change done for bridge_input()/bridge_output() in r363430
  we must enter epoch ourselves, because its coverage is not as wide as in
  head.

  This is a direct commit to stable/12.

  PR:		248046

Changes:
  stable/12/sys/net/if_bridge.c
Comment 15 dmilith 2020-07-25 17:17:20 UTC
I built last stable/12 again, and issue is there after both cases:

ifconfig bridge0 create
ifconfig bridge0 addm em0 up
ifconfig bridge0 destroy

but I've found it also with:

ifconfig bridge0 create
ifconfig vlan0 create
ifconfig bridge0 addm em0 up
ifconfig bridge0 addm vlan0 up
Comment 16 Kristof Provost freebsd_committer freebsd_triage 2020-07-26 12:19:17 UTC
(In reply to dmilith from comment #15)
I cannot reproduce that with either scenario.

Please attach full dmesg output, any content of /etc/src.conf or make.conf, any local patches, content of /etc/sysctl.conf and the exact revision you're building from.
Comment 17 dmilith 2020-07-26 15:36:04 UTC
(In reply to Kristof Provost from comment #16)

I did rebuild again, full command that caused KP:

ifconfig bridge0 create; ifconfig tap0 create; ifconfig tap1 create; ifconfig bridge0 addm tap0 up; ifconfig bridge0 addm tap1 up; ifconfig bridge0 addm em0 up

NOTE: em0 is my main network card, virtualized under vmware fusion 7.1.3 pro

It's production build, so dmesg log is limited:

[44] KDB: stack backtrace:
[44] #0 0xffffffff8098a9d5 at kdb_backtrace+0x65
[44] #1 0xffffffff80947790 at vpanic+0x180
[44] #2 0xffffffff809475a3 at panic+0x43
[44] #3 0xffffffff80951fdd at _sleep+0x3dd
[44] #4 0xffffffff80952331 at pause_sbt+0xf1
[44] #5 0xffffffff804fa177 at e1000_reset_hw_82540+0x177
[44] #6 0xffffffff804f114b at em_if_stop+0x1b
[44] #7 0xffffffff80a43a71 at iflib_stop+0xc1
[44] #8 0xffffffff80a50d6c at iflib_if_ioctl+0x39c
[44] #9 0xffffffff82f2c5c5 at bridge_mutecaps+0x145
[44] #10 0xffffffff82f28467 at bridge_ioctl_add+0x467
[44] #11 0xffffffff82f2a37b at bridge_ioctl+0x32b
[44] #12 0xffffffff80a35ce1 at ifioctl+0x701
[44] #13 0xffffffff809a9895 at kern_ioctl+0x2d5
[44] #14 0xffffffff809a9517 at sys_ioctl+0x127
[44] #15 0xffffffff80d71b84 at amd64_syscall+0x2a4
[44] #16 0xffffffff80d4b1ae at fast_syscall_common+0xf8
[44] Uptime: 44s
Comment 18 dmilith 2020-07-26 15:40:55 UTC
(In reply to Kristof Provost from comment #16)
revision - last available from 16:30 CEST,
branch - stable/12
src.conf - https://github.com/VerKnowSys/svdOS/blob/master/etc/src.conf
make.conf - https://github.com/VerKnowSys/svdOS/blob/master/etc/make.conf
no custom patches
Comment 19 Kristof Provost freebsd_committer freebsd_triage 2020-07-26 15:47:44 UTC
(In reply to dmilith from comment #17)
**FULL** dmesg, please. You're cutting off part of the panic message, making this harder than it needs to be.

There may be more things in there that are useful. Please, pretty please stop trying to pre-filter this information. It doesn't help.

Also, an SVN revision number (or git hash, whatever), not a timestamp please.
Comment 20 dmilith 2020-07-26 16:04:17 UTC
(In reply to Kristof Provost from comment #19)

https://gist.github.com/dmilith/f6454244f273accf6905f6cba812b3b5
Comment 21 Kristof Provost freebsd_committer freebsd_triage 2020-07-26 16:39:56 UTC
(In reply to dmilith from comment #20)
That's a different operating system...

Don't try to hide that during a bug submission please. It'll always come out in the end, and you're just wasting both of our time.

This almost certainly also affects base FreeBSD, but *DO NOT HIDE RELEVANT INFORMATION IN BUG REPORTS*. You're just making the people who want to fix these bugs angry, and that's not the right state of mind to fix bug. Or be helpful.

The issue here is that the bridge code tries to set the physical interface in promiscuous mode, which sleeps in the em code. The stable/12 epoch code doesn't allow sleeps during epoch sections (head does), which triggers this panic. (Which we wouldn't see on a default FreeBSD stable/12, because that requires INVARIANTS iirc....)

I'll see if that can be fixed, but I may end up just reverting the epoch code in stable/12.
Comment 22 dmilith 2020-07-26 16:49:28 UTC
(In reply to Kristof Provost from comment #21)
Not avoiding or hiding anything. It's exactly same operating system, synced every 6 hours… (not to mention HBSD patches don't touch anything related to if_bridge).

Also I have both INVARIANTS and INVARIANT_SUPPORT options enabled in my kernel configuration.
Comment 23 Eugene Grosbein freebsd_committer freebsd_triage 2020-07-27 08:41:22 UTC
(In reply to dmilith from comment #22)

HBSD is not same as FreeBSD. HBSD has significant differences.
Comment 24 dmilith 2020-07-27 14:38:12 UTC
After Kristof's revert, issue is fixed and this thread can be closed.