Bug 240609

Summary: iflib: Panic with INVARIANTS: sleeping in an epoch section (12.1-pre-QA) (vlan + lagg involved)
Product: Base System Reporter: Harald Schmalzbauer <bugzilla.freebsd>
Component: kernAssignee: Gleb Smirnoff <glebius>
Status: Open ---    
Severity: Affects Some People CC: afedorov, chris, emaste, erj, ferrao, garga, glebius, hselasky, marius, ml, net, pi, sergey.dyatko
Priority: --- Keywords: crash, needs-qa
Version: 12.0-STABLEFlags: koobs: maintainer-feedback? (glebius)
koobs: mfc-stable12?
koobs: mfc-stable11?
Hardware: Any   
OS: Any   
Bug Depends on:    
Bug Blocks: 240700    
Attachments:
Description Flags
Possible workaround
none
Suggested patch for lagg none

Description Harald Schmalzbauer 2019-09-16 08:24:13 UTC
Hello,

here's a iflib related panic I get on my real-world cold-standby setup with 12.1-prerelease and debug kernel.
It happens when creating a vlan(4) child with if_igb(4) pair as lagg(4) parent:

<6>vlan0: link state changed to UP
panic: sleeping in an epoch section
cpuid = 1
time = 1568620268
KDB: stack backtrace:                                                                                                                  
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe000058b380                                                         
vpanic() at vpanic+0x19d/frame 0xfffffe000058b3d0
panic() at panic+0x43/frame 0xfffffe000058b430
_sleep() at _sleep+0x466/frame 0xfffffe000058b4d0
pause_sbt() at pause_sbt+0x10f/frame 0xfffffe000058b510
e1000_reset_hw_82580() at e1000_reset_hw_82580+0x1cc/frame 0xfffffe000058b550
em_if_stop() at em_if_stop+0x1b/frame 0xfffffe000058b570
iflib_stop() at iflib_stop+0xc3/frame 0xfffffe000058b5c0
iflib_vlan_register() at iflib_vlan_register+0xad/frame 0xfffffe000058b600
lagg_register_vlan() at lagg_register_vlan+0xda/frame 0xfffffe000058b660
vlan_config() at vlan_config+0x50b/frame 0xfffffe000058b6c0
vlan_clone_create() at vlan_clone_create+0x29b/frame 0xfffffe000058b730
if_clone_createif() at if_clone_createif+0x4a/frame 0xfffffe000058b780
ifioctl() at ifioctl+0x6fe/frame 0xfffffe000058b850
kern_ioctl() at kern_ioctl+0x2b0/frame 0xfffffe000058b8b0
sys_ioctl() at sys_ioctl+0x15d/frame 0xfffffe000058b980
amd64_syscall() at amd64_syscall+0x276/frame 0xfffffe000058bab0
fast_syscall_common() at fast_syscall_common+0x101/frame 0xfffffe000058bab0
--- syscall (54, FreeBSD ELF64, sys_ioctl), rip = 0x80047439a, rsp = 0x7fffffffe348, rbp = 0x7fffffffe350 ---
KDB: enter: panic

#9  0xffffffff805cf4ca in vpanic (fmt=<value optimized out>, ap=<value optimized out>)
    at /usr/local/share/deploy-tools/RELENG_12/src/sys/kern/kern_shutdown.c:866
#10 0xffffffff805cf273 in panic (fmt=<value optimized out>)
    at /usr/local/share/deploy-tools/RELENG_12/src/sys/kern/kern_shutdown.c:804
#11 0xffffffff805da0b6 in _sleep (ident=0xffffffff80ef0941, lock=0x0, priority=0, wmesg=<value optimized out>, sbt=42949672, 
    pr=0, flags=256) at /usr/local/share/deploy-tools/RELENG_12/src/sys/kern/kern_synch.c:150
#12 0xffffffff805da4af in pause_sbt (wmesg=<value optimized out>, sbt=42949672, pr=<value optimized out>, 
    flags=<value optimized out>) at /usr/local/share/deploy-tools/RELENG_12/src/sys/kern/kern_synch.c:332
#13 0xffffffff81b3e7cc in e1000_reset_hw_82580 (hw=0xfffffe004b7eb008) at RELENG_12/src/sys/dev/e1000/e1000_osdep.h:97
#14 0xffffffff81b0c86b in em_if_stop (ctx=<value optimized out>)
    at /usr/local/share/deploy-tools/RELENG_12/src/sys/dev/e1000/if_em.c:1867
#15 0xffffffff806f74f3 in iflib_stop (ctx=0xfffff8000291f800) at ifdi_if.h:268
#16 0xffffffff80704e3d in iflib_vlan_register (arg=0xfffff8000291f800, ifp=0xfffff8000295a800, vtag=232)
    at /usr/local/share/deploy-tools/RELENG_12/src/sys/net/iflib.c:3883
#17 0xffffffff806eb94a in lagg_register_vlan (arg=<value optimized out>, ifp=<value optimized out>, vtag=<value optimized out>)
    at /usr/local/share/deploy-tools/RELENG_12/src/sys/net/if_lagg.c:452
#18 0xffffffff806f68fb in vlan_config (ifv=0xfffff80002555c00, p=0xfffff80002895000, vid=<value optimized out>)
    at /usr/local/share/deploy-tools/RELENG_12/src/sys/net/if_vlan.c:1431
#19 0xffffffff806f596b in vlan_clone_create (ifc=0xfffff800024dec00, name=0xfffffe000058b8d0 "vlan0", len=18446735277655190528, 
    params=<value optimized out>) at /usr/local/share/deploy-tools/RELENG_12/src/sys/net/if_vlan.c:1066
#20 0xffffffff806e1c3a in if_clone_createif (ifc=0xfffff800024dec00, name=0xfffffe000058b8d0 "vlan0", len=16, 
    params=0x22db40 <Address 0x22db40 out of bounds>) at /usr/local/share/deploy-tools/RELENG_12/src/sys/net/if_clone.c:229
#21 0xffffffff806d90be in ifioctl (so=<value optimized out>, cmd=3223349628, data=0xfffffe000058b8d0 "vlan0", 
    td=0xfffff800037a55e0) at /usr/local/share/deploy-tools/RELENG_12/src/sys/net/if.c:3097
---Type <return> to continue, or q <return> to quit---
#22 0xffffffff8063d870 in kern_ioctl (td=0xfffff800037a55e0, fd=<value optimized out>, com=3223349628, 
    data=<value optimized out>) at RELENG_12/src/sys/sys/file.h:337
#23 0xffffffff8063d54d in sys_ioctl (td=0xfffff800037a55e0, uap=0xfffff800037a59a0)
    at /usr/local/share/deploy-tools/RELENG_12/src/sys/kern/sys_generic.c:712
#24 0xffffffff8093abe6 in amd64_syscall (td=0xfffff800037a55e0, traced=0)
    at RELENG_12/src/sys/amd64/amd64/../../kern/subr_syscall.c:135
#25 0xffffffff80912550 in fast_syscall_common () at /usr/local/share/deploy-tools/RELENG_12/src/sys/amd64/amd64/exception.S:581
#26 0x000000080047439a in ?? ()
Previous frame inner to this frame (corrupt stack?)

It's almost identical to https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=232362
Only this line/function istn't listed with the new hardware (then Kawela, 82576 –> now StonyLake, i350:
#13 0xffffffff80569511 in e1000_disable_pcie_master_generic (hw=0xfffffe0000790008)

I'll mark the old one as duplicate.

Thanks,
-harry
Comment 1 Harald Schmalzbauer 2019-09-16 08:25:16 UTC
*** Bug 232362 has been marked as a duplicate of this bug. ***
Comment 2 Eric Joyner freebsd_committer 2019-10-09 17:44:40 UTC
I'm not an expert on locking issues, but it appears that lagg_register_vlan() enters an epoch (via the now-confusingly named LAGG_RLOCK() macro) that iflib_vlan_register() is run inside of, and the msec_delay()->safe_pause_ms()->pause() in the em driver is causing the "sleeping in an epoch section" panic.

A quick fix would be to make em *not sleep* during that e1000_reset_hw_82580() function, but that doesn't seem ideal; why should it not be allowed to sleep?
Comment 3 Hans Petter Selasky freebsd_committer 2019-10-09 17:48:03 UTC
Created attachment 208207 [details]
Possible workaround

It might be possible to simply test for epoch.

My experience if_ioctl()'s that sleep usually cause problems.
Comment 4 Harald Schmalzbauer 2019-10-12 14:06:01 UTC
(In reply to Hans Petter Selasky from comment #3)

Happy to confirm that your patch prevents the machine from panicking during vlan(4) child setup.
Haven't done further tests, but to my very limited understanding of the change, any side effects are very unlikely.

Thanks,

-harry
Comment 5 ml 2020-04-09 06:36:50 UTC
Just a "me too" here.

I'm running 12.1p3/amd64 (with vlan + lagg + em, throw even bridge in) and I'm experiencing deadlocks (VFS related I suspect).
So I turned on INVARIANTS, WITNESS, etc..., but could not boot without this patch.
Comment 6 Gleb Smirnoff freebsd_committer 2020-09-11 17:01:44 UTC
Created attachment 217889 [details]
Suggested patch for lagg

Here is patch (against head), that prevents lagg reconfiguration to use epoch and uses sleepable lock.
Comment 7 Aleksandr Fedorov freebsd_committer 2020-12-08 06:33:25 UTC
Hi, the panic is still here.

It's easily reproduced using the bhyve + e1000 device emulation.

# sh /usr/share/examples/bhyve/vmrun.sh -c 2 -m 1024M -n e1000 -t tap0 -t tap1 -d head.img freebsd-head

root@vm-13:~ # ifconfig em0 up
root@vm-13:~ # ifconfig em1 up
root@vm-13:~ # ifconfig lagg create
lagg0
root@vm-13:~ # ifconfig lagg0 laggproto lacp laggport em0 laggport em1 192.168.1.1 netmask 255.255.255.0
root@vm-13:~ # ifconfig vlan create vlan 1001 vlandev lagg0
panic: sleepq_add: td 0xfffffe00497c6e00 to sleep on wchan 0xffffffff815ac3d1 with sleeping prohibited
cpuid = 1
time = 1607417902
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe0049b0a470
vpanic() at vpanic+0x181/frame 0xfffffe0049b0a4c0
panic() at panic+0x43/frame 0xfffffe0049b0a520
sleepq_add() at sleepq_add+0x359/frame 0xfffffe0049b0a570
_sleep() at _sleep+0x20c/frame 0xfffffe0049b0a620
pause_sbt() at pause_sbt+0xfe/frame 0xfffffe0049b0a650
e1000_reset_hw_82540() at e1000_reset_hw_82540+0x177/frame 0xfffffe0049b0a680
em_if_stop() at em_if_stop+0x1b/frame 0xfffffe0049b0a6a0
iflib_stop() at iflib_stop+0xbd/frame 0xfffffe0049b0a6f0
iflib_vlan_register() at iflib_vlan_register+0xe8/frame 0xfffffe0049b0a730
lagg_register_vlan() at lagg_register_vlan+0x102/frame 0xfffffe0049b0a790
vlan_config() at vlan_config+0x553/frame 0xfffffe0049b0a7f0
vlan_clone_create() at vlan_clone_create+0x2a2/frame 0xfffffe0049b0a860
if_clone_createif() at if_clone_createif+0x4a/frame 0xfffffe0049b0a8b0
ifioctl() at ifioctl+0x783/frame 0xfffffe0049b0a980
kern_ioctl() at kern_ioctl+0x289/frame 0xfffffe0049b0a9f0
sys_ioctl() at sys_ioctl+0x12a/frame 0xfffffe0049b0aac0
amd64_syscall() at amd64_syscall+0x12e/frame 0xfffffe0049b0abf0
fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe0049b0abf0
--- syscall (54, FreeBSD ELF64, sys_ioctl), rip = 0x80042629a, rsp = 0x7fffffffe168, rbp = 0x7fffffffe180 ---
KDB: enter: panic
[ thread pid 891 tid 100072 ]
Stopped at      kdb_enter+0x37: movq    $0,0x10aa456(%rip)
db> 

The patch from Gleb solves the issue. So, maybe commit it?
Comment 8 Sergey V. Dyatko 2020-12-08 14:43:23 UTC
With patch provided by Gleb I'm not observe this panic anymore 
It is 13-CURRENT,  r366075

cloned_interfaces="lagg0 vlan101"
ifconfig_lagg0="laggproto lacp laggport em0 laggport em1 212.8.x.y netmask 255.255.255.240"
ifconfig_vlan101="vlan 101 vlandev lagg0 192.168.1.29/24"
Comment 9 commit-hook freebsd_committer 2020-12-08 16:46:20 UTC
A commit references this bug:

Author: glebius
Date: Tue Dec  8 16:46:01 UTC 2020
New revision: 368448
URL: https://svnweb.freebsd.org/changeset/base/368448

Log:
  The list of ports in configuration path shall be protected by locks,
  epoch shall be used only for fast path.  Thus use LAGG_XLOCK() in
  lagg_[un]register_vlan.  This fixes sleeping in epoch panic.

  PR:		240609

Changes:
  head/sys/net/if_lagg.c
Comment 10 ferrao 2020-12-25 10:33:03 UTC
Can someone gently confirm if this seems to be the same bug? I've hit this with TrueNAS 12.0-U1 and now my system is off-line due to this.

https://ibb.co/xLnnYmn

It seems that TrueNAS 12.0-U1 is built agains 12.2-RELEASE-p2: FreeBSD freenas.win.versatushpc.com.br 12.2-RELEASE-p2 FreeBSD 12.2-RELEASE-p2 663e6b09467(HEAD) TRUENAS  amd64

Sorry for not writing the issue as a text, but I only have this console right now and Serial is for whatever reasons not working.
Comment 11 commit-hook freebsd_committer 2021-03-09 22:39:31 UTC
A commit in branch stable/12 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=4058265d605de7e6e66d9ad5153ac496f4f3c628

commit 4058265d605de7e6e66d9ad5153ac496f4f3c628
Author:     Gleb Smirnoff <glebius@FreeBSD.org>
AuthorDate: 2020-12-08 16:46:00 +0000
Commit:     Alexander Motin <mav@FreeBSD.org>
CommitDate: 2021-03-09 22:39:06 +0000

    The list of ports in configuration path shall be protected by locks,
    epoch shall be used only for fast path.  Thus use LAGG_XLOCK() in
    lagg_[un]register_vlan.  This fixes sleeping in epoch panic.

    PR:             240609
    (cherry picked from commit e1074ed6a08033ee571b4bedb3ffe6049a4a7361)

 sys/net/if_lagg.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)