Bug 240609 - iflib: Panic with INVARIANTS: sleeping in an epoch section (12.1-pre-QA) (vlan + lagg involved)
Summary: iflib: Panic with INVARIANTS: sleeping in an epoch section (12.1-pre-QA) (vla...
Status: Open
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 12.0-STABLE
Hardware: Any Any
: --- Affects Some People
Assignee: Gleb Smirnoff
Keywords: crash, needs-qa
: 232362 (view as bug list)
Depends on:
Blocks: 240700
  Show dependency treegraph
Reported: 2019-09-16 08:24 UTC by Harald Schmalzbauer
Modified: 2020-09-11 17:01 UTC (History)
10 users (show)

See Also:
koobs: mfc-stable12?
koobs: mfc-stable11?

Possible workaround (587 bytes, patch)
2019-10-09 17:48 UTC, Hans Petter Selasky
no flags Details | Diff
Suggested patch for lagg (826 bytes, patch)
2020-09-11 17:01 UTC, Gleb Smirnoff
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Harald Schmalzbauer 2019-09-16 08:24:13 UTC

here's a iflib related panic I get on my real-world cold-standby setup with 12.1-prerelease and debug kernel.
It happens when creating a vlan(4) child with if_igb(4) pair as lagg(4) parent:

<6>vlan0: link state changed to UP
panic: sleeping in an epoch section
cpuid = 1
time = 1568620268
KDB: stack backtrace:                                                                                                                  
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe000058b380                                                         
vpanic() at vpanic+0x19d/frame 0xfffffe000058b3d0
panic() at panic+0x43/frame 0xfffffe000058b430
_sleep() at _sleep+0x466/frame 0xfffffe000058b4d0
pause_sbt() at pause_sbt+0x10f/frame 0xfffffe000058b510
e1000_reset_hw_82580() at e1000_reset_hw_82580+0x1cc/frame 0xfffffe000058b550
em_if_stop() at em_if_stop+0x1b/frame 0xfffffe000058b570
iflib_stop() at iflib_stop+0xc3/frame 0xfffffe000058b5c0
iflib_vlan_register() at iflib_vlan_register+0xad/frame 0xfffffe000058b600
lagg_register_vlan() at lagg_register_vlan+0xda/frame 0xfffffe000058b660
vlan_config() at vlan_config+0x50b/frame 0xfffffe000058b6c0
vlan_clone_create() at vlan_clone_create+0x29b/frame 0xfffffe000058b730
if_clone_createif() at if_clone_createif+0x4a/frame 0xfffffe000058b780
ifioctl() at ifioctl+0x6fe/frame 0xfffffe000058b850
kern_ioctl() at kern_ioctl+0x2b0/frame 0xfffffe000058b8b0
sys_ioctl() at sys_ioctl+0x15d/frame 0xfffffe000058b980
amd64_syscall() at amd64_syscall+0x276/frame 0xfffffe000058bab0
fast_syscall_common() at fast_syscall_common+0x101/frame 0xfffffe000058bab0
--- syscall (54, FreeBSD ELF64, sys_ioctl), rip = 0x80047439a, rsp = 0x7fffffffe348, rbp = 0x7fffffffe350 ---
KDB: enter: panic

#9  0xffffffff805cf4ca in vpanic (fmt=<value optimized out>, ap=<value optimized out>)
    at /usr/local/share/deploy-tools/RELENG_12/src/sys/kern/kern_shutdown.c:866
#10 0xffffffff805cf273 in panic (fmt=<value optimized out>)
    at /usr/local/share/deploy-tools/RELENG_12/src/sys/kern/kern_shutdown.c:804
#11 0xffffffff805da0b6 in _sleep (ident=0xffffffff80ef0941, lock=0x0, priority=0, wmesg=<value optimized out>, sbt=42949672, 
    pr=0, flags=256) at /usr/local/share/deploy-tools/RELENG_12/src/sys/kern/kern_synch.c:150
#12 0xffffffff805da4af in pause_sbt (wmesg=<value optimized out>, sbt=42949672, pr=<value optimized out>, 
    flags=<value optimized out>) at /usr/local/share/deploy-tools/RELENG_12/src/sys/kern/kern_synch.c:332
#13 0xffffffff81b3e7cc in e1000_reset_hw_82580 (hw=0xfffffe004b7eb008) at RELENG_12/src/sys/dev/e1000/e1000_osdep.h:97
#14 0xffffffff81b0c86b in em_if_stop (ctx=<value optimized out>)
    at /usr/local/share/deploy-tools/RELENG_12/src/sys/dev/e1000/if_em.c:1867
#15 0xffffffff806f74f3 in iflib_stop (ctx=0xfffff8000291f800) at ifdi_if.h:268
#16 0xffffffff80704e3d in iflib_vlan_register (arg=0xfffff8000291f800, ifp=0xfffff8000295a800, vtag=232)
    at /usr/local/share/deploy-tools/RELENG_12/src/sys/net/iflib.c:3883
#17 0xffffffff806eb94a in lagg_register_vlan (arg=<value optimized out>, ifp=<value optimized out>, vtag=<value optimized out>)
    at /usr/local/share/deploy-tools/RELENG_12/src/sys/net/if_lagg.c:452
#18 0xffffffff806f68fb in vlan_config (ifv=0xfffff80002555c00, p=0xfffff80002895000, vid=<value optimized out>)
    at /usr/local/share/deploy-tools/RELENG_12/src/sys/net/if_vlan.c:1431
#19 0xffffffff806f596b in vlan_clone_create (ifc=0xfffff800024dec00, name=0xfffffe000058b8d0 "vlan0", len=18446735277655190528, 
    params=<value optimized out>) at /usr/local/share/deploy-tools/RELENG_12/src/sys/net/if_vlan.c:1066
#20 0xffffffff806e1c3a in if_clone_createif (ifc=0xfffff800024dec00, name=0xfffffe000058b8d0 "vlan0", len=16, 
    params=0x22db40 <Address 0x22db40 out of bounds>) at /usr/local/share/deploy-tools/RELENG_12/src/sys/net/if_clone.c:229
#21 0xffffffff806d90be in ifioctl (so=<value optimized out>, cmd=3223349628, data=0xfffffe000058b8d0 "vlan0", 
    td=0xfffff800037a55e0) at /usr/local/share/deploy-tools/RELENG_12/src/sys/net/if.c:3097
---Type <return> to continue, or q <return> to quit---
#22 0xffffffff8063d870 in kern_ioctl (td=0xfffff800037a55e0, fd=<value optimized out>, com=3223349628, 
    data=<value optimized out>) at RELENG_12/src/sys/sys/file.h:337
#23 0xffffffff8063d54d in sys_ioctl (td=0xfffff800037a55e0, uap=0xfffff800037a59a0)
    at /usr/local/share/deploy-tools/RELENG_12/src/sys/kern/sys_generic.c:712
#24 0xffffffff8093abe6 in amd64_syscall (td=0xfffff800037a55e0, traced=0)
    at RELENG_12/src/sys/amd64/amd64/../../kern/subr_syscall.c:135
#25 0xffffffff80912550 in fast_syscall_common () at /usr/local/share/deploy-tools/RELENG_12/src/sys/amd64/amd64/exception.S:581
#26 0x000000080047439a in ?? ()
Previous frame inner to this frame (corrupt stack?)

It's almost identical to https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=232362
Only this line/function istn't listed with the new hardware (then Kawela, 82576 –> now StonyLake, i350:
#13 0xffffffff80569511 in e1000_disable_pcie_master_generic (hw=0xfffffe0000790008)

I'll mark the old one as duplicate.

Comment 1 Harald Schmalzbauer 2019-09-16 08:25:16 UTC
*** Bug 232362 has been marked as a duplicate of this bug. ***
Comment 2 Eric Joyner freebsd_committer 2019-10-09 17:44:40 UTC
I'm not an expert on locking issues, but it appears that lagg_register_vlan() enters an epoch (via the now-confusingly named LAGG_RLOCK() macro) that iflib_vlan_register() is run inside of, and the msec_delay()->safe_pause_ms()->pause() in the em driver is causing the "sleeping in an epoch section" panic.

A quick fix would be to make em *not sleep* during that e1000_reset_hw_82580() function, but that doesn't seem ideal; why should it not be allowed to sleep?
Comment 3 Hans Petter Selasky freebsd_committer 2019-10-09 17:48:03 UTC
Created attachment 208207 [details]
Possible workaround

It might be possible to simply test for epoch.

My experience if_ioctl()'s that sleep usually cause problems.
Comment 4 Harald Schmalzbauer 2019-10-12 14:06:01 UTC
(In reply to Hans Petter Selasky from comment #3)

Happy to confirm that your patch prevents the machine from panicking during vlan(4) child setup.
Haven't done further tests, but to my very limited understanding of the change, any side effects are very unlikely.


Comment 5 ml 2020-04-09 06:36:50 UTC
Just a "me too" here.

I'm running 12.1p3/amd64 (with vlan + lagg + em, throw even bridge in) and I'm experiencing deadlocks (VFS related I suspect).
So I turned on INVARIANTS, WITNESS, etc..., but could not boot without this patch.
Comment 6 Gleb Smirnoff freebsd_committer 2020-09-11 17:01:44 UTC
Created attachment 217889 [details]
Suggested patch for lagg

Here is patch (against head), that prevents lagg reconfiguration to use epoch and uses sleepable lock.