Bug 264549 - HardenedBSD: panic: e1000/bridge: "sleep on wchan ... with sleeping prohibited"
Summary: HardenedBSD: panic: e1000/bridge: "sleep on wchan ... with sleeping prohibited"
Status: Open
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 13.1-STABLE
Hardware: Any Any
: --- Affects Only Me
Assignee: freebsd-net (Nobody)
URL:
Keywords: IntelNetworking, crash
Depends on:
Blocks:
 
Reported: 2022-06-08 16:46 UTC by mfbott
Modified: 2023-11-28 10:33 UTC (History)
4 users (show)

See Also:


Attachments
patch (526 bytes, patch)
2022-06-08 16:49 UTC, mfbott
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description mfbott 2022-06-08 16:46:59 UTC
When employed as members of a bridge such as this one:


bridge0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
        ether 58:9c:fc:10:ff:d4
        inet 192.168.219.1 netmask 0xffffff00 broadcast 192.168.219.255
        id b1:c1:11:30:35:51 priority 32768 hellotime 2 fwddelay 15
        maxage 20 holdcnt 6 proto rstp maxaddr 2000 timeout 1200
        root id b1:c1:11:30:35:51 priority 32768 ifcost 0 port 0
        member: igb1 flags=147<LEARNING,DISCOVER,STP,AUTOEDGE,AUTOPTP>
                ifmaxaddr 0 port 2 priority 128 path cost 2000000 proto rstp
                role disabled state discarding
        member: igb0 flags=147<LEARNING,DISCOVER,STP,AUTOEDGE,AUTOPTP>
                ifmaxaddr 0 port 1 priority 128 path cost 2000000 proto rstp
                role disabled state discarding
        nd6 options=9<PERFORMNUD,IFDISABLED>
        groups: bridge


network adapters igb0/igb1 trigger a "sleep on wchan ... with sleeping
prohibited"-KASSERT whenever I pull any LAN cables.


The stack backtrace reads:
---------------------------------------------------

Unread portion of the kernel message buffer:
[148] panic: sleepq_add: td 0xfffffe00380fd740 to sleep on wchan 0xffffffff8164ae00 with sleeping prohibited
[148] cpuid = 0
[148] time = 1654687988
[148] __HardenedBSD_version = 1300062 __FreeBSD_version = 1301503
[148] version = FreeBSD 13.1-STABLE-HBSD #1  hardened/13-stable/master-n191189-be9f2a97c09: Wed Jun  8 11:14:09 UTC 2022
[148]     (Redacted)
[148] KDB: stack backtrace:
[148] #0 0xffffffff80c7a3fb at kdb_backtrace+0x6b
[148] #1 0xffffffff80c2f964 at vpanic+0x184
[148] #2 0xffffffff80c2f733 at panic+0x43
[148] #3 0xffffffff80c8a4b8 at sleepq_add+0x358
[148] #4 0xffffffff80c3bb44 at _sleep+0x194
[148] #5 0xffffffff80c3c160 at pause_sbt+0x110
[148] #6 0xffffffff82560efc at e1000_write_phy_reg_mdic+0xec
[148] #7 0xffffffff825653b9 at e1000_read_phy_reg_gs40g+0x39
[148] #8 0xffffffff825626e0 at e1000_phy_has_link_generic+0x70
[148] #9 0xffffffff82569862 at e1000_check_for_copper_link_generic+0x22
[148] #10 0xffffffff8255884c at em_if_update_admin_status+0x18c
[148] #11 0xffffffff80d685df at iflib_media_status+0x6f
[148] #12 0xffffffff80d52624 at ifmedia_ioctl+0x174
[148] #13 0xffffffff83431189 at bstp_ifupdstatus+0x79
[148] #14 0xffffffff83431052 at bstp_linkstate+0x22
[148] #15 0xffffffff8342d494 at bridge_linkstate+0x84
[148] #16 0xffffffff80d4372c at do_link_state_change+0xcc
[148] #17 0xffffffff80c8f990 at taskqueue_run_locked+0xb0
[148] Uptime: 2m28s
[148] Dumping 1534 out of 32592 MB:..2%..11%..21%..31%..41%..51%..61%..71%..81%..91%

__curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55
55              __asm("movq %%gs:%P1,%0" : "=r" (td) : "n" (offsetof(struct pcpu,
(kgdb) #0  __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55
#1  doadump (textdump=<optimized out>)
    at /usr/src/sys/kern/kern_shutdown.c:401
#2  0xffffffff80c2f574 in kern_reboot (howto=16644)
    at /usr/src/sys/kern/kern_shutdown.c:489
#3  0xffffffff80c2f9d3 in vpanic (
    fmt=0xffffffff812bb4ad "%s: td %p to sleep on wchan %p with sleeping prohibited", ap=<optimized out>) at /usr/src/sys/kern/kern_shutdown.c:925
#4  0xffffffff80c2f733 in panic (fmt=<unavailable>)
    at /usr/src/sys/kern/kern_shutdown.c:846
#5  0xffffffff80c8a4b8 in sleepq_add (
    wchan=wchan@entry=0xffffffff8164ae00 <pause_wchan>, lock=lock@entry=0x0,
    wmesg=wmesg@entry=0xffffffff8258ec2c "e1000_delay", flags=2,
    flags@entry=<unavailable>, queue=queue@entry=0)
    at /usr/src/sys/kern/subr_sleepqueue.c:328
#6  0xffffffff80c3bb44 in _sleep (ident=0xffffffff8164ae00 <pause_wchan>,
    lock=lock@entry=0x0, priority=0, wmesg=<unavailable>,
    sbt=sbt@entry=4294967, pr=0, flags=256)
    at /usr/src/sys/kern/kern_synch.c:208
#7  0xffffffff80c3c160 in pause_sbt (wmesg=<unavailable>, sbt=4294967,
    pr=<unavailable>, pr@entry=0, flags=flags@entry=256)
    at /usr/src/sys/kern/kern_synch.c:340
#8  0xffffffff82560efc in safe_pause_us (x=50)
    at /usr/src/sys/dev/e1000/e1000_osdep.h:88
#9  e1000_write_phy_reg_mdic (hw=hw@entry=0xfffffe00dfdd2000,
    offset=offset@entry=22, data=<optimized out>)
    at /usr/src/sys/dev/e1000/e1000_phy.c:384
#10 0xffffffff825653b9 in e1000_read_phy_reg_gs40g (hw=0xfffffe00dfdd2000,
    offset=<optimized out>, data=0xfffffe00351e6b66)
    at /usr/src/sys/dev/e1000/e1000_phy.c:4103
#11 0xffffffff825626e0 in e1000_phy_has_link_generic (hw=0xfffffe00dfdd2000,
    iterations=1, usec_interval=0, success=0xfffffe00351e6ba7)
    at /usr/src/sys/dev/e1000/e1000_phy.c:2376
#12 0xffffffff82569862 in e1000_check_for_copper_link_generic (
    hw=0xfffffe00dfdd2000) at /usr/src/sys/dev/e1000/e1000_mac.c:737
#13 0xffffffff8255884c in em_if_update_admin_status (ctx=0xfffff800020bd800)
    at /usr/src/sys/dev/e1000/if_em.c:1803
#14 0xffffffff80d685df in IFDI_UPDATE_ADMIN_STATUS (_ctx=0xfffff800020bd800)
    at ./ifdi_if.h:492
#15 iflib_media_status (ifp=<optimized out>, ifmr=0xfffffe00351e6c90)
    at /usr/src/sys/net/iflib.c:2557
#16 0xffffffff80d52624 in ifmedia_ioctl (ifp=<unavailable>,
    ifr=0xfffffe00351e6c90, ifm=0xfffff800020bdb18, cmd=<optimized out>)
    at /usr/src/sys/net/if_media.c:294
#17 0xffffffff83431189 in bstp_ifupdstatus (arg=0xfffff80023988018,
    pending=pending@entry=0) at /usr/src/sys/net/bridgestp.c:1821
#18 0xffffffff83431052 in bstp_linkstate (bp=bp@entry=0xfffff80023988018)
    at /usr/src/sys/net/bridgestp.c:1802
#19 0xffffffff8342d494 in bridge_linkstate (ifp=0xfffff800020d0000)
    at /usr/src/sys/net/if_bridge.c:3729
#20 0xffffffff80d4372c in do_link_state_change (arg=0xfffff800020d0000,
    pending=1) at /usr/src/sys/net/if.c:2317
#21 0xffffffff80c8f990 in taskqueue_run_locked (
    queue=queue@entry=0xfffff80001ac8b00)
    at /usr/src/sys/kern/subr_taskqueue.c:477
#22 0xffffffff80c8f89d in taskqueue_run (queue=0xfffff80001ac8b00)
    at /usr/src/sys/kern/subr_taskqueue.c:492
#23 0xffffffff80bed9ee in intr_event_execute_handlers (ie=0xfffff80001ac8a00,
    p=<optimized out>) at /usr/src/sys/kern/kern_intr.c:1167
#24 ithread_execute_handlers (ie=0xfffff80001ac8a00, p=<optimized out>)
    at /usr/src/sys/kern/kern_intr.c:1180
#25 ithread_loop (arg=<optimized out>) at /usr/src/sys/kern/kern_intr.c:1268
#26 0xffffffff80bea955 in fork_exit (
    callout=0xffffffff80bed790 <ithread_loop>, arg=0xfffff80001af10e0,
    frame=0xfffffe00351e6f40) at /usr/src/sys/kern/kern_fork.c:1109
#27 <signal handler called>
#28 mi_startup () at /usr/src/sys/kern/init_main.c:324
Backtrace stopped: Cannot access memory at address 0x2fd9008
(kgdb)

---------------------------------------------------


I've solved this issue by replacing "pause" with "DELAY" inside the e1000 driver
(see attachment). This certainly works for me, but could point to a deeper
problem within anything bridge-related. (I couldn't reproduce this panic
outside a bridge.)

There's possibly a better solution in which we allow "pause" under some
circumstances (unlike my blanket replacement), but I haven't found one.
Comment 1 mfbott 2022-06-08 16:49:58 UTC
Created attachment 234558 [details]
patch
Comment 2 Zhenlei Huang freebsd_committer freebsd_triage 2023-11-28 10:33:57 UTC
(In reply to mfbott from comment #0)
> I've solved this issue by replacing "pause" with "DELAY" inside the e1000 driver
> (see attachment).
> This certainly works for me, but could point to a deeper problem within anything
> bridge-related. (I couldn't reproduce this panic outside a bridge.)

IIUC this is an issue that combines use _sleep() with epoch_enter_preempt().

bridge_linkstate() will enter net epoch and then calls bstp_linkstate() and eventually e1000_write_phy_reg_mdic() / pause() / _sleep().

As per EPOCH(9),
        EPOCH_PREEMPT
		     The  epoch	 will  allow preemption	during sections.  Only
		     non-sleepable locks may be	acquired during	a  preemptible
		     epoch.	  The	  functions	epoch_enter_preempt(),
		     epoch_exit_preempt(), and	epoch_wait_preempt()  must  be
		     used   in	 place	of  epoch_enter(),  epoch_exit(),  and
		     epoch_wait(), respectively.

it is wrong to sleep within net epoch (allow preemption).


> There's possibly a better solution in which we allow "pause" under some
> circumstances (unlike my blanket replacement), but I haven't found one.

There's also problem report by Jean-Sébastien in https://reviews.freebsd.org/D14984 which has the same cause.

CC the author @Matt Macy