Bug 230510 - iflib/vlan panic: sleeping thread
Summary: iflib/vlan panic: sleeping thread
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: CURRENT
Hardware: Any Any
: --- Affects Some People
Assignee: freebsd-net (Nobody)
URL: https://reviews.freebsd.org/D16808
Keywords: crash, needs-qa
: 230655 (view as bug list)
Depends on:
Blocks:
 
Reported: 2018-08-10 19:04 UTC by Harald Schmalzbauer
Modified: 2018-09-24 19:31 UTC (History)
3 users (show)

See Also:
koobs: mfc-stable11?


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Harald Schmalzbauer 2018-08-10 19:04:24 UTC
If I utilize rc.network(8) to create a vlan(4) child of a iflib em0(4) parent device, I get really a lot of LOR's (most likely all LOR's too fast to read).  Estimated 500-2000 LOR's before panic.

Here's the backtrace:
#1  0xffffffff803ecb2b in db_dump (dummy=<value optimized out>, dummy2=<value optimized out>, dummy3=<value optimized out>, 
    dummy4=<value optimized out>) at /usr/local/share/deploy-tools/HEAD/src/sys/ddb/db_command.c:574
#2  0xffffffff803ec8f9 in db_command (cmd_table=<value optimized out>)
    at /usr/local/share/deploy-tools/HEAD/src/sys/ddb/db_command.c:481
#3  0xffffffff803ec674 in db_command_loop () at /usr/local/share/deploy-tools/HEAD/src/sys/ddb/db_command.c:534
#4  0xffffffff803ef8ff in db_trap (type=<value optimized out>, code=<value optimized out>)
    at /usr/local/share/deploy-tools/HEAD/src/sys/ddb/db_main.c:252
#5  0xffffffff80834923 in kdb_trap (type=3, code=0, tf=<value optimized out>)
    at /usr/local/share/deploy-tools/HEAD/src/sys/kern/subr_kdb.c:693
#6  0xffffffff80b4a3ef in trap (frame=0xfffffe00751e8560) at /usr/local/share/deploy-tools/HEAD/src/sys/amd64/amd64/trap.c:605
#7  0xffffffff80b25d95 in calltrap () at /usr/local/share/deploy-tools/HEAD/src/sys/amd64/amd64/exception.S:232
#8  0xffffffff80833ffb in kdb_enter (why=0xffffffff80ca81cc "panic", msg=<value optimized out>) at cpufunc.h:65
#9  0xffffffff807e9c00 in vpanic (fmt=<value optimized out>, ap=0xfffffe00751e86d0)
    at /usr/local/share/deploy-tools/HEAD/src/sys/kern/kern_shutdown.c:852
#10 0xffffffff807e9c93 in panic (fmt=<value optimized out>)
    at /usr/local/share/deploy-tools/HEAD/src/sys/kern/kern_shutdown.c:790
#11 0xffffffff8084bab5 in propagate_priority (td=<value optimized out>)
    at /usr/local/share/deploy-tools/HEAD/src/sys/kern/subr_turnstile.c:228
#12 0xffffffff8084c56d in turnstile_wait (ts=0xfffff80003089e40, owner=0xfffff800035da580, queue=<value optimized out>)
    at /usr/local/share/deploy-tools/HEAD/src/sys/kern/subr_turnstile.c:783
#13 0xffffffff807c7cf1 in __mtx_lock_sleep (c=0xfffff80003636ee0, v=<value optimized out>, opts=<value optimized out>, 
    file=<value optimized out>, line=<value optimized out>) at /usr/local/share/deploy-tools/HEAD/src/sys/kern/kern_mutex.c:639
#14 0xffffffff807c7a79 in __mtx_lock_flags (c=0xfffff80003636ee0, opts=<value optimized out>, file=<value optimized out>, 
    line=<value optimized out>) at /usr/local/share/deploy-tools/HEAD/src/sys/kern/kern_mutex.c:255
#15 0xffffffff807e3725 in _rm_wlock (rm=0xfffff80003636e88) at /usr/local/share/deploy-tools/HEAD/src/sys/kern/kern_rmlock.c:540
#16 0xffffffff807e3a94 in _rm_wlock_debug (rm=0xfffff80003636e88, 
    file=0xffffffff80cb26fe "/usr/local/share/deploy-tools/HEAD/src/sys/net/if_vlan.c", line=1639)
    at /usr/local/share/deploy-tools/HEAD/src/sys/kern/kern_rmlock.c:605
#17 0xffffffff80904f9b in vlan_link_state (ifp=0xfffff80003ba3000)
    at /usr/local/share/deploy-tools/HEAD/src/sys/net/if_vlan.c:1639
#18 0xffffffff808efd90 in do_link_state_change (arg=0xfffff80003ba3000, pending=1)
    at /usr/local/share/deploy-tools/HEAD/src/sys/net/if.c:2332
#19 0xffffffff808484bc in taskqueue_run_locked (queue=0xfffff800030b3500)
    at /usr/local/share/deploy-tools/HEAD/src/sys/kern/subr_taskqueue.c:465
#20 0xffffffff8084832a in taskqueue_run (queue=0xfffff800030b3500)
    at /usr/local/share/deploy-tools/HEAD/src/sys/kern/subr_taskqueue.c:484
#21 0xffffffff807ab180 in ithread_loop (arg=<value optimized out>)
    at /usr/local/share/deploy-tools/HEAD/src/sys/kern/kern_intr.c:1043
#22 0xffffffff807a8004 in fork_exit (callout=0xffffffff807ab040 <ithread_loop>, arg=0xfffff8000359e080, frame=0xfffffe00751e8ac0)
---Type <return> to continue, or q <return> to quit---
    at /usr/local/share/deploy-tools/HEAD/src/sys/kern/kern_fork.c:1057
#23 0xffffffff80b26d6e in fork_trampoline () at /usr/local/share/deploy-tools/HEAD/src/sys/amd64/amd64/exception.S:990
#24 0x0000000000000000 in ?? ()

Unfortunately I don't have the source easily accessabele on that system and also no easy way to capture console output.
Please tell me if additional info is a prerequisite to analyze the problem, I'll provide the missing parts.

Thanks,
-harry

P.S.: If I manually create the vlan(4) child from multi user shell, there are LOR's but _no_ panic happening. Also, the vlan(4) device works afterwards.
Comment 1 Harald Schmalzbauer 2018-08-10 19:08:15 UTC
P.P.S.: Backtrace is from todays sources, but I saw the panic happening much longer ago.  Maybe monthly since April this year.  Haven't found time to report earlier, sorry.

-harry
Comment 2 Kevin Bowling freebsd_committer freebsd_triage 2018-08-20 01:23:36 UTC
https://reviews.freebsd.org/D16808
Comment 3 Harald Schmalzbauer 2018-08-20 17:44:35 UTC
Kudos!

No more LORs and the described problem is solved with D16808 against r338093.
But vlan(4) doesn't work as expected.
I have to reduce MTU to 1468 on the vlan(4) device to get frames passed out.
I haven't really checked much, since there were some offloading changes recently and I'm not sure if if_valn(4) is known to be under rework/broken.

I have if_em(4) (I217-V) as parent device.
For the moment I haven't disabled any offloading feature, so the interfaces involved read like this:

em0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 9000 options=81249b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,LRO,WOL_MAGIC,VLAN_HWFILTER>
	ether 56:be:f7:0b:d7:4e
	hwaddr 
	inet 192.0.2.1 netmask 0xffffff00 broadcast 192.0.2.255 
	inet6 2001:db8:1::3:1 prefixlen 64 
	inet6 fe80::54be:f7ff:fe0b:d74e%em0 prefixlen 64 scopeid 0x1 
	media: Ethernet autoselect (1000baseT <full-duplex>)
	status: active
	nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>

vlegn: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1468
	options=403<RXCSUM,TXCSUM,LRO>
	ether 56:be:f7:0b:d7:4e
	inet 169.254.0.1 netmask 0xffffff00 broadcast 169.254.0.255 
	inet6 2001:db8:2::3:2 prefixlen 64 
	inet6 fe80::54be:f7ff:fe0b:d74e%vlegn prefixlen 64 scopeid 0x3 
	groups: vlan 
	vlan: 1234 vlanpcp: 0 parent interface: em0
	media: Ethernet autoselect (1000baseT <full-duplex>)
	status: active
	nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>

Usually vlegn (if_vlan(4) child of em0) should work with the inherited MTU of 9000

Shall I file a different PR?
And test with different offloading scenarios before doing so?

Thanks,

-harry
Comment 4 Harald Schmalzbauer 2018-08-27 19:08:01 UTC
(In reply to Harald Schmalzbauer from comment #3)

Just wanted to confirm that the functionality/MTU problem was unrelated to the tested D16808 and seems to be fixed in r338305.
Quick local if_vlan(4) tests passed with D16808 applied to r338305.

thanks!
Comment 5 commit-hook freebsd_committer freebsd_triage 2018-09-21 01:38:10 UTC
A commit references this bug:

Author: mmacy
Date: Fri Sep 21 01:37:09 UTC 2018
New revision: 338850
URL: https://svnweb.freebsd.org/changeset/base/338850

Log:
  fix vlan locking to permit sx acquisition in ioctl calls

  - update vlan(9) to handle changes earlier this year in multicast locking

  Tested by: np@, darkfiberu at gmail.com

  PR:	230510
  Reviewed by:	mjoras@, shurd@, sbruno@
  Approved by:	re (gjb@)
  Sponsored by:	Limelight Networks
  Differential Revision:	https://reviews.freebsd.org/D16808

Changes:
  head/sys/net/if_var.h
  head/sys/net/if_vlan.c
Comment 6 Navdeep Parhar freebsd_committer freebsd_triage 2018-09-24 19:06:09 UTC
*** Bug 230655 has been marked as a duplicate of this bug. ***