Bug 211689 - panic with lagg failover wireless ath and iwm
Summary: panic with lagg failover wireless ath and iwm
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: wireless (show other bugs)
Version: 11.0-RC1
Hardware: amd64 Any
: --- Affects Some People
Assignee: freebsd-net (Nobody)
URL:
Keywords: crash, regression
Depends on:
Blocks:
 
Reported: 2016-08-09 08:13 UTC by Dominic Fandrey
Modified: 2020-02-16 20:48 UTC (History)
10 users (show)

See Also:


Attachments
Textdump of panic with iwm device (12.31 KB, application/x-xz)
2016-08-09 08:13 UTC, Dominic Fandrey
no flags Details
Textdump of panic with ath device (7.85 KB, application/x-xz)
2016-08-09 08:14 UTC, Dominic Fandrey
no flags Details
remove pending lladdr change when destroying lagg device (2.00 KB, patch)
2016-08-09 20:52 UTC, Jan Kokemüller
no flags Details | Diff
panic: ifconfig lagg0 destroy (144.26 KB, text/plain)
2016-12-27 18:50 UTC, Dominic Fandrey
no flags Details
Remove the sc_lladdr_task taskqueue (7.47 KB, patch)
2017-01-17 23:58 UTC, Alan Somers
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Dominic Fandrey freebsd_committer freebsd_triage 2016-08-09 08:13:05 UTC
Created attachment 173440 [details]
Textdump of panic with iwm device

The system:
> FreeBSD AprilRyan.norad 11.0-BETA4 FreeBSD 11.0-BETA4 #0 r303827: Mon Aug  8 11:15:48 CEST 2016     root@AprilRyan.norad:/usr/obj/S403/amd64/usr/src/sys/S403  amd64

My setup was lagg failover from re0 to wlan0. After updating from stable/10 yesterday to stable/11 the Atheros AR5BHB92 in my system did no longer manage to connect to networks (scan works, connection fails).

So I replaced it with an Intel 7260AC wireless device (iwm driver). I got it to connect to a network but the system panicked when I brought the lagg0 interface up. Then I switched back to the Atheros device and reproduced the panic.

Now I got rid of lagg failover and both wireless devices seem to work fine (I'm currently submitting this bug using the Intel device).
Comment 1 Dominic Fandrey freebsd_committer freebsd_triage 2016-08-09 08:14:15 UTC
Created attachment 173441 [details]
Textdump of panic with ath device
Comment 2 Adrian Chadd freebsd_committer freebsd_triage 2016-08-09 08:43:39 UTC
Comment on attachment 173441 [details]
Textdump of panic with ath device

hi,

I can't really see what's going on. It's a custom kernel; the msgbuf doesn't give me the real backtrace and I don't know where in the source that panic occured.

Did you get an actual coredump? if not, please enable coredumps and get a coredump with the AR9280 in your machine and then email the generated 'core.X.txt' here. :)

Thanks!


-adrian
Comment 3 Dominic Fandrey freebsd_committer freebsd_triage 2016-08-09 11:52:29 UTC
Can do, but might have to wait till the weekend.
Comment 4 Jan Kokemüller 2016-08-09 20:52:22 UTC
Created attachment 173477 [details]
remove pending lladdr change when destroying lagg device

Can you try the attached patch? I get similar crashes in "swi6: task queue" when destroying a lagg device (em0 & iwn0/wlan0 in my case). Maybe this is related. I'll open a new bug report if not.

In "lagg_clone_destroy" there can be tasks running on the sc_lladdr_task queue that operate on the main lagg interface. Those tasks run at the same time the main lagg interface is destroyed in "ether_ifdetach" and "if_free". Tasks that operate on the lagg ports are correctly removed in "lagg_port_destroy". But tasks that operate on the main lagg interface are not removed. The attached patch fixes this.

I can reproduce this crash by running "service netif restart" a few times on an unmodified 11.0-BETA4 kernel.

Also, can you try to change your lagg configuration in /etc/rc.conf to look more like this:
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=211436#c2
This may fix/work around the infinite UP/DOWN loop when bringing up the lagg interface.

My configuration looks like this (replace "xx:xx:xx:xx:xx:xx" with the MAC address of the wifi card). The lagg interface comes up reliably.
wlans_iwn0="wlan0"
ifconfig_wlan0="WPA"
ifconfig_em0="ether xx:xx:xx:xx:xx:xx"
cloned_interfaces="lagg0"
ifconfig_lagg0="DHCP laggproto failover laggport em0 laggport wlan0"



Here is a backtrace:

#0  doadump (textdump=<value optimized out>) at pcpu.h:221
#1  0xffffffff80ad6d99 in kern_reboot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:366
#2  0xffffffff80ad734b in vpanic (fmt=<value optimized out>, ap=<value optimized out>) at /usr/src/sys/kern/kern_shutdown.c:759
#3  0xffffffff80ad7183 in panic (fmt=0x0) at /usr/src/sys/kern/kern_shutdown.c:690
#4  0xffffffff80f9cd51 in trap_fatal (frame=0xfffffe0227aef850, eva=0) at /usr/src/sys/amd64/amd64/trap.c:841
#5  0xffffffff80f9cf43 in trap_pfault (frame=0xfffffe0227aef850, usermode=0) at /usr/src/sys/amd64/amd64/trap.c:691
#6  0xffffffff80f9c4ec in trap (frame=0xfffffe0227aef850) at /usr/src/sys/amd64/amd64/trap.c:442
#7  0xffffffff80f7fc01 in calltrap () at /usr/src/sys/amd64/amd64/exception.S:236
#8  0xffffffff80c45800 in arp_iflladdr (arg=0x0, ifp=0xfffff800311d0800) at /usr/src/sys/netinet/if_ether.c:1339
#9  0xffffffff82660516 in lagg_port_setlladdr (arg=<value optimized out>, pending=<value optimized out>) at /usr/src/sys/modules/if_lagg/../../net/if_lagg.c:718
#10 0xffffffff80b31dda in taskqueue_run_locked (queue=<value optimized out>) at /usr/src/sys/kern/subr_taskqueue.c:465
#11 0xffffffff80b31bcf in taskqueue_run (queue=0xfffff80003f81200) at /usr/src/sys/kern/subr_taskqueue.c:484
#12 0xffffffff80a9142f in intr_event_execute_handlers (p=<value optimized out>, ie=<value optimized out>) at /usr/src/sys/kern/kern_intr.c:1262
#13 0xffffffff80a91696 in ithread_loop (arg=<value optimized out>) at /usr/src/sys/kern/kern_intr.c:1275
#14 0xffffffff80a8e075 in fork_exit (callout=0xffffffff80a915d0 <ithread_loop>, arg=0xfffff800041fb880, frame=0xfffffe0227aefac0) at /usr/src/sys/kern/kern_fork.c:1038
#15 0xffffffff80f8013e in fork_trampoline () at /usr/src/sys/amd64/amd64/exception.S:611
#16 0x0000000000000000 in ?? ()
Comment 5 Dominic Fandrey freebsd_committer freebsd_triage 2016-08-25 13:55:40 UTC
In the meantime wifi behind lagg works in principle, I had the MAC address setting in ifconfig_ath0/ifconfig_iwm0 instead of ifconfig_wlan0, due to having migrated from the stable/10 branch.

But using lagg failover still inevitably crashes the system. Only the backtraces don't make any sense, I don't think I get a backtrace of the right stack, there's not even a call to panic() in the trace. Maybe the stack gets smashed.
Comment 6 Dominic Fandrey freebsd_committer freebsd_triage 2016-12-27 17:46:07 UTC
Seems to be fixed.

I'll test some more locations and close this in January unless I run into more problems.
Comment 7 Dominic Fandrey freebsd_committer freebsd_triage 2016-12-27 18:50:46 UTC
Created attachment 178330 [details]
panic: ifconfig lagg0 destroy

It panics when I try to destroy lagg0.
Comment 8 Jan Kokemüller 2016-12-27 21:53:57 UTC
The patch in comment #4 should fix the panic when destroying a lagg device.
Comment 9 Robin Randhawa 2017-01-13 16:44:42 UTC
I'd like to confirm that without the patch in comment #4, my system panics immediately on invoking:

$ system netif restart

Backtrace is similar to that reported in comment #4:

(kgdb) bt
#0  doadump (textdump=1) at pcpu.h:222
#1  0xffffffff80a30305 in kern_reboot (howto=<value optimized out>) at /usr/src/sys/kern/kern_shutdown.c:386
#2  0xffffffff80a308e0 in vpanic (fmt=<value optimized out>, ap=<value optimized out>) at /usr/src/sys/kern/kern_shutdown.c:779
#3  0xffffffff80a30923 in panic (fmt=<value optimized out>) at /usr/src/sys/kern/kern_shutdown.c:710
#4  0xffffffff80ea01f2 in trap_fatal (frame=0xfffffe07c4cc5760, eva=32) at /usr/src/sys/amd64/amd64/trap.c:801
#5  0xffffffff80ea03d8 in trap_pfault (frame=0xfffffe07c4cc5760, usermode=0) at /usr/src/sys/amd64/amd64/trap.c:658
#6  0xffffffff80e9f9f8 in trap (frame=0xfffffe07c4cc5760) at /usr/src/sys/amd64/amd64/trap.c:421
#7  0xffffffff80e80261 in calltrap () at /usr/src/sys/amd64/amd64/exception.S:236
#8  0xffffffff80c4b85b in nd6_iflladdr (arg=0x0, ifp=0xfffff8023448f000) at /usr/src/sys/netinet6/nd6.c:208
#9  0xffffffff835fe5de in lagg_port_setlladdr (arg=<value optimized out>, pending=<value optimized out>) at /usr/src/sys/modules/if_lagg/../../net/if_lagg.c:719
#10 0xffffffff80a82bac in taskqueue_run_locked (queue=<value optimized out>) at /usr/src/sys/kern/subr_taskqueue.c:454
#11 0xffffffff80a82a2a in taskqueue_run (queue=0xfffff800071b0400) at /usr/src/sys/kern/subr_taskqueue.c:473
#12 0xffffffff809f7ae6 in intr_event_execute_handlers (p=<value optimized out>, ie=<value optimized out>) at /usr/src/sys/kern/kern_intr.c:1262
#13 0xffffffff809f8156 in ithread_loop (arg=<value optimized out>) at /usr/src/sys/kern/kern_intr.c:1275
#14 0xffffffff809f53e4 in fork_exit (callout=0xffffffff809f80b0 <ithread_loop>, arg=0xfffff80007748ec0, frame=0xfffffe07c4cc59c0) at /usr/src/sys/kern/kern_fork.c:1038
#15 0xffffffff80e8079e in fork_trampoline () at /usr/src/sys/amd64/amd64/exception.S:611
#16 0x0000000000000000 in ?? ()
.
.
$ uname -a

FreeBSD vulcan 12.0-CURRENT FreeBSD 12.0-CURRENT #1 r312026 (fix-lagg): Fri Jan 13 12:05:45 GMT 2017     root@vulcan:/usr/obj/usr/src/sys/GENERIC  amd64

So this is current as of today.

With the patch applied the problem is gone.

May I know if and when this patch is to be submitted to current for inclusion ?

Thanks.
Comment 10 Alan Somers freebsd_committer freebsd_triage 2017-01-13 18:30:47 UTC
Jan,
    I've run into this problem, too.  My solution was to completely eliminate the sc_lladdr_task.  It worked well on stable/10.  I'll see if it works on head too.
Comment 11 Alan Somers freebsd_committer freebsd_triage 2017-01-17 23:58:57 UTC
Created attachment 179015 [details]
Remove the sc_lladdr_task taskqueue

Please try the attached patch.  It removes the sc_lladdr_task taskqueue, and simply removes the link local addresses inline with lagg_port_destroy.

The lp->detaching check in lagg_port_lladdr is there to avoid a recursive lock acquisition panic that happens when sending the gratuitous ARP caused by removing the link-local address.  It's ok to skip sending the gratuitous ARP because we're about to destroy the lagg port, anyway.

The change to in6.c is more complicated, and I need to check with some IPv6 experts to be sure that it's valid.  But without that change, I find that a lagg port can end up with a link-local address even though ND6_IFF_IFDISABLED is set.
Comment 12 Robin Randhawa 2017-01-18 12:56:48 UTC
Thanks Alan.

I can confirm that by rolling back the patch alluded to in comment #4 and using the patch supplied by you in comment #11 there are:

1. no panics across multiple netif restarts across multiple ethernet/wifi link up/down sequences

2. no perceivable issue with networking on this system (Thinkpad P50)

I assume you will now make the motions to get this patch included in current ? That would be ideal. Please let me know if you need any further testing.

Cheers.
Comment 13 Robin Randhawa 2017-01-18 14:47:42 UTC
It appears I spoke too soon.

While networking seems fine in general, I find that with the patch included, an ACPI suspend to state S3 causes the system to panic silently while suspending and eventually reboot. The backtrace from the crash dump is as follows:

(kgdb) bt
#0  doadump (textdump=1) at pcpu.h:222
#1  0xffffffff80a30125 in kern_reboot (howto=<value optimized out>) at /usr/src/sys/kern/kern_shutdown.c:386
#2  0xffffffff80a30700 in vpanic (fmt=<value optimized out>, ap=<value optimized out>) at /usr/src/sys/kern/kern_shutdown.c:779
#3  0xffffffff80a30536 in kassert_panic (fmt=<value optimized out>) at /usr/src/sys/kern/kern_shutdown.c:669
#4  0xffffffff80a11b4c in __mtx_lock_flags (c=0xfffffe00011da080, opts=0, file=<value optimized out>, line=1614)
    at /usr/src/sys/kern/kern_mutex.c:279
#5  0xffffffff80b7ab88 in ieee80211_suspend_all (ic=0xfffffe00011da048) at /usr/src/sys/net80211/ieee80211_proto.c:1614
#6  0xffffffff822dddff in iwm_suspend (dev=<value optimized out>) at /usr/src/sys/modules/iwm/../../dev/iwm/if_iwm.c:6193
#7  0xffffffff80a64714 in bus_generic_suspend_child (dev=<value optimized out>, child=0xfffff80007b5d500) at device_if.h:275
#8  0xffffffff806c562d in pci_suspend_child (dev=0xfffff80007b5d600, child=0xfffff80007b5d500)
    at /usr/src/sys/dev/pci/pci.c:4247
#9  0xffffffff80a647f7 in bus_generic_suspend (dev=<value optimized out>) at bus_if.h:975
#10 0xffffffff80a64714 in bus_generic_suspend_child (dev=<value optimized out>, child=0xfffff80007b5d600) at device_if.h:275
#11 0xffffffff80a647f7 in bus_generic_suspend (dev=<value optimized out>) at bus_if.h:975
#12 0xffffffff80a64714 in bus_generic_suspend_child (dev=<value optimized out>, child=0xfffff80007b5e900) at device_if.h:275
#13 0xffffffff806c562d in pci_suspend_child (dev=0xfffff80007b5f300, child=0xfffff80007b5e900)
    at /usr/src/sys/dev/pci/pci.c:4247
#14 0xffffffff80a647f7 in bus_generic_suspend (dev=<value optimized out>) at bus_if.h:975
#15 0xffffffff80a64714 in bus_generic_suspend_child (dev=<value optimized out>, child=0xfffff80007b5f300) at device_if.h:275
#16 0xffffffff80a647f7 in bus_generic_suspend (dev=<value optimized out>) at bus_if.h:975
#17 0xffffffff80a64714 in bus_generic_suspend_child (dev=<value optimized out>, child=0xfffff800077f6000) at device_if.h:275
#18 0xffffffff80a647f7 in bus_generic_suspend (dev=<value optimized out>) at bus_if.h:975
#19 0xffffffff803b960f in acpi_suspend (dev=0xfffff800077f7700) at /usr/src/sys/dev/acpica/acpi.c:729
#20 0xffffffff80a64714 in bus_generic_suspend_child (dev=<value optimized out>, child=0xfffff800077f7700) at device_if.h:275
#21 0xffffffff80a647f7 in bus_generic_suspend (dev=<value optimized out>) at bus_if.h:975
#22 0xffffffff80a64714 in bus_generic_suspend_child (dev=<value optimized out>, child=0xfffff800077f7c00) at device_if.h:275
#23 0xffffffff80a647f7 in bus_generic_suspend (dev=<value optimized out>) at bus_if.h:975
#24 0xffffffff803b71bf in acpi_EnterSleepState (sc=<value optimized out>, state=<value optimized out>) at device_if.h:275
#25 0xffffffff803b7b60 in acpi_AckSleepState (clone=<value optimized out>, error=<value optimized out>)
    at /usr/src/sys/dev/acpica/acpi.c:2780
#26 0xffffffff8090aaa3 in devfs_ioctl (ap=<value optimized out>) at /usr/src/sys/fs/devfs/devfs_vnops.c:831
#27 0xffffffff81002d50 in VOP_IOCTL_APV (vop=<value optimized out>, a=<value optimized out>) at vnode_if.c:1067
#28 0xffffffff80b01634 in vn_ioctl (fp=0xfffff8000d2b1960, com=<value optimized out>, data=0xfffffe08595f4780,
    active_cred=0xfffff80007701d00, td=<value optimized out>) at vnode_if.h:448
#29 0xffffffff8090b1cf in devfs_ioctl_f (fp=<value optimized out>, com=<value optimized out>, data=<value optimized out>,
    cred=<value optimized out>, td=0xfffff8000d09d500) at /usr/src/sys/fs/devfs/devfs_vnops.c:789
#30 0xffffffff80a94870 in kern_ioctl (td=<value optimized out>, fd=<value optimized out>, com=<value optimized out>,
    data=<value optimized out>) at file.h:321
#31 0xffffffff80a9450f in sys_ioctl (td=<value optimized out>, uap=0xfffffe08595f4930) at /usr/src/sys/kern/sys_generic.c:746
#32 0xffffffff80ea09b9 in amd64_syscall (td=0xfffff8000d09d500, traced=0) at subr_syscall.c:135
#33 0xffffffff80e806db in Xfast_syscall () at /usr/src/sys/amd64/amd64/exception.S:396
#34 0x000000080097cbaa in ?? ()
Previous frame inner to this frame (corrupt stack?)

Perhaps there is some reliance on the sc_lladr_task for some housekeeping action in the suspend path ?

I've tested this across three suspend runs and the same panic occurred all three times. With the patch removed, the panic doesn't happen and suspend-resume cycles succeed.
Comment 14 Alan Somers freebsd_committer freebsd_triage 2017-01-18 15:47:36 UTC
This panic is surprising, Randy.  Could you please post the following?
1) output of "ifconfig" from just before you suspend
2) the panic message
Additionally, do you happen do know if the suspend process destroys lagg interfaces?  Because if it doesn't, then I don't see how my patch could make a difference.
Comment 15 Robin Randhawa 2017-01-19 11:14:06 UTC
(In reply to Alan Somers from comment #14)

Before I rebuild kernels and deploy to get you the pre-suspend ifconfig output, I thought I'll share my thoughts on your second question, lest it mean I don't have to go through the deployment cycle (which is not a problem - just a little bit of bother).

So I'm pretty sure that a "$ service netif restart" invocation will end up taking the lagg interface down via the /etc/rc.conf cloned_interface="lagg0" specification. AFAICT, that means that the netif rc.d glue will end up invoking clone_down() [See /etc/network.subr] which looks like it does indeed tear the lagg interface down.

Does that imply that your patch will need some mods in order to accomodate things ? Please do let me know and I'll follow through as needed.
Comment 16 Alan Somers freebsd_committer freebsd_triage 2017-01-19 16:27:57 UTC
Randy, you are correct that "service netif restart lagg0" will destroy the cloned interface.  I just didn't realize that suspending restarted network interfaces.  In any case, you're panic isn't directly in the lagg or inet6 code.  That's why I want to see the exact panic string.  I may need access to the core file too, if you're willing and able to provide it.
Comment 17 Jan Kokemüller 2017-02-18 08:40:35 UTC
A fix has landed in current (https://svnweb.freebsd.org/base?view=revision&revision=312979).
Comment 18 Kubilay Kocak freebsd_committer freebsd_triage 2019-05-07 03:43:56 UTC
Closing, apparently resolved by loos in base r312979

@Luiz If your commit was indeed a fix for this bug, please assign yourself

@Dominic If this is still an issue, please re-open with further information.
Comment 19 commit-hook freebsd_committer freebsd_triage 2020-02-16 20:48:36 UTC
A commit references this bug:

Author: yuri
Date: Sun Feb 16 20:47:57 UTC 2020
New revision: 526321
URL: https://svnweb.freebsd.org/changeset/ports/526321

Log:
  math/libnormaliz: Update 3.8.3 -> 3.8.4

  PR:		211689
  Submitted by:	salvadore (maintainer)

Changes:
  head/math/libnormaliz/Makefile
  head/math/libnormaliz/distinfo