I'm facing some random crash when I'm adding a port into an LACP aggregate.
In the coredump, I found that the crash is due to a race condition when the interface is added to the aggregate.
The condition is the following:
The thread 1 (ifconfig lagg0 laggport em0) is adding an interface to an LACP aggregate. The code enter the function lagg_port_create in the file if_lagg.c.
That function "publish" an update of the interface with the following code:
/* Change the interface type */
lp->lp_iftype = ifp->if_type;
ifp->if_type = IFT_IEEE8023ADLAG;
ifp->if_lagg = lp;
lp->lp_ioctl = ifp->if_ioctl;
ifp->if_ioctl = lagg_port_ioctl;
lp->lp_output = ifp->if_output;
ifp->if_output = lagg_port_output;
In the thread 2, thereafter that publication, a packet come on the interface em0, use the new state of the interface, so lagg_input_p is called. Because the setup of the LACP is not finished (the call to lagg_proto_addport is after the publication), the LACP code uses a NULL pointer and the kernel crashes.
Can I change the code of the lagg_port_create function and put the "publication" of the new state of the interface at the end of the function ?
Created attachment 201036 [details]
use LAGG_PORT_DISABLED for a port until it fully added
lagg_input() drops incoming mbuf until LAGG_PORT_DISABLED is cleared
Please try attached patch. It should replace a panic with packed drop if an interface being added to an aggregate is up and receives a packet in process of inclusion to the aggregate.
Apply it with command:
cd /usr/src && patch < /path/to/patch
Then rebuild if_lagg.ko kernel module if you use it:
make obj depend && make all install
Or rebuild the kernel if your kernel configuration includes device lagg.
The race was introduced by the commit 333612. That commit is only in version 12 and up.
I updated the ticket, sorry for that mistake.
I'm testing your patch (with minor changes to be applyed on 12.0 version).
Assign to author of r333612.
Stephen, please take a look at the PR 234846 that describes a panic after your change to lagg(4).
After some test, the crash seems disapear when adding port.
But there is the same race in the function lagg_port_destroy, that delete the LACP port then restore the interface type.
Which part of LACP uses a NULL pointer? Do you have a backtrace?
I have only the network thread, the ifconfig thread have its backtrace truncated, but the td->td_proc show me that the command was "ifconfig lagg0 aggport em0"
The crash is due to the "lp_psc" field that is NULL into the "struct lagg_port"
There it is:
#0 __curthread () at ./machine/pcpu.h:219
#1 doadump (textdump=<optimized out>) at ../../../kern/kern_shutdown.c:298
#2 0xffffffff80459680 in kern_reboot (howto=260) at ../../../kern/kern_shutdown.c:486
#3 0xffffffff80459a25 in vpanic (fmt=<optimized out>, ap=<optimized out>) at ../../../kern/kern_shutdown.c:889
#4 0xffffffff804598f3 in panic (fmt=<unavailable>) at ../../../kern/kern_shutdown.c:818
#5 0xffffffff8067e848 in trap_fatal (frame=<optimized out>, eva=<optimized out>) at ../../../amd64/amd64/trap.c:858
#6 0xffffffff8067ec27 in trap_pfault (frame=<unavailable>, usermode=<optimized out>) at
#7 0xffffffff8067e24a in trap (frame=0xfffffe0d900b4630) at ../../../amd64/amd64/trap.c:447
#8 <signal handler called>
#9 lacp_iscollecting (lgp=0xfffff80572adbe00) at ../../../net/ieee8023ad_lacp.h:324
#10 lagg_lacp_input (sc=<optimized out>, lp=0xfffff80572adbe00, m=<optimized out>) at ../../../net/if_lagg.c:2176
#11 0xffffffff80521155 in lagg_proto_input (sc=<optimized out>, m=<optimized out>, sc=<optimized
out>, lp=<optimized out>, m=<optimized out>) at ../../../net/if_lagg.c:362
#12 lagg_input (ifp=0xfffff8034bfc3800, m=0xfffff8065cb60a00) at ../../../net/if_lagg.c:1667
#13 0xffffffff8051d555 in ether_input (ifp=0xfffff8034bfc3800, m=0xfffff8065cb60a00) at ../../../net/if_ethersubr.c:794
#14 0xffffffff8052d0fa in pollif_recv () at ../../../net/netisr_pollif.c:529
#15 0xffffffff8052c7a5 in swi_net (arg=0xfffffe10ca797880) at ../../../net/netisr.c:824
#16 0xffffffff8042fe5b in intr_event_execute_handlers (p=<optimized out>, ie=0xfffff8034bf12a00)
#17 0xffffffff80430456 in ithread_execute_handlers (ie=<optimized out>, p=<optimized out>)
#18 ithread_loop (arg=0xfffff8034befd7c0) at ../../../kern/kern_intr.c:1394
#19 0xffffffff8042d595 in fork_exit (callout=0xffffffff804303c0 <ithread_loop>, arg=0xfffff8034befd7c0,
frame=0xfffffe0d900b49c0) at ../../../kern/kern_fork.c:1027
#20 <signal handler called>
Thanks, I'm having problems reproducing this, do you have a similar panic in lagg_port_destroy()?
Created attachment 201200 [details]
Naive initial cut at a patch
Initial cut at a patch using LAGG_RLOCK(), which requires moving the lacp_linkstate() call out of lacp_port_create().
Created attachment 201203 [details]
Naive initial cut at a patch
(In reply to Stephen Hurd from comment #8)
Yes, we also have the race when removing the port, but more unlikely. In that case, the backtrace looks the same.
(In reply to Alexandre martins from comment #11)
What revision or release is that backtrace from?
I'm at the r321477 :
(In reply to Alexandre martins from comment #13)
Ok, I'm a bit confused here, I must be missing something. Here's my (mis)understanding... please correct me where I'm wrong.
- This bug was initially filed against 11.2.
- Comment #3 says that the issue is caused by r333612.
- Bug was assigned to me as a result.
- Version was changed to 12.0-STABLE as a result.
- Panic occurs with r321477.
- r321477 < r333612, so the r333612 change is not present when the panic occurs.
- r321477 is 12.0-CURRENT (should show in uname -r)
If my understanding is correct, can you test with 13.0-CURRENT, 12.0-RELEASE or 12-STABLE?
I'm sorry, I have made a mistake when I give the revision of the bug, It's the 317696 that introduce the bug, when the lock was split in two (SX plus RM).
I'm resume all of my story:
- We are working on a base of FreeBSD 10.3, and faced crashes with LACP
- We updated the if_lagg code to the revision 321477, which solved our initial problem
- However, another crash is occurring (the one that I put the backtrace here)
- When I searched the cause of the race, I take the wrong lock change (333612 instead of 321477)
The root cause of the problem remain the same: ether_input call lagg_input with the lacp port not fully initialized.
The function lagg_input locks the (RM / EPOCK) in read mode but that lock is not taken in lagg_port_create/lagg_port_destroy, who is locked by the SX.
The first patch with LAGG_PORT_DISABLED works for us.
My apologies for the confusion.
Reassigning and re-opening.
Alexander, please take a look at the PR 234846 reporting panic in lagg(4) after your commit https://svnweb.freebsd.org/base?view=revision&revision=317696