| Summary: | ng_eiface: Kernel panic due to racecondition in ng_eiface shutdown | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Base System | Reporter: | Vladyslav V. Prodan <admin> | ||||||||||
| Component: | kern | Assignee: | Aleksandr Fedorov <afedorov> | ||||||||||
| Status: | Closed FIXED | ||||||||||||
| Severity: | Affects Some People | CC: | afedorov, donner, eugen, glebius, hselasky, kevans, net | ||||||||||
| Priority: | --- | Keywords: | crash | ||||||||||
| Version: | 12.1-STABLE | Flags: | koobs:
mfc-stable12?
koobs: mfc-stable11- |
||||||||||
| Hardware: | Any | ||||||||||||
| OS: | Any | ||||||||||||
| URL: | https://reviews.freebsd.org/D24557 | ||||||||||||
| Attachments: |
|
||||||||||||
Created attachment 211777 [details]
kgdb output
Created attachment 211778 [details]
File /etc/rc.conf output
static void
ng_eiface_mediastatus(struct ifnet *ifp, struct ifmediareq *ifmr)
{
const priv_p priv = (priv_p)ifp->if_softc;
struct ifmedia *ifm = &priv->media;
--> if (ifm->ifm_cur->ifm_media == (IFM_ETHER | IFM_AUTO) &&
(priv->link_status & IFM_ACTIVE))
ifmr->ifm_active = IFM_ETHER | IFM_1000_T | IFM_FDX;
else
ifmr->ifm_active = ifm->ifm_cur->ifm_media;
ifmr->ifm_status = priv->link_status;
return;
}
May I ask you to compile the test system with NETGRAPH_DEBUG?
This will prevent the netgraph system from freeing all allocated memory.
I just want to make sure, that we do have a problem with the netgraph framework, not with a race condition within ng_eiface.
If you are on this path, may you please add INVARIANT_SUPPORT and INVARIANTS?
This will catch several types of errors before they might happen.
(In reply to lutz from comment #3) Rebuild the kernel with options: options NETGRAPH_DEBUG options INVARIANTS options INVARIANT_SUPPORT ? Maybe add more other options? options KASSERT_PANIC_OPTIONAL options DIAGNOSTIC options REGRESSION options TSLOG options TSLOGSIZE=262144 (In reply to Vladislav V. Prodan from comment #4) After the first removal of the vlans, I got kernel panic ... Created attachment 211803 [details]
kgdb info
Crash dump of the new kernel after kernel panic.
The new kernel is built with additional options:
options NETGRAPH_DEBUG
options INVARIANTS
options INVARIANT_SUPPORT
options KASSERT_PANIC_OPTIONAL
options DIAGNOSTIC
options REGRESSION
options TSLOG
options TSLOGSIZE=262144
The crash seems to be caused by a dhclient process which interleaves with the destruction of the node. I do not claim, that there is no bug with this, but it's not a bug with the destruction of nodes, it's more like a racecondition in destroying the interface. Can you please remove the "ifconfig_DEFAULT" line from rc.conf? This spans an dhclient for each ngeth* (In reply to lutz from comment #7) Commenting out the line 'ifconfig_DEFAULT = "SYNCDHCP" in the /etc/rc.conf file with a reboot of the VM no longer causes kernel panic. Please rename this bug to "racecondition in ng_eiface shutdown" or close it. It seems that there are a race in the function ng_eiface_rmnode().
613 static int
614 ng_eiface_rmnode(node_p node)
615 {
616 const priv_p priv = NG_NODE_PRIVATE(node);
617 struct ifnet *const ifp = priv->ifp;
618
619 /*
620 * the ifnet may be in a different vnet than the netgraph node,
621 * hence we have to change the current vnet context here.
622 */
623 CURVNET_SET_QUIET(ifp->if_vnet);
624 ifmedia_removeall(&priv->media);
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Remove media
625 ether_ifdetach(ifp);
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Detach interface. Remove interface from ifnet's list which protected by WLOCK().
626 if_free(ifp);
627 CURVNET_RESTORE();
628 free_unr(V_ng_eiface_unit, priv->unit);
629 free(priv, M_NETGRAPH);
630 NG_NODE_SET_PRIVATE(node, NULL);
631 NG_NODE_UNREF(node);
632 return (0);
633 }
So, the media is already removed, but the interface is still available.
I think the order should be different, like other interfaces do:
1) Detach interface.
2) Free used resources.
Can you test the next patch:
Index: sys/netgraph/ng_eiface.c
===================================================================
--- sys/netgraph/ng_eiface.c (revision 358308)
+++ sys/netgraph/ng_eiface.c (working copy)
@@ -621,9 +621,9 @@
* hence we have to change the current vnet context here.
*/
CURVNET_SET_QUIET(ifp->if_vnet);
- ifmedia_removeall(&priv->media);
ether_ifdetach(ifp);
if_free(ifp);
+ ifmedia_removeall(&priv->media);
CURVNET_RESTORE();
free_unr(V_ng_eiface_unit, priv->unit);
free(priv, M_NETGRAPH);
I applied the patch and rebuilt the kernel. Kernel panic is no longer there, but in the logs /var/log/messages messages of the form: ... Feb 26 21:43:17 core kernel: ngeth504: link state changed to DOWN Feb 26 21:43:17 core dhclient[3168]: receive_packet failed on ngeth504: Device not configured Feb 26 21:43:17 core dhclient[3168]: ioctl(SIOCGIFFLAGS) on ngeth504: Operation not permitted Feb 26 21:43:17 core kernel: if_delmulti_locked: detaching ifnet instance 0xfffff80059e8b000 Feb 26 21:43:17 core syslogd: last message repeated 1 times Feb 26 21:43:17 core kernel: ifaddr cache = 0xfffff8000ddf5600 is deleted Feb 26 21:43:17 core kernel: if_delmulti_locked: detaching ifnet instance 0xfffff80059e8b000 Feb 26 21:43:17 core syslogd: last message repeated 3 times Feb 26 21:43:17 core dhclient[3168]: Interface ngeth504 no longer appears valid. Feb 26 21:43:17 core dhclient[3168]: No live interfaces to poll on - exiting. Feb 26 21:43:17 core dhclient[3168]: exiting. Feb 26 21:43:17 core dhclient[3168]: connection closed Feb 26 21:43:17 core dhclient[3168]: exiting. Feb 26 21:43:17 core root[3259]: /etc/rc.d/dhclient: WARNING: failed to start dhclient ... Feb 26 21:46:11 core kernel: ngeth0: link state changed to DOWN Feb 26 21:46:11 core dhclient[3521]: receive_packet failed on ngeth0: Device not configured Feb 26 21:46:11 core dhclient[3521]: ioctl(SIOCGIFFLAGS) on ngeth0: Operation not permitted Feb 26 21:46:11 core dhclient[3521]: Interface ngeth0 no longer appears valid. Feb 26 21:46:11 core dhclient[3521]: No live interfaces to poll on - exiting. Feb 26 21:46:11 core dhclient[3521]: exiting. Feb 26 21:46:11 core dhclient[3521]: connection closed Feb 26 21:46:11 core dhclient[3521]: exiting. Feb 26 21:46:11 core root[10207]: /etc/rc.d/dhclient: WARNING: failed to start dhclient Feb 26 21:46:11 core kernel: if_delmulti_locked: detaching ifnet instance 0xfffff80059e8e000 Feb 26 21:46:11 core kernel: ifaddr cache = 0xfffff80059db2800 is deleted Feb 26 21:46:11 core kernel: if_delmulti_locked: detaching ifnet instance 0xfffff80059e8e000 Feb 26 21:46:11 core syslogd: last message repeated 4 times ... (In reply to Vladislav V. Prodan from comment #11) And output debug.log: ... Dec 29 04:20:59 core newsyslog[883]: logfile first created Feb 19 18:28:50 core dhclient[1023]: em3 link state up -> down Feb 19 18:28:50 core dhclient[1000]: em2 link state up -> down Feb 19 18:28:50 core dhclient[977]: em1 link state up -> down Feb 19 18:28:50 core dhclient[899]: em0 link state up -> down Feb 19 18:28:53 core dhclient[1023]: em3 link state down -> up Feb 19 18:28:53 core dhclient[1000]: em2 link state down -> up Feb 19 18:28:53 core dhclient[977]: em1 link state down -> up Feb 19 18:28:53 core dhclient[899]: em0 link state down -> up Feb 20 00:52:33 core dhclient[16136]: ioctl(SIOCGIFMEDIA) on ngeth490: Device not configured Feb 20 00:53:09 core dhclient[20775]: ioctl(SIOCGIFMEDIA) on ngeth500: Device not configured Feb 20 19:45:32 core dhclient[9911]: ioctl(SIOCGIFMEDIA) on ngeth560: Device not configured Feb 20 20:00:49 core dhclient[1665]: ngeth0 link state up -> down Feb 20 20:12:18 core dhclient[1619]: ioctl(SIOCGIFMEDIA) on ngeth0: Device not configured Feb 20 20:12:22 core dhclient[3063]: ioctl(SIOCGIFMEDIA) on ngeth409: Device not configured Feb 20 20:22:27 core dhclient[1538]: ioctl(SIOCGIFMEDIA) on ngeth0: Device not configured Feb 20 20:22:52 core dhclient[7627]: ioctl(SIOCGIFMEDIA) on ngeth0: Device not configured Feb 20 20:23:19 core dhclient[9421]: ioctl(SIOCGIFMEDIA) on ngeth0: Device not configured Feb 26 21:43:09 core dhclient[2734]: ioctl(SIOCGIFMEDIA) on ngeth266: Device not configured Feb 26 21:43:15 core dhclient[2963]: ioctl(SIOCGIFMEDIA) on ngeth408: Device not configured Feb 26 21:43:17 core dhclient[3168]: ioctl(SIOCGIFMEDIA) on ngeth504: Device not configured Feb 26 21:46:11 core dhclient[3521]: ngeth0 link state up -> down ... (In reply to Vladislav V. Prodan from comment #12) root@core:~ # date Wed Feb 26 21:52:03 EET 2020 These messages looks to me as expected: Feb 26 21:46:11 core dhclient[3521]: receive_packet failed on ngeth0: Device not configured Feb 26 21:46:11 core dhclient[3521]: ioctl(SIOCGIFFLAGS) on ngeth0: Operation not permitted Feb 26 21:46:11 core dhclient[3521]: Interface ngeth0 no longer appears valid. Feb 26 21:46:11 core dhclient[3521]: No live interfaces to poll on - exiting. Feb 26 21:46:11 core dhclient[3521]: exiting. Feb 26 21:46:11 core dhclient[3521]: connection closed Feb 26 21:46:11 core dhclient[3521]: exiting. But these need further research: Feb 26 21:43:17 core kernel: ifaddr cache = 0xfffff8000ddf5600 is deleted Feb 26 21:43:17 core kernel: if_delmulti_locked: detaching ifnet instance 0xfffff80059e8b000 A commit references this bug: Author: afedorov Date: Mon Apr 27 10:00:46 UTC 2020 New revision: 360372 URL: https://svnweb.freebsd.org/changeset/base/360372 Log: ng_eiface: fix kernel panic due to the racecondition in ng_eiface shutdown. PR: 244247 Reported by: Vladislav V. Prodan <admin@support.od.ua> Reviewed by: vmaffione, lutz_donnerhacke.de Approved by: vmaffione (mentor) Sponsored by: vstack.com Differential Revision: https://reviews.freebsd.org/D24557 Changes: head/sys/netgraph/ng_eiface.c ^Triage: Assign to committer resolving Sorry for the long reply. I will try to make MFC to the 12-STABLE. (In reply to Aleksandr Fedorov from comment #17) Aleksandr, will you be able to merge it to stable/12 soon? ^Triage: committed back in 2020. The mfc-stable* flags' values are now OBE. |
Created attachment 211776 [details] ifconfig output Kernel panic when removing 700 vlans via netgraph I have a virtual machine on Virtualbox with 2 core CPU and 2 GP RAM. I have non-GENERIC kernel. Problem Source - PR 187835. # uname -a FreeBSD core.domain.com 12.1-STABLE FreeBSD 12.1-STABLE r358094 SUPPORT-12-1-0 amd64 # kldstat Id Refs Address Size Name 1 10 0xffffffff80200000 25a7a78 kernel 2 1 0xffffffff827a8000 3ad5b8 zfs.ko 3 2 0xffffffff82b56000 a3f0 opensolaris.ko How to play kernel panic: 1) Add tuning sysctl #sysctl net.graph.recvspace=2048000 #sysctl net.graph.maxdgram=2048000 #sysctl kern.ipc.nmbclusters=4000000 #sysctl kern.ipc.maxsockbuf=83886080 2) adding 564 vlan #cat << EOF >> test_creating_vlan.sh #!/bin/sh # man ng_vlan ETHER_IF=em2 ngctl -f- <<EOF mkpeer ${ETHER_IF}: vlan lower downstream name ${ETHER_IF}:lower vlan-${ETHER_IF} connect ${ETHER_IF}: vlan-${ETHER_IF}: upper nomatch EOF for i in `jot 564 100`; do ngctl mkpeer vlan-${ETHER_IF}: eiface vlan$i ether ngctl msg vlan-${ETHER_IF}: addfilter \{ vlan=$i hook=\"vlan$i\" \} done EOF #sh test_creating_vlan.sh 3) delete all freshly created vlans #ngctl list | awk ' $2 ~ /ngeth/ {print "ngctl shutdown "$2":";}' | sh 4) add already 700 vlan #cat << EOF >> test_creating_vlan.2.sh #!/bin/sh # man ng_vlan ETHER_IF=em2 ngctl -f- <<EOF mkpeer ${ETHER_IF}: vlan lower downstream name ${ETHER_IF}:lower vlan-${ETHER_IF} connect ${ETHER_IF}: vlan-${ETHER_IF}: upper nomatch EOF for i in `jot 700 100`; do ngctl mkpeer vlan-${ETHER_IF}: eiface vlan$i ether ngctl msg vlan-${ETHER_IF}: addfilter \{ vlan=$i hook=\"vlan$i\" \} done EOF #sh test_creating_vlan.2.sh 5) Again we delete these 700 vlans ngctl list | awk ' $2 ~ /ngeth/ {print "ngctl shutdown "$2":";}' | sh 6) and kernel panic