Bug 244247 - Kernel panic due to racecondition in ng_eiface shutdown
Summary: Kernel panic due to racecondition in ng_eiface shutdown
Status: New
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 12.1-STABLE
Hardware: Any Any
: --- Affects Only Me
Assignee: freebsd-net mailing list
URL:
Keywords: crash
Depends on:
Blocks:
 
Reported: 2020-02-20 17:04 UTC by Vladislav V. Prodan
Modified: 2020-02-27 17:18 UTC (History)
5 users (show)

See Also:


Attachments
ifconfig output (2.54 KB, text/plain)
2020-02-20 17:04 UTC, Vladislav V. Prodan
no flags Details
kgdb output (4.48 KB, text/plain)
2020-02-20 17:05 UTC, Vladislav V. Prodan
no flags Details
File /etc/rc.conf output (511 bytes, text/plain)
2020-02-20 17:31 UTC, Vladislav V. Prodan
no flags Details
kgdb info (133.98 KB, text/plain)
2020-02-21 17:06 UTC, Vladislav V. Prodan
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Vladislav V. Prodan 2020-02-20 17:04:46 UTC
Created attachment 211776 [details]
ifconfig output

Kernel panic when removing 700 vlans via netgraph

I have a virtual machine on Virtualbox with 2 core CPU and 2 GP RAM.
I have non-GENERIC kernel.
Problem Source - PR 187835.

# uname -a
FreeBSD core.domain.com 12.1-STABLE FreeBSD 12.1-STABLE r358094 SUPPORT-12-1-0  amd64

# kldstat
Id Refs Address                Size Name
 1   10 0xffffffff80200000  25a7a78 kernel
 2    1 0xffffffff827a8000   3ad5b8 zfs.ko
 3    2 0xffffffff82b56000     a3f0 opensolaris.ko


How to play kernel panic:

1) Add tuning sysctl

#sysctl net.graph.recvspace=2048000
#sysctl net.graph.maxdgram=2048000
#sysctl kern.ipc.nmbclusters=4000000
#sysctl kern.ipc.maxsockbuf=83886080

2) adding 564 vlan
#cat << EOF >> test_creating_vlan.sh
#!/bin/sh

# man ng_vlan

ETHER_IF=em2

ngctl -f- <<EOF
     mkpeer ${ETHER_IF}: vlan lower downstream
     name ${ETHER_IF}:lower vlan-${ETHER_IF}
     connect ${ETHER_IF}: vlan-${ETHER_IF}: upper nomatch
EOF

for i in `jot 564 100`; do
    ngctl mkpeer vlan-${ETHER_IF}: eiface vlan$i ether
    ngctl msg vlan-${ETHER_IF}: addfilter \{ vlan=$i hook=\"vlan$i\" \}
done
EOF

#sh test_creating_vlan.sh 

3) delete all freshly created vlans
#ngctl list | awk ' $2 ~ /ngeth/ {print "ngctl shutdown "$2":";}' | sh

4) add already 700 vlan

#cat << EOF >> test_creating_vlan.2.sh
#!/bin/sh

# man ng_vlan

ETHER_IF=em2

ngctl -f- <<EOF
     mkpeer ${ETHER_IF}: vlan lower downstream
     name ${ETHER_IF}:lower vlan-${ETHER_IF}
     connect ${ETHER_IF}: vlan-${ETHER_IF}: upper nomatch
EOF

for i in `jot 700 100`; do
    ngctl mkpeer vlan-${ETHER_IF}: eiface vlan$i ether
    ngctl msg vlan-${ETHER_IF}: addfilter \{ vlan=$i hook=\"vlan$i\" \}
done
EOF

#sh test_creating_vlan.2.sh

5) Again we delete these 700 vlans
ngctl list | awk ' $2 ~ /ngeth/ {print "ngctl shutdown "$2":";}' | sh

6) and
kernel panic
Comment 1 Vladislav V. Prodan 2020-02-20 17:05:25 UTC
Created attachment 211777 [details]
kgdb output
Comment 2 Vladislav V. Prodan 2020-02-20 17:31:04 UTC
Created attachment 211778 [details]
File /etc/rc.conf output
Comment 3 lutz 2020-02-21 14:00:19 UTC
static void
ng_eiface_mediastatus(struct ifnet *ifp, struct ifmediareq *ifmr)
{
        const priv_p priv = (priv_p)ifp->if_softc;
        struct ifmedia *ifm = &priv->media;

-->     if (ifm->ifm_cur->ifm_media == (IFM_ETHER | IFM_AUTO) &&
            (priv->link_status & IFM_ACTIVE))
                ifmr->ifm_active = IFM_ETHER | IFM_1000_T | IFM_FDX;
        else
                ifmr->ifm_active = ifm->ifm_cur->ifm_media;
        ifmr->ifm_status = priv->link_status;

        return;
}

May I ask you to compile the test system with NETGRAPH_DEBUG?
This will prevent the netgraph system from freeing all allocated memory.

I just want to make sure, that we do have a problem with the netgraph framework, not with a race condition within ng_eiface.

If you are on this path, may you please add INVARIANT_SUPPORT and INVARIANTS?
This will catch several types of errors before they might happen.
Comment 4 Vladislav V. Prodan 2020-02-21 15:57:38 UTC
(In reply to lutz from comment #3)

Rebuild the kernel with options:

options		NETGRAPH_DEBUG
options 	INVARIANTS
options 	INVARIANT_SUPPORT

?

Maybe add more other options?

options 	KASSERT_PANIC_OPTIONAL
options 	DIAGNOSTIC
options 	REGRESSION
options 	TSLOG
options 	TSLOGSIZE=262144
Comment 5 Vladislav V. Prodan 2020-02-21 16:56:12 UTC
(In reply to Vladislav V. Prodan from comment #4)

After the first removal of the vlans, I got kernel panic ...
Comment 6 Vladislav V. Prodan 2020-02-21 17:06:23 UTC
Created attachment 211803 [details]
kgdb info

Crash dump of the new kernel after kernel panic.

The new kernel is built with additional options:
options		NETGRAPH_DEBUG
options 	INVARIANTS
options 	INVARIANT_SUPPORT

options 	KASSERT_PANIC_OPTIONAL
options 	DIAGNOSTIC
options 	REGRESSION
options 	TSLOG
options 	TSLOGSIZE=262144
Comment 7 lutz 2020-02-21 22:34:47 UTC
The crash seems to be caused by a dhclient process which interleaves with the destruction of the node. I do not claim, that there is no bug with this, but it's not a bug with the destruction of nodes, it's more like a racecondition in destroying the interface.

Can you please remove the "ifconfig_DEFAULT" line from rc.conf?
This spans an dhclient for each ngeth*
Comment 8 Vladislav V. Prodan 2020-02-22 03:23:24 UTC
(In reply to lutz from comment #7)

Commenting out the line 'ifconfig_DEFAULT = "SYNCDHCP" in the /etc/rc.conf file with a reboot of the VM no longer causes kernel panic.
Comment 9 lutz 2020-02-22 09:23:29 UTC
Please rename this bug to "racecondition in ng_eiface shutdown" or close it.
Comment 10 Aleksandr Fedorov 2020-02-25 06:13:12 UTC
It seems that there are a race in the function ng_eiface_rmnode().

613 	static int
614 	ng_eiface_rmnode(node_p node)
615 	{
616 	        const priv_p priv = NG_NODE_PRIVATE(node);
617 	        struct ifnet *const ifp = priv->ifp;
618 	
619 	        /*
620 	         * the ifnet may be in a different vnet than the netgraph node, 
621 	         * hence we have to change the current vnet context here.
622 	         */
623 	        CURVNET_SET_QUIET(ifp->if_vnet);
624 	        ifmedia_removeall(&priv->media);
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Remove media


625 	        ether_ifdetach(ifp);
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Detach interface. Remove interface from ifnet's list which protected by WLOCK().

626 	        if_free(ifp);
627 	        CURVNET_RESTORE();
628 	        free_unr(V_ng_eiface_unit, priv->unit);
629 	        free(priv, M_NETGRAPH);
630 	        NG_NODE_SET_PRIVATE(node, NULL);
631 	        NG_NODE_UNREF(node);
632 	        return (0);
633 	}

So, the media is already removed, but the interface is still available.

I think the order should be different, like other interfaces do:
1) Detach interface.
2) Free used resources.

Can you test the next patch:

Index: sys/netgraph/ng_eiface.c
===================================================================
--- sys/netgraph/ng_eiface.c    (revision 358308)
+++ sys/netgraph/ng_eiface.c    (working copy)
@@ -621,9 +621,9 @@
         * hence we have to change the current vnet context here.
         */
        CURVNET_SET_QUIET(ifp->if_vnet);
-       ifmedia_removeall(&priv->media);
        ether_ifdetach(ifp);
        if_free(ifp);
+       ifmedia_removeall(&priv->media);
        CURVNET_RESTORE();
        free_unr(V_ng_eiface_unit, priv->unit);
        free(priv, M_NETGRAPH);
Comment 11 Vladislav V. Prodan 2020-02-26 17:49:36 UTC
I applied the patch and rebuilt the kernel.

Kernel panic is no longer there, but in the logs /var/log/messages messages of the form:

...
Feb 26 21:43:17 core kernel: ngeth504: link state changed to DOWN
Feb 26 21:43:17 core dhclient[3168]: receive_packet failed on ngeth504: Device not configured
Feb 26 21:43:17 core dhclient[3168]: ioctl(SIOCGIFFLAGS) on ngeth504: Operation not permitted
Feb 26 21:43:17 core kernel: if_delmulti_locked: detaching ifnet instance 0xfffff80059e8b000
Feb 26 21:43:17 core syslogd: last message repeated 1 times
Feb 26 21:43:17 core kernel: ifaddr cache = 0xfffff8000ddf5600 is deleted
Feb 26 21:43:17 core kernel: if_delmulti_locked: detaching ifnet instance 0xfffff80059e8b000
Feb 26 21:43:17 core syslogd: last message repeated 3 times
Feb 26 21:43:17 core dhclient[3168]: Interface ngeth504 no longer appears valid.
Feb 26 21:43:17 core dhclient[3168]: No live interfaces to poll on - exiting.
Feb 26 21:43:17 core dhclient[3168]: exiting.
Feb 26 21:43:17 core dhclient[3168]: connection closed
Feb 26 21:43:17 core dhclient[3168]: exiting.
Feb 26 21:43:17 core root[3259]: /etc/rc.d/dhclient: WARNING: failed to start dhclient
...

Feb 26 21:46:11 core kernel: ngeth0: link state changed to DOWN
Feb 26 21:46:11 core dhclient[3521]: receive_packet failed on ngeth0: Device not configured
Feb 26 21:46:11 core dhclient[3521]: ioctl(SIOCGIFFLAGS) on ngeth0: Operation not permitted
Feb 26 21:46:11 core dhclient[3521]: Interface ngeth0 no longer appears valid.
Feb 26 21:46:11 core dhclient[3521]: No live interfaces to poll on - exiting.
Feb 26 21:46:11 core dhclient[3521]: exiting.
Feb 26 21:46:11 core dhclient[3521]: connection closed
Feb 26 21:46:11 core dhclient[3521]: exiting.
Feb 26 21:46:11 core root[10207]: /etc/rc.d/dhclient: WARNING: failed to start dhclient
Feb 26 21:46:11 core kernel: if_delmulti_locked: detaching ifnet instance 0xfffff80059e8e000
Feb 26 21:46:11 core kernel: ifaddr cache = 0xfffff80059db2800 is deleted
Feb 26 21:46:11 core kernel: if_delmulti_locked: detaching ifnet instance 0xfffff80059e8e000
Feb 26 21:46:11 core syslogd: last message repeated 4 times
...
Comment 12 Vladislav V. Prodan 2020-02-26 17:51:17 UTC
(In reply to Vladislav V. Prodan from comment #11)

And output debug.log:
...
Dec 29 04:20:59 core newsyslog[883]: logfile first created
Feb 19 18:28:50 core dhclient[1023]: em3 link state up -> down
Feb 19 18:28:50 core dhclient[1000]: em2 link state up -> down
Feb 19 18:28:50 core dhclient[977]: em1 link state up -> down
Feb 19 18:28:50 core dhclient[899]: em0 link state up -> down
Feb 19 18:28:53 core dhclient[1023]: em3 link state down -> up
Feb 19 18:28:53 core dhclient[1000]: em2 link state down -> up
Feb 19 18:28:53 core dhclient[977]: em1 link state down -> up
Feb 19 18:28:53 core dhclient[899]: em0 link state down -> up
Feb 20 00:52:33 core dhclient[16136]: ioctl(SIOCGIFMEDIA) on ngeth490: Device not configured
Feb 20 00:53:09 core dhclient[20775]: ioctl(SIOCGIFMEDIA) on ngeth500: Device not configured
Feb 20 19:45:32 core dhclient[9911]: ioctl(SIOCGIFMEDIA) on ngeth560: Device not configured
Feb 20 20:00:49 core dhclient[1665]: ngeth0 link state up -> down
Feb 20 20:12:18 core dhclient[1619]: ioctl(SIOCGIFMEDIA) on ngeth0: Device not configured
Feb 20 20:12:22 core dhclient[3063]: ioctl(SIOCGIFMEDIA) on ngeth409: Device not configured
Feb 20 20:22:27 core dhclient[1538]: ioctl(SIOCGIFMEDIA) on ngeth0: Device not configured
Feb 20 20:22:52 core dhclient[7627]: ioctl(SIOCGIFMEDIA) on ngeth0: Device not configured
Feb 20 20:23:19 core dhclient[9421]: ioctl(SIOCGIFMEDIA) on ngeth0: Device not configured
Feb 26 21:43:09 core dhclient[2734]: ioctl(SIOCGIFMEDIA) on ngeth266: Device not configured
Feb 26 21:43:15 core dhclient[2963]: ioctl(SIOCGIFMEDIA) on ngeth408: Device not configured
Feb 26 21:43:17 core dhclient[3168]: ioctl(SIOCGIFMEDIA) on ngeth504: Device not configured
Feb 26 21:46:11 core dhclient[3521]: ngeth0 link state up -> down
...
Comment 13 Vladislav V. Prodan 2020-02-26 17:52:34 UTC
(In reply to Vladislav V. Prodan from comment #12)

root@core:~ # date
Wed Feb 26 21:52:03 EET 2020
Comment 14 Aleksandr Fedorov 2020-02-27 17:18:58 UTC
These messages looks to me as expected:

Feb 26 21:46:11 core dhclient[3521]: receive_packet failed on ngeth0: Device not configured
Feb 26 21:46:11 core dhclient[3521]: ioctl(SIOCGIFFLAGS) on ngeth0: Operation not permitted
Feb 26 21:46:11 core dhclient[3521]: Interface ngeth0 no longer appears valid.
Feb 26 21:46:11 core dhclient[3521]: No live interfaces to poll on - exiting.
Feb 26 21:46:11 core dhclient[3521]: exiting.
Feb 26 21:46:11 core dhclient[3521]: connection closed
Feb 26 21:46:11 core dhclient[3521]: exiting.


But these need further research:
Feb 26 21:43:17 core kernel: ifaddr cache = 0xfffff8000ddf5600 is deleted
Feb 26 21:43:17 core kernel: if_delmulti_locked: detaching ifnet instance 0xfffff80059e8b000