Bug 213410 - [carp] service netif restart causes hang only when carp is enabled
Summary: [carp] service netif restart causes hang only when carp is enabled
Status: New
Alias: None
Product: Base System
Classification: Unclassified
Component: bin (show other bugs)
Version: 11.0-STABLE
Hardware: Any Any
: --- Affects Only Me
Assignee: freebsd-net (Nobody)
URL:
Keywords: needs-qa, patch
Depends on:
Blocks:
 
Reported: 2016-10-12 08:16 UTC by Dave Cottlehuber
Modified: 2017-02-25 22:08 UTC (History)
2 users (show)

See Also:
koobs: mfc-stable11?
koobs: mfc-stable10?
koobs: mfc-stable9?


Attachments
dmesg (16.38 KB, text/plain)
2016-10-12 08:16 UTC, Dave Cottlehuber
no flags Details
Change assert into check (431 bytes, patch)
2016-11-26 14:15 UTC, Kristof Provost
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Dave Cottlehuber freebsd_committer freebsd_triage 2016-10-12 08:16:22 UTC
Created attachment 175654 [details]
dmesg

# steps

FreeBSD 11.0Rp1 amd64

- dmesg attached
- ifconfig (IPs masked)

igb0: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
	options=6403bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6>
	ether 78:45:c4:fa:d2:12
	nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
	media: Ethernet autoselect (1000baseT <full-duplex>)
	status: active
igb1: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
	options=6403bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6>
	ether 78:45:c4:fa:d2:12
	nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
	media: Ethernet autoselect (1000baseT <full-duplex>)
	status: active
lo0: flags=8049<UP,LOOPBACK,RUNNING,MULTICAST> metric 0 mtu 16384
	options=600003<RXCSUM,TXCSUM,RXCSUM_IPV6,TXCSUM_IPV6>
	inet6 ::1 prefixlen 128
	inet6 fe80::1%lo0 prefixlen 64 scopeid 0x3
	inet 127.0.0.1 netmask 0xff000000
	nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
	groups: lo
lagg0: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
	options=6403bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6>
	ether 78:45:c4:fa:d2:12
	inet 10.0.9.83 netmask 0xfffffff0 broadcast 10.0.9.95
	inet 10.0.9.84 netmask 0xffffffff broadcast 10.0.9.84 vhid 1
	inet 10.0.9.85 netmask 0xffffffff broadcast 10.0.9.85 vhid 3
	inet6 fe80::7a45:c4ff:fefa:d212%lagg0 prefixlen 64 scopeid 0x4
	inet6 3000:3050:3000:4::83 prefixlen 64
	inet6 3000:3050:3000:4::84 prefixlen 64 vhid 2
	inet6 3000:3050:3000:4::85 prefixlen 64 vhid 4
	nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
	media: Ethernet autoselect
	status: active
	carp: BACKUP vhid 1 advbase 1 advskew 100
	carp: BACKUP vhid 3 advbase 1 advskew 0
	carp: BACKUP vhid 2 advbase 1 advskew 100
	carp: BACKUP vhid 4 advbase 1 advskew 0
	groups: lagg
	laggproto lacp lagghash l2,l3,l4
	laggport: igb0 flags=1c<ACTIVE,COLLECTING,DISTRIBUTING>
	laggport: igb1 flags=1c<ACTIVE,COLLECTING,DISTRIBUTING>

issue `service netif restart`

This was initially done via net/mosh connection and tmux inside that, 
but repeated again with direct console access (KVM remote mgmt tool).

## actual results

the system hangs, 100% reproducible.

- no keyboard entry
- no ability to Alt-F3 to switch tabs
- no ping over network
- a hard reboot is required to regain control
- final message in log appears to be 
    Oct 12 08:01:22 bridget kernel: lagg0: link state changed to DOWN


### console

Oct 12 08:01:21 bridget dch: /etc/rc.d/netif: DEBUG: checkyesno: netif_enable is set to YES.
Oct 12 08:01:21 bridget dch: /etc/rc.d/netif: DEBUG: checkyesno: netif_enable is set to YES.
Oct 12 08:01:21 bridget dch: /etc/rc.d/netif: DEBUG: run_rc_command: doit: netif_stop
Oct 12 08:01:21 bridget kernel: ifa_maintain_loopback_route: deletion failed for interface lo0: 48

### /var/log/messages
Oct 12 08:00:00 bridget newsyslog[1525]: logfile turned over due to size>100K
Oct 12 08:01:21 bridget dch: /etc/rc.d/netif: DEBUG: checkyesno: netif_enable is set to YES.
Oct 12 08:01:21 bridget dch: /etc/rc.d/netif: DEBUG: checkyesno: netif_enable is set to YES.
Oct 12 08:01:21 bridget dch: /etc/rc.d/netif: DEBUG: run_rc_command: doit: netif_stop
Oct 12 08:01:21 bridget kernel: ifa_maintain_loopback_route: deletion failed for interface lo0: 48
Oct 12 08:01:21 bridget dch: /etc/rc.d/netif: DEBUG: checkyesno: ipv6_gateway_enable is set to NO.
Oct 12 08:01:21 bridget kernel: ifa_maintain_loopback_route: deletion failed for interface lagg0: 3
Oct 12 08:01:21 bridget kernel: carp: 2@lagg0: BACKUP -> INIT (hardware interface up)
Oct 12 08:01:21 bridget kernel: ifa_maintain_loopback_route: deletion failed for interface lagg0: 3
Oct 12 08:01:21 bridget kernel: carp: 4@lagg0: MASTER -> INIT (hardware interface up)
Oct 12 08:01:21 bridget kernel: ifa_maintain_loopback_route: deletion failed for interface lagg0: 3
Oct 12 08:01:21 bridget last message repeated 3 times
Oct 12 08:01:21 bridget kernel: carp: 1@lagg0: BACKUP -> INIT (hardware interface up)
Oct 12 08:01:21 bridget kernel: ifa_maintain_loopback_route: deletion failed for interface lagg0: 3
Oct 12 08:01:21 bridget last message repeated 2 times
Oct 12 08:01:21 bridget kernel: carp: 3@lagg0: MASTER -> INIT (hardware interface up)
Oct 12 08:01:21 bridget kernel: igb0: promiscuous mode disabled
Oct 12 08:01:21 bridget kernel: igb1: promiscuous mode disabled
Oct 12 08:01:21 bridget kernel: lagg0: promiscuous mode disabled
Oct 12 08:01:21 bridget dch: /etc/rc.d/netif: DEBUG: The following interfaces were not configured:
Oct 12 08:01:21 bridget kernel: ifa_maintain_loopback_route: deletion failed for interface lagg0: 3
Oct 12 08:01:21 bridget dch: /etc/rc.d/netif: DEBUG: Destroyed wlan(4)s:
Oct 12 08:01:21 bridget dch: /etc/rc.d/netif: DEBUG: checkyesno: cloned_interfaces_sticky is set to NO.
Oct 12 08:01:21 bridget kernel: lagg0: link state changed to DOWN
Oct 12 08:01:21 bridget dch: /etc/rc.d/netif: DEBUG: Destroyed clones: lagg0
Oct 12 08:01:21 bridget dch: /etc/rc.d/netif: DEBUG: checkyesno: netif_enable is set to YES.
Oct 12 08:01:21 bridget dch: /etc/rc.d/netif: DEBUG: run_rc_command: doit: netif_start
Oct 12 08:01:21 bridget dch: /etc/rc.d/netif: DEBUG: Created wlan(4)s:
Oct 12 08:01:21 bridget dch: /etc/rc.d/netif: DEBUG: Cloned: lagg0
Oct 12 08:01:21 bridget root: /etc/pccard_ether: DEBUG: run_rc_command: start_precmd: checkauto
Oct 12 08:01:21 bridget root: /etc/pccard_ether: DEBUG: run_rc_command: doit: pccard_ether_start
Oct 12 08:01:21 bridget dch: /etc/rc.d/netif: DEBUG: checkyesno: ipv6_activate_all_interfaces is set to NO.
Oct 12 08:01:21 bridget root: /etc/rc.d/netif: DEBUG: checkyesno: netif_enable is set to YES.
Oct 12 08:01:21 bridget root: /etc/rc.d/netif: DEBUG: run_rc_command: doit: netif_start lagg0
Oct 12 08:01:21 bridget root: /etc/rc.d/netif: DEBUG: Created wlan(4)s:
Oct 12 08:01:21 bridget root: /etc/rc.d/netif: DEBUG: Cloned:
Oct 12 08:01:21 bridget root: /etc/rc.d/netif: DEBUG: checkyesno: ipv6_activate_all_interfaces is set to NO.
Oct 12 08:01:21 bridget dch: /etc/rc.d/netif: DEBUG: checkyesno: ipv6_activate_all_interfaces is set to NO.
Oct 12 08:01:21 bridget kernel: lagg0: link state changed to UP
Oct 12 08:01:21 bridget root: /etc/rc.d/netif: DEBUG: checkyesno: ipv6_gateway_enable is set to NO.
Oct 12 08:01:21 bridget kernel: igb0: promiscuous mode enabled
Oct 12 08:01:21 bridget kernel: igb1: promiscuous mode enabled
Oct 12 08:01:21 bridget kernel: lagg0: promiscuous mode enabled
Oct 12 08:01:21 bridget kernel: igb0: link state changed to DOWN
Oct 12 08:01:21 bridget kernel: carp: 1@lagg0: INIT -> BACKUP (initialization complete)
Oct 12 08:01:21 bridget kernel: carp: 3@lagg0: INIT -> BACKUP (initialization complete)
Oct 12 08:01:21 bridget kernel: carp: 2@lagg0: INIT -> BACKUP (initialization complete)
Oct 12 08:01:21 bridget kernel: carp: 4@lagg0: INIT -> BACKUP (initialization complete)
Oct 12 08:01:21 bridget dch: /etc/rc.d/netif: DEBUG: checkyesno: ipv6_activate_all_interfaces is set to NO.
Oct 12 08:01:22 bridget dch: /etc/rc.d/netif: DEBUG: checkyesno: ipv6_activate_all_interfaces is set to NO.
Oct 12 08:01:22 bridget kernel: igb1: link state changed to DOWN
Oct 12 08:01:22 bridget kernel: carp: 1@lagg0: BACKUP -> INIT (hardware interface down)
Oct 12 08:01:22 bridget kernel: carp: demoted by 240 to 240 (interface down)
Oct 12 08:01:22 bridget kernel: carp: 3@lagg0: BACKUP -> INIT (hardware interface down)
Oct 12 08:01:22 bridget kernel: carp: demoted by 240 to 480 (interface down)
Oct 12 08:01:22 bridget kernel: carp: 2@lagg0: BACKUP -> INIT (hardware interface down)
Oct 12 08:01:22 bridget kernel: carp: demoted by 240 to 720 (interface down)
Oct 12 08:01:22 bridget kernel: carp: 4@lagg0: BACKUP -> INIT (hardware interface down)
Oct 12 08:01:22 bridget kernel: carp: demoted by 240 to 960 (interface down)
Oct 12 08:01:22 bridget kernel: lagg0: link state changed to DOWN
Oct 12 08:01:24 bridget root: /etc/rc.d/netif: DEBUG: checkyesno: rc_startmsgs is set to YES.

# expected results

after a short period of downtime, the network is re-established.

# notes

if carp config is disabled, and system is rebooted, this functions as expected.

# config

```
# /etc/rc.conf on 1st node
hostname="one.my.domain"
ifconfig_igb0="up"
ifconfig_igb1="up"
cloned_interfaces="lagg0"
ifconfig_lagg0="inet 10.0.9.82 netmask 255.255.255.240 laggproto lacp laggport igb0 laggport igb1"
ifconfig_lagg0_ipv6="inet6 3000:3050:3000:4::82/64"
# ifconfig_lo1="inet 10.0.0.254 netmask 255.255.255.0"
defaultrouter="10.0.9.81"
ipv6_defaultrouter="3000:3050:3000:4::1"
# Set dumpdev to "AUTO" to enable crash dumps, "NO" to disable
dumpdev="AUTO"
zfs_enable="YES"

# carp on
kld_list="carp"
ifconfig_lagg0_aliases="\
        inet  vhid 1 advskew   0 pass pwd1 10.0.9.84/32 \
        inet6 vhid 2 advskew   0 pass pwd2 3000:3050:3000:4::84/64 \
        inet  vhid 3 advskew 100 pass pwd3 10.0.9.85/32 \
        inet6 vhid 4 advskew 100 pass pwd4 3000:3050:3000:4::85/64"

# debugging rc.d scripts
rc_debug="YES"
rc_startmsgs="YES"
```

```
# /etc/rc.conf on 2nd node
hostname="two.my.domain"
ifconfig_igb0="up"
ifconfig_igb1="up"
cloned_interfaces="lagg0"
ifconfig_lagg0="inet 10.0.9.83 netmask 255.255.255.240 laggproto lacp laggport igb0 laggport igb1"
ifconfig_lagg0_ipv6="inet6 3000:3050:3000:4::83/64"
defaultrouter="10.0.9.81"
ipv6_defaultrouter="3000:3050:3000:4::1"
# Set dumpdev to "AUTO" to enable crash dumps, "NO" to disable
dumpdev="AUTO"
zfs_enable="YES"

# carp on
kld_list="carp"
ifconfig_lagg0_aliases="\
        inet  vhid 1 advskew 100 pass pwd1 10.0.9.84/32 \
        inet6 vhid 2 advskew 100 pass pwd2 3000:3050:3000:4::84/64 \
        inet  vhid 3 advskew   0 pass pwd3 10.0.9.85/32 \
        inet6 vhid 4 advskew   0 pass pwd4 3000:3050:3000:4::85/64"

# debugging rc.d scripts
rc_debug="YES"
rc_startmsgs="YES"
```

```
# /boot/loader.conf
/boot/loader.conf
# storage
# zfs won't start mounting volumes without this
zfs_load="YES"
kern.geom.label.gptid.enable="0"

# hardware
coretemp_load="YES"

# console
# ensure console in IPMI mode remains accessible instead of going all white
hw.vga.textmode=1

# bhyve and jails
vmm_load="YES"
nmdm_load="YES"
if_bridge_load="YES"
if_tap_load="YES"
kern.racct.enable=1

# debug super powers
dtraceall_load="YES"

# runtime
# maxfiles
kern.maxfiles="25000"

# network
# fibs
# https://blog.feld.me/posts/2015/06/routing-a-freebsd-jail-through-openvpn/
# https://www.freebsd.org/cgi/man.cgi?query=setfib
net.fibs=2
# from https://calomel.org/freebsd_network_tuning.html
accf_data_load="YES"
accf_dns_load="YES"
autoboot_delay="3"
ahci_load="YES"
aio_load="YES"
cc_htcp_load="YES"
net.tcp.hostcache.cachelimit="0"
```


```
# /etc/sysctl.conf
# carp tweaks
net.inet.carp.preempt=1
```
Comment 1 Kristof Provost freebsd_committer freebsd_triage 2016-11-25 16:04:59 UTC
I’ve had a very quick look, and at first glance it seems like an overly strict KASSERT() more than anything else.
Basically, during service netif restart the scripts try to set up carp on an address that’s already got it configured. That runs into the assert and panics the box (or actually panics later on if INVARIANTS is not set).

Simply replacing the KASSERT with a check (and returning errors) prevents the panic.
I don’t have a carp test setup, but this should make things a lot better already.

Can you check if this works for you?

diff --git a/sys/netinet/ip_carp.c b/sys/netinet/ip_carp.c index 7855af2..ea27f0a 100644
--- a/sys/netinet/ip_carp.c
+++ b/sys/netinet/ip_carp.c
@@ -1804,7 +1804,8 @@ carp_attach(struct ifaddr *ifa, int vhid)
        struct carp_softc *sc;
        int index, error;

-       KASSERT(ifa->ifa_carp == NULL, ("%s: ifa %p attached", __func__, ifa));
+       if (ifa->ifa_carp != NULL)
+               return (EBUSY);

        switch (ifa->ifa_addr->sa_family) {
 #ifdef INET
Comment 2 Kubilay Kocak freebsd_committer freebsd_triage 2016-11-26 07:59:24 UTC
@Kristov Could you include the patch in comment 1 as an attachment please
Comment 3 Kristof Provost freebsd_committer freebsd_triage 2016-11-26 14:15:15 UTC
Created attachment 177414 [details]
Change assert into check