Bug 200319 - Bridge+CARP crashes/freezes
Summary: Bridge+CARP crashes/freezes
Status: Open
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 10.1-RELEASE
Hardware: Any Any
: Normal Affects Many People
Assignee: freebsd-net mailing list
URL: https://reviews.freebsd.org/D3133
Keywords: crash, needs-qa
Depends on:
Blocks:
 
Reported: 2015-05-19 08:20 UTC by Ermal Luçi
Modified: 2019-06-27 15:43 UTC (History)
24 users (show)

See Also:
koobs: mfc-stable11?
koobs: mfc-stable12?


Attachments
Fixes on carp+bridge freeze (6.35 KB, text/plain)
2015-05-19 08:20 UTC, Ermal Luçi
no flags Details
script to demosntrate the carp-on-bridge deadlock (1.83 KB, application/x-shellscript)
2019-02-01 08:09 UTC, nvass
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Ermal Luçi 2015-05-19 08:20:35 UTC
Created attachment 156928 [details]
Fixes on carp+bridge freeze

The interaction between carp and bridge code makes the OS hang.

If there is a bridge with member an interface that has a CARP ip and no minimal amount of traffic flowing in the system result is a hang of the system and sometimes a crash.

Reference https://redmine.pfsense.org/issues/4607

After analysis seems that the carp code uses taskqueue_swi for scheduling demotion events.
The carp CIF lock had a lot of contention.
ether_input was doing duplicate checks of bridge_input causing even more contention on the CIF lock.

Attached patch converts CIF lock to RW lock.
It avoids duplicate checks from ether_input in case of bridge
schedules the taskqueue as taskqueue_thread rather than a SWI thread.
Comment 1 Jason Unovitch freebsd_committer 2015-07-19 21:39:43 UTC
Is this on the radar to get put in 10.2?  I've got some feedback over in the forums that the patch has resolved the issue.
https://forums.freebsd.org/threads/carp-bridge-crashes-freezes-on-freebsd-10.52427
Comment 2 cmb 2015-07-20 16:16:45 UTC
The fix here has been confirmed by a number of pfSense users, as we've had it in our builds since the creation of this PR. It's in production on at least several dozen formerly-impacted systems, at least a dozen of which have upwards of 1000 users connected to the bridge interface using CARP IP as their gateway. All the associated problems in 10.x are gone with this. So it should be safe.
Comment 3 Michael Galati 2016-10-12 08:42:00 UTC
Any chance of https://reviews.freebsd.org/D3133 getting committed?  I can confirm this is still a problem on stable/11 (r306832).  My machine was deadlocking in if_carp in as little as an hour (though usually would take more like 7 - 9 hours).  After applying D3133, I'm on 53 hours and counting with no more deadlocks.
Comment 4 Brian Fundakowski Feldman 2016-10-26 21:47:32 UTC
The deadlock occurs immediately for me on 11.0-RELEASE when I use CARP on if_bridge instead of just a regular interface.  The patch resolves the deadlock completely.  Thank you for your work!
Comment 5 nvass 2016-12-06 14:44:55 UTC
12-CURRENT is also affected.
Comment 6 hcoin 2017-11-29 23:36:51 UTC
Confirmed in 11.1 / pfsense 2.4.2.  

/root: ps -axHdwwo "pid ppid %cpu tt stat systime blocked state mwchan command" 
  PID  PPID  %CPU TT  STAT    SYSTIME  BLOCKED STAT MWCHAN   COMMAND
    0     0   0.0  -  DLs  8588:17.70        0 DLs  swapin   [kernel/swapper]
    1     0   0.0  -  ILs     0:00.03 9e7c9014 ILs  wait     - /sbin/init --
  307     1   0.0  -  Ss      0:01.26        0 Ss   kqread   |-- php-fpm: master process (/usr/local/lib/php-fpm.conf) (php-fpm)
72116   307   0.0  -  I       0:00.01        0 I    accept   | `-- php-fpm: pool nginx (php-fpm)
  321     1   0.0  -  INs     0:00.05 7ff6befe INs  kqread   |-- /usr/local/sbin/check_reload_status
  323   321   0.0  -  IN      0:00.00 7ff6befe IN   kqread   | `-- check_reload_status: Monitoring daemon of check_reload_status
  336     1   0.0  -  Is      0:00.06        0 Is   select   |-- /sbin/devd -q -f /etc/pfSense-devd.conf
 7031     1   0.0  -  Is      0:00.00        0 Is   uwait    |-- /usr/local/bin/dpinger -S -r 0 -i <far side of vpn>
 7031     1   0.0  -  Is      0:00.14        0 Is   sbwait   |-- /usr/local/bin/dpinger -S -r 0 -i <far side of vpn>
 7031     1   0.0  -  Ls      0:00.33        0 Ls   if_bridg |-- /usr/local/bin/dpinger -S -r 0 -i <far side of vpn>
 7031     1   0.0  -  Ss      0:00.09        0 Ss   nanslp   |-- /usr/local/bin/dpinger -S -r 0 -i <far side of vpn>
 7031     1   0.0  -  Is      0:00.03        0 Is   accept   |-- /usr/local/bin/dpinger -S -r 0 -i <far side of vpn>
 7325     1   0.0  -  Is      0:00.00        0 Is   uwait    |-- /usr/local/bin/dpinger -S -r 0 -i <far side of vpn>
 7325     1   0.0  -  Is      0:00.48        0 Is   sbwait   |-- /usr/local/bin/dpinger -S -r 0 -i <far side of vpn>
 7325     1   0.0  -  Ls      0:01.43        0 Ls   if_bridg |-- /usr/local/bin/dpinger -S -r 0 -i <far side of vpn>
 7325     1   0.0  -  Ss      0:00.61        0 Ss   nanslp   |-- /usr/local/bin/dpinger -S -r 0 -i <far side of vpn>
 7325     1   0.0  -  Is      0:00.01        0 Is   accept   |-- /usr/local/bin/dpinger -S -r 0 -i <far side of vpn>
 7653     1   0.0  -  Is      0:00.00        0 Is   uwait    |-- /usr/local/bin/dpinger -S -r 0 -i <far side of vpn>
 7653     1   0.0  -  Ss      0:00.92        0 Ss   sbwait   |-- /usr/local/bin/dpinger -S -r 0 -i <far side of vpn>
 7653     1   0.0  -  Ss      0:02.27        0 Ss   nanslp   |-- /usr/local/bin/dpinger -S -r 0 -i <far side of vpn>
 7653     1   0.0  -  Ss      0:00.62        0 Ss   nanslp   |-- /usr/local/bin/dpinger -S -r 0 -i <far side of vpn>
 7653     1   0.0  -  Is      0:00.02        0 Is   accept   |-- /usr/local/bin/dpinger -S -r 0 -i <far side of vpn>
 8067     1   0.0  -  Is      0:00.00        0 Is   uwait    |-- /usr/local/bin/dpinger -S -r 0 -i <far side of vpn>
 8067     1   0.0  -  Ss      0:00.91        0 Ss   sbwait   |-- /usr/local/bin/dpinger -S -r 0 -i <far side of vpn>
 8067     1   0.0  -  Ss      0:02.51        0 Ss   nanslp   |-- /usr/local/bin/dpinger -S -r 0 -i <far side of vpn>
 8067     1   0.0  -  Ss      0:00.62        0 Ss   nanslp   |-- /usr/local/bin/dpinger -S -r 0 -i <far side of vpn>
 8067     1   0.0  -  Is      0:00.03        0 Is   accept   |-- /usr/local/bin/dpinger -S -r 0 -i <far side of vpn>
 8326     1   0.0  -  Is      0:00.00        0 Is   uwait    |-- /usr/local/bin/dpinger -S -r 0 -i <isp>
 8326     1   0.0  -  Ss      0:01.00        0 Ss   sbwait   |-- /usr/local/bin/dpinger -S -r 0 -i <isp>
 8326     1   0.0  -  Ss      0:02.29        0 Ss   nanslp   |-- /usr/local/bin/dpinger -S -r 0 -i <isp>
 8326     1   0.0  -  Ss      0:00.67        0 Ss   nanslp   |-- /usr/local/bin/dpinger -S -r 0 -i <isp>
 8326     1   0.0  -  Is      0:00.02        0 Is   accept   |-- /usr/local/bin/dpinger -S -r 0 -i <isp>
 9373     1   0.0  -  Is      0:00.06        0 Is   select   |-- dhclient: vtnet1 [priv] (dhclient)
13403     1   0.0  -  Ss      0:00.12        0 Ss   select   |-- dhclient: vtnet1 (dhclient)
14509     1   0.0  -  Is      0:00.00        0 Is   select   |-- /usr/sbin/sshd
28061     1   0.0  -  Ss      0:00.51        0 Ss   select   |-- /usr/local/sbin/openvpn --config /var/etc/openvpn/client2.conf
30047     1   0.0  -  Ss      0:19.31        0 Ss   select   |-- /usr/local/sbin/openvpn --config /var/etc/openvpn/client3.conf
42540     1   0.0  -  Ss      0:06.95        0 Ss   select   |-- /usr/local/sbin/dhcpd -user dhcpd -group _dhcp -chroot /var/dhcpd -cf /etc/dhcpd.conf -pf /var/run/dhcpd.pid vtnet0 vtnet0.30
43814     1   0.0  -  Ss      0:04.97        0 Ss   select   |-- /usr/local/sbin/dhcpd -6 -user dhcpd -group _dhcp -chroot /var/dhcpd -cf /etc/dhcpdv6.conf -pf /var/run/dhcpdv6.pid vtnet0
44137     1   0.0  -  Is      0:00.00        0 Is   kqread   |-- /usr/local/sbin/dhcpleases6 -c /usr/local/bin/php-cgi -f /usr/local/sbin/prefixes.php|/bin/sh -l /var/dhcpd/var/db/dhcpd6.leases
46990     1   0.0  -  IN      0:16.69        0 IN   wait     |-- /bin/sh /var/db/rrd/updaterrd.sh
61474 46990   0.0  -  IN      0:00.00        0 IN   nanslp   | `-- sleep 60
55338     1   0.0  -  Ss      0:00.82        0 Ss   select   |-- /usr/local/sbin/radvd -p /var/run/radvd.pid -C /var/etc/radvd.conf -m syslog
59394     1   0.0  -  Ss      0:00.69        0 Ss   select   |-- /usr/local/sbin/openvpn --config /var/etc/openvpn/client4.conf
69070     1   0.0  -  Ls      0:07.47        0 Ls   carp_sof |-- /usr/local/sbin/openvpn --config /var/etc/openvpn/client1.conf
79855     1   0.0  -  Is      0:00.35        0 Is   select   |-- /usr/sbin/syslogd -s -c -c -l /var/dhcpd/var/run/log -P /var/run/syslog.pid -f /etc/syslog.conf
33891 79855   0.0  -  Is      0:00.01     2001 Is   piperd   | |-- /usr/local/sbin/sshlockout_pf 15
33891 79855   0.0  -  Is      0:00.00     2001 Is   nanslp   | `-- /usr/local/sbin/sshlockout_pf 15
84053     1   0.0  -  Ss      0:00.82        0 Ss   bpf      |-- /usr/local/sbin/filterlog -i pflog0 -p /var/run/filterlog.pid
84114     1   0.0  -  Is      0:00.00        0 Is   wait     |-- /usr/local/bin/minicron 240 /var/run/ping_hosts.pid /usr/local/bin/ping_hosts.sh
84433 84114   0.0  -  I       0:00.01    80006 I    wait     | `-- minicron: helper /usr/local/bin/ping_hosts.sh  (minicron)
56555 84433   0.0  -  I       0:00.00        0 I    piperd   |   `-- /bin/sh /usr/local/bin/ping_hosts.sh
56661 56555   0.0  -  I       0:00.00        0 I    wait     |     `-- /bin/sh /usr/local/bin/ping_hosts.sh
56696 56661   0.0  -  L       0:00.00        0 L    carp_if  |       |-- ifconfig
56959 56661   0.0  -  I       0:00.01        0 I    piperd   |       |-- grep carp: BACKUP vhid
57227 56661   0.0  -  I       0:00.01        0 I    piperd   |       `-- wc -l
84776     1   0.0  -  Is      0:00.00        0 Is   select   |-- /usr/local/sbin/xinetd -syslog daemon -f /var/etc/xinetd.conf -pidfile /var/run/xinetd.pid
84821     1   0.0  -  Is      0:00.00        0 Is   wait     |-- /usr/local/bin/minicron 3600 /var/run/expire_accounts.pid /usr/local/sbin/fcgicli -f /etc/rc.expireaccounts
85313 84821   0.0  -  I       0:00.00        0 I    nanslp   | `-- minicron: helper /usr/local/sbin/fcgicli -f /etc/rc.expireaccounts  (minicron)
85571     1   0.0  -  Is      0:00.00        0 Is   wait     |-- /usr/local/bin/minicron 86400 /var/run/update_alias_url_data.pid /usr/local/sbin/fcgicli -f /etc/rc.update_alias_url_data
86081 85571   0.0  -  I       0:00.00        0 I    nanslp   | `-- minicron: helper /usr/local/sbin/fcgicli -f /etc/rc.update_alias_url_data  (minicron)
95019     1   0.0  -  S       0:00.33        0 S    select   |-- /usr/local/sbin/dnsmasq --all-servers -C /dev/null --rebind-localhost-ok --stop-dns-rebind --dns-forward-max=5000 --cache-size=10000 --local-ttl=1
97966     1   0.0  -  Is      0:00.01        0 Is   pause    |-- nginx: master process /usr/local/sbin/nginx -c /var/etc/nginx-webConfigurator.conf (nginx)
98108 97966   0.0  -  I       0:00.03        0 I    kqread   | |-- nginx: worker process (nginx)
98295 97966   0.0  -  I       0:00.06        0 I    kqread   | `-- nginx: worker process (nginx)
98523     1   0.0  -  Is      0:00.11        0 Is   nanslp   |-- /usr/sbin/cron -s
99086     1   0.0  -  Ss      0:01.69        0 Ss   select   |-- /usr/local/sbin/ntpd -g -c /var/etc/ntpd.conf -p /var/run/ntpd.pid
32113     1   0.0 u0  Is      0:00.02        0 Is   wait     |-- login [pam] (login)
34228 32113   0.0 u0  I       0:00.00        0 I    wait     | `-- -sh (sh)
34720 34228   0.0 u0  I       0:00.02        0 I    wait     |   `-- /bin/sh /etc/rc.initial
85320 34720   0.0 u0  S       0:00.46        2 S    pause    |     `-- /bin/tcsh
61714 85320   0.0 u0  R+      0:00.01        0 R+   -        |       `-- ps -axHdwwo pid ppid %cpu tt stat systime blocked state mwchan command
30815     1   0.0 v0  Is      0:00.01        0 Is   wait     |-- login [pam] (login)
34064 30815   0.0 v0  I       0:00.01        0 I    wait     | `-- -sh (sh)
34985 34064   0.0 v0  I+      0:00.01        0 I+   ttyin    |   `-- /bin/sh /etc/rc.initial
30878     1   0.0 v1  Is+     0:00.01        0 Is+  ttyin    |-- /usr/libexec/getty Pc ttyv1
31007     1   0.0 v2  Is+     0:00.01        0 Is+  ttyin    |-- /usr/libexec/getty Pc ttyv2
31330     1   0.0 v3  Is+     0:00.00        0 Is+  ttyin    |-- /usr/libexec/getty Pc ttyv3
31518     1   0.0 v4  Is+     0:00.01        0 Is+  ttyin    |-- /usr/libexec/getty Pc ttyv4
31772     1   0.0 v5  Is+     0:00.01        0 Is+  ttyin    |-- /usr/libexec/getty Pc ttyv5
31831     1   0.0 v6  Is+     0:00.01        0 Is+  ttyin    |-- /usr/libexec/getty Pc ttyv6
31993     1   0.0 v7  Is+     0:00.00        0 Is+  ttyin    `-- /usr/libexec/getty Pc ttyv7
    2     0   0.0  -  DL      0:00.00        0 DL   crypto_w - [crypto]
    3     0   0.0  -  DL      0:00.00        0 DL   crypto_r - [crypto returns]
    4     0   0.0  -  DL      0:00.00        0 DL   -        - [cam/doneq0]
    4     0   0.0  -  DL      0:00.22        0 DL   -        - [cam/scanner]
    5     0   0.0  -  DL      0:00.01        0 DL   -        - [soaiod1]
    6     0   0.0  -  DL      0:00.01        0 DL   -        - [soaiod2]
    7     0   0.0  -  DL      0:00.01        0 DL   -        - [soaiod3]
    8     0   0.0  -  DL      0:00.01        0 DL   -        - [soaiod4]
    9     0   0.0  -  DL      0:00.00        0 DL   waiting_ - [sctp_iterator]
   10     0   0.0  -  DL      0:00.00        0 DL   audit_wo - [audit]
   11     0 100.0  -  RL    227:37.09        0 RL   -        - [idle/idle: cpu0]
   11     0 100.0  -  RL    227:21.51        0 RL   -        - [idle/idle: cpu1]
   12     0   0.0  -  WL      0:00.02        0 WL   -        - [intr/swi1: netisr 0]
   12     0   0.0  -  WL      0:00.01        0 WL   -        - [intr/swi1: netisr 1]
   12     0   0.0  -  WL      0:00.00        0 WL   -        - [intr/swi3: vm]
   12     0   0.0  -  LL      0:00.00        0 LL   if_bridg - [intr/swi4: clock (0]
   12     0   0.0  -  WL      0:00.00        0 WL   -        - [intr/swi4: clock (1]
   12     0   0.0  -  WL      0:00.00        0 WL   -        - [intr/swi6: task que]
   12     0   0.0  -  WL      0:00.00        0 WL   -        - [intr/swi6: Giant ta]
   12     0   0.0  -  WL      0:00.00        0 WL   -        - [intr/swi5: fast tas]
   12     0   0.0  -  WL      0:00.00        0 WL   -        - [intr/irq14: ata0]
   12     0   0.0  -  WL      0:00.00        0 WL   -        - [intr/irq15: ata1]
   12     0   0.0  -  WL      0:00.00        0 WL   -        - [intr/irq256: virtio]
   12     0   0.0  -  LL      0:00.00        0 LL   if_bridg - [intr/irq257: virtio]
   12     0   0.0  -  WL      0:00.00        0 WL   -        - [intr/irq258: virtio]
   12     0   0.0  -  WL      0:00.00        0 WL   -        - [intr/irq259: virtio]
   12     0   0.0  -  WL      0:00.00        0 WL   -        - [intr/irq260: virtio]
   12     0   0.0  -  WL      0:00.00        0 WL   -        - [intr/irq261: virtio]
   12     0   0.0  -  WL      0:00.00        0 WL   -        - [intr/irq262: virtio]
   12     0   0.0  -  WL      0:00.00        0 WL   -        - [intr/irq263: virtio]
   12     0   0.0  -  WL      0:00.00        0 WL   -        - [intr/irq264: virtio]
   12     0   0.0  -  WL      0:00.00        0 WL   -        - [intr/irq11: uhci0 u]
   12     0   0.0  -  WL      0:00.00        0 WL   -        - [intr/irq10: uhci2 e]
   12     0   0.0  -  WL      0:00.00        0 WL   -        - [intr/irq265: virtio]
   12     0   0.0  -  WL      0:00.00        0 WL   -        - [intr/irq266: virtio]
   12     0   0.0  -  WL      0:00.00        0 WL   -        - [intr/irq1: atkbd0]
   12     0   0.0  -  WL      0:00.00        0 WL   -        - [intr/irq12: psm0]
   12     0   0.0  -  WL      0:00.00        0 WL   -        - [intr/swi0: uart]
   12     0   0.0  -  WL      0:00.00        0 WL   -        - [intr/swi1: pf send]
   12     0   0.0  -  WL      0:00.00        0 WL   -        - [intr/swi1: pfsync]
   13     0   0.0  -  DL      0:00.00        0 DL   sleep    - [ng_queue/ng_queue0]
   13     0   0.0  -  DL      0:00.00        0 DL   sleep    - [ng_queue/ng_queue1]
   14     0   0.0  -  DL      0:00.01        0 DL   -        - [geom/g_event]
   14     0   0.0  -  DL      0:00.00        0 DL   -        - [geom/g_up]
   14     0   0.0  -  DL      0:00.01        0 DL   -        - [geom/g_down]
   15     0   0.0  -  DL      0:00.00        0 DL   -        - [usb/usbus0]
   15     0   0.0  -  DL      0:00.00        0 DL   -        - [usb/usbus0]
   15     0   0.0  -  DL      0:00.00        0 DL   -        - [usb/usbus0]
   15     0   0.0  -  DL      0:00.21        0 DL   -        - [usb/usbus0]
   15     0   0.0  -  DL      0:00.00        0 DL   -        - [usb/usbus0]
   15     0   0.0  -  DL      0:00.00        0 DL   -        - [usb/usbus1]
   15     0   0.0  -  DL      0:00.00        0 DL   -        - [usb/usbus1]
   15     0   0.0  -  DL      0:00.00        0 DL   -        - [usb/usbus1]
   15     0   0.0  -  DL      0:00.21        0 DL   -        - [usb/usbus1]
   15     0   0.0  -  DL      0:00.00        0 DL   -        - [usb/usbus1]
   15     0   0.0  -  DL      0:00.00        0 DL   -        - [usb/usbus2]
   15     0   0.0  -  DL      0:00.00        0 DL   -        - [usb/usbus2]
   15     0   0.0  -  DL      0:00.00        0 DL   -        - [usb/usbus2]
   15     0   0.0  -  DL      0:00.20        0 DL   -        - [usb/usbus2]
   15     0   0.0  -  DL      0:00.00        0 DL   -        - [usb/usbus2]
   15     0   0.0  -  DL      0:00.00        0 DL   -        - [usb/usbus3]
   15     0   0.0  -  DL      0:00.00        0 DL   -        - [usb/usbus3]
   15     0   0.0  -  DL      0:00.00        0 DL   -        - [usb/usbus3]
   15     0   0.0  -  DL      0:00.39        0 DL   -        - [usb/usbus3]
   15     0   0.0  -  DL      0:00.00        0 DL   -        - [usb/usbus3]
   16     0   0.0  -  DL      0:06.75        0 DL   pftm     - [pf purge]
   17     0   0.0  -  DL      0:02.72        0 DL   -        - [rand_harvestq]
   18     0   0.0  -  DL      0:00.67        0 DL   psleep   - [pagedaemon/pagedaem]
   18     0   0.0  -  DL      0:00.00        0 DL   launds   - [pagedaemon/laundry:]
   18     0   0.0  -  DL      0:00.00        0 DL   umarcl   - [pagedaemon/uma]
   19     0   0.0  -  DL      0:00.00        0 DL   psleep   - [vmdaemon]
   20     0   0.0  -  DL      0:00.00        0 DL   pgzero   - [pagezero]
   21     0   0.0  -  DL      0:00.31        0 DL   -        - [bufspacedaemon]
   22     0   0.0  -  DL      0:00.30        0 DL   psleep   - [bufdaemon/bufdaemon]
   22     0   0.0  -  DL      0:00.82        0 DL   sdflush  - [bufdaemon// worker]
   23     0   0.0  -  DL      0:00.31        0 DL   vlruwt   - [vnlru]
   24     0   0.0  -  DL      0:03.75        0 DL   syncer   - [syncer]
   57     0   0.0  -  DL      0:00.08        0 DL   mdwait   - [md0]
    0     0   0.0  -  DLs     0:00.00        0 DLs  -        [kernel/if_config_tq]
    0     0   0.0  -  DLs     0:00.00        0 DLs  -        [kernel/if_io_tqg_0]
    0     0   0.0  -  DLs     0:00.00        0 DLs  -        [kernel/if_io_tqg_1]
    0     0   0.0  -  DLs     0:00.00        0 DLs  -        [kernel/kqueue_ctx t]
    0     0   0.0  -  DLs     0:00.00        0 DLs  -        [kernel/aiod_kick ta]
    0     0   0.0  -  DLs     0:01.02        0 DLs  -        [kernel/thread taskq]
    0     0   0.0  -  DLs     0:00.00        0 DLs  -        [kernel/firmware tas]
    0     0   0.0  -  DLs     0:00.00        0 DLs  -        [kernel/vtnet0 rxq 0]
    0     0   0.0  -  DLs     0:00.00        0 DLs  -        [kernel/vtnet0 txq 0]
    0     0   0.0  -  DLs     0:00.00        0 DLs  -        [kernel/vtnet1 rxq 0]
    0     0   0.0  -  DLs     0:00.00        0 DLs  -        [kernel/vtnet1 txq 0]
    0     0   0.0  -  DLs     0:00.00        0 DLs  -        [kernel/vtnet2 rxq 0]
    0     0   0.0  -  DLs     0:00.00        0 DLs  -        [kernel/vtnet2 txq 0]
    0     0   0.0  -  DLs     0:00.00        0 DLs  vtbslp   [kernel/virtio_ballo]
    0     0   0.0  -  DLs     0:00.00        0 DLs  -        [kernel/mca taskq]
    0     0   0.0  -  DLs     0:00.00        0 DLs  -        [kernel/acpi_task_0]
    0     0   0.0  -  DLs     0:00.00        0 DLs  -        [kernel/acpi_task_1]
    0     0   0.0  -  DLs     0:00.00        0 DLs  -        [kernel/acpi_task_2]
    0     0   0.0  -  DLs     0:00.00        0 DLs  -        [kernel/CAM taskq]
Comment 7 hcoin 2017-11-29 23:38:52 UTC
This recurs at intervals that vary from seconds to minutes to hours.
Comment 8 Kubilay Kocak freebsd_committer freebsd_triage 2018-11-19 08:31:37 UTC
Reset assignee, open to take.

This is a high-visibility bug (deadlock) in a common configuration for a relatively high-value feature (CARP).

This issue has a patch (see also: review URL) with several confirmations of successful resolution. This is not to say the patch is ready to commit, and may require at least rebasing on CURRENT given its age.

CC'ing differential reviewers/commenters as well in case they're keen on taking it to resolution.
Comment 9 Luiz Otavio O Souza,+55 (14) 99772-1255 freebsd_committer 2018-11-21 11:18:03 UTC
The patch (in that review) is misleading, it fixes the issue by accident.

The right fix needs a bit of work.

pfSense has an updated version of the original workaround, which could be committed for the time being.

I'll try to update this issue in the following days.
Comment 10 Thomas Steen Rasmussen / Tykling 2019-01-30 16:15:01 UTC
Hello,

I just wanted to let everyone know that this issue is still very much a problem on 12. I've just spent a couple of days(!) narrowing it down to be a carp+bridge issue, and finally found this PR.

We wanted to combine CARP with using v4 space "efficiently" as described in https://www.freebsd.org/doc/en/books/handbook/network-bridging.html "31.6.3. Bridge Interface Parameters" in the "sticky" section. But this issue has completely stalled the firewall project (which we've been working on for months).

Is there anything at all I can to to expedite this, any testing or such required? I can hardly believe that this has been open for almost 4 years now.

Thanks! :)
Comment 11 Kristof Provost freebsd_committer 2019-01-30 20:15:16 UTC
(In reply to Thomas Steen Rasmussen / Tykling from comment #10)
I'm not too familiar with the relevant code, but if you can create a test script to reproduce the problem I can try taking a look.

(As a general rule: anything that makes the problem easier to reproduce will greatly help getting it fixed.)
Comment 12 nvass 2019-02-01 08:09:17 UTC
Created attachment 201585 [details]
script to demosntrate the carp-on-bridge deadlock

Hi,

I wrote a small script to reproduce the problem per Kristof's request.
Comment 13 Kristof Provost freebsd_committer 2019-02-01 14:40:23 UTC
(In reply to nvass from comment #12)
For anyone else who wants to try: the script doesn't load the carp module, so load it manually first.

The manifestation of it isn't exactly what I expected. I don't see an OS hang, but I do see ping report errors: "ping: sendto: No buffer space available".
Comment 14 nvass 2019-02-01 18:24:48 UTC
(In reply to Kristof Provost from comment #13)
At the same time the network stack is not functioning anymore. No IP, ARP, TCP etc. For example, if you get inside the CARP jails and type "ifconfig bridge0", ifconfig will never exit but wait forever:

root@moby:~ # jexec 2 ifconfig bridge0
bridge0: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
        ether 9e:e2:5e:40:30:1e
        inet 10.255.255.2 netmask 0xfffffff8 broadcast 10.255.255.7 
        inet 10.255.255.1 netmask 0xfffffff8 broadcast 10.255.255.7 vhid 1 
load: 0.28  cmd: ifconfig 1024 [*if_bridge] 8.96r 0.00u 0.00s 0% 2948k
load: 0.28  cmd: ifconfig 1024 [*if_bridge] 9.26r 0.00u 0.00s 0% 2948k
load: 0.28  cmd: ifconfig 1024 [*if_bridge] 9.54r 0.00u 0.00s 0% 2948k
load: 0.28  cmd: ifconfig 1024 [*if_bridge] 9.80r 0.00u 0.00s 0% 2948k
Comment 15 Oleg Sharoyko 2019-04-28 21:29:29 UTC
Hello,

I can reliably reproduce this problem on three systems running 12.0 and 11.2. To reproduce it one needs to send a non negligible amount of traffic towards the system under test so I cannot provide the exact script but I hope my instructions are clear and I’m happy to provide as much details as I can.

To reproduce a problem configure the system in the following way:

ifconfig ue0 up # Or any other ethernet interface that you have in the system
ifconfig bridge0 create
ifconfig bridge0 addm ue0
ifconfig bridge0 inet 192.168.0.111/24 vhid 1 advskew 100 pass mekmitasdigoat
pkg install iperf3
iperf3 -s

From some other machine send traffic towards system under test:

iperf3 -c 192.168.0.111 -u -t 120 -Z -b 100m 

In all of my tests this resulted in immediate (within 2 seconds) lockup of the network stack on the target system. Excluding bridge interface from the picture and configuring CARP directly on top of ue0 (also tested with rl0) eliminates the problem.

I will try to look more into this problem but I'm not familiar with locking in the kernel. Are there any documents/exmaples/kernel options which can help find where deadlock happens?

Also, https://reviews.freebsd.org/D3133 had no attention for over a year. But looks like it has been applied in pfSense and works there (https://redmine.pfsense.org/issues/8056) is that something woth looking into?

Kind regards,
Oleg
Comment 16 Kristof Provost freebsd_committer 2019-04-28 23:05:05 UTC
Thanks Oleg,

The demo script nvass attached also worked for me to trigger the problem. I've not yet tried your setup, but I'm confident it'll also work.

I think I understand the problem in general terms (but I need to look at details for the full story). Basically, carp and bridge deadlock against each other because carp calls into bridge which calls into carp (or vice versa).

I believe the patch in D3133 is not correct. (See also ae@'s comment there.) It removes locking protection where it should be present, and this is what seems to fix the problem.

This issue is still on my todo list, but I can't promise when I'll get the time to look into it.
Comment 17 mlavkin 2019-05-22 19:20:26 UTC
Got it on 12.0-p4 :(