On FreeBSD 12-STABLE r354698 amd64 with LACP laggs on top of Solarflare SFXGE nics (7k and 8k series cards) attached to arista mlaggs ( 7050's or 7150's running eos 4.0.19 and newer ). The FreeBSD side sporadically stops setting the lacp distribution flag. Causing the switch to detect a link flap. The setup is as follows . Dell R630 and R640's, Supermicro X9DRT Ivy Bridge boards setup with one Solarflare SFN7122F or SFN8522 runnging Solarflare Firmware 7.1 and 7.4.4 . Each server is setup as a router with pf being used to nat and filter traffic. There is on LACP lagg made of up the two ports going to two upstream Arista 7050 or 7150 switches in a MLAGG setup. The LAGG carries anywhere from 5 to 50 vlans at a time. Now the complicated part. This issues happens as what appear to be random times, on routers we have setup and left to "burn in" over a weekend with little or no traffic, and on some routers where they are preforming moderate amounts work. This issue also did not happen on 10.3-STABLE amd64 . Sysctls ============================= #security.bsd.see_other_uids=0 net.inet.tcp.mssdflt=1460 net.inet.tcp.minmss=536 net.inet.tcp.rfc6675_pipe=1 net.inet.tcp.syncache.rexmtlimit=0 # (default 3) net.inet.tcp.per_cpu_timers=1 net.inet.ip.fastforwarding=1 #kern.random.harvest.mask=65887 kern.random.harvest.mask=65537 kern.random.sys.harvest.ethernet=0 kern.random.sys.harvest.point_to_point=0 kern.random.sys.harvest.interrupt=0 hw.intr_storm_threshold=10000 kern.ipc.maxsockbuf=16777216 # socket buffers net.inet.tcp.recvspace=4194304 net.inet.tcp.sendspace=2097152 net.inet.tcp.sendbuf_max=16777216 net.inet.tcp.recvbuf_max=16777216 net.inet.tcp.sendbuf_auto=1 net.inet.tcp.recvbuf_auto=1 net.inet.tcp.sendbuf_inc=16384 net.inet.tcp.recvbuf_inc=524288 net.inet.tcp.cc.algorithm=htcp net.inet.ip.intr_queue_maxlen=2048 net.route.netisr_maxqlen=2048 # Do not send IP redirects (enable fastforwarding path) net.inet.ip.redirect=0 net.inet6.ip6.redirect=0 ===loader.conf=== ipmi_load="YES" boot_multicons="YES" boot_serial="YES" console="comconsole,vidconsole" net.inet.tcp.tso="0" autoboot_delay="5" hw.mfi.mrsas_enable="1" hw.usb.no_pf="1" # Disable USB packet filtering hw.usb.no_shutdown_wait="1" hw.vga.textmode="1" # Text mode machdep.hyperthreading_allowed="0" geom_mirror_load="YES" kern.ipc.nmbclusters="1000000" net.isr.maxqlimit="1000000" kern.ipc.nmbjumbop=524288 net.isr.bindthreads="0" net.isr.maxthreads="-1" net.link.ifqmaxlen="2048" net.pf.source_nodes_hashsize="1048576" net.isr.defaultqlimit="2048" net.inet.tcp.syncache.hashsize="1024" net.inet.tcp.syncache.bucketlimit="100" net.inet.tcp.tcbhashsize="65536" vm.pmap.pti=0 hw.ibrs_disable=1 ===LAGG Config=== ifconfig lagg0 laggproto lacp lagghash l2,l3 laggport sfxge0 laggport sfxge1 ===sfxge tunings=== kenv hw.sfxge.${NIC0_ID}.max_rss_channels=7 kenv hw.sfxge.${NIC1_ID}.max_rss_channels=7 kenv hw.sfxge.tx_ring=2048 kenv hw.sfxge.rx_ring=4096 kenv hw.sfxge.tx_dpl_get_non_tcp_max=4096 kenv hw.sfxge.tx_dpl_put_max=2048 # This turns of AIM sysctl dev.sfxge.${NIC0_ID}.int_mod=0 sysctl dev.sfxge.${NIC1_ID}.int_mod=0 PCAPs available on request.
(In reply to nonesuch from comment #0) Were you able to narrow down this issue? I'm having similar issue with FreeBSD 12.1-RELEASE-p7 and Arista 7280 switches in a MLAGG setup. > The FreeBSD side sporadically stops setting the lacp distribution flag. How did you check this?
Alex I have a span port setup on each switch going into a tap / agg switch that feeds a corvil packet capture box. From the corvil I can pull the traffic by mac and look at what was going on using wireshark going back about 3 weeks. Second thing I did was capture the details from net.link.lagg.lacp.debug = 1. Looking a the two sources i can see the Freebsd side of things seams to miss a beat, by slowing down somehow . I am not exactly sure how or why this is happening. So far I have two ideas why this is going on. In 11 something the sysctl net.link.lagg.default_use_flowid was changed from a default of 1 to 0 . I suspect this can be causing the issue. Setting use flowid to 1 appears to keep all of the arp, lacp and non ip protocols on queue 0 . This prioritizes the traffic enough to make the kernel lcap bits respond fast enough. Here is the post about the lacp change. Maybe we can ping someone from Multiplay . https://www.mail-archive.com/svn-src-all@freebsd.org/msg155156.html