Bug 246660 - Sporadic LACP Lagg Flap
Summary: Sporadic LACP Lagg Flap
Status: New
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 12.1-STABLE
Hardware: amd64 Any
: --- Affects Some People
Assignee: freebsd-net (Nobody)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-05-22 14:11 UTC by nonesuch
Modified: 2020-09-16 02:41 UTC (History)
5 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description nonesuch 2020-05-22 14:11:53 UTC
On FreeBSD 12-STABLE r354698 amd64 with LACP laggs on top of Solarflare SFXGE nics (7k and 8k series cards) attached to arista mlaggs ( 7050's or 7150's running  eos 4.0.19 and newer ). The FreeBSD side sporadically stops setting the lacp distribution flag. Causing the switch to detect a link flap. 


The setup is as follows . Dell R630 and R640's, Supermicro X9DRT Ivy Bridge boards
setup with one Solarflare SFN7122F or SFN8522 runnging Solarflare Firmware 7.1 and  7.4.4 . Each server is setup as a router with pf being used to nat and filter traffic. There is on LACP lagg made of up the two ports going to two upstream Arista 7050 or 7150 switches in a MLAGG setup. The LAGG carries anywhere from 5 to 50 vlans at a time. 

Now the complicated part. This issues happens as what appear to be random times, on routers we have setup and left to "burn in" over a weekend with little or no traffic, and on some routers where they are preforming moderate amounts work.
This issue also did not happen on 10.3-STABLE amd64 . 


Sysctls
=============================

#security.bsd.see_other_uids=0
net.inet.tcp.mssdflt=1460
net.inet.tcp.minmss=536
net.inet.tcp.rfc6675_pipe=1
net.inet.tcp.syncache.rexmtlimit=0  # (default 3)
net.inet.tcp.per_cpu_timers=1
net.inet.ip.fastforwarding=1
#kern.random.harvest.mask=65887
kern.random.harvest.mask=65537
kern.random.sys.harvest.ethernet=0
kern.random.sys.harvest.point_to_point=0
kern.random.sys.harvest.interrupt=0
hw.intr_storm_threshold=10000
kern.ipc.maxsockbuf=16777216
# socket buffers
net.inet.tcp.recvspace=4194304
net.inet.tcp.sendspace=2097152
net.inet.tcp.sendbuf_max=16777216
net.inet.tcp.recvbuf_max=16777216
net.inet.tcp.sendbuf_auto=1
net.inet.tcp.recvbuf_auto=1
net.inet.tcp.sendbuf_inc=16384
net.inet.tcp.recvbuf_inc=524288
net.inet.tcp.cc.algorithm=htcp
net.inet.ip.intr_queue_maxlen=2048
net.route.netisr_maxqlen=2048
# Do not send IP redirects (enable fastforwarding path)
net.inet.ip.redirect=0
net.inet6.ip6.redirect=0

===loader.conf===
ipmi_load="YES"
boot_multicons="YES"
boot_serial="YES"
console="comconsole,vidconsole"
net.inet.tcp.tso="0"
autoboot_delay="5"
hw.mfi.mrsas_enable="1"
hw.usb.no_pf="1"        # Disable USB packet filtering
hw.usb.no_shutdown_wait="1"
hw.vga.textmode="1"     # Text mode
machdep.hyperthreading_allowed="0"
geom_mirror_load="YES"
kern.ipc.nmbclusters="1000000"
net.isr.maxqlimit="1000000"
kern.ipc.nmbjumbop=524288
net.isr.bindthreads="0"
net.isr.maxthreads="-1"
net.link.ifqmaxlen="2048"
net.pf.source_nodes_hashsize="1048576"
net.isr.defaultqlimit="2048"
net.inet.tcp.syncache.hashsize="1024"
net.inet.tcp.syncache.bucketlimit="100"
net.inet.tcp.tcbhashsize="65536"
vm.pmap.pti=0
hw.ibrs_disable=1


===LAGG Config===

ifconfig lagg0 laggproto lacp lagghash l2,l3 laggport sfxge0 laggport sfxge1


===sfxge tunings===
kenv hw.sfxge.${NIC0_ID}.max_rss_channels=7
kenv hw.sfxge.${NIC1_ID}.max_rss_channels=7
kenv hw.sfxge.tx_ring=2048
kenv hw.sfxge.rx_ring=4096
kenv hw.sfxge.tx_dpl_get_non_tcp_max=4096
kenv hw.sfxge.tx_dpl_put_max=2048
# This turns of AIM
sysctl dev.sfxge.${NIC0_ID}.int_mod=0
sysctl dev.sfxge.${NIC1_ID}.int_mod=0


PCAPs available on request.
Comment 1 Aleks Bunin 2020-08-22 16:24:45 UTC
(In reply to nonesuch from comment #0)
Were you able to narrow down this issue? I'm having similar issue with FreeBSD 12.1-RELEASE-p7 and Arista 7280 switches in a MLAGG setup.

> The FreeBSD side sporadically stops setting the lacp distribution flag.

How did you check this?
Comment 2 nonesuch 2020-08-23 15:43:11 UTC
Alex
  I have a span port setup on each switch going into a tap / agg switch that feeds a corvil packet capture box. From the corvil I can pull the traffic by mac and look at what was going on using wireshark going back about 3 weeks. Second thing I did was capture the details from net.link.lagg.lacp.debug = 1.  Looking a the two sources i can see the Freebsd side of things seams to miss a beat, by slowing down somehow . I am not exactly sure how or why this is happening.   

  So far I have two ideas why this is going on. In 11 something the sysctl net.link.lagg.default_use_flowid was changed from a default of 1 to 0 . 
I suspect this can be causing the issue. Setting use flowid to 1 appears to keep all of the arp, lacp and non ip protocols on queue 0 . This prioritizes the traffic enough to make the kernel lcap bits respond fast enough. 

Here is the post about the lacp change. Maybe we can ping  someone from Multiplay .
https://www.mail-archive.com/svn-src-all@freebsd.org/msg155156.html