On a BGP, running FreeBSD 10.1-STABLE, version r281235 and it works fine for several years now. After upgrading to any newer version I start having vlan TX errors on the exact same hardware, just booting an SSD with a newer system. Details: We have around 4Gbit/s and 1.8Mpps routed on peak while per port interface we peak at 300Kpps. Our quality metrics are measured with: ping -s 1472 -i 0.1 <our-other-ibgp-router> As well as iperf bidirecional. Systems working w/o problem: - 10.1-STABLE / r281235 Systems tested with drops: - 10.2-STABLE / r292035M - 10.3-STABLE / r298705 - 11.0-CURRENT / r295683 (downloaded snapshot from ftp.freebsd.org) - 11.0-CURRENT Melifaro Routing Branch / r297731M While testing, when errors happen I can see output errs on the vlan port on the output from "netstat -w1 -I vlan6" input vlan6 output packets errs idrops bytes packets errs bytes colls 1 0 0 66 30557 2 33310968 0 1 0 0 105 31458 3 33912219 0 2 0 0 2954 32001 8 34983986 0 1 0 0 1512 33150 6 35942558 0 1 0 0 1512 33654 4 37311862 0 1 0 0 1512 34825 3 38213793 0 3 0 0 1683 35376 4 39488912 0 5 0 0 7280 32423 3 35551869 0 Problems may happen under high load (~200Kpps) or low load (~30Kpps) on a vlan port. The observed frame loss never happens on untagged ports, only vlan related. The observed loss happens with packets sized 900 bytes and above but noticeably loss rate is higher with packets close to 1400 (1472 is my reference size). Loss rate on all listed systems different from r281235 is 9-19% with ping(1) and iperf, while it's 0% (no loss or very irrelevant loss) on r281235. Hardware tried: - Intel 82599EB 10-Gigabit SFI/SFP+ Network Connection (2x2 on x8 PCIe bus, total 4x10G). - Chelsio T520, 2x2 on x8PCIe bus, total 4x10G Exactly the same behavior, so it's not Intel related/exclusive. Same hardware: I always test the very same hardware, I have two SSD drives in this router, one for the 10.1 which just runs fine and the other disk to test the various versions of FreeBSD. Sysctl/loader: Only minor loader and sysctl confs are tweaked: kern.hz=2000 net.inet.ip.redirect=1 # do not send IP redirects net.inet.ip.accept_sourceroute=0 # drop source routed packets since they ca net.inet.ip.sourceroute=0 # if source routed packets are accepted th net.inet.tcp.drop_synfin=1 # SYN/FIN packets get dropped on initial c net.inet.udp.blackhole=1 # drop udp packets destined for closed soc net.inet.tcp.blackhole=2 # drop tcp packets destined for closed por security.bsd.see_other_uids=0 Netstat output when errors happen: input vlan6 output packets errs idrops bytes packets errs bytes colls 1 0 0 66 30557 2 33310968 0 1 0 0 105 31458 3 33912219 0 2 0 0 2954 32001 8 34983986 0 1 0 0 1512 33150 6 35942558 0 1 0 0 1512 33654 4 37311862 0 1 0 0 1512 34825 3 38213793 0 3 0 0 1683 35376 4 39488912 0 5 0 0 7280 32423 3 35551869 0 No relevant errors on the phisical ix(4) o cxl(4) ports happen. It's very easy to simulate/reproduce in my environment, I just need to boot a newer system and very soon some vlan start to drop packets which are not dropped on 10.1-STABLE and I can be contacted if a developer want to ssh in. I can also updated this PR with more informatio if needed.
Hi, I am the owner of the server. Thanks for your help in solving this problem. I believe that with the solution of this problem our FreeBSD will get stronger, providing more performance in demanding traffic. Everything leads to believe that is related to a problem with vlan, but as I am no developer can not say for sure if the problem is really that. What I realized is that the connections without vlan, this problem does not happen.
Can you run this for a few seconds (when the output errors are occurring) and provide the output? You may have to "kldload dtraceall" first. # dtrace -n 'fbt::*_transmit:return {@[probefunc, arg1] = count()}'
Hi Navdeep, Unfortunately we could not wait and had to change our FreeBSD for Juniper MX 104. We were sad because we wanted to have helped to solve this problem that is sure to come up again with someone who has a high traffic and using vlan.