Bug 209351

Summary: VLAN TX errors, possible performance regression after 10.1-STABLE (r281235)
Product: Base System Reporter: Jose Claudio Pastore <zclaudio>
Component: kernAssignee: freebsd-net (Nobody) <net>
Status: New ---    
Severity: Affects Some People CC: gondim, np
Priority: --- Keywords: regression
Version: 11.0-STABLE   
Hardware: amd64   
OS: Any   

Description Jose Claudio Pastore 2016-05-07 00:08:50 UTC
On a BGP, running FreeBSD 10.1-STABLE, version r281235 and it works fine for several years now. After upgrading to any newer version I start having vlan TX errors on the exact same hardware, just booting an SSD with a newer system.

Details:

We have around 4Gbit/s and 1.8Mpps routed on peak while per port interface we peak at 300Kpps.

Our quality metrics are measured with:

ping -s 1472 -i 0.1 <our-other-ibgp-router>

As well as iperf bidirecional.

Systems working w/o problem:
- 10.1-STABLE / r281235

Systems tested with drops:
- 10.2-STABLE / r292035M
- 10.3-STABLE / r298705
- 11.0-CURRENT / r295683 (downloaded snapshot from ftp.freebsd.org)
- 11.0-CURRENT Melifaro Routing Branch / r297731M

While testing, when errors happen I can see output errs on the vlan port on the output from "netstat -w1 -I vlan6"

           input          vlan6           output
   packets  errs idrops      bytes    packets  errs      bytes colls
         1     0     0         66      30557     2   33310968     0
         1     0     0        105      31458     3   33912219     0
         2     0     0       2954      32001     8   34983986     0
         1     0     0       1512      33150     6   35942558     0
         1     0     0       1512      33654     4   37311862     0
         1     0     0       1512      34825     3   38213793     0
         3     0     0       1683      35376     4   39488912     0
         5     0     0       7280      32423     3   35551869     0

Problems may happen under high load (~200Kpps) or low load (~30Kpps) on a vlan port. 

The observed frame loss never happens on untagged ports, only vlan related.

The observed loss happens with packets sized 900 bytes and above but noticeably loss rate is higher with packets close to 1400 (1472 is my reference size).

Loss rate on all listed systems different from r281235 is 9-19% with ping(1) and iperf, while it's 0% (no loss or very irrelevant loss) on r281235.

Hardware tried:

- Intel 82599EB 10-Gigabit SFI/SFP+ Network Connection (2x2 on x8 PCIe bus, total 4x10G).
- Chelsio T520, 2x2 on x8PCIe bus, total 4x10G

Exactly the same behavior, so it's not Intel related/exclusive.

Same hardware:

I always test the very same hardware, I have two SSD drives in this router, one for the 10.1 which just runs fine and the other disk to test the various versions of FreeBSD.

Sysctl/loader:

Only minor loader and sysctl confs are tweaked:

kern.hz=2000
net.inet.ip.redirect=1                # do not send IP redirects
net.inet.ip.accept_sourceroute=0      # drop source routed packets since they ca
net.inet.ip.sourceroute=0             # if source routed packets are accepted th
net.inet.tcp.drop_synfin=1            # SYN/FIN packets get dropped on initial c
net.inet.udp.blackhole=1              # drop udp packets destined for closed soc
net.inet.tcp.blackhole=2              # drop tcp packets destined for closed por
security.bsd.see_other_uids=0

Netstat output when errors happen:

           input          vlan6           output
   packets  errs idrops      bytes    packets  errs      bytes colls
         1     0     0         66      30557     2   33310968     0
         1     0     0        105      31458     3   33912219     0
         2     0     0       2954      32001     8   34983986     0
         1     0     0       1512      33150     6   35942558     0
         1     0     0       1512      33654     4   37311862     0
         1     0     0       1512      34825     3   38213793     0
         3     0     0       1683      35376     4   39488912     0
         5     0     0       7280      32423     3   35551869     0

No relevant errors on the phisical ix(4) o cxl(4) ports happen.

It's very easy to simulate/reproduce in my environment, I just need to boot a newer system and very soon some vlan start to drop packets which are not dropped on 10.1-STABLE and I can be contacted if a developer want to ssh in. I can also updated this PR with more informatio if needed.
Comment 1 gondim 2016-06-01 13:43:46 UTC
Hi,

I am the owner of the server. Thanks for your help in solving this problem. I believe that with the solution of this problem our FreeBSD will get stronger, providing more performance in demanding traffic.


Everything leads to believe that is related to a problem with vlan, but as I am no developer can not say for sure if the problem is really that.

What I realized is that the connections without vlan, this problem does not happen.
Comment 2 Navdeep Parhar freebsd_committer freebsd_triage 2016-06-03 16:44:42 UTC
Can you run this for a few seconds (when the output errors are occurring) and provide the output?  You may have to "kldload dtraceall" first.

# dtrace -n 'fbt::*_transmit:return {@[probefunc, arg1] = count()}'
Comment 3 gondim 2016-08-25 15:31:10 UTC
Hi Navdeep,

Unfortunately we could not wait and had to change our FreeBSD for Juniper MX 104. We were sad because we wanted to have helped to solve this problem that is sure to come up again with someone who has a high traffic and using vlan.