Hi, I'm chasing a performance problem when using FreeBSD as a router. The system has 4 Mellanox ConnectX-3 cards. Bonded to 2 LAGGs with LACP. mlxen0 and 1 are lagg0 and mlxen2 and 3 are lagg1. Directly on these interface are linux boxes also with ConnectX-3 cards. This is the ifconfig from the router: mlxen0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500 options=8d00b8<VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,VLAN_HWFILTER,VLAN_HWTSO,LINKSTATE> ether 24:8a:07:f7:5b:30 hwaddr 24:8a:07:f7:5b:30 nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL> media: Ethernet autoselect (10Gbase-CX4 <full-duplex,rxpause,txpause>) status: active mlxen1: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500 options=8d00b8<VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,VLAN_HWFILTER,VLAN_HWTSO,LINKSTATE> ether 24:8a:07:f7:5b:30 hwaddr 24:8a:07:f7:5b:31 nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL> media: Ethernet autoselect (10Gbase-CX4 <full-duplex,rxpause,txpause>) status: active mlxen2: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500 options=8d00b8<VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,VLAN_HWFILTER,VLAN_HWTSO,LINKSTATE> ether 24:8a:07:f7:5f:10 hwaddr 24:8a:07:f7:5f:10 nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL> media: Ethernet autoselect (10Gbase-CX4 <full-duplex,rxpause,txpause>) status: active mlxen3: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500 options=8d00b8<VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,VLAN_HWFILTER,VLAN_HWTSO,LINKSTATE> ether 24:8a:07:f7:5f:10 hwaddr 24:8a:07:f7:5f:11 nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL> media: Ethernet autoselect (10Gbase-CX4 <full-duplex,rxpause,txpause>) status: active lagg0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500 options=8d00b8<VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,VLAN_HWFILTER,VLAN_HWTSO,LINKSTATE> ether 24:8a:07:f7:5b:30 inet6 fe80::268a:7ff:fef7:5b30%lagg0 prefixlen 64 scopeid 0xb inet 10.22.1.1 netmask 0xffffff00 broadcast 10.22.1.255 nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL> media: Ethernet autoselect status: active groups: lagg laggproto lacp lagghash l2,l3,l4 laggport: mlxen0 flags=1c<ACTIVE,COLLECTING,DISTRIBUTING> laggport: mlxen1 flags=1c<ACTIVE,COLLECTING,DISTRIBUTING> lagg1: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500 options=8d00b8<VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,VLAN_HWFILTER,VLAN_HWTSO,LINKSTATE> ether 24:8a:07:f7:5f:10 inet6 fe80::268a:7ff:fef7:5f10%lagg1 prefixlen 64 scopeid 0xc inet 10.22.2.1 netmask 0xffffff00 broadcast 10.22.2.255 nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL> media: Ethernet autoselect status: active groups: lagg laggproto lacp lagghash l2,l3,l4 laggport: mlxen2 flags=1c<ACTIVE,COLLECTING,DISTRIBUTING> laggport: mlxen3 flags=1c<ACTIVE,COLLECTING,DISTRIBUTING> When I put the 2 linux boxes directly together I can easily achive 20GBits. iperf server: iperf3 -V -p 5000 -f m -s iperf client: iperf3 -p 5000 -f m -V -c 10.22.2.10 -t 30 -P 10 When I put the BSD router between I can only achieve 10Gbit. From the man page of ifconfig there states: lagghash option[,option] Set the packet layers to hash for aggregation protocols which load balance. The default is "l2,l3,l4". The options can be combined using commas. l2 src/dst mac address and optional vlan number. l3 src/dst address for IPv4 or IPv6. l4 src/dst port for TCP/UDP/SCTP. The problem is that l4 is not really true, because iperf on multistream (-P 10) uses multiple source ports, but the load comes from lagg0 with 5Gbit divided on mlxen0 and mlxen1 and goes via lagg1 through one of the interfaces with 10Gbit. If I start a second iperf instance, listening on a different port and start both, the traffic flows with 20Gbit correctly shared. Searching bugtracker, forums and asking IRC doesn't gave any good answer, already played with lacp strict mode and enable/disable flowid does not help. Right now I'm not sure if this is a bug in kernel or documentation, but it would be cool if we can include src and dst ports in hashing calculation. Thanks Michael
Please do: ifconfig lagg0 -use_flowid ifconfig lagg1 -use_flowid Then repeat your tests. This should fix it.
A commit references this bug: Author: eugen Date: Sat Dec 22 11:38:55 UTC 2018 New revision: 342367 URL: https://svnweb.freebsd.org/changeset/base/342367 Log: ifconfig.4, lagg.4: fix documentation bug: -use_flowid needs to be used to force local hash computation and disable usage of RSS hash provided by driver. PR: 234242 MFC after: 1 week Changes: head/sbin/ifconfig/ifconfig.8 head/share/man/man4/lagg.4
Dear Eugene, Thanks for taking the time looking into this. I applied the commands but it's still at 9,4Gbit only. root@Router:~ # sysctl -a | egrep 'lagg|lacp' net.link.lagg.lacp.default_strict_mode: 1 net.link.lagg.lacp.debug: 0 net.link.lagg.default_flowid_shift: 16 net.link.lagg.default_use_flowid: 0 net.link.lagg.failover_rx_all: 0 I can also supply screenshots where you can see packets incoming on mlxen0 and mlxen1 and outgoing on mlxen2 only. Best, Michael
While running test between two hosts you have same MAC and IP addresses, so they does not supply any variance and only L4 headers (ports) can add it. Please double-check that your test creates multiple flows: pkg install trafshow trafshow -a 32 -npi lagg0
Created attachment 200386 [details] trafshow
The iperf command with option -P 10 creates 10 sim. streams, so MAC, IP and destination port is always the same. There are 10 streams with 10 different source ports to one destination port. When starting a second server instance on a different destination port the balancing works fine. That was my initial question, because for the calcuation of the hash the man page states l4 includes src/dst port, but then it should already be distributed as src port is different. Enclosed is the requested screenshot of trafshow, also the iperf command at 9,GBit and a tcpdump showing different source ports. Again, I'm not sure if it's a bug, a misbehavior or just wrong documentation. Thank you for you time, very appreciated! :) Michael
Please show output of commands: ifconfig -v lagg0 ifconfig -v lagg1 Note "-v" flag that enables more details.
root@Router:~ # ifconfig -v lagg0 lagg0: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500 options=ed07bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,VLAN_HWFILTER,VLAN_HWTSO,LINKSTATE,RXCSUM_IPV6,TXCSUM_IPV6> ether 24:8a:07:f7:5b:30 inet6 fe80::268a:7ff:fef7:5b30%lagg0 prefixlen 64 scopeid 0xb inet 10.22.1.1 netmask 0xffffff00 broadcast 10.22.1.255 nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL> media: Ethernet autoselect status: active groups: lagg laggproto lacp lagghash l2,l3,l4 lagg options: flags=10<LACP_STRICT> flowid_shift: 16 lagg statistics: active ports: 2 flapping: 0 lag id: [(8000,24-8A-07-F7-5B-30,0172,0000,0000), (FFFF,24-8A-07-F7-5C-01,000F,0000,0000)] laggport: mlxen0 flags=1c<ACTIVE,COLLECTING,DISTRIBUTING> state=3d<ACTIVITY,AGGREGATION,SYNC,COLLECTING,DISTRIBUTING> [(8000,24-8A-07-F7-5B-30,0172,8000,0007), (FFFF,24-8A-07-F7-5C-01,000F,00FF,0002)] laggport: mlxen1 flags=1c<ACTIVE,COLLECTING,DISTRIBUTING> state=3d<ACTIVITY,AGGREGATION,SYNC,COLLECTING,DISTRIBUTING> [(8000,24-8A-07-F7-5B-30,0172,8000,0008), (FFFF,24-8A-07-F7-5C-01,000F,00FF,0001)] root@Router:~ # ifconfig -v lagg1 lagg1: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500 options=ed07bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,VLAN_HWFILTER,VLAN_HWTSO,LINKSTATE,RXCSUM_IPV6,TXCSUM_IPV6> ether 24:8a:07:f7:5f:10 inet6 fe80::268a:7ff:fef7:5f10%lagg1 prefixlen 64 scopeid 0xc inet 10.22.2.1 netmask 0xffffff00 broadcast 10.22.2.255 nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL> media: Ethernet autoselect status: active groups: lagg laggproto lacp lagghash l2,l3,l4 lagg options: flags=10<LACP_STRICT> flowid_shift: 16 lagg statistics: active ports: 2 flapping: 0 lag id: [(8000,24-8A-07-F7-5F-10,0192,0000,0000), (FFFF,24-8A-07-F7-5B-01,000F,0000,0000)] laggport: mlxen2 flags=1c<ACTIVE,COLLECTING,DISTRIBUTING> state=3d<ACTIVITY,AGGREGATION,SYNC,COLLECTING,DISTRIBUTING> [(8000,24-8A-07-F7-5F-10,0192,8000,0009), (FFFF,24-8A-07-F7-5B-01,000F,00FF,0002)] laggport: mlxen3 flags=1c<ACTIVE,COLLECTING,DISTRIBUTING> state=3d<ACTIVITY,AGGREGATION,SYNC,COLLECTING,DISTRIBUTING> [(8000,24-8A-07-F7-5F-10,0192,8000,000A), (FFFF,24-8A-07-F7-5B-01,000F,00FF,0001)]
Does it help if you change lagghash to L4 only? ifconfig lagg0 lagghash l4 ifconfig lagg1 lagghash l4
By default, FreeBSD kernel does not randomize source ports for application like iperf3 and it may happen that all ports assigned in regular way are hashed to single LACP port. Please enable port randomization using "sysctl net.inet.ip.random_id=1" and retry the test. Also, your screenshot for trafshow shows only single traffic stream in each direction and that's odd.
Dear Eugene, Thanks for the tip with lagghash 4, it's the same result. The reason for trafshow only showing one stream depends on -a 32 as it aggregates per IP. I'll attach 3 screenshots, one showing trafshow without aggregation showing multiple streams, and 2 screenshots showing the output of netstat -hw1 -I XXXX One screenshot of the two interfaces of lagg with inbound traffic, where you can see it's splitted across both interfaces (5Gbit each) and one screenshot with the lagg of outbound traffic where the full 10G only on one interface. randomize_ip was also enabled.
Created attachment 200392 [details] trafshow2
Created attachment 200393 [details] lagg inbound
Created attachment 200394 [details] lagg outbound
Ports are still not randomized. If you run iperf3 on Linux host, you need to enable randomization there.
Dear Eugene, Ubuntu doesn't support a randomization parameter via sysctl, I can only enhance the range for ephemeral port selection. I got a little bit progress: When I higher the parallel streams to 20, every 10th test or so I can reach 18Gbit. No idea how the source port calculation works exactly but it seems it works in general, but not in a real predictable manner. For a backup software working with multiple streams to reach more throughput it's not garantueed to balance evenly. Seems that Linux does it a bit different. I'll install FreeBSD on the clients after holidays to check if it works better. Can I leave this open until final testing? Thanks for your time! :)
A commit references this bug: Author: eugen Date: Sat Dec 29 00:41:21 UTC 2018 New revision: 342584 URL: https://svnweb.freebsd.org/changeset/base/342584 Log: MFC r342367: ifconfig.8, lagg.4: fix documentation bug: -use_flowid needs to be used to force local hash computation and disable usage of RSS hash provided by driver. PR: 234242 Changes: _U stable/12/ stable/12/sbin/ifconfig/ifconfig.8 stable/12/share/man/man4/lagg.4
A commit references this bug: Author: eugen Date: Sat Dec 29 00:42:11 UTC 2018 New revision: 342585 URL: https://svnweb.freebsd.org/changeset/base/342585 Log: MFC r342367: ifconfig.8, lagg.4: fix documentation bug: -use_flowid needs to be used to force local hash computation and disable usage of RSS hash provided by driver. PR: 234242 Changes: _U stable/11/ stable/11/sbin/ifconfig/ifconfig.8 stable/11/share/man/man4/lagg.4
A commit references this bug: Author: eugen Date: Sat Dec 29 00:44:12 UTC 2018 New revision: 342586 URL: https://svnweb.freebsd.org/changeset/base/342586 Log: MFC r342367: ifconfig.8, lagg.4: fix documentation bug: -use_flowid needs to be used to force local hash computation and disable usage of RSS hash provided by driver. PR: 234242 Changes: _U stable/10/ stable/10/sbin/ifconfig/ifconfig.8 stable/10/share/man/man4/lagg.4
Dear Eugene, it seems I did a mistake, I set the sysctl for use_flowid and did the testing, but as it's a default_ sysctl it'll only activate after reboot. This may have changed with 11.X since googling for flowid you can see ppl setting this valua directly with .X.use_flowid where X is interface lagg number. In my ifconfig -vvvvv lagg0 I saw lagg options: flags=10<LACP_STRICT> Now I set .. ifconfig lagg0 use_flowid ifconfig lagg1 use_flowid .. and I always get around 15-18 Gbit with each test. Sorry if I wasted your time! Thanks, Michael