Bug 234242 - LACP l2,l3,l4 load sharing only respects dst port
Summary: LACP l2,l3,l4 load sharing only respects dst port
Status: Closed Not A Bug
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 11.2-STABLE
Hardware: amd64 Any
: --- Affects Many People
Assignee: freebsd-net (Nobody)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-12-21 10:20 UTC by Michael Muenz
Modified: 2018-12-31 06:38 UTC (History)
1 user (show)

See Also:


Attachments
trafshow (43.61 KB, image/png)
2018-12-23 09:42 UTC, Michael Muenz
no flags Details
trafshow2 (31.26 KB, image/png)
2018-12-23 14:39 UTC, Michael Muenz
no flags Details
lagg inbound (32.45 KB, image/png)
2018-12-23 14:40 UTC, Michael Muenz
no flags Details
lagg outbound (31.60 KB, image/png)
2018-12-23 14:40 UTC, Michael Muenz
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Michael Muenz 2018-12-21 10:20:11 UTC
Hi,

I'm chasing a performance problem when using FreeBSD as a router. 
The system has 4 Mellanox ConnectX-3 cards. Bonded to 2 LAGGs with LACP.

mlxen0 and 1 are lagg0 and mlxen2 and 3 are lagg1.

Directly on these interface are linux boxes also with ConnectX-3 cards.

This is the ifconfig from the router:


mlxen0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
        options=8d00b8<VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,VLAN_HWFILTER,VLAN_HWTSO,LINKSTATE>
        ether 24:8a:07:f7:5b:30
        hwaddr 24:8a:07:f7:5b:30
        nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
        media: Ethernet autoselect (10Gbase-CX4 <full-duplex,rxpause,txpause>)
        status: active
mlxen1: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
        options=8d00b8<VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,VLAN_HWFILTER,VLAN_HWTSO,LINKSTATE>
        ether 24:8a:07:f7:5b:30
        hwaddr 24:8a:07:f7:5b:31
        nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
        media: Ethernet autoselect (10Gbase-CX4 <full-duplex,rxpause,txpause>)
        status: active
mlxen2: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
        options=8d00b8<VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,VLAN_HWFILTER,VLAN_HWTSO,LINKSTATE>
        ether 24:8a:07:f7:5f:10
        hwaddr 24:8a:07:f7:5f:10
        nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
        media: Ethernet autoselect (10Gbase-CX4 <full-duplex,rxpause,txpause>)
        status: active
mlxen3: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
        options=8d00b8<VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,VLAN_HWFILTER,VLAN_HWTSO,LINKSTATE>
        ether 24:8a:07:f7:5f:10
        hwaddr 24:8a:07:f7:5f:11
        nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
        media: Ethernet autoselect (10Gbase-CX4 <full-duplex,rxpause,txpause>)
        status: active
lagg0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
        options=8d00b8<VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,VLAN_HWFILTER,VLAN_HWTSO,LINKSTATE>
        ether 24:8a:07:f7:5b:30
        inet6 fe80::268a:7ff:fef7:5b30%lagg0 prefixlen 64 scopeid 0xb
        inet 10.22.1.1 netmask 0xffffff00 broadcast 10.22.1.255
        nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
        media: Ethernet autoselect
        status: active
        groups: lagg
        laggproto lacp lagghash l2,l3,l4
        laggport: mlxen0 flags=1c<ACTIVE,COLLECTING,DISTRIBUTING>
        laggport: mlxen1 flags=1c<ACTIVE,COLLECTING,DISTRIBUTING>
lagg1: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
        options=8d00b8<VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,VLAN_HWFILTER,VLAN_HWTSO,LINKSTATE>
        ether 24:8a:07:f7:5f:10
        inet6 fe80::268a:7ff:fef7:5f10%lagg1 prefixlen 64 scopeid 0xc
        inet 10.22.2.1 netmask 0xffffff00 broadcast 10.22.2.255
        nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
        media: Ethernet autoselect
        status: active
        groups: lagg
        laggproto lacp lagghash l2,l3,l4
        laggport: mlxen2 flags=1c<ACTIVE,COLLECTING,DISTRIBUTING>
        laggport: mlxen3 flags=1c<ACTIVE,COLLECTING,DISTRIBUTING>


When I put the 2 linux boxes directly together I can easily achive 20GBits.

iperf server: iperf3 -V -p 5000 -f m -s
iperf client: iperf3 -p 5000 -f m -V -c 10.22.2.10 -t 30 -P 10

When I put the BSD router between I can only achieve 10Gbit. 

From the man page of ifconfig there states:

     lagghash option[,option]
             Set the packet layers to hash for aggregation protocols which
             load balance.  The default is "l2,l3,l4".  The options can be
             combined using commas.

             l2      src/dst mac address and optional vlan number.
             l3      src/dst address for IPv4 or IPv6.
             l4      src/dst port for TCP/UDP/SCTP.

The problem is that l4 is not really true, because iperf on multistream (-P 10) uses multiple source ports, but the load comes from lagg0 with 5Gbit divided on mlxen0 and mlxen1 and goes via lagg1 through one of the interfaces with 10Gbit. 

If I start a second iperf instance, listening on a different port and start both, the traffic flows with 20Gbit correctly shared. 


Searching bugtracker, forums and asking IRC doesn't gave any good answer, already played with lacp strict mode and enable/disable flowid does not help. 

Right now I'm not sure if this is a bug in kernel or documentation, but it would be cool if we can include src and dst ports in hashing calculation.


Thanks
Michael
Comment 1 Eugene Grosbein freebsd_committer freebsd_triage 2018-12-22 11:22:16 UTC
Please do:

ifconfig lagg0 -use_flowid
ifconfig lagg1 -use_flowid

Then repeat your tests. This should fix it.
Comment 2 commit-hook freebsd_committer freebsd_triage 2018-12-22 11:39:13 UTC
A commit references this bug:

Author: eugen
Date: Sat Dec 22 11:38:55 UTC 2018
New revision: 342367
URL: https://svnweb.freebsd.org/changeset/base/342367

Log:
  ifconfig.4, lagg.4: fix documentation bug: -use_flowid needs to be used
  to force local hash computation and disable usage of RSS hash
  provided by driver.

  PR:		234242
  MFC after:	1 week

Changes:
  head/sbin/ifconfig/ifconfig.8
  head/share/man/man4/lagg.4
Comment 3 Michael Muenz 2018-12-22 13:48:53 UTC
Dear Eugene,

Thanks for taking the time looking into this.
I applied the commands but it's still at 9,4Gbit only.

root@Router:~ # sysctl -a | egrep 'lagg|lacp'
net.link.lagg.lacp.default_strict_mode: 1
net.link.lagg.lacp.debug: 0
net.link.lagg.default_flowid_shift: 16
net.link.lagg.default_use_flowid: 0
net.link.lagg.failover_rx_all: 0

I can also supply screenshots where you can see packets incoming on mlxen0 and mlxen1 and outgoing on mlxen2 only. 


Best,
Michael
Comment 4 Eugene Grosbein freebsd_committer freebsd_triage 2018-12-23 08:57:47 UTC
While running test between two hosts you have same MAC and IP addresses, so they does not supply any variance and only L4 headers (ports) can add it. Please double-check that your test creates multiple flows:

pkg install trafshow
trafshow -a 32 -npi lagg0
Comment 5 Michael Muenz 2018-12-23 09:42:18 UTC
Created attachment 200386 [details]
trafshow
Comment 6 Michael Muenz 2018-12-23 09:42:34 UTC
The iperf command with option -P 10 creates 10 sim. streams, so MAC, IP and destination port is always the same. There are 10 streams with 10 different source ports to one destination port. 

When starting a second server instance on a different destination port the balancing works fine. That was my initial question, because for the calcuation of the hash the man page states l4 includes src/dst port, but then it should already be distributed as src port is different.

Enclosed is the requested screenshot of trafshow, also the iperf command at 9,GBit and a tcpdump showing different source ports.


Again, I'm not sure if it's a bug, a misbehavior or just wrong documentation.
Thank you for you time, very appreciated! :)


Michael
Comment 7 Eugene Grosbein freebsd_committer freebsd_triage 2018-12-23 10:08:17 UTC
Please show output of commands:

ifconfig -v lagg0
ifconfig -v lagg1

Note "-v" flag that enables more details.
Comment 8 Michael Muenz 2018-12-23 10:29:21 UTC
root@Router:~ # ifconfig -v lagg0
lagg0: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
        options=ed07bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,VLAN_HWFILTER,VLAN_HWTSO,LINKSTATE,RXCSUM_IPV6,TXCSUM_IPV6>
        ether 24:8a:07:f7:5b:30
        inet6 fe80::268a:7ff:fef7:5b30%lagg0 prefixlen 64 scopeid 0xb
        inet 10.22.1.1 netmask 0xffffff00 broadcast 10.22.1.255
        nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
        media: Ethernet autoselect
        status: active
        groups: lagg
        laggproto lacp lagghash l2,l3,l4
        lagg options:
                flags=10<LACP_STRICT>
                flowid_shift: 16
        lagg statistics:
                active ports: 2
                flapping: 0
        lag id: [(8000,24-8A-07-F7-5B-30,0172,0000,0000),
                 (FFFF,24-8A-07-F7-5C-01,000F,0000,0000)]
        laggport: mlxen0 flags=1c<ACTIVE,COLLECTING,DISTRIBUTING> state=3d<ACTIVITY,AGGREGATION,SYNC,COLLECTING,DISTRIBUTING>
                [(8000,24-8A-07-F7-5B-30,0172,8000,0007),
                 (FFFF,24-8A-07-F7-5C-01,000F,00FF,0002)]
        laggport: mlxen1 flags=1c<ACTIVE,COLLECTING,DISTRIBUTING> state=3d<ACTIVITY,AGGREGATION,SYNC,COLLECTING,DISTRIBUTING>
                [(8000,24-8A-07-F7-5B-30,0172,8000,0008),
                 (FFFF,24-8A-07-F7-5C-01,000F,00FF,0001)]
root@Router:~ # ifconfig -v lagg1
lagg1: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
        options=ed07bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,VLAN_HWFILTER,VLAN_HWTSO,LINKSTATE,RXCSUM_IPV6,TXCSUM_IPV6>
        ether 24:8a:07:f7:5f:10
        inet6 fe80::268a:7ff:fef7:5f10%lagg1 prefixlen 64 scopeid 0xc
        inet 10.22.2.1 netmask 0xffffff00 broadcast 10.22.2.255
        nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
        media: Ethernet autoselect
        status: active
        groups: lagg
        laggproto lacp lagghash l2,l3,l4
        lagg options:
                flags=10<LACP_STRICT>
                flowid_shift: 16
        lagg statistics:
                active ports: 2
                flapping: 0
        lag id: [(8000,24-8A-07-F7-5F-10,0192,0000,0000),
                 (FFFF,24-8A-07-F7-5B-01,000F,0000,0000)]
        laggport: mlxen2 flags=1c<ACTIVE,COLLECTING,DISTRIBUTING> state=3d<ACTIVITY,AGGREGATION,SYNC,COLLECTING,DISTRIBUTING>
                [(8000,24-8A-07-F7-5F-10,0192,8000,0009),
                 (FFFF,24-8A-07-F7-5B-01,000F,00FF,0002)]
        laggport: mlxen3 flags=1c<ACTIVE,COLLECTING,DISTRIBUTING> state=3d<ACTIVITY,AGGREGATION,SYNC,COLLECTING,DISTRIBUTING>
                [(8000,24-8A-07-F7-5F-10,0192,8000,000A),
                 (FFFF,24-8A-07-F7-5B-01,000F,00FF,0001)]
Comment 9 Eugene Grosbein freebsd_committer freebsd_triage 2018-12-23 11:10:57 UTC
Does it help if you change lagghash to L4 only?

ifconfig lagg0 lagghash l4
ifconfig lagg1 lagghash l4
Comment 10 Eugene Grosbein freebsd_committer freebsd_triage 2018-12-23 11:37:35 UTC
By default, FreeBSD kernel does not randomize source ports for application like iperf3 and it may happen that all ports assigned in regular way are hashed to single LACP port. Please enable port randomization using "sysctl net.inet.ip.random_id=1" and retry the test.

Also, your screenshot for trafshow shows only single traffic stream in each direction and that's odd.
Comment 11 Michael Muenz 2018-12-23 14:39:02 UTC
Dear Eugene,

Thanks for the tip with lagghash 4, it's the same result.
The reason for trafshow only showing one stream depends on -a 32 as it aggregates per IP. 

I'll attach 3 screenshots, one showing trafshow without aggregation showing multiple streams, and 2 screenshots showing the output of 

netstat -hw1 -I XXXX

One screenshot of the two interfaces of lagg with inbound traffic, where you can see it's splitted across both interfaces (5Gbit each) and one screenshot with the lagg of outbound traffic where the full 10G only on one interface.

randomize_ip was also enabled.
Comment 12 Michael Muenz 2018-12-23 14:39:34 UTC
Created attachment 200392 [details]
trafshow2
Comment 13 Michael Muenz 2018-12-23 14:40:01 UTC
Created attachment 200393 [details]
lagg inbound
Comment 14 Michael Muenz 2018-12-23 14:40:24 UTC
Created attachment 200394 [details]
lagg outbound
Comment 15 Eugene Grosbein freebsd_committer freebsd_triage 2018-12-23 22:07:19 UTC
Ports are still not randomized. If you run iperf3 on Linux host, you need to enable randomization there.
Comment 16 Michael Muenz 2018-12-24 11:25:36 UTC
Dear Eugene,

Ubuntu doesn't support a randomization parameter via sysctl, I can only enhance the range for ephemeral port selection. I got a little bit progress:

When I higher the parallel streams to 20, every 10th test or so I can reach 18Gbit. No idea how the source port calculation works exactly but it seems it works in general, but not in a real predictable manner.
For a backup software working with multiple streams to reach more throughput it's not garantueed to balance evenly. Seems that Linux does it a bit different. 

I'll install FreeBSD on the clients after holidays to check if it works better. Can I leave this open until final testing? 

Thanks for your time! :)
Comment 17 commit-hook freebsd_committer freebsd_triage 2018-12-29 00:41:52 UTC
A commit references this bug:

Author: eugen
Date: Sat Dec 29 00:41:21 UTC 2018
New revision: 342584
URL: https://svnweb.freebsd.org/changeset/base/342584

Log:
  MFC r342367: ifconfig.8, lagg.4: fix documentation bug: -use_flowid
  needs to be used to force local hash computation and disable usage
  of RSS hash provided by driver.

  PR:		234242

Changes:
_U  stable/12/
  stable/12/sbin/ifconfig/ifconfig.8
  stable/12/share/man/man4/lagg.4
Comment 18 commit-hook freebsd_committer freebsd_triage 2018-12-29 00:42:56 UTC
A commit references this bug:

Author: eugen
Date: Sat Dec 29 00:42:11 UTC 2018
New revision: 342585
URL: https://svnweb.freebsd.org/changeset/base/342585

Log:
  MFC r342367: ifconfig.8, lagg.4: fix documentation bug: -use_flowid
  needs to be used to force local hash computation and disable usage
  of RSS hash provided by driver.

  PR:		234242

Changes:
_U  stable/11/
  stable/11/sbin/ifconfig/ifconfig.8
  stable/11/share/man/man4/lagg.4
Comment 19 commit-hook freebsd_committer freebsd_triage 2018-12-29 00:44:59 UTC
A commit references this bug:

Author: eugen
Date: Sat Dec 29 00:44:12 UTC 2018
New revision: 342586
URL: https://svnweb.freebsd.org/changeset/base/342586

Log:
  MFC r342367: ifconfig.8, lagg.4: fix documentation bug: -use_flowid
  needs to be used to force local hash computation and disable usage
  of RSS hash provided by driver.

  PR:		234242

Changes:
_U  stable/10/
  stable/10/sbin/ifconfig/ifconfig.8
  stable/10/share/man/man4/lagg.4
Comment 20 Michael Muenz 2018-12-31 06:38:09 UTC
Dear Eugene,

it seems I did a mistake, I set the sysctl for use_flowid and did the testing, but as it's a default_ sysctl it'll only activate after reboot. 

This may have changed with 11.X since googling for flowid you can see ppl setting this valua directly with .X.use_flowid where X is interface lagg number.

In my ifconfig -vvvvv lagg0 I saw 

        lagg options:
                flags=10<LACP_STRICT>

Now I set ..

ifconfig lagg0 use_flowid
ifconfig lagg1 use_flowid

.. and I always get around 15-18 Gbit with each test.

Sorry if I wasted your time!

Thanks,
Michael