Bug 248652 - netmap: pkt-gen TX huge pps difference between 11-STABLE and 12-STABLE/CURRENT on ix & ixl NIC
Summary: netmap: pkt-gen TX huge pps difference between 11-STABLE and 12-STABLE/CURREN...
Status: Open
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: CURRENT
Hardware: amd64 Any
: --- Affects Only Me
Assignee: freebsd-net (Nobody)
URL:
Keywords: iflib, needs-qa, performance, regression
Depends on:
Blocks:
 
Reported: 2020-08-14 08:52 UTC by Sylvain Galliano
Modified: 2020-09-20 21:11 UTC (History)
5 users (show)

See Also:
koobs: maintainer-feedback? (vmaffione)
koobs: maintainer-feedback? (freebsd)
koobs: mfc-stable12?
koobs: mfc-stable11-


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Sylvain Galliano 2020-08-14 08:52:04 UTC
I'm testing netmap tx performance between 11-STABLE and CURRENT (same results as 12-STABLE) with 2 NICs:
Intel X520 (10G) and Intel IXL710 (40G)
Here are my tests and the results using differents OS version/NIC & number of queues

*******************************************

Testing NIC Intel X520, 1 queue configured
pkt-gen -i ix1 -f tx -S a0:36:9f:3e:57:1a -D 3c:fd:fe:a2:22:91 -s 192.168.0.1 -d 192.168.0.2

11-STABLE:
ix1: <Intel(R) PRO/10GbE PCI-Express Network Driver, Version - 3.2.12-k> port 0xece0-0xecff mem 0xdb600000-0xdb6fffff,0xdb7fc000-0xdb7fffff irq 53 at device 0.1 numa-domain 0 on pci5
ix1: Using MSI-X interrupts with 2 vectors
ix1: Ethernet address: a0:36:9f:51:c9:66
ix1: PCI Express Bus: Speed 5.0GT/s Width x8
ix1: netmap queues/slots: TX 1/2048, RX 1/2048

pkt-gen result:
297.988718 main_thread [2639] 14.151 Mpps (15.049 Mpkts 6.792 Gbps in 1063439 usec) 510.11 avg_batch 0 min_space
14Mpps

CURRENT:
ix1: <Intel(R) PRO/10GbE PCI-Express Network Driver> port 0xece0-0xecff mem 0xdb600000-0xdb6fffff,0xdb7fc000-0xdb7fffff irq 53 at device 0.1 numa-domain 0 on pci5
ix1: Using 2048 TX descriptors and 2048 RX descriptors
ix1: Using 1 RX queues 1 TX queues
ix1: Using MSI-X interrupts with 2 vectors
ix1: allocated for 1 queues
ix1: allocated for 1 rx queues
ix1: Ethernet address: a0:36:9f:51:c9:66
ix1: PCI Express Bus: Speed 5.0GT/s Width x8
ix1: netmap queues/slots: TX 1/2048, RX 1/2048

pkt-gen result:
198.445241 main_thread [2639] 2.615 Mpps (2.620 Mpkts 1.255 Gbps in 1001871 usec) 466.26 avg_batch 99999 min_space

2.6Mpps: much slower than 11-STABLE (14Mpps)

*******************************************

Testing NIC Intel IX710, 6 queues configured
pkt-gen -i ixl0 -f tx -S 9c:69:b4:60:ef:44 -D 9c:69:b4:60:35:ac -s 192.168.2.1 -d 192.168.2.2

11-STABLE:
ixl0: <Intel(R) Ethernet Connection 700 Series PF Driver, Version - 1.11.9-k> mem 0xd5000000-0xd57fffff,0xd6ff0000-0xd6ff7fff irq 40 at device 0.0 numa-domain 0 on pci2
ixl0: using 2048 tx descriptors and 2048 rx descriptors
ixl0: fw 6.0.48442 api 1.7 nvm 6.01 etid 800034a4 oem 1.262.0
ixl0: PF-ID[0]: VFs 64, MSIX 129, VF MSIX 5, QPs 768, I2C
ixl0: Using MSIX interrupts with 7 vectors
ixl0: Allocating 8 queues for PF LAN VSI; 6 queues active
ixl0: Ethernet address: 9c:69:b4:60:ef:44
ixl0: PCI Express Bus: Speed 8.0GT/s Width x8
ixl0: SR-IOV ready
ixl0: netmap queues/slots: TX 6/2048, RX 6/2048
ixl0: TSO4 requires txcsum, disabling both...

pkt-gen result:
515.210701 main_thread [2639] 42.566 Mpps (45.248 Mpkts 20.432 Gbps in 1062998 usec) 395.17 avg_batch 99999 min_space

42Mpps


CURRENT:
ixl0: <Intel(R) Ethernet Controller XL710 for 40GbE QSFP+ - 2.2.0-k> mem 0xd5000000-0xd57fffff,0xd6ff0000-0xd6ff7fff irq 40 at device 0.0 numa-domain 0 on pci2
ixl0: fw 6.0.48442 api 1.7 nvm 6.01 etid 800034a4 oem 1.262.0
ixl0: PF-ID[0]: VFs 64, MSI-X 129, VF MSI-X 5, QPs 768, I2C
ixl0: Using 2048 TX descriptors and 2048 RX descriptors
ixl0: Using 6 RX queues 6 TX queues
ixl0: Using MSI-X interrupts with 7 vectors
ixl0: Ethernet address: 9c:69:b4:60:ef:44
ixl0: Allocating 8 queues for PF LAN VSI; 6 queues active
ixl0: PCI Express Bus: Speed 8.0GT/s Width x8
ixl0: SR-IOV ready
ixl0: netmap queues/slots: TX 6/2048, RX 6/2048
ixl0: Media change is not supported.
ixl0: Link is up, 40 Gbps Full Duplex, Requested FEC: None, Negotiated FEC: None, Autoneg: True, Flow Control: None

pkt-gen result:
941.463329 main_thread [2639] 13.564 Mpps (13.741 Mpkts 6.511 Gbps in 1013001 usec) 16.04 avg_batch 99999 min_space

13Mpps: much slower than 11-STABLE (42Mpps)


*******************************************
And a last test, this one showing better performance in CURRENT vs 11-STABLE :)

Testing NIC Intel IX710, 1 queue configured
pkt-gen -i ixl0 -f tx -S 9c:69:b4:60:ef:44 -D 9c:69:b4:60:35:ac -s 192.168.2.1 -d 192.168.2.2

11-STABLE:
ixl0: <Intel(R) Ethernet Connection 700 Series PF Driver, Version - 1.11.9-k> mem 0xd5000000-0xd57fffff,0xd6ff0000-0xd6ff7fff irq 40 at device 0.0 numa-domain 0 on pci2
ixl0: using 2048 tx descriptors and 2048 rx descriptors
ixl0: fw 6.0.48442 api 1.7 nvm 6.01 etid 800034a4 oem 1.262.0
ixl0: PF-ID[0]: VFs 64, MSIX 129, VF MSIX 5, QPs 768, I2C
ixl0: Using MSIX interrupts with 2 vectors
ixl0: Allocating 1 queues for PF LAN VSI; 1 queues active
ixl0: Ethernet address: 9c:69:b4:60:ef:44
ixl0: PCI Express Bus: Speed 8.0GT/s Width x8
ixl0: SR-IOV ready
ixl0: netmap queues/slots: TX 1/2048, RX 1/2048
ixl0: TSO4 requires txcsum, disabling both...

pkt-gen result:
609.889550 main_thread [2639] 8.413 Mpps (8.617 Mpkts 4.038 Gbps in 1024294 usec) 511.42 avg_batch 0 min_space

8Mpps

CURRENT:
ixl0: <Intel(R) Ethernet Controller XL710 for 40GbE QSFP+ - 2.2.0-k> mem 0xd5000000-0xd57fffff,0xd6ff0000-0xd6ff7fff irq 40 at device 0.0 numa-domain 0 on pci2
ixl0: fw 6.0.48442 api 1.7 nvm 6.01 etid 800034a4 oem 1.262.0
ixl0: PF-ID[0]: VFs 64, MSI-X 129, VF MSI-X 5, QPs 768, I2C
ixl0: Using 2048 TX descriptors and 2048 RX descriptors
ixl0: Using 1 RX queues 1 TX queues
ixl0: Using MSI-X interrupts with 2 vectors
ixl0: Ethernet address: 9c:69:b4:60:ef:44
ixl0: Allocating 1 queues for PF LAN VSI; 1 queues active
ixl0: PCI Express Bus: Speed 8.0GT/s Width x8
ixl0: SR-IOV ready
ixl0: netmap queues/slots: TX 1/2048, RX 1/2048
ixl0: Media change is not supported.
ixl0: Link is up, 40 Gbps Full Duplex, Requested FEC: None, Negotiated FEC: None, Autoneg: True, Flow Control: None

pkt-gen result:
526.299416 main_thread [2639] 12.228 Mpps (12.240 Mpkts 5.870 Gbps in 1001000 usec) 14.37 avg_batch 99999 min_space

12Mpps: much better than 11-STABLE (8Mpps)
Comment 1 Vincenzo Maffione freebsd_committer 2020-08-14 14:23:02 UTC
Thanks for reporting.
What I can tell you for sure is that the difference is to be attributed to the conversion of Intel drivers (em, ix, ixl) to iflib.
This impacted netmap because netmap support for iflib drivers (intel ones, vmx, mgb, bnxt) is provided directly within the iflib core. IOW, no explicit netmap code stays within the drivers.

I would say some physiological performance drop is to be expected, due to the additional indirection introduced by iflib. However, the performance drop should not be so large as reported in your experiments.
The 2.6 Mpps you get in the first comparison let me think that you may have accidentally left ethernet flow control enabled, maybe?
Moreover, the last experiment is rather confusing, since you have actually a performance improvement... this lets me think that maybe the configuration is not 100% aligned between the two cases?

Have you tried to disable all the offloads? In 11-stable the driver-specific netmap code does not program the offloads, whereas in CURRENT (and 12) the iflib callbacks actually program the offloads also in case of netmap.

  # ifconfig ix0 -txcsum -rxcsum -tso4 -tso6 -lro -txcsum6 -rxcsum6
Comment 2 Sylvain Galliano 2020-08-14 15:21:40 UTC
Hi Vincenzo, thanks for your quick reply.

I've disabled all offloads in both 11-STABLE and CURRENT and I got the same results.

I did another test that may help you:

I've recompiled pkt-gen on current after adding:
 #define BUSYWAIT

Testing NIC Intel X520, 1 queue configured

default pkt-gen:
696.194470 main_thread [2641] 2.560 Mpps (2.570 Mpkts 1.229 Gbps in 1004000 usec) 
465.45 avg_batch 99999 min_space

with busywait:
733.764470 main_thread [2641] 14.881 Mpps (15.172 Mpkts 7.143 Gbps in 1019565 usec) 344.22 avg_batch 99999 min_space

14Mpps, same as 11-STABLE
Comment 3 Vincenzo Maffione freebsd_committer 2020-08-14 15:32:50 UTC
It looks like you get 2.6 Mpps because you are not getting enough interrupts... have you tried to measure the interrupt rate in the two cases (current vs 11, no busy wait)?
Intel NICs have tunables to set interrupt coalescing, for both TX and RX. Maybe playing with those changes the game?
Comment 4 Sylvain Galliano 2020-08-14 19:51:25 UTC
You're right, interrupt rate limit pps on CURRENT + X520 NIC:

pkt-gen, no busy wait

11-stable:  27500 irq/s
CURRENT:    5500  irq/s

pkt-gen, with busy wait on CURRENT: +30000 irq/s

Regarding NIC irq tunable, the only one related to 'ix' looks good:

# sysctl hw.ix.max_interrupt_rate
hw.ix.max_interrupt_rate: 31250
Comment 5 Vincenzo Maffione freebsd_committer 2020-08-14 21:08:26 UTC
Ok, thanks for the feedback. That means that the issue is that iflib is not asking enough TX descriptor writebacks. Need for some investigation in the iflib txsync routine.
Comment 6 Kubilay Kocak freebsd_committer freebsd_triage 2020-08-15 05:00:57 UTC
(In reply to Vincenzo Maffione from comment #5)

Is this more specific/scoped to:

 - netmap & iflib, or 
 - ix/ixl and/or NIC driver & iflib, or
 - iflib framework (generally)
Comment 7 Vincenzo Maffione freebsd_committer 2020-08-18 08:41:41 UTC
(In reply to Kubilay Kocak from comment #6)
I would say
  ix/ixl and/or NIC driver & iflib
because it's not something related to the netmap module itself, and it is an optimization which derives from ix/ixl netmap support code, which now is included within iflib.
Comment 8 Sylvain Galliano 2020-08-28 19:58:15 UTC
After looking at iflib_netmap_timer_adjust() & iflib_netmap_txsync() in sys/net/iflib.c,
I made some tuning on kern.hz:

Still using X520 with 1 queue
ix0: PCI Express Bus: Speed 5.0GT/s Width x8
ix0: netmap queues/slots: TX 1/2048, RX 1/2048

*****************

/boot/loader.conf:
kern.hz=1000  (default)

pkt-gen result:
204.153802 main_thread [2639] 2.562 Mpps (2.567 Mpkts 1.230 Gbps in 1001994 usec) 465.32 avg_batch 99999 min_space
205.155321 main_thread [2639] 2.561 Mpps (2.565 Mpkts 1.229 Gbps in 1001519 usec) 465.45 avg_batch 99999 min_space

5500 irq/s:

*****************

/boot/loader.conf:
kern.hz=1999

pkt-gen result:
41.375049 main_thread [2639] 5.117 Mpps (5.222 Mpkts 2.456 Gbps in 1020510 usec) 465.45 avg_batch 99999 min_space
42.375546 main_thread [2639] 5.118 Mpps (5.121 Mpkts 2.457 Gbps in 1000497 usec) 465.42 avg_batch 99999 min_space

11000 irq/s

X2 performance & irq/s

*****************

/boot/loader.conf:
kern.hz=2000

pkt-gen result:
797.608080 main_thread [2639] 2.560 Mpps (2.563 Mpkts 1.229 Gbps in 1001001 usec) 465.50 avg_batch 99999 min_space
798.609079 main_thread [2639] 2.560 Mpps (2.563 Mpkts 1.229 Gbps in 1000999 usec) 465.41 avg_batch 99999 min_space

5500 irq/s

Same performance & irq/s as kern.hz=1000 (due to limit at 2000 in iflib_netmap_timer_adjust & iflib_netmap_txsync)

*****************

Last test, this one I forced 'ticks' parameter to '1' in callout_reset_on on iflib_netmap_timer_adjust & iflib_netmap_txsync
by increasing the 2000 limit to 20000 in both functions
and put an insame value for kern.hz

/boot/loader.conf:
kern.hz=10000

pkt-gen result:
345.415939 main_thread [2639] 14.880 Mpps (14.890 Mpkts 7.142 Gbps in 1000699 usec) 430.97 avg_batch 99999 min_space
346.429134 main_thread [2639] 14.880 Mpps (15.076 Mpkts 7.142 Gbps in 1013196 usec) 432.17 avg_batch 99999 min_space

29000 irq/s

Same performance as FreeBSD 11

Looks like callout_reset_on to iflib_timer have a look high delay.
Comment 9 Vincenzo Maffione freebsd_committer 2020-08-31 22:05:59 UTC
Thanks a lot for the tests.
I think the way netmap tx is handled right now needs improvement.

As far as I can tell, in your setup TX interrupts are simply not used (ix and ixl seem to use softirq for TX interrupt processing).
Your experiments with increasing kern.hz cause the interrupt rate of the OS timer to increase, and therefore causing the iflib_timer() routine to be called more often. Being called more often, the TX ring is cleaned up (TX credits update) more often and therefore the application can submit new TX packets more often, hence the improved pps.

However, clearly increasing kern.hz is not a viable approach.
I think we should try to use a separate timer for netmap TX credits update, using higher resolution (e.g. callout_reset_sbt_on()), and maybe try to dynamically adjust the timer period to become smaller when transmitting at high rate, and lower when transmitting ad low rate.
I'll try to come up with an experimental patch in the next days.
Comment 10 Eric Joyner freebsd_committer 2020-09-15 17:31:38 UTC
(In reply to Vincenzo Maffione from comment #9)

Any update? This might be something to get into 12.2 if we can.
Comment 11 Vincenzo Maffione freebsd_committer 2020-09-20 21:11:21 UTC
(In reply to Eric Joyner from comment #10)

Not yet. I've been AFK for a couple of weeks. I should be able to work on it this week.