I'm testing netmap tx performance between 11-STABLE and CURRENT (same results as 12-STABLE) with 2 NICs: Intel X520 (10G) and Intel IXL710 (40G) Here are my tests and the results using differents OS version/NIC & number of queues ******************************************* Testing NIC Intel X520, 1 queue configured pkt-gen -i ix1 -f tx -S a0:36:9f:3e:57:1a -D 3c:fd:fe:a2:22:91 -s 192.168.0.1 -d 192.168.0.2 11-STABLE: ix1: <Intel(R) PRO/10GbE PCI-Express Network Driver, Version - 3.2.12-k> port 0xece0-0xecff mem 0xdb600000-0xdb6fffff,0xdb7fc000-0xdb7fffff irq 53 at device 0.1 numa-domain 0 on pci5 ix1: Using MSI-X interrupts with 2 vectors ix1: Ethernet address: a0:36:9f:51:c9:66 ix1: PCI Express Bus: Speed 5.0GT/s Width x8 ix1: netmap queues/slots: TX 1/2048, RX 1/2048 pkt-gen result: 297.988718 main_thread [2639] 14.151 Mpps (15.049 Mpkts 6.792 Gbps in 1063439 usec) 510.11 avg_batch 0 min_space 14Mpps CURRENT: ix1: <Intel(R) PRO/10GbE PCI-Express Network Driver> port 0xece0-0xecff mem 0xdb600000-0xdb6fffff,0xdb7fc000-0xdb7fffff irq 53 at device 0.1 numa-domain 0 on pci5 ix1: Using 2048 TX descriptors and 2048 RX descriptors ix1: Using 1 RX queues 1 TX queues ix1: Using MSI-X interrupts with 2 vectors ix1: allocated for 1 queues ix1: allocated for 1 rx queues ix1: Ethernet address: a0:36:9f:51:c9:66 ix1: PCI Express Bus: Speed 5.0GT/s Width x8 ix1: netmap queues/slots: TX 1/2048, RX 1/2048 pkt-gen result: 198.445241 main_thread [2639] 2.615 Mpps (2.620 Mpkts 1.255 Gbps in 1001871 usec) 466.26 avg_batch 99999 min_space 2.6Mpps: much slower than 11-STABLE (14Mpps) ******************************************* Testing NIC Intel IX710, 6 queues configured pkt-gen -i ixl0 -f tx -S 9c:69:b4:60:ef:44 -D 9c:69:b4:60:35:ac -s 192.168.2.1 -d 192.168.2.2 11-STABLE: ixl0: <Intel(R) Ethernet Connection 700 Series PF Driver, Version - 1.11.9-k> mem 0xd5000000-0xd57fffff,0xd6ff0000-0xd6ff7fff irq 40 at device 0.0 numa-domain 0 on pci2 ixl0: using 2048 tx descriptors and 2048 rx descriptors ixl0: fw 6.0.48442 api 1.7 nvm 6.01 etid 800034a4 oem 1.262.0 ixl0: PF-ID[0]: VFs 64, MSIX 129, VF MSIX 5, QPs 768, I2C ixl0: Using MSIX interrupts with 7 vectors ixl0: Allocating 8 queues for PF LAN VSI; 6 queues active ixl0: Ethernet address: 9c:69:b4:60:ef:44 ixl0: PCI Express Bus: Speed 8.0GT/s Width x8 ixl0: SR-IOV ready ixl0: netmap queues/slots: TX 6/2048, RX 6/2048 ixl0: TSO4 requires txcsum, disabling both... pkt-gen result: 515.210701 main_thread [2639] 42.566 Mpps (45.248 Mpkts 20.432 Gbps in 1062998 usec) 395.17 avg_batch 99999 min_space 42Mpps CURRENT: ixl0: <Intel(R) Ethernet Controller XL710 for 40GbE QSFP+ - 2.2.0-k> mem 0xd5000000-0xd57fffff,0xd6ff0000-0xd6ff7fff irq 40 at device 0.0 numa-domain 0 on pci2 ixl0: fw 6.0.48442 api 1.7 nvm 6.01 etid 800034a4 oem 1.262.0 ixl0: PF-ID[0]: VFs 64, MSI-X 129, VF MSI-X 5, QPs 768, I2C ixl0: Using 2048 TX descriptors and 2048 RX descriptors ixl0: Using 6 RX queues 6 TX queues ixl0: Using MSI-X interrupts with 7 vectors ixl0: Ethernet address: 9c:69:b4:60:ef:44 ixl0: Allocating 8 queues for PF LAN VSI; 6 queues active ixl0: PCI Express Bus: Speed 8.0GT/s Width x8 ixl0: SR-IOV ready ixl0: netmap queues/slots: TX 6/2048, RX 6/2048 ixl0: Media change is not supported. ixl0: Link is up, 40 Gbps Full Duplex, Requested FEC: None, Negotiated FEC: None, Autoneg: True, Flow Control: None pkt-gen result: 941.463329 main_thread [2639] 13.564 Mpps (13.741 Mpkts 6.511 Gbps in 1013001 usec) 16.04 avg_batch 99999 min_space 13Mpps: much slower than 11-STABLE (42Mpps) ******************************************* And a last test, this one showing better performance in CURRENT vs 11-STABLE :) Testing NIC Intel IX710, 1 queue configured pkt-gen -i ixl0 -f tx -S 9c:69:b4:60:ef:44 -D 9c:69:b4:60:35:ac -s 192.168.2.1 -d 192.168.2.2 11-STABLE: ixl0: <Intel(R) Ethernet Connection 700 Series PF Driver, Version - 1.11.9-k> mem 0xd5000000-0xd57fffff,0xd6ff0000-0xd6ff7fff irq 40 at device 0.0 numa-domain 0 on pci2 ixl0: using 2048 tx descriptors and 2048 rx descriptors ixl0: fw 6.0.48442 api 1.7 nvm 6.01 etid 800034a4 oem 1.262.0 ixl0: PF-ID[0]: VFs 64, MSIX 129, VF MSIX 5, QPs 768, I2C ixl0: Using MSIX interrupts with 2 vectors ixl0: Allocating 1 queues for PF LAN VSI; 1 queues active ixl0: Ethernet address: 9c:69:b4:60:ef:44 ixl0: PCI Express Bus: Speed 8.0GT/s Width x8 ixl0: SR-IOV ready ixl0: netmap queues/slots: TX 1/2048, RX 1/2048 ixl0: TSO4 requires txcsum, disabling both... pkt-gen result: 609.889550 main_thread [2639] 8.413 Mpps (8.617 Mpkts 4.038 Gbps in 1024294 usec) 511.42 avg_batch 0 min_space 8Mpps CURRENT: ixl0: <Intel(R) Ethernet Controller XL710 for 40GbE QSFP+ - 2.2.0-k> mem 0xd5000000-0xd57fffff,0xd6ff0000-0xd6ff7fff irq 40 at device 0.0 numa-domain 0 on pci2 ixl0: fw 6.0.48442 api 1.7 nvm 6.01 etid 800034a4 oem 1.262.0 ixl0: PF-ID[0]: VFs 64, MSI-X 129, VF MSI-X 5, QPs 768, I2C ixl0: Using 2048 TX descriptors and 2048 RX descriptors ixl0: Using 1 RX queues 1 TX queues ixl0: Using MSI-X interrupts with 2 vectors ixl0: Ethernet address: 9c:69:b4:60:ef:44 ixl0: Allocating 1 queues for PF LAN VSI; 1 queues active ixl0: PCI Express Bus: Speed 8.0GT/s Width x8 ixl0: SR-IOV ready ixl0: netmap queues/slots: TX 1/2048, RX 1/2048 ixl0: Media change is not supported. ixl0: Link is up, 40 Gbps Full Duplex, Requested FEC: None, Negotiated FEC: None, Autoneg: True, Flow Control: None pkt-gen result: 526.299416 main_thread [2639] 12.228 Mpps (12.240 Mpkts 5.870 Gbps in 1001000 usec) 14.37 avg_batch 99999 min_space 12Mpps: much better than 11-STABLE (8Mpps)
Thanks for reporting. What I can tell you for sure is that the difference is to be attributed to the conversion of Intel drivers (em, ix, ixl) to iflib. This impacted netmap because netmap support for iflib drivers (intel ones, vmx, mgb, bnxt) is provided directly within the iflib core. IOW, no explicit netmap code stays within the drivers. I would say some physiological performance drop is to be expected, due to the additional indirection introduced by iflib. However, the performance drop should not be so large as reported in your experiments. The 2.6 Mpps you get in the first comparison let me think that you may have accidentally left ethernet flow control enabled, maybe? Moreover, the last experiment is rather confusing, since you have actually a performance improvement... this lets me think that maybe the configuration is not 100% aligned between the two cases? Have you tried to disable all the offloads? In 11-stable the driver-specific netmap code does not program the offloads, whereas in CURRENT (and 12) the iflib callbacks actually program the offloads also in case of netmap. # ifconfig ix0 -txcsum -rxcsum -tso4 -tso6 -lro -txcsum6 -rxcsum6
Hi Vincenzo, thanks for your quick reply. I've disabled all offloads in both 11-STABLE and CURRENT and I got the same results. I did another test that may help you: I've recompiled pkt-gen on current after adding: #define BUSYWAIT Testing NIC Intel X520, 1 queue configured default pkt-gen: 696.194470 main_thread [2641] 2.560 Mpps (2.570 Mpkts 1.229 Gbps in 1004000 usec) 465.45 avg_batch 99999 min_space with busywait: 733.764470 main_thread [2641] 14.881 Mpps (15.172 Mpkts 7.143 Gbps in 1019565 usec) 344.22 avg_batch 99999 min_space 14Mpps, same as 11-STABLE
It looks like you get 2.6 Mpps because you are not getting enough interrupts... have you tried to measure the interrupt rate in the two cases (current vs 11, no busy wait)? Intel NICs have tunables to set interrupt coalescing, for both TX and RX. Maybe playing with those changes the game?
You're right, interrupt rate limit pps on CURRENT + X520 NIC: pkt-gen, no busy wait 11-stable: 27500 irq/s CURRENT: 5500 irq/s pkt-gen, with busy wait on CURRENT: +30000 irq/s Regarding NIC irq tunable, the only one related to 'ix' looks good: # sysctl hw.ix.max_interrupt_rate hw.ix.max_interrupt_rate: 31250
Ok, thanks for the feedback. That means that the issue is that iflib is not asking enough TX descriptor writebacks. Need for some investigation in the iflib txsync routine.
(In reply to Vincenzo Maffione from comment #5) Is this more specific/scoped to: - netmap & iflib, or - ix/ixl and/or NIC driver & iflib, or - iflib framework (generally)
(In reply to Kubilay Kocak from comment #6) I would say ix/ixl and/or NIC driver & iflib because it's not something related to the netmap module itself, and it is an optimization which derives from ix/ixl netmap support code, which now is included within iflib.
After looking at iflib_netmap_timer_adjust() & iflib_netmap_txsync() in sys/net/iflib.c, I made some tuning on kern.hz: Still using X520 with 1 queue ix0: PCI Express Bus: Speed 5.0GT/s Width x8 ix0: netmap queues/slots: TX 1/2048, RX 1/2048 ***************** /boot/loader.conf: kern.hz=1000 (default) pkt-gen result: 204.153802 main_thread [2639] 2.562 Mpps (2.567 Mpkts 1.230 Gbps in 1001994 usec) 465.32 avg_batch 99999 min_space 205.155321 main_thread [2639] 2.561 Mpps (2.565 Mpkts 1.229 Gbps in 1001519 usec) 465.45 avg_batch 99999 min_space 5500 irq/s: ***************** /boot/loader.conf: kern.hz=1999 pkt-gen result: 41.375049 main_thread [2639] 5.117 Mpps (5.222 Mpkts 2.456 Gbps in 1020510 usec) 465.45 avg_batch 99999 min_space 42.375546 main_thread [2639] 5.118 Mpps (5.121 Mpkts 2.457 Gbps in 1000497 usec) 465.42 avg_batch 99999 min_space 11000 irq/s X2 performance & irq/s ***************** /boot/loader.conf: kern.hz=2000 pkt-gen result: 797.608080 main_thread [2639] 2.560 Mpps (2.563 Mpkts 1.229 Gbps in 1001001 usec) 465.50 avg_batch 99999 min_space 798.609079 main_thread [2639] 2.560 Mpps (2.563 Mpkts 1.229 Gbps in 1000999 usec) 465.41 avg_batch 99999 min_space 5500 irq/s Same performance & irq/s as kern.hz=1000 (due to limit at 2000 in iflib_netmap_timer_adjust & iflib_netmap_txsync) ***************** Last test, this one I forced 'ticks' parameter to '1' in callout_reset_on on iflib_netmap_timer_adjust & iflib_netmap_txsync by increasing the 2000 limit to 20000 in both functions and put an insame value for kern.hz /boot/loader.conf: kern.hz=10000 pkt-gen result: 345.415939 main_thread [2639] 14.880 Mpps (14.890 Mpkts 7.142 Gbps in 1000699 usec) 430.97 avg_batch 99999 min_space 346.429134 main_thread [2639] 14.880 Mpps (15.076 Mpkts 7.142 Gbps in 1013196 usec) 432.17 avg_batch 99999 min_space 29000 irq/s Same performance as FreeBSD 11 Looks like callout_reset_on to iflib_timer have a look high delay.
Thanks a lot for the tests. I think the way netmap tx is handled right now needs improvement. As far as I can tell, in your setup TX interrupts are simply not used (ix and ixl seem to use softirq for TX interrupt processing). Your experiments with increasing kern.hz cause the interrupt rate of the OS timer to increase, and therefore causing the iflib_timer() routine to be called more often. Being called more often, the TX ring is cleaned up (TX credits update) more often and therefore the application can submit new TX packets more often, hence the improved pps. However, clearly increasing kern.hz is not a viable approach. I think we should try to use a separate timer for netmap TX credits update, using higher resolution (e.g. callout_reset_sbt_on()), and maybe try to dynamically adjust the timer period to become smaller when transmitting at high rate, and lower when transmitting ad low rate. I'll try to come up with an experimental patch in the next days.
(In reply to Vincenzo Maffione from comment #9) Any update? This might be something to get into 12.2 if we can.
(In reply to Eric Joyner from comment #10) Not yet. I've been AFK for a couple of weeks. I should be able to work on it this week.
Is there any progress about this issue? Unfortunately I'm blocked on old opnsense & sensei version because with 20.7 the performance is really bad (<300Mbit/s). With 20.1 I can reach wirespeed (1GbE) without problems. Let me know, if I/we can test something to help solve this.
(In reply to vistalba from comment #12) I started to work on it, however I've no suitable hardware to test. This means that I will need to patch qemu to modify the emulation of an iflib-backed device with MSI-X interrupts (such as vmxnet3) in such a way that I can reproduce the problem (e.g. by making the transmission asynchronous w.r.t the register write that triggers it, like in real hardware). I will for sure ask for tests on real hardware, but first I need to make some basic experiments on my own.
Created attachment 218723 [details] Draft patch to test the netmap tx timer This is a draft patch that adds support for a per-tx-queue timer dedicated to netmap. The timer interval is still not adaptive, but controlled by a per-interface sysctl, e.g.: sysctl dev.ix.0.iflib.nm_tx_tmr_us=500 It would be useful to test pkt-gen transmission on ixl/ix NICs, playing on the tunable to hopefully see the pps increase. Values too large should cause the pps to drop. Values too short should cause the CPU utilization to go up (and possibly the pps to drop a little bit). Can anyone test this?
^Triage: Switch Version to earliest affected version/branch
(In reply to Vincenzo Maffione from comment #14) Here are the results: X520 with 1 queue ix0: PCI Express Bus: Speed 5.0GT/s Width x8 ix0: netmap queues/slots: TX 1/2048, RX 1/2048 ******************************************* sysctl dev.ix.0.iflib.nm_tx_tmr_us=0 (default value) pkt-gen: 683.502433 main_thread [2639] 4.215 Mpps (4.227 Mpkts 2.023 Gbps in 1002819 usec) 465.43 avg_batch 99999 min_space ******************************************* sysctl dev.ix.0.iflib.nm_tx_tmr_us=300 pkt-gen: 750.688608 main_thread [2639] 6.496 Mpps (6.646 Mpkts 3.118 Gbps in 1023000 usec) 465.45 avg_batch 99999 min_space ******************************************* sysctl dev.ix.0.iflib.nm_tx_tmr_us=200 pkt-gen: 771.736855 main_thread [2639] 8.907 Mpps (9.112 Mpkts 4.275 Gbps in 1022999 usec) 465.45 avg_batch 99999 min_space ******************************************* sysctl dev.ix.0.iflib.nm_tx_tmr_us=100 pkt-gen: 804.554603 main_thread [2639] 14.136 Mpps (14.147 Mpkts 6.785 Gbps in 1000748 usec) 465.45 avg_batch 99999 min_space -> close to 10G line rate ******************************************* sysctl dev.ix.0.iflib.nm_tx_tmr_us=90 pkt-gen: 872.156329 main_thread [2639] 14.880 Mpps (15.054 Mpkts 7.142 Gbps in 1011721 usec) 466.96 avg_batch 99999 min_space Now using same X520 NIC using 4 queues. ix1: PCI Express Bus: Speed 5.0GT/s Width x8 ix1: netmap queues/slots: TX 4/2048, RX 4/2048 ******************************************* sysctl dev.ix.1.iflib.nm_tx_tmr_us=0 (default) pkt-gen: 047.988586 main_thread [2639] 13.596 Mpps (13.623 Mpkts 6.526 Gbps in 1002002 usec) 443.03 avg_batch 99999 min_space -> close to max speed (thanks to 4 queue) ******************************************* sysctl dev.ix.1.iflib.nm_tx_tmr_us=400 pkt-gen: 094.224581 main_thread [2639] 14.887 Mpps (14.904 Mpkts 7.146 Gbps in 1001173 usec) 440.75 avg_batch 99999 min_space Looks really good for X520 NIC whatever the number of queue I used. Now same tests using XL710 NIC (40G) using 1 queue: ixl1: PCI Express Bus: Speed 8.0GT/s Width x8 ixl1: netmap queues/slots: TX 1/1024, RX 1/1024 ******************************************* sysctl dev.ixl.1.iflib.nm_tx_tmr_us=0 (default) pkt-gen: 324.883066 main_thread [2639] 12.270 Mpps (13.044 Mpkts 5.890 Gbps in 1063000 usec) 16.53 avg_batch 99999 min_space ******************************************* sysctl dev.ixl.1.iflib.nm_tx_tmr_us=100 pkt-gen: 350.497566 main_thread [2639] 12.246 Mpps (12.258 Mpkts 5.878 Gbps in 1001003 usec) 16.48 avg_batch 99999 min_space no changes. Now testing XL710 with 4 queues: ixl0: PCI Express Bus: Speed 8.0GT/s Width x8 ixl0: netmap queues/slots: TX 4/1024, RX 4/1024 ******************************************* sysctl dev.ixl.0.iflib.nm_tx_tmr_us=0 (default) pkt-gen: 614.766048 main_thread [2639] 13.671 Mpps (14.539 Mpkts 6.562 Gbps in 1063494 usec) 15.75 avg_batch 99999 min_space ******************************************* sysctl dev.ixl.0.iflib.nm_tx_tmr_us=100 pkt-gen: 640.652549 main_thread [2639] 13.672 Mpps (13.795 Mpkts 6.562 Gbps in 1009001 usec) 15.79 avg_batch 99999 min_space No changes using XL710 NIC (as a reminder, using FreeBSD 11 without iflib, I can reach +40Mpps on XL710 using pkt-gen)
(In reply to Vincenzo Maffione from comment #14) Is there a easy way to test this on my opnsense vm with vmx interfaces. As far as I know my netmap issue on vmx is related to this timer issue as well. I'm not so familiar with freebsd.
(In reply to vistalba from comment #17) - Install Vanilla FreeBSD12 - pkg install git - cd /usr && git clone https://github.com/opnsense/tools - cd tools && make update - make kernel You can also just create an image, follow the guides on https://github.com/opnsense/tools this might be easier
(In reply to Sylvain Galliano from comment #16) Thanks a lot. In the XL710 case, have you tried with lower values of the timer, such as 50us down to 5 us? Is there any visible change? Also, have you looked at pkt-gen CPU utilization? That's something that tells you if you are CPU limited (unlikely) or rather still limited by the "pseudo-interrupt rate" being too low. For, instance, how does pkt-gen CPU utilization look in the case of XL710 and 1 queue (for simplicity, so that you have just a single thread)?
(In reply to vistalba from comment #17) Of course you will need to apply the attached patch before "make kernel", e.g. $ cd /path/to/freebsd/kernel/sources $ patch -p1 < /path/to/attached/patch.diff
(In reply to Vincenzo Maffione from comment #19) result using 1 queue: ixl1: PCI Express Bus: Speed 8.0GT/s Width x8 ixl1: netmap queues/slots: TX 1/1024, RX 1/1024 sysctl dev.ixl.0.iflib.nm_tx_tmr_us=50 194.333930 main_thread [2638] 11.896 Mpps (11.902 Mpkts 5.710 Gbps in 1000497 usec) 341.33 avg_batch 99999 min_space pkt-gen cpu usage: 35% sysctl dev.ixl.0.iflib.nm_tx_tmr_us=5 235.070929 main_thread [2638] 14.754 Mpps (15.543 Mpkts 7.082 Gbps in 1053521 usec) 390.14 avg_batch 99999 min_space pkt-gen cpu usage: 56% sysctl dev.ixl.0.iflib.nm_tx_tmr_us=1 266.392925 main_thread [2638] 14.748 Mpps (14.762 Mpkts 7.079 Gbps in 1000998 usec) 407.41 avg_batch 99999 min_space pkt-gen cpu usage: 66% Using 6 queues configured, max pps is 17Mpps, even when using low nm_tx_tmr_us value (1 us) pkt-gen cpu usage: 82%
Created attachment 218866 [details] Netmap tx timer + timely credits update A small extension of the previous patch, which adds up the timely update of tx credits and timer start.
(In reply to Sylvain Galliano from comment #21) Thanks. The CPU utilization at least tells us that we are not CPU bound. Could you please perform some tests with the second patch? It's basically the same as the first one, with a couple of changes that should ensure a more timely recovery of consumed TX slots (and a more timely timer firing).
(In reply to Vincenzo Maffione from comment #23) After using 'Netmap tx timer + timely credits update' patch, I didn't notice any difference on results. Should I make some specific tests to confirm changes between 2 patches ?
Sorry, my bad. I read the code the wrong way, so the second patch is indeed useless. Please forget about that. The patch is not ensuring timely TX slots recovery (as pointed out in comment #23). So it seems that the situation where we are losing against 11-stable is ixl with 6 queues (or more in general, with more than 1 queue). The other combinations (ix, or ixl/1q are on par). Is this correct? Now, focusing on the ixl/6q case, and using the first patch I provided, do you see a significant difference in average batch (as reported by pkt-gen) and pkt-gen CPU utilization? The avg_batch metric tells us how many packets we were able to send for each txsync syscall. So the higher the better (at least up to 100/200).
(In reply to Vincenzo Maffione from comment #25) you are correct, using 1 queue on both ix/ixl NIC + tuning new sysctl, we have same result as FreeBSD 11 Regarding avg_batch values on ixl, they are very low, not in the range you expected: 6 queues, iflib.nm_tx_tmr_us=5 070.664142 main_thread [2639] 13.588 Mpps (13.602 Mpkts 6.522 Gbps in 1001000 usec) 20.87 avg_batch 99999 min_space cpu usage: 100% even with 1 queue I got a low value for avg_batch: 283.855379 main_thread [2639] 12.147 Mpps (12.757 Mpkts 5.831 Gbps in 1050245 usec) 13.97 avg_batch 99999 min_space cpu usage: 100% ix (X520) NIC have a good avg_batch (whatever the number of queue): 404.130120 main_thread [2639] 14.880 Mpps (14.895 Mpkts 7.143 Gbps in 1000999 usec) 436.04 avg_batch 99999 min_space I don't known if this can help you but I did another test specific to ixl NIC: setting hw.ixl.enable_head_writeback=0 avg_batch is higher, pps a little better and we are not CPU bound this time: 6 queues: 603.651106 main_thread [2639] 17.384 Mpps (17.402 Mpkts 8.345 Gbps in 1001003 usec) 308.10 avg_batch 99999 min_space cpu usage: 71% 1 queue: 730.590104 main_thread [2639] 15.084 Mpps (15.416 Mpkts 7.241 Gbps in 1022004 usec) 442.64 avg_batch 99999 min_space cpu usage: 57% Same test but using more threads on pkt-gen (-p 6), 6 queues: 995.887010 main_thread [2639] 17.327 Mpps (17.339 Mpkts 8.317 Gbps in 1000693 usec) 286.35 avg_batch 599994 min_space cpu usage: 197% top -H: PID USERNAME PRI NICE SIZE RES STATE C TIME WCPU COMMAND 70876 root -92 0 348M 16M select 3 0:23 35.46% pkt-gen{pkt-gen} 70876 root -92 0 348M 16M CPU4 4 0:23 33.58% pkt-gen{pkt-gen} 70876 root -92 0 348M 16M RUN 6 0:23 33.44% pkt-gen{pkt-gen} 70876 root -92 0 348M 16M select 2 0:23 31.69% pkt-gen{pkt-gen} 70876 root 45 0 348M 16M CPU9 9 0:23 31.64% pkt-gen{pkt-gen} 70876 root -92 0 348M 16M select 6 0:23 31.24% pkt-gen{pkt-gen}
Created attachment 218932 [details] netmap tx timer + honor IPI_TX_INTR in ixl txd_encap Adds an unrelated fix on top of the first patch (218723). The new fix should remove a major regression introduced by the ixl porting to iflib.
(In reply to Sylvain Galliano from comment #26) Thanks. I just figured out that there may be a major flaw introduced by the porting of ixl to iflib. This flaw should couse too many writebacks from the NIC to report completed transmissions, even if iflib asks for a writeback once in a while. Can you please run again your tests on ixl with the latest patch?
(In reply to Vincenzo Maffione from comment #28) Yes, this is much better: 6 queues, nm_tx_tmr_us=5: 983.492185 main_thread [2639] 37.907 Mpps (37.945 Mpkts 18.196 Gbps in 1001000 usec) 512.00 avg_batch 99999 min_space cpu usage: 100% but with this patch, something is wrong when using 1 queue: 110.079117 main_thread [2639] 0.000 pps (0.000 pkts 0.000 bps in 1003920 usec) 0.00 avg_batch 99999 min_space 111.080184 main_thread [2639] 0.000 pps (0.000 pkts 0.000 bps in 1001066 usec) 0.00 avg_batch 99999 min_space 111.714181 sender_body [1663] poll error on queue 0: timeout 112.089179 main_thread [2639] 0.000 pps (0.000 pkts 0.000 bps in 1008996 usec) 0.00 avg_batch 99999 min_space 113.116178 main_thread [2639] 0.000 pps (0.000 pkts 0.000 bps in 1026999 usec) 0.00 avg_batch 99999 min_space (I've double checked by reverting with previous patch: no issue) same errors when using 6 queues and pkt-gen with 6 threads (-p 6)
(In reply to Vincenzo Maffione from comment #28) Hi Vincenzo, Good catch! Thanks a lot! The non-iflib version of ixl also sets a request status flag on all 'End of packet' descriptors. I'm guessing that the difference in performance is related to a dynamic interrupt moderation, which is disabled by default in iflib version of the driver. I'm a bit concerned though about the 1 queue case. It think it would be good to put this fix in a separate review and let our validation team run some tests. Would you like me to do it or do you prefer to do it yourself?
(In reply to Krzysztof Galazka from comment #30) Hi Krzysztof, I agree, and created a separate review for this possible change: https://reviews.freebsd.org/D26896 It would be nice if you guys could run some tests to validate the change against normal TCP/IP stack usage (e.g. non-netmap). Speaking about the non-iflib driver, I guess it is acceptable for the ixl_xmit routine to always set the report flag on the EOP packet. However, this is not acceptable for netmap, and indeed the non-iflib netmap ixl driver is only setting it twice per ring (see https://github.com/freebsd/freebsd/blob/stable/11/sys/dev/netmap/if_ixl_netmap.h#L221-L223). This is, in my opinion, what explains the huge difference in performance between non-iflib and iflib for ixl. If my understanding is correct, and according to our past experience with netmap, the report flag will cause the NIC to initiate a DMA transaction to either set the DD bit in the descriptor, or perform a memory write to update the shadow TDH. This is particularly expensive in netmap when done for each descriptor, specially because netmap uses single-descriptor packets. Interrupt moderation can also help a lot to mitigate the CPU overhead, but as far as I see it does not limit the writeback DMA transactions, and therefore it does not help in the netmap use-case. Moreover, my understanding is that iflib is not using hardware interrupts on ixl (nor ix), but rather is using "softirqs", so I guess that interrupt moderation does not play a role here. I may be wrong on this last point. I suspect the 1-queue pkt-gen hang problem may be due to something else, rather than the report flag change. As a matter of facts, the same logic was working fine on the non-iflib driver.
(In reply to Sylvain Galliano from comment #29) Thanks again for your tests. I'm inclined to think that the pkt-gen hang issue that you see is not directly caused by the ixl patch. Would you please try to test what happens if you only apply the ixl patch (discarding all the changes related to the netmap timer)? In the very end the change to ixl is orthogonal to the netmap timer issue. Also, it would be useful to understand whether the hang problem comes from some sort of race condition or not. For this purpose, you may try to use the -R argument of pkt-gen (this time with the timer+ixl patch) to specify a maximum rate in packets per second (pps). E.g. you could start from 1000 pps, check that it does not hang, double the rate and repeat the process until you find a critical rate that causes the hang. Unless this is not a race condition and the hang happens at any rate. When the hang happens, it may help to see the ring state, e.g. with the following patch to pkt-gen. I expect to see head, cur and tail having the same value. diff --git a/apps/pkt-gen/pkt-gen.c b/apps/pkt-gen/pkt-gen.c index ef876f4f..19497fe9 100644 --- a/apps/pkt-gen/pkt-gen.c +++ b/apps/pkt-gen/pkt-gen.c @@ -1675,6 +1675,10 @@ sender_body(void *data) break; D("poll error on queue %d: %s", targ->me, rv ? strerror(errno) : "timeout"); + for (i = targ->nmd->first_tx_ring; i <= targ->nmd->last_tx_ring; i++) { + txring = NETMAP_TXRING(nifp, i); + D("txring %u %u %u", txring->head, txring->cur, txring->tail); + } // goto quit; } if (pfd.revents & POLLERR) {
(In reply to Vincenzo Maffione from comment #32) Here are the results: ixl only patch, 6 queues, pkt-get WITHOUT -R: 710.623764 main_thread [2642] 38.621 Mpps (38.698 Mpkts 18.538 Gbps in 1002000 usec) 512.00 avg_batch 99999 min_space ixl only patch, 1 queue, pkt-get WITHOUT -R: 670.168017 main_thread [2642] 0.000 pps (0.000 pkts 0.000 bps in 1009185 usec) 0.00 avg_batch 99999 min_space 671.181833 main_thread [2642] 0.000 pps (0.000 pkts 0.000 bps in 1013816 usec) 0.00 avg_batch 99999 min_space 672.171832 sender_body [1662] poll error on queue 0: timeout 672.171838 sender_body [1665] txring 513 513 513 672.191833 main_thread [2642] 0.000 pps (0.000 pkts 0.000 bps in 1010000 usec) 0.00 avg_batch 99999 min_space ixl only patch, 1 queue, pkt-get WITH -R: -R 1000: 813.372070 main_thread [2642] 1.001 Kpps (1.002 Kpkts 480.718 Kbps in 1000503 usec) 3.00 avg_batch 99999 min_space -R 2000: 860.807010 main_thread [2642] 2.006 Kpps (2.010 Kpkts 962.692 Kbps in 1002190 usec) 6.00 avg_batch 99999 min_space ... (all intermediate -R value worked) ... -R 17000000: 057.160242 main_thread [2642] 17.001 Mpps (18.072 Mpkts 8.160 Gbps in 1063000 usec) 512.00 avg_batch 99999 min_space -R 18000000: 030.167994 sender_body [1662] poll error on queue 0: timeout 030.168001 sender_body [1665] txring 513 513 513 ixl + timer patches, 1 queue, pkt-get WITH -R: sysctl nm_tx_tmr_us=5 -R 1000: 261.886507 main_thread [2642] 1.001 Kpps (1.065 Kpkts 480.679 Kbps in 1063496 usec) 3.00 avg_batch 99999 min_space -R 2000: 279.365024 main_thread [2642] 2.000 Kpps (2.034 Kpkts 960.219 Kbps in 1016768 usec) 6.00 avg_batch 99999 min_space ... (all intermediate -R value worked) ... -R 17000000 388.372451 main_thread [2642] 17.000 Mpps (18.079 Mpkts 8.160 Gbps in 1063431 usec) 512.00 avg_batch 99999 min_space -R 18000000 894.421917 main_thread [2642] 18.000 Mpps (18.036 Mpkts 8.640 Gbps in 1002001 usec) 512.00 avg_batch 99999 min_space and sometime an error -R 19000000 991.012912 sender_body [1662] poll error on queue 0: timeout 991.012920 sender_body [1665] txring 513 513 513 another run: 968.011919 sender_body [1662] poll error on queue 0: timeout 968.011926 sender_body [1665] txring 235 235 235 and another one: 112.008840 sender_body [1662] poll error on queue 0: timeout 112.008848 sender_body [1665] txring 95 95 95
(In reply to Sylvain Galliano from comment #33) Ok, thanks. At this point it's clear that there are two indipendent issues that slow down netmap-iflib on ix/ixl. The first is the lack of a per-tx-queue netmap timer (or taskqueue). The second is the lack of descriptor writeback moderation in ixl. We can start by merging the timer patch, and then work on the separate ixl issue.
Created attachment 219062 [details] netmap tx timer w/queue intr enable + honor IPCP_TX_INTR in ixl_txd_encap Extension of the last one (218932), to also call the IFDI tx queue interrupt enable, similarly to what the iflib_timer() code already does.
(In reply to Vincenzo Maffione from comment #35) I would ask for advice from the Intel guys here... I'm trying to compare stable/11 vs current, regarding how TX interrupts are handled. It looks like in stable/11 MSI-x handlers are shared for the TX and RX queue, while in current TX interrupts are not used. Also, in stable/11 the interrupt handler seems to do a disable_queue and then enable_queue, while on current I only see the enable_queue step (IFDI_TX_QUEUE_INTR_ENABLE). Therefore, in the last patch I also add the enable_queue step in the netmap timer routine. It may be worth giving a try to see if this fixes the ixl issue.
(In reply to Vincenzo Maffione from comment #36) I have tested last patch (netmap tx timer w/queue intr enable + honor IPCP_TX_INTR in ixl_txd_encap): same results
Created attachment 219121 [details] Cleaned up netmap tx timer patch (no sysctl)
(In reply to Sylvain Galliano from comment #37) Ok thanks. It was worth a try. I guess we'll need some help from Intel here. In the meanwhile, I would like to commit the netmap tx timer change only. I attached a cleaned up patch, with an hardcoded value for the netmap timer. I would avoid to add a new sysctl for something that may be changed again soon. In any case, the patch is meant to improve a lot the current situation for both ix and ixl. Could you please run your tests again on ix and ixl to check that you get numbers that are consistent with the ones you reported in comment n. 16?
(In reply to Vincenzo Maffione from comment #39) results are all good for ix (X520) NIC (+14M pps, same as FreeBSD 11) No changes in ixl (same results as comment #16)
(In reply to Sylvain Galliano from comment #40) Thank you for confirming. In the meanwhile I'll commit this change. Maybe we should open a separate issue for the ixl regression? Now we know that it is caused by the RS flag being set an all the TX descriptors.
A commit references this bug: Author: vmaffione Date: Tue Oct 27 21:53:33 UTC 2020 New revision: 367093 URL: https://svnweb.freebsd.org/changeset/base/367093 Log: iflib: add per-tx-queue netmap timer The way netmap TX is handled in iflib when TX interrupts are not used (IFC_NETMAP_TX_IRQ not set) has some issues: - The netmap_tx_irq() function gets called by iflib_timer(), which gets scheduled with tick granularity (hz). This is not frequent enough for 10Gbps NICs and beyond (e.g., ixgbe or ixl). The end result is that the transmitting netmap application is not woken up fast enough to saturate the link with small packets. - The iflib_timer() functions also calls isc_txd_credits_update() to ask for more TX completion updates. However, this violates the netmap requirement that only txsync can access the TX queue for datapath operations. Only netmap_tx_irq() may be called out of the txsync context. This change introduces per-tx-queue netmap timers, using microsecond granularity to ensure that netmap_tx_irq() can be called often enough to allow for maximum packet rate. The timer routine simply calls netmap_tx_irq() to wake up the netmap application. The latter will wake up and call txsync to collect TX completion updates. This change brings back line rate speed with small packets for ixgbe. For the time being, timer expiration is hardcoded to 90 microseconds, in order to avoid introducing a new sysctl. We may eventually implement an adaptive expiration period or use another deferred work mechanism in place of timers. Also, fix the timers usage to make sure that each queue is serviced by a different CPU. PR: 248652 Reported by: sg@efficientip.com MFC after: 2 weeks Changes: head/sys/net/iflib.c
I made same tests on vmware + vmxnet NIC + latest patch and I got a panic: spin lock 0xfffff80003079cc0 (turnstile lock) held by 0xfffffe0009607e00 (tid 100006) too long panic: spin lock held too long cpuid = 1 time = 1603884508 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe0008480680 vpanic() at vpanic+0x182/frame 0xfffffe00084806d0 panic() at panic+0x43/frame 0xfffffe0008480730 _mtx_lock_indefinite_check() at _mtx_lock_indefinite_check+0x64/frame 0xfffffe0008480740 _mtx_lock_spin_cookie() at _mtx_lock_spin_cookie+0xd5/frame 0xfffffe00084807b0 turnstile_trywait() at turnstile_trywait+0xe3/frame 0xfffffe00084807e0 __mtx_lock_sleep() at __mtx_lock_sleep+0x119/frame 0xfffffe0008480870 doselwakeup() at doselwakeup+0x179/frame 0xfffffe00084808c0 nm_os_selwakeup() at nm_os_selwakeup+0x13/frame 0xfffffe00084808e0 netmap_notify() at netmap_notify+0x3d/frame 0xfffffe0008480900 softclock_call_cc() at softclock_call_cc+0x13d/frame 0xfffffe00084809a0 callout_process() at callout_process+0x1c0/frame 0xfffffe0008480a10 handleevents() at handleevents+0x188/frame 0xfffffe0008480a50 timercb() at timercb+0x24e/frame 0xfffffe0008480aa0 lapic_handle_timer() at lapic_handle_timer+0x9b/frame 0xfffffe0008480ad0 Xtimerint() at Xtimerint+0xb1/frame 0xfffffe0008480ad0 --- interrupt, rip = 0xffffffff80f5bd46, rsp = 0xfffffe0008480ba0, rbp = 0xfffffe0008480ba0 --- acpi_cpu_c1() at acpi_cpu_c1+0x6/frame 0xfffffe0008480ba0 acpi_cpu_idle() at acpi_cpu_idle+0x2eb/frame 0xfffffe0008480bf0 cpu_idle_acpi() at cpu_idle_acpi+0x3e/frame 0xfffffe0008480c10 cpu_idle() at cpu_idle+0x9f/frame 0xfffffe0008480c30 sched_idletd() at sched_idletd+0x2e4/frame 0xfffffe0008480cf0 fork_exit() at fork_exit+0x7e/frame 0xfffffe0008480d30 fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe0008480d30 --- trap 0, rip = 0, rsp = 0, rbp = 0 --- I used the first patch you sent (Draft patch to test the netmap tx timer), no issue this time. the only major difference I can see between 2 patches (except sysctl) is: + txq->ift_timer.c_cpu = cpu; and + txq->ift_netmap_timer.c_cpu = cpu;
Created attachment 219179 [details] Cleaned up netmap tx timer (bugfixes)
(In reply to Sylvain Galliano from comment #43) Ugh. Thanks for reporting. I indeed introduced a subtle typo bug, using callout_reset_sbt() rather than callout_reset_sbt_on() (as intended). Therefore I was passing the "cpu" value to the "flags" argument, resulting in a disaster. In your test this probably triggered the C_DIRECT_EXEC flag of callout(9), so that the timer was being executed in hardware interrupt context. I uploaded the patch that is now consistent with the src tree (that I'm going to fix right away).
A commit references this bug: Author: vmaffione Date: Wed Oct 28 21:06:18 UTC 2020 New revision: 367117 URL: https://svnweb.freebsd.org/changeset/base/367117 Log: iflib: fix typo bug introduced by r367093 Code was supposed to call callout_reset_sbt_on() rather than callout_reset_sbt(). This resulted into passing a "cpu" value to a "flag" argument. A recipe for subtle errors. PR: 248652 Reported by: sg@efficientip.com MFC with: r367093 Changes: head/sys/net/iflib.c
A commit references this bug: Author: vmaffione Date: Wed Nov 11 21:27:17 UTC 2020 New revision: 367599 URL: https://svnweb.freebsd.org/changeset/base/367599 Log: MFC r367093, r367117 iflib: add per-tx-queue netmap timer The way netmap TX is handled in iflib when TX interrupts are not used (IFC_NETMAP_TX_IRQ not set) has some issues: - The netmap_tx_irq() function gets called by iflib_timer(), which gets scheduled with tick granularity (hz). This is not frequent enough for 10Gbps NICs and beyond (e.g., ixgbe or ixl). The end result is that the transmitting netmap application is not woken up fast enough to saturate the link with small packets. - The iflib_timer() functions also calls isc_txd_credits_update() to ask for more TX completion updates. However, this violates the netmap requirement that only txsync can access the TX queue for datapath operations. Only netmap_tx_irq() may be called out of the txsync context. This change introduces per-tx-queue netmap timers, using microsecond granularity to ensure that netmap_tx_irq() can be called often enough to allow for maximum packet rate. The timer routine simply calls netmap_tx_irq() to wake up the netmap application. The latter will wake up and call txsync to collect TX completion updates. This change brings back line rate speed with small packets for ixgbe. For the time being, timer expiration is hardcoded to 90 microseconds, in order to avoid introducing a new sysctl. We may eventually implement an adaptive expiration period or use another deferred work mechanism in place of timers. Also, fix the timers usage to make sure that each queue is serviced by a different CPU. PR: 248652 Reported by: sg@efficientip.com Changes: _U stable/12/ stable/12/sys/net/iflib.c
The original issue has been addressed. Over the course of the debugging and testing, an additional separate issue specific to ixl has been found. The latter can be followed up here: https://reviews.freebsd.org/D26896