Bug 248652 - iflib: netmap pkt-gen large TX performance difference between 11-STABLE and 12-STABLE/CURRENT on ix & ixl NIC
Summary: iflib: netmap pkt-gen large TX performance difference between 11-STABLE and 1...
Status: In Progress
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 12.0-STABLE
Hardware: amd64 Any
: --- Affects Some People
Assignee: Vincenzo Maffione
URL:
Keywords: iflib, needs-qa, performance, regression
Depends on:
Blocks:
 
Reported: 2020-08-14 08:52 UTC by Sylvain Galliano
Modified: 2020-11-11 21:27 UTC (History)
9 users (show)

See Also:
koobs: maintainer-feedback? (vmaffione)
koobs: maintainer-feedback? (freebsd)
koobs: mfc-stable12?
koobs: mfc-stable11-


Attachments
Draft patch to test the netmap tx timer (6.82 KB, patch)
2020-10-13 21:14 UTC, Vincenzo Maffione
no flags Details | Diff
Netmap tx timer + timely credits update (7.27 KB, patch)
2020-10-18 20:36 UTC, Vincenzo Maffione
no flags Details | Diff
netmap tx timer + honor IPI_TX_INTR in ixl txd_encap (7.90 KB, patch)
2020-10-20 21:14 UTC, Vincenzo Maffione
no flags Details | Diff
netmap tx timer w/queue intr enable + honor IPCP_TX_INTR in ixl_txd_encap (8.01 KB, patch)
2020-10-25 15:30 UTC, Vincenzo Maffione
no flags Details | Diff
Cleaned up netmap tx timer patch (no sysctl) (6.83 KB, patch)
2020-10-26 21:39 UTC, Vincenzo Maffione
no flags Details | Diff
Cleaned up netmap tx timer (bugfixes) (7.01 KB, patch)
2020-10-28 20:50 UTC, Vincenzo Maffione
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Sylvain Galliano 2020-08-14 08:52:04 UTC
I'm testing netmap tx performance between 11-STABLE and CURRENT (same results as 12-STABLE) with 2 NICs:
Intel X520 (10G) and Intel IXL710 (40G)
Here are my tests and the results using differents OS version/NIC & number of queues

*******************************************

Testing NIC Intel X520, 1 queue configured
pkt-gen -i ix1 -f tx -S a0:36:9f:3e:57:1a -D 3c:fd:fe:a2:22:91 -s 192.168.0.1 -d 192.168.0.2

11-STABLE:
ix1: <Intel(R) PRO/10GbE PCI-Express Network Driver, Version - 3.2.12-k> port 0xece0-0xecff mem 0xdb600000-0xdb6fffff,0xdb7fc000-0xdb7fffff irq 53 at device 0.1 numa-domain 0 on pci5
ix1: Using MSI-X interrupts with 2 vectors
ix1: Ethernet address: a0:36:9f:51:c9:66
ix1: PCI Express Bus: Speed 5.0GT/s Width x8
ix1: netmap queues/slots: TX 1/2048, RX 1/2048

pkt-gen result:
297.988718 main_thread [2639] 14.151 Mpps (15.049 Mpkts 6.792 Gbps in 1063439 usec) 510.11 avg_batch 0 min_space
14Mpps

CURRENT:
ix1: <Intel(R) PRO/10GbE PCI-Express Network Driver> port 0xece0-0xecff mem 0xdb600000-0xdb6fffff,0xdb7fc000-0xdb7fffff irq 53 at device 0.1 numa-domain 0 on pci5
ix1: Using 2048 TX descriptors and 2048 RX descriptors
ix1: Using 1 RX queues 1 TX queues
ix1: Using MSI-X interrupts with 2 vectors
ix1: allocated for 1 queues
ix1: allocated for 1 rx queues
ix1: Ethernet address: a0:36:9f:51:c9:66
ix1: PCI Express Bus: Speed 5.0GT/s Width x8
ix1: netmap queues/slots: TX 1/2048, RX 1/2048

pkt-gen result:
198.445241 main_thread [2639] 2.615 Mpps (2.620 Mpkts 1.255 Gbps in 1001871 usec) 466.26 avg_batch 99999 min_space

2.6Mpps: much slower than 11-STABLE (14Mpps)

*******************************************

Testing NIC Intel IX710, 6 queues configured
pkt-gen -i ixl0 -f tx -S 9c:69:b4:60:ef:44 -D 9c:69:b4:60:35:ac -s 192.168.2.1 -d 192.168.2.2

11-STABLE:
ixl0: <Intel(R) Ethernet Connection 700 Series PF Driver, Version - 1.11.9-k> mem 0xd5000000-0xd57fffff,0xd6ff0000-0xd6ff7fff irq 40 at device 0.0 numa-domain 0 on pci2
ixl0: using 2048 tx descriptors and 2048 rx descriptors
ixl0: fw 6.0.48442 api 1.7 nvm 6.01 etid 800034a4 oem 1.262.0
ixl0: PF-ID[0]: VFs 64, MSIX 129, VF MSIX 5, QPs 768, I2C
ixl0: Using MSIX interrupts with 7 vectors
ixl0: Allocating 8 queues for PF LAN VSI; 6 queues active
ixl0: Ethernet address: 9c:69:b4:60:ef:44
ixl0: PCI Express Bus: Speed 8.0GT/s Width x8
ixl0: SR-IOV ready
ixl0: netmap queues/slots: TX 6/2048, RX 6/2048
ixl0: TSO4 requires txcsum, disabling both...

pkt-gen result:
515.210701 main_thread [2639] 42.566 Mpps (45.248 Mpkts 20.432 Gbps in 1062998 usec) 395.17 avg_batch 99999 min_space

42Mpps


CURRENT:
ixl0: <Intel(R) Ethernet Controller XL710 for 40GbE QSFP+ - 2.2.0-k> mem 0xd5000000-0xd57fffff,0xd6ff0000-0xd6ff7fff irq 40 at device 0.0 numa-domain 0 on pci2
ixl0: fw 6.0.48442 api 1.7 nvm 6.01 etid 800034a4 oem 1.262.0
ixl0: PF-ID[0]: VFs 64, MSI-X 129, VF MSI-X 5, QPs 768, I2C
ixl0: Using 2048 TX descriptors and 2048 RX descriptors
ixl0: Using 6 RX queues 6 TX queues
ixl0: Using MSI-X interrupts with 7 vectors
ixl0: Ethernet address: 9c:69:b4:60:ef:44
ixl0: Allocating 8 queues for PF LAN VSI; 6 queues active
ixl0: PCI Express Bus: Speed 8.0GT/s Width x8
ixl0: SR-IOV ready
ixl0: netmap queues/slots: TX 6/2048, RX 6/2048
ixl0: Media change is not supported.
ixl0: Link is up, 40 Gbps Full Duplex, Requested FEC: None, Negotiated FEC: None, Autoneg: True, Flow Control: None

pkt-gen result:
941.463329 main_thread [2639] 13.564 Mpps (13.741 Mpkts 6.511 Gbps in 1013001 usec) 16.04 avg_batch 99999 min_space

13Mpps: much slower than 11-STABLE (42Mpps)


*******************************************
And a last test, this one showing better performance in CURRENT vs 11-STABLE :)

Testing NIC Intel IX710, 1 queue configured
pkt-gen -i ixl0 -f tx -S 9c:69:b4:60:ef:44 -D 9c:69:b4:60:35:ac -s 192.168.2.1 -d 192.168.2.2

11-STABLE:
ixl0: <Intel(R) Ethernet Connection 700 Series PF Driver, Version - 1.11.9-k> mem 0xd5000000-0xd57fffff,0xd6ff0000-0xd6ff7fff irq 40 at device 0.0 numa-domain 0 on pci2
ixl0: using 2048 tx descriptors and 2048 rx descriptors
ixl0: fw 6.0.48442 api 1.7 nvm 6.01 etid 800034a4 oem 1.262.0
ixl0: PF-ID[0]: VFs 64, MSIX 129, VF MSIX 5, QPs 768, I2C
ixl0: Using MSIX interrupts with 2 vectors
ixl0: Allocating 1 queues for PF LAN VSI; 1 queues active
ixl0: Ethernet address: 9c:69:b4:60:ef:44
ixl0: PCI Express Bus: Speed 8.0GT/s Width x8
ixl0: SR-IOV ready
ixl0: netmap queues/slots: TX 1/2048, RX 1/2048
ixl0: TSO4 requires txcsum, disabling both...

pkt-gen result:
609.889550 main_thread [2639] 8.413 Mpps (8.617 Mpkts 4.038 Gbps in 1024294 usec) 511.42 avg_batch 0 min_space

8Mpps

CURRENT:
ixl0: <Intel(R) Ethernet Controller XL710 for 40GbE QSFP+ - 2.2.0-k> mem 0xd5000000-0xd57fffff,0xd6ff0000-0xd6ff7fff irq 40 at device 0.0 numa-domain 0 on pci2
ixl0: fw 6.0.48442 api 1.7 nvm 6.01 etid 800034a4 oem 1.262.0
ixl0: PF-ID[0]: VFs 64, MSI-X 129, VF MSI-X 5, QPs 768, I2C
ixl0: Using 2048 TX descriptors and 2048 RX descriptors
ixl0: Using 1 RX queues 1 TX queues
ixl0: Using MSI-X interrupts with 2 vectors
ixl0: Ethernet address: 9c:69:b4:60:ef:44
ixl0: Allocating 1 queues for PF LAN VSI; 1 queues active
ixl0: PCI Express Bus: Speed 8.0GT/s Width x8
ixl0: SR-IOV ready
ixl0: netmap queues/slots: TX 1/2048, RX 1/2048
ixl0: Media change is not supported.
ixl0: Link is up, 40 Gbps Full Duplex, Requested FEC: None, Negotiated FEC: None, Autoneg: True, Flow Control: None

pkt-gen result:
526.299416 main_thread [2639] 12.228 Mpps (12.240 Mpkts 5.870 Gbps in 1001000 usec) 14.37 avg_batch 99999 min_space

12Mpps: much better than 11-STABLE (8Mpps)
Comment 1 Vincenzo Maffione freebsd_committer 2020-08-14 14:23:02 UTC
Thanks for reporting.
What I can tell you for sure is that the difference is to be attributed to the conversion of Intel drivers (em, ix, ixl) to iflib.
This impacted netmap because netmap support for iflib drivers (intel ones, vmx, mgb, bnxt) is provided directly within the iflib core. IOW, no explicit netmap code stays within the drivers.

I would say some physiological performance drop is to be expected, due to the additional indirection introduced by iflib. However, the performance drop should not be so large as reported in your experiments.
The 2.6 Mpps you get in the first comparison let me think that you may have accidentally left ethernet flow control enabled, maybe?
Moreover, the last experiment is rather confusing, since you have actually a performance improvement... this lets me think that maybe the configuration is not 100% aligned between the two cases?

Have you tried to disable all the offloads? In 11-stable the driver-specific netmap code does not program the offloads, whereas in CURRENT (and 12) the iflib callbacks actually program the offloads also in case of netmap.

  # ifconfig ix0 -txcsum -rxcsum -tso4 -tso6 -lro -txcsum6 -rxcsum6
Comment 2 Sylvain Galliano 2020-08-14 15:21:40 UTC
Hi Vincenzo, thanks for your quick reply.

I've disabled all offloads in both 11-STABLE and CURRENT and I got the same results.

I did another test that may help you:

I've recompiled pkt-gen on current after adding:
 #define BUSYWAIT

Testing NIC Intel X520, 1 queue configured

default pkt-gen:
696.194470 main_thread [2641] 2.560 Mpps (2.570 Mpkts 1.229 Gbps in 1004000 usec) 
465.45 avg_batch 99999 min_space

with busywait:
733.764470 main_thread [2641] 14.881 Mpps (15.172 Mpkts 7.143 Gbps in 1019565 usec) 344.22 avg_batch 99999 min_space

14Mpps, same as 11-STABLE
Comment 3 Vincenzo Maffione freebsd_committer 2020-08-14 15:32:50 UTC
It looks like you get 2.6 Mpps because you are not getting enough interrupts... have you tried to measure the interrupt rate in the two cases (current vs 11, no busy wait)?
Intel NICs have tunables to set interrupt coalescing, for both TX and RX. Maybe playing with those changes the game?
Comment 4 Sylvain Galliano 2020-08-14 19:51:25 UTC
You're right, interrupt rate limit pps on CURRENT + X520 NIC:

pkt-gen, no busy wait

11-stable:  27500 irq/s
CURRENT:    5500  irq/s

pkt-gen, with busy wait on CURRENT: +30000 irq/s

Regarding NIC irq tunable, the only one related to 'ix' looks good:

# sysctl hw.ix.max_interrupt_rate
hw.ix.max_interrupt_rate: 31250
Comment 5 Vincenzo Maffione freebsd_committer 2020-08-14 21:08:26 UTC
Ok, thanks for the feedback. That means that the issue is that iflib is not asking enough TX descriptor writebacks. Need for some investigation in the iflib txsync routine.
Comment 6 Kubilay Kocak freebsd_committer freebsd_triage 2020-08-15 05:00:57 UTC
(In reply to Vincenzo Maffione from comment #5)

Is this more specific/scoped to:

 - netmap & iflib, or 
 - ix/ixl and/or NIC driver & iflib, or
 - iflib framework (generally)
Comment 7 Vincenzo Maffione freebsd_committer 2020-08-18 08:41:41 UTC
(In reply to Kubilay Kocak from comment #6)
I would say
  ix/ixl and/or NIC driver & iflib
because it's not something related to the netmap module itself, and it is an optimization which derives from ix/ixl netmap support code, which now is included within iflib.
Comment 8 Sylvain Galliano 2020-08-28 19:58:15 UTC
After looking at iflib_netmap_timer_adjust() & iflib_netmap_txsync() in sys/net/iflib.c,
I made some tuning on kern.hz:

Still using X520 with 1 queue
ix0: PCI Express Bus: Speed 5.0GT/s Width x8
ix0: netmap queues/slots: TX 1/2048, RX 1/2048

*****************

/boot/loader.conf:
kern.hz=1000  (default)

pkt-gen result:
204.153802 main_thread [2639] 2.562 Mpps (2.567 Mpkts 1.230 Gbps in 1001994 usec) 465.32 avg_batch 99999 min_space
205.155321 main_thread [2639] 2.561 Mpps (2.565 Mpkts 1.229 Gbps in 1001519 usec) 465.45 avg_batch 99999 min_space

5500 irq/s:

*****************

/boot/loader.conf:
kern.hz=1999

pkt-gen result:
41.375049 main_thread [2639] 5.117 Mpps (5.222 Mpkts 2.456 Gbps in 1020510 usec) 465.45 avg_batch 99999 min_space
42.375546 main_thread [2639] 5.118 Mpps (5.121 Mpkts 2.457 Gbps in 1000497 usec) 465.42 avg_batch 99999 min_space

11000 irq/s

X2 performance & irq/s

*****************

/boot/loader.conf:
kern.hz=2000

pkt-gen result:
797.608080 main_thread [2639] 2.560 Mpps (2.563 Mpkts 1.229 Gbps in 1001001 usec) 465.50 avg_batch 99999 min_space
798.609079 main_thread [2639] 2.560 Mpps (2.563 Mpkts 1.229 Gbps in 1000999 usec) 465.41 avg_batch 99999 min_space

5500 irq/s

Same performance & irq/s as kern.hz=1000 (due to limit at 2000 in iflib_netmap_timer_adjust & iflib_netmap_txsync)

*****************

Last test, this one I forced 'ticks' parameter to '1' in callout_reset_on on iflib_netmap_timer_adjust & iflib_netmap_txsync
by increasing the 2000 limit to 20000 in both functions
and put an insame value for kern.hz

/boot/loader.conf:
kern.hz=10000

pkt-gen result:
345.415939 main_thread [2639] 14.880 Mpps (14.890 Mpkts 7.142 Gbps in 1000699 usec) 430.97 avg_batch 99999 min_space
346.429134 main_thread [2639] 14.880 Mpps (15.076 Mpkts 7.142 Gbps in 1013196 usec) 432.17 avg_batch 99999 min_space

29000 irq/s

Same performance as FreeBSD 11

Looks like callout_reset_on to iflib_timer have a look high delay.
Comment 9 Vincenzo Maffione freebsd_committer 2020-08-31 22:05:59 UTC
Thanks a lot for the tests.
I think the way netmap tx is handled right now needs improvement.

As far as I can tell, in your setup TX interrupts are simply not used (ix and ixl seem to use softirq for TX interrupt processing).
Your experiments with increasing kern.hz cause the interrupt rate of the OS timer to increase, and therefore causing the iflib_timer() routine to be called more often. Being called more often, the TX ring is cleaned up (TX credits update) more often and therefore the application can submit new TX packets more often, hence the improved pps.

However, clearly increasing kern.hz is not a viable approach.
I think we should try to use a separate timer for netmap TX credits update, using higher resolution (e.g. callout_reset_sbt_on()), and maybe try to dynamically adjust the timer period to become smaller when transmitting at high rate, and lower when transmitting ad low rate.
I'll try to come up with an experimental patch in the next days.
Comment 10 Eric Joyner freebsd_committer 2020-09-15 17:31:38 UTC
(In reply to Vincenzo Maffione from comment #9)

Any update? This might be something to get into 12.2 if we can.
Comment 11 Vincenzo Maffione freebsd_committer 2020-09-20 21:11:21 UTC
(In reply to Eric Joyner from comment #10)

Not yet. I've been AFK for a couple of weeks. I should be able to work on it this week.
Comment 12 vistalba 2020-10-11 11:47:53 UTC
Is there any progress about this issue?
Unfortunately I'm blocked on old opnsense & sensei version because with 20.7 the performance is really bad (<300Mbit/s). With 20.1 I can reach wirespeed (1GbE) without problems.

Let me know, if I/we can test something to help solve this.
Comment 13 Vincenzo Maffione freebsd_committer 2020-10-11 16:27:03 UTC
(In reply to vistalba from comment #12)
I started to work on it, however I've no suitable hardware to test.

This means that I will need to patch qemu to modify the emulation of an iflib-backed device with MSI-X interrupts (such as vmxnet3) in such a way that I can reproduce the problem (e.g. by making the transmission asynchronous w.r.t the register write that triggers it, like in real hardware).

I will for sure ask for tests on real hardware, but first I need to make some basic experiments on my own.
Comment 14 Vincenzo Maffione freebsd_committer 2020-10-13 21:14:32 UTC
Created attachment 218723 [details]
Draft patch to test the netmap tx timer

This is a draft patch that adds support for a per-tx-queue timer dedicated to netmap.
The timer interval is still not adaptive, but controlled by a per-interface sysctl, e.g.:

  sysctl dev.ix.0.iflib.nm_tx_tmr_us=500

It would be useful to test pkt-gen transmission on ixl/ix NICs, playing on the tunable to hopefully see the pps increase.
Values too large should cause the pps to drop. Values too short should cause the CPU utilization to go up (and possibly the pps to drop a little bit).

Can anyone test this?
Comment 15 Kubilay Kocak freebsd_committer freebsd_triage 2020-10-14 04:19:22 UTC
^Triage: Switch Version to earliest affected version/branch
Comment 16 Sylvain Galliano 2020-10-14 12:08:36 UTC
(In reply to Vincenzo Maffione from comment #14)

Here are the results:

X520 with 1 queue
ix0: PCI Express Bus: Speed 5.0GT/s Width x8
ix0: netmap queues/slots: TX 1/2048, RX 1/2048

*******************************************

sysctl dev.ix.0.iflib.nm_tx_tmr_us=0  (default value)

pkt-gen:
683.502433 main_thread [2639] 4.215 Mpps (4.227 Mpkts 2.023 Gbps in 1002819 usec) 465.43 avg_batch 99999 min_space

*******************************************

sysctl dev.ix.0.iflib.nm_tx_tmr_us=300

pkt-gen:
750.688608 main_thread [2639] 6.496 Mpps (6.646 Mpkts 3.118 Gbps in 1023000 usec) 465.45 avg_batch 99999 min_space

*******************************************

sysctl dev.ix.0.iflib.nm_tx_tmr_us=200

pkt-gen:
771.736855 main_thread [2639] 8.907 Mpps (9.112 Mpkts 4.275 Gbps in 1022999 usec) 465.45 avg_batch 99999 min_space

*******************************************

sysctl dev.ix.0.iflib.nm_tx_tmr_us=100

pkt-gen:
804.554603 main_thread [2639] 14.136 Mpps (14.147 Mpkts 6.785 Gbps in 1000748 usec) 465.45 avg_batch 99999 min_space
-> close to 10G line rate

*******************************************

sysctl dev.ix.0.iflib.nm_tx_tmr_us=90

pkt-gen:
872.156329 main_thread [2639] 14.880 Mpps (15.054 Mpkts 7.142 Gbps in 1011721 usec) 466.96 avg_batch 99999 min_space



Now using same X520 NIC using 4 queues.

ix1: PCI Express Bus: Speed 5.0GT/s Width x8
ix1: netmap queues/slots: TX 4/2048, RX 4/2048

*******************************************

sysctl dev.ix.1.iflib.nm_tx_tmr_us=0 (default)

pkt-gen:
047.988586 main_thread [2639] 13.596 Mpps (13.623 Mpkts 6.526 Gbps in 1002002 usec) 443.03 avg_batch 99999 min_space
-> close to max speed (thanks to 4 queue)

*******************************************

sysctl dev.ix.1.iflib.nm_tx_tmr_us=400

pkt-gen:
094.224581 main_thread [2639] 14.887 Mpps (14.904 Mpkts 7.146 Gbps in 1001173 usec) 440.75 avg_batch 99999 min_space


Looks really good for X520 NIC whatever the number of queue I used.



Now same tests using XL710 NIC (40G) using 1 queue:

ixl1: PCI Express Bus: Speed 8.0GT/s Width x8
ixl1: netmap queues/slots: TX 1/1024, RX 1/1024

*******************************************

sysctl dev.ixl.1.iflib.nm_tx_tmr_us=0 (default)

pkt-gen:
324.883066 main_thread [2639] 12.270 Mpps (13.044 Mpkts 5.890 Gbps in 1063000 usec) 16.53 avg_batch 99999 min_space

*******************************************

sysctl dev.ixl.1.iflib.nm_tx_tmr_us=100

pkt-gen:
350.497566 main_thread [2639] 12.246 Mpps (12.258 Mpkts 5.878 Gbps in 1001003 usec) 16.48 avg_batch 99999 min_space

no changes.


Now testing XL710 with 4 queues:

ixl0: PCI Express Bus: Speed 8.0GT/s Width x8
ixl0: netmap queues/slots: TX 4/1024, RX 4/1024

*******************************************

sysctl dev.ixl.0.iflib.nm_tx_tmr_us=0 (default)

pkt-gen:
614.766048 main_thread [2639] 13.671 Mpps (14.539 Mpkts 6.562 Gbps in 1063494 usec) 15.75 avg_batch 99999 min_space

*******************************************

sysctl dev.ixl.0.iflib.nm_tx_tmr_us=100

pkt-gen:
640.652549 main_thread [2639] 13.672 Mpps (13.795 Mpkts 6.562 Gbps in 1009001 usec) 15.79 avg_batch 99999 min_space


No changes using XL710 NIC (as a reminder, using FreeBSD 11 without iflib, I can reach +40Mpps on XL710 using pkt-gen)
Comment 17 vistalba 2020-10-14 14:18:20 UTC
(In reply to Vincenzo Maffione from comment #14)

Is there a easy way to test this on my opnsense vm with vmx interfaces. As far as I know my netmap issue on vmx is related to this timer issue as well.
I'm not so familiar with freebsd.
Comment 18 Michael Muenz 2020-10-14 15:27:55 UTC
(In reply to vistalba from comment #17)

- Install Vanilla FreeBSD12
- pkg install git
- cd /usr && git clone https://github.com/opnsense/tools
- cd tools && make update
- make kernel

You can also just create an image, follow the guides on https://github.com/opnsense/tools this might be easier
Comment 19 Vincenzo Maffione freebsd_committer 2020-10-18 15:50:57 UTC
(In reply to Sylvain Galliano from comment #16)

Thanks a lot.
In the XL710 case, have you tried with lower values of the timer, such as 50us down to 5 us?
Is there any visible change?

Also, have you looked at pkt-gen CPU utilization? That's something that tells you if you are CPU limited (unlikely) or rather still limited by the "pseudo-interrupt rate" being too low.
For, instance, how does pkt-gen CPU utilization look in the case of XL710 and 1 queue (for simplicity, so that you have just a single thread)?
Comment 20 Vincenzo Maffione freebsd_committer 2020-10-18 15:53:18 UTC
(In reply to vistalba from comment #17)
Of course you will need to apply the attached patch before "make kernel", e.g.
  $ cd /path/to/freebsd/kernel/sources
  $ patch -p1 < /path/to/attached/patch.diff
Comment 21 Sylvain Galliano 2020-10-18 17:10:35 UTC
(In reply to Vincenzo Maffione from comment #19)

result using 1 queue:

ixl1: PCI Express Bus: Speed 8.0GT/s Width x8
ixl1: netmap queues/slots: TX 1/1024, RX 1/1024

sysctl dev.ixl.0.iflib.nm_tx_tmr_us=50

194.333930 main_thread [2638] 11.896 Mpps (11.902 Mpkts 5.710 Gbps in 1000497 usec) 341.33 avg_batch 99999 min_space

pkt-gen cpu usage: 35%


sysctl dev.ixl.0.iflib.nm_tx_tmr_us=5

235.070929 main_thread [2638] 14.754 Mpps (15.543 Mpkts 7.082 Gbps in 1053521 usec) 390.14 avg_batch 99999 min_space

pkt-gen cpu usage: 56%



sysctl dev.ixl.0.iflib.nm_tx_tmr_us=1

266.392925 main_thread [2638] 14.748 Mpps (14.762 Mpkts 7.079 Gbps in 1000998 usec) 407.41 avg_batch 99999 min_space

pkt-gen cpu usage: 66%



Using 6 queues configured, max pps is 17Mpps, even when using low nm_tx_tmr_us value (1 us)
pkt-gen cpu usage: 82%
Comment 22 Vincenzo Maffione freebsd_committer 2020-10-18 20:36:35 UTC
Created attachment 218866 [details]
Netmap tx timer + timely credits update

A small extension of the previous patch, which adds up the timely update of tx credits and timer start.
Comment 23 Vincenzo Maffione freebsd_committer 2020-10-18 20:41:15 UTC
(In reply to Sylvain Galliano from comment #21)
Thanks.
The CPU utilization at least tells us that we are not CPU bound.
Could you please perform some tests with the second patch?
It's basically the same as the first one, with a couple of changes that should ensure a more timely recovery of consumed TX slots (and a more timely timer firing).
Comment 24 Sylvain Galliano 2020-10-19 08:57:42 UTC
(In reply to Vincenzo Maffione from comment #23)

After using 'Netmap tx timer + timely credits update' patch, I didn't notice any difference on results.
Should I make some specific tests to confirm changes between 2 patches ?
Comment 25 Vincenzo Maffione freebsd_committer 2020-10-19 21:05:18 UTC
Sorry, my bad.
I read the code the wrong way, so the second patch is indeed useless. Please forget about that. The patch is not ensuring timely TX slots recovery (as pointed out in comment #23).

So it seems that the situation where we are losing against 11-stable is ixl with 6 queues (or more in general, with more than 1 queue). The other combinations (ix, or ixl/1q are on par). Is this correct?

Now, focusing on the ixl/6q case, and using the first patch I provided, do you see a significant difference in average batch (as reported by pkt-gen) and pkt-gen CPU utilization?
The avg_batch metric tells us how many packets we were able to send for each txsync syscall. So the higher the better (at least up to 100/200).
Comment 26 Sylvain Galliano 2020-10-20 10:26:57 UTC
(In reply to Vincenzo Maffione from comment #25)

you are correct, using 1 queue on both ix/ixl NIC + tuning new sysctl, we have same result as FreeBSD 11

Regarding avg_batch values on ixl, they are very low, not in the range you expected:

6 queues, iflib.nm_tx_tmr_us=5
070.664142 main_thread [2639] 13.588 Mpps (13.602 Mpkts 6.522 Gbps in 1001000 usec) 20.87 avg_batch 99999 min_space
cpu usage: 100%

even with 1 queue I got a low value for avg_batch:
283.855379 main_thread [2639] 12.147 Mpps (12.757 Mpkts 5.831 Gbps in 1050245 usec) 13.97 avg_batch 99999 min_space
cpu usage: 100%

ix (X520) NIC have a good avg_batch (whatever the number of queue):
404.130120 main_thread [2639] 14.880 Mpps (14.895 Mpkts 7.143 Gbps in 1000999 usec) 436.04 avg_batch 99999 min_space

I don't known if this can help you but I did another test specific to ixl NIC:
setting hw.ixl.enable_head_writeback=0

avg_batch is higher, pps a little better and we are not CPU bound this time:

6 queues:
603.651106 main_thread [2639] 17.384 Mpps (17.402 Mpkts 8.345 Gbps in 1001003 usec) 308.10 avg_batch 99999 min_space
cpu usage: 71%

1 queue:
730.590104 main_thread [2639] 15.084 Mpps (15.416 Mpkts 7.241 Gbps in 1022004 usec) 442.64 avg_batch 99999 min_space
cpu usage: 57%


Same test but using more threads on pkt-gen (-p 6), 6 queues:
995.887010 main_thread [2639] 17.327 Mpps (17.339 Mpkts 8.317 Gbps in 1000693 usec) 286.35 avg_batch 599994 min_space
cpu usage: 197%

top -H:
  PID USERNAME    PRI NICE   SIZE    RES STATE    C   TIME    WCPU COMMAND
70876 root        -92    0   348M    16M select   3   0:23  35.46% pkt-gen{pkt-gen}
70876 root        -92    0   348M    16M CPU4     4   0:23  33.58% pkt-gen{pkt-gen}
70876 root        -92    0   348M    16M RUN      6   0:23  33.44% pkt-gen{pkt-gen}
70876 root        -92    0   348M    16M select   2   0:23  31.69% pkt-gen{pkt-gen}
70876 root         45    0   348M    16M CPU9     9   0:23  31.64% pkt-gen{pkt-gen}
70876 root        -92    0   348M    16M select   6   0:23  31.24% pkt-gen{pkt-gen}
Comment 27 Vincenzo Maffione freebsd_committer 2020-10-20 21:14:04 UTC
Created attachment 218932 [details]
netmap tx timer + honor IPI_TX_INTR in ixl txd_encap

Adds an unrelated fix on top of the first patch (218723).
The new fix should remove a major regression introduced by the ixl porting to iflib.
Comment 28 Vincenzo Maffione freebsd_committer 2020-10-20 21:17:19 UTC
(In reply to Sylvain Galliano from comment #26)
Thanks.

I just figured out that there may be a major flaw introduced by the porting of ixl to iflib. This flaw should couse too many writebacks from the NIC to report completed transmissions, even if iflib asks for a writeback once in a while.

Can you please run again your tests on ixl with the latest patch?
Comment 29 Sylvain Galliano 2020-10-20 21:49:00 UTC
(In reply to Vincenzo Maffione from comment #28)

Yes, this is much better:

6 queues, nm_tx_tmr_us=5:
983.492185 main_thread [2639] 37.907 Mpps (37.945 Mpkts 18.196 Gbps in 1001000 usec) 512.00 avg_batch 99999 min_space
cpu usage: 100%

but with this patch, something is wrong when using 1 queue:
110.079117 main_thread [2639] 0.000 pps (0.000 pkts 0.000 bps in 1003920 usec) 0.00 avg_batch 99999 min_space
111.080184 main_thread [2639] 0.000 pps (0.000 pkts 0.000 bps in 1001066 usec) 0.00 avg_batch 99999 min_space
111.714181 sender_body [1663] poll error on queue 0: timeout
112.089179 main_thread [2639] 0.000 pps (0.000 pkts 0.000 bps in 1008996 usec) 0.00 avg_batch 99999 min_space
113.116178 main_thread [2639] 0.000 pps (0.000 pkts 0.000 bps in 1026999 usec) 0.00 avg_batch 99999 min_space

(I've double checked by reverting with previous patch: no issue)

same errors when using 6 queues and pkt-gen with 6 threads (-p 6)
Comment 30 Krzysztof Galazka 2020-10-21 07:48:36 UTC
(In reply to Vincenzo Maffione from comment #28)

Hi Vincenzo,

Good catch! Thanks a lot! The non-iflib version of ixl also sets a request status flag on all 'End of packet' descriptors. I'm guessing that the difference in performance is related to a dynamic interrupt moderation, which is disabled by default in iflib version of the driver. I'm a bit concerned though about the 1 queue case. It think it would be good to put this fix in a separate review and let our validation team run some tests. Would you like me to do it or do you prefer to do it yourself?
Comment 31 Vincenzo Maffione freebsd_committer 2020-10-21 20:53:53 UTC
(In reply to Krzysztof Galazka from comment #30)
Hi Krzysztof,
  I agree, and created a separate review for this possible change:
https://reviews.freebsd.org/D26896
It would be nice if you guys could run some tests to validate the change against normal TCP/IP stack usage (e.g. non-netmap).

Speaking about the non-iflib driver, I guess it is acceptable for the ixl_xmit routine to always set the report flag on the EOP packet.
However, this is not acceptable for netmap, and indeed the non-iflib netmap ixl driver is only setting it twice per ring (see https://github.com/freebsd/freebsd/blob/stable/11/sys/dev/netmap/if_ixl_netmap.h#L221-L223).
This is, in my opinion, what explains the huge difference in performance between non-iflib and iflib for ixl.

If my understanding is correct, and according to our past experience with netmap, the report flag will cause the NIC to initiate a DMA transaction to either set the DD bit in the descriptor, or perform a memory write to update the shadow TDH.
This is particularly expensive in netmap when done for each descriptor, specially because netmap uses single-descriptor packets.
Interrupt moderation can also help a lot to mitigate the CPU overhead, but as far as I see it does not limit the writeback DMA transactions, and therefore it does not help in the netmap use-case. Moreover, my understanding is that iflib is not using hardware interrupts on ixl (nor ix), but rather is using "softirqs", so I guess that interrupt moderation does not play a role here. I may be wrong on this last point.

I suspect the 1-queue pkt-gen hang problem may be due to something else, rather than the report flag change. As a matter of facts, the same logic was working fine on the non-iflib driver.
Comment 32 Vincenzo Maffione freebsd_committer 2020-10-21 21:18:49 UTC
(In reply to Sylvain Galliano from comment #29)
Thanks again for your tests.

I'm inclined to think that the pkt-gen hang issue that you see is not directly caused by the ixl patch.
Would you please try to test what happens if you only apply the ixl patch (discarding all the changes related to the netmap timer)?
In the very end the change to ixl is orthogonal to the netmap timer issue.

Also, it would be useful to understand whether the hang problem comes from some sort of race condition or not. For this purpose, you may try to use the -R argument of pkt-gen (this time with the timer+ixl patch) to specify a maximum rate in packets per second (pps). E.g. you could start from 1000 pps, check that it does not hang, double the rate and repeat the process until you find a critical rate that causes the hang. Unless this is not a race condition and the hang happens at any rate.
When the hang happens, it may help to see the ring state, e.g. with the following patch to pkt-gen. I expect to see head, cur and tail having the same value. 

diff --git a/apps/pkt-gen/pkt-gen.c b/apps/pkt-gen/pkt-gen.c
index ef876f4f..19497fe9 100644
--- a/apps/pkt-gen/pkt-gen.c
+++ b/apps/pkt-gen/pkt-gen.c
@@ -1675,6 +1675,10 @@ sender_body(void *data)
                                break;
                        D("poll error on queue %d: %s", targ->me,
                                rv ? strerror(errno) : "timeout");
+                       for (i = targ->nmd->first_tx_ring; i <= targ->nmd->last_tx_ring; i++) {
+                               txring = NETMAP_TXRING(nifp, i);
+                               D("txring %u %u %u", txring->head, txring->cur, txring->tail);
+                       }
                        // goto quit;
                }
                if (pfd.revents & POLLERR) {
Comment 33 Sylvain Galliano 2020-10-22 09:56:29 UTC
(In reply to Vincenzo Maffione from comment #32)

Here are the results:

ixl only patch, 6 queues, pkt-get WITHOUT -R:
710.623764 main_thread [2642] 38.621 Mpps (38.698 Mpkts 18.538 Gbps in 1002000 usec) 512.00 avg_batch 99999 min_space

ixl only patch, 1 queue, pkt-get WITHOUT -R:

670.168017 main_thread [2642] 0.000 pps (0.000 pkts 0.000 bps in 1009185 usec) 0.00 avg_batch 99999 min_space
671.181833 main_thread [2642] 0.000 pps (0.000 pkts 0.000 bps in 1013816 usec) 0.00 avg_batch 99999 min_space
672.171832 sender_body [1662] poll error on queue 0: timeout
672.171838 sender_body [1665] txring 513 513 513
672.191833 main_thread [2642] 0.000 pps (0.000 pkts 0.000 bps in 1010000 usec) 0.00 avg_batch 99999 min_space

ixl only patch, 1 queue, pkt-get WITH -R:

-R 1000:
813.372070 main_thread [2642] 1.001 Kpps (1.002 Kpkts 480.718 Kbps in 1000503 usec) 3.00 avg_batch 99999 min_space
-R 2000:
860.807010 main_thread [2642] 2.006 Kpps (2.010 Kpkts 962.692 Kbps in 1002190 usec) 6.00 avg_batch 99999 min_space
...
(all intermediate -R value worked)
...
-R 17000000:
057.160242 main_thread [2642] 17.001 Mpps (18.072 Mpkts 8.160 Gbps in 1063000 usec) 512.00 avg_batch 99999 min_space
-R 18000000:
030.167994 sender_body [1662] poll error on queue 0: timeout
030.168001 sender_body [1665] txring 513 513 513



ixl + timer patches, 1 queue, pkt-get WITH -R:
sysctl nm_tx_tmr_us=5

-R 1000:
261.886507 main_thread [2642] 1.001 Kpps (1.065 Kpkts 480.679 Kbps in 1063496 usec) 3.00 avg_batch 99999 min_space
-R 2000:
279.365024 main_thread [2642] 2.000 Kpps (2.034 Kpkts 960.219 Kbps in 1016768 usec) 6.00 avg_batch 99999 min_space
...
(all intermediate -R value worked)
...
-R 17000000
388.372451 main_thread [2642] 17.000 Mpps (18.079 Mpkts 8.160 Gbps in 1063431 usec) 512.00 avg_batch 99999 min_space
-R 18000000
894.421917 main_thread [2642] 18.000 Mpps (18.036 Mpkts 8.640 Gbps in 1002001 usec) 512.00 avg_batch 99999 min_space
and sometime an error
-R 19000000
991.012912 sender_body [1662] poll error on queue 0: timeout
991.012920 sender_body [1665] txring 513 513 513

another run:
968.011919 sender_body [1662] poll error on queue 0: timeout
968.011926 sender_body [1665] txring 235 235 235

and another one:
112.008840 sender_body [1662] poll error on queue 0: timeout
112.008848 sender_body [1665] txring 95 95 95
Comment 34 Vincenzo Maffione freebsd_committer 2020-10-23 20:57:30 UTC
(In reply to Sylvain Galliano from comment #33)
Ok, thanks. At this point it's clear that there are two indipendent issues that slow down netmap-iflib on ix/ixl. The first is the lack of a per-tx-queue netmap timer (or taskqueue). The second is the lack of descriptor writeback moderation in ixl.
We can start by merging the timer patch, and then work on the separate ixl issue.
Comment 35 Vincenzo Maffione freebsd_committer 2020-10-25 15:30:30 UTC
Created attachment 219062 [details]
netmap tx timer w/queue intr enable + honor IPCP_TX_INTR in ixl_txd_encap

Extension of the last one (218932), to also call the IFDI tx queue interrupt enable, similarly to what the iflib_timer() code already does.
Comment 36 Vincenzo Maffione freebsd_committer 2020-10-25 15:38:52 UTC
(In reply to Vincenzo Maffione from comment #35)
I would ask for advice from the Intel guys here...
I'm trying to compare stable/11 vs current, regarding how TX interrupts are handled. It looks like in stable/11 MSI-x handlers are shared for the TX and RX queue, while in current TX interrupts are not used.
Also, in stable/11 the interrupt handler seems to do a disable_queue and then enable_queue, while on current I only see the enable_queue step (IFDI_TX_QUEUE_INTR_ENABLE).

Therefore, in the last patch I also add the enable_queue step in the netmap timer routine. It may be worth giving a try to see if this fixes the ixl issue.
Comment 37 Sylvain Galliano 2020-10-26 08:29:35 UTC
(In reply to Vincenzo Maffione from comment #36)

I have tested last patch (netmap tx timer w/queue intr enable + honor IPCP_TX_INTR in ixl_txd_encap): same results
Comment 38 Vincenzo Maffione freebsd_committer 2020-10-26 21:39:17 UTC
Created attachment 219121 [details]
Cleaned up netmap tx timer patch (no sysctl)
Comment 39 Vincenzo Maffione freebsd_committer 2020-10-26 21:47:20 UTC
(In reply to Sylvain Galliano from comment #37)
Ok thanks. It was worth a try. I guess we'll need some help from Intel here.

In the meanwhile, I would like to commit the netmap tx timer change only.
I attached a cleaned up patch, with an hardcoded value for the netmap timer.
I would avoid to add a new sysctl for something that may be changed again soon.

In any case, the patch is meant to improve a lot the current situation for both ix and ixl.
Could you please run your tests again on ix and ixl to check that you get numbers that are consistent with the ones you reported in comment n. 16?
Comment 40 Sylvain Galliano 2020-10-27 17:17:34 UTC
(In reply to Vincenzo Maffione from comment #39)

results are all good for ix (X520) NIC (+14M pps, same as FreeBSD 11)

No changes in ixl (same results as comment #16)
Comment 41 Vincenzo Maffione freebsd_committer 2020-10-27 21:21:36 UTC
(In reply to Sylvain Galliano from comment #40)
Thank you for confirming.
In the meanwhile I'll commit this change.

Maybe we should open a separate issue for the ixl regression? Now we know that it is caused by the RS flag being set an all the TX descriptors.
Comment 42 commit-hook freebsd_committer 2020-10-27 21:54:08 UTC
A commit references this bug:

Author: vmaffione
Date: Tue Oct 27 21:53:33 UTC 2020
New revision: 367093
URL: https://svnweb.freebsd.org/changeset/base/367093

Log:
  iflib: add per-tx-queue netmap timer

  The way netmap TX is handled in iflib when TX interrupts are not
  used (IFC_NETMAP_TX_IRQ not set) has some issues:
    - The netmap_tx_irq() function gets called by iflib_timer(), which
      gets scheduled with tick granularity (hz). This is not frequent
      enough for 10Gbps NICs and beyond (e.g., ixgbe or ixl). The end
      result is that the transmitting netmap application is not woken
      up fast enough to saturate the link with small packets.
    - The iflib_timer() functions also calls isc_txd_credits_update()
      to ask for more TX completion updates. However, this violates
      the netmap requirement that only txsync can access the TX queue
      for datapath operations. Only netmap_tx_irq() may be called out
      of the txsync context.

  This change introduces per-tx-queue netmap timers, using microsecond
  granularity to ensure that netmap_tx_irq() can be called often enough
  to allow for maximum packet rate. The timer routine simply calls
  netmap_tx_irq() to wake up the netmap application. The latter will
  wake up and call txsync to collect TX completion updates.

  This change brings back line rate speed with small packets for ixgbe.
  For the time being, timer expiration is hardcoded to 90 microseconds,
  in order to avoid introducing a new sysctl.
  We may eventually implement an adaptive expiration period or use another
  deferred work mechanism in place of timers.

  Also, fix the timers usage to make sure that each queue is serviced
  by a different CPU.

  PR:	248652
  Reported by:	sg@efficientip.com
  MFC after:	2 weeks

Changes:
  head/sys/net/iflib.c
Comment 43 Sylvain Galliano 2020-10-28 17:02:09 UTC
I made same tests on vmware + vmxnet NIC + latest patch and I got a panic:

spin lock 0xfffff80003079cc0 (turnstile lock) held by 0xfffffe0009607e00 (tid 100006) too long
panic: spin lock held too long
cpuid = 1
time = 1603884508
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe0008480680
vpanic() at vpanic+0x182/frame 0xfffffe00084806d0
panic() at panic+0x43/frame 0xfffffe0008480730
_mtx_lock_indefinite_check() at _mtx_lock_indefinite_check+0x64/frame 0xfffffe0008480740
_mtx_lock_spin_cookie() at _mtx_lock_spin_cookie+0xd5/frame 0xfffffe00084807b0
turnstile_trywait() at turnstile_trywait+0xe3/frame 0xfffffe00084807e0
__mtx_lock_sleep() at __mtx_lock_sleep+0x119/frame 0xfffffe0008480870
doselwakeup() at doselwakeup+0x179/frame 0xfffffe00084808c0
nm_os_selwakeup() at nm_os_selwakeup+0x13/frame 0xfffffe00084808e0
netmap_notify() at netmap_notify+0x3d/frame 0xfffffe0008480900
softclock_call_cc() at softclock_call_cc+0x13d/frame 0xfffffe00084809a0
callout_process() at callout_process+0x1c0/frame 0xfffffe0008480a10
handleevents() at handleevents+0x188/frame 0xfffffe0008480a50
timercb() at timercb+0x24e/frame 0xfffffe0008480aa0
lapic_handle_timer() at lapic_handle_timer+0x9b/frame 0xfffffe0008480ad0
Xtimerint() at Xtimerint+0xb1/frame 0xfffffe0008480ad0
--- interrupt, rip = 0xffffffff80f5bd46, rsp = 0xfffffe0008480ba0, rbp = 0xfffffe0008480ba0 ---
acpi_cpu_c1() at acpi_cpu_c1+0x6/frame 0xfffffe0008480ba0
acpi_cpu_idle() at acpi_cpu_idle+0x2eb/frame 0xfffffe0008480bf0
cpu_idle_acpi() at cpu_idle_acpi+0x3e/frame 0xfffffe0008480c10
cpu_idle() at cpu_idle+0x9f/frame 0xfffffe0008480c30
sched_idletd() at sched_idletd+0x2e4/frame 0xfffffe0008480cf0
fork_exit() at fork_exit+0x7e/frame 0xfffffe0008480d30
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe0008480d30
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---

I used the first patch you sent (Draft patch to test the netmap tx timer), no issue this time.

the only major difference I can see between 2 patches (except sysctl) is:
+               txq->ift_timer.c_cpu = cpu;
and
+               txq->ift_netmap_timer.c_cpu = cpu;
Comment 44 Vincenzo Maffione freebsd_committer 2020-10-28 20:50:39 UTC
Created attachment 219179 [details]
Cleaned up netmap tx timer (bugfixes)
Comment 45 Vincenzo Maffione freebsd_committer 2020-10-28 20:55:25 UTC
(In reply to Sylvain Galliano from comment #43)
Ugh.
Thanks for reporting.

I indeed introduced a subtle typo bug, using callout_reset_sbt() rather than callout_reset_sbt_on() (as intended). Therefore I was passing the "cpu" value to the "flags" argument, resulting in a disaster. In your test this probably triggered the C_DIRECT_EXEC flag of callout(9), so that the timer was being executed in hardware interrupt context.

I uploaded the patch that is now consistent with the src tree (that I'm going to fix right away).
Comment 46 commit-hook freebsd_committer 2020-10-28 21:06:36 UTC
A commit references this bug:

Author: vmaffione
Date: Wed Oct 28 21:06:18 UTC 2020
New revision: 367117
URL: https://svnweb.freebsd.org/changeset/base/367117

Log:
  iflib: fix typo bug introduced by r367093

  Code was supposed to call callout_reset_sbt_on() rather than
  callout_reset_sbt(). This resulted into passing a "cpu" value
  to a "flag" argument. A recipe for subtle errors.

  PR:	248652
  Reported by:	sg@efficientip.com
  MFC with: r367093

Changes:
  head/sys/net/iflib.c
Comment 47 commit-hook freebsd_committer 2020-11-11 21:27:28 UTC
A commit references this bug:

Author: vmaffione
Date: Wed Nov 11 21:27:17 UTC 2020
New revision: 367599
URL: https://svnweb.freebsd.org/changeset/base/367599

Log:
  MFC r367093, r367117

  iflib: add per-tx-queue netmap timer

  The way netmap TX is handled in iflib when TX interrupts are not
  used (IFC_NETMAP_TX_IRQ not set) has some issues:
    - The netmap_tx_irq() function gets called by iflib_timer(), which
      gets scheduled with tick granularity (hz). This is not frequent
      enough for 10Gbps NICs and beyond (e.g., ixgbe or ixl). The end
      result is that the transmitting netmap application is not woken
      up fast enough to saturate the link with small packets.
    - The iflib_timer() functions also calls isc_txd_credits_update()
      to ask for more TX completion updates. However, this violates
      the netmap requirement that only txsync can access the TX queue
      for datapath operations. Only netmap_tx_irq() may be called out
      of the txsync context.

  This change introduces per-tx-queue netmap timers, using microsecond
  granularity to ensure that netmap_tx_irq() can be called often enough
  to allow for maximum packet rate. The timer routine simply calls
  netmap_tx_irq() to wake up the netmap application. The latter will
  wake up and call txsync to collect TX completion updates.

  This change brings back line rate speed with small packets for ixgbe.
  For the time being, timer expiration is hardcoded to 90 microseconds,
  in order to avoid introducing a new sysctl.
  We may eventually implement an adaptive expiration period or use another
  deferred work mechanism in place of timers.

  Also, fix the timers usage to make sure that each queue is serviced
  by a different CPU.

  PR:     248652
  Reported by:    sg@efficientip.com

Changes:
_U  stable/12/
  stable/12/sys/net/iflib.c