Bug 203856 - [igb] PPPoE RX traffic is limitied to one queue
Summary: [igb] PPPoE RX traffic is limitied to one queue
Status: Closed Works As Intended
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 10.1-STABLE
Hardware: Any Any
: --- Affects Some People
Assignee: freebsd-net mailing list
URL:
Keywords: IntelNetworking
Depends on:
Blocks:
 
Reported: 2015-10-18 16:21 UTC by laiclair
Modified: 2018-07-27 18:02 UTC (History)
11 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description laiclair 2015-10-18 16:21:02 UTC
On PPPoE interface packets are only received on one NIC driver queue (queue0).
This is hurting system performance.

Similar problem is explained on the freebsd wiki.
See "Traffic flow" chapter on page https://wiki.freebsd.org/NetworkPerformanceTuning

Can the proposed patch been tested/integrated ?

Thank you.
Comment 1 anoteros 2017-03-01 21:47:30 UTC
This is still a problem and the link to the patch has gone dead.
Is this still in the works?
Comment 2 anoteros 2017-03-01 21:48:05 UTC
(In reply to anoteros from comment #1)
Here is a link to the same issue in the pfSense bug tracker which has more information. https://redmine.pfsense.org/issues/4821
Comment 3 Vladimir 2017-03-17 04:06:45 UTC
Actually it affects all peoples who have PPPoE or any other non-ip traffic connections that can not be handle with one CPU core and igb Intel card, ex 1Gbit PPPoE connections all over the world. This issue is specific to igb driver and not to em driver. There is some tests results posted on pfSense forum, same hardware with both cards. 
https://forum.pfsense.org/index.php?topic=107610.0
See last post.
Are there any plans to fix it?
Comment 4 Gimli 2018-03-16 18:09:00 UTC
Now that Spectre and Metldown patches are coming out and CPUs are losing considerable performance this becomes even more critical to fix. For example it's impossible to run a gigabit PPPoE connection on pfSense now because of this FreeBSD bug.
Comment 5 ricsip 2018-07-16 12:40:56 UTC
Is there any chance this defect will ever be fixed? As many Gigabit ISPs are gaining traction worldwide, FTTH is increasing exponentially continuously, and many providers work through PPPoE (unfortunately), that will be a huge issue (if its not already)!

The home-built SBC routers like PCEngines APU1/2/3 series are completely useless and simply waste of money if running Pfsense/OPNsense instead of any Linux-based distro. Unfortunately, you cannot expect resolution from the pfSense/OPNsense guys, if the issue is the combination of underlying Freebsd OS + Intel NIC driver. 

The Quad-core 1Ghz AMD CPU (or similar in such boards) becomes obviously a bottleneck if only 1 core is forced to process the entire 1Gbit traffic, while the other 3 cores are sitting idle 90+ % most of the time. Properly sharing the load among the other 3 cores it would perform perfectly fine, so no real need to buy a server-grade hardware for a firewall/router-only box that eats 20 times more electricity, and produces infinitely more noise (SBC has no FAN).

Or if you see that this issue will not be solved in the foreseeable future (like say until 13.0), at least make this limitation visible in various places in your manual (ex. list it as a defect in the "igb" driver manual, list it as a defect in the MPD5 manual if using PPPoE, make note of this in the performance/routing section of the Handbook). Pfsense/OPNsense should act similarly, but that subject to their devs.

Thanks for your support.
Comment 6 Jan Bramkamp 2018-07-16 12:51:58 UTC
Do you expect a serious reply?
Comment 7 ricsip 2018-07-16 12:58:19 UTC
(In reply to Jan Bramkamp from comment #6)

I am not familiar with how things work in this forum, but based on your comment I assume the fix is unlikely.

Just wondering: is it not affecting enough users (no critical mass), or the issue is not considered to be serious enough?
Comment 8 Mark Linimon freebsd_committer freebsd_triage 2018-07-16 13:18:26 UTC
Attempt to assign to a more useful mailing list.
Comment 9 Jan Bramkamp 2018-07-16 13:49:51 UTC
(In reply to ricsip from comment #7)

The problem exists and can be fixed, but would require non-trivial changes. I suspect that the problem isn't specific to the Intel NIC driver and that you'll encounter the smae behaviour on an APU1 board.

I asked if you expect a serious answer because your comment reads like you are owed a quick fix to your problem by the world.
Comment 10 ricsip 2018-07-16 14:17:36 UTC
(In reply to Jan Bramkamp from comment #9)

Your reply greatly appreciated. My intention was to see if a stone is dropped into the lake, it may create some waves and progress may happen. 

TBH I was unaware of this issue until I faced it personally (before going into prod., testbed didnt prove the expected results, and the interwebs wasnt clear about this issue), and google-ing revealed many similar cases. So I had to dig deeper into whats the current state with low-powered SOC-based firewalls running on FreeBSD. Then came the famous FreeBSD forwarding performance guide (https://bsdrp.net/documentation/technical_docs/performance) where I found disturbingly strange results about APU2 board, and date of that page suggested this is a long known issue.


Tricky situation, as there are multiple parties directly or indirectly involved:

a) the SOC vendor PCENGINES, who sell (quote) "PC Engines apu boards are small form factor system boards optimized for wireless routing and network security applications" --> one would assume its optimized for routing/firewalling. Warning sign for the future: if you dont see any clear, credible benchmark numbers on the vendors site!

b) the software vendor: pfSense / OPNSENSE, who deliver an appliance-grade software, that supports the above SOC board --> if any bugs during use, that is not clearly a HW defect (eg. overheat or faulty CPU/RAM chip), one would assume they can support you to fix it

Then come the various "backstage" players. You dont deal with them directly, but your problem may land on their territory at the end of the day. 

c) FreeBSD, which delivers the underlying OS. --> if the bug is not directly pfSense/OPNSENSE-level, you'd have to turn to the FreeBSD community hoping the issue will be fixed

d) the HW vendor AMD: --> if there is some performance issue, maybe its an AMD issue --> turn to AMD for any fix

e) the HW vendor Intel: --> if the issue relates to the NIC itself, its an Intel issue -->turn to Intel for any fix

f) whoever else I may have forgotten
Comment 11 Eugene Grosbein freebsd_committer 2018-07-16 14:32:01 UTC
There seems to be common mis-understanging in how hardware receive queues work in igb(4) chipsets.

First, one should read Intel's datasheet on the NIC. For example of 82576-based NIC this is https://www.intel.com/content/dam/www/public/us/en/documents/datasheets/82576eb-gigabit-ethernet-controller-datasheet.pdf

The section 7.1.1.7 of the datasheet states that NIC "supports a single hash function, as defined by Microsoft RRS". Reading on, one learns this means that only frames containing IPv4 or IPv6 packets are hashed using their IP addresses as hash function arguments and, optionally, TCP port numbers.

This means that all incoming PPPoE ethernet frames are NOT hashed by such NIC in hardware, as any other frames carrying no plain IPv4 nor IPv6 packets. This is the reason why all incoming PPPoE ethernet frames get to the same (zero) queue. 

The igb(4) driver has nothing to do with this problem, and mentioned "patch" cannot solve the problem too. However, there are other ways.

Most performant way for production use is usage of several igb NICs combined with lagg(4) logical channel connected to managed switch that is configured to distribute traffic flows between ports of the logical channel based on source MAC address of a frame. This is useful for mass-servicing of clients when one has multiple PPPoE clients generating flows of PPPoE frames each using distinct MAC address. This way is not really useful for PPPoE client receiving all frames from single PPPoE server.

There is another way. By default, FreeBSD kernel performs all processing of received PPPoE frame within driver interrupt context: decapsulation, optional decompression/decryption, network address translation, routing lookups, packet filtering and so on. This can result in overloaded single CPU core in default configuration when sysctl net.isr.dispatch=direct. Since FreeBSD 8 we have netisr(8) network dispatch service allowing any NIC driver just enqueue received ethernet frame and cease its following processing freeing this CPU core. Other kernel threads using other CPU cores will then dequeue received frames to complete decapsilation etc. loading all CPU cores evenly.

So, one just should make sure it has "net.isr.maxthreads" and "net.isr.numthreads" greater than 1 and switch net.isr.dispatch to "deferred" value that permits NIC drivers to use netisr(9) queues to distribute load between CPU cores.
Comment 12 Jan Bramkamp 2018-07-16 14:53:39 UTC
I may work as intended, but that doesn't mean that it works well, because if I remember correctly there is a single lock in each PPPoE instance as well so you just moved the bottleneck a little bit up the network stack.
Comment 13 Eugene Grosbein freebsd_committer 2018-07-16 15:04:21 UTC
(In reply to Jan Bramkamp from comment #12)

I'm not sure which lock you are talking of.

Anyway, the problem has nothing to do with igb driver as NICs supported by igb(4) driver have no hardware support for PPPoE per-queue distribution, period.

Though, feel free to open new PR if you have performance problems with netisr(9) queues or "PPPoE instance lock" (no matter what is it).
Comment 14 Vladimir 2018-07-16 16:10:42 UTC
If the problem is isolated to igb driver only, could it be igb driver problem, no?
Why then there was patch for igb, that currently missing? 
https://wiki.freebsd.org/NetworkPerformanceTuning (Traffic flow section)
Comment 15 Eugene Grosbein freebsd_committer 2018-07-16 16:29:31 UTC
(In reply to Vladimir from comment #14)

The problem is in hardware card not supporting PPPoE per-queue load distribution, not in the driver. Why anyone would think that igb-supported NICs are capable doing that in first place? That's plain wrong.

And mentioned patch never made it better, and could not.
Comment 16 Eugene Grosbein freebsd_committer 2018-07-16 16:32:37 UTC
(In reply to Vladimir from comment #14)

I guess you can read Russian, please take a look at my post https://dadv.livejournal.com/139170.html , it may make things clearer for you.
Comment 17 Vladimir 2018-07-16 16:39:28 UTC
Thanks, Eugene.
Comment 18 Vladimir 2018-07-16 17:00:17 UTC
I think most people expect that igb hardware works the same way under FreeBSD like it works under other OSes like Linux or whatever... Windows, I think this pointed me also to think that it is driver problem, now I see that it is not or at least it is expected to be a problem under certain conditions.
Comment 19 ricsip 2018-07-16 19:46:42 UTC
(In reply to Eugene Grosbein from comment #16)

Gents, if the igb (and any other NIC on the planet) is so fundamentally broken for multi-queue + PPPoe, at least make this clear written in the drivers "Known issues" section. Would help to many people to avoid using SBCs for Gigabit firewall.
Comment 20 Eugene Grosbein freebsd_committer 2018-07-17 03:52:38 UTC
(In reply to ricsip from comment #19)

Again, this is not igb(4) driver "broken", these are corresponding network cards having no hardware support to distribute PPPoE traffic per-queue. This is already documented in manufacturer (Intel) datasheets. Why someone blindly assume the NIC has such support and does not read the datasheet?
Comment 21 ricsip 2018-07-17 08:34:43 UTC
(In reply to Eugene Grosbein from comment #20)

Eugene: thanks for the clarification. Forgive my ignorance, that I was not aware that the feature "multiple TX/RX queue with RSS support" for any NIC on the market is a pure marketing gimmick, if the connection type on the NIC will be set as ="PPPoE" and not as "pure-IP". Indeed, somebody with the necessary network background could easily decrypt the true meaning of the various tables (Intel i210 datasheet):

RSS and MSI-X to lower CPU utilization in multi-core systems
Receive Side Scaling (RSS) number of queues per port: Up to 4
Total number of Rx queues per port: 4
Total number of TX queues per port: 4
RSS — Receive Side Scaling distributes packet processing between several processor cores by assigning packets into different descriptor queues. RSS assigns to each received packet an RSS index. Packets are routed to a queue out of a set of Rx queues based on their RSS index and other considerations.

7.1.2.10.1 RSS Hash Function
Section 7.1.2.10.1 provides a verification suite used to validate that the hash function is computed according to Microsoft* nomenclature.
The I210 hash function follows Microsoft* definition. A single hash function is defined with several variations for the following cases:
• TcpIPv4 — The I210 parses the packet to identify an IPv4 packet containing a TCP segment per the criteria described later in this section. If the packet is not an IPv4 packet containing a TCP segment, RSS is not done for the packet.
• IPv4 — The I210 parses the packet to identify an IPv4 packet. If the packet is not an IPv4 packet, RSS is not done for the packet.
• TcpIPv6 — The I210 parses the packet to identify an IPv6 packet containing a TCP segment per the criteria described later in this section. If the packet is not an IPv6 packet containing a TCP segment,RSS is not done for the packet.
• TcpIPv6Ex — The I210 parses the packet to identify an IPv6 packet containing a TCP segment with extensions per the criteria described later in this section. If the packet is not an IPv6 packet containing a TCP segment, RSS is not done for the packet. Extension headers should be parsed for a Home-Address-Option field (for source address) or the Routing-Header-Type-2 field (for destination address).
• IPv6Ex — The I210 parses the packet to identify an IPv6 packet. Extension headers should be parsed for a Home-Address-Option field (for source address) or the Routing-Header-Type-2 field (for destination address). Note that the packet is not required to contain any of these extension headers to be hashed by this function. In this case, the IPv6 hash is used. If the packet is not an
IPv6 packet, RSS is not done for the packet.
• IPv6 — The I210 parses the packet to identify an IPv6 packet. If the packet is not an IPv6 packet, receive-side-scaling is not done for the packet.

The following additional cases are NOT part of the Microsoft* RSS specification:
• UdpIPV4 — The I210 parses the packet to identify a packet with UDP over IPv4.
• UdpIPV6 — The I210 parses the packet to identify a packet with UDP over IPv6.
• UdpIPV6Ex — The I210 parses the packet to identify a packet with UDP over IPv6 with extensions.
A packet is identified as containing a TCP segment if all of the following conditions are met:
• The transport layer protocol is TCP (not UDP, ICMP, IGMP, etc.).
• The TCP segment can be parsed (such as IP options can be parsed, packet not encrypted).
• The packet is not fragmented (even if the fragment contains a complete TCP header)

For the majority, these deep technical statements / values are totally not easy to covert into real world meaning of how will my i210-based firewall will perform at 1Gbit wire-speed.

Intel could have said in their datasheets, that "this NIC is not recommended for PPPoE-type of service". Or should Microsoft be blamed (as Intel implemented the MSFT Receive Side Scaling algorithm, and its MSFT who doesnt support anything apart from TCP/IP)? You see, even MSFT entered this arena now :(

Just for my curiosity (its offtopic discussion from now on, again forgive me about that): what other possible network types will lose the multi-queue TX/RX capability, if those types are not strictly considered as "pure-IP"?

Thanks again, situation disappointing.
Comment 22 ricsip 2018-07-17 08:36:36 UTC
(In reply to Eugene Grosbein from comment #20)

Eugene: thanks for the clarification. Forgive my ignorance, that I was not aware that the feature "multiple TX/RX queue with RSS support" for any NIC on the market is a pure marketing gimmick, if the connection type on the NIC will be set as ="PPPoE" and not as "pure-IP". Indeed, somebody with the necessary network background could easily decrypt the true meaning of the various tables (quote from Intel i210 datasheet):

RSS and MSI-X to lower CPU utilization in multi-core systems
Receive Side Scaling (RSS) number of queues per port: Up to 4
Total number of Rx queues per port: 4
Total number of TX queues per port: 4
RSS — Receive Side Scaling distributes packet processing between several processor cores by assigning packets into different descriptor queues. RSS assigns to each received packet an RSS index. Packets are routed to a queue out of a set of Rx queues based on their RSS index and other considerations.

7.1.2.10.1 RSS Hash Function
Section 7.1.2.10.1 provides a verification suite used to validate that the hash function is computed according to Microsoft* nomenclature.
The I210 hash function follows Microsoft* definition. A single hash function is defined with several variations for the following cases:
• TcpIPv4 — The I210 parses the packet to identify an IPv4 packet containing a TCP segment per the criteria described later in this section. If the packet is not an IPv4 packet containing a TCP segment, RSS is not done for the packet.
• IPv4 — The I210 parses the packet to identify an IPv4 packet. If the packet is not an IPv4 packet, RSS is not done for the packet.
• TcpIPv6 — The I210 parses the packet to identify an IPv6 packet containing a TCP segment per the criteria described later in this section. If the packet is not an IPv6 packet containing a TCP segment,RSS is not done for the packet.
• TcpIPv6Ex — The I210 parses the packet to identify an IPv6 packet containing a TCP segment with extensions per the criteria described later in this section. If the packet is not an IPv6 packet containing a TCP segment, RSS is not done for the packet. Extension headers should be parsed for a Home-Address-Option field (for source address) or the Routing-Header-Type-2 field (for destination address).
• IPv6Ex — The I210 parses the packet to identify an IPv6 packet. Extension headers should be parsed for a Home-Address-Option field (for source address) or the Routing-Header-Type-2 field (for destination address). Note that the packet is not required to contain any of these extension headers to be hashed by this function. In this case, the IPv6 hash is used. If the packet is not an
IPv6 packet, RSS is not done for the packet.
• IPv6 — The I210 parses the packet to identify an IPv6 packet. If the packet is not an IPv6 packet, receive-side-scaling is not done for the packet.

The following additional cases are NOT part of the Microsoft* RSS specification:
• UdpIPV4 — The I210 parses the packet to identify a packet with UDP over IPv4.
• UdpIPV6 — The I210 parses the packet to identify a packet with UDP over IPv6.
• UdpIPV6Ex — The I210 parses the packet to identify a packet with UDP over IPv6 with extensions.
A packet is identified as containing a TCP segment if all of the following conditions are met:
• The transport layer protocol is TCP (not UDP, ICMP, IGMP, etc.).
• The TCP segment can be parsed (such as IP options can be parsed, packet not encrypted).
• The packet is not fragmented (even if the fragment contains a complete TCP header)

For the majority, these deep technical statements / values are not that easy to covert into real world meaning of "how will my i210-based firewall perform at 1Gbit speed using PPPoE connection".

Intel could have said in their datasheets, that "this NIC is not recommended for PPPoE-type of service". Or should Microsoft be blamed (as Intel implemented the MSFT Receive Side Scaling algorithm, and its MSFT who doesnt support anything apart from TCP/IP)? You see, even MSFT entered this arena now :(

Just for my curiosity (its offtopic discussion from now on, again forgive me about that): what other possible network types will lose the multi-queue TX/RX capability, if those types are not strictly considered as "pure-IP"?

Thanks again, situation disappointing.
Comment 23 Eugene Grosbein freebsd_committer 2018-07-17 10:31:04 UTC
(In reply to ricsip from comment #22)

This greatly depends on local conditions. There is exactly same problem for PPtP-connected client using GRE encapsulation (not clean TCP) when all traffic is delivered from single IP address of PPtP server to single local IP address of PPtP client. Same with L2TP-over-UDP encapsulation.
Comment 24 ricsip 2018-07-25 14:12:02 UTC
(In reply to Eugene Grosbein from comment #23)

Hi Eugene,

as I was not satisfied with the outcome here, I installed an ipfire linux distrib on my APU2 to see what performance it can achieve versus the low performance seen under Freebsd/opnsense. Same purely IP-routing based testing as I did with opnsense, as I dont have the knowledge to build a proper PPPoE simulator lab.

Well, using ipfire I could reach 850-900 Mbit/sec on single-flow iperf, that was only possible with multi-flow iperf under opnsense. Even the load was much lighter in contrast with opnsense: according to TOP it was 70% idle during 900Mbit single-flow iperf session. So I think this whole thread about RSS-locked-to-single-core is either incorrect, or linux kernel does some magic much much better than Freebsd.

So I no longer know whom to believe anything in this topic...
Comment 25 Jan Bramkamp 2018-07-25 14:15:53 UTC
Does Linux use RSS to achieve this performance or does it drain the NIC queue in a single interrupt and load balance the rest? Did you try the netisr workaround?
Comment 26 Eugene Grosbein freebsd_committer 2018-07-25 18:08:45 UTC
(In reply to ricsip from comment #24)

Please use our mailing lists or web forums for general support questions of discussion and leave Bugzilla for bug reports. Again, the problem has nothing to do with this igb driver problem report. You can easily link this PR and even distinct comments, though. Write to freebsd-net@freebsd.org, for example.

And yes, you should try make use of netisr(9) queues first.
Comment 27 Vladimir 2018-07-25 18:21:29 UTC
Do you mean something like net.isr.dispatch=deferred ?
Comment 28 Eugene Grosbein freebsd_committer 2018-07-25 18:24:52 UTC
(In reply to Vladimir from comment #27)

Yes. Have you missed comment #11 describing possible solutions including this?
Comment 29 Vladimir 2018-07-25 18:32:45 UTC
(In reply to Eugene Grosbein from comment #28)
No, just wanted to confirm.
Comment 30 Kurt Jaeger freebsd_committer 2018-07-27 17:03:42 UTC
(In reply to anoteros from comment #1)

This link for a patch seems valid:

http://static.ipfw.ru/patches/igb_flowid.diff

Did anyone test with that ? Does that still apply ?
Comment 31 Eugene Grosbein freebsd_committer 2018-07-27 18:02:16 UTC
(In reply to Kurt Jaeger from comment #30)

This patch does not apply in any sense: it won't apply textually and it was (incomplete) attempt to solve another problem in first place: it tried to add a sysctl to disable flowid generation by igb(4) driver based on (always zero for PPPoE) hardware flow id assigned by the chip. It was meaningless from the beginning.