203524 – TCP checksum failed on igb network adapter

Bug 203524 - TCP checksum failed on igb network adapter

Summary: TCP checksum failed on igb network adapter

Status:	Closed Not A Bug

Alias:	None

Product:	Base System
Classification:	Unclassified
Component:	kern (show other bugs)
Version:	10.1-RELEASE
Hardware:	amd64 Any

Importance:	--- Affects Some People
Assignee:	freebsd-net (Nobody)

URL:
Keywords:

Depends on:
Blocks:

Reported:	2015-10-03 11:24 UTC by Sossi Andrej
Modified:	2016-01-26 13:11 UTC (History)
CC List:	6 users (show)

See Also:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Sossi Andrej 2015-10-03 11:24:57 UTC

Hello,

I have a weird network problem which I believe may be caused by the
FreeBSD igb driver or perhaps even the network adapter.

Let me try to explain the scenario in brief:
I have a FreeBSD 10.0-RELEASE-p10  server with a public IP address, in
which N virtual machines are installed through JAIL; the machines hold
private IP addresses on the loopback1 adapter. The VMs access the
internet through NATting on the public IP via ipfw:

nat 1 config ip X.Y.Z.W if igb0 unreg_only same_ports

add 60000 nat 1 ip from 192.168.250.0/24 to any out xmit igb0 keep-state
add 60001 nat 1 ip from any to X.Y.Z.W in recv igb0

In addition, port forwarding is configured on the real machine towards
the VMs in order to support public services (Apache httpd, database, etc.)

The network adapter is:
igb0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500

options=403bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,VLAN_HWTSO>
        ether 00:45:80:dd:32:30
        inet X.Y.Z.W netmask 0xffffff00 broadcast X.Y.Z.W
        inet6 XX::YY:ZZ:WWW:VVV%igb0 prefixlen 64 scopeid 0x1
        inet6 XX:YY:ZZ:WWW::1 prefixlen 64
        nd6 options=23<PERFORMNUD,ACCEPT_RTADV,AUTO_LINKLOCAL>
        media: Ethernet autoselect (1000baseT <full-duplex>)
        status: active

The loopback1 adapter, where the VMs' IPs are assigned, too, has MTU 1500.

So far so good, in the sense that everything works as expected, almost.
Occasionally there are requests originated by the VMs towards internet
servers which end in timeout (http, sftp, etc.). The very same requests,
if executed by the real machine, end correctly with a response.
After countless experiments I have managed to reproduce the problem
deterministically.

Through a tcpdump executed on the request's recipient I have noticed
that all TCP packets with a payload between 101 e 106  (inclusive)
bytes in size arrive with a wrong TCP checksum and as such are rejected.
Subsequent retransmissions of the same packet continue to bear a wrong
checksum
and this continues until the connection timeout is reached. The IP
checksum, instead, is always correct. Packets smaller than 101 bytes are
transmitted and received with the correct checksum, as the same happens
to packets with a payload in excess of 116 bytes in size.

If TSO[46] and [TR]XCSUM is disabled from igb options, the problem disappears.

The same problem I have on second server with same configuration and
hardware bat with FreeBSD 10.0-RELEASE-p1 .

I believe the above behavior is something error with the driver, as on
a third machine, with identical configuration with jail machines NATting
but with an em driver, the checksum problem didn't appear.

Comment 1 Eugene Grosbein 2015-10-05 01:27:34 UTC

Please read ipfw(8) manual page, section BUGG; this is known and documented problem of ipfw nat:

Due to the architecture of libalias(3), ipfw nat is not compatible with
the TCP segmentation offloading (TSO).  Thus, to reliably nat your net-
work traffic, please disable TSO on your NICs using ifconfig(8).

This PR should be closed.

Comment 2 fodillemlinkarim 2016-01-13 20:10:29 UTC

Hi,

I believe I have fixed something like this recently please see a post I just made:

Hi,

I've hit a very interesting problem with ipfw-nat and local TCP traffic that has enough TCP options to hit a special case in m_megapullup(). Here is the story:

I am using the following NIC:

igb0@pci0:4:0:0:        class=0x020000 card=0x00008086 chip=0x150e8086 rev=0x01 hdr=0x00

And when I do ipfw nat to locally emitted packets I see packets not being processed in the igb driver for HW checksum. Now a quick search for m_pullup in the igb driver code will show that our igb driver expects a contiguous ethernet + ip header in igb_tx_ctx_setup(). Now the friendly m_megapullup() in alias.c doesn't reserve any space before the ip header for the ethernet header after its call to m_getcl like tcp_output.c (see m->m_data += max_linkhdr in tcp_output.c).

So the call to M_PREPEND() in ether_output() is forced to prepend a new mbuf for the ethernet header, leading to a non contiguous ether + ip. This in turn leads to a failure to properly read the IP protocol in the igb driver and apply the proper HW checksum function. Particularly this call in igb_tcp_ctx_setup(): ip = (struct ip *)(mp->m_data + ehdrlen);

To reproduce the issue I simply create a NAT rule for an igb interface and initiate a TCP connection locally going out through that interface (it should go through NAT obviously) something like:

ipfw nat 1 config igb0 reset
ipfw add 10 nat 1 via igb0

 Although you need to make sure you fill enough of the SYN packet to trigger the allocation of new memory in m_megapullup. You can do this by using enough TCP options so its filling up almost all of the 256 mbuf or make RESERVE something like 300 bytes in alias.c.

The fix I propose is very simple and faster for all drivers, including the ones that do perform a check for ether + ip to be contiguous upon accessing the IP header. If the leading space is available it doesn't allocate any extra space (as it should for most cases) but if for some reason the mbuf used doesn't have 100 bytes (RESERVE in megapullup) of free space it will reserve some at the front too. If the leading space isn't necessary then it won't cause any harm.


-Subproject commit cfe39807fe9b1a23c13f73aabde302046736fa1c
+Subproject commit cfe39807fe9b1a23c13f73aabde302046736fa1c-dirty
diff --git a/freebsd/sys/netinet/libalias/alias.c b/freebsd/sys/netinet/libalias/alias.c
index 876e958..dc424a6 100644
--- a/freebsd/sys/netinet/libalias/alias.c
+++ b/freebsd/sys/netinet/libalias/alias.c
@@ -1757,7 +1757,8 @@ m_megapullup(struct mbuf *m, int len) {
         * writable and has some extra space for expansion.
         * XXX: Constant 100bytes is completely empirical. */
 #define        RESERVE 100
-   if (m->m_next == NULL && M_WRITABLE(m) && M_TRAILINGSPACE(m) >= RESERVE)
+ if (m->m_next == NULL && M_WRITABLE(m) &&
+                 M_TRAILINGSPACE(m) >= RESERVE && M_LEADINGSPACE(m) >= max_linkhdr)
                return (m);

        if (len <= MCLBYTES - RESERVE) {
@@ -1779,6 +1780,7 @@ m_megapullup(struct mbuf *m, int len) {
                goto bad;

        m_move_pkthdr(mcl, m);
+ mcl->m_data += max_linkhdr;
        m_copydata(m, 0, len, mtod(mcl, caddr_t));
        mcl->m_len = mcl->m_pkthdr.len = len;
        m_freem(m);

It would be nice if some FBSD comitter could review and hopefully add this patch to FBSD.

Thank you,

Karim.

Comment 3 Gleb Smirnoff freebsd_committer

2016-01-25 23:22:12 UTC

Karim, what I see in the FreeBSD head is that m_megapullup() runs m_align() on the mbuf, so the patch is no longer needed.

P.S. I filed a bug for igb(4).