Bug 212283 - oversized IP datagrams on raw socket with IP_RAWOUTPUT hang network interface drivers
Summary: oversized IP datagrams on raw socket with IP_RAWOUTPUT hang network interface...
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 11.0-STABLE
Hardware: amd64 Any
: --- Affects Many People
Assignee: Michael Tuexen
URL:
Keywords: regression
Depends on:
Blocks: 208205
  Show dependency treegraph
 
Reported: 2016-08-31 12:21 UTC by Mathieu Arnold
Modified: 2019-01-21 19:35 UTC (History)
12 users (show)

See Also:
koobs: mfc-stable11?


Attachments
Proposed patch for net/ntraceroute port (510 bytes, patch)
2016-08-31 15:52 UTC, Andrey V. Elsukov
no flags Details | Diff
Proposed patch (untested) (883 bytes, patch)
2016-09-03 18:05 UTC, Andrey V. Elsukov
no flags Details | Diff
Proposed patch (untested) (884 bytes, patch)
2016-09-03 18:13 UTC, Andrey V. Elsukov
no flags Details | Diff
Proposed patch for net/ntraceroute port (1.01 KB, patch)
2016-12-27 19:38 UTC, Andrey V. Elsukov
no flags Details | Diff
Proposed patch (1.25 KB, patch)
2016-12-30 15:13 UTC, Andrey V. Elsukov
no flags Details | Diff
Simple test program (1.51 KB, text/plain)
2016-12-30 22:11 UTC, Michael Tuexen
no flags Details
another test program (1.69 KB, text/plain)
2016-12-30 22:42 UTC, Andrey V. Elsukov
no flags Details
Proposed patch to enforce consistent lengths (529 bytes, patch)
2016-12-31 12:40 UTC, Michael Tuexen
no flags Details | Diff
Simple UDP based test program (1.16 KB, text/plain)
2017-01-13 11:02 UTC, Michael Tuexen
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Mathieu Arnold freebsd_committer 2016-08-31 12:21:17 UTC
(it is for 11.0-RC2, but it also works on 11.0-RC1 and 11.0-RC2 is not in the list)

When I run ntraceroute -uOAM google.com (the host doesn't matter) (from net/ntraceroute)

On my dev box, it does:
re0: watchdog timeout
re0: link state changed to DOWN
re0: link state changed to UP

On my poudriere build box, it does:
bge0: watchdog timeout -- resetting
bge0: link state changed to DOWN
bge0.666: link state changed to DOWN
[...more vlans]
bge0: link state changed to UP
bge0.666: link state changed to UP
[...more vlans]

I have no clue of what it really does, but it's bad.
Comment 1 Mathieu Arnold freebsd_committer 2016-08-31 12:23:11 UTC
As a regular user, it fails with:

$ ntraceroute -uOAM google.com
traceroute: icmp socket: Operation not permitted

so, this needs to be run as root.
Comment 2 Andrey V. Elsukov freebsd_committer 2016-08-31 15:52:31 UTC
Created attachment 174255 [details]
Proposed patch for net/ntraceroute port

You can use this patch for port, but the kernel still need to be fixed.
Comment 3 Andrey V. Elsukov freebsd_committer 2016-08-31 17:32:21 UTC
From my ktrace results:
 36643 ntraceroute CALL  sendto(0x4,0x801270380,0x8021,0,0x614e14,0x10)
 36643 ntraceroute GIO   fd 4 wrote 4096 bytes
       0x0000 4500 2180 0000 0040 0

sendto() does send 32801 bytes, but in the IP header this is 8576 bytes. Since DF flag also stored in the wrong byte order we fall into ip_fragment(), where we create several fragments. Then at the end of ip_fragment() we call m_adj() to trim mbuf chain from the tail. But since real length is ~32801, after trimming we will still have several mbufs in the chain and a lot of data to send. 
A driver calls m_collapse() and it fails. This leads to triggering of watchdog.

# dtrace -n 'fbt::m_collapse:return { printf("%p", (struct mbuf *)arg1); stack(10); }'
dtrace: description 'fbt::m_collapse:return ' matched 1 probe
CPU     ID                    FUNCTION:NAME
              if_iwn.ko`iwn_tx_data+0x7d9
              if_iwn.ko`iwn_transmit+0x9e
              kernel`ieee80211_parent_xmitpkt+0x13
              kernel`ieee80211_vap_pkt_send_dest+0x29d
              kernel`ieee80211_vap_transmit+0x332
              kernel`ether_output+0x763
              kernel`ip_output+0x15f0
              kernel`rip_output+0x2d3
              kernel`sosend_generic+0x5e7
              kernel`kern_sendit+0x22a

  1  37159                m_collapse:return 0
              if_iwn.ko`iwn_tx_data+0x7d9
              if_iwn.ko`iwn_transmit+0x9e
              kernel`ieee80211_parent_xmitpkt+0x13
              kernel`ieee80211_vap_pkt_send_dest+0x29d
              kernel`ieee80211_vap_transmit+0x332
              kernel`ether_output+0x763
              kernel`ip_output+0x15f0
              kernel`rip_output+0x2d3
              kernel`sosend_generic+0x5e7
              kernel`kern_sendit+0x22a
1  37159                m_collapse:return 0

In console I see these messages:
iwn0: iwn_tx_data: could not defrag mbuf
iwn0: iwn_tx_data: could not defrag mbuf
iwn0: iwn_tx_data: could not defrag mbuf
iwn0: iwn_tx_data: could not defrag mbuf
iwn0: iwn_intr: fatal firmware error
firmware error log:
  error type      = "UNKNOWN" (0x00000EDC)
  program counter = 0x000222C4
  source line     = 0x00000AF4
  error data      = 0x0000000000000000
  branch link     = 0x0002224C0002224C
  interrupt link  = 0x0000D6BE00000000
  time            = 3419201031
driver status:
  tx ring  0: qid=0  cur=19  queued=9  
  tx ring  1: qid=1  cur=0   queued=0  
  tx ring  2: qid=2  cur=0   queued=0  
  tx ring  3: qid=3  cur=17  queued=0  
  tx ring  4: qid=4  cur=0   queued=0  
  tx ring  5: qid=5  cur=0   queued=0  
  tx ring  6: qid=6  cur=0   queued=0  
  tx ring  7: qid=7  cur=0   queued=0  
  tx ring  8: qid=8  cur=0   queued=0  
  tx ring  9: qid=9  cur=199 queued=0  
  tx ring 10: qid=10 cur=0   queued=0  
  tx ring 11: qid=11 cur=0   queued=0  
  tx ring 12: qid=12 cur=0   queued=0  
  tx ring 13: qid=13 cur=0   queued=0  
  tx ring 14: qid=14 cur=0   queued=0  
  tx ring 15: qid=15 cur=0   queued=0  
  tx ring 16: qid=16 cur=0   queued=0  
  tx ring 17: qid=17 cur=0   queued=0  
  tx ring 18: qid=18 cur=0   queued=0  
  tx ring 19: qid=19 cur=0   queued=0  
  rx ring: cur=60
iwn0: iwn_panicked: controller panicked, iv_state = 5; restarting

In your case seems re(4) behaves in the same way.
Comment 4 Andrey V. Elsukov freebsd_committer 2016-08-31 17:43:29 UTC
I think we need to ponder and discuss how much is it acceptable, invoke ip_fragment() in case when we have IP_RAWOUTPUT flag set.
Comment 5 Sean Bruno freebsd_committer 2016-09-01 14:36:15 UTC
Seemingly unrelated, we are seeing re(4) timeouts in more recent head.

FreeBSD 11.0-ALPHA6 (CLUSTER11) #0 r302331: Sun Jul  3 23:03:04 UTC 2016

We have two gateway machines using pf/pfsync/carp in failover mode.  The primary machine (igw0) seems to hit watchdog timeouts almost immediately.  Is there something we should try doing to alleviate this failure?  It seems that the primary/master igw0 hits the problem more reliably, and I haven't seen this on the backup.
Comment 6 commit-hook freebsd_committer 2016-09-02 21:12:48 UTC
A commit references this bug:

Author: mat
Date: Fri Sep  2 21:12:22 UTC 2016
New revision: 421276
URL: https://svnweb.freebsd.org/changeset/ports/421276

Log:
  Fix on 11.

  PR:		212283
  Submitted by:	ae
  Sponsored by:	Absolight

Changes:
  head/net/ntraceroute/Makefile
  head/net/ntraceroute/files/patch-traceroute.c
Comment 7 Mathieu Arnold freebsd_committer 2016-09-02 21:56:33 UTC
So, I tried your patch for ntraceroute, the problem is still there, the bge0 gets a watchdog timeout and the network still freezes for a minute or so.
Comment 8 Andrey V. Elsukov freebsd_committer 2016-09-03 18:05:36 UTC
Created attachment 174350 [details]
Proposed patch (untested)

I only tried ntraceroute on head/ and it worked for me. 
Since nobody showed interest, can you try this patch to the system?
Comment 9 Andrey V. Elsukov freebsd_committer 2016-09-03 18:13:29 UTC
Created attachment 174351 [details]
Proposed patch (untested)

Fix a typo.
Comment 10 Mathieu Arnold freebsd_committer 2016-09-05 14:26:54 UTC
I can't test a kernel patch on both boxes I have that currently run 11.0-RC2.

I can try to find the time to install something like vmware on my mac to install a FreeBSD to test this patch, but I don't see this happening this week or the next.
Comment 11 Gleb Smirnoff freebsd_committer 2016-09-06 19:48:35 UTC
I think we shouldn't tweak the ip_len in the kernel. If userland
Comment 12 Gleb Smirnoff freebsd_committer 2016-09-06 19:49:47 UTC
I think we shouldn't tweak the ip_len in the kernel. If userland supplies oversized datagram, greater than interface MTU, we should just drop it silently.
Comment 13 Gleb Smirnoff freebsd_committer 2016-09-06 19:52:02 UTC
Not silently. Better return EFBIG to userland.
Comment 14 Mathieu Arnold freebsd_committer 2016-09-17 09:05:20 UTC
This is still a problem on 11.0-RC3.
Comment 15 Mathieu Arnold freebsd_committer 2016-09-27 22:23:33 UTC
This is still a problem on 11.0-RELEASE.
Comment 16 Mathieu Arnold freebsd_committer 2016-10-14 10:59:33 UTC
This is still a problem on 11.0-RELEASE-p1
Comment 17 Michael Tuexen freebsd_committer 2016-10-14 12:01:43 UTC
(In reply to Gleb Smirnoff from comment #12)
Normally the kernel does fragmentation for raw IP packets. So I would prefer to keep that. But can't we check that the length provided in the send call matches the length in the IP header. If there is a mismatch, return an error to the user (indicating EINVAL)? That way the user gets an indication that some information was inconsistent.
Comment 18 Mathieu Arnold freebsd_committer 2016-10-29 10:53:06 UTC
Still a problem with 11.0-RELEASE-p2.
Comment 19 Mathieu Arnold freebsd_committer 2016-12-21 12:59:31 UTC
Still a problem with 11.0-RELEASE-p5.
Comment 20 Andrey V. Elsukov freebsd_committer 2016-12-27 19:38:25 UTC
Created attachment 178331 [details]
Proposed patch for net/ntraceroute port
Comment 21 Andrey V. Elsukov freebsd_committer 2016-12-27 19:38:52 UTC
(In reply to Mathieu Arnold from comment #7)
> So, I tried your patch for ntraceroute, the problem is still there, the bge0
> gets a watchdog timeout and the network still freezes for a minute or so.

I realized that when it is builded from ports, defined options have no effect due to __FreeBSD_version is not defined. Can you try new patch?
Comment 22 commit-hook freebsd_committer 2016-12-28 11:38:54 UTC
A commit references this bug:

Author: mat
Date: Wed Dec 28 11:38:12 UTC 2016
New revision: 429739
URL: https://svnweb.freebsd.org/changeset/ports/429739

Log:
  Really fix on 11+.

  PR:		212283
  Submitted by:	ae
  Sponsored by:	Absolight

Changes:
  head/net/ntraceroute/Makefile
  head/net/ntraceroute/files/patch-traceroute.c
Comment 23 Mathieu Arnold freebsd_committer 2016-12-28 11:39:24 UTC
Ah, this does fix ntraceroute, thanks.

I still think something should happen kernel-wise so that an unpatched ntraceroute returns an error instead of crashing the network card :-)
Comment 24 commit-hook freebsd_committer 2016-12-28 11:40:59 UTC
A commit references this bug:

Author: mat
Date: Wed Dec 28 11:40:00 UTC 2016
New revision: 429740
URL: https://svnweb.freebsd.org/changeset/ports/429740

Log:
  MFH: r429739

  Really fix on 11+.

  PR:		212283
  Submitted by:	ae
  Sponsored by:	Absolight

Changes:
_U  branches/2016Q4/
  branches/2016Q4/net/ntraceroute/Makefile
  branches/2016Q4/net/ntraceroute/files/patch-traceroute.c
Comment 25 Andrey V. Elsukov freebsd_committer 2016-12-29 21:56:50 UTC
So, I did some tests on my home machine with Intel em(4) card and with unpatched ntraceroute.
First of I added this debugging code into ip_fragment:
------
@@ -894,7 +895,19 @@ smart_frag_failure:
 		ip->ip_sum = in_cksum(m0, hlen);
 		m0->m_pkthdr.csum_flags &= ~CSUM_IP;
 	}
+	{
+		struct mbuf *m, *p;
 
+		printf("%s: nfrags %d, ip_len %d\n    ", __func__, nfrags, ip_len);
+		for (p = m0; p != NULL; p = p->m_nextpkt) {
+			for (m = p, nfrags = 0, len = 0; m != NULL; m = m->m_next) {
+				nfrags++;
+				len += m->m_len;
+			}
+			printf("%d(%d/%d) ", nfrags, p->m_pkthdr.len, len);
+		}
+		printf("\n");
+	}
 done:
 	*m_frag = m0;
 	return error;
------

Then I used this dtrace script:

------
#!/usr/sbin/dtrace -s

BEGIN
{
        packets = 0;
}

ip:::send
/args[2]->ip_ver == 4 && args[4]->ipv4_flags && args[4]->ipv4_offset/
{
        printf("%d: ip_len %d, flags 0x%02x, offset 0x%04x",
            ++packets, args[4]->ipv4_length, args[4]->ipv4_flags,
            args[4]->ipv4_offset);
}
------
After running this command we have:
# ntraceroute -uOAM google.com
------

ip_fragment: nfrags 6, ip_len 8576
    7(1500/25725) 2(1500/1500) 3(1500/1500) 2(1500/1500) 2(1500/1500) 3(1176/1176) 
ip_fragment: nfrags 3, ip_len 4288
    12(1500/46380) 2(1500/1500) 3(1328/1328) 
ip_fragment: nfrags 2, ip_len 2016
    14(1500/56835) 2(536/536) 

CPU     ID                    FUNCTION:NAME
  2  62685                        none:send 1: ip_len 1500, flags 0x20, offset 0x0040
  2  62685                        none:send 2: ip_len 1500, flags 0x20, offset 0x00f9
  2  62685                        none:send 3: ip_len 1500, flags 0x20, offset 0x01b2
  2  62685                        none:send 4: ip_len 1500, flags 0x20, offset 0x026b
  2  62685                        none:send 5: ip_len 1500, flags 0x20, offset 0x0324
  2  62685                        none:send 6: ip_len 1500, flags 0x20, offset 0x0040
  2  62685                        none:send 7: ip_len 1500, flags 0x20, offset 0x00f9
  2  62685                        none:send 8: ip_len 1500, flags 0x20, offset 0x0040

em0: Watchdog timeout Queue[0]-- resetting
Interface is RUNNING and ACTIVE
em0: TX Queue 0 ------
em0: hw tdh = 126, hw tdt = 361
em0: Tx Queue Status = -2147483648
em0: TX descriptors avail = 787
em0: Tx Descriptors avail failure = 0
em0: RX Queue 0 ------
em0: hw rdh = 112, hw rdt = 111
em0: RX discarded packets = 0
em0: RX Next to Check = 112
em0: RX Next to Refresh = 111
em0: link state changed to DOWN
em0: link state changed to UP

There is ktrace output for corresponding sendto:

1) ntraceroute CALL  sendto(0x4,0x80125f980,0x8021,0,0x614e84,0x10)
   ntraceroute RET   sendto 32801/0x8021
2) ntraceroute CALL  sendto(0x4,0x80125f980,0xc010,0,0x614e84,0x10)
   ntraceroute RET   sendto 49168/0xc010
3) ntraceroute CALL  sendto(0x4,0x80125f980,0xe007,0,0x614e84,0x10)
   ntraceroute RET   sendto 57351/0xe007
4) ntraceroute CALL  sendto(0x4,0x80125f980,0xf003,0,0x614e84,0x10)
   ntraceroute RET   sendto 63489/0xf801
5) ntraceroute CALL  sendto(0x4,0x80125f980,0xfc00,0,0x614e84,0x10)
   ntraceroute RET   sendto 64512/0xfc00
6) ntraceroute CALL  sendto(0x4,0x80125f980,0xfdff,0,0x614e84,0x10)
   ntraceroute RET   sendto -1 errno 22 Invalid argument
   
As you can see, the ip_fragment() is totally confused by values that IP header contains and produces wrong mbuf chain.
First mbuf in the chain has m_pkthdr.len = 1500 and total length 25725.

-----

Next experiment. Always use m_pkthdr.len as ip_len in ip_fragment().

-----
@@ -741,7 +742,7 @@ ip_fragment(struct ip *ip, struct mbuf **m_frag, i
 	int nfrags;
 	uint16_t ip_len, ip_off;
 
-	ip_len = ntohs(ip->ip_len);
+	ip_len = m0->m_pkthdr.len;
 	ip_off = ntohs(ip->ip_off);
 
 	if (ip_off & IP_DF) {	/* Fragmentation not allowed */
-----

The same commands produce the following output:

-----

ip_fragment: nfrags 23, ip_len 32801
    1(1500/1500) 2(1500/1500) 3(1500/1500) 2(1500/1500) 2(1500/1500) 3(1500/1500) 2(1500/1500) 2(1500/1500) 3(1500/1500) 2(1500/1500) 2(1500/1500) 3(1500/1500) 2(1500/1500) 3(1500/1500) 2(1500/1500) 2(1500/1500) 3(1500/1500) 2(1500/1500) 2(1500/1500) 3(1500/1500) 2(1500/1500) 2(1500/1500) 3(241/241) 
ip_fragment: nfrags 34, ip_len 49168
    1(1500/1500) 2(1500/1500) 3(1500/1500) 2(1500/1500) 2(1500/1500) 3(1500/1500) 2(1500/1500) 2(1500/1500) 3(1500/1500) 2(1500/1500) 2(1500/1500) 3(1500/1500) 2(1500/1500) 3(1500/1500) 2(1500/1500) 2(1500/1500) 3(1500/1500) 2(1500/1500) 2(1500/1500) 3(1500/1500) 2(1500/1500) 2(1500/1500) 3(1500/1500) 2(1500/1500) 3(1500/1500) 2(1500/1500) 2(1500/1500) 3(1500/1500) 2(1500/1500) 2(1500/1500) 3(1500/1500) 2(1500/1500) 2(1500/1500) 3(328/328) 
ip_fragment: nfrags 39, ip_len 57351
    1(1500/1500) 2(1500/1500) 3(1500/1500) 2(1500/1500) 2(1500/1500) 3(1500/1500) 2(1500/1500) 2(1500/1500) 3(1500/1500) 2(1500/1500) 2(1500/1500) 3(1500/1500) 2(1500/1500) 3(1500/1500) 2(1500/1500) 2(1500/1500) 3(1500/1500) 2(1500/1500) 2(1500/1500) 3(1500/1500) 2(1500/1500) 2(1500/1500) 3(1500/1500) 2(1500/1500) 3(1500/1500) 2(1500/1500) 2(1500/1500) 3(1500/1500) 2(1500/1500) 2(1500/1500) 3(1500/1500) 2(1500/1500) 2(1500/1500) 3(1500/1500) 2(1500/1500) 3(1500/1500) 2(1500/1500) 2(1500/1500) 3(1111/1111) 

CPU     ID                    FUNCTION:NAME
  2  62685                        none:send 1: ip_len 1500, flags 0x20, offset 0x0040
  2  62685                        none:send 2: ip_len 1500, flags 0x20, offset 0x00f9
  2  62685                        none:send 3: ip_len 1500, flags 0x20, offset 0x01b2
  2  62685                        none:send 4: ip_len 1500, flags 0x20, offset 0x026b
  2  62685                        none:send 5: ip_len 1500, flags 0x20, offset 0x0324
  2  62685                        none:send 6: ip_len 1500, flags 0x20, offset 0x03dd
  2  62685                        none:send 7: ip_len 1500, flags 0x20, offset 0x0496
.....
  2  62685                        none:send 88: ip_len 1500, flags 0x20, offset 0x1760
  2  62685                        none:send 89: ip_len 1500, flags 0x20, offset 0x1819
  2  62685                        none:send 90: ip_len 1500, flags 0x20, offset 0x18d2
  2  62685                        none:send 91: ip_len 1500, flags 0x20, offset 0x198b
  2  62685                        none:send 92: ip_len 1500, flags 0x20, offset 0x1a44
  2  62685                        none:send 93: ip_len 1500, flags 0x20, offset 0x1afd

em0: Watchdog timeout Queue[0]-- resetting
Interface is RUNNING and ACTIVE
em0: TX Queue 0 ------
em0: hw tdh = 353, hw tdt = 568
em0: Tx Queue Status = -2147483648
em0: TX descriptors avail = 808
em0: Tx Descriptors avail failure = 0
em0: RX Queue 0 ------
em0: hw rdh = 127, hw rdt = 126
em0: RX discarded packets = 0
em0: RX Next to Check = 127
em0: RX Next to Refresh = 126
em0: link state changed to DOWN
em0: link state changed to UP
-----

Now ip_fragment() produces correct mbuf chains. But anyway this leads to watchdog timeout.
I think at least this patch should be committed. Also we need to think how optimize ip_fragment() to not produce the chains for each fragment. I think this should not be so hard to implement.
Comment 26 Andrey V. Elsukov freebsd_committer 2016-12-29 22:02:51 UTC
re(4) also has the same problem, when 93 packets are queued its watchdog handles timeout.
Comment 27 Andrey V. Elsukov freebsd_committer 2016-12-29 22:12:06 UTC
(In reply to Andrey V. Elsukov from comment #26)
> optimize ip_fragment() to not produce the chains for each fragment. I think
> this should not be so hard to implement.

It looks like it isn't so easy :)
Because these mbufs use shared clusters.
Comment 28 Andrey V. Elsukov freebsd_committer 2016-12-30 15:13:31 UTC
Created attachment 178398 [details]
Proposed patch

I still think that we should deny sending oversized datagramms for RAW sockets with enabled IP_HDRINCL option. The rationale for this restriction is the desire of the user to have own built IP header. If you want own IP header, be ready to make IP fragmentation :-)
It looks like linux behaves the same. When RAW socket doesn't set IP_HDRINCL option, the kernel will do IP fragmentation if needed.
Comment 29 Michael Tuexen freebsd_committer 2016-12-30 15:23:41 UTC
Can you define "oversized packets"?

Best regards
Michael
Comment 30 Andrey V. Elsukov freebsd_committer 2016-12-30 15:37:24 UTC
(In reply to Michael Tuexen from comment #29)
> Can you define "oversized packets"?

In this case this will be all packets that exceeds MTU for outbound interface.
Note, all packets that were crafted with IP_HDRINCL option. If you didn't use IP_HDRINCL and IP_DONTFRAG options, packet will be fragmented.

There are a lot of code that expects IP header's fields are sane, and we can't guarantee this, because ip_hl and ip_id are the only what we really check. I only checked the unicast path, multicast path is, probably, also vulnerable.
Comment 31 Michael Tuexen freebsd_committer 2016-12-30 15:46:02 UTC
(In reply to Andrey V. Elsukov from comment #30)
I was assuming that on the send path fragmentation is done by the kernel even if I have the IP_HDRINCL option enabled.

Wasn't the original problem that the length given by the send() call is not in tune with the length given in the IP header (because in the header is wasn't in the expected byte order)?
Comment 32 Andrey V. Elsukov freebsd_committer 2016-12-30 16:05:00 UTC
(In reply to Michael Tuexen from comment #31)
> (In reply to Andrey V. Elsukov from comment #30)
> Wasn't the original problem that the length given by the send() call is not
> in tune with the length given in the IP header (because in the header is
> wasn't in the expected byte order)?

Yes, this is half of the problem. ip_off is also in wrong byte order. And this is the cause of generating of hundreds mbufs. Application wants determine MTU and it starts trying from 32k with DF flag set. But the kernel doesn't see this DF flag, it does IP fragmentation. App sends a much more data (it is interesting, so big MTU!), this produces much more fragments, and again, and again. So we get watchdog timeout and NIC reset.

> I was assuming that on the send path fragmentation is done by the kernel
> even if I have the IP_HDRINCL option enabled.

Then I don't know how to fix this.
I tried the unpatched ntraceroute on ixgbe(4). It doesn't complain, but it just hangs, only ifconfig ix0 down/up helped ressurect it.
Comment 33 Michael Tuexen freebsd_committer 2016-12-30 16:20:18 UTC
(In reply to Andrey V. Elsukov from comment #32)
Let me write a small test program to test this on Linux and FreeBSD. I think we agree on an inconsistent length in the header and the provided buffer should be detected.

However, if the user provides a large packet, this results in a lot of fragments provided to the driver. Isn't this the same when the user sends a large UDP packet?

Best regards
Michael
Comment 34 Andrey V. Elsukov freebsd_committer 2016-12-30 16:25:12 UTC
(In reply to Michael Tuexen from comment #33)
> (In reply to Andrey V. Elsukov from comment #32)
> Let me write a small test program to test this on Linux and FreeBSD. I think
> we agree on an inconsistent length in the header and the provided buffer
> should be detected.

I think you can use some in the tools/tools/netrate/ 

> However, if the user provides a large packet, this results in a lot of
> fragments provided to the driver. Isn't this the same when the user sends a
> large UDP packet?

Yes, ntraceroute also uses UDP.
Comment 35 Michael Tuexen freebsd_committer 2016-12-30 22:11:13 UTC
Created attachment 178409 [details]
Simple test program

A simple test program running on FreeBSD, Linux, Mac OS X, and Solaris.
Comment 36 Michael Tuexen freebsd_committer 2016-12-30 22:18:20 UTC
I did some testing with the provided test program. You can provide the buffer length (buf_len) and the length put in the IP header (hdr_len) as command line arguments. I tested this on FreeBSD head, Linux, Mac OS X, and Solaris. The results:

FreeBSD:
	* Supports fragmentation and the sending of buffers up to net.inet.raw.maxdgram bytes.
	* I tested also with net.inet.raw.maxdgram set to 65535 succesfully (using an igb interface).
	* The packet contains buf_len bytes, IP header reports hdr_len bytes
	* EINVAL returned if hdr_len > buf_len.

Linux:
	* Does not support fragmentation and the sending of buffers up to an MTU.
	* Always uses the buf_len (overwrites the hdr_len).

Mac OS X:
	* Same as FreeBSD.

Solaris:
	* supports fragmentation and the sending of buffers up to 65535 bytes.
	* Always uses the buf_len (overwrites the hdr_len).

What am I missing? I'm happy to test more. If the problem comes from ip_off, I think the kernel should ignore that, since the kernel should do the fragmentation...
Comment 37 Andrey V. Elsukov freebsd_committer 2016-12-30 22:22:48 UTC
(In reply to Michael Tuexen from comment #36)
> What am I missing? I'm happy to test more. If the problem comes from ip_off,
> I think the kernel should ignore that, since the kernel should do the
> fragmentation...

So, 'ping -s 64000' also works as expected.
Did you try to do sendto() in a loop?
Comment 38 Andrey V. Elsukov freebsd_committer 2016-12-30 22:42:19 UTC
Created attachment 178410 [details]
another test program
Comment 39 Andrey V. Elsukov freebsd_committer 2016-12-30 22:43:59 UTC
(In reply to Michael Tuexen from comment #36)

I modified your program to show how ntraceroute does it.

# ./test www.google.com 1000 64000 25

This kills em(4) and iwm(4) on my notebook (via lagg).
Comment 40 Michael Tuexen freebsd_committer 2016-12-31 00:45:32 UTC
It doesn't kill my igb card. Please note that running
./test www.google.com 1000 64000 25
send packets to 255.255.255.255 (which I tested), but it is not intended.
So could you see if
./test a.b.c.d 1000 64000 25
using an IPv4 address a.b.c.d also kills your interface?
Assuming yes, what does
./test a.b.c.d 64000 64000 25
do?
Comment 41 Andrey V. Elsukov freebsd_committer 2016-12-31 10:53:55 UTC
(In reply to Michael Tuexen from comment #40)
> It doesn't kill my igb card. Please note that running
> ./test www.google.com 1000 64000 25
> send packets to 255.255.255.255 (which I tested), but it is not intended.
> So could you see if
> ./test a.b.c.d 1000 64000 25
> using an IPv4 address a.b.c.d also kills your interface?

Ok, I did several tests on all machines what I have for tests :)

------------------
1) Some old 11.0-CURRENT #0 r291555
ale0: <Atheros AR8121/AR8113/AR8114 PCIe Ethernet>

# ./test 10.9.8.11 1000 64000 1
sends one packet (length 1000)

# ./test 10.9.8.11 1000 64000 25
sends 25 packets (length 1000)

# ./test 10.9.8.11 64000 64000 1
does IP fragmentation

# ./test 10.9.8.11 64000 64000 25
does IP fragmentation and then returns error:
send: No buffer space available

So, it works without problems.
------------------
2) 12.0-CURRENT #16 r310781 (projects/ipsec)
em0: <Intel(R) PRO/1000 Network Connection 7.6.1-k>
re0: <D-Link DGE-528(T) Gigabit Ethernet Adapter>

# ./test 10.9.8.2 1000 64000 1
sends one packet (length 1000)
This leads to:
em0: Watchdog timeout Queue[0]-- resetting
Interface is RUNNING and ACTIVE
em0: TX Queue 0 ------
em0: hw tdh = 49, hw tdt = 97
em0: Tx Queue Status = -2147483648
em0: TX descriptors avail = 975
em0: Tx Descriptors avail failure = 0
em0: RX Queue 0 ------
em0: hw rdh = 58, hw rdt = 57
em0: RX discarded packets = 0
em0: RX Next to Check = 58
em0: RX Next to Refresh = 57
em0: link state changed to DOWN

Now the same test on re(4):
# ./test 10.9.8.2 1000 64000 1
sends one packet (length 1000)
This leads to:
re0: watchdog timeout
------------------
3) The same hardware, but our patched version of 9.3-STABLE 
9.3-STABLE #1 6cbfd06-dirty (AFAIR, we didn't changed anything related to sockets).
# ./test 10.9.8.2 1000 64000 1
does IP fragmentation

# ./test 10.9.8.2 1000 64000 25
does IP fragmentation, then returns error:
send: No buffer space available

./test 10.9.8.2 64000 64000 1
This leads to:
re0: watchdog timeout

Now the same with em(4):
# ./test 10.9.8.2 1000 64000 1
Hm.. This sends only one packet. I tried again and this leads to:

em0: Watchdog timeout -- resetting
em0: Queue(0) tdh = 115, hw tdt = 270
em0: TX(0) desc avail = 867,Next TX to Clean = 113
em0: link state changed to DOWN

But fragments were sent.
------------------
4) The same hardware. 10.1-STABLE #1 r284010

# ./test 10.9.8.2 1000 64000 1
# ./test 10.9.8.2 1000 64000 1
em0: Watchdog timeout -- resetting
em0: Queue(0) tdh = 53, hw tdt = 107
em0: TX(0) desc avail = 969,Next TX to Clean = 52
em0: link state changed to DOWN

The same with re(4).
# ./test 10.9.8.2 1000 64000 50
send: No buffer space available

# ./test 10.9.8.2 64000 64000 1
re0: watchdog timeout
------------------
5) The same hardware. 11.0-CURRENT #34 (this our version of 11, it is based on r285435).

# ./test 10.9.8.2 1000 64000 1
re0: watchdog timeout

./test 10.9.8.2 64000 64000 50
send: No buffer space available

The same on em(4):

# ./test 10.9.8.2 1000 64000 1
Interface is RUNNING and ACTIVE
em0: TX Queue 0 ------
em0: hw tdh = 40, hw tdt = 73
em0: Tx Queue Status = -2147483648
em0: TX descriptors avail = 990
em0: Tx Descriptors avail failure = 0
em0: RX Queue 0 ------
em0: hw rdh = 39, hw rdt = 38
em0: RX discarded packets = 0
em0: RX Next to Check = 39
em0: RX Next to Refresh = 38
em0: link state changed to DOWN

# ./test 10.9.8.2 64000 64000 50
send: No buffer space available
------------------
6) 12.0-CURRENT #4 43d876e(projects/ipsec)-dirty
ix0: <Intel(R) PRO/10GbE PCI-Express Network Driver, Version - 3.1.13-k>
cxl0: t5nex0: <Chelsio T580-LP-SO-CR>
mlx5_core0: <mlx5_core>: mce0: MT27710 Family [ConnectX-4 Lx]

ixgbe(4):
# ./test 87.250.242.144 1000 64000 25
sends packets with length 1000

# ./test 87.250.242.144 64000 64000 1000
send: No buffer space available

# ./test 87.250.242.144 1000 64000 1000
:)
This leads to ix0 TX hangs. "ifconfig ix0 down up" helps. Good to have ipmi :)

cxgbe(4):
# ifconfig cxl0 inet 10.9.8.1/24
# arp -s 10.9.8.2 00:11:22:33:44:55
# ./test 10.9.8.2 1000 64000 10000
send: No buffer space available

I've got ENOBUFS, but it seems cxgbe's TX also hangs.
# ifconfig cxl0 down up
^T
load: 0.00  cmd: ifconfig 11052 [qflush] 9.14r 0.00u 0.02s 0% 1964k

Ok. What if we'll try to do kldunload if_cxgbe? :)
# kldunload if_cxgbe
^T
load: 0.00  cmd: kldunload 11054 [qflush] 145.02r 0.00u 0.45s 0% 1500k

begin_synchronized_op with the following non-sleepable locks held:
exclusive sleep mutex bpf global lock (bpf global lock) r = 0 (0xffffffff81dcc378) locked @ /usr/src/sys/net/bpf.c:777
stack backtrace:
#0 0xffffffff80aa6d10 at witness_debugger+0x70
#1 0xffffffff80aa7fd7 at witness_warn+0x397
#2 0xffffffff8288d3fd at begin_synchronized_op+0x3d
#3 0xffffffff828972b0 at cxgbe_ioctl+0x980
#4 0xffffffff80b35052 at if_setflag+0xf2
#5 0xffffffff80b34efc at ifpromisc+0x2c
#6 0xffffffff80b2b098 at bpf_detachd_locked+0x1e8
#7 0xffffffff80b2da5b at bpf_dtor+0x8b
#8 0xffffffff8091fe7b at devfs_destroy_cdevpriv+0x8b
#9 0xffffffff809234c5 at devfs_close_f+0x65
#10 0xffffffff809f683a at _fdrop+0x1a
#11 0xffffffff809f9950 at closef+0x1f0
#12 0xffffffff809f6de3 at closefp+0xa3
#13 0xffffffff80eadab9 at amd64_syscall+0x2f9
#14 0xffffffff80e8d6ab at Xfast_syscall+0xfb

After reboot:
# ./test 87.250.242.144 64000 64000 1000
send: No buffer space available

It doesn't hangs, but it sends nothing.

mce0:
# ifconfig mce0 inet 10.9.8.1/24
# arp -s 10.9.8.2 00:11:22:33:44:55
# ./test 10.9.8.2 1000 64000 10000
10000 packets with length 1000 were sent. No errors. TX still work.

# ./test 10.9.8.2 64000 64000 1000
send: No buffer space available
11290 packets captured. TX still work.
------------------

I think that's enough for today. Happy new year! :)
Comment 42 Michael Tuexen freebsd_committer 2016-12-31 12:40:27 UTC
Created attachment 178417 [details]
Proposed patch to enforce consistent lengths
Comment 43 Michael Tuexen freebsd_committer 2016-12-31 12:49:26 UTC
Thanks for testing. I think it does not make much sense to use the raw ip interface to send IP packets with inconsistent lengths. The two lengths we have are the length of the buffer provided in the send call and the length field in the IP header header provided. Solaris and Linux enforce consistency by using the buffer length to overwrite the length in the IP header. I'm not sure I like this auto-fixing... Therefore I'm proposing a patch which checks the lengths for consistency and return EINVAL if they are not consistent.

This might not fix all issues we are seeing, but it should bring the raw ip interface with the IP_HDRINCL option on par with the raw ip interface without the option and with UDP or UDP-Lite sockets. And it reduces the testing a lot...

I would be interested on your opinion on the patch and on results on your interfaces when using consistent length. If I read your test results correctly, this would indicate that only the re driver has an issue...

Happy New Year!
Comment 44 Michael Tuexen freebsd_committer 2017-01-12 20:13:11 UTC
I put the consistency patch up on Phabricator:
https://reviews.freebsd.org/D9161
Comment 45 commit-hook freebsd_committer 2017-01-13 10:56:22 UTC
A commit references this bug:

Author: tuexen
Date: Fri Jan 13 10:55:27 UTC 2017
New revision: 312063
URL: https://svnweb.freebsd.org/changeset/base/312063

Log:
  Ensure that the buffer length and the length provided in the IPv4
  header match when using a raw socket to send IPv4 packets and
  providing the header. If they don't match, let send return -1
  and set errno to EINVAL.

  Before this patch is was only enforced that the length in the header
  is not larger then the buffer length.

  PR:			212283
  Reviewed by:		ae, gnn
  MFC after:		1 week
  Differential Revision:	https://reviews.freebsd.org/D9161

Changes:
  head/sys/netinet/raw_ip.c
Comment 46 Michael Tuexen freebsd_committer 2017-01-13 11:02:53 UTC
Created attachment 178855 [details]
Simple UDP based test program

Since we now enfore consistent lengths for raw IP packet, I think raw sockets should behave the same as UDP sockets. I attached a test program for this.
If UDP sockets and raw IP sockets behave the same now (and they do for the interfaces I can test with) and some ethernet cards have problems with such a load, I think this is a driver issue and should be addresses at that level. But testing is needed first.

So could you retest using raw IP sockets and UDP sockets using r312063 or higher? Or just uses consistent lengths...
Comment 47 Pyun YongHyeon freebsd_committer 2017-01-15 04:30:17 UTC
(In reply to Michael Tuexen from comment #46)
I'm not sure I fully understand this PR and analysis of NIC test
results(it's too long).  Generally, drivers blindly trust mbuf
chains(virtual addresses and lengths) passed from upper stack and
set up DMA descriptors after converting virtual addresses to
physical addresses.  DMA engine of NIC do TX work based on DMA
descriptors and generate an TX completion interrupt after
transmission.  Some firmware of NIC also parse IP/TCP/UDP headers
to compute checksum, TSO/UFO and etc.  If some fields of protocol
header or mbuf chain addresses/lengths are not consistent with
total length of packet, DMA engine/NIC firmware will access invalid
physical addresses.  If NIC firmware parser detects some 
inconsistency in protocol header before doing actual DMA work it
shall abort TX operation such that it leads to watchdog timeouts.
If DMA engine accesses invalid physical addresses it will 
completely lockup NIC or crash the system.  I think the various NIC
tests were done with broken mbuf chains so it's expected result.

Michael's test program should work on all network drivers since
it's the same type of operation as NFS over UDP.  re(4)/ale(4) was
extensively tested with NFS over UDP in the past.  em(4) was also
tested with NFS over UDP when I was em(4) maintainer.  I don't have
handy CURRENT box to test re(4) at this moment but I can do re(4)
test again if you really need it. It may take several weeks though.
Comment 48 Michael Tuexen freebsd_committer 2017-01-15 09:09:29 UTC
(In reply to Pyun YongHyeon from comment #47)
I pretty much agree. We are now forcing the mbuf chain to be consistent and I don't see any problems with em or igb drivers anymore. If the test programs still result in problems on particular NICs, it is a driver problem if the NIC, not a problem of the IP stack.
From my point of view we can close this PR and open NIC specific ones if particular NICs have problems with a high load of fragmented IP packets.
Comment 49 commit-hook freebsd_committer 2017-06-01 08:04:25 UTC
A commit references this bug:

Author: tuexen
Date: Thu Jun  1 08:04:10 UTC 2017
New revision: 319392
URL: https://svnweb.freebsd.org/changeset/base/319392

Log:
  MFC r312063:

  Ensure that the buffer length and the length provided in the IPv4
  header match when using a raw socket to send IPv4 packets and
  providing the header. If they don't match, let send return -1
  and set errno to EINVAL.

  Before this patch is was only enforced that the length in the header
  is not larger then the buffer length.

  PR:			212283
  Reviewed by:		ae, gnn
  Differential Revision:	https://reviews.freebsd.org/D9161

Changes:
_U  stable/11/
  stable/11/sys/netinet/raw_ip.c
Comment 50 Oleksandr Tymoshenko freebsd_committer freebsd_triage 2019-01-21 19:20:15 UTC
Michael, can this PR be closed or is there work pending?

Thanks
Comment 51 Michael Tuexen freebsd_committer 2019-01-21 19:35:42 UTC
(In reply to Oleksandr Tymoshenko from comment #50)
I think whatever needs to be fixed in the IP stack is fixed. If there are still driver specific problems, they need to be addressed in a different way.
Since there where no problem reports for two years, I'm closing it know as fixed.