Summary: | oversized IP datagrams on raw socket with IP_RAWOUTPUT hang network interface drivers | ||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Base System | Reporter: | Mathieu Arnold <mat> | ||||||||||||||||||||
Component: | kern | Assignee: | Michael Tuexen <tuexen> | ||||||||||||||||||||
Status: | Closed FIXED | ||||||||||||||||||||||
Severity: | Affects Many People | CC: | ae, erj, glebius, gonzo, hselasky, jhb, np, re, rwatson, sbruno, tuexen, yongari | ||||||||||||||||||||
Priority: | --- | Keywords: | regression | ||||||||||||||||||||
Version: | 11.0-STABLE | Flags: | koobs:
mfc-stable11?
|
||||||||||||||||||||
Hardware: | amd64 | ||||||||||||||||||||||
OS: | Any | ||||||||||||||||||||||
Attachments: |
|
Description
Mathieu Arnold
2016-08-31 12:21:17 UTC
As a regular user, it fails with: $ ntraceroute -uOAM google.com traceroute: icmp socket: Operation not permitted so, this needs to be run as root. Created attachment 174255 [details]
Proposed patch for net/ntraceroute port
You can use this patch for port, but the kernel still need to be fixed.
From my ktrace results: 36643 ntraceroute CALL sendto(0x4,0x801270380,0x8021,0,0x614e14,0x10) 36643 ntraceroute GIO fd 4 wrote 4096 bytes 0x0000 4500 2180 0000 0040 0 sendto() does send 32801 bytes, but in the IP header this is 8576 bytes. Since DF flag also stored in the wrong byte order we fall into ip_fragment(), where we create several fragments. Then at the end of ip_fragment() we call m_adj() to trim mbuf chain from the tail. But since real length is ~32801, after trimming we will still have several mbufs in the chain and a lot of data to send. A driver calls m_collapse() and it fails. This leads to triggering of watchdog. # dtrace -n 'fbt::m_collapse:return { printf("%p", (struct mbuf *)arg1); stack(10); }' dtrace: description 'fbt::m_collapse:return ' matched 1 probe CPU ID FUNCTION:NAME if_iwn.ko`iwn_tx_data+0x7d9 if_iwn.ko`iwn_transmit+0x9e kernel`ieee80211_parent_xmitpkt+0x13 kernel`ieee80211_vap_pkt_send_dest+0x29d kernel`ieee80211_vap_transmit+0x332 kernel`ether_output+0x763 kernel`ip_output+0x15f0 kernel`rip_output+0x2d3 kernel`sosend_generic+0x5e7 kernel`kern_sendit+0x22a 1 37159 m_collapse:return 0 if_iwn.ko`iwn_tx_data+0x7d9 if_iwn.ko`iwn_transmit+0x9e kernel`ieee80211_parent_xmitpkt+0x13 kernel`ieee80211_vap_pkt_send_dest+0x29d kernel`ieee80211_vap_transmit+0x332 kernel`ether_output+0x763 kernel`ip_output+0x15f0 kernel`rip_output+0x2d3 kernel`sosend_generic+0x5e7 kernel`kern_sendit+0x22a 1 37159 m_collapse:return 0 In console I see these messages: iwn0: iwn_tx_data: could not defrag mbuf iwn0: iwn_tx_data: could not defrag mbuf iwn0: iwn_tx_data: could not defrag mbuf iwn0: iwn_tx_data: could not defrag mbuf iwn0: iwn_intr: fatal firmware error firmware error log: error type = "UNKNOWN" (0x00000EDC) program counter = 0x000222C4 source line = 0x00000AF4 error data = 0x0000000000000000 branch link = 0x0002224C0002224C interrupt link = 0x0000D6BE00000000 time = 3419201031 driver status: tx ring 0: qid=0 cur=19 queued=9 tx ring 1: qid=1 cur=0 queued=0 tx ring 2: qid=2 cur=0 queued=0 tx ring 3: qid=3 cur=17 queued=0 tx ring 4: qid=4 cur=0 queued=0 tx ring 5: qid=5 cur=0 queued=0 tx ring 6: qid=6 cur=0 queued=0 tx ring 7: qid=7 cur=0 queued=0 tx ring 8: qid=8 cur=0 queued=0 tx ring 9: qid=9 cur=199 queued=0 tx ring 10: qid=10 cur=0 queued=0 tx ring 11: qid=11 cur=0 queued=0 tx ring 12: qid=12 cur=0 queued=0 tx ring 13: qid=13 cur=0 queued=0 tx ring 14: qid=14 cur=0 queued=0 tx ring 15: qid=15 cur=0 queued=0 tx ring 16: qid=16 cur=0 queued=0 tx ring 17: qid=17 cur=0 queued=0 tx ring 18: qid=18 cur=0 queued=0 tx ring 19: qid=19 cur=0 queued=0 rx ring: cur=60 iwn0: iwn_panicked: controller panicked, iv_state = 5; restarting In your case seems re(4) behaves in the same way. I think we need to ponder and discuss how much is it acceptable, invoke ip_fragment() in case when we have IP_RAWOUTPUT flag set. Seemingly unrelated, we are seeing re(4) timeouts in more recent head. FreeBSD 11.0-ALPHA6 (CLUSTER11) #0 r302331: Sun Jul 3 23:03:04 UTC 2016 We have two gateway machines using pf/pfsync/carp in failover mode. The primary machine (igw0) seems to hit watchdog timeouts almost immediately. Is there something we should try doing to alleviate this failure? It seems that the primary/master igw0 hits the problem more reliably, and I haven't seen this on the backup. A commit references this bug: Author: mat Date: Fri Sep 2 21:12:22 UTC 2016 New revision: 421276 URL: https://svnweb.freebsd.org/changeset/ports/421276 Log: Fix on 11. PR: 212283 Submitted by: ae Sponsored by: Absolight Changes: head/net/ntraceroute/Makefile head/net/ntraceroute/files/patch-traceroute.c So, I tried your patch for ntraceroute, the problem is still there, the bge0 gets a watchdog timeout and the network still freezes for a minute or so. Created attachment 174350 [details]
Proposed patch (untested)
I only tried ntraceroute on head/ and it worked for me.
Since nobody showed interest, can you try this patch to the system?
Created attachment 174351 [details]
Proposed patch (untested)
Fix a typo.
I can't test a kernel patch on both boxes I have that currently run 11.0-RC2. I can try to find the time to install something like vmware on my mac to install a FreeBSD to test this patch, but I don't see this happening this week or the next. I think we shouldn't tweak the ip_len in the kernel. If userland I think we shouldn't tweak the ip_len in the kernel. If userland supplies oversized datagram, greater than interface MTU, we should just drop it silently. Not silently. Better return EFBIG to userland. This is still a problem on 11.0-RC3. This is still a problem on 11.0-RELEASE. This is still a problem on 11.0-RELEASE-p1 (In reply to Gleb Smirnoff from comment #12) Normally the kernel does fragmentation for raw IP packets. So I would prefer to keep that. But can't we check that the length provided in the send call matches the length in the IP header. If there is a mismatch, return an error to the user (indicating EINVAL)? That way the user gets an indication that some information was inconsistent. Still a problem with 11.0-RELEASE-p2. Still a problem with 11.0-RELEASE-p5. Created attachment 178331 [details]
Proposed patch for net/ntraceroute port
(In reply to Mathieu Arnold from comment #7) > So, I tried your patch for ntraceroute, the problem is still there, the bge0 > gets a watchdog timeout and the network still freezes for a minute or so. I realized that when it is builded from ports, defined options have no effect due to __FreeBSD_version is not defined. Can you try new patch? A commit references this bug: Author: mat Date: Wed Dec 28 11:38:12 UTC 2016 New revision: 429739 URL: https://svnweb.freebsd.org/changeset/ports/429739 Log: Really fix on 11+. PR: 212283 Submitted by: ae Sponsored by: Absolight Changes: head/net/ntraceroute/Makefile head/net/ntraceroute/files/patch-traceroute.c Ah, this does fix ntraceroute, thanks. I still think something should happen kernel-wise so that an unpatched ntraceroute returns an error instead of crashing the network card :-) A commit references this bug: Author: mat Date: Wed Dec 28 11:40:00 UTC 2016 New revision: 429740 URL: https://svnweb.freebsd.org/changeset/ports/429740 Log: MFH: r429739 Really fix on 11+. PR: 212283 Submitted by: ae Sponsored by: Absolight Changes: _U branches/2016Q4/ branches/2016Q4/net/ntraceroute/Makefile branches/2016Q4/net/ntraceroute/files/patch-traceroute.c So, I did some tests on my home machine with Intel em(4) card and with unpatched ntraceroute. First of I added this debugging code into ip_fragment: ------ @@ -894,7 +895,19 @@ smart_frag_failure: ip->ip_sum = in_cksum(m0, hlen); m0->m_pkthdr.csum_flags &= ~CSUM_IP; } + { + struct mbuf *m, *p; + printf("%s: nfrags %d, ip_len %d\n ", __func__, nfrags, ip_len); + for (p = m0; p != NULL; p = p->m_nextpkt) { + for (m = p, nfrags = 0, len = 0; m != NULL; m = m->m_next) { + nfrags++; + len += m->m_len; + } + printf("%d(%d/%d) ", nfrags, p->m_pkthdr.len, len); + } + printf("\n"); + } done: *m_frag = m0; return error; ------ Then I used this dtrace script: ------ #!/usr/sbin/dtrace -s BEGIN { packets = 0; } ip:::send /args[2]->ip_ver == 4 && args[4]->ipv4_flags && args[4]->ipv4_offset/ { printf("%d: ip_len %d, flags 0x%02x, offset 0x%04x", ++packets, args[4]->ipv4_length, args[4]->ipv4_flags, args[4]->ipv4_offset); } ------ After running this command we have: # ntraceroute -uOAM google.com ------ ip_fragment: nfrags 6, ip_len 8576 7(1500/25725) 2(1500/1500) 3(1500/1500) 2(1500/1500) 2(1500/1500) 3(1176/1176) ip_fragment: nfrags 3, ip_len 4288 12(1500/46380) 2(1500/1500) 3(1328/1328) ip_fragment: nfrags 2, ip_len 2016 14(1500/56835) 2(536/536) CPU ID FUNCTION:NAME 2 62685 none:send 1: ip_len 1500, flags 0x20, offset 0x0040 2 62685 none:send 2: ip_len 1500, flags 0x20, offset 0x00f9 2 62685 none:send 3: ip_len 1500, flags 0x20, offset 0x01b2 2 62685 none:send 4: ip_len 1500, flags 0x20, offset 0x026b 2 62685 none:send 5: ip_len 1500, flags 0x20, offset 0x0324 2 62685 none:send 6: ip_len 1500, flags 0x20, offset 0x0040 2 62685 none:send 7: ip_len 1500, flags 0x20, offset 0x00f9 2 62685 none:send 8: ip_len 1500, flags 0x20, offset 0x0040 em0: Watchdog timeout Queue[0]-- resetting Interface is RUNNING and ACTIVE em0: TX Queue 0 ------ em0: hw tdh = 126, hw tdt = 361 em0: Tx Queue Status = -2147483648 em0: TX descriptors avail = 787 em0: Tx Descriptors avail failure = 0 em0: RX Queue 0 ------ em0: hw rdh = 112, hw rdt = 111 em0: RX discarded packets = 0 em0: RX Next to Check = 112 em0: RX Next to Refresh = 111 em0: link state changed to DOWN em0: link state changed to UP There is ktrace output for corresponding sendto: 1) ntraceroute CALL sendto(0x4,0x80125f980,0x8021,0,0x614e84,0x10) ntraceroute RET sendto 32801/0x8021 2) ntraceroute CALL sendto(0x4,0x80125f980,0xc010,0,0x614e84,0x10) ntraceroute RET sendto 49168/0xc010 3) ntraceroute CALL sendto(0x4,0x80125f980,0xe007,0,0x614e84,0x10) ntraceroute RET sendto 57351/0xe007 4) ntraceroute CALL sendto(0x4,0x80125f980,0xf003,0,0x614e84,0x10) ntraceroute RET sendto 63489/0xf801 5) ntraceroute CALL sendto(0x4,0x80125f980,0xfc00,0,0x614e84,0x10) ntraceroute RET sendto 64512/0xfc00 6) ntraceroute CALL sendto(0x4,0x80125f980,0xfdff,0,0x614e84,0x10) ntraceroute RET sendto -1 errno 22 Invalid argument As you can see, the ip_fragment() is totally confused by values that IP header contains and produces wrong mbuf chain. First mbuf in the chain has m_pkthdr.len = 1500 and total length 25725. ----- Next experiment. Always use m_pkthdr.len as ip_len in ip_fragment(). ----- @@ -741,7 +742,7 @@ ip_fragment(struct ip *ip, struct mbuf **m_frag, i int nfrags; uint16_t ip_len, ip_off; - ip_len = ntohs(ip->ip_len); + ip_len = m0->m_pkthdr.len; ip_off = ntohs(ip->ip_off); if (ip_off & IP_DF) { /* Fragmentation not allowed */ ----- The same commands produce the following output: ----- ip_fragment: nfrags 23, ip_len 32801 1(1500/1500) 2(1500/1500) 3(1500/1500) 2(1500/1500) 2(1500/1500) 3(1500/1500) 2(1500/1500) 2(1500/1500) 3(1500/1500) 2(1500/1500) 2(1500/1500) 3(1500/1500) 2(1500/1500) 3(1500/1500) 2(1500/1500) 2(1500/1500) 3(1500/1500) 2(1500/1500) 2(1500/1500) 3(1500/1500) 2(1500/1500) 2(1500/1500) 3(241/241) ip_fragment: nfrags 34, ip_len 49168 1(1500/1500) 2(1500/1500) 3(1500/1500) 2(1500/1500) 2(1500/1500) 3(1500/1500) 2(1500/1500) 2(1500/1500) 3(1500/1500) 2(1500/1500) 2(1500/1500) 3(1500/1500) 2(1500/1500) 3(1500/1500) 2(1500/1500) 2(1500/1500) 3(1500/1500) 2(1500/1500) 2(1500/1500) 3(1500/1500) 2(1500/1500) 2(1500/1500) 3(1500/1500) 2(1500/1500) 3(1500/1500) 2(1500/1500) 2(1500/1500) 3(1500/1500) 2(1500/1500) 2(1500/1500) 3(1500/1500) 2(1500/1500) 2(1500/1500) 3(328/328) ip_fragment: nfrags 39, ip_len 57351 1(1500/1500) 2(1500/1500) 3(1500/1500) 2(1500/1500) 2(1500/1500) 3(1500/1500) 2(1500/1500) 2(1500/1500) 3(1500/1500) 2(1500/1500) 2(1500/1500) 3(1500/1500) 2(1500/1500) 3(1500/1500) 2(1500/1500) 2(1500/1500) 3(1500/1500) 2(1500/1500) 2(1500/1500) 3(1500/1500) 2(1500/1500) 2(1500/1500) 3(1500/1500) 2(1500/1500) 3(1500/1500) 2(1500/1500) 2(1500/1500) 3(1500/1500) 2(1500/1500) 2(1500/1500) 3(1500/1500) 2(1500/1500) 2(1500/1500) 3(1500/1500) 2(1500/1500) 3(1500/1500) 2(1500/1500) 2(1500/1500) 3(1111/1111) CPU ID FUNCTION:NAME 2 62685 none:send 1: ip_len 1500, flags 0x20, offset 0x0040 2 62685 none:send 2: ip_len 1500, flags 0x20, offset 0x00f9 2 62685 none:send 3: ip_len 1500, flags 0x20, offset 0x01b2 2 62685 none:send 4: ip_len 1500, flags 0x20, offset 0x026b 2 62685 none:send 5: ip_len 1500, flags 0x20, offset 0x0324 2 62685 none:send 6: ip_len 1500, flags 0x20, offset 0x03dd 2 62685 none:send 7: ip_len 1500, flags 0x20, offset 0x0496 ..... 2 62685 none:send 88: ip_len 1500, flags 0x20, offset 0x1760 2 62685 none:send 89: ip_len 1500, flags 0x20, offset 0x1819 2 62685 none:send 90: ip_len 1500, flags 0x20, offset 0x18d2 2 62685 none:send 91: ip_len 1500, flags 0x20, offset 0x198b 2 62685 none:send 92: ip_len 1500, flags 0x20, offset 0x1a44 2 62685 none:send 93: ip_len 1500, flags 0x20, offset 0x1afd em0: Watchdog timeout Queue[0]-- resetting Interface is RUNNING and ACTIVE em0: TX Queue 0 ------ em0: hw tdh = 353, hw tdt = 568 em0: Tx Queue Status = -2147483648 em0: TX descriptors avail = 808 em0: Tx Descriptors avail failure = 0 em0: RX Queue 0 ------ em0: hw rdh = 127, hw rdt = 126 em0: RX discarded packets = 0 em0: RX Next to Check = 127 em0: RX Next to Refresh = 126 em0: link state changed to DOWN em0: link state changed to UP ----- Now ip_fragment() produces correct mbuf chains. But anyway this leads to watchdog timeout. I think at least this patch should be committed. Also we need to think how optimize ip_fragment() to not produce the chains for each fragment. I think this should not be so hard to implement. re(4) also has the same problem, when 93 packets are queued its watchdog handles timeout. (In reply to Andrey V. Elsukov from comment #26) > optimize ip_fragment() to not produce the chains for each fragment. I think > this should not be so hard to implement. It looks like it isn't so easy :) Because these mbufs use shared clusters. Created attachment 178398 [details]
Proposed patch
I still think that we should deny sending oversized datagramms for RAW sockets with enabled IP_HDRINCL option. The rationale for this restriction is the desire of the user to have own built IP header. If you want own IP header, be ready to make IP fragmentation :-)
It looks like linux behaves the same. When RAW socket doesn't set IP_HDRINCL option, the kernel will do IP fragmentation if needed.
Can you define "oversized packets"? Best regards Michael (In reply to Michael Tuexen from comment #29) > Can you define "oversized packets"? In this case this will be all packets that exceeds MTU for outbound interface. Note, all packets that were crafted with IP_HDRINCL option. If you didn't use IP_HDRINCL and IP_DONTFRAG options, packet will be fragmented. There are a lot of code that expects IP header's fields are sane, and we can't guarantee this, because ip_hl and ip_id are the only what we really check. I only checked the unicast path, multicast path is, probably, also vulnerable. (In reply to Andrey V. Elsukov from comment #30) I was assuming that on the send path fragmentation is done by the kernel even if I have the IP_HDRINCL option enabled. Wasn't the original problem that the length given by the send() call is not in tune with the length given in the IP header (because in the header is wasn't in the expected byte order)? (In reply to Michael Tuexen from comment #31) > (In reply to Andrey V. Elsukov from comment #30) > Wasn't the original problem that the length given by the send() call is not > in tune with the length given in the IP header (because in the header is > wasn't in the expected byte order)? Yes, this is half of the problem. ip_off is also in wrong byte order. And this is the cause of generating of hundreds mbufs. Application wants determine MTU and it starts trying from 32k with DF flag set. But the kernel doesn't see this DF flag, it does IP fragmentation. App sends a much more data (it is interesting, so big MTU!), this produces much more fragments, and again, and again. So we get watchdog timeout and NIC reset. > I was assuming that on the send path fragmentation is done by the kernel > even if I have the IP_HDRINCL option enabled. Then I don't know how to fix this. I tried the unpatched ntraceroute on ixgbe(4). It doesn't complain, but it just hangs, only ifconfig ix0 down/up helped ressurect it. (In reply to Andrey V. Elsukov from comment #32) Let me write a small test program to test this on Linux and FreeBSD. I think we agree on an inconsistent length in the header and the provided buffer should be detected. However, if the user provides a large packet, this results in a lot of fragments provided to the driver. Isn't this the same when the user sends a large UDP packet? Best regards Michael (In reply to Michael Tuexen from comment #33) > (In reply to Andrey V. Elsukov from comment #32) > Let me write a small test program to test this on Linux and FreeBSD. I think > we agree on an inconsistent length in the header and the provided buffer > should be detected. I think you can use some in the tools/tools/netrate/ > However, if the user provides a large packet, this results in a lot of > fragments provided to the driver. Isn't this the same when the user sends a > large UDP packet? Yes, ntraceroute also uses UDP. Created attachment 178409 [details]
Simple test program
A simple test program running on FreeBSD, Linux, Mac OS X, and Solaris.
I did some testing with the provided test program. You can provide the buffer length (buf_len) and the length put in the IP header (hdr_len) as command line arguments. I tested this on FreeBSD head, Linux, Mac OS X, and Solaris. The results: FreeBSD: * Supports fragmentation and the sending of buffers up to net.inet.raw.maxdgram bytes. * I tested also with net.inet.raw.maxdgram set to 65535 succesfully (using an igb interface). * The packet contains buf_len bytes, IP header reports hdr_len bytes * EINVAL returned if hdr_len > buf_len. Linux: * Does not support fragmentation and the sending of buffers up to an MTU. * Always uses the buf_len (overwrites the hdr_len). Mac OS X: * Same as FreeBSD. Solaris: * supports fragmentation and the sending of buffers up to 65535 bytes. * Always uses the buf_len (overwrites the hdr_len). What am I missing? I'm happy to test more. If the problem comes from ip_off, I think the kernel should ignore that, since the kernel should do the fragmentation... (In reply to Michael Tuexen from comment #36) > What am I missing? I'm happy to test more. If the problem comes from ip_off, > I think the kernel should ignore that, since the kernel should do the > fragmentation... So, 'ping -s 64000' also works as expected. Did you try to do sendto() in a loop? Created attachment 178410 [details]
another test program
(In reply to Michael Tuexen from comment #36) I modified your program to show how ntraceroute does it. # ./test www.google.com 1000 64000 25 This kills em(4) and iwm(4) on my notebook (via lagg). It doesn't kill my igb card. Please note that running ./test www.google.com 1000 64000 25 send packets to 255.255.255.255 (which I tested), but it is not intended. So could you see if ./test a.b.c.d 1000 64000 25 using an IPv4 address a.b.c.d also kills your interface? Assuming yes, what does ./test a.b.c.d 64000 64000 25 do? (In reply to Michael Tuexen from comment #40) > It doesn't kill my igb card. Please note that running > ./test www.google.com 1000 64000 25 > send packets to 255.255.255.255 (which I tested), but it is not intended. > So could you see if > ./test a.b.c.d 1000 64000 25 > using an IPv4 address a.b.c.d also kills your interface? Ok, I did several tests on all machines what I have for tests :) ------------------ 1) Some old 11.0-CURRENT #0 r291555 ale0: <Atheros AR8121/AR8113/AR8114 PCIe Ethernet> # ./test 10.9.8.11 1000 64000 1 sends one packet (length 1000) # ./test 10.9.8.11 1000 64000 25 sends 25 packets (length 1000) # ./test 10.9.8.11 64000 64000 1 does IP fragmentation # ./test 10.9.8.11 64000 64000 25 does IP fragmentation and then returns error: send: No buffer space available So, it works without problems. ------------------ 2) 12.0-CURRENT #16 r310781 (projects/ipsec) em0: <Intel(R) PRO/1000 Network Connection 7.6.1-k> re0: <D-Link DGE-528(T) Gigabit Ethernet Adapter> # ./test 10.9.8.2 1000 64000 1 sends one packet (length 1000) This leads to: em0: Watchdog timeout Queue[0]-- resetting Interface is RUNNING and ACTIVE em0: TX Queue 0 ------ em0: hw tdh = 49, hw tdt = 97 em0: Tx Queue Status = -2147483648 em0: TX descriptors avail = 975 em0: Tx Descriptors avail failure = 0 em0: RX Queue 0 ------ em0: hw rdh = 58, hw rdt = 57 em0: RX discarded packets = 0 em0: RX Next to Check = 58 em0: RX Next to Refresh = 57 em0: link state changed to DOWN Now the same test on re(4): # ./test 10.9.8.2 1000 64000 1 sends one packet (length 1000) This leads to: re0: watchdog timeout ------------------ 3) The same hardware, but our patched version of 9.3-STABLE 9.3-STABLE #1 6cbfd06-dirty (AFAIR, we didn't changed anything related to sockets). # ./test 10.9.8.2 1000 64000 1 does IP fragmentation # ./test 10.9.8.2 1000 64000 25 does IP fragmentation, then returns error: send: No buffer space available ./test 10.9.8.2 64000 64000 1 This leads to: re0: watchdog timeout Now the same with em(4): # ./test 10.9.8.2 1000 64000 1 Hm.. This sends only one packet. I tried again and this leads to: em0: Watchdog timeout -- resetting em0: Queue(0) tdh = 115, hw tdt = 270 em0: TX(0) desc avail = 867,Next TX to Clean = 113 em0: link state changed to DOWN But fragments were sent. ------------------ 4) The same hardware. 10.1-STABLE #1 r284010 # ./test 10.9.8.2 1000 64000 1 # ./test 10.9.8.2 1000 64000 1 em0: Watchdog timeout -- resetting em0: Queue(0) tdh = 53, hw tdt = 107 em0: TX(0) desc avail = 969,Next TX to Clean = 52 em0: link state changed to DOWN The same with re(4). # ./test 10.9.8.2 1000 64000 50 send: No buffer space available # ./test 10.9.8.2 64000 64000 1 re0: watchdog timeout ------------------ 5) The same hardware. 11.0-CURRENT #34 (this our version of 11, it is based on r285435). # ./test 10.9.8.2 1000 64000 1 re0: watchdog timeout ./test 10.9.8.2 64000 64000 50 send: No buffer space available The same on em(4): # ./test 10.9.8.2 1000 64000 1 Interface is RUNNING and ACTIVE em0: TX Queue 0 ------ em0: hw tdh = 40, hw tdt = 73 em0: Tx Queue Status = -2147483648 em0: TX descriptors avail = 990 em0: Tx Descriptors avail failure = 0 em0: RX Queue 0 ------ em0: hw rdh = 39, hw rdt = 38 em0: RX discarded packets = 0 em0: RX Next to Check = 39 em0: RX Next to Refresh = 38 em0: link state changed to DOWN # ./test 10.9.8.2 64000 64000 50 send: No buffer space available ------------------ 6) 12.0-CURRENT #4 43d876e(projects/ipsec)-dirty ix0: <Intel(R) PRO/10GbE PCI-Express Network Driver, Version - 3.1.13-k> cxl0: t5nex0: <Chelsio T580-LP-SO-CR> mlx5_core0: <mlx5_core>: mce0: MT27710 Family [ConnectX-4 Lx] ixgbe(4): # ./test 87.250.242.144 1000 64000 25 sends packets with length 1000 # ./test 87.250.242.144 64000 64000 1000 send: No buffer space available # ./test 87.250.242.144 1000 64000 1000 :) This leads to ix0 TX hangs. "ifconfig ix0 down up" helps. Good to have ipmi :) cxgbe(4): # ifconfig cxl0 inet 10.9.8.1/24 # arp -s 10.9.8.2 00:11:22:33:44:55 # ./test 10.9.8.2 1000 64000 10000 send: No buffer space available I've got ENOBUFS, but it seems cxgbe's TX also hangs. # ifconfig cxl0 down up ^T load: 0.00 cmd: ifconfig 11052 [qflush] 9.14r 0.00u 0.02s 0% 1964k Ok. What if we'll try to do kldunload if_cxgbe? :) # kldunload if_cxgbe ^T load: 0.00 cmd: kldunload 11054 [qflush] 145.02r 0.00u 0.45s 0% 1500k begin_synchronized_op with the following non-sleepable locks held: exclusive sleep mutex bpf global lock (bpf global lock) r = 0 (0xffffffff81dcc378) locked @ /usr/src/sys/net/bpf.c:777 stack backtrace: #0 0xffffffff80aa6d10 at witness_debugger+0x70 #1 0xffffffff80aa7fd7 at witness_warn+0x397 #2 0xffffffff8288d3fd at begin_synchronized_op+0x3d #3 0xffffffff828972b0 at cxgbe_ioctl+0x980 #4 0xffffffff80b35052 at if_setflag+0xf2 #5 0xffffffff80b34efc at ifpromisc+0x2c #6 0xffffffff80b2b098 at bpf_detachd_locked+0x1e8 #7 0xffffffff80b2da5b at bpf_dtor+0x8b #8 0xffffffff8091fe7b at devfs_destroy_cdevpriv+0x8b #9 0xffffffff809234c5 at devfs_close_f+0x65 #10 0xffffffff809f683a at _fdrop+0x1a #11 0xffffffff809f9950 at closef+0x1f0 #12 0xffffffff809f6de3 at closefp+0xa3 #13 0xffffffff80eadab9 at amd64_syscall+0x2f9 #14 0xffffffff80e8d6ab at Xfast_syscall+0xfb After reboot: # ./test 87.250.242.144 64000 64000 1000 send: No buffer space available It doesn't hangs, but it sends nothing. mce0: # ifconfig mce0 inet 10.9.8.1/24 # arp -s 10.9.8.2 00:11:22:33:44:55 # ./test 10.9.8.2 1000 64000 10000 10000 packets with length 1000 were sent. No errors. TX still work. # ./test 10.9.8.2 64000 64000 1000 send: No buffer space available 11290 packets captured. TX still work. ------------------ I think that's enough for today. Happy new year! :) Created attachment 178417 [details]
Proposed patch to enforce consistent lengths
Thanks for testing. I think it does not make much sense to use the raw ip interface to send IP packets with inconsistent lengths. The two lengths we have are the length of the buffer provided in the send call and the length field in the IP header header provided. Solaris and Linux enforce consistency by using the buffer length to overwrite the length in the IP header. I'm not sure I like this auto-fixing... Therefore I'm proposing a patch which checks the lengths for consistency and return EINVAL if they are not consistent. This might not fix all issues we are seeing, but it should bring the raw ip interface with the IP_HDRINCL option on par with the raw ip interface without the option and with UDP or UDP-Lite sockets. And it reduces the testing a lot... I would be interested on your opinion on the patch and on results on your interfaces when using consistent length. If I read your test results correctly, this would indicate that only the re driver has an issue... Happy New Year! I put the consistency patch up on Phabricator: https://reviews.freebsd.org/D9161 A commit references this bug: Author: tuexen Date: Fri Jan 13 10:55:27 UTC 2017 New revision: 312063 URL: https://svnweb.freebsd.org/changeset/base/312063 Log: Ensure that the buffer length and the length provided in the IPv4 header match when using a raw socket to send IPv4 packets and providing the header. If they don't match, let send return -1 and set errno to EINVAL. Before this patch is was only enforced that the length in the header is not larger then the buffer length. PR: 212283 Reviewed by: ae, gnn MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D9161 Changes: head/sys/netinet/raw_ip.c Created attachment 178855 [details]
Simple UDP based test program
Since we now enfore consistent lengths for raw IP packet, I think raw sockets should behave the same as UDP sockets. I attached a test program for this.
If UDP sockets and raw IP sockets behave the same now (and they do for the interfaces I can test with) and some ethernet cards have problems with such a load, I think this is a driver issue and should be addresses at that level. But testing is needed first.
So could you retest using raw IP sockets and UDP sockets using r312063 or higher? Or just uses consistent lengths...
(In reply to Michael Tuexen from comment #46) I'm not sure I fully understand this PR and analysis of NIC test results(it's too long). Generally, drivers blindly trust mbuf chains(virtual addresses and lengths) passed from upper stack and set up DMA descriptors after converting virtual addresses to physical addresses. DMA engine of NIC do TX work based on DMA descriptors and generate an TX completion interrupt after transmission. Some firmware of NIC also parse IP/TCP/UDP headers to compute checksum, TSO/UFO and etc. If some fields of protocol header or mbuf chain addresses/lengths are not consistent with total length of packet, DMA engine/NIC firmware will access invalid physical addresses. If NIC firmware parser detects some inconsistency in protocol header before doing actual DMA work it shall abort TX operation such that it leads to watchdog timeouts. If DMA engine accesses invalid physical addresses it will completely lockup NIC or crash the system. I think the various NIC tests were done with broken mbuf chains so it's expected result. Michael's test program should work on all network drivers since it's the same type of operation as NFS over UDP. re(4)/ale(4) was extensively tested with NFS over UDP in the past. em(4) was also tested with NFS over UDP when I was em(4) maintainer. I don't have handy CURRENT box to test re(4) at this moment but I can do re(4) test again if you really need it. It may take several weeks though. (In reply to Pyun YongHyeon from comment #47) I pretty much agree. We are now forcing the mbuf chain to be consistent and I don't see any problems with em or igb drivers anymore. If the test programs still result in problems on particular NICs, it is a driver problem if the NIC, not a problem of the IP stack. From my point of view we can close this PR and open NIC specific ones if particular NICs have problems with a high load of fragmented IP packets. A commit references this bug: Author: tuexen Date: Thu Jun 1 08:04:10 UTC 2017 New revision: 319392 URL: https://svnweb.freebsd.org/changeset/base/319392 Log: MFC r312063: Ensure that the buffer length and the length provided in the IPv4 header match when using a raw socket to send IPv4 packets and providing the header. If they don't match, let send return -1 and set errno to EINVAL. Before this patch is was only enforced that the length in the header is not larger then the buffer length. PR: 212283 Reviewed by: ae, gnn Differential Revision: https://reviews.freebsd.org/D9161 Changes: _U stable/11/ stable/11/sys/netinet/raw_ip.c Michael, can this PR be closed or is there work pending? Thanks (In reply to Oleksandr Tymoshenko from comment #50) I think whatever needs to be fixed in the IP stack is fixed. If there are still driver specific problems, they need to be addressed in a different way. Since there where no problem reports for two years, I'm closing it know as fixed. |