243392 – if_vmx(4): Input buffer corruption

Bug 243392 - if_vmx(4): Input buffer corruption

Summary: if_vmx(4): Input buffer corruption

Status:	Closed FIXED

Alias:	None

Product:	Base System
Classification:	Unclassified
Component:	kern (show other bugs)
Version:	12.1-RELEASE
Hardware:	Any Any

Importance:	--- Affects Some People
Assignee:	Patrick Kelsey

URL:
Keywords:

Depends on:
Blocks:

Reported:	2020-01-16 14:32 UTC by alexandr.oleynikov
Modified:	2020-07-26 00:02 UTC (History)
CC List:	6 users (show)

See Also:	236999

Flags:	koobs: mfc-stable12? koobs: mfc-stable11?

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description alexandr.oleynikov 2020-01-16 14:32:54 UTC

Tested on 12.1-Release, 12.1-Release-p1 and with patch regarding tso from https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=236999

All VM running on the same host with VMware ESXi, 6.7.0, 15160138 
 1 VM - FreeBSD 12.1
 2 VM - Any OS (tested with different, second one is irellevant)

After some network load packets in the interface queue being to get stuck

Reproduce in differents enviroment with different netowrk load
Easiest way to reproduce - copy large file from second VM to FreeBSD using samba or ftp.
Some output examples:
172.31.255.2 - FreeBSD
172.31.255.3 - second VM

Ping:
8008 bytes from 172.31.255.3: icmp_seq=321 ttl=128 time=0.375 ms
8008 bytes from 172.31.255.3: icmp_seq=322 ttl=128 time=0.229 ms
8008 bytes from 172.31.255.3: icmp_seq=323 ttl=128 time=0.179 ms
8008 bytes from 172.31.255.3: icmp_seq=324 ttl=128 time=0.203 ms
8008 bytes from 172.31.255.3: icmp_seq=325 ttl=128 time=0.278 ms
8008 bytes from 172.31.255.3: icmp_seq=326 ttl=128 time=0.178 ms
8008 bytes from 172.31.255.3: icmp_seq=327 ttl=128 time=2.204 ms
8008 bytes from 172.31.255.3: icmp_seq=328 ttl=128 time=0.226 ms
8008 bytes from 172.31.255.3: icmp_seq=329 ttl=128 time=0.221 ms
8008 bytes from 172.31.255.3: icmp_seq=330 ttl=128 time=0.466 ms
8008 bytes from 172.31.255.3: icmp_seq=331 ttl=128 time=0.253 ms   
8008 bytes from 172.31.255.3: icmp_seq=350 ttl=128 time=1020.494 ms  <-- start file copy
8008 bytes from 172.31.255.3: icmp_seq=362 ttl=128 time=1037.839 ms
8008 bytes from 172.31.255.3: icmp_seq=381 ttl=128 time=4162.085 ms
8008 bytes from 172.31.255.3: icmp_seq=382 ttl=128 time=5202.940 ms
8008 bytes from 172.31.255.3: icmp_seq=383 ttl=128 time=6200.121 ms
8008 bytes from 172.31.255.3: icmp_seq=385 ttl=128 time=7261.939 ms
8008 bytes from 172.31.255.3: icmp_seq=386 ttl=128 time=7200.745 ms
8008 bytes from 172.31.255.3: icmp_seq=390 ttl=128 time=4100.259 ms


tcpdump after propblem appear:

# tcpdump -i vmx1 icmp
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on vmx1, link-type EN10MB (Ethernet), capture size 262144 bytes
16:09:41.700833 IP 172.31.255.2 > 172.31.255.3: ICMP echo request, id 23149, seq 0, length 8008
16:09:42.727307 IP 172.31.255.2 > 172.31.255.3: ICMP echo request, id 23149, seq 1, length 8008
16:09:43.761866 IP 172.31.255.2 > 172.31.255.3: ICMP echo request, id 23149, seq 2, length 8008
16:09:43.762117 IP 172.31.255.3 > 172.31.255.2: ICMP echo reply, id 23149, seq 0, length 8008
16:09:44.771901 IP 172.31.255.2 > 172.31.255.3: ICMP echo request, id 23149, seq 3, length 8008
16:09:44.978178 IP truncated-ip - 7810 bytes missing! 172.31.255.3 > 172.31.255.2: ICMP echo reply, id 23149, seq 1, length 8008
16:09:45.788996 IP 172.31.255.2 > 172.31.255.3: ICMP echo request, id 23149, seq 4, length 8008
16:09:46.039983 IP truncated-ip - 7928 bytes missing! 172.31.255.3 > 172.31.255.2: ICMP echo reply, id 23149, seq 2, length 8008
16:09:46.290583 IP truncated-ip - 7928 bytes missing! 172.31.255.3 > 172.31.255.2: ICMP echo reply, id 23149, seq 3, length 8008
16:09:46.801911 IP 172.31.255.2 > 172.31.255.3: ICMP echo request, id 23149, seq 5, length 8008
16:09:47.861870 IP 172.31.255.2 > 172.31.255.3: ICMP echo request, id 23149, seq 6, length 8008
16:09:48.071886 IP truncated-ip - 7822 bytes missing! 172.31.255.3 > 172.31.255.2: ICMP echo reply, id 23149, seq 4, length 8008
16:09:48.871905 IP 172.31.255.2 > 172.31.255.3: ICMP echo request, id 23149, seq 7, length 8008
16:09:49.802033 IP truncated-ip - 7928 bytes missing! 172.31.255.3 > 172.31.255.2: ICMP echo reply, id 23149, seq 5, length 8008
16:09:49.891036 IP 172.31.255.2 > 172.31.255.3: ICMP echo request, id 23149, seq 8, length 8008
16:09:50.802001 IP truncated-ip - 7928 bytes missing! 172.31.255.3 > 172.31.255.2: ICMP echo reply, id 23149, seq 6, length 8008
16:09:50.901923 IP 172.31.255.2 > 172.31.255.3: ICMP echo request, id 23149, seq 9, length 8008
16:09:51.961848 IP 172.31.255.2 > 172.31.255.3: ICMP echo request, id 23149, seq 10, length 8008

Seems like new arrived packets pushed out old packets from the queue

After some time even more than 40 seconds gap can exist between sending and receiving packet.

tcpdump running on the same time on second VM shows sending a response immediately after receiving request.

Disabling/enabling any driver features like tso, csum, etc, changing mtu irrelevant to the problem.

Comment 1 Andriy Gapon freebsd_committer

2020-01-16 18:00:17 UTC

I am seeing a somewhat similar problem.
When scp-ing a large file from a hardware host to a VM with vmxnet3 interface, I see that the copy starts very fast at first, but then it stalls for many seconds and then suddenly it becomes fast again and the copy completes.
If a file is very large, then the stall+recovery can happen a few times before the copy is complete.

Comment 2 Patrick Kelsey freebsd_committer

2020-01-16 19:22:30 UTC

Have you tried running an image with this patch in place: https://svnweb.freebsd.org/base?view=revision&revision=356703?

If you do not have that patch, you should try to see if you can reproduce your issue without TSO enabled, specifically without a TSO packet ever attempted to be sent (for example by ensuring TSO is disabled for that interface in /etc/rc.conf).  Without the above patch in place, and TSO enabled, it may be possible (that is, has not been ruled out) for the state of the virtual device to become corrupted via the sending of TSO packets, so disabling TSO on the interface later might not make a difference.

Comment 3 Patrick Kelsey freebsd_committer

2020-01-16 19:23:59 UTC

(In reply to Patrick Kelsey from comment #2)
Ah, sorry, just saw that you stated you are running with that patch in place.

Comment 4 alexandr.oleynikov 2020-01-17 10:37:58 UTC

I did some more tests. Hope this will provide some more information.
First one with recompiled kernel with TSO patch. As a network load was a file coping to server using samba 

 ifconfig vmx1
vmx1: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 9000
        options=e403bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6>
        ether 00:50:56:be:f0:13
        inet 172.31.255.2 netmask 0xffffff00 broadcast 172.31.255.255
        media: Ethernet autoselect
        status: active
        nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>


# tcpdump -i vmx1 icmp &
# tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on vmx1, link-type EN10MB (Ethernet), capture size 262144 bytes

# ping -s 8000 172.31.255.3

PING 172.31.255.3 (172.31.255.3): 8000 data bytes
11:59:07.108253 IP 172.31.255.2 > 172.31.255.3: ICMP echo request, id 30548, seq 0, length 8008
11:59:07.108425 IP 172.31.255.3 > 172.31.255.2: ICMP echo reply, id 30548, seq 0, length 8008
8008 bytes from 172.31.255.3: icmp_seq=0 ttl=128 time=0.226 ms
11:59:08.126583 IP 172.31.255.2 > 172.31.255.3: ICMP echo request, id 30548, seq 1, length 8008
11:59:08.126754 IP 172.31.255.3 > 172.31.255.2: ICMP echo reply, id 30548, seq 1, length 8008
8008 bytes from 172.31.255.3: icmp_seq=1 ttl=128 time=0.213 ms

--- skipped some lines ---

12:00:20.401492 IP 172.31.255.2 > 172.31.255.3: ICMP echo request, id 30548, seq 71, length 8008
8008 bytes from 172.31.255.3: icmp_seq=71 ttl=128 time=0.550 ms
12:00:20.402010 IP 172.31.255.3 > 172.31.255.2: ICMP echo reply, id 30548, seq 71, length 8008
12:00:21.408758 IP 172.31.255.2 > 172.31.255.3: ICMP echo request, id 30548, seq 72, length 8008
8008 bytes from 172.31.255.3: icmp_seq=72 ttl=128 time=2.303 ms
12:00:21.410995 IP 172.31.255.3 > 172.31.255.2: ICMP echo reply, id 30548, seq 72, length 8008
12:00:24.527165 IP 172.31.255.2 > 172.31.255.3: ICMP echo request, id 30548, seq 73, length 8008
8008 bytes from 172.31.255.3: icmp_seq=73 ttl=128 time=133.291 ms
12:00:24.592341 IP 172.31.255.3 > 172.31.255.2: ICMP echo reply, id 30548, seq 73, length 8008
12:00:25.569300 IP 172.31.255.2 > 172.31.255.3: ICMP echo request, id 30548, seq 74, length 8008
12:00:25.662953 IP 172.31.255.3 > 172.31.255.2: ICMP echo reply, id 30548, seq 74, length 8008

--- after seqnum 73 packets received by kernel and seen with tcpdump but not returned to ping process

--- skipped some lines --- 

12:01:27.114142 IP 172.31.255.2 > 172.31.255.3: ICMP echo request, id 30548, seq 134, length 8008
12:01:27.160943 IP 172.31.255.3 > 172.31.255.2: ICMP echo reply, id 30548, seq 134, length 8008
12:01:28.125972 IP 172.31.255.2 > 172.31.255.3: ICMP echo request, id 30548, seq 135, length 8008
12:01:28.126346 IP truncated-ip - 7982 bytes missing! 172.31.255.3 > 172.31.255.2: ICMP echo reply, id 30548, seq 135, length 8008

--- received malformed L2 frame from seqnum >= 135

12:01:29.198552 IP 172.31.255.2 > 172.31.255.3: ICMP echo request, id 30548, seq 136, length 8008
12:01:29.223302 IP truncated-ip - 7810 bytes missing! 172.31.255.3 > 172.31.255.2: ICMP echo reply, id 30548, seq 136, length 8008
12:01:30.214849 IP 172.31.255.2 > 172.31.255.3: ICMP echo request, id 30548, seq 137, length 8008
12:01:30.221687 IP truncated-ip - 7822 bytes missing! 172.31.255.3 > 172.31.255.2: ICMP echo reply, id 30548, seq 137, length 8008
12:01:31.246460 IP 172.31.255.2 > 172.31.255.3: ICMP echo request, id 30548, seq 138, length 8008

--- skip some lines

12:01:37.514942 IP 172.31.255.2 > 172.31.255.3: ICMP echo request, id 30548, seq 144, length 8008
12:01:37.517865 IP truncated-ip - 7808 bytes missing! 172.31.255.3 > 172.31.255.2: ICMP echo reply, id 30548, seq 144, length 8008
12:01:38.579626 IP 172.31.255.2 > 172.31.255.3: ICMP echo request, id 30548, seq 145, length 8008
12:01:38.615120 IP truncated-ip - 7928 bytes missing! 172.31.255.3 > 172.31.255.2: ICMP echo reply, id 30548, seq 145, length 8008
12:01:39.603253 IP 172.31.255.2 > 172.31.255.3: ICMP echo request, id 30548, seq 146, length 8008
12:01:40.614996 IP 172.31.255.2 > 172.31.255.3: ICMP echo request, id 30548, seq 147, length 8008
12:01:40.615183 IP 172.31.255.3 > 172.31.255.2: ICMP echo reply, id 30548, seq 146, length 8008


--- difference in 1 second between sending and receiveng reply from seqnum 146

12:01:40.615201 IP truncated-ip - 7928 bytes missing! 172.31.255.3 > 172.31.255.2: ICMP echo reply, id 30548, seq 147, length 8008
8008 bytes from 172.31.255.3: icmp_seq=146 ttl=128 time=1011.985 ms
12:01:41.657600 IP 172.31.255.2 > 172.31.255.3: ICMP echo request, id 30548, seq 148, length 8008
12:01:42.701072 IP 172.31.255.2 > 172.31.255.3: ICMP echo request, id 30548, seq 149, length 8008
12:01:42.701321 IP 172.31.255.3 > 172.31.255.2: ICMP echo reply, id 30548, seq 148, length 8008
8008 bytes from 172.31.255.3: icmp_seq=148 ttl=128 time=1043.763 ms
12:01:43.615120 IP truncated-ip - 7928 bytes missing! 172.31.255.3 > 172.31.255.2: ICMP echo reply, id 30548, seq 149, length 8008
12:01:43.714982 IP 172.31.255.2 > 172.31.255.3: ICMP echo request, id 30548, seq 150, length 8008
12:01:43.988367 IP truncated-ip - 7808 bytes missing! 172.31.255.3 > 172.31.255.2: ICMP echo reply, id 30548, seq 150, length 8008
12:01:44.787457 IP 172.31.255.2 > 172.31.255.3: ICMP echo request, id 30548, seq 151, length 8008
12:01:44.788966 IP truncated-ip - 7782 bytes missing! 172.31.255.3 > 172.31.255.2: ICMP echo reply, id 30548, seq 151, length 8008
12:01:45.815011 IP 172.31.255.2 > 172.31.255.3: ICMP echo request, id 30548, seq 152, length 8008
12:01:45.970727 IP truncated-ip - 7976 bytes missing! 172.31.255.3 > 172.31.255.2: ICMP echo reply, id 30548, seq 152, length 8008
12:01:46.834089 IP 172.31.255.2 > 172.31.255.3: ICMP echo request, id 30548, seq 153, length 8008
12:01:47.615212 IP truncated-ip - 7928 bytes missing! 172.31.255.3 > 172.31.255.2: ICMP echo reply, id 30548, seq 153, length 8008
12:01:47.897600 IP 172.31.255.2 > 172.31.255.3: ICMP echo request, id 30548, seq 154, length 8008
12:01:48.914981 IP 172.31.255.2 > 172.31.255.3: ICMP echo request, id 30548, seq 155, length 8008
12:01:48.915192 IP 172.31.255.3 > 172.31.255.2: ICMP echo reply, id 30548, seq 154, length 8008
8008 bytes from 172.31.255.3: icmp_seq=154 ttl=128 time=1017.638 ms


--- some packet reveived undamaged but with delay in 1 second



When i try using iperf as network load source in most cases was kernel panic as result:


Fatal trap 12: page fault while in kernel mode
cpuid = 1; apic id = 02
fault virtual address   = 0x0
fault code              = supervisor write data, page not present
instruction pointer     = 0x20:0xffffffff80cef252
stack pointer           = 0x28:0xfffffe00753547c0
frame pointer           = 0x28:0xfffffe00753548a0
code segment            = base rx0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 0 (if_io_tqg_1)
trap number             = 12
panic: page fault
cpuid = 1
time = 1579255990
KDB: stack backtrace:
#0 0xffffffff80c1d297 at kdb_backtrace+0x67
#1 0xffffffff80bd05cd at vpanic+0x19d
#2 0xffffffff80bd0423 at panic+0x43
#3 0xffffffff810a7d2c at trap_fatal+0x39c
#4 0xffffffff810a7d79 at trap_pfault+0x49
#5 0xffffffff810a736f at trap+0x29f
#6 0xffffffff81081a0c at calltrap+0x8
#7 0xffffffff80ce9be5 at _task_fn_rx+0x75
#8 0xffffffff80c1bb54 at gtaskqueue_run_locked+0x144
#9 0xffffffff80c1b7b8 at gtaskqueue_thread_loop+0x98
#10 0xffffffff80b90c23 at fork_exit+0x83
#11 0xffffffff81082a4e at fork_trampoline+0xe
Uptime: 16m42s

Then reverting to default kernel, disabling tso and reboot:
# uname -a
FreeBSD  ******************* 12.1-RELEASE-p1 FreeBSD 12.1-RELEASE-p1 GENERIC  amd64
# ifconfig vmx1
vmx1: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 9000
        options=e400bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6>
        ether 00:50:56:be:f0:13
        inet 172.31.255.2 netmask 0xffffff00 broadcast 172.31.255.255
        media: Ethernet autoselect
        status: active
        nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
#


iperf3 -c 172.31.255.2 -p 1234
Connecting to host 172.31.255.2, port 1234
[  5] local 172.31.255.5 port 32466 connected to 172.31.255.2 port 1234
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.03   sec   497 MBytes  4.05 Gbits/sec   11   8.74 KBytes
[  5]   1.03-2.07   sec  0.00 Bytes  0.00 bits/sec    3   8.74 KBytes
[  5]   2.07-3.06   sec  0.00 Bytes  0.00 bits/sec    1   8.74 KBytes
[  5]   3.06-4.02   sec  0.00 Bytes  0.00 bits/sec    1   8.74 KBytes
[  5]   4.02-5.01   sec  0.00 Bytes  0.00 bits/sec    0   8.74 KBytes
[  5]   5.01-6.03   sec  0.00 Bytes  0.00 bits/sec    1   8.74 KBytes
[  5]   6.03-7.06   sec  0.00 Bytes  0.00 bits/sec    0   8.74 KBytes
[  5]   7.06-8.04   sec  0.00 Bytes  0.00 bits/sec    0   8.74 KBytes
[  5]   8.04-9.07   sec  0.00 Bytes  0.00 bits/sec    0   8.74 KBytes
[  5]   9.07-10.01  sec  0.00 Bytes  0.00 bits/sec    1   8.74 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.01  sec   497 MBytes   416 Mbits/sec   18             sender
[  5]   0.00-10.60  sec   496 MBytes   393 Mbits/sec                  receiver


# ping -s 8000 172.31.255.5
PING 172.31.255.5 (172.31.255.5): 8000 data bytes
8008 bytes from 172.31.255.5: icmp_seq=0 ttl=64 time=0.322 ms
12:22:09.903151 IP 172.31.255.2 > 172.31.255.5: ICMP echo request, id 10122, seq 0, length 8008
12:22:09.903253 IP 172.31.255.5 > 172.31.255.2: ICMP echo reply, id 10122, seq 0, length 8008
12:22:10.922205 IP 172.31.255.2 > 172.31.255.5: ICMP echo request, id 10122, seq 1, length 8008
12:22:10.922300 IP 172.31.255.5 > 172.31.255.2: ICMP echo reply, id 10122, seq 1, length 8008
8008 bytes from 172.31.255.5: icmp_seq=1 ttl=64 time=0.147 ms
12:22:11.969930 IP 172.31.255.2 > 172.31.255.5: ICMP echo request, id 10122, seq 2, length 8008
12:22:11.970035 IP 172.31.255.5 > 172.31.255.2: ICMP echo reply, id 10122, seq 2, length 8008
8008 bytes from 172.31.255.5: icmp_seq=2 ttl=64 time=0.159 ms
12:22:12.997254 IP 172.31.255.2 > 172.31.255.5: ICMP echo request, id 10122, seq 3, length 8008
12:22:12.997386 IP 172.31.255.5 > 172.31.255.2: ICMP echo reply, id 10122, seq 3, length 8008
8008 bytes from 172.31.255.5: icmp_seq=3 ttl=64 time=0.175 ms
12:22:14.029823 IP 172.31.255.2 > 172.31.255.5: ICMP echo request, id 10122, seq 4, length 8008
12:22:14.030017 IP 172.31.255.5 > 172.31.255.2: ICMP echo reply, id 10122, seq 4, length 8008
8008 bytes from 172.31.255.5: icmp_seq=4 ttl=64 time=0.237 ms
12:22:15.058570 IP 172.31.255.2 > 172.31.255.5: ICMP echo request, id 10122, seq 5, length 8008
12:22:15.058769 IP 172.31.255.5 > 172.31.255.2: ICMP echo reply, id 10122, seq 5, length 8008
8008 bytes from 172.31.255.5: icmp_seq=5 ttl=64 time=0.241 ms
12:22:16.096803 IP 172.31.255.2 > 172.31.255.5: ICMP echo request, id 10122, seq 6, length 8008
12:22:16.096896 IP 172.31.255.5 > 172.31.255.2: ICMP echo reply, id 10122, seq 6, length 8008
8008 bytes from 172.31.255.5: icmp_seq=6 ttl=64 time=0.139 ms
12:22:17.136966 IP 172.31.255.2 > 172.31.255.5: ICMP echo request, id 10122, seq 7, length 8008
12:22:17.137224 IP 172.31.255.5 > 172.31.255.2: ICMP echo reply, id 10122, seq 7, length 8008
12:22:18.164014 IP 172.31.255.2 > 172.31.255.5: ICMP echo request, id 10122, seq 8, length 8008
12:22:18.164194 IP 172.31.255.5 > 172.31.255.2: ICMP echo reply, id 10122, seq 8, length 8008

--- packets stops sending to ping process
--- skip some lines ---
-- but after some time packets againg sending to ping process


12:26:15.636917 IP 172.31.255.2 > 172.31.255.5: ICMP echo request, id 10122, seq 238, length 8008
12:26:15.637147 IP 172.31.255.5 > 172.31.255.2: ICMP echo reply, id 10122, seq 238, length 8008
12:26:16.696907 IP 172.31.255.2 > 172.31.255.5: ICMP echo request, id 10122, seq 239, length 8008
12:26:16.697100 IP 172.31.255.5 > 172.31.255.2: ICMP echo reply, id 10122, seq 239, length 8008
8008 bytes from 172.31.255.5: icmp_seq=239 ttl=64 time=0.256 ms
12:26:17.756044 IP 172.31.255.2 > 172.31.255.5: ICMP echo request, id 10122, seq 240, length 8008
12:26:17.756178 IP 172.31.255.5 > 172.31.255.2: ICMP echo reply, id 10122, seq 240, length 8008
8008 bytes from 172.31.255.5: icmp_seq=240 ttl=64 time=0.190 ms
12:26:18.796861 IP 172.31.255.2 > 172.31.255.5: ICMP echo request, id 10122, seq 241, length 8008
12:26:18.796982 IP 172.31.255.5 > 172.31.255.2: ICMP echo reply, id 10122, seq 241, length 8008
8008 bytes from 172.31.255.5: icmp_seq=241 ttl=64 time=0.176 ms
12:26:19.836847 IP 172.31.255.2 > 172.31.255.5: ICMP echo request, id 10122, seq 242, length 8008
12:26:19.836981 IP 172.31.255.5 > 172.31.255.2: ICMP echo reply, id 10122, seq 242, length 8008
8008 bytes from 172.31.255.5: icmp_seq=242 ttl=64 time=0.192 ms

Comment 5 Kubilay Kocak freebsd_committer

2020-01-21 01:52:17 UTC

^Triage: Correct driver (component) name

Comment 6 Andriy Gapon freebsd_committer

2020-01-28 10:15:11 UTC

I probably should open a new bug for my side of the things as my problem can be different from the original problem reported here.
But for the moment here is a new bit of information.

I tried to gather a little bit more of information with tcpdump and I got a very surprising result.
If I run tcpdump on the vmx side, then the copy speed never degrades.
If I start copying before starting tcpdump then the speed can degrade, but as soon as I start tcpdump (on the vmx side) the speed immediately recovers.
I have tested that many times with the same result.

It seems that enabling bpf changes the packet processing flow and works around whatever problem there is.
Does anyone have an idea what that change could be and where to dig further?
Thanks!

Comment 7 Andriy Gapon freebsd_committer

2020-01-28 10:50:41 UTC

When I run tcpdump on the other end, there is no effect on the speed.

There are some potentially interesting snippets in the capture.
Here is one where 10.25.200.199 is the vmx side and 10.180.0.51 is the other host that sends data to the vmx host:
  115  02:16:28.787281 IP 10.180.0.51.17739 > 10.25.200.199.22: Flags [.], seq 81774:83160, ack 1, win 1026, options [nop,nop,TS val 3595053000 ecr 1184721692], length 1386
  116  02:16:28.787284 IP 10.180.0.51.17739 > 10.25.200.199.22: Flags [.], seq 83160:84546, ack 1, win 1026, options [nop,nop,TS val 3595053000 ecr 1184721692], length 1386
  117  02:16:28.787304 IP 10.180.0.51.17739 > 10.25.200.199.22: Flags [.], seq 84546:85932, ack 1, win 1026, options [nop,nop,TS val 3595053000 ecr 1184721692], length 1386
  118  02:16:28.787307 IP 10.180.0.51.17739 > 10.25.200.199.22: Flags [.], seq 85932:87318, ack 1, win 1026, options [nop,nop,TS val 3595053000 ecr 1184721692], length 1386

  119  02:16:28.883545 IP 10.25.200.199.22 > 10.180.0.51.17739: Flags [.], ack 83160, win 2192, options [nop,nop,TS val 1184721792 ecr 3595053000], length 0

  120  02:16:28.883576 IP 10.180.0.51.17739 > 10.25.200.199.22: Flags [.], seq 87318:88704, ack 1, win 1026, options [nop,nop,TS val 3595053097 ecr 1184721792], length 1386

  121  02:16:28.887035 IP 10.25.200.199.22 > 10.180.0.51.17739: Flags [.], ack 83160, win 2192, options [nop,nop,TS val 1184721792 ecr 3595053000,nop,nop,sack 1 {87318:88704}], length 0

  122  02:16:29.336381 IP 10.180.0.51.17739 > 10.25.200.199.22: Flags [.], seq 83160:84546, ack 1, win 1026, options [nop,nop,TS val 3595053549 ecr 1184721792], length 1386

My reading of the above.
Packets 115-118: data packets
Packet 119: ack for packet 115
Packet 120: another data packet
Packet 121: appears to be exact duplicate of the packet 119 (ack)
Packet 122: looks like a retransmission of packet 116 after a noticeable delay (0.44s)

Well, on another look, packets 119 and 121 are not exactly the same.
121 has this "sack 1 {87318:88704}" that 119 does not have.
And it seems like packet 121 is what "confuses" the sending side as the pause and retransmission.
Not a TCP expert, so don't know if anything is wrong here.

Comment 8 Andriy Gapon freebsd_committer

2020-01-28 10:58:10 UTC

(In reply to Andriy Gapon from comment #7)
It seems like packets 116 (seq 84546) and 117 (seq 85932) were never ack-ed and, thus, lost (presumably by the vmx side) ?
Packet 115 was ack-ed, packets 118 and 120 were sack-ed.

Comment 9 Andriy Gapon freebsd_committer

2020-01-28 13:44:20 UTC

A few additional datapoints.

Running tcpdump with -p (no promiscuous mode) does not prevent the speed degradation.

tcpdump shows many incoming tcp packets are received out of order.

Disabling LRO works around the problem.

Comment 10 alexandr.oleynikov 2020-01-29 08:50:17 UTC

Disabling all possible options except jumbo frames is not workaround for a problem

edited rc.conf & reboot. After receiving approx 300-400 MBytes network become unavailable


# ifconfig  vmx1
vmx1: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 9000
        options=800020<JUMBO_MTU>
        ether 00:50:56:be:f0:13
        inet 172.31.255.2 netmask 0xffffff00 broadcast 172.31.255.255
        media: Ethernet autoselect
        status: active
        nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
# iperf3 -s -p 1234
-----------------------------------------------------------
Server listening on 1234
-----------------------------------------------------------
Accepted connection from 172.31.255.5, port 52124
[  5] local 172.31.255.2 port 1234 connected to 172.31.255.5 port 51022
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.02   sec   320 MBytes  2.64 Gbits/sec
[  5]   1.02-2.02   sec  0.00 Bytes  0.00 bits/sec
[  5]   2.02-3.02   sec  0.00 Bytes  0.00 bits/sec
[  5]   3.02-4.01   sec  0.00 Bytes  0.00 bits/sec
[  5]   4.01-5.02   sec  0.00 Bytes  0.00 bits/sec
[  5]   5.02-6.02   sec  0.00 Bytes  0.00 bits/sec
[  5]   6.02-7.01   sec  0.00 Bytes  0.00 bits/sec
[  5]   7.01-8.02   sec  0.00 Bytes  0.00 bits/sec
[  5]   8.02-9.01   sec  0.00 Bytes  0.00 bits/sec
[  5]   9.01-10.01  sec  0.00 Bytes  0.00 bits/sec
[  5]  10.01-10.19  sec  0.00 Bytes  0.00 bits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-10.19  sec   320 MBytes   263 Mbits/sec                  receiver
-----------------------------------------------------------
Server listening on 1234
-----------------------------------------------------------

Comment 11 alexandr.oleynikov 2020-01-29 08:58:42 UTC

For additional:

ifconfig down & up temporary fix problem but after receiveing again approx 300-500 Mbytes in short period network become unavailable again or kernel panic (50%/50%)

# ifconfig vmx1 down && ifconfig vmx1 up && iperf3 -s -p 1234
-----------------------------------------------------------
Server listening on 1234
-----------------------------------------------------------
Accepted connection from 172.31.255.5, port 39861
[  5] local 172.31.255.2 port 1234 connected to 172.31.255.5 port 39030

--- kernel panic ---- 
Fatal trap 12: page fault while in kernel mode
cpuid = 1; apic id = 02
fault virtual address   = 0x0
fault code              = supervisor write data, page not present
instruction pointer     = 0x20:0xffffffff80cef252
stack pointer           = 0x28:0xfffffe00753547c0
frame pointer           = 0x28:0xfffffe00753548a0
code segment            = base rx0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 0 (if_io_tqg_1)
trap number             = 12
panic: page fault
cpuid = 1
time = 1580287973
KDB: stack backtrace:
#0 0xffffffff80c1d297 at kdb_backtrace+0x67
#1 0xffffffff80bd05cd at vpanic+0x19d
#2 0xffffffff80bd0423 at panic+0x43
#3 0xffffffff810a7dcc at trap_fatal+0x39c
#4 0xffffffff810a7e19 at trap_pfault+0x49
#5 0xffffffff810a740f at trap+0x29f
#6 0xffffffff81081a0c at calltrap+0x8
#7 0xffffffff80ce9be5 at _task_fn_rx+0x75
#8 0xffffffff80c1bb54 at gtaskqueue_run_locked+0x144
#9 0xffffffff80c1b7b8 at gtaskqueue_thread_loop+0x98
#10 0xffffffff80b90c23 at fork_exit+0x83
#11 0xffffffff81082a4e at fork_trampoline+0xe
Uptime: 11m47s


And from sending side:
root@bsd_template:/usr/home/ao # iperf3 -c 172.31.255.2 -p 1234
Connecting to host 172.31.255.2, port 1234
[  5] local 172.31.255.5 port 39030 connected to 172.31.255.2 port 1234
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.01   sec   614 MBytes  5.07 Gbits/sec    1   8.74 KBytes
[  5]   1.01-2.03   sec  0.00 Bytes  0.00 bits/sec    3   8.74 KBytes
[  5]   2.03-3.02   sec  0.00 Bytes  0.00 bits/sec    1   8.74 KBytes
[  5]   3.02-4.03   sec  0.00 Bytes  0.00 bits/sec    1   8.74 KBytes
[  5]   4.03-5.03   sec  0.00 Bytes  0.00 bits/sec    0   8.74 KBytes
[  5]   5.03-6.02   sec  0.00 Bytes  0.00 bits/sec    1   8.74 KBytes
[  5]   6.02-7.03   sec  0.00 Bytes  0.00 bits/sec    0   8.74 KBytes
[  5]   7.03-8.03   sec  0.00 Bytes  0.00 bits/sec    0   8.74 KBytes
[  5]   8.03-9.03   sec  0.00 Bytes  0.00 bits/sec    0   8.74 KBytes

Comment 12 commit-hook freebsd_committer

2020-03-14 19:44:47 UTC

A commit references this bug:

Author: pkelsey
Date: Sat Mar 14 19:43:44 UTC 2020
New revision: 358995
URL: https://svnweb.freebsd.org/changeset/base/358995

Log:
  Fix iflib freelist state corruption

  This fixes a bug in iflib freelist management that breaks the required
  correspondence between freelist indexes and driver ring slots.

  PR:		243126, 243392, 240628
  Reported by:	avg, alexandr.oleynikov@gmail.com, Harald Schmalzbauer
  Reviewed by:	avg, gallatin
  MFC after:	1 week
  Differential Revision:	https://reviews.freebsd.org/D23943

Changes:
  head/sys/net/iflib.c

Comment 13 commit-hook freebsd_committer

2020-03-14 19:55:52 UTC

A commit references this bug:

Author: pkelsey
Date: Sat Mar 14 19:55:06 UTC 2020
New revision: 358997
URL: https://svnweb.freebsd.org/changeset/base/358997

Log:
  Remove freelist contiguous-indexes assertion from rxd_frag_to_sd()

  The vmx driver is an example of an iflib driver that might report
  packets using non-contiguous descriptors (with unused descriptors
  either between received packets or between the fragments of a received
  packet), so this assertion needs to be removed.

  For such drivers, the freelist producer and consumer indexes don't
  relate directly to driver ring slots (the driver deals directly with
  freelist buffer indexes supplied by iflib during refill, and reports
  them with each fragment during packet reception), but do continue to
  be used by iflib for accounting, such as determining the number of
  ring slots that are refillable.

  PR:		243126, 243392, 240628
  Reported by:	avg, alexandr.oleynikov@gmail.com, Harald Schmalzbauer
  Reviewed by:	gallatin
  MFC after:	1 week
  Differential Revision:	https://reviews.freebsd.org/D23946

Changes:
  head/sys/net/iflib.c

Comment 14 commit-hook freebsd_committer

2020-03-14 20:08:57 UTC

A commit references this bug:

Author: pkelsey
Date: Sat Mar 14 20:08:04 UTC 2020
New revision: 359000
URL: https://svnweb.freebsd.org/changeset/base/359000

Log:
  Fix if_vmx receive checksum offload bug and harden against the device skipping receive descriptors

  This fixes a bug where the checksum offload status of received packets
  was being taken from the first descriptor instead of the last, which
  affected LRO packets.

  The driver has been hardened against the device skipping receive
  descriptors, although it is not believed that this can occur given the
  way this implementation configures the receive rings.

  Additionally, for packets received with the error indicator set, the
  driver now forces the length of all fragments in that packet to zero
  prior to passing it to iflib.  Such packets should wind up being
  discarded at some point in the stack anyway, but this removes any
  questions by killing them in the driver.

  Counters have been added (and exposed via sysctls) for skipped receive
  descriptors, zero-length packets received, and packets received with
  the error indicator set so that these conditions can be easily
  observed in the field.

  PR:		243126, 243392, 240628
  Reported by:	avg, alexandr.oleynikov@gmail.com, Harald Schmalzbauer
  Reviewed by:	gallatin
  MFC after:	1 week
  Differential Revision:	https://reviews.freebsd.org/D23949

Changes:
  head/sys/dev/vmware/vmxnet3/if_vmx.c
  head/sys/dev/vmware/vmxnet3/if_vmxvar.h

Comment 15 Kubilay Kocak freebsd_committer

2020-03-19 03:45:03 UTC

^Triage: Assign to committer resolving, pending MFC. If this doesn't need to go
to stable/11, set mfc-stable11 to -

Thanks Patrick!

Comment 16 Dmitry Petrov 2020-04-28 15:55:24 UTC

It has been more than a month. Any way to speedup the MFC?

Comment 17 Ozkan KIRIK 2020-05-18 15:21:34 UTC

Thanks for fix, but when the MFC to stable12 is planned ? According to comments, MFC time is over.

Comment 18 commit-hook freebsd_committer

2020-07-13 17:21:10 UTC

A commit references this bug:

Author: mav
Date: Mon Jul 13 17:20:21 UTC 2020
New revision: 363163
URL: https://svnweb.freebsd.org/changeset/base/363163

Log:
  MFC r358995 (by pkelsey): Fix iflib freelist state corruption

  This fixes a bug in iflib freelist management that breaks the required
  correspondence between freelist indexes and driver ring slots.

  PR: 243126, 243392, 240628

Changes:
_U  stable/12/
  stable/12/sys/net/iflib.c