Bug 234754 - FreeBSD 12.0 ena Network Driver on AWS EC2 Packet Loss
Summary: FreeBSD 12.0 ena Network Driver on AWS EC2 Packet Loss
Status: New
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 12.0-RELEASE
Hardware: amd64 Any
: --- Affects Only Me
Assignee: freebsd-virtualization mailing list
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-01-08 16:04 UTC by Mike Walker
Modified: 2019-03-31 01:46 UTC (History)
5 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Mike Walker 2019-01-08 16:04:45 UTC
Overview: 

Given a single connection transferring anywhere a 100MB file, using scp or just plain netcat, there will eventually be sufficient packet loss on the server side such that the connection will stall and eventually reset.


Steps to Reproduce:

1. Boot the standard FreeBSD 12.0 AMI ami-04331586c79df8e01 in us-west-2
2. Attempt to use scp to copy a 100MB file onto the server

Additional Information:

I've been able to replicate this against a bone-stock boot-up of the FreeBSD 12.0 AMI on the following instance+region combinations (every one I've tried I've been able to replicate it on):

* eu-west-1 / t3.medium
* us-west-2 / t3.small
* us-west-2 / t3.medium
* us-west-2 / t3.large
* us-west-2 / m5.large
* us-west-2 / m5a.large
* us-west-2 / m5a.xlarge

I was able to replicate this using a residental (~50Mbps) connection as well as another hosted machine at another provider with a 1Gbps connection, though I cannot replicate this using a machine in the same VPC as the server as the client.

This really seems like the network driver is just black-holeing packets. I see no error messages in dmesg, /var/log/, the console, or anything untoward in the dev.ena.* sysctl values :(

Here's `sysctl dev.ena` from a t3.small instance after a number of tcp connections experiencing packet loss:

dev.ena.0.admin_stats.no_completion: 0
dev.ena.0.admin_stats.out_of_space: 0
dev.ena.0.admin_stats.completed_cmd: 37
dev.ena.0.admin_stats.sumbitted_cmd: 37
dev.ena.0.admin_stats.aborted_cmd: 0
dev.ena.0.hw_stats.rx_drops: 0
dev.ena.0.hw_stats.tx_bytes: 5681243
dev.ena.0.hw_stats.rx_bytes: 47398849
dev.ena.0.hw_stats.tx_packets: 27863
dev.ena.0.hw_stats.rx_packets: 42686
dev.ena.0.queue1.rx_ring.empty_rx_ring: 0
dev.ena.0.queue1.rx_ring.bad_req_id: 0
dev.ena.0.queue1.rx_ring.bad_desc_num: 0
dev.ena.0.queue1.rx_ring.dma_mapping_err: 0
dev.ena.0.queue1.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue1.rx_ring.mbuf_alloc_fail: 0
dev.ena.0.queue1.rx_ring.bad_csum: 0
dev.ena.0.queue1.rx_ring.refil_partial: 0
dev.ena.0.queue1.rx_ring.bytes: 33640679
dev.ena.0.queue1.rx_ring.count: 28882
dev.ena.0.queue1.tx_ring.mbuf_collapse_err: 0
dev.ena.0.queue1.tx_ring.mbuf_collapses: 0
dev.ena.0.queue1.tx_ring.bad_req_id: 0
dev.ena.0.queue1.tx_ring.missing_tx_comp: 0
dev.ena.0.queue1.tx_ring.doorbells: 18803
dev.ena.0.queue1.tx_ring.dma_mapping_err: 0
dev.ena.0.queue1.tx_ring.prepare_ctx_err: 0
dev.ena.0.queue1.tx_ring.bytes: 4843829
dev.ena.0.queue1.tx_ring.count: 18812
dev.ena.0.queue0.rx_ring.empty_rx_ring: 0
dev.ena.0.queue0.rx_ring.bad_req_id: 0
dev.ena.0.queue0.rx_ring.bad_desc_num: 0
dev.ena.0.queue0.rx_ring.dma_mapping_err: 0
dev.ena.0.queue0.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue0.rx_ring.mbuf_alloc_fail: 0
dev.ena.0.queue0.rx_ring.bad_csum: 0
dev.ena.0.queue0.rx_ring.refil_partial: 0
dev.ena.0.queue0.rx_ring.bytes: 13758170
dev.ena.0.queue0.rx_ring.count: 13804
dev.ena.0.queue0.tx_ring.mbuf_collapse_err: 0
dev.ena.0.queue0.tx_ring.mbuf_collapses: 0
dev.ena.0.queue0.tx_ring.bad_req_id: 0
dev.ena.0.queue0.tx_ring.missing_tx_comp: 0
dev.ena.0.queue0.tx_ring.doorbells: 9057
dev.ena.0.queue0.tx_ring.dma_mapping_err: 0
dev.ena.0.queue0.tx_ring.prepare_ctx_err: 0
dev.ena.0.queue0.tx_ring.bytes: 838748
dev.ena.0.queue0.tx_ring.count: 9057
dev.ena.0.admin_q_pause: 0
dev.ena.0.interface_down: 1
dev.ena.0.interface_up: 2
dev.ena.0.wd_expired: 0
dev.ena.0.missing_tx_threshold: 128
dev.ena.0.missing_tx_max_queues: 4
dev.ena.0.missing_tx_timeout: 21474836480
dev.ena.0.keep_alive_timeout: 25769803776
dev.ena.0.wd_active: 1
dev.ena.0.%parent: pci0
dev.ena.0.%pnpinfo: vendor=0x1d0f device=0xec20 subvendor=0x0000 subdevice=0x0000 class=0x020000
dev.ena.0.%location: slot=5 function=0 dbsf=pci0:0:5:0 handle=\_SB_.PCI0.S05_
dev.ena.0.%driver: ena
dev.ena.0.%desc: ENA adapter
dev.ena.%parent:
Comment 1 Mike Walker 2019-01-08 16:21:53 UTC
Also I can replicate this with TCP as well as UDP.


Example TCP transfer test-case commands, right now this reliably stalls around 12-15MB into the transfer, going at 1.36MiB/s:

################
# client:
dd if=/dev/urandom of=rand100M.bin bs=1M count=100
pv -pterbT rand100M.bin | nc -v  34.242.134.36 31337

# server:
nc  -v -k -l 31337 | pv -W -s 100000000 -p -t -e -r -b -T > /tmp/rand100M.bin
################

Example UDP transfer test-case commands, right now this reliably stalls around 2-3MB into the transfer: 

################
# client:
dd if=/dev/urandom of=rand100M.bin bs=1M count=100
pv -pterbT rand100M.bin | nc -v -u 34.242.134.36 31337

# server:
nc -u -v -k -l 31337 | pv -W -s 100000000 -p -t -e -r -b -T > /tmp/rand100M.bin
################
Comment 2 Kaspars Bankovskis 2019-01-08 16:49:36 UTC
Thanks, I was just going to report the same issue. I'm experiencing the same problem with large data transfers of http/https.

To reproduce, I created a new EC2 instance (type c5-large) using the AMI ami-01fe4421da59ecb30 in the region eu-west-1, and then trying to fetch FreeBSD iso file from the official download site. It never succeeds.

$ fetch https://download.freebsd.org/ftp/releases/amd64/amd64/ISO-IMAGES/12.0/FreeBSD-12.0-RELEASE-amd64-dvd1.iso
FreeBSD-12.0-RELEASE-amd64-dvd1.iso            30% of 3704 MB 3063 kBps 14m26s
fetch: https://download.freebsd.org/ftp/releases/amd64/amd64/ISO-IMAGES/12.0/FreeBSD-12.0-RELEASE-amd64-dvd1.iso: Connection reset by peer

I also see no error messages in logs.
Comment 3 Alex Dupre freebsd_committer 2019-01-08 17:25:28 UTC
FWIW I wasn't able to replicate it on a t3.xlarge machine in us-east-1 and neither on a t3.large in eu-west-1 by just using scp to copy a 100MB file from a remote location.

% scp rand100M.bin t3:
rand100M.bin            100%  100MB   7.3MB/s   00:13


The download issue I think it's related to download.freebsd.org, I get connection resets also from machines not in AWS.
Comment 4 Mike Walker 2019-01-08 17:31:44 UTC
For(In reply to Alex Dupre from comment #3)

For the machine that worked for you, can you post the output of `sysctl dev.ena.0.%pnpinfo` ?

I know there are 4 variants of the ENA device, I'm wondering if this issue only affects some of them.
Comment 5 Mike Walker 2019-01-08 17:33:32 UTC
Also, a curious thing I noticed is that if I'm able to set a low-enough bandwidth limit in whatever tool I'm using I can prevent the issue from popping up.

Testing from my residential connection with either scp or pv, if I set the bandwidth to be just below my available bandwidth, I don't see any packet loss, and the transfers succeed.
Comment 6 Kaspars Bankovskis 2019-01-08 17:35:43 UTC
(In reply to Alex Dupre from comment #3)

I used download.freebsd.org only as an example. It happens to me with any significantly large file, regardless of its location. I've tested also such sites as http://ftp.gnu.org, as well as downloading files from S3 bucket.
Comment 7 Colin Percival freebsd_committer 2019-01-08 23:48:11 UTC
There's a known "TCP connections can get stuck after experiencing packet loss" issue (errata notice coming shortly) which might be contributing to this.  But the fact that this is showing up with UDP indicates that it's not just that TCP bug.

Can you test with stable/12 at r342378 or later?  That will take the TCP bug out of the equation -- it would be good to know if that affects what you're seeing at all.
Comment 8 Alex Dupre freebsd_committer 2019-01-09 10:13:36 UTC
(In reply to Mike Walker from comment #4)

Same as yours, on both machines:

dev.ena.0.%pnpinfo: vendor=0x1d0f device=0xec20 subvendor=0x0000 subdevice=0x0000 class=0x020000
Comment 9 Kaspars Bankovskis 2019-01-09 10:14:28 UTC
(In reply to Colin Percival from comment #7)

Thanks, Colin. It seems that it really fixes the issue. Checked out the latest stable/12, rebuilt and reinstalled the kernel. So far haven't reproduced the bug anymore. Do you have any estimates when this fix is going to hit the binary updates?
Comment 10 Colin Percival freebsd_committer 2019-01-09 10:31:24 UTC
(In reply to Kaspars Bankovskis from comment #9)

Thanks for confirming.  I'm not part of the release engineering or security officer teams, but I know they were working on errata notice text for this (and other issues, one of which caused hangs in TCP CUBIC) in the last few days of 2018.  My best guess would be next week.

Mike: Can you confirm that you're seeing UDP issues?  AFAIK you're the only person to reproduce this with UDP.
Comment 11 Mike Walker 2019-01-09 18:43:41 UTC
(In reply to Colin Percival from comment #10)

Hi Colin,

Whew, I'm glad to hear the TCP issue is hopefully resolved in stable/12. I'm building that now and will test it shortly.

Regarding the UDP issue, I'm personally less concerned about that since I don't have a production use of UDP that would be impacted, but I'm curious if anyone else can replicate it.

In testing UDP with a 100M byte file I get the following results:

Sending side: residential ISP connection
Receiving side: AWS eu-west-1 t3.medium FreeBSD 12.0-RELEASE

----------------------------------------------

Full Bandwidth: (Same results to within a few percent trying 5 times)

Sending side reports sending entire file, @ 41.6MiB/s
Receiving side: Transfer stalled @ 1.9MB

----------------------------------------------

Bandwidth Limited, sending side, pv set to 250KBytes/sec:

Sending side: reports full transfer @ 244KiB/s
Receiving side: Transfer complete, no errors

----------------------------------------------

Bandwidth Limited, sending side, pv set to 1MBytes/sec:

Sending side: reports full transfer @ 976KiB/s
Receiving side: Transfer complete, no errors

----------------------------------------------

Bandwidth Limited, sending side, pv set to 2MBytes/sec:

Sending side: reports full transfer @ 1.91MiB/s
Receiving side: Transfer stalled @ 68.2MB

----------------------------------------------

Bandwidth Limited, sending side, pv set to 5MBytes/sec:

Sending side: reports full transfer @ 4.78MiB/s
Receiving side: Transfer stalled @ 20.2MB
Comment 12 Mike Walker 2019-01-09 19:58:18 UTC
(In reply to Colin Percival from comment #7)

I was able to confirm 12.0-STABLE fixes the TCP issue I was experiencing! 🙌