Summary: | FreeBSD 12.0 ena Network Driver on AWS EC2 Packet Loss | ||
---|---|---|---|
Product: | Base System | Reporter: | Mike Walker <mike.walker> |
Component: | kern | Assignee: | freebsd-virtualization (Nobody) <virtualization> |
Status: | Closed FIXED | ||
Severity: | Affects Only Me | CC: | ale, cperciva, derekverlee, kaspars, mmpestorich, monder, pete |
Priority: | --- | ||
Version: | 12.0-RELEASE | ||
Hardware: | amd64 | ||
OS: | Any |
Description
Mike Walker
2019-01-08 16:04:45 UTC
Also I can replicate this with TCP as well as UDP. Example TCP transfer test-case commands, right now this reliably stalls around 12-15MB into the transfer, going at 1.36MiB/s: ################ # client: dd if=/dev/urandom of=rand100M.bin bs=1M count=100 pv -pterbT rand100M.bin | nc -v 34.242.134.36 31337 # server: nc -v -k -l 31337 | pv -W -s 100000000 -p -t -e -r -b -T > /tmp/rand100M.bin ################ Example UDP transfer test-case commands, right now this reliably stalls around 2-3MB into the transfer: ################ # client: dd if=/dev/urandom of=rand100M.bin bs=1M count=100 pv -pterbT rand100M.bin | nc -v -u 34.242.134.36 31337 # server: nc -u -v -k -l 31337 | pv -W -s 100000000 -p -t -e -r -b -T > /tmp/rand100M.bin ################ Thanks, I was just going to report the same issue. I'm experiencing the same problem with large data transfers of http/https. To reproduce, I created a new EC2 instance (type c5-large) using the AMI ami-01fe4421da59ecb30 in the region eu-west-1, and then trying to fetch FreeBSD iso file from the official download site. It never succeeds. $ fetch https://download.freebsd.org/ftp/releases/amd64/amd64/ISO-IMAGES/12.0/FreeBSD-12.0-RELEASE-amd64-dvd1.iso FreeBSD-12.0-RELEASE-amd64-dvd1.iso 30% of 3704 MB 3063 kBps 14m26s fetch: https://download.freebsd.org/ftp/releases/amd64/amd64/ISO-IMAGES/12.0/FreeBSD-12.0-RELEASE-amd64-dvd1.iso: Connection reset by peer I also see no error messages in logs. FWIW I wasn't able to replicate it on a t3.xlarge machine in us-east-1 and neither on a t3.large in eu-west-1 by just using scp to copy a 100MB file from a remote location. % scp rand100M.bin t3: rand100M.bin 100% 100MB 7.3MB/s 00:13 The download issue I think it's related to download.freebsd.org, I get connection resets also from machines not in AWS. For(In reply to Alex Dupre from comment #3) For the machine that worked for you, can you post the output of `sysctl dev.ena.0.%pnpinfo` ? I know there are 4 variants of the ENA device, I'm wondering if this issue only affects some of them. Also, a curious thing I noticed is that if I'm able to set a low-enough bandwidth limit in whatever tool I'm using I can prevent the issue from popping up. Testing from my residential connection with either scp or pv, if I set the bandwidth to be just below my available bandwidth, I don't see any packet loss, and the transfers succeed. (In reply to Alex Dupre from comment #3) I used download.freebsd.org only as an example. It happens to me with any significantly large file, regardless of its location. I've tested also such sites as http://ftp.gnu.org, as well as downloading files from S3 bucket. There's a known "TCP connections can get stuck after experiencing packet loss" issue (errata notice coming shortly) which might be contributing to this. But the fact that this is showing up with UDP indicates that it's not just that TCP bug. Can you test with stable/12 at r342378 or later? That will take the TCP bug out of the equation -- it would be good to know if that affects what you're seeing at all. (In reply to Mike Walker from comment #4) Same as yours, on both machines: dev.ena.0.%pnpinfo: vendor=0x1d0f device=0xec20 subvendor=0x0000 subdevice=0x0000 class=0x020000 (In reply to Colin Percival from comment #7) Thanks, Colin. It seems that it really fixes the issue. Checked out the latest stable/12, rebuilt and reinstalled the kernel. So far haven't reproduced the bug anymore. Do you have any estimates when this fix is going to hit the binary updates? (In reply to Kaspars Bankovskis from comment #9) Thanks for confirming. I'm not part of the release engineering or security officer teams, but I know they were working on errata notice text for this (and other issues, one of which caused hangs in TCP CUBIC) in the last few days of 2018. My best guess would be next week. Mike: Can you confirm that you're seeing UDP issues? AFAIK you're the only person to reproduce this with UDP. (In reply to Colin Percival from comment #10) Hi Colin, Whew, I'm glad to hear the TCP issue is hopefully resolved in stable/12. I'm building that now and will test it shortly. Regarding the UDP issue, I'm personally less concerned about that since I don't have a production use of UDP that would be impacted, but I'm curious if anyone else can replicate it. In testing UDP with a 100M byte file I get the following results: Sending side: residential ISP connection Receiving side: AWS eu-west-1 t3.medium FreeBSD 12.0-RELEASE ---------------------------------------------- Full Bandwidth: (Same results to within a few percent trying 5 times) Sending side reports sending entire file, @ 41.6MiB/s Receiving side: Transfer stalled @ 1.9MB ---------------------------------------------- Bandwidth Limited, sending side, pv set to 250KBytes/sec: Sending side: reports full transfer @ 244KiB/s Receiving side: Transfer complete, no errors ---------------------------------------------- Bandwidth Limited, sending side, pv set to 1MBytes/sec: Sending side: reports full transfer @ 976KiB/s Receiving side: Transfer complete, no errors ---------------------------------------------- Bandwidth Limited, sending side, pv set to 2MBytes/sec: Sending side: reports full transfer @ 1.91MiB/s Receiving side: Transfer stalled @ 68.2MB ---------------------------------------------- Bandwidth Limited, sending side, pv set to 5MBytes/sec: Sending side: reports full transfer @ 4.78MiB/s Receiving side: Transfer stalled @ 20.2MB (In reply to Colin Percival from comment #7) I was able to confirm 12.0-STABLE fixes the TCP issue I was experiencing! 🙌 I'll share my experience, as I think might be a different manifestation of this bug. I'm running a freebsd 12.0 release instance configured as a VPN router in ec2, with the ENI registered in the subnet's route table, and "source/dest check" disabled. Gateway is enabled in rc.conf, as well as pf, with some NAT and filtering rules. Pinging the vpn server's local address from another instance in the subnet (Linux) works, pinging the vpn client from the vpn server works, but pinging the client from the Linux host experienced >99% packet loss, with one reply arriving in many thousands. TCP dump showed the echo request getting all the way to the client, and the reply apparently emitted on ena0, but never arriving at the linux interface. This was the same with a t3a.small and t3.small instance in us-east-1. Switching the instance type to t2.small (xn driver) solved the issue. FreeBSD 12.0-RELEASE-amd64 (ami-03b0f822e17669866), us-east-2 If I get a chance to try an instance with STABLE I will post back. Closing this since it appears to have been a TCP stack issue unrelated to the ENA driver or EC2. |