Summary: | ena driver causing kernel panics on AWS EC2 | ||
---|---|---|---|
Product: | Base System | Reporter: | Mike Walker <mike.walker> |
Component: | kern | Assignee: | freebsd-virtualization (Nobody) <virtualization> |
Status: | Closed FIXED | ||
Severity: | Affects Only Me | CC: | ale, andreas.sommer87, berend, cperciva, crane76, elij+freebsd, in2, leif, mmpestorich, pete, pstef, richard, terje |
Priority: | --- | ||
Version: | 11.1-RELEASE | ||
Hardware: | amd64 | ||
OS: | Any |
Description
Mike Walker
2018-02-09 15:38:47 UTC
We're also affected by this, running c5.large, handling about 13 000 connections through haproxy, then varnish and on to other systems. Activity was about 4000 requests pr. minute leading up to the crash, which doesn't seem all that high. It's possible that it could have spiked shortly before the crash though, without getting that in the logs. This is: FreeBSD [host snipped] 11.1-RELEASE-p8 FreeBSD 11.1-RELEASE-p8 #0: Tue Mar 13 17:07:05 UTC 2018 root@amd64-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC amd64 It's a lightly modified/configured version of one of the usual FreeBSD AMIs, I don't recall the AMI ID exactly, sorry. Kernel etc is stock, we've just made additions in terms of software etc for our own AMI. We have two virtually identical machines exposed under the same hostname, receiving a near identical load, and have so far only been noticing this with one of the machines. Could be coincidental, but figured it worthwhile to mention. It strikes me as noteworthy that the data rate was only about 700kBps at the last data point I have before the crash. Unfortunately I don't know anything about packet rate, and again it's possible that there could have been a peak leading up to the crash, without getting the logs of it. If anyone is interested in any other data from this, please do let me know. Also, this is part of a redundant setup, allowing some extra room for moving things around if anyone wants anything tested or tried on the setup. >> Crash itself: Limiting open port RST response from 457 to 200 packets/sec Limiting open port RST response from 487 to 200 packets/sec Limiting open port RST response from 541 to 200 packets/sec Limiting open port RST response from 517 to 200 packets/sec Limiting open port RST response from 586 to 200 packets/sec Limiting open port RST response from 237 to 200 packets/sec ena0: Found a Tx that wasn't completed on time, qid 1, index 324. pid 3639 (varnishd), uid 429: exited on signal 6 Limiting open port RST response from 259 to 200 packets/sec Limiting open port RST response from 380 to 200 packets/sec ena0: Found a Tx that wasn't completed on time, qid 1, index 181. Fatal trap 12: page fault while in kernel mode cpuid = 0; apic id = 00 fault virtual address = 0x1c fault code = supervisor write data, page not present instruction pointer = 0x20:0xffffffff82173f8c stack pointer = 0x28:0xfffffe0110f43180 frame pointer = 0x28:0xfffffe0110f43260 code segment = base rx0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 12 (irq261: ena0) trap number = 12 panic: page fault cpuid = 0 KDB: stack backtrace: #0 0xffffffff80aadac7 at kdb_backtrace+0x67 #1 0xffffffff80a6bba6 at vpanic+0x186 #2 0xffffffff80a6ba13 at panic+0x43 #3 0xffffffff80ee3092 at trap_fatal+0x322 #4 0xffffffff80ee30eb at trap_pfault+0x4b #5 0xffffffff80ee290a at trap+0x2ca #6 0xffffffff80ec3d40 at calltrap+0x8 #7 0xffffffff80a321ec at intr_event_execute_handlers+0xec #8 0xffffffff80a324d6 at ithread_loop+0xd6 #9 0xffffffff80a2f845 at fork_exit+0x85 #10 0xffffffff80ec4a0e at fork_trampoline+0xe Uptime: 8d22h59m55s Rebooting... >> boot log: Copyright (c) 1992-2017 The FreeBSD Project. Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994 The Regents of the University of California. All rights reserved. FreeBSD is a registered trademark of The FreeBSD Foundation. FreeBSD 11.1-RELEASE-p8 #0: Tue Mar 13 17:07:05 UTC 2018 root@amd64-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC amd64 FreeBSD clang version 4.0.0 (tags/RELEASE_400/final 297347) (based on LLVM 4.0.0) VT(vga): text 80x25 CPU: HammerEM64T (3000.05-MHz K8-class CPU) Origin="GenuineIntel" Id=0x50653 Family=0x6 Model=0x55 Stepping=3 Features=0x1f83fbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,MMX,FXSR,SSE,SSE2,SS,HTT> Features2=0xfffa3203<SSE3,PCLMULQDQ,SSSE3,FMA,CX16,PCID,SSE4.1,SSE4.2,x2APIC,MOVBE,POPCNT,TSCDLT,AESNI,XSAVE,OSXSAVE,AVX,F16C,RDRAND,HV> AMD Features=0x2c100800<SYSCALL,NX,Page1GB,RDTSCP,LM> AMD Features2=0x121<LAHF,ABM,Prefetch> Structured Extended Features=0xd11f4fbb<FSGSBASE,TSCADJ,BMI1,HLE,AVX2,SMEP,BMI2,ERMS,INVPCID,RTM,MPX,AVX512F,AVX512DQ,RDSEED,ADX,SMAP,CLWB,AVX512C D,AVX512BW> Structured Extended Features2=0x8<PKU> XSAVE Features=0xf<XSAVEOPT,XSAVEC,XINUSE,XSAVES> TSC: P-state invariant, performance statistics Hypervisor: Origin = "KVMKVMKVM" real memory = 5114953728 (4878 MB) avail memory = 3844890624 (3666 MB) Event timer "LAPIC" quality 600 ACPI APIC Table: <AMAZON AMZNAPIC> FreeBSD/SMP: Multiprocessor System Detected: 2 CPUs FreeBSD/SMP: 1 package(s) x 1 core(s) x 2 hardware threads random: unblocking device. ioapic0 <Version 1.1> irqs 0-23 on motherboard SMP: AP CPU #1 Launched! random: entropy device external interface kbd1 at kbdmux0 netmap: loaded module module_register_init: MOD_LOAD (vesa, 0xffffffff80f5eb40, 0) error 19 random: registering fast source Intel Secure Key RNG random: fast provider: "Intel Secure Key RNG" nexus0 vtvga0: <VT VGA driver> on motherboard cryptosoft0: <software crypto> on motherboard acpi0: <AMAZON AMZNRSDT> on motherboard acpi0: Power Button (fixed) cpu0: <ACPI CPU> on acpi0 cpu1: <ACPI CPU> on acpi0 atrtc0: <AT realtime clock> port 0x70-0x71,0x72-0x77 irq 8 on acpi0 Event timer "RTC" frequency 32768 Hz quality 0 Timecounter "ACPI-fast" frequency 3579545 Hz quality 900 acpi_timer0: <24-bit timer at 3.579545MHz> port 0xb008-0xb00b on acpi0 pcib0: <ACPI Host-PCI bridge> port 0xcf8-0xcff on acpi0 pci0: <ACPI PCI bus> on pcib0 isab0: <PCI-ISA bridge> at device 1.0 on pci0 isa0: <ISA bus> on isab0 pci0: <old, non-VGA display device> at device 1.3 (no driver attached) vgapci0: <VGA-compatible display> mem 0xfe400000-0xfe7fffff at device 3.0 on pci0 vgapci0: Boot video device nvme0: <Generic NVMe Device> mem 0xfebf0000-0xfebf3fff irq 11 at device 4.0 on pci0 ena0: <ENA adapter> mem 0xfebf4000-0xfebf7fff at device 5.0 on pci0 ena0: Elastic Network Adapter (ENA)ena v0.7.0 ena0: initalize 2 io queues ena0: Ethernet address: 02:2b:3a:f4:70:8c ena0: Allocated msix_entries, vectors (cnt: 3) nvme1: <Generic NVMe Device> mem 0xfebf8000-0xfebfbfff irq 11 at device 31.0 on pci0 atkbdc0: <Keyboard controller (i8042)> port 0x60,0x64 irq 1 on acpi0 atkbd0: <AT Keyboard> irq 1 on atkbdc0 kbd0 at atkbd0 atkbd0: [GIANT-LOCKED] uart0: <Non-standard ns8250 class UART with FIFOs> port 0x3f8-0x3ff irq 4 flags 0x10 on acpi0 uart0: console (115200,n,8,1) orm0: <ISA Option ROM> at iomem 0xef000-0xeffff on isa0 vga0: <Generic ISA VGA> at port 0x3c0-0x3df iomem 0xa0000-0xbffff on isa0 attimer0: <AT timer> at port 0x40 on isa0 Timecounter "i8254" frequency 1193182 Hz quality 0 attimer0: Can't map interrupt. ppc0: cannot reserve I/O port range ena0: link is UP ena0: link state changed to UP Timecounters tick every 1.000 msec usb_needs_explore_all: no devclass nvme cam probe device init nvme0: temperature threshold not supported nvd0: <Amazon Elastic Block Store> NVMe namespace nvd0: 20480MB (41943040 512 byte sectors) nvme1: temperature threshold not supported nvd1: <Amazon Elastic Block Store> NVMe namespace GEOM: nvd1: corrupt or invalid GPT detected. nvd1: 20480MB (41943040 512 byte sectors) GEOM: nvd1: GPT rejected -- may not be recoverable. Trying to mount root from ufs:/dev/gpt/rootfs [rw]... We're seeing this now since we migrated out instance between instance types. Before we were using a c5.9xlarge instance and we recently scaled that back down (we'd rebuilt and scaled up earlier in the year to deal with seasonal load). This instance is our primary db instance and the kernel panics seem to happen either during the db dump process (which occur when we have the least amount of DB traffic) or as has happened once so far, during the peak of the daily load. Our log output is almost the same except as that already submitted, although out current process line reads: ```current process = 12 (irq269: ena0)``` The two previous reports had the same IRQ number and are on the same class of instance types whereas we're on an M5 class instance type and get a slightly different IRQ number. Also, this was happening on 11.1p4 but we since upgraded to p10 and the issue still is occurring. I'm planning on upgrading a clone to 11.2p0 later this week to check out whether there's a new version of the ENA (from here https://github.com/amzn/amzn-drivers/tree/master/kernel/fbsd/ena) in this build. (In reply to Richard Paul from comment #2) Just a quick grab of our DB server's current throughput with `systat -ifstat -pps`: ena0 in 4.498 Kp/s 4.498 Kp/s 279.347 Mp out 12.956 Kp/s 12.956 Kp/s 674.161 Mp We have a couple of Varnish servers in front of this platform (r4.2xlarge) which are rock solid and which don't see anything like this kind of throughput which is possibly why it's only this server we're seeing this issue with. Sorry, I'm coming to this late -- somehow I never saw this PR earlier. It's possible that this is an ENA driver bug, but C5 also has the added complication of using an entirely different virtualization platform, and I'm a bit suspicious of the backtrace here. Can one of you try to reproduce this on a different instance type -- m4.16xlarge or r4.* would probably be best -- so we can see if it's specifically an *ENA* problem or a *C5* problem? Hi Colin, Thanks for responding to this issue. You're right to point out that this may be down to the difference in hypervisors. M5 seem to be based on HVM too so for our purposes maybe moving over to an r4.2xlarge would be our nearest alternative in the r4 range of instances. We're waiting for the inevitable failure of our current instance as it seems to be falling over after 6-7 days. What we need is some kind of reproducible test case for this to better be able to diagnose the issue. As the other reporters in this ticket say, this can take hours to a day to reproduce and in our case, multiple days, which makes finding such a test case so time consuming and difficult. This is what I'm currently attempting to do with a cloned M5 instance type of our DB server running the database dump and then trying to load the server heavily but it's not producing much in the way of results at the moment. Okay, with a bit of effort and this is testing against the v0.7.0 ENA driver on FreeBSD 11.1p11 I got a reboot. But this was not easy to get to reproduce. Here's what I did to get this to die.....eventually (it took about 3 hours) * Downsize the instance to an m5.large * I installed nginx on the DB server clone and started it (I'll detail config below) * I installed the phoronix-test-suite and pts/blogbench * I kicked off a backup of our 115GB database to a local (800GB EBS vol.) ZFS partion (the DB is held on a mirrored ZFS set on another pair of 250GB EBS vol's) * I kicked off a stress run of the phoronix blogbench * With two t2.medium instances in the same VPC, I ran wrk -d 12h -c 2k -t2 http://10.0.0.10/ Additional configuration etc.: ---------------------------------------------------------------------------------- root@os-upgrade-test-db:~ # setenv PTS_CONCURRENT_TEST_RUNS 8 root@os-upgrade-test-db:~ # setenv TOTAL_LOOP_TIME 30 root@os-upgrade-test-db:~ # phoronix-test-suite stress-run pts/blogbench Choose Option 3 for Test All Options ---------------------------------------------------------------------------------- ---------------------------------------------------------------------------------- pkg info nginx-full nginx-full-1.12.2_11,2 Name : nginx-full Version : 1.12.2_11,2 Installed on : Thu Jun 28 09:37:29 2018 UTC ___nginx.conf___ worker_processes auto; events { worker_connections 2048; } http { include mime.types; default_type application/octet-stream; sendfile on; tcp_nopush on; keepalive_timeout 65; gzip on; server { listen 80; server_name localhost; location / { root /usr/local/www/nginx; index index.html index.htm; } location = /50x.html { root /usr/local/www/nginx-dist; } } } ---------------------------------------------------------------------------------- ---------------------------------------------------------------------------------- This is just to give you an idea of what we're doing, the backup is actually a whole backups script file for doing this for each database in the RDBMS and they rsyncing to to the offsite server __Postgres dump__ sudo -u pgsql pg_dump -j 16 -Fd dbname -f /var/backups/outfile ---------------------------------------------------------------------------------- I missed a couple of configuration options we'd set: ------------------------------------------------------------------------------ sysctl: kern.ipc.soacceptqueue: 8192 ------------------------------------------------------------------------------ ------------------------------------------------------------------------------ postgresql.conf max_connections = 120 shared_buffers = 2GB effective_cache_size = 6GB checkpoint_completion_target = 0.9 checkpoint_timeout = 1h work_mem = 2MB maintenance_work_mem = 256MB max_locks_per_transaction = 128 random_page_cost = 1.1 max_worker_processes = 2 ------------------------------------------------------------------------------ I've changed the instance type to an r4.large and I'll tweak some of the postgresql settings for the additional memory on the r4 instance and I'll set it off again to attempt getting it to fall over. Just to help you out with pts ----------------------------- To run the benchmark in pts you need to install it using the phoronix-test-suite package: #~ pkg install phoronix-test-suite #~ phoronix-test-suite install pts/blogbench So, I managed to produce sufficient load to force this to happen on an m5.large instance. I have tried to replicate this on a r4.large instance and have failed to do so so far but I will keep trying. As such it may not be the ENA adapter at all that is causing this issue; read more at http://www.daemonology.net/blog/2017-11-17-FreeBSD-EC2-C5-instances.html Have you been attaching/detaching EBS volumes while your [mc]5 instances are running? AFAIK the nvme driver is completely stable aside from that. Hi Colin, I read the article about these drives earlier this year thanks, (p.s. is this more usable now in 11.2 or are we still waiting on an ability to hot remove drives), specifically on this test instance, no, I haven't we did do this on the original C5 server when we wanted to replace an IO2 drive which was costing us a fortune, and had to schedule some down time to detach it. But in terms of the instance that I'm testing no, I didn't mess about with the volumes. However, thinking about this, have you tried testing with a larger number of EBS volumes attached? And tested with load going to most of them at the same time? We have, UFS root vol, 2x mirrored ZFS for DB, 1x ZFS for logs, 1x ZFS local backups vol. Sadly nvme hotplug/unplug is still broken in 11.2 -- unfortunately it turned out that some of the people who would have been fixing that were also the people who needed to work on fixing the Spectre/Meltdown/etc. issues so this got pushed off. Right now I'm hoping that we'll have the bugs worked out in time for 12.0. The extent of the testing I've done is a few buildworlds on a single disk -- I've been busy chasing other issues (e.g., the IPI issue which was causing userland data corruption) so I haven't been able to do much testing here. Any you can do will be much appreciated... Okay, so we thought that this was resolved with the move to an R4.2Xlarge. It wasn't, the server page faulted last night with the same old page fault process/trap number etc. Our next approach, I guess, will be building the latest version of the ENA adaptor on this instance since we seem to have ruled out issues with the change in drivers for HVM. Hi, I'm running 11.2-RELEASE on ec2 singapore. while, no panic encountered yet, ENA keeps going down and up every EXACTLY 30 mins and causes about .5% packet loss. on server 1 c5.2xlarge, traffic is 50Mbps, Aug 2 00:24:40 ip-10-251-18-192 kernel: ena0: device is going DOWN Aug 2 00:24:40 ip-10-251-18-192 kernel: ena0: device is going UP Aug 2 00:54:40 ip-10-251-18-192 kernel: ena0: device is going DOWN Aug 2 00:54:40 ip-10-251-18-192 kernel: ena0: device is going UP Aug 2 01:24:41 ip-10-251-18-192 kernel: ena0: device is going DOWN Aug 2 01:24:41 ip-10-251-18-192 kernel: ena0: device is going UP Aug 2 01:54:40 ip-10-251-18-192 kernel: ena0: device is going DOWN Aug 2 01:54:41 ip-10-251-18-192 kernel: ena0: device is going UP on server 2 c5.large, traffic is <1Mbps Aug 2 00:18:00 proxy621 kernel: ena0: device is going DOWN Aug 2 00:18:00 proxy621 kernel: ena0: device is going UP Aug 2 00:48:00 proxy621 kernel: ena0: device is going DOWN Aug 2 00:48:00 proxy621 kernel: ena0: device is going UP Aug 2 01:18:00 proxy621 kernel: ena0: device is going DOWN Aug 2 01:18:00 proxy621 kernel: ena0: device is going UP Aug 2 01:48:00 proxy621 kernel: ena0: device is going DOWN Aug 2 01:48:00 proxy621 kernel: ena0: device is going UP grep ena /var/log/messages is here server 1: https://nopaste.xyz/?ac03ff403e167965#pg6GYMdb+yReKI4OFiR7vmXqVy7fCsYI5e9TX2hdqTA= server 2: https://nopaste.xyz/?4b43d08c79c5bc32#gIcXQRyZTFZ0e7M9aW8NQQLatv78UBD3p6Gu7ZQ0QPs= ENA flapping every 30 minutes is almost certainly due to the MTU being set thanks to DHCP announcing support for jumbograms. That particular bug is fixed in HEAD (r333454). AFAIK this should not cause any of the other reported issues, but it would be good if someone who is experiencing problems can confirm that they don't happen at 30 minute intervals. Can anyone reproduce this on HEAD? If this is still broken I'd like to make sure it's fixed before 12.0-RELEASE, but so far this seems quite elusive. (In reply to Colin Percival from comment #16) I will try to reproduce later today, or this weekend. I was able to reproduce about a month ago IIRC, but will test with latest 12-CURRENT checkout. (In reply to pete from comment #16) I've been able to reproduce this repeatedly (but not predictably) on 11.2 on an r4.large. Not to state the blindingly obvious, but smaller instances such as t2.* aren't affected since they use xn instead of ena. It seems to be most likely at times of high network IO, which again risks stating the forehead-slappingly obvious. :) Multiple times, the crash included the same back-trace shown in this bug. However, at least once it panicked on a double-fault, which, if related, suggests that the bug in ena could be incurring memory corruption. Now granted, I only know of one incidence of a double-fault, so it could've been running on a host with faulty RAM or something at the time. However, after each panic, I'd stop/start the instance rather than reboot, to provoke it to move to new hardware, so I'm not suggesting that the whole bug is merely from faulty host hardware. I might beg that the fix could be patched in 11.2, or at least included in 11.3 so it won't have to wait for 12. Otherwise, AWS users will find themselves stuck on 11.1, and the approaching EOL of 11.1 will leave them without security updates, which in turn makes this an indirect security issue. However, I understand there are other considerations at play, and very much appreciate the relentless work of the security team (not to mention the work on AWS support and FreeBSD in general). Probably too much detail: The particular case was our standby MySQL database on an r4.large. It was stable on 11.1, and problematic after I upgraded it to 11.2 (with `freebsd-update upgrade`); after five or so crashes in a month, I downgraded it back to 11.1 (again with `freebsd-update upgrade`), after which it has been perfectly stable for a couple of weeks now. It's in master-master replication with our production replica, and normally gets a fairly low but steady stream of activity from the replication. However, we have several nightly jobs that crank away on updating a model and cause a large volume of traffic in the replication stream. I don't have proper metrics on bytes/sec, so I don't have any idea whether it saturates the interface. It's enough that replication falls behind for up to a few hours, but I wouldn't call our system "huge" in terms of network traffic by any means. The reason I included all that detail is to point out: (1) it seems to be a regression between 11.1 and 11.2, (2) r4.* are for sure affected, and (3) it may be that the problem is more likely to be triggered on moderate or bursty network traffic with much task-switching between MySQL threads, compared to a simple stream of a high speed file transfer, for example. -Leif The reason I was asking about HEAD is that we're currently at 12.0-ALPHA5 -- we're going to have 12.0-RELEASE before the release engineering team goes back and does the next release from stable/11 (aka. 11.3-RELEASE). In other words, I'd like to make sure this is fixed in the next release, but the first step towards that is knowing if it's still broken. There have been some driver updates since 11.2 and one of them might have fixed this accidentally. (In reply to Colin Percival from comment #19) Cool. I may be able to clone that machine to 12 later this week and try to reproduce it...I need to finish some urgent work first. (In reply to Colin Percival from comment #15) I tested on 12.0-alpha2 and 12.0-alpha5 on c5.large and t3.micro and did not see any ena up and down message again. so I think this issue has been fixed. (In reply to Ling from comment #21) I think the up/down state flapping is unrelated to the panics other people were seeing, so I'd like to know if other people can reproduce the issues they saw. I have a c5.large ec2 instance i'm running to test this using 12.0-ALPHA5. In the same VPC I have a system using a xn ethernet interface. I am running iperf3 between these two systems and getting just shy of 1Gbs network throughput, and ~13Kp/s. The c5.large system with ena interfaces has not had any problems so far. I've run several iperf3 TCP tests for 10mins each with no errors. As mentioned earlier, the interface flapping errors have gone away as well. If there are other artificial benchmarks that I should run to help validate this configuration has stabilized let me know and I can run them today. A bit OT, but is there a particular reason for the FreeBSD 11.2 AMI to support C5 instances and not M5? This has been sat on my to do list for a while. I'm hoping that if I can get my next job out of the way this week I'll revisit this. Further to my previous posts we seem to see that on instances faced with moderate memory pressure whilst also seeing reasonable amounts of writing to disks that are running on ZFS datasets the reboots seem to happen more regularly. We don't see this just where there is memory pressure, e.g. Varnish servers which are running purely within memory we haven't seen this even though the memory usage is very close to 100%. If we double the memory on a crashing instance, the issue goes away. As such I'm going to attempt to force memory pressure on a test server with an additional disk with a zpool and zfs dataset to attempt to reproduce this on a recent 12.0 instance. I have same problem(ena interface going down and up repeatedly). # uname -a FreeBSD db-20 11.2-RELEASE-p4 FreeBSD 11.2-RELEASE-p4 #0: Thu Sep 27 08:16:24 UTC 2018 root@amd64-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC amd64 # ifconfig ena0 ena0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500 options=422<TXCSUM,JUMBO_MTU,LRO> ether 06:4d:4b:64:e1:86 hwaddr 06:4d:4b:64:e1:86 inet6 fe80::44d:4bff:fe64:e186%ena0 prefixlen 64 scopeid 0x1 inet 10.1.20.20 netmask 0xffffff00 broadcast 10.1.20.255 nd6 options=23<PERFORMNUD,ACCEPT_RTADV,AUTO_LINKLOCAL> media: Ethernet autoselect (10Gbase-T <full-duplex>) status: active AWS r5.large instance. It was 11.1 release. I upgraded to 11.2 with freebsd-update. But, my another instance is very stable r4.large # uname -a FreeBSD web-10 11.1-RELEASE-p1 FreeBSD 11.1-RELEASE-p1 #0: Wed Aug 9 11:55:48 UTC 2017 root@amd64-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC amd64 # ifconfig ena0 ena0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500 options=422<TXCSUM,JUMBO_MTU,LRO> ether 06:01:57:54:03:a2 hwaddr 06:01:57:54:03:a2 inet6 fe80::401:57ff:fe54:3a2%ena0 prefixlen 64 scopeid 0x1 inet 10.1.20.10 netmask 0xffffff00 broadcast 10.1.20.255 nd6 options=23<PERFORMNUD,ACCEPT_RTADV,AUTO_LINKLOCAL> media: Ethernet autoselect (10Gbase-T <full-duplex>) status: active @jaehak That is not this issue and has been rectified already in v12. v12 should be out in December and this problem will go away for you (and us we're actually seeing disruption on our production systems because of this bug as the application unexpectedly can't reach the cache and database layers when the network interface is down and we're hit pretty hard by this because we receive 1M odd requests per day so there's a lot of opportunity for this to happen.) I did some testing yesterday but I couldn't manage to reproduce the issue on either 11.2 or 12.0 Beta-1; however, the problem does still exist on the current 11.2 release because I had a test instance with jails on it that I was building our application stack in it (it's a convoluted stack with a lot of files being uploaded to S3 as part of the build) and I'd been having issues with it rebooting but yesterday it failed on startup as it wanted to drop into single user mode due to a UFS checksum issue. Obviously this isn't possible on AWS as you don't get console access so this instance had to be written off. Seeing exactly the same thing on m5.large. 100% repeatable (full zfs send/recv from another server). Dec 21 21:07:10 nfs1 kernel: Fatal trap 12: page fault while in kernel mode Dec 21 21:07:10 nfs1 kernel: cpuid = 0; apic id = 00 Dec 21 21:07:10 nfs1 kernel: fault virtual address = 0x1c Dec 21 21:07:10 nfs1 kernel: fault code = supervisor write data, page not present Dec 21 21:07:10 nfs1 kernel: instruction pointer = 0x20:0xffffffff82269f5c Dec 21 21:07:10 nfs1 kernel: stack pointer = 0x0:0xfffffe02259ac180 Dec 21 21:07:10 nfs1 kernel: frame pointer = 0x0:0xfffffe02259ac260 Dec 21 21:07:10 nfs1 kernel: code segment = base rx0, limit 0xfffff, type 0x1b Dec 21 21:07:10 nfs1 kernel: = DPL 0, pres 1, long 1, def32 0, gran 1 Dec 21 21:07:10 nfs1 kernel: processor eflags = interrupt enabled, resume, IOPL = 0 Dec 21 21:07:10 nfs1 kernel: current process = 12 (irq260: ena0) Dec 21 21:07:10 nfs1 kernel: trap number = 12 Dec 21 21:07:10 nfs1 kernel: panic: page fault Dec 21 21:07:10 nfs1 kernel: cpuid = 0 Dec 21 21:07:10 nfs1 kernel: KDB: stack backtrace: Dec 21 21:07:10 nfs1 kernel: #0 0xffffffff80b3d577 at kdb_backtrace+0x67 Dec 21 21:07:10 nfs1 kernel: #1 0xffffffff80af6b17 at vpanic+0x177 Dec 21 21:07:10 nfs1 kernel: #2 0xffffffff80af6993 at panic+0x43 Dec 21 21:07:10 nfs1 kernel: #3 0xffffffff80f77fdf at trap_fatal+0x35f Dec 21 21:07:10 nfs1 kernel: #4 0xffffffff80f78039 at trap_pfault+0x49 Dec 21 21:07:10 nfs1 kernel: #5 0xffffffff80f77807 at trap+0x2c7 Dec 21 21:07:10 nfs1 kernel: #6 0xffffffff80f5808c at calltrap+0x8 Dec 21 21:07:10 nfs1 kernel: #7 0xffffffff80abcd69 at intr_event_execute_handlers+0xe9 Dec 21 21:07:10 nfs1 kernel: #8 0xffffffff80abd047 at ithread_loop+0xe7 Dec 21 21:07:10 nfs1 kernel: #9 0xffffffff80aba093 at fork_exit+0x83 Dec 21 21:07:10 nfs1 kernel: #10 0xffffffff80f58fae at fork_trampoline+0xe Dec 21 21:07:10 nfs1 kernel: Uptime: 11h49m3s Dec 21 21:07:10 nfs1 kernel: Rebooting... This is on 11.2-RELEASE-p7. You'll need to try this out on 12.0 berend I think will be the response. We have completed our migration to GCP, from AWS, now so can't make any more progress on this. ena in 12.0-RELEASE is still not stable on AWS. We see these kernel messages a few times per day: ena_com_prepare_tx() [TID:100766]: Not enough space in the tx queue ena0: failed to prepare tx bufs ena_com_prepare_tx() [TID:100766]: Not enough space in the tx queue ena0: failed to prepare tx bufs ena_com_prepare_tx() [TID:100378]: Not enough space in the tx queue ena0: failed to prepare tx bufs ena_com_prepare_tx() [TID:100376]: Not enough space in the tx queue ena0: failed to prepare tx bufs ena_com_prepare_tx() [TID:100844]: Not enough space in the tx queue ena0: failed to prepare tx bufs ena_com_prepare_tx() [TID:100765]: Not enough space in the tx queue ena0: failed to prepare tx bufs ena_com_prepare_tx() [TID:100363]: Not enough space in the tx queue ena0: failed to prepare tx bufs ena_com_prepare_tx() [TID:100523]: Not enough space in the tx queue ena0: failed to prepare tx bufs ena0: Keep alive watchdog timeout. ena0: Trigger reset is on ena0: device is going DOWN ena0: device is going UP ena0: link is UP ena_com_prepare_tx() [TID:100401]: Not enough space in the tx queue ena0: failed to prepare tx bufs ena_com_prepare_tx() [TID:100477]: Not enough space in the tx queue ena0: failed to prepare tx bufs ena_com_prepare_tx() [TID:100634]: Not enough space in the tx queue ena0: failed to prepare tx bufs Leif, could you open a new PR for that and CC me? I'll get some people to look at it but I think it's an unrelated issue so I don't want to force them to wade through this entire thread. I'm experiencing packet loss w/ENA & FreeBSD 12.0, relevant bug report here: bug #234754 (In reply to Colin Percival from comment #32) You bet. Sorry, I was away for a few days. I opened https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=234838 I believe that this patch should fix the underlying problem, which is in the ENA driver: https://reviews.freebsd.org/D24546 If you're able to build a custom kernel, please test that patch and report results in that review or via email (cperciva@). A commit references this bug: Author: mw Date: Thu May 7 11:28:40 UTC 2020 New revision: 360777 URL: https://svnweb.freebsd.org/changeset/base/360777 Log: Optimize ENA Rx refill for low memory conditions Sometimes, especially when there is not much memory in the system left, allocating mbuf jumbo clusters (like 9KB or 16KB) can take a lot of time and it is not guaranteed that it'll succeed. In that situation, the fallback will work, but if the refill needs to take a place for a lot of descriptors at once, the time spent in m_getjcl looking for memory can cause system unresponsiveness due to high priority of the Rx task. This can also lead to driver reset, because Tx cleanup routine is being blocked and timer service could detect that Tx packets aren't cleaned up. The reset routine can further create another unresponsiveness - Rx rings are being refilled there, so m_getjcl will again burn the CPU. This was causing NVMe driver timeouts and resets, because network driver is having higher priority. Instead of 16KB jumbo clusters for the Rx buffers, 9KB clusters are enough - ENA MTU is being set to 9K anyway, so it's very unlikely that more space than 9KB will be needed. However, 9KB jumbo clusters can still cause issues, so by default the page size mbuf cluster will be used for the Rx descriptors. This can have a small (~2%) impact on the throughput of the device, so to restore original behavior, one must change sysctl "hw.ena.enable_9k_mbufs" to "1" in "/boot/loader.conf" file. As a part of this patch (important fix), the version of the driver was updated to v2.1.2. Submitted by: cperciva Reviewed by: Michal Krawczyk <mk@semihalf.com> Reviewed by: Ido Segev <idose@amazon.com> Reviewed by: Guy Tzalik <gtzalik@amazon.com> MFC after: 3 days PR: 225791, 234838, 235856, 236989, 243531 Differential Revision: https://reviews.freebsd.org/D24546 Changes: head/sys/dev/ena/ena.c head/sys/dev/ena/ena.h head/sys/dev/ena/ena_sysctl.c head/sys/dev/ena/ena_sysctl.h A commit references this bug: Author: mw Date: Tue May 12 18:44:41 UTC 2020 New revision: 360985 URL: https://svnweb.freebsd.org/changeset/base/360985 Log: MFC r360777: Optimize ENA Rx refill for low memory conditions Sometimes, especially when there is not much memory in the system left, allocating mbuf jumbo clusters (like 9KB or 16KB) can take a lot of time and it is not guaranteed that it'll succeed. In that situation, the fallback will work, but if the refill needs to take a place for a lot of descriptors at once, the time spent in m_getjcl looking for memory can cause system unresponsiveness due to high priority of the Rx task. This can also lead to driver reset, because Tx cleanup routine is being blocked and timer service could detect that Tx packets aren't cleaned up. The reset routine can further create another unresponsiveness - Rx rings are being refilled there, so m_getjcl will again burn the CPU. This was causing NVMe driver timeouts and resets, because network driver is having higher priority. Instead of 16KB jumbo clusters for the Rx buffers, 9KB clusters are enough - ENA MTU is being set to 9K anyway, so it's very unlikely that more space than 9KB will be needed. However, 9KB jumbo clusters can still cause issues, so by default the page size mbuf cluster will be used for the Rx descriptors. This can have a small (~2%) impact on the throughput of the device, so to restore original behavior, one must change sysctl "hw.ena.enable_9k_mbufs" to "1" in "/boot/loader.conf" file. As a part of this patch (important fix), the version of the driver was updated to v2.1.2. Submitted by: cperciva PR: 225791, 234838, 235856, 236989, 243531 Changes: _U stable/12/ stable/12/sys/dev/ena/ena.c stable/12/sys/dev/ena/ena.h |