Summary: | AWS EC2 lockups "Missing interrupt" | ||||||
---|---|---|---|---|---|---|---|
Product: | Base System | Reporter: | Charles O'Donnell <cao> | ||||
Component: | kern | Assignee: | freebsd-virtualization (Nobody) <virtualization> | ||||
Status: | Closed FIXED | ||||||
Severity: | Affects Only Me | CC: | cao, chuck, cperciva, freebsd, imp, otis, rgrimes | ||||
Priority: | --- | ||||||
Version: | 12.0-RELEASE | ||||||
Hardware: | amd64 | ||||||
OS: | Any | ||||||
Attachments: |
|
Description
Charles O'Donnell
2019-04-03 13:31:10 UTC
Pulling in Chuck Tuffli as am NVMe expert Created attachment 204824 [details]
syslog entries
Log entries.
I still have not come up with a way to reliably reproduce this error, but is continues to be a major issue on our productions systems. In the meantime, I thought I might include the excerpt above from /var/log/messages which includes some additional noise. I don't know if these entries are directly related or indirectly related. As an addendum, the system has been kept up to date with freebsd-update. FreeBSD 12.0-RELEASE-p4 GENERIC This issue continues to be a major problem with FreeBSD 12 on AWS EC2 on busy production servers. Add'l log excerpts. Please advise any other information that would be helpful. Aug 5 23:00:44 s4 kernel: nvme5: cpl does not map to outstanding cmd Aug 5 23:00:44 s4 kernel: cdw0:00000000 sqhd:0041 sqid:0007 cid:007d p:1 sc:00 sct:0 m:0 dnr:0 Aug 5 23:00:44 s4 kernel: nvme5: Missing interrupt Aug 5 23:00:44 s4 kernel: nvme5: Resetting controller due to a timeout. Aug 5 23:00:44 s4 kernel: nvme5: resetting controller Aug 5 23:00:44 s4 kernel: nvme5: aborting outstanding i/o Aug 5 23:00:44 s4 kernel: nvme5: WRITE sqid:1 cid:125 nsid:1 lba:10245542 len:256 Aug 5 23:00:44 s4 kernel: nvme5: ABORTED - BY REQUEST (00/07) sqid:1 cid:125 cdw0:0 Aug 5 23:00:44 s4 kernel: nvme5: aborting outstanding i/o Aug 5 23:00:44 s4 kernel: nvme5: WRITE sqid:1 cid:124 nsid:1 lba:10245798 len:48 Aug 5 23:00:44 s4 kernel: nvme5: ABORTED - BY REQUEST (00/07) sqid:1 cid:124 cdw0:0 Aug 5 23:00:44 s4 kernel: nvme5: resubmitting queued i/o Aug 5 23:00:44 s4 kernel: nvme5: WRITE sqid:1 cid:0 nsid:1 lba:10245846 len:256 Aug 5 23:00:44 s4 kernel: nvme5: aborting outstanding i/o Aug 5 23:00:44 s4 kernel: nvme5: WRITE sqid:7 cid:125 nsid:1 lba:10240126 len:256 Aug 5 23:00:44 s4 kernel: nvme5: ABORTED - BY REQUEST (00/07) sqid:7 cid:125 cdw0:0 Is there a workload that's known to trigger this bug? There's a few issues in -current that might not be in 12.0R yet that I'd like to see if I can trigger the bug and then try again with 13 to see if they are fixed and maybe get a patch into 12.0 Hello Warner, I have note been able to reliably reproduce the problem. There may be a connection with heavy disk activity. Hi Charles, Have you seen this recently? Ideally, on FreeBSD 12.1 or HEAD? Also, can you tell me whether you use AMD instances, and if yes whether any of them are affected? Hello Colin, Thank you for your interest. As of last week the issue was still present. The affected hosts are 12.0-RELEASE kept up to date with the most recent patch level using freebsd-update. I just updated the affected hosts to 12.1-RELEASE using 'freebsd-update upgrade'. If the issue is still present I expect to see it appear in 5-10 days. The original AMI was released by FreeBSD: ami-00e61b69944d5d2a3 FreeBSD 12.0-RELEASE-amd64-f5af2713-6d84-41f0-95ea-2eaee78f2af4-ami-03b0f822e17669866.4 FreeBSD/amd64 releng/12.0@341666 Let me know if there is any other information I can provide. Thank you. Best. (In reply to Charles O'Donnell from comment #10) Hi Charles, Please let me know ASAP if you experience this on 12.1-RELEASE. Some problems were fixed between 12.0 and 12.1 but I have some reports of problems still occurring on 12.1 -- so I'm trying to track down exactly what conditions are needed to cause issues with 12.1. Other details which I'll want to know (if it recurs with 12.1): * Instance type * Type of EBS volumes being used * Filesystem (all the reports I've had on 12.1 have been ZFS) * Workload * Whether there's significant network traffic (all the reports I've had on 12.1 have featured significant disk *and* network loads). Thanks! Instance type is c5d.2xlarge. EBS volume type is gp2. Filesystem is ZFS (mirror). Workload is HTTP, Database, Image processing. Significant network load and disk activity. Best. Most recent incident: Feb 27 04:23:31 s4 kernel: nvme4: Missing interrupt Feb 27 04:24:34 s4 syslogd: last message repeated 1 times Feb 27 04:24:34 s4 kernel: nvme1: Missing interrupt Feb 27 04:24:34 s4 kernel: nvme0: Missing interrupt Feb 27 04:24:34 s4 syslogd: last message repeated 1 times Feb 27 04:24:34 s4 kernel: nvme1: nvme1: cpl does not map to outstanding cmd Feb 27 04:24:34 s4 kernel: cdw0:00000000 sqhd:0012 sqid:0001 cid:0017 p:0 sc:00 sct:0 m:0 dnr:0 Feb 27 04:24:34 s4 kernel: nvme1: Missing interrupt Feb 27 04:24:34 s4 kernel: nvme4: nvme1: Missing interrupt Feb 27 04:24:34 s4 kernel: Missing interrupt Feb 27 04:24:34 s4 syslogd: last message repeated 1 times Feb 27 04:24:34 s4 kernel: nvme4: Missing interrupt Feb 27 04:24:34 s4 kernel: nvme0: Missing interrupt Feb 27 04:24:34 s4 kernel: nvme1: nvme1: nvme4: Missing interrupt Feb 27 04:24:34 s4 kernel: Missing interrupt # freebsd-version -ukr 12.1-RELEASE-p2 12.1-RELEASE-p2 12.1-RELEASE-p2 # ls -lahtuT /etc/periodic/daily/ total 128 -rwxr-xr-x 1 root wheel 811B Feb 27 03:05:34 2020 999.local -rwxr-xr-x 1 root wheel 2.8K Feb 27 03:05:34 2020 800.scrub-zfs -rwxr-xr-x 1 root wheel 845B Feb 27 03:05:34 2020 510.status-world-kernel -rwxr-xr-x 1 root wheel 737B Feb 27 03:05:34 2020 500.queuerun -rwxr-xr-x 1 root wheel 498B Feb 27 03:05:34 2020 480.status-ntpd -rwxr-xr-x 1 root wheel 451B Feb 27 03:05:34 2020 480.leapfile-ntpd -rwxr-xr-x 1 root wheel 2.0K Feb 27 03:05:34 2020 460.status-mail-rejects -rwxr-xr-x 1 root wheel 1.0K Feb 27 03:01:00 2020 450.status-security -rwxr-xr-x 1 root wheel 1.4K Feb 27 03:01:00 2020 440.status-mailq -rwxr-xr-x 1 root wheel 705B Feb 27 03:01:00 2020 430.status-uptime -rwxr-xr-x 1 root wheel 611B Feb 27 03:01:00 2020 420.status-network -rwxr-xr-x 1 root wheel 684B Feb 27 03:01:00 2020 410.status-mfi -rwxr-xr-x 1 root wheel 590B Feb 27 03:01:00 2020 409.status-gconcat -rwxr-xr-x 1 root wheel 590B Feb 27 03:01:00 2020 408.status-gstripe -rwxr-xr-x 1 root wheel 591B Feb 27 03:01:00 2020 407.status-graid3 -rwxr-xr-x 1 root wheel 596B Feb 27 03:01:00 2020 406.status-gmirror -rwxr-xr-x 1 root wheel 807B Feb 27 03:01:00 2020 404.status-zfs -rwxr-xr-x 1 root wheel 583B Feb 27 03:01:00 2020 401.status-graid -rwxr-xr-x 1 root wheel 773B Feb 27 03:01:00 2020 400.status-disks -rwxr-xr-x 1 root wheel 724B Feb 27 03:01:00 2020 330.news -r-xr-xr-x 1 root wheel 1.4K Feb 27 03:01:00 2020 310.accounting -rwxr-xr-x 1 root wheel 693B Feb 27 03:01:00 2020 300.calendar -rwxr-xr-x 1 root wheel 1.0K Feb 27 03:01:00 2020 210.backup-aliases -rwxr-xr-x 1 root wheel 1.7K Feb 27 03:01:00 2020 200.backup-passwd -rwxr-xr-x 1 root wheel 603B Feb 27 03:01:00 2020 150.clean-hoststat -rwxr-xr-x 1 root wheel 1.0K Feb 27 03:01:00 2020 140.clean-rwho -rwxr-xr-x 1 root wheel 709B Feb 27 03:01:00 2020 130.clean-msgs -rwxr-xr-x 1 root wheel 1.1K Feb 27 03:01:00 2020 120.clean-preserve -rwxr-xr-x 1 root wheel 1.5K Feb 27 03:01:00 2020 110.clean-tmps -rwxr-xr-x 1 root wheel 1.3K Feb 27 03:01:00 2020 100.clean-disks drwxr-xr-x 2 root wheel 1.0K Dec 7 01:23:36 2018 . drwxr-xr-x 6 root wheel 512B Dec 7 01:23:36 2018 .. I just ran into the "nvmeX: Missing interrupt" on FreeNAS under VMWare when testing something out tonight. It happened right at boot, and repeated quite often (to the point I couldn't even fully boot the OS after ~20 minutes). FreeNAS is still on 11.3-RELEASE. But this is a homelab system that I can easily build test cases with. nvme drives are Intel Optane M10 16GB. These are cheap on eBay right now. No data on them at all yet (empty partition table). I was using PCIe passthrough on ESXi-6.7.0-20190802001-standard. I'll start testing various 12.x and 13 kernels to see if conditions are repeatable on any of those with this config too (if it is, I'll post more details) Vincent: There are known issues with the nvme driver on 11.3 and 12.0. If you can reproduce this on 12.1 I'll be *very* interested but I think it's very likely the problem will go away when you upgrade. I actually get a kernel panic in 12.1, but it works in 13 https://twitter.com/DarkainMX/status/1239715138027962368 I believe that this patch should fix the underlying problem, which is in the ENA driver: https://reviews.freebsd.org/D24546 If you're able to build a custom kernel, please test that patch and report results in that review or via email (cperciva@). Hi Colin, patch applied. It may be as long as two weeks before I can report results. With regards to running these queries: vmstat -z | grep mbuf_jumbo; sysctl -a | grep mjum_alloc_fail Let me know if this should be run periodically, or only when the machine is experiencing the issue. Thank you. Best. Hi Charles, Can you run those periodically (maybe every hour)? What I'm expecting is that the sysctls will stay at zero, while the 9k mbuf "FAIL" count (second last value) will stay at zero or perhaps increase slowly, then jump up rapidly when the system would have gotten stuck. If the patch works as expected, the system *won't* get stuck at that point. ;-) Colin Percival A commit references this bug: Author: mw Date: Thu May 7 11:28:41 UTC 2020 New revision: 360777 URL: https://svnweb.freebsd.org/changeset/base/360777 Log: Optimize ENA Rx refill for low memory conditions Sometimes, especially when there is not much memory in the system left, allocating mbuf jumbo clusters (like 9KB or 16KB) can take a lot of time and it is not guaranteed that it'll succeed. In that situation, the fallback will work, but if the refill needs to take a place for a lot of descriptors at once, the time spent in m_getjcl looking for memory can cause system unresponsiveness due to high priority of the Rx task. This can also lead to driver reset, because Tx cleanup routine is being blocked and timer service could detect that Tx packets aren't cleaned up. The reset routine can further create another unresponsiveness - Rx rings are being refilled there, so m_getjcl will again burn the CPU. This was causing NVMe driver timeouts and resets, because network driver is having higher priority. Instead of 16KB jumbo clusters for the Rx buffers, 9KB clusters are enough - ENA MTU is being set to 9K anyway, so it's very unlikely that more space than 9KB will be needed. However, 9KB jumbo clusters can still cause issues, so by default the page size mbuf cluster will be used for the Rx descriptors. This can have a small (~2%) impact on the throughput of the device, so to restore original behavior, one must change sysctl "hw.ena.enable_9k_mbufs" to "1" in "/boot/loader.conf" file. As a part of this patch (important fix), the version of the driver was updated to v2.1.2. Submitted by: cperciva Reviewed by: Michal Krawczyk <mk@semihalf.com> Reviewed by: Ido Segev <idose@amazon.com> Reviewed by: Guy Tzalik <gtzalik@amazon.com> MFC after: 3 days PR: 225791, 234838, 235856, 236989, 243531 Differential Revision: https://reviews.freebsd.org/D24546 Changes: head/sys/dev/ena/ena.c head/sys/dev/ena/ena.h head/sys/dev/ena/ena_sysctl.c head/sys/dev/ena/ena_sysctl.h Colin, At this point we are at 10 days uptime. Below are select excerpts from vmstat logging. It appears to follow the path you predicted. I can attach the full hourly log if it is of interest. Best. DAY 8 (first uptick of ===> Tue May 5 17:00:00 EDT 2020 mbuf_jumbo_page: 4096, 490945, 0, 31,40308027, 0, 0 mbuf_jumbo_9k: 9216, 145465, 7668, 286,58743917, 7, 0 mbuf_jumbo_16k: 16384, 81824, 0, 0, 0, 0, 0 dev.ena.0.queue7.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue6.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue5.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue4.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue3.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue2.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue1.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue0.rx_ring.mjum_alloc_fail: 0 DAY 9 ===> Wed May 6 04:00:00 EDT 2020 mbuf_jumbo_page: 4096, 490945, 97, 446,41133523, 0, 0 mbuf_jumbo_9k: 9216, 145465, 7645, 412,59865306, 9, 0 mbuf_jumbo_16k: 16384, 81824, 0, 0, 0, 0, 0 dev.ena.0.queue7.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue6.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue5.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue4.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue3.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue2.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue1.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue0.rx_ring.mjum_alloc_fail: 0 ===> Wed May 6 05:00:00 EDT 2020 mbuf_jumbo_page: 4096, 490945, 367, 449,41875931, 0, 0 mbuf_jumbo_9k: 9216, 145465, 7253, 108,60877108, 241, 0 mbuf_jumbo_16k: 16384, 81824, 0, 0, 0, 0, 0 dev.ena.0.queue7.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue6.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue5.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue4.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue3.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue2.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue1.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue0.rx_ring.mjum_alloc_fail: 0 ===> Wed May 6 06:00:00 EDT 2020 mbuf_jumbo_page: 4096, 490945, 34, 41,42115282, 0, 0 mbuf_jumbo_9k: 9216, 145465, 7719, 267,61217632, 243, 0 mbuf_jumbo_16k: 16384, 81824, 0, 0, 0, 0, 0 dev.ena.0.queue7.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue6.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue5.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue4.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue3.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue2.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue1.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue0.rx_ring.mjum_alloc_fail: 0 DAY 10 ===> Thu May 7 07:00:00 EDT 2020 mbuf_jumbo_page: 4096, 490945, 0, 65,45067065, 0, 0 mbuf_jumbo_9k: 9216, 145465, 7647, 322,65796230,1640, 0 mbuf_jumbo_16k: 16384, 81824, 0, 0, 0, 0, 0 dev.ena.0.queue7.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue6.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue5.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue4.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue3.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue2.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue1.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue0.rx_ring.mjum_alloc_fail: 0 Addendum to previous post, no downtime or freezes as of yet. (In reply to Charles O'Donnell from comment #22) Thanks! We ended up going with a slightly more conservative patch than the one I provided above (defaulting to 4k buffers rather than switching, since we weren't sure if we could switch fast enough) but your statistics are useful (and confirm the decision to go with the more conservative option -- you have 1640 failed 9k allocations which will have added some network latency during busy periods). I suggest waiting until we get the Errata Notice out and then switching over to the patch we send out with that. New development. See three notes below. N.B. the system appears to have fully recovered. Normally I would have expected a freeze. 1. CPU alarm from a custom AWS monitor at 16:43 UTC (12:43 PM ET): Alarm Details: - Name: Starch CPU - Description: - State Change: OK -> ALARM - Reason for State Change: Threshold Crossed: 1 datapoint [31.4 (07/05/20 16:38:00)] was greater than or equal to the threshold (30.0). - Timestamp: Thursday 07 May, 2020 16:43:35 UTC - AWS Account: 539612714288 - Alarm Arn: arn:aws:cloudwatch:us-east-1:539612714288:alarm:Starch CPU Threshold: - The alarm is in the ALARM state when the metric is GreaterThanOrEqualToThreshold 30.0 for 300 seconds. 2. Sudden jump in failed 9k mbufs between 12:00 and 13:00 ET: ===> Thu May 7 10:00:00 EDT 2020 mbuf_jumbo_page: 4096, 490945, 0, 56,45464111, 0, 0 mbuf_jumbo_9k: 9216, 145465, 7538, 450,66361278,1640, 0 mbuf_jumbo_16k: 16384, 81824, 0, 0, 0, 0, 0 dev.ena.0.queue7.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue6.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue5.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue4.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue3.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue2.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue1.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue0.rx_ring.mjum_alloc_fail: 0 ===> Thu May 7 11:00:00 EDT 2020 mbuf_jumbo_page: 4096, 490945, 16, 113,45658689, 0, 0 mbuf_jumbo_9k: 9216, 145465, 7592, 397,66645310,1642, 0 mbuf_jumbo_16k: 16384, 81824, 0, 0, 0, 0, 0 dev.ena.0.queue7.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue6.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue5.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue4.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue3.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue2.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue1.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue0.rx_ring.mjum_alloc_fail: 0 ===> Thu May 7 12:00:00 EDT 2020 mbuf_jumbo_page: 4096, 490945, 182, 31,45730287, 0, 0 mbuf_jumbo_9k: 9216, 145465, 7461, 259,66753693,1693, 0 mbuf_jumbo_16k: 16384, 81824, 0, 0, 0, 0, 0 dev.ena.0.queue7.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue6.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue5.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue4.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue3.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue2.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue1.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue0.rx_ring.mjum_alloc_fail: 0 ===> Thu May 7 13:00:00 EDT 2020 mbuf_jumbo_page: 4096, 490945, 119, 109,46249719, 0, 0 mbuf_jumbo_9k: 9216, 145465, 7863, 207,67594999,2577, 0 mbuf_jumbo_16k: 16384, 81824, 0, 0, 0, 0, 0 dev.ena.0.queue7.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue6.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue5.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue4.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue3.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue2.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue1.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue0.rx_ring.mjum_alloc_fail: 0 3: ena0 reset at 12:43 ET: May 7 12:43:19 s4 kernel: ena0: The number of lost tx completion is above the threshold (129 > 128). Reset the device May 7 12:43:19 s4 kernel: ena0: Trigger reset is on May 7 12:43:19 s4 kernel: ena0: device is going DOWN May 7 12:43:22 s4 kernel: ena0: free uncompleted tx mbuf qid 3 idx 0x319ena0: free uncompleted tx mbuf qid 7 idx 0x173 May 7 12:43:23 s4 kernel: ena0: ena0: device is going UP May 7 12:43:23 s4 kernel: link is UP May 7 12:45:00 s4 kernel: ena0: The number of lost tx completion is above the threshold (129 > 128). Reset the device May 7 12:45:00 s4 kernel: ena0: Trigger reset is on May 7 12:45:00 s4 kernel: ena0: device is going DOWN May 7 12:45:04 s4 kernel: ena0: free uncompleted tx mbuf qid 3 idx 0x102 May 7 12:45:04 s4 kernel: ena0: ena0: device is going UP May 7 12:45:04 s4 kernel: link is UP May 7 12:45:26 s4 kernel: ena0: The number of lost tx completion is above the threshold (129 > 128). Reset the device May 7 12:45:26 s4 kernel: ena0: Trigger reset is on May 7 12:45:26 s4 kernel: ena0: device is going DOWN May 7 12:45:29 s4 kernel: ena0: free uncompleted tx mbuf qid 1 idx 0x3c7ena0: free uncompleted tx mbuf qid 2 idx 0x2c5ena0: free uncompleted tx mbuf qid 6 idx 0x2abena0: free uncompleted tx mbuf qid 7 idx 0x241 May 7 12:45:30 s4 kernel: May 7 12:45:30 s4 kernel: stray irq265 May 7 12:45:30 s4 kernel: ena0: ena0: device is going UP May 7 12:45:30 s4 kernel: link is UP May 7 12:46:05 s4 kernel: ena0: Keep alive watchdog timeout. May 7 12:46:05 s4 kernel: ena0: Trigger reset is on May 7 12:46:05 s4 kernel: ena0: device is going DOWN May 7 12:46:07 s4 kernel: ena0: free uncompleted tx mbuf qid 1 idx 0x123ena0: free uncompleted tx mbuf qid 3 idx 0xeeena0: free uncompleted tx mbuf qid 6 idx 0x208 May 7 12:46:08 s4 kernel: ena0: ena0: device is going UP May 7 12:46:08 s4 kernel: link is UP May 7 12:46:36 s4 kernel: ena0: The number of lost tx completion is above the threshold (129 > 128). Reset the device May 7 12:46:36 s4 kernel: ena0: Trigger reset is on May 7 12:46:36 s4 kernel: ena0: device is going DOWN May 7 12:46:37 s4 kernel: ena0: free uncompleted tx mbuf qid 0 idx 0x2c2ena0: free uncompleted tx mbuf qid 1 idx 0x135ena0: free uncompleted tx mbuf qid 2 idx 0xeeena0: free uncompleted tx mbuf qid 3 idx 0x373ena0: free uncompleted tx mbuf qid 4 idx 0x88ena0: free uncompleted t> May 7 12:46:38 s4 kernel: ena0: ena0: device is going UP May 7 12:46:38 s4 kernel: link is UP Aha! That's exactly the "not switching from 9k to 4k fast enough" case we were worried about. Good to know that we were justified in going with the more conservative option. Not sure if this is expected behavior with the patched kernel. Picking up from day 6 above: ===> Thu May 7 12:00:00 EDT 2020 mbuf_jumbo_page: 4096, 490945, 182, 31,45730287, 0, 0 mbuf_jumbo_9k: 9216, 145465, 7461, 259,66753693,1693, 0 mbuf_jumbo_16k: 16384, 81824, 0, 0, 0, 0, 0 dev.ena.0.queue7.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue6.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue5.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue4.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue3.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue2.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue1.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue0.rx_ring.mjum_alloc_fail: 0 ===> Fri May 8 12:00:00 EDT 2020 mbuf_jumbo_page: 4096, 490945, 82, 71,52207112, 0, 0 mbuf_jumbo_9k: 9216, 145465, 7551, 351,75614231,2625, 0 mbuf_jumbo_16k: 16384, 81824, 0, 0, 0, 0, 0 dev.ena.0.queue7.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue6.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue5.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue4.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue3.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue2.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue1.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue0.rx_ring.mjum_alloc_fail: 0 ===> Sat May 9 12:00:00 EDT 2020 mbuf_jumbo_page: 4096, 490945, 139, 69,56246475, 0, 0 mbuf_jumbo_9k: 9216, 145465, 7799, 339,83036796,2784, 0 mbuf_jumbo_16k: 16384, 81824, 0, 0, 0, 0, 0 dev.ena.0.queue7.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue6.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue5.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue4.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue3.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue2.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue1.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue0.rx_ring.mjum_alloc_fail: 0 ===> Sun May 10 12:00:00 EDT 2020 mbuf_jumbo_page: 4096, 490945, 39, 51,59639876, 0, 0 mbuf_jumbo_9k: 9216, 145465, 7663, 222,85652082,2963, 0 mbuf_jumbo_16k: 16384, 81824, 0, 0, 0, 0, 0 dev.ena.0.queue7.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue6.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue5.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue4.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue3.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue2.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue1.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue0.rx_ring.mjum_alloc_fail: 0 ===> Mon May 11 12:00:00 EDT 2020 mbuf_jumbo_page: 4096, 490945, 1598, 223,63502113, 0, 0 mbuf_jumbo_9k: 9216, 145465, 6230, 90,91087037,4248, 0 mbuf_jumbo_16k: 16384, 81824, 0, 0, 0, 0, 0 dev.ena.0.queue7.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue6.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue5.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue4.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue3.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue2.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue1.rx_ring.mjum_alloc_fail: 0 dev.ena.0.queue0.rx_ring.mjum_alloc_fail: 0 System spinning. Had to reboot. Related /var/log/messages activity show below. May 11 12:20:34 s4 kernel: ena0: The number of lost tx completion is above the threshold (129 > 128). Reset the device May 11 12:20:34 s4 kernel: ena0: Trigger reset is on May 11 12:20:34 s4 kernel: ena0: device is going DOWN May 11 12:20:39 s4 kernel: ena0: free uncompleted tx mbuf qid 1 idx 0x2e1 May 11 12:20:40 s4 kernel: ena0: device is going UP May 11 12:20:40 s4 kernel: ena0: link is UP May 11 12:21:55 s4 kernel: ena0: The number of lost tx completion is above the threshold (129 > 128). Reset the device May 11 12:21:55 s4 kernel: ena0: Trigger reset is on May 11 12:21:55 s4 kernel: ena0: device is going DOWN May 11 12:22:02 s4 kernel: ena0: free uncompleted tx mbuf qid 1 idx 0x172ena0: free uncompleted tx mbuf qid 3 idx 0x66ena0: free> May 11 12:22:03 s4 kernel: May 11 12:22:03 s4 kernel: stray irq265 May 11 12:22:03 s4 kernel: ena0: ena0: device is going UP May 11 12:22:03 s4 kernel: link is UP May 11 12:22:29 s4 kernel: ena0: The number of lost tx completion is above the threshold (129 > 128). Reset the device May 11 12:22:29 s4 kernel: ena0: Trigger reset is on May 11 12:22:29 s4 kernel: ena0: device is going DOWN May 11 12:22:32 s4 kernel: ena0: free uncompleted tx mbuf qid 2 idx 0x32bena0: free uncompleted tx mbuf qid 4 idx 0x338ena0: fre> May 11 12:22:33 s4 kernel: May 11 12:22:33 s4 kernel: stray irq265 May 11 12:22:33 s4 kernel: ena0: ena0: device is going UP May 11 12:22:33 s4 kernel: link is UP May 11 12:23:04 s4 kernel: ena0: The number of lost tx completion is above the threshold (129 > 128). Reset the device May 11 12:23:04 s4 kernel: ena0: Trigger reset is on May 11 12:23:04 s4 kernel: ena0: device is going DOWN May 11 12:23:11 s4 kernel: ena0: free uncompleted tx mbuf qid 6 idx 0x10e May 11 12:23:12 s4 kernel: ena0: ena0: device is going UP May 11 12:23:12 s4 kernel: link is UP May 11 12:25:33 s4 kernel: ena0: The number of lost tx completion is above the threshold (129 > 128). Reset the device May 11 12:25:33 s4 kernel: ena0: Trigger reset is on May 11 12:25:33 s4 kernel: ena0: device is going DOWN May 11 12:25:38 s4 kernel: ena0: free uncompleted tx mbuf qid 0 idx 0x0 May 11 12:25:39 s4 kernel: ena0: ena0: device is going UP May 11 12:25:39 s4 kernel: link is UP May 11 12:28:36 s4 kernel: ena0: Keep alive watchdog timeout. May 11 12:28:36 s4 kernel: ena0: Trigger reset is on May 11 12:28:36 s4 kernel: ena0: device is going DOWN May 11 12:28:39 s4 kernel: ena0: device is going UP May 11 12:28:39 s4 kernel: ena0: link is UP May 11 12:30:04 s4 kernel: nvme4: cpl does not map to outstanding cmd May 11 12:30:04 s4 kernel: cdw0:00000000 sqhd:0013 sqid:0002 cid:0012 p:0 sc:00 sct:0 m:0 dnr:0 May 11 12:30:04 s4 kernel: nvme4: Missing interrupt May 11 12:30:04 s4 kernel: nvme1: Missing interrupt May 11 12:30:34 s4 kernel: nvme4: Resetting controller due to a timeout. May 11 12:30:34 s4 kernel: nvme4: resetting controller May 11 12:30:47 s4 kernel: nvme4: temperature threshold not supported May 11 12:30:47 s4 kernel: nvme4: aborting outstanding i/o May 11 12:30:47 s4 syslogd: last message repeated 3 times May 11 12:31:26 s4 kernel: ena0: The number of lost tx completion is above the threshold (129 > 128). Reset the device May 11 12:31:26 s4 kernel: ena0: Trigger reset is on May 11 12:31:26 s4 kernel: ena0: device is going DOWN May 11 12:31:27 s4 kernel: ena0: free uncompleted tx mbuf qid 2 idx 0x20fena0: free uncompleted tx mbuf qid 4 idx 0x8fena0: free> May 11 12:31:28 s4 kernel: May 11 12:31:28 s4 kernel: stray irq266 May 11 12:31:28 s4 kernel: stray irq272 May 11 12:31:28 s4 kernel: stray irq265 May 11 12:31:28 s4 kernel: ena0: device is going UP May 11 12:31:28 s4 kernel: ena0: link is UP May 11 12:32:46 s4 kernel: nvme1: Missing interrupt May 11 12:32:46 s4 kernel: nvme4: cpl does not map to outstanding cmd May 11 12:32:46 s4 kernel: cdw0:00000000 sqhd:0007 sqid:0001 cid:0015 p:1 sc:00 sct:0 m:0 dnr:0 May 11 12:32:46 s4 kernel: nvme4: Missing interrupt May 11 12:32:46 s4 kernel: nvme2: Missing interrupt May 11 12:33:16 s4 kernel: nvme4: Resetting controller due to a timeout. May 11 12:33:16 s4 kernel: nvme4: resetting controller May 11 12:33:17 s4 kernel: nvme4: temperature threshold not supported May 11 12:33:17 s4 kernel: nvme4: aborting outstanding i/o May 11 12:33:17 s4 syslogd: last message repeated 2 times May 11 12:33:40 s4 kernel: ena0: The number of lost tx completion is above the threshold (129 > 128). Reset the device May 11 12:33:40 s4 kernel: ena0: Trigger reset is on May 11 12:33:40 s4 kernel: ena0: device is going DOWN May 11 12:33:43 s4 kernel: ena0: free uncompleted tx mbuf qid 3 idx 0x0 May 11 12:33:44 s4 kernel: ena0: ena0: device is going UP May 11 12:33:44 s4 kernel: link is UP May 11 12:34:15 s4 kernel: ena0: Keep alive watchdog timeout. May 11 12:34:15 s4 kernel: ena0: Trigger reset is on May 11 12:34:15 s4 kernel: ena0: device is going DOWN May 11 12:34:21 s4 kernel: ena0: ena0: device is going UP May 11 12:34:21 s4 kernel: link is UP May 11 12:35:33 s4 kernel: ena0: The number of lost tx completion is above the threshold (129 > 128). Reset the device May 11 12:35:33 s4 kernel: ena0: Trigger reset is on May 11 12:35:33 s4 kernel: ena0: device is going DOWN May 11 12:35:42 s4 kernel: ena0: free uncompleted tx mbuf qid 3 idx 0x34 May 11 12:35:43 s4 kernel: ena0: ena0: device is going UP May 11 12:35:43 s4 kernel: link is UP May 11 12:36:16 s4 kernel: ena0: Keep alive watchdog timeout. May 11 12:36:16 s4 kernel: ena0: Trigger reset is on May 11 12:36:16 s4 kernel: ena0: device is going DOWN May 11 12:36:21 s4 kernel: ena0: free uncompleted tx mbuf qid 7 idx 0x20a May 11 12:36:22 s4 kernel: ena0: ena0: device is going UP May 11 12:36:22 s4 kernel: link is UP May 11 12:37:05 s4 kernel: ena0: The number of lost tx completion is above the threshold (129 > 128). Reset the device May 11 12:37:05 s4 kernel: ena0: Trigger reset is on May 11 12:37:05 s4 kernel: ena0: device is going DOWN May 11 12:37:09 s4 kernel: ena0: free uncompleted tx mbuf qid 4 idx 0x23f May 11 12:37:10 s4 kernel: ena0: ena0: device is going UP May 11 12:37:10 s4 kernel: link is UP May 11 12:38:38 s4 kernel: ena0: The number of lost tx completion is above the threshold (129 > 128). Reset the device May 11 12:38:38 s4 kernel: ena0: Trigger reset is on May 11 12:38:38 s4 kernel: ena0: device is going DOWN May 11 12:38:43 s4 kernel: ena0: free uncompleted tx mbuf qid 5 idx 0x1b3 May 11 12:38:44 s4 kernel: May 11 12:38:44 s4 kernel: too many stray irq 265's: not logging anymore May 11 12:38:44 s4 kernel: ena0: ena0: device is going UP May 11 12:38:44 s4 kernel: link is UP May 11 12:39:17 s4 kernel: ena0: The number of lost tx completion is above the threshold (129 > 128). Reset the device May 11 12:39:17 s4 kernel: ena0: Trigger reset is on May 11 12:39:17 s4 kernel: ena0: device is going DOWN May 11 12:39:25 s4 kernel: ena0: free uncompleted tx mbuf qid 1 idx 0x238ena0: free uncompleted tx mbuf qid 6 idx 0x249 May 11 12:39:26 s4 kernel: ena0: ena0: device is going UP May 11 12:39:26 s4 kernel: link is UP May 11 12:40:20 s4 kernel: ena0: The number of lost tx completion is above the threshold (129 > 128). Reset the device May 11 12:40:20 s4 kernel: ena0: Trigger reset is on May 11 12:40:20 s4 kernel: ena0: device is going DOWN May 11 12:40:24 s4 kernel: ena0: free uncompleted tx mbuf qid 0 idx 0x3c0 May 11 12:40:25 s4 kernel: ena0: device is going UP May 11 12:40:25 s4 kernel: ena0: link is UP May 11 12:41:17 s4 kernel: ena0: Keep alive watchdog timeout. May 11 12:41:17 s4 kernel: ena0: Trigger reset is on May 11 12:41:17 s4 kernel: ena0: device is going DOWN May 11 12:41:22 s4 kernel: ena0: device is going UP May 11 12:41:22 s4 kernel: ena0: link is UP May 11 12:42:17 s4 kernel: ena0: Keep alive watchdog timeout. May 11 12:42:17 s4 kernel: ena0: Trigger reset is on May 11 12:42:17 s4 kernel: ena0: device is going DOWN May 11 12:42:17 s4 sm-msp-queue[70209]: 04BGPgiQ066459: SYSERR(root): timeout writing message to [127.0.0.1] May 11 12:42:21 s4 kernel: ena0: ena0: device is going UP May 11 12:42:21 s4 kernel: link is UP May 11 12:43:01 s4 kernel: ena0: The number of lost tx completion is above the threshold (129 > 128). Reset the device May 11 12:43:01 s4 kernel: ena0: Trigger reset is on May 11 12:43:01 s4 kernel: ena0: device is going DOWN May 11 12:43:06 s4 kernel: ena0: free uncompleted tx mbuf qid 2 idx 0x359 May 11 12:43:07 s4 kernel: May 11 12:43:07 s4 kernel: stray irq266 May 11 12:43:07 s4 kernel: ena0: ena0: device is going UP May 11 12:43:07 s4 kernel: link is UP May 11 12:45:23 s4 kernel: ena0: Keep alive watchdog timeout. May 11 12:45:23 s4 kernel: ena0: Trigger reset is on May 11 12:45:23 s4 kernel: ena0: device is going DOWN May 11 12:45:33 s4 kernel: ena0: ena0: device is going UP May 11 12:45:33 s4 kernel: link is UP May 11 12:48:08 s4 kernel: ena0: Keep alive watchdog timeout. May 11 12:48:08 s4 kernel: ena0: Trigger reset is on May 11 12:48:08 s4 kernel: ena0: device is going DOWN May 11 12:48:11 s4 kernel: ena0: ena0: device is going UP May 11 12:48:11 s4 kernel: link is UP May 11 12:48:51 s4 kernel: ena0: The number of lost tx completion is above the threshold (129 > 128). Reset the device May 11 12:48:51 s4 kernel: ena0: Trigger reset is on May 11 12:48:51 s4 kernel: ena0: device is going DOWN May 11 12:48:55 s4 kernel: ena0: free uncompleted tx mbuf qid 7 idx 0x266 May 11 12:48:55 s4 kernel: ena0: ena0: device is going UP May 11 12:48:55 s4 kernel: link is UP May 11 12:50:38 s4 kernel: ena0: Keep alive watchdog timeout. May 11 12:50:38 s4 kernel: ena0: Trigger reset is on May 11 12:50:38 s4 kernel: ena0: device is going DOWN May 11 12:50:41 s4 kernel: ena0: ena0: device is going UP May 11 12:50:41 s4 kernel: link is UP May 11 12:51:43 s4 kernel: ena0: The number of lost tx completion is above the threshold (129 > 128). Reset the device May 11 12:51:43 s4 kernel: ena0: Trigger reset is on May 11 12:51:43 s4 kernel: ena0: device is going DOWN May 11 12:51:50 s4 kernel: ena0: free uncompleted tx mbuf qid 5 idx 0x358 May 11 12:51:51 s4 kernel: ena0: ena0: device is going UP May 11 12:51:51 s4 kernel: link is UP May 11 12:53:24 s4 kernel: ena0: The number of lost tx completion is above the threshold (129 > 128). Reset the device May 11 12:53:24 s4 kernel: ena0: Trigger reset is on May 11 12:53:24 s4 kernel: ena0: device is going DOWN May 11 12:53:28 s4 kernel: ena0: free uncompleted tx mbuf qid 5 idx 0x100 May 11 12:53:29 s4 kernel: ena0: ena0: device is going UP May 11 12:53:29 s4 kernel: link is UP May 11 12:56:19 s4 kernel: ena0: The number of lost tx completion is above the threshold (129 > 128). Reset the device May 11 12:56:19 s4 kernel: ena0: Trigger reset is on May 11 12:56:19 s4 kernel: ena0: device is going DOWN May 11 12:56:22 s4 kernel: ena0: free uncompleted tx mbuf qid 4 idx 0x3d8 May 11 12:56:23 s4 kernel: ena0: ena0: device is going UP May 11 12:56:23 s4 kernel: link is UP May 11 12:57:36 s4 kernel: . A commit references this bug: Author: mw Date: Tue May 12 18:44:42 UTC 2020 New revision: 360985 URL: https://svnweb.freebsd.org/changeset/base/360985 Log: MFC r360777: Optimize ENA Rx refill for low memory conditions Sometimes, especially when there is not much memory in the system left, allocating mbuf jumbo clusters (like 9KB or 16KB) can take a lot of time and it is not guaranteed that it'll succeed. In that situation, the fallback will work, but if the refill needs to take a place for a lot of descriptors at once, the time spent in m_getjcl looking for memory can cause system unresponsiveness due to high priority of the Rx task. This can also lead to driver reset, because Tx cleanup routine is being blocked and timer service could detect that Tx packets aren't cleaned up. The reset routine can further create another unresponsiveness - Rx rings are being refilled there, so m_getjcl will again burn the CPU. This was causing NVMe driver timeouts and resets, because network driver is having higher priority. Instead of 16KB jumbo clusters for the Rx buffers, 9KB clusters are enough - ENA MTU is being set to 9K anyway, so it's very unlikely that more space than 9KB will be needed. However, 9KB jumbo clusters can still cause issues, so by default the page size mbuf cluster will be used for the Rx descriptors. This can have a small (~2%) impact on the throughput of the device, so to restore original behavior, one must change sysctl "hw.ena.enable_9k_mbufs" to "1" in "/boot/loader.conf" file. As a part of this patch (important fix), the version of the driver was updated to v2.1.2. Submitted by: cperciva PR: 225791, 234838, 235856, 236989, 243531 Changes: _U stable/12/ stable/12/sys/dev/ena/ena.c stable/12/sys/dev/ena/ena.h (In reply to Charles O'Donnell from comment #26) It's slightly surprising that your system is getting loaded enough to cause the errors above -- but it may be that each memory allocation failure is taking longer since your system is much larger than the t3 instance I was able to reproduce this on earlier. In any case, it you apply the patch which went into stable/12 (r360985) instead of the patch I sent you earlier, this issue should go away. (If it doesn't, we DEFINITELY want to know -- this patch will be going out as an Errata Notice for FreeBSD 12.1.) Colin Percival FWIW, I've just run into the same issue on bare metal, Dell r740xd, with 20xNVME and 2xSSD, running 1-month-old STABLE. Any info that I can provide to help investigation, please? (In reply to Juraj Lutter from comment #29) This was a specific issue with EC2 instances (with the ENA driver in particular) which has now been fixed. Indeed. Corrected and gone: 6:41AM up 93 days, 23:21, 1 user, load averages: 0.80, 0.68, 0.70 Above uptime from an EC2 instance that previously couldn't get past 4-5 days of uptime. Cheers Colin. |