Bug 236989

Summary: AWS EC2 lockups "Missing interrupt"
Product: Base System Reporter: Charles O'Donnell <cao>
Component: kernAssignee: freebsd-virtualization (Nobody) <virtualization>
Status: Closed FIXED    
Severity: Affects Only Me CC: cao, chuck, cperciva, freebsd, imp, otis, rgrimes
Priority: ---    
Version: 12.0-RELEASE   
Hardware: amd64   
OS: Any   
Attachments:
Description Flags
syslog entries none

Description Charles O'Donnell 2019-04-03 13:31:10 UTC
I am experiencing lockups on a production c5d.2xlarge instance running FreeBSD 12.0-RELEASE. Frequency is about once a week.

The harbinger of these lockups is the appearance of "nvmX Missing interrupt" in the logs:

Apr  3 00:56:32 host kernel: nvme0: Missing interrupt
Apr  3 00:57:43 host syslogd: last message repeated 1 times
Apr  3 00:58:43 host kernel: nvme4: Missing interrupt
Apr  3 00:58:43 host kernel: nvme1: Missing interrupt
Apr  3 00:58:43 host kernel: nvme0: Missing interrupt
Apr  3 00:58:43 host kernel: nvme4: Missing interrupt
Apr  3 00:58:43 host kernel: nvme1: Missing interrupt
Apr  3 00:58:43 host kernel: nvme0: Missing interrupt
Apr  3 00:59:43 host kernel: nvme4: Missing interrupt
Apr  3 00:59:43 host kernel: nvme1: Missing interrupt
Apr  3 00:59:43 host kernel: nvme0: Missing interrupt
Apr  3 00:59:43 host kernel: nvme1: nvme4: Missing interrupt
Apr  3 00:59:43 host kernel: Missing interrupt
Apr  3 00:59:43 host kernel: nvme0: Missing interrupt
Apr  3 00:59:44 host kernel: nvme1: Missing interrupt
Apr  3 01:00:05 host kernel: nvme0: Missing interrupt
Apr  3 01:20:01 host kernel: nvme0: 
Apr  3 01:20:01 host kernel: Missing interrupt
Apr  3 01:22:10 host kernel: sonewconn: pcb 0xfffff802988adb00: Listen queue overflow: 151 already in queue awaiting acceptance (1 occurrences)
Apr  3 01:24:33 host kernel: sonewconn: pcb 0xfffff802988adb00: Listen queue overflow: 151 already in queue awaiting acceptance (6 occurrences)
Apr  3 01:25:35 host kernel: sonewconn: pcb 0xfffff802988adb00: Listen queue overflow: 151 already in queue awaiting acceptance (4 occurrences)
Apr  3 01:26:45 host syslogd: last message repeated 1 times
Apr  3 01:27:49 host syslogd: last message repeated 1 times


Within a few hours the machine will become unresponsive, CPU pegged at 100%, high disk reads and writes. It will not respond to an EC2 "stop" command and requires a forced (hard) reset.


c5d.2xlarge
FreeBSD 12.0-RELEASE-p3 GENERIC amd64
zfs is in use on some drives, but not all.

I am running several instances with this same configuration, but only one of them has had this issue so far, and it happens to be the host that has the highest disk activity.
Comment 1 Rodney W. Grimes freebsd_committer freebsd_triage 2019-04-22 21:50:39 UTC
Pulling in Chuck Tuffli as am NVMe expert
Comment 2 Charles O'Donnell 2019-06-04 20:27:51 UTC
Created attachment 204824 [details]
syslog entries

Log entries.
Comment 3 Charles O'Donnell 2019-06-04 20:28:24 UTC
I still have not come up with a way to reliably reproduce this error, but is continues to be a major issue on our productions systems.

In the meantime, I thought I might include the excerpt above from /var/log/messages which includes some additional noise. I don't know if these entries are directly related or indirectly related.
Comment 4 Charles O'Donnell 2019-06-04 20:33:54 UTC
As an addendum, the system has been kept up to date with freebsd-update.

FreeBSD 12.0-RELEASE-p4 GENERIC
Comment 5 Charles O'Donnell 2019-08-06 03:30:35 UTC
This issue continues to be a major problem with FreeBSD 12 on AWS EC2 on busy production servers.
Comment 6 Charles O'Donnell 2019-08-06 13:13:16 UTC
Add'l log excerpts. Please advise any other information that would be helpful.


Aug  5 23:00:44 s4 kernel: nvme5: cpl does not map to outstanding cmd
Aug  5 23:00:44 s4 kernel: cdw0:00000000 sqhd:0041 sqid:0007 cid:007d p:1 sc:00 sct:0 m:0 dnr:0
Aug  5 23:00:44 s4 kernel: nvme5: Missing interrupt
Aug  5 23:00:44 s4 kernel: nvme5: Resetting controller due to a timeout.
Aug  5 23:00:44 s4 kernel: nvme5: resetting controller
Aug  5 23:00:44 s4 kernel: nvme5: aborting outstanding i/o
Aug  5 23:00:44 s4 kernel: nvme5: WRITE sqid:1 cid:125 nsid:1 lba:10245542 len:256
Aug  5 23:00:44 s4 kernel: nvme5: ABORTED - BY REQUEST (00/07) sqid:1 cid:125 cdw0:0
Aug  5 23:00:44 s4 kernel: nvme5: aborting outstanding i/o
Aug  5 23:00:44 s4 kernel: nvme5: WRITE sqid:1 cid:124 nsid:1 lba:10245798 len:48
Aug  5 23:00:44 s4 kernel: nvme5: ABORTED - BY REQUEST (00/07) sqid:1 cid:124 cdw0:0
Aug  5 23:00:44 s4 kernel: nvme5: resubmitting queued i/o
Aug  5 23:00:44 s4 kernel: nvme5: WRITE sqid:1 cid:0 nsid:1 lba:10245846 len:256
Aug  5 23:00:44 s4 kernel: nvme5: aborting outstanding i/o
Aug  5 23:00:44 s4 kernel: nvme5: WRITE sqid:7 cid:125 nsid:1 lba:10240126 len:256
Aug  5 23:00:44 s4 kernel: nvme5: ABORTED - BY REQUEST (00/07) sqid:7 cid:125 cdw0:0
Comment 7 Warner Losh freebsd_committer freebsd_triage 2019-09-03 04:51:13 UTC
Is there a workload that's known to trigger this bug? There's a few issues in -current that might not be in 12.0R yet that I'd like to see if I can trigger the bug and then try again with 13 to see if they are fixed and maybe get a patch into 12.0
Comment 8 Charles O'Donnell 2019-09-03 13:49:11 UTC
Hello Warner, I have note been able to reliably reproduce the problem. There may be a connection with heavy disk activity.
Comment 9 Colin Percival freebsd_committer freebsd_triage 2020-02-15 22:22:07 UTC
Hi Charles,

Have you seen this recently?  Ideally, on FreeBSD 12.1 or HEAD?

Also, can you tell me whether you use AMD instances, and if yes whether any of them are affected?
Comment 10 Charles O'Donnell 2020-02-16 20:39:44 UTC
Hello Colin,

Thank you for your interest.

As of last week the issue was still present. The affected hosts are 12.0-RELEASE kept up to date with the most recent patch level using freebsd-update.

I just updated the affected hosts to 12.1-RELEASE using 'freebsd-update upgrade'. If the issue is still present I expect to see it appear in 5-10 days.

The original AMI was released by FreeBSD:

ami-00e61b69944d5d2a3
FreeBSD 12.0-RELEASE-amd64-f5af2713-6d84-41f0-95ea-2eaee78f2af4-ami-03b0f822e17669866.4
FreeBSD/amd64 releng/12.0@341666

Let me know if there is any other information I can provide.

Thank you.

Best.
Comment 11 Colin Percival freebsd_committer freebsd_triage 2020-02-16 20:47:09 UTC
(In reply to Charles O'Donnell from comment #10)

Hi Charles,

Please let me know ASAP if you experience this on 12.1-RELEASE.  Some problems were fixed between 12.0 and 12.1 but I have some reports of problems still occurring on 12.1 -- so I'm trying to track down exactly what conditions are needed to cause issues with 12.1.

Other details which I'll want to know (if it recurs with 12.1):

* Instance type
* Type of EBS volumes being used
* Filesystem (all the reports I've had on 12.1 have been ZFS)
* Workload
* Whether there's significant network traffic (all the reports I've had on 12.1 have featured significant disk *and* network loads).

Thanks!
Comment 12 Charles O'Donnell 2020-02-16 20:59:32 UTC
Instance type is c5d.2xlarge.
EBS volume type is gp2.
Filesystem is ZFS (mirror).
Workload is HTTP, Database, Image processing.
Significant network load and disk activity.

Best.
Comment 13 Charles O'Donnell 2020-02-27 12:59:25 UTC
Most recent incident:

Feb 27 04:23:31 s4 kernel: nvme4: Missing interrupt
Feb 27 04:24:34 s4 syslogd: last message repeated 1 times
Feb 27 04:24:34 s4 kernel: nvme1: Missing interrupt
Feb 27 04:24:34 s4 kernel: nvme0: Missing interrupt
Feb 27 04:24:34 s4 syslogd: last message repeated 1 times
Feb 27 04:24:34 s4 kernel: nvme1: nvme1: cpl does not map to outstanding cmd
Feb 27 04:24:34 s4 kernel: cdw0:00000000 sqhd:0012 sqid:0001 cid:0017 p:0 sc:00 sct:0 m:0 dnr:0
Feb 27 04:24:34 s4 kernel: nvme1: Missing interrupt
Feb 27 04:24:34 s4 kernel: nvme4: nvme1: Missing interrupt
Feb 27 04:24:34 s4 kernel: Missing interrupt
Feb 27 04:24:34 s4 syslogd: last message repeated 1 times
Feb 27 04:24:34 s4 kernel: nvme4: Missing interrupt
Feb 27 04:24:34 s4 kernel: nvme0: Missing interrupt
Feb 27 04:24:34 s4 kernel: nvme1: nvme1: nvme4: Missing interrupt
Feb 27 04:24:34 s4 kernel: Missing interrupt

# freebsd-version -ukr
12.1-RELEASE-p2
12.1-RELEASE-p2
12.1-RELEASE-p2

# ls -lahtuT /etc/periodic/daily/
total 128
-rwxr-xr-x  1 root  wheel   811B Feb 27 03:05:34 2020 999.local
-rwxr-xr-x  1 root  wheel   2.8K Feb 27 03:05:34 2020 800.scrub-zfs
-rwxr-xr-x  1 root  wheel   845B Feb 27 03:05:34 2020 510.status-world-kernel
-rwxr-xr-x  1 root  wheel   737B Feb 27 03:05:34 2020 500.queuerun
-rwxr-xr-x  1 root  wheel   498B Feb 27 03:05:34 2020 480.status-ntpd
-rwxr-xr-x  1 root  wheel   451B Feb 27 03:05:34 2020 480.leapfile-ntpd
-rwxr-xr-x  1 root  wheel   2.0K Feb 27 03:05:34 2020 460.status-mail-rejects
-rwxr-xr-x  1 root  wheel   1.0K Feb 27 03:01:00 2020 450.status-security
-rwxr-xr-x  1 root  wheel   1.4K Feb 27 03:01:00 2020 440.status-mailq
-rwxr-xr-x  1 root  wheel   705B Feb 27 03:01:00 2020 430.status-uptime
-rwxr-xr-x  1 root  wheel   611B Feb 27 03:01:00 2020 420.status-network
-rwxr-xr-x  1 root  wheel   684B Feb 27 03:01:00 2020 410.status-mfi
-rwxr-xr-x  1 root  wheel   590B Feb 27 03:01:00 2020 409.status-gconcat
-rwxr-xr-x  1 root  wheel   590B Feb 27 03:01:00 2020 408.status-gstripe
-rwxr-xr-x  1 root  wheel   591B Feb 27 03:01:00 2020 407.status-graid3
-rwxr-xr-x  1 root  wheel   596B Feb 27 03:01:00 2020 406.status-gmirror
-rwxr-xr-x  1 root  wheel   807B Feb 27 03:01:00 2020 404.status-zfs
-rwxr-xr-x  1 root  wheel   583B Feb 27 03:01:00 2020 401.status-graid
-rwxr-xr-x  1 root  wheel   773B Feb 27 03:01:00 2020 400.status-disks
-rwxr-xr-x  1 root  wheel   724B Feb 27 03:01:00 2020 330.news
-r-xr-xr-x  1 root  wheel   1.4K Feb 27 03:01:00 2020 310.accounting
-rwxr-xr-x  1 root  wheel   693B Feb 27 03:01:00 2020 300.calendar
-rwxr-xr-x  1 root  wheel   1.0K Feb 27 03:01:00 2020 210.backup-aliases
-rwxr-xr-x  1 root  wheel   1.7K Feb 27 03:01:00 2020 200.backup-passwd
-rwxr-xr-x  1 root  wheel   603B Feb 27 03:01:00 2020 150.clean-hoststat
-rwxr-xr-x  1 root  wheel   1.0K Feb 27 03:01:00 2020 140.clean-rwho
-rwxr-xr-x  1 root  wheel   709B Feb 27 03:01:00 2020 130.clean-msgs
-rwxr-xr-x  1 root  wheel   1.1K Feb 27 03:01:00 2020 120.clean-preserve
-rwxr-xr-x  1 root  wheel   1.5K Feb 27 03:01:00 2020 110.clean-tmps
-rwxr-xr-x  1 root  wheel   1.3K Feb 27 03:01:00 2020 100.clean-disks
drwxr-xr-x  2 root  wheel   1.0K Dec  7 01:23:36 2018 .
drwxr-xr-x  6 root  wheel   512B Dec  7 01:23:36 2018 ..
Comment 14 Vincent Milum Jr 2020-03-15 06:16:59 UTC
I just ran into the "nvmeX: Missing interrupt" on FreeNAS under VMWare when testing something out tonight. It happened right at boot, and repeated quite often (to the point I couldn't even fully boot the OS after ~20 minutes). FreeNAS is still on 11.3-RELEASE. But this is a homelab system that I can easily build test cases with.

nvme drives are Intel Optane M10 16GB. These are cheap on eBay right now. No data on them at all yet (empty partition table). I was using PCIe passthrough on ESXi-6.7.0-20190802001-standard.

I'll start testing various 12.x and 13 kernels to see if conditions are repeatable on any of those with this config too (if it is, I'll post more details)
Comment 15 Colin Percival freebsd_committer freebsd_triage 2020-03-15 16:46:20 UTC
Vincent: There are known issues with the nvme driver on 11.3 and 12.0.  If you can reproduce this on 12.1 I'll be *very* interested but I think it's very likely the problem will go away when you upgrade.
Comment 16 Vincent Milum Jr 2020-03-17 00:56:44 UTC
I actually get a kernel panic in 12.1, but it works in 13

https://twitter.com/DarkainMX/status/1239715138027962368
Comment 17 Colin Percival freebsd_committer freebsd_triage 2020-04-22 23:35:47 UTC
I believe that this patch should fix the underlying problem, which is in the ENA driver: https://reviews.freebsd.org/D24546

If you're able to build a custom kernel, please test that patch and report results in that review or via email (cperciva@).
Comment 18 Charles O'Donnell 2020-04-27 11:47:50 UTC
Hi Colin, patch applied. It may be as long as two weeks before I can report results.

With regards to running these queries:

  vmstat -z | grep mbuf_jumbo; sysctl -a | grep mjum_alloc_fail

Let me know if this should be run periodically, or only when the machine is experiencing the issue.

Thank you.

Best.
Comment 19 Colin Percival freebsd_committer freebsd_triage 2020-04-27 20:08:25 UTC
Hi Charles,

Can you run those periodically (maybe every hour)?  What I'm expecting is that the sysctls will stay at zero, while the 9k mbuf "FAIL" count (second last value) will stay at zero or perhaps increase slowly, then jump up rapidly when the system would have gotten stuck.

If the patch works as expected, the system *won't* get stuck at that point. ;-)

Colin Percival
Comment 20 commit-hook freebsd_committer freebsd_triage 2020-05-07 11:28:52 UTC
A commit references this bug:

Author: mw
Date: Thu May  7 11:28:41 UTC 2020
New revision: 360777
URL: https://svnweb.freebsd.org/changeset/base/360777

Log:
  Optimize ENA Rx refill for low memory conditions

  Sometimes, especially when there is not much memory in the system left,
  allocating mbuf jumbo clusters (like 9KB or 16KB) can take a lot of time
  and it is not guaranteed that it'll succeed. In that situation, the
  fallback will work, but if the refill needs to take a place for a lot of
  descriptors at once, the time spent in m_getjcl looking for memory can
  cause system unresponsiveness due to high priority of the Rx task. This
  can also lead to driver reset, because Tx cleanup routine is being
  blocked and timer service could detect that Tx packets aren't cleaned
  up. The reset routine can further create another unresponsiveness - Rx
  rings are being refilled there, so m_getjcl will again burn the CPU.
  This was causing NVMe driver timeouts and resets, because network driver
  is having higher priority.

  Instead of 16KB jumbo clusters for the Rx buffers, 9KB clusters are
  enough - ENA MTU is being set to 9K anyway, so it's very unlikely that
  more space than 9KB will be needed.

  However, 9KB jumbo clusters can still cause issues, so by default the
  page size mbuf cluster will be used for the Rx descriptors. This can have a
  small (~2%) impact on the throughput of the device, so to restore
  original behavior, one must change sysctl "hw.ena.enable_9k_mbufs" to
  "1" in "/boot/loader.conf" file.

  As a part of this patch (important fix), the version of the driver
  was updated to v2.1.2.

  Submitted by:   cperciva
  Reviewed by:    Michal Krawczyk <mk@semihalf.com>
  Reviewed by:    Ido Segev <idose@amazon.com>
  Reviewed by:    Guy Tzalik <gtzalik@amazon.com>
  MFC after:      3 days
  PR:             225791, 234838, 235856, 236989, 243531
  Differential Revision: https://reviews.freebsd.org/D24546

Changes:
  head/sys/dev/ena/ena.c
  head/sys/dev/ena/ena.h
  head/sys/dev/ena/ena_sysctl.c
  head/sys/dev/ena/ena_sysctl.h
Comment 21 Charles O'Donnell 2020-05-07 12:00:44 UTC
Colin,

At this point we are at 10 days uptime. Below are select excerpts from vmstat logging. It appears to follow the path you predicted. I can attach the full hourly log if it is of interest.

Best.


DAY 8 (first uptick of 
===> Tue May  5 17:00:00 EDT 2020
mbuf_jumbo_page:       4096, 490945,       0,      31,40308027,   0,   0
mbuf_jumbo_9k:         9216, 145465,    7668,     286,58743917,   7,   0
mbuf_jumbo_16k:       16384,  81824,       0,       0,       0,   0,   0
dev.ena.0.queue7.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue6.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue5.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue4.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue3.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue2.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue1.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue0.rx_ring.mjum_alloc_fail: 0

DAY 9
===> Wed May  6 04:00:00 EDT 2020
mbuf_jumbo_page:       4096, 490945,      97,     446,41133523,   0,   0
mbuf_jumbo_9k:         9216, 145465,    7645,     412,59865306,   9,   0
mbuf_jumbo_16k:       16384,  81824,       0,       0,       0,   0,   0
dev.ena.0.queue7.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue6.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue5.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue4.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue3.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue2.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue1.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue0.rx_ring.mjum_alloc_fail: 0

===> Wed May  6 05:00:00 EDT 2020
mbuf_jumbo_page:       4096, 490945,     367,     449,41875931,   0,   0
mbuf_jumbo_9k:         9216, 145465,    7253,     108,60877108, 241,   0
mbuf_jumbo_16k:       16384,  81824,       0,       0,       0,   0,   0
dev.ena.0.queue7.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue6.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue5.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue4.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue3.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue2.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue1.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue0.rx_ring.mjum_alloc_fail: 0

===> Wed May  6 06:00:00 EDT 2020
mbuf_jumbo_page:       4096, 490945,      34,      41,42115282,   0,   0
mbuf_jumbo_9k:         9216, 145465,    7719,     267,61217632, 243,   0
mbuf_jumbo_16k:       16384,  81824,       0,       0,       0,   0,   0
dev.ena.0.queue7.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue6.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue5.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue4.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue3.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue2.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue1.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue0.rx_ring.mjum_alloc_fail: 0

DAY 10
===> Thu May  7 07:00:00 EDT 2020
mbuf_jumbo_page:       4096, 490945,       0,      65,45067065,   0,   0
mbuf_jumbo_9k:         9216, 145465,    7647,     322,65796230,1640,   0
mbuf_jumbo_16k:       16384,  81824,       0,       0,       0,   0,   0
dev.ena.0.queue7.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue6.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue5.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue4.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue3.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue2.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue1.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue0.rx_ring.mjum_alloc_fail: 0
Comment 22 Charles O'Donnell 2020-05-07 13:48:13 UTC
Addendum to previous post, no downtime or freezes as of yet.
Comment 23 Colin Percival freebsd_committer freebsd_triage 2020-05-07 17:13:52 UTC
(In reply to Charles O'Donnell from comment #22)

Thanks!  We ended up going with a slightly more conservative patch than the one I provided above (defaulting to 4k buffers rather than switching, since we weren't sure if we could switch fast enough) but your statistics are useful (and confirm the decision to go with the more conservative option -- you have 1640 failed 9k allocations which will have added some network latency during busy periods).

I suggest waiting until we get the Errata Notice out and then switching over to the patch we send out with that.
Comment 24 Charles O'Donnell 2020-05-07 17:31:00 UTC
New development. See three notes below.

N.B. the system appears to have fully recovered. Normally I would have expected a freeze.



1. CPU alarm from a custom AWS monitor at 16:43 UTC (12:43 PM ET):

Alarm Details:
- Name:                       Starch CPU
- Description:                
- State Change:               OK -> ALARM
- Reason for State Change:    Threshold Crossed: 1 datapoint [31.4 (07/05/20 16:38:00)] was greater than or equal to the threshold (30.0).
- Timestamp:                  Thursday 07 May, 2020 16:43:35 UTC
- AWS Account:                539612714288
- Alarm Arn:                  arn:aws:cloudwatch:us-east-1:539612714288:alarm:Starch CPU

Threshold:
- The alarm is in the ALARM state when the metric is GreaterThanOrEqualToThreshold 30.0 for 300 seconds. 



2. Sudden jump in failed 9k mbufs between 12:00 and 13:00 ET:

===> Thu May  7 10:00:00 EDT 2020
mbuf_jumbo_page:       4096, 490945,       0,      56,45464111,   0,   0
mbuf_jumbo_9k:         9216, 145465,    7538,     450,66361278,1640,   0
mbuf_jumbo_16k:       16384,  81824,       0,       0,       0,   0,   0
dev.ena.0.queue7.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue6.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue5.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue4.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue3.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue2.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue1.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue0.rx_ring.mjum_alloc_fail: 0
===> Thu May  7 11:00:00 EDT 2020
mbuf_jumbo_page:       4096, 490945,      16,     113,45658689,   0,   0
mbuf_jumbo_9k:         9216, 145465,    7592,     397,66645310,1642,   0
mbuf_jumbo_16k:       16384,  81824,       0,       0,       0,   0,   0
dev.ena.0.queue7.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue6.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue5.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue4.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue3.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue2.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue1.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue0.rx_ring.mjum_alloc_fail: 0
===> Thu May  7 12:00:00 EDT 2020
mbuf_jumbo_page:       4096, 490945,     182,      31,45730287,   0,   0
mbuf_jumbo_9k:         9216, 145465,    7461,     259,66753693,1693,   0
mbuf_jumbo_16k:       16384,  81824,       0,       0,       0,   0,   0
dev.ena.0.queue7.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue6.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue5.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue4.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue3.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue2.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue1.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue0.rx_ring.mjum_alloc_fail: 0
===> Thu May  7 13:00:00 EDT 2020
mbuf_jumbo_page:       4096, 490945,     119,     109,46249719,   0,   0
mbuf_jumbo_9k:         9216, 145465,    7863,     207,67594999,2577,   0
mbuf_jumbo_16k:       16384,  81824,       0,       0,       0,   0,   0
dev.ena.0.queue7.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue6.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue5.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue4.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue3.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue2.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue1.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue0.rx_ring.mjum_alloc_fail: 0



3: ena0 reset at 12:43 ET:

May  7 12:43:19 s4 kernel: ena0: The number of lost tx completion is above the threshold (129 > 128). Reset the device
May  7 12:43:19 s4 kernel: ena0: Trigger reset is on
May  7 12:43:19 s4 kernel: ena0: device is going DOWN
May  7 12:43:22 s4 kernel: ena0: free uncompleted tx mbuf qid 3 idx 0x319ena0: free uncompleted tx mbuf qid 7 idx 0x173
May  7 12:43:23 s4 kernel: ena0: ena0: device is going UP
May  7 12:43:23 s4 kernel: link is UP
May  7 12:45:00 s4 kernel: ena0: The number of lost tx completion is above the threshold (129 > 128). Reset the device
May  7 12:45:00 s4 kernel: ena0: Trigger reset is on
May  7 12:45:00 s4 kernel: ena0: device is going DOWN
May  7 12:45:04 s4 kernel: ena0: free uncompleted tx mbuf qid 3 idx 0x102
May  7 12:45:04 s4 kernel: ena0: ena0: device is going UP
May  7 12:45:04 s4 kernel: link is UP
May  7 12:45:26 s4 kernel: ena0: The number of lost tx completion is above the threshold (129 > 128). Reset the device
May  7 12:45:26 s4 kernel: ena0: Trigger reset is on
May  7 12:45:26 s4 kernel: ena0: device is going DOWN
May  7 12:45:29 s4 kernel: ena0: free uncompleted tx mbuf qid 1 idx 0x3c7ena0: free uncompleted tx mbuf qid 2 idx 0x2c5ena0: free uncompleted tx mbuf qid 6 idx 0x2abena0: free uncompleted tx mbuf qid 7 idx 0x241
May  7 12:45:30 s4 kernel: 
May  7 12:45:30 s4 kernel: stray irq265
May  7 12:45:30 s4 kernel: ena0: ena0: device is going UP
May  7 12:45:30 s4 kernel: link is UP
May  7 12:46:05 s4 kernel: ena0: Keep alive watchdog timeout.
May  7 12:46:05 s4 kernel: ena0: Trigger reset is on
May  7 12:46:05 s4 kernel: ena0: device is going DOWN
May  7 12:46:07 s4 kernel: ena0: free uncompleted tx mbuf qid 1 idx 0x123ena0: free uncompleted tx mbuf qid 3 idx 0xeeena0: free uncompleted tx mbuf qid 6 idx 0x208
May  7 12:46:08 s4 kernel: ena0: ena0: device is going UP
May  7 12:46:08 s4 kernel: link is UP
May  7 12:46:36 s4 kernel: ena0: The number of lost tx completion is above the threshold (129 > 128). Reset the device
May  7 12:46:36 s4 kernel: ena0: Trigger reset is on
May  7 12:46:36 s4 kernel: ena0: device is going DOWN
May  7 12:46:37 s4 kernel: ena0: free uncompleted tx mbuf qid 0 idx 0x2c2ena0: free uncompleted tx mbuf qid 1 idx 0x135ena0: free uncompleted tx mbuf qid 2 idx 0xeeena0: free uncompleted tx mbuf qid 3 idx 0x373ena0: free uncompleted tx mbuf qid 4 idx 0x88ena0: free uncompleted t>
May  7 12:46:38 s4 kernel: ena0: ena0: device is going UP
May  7 12:46:38 s4 kernel: link is UP
Comment 25 Colin Percival freebsd_committer freebsd_triage 2020-05-07 17:41:33 UTC
Aha!  That's exactly the "not switching from 9k to 4k fast enough" case we were worried about.  Good to know that we were justified in going with the more conservative option.
Comment 26 Charles O'Donnell 2020-05-11 20:56:09 UTC
Not sure if this is expected behavior with the patched kernel.

Picking up from day 6 above:

===> Thu May  7 12:00:00 EDT 2020
mbuf_jumbo_page:       4096, 490945,     182,      31,45730287,   0,   0
mbuf_jumbo_9k:         9216, 145465,    7461,     259,66753693,1693,   0
mbuf_jumbo_16k:       16384,  81824,       0,       0,       0,   0,   0
dev.ena.0.queue7.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue6.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue5.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue4.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue3.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue2.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue1.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue0.rx_ring.mjum_alloc_fail: 0

===> Fri May  8 12:00:00 EDT 2020
mbuf_jumbo_page:       4096, 490945,      82,      71,52207112,   0,   0
mbuf_jumbo_9k:         9216, 145465,    7551,     351,75614231,2625,   0
mbuf_jumbo_16k:       16384,  81824,       0,       0,       0,   0,   0
dev.ena.0.queue7.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue6.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue5.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue4.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue3.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue2.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue1.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue0.rx_ring.mjum_alloc_fail: 0

===> Sat May  9 12:00:00 EDT 2020
mbuf_jumbo_page:       4096, 490945,     139,      69,56246475,   0,   0
mbuf_jumbo_9k:         9216, 145465,    7799,     339,83036796,2784,   0
mbuf_jumbo_16k:       16384,  81824,       0,       0,       0,   0,   0
dev.ena.0.queue7.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue6.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue5.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue4.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue3.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue2.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue1.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue0.rx_ring.mjum_alloc_fail: 0

===> Sun May 10 12:00:00 EDT 2020
mbuf_jumbo_page:       4096, 490945,      39,      51,59639876,   0,   0
mbuf_jumbo_9k:         9216, 145465,    7663,     222,85652082,2963,   0
mbuf_jumbo_16k:       16384,  81824,       0,       0,       0,   0,   0
dev.ena.0.queue7.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue6.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue5.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue4.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue3.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue2.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue1.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue0.rx_ring.mjum_alloc_fail: 0

===> Mon May 11 12:00:00 EDT 2020
mbuf_jumbo_page:       4096, 490945,    1598,     223,63502113,   0,   0
mbuf_jumbo_9k:         9216, 145465,    6230,      90,91087037,4248,   0
mbuf_jumbo_16k:       16384,  81824,       0,       0,       0,   0,   0
dev.ena.0.queue7.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue6.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue5.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue4.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue3.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue2.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue1.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue0.rx_ring.mjum_alloc_fail: 0


System spinning. Had to reboot.

Related /var/log/messages activity show below.


May 11 12:20:34 s4 kernel: ena0: The number of lost tx completion is above the threshold (129 > 128). Reset the device
May 11 12:20:34 s4 kernel: ena0: Trigger reset is on
May 11 12:20:34 s4 kernel: ena0: device is going DOWN
May 11 12:20:39 s4 kernel: ena0: free uncompleted tx mbuf qid 1 idx 0x2e1
May 11 12:20:40 s4 kernel: ena0: device is going UP
May 11 12:20:40 s4 kernel: ena0: link is UP
May 11 12:21:55 s4 kernel: ena0: The number of lost tx completion is above the threshold (129 > 128). Reset the device
May 11 12:21:55 s4 kernel: ena0: Trigger reset is on
May 11 12:21:55 s4 kernel: ena0: device is going DOWN
May 11 12:22:02 s4 kernel: ena0: free uncompleted tx mbuf qid 1 idx 0x172ena0: free uncompleted tx mbuf qid 3 idx 0x66ena0: free>
May 11 12:22:03 s4 kernel: 
May 11 12:22:03 s4 kernel: stray irq265
May 11 12:22:03 s4 kernel: ena0: ena0: device is going UP
May 11 12:22:03 s4 kernel: link is UP
May 11 12:22:29 s4 kernel: ena0: The number of lost tx completion is above the threshold (129 > 128). Reset the device
May 11 12:22:29 s4 kernel: ena0: Trigger reset is on
May 11 12:22:29 s4 kernel: ena0: device is going DOWN
May 11 12:22:32 s4 kernel: ena0: free uncompleted tx mbuf qid 2 idx 0x32bena0: free uncompleted tx mbuf qid 4 idx 0x338ena0: fre>
May 11 12:22:33 s4 kernel: 
May 11 12:22:33 s4 kernel: stray irq265
May 11 12:22:33 s4 kernel: ena0: ena0: device is going UP
May 11 12:22:33 s4 kernel: link is UP
May 11 12:23:04 s4 kernel: ena0: The number of lost tx completion is above the threshold (129 > 128). Reset the device
May 11 12:23:04 s4 kernel: ena0: Trigger reset is on
May 11 12:23:04 s4 kernel: ena0: device is going DOWN
May 11 12:23:11 s4 kernel: ena0: free uncompleted tx mbuf qid 6 idx 0x10e
May 11 12:23:12 s4 kernel: ena0: ena0: device is going UP
May 11 12:23:12 s4 kernel: link is UP
May 11 12:25:33 s4 kernel: ena0: The number of lost tx completion is above the threshold (129 > 128). Reset the device
May 11 12:25:33 s4 kernel: ena0: Trigger reset is on
May 11 12:25:33 s4 kernel: ena0: device is going DOWN
May 11 12:25:38 s4 kernel: ena0: free uncompleted tx mbuf qid 0 idx 0x0
May 11 12:25:39 s4 kernel: ena0: ena0: device is going UP
May 11 12:25:39 s4 kernel: link is UP
May 11 12:28:36 s4 kernel: ena0: Keep alive watchdog timeout.
May 11 12:28:36 s4 kernel: ena0: Trigger reset is on
May 11 12:28:36 s4 kernel: ena0: device is going DOWN
May 11 12:28:39 s4 kernel: ena0: device is going UP
May 11 12:28:39 s4 kernel: ena0: link is UP
May 11 12:30:04 s4 kernel: nvme4: cpl does not map to outstanding cmd
May 11 12:30:04 s4 kernel: cdw0:00000000 sqhd:0013 sqid:0002 cid:0012 p:0 sc:00 sct:0 m:0 dnr:0
May 11 12:30:04 s4 kernel: nvme4: Missing interrupt
May 11 12:30:04 s4 kernel: nvme1: Missing interrupt
May 11 12:30:34 s4 kernel: nvme4: Resetting controller due to a timeout.
May 11 12:30:34 s4 kernel: nvme4: resetting controller
May 11 12:30:47 s4 kernel: nvme4: temperature threshold not supported
May 11 12:30:47 s4 kernel: nvme4: aborting outstanding i/o
May 11 12:30:47 s4 syslogd: last message repeated 3 times
May 11 12:31:26 s4 kernel: ena0: The number of lost tx completion is above the threshold (129 > 128). Reset the device
May 11 12:31:26 s4 kernel: ena0: Trigger reset is on
May 11 12:31:26 s4 kernel: ena0: device is going DOWN
May 11 12:31:27 s4 kernel: ena0: free uncompleted tx mbuf qid 2 idx 0x20fena0: free uncompleted tx mbuf qid 4 idx 0x8fena0: free>
May 11 12:31:28 s4 kernel: 
May 11 12:31:28 s4 kernel: stray irq266
May 11 12:31:28 s4 kernel: stray irq272
May 11 12:31:28 s4 kernel: stray irq265
May 11 12:31:28 s4 kernel: ena0: device is going UP
May 11 12:31:28 s4 kernel: ena0: link is UP
May 11 12:32:46 s4 kernel: nvme1: Missing interrupt
May 11 12:32:46 s4 kernel: nvme4: cpl does not map to outstanding cmd
May 11 12:32:46 s4 kernel: cdw0:00000000 sqhd:0007 sqid:0001 cid:0015 p:1 sc:00 sct:0 m:0 dnr:0
May 11 12:32:46 s4 kernel: nvme4: Missing interrupt
May 11 12:32:46 s4 kernel: nvme2: Missing interrupt
May 11 12:33:16 s4 kernel: nvme4: Resetting controller due to a timeout.
May 11 12:33:16 s4 kernel: nvme4: resetting controller
May 11 12:33:17 s4 kernel: nvme4: temperature threshold not supported
May 11 12:33:17 s4 kernel: nvme4: aborting outstanding i/o
May 11 12:33:17 s4 syslogd: last message repeated 2 times
May 11 12:33:40 s4 kernel: ena0: The number of lost tx completion is above the threshold (129 > 128). Reset the device
May 11 12:33:40 s4 kernel: ena0: Trigger reset is on
May 11 12:33:40 s4 kernel: ena0: device is going DOWN
May 11 12:33:43 s4 kernel: ena0: free uncompleted tx mbuf qid 3 idx 0x0
May 11 12:33:44 s4 kernel: ena0: ena0: device is going UP
May 11 12:33:44 s4 kernel: link is UP
May 11 12:34:15 s4 kernel: ena0: Keep alive watchdog timeout.
May 11 12:34:15 s4 kernel: ena0: Trigger reset is on
May 11 12:34:15 s4 kernel: ena0: device is going DOWN
May 11 12:34:21 s4 kernel: ena0: ena0: device is going UP
May 11 12:34:21 s4 kernel: link is UP
May 11 12:35:33 s4 kernel: ena0: The number of lost tx completion is above the threshold (129 > 128). Reset the device
May 11 12:35:33 s4 kernel: ena0: Trigger reset is on
May 11 12:35:33 s4 kernel: ena0: device is going DOWN
May 11 12:35:42 s4 kernel: ena0: free uncompleted tx mbuf qid 3 idx 0x34
May 11 12:35:43 s4 kernel: ena0: ena0: device is going UP
May 11 12:35:43 s4 kernel: link is UP
May 11 12:36:16 s4 kernel: ena0: Keep alive watchdog timeout.
May 11 12:36:16 s4 kernel: ena0: Trigger reset is on
May 11 12:36:16 s4 kernel: ena0: device is going DOWN
May 11 12:36:21 s4 kernel: ena0: free uncompleted tx mbuf qid 7 idx 0x20a
May 11 12:36:22 s4 kernel: ena0: ena0: device is going UP
May 11 12:36:22 s4 kernel: link is UP
May 11 12:37:05 s4 kernel: ena0: The number of lost tx completion is above the threshold (129 > 128). Reset the device
May 11 12:37:05 s4 kernel: ena0: Trigger reset is on
May 11 12:37:05 s4 kernel: ena0: device is going DOWN
May 11 12:37:09 s4 kernel: ena0: free uncompleted tx mbuf qid 4 idx 0x23f
May 11 12:37:10 s4 kernel: ena0: ena0: device is going UP
May 11 12:37:10 s4 kernel: link is UP
May 11 12:38:38 s4 kernel: ena0: The number of lost tx completion is above the threshold (129 > 128). Reset the device
May 11 12:38:38 s4 kernel: ena0: Trigger reset is on
May 11 12:38:38 s4 kernel: ena0: device is going DOWN
May 11 12:38:43 s4 kernel: ena0: free uncompleted tx mbuf qid 5 idx 0x1b3
May 11 12:38:44 s4 kernel: 
May 11 12:38:44 s4 kernel: too many stray irq 265's: not logging anymore
May 11 12:38:44 s4 kernel: ena0: ena0: device is going UP
May 11 12:38:44 s4 kernel: link is UP
May 11 12:39:17 s4 kernel: ena0: The number of lost tx completion is above the threshold (129 > 128). Reset the device
May 11 12:39:17 s4 kernel: ena0: Trigger reset is on
May 11 12:39:17 s4 kernel: ena0: device is going DOWN
May 11 12:39:25 s4 kernel: ena0: free uncompleted tx mbuf qid 1 idx 0x238ena0: free uncompleted tx mbuf qid 6 idx 0x249
May 11 12:39:26 s4 kernel: ena0: ena0: device is going UP
May 11 12:39:26 s4 kernel: link is UP
May 11 12:40:20 s4 kernel: ena0: The number of lost tx completion is above the threshold (129 > 128). Reset the device
May 11 12:40:20 s4 kernel: ena0: Trigger reset is on
May 11 12:40:20 s4 kernel: ena0: device is going DOWN
May 11 12:40:24 s4 kernel: ena0: free uncompleted tx mbuf qid 0 idx 0x3c0
May 11 12:40:25 s4 kernel: ena0: device is going UP
May 11 12:40:25 s4 kernel: ena0: link is UP
May 11 12:41:17 s4 kernel: ena0: Keep alive watchdog timeout.
May 11 12:41:17 s4 kernel: ena0: Trigger reset is on
May 11 12:41:17 s4 kernel: ena0: device is going DOWN
May 11 12:41:22 s4 kernel: ena0: device is going UP
May 11 12:41:22 s4 kernel: ena0: link is UP
May 11 12:42:17 s4 kernel: ena0: Keep alive watchdog timeout.
May 11 12:42:17 s4 kernel: ena0: Trigger reset is on
May 11 12:42:17 s4 kernel: ena0: device is going DOWN
May 11 12:42:17 s4 sm-msp-queue[70209]: 04BGPgiQ066459: SYSERR(root): timeout writing message to [127.0.0.1]
May 11 12:42:21 s4 kernel: ena0: ena0: device is going UP
May 11 12:42:21 s4 kernel: link is UP
May 11 12:43:01 s4 kernel: ena0: The number of lost tx completion is above the threshold (129 > 128). Reset the device
May 11 12:43:01 s4 kernel: ena0: Trigger reset is on
May 11 12:43:01 s4 kernel: ena0: device is going DOWN
May 11 12:43:06 s4 kernel: ena0: free uncompleted tx mbuf qid 2 idx 0x359
May 11 12:43:07 s4 kernel: 
May 11 12:43:07 s4 kernel: stray irq266
May 11 12:43:07 s4 kernel: ena0: ena0: device is going UP
May 11 12:43:07 s4 kernel: link is UP
May 11 12:45:23 s4 kernel: ena0: Keep alive watchdog timeout.
May 11 12:45:23 s4 kernel: ena0: Trigger reset is on
May 11 12:45:23 s4 kernel: ena0: device is going DOWN
May 11 12:45:33 s4 kernel: ena0: ena0: device is going UP
May 11 12:45:33 s4 kernel: link is UP
May 11 12:48:08 s4 kernel: ena0: Keep alive watchdog timeout.
May 11 12:48:08 s4 kernel: ena0: Trigger reset is on
May 11 12:48:08 s4 kernel: ena0: device is going DOWN
May 11 12:48:11 s4 kernel: ena0: ena0: device is going UP
May 11 12:48:11 s4 kernel: link is UP
May 11 12:48:51 s4 kernel: ena0: The number of lost tx completion is above the threshold (129 > 128). Reset the device
May 11 12:48:51 s4 kernel: ena0: Trigger reset is on
May 11 12:48:51 s4 kernel: ena0: device is going DOWN
May 11 12:48:55 s4 kernel: ena0: free uncompleted tx mbuf qid 7 idx 0x266
May 11 12:48:55 s4 kernel: ena0: ena0: device is going UP
May 11 12:48:55 s4 kernel: link is UP
May 11 12:50:38 s4 kernel: ena0: Keep alive watchdog timeout.
May 11 12:50:38 s4 kernel: ena0: Trigger reset is on
May 11 12:50:38 s4 kernel: ena0: device is going DOWN
May 11 12:50:41 s4 kernel: ena0: ena0: device is going UP
May 11 12:50:41 s4 kernel: link is UP
May 11 12:51:43 s4 kernel: ena0: The number of lost tx completion is above the threshold (129 > 128). Reset the device
May 11 12:51:43 s4 kernel: ena0: Trigger reset is on
May 11 12:51:43 s4 kernel: ena0: device is going DOWN
May 11 12:51:50 s4 kernel: ena0: free uncompleted tx mbuf qid 5 idx 0x358
May 11 12:51:51 s4 kernel: ena0: ena0: device is going UP
May 11 12:51:51 s4 kernel: link is UP
May 11 12:53:24 s4 kernel: ena0: The number of lost tx completion is above the threshold (129 > 128). Reset the device
May 11 12:53:24 s4 kernel: ena0: Trigger reset is on
May 11 12:53:24 s4 kernel: ena0: device is going DOWN
May 11 12:53:28 s4 kernel: ena0: free uncompleted tx mbuf qid 5 idx 0x100
May 11 12:53:29 s4 kernel: ena0: ena0: device is going UP
May 11 12:53:29 s4 kernel: link is UP
May 11 12:56:19 s4 kernel: ena0: The number of lost tx completion is above the threshold (129 > 128). Reset the device
May 11 12:56:19 s4 kernel: ena0: Trigger reset is on
May 11 12:56:19 s4 kernel: ena0: device is going DOWN
May 11 12:56:22 s4 kernel: ena0: free uncompleted tx mbuf qid 4 idx 0x3d8
May 11 12:56:23 s4 kernel: ena0: ena0: device is going UP
May 11 12:56:23 s4 kernel: link is UP
May 11 12:57:36 s4 kernel: .
Comment 27 commit-hook freebsd_committer freebsd_triage 2020-05-12 18:46:09 UTC
A commit references this bug:

Author: mw
Date: Tue May 12 18:44:42 UTC 2020
New revision: 360985
URL: https://svnweb.freebsd.org/changeset/base/360985

Log:
  MFC r360777: Optimize ENA Rx refill for low memory conditions

  Sometimes, especially when there is not much memory in the system left,
  allocating mbuf jumbo clusters (like 9KB or 16KB) can take a lot of time
  and it is not guaranteed that it'll succeed. In that situation, the
  fallback will work, but if the refill needs to take a place for a lot of
  descriptors at once, the time spent in m_getjcl looking for memory can
  cause system unresponsiveness due to high priority of the Rx task. This
  can also lead to driver reset, because Tx cleanup routine is being
  blocked and timer service could detect that Tx packets aren't cleaned
  up. The reset routine can further create another unresponsiveness - Rx
  rings are being refilled there, so m_getjcl will again burn the CPU.
  This was causing NVMe driver timeouts and resets, because network driver
  is having higher priority.

  Instead of 16KB jumbo clusters for the Rx buffers, 9KB clusters are
  enough - ENA MTU is being set to 9K anyway, so it's very unlikely that
  more space than 9KB will be needed.

  However, 9KB jumbo clusters can still cause issues, so by default the
  page size mbuf cluster will be used for the Rx descriptors. This can have a
  small (~2%) impact on the throughput of the device, so to restore
  original behavior, one must change sysctl "hw.ena.enable_9k_mbufs" to
  "1" in "/boot/loader.conf" file.

  As a part of this patch (important fix), the version of the driver
  was updated to v2.1.2.

  Submitted by:   cperciva
  PR:             225791, 234838, 235856, 236989, 243531

Changes:
_U  stable/12/
  stable/12/sys/dev/ena/ena.c
  stable/12/sys/dev/ena/ena.h
Comment 28 Colin Percival freebsd_committer freebsd_triage 2020-05-13 01:16:02 UTC
(In reply to Charles O'Donnell from comment #26)

It's slightly surprising that your system is getting loaded enough to cause the errors above -- but it may be that each memory allocation failure is taking longer since your system is much larger than the t3 instance I was able to reproduce this on earlier.

In any case, it you apply the patch which went into stable/12 (r360985) instead of the patch I sent you earlier, this issue should go away.  (If it doesn't, we DEFINITELY want to know -- this patch will be going out as an Errata Notice for FreeBSD 12.1.)

Colin Percival
Comment 29 Juraj Lutter freebsd_committer freebsd_triage 2020-10-25 19:05:32 UTC
FWIW, I've just run into the same issue on bare metal, Dell r740xd, with 20xNVME and 2xSSD, running 1-month-old STABLE.

Any info that I can provide to help investigation, please?
Comment 30 Colin Percival freebsd_committer freebsd_triage 2020-10-25 19:07:41 UTC
(In reply to Juraj Lutter from comment #29)

This was a specific issue with EC2 instances (with the ENA driver in particular) which has now been fixed.
Comment 31 Charles O'Donnell 2020-10-26 10:45:07 UTC
Indeed. Corrected and gone:

 6:41AM  up 93 days, 23:21, 1 user, load averages: 0.80, 0.68, 0.70

Above uptime from an EC2 instance that previously couldn't get past 4-5 days of uptime.

Cheers Colin.