Bug 243531

Summary:

Unstable ena and nvme on AWS

Product:

Base System

Reporter:

Leif Pedersen <leif>

Component:

kern

Assignee:

freebsd-virtualization (Nobody) <virtualization>

Status:

Closed FIXED

Severity:

Affects Some People

CC:

cperciva, imp, tomas

Priority:

---

Version:

12.1-RELEASE

Hardware:

amd64

OS:

Any

Attachments:

Description	Flags
console screenshot of crash on 2020-02-21	none
syslog from crash on 2020-02-21	none
syslog from verbose boot on 2020-02-27	none

Description Leif Pedersen 2020-01-22 23:28:22 UTC

We just recently upgraded our systems on AWS to 12.1, and we're seeing errors with nvme0 and ena0. Typically, these errors manifest together. My sample size is 34 instances, all t3.medium or r4.large. I have others that are t2.small, and those are fine being that they don't use ena or nvme drivers. Of the 34, there's a group of 15 that are almost idle, and never throw these errors. Of the remaining that have load, only 5 ran without errors and 14 of them threw these errors scattered about various times in the last 4.5 days. 3 machines have crashed during this period, so the errors seem to often be nonfatal, but not always. So based on that it seems load related, and does not seem isolated to an occasional hardware problem.

Here's a sample of the log from one of the machines that crashed:

Jan 22 00:17:05 jdas-dev kernel: nvme0: cpl does not map to outstanding cmd
Jan 22 00:17:05 jdas-dev kernel: cdw0:00000000 sqhd:000d sqid:0001 cid:0012 p:1 sc:00 sct:0 m:0 dnr:0
Jan 22 00:17:05 jdas-dev kernel: nvme0: Resetting controller due to a timeout.
Jan 22 00:17:05 jdas-dev kernel: nvme0: resetting controller
Jan 22 00:17:05 jdas-dev kernel: nvme0: temperature threshold not supported
Jan 22 00:17:05 jdas-dev kernel: nvme0: aborting outstanding i/o
Jan 22 00:17:05 jdas-dev kernel: nvme0: resubmitting queued i/o
Jan 22 00:17:05 jdas-dev kernel: nvme0: WRITE sqid:2 cid:0 nsid:1 lba:691756520 len:8
Jan 22 00:17:21 jdas-dev kernel: ena0: The number of lost tx completion is above the threshold (129 > 128). Reset the device
Jan 22 00:17:21 jdas-dev kernel: ena0: Trigger reset is on
Jan 22 00:17:21 jdas-dev kernel: ena0: device is going DOWN
Jan 22 00:17:21 jdas-dev kernel: ena0: free uncompleted tx mbuf qid 0 idx 0x154
Jan 22 00:17:22 jdas-dev kernel: ena0: device is going UP
Jan 22 00:17:22 jdas-dev kernel: ena0: link is UP
Jan 22 00:17:52 jdas-dev kernel: nvme0: Missing interrupt
Jan 22 00:18:46 jdas-dev kernel: nvme0: Missing interrupt
Jan 22 00:19:34 jdas-dev kernel: nvme0: cpl does not map to outstanding cmd
Jan 22 00:19:34 jdas-dev kernel: cdw0:00000000 sqhd:001a sqid:0002 cid:001b p:0 sc:00 sct:0 m:0 dnr:0
Jan 22 00:19:34 jdas-dev kernel: nvme0: Resetting controller due to a timeout.
Jan 22 00:19:34 jdas-dev kernel: nvme0: resetting controller
Jan 22 00:19:35 jdas-dev kernel: nvme0: temperature threshold not supported
Jan 22 00:19:35 jdas-dev kernel: nvme0: resubmitting queued i/o
Jan 22 00:19:35 jdas-dev kernel: nvme0: WRITE sqid:1 cid:0 nsid:1 lba:405055248 len:8
Jan 22 00:19:35 jdas-dev kernel: nvme0: aborting outstanding i/o

At this point, we rebooted the machine.

Jan 22 09:02:30 jdas-dev kernel: nvme0: Resetting controller due to a timeout.
Jan 22 09:02:30 jdas-dev kernel: nvme0: resetting controller
Jan 22 09:02:30 jdas-dev kernel: nvme0: temperature threshold not supported
Jan 22 09:02:30 jdas-dev kernel: nvme0: aborting outstanding i/o
Jan 22 09:02:30 jdas-dev kernel: nvme0: DATASET MANAGEMENT sqid:2 cid:27 nsid:0
Jan 22 09:02:30 jdas-dev kernel: nvme0: INVALID OPCODE (00/01) sqid:2 cid:27 cdw0:0
Jan 22 09:02:30 jdas-dev kernel: ena0: Keep alive watchdog timeout.
Jan 22 09:02:30 jdas-dev kernel: ena0: Trigger reset is on
Jan 22 09:02:30 jdas-dev kernel: ena0: device is going DOWN
Jan 22 09:02:30 jdas-dev kernel: ena0: ena0: device is going UP
Jan 22 09:02:30 jdas-dev kernel: link is UP
Jan 22 09:02:30 jdas-dev kernel: ena0: Keep alive watchdog timeout.
Jan 22 09:02:30 jdas-dev kernel: ena0: Trigger reset is on
Jan 22 09:02:30 jdas-dev kernel: ena0: device is going DOWN
Jan 22 09:02:30 jdas-dev kernel: ena0: ena0: device is going UP
Jan 22 09:02:30 jdas-dev kernel: link is UP
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 312763, size: 4096
Jan 22 09:02:30 jdas-dev kernel: 90 second watchdog timeout expired. Shutdown terminated.
Jan 22 09:02:30 jdas-dev kernel: Wed Jan 22 08:58:58 CST 2020
Jan 22 09:02:30 jdas-dev kernel: 2020-01-22T08:59:01.658827-06:00 jdas-dev.aws0.pla-net.cc init 1 - - /etc/rc.shutdown terminated abnormally, going to single user mode
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 312763, size: 4096
Jan 22 09:02:30 jdas-dev kernel: 2020-01-22T08:59:23.108170-06:00 jdas-dev.aws0.pla-net.cc init 1 - - some processes would not die; ps axl advised
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 445, size: 12288
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 21337, size: 4096
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 372873, size: 4096
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 21854, size: 57344
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 118775, size: 8192
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 315370, size: 4096
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 312763, size: 4096
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 445, size: 12288
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 21337, size: 4096
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 372873, size: 4096
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 21854, size: 57344
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 118775, size: 8192
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 315370, size: 4096
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 312763, size: 4096
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 445, size: 12288
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 21337, size: 4096
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 21854, size: 57344
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 372873, size: 4096
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 118775, size: 8192
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 315370, size: 4096
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 312763, size: 4096
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 445, size: 12288
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 21337, size: 4096
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 21854, size: 57344
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 372873, size: 4096
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 118775, size: 8192
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 312763, size: 4096
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 315370, size: 4096
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 445, size: 12288
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 21337, size: 4096
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 21854, size: 57344
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 372873, size: 4096
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 118775, size: 8192
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 315370, size: 4096
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 312763, size: 4096
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 445, size: 12288
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 21337, size: 4096
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 21854, size: 57344
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 372873, size: 4096
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 118775, size: 8192
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 315370, size: 4096
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 312763, size: 4096
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 445, size: 12288
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 21337, size: 4096
Jan 22 09:02:30 jdas-dev kernel: ---<<BOOT>>---
Jan 22 09:02:30 jdas-dev kernel: Copyright (c) 1992-2019 The FreeBSD Project.
Jan 22 09:02:30 jdas-dev kernel: Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994

Comment 1 Colin Percival freebsd_committer

2020-02-11 06:37:02 UTC

I haven't been able to reproduce this yet... any more tips on how to replicate your environment?

Comment 2 Leif Pedersen 2020-02-21 18:44:43 UTC

I'm at a bit of a loss to come up with anything particularly helpful. A few thoughts, although mostly naive observations and wild speculation -

It kind of seems like when one machine has a problem, several others do also. This suggests that it could be triggered by a shared event in the host's networking or EBS. (None of our instances have local storage.) I don't have enough machines or samples to show that it's not just a coincidence though.

The nvme errors are always (or almost always?) accompanied by ena errors, but ena errors happen without nvme errors sometimes. That suggests it might be triggered by a network event in the AWS hosting infrastructure, like a network topology change or something.

I'll attach a /var/log/all.log and the screenshot from a crash that happened today. Probably nothing new there. This time, the machine did not panic, but rather wedged after Nagios reported its CPU load at 9. There's nothing running on this one besides the hourly zfs snapshot transfers, so I think the load from processes piled up waiting for IO.

The timing of error messages stretches out over many minutes, starting with ena errors at 02:20:16, and nvme errors finally happen at 02:28:06. Seems odd, like a problem that ramps up rather slowly rather than an abrupt crash.

It's also interesting that these messages on the console screenshot made it into syslog, so IO must have recovered, if only briefly.

Comment 3 Leif Pedersen 2020-02-21 18:45:40 UTC

Created attachment 211805 [details]
console screenshot of crash on 2020-02-21

Comment 4 Leif Pedersen 2020-02-21 18:47:03 UTC

Created attachment 211806 [details]
syslog from crash on 2020-02-21

Comment 5 Colin Percival freebsd_committer

2020-02-21 19:55:36 UTC

Hi Leif,

Can you tell me the instance ID from that one?  And is that timestamp in UTC?

Comment 6 Colin Percival freebsd_committer

2020-02-24 21:09:56 UTC

Another request: Can you provide a verbose boot?  We're wondering if there's something funky going on with interrupt routing.

Comment 7 Leif Pedersen 2020-02-27 21:08:15 UTC

Created attachment 212007 [details]
syslog from verbose boot on 2020-02-27

Comment 8 Leif Pedersen 2020-02-27 21:09:01 UTC

(In reply to Colin Percival from comment #6)

Sorry, I didn't notice your message sooner. :(

The instance id of extra1-dev is i-0ce4598513b3acbb2. I attached the log from a verbose boot.

Comment 9 Leif Pedersen 2020-02-27 21:12:13 UTC

My timestamps are in America/Chicago (-0600 now/-0500 summer)

Comment 10 Colin Percival freebsd_committer

2020-04-22 23:36:00 UTC

I believe that this patch should fix the underlying problem, which is in the ENA driver: https://reviews.freebsd.org/D24546

If you're able to build a custom kernel, please test that patch and report results in that review or via email (cperciva@).

Comment 11 commit-hook freebsd_committer

2020-05-07 11:28:54 UTC

A commit references this bug:

Author: mw
Date: Thu May  7 11:28:41 UTC 2020
New revision: 360777
URL: https://svnweb.freebsd.org/changeset/base/360777

Log:
  Optimize ENA Rx refill for low memory conditions

  Sometimes, especially when there is not much memory in the system left,
  allocating mbuf jumbo clusters (like 9KB or 16KB) can take a lot of time
  and it is not guaranteed that it'll succeed. In that situation, the
  fallback will work, but if the refill needs to take a place for a lot of
  descriptors at once, the time spent in m_getjcl looking for memory can
  cause system unresponsiveness due to high priority of the Rx task. This
  can also lead to driver reset, because Tx cleanup routine is being
  blocked and timer service could detect that Tx packets aren't cleaned
  up. The reset routine can further create another unresponsiveness - Rx
  rings are being refilled there, so m_getjcl will again burn the CPU.
  This was causing NVMe driver timeouts and resets, because network driver
  is having higher priority.

  Instead of 16KB jumbo clusters for the Rx buffers, 9KB clusters are
  enough - ENA MTU is being set to 9K anyway, so it's very unlikely that
  more space than 9KB will be needed.

  However, 9KB jumbo clusters can still cause issues, so by default the
  page size mbuf cluster will be used for the Rx descriptors. This can have a
  small (~2%) impact on the throughput of the device, so to restore
  original behavior, one must change sysctl "hw.ena.enable_9k_mbufs" to
  "1" in "/boot/loader.conf" file.

  As a part of this patch (important fix), the version of the driver
  was updated to v2.1.2.

  Submitted by:   cperciva
  Reviewed by:    Michal Krawczyk <mk@semihalf.com>
  Reviewed by:    Ido Segev <idose@amazon.com>
  Reviewed by:    Guy Tzalik <gtzalik@amazon.com>
  MFC after:      3 days
  PR:             225791, 234838, 235856, 236989, 243531
  Differential Revision: https://reviews.freebsd.org/D24546

Changes:
  head/sys/dev/ena/ena.c
  head/sys/dev/ena/ena.h
  head/sys/dev/ena/ena_sysctl.c
  head/sys/dev/ena/ena_sysctl.h

Comment 12 commit-hook freebsd_committer

2020-05-12 18:46:11 UTC

A commit references this bug:

Author: mw
Date: Tue May 12 18:44:42 UTC 2020
New revision: 360985
URL: https://svnweb.freebsd.org/changeset/base/360985

Log:
  MFC r360777: Optimize ENA Rx refill for low memory conditions

  Sometimes, especially when there is not much memory in the system left,
  allocating mbuf jumbo clusters (like 9KB or 16KB) can take a lot of time
  and it is not guaranteed that it'll succeed. In that situation, the
  fallback will work, but if the refill needs to take a place for a lot of
  descriptors at once, the time spent in m_getjcl looking for memory can
  cause system unresponsiveness due to high priority of the Rx task. This
  can also lead to driver reset, because Tx cleanup routine is being
  blocked and timer service could detect that Tx packets aren't cleaned
  up. The reset routine can further create another unresponsiveness - Rx
  rings are being refilled there, so m_getjcl will again burn the CPU.
  This was causing NVMe driver timeouts and resets, because network driver
  is having higher priority.

  Instead of 16KB jumbo clusters for the Rx buffers, 9KB clusters are
  enough - ENA MTU is being set to 9K anyway, so it's very unlikely that
  more space than 9KB will be needed.

  However, 9KB jumbo clusters can still cause issues, so by default the
  page size mbuf cluster will be used for the Rx descriptors. This can have a
  small (~2%) impact on the throughput of the device, so to restore
  original behavior, one must change sysctl "hw.ena.enable_9k_mbufs" to
  "1" in "/boot/loader.conf" file.

  As a part of this patch (important fix), the version of the driver
  was updated to v2.1.2.

  Submitted by:   cperciva
  PR:             225791, 234838, 235856, 236989, 243531

Changes:
_U  stable/12/
  stable/12/sys/dev/ena/ena.c
  stable/12/sys/dev/ena/ena.h

Comment 13 Tomáš Čiernik 2020-06-05 10:46:43 UTC

Hello,

today I had service outage on 12.1-RELEASE-p5. After cca. 10 minutes system and all services went back to normal. Here is output of /var/log/messages:

Jun  5 08:58:37 mail kernel: ena0: The number of lost tx completion is above the threshold (129 > 128). Reset the device
Jun  5 08:58:37 mail kernel: ena0: Trigger reset is on
Jun  5 08:58:37 mail kernel: ena0: device is going DOWN
Jun  5 08:58:37 mail kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 475845, size: 28672
Jun  5 08:58:37 mail kernel: ena0: free uncompleted tx mbuf qid 0 idx 0x8eena0: free uncompleted tx mbuf qid 1 idx 0x0swap_pager: indefinite wait buffer: bufobj: 0, blkno: 295657, size: 28672
Jun  5 08:58:38 mail kernel: ena0: ena0: device is going UP
Jun  5 08:58:38 mail kernel: link is UP


What can I do to confirm relation to this bug?

Comment 14 Colin Percival freebsd_committer

2020-06-05 17:46:22 UTC

(In reply to Tomáš Čiernik from comment #13)

Hi Tomáš,

That definitely looks like the same issue.  Does `sysctl dev.ena | grep mbuf_alloc_fail` show nonzero values?  (If no, it's not the same issue; if yes, it probably is.)

Colin Percival

Comment 15 Tomáš Čiernik 2020-06-08 14:22:13 UTC

Hello,

sorry for my late anwer. It seems it's different issue.

% sysctl dev.ena | grep mbuf_alloc_fail
dev.ena.0.queue1.rx_ring.mbuf_alloc_fail: 0
dev.ena.0.queue0.rx_ring.mbuf_alloc_fail: 0


Many thanks for your input - now I know I should focus to another point ("just" to find it ...).

Comment 16 Warner Losh freebsd_committer

2021-07-10 22:20:24 UTC

This issue has been fixed in current and MFC'd to stable/12. Since the other report turned out not to be this, I'm closing this bug as fixed.