Bug 243531 - Unstable ena and nvme on AWS
Summary: Unstable ena and nvme on AWS
Status: New
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 12.1-RELEASE
Hardware: amd64 Any
: --- Affects Some People
Assignee: freebsd-virtualization mailing list
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-01-22 23:28 UTC by Leif Pedersen
Modified: 2020-02-27 21:12 UTC (History)
1 user (show)

See Also:


Attachments
console screenshot of crash on 2020-02-21 (98.16 KB, image/jpeg)
2020-02-21 18:45 UTC, Leif Pedersen
no flags Details
syslog from crash on 2020-02-21 (297.33 KB, text/plain)
2020-02-21 18:47 UTC, Leif Pedersen
no flags Details
syslog from verbose boot on 2020-02-27 (61.41 KB, text/plain)
2020-02-27 21:08 UTC, Leif Pedersen
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Leif Pedersen 2020-01-22 23:28:22 UTC
We just recently upgraded our systems on AWS to 12.1, and we're seeing errors with nvme0 and ena0. Typically, these errors manifest together. My sample size is 34 instances, all t3.medium or r4.large. I have others that are t2.small, and those are fine being that they don't use ena or nvme drivers. Of the 34, there's a group of 15 that are almost idle, and never throw these errors. Of the remaining that have load, only 5 ran without errors and 14 of them threw these errors scattered about various times in the last 4.5 days. 3 machines have crashed during this period, so the errors seem to often be nonfatal, but not always. So based on that it seems load related, and does not seem isolated to an occasional hardware problem.

Here's a sample of the log from one of the machines that crashed:

Jan 22 00:17:05 jdas-dev kernel: nvme0: cpl does not map to outstanding cmd
Jan 22 00:17:05 jdas-dev kernel: cdw0:00000000 sqhd:000d sqid:0001 cid:0012 p:1 sc:00 sct:0 m:0 dnr:0
Jan 22 00:17:05 jdas-dev kernel: nvme0: Resetting controller due to a timeout.
Jan 22 00:17:05 jdas-dev kernel: nvme0: resetting controller
Jan 22 00:17:05 jdas-dev kernel: nvme0: temperature threshold not supported
Jan 22 00:17:05 jdas-dev kernel: nvme0: aborting outstanding i/o
Jan 22 00:17:05 jdas-dev kernel: nvme0: resubmitting queued i/o
Jan 22 00:17:05 jdas-dev kernel: nvme0: WRITE sqid:2 cid:0 nsid:1 lba:691756520 len:8
Jan 22 00:17:21 jdas-dev kernel: ena0: The number of lost tx completion is above the threshold (129 > 128). Reset the device
Jan 22 00:17:21 jdas-dev kernel: ena0: Trigger reset is on
Jan 22 00:17:21 jdas-dev kernel: ena0: device is going DOWN
Jan 22 00:17:21 jdas-dev kernel: ena0: free uncompleted tx mbuf qid 0 idx 0x154
Jan 22 00:17:22 jdas-dev kernel: ena0: device is going UP
Jan 22 00:17:22 jdas-dev kernel: ena0: link is UP
Jan 22 00:17:52 jdas-dev kernel: nvme0: Missing interrupt
Jan 22 00:18:46 jdas-dev kernel: nvme0: Missing interrupt
Jan 22 00:19:34 jdas-dev kernel: nvme0: cpl does not map to outstanding cmd
Jan 22 00:19:34 jdas-dev kernel: cdw0:00000000 sqhd:001a sqid:0002 cid:001b p:0 sc:00 sct:0 m:0 dnr:0
Jan 22 00:19:34 jdas-dev kernel: nvme0: Resetting controller due to a timeout.
Jan 22 00:19:34 jdas-dev kernel: nvme0: resetting controller
Jan 22 00:19:35 jdas-dev kernel: nvme0: temperature threshold not supported
Jan 22 00:19:35 jdas-dev kernel: nvme0: resubmitting queued i/o
Jan 22 00:19:35 jdas-dev kernel: nvme0: WRITE sqid:1 cid:0 nsid:1 lba:405055248 len:8
Jan 22 00:19:35 jdas-dev kernel: nvme0: aborting outstanding i/o

At this point, we rebooted the machine.

Jan 22 09:02:30 jdas-dev kernel: nvme0: Resetting controller due to a timeout.
Jan 22 09:02:30 jdas-dev kernel: nvme0: resetting controller
Jan 22 09:02:30 jdas-dev kernel: nvme0: temperature threshold not supported
Jan 22 09:02:30 jdas-dev kernel: nvme0: aborting outstanding i/o
Jan 22 09:02:30 jdas-dev kernel: nvme0: DATASET MANAGEMENT sqid:2 cid:27 nsid:0
Jan 22 09:02:30 jdas-dev kernel: nvme0: INVALID OPCODE (00/01) sqid:2 cid:27 cdw0:0
Jan 22 09:02:30 jdas-dev kernel: ena0: Keep alive watchdog timeout.
Jan 22 09:02:30 jdas-dev kernel: ena0: Trigger reset is on
Jan 22 09:02:30 jdas-dev kernel: ena0: device is going DOWN
Jan 22 09:02:30 jdas-dev kernel: ena0: ena0: device is going UP
Jan 22 09:02:30 jdas-dev kernel: link is UP
Jan 22 09:02:30 jdas-dev kernel: ena0: Keep alive watchdog timeout.
Jan 22 09:02:30 jdas-dev kernel: ena0: Trigger reset is on
Jan 22 09:02:30 jdas-dev kernel: ena0: device is going DOWN
Jan 22 09:02:30 jdas-dev kernel: ena0: ena0: device is going UP
Jan 22 09:02:30 jdas-dev kernel: link is UP
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 312763, size: 4096
Jan 22 09:02:30 jdas-dev kernel: 90 second watchdog timeout expired. Shutdown terminated.
Jan 22 09:02:30 jdas-dev kernel: Wed Jan 22 08:58:58 CST 2020
Jan 22 09:02:30 jdas-dev kernel: 2020-01-22T08:59:01.658827-06:00 jdas-dev.aws0.pla-net.cc init 1 - - /etc/rc.shutdown terminated abnormally, going to single user mode
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 312763, size: 4096
Jan 22 09:02:30 jdas-dev kernel: 2020-01-22T08:59:23.108170-06:00 jdas-dev.aws0.pla-net.cc init 1 - - some processes would not die; ps axl advised
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 445, size: 12288
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 21337, size: 4096
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 372873, size: 4096
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 21854, size: 57344
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 118775, size: 8192
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 315370, size: 4096
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 312763, size: 4096
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 445, size: 12288
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 21337, size: 4096
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 372873, size: 4096
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 21854, size: 57344
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 118775, size: 8192
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 315370, size: 4096
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 312763, size: 4096
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 445, size: 12288
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 21337, size: 4096
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 21854, size: 57344
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 372873, size: 4096
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 118775, size: 8192
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 315370, size: 4096
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 312763, size: 4096
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 445, size: 12288
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 21337, size: 4096
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 21854, size: 57344
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 372873, size: 4096
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 118775, size: 8192
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 312763, size: 4096
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 315370, size: 4096
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 445, size: 12288
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 21337, size: 4096
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 21854, size: 57344
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 372873, size: 4096
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 118775, size: 8192
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 315370, size: 4096
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 312763, size: 4096
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 445, size: 12288
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 21337, size: 4096
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 21854, size: 57344
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 372873, size: 4096
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 118775, size: 8192
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 315370, size: 4096
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 312763, size: 4096
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 445, size: 12288
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 21337, size: 4096
Jan 22 09:02:30 jdas-dev kernel: ---<<BOOT>>---
Jan 22 09:02:30 jdas-dev kernel: Copyright (c) 1992-2019 The FreeBSD Project.
Jan 22 09:02:30 jdas-dev kernel: Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
Comment 1 Colin Percival freebsd_committer 2020-02-11 06:37:02 UTC
I haven't been able to reproduce this yet... any more tips on how to replicate your environment?
Comment 2 Leif Pedersen 2020-02-21 18:44:43 UTC
I'm at a bit of a loss to come up with anything particularly helpful. A few thoughts, although mostly naive observations and wild speculation -

It kind of seems like when one machine has a problem, several others do also. This suggests that it could be triggered by a shared event in the host's networking or EBS. (None of our instances have local storage.) I don't have enough machines or samples to show that it's not just a coincidence though.

The nvme errors are always (or almost always?) accompanied by ena errors, but ena errors happen without nvme errors sometimes. That suggests it might be triggered by a network event in the AWS hosting infrastructure, like a network topology change or something.

I'll attach a /var/log/all.log and the screenshot from a crash that happened today. Probably nothing new there. This time, the machine did not panic, but rather wedged after Nagios reported its CPU load at 9. There's nothing running on this one besides the hourly zfs snapshot transfers, so I think the load from processes piled up waiting for IO.

The timing of error messages stretches out over many minutes, starting with ena errors at 02:20:16, and nvme errors finally happen at 02:28:06. Seems odd, like a problem that ramps up rather slowly rather than an abrupt crash.

It's also interesting that these messages on the console screenshot made it into syslog, so IO must have recovered, if only briefly.
Comment 3 Leif Pedersen 2020-02-21 18:45:40 UTC
Created attachment 211805 [details]
console screenshot of crash on 2020-02-21
Comment 4 Leif Pedersen 2020-02-21 18:47:03 UTC
Created attachment 211806 [details]
syslog from crash on 2020-02-21
Comment 5 Colin Percival freebsd_committer 2020-02-21 19:55:36 UTC
Hi Leif,

Can you tell me the instance ID from that one?  And is that timestamp in UTC?
Comment 6 Colin Percival freebsd_committer 2020-02-24 21:09:56 UTC
Another request: Can you provide a verbose boot?  We're wondering if there's something funky going on with interrupt routing.
Comment 7 Leif Pedersen 2020-02-27 21:08:15 UTC
Created attachment 212007 [details]
syslog from verbose boot on 2020-02-27
Comment 8 Leif Pedersen 2020-02-27 21:09:01 UTC
(In reply to Colin Percival from comment #6)

Sorry, I didn't notice your message sooner. :(

The instance id of extra1-dev is i-0ce4598513b3acbb2. I attached the log from a verbose boot.
Comment 9 Leif Pedersen 2020-02-27 21:12:13 UTC
My timestamps are in America/Chicago (-0600 now/-0500 summer)