Bug 243531 - Unstable ena and nvme on AWS
Summary: Unstable ena and nvme on AWS
Status: New
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 12.1-RELEASE
Hardware: amd64 Any
: --- Affects Some People
Assignee: freebsd-virtualization mailing list
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-01-22 23:28 UTC by Leif Pedersen
Modified: 2020-02-11 06:37 UTC (History)
1 user (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Leif Pedersen 2020-01-22 23:28:22 UTC
We just recently upgraded our systems on AWS to 12.1, and we're seeing errors with nvme0 and ena0. Typically, these errors manifest together. My sample size is 34 instances, all t3.medium or r4.large. I have others that are t2.small, and those are fine being that they don't use ena or nvme drivers. Of the 34, there's a group of 15 that are almost idle, and never throw these errors. Of the remaining that have load, only 5 ran without errors and 14 of them threw these errors scattered about various times in the last 4.5 days. 3 machines have crashed during this period, so the errors seem to often be nonfatal, but not always. So based on that it seems load related, and does not seem isolated to an occasional hardware problem.

Here's a sample of the log from one of the machines that crashed:

Jan 22 00:17:05 jdas-dev kernel: nvme0: cpl does not map to outstanding cmd
Jan 22 00:17:05 jdas-dev kernel: cdw0:00000000 sqhd:000d sqid:0001 cid:0012 p:1 sc:00 sct:0 m:0 dnr:0
Jan 22 00:17:05 jdas-dev kernel: nvme0: Resetting controller due to a timeout.
Jan 22 00:17:05 jdas-dev kernel: nvme0: resetting controller
Jan 22 00:17:05 jdas-dev kernel: nvme0: temperature threshold not supported
Jan 22 00:17:05 jdas-dev kernel: nvme0: aborting outstanding i/o
Jan 22 00:17:05 jdas-dev kernel: nvme0: resubmitting queued i/o
Jan 22 00:17:05 jdas-dev kernel: nvme0: WRITE sqid:2 cid:0 nsid:1 lba:691756520 len:8
Jan 22 00:17:21 jdas-dev kernel: ena0: The number of lost tx completion is above the threshold (129 > 128). Reset the device
Jan 22 00:17:21 jdas-dev kernel: ena0: Trigger reset is on
Jan 22 00:17:21 jdas-dev kernel: ena0: device is going DOWN
Jan 22 00:17:21 jdas-dev kernel: ena0: free uncompleted tx mbuf qid 0 idx 0x154
Jan 22 00:17:22 jdas-dev kernel: ena0: device is going UP
Jan 22 00:17:22 jdas-dev kernel: ena0: link is UP
Jan 22 00:17:52 jdas-dev kernel: nvme0: Missing interrupt
Jan 22 00:18:46 jdas-dev kernel: nvme0: Missing interrupt
Jan 22 00:19:34 jdas-dev kernel: nvme0: cpl does not map to outstanding cmd
Jan 22 00:19:34 jdas-dev kernel: cdw0:00000000 sqhd:001a sqid:0002 cid:001b p:0 sc:00 sct:0 m:0 dnr:0
Jan 22 00:19:34 jdas-dev kernel: nvme0: Resetting controller due to a timeout.
Jan 22 00:19:34 jdas-dev kernel: nvme0: resetting controller
Jan 22 00:19:35 jdas-dev kernel: nvme0: temperature threshold not supported
Jan 22 00:19:35 jdas-dev kernel: nvme0: resubmitting queued i/o
Jan 22 00:19:35 jdas-dev kernel: nvme0: WRITE sqid:1 cid:0 nsid:1 lba:405055248 len:8
Jan 22 00:19:35 jdas-dev kernel: nvme0: aborting outstanding i/o

At this point, we rebooted the machine.

Jan 22 09:02:30 jdas-dev kernel: nvme0: Resetting controller due to a timeout.
Jan 22 09:02:30 jdas-dev kernel: nvme0: resetting controller
Jan 22 09:02:30 jdas-dev kernel: nvme0: temperature threshold not supported
Jan 22 09:02:30 jdas-dev kernel: nvme0: aborting outstanding i/o
Jan 22 09:02:30 jdas-dev kernel: nvme0: DATASET MANAGEMENT sqid:2 cid:27 nsid:0
Jan 22 09:02:30 jdas-dev kernel: nvme0: INVALID OPCODE (00/01) sqid:2 cid:27 cdw0:0
Jan 22 09:02:30 jdas-dev kernel: ena0: Keep alive watchdog timeout.
Jan 22 09:02:30 jdas-dev kernel: ena0: Trigger reset is on
Jan 22 09:02:30 jdas-dev kernel: ena0: device is going DOWN
Jan 22 09:02:30 jdas-dev kernel: ena0: ena0: device is going UP
Jan 22 09:02:30 jdas-dev kernel: link is UP
Jan 22 09:02:30 jdas-dev kernel: ena0: Keep alive watchdog timeout.
Jan 22 09:02:30 jdas-dev kernel: ena0: Trigger reset is on
Jan 22 09:02:30 jdas-dev kernel: ena0: device is going DOWN
Jan 22 09:02:30 jdas-dev kernel: ena0: ena0: device is going UP
Jan 22 09:02:30 jdas-dev kernel: link is UP
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 312763, size: 4096
Jan 22 09:02:30 jdas-dev kernel: 90 second watchdog timeout expired. Shutdown terminated.
Jan 22 09:02:30 jdas-dev kernel: Wed Jan 22 08:58:58 CST 2020
Jan 22 09:02:30 jdas-dev kernel: 2020-01-22T08:59:01.658827-06:00 jdas-dev.aws0.pla-net.cc init 1 - - /etc/rc.shutdown terminated abnormally, going to single user mode
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 312763, size: 4096
Jan 22 09:02:30 jdas-dev kernel: 2020-01-22T08:59:23.108170-06:00 jdas-dev.aws0.pla-net.cc init 1 - - some processes would not die; ps axl advised
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 445, size: 12288
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 21337, size: 4096
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 372873, size: 4096
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 21854, size: 57344
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 118775, size: 8192
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 315370, size: 4096
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 312763, size: 4096
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 445, size: 12288
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 21337, size: 4096
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 372873, size: 4096
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 21854, size: 57344
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 118775, size: 8192
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 315370, size: 4096
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 312763, size: 4096
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 445, size: 12288
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 21337, size: 4096
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 21854, size: 57344
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 372873, size: 4096
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 118775, size: 8192
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 315370, size: 4096
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 312763, size: 4096
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 445, size: 12288
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 21337, size: 4096
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 21854, size: 57344
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 372873, size: 4096
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 118775, size: 8192
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 312763, size: 4096
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 315370, size: 4096
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 445, size: 12288
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 21337, size: 4096
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 21854, size: 57344
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 372873, size: 4096
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 118775, size: 8192
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 315370, size: 4096
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 312763, size: 4096
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 445, size: 12288
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 21337, size: 4096
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 21854, size: 57344
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 372873, size: 4096
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 118775, size: 8192
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 315370, size: 4096
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 312763, size: 4096
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 445, size: 12288
Jan 22 09:02:30 jdas-dev kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 21337, size: 4096
Jan 22 09:02:30 jdas-dev kernel: ---<<BOOT>>---
Jan 22 09:02:30 jdas-dev kernel: Copyright (c) 1992-2019 The FreeBSD Project.
Jan 22 09:02:30 jdas-dev kernel: Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
Comment 1 Colin Percival freebsd_committer 2020-02-11 06:37:02 UTC
I haven't been able to reproduce this yet... any more tips on how to replicate your environment?