Bug 236989 - AWS EC2 lockups "Missing interrupt"
Summary: AWS EC2 lockups "Missing interrupt"
Status: New
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 12.0-RELEASE
Hardware: amd64 Any
: --- Affects Only Me
Assignee: freebsd-virtualization mailing list
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-04-03 13:31 UTC by Charles O'Donnell
Modified: 2020-03-17 00:56 UTC (History)
6 users (show)

See Also:


Attachments
syslog entries (2.66 KB, text/plain)
2019-06-04 20:27 UTC, Charles O'Donnell
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Charles O'Donnell 2019-04-03 13:31:10 UTC
I am experiencing lockups on a production c5d.2xlarge instance running FreeBSD 12.0-RELEASE. Frequency is about once a week.

The harbinger of these lockups is the appearance of "nvmX Missing interrupt" in the logs:

Apr  3 00:56:32 host kernel: nvme0: Missing interrupt
Apr  3 00:57:43 host syslogd: last message repeated 1 times
Apr  3 00:58:43 host kernel: nvme4: Missing interrupt
Apr  3 00:58:43 host kernel: nvme1: Missing interrupt
Apr  3 00:58:43 host kernel: nvme0: Missing interrupt
Apr  3 00:58:43 host kernel: nvme4: Missing interrupt
Apr  3 00:58:43 host kernel: nvme1: Missing interrupt
Apr  3 00:58:43 host kernel: nvme0: Missing interrupt
Apr  3 00:59:43 host kernel: nvme4: Missing interrupt
Apr  3 00:59:43 host kernel: nvme1: Missing interrupt
Apr  3 00:59:43 host kernel: nvme0: Missing interrupt
Apr  3 00:59:43 host kernel: nvme1: nvme4: Missing interrupt
Apr  3 00:59:43 host kernel: Missing interrupt
Apr  3 00:59:43 host kernel: nvme0: Missing interrupt
Apr  3 00:59:44 host kernel: nvme1: Missing interrupt
Apr  3 01:00:05 host kernel: nvme0: Missing interrupt
Apr  3 01:20:01 host kernel: nvme0: 
Apr  3 01:20:01 host kernel: Missing interrupt
Apr  3 01:22:10 host kernel: sonewconn: pcb 0xfffff802988adb00: Listen queue overflow: 151 already in queue awaiting acceptance (1 occurrences)
Apr  3 01:24:33 host kernel: sonewconn: pcb 0xfffff802988adb00: Listen queue overflow: 151 already in queue awaiting acceptance (6 occurrences)
Apr  3 01:25:35 host kernel: sonewconn: pcb 0xfffff802988adb00: Listen queue overflow: 151 already in queue awaiting acceptance (4 occurrences)
Apr  3 01:26:45 host syslogd: last message repeated 1 times
Apr  3 01:27:49 host syslogd: last message repeated 1 times


Within a few hours the machine will become unresponsive, CPU pegged at 100%, high disk reads and writes. It will not respond to an EC2 "stop" command and requires a forced (hard) reset.


c5d.2xlarge
FreeBSD 12.0-RELEASE-p3 GENERIC amd64
zfs is in use on some drives, but not all.

I am running several instances with this same configuration, but only one of them has had this issue so far, and it happens to be the host that has the highest disk activity.
Comment 1 Rodney W. Grimes freebsd_committer 2019-04-22 21:50:39 UTC
Pulling in Chuck Tuffli as am NVMe expert
Comment 2 Charles O'Donnell 2019-06-04 20:27:51 UTC
Created attachment 204824 [details]
syslog entries

Log entries.
Comment 3 Charles O'Donnell 2019-06-04 20:28:24 UTC
I still have not come up with a way to reliably reproduce this error, but is continues to be a major issue on our productions systems.

In the meantime, I thought I might include the excerpt above from /var/log/messages which includes some additional noise. I don't know if these entries are directly related or indirectly related.
Comment 4 Charles O'Donnell 2019-06-04 20:33:54 UTC
As an addendum, the system has been kept up to date with freebsd-update.

FreeBSD 12.0-RELEASE-p4 GENERIC
Comment 5 Charles O'Donnell 2019-08-06 03:30:35 UTC
This issue continues to be a major problem with FreeBSD 12 on AWS EC2 on busy production servers.
Comment 6 Charles O'Donnell 2019-08-06 13:13:16 UTC
Add'l log excerpts. Please advise any other information that would be helpful.


Aug  5 23:00:44 s4 kernel: nvme5: cpl does not map to outstanding cmd
Aug  5 23:00:44 s4 kernel: cdw0:00000000 sqhd:0041 sqid:0007 cid:007d p:1 sc:00 sct:0 m:0 dnr:0
Aug  5 23:00:44 s4 kernel: nvme5: Missing interrupt
Aug  5 23:00:44 s4 kernel: nvme5: Resetting controller due to a timeout.
Aug  5 23:00:44 s4 kernel: nvme5: resetting controller
Aug  5 23:00:44 s4 kernel: nvme5: aborting outstanding i/o
Aug  5 23:00:44 s4 kernel: nvme5: WRITE sqid:1 cid:125 nsid:1 lba:10245542 len:256
Aug  5 23:00:44 s4 kernel: nvme5: ABORTED - BY REQUEST (00/07) sqid:1 cid:125 cdw0:0
Aug  5 23:00:44 s4 kernel: nvme5: aborting outstanding i/o
Aug  5 23:00:44 s4 kernel: nvme5: WRITE sqid:1 cid:124 nsid:1 lba:10245798 len:48
Aug  5 23:00:44 s4 kernel: nvme5: ABORTED - BY REQUEST (00/07) sqid:1 cid:124 cdw0:0
Aug  5 23:00:44 s4 kernel: nvme5: resubmitting queued i/o
Aug  5 23:00:44 s4 kernel: nvme5: WRITE sqid:1 cid:0 nsid:1 lba:10245846 len:256
Aug  5 23:00:44 s4 kernel: nvme5: aborting outstanding i/o
Aug  5 23:00:44 s4 kernel: nvme5: WRITE sqid:7 cid:125 nsid:1 lba:10240126 len:256
Aug  5 23:00:44 s4 kernel: nvme5: ABORTED - BY REQUEST (00/07) sqid:7 cid:125 cdw0:0
Comment 7 Warner Losh freebsd_committer 2019-09-03 04:51:13 UTC
Is there a workload that's known to trigger this bug? There's a few issues in -current that might not be in 12.0R yet that I'd like to see if I can trigger the bug and then try again with 13 to see if they are fixed and maybe get a patch into 12.0
Comment 8 Charles O'Donnell 2019-09-03 13:49:11 UTC
Hello Warner, I have note been able to reliably reproduce the problem. There may be a connection with heavy disk activity.
Comment 9 Colin Percival freebsd_committer 2020-02-15 22:22:07 UTC
Hi Charles,

Have you seen this recently?  Ideally, on FreeBSD 12.1 or HEAD?

Also, can you tell me whether you use AMD instances, and if yes whether any of them are affected?
Comment 10 Charles O'Donnell 2020-02-16 20:39:44 UTC
Hello Colin,

Thank you for your interest.

As of last week the issue was still present. The affected hosts are 12.0-RELEASE kept up to date with the most recent patch level using freebsd-update.

I just updated the affected hosts to 12.1-RELEASE using 'freebsd-update upgrade'. If the issue is still present I expect to see it appear in 5-10 days.

The original AMI was released by FreeBSD:

ami-00e61b69944d5d2a3
FreeBSD 12.0-RELEASE-amd64-f5af2713-6d84-41f0-95ea-2eaee78f2af4-ami-03b0f822e17669866.4
FreeBSD/amd64 releng/12.0@341666

Let me know if there is any other information I can provide.

Thank you.

Best.
Comment 11 Colin Percival freebsd_committer 2020-02-16 20:47:09 UTC
(In reply to Charles O'Donnell from comment #10)

Hi Charles,

Please let me know ASAP if you experience this on 12.1-RELEASE.  Some problems were fixed between 12.0 and 12.1 but I have some reports of problems still occurring on 12.1 -- so I'm trying to track down exactly what conditions are needed to cause issues with 12.1.

Other details which I'll want to know (if it recurs with 12.1):

* Instance type
* Type of EBS volumes being used
* Filesystem (all the reports I've had on 12.1 have been ZFS)
* Workload
* Whether there's significant network traffic (all the reports I've had on 12.1 have featured significant disk *and* network loads).

Thanks!
Comment 12 Charles O'Donnell 2020-02-16 20:59:32 UTC
Instance type is c5d.2xlarge.
EBS volume type is gp2.
Filesystem is ZFS (mirror).
Workload is HTTP, Database, Image processing.
Significant network load and disk activity.

Best.
Comment 13 Charles O'Donnell 2020-02-27 12:59:25 UTC
Most recent incident:

Feb 27 04:23:31 s4 kernel: nvme4: Missing interrupt
Feb 27 04:24:34 s4 syslogd: last message repeated 1 times
Feb 27 04:24:34 s4 kernel: nvme1: Missing interrupt
Feb 27 04:24:34 s4 kernel: nvme0: Missing interrupt
Feb 27 04:24:34 s4 syslogd: last message repeated 1 times
Feb 27 04:24:34 s4 kernel: nvme1: nvme1: cpl does not map to outstanding cmd
Feb 27 04:24:34 s4 kernel: cdw0:00000000 sqhd:0012 sqid:0001 cid:0017 p:0 sc:00 sct:0 m:0 dnr:0
Feb 27 04:24:34 s4 kernel: nvme1: Missing interrupt
Feb 27 04:24:34 s4 kernel: nvme4: nvme1: Missing interrupt
Feb 27 04:24:34 s4 kernel: Missing interrupt
Feb 27 04:24:34 s4 syslogd: last message repeated 1 times
Feb 27 04:24:34 s4 kernel: nvme4: Missing interrupt
Feb 27 04:24:34 s4 kernel: nvme0: Missing interrupt
Feb 27 04:24:34 s4 kernel: nvme1: nvme1: nvme4: Missing interrupt
Feb 27 04:24:34 s4 kernel: Missing interrupt

# freebsd-version -ukr
12.1-RELEASE-p2
12.1-RELEASE-p2
12.1-RELEASE-p2

# ls -lahtuT /etc/periodic/daily/
total 128
-rwxr-xr-x  1 root  wheel   811B Feb 27 03:05:34 2020 999.local
-rwxr-xr-x  1 root  wheel   2.8K Feb 27 03:05:34 2020 800.scrub-zfs
-rwxr-xr-x  1 root  wheel   845B Feb 27 03:05:34 2020 510.status-world-kernel
-rwxr-xr-x  1 root  wheel   737B Feb 27 03:05:34 2020 500.queuerun
-rwxr-xr-x  1 root  wheel   498B Feb 27 03:05:34 2020 480.status-ntpd
-rwxr-xr-x  1 root  wheel   451B Feb 27 03:05:34 2020 480.leapfile-ntpd
-rwxr-xr-x  1 root  wheel   2.0K Feb 27 03:05:34 2020 460.status-mail-rejects
-rwxr-xr-x  1 root  wheel   1.0K Feb 27 03:01:00 2020 450.status-security
-rwxr-xr-x  1 root  wheel   1.4K Feb 27 03:01:00 2020 440.status-mailq
-rwxr-xr-x  1 root  wheel   705B Feb 27 03:01:00 2020 430.status-uptime
-rwxr-xr-x  1 root  wheel   611B Feb 27 03:01:00 2020 420.status-network
-rwxr-xr-x  1 root  wheel   684B Feb 27 03:01:00 2020 410.status-mfi
-rwxr-xr-x  1 root  wheel   590B Feb 27 03:01:00 2020 409.status-gconcat
-rwxr-xr-x  1 root  wheel   590B Feb 27 03:01:00 2020 408.status-gstripe
-rwxr-xr-x  1 root  wheel   591B Feb 27 03:01:00 2020 407.status-graid3
-rwxr-xr-x  1 root  wheel   596B Feb 27 03:01:00 2020 406.status-gmirror
-rwxr-xr-x  1 root  wheel   807B Feb 27 03:01:00 2020 404.status-zfs
-rwxr-xr-x  1 root  wheel   583B Feb 27 03:01:00 2020 401.status-graid
-rwxr-xr-x  1 root  wheel   773B Feb 27 03:01:00 2020 400.status-disks
-rwxr-xr-x  1 root  wheel   724B Feb 27 03:01:00 2020 330.news
-r-xr-xr-x  1 root  wheel   1.4K Feb 27 03:01:00 2020 310.accounting
-rwxr-xr-x  1 root  wheel   693B Feb 27 03:01:00 2020 300.calendar
-rwxr-xr-x  1 root  wheel   1.0K Feb 27 03:01:00 2020 210.backup-aliases
-rwxr-xr-x  1 root  wheel   1.7K Feb 27 03:01:00 2020 200.backup-passwd
-rwxr-xr-x  1 root  wheel   603B Feb 27 03:01:00 2020 150.clean-hoststat
-rwxr-xr-x  1 root  wheel   1.0K Feb 27 03:01:00 2020 140.clean-rwho
-rwxr-xr-x  1 root  wheel   709B Feb 27 03:01:00 2020 130.clean-msgs
-rwxr-xr-x  1 root  wheel   1.1K Feb 27 03:01:00 2020 120.clean-preserve
-rwxr-xr-x  1 root  wheel   1.5K Feb 27 03:01:00 2020 110.clean-tmps
-rwxr-xr-x  1 root  wheel   1.3K Feb 27 03:01:00 2020 100.clean-disks
drwxr-xr-x  2 root  wheel   1.0K Dec  7 01:23:36 2018 .
drwxr-xr-x  6 root  wheel   512B Dec  7 01:23:36 2018 ..
Comment 14 Vincent Milum Jr 2020-03-15 06:16:59 UTC
I just ran into the "nvmeX: Missing interrupt" on FreeNAS under VMWare when testing something out tonight. It happened right at boot, and repeated quite often (to the point I couldn't even fully boot the OS after ~20 minutes). FreeNAS is still on 11.3-RELEASE. But this is a homelab system that I can easily build test cases with.

nvme drives are Intel Optane M10 16GB. These are cheap on eBay right now. No data on them at all yet (empty partition table). I was using PCIe passthrough on ESXi-6.7.0-20190802001-standard.

I'll start testing various 12.x and 13 kernels to see if conditions are repeatable on any of those with this config too (if it is, I'll post more details)
Comment 15 Colin Percival freebsd_committer 2020-03-15 16:46:20 UTC
Vincent: There are known issues with the nvme driver on 11.3 and 12.0.  If you can reproduce this on 12.1 I'll be *very* interested but I think it's very likely the problem will go away when you upgrade.
Comment 16 Vincent Milum Jr 2020-03-17 00:56:44 UTC
I actually get a kernel panic in 12.1, but it works in 13

https://twitter.com/DarkainMX/status/1239715138027962368