Bug 235856 - FreeBSD freezes on AWS EC2 t3 machines
Summary: FreeBSD freezes on AWS EC2 t3 machines
Status: New
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 12.0-RELEASE
Hardware: amd64 Any
: --- Affects Only Me
Assignee: freebsd-virtualization (Nobody)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-02-19 08:30 UTC by Alex Dupre
Modified: 2020-06-10 07:59 UTC (History)
4 users (show)

See Also:


Attachments
VM screenshot (138.66 KB, image/png)
2019-02-19 08:30 UTC, Alex Dupre
no flags Details
VM System Log (4.42 KB, text/plain)
2019-02-19 08:30 UTC, Alex Dupre
no flags Details
VM screenshot 2 (105.84 KB, image/jpeg)
2019-02-21 09:37 UTC, Alex Dupre
no flags Details
dmesg with crash on t3.xlarge (20.91 KB, text/plain)
2019-04-04 07:46 UTC, Alex Dupre
no flags Details
more complete log (3.46 KB, text/plain)
2019-11-25 09:08 UTC, mail
no flags Details
verbose_boot_zfs01_25_02 (34.81 KB, text/plain)
2020-03-09 08:27 UTC, mail
no flags Details
messages (348.14 KB, text/plain)
2020-03-10 08:43 UTC, mail
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Alex Dupre freebsd_committer 2019-02-19 08:30:16 UTC
Created attachment 202156 [details]
VM screenshot

After a few days (it may happen after a day or after 2 weeks) the machine becomes unresponsive. 1/2 ec2 status check fail and the reboot doesn't seem to do anything. A stop/restart of the instance is necessary. The machine workload has constant use of CPU, network and disk with a few spikes, but nothing exaggerate.

VM screenshot and system log attached.
Comment 1 Alex Dupre freebsd_committer 2019-02-19 08:30:43 UTC
Created attachment 202157 [details]
VM System Log
Comment 2 Colin Percival freebsd_committer 2019-02-21 01:01:24 UTC
Thanks for reporting this.  A few questions:

1. How repeatable is this?  Does it happen to every instance you launch (after a variable number of days)?

2. Have you tried different instance types?

3. What sort of disk is this?
Comment 3 Alex Dupre freebsd_committer 2019-02-21 09:36:58 UTC
1) it doesn't happen to all t3 instances I have, but it happens quite regularly on a small subset that have similar workloads. This specific machine is running a few testnet blockchain nodes (bitcoin, ethereum, etc.), and it just freezed again.

2) I think it's happening, less frequently, also on a c5 instance, but I have aws automatic recover on that one.

3) All these machines have multiple gp2 disks only.
Comment 4 Alex Dupre freebsd_committer 2019-02-21 09:37:25 UTC
Created attachment 202224 [details]
VM screenshot 2
Comment 5 Alex Dupre freebsd_committer 2019-04-04 07:46:02 UTC
Created attachment 203370 [details]
dmesg with crash on t3.xlarge

I've changed the instance from a t3.large to a t3.xlarge, and this is the latest crash I got. This time it wasn't a freeze, the machine rebooted.

No issues on my c5 machines in the last month.
Comment 6 Colin Percival freebsd_committer 2019-08-30 00:46:05 UTC
Are you still experiencing this?  There has been at least one bug fix on Amazon's side which might have corrected this, but based on other reports I think there's still something going on which affects some systems.
Comment 7 Alex Dupre freebsd_committer 2019-08-30 07:19:55 UTC
I'm not 100% sure, but it looks like it didn't happen recently, so it might be fixed.

The last reboot of that machine was caused by this:

nvme1: Missing interrupt
nvme0: Missing interrupt
ena0: Keep alive watchdog timeout.
ena0: Trigger reset is on
ena0: device is going DOWN
nvme1: cpl does not map to outstanding cmd
cdw0:00000000 sqhd:000a sqid:0001 cid:000b p:0 sc:00 sct:0 m:0 dnr:0
nvme1: Resetting controller due to a timeout.
nvme1: resetting controller
nvme1: temperature threshold not supported
nvme1: aborting outstanding i/o
nvme1: READ sqid:1 cid:11 nsid:1 lba:82665664 len:64
nvme1: ABORTED - BY REQUEST (00/07) sqid:1 cid:11 cdw0:0
nvme1: Missing interrupt
ena0: ena0: device is going UP
link is UP
nvme1: Resetting controller due to a timeout.
nvme1: resetting controller
nvme1: temperature threshold not supported
nvme1: aborting outstanding i/o
nvme1: READ sqid:1 cid:11 nsid:1 lba:101778688 len:64
nvme1: ABORTED - BY REQUEST (00/07) sqid:1 cid:11 cdw0:0
nvme1: aborting outstanding i/o
nvme1: READ sqid:1 cid:21 nsid:1 lba:93070656 len:64
nvme1: ABORTED - BY REQUEST (00/07) sqid:1 cid:21 cdw0:0
nvme1: aborting outstanding i/o
nvme1: WRITE sqid:1 cid:4 nsid:1 lba:87852928 len:64
nvme1: ABORTED - BY REQUEST (00/07) sqid:1 cid:4 cdw0:0


Fatal trap 12: page fault while in kernel mode
cpuid = 2; apic id = 02
fault virtual address   = 0x18
fault code              = supervisor write data, page not present
instruction pointer     = 0x20:0xffffffff80ae362e
stack pointer           = 0x28:0xfffffe00741bc1c0
frame pointer           = 0x28:0xfffffe00741bc200
code segment            = base rx0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 12 (irq258: nvme0)
trap number             = 12
panic: page fault
cpuid = 2
time = 1563188396
KDB: stack backtrace:
#0 0xffffffff80be7957 at kdb_backtrace+0x67
#1 0xffffffff80b9b543 at vpanic+0x1a3
#2 0xffffffff80b9b393 at panic+0x43
#3 0xffffffff81074bff at trap_fatal+0x35f
#4 0xffffffff81074c59 at trap_pfault+0x49
#5 0xffffffff8107427e at trap+0x29e
#6 0xffffffff8104f6f5 at calltrap+0x8
#7 0xffffffff807674f6 at nvme_qpair_complete_tracker+0x166
#8 0xffffffff807672b4 at nvme_qpair_process_completions+0x94
#9 0xffffffff80b5edf7 at ithread_loop+0x1a7
#10 0xffffffff80b5bf13 at fork_exit+0x83
#11 0xffffffff810506ee at fork_trampoline+0xe
Uptime: 5d3h3m9s
Comment 8 Charles O'Donnell 2019-08-30 13:47:38 UTC
Same symptoms still occurring regularly here. Begins with "nvmeX: Missing interrupt" in the logs, which can go on for a few hours, then the server becomes unresponsive. Typically occurs 4-6 days after boot. Only the most heavily loaded production servers are affected. Other servers have never exhibited this behavior. Have tried a variety of instance types. At this point I have it monitored so I can do a graceful cycle before it locks up. 

See also bug #236989

Uname:

FreeBSD 12.0-RELEASE-p10 FreeBSD 12.0-RELEASE-p10 GENERIC  amd64

Log excerpt:

Aug 27 03:02:45 host kernel: nvme4: Missing interrupt
Aug 27 03:02:56 host kernel: nvme1: Missing interrupt
Aug 27 04:19:11 host kernel: nvme1: nvme4: Missing interrupt
Aug 27 04:19:11 host kernel: Missing interrupt
Aug 27 04:19:30 host kernel: nvme0: Missing interrupt
Aug 27 05:16:58 host kernel: nvme1: Missing interrupt
Aug 27 05:16:58 host kernel: nvme4: Missing interrupt
Aug 27 14:58:13 host kernel: ena0: Keep alive watchdog timeout.
Aug 27 14:58:13 host kernel: ena0: Trigger reset is on
Aug 27 14:58:13 host kernel: ena0: device is going DOWN
Aug 27 14:58:13 host kernel: nvme4: Missing interrupt
Aug 27 14:58:13 host kernel: nvme1: Missing interrupt
Aug 27 14:58:16 host dhclient[417]: send_packet6: Network is down
Aug 27 14:58:16 host dhclient[417]: dhc6: send_packet6() sent -1 of 52 bytes
Aug 27 14:58:17 host kernel: ena0: free uncompleted tx mbuf qid 5 idx 0x25c
Aug 27 14:58:17 host kernel: ena0: ena0: device is going UP
Aug 27 14:58:17 host kernel: link is UP
Comment 9 mail 2019-11-25 08:22:24 UTC
(In reply to Charles O'Donnell from comment #8)

Hi,

We just suffered from the same issue. This particular FreeBSD 12.1 r5.large instance with an additional gp2 disk (ZFS)  was up for about 2 weeks when this occured.

These are the last entries in the "logs" :

Nov 25 03:04:36 zfs01 kernel: nvme1: cpl does not map to outstanding cmd
Nov 25 03:04:36 zfs01 kernel: cdw0:00000000 sqhd:001a sqid:0001 cid:0018
p:0 sc:00 sct:0 m:0 dnr:0
Nov 25 03:04:36 zfs01 kernel: nvme1: Missing interrupt
Nov 25 03:04:36 zfs01 kernel: nvme1: Resetting controller due to a timeout.
Nov 25 03:04:36 zfs01 kernel: nvme1: resetting controller
Nov 25 03:04:36 zfs01 kernel: nvme1: temperature threshold not supported
Nov 25 03:04:36 zfs01 kernel: nvme1: aborting outstanding i/o
Nov 25 03:05:07 zfs01 kernel: nvme1: Missing interrupt
Nov 25 03:05:36 zfs01 kernel: nvme1: Resetting controller due to a timeout.
Nov 25 03:05:36 zfs01 kernel: nvme1: resetting controller
Nov 25 03:05:37 zfs01 kernel: nvme1: temperature threshold not supported
Nov 25 03:05:37 zfs01 kernel: nvme1: aborting outstanding i/o
Nov 25 03:05:37 zfs01 syslogd: last message repeated 4 times
Nov 25 03:06:07 zfs01 kernel: nvme1: Missing interrupt
Nov 25 03:09:08 zfs01 dhclient[84191]: send_packet: Host is down
Nov 25 03:09:52 zfs01 syslogd: last message repeated 4 times

@Colin/@Alex: Do you know if this occurs on "older" instance types as well? We would definately switch to older instance-types if this issue doesn't occur on older generations.

Kind regards,

Ruben
Comment 10 mail 2019-11-25 08:41:53 UTC
Hmm. We have a standby machine for this instance (same specs) that has similar log entries, but has not "frozen" yet:

Nov 18 16:51:32 zfs02 kernel: link is UP
Nov 18 16:55:50 zfs02 kernel: nvme1: Missing interrupt
Nov 18 17:01:39 zfs02 dhclient[75125]: send_packet6: Network is down
Nov 18 17:01:39 zfs02 dhclient[75125]: dhc6: send_packet6() sent -1 of 52 bytes
Nov 19 03:30:59 zfs02 kernel: nvme1: cpl does not map to outstanding cmd
Nov 19 03:30:59 zfs02 kernel: cdw0:00000000 sqhd:0006 sqid:0001 cid:000d p:0 sc:00 sct:0 m:0 dnr:0
Nov 19 03:30:59 zfs02 kernel: nvme1: Resetting controller due to a timeout.
Nov 19 03:30:59 zfs02 kernel: nvme1: resetting controller
Nov 19 03:30:59 zfs02 kernel: nvme1: temperature threshold not supported
Nov 19 03:30:59 zfs02 kernel: nvme1: aborting outstanding i/o
Nov 19 03:31:28 zfs02 dhclient[75125]: send_packet6: Network is down
Nov 19 03:31:28 zfs02 dhclient[75125]: dhc6: send_packet6() sent -1 of 52 bytes
Nov 19 03:31:29 zfs02 kernel: 
Nov 19 03:31:29 zfs02 kernel: stray irq260

I'm curious when the "freezing" will initiate... 

Is there any way to increase the verbosity / usefullness of the logs?
Comment 11 Alex Dupre freebsd_committer 2019-11-25 09:04:35 UTC
(In reply to mail from comment #9)
Older instances haven't nvme drives (and ena network cards), so these specific issues cannot occur.
Comment 12 mail 2019-11-25 09:08:04 UTC
Created attachment 209401 [details]
more complete log

A more complete log (also shows ena0 issues, perhaps there is a relation?)
Comment 13 mail 2019-11-25 09:10:55 UTC
(In reply to Alex Dupre from comment #11)

Hi Alex,

Ah, that is good news.

We will downgrade the zfs01 instance and keep the zfs02 instance on the problematic generation.

Thank you for your feedback :)

Kind regards,

Ruben
Comment 14 Alex Dupre freebsd_committer 2019-11-25 10:04:24 UTC
(In reply to mail from comment #13)
The bad news is that HW performance are worse on previous generations.
Comment 15 mail 2019-11-25 10:11:16 UTC
(In reply to Alex Dupre from comment #14)

Hi Alex,

Yes, that's very unfortunate. Stable lower performance vs unstable higher performance is not a difficult call (for this production platform) though...

We will keep the other node on the newer generation AWS hardware. Perhaps that way we can help wiser / more knowledgeable people out with debugging this issue. If anybody would like us to us to perform tests / grant access: please let us know!

Kind regards,

Ruben
Comment 16 Colin Percival freebsd_committer 2020-02-15 22:28:38 UTC
A few questions:

1. Have you seen this recently?

2. Have you seen this on FreeBSD 12.1 or 13.0?

3. Have you tried AMD instances, and if yes has this problem appeared there?

4. Can you reproduce this reasonably quickly (say, by running an instance for a day or two)?
Comment 17 Alex Dupre freebsd_committer 2020-02-17 08:48:32 UTC
(In reply to Colin Percival from comment #16)

Unfortunately I had to switch off those EC2 instances, so I cannot say if it'd still happen or not.

I didn't try FreeBSD 13 nor AMD instances.
Comment 18 mail 2020-02-17 10:47:29 UTC
Hi Colin,

> 1. How repeatable is this?  Does it happen to every instance you launch (after a variable number of days)?

Unfortunately we are not in the habit of often redeplying our zfs nodes (since they provide storage for whole platforms) :( It still happens to the affected nodes though:

 bzgrep "nvme1: Missing interrupt" /var/log/messages.0.bz2 ; grep  "nvme1: Missing interrupt" /var/log/messages
Nov 16 03:04:18 zfs01 kernel: nvme1: Missing interrupt
Nov 16 03:05:19 zfs01 kernel: nvme1: Missing interrupt
Nov 25 03:04:36 zfs01 kernel: nvme1: Missing interrupt
Nov 25 03:05:07 zfs01 kernel: nvme1: Missing interrupt
Nov 25 03:06:07 zfs01 kernel: nvme1: Missing interrupt
Dec 13 03:04:34 zfs01 kernel: nvme1: Missing interrupt
Dec 13 03:05:35 zfs01 kernel: nvme1: Missing interrupt
Dec 13 03:06:26 zfs01 kernel: nvme1: Missing interrupt
Dec 13 03:06:57 zfs01 kernel: nvme1: Missing interrupt
Dec 13 03:07:58 zfs01 kernel: nvme1: Missing interrupt
Jan 25 03:06:02 zfs01 kernel: nvme1: Missing interrupt
Jan 25 03:07:02 zfs01 kernel: nvme1: Missing interrupt
Feb 11 03:05:32 zfs01 kernel: nvme1: Missing interrupt
Feb 11 03:07:01 zfs01 kernel: nvme1: Missing interrupt
Feb 17 03:06:29 zfs01 kernel: nvme1: Missing interrupt

===

bzgrep "nvme1: Missing interrupt" /var/log/messages.0.bz2 ; grep "nvme1: Missing interrupt" /var/log/messages
Jan 25 04:29:03 volume3 kernel: nvme1: Missing interrupt
Feb  4 04:04:45 volume3 kernel: nvme1: Missing interrupt
Feb 11 04:04:48 volume3 kernel: nvme1: Missing interrupt

Kind of interesting that zfs01 and volume03 have totally different customers, usage patterns but have a collission of 2 dates :| 


> 2. Have you tried different instance types?

Yes. This issue is not manifesting itself on an r4.xlarge instance. Same ami on r5.large: problems...

> 3. What sort of disk is this?

We use cloudformation/ansible to deploy these servers, so they are all kind of identically configured (apart from sizing).  Both instances suffering from this issue are 500GB+ EBS GP2 or IO disks with GPT and a zpool configured onto them. 

Please let me know if you would like to receive more information.

Kind regards,

Ruben
Comment 19 mail 2020-02-17 11:14:55 UTC
Hi Colin,

After some further thought: If you'd like, we could provision an additional set of r4/r5 machines and give you access to them?

Kind regards,

Ruben
Comment 20 Colin Percival freebsd_committer 2020-02-17 19:14:40 UTC
Hi Ruben,

I'd love to be able to reproduce this -- do you have some sort of "canned workload" which will do it?  I hesitate about asking for access to "live" systems since as soon as I can reproduce it I'll want to do things like building new kernels with extra debugging/logging added, and I don't want to interfere with production services -- and everything I'm seeing so far suggests that this is workload dependent, so simply spinning up an instance without any workload won't help. :-/

Colin
Comment 21 Colin Percival freebsd_committer 2020-02-17 19:19:57 UTC
(In reply to mail from comment #18)

Ruben: Looking at those logs... is the second system in a different timezone from the first?  The first one looks very much like it's glitching when the 'periodic daily' cron job runs at 3:01 AM, while the second is consistently an hour later.
Comment 22 mail 2020-02-19 12:24:35 UTC
(In reply to Colin Percival from comment #20)

Hi Colin,

We've spun up some additional instances and we are trying to trigger the issue by configuring a workload that should be similar enough to do so.

Giving access to those nodes it not a problem whatsoever since we spun them up specifically to investigate this issue.

Kind regards,

Ruben
Comment 23 mail 2020-02-19 13:20:36 UTC
(In reply to Colin Percival from comment #21)

Hmm. The difference of exactly 1 hour doesn't seem to have a relationship with different timezones...

Today we had another occurance on one of the machines:

Feb 19 03:24:53 volume3 kernel: nvme1: cpl does not map to outstanding cmd
Feb 19 03:24:53 volume3 kernel: cdw0:00000000 sqhd:000c sqid:0002 cid:0017 p:0 sc:00 sct:0 m:0 dnr:0
Feb 19 03:24:53 volume3 kernel: nvme1: Missing interrupt
Feb 19 03:24:53 volume3 kernel: nvme1: Resetting controller due to a timeout.
Feb 19 03:24:53 volume3 kernel: nvme1: resetting controller
Feb 19 03:24:54 volume3 kernel: nvme1: temperature threshold not supported
Feb 19 03:24:54 volume3 kernel: nvme1: aborting outstanding i/o

Comparing 03:24:53 with the access times of the daily scripts though:

ls -lahtuT /etc/periodic/daily/
total 128
-rwxr-xr-x  1 root  wheel   1.0K Feb 19 12:31:47 2020 450.status-security
-rwxr-xr-x  1 root  wheel   811B Feb 19 05:32:14 2020 999.local
-rwxr-xr-x  1 root  wheel   2.8K Feb 19 05:32:14 2020 800.scrub-zfs
-rwxr-xr-x  1 root  wheel   845B Feb 19 05:32:14 2020 510.status-world-kernel
-rwxr-xr-x  1 root  wheel   737B Feb 19 05:32:14 2020 500.queuerun
-rwxr-xr-x  1 root  wheel   498B Feb 19 05:32:14 2020 480.status-ntpd
-rwxr-xr-x  1 root  wheel   451B Feb 19 05:32:14 2020 480.leapfile-ntpd
-rwxr-xr-x  1 root  wheel   2.0K Feb 19 05:32:14 2020 460.status-mail-rejects
-rwxr-xr-x  1 root  wheel   1.4K Feb 19 03:01:00 2020 440.status-mailq
-rwxr-xr-x  1 root  wheel   705B Feb 19 03:01:00 2020 430.status-uptime
-rwxr-xr-x  1 root  wheel   611B Feb 19 03:01:00 2020 420.status-network
-rwxr-xr-x  1 root  wheel   684B Feb 19 03:01:00 2020 410.status-mfi
-rwxr-xr-x  1 root  wheel   590B Feb 19 03:01:00 2020 409.status-gconcat
-rwxr-xr-x  1 root  wheel   590B Feb 19 03:01:00 2020 408.status-gstripe
-rwxr-xr-x  1 root  wheel   591B Feb 19 03:01:00 2020 407.status-graid3
-rwxr-xr-x  1 root  wheel   596B Feb 19 03:01:00 2020 406.status-gmirror
-rwxr-xr-x  1 root  wheel   807B Feb 19 03:01:00 2020 404.status-zfs
-rwxr-xr-x  1 root  wheel   583B Feb 19 03:01:00 2020 401.status-graid
-rwxr-xr-x  1 root  wheel   773B Feb 19 03:01:00 2020 400.status-disks
-rwxr-xr-x  1 root  wheel   724B Feb 19 03:01:00 2020 330.news
-r-xr-xr-x  1 root  wheel   1.4K Feb 19 03:01:00 2020 310.accounting
-rwxr-xr-x  1 root  wheel   693B Feb 19 03:01:00 2020 300.calendar
-rwxr-xr-x  1 root  wheel   1.0K Feb 19 03:01:00 2020 210.backup-aliases
-rwxr-xr-x  1 root  wheel   1.7K Feb 19 03:01:00 2020 200.backup-passwd
-rwxr-xr-x  1 root  wheel   603B Feb 19 03:01:00 2020 150.clean-hoststat
-rwxr-xr-x  1 root  wheel   1.0K Feb 19 03:01:00 2020 140.clean-rwho
-rwxr-xr-x  1 root  wheel   709B Feb 19 03:01:00 2020 130.clean-msgs
-rwxr-xr-x  1 root  wheel   1.1K Feb 19 03:01:00 2020 120.clean-preserve
-rwxr-xr-x  1 root  wheel   1.5K Feb 19 03:01:00 2020 110.clean-tmps
-rwxr-xr-x  1 root  wheel   1.3K Feb 19 03:01:00 2020 100.clean-disks
drwxr-xr-x  2 root  wheel   1.0K Nov  1 07:06:41 2019 .
drwxr-xr-x  6 root  wheel   512B Nov  1 07:06:41 2019 ..

but if the periodic framework executes the jobs serially I see no link with  440.status-mailq (that does not sound like high io) :S.

I think there definately is a link between this bug and high disk-/network-io so the periodic framework probably classifies as a nice trigger (especially the security bits with the find commands)....

We will continue to cross-reference the access times of the daily scripts with the "Missing interrupt" occurences and post updates.

Kind regards,

Ruben
Comment 24 Colin Percival freebsd_committer 2020-02-19 20:27:16 UTC
(In reply to mail from comment #23)

> Comparing 03:24:53 with the access times of the daily scripts though:
>
> ls -lahtuT /etc/periodic/daily/
> [snip]

Looks like it was probably something from the daily security run (which is
kicked off by 450.status-security).  Can you look at the timestamps on
/etc/periodic/security/ ?

> I think there definately is a link between this bug and high disk-/network-io
> so the periodic framework probably classifies as a nice trigger (especially the
> security bits with the find commands)....

Yeah, I'm guessing that it's related to find causing lots of I/O.
Comment 25 mail 2020-02-19 20:41:23 UTC
(In reply to Colin Percival from comment #24)

Hi Colin,

Since the troubleshooting earlier today I ran several find commands on the /etc/ directory :(

Ill make sure to ls all directories next time!

Apologies!

Regards,

Ruben
Comment 26 mail 2020-02-24 10:15:14 UTC
(In reply to mail from comment #25)

I've seen a couple nvme messages on the test-setup:

root@zfs01:/usr/home/ruben # grep nvme1 /var/log/messages 
Feb 19 14:11:34 freebsd kernel: nvme1: <Generic NVMe Device> mem 0xfebf8000-0xfebfbfff irq 11 at device 31.0 on pci0
Feb 19 14:11:34 freebsd kernel: nvme1: temperature threshold not supported
Feb 22 23:16:44 zfs01 kernel: nvme1: Missing interrupt
Feb 24 05:24:52 zfs01 kernel: nvme1: Missing interrupt
root@zfs01:/usr/home/ruben # uptime
10:08AM  up 4 days, 19:57, 1 user, load averages: 1.92, 2.28, 2.23
root@zfs01:/usr/home/ruben # date
Mon Feb 24 10:08:54 UTC 2020
root@zfs01:/usr/home/ruben # 

Will try to further tune the test workload this week.
Comment 27 Colin Percival freebsd_committer 2020-02-24 20:14:30 UTC
Can you confirm that the "Missing interrupt" message is coming from FreeBSD 12.1?  There was an issue with that symptom which was fixed between 12.0 and 12.1.
Comment 28 Colin Percival freebsd_committer 2020-02-24 21:10:52 UTC
Can you provide a verbose boot?  We're wondering if there's something going on with interrupt routing.
Comment 29 mail 2020-02-24 21:16:07 UTC
(In reply to Colin Percival from comment #27)

Hi Colin,

Yes, both production instances are running FreeBSD 12.1 :

root@volume3:/usr/home/ruben # freebsd-version -ukr
12.1-RELEASE-p1
12.1-RELEASE-p1
12.1-RELEASE-p1
root@volume3:/usr/home/ruben # 

root@zfs01:/usr/home/ruben # freebsd-version -ukr
12.1-RELEASE
12.1-RELEASE
12.1-RELEASE
root@zfs01:/usr/home/ruben # 

as is our test-setup:

root@zfs01:/usr/home/ruben # freebsd-version -ukr
12.1-RELEASE-p2
12.1-RELEASE-p2
12.1-RELEASE-p2
root@zfs01:/usr/home/ruben # 

Regards,

Ruben
Comment 30 mail 2020-02-24 21:20:12 UTC
(In reply to Colin Percival from comment #28)

Hi Colin,

Are you referring to sharing the contents of /var/run/dmesg.boot while 

boot_verbose="YES"

has been set in /boot/loader.conf?

If so, then I will be able to do so tomorrow for at least one of the production instances.

Regards,

Ruben
Comment 31 Colin Percival freebsd_committer 2020-02-24 21:32:32 UTC
Yes, that's right.  Ideal would be if you can get the verbose dump and confirm that the instance later suffers from this glitch -- just to rule out some sort of "interrupts *sometimes* get set up wrong at boot time" issue.
Comment 32 mail 2020-02-25 12:02:14 UTC
(In reply to Colin Percival from comment #31)

Hi,

I've managed to reboot all servers with the verbose setting active. If the issue occurs again ill be sure to upload the information you requested (unless you want to have a look at it "now"?).

One of the machines seemed weirdly impacted by a similar issue whilst receiving a stream of 5 MB/s (zfs snapshot):

Feb 25 11:04:01 volume3 kernel: ena0: Keep alive watchdog timeout.
Feb 25 11:04:01 volume3 kernel: ena0: Trigger reset is on
Feb 25 11:04:01 volume3 kernel: ena0: device is going DOWN
Feb 25 11:04:19 volume3 kernel: ena0: ena0: device is going UP
Feb 25 11:04:19 volume3 kernel: link is UP
Feb 25 11:05:10 volume3 dhclient[33054]: send_packet6: Network is down
Feb 25 11:05:10 volume3 dhclient[33054]: dhc6: send_packet6() sent -1 of 52 bytes
Feb 25 11:26:32 volume3 kernel: ena0: The number of lost tx completion is above the threshold (129 > 128). Reset the device
Feb 25 11:26:32 volume3 kernel: ena0: Trigger reset is on
Feb 25 11:26:32 volume3 kernel: ena0: device is going DOWN
Feb 25 11:26:33 volume3 kernel: ena0: free uncompleted tx mbuf qid 1 idx 0x182
Feb 25 11:26:33 volume3 kernel: 
Feb 25 11:26:33 volume3 kernel: stray irq260
Feb 25 11:26:33 volume3 kernel: ena0: ena0: device is going UP
Feb 25 11:26:33 volume3 kernel: link is UP
Feb 25 11:34:06 volume3 shutdown[76043]: reboot by ruben: 

Symptoms : very high load, unresponsive system (freezes/hangs on console). 
No nvme messages though. Unfortunately, I did not yet have the verbose boot information you requested at that point. Did not have time to further debug either, since we need this machine up and running (initiated the reboot at 11:34) :(

Regards,

Ruben
Comment 33 Charles O'Donnell 2020-02-27 12:57:35 UTC
Most recent incident:

Feb 27 04:23:31 s4 kernel: nvme4: Missing interrupt
Feb 27 04:24:34 s4 syslogd: last message repeated 1 times
Feb 27 04:24:34 s4 kernel: nvme1: Missing interrupt
Feb 27 04:24:34 s4 kernel: nvme0: Missing interrupt
Feb 27 04:24:34 s4 syslogd: last message repeated 1 times
Feb 27 04:24:34 s4 kernel: nvme1: nvme1: cpl does not map to outstanding cmd
Feb 27 04:24:34 s4 kernel: cdw0:00000000 sqhd:0012 sqid:0001 cid:0017 p:0 sc:00 sct:0 m:0 dnr:0
Feb 27 04:24:34 s4 kernel: nvme1: Missing interrupt
Feb 27 04:24:34 s4 kernel: nvme4: nvme1: Missing interrupt
Feb 27 04:24:34 s4 kernel: Missing interrupt
Feb 27 04:24:34 s4 syslogd: last message repeated 1 times
Feb 27 04:24:34 s4 kernel: nvme4: Missing interrupt
Feb 27 04:24:34 s4 kernel: nvme0: Missing interrupt
Feb 27 04:24:34 s4 kernel: nvme1: nvme1: nvme4: Missing interrupt
Feb 27 04:24:34 s4 kernel: Missing interrupt

# freebsd-version -ukr
12.1-RELEASE-p2
12.1-RELEASE-p2
12.1-RELEASE-p2

# ls -lahtuT /etc/periodic/daily/
total 128
-rwxr-xr-x  1 root  wheel   811B Feb 27 03:05:34 2020 999.local
-rwxr-xr-x  1 root  wheel   2.8K Feb 27 03:05:34 2020 800.scrub-zfs
-rwxr-xr-x  1 root  wheel   845B Feb 27 03:05:34 2020 510.status-world-kernel
-rwxr-xr-x  1 root  wheel   737B Feb 27 03:05:34 2020 500.queuerun
-rwxr-xr-x  1 root  wheel   498B Feb 27 03:05:34 2020 480.status-ntpd
-rwxr-xr-x  1 root  wheel   451B Feb 27 03:05:34 2020 480.leapfile-ntpd
-rwxr-xr-x  1 root  wheel   2.0K Feb 27 03:05:34 2020 460.status-mail-rejects
-rwxr-xr-x  1 root  wheel   1.0K Feb 27 03:01:00 2020 450.status-security
-rwxr-xr-x  1 root  wheel   1.4K Feb 27 03:01:00 2020 440.status-mailq
-rwxr-xr-x  1 root  wheel   705B Feb 27 03:01:00 2020 430.status-uptime
-rwxr-xr-x  1 root  wheel   611B Feb 27 03:01:00 2020 420.status-network
-rwxr-xr-x  1 root  wheel   684B Feb 27 03:01:00 2020 410.status-mfi
-rwxr-xr-x  1 root  wheel   590B Feb 27 03:01:00 2020 409.status-gconcat
-rwxr-xr-x  1 root  wheel   590B Feb 27 03:01:00 2020 408.status-gstripe
-rwxr-xr-x  1 root  wheel   591B Feb 27 03:01:00 2020 407.status-graid3
-rwxr-xr-x  1 root  wheel   596B Feb 27 03:01:00 2020 406.status-gmirror
-rwxr-xr-x  1 root  wheel   807B Feb 27 03:01:00 2020 404.status-zfs
-rwxr-xr-x  1 root  wheel   583B Feb 27 03:01:00 2020 401.status-graid
-rwxr-xr-x  1 root  wheel   773B Feb 27 03:01:00 2020 400.status-disks
-rwxr-xr-x  1 root  wheel   724B Feb 27 03:01:00 2020 330.news
-r-xr-xr-x  1 root  wheel   1.4K Feb 27 03:01:00 2020 310.accounting
-rwxr-xr-x  1 root  wheel   693B Feb 27 03:01:00 2020 300.calendar
-rwxr-xr-x  1 root  wheel   1.0K Feb 27 03:01:00 2020 210.backup-aliases
-rwxr-xr-x  1 root  wheel   1.7K Feb 27 03:01:00 2020 200.backup-passwd
-rwxr-xr-x  1 root  wheel   603B Feb 27 03:01:00 2020 150.clean-hoststat
-rwxr-xr-x  1 root  wheel   1.0K Feb 27 03:01:00 2020 140.clean-rwho
-rwxr-xr-x  1 root  wheel   709B Feb 27 03:01:00 2020 130.clean-msgs
-rwxr-xr-x  1 root  wheel   1.1K Feb 27 03:01:00 2020 120.clean-preserve
-rwxr-xr-x  1 root  wheel   1.5K Feb 27 03:01:00 2020 110.clean-tmps
-rwxr-xr-x  1 root  wheel   1.3K Feb 27 03:01:00 2020 100.clean-disks
drwxr-xr-x  2 root  wheel   1.0K Dec  7 01:23:36 2018 .
drwxr-xr-x  6 root  wheel   512B Dec  7 01:23:36 2018 ..
Comment 34 Colin Percival freebsd_committer 2020-02-27 17:58:46 UTC
Are these "ephemeral" disks or EBS attached disks?

Can you ls -lahtuT /etc/periodic/security/ ?

What's the instance ID?
Comment 35 Charles O'Donnell 2020-03-01 12:49:06 UTC
> Are these "ephemeral" disks or EBS attached disks?
EBS attached.
EBS volume type is gp2.
Filesystem is zfs.


> What's the instance ID?
Instance type is c5d.2xlarge.

ami-00e61b69944d5d2a3
FreeBSD 12.0-RELEASE-amd64-f5af2713-6d84-41f0-95ea-2eaee78f2af4-ami-03b0f822e17669866.4
FreeBSD/amd64 releng/12.0@341666

Workload is HTTP, Database, Image processing.
Significant network load and disk activity.

See also bug #236989


> Can you ls -lahtuT /etc/periodic/security/ ?
ls -lahtuT /etc/periodic/security/
total 68
-r--r--r--  1 root  wheel   2.8K Mar  1 05:30:01 2020 security.functions
-rwxr-xr-x  1 root  wheel   2.3K Mar  1 05:30:01 2020 900.tcpwrap
-rwxr-xr-x  1 root  wheel   2.3K Mar  1 05:30:01 2020 800.loginfail
-rwxr-xr-x  1 root  wheel   1.9K Mar  1 05:30:01 2020 700.kernelmsg
-rwxr-xr-x  1 root  wheel   2.0K Mar  1 05:30:01 2020 610.ipf6denied
-rwxr-xr-x  1 root  wheel   2.2K Mar  1 05:30:01 2020 550.ipfwlimit
-rwxr-xr-x  1 root  wheel   2.1K Mar  1 05:30:01 2020 520.pfdenied
-rwxr-xr-x  1 root  wheel   1.9K Mar  1 05:30:01 2020 510.ipfdenied
-rwxr-xr-x  1 root  wheel   2.0K Mar  1 05:30:01 2020 500.ipfwdenied
-rwxr-xr-x  1 root  wheel   1.9K Mar  1 05:30:01 2020 410.logincheck
-rwxr-xr-x  1 root  wheel   1.9K Mar  1 05:30:01 2020 400.passwdless
-rwxr-xr-x  1 root  wheel   1.9K Mar  1 05:30:01 2020 300.chkuid0
-rwxr-xr-x  1 root  wheel   2.3K Mar  1 05:30:01 2020 200.chkmounts
-rwxr-xr-x  1 root  wheel   2.2K Mar  1 05:30:01 2020 110.neggrpperm
-rwxr-xr-x  1 root  wheel   2.2K Mar  1 05:30:01 2020 100.chksetuid
drwxr-xr-x  2 root  wheel   512B Dec  7 01:23:36 2018 .
drwxr-xr-x  6 root  wheel   512B Dec  7 01:23:36 2018 ..
Comment 36 mail 2020-03-09 08:24:40 UTC
Hi,

This weekend the issue manifested itself again on one of our 12.1 instances (with an EBS volume attached):

Mar  7 03:05:47 zfs01 kernel: nvme1: cpl does not map to outstanding cmd
Mar  7 03:05:47 zfs01 kernel: cdw0:00000000 sqhd:001b sqid:0001 cid:001b p:0 sc:00 sct:0 m:0 dnr:0
Mar  7 03:05:47 zfs01 kernel: nvme1: Resetting controller due to a timeout.
Mar  7 03:05:47 zfs01 kernel: nvme1: resetting controller
Mar  7 03:05:47 zfs01 kernel: nvme1: temperature threshold not supported
Mar  7 03:05:47 zfs01 kernel: nvme1: aborting outstanding i/o
Mar  7 03:06:18 zfs01 kernel: nvme1: Missing interrupt
Mar  7 03:06:48 zfs01 kernel: nvme1: Resetting controller due to a timeout.
Mar  7 03:06:48 zfs01 kernel: nvme1: resetting controller
Mar  7 03:06:48 zfs01 kernel: nvme1: temperature threshold not supported
Mar  7 03:06:48 zfs01 kernel: nvme1: aborting outstanding i/o
Mar  7 03:06:48 zfs01 syslogd: last message repeated 5 times
Mar  7 03:07:20 zfs01 kernel: nvme1: VERIFY sqid:1 cid:27 nsid:0 lba:0 len:1
Mar  7 03:07:20 zfs01 kernel: nvme1: INVALID OPCODE (00/01) sqid:1 cid:27 cdw0:0
Mar  7 03:07:20 zfs01 kernel: nvme1: Missing interrupt
Mar  7 03:07:20 zfs01 kernel: nvme1: VERIFY sqid:1 cid:27 nsid:0 lba:0 len:1
Mar  7 03:07:20 zfs01 kernel: nvme1: INVALID OPCODE (00/01) sqid:1 cid:27 cdw0:0
Mar  7 03:08:23 zfs01 kernel: ena0: The number of lost tx completion is above the threshold (129 > 128). Reset the device
Mar  7 03:08:23 zfs01 kernel: ena0: Trigger reset is on
Mar  7 03:08:23 zfs01 kernel: ena0: device is going DOWN
Mar  7 03:08:23 zfs01 kernel: nvme1: Resetting controller due to a timeout.
Mar  7 03:08:23 zfs01 kernel: nvme1: resetting controller
Mar  7 03:08:23 zfs01 dhclient[40936]: send_packet6: Network is down
Mar  7 03:08:23 zfs01 dhclient[40936]: dhc6: send_packet6() sent -1 of 52 bytes
Mar  7 03:08:23 zfs01 kernel: nvme1: aborting outstanding admin command
Mar  7 03:08:23 zfs01 kernel: nvme1: CREATE IO SQ (01) sqid:0 cid:24 nsid:1 cdw10:1cb72e58 cdw11:00000000
Mar  7 03:08:23 zfs01 kernel: nvme1: ABORTED - BY REQUEST (00/07) sqid:0 cid:15 cdw0:0
Mar  7 03:08:23 zfs01 kernel: nvme1: temperature threshold not supported
Mar  7 03:08:23 zfs01 kernel: nvme1: aborting outstanding i/o
Mar  7 03:08:23 zfs01 syslogd: last message repeated 2 times
Mar  7 03:08:53 zfs01 kernel: nvme1: WRITE sqid:1 cid:18 nsid:1 lba:3856318 len:64
Mar  7 03:08:53 zfs01 kernel: nvme1: INVALID OPCODE (00/01) sqid:1 cid:27 cdw0:0
Mar  7 03:08:53 zfs01 kernel: nvme1: Missing interrupt
Mar  7 03:09:15 zfs01 kernel: ena0: free uncompleted tx mbuf qid 0 idx 0x58
Mar  7 03:09:16 zfs01 kernel: ena0: attempting to allocate 3 MSI-X vectors (9 supported)
Mar  7 03:09:16 zfs01 kernel: msi: routing MSI-X IRQ 259 to local APIC 0 vector 52
Mar  7 03:09:16 zfs01 kernel: msi: routing MSI-X IRQ 260 to local APIC 0 vector 53
Mar  7 03:09:16 zfs01 kernel: msi: routing MSI-X IRQ 261 to local APIC 0 vector 54
Mar  7 03:09:16 zfs01 kernel: ena0: using IRQs 259-261 for MSI-X
Mar  7 03:09:16 zfs01 kernel: ena0: device is going UP
Mar  7 03:09:16 zfs01 kernel: ena0: link is UP
Mar  7 03:10:30 zfs01 dhclient[40936]: send_packet6: Network is down
Mar  7 03:10:30 zfs01 dhclient[40936]: dhc6: send_packet6() sent -1 of 52 bytes
Mar  7 03:10:32 zfs01 dhclient[69248]: send_packet: Network is down
Mar  7 03:11:16 zfs01 syslogd: last message repeated 4 times
Mar  7 03:11:33 zfs01 syslogd: last message repeated 1 times
Mar  7 03:13:31 zfs01 kernel: ena0: The number of lost tx completion is above the threshold (129 > 128). Reset the device
Mar  7 03:13:31 zfs01 kernel: ena0: Trigger reset is on
Mar  7 03:13:31 zfs01 kernel: ena0: device is going DOWN
Mar  7 03:14:25 zfs01 kernel: ena0: free uncompleted tx mbuf qid 0 idx 0x134
Mar  7 03:14:26 zfs01 kernel: ena0: attempting to allocate 3 MSI-X vectors (9 supported)
Mar  7 03:14:26 zfs01 kernel: msi: routing MSI-X IRQ 259 to local APIC 0 vector 52

root@zfs01:/usr/home/ruben # ls -lahtuT /etc/periodic/daily/
total 128
-rwxr-xr-x  1 root  wheel   1.0K Mar  7 03:01:00 2020 450.status-security
-rwxr-xr-x  1 root  wheel   1.4K Mar  7 03:01:00 2020 440.status-mailq
-rwxr-xr-x  1 root  wheel   705B Mar  7 03:01:00 2020 430.status-uptime
-rwxr-xr-x  1 root  wheel   611B Mar  7 03:01:00 2020 420.status-network
-rwxr-xr-x  1 root  wheel   684B Mar  7 03:01:00 2020 410.status-mfi
-rwxr-xr-x  1 root  wheel   590B Mar  7 03:01:00 2020 409.status-gconcat
-rwxr-xr-x  1 root  wheel   590B Mar  7 03:01:00 2020 408.status-gstripe
-rwxr-xr-x  1 root  wheel   591B Mar  7 03:01:00 2020 407.status-graid3
-rwxr-xr-x  1 root  wheel   596B Mar  7 03:01:00 2020 406.status-gmirror
-rwxr-xr-x  1 root  wheel   807B Mar  7 03:01:00 2020 404.status-zfs
-rwxr-xr-x  1 root  wheel   583B Mar  7 03:01:00 2020 401.status-graid
-rwxr-xr-x  1 root  wheel   773B Mar  7 03:01:00 2020 400.status-disks
-rwxr-xr-x  1 root  wheel   724B Mar  7 03:01:00 2020 330.news
-r-xr-xr-x  1 root  wheel   1.4K Mar  7 03:01:00 2020 310.accounting
-rwxr-xr-x  1 root  wheel   693B Mar  7 03:01:00 2020 300.calendar
-rwxr-xr-x  1 root  wheel   1.0K Mar  7 03:01:00 2020 210.backup-aliases
-rwxr-xr-x  1 root  wheel   1.7K Mar  7 03:01:00 2020 200.backup-passwd
-rwxr-xr-x  1 root  wheel   603B Mar  7 03:01:00 2020 150.clean-hoststat
-rwxr-xr-x  1 root  wheel   1.0K Mar  7 03:01:00 2020 140.clean-rwho
-rwxr-xr-x  1 root  wheel   709B Mar  7 03:01:00 2020 130.clean-msgs
-rwxr-xr-x  1 root  wheel   1.1K Mar  7 03:01:00 2020 120.clean-preserve
-rwxr-xr-x  1 root  wheel   1.5K Mar  7 03:01:00 2020 110.clean-tmps
-rwxr-xr-x  1 root  wheel   1.3K Mar  7 03:01:00 2020 100.clean-disks
-rwxr-xr-x  1 root  wheel   811B Mar  5 03:21:29 2020 999.local
-rwxr-xr-x  1 root  wheel   2.8K Mar  5 03:21:29 2020 800.scrub-zfs
-rwxr-xr-x  1 root  wheel   845B Mar  5 03:21:29 2020 510.status-world-kernel
-rwxr-xr-x  1 root  wheel   737B Mar  5 03:21:29 2020 500.queuerun
-rwxr-xr-x  1 root  wheel   498B Mar  5 03:21:29 2020 480.status-ntpd
-rwxr-xr-x  1 root  wheel   451B Mar  5 03:03:36 2020 480.leapfile-ntpd
-rwxr-xr-x  1 root  wheel   2.0K Mar  5 03:03:18 2020 460.status-mail-rejects
drwxr-xr-x  2 root  wheel   1.0K Dec  7 06:23:36 2018 .
drwxr-xr-x  6 root  wheel   512B Dec  7 06:23:36 2018 ..
root@zfs01:/usr/home/ruben # 

root@zfs01:/usr/home/ruben # ls -lahtuT /etc/periodic/security/
total 68
-rwxr-xr-x  1 root  wheel   2.3K Mar  7 03:01:48 2020 900.tcpwrap
-rwxr-xr-x  1 root  wheel   2.3K Mar  7 03:01:48 2020 800.loginfail
-rwxr-xr-x  1 root  wheel   1.9K Mar  7 03:01:48 2020 700.kernelmsg
-r--r--r--  1 root  wheel   2.8K Mar  7 03:01:48 2020 security.functions
-rwxr-xr-x  1 root  wheel   2.0K Mar  7 03:01:48 2020 610.ipf6denied
-rwxr-xr-x  1 root  wheel   2.2K Mar  7 03:01:48 2020 550.ipfwlimit
-rwxr-xr-x  1 root  wheel   2.1K Mar  7 03:01:48 2020 520.pfdenied
-rwxr-xr-x  1 root  wheel   1.9K Mar  7 03:01:48 2020 510.ipfdenied
-rwxr-xr-x  1 root  wheel   2.0K Mar  7 03:01:48 2020 500.ipfwdenied
-rwxr-xr-x  1 root  wheel   1.9K Mar  7 03:01:48 2020 410.logincheck
-rwxr-xr-x  1 root  wheel   1.9K Mar  7 03:01:48 2020 400.passwdless
-rwxr-xr-x  1 root  wheel   1.9K Mar  7 03:01:48 2020 300.chkuid0
-rwxr-xr-x  1 root  wheel   2.3K Mar  7 03:01:48 2020 200.chkmounts
-rwxr-xr-x  1 root  wheel   2.2K Mar  7 03:01:25 2020 110.neggrpperm
-rwxr-xr-x  1 root  wheel   2.2K Mar  7 03:01:00 2020 100.chksetuid
drwxr-xr-x  2 root  wheel   512B Dec  7 06:23:36 2018 .
drwxr-xr-x  6 root  wheel   512B Dec  7 06:23:36 2018 ..
root@zfs01:/usr/home/ruben # 

the NIC had been going up/down ever since (for 2 days) until a coworker rebooted it this morning.

There does seem to be a relationship with the periodic framework, with the issues occuring at 03:05 while the last timestamp updates around 03:01 ...

Will attach the verbose boot log. Feel free to request any additional details!

Kind regards,

Ruben
Comment 37 mail 2020-03-09 08:27:46 UTC
Created attachment 212278 [details]
verbose_boot_zfs01_25_02
Comment 38 Colin Percival freebsd_committer 2020-03-09 19:23:04 UTC
Hi Ruben,

Can you confirm that "cpl does not map to outstanding cmd" was the first unusual line to show up?  (i.e., there weren't more lines before that which got lost?)  We had a hypothetical explanation for this which should have also involved having a "missing interrupt" message at the same time...

Colin Percival
Comment 39 mail 2020-03-10 08:42:42 UTC
(In reply to Colin Percival from comment #38)
Hi Colin,

There were issues with the nic (perhaps intterupt related?) prior to the "cpl does not map to outstanding cmd" log entry.

I will upload the full /var/log/messages contents between a couple of incidents.

Please let me know if you would like me to provide further context/configure additional logging etc.

Kind regards,

Ruben
Comment 40 mail 2020-03-10 08:43:27 UTC
Created attachment 212295 [details]
messages
Comment 41 mail 2020-03-12 11:18:20 UTC
Another occurance this night:

Mar 11 07:14:28 zfs01 ntpd[6009]: leapsecond file ('/var/db/ntpd.leap-seconds.list'): expired less than 75 days ago
Mar 12 03:05:45 zfs01 kernel: nvme1: cpl does not map to outstanding cmd
Mar 12 03:05:45 zfs01 kernel: cdw0:00000000 sqhd:001e sqid:0001 cid:0016 p:0 sc:00 sct:0 m:0 dnr:0
Mar 12 03:05:45 zfs01 kernel: nvme1: Missing interrupt
Mar 12 03:05:45 zfs01 kernel: nvme1: Resetting controller due to a timeout.
Mar 12 03:05:45 zfs01 kernel: nvme1: resetting controller
Mar 12 03:05:45 zfs01 kernel: nvme1: temperature threshold not supported
Mar 12 03:05:45 zfs01 kernel: nvme1: aborting outstanding i/o
Mar 12 03:06:15 zfs01 kernel: nvme1: Missing interrupt
Mar 12 03:07:08 zfs01 kernel: nvme1: Resetting controller due to a timeout.
Mar 12 03:07:08 zfs01 kernel: nvme1: resetting controller
Mar 12 03:07:08 zfs01 kernel: nvme1: temperature threshold not supported
Mar 12 03:07:08 zfs01 kernel: nvme1: aborting outstanding i/o
Mar 12 03:07:08 zfs01 syslogd: last message repeated 9 times
Mar 12 03:07:13 zfs01 kernel: ena0: The number of lost tx completion is above the threshold (129 > 128). Reset the device
Mar 12 03:07:13 zfs01 kernel: ena0: Trigger reset is on
Mar 12 03:07:13 zfs01 kernel: ena0: device is going DOWN
Mar 12 03:07:38 zfs01 kernel: nvme1: Missing interrupt
Mar 12 03:08:05 zfs01 kernel: ena0: free uncompleted tx mbuf qid 0 idx 0x344
Mar 12 03:08:05 zfs01 kernel: ena0: attempting to allocate 3 MSI-X vectors (9 supported)
Mar 12 03:08:05 zfs01 kernel: msi: routing MSI-X IRQ 259 to local APIC 0 vector 52
Mar 12 03:08:05 zfs01 kernel: msi: routing MSI-X IRQ 260 to local APIC 0 vector 53
Mar 12 03:08:05 zfs01 kernel: msi: routing MSI-X IRQ 261 to local APIC 0 vector 54
Mar 12 03:08:05 zfs01 kernel: ena0: using IRQs 259-261 for MSI-X
Mar 12 03:08:05 zfs01 kernel: ena0: device is going UP
Mar 12 03:08:05 zfs01 kernel: ena0: link is UP
Mar 12 03:09:15 zfs01 dhclient[816]: send_packet6: Network is down
Mar 12 03:09:15 zfs01 dhclient[816]: dhc6: send_packet6() sent -1 of 52 bytes
Mar 12 03:09:50 zfs01 dhclient[67871]: send_packet: Network is down
Mar 12 03:10:12 zfs01 syslogd: last message repeated 3 times
Mar 12 03:14:23 zfs01 kernel: ena0: The number of lost tx completion is above the threshold (129 > 128). Reset the device
Mar 12 03:14:23 zfs01 kernel: ena0: Trigger reset is on
Mar 12 03:14:23 zfs01 kernel: ena0: device is going DOWN
Mar 12 03:15:18 zfs01 kernel: ena0: free uncompleted tx mbuf qid 0 idx 0x176
Mar 12 03:15:19 zfs01 kernel: ena0: attempting to allocate 3 MSI-X vectors (9 supported)
Mar 12 03:15:19 zfs01 kernel: msi: routing MSI-X IRQ 259 to local APIC 0 vector 52
Mar 12 03:15:19 zfs01 kernel: msi: routing MSI-X IRQ 260 to local APIC 0 vector 53
Mar 12 03:15:19 zfs01 kernel: msi: routing MSI-X IRQ 261 to local APIC 0 vector 54
Mar 12 03:15:19 zfs01 kernel: ena0: using IRQs 259-261 for MSI-X
Mar 12 03:15:19 zfs01 kernel: ena0: device is going UP
Mar 12 03:15:19 zfs01 kernel: ena0: link is UP
Mar 12 03:15:23 zfs01 dhclient[816]: send_packet6: Network is down
Mar 12 03:15:23 zfs01 dhclient[816]: dhc6: send_packet6() sent -1 of 52 bytes

root@zfs01:~ # ls -lahtuT /etc/periodic/security/
total 68
-r--r--r--  1 root  wheel   2.8K Mar 12 03:01:48 2020 security.functions
-rwxr-xr-x  1 root  wheel   2.3K Mar 12 03:01:48 2020 900.tcpwrap
-rwxr-xr-x  1 root  wheel   2.3K Mar 12 03:01:48 2020 800.loginfail
-rwxr-xr-x  1 root  wheel   1.9K Mar 12 03:01:48 2020 700.kernelmsg
-rwxr-xr-x  1 root  wheel   2.0K Mar 12 03:01:48 2020 610.ipf6denied
-rwxr-xr-x  1 root  wheel   2.2K Mar 12 03:01:48 2020 550.ipfwlimit
-rwxr-xr-x  1 root  wheel   2.1K Mar 12 03:01:48 2020 520.pfdenied
-rwxr-xr-x  1 root  wheel   1.9K Mar 12 03:01:48 2020 510.ipfdenied
-rwxr-xr-x  1 root  wheel   2.0K Mar 12 03:01:48 2020 500.ipfwdenied
-rwxr-xr-x  1 root  wheel   1.9K Mar 12 03:01:48 2020 410.logincheck
-rwxr-xr-x  1 root  wheel   1.9K Mar 12 03:01:48 2020 400.passwdless
-rwxr-xr-x  1 root  wheel   1.9K Mar 12 03:01:48 2020 300.chkuid0
-rwxr-xr-x  1 root  wheel   2.3K Mar 12 03:01:48 2020 200.chkmounts
-rwxr-xr-x  1 root  wheel   2.2K Mar 12 03:01:24 2020 110.neggrpperm
-rwxr-xr-x  1 root  wheel   2.2K Mar 12 03:01:00 2020 100.chksetuid
drwxr-xr-x  2 root  wheel   512B Dec  7 06:23:36 2018 .
drwxr-xr-x  6 root  wheel   512B Dec  7 06:23:36 2018 ..
root@zfs01:~ # ls -lahtuT /etc/periodic/daily/
total 128
-rwxr-xr-x  1 root  wheel   811B Mar 12 03:12:55 2020 999.local
-rwxr-xr-x  1 root  wheel   2.8K Mar 12 03:12:55 2020 800.scrub-zfs
-rwxr-xr-x  1 root  wheel   845B Mar 12 03:12:55 2020 510.status-world-kernel
-rwxr-xr-x  1 root  wheel   737B Mar 12 03:12:55 2020 500.queuerun
-rwxr-xr-x  1 root  wheel   498B Mar 12 03:12:55 2020 480.status-ntpd
-rwxr-xr-x  1 root  wheel   451B Mar 12 03:02:07 2020 480.leapfile-ntpd
-rwxr-xr-x  1 root  wheel   2.0K Mar 12 03:01:57 2020 460.status-mail-rejects
-rwxr-xr-x  1 root  wheel   1.0K Mar 12 03:01:00 2020 450.status-security
-rwxr-xr-x  1 root  wheel   1.4K Mar 12 03:01:00 2020 440.status-mailq
-rwxr-xr-x  1 root  wheel   705B Mar 12 03:01:00 2020 430.status-uptime
-rwxr-xr-x  1 root  wheel   611B Mar 12 03:01:00 2020 420.status-network
-rwxr-xr-x  1 root  wheel   684B Mar 12 03:01:00 2020 410.status-mfi
-rwxr-xr-x  1 root  wheel   590B Mar 12 03:01:00 2020 409.status-gconcat
-rwxr-xr-x  1 root  wheel   590B Mar 12 03:01:00 2020 408.status-gstripe
-rwxr-xr-x  1 root  wheel   591B Mar 12 03:01:00 2020 407.status-graid3
-rwxr-xr-x  1 root  wheel   596B Mar 12 03:01:00 2020 406.status-gmirror
-rwxr-xr-x  1 root  wheel   807B Mar 12 03:01:00 2020 404.status-zfs
-rwxr-xr-x  1 root  wheel   583B Mar 12 03:01:00 2020 401.status-graid
-rwxr-xr-x  1 root  wheel   773B Mar 12 03:01:00 2020 400.status-disks
-rwxr-xr-x  1 root  wheel   724B Mar 12 03:01:00 2020 330.news
-r-xr-xr-x  1 root  wheel   1.4K Mar 12 03:01:00 2020 310.accounting
-rwxr-xr-x  1 root  wheel   693B Mar 12 03:01:00 2020 300.calendar
-rwxr-xr-x  1 root  wheel   1.0K Mar 12 03:01:00 2020 210.backup-aliases
-rwxr-xr-x  1 root  wheel   1.7K Mar 12 03:01:00 2020 200.backup-passwd
-rwxr-xr-x  1 root  wheel   603B Mar 12 03:01:00 2020 150.clean-hoststat
-rwxr-xr-x  1 root  wheel   1.0K Mar 12 03:01:00 2020 140.clean-rwho
-rwxr-xr-x  1 root  wheel   709B Mar 12 03:01:00 2020 130.clean-msgs
-rwxr-xr-x  1 root  wheel   1.1K Mar 12 03:01:00 2020 120.clean-preserve
-rwxr-xr-x  1 root  wheel   1.5K Mar 12 03:01:00 2020 110.clean-tmps
-rwxr-xr-x  1 root  wheel   1.3K Mar 12 03:01:00 2020 100.clean-disks
drwxr-xr-x  2 root  wheel   1.0K Dec  7 06:23:36 2018 .
drwxr-xr-x  6 root  wheel   512B Dec  7 06:23:36 2018 ..
root@zfs01:~ #
Comment 42 Colin Percival freebsd_committer 2020-03-12 13:11:20 UTC
Can you try setting hw.nvme.timeout_period in /boot/loader.conf?  In particular:

1. Setting it to "1" may result in more errors being logged but less hanging, and
2. Setting it to "3600" (or something similarly big) may result in fewer errors but longer hangs.

I don't think either of these is a solution but it would be very enlightening to know if either of these has the anticipated effect.
Comment 43 mail 2020-03-12 13:19:18 UTC
(In reply to Colin Percival from comment #42)
Hi Colin,

I've just added this entry to /boot/loader.conf:

hw.nvme.timeout_period=1

and rebooted the instance. Will provide feedback on perceived effects.

Kind Regards,

Ruben
Comment 44 mail 2020-03-30 14:11:52 UTC
(In reply to mail from comment #43)
Well, the instance is still online. It appears to be in some sort of trouble, probably since:

Mar 25 03:05:38 zfs01 kernel: nvme1: Missing interrupt
Mar 25 03:07:04 zfs01 syslogd: last message repeated 1 times
Mar 25 03:07:13 zfs01 syslogd: last message repeated 1 times

logging on with ssh works, but there is a huge lag noticable. There is a 5-15 second delay between characters I type on my keyboard and them appearing on the terminal.

There are no other obvious log entries and top shows:

root@zfs01:/usr/home/ruben # top -n -b 10
last pid:  6417;  load averages:  0.29,  0.56,  0.53  up 18+00:53:17    14:09:45
49 processes:  1 running, 48 sleeping
CPU:  0.8% user,  0.0% nice,  2.9% system,  0.1% interrupt, 96.2% idle
Mem: 23M Active, 3300K Inact, 25M Laundry, 14G Wired, 714M Buf, 1539M Free
ARC: 10G Total, 6090M MFU, 4031M MRU, 420K Anon, 80M Header, 508M Other
     9037M Compressed, 12G Uncompressed, 1.35:1 Ratio
Swap: 8192M Total, 8192M Free

  PID USERNAME    THR PRI NICE   SIZE    RES STATE    C   TIME    WCPU COMMAND
92501 root          1  25    0    21M    11M select   1   0:02   6.98% ssh
97476 root          1  22    0    11M  2528K select   0   0:01   2.98% pv
 1145 root          1  21    0    11M  2528K select   0   0:01   0.98% nc
46323 root         20  20    0    11M  2528K rpcsvc   1  84:50   0.00% nfsd
73646 ntpd          1  20    0    16M    16M select   0   1:20   0.00% ntpd
30467 _pflogd       1  20    0    12M  3072K bpf      0   1:15   0.00% pflogd
81756 root          1  21    0    12M  3892K wait     0   0:13   0.00% sh
70675 root          1  34    0    11M   968K nanslp   1   0:07   0.00% cron
58098 root          1  20    0    11M  2672K select   0   0:07   0.00% syslogd
29034 root          1  20    0    15M  5648K select   0   0:05   0.00% nrpe3

root@zfs01:/usr/home/ruben # 

The NFS clients for this nfs server also seem to be performing their normal tasks on the dataset.

Strange.
Comment 45 Colin Percival freebsd_committer 2020-03-30 17:17:11 UTC
Hi Ruben,

Getting more "missing interrupt" messages is an expected result of reducing the timeout period, so that's kind of a good sign.

I don't know what would be making the system very slow though -- not unless this is a T3 and you've run out of burstable CPU quota.  Very strange indeed.

Any errors logged recently relating to ENA?
Comment 46 mail 2020-04-02 22:01:39 UTC
(In reply to Colin Percival from comment #45)
Hi Colin,

Well, another crash this morning. A coworker also had to reboot the instance again.

There are a lot of messages regarding ENA in the logs. This is /var/log/messages from the event:

Apr  1 13:33:45 zfs01 ntpd[73646]: leapsecond file ('/var/db/ntpd.leap-seconds.list'): expired less than 96 days ago
Apr  2 03:04:51 zfs01 kernel: nvme1: Missing interrupt
Apr  2 03:04:58 zfs01 syslogd: last message repeated 1 times
Apr  2 03:05:12 zfs01 kernel: ena0: The number of lost tx completion is above the threshold (129 > 128). Reset the device
Apr  2 03:05:12 zfs01 kernel: ena0: Trigger reset is on
Apr  2 03:05:12 zfs01 kernel: ena0: device is going DOWN
Apr  2 03:05:13 zfs01 kernel: ena0: free uncompleted tx mbuf qid 0 idx 0x25d
Apr  2 03:05:14 zfs01 kernel: ena0: attempting to allocate 3 MSI-X vectors (9 supported)
Apr  2 03:05:14 zfs01 kernel: msi: routing MSI-X IRQ 259 to local APIC 0 vector 52
Apr  2 03:05:14 zfs01 kernel: msi: routing MSI-X IRQ 260 to local APIC 0 vector 53
Apr  2 03:05:14 zfs01 kernel: msi: routing MSI-X IRQ 261 to local APIC 0 vector 54
Apr  2 03:05:14 zfs01 kernel: ena0: using IRQs 259-261 for MSI-X
Apr  2 03:05:14 zfs01 kernel: ena0: device is going UP
Apr  2 03:05:14 zfs01 kernel: ena0: link is UP
Apr  2 03:05:26 zfs01 kernel: nvme1: Missing interrupt
Apr  2 03:05:40 zfs01 syslogd: last message repeated 1 times
Apr  2 03:09:20 zfs01 kernel: ena0: The number of lost tx completion is above the threshold (129 > 128). Reset the device
Apr  2 03:09:20 zfs01 kernel: ena0: Trigger reset is on
Apr  2 03:09:20 zfs01 kernel: ena0: device is going DOWN
Apr  2 03:09:20 zfs01 dhclient[7435]: send_packet6: Network is down
Apr  2 03:09:20 zfs01 dhclient[7435]: dhc6: send_packet6() sent -1 of 52 bytes
Apr  2 03:09:20 zfs01 kernel: ena0: free uncompleted tx mbuf qid 0 idx 0x25bena0: free uncompleted tx mbuf qid 1 idx 0x2ad
Apr  2 03:09:21 zfs01 kernel: ena0: attempting to allocate 3 MSI-X vectors (9 supported)
Apr  2 03:09:21 zfs01 kernel: msi: routing MSI-X IRQ 259 to local APIC 0 vector 52
Apr  2 03:09:21 zfs01 kernel: msi: routing MSI-X IRQ 260 to local APIC 0 vector 53
Apr  2 03:09:21 zfs01 kernel: msi: routing MSI-X IRQ 261 to local APIC 0 vector 54
Apr  2 03:09:21 zfs01 kernel: ena0: using IRQs 259-261 for MSI-X
Apr  2 03:09:21 zfs01 kernel: ena0: device is going UP
Apr  2 03:09:21 zfs01 kernel: ena0: link is UP
Apr  2 03:11:29 zfs01 kernel: nvme1: Missing interrupt
Apr  2 03:11:40 zfs01 kernel: nvme1: cpl does not map to outstanding cmd
Apr  2 03:11:40 zfs01 kernel: cdw0:00000000 sqhd:0013 sqid:0001 cid:001c p:0 sc:00 sct:0 m:0 dnr:0
Apr  2 03:11:40 zfs01 kernel: nvme1: Missing interrupt
Apr  2 03:11:40 zfs01 kernel: nvme1: Resetting controller due to a timeout.
Apr  2 03:11:40 zfs01 kernel: nvme1: resetting controller
Apr  2 03:11:40 zfs01 kernel: nvme1: temperature threshold not supported
Apr  2 03:11:40 zfs01 kernel: nvme1: aborting outstanding i/o
Apr  2 03:11:40 zfs01 kernel: nvme1: resubmitting queued i/o
Apr  2 03:11:40 zfs01 kernel: nvme1: WRITE sqid:2 cid:0 nsid:1 lba:1401380000 len:256
Apr  2 03:11:40 zfs01 kernel: nvme1: resubmitting queued i/o
Apr  2 03:11:40 zfs01 kernel: nvme1: WRITE sqid:2 cid:0 nsid:1 lba:1401380800 len:256
Apr  2 03:11:40 zfs01 kernel: nvme1: resubmitting queued i/o
Apr  2 03:11:40 zfs01 kernel: nvme1: WRITE sqid:2 cid:0 nsid:1 lba:1401381576 len:256
Apr  2 03:11:40 zfs01 kernel: nvme1: resubmitting queued i/o
Apr  2 03:11:40 zfs01 kernel: nvme1: WRITE sqid:2 cid:0 nsid:1 lba:1401382400 len:256
Apr  2 03:11:40 zfs01 kernel: nvme1: resubmitting queued i/o
Apr  2 03:11:40 zfs01 kernel: nvme1: WRITE sqid:2 cid:0 nsid:1 lba:1401405864 len:256
Apr  2 03:11:40 zfs01 kernel: nvme1: resubmitting queued i/o
Apr  2 03:11:40 zfs01 kernel: nvme1: WRITE sqid:2 cid:0 nsid:1 lba:1401408264 len:256
Apr  2 03:11:40 zfs01 kernel: nvme1: resubmitting queued i/o
Apr  2 03:11:40 zfs01 kernel: nvme1: WRITE sqid:2 cid:0 nsid:1 lba:1401412720 len:256
Apr  2 03:11:40 zfs01 kernel: nvme1: resubmitting queued i/o
Apr  2 03:11:40 zfs01 kernel: nvme1: WRITE sqid:2 cid:0 nsid:1 lba:1401413240 len:256
Apr  2 03:11:40 zfs01 kernel: nvme1: resubmitting queued i/o
Apr  2 03:11:40 zfs01 kernel: nvme1: WRITE sqid:2 cid:0 nsid:1 lba:1401414504 len:256
Apr  2 03:11:40 zfs01 kernel: nvme1: resubmitting queued i/o
Apr  2 03:11:40 zfs01 kernel: nvme1: WRITE sqid:2 cid:0 nsid:1 lba:1938636792 len:8
Apr  2 03:11:46 zfs01 kernel: nvme1: Missing interrupt
Apr  2 03:11:56 zfs01 kernel: nvme1: Resetting controller due to a timeout.
Apr  2 03:11:56 zfs01 kernel: nvme1: resetting controller
Apr  2 03:11:56 zfs01 kernel: nvme1: temperature threshold not supported
Apr  2 03:11:56 zfs01 kernel: nvme1: aborting outstanding i/o
Apr  2 03:11:56 zfs01 syslogd: last message repeated 10 times
Apr  2 03:12:02 zfs01 kernel: nvme1: WRITE sqid:1 cid:5 nsid:1 lba:3847102 len:64
Apr  2 03:12:02 zfs01 kernel: nvme1: INVALID OPCODE (00/01) sqid:1 cid:28 cdw0:0
Apr  2 03:12:02 zfs01 kernel: nvme1: Missing interrupt
Apr  2 03:14:18 zfs01 kernel: ena0: Keep alive watchdog timeout.
Apr  2 03:14:18 zfs01 kernel: ena0: Trigger reset is on
Apr  2 03:14:18 zfs01 kernel: ena0: device is going DOWN
Apr  2 03:15:36 zfs01 dhclient[7435]: send_packet6: Network is down
Apr  2 03:15:36 zfs01 dhclient[7435]: dhc6: send_packet6() sent -1 of 52 bytes
Apr  2 03:15:39 zfs01 kernel: ena0: attempting to allocate 3 MSI-X vectors (9 supported)
Apr  2 03:15:39 zfs01 kernel: msi: routing MSI-X IRQ 259 to local APIC 0 vector 52
Apr  2 03:15:39 zfs01 kernel: msi: routing MSI-X IRQ 260 to local APIC 0 vector 53
Apr  2 03:15:39 zfs01 kernel: msi: routing MSI-X IRQ 261 to local APIC 0 vector 54
Apr  2 03:15:39 zfs01 kernel: ena0: using IRQs 259-261 for MSI-X
Apr  2 03:15:39 zfs01 kernel: ena0: ena0: device is going UP
Apr  2 03:15:39 zfs01 kernel: link is UP
Apr  2 03:17:22 zfs01 dhclient[69443]: send_packet: Network is down
Apr  2 03:17:40 zfs01 syslogd: last message repeated 4 times
Apr  2 03:17:44 zfs01 dhclient[7435]: send_packet6: Network is down
Apr  2 03:17:44 zfs01 dhclient[7435]: dhc6: send_packet6() sent -1 of 52 bytes
Apr  2 03:17:50 zfs01 dhclient[69443]: send_packet: Network is down
Apr  2 03:18:21 zfs01 syslogd: last message repeated 1 times
Apr  2 03:19:37 zfs01 syslogd: last message repeated 2 times
Apr  2 03:19:39 zfs01 dhclient[7435]: send_packet6: Network is down
Apr  2 03:19:39 zfs01 dhclient[7435]: dhc6: send_packet6() sent -1 of 52 bytes
Apr  2 03:23:20 zfs01 kernel: ena0: Keep alive watchdog timeout.
Apr  2 03:23:20 zfs01 kernel: ena0: Trigger reset is on
Apr  2 03:23:20 zfs01 kernel: ena0: device is going DOWN
Apr  2 03:23:20 zfs01 dhclient[7435]: send_packet6: Network is down
Apr  2 03:23:20 zfs01 dhclient[7435]: dhc6: send_packet6() sent -1 of 52 bytes
Apr  2 03:23:23 zfs01 kernel: ena0: free uncompleted tx mbuf qid 1 idx 0x2ac
Apr  2 03:23:24 zfs01 kernel: ena0: attempting to allocate 3 MSI-X vectors (9 supported)
Apr  2 03:23:24 zfs01 kernel: msi: routing MSI-X IRQ 259 to local APIC 0 vector 52
Apr  2 03:23:24 zfs01 kernel: msi: routing MSI-X IRQ 260 to local APIC 0 vector 53
Apr  2 03:23:24 zfs01 kernel: msi: routing MSI-X IRQ 261 to local APIC 0 vector 54
Apr  2 03:23:24 zfs01 kernel: ena0: using IRQs 259-261 for MSI-X
Apr  2 03:23:24 zfs01 kernel: ena0: device is going UP
Apr  2 03:23:24 zfs01 kernel: ena0: link is UP
Apr  2 03:25:09 zfs01 dhclient[7435]: send_packet6: Network is down
Apr  2 03:25:09 zfs01 dhclient[7435]: dhc6: send_packet6() sent -1 of 52 bytes
Apr  2 03:27:00 zfs01 dhclient[7435]: send_packet6: Network is down
Apr  2 03:27:00 zfs01 dhclient[7435]: dhc6: send_packet6() sent -1 of 52 bytes
Apr  2 03:28:49 zfs01 dhclient[7435]: send_packet6: Network is down
Apr  2 03:28:49 zfs01 dhclient[7435]: dhc6: send_packet6() sent -1 of 52 bytes
Apr  2 04:37:52 zfs01 nrpe[41394]: Could not read request from client 172.28.8.16, bailing out...
Apr  2 04:43:45 zfs01 kernel: ena0: The number of lost tx completion is above the threshold (129 > 128). Reset the device
Apr  2 04:43:45 zfs01 kernel: ena0: Trigger reset is on
Apr  2 04:43:45 zfs01 kernel: ena0: device is going DOWN
Apr  2 04:44:22 zfs01 dhclient[7435]: send_packet6: Network is down
Apr  2 04:44:22 zfs01 dhclient[7435]: dhc6: send_packet6() sent -1 of 52 bytes
Apr  2 04:44:25 zfs01 kernel: ena0: free uncompleted tx mbuf qid 1 idx 0x28f
Apr  2 04:44:26 zfs01 kernel: ena0: attempting to allocate 3 MSI-X vectors (9 supported)
Apr  2 04:44:26 zfs01 kernel: msi: routing MSI-X IRQ 259 to local APIC 0 vector 52
Apr  2 04:44:26 zfs01 kernel: msi: routing MSI-X IRQ 260 to local APIC 0 vector 53
Apr  2 04:44:26 zfs01 kernel: msi: routing MSI-X IRQ 261 to local APIC 0 vector 54
Apr  2 04:44:26 zfs01 kernel: ena0: using IRQs 259-261 for MSI-X
Apr  2 04:44:26 zfs01 kernel: stray irq260
Apr  2 04:44:26 zfs01 kernel: ena0: ena0: device is going UP
Apr  2 04:44:26 zfs01 kernel: link is UP
Apr  2 04:46:33 zfs01 dhclient[7435]: send_packet6: Network is down
Apr  2 04:46:33 zfs01 dhclient[7435]: dhc6: send_packet6() sent -1 of 52 bytes
Apr  2 04:48:36 zfs01 dhclient[7435]: send_packet6: Network is down
Apr  2 04:48:36 zfs01 dhclient[7435]: dhc6: send_packet6() sent -1 of 52 bytes
Apr  2 04:50:25 zfs01 dhclient[7435]: send_packet6: Network is down
Apr  2 04:50:25 zfs01 dhclient[7435]: dhc6: send_packet6() sent -1 of 52 bytes
Apr  2 04:52:37 zfs01 dhclient[7435]: send_packet6: Network is down
Apr  2 04:52:37 zfs01 dhclient[7435]: dhc6: send_packet6() sent -1 of 52 bytes
Apr  2 04:54:40 zfs01 dhclient[7435]: send_packet6: Network is down
Apr  2 04:54:40 zfs01 dhclient[7435]: dhc6: send_packet6() sent -1 of 52 bytes
Apr  2 05:09:52 zfs01 kernel: ena0: The number of lost tx completion is above the threshold (129 > 128). Reset the device
Apr  2 05:09:52 zfs01 kernel: ena0: Trigger reset is on
Apr  2 05:09:52 zfs01 kernel: ena0: device is going DOWN
Apr  2 05:11:03 zfs01 dhclient[7435]: send_packet6: Network is down
Apr  2 05:11:03 zfs01 dhclient[7435]: dhc6: send_packet6() sent -1 of 52 bytes
Apr  2 05:11:04 zfs01 kernel: ena0: free uncompleted tx mbuf qid 1 idx 0x22b
Apr  2 05:11:05 zfs01 kernel: ena0: attempting to allocate 3 MSI-X vectors (9 supported)
Apr  2 05:11:05 zfs01 kernel: msi: routing MSI-X IRQ 259 to local APIC 0 vector 52
Apr  2 05:11:05 zfs01 kernel: msi: routing MSI-X IRQ 260 to local APIC 0 vector 53
Apr  2 05:11:05 zfs01 kernel: msi: routing MSI-X IRQ 261 to local APIC 0 vector 54
Apr  2 05:11:05 zfs01 kernel: ena0: using IRQs 259-261 for MSI-X
Apr  2 05:11:05 zfs01 kernel: stray irq260
Apr  2 05:11:05 zfs01 kernel: ena0: device is going UP
Apr  2 05:11:05 zfs01 kernel: ena0: link is UP
Apr  2 05:13:00 zfs01 dhclient[7435]: send_packet6: Network is down
Apr  2 05:13:00 zfs01 dhclient[7435]: dhc6: send_packet6() sent -1 of 52 bytes
Apr  2 05:14:57 zfs01 dhclient[7435]: send_packet6: Network is down
Apr  2 05:14:57 zfs01 dhclient[7435]: dhc6: send_packet6() sent -1 of 52 bytes
Apr  2 05:16:46 zfs01 dhclient[7435]: send_packet6: Network is down
Apr  2 05:16:46 zfs01 dhclient[7435]: dhc6: send_packet6() sent -1 of 52 bytes
Apr  2 05:18:39 zfs01 dhclient[7435]: send_packet6: Network is down
Apr  2 05:18:39 zfs01 dhclient[7435]: dhc6: send_packet6() sent -1 of 52 bytes
Apr  2 05:20:41 zfs01 dhclient[7435]: send_packet6: Network is down
Apr  2 05:20:41 zfs01 dhclient[7435]: dhc6: send_packet6() sent -1 of 52 bytes
Apr  2 05:22:33 zfs01 dhclient[7435]: send_packet6: Network is down
Apr  2 05:22:33 zfs01 dhclient[7435]: dhc6: send_packet6() sent -1 of 52 bytes
Apr  2 05:28:52 zfs01 kernel: ena0: Keep alive watchdog timeout.
Apr  2 05:28:52 zfs01 kernel: ena0: Trigger reset is on
Apr  2 05:28:52 zfs01 kernel: ena0: device is going DOWN
Apr  2 05:28:52 zfs01 dhclient[7435]: send_packet6: Network is down
Apr  2 05:28:52 zfs01 dhclient[7435]: dhc6: send_packet6() sent -1 of 52 bytes
Apr  2 05:29:24 zfs01 kernel: ena0: attempting to allocate 3 MSI-X vectors (9 supported)
Apr  2 05:29:24 zfs01 kernel: msi: routing MSI-X IRQ 259 to local APIC 0 vector 52
Apr  2 05:29:24 zfs01 kernel: msi: routing MSI-X IRQ 260 to local APIC 0 vector 53
Apr  2 05:29:24 zfs01 kernel: msi: routing MSI-X IRQ 261 to local APIC 0 vector 54
Apr  2 05:29:24 zfs01 kernel: ena0: using IRQs 259-261 for MSI-X
Apr  2 05:29:24 zfs01 kernel: stray irq260
Apr  2 05:29:24 zfs01 kernel: ena0: ena0: device is going UP
Apr  2 05:29:24 zfs01 kernel: link is UP
Apr  2 05:57:05 zfs01 syslogd: kernel boot file is /boot/kernel/kernel
Apr  2 05:57:05 zfs01 kernel: pflog0: promiscuous mode disabled
Apr  2 05:57:05 zfs01 kernel: 
Apr  2 05:57:05 zfs01 syslogd: last message repeated 1 times
Apr  2 05:57:05 zfs01 kernel: Fatal trap 12: page fault while in kernel mode
Apr  2 05:57:05 zfs01 kernel: cpuid = 0; apic id = 00
Apr  2 05:57:05 zfs01 kernel: fault virtual address     = 0x78
Apr  2 05:57:05 zfs01 kernel: fault code                = supervisor read data, page not present
Apr  2 05:57:05 zfs01 kernel: instruction pointer       = 0x20:0xffffffff80b17c97
Apr  2 05:57:05 zfs01 kernel: stack pointer             = 0x28:0xfffffe000045d1e0
Apr  2 05:57:05 zfs01 kernel: frame pointer             = 0x28:0xfffffe000045d210
Apr  2 05:57:05 zfs01 kernel: code segment              = base rx0, limit 0xfffff, type 0x1b
Apr  2 05:57:05 zfs01 kernel:                   = DPL 0, pres 1, long 1, def32 0, gran 1
Apr  2 05:57:05 zfs01 kernel: processor eflags  = interrupt enabled, resume, IOPL = 0
Apr  2 05:57:05 zfs01 kernel: current process           = 12 (irq257: nvme0:io0)
Apr  2 05:57:05 zfs01 kernel: trap number               = 12
Apr  2 05:57:05 zfs01 kernel: panic: page fault
Apr  2 05:57:05 zfs01 kernel: cpuid = 0
Apr  2 05:57:05 zfs01 kernel: time = 1585806979
Apr  2 05:57:05 zfs01 kernel: KDB: stack backtrace:
Apr  2 05:57:05 zfs01 kernel: #0 0xffffffff80c1d2b7 at kdb_backtrace+0x67
Apr  2 05:57:05 zfs01 kernel: #1 0xffffffff80bd05ed at vpanic+0x19d
Apr  2 05:57:05 zfs01 kernel: #2 0xffffffff80bd0443 at panic+0x43
Apr  2 05:57:05 zfs01 kernel: #3 0xffffffff810a7dcc at trap_fatal+0x39c
Apr  2 05:57:05 zfs01 kernel: #4 0xffffffff810a7e19 at trap_pfault+0x49
Apr  2 05:57:05 zfs01 kernel: #5 0xffffffff810a740f at trap+0x29f
Apr  2 05:57:05 zfs01 kernel: #6 0xffffffff81081a2c at calltrap+0x8
Apr  2 05:57:05 zfs01 kernel: #7 0xffffffff8079c5ac at nvme_qpair_complete_tracker+0x1bc
Apr  2 05:57:05 zfs01 kernel: #8 0xffffffff8079c2c4 at nvme_qpair_process_completions+0xd4
Apr  2 05:57:05 zfs01 kernel: #9 0xffffffff80b93dd4 at ithread_loop+0x1d4
Apr  2 05:57:05 zfs01 kernel: #10 0xffffffff80b90c43 at fork_exit+0x83
Apr  2 05:57:05 zfs01 kernel: #11 0xffffffff81082a6e at fork_trampoline+0xe
Apr  2 05:57:05 zfs01 kernel: Uptime: 20d16h39m50s
Apr  2 05:57:05 zfs01 kernel: Rebooting...
Apr  2 05:57:05 zfs01 kernel: ---<<BOOT>>---

This occured only minutes again after running the periodic framework:

root@zfs01:~ # ls -lahtuT /etc/periodic/daily/
total 128
-rwxr-xr-x  1 root  wheel   451B Apr  2 03:02:09 2020 480.leapfile-ntpd
-rwxr-xr-x  1 root  wheel   2.0K Apr  2 03:01:58 2020 460.status-mail-rejects
-rwxr-xr-x  1 root  wheel   1.0K Apr  2 03:01:00 2020 450.status-security
-rwxr-xr-x  1 root  wheel   1.4K Apr  2 03:01:00 2020 440.status-mailq
-rwxr-xr-x  1 root  wheel   705B Apr  2 03:01:00 2020 430.status-uptime
-rwxr-xr-x  1 root  wheel   611B Apr  2 03:01:00 2020 420.status-network
-rwxr-xr-x  1 root  wheel   684B Apr  2 03:01:00 2020 410.status-mfi
-rwxr-xr-x  1 root  wheel   590B Apr  2 03:01:00 2020 409.status-gconcat
-rwxr-xr-x  1 root  wheel   590B Apr  2 03:01:00 2020 408.status-gstripe
-rwxr-xr-x  1 root  wheel   591B Apr  2 03:01:00 2020 407.status-graid3
-rwxr-xr-x  1 root  wheel   596B Apr  2 03:01:00 2020 406.status-gmirror
-rwxr-xr-x  1 root  wheel   807B Apr  2 03:01:00 2020 404.status-zfs
-rwxr-xr-x  1 root  wheel   583B Apr  2 03:01:00 2020 401.status-graid
-rwxr-xr-x  1 root  wheel   773B Apr  2 03:01:00 2020 400.status-disks
-rwxr-xr-x  1 root  wheel   724B Apr  2 03:01:00 2020 330.news
-r-xr-xr-x  1 root  wheel   1.4K Apr  2 03:01:00 2020 310.accounting
-rwxr-xr-x  1 root  wheel   693B Apr  2 03:01:00 2020 300.calendar
-rwxr-xr-x  1 root  wheel   1.0K Apr  2 03:01:00 2020 210.backup-aliases
-rwxr-xr-x  1 root  wheel   1.7K Apr  2 03:01:00 2020 200.backup-passwd
-rwxr-xr-x  1 root  wheel   603B Apr  2 03:01:00 2020 150.clean-hoststat
-rwxr-xr-x  1 root  wheel   1.0K Apr  2 03:01:00 2020 140.clean-rwho
-rwxr-xr-x  1 root  wheel   709B Apr  2 03:01:00 2020 130.clean-msgs
-rwxr-xr-x  1 root  wheel   1.1K Apr  2 03:01:00 2020 120.clean-preserve
-rwxr-xr-x  1 root  wheel   1.5K Apr  2 03:01:00 2020 110.clean-tmps
-rwxr-xr-x  1 root  wheel   1.3K Apr  2 03:01:00 2020 100.clean-disks
-rwxr-xr-x  1 root  wheel   811B Apr  1 03:54:15 2020 999.local
-rwxr-xr-x  1 root  wheel   2.8K Apr  1 03:54:15 2020 800.scrub-zfs
-rwxr-xr-x  1 root  wheel   845B Apr  1 03:54:15 2020 510.status-world-kernel
-rwxr-xr-x  1 root  wheel   737B Apr  1 03:54:15 2020 500.queuerun
-rwxr-xr-x  1 root  wheel   498B Apr  1 03:54:15 2020 480.status-ntpd
drwxr-xr-x  2 root  wheel   1.0K Dec  7 06:23:36 2018 .
drwxr-xr-x  6 root  wheel   512B Dec  7 06:23:36 2018 ..
root@zfs01:~ # ls -lahtuT /etc/periodic/security/
total 68
-r--r--r--  1 root  wheel   2.8K Apr  2 03:01:49 2020 security.functions
-rwxr-xr-x  1 root  wheel   2.3K Apr  2 03:01:49 2020 900.tcpwrap
-rwxr-xr-x  1 root  wheel   2.3K Apr  2 03:01:49 2020 800.loginfail
-rwxr-xr-x  1 root  wheel   1.9K Apr  2 03:01:49 2020 700.kernelmsg
-rwxr-xr-x  1 root  wheel   2.0K Apr  2 03:01:49 2020 610.ipf6denied
-rwxr-xr-x  1 root  wheel   2.2K Apr  2 03:01:49 2020 550.ipfwlimit
-rwxr-xr-x  1 root  wheel   2.1K Apr  2 03:01:49 2020 520.pfdenied
-rwxr-xr-x  1 root  wheel   1.9K Apr  2 03:01:49 2020 510.ipfdenied
-rwxr-xr-x  1 root  wheel   2.0K Apr  2 03:01:49 2020 500.ipfwdenied
-rwxr-xr-x  1 root  wheel   1.9K Apr  2 03:01:49 2020 410.logincheck
-rwxr-xr-x  1 root  wheel   1.9K Apr  2 03:01:49 2020 400.passwdless
-rwxr-xr-x  1 root  wheel   1.9K Apr  2 03:01:49 2020 300.chkuid0
-rwxr-xr-x  1 root  wheel   2.3K Apr  2 03:01:49 2020 200.chkmounts
-rwxr-xr-x  1 root  wheel   2.2K Apr  2 03:01:25 2020 110.neggrpperm
-rwxr-xr-x  1 root  wheel   2.2K Apr  2 03:01:00 2020 100.chksetuid
drwxr-xr-x  2 root  wheel   512B Dec  7 06:23:36 2018 .
drwxr-xr-x  6 root  wheel   512B Dec  7 06:23:36 2018 ..
root@zfs01:~ # 

I don't know how long we will be able to keep this installation running under these circumstances; the platform it provides NFS for will be "LIVE" in a couple of weeks and this issue needs to be resolved prior to that.

Kind regards,

Ruben
Comment 47 Colin Percival freebsd_committer 2020-04-02 22:06:08 UTC
Hi Ruben,

Good timing.  Can you test the patch in https://reviews.freebsd.org/D24256 ?  It will probably add "nvme_timeout" messages to your console log -- I'd be interested to know how many -- but it also fixes a nasty race condition which I suspect is causing some of the breakage you're seeing.
Comment 48 mail 2020-04-04 14:15:05 UTC
(In reply to Colin Percival from comment #47)

Hi Colin,

I've checked out the releng/12.1 branch,, applied the patch, compiled, installed and rebooted.

uname -a
FreeBSD zfs01.kpnmsg.net 12.1-RELEASE-p3 FreeBSD 12.1-RELEASE-p3 #0 r359631M: Sat Apr  4 13:41:41 UTC 2020     root@zfs01.kpnmsg.net:/data/obj/data/src/amd64.amd64/sys/GENERIC  amd64

I still have the 

hw.nvme.timeout_period=1

active in /boot/loader.conf , is that OK or should I switch that to something else ?

Will keep you updated on /var/log/messages contents. Hope the issue will trigger next week, because after that we really have to switch instance types (for the instance that got triggered the most ( until now :P) ).

Regards,

Ruben
Comment 49 Colin Percival freebsd_committer 2020-04-04 17:34:14 UTC
Thanks!  Yes, it would be great if you can keep that tunable in place so we can see how it affects this.  I'm particularly interested in seeing anything logged concerning nvme -- there will probably be additional "nvme_timeout" messages being logged even when nothing is visibly going wrong, and I'd like to see how often (and under what conditions) those appear.
Comment 50 mail 2020-04-20 08:35:29 UTC
(In reply to Colin Percival from comment #49)

Hi Colin,

The machine that was exhibiting the most freezes will be tested extensively during the next couple of days. No errors on that instance yet. We will probably have to decommission this particular instance end of this week though :(

I just had a look on one of the other machines and there is an error in one of the logs:

Apr  5 09:31:41 volume3 ec2[7704]: -----END SSH HOST KEY FINGERPRINTS-----
Apr  5 09:31:41 volume3 ec2[14622]: #############################################################
Apr 16 04:08:31 volume3 kernel: nvme1: nvme_timeout
Apr 16 04:08:31 volume3 kernel: nvme1: 
Apr 16 04:08:31 volume3 kernel: Resetting controller due to a timeout.
Apr 16 04:08:31 volume3 kernel: nvme1: resetting controller
Apr 16 04:08:31 volume3 kernel: nvme1: temperature threshold not supported
Apr 16 04:08:31 volume3 kernel: nvme1: resubmitting queued i/o
Apr 16 04:08:31 volume3 kernel: nvme1: READ sqid:2 cid:0 nsid:1 lba:2029427280 len:8
Apr 16 04:08:31 volume3 kernel: nvme1: resubmitting queued i/o
Apr 16 04:08:31 volume3 kernel: nvme1: READ sqid:2 cid:0 nsid:1 lba:1776954392 len:8
Apr 16 04:08:31 volume3 kernel: nvme1: resubmitting queued i/o
Apr 16 04:08:31 volume3 kernel: nvme1: READ sqid:2 cid:0 nsid:1 lba:1315250264 len:8
Apr 16 04:08:31 volume3 kernel: nvme1: resubmitting queued i/o
Apr 16 04:08:31 volume3 kernel: nvme1: READ sqid:2 cid:0 nsid:1 lba:871868818 len:8
Apr 16 04:08:31 volume3 kernel: nvme1: resubmitting queued i/o
Apr 16 04:08:31 volume3 kernel: nvme1: READ sqid:2 cid:0 nsid:1 lba:423018743 len:7
Apr 16 04:08:31 volume3 kernel: nvme1: resubmitting queued i/o
Apr 16 04:08:31 volume3 kernel: nvme1: READ sqid:2 cid:0 nsid:1 lba:2561980904 len:8
Apr 16 04:08:31 volume3 kernel: nvme1: resubmitting queued i/o
Apr 16 04:08:31 volume3 kernel: nvme1: READ sqid:2 cid:0 nsid:1 lba:1987706424 len:8
Apr 16 04:08:31 volume3 kernel: nvme1: resubmitting queued i/o
Apr 16 04:08:31 volume3 kernel: nvme1: READ sqid:2 cid:0 nsid:1 lba:115481696 len:8
Apr 16 04:08:31 volume3 kernel: nvme1: resubmitting queued i/o
Apr 16 04:08:31 volume3 kernel: nvme1: READ sqid:2 cid:0 nsid:1 lba:1095204568 len:8
Apr 16 04:08:31 volume3 kernel: nvme1: resubmitting queued i/o
Apr 16 04:08:31 volume3 kernel: nvme1: READ sqid:2 cid:0 nsid:1 lba:948211334 len:12

No freezing or any other symptoms yet. This machine runs the patched kernel but doesn't have a "hw.nvme.timeout_period" setting configured in its /boot/loader.conf .

Will provide feedback on new developments.

Regards,

Ruben(In reply to Colin Percival from comment #49)
(In reply to Colin Percival from comment #49)
Comment 51 Colin Percival freebsd_committer 2020-04-20 21:09:10 UTC
(In reply to mail from comment #50)

Hi Ruben,

Thanks!  This is consistent with my current understanding, which is that the patch I provided fixes the data structure corruption issues (which showed up as some of the more "interesting" warnings and also sometimes as kernel panics).

The timeouts and freezing appear to be due to the ENA driver having trouble allocating jumbo mbuf clusters during low-memory conditions.  Can you check "sysctl -a | grep mjum_alloc_fail" (ideally, log it every minute and see if it jumps around the times you see hanging)?

I have a patch which I believe fixes this which I'd be happy to share if you want to try something experimental -- we're still discussing details about the best way to address this, though.

Colin
Comment 52 Charles O'Donnell 2020-04-20 21:16:18 UTC
Would "sysctl -a | grep mjum_alloc_fail" provide productive results on an unpatched 12.1-R kernel? I am not running the patches, but I am getting these same freezes, and I can run queries prior to rebooting.

Best.
Comment 53 Colin Percival freebsd_committer 2020-04-20 21:21:32 UTC
Charles: Yes, a stock FreeBSD 12.1-RELEASE kernel should have those sysctls (they're in if_ena.ko technically but you should have that loaded...) and it would be great to hear if those are non-zero and/or increasing around when you see the system freezing.
Comment 54 Colin Percival freebsd_committer 2020-04-22 23:34:42 UTC
I believe that this patch should fix the underlying problem, which is in the ENA driver: https://reviews.freebsd.org/D24546

If you're able to build a custom kernel, please test that patch and report results in that review or via email (cperciva@).
Comment 55 commit-hook freebsd_committer 2020-05-07 11:28:55 UTC
A commit references this bug:

Author: mw
Date: Thu May  7 11:28:41 UTC 2020
New revision: 360777
URL: https://svnweb.freebsd.org/changeset/base/360777

Log:
  Optimize ENA Rx refill for low memory conditions

  Sometimes, especially when there is not much memory in the system left,
  allocating mbuf jumbo clusters (like 9KB or 16KB) can take a lot of time
  and it is not guaranteed that it'll succeed. In that situation, the
  fallback will work, but if the refill needs to take a place for a lot of
  descriptors at once, the time spent in m_getjcl looking for memory can
  cause system unresponsiveness due to high priority of the Rx task. This
  can also lead to driver reset, because Tx cleanup routine is being
  blocked and timer service could detect that Tx packets aren't cleaned
  up. The reset routine can further create another unresponsiveness - Rx
  rings are being refilled there, so m_getjcl will again burn the CPU.
  This was causing NVMe driver timeouts and resets, because network driver
  is having higher priority.

  Instead of 16KB jumbo clusters for the Rx buffers, 9KB clusters are
  enough - ENA MTU is being set to 9K anyway, so it's very unlikely that
  more space than 9KB will be needed.

  However, 9KB jumbo clusters can still cause issues, so by default the
  page size mbuf cluster will be used for the Rx descriptors. This can have a
  small (~2%) impact on the throughput of the device, so to restore
  original behavior, one must change sysctl "hw.ena.enable_9k_mbufs" to
  "1" in "/boot/loader.conf" file.

  As a part of this patch (important fix), the version of the driver
  was updated to v2.1.2.

  Submitted by:   cperciva
  Reviewed by:    Michal Krawczyk <mk@semihalf.com>
  Reviewed by:    Ido Segev <idose@amazon.com>
  Reviewed by:    Guy Tzalik <gtzalik@amazon.com>
  MFC after:      3 days
  PR:             225791, 234838, 235856, 236989, 243531
  Differential Revision: https://reviews.freebsd.org/D24546

Changes:
  head/sys/dev/ena/ena.c
  head/sys/dev/ena/ena.h
  head/sys/dev/ena/ena_sysctl.c
  head/sys/dev/ena/ena_sysctl.h
Comment 56 mail 2020-05-07 12:13:08 UTC
(In reply to Colin Percival from comment #54)

Hi Colin,

Good to see the problem has been identified, and patched!

Our machines have not yet frozen again after applying the patch you mentioned on 2020-04-02 .

Will this commit (360777) be released as an errata (I don't know of a more official term I'm affraid)? I'm asking because I would like to know whether or not we can switch back to using freebsd-update for implementing the fix or should continue to compile our kernels manually.

Regards,

Ruben
Comment 57 Colin Percival freebsd_committer 2020-05-07 17:15:07 UTC
(In reply to mail from comment #56)

Yes, we're planning on doing an Errata Notice for this.  I suggest waiting until that comes out, then getting the fix with freebsd-update.
Comment 58 commit-hook freebsd_committer 2020-05-12 18:46:07 UTC
A commit references this bug:

Author: mw
Date: Tue May 12 18:44:41 UTC 2020
New revision: 360985
URL: https://svnweb.freebsd.org/changeset/base/360985

Log:
  MFC r360777: Optimize ENA Rx refill for low memory conditions

  Sometimes, especially when there is not much memory in the system left,
  allocating mbuf jumbo clusters (like 9KB or 16KB) can take a lot of time
  and it is not guaranteed that it'll succeed. In that situation, the
  fallback will work, but if the refill needs to take a place for a lot of
  descriptors at once, the time spent in m_getjcl looking for memory can
  cause system unresponsiveness due to high priority of the Rx task. This
  can also lead to driver reset, because Tx cleanup routine is being
  blocked and timer service could detect that Tx packets aren't cleaned
  up. The reset routine can further create another unresponsiveness - Rx
  rings are being refilled there, so m_getjcl will again burn the CPU.
  This was causing NVMe driver timeouts and resets, because network driver
  is having higher priority.

  Instead of 16KB jumbo clusters for the Rx buffers, 9KB clusters are
  enough - ENA MTU is being set to 9K anyway, so it's very unlikely that
  more space than 9KB will be needed.

  However, 9KB jumbo clusters can still cause issues, so by default the
  page size mbuf cluster will be used for the Rx descriptors. This can have a
  small (~2%) impact on the throughput of the device, so to restore
  original behavior, one must change sysctl "hw.ena.enable_9k_mbufs" to
  "1" in "/boot/loader.conf" file.

  As a part of this patch (important fix), the version of the driver
  was updated to v2.1.2.

  Submitted by:   cperciva
  PR:             225791, 234838, 235856, 236989, 243531

Changes:
_U  stable/12/
  stable/12/sys/dev/ena/ena.c
  stable/12/sys/dev/ena/ena.h
Comment 59 mail 2020-06-10 07:59:20 UTC
I think the patches have materialized into FreeBSD-EN-20:11.ena .

( https://www.freebsd.org/security/advisories/FreeBSD-EN-20:11.ena.asc )

We will update our instances with freebsd-update (as suggested). 

Thanks Colin for your efforts :)