Summary: | FreeBSD freezes on AWS EC2 t3 machines | ||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Base System | Reporter: | Alex Dupre <ale> | ||||||||||||||||
Component: | kern | Assignee: | freebsd-virtualization (Nobody) <virtualization> | ||||||||||||||||
Status: | New --- | ||||||||||||||||||
Severity: | Affects Only Me | CC: | cao, cperciva, mail, sigsys | ||||||||||||||||
Priority: | --- | ||||||||||||||||||
Version: | 12.0-RELEASE | ||||||||||||||||||
Hardware: | amd64 | ||||||||||||||||||
OS: | Any | ||||||||||||||||||
Attachments: |
|
Created attachment 202157 [details]
VM System Log
Thanks for reporting this. A few questions: 1. How repeatable is this? Does it happen to every instance you launch (after a variable number of days)? 2. Have you tried different instance types? 3. What sort of disk is this? 1) it doesn't happen to all t3 instances I have, but it happens quite regularly on a small subset that have similar workloads. This specific machine is running a few testnet blockchain nodes (bitcoin, ethereum, etc.), and it just freezed again. 2) I think it's happening, less frequently, also on a c5 instance, but I have aws automatic recover on that one. 3) All these machines have multiple gp2 disks only. Created attachment 202224 [details]
VM screenshot 2
Created attachment 203370 [details]
dmesg with crash on t3.xlarge
I've changed the instance from a t3.large to a t3.xlarge, and this is the latest crash I got. This time it wasn't a freeze, the machine rebooted.
No issues on my c5 machines in the last month.
Are you still experiencing this? There has been at least one bug fix on Amazon's side which might have corrected this, but based on other reports I think there's still something going on which affects some systems. I'm not 100% sure, but it looks like it didn't happen recently, so it might be fixed. The last reboot of that machine was caused by this: nvme1: Missing interrupt nvme0: Missing interrupt ena0: Keep alive watchdog timeout. ena0: Trigger reset is on ena0: device is going DOWN nvme1: cpl does not map to outstanding cmd cdw0:00000000 sqhd:000a sqid:0001 cid:000b p:0 sc:00 sct:0 m:0 dnr:0 nvme1: Resetting controller due to a timeout. nvme1: resetting controller nvme1: temperature threshold not supported nvme1: aborting outstanding i/o nvme1: READ sqid:1 cid:11 nsid:1 lba:82665664 len:64 nvme1: ABORTED - BY REQUEST (00/07) sqid:1 cid:11 cdw0:0 nvme1: Missing interrupt ena0: ena0: device is going UP link is UP nvme1: Resetting controller due to a timeout. nvme1: resetting controller nvme1: temperature threshold not supported nvme1: aborting outstanding i/o nvme1: READ sqid:1 cid:11 nsid:1 lba:101778688 len:64 nvme1: ABORTED - BY REQUEST (00/07) sqid:1 cid:11 cdw0:0 nvme1: aborting outstanding i/o nvme1: READ sqid:1 cid:21 nsid:1 lba:93070656 len:64 nvme1: ABORTED - BY REQUEST (00/07) sqid:1 cid:21 cdw0:0 nvme1: aborting outstanding i/o nvme1: WRITE sqid:1 cid:4 nsid:1 lba:87852928 len:64 nvme1: ABORTED - BY REQUEST (00/07) sqid:1 cid:4 cdw0:0 Fatal trap 12: page fault while in kernel mode cpuid = 2; apic id = 02 fault virtual address = 0x18 fault code = supervisor write data, page not present instruction pointer = 0x20:0xffffffff80ae362e stack pointer = 0x28:0xfffffe00741bc1c0 frame pointer = 0x28:0xfffffe00741bc200 code segment = base rx0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 12 (irq258: nvme0) trap number = 12 panic: page fault cpuid = 2 time = 1563188396 KDB: stack backtrace: #0 0xffffffff80be7957 at kdb_backtrace+0x67 #1 0xffffffff80b9b543 at vpanic+0x1a3 #2 0xffffffff80b9b393 at panic+0x43 #3 0xffffffff81074bff at trap_fatal+0x35f #4 0xffffffff81074c59 at trap_pfault+0x49 #5 0xffffffff8107427e at trap+0x29e #6 0xffffffff8104f6f5 at calltrap+0x8 #7 0xffffffff807674f6 at nvme_qpair_complete_tracker+0x166 #8 0xffffffff807672b4 at nvme_qpair_process_completions+0x94 #9 0xffffffff80b5edf7 at ithread_loop+0x1a7 #10 0xffffffff80b5bf13 at fork_exit+0x83 #11 0xffffffff810506ee at fork_trampoline+0xe Uptime: 5d3h3m9s Same symptoms still occurring regularly here. Begins with "nvmeX: Missing interrupt" in the logs, which can go on for a few hours, then the server becomes unresponsive. Typically occurs 4-6 days after boot. Only the most heavily loaded production servers are affected. Other servers have never exhibited this behavior. Have tried a variety of instance types. At this point I have it monitored so I can do a graceful cycle before it locks up. See also bug #236989 Uname: FreeBSD 12.0-RELEASE-p10 FreeBSD 12.0-RELEASE-p10 GENERIC amd64 Log excerpt: Aug 27 03:02:45 host kernel: nvme4: Missing interrupt Aug 27 03:02:56 host kernel: nvme1: Missing interrupt Aug 27 04:19:11 host kernel: nvme1: nvme4: Missing interrupt Aug 27 04:19:11 host kernel: Missing interrupt Aug 27 04:19:30 host kernel: nvme0: Missing interrupt Aug 27 05:16:58 host kernel: nvme1: Missing interrupt Aug 27 05:16:58 host kernel: nvme4: Missing interrupt Aug 27 14:58:13 host kernel: ena0: Keep alive watchdog timeout. Aug 27 14:58:13 host kernel: ena0: Trigger reset is on Aug 27 14:58:13 host kernel: ena0: device is going DOWN Aug 27 14:58:13 host kernel: nvme4: Missing interrupt Aug 27 14:58:13 host kernel: nvme1: Missing interrupt Aug 27 14:58:16 host dhclient[417]: send_packet6: Network is down Aug 27 14:58:16 host dhclient[417]: dhc6: send_packet6() sent -1 of 52 bytes Aug 27 14:58:17 host kernel: ena0: free uncompleted tx mbuf qid 5 idx 0x25c Aug 27 14:58:17 host kernel: ena0: ena0: device is going UP Aug 27 14:58:17 host kernel: link is UP (In reply to Charles O'Donnell from comment #8) Hi, We just suffered from the same issue. This particular FreeBSD 12.1 r5.large instance with an additional gp2 disk (ZFS) was up for about 2 weeks when this occured. These are the last entries in the "logs" : Nov 25 03:04:36 zfs01 kernel: nvme1: cpl does not map to outstanding cmd Nov 25 03:04:36 zfs01 kernel: cdw0:00000000 sqhd:001a sqid:0001 cid:0018 p:0 sc:00 sct:0 m:0 dnr:0 Nov 25 03:04:36 zfs01 kernel: nvme1: Missing interrupt Nov 25 03:04:36 zfs01 kernel: nvme1: Resetting controller due to a timeout. Nov 25 03:04:36 zfs01 kernel: nvme1: resetting controller Nov 25 03:04:36 zfs01 kernel: nvme1: temperature threshold not supported Nov 25 03:04:36 zfs01 kernel: nvme1: aborting outstanding i/o Nov 25 03:05:07 zfs01 kernel: nvme1: Missing interrupt Nov 25 03:05:36 zfs01 kernel: nvme1: Resetting controller due to a timeout. Nov 25 03:05:36 zfs01 kernel: nvme1: resetting controller Nov 25 03:05:37 zfs01 kernel: nvme1: temperature threshold not supported Nov 25 03:05:37 zfs01 kernel: nvme1: aborting outstanding i/o Nov 25 03:05:37 zfs01 syslogd: last message repeated 4 times Nov 25 03:06:07 zfs01 kernel: nvme1: Missing interrupt Nov 25 03:09:08 zfs01 dhclient[84191]: send_packet: Host is down Nov 25 03:09:52 zfs01 syslogd: last message repeated 4 times @Colin/@Alex: Do you know if this occurs on "older" instance types as well? We would definately switch to older instance-types if this issue doesn't occur on older generations. Kind regards, Ruben Hmm. We have a standby machine for this instance (same specs) that has similar log entries, but has not "frozen" yet: Nov 18 16:51:32 zfs02 kernel: link is UP Nov 18 16:55:50 zfs02 kernel: nvme1: Missing interrupt Nov 18 17:01:39 zfs02 dhclient[75125]: send_packet6: Network is down Nov 18 17:01:39 zfs02 dhclient[75125]: dhc6: send_packet6() sent -1 of 52 bytes Nov 19 03:30:59 zfs02 kernel: nvme1: cpl does not map to outstanding cmd Nov 19 03:30:59 zfs02 kernel: cdw0:00000000 sqhd:0006 sqid:0001 cid:000d p:0 sc:00 sct:0 m:0 dnr:0 Nov 19 03:30:59 zfs02 kernel: nvme1: Resetting controller due to a timeout. Nov 19 03:30:59 zfs02 kernel: nvme1: resetting controller Nov 19 03:30:59 zfs02 kernel: nvme1: temperature threshold not supported Nov 19 03:30:59 zfs02 kernel: nvme1: aborting outstanding i/o Nov 19 03:31:28 zfs02 dhclient[75125]: send_packet6: Network is down Nov 19 03:31:28 zfs02 dhclient[75125]: dhc6: send_packet6() sent -1 of 52 bytes Nov 19 03:31:29 zfs02 kernel: Nov 19 03:31:29 zfs02 kernel: stray irq260 I'm curious when the "freezing" will initiate... Is there any way to increase the verbosity / usefullness of the logs? (In reply to mail from comment #9) Older instances haven't nvme drives (and ena network cards), so these specific issues cannot occur. Created attachment 209401 [details]
more complete log
A more complete log (also shows ena0 issues, perhaps there is a relation?)
(In reply to Alex Dupre from comment #11) Hi Alex, Ah, that is good news. We will downgrade the zfs01 instance and keep the zfs02 instance on the problematic generation. Thank you for your feedback :) Kind regards, Ruben (In reply to mail from comment #13) The bad news is that HW performance are worse on previous generations. (In reply to Alex Dupre from comment #14) Hi Alex, Yes, that's very unfortunate. Stable lower performance vs unstable higher performance is not a difficult call (for this production platform) though... We will keep the other node on the newer generation AWS hardware. Perhaps that way we can help wiser / more knowledgeable people out with debugging this issue. If anybody would like us to us to perform tests / grant access: please let us know! Kind regards, Ruben A few questions: 1. Have you seen this recently? 2. Have you seen this on FreeBSD 12.1 or 13.0? 3. Have you tried AMD instances, and if yes has this problem appeared there? 4. Can you reproduce this reasonably quickly (say, by running an instance for a day or two)? (In reply to Colin Percival from comment #16) Unfortunately I had to switch off those EC2 instances, so I cannot say if it'd still happen or not. I didn't try FreeBSD 13 nor AMD instances. Hi Colin, > 1. How repeatable is this? Does it happen to every instance you launch (after a variable number of days)? Unfortunately we are not in the habit of often redeplying our zfs nodes (since they provide storage for whole platforms) :( It still happens to the affected nodes though: bzgrep "nvme1: Missing interrupt" /var/log/messages.0.bz2 ; grep "nvme1: Missing interrupt" /var/log/messages Nov 16 03:04:18 zfs01 kernel: nvme1: Missing interrupt Nov 16 03:05:19 zfs01 kernel: nvme1: Missing interrupt Nov 25 03:04:36 zfs01 kernel: nvme1: Missing interrupt Nov 25 03:05:07 zfs01 kernel: nvme1: Missing interrupt Nov 25 03:06:07 zfs01 kernel: nvme1: Missing interrupt Dec 13 03:04:34 zfs01 kernel: nvme1: Missing interrupt Dec 13 03:05:35 zfs01 kernel: nvme1: Missing interrupt Dec 13 03:06:26 zfs01 kernel: nvme1: Missing interrupt Dec 13 03:06:57 zfs01 kernel: nvme1: Missing interrupt Dec 13 03:07:58 zfs01 kernel: nvme1: Missing interrupt Jan 25 03:06:02 zfs01 kernel: nvme1: Missing interrupt Jan 25 03:07:02 zfs01 kernel: nvme1: Missing interrupt Feb 11 03:05:32 zfs01 kernel: nvme1: Missing interrupt Feb 11 03:07:01 zfs01 kernel: nvme1: Missing interrupt Feb 17 03:06:29 zfs01 kernel: nvme1: Missing interrupt === bzgrep "nvme1: Missing interrupt" /var/log/messages.0.bz2 ; grep "nvme1: Missing interrupt" /var/log/messages Jan 25 04:29:03 volume3 kernel: nvme1: Missing interrupt Feb 4 04:04:45 volume3 kernel: nvme1: Missing interrupt Feb 11 04:04:48 volume3 kernel: nvme1: Missing interrupt Kind of interesting that zfs01 and volume03 have totally different customers, usage patterns but have a collission of 2 dates :| > 2. Have you tried different instance types? Yes. This issue is not manifesting itself on an r4.xlarge instance. Same ami on r5.large: problems... > 3. What sort of disk is this? We use cloudformation/ansible to deploy these servers, so they are all kind of identically configured (apart from sizing). Both instances suffering from this issue are 500GB+ EBS GP2 or IO disks with GPT and a zpool configured onto them. Please let me know if you would like to receive more information. Kind regards, Ruben Hi Colin, After some further thought: If you'd like, we could provision an additional set of r4/r5 machines and give you access to them? Kind regards, Ruben Hi Ruben, I'd love to be able to reproduce this -- do you have some sort of "canned workload" which will do it? I hesitate about asking for access to "live" systems since as soon as I can reproduce it I'll want to do things like building new kernels with extra debugging/logging added, and I don't want to interfere with production services -- and everything I'm seeing so far suggests that this is workload dependent, so simply spinning up an instance without any workload won't help. :-/ Colin (In reply to mail from comment #18) Ruben: Looking at those logs... is the second system in a different timezone from the first? The first one looks very much like it's glitching when the 'periodic daily' cron job runs at 3:01 AM, while the second is consistently an hour later. (In reply to Colin Percival from comment #20) Hi Colin, We've spun up some additional instances and we are trying to trigger the issue by configuring a workload that should be similar enough to do so. Giving access to those nodes it not a problem whatsoever since we spun them up specifically to investigate this issue. Kind regards, Ruben (In reply to Colin Percival from comment #21) Hmm. The difference of exactly 1 hour doesn't seem to have a relationship with different timezones... Today we had another occurance on one of the machines: Feb 19 03:24:53 volume3 kernel: nvme1: cpl does not map to outstanding cmd Feb 19 03:24:53 volume3 kernel: cdw0:00000000 sqhd:000c sqid:0002 cid:0017 p:0 sc:00 sct:0 m:0 dnr:0 Feb 19 03:24:53 volume3 kernel: nvme1: Missing interrupt Feb 19 03:24:53 volume3 kernel: nvme1: Resetting controller due to a timeout. Feb 19 03:24:53 volume3 kernel: nvme1: resetting controller Feb 19 03:24:54 volume3 kernel: nvme1: temperature threshold not supported Feb 19 03:24:54 volume3 kernel: nvme1: aborting outstanding i/o Comparing 03:24:53 with the access times of the daily scripts though: ls -lahtuT /etc/periodic/daily/ total 128 -rwxr-xr-x 1 root wheel 1.0K Feb 19 12:31:47 2020 450.status-security -rwxr-xr-x 1 root wheel 811B Feb 19 05:32:14 2020 999.local -rwxr-xr-x 1 root wheel 2.8K Feb 19 05:32:14 2020 800.scrub-zfs -rwxr-xr-x 1 root wheel 845B Feb 19 05:32:14 2020 510.status-world-kernel -rwxr-xr-x 1 root wheel 737B Feb 19 05:32:14 2020 500.queuerun -rwxr-xr-x 1 root wheel 498B Feb 19 05:32:14 2020 480.status-ntpd -rwxr-xr-x 1 root wheel 451B Feb 19 05:32:14 2020 480.leapfile-ntpd -rwxr-xr-x 1 root wheel 2.0K Feb 19 05:32:14 2020 460.status-mail-rejects -rwxr-xr-x 1 root wheel 1.4K Feb 19 03:01:00 2020 440.status-mailq -rwxr-xr-x 1 root wheel 705B Feb 19 03:01:00 2020 430.status-uptime -rwxr-xr-x 1 root wheel 611B Feb 19 03:01:00 2020 420.status-network -rwxr-xr-x 1 root wheel 684B Feb 19 03:01:00 2020 410.status-mfi -rwxr-xr-x 1 root wheel 590B Feb 19 03:01:00 2020 409.status-gconcat -rwxr-xr-x 1 root wheel 590B Feb 19 03:01:00 2020 408.status-gstripe -rwxr-xr-x 1 root wheel 591B Feb 19 03:01:00 2020 407.status-graid3 -rwxr-xr-x 1 root wheel 596B Feb 19 03:01:00 2020 406.status-gmirror -rwxr-xr-x 1 root wheel 807B Feb 19 03:01:00 2020 404.status-zfs -rwxr-xr-x 1 root wheel 583B Feb 19 03:01:00 2020 401.status-graid -rwxr-xr-x 1 root wheel 773B Feb 19 03:01:00 2020 400.status-disks -rwxr-xr-x 1 root wheel 724B Feb 19 03:01:00 2020 330.news -r-xr-xr-x 1 root wheel 1.4K Feb 19 03:01:00 2020 310.accounting -rwxr-xr-x 1 root wheel 693B Feb 19 03:01:00 2020 300.calendar -rwxr-xr-x 1 root wheel 1.0K Feb 19 03:01:00 2020 210.backup-aliases -rwxr-xr-x 1 root wheel 1.7K Feb 19 03:01:00 2020 200.backup-passwd -rwxr-xr-x 1 root wheel 603B Feb 19 03:01:00 2020 150.clean-hoststat -rwxr-xr-x 1 root wheel 1.0K Feb 19 03:01:00 2020 140.clean-rwho -rwxr-xr-x 1 root wheel 709B Feb 19 03:01:00 2020 130.clean-msgs -rwxr-xr-x 1 root wheel 1.1K Feb 19 03:01:00 2020 120.clean-preserve -rwxr-xr-x 1 root wheel 1.5K Feb 19 03:01:00 2020 110.clean-tmps -rwxr-xr-x 1 root wheel 1.3K Feb 19 03:01:00 2020 100.clean-disks drwxr-xr-x 2 root wheel 1.0K Nov 1 07:06:41 2019 . drwxr-xr-x 6 root wheel 512B Nov 1 07:06:41 2019 .. but if the periodic framework executes the jobs serially I see no link with 440.status-mailq (that does not sound like high io) :S. I think there definately is a link between this bug and high disk-/network-io so the periodic framework probably classifies as a nice trigger (especially the security bits with the find commands).... We will continue to cross-reference the access times of the daily scripts with the "Missing interrupt" occurences and post updates. Kind regards, Ruben (In reply to mail from comment #23) > Comparing 03:24:53 with the access times of the daily scripts though: > > ls -lahtuT /etc/periodic/daily/ > [snip] Looks like it was probably something from the daily security run (which is kicked off by 450.status-security). Can you look at the timestamps on /etc/periodic/security/ ? > I think there definately is a link between this bug and high disk-/network-io > so the periodic framework probably classifies as a nice trigger (especially the > security bits with the find commands).... Yeah, I'm guessing that it's related to find causing lots of I/O. (In reply to Colin Percival from comment #24) Hi Colin, Since the troubleshooting earlier today I ran several find commands on the /etc/ directory :( Ill make sure to ls all directories next time! Apologies! Regards, Ruben (In reply to mail from comment #25) I've seen a couple nvme messages on the test-setup: root@zfs01:/usr/home/ruben # grep nvme1 /var/log/messages Feb 19 14:11:34 freebsd kernel: nvme1: <Generic NVMe Device> mem 0xfebf8000-0xfebfbfff irq 11 at device 31.0 on pci0 Feb 19 14:11:34 freebsd kernel: nvme1: temperature threshold not supported Feb 22 23:16:44 zfs01 kernel: nvme1: Missing interrupt Feb 24 05:24:52 zfs01 kernel: nvme1: Missing interrupt root@zfs01:/usr/home/ruben # uptime 10:08AM up 4 days, 19:57, 1 user, load averages: 1.92, 2.28, 2.23 root@zfs01:/usr/home/ruben # date Mon Feb 24 10:08:54 UTC 2020 root@zfs01:/usr/home/ruben # Will try to further tune the test workload this week. Can you confirm that the "Missing interrupt" message is coming from FreeBSD 12.1? There was an issue with that symptom which was fixed between 12.0 and 12.1. Can you provide a verbose boot? We're wondering if there's something going on with interrupt routing. (In reply to Colin Percival from comment #27) Hi Colin, Yes, both production instances are running FreeBSD 12.1 : root@volume3:/usr/home/ruben # freebsd-version -ukr 12.1-RELEASE-p1 12.1-RELEASE-p1 12.1-RELEASE-p1 root@volume3:/usr/home/ruben # root@zfs01:/usr/home/ruben # freebsd-version -ukr 12.1-RELEASE 12.1-RELEASE 12.1-RELEASE root@zfs01:/usr/home/ruben # as is our test-setup: root@zfs01:/usr/home/ruben # freebsd-version -ukr 12.1-RELEASE-p2 12.1-RELEASE-p2 12.1-RELEASE-p2 root@zfs01:/usr/home/ruben # Regards, Ruben (In reply to Colin Percival from comment #28) Hi Colin, Are you referring to sharing the contents of /var/run/dmesg.boot while boot_verbose="YES" has been set in /boot/loader.conf? If so, then I will be able to do so tomorrow for at least one of the production instances. Regards, Ruben Yes, that's right. Ideal would be if you can get the verbose dump and confirm that the instance later suffers from this glitch -- just to rule out some sort of "interrupts *sometimes* get set up wrong at boot time" issue. (In reply to Colin Percival from comment #31) Hi, I've managed to reboot all servers with the verbose setting active. If the issue occurs again ill be sure to upload the information you requested (unless you want to have a look at it "now"?). One of the machines seemed weirdly impacted by a similar issue whilst receiving a stream of 5 MB/s (zfs snapshot): Feb 25 11:04:01 volume3 kernel: ena0: Keep alive watchdog timeout. Feb 25 11:04:01 volume3 kernel: ena0: Trigger reset is on Feb 25 11:04:01 volume3 kernel: ena0: device is going DOWN Feb 25 11:04:19 volume3 kernel: ena0: ena0: device is going UP Feb 25 11:04:19 volume3 kernel: link is UP Feb 25 11:05:10 volume3 dhclient[33054]: send_packet6: Network is down Feb 25 11:05:10 volume3 dhclient[33054]: dhc6: send_packet6() sent -1 of 52 bytes Feb 25 11:26:32 volume3 kernel: ena0: The number of lost tx completion is above the threshold (129 > 128). Reset the device Feb 25 11:26:32 volume3 kernel: ena0: Trigger reset is on Feb 25 11:26:32 volume3 kernel: ena0: device is going DOWN Feb 25 11:26:33 volume3 kernel: ena0: free uncompleted tx mbuf qid 1 idx 0x182 Feb 25 11:26:33 volume3 kernel: Feb 25 11:26:33 volume3 kernel: stray irq260 Feb 25 11:26:33 volume3 kernel: ena0: ena0: device is going UP Feb 25 11:26:33 volume3 kernel: link is UP Feb 25 11:34:06 volume3 shutdown[76043]: reboot by ruben: Symptoms : very high load, unresponsive system (freezes/hangs on console). No nvme messages though. Unfortunately, I did not yet have the verbose boot information you requested at that point. Did not have time to further debug either, since we need this machine up and running (initiated the reboot at 11:34) :( Regards, Ruben Most recent incident: Feb 27 04:23:31 s4 kernel: nvme4: Missing interrupt Feb 27 04:24:34 s4 syslogd: last message repeated 1 times Feb 27 04:24:34 s4 kernel: nvme1: Missing interrupt Feb 27 04:24:34 s4 kernel: nvme0: Missing interrupt Feb 27 04:24:34 s4 syslogd: last message repeated 1 times Feb 27 04:24:34 s4 kernel: nvme1: nvme1: cpl does not map to outstanding cmd Feb 27 04:24:34 s4 kernel: cdw0:00000000 sqhd:0012 sqid:0001 cid:0017 p:0 sc:00 sct:0 m:0 dnr:0 Feb 27 04:24:34 s4 kernel: nvme1: Missing interrupt Feb 27 04:24:34 s4 kernel: nvme4: nvme1: Missing interrupt Feb 27 04:24:34 s4 kernel: Missing interrupt Feb 27 04:24:34 s4 syslogd: last message repeated 1 times Feb 27 04:24:34 s4 kernel: nvme4: Missing interrupt Feb 27 04:24:34 s4 kernel: nvme0: Missing interrupt Feb 27 04:24:34 s4 kernel: nvme1: nvme1: nvme4: Missing interrupt Feb 27 04:24:34 s4 kernel: Missing interrupt # freebsd-version -ukr 12.1-RELEASE-p2 12.1-RELEASE-p2 12.1-RELEASE-p2 # ls -lahtuT /etc/periodic/daily/ total 128 -rwxr-xr-x 1 root wheel 811B Feb 27 03:05:34 2020 999.local -rwxr-xr-x 1 root wheel 2.8K Feb 27 03:05:34 2020 800.scrub-zfs -rwxr-xr-x 1 root wheel 845B Feb 27 03:05:34 2020 510.status-world-kernel -rwxr-xr-x 1 root wheel 737B Feb 27 03:05:34 2020 500.queuerun -rwxr-xr-x 1 root wheel 498B Feb 27 03:05:34 2020 480.status-ntpd -rwxr-xr-x 1 root wheel 451B Feb 27 03:05:34 2020 480.leapfile-ntpd -rwxr-xr-x 1 root wheel 2.0K Feb 27 03:05:34 2020 460.status-mail-rejects -rwxr-xr-x 1 root wheel 1.0K Feb 27 03:01:00 2020 450.status-security -rwxr-xr-x 1 root wheel 1.4K Feb 27 03:01:00 2020 440.status-mailq -rwxr-xr-x 1 root wheel 705B Feb 27 03:01:00 2020 430.status-uptime -rwxr-xr-x 1 root wheel 611B Feb 27 03:01:00 2020 420.status-network -rwxr-xr-x 1 root wheel 684B Feb 27 03:01:00 2020 410.status-mfi -rwxr-xr-x 1 root wheel 590B Feb 27 03:01:00 2020 409.status-gconcat -rwxr-xr-x 1 root wheel 590B Feb 27 03:01:00 2020 408.status-gstripe -rwxr-xr-x 1 root wheel 591B Feb 27 03:01:00 2020 407.status-graid3 -rwxr-xr-x 1 root wheel 596B Feb 27 03:01:00 2020 406.status-gmirror -rwxr-xr-x 1 root wheel 807B Feb 27 03:01:00 2020 404.status-zfs -rwxr-xr-x 1 root wheel 583B Feb 27 03:01:00 2020 401.status-graid -rwxr-xr-x 1 root wheel 773B Feb 27 03:01:00 2020 400.status-disks -rwxr-xr-x 1 root wheel 724B Feb 27 03:01:00 2020 330.news -r-xr-xr-x 1 root wheel 1.4K Feb 27 03:01:00 2020 310.accounting -rwxr-xr-x 1 root wheel 693B Feb 27 03:01:00 2020 300.calendar -rwxr-xr-x 1 root wheel 1.0K Feb 27 03:01:00 2020 210.backup-aliases -rwxr-xr-x 1 root wheel 1.7K Feb 27 03:01:00 2020 200.backup-passwd -rwxr-xr-x 1 root wheel 603B Feb 27 03:01:00 2020 150.clean-hoststat -rwxr-xr-x 1 root wheel 1.0K Feb 27 03:01:00 2020 140.clean-rwho -rwxr-xr-x 1 root wheel 709B Feb 27 03:01:00 2020 130.clean-msgs -rwxr-xr-x 1 root wheel 1.1K Feb 27 03:01:00 2020 120.clean-preserve -rwxr-xr-x 1 root wheel 1.5K Feb 27 03:01:00 2020 110.clean-tmps -rwxr-xr-x 1 root wheel 1.3K Feb 27 03:01:00 2020 100.clean-disks drwxr-xr-x 2 root wheel 1.0K Dec 7 01:23:36 2018 . drwxr-xr-x 6 root wheel 512B Dec 7 01:23:36 2018 .. Are these "ephemeral" disks or EBS attached disks? Can you ls -lahtuT /etc/periodic/security/ ? What's the instance ID? > Are these "ephemeral" disks or EBS attached disks? EBS attached. EBS volume type is gp2. Filesystem is zfs. > What's the instance ID? Instance type is c5d.2xlarge. ami-00e61b69944d5d2a3 FreeBSD 12.0-RELEASE-amd64-f5af2713-6d84-41f0-95ea-2eaee78f2af4-ami-03b0f822e17669866.4 FreeBSD/amd64 releng/12.0@341666 Workload is HTTP, Database, Image processing. Significant network load and disk activity. See also bug #236989 > Can you ls -lahtuT /etc/periodic/security/ ? ls -lahtuT /etc/periodic/security/ total 68 -r--r--r-- 1 root wheel 2.8K Mar 1 05:30:01 2020 security.functions -rwxr-xr-x 1 root wheel 2.3K Mar 1 05:30:01 2020 900.tcpwrap -rwxr-xr-x 1 root wheel 2.3K Mar 1 05:30:01 2020 800.loginfail -rwxr-xr-x 1 root wheel 1.9K Mar 1 05:30:01 2020 700.kernelmsg -rwxr-xr-x 1 root wheel 2.0K Mar 1 05:30:01 2020 610.ipf6denied -rwxr-xr-x 1 root wheel 2.2K Mar 1 05:30:01 2020 550.ipfwlimit -rwxr-xr-x 1 root wheel 2.1K Mar 1 05:30:01 2020 520.pfdenied -rwxr-xr-x 1 root wheel 1.9K Mar 1 05:30:01 2020 510.ipfdenied -rwxr-xr-x 1 root wheel 2.0K Mar 1 05:30:01 2020 500.ipfwdenied -rwxr-xr-x 1 root wheel 1.9K Mar 1 05:30:01 2020 410.logincheck -rwxr-xr-x 1 root wheel 1.9K Mar 1 05:30:01 2020 400.passwdless -rwxr-xr-x 1 root wheel 1.9K Mar 1 05:30:01 2020 300.chkuid0 -rwxr-xr-x 1 root wheel 2.3K Mar 1 05:30:01 2020 200.chkmounts -rwxr-xr-x 1 root wheel 2.2K Mar 1 05:30:01 2020 110.neggrpperm -rwxr-xr-x 1 root wheel 2.2K Mar 1 05:30:01 2020 100.chksetuid drwxr-xr-x 2 root wheel 512B Dec 7 01:23:36 2018 . drwxr-xr-x 6 root wheel 512B Dec 7 01:23:36 2018 .. Hi, This weekend the issue manifested itself again on one of our 12.1 instances (with an EBS volume attached): Mar 7 03:05:47 zfs01 kernel: nvme1: cpl does not map to outstanding cmd Mar 7 03:05:47 zfs01 kernel: cdw0:00000000 sqhd:001b sqid:0001 cid:001b p:0 sc:00 sct:0 m:0 dnr:0 Mar 7 03:05:47 zfs01 kernel: nvme1: Resetting controller due to a timeout. Mar 7 03:05:47 zfs01 kernel: nvme1: resetting controller Mar 7 03:05:47 zfs01 kernel: nvme1: temperature threshold not supported Mar 7 03:05:47 zfs01 kernel: nvme1: aborting outstanding i/o Mar 7 03:06:18 zfs01 kernel: nvme1: Missing interrupt Mar 7 03:06:48 zfs01 kernel: nvme1: Resetting controller due to a timeout. Mar 7 03:06:48 zfs01 kernel: nvme1: resetting controller Mar 7 03:06:48 zfs01 kernel: nvme1: temperature threshold not supported Mar 7 03:06:48 zfs01 kernel: nvme1: aborting outstanding i/o Mar 7 03:06:48 zfs01 syslogd: last message repeated 5 times Mar 7 03:07:20 zfs01 kernel: nvme1: VERIFY sqid:1 cid:27 nsid:0 lba:0 len:1 Mar 7 03:07:20 zfs01 kernel: nvme1: INVALID OPCODE (00/01) sqid:1 cid:27 cdw0:0 Mar 7 03:07:20 zfs01 kernel: nvme1: Missing interrupt Mar 7 03:07:20 zfs01 kernel: nvme1: VERIFY sqid:1 cid:27 nsid:0 lba:0 len:1 Mar 7 03:07:20 zfs01 kernel: nvme1: INVALID OPCODE (00/01) sqid:1 cid:27 cdw0:0 Mar 7 03:08:23 zfs01 kernel: ena0: The number of lost tx completion is above the threshold (129 > 128). Reset the device Mar 7 03:08:23 zfs01 kernel: ena0: Trigger reset is on Mar 7 03:08:23 zfs01 kernel: ena0: device is going DOWN Mar 7 03:08:23 zfs01 kernel: nvme1: Resetting controller due to a timeout. Mar 7 03:08:23 zfs01 kernel: nvme1: resetting controller Mar 7 03:08:23 zfs01 dhclient[40936]: send_packet6: Network is down Mar 7 03:08:23 zfs01 dhclient[40936]: dhc6: send_packet6() sent -1 of 52 bytes Mar 7 03:08:23 zfs01 kernel: nvme1: aborting outstanding admin command Mar 7 03:08:23 zfs01 kernel: nvme1: CREATE IO SQ (01) sqid:0 cid:24 nsid:1 cdw10:1cb72e58 cdw11:00000000 Mar 7 03:08:23 zfs01 kernel: nvme1: ABORTED - BY REQUEST (00/07) sqid:0 cid:15 cdw0:0 Mar 7 03:08:23 zfs01 kernel: nvme1: temperature threshold not supported Mar 7 03:08:23 zfs01 kernel: nvme1: aborting outstanding i/o Mar 7 03:08:23 zfs01 syslogd: last message repeated 2 times Mar 7 03:08:53 zfs01 kernel: nvme1: WRITE sqid:1 cid:18 nsid:1 lba:3856318 len:64 Mar 7 03:08:53 zfs01 kernel: nvme1: INVALID OPCODE (00/01) sqid:1 cid:27 cdw0:0 Mar 7 03:08:53 zfs01 kernel: nvme1: Missing interrupt Mar 7 03:09:15 zfs01 kernel: ena0: free uncompleted tx mbuf qid 0 idx 0x58 Mar 7 03:09:16 zfs01 kernel: ena0: attempting to allocate 3 MSI-X vectors (9 supported) Mar 7 03:09:16 zfs01 kernel: msi: routing MSI-X IRQ 259 to local APIC 0 vector 52 Mar 7 03:09:16 zfs01 kernel: msi: routing MSI-X IRQ 260 to local APIC 0 vector 53 Mar 7 03:09:16 zfs01 kernel: msi: routing MSI-X IRQ 261 to local APIC 0 vector 54 Mar 7 03:09:16 zfs01 kernel: ena0: using IRQs 259-261 for MSI-X Mar 7 03:09:16 zfs01 kernel: ena0: device is going UP Mar 7 03:09:16 zfs01 kernel: ena0: link is UP Mar 7 03:10:30 zfs01 dhclient[40936]: send_packet6: Network is down Mar 7 03:10:30 zfs01 dhclient[40936]: dhc6: send_packet6() sent -1 of 52 bytes Mar 7 03:10:32 zfs01 dhclient[69248]: send_packet: Network is down Mar 7 03:11:16 zfs01 syslogd: last message repeated 4 times Mar 7 03:11:33 zfs01 syslogd: last message repeated 1 times Mar 7 03:13:31 zfs01 kernel: ena0: The number of lost tx completion is above the threshold (129 > 128). Reset the device Mar 7 03:13:31 zfs01 kernel: ena0: Trigger reset is on Mar 7 03:13:31 zfs01 kernel: ena0: device is going DOWN Mar 7 03:14:25 zfs01 kernel: ena0: free uncompleted tx mbuf qid 0 idx 0x134 Mar 7 03:14:26 zfs01 kernel: ena0: attempting to allocate 3 MSI-X vectors (9 supported) Mar 7 03:14:26 zfs01 kernel: msi: routing MSI-X IRQ 259 to local APIC 0 vector 52 root@zfs01:/usr/home/ruben # ls -lahtuT /etc/periodic/daily/ total 128 -rwxr-xr-x 1 root wheel 1.0K Mar 7 03:01:00 2020 450.status-security -rwxr-xr-x 1 root wheel 1.4K Mar 7 03:01:00 2020 440.status-mailq -rwxr-xr-x 1 root wheel 705B Mar 7 03:01:00 2020 430.status-uptime -rwxr-xr-x 1 root wheel 611B Mar 7 03:01:00 2020 420.status-network -rwxr-xr-x 1 root wheel 684B Mar 7 03:01:00 2020 410.status-mfi -rwxr-xr-x 1 root wheel 590B Mar 7 03:01:00 2020 409.status-gconcat -rwxr-xr-x 1 root wheel 590B Mar 7 03:01:00 2020 408.status-gstripe -rwxr-xr-x 1 root wheel 591B Mar 7 03:01:00 2020 407.status-graid3 -rwxr-xr-x 1 root wheel 596B Mar 7 03:01:00 2020 406.status-gmirror -rwxr-xr-x 1 root wheel 807B Mar 7 03:01:00 2020 404.status-zfs -rwxr-xr-x 1 root wheel 583B Mar 7 03:01:00 2020 401.status-graid -rwxr-xr-x 1 root wheel 773B Mar 7 03:01:00 2020 400.status-disks -rwxr-xr-x 1 root wheel 724B Mar 7 03:01:00 2020 330.news -r-xr-xr-x 1 root wheel 1.4K Mar 7 03:01:00 2020 310.accounting -rwxr-xr-x 1 root wheel 693B Mar 7 03:01:00 2020 300.calendar -rwxr-xr-x 1 root wheel 1.0K Mar 7 03:01:00 2020 210.backup-aliases -rwxr-xr-x 1 root wheel 1.7K Mar 7 03:01:00 2020 200.backup-passwd -rwxr-xr-x 1 root wheel 603B Mar 7 03:01:00 2020 150.clean-hoststat -rwxr-xr-x 1 root wheel 1.0K Mar 7 03:01:00 2020 140.clean-rwho -rwxr-xr-x 1 root wheel 709B Mar 7 03:01:00 2020 130.clean-msgs -rwxr-xr-x 1 root wheel 1.1K Mar 7 03:01:00 2020 120.clean-preserve -rwxr-xr-x 1 root wheel 1.5K Mar 7 03:01:00 2020 110.clean-tmps -rwxr-xr-x 1 root wheel 1.3K Mar 7 03:01:00 2020 100.clean-disks -rwxr-xr-x 1 root wheel 811B Mar 5 03:21:29 2020 999.local -rwxr-xr-x 1 root wheel 2.8K Mar 5 03:21:29 2020 800.scrub-zfs -rwxr-xr-x 1 root wheel 845B Mar 5 03:21:29 2020 510.status-world-kernel -rwxr-xr-x 1 root wheel 737B Mar 5 03:21:29 2020 500.queuerun -rwxr-xr-x 1 root wheel 498B Mar 5 03:21:29 2020 480.status-ntpd -rwxr-xr-x 1 root wheel 451B Mar 5 03:03:36 2020 480.leapfile-ntpd -rwxr-xr-x 1 root wheel 2.0K Mar 5 03:03:18 2020 460.status-mail-rejects drwxr-xr-x 2 root wheel 1.0K Dec 7 06:23:36 2018 . drwxr-xr-x 6 root wheel 512B Dec 7 06:23:36 2018 .. root@zfs01:/usr/home/ruben # root@zfs01:/usr/home/ruben # ls -lahtuT /etc/periodic/security/ total 68 -rwxr-xr-x 1 root wheel 2.3K Mar 7 03:01:48 2020 900.tcpwrap -rwxr-xr-x 1 root wheel 2.3K Mar 7 03:01:48 2020 800.loginfail -rwxr-xr-x 1 root wheel 1.9K Mar 7 03:01:48 2020 700.kernelmsg -r--r--r-- 1 root wheel 2.8K Mar 7 03:01:48 2020 security.functions -rwxr-xr-x 1 root wheel 2.0K Mar 7 03:01:48 2020 610.ipf6denied -rwxr-xr-x 1 root wheel 2.2K Mar 7 03:01:48 2020 550.ipfwlimit -rwxr-xr-x 1 root wheel 2.1K Mar 7 03:01:48 2020 520.pfdenied -rwxr-xr-x 1 root wheel 1.9K Mar 7 03:01:48 2020 510.ipfdenied -rwxr-xr-x 1 root wheel 2.0K Mar 7 03:01:48 2020 500.ipfwdenied -rwxr-xr-x 1 root wheel 1.9K Mar 7 03:01:48 2020 410.logincheck -rwxr-xr-x 1 root wheel 1.9K Mar 7 03:01:48 2020 400.passwdless -rwxr-xr-x 1 root wheel 1.9K Mar 7 03:01:48 2020 300.chkuid0 -rwxr-xr-x 1 root wheel 2.3K Mar 7 03:01:48 2020 200.chkmounts -rwxr-xr-x 1 root wheel 2.2K Mar 7 03:01:25 2020 110.neggrpperm -rwxr-xr-x 1 root wheel 2.2K Mar 7 03:01:00 2020 100.chksetuid drwxr-xr-x 2 root wheel 512B Dec 7 06:23:36 2018 . drwxr-xr-x 6 root wheel 512B Dec 7 06:23:36 2018 .. root@zfs01:/usr/home/ruben # the NIC had been going up/down ever since (for 2 days) until a coworker rebooted it this morning. There does seem to be a relationship with the periodic framework, with the issues occuring at 03:05 while the last timestamp updates around 03:01 ... Will attach the verbose boot log. Feel free to request any additional details! Kind regards, Ruben Created attachment 212278 [details]
verbose_boot_zfs01_25_02
Hi Ruben, Can you confirm that "cpl does not map to outstanding cmd" was the first unusual line to show up? (i.e., there weren't more lines before that which got lost?) We had a hypothetical explanation for this which should have also involved having a "missing interrupt" message at the same time... Colin Percival (In reply to Colin Percival from comment #38) Hi Colin, There were issues with the nic (perhaps intterupt related?) prior to the "cpl does not map to outstanding cmd" log entry. I will upload the full /var/log/messages contents between a couple of incidents. Please let me know if you would like me to provide further context/configure additional logging etc. Kind regards, Ruben Created attachment 212295 [details]
messages
Another occurance this night: Mar 11 07:14:28 zfs01 ntpd[6009]: leapsecond file ('/var/db/ntpd.leap-seconds.list'): expired less than 75 days ago Mar 12 03:05:45 zfs01 kernel: nvme1: cpl does not map to outstanding cmd Mar 12 03:05:45 zfs01 kernel: cdw0:00000000 sqhd:001e sqid:0001 cid:0016 p:0 sc:00 sct:0 m:0 dnr:0 Mar 12 03:05:45 zfs01 kernel: nvme1: Missing interrupt Mar 12 03:05:45 zfs01 kernel: nvme1: Resetting controller due to a timeout. Mar 12 03:05:45 zfs01 kernel: nvme1: resetting controller Mar 12 03:05:45 zfs01 kernel: nvme1: temperature threshold not supported Mar 12 03:05:45 zfs01 kernel: nvme1: aborting outstanding i/o Mar 12 03:06:15 zfs01 kernel: nvme1: Missing interrupt Mar 12 03:07:08 zfs01 kernel: nvme1: Resetting controller due to a timeout. Mar 12 03:07:08 zfs01 kernel: nvme1: resetting controller Mar 12 03:07:08 zfs01 kernel: nvme1: temperature threshold not supported Mar 12 03:07:08 zfs01 kernel: nvme1: aborting outstanding i/o Mar 12 03:07:08 zfs01 syslogd: last message repeated 9 times Mar 12 03:07:13 zfs01 kernel: ena0: The number of lost tx completion is above the threshold (129 > 128). Reset the device Mar 12 03:07:13 zfs01 kernel: ena0: Trigger reset is on Mar 12 03:07:13 zfs01 kernel: ena0: device is going DOWN Mar 12 03:07:38 zfs01 kernel: nvme1: Missing interrupt Mar 12 03:08:05 zfs01 kernel: ena0: free uncompleted tx mbuf qid 0 idx 0x344 Mar 12 03:08:05 zfs01 kernel: ena0: attempting to allocate 3 MSI-X vectors (9 supported) Mar 12 03:08:05 zfs01 kernel: msi: routing MSI-X IRQ 259 to local APIC 0 vector 52 Mar 12 03:08:05 zfs01 kernel: msi: routing MSI-X IRQ 260 to local APIC 0 vector 53 Mar 12 03:08:05 zfs01 kernel: msi: routing MSI-X IRQ 261 to local APIC 0 vector 54 Mar 12 03:08:05 zfs01 kernel: ena0: using IRQs 259-261 for MSI-X Mar 12 03:08:05 zfs01 kernel: ena0: device is going UP Mar 12 03:08:05 zfs01 kernel: ena0: link is UP Mar 12 03:09:15 zfs01 dhclient[816]: send_packet6: Network is down Mar 12 03:09:15 zfs01 dhclient[816]: dhc6: send_packet6() sent -1 of 52 bytes Mar 12 03:09:50 zfs01 dhclient[67871]: send_packet: Network is down Mar 12 03:10:12 zfs01 syslogd: last message repeated 3 times Mar 12 03:14:23 zfs01 kernel: ena0: The number of lost tx completion is above the threshold (129 > 128). Reset the device Mar 12 03:14:23 zfs01 kernel: ena0: Trigger reset is on Mar 12 03:14:23 zfs01 kernel: ena0: device is going DOWN Mar 12 03:15:18 zfs01 kernel: ena0: free uncompleted tx mbuf qid 0 idx 0x176 Mar 12 03:15:19 zfs01 kernel: ena0: attempting to allocate 3 MSI-X vectors (9 supported) Mar 12 03:15:19 zfs01 kernel: msi: routing MSI-X IRQ 259 to local APIC 0 vector 52 Mar 12 03:15:19 zfs01 kernel: msi: routing MSI-X IRQ 260 to local APIC 0 vector 53 Mar 12 03:15:19 zfs01 kernel: msi: routing MSI-X IRQ 261 to local APIC 0 vector 54 Mar 12 03:15:19 zfs01 kernel: ena0: using IRQs 259-261 for MSI-X Mar 12 03:15:19 zfs01 kernel: ena0: device is going UP Mar 12 03:15:19 zfs01 kernel: ena0: link is UP Mar 12 03:15:23 zfs01 dhclient[816]: send_packet6: Network is down Mar 12 03:15:23 zfs01 dhclient[816]: dhc6: send_packet6() sent -1 of 52 bytes root@zfs01:~ # ls -lahtuT /etc/periodic/security/ total 68 -r--r--r-- 1 root wheel 2.8K Mar 12 03:01:48 2020 security.functions -rwxr-xr-x 1 root wheel 2.3K Mar 12 03:01:48 2020 900.tcpwrap -rwxr-xr-x 1 root wheel 2.3K Mar 12 03:01:48 2020 800.loginfail -rwxr-xr-x 1 root wheel 1.9K Mar 12 03:01:48 2020 700.kernelmsg -rwxr-xr-x 1 root wheel 2.0K Mar 12 03:01:48 2020 610.ipf6denied -rwxr-xr-x 1 root wheel 2.2K Mar 12 03:01:48 2020 550.ipfwlimit -rwxr-xr-x 1 root wheel 2.1K Mar 12 03:01:48 2020 520.pfdenied -rwxr-xr-x 1 root wheel 1.9K Mar 12 03:01:48 2020 510.ipfdenied -rwxr-xr-x 1 root wheel 2.0K Mar 12 03:01:48 2020 500.ipfwdenied -rwxr-xr-x 1 root wheel 1.9K Mar 12 03:01:48 2020 410.logincheck -rwxr-xr-x 1 root wheel 1.9K Mar 12 03:01:48 2020 400.passwdless -rwxr-xr-x 1 root wheel 1.9K Mar 12 03:01:48 2020 300.chkuid0 -rwxr-xr-x 1 root wheel 2.3K Mar 12 03:01:48 2020 200.chkmounts -rwxr-xr-x 1 root wheel 2.2K Mar 12 03:01:24 2020 110.neggrpperm -rwxr-xr-x 1 root wheel 2.2K Mar 12 03:01:00 2020 100.chksetuid drwxr-xr-x 2 root wheel 512B Dec 7 06:23:36 2018 . drwxr-xr-x 6 root wheel 512B Dec 7 06:23:36 2018 .. root@zfs01:~ # ls -lahtuT /etc/periodic/daily/ total 128 -rwxr-xr-x 1 root wheel 811B Mar 12 03:12:55 2020 999.local -rwxr-xr-x 1 root wheel 2.8K Mar 12 03:12:55 2020 800.scrub-zfs -rwxr-xr-x 1 root wheel 845B Mar 12 03:12:55 2020 510.status-world-kernel -rwxr-xr-x 1 root wheel 737B Mar 12 03:12:55 2020 500.queuerun -rwxr-xr-x 1 root wheel 498B Mar 12 03:12:55 2020 480.status-ntpd -rwxr-xr-x 1 root wheel 451B Mar 12 03:02:07 2020 480.leapfile-ntpd -rwxr-xr-x 1 root wheel 2.0K Mar 12 03:01:57 2020 460.status-mail-rejects -rwxr-xr-x 1 root wheel 1.0K Mar 12 03:01:00 2020 450.status-security -rwxr-xr-x 1 root wheel 1.4K Mar 12 03:01:00 2020 440.status-mailq -rwxr-xr-x 1 root wheel 705B Mar 12 03:01:00 2020 430.status-uptime -rwxr-xr-x 1 root wheel 611B Mar 12 03:01:00 2020 420.status-network -rwxr-xr-x 1 root wheel 684B Mar 12 03:01:00 2020 410.status-mfi -rwxr-xr-x 1 root wheel 590B Mar 12 03:01:00 2020 409.status-gconcat -rwxr-xr-x 1 root wheel 590B Mar 12 03:01:00 2020 408.status-gstripe -rwxr-xr-x 1 root wheel 591B Mar 12 03:01:00 2020 407.status-graid3 -rwxr-xr-x 1 root wheel 596B Mar 12 03:01:00 2020 406.status-gmirror -rwxr-xr-x 1 root wheel 807B Mar 12 03:01:00 2020 404.status-zfs -rwxr-xr-x 1 root wheel 583B Mar 12 03:01:00 2020 401.status-graid -rwxr-xr-x 1 root wheel 773B Mar 12 03:01:00 2020 400.status-disks -rwxr-xr-x 1 root wheel 724B Mar 12 03:01:00 2020 330.news -r-xr-xr-x 1 root wheel 1.4K Mar 12 03:01:00 2020 310.accounting -rwxr-xr-x 1 root wheel 693B Mar 12 03:01:00 2020 300.calendar -rwxr-xr-x 1 root wheel 1.0K Mar 12 03:01:00 2020 210.backup-aliases -rwxr-xr-x 1 root wheel 1.7K Mar 12 03:01:00 2020 200.backup-passwd -rwxr-xr-x 1 root wheel 603B Mar 12 03:01:00 2020 150.clean-hoststat -rwxr-xr-x 1 root wheel 1.0K Mar 12 03:01:00 2020 140.clean-rwho -rwxr-xr-x 1 root wheel 709B Mar 12 03:01:00 2020 130.clean-msgs -rwxr-xr-x 1 root wheel 1.1K Mar 12 03:01:00 2020 120.clean-preserve -rwxr-xr-x 1 root wheel 1.5K Mar 12 03:01:00 2020 110.clean-tmps -rwxr-xr-x 1 root wheel 1.3K Mar 12 03:01:00 2020 100.clean-disks drwxr-xr-x 2 root wheel 1.0K Dec 7 06:23:36 2018 . drwxr-xr-x 6 root wheel 512B Dec 7 06:23:36 2018 .. root@zfs01:~ # Can you try setting hw.nvme.timeout_period in /boot/loader.conf? In particular: 1. Setting it to "1" may result in more errors being logged but less hanging, and 2. Setting it to "3600" (or something similarly big) may result in fewer errors but longer hangs. I don't think either of these is a solution but it would be very enlightening to know if either of these has the anticipated effect. (In reply to Colin Percival from comment #42) Hi Colin, I've just added this entry to /boot/loader.conf: hw.nvme.timeout_period=1 and rebooted the instance. Will provide feedback on perceived effects. Kind Regards, Ruben (In reply to mail from comment #43) Well, the instance is still online. It appears to be in some sort of trouble, probably since: Mar 25 03:05:38 zfs01 kernel: nvme1: Missing interrupt Mar 25 03:07:04 zfs01 syslogd: last message repeated 1 times Mar 25 03:07:13 zfs01 syslogd: last message repeated 1 times logging on with ssh works, but there is a huge lag noticable. There is a 5-15 second delay between characters I type on my keyboard and them appearing on the terminal. There are no other obvious log entries and top shows: root@zfs01:/usr/home/ruben # top -n -b 10 last pid: 6417; load averages: 0.29, 0.56, 0.53 up 18+00:53:17 14:09:45 49 processes: 1 running, 48 sleeping CPU: 0.8% user, 0.0% nice, 2.9% system, 0.1% interrupt, 96.2% idle Mem: 23M Active, 3300K Inact, 25M Laundry, 14G Wired, 714M Buf, 1539M Free ARC: 10G Total, 6090M MFU, 4031M MRU, 420K Anon, 80M Header, 508M Other 9037M Compressed, 12G Uncompressed, 1.35:1 Ratio Swap: 8192M Total, 8192M Free PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND 92501 root 1 25 0 21M 11M select 1 0:02 6.98% ssh 97476 root 1 22 0 11M 2528K select 0 0:01 2.98% pv 1145 root 1 21 0 11M 2528K select 0 0:01 0.98% nc 46323 root 20 20 0 11M 2528K rpcsvc 1 84:50 0.00% nfsd 73646 ntpd 1 20 0 16M 16M select 0 1:20 0.00% ntpd 30467 _pflogd 1 20 0 12M 3072K bpf 0 1:15 0.00% pflogd 81756 root 1 21 0 12M 3892K wait 0 0:13 0.00% sh 70675 root 1 34 0 11M 968K nanslp 1 0:07 0.00% cron 58098 root 1 20 0 11M 2672K select 0 0:07 0.00% syslogd 29034 root 1 20 0 15M 5648K select 0 0:05 0.00% nrpe3 root@zfs01:/usr/home/ruben # The NFS clients for this nfs server also seem to be performing their normal tasks on the dataset. Strange. Hi Ruben, Getting more "missing interrupt" messages is an expected result of reducing the timeout period, so that's kind of a good sign. I don't know what would be making the system very slow though -- not unless this is a T3 and you've run out of burstable CPU quota. Very strange indeed. Any errors logged recently relating to ENA? (In reply to Colin Percival from comment #45) Hi Colin, Well, another crash this morning. A coworker also had to reboot the instance again. There are a lot of messages regarding ENA in the logs. This is /var/log/messages from the event: Apr 1 13:33:45 zfs01 ntpd[73646]: leapsecond file ('/var/db/ntpd.leap-seconds.list'): expired less than 96 days ago Apr 2 03:04:51 zfs01 kernel: nvme1: Missing interrupt Apr 2 03:04:58 zfs01 syslogd: last message repeated 1 times Apr 2 03:05:12 zfs01 kernel: ena0: The number of lost tx completion is above the threshold (129 > 128). Reset the device Apr 2 03:05:12 zfs01 kernel: ena0: Trigger reset is on Apr 2 03:05:12 zfs01 kernel: ena0: device is going DOWN Apr 2 03:05:13 zfs01 kernel: ena0: free uncompleted tx mbuf qid 0 idx 0x25d Apr 2 03:05:14 zfs01 kernel: ena0: attempting to allocate 3 MSI-X vectors (9 supported) Apr 2 03:05:14 zfs01 kernel: msi: routing MSI-X IRQ 259 to local APIC 0 vector 52 Apr 2 03:05:14 zfs01 kernel: msi: routing MSI-X IRQ 260 to local APIC 0 vector 53 Apr 2 03:05:14 zfs01 kernel: msi: routing MSI-X IRQ 261 to local APIC 0 vector 54 Apr 2 03:05:14 zfs01 kernel: ena0: using IRQs 259-261 for MSI-X Apr 2 03:05:14 zfs01 kernel: ena0: device is going UP Apr 2 03:05:14 zfs01 kernel: ena0: link is UP Apr 2 03:05:26 zfs01 kernel: nvme1: Missing interrupt Apr 2 03:05:40 zfs01 syslogd: last message repeated 1 times Apr 2 03:09:20 zfs01 kernel: ena0: The number of lost tx completion is above the threshold (129 > 128). Reset the device Apr 2 03:09:20 zfs01 kernel: ena0: Trigger reset is on Apr 2 03:09:20 zfs01 kernel: ena0: device is going DOWN Apr 2 03:09:20 zfs01 dhclient[7435]: send_packet6: Network is down Apr 2 03:09:20 zfs01 dhclient[7435]: dhc6: send_packet6() sent -1 of 52 bytes Apr 2 03:09:20 zfs01 kernel: ena0: free uncompleted tx mbuf qid 0 idx 0x25bena0: free uncompleted tx mbuf qid 1 idx 0x2ad Apr 2 03:09:21 zfs01 kernel: ena0: attempting to allocate 3 MSI-X vectors (9 supported) Apr 2 03:09:21 zfs01 kernel: msi: routing MSI-X IRQ 259 to local APIC 0 vector 52 Apr 2 03:09:21 zfs01 kernel: msi: routing MSI-X IRQ 260 to local APIC 0 vector 53 Apr 2 03:09:21 zfs01 kernel: msi: routing MSI-X IRQ 261 to local APIC 0 vector 54 Apr 2 03:09:21 zfs01 kernel: ena0: using IRQs 259-261 for MSI-X Apr 2 03:09:21 zfs01 kernel: ena0: device is going UP Apr 2 03:09:21 zfs01 kernel: ena0: link is UP Apr 2 03:11:29 zfs01 kernel: nvme1: Missing interrupt Apr 2 03:11:40 zfs01 kernel: nvme1: cpl does not map to outstanding cmd Apr 2 03:11:40 zfs01 kernel: cdw0:00000000 sqhd:0013 sqid:0001 cid:001c p:0 sc:00 sct:0 m:0 dnr:0 Apr 2 03:11:40 zfs01 kernel: nvme1: Missing interrupt Apr 2 03:11:40 zfs01 kernel: nvme1: Resetting controller due to a timeout. Apr 2 03:11:40 zfs01 kernel: nvme1: resetting controller Apr 2 03:11:40 zfs01 kernel: nvme1: temperature threshold not supported Apr 2 03:11:40 zfs01 kernel: nvme1: aborting outstanding i/o Apr 2 03:11:40 zfs01 kernel: nvme1: resubmitting queued i/o Apr 2 03:11:40 zfs01 kernel: nvme1: WRITE sqid:2 cid:0 nsid:1 lba:1401380000 len:256 Apr 2 03:11:40 zfs01 kernel: nvme1: resubmitting queued i/o Apr 2 03:11:40 zfs01 kernel: nvme1: WRITE sqid:2 cid:0 nsid:1 lba:1401380800 len:256 Apr 2 03:11:40 zfs01 kernel: nvme1: resubmitting queued i/o Apr 2 03:11:40 zfs01 kernel: nvme1: WRITE sqid:2 cid:0 nsid:1 lba:1401381576 len:256 Apr 2 03:11:40 zfs01 kernel: nvme1: resubmitting queued i/o Apr 2 03:11:40 zfs01 kernel: nvme1: WRITE sqid:2 cid:0 nsid:1 lba:1401382400 len:256 Apr 2 03:11:40 zfs01 kernel: nvme1: resubmitting queued i/o Apr 2 03:11:40 zfs01 kernel: nvme1: WRITE sqid:2 cid:0 nsid:1 lba:1401405864 len:256 Apr 2 03:11:40 zfs01 kernel: nvme1: resubmitting queued i/o Apr 2 03:11:40 zfs01 kernel: nvme1: WRITE sqid:2 cid:0 nsid:1 lba:1401408264 len:256 Apr 2 03:11:40 zfs01 kernel: nvme1: resubmitting queued i/o Apr 2 03:11:40 zfs01 kernel: nvme1: WRITE sqid:2 cid:0 nsid:1 lba:1401412720 len:256 Apr 2 03:11:40 zfs01 kernel: nvme1: resubmitting queued i/o Apr 2 03:11:40 zfs01 kernel: nvme1: WRITE sqid:2 cid:0 nsid:1 lba:1401413240 len:256 Apr 2 03:11:40 zfs01 kernel: nvme1: resubmitting queued i/o Apr 2 03:11:40 zfs01 kernel: nvme1: WRITE sqid:2 cid:0 nsid:1 lba:1401414504 len:256 Apr 2 03:11:40 zfs01 kernel: nvme1: resubmitting queued i/o Apr 2 03:11:40 zfs01 kernel: nvme1: WRITE sqid:2 cid:0 nsid:1 lba:1938636792 len:8 Apr 2 03:11:46 zfs01 kernel: nvme1: Missing interrupt Apr 2 03:11:56 zfs01 kernel: nvme1: Resetting controller due to a timeout. Apr 2 03:11:56 zfs01 kernel: nvme1: resetting controller Apr 2 03:11:56 zfs01 kernel: nvme1: temperature threshold not supported Apr 2 03:11:56 zfs01 kernel: nvme1: aborting outstanding i/o Apr 2 03:11:56 zfs01 syslogd: last message repeated 10 times Apr 2 03:12:02 zfs01 kernel: nvme1: WRITE sqid:1 cid:5 nsid:1 lba:3847102 len:64 Apr 2 03:12:02 zfs01 kernel: nvme1: INVALID OPCODE (00/01) sqid:1 cid:28 cdw0:0 Apr 2 03:12:02 zfs01 kernel: nvme1: Missing interrupt Apr 2 03:14:18 zfs01 kernel: ena0: Keep alive watchdog timeout. Apr 2 03:14:18 zfs01 kernel: ena0: Trigger reset is on Apr 2 03:14:18 zfs01 kernel: ena0: device is going DOWN Apr 2 03:15:36 zfs01 dhclient[7435]: send_packet6: Network is down Apr 2 03:15:36 zfs01 dhclient[7435]: dhc6: send_packet6() sent -1 of 52 bytes Apr 2 03:15:39 zfs01 kernel: ena0: attempting to allocate 3 MSI-X vectors (9 supported) Apr 2 03:15:39 zfs01 kernel: msi: routing MSI-X IRQ 259 to local APIC 0 vector 52 Apr 2 03:15:39 zfs01 kernel: msi: routing MSI-X IRQ 260 to local APIC 0 vector 53 Apr 2 03:15:39 zfs01 kernel: msi: routing MSI-X IRQ 261 to local APIC 0 vector 54 Apr 2 03:15:39 zfs01 kernel: ena0: using IRQs 259-261 for MSI-X Apr 2 03:15:39 zfs01 kernel: ena0: ena0: device is going UP Apr 2 03:15:39 zfs01 kernel: link is UP Apr 2 03:17:22 zfs01 dhclient[69443]: send_packet: Network is down Apr 2 03:17:40 zfs01 syslogd: last message repeated 4 times Apr 2 03:17:44 zfs01 dhclient[7435]: send_packet6: Network is down Apr 2 03:17:44 zfs01 dhclient[7435]: dhc6: send_packet6() sent -1 of 52 bytes Apr 2 03:17:50 zfs01 dhclient[69443]: send_packet: Network is down Apr 2 03:18:21 zfs01 syslogd: last message repeated 1 times Apr 2 03:19:37 zfs01 syslogd: last message repeated 2 times Apr 2 03:19:39 zfs01 dhclient[7435]: send_packet6: Network is down Apr 2 03:19:39 zfs01 dhclient[7435]: dhc6: send_packet6() sent -1 of 52 bytes Apr 2 03:23:20 zfs01 kernel: ena0: Keep alive watchdog timeout. Apr 2 03:23:20 zfs01 kernel: ena0: Trigger reset is on Apr 2 03:23:20 zfs01 kernel: ena0: device is going DOWN Apr 2 03:23:20 zfs01 dhclient[7435]: send_packet6: Network is down Apr 2 03:23:20 zfs01 dhclient[7435]: dhc6: send_packet6() sent -1 of 52 bytes Apr 2 03:23:23 zfs01 kernel: ena0: free uncompleted tx mbuf qid 1 idx 0x2ac Apr 2 03:23:24 zfs01 kernel: ena0: attempting to allocate 3 MSI-X vectors (9 supported) Apr 2 03:23:24 zfs01 kernel: msi: routing MSI-X IRQ 259 to local APIC 0 vector 52 Apr 2 03:23:24 zfs01 kernel: msi: routing MSI-X IRQ 260 to local APIC 0 vector 53 Apr 2 03:23:24 zfs01 kernel: msi: routing MSI-X IRQ 261 to local APIC 0 vector 54 Apr 2 03:23:24 zfs01 kernel: ena0: using IRQs 259-261 for MSI-X Apr 2 03:23:24 zfs01 kernel: ena0: device is going UP Apr 2 03:23:24 zfs01 kernel: ena0: link is UP Apr 2 03:25:09 zfs01 dhclient[7435]: send_packet6: Network is down Apr 2 03:25:09 zfs01 dhclient[7435]: dhc6: send_packet6() sent -1 of 52 bytes Apr 2 03:27:00 zfs01 dhclient[7435]: send_packet6: Network is down Apr 2 03:27:00 zfs01 dhclient[7435]: dhc6: send_packet6() sent -1 of 52 bytes Apr 2 03:28:49 zfs01 dhclient[7435]: send_packet6: Network is down Apr 2 03:28:49 zfs01 dhclient[7435]: dhc6: send_packet6() sent -1 of 52 bytes Apr 2 04:37:52 zfs01 nrpe[41394]: Could not read request from client 172.28.8.16, bailing out... Apr 2 04:43:45 zfs01 kernel: ena0: The number of lost tx completion is above the threshold (129 > 128). Reset the device Apr 2 04:43:45 zfs01 kernel: ena0: Trigger reset is on Apr 2 04:43:45 zfs01 kernel: ena0: device is going DOWN Apr 2 04:44:22 zfs01 dhclient[7435]: send_packet6: Network is down Apr 2 04:44:22 zfs01 dhclient[7435]: dhc6: send_packet6() sent -1 of 52 bytes Apr 2 04:44:25 zfs01 kernel: ena0: free uncompleted tx mbuf qid 1 idx 0x28f Apr 2 04:44:26 zfs01 kernel: ena0: attempting to allocate 3 MSI-X vectors (9 supported) Apr 2 04:44:26 zfs01 kernel: msi: routing MSI-X IRQ 259 to local APIC 0 vector 52 Apr 2 04:44:26 zfs01 kernel: msi: routing MSI-X IRQ 260 to local APIC 0 vector 53 Apr 2 04:44:26 zfs01 kernel: msi: routing MSI-X IRQ 261 to local APIC 0 vector 54 Apr 2 04:44:26 zfs01 kernel: ena0: using IRQs 259-261 for MSI-X Apr 2 04:44:26 zfs01 kernel: stray irq260 Apr 2 04:44:26 zfs01 kernel: ena0: ena0: device is going UP Apr 2 04:44:26 zfs01 kernel: link is UP Apr 2 04:46:33 zfs01 dhclient[7435]: send_packet6: Network is down Apr 2 04:46:33 zfs01 dhclient[7435]: dhc6: send_packet6() sent -1 of 52 bytes Apr 2 04:48:36 zfs01 dhclient[7435]: send_packet6: Network is down Apr 2 04:48:36 zfs01 dhclient[7435]: dhc6: send_packet6() sent -1 of 52 bytes Apr 2 04:50:25 zfs01 dhclient[7435]: send_packet6: Network is down Apr 2 04:50:25 zfs01 dhclient[7435]: dhc6: send_packet6() sent -1 of 52 bytes Apr 2 04:52:37 zfs01 dhclient[7435]: send_packet6: Network is down Apr 2 04:52:37 zfs01 dhclient[7435]: dhc6: send_packet6() sent -1 of 52 bytes Apr 2 04:54:40 zfs01 dhclient[7435]: send_packet6: Network is down Apr 2 04:54:40 zfs01 dhclient[7435]: dhc6: send_packet6() sent -1 of 52 bytes Apr 2 05:09:52 zfs01 kernel: ena0: The number of lost tx completion is above the threshold (129 > 128). Reset the device Apr 2 05:09:52 zfs01 kernel: ena0: Trigger reset is on Apr 2 05:09:52 zfs01 kernel: ena0: device is going DOWN Apr 2 05:11:03 zfs01 dhclient[7435]: send_packet6: Network is down Apr 2 05:11:03 zfs01 dhclient[7435]: dhc6: send_packet6() sent -1 of 52 bytes Apr 2 05:11:04 zfs01 kernel: ena0: free uncompleted tx mbuf qid 1 idx 0x22b Apr 2 05:11:05 zfs01 kernel: ena0: attempting to allocate 3 MSI-X vectors (9 supported) Apr 2 05:11:05 zfs01 kernel: msi: routing MSI-X IRQ 259 to local APIC 0 vector 52 Apr 2 05:11:05 zfs01 kernel: msi: routing MSI-X IRQ 260 to local APIC 0 vector 53 Apr 2 05:11:05 zfs01 kernel: msi: routing MSI-X IRQ 261 to local APIC 0 vector 54 Apr 2 05:11:05 zfs01 kernel: ena0: using IRQs 259-261 for MSI-X Apr 2 05:11:05 zfs01 kernel: stray irq260 Apr 2 05:11:05 zfs01 kernel: ena0: device is going UP Apr 2 05:11:05 zfs01 kernel: ena0: link is UP Apr 2 05:13:00 zfs01 dhclient[7435]: send_packet6: Network is down Apr 2 05:13:00 zfs01 dhclient[7435]: dhc6: send_packet6() sent -1 of 52 bytes Apr 2 05:14:57 zfs01 dhclient[7435]: send_packet6: Network is down Apr 2 05:14:57 zfs01 dhclient[7435]: dhc6: send_packet6() sent -1 of 52 bytes Apr 2 05:16:46 zfs01 dhclient[7435]: send_packet6: Network is down Apr 2 05:16:46 zfs01 dhclient[7435]: dhc6: send_packet6() sent -1 of 52 bytes Apr 2 05:18:39 zfs01 dhclient[7435]: send_packet6: Network is down Apr 2 05:18:39 zfs01 dhclient[7435]: dhc6: send_packet6() sent -1 of 52 bytes Apr 2 05:20:41 zfs01 dhclient[7435]: send_packet6: Network is down Apr 2 05:20:41 zfs01 dhclient[7435]: dhc6: send_packet6() sent -1 of 52 bytes Apr 2 05:22:33 zfs01 dhclient[7435]: send_packet6: Network is down Apr 2 05:22:33 zfs01 dhclient[7435]: dhc6: send_packet6() sent -1 of 52 bytes Apr 2 05:28:52 zfs01 kernel: ena0: Keep alive watchdog timeout. Apr 2 05:28:52 zfs01 kernel: ena0: Trigger reset is on Apr 2 05:28:52 zfs01 kernel: ena0: device is going DOWN Apr 2 05:28:52 zfs01 dhclient[7435]: send_packet6: Network is down Apr 2 05:28:52 zfs01 dhclient[7435]: dhc6: send_packet6() sent -1 of 52 bytes Apr 2 05:29:24 zfs01 kernel: ena0: attempting to allocate 3 MSI-X vectors (9 supported) Apr 2 05:29:24 zfs01 kernel: msi: routing MSI-X IRQ 259 to local APIC 0 vector 52 Apr 2 05:29:24 zfs01 kernel: msi: routing MSI-X IRQ 260 to local APIC 0 vector 53 Apr 2 05:29:24 zfs01 kernel: msi: routing MSI-X IRQ 261 to local APIC 0 vector 54 Apr 2 05:29:24 zfs01 kernel: ena0: using IRQs 259-261 for MSI-X Apr 2 05:29:24 zfs01 kernel: stray irq260 Apr 2 05:29:24 zfs01 kernel: ena0: ena0: device is going UP Apr 2 05:29:24 zfs01 kernel: link is UP Apr 2 05:57:05 zfs01 syslogd: kernel boot file is /boot/kernel/kernel Apr 2 05:57:05 zfs01 kernel: pflog0: promiscuous mode disabled Apr 2 05:57:05 zfs01 kernel: Apr 2 05:57:05 zfs01 syslogd: last message repeated 1 times Apr 2 05:57:05 zfs01 kernel: Fatal trap 12: page fault while in kernel mode Apr 2 05:57:05 zfs01 kernel: cpuid = 0; apic id = 00 Apr 2 05:57:05 zfs01 kernel: fault virtual address = 0x78 Apr 2 05:57:05 zfs01 kernel: fault code = supervisor read data, page not present Apr 2 05:57:05 zfs01 kernel: instruction pointer = 0x20:0xffffffff80b17c97 Apr 2 05:57:05 zfs01 kernel: stack pointer = 0x28:0xfffffe000045d1e0 Apr 2 05:57:05 zfs01 kernel: frame pointer = 0x28:0xfffffe000045d210 Apr 2 05:57:05 zfs01 kernel: code segment = base rx0, limit 0xfffff, type 0x1b Apr 2 05:57:05 zfs01 kernel: = DPL 0, pres 1, long 1, def32 0, gran 1 Apr 2 05:57:05 zfs01 kernel: processor eflags = interrupt enabled, resume, IOPL = 0 Apr 2 05:57:05 zfs01 kernel: current process = 12 (irq257: nvme0:io0) Apr 2 05:57:05 zfs01 kernel: trap number = 12 Apr 2 05:57:05 zfs01 kernel: panic: page fault Apr 2 05:57:05 zfs01 kernel: cpuid = 0 Apr 2 05:57:05 zfs01 kernel: time = 1585806979 Apr 2 05:57:05 zfs01 kernel: KDB: stack backtrace: Apr 2 05:57:05 zfs01 kernel: #0 0xffffffff80c1d2b7 at kdb_backtrace+0x67 Apr 2 05:57:05 zfs01 kernel: #1 0xffffffff80bd05ed at vpanic+0x19d Apr 2 05:57:05 zfs01 kernel: #2 0xffffffff80bd0443 at panic+0x43 Apr 2 05:57:05 zfs01 kernel: #3 0xffffffff810a7dcc at trap_fatal+0x39c Apr 2 05:57:05 zfs01 kernel: #4 0xffffffff810a7e19 at trap_pfault+0x49 Apr 2 05:57:05 zfs01 kernel: #5 0xffffffff810a740f at trap+0x29f Apr 2 05:57:05 zfs01 kernel: #6 0xffffffff81081a2c at calltrap+0x8 Apr 2 05:57:05 zfs01 kernel: #7 0xffffffff8079c5ac at nvme_qpair_complete_tracker+0x1bc Apr 2 05:57:05 zfs01 kernel: #8 0xffffffff8079c2c4 at nvme_qpair_process_completions+0xd4 Apr 2 05:57:05 zfs01 kernel: #9 0xffffffff80b93dd4 at ithread_loop+0x1d4 Apr 2 05:57:05 zfs01 kernel: #10 0xffffffff80b90c43 at fork_exit+0x83 Apr 2 05:57:05 zfs01 kernel: #11 0xffffffff81082a6e at fork_trampoline+0xe Apr 2 05:57:05 zfs01 kernel: Uptime: 20d16h39m50s Apr 2 05:57:05 zfs01 kernel: Rebooting... Apr 2 05:57:05 zfs01 kernel: ---<<BOOT>>--- This occured only minutes again after running the periodic framework: root@zfs01:~ # ls -lahtuT /etc/periodic/daily/ total 128 -rwxr-xr-x 1 root wheel 451B Apr 2 03:02:09 2020 480.leapfile-ntpd -rwxr-xr-x 1 root wheel 2.0K Apr 2 03:01:58 2020 460.status-mail-rejects -rwxr-xr-x 1 root wheel 1.0K Apr 2 03:01:00 2020 450.status-security -rwxr-xr-x 1 root wheel 1.4K Apr 2 03:01:00 2020 440.status-mailq -rwxr-xr-x 1 root wheel 705B Apr 2 03:01:00 2020 430.status-uptime -rwxr-xr-x 1 root wheel 611B Apr 2 03:01:00 2020 420.status-network -rwxr-xr-x 1 root wheel 684B Apr 2 03:01:00 2020 410.status-mfi -rwxr-xr-x 1 root wheel 590B Apr 2 03:01:00 2020 409.status-gconcat -rwxr-xr-x 1 root wheel 590B Apr 2 03:01:00 2020 408.status-gstripe -rwxr-xr-x 1 root wheel 591B Apr 2 03:01:00 2020 407.status-graid3 -rwxr-xr-x 1 root wheel 596B Apr 2 03:01:00 2020 406.status-gmirror -rwxr-xr-x 1 root wheel 807B Apr 2 03:01:00 2020 404.status-zfs -rwxr-xr-x 1 root wheel 583B Apr 2 03:01:00 2020 401.status-graid -rwxr-xr-x 1 root wheel 773B Apr 2 03:01:00 2020 400.status-disks -rwxr-xr-x 1 root wheel 724B Apr 2 03:01:00 2020 330.news -r-xr-xr-x 1 root wheel 1.4K Apr 2 03:01:00 2020 310.accounting -rwxr-xr-x 1 root wheel 693B Apr 2 03:01:00 2020 300.calendar -rwxr-xr-x 1 root wheel 1.0K Apr 2 03:01:00 2020 210.backup-aliases -rwxr-xr-x 1 root wheel 1.7K Apr 2 03:01:00 2020 200.backup-passwd -rwxr-xr-x 1 root wheel 603B Apr 2 03:01:00 2020 150.clean-hoststat -rwxr-xr-x 1 root wheel 1.0K Apr 2 03:01:00 2020 140.clean-rwho -rwxr-xr-x 1 root wheel 709B Apr 2 03:01:00 2020 130.clean-msgs -rwxr-xr-x 1 root wheel 1.1K Apr 2 03:01:00 2020 120.clean-preserve -rwxr-xr-x 1 root wheel 1.5K Apr 2 03:01:00 2020 110.clean-tmps -rwxr-xr-x 1 root wheel 1.3K Apr 2 03:01:00 2020 100.clean-disks -rwxr-xr-x 1 root wheel 811B Apr 1 03:54:15 2020 999.local -rwxr-xr-x 1 root wheel 2.8K Apr 1 03:54:15 2020 800.scrub-zfs -rwxr-xr-x 1 root wheel 845B Apr 1 03:54:15 2020 510.status-world-kernel -rwxr-xr-x 1 root wheel 737B Apr 1 03:54:15 2020 500.queuerun -rwxr-xr-x 1 root wheel 498B Apr 1 03:54:15 2020 480.status-ntpd drwxr-xr-x 2 root wheel 1.0K Dec 7 06:23:36 2018 . drwxr-xr-x 6 root wheel 512B Dec 7 06:23:36 2018 .. root@zfs01:~ # ls -lahtuT /etc/periodic/security/ total 68 -r--r--r-- 1 root wheel 2.8K Apr 2 03:01:49 2020 security.functions -rwxr-xr-x 1 root wheel 2.3K Apr 2 03:01:49 2020 900.tcpwrap -rwxr-xr-x 1 root wheel 2.3K Apr 2 03:01:49 2020 800.loginfail -rwxr-xr-x 1 root wheel 1.9K Apr 2 03:01:49 2020 700.kernelmsg -rwxr-xr-x 1 root wheel 2.0K Apr 2 03:01:49 2020 610.ipf6denied -rwxr-xr-x 1 root wheel 2.2K Apr 2 03:01:49 2020 550.ipfwlimit -rwxr-xr-x 1 root wheel 2.1K Apr 2 03:01:49 2020 520.pfdenied -rwxr-xr-x 1 root wheel 1.9K Apr 2 03:01:49 2020 510.ipfdenied -rwxr-xr-x 1 root wheel 2.0K Apr 2 03:01:49 2020 500.ipfwdenied -rwxr-xr-x 1 root wheel 1.9K Apr 2 03:01:49 2020 410.logincheck -rwxr-xr-x 1 root wheel 1.9K Apr 2 03:01:49 2020 400.passwdless -rwxr-xr-x 1 root wheel 1.9K Apr 2 03:01:49 2020 300.chkuid0 -rwxr-xr-x 1 root wheel 2.3K Apr 2 03:01:49 2020 200.chkmounts -rwxr-xr-x 1 root wheel 2.2K Apr 2 03:01:25 2020 110.neggrpperm -rwxr-xr-x 1 root wheel 2.2K Apr 2 03:01:00 2020 100.chksetuid drwxr-xr-x 2 root wheel 512B Dec 7 06:23:36 2018 . drwxr-xr-x 6 root wheel 512B Dec 7 06:23:36 2018 .. root@zfs01:~ # I don't know how long we will be able to keep this installation running under these circumstances; the platform it provides NFS for will be "LIVE" in a couple of weeks and this issue needs to be resolved prior to that. Kind regards, Ruben Hi Ruben, Good timing. Can you test the patch in https://reviews.freebsd.org/D24256 ? It will probably add "nvme_timeout" messages to your console log -- I'd be interested to know how many -- but it also fixes a nasty race condition which I suspect is causing some of the breakage you're seeing. (In reply to Colin Percival from comment #47) Hi Colin, I've checked out the releng/12.1 branch,, applied the patch, compiled, installed and rebooted. uname -a FreeBSD zfs01.kpnmsg.net 12.1-RELEASE-p3 FreeBSD 12.1-RELEASE-p3 #0 r359631M: Sat Apr 4 13:41:41 UTC 2020 root@zfs01.kpnmsg.net:/data/obj/data/src/amd64.amd64/sys/GENERIC amd64 I still have the hw.nvme.timeout_period=1 active in /boot/loader.conf , is that OK or should I switch that to something else ? Will keep you updated on /var/log/messages contents. Hope the issue will trigger next week, because after that we really have to switch instance types (for the instance that got triggered the most ( until now :P) ). Regards, Ruben Thanks! Yes, it would be great if you can keep that tunable in place so we can see how it affects this. I'm particularly interested in seeing anything logged concerning nvme -- there will probably be additional "nvme_timeout" messages being logged even when nothing is visibly going wrong, and I'd like to see how often (and under what conditions) those appear. (In reply to Colin Percival from comment #49) Hi Colin, The machine that was exhibiting the most freezes will be tested extensively during the next couple of days. No errors on that instance yet. We will probably have to decommission this particular instance end of this week though :( I just had a look on one of the other machines and there is an error in one of the logs: Apr 5 09:31:41 volume3 ec2[7704]: -----END SSH HOST KEY FINGERPRINTS----- Apr 5 09:31:41 volume3 ec2[14622]: ############################################################# Apr 16 04:08:31 volume3 kernel: nvme1: nvme_timeout Apr 16 04:08:31 volume3 kernel: nvme1: Apr 16 04:08:31 volume3 kernel: Resetting controller due to a timeout. Apr 16 04:08:31 volume3 kernel: nvme1: resetting controller Apr 16 04:08:31 volume3 kernel: nvme1: temperature threshold not supported Apr 16 04:08:31 volume3 kernel: nvme1: resubmitting queued i/o Apr 16 04:08:31 volume3 kernel: nvme1: READ sqid:2 cid:0 nsid:1 lba:2029427280 len:8 Apr 16 04:08:31 volume3 kernel: nvme1: resubmitting queued i/o Apr 16 04:08:31 volume3 kernel: nvme1: READ sqid:2 cid:0 nsid:1 lba:1776954392 len:8 Apr 16 04:08:31 volume3 kernel: nvme1: resubmitting queued i/o Apr 16 04:08:31 volume3 kernel: nvme1: READ sqid:2 cid:0 nsid:1 lba:1315250264 len:8 Apr 16 04:08:31 volume3 kernel: nvme1: resubmitting queued i/o Apr 16 04:08:31 volume3 kernel: nvme1: READ sqid:2 cid:0 nsid:1 lba:871868818 len:8 Apr 16 04:08:31 volume3 kernel: nvme1: resubmitting queued i/o Apr 16 04:08:31 volume3 kernel: nvme1: READ sqid:2 cid:0 nsid:1 lba:423018743 len:7 Apr 16 04:08:31 volume3 kernel: nvme1: resubmitting queued i/o Apr 16 04:08:31 volume3 kernel: nvme1: READ sqid:2 cid:0 nsid:1 lba:2561980904 len:8 Apr 16 04:08:31 volume3 kernel: nvme1: resubmitting queued i/o Apr 16 04:08:31 volume3 kernel: nvme1: READ sqid:2 cid:0 nsid:1 lba:1987706424 len:8 Apr 16 04:08:31 volume3 kernel: nvme1: resubmitting queued i/o Apr 16 04:08:31 volume3 kernel: nvme1: READ sqid:2 cid:0 nsid:1 lba:115481696 len:8 Apr 16 04:08:31 volume3 kernel: nvme1: resubmitting queued i/o Apr 16 04:08:31 volume3 kernel: nvme1: READ sqid:2 cid:0 nsid:1 lba:1095204568 len:8 Apr 16 04:08:31 volume3 kernel: nvme1: resubmitting queued i/o Apr 16 04:08:31 volume3 kernel: nvme1: READ sqid:2 cid:0 nsid:1 lba:948211334 len:12 No freezing or any other symptoms yet. This machine runs the patched kernel but doesn't have a "hw.nvme.timeout_period" setting configured in its /boot/loader.conf . Will provide feedback on new developments. Regards, Ruben(In reply to Colin Percival from comment #49) (In reply to Colin Percival from comment #49) (In reply to mail from comment #50) Hi Ruben, Thanks! This is consistent with my current understanding, which is that the patch I provided fixes the data structure corruption issues (which showed up as some of the more "interesting" warnings and also sometimes as kernel panics). The timeouts and freezing appear to be due to the ENA driver having trouble allocating jumbo mbuf clusters during low-memory conditions. Can you check "sysctl -a | grep mjum_alloc_fail" (ideally, log it every minute and see if it jumps around the times you see hanging)? I have a patch which I believe fixes this which I'd be happy to share if you want to try something experimental -- we're still discussing details about the best way to address this, though. Colin Would "sysctl -a | grep mjum_alloc_fail" provide productive results on an unpatched 12.1-R kernel? I am not running the patches, but I am getting these same freezes, and I can run queries prior to rebooting. Best. Charles: Yes, a stock FreeBSD 12.1-RELEASE kernel should have those sysctls (they're in if_ena.ko technically but you should have that loaded...) and it would be great to hear if those are non-zero and/or increasing around when you see the system freezing. I believe that this patch should fix the underlying problem, which is in the ENA driver: https://reviews.freebsd.org/D24546 If you're able to build a custom kernel, please test that patch and report results in that review or via email (cperciva@). A commit references this bug: Author: mw Date: Thu May 7 11:28:41 UTC 2020 New revision: 360777 URL: https://svnweb.freebsd.org/changeset/base/360777 Log: Optimize ENA Rx refill for low memory conditions Sometimes, especially when there is not much memory in the system left, allocating mbuf jumbo clusters (like 9KB or 16KB) can take a lot of time and it is not guaranteed that it'll succeed. In that situation, the fallback will work, but if the refill needs to take a place for a lot of descriptors at once, the time spent in m_getjcl looking for memory can cause system unresponsiveness due to high priority of the Rx task. This can also lead to driver reset, because Tx cleanup routine is being blocked and timer service could detect that Tx packets aren't cleaned up. The reset routine can further create another unresponsiveness - Rx rings are being refilled there, so m_getjcl will again burn the CPU. This was causing NVMe driver timeouts and resets, because network driver is having higher priority. Instead of 16KB jumbo clusters for the Rx buffers, 9KB clusters are enough - ENA MTU is being set to 9K anyway, so it's very unlikely that more space than 9KB will be needed. However, 9KB jumbo clusters can still cause issues, so by default the page size mbuf cluster will be used for the Rx descriptors. This can have a small (~2%) impact on the throughput of the device, so to restore original behavior, one must change sysctl "hw.ena.enable_9k_mbufs" to "1" in "/boot/loader.conf" file. As a part of this patch (important fix), the version of the driver was updated to v2.1.2. Submitted by: cperciva Reviewed by: Michal Krawczyk <mk@semihalf.com> Reviewed by: Ido Segev <idose@amazon.com> Reviewed by: Guy Tzalik <gtzalik@amazon.com> MFC after: 3 days PR: 225791, 234838, 235856, 236989, 243531 Differential Revision: https://reviews.freebsd.org/D24546 Changes: head/sys/dev/ena/ena.c head/sys/dev/ena/ena.h head/sys/dev/ena/ena_sysctl.c head/sys/dev/ena/ena_sysctl.h (In reply to Colin Percival from comment #54) Hi Colin, Good to see the problem has been identified, and patched! Our machines have not yet frozen again after applying the patch you mentioned on 2020-04-02 . Will this commit (360777) be released as an errata (I don't know of a more official term I'm affraid)? I'm asking because I would like to know whether or not we can switch back to using freebsd-update for implementing the fix or should continue to compile our kernels manually. Regards, Ruben (In reply to mail from comment #56) Yes, we're planning on doing an Errata Notice for this. I suggest waiting until that comes out, then getting the fix with freebsd-update. A commit references this bug: Author: mw Date: Tue May 12 18:44:41 UTC 2020 New revision: 360985 URL: https://svnweb.freebsd.org/changeset/base/360985 Log: MFC r360777: Optimize ENA Rx refill for low memory conditions Sometimes, especially when there is not much memory in the system left, allocating mbuf jumbo clusters (like 9KB or 16KB) can take a lot of time and it is not guaranteed that it'll succeed. In that situation, the fallback will work, but if the refill needs to take a place for a lot of descriptors at once, the time spent in m_getjcl looking for memory can cause system unresponsiveness due to high priority of the Rx task. This can also lead to driver reset, because Tx cleanup routine is being blocked and timer service could detect that Tx packets aren't cleaned up. The reset routine can further create another unresponsiveness - Rx rings are being refilled there, so m_getjcl will again burn the CPU. This was causing NVMe driver timeouts and resets, because network driver is having higher priority. Instead of 16KB jumbo clusters for the Rx buffers, 9KB clusters are enough - ENA MTU is being set to 9K anyway, so it's very unlikely that more space than 9KB will be needed. However, 9KB jumbo clusters can still cause issues, so by default the page size mbuf cluster will be used for the Rx descriptors. This can have a small (~2%) impact on the throughput of the device, so to restore original behavior, one must change sysctl "hw.ena.enable_9k_mbufs" to "1" in "/boot/loader.conf" file. As a part of this patch (important fix), the version of the driver was updated to v2.1.2. Submitted by: cperciva PR: 225791, 234838, 235856, 236989, 243531 Changes: _U stable/12/ stable/12/sys/dev/ena/ena.c stable/12/sys/dev/ena/ena.h I think the patches have materialized into FreeBSD-EN-20:11.ena . ( https://www.freebsd.org/security/advisories/FreeBSD-EN-20:11.ena.asc ) We will update our instances with freebsd-update (as suggested). Thanks Colin for your efforts :) |
Created attachment 202156 [details] VM screenshot After a few days (it may happen after a day or after 2 weeks) the machine becomes unresponsive. 1/2 ec2 status check fail and the reboot doesn't seem to do anything. A stop/restart of the instance is necessary. The machine workload has constant use of CPU, network and disk with a few spikes, but nothing exaggerate. VM screenshot and system log attached.