Bug 234838 - ena drop-outs on 12.0-RELEASE
Summary: ena drop-outs on 12.0-RELEASE
Status: Open
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 12.0-RELEASE
Hardware: amd64 Any
: --- Affects Only Me
Assignee: freebsd-virtualization mailing list
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-01-10 23:08 UTC by Leif Pedersen
Modified: 2020-05-12 18:45 UTC (History)
7 users (show)

See Also:


Attachments
aws console log (9.47 KB, text/plain)
2020-04-18 04:33 UTC, Darin Luckie
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Leif Pedersen 2019-01-10 23:08:40 UTC
We're observing problems with the ena driver on AWS in 12.0-RELEASE. We see these kernel messages a few times per day, and not surprisingly, the NIC appears to quit working for a few minutes and then recover. (We see this because our Nagios instance alarms and then recovers.) This particular instance is an r4.large.

ena_com_prepare_tx() [TID:100766]: Not enough space in the tx queue
 
ena0: failed to prepare tx bufs
ena_com_prepare_tx() [TID:100766]: Not enough space in the tx queue
 
ena0: failed to prepare tx bufs
ena_com_prepare_tx() [TID:100378]: Not enough space in the tx queue
 
ena0: failed to prepare tx bufs
ena_com_prepare_tx() [TID:100376]: Not enough space in the tx queue
 
ena0: failed to prepare tx bufs
ena_com_prepare_tx() [TID:100844]: Not enough space in the tx queue
 
ena0: failed to prepare tx bufs
ena_com_prepare_tx() [TID:100765]: Not enough space in the tx queue
 
ena0: failed to prepare tx bufs
ena_com_prepare_tx() [TID:100363]: Not enough space in the tx queue
 
ena0: failed to prepare tx bufs
ena_com_prepare_tx() [TID:100523]: Not enough space in the tx queue
 
ena0: failed to prepare tx bufs
ena0: Keep alive watchdog timeout.
ena0: Trigger reset is on
ena0: device is going DOWN
ena0: device is going UP
ena0: link is UP
ena_com_prepare_tx() [TID:100401]: Not enough space in the tx queue
 
ena0: failed to prepare tx bufs
ena_com_prepare_tx() [TID:100477]: Not enough space in the tx queue
 
ena0: failed to prepare tx bufs
ena_com_prepare_tx() [TID:100634]: Not enough space in the tx queue
 
ena0: failed to prepare tx bufs
ena0: Keep alive watchdog timeout.
ena0: Trigger reset is on
ena0: device is going DOWN
ena0: free uncompleted tx mbuf qid 0 idx 0xb9ena0: ena0: device is going UP
link is UP
ena0: Keep alive watchdog timeout.
ena0: Trigger reset is on
ena0: device is going DOWN
ena0: device is going UP
ena0: link is UP
ena_com_prepare_tx() [TID:100360]: Not enough space in the tx queue
 
ena0: failed to prepare tx bufs
ena_com_prepare_tx() [TID:100587]: Not enough space in the tx queue
 
ena0: failed to prepare tx bufs
ena_com_prepare_tx() [TID:100365]: Not enough space in the tx queue
 
ena0: failed to prepare tx bufs
ena0: The number of lost tx completion is above the threshold (129 > 128). Reset the device
ena0: Trigger reset is on
ena0: device is going DOWN
ena0: free uncompleted tx mbuf qid 1 idx 0x2bbena0: device is going UP
ena0: link is UP
ena_com_prepare_tx() [TID:100375]: Not enough space in the tx queue
 
ena0: failed to prepare tx bufs
ena_com_prepare_tx() [TID:100892]: Not enough space in the tx queue
 
ena0: failed to prepare tx bufs
ena0: The number of lost tx completion is above the threshold (129 > 128). Reset the device
ena0: Trigger reset is on
ena0: device is going DOWN
ena0: free uncompleted tx mbuf qid 0 idx 0x234ena0: ena0: device is going UP
link is UP
ena_com_prepare_tx() [TID:100365]: Not enough space in the tx queue
 
ena0: failed to prepare tx bufs
ena_com_prepare_tx() [TID:100832]: Not enough space in the tx queue
 
ena0: failed to prepare tx bufs
ena0: The number of lost tx completion is above the threshold (129 > 128). Reset the device
ena0: Trigger reset is on
ena0: device is going DOWN
ena0: free uncompleted tx mbuf qid 1 idx 0xf5ena0: ena0: device is going UP
link is UP
ena_com_prepare_tx() [TID:100474]: Not enough space in the tx queue
 
ena0: failed to prepare tx bufs
ena0: Keep alive watchdog timeout.
ena0: Trigger reset is on
ena0: device is going DOWN
ena0: device is going UP
ena0: link is UP
Comment 1 Colin Percival freebsd_committer 2019-01-10 23:54:54 UTC
Thanks for reporting this.  Can you tell me anything about the workload this instance is seeing?  Amount of bandwidth, TCP vs UDP, large packets vs small packets, anything else which seems like it might be relevant?
Comment 2 Leif Pedersen 2019-01-11 04:20:22 UTC
Of course! Thanks for looking!

It's our standby MySQL database, so there's a light but steady stream of network IO for the DB replication. Its filesystems are ZFS. Once an hour, six other machines (developer sandboxes) pull the latest ZFS snapshot of the DB (in parallel) so our developers can clone any recent hourly snapshot for testing. These messages happen about three times per day at the top of the hour when this hourly pull would be running. They happen sometimes at other odd times: once on Jan 4, and once on Dec 19, looking at logs going back to Dec 12. So it probably isn't just the 6 concurrent `zfs send`s that cause it.

Also at the top of the hour, it runs a mysqldump cron which takes ~10 minutes. That has nothing to do with the network; just full disclosure that it increases CPU load significantly.

There are batch jobs that send a lot of transactions through the DB replication stream, but they don't seem to correlate. I think those cause only minimal network IO but high CPU & disk load.

The machine doesn't do anything else.

Hopefully that's a helpful idea of the work load. It's the only machine I've upgraded to 12. Our other machines are running 11.1. This one is my canary; I dare not upgrade the more important machines with this symptom.

Also, occasionally it will panic with the message "Fatal double fault" and no backtrace. I can get you more on that if you want. Sometimes this panic happens when the machine is booting, before it has even configured the network. Related? I dunno. Just giving you anything that might be useful.
Comment 3 Leif Pedersen 2019-01-11 18:13:02 UTC
Just fyi, I updated to 12.0-RELEASE-p2, which I saw had announced a fix for networking. Over night, the problem appeared again twice, so apparently that didn't fix it.

Jan 11 02:03:21 db1 kernel: ena_com_prepare_tx() [TID:100482]: Not enough space in the tx queue
Jan 11 02:03:21 db1 kernel:  
Jan 11 02:03:21 db1 kernel: ena0: failed to prepare tx bufs
Jan 11 03:03:54 db1 kernel: ena_com_prepare_tx() [TID:100385]: Not enough space in the tx queue
Jan 11 03:03:54 db1 kernel:  
Jan 11 03:03:54 db1 kernel: ena0: failed to prepare tx bufs
Comment 4 Colin Percival freebsd_committer 2019-01-16 02:04:14 UTC
(In reply to Leif Pedersen from comment #3)

Can you build a kernel with the patch from r343071?  Apparently the 'failed to prepare tx bufs' situation is harmless (and the message should only be printed when debugging is turned on) but it's possible that the mere act of logging the warning is causing timeouts -- so it would be good to know if you see any sign of the device resets after applying this patch.
Comment 5 Colin Percival freebsd_committer 2019-08-30 00:43:56 UTC
Closing this as "feedback timeout".  Feel free to reopen if you see this with a newer kernel.
Comment 6 Ryan Langseth 2019-10-22 18:41:46 UTC
It seems like there is still an issue with this. We are running FreeBSD 12.0-RELEASE-p10 on a c5.2xl instance. And have had the system reset the network device twice in the last 24 hours.

The current traffic to it is a zfs recv over ssh running at ~40MiB/s. The first time it dropped out it was just the ena device. The second time I also got nvme 'Missing Interrupts' as well. The system has 6 gp2 volumes for the zpool.

`grep kern.crit /var/log/messages`
Oct 21 12:26:52 <kern.crit> apache-00 kernel: Trying to mount root from ufs:/dev/gpt/rootfs [rw]...
Oct 21 12:26:52 <kern.crit> apache-00 kernel: ena0: device is going UP
Oct 21 12:26:52 <kern.crit> apache-00 kernel: ena0: device is going DOWN
Oct 21 12:26:52 <kern.crit> apache-00 kernel: ena0: device is going UP
Oct 21 12:26:52 <kern.crit> apache-00 kernel: intsmb0: <Intel PIIX4 SMBUS Interface> port 0xb100-0xb10f at device 1.3 on pci0
Oct 21 12:26:52 <kern.crit> apache-00 kernel: intsmb0: intr IRQ 9 enabled revision 255
Oct 21 12:26:52 <kern.crit> apache-00 kernel: smbus0: <System Management Bus> on intsmb0
Oct 21 12:26:53 <kern.crit> apache-00 kernel: Security policy loaded: MAC/ntpd (mac_ntpd)

Oct 22 06:33:55 <kern.crit> apache-00 kernel: ena0: The number of lost tx completion is above the threshold (129 > 128). Reset the device
Oct 22 06:33:55 <kern.crit> apache-00 kernel: ena0: Trigger reset is on
Oct 22 06:33:55 <kern.crit> apache-00 kernel: ena0: device is going DOWN
Oct 22 06:34:02 <kern.crit> apache-00 kernel: ena0: free uncompleted tx mbuf qid 0 idx 0x1f2
Oct 22 06:34:03 <kern.crit> apache-00 kernel: ena0: ena0: device is going UP
Oct 22 06:34:03 <kern.crit> apache-00 kernel: link is UP

Oct 22 13:18:10 <kern.crit> apache-00 kernel: ena0: The number of lost tx completion is above the threshold (129 > 128). Reset the device
Oct 22 13:18:10 <kern.crit> apache-00 kernel: ena0: Trigger reset is on
Oct 22 13:18:10 <kern.crit> apache-00 kernel: ena0: device is going DOWN
Oct 22 13:18:16 <kern.crit> apache-00 kernel: ena0: free uncompleted tx mbuf qid 4 idx 0x3a6
Oct 22 13:18:16 <kern.crit> apache-00 kernel: 
Oct 22 13:18:16 <kern.crit> apache-00 kernel: ena0: device is going UP
Oct 22 13:18:16 <kern.crit> apache-00 kernel: ena0: link is UP
Oct 22 13:18:47 <kern.crit> apache-00 kernel: nvme0: Missing interrupt
Oct 22 13:18:51 <kern.crit> apache-00 kernel: nvme5: Missing interrupt
Oct 22 13:18:51 <kern.crit> apache-00 kernel: nvme2: Missing interrupt
Oct 22 13:19:17 <kern.crit> apache-00 kernel: nvme0: Missing interrupt
Oct 22 13:20:47 <kern.crit> apache-00 kernel: nvme2: Missing interrupt
Oct 22 13:20:47 <kern.crit> apache-00 kernel: nvme5: Missing interrupt
Oct 22 13:20:47 <kern.crit> apache-00 kernel: nvme0: Missing interrupt
Oct 22 13:20:47 <kern.crit> apache-00 kernel: nvme5: Missing interrupt
Oct 22 13:20:47 <kern.crit> apache-00 kernel: nvme6: Missing interrupt
Oct 22 13:20:47 <kern.crit> apache-00 kernel: nvme2: Missing interrupt
Oct 22 13:20:47 <kern.crit> apache-00 kernel: nvme0: Missing interrupt
Oct 22 13:20:47 <kern.crit> apache-00 kernel: nvme5: Missing interrupt
Oct 22 13:20:47 <kern.crit> apache-00 kernel: nvme0: Missing interrupt
Oct 22 13:22:17 <kern.crit> apache-00 kernel: nvme4: Missing interrupt
Oct 22 13:22:17 <kern.crit> apache-00 kernel: nvme2: Missing interrupt
Oct 22 13:22:17 <kern.crit> apache-00 kernel: nvme5: Missing interrupt
Oct 22 13:22:17 <kern.crit> apache-00 kernel: nvme0: Missing interrupt
Oct 22 13:22:17 <kern.crit> apache-00 kernel: nvme5: nvme2: nvme4: Missing interrupt
Oct 22 13:22:17 <kern.crit> apache-00 kernel: Missing interrupt
Oct 22 13:22:17 <kern.crit> apache-00 syslogd: last message repeated 1 times
Oct 22 13:22:17 <kern.crit> apache-00 kernel: nvme0: Missing interrupt
Oct 22 13:22:17 <kern.crit> apache-00 kernel: nvme2: Missing interrupt
Oct 22 13:22:17 <kern.crit> apache-00 kernel: nvme0: 
Oct 22 13:22:17 <kern.crit> apache-00 kernel: 
Oct 22 13:22:17 <kern.crit> apache-00 kernel: Missing interrupt
Oct 22 13:22:21 <kern.crit> apache-00 kernel: nvme2: nvme6: Missing interrupt
Oct 22 13:22:21 <kern.crit> apache-00 kernel: nvme5: Missing interrupt
Oct 22 13:22:21 <kern.crit> apache-00 kernel: Missing interrupt
Oct 22 13:22:21 <kern.crit> apache-00 kernel: nvme4: Missing interrupt
Oct 22 13:22:51 <kern.crit> apache-00 kernel: nvme6: nvme4: Missing interrupt
Oct 22 13:22:51 <kern.crit> apache-00 kernel: Missing interrupt
Oct 22 13:22:51 <kern.crit> apache-00 kernel: nvme5: Missing interrupt
Oct 22 13:22:51 <kern.crit> apache-00 kernel: nvme2: Missing interrupt
Oct 22 13:23:16 <kern.crit> apache-00 kernel: nvme0: Missing interrupt
Oct 22 13:23:21 <kern.crit> apache-00 kernel: nvme2: Missing interrupt
Oct 22 13:23:21 <kern.crit> apache-00 kernel: nvme6: Missing interrupt
Oct 22 13:23:26 <kern.crit> apache-00 kernel: nvme4: Missing interrupt

I will add that this instance was originally a FreeBSD 11.x system that was freebsd-update'd to 12. As a 11 system it was panicing on the transfer every 3-4 hours. I am bringing up a fresh 12.x system to do additional testing.
Comment 7 Colin Percival freebsd_committer 2019-10-23 21:26:36 UTC
Please let me know if you see this on 12.1.  There's at least two relevant bug fixes which are in stable/12 but were not in 12.0-RELEASE.
Comment 8 Ryan Langseth 2019-11-07 16:10:22 UTC
I started a zfs send/recv via ssh between two instances upgraded to 12.1-RELEASE. The transfer did not stop but I did get this on the sending system once in 8TB of data transferred:

ena0: The number of lost tx completion is above the threshold (129 > 128). Reset the device
ena0: Trigger reset is on
ena0: device is going DOWN
ena0: free uncompleted tx mbuf qid 2 idx 0x199
stray irq272
stray irq265
ena0: ena0: device is going UP
link is UP
Comment 9 Colin Percival freebsd_committer 2019-11-07 21:58:38 UTC
I'm inclined to say that if the worst we see is a single driver reset after 8 TB of traffic, we're probably doing ok.  Just to confirm, based on your experience with 12.0 you would have seen far worse after sending that much traffic, right?
Comment 10 Dmitri Goutnik freebsd_committer 2019-11-26 20:22:20 UTC
This is still a serious issue on 12.1-RELEASE, I see it semi-regularly (approx. once a week on average) on a busy m5.xlarge instance running PostgreSQL. The ena reset itself might be relatively harmless but sometimes it causes a loss of connection to EBS volumes which is disastrous on the database server. Currently, this makes any instance type that has ena interface unusable for high-traffic applications.
Comment 11 Dmitri Goutnik freebsd_committer 2019-11-26 20:25:17 UTC
Reopening as it is still an issue on 12.1.
Comment 12 Leif Pedersen 2020-01-07 23:58:16 UTC
It's decidedly still a problem on 12.1. I see these kernel messages:

Jan  7 04:15:15 beta kernel: ena0: The number of lost tx completion is above the threshold (129 > 128). Reset the device
Jan  7 04:15:15 beta kernel: ena0: Trigger reset is on
Jan  7 04:15:15 beta kernel: ena0: device is going DOWN
Jan  7 04:15:17 beta kernel: ena0: free uncompleted tx mbuf qid 1 idx 0x2ba
Jan  7 04:15:18 beta kernel: ena0: device is going UP
Jan  7 04:15:18 beta kernel: ena0: link is UP

And during shutdown:

Jan  7 05:15:45 beta kernel: ena_dma_alloc() [TID:100031]: bus_dmamem_alloc(16384) failed: 12
Jan  7 05:15:45 beta kernel:  
Jan  7 05:15:45 beta kernel: ena_com_init_io_sq() [TID:100031]: memory allocation failed 
Jan  7 05:15:45 beta kernel: ena0: Failed to create io TX queue #0 rc: 12
Jan  7 05:15:45 beta kernel: ena_up() [TID:100031]: create IO queues failed 
Jan  7 05:15:46 beta kernel: ena0: Failed to create I/O queues
Jan  7 05:15:46 beta kernel: ena0: ENA reset failed!
Jan  7 05:54:50 beta kernel: ena0: device is not running!
Jan  7 09:10:27 beta kernel: Stopping cron.
Comment 13 Leif Pedersen 2020-01-07 23:59:46 UTC
To elaborate - the resets happen every few minutes, and do cause the machine to go unavailable at least some of the time.
Comment 14 Colin Percival freebsd_committer 2020-01-08 00:15:03 UTC
Ooh, every few minutes is *definitely* interesting.  Can you tell me how to reproduce this (maybe via email if there's anything private involved)?  It's hard to debug "approx. once a week on average" but if I can trigger this on a timescale of minutes I can add debugging, make effective use of dtrace, etc.
Comment 15 Leif Pedersen 2020-01-09 16:59:23 UTC
(In reply to Colin Percival from comment #14)
I'm sorry. I had a misunderstanding, and this machine was actually still on 11.3 when this happened. I'll post again if it is actually on 12.1. Sorry again for the noise.
Comment 16 Colin Percival freebsd_committer 2020-02-15 22:33:11 UTC
(In reply to Dmitri Goutnik from comment #10)

I'm trying to reproduce this with 12.1-RELEASE.  Can you email me (cperciva@) with details about your environment?
Comment 17 Darin Luckie 2020-04-18 03:06:30 UTC
hi guys,

Colin - don't know if this helps any, but I recently fubar'd my Wordpress blog (m5a.large) on 12.1-RELEASE

The problem I ran into was installing openvpn from ports (my php setup has some custom ports from awhile back so I'm just sticking with what I know)

Anyways, after adding the VPN, I get connected but am unable to NAT anything THROUGH the instance.

Spent days looking at tcpdumps noticing I was NAT'ing via pf and sending but not getting any replies.

In the end, I tested out an 11.3-RELEASE (ami-01edfd2c712aaa20e) and copied my pf & vpn and behold it works!

So, now I'm downgrading the 12 instance via source.

My guess? It's a problem with the new ena driver, since 11.3 runs via nx instead of ena...

HTH
Comment 18 Darin Luckie 2020-04-18 04:33:07 UTC
Created attachment 213518 [details]
aws console log

console log from m5.large unable to NAT using ena adapter
Comment 19 Darin Luckie 2020-04-18 04:34:14 UTC
here's the console log. downgrading was a no-go. Gonna try 12-stable before I migrate
Comment 20 Colin Percival freebsd_committer 2020-04-18 19:18:27 UTC
Darin: That looks like a completely unrelated issue.
Comment 21 Colin Percival freebsd_committer 2020-04-22 23:34:57 UTC
I believe that this patch should fix the underlying problem, which is in the ENA driver: https://reviews.freebsd.org/D24546

If you're able to build a custom kernel, please test that patch and report results in that review or via email (cperciva@).
Comment 22 commit-hook freebsd_committer 2020-05-07 11:28:50 UTC
A commit references this bug:

Author: mw
Date: Thu May  7 11:28:40 UTC 2020
New revision: 360777
URL: https://svnweb.freebsd.org/changeset/base/360777

Log:
  Optimize ENA Rx refill for low memory conditions

  Sometimes, especially when there is not much memory in the system left,
  allocating mbuf jumbo clusters (like 9KB or 16KB) can take a lot of time
  and it is not guaranteed that it'll succeed. In that situation, the
  fallback will work, but if the refill needs to take a place for a lot of
  descriptors at once, the time spent in m_getjcl looking for memory can
  cause system unresponsiveness due to high priority of the Rx task. This
  can also lead to driver reset, because Tx cleanup routine is being
  blocked and timer service could detect that Tx packets aren't cleaned
  up. The reset routine can further create another unresponsiveness - Rx
  rings are being refilled there, so m_getjcl will again burn the CPU.
  This was causing NVMe driver timeouts and resets, because network driver
  is having higher priority.

  Instead of 16KB jumbo clusters for the Rx buffers, 9KB clusters are
  enough - ENA MTU is being set to 9K anyway, so it's very unlikely that
  more space than 9KB will be needed.

  However, 9KB jumbo clusters can still cause issues, so by default the
  page size mbuf cluster will be used for the Rx descriptors. This can have a
  small (~2%) impact on the throughput of the device, so to restore
  original behavior, one must change sysctl "hw.ena.enable_9k_mbufs" to
  "1" in "/boot/loader.conf" file.

  As a part of this patch (important fix), the version of the driver
  was updated to v2.1.2.

  Submitted by:   cperciva
  Reviewed by:    Michal Krawczyk <mk@semihalf.com>
  Reviewed by:    Ido Segev <idose@amazon.com>
  Reviewed by:    Guy Tzalik <gtzalik@amazon.com>
  MFC after:      3 days
  PR:             225791, 234838, 235856, 236989, 243531
  Differential Revision: https://reviews.freebsd.org/D24546

Changes:
  head/sys/dev/ena/ena.c
  head/sys/dev/ena/ena.h
  head/sys/dev/ena/ena_sysctl.c
  head/sys/dev/ena/ena_sysctl.h
Comment 23 commit-hook freebsd_committer 2020-05-12 18:45:45 UTC
A commit references this bug:

Author: mw
Date: Tue May 12 18:44:41 UTC 2020
New revision: 360985
URL: https://svnweb.freebsd.org/changeset/base/360985

Log:
  MFC r360777: Optimize ENA Rx refill for low memory conditions

  Sometimes, especially when there is not much memory in the system left,
  allocating mbuf jumbo clusters (like 9KB or 16KB) can take a lot of time
  and it is not guaranteed that it'll succeed. In that situation, the
  fallback will work, but if the refill needs to take a place for a lot of
  descriptors at once, the time spent in m_getjcl looking for memory can
  cause system unresponsiveness due to high priority of the Rx task. This
  can also lead to driver reset, because Tx cleanup routine is being
  blocked and timer service could detect that Tx packets aren't cleaned
  up. The reset routine can further create another unresponsiveness - Rx
  rings are being refilled there, so m_getjcl will again burn the CPU.
  This was causing NVMe driver timeouts and resets, because network driver
  is having higher priority.

  Instead of 16KB jumbo clusters for the Rx buffers, 9KB clusters are
  enough - ENA MTU is being set to 9K anyway, so it's very unlikely that
  more space than 9KB will be needed.

  However, 9KB jumbo clusters can still cause issues, so by default the
  page size mbuf cluster will be used for the Rx descriptors. This can have a
  small (~2%) impact on the throughput of the device, so to restore
  original behavior, one must change sysctl "hw.ena.enable_9k_mbufs" to
  "1" in "/boot/loader.conf" file.

  As a part of this patch (important fix), the version of the driver
  was updated to v2.1.2.

  Submitted by:   cperciva
  PR:             225791, 234838, 235856, 236989, 243531

Changes:
_U  stable/12/
  stable/12/sys/dev/ena/ena.c
  stable/12/sys/dev/ena/ena.h