Bug 234838 - ena drop-outs on 12.0-RELEASE
Summary: ena drop-outs on 12.0-RELEASE
Status: Open
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 12.0-RELEASE
Hardware: amd64 Any
: --- Affects Only Me
Assignee: freebsd-virtualization mailing list
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-01-10 23:08 UTC by Leif Pedersen
Modified: 2019-11-26 20:25 UTC (History)
7 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Leif Pedersen 2019-01-10 23:08:40 UTC
We're observing problems with the ena driver on AWS in 12.0-RELEASE. We see these kernel messages a few times per day, and not surprisingly, the NIC appears to quit working for a few minutes and then recover. (We see this because our Nagios instance alarms and then recovers.) This particular instance is an r4.large.

ena_com_prepare_tx() [TID:100766]: Not enough space in the tx queue
 
ena0: failed to prepare tx bufs
ena_com_prepare_tx() [TID:100766]: Not enough space in the tx queue
 
ena0: failed to prepare tx bufs
ena_com_prepare_tx() [TID:100378]: Not enough space in the tx queue
 
ena0: failed to prepare tx bufs
ena_com_prepare_tx() [TID:100376]: Not enough space in the tx queue
 
ena0: failed to prepare tx bufs
ena_com_prepare_tx() [TID:100844]: Not enough space in the tx queue
 
ena0: failed to prepare tx bufs
ena_com_prepare_tx() [TID:100765]: Not enough space in the tx queue
 
ena0: failed to prepare tx bufs
ena_com_prepare_tx() [TID:100363]: Not enough space in the tx queue
 
ena0: failed to prepare tx bufs
ena_com_prepare_tx() [TID:100523]: Not enough space in the tx queue
 
ena0: failed to prepare tx bufs
ena0: Keep alive watchdog timeout.
ena0: Trigger reset is on
ena0: device is going DOWN
ena0: device is going UP
ena0: link is UP
ena_com_prepare_tx() [TID:100401]: Not enough space in the tx queue
 
ena0: failed to prepare tx bufs
ena_com_prepare_tx() [TID:100477]: Not enough space in the tx queue
 
ena0: failed to prepare tx bufs
ena_com_prepare_tx() [TID:100634]: Not enough space in the tx queue
 
ena0: failed to prepare tx bufs
ena0: Keep alive watchdog timeout.
ena0: Trigger reset is on
ena0: device is going DOWN
ena0: free uncompleted tx mbuf qid 0 idx 0xb9ena0: ena0: device is going UP
link is UP
ena0: Keep alive watchdog timeout.
ena0: Trigger reset is on
ena0: device is going DOWN
ena0: device is going UP
ena0: link is UP
ena_com_prepare_tx() [TID:100360]: Not enough space in the tx queue
 
ena0: failed to prepare tx bufs
ena_com_prepare_tx() [TID:100587]: Not enough space in the tx queue
 
ena0: failed to prepare tx bufs
ena_com_prepare_tx() [TID:100365]: Not enough space in the tx queue
 
ena0: failed to prepare tx bufs
ena0: The number of lost tx completion is above the threshold (129 > 128). Reset the device
ena0: Trigger reset is on
ena0: device is going DOWN
ena0: free uncompleted tx mbuf qid 1 idx 0x2bbena0: device is going UP
ena0: link is UP
ena_com_prepare_tx() [TID:100375]: Not enough space in the tx queue
 
ena0: failed to prepare tx bufs
ena_com_prepare_tx() [TID:100892]: Not enough space in the tx queue
 
ena0: failed to prepare tx bufs
ena0: The number of lost tx completion is above the threshold (129 > 128). Reset the device
ena0: Trigger reset is on
ena0: device is going DOWN
ena0: free uncompleted tx mbuf qid 0 idx 0x234ena0: ena0: device is going UP
link is UP
ena_com_prepare_tx() [TID:100365]: Not enough space in the tx queue
 
ena0: failed to prepare tx bufs
ena_com_prepare_tx() [TID:100832]: Not enough space in the tx queue
 
ena0: failed to prepare tx bufs
ena0: The number of lost tx completion is above the threshold (129 > 128). Reset the device
ena0: Trigger reset is on
ena0: device is going DOWN
ena0: free uncompleted tx mbuf qid 1 idx 0xf5ena0: ena0: device is going UP
link is UP
ena_com_prepare_tx() [TID:100474]: Not enough space in the tx queue
 
ena0: failed to prepare tx bufs
ena0: Keep alive watchdog timeout.
ena0: Trigger reset is on
ena0: device is going DOWN
ena0: device is going UP
ena0: link is UP
Comment 1 Colin Percival freebsd_committer 2019-01-10 23:54:54 UTC
Thanks for reporting this.  Can you tell me anything about the workload this instance is seeing?  Amount of bandwidth, TCP vs UDP, large packets vs small packets, anything else which seems like it might be relevant?
Comment 2 Leif Pedersen 2019-01-11 04:20:22 UTC
Of course! Thanks for looking!

It's our standby MySQL database, so there's a light but steady stream of network IO for the DB replication. Its filesystems are ZFS. Once an hour, six other machines (developer sandboxes) pull the latest ZFS snapshot of the DB (in parallel) so our developers can clone any recent hourly snapshot for testing. These messages happen about three times per day at the top of the hour when this hourly pull would be running. They happen sometimes at other odd times: once on Jan 4, and once on Dec 19, looking at logs going back to Dec 12. So it probably isn't just the 6 concurrent `zfs send`s that cause it.

Also at the top of the hour, it runs a mysqldump cron which takes ~10 minutes. That has nothing to do with the network; just full disclosure that it increases CPU load significantly.

There are batch jobs that send a lot of transactions through the DB replication stream, but they don't seem to correlate. I think those cause only minimal network IO but high CPU & disk load.

The machine doesn't do anything else.

Hopefully that's a helpful idea of the work load. It's the only machine I've upgraded to 12. Our other machines are running 11.1. This one is my canary; I dare not upgrade the more important machines with this symptom.

Also, occasionally it will panic with the message "Fatal double fault" and no backtrace. I can get you more on that if you want. Sometimes this panic happens when the machine is booting, before it has even configured the network. Related? I dunno. Just giving you anything that might be useful.
Comment 3 Leif Pedersen 2019-01-11 18:13:02 UTC
Just fyi, I updated to 12.0-RELEASE-p2, which I saw had announced a fix for networking. Over night, the problem appeared again twice, so apparently that didn't fix it.

Jan 11 02:03:21 db1 kernel: ena_com_prepare_tx() [TID:100482]: Not enough space in the tx queue
Jan 11 02:03:21 db1 kernel:  
Jan 11 02:03:21 db1 kernel: ena0: failed to prepare tx bufs
Jan 11 03:03:54 db1 kernel: ena_com_prepare_tx() [TID:100385]: Not enough space in the tx queue
Jan 11 03:03:54 db1 kernel:  
Jan 11 03:03:54 db1 kernel: ena0: failed to prepare tx bufs
Comment 4 Colin Percival freebsd_committer 2019-01-16 02:04:14 UTC
(In reply to Leif Pedersen from comment #3)

Can you build a kernel with the patch from r343071?  Apparently the 'failed to prepare tx bufs' situation is harmless (and the message should only be printed when debugging is turned on) but it's possible that the mere act of logging the warning is causing timeouts -- so it would be good to know if you see any sign of the device resets after applying this patch.
Comment 5 Colin Percival freebsd_committer 2019-08-30 00:43:56 UTC
Closing this as "feedback timeout".  Feel free to reopen if you see this with a newer kernel.
Comment 6 Ryan Langseth 2019-10-22 18:41:46 UTC
It seems like there is still an issue with this. We are running FreeBSD 12.0-RELEASE-p10 on a c5.2xl instance. And have had the system reset the network device twice in the last 24 hours.

The current traffic to it is a zfs recv over ssh running at ~40MiB/s. The first time it dropped out it was just the ena device. The second time I also got nvme 'Missing Interrupts' as well. The system has 6 gp2 volumes for the zpool.

`grep kern.crit /var/log/messages`
Oct 21 12:26:52 <kern.crit> apache-00 kernel: Trying to mount root from ufs:/dev/gpt/rootfs [rw]...
Oct 21 12:26:52 <kern.crit> apache-00 kernel: ena0: device is going UP
Oct 21 12:26:52 <kern.crit> apache-00 kernel: ena0: device is going DOWN
Oct 21 12:26:52 <kern.crit> apache-00 kernel: ena0: device is going UP
Oct 21 12:26:52 <kern.crit> apache-00 kernel: intsmb0: <Intel PIIX4 SMBUS Interface> port 0xb100-0xb10f at device 1.3 on pci0
Oct 21 12:26:52 <kern.crit> apache-00 kernel: intsmb0: intr IRQ 9 enabled revision 255
Oct 21 12:26:52 <kern.crit> apache-00 kernel: smbus0: <System Management Bus> on intsmb0
Oct 21 12:26:53 <kern.crit> apache-00 kernel: Security policy loaded: MAC/ntpd (mac_ntpd)

Oct 22 06:33:55 <kern.crit> apache-00 kernel: ena0: The number of lost tx completion is above the threshold (129 > 128). Reset the device
Oct 22 06:33:55 <kern.crit> apache-00 kernel: ena0: Trigger reset is on
Oct 22 06:33:55 <kern.crit> apache-00 kernel: ena0: device is going DOWN
Oct 22 06:34:02 <kern.crit> apache-00 kernel: ena0: free uncompleted tx mbuf qid 0 idx 0x1f2
Oct 22 06:34:03 <kern.crit> apache-00 kernel: ena0: ena0: device is going UP
Oct 22 06:34:03 <kern.crit> apache-00 kernel: link is UP

Oct 22 13:18:10 <kern.crit> apache-00 kernel: ena0: The number of lost tx completion is above the threshold (129 > 128). Reset the device
Oct 22 13:18:10 <kern.crit> apache-00 kernel: ena0: Trigger reset is on
Oct 22 13:18:10 <kern.crit> apache-00 kernel: ena0: device is going DOWN
Oct 22 13:18:16 <kern.crit> apache-00 kernel: ena0: free uncompleted tx mbuf qid 4 idx 0x3a6
Oct 22 13:18:16 <kern.crit> apache-00 kernel: 
Oct 22 13:18:16 <kern.crit> apache-00 kernel: ena0: device is going UP
Oct 22 13:18:16 <kern.crit> apache-00 kernel: ena0: link is UP
Oct 22 13:18:47 <kern.crit> apache-00 kernel: nvme0: Missing interrupt
Oct 22 13:18:51 <kern.crit> apache-00 kernel: nvme5: Missing interrupt
Oct 22 13:18:51 <kern.crit> apache-00 kernel: nvme2: Missing interrupt
Oct 22 13:19:17 <kern.crit> apache-00 kernel: nvme0: Missing interrupt
Oct 22 13:20:47 <kern.crit> apache-00 kernel: nvme2: Missing interrupt
Oct 22 13:20:47 <kern.crit> apache-00 kernel: nvme5: Missing interrupt
Oct 22 13:20:47 <kern.crit> apache-00 kernel: nvme0: Missing interrupt
Oct 22 13:20:47 <kern.crit> apache-00 kernel: nvme5: Missing interrupt
Oct 22 13:20:47 <kern.crit> apache-00 kernel: nvme6: Missing interrupt
Oct 22 13:20:47 <kern.crit> apache-00 kernel: nvme2: Missing interrupt
Oct 22 13:20:47 <kern.crit> apache-00 kernel: nvme0: Missing interrupt
Oct 22 13:20:47 <kern.crit> apache-00 kernel: nvme5: Missing interrupt
Oct 22 13:20:47 <kern.crit> apache-00 kernel: nvme0: Missing interrupt
Oct 22 13:22:17 <kern.crit> apache-00 kernel: nvme4: Missing interrupt
Oct 22 13:22:17 <kern.crit> apache-00 kernel: nvme2: Missing interrupt
Oct 22 13:22:17 <kern.crit> apache-00 kernel: nvme5: Missing interrupt
Oct 22 13:22:17 <kern.crit> apache-00 kernel: nvme0: Missing interrupt
Oct 22 13:22:17 <kern.crit> apache-00 kernel: nvme5: nvme2: nvme4: Missing interrupt
Oct 22 13:22:17 <kern.crit> apache-00 kernel: Missing interrupt
Oct 22 13:22:17 <kern.crit> apache-00 syslogd: last message repeated 1 times
Oct 22 13:22:17 <kern.crit> apache-00 kernel: nvme0: Missing interrupt
Oct 22 13:22:17 <kern.crit> apache-00 kernel: nvme2: Missing interrupt
Oct 22 13:22:17 <kern.crit> apache-00 kernel: nvme0: 
Oct 22 13:22:17 <kern.crit> apache-00 kernel: 
Oct 22 13:22:17 <kern.crit> apache-00 kernel: Missing interrupt
Oct 22 13:22:21 <kern.crit> apache-00 kernel: nvme2: nvme6: Missing interrupt
Oct 22 13:22:21 <kern.crit> apache-00 kernel: nvme5: Missing interrupt
Oct 22 13:22:21 <kern.crit> apache-00 kernel: Missing interrupt
Oct 22 13:22:21 <kern.crit> apache-00 kernel: nvme4: Missing interrupt
Oct 22 13:22:51 <kern.crit> apache-00 kernel: nvme6: nvme4: Missing interrupt
Oct 22 13:22:51 <kern.crit> apache-00 kernel: Missing interrupt
Oct 22 13:22:51 <kern.crit> apache-00 kernel: nvme5: Missing interrupt
Oct 22 13:22:51 <kern.crit> apache-00 kernel: nvme2: Missing interrupt
Oct 22 13:23:16 <kern.crit> apache-00 kernel: nvme0: Missing interrupt
Oct 22 13:23:21 <kern.crit> apache-00 kernel: nvme2: Missing interrupt
Oct 22 13:23:21 <kern.crit> apache-00 kernel: nvme6: Missing interrupt
Oct 22 13:23:26 <kern.crit> apache-00 kernel: nvme4: Missing interrupt

I will add that this instance was originally a FreeBSD 11.x system that was freebsd-update'd to 12. As a 11 system it was panicing on the transfer every 3-4 hours. I am bringing up a fresh 12.x system to do additional testing.
Comment 7 Colin Percival freebsd_committer 2019-10-23 21:26:36 UTC
Please let me know if you see this on 12.1.  There's at least two relevant bug fixes which are in stable/12 but were not in 12.0-RELEASE.
Comment 8 Ryan Langseth 2019-11-07 16:10:22 UTC
I started a zfs send/recv via ssh between two instances upgraded to 12.1-RELEASE. The transfer did not stop but I did get this on the sending system once in 8TB of data transferred:

ena0: The number of lost tx completion is above the threshold (129 > 128). Reset the device
ena0: Trigger reset is on
ena0: device is going DOWN
ena0: free uncompleted tx mbuf qid 2 idx 0x199
stray irq272
stray irq265
ena0: ena0: device is going UP
link is UP
Comment 9 Colin Percival freebsd_committer 2019-11-07 21:58:38 UTC
I'm inclined to say that if the worst we see is a single driver reset after 8 TB of traffic, we're probably doing ok.  Just to confirm, based on your experience with 12.0 you would have seen far worse after sending that much traffic, right?
Comment 10 Dmitri Goutnik freebsd_committer 2019-11-26 20:22:20 UTC
This is still a serious issue on 12.1-RELEASE, I see it semi-regularly (approx. once a week on average) on a busy m5.xlarge instance running PostgreSQL. The ena reset itself might be relatively harmless but sometimes it causes a loss of connection to EBS volumes which is disastrous on the database server. Currently, this makes any instance type that has ena interface unusable for high-traffic applications.
Comment 11 Dmitri Goutnik freebsd_committer 2019-11-26 20:25:17 UTC
Reopening as it is still an issue on 12.1.