Bug 234838 - ena drop-outs on 12.0-RELEASE
Summary: ena drop-outs on 12.0-RELEASE
Status: New
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 12.0-RELEASE
Hardware: amd64 Any
: --- Affects Only Me
Assignee: freebsd-virtualization mailing list
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-01-10 23:08 UTC by Leif Pedersen
Modified: 2019-02-07 16:15 UTC (History)
5 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Leif Pedersen 2019-01-10 23:08:40 UTC
We're observing problems with the ena driver on AWS in 12.0-RELEASE. We see these kernel messages a few times per day, and not surprisingly, the NIC appears to quit working for a few minutes and then recover. (We see this because our Nagios instance alarms and then recovers.) This particular instance is an r4.large.

ena_com_prepare_tx() [TID:100766]: Not enough space in the tx queue
 
ena0: failed to prepare tx bufs
ena_com_prepare_tx() [TID:100766]: Not enough space in the tx queue
 
ena0: failed to prepare tx bufs
ena_com_prepare_tx() [TID:100378]: Not enough space in the tx queue
 
ena0: failed to prepare tx bufs
ena_com_prepare_tx() [TID:100376]: Not enough space in the tx queue
 
ena0: failed to prepare tx bufs
ena_com_prepare_tx() [TID:100844]: Not enough space in the tx queue
 
ena0: failed to prepare tx bufs
ena_com_prepare_tx() [TID:100765]: Not enough space in the tx queue
 
ena0: failed to prepare tx bufs
ena_com_prepare_tx() [TID:100363]: Not enough space in the tx queue
 
ena0: failed to prepare tx bufs
ena_com_prepare_tx() [TID:100523]: Not enough space in the tx queue
 
ena0: failed to prepare tx bufs
ena0: Keep alive watchdog timeout.
ena0: Trigger reset is on
ena0: device is going DOWN
ena0: device is going UP
ena0: link is UP
ena_com_prepare_tx() [TID:100401]: Not enough space in the tx queue
 
ena0: failed to prepare tx bufs
ena_com_prepare_tx() [TID:100477]: Not enough space in the tx queue
 
ena0: failed to prepare tx bufs
ena_com_prepare_tx() [TID:100634]: Not enough space in the tx queue
 
ena0: failed to prepare tx bufs
ena0: Keep alive watchdog timeout.
ena0: Trigger reset is on
ena0: device is going DOWN
ena0: free uncompleted tx mbuf qid 0 idx 0xb9ena0: ena0: device is going UP
link is UP
ena0: Keep alive watchdog timeout.
ena0: Trigger reset is on
ena0: device is going DOWN
ena0: device is going UP
ena0: link is UP
ena_com_prepare_tx() [TID:100360]: Not enough space in the tx queue
 
ena0: failed to prepare tx bufs
ena_com_prepare_tx() [TID:100587]: Not enough space in the tx queue
 
ena0: failed to prepare tx bufs
ena_com_prepare_tx() [TID:100365]: Not enough space in the tx queue
 
ena0: failed to prepare tx bufs
ena0: The number of lost tx completion is above the threshold (129 > 128). Reset the device
ena0: Trigger reset is on
ena0: device is going DOWN
ena0: free uncompleted tx mbuf qid 1 idx 0x2bbena0: device is going UP
ena0: link is UP
ena_com_prepare_tx() [TID:100375]: Not enough space in the tx queue
 
ena0: failed to prepare tx bufs
ena_com_prepare_tx() [TID:100892]: Not enough space in the tx queue
 
ena0: failed to prepare tx bufs
ena0: The number of lost tx completion is above the threshold (129 > 128). Reset the device
ena0: Trigger reset is on
ena0: device is going DOWN
ena0: free uncompleted tx mbuf qid 0 idx 0x234ena0: ena0: device is going UP
link is UP
ena_com_prepare_tx() [TID:100365]: Not enough space in the tx queue
 
ena0: failed to prepare tx bufs
ena_com_prepare_tx() [TID:100832]: Not enough space in the tx queue
 
ena0: failed to prepare tx bufs
ena0: The number of lost tx completion is above the threshold (129 > 128). Reset the device
ena0: Trigger reset is on
ena0: device is going DOWN
ena0: free uncompleted tx mbuf qid 1 idx 0xf5ena0: ena0: device is going UP
link is UP
ena_com_prepare_tx() [TID:100474]: Not enough space in the tx queue
 
ena0: failed to prepare tx bufs
ena0: Keep alive watchdog timeout.
ena0: Trigger reset is on
ena0: device is going DOWN
ena0: device is going UP
ena0: link is UP
Comment 1 Colin Percival freebsd_committer 2019-01-10 23:54:54 UTC
Thanks for reporting this.  Can you tell me anything about the workload this instance is seeing?  Amount of bandwidth, TCP vs UDP, large packets vs small packets, anything else which seems like it might be relevant?
Comment 2 Leif Pedersen 2019-01-11 04:20:22 UTC
Of course! Thanks for looking!

It's our standby MySQL database, so there's a light but steady stream of network IO for the DB replication. Its filesystems are ZFS. Once an hour, six other machines (developer sandboxes) pull the latest ZFS snapshot of the DB (in parallel) so our developers can clone any recent hourly snapshot for testing. These messages happen about three times per day at the top of the hour when this hourly pull would be running. They happen sometimes at other odd times: once on Jan 4, and once on Dec 19, looking at logs going back to Dec 12. So it probably isn't just the 6 concurrent `zfs send`s that cause it.

Also at the top of the hour, it runs a mysqldump cron which takes ~10 minutes. That has nothing to do with the network; just full disclosure that it increases CPU load significantly.

There are batch jobs that send a lot of transactions through the DB replication stream, but they don't seem to correlate. I think those cause only minimal network IO but high CPU & disk load.

The machine doesn't do anything else.

Hopefully that's a helpful idea of the work load. It's the only machine I've upgraded to 12. Our other machines are running 11.1. This one is my canary; I dare not upgrade the more important machines with this symptom.

Also, occasionally it will panic with the message "Fatal double fault" and no backtrace. I can get you more on that if you want. Sometimes this panic happens when the machine is booting, before it has even configured the network. Related? I dunno. Just giving you anything that might be useful.
Comment 3 Leif Pedersen 2019-01-11 18:13:02 UTC
Just fyi, I updated to 12.0-RELEASE-p2, which I saw had announced a fix for networking. Over night, the problem appeared again twice, so apparently that didn't fix it.

Jan 11 02:03:21 db1 kernel: ena_com_prepare_tx() [TID:100482]: Not enough space in the tx queue
Jan 11 02:03:21 db1 kernel:  
Jan 11 02:03:21 db1 kernel: ena0: failed to prepare tx bufs
Jan 11 03:03:54 db1 kernel: ena_com_prepare_tx() [TID:100385]: Not enough space in the tx queue
Jan 11 03:03:54 db1 kernel:  
Jan 11 03:03:54 db1 kernel: ena0: failed to prepare tx bufs
Comment 4 Colin Percival freebsd_committer 2019-01-16 02:04:14 UTC
(In reply to Leif Pedersen from comment #3)

Can you build a kernel with the patch from r343071?  Apparently the 'failed to prepare tx bufs' situation is harmless (and the message should only be printed when debugging is turned on) but it's possible that the mere act of logging the warning is causing timeouts -- so it would be good to know if you see any sign of the device resets after applying this patch.