Bug 227100 - [epair] epair interface stops working when it reaches the hardware queue limit
Summary: [epair] epair interface stops working when it reaches the hardware queue limit
Status: Open
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: CURRENT
Hardware: amd64 Any
: --- Affects Only Me
Assignee: Bjoern A. Zeeb
Keywords: patch
Depends on:
Reported: 2018-03-30 04:21 UTC by Reshad Patuck
Modified: 2020-07-22 16:50 UTC (History)
4 users (show)

See Also:

the output of netstat and dtrace (1.83 KB, text/plain)
2018-03-30 04:21 UTC, Reshad Patuck
no flags Details
add SDT in epair (3.62 KB, patch)
2018-03-30 04:25 UTC, Reshad Patuck
no flags Details | Diff
source of patched if_epair (28.86 KB, text/plain)
2018-03-30 04:26 UTC, Reshad Patuck
no flags Details
dtrace script to get enqueued error (182 bytes, text/plain)
2018-03-30 04:27 UTC, Reshad Patuck
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Reshad Patuck 2018-03-30 04:21:56 UTC
Created attachment 191964 [details]
the output of netstat and dtrace

When the epair interface reaches the hardware queue limit, epairs stop transferring data.

This bug refers to this mailing list conversation https://lists.freebsd.org/pipermail/freebsd-net/2018-March/050077.html

So far using the patch/if_epair source file attached to this bug we can tell that the error occurs in this block of code

	if ((epair_dpcpu->epair_drv_flags & IFF_DRV_OACTIVE) != 0) {
		 * Our hardware queue is full, try to fall back
		 * queuing to the ifq but do not call ifp->if_start.
		 * Either we are lucky or the packet is gone.
		IFQ_ENQUEUE(&ifp->if_snd, m, error);
		if (!error)
		SDT_PROBE3(if_epair, transmit, epair_transmit_locked, enqueued,
				ifp, m, error);
		return (error);

Where the value of the 'error' is 55.

Setting 'net.link.epair.netisr_maxqlen' to a very small value makes this occur faster.

This issue seems to be happening in the wild only on one of my servers.
Other servers under more load in different environments do not seem to exhibit this behaviour.

@Kristof please chime in if I have missed something out.

- commands.txt
- epair-sdt-diff.patch 
- epair_transmit_locked:enqueued-error-code.d
- if_epair.c
Comment 1 Reshad Patuck 2018-03-30 04:25:28 UTC
Created attachment 191965 [details]
add SDT in epair

kristof's patch to apply dtrace SDT probes to if_epair
Comment 2 Reshad Patuck 2018-03-30 04:26:03 UTC
Created attachment 191966 [details]
source of patched if_epair
Comment 3 Reshad Patuck 2018-03-30 04:27:12 UTC
Created attachment 191967 [details]
dtrace script to get enqueued error
Comment 4 Bjoern A. Zeeb freebsd_committer 2020-03-10 18:52:58 UTC
There've been reports of this all over the last years.

I hit the problem myself today again (after looking at it before multiple times).
I *think* I have a good feeling of multiple things which might be wrong.

Stay tuned for patches to try soon.
Comment 5 Bjoern A. Zeeb freebsd_committer 2020-03-11 21:09:08 UTC
I just posted https://reviews.freebsd.org/D24033

The problems so far identified in the no longer compiled code:
- queuing the drain for the wrong cpuid in a multi-thread netisr setup
- not always (re-)adding the ifp for drain when needed
- not checking the epair_start_locked() results (code has no way currently) to properly decide if we did successfully queue another packet.  If we did not we have to free the packets as the netisr callback will only happen if there is another active interface on the CPU which will netisr_queue() a packet successfully or we'll be stuck.

I have a local work in progress to mitigate these shortcomings but I am still seeing callbacks not always happening so there's yet at least one more case I haven't caught.
I am not sure if it will make sense to keep fixing this properly.

If there are people with a lot of queue drops (netisr -Q) or OErrors on the epair interfaces then it might make sense.

The alternative is to spend a few days and rework epair for the 2020s...