Summary: | [epair] epair interface stops working when it reaches the hardware queue limit | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Base System | Reporter: | Reshad Patuck <reshadpatuck1> | ||||||||||
Component: | kern | Assignee: | Kristof Provost <kp> | ||||||||||
Status: | Open --- | ||||||||||||
Severity: | Affects Only Me | CC: | bz, grembo, hausen, kristof, markj, matthias+freebsd+bugzilla | ||||||||||
Priority: | --- | Keywords: | patch | ||||||||||
Version: | CURRENT | ||||||||||||
Hardware: | amd64 | ||||||||||||
OS: | Any | ||||||||||||
Attachments: |
|
Description
Reshad Patuck
2018-03-30 04:21:56 UTC
Created attachment 191965 [details]
add SDT in epair
kristof's patch to apply dtrace SDT probes to if_epair
Created attachment 191966 [details]
source of patched if_epair
Created attachment 191967 [details]
dtrace script to get enqueued error
There've been reports of this all over the last years. I hit the problem myself today again (after looking at it before multiple times). I *think* I have a good feeling of multiple things which might be wrong. Stay tuned for patches to try soon. I just posted https://reviews.freebsd.org/D24033 The problems so far identified in the no longer compiled code: - queuing the drain for the wrong cpuid in a multi-thread netisr setup - not always (re-)adding the ifp for drain when needed - not checking the epair_start_locked() results (code has no way currently) to properly decide if we did successfully queue another packet. If we did not we have to free the packets as the netisr callback will only happen if there is another active interface on the CPU which will netisr_queue() a packet successfully or we'll be stuck. I have a local work in progress to mitigate these shortcomings but I am still seeing callbacks not always happening so there's yet at least one more case I haven't caught. I am not sure if it will make sense to keep fixing this properly. If there are people with a lot of queue drops (netisr -Q) or OErrors on the epair interfaces then it might make sense. The alternative is to spend a few days and rework epair for the 2020s... It's been ages and I had a bit of time when I couldn't concentrate much so I started to cleanup old work spaces... Please check this review (there is a link there also for a full drop-in file-replacement). As you can see on the comments of the review I had started this last year for a quick test.. and it could be improved. Maybe one of you has time or interest in this. It seems to be holding up though. https://reviews.freebsd.org/D31077 Hand over to @kp as he keeps working forward based on my initial work and now his own. Some changes have since gone into the tree; would be good to record the commit hashes here and then see if this can be closed? (In reply to Bjoern A. Zeeb from comment #7) That'd be https://cgit.freebsd.org/src/commit/?id=3dd5760aa5f8 (main), https://cgit.freebsd.org/src/commit/?id=f4aba8c9f0cb (stable/13) https://cgit.freebsd.org/src/commit/?id=7c2b681b33fc (stable/12). I'd expect this problem to be fixed with those. After updating form 12.2 to 12.3 im hit by this problem regulary (every 1-2 days). I Use FreeBSD serx05.xenet.de 12.3-STABLE FreeBSD 12.3-STABLE stable/12-n234916-1be600552e5 XENET amd64 Using epair for connecting VNET/Jails. tcpdump on both epair ends shows traffic floating from host to jails epair but no traffic from jail to host. I can see traffic going out on vnetjails epair but this show not up on hosts epair. Have you tried a version with the listed changes from comment #8? 13.1-RC? (In reply to Kristof Provost from comment #10) Im on stable/12 src from Mar 18. # sysctl net.link.epair.netisr_maxqlen sysctl: unknown oid 'net.link.epair.netisr_maxqlen' So I think I am using a version with changes. More info: epairXa is on host, member of a bridge. epairXb is in jail. Setup works for some days (~3) then stops. tcpdump on epairXa and epairXb shows: Traffic epairXa -> epairXb get through Traffic epairXb -> epairXa does NOT get through inside jail tcpdump on epairXb shows traffic going out. systat -ifstat shows 0 traffic going out. ifconfig down/up both epairs does not fix the issue. only 1 of 7 jails is hit by this bug ( random? ) running a nameserver (bind) the only jail with high UDP traffic. (In reply to matthias+freebsd+bugzilla from comment #12) Test on 13.1-RC2, or stable/13 post f6138d93b5115ff560b24200d1ea002cdc46bb64 or main post 0bf7acd6b7047537a38e2de391a461e4e8956630. It'll be fixed there. The fix will not land in 12, for boring reasons (see https://cgit.freebsd.org/src/commit/sys/net/if_epair.c?h=stable/12&id=56dc95b249dceb30367a77dccd0231cbb08dc1f7). I do not intend to work on this for stable/12. Upgrade to 13. (In reply to Kristof Provost from comment #13) I am fine with that. I think I am not hit by the original bug but by some fallout of the fix an stable/12. Now this gets revertet in stable/12 I expect it to work as before. Im building an updtodate stable/12 atm, will test it a few days. and will report back. (In reply to Kristof Provost from comment #13) (selecting random comment) Seems like I've been affected by this today (13.1-RELEASE-p11). Setup was: physical interface -> vlan interface -> bridge -> epair0a/b <- vnet jail Symptoms were: - No traffic flowing between host and jail - Logically, no traffic flowing between any other machines - Out of buffer space reported inside of the jail - Restarting the jail didn't help - down/up devices didn't help - Stopping jail and kldunloading if_epair *did* help This happened during a network heavy build. It might have been amplified by a pf antispoof rule that might have blocked legitimate traffic (did not verify). I didn't dig deeper and didn't try to apply any patches yet, just leaving it here as a datapoint (and adding myself to CC). (In reply to Michael Gmelin from comment #15) I'm going to assume you mean 13.0-p11, because otherwise I want to borrow your time machine. Early afternoon yesterday please. In which case the advice remains: test on 13.1. Once it's released. Don't mess with the timeline. (In reply to Kristof Provost from comment #16) > I'm going to assume you mean 13.0-p11, because otherwise > I want to borrow your time machine. Early afternoon yesterday > please. Of course ^_^ Always ahead of the curve... (and craving an "edit" button) |