Bug 211962 - bxe driver queue soft hangs and flooding tx_soft_errors
Summary: bxe driver queue soft hangs and flooding tx_soft_errors
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 10.3-RELEASE
Hardware: Any Any
: --- Affects Some People
Assignee: David C Somayajulu
URL:
Keywords: needs-qa, patch
Depends on:
Blocks:
 
Reported: 2016-08-18 11:25 UTC by aler
Modified: 2018-05-02 18:50 UTC (History)
3 users (show)

See Also:
koobs: mfc-stable11?
koobs: mfc-stable10?
koobs: mfc-stable9?


Attachments
patch to fix bug (1.41 KB, patch)
2016-08-18 11:26 UTC, aler
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description aler 2016-08-18 11:25:26 UTC
Hardware (from dmesg): QLogic NetXtreme II BCM57711E 10GbE (A0) BXE v:1.78.79 (but this seems not important here)

Sometimes (randomly) bxe driver reduces its sending speed to very slow values, and one or more queues fastly increases dev.bxe.#.queue.#.tx_soft_errors value. This can be stopped and returned to normal operation only by restarting the interface (ifconfig down + ifconfig up).

It seems i found the bug that causes this.
In bxe_tx_mq_start_locked(), when we have both pending and new packet, it first tries to enqueue new packet and exits on fail. After that, it dequeues pending packet and tryes to handle it.
Sometimes the tx queue gets overflowed by accident. It is nothing bad, bad what happens next: bxe_tx_mq_start_locked() getting called again and again, with new packets that TCP layer trying to send. Everytime called it tryes to enqueue them, fails and exits, having no way to handle pending packets and reduce queue length.

I'm new to bxe driver source code and can be wrong but the patch seems fixed the problem.
Comment 1 aler 2016-08-18 11:26:03 UTC
Created attachment 173818 [details]
patch to fix bug
Comment 2 Matt Joras 2016-08-18 17:54:14 UTC
So this is a symptom of a problem we have been working to address at Isilon, and there's some pretty significant refactors I've done to the tx path to avoid problems like this.

As you touched on, the fundamental problem here is actually that there's nothing that guarantees the continual draining of the drbr, so the driver can get stuck in various states of not transmitting as fast as it can. I addressed this mostly by adding a deferred tx task which guarantees that as long as there is packets sitting on the drbr there will be a tx task scheduled to drain them.

We've tested these changes internally and they have been submitted to Qlogic for regression testing. They have stated they plan to commit them once this regression testing is done (optimistically around next week).
Comment 3 Kubilay Kocak freebsd_committer freebsd_triage 2016-08-20 09:50:11 UTC
Thank you Matt. Can you make those committers aware of this issue if you haven't already so they can assign it to themselves and coordinate it through MFC's and release engineering (if it applies to 11.0-RELEASE)
Comment 4 Matt Joras 2016-08-25 16:46:05 UTC
(In reply to Kubilay Kocak from comment #3)
Sure, I'll make sure they (davidcs) are aware of this issue.

No update on when they will be done with their QA unfortunately.
Comment 5 Matt Joras 2016-10-19 16:25:55 UTC
The changeset I authored for Isilon has been committed by davidcs. Hopefully this will alleviate the issues in the PR.

https://svnweb.freebsd.org/base?view=revision&revision=307578
Comment 6 Eugene Grosbein freebsd_committer freebsd_triage 2018-04-30 22:30:07 UTC
Dear submitter, is this issue still relevant?
Comment 7 David C Somayajulu freebsd_committer freebsd_triage 2018-05-02 18:50:43 UTC
This bug has been confirmed as fixed.