Bug 272924 - cxgbei drops connections during write, with 16+ sessions
Summary: cxgbei drops connections during write, with 16+ sessions
Status: In Progress
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: CURRENT
Hardware: Any Any
: --- Affects Only Me
Assignee: freebsd-net (Nobody)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2023-08-03 21:24 UTC by Alan Somers
Modified: 2023-08-18 18:08 UTC (History)
1 user (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Alan Somers freebsd_committer freebsd_triage 2023-08-03 21:24:46 UTC
On both FreeBSD 13.2 and 14.0 I can use cxgbei offload for my iSCSI sessions.  There is no trouble with reads.  Writes work too, with up to 9 iSCSI sessions.  But somewhere around 14 simultaneous sessions, overall bandwidth becomes terrible (10s to 100s of MB/s) and the server's log is spammed with messages like this:

WARNING: 172.32.10.78 (iqn.1994-09.org.freebsd:MYINITATOR.MYDOMAIN.net): no ping reply (NOP-Out) after 5 seconds; dropping connection

and this:

2023-08-03T20:32:53.730407+00:00 MYSERVER.MYDOMAIN.net ctld[6453] 172.32.10.79 (iqn.1994-09.org.freebsd:MYINITIATOR.MYDOMAIN.net): error returned from CTL iSCSI limits request: cfiscsi_ioctl_limits: icl_limits failed with error 6; dropping connection


Meanwhile, the initiator's log is spammed with messages like these:

2023-08-03T20:33:57.756484+00:00 MYINITIATOR kernel: WARNING: 172.33.10.58 (iqn.2018-10.net.MYDOMAIN.MYSERVER:zd17): login timed out after 61 seconds; reconnecting
2023-08-03T20:33:57.756486+00:00 slc-rb19b-ss kernel: WARNING: MYSERVER.MYDOMAIN.net (iqn.2018-10.net.MYDOMAIN.MYSERVER:zd17): login timed out after 61 seconds; reconnecting

Is there some kind of undocumented limit that I'm running into?  FWIW the limit doesn't seem to be related to the number of connected sessions, just the number of active sessions with traffic.  When the failures start, I think I have about 50 total outstanding commands, and around 2-3 GBps of traffic.
Comment 1 Navdeep Parhar freebsd_committer freebsd_triage 2023-08-04 14:55:58 UTC
By default the driver allocates just 2 TOE txq/rxq (these are the queues used by all the offload drivers, including cxgbei).  The first thing to try would be to increase these to 4 or 8, depending on the number of cores in the system.  Try this in loader.conf:

hw.cxgbe.nofldtxq="-8"
hw.cxgbe.nofldrxq="-8"

Keep an eye on "netstat -dI <ifnet> -w1" when the system is under load.  It may be that there are packet drops leading to retransmits and loss of performance.
Comment 2 Alan Somers freebsd_committer freebsd_triage 2023-08-04 15:00:42 UTC
What's the "-" for?  Should that be:

hw.cxgbe.nofldtxq="8"
hw.cxgbe.nofldrxq="8"

And is there a recommended relationship between core count and queue count?  FWIW this is a 24-core CPU, with hyperthreading disabled.
Comment 3 Navdeep Parhar freebsd_committer freebsd_triage 2023-08-04 15:34:26 UTC
(In reply to Alan Somers from comment #2)

The "-n" means "upto n, but no more than the number of cores".  On a 24 core system both 8 and -8 will result in 8 queues.  But if the system had 4 cores for example then -8 would result in 4 queues (not 8).
Comment 4 Alan Somers freebsd_committer freebsd_triage 2023-08-04 21:14:29 UTC
No dice.  When I set those knobs to "-8", I saw the same behavior.  netstat showed no dropped packets.  And when I tried setting them to "-24", it quickly produced panics.  See bug #272947 .  BTW, do those settings also apply to the ccr crypto offload driver?
Comment 5 Navdeep Parhar freebsd_committer freebsd_triage 2023-08-18 18:08:41 UTC
What exact card is this and how are you generating load?  If it's an fio or iozone type command please post it here and I'll try the same thing.  Are you running cxgbei-offloaded iscsi on both the sides or is it only the initiator or target?

I tried with 100G cards and cxgbei on both sides and didn't run into this problem.  I'm going to try with 25G cards next.  2-3GBps would be close to line rate for those cards and it may be that the performance falls off a cliff when you get close to line rate and start seeing pause/drops.