Summary: | tcp_output() sends beyond receivers window | ||||||
---|---|---|---|---|---|---|---|
Product: | Base System | Reporter: | Gleb Smirnoff <glebius> | ||||
Component: | kern | Assignee: | Richard Scheffenegger <rscheff> | ||||
Status: | In Progress --- | ||||||
Severity: | Affects Some People | CC: | markj, rscheff, tuexen | ||||
Priority: | --- | Flags: | markj:
mfc-stable14?
|
||||
Version: | 15.0-CURRENT | ||||||
Hardware: | Any | ||||||
OS: | Any | ||||||
See Also: | https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=284588 | ||||||
Attachments: |
|
Description
Gleb Smirnoff
![]() ![]() Sorry, forgot to provide full pf debug for the packet. Missing line: Dec 26 22:45:58 cell kernel: pf: State failure on: 1 | So, the src->seqhi is smaller than seq + data_len. Either seqhi of the tracked connection is incorrectly low, or our stack is sending too much. The more I look at it, the more it seems to me it might be a duplicate of bug 282605. Note that this kernel is compiled without INVARIANTS, so it doesn't panic. Nope, it is not a duplicate of bug 282605. We aren't sending 2 segments, we are sending one, but with too much data. I checked that enabling this particular KASSERT. Here the connection state that leads to a bogus packet: TF_FASTRECOVERY is set snd_wnd = 384 snd_cwnd quite large snd_nxt == snd_una SACK state: sack_bytes_rexmit 1448 sacked_bytes 124528 lost_bytes 0 In this state we start with sendwin=0 at line 262, then it is set to snd_wnd at line 277. We enter the fast recovery block and tcp_compute_pipe() returns us large negative -121632 at line 301. Subtracting large negative from small positive ends in large cwin. At this point we totally forget that we have small snd_wnd. So whatever branch for the len calculation we take in the rest of recovery block, we may overcommit snd_wnd. So, I'm 90% sure that regression comes from 7dc78150c730e90fa2afdaba3aa645932b30c429 this chunk: @@ -292,10 +294,13 @@ tcp_default_output(struct tcpcb *tp) if ((tp->t_flags & TF_SACK_PERMIT) && (IN_FASTRECOVERY(tp->t_flags) || SEQ_LT(tp->snd_nxt, tp->snd_max)) && (p = tcp_sack_output(tp, &sack_bytes_rxmt))) { - uint32_t cwin; + int32_t cwin; - cwin = - imax(min(tp->snd_wnd, tp->snd_cwnd) - sack_bytes_rxmt, 0); + if (IN_FASTRECOVERY(tp->t_flags)) { + cwin = imax(sendwin - tcp_compute_pipe(tp), 0); + } else { + cwin = imax(sendwin - off, 0); + } /* Do not retransmit SACK segments beyond snd_recover */ if (SEQ_GT(p->end, tp->snd_recover)) { /* I really can't understand logic of tcp_compute_pipe(). We are subtracting sacked_bytes from sack_bytes_rexmit. But the latter is a gauge value, that goes up and down as we retransmit holes and new holes appear. And the former is always growing value, that is reset to 0 only when we clear all holes. For a long living lossy connection this clearance may not happen for long periods of time, so we would accumulate a huge sacked_bytes with a relatively small sack_bytes_rexmit. After that we are going to get negative result from tcp_compute_pipe(). But this code is pretty old and didn't cause problems before. Until f359d6ebbcced08a13060c07168856ffe6c9f734 in 2020 the sacked_bytes were recalculated on every packet, so they couldn't leak on a long and always lossy connection. I think this revision broke the RFC6675 support really. Also, today I'm not sure that the problem is a new regression. Previously my dlna server was using sendfile(2) instead of write(2). In this case the EACCESS from pf was delivered to the I/O completion thread and ignored. The TCP of course recovered from the dropped packet. Recent package update switched minidlnad to use write(2), so EACCESS is delivered to the process and it terminates. Created attachment 256224 [details]
patch to properly reset sackhint state
With some debugging I have found that when SACK scoreboard collapses, it has hole_bytes cleared, but the rest of the fields are not. That's why we may get bogus return values from tcp_compute_pipe(). I created a patch to properly clear scoreboard state. Now, with my debugging I see that problem no longer happens.
HOWEVER! I still get EACCESS from pf(4) for the same reason. Our TCP tries to send beyond advertised windows of Samsung. But this seems to be a different problem. First, now it takes much longer to reproduce. Second, none of my D scripts that I used to catch the above condition do catch anything with patched kernel. Third, the len argument reported by pf(4) sometimes is larger than t_maxseg (TSO is on), but I never have seen this with the first bug.
Ok, I have nailed down two problems and now I can run TCP session with lossy Samsung for hours. Previously either problem would reproduce within a minute or two. https://reviews.freebsd.org/D48236 https://reviews.freebsd.org/D48237 A commit in branch main references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=1043b36b2054ffb32aadfe1c276684bf463db59a commit 1043b36b2054ffb32aadfe1c276684bf463db59a Author: Gleb Smirnoff <glebius@FreeBSD.org> AuthorDate: 2025-01-13 18:13:54 +0000 Commit: Gleb Smirnoff <glebius@FreeBSD.org> CommitDate: 2025-01-13 18:13:54 +0000 tcp: don't send beyond receivers advertised window When calculating length of data to be sent, we may do some congestion calculations, but we shall never send a byte beyond (snd_una + snd_wnd). In 7dc78150c730e we started to use tcp_compute_pipe() instead of (snd_wnd - off). This function makes an estimate of how much data is in flight. It can return a value smaller and larger than (snd_nxt - snd_una). It will return a value larger when we have retransmitted some data from SACK holes, and smaller once those retransmitted SACK holes are acked. We may use tcp_compute_pipe() for length calculation, but always capped by the send offset 'off', which (snd_nxt - snd_una). PR: 283649 Reviewed by: rscheff Differential Revision: https://reviews.freebsd.org/D48237 Fixes: 7dc78150c730e90fa2afdaba3aa645932b30c429 sys/netinet/tcp_output.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) A commit in branch main references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=b84f41b4e82df373f8e682d45791b6ab636cd94e commit b84f41b4e82df373f8e682d45791b6ab636cd94e Author: Gleb Smirnoff <glebius@FreeBSD.org> AuthorDate: 2025-01-13 18:13:45 +0000 Commit: Gleb Smirnoff <glebius@FreeBSD.org> CommitDate: 2025-01-13 18:13:45 +0000 tcp: properly reset sackhint values when SACK recovery is done When the SACK scoreboard collapses, properly clear all the counters. The counters are used in tcp_compute_pipe(), which can be called anytime later after the SACK recovery. The returned result can be totally bogus, including both too large and too small values. PR: 283649 Reviewed by: rscheff Differential Revision: https://reviews.freebsd.org/D48236 sys/netinet/tcp_sack.c | 27 +++++++++++++++++---------- 1 file changed, 17 insertions(+), 10 deletions(-) Richard, thanks for reviewing my revisions! I have checked them in CURRENT. May I leave you responsible for the bug? Cause there are two more things to handle, and I believe, you'd do it better than me: 1) Check if fixes need to be merges to stable/14. 2) Check one more branch that uses tcp_compute_pipe() to calculate how much are we going to send. Maybe it also needs capping, but I never enter fast recovery between my minidlna and the TV. The branch is located at tcp_output.c:300. I tried to understand do we cap the obtain value later or not, but got lost. *** Bug 284588 has been marked as a duplicate of this bug. *** |