Bug 283649 - tcp_output() sends beyond receivers window
Summary: tcp_output() sends beyond receivers window
Status: In Progress
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 15.0-CURRENT
Hardware: Any Any
: --- Affects Some People
Assignee: Richard Scheffenegger
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2024-12-27 07:24 UTC by Gleb Smirnoff
Modified: 2025-02-05 16:50 UTC (History)
3 users (show)

See Also:
markj: mfc-stable14?


Attachments
patch to properly reset sackhint state (2.05 KB, patch)
2024-12-28 07:18 UTC, Gleb Smirnoff
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Gleb Smirnoff freebsd_committer freebsd_triage 2024-12-27 07:24:45 UTC
Running FreeBSD-CURRENT at ff4c19bb5427 aka main-stabweek-2024-Nov. This definitely is a regression, since the bug was most likely not present in October stabweek snapshot and I'm 100% sure it was not present in the September stabweek snapshot. It is still unclear to me is it a regression in pf(4) or in TCP sending machine.

I have a local minidlna server that serves content over TCP. It basically sends data to a local client in an endless write(2) loop. The pf rules on the internal interface are pretty relaxed:

pass in on $iif from $local to any

However, once in a while pf(4) returns PFIL_DROPPED to the sending syscall. In my case it usually takes 1-5 minutes of watching a movie over minidlna, which is sending about 10-100 megabytes of data. With loud mode, pf would print:

Dec 26 22:45:58 cell kernel: pf: BAD state: TCP in wire: 10.1.10.180:44977 10.1.10.1:8200 stack: - [lo=657121872 high=657187600 win=14209 modulator=0 wscale=7] [lo=124961762 high=124962586 win=1027 modulator=0 wscale=6] 4:4 @3 A seq=124961762 (124961762) ack=657121872 len=1448 ackskew=0 pkts=7246:7242 dir=out,rev

The other side is a Samsung TV, and looks like it has its own weirdness. For example it would ACK with a sequence number, that is was in the middle of the sent packet. Note that there is no fragmentation on the network. It will also report SACK holes that cross. Here is excerpt from the tcpdump around seq=124961762:

22:45:58.153441 IP 10.1.10.180.44977 > 10.1.10.1.8200: Flags [.], ack 124838682, win 968, options [nop,nop,TS val 15322271 ecr 418474971,nop,nop,sack 3 {124871986:124961762}{124867642:124870538}{124844474:124866194}], length 0
22:45:58.153573 IP 10.1.10.180.44977 > 10.1.10.1.8200: Flags [.], ack 124840130, win 957, options [nop,nop,TS val 15322271 ecr 418474984,nop,nop,sack 3 {124871986:124961762}{124867642:124870538}{124844474:124866194}], length 0
22:45:58.153606 IP 10.1.10.1.8200 > 10.1.10.180.44977: Flags [.], ack 657121872, win 1027, options [nop,nop,TS val 418474993 ecr 15322271], length 0
22:45:58.153820 IP 10.1.10.180.44977 > 10.1.10.1.8200: Flags [.], ack 124841578, win 946, options [nop,nop,TS val 15322271 ecr 418474984,nop,nop,sack 3 {124871986:124961762}{124867642:124870538}{124844474:124866194}], length 0
22:45:58.153849 IP 10.1.10.1.8200 > 10.1.10.180.44977: Flags [.], ack 657121872, win 1027, options [nop,nop,TS val 418474994 ecr 15322271], length 0
22:45:58.153865 IP 10.1.10.180.44977 > 10.1.10.1.8200: Flags [.], ack 124843026, win 935, options [nop,nop,TS val 15322271 ecr 418474985,nop,nop,sack 3 {124871986:124961762}{124867642:124870538}{124844474:124866194}], length 0
22:45:58.153892 IP 10.1.10.1.8200 > 10.1.10.180.44977: Flags [.], ack 657121872, win 1027, options [nop,nop,TS val 418474994 ecr 15322271], length 0
22:45:58.153920 IP 10.1.10.180.44977 > 10.1.10.1.8200: Flags [.], ack 124866194, win 754, options [nop,nop,TS val 15322271 ecr 418474985,nop,nop,sack 2 {124871986:124961762}{124867642:124870538}], length 0
22:45:58.153947 IP 10.1.10.1.8200 > 10.1.10.180.44977: Flags [.], ack 657121872, win 1027, options [nop,nop,TS val 418474994 ecr 15322271], length 0
22:45:58.154043 IP 10.1.10.180.44977 > 10.1.10.1.8200: Flags [.], ack 124870538, win 721, options [nop,nop,TS val 15322271 ecr 418474986,nop,nop,sack 1 {124871986:124961762}], length 0
22:45:58.154070 IP 10.1.10.1.8200 > 10.1.10.180.44977: Flags [.], ack 657121872, win 1027, options [nop,nop,TS val 418474994 ecr 15322271], length 0
22:45:58.154188 IP 10.1.10.180.44977 > 10.1.10.1.8200: Flags [.], ack 124961762, win 9, options [nop,nop,TS val 15322271 ecr 418474986], length 0
22:45:58.154230 IP 10.1.10.1.8200 > 10.1.10.180.44977: Flags [.], seq 124961762:124962914, ack 657121872, win 1027, options [nop,nop,TS val 418474994 ecr 15322271], length 1152
22:45:58.190890 IP 10.1.10.180.44977 > 10.1.10.1.8200: Flags [.], ack 124962914, win 0, options [nop,nop,TS val 15322281 ecr 418474994], length 0
22:45:58.210143 IP 10.1.10.180.44977 > 10.1.10.1.8200: Flags [.], ack 124962914, win 1003, options [nop,nop,TS val 15322285 ecr 418474994], length 0
22:45:58.210171 IP 10.1.10.1.8200 > 10.1.10.180.44977: Flags [.], seq 124962914:124964362, ack 657121872, win 1027, options [nop,nop,TS val 418475050 ecr 15322285], length 1448
22:45:58.211119 IP 10.1.10.180.44977 > 10.1.10.1.8200: Flags [.], ack 124965810, win 1116, options [nop,nop,TS val 15322286 ecr 418475050], length 0
22:45:58.211157 IP 10.1.10.1.8200 > 10.1.10.180.44977: Flags [.], seq 124965810:124967258, ack 657121872, win 1027, options [nop,nop,TS val 418475051 ecr 15322286], length 1448
22:45:58.211614 IP 10.1.10.180.44977 > 10.1.10.1.8200: Flags [.], ack 124968706, win 1229, options [nop,nop,TS val 15322286 ecr 418475051], length 0
22:45:58.211651 IP 10.1.10.1.8200 > 10.1.10.180.44977: Flags [.], seq 124971602:124973050, ack 657121872, win 1027, options [nop,nop,TS val 418475052 ecr 15322286], length 1448
22:45:58.250909 IP 10.1.10.180.44977 > 10.1.10.1.8200: Flags [.], ack 124977394, win 1186, options [nop,nop,TS val 15322296 ecr 418475051], length 0

Of course packet dropped by pf is missing from the above dump. I can also share full tcpdump file via private email.
Comment 1 Gleb Smirnoff freebsd_committer freebsd_triage 2024-12-27 08:16:32 UTC
Sorry, forgot to provide full pf debug for the packet. Missing line:

Dec 26 22:45:58 cell kernel: pf: State failure on: 1       |    

So, the src->seqhi is smaller than seq + data_len. Either seqhi of the tracked connection is incorrectly low, or our stack is sending too much.
Comment 2 Gleb Smirnoff freebsd_committer freebsd_triage 2024-12-27 08:32:44 UTC
The more I look at it, the more it seems to me it might be a duplicate of bug 282605. Note that this kernel is compiled without INVARIANTS, so it doesn't panic.
Comment 3 Gleb Smirnoff freebsd_committer freebsd_triage 2024-12-27 16:43:45 UTC
Nope, it is not a duplicate of bug 282605. We aren't sending 2 segments, we are sending one, but with too much data. I checked that enabling this particular KASSERT.
Comment 4 Gleb Smirnoff freebsd_committer freebsd_triage 2024-12-27 18:16:56 UTC
Here the connection state that leads to a bogus packet:

TF_FASTRECOVERY is set
snd_wnd = 384
snd_cwnd quite large
snd_nxt == snd_una
SACK state: sack_bytes_rexmit 1448 sacked_bytes 124528 lost_bytes 0

In this state we start with sendwin=0 at line 262, then it is set to snd_wnd at line 277.  We enter the fast recovery block and tcp_compute_pipe() returns us large negative -121632 at line 301. Subtracting large negative from small positive ends in large cwin. At this point we totally forget that we have small snd_wnd. So whatever branch for the len calculation we take in the rest of recovery block, we may overcommit snd_wnd.

So, I'm 90% sure that regression comes from 7dc78150c730e90fa2afdaba3aa645932b30c429 this chunk:
@@ -292,10 +294,13 @@ tcp_default_output(struct tcpcb *tp)
        if ((tp->t_flags & TF_SACK_PERMIT) &&
            (IN_FASTRECOVERY(tp->t_flags) || SEQ_LT(tp->snd_nxt, tp->snd_max)) &&
            (p = tcp_sack_output(tp, &sack_bytes_rxmt))) {
-               uint32_t cwin;
+               int32_t cwin;
 
-               cwin =
-                   imax(min(tp->snd_wnd, tp->snd_cwnd) - sack_bytes_rxmt, 0);
+               if (IN_FASTRECOVERY(tp->t_flags)) {
+                       cwin = imax(sendwin - tcp_compute_pipe(tp), 0);
+               } else {
+                       cwin = imax(sendwin - off, 0);
+               }
                /* Do not retransmit SACK segments beyond snd_recover */
                if (SEQ_GT(p->end, tp->snd_recover)) {
                        /*
Comment 5 Gleb Smirnoff freebsd_committer freebsd_triage 2024-12-27 19:04:22 UTC
I really can't understand logic of tcp_compute_pipe(). We are subtracting sacked_bytes from sack_bytes_rexmit. But the latter is a gauge value, that goes up and down as we retransmit holes and new holes appear. And the former is always growing value, that is reset to 0 only when we clear all holes. For a long living lossy connection this clearance may not happen for long periods of time, so we would accumulate a huge sacked_bytes with a relatively small sack_bytes_rexmit. After that we are going to get negative result from tcp_compute_pipe(). But this code is pretty old and didn't cause problems before.
Comment 6 Gleb Smirnoff freebsd_committer freebsd_triage 2024-12-28 03:29:38 UTC
Until f359d6ebbcced08a13060c07168856ffe6c9f734 in 2020 the sacked_bytes were recalculated on every packet, so they couldn't leak on a long and always lossy connection. I think this revision broke the RFC6675 support really.

Also, today I'm not sure that the problem is a new regression. Previously my dlna server was using sendfile(2) instead of write(2). In this case the EACCESS from pf was delivered to the I/O completion thread and ignored. The TCP of course recovered from the dropped packet. Recent package update switched minidlnad to use write(2), so EACCESS is delivered to the process and it terminates.
Comment 7 Gleb Smirnoff freebsd_committer freebsd_triage 2024-12-28 07:18:01 UTC
Created attachment 256224 [details]
patch to properly reset sackhint state

With some debugging I have found that when SACK scoreboard collapses, it has hole_bytes cleared, but the rest of the fields are not. That's why we may get bogus return values from tcp_compute_pipe(). I created a patch to properly clear scoreboard state. Now, with my debugging I see that problem no longer happens.

HOWEVER! I still get EACCESS from pf(4) for the same reason. Our TCP tries to send beyond advertised windows of Samsung. But this seems to be a different problem. First, now it takes much longer to reproduce. Second, none of my D scripts that I used to catch the above condition do catch anything with patched kernel. Third, the len argument reported by pf(4) sometimes is larger than t_maxseg (TSO is on), but I never have seen this with the first bug.
Comment 9 Gleb Smirnoff freebsd_committer freebsd_triage 2024-12-29 00:30:57 UTC
Ok, I have nailed down two problems and now I can run TCP session with lossy Samsung for hours. Previously either problem would reproduce within a minute or two.

https://reviews.freebsd.org/D48236
https://reviews.freebsd.org/D48237
Comment 10 commit-hook freebsd_committer freebsd_triage 2025-01-13 18:15:55 UTC
A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=1043b36b2054ffb32aadfe1c276684bf463db59a

commit 1043b36b2054ffb32aadfe1c276684bf463db59a
Author:     Gleb Smirnoff <glebius@FreeBSD.org>
AuthorDate: 2025-01-13 18:13:54 +0000
Commit:     Gleb Smirnoff <glebius@FreeBSD.org>
CommitDate: 2025-01-13 18:13:54 +0000

    tcp: don't send beyond receivers advertised window

    When calculating length of data to be sent, we may do some congestion
    calculations, but we shall never send a byte beyond (snd_una + snd_wnd).
    In 7dc78150c730e we started to use tcp_compute_pipe() instead of (snd_wnd
    - off).  This function makes an estimate of how much data is in flight. It
    can return a value smaller and larger than (snd_nxt - snd_una).  It will
    return a value larger when we have retransmitted some data from SACK
    holes, and smaller once those retransmitted SACK holes are acked.

    We may use tcp_compute_pipe() for length calculation, but always capped
    by the send offset 'off', which (snd_nxt - snd_una).

    PR:                     283649
    Reviewed by:            rscheff
    Differential Revision:  https://reviews.freebsd.org/D48237
    Fixes:                  7dc78150c730e90fa2afdaba3aa645932b30c429

 sys/netinet/tcp_output.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)
Comment 11 commit-hook freebsd_committer freebsd_triage 2025-01-13 18:15:57 UTC
A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=b84f41b4e82df373f8e682d45791b6ab636cd94e

commit b84f41b4e82df373f8e682d45791b6ab636cd94e
Author:     Gleb Smirnoff <glebius@FreeBSD.org>
AuthorDate: 2025-01-13 18:13:45 +0000
Commit:     Gleb Smirnoff <glebius@FreeBSD.org>
CommitDate: 2025-01-13 18:13:45 +0000

    tcp: properly reset sackhint values when SACK recovery is done

    When the SACK scoreboard collapses, properly clear all the counters.
    The counters are used in tcp_compute_pipe(), which can be called
    anytime later after the SACK recovery.  The returned result can be
    totally bogus, including both too large and too small values.

    PR:                     283649
    Reviewed by:            rscheff
    Differential Revision:  https://reviews.freebsd.org/D48236

 sys/netinet/tcp_sack.c | 27 +++++++++++++++++----------
 1 file changed, 17 insertions(+), 10 deletions(-)
Comment 12 Gleb Smirnoff freebsd_committer freebsd_triage 2025-01-13 18:27:57 UTC
Richard,

thanks for reviewing my revisions! I have checked them in CURRENT.

May I leave you responsible for the bug? Cause there are two more things to
handle, and I believe, you'd do it better than me:

1) Check if fixes need to be merges to stable/14.
2) Check one more branch that uses tcp_compute_pipe() to calculate how
   much are we going to send. Maybe it also needs capping, but I never enter
   fast recovery between my minidlna and the TV.  The branch is located at
   tcp_output.c:300.  I tried to understand do we cap the obtain value later
   or not, but got lost.
Comment 13 Gleb Smirnoff freebsd_committer freebsd_triage 2025-02-05 16:45:05 UTC
*** Bug 284588 has been marked as a duplicate of this bug. ***