Sending stress TCP with offloaded NAT-T configuration panic: tcp_do_segment: sent too much BT #0 __curthread () at /usr/kernel_git/sys/amd64/include/pcpu_aux.h:57 #1 doadump (textdump=textdump@entry=1) at /usr/kernel_git/sys/kern/kern_shutdown.c:404 #2 0xffffffff80b3b7a0 in kern_reboot (howto=260) at /usr/kernel_git/sys/kern/kern_shutdown.c:524 #3 0xffffffff80b3bcd7 in vpanic (fmt=0xffffffff81172d6e "%s: sent too much", ap=ap@entry=0xfffffe010a82db20) at /usr/kernel_git/sys/kern/kern_shutdown.c:979 #4 0xffffffff80b3bb03 in panic (fmt=<unavailable>) at /usr/kernel_git/sys/kern/kern_shutdown.c:892 #5 0xffffffff80d34372 in tcp_do_segment (tp=0xfffff803402baa80, tp@entry=<error reading variable: value is not available>, m=<optimized out>, m@entry=<error reading variable: value is not available>, th=0xfffff802fd295e84, th@entry=<error reading variable: value is not available>, drop_hdrlen=64, drop_hdrlen@entry=<error reading variable: value is not available>, tlen=0, tlen@entry=<error reading variable: value is not available>, iptos=<unavailable>, iptos@entry=<error reading variable: value is not available>) at /usr/kernel_git/sys/netinet/tcp_input.c:1548 #6 0xffffffff80d30d98 in tcp_input_with_port (mp=<optimized out>, offp=<optimized out>, proto=<optimized out>, port=port@entry=0) at /usr/kernel_git/sys/netinet/tcp_input.c:1158 #7 0xffffffff80d31a5b in tcp_input (mp=<unavailable>, offp=<unavailable>, proto=<unavailable>) at /usr/kernel_git/sys/netinet/tcp_input.c:1502 #8 0xffffffff80d1e3af in ip_input (m=0x0, m@entry=<error reading variable: value is not available>) at /usr/kernel_git/sys/netinet/ip_input.c:857 #9 0xffffffff80c98b7b in netisr_process_workstream_proto (nwsp=0xffffffff823c0a00, proto=1) at /usr/kernel_git/sys/net/netisr.c:927 #10 swi_net (arg=0xffffffff823c0a00) at /usr/kernel_git/sys/net/netisr.c:974 #11 0xffffffff80af3b56 in intr_event_execute_handlers (ie=0xfffff800035ee600, p=<optimized out>) at /usr/kernel_git/sys/kern/kern_intr.c:1183 #12 ithread_execute_handlers (ie=0xfffff800035ee600, p=<optimized out>) at /usr/kernel_git/sys/kern/kern_intr.c:1196 #13 ithread_loop (arg=arg@entry=0xfffff800033cdf60) at /usr/kernel_git/sys/kern/kern_intr.c:1289 #14 0xffffffff80aeff52 in fork_exit (callout=0xffffffff80af38f0 <ithread_loop>, arg=0xfffff800033cdf60, frame=0xfffffe010a82df40) at /usr/kernel_git/sys/kern/kern_fork.c:1151 #15 <signal handler called> SHA ID of the kernel: 2ce493e1693b55a330079ac5fce8beb66e26ddeb
There have been a couple of TCP commits since your revision - is it possible that this is already fixed?
Created attachment 255012 [details] packetdrill reproducer
(In reply to Mark Johnston from comment #1) It is not yet fixed. I attached a packetdrill reproducer. I am right now testing review D47474 which seems to avoid triggering the relevant KASSERT so far. The reproducer also does not trigger the KASSERT anymore with review D47474. But it is not fully understood, what is going on.
Agree, D47474 masks the problem, which appears to be an interaction between the new tcp transmission selection code, and the combination of RTO followed by a SACK loss recovery in close succession.
I just run into this panic, on the ipv6 side I would guess (current as of 2024-10-30-120714): ---snip--- [365136] panic: tcp_do_segment: sent too much [365136] cpuid = 1 [365136] time = 1731354815 [365136] KDB: stack backtrace: [365136] db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe04314f7790 [365136] vpanic() at vpanic+0x136/frame 0xfffffe04314f78c0 [365136] panic() at panic+0x43/frame 0xfffffe04314f7920 [365136] tcp_do_segment() at tcp_do_segment+0x2852/frame 0xfffffe04314f79f0 [365136] tcp_input_with_port() at tcp_input_with_port+0x10e2/frame 0xfffffe04314f7b40 [365136] tcp6_input_with_port() at tcp6_input_with_port+0x6a/frame 0xfffffe04314f7b70 [365136] tcp6_input() at tcp6_input+0xb/frame 0xfffffe04314f7b80 [365136] ip6_input() at ip6_input+0xc76/frame 0xfffffe04314f7c60 [365136] netisr_dispatch_src() at netisr_dispatch_src+0xb9/frame 0xfffffe04314f7cc0 [365136] ether_demux() at ether_demux+0x16a/frame 0xfffffe04314f7cf0 [365136] ether_nh_input() at ether_nh_input+0x3cf/frame 0xfffffe04314f7d40 [365136] netisr_dispatch_src() at netisr_dispatch_src+0xb9/frame 0xfffffe04314f7da0 [365136] ether_input() at ether_input+0xd5/frame 0xfffffe04314f7e00 [365136] epair_tx_start_deferred() at epair_tx_start_deferred+0xd4/frame 0xfffffe04314f7e40 [365136] taskqueue_run_locked() at taskqueue_run_locked+0x1c7/frame 0xfffffe04314f7ec0 [365136] taskqueue_thread_loop() at taskqueue_thread_loop+0xd3/frame 0xfffffe04314f7ef0 [365136] fork_exit() at fork_exit+0x87/frame 0xfffffe04314f7f30 [365136] fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe04314f7f30 [365136] --- trap 0x3de64570, rip = 0, rsp = 0, rbp = 0 --- [365136] Uptime: 4d5h25m36s [365136] Dumping 50824 out of 73621 MB:..1%..11%..21%..31%..41%..51%..61%..71%..81%..91% ---snip-- Parts of the crashdump output (full output available on request): ---snip--- #5 0xffffffff806b6b72 in tcp_do_segment (tp=0xfffff80da7abfa80, m=<optimized out>, th=0xfffff804c5190a96, drop_hdrlen=72, tlen=0, iptos=<optimized out>) at /space/system/usr_src/sys/netinet/tcp_input.c:1548 to = {to_flags = 128, to_tsval = 4294965249, to_tsecr = 4546, to_sacks = 0xfffff804c5190aae "\aa\217\345\aa\225}\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255", <incomplete sequence \336>, to_signature = 0x4 <error: Cannot access memory at address 0x4>, to_tfo_cookie = 0xfffffe04314f79c0 "\340yO1", to_mss = 22215, to_wscale = 111 'o', to_nsacks = 1 '\001', to_tfo_len = 255 '\377', to_spare = 2154633376} maxseg = 1432 inp = 0xfffff80da7abfa80 needoutput = 0 incforsyn = <optimized out> so = 0xfffff8051a0ffc00 inc = <optimized out> thflags = <optimized out> sack_changed = <optimized out> nsegs = 1 s = <optimized out> tiwin = <optimized out> rstreason = <optimized out> todrop = <optimized out> acked = <optimized out> tfo_syn = <optimized out> mfree = <optimized out> ourfinisacked = <optimized out> win = <optimized out> close = <optimized out> #6 0xffffffff806b3602 in tcp_input_with_port (mp=mp@entry=0xfffffe04314f7bc8, offp=offp@entry=0xfffffe04314f7bc0, proto=<optimized out>, port=0) at /space/system/usr_src/sys/netinet/tcp_input.c:1158 so = 0xfffff8051a0ffc00 to = {to_flags = 0, to_tsval = 0, to_tsecr = 719386432, to_sacks = 0xfffff80727e1a038 "\001", to_signature = 0xfffff80727e1a084 "", to_tfo_cookie = 0x5bbafa8e00000073 <error: Cannot access memory at address 0x5bbafa8e00000073>, to_mss = 37456, to_wscale = 5 '\005', to_nsacks = 0 '\000', to_tfo_len = 0 '\000', to_spare = 590855} m = 0xfffff804c5190a00 th = 0xfffff804c5190a96 ip = 0x0 inp = <optimized out> tp = <unavailable> optp = 0xfffff804c5190aaa "\001\001\005\n\aa\217\345\aa\225}\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255", <incomplete sequence \336> optlen = 12 tlen = <optimized out> rstreason = <optimized out> fwd_tag = 0x0 ip6 = 0xfffff804c5190a6e s = 0x0 off0 = <optimized out> iptos = 0 '\000' off = <optimized out> len = <optimized out> ipttl = <optimized out> thflags = <optimized out> drop_hdrlen = 72 lookupflag = <optimized out> isipv6 = <optimized out> #7 0xffffffff806b247a in tcp6_input_with_port (mp=0xfffffe04314f7bc8, offp=0xfffffe04314f7bc0, proto=<optimized out>, port=port@entry=0) at /space/system/usr_src/sys/netinet/tcp_input.c:594 m = 0xfffff804c5190a00 ip6 = <optimized out> ia6 = <unavailable> #8 0xffffffff806b3d5b in tcp6_input (mp=<unavailable>, offp=<unavailable>, proto=<unavailable>) at /space/system/usr_src/sys/netinet/tcp_input.c:601 No locals. ---snip---
A commit in branch main references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=8f5a2e216f4cb955150c8f88ab21eaecc5adc8b9 commit 8f5a2e216f4cb955150c8f88ab21eaecc5adc8b9 Author: Richard Scheffenegger <rscheff@FreeBSD.org> AuthorDate: 2024-11-14 08:19:34 +0000 Commit: Richard Scheffenegger <rscheff@FreeBSD.org> CommitDate: 2024-11-14 08:19:49 +0000 tcp: fix cwnd recalculation during limited transmit Properly calculate the expected flight size (cwnd) during limited transmit. Exclude the SACK scoreboard from consideration when still in limited transmit. PR: 282605 Reviewed By: tuexen, #transport Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D47541 sys/netinet/tcp_input.c | 2 +- sys/netinet/tcp_output.c | 3 ++- 2 files changed, 3 insertions(+), 2 deletions(-)
stable-13 is unaffected. stable-14 may have some exposure: - D43470 was not put in stable-14, which is the main trigger - D43355 (which was backported) does some tweaking of cwnd in the limited transmit branch, where this KASSERT triggers, but I think this has not led to any panics in stable-14. All panics observed where after D43470 was committed to MAIN afaik.
(In reply to Richard Scheffenegger from comment #7) Note that on stable/14, you won't see assertion failures unless running the GENERIC-DEBUG kernel config. In particular, assertions are disabled by default on stable branches.
(In reply to Richard Scheffenegger from comment #7) Let me run the reproducer on stable/14 and report back.
(In reply to Michael Tuexen from comment #9) The reproducer does NOT trigger the panic when running on stable/14 with a GENERIC-DEBUG config.
(In reply to Richard Scheffenegger from comment #7) I have several panics from stable-14 but I haven't dug into them.
(In reply to Peter Jeremy from comment #11) Well, without the backtrace and possibly the core, it will be hard to understand why your environment would be excessively affected. The reproducer of this panic from earlier reports was fixed by D47541 (and masked by D47474). Are you on stable-14 or main or some mixture?
This is 14-STABLE from 2024-11-18 (df1b835333ff). I have some local patches but nothing that should be impacting TCP. It crashed twice in reasonably quick succession but I haven't seen it since then (and I have both cores). Both times, it looked like below. I had a look at applying D47541 but it didn't apply cleanly to 14-stable - I've just had a closer look and I think that was just "patch" getting confused by the context. db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe01d43a8770 vpanic() at vpanic+0x131/frame 0xfffffe01d43a88a0 panic() at panic+0x43/frame 0xfffffe01d43a8900 tcp_do_segment() at tcp_do_segment+0x29fd/frame 0xfffffe01d43a89e0 tcp_input_with_port() at tcp_input_with_port+0xf94/frame 0xfffffe01d43a8b20 tcp_input() at tcp_input+0xb/frame 0xfffffe01d43a8b30 ip_input() at ip_input+0x287/frame 0xfffffe01d43a8b80 netisr_dispatch_src() at netisr_dispatch_src+0xad/frame 0xfffffe01d43a8be0 ether_demux() at ether_demux+0x17a/frame 0xfffffe01d43a8c10 ether_nh_input() at ether_nh_input+0x3da/frame 0xfffffe01d43a8c60 netisr_dispatch_src() at netisr_dispatch_src+0xad/frame 0xfffffe01d43a8cc0 ether_input() at ether_input+0xe5/frame 0xfffffe01d43a8d20 _task_fn_rx() at _task_fn_rx+0xea4/frame 0xfffffe01d43a8e40 gtaskqueue_run_locked() at gtaskqueue_run_locked+0x18e/frame 0xfffffe01d43a8ec0 gtaskqueue_thread_loop() at gtaskqueue_thread_loop+0xd3/frame 0xfffffe01d43a8ef0 fork_exit() at fork_exit+0x82/frame 0xfffffe01d43a8f30 fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe01d43a8f30 I've made the full core.txt available at https://home.rulingia.com/~peter/private/core.txt.51
I got the panic again on the stabweek revision ff4c19bb5427, which includes 8f5a2e216f4c aka D47541. Re-opening the bug. Richard, I got a core and can send you any debugging info. Please email me in private what do I need to get from the core for you.
(In reply to Gleb Smirnoff from comment #14) Could you enable BlackBox logging using sysctl net.inet.tcp.bb.log_auto_all=1 sysctl net.inet.tcp.bb.log_auto_ratio=1 sysctl net.inet.tcp.bb.log_auto_mode=1 Also have the tcplog_dumper running. You can use tcplog_dumper_enable="YES" in /etc/rc.conf Using these settings (mode = TCP_LOG_STATE_TAIL), keeps the trailing events of all TCP connections, but does not dump anything to disk. If the kernel panics, we can get the BBLog information from the core. I might be able to construct a packetdrill reproducer from the BBLog file. At least I was able to do it last time.
I'm now seeing this every couple of days on 14-stable (src-c291194-g30bcbf8a2fef) with basically no TCP load. I've had a look at the suggestions in #15 but can't find tcplog_dumper. I have core dumps if that's of any interest.
(In reply to Peter Jeremy from comment #16) You should be able to install the tcplog_dumper by using pkg install tcplog_dumper and then follow the instructions in #15. Once you have then a core file with BBLog information, we can extract it.
I suspect the problematic commit might be the one pasted below. HardenedBSD has multiple users that have reported frequent crashes, except when the below-referenced commit is reverted. The affected HardenedBSD users are not able to produce a crash dump. Please let me know if/how I can be of any assistance. I can have the affected users test candidate patches. Note that it might take several days to get testing results. commit 6b2977c5978b58718d9cddc0a8c31e1bd9aa0361 Author: Richard Scheffenegger <rscheff@FreeBSD.org> AuthorDate: Thu Oct 10 12:50:22 2024 +0200 Commit: Richard Scheffenegger <rscheff@FreeBSD.org> CommitDate: Fri Oct 18 09:51:38 2024 +0200 tcp: fix duplicate retransmissions when RTO happens during SACK loss recovery When snd_nxt doesn't track snd_max, partial SACK ACKs may elicit unexpected duplicate retransmissions. This is usually masked by LRO not necessarily ACKing every individual segment, and prior to RFC6675 SACK loss recovery, harder to trigger even when an RTO happens while SACK loss recovery is ongoing. Address this by improving the logic when to start a SACK loss recovery and how to deal with a RTO, as well as improvements to the adjusted congestion window during transmission selection. Reviewed By: tuexen, cc, #transport Sponsored by: NetApp, Inc. MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D43355 (cherry picked from commit 440f4ba18e3ab7be912858bbcb96a419fcf14809)
Created attachment 256878 [details] packetdrill reproducer for stable/14
(In reply to Shawn Webb from comment #18) Is this for stable/14 or for CURRENT?
(In reply to Michael Tuexen from comment #20) This is in 14-stable.
(In reply to Shawn Webb from comment #21) Thanks for the confirmation. I am looking into it.
*** Bug 284588 has been marked as a duplicate of this bug. ***
I just hit this on the latest main branch 2025/2/25: panic: tcp_do_segment: sent too much cpuid = 3 time = 1740452298 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe00dc30b6e0 vpanic() at vpanic+0x136/frame 0xfffffe00dc30b810 panic() at panic+0x43/frame 0xfffffe00dc30b870 tcp_do_segment() at tcp_do_segment+0x2998/frame 0xfffffe00dc30b950 tcp_input_with_port() at tcp_input_with_port+0x1140/frame 0xfffffe00dc30baa0 tcp_input() at tcp_input+0xb/frame 0xfffffe00dc30bab0 ip_input() at ip_input+0x28d/frame 0xfffffe00dc30bb10 netisr_dispatch_src() at netisr_dispatch_src+0xb4/frame 0xfffffe00dc30bb70 ether_demux() at ether_demux+0x16a/frame 0xfffffe00dc30bba0 ether_nh_input() at ether_nh_input+0x3cd/frame 0xfffffe00dc30bbf0 netisr_dispatch_src() at netisr_dispatch_src+0xb4/frame 0xfffffe00dc30bc50 ether_input() at ether_input+0xd5/frame 0xfffffe00dc30bcb0 hn_xpnt_vf_input() at hn_xpnt_vf_input+0x126/frame 0xfffffe00dc30bd20 tcp_lro_flush() at tcp_lro_flush+0x34f/frame 0xfffffe00dc30bd50 tcp_lro_flush_all() at tcp_lro_flush_all+0x1f3/frame 0xfffffe00dc30bda0 mana_poll() at mana_poll+0x9db/frame 0xfffffe00dc30be40 taskqueue_run_locked() at taskqueue_run_locked+0x1c2/frame 0xfffffe00dc30bec0 taskqueue_thread_loop() at taskqueue_thread_loop+0xd3/frame 0xfffffe00dc30bef0 fork_exit() at fork_exit+0x82/frame 0xfffffe00dc30bf30 fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe00dc30bf30 --- trap 0x17ee7fe0, rip = 0x3fe908dbc388, rsp = 0x17ee7cb800000001, rbp = 0 --- KDB: enter: panic [ thread pid 0 tid 100194 ] Stopped at kdb_enter+0x33: movq $0,0x1046992(%rip)
I just tripped over this in EC2 with a kernel built from 12 days ago (FreeBSD 15.0-CURRENT #0 main-n275409-6156da866e7d: Thu Feb 13 07:15:24 UTC 2025): tcp_do_segment: sent too much cpuid = 0 time = 1740512855 KDB: stack backtrace: db_trace_self() at db_trace_self db_trace_self_wrapper() at db_trace_self_wrapper+0x38 vpanic() at vpanic+0x1a0 panic() at panic+0x48 tcp_do_segment() at tcp_do_segment+0x2778 tcp_input_with_port() at tcp_input_with_port+0xfd4 tcp_input() at tcp_input+0x10 ip_input() at ip_input+0x354 netisr_dispatch_src() at netisr_dispatch_src+0xd8 ether_demux() at ether_demux+0x174 ether_nh_input() at ether_nh_input+0x374 netisr_dispatch_src() at netisr_dispatch_src+0xd8 ether_input() at ether_input+0xdc ena_cleanup() at ena_cleanup+0x4fc taskqueue_run_locked() at taskqueue_run_locked+0x17c taskqueue_thread_loop() at taskqueue_thread_loop+0xc0 fork_exit() at fork_exit+0x78 fork_trampoline() at fork_trampoline+0x18 KDB: enter: panic [ thread pid 0 tid 100211 ] Stopped at kdb_enter+0x48: str xzr, [x19, #2048] I can leave the system running for a while and have a serial console open if it would be useful to poke at anything before rebooting. I'll kill the instance in a few days if I don't hear from anyone.
Will put up a fix for review shortly.
ran into this today on recent 15.0, screenshot of stack trace: https://i.imgur.com/gmTDD7q.png i'm happy to test a patch once there is one.
I can repro this in ~ 5 minutes on my arm64 router, while doing some web loadgen testing through it using wrk2 (not in ports yet). $ wrk -d 120s -c 200 -t 8 -L -R 800 https://whiskey.skunkwerks.at/ - 15.0-CURRENT GENERIC arm64 - built from 20250315 at d2a55e6a9348 (upstream/main) openssl: update ASM and version info for 3.0.16 import (kgdb) bt #0 0xffff0000004b8ce8 in doadump (textdump=textdump@entry=1) at /usr/src/sys/kern/kern_shutdown.c:404 #1 0xffff0000004b8aa8 in kern_reboot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:524 #2 0xffff0000004b9028 in vpanic (fmt=<optimized out>, ap=...) at /usr/src/sys/kern/kern_shutdown.c:979 #3 0xffff0000004b8dcc in panic (fmt=0xffff000000ce7000 <M_PLIMIT+64> "") at /usr/src/sys/kern/kern_shutdown.c:892 #4 0xffff000000666234 in tcp_do_segment (tp=0xffffa085c2f52000, m=<optimized out>, th=0xffff0001631630e2, drop_hdrlen=<optimized out>, tlen=0, iptos=<optimized out>) at /usr/src/sys/netinet/tcp_input.c:1545 #5 0xffff000000662d88 in tcp_input_with_port (mp=<optimized out>, offp=<optimized out>, proto=<optimized out>, port=0, port@entry=42192) at /usr/src/sys/netinet/tcp_input.c:816 #6 0xffff000000663aa8 in tcp_input (mp=0xffff000000ce7000 <M_PLIMIT+64>, offp=0xffff000000db9000 <sdt_vfs_vop_vop_spare5_entry+96>, proto=1) at /usr/src/sys/netinet/tcp_input.c:1499 #7 0xffff00000064e2ac in ip_input (m=0x0) at /usr/src/sys/netinet/ip_input.c:861 #8 0xffff000000612344 in netisr_dispatch_src (proto=proto@entry=1, source=0, source@entry=18446462603272168832, m=0xffffa0804716a800) at /usr/src/sys/net/netisr.c:1151 #9 0xffff000000612728 in netisr_dispatch (proto=13529088, proto@entry=1, m=0x1) at /usr/src/sys/net/netisr.c:1242 #10 0xffff0000005f4214 in ether_demux (ifp=ifp@entry=0xffffa0800440d000, m=0xffffa0804716a800) at /usr/src/sys/net/if_ethersubr.c:942 #11 0xffff0000005f58c0 in ether_input_internal (ifp=0xffffa0800440d000, m=0xffffa0804716a800) at /usr/src/sys/net/if_ethersubr.c:706 #12 ether_nh_input (m=<optimized out>) at /usr/src/sys/net/if_ethersubr.c:736 #13 0xffff000000612344 in netisr_dispatch_src (proto=proto@entry=5, source=0, source@entry=18446462603272169024, m=0xffffa0804716a800) at /usr/src/sys/net/netisr.c:1151 #14 0xffff000000612728 in netisr_dispatch (proto=13529088, proto@entry=5, m=0x1) at /usr/src/sys/net/netisr.c:1242 #15 0xffff0000005f4664 in ether_input (ifp=0xffffa0800440d000, m=0xffff000000db9000 <sdt_vfs_vop_vop_spare5_entry+96>) at /usr/src/sys/net/if_ethersubr.c:847 #16 0xffff0000008be654 in dpaa2_ni_rx (ch=ch@entry=0xffff000113aaf000, fq=0xffff000113aaf180, fd=<optimized out>, ctx=ctx@entry=0xffff00010e90a788) at /usr/src/sys/dev/dpaa2/dpaa2_ni.c:3179 #17 0xffff0000008bdc34 in dpaa2_ni_consume_frames (chan=0xffff000113aaf000, src=<optimized out>, consumed=<optimized out>) at /usr/src/sys/dev/dpaa2/dpaa2_ni.c:3051 #18 dpaa2_ni_rx_cleanup (ch=0xffff000113aaf000) at /usr/src/sys/dev/dpaa2/dpaa2_ni.c:2868 #19 dpaa2_ni_cleanup_task (arg=<optimized out>, count=<optimized out>) at /usr/src/sys/dev/dpaa2/dpaa2_ni.c:2827 #20 0xffff000000521de4 in taskqueue_run_locked (queue=queue@entry=0xffffa08004720480) at /usr/src/sys/kern/subr_taskqueue.c:517 #21 0xffff000000522d1c in taskqueue_thread_loop (arg=arg@entry=0xffff000113ab1e80) at /usr/src/sys/kern/subr_taskqueue.c:829 #22 0xffff000000468ef4 in fork_exit (callout=0xffff000000522c58 <taskqueue_thread_loop>, arg=0xffff000113ab1e80, frame=0xffff00010e90aa00) at /usr/src/sys/kern/kern_fork.c:1152 #23 <signal handler called>
Created attachment 258715 [details] patch for testing
Can you please test the patch I added to the attachments and report the result? It is not the patch I will commit, but it is the part which is intended to fix the panic.
Alternatively, you can also apply https://reviews.freebsd.org/D48652 - which does not fix the issue of incorrectly unwinding an RTO, but once an RTO was declared spurious, fixes the subsequent cwnd calculation by resetting the state more completely.
(In reply to Richard Scheffenegger from comment #31) But to fix the root cause, I need feedback on the patch I provided.
(In reply to Alexander Leidinger from comment #5) I also found similar panic with ipv6 on 15.0 based on 2024-10-13: (kgdb) bt #0 __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:57 #1 doadump (textdump=textdump@entry=1) at /usr/src/sys/kern/kern_shutdown.c:404 #2 0xffffffff80b534c0 in kern_reboot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:524 #3 0xffffffff80b539dc in vpanic (fmt=0xffffffff811fbf6d "%s: sent too much", ap=ap@entry=0xfffffe01060526b0) at /usr/src/sys/kern/kern_shutdown.c:979 #4 0xffffffff80b53823 in panic (fmt=<unavailable>) at /usr/src/sys/kern/kern_shutdown.c:892 #5 0xffffffff80d4b102 in tcp_do_segment (tp=0xfffff80399e5fa80, tp@entry=<error reading variable: value is not available>, m=<optimized out>, m@entry=<error reading variable: value is not available>, th=0xfffff8004426169a, th@entry=<error reading variable: value is not available>, drop_hdrlen=72, drop_hdrlen@entry=<error reading variable: value is not available>, tlen=0, tlen@entry=<error reading variable: value is not available>, iptos=<unavailable>, iptos@entry=<error reading variable: value is not available>) at /usr/src/sys/netinet/tcp_input.c:1548 #6 0xffffffff80d47aa5 in tcp_input_with_port (mp=mp@entry=0xfffffe0106052988, offp=offp@entry=0xfffffe0106052980, proto=<optimized out>, port=0) at /usr/src/sys/netinet/tcp_input.c:1158 #7 0xffffffff80d4691a in tcp6_input_with_port (mp=0xfffffe0106052988, offp=0xfffffe0106052980, proto=<optimized out>, port=port@entry=0) at /usr/src/sys/netinet/tcp_input.c:594 #8 0xffffffff80d481eb in tcp6_input (mp=<unavailable>, offp=<unavailable>, proto=<unavailable>) at /usr/src/sys/netinet/tcp_input.c:601 #9 0xffffffff80d929c0 in ip6_input (m=0x0, m@entry=<error reading variable: value is not available>) at /usr/src/sys/netinet6/ip6_input.c:930 #10 0xffffffff80caf29d in netisr_dispatch_src (proto=proto@entry=6, source=source@entry=0, m=0xfffff80044261600) at /usr/src/sys/net/netisr.c:1152 #11 0xffffffff80caf61f in netisr_dispatch (proto=<unavailable>, proto@entry=6, m=<unavailable>) at /usr/src/sys/net/netisr.c:1243 #12 0xffffffff80c9180a in ether_demux (ifp=ifp@entry=0xfffff80014bd1000, m=0xfffff80044261600) at /usr/src/sys/net/if_ethersubr.c:946 #13 0xffffffff80c92efd in ether_input_internal (ifp=0xfffff80014bd1000, m=0xfffff80044261600) at /usr/src/sys/net/if_ethersubr.c:710 #14 ether_nh_input (m=<unavailable>, m@entry=<error reading variable: value is not available>) at /usr/src/sys/net/if_ethersubr.c:740 #15 0xffffffff80caf29d in netisr_dispatch_src (proto=proto@entry=5, source=source@entry=0, m=0xfffff80044261600) at /usr/src/sys/net/netisr.c:1152 #16 0xffffffff80caf61f in netisr_dispatch (proto=<unavailable>, proto@entry=5, m=<unavailable>) at /usr/src/sys/net/netisr.c:1243 #17 0xffffffff80c91c65 in ether_input (ifp=0xfffff80014bd1000, ifp@entry=<error reading variable: value is not available>, m=<unavailable>, m@entry=<error reading variable: value is not available>) at /usr/src/sys/net/if_ethersubr.c:851 #18 0xffffffff80c9175b in ether_demux (ifp=ifp@entry=0xfffff80004619000, m=0xfffff80044261600) at /usr/src/sys/net/if_ethersubr.c:897 #19 0xffffffff80c92efd in ether_input_internal (ifp=0xfffff80004619000, m=0xfffff80044261600) at /usr/src/sys/net/if_ethersubr.c:710 #20 ether_nh_input (m=<unavailable>, m@entry=<error reading variable: value is not available>) at /usr/src/sys/net/if_ethersubr.c:740 #21 0xffffffff80caf29d in netisr_dispatch_src (proto=proto@entry=5, source=source@entry=0, m=0xfffff80044261600) at /usr/src/sys/net/netisr.c:1152 #22 0xffffffff80caf61f in netisr_dispatch (proto=<unavailable>, proto@entry=5, m=<unavailable>) at /usr/src/sys/net/netisr.c:1243 #23 0xffffffff80c91c65 in ether_input (ifp=0xfffff80004619000, m=<unavailable>) at /usr/src/sys/net/if_ethersubr.c:851 #24 0xffffffff80cab203 in iflib_rxeof (rxq=rxq@entry=0xfffff80004e73680, budget=<optimized out>) at /usr/src/sys/net/iflib.c:3081 #25 0xffffffff80ca53aa in _task_fn_rx (context=0xfffff80004e73680) at /usr/src/sys/net/iflib.c:4155 #26 0xffffffff80ba0eee in gtaskqueue_run_locked (queue=queue@entry=0xfffff800042d0600) at /usr/src/sys/kern/subr_gtaskqueue.c:368 #27 0xffffffff80ba0c23 in gtaskqueue_thread_loop (arg=arg@entry=0xfffffe003c109038) at /usr/src/sys/kern/subr_gtaskqueue.c:544 #28 0xffffffff80b080e2 in fork_exit (callout=0xffffffff80ba0b50 <gtaskqueue_thread_loop>, arg=0xfffffe003c109038, frame=0xfffffe0106052f40) at /usr/src/sys/kern/kern_fork.c:1151 #29 <signal handler called> (kgdb) f 5 #5 0xffffffff80d4b102 in tcp_do_segment (tp=0xfffff80399e5fa80, tp@entry=<error reading variable: value is not available>, m=<optimized out>, m@entry=<error reading variable: value is not available>, th=0xfffff8004426169a, th@entry=<error reading variable: value is not available>, drop_hdrlen=72, drop_hdrlen@entry=<error reading variable: value is not available>, tlen=0, tlen@entry=<error reading variable: value is not available>, iptos=<unavailable>, iptos@entry=<error reading variable: value is not available>) at /usr/src/sys/netinet/tcp_input.c:1548 1548 KASSERT(tp->t_state != TCPS_TIME_WAIT, ("%s: TCPS_TIME_WAIT", (kgdb) i lo to = {to_flags = 16, to_tsval = 2267638854, to_tsecr = 249601034, to_sacks = 0xfffff80399e5fa80 "", to_signature = 0x4 <error: Cannot access memory at address 0x4>, to_tfo_cookie = 0xfffffe0106052780 "\004", to_mss = 55127, to_wscale = 216 '\330', to_nsacks = 128 '\200', to_tfo_len = 255 '\377', to_spare = 0} maxseg = 1328 inp = 0xfffff80399e5fa80 needoutput = 0 incforsyn = <optimized out> so = 0xfffff80399781800 inc = <optimized out> thflags = <optimized out> sack_changed = <optimized out> nsegs = 1 s = <optimized out> rstreason = <optimized out> tiwin = <optimized out> todrop = <optimized out> acked = <optimized out> tfo_syn = <optimized out> mfree = <optimized out> ourfinisacked = <optimized out> win = <optimized out> close = <optimized out>
hi Michael after that patch, I ran wrk 3x and no panics. I'll leave it in a loop for a while.
(In reply to Andrey V. Elsukov from comment #33) I don't expect the problem to be specific to IPv4 or IPv6... Just a matter of which network layer you use.
(In reply to Dave Cottlehuber from comment #34) Perfect, thank you. I do know that the code I removed is a problem, but I want to make sure that there are not more problems...
Michael, Richard, do we have any chances for this panic to be fixed before March stabweek (next week)? It is already over 4 months since the panic was discovered and it affects many people.
(In reply to Michael Tuexen from comment #30) I'll coordinate testing a new build with the HardenedBSD community members being impacted by this bug. I'll try to get them a new test build this weekend.
(In reply to Gleb Smirnoff from comment #37) I can provide a patch tomorrow. Need to look at RACK and BBR...
(In reply to Michael Tuexen from comment #29) Before this patch, I was getting the panic about once every few weeks. I've been running the patch for the last day. No problems so far, so I know that it didn't make things substantially worse.
A patch is under review D49414.
A commit in branch main references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=fbcf3b74e8f2c0c5ba37f1839bfe9395eb2fd0b1 commit fbcf3b74e8f2c0c5ba37f1839bfe9395eb2fd0b1 Author: Michael Tuexen <tuexen@FreeBSD.org> AuthorDate: 2025-03-20 15:17:40 +0000 Commit: Michael Tuexen <tuexen@FreeBSD.org> CommitDate: 2025-03-20 15:17:40 +0000 tcp: fix detection of bad RTOs If timestamps are enabled, the actions performed by a retransmission timeout were rolled back, when they should not. It is needed to make sure the incoming segment advances SND.UNA. To do this, remove the incorrect upfront check and extend the check in the fast path to handle also the case of timestamps. PR: 282605 Reviewed by: cc, rscheff, Peter Lei MFC after: 3 days Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D49414 sys/netinet/tcp_input.c | 17 +++++++---------- 1 file changed, 7 insertions(+), 10 deletions(-)
A commit in branch main references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=6f6c07813b38ab04d8b1b2bb87c0291dbae25a25 commit 6f6c07813b38ab04d8b1b2bb87c0291dbae25a25 Author: Richard Scheffenegger <rscheff@FreeBSD.org> AuthorDate: 2025-03-21 04:23:30 +0000 Commit: Richard Scheffenegger <rscheff@FreeBSD.org> CommitDate: 2025-03-21 04:28:31 +0000 tcp: fix reverting of spurious timeouts (RTO) One variable that became critical to correctly calculate the cwnd during limited transmit was not properly reverted on detection of spurious timeouts. PR: 282605 Reviewed By: cc, tuexen, #transport MFC after: 3 days Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D48652 sys/netinet/tcp_input.c | 1 + 1 file changed, 1 insertion(+)
A commit in branch stable/14 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=4a328b8049962976d13b8fb9abe2d9a488f2a894 commit 4a328b8049962976d13b8fb9abe2d9a488f2a894 Author: Richard Scheffenegger <rscheff@FreeBSD.org> AuthorDate: 2025-03-21 04:23:30 +0000 Commit: Richard Scheffenegger <rscheff@FreeBSD.org> CommitDate: 2025-03-24 05:35:43 +0000 tcp: fix reverting of spurious timeouts (RTO) One variable that became critical to correctly calculate the cwnd during limited transmit was not properly reverted on detection of spurious timeouts. PR: 282605 Reviewed By: cc, tuexen, #transport MFC after: 3 days Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D48652 (cherry picked from commit 6f6c07813b38ab04d8b1b2bb87c0291dbae25a25) sys/netinet/tcp_input.c | 1 + 1 file changed, 1 insertion(+)
It's starting to look like the issue has been fixed in the recent commits. At the very least, I haven't had a kernel panic in two days.
I just hit a case this morning on a kernel at 67c1c4dfd1, which has both of these included: fbcf3b74e8f2c0c5 ("tcp: fix detection of bad RTOs") 6f6c07813b38ab04 ("tcp: fix reverting of spurious timeouts (RTO)") panic: tcp_do_segment: sent too much cpuid = 1 time = 1742996303 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe015b6b55b0 vpanic() at vpanic+0x136/frame 0xfffffe015b6b56e0 panic() at panic+0x43/frame 0xfffffe015b6b5740 tcp_do_segment() at tcp_do_segment+0x28ed/frame 0xfffffe015b6b5820 tcp_input_with_port() at tcp_input_with_port+0x1140/frame 0xfffffe015b6b5970 tcp6_input_with_port() at tcp6_input_with_port+0x6a/frame 0xfffffe015b6b59a0 tcp6_input() at tcp6_input+0xb/frame 0xfffffe015b6b59b0 ip6_input() at ip6_input+0xc98/frame 0xfffffe015b6b5a90 netisr_dispatch_src() at netisr_dispatch_src+0xb4/frame 0xfffffe015b6b5af0 ether_demux() at ether_demux+0x16a/frame 0xfffffe015b6b5b20 ether_nh_input() at ether_nh_input+0x3cd/frame 0xfffffe015b6b5b70 netisr_dispatch_src() at netisr_dispatch_src+0xb4/frame 0xfffffe015b6b5bd0 ether_input() at ether_input+0xd5/frame 0xfffffe015b6b5c30 ieee80211_deliver_data() at ieee80211_deliver_data+0xeb/frame 0xfffffe015b6b5c80 sta_input() at sta_input+0xcda/frame 0xfffffe015b6b5d40 ieee80211_input_mimo() at ieee80211_input_mimo+0x20c/frame 0xfffffe015b6b5df0 lkpi_80211_lhw_rxq_task() at lkpi_80211_lhw_rxq_task+0x143/frame 0xfffffe015b6b5e40 taskqueue_run_locked() at taskqueue_run_locked+0x1c2/frame 0xfffffe015b6b5ec0 taskqueue_thread_loop() at taskqueue_thread_loop+0xd3/frame 0xfffffe015b6b5ef0 fork_exit() at fork_exit+0x82/frame 0xfffffe015b6b5f30 fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe015b6b5f30 --- trap 0xd229bf4a, rip = 0x79b36f1efc3c4ce3, rsp = 0xd7e96fa6591692e4, rbp = 0xa979fdd81e679dee ---
(In reply to Kyle Evans from comment #46) Was BBLoging enabled?
(In reply to Michael Tuexen from comment #47) Yes, allegedly- not sure how to confirm/extract anything useful from it, though.
(In reply to Kyle Evans from comment #48) Please do the following: 1. Download tcplog.py from review D48705. 2. Start kgdb with the relevant core. 3. source tcplog.py 4. go up the stack until you have the tp as a variable 5. tcplog_dump tp 6. quit kgdb.Now you should have a *.bin file. Please send this to tuexen@freebsd.org. If this does not work, please report.
Just got this crash as well, just on the older stable/14, so probably no new useful info from me. I have the vmcore, just in case. panic: tcp_do_segment: sent too much cpuid = 3 time = 1743052191 KDB: stack backtrace: db_trace_self_wrapper() at 0xffffffff8061eadb = db_trace_self_wrapper+0x2b/frame 0xfffffe0120fda760 kdb_backtrace() at 0xffffffff80956a57 = kdb_backtrace+0x37/frame 0xfffffe0120fda810 vpanic() at 0xffffffff80907629 = vpanic+0x169/frame 0xfffffe0120fda950 panic() at 0xffffffff80907403 = panic+0x43/frame 0xfffffe0120fda9b0 tcp_do_segment() at 0xffffffff80a82864 = tcp_do_segment+0x1c24/frame 0xfffffe0120fdaa90 tcp_input_with_port() at 0xffffffff80a8066d = tcp_input_with_port+0x12bd/frame 0xfffffe0120fdabc0 tcp_input() at 0xffffffff80a80c2b = tcp_input+0xb/frame 0xfffffe0120fdabd0 ip_input() at 0xffffffff80a735ae = ip_input+0x25e/frame 0xfffffe0120fdac10 netisr_dispatch_src() at 0xffffffff80a436ff = netisr_dispatch_src+0x8f/frame 0xfffffe0120fdac70 netisr_dispatch() at 0xffffffff80a43a6e = netisr_dispatch+0xe/frame 0xfffffe0120fdac80 ether_demux() at 0xffffffff80a39cc9 = ether_demux+0x169/frame 0xfffffe0120fdacb0 ether_nh_input() at 0xffffffff80a3afe0 = ether_nh_input+0x2c0/frame 0xfffffe0120fdace0 netisr_dispatch_src() at 0xffffffff80a436ff = netisr_dispatch_src+0x8f/frame 0xfffffe0120fdad40 netisr_dispatch() at 0xffffffff80a43a6e = netisr_dispatch+0xe/frame 0xfffffe0120fdad50 ether_input() at 0xffffffff80a3a115 = ether_input+0xa5/frame 0xfffffe0120fdada0 if_input() at 0xffffffff80a2fa2a = if_input+0xa/frame 0xfffffe0120fdadb0 re_rxeof() at 0xffffffff806ff873 = re_rxeof+0x463/frame 0xfffffe0120fdae20 re_intr_msi() at 0xffffffff806fd747 = re_intr_msi+0xc7/frame 0xfffffe0120fdae60 ithread_loop() at 0xffffffff808c67ca = ithread_loop+0x2aa/frame 0xfffffe0120fdaef0 fork_exit() at 0xffffffff808c2b47 = fork_exit+0xc7/frame 0xfffffe0120fdaf30 fork_trampoline() at 0xffffffff80c5034e = fork_trampoline+0xe/frame 0xfffffe0120fdaf30
syzkaller just hit the bug on most recent CURRENT: https://syzkaller.appspot.com/bug?extid=e4418498cc95b1cf5c16
(In reply to Gleb Smirnoff from comment #51) Yepp. However, syzkaller does not help here. However, I do have a BBLog file from Kyle which I will use to try to get a reproducer using packetdrill. There is a least one more codepath triggering the assertion...
I run HARDENEDBSD-14-STABLE and have been seeing this. I grabbed source for stable-14 and compiled a kernel on 26 March, validating that tcp_input.c was post 4a328b8049962976d13b8fb9abe2d9a488f2a894. The crashes seem decidedly fewer, but i got this one yesterday: [1692] panic: tcp_do_segment: sent too much [1692] cpuid = 2 [1692] time = 1742987244 [1692] __HardenedBSD_version = 1400007 __FreeBSD_version = 1402501 [1692] version = FreeBSD 14.2-STABLE-HBSD HARDENEDBSD-14-STABLE [1692] KDB: stack backtrace: [1692] #0 0xffffffff80bb8aeb at kdb_backtrace+0x7b [1692] #1 0xffffffff80b69af6 at vpanic+0x166 [1692] #2 0xffffffff80b698e3 at panic+0x43 [1692] #3 0xffffffff80d56ec1 at tcp_do_segment+0x2b91 [1692] #4 0xffffffff80d537f0 at tcp_input_with_port+0x11f0 [1692] #5 0xffffffff80d5431b at tcp_input+0xb [1692] #6 0xffffffff80d414f3 at ip_input+0x2c3 [1692] #7 0xffffffff80cbd4c4 at netisr_dispatch_src+0xc4 [1692] #8 0xffffffff80c9f47a at ether_demux+0x17a [1692] #9 0xffffffff80ca0c27 at ether_nh_input+0x417 [1692] #10 0xffffffff80cbd4c4 at netisr_dispatch_src+0xc4 [1692] #11 0xffffffff80c9f8e9 at ether_input+0xe9 [1692] #12 0xffffffff80d5db4f at tcp_lro_flush+0x35f [1692] #13 0xffffffff80d5e027 at tcp_lro_rx_done+0x37 [1692] #14 0xffffffff80d5dd48 at tcp_lro_flush_all+0x178 [1692] #15 0xffffffff80cb92ad at iflib_rxeof+0xead [1692] #16 0xffffffff80cb2ff5 at _task_fn_rx+0x85 [1692] #17 0xffffffff80bb73c3 at gtaskqueue_run_locked+0x1a3 I have applied the BlackBox Logging sysctls and am running tcplog_dumper, and will report back if (when) another crash occurs.
A commit in branch stable/14 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=e754d89bba8c526fd1ff792616ff4cbf9c6e093e commit e754d89bba8c526fd1ff792616ff4cbf9c6e093e Author: Michael Tuexen <tuexen@FreeBSD.org> AuthorDate: 2025-03-20 15:17:40 +0000 Commit: Michael Tuexen <tuexen@FreeBSD.org> CommitDate: 2025-04-04 09:05:10 +0000 tcp: fix detection of bad RTOs If timestamps are enabled, the actions performed by a retransmission timeout were rolled back, when they should not. It is needed to make sure the incoming segment advances SND.UNA. To do this, remove the incorrect upfront check and extend the check in the fast path to handle also the case of timestamps. PR: 282605 Reviewed by: cc, rscheff, Peter Lei Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D49414 (cherry picked from commit fbcf3b74e8f2c0c5ba37f1839bfe9395eb2fd0b1) sys/netinet/tcp_input.c | 17 +++++++---------- 1 file changed, 7 insertions(+), 10 deletions(-)
(In reply to Mike from comment #53) Please update again to include the fix I just MFCed to stable/14. Please continue testing, because we are aware of one other path triggering the KASSERT. However, we have only reports for CURRENT, not stable/14. I would be great to know, if stable/14 is also affected.
(In reply to Michael Tuexen from comment #55) I realize I have been sloppy with my data and previous post, and need to correct/clarify the info I provided before, and call attention to the fact that the dump trace i provided is from *before* the 24 March patch, which I mistakenly associated with a crash on 3 April on the patched kernel. Data Summary (too much information) ============ - The system had 13 unexpected reboots between an intentional reboot 6 Jan and 5 April at intervals varying from 13 minutes to 17 days - Crashdumps were captured 3 times, all with a "savecore[xxxxx] reboot after panic: tcp_do_segment: sent too much" message - The last crash on an unpatched kernel was applied was 26 March which is the dump trace I provided - A kernel with the patch from 24 March was installed 26 March after that crash - The system crashed 3 April on the patched kernel, and did not leave a dump behind, so there is no direct evidence that it is caused by this issue, but the affected system has never crashed for *any* reason on earlier kernels. Current status ============== - The system is running the HBSD 14-STABLE in which your commit is the last applied: 14.2-STABLE-HBSD FreeBSD 14.2-STABLE-HBSD hardened/14-stable/master-n193911-ce071b70191e HARDENEDBSD-14-STABLE amd64 - I have applied the BlackBox Logging sysctls and am running tcplog_dumper, and will report back if (when) another crash occurs.
A couple days ago I got this panic with a -CURRENT kernel built after the recent bugfix commits. The machine was busy building packages and my internet connection was flakey. Mon Apr 14 18:19:44 PDT 2025 FreeBSD zipper.catspoiler.org 15.0-CURRENT FreeBSD 15.0-CURRENT #6 n276073-67c1c 4dfd1cc: Mon Mar 24 19:27:24 PDT 2025 dl@zipper.catspoiler.org:/usr/obj/usr/ src/amd64.amd64/sys/GENERIC amd64 panic: tcp_do_segment: sent too much Reading symbols from /boot/kernel/kernel... Reading symbols from /usr/lib/debug//boot/kernel/kernel.debug... Reading symbols from /boot/kernel/linux.ko... Reading symbols from /usr/lib/debug//boot/kernel/linux.ko.debug... Reading symbols from /boot/kernel/linux_common.ko... Reading symbols from /usr/lib/debug//boot/kernel/linux_common.ko.debug... Reading symbols from /boot/kernel/mqueuefs.ko... Reading symbols from /usr/lib/debug//boot/kernel/mqueuefs.ko.debug... Reading symbols from /boot/kernel/linux64.ko... Reading symbols from /usr/lib/debug//boot/kernel/linux64.ko.debug... Reading symbols from /boot/kernel/geom_mirror.ko... Reading symbols from /usr/lib/debug//boot/kernel/geom_mirror.ko.debug... Reading symbols from /boot/kernel/zfs.ko... Reading symbols from /usr/lib/debug//boot/kernel/zfs.ko.debug... Reading symbols from /boot/kernel/fdescfs.ko... Reading symbols from /usr/lib/debug//boot/kernel/fdescfs.ko.debug... Reading symbols from /boot/kernel/acpi_wmi.ko... Reading symbols from /usr/lib/debug//boot/kernel/acpi_wmi.ko.debug... Reading symbols from /boot/kernel/intpm.ko... Reading symbols from /usr/lib/debug//boot/kernel/intpm.ko.debug... Reading symbols from /boot/kernel/smbus.ko... Reading symbols from /usr/lib/debug//boot/kernel/smbus.ko.debug... Reading symbols from /boot/kernel/uhid.ko... Reading symbols from /usr/lib/debug//boot/kernel/uhid.ko.debug... Reading symbols from /boot/kernel/ums.ko... Reading symbols from /usr/lib/debug//boot/kernel/ums.ko.debug... Reading symbols from /boot/kernel/nullfs.ko... Reading symbols from /usr/lib/debug//boot/kernel/nullfs.ko.debug... Reading symbols from /boot/kernel/mac_ntpd.ko... Reading symbols from /usr/lib/debug//boot/kernel/mac_ntpd.ko.debug... Reading symbols from /boot/kernel/linprocfs.ko... Reading symbols from /usr/lib/debug//boot/kernel/linprocfs.ko.debug... __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:57 57 __asm("movq %%gs:%c1,%0" : "=r" (td) (kgdb) #0 __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:57 td = <optimized out> #1 doadump (textdump=textdump@entry=0) at /usr/src/sys/kern/kern_shutdown.c:404 error = 0 coredump = <optimized out> #2 0xffffffff804a34ba in db_dump (dummy=<optimized out>, dummy2=<optimized out>, dummy3=<optimized out>, dummy4=<optimized out>) at /usr/src/sys/ddb/db_command.c:596 error = <optimized out> #3 0xffffffff804a32ad in db_command (last_cmdp=<optimized out>, cmd_table=<optimized out>, dopager=true) at /usr/src/sys/ddb/db_command.c:508 modif = "\000\000\000\000\000\000\000\000\024P\235\201\377\377\377\377x\ 006D\003\000\370\377\377\000\000\000\000\000\000\000\000\b\000\000\0000\000\000\ 000\320q\253Z\001\376\377\377pq\253Z\001\376\377\377\024P\235\201\377\377\377\37 7\020r\253Z\001\376\377\377dYJ\200\377\377\377\377@W\373\003\001\370\377\3770\36 3.\201\377\377\377\377\020P\235\201\377\377\377\377\000\000\000\000\000\000\000\ 000@W\373\003\001\370\377\377" addr = -2135263629 count = -1 cmd = 0xffffffff8181a8b8 <db_cmds+616> have_addr = <optimized out> t = <optimized out> result = <optimized out> #4 0xffffffff804a2f6d in db_command_loop () at /usr/src/sys/ddb/db_command.c:555 No locals. #5 0xffffffff804a6946 in db_trap (type=<optimized out>, code=<optimized out>) at /usr/src/sys/ddb/db_main.c:267 jb = {{_jb = {0, -2193207102920, -2193207102784, -8791731251392, -2122206128, -2193207102912, -2193207102144, -2142607148, -2193207102816, -2137415265, -2118532256, 3}}} bkpt = false watchpt = false prev_jb = 0x0 why = <optimized out> #6 0xffffffff80ba80ff in kdb_trap (type=type@entry=3, code=code@entry=0, tf=tf@entry=0xfffffe015aab7540) at /usr/src/sys/kern/subr_kdb.c:790 __pc = 0x0 __pc = 0x0 other_cpus = {__bits = {18446744071581860125, 18446741880502449296, 13736095756378376981, 0, 17, 18446744071574325360, 18446744071580898254, 18446741880502449344, 18446741880502449328, 18446744071574320390, 4294967294, 44470858976, 0, 18446744071580898253, 2167275808, 0}} be = 0xffffffff8181b450 <ddb_dbbe> intr = 70 did_stop_cpus = <optimized out> handled = <optimized out> #7 0xffffffff8109353b in trap (frame=<optimized out>) at /usr/src/sys/amd64/amd64/trap.c:608 __pc = 0x0 __pc = 0x0 __pc = 0x0 ksi = {ksi_link = {tqe_next = 0x11804a6adc, tqe_prev = 0xfffffe015aab7478}, ksi_info = {si_signo = 1521185992, si_errno = -511, si_code = 0, si_pid = 0, si_uid = 1521185976, si_status = -511, si_addr = 0xbea06a503faf0b15, si_value = { sival_int = 1004, sival_ptr = 0x3ec, sigval_int = 1004, sigval_ptr = 0x3ec}, _reason = {_fault = {_trapno = 1521188672}, _timer = {_timerid = 1521188672, _overrun = -511}, _mesgq = { _mqd = 1521188672}, _poll = {_band = -2193207099584}, _capsicum = {_syscall = 1521188672}, __spare__ = { __spare1__ = -2193207099584, __spare2__ = {0, 0, 0, 0, 1, 0, 1068436245}}}}, ksi_flags = 256, ksi_sigq = 0x0} signo = 1509952099 ucode = -511 td = <optimized out> p = <optimized out> dr6 = 0 type = 3 addr = <optimized out> pf = <optimized out> i = <optimized out> #8 <signal handler called> No locals. #9 kdb_enter (why=<optimized out>, msg=<optimized out>) at /usr/src/sys/kern/subr_kdb.c:556 No locals. #10 0xffffffff80b57f0b in vpanic (fmt=0xffffffff81210007 "%s: sent too much", ap=ap@entry=0xfffffe015aab7770) at /usr/src/sys/kern/kern_shutdown.c:967 buf = "tcp_do_segment: sent too much", '\000' <repeats 226 times> __pc = 0x0 __pc = 0x0 __pc = 0x0 other_cpus = {__bits = {65534, 0 <repeats 15 times>}} td = 0xfffff80103fb5740 bootopt = <optimized out> newpanic = <optimized out> #11 0xffffffff80b57d73 in panic ( fmt=0xffffffff81b9c3a0 <cnputs_mtx> "G?\027\201\377\377\377\377") at /usr/src/sys/kern/kern_shutdown.c:892 ap = {{gp_offset = 16, fp_offset = 48, overflow_arg_area = 0xfffffe015aab77a0, reg_save_area = 0xfffffe015aab7740}} #12 0xffffffff80d5279d in tcp_do_segment (tp=0xfffff8045f238000, m=<optimized out>, th=0xfffff8008a06c396, drop_hdrlen=72, tlen=0, iptos=<optimized out>) at /usr/src/sys/netinet/tcp_input.c:1546 to = {to_flags = 16, to_tsval = 1014279682, to_tsecr = 1048849, to_sacks = 0x106 <error: Cannot access memory at address 0x106>, to_signature = 0xffffffff81cded58 <mac_policy_rm> "7v#\201\377\377\377 \377", to_tfo_cookie = 0xffffffff811eab40 "/usr/src/sys/security/mac/mac_fram ework.c", to_mss = 30720, to_wscale = 171 '\253', to_nsacks = 90 'Z', to_tfo_len = 1 '\001', to_spare = 2159349026} maxseg = 1408 inp = 0xfffff8045f238000 needoutput = 0 incforsyn = <optimized out> so = 0xfffff8007b541400 inc = <optimized out> thflags = <optimized out> sack_changed = <optimized out> nsegs = 1 s = <optimized out> tiwin = <optimized out> rstreason = <optimized out> todrop = <optimized out> acked = <optimized out> tfo_syn = <optimized out> mfree = <optimized out> ourfinisacked = <optimized out> win = <optimized out> close = <optimized out> #13 0xffffffff80d4f1e0 in tcp_input_with_port (mp=mp@entry=0xfffffe015aab7a48, offp=offp@entry=0xfffffe015aab7a40, proto=<optimized out>, port=0) at /usr/src/sys/netinet/tcp_input.c:1165 so = 0xfffff8007b541400 to = {to_flags = 0, to_tsval = 4294967295, to_tsecr = 2195783824, to_sacks = 0xfffff8008aa76000 "\205\351\"\201\377\377\377\377", to_signature = 0x4133a5d681cded58 <error: Cannot access memory at addr ess 0x4133a5d681cded58>, to_tfo_cookie = 0x1 <error: Cannot access memory at address 0x1>, to_mss = 63232, to_wscale = 88 'X', to_nsacks = 138 '\212', to_tfo_len = 0 '\000', to_spare = 2321086358} m = 0xfffff8008a06c300 th = 0xfffff8008a06c396 ip = 0x0 inp = <optimized out> tp = 0xffffffff81b9c3a0 <cnputs_mtx> optp = 0xfffff8008a06c3aa "\001\001\b\n<t\256\002\320\0038{\255\336\336\ 300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\ 300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\ 300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\ 300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\ 300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\ 300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\ 300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\ 300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\ 300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\ 300\255\336\336\300"... optlen = 12 tlen = <optimized out> rstreason = <optimized out> fwd_tag = 0x0 ip6 = 0xfffff8008a06c36e s = 0x0 off0 = <optimized out> iptos = 128 '\200' off = <optimized out> len = <optimized out> ipttl = <optimized out> thflags = <optimized out> drop_hdrlen = 72 lookupflag = <optimized out> isipv6 = <optimized out> #14 0xffffffff80d4dffa in tcp6_input_with_port (mp=0xfffffe015aab7a48, offp=0xfffffe015aab7a40, proto=<optimized out>, port=port@entry=0) at /usr/src/sys/netinet/tcp_input.c:600 m = 0xfffff8008a06c300 ip6 = <optimized out> ia6 = 0x12 #15 0xffffffff80d4f94b in tcp6_input (mp=0xffffffff81b9c3a0 <cnputs_mtx>, offp=0xfffffe015aab74c0, proto=-2128026873) at /usr/src/sys/netinet/tcp_input.c:607 No locals. #16 0xffffffff80d9a5e8 in ip6_input (m=0x0) at /usr/src/sys/netinet6/ip6_input.c:941 odst = {__u6_addr = { __u6_addr8 = "\000\367X\212\000\370\377\377\001\000\000\000\000\000\ 000", __u6_addr16 = {63232, 35416, 63488, 65535, 1, 0, 0, 0}, __u6_addr32 = { 2321086208, 4294965248, 1, 0}}} plen = 32 rtalert = 4294967295 off = 40 ours = 1 nxt = <optimized out> srcrt = <optimized out> rcvif = 0xfffff80117541800 ip6 = <optimized out> ia = <optimized out> nest = 1 #17 0xffffffff80cb4b64 in netisr_dispatch_src (proto=proto@entry=6, source=source@entry=0, m=0xfffff8008a06c300) at /usr/src/sys/net/netisr.c:1151 __pc = 0x0 __pc = 0x0 __pc = 0x0 cpuid = 0 dosignal = 0 npp = 0xffffffff81cc8070 <netisr_proto+336> dispatch_policy = 2166940423 nwsp = <optimized out> npwp = <optimized out> error = <optimized out> out_unlock = <optimized out> #18 0xffffffff80cb4eef in netisr_dispatch (proto=2176435104, proto@entry=6, m=0xffffffff8128e307) at /usr/src/sys/net/netisr.c:1242 No locals. #19 0xffffffff80c9737a in ether_demux (ifp=ifp@entry=0xfffff80117541800, m=0xfffff8008a06c300) at /usr/src/sys/net/if_ethersubr.c:942 i = <optimized out> eh = <optimized out> ether_type = <optimized out> isr = 6 #20 0xffffffff80c98a3d in ether_input_internal (ifp=0xfffff80117541800, m=0xfffff8008a06c300) at /usr/src/sys/net/if_ethersubr.c:706 eh = 0xfffff8008a06c360 saved_vnet = 0xfffff8000342cb80 etype = <optimized out> evl = <optimized out> #21 ether_nh_input (m=<optimized out>) at /usr/src/sys/net/if_ethersubr.c:736 No locals. #22 0xffffffff80cb4b64 in netisr_dispatch_src (proto=proto@entry=5, source=source@entry=0, m=0xfffff8008a06c300) at /usr/src/sys/net/netisr.c:1151 __pc = 0x0 __pc = 0x0 __pc = 0x0 cpuid = 0 dosignal = 838860800 npp = 0xffffffff81cc8038 <netisr_proto+280> dispatch_policy = 2166940423 nwsp = <optimized out> npwp = <optimized out> error = <optimized out> out_unlock = <optimized out> #23 0xffffffff80cb4eef in netisr_dispatch (proto=2176435104, proto@entry=5, m=0xffffffff8128e307) at /usr/src/sys/net/netisr.c:1242 No locals. #24 0xffffffff80c977c5 in ether_input (ifp=0xfffff80117541800, m=0xfffffe015aab74c0) at /usr/src/sys/net/if_ethersubr.c:847 printedonce = false et = {et_link = {tqe_next = 0x0, tqe_prev = 0x0}, et_td = 0x0, et_section = {bucket = 0}, et_old_priority = 0 '\000'} needs_epoch = <optimized out> saved_vnet = 0x0 mn = 0x0 #25 0xffffffff80d5971f in tcp_lro_flush (lc=lc@entry=0xfffff80117533830, le=le@entry=0xfffffe015d88e690) at /usr/src/sys/netinet/tcp_lro.c:1117 No locals. #26 0xffffffff80d59993 in tcp_lro_rx_done (lc=<optimized out>) at /usr/src/sys/netinet/tcp_lro.c:592 le = 0xfffffe015d88e690 #27 tcp_lro_flush_all (lc=lc@entry=0xfffff80117533830) at /usr/src/sys/netinet/tcp_lro.c:1260 __pc = 0x0 __pc = 0x0 saved_vnet = <optimized out> seq = <optimized out> x = <optimized out> nseq = <optimized out> done = <optimized out> #28 0xffffffff80cb0787 in iflib_rxeof (rxq=rxq@entry=0xfffff80117533800, budget=<optimized out>) at /usr/src/sys/net/iflib.c:3095 ri = {iri_qsidx = 0, iri_vtag = 0, iri_len = 86, iri_cidx = 325, iri_ifp = 0xfffff80117541800, iri_frags = 0xfffff801175339c0, iri_flowid = 911985026, iri_csum_flags = 201326592, iri_csum_data = 65535, iri_flags = 0 '\000', iri_nfrags = 1 '\001', iri_rsstype = 134 '\206', iri_pad = 0 '\000'} ctx = 0xfffff80117546c00 sctx = 0xffffffff81836340 <igb_sctx_init> scctx = 0xfffff80117546ca8 retval = <optimized out> v6_forwarding = <optimized out> v4_forwarding = <optimized out> lro_possible = <optimized out> ifp = 0xfffff80117541800 mt = <optimized out> mh = <optimized out> rx_bytes = <optimized out> rx_pkts = <optimized out> cidxp = 0xfffff8011752aa00 avail = 0 saved_vnet = 0x0 budget_left = <optimized out> err = <optimized out> m = <optimized out> i = <optimized out> fl = <optimized out> mf = <optimized out> lro_enabled = <optimized out> #29 0xffffffff80caa99a in _task_fn_rx (context=0xfffff80117533800) at /usr/src/sys/net/iflib.c:4158 __pc = 0x0 work = 0 rxq = 0xfffff80117533800 ctx = 0xfffff80117546c00 nmirq = <optimized out> more = <optimized out> budget = 29888 #30 0xffffffff80ba62ae in gtaskqueue_run_locked ( queue=queue@entry=0xfffff800036d6000) at /usr/src/sys/kern/subr_gtaskqueue.c:368 et = {et_link = {tqe_next = 0x0, tqe_prev = 0xfffffe0069f49ad8}, et_td = 0xfffff80103fb5740, et_section = {bucket = 0}, et_old_priority = 8 '\b'} tb = {tb_running = 0xfffff801175338b0, tb_seq = 3188829, tb_link = { le_next = 0x0, le_prev = 0xfffff800036d6010}} in_net_epoch = true in_net_epoch = true gtask = <optimized out> #31 0xffffffff80ba5fe3 in gtaskqueue_thread_loop ( arg=arg@entry=0xfffffe0116741008) at /usr/src/sys/kern/subr_gtaskqueue.c:544 tqp = <optimized out> tq = 0xfffff800036d6000 #32 0xffffffff80b0b152 in fork_exit ( callout=0xffffffff80ba5f10 <gtaskqueue_thread_loop>, arg=0xfffffe0116741008, frame=0xfffffe015aab7f40) at /usr/src/sys/kern/kern_fork.c:1152 __pc = 0x0 __pc = 0x0 td = 0xfffff80103fb5740 p = 0xffffffff81b9b060 <proc0> dtd = <optimized out> #33 <signal handler called> No locals.
Created attachment 259712 [details] Another reproducer
(In reply to Michael Tuexen from comment #58) Thanks to Kyle I was able to produce a reproducer for another problem which triggers the "sent too much" KASSERT. Please note that this happens on current and stable/14. (0) The TCP connection is established with SACK, without TS, an MSS of 1220, and TCP_NODELAY is enabled. (1) The application sends 2020 bytes. (2) The TCP stacks sends a segment with 1220 bytes and one with 800 bytes. Both segments are lost. (3) The timer based retransmission of the first segment happens. CWND is now 1220. (4) When the ACK is received, the second segment is retransmitted. CWND is now 2440. (5) When the ACK is received, CWND becomes 3240. This is the precondition. The important point is that 2 * MSS < CWND < 3 * MSS holds. Now the following happens: (6) The application sends 1018 bytes. (7) The TCP stacks sends a segment with 1018 bytes. This segment is lost. (8) The application sends 2703 bytes. (9) The TCP stack sends a segment with 1220 bytes. The number of bytes not yet sent is 2703 - 1220 = 1483, the space left in the congestion window is 3240 - 1018 - 1220 = 1002. So no more segments are sent right now. (10) The segment with 1220 bytes is SACKed. (11) Therefore, the TCP stack sends out the next 1220 bytes of the remaining 1483 bytes. (12) Since the Nagle algorithm is turned off, the remaining 263 bytes are also send out, because the congestion control allows it. The KASSERT triggers, sind in (11) and (12) more than an MSS are sent out. For me, the sending looks completely valid. So we need to weaken the KASSERT to cover these cases or remove it at all.
A patch is under review D49922.
(In reply to Michael Tuexen from comment #59) Awesome, thanks for investigating that!
A commit in branch main references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=934caaec3afc43638c2a1da8fbe3b672566db4fe commit 934caaec3afc43638c2a1da8fbe3b672566db4fe Author: Michael Tuexen <tuexen@FreeBSD.org> AuthorDate: 2025-05-01 16:11:03 +0000 Commit: Michael Tuexen <tuexen@FreeBSD.org> CommitDate: 2025-05-01 16:11:03 +0000 tcp: improve KASSERT in limited retransmit When doing a limited retransmit, allow up to 2 * MSS - 1 if the Nagle algorithm has been disabled. PR: 282605 Reviewed by: cc, Peter Lei MFC after: 3 days Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D49922 sys/netinet/tcp_input.c | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-)
A commit in branch stable/14 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=0906658c3409996b26518e67df48c01052ef934c commit 0906658c3409996b26518e67df48c01052ef934c Author: Michael Tuexen <tuexen@FreeBSD.org> AuthorDate: 2025-05-01 16:11:03 +0000 Commit: Michael Tuexen <tuexen@FreeBSD.org> CommitDate: 2025-05-06 19:15:39 +0000 tcp: improve KASSERT in limited retransmit When doing a limited retransmit, allow up to 2 * MSS - 1 if the Nagle algorithm has been disabled. PR: 282605 Reviewed by: cc, Peter Lei Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D49922 (cherry picked from commit 934caaec3afc43638c2a1da8fbe3b672566db4fe) sys/netinet/tcp_input.c | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-)
A commit in branch releng/14.3 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=976aa65e3ea48f8864149dcbfc6f912a7ca2b67a commit 976aa65e3ea48f8864149dcbfc6f912a7ca2b67a Author: Michael Tuexen <tuexen@FreeBSD.org> AuthorDate: 2025-05-01 16:11:03 +0000 Commit: Michael Tuexen <tuexen@FreeBSD.org> CommitDate: 2025-05-06 19:18:24 +0000 tcp: improve KASSERT in limited retransmit When doing a limited retransmit, allow up to 2 * MSS - 1 if the Nagle algorithm has been disabled. PR: 282605 Approved by: re (cperciva) Reviewed by: cc, Peter Lei Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D49922 (cherry picked from commit 934caaec3afc43638c2a1da8fbe3b672566db4fe) (cherry picked from commit 0906658c3409996b26518e67df48c01052ef934c) sys/netinet/tcp_input.c | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-)
Why is the see also link added? Looks like a different problem to me.
(In reply to Michael Tuexen from comment #65) Yeah, sorry. It looked very similar.
(In reply to Andrey V. Elsukov from comment #66) Thanks for the quick reaction.