Sending stress TCP with offloaded NAT-T configuration panic: tcp_do_segment: sent too much BT #0 __curthread () at /usr/kernel_git/sys/amd64/include/pcpu_aux.h:57 #1 doadump (textdump=textdump@entry=1) at /usr/kernel_git/sys/kern/kern_shutdown.c:404 #2 0xffffffff80b3b7a0 in kern_reboot (howto=260) at /usr/kernel_git/sys/kern/kern_shutdown.c:524 #3 0xffffffff80b3bcd7 in vpanic (fmt=0xffffffff81172d6e "%s: sent too much", ap=ap@entry=0xfffffe010a82db20) at /usr/kernel_git/sys/kern/kern_shutdown.c:979 #4 0xffffffff80b3bb03 in panic (fmt=<unavailable>) at /usr/kernel_git/sys/kern/kern_shutdown.c:892 #5 0xffffffff80d34372 in tcp_do_segment (tp=0xfffff803402baa80, tp@entry=<error reading variable: value is not available>, m=<optimized out>, m@entry=<error reading variable: value is not available>, th=0xfffff802fd295e84, th@entry=<error reading variable: value is not available>, drop_hdrlen=64, drop_hdrlen@entry=<error reading variable: value is not available>, tlen=0, tlen@entry=<error reading variable: value is not available>, iptos=<unavailable>, iptos@entry=<error reading variable: value is not available>) at /usr/kernel_git/sys/netinet/tcp_input.c:1548 #6 0xffffffff80d30d98 in tcp_input_with_port (mp=<optimized out>, offp=<optimized out>, proto=<optimized out>, port=port@entry=0) at /usr/kernel_git/sys/netinet/tcp_input.c:1158 #7 0xffffffff80d31a5b in tcp_input (mp=<unavailable>, offp=<unavailable>, proto=<unavailable>) at /usr/kernel_git/sys/netinet/tcp_input.c:1502 #8 0xffffffff80d1e3af in ip_input (m=0x0, m@entry=<error reading variable: value is not available>) at /usr/kernel_git/sys/netinet/ip_input.c:857 #9 0xffffffff80c98b7b in netisr_process_workstream_proto (nwsp=0xffffffff823c0a00, proto=1) at /usr/kernel_git/sys/net/netisr.c:927 #10 swi_net (arg=0xffffffff823c0a00) at /usr/kernel_git/sys/net/netisr.c:974 #11 0xffffffff80af3b56 in intr_event_execute_handlers (ie=0xfffff800035ee600, p=<optimized out>) at /usr/kernel_git/sys/kern/kern_intr.c:1183 #12 ithread_execute_handlers (ie=0xfffff800035ee600, p=<optimized out>) at /usr/kernel_git/sys/kern/kern_intr.c:1196 #13 ithread_loop (arg=arg@entry=0xfffff800033cdf60) at /usr/kernel_git/sys/kern/kern_intr.c:1289 #14 0xffffffff80aeff52 in fork_exit (callout=0xffffffff80af38f0 <ithread_loop>, arg=0xfffff800033cdf60, frame=0xfffffe010a82df40) at /usr/kernel_git/sys/kern/kern_fork.c:1151 #15 <signal handler called> SHA ID of the kernel: 2ce493e1693b55a330079ac5fce8beb66e26ddeb
There have been a couple of TCP commits since your revision - is it possible that this is already fixed?
Created attachment 255012 [details] packetdrill reproducer
(In reply to Mark Johnston from comment #1) It is not yet fixed. I attached a packetdrill reproducer. I am right now testing review D47474 which seems to avoid triggering the relevant KASSERT so far. The reproducer also does not trigger the KASSERT anymore with review D47474. But it is not fully understood, what is going on.
Agree, D47474 masks the problem, which appears to be an interaction between the new tcp transmission selection code, and the combination of RTO followed by a SACK loss recovery in close succession.
I just run into this panic, on the ipv6 side I would guess (current as of 2024-10-30-120714): ---snip--- [365136] panic: tcp_do_segment: sent too much [365136] cpuid = 1 [365136] time = 1731354815 [365136] KDB: stack backtrace: [365136] db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe04314f7790 [365136] vpanic() at vpanic+0x136/frame 0xfffffe04314f78c0 [365136] panic() at panic+0x43/frame 0xfffffe04314f7920 [365136] tcp_do_segment() at tcp_do_segment+0x2852/frame 0xfffffe04314f79f0 [365136] tcp_input_with_port() at tcp_input_with_port+0x10e2/frame 0xfffffe04314f7b40 [365136] tcp6_input_with_port() at tcp6_input_with_port+0x6a/frame 0xfffffe04314f7b70 [365136] tcp6_input() at tcp6_input+0xb/frame 0xfffffe04314f7b80 [365136] ip6_input() at ip6_input+0xc76/frame 0xfffffe04314f7c60 [365136] netisr_dispatch_src() at netisr_dispatch_src+0xb9/frame 0xfffffe04314f7cc0 [365136] ether_demux() at ether_demux+0x16a/frame 0xfffffe04314f7cf0 [365136] ether_nh_input() at ether_nh_input+0x3cf/frame 0xfffffe04314f7d40 [365136] netisr_dispatch_src() at netisr_dispatch_src+0xb9/frame 0xfffffe04314f7da0 [365136] ether_input() at ether_input+0xd5/frame 0xfffffe04314f7e00 [365136] epair_tx_start_deferred() at epair_tx_start_deferred+0xd4/frame 0xfffffe04314f7e40 [365136] taskqueue_run_locked() at taskqueue_run_locked+0x1c7/frame 0xfffffe04314f7ec0 [365136] taskqueue_thread_loop() at taskqueue_thread_loop+0xd3/frame 0xfffffe04314f7ef0 [365136] fork_exit() at fork_exit+0x87/frame 0xfffffe04314f7f30 [365136] fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe04314f7f30 [365136] --- trap 0x3de64570, rip = 0, rsp = 0, rbp = 0 --- [365136] Uptime: 4d5h25m36s [365136] Dumping 50824 out of 73621 MB:..1%..11%..21%..31%..41%..51%..61%..71%..81%..91% ---snip-- Parts of the crashdump output (full output available on request): ---snip--- #5 0xffffffff806b6b72 in tcp_do_segment (tp=0xfffff80da7abfa80, m=<optimized out>, th=0xfffff804c5190a96, drop_hdrlen=72, tlen=0, iptos=<optimized out>) at /space/system/usr_src/sys/netinet/tcp_input.c:1548 to = {to_flags = 128, to_tsval = 4294965249, to_tsecr = 4546, to_sacks = 0xfffff804c5190aae "\aa\217\345\aa\225}\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255", <incomplete sequence \336>, to_signature = 0x4 <error: Cannot access memory at address 0x4>, to_tfo_cookie = 0xfffffe04314f79c0 "\340yO1", to_mss = 22215, to_wscale = 111 'o', to_nsacks = 1 '\001', to_tfo_len = 255 '\377', to_spare = 2154633376} maxseg = 1432 inp = 0xfffff80da7abfa80 needoutput = 0 incforsyn = <optimized out> so = 0xfffff8051a0ffc00 inc = <optimized out> thflags = <optimized out> sack_changed = <optimized out> nsegs = 1 s = <optimized out> tiwin = <optimized out> rstreason = <optimized out> todrop = <optimized out> acked = <optimized out> tfo_syn = <optimized out> mfree = <optimized out> ourfinisacked = <optimized out> win = <optimized out> close = <optimized out> #6 0xffffffff806b3602 in tcp_input_with_port (mp=mp@entry=0xfffffe04314f7bc8, offp=offp@entry=0xfffffe04314f7bc0, proto=<optimized out>, port=0) at /space/system/usr_src/sys/netinet/tcp_input.c:1158 so = 0xfffff8051a0ffc00 to = {to_flags = 0, to_tsval = 0, to_tsecr = 719386432, to_sacks = 0xfffff80727e1a038 "\001", to_signature = 0xfffff80727e1a084 "", to_tfo_cookie = 0x5bbafa8e00000073 <error: Cannot access memory at address 0x5bbafa8e00000073>, to_mss = 37456, to_wscale = 5 '\005', to_nsacks = 0 '\000', to_tfo_len = 0 '\000', to_spare = 590855} m = 0xfffff804c5190a00 th = 0xfffff804c5190a96 ip = 0x0 inp = <optimized out> tp = <unavailable> optp = 0xfffff804c5190aaa "\001\001\005\n\aa\217\345\aa\225}\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255\336\336\300\255", <incomplete sequence \336> optlen = 12 tlen = <optimized out> rstreason = <optimized out> fwd_tag = 0x0 ip6 = 0xfffff804c5190a6e s = 0x0 off0 = <optimized out> iptos = 0 '\000' off = <optimized out> len = <optimized out> ipttl = <optimized out> thflags = <optimized out> drop_hdrlen = 72 lookupflag = <optimized out> isipv6 = <optimized out> #7 0xffffffff806b247a in tcp6_input_with_port (mp=0xfffffe04314f7bc8, offp=0xfffffe04314f7bc0, proto=<optimized out>, port=port@entry=0) at /space/system/usr_src/sys/netinet/tcp_input.c:594 m = 0xfffff804c5190a00 ip6 = <optimized out> ia6 = <unavailable> #8 0xffffffff806b3d5b in tcp6_input (mp=<unavailable>, offp=<unavailable>, proto=<unavailable>) at /space/system/usr_src/sys/netinet/tcp_input.c:601 No locals. ---snip---
A commit in branch main references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=8f5a2e216f4cb955150c8f88ab21eaecc5adc8b9 commit 8f5a2e216f4cb955150c8f88ab21eaecc5adc8b9 Author: Richard Scheffenegger <rscheff@FreeBSD.org> AuthorDate: 2024-11-14 08:19:34 +0000 Commit: Richard Scheffenegger <rscheff@FreeBSD.org> CommitDate: 2024-11-14 08:19:49 +0000 tcp: fix cwnd recalculation during limited transmit Properly calculate the expected flight size (cwnd) during limited transmit. Exclude the SACK scoreboard from consideration when still in limited transmit. PR: 282605 Reviewed By: tuexen, #transport Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D47541 sys/netinet/tcp_input.c | 2 +- sys/netinet/tcp_output.c | 3 ++- 2 files changed, 3 insertions(+), 2 deletions(-)
stable-13 is unaffected. stable-14 may have some exposure: - D43470 was not put in stable-14, which is the main trigger - D43355 (which was backported) does some tweaking of cwnd in the limited transmit branch, where this KASSERT triggers, but I think this has not led to any panics in stable-14. All panics observed where after D43470 was committed to MAIN afaik.
(In reply to Richard Scheffenegger from comment #7) Note that on stable/14, you won't see assertion failures unless running the GENERIC-DEBUG kernel config. In particular, assertions are disabled by default on stable branches.
(In reply to Richard Scheffenegger from comment #7) Let me run the reproducer on stable/14 and report back.
(In reply to Michael Tuexen from comment #9) The reproducer does NOT trigger the panic when running on stable/14 with a GENERIC-DEBUG config.
(In reply to Richard Scheffenegger from comment #7) I have several panics from stable-14 but I haven't dug into them.
(In reply to Peter Jeremy from comment #11) Well, without the backtrace and possibly the core, it will be hard to understand why your environment would be excessively affected. The reproducer of this panic from earlier reports was fixed by D47541 (and masked by D47474). Are you on stable-14 or main or some mixture?
This is 14-STABLE from 2024-11-18 (df1b835333ff). I have some local patches but nothing that should be impacting TCP. It crashed twice in reasonably quick succession but I haven't seen it since then (and I have both cores). Both times, it looked like below. I had a look at applying D47541 but it didn't apply cleanly to 14-stable - I've just had a closer look and I think that was just "patch" getting confused by the context. db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe01d43a8770 vpanic() at vpanic+0x131/frame 0xfffffe01d43a88a0 panic() at panic+0x43/frame 0xfffffe01d43a8900 tcp_do_segment() at tcp_do_segment+0x29fd/frame 0xfffffe01d43a89e0 tcp_input_with_port() at tcp_input_with_port+0xf94/frame 0xfffffe01d43a8b20 tcp_input() at tcp_input+0xb/frame 0xfffffe01d43a8b30 ip_input() at ip_input+0x287/frame 0xfffffe01d43a8b80 netisr_dispatch_src() at netisr_dispatch_src+0xad/frame 0xfffffe01d43a8be0 ether_demux() at ether_demux+0x17a/frame 0xfffffe01d43a8c10 ether_nh_input() at ether_nh_input+0x3da/frame 0xfffffe01d43a8c60 netisr_dispatch_src() at netisr_dispatch_src+0xad/frame 0xfffffe01d43a8cc0 ether_input() at ether_input+0xe5/frame 0xfffffe01d43a8d20 _task_fn_rx() at _task_fn_rx+0xea4/frame 0xfffffe01d43a8e40 gtaskqueue_run_locked() at gtaskqueue_run_locked+0x18e/frame 0xfffffe01d43a8ec0 gtaskqueue_thread_loop() at gtaskqueue_thread_loop+0xd3/frame 0xfffffe01d43a8ef0 fork_exit() at fork_exit+0x82/frame 0xfffffe01d43a8f30 fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe01d43a8f30 I've made the full core.txt available at https://home.rulingia.com/~peter/private/core.txt.51
I got the panic again on the stabweek revision ff4c19bb5427, which includes 8f5a2e216f4c aka D47541. Re-opening the bug. Richard, I got a core and can send you any debugging info. Please email me in private what do I need to get from the core for you.
(In reply to Gleb Smirnoff from comment #14) Could you enable BlackBox logging using sysctl net.inet.tcp.bb.log_auto_all=1 sysctl net.inet.tcp.bb.log_auto_ratio=1 sysctl net.inet.tcp.bb.log_auto_mode=1 Also have the tcplog_dumper running. You can use tcplog_dumper_enable="YES" in /etc/rc.conf Using these settings (mode = TCP_LOG_STATE_TAIL), keeps the trailing events of all TCP connections, but does not dump anything to disk. If the kernel panics, we can get the BBLog information from the core. I might be able to construct a packetdrill reproducer from the BBLog file. At least I was able to do it last time.
I'm now seeing this every couple of days on 14-stable (src-c291194-g30bcbf8a2fef) with basically no TCP load. I've had a look at the suggestions in #15 but can't find tcplog_dumper. I have core dumps if that's of any interest.
(In reply to Peter Jeremy from comment #16) You should be able to install the tcplog_dumper by using pkg install tcplog_dumper and then follow the instructions in #15. Once you have then a core file with BBLog information, we can extract it.
I suspect the problematic commit might be the one pasted below. HardenedBSD has multiple users that have reported frequent crashes, except when the below-referenced commit is reverted. The affected HardenedBSD users are not able to produce a crash dump. Please let me know if/how I can be of any assistance. I can have the affected users test candidate patches. Note that it might take several days to get testing results. commit 6b2977c5978b58718d9cddc0a8c31e1bd9aa0361 Author: Richard Scheffenegger <rscheff@FreeBSD.org> AuthorDate: Thu Oct 10 12:50:22 2024 +0200 Commit: Richard Scheffenegger <rscheff@FreeBSD.org> CommitDate: Fri Oct 18 09:51:38 2024 +0200 tcp: fix duplicate retransmissions when RTO happens during SACK loss recovery When snd_nxt doesn't track snd_max, partial SACK ACKs may elicit unexpected duplicate retransmissions. This is usually masked by LRO not necessarily ACKing every individual segment, and prior to RFC6675 SACK loss recovery, harder to trigger even when an RTO happens while SACK loss recovery is ongoing. Address this by improving the logic when to start a SACK loss recovery and how to deal with a RTO, as well as improvements to the adjusted congestion window during transmission selection. Reviewed By: tuexen, cc, #transport Sponsored by: NetApp, Inc. MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D43355 (cherry picked from commit 440f4ba18e3ab7be912858bbcb96a419fcf14809)
Created attachment 256878 [details] packetdrill reproducer for stable/14
(In reply to Shawn Webb from comment #18) Is this for stable/14 or for CURRENT?
(In reply to Michael Tuexen from comment #20) This is in 14-stable.
(In reply to Shawn Webb from comment #21) Thanks for the confirmation. I am looking into it.
*** Bug 284588 has been marked as a duplicate of this bug. ***
I just hit this on the latest main branch 2025/2/25: panic: tcp_do_segment: sent too much cpuid = 3 time = 1740452298 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe00dc30b6e0 vpanic() at vpanic+0x136/frame 0xfffffe00dc30b810 panic() at panic+0x43/frame 0xfffffe00dc30b870 tcp_do_segment() at tcp_do_segment+0x2998/frame 0xfffffe00dc30b950 tcp_input_with_port() at tcp_input_with_port+0x1140/frame 0xfffffe00dc30baa0 tcp_input() at tcp_input+0xb/frame 0xfffffe00dc30bab0 ip_input() at ip_input+0x28d/frame 0xfffffe00dc30bb10 netisr_dispatch_src() at netisr_dispatch_src+0xb4/frame 0xfffffe00dc30bb70 ether_demux() at ether_demux+0x16a/frame 0xfffffe00dc30bba0 ether_nh_input() at ether_nh_input+0x3cd/frame 0xfffffe00dc30bbf0 netisr_dispatch_src() at netisr_dispatch_src+0xb4/frame 0xfffffe00dc30bc50 ether_input() at ether_input+0xd5/frame 0xfffffe00dc30bcb0 hn_xpnt_vf_input() at hn_xpnt_vf_input+0x126/frame 0xfffffe00dc30bd20 tcp_lro_flush() at tcp_lro_flush+0x34f/frame 0xfffffe00dc30bd50 tcp_lro_flush_all() at tcp_lro_flush_all+0x1f3/frame 0xfffffe00dc30bda0 mana_poll() at mana_poll+0x9db/frame 0xfffffe00dc30be40 taskqueue_run_locked() at taskqueue_run_locked+0x1c2/frame 0xfffffe00dc30bec0 taskqueue_thread_loop() at taskqueue_thread_loop+0xd3/frame 0xfffffe00dc30bef0 fork_exit() at fork_exit+0x82/frame 0xfffffe00dc30bf30 fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe00dc30bf30 --- trap 0x17ee7fe0, rip = 0x3fe908dbc388, rsp = 0x17ee7cb800000001, rbp = 0 --- KDB: enter: panic [ thread pid 0 tid 100194 ] Stopped at kdb_enter+0x33: movq $0,0x1046992(%rip)
I just tripped over this in EC2 with a kernel built from 12 days ago (FreeBSD 15.0-CURRENT #0 main-n275409-6156da866e7d: Thu Feb 13 07:15:24 UTC 2025): tcp_do_segment: sent too much cpuid = 0 time = 1740512855 KDB: stack backtrace: db_trace_self() at db_trace_self db_trace_self_wrapper() at db_trace_self_wrapper+0x38 vpanic() at vpanic+0x1a0 panic() at panic+0x48 tcp_do_segment() at tcp_do_segment+0x2778 tcp_input_with_port() at tcp_input_with_port+0xfd4 tcp_input() at tcp_input+0x10 ip_input() at ip_input+0x354 netisr_dispatch_src() at netisr_dispatch_src+0xd8 ether_demux() at ether_demux+0x174 ether_nh_input() at ether_nh_input+0x374 netisr_dispatch_src() at netisr_dispatch_src+0xd8 ether_input() at ether_input+0xdc ena_cleanup() at ena_cleanup+0x4fc taskqueue_run_locked() at taskqueue_run_locked+0x17c taskqueue_thread_loop() at taskqueue_thread_loop+0xc0 fork_exit() at fork_exit+0x78 fork_trampoline() at fork_trampoline+0x18 KDB: enter: panic [ thread pid 0 tid 100211 ] Stopped at kdb_enter+0x48: str xzr, [x19, #2048] I can leave the system running for a while and have a serial console open if it would be useful to poke at anything before rebooting. I'll kill the instance in a few days if I don't hear from anyone.
Will put up a fix for review shortly.
ran into this today on recent 15.0, screenshot of stack trace: https://i.imgur.com/gmTDD7q.png i'm happy to test a patch once there is one.
I can repro this in ~ 5 minutes on my arm64 router, while doing some web loadgen testing through it using wrk2 (not in ports yet). $ wrk -d 120s -c 200 -t 8 -L -R 800 https://whiskey.skunkwerks.at/ - 15.0-CURRENT GENERIC arm64 - built from 20250315 at d2a55e6a9348 (upstream/main) openssl: update ASM and version info for 3.0.16 import (kgdb) bt #0 0xffff0000004b8ce8 in doadump (textdump=textdump@entry=1) at /usr/src/sys/kern/kern_shutdown.c:404 #1 0xffff0000004b8aa8 in kern_reboot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:524 #2 0xffff0000004b9028 in vpanic (fmt=<optimized out>, ap=...) at /usr/src/sys/kern/kern_shutdown.c:979 #3 0xffff0000004b8dcc in panic (fmt=0xffff000000ce7000 <M_PLIMIT+64> "") at /usr/src/sys/kern/kern_shutdown.c:892 #4 0xffff000000666234 in tcp_do_segment (tp=0xffffa085c2f52000, m=<optimized out>, th=0xffff0001631630e2, drop_hdrlen=<optimized out>, tlen=0, iptos=<optimized out>) at /usr/src/sys/netinet/tcp_input.c:1545 #5 0xffff000000662d88 in tcp_input_with_port (mp=<optimized out>, offp=<optimized out>, proto=<optimized out>, port=0, port@entry=42192) at /usr/src/sys/netinet/tcp_input.c:816 #6 0xffff000000663aa8 in tcp_input (mp=0xffff000000ce7000 <M_PLIMIT+64>, offp=0xffff000000db9000 <sdt_vfs_vop_vop_spare5_entry+96>, proto=1) at /usr/src/sys/netinet/tcp_input.c:1499 #7 0xffff00000064e2ac in ip_input (m=0x0) at /usr/src/sys/netinet/ip_input.c:861 #8 0xffff000000612344 in netisr_dispatch_src (proto=proto@entry=1, source=0, source@entry=18446462603272168832, m=0xffffa0804716a800) at /usr/src/sys/net/netisr.c:1151 #9 0xffff000000612728 in netisr_dispatch (proto=13529088, proto@entry=1, m=0x1) at /usr/src/sys/net/netisr.c:1242 #10 0xffff0000005f4214 in ether_demux (ifp=ifp@entry=0xffffa0800440d000, m=0xffffa0804716a800) at /usr/src/sys/net/if_ethersubr.c:942 #11 0xffff0000005f58c0 in ether_input_internal (ifp=0xffffa0800440d000, m=0xffffa0804716a800) at /usr/src/sys/net/if_ethersubr.c:706 #12 ether_nh_input (m=<optimized out>) at /usr/src/sys/net/if_ethersubr.c:736 #13 0xffff000000612344 in netisr_dispatch_src (proto=proto@entry=5, source=0, source@entry=18446462603272169024, m=0xffffa0804716a800) at /usr/src/sys/net/netisr.c:1151 #14 0xffff000000612728 in netisr_dispatch (proto=13529088, proto@entry=5, m=0x1) at /usr/src/sys/net/netisr.c:1242 #15 0xffff0000005f4664 in ether_input (ifp=0xffffa0800440d000, m=0xffff000000db9000 <sdt_vfs_vop_vop_spare5_entry+96>) at /usr/src/sys/net/if_ethersubr.c:847 #16 0xffff0000008be654 in dpaa2_ni_rx (ch=ch@entry=0xffff000113aaf000, fq=0xffff000113aaf180, fd=<optimized out>, ctx=ctx@entry=0xffff00010e90a788) at /usr/src/sys/dev/dpaa2/dpaa2_ni.c:3179 #17 0xffff0000008bdc34 in dpaa2_ni_consume_frames (chan=0xffff000113aaf000, src=<optimized out>, consumed=<optimized out>) at /usr/src/sys/dev/dpaa2/dpaa2_ni.c:3051 #18 dpaa2_ni_rx_cleanup (ch=0xffff000113aaf000) at /usr/src/sys/dev/dpaa2/dpaa2_ni.c:2868 #19 dpaa2_ni_cleanup_task (arg=<optimized out>, count=<optimized out>) at /usr/src/sys/dev/dpaa2/dpaa2_ni.c:2827 #20 0xffff000000521de4 in taskqueue_run_locked (queue=queue@entry=0xffffa08004720480) at /usr/src/sys/kern/subr_taskqueue.c:517 #21 0xffff000000522d1c in taskqueue_thread_loop (arg=arg@entry=0xffff000113ab1e80) at /usr/src/sys/kern/subr_taskqueue.c:829 #22 0xffff000000468ef4 in fork_exit (callout=0xffff000000522c58 <taskqueue_thread_loop>, arg=0xffff000113ab1e80, frame=0xffff00010e90aa00) at /usr/src/sys/kern/kern_fork.c:1152 #23 <signal handler called>
Created attachment 258715 [details] patch for testing
Can you please test the patch I added to the attachments and report the result? It is not the patch I will commit, but it is the part which is intended to fix the panic.
Alternatively, you can also apply https://reviews.freebsd.org/D48652 - which does not fix the issue of incorrectly unwinding an RTO, but once an RTO was declared spurious, fixes the subsequent cwnd calculation by resetting the state more completely.
(In reply to Richard Scheffenegger from comment #31) But to fix the root cause, I need feedback on the patch I provided.
(In reply to Alexander Leidinger from comment #5) I also found similar panic with ipv6 on 15.0 based on 2024-10-13: (kgdb) bt #0 __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:57 #1 doadump (textdump=textdump@entry=1) at /usr/src/sys/kern/kern_shutdown.c:404 #2 0xffffffff80b534c0 in kern_reboot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:524 #3 0xffffffff80b539dc in vpanic (fmt=0xffffffff811fbf6d "%s: sent too much", ap=ap@entry=0xfffffe01060526b0) at /usr/src/sys/kern/kern_shutdown.c:979 #4 0xffffffff80b53823 in panic (fmt=<unavailable>) at /usr/src/sys/kern/kern_shutdown.c:892 #5 0xffffffff80d4b102 in tcp_do_segment (tp=0xfffff80399e5fa80, tp@entry=<error reading variable: value is not available>, m=<optimized out>, m@entry=<error reading variable: value is not available>, th=0xfffff8004426169a, th@entry=<error reading variable: value is not available>, drop_hdrlen=72, drop_hdrlen@entry=<error reading variable: value is not available>, tlen=0, tlen@entry=<error reading variable: value is not available>, iptos=<unavailable>, iptos@entry=<error reading variable: value is not available>) at /usr/src/sys/netinet/tcp_input.c:1548 #6 0xffffffff80d47aa5 in tcp_input_with_port (mp=mp@entry=0xfffffe0106052988, offp=offp@entry=0xfffffe0106052980, proto=<optimized out>, port=0) at /usr/src/sys/netinet/tcp_input.c:1158 #7 0xffffffff80d4691a in tcp6_input_with_port (mp=0xfffffe0106052988, offp=0xfffffe0106052980, proto=<optimized out>, port=port@entry=0) at /usr/src/sys/netinet/tcp_input.c:594 #8 0xffffffff80d481eb in tcp6_input (mp=<unavailable>, offp=<unavailable>, proto=<unavailable>) at /usr/src/sys/netinet/tcp_input.c:601 #9 0xffffffff80d929c0 in ip6_input (m=0x0, m@entry=<error reading variable: value is not available>) at /usr/src/sys/netinet6/ip6_input.c:930 #10 0xffffffff80caf29d in netisr_dispatch_src (proto=proto@entry=6, source=source@entry=0, m=0xfffff80044261600) at /usr/src/sys/net/netisr.c:1152 #11 0xffffffff80caf61f in netisr_dispatch (proto=<unavailable>, proto@entry=6, m=<unavailable>) at /usr/src/sys/net/netisr.c:1243 #12 0xffffffff80c9180a in ether_demux (ifp=ifp@entry=0xfffff80014bd1000, m=0xfffff80044261600) at /usr/src/sys/net/if_ethersubr.c:946 #13 0xffffffff80c92efd in ether_input_internal (ifp=0xfffff80014bd1000, m=0xfffff80044261600) at /usr/src/sys/net/if_ethersubr.c:710 #14 ether_nh_input (m=<unavailable>, m@entry=<error reading variable: value is not available>) at /usr/src/sys/net/if_ethersubr.c:740 #15 0xffffffff80caf29d in netisr_dispatch_src (proto=proto@entry=5, source=source@entry=0, m=0xfffff80044261600) at /usr/src/sys/net/netisr.c:1152 #16 0xffffffff80caf61f in netisr_dispatch (proto=<unavailable>, proto@entry=5, m=<unavailable>) at /usr/src/sys/net/netisr.c:1243 #17 0xffffffff80c91c65 in ether_input (ifp=0xfffff80014bd1000, ifp@entry=<error reading variable: value is not available>, m=<unavailable>, m@entry=<error reading variable: value is not available>) at /usr/src/sys/net/if_ethersubr.c:851 #18 0xffffffff80c9175b in ether_demux (ifp=ifp@entry=0xfffff80004619000, m=0xfffff80044261600) at /usr/src/sys/net/if_ethersubr.c:897 #19 0xffffffff80c92efd in ether_input_internal (ifp=0xfffff80004619000, m=0xfffff80044261600) at /usr/src/sys/net/if_ethersubr.c:710 #20 ether_nh_input (m=<unavailable>, m@entry=<error reading variable: value is not available>) at /usr/src/sys/net/if_ethersubr.c:740 #21 0xffffffff80caf29d in netisr_dispatch_src (proto=proto@entry=5, source=source@entry=0, m=0xfffff80044261600) at /usr/src/sys/net/netisr.c:1152 #22 0xffffffff80caf61f in netisr_dispatch (proto=<unavailable>, proto@entry=5, m=<unavailable>) at /usr/src/sys/net/netisr.c:1243 #23 0xffffffff80c91c65 in ether_input (ifp=0xfffff80004619000, m=<unavailable>) at /usr/src/sys/net/if_ethersubr.c:851 #24 0xffffffff80cab203 in iflib_rxeof (rxq=rxq@entry=0xfffff80004e73680, budget=<optimized out>) at /usr/src/sys/net/iflib.c:3081 #25 0xffffffff80ca53aa in _task_fn_rx (context=0xfffff80004e73680) at /usr/src/sys/net/iflib.c:4155 #26 0xffffffff80ba0eee in gtaskqueue_run_locked (queue=queue@entry=0xfffff800042d0600) at /usr/src/sys/kern/subr_gtaskqueue.c:368 #27 0xffffffff80ba0c23 in gtaskqueue_thread_loop (arg=arg@entry=0xfffffe003c109038) at /usr/src/sys/kern/subr_gtaskqueue.c:544 #28 0xffffffff80b080e2 in fork_exit (callout=0xffffffff80ba0b50 <gtaskqueue_thread_loop>, arg=0xfffffe003c109038, frame=0xfffffe0106052f40) at /usr/src/sys/kern/kern_fork.c:1151 #29 <signal handler called> (kgdb) f 5 #5 0xffffffff80d4b102 in tcp_do_segment (tp=0xfffff80399e5fa80, tp@entry=<error reading variable: value is not available>, m=<optimized out>, m@entry=<error reading variable: value is not available>, th=0xfffff8004426169a, th@entry=<error reading variable: value is not available>, drop_hdrlen=72, drop_hdrlen@entry=<error reading variable: value is not available>, tlen=0, tlen@entry=<error reading variable: value is not available>, iptos=<unavailable>, iptos@entry=<error reading variable: value is not available>) at /usr/src/sys/netinet/tcp_input.c:1548 1548 KASSERT(tp->t_state != TCPS_TIME_WAIT, ("%s: TCPS_TIME_WAIT", (kgdb) i lo to = {to_flags = 16, to_tsval = 2267638854, to_tsecr = 249601034, to_sacks = 0xfffff80399e5fa80 "", to_signature = 0x4 <error: Cannot access memory at address 0x4>, to_tfo_cookie = 0xfffffe0106052780 "\004", to_mss = 55127, to_wscale = 216 '\330', to_nsacks = 128 '\200', to_tfo_len = 255 '\377', to_spare = 0} maxseg = 1328 inp = 0xfffff80399e5fa80 needoutput = 0 incforsyn = <optimized out> so = 0xfffff80399781800 inc = <optimized out> thflags = <optimized out> sack_changed = <optimized out> nsegs = 1 s = <optimized out> rstreason = <optimized out> tiwin = <optimized out> todrop = <optimized out> acked = <optimized out> tfo_syn = <optimized out> mfree = <optimized out> ourfinisacked = <optimized out> win = <optimized out> close = <optimized out>
hi Michael after that patch, I ran wrk 3x and no panics. I'll leave it in a loop for a while.
(In reply to Andrey V. Elsukov from comment #33) I don't expect the problem to be specific to IPv4 or IPv6... Just a matter of which network layer you use.
(In reply to Dave Cottlehuber from comment #34) Perfect, thank you. I do know that the code I removed is a problem, but I want to make sure that there are not more problems...
Michael, Richard, do we have any chances for this panic to be fixed before March stabweek (next week)? It is already over 4 months since the panic was discovered and it affects many people.
(In reply to Michael Tuexen from comment #30) I'll coordinate testing a new build with the HardenedBSD community members being impacted by this bug. I'll try to get them a new test build this weekend.
(In reply to Gleb Smirnoff from comment #37) I can provide a patch tomorrow. Need to look at RACK and BBR...
(In reply to Michael Tuexen from comment #29) Before this patch, I was getting the panic about once every few weeks. I've been running the patch for the last day. No problems so far, so I know that it didn't make things substantially worse.
A patch is under review D49414.
A commit in branch main references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=fbcf3b74e8f2c0c5ba37f1839bfe9395eb2fd0b1 commit fbcf3b74e8f2c0c5ba37f1839bfe9395eb2fd0b1 Author: Michael Tuexen <tuexen@FreeBSD.org> AuthorDate: 2025-03-20 15:17:40 +0000 Commit: Michael Tuexen <tuexen@FreeBSD.org> CommitDate: 2025-03-20 15:17:40 +0000 tcp: fix detection of bad RTOs If timestamps are enabled, the actions performed by a retransmission timeout were rolled back, when they should not. It is needed to make sure the incoming segment advances SND.UNA. To do this, remove the incorrect upfront check and extend the check in the fast path to handle also the case of timestamps. PR: 282605 Reviewed by: cc, rscheff, Peter Lei MFC after: 3 days Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D49414 sys/netinet/tcp_input.c | 17 +++++++---------- 1 file changed, 7 insertions(+), 10 deletions(-)
A commit in branch main references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=6f6c07813b38ab04d8b1b2bb87c0291dbae25a25 commit 6f6c07813b38ab04d8b1b2bb87c0291dbae25a25 Author: Richard Scheffenegger <rscheff@FreeBSD.org> AuthorDate: 2025-03-21 04:23:30 +0000 Commit: Richard Scheffenegger <rscheff@FreeBSD.org> CommitDate: 2025-03-21 04:28:31 +0000 tcp: fix reverting of spurious timeouts (RTO) One variable that became critical to correctly calculate the cwnd during limited transmit was not properly reverted on detection of spurious timeouts. PR: 282605 Reviewed By: cc, tuexen, #transport MFC after: 3 days Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D48652 sys/netinet/tcp_input.c | 1 + 1 file changed, 1 insertion(+)
A commit in branch stable/14 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=4a328b8049962976d13b8fb9abe2d9a488f2a894 commit 4a328b8049962976d13b8fb9abe2d9a488f2a894 Author: Richard Scheffenegger <rscheff@FreeBSD.org> AuthorDate: 2025-03-21 04:23:30 +0000 Commit: Richard Scheffenegger <rscheff@FreeBSD.org> CommitDate: 2025-03-24 05:35:43 +0000 tcp: fix reverting of spurious timeouts (RTO) One variable that became critical to correctly calculate the cwnd during limited transmit was not properly reverted on detection of spurious timeouts. PR: 282605 Reviewed By: cc, tuexen, #transport MFC after: 3 days Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D48652 (cherry picked from commit 6f6c07813b38ab04d8b1b2bb87c0291dbae25a25) sys/netinet/tcp_input.c | 1 + 1 file changed, 1 insertion(+)
It's starting to look like the issue has been fixed in the recent commits. At the very least, I haven't had a kernel panic in two days.
I just hit a case this morning on a kernel at 67c1c4dfd1, which has both of these included: fbcf3b74e8f2c0c5 ("tcp: fix detection of bad RTOs") 6f6c07813b38ab04 ("tcp: fix reverting of spurious timeouts (RTO)") panic: tcp_do_segment: sent too much cpuid = 1 time = 1742996303 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe015b6b55b0 vpanic() at vpanic+0x136/frame 0xfffffe015b6b56e0 panic() at panic+0x43/frame 0xfffffe015b6b5740 tcp_do_segment() at tcp_do_segment+0x28ed/frame 0xfffffe015b6b5820 tcp_input_with_port() at tcp_input_with_port+0x1140/frame 0xfffffe015b6b5970 tcp6_input_with_port() at tcp6_input_with_port+0x6a/frame 0xfffffe015b6b59a0 tcp6_input() at tcp6_input+0xb/frame 0xfffffe015b6b59b0 ip6_input() at ip6_input+0xc98/frame 0xfffffe015b6b5a90 netisr_dispatch_src() at netisr_dispatch_src+0xb4/frame 0xfffffe015b6b5af0 ether_demux() at ether_demux+0x16a/frame 0xfffffe015b6b5b20 ether_nh_input() at ether_nh_input+0x3cd/frame 0xfffffe015b6b5b70 netisr_dispatch_src() at netisr_dispatch_src+0xb4/frame 0xfffffe015b6b5bd0 ether_input() at ether_input+0xd5/frame 0xfffffe015b6b5c30 ieee80211_deliver_data() at ieee80211_deliver_data+0xeb/frame 0xfffffe015b6b5c80 sta_input() at sta_input+0xcda/frame 0xfffffe015b6b5d40 ieee80211_input_mimo() at ieee80211_input_mimo+0x20c/frame 0xfffffe015b6b5df0 lkpi_80211_lhw_rxq_task() at lkpi_80211_lhw_rxq_task+0x143/frame 0xfffffe015b6b5e40 taskqueue_run_locked() at taskqueue_run_locked+0x1c2/frame 0xfffffe015b6b5ec0 taskqueue_thread_loop() at taskqueue_thread_loop+0xd3/frame 0xfffffe015b6b5ef0 fork_exit() at fork_exit+0x82/frame 0xfffffe015b6b5f30 fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe015b6b5f30 --- trap 0xd229bf4a, rip = 0x79b36f1efc3c4ce3, rsp = 0xd7e96fa6591692e4, rbp = 0xa979fdd81e679dee ---
(In reply to Kyle Evans from comment #46) Was BBLoging enabled?
(In reply to Michael Tuexen from comment #47) Yes, allegedly- not sure how to confirm/extract anything useful from it, though.
(In reply to Kyle Evans from comment #48) Please do the following: 1. Download tcplog.py from review D48705. 2. Start kgdb with the relevant core. 3. source tcplog.py 4. go up the stack until you have the tp as a variable 5. tcplog_dump tp 6. quit kgdb.Now you should have a *.bin file. Please send this to tuexen@freebsd.org. If this does not work, please report.
Just got this crash as well, just on the older stable/14, so probably no new useful info from me. I have the vmcore, just in case. panic: tcp_do_segment: sent too much cpuid = 3 time = 1743052191 KDB: stack backtrace: db_trace_self_wrapper() at 0xffffffff8061eadb = db_trace_self_wrapper+0x2b/frame 0xfffffe0120fda760 kdb_backtrace() at 0xffffffff80956a57 = kdb_backtrace+0x37/frame 0xfffffe0120fda810 vpanic() at 0xffffffff80907629 = vpanic+0x169/frame 0xfffffe0120fda950 panic() at 0xffffffff80907403 = panic+0x43/frame 0xfffffe0120fda9b0 tcp_do_segment() at 0xffffffff80a82864 = tcp_do_segment+0x1c24/frame 0xfffffe0120fdaa90 tcp_input_with_port() at 0xffffffff80a8066d = tcp_input_with_port+0x12bd/frame 0xfffffe0120fdabc0 tcp_input() at 0xffffffff80a80c2b = tcp_input+0xb/frame 0xfffffe0120fdabd0 ip_input() at 0xffffffff80a735ae = ip_input+0x25e/frame 0xfffffe0120fdac10 netisr_dispatch_src() at 0xffffffff80a436ff = netisr_dispatch_src+0x8f/frame 0xfffffe0120fdac70 netisr_dispatch() at 0xffffffff80a43a6e = netisr_dispatch+0xe/frame 0xfffffe0120fdac80 ether_demux() at 0xffffffff80a39cc9 = ether_demux+0x169/frame 0xfffffe0120fdacb0 ether_nh_input() at 0xffffffff80a3afe0 = ether_nh_input+0x2c0/frame 0xfffffe0120fdace0 netisr_dispatch_src() at 0xffffffff80a436ff = netisr_dispatch_src+0x8f/frame 0xfffffe0120fdad40 netisr_dispatch() at 0xffffffff80a43a6e = netisr_dispatch+0xe/frame 0xfffffe0120fdad50 ether_input() at 0xffffffff80a3a115 = ether_input+0xa5/frame 0xfffffe0120fdada0 if_input() at 0xffffffff80a2fa2a = if_input+0xa/frame 0xfffffe0120fdadb0 re_rxeof() at 0xffffffff806ff873 = re_rxeof+0x463/frame 0xfffffe0120fdae20 re_intr_msi() at 0xffffffff806fd747 = re_intr_msi+0xc7/frame 0xfffffe0120fdae60 ithread_loop() at 0xffffffff808c67ca = ithread_loop+0x2aa/frame 0xfffffe0120fdaef0 fork_exit() at 0xffffffff808c2b47 = fork_exit+0xc7/frame 0xfffffe0120fdaf30 fork_trampoline() at 0xffffffff80c5034e = fork_trampoline+0xe/frame 0xfffffe0120fdaf30