Summary: | [tcp] Panic: Fatal trap 12: page fault while in kernel mode (if_io_tqg_4) - m_copydata ... at /usr/src/sys/kern/uipc_mbuf.c:659 | ||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Base System | Reporter: | iron.udjin | ||||||||||||||||||||||||
Component: | kern | Assignee: | freebsd-net (Nobody) <net> | ||||||||||||||||||||||||
Status: | Closed FIXED | ||||||||||||||||||||||||||
Severity: | Affects Only Me | CC: | chris, clearscreen, ddobrev85, freebsd, grahamperrin, hselasky, jhb, markj, net, pi, rscheff, supportme, tpearson, tuexen, zarychtam | ||||||||||||||||||||||||
Priority: | --- | Keywords: | crash, needs-qa | ||||||||||||||||||||||||
Version: | 13.1-RELEASE | Flags: | koobs:
maintainer-feedback?
(net) koobs: maintainer-feedback? (hselasky) markj: maintainer-feedback- koobs: maintainer-feedback? (jhb) tuexen: mfc-stable13+ tuexen: mfc-stable12+ |
||||||||||||||||||||||||
Hardware: | amd64 | ||||||||||||||||||||||||||
OS: | Any | ||||||||||||||||||||||||||
See Also: |
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=263445 https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=260393 |
||||||||||||||||||||||||||
Bug Depends on: | |||||||||||||||||||||||||||
Bug Blocks: | 264030 | ||||||||||||||||||||||||||
Attachments: |
|
Description
iron.udjin
2022-05-26 16:00:56 UTC
(kgdb) list *0xffffffff80cae31d 0xffffffff80cae31d is in m_copydata (/usr/src/sys/kern/uipc_mbuf.c:659). 654 off -= m->m_len; 655 m = m->m_next; 656 } 657 while (len > 0) { 658 KASSERT(m != NULL, ("m_copydata, length > size of mbuf chain")); 659 count = min(m->m_len - off, len); 660 if ((m->m_flags & M_EXTPG) != 0) 661 m_copyfromunmapped(m, off, count, cp); 662 else 663 bcopy(mtod(m, caddr_t) + off, cp, count); Can you provide information how to reproduce it? (In reply to Michael Tuexen from comment #2) Unfortunately I don't know how to reproduce this bug. The server was working a week or so without any problem. Today it suddenly panicked and restarted. Workload is typical: web-server, postgresql, S3 (minio). (In reply to Michael Tuexen from comment #2) Do you need all of my sysctls from /etc/sysctl.conf and /boot/loader.conf? Looks like it walked off the end of the mbuf chain as it tried to copy one byte too many. You could try going up to frame 8 (tcp_output) to see if the arguments passed to m_copydata() are still around (looks like we know the mbuf chain via m@entry for frame 9, but knowing the original length and offset and confirming it walked off the end would be good). You'd have to figure out why the length was wrong though and that might need more digging in the tp or the like. (In reply to John Baldwin from comment #5) Oh...I'm not sure that I can do it by myself. I'm not so expirianced in debugging. Can I send you compressed core dump file? Hmm, unfortunately it would need to be the contents of /boot/kernel and /usr/lib/debug/boot/kernel along with the vmcore. However, if you can stick those all in a tarball I can take a look at them. It might be simpler to store them at a URL and mail that to me directly. (In reply to John Baldwin from comment #7) Emailed you link on archive. (In reply to iron.udjin from comment #8) You might also send the link to me (tuexen@freebsd.org) and Richard (rscheff@freebsd.org), if you are willing to do so. (In reply to Michael Tuexen from comment #9) Done. (In reply to iron.udjin from comment #10) Thanks! I could download the tarball and will also have a look at it. Are you using KTLS? (In reply to Michael Tuexen from comment #12) Yes, with www/nginx-devel (In reply to iron.udjin from comment #13) Do you observe the panics on a regular base? If yes, could you disable KTLS and see if they still happen. Can you also post /etc/sysctl.conf and /boot/loader.conf? (In reply to Michael Tuexen from comment #14) It's not panic on a regular base. The server works 11 days without panic with KTLS enabled. Created attachment 234498 [details]
sysctl.conf
Created attachment 234499 [details]
loader.conf
(In reply to iron.udjin from comment #17) Thanks. Are you actually using HTCP as the congestion module? Which network interface cards are you using? Are they doing HW TLS? (In reply to Michael Tuexen from comment #18) ixl0@pci0:33:0:0: class=0x020000 rev=0x02 hdr=0x00 vendor=0x8086 device=0x15ff subvendor=0x1849 subdevice=0x0000 vendor = 'Intel Corporation' device = 'Ethernet Controller X710 for 10GBASE-T' class = network subclass = ethernet Network card isn't doing HW TLS. I use ktls_ocf module. Currently I don't use HTCP but I was using it before. # sysctl net.inet.tcp.cc.algorithm net.inet.tcp.cc.algorithm: newreno (In reply to iron.udjin from comment #19) OK, thanks for the information. My current thinking is that the problem might be related to using KTLS in a situation, where the TCP sends a FIN, but still has outstanding data. KTLS is tested a lot with the RACK stack, which does NOT send a FIN if there is outstanding data. This could be the problem with len = 1, since the FIN takes conceptually one byte in the sequence number space, but there is no corresponding byte in the data stream... (In reply to Michael Tuexen from comment #12) My workload is also web-related - nginx, grafana, influxdb No KTLS, so I doubt it's related, and I have the same panic. Last occurrences: on server #1 - 54 days, on server #2 - 154 days. (In reply to Dobri Dobrev from comment #21) Yes, it's most likely not related to KTLS because I expirianced such crashes on 13.0-STABLE with no KTLS enabled (see #258183) (In reply to iron.udjin from comment #22) Thanks for providing the information. We are trying to figure out what might be going on... The current thinking is, that SACK rescue retransmissions (in FBSD13 this is gated by net.inet.tcp.rfc6675_pipe=1) very rarely creates an entry, which apparently is beyond the valid data range. While under most common circumstances, a final FIN bit in the sequence space is taken care of, it seems that there may be some double-counting for the FIN bit. In most of the inspected cores, we found: TCP state: LAST_ACK (FIN received and also FIN sent) SACK loss recovery triggered A cumulative ACK before all outstanding data was received The remote cliet "disappears" for a significant amount of time (7 to 12 retransmission timeouts), but may re-appear again just prior. snd_max consistently 2 counts above the last data, instead of the expected 1 (for the FIN bit). However, it is still unclear under what circumstances this double-counting happens, possibly when the persist timer triggers, and a few other conditions are also fulfilled - maybe a race condition between normal packet processing and a timer firing. In short: disabling rfc6675 enhanced SACK features (more correct pipeline accounting, rescue retransmissions) should address the cause of the panic, while not addressing the root cause of when/why there is the double-accounting of the FIN bit... Would you be willing to run an intrumented kernel, which either panics (full core dump), or spews out various state, when inconsistencies are detected in this space - while ignoring/addressing them "on the fly" without panicing? (In reply to Richard Scheffenegger from comment #24) I can build and run custom kernel. But the problem that I cannot reproduce crash. The panic has been occur when my server was under DDoS attack. After that it was working almost a month without any problem with net.inet.tcp.rfc6675_pipe=1 and KTLS enabled. (In reply to iron.udjin from comment #25) Interesting information. Do you know what the traffic pattern was when being under attack. Any chance you have a .pcap file and are will to share (privately)? (In reply to iron.udjin from comment #25) I've prepared a patch against main (may need some manual tweaking to apply to 13.1-RELEASE as of now) wget https://reviews.freebsd.org/D35446?id=106838&download=true If the kernel is built with INVARIANTS, it should panic early on, once an inconsistency between the socket sendbuffer and tcp state variables is detected - instead of panicing a few packets later, when that inconsistency results in a invalid pointer access... If the kernel is built without INVARIANTS, the kernel log buffer (dmesg) should provide some hints as to when/where the inconsistency first occured, which may gve more indirect clues. But it would address the inconsistency right away, and continue operation. If the panic was observed during a DDOS, this strengthens the clue that there exists a race condition (double-accounting for the FIN bit). However, prior to the introduction of SACK rescue retransmissions, this never materially affected TCP operations, as the socket buffer data would be used directly to see what sequence range to send, rather than the SACK scoreboard data. (In reply to Michael Tuexen from comment #26) When my website has started experience problem with overload because of thousands distributed requests like "GET / HTTP/2.0", I switched website to cloudflare and configured filters which was allow the website stay alive. So, there wasn't any direct traffic from clients to the server. Regarding traffic pattern: all traffic was TCP/HTTPS/443 port from cloudflare. Currently there no DDoS attack and no crashes. I guess to make .pcap file now is useless. (In reply to Richard Scheffenegger from comment #27) Tell me please which options should be enabled when I'll rebuild kernel. Anything else except INVARIANTS? (In reply to iron.udjin from comment #29) In addition, I would also suggest options TCPPCAP That would allow us to see the last n packets in/out per endpoint in the kernel dump. Just compiling it in does not enable the logging. If we actually want to do the logging, we need to do something like sudo syssctl net.inet.tcp.tcp_pcap_packets=10 to store the last 10 packets in each direction. We can see if your system can handle this, since it use mbufs. (In reply to Richard Scheffenegger from comment #27) Failed to apply patch for 13.1-RELEASE source code: |Index: sys/netinet/tcp_sack.c |=================================================================== |--- sys/netinet/tcp_sack.c |+++ sys/netinet/tcp_sack.c -------------------------- Patching file sys/netinet/tcp_sack.c using Plan A... Hunk #1 succeeded at 879 with fuzz 1 (offset -19 lines). Hunk #2 failed at 962. 1 out of 2 hunks failed--saving rejects to sys/netinet/tcp_sack.c.rej done Created attachment 234626 [details]
D35446 patch to apply clean against 13.1-RELEASE
Comment on attachment 234626 [details]
D35446 patch to apply clean against 13.1-RELEASE
missed removing the goto label to make the compiler happy.
Created attachment 234627 [details]
clean patch against 13.1-RELEASE
(In reply to iron.udjin from comment #31) Try the patch attached just now, which should apply cleanly. Note that if the hypothesis about snd_max getting infrequently incremented twice holds, you may observe more frequent panics in an INVARIANTS kernel - prior to these added checks, that oddity would not materially affect the tcp session. Only the new feature of SACK rescue retransmissions - which should also happen very infrequently at the same time as a FIN bit - exposed this. In that case, you may want to switch to a kernel without the INVARIANTS (which would still log those occurances, but not panic), or the default 13.1-RELEASE kernel with rfc6675_pipe disabled (to disable rescue retransmissions) for stable operation. Today I had a kernel panic in a server. Is it possible the same bug? Fatal trap 12: page fault while in kernel mode cpuid = 7; apic id = 07 fault virtual address = 0x18 fault code = supervisor read data, page not present instruction pointer = 0x20:0xffffffff80cae31d stack pointer = 0x28:0xfffffe00e00445f0 frame pointer = 0x28:0xfffffe00e0044660 code segment = base rx0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 0 (if_io_tqg_7) trap number = 12 panic: page fault cpuid = 7 time = 1654981222 KDB: stack backtrace: #0 0xffffffff80c69465 at kdb_backtrace+0x65 #1 0xffffffff80c1bb1f at vpanic+0x17f #2 0xffffffff80c1b993 at panic+0x43 #3 0xffffffff810afdf5 at trap_fatal+0x385 #4 0xffffffff810afe4f at trap_pfault+0x4f #5 0xffffffff81087528 at calltrap+0x8 #6 0xffffffff80de07c9 at tcp_output+0x1339 #7 0xffffffff80dd7eed at tcp_do_segment+0x2cfd #8 0xffffffff80dd44b1 at tcp_input_with_port+0xb61 #9 0xffffffff80dd515b at tcp_input+0xb #10 0xffffffff80dc691f at ip_input+0x11f #11 0xffffffff80d53089 at netisr_dispatch_src+0xb9 #12 0xffffffff80d36ea8 at ether_demux+0x138 #13 0xffffffff80d38235 at ether_nh_input+0x355 #14 0xffffffff80d53089 at netisr_dispatch_src+0xb9 #15 0xffffffff80d372d9 at ether_input+0x69 #16 0xffffffff80ddd9f4 at tcp_lro_flush+0x2f4 #17 0xffffffff80dddd3b at tcp_lro_flush_all+0x1bb Uptime: 20d1h1m33s Dumping 2502 out of 32501 MB:..1%..11%..21%..31%..41%..51%..61%..71%..81%..91% __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55 warning: Source file is more recent than executable. 55 __asm("movq %%gs:%P1,%0" : "=r" (td) : "n" (offsetof(struct pcpu, (kgdb) #0 __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55 #1 doadump (textdump=<optimized out>) at /usr/src/sys/kern/kern_shutdown.c:399 #2 0xffffffff80c1b71c in kern_reboot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:487 #3 0xffffffff80c1bb8e in vpanic (fmt=0xffffffff811b4fb9 "%s", ap=<optimized out>) at /usr/src/sys/kern/kern_shutdown.c:920 #4 0xffffffff80c1b993 in panic (fmt=<unavailable>) at /usr/src/sys/kern/kern_shutdown.c:844 #5 0xffffffff810afdf5 in trap_fatal (frame=0xfffffe00e0044530, eva=24) at /usr/src/sys/amd64/amd64/trap.c:944 #6 0xffffffff810afe4f in trap_pfault (frame=0xfffffe00e0044530, usermode=false, signo=<optimized out>, ucode=<optimized out>) at /usr/src/sys/amd64/amd64/trap.c:763 #7 <signal handler called> #8 m_copydata (m=0x0, m@entry=0xfffff80405398400, off=0, len=1, cp=<optimized out>) at /usr/src/sys/kern/uipc_mbuf.c:659 #9 0xffffffff80de07c9 in tcp_output (tp=<optimized out>) at /usr/src/sys/netinet/tcp_output.c:1081 #10 0xffffffff80dd7eed in tcp_do_segment (m=<optimized out>, th=<optimized out>, so=<optimized out>, tp=0xfffffe013f603000, drop_hdrlen=40, tlen=<optimized out>, iptos=0 '\000') at /usr/src/sys/netinet/tcp_input.c:2637 #11 0xffffffff80dd44b1 in tcp_input_with_port (mp=<optimized out>, offp=<optimized out>, proto=<optimized out>, port=port@entry=0) at /usr/src/sys/netinet/tcp_input.c:1400 #12 0xffffffff80dd515b in tcp_input (mp=0xfffff80405398400, offp=0x0, proto=1) at /usr/src/sys/netinet/tcp_input.c:1496 #13 0xffffffff80dc691f in ip_input (m=0x0) at /usr/src/sys/netinet/ip_input.c:839 #14 0xffffffff80d53089 in netisr_dispatch_src (proto=1, source=source@entry=0, m=0xfffff8002f330900) at /usr/src/sys/net/netisr.c:1143 #15 0xffffffff80d5345f in netisr_dispatch (proto=87655424, m=0x1) at /usr/src/sys/net/netisr.c:1234 #16 0xffffffff80d36ea8 in ether_demux (ifp=ifp@entry=0xfffff80004454000, m=0x0) at /usr/src/sys/net/if_ethersubr.c:921 #17 0xffffffff80d38235 in ether_input_internal (ifp=0xfffff80004454000, m=0x0) at /usr/src/sys/net/if_ethersubr.c:707 #18 ether_nh_input (m=<optimized out>) at /usr/src/sys/net/if_ethersubr.c:737 #19 0xffffffff80d53089 in netisr_dispatch_src (proto=proto@entry=5, source=source@entry=0, m=m@entry=0xfffff8002f330900) at /usr/src/sys/net/netisr.c:1143 #20 0xffffffff80d5345f in netisr_dispatch (proto=87655424, proto@entry=5, m=0x1, m@entry=0xfffff8002f330900) at /usr/src/sys/net/netisr.c:1234 #21 0xffffffff80d372d9 in ether_input (ifp=<optimized out>, m=0xfffff8002f330900) at /usr/src/sys/net/if_ethersubr.c:828 #22 0xffffffff80ddd9f4 in tcp_lro_flush (lc=lc@entry=0xfffff80003cf5830, le=0xfffffe0103f3f690) at /usr/src/sys/netinet/tcp_lro.c:1375 #23 0xffffffff80dddd3b in tcp_lro_rx_done (lc=0xfffff80003cf5830) at /usr/src/sys/netinet/tcp_lro.c:566 #24 tcp_lro_flush_all (lc=lc@entry=0xfffff80003cf5830) at /usr/src/sys/netinet/tcp_lro.c:1532 #25 0xffffffff80d4f503 in iflib_rxeof (rxq=<optimized out>, rxq@entry=0xfffff80003cf5800, budget=<optimized out>) at /usr/src/sys/net/iflib.c:3058 #26 0xffffffff80d49b22 in _task_fn_rx (context=0xfffff80003cf5800) at /usr/src/sys/net/iflib.c:3990 #27 0xffffffff80c67e9d in gtaskqueue_run_locked ( queue=queue@entry=0xfffff80003ac2000) at /usr/src/sys/kern/subr_gtaskqueue.c:371 #28 0xffffffff80c67b12 in gtaskqueue_thread_loop (arg=<optimized out>, arg@entry=0xfffffe00387fb0b0) at /usr/src/sys/kern/subr_gtaskqueue.c:547 #29 0xffffffff80bd8a5e in fork_exit ( callout=0xffffffff80c67a50 <gtaskqueue_thread_loop>, arg=0xfffffe00387fb0b0, frame=0xfffffe00e0044f40) at /usr/src/sys/kern/kern_fork.c:1093 #30 <signal handler called> #31 mi_startup () at /usr/src/sys/kern/init_main.c:322 Backtrace stopped: Cannot access memory at address 0x1d (kgdb) Can you provide the vmcore (and kernel.debug) files? Or check f 10 p *tp p *tp->sackhint.nexthole p tp->snd_una + tp->t_inpcb->inp_socket->so_snd.sb_ccc p/x tp->t_flags p/x tp->t_flags2 p *tp->t_timers So far, this seems to be the signature: t_state == 8 (LAST_ACK) snd_max == snd_una + so_snd.sb_ccc + 2 (! should be one to account for the FIN) snd_fack < sackhole.end (rescue retransmission - this is new in fbsd13) t_rxtshift == 6..12 (many retransmission timeouts, indicating the client disappeared - temporarily) t_rtseq == snd_una + so_snd.sb_ccc snd_recover == snd_max (indicating the double-accounting for FIN happend prior of entering loss recovery, or "many" packets prior of the actual panic) t_flags = 0x20100274 (TF_CONGRECOVERY, TF_FASTRECOVERY, TF_SACK_PERMIT, TF_SENTFIN, TF_REQ_SCALE, TF_RCVD_SCALE, TF_NODELAY, TF_ACKNOW) (In reply to Richard Scheffenegger from comment #35) I have a problem with kernel rebuild on the linking stage. I just copied GENERIC to KP and added: ident DEBUG options TCPPCAP options WITNESS options INVARIANTS ...and got errors: linking kernel.full ld: error: undefined symbol: __mtx_assert >>> referenced by cam_periph.c:351 (/usr/src/sys/cam/cam_periph.c:351) >>> cam_periph.o:(cam_periph_find) >>> referenced by cam_periph.c:444 (/usr/src/sys/cam/cam_periph.c:444) >>> cam_periph.o:(cam_periph_release_locked_buses) >>> referenced by cam_periph.c:678 (/usr/src/sys/cam/cam_periph.c:678) >>> cam_periph.o:(cam_periph_release_locked_buses) >>> referenced 4252 more times ld: error: undefined symbol: _sx_assert >>> referenced by filedesc.h:292 (/usr/src/sys/sys/filedesc.h:292) >>> freebsd32_capability.o:(freebsd32_cap_ioctls_get) >>> referenced by acpi_battery.c:465 (/usr/src/sys/dev/acpica/acpi_battery.c:465) >>> acpi_battery.o:(acpi_battery_register) >>> referenced by acpi_cmbat.c:342 (/usr/src/sys/dev/acpica/acpi_cmbat.c:342) >>> acpi_cmbat.o:(acpi_cmbat_get_bix) >>> referenced 590 more times ld: error: undefined symbol: __rw_assert >>> referenced by agp.c:613 (/usr/src/sys/dev/agp/agp.c:613) >>> agp.o:(agp_generic_bind_memory) >>> referenced by tmpfs_subr.c:169 (/usr/src/sys/fs/tmpfs/tmpfs_subr.c:169) >>> tmpfs_subr.o:(tmpfs_pager_update_writecount) >>> referenced by tmpfs_subr.c:185 (/usr/src/sys/fs/tmpfs/tmpfs_subr.c:185) >>> tmpfs_subr.o:(tmpfs_pager_release_writecount) >>> referenced 567 more times ld: error: undefined symbol: _lockmgr_assert >>> referenced by msdosfs_fat.c:404 (/usr/src/sys/fs/msdosfs/msdosfs_fat.c:404) >>> msdosfs_fat.o:(usemap_free) >>> referenced by msdosfs_fat.c:762 (/usr/src/sys/fs/msdosfs/msdosfs_fat.c:762) >>> msdosfs_fat.o:(clusteralloc) >>> referenced by msdosfs_fat.c:901 (/usr/src/sys/fs/msdosfs/msdosfs_fat.c:901) >>> msdosfs_fat.o:(fillinusemap) >>> referenced 14 more times ld: error: undefined symbol: _rm_assert >>> referenced by kern_rmlock.c:120 (/usr/src/sys/kern/kern_rmlock.c:120) >>> kern_rmlock.o:(assert_rm) >>> referenced by kern_rmlock.c:151 (/usr/src/sys/kern/kern_rmlock.c:151) >>> kern_rmlock.o:(unlock_rm) >>> referenced by kern_rmlock.c:322 (/usr/src/sys/kern/kern_rmlock.c:322) >>> kern_rmlock.o:(rm_destroy) >>> referenced 38 more times ld: error: undefined symbol: _rangelock_cookie_assert >>> referenced by uipc_shm.c:743 (/usr/src/sys/kern/uipc_shm.c:743) >>> uipc_shm.o:(shm_dotruncate_cookie) >>> referenced by uipc_shm.c:641 (/usr/src/sys/kern/uipc_shm.c:641) >>> uipc_shm.o:(shm_dotruncate_locked) *** [kernel.full] Error code 1 /etc/make.conf: CFLAGS= -O2 -pipe -march=native (In reply to Richard Scheffenegger from comment #37) Hello Richard, Tried to e-mail you at rscheff@freebsd.org but Office 365 block it. Can you e-mail me at chris@cretaforce.gr to send you the link to download it? (In reply to Richard Scheffenegger from comment #37) Looks like similar to this which I reported last year: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=254725 (In reply to Christos Chatzaras from comment #40) Indeed. Unlike the other cores, the session is still in FIN_WAIT_1 state (6), not LAST_ACK. Also, there is an entire chain of 1-byte holes in the SACK scoreboard (which got retransmitted. No sign of a rescue retransmission - snd_fack seems to have covered snd_max (the FIN bit?) but there is a gap of 1 byte in the sequence stream: snd_max is still off from so_snd by 2 instead of the expected 1. So not quite the same, but maybe the same root cause. (In reply to Christos Chatzaras from comment #39) Can you also send the link to tuexen@freebsd.org or, if that is blocked, to tuexen@fh-muenster.de? (In reply to Michael Tuexen from comment #42) Links sent. Thank you. (In reply to iron.udjin from comment #38) Can you also add options INVARIANT_SUPPORT (In reply to Richard Scheffenegger from comment #44) Ok, kernel has been rebuilded. I'll let you know in case of having something interesting in dmesg. (In reply to iron.udjin from comment #45) Can you try something like sudo syssctl net.inet.tcp.tcp_pcap_packets=10 If your system runs out of mbufs, use less than 10... (In reply to Michael Tuexen from comment #46) Yes, I did it already. Thanks. (In reply to iron.udjin from comment #47) Perfect! (In reply to Richard Scheffenegger from comment #37) Good day to all. We most likely got into exactly the same problem and backtrace. Here is some information that we hope will be of some help. If there is anything else we can help with, please just let me know. 13.1-STABLE FreeBSD 13.1-STABLE #0 stable/13-n251001-41ce229505a: Sat Jun 4 19:47:50 EEST 2022 kgdb /usr/lib/debug/boot/kernel/kernel.debug /var/crash/vmcore.0 ... Fatal trap 12: page fault while in kernel mode cpuid = 2; apic id = 04 fault virtual address = 0x18 fault code = supervisor read data, page not present instruction pointer = 0x20:0xffffffff806e5569 stack pointer = 0x28:0xfffffe027ac53690 frame pointer = 0x28:0xfffffe027ac53700 code segment = base rx0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 0 (if_io_tqg_2) trap number = 12 panic: page fault cpuid = 2 time = 1655099122 KDB: stack backtrace: #0 0xffffffff806a0cc5 at kdb_backtrace+0x65 #1 0xffffffff80657a0f at vpanic+0x17f #2 0xffffffff80657883 at panic+0x43 #3 0xffffffff80a03837 at trap_fatal+0x387 #4 0xffffffff80a0388f at trap_pfault+0x4f #5 0xffffffff809dcbe8 at calltrap+0x8 #6 0xffffffff807cbee9 at tcp_output+0x1329 #7 0xffffffff807c332b at tcp_do_segment+0x29db #8 0xffffffff807bfc21 at tcp_input_with_port+0xb61 #9 0xffffffff807c08bb at tcp_input+0xb #10 0xffffffff807b2118 at ip_input+0x118 #11 0xffffffff807890b9 at netisr_dispatch_src+0xb9 #12 0xffffffff8076d554 at ether_demux+0x144 #13 0xffffffff8076e8b6 at ether_nh_input+0x346 #14 0xffffffff807890b9 at netisr_dispatch_src+0xb9 #15 0xffffffff8076d979 at ether_input+0x69 #16 0xffffffff80780bf1 at _task_fn_rx+0xc31 #17 0xffffffff8069f6dd at gtaskqueue_run_locked+0x15d Uptime: 8d12h42m8s Dumping 15966 out of 262007 MB:..1%..11%..21%..31%..41%..51%..61%..71%..81%..91% __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55 55 __asm("movq %%gs:%P1,%0" : "=r" (td) : "n" (offsetof(struct pcpu, (kgdb) f 10 #10 0xffffffff807c332b in tcp_do_segment (m=0xfffff8020c425400, th=0xfffff8020c42547a, so=0xfffff8179dcc1760, tp=0xfffffe09a5da7950, drop_hdrlen=41, tlen=<optimized out>, iptos=0 '\000') at /usr/src/sys/netinet/tcp_input.c:2637 2637 (void) tp->t_fb->tfb_tcp_output(tp); (kgdb) p *tp $12 = {t_inpcb = 0xfffff81612d9f1f0, t_fb = 0xffffffff80ebe670 <tcp_def_funcblk>, t_fb_ptr = 0x0, t_maxseg = 1440, t_logstate = 0, t_port = 0, t_state = 8, t_idle_reduce = 0, t_delayed_ack = 0, t_fin_is_rst = 0, t_log_state_set = 0, bits_spare = 0, t_flags = 537920116, snd_una = 893181596, snd_max = 893186736, snd_nxt = 893186736, snd_up = 893181596, snd_wnd = 66240, snd_cwnd = 7996, t_peakrate_thr = 0, ts_offset = 0, rfbuf_ts = 736745954, rcv_numsacks = 0, t_tsomax = 0, t_tsomaxsegcount = 0, t_tsomaxsegsize = 0, rcv_nxt = 2834403673, rcv_adv = 2834408281, rcv_wnd = 4608, t_flags2 = 1026, t_srtt = 0, t_rttvar = 4000, ts_recent = 0, snd_scale = 2 '\002', rcv_scale = 9 '\t', snd_limited = 0 '\000', request_r_scale = 9 '\t', last_ack_sent = 2834403673, t_rcvtime = 2883811337, rcv_up = 2834403673, t_segqlen = 0, t_segqmbuflen = 0, t_segq = {tqh_first = 0x0, tqh_last = 0xfffffe09a5da79e0}, t_in_pkt = 0x0, t_tail_pkt = 0x0, t_timers = 0xfffffe09a5da7bf8, t_vnet = 0xfffff801016b1380, snd_ssthresh = 2880, snd_wl1 = 2834403673, snd_wl2 = 893181596, irs = 2834403154, iss = 893181595, t_acktime = 0, t_sndtime = 2883799567, ts_recent_age = 0, snd_recover = 893186735, cl4_spare = 0, t_oobflags = 0 '\000', t_iobc = 0 '\000', t_rxtcur = 64000, t_rxtshift = 7, t_rtttime = 2883799567, t_rtseq = 893186734, t_starttime = 2883628642, t_fbyte_in = 2883628647, t_fbyte_out = 2883628648, t_pmtud_saved_maxseg = 0, t_blackhole_enter = 0, t_blackhole_exit = 0, t_rttmin = 30, t_rttbest = 0, t_softerror = 0, max_sndwnd = 66240, snd_cwnd_prev = 23040, snd_ssthresh_prev = 1073725440, snd_recover_prev = 893181595, t_sndzerowin = 0, t_rttupdated = 0, snd_numholes = 3, t_badrxtwin = 2883629648, snd_holes = {tqh_first = 0xfffff808705bae80, tqh_last = 0xfffff8037d9e6110}, snd_fack = 893186735, sackblks = {{ start = 2834403672, end = 2834403673}, {start = 0, end = 0}, {start = 0, end = 0}, {start = 0, end = 0}, {start = 0, end = 0}, {start = 0, end = 0}}, sackhint = {nexthole = 0xfffff8037d9e6100, sack_bytes_rexmit = 4250, last_sack_ack = 0, delivered_data = 0, sacked_bytes = 888, recover_fs = 5138, prr_delivered = 6636, prr_out = 20102}, t_rttlow = 0, rfbuf_cnt = 517, tod = 0x0, t_sndrexmitpack = 85, t_rcvoopack = 0, t_toe = 0x0, cc_algo = 0xffffffff80ebb1b0 <newreno_cc_algo>, ccv = 0xfffffe09a5da7d40, osd = 0xfffffe09a5da7d68, t_bytes_acked = 0, t_maxunacktime = 0, t_keepinit = 0, t_keepidle = 0, t_keepintvl = 0, t_keepcnt = 0, t_dupacks = 52, t_lognum = 0, t_loglimit = 5000, t_pacing_rate = -1, t_logs = {stqh_first = 0x0, stqh_last = 0xfffffe09a5da7b68}, t_lin = 0x0, t_lib = 0x0, t_output_caller = 0x0, t_stats = 0x0, t_logsn = 0, gput_ts = 0, gput_seq = 0, gput_ack = 0, t_stats_gput_prev = 0, t_maxpeakrate = 0, t_sndtlppack = 0, t_sndtlpbyte = 0, t_sndbytes = 13753, t_snd_rxt_bytes = 37794, t_tfo_client_cookie_len = 0 '\000', t_end_info_status = 0, t_tfo_pending = 0x0, t_tfo_cookie = { client = '\000' <repeats 15 times>, server = 0}, {t_end_info_bytes = "\000\000\000\000\000\000\000", t_end_info = 0}} (kgdb) p *tp->sackhint.nexthole $14 = {start = 893186733, end = 893186734, rxmit = 893186733, scblink = {tqe_next = 0x0, tqe_prev = 0xfffff80870696d90}} (kgdb) p tp->snd_una + tp->t_inpcb->inp_socket->so_snd.sb_ccc $15 = 893186733 (kgdb) p/x tp->t_flags $16 = 0x20100274 (kgdb) p/x tp->t_flags2 $17 = 0x402 (kgdb) p *tp->t_timers $18 = {tt_rexmt = {c_links = {le = {le_next = 0xfffffe03c8712b18, le_prev = 0xfffffe01a40f36f8}, sle = {sle_next = 0xfffffe03c8712b18}, tqe = {tqe_next = 0xfffffe03c8712b18, tqe_prev = 0xfffffe01a40f36f8}}, c_time = 3165308754918543, c_precision = 17179868000, c_arg = 0xfffffe09a5da7950, c_func = 0xffffffff807dde30 <tcp_timer_rexmt>, c_lock = 0x0, c_flags = 2, c_iflags = 20, c_cpu = 0}, tt_persist = {c_links = {le = {le_next = 0x0, le_prev = 0x0}, sle = {sle_next = 0x0}, tqe = {tqe_next = 0x0, tqe_prev = 0x0}}, c_time = 0, c_precision = 0, c_arg = 0x0, c_func = 0x0, c_lock = 0x0, c_flags = 0, c_iflags = 16, c_cpu = 0}, tt_keep = {c_links = {le = {le_next = 0xfffffe0cfc422408, le_prev = 0xfffffe05c8f73328}, sle = {sle_next = 0xfffffe0cfc422408}, tqe = {tqe_next = 0xfffffe0cfc422408, tqe_prev = 0xfffffe05c8f73328}}, c_time = 3195223522196068, c_precision = 1932735150000, c_arg = 0xfffffe09a5da7950, c_func = 0xffffffff807dd730 <tcp_timer_keep>, c_lock = 0x0, c_flags = 2, c_iflags = 20, c_cpu = 0}, tt_2msl = {c_links = {le = {le_next = 0x0, le_prev = 0x0}, sle = {sle_next = 0x0}, tqe = {tqe_next = 0x0, tqe_prev = 0x0}}, c_time = 0, c_precision = 0, c_arg = 0x0, c_func = 0x0, c_lock = 0x0, c_flags = 0, c_iflags = 16, c_cpu = 0}, tt_delack = {c_links = {le = {le_next = 0x0, le_prev = 0x0}, sle = {sle_next = 0x0}, tqe = {tqe_next = 0x0, tqe_prev = 0x0}}, c_time = 0, c_precision = 0, c_arg = 0x0, c_func = 0x0, c_lock = 0x0, c_flags = 0, c_iflags = 16, c_cpu = 0}, tt_flags = 0, tt_draincnt = 0} (kgdb) After applying the patch comment#34 and with options INVARIANTS options INVARIANT_SUPPORT in kernel, system goin to panic in 5-40 minutes (tried 3 times, all the same place), with following trace: Unread portion of the kernel message buffer: panic: tcp_output: snd_max beyond so_snd cpuid = 12 time = 1655213044 KDB: stack backtrace: #0 0xffffffff808d8f01 at kdb_backtrace+0x71 #1 0xffffffff8086f797 at vpanic+0x227 #2 0xffffffff8086f2be at panic+0x4e #3 0xffffffff80ab3551 at tcp_output+0x32a1 #4 0xffffffff80aa2722 at tcp_do_segment+0x2e72 #5 0xffffffff80a9ec35 at tcp_input_with_port+0x1be5 #6 0xffffffff80a9f777 at tcp_input+0x27 #7 0xffffffff80a87061 at ip_input+0xdd1 #8 0xffffffff80a4023f at netisr_dispatch_src+0x1df #9 0xffffffff80a407a1 at netisr_dispatch+0x21 #10 0xffffffff80a11266 at ether_demux+0x306 #11 0xffffffff80a13c10 at ether_input_internal+0x9e0 #12 0xffffffff80a13221 at ether_nh_input+0xb1 #13 0xffffffff80a4023f at netisr_dispatch_src+0x1df #14 0xffffffff80a407a1 at netisr_dispatch+0x21 #15 0xffffffff80a11b09 at ether_input+0x1a9 #16 0xffffffff80a3a925 at iflib_rxeof+0x895 #17 0xffffffff80a2e4e5 at _task_fn_rx+0xd5 Uptime: 43m43s Dumping 9369 out of 261999 MB:..1%..11%..21%..31%..41%..51%..61%..71%..81%..91% doadump (textdump=1) at /usr/src/sys/kern/kern_shutdown.c:399 399 dumptid = curthread->td_tid; (kgdb) bt #0 doadump (textdump=1) at /usr/src/sys/kern/kern_shutdown.c:399 #1 0xffffffff8086efd3 in kern_reboot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:487 #2 0xffffffff8086f84f in vpanic (fmt=0xffffffff80f390c8 "%s: snd_max beyond so_snd", ap=0xfffffe027ac92320) at /usr/src/sys/kern/kern_shutdown.c:920 #3 0xffffffff8086f2be in panic (fmt=0xffffffff80f390c8 "%s: snd_max beyond so_snd") at /usr/src/sys/kern/kern_shutdown.c:844 #4 0xffffffff80ab3551 in tcp_output (tp=0xfffffe04709abca8) at /usr/src/sys/netinet/tcp_output.c:1583 #5 0xffffffff80aa2722 in tcp_do_segment (m=0xfffff801ef8be500, th=0xfffff801ef8be57a, so=0xfffff8061cdc8b10, tp=0xfffffe04709abca8, drop_hdrlen=41, tlen=0, iptos=0 '\000') at /usr/src/sys/netinet/tcp_input.c:2713 #6 0xffffffff80a9ec35 in tcp_input_with_port (mp=0xfffffe027ac929c8, offp=0xfffffe027ac92968, proto=6, port=0) at /usr/src/sys/netinet/tcp_input.c:1400 #7 0xffffffff80a9f777 in tcp_input (mp=0xfffffe027ac929c8, offp=0xfffffe027ac92968, proto=6) at /usr/src/sys/netinet/tcp_input.c:1496 #8 0xffffffff80a87061 in ip_input (m=0x0) at /usr/src/sys/netinet/ip_input.c:839 #9 0xffffffff80a4023f in netisr_dispatch_src (proto=1, source=0, m=0xfffff801ef8be500) at /usr/src/sys/net/netisr.c:1143 #10 0xffffffff80a407a1 in netisr_dispatch (proto=1, m=0xfffff801ef8be500) at /usr/src/sys/net/netisr.c:1234 #11 0xffffffff80a11266 in ether_demux (ifp=0xfffff820816e3800, m=0xfffff801ef8be500) at /usr/src/sys/net/if_ethersubr.c:921 #12 0xffffffff80a13c10 in ether_input_internal (ifp=0xfffff820816e3800, m=0xfffff801ef8be500) at /usr/src/sys/net/if_ethersubr.c:707 #13 0xffffffff80a13221 in ether_nh_input (m=0xfffff801ef8be500) at /usr/src/sys/net/if_ethersubr.c:737 #14 0xffffffff80a4023f in netisr_dispatch_src (proto=5, source=0, m=0xfffff801ef8be500) at /usr/src/sys/net/netisr.c:1143 #15 0xffffffff80a407a1 in netisr_dispatch (proto=5, m=0xfffff801ef8be500) at /usr/src/sys/net/netisr.c:1234 #16 0xffffffff80a11b09 in ether_input (ifp=0xfffff8010650a000, m=0xfffff801ef8be500) at /usr/src/sys/net/if_ethersubr.c:828 #17 0xffffffff80a3a925 in iflib_rxeof (rxq=0xfffffe01b7551080, budget=16) at /usr/src/sys/net/iflib.c:3047 #18 0xffffffff80a2e4e5 in _task_fn_rx (context=0xfffffe01b7551080) at /usr/src/sys/net/iflib.c:3990 #19 0xffffffff808d7427 in gtaskqueue_run_locked (queue=0xfffff80104d7e200) at /usr/src/sys/kern/subr_gtaskqueue.c:371 #20 0xffffffff808d6fad in gtaskqueue_thread_loop (arg=0xfffffe01b71a7128) at /usr/src/sys/kern/subr_gtaskqueue.c:547 #21 0xffffffff808053f2 in fork_exit (callout=0xffffffff808d6f00 <gtaskqueue_thread_loop>, arg=0xfffffe01b71a7128, frame=0xfffffe027ac92f40) at /usr/src/sys/kern/kern_fork.c:1093 #22 <signal handler called> #23 0xffffffff8129ea18 in periodic_resettodr_sys_init () Backtrace stopped: Cannot access memory at address 0x0 (kgdb) fr 4 #4 0xffffffff80ab3551 in tcp_output (tp=0xfffffe04709abca8) at /usr/src/sys/netinet/tcp_output.c:1583 1583 KASSERT(SEQ_LEQ(tp->snd_max, top+1), (kgdb) p tp->snd_max $1 = 3141897257 (kgdb) p top $2 = 3141897255 (kgdb) No KTLS enabled\used. Adapter is Intel X710 (if_ixl). If there is anything else we can help with, please just let me know. Thanks a lot! Can you provide that core + kernel.debug files? I've extended the logging in a revised patch, if this may be more easy. Created attachment 234683 [details]
more logging
extended logging (w/ panic/KASSERT)
(In reply to Richard Scheffenegger from comment #51) Sent the link with in e-mail to: rscheff@freebsd.org tuexen@freebsd.org (In reply to Dmitriy from comment #53) Thanks. Did you enable options TCPPCAP in the kernel config? It looks like that it is not enabled... Thanks for the core. For stable operation, please use an unpatched kernel, and without net.inet.tcp.rfc6675_pipe=0 The patched cores confirm that during the very final phases, the stack increments one of the variables multiple times for the FIN bit, when it should only have incremented it once. Together with SACK rescue retransmissions (enabled by above sysctl) this leads to an attempt to send out non-existing data. the TCPPCAP and sudo syssctl net.inet.tcp.tcp_pcap_packets=10 would retain the last 10 packets of every session in-memory, in order to create a reproduction script, as the timing (relativ to retransmission timeouts, persist and keepalive timers) of the clients packets apparently has something to do with this problem. (In reply to Michael Tuexen from comment #54) You are right, the option was not enabled. Added these settings and sent new core files in a e-mail to: rscheff@freebsd.org tuexen@freebsd.org Thank you! (In reply to Richard Scheffenegger from comment #55) You mean that he has to use "net.inet.tcp.rfc6675_pipe=0" for stable operartion, right? (In reply to Christos Chatzaras from comment #57) But "net.inet.tcp.rfc6675_pipe: 0" it was already "0" and never was changed. It remains "0" value when crash occurs. (In reply to Dmitriy from comment #58) Yes "0" is the default. So the only workaround is to disable SACK? (In reply to Christos Chatzaras from comment #57) Sorry for being unclear: A plain (unpatched) 13.1-RELEASE kernel with net.inet.tcp.rfc6675_pipe=0 would not exhibit a panic. The patches added here really are to gather more information - and deliberately panic (when built with INVARIANTS) or retroactively adjust the incorrect TCP state. Stable operation in a production environment would only need the standard kernel, and it's safe to use SACK as long as rfc6675_pipe is not enabled. (The root cause - apparent double-accounting for FIN bits, and possibly even sending FIN bits at two different, final sequence numbers - would still be present, but not cause a crash). (In reply to Dmitriy from comment #56) Thank you! I'm trying to recreate the problem locally using packetdrill. (In reply to Dmitriy from comment #56) I can confirm that options TCPPCAP was enabled in the kernel. However, it seems you did not execute sudo syssctl net.inet.tcp.tcp_pcap_packets=10 at least that would explain why no packets were captured. Can you double check? More than 10 would be useful (like 20 or 30), but I don't know if that would work for your workload. (In reply to Michael Tuexen from comment #62) Option was added to sysctl.conf, system rebooted, after reboot it definetly was checked: sysctl net.inet.tcp.tcp_pcap_packets net.inet.tcp.tcp_pcap_packets: 10 Going to set net.inet.tcp.tcp_pcap_packets=30 for sure. Waiting for new core. Offtop: meantime kernel was rebilt without INVARIANTS to continue running without panicing, but logging. Hope results would be helpful too: Jun 15 13:55:02 hostname kernel: tcp_output#1587: snd_max 1422777791 > so_snd+1 1422777790 adjusting. Jun 15 13:55:47 hostname kernel: tcp_output#1587: snd_max 1422777792 > so_snd+1 1422777790 adjusting. Jun 15 14:13:29 hostname kernel: tcp_output#1587: snd_max 75162767 > so_snd+1 75162766 adjusting. Jun 15 14:14:15 hostname kernel: tcp_output#1587: snd_max 75162768 > so_snd+1 75162766 adjusting. Jun 15 15:06:32 hostname kernel: tcp_output#1587: snd_max 2137549205 > so_snd+1 2137549204 adjusting. Jun 15 15:15:53 hostname kernel: tcp_output#1587: snd_max 2960832867 > so_snd+1 2960832866 adjusting. Jun 15 15:16:32 hostname kernel: tcp_output#1587: snd_max 1274542388 > so_snd+1 1274542387 adjusting. Jun 15 15:16:32 hostname kernel: tcp_output#1587: snd_max 1274542389 > so_snd+1 1274542387 adjusting. Jun 15 15:30:31 hostname kernel: tcp_output#1587: snd_max 2283463051 > so_snd+1 2283463050 adjusting. Jun 15 15:35:43 hostname kernel: tcp_output#1587: snd_max 336646110 > so_snd+1 336646109 adjusting. Jun 15 15:38:35 hostname kernel: tcp_output#1587: snd_max 1227541195 > so_snd+1 1227541194 adjusting. Jun 15 15:38:35 hostname kernel: tcp_output#1587: snd_max 1227541196 > so_snd+1 1227541194 adjusting. Jun 15 16:56:46 hostname kernel: tcp_output#1587: snd_max 382752252 > so_snd+1 382752251 adjusting. Jun 15 16:56:46 hostname kernel: tcp_output#1587: snd_max 382752253 > so_snd+1 382752251 adjusting. Jun 15 17:08:42 hostname kernel: tcp_output#1587: snd_max 359933951 > so_snd+1 359933950 adjusting. Jun 15 17:08:42 hostname kernel: tcp_output#1587: snd_max 359933952 > so_snd+1 359933950 adjusting. Jun 15 17:09:31 hostname kernel: tcp_output#1587: snd_max 205782007 > so_snd+1 205782006 adjusting. Jun 15 17:09:41 hostname kernel: tcp_output#1587: snd_max 205782008 > so_snd+1 205782006 adjusting. Jun 15 17:09:53 hostname kernel: tcp_output#1587: snd_max 205782009 > so_snd+1 205782006 adjusting. Jun 15 17:10:03 hostname kernel: tcp_output#1587: snd_max 205782010 > so_snd+1 205782006 adjusting. Jun 15 17:22:07 hostname kernel: tcp_output#1587: snd_max 4147118877 > so_snd+1 4147118876 adjusting. Jun 15 17:36:46 hostname kernel: tcp_output#1587: snd_max 494998134 > so_snd+1 494998133 adjusting. Jun 15 18:27:14 hostname kernel: tcp_output#1587: snd_max 2387061338 > so_snd+1 2387061337 adjusting. Jun 15 19:22:28 hostname kernel: tcp_output#1587: snd_max 725541353 > so_snd+1 725541352 adjusting. Jun 15 19:23:53 hostname kernel: tcp_output#1587: snd_max 607018670 > so_snd+1 607018669 adjusting. Jun 15 19:24:03 hostname kernel: tcp_output#1587: snd_max 607018671 > so_snd+1 607018669 adjusting. Jun 15 19:24:14 hostname kernel: tcp_output#1587: snd_max 607018672 > so_snd+1 607018669 adjusting. Jun 15 19:24:24 hostname kernel: tcp_output#1587: snd_max 607018673 > so_snd+1 607018669 adjusting. Jun 15 19:24:34 hostname kernel: tcp_output#1587: snd_max 607018674 > so_snd+1 607018669 adjusting. Jun 15 19:24:44 hostname kernel: tcp_output#1587: snd_max 607018675 > so_snd+1 607018669 adjusting. Created attachment 234709 [details]
Fix TCPPCAP
(In reply to Dmitriy from comment #63) OK, I tested it locally. To get TCPPCAP working you need to apply the patch I just added with the name "Fix TCPPCAP" Can you apply that patch, build and install an updated kernel and provide a kernel dump after that patch has been applied? (In reply to Michael Tuexen from comment #65) Done. Please find the link with new corefile in e-mail. Thank you! (In reply to Dmitriy from comment #66) Thank you very much. The new core file contains the packets handled by the end point. Today I had another crash in a different server. Fatal trap 12: page fault while in kernel mode cpuid = 7; apic id = 07 fault virtual address = 0x18 fault code = supervisor read data, page not present instruction pointer = 0x20:0xffffffff80cae31d stack pointer = 0x28:0xfffffe01141445c0 frame pointer = 0x28:0xfffffe0114144630 code segment = base rx0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 0 (if_io_tqg_7) trap number = 12 panic: page fault cpuid = 7 time = 1655767279 KDB: stack backtrace: #0 0xffffffff80c69465 at kdb_backtrace+0x65 #1 0xffffffff80c1bb1f at vpanic+0x17f #2 0xffffffff80c1b993 at panic+0x43 #3 0xffffffff810afdf5 at trap_fatal+0x385 #4 0xffffffff810afe4f at trap_pfault+0x4f #5 0xffffffff81087528 at calltrap+0x8 #6 0xffffffff80de07c9 at tcp_output+0x1339 #7 0xffffffff80dd7eed at tcp_do_segment+0x2cfd #8 0xffffffff80dd44b1 at tcp_input_with_port+0xb61 #9 0xffffffff80dd515b at tcp_input+0xb #10 0xffffffff80dc691f at ip_input+0x11f #11 0xffffffff80d53089 at netisr_dispatch_src+0xb9 #12 0xffffffff80d36ea8 at ether_demux+0x138 #13 0xffffffff80d38235 at ether_nh_input+0x355 #14 0xffffffff80d53089 at netisr_dispatch_src+0xb9 #15 0xffffffff80d372d9 at ether_input+0x69 #16 0xffffffff80ddeaa5 at tcp_push_and_replace+0x25 #17 0xffffffff80ddd74c at tcp_lro_flush+0x4c Uptime: 29d3h36m11s Dumping 4275 out of 65278 MB:..1%..11%..21%..31%..41%..51%..61%..71%..81%..91% __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55 warning: Source file is more recent than executable. 55 __asm("movq %%gs:%P1,%0" : "=r" (td) : "n" (offsetof(struct pcpu, (kgdb) #0 __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55 #1 doadump (textdump=<optimized out>) at /usr/src/sys/kern/kern_shutdown.c:399 #2 0xffffffff80c1b71c in kern_reboot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:487 #3 0xffffffff80c1bb8e in vpanic (fmt=0xffffffff811b4fb9 "%s", ap=<optimized out>) at /usr/src/sys/kern/kern_shutdown.c:920 #4 0xffffffff80c1b993 in panic (fmt=<unavailable>) at /usr/src/sys/kern/kern_shutdown.c:844 #5 0xffffffff810afdf5 in trap_fatal (frame=0xfffffe0114144500, eva=24) at /usr/src/sys/amd64/amd64/trap.c:944 #6 0xffffffff810afe4f in trap_pfault (frame=0xfffffe0114144500, usermode=false, signo=<optimized out>, ucode=<optimized out>) at /usr/src/sys/amd64/amd64/trap.c:763 #7 <signal handler called> #8 m_copydata (m=0x0, m@entry=0xfffff80c219ce500, off=0, len=1, cp=<optimized out>) at /usr/src/sys/kern/uipc_mbuf.c:659 #9 0xffffffff80de07c9 in tcp_output (tp=<optimized out>) at /usr/src/sys/netinet/tcp_output.c:1081 #10 0xffffffff80dd7eed in tcp_do_segment (m=<optimized out>, th=<optimized out>, so=<optimized out>, tp=0xfffffe01990a1000, drop_hdrlen=64, tlen=<optimized out>, iptos=0 '\000') at /usr/src/sys/netinet/tcp_input.c:2637 #11 0xffffffff80dd44b1 in tcp_input_with_port (mp=<optimized out>, offp=<optimized out>, proto=<optimized out>, port=port@entry=0) at /usr/src/sys/netinet/tcp_input.c:1400 #12 0xffffffff80dd515b in tcp_input (mp=0xfffff80c219ce500, offp=0x0, proto=1) at /usr/src/sys/netinet/tcp_input.c:1496 #13 0xffffffff80dc691f in ip_input (m=0x0) at /usr/src/sys/netinet/ip_input.c:839 #14 0xffffffff80d53089 in netisr_dispatch_src (proto=1, source=source@entry=0, m=0xfffff80e00395400) at /usr/src/sys/net/netisr.c:1143 #15 0xffffffff80d5345f in netisr_dispatch (proto=563930368, m=0x1) at /usr/src/sys/net/netisr.c:1234 #16 0xffffffff80d36ea8 in ether_demux (ifp=ifp@entry=0xfffff80004659000, m=0x0) at /usr/src/sys/net/if_ethersubr.c:921 #17 0xffffffff80d38235 in ether_input_internal (ifp=0xfffff80004659000, m=0x0) at /usr/src/sys/net/if_ethersubr.c:707 #18 ether_nh_input (m=<optimized out>) at /usr/src/sys/net/if_ethersubr.c:737 #19 0xffffffff80d53089 in netisr_dispatch_src (proto=proto@entry=5, source=source@entry=0, m=m@entry=0xfffff80e00395400) at /usr/src/sys/net/netisr.c:1143 #20 0xffffffff80d5345f in netisr_dispatch (proto=563930368, proto@entry=5, m=0x1, m@entry=0xfffff80e00395400) at /usr/src/sys/net/netisr.c:1234 #21 0xffffffff80d372d9 in ether_input (ifp=<optimized out>, m=0xfffff80e00395400) at /usr/src/sys/net/if_ethersubr.c:828 #22 0xffffffff80ddeaa5 in tcp_push_and_replace (lc=0xfffff80c219ce500, lc@entry=0xfffff80003ef2830, le=le@entry=0xfffffe0158387690, m=m@entry=0xfffff80f2b178300) at /usr/src/sys/netinet/tcp_lro.c:923 #23 0xffffffff80ddd74c in tcp_lro_condense (lc=0xfffff80003ef2830, le=0xfffffe0158387690) at /usr/src/sys/netinet/tcp_lro.c:1011 #24 tcp_lro_flush (lc=lc@entry=0xfffff80003ef2830, le=0xfffffe0158387690) at /usr/src/sys/netinet/tcp_lro.c:1374 #25 0xffffffff80dddd3b in tcp_lro_rx_done (lc=0xfffff80003ef2830) at /usr/src/sys/netinet/tcp_lro.c:566 #26 tcp_lro_flush_all (lc=lc@entry=0xfffff80003ef2830) at /usr/src/sys/netinet/tcp_lro.c:1532 #27 0xffffffff80d4f503 in iflib_rxeof (rxq=<optimized out>, rxq@entry=0xfffff80003ef2800, budget=<optimized out>) at /usr/src/sys/net/iflib.c:3058 #28 0xffffffff80d49b22 in _task_fn_rx (context=0xfffff80003ef2800) at /usr/src/sys/net/iflib.c:3990 #29 0xffffffff80c67e9d in gtaskqueue_run_locked ( queue=queue@entry=0xfffff80003cbf000) at /usr/src/sys/kern/subr_gtaskqueue.c:371 #30 0xffffffff80c67b12 in gtaskqueue_thread_loop (arg=<optimized out>, arg@entry=0xfffffe01142820b0) at /usr/src/sys/kern/subr_gtaskqueue.c:547 #31 0xffffffff80bd8a5e in fork_exit ( callout=0xffffffff80c67a50 <gtaskqueue_thread_loop>, arg=0xfffffe01142820b0, frame=0xfffffe0114144f40) at /usr/src/sys/kern/kern_fork.c:1093 #32 <signal handler called> #33 mi_startup () at /usr/src/sys/kern/init_main.c:322 Backtrace stopped: Cannot access memory at address 0x1d (kgdb) Are you using net.inet.tcp.rfc6675_pipe=1? While we are investigating the root cause for snd_max to grow beyond what it is supposed to ever grow to, in the meantime disabling sack rescue retransmissions should prevent these panics. (Note: even though we have one core with the (effectively) full packet trace, the misbehavior could not yet be reproduced properly. (In reply to Christos Chatzaras from comment #68) From the provided core: snd_una: 2568161219 snd_nxt: 2568161624 snd_max: 2568161624 snd_fack: 2568161624 (!!) so_snd: 403 snd_una + so_snd: 2568161622 *tp->sackhint->nexthole {start = 2568161219, end = 2568161623, rxmit = 2568161622, ... This indicates, that the remote client SACKed the 2nd FIN bit (which apparently was sent at sequence number 2568161624, not 2568161623 as expected). The SACK for this 2nd FIN created a SACK scoreboard hole, which included the sequence number of the 1st FIN bit - which does not exist in the socket sendbuffer. Thus when trying to retransmit this non-existing byte, the panic happens. Created attachment 234833 [details]
symptomatic logging patch
This patch symptomatically addresses when snd_max grows beyond the FIN bit, without panicing. Also, a few tidbits of data get logged to dmesg.
(In reply to Richard Scheffenegger from comment #69) 1) net.inet.tcp.rfc6675_pipe is disabled. 2) I sent the vmcore + debug symbols to your e-mail. 3) I temporary disable sack. I see that iron.udjin loader.conf has wireguard enabled. I use wireguard too. Is this possible related to wireguard? The offending TCP sessions we've seen so far all seem to be regular https sessions. For (yet unknown) reasons, rarely the FIN bit seems to get accounted more than once - up to 6 times, from one of the logging patched kernels. With SACK rescue retransmissions, that can lead to having the sequence number of one or more FIN bits included in the block which is to be retransmitted - and the panic happens, when the data byte at the sequence number of the FIN bit is tried to be retranmitted... In your case, even though you didn't have SACK rescue retransmissions enabled, the client of the offending session appears to have SACK'ed the 2nd retransmission of the (and empty) FIN packet with a "too high" sequence number, in effect resulting in the same issue. This should be happening much less frequently than with the SACK rescue retransmissions. At the same time, it appears, that double-counting of FIN bits happens quite frequently, but is not easy to reproduce. Thus we are currently working on a patch which exposes this (in INVARIANTS kernels with panics, in production kernels by logging that unexpected state, and clearing it "on the fly"). See D35446 (In reply to Christos Chatzaras from comment #73) I don't think that it's somehow related to wireguard because I had similar panics on the server which is doesn't have wireguard installed/used. (In reply to iron.udjin from comment #75) Thanks for the data pointer. I guess, whatever modifications a tunnel or link layer does unintentionally would look like packet corruption to TCP and should be handled fine. I also guess such issue would have come up on a much broader scale. (In reply to Richard Scheffenegger from comment #70) I found this old review which I believe is not committed: https://reviews.freebsd.org/D2970 Does this try to fix my panic? From the pcap information we were able to reconstruct the packet sequence, but not the exact timing. Unfortunately, we were not able to reproduce the issue. However, it seems hard, to extend the pcap logging. So I suggest to use the BlackBox Logging instead. Could you: * Add options TCP_BLACKBOX to you kernel configuration file and recompile the kernel. You can remove the TCPPCAP option, if you want. * Add net.inet.tcp.bb.log_auto_all=1 net.inet.tcp.bb.log_auto_ratio=1 net.inet.tcp.bb.log_auto_mode=1 net.inet.tcp.bb.log_session_limit=200 This should enable BlackBox Logging on all TCP connections and it will the last 200 events per TCP connection stored in the end point. Once the system has crashed, please provide a core. This time we should not only get the packet being sent/received, but also internal TCP state variables and information about the retransmission timer. If this information is not enough, we can adhoc extend the TCP Black Box Logging. Hello all, On the few servers we're running 13.1 we see these trap 12 errors too. The most interesting bit I can contribute is we only see it on servers utilizing if_bridge. My guess is the servers using if_bridge are seeing a lot more traffic across the wire and are therefore more likely to experience the packet sequence required to cause the trap 12 errors. When we rank servers by trap 12 error occurrence we see a correlation with bandwidth utilization (more bandwidth utilization, greater trap 12 error frequency). Happy to help in any way I can. (In reply to Chad Smith from comment #79) Do the affected servers report IP-layer output errors? Or low mbuf situations? (netstat -s after a bit of uptime may show increasing IP output error counters). Background: As the full packet trace (without timing) by itself is not sufficient to reproduce the issue, one possibility is that very infrequent error handling is involved in the introduction of the problematic state - and this may only happen on quite busy machines where mbuf allocation, or packet output issues are showing up. (In reply to Richard Scheffenegger from comment #80) Yes, errors from ip_output() are logged: https://cgit.freebsd.org/src/tree/sys/netinet/tcp_output.c?h=stable/13#n1670 This is not the same way as in RACK or BBR, but we can clean that up later. (In reply to Michael Tuexen from comment #78) Just sent the link with new core file in e-mail. Kernel was built with "options TCP_BLACKBOX", without any debug\diagnostic patches. Thank you! (In reply to Dmitriy from comment #82) Thank you very much for the core. Let me have a look, I'll report my findings. (In reply to Richard Scheffenegger from comment #80) Grabbed the netstat -s output from a recent crash file where the system had been up for an hour before it crashed again, also posting some additional info. We have a test bench set up and we are trying to reproduce this behavior on LAN. It seems in perfect network conditions a massively parallel iperf run across a bridged interface for 24 hours does not trigger this. Looking for ideas on how to simulate internet-like network conditions. Short of hooking the other end of our test bench up to an internet connection in another city I am out of ideas. Open to suggestions. # uname -a FreeBSD 13.1-RELEASE FreeBSD 13.1-RELEASE releng/13.1-n250148-fc952ac2212 GENERIC amd64 Fatal trap 12: page fault while in kernel mode cpuid = 4; apic id = 04 fault virtual address = 0x10 fault code = supervisor read data, page not present instruction pointer = 0x20:0xffffffff8234a783 stack pointer = 0x0:0xfffffe00c45a5a50 frame pointer = 0x0:0xfffffe00c45a5a60 code segment = base rx0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 0 (if_io_tqg_4) trap number = 12 panic: page fault cpuid = 4 time = 1657212869 KDB: stack backtrace: #0 0xffffffff80c69465 at kdb_backtrace+0x65 #1 0xffffffff80c1bb1f at vpanic+0x17f #2 0xffffffff80c1b993 at panic+0x43 #3 0xffffffff810afdf5 at trap_fatal+0x385 #4 0xffffffff810afe4f at trap_pfault+0x4f #5 0xffffffff81087528 at calltrap+0x8 #6 0xffffffff80d55ec7 at pfil_run_hooks+0x97 #7 0xffffffff8239af37 at bridge_pfil+0x497 #8 0xffffffff8239d5a3 at bridge_forward+0x323 #9 0xffffffff8239cef1 at bridge_input+0x4c1 #10 0xffffffff80d380fd at ether_nh_input+0x21d #11 0xffffffff80d53089 at netisr_dispatch_src+0xb9 #12 0xffffffff80d372d9 at ether_input+0x69 #13 0xffffffff80d4f4d7 at iflib_rxeof+0xc27 #14 0xffffffff80d49b22 at _task_fn_rx+0x72 #15 0xffffffff80c67e9d at gtaskqueue_run_locked+0x15d #16 0xffffffff80c67b12 at gtaskqueue_thread_loop+0xc2 #17 0xffffffff80bd8a5e at fork_exit+0x7e ------------------------------------------------------------------------ netstat -s tcp: 144916 packets sent 10016 data packets (765551 bytes) 0 data packets (0 bytes) retransmitted 0 data packets unnecessarily retransmitted 0 resends initiated by MTU discovery 133061 ack-only packets (306 delayed) 0 URG only packets 0 window probe packets 7 window update packets 1832 control packets 275476 packets received 13046 acks (for 767206 bytes) 17 duplicate acks 0 UDP tunneled pkts 0 UDP tunneled pkt cnt with errors 0 acks for unsent data 269667 packets (337208234 bytes) received in-sequence 17 completely duplicate packets (1304 bytes) 0 old duplicate packets 0 packets with some dup. data (0 bytes duped) 13 out-of-order packets (16952 bytes) 0 packets (0 bytes) of data after window 0 window probes 0 window update packets 0 packets received after close 0 discarded for bad checksums 0 discarded for bad header offset fields 0 discarded because packet too short 0 discarded due to full reassembly queue 177 connection requests 1561 connection accepts 0 bad connection attempts 0 listen queue overflows 0 ignored RSTs in the windows 1661 connections established (including accepts) 6 times used RTT from hostcache 6 times used RTT variance from hostcache 0 times used slow-start threshold from hostcache 1680 connections closed (including 0 drops) 6 connections updated cached RTT on close 6 connections updated cached RTT variance on close 0 connections updated cached ssthresh on close 3 embryonic connections dropped 13046 segments updated rtt (of 9912 attempts) 74 retransmit timeouts 0 connections dropped by rexmit timeout 0 persist timeouts 0 connections dropped by persist timeout 0 Connections (fin_wait_2) dropped because of timeout 0 keepalive timeouts 0 keepalive probes sent 0 connections dropped by keepalive 254 correct ACK header predictions 260118 correct data packet header predictions 1561 syncache entries added 0 retransmitted 0 dupsyn 0 dropped 1561 completed 0 bucket overflow 0 cache overflow 0 reset 0 stale 0 aborted 0 badack 0 unreach 0 zone failures 1561 cookies sent 0 cookies received 1 hostcache entry added 0 bucket overflow 0 SACK recovery episodes 0 segment rexmits in SACK recovery episodes 0 byte rexmits in SACK recovery episodes 0 SACK options (SACK blocks) received 0 SACK options (SACK blocks) sent 0 SACK scoreboard overflow 0 packets with ECN CE bit set 0 packets with ECN ECT(0) bit set 0 packets with ECN ECT(1) bit set 0 successful ECN handshakes 0 times ECN reduced the congestion window 0 packets with matching signature received 0 packets with bad signature received 0 times failed to make signature due to no SA 0 times unexpected signature received 0 times no signature provided by segment 0 Path MTU discovery black hole detection activations 0 Path MTU discovery black hole detection min MSS activations 0 Path MTU discovery black hole detection failures 154 times connection in TIME-WAIT responded with ACK 0 times connection in TIME-WAIT was actively recycled 0 times connection in TIME-WAIT responded with RST TCP connection count by state: 0 connections in CLOSED state 8 connections in LISTEN state 0 connections in SYN_SENT state 0 connections in SYN_RCVD state 6 connections in ESTABLISHED state 0 connections in CLOSE_WAIT state 0 connections in FIN_WAIT_1 state 0 connections in CLOSING state 0 connections in LAST_ACK state 0 connections in FIN_WAIT_2 state 64 connections in TIME_WAIT state udp: 288632 datagrams received 0 with incomplete header 0 with bad data length field 0 with bad checksum 0 with no checksum 13740 dropped due to no socket 2422 broadcast/multicast datagrams undelivered 0 dropped due to full socket buffers 0 not for hashed pcb 272470 delivered 179590 datagrams output 0 times multicast source filter matched ip: 648231 total packets received 0 bad header checksums 0 with size smaller than minimum 0 with data size < data length 0 with ip length > max ip packet size 0 with header length < data size 0 with data length < header length 0 with bad options 0 with incorrect version number 0 fragments received 0 fragments dropped (dup or out of space) 0 fragments dropped after timeout 0 packets reassembled ok 564854 packets for this host 75426 packets for unknown/unsupported protocol 0 packets forwarded (0 packets fast forwarded) 7951 packets not forwardable 0 packets received for unknown multicast group 0 redirects sent 324597 packets sent from this host 0 packets sent with fabricated ip header 0 output packets dropped due to no bufs, etc. 2 output packets discarded due to no route 0 output datagrams fragmented 0 fragments created 0 datagrams that can't be fragmented 0 tunneling packets that can't find gif 0 datagrams with bad address in header icmp: 75428 calls to icmp_error 0 errors not generated in response to an icmp message Output histogram: echo reply: 80 destination unreachable: 2 0 messages with bad code fields 0 messages less than the minimum length 0 messages with bad checksum 0 messages with bad length 0 multicast echo requests ignored 0 multicast timestamp requests ignored Input histogram: destination unreachable: 65 echo: 80 80 message responses generated 0 invalid return addresses 0 no return routes arp: 3 ARP requests sent 1 ARP request failed to sent 72 ARP replies sent 20625 ARP requests received 773 ARP replies received 74733 ARP packets received 0 total packets dropped due to no ARP entry 0 ARP entrys timed out 0 Duplicate IPs seen ip6: 0 total packets received 0 with size smaller than minimum 0 with data size < data length 0 with bad options 0 with incorrect version number 0 fragments received 0 fragments dropped (dup or out of space) 0 fragments dropped after timeout 0 fragments that exceeded limit 0 atomic fragments 0 packets reassembled ok 0 packets for this host 0 packets forwarded 0 packets not forwardable 0 redirects sent 0 packets sent from this host 0 packets sent with fabricated ip header 0 output packets dropped due to no bufs, etc. 0 output packets discarded due to no route 0 output datagrams fragmented 0 fragments created 0 datagrams that can't be fragmented 0 packets that violated scope rules 0 multicast packets which we don't join Mbuf statistics: 0 one mbuf 0 one ext mbuf 0 two or more ext mbuf 0 packets whose headers are not contiguous 0 tunneling packets that can't find gif 0 packets discarded because of too many headers 0 failures of source address selection icmp6: 0 calls to icmp6_error 0 errors not generated in response to an icmp6 message 0 errors not generated because of rate limitation 0 messages with bad code fields 0 messages < minimum length 0 bad checksums 0 messages with bad length Histogram of error messages to be generated: 0 no route 0 administratively prohibited 0 beyond scope 0 address unreachable 0 port unreachable 0 packet too big 0 time exceed transit 0 time exceed reassembly 0 erroneous header field 0 unrecognized next header 0 unrecognized option 0 redirect 0 unknown 0 message responses generated 0 messages with too many ND options 0 messages with bad ND options 0 bad neighbor solicitation messages 0 bad neighbor advertisement messages 0 bad router solicitation messages 0 bad router advertisement messages 0 bad redirect messages 0 default routers overflows 0 prefix overflows 0 neighbour entries overflows 0 redirect overflows 0 messages with invalid hop limit 0 path MTU changes rip6: 0 messages received 0 checksum calculations on inbound 0 messages with bad checksum 0 messages dropped due to no socket 0 multicast messages dropped due to no socket 0 messages dropped due to full socket buffers 0 delivered 0 datagrams output pfkey: 0 requests sent from userland 0 bytes sent from userland 0 messages with invalid length field 0 messages with invalid version field 0 messages with invalid message type field 0 messages too short 0 messages with memory allocation failure 0 messages with duplicate extension 0 messages with invalid extension type 0 messages with invalid sa type 0 messages with invalid address extension 0 requests sent to userland 0 bytes sent to userland 0 messages toward single socket 0 messages toward all sockets 0 messages toward registered sockets 0 messages with memory allocation failure Hi Chad, Thank you very much for your effort. Are you sure about this panic - in the backtrace, it looks like an issue happend in #6 0xffffffff80d55ec7 at pfil_run_hooks+0x97 #7 0xffffffff8239af37 at bridge_pfil+0x497 (bridging code, not tcp). In the few instances we have seen in more detail, it looked like TLS handshake was being performed when - for one reason or another - there was no response from the client for an extended period of time (probably 1-2 min), before seeing ACKs (and SACKs) again. The IP stats in your case seem pretty clean - we suspected that a missing adjustment in a local error handling path may have something to do with this, but after doing some error injection prior and after addressing this, the behavior seems sufficiently different that I'm not convinced that is the culprit either. Created attachment 236664 [details]
Fix offset computation
Hi Dmitriy, there are several things which going unexpected, but one that I think is definitely wrong is the computation of an offset. Can you try the patch "Fix offset computation" and report if the kernel still crashes? Comment on attachment 236664 [details]
Fix offset computation
The patch doesn't match with tcp_output.c in 13-stable.
} else {
len = ((int32_t)ulmin(cwin,
SEQ_SUB(p->end, p->rxmit)));
}
off = SEQ_SUB(p->rxmit, tp->snd_una);
KASSERT(off >= 0,("%s: sack block to the left of una : %d",
__func__, off));
if (len > 0) {
sack_rxmit = 1;
sendalot = 1;
TCPSTAT_INC(tcps_sack_rexmits);
TCPSTAT_ADD(tcps_sack_rexmit_bytes,
min(len, tcp_maxseg(tp)));
}
Created attachment 236665 [details]
Fix offset computation (stable/13)
(In reply to Dobri Dobrev from comment #88) Please use patch for stable/13 Created attachment 236666 [details]
Updated: Fix offset computation (stable/13)
Created attachment 236667 [details]
Updated: Fix offset computation (releng/13.1)
I missed to move also an KASSERT. But it shouldn't be critical if you don't move it... A commit in branch main references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=6d9e911fbadf3b409802a211c1dae9b47cb5a2b8 commit 6d9e911fbadf3b409802a211c1dae9b47cb5a2b8 Author: Michael Tuexen <tuexen@FreeBSD.org> AuthorDate: 2022-09-19 10:42:43 +0000 Commit: Michael Tuexen <tuexen@FreeBSD.org> CommitDate: 2022-09-19 10:49:31 +0000 tcp: fix computation of offset Only update the offset if actually retransmitting from the scoreboard. If not done correctly, this may result in trying to (re)-transmit data not being being in the socket buffe and therefore resulting in a panic. PR: 264257 PR: 263445 PR: 260393 Reviewed by: rscheff@ MFC after: 3 days Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D36626 sys/netinet/tcp_output.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) 24 hours - no panic yet. Looks promising. (In reply to Dobri Dobrev from comment #95) How long did it take normally until a panic happened? (In reply to Michael Tuexen from comment #96) Last time it took 10-15 minutes, however, I've had cases where it took 24+ hours. Atm still running w/o a panic. It looks like setting "net.inet.tcp.rfc6675_pipe=1" was a workaround, could you confirm, please? This issue looks similar to the already fixed bug 254244. (In reply to Marek Zarychta from comment #98) > It looks like setting "net.inet.tcp.rfc6675_pipe=1" was a workaround, could you confirm, please? The workaround was "net.inet.tcp.rfc6675_pipe=0" for bug 254244, to disable "SACK rescue retransmissions". > This issue looks similar to the already fixed bug 254244. Similar yes, but not identical. In Bug 254244, it was an oversight on my part, to deal with the final <FIN> bit in the sequence number space, if a rescue retransmission is to be done after the stack had already sent this <FIN>. The implicit assumption is, that at the tail end of a TCP session, only a single <FIN>, at one specific sequence number exists - as it should be. However, the complex of bug 260393, bug 263445 and this bug 264257 comes from an issue which existed "forever", but never, or extremely rarely manifested itself as a server-side panic. It does have some potential to lead to erraneous bytes at the tail end of a tcp session, which, depending on the application using the data, may lead to client side data inconsistencies. That issue is, that under specific circumstances, a TCP session will sent multiple <FIN> at advancing sequence numbers, and "overwriting" the previously sent <FIN> with a (not fully) arbitrary data byte. In once instance, logging showed up to 6 consecutive <FIN>s, each at a new position - but even just two such <FIN> at consecutive positions are problematic. This day-zero issue (most likely affecting / having affected all BSD derived TCP stacks with active SACK) is fixed in FreeBSD with https://reviews.freebsd.org/D36626 While investigating further, two more bugs showed up: https://reviews.freebsd.org/D36631 https://reviews.freebsd.org/D36637 but these do not corrupt the tcpcb state, or lead to erraneous transmission of different information at the same sequence number - only affect when and which segments are sent. (A regression in MAIN was also found during looking closely at the test cases - one of which would not fully validate the expected state, but that is not related directly to this problem really). A commit in branch main references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=5ae83e0d871bc7cbe4dcc9a33d37eb689e631efe commit 5ae83e0d871bc7cbe4dcc9a33d37eb689e631efe Author: Michael Tuexen <tuexen@FreeBSD.org> AuthorDate: 2022-09-22 10:12:11 +0000 Commit: Michael Tuexen <tuexen@FreeBSD.org> CommitDate: 2022-09-22 10:12:11 +0000 tcp: send ACKs when requested When doing Limited Transmit send an ACK when needed by the protocol processing (like sending ACKs with a DSACK block). PR: 264257 PR: 263445 PR: 260393 Reviewed by: rscheff@ MFC after: 3 days Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D36631 sys/netinet/tcp_input.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) A commit in branch main references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=a743fc8826fa348b09d219632594c537f8e5690e commit a743fc8826fa348b09d219632594c537f8e5690e Author: Richard Scheffenegger <rscheff@FreeBSD.org> AuthorDate: 2022-09-22 10:55:25 +0000 Commit: Richard Scheffenegger <rscheff@FreeBSD.org> CommitDate: 2022-09-22 11:28:43 +0000 tcp: fix cwnd restricted SACK retransmission loop While doing the initial SACK retransmission segment while heavily cwnd constrained, tcp_ouput can erroneously send out the entire sendbuffer again. This may happen after an retransmission timeout, which resets snd_nxt to snd_una while the SACK scoreboard is still populated. Reviewed By: tuexen, #transport PR: 264257 PR: 263445 PR: 260393 MFC after: 3 days Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D36637 sys/netinet/tcp_output.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) (In reply to Richard Scheffenegger from comment #99) I asked because have never experienced such a panic on the systems I am maintaining, and since net.inet.tcp.rfc6675_pipe = 1 is set almost everywhere, so I was guessing that maybe it was a remedy. Thank you for an exhaustive explanation of these intricacies. (In reply to Marek Zarychta from comment #102) This setting got enabled by default in D28702 (https://reviews.freebsd.org/rGd1de2b05a001d3d80f633f576f4909c2686dda3d). Prior to this, RFC6675 (SACK loss recovery) was implemented more fully with D18985 (https://reviews.freebsd.org/rG3c40e1d52cd86168779cf99dbabe58df465d7e3f). This combination - enhanced features, combined with a very long lingering issue in handling rare corner cases of SACK loss recovery & retransmission timeout (RTO) interaction lead to the much higher incidence rate, and actual panics reported in these three associated bugs. As a part of the sequence of events which lead to this issue includes non-responsive clients or network paths with extreme packet loss rates, this effect is unlikely to be observed in stable, low loss, reactive client environments. Nevertheless, there is still a discussion point, if there should be a (binary) patch for this - or if providing a patch, and including this in STABLE-13 / STABLE-12 is sufficient. Most affected deployments could include the patches, so far. A commit in branch stable/12 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=3651c4f42285644938e2f5bc924ab8c7ed857f83 commit 3651c4f42285644938e2f5bc924ab8c7ed857f83 Author: Richard Scheffenegger <rscheff@FreeBSD.org> AuthorDate: 2022-09-22 10:55:25 +0000 Commit: Richard Scheffenegger <rscheff@FreeBSD.org> CommitDate: 2022-09-25 08:52:56 +0000 tcp: fix cwnd restricted SACK retransmission loop While doing the initial SACK retransmission segment while heavily cwnd constrained, tcp_ouput can erroneously send out the entire sendbuffer again. This may happen after an retransmission timeout, which resets snd_nxt to snd_una while the SACK scoreboard is still populated. Reviewed By: tuexen, #transport PR: 264257 PR: 263445 PR: 260393 MFC after: 3 days Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D36637 (cherry picked from commit a743fc8826fa348b09d219632594c537f8e5690e) sys/netinet/tcp_output.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) A commit in branch stable/12 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=9e69e009c86f259653610f3c337253b79381c7a7 commit 9e69e009c86f259653610f3c337253b79381c7a7 Author: Michael Tuexen <tuexen@FreeBSD.org> AuthorDate: 2022-09-22 10:12:11 +0000 Commit: Richard Scheffenegger <rscheff@FreeBSD.org> CommitDate: 2022-09-25 08:46:54 +0000 tcp: send ACKs when requested When doing Limited Transmit send an ACK when needed by the protocol processing (like sending ACKs with a DSACK block). PR: 264257 PR: 263445 PR: 260393 Reviewed by: rscheff@ MFC after: 3 days Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D36631 (cherry picked from commit 5ae83e0d871bc7cbe4dcc9a33d37eb689e631efe) sys/netinet/tcp_input.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) A commit in branch stable/12 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=26370413d43bfd65500270ff331ae6bdf0f54133 commit 26370413d43bfd65500270ff331ae6bdf0f54133 Author: Michael Tuexen <tuexen@FreeBSD.org> AuthorDate: 2022-09-19 10:42:43 +0000 Commit: Richard Scheffenegger <rscheff@FreeBSD.org> CommitDate: 2022-09-25 08:41:54 +0000 tcp: fix computation of offset Only update the offset if actually retransmitting from the scoreboard. If not done correctly, this may result in trying to (re)-transmit data not being being in the socket buffe and therefore resulting in a panic. PR: 264257 PR: 263445 PR: 260393 Reviewed by: rscheff@ MFC after: 3 days Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D36626 (cherry picked from commit 6d9e911fbadf3b409802a211c1dae9b47cb5a2b8) sys/netinet/tcp_output.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) A commit in branch stable/13 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=c1f9a81e7bfe354dfa4f191d5180426f76bc514b commit c1f9a81e7bfe354dfa4f191d5180426f76bc514b Author: Richard Scheffenegger <rscheff@FreeBSD.org> AuthorDate: 2022-09-22 10:55:25 +0000 Commit: Richard Scheffenegger <rscheff@FreeBSD.org> CommitDate: 2022-09-25 08:56:28 +0000 tcp: fix cwnd restricted SACK retransmission loop While doing the initial SACK retransmission segment while heavily cwnd constrained, tcp_ouput can erroneously send out the entire sendbuffer again. This may happen after an retransmission timeout, which resets snd_nxt to snd_una while the SACK scoreboard is still populated. Reviewed By: tuexen, #transport PR: 264257 PR: 263445 PR: 260393 MFC after: 3 days Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D36637 (cherry picked from commit a743fc8826fa348b09d219632594c537f8e5690e) sys/netinet/tcp_output.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) A commit in branch stable/13 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=f9edad0054652e020b8214f61c0e454fd48101a6 commit f9edad0054652e020b8214f61c0e454fd48101a6 Author: Michael Tuexen <tuexen@FreeBSD.org> AuthorDate: 2022-09-22 10:12:11 +0000 Commit: Richard Scheffenegger <rscheff@FreeBSD.org> CommitDate: 2022-09-25 08:55:41 +0000 tcp: send ACKs when requested When doing Limited Transmit send an ACK when needed by the protocol processing (like sending ACKs with a DSACK block). PR: 264257 PR: 263445 PR: 260393 Reviewed by: rscheff@ MFC after: 3 days Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D36631 (cherry picked from commit 5ae83e0d871bc7cbe4dcc9a33d37eb689e631efe) sys/netinet/tcp_input.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) A commit in branch stable/13 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=0612d3000b974f31de15c90c77bf43f121fc8656 commit 0612d3000b974f31de15c90c77bf43f121fc8656 Author: Michael Tuexen <tuexen@FreeBSD.org> AuthorDate: 2022-09-19 10:42:43 +0000 Commit: Richard Scheffenegger <rscheff@FreeBSD.org> CommitDate: 2022-09-25 08:54:18 +0000 tcp: fix computation of offset Only update the offset if actually retransmitting from the scoreboard. If not done correctly, this may result in trying to (re)-transmit data not being being in the socket buffe and therefore resulting in a panic. PR: 264257 PR: 263445 PR: 260393 Reviewed by: rscheff@ MFC after: 3 days Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D36626 (cherry picked from commit 6d9e911fbadf3b409802a211c1dae9b47cb5a2b8) sys/netinet/tcp_output.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) (In reply to Michael Tuexen from comment #87) Hello, after applying of the patch named "Fix offset computation" system is currently 12 days uptime with the same workload that before patch causing the crash after 5-8 days uptime. So, seems that was a root cause and it was fixed! Thank you! Hello, Is it possible this patch fixes the issue in comment #36 ? (In reply to Christos Chatzaras from comment #111) Yes, this is very likely - the "fix offset" patch addresses the problematic sequence of events, leading to this off-by-one calculation. MAIN: https://reviews.freebsd.org/rG6d9e911fbadf3b409802a211c1dae9b47cb5a2b8 STABLE-13: https://reviews.freebsd.org/rG0612d3000b974f31de15c90c77bf43f121fc8656 STABLE-12: https://reviews.freebsd.org/rG26370413d43bfd65500270ff331ae6bdf0f54133 The change is effectively identical, in all variants. (Without the other two fixes, the exact timing of packets being sent may not fully conform to TCP specs, but this doesn't materially impact stability or consistency at all). Closing this as it seems to be fixed. If the problem still exists, please reopen. We're still seeing a very similar panic on a FreeBSD 13 router system that includes this patch in the kernel: Tracing command kernel pid 0 tid 100012 td 0xfffffe001ed50720 (CPU 1) kdb_enter() at kdb_enter+0x37/frame 0xfffffe001bd3f5b0 vpanic() at vpanic+0x1b0/frame 0xfffffe001bd3f600 panic() at panic+0x43/frame 0xfffffe001bd3f660 trap_fatal() at trap_fatal+0x385/frame 0xfffffe001bd3f6c0 trap_pfault() at trap_pfault+0x4f/frame 0xfffffe001bd3f720 calltrap() at calltrap+0x8/frame 0xfffffe001bd3f720 --- trap 0xc, rip = 0xffffffff80d3d250, rsp = 0xfffffe001bd3f7f0, rbp = 0xfffffe001bd3f850 --- m_copym() at m_copym+0x30/frame 0xfffffe001bd3f850 ip_fragment() at ip_fragment+0x24f/frame 0xfffffe001bd3f8f0 ip_output() at ip_output+0x13d5/frame 0xfffffe001bd3fa30 ip_forward() at ip_forward+0x3cf/frame 0xfffffe001bd3fae0 ip_input() at ip_input+0x79e/frame 0xfffffe001bd3fb70 netisr_dispatch_src() at netisr_dispatch_src+0xb9/frame 0xfffffe001bd3fbc0 ether_demux() at ether_demux+0x138/frame 0xfffffe001bd3fbf0 ether_nh_input() at ether_nh_input+0x355/frame 0xfffffe001bd3fc50 netisr_dispatch_src() at netisr_dispatch_src+0xb9/frame 0xfffffe001bd3fca0 ether_input() at ether_input+0x69/frame 0xfffffe001bd3fd00 iflib_rxeof() at iflib_rxeof+0xc27/frame 0xfffffe001bd3fe00 _task_fn_rx() at _task_fn_rx+0x72/frame 0xfffffe001bd3fe40 gtaskqueue_run_locked() at gtaskqueue_run_locked+0x15d/frame 0xfffffe001bd3fec0 gtaskqueue_thread_loop() at gtaskqueue_thread_loop+0xc2/frame 0xfffffe001bd3fef0 fork_exit() at fork_exit+0x7e/frame 0xfffffe001bd3ff30 fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe001bd3ff30 --- trap 0, rip = 0xffffffff80c313af, rsp = 0, rbp = 0x3 --- mi_startup() at mi_startup+0xdf/frame 0x3 Will try the workaround to see if the issue still appears. Note it can be several weeks between panics, so we won't know if the workaround is still a solution until the end of April. The stacktrace here looks entirely different, with the local system's tcp stack not being involved (as this is a router, forwarded traffic doesn't normally get processed by the tcp stack - only by the ip layer as is visible in the stack trace). I suggest to open a separate bug for this, and also provide the core dumps if possible (note that the core dumps may contain sensible information in the mbuf chains). |