I run 13.0-RC4 for few days and I got this panic: Fatal trap 12: page fault while in kernel mode cpuid = 7; apic id = 07 fault virtual address = 0x18 fault code = supervisor read data, page not present instruction pointer = 0x20:0xffffffff80c9b7d8 stack pointer = 0x0:0xfffffe00357a51c0 frame pointer = 0x0:0xfffffe00357a5230 code segment = base rx0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 0 (if_io_tqg_7) trap number = 12 panic: page fault cpuid = 7 time = 1617377524 KDB: stack backtrace: #0 0xffffffff80c57345 at kdb_backtrace+0x65 #1 0xffffffff80c09d21 at vpanic+0x181 #2 0xffffffff80c09b93 at panic+0x43 #3 0xffffffff8108a187 at trap_fatal+0x387 #4 0xffffffff8108a1df at trap_pfault+0x4f #5 0xffffffff8108983d at trap+0x27d #6 0xffffffff81061768 at calltrap+0x8 #7 0xffffffff80dc8a33 at tcp_output+0x10b3 #8 0xffffffff80dc0fcb at tcp_do_segment+0x301b #9 0xffffffff80dbd1ee at tcp_input+0xabe #10 0xffffffff80dafbe5 at ip_input+0x125 #11 0xffffffff80d3f2ca at netisr_dispatch_src+0xca #12 0xffffffff80d23a58 at ether_demux+0x148 #13 0xffffffff80d24ddc at ether_nh_input+0x34c #14 0xffffffff80d3f2ca at netisr_dispatch_src+0xca #15 0xffffffff80d23ea9 at ether_input+0x69 #16 0xffffffff80dc6a61 at tcp_flush_out_le+0x221 #17 0xffffffff80dc67fd at tcp_lro_flush+0x2ad Uptime: 2d15h58m1s Dumping 2453 out of 32505 MB:..1%..11%..21%..31%..41%..51%..61%..71%..81%..91% __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55 55 __asm("movq %%gs:%P1,%0" : "=r" (td) : "n" (offsetof(struct pcpu, (kgdb) #0 __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55 #1 doadump (textdump=<optimized out>) at /usr/src/sys/kern/kern_shutdown.c:399 #2 0xffffffff80c09916 in kern_reboot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:486 #3 0xffffffff80c09d90 in vpanic (fmt=<optimized out>, ap=<optimized out>) at /usr/src/sys/kern/kern_shutdown.c:919 #4 0xffffffff80c09b93 in panic (fmt=<unavailable>) at /usr/src/sys/kern/kern_shutdown.c:843 #5 0xffffffff8108a187 in trap_fatal (frame=0xfffffe00357a5100, eva=24) at /usr/src/sys/amd64/amd64/trap.c:915 #6 0xffffffff8108a1df in trap_pfault (frame=frame@entry=0xfffffe00357a5100, usermode=false, signo=<optimized out>, signo@entry=0x0, ucode=<optimized out>, ucode@entry=0x0) at /usr/src/sys/amd64/amd64/trap.c:732 #7 0xffffffff8108983d in trap (frame=0xfffffe00357a5100) at /usr/src/sys/amd64/amd64/trap.c:398 #8 <signal handler called> #9 m_copydata (m=m@entry=0x0, off=0, len=1, cp=<optimized out>) at /usr/src/sys/kern/uipc_mbuf.c:656 #10 0xffffffff80dc8a33 in tcp_output (tp=0xfffffe013eac04d8) at /usr/src/sys/netinet/tcp_output.c:1068 #11 0xffffffff80dc0fcb in tcp_do_segment (m=0xfffff804e393ca00, th=<optimized out>, so=<optimized out>, tp=0xfffffe013eac04d8, drop_hdrlen=64, tlen=<optimized out>, iptos=0 '\000') at /usr/src/sys/sys/libkern.h:91 #12 0xffffffff80dbd1ee in tcp_input (mp=<optimized out>, offp=<optimized out>, proto=<optimized out>) at /usr/src/sys/netinet/tcp_input.c:1382 #13 0xffffffff80dafbe5 in ip_input (m=0x0) at /usr/src/sys/netinet/ip_input.c:829 #14 0xffffffff80d3f2ca in netisr_dispatch_src (proto=1, source=<optimized out>, source@entry=0, m=0xfffff801e35a659c) at /usr/src/sys/net/netisr.c:1143 #15 0xffffffff80d3f5bf in netisr_dispatch (proto=0, m=0x1) at /usr/src/sys/net/netisr.c:1234 #16 0xffffffff80d23a58 in ether_demux (ifp=ifp@entry=0xfffff80004075000, m=0x0) at /usr/src/sys/net/if_ethersubr.c:923 #17 0xffffffff80d24ddc in ether_input_internal (ifp=0xfffff80004075000, m=0x0) at /usr/src/sys/net/if_ethersubr.c:709 #18 ether_nh_input (m=<optimized out>) at /usr/src/sys/net/if_ethersubr.c:739 #19 0xffffffff80d3f2ca in netisr_dispatch_src (proto=proto@entry=5, source=<optimized out>, source@entry=0, m=0xfffff801e35a659c, m@entry=0xfffff804e393ca00) at /usr/src/sys/net/netisr.c:1143 #20 0xffffffff80d3f5bf in netisr_dispatch (proto=0, proto@entry=5, m=0x1, m@entry=0xfffff804e393ca00) at /usr/src/sys/net/netisr.c:1234 #21 0xffffffff80d23ea9 in ether_input (ifp=<optimized out>, m=0xfffff804e393ca00) at /usr/src/sys/net/if_ethersubr.c:830 #22 0xffffffff80dc6a61 in tcp_flush_out_le (tp=0x0, lc=lc@entry=0xfffff8000405f830, le=le@entry=0xfffffe0104118498, locked=0) at /usr/src/sys/netinet/tcp_lro.c:569 #23 0xffffffff80dc67fd in tcp_lro_flush (lc=lc@entry=0xfffff8000405f830, le=0xfffffe0104118498) at /usr/src/sys/netinet/tcp_lro.c:978 #24 0xffffffff80dc6bab in tcp_lro_rx_done (lc=0xfffff8000405f830) at /usr/src/sys/netinet/tcp_lro.c:356 #25 tcp_lro_flush_all (lc=lc@entry=0xfffff8000405f830) at /usr/src/sys/netinet/tcp_lro.c:1123 #26 0xffffffff80d3ba22 in iflib_rxeof (rxq=<optimized out>, rxq@entry=0xfffff8000405f800, budget=<optimized out>) at /usr/src/sys/net/iflib.c:3017 #27 0xffffffff80d35d32 in _task_fn_rx (context=0xfffff8000405f800) at /usr/src/sys/net/iflib.c:3949 #28 0xffffffff80c55dad in gtaskqueue_run_locked ( queue=queue@entry=0xfffff80003988800) at /usr/src/sys/kern/subr_gtaskqueue.c:371 #29 0xffffffff80c55a4c in gtaskqueue_thread_loop (arg=<optimized out>, arg@entry=0xfffffe00387e40b0) at /usr/src/sys/kern/subr_gtaskqueue.c:547 #30 0xffffffff80bc7c5e in fork_exit ( callout=0xffffffff80c559a0 <gtaskqueue_thread_loop>, arg=0xfffffe00387e40b0, frame=0xfffffe00357a5b00) at /usr/src/sys/kern/kern_fork.c:1069 #31 <signal handler called> (kgdb)
Maybe these are related: https://lists.freebsd.org/pipermail/freebsd-current/2021-January/078136.html https://lists.freebsd.org/pipermail/freebsd-current/2021-January/078553.html
This may be related to some known issues being worked on. Are you using the "TCPHPTS" option in your kernel configuration file? --HPS
(In reply to Hans Petter Selasky from comment #2) I use a GENERIC configuration file and didn't find anything related to TCPHPTS there.
(In reply to Christos Chatzaras from comment #3) It seems that with net.inet.tcp.sack.enable disabled panics is gone
This issue may look similar to: https://reviews.freebsd.org/D29315 I see the fix is not in RC4 yet. --HPS
(In reply to Sergey V. Dyatko from comment #4) "sysctl net.inet.tcp.sack.enable=0" is the first think I did after server came back online. I also disable LRO. What version of FreeBSD you use now? I see you had the same or similar panic with 13-CURRENT 2 months ago.
(In reply to Hans Petter Selasky from comment #5) I am not wrong to land in releng/13.0 it needs permission from re@FreeBSD.org , and then @rscheff can commit it.
https://reviews.freebsd.org/D18985 (RFC6675 SACK rescue retransmission) is not in 13.0-RC4, thus https://reviews.freebsd.org/D29315 (fixing a off-by one, NULL pointer free, introduced above) should not play a role here... https://reviews.freebsd.org/D29083 could also lead to a panic in tcp_output, but should be in RC4... (checked in Mar08, RC4 was Mar29). Why libkern.h:92 for tcp_do_segment in frame 11? Any possibility to get the core? Or the full tcpcb of frame 11?
(In reply to Richard Scheffenegger from comment #8) Hello Richard, I tried to contact you to rscheff at freebsd.org which forward the message to your private address and blocked: "A custom mail flow rule created by an admin at xxxxxxx.onmicrosoft.com has blocked your message. Do you want to send you the vmcore (2.4GB) using wetransfer or you want to give you SSH access to a server and upload there the core? The server has /usr/src from RC4 and kgdb.
received the core and kernel.debug symbols. The panic is likely a off-by-one related to the FIN bit, but not with the rescue retransmission, but rather PRR. The TCP state indicates, that only the last data byte, and the final FIN bit are unacknowledged. frame 11: (kgdb) p *tp->sackhint.nexthole $48 = {start = 935342315, end = 935342316, rxmit = 935342315, scblink = {tqe_next = 0x0, tqe_prev = 0xfffffe013eac0618}} (kgdb) p tp->snd_max $49 = 935342317 (kgdb) p tp->snd_una $50 = 935342315 (kgdb) p/x tp->t_flags $6 = 0x603003f4 ====> TF_SENTFIN is set. Now, the FIN bit occupys the last Seq# (..317). The SACK hole should therefore be a valid 1 byte hole, which hasn't been retransmitted... The incoming SACK appears to be SACKing the FIN bit? (kgdb) p/x *(struct sackblk *)to.to_sacks $53 = {start = 0xec30c037, end = 0xed30c037} (kgdb) p 0x37c030ec $54 = 935342316 (kgdb) p 0x37c030ed $55 = 935342317
Two more question: Have you observed this type of panic once, or multiple times? how are net.inet.tcp.rfc6675_pipe and net.inet.tcp.do_prr set? If this is repeatable in your environment, you may want to reenable sack, but disable PRR (which is the new mechanism in 13). If this is repeatable, would you be willing to enable blackbox logging, or alternatively a packet capture of what leads up to this event?
(In reply to Richard Scheffenegger from comment #11) I have RC3-p1 / RC4 in 100+ servers. It happen only once, but only 5 days pass after the upgrade. --------- sysctl values: net.inet.tcp.rfc6675_pipe: 0 net.inet.tcp.do_prr: 1
(In reply to Richard Scheffenegger from comment #11) I enable SACK and LRO and disable PRR. ----------- (In reply to Sergey V. Dyatko from comment #4) In your kgdb output I see that PRR is mentioned: https://lists.freebsd.org/pipermail/freebsd-current/2021-January/078553.html Can you enable net.inet.tcp.sack.enable and disable net.inet.tcp.do_prr ? Also how often did this panic happened to your servers?
Extracted a more complete set of packet headers belonging to the problematic session from the privately provided core. The session is ECN-enabled At the time of the panic, SENTFIN was set Based on the Timestamp option of the incoming ACKs, serious reordering and spurious retransmissions were going on. The final packet with FIN originally has a payload of 1 byte. (TSopt val ..5625), but that is apparently lost and not received by the client. Susequently (based on TSopt val), just the FIN is retransmitted twice, with TSopt val ..5861 and ..5979 (e.g. when a transmission opportunity would be there, but no new data is available). The RTT appears to be nearly 100ms from the very last round, sRTT is averaged at 275ms At the panic, TSval would have been ..5988 This is for retransmitting the final payload byte, as the client only SACKed the 1st FIN retransmission. However, for some reason that byte is no longer available in the send socket buffer, causing the crash. Srv -> clnt F. 9999:10000(1) //dropped Clnt -> Srv E. 1:1(0) ack -26seg <sack -25seg:9999> (unobserved retransmission Srv->Clnt) Clnt -> Srv E. 1:1(0) ack 9999 Srv -> cnt F. 10000:10000(0) Clnt -> Srv E. 1:1(0) ack 9999 <sack 10000:10001> attempt to retransmit 10000:10001(1) -> crash. However, current attempts to recreate this misbehavior were unsuccessful in recreating the panic.
Another crash: Fatal trap 12: page fault while in kernel mode cpuid = 7; apic id = 07 fault virtual address = 0x18 fault code = supervisor read data, page not present instruction pointer = 0x20:0xffffffff80cae31d stack pointer = 0x28:0xfffffe01141445c0 frame pointer = 0x28:0xfffffe0114144630 code segment = base rx0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 0 (if_io_tqg_7) trap number = 12 panic: page fault cpuid = 7 time = 1655767279 KDB: stack backtrace: #0 0xffffffff80c69465 at kdb_backtrace+0x65 #1 0xffffffff80c1bb1f at vpanic+0x17f #2 0xffffffff80c1b993 at panic+0x43 #3 0xffffffff810afdf5 at trap_fatal+0x385 #4 0xffffffff810afe4f at trap_pfault+0x4f #5 0xffffffff81087528 at calltrap+0x8 #6 0xffffffff80de07c9 at tcp_output+0x1339 #7 0xffffffff80dd7eed at tcp_do_segment+0x2cfd #8 0xffffffff80dd44b1 at tcp_input_with_port+0xb61 #9 0xffffffff80dd515b at tcp_input+0xb #10 0xffffffff80dc691f at ip_input+0x11f #11 0xffffffff80d53089 at netisr_dispatch_src+0xb9 #12 0xffffffff80d36ea8 at ether_demux+0x138 #13 0xffffffff80d38235 at ether_nh_input+0x355 #14 0xffffffff80d53089 at netisr_dispatch_src+0xb9 #15 0xffffffff80d372d9 at ether_input+0x69 #16 0xffffffff80ddeaa5 at tcp_push_and_replace+0x25 #17 0xffffffff80ddd74c at tcp_lro_flush+0x4c Uptime: 29d3h36m11s Dumping 4275 out of 65278 MB:..1%..11%..21%..31%..41%..51%..61%..71%..81%..91% __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55 warning: Source file is more recent than executable. 55 __asm("movq %%gs:%P1,%0" : "=r" (td) : "n" (offsetof(struct pcpu, (kgdb) #0 __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55 #1 doadump (textdump=<optimized out>) at /usr/src/sys/kern/kern_shutdown.c:399 #2 0xffffffff80c1b71c in kern_reboot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:487 #3 0xffffffff80c1bb8e in vpanic (fmt=0xffffffff811b4fb9 "%s", ap=<optimized out>) at /usr/src/sys/kern/kern_shutdown.c:920 #4 0xffffffff80c1b993 in panic (fmt=<unavailable>) at /usr/src/sys/kern/kern_shutdown.c:844 #5 0xffffffff810afdf5 in trap_fatal (frame=0xfffffe0114144500, eva=24) at /usr/src/sys/amd64/amd64/trap.c:944 #6 0xffffffff810afe4f in trap_pfault (frame=0xfffffe0114144500, usermode=false, signo=<optimized out>, ucode=<optimized out>) at /usr/src/sys/amd64/amd64/trap.c:763 #7 <signal handler called> #8 m_copydata (m=0x0, m@entry=0xfffff80c219ce500, off=0, len=1, cp=<optimized out>) at /usr/src/sys/kern/uipc_mbuf.c:659 #9 0xffffffff80de07c9 in tcp_output (tp=<optimized out>) at /usr/src/sys/netinet/tcp_output.c:1081 #10 0xffffffff80dd7eed in tcp_do_segment (m=<optimized out>, th=<optimized out>, so=<optimized out>, tp=0xfffffe01990a1000, drop_hdrlen=64, tlen=<optimized out>, iptos=0 '\000') at /usr/src/sys/netinet/tcp_input.c:2637 #11 0xffffffff80dd44b1 in tcp_input_with_port (mp=<optimized out>, offp=<optimized out>, proto=<optimized out>, port=port@entry=0) at /usr/src/sys/netinet/tcp_input.c:1400 #12 0xffffffff80dd515b in tcp_input (mp=0xfffff80c219ce500, offp=0x0, proto=1) at /usr/src/sys/netinet/tcp_input.c:1496 #13 0xffffffff80dc691f in ip_input (m=0x0) at /usr/src/sys/netinet/ip_input.c:839 #14 0xffffffff80d53089 in netisr_dispatch_src (proto=1, source=source@entry=0, m=0xfffff80e00395400) at /usr/src/sys/net/netisr.c:1143 #15 0xffffffff80d5345f in netisr_dispatch (proto=563930368, m=0x1) at /usr/src/sys/net/netisr.c:1234 #16 0xffffffff80d36ea8 in ether_demux (ifp=ifp@entry=0xfffff80004659000, m=0x0) at /usr/src/sys/net/if_ethersubr.c:921 #17 0xffffffff80d38235 in ether_input_internal (ifp=0xfffff80004659000, m=0x0) at /usr/src/sys/net/if_ethersubr.c:707 #18 ether_nh_input (m=<optimized out>) at /usr/src/sys/net/if_ethersubr.c:737 #19 0xffffffff80d53089 in netisr_dispatch_src (proto=proto@entry=5, source=source@entry=0, m=m@entry=0xfffff80e00395400) at /usr/src/sys/net/netisr.c:1143 #20 0xffffffff80d5345f in netisr_dispatch (proto=563930368, proto@entry=5, m=0x1, m@entry=0xfffff80e00395400) at /usr/src/sys/net/netisr.c:1234 #21 0xffffffff80d372d9 in ether_input (ifp=<optimized out>, m=0xfffff80e00395400) at /usr/src/sys/net/if_ethersubr.c:828 #22 0xffffffff80ddeaa5 in tcp_push_and_replace (lc=0xfffff80c219ce500, lc@entry=0xfffff80003ef2830, le=le@entry=0xfffffe0158387690, m=m@entry=0xfffff80f2b178300) at /usr/src/sys/netinet/tcp_lro.c:923 #23 0xffffffff80ddd74c in tcp_lro_condense (lc=0xfffff80003ef2830, le=0xfffffe0158387690) at /usr/src/sys/netinet/tcp_lro.c:1011 #24 tcp_lro_flush (lc=lc@entry=0xfffff80003ef2830, le=0xfffffe0158387690) at /usr/src/sys/netinet/tcp_lro.c:1374 #25 0xffffffff80dddd3b in tcp_lro_rx_done (lc=0xfffff80003ef2830) at /usr/src/sys/netinet/tcp_lro.c:566 #26 tcp_lro_flush_all (lc=lc@entry=0xfffff80003ef2830) at /usr/src/sys/netinet/tcp_lro.c:1532 #27 0xffffffff80d4f503 in iflib_rxeof (rxq=<optimized out>, rxq@entry=0xfffff80003ef2800, budget=<optimized out>) at /usr/src/sys/net/iflib.c:3058 #28 0xffffffff80d49b22 in _task_fn_rx (context=0xfffff80003ef2800) at /usr/src/sys/net/iflib.c:3990 #29 0xffffffff80c67e9d in gtaskqueue_run_locked ( queue=queue@entry=0xfffff80003cbf000) at /usr/src/sys/kern/subr_gtaskqueue.c:371 #30 0xffffffff80c67b12 in gtaskqueue_thread_loop (arg=<optimized out>, arg@entry=0xfffffe01142820b0) at /usr/src/sys/kern/subr_gtaskqueue.c:547 #31 0xffffffff80bd8a5e in fork_exit ( callout=0xffffffff80c67a50 <gtaskqueue_thread_loop>, arg=0xfffffe01142820b0, frame=0xfffffe0114144f40) at /usr/src/sys/kern/kern_fork.c:1093 #32 <signal handler called> #33 mi_startup () at /usr/src/sys/kern/init_main.c:322 Backtrace stopped: Cannot access memory at address 0x1d (kgdb)
What version of FreeBSD did this crash happen on?
This PR looks like it is already fixed in 13-stable. Try using a 13-stable kernel. --HPS
It happened with 13.0 and 13.1. For now I have net.inet.tcp.sack.enable=0 but I plan to enable it again after 13.2 is released to see if it's fixed. Also the issue only happens in servers with specific and older hardware. I have servers with newer hardware and net.inet.tcp.sack.enable=1 and the issue didn't happen there (or I was just lucky).
(In reply to Christos Chatzaras from comment #18) Any news, since 13.2 has been released? Best regards Michael
(In reply to Michael Tuexen from comment #19) Hello Michael. I already upgrade most of the servers to newer hardware and in these servers I have SACK enabled. No kernel panics on these servers so far. On older servers I still have SACK disabled and I plan to keep it disabled until I upgrade them to newer hardware. Maybe we can close the PR?
(In reply to Christos Chatzaras from comment #20) Thanks for the quick feedback. Yepp, I guess closing is the most appropriate option. If the problem comes up again, you can re-open. Best regards Michael