Hello, I'm running latest 13-STABLE. I already mentioned panic #256538 but I couldn't reproduce and debug it. I think similar is described in #254735. So, on one of my servers every time kernel panics right after the server start up. Here is trace: Fatal trap 12: page fault while in kernel mode cpuid = 0; apic id = 00 fault virtual address = 0xffffffff0000002a fault code = supervisor read data, page not present instruction pointer = 0x20:0xffffffff81629071 stack pointer = 0x28:0xfffffe0202a046a0 frame pointer = 0x28:0xfffffe0202a04990 code segment = base rx0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 11 (swi1: hpts) trap number = 12 panic: page fault cpuid = 0 time = 1626303252 KDB: stack backtrace: #0 0xffffffff80646505 at kdb_backtrace+0x65 #1 0xffffffff80602661 at vpanic+0x181 #2 0xffffffff806024d3 at panic+0x43 #3 0xffffffff8085a857 at trap_fatal+0x387 #4 0xffffffff8085a8af at trap_pfault+0x4f #5 0xffffffff80859f63 at trap+0x253 #6 0xffffffff80833d8e at calltrap+0x8 #7 0xffffffff8075fc10 at tcp_hptsi+0x7d0 #8 0xffffffff80760ddc at tcp_hpts_thread+0x11c #9 0xffffffff805cb221 at ithread_loop+0x191 #10 0xffffffff805c8541 at fork_exit+0x71 #11 0xffffffff80834e1e at fork_trampoline+0xe Uptime: 27s Dumping 4308 out of 130940 MB:..1%..11%..21%..31%..41%..51%..61%..71%..81%..91% __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55 55 __asm("movq %%gs:%P1,%0" : "=r" (td) : "n" (offsetof(struct pcpu, (kgdb) bt #0 __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55 #1 doadump (textdump=<optimized out>) at /usr/src/sys/kern/kern_shutdown.c:399 #2 0xffffffff8060228e in kern_reboot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:486 #3 0xffffffff806026d0 in vpanic (fmt=<optimized out>, ap=<optimized out>) at /usr/src/sys/kern/kern_shutdown.c:919 #4 0xffffffff806024d3 in panic (fmt=<unavailable>) at /usr/src/sys/kern/kern_shutdown.c:843 #5 0xffffffff8085a857 in trap_fatal (frame=0xfffffe0202a045e0, eva=18446744069414584362) at /usr/src/sys/amd64/amd64/trap.c:943 #6 0xffffffff8085a8af in trap_pfault (frame=frame@entry=0xfffffe0202a045e0, usermode=false, signo=<optimized out>, signo@entry=0x0, ucode=<optimized out>, ucode@entry=0x0) at /usr/src/sys/amd64/amd64/trap.c:760 #7 0xffffffff80859f63 in trap (frame=0xfffffe0202a045e0) at /usr/src/sys/amd64/amd64/trap.c:438 #8 <signal handler called> #9 0xffffffff81629071 in rack_output () from /boot/kernel/tcp_rack.ko #10 0xfffff805f2218e00 in ?? () #11 0x000c000000000000 in ?? () #12 0x0000000000000000 in ?? () (kgdb) Let me know if you need any other info.
Created attachment 226472 [details] sysctl.conf
Created attachment 226473 [details] loader.conf
Created attachment 226474 [details] KERNEL-config
One more trace (possibly more informative): Fatal trap 12: page fault while in kernel mode cpuid = 4; apic id = 04 fault virtual address = 0xffffffff0000002a fault code = supervisor read data, page not present instruction pointer = 0x20:0xffffffff81608071 stack pointer = 0x28:0xfffffe0202a186a0 frame pointer = 0x28:0xfffffe0202a18990 code segment = base rx0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 11 (swi1: hpts) trap number = 12 panic: page fault cpuid = 4 time = 1626306588 KDB: stack backtrace: #0 0xffffffff80646505 at kdb_backtrace+0x65 #1 0xffffffff80602661 at vpanic+0x181 #2 0xffffffff806024d3 at panic+0x43 #3 0xffffffff8085a857 at trap_fatal+0x387 #4 0xffffffff8085a8af at trap_pfault+0x4f #5 0xffffffff80859f63 at trap+0x253 #6 0xffffffff80833d8e at calltrap+0x8 #7 0xffffffff8075fc10 at tcp_hptsi+0x7d0 #8 0xffffffff80760ddc at tcp_hpts_thread+0x11c #9 0xffffffff805cb221 at ithread_loop+0x191 #10 0xffffffff805c8541 at fork_exit+0x71 #11 0xffffffff80834e1e at fork_trampoline+0xe Uptime: 25s Dumping 4278 out of 130940 MB:..1%..11%..21%..31%..41%..51%..61%..71%..81%..91% __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55 55 __asm("movq %%gs:%P1,%0" : "=r" (td) : "n" (offsetof(struct pcpu, (kgdb) bt #0 __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55 #1 doadump (textdump=<optimized out>) at /usr/src/sys/kern/kern_shutdown.c:399 #2 0xffffffff8060228e in kern_reboot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:486 #3 0xffffffff806026d0 in vpanic (fmt=<optimized out>, ap=<optimized out>) at /usr/src/sys/kern/kern_shutdown.c:919 #4 0xffffffff806024d3 in panic (fmt=<unavailable>) at /usr/src/sys/kern/kern_shutdown.c:843 #5 0xffffffff8085a857 in trap_fatal (frame=0xfffffe0202a185e0, eva=18446744069414584362) at /usr/src/sys/amd64/amd64/trap.c:943 #6 0xffffffff8085a8af in trap_pfault (frame=frame@entry=0xfffffe0202a185e0, usermode=false, signo=<optimized out>, signo@entry=0x0, ucode=<optimized out>, ucode@entry=0x0) at /usr/src/sys/amd64/amd64/trap.c:760 #7 0xffffffff80859f63 in trap (frame=0xfffffe0202a185e0) at /usr/src/sys/amd64/amd64/trap.c:438 #8 <signal handler called> #9 0xffffffff81608071 in rack_output (tp=<optimized out>) at /usr/src/sys/modules/tcp/rack/../../../netinet/tcp_stacks/rack.c:16540 #10 0xffffffff8075fc10 in tcp_hptsi (hpts=hpts@entry=0xfffff8010398c780, from_callout=from_callout@entry=1) at /usr/src/sys/netinet/tcp_hpts.c:1662 #11 0xffffffff80760ddc in tcp_hpts_thread (ctx=0xfffff8010398c780) at /usr/src/sys/netinet/tcp_hpts.c:2035 #12 0xffffffff805cb221 in intr_event_execute_handlers (p=<optimized out>, ie=0xfffff8010398d500) at /usr/src/sys/kern/kern_intr.c:1168 #13 ithread_execute_handlers (p=<optimized out>, ie=0xfffff8010398d500) at /usr/src/sys/kern/kern_intr.c:1181 #14 ithread_loop (arg=arg@entry=0xfffff8010397d680) at /usr/src/sys/kern/kern_intr.c:1269 #15 0xffffffff805c8541 in fork_exit (callout=0xffffffff805cb090 <ithread_loop>, arg=0xfffff8010397d680, frame=0xfffffe0202a18c00) at /usr/src/sys/kern/kern_fork.c:1083 #16 <signal handler called>
Previous trace was after I switched CC from HTCP to NEWRENO.
(In reply to iron.udjin from comment #5) Interesting. I have seen a panic like in comment #4 on one of my arm64 servers running FreeBSD main, but thought that it is related to arm64, since I haven't seen it on amd64. So your report shows that it is platform independent. Since we keep FreeBSD main and stable/13 in sync as much as possible, it is not unexpected that you see the problem on stable/13 and I saw it on main. Do you have steps that trigger the panic deterministically after the system has come up? It would be helpful for me to be able to trigger the problem also on an amd64 system. Can you also provide the output of ifconfig? I'm wondering if LRO or TSO is involved...
(In reply to Michael Tuexen from comment #6) >Do you have steps that trigger the panic deterministically after the system has come up? Unfortunatelly no. The sytem panics a few seconds after server start up. >Can you also provide the output of ifconfig? igb0: flags=8863<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 9000 options=4e527bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,WOL_MAGIC,VLAN_HWFILTER,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6,NOMAP> ether 9c:5c:8e:4f:6a:7d media: Ethernet autoselect (1000baseT <full-duplex>) status: active lo0: flags=8049<UP,LOOPBACK,RUNNING,MULTICAST> metric 0 mtu 16384 options=680003<RXCSUM,TXCSUM,LINKSTATE,RXCSUM_IPV6,TXCSUM_IPV6> inet 127.0.0.1 netmask 0xffffffff groups: lo lo1: flags=8049<UP,LOOPBACK,RUNNING,MULTICAST> metric 0 mtu 16384 options=680003<RXCSUM,TXCSUM,LINKSTATE,RXCSUM_IPV6,TXCSUM_IPV6> inet 192.168.0.1 netmask 0xffffffff groups: lo Should I try to change back MTU to 1500 or disable TSO/LRO?
(In reply to iron.udjin from comment #7) I do see the panic also on an igb interface, but I'm using an MTU of 1500 byte. Let me try to experiment...
Are you using vnets? If not, you can comment out the options VIMAGE line and rebuild the kernel. Testing on my server indicates, that the problem only shows up with VIMAGE kernels.
(In reply to Michael Tuexen from comment #9) I've rebuilded kernel without VIMAGE. No panic after restart yet. P.S: there is still problem with SSH (as I already described in #256538). I'll debug this issue and create a new bug report when I'll have time for it. Quick workaround: # sysctl net.inet.tcp.functions_default=freebsd # service sshd restart # sysctl net.inet.tcp.functions_default=rack
(In reply to iron.udjin from comment #10) Regarding yo ssh problem. What is the output of sysctl net.inet.tcp.tolerate_missing_ts What OS is the system running which is ssh-ing into the box?
(In reply to Michael Tuexen from comment #11) # sysctl net.inet.tcp.tolerate_missing_ts net.inet.tcp.tolerate_missing_ts: 1 OS is the same version as host has. If case of connection to the server from Windows - the problem is not happen.
(In reply to iron.udjin from comment #12) Assuming that on the peer we also have net.inet.tcp.tolerate_missing_ts: 1 then we need to look into this. Please open a separate issue for that (when time permits).
(In reply to iron.udjin from comment #12) OS: stable/13-n246050-07ef7a034965 On another server I catched one more panic. But it has a little bit different trace: Fatal trap 12: page fault while in kernel mode cpuid = 39; apic id = 33 fault virtual address = 0x18 fault code = supervisor read data, page not present instruction pointer = 0x20:0xffffffff80fc1c20 stack pointer = 0x0:0xfffffe0321555e90 frame pointer = 0x0:0xfffffe0321555ed0 code segment = base rx0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 11 (swi1: hpts) trap number = 12 panic: page fault cpuid = 39 time = 1624174594 KDB: stack backtrace: #0 0xffffffff805f37a5 at kdb_backtrace+0x65 #1 0xffffffff805a9931 at vpanic+0x181 #2 0xffffffff805a97a3 at panic+0x43 #3 0xffffffff80852617 at trap_fatal+0x387 #4 0xffffffff8085266f at trap_pfault+0x4f #5 0xffffffff80851ce3 at trap+0x253 #6 0xffffffff8082ac18 at calltrap+0x8 #7 0xffffffff80fb183c at rack_log_output+0xec #8 0xffffffff80fa9a33 at rack_output+0x6ca3 #9 0xffffffff80718835 at tcp_hpts_thread+0x725 #10 0xffffffff8056cfed at ithread_loop+0x24d #11 0xffffffff80569ebd at fork_exit+0x7d #12 0xffffffff8082bc9e at fork_trampoline+0xe Uptime: 9h40m48s Dumping 21243 out of 196233 MB:..1%..11%..21%..31%..41%..51%..61%..71%..81%..91% __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55 55 __asm("movq %%gs:%P1,%0" : "=r" (td) : "n" (offsetof(struct pcpu, (kgdb) #0 __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55 #1 doadump (textdump=<optimized out>) at /usr/src/sys/kern/kern_shutdown.c:399 #2 0xffffffff805a9525 in kern_reboot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:486 #3 0xffffffff805a99a0 in vpanic (fmt=<optimized out>, ap=<optimized out>) at /usr/src/sys/kern/kern_shutdown.c:919 #4 0xffffffff805a97a3 in panic (fmt=<unavailable>) at /usr/src/sys/kern/kern_shutdown.c:843 #5 0xffffffff80852617 in trap_fatal (frame=0xfffffe0321555dd0, eva=24) at /usr/src/sys/amd64/amd64/trap.c:943 #6 0xffffffff8085266f in trap_pfault (frame=frame@entry=0xfffffe0321555dd0, usermode=false, signo=<optimized out>, signo@entry=0x0, ucode=<optimized out>, ucode@entry=0x0) at /usr/src/sys/amd64/amd64/trap.c:760 #7 0xffffffff80851ce3 in trap (frame=0xfffffe0321555dd0) at /usr/src/sys/amd64/amd64/trap.c:438 #8 <signal handler called> #9 rack_setup_offset_for_rsm (src_rsm=0xfffff814ec3da230, rsm=0xfffff81f552bebd0) at /usr/src/sys/modules/tcp/rack/../../../netinet/tcp_stacks/rack.c:6024 at /usr/src/sys/kern/kern_shutdown.c:919 #4 0xffffffff805a97a3 in panic (fmt=<unavailable>) at /usr/src/sys/kern/kern_shutdown.c:843 #5 0xffffffff80852617 in trap_fatal (frame=0xfffffe0321555dd0, eva=24) at /usr/src/sys/amd64/amd64/trap.c:943 #6 0xffffffff8085266f in trap_pfault (frame=frame@entry=0xfffffe0321555dd0, usermode=false, signo=<optimized out>, signo@entry=0x0, ucode=<optimized out>, ucode@entry=0x0) at /usr/src/sys/amd64/amd64/trap.c:760 #7 0xffffffff80851ce3 in trap (frame=0xfffffe0321555dd0) at /usr/src/sys/amd64/amd64/trap.c:438 #8 <signal handler called> #9 rack_setup_offset_for_rsm (src_rsm=0xfffff814ec3da230, rsm=0xfffff81f552bebd0) at /usr/src/sys/modules/tcp/rack/../../../netinet/tcp_stacks/rack.c:6024 #10 rack_clone_rsm (rack=<optimized out>, nrsm=0xfffff81f552bebd0, rsm=0xfffff814ec3da230, start=3444253360) at /usr/src/sys/modules/tcp/rack/../../../netinet/tcp_stacks/rack.c:6076 #11 rack_update_entry (tp=tp@entry=0xfffffe07d12dc870, rack=0xfffffe07c8e3cd00, rsm=0xfffff814ec3da230, ts=34848115395, lenp=lenp@entry=0xfffffe0321555f14, add_flag=<optimized out>) at /usr/src/sys/modules/tcp/rack/../../../netinet/tcp_stacks/rack.c:7169 #12 0xffffffff80fb183c in rack_log_output (tp=tp@entry=0xfffffe07d12dc870, to=<optimized out>, len=len@entry=253, seq_out=3444253107, th_flags=<optimized out>, th_flags@entry=16 '\020', err=err@entry=0, cts=34848115395, hintrsm=0x0, add_flag=16384, s_mb=0xfffff80df0cd4800, s_moff=1) at /usr/src/sys/modules/tcp/rack/../../../netinet/tcp_stacks/rack.c:7384 #13 0xffffffff80fa9a33 in rack_fast_rsm_output (tp=<optimized out>, rack=<optimized out>, rsm=<optimized out>, ts_val=<optimized out>, cts=488377027, ms_cts=34848115, tv=0xfffffe0321556018, len=<optimized out>) at /usr/src/sys/modules/tcp/rack/../../../netinet/tcp_stacks/rack.c:15404 #14 rack_output (tp=<optimized out>) at /usr/src/sys/modules/tcp/rack/../../../netinet/tcp_stacks/rack.c:16417 #15 0xffffffff80718835 in tcp_hptsi (hpts=0xfffff8184d9f3700) at /usr/src/sys/netinet/tcp_hpts.c:1613 #16 tcp_hpts_thread (ctx=0xfffff8184d9f3700) at /usr/src/sys/netinet/tcp_hpts.c:1832 #17 0xffffffff8056cfed in intr_event_execute_handlers (p=<optimized out>, ie=0xfffff8184d9d0c00) at /usr/src/sys/kern/kern_intr.c:1168 #18 ithread_execute_handlers (p=<optimized out>, ie=0xfffff8184d9d0c00) at /usr/src/sys/kern/kern_intr.c:1181 #19 ithread_loop (arg=arg@entry=0xfffff8184d9e3640) at /usr/src/sys/kern/kern_intr.c:1269 #20 0xffffffff80569ebd in fork_exit ( callout=0xffffffff8056cda0 <ithread_loop>, arg=0xfffff8184d9e3640, frame=0xfffffe0321556480) at /usr/src/sys/kern/kern_fork.c:1083 #21 <signal handler called> (kgdb) There is also VIMAGE enabled.
I think review D31212 will fix the first issue you reported. At least it explains it and resolves it in my testing when using a kernel with VIMAGE enabled. Would be great if you could test it and report.
(In reply to iron.udjin from comment #14) I think this is a different issue. Do you have a way to reproduce this?
(In reply to Michael Tuexen from comment #16) No. I found crashdump which was happen a month ago. The server was automatically restarted after panic. I'm even didn't know about it.
(In reply to iron.udjin from comment #17) I think a month ago, we had older sources. I would suggest to update and see if if the problem still exists.
A commit in branch main references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=a730d82378d3cdf5356775ec0c23ad2ca40c5edb commit a730d82378d3cdf5356775ec0c23ad2ca40c5edb Author: Michael Tuexen <tuexen@FreeBSD.org> AuthorDate: 2021-07-19 22:29:18 +0000 Commit: Michael Tuexen <tuexen@FreeBSD.org> CommitDate: 2021-07-19 22:29:18 +0000 tcp: fix RACK and BBR when using VIMAGE enabled kernel Fix a bug in VNET handling, which occurs when using specific NICs. PR: 257195 Reviewed by: rrs MFC after: 3 days Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D31212 sys/netinet/tcp_stacks/rack_bbr_common.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-)
(In reply to commit-hook from comment #19) Just tested your patch. The server doesn't panic. All seems good.
^Triage: Assign to committer resolving and mark (un)affected branches.
A commit in branch stable/13 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=9b1219b24a5adaca44833287ac2727e3523e3b62 commit 9b1219b24a5adaca44833287ac2727e3523e3b62 Author: Michael Tuexen <tuexen@FreeBSD.org> AuthorDate: 2021-07-19 22:29:18 +0000 Commit: Michael Tuexen <tuexen@FreeBSD.org> CommitDate: 2021-07-22 09:13:31 +0000 tcp: fix RACK and BBR when using VIMAGE enabled kernel Fix a bug in VNET handling, which occurs when using specific NICs. PR: 257195 Reviewed by: rrs Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D31212 (cherry picked from commit a730d82378d3cdf5356775ec0c23ad2ca40c5edb) sys/netinet/tcp_stacks/rack_bbr_common.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-)