Bug 257195 - [tcp] Panic when RACK enabled: tcp_hptsi at /usr/src/sys/netinet/tcp_hpts.c:1662
Summary: [tcp] Panic when RACK enabled: tcp_hptsi at /usr/src/sys/netinet/tcp_hpts.c:1662
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 13.0-STABLE
Hardware: amd64 Any
: --- Affects Some People
Assignee: Michael Tuexen
URL:
Keywords: crash
Depends on:
Blocks:
 
Reported: 2021-07-14 23:33 UTC by iron.udjin
Modified: 2021-07-22 09:15 UTC (History)
5 users (show)

See Also:
tuexen: mfc-stable13+
koobs: mfc-stable12-
koobs: mfc-stable11-


Attachments
sysctl.conf (3.53 KB, text/plain)
2021-07-14 23:34 UTC, iron.udjin
no flags Details
loader.conf (1.17 KB, text/plain)
2021-07-14 23:35 UTC, iron.udjin
no flags Details
KERNEL-config (14.75 KB, text/plain)
2021-07-14 23:36 UTC, iron.udjin
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description iron.udjin 2021-07-14 23:33:44 UTC
Hello,

I'm running latest 13-STABLE. I already mentioned panic #256538 but I couldn't reproduce and debug it. I think similar is described in #254735.

So, on one of my servers every time kernel panics right after the server start up.

Here is trace:

Fatal trap 12: page fault while in kernel mode
cpuid = 0; apic id = 00
fault virtual address	= 0xffffffff0000002a
fault code		= supervisor read data, page not present
instruction pointer	= 0x20:0xffffffff81629071
stack pointer	       = 0x28:0xfffffe0202a046a0
frame pointer	       = 0x28:0xfffffe0202a04990
code segment		= base rx0, limit 0xfffff, type 0x1b
			= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags	= interrupt enabled, resume, IOPL = 0
current process		= 11 (swi1: hpts)
trap number		= 12
panic: page fault
cpuid = 0
time = 1626303252
KDB: stack backtrace:
#0 0xffffffff80646505 at kdb_backtrace+0x65
#1 0xffffffff80602661 at vpanic+0x181
#2 0xffffffff806024d3 at panic+0x43
#3 0xffffffff8085a857 at trap_fatal+0x387
#4 0xffffffff8085a8af at trap_pfault+0x4f
#5 0xffffffff80859f63 at trap+0x253
#6 0xffffffff80833d8e at calltrap+0x8
#7 0xffffffff8075fc10 at tcp_hptsi+0x7d0
#8 0xffffffff80760ddc at tcp_hpts_thread+0x11c
#9 0xffffffff805cb221 at ithread_loop+0x191
#10 0xffffffff805c8541 at fork_exit+0x71
#11 0xffffffff80834e1e at fork_trampoline+0xe
Uptime: 27s
Dumping 4308 out of 130940 MB:..1%..11%..21%..31%..41%..51%..61%..71%..81%..91%

__curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55
55		__asm("movq %%gs:%P1,%0" : "=r" (td) : "n" (offsetof(struct pcpu,
(kgdb) bt
#0  __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55
#1  doadump (textdump=<optimized out>) at /usr/src/sys/kern/kern_shutdown.c:399
#2  0xffffffff8060228e in kern_reboot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:486
#3  0xffffffff806026d0 in vpanic (fmt=<optimized out>, ap=<optimized out>) at /usr/src/sys/kern/kern_shutdown.c:919
#4  0xffffffff806024d3 in panic (fmt=<unavailable>) at /usr/src/sys/kern/kern_shutdown.c:843
#5  0xffffffff8085a857 in trap_fatal (frame=0xfffffe0202a045e0, eva=18446744069414584362) at /usr/src/sys/amd64/amd64/trap.c:943
#6  0xffffffff8085a8af in trap_pfault (frame=frame@entry=0xfffffe0202a045e0, usermode=false, signo=<optimized out>, signo@entry=0x0, ucode=<optimized out>, ucode@entry=0x0) at /usr/src/sys/amd64/amd64/trap.c:760
#7  0xffffffff80859f63 in trap (frame=0xfffffe0202a045e0) at /usr/src/sys/amd64/amd64/trap.c:438
#8  <signal handler called>
#9  0xffffffff81629071 in rack_output () from /boot/kernel/tcp_rack.ko
#10 0xfffff805f2218e00 in ?? ()
#11 0x000c000000000000 in ?? ()
#12 0x0000000000000000 in ?? ()
(kgdb)

Let me know if you need any other info.
Comment 1 iron.udjin 2021-07-14 23:34:58 UTC
Created attachment 226472 [details]
sysctl.conf
Comment 2 iron.udjin 2021-07-14 23:35:53 UTC
Created attachment 226473 [details]
loader.conf
Comment 3 iron.udjin 2021-07-14 23:36:47 UTC
Created attachment 226474 [details]
KERNEL-config
Comment 4 iron.udjin 2021-07-15 00:00:46 UTC
One more trace (possibly more informative):


Fatal trap 12: page fault while in kernel mode
cpuid = 4; apic id = 04
fault virtual address	= 0xffffffff0000002a
fault code		= supervisor read data, page not present
instruction pointer	= 0x20:0xffffffff81608071
stack pointer	       = 0x28:0xfffffe0202a186a0
frame pointer	       = 0x28:0xfffffe0202a18990
code segment		= base rx0, limit 0xfffff, type 0x1b
			= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags	= interrupt enabled, resume, IOPL = 0
current process		= 11 (swi1: hpts)
trap number		= 12
panic: page fault
cpuid = 4
time = 1626306588
KDB: stack backtrace:
#0 0xffffffff80646505 at kdb_backtrace+0x65
#1 0xffffffff80602661 at vpanic+0x181
#2 0xffffffff806024d3 at panic+0x43
#3 0xffffffff8085a857 at trap_fatal+0x387
#4 0xffffffff8085a8af at trap_pfault+0x4f
#5 0xffffffff80859f63 at trap+0x253
#6 0xffffffff80833d8e at calltrap+0x8
#7 0xffffffff8075fc10 at tcp_hptsi+0x7d0
#8 0xffffffff80760ddc at tcp_hpts_thread+0x11c
#9 0xffffffff805cb221 at ithread_loop+0x191
#10 0xffffffff805c8541 at fork_exit+0x71
#11 0xffffffff80834e1e at fork_trampoline+0xe
Uptime: 25s
Dumping 4278 out of 130940 MB:..1%..11%..21%..31%..41%..51%..61%..71%..81%..91%

__curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55
55		__asm("movq %%gs:%P1,%0" : "=r" (td) : "n" (offsetof(struct pcpu,
(kgdb) bt
#0  __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55
#1  doadump (textdump=<optimized out>) at /usr/src/sys/kern/kern_shutdown.c:399
#2  0xffffffff8060228e in kern_reboot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:486
#3  0xffffffff806026d0 in vpanic (fmt=<optimized out>, ap=<optimized out>) at /usr/src/sys/kern/kern_shutdown.c:919
#4  0xffffffff806024d3 in panic (fmt=<unavailable>) at /usr/src/sys/kern/kern_shutdown.c:843
#5  0xffffffff8085a857 in trap_fatal (frame=0xfffffe0202a185e0, eva=18446744069414584362) at /usr/src/sys/amd64/amd64/trap.c:943
#6  0xffffffff8085a8af in trap_pfault (frame=frame@entry=0xfffffe0202a185e0, usermode=false, signo=<optimized out>, signo@entry=0x0, ucode=<optimized out>, ucode@entry=0x0) at /usr/src/sys/amd64/amd64/trap.c:760
#7  0xffffffff80859f63 in trap (frame=0xfffffe0202a185e0) at /usr/src/sys/amd64/amd64/trap.c:438
#8  <signal handler called>
#9  0xffffffff81608071 in rack_output (tp=<optimized out>) at /usr/src/sys/modules/tcp/rack/../../../netinet/tcp_stacks/rack.c:16540
#10 0xffffffff8075fc10 in tcp_hptsi (hpts=hpts@entry=0xfffff8010398c780, from_callout=from_callout@entry=1) at /usr/src/sys/netinet/tcp_hpts.c:1662
#11 0xffffffff80760ddc in tcp_hpts_thread (ctx=0xfffff8010398c780) at /usr/src/sys/netinet/tcp_hpts.c:2035
#12 0xffffffff805cb221 in intr_event_execute_handlers (p=<optimized out>, ie=0xfffff8010398d500) at /usr/src/sys/kern/kern_intr.c:1168
#13 ithread_execute_handlers (p=<optimized out>, ie=0xfffff8010398d500) at /usr/src/sys/kern/kern_intr.c:1181
#14 ithread_loop (arg=arg@entry=0xfffff8010397d680) at /usr/src/sys/kern/kern_intr.c:1269
#15 0xffffffff805c8541 in fork_exit (callout=0xffffffff805cb090 <ithread_loop>, arg=0xfffff8010397d680, frame=0xfffffe0202a18c00) at /usr/src/sys/kern/kern_fork.c:1083
#16 <signal handler called>
Comment 5 iron.udjin 2021-07-15 00:02:50 UTC
Previous trace was after I switched CC from HTCP to NEWRENO.
Comment 6 Michael Tuexen freebsd_committer 2021-07-15 07:34:08 UTC
(In reply to iron.udjin from comment #5)

Interesting. I have seen a panic like in comment #4 on one of my arm64 servers running FreeBSD main, but thought that it is related to arm64, since I haven't seen it on amd64. So your report shows that it is platform independent.
Since we keep FreeBSD main and stable/13 in sync as much as possible, it is not unexpected that you see the problem on stable/13 and I saw it on main.

Do you have steps that trigger the panic deterministically after the system has come up? It would be helpful for me to be able to trigger the problem also on an amd64 system.

Can you also provide the output of ifconfig? I'm wondering if LRO or TSO is involved...
Comment 7 iron.udjin 2021-07-15 09:46:57 UTC
(In reply to Michael Tuexen from comment #6)

>Do you have steps that trigger the panic deterministically after the system has come up?

Unfortunatelly no. The sytem panics a few seconds after server start up.

>Can you also provide the output of ifconfig?

igb0: flags=8863<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 9000
	options=4e527bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,WOL_MAGIC,VLAN_HWFILTER,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6,NOMAP>
	ether 9c:5c:8e:4f:6a:7d
	media: Ethernet autoselect (1000baseT <full-duplex>)
	status: active
lo0: flags=8049<UP,LOOPBACK,RUNNING,MULTICAST> metric 0 mtu 16384
	options=680003<RXCSUM,TXCSUM,LINKSTATE,RXCSUM_IPV6,TXCSUM_IPV6>
	inet 127.0.0.1 netmask 0xffffffff
	groups: lo
lo1: flags=8049<UP,LOOPBACK,RUNNING,MULTICAST> metric 0 mtu 16384
	options=680003<RXCSUM,TXCSUM,LINKSTATE,RXCSUM_IPV6,TXCSUM_IPV6>
	inet 192.168.0.1 netmask 0xffffffff
	groups: lo

Should I try to change back MTU to 1500 or disable TSO/LRO?
Comment 8 Michael Tuexen freebsd_committer 2021-07-15 17:46:36 UTC
(In reply to iron.udjin from comment #7)
I do see the panic also on an igb interface, but I'm using an MTU of 1500 byte. Let me try to experiment...
Comment 9 Michael Tuexen freebsd_committer 2021-07-16 20:49:19 UTC
Are you using vnets? If not, you can comment out the

options VIMAGE

line and rebuild the kernel. Testing on my server indicates, that the problem only shows up with VIMAGE kernels.
Comment 10 iron.udjin 2021-07-16 21:15:26 UTC
(In reply to Michael Tuexen from comment #9)

I've rebuilded kernel without VIMAGE. No panic after restart yet. 

P.S: there is still problem with SSH (as I already described in #256538). I'll debug this issue and create a new bug report when I'll have time for it.
Quick workaround:
# sysctl net.inet.tcp.functions_default=freebsd
# service sshd restart
# sysctl net.inet.tcp.functions_default=rack
Comment 11 Michael Tuexen freebsd_committer 2021-07-16 21:28:16 UTC
(In reply to iron.udjin from comment #10)
Regarding yo ssh problem. What is the output of
sysctl net.inet.tcp.tolerate_missing_ts

What OS is the system running which is ssh-ing into the box?
Comment 12 iron.udjin 2021-07-16 21:33:49 UTC
(In reply to Michael Tuexen from comment #11)

# sysctl net.inet.tcp.tolerate_missing_ts
net.inet.tcp.tolerate_missing_ts: 1

OS is the same version as host has. If case of connection to the server from Windows - the problem is not happen.
Comment 13 Michael Tuexen freebsd_committer 2021-07-16 21:52:19 UTC
(In reply to iron.udjin from comment #12)
Assuming that on the peer we also have
net.inet.tcp.tolerate_missing_ts: 1
then we need to look into this. Please open a separate issue for that (when time permits).
Comment 14 iron.udjin 2021-07-18 04:07:43 UTC
(In reply to iron.udjin from comment #12)

OS: stable/13-n246050-07ef7a034965

On another server I catched one more panic. But it has a little bit different trace:

Fatal trap 12: page fault while in kernel mode
cpuid = 39; apic id = 33
fault virtual address   = 0x18
fault code              = supervisor read data, page not present
instruction pointer     = 0x20:0xffffffff80fc1c20
stack pointer           = 0x0:0xfffffe0321555e90
frame pointer           = 0x0:0xfffffe0321555ed0
code segment            = base rx0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 11 (swi1: hpts)
trap number             = 12
panic: page fault
cpuid = 39
time = 1624174594
KDB: stack backtrace:
#0 0xffffffff805f37a5 at kdb_backtrace+0x65
#1 0xffffffff805a9931 at vpanic+0x181
#2 0xffffffff805a97a3 at panic+0x43
#3 0xffffffff80852617 at trap_fatal+0x387
#4 0xffffffff8085266f at trap_pfault+0x4f
#5 0xffffffff80851ce3 at trap+0x253
#6 0xffffffff8082ac18 at calltrap+0x8
#7 0xffffffff80fb183c at rack_log_output+0xec
#8 0xffffffff80fa9a33 at rack_output+0x6ca3
#9 0xffffffff80718835 at tcp_hpts_thread+0x725
#10 0xffffffff8056cfed at ithread_loop+0x24d
#11 0xffffffff80569ebd at fork_exit+0x7d
#12 0xffffffff8082bc9e at fork_trampoline+0xe
Uptime: 9h40m48s
Dumping 21243 out of 196233 MB:..1%..11%..21%..31%..41%..51%..61%..71%..81%..91%

__curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55
55              __asm("movq %%gs:%P1,%0" : "=r" (td) : "n" (offsetof(struct pcpu,
(kgdb) #0  __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55
#1  doadump (textdump=<optimized out>)
    at /usr/src/sys/kern/kern_shutdown.c:399
#2  0xffffffff805a9525 in kern_reboot (howto=260)
    at /usr/src/sys/kern/kern_shutdown.c:486
#3  0xffffffff805a99a0 in vpanic (fmt=<optimized out>, ap=<optimized out>)
    at /usr/src/sys/kern/kern_shutdown.c:919
#4  0xffffffff805a97a3 in panic (fmt=<unavailable>)
    at /usr/src/sys/kern/kern_shutdown.c:843
#5  0xffffffff80852617 in trap_fatal (frame=0xfffffe0321555dd0, eva=24)
    at /usr/src/sys/amd64/amd64/trap.c:943
#6  0xffffffff8085266f in trap_pfault (frame=frame@entry=0xfffffe0321555dd0, 
    usermode=false, signo=<optimized out>, signo@entry=0x0, 
    ucode=<optimized out>, ucode@entry=0x0)
    at /usr/src/sys/amd64/amd64/trap.c:760
#7  0xffffffff80851ce3 in trap (frame=0xfffffe0321555dd0)
    at /usr/src/sys/amd64/amd64/trap.c:438
#8  <signal handler called>
#9  rack_setup_offset_for_rsm (src_rsm=0xfffff814ec3da230, 
    rsm=0xfffff81f552bebd0)
    at /usr/src/sys/modules/tcp/rack/../../../netinet/tcp_stacks/rack.c:6024
    at /usr/src/sys/kern/kern_shutdown.c:919
#4  0xffffffff805a97a3 in panic (fmt=<unavailable>)
    at /usr/src/sys/kern/kern_shutdown.c:843
#5  0xffffffff80852617 in trap_fatal (frame=0xfffffe0321555dd0, eva=24)
    at /usr/src/sys/amd64/amd64/trap.c:943
#6  0xffffffff8085266f in trap_pfault (frame=frame@entry=0xfffffe0321555dd0, 
    usermode=false, signo=<optimized out>, signo@entry=0x0, 
    ucode=<optimized out>, ucode@entry=0x0)
    at /usr/src/sys/amd64/amd64/trap.c:760
#7  0xffffffff80851ce3 in trap (frame=0xfffffe0321555dd0)
    at /usr/src/sys/amd64/amd64/trap.c:438
#8  <signal handler called>
#9  rack_setup_offset_for_rsm (src_rsm=0xfffff814ec3da230, 
    rsm=0xfffff81f552bebd0)
    at /usr/src/sys/modules/tcp/rack/../../../netinet/tcp_stacks/rack.c:6024
#10 rack_clone_rsm (rack=<optimized out>, nrsm=0xfffff81f552bebd0, 
    rsm=0xfffff814ec3da230, start=3444253360)
    at /usr/src/sys/modules/tcp/rack/../../../netinet/tcp_stacks/rack.c:6076
#11 rack_update_entry (tp=tp@entry=0xfffffe07d12dc870, 
    rack=0xfffffe07c8e3cd00, rsm=0xfffff814ec3da230, ts=34848115395, 
    lenp=lenp@entry=0xfffffe0321555f14, add_flag=<optimized out>)
    at /usr/src/sys/modules/tcp/rack/../../../netinet/tcp_stacks/rack.c:7169
#12 0xffffffff80fb183c in rack_log_output (tp=tp@entry=0xfffffe07d12dc870, 
    to=<optimized out>, len=len@entry=253, seq_out=3444253107, 
    th_flags=<optimized out>, th_flags@entry=16 '\020', err=err@entry=0, 
    cts=34848115395, hintrsm=0x0, add_flag=16384, s_mb=0xfffff80df0cd4800, 
    s_moff=1)
    at /usr/src/sys/modules/tcp/rack/../../../netinet/tcp_stacks/rack.c:7384
#13 0xffffffff80fa9a33 in rack_fast_rsm_output (tp=<optimized out>, 
    rack=<optimized out>, rsm=<optimized out>, ts_val=<optimized out>, 
    cts=488377027, ms_cts=34848115, tv=0xfffffe0321556018, 
    len=<optimized out>)
    at /usr/src/sys/modules/tcp/rack/../../../netinet/tcp_stacks/rack.c:15404
#14 rack_output (tp=<optimized out>)
    at /usr/src/sys/modules/tcp/rack/../../../netinet/tcp_stacks/rack.c:16417
#15 0xffffffff80718835 in tcp_hptsi (hpts=0xfffff8184d9f3700)
    at /usr/src/sys/netinet/tcp_hpts.c:1613
#16 tcp_hpts_thread (ctx=0xfffff8184d9f3700)
    at /usr/src/sys/netinet/tcp_hpts.c:1832
#17 0xffffffff8056cfed in intr_event_execute_handlers (p=<optimized out>, 
    ie=0xfffff8184d9d0c00) at /usr/src/sys/kern/kern_intr.c:1168
#18 ithread_execute_handlers (p=<optimized out>, ie=0xfffff8184d9d0c00)
    at /usr/src/sys/kern/kern_intr.c:1181
#19 ithread_loop (arg=arg@entry=0xfffff8184d9e3640)
    at /usr/src/sys/kern/kern_intr.c:1269
#20 0xffffffff80569ebd in fork_exit (
    callout=0xffffffff8056cda0 <ithread_loop>, arg=0xfffff8184d9e3640, 
    frame=0xfffffe0321556480) at /usr/src/sys/kern/kern_fork.c:1083
#21 <signal handler called>
(kgdb)

There is also VIMAGE enabled.
Comment 15 Michael Tuexen freebsd_committer 2021-07-18 10:06:01 UTC
I think review D31212 will fix the first issue you reported. At least it explains it and resolves it in my testing when using a kernel with VIMAGE enabled.

Would be great if you could test it and report.
Comment 16 Michael Tuexen freebsd_committer 2021-07-18 10:07:45 UTC
(In reply to iron.udjin from comment #14)
I think this is a different issue. Do you have a way to reproduce this?
Comment 17 iron.udjin 2021-07-18 10:13:32 UTC
(In reply to Michael Tuexen from comment #16)

No. I found crashdump which was happen a month ago. The server was automatically restarted after panic. I'm even didn't know about it.
Comment 18 Michael Tuexen freebsd_committer 2021-07-18 10:37:59 UTC
(In reply to iron.udjin from comment #17)
I think a month ago, we had older sources. I would suggest to update and see if if the problem still exists.
Comment 19 commit-hook freebsd_committer 2021-07-19 22:33:41 UTC
A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=a730d82378d3cdf5356775ec0c23ad2ca40c5edb

commit a730d82378d3cdf5356775ec0c23ad2ca40c5edb
Author:     Michael Tuexen <tuexen@FreeBSD.org>
AuthorDate: 2021-07-19 22:29:18 +0000
Commit:     Michael Tuexen <tuexen@FreeBSD.org>
CommitDate: 2021-07-19 22:29:18 +0000

    tcp: fix RACK and BBR when using VIMAGE enabled kernel

    Fix a bug in VNET handling, which occurs when using specific NICs.
    PR:                     257195
    Reviewed by:            rrs
    MFC after:              3 days
    Sponsored by:           Netflix, Inc.
    Differential Revision:  https://reviews.freebsd.org/D31212

 sys/netinet/tcp_stacks/rack_bbr_common.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)
Comment 20 iron.udjin 2021-07-20 01:13:04 UTC
(In reply to commit-hook from comment #19)

Just tested your patch. The server doesn't panic. All seems good.
Comment 21 Kubilay Kocak freebsd_committer freebsd_triage 2021-07-20 01:59:33 UTC
^Triage: Assign to committer resolving and mark (un)affected branches.
Comment 22 commit-hook freebsd_committer 2021-07-22 09:14:42 UTC
A commit in branch stable/13 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=9b1219b24a5adaca44833287ac2727e3523e3b62

commit 9b1219b24a5adaca44833287ac2727e3523e3b62
Author:     Michael Tuexen <tuexen@FreeBSD.org>
AuthorDate: 2021-07-19 22:29:18 +0000
Commit:     Michael Tuexen <tuexen@FreeBSD.org>
CommitDate: 2021-07-22 09:13:31 +0000

    tcp: fix RACK and BBR when using VIMAGE enabled kernel

    Fix a bug in VNET handling, which occurs when using specific NICs.
    PR:                     257195
    Reviewed by:            rrs
    Sponsored by:           Netflix, Inc.
    Differential Revision:  https://reviews.freebsd.org/D31212

    (cherry picked from commit a730d82378d3cdf5356775ec0c23ad2ca40c5edb)

 sys/netinet/tcp_stacks/rack_bbr_common.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)