Bug 254309

Summary:

[tcp] frequent panics in tcp_output

Product:

Base System

Reporter:

Ivan Rozhuk <rozhuk.im>

Component:

kern

Assignee:

Richard Scheffenegger <rscheff>

Status:

Closed FIXED

Severity:

Affects Only Me

CC:

ae, cy, hselasky, rozhuk.im, rscheff, thj, tuexen

Priority:

---

Keywords:

crash

Version:

13.0-STABLE

Hardware:

amd64

OS:

Any

See Also:

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=254244
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=254015
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=263445

Attachments:

Description	Flags
crash log	none
log 2	none

Description Ivan Rozhuk 2021-03-15 15:41:50 UTC

__curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55
55              __asm("movq %%gs:%P1,%0" : "=r" (td) : "n" (offsetof(struct pcpu,
(kgdb) #0  __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55
#1  doadump (textdump=<optimized out>)
    at /usr/src/sys/kern/kern_shutdown.c:399
#2  0xffffffff805ee215 in kern_reboot (howto=260)
    at /usr/src/sys/kern/kern_shutdown.c:486
#3  0xffffffff805ee680 in vpanic (fmt=<optimized out>, ap=<optimized out>)
    at /usr/src/sys/kern/kern_shutdown.c:919
#4  0xffffffff805ee483 in panic (fmt=<unavailable>)
    at /usr/src/sys/kern/kern_shutdown.c:843
#5  0xffffffff808e58f7 in trap_fatal (frame=0xfffffe01140f6540, eva=24)
    at /usr/src/sys/amd64/amd64/trap.c:915
#6  0xffffffff808e594f in trap_pfault (frame=frame@entry=0xfffffe01140f6540,
    usermode=false, signo=<optimized out>, signo@entry=0x0,
    ucode=<optimized out>, ucode@entry=0x0)
    at /usr/src/sys/amd64/amd64/trap.c:732
#7  0xffffffff808e5116 in trap (frame=0xfffffe01140f6540)
    at /usr/src/sys/amd64/amd64/trap.c:398
#8  <signal handler called>
#9  m_copydata (m=0x0, m@entry=0xfffff80576236400, off=0, len=1,
    cp=<optimized out>) at /usr/src/sys/kern/uipc_mbuf.c:656
#10 0xffffffff8076eb5a in tcp_output (tp=0xfffffe0169d87ca8)
    at /usr/src/sys/netinet/tcp_output.c:1068
#11 0xffffffff80765fbb in tcp_do_segment (m=<optimized out>,
    th=<optimized out>, so=0xfffff802a3cb03b0, tp=0xfffffe0169d87ca8,
    drop_hdrlen=52, tlen=<optimized out>, iptos=0 '\000')
    at /usr/src/sys/netinet/tcp_input.c:2817
#12 0xffffffff80763588 in tcp_input (mp=<optimized out>,
    offp=<optimized out>, proto=<optimized out>)
    at /usr/src/sys/netinet/tcp_input.c:1135
#13 0xffffffff80757912 in ip_input (m=0x0)
    at /usr/src/sys/netinet/ip_input.c:833
#14 0xffffffff8072c318 in netisr_process_workstream_proto (
    nwsp=<optimized out>, proto=1) at /usr/src/sys/net/netisr.c:919
#15 swi_net (arg=<optimized out>) at /usr/src/sys/net/netisr.c:966
#16 0xffffffff805bb045 in intr_event_execute_handlers (p=<optimized out>,
    ie=0xfffff800029d6500) at /usr/src/sys/kern/kern_intr.c:1168
#17 ithread_execute_handlers (p=<optimized out>, ie=0xfffff800029d6500)
    at /usr/src/sys/kern/kern_intr.c:1181
#18 ithread_loop (arg=0xfffff80002a05020)
    at /usr/src/sys/kern/kern_intr.c:1269
#19 0xffffffff805b7c77 in fork_exit (
    callout=0xffffffff805bad30 <ithread_loop>, arg=0xfffff80002a05020,
    frame=0xfffffe01140f6c00) at /usr/src/sys/kern/kern_fork.c:1069
#20 <signal handler called>



__curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55
55              __asm("movq %%gs:%P1,%0" : "=r" (td) : "n" (offsetof(struct pcpu,
(kgdb) #0  __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55
#1  doadump (textdump=<optimized out>)
    at /usr/src/sys/kern/kern_shutdown.c:399
#2  0xffffffff805ee6a5 in kern_reboot (howto=260)
    at /usr/src/sys/kern/kern_shutdown.c:486
#3  0xffffffff805eeb10 in vpanic (fmt=<optimized out>, ap=<optimized out>)
    at /usr/src/sys/kern/kern_shutdown.c:919
#4  0xffffffff805ee913 in panic (fmt=<unavailable>)
    at /usr/src/sys/kern/kern_shutdown.c:843
#5  0xffffffff808e5d57 in trap_fatal (frame=0xfffffe0113ba9540, eva=80)
    at /usr/src/sys/amd64/amd64/trap.c:915
#6  0xffffffff808e5daf in trap_pfault (frame=frame@entry=0xfffffe0113ba9540,
    usermode=false, signo=<optimized out>, signo@entry=0x0,
    ucode=<optimized out>, ucode@entry=0x0)
    at /usr/src/sys/amd64/amd64/trap.c:732
#7  0xffffffff808e5576 in trap (frame=0xfffffe0113ba9540)
    at /usr/src/sys/amd64/amd64/trap.c:398
#8  <signal handler called>
#9  0xffffffff80650d6c in turnstile_wait (ts=0xfffff8000229c780,
    owner=<optimized out>, queue=queue@entry=0)
    at /usr/src/sys/kern/subr_turnstile.c:794
#10 0xffffffff805d5d75 in __mtx_lock_sleep (c=0xfffff80004cf5618,
    v=<optimized out>) at /usr/src/sys/kern/kern_mutex.c:664
#11 0xffffffff80771f9a in tcp_hpts_thread (ctx=0xfffff80004cf5600)
    at /usr/src/sys/netinet/tcp_hpts.c:1816
#12 0xffffffff80609166 in softclock_call_cc (c=0xfffff80004cf56c0,
    cc=cc@entry=0xffffffff80c6bd40 <cc_cpu+4800>, direct=direct@entry=1)
    at /usr/src/sys/kern/kern_timeout.c:696
#13 0xffffffff80608f3f in callout_process (now=now@entry=8227146370453)
    at /usr/src/sys/kern/kern_timeout.c:479
#14 0xffffffff80590355 in handleevents (now=8227146370453, fake=fake@entry=0)
    at /usr/src/sys/kern/kern_clocksource.c:213
#15 0xffffffff8059011c in hardclockintr ()
    at /usr/src/sys/kern/kern_clocksource.c:148
#16 0xffffffff808b78a1 in ipi_bitmap_handler (frame=...)
    at /usr/src/sys/x86/x86/mp_x86.c:1318
#17 <signal handler called>
#18 acpi_cpu_c1 () at /usr/src/sys/x86/x86/cpu_machdep.c:211
#19 0xffffffff804180cb in acpi_cpu_idle (sbt=<optimized out>)
    at /usr/src/sys/dev/acpica/acpi_cpu.c:1185
#20 0xffffffff808acde1 in cpu_idle_acpi (sbt=0)
    at /usr/src/sys/x86/x86/cpu_machdep.c:509
#21 0xffffffff808ace97 in cpu_idle (busy=0)
    at /usr/src/sys/x86/x86/cpu_machdep.c:629
#22 0xffffffff8061fcb4 in sched_idletd (dummy=<optimized out>)
    at /usr/src/sys/kern/sched_ule.c:2874
#23 0xffffffff805b7c77 in fork_exit (
    callout=0xffffffff8061f920 <sched_idletd>, arg=0x0,
    frame=0xfffffe0113ba9c00) at /usr/src/sys/kern/kern_fork.c:1069
#24 <signal handler called>

Comment 1 Tom Jones freebsd_committer

2021-03-15 16:19:10 UTC

Can you include more information about your FreeBSD version (git hash), tcp configuration (congestion control, tcp stack) and maybe comment on what your workload is and when the panics are triggered?

Comment 2 Michael Tuexen freebsd_committer

2021-03-15 16:50:35 UTC

It would be great to see the panic message and to have a way to reproduce it. Can you describe how to reproduce the issue? Does the same problem occur when you are using CURRENT?

Comment 3 Ivan Rozhuk 2021-03-15 17:08:02 UTC

Created attachment 223295 [details]
crash log

http://www.netlab.linkpc.net/download/software/os_cfg/FBSD/13/base/usr/src/sys/amd64/conf/
srv+base

http://www.netlab.linkpc.net/download/software/os_cfg/FBSD/13/base/etc/sysctl.conf

Other configs available:
http://www.netlab.linkpc.net/download/software/os_cfg/FBSD/13/
base+srv

This is home NAS+++ server, with web server, samba, rtorrent, etc that connected to inet via IPv4 + IPv6.

This happen few times, once per day.
I can not reproduce this.

FreeBSD 13 amd64, few days old sources build.

Comment 4 Ivan Rozhuk 2021-03-15 17:08:21 UTC

Created attachment 223296 [details]
log 2

Comment 5 Ivan Rozhuk 2021-03-15 17:09:52 UTC

igb0: flags=8863<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 9000
	options=4e527bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,WOL_MAGIC,VLAN_HWFILTER,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6,NOMAP>
	ether ***********
	media: Ethernet autoselect (1000baseT <full-duplex>)
	status: active
	nd6 options=9<PERFORMNUD,IFDISABLED>

Comment 6 Ivan Rozhuk 2021-03-16 11:52:49 UTC

Got first crash (@m_copydata (m=0x0, m@entry=0xfffff80576236400, off=0, len=1, cp=<optimized out>) at /usr/src/sys/kern/uipc_mbuf.c:656) on 2 different installations.

Suspect that it caused by some changes in last 1-2 month or by me: I add 
options		RATELIMIT		#o TX rate limiting support

Comment 7 Ivan Rozhuk 2021-03-16 20:40:12 UTC

#10 0xffffffff8076ec3a in tcp_output (tp=0xfffffe01137c60c0) at /usr/src/sys/netinet/tcp_output.c:1068
1068				m_copydata(mb, moff, len,
(kgdb) info locals
moff = 0
mb = 0xfffff802a20ba800
msb = <optimized out>
opt = "\001\001\b\n\325\035\060\267+\223/\264\001\376\377\377\001M\373\267\000\000\000\000\340`|\023\001\376\377\377\250\005\000\000\000\000\000"
to = {to_flags = 16, to_tsval = 3073383893, to_tsecr = 3023016747, to_sacks = 0xfffffe001eacac00 "\300P\254\036", 
  to_signature = 0x11ea8d0c0 <error: Cannot access memory at address 0x11ea8d0c0>, to_tfo_cookie = 0xfe <error: Cannot access memory at address 0xfe>, 
  to_mss = 8288, to_wscale = 229 '\345', to_nsacks = 197 '\305', to_tfo_len = 0 '\000', to_spare = 3535675904}
hw_tls = false
isipv6 = <optimized out>
ip6 = 0x0
dont_sendalot = 0
wanted_cookie = 0
ip = 0xfffff8020e774668
if_hw_tsomaxsegsize = 0
if_hw_tsomaxsegcount = 0
error = <optimized out>
so = <optimized out>
idle = 0
sendalot = 1
tso = 0
flags = 17
recwin = 2098020
sack_rxmit = 1
p = 0xfffff802a256ef20
off = 32622
mtu = 0
sendwin = <optimized out>
sack_bytes_rxmt = <optimized out>
len = 1
ipoptlen = <optimized out>
optlen = <optimized out>
hdrlen = <optimized out>
curticks = <optimized out>
m = 0xfffff8020eecec00
th = <optimized out>

Comment 8 Ivan Rozhuk 2021-03-17 11:26:45 UTC

I try kernel without:
options		RATELIMIT		#o TX rate limiting support
options		TCP_OFFLOAD		#o TCP offload
options 	TCP_BLACKBOX		#o Enhanced TCP event logging
options 	TCP_HHOOK		#o hhook(9) framework for TCP
options		TCP_RFC7413		#o Server-side implementation of TCP Fast Open (TFO) [RFC7413]
options		TCP_RFC7413_MAX_KEYS=2	#o 
options		TCPHPTS			#o high precision timer system for tcp.

not help.

Comment 9 Hans Petter Selasky freebsd_committer

2021-03-17 12:09:20 UTC

Hi,

Try to set:
sysctl net.inet.tcp.sack.enable=0

For now.

--HPS

Comment 10 Richard Scheffenegger freebsd_committer

2021-03-17 15:05:36 UTC

Better, disable net.inet.tcp.rfc6675_pipe=0 while retaining SACK.

Comment 11 Richard Scheffenegger freebsd_committer

2021-03-17 15:30:16 UTC

See https://reviews.freebsd.org/D29315 

flags = 17 -> 0x11, 0x10 is TF_SENTFIN, TF_ACKNOW

and 6675pipe is enabled, enabling the new rescue-retransmission.

Further, this is stated to be a web server, where it is likely that http/1.0 tcp sessions are closed right after an object was sent, and if the very last segment with the FIN is dropped by the network, the rescue retransmission code tried to include the "data byte" of the FIN (which doesn't exist really, only as the last octet in the sequence space stream).

Comment 12 commit-hook freebsd_committer

2021-03-17 16:44:17 UTC

A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=e9f029831fa5747ae1b405f5716c52cb4ebf1e04

commit e9f029831fa5747ae1b405f5716c52cb4ebf1e04
Author:     Richard Scheffenegger <rscheff@FreeBSD.org>
AuthorDate: 2021-03-17 15:44:29 +0000
Commit:     Richard Scheffenegger <rscheff@FreeBSD.org>
CommitDate: 2021-03-17 16:12:04 +0000

    fix panic when rescue retransmission and FIN overlap

    PR:           254244
    PR:           254309
    Reviewed By:  #transport, hselasky, tuexen
    MFC after:    3 days
    Sponsored By: NetApp, Inc.
    Differential Revision: https://reviews.freebsd.org/D29315

 sys/netinet/tcp_sack.c | 14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

Comment 13 commit-hook freebsd_committer

2021-03-17 19:33:56 UTC

A commit in branch stable/13 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=703419774f86525a2441d615733993a6fddcd047

commit 703419774f86525a2441d615733993a6fddcd047
Author:     Richard Scheffenegger <rscheff@FreeBSD.org>
AuthorDate: 2021-03-17 15:44:29 +0000
Commit:     Richard Scheffenegger <rscheff@FreeBSD.org>
CommitDate: 2021-03-17 19:05:33 +0000

    fix panic when rescue retransmission and FIN overlap

    PR:           254244
    PR:           254309
    Reviewed By:  #transport, hselasky, tuexen
    Approved by:  re (cperciva)
    MFC after:    immediately
    Sponsored By: NetApp, Inc.
    Differential Revision: https://reviews.freebsd.org/D29315

    (cherry picked from commit e9f029831fa5747ae1b405f5716c52cb4ebf1e04)

 sys/netinet/tcp_sack.c | 14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

Comment 14 Ivan Rozhuk 2021-03-19 19:58:48 UTC

2 days uptime without panic, looks like fixed.

Thanks!