Bug 252958 - [tcp] Kernel panic in tcp_prr_partialack()
Summary: [tcp] Kernel panic in tcp_prr_partialack()
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 13.0-STABLE
Hardware: amd64 Any
: --- Affects Only Me
Assignee: Richard Scheffenegger
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-01-24 02:57 UTC by Martin Matuska
Modified: 2021-02-01 19:15 UTC (History)
5 users (show)

See Also:
tuexen: mfc-stable13+


Attachments
Backtrace (2.07 KB, text/plain)
2021-01-24 02:57 UTC, Martin Matuska
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Martin Matuska freebsd_committer 2021-01-24 02:57:44 UTC
Created attachment 221859 [details]
Backtrace

I am encountering a kernel panic with kernel:
FreeBSD 13.0-ALPHA2 stable/13-c256207-gd1c39af0ec3

It does not happen with kernel before:
bc7ee8e5bc555c246bad8bbb9cdf964fa0a08f41
Comment 1 Martin Matuska freebsd_committer 2021-01-24 08:41:18 UTC
Used sysctls:

kern.ipc.maxsockbuf=33554432
net.inet.tcp.sendbuf_inc=32768
net.inet.tcp.cc.algorithm="htcp"
net.inet.tcp.cc.htcp.rtt_scaling=0
net.inet.tcp.cc.htcp.adaptive_backoff=1

# Required for proper PF operation
kern.timecounter.hardware="HPET"

# Kernel TLS
kern.ipc.mb_use_ext_pgs=1
kern.ipc.tls.ifnet.permitted=1
kern.ipc.tls.enable=1
Comment 2 Michael Tuexen freebsd_committer 2021-01-24 17:58:14 UTC
Assigning this to Richard, since he authored the patch. Leaving the bug for him.

Initial question: do you have a way to reproduce this? Are there steps to follow to recreate this locally?
Comment 3 Martin Matuska freebsd_committer 2021-01-24 21:08:14 UTC
This is somewhat difficult, we are not able to reproduce it artificially on our test system, we need "live" traffic on our 100Gbit link, but very little streaming traffic from various systems (Linux, Windows, Android, ...) is enough to trigger it. The panic happens somewhere between 3-30 minutes after the system with the "faulty" kernel gets online with only few hundred megabit to single-digit gigabit traffic.

What is a pity that we have not set up a large-enough swap partition for a kernel dump so I just have the backtrace from a screenshot of a debug-enabled kernel. If there is anything else how I can help, may we could redesign the boot disk to have a larger partition to be able to store a kernel dump.
Comment 4 Richard Scheffenegger freebsd_committer 2021-01-24 21:32:02 UTC
I'm not sure how to end up with a PRR_partialack, without recover_fs being initialized.

Potentially with ACK reordering (unlikely), or spurious RTO rollback (where TF_FASTRECOVERY may be set, but recover_fs was cleared already.

Do you observe non-zero "data packets unnecessarily retransmitted" in 
the output of netstat -snp tcp?

https://reviews.freebsd.org/D28326 has a patch to fix the div/0 in that section of code, although knowing explicitly the system ends up going there would
be interesting.

If the frequence of the above counter incrementing (and no more panics with that patch) matches the typical runtime when it happend, it's like to have to do with RTO rollbacks.
Comment 5 Michael Tuexen freebsd_committer 2021-01-24 21:36:57 UTC
Martin, are you able to test whether D28326 fixes the issue? If you can't, that is fine, but if you can it would be great to know if this fixes the issue.
Comment 6 Michael Tuexen freebsd_committer 2021-01-26 16:11:09 UTC
https://cgit.FreeBSD.org/src/commit/?id=6a376af0cd212be4e16d013d35a0e2eec1dbb8ae should fix this issue.
Comment 7 Michael Tuexen freebsd_committer 2021-02-01 18:45:56 UTC
https://cgit.FreeBSD.org/src/commit/?id=76dd854f47f4aea703093647a158f280d383ea6d fixes it in stable/13. Therefore closing the issue.