Bug 254244 - panics after upgrade to stable/13-n244861-b9773574371
Summary: panics after upgrade to stable/13-n244861-b9773574371
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 13.0-STABLE
Hardware: amd64 Any
: --- Affects Some People
Assignee: Richard Scheffenegger
URL:
Keywords: crash, regression
Depends on:
Blocks:
 
Reported: 2021-03-12 20:03 UTC by Marek Zarychta
Modified: 2022-10-12 00:48 UTC (History)
8 users (show)

See Also:
koobs: mfc-stable13+
koobs: mfc-stable12-
koobs: mfc-stable11-


Attachments
Some messages caught by syslog (11.58 KB, text/plain)
2021-03-12 20:03 UTC, Marek Zarychta
no flags Details
backtrace 1 (6.01 KB, text/plain)
2021-03-15 16:57 UTC, Marek Zarychta
no flags Details
Patch to try (497 bytes, patch)
2021-03-15 18:34 UTC, Hans Petter Selasky
no flags Details | Diff
Patch to try #2 (1.52 KB, patch)
2021-03-15 19:22 UTC, Hans Petter Selasky
no flags Details | Diff
backtrace 2 (5.47 KB, text/plain)
2021-03-16 16:25 UTC, Marek Zarychta
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Marek Zarychta 2021-03-12 20:03:50 UTC
Created attachment 223220 [details]
Some messages caught by syslog

I have upgraded yesterday a bunch of machines to stable/13-n244861-b9773574371 and experience panics on them. There is no possibility to get cores at the moment, I will work on this after the weekend. The Syslog has caught something like a prelude to some of the panics: 

Mar 12 13:33:44 opnr kernel: [6004] Fatal trap 12: page fault while in kernel mode
Mar 12 13:33:44 opnr kernel: [6004] cpuid = 3; apic id = 06
Mar 12 13:33:44 opnr kernel: [6004] fault virtual address       = 0x18
Mar 12 13:33:44 opnr kernel: [6004] fault code          = supervisor read data, page not present
Mar 12 13:33:44 opnr kernel: [6004] instruction pointer = 0x20:0xffffffff80c77c28
Mar 12 13:33:44 opnr kernel: [6004] stack pointer               = 0x28:0xfffffe00c723f340
Mar 12 13:33:44 opnr kernel: [6004] frame pointer               = 0x28:0xfffffe00c723f3b0
Mar 12 13:33:44 opnr kernel: [6004] code segment                = base rx0, limit 0xfffff, type 0x1b
Mar 12 13:33:44 opnr kernel: [6004]                     = DPL 0, pres 1, long 1, def32 0, gran 1
Mar 12 13:33:44 opnr kernel: [6004] processor eflags    = interrupt enabled, resume, IOPL = 0
Mar 12 13:33:44 opnr kernel: [6004] current process             = 0 (if_io_tqg_3)

More traces in the file attached.

Both affected machines have Atom C2758 CPU. I have been testing 13.0-STABLE on them for a month with no issues so far. The last working build was stable/13-n244798-05083436a6e from Sun Mar  7.

Access to the machines is limited, swap is configured on the ZFS and netdumps not possible due to network configuration (vlan(4)s, over LACP lagg(4)).

I will try to configure the USB drive as a device to save crash dumps if such a scenario will be possible.
Comment 1 Hans Petter Selasky freebsd_committer freebsd_triage 2021-03-13 11:54:15 UTC
Looks like a classic NULL pointer. Do you have the backtrace of the panic?

--HPS
Comment 2 Marek Zarychta 2021-03-13 15:27:04 UTC
(In reply to Hans Petter Selasky from comment #1)
Unfortunately, there is no more I can provide. I will test the setup and eventually enable the USB dump device to catch when it happens again.
Comment 4 Marek Zarychta 2021-03-15 16:57:43 UTC
Created attachment 223294 [details]
backtrace 1

Here's fresh backtrace
Comment 5 Hans Petter Selasky freebsd_committer freebsd_triage 2021-03-15 18:33:52 UTC
Hi,

I looks to me like iflib's rxeof has produced an invalid chain of mbufs.

I only see one bug in there, and that is, if iflib_fixup_rx() gets a new mbuf header, the m_nextpkt field doesn't get zeroed.

If this is easy to reproduce, can you try the attached patch, meanwhile?

--HPS
Comment 6 Hans Petter Selasky freebsd_committer freebsd_triage 2021-03-15 18:34:19 UTC
Created attachment 223301 [details]
Patch to try
Comment 7 Hans Petter Selasky freebsd_committer freebsd_triage 2021-03-15 18:46:12 UTC
Looks like I'm wrong. m_init() clears m_nextpkt, must be something else.
Comment 8 Mark Johnston freebsd_committer freebsd_triage 2021-03-15 19:02:56 UTC
There's a number of similar-looking reports on stable/13.  I wonder if some MFC was missed.
Comment 9 Marek Zarychta 2021-03-15 19:04:42 UTC
(In reply to Hans Petter Selasky from comment #7)

Too late, the patch was applied and the new kernel is installed. I will revert it then.

Could it be LRO issue? I have here LRO enabled on two igb(4)s aggregated with LACP lagg(4) and this lagg(4) is a parent for some vlan(4) interfaces.
Comment 10 Mark Johnston freebsd_committer freebsd_triage 2021-03-15 19:11:52 UTC
(In reply to Marek Zarychta from comment #9)
Yes, disabling LRO would be a reasonable first step.  I looked at the commits between the problematic revision and the known-good revision and nothing really stands out.
Comment 11 Marek Zarychta 2021-03-15 19:20:26 UTC
I have been running 13.0-STABLE on this machine since 8 Feb 2021 and upgrading once or twice a week without issues. The machine was running production code: firewall (PF + IPFW with dummynet), HTTP proxy, some VPN solutions etc. and never experienced panics there. Like I have mentioned above, the last non-panicking build was stable/13-n244798-05083436a6e from Sun Mar 7.
Comment 12 Hans Petter Selasky freebsd_committer freebsd_triage 2021-03-15 19:22:38 UTC
Created attachment 223304 [details]
Patch to try #2

Try this patch instead.

@markj: There are some things I don't understand. Why is tcp_lro_queue_mbuf() not used when tcp_lro_flush_all() is? Looks buggy to me.

Also I see that the lro_possible flag may be stale, I.E. the value from previous mbuf is used for new mbuf.

--HPS
Comment 13 Hans Petter Selasky freebsd_committer freebsd_triage 2021-03-15 19:29:16 UTC
The last commit that touched this code path was:

commit 35e4e998d8187c1d4d413bdc13a79a6415a30a18
Author: Stephen Hurd <shurd@FreeBSD.org>
Date:   Mon Nov 6 16:23:21 2017 +0000

    Only chain non-LRO mbufs when LRO is not possible
    
    Preserve packet order between tcp_lro_rx() and if_input() to avoid
    creating extra corner cases. If no packets can be LROed, combine them
    into one chain for submission via if_input(). If any packet can
    potentially be LROed however, retain old behaviour and call if_input()
    for each packet.
    
    This should keep the 12% improvement for small packet forwarding intact,
    but mostly avoids impacting the LRO case.
    
    Reviewed by:    cem, sbruno
    Approved by:    sbruno (mentor)
    Sponsored by:   Limelight Networks
    Differential Revision:  https://reviews.freebsd.org/D12876

Notes:
    svn path=/head/; revision=325487
Comment 14 Hans Petter Selasky freebsd_committer freebsd_triage 2021-03-15 19:38:32 UTC
Comment on attachment 223294 [details]
backtrace 1

In the debug (kgdb prompt) you can try to dump this mbuf:

print ((struct mbuf *)0xfffff802b067d590)[0]

Then try to follow m_nextpkt:

print ((struct mbuf *)0xfffff802b067d590)[0].m_nextpkt[0]

And see where you end up.

--HPS
Comment 15 Marek Zarychta 2021-03-15 19:49:08 UTC
(In reply to Hans Petter Selasky from comment #12)
I am sorry, I can't test it. The patch doesn't apply on last stable/13 sources:
Rejected hunk #2.
Comment 16 Hans Petter Selasky freebsd_committer freebsd_triage 2021-03-15 19:56:14 UTC
Did you revert the previous patch?

Works fine here:

cat iflib.diff | patch  -p1
Hmm...  Looks like a unified diff to me...
The text leading up to this was:
--------------------------
|diff --git a/sys/net/iflib.c b/sys/net/iflib.c
|index 05e99ba318d..c6a8ec9e25e 100644
|--- a/sys/net/iflib.c
|+++ b/sys/net/iflib.c
--------------------------
Patching file sys/net/iflib.c using Plan A...
Hunk #1 succeeded at 2888 (offset -13 lines).
Hunk #2 succeeded at 2975 (offset -13 lines).
done

commit 763fb2fda0144e3630de74b918d06a96b7968ee2 (HEAD -> stable/13, freebsd/stable/13)
Author: Mark Johnston <markj@FreeBSD.org>
Date:   Mon Mar 8 12:39:05 2021 -0500

    dumpon.8: Ask DDB to call doadump() rather than calling it directly
    
    Sponsored by:   The FreeBSD Foundation
    
    (cherry picked from commit af06ff55535d9b2de253103e974558104e0a3d97)
Comment 17 Marek Zarychta 2021-03-15 20:55:15 UTC
(In reply to Hans Petter Selasky from comment #14)
I am sorry for borking this patch, downloaded the wrong patch (diff to the previous diff).
The patch was applied and the system rebooted. To meet initial conditions LRO is still enabled. 

I am getting this:

(kgdb) print ((struct mbuf *)0xfffff802b067d590)[0]
$1 = {{m_next = 0xd266469b94fc022a, m_slist = {sle_next = 0xd266469b94fc022a}, m_stailq = {stqe_next = 0xd266469b94fc022a}}, {m_nextpkt = 0xd9d8353f106f5dcf, m_slistpkt = {
      sle_next = 0xd9d8353f106f5dcf}, m_stailqpkt = {stqe_next = 0xd9d8353f106f5dcf}}, m_data = 0x4bf25f3f2b29ab25 <error: Cannot access memory at address 0x4bf25f3f2b29ab25>, 
  m_len = -75255363, m_type = 63, m_flags = 7210736, {{{m_pkthdr = {{snd_tag = 0xf2fa50cacc9239d9, rcvif = 0xf2fa50cacc9239d9}, tags = {slh_first = 0xcce1f9cb1f09b95b}, 
          len = -750803441, flowid = 2961906515, csum_flags = 2906213133, fibnum = 45135, numa_domain = 18 '\022', rsstype = 165 '\245', {rcv_tstmp = 12294156060680234325, {
              l2hlen = 85 'U', l3hlen = 165 '\245', l4hlen = 15 '\017', l5hlen = 202 '\312', inner_l2hlen = 204 '\314', inner_l3hlen = 157 '\235', inner_l4hlen = 157 '\235', 
              inner_l5hlen = 170 '\252'}}, PH_per = {eight = "\365\333\365ɇ\231\337*", sixteen = {56309, 51701, 39303, 10975}, thirtytwo = {3388333045, 719296903}, sixtyfour = {
              3089356677887417333}, unintptr = {3089356677887417333}, ptr = 0x2adf9987c9f5dbf5}, PH_loc = {eight = "a\375\325)\366{\277\206", sixteen = {64865, 10709, 31734, 34495}, 
            thirtytwo = {701889889, 2260696054}, sixtyfour = {9709615618828139873}, unintptr = {9709615618828139873}, ptr = 0x86bf7bf629d5fd61}}, {m_epg_npgs = 217 '\331', 
          m_epg_nrdy = 57 '9', m_epg_hdrlen = 146 '\222', m_epg_trllen = 204 '\314', m_epg_1st_off = 20682, m_epg_last_len = 62202, m_epg_flags = 91 '[', 
          m_epg_record_type = 185 '\271', __spare = "\t\037", m_epg_enc_cnt = -857605685, m_epg_tls = 0xb08b1b53d33fa60f, m_epg_so = 0xa512b04fad394b0d, 
          m_epg_seqno = 12294156060680234325, m_epg_stailq = {stqe_next = 0x2adf9987c9f5dbf5}}}, {m_ext = {{ext_count = 0, ext_cnt = 0x0}, ext_size = 0, ext_type = 0, ext_flags = 0, {
            {ext_buf = 0x7ea3057a88b7afb5 <error: Cannot access memory at address 0x7ea3057a88b7afb5>, ext_arg2 = 0x0}, {extpg_pa = {9125143293820645301, 0, 0, 18446735289166059064, 
                60129542144}, 
              extpg_trail = "\024\000\000\000\000\000\000\000\a\000\000\000.\304\033;>]0\212\000\000\000\000\000\000\000\017\000\000\377\202\000\000\000\000\000\000\000\000\a@\000\000\377\377\000\000\000\000\000\000\000\000\000\000\001\000\000\000\000\000@", extpg_hdr = "\000\020\000\000\003\001\000\000\000 9\233\002\370\377\377\000\000\000\000\000\000"}}, 
          ext_free = 0xa92482001880, ext_arg1 = 0x4214bea60a080101}, m_pktdat = 0xfffff802b067d5e8 ""}}, 
    m_dat = 0xfffff802b067d5b0 "\331\071\222\314\312P\372\362[\271\t\037\313\371\341\314\017\246?\323S\033\213\260\rK9\255O\260\022\245U\245\017\312̝\235\252\365\333\365ɇ\231\337*a\375\325)\366{\277\206"}}
(kgdb) print ((struct mbuf *)0xfffff802b067d590)[0].m_nextpkt[0]
Cannot access memory at address 0xd9d8353f106f5dcf
Comment 18 Sergey V. Dyatko 2021-03-16 07:37:06 UTC
(In reply to Mark Johnston from comment #10)
i've already tried disabing LRO on r368448 - without luck, + I can't get dump on host with netdumpd running.
Last Friday I've tried to use net/intel-ix-kmod and got a panic again, and again without dump :(
Unfortunately, since December 2020, I still haven't learned how to reproduce this problem.
next step I plan to test - add 'nodevice iflib' and re-install kernel

BTW last known (by me) revision w/o this panic is head@r360004
Hope this helps
Comment 19 Marek Zarychta 2021-03-16 16:25:13 UTC
Created attachment 223325 [details]
backtrace 2

The kernel with patch 2 applied still finds a way to panic. Please compare the new backtrace.
Comment 20 Hans Petter Selasky freebsd_committer freebsd_triage 2021-03-16 17:04:13 UTC
Comment on attachment 223325 [details]
backtrace 2

Can you type in kgdb101 (please install using GDB from ports or pkg):

frame 27
print *m

--HPS
Comment 21 Marek Zarychta 2021-03-16 17:31:27 UTC
(In reply to Hans Petter Selasky from comment #20)
I am happy to serve with this dump.

(kgdb) frame 27
#27 0xffffffff80cfbac9 in ether_input (ifp=<optimized out>, m=0xfffff8025a85aa00) at /usr/src/sys/net/if_ethersubr.c:827
827			netisr_dispatch(NETISR_ETHER, m);
(kgdb) print *m
$1 = {{m_next = 0x0, m_slist = {sle_next = 0x0}, m_stailq = {stqe_next = 0x0}}, {m_nextpkt = 0x0, m_slistpkt = {
      sle_next = 0x0}, m_stailqpkt = {stqe_next = 0x0}}, m_data = 0xfffff8025a85aa66 "E", m_len = 40, m_type = 1, 
  m_flags = 2, {{{m_pkthdr = {{snd_tag = 0xfffff80045091000, rcvif = 0xfffff80045091000}, tags = {slh_first = 0x0}, 
          len = 40, flowid = 3264704486, csum_flags = 251658240, fibnum = 0, numa_domain = 255 '\377', rsstype = 130 '\202', 
          {rcv_tstmp = 0, {l2hlen = 0 '\000', l3hlen = 0 '\000', l4hlen = 0 '\000', l5hlen = 0 '\000', 
              inner_l2hlen = 0 '\000', inner_l3hlen = 0 '\000', inner_l4hlen = 0 '\000', inner_l5hlen = 0 '\000'}}, 
          PH_per = {eight = "\002\000\000\000\377\377\000", sixteen = {2, 0, 65535, 0}, thirtytwo = {2, 65535}, sixtyfour = {
              281470681743362}, unintptr = {281470681743362}, ptr = 0xffff00000002}, PH_loc = {
            eight = "\000\000\000\000\000\000\000", sixteen = {0, 0, 0, 0}, thirtytwo = {0, 0}, sixtyfour = {0}, unintptr = {
              0}, ptr = 0x0}}, {m_epg_npgs = 0 '\000', m_epg_nrdy = 16 '\020', m_epg_hdrlen = 9 '\t', m_epg_trllen = 69 'E', 
          m_epg_1st_off = 63488, m_epg_last_len = 65535, m_epg_flags = 0 '\000', m_epg_record_type = 0 '\000', 
          __spare = "\000", m_epg_enc_cnt = 0, m_epg_tls = 0xc2976fe600000028, m_epg_so = 0x82ff00000f000000, 
          m_epg_seqno = 0, m_epg_stailq = {stqe_next = 0xffff00000002}}}, {m_ext = {{ext_count = 40574892, 
            ext_cnt = 0x1fac88a6026b1fac}, ext_size = 3634365035, ext_type = 8, ext_flags = 17664, {{
              ext_buf = 0x67a0040c27a2800 <error: Cannot access memory at address 0x67a0040c27a2800>, 
              ext_arg2 = 0xbc592715801f3f1a}, {extpg_pa = {466685789526960128, 13571921925355028250, 6125337632998583261, 
                1175541087717038345, 258}, 
              extpg_trail = "\232\265\000\000\001\001\b\n\220\324a\240\322\343z\325\000\000\367}\317\351\347\361\026^\271\217\270\000ӉX[G\002t\225pT\032\003g\201\327y%\200\240\022/u\363D|:\220\211\257\325\023,>\265", 
              extpg_hdr = "\275gp\205#\027\003\363Ō\216\212\v\261\000\000\273\233\272t߾\236"}}, 
          ext_free = 0xce60062281102bfe, ext_arg1 = 0x1facc45091565000}, 
        m_pktdat = 0xfffff8025a85aa58 "\254\037k\002\246\210\254\037k\002\240\330\b"}}, m_dat = 0xfffff8025a85aa20 ""}}
Comment 22 Hans Petter Selasky freebsd_committer freebsd_triage 2021-03-16 18:02:10 UTC
Further try in frame 27:

print /x *(uint8_t [128] *)((struct mbuf *)m)->m_data

That will hopefully dump the packet.

--HPS
Comment 23 Marek Zarychta 2021-03-17 06:35:27 UTC
(In reply to Hans Petter Selasky from comment #22)
I am not willing to disclose it here, will send in the message.
Comment 24 Richard Scheffenegger freebsd_committer freebsd_triage 2021-03-17 15:41:35 UTC
The tp->t_flags in the BT mentioned earlier here decodes to
6130 03F4
        TF_NODELAY
       TF_SENTFIN
       TF_REQSCALE
       TF_RCVDSCALE
       TF_REQ_TSTMP
      TF_RCVD_TSTMP
      TF_SACK_PERMIT
  TF_FASTRECOVERY
  TF_WASFRECOVERY
 TF_TSO
TF_CONGRECOVERY
TF_WASCRECOVERY

Is net.inet.tcp.rfc6675_pipe enabled? If yes, see https://reviews.freebsd.org/D29315
Comment 25 Marek Zarychta 2021-03-17 15:58:34 UTC
(In reply to Richard Scheffenegger from comment #24)

Hans, Richard, thank you for the deep insight into this.

Yes, net.inet.tcp.rfc6675_pipe=1 is set on this machine probably since early FreeBSD 11. The patch from D29315 was applied and the system rebooted with the original sysctl settings (net.inet.tcp.sack.enable=1,net.inet.tcp.rfc6675_pipe1=1 ).
Comment 26 Richard Scheffenegger freebsd_committer freebsd_triage 2021-03-17 16:05:40 UTC
Not so sure if this really is due to Rescue Retransmissions; 

The linked dump ( http://freebsd.1045724.x6.nabble.com/13-0-CURRENT-r368448-panic-td6443146.html ) is from Jan 20, but SACK rescue retransmission was only checked in to stable/13 on March 2 with 
https://reviews.freebsd.org/R10:15a7c88058d419e3347673ab891ae77ba28ae1bd.

So there might be more to this here?
Comment 27 commit-hook freebsd_committer freebsd_triage 2021-03-17 16:44:15 UTC
A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=e9f029831fa5747ae1b405f5716c52cb4ebf1e04

commit e9f029831fa5747ae1b405f5716c52cb4ebf1e04
Author:     Richard Scheffenegger <rscheff@FreeBSD.org>
AuthorDate: 2021-03-17 15:44:29 +0000
Commit:     Richard Scheffenegger <rscheff@FreeBSD.org>
CommitDate: 2021-03-17 16:12:04 +0000

    fix panic when rescue retransmission and FIN overlap

    PR:           254244
    PR:           254309
    Reviewed By:  #transport, hselasky, tuexen
    MFC after:    3 days
    Sponsored By: NetApp, Inc.
    Differential Revision: https://reviews.freebsd.org/D29315

 sys/netinet/tcp_sack.c | 14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)
Comment 28 commit-hook freebsd_committer freebsd_triage 2021-03-17 19:33:57 UTC
A commit in branch stable/13 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=703419774f86525a2441d615733993a6fddcd047

commit 703419774f86525a2441d615733993a6fddcd047
Author:     Richard Scheffenegger <rscheff@FreeBSD.org>
AuthorDate: 2021-03-17 15:44:29 +0000
Commit:     Richard Scheffenegger <rscheff@FreeBSD.org>
CommitDate: 2021-03-17 19:05:33 +0000

    fix panic when rescue retransmission and FIN overlap

    PR:           254244
    PR:           254309
    Reviewed By:  #transport, hselasky, tuexen
    Approved by:  re (cperciva)
    MFC after:    immediately
    Sponsored By: NetApp, Inc.
    Differential Revision: https://reviews.freebsd.org/D29315

    (cherry picked from commit e9f029831fa5747ae1b405f5716c52cb4ebf1e04)

 sys/netinet/tcp_sack.c | 14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)
Comment 29 Kubilay Kocak freebsd_committer freebsd_triage 2021-03-22 01:52:06 UTC
^Triage:

 * Assign to committer that resolves
 * Track stable/* merge(s)