254244 – panics after upgrade to stable/13-n244861-b9773574371

Bug 254244 - panics after upgrade to stable/13-n244861-b9773574371

Summary: panics after upgrade to stable/13-n244861-b9773574371

Status:	Closed FIXED

Alias:	None

Product:	Base System
Classification:	Unclassified
Component:	kern (show other bugs)
Version:	13.0-STABLE
Hardware:	amd64 Any

Importance:	--- Affects Some People
Assignee:	Richard Scheffenegger

URL:
Keywords:	crash, regression

Depends on:
Blocks:

Reported:	2021-03-12 20:03 UTC by Marek Zarychta
Modified:	2022-10-12 00:48 UTC (History)
CC List:	8 users (show)

See Also:	254309 254015

Flags:	koobs: mfc-stable13+ koobs: mfc-stable12- koobs: mfc-stable11-

Attachments
Some messages caught by syslog (11.58 KB, text/plain) 2021-03-12 20:03 UTC, Marek Zarychta	no flags	Details
backtrace 1 (6.01 KB, text/plain) 2021-03-15 16:57 UTC, Marek Zarychta	no flags	Details
Patch to try (497 bytes, patch) 2021-03-15 18:34 UTC, Hans Petter Selasky	no flags	Details \| Diff
Patch to try #2 (1.52 KB, patch) 2021-03-15 19:22 UTC, Hans Petter Selasky	no flags	Details \| Diff
backtrace 2 (5.47 KB, text/plain) 2021-03-16 16:25 UTC, Marek Zarychta	no flags	Details
Show Obsolete (1) View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Marek Zarychta 2021-03-12 20:03:50 UTC

Created attachment 223220 [details]
Some messages caught by syslog

I have upgraded yesterday a bunch of machines to stable/13-n244861-b9773574371 and experience panics on them. There is no possibility to get cores at the moment, I will work on this after the weekend. The Syslog has caught something like a prelude to some of the panics: 

Mar 12 13:33:44 opnr kernel: [6004] Fatal trap 12: page fault while in kernel mode
Mar 12 13:33:44 opnr kernel: [6004] cpuid = 3; apic id = 06
Mar 12 13:33:44 opnr kernel: [6004] fault virtual address       = 0x18
Mar 12 13:33:44 opnr kernel: [6004] fault code          = supervisor read data, page not present
Mar 12 13:33:44 opnr kernel: [6004] instruction pointer = 0x20:0xffffffff80c77c28
Mar 12 13:33:44 opnr kernel: [6004] stack pointer               = 0x28:0xfffffe00c723f340
Mar 12 13:33:44 opnr kernel: [6004] frame pointer               = 0x28:0xfffffe00c723f3b0
Mar 12 13:33:44 opnr kernel: [6004] code segment                = base rx0, limit 0xfffff, type 0x1b
Mar 12 13:33:44 opnr kernel: [6004]                     = DPL 0, pres 1, long 1, def32 0, gran 1
Mar 12 13:33:44 opnr kernel: [6004] processor eflags    = interrupt enabled, resume, IOPL = 0
Mar 12 13:33:44 opnr kernel: [6004] current process             = 0 (if_io_tqg_3)

More traces in the file attached.

Both affected machines have Atom C2758 CPU. I have been testing 13.0-STABLE on them for a month with no issues so far. The last working build was stable/13-n244798-05083436a6e from Sun Mar  7.

Access to the machines is limited, swap is configured on the ZFS and netdumps not possible due to network configuration (vlan(4)s, over LACP lagg(4)).

I will try to configure the USB drive as a device to save crash dumps if such a scenario will be possible.

Comment 1 Hans Petter Selasky freebsd_committer

2021-03-13 11:54:15 UTC

Looks like a classic NULL pointer. Do you have the backtrace of the panic?

--HPS

Comment 2 Marek Zarychta 2021-03-13 15:27:04 UTC

(In reply to Hans Petter Selasky from comment #1)
Unfortunately, there is no more I can provide. I will test the setup and eventually enable the USB dump device to catch when it happens again.

Comment 3 Sergey V. Dyatko 2021-03-14 17:01:41 UTC

http://freebsd.1045724.x6.nabble.com/13-0-CURRENT-r368448-panic-td6443146.html

Comment 4 Marek Zarychta 2021-03-15 16:57:43 UTC

Created attachment 223294 [details]
backtrace 1

Here's fresh backtrace

Comment 5 Hans Petter Selasky freebsd_committer

2021-03-15 18:33:52 UTC

Hi,

I looks to me like iflib's rxeof has produced an invalid chain of mbufs.

I only see one bug in there, and that is, if iflib_fixup_rx() gets a new mbuf header, the m_nextpkt field doesn't get zeroed.

If this is easy to reproduce, can you try the attached patch, meanwhile?

--HPS

Comment 6 Hans Petter Selasky freebsd_committer

2021-03-15 18:34:19 UTC

Created attachment 223301 [details]
Patch to try

Comment 7 Hans Petter Selasky freebsd_committer

2021-03-15 18:46:12 UTC

Looks like I'm wrong. m_init() clears m_nextpkt, must be something else.

Comment 8 Mark Johnston freebsd_committer

2021-03-15 19:02:56 UTC

There's a number of similar-looking reports on stable/13.  I wonder if some MFC was missed.

Comment 9 Marek Zarychta 2021-03-15 19:04:42 UTC

(In reply to Hans Petter Selasky from comment #7)

Too late, the patch was applied and the new kernel is installed. I will revert it then.

Could it be LRO issue? I have here LRO enabled on two igb(4)s aggregated with LACP lagg(4) and this lagg(4) is a parent for some vlan(4) interfaces.

Comment 10 Mark Johnston freebsd_committer

2021-03-15 19:11:52 UTC

(In reply to Marek Zarychta from comment #9)
Yes, disabling LRO would be a reasonable first step.  I looked at the commits between the problematic revision and the known-good revision and nothing really stands out.

Comment 11 Marek Zarychta 2021-03-15 19:20:26 UTC

I have been running 13.0-STABLE on this machine since 8 Feb 2021 and upgrading once or twice a week without issues. The machine was running production code: firewall (PF + IPFW with dummynet), HTTP proxy, some VPN solutions etc. and never experienced panics there. Like I have mentioned above, the last non-panicking build was stable/13-n244798-05083436a6e from Sun Mar 7.

Comment 12 Hans Petter Selasky freebsd_committer

2021-03-15 19:22:38 UTC

Created attachment 223304 [details]
Patch to try #2

Try this patch instead.

@markj: There are some things I don't understand. Why is tcp_lro_queue_mbuf() not used when tcp_lro_flush_all() is? Looks buggy to me.

Also I see that the lro_possible flag may be stale, I.E. the value from previous mbuf is used for new mbuf.

--HPS

Comment 13 Hans Petter Selasky freebsd_committer

2021-03-15 19:29:16 UTC

The last commit that touched this code path was:

commit 35e4e998d8187c1d4d413bdc13a79a6415a30a18
Author: Stephen Hurd <shurd@FreeBSD.org>
Date:   Mon Nov 6 16:23:21 2017 +0000

    Only chain non-LRO mbufs when LRO is not possible
    
    Preserve packet order between tcp_lro_rx() and if_input() to avoid
    creating extra corner cases. If no packets can be LROed, combine them
    into one chain for submission via if_input(). If any packet can
    potentially be LROed however, retain old behaviour and call if_input()
    for each packet.
    
    This should keep the 12% improvement for small packet forwarding intact,
    but mostly avoids impacting the LRO case.
    
    Reviewed by:    cem, sbruno
    Approved by:    sbruno (mentor)
    Sponsored by:   Limelight Networks
    Differential Revision:  https://reviews.freebsd.org/D12876

Notes:
    svn path=/head/; revision=325487

Comment 14 Hans Petter Selasky freebsd_committer

2021-03-15 19:38:32 UTC

Comment on attachment 223294 [details]
backtrace 1

In the debug (kgdb prompt) you can try to dump this mbuf:

print ((struct mbuf *)0xfffff802b067d590)[0]

Then try to follow m_nextpkt:

print ((struct mbuf *)0xfffff802b067d590)[0].m_nextpkt[0]

And see where you end up.

--HPS

Comment 15 Marek Zarychta 2021-03-15 19:49:08 UTC

(In reply to Hans Petter Selasky from comment #12)
I am sorry, I can't test it. The patch doesn't apply on last stable/13 sources:
Rejected hunk #2.

Comment 16 Hans Petter Selasky freebsd_committer

2021-03-15 19:56:14 UTC

Did you revert the previous patch?

Works fine here:

cat iflib.diff | patch  -p1
Hmm...  Looks like a unified diff to me...
The text leading up to this was:
--------------------------
|diff --git a/sys/net/iflib.c b/sys/net/iflib.c
|index 05e99ba318d..c6a8ec9e25e 100644
|--- a/sys/net/iflib.c
|+++ b/sys/net/iflib.c
--------------------------
Patching file sys/net/iflib.c using Plan A...
Hunk #1 succeeded at 2888 (offset -13 lines).
Hunk #2 succeeded at 2975 (offset -13 lines).
done

commit 763fb2fda0144e3630de74b918d06a96b7968ee2 (HEAD -> stable/13, freebsd/stable/13)
Author: Mark Johnston <markj@FreeBSD.org>
Date:   Mon Mar 8 12:39:05 2021 -0500

    dumpon.8: Ask DDB to call doadump() rather than calling it directly
    
    Sponsored by:   The FreeBSD Foundation
    
    (cherry picked from commit af06ff55535d9b2de253103e974558104e0a3d97)

Comment 17 Marek Zarychta 2021-03-15 20:55:15 UTC

(In reply to Hans Petter Selasky from comment #14)
I am sorry for borking this patch, downloaded the wrong patch (diff to the previous diff).
The patch was applied and the system rebooted. To meet initial conditions LRO is still enabled. 

I am getting this:

(kgdb) print ((struct mbuf *)0xfffff802b067d590)[0]
$1 = {{m_next = 0xd266469b94fc022a, m_slist = {sle_next = 0xd266469b94fc022a}, m_stailq = {stqe_next = 0xd266469b94fc022a}}, {m_nextpkt = 0xd9d8353f106f5dcf, m_slistpkt = {
      sle_next = 0xd9d8353f106f5dcf}, m_stailqpkt = {stqe_next = 0xd9d8353f106f5dcf}}, m_data = 0x4bf25f3f2b29ab25 <error: Cannot access memory at address 0x4bf25f3f2b29ab25>, 
  m_len = -75255363, m_type = 63, m_flags = 7210736, {{{m_pkthdr = {{snd_tag = 0xf2fa50cacc9239d9, rcvif = 0xf2fa50cacc9239d9}, tags = {slh_first = 0xcce1f9cb1f09b95b}, 
          len = -750803441, flowid = 2961906515, csum_flags = 2906213133, fibnum = 45135, numa_domain = 18 '\022', rsstype = 165 '\245', {rcv_tstmp = 12294156060680234325, {
              l2hlen = 85 'U', l3hlen = 165 '\245', l4hlen = 15 '\017', l5hlen = 202 '\312', inner_l2hlen = 204 '\314', inner_l3hlen = 157 '\235', inner_l4hlen = 157 '\235', 
              inner_l5hlen = 170 '\252'}}, PH_per = {eight = "\365\333\365ɇ\231\337*", sixteen = {56309, 51701, 39303, 10975}, thirtytwo = {3388333045, 719296903}, sixtyfour = {
              3089356677887417333}, unintptr = {3089356677887417333}, ptr = 0x2adf9987c9f5dbf5}, PH_loc = {eight = "a\375\325)\366{\277\206", sixteen = {64865, 10709, 31734, 34495}, 
            thirtytwo = {701889889, 2260696054}, sixtyfour = {9709615618828139873}, unintptr = {9709615618828139873}, ptr = 0x86bf7bf629d5fd61}}, {m_epg_npgs = 217 '\331', 
          m_epg_nrdy = 57 '9', m_epg_hdrlen = 146 '\222', m_epg_trllen = 204 '\314', m_epg_1st_off = 20682, m_epg_last_len = 62202, m_epg_flags = 91 '[', 
          m_epg_record_type = 185 '\271', __spare = "\t\037", m_epg_enc_cnt = -857605685, m_epg_tls = 0xb08b1b53d33fa60f, m_epg_so = 0xa512b04fad394b0d, 
          m_epg_seqno = 12294156060680234325, m_epg_stailq = {stqe_next = 0x2adf9987c9f5dbf5}}}, {m_ext = {{ext_count = 0, ext_cnt = 0x0}, ext_size = 0, ext_type = 0, ext_flags = 0, {
            {ext_buf = 0x7ea3057a88b7afb5 <error: Cannot access memory at address 0x7ea3057a88b7afb5>, ext_arg2 = 0x0}, {extpg_pa = {9125143293820645301, 0, 0, 18446735289166059064, 
                60129542144}, 
              extpg_trail = "\024\000\000\000\000\000\000\000\a\000\000\000.\304\033;>]0\212\000\000\000\000\000\000\000\017\000\000\377\202\000\000\000\000\000\000\000\000\a@\000\000\377\377\000\000\000\000\000\000\000\000\000\000\001\000\000\000\000\000@", extpg_hdr = "\000\020\000\000\003\001\000\000\000 9\233\002\370\377\377\000\000\000\000\000\000"}}, 
          ext_free = 0xa92482001880, ext_arg1 = 0x4214bea60a080101}, m_pktdat = 0xfffff802b067d5e8 ""}}, 
    m_dat = 0xfffff802b067d5b0 "\331\071\222\314\312P\372\362[\271\t\037\313\371\341\314\017\246?\323S\033\213\260\rK9\255O\260\022\245U\245\017\312̝\235\252\365\333\365ɇ\231\337*a\375\325)\366{\277\206"}}
(kgdb) print ((struct mbuf *)0xfffff802b067d590)[0].m_nextpkt[0]
Cannot access memory at address 0xd9d8353f106f5dcf

Comment 18 Sergey V. Dyatko 2021-03-16 07:37:06 UTC

(In reply to Mark Johnston from comment #10)
i've already tried disabing LRO on r368448 - without luck, + I can't get dump on host with netdumpd running.
Last Friday I've tried to use net/intel-ix-kmod and got a panic again, and again without dump :(
Unfortunately, since December 2020, I still haven't learned how to reproduce this problem.
next step I plan to test - add 'nodevice iflib' and re-install kernel

BTW last known (by me) revision w/o this panic is head@r360004
Hope this helps

Comment 19 Marek Zarychta 2021-03-16 16:25:13 UTC

Created attachment 223325 [details]
backtrace 2

The kernel with patch 2 applied still finds a way to panic. Please compare the new backtrace.

Comment 20 Hans Petter Selasky freebsd_committer

2021-03-16 17:04:13 UTC

Comment on attachment 223325 [details]
backtrace 2

Can you type in kgdb101 (please install using GDB from ports or pkg):

frame 27
print *m

--HPS

Comment 21 Marek Zarychta 2021-03-16 17:31:27 UTC

(In reply to Hans Petter Selasky from comment #20)
I am happy to serve with this dump.

(kgdb) frame 27
#27 0xffffffff80cfbac9 in ether_input (ifp=<optimized out>, m=0xfffff8025a85aa00) at /usr/src/sys/net/if_ethersubr.c:827
827			netisr_dispatch(NETISR_ETHER, m);
(kgdb) print *m
$1 = {{m_next = 0x0, m_slist = {sle_next = 0x0}, m_stailq = {stqe_next = 0x0}}, {m_nextpkt = 0x0, m_slistpkt = {
      sle_next = 0x0}, m_stailqpkt = {stqe_next = 0x0}}, m_data = 0xfffff8025a85aa66 "E", m_len = 40, m_type = 1, 
  m_flags = 2, {{{m_pkthdr = {{snd_tag = 0xfffff80045091000, rcvif = 0xfffff80045091000}, tags = {slh_first = 0x0}, 
          len = 40, flowid = 3264704486, csum_flags = 251658240, fibnum = 0, numa_domain = 255 '\377', rsstype = 130 '\202', 
          {rcv_tstmp = 0, {l2hlen = 0 '\000', l3hlen = 0 '\000', l4hlen = 0 '\000', l5hlen = 0 '\000', 
              inner_l2hlen = 0 '\000', inner_l3hlen = 0 '\000', inner_l4hlen = 0 '\000', inner_l5hlen = 0 '\000'}}, 
          PH_per = {eight = "\002\000\000\000\377\377\000", sixteen = {2, 0, 65535, 0}, thirtytwo = {2, 65535}, sixtyfour = {
              281470681743362}, unintptr = {281470681743362}, ptr = 0xffff00000002}, PH_loc = {
            eight = "\000\000\000\000\000\000\000", sixteen = {0, 0, 0, 0}, thirtytwo = {0, 0}, sixtyfour = {0}, unintptr = {
              0}, ptr = 0x0}}, {m_epg_npgs = 0 '\000', m_epg_nrdy = 16 '\020', m_epg_hdrlen = 9 '\t', m_epg_trllen = 69 'E', 
          m_epg_1st_off = 63488, m_epg_last_len = 65535, m_epg_flags = 0 '\000', m_epg_record_type = 0 '\000', 
          __spare = "\000", m_epg_enc_cnt = 0, m_epg_tls = 0xc2976fe600000028, m_epg_so = 0x82ff00000f000000, 
          m_epg_seqno = 0, m_epg_stailq = {stqe_next = 0xffff00000002}}}, {m_ext = {{ext_count = 40574892, 
            ext_cnt = 0x1fac88a6026b1fac}, ext_size = 3634365035, ext_type = 8, ext_flags = 17664, {{
              ext_buf = 0x67a0040c27a2800 <error: Cannot access memory at address 0x67a0040c27a2800>, 
              ext_arg2 = 0xbc592715801f3f1a}, {extpg_pa = {466685789526960128, 13571921925355028250, 6125337632998583261, 
                1175541087717038345, 258}, 
              extpg_trail = "\232\265\000\000\001\001\b\n\220\324a\240\322\343z\325\000\000\367}\317\351\347\361\026^\271\217\270\000ӉX[G\002t\225pT\032\003g\201\327y%\200\240\022/u\363D|:\220\211\257\325\023,>\265", 
              extpg_hdr = "\275gp\205#\027\003\363Ō\216\212\v\261\000\000\273\233\272t߾\236"}}, 
          ext_free = 0xce60062281102bfe, ext_arg1 = 0x1facc45091565000}, 
        m_pktdat = 0xfffff8025a85aa58 "\254\037k\002\246\210\254\037k\002\240\330\b"}}, m_dat = 0xfffff8025a85aa20 ""}}

Comment 22 Hans Petter Selasky freebsd_committer

2021-03-16 18:02:10 UTC

Further try in frame 27:

print /x *(uint8_t [128] *)((struct mbuf *)m)->m_data

That will hopefully dump the packet.

--HPS

Comment 23 Marek Zarychta 2021-03-17 06:35:27 UTC

(In reply to Hans Petter Selasky from comment #22)
I am not willing to disclose it here, will send in the message.

Comment 24 Richard Scheffenegger freebsd_committer

2021-03-17 15:41:35 UTC

The tp->t_flags in the BT mentioned earlier here decodes to
6130 03F4
        TF_NODELAY
       TF_SENTFIN
       TF_REQSCALE
       TF_RCVDSCALE
       TF_REQ_TSTMP
      TF_RCVD_TSTMP
      TF_SACK_PERMIT
  TF_FASTRECOVERY
  TF_WASFRECOVERY
 TF_TSO
TF_CONGRECOVERY
TF_WASCRECOVERY

Is net.inet.tcp.rfc6675_pipe enabled? If yes, see https://reviews.freebsd.org/D29315

Comment 25 Marek Zarychta 2021-03-17 15:58:34 UTC

(In reply to Richard Scheffenegger from comment #24)

Hans, Richard, thank you for the deep insight into this.

Yes, net.inet.tcp.rfc6675_pipe=1 is set on this machine probably since early FreeBSD 11. The patch from D29315 was applied and the system rebooted with the original sysctl settings (net.inet.tcp.sack.enable=1,net.inet.tcp.rfc6675_pipe1=1 ).

Comment 26 Richard Scheffenegger freebsd_committer

2021-03-17 16:05:40 UTC

Not so sure if this really is due to Rescue Retransmissions; 

The linked dump ( http://freebsd.1045724.x6.nabble.com/13-0-CURRENT-r368448-panic-td6443146.html ) is from Jan 20, but SACK rescue retransmission was only checked in to stable/13 on March 2 with 
https://reviews.freebsd.org/R10:15a7c88058d419e3347673ab891ae77ba28ae1bd.

So there might be more to this here?

Comment 27 commit-hook freebsd_committer

2021-03-17 16:44:15 UTC

A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=e9f029831fa5747ae1b405f5716c52cb4ebf1e04

commit e9f029831fa5747ae1b405f5716c52cb4ebf1e04
Author:     Richard Scheffenegger <rscheff@FreeBSD.org>
AuthorDate: 2021-03-17 15:44:29 +0000
Commit:     Richard Scheffenegger <rscheff@FreeBSD.org>
CommitDate: 2021-03-17 16:12:04 +0000

    fix panic when rescue retransmission and FIN overlap

    PR:           254244
    PR:           254309
    Reviewed By:  #transport, hselasky, tuexen
    MFC after:    3 days
    Sponsored By: NetApp, Inc.
    Differential Revision: https://reviews.freebsd.org/D29315

 sys/netinet/tcp_sack.c | 14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

Comment 28 commit-hook freebsd_committer

2021-03-17 19:33:57 UTC

A commit in branch stable/13 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=703419774f86525a2441d615733993a6fddcd047

commit 703419774f86525a2441d615733993a6fddcd047
Author:     Richard Scheffenegger <rscheff@FreeBSD.org>
AuthorDate: 2021-03-17 15:44:29 +0000
Commit:     Richard Scheffenegger <rscheff@FreeBSD.org>
CommitDate: 2021-03-17 19:05:33 +0000

    fix panic when rescue retransmission and FIN overlap

    PR:           254244
    PR:           254309
    Reviewed By:  #transport, hselasky, tuexen
    Approved by:  re (cperciva)
    MFC after:    immediately
    Sponsored By: NetApp, Inc.
    Differential Revision: https://reviews.freebsd.org/D29315

    (cherry picked from commit e9f029831fa5747ae1b405f5716c52cb4ebf1e04)

 sys/netinet/tcp_sack.c | 14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

Comment 29 Kubilay Kocak freebsd_committer

2021-03-22 01:52:06 UTC

^Triage:

 * Assign to committer that resolves
 * Track stable/* merge(s)