Created attachment 223220 [details] Some messages caught by syslog I have upgraded yesterday a bunch of machines to stable/13-n244861-b9773574371 and experience panics on them. There is no possibility to get cores at the moment, I will work on this after the weekend. The Syslog has caught something like a prelude to some of the panics: Mar 12 13:33:44 opnr kernel: [6004] Fatal trap 12: page fault while in kernel mode Mar 12 13:33:44 opnr kernel: [6004] cpuid = 3; apic id = 06 Mar 12 13:33:44 opnr kernel: [6004] fault virtual address = 0x18 Mar 12 13:33:44 opnr kernel: [6004] fault code = supervisor read data, page not present Mar 12 13:33:44 opnr kernel: [6004] instruction pointer = 0x20:0xffffffff80c77c28 Mar 12 13:33:44 opnr kernel: [6004] stack pointer = 0x28:0xfffffe00c723f340 Mar 12 13:33:44 opnr kernel: [6004] frame pointer = 0x28:0xfffffe00c723f3b0 Mar 12 13:33:44 opnr kernel: [6004] code segment = base rx0, limit 0xfffff, type 0x1b Mar 12 13:33:44 opnr kernel: [6004] = DPL 0, pres 1, long 1, def32 0, gran 1 Mar 12 13:33:44 opnr kernel: [6004] processor eflags = interrupt enabled, resume, IOPL = 0 Mar 12 13:33:44 opnr kernel: [6004] current process = 0 (if_io_tqg_3) More traces in the file attached. Both affected machines have Atom C2758 CPU. I have been testing 13.0-STABLE on them for a month with no issues so far. The last working build was stable/13-n244798-05083436a6e from Sun Mar 7. Access to the machines is limited, swap is configured on the ZFS and netdumps not possible due to network configuration (vlan(4)s, over LACP lagg(4)). I will try to configure the USB drive as a device to save crash dumps if such a scenario will be possible.
Looks like a classic NULL pointer. Do you have the backtrace of the panic? --HPS
(In reply to Hans Petter Selasky from comment #1) Unfortunately, there is no more I can provide. I will test the setup and eventually enable the USB dump device to catch when it happens again.
http://freebsd.1045724.x6.nabble.com/13-0-CURRENT-r368448-panic-td6443146.html
Created attachment 223294 [details] backtrace 1 Here's fresh backtrace
Hi, I looks to me like iflib's rxeof has produced an invalid chain of mbufs. I only see one bug in there, and that is, if iflib_fixup_rx() gets a new mbuf header, the m_nextpkt field doesn't get zeroed. If this is easy to reproduce, can you try the attached patch, meanwhile? --HPS
Created attachment 223301 [details] Patch to try
Looks like I'm wrong. m_init() clears m_nextpkt, must be something else.
There's a number of similar-looking reports on stable/13. I wonder if some MFC was missed.
(In reply to Hans Petter Selasky from comment #7) Too late, the patch was applied and the new kernel is installed. I will revert it then. Could it be LRO issue? I have here LRO enabled on two igb(4)s aggregated with LACP lagg(4) and this lagg(4) is a parent for some vlan(4) interfaces.
(In reply to Marek Zarychta from comment #9) Yes, disabling LRO would be a reasonable first step. I looked at the commits between the problematic revision and the known-good revision and nothing really stands out.
I have been running 13.0-STABLE on this machine since 8 Feb 2021 and upgrading once or twice a week without issues. The machine was running production code: firewall (PF + IPFW with dummynet), HTTP proxy, some VPN solutions etc. and never experienced panics there. Like I have mentioned above, the last non-panicking build was stable/13-n244798-05083436a6e from Sun Mar 7.
Created attachment 223304 [details] Patch to try #2 Try this patch instead. @markj: There are some things I don't understand. Why is tcp_lro_queue_mbuf() not used when tcp_lro_flush_all() is? Looks buggy to me. Also I see that the lro_possible flag may be stale, I.E. the value from previous mbuf is used for new mbuf. --HPS
The last commit that touched this code path was: commit 35e4e998d8187c1d4d413bdc13a79a6415a30a18 Author: Stephen Hurd <shurd@FreeBSD.org> Date: Mon Nov 6 16:23:21 2017 +0000 Only chain non-LRO mbufs when LRO is not possible Preserve packet order between tcp_lro_rx() and if_input() to avoid creating extra corner cases. If no packets can be LROed, combine them into one chain for submission via if_input(). If any packet can potentially be LROed however, retain old behaviour and call if_input() for each packet. This should keep the 12% improvement for small packet forwarding intact, but mostly avoids impacting the LRO case. Reviewed by: cem, sbruno Approved by: sbruno (mentor) Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D12876 Notes: svn path=/head/; revision=325487
Comment on attachment 223294 [details] backtrace 1 In the debug (kgdb prompt) you can try to dump this mbuf: print ((struct mbuf *)0xfffff802b067d590)[0] Then try to follow m_nextpkt: print ((struct mbuf *)0xfffff802b067d590)[0].m_nextpkt[0] And see where you end up. --HPS
(In reply to Hans Petter Selasky from comment #12) I am sorry, I can't test it. The patch doesn't apply on last stable/13 sources: Rejected hunk #2.
Did you revert the previous patch? Works fine here: cat iflib.diff | patch -p1 Hmm... Looks like a unified diff to me... The text leading up to this was: -------------------------- |diff --git a/sys/net/iflib.c b/sys/net/iflib.c |index 05e99ba318d..c6a8ec9e25e 100644 |--- a/sys/net/iflib.c |+++ b/sys/net/iflib.c -------------------------- Patching file sys/net/iflib.c using Plan A... Hunk #1 succeeded at 2888 (offset -13 lines). Hunk #2 succeeded at 2975 (offset -13 lines). done commit 763fb2fda0144e3630de74b918d06a96b7968ee2 (HEAD -> stable/13, freebsd/stable/13) Author: Mark Johnston <markj@FreeBSD.org> Date: Mon Mar 8 12:39:05 2021 -0500 dumpon.8: Ask DDB to call doadump() rather than calling it directly Sponsored by: The FreeBSD Foundation (cherry picked from commit af06ff55535d9b2de253103e974558104e0a3d97)
(In reply to Hans Petter Selasky from comment #14) I am sorry for borking this patch, downloaded the wrong patch (diff to the previous diff). The patch was applied and the system rebooted. To meet initial conditions LRO is still enabled. I am getting this: (kgdb) print ((struct mbuf *)0xfffff802b067d590)[0] $1 = {{m_next = 0xd266469b94fc022a, m_slist = {sle_next = 0xd266469b94fc022a}, m_stailq = {stqe_next = 0xd266469b94fc022a}}, {m_nextpkt = 0xd9d8353f106f5dcf, m_slistpkt = { sle_next = 0xd9d8353f106f5dcf}, m_stailqpkt = {stqe_next = 0xd9d8353f106f5dcf}}, m_data = 0x4bf25f3f2b29ab25 <error: Cannot access memory at address 0x4bf25f3f2b29ab25>, m_len = -75255363, m_type = 63, m_flags = 7210736, {{{m_pkthdr = {{snd_tag = 0xf2fa50cacc9239d9, rcvif = 0xf2fa50cacc9239d9}, tags = {slh_first = 0xcce1f9cb1f09b95b}, len = -750803441, flowid = 2961906515, csum_flags = 2906213133, fibnum = 45135, numa_domain = 18 '\022', rsstype = 165 '\245', {rcv_tstmp = 12294156060680234325, { l2hlen = 85 'U', l3hlen = 165 '\245', l4hlen = 15 '\017', l5hlen = 202 '\312', inner_l2hlen = 204 '\314', inner_l3hlen = 157 '\235', inner_l4hlen = 157 '\235', inner_l5hlen = 170 '\252'}}, PH_per = {eight = "\365\333\365ɇ\231\337*", sixteen = {56309, 51701, 39303, 10975}, thirtytwo = {3388333045, 719296903}, sixtyfour = { 3089356677887417333}, unintptr = {3089356677887417333}, ptr = 0x2adf9987c9f5dbf5}, PH_loc = {eight = "a\375\325)\366{\277\206", sixteen = {64865, 10709, 31734, 34495}, thirtytwo = {701889889, 2260696054}, sixtyfour = {9709615618828139873}, unintptr = {9709615618828139873}, ptr = 0x86bf7bf629d5fd61}}, {m_epg_npgs = 217 '\331', m_epg_nrdy = 57 '9', m_epg_hdrlen = 146 '\222', m_epg_trllen = 204 '\314', m_epg_1st_off = 20682, m_epg_last_len = 62202, m_epg_flags = 91 '[', m_epg_record_type = 185 '\271', __spare = "\t\037", m_epg_enc_cnt = -857605685, m_epg_tls = 0xb08b1b53d33fa60f, m_epg_so = 0xa512b04fad394b0d, m_epg_seqno = 12294156060680234325, m_epg_stailq = {stqe_next = 0x2adf9987c9f5dbf5}}}, {m_ext = {{ext_count = 0, ext_cnt = 0x0}, ext_size = 0, ext_type = 0, ext_flags = 0, { {ext_buf = 0x7ea3057a88b7afb5 <error: Cannot access memory at address 0x7ea3057a88b7afb5>, ext_arg2 = 0x0}, {extpg_pa = {9125143293820645301, 0, 0, 18446735289166059064, 60129542144}, extpg_trail = "\024\000\000\000\000\000\000\000\a\000\000\000.\304\033;>]0\212\000\000\000\000\000\000\000\017\000\000\377\202\000\000\000\000\000\000\000\000\a@\000\000\377\377\000\000\000\000\000\000\000\000\000\000\001\000\000\000\000\000@", extpg_hdr = "\000\020\000\000\003\001\000\000\000 9\233\002\370\377\377\000\000\000\000\000\000"}}, ext_free = 0xa92482001880, ext_arg1 = 0x4214bea60a080101}, m_pktdat = 0xfffff802b067d5e8 ""}}, m_dat = 0xfffff802b067d5b0 "\331\071\222\314\312P\372\362[\271\t\037\313\371\341\314\017\246?\323S\033\213\260\rK9\255O\260\022\245U\245\017\312̝\235\252\365\333\365ɇ\231\337*a\375\325)\366{\277\206"}} (kgdb) print ((struct mbuf *)0xfffff802b067d590)[0].m_nextpkt[0] Cannot access memory at address 0xd9d8353f106f5dcf
(In reply to Mark Johnston from comment #10) i've already tried disabing LRO on r368448 - without luck, + I can't get dump on host with netdumpd running. Last Friday I've tried to use net/intel-ix-kmod and got a panic again, and again without dump :( Unfortunately, since December 2020, I still haven't learned how to reproduce this problem. next step I plan to test - add 'nodevice iflib' and re-install kernel BTW last known (by me) revision w/o this panic is head@r360004 Hope this helps
Created attachment 223325 [details] backtrace 2 The kernel with patch 2 applied still finds a way to panic. Please compare the new backtrace.
Comment on attachment 223325 [details] backtrace 2 Can you type in kgdb101 (please install using GDB from ports or pkg): frame 27 print *m --HPS
(In reply to Hans Petter Selasky from comment #20) I am happy to serve with this dump. (kgdb) frame 27 #27 0xffffffff80cfbac9 in ether_input (ifp=<optimized out>, m=0xfffff8025a85aa00) at /usr/src/sys/net/if_ethersubr.c:827 827 netisr_dispatch(NETISR_ETHER, m); (kgdb) print *m $1 = {{m_next = 0x0, m_slist = {sle_next = 0x0}, m_stailq = {stqe_next = 0x0}}, {m_nextpkt = 0x0, m_slistpkt = { sle_next = 0x0}, m_stailqpkt = {stqe_next = 0x0}}, m_data = 0xfffff8025a85aa66 "E", m_len = 40, m_type = 1, m_flags = 2, {{{m_pkthdr = {{snd_tag = 0xfffff80045091000, rcvif = 0xfffff80045091000}, tags = {slh_first = 0x0}, len = 40, flowid = 3264704486, csum_flags = 251658240, fibnum = 0, numa_domain = 255 '\377', rsstype = 130 '\202', {rcv_tstmp = 0, {l2hlen = 0 '\000', l3hlen = 0 '\000', l4hlen = 0 '\000', l5hlen = 0 '\000', inner_l2hlen = 0 '\000', inner_l3hlen = 0 '\000', inner_l4hlen = 0 '\000', inner_l5hlen = 0 '\000'}}, PH_per = {eight = "\002\000\000\000\377\377\000", sixteen = {2, 0, 65535, 0}, thirtytwo = {2, 65535}, sixtyfour = { 281470681743362}, unintptr = {281470681743362}, ptr = 0xffff00000002}, PH_loc = { eight = "\000\000\000\000\000\000\000", sixteen = {0, 0, 0, 0}, thirtytwo = {0, 0}, sixtyfour = {0}, unintptr = { 0}, ptr = 0x0}}, {m_epg_npgs = 0 '\000', m_epg_nrdy = 16 '\020', m_epg_hdrlen = 9 '\t', m_epg_trllen = 69 'E', m_epg_1st_off = 63488, m_epg_last_len = 65535, m_epg_flags = 0 '\000', m_epg_record_type = 0 '\000', __spare = "\000", m_epg_enc_cnt = 0, m_epg_tls = 0xc2976fe600000028, m_epg_so = 0x82ff00000f000000, m_epg_seqno = 0, m_epg_stailq = {stqe_next = 0xffff00000002}}}, {m_ext = {{ext_count = 40574892, ext_cnt = 0x1fac88a6026b1fac}, ext_size = 3634365035, ext_type = 8, ext_flags = 17664, {{ ext_buf = 0x67a0040c27a2800 <error: Cannot access memory at address 0x67a0040c27a2800>, ext_arg2 = 0xbc592715801f3f1a}, {extpg_pa = {466685789526960128, 13571921925355028250, 6125337632998583261, 1175541087717038345, 258}, extpg_trail = "\232\265\000\000\001\001\b\n\220\324a\240\322\343z\325\000\000\367}\317\351\347\361\026^\271\217\270\000ӉX[G\002t\225pT\032\003g\201\327y%\200\240\022/u\363D|:\220\211\257\325\023,>\265", extpg_hdr = "\275gp\205#\027\003\363Ō\216\212\v\261\000\000\273\233\272t߾\236"}}, ext_free = 0xce60062281102bfe, ext_arg1 = 0x1facc45091565000}, m_pktdat = 0xfffff8025a85aa58 "\254\037k\002\246\210\254\037k\002\240\330\b"}}, m_dat = 0xfffff8025a85aa20 ""}}
Further try in frame 27: print /x *(uint8_t [128] *)((struct mbuf *)m)->m_data That will hopefully dump the packet. --HPS
(In reply to Hans Petter Selasky from comment #22) I am not willing to disclose it here, will send in the message.
The tp->t_flags in the BT mentioned earlier here decodes to 6130 03F4 TF_NODELAY TF_SENTFIN TF_REQSCALE TF_RCVDSCALE TF_REQ_TSTMP TF_RCVD_TSTMP TF_SACK_PERMIT TF_FASTRECOVERY TF_WASFRECOVERY TF_TSO TF_CONGRECOVERY TF_WASCRECOVERY Is net.inet.tcp.rfc6675_pipe enabled? If yes, see https://reviews.freebsd.org/D29315
(In reply to Richard Scheffenegger from comment #24) Hans, Richard, thank you for the deep insight into this. Yes, net.inet.tcp.rfc6675_pipe=1 is set on this machine probably since early FreeBSD 11. The patch from D29315 was applied and the system rebooted with the original sysctl settings (net.inet.tcp.sack.enable=1,net.inet.tcp.rfc6675_pipe1=1 ).
Not so sure if this really is due to Rescue Retransmissions; The linked dump ( http://freebsd.1045724.x6.nabble.com/13-0-CURRENT-r368448-panic-td6443146.html ) is from Jan 20, but SACK rescue retransmission was only checked in to stable/13 on March 2 with https://reviews.freebsd.org/R10:15a7c88058d419e3347673ab891ae77ba28ae1bd. So there might be more to this here?
A commit in branch main references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=e9f029831fa5747ae1b405f5716c52cb4ebf1e04 commit e9f029831fa5747ae1b405f5716c52cb4ebf1e04 Author: Richard Scheffenegger <rscheff@FreeBSD.org> AuthorDate: 2021-03-17 15:44:29 +0000 Commit: Richard Scheffenegger <rscheff@FreeBSD.org> CommitDate: 2021-03-17 16:12:04 +0000 fix panic when rescue retransmission and FIN overlap PR: 254244 PR: 254309 Reviewed By: #transport, hselasky, tuexen MFC after: 3 days Sponsored By: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D29315 sys/netinet/tcp_sack.c | 14 ++++++++++++-- 1 file changed, 12 insertions(+), 2 deletions(-)
A commit in branch stable/13 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=703419774f86525a2441d615733993a6fddcd047 commit 703419774f86525a2441d615733993a6fddcd047 Author: Richard Scheffenegger <rscheff@FreeBSD.org> AuthorDate: 2021-03-17 15:44:29 +0000 Commit: Richard Scheffenegger <rscheff@FreeBSD.org> CommitDate: 2021-03-17 19:05:33 +0000 fix panic when rescue retransmission and FIN overlap PR: 254244 PR: 254309 Reviewed By: #transport, hselasky, tuexen Approved by: re (cperciva) MFC after: immediately Sponsored By: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D29315 (cherry picked from commit e9f029831fa5747ae1b405f5716c52cb4ebf1e04) sys/netinet/tcp_sack.c | 14 ++++++++++++-- 1 file changed, 12 insertions(+), 2 deletions(-)
^Triage: * Assign to committer that resolves * Track stable/* merge(s)