Bug 241403

Summary: iflib_debugnet_init sometimes provides bogus zero values shortly after ifnet_link_event:LINK_STATE_UP event
Product: Base System Reporter: Navdeep Parhar <np>
Component: kernAssignee: Mark Johnston <markj>
Status: Open ---    
Severity: Affects Some People CC: afedorov, krzysztof.galazka, markj, ohartmann, pi, rickmanbritney, zeising
Priority: --- Flags: zeising: mfc-stable12-
zeising: mfc-stable11-
Version: CURRENT   
Hardware: Any   
OS: Any   

Description Navdeep Parhar freebsd_committer freebsd_triage 2019-10-22 00:54:51 UTC
I updated multiple systems to r353877 today and two of them (out of 5) panicked
on reboot.  The rest booted up properly.

panic: m_getzone: invalid cluster size 0
cpuid = 1
time = 1571705154
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe006df2b930
vpanic() at vpanic+0x17e/frame 0xfffffe006df2b990
panic() at panic+0x43/frame 0xfffffe006df2b9f0
debugnet_mbuf_reinit() at debugnet_mbuf_reinit+0x21b/frame 0xfffffe006df2ba30
debugnet_any_ifnet_update() at debugnet_any_ifnet_update+0xfa/frame 0xfffffe006df2ba80
do_link_state_change() at do_link_state_change+0x1eb/frame 0xfffffe006df2bad0
taskqueue_run_locked() at taskqueue_run_locked+0x103/frame 0xfffffe006df2bb30
taskqueue_run() at taskqueue_run+0x6f/frame 0xfffffe006df2bb50
ithread_loop() at ithread_loop+0x1db/frame 0xfffffe006df2bbb0
fork_exit() at fork_exit+0x7e/frame 0xfffffe006df2bbf0
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe006df2bbf0
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---
KDB: enter: panic
[ thread pid 12 tid 100020 ]
Stopped at      kdb_enter+0x37: movq    $0,0x106feb6(%rip)
Comment 1 Conrad Meyer freebsd_committer freebsd_triage 2019-10-22 01:19:57 UTC
I think that means a NIC driver ::dn_init() is returning zero for clsize and something non-zero for nrxr or ncl.  Any chance it's obvious which NIC the two
systems share?  We can mask this in `debugnet_any_ifnet_update()` but it seems better to fix the support in the NIC.
Comment 2 Navdeep Parhar freebsd_committer freebsd_triage 2019-10-22 01:25:50 UTC
(In reply to Conrad Meyer from comment #1)
There are two em's in the first system and two igb's in the second one.  The
panics seem to occur right after the em1/igb1 link comes up on either system.
So this probably affects iflib drivers.

Both were in the middle of starting the network when we see the link up message
pop up in the middle and then a panic.


system 1:
...
em0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
        options=481249b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,LRem1: link state changed to UP
O,WOL_MAGIC,VLANpanic: m_getzone: invalid cluster size 0
cpuid = 1
time = 1571705154

system 2:
...
igb0: flags=8843<UP,BROADCAST,igb1: link state changed to UP
RUNNING,SIMPLEX,panic: m_getzone: invalid cluster size 0
cpuid = 2
time = 1571704676
KDB: stack backtrace:
Comment 3 Conrad Meyer freebsd_committer freebsd_triage 2019-10-22 01:58:53 UTC
This isn't in 11 or 12, please calm down with the MFC flags.
Comment 4 Kubilay Kocak freebsd_committer freebsd_triage 2019-10-22 02:17:21 UTC
(In reply to Conrad Meyer from comment #3)

One (triage) can't always know, but I *should* have realised the potential for it to be CURRENT only given the description in comment 0.

Note too that setting mfc flags (or any other flag) doesn't mean 'merge', it means "is this relevant for-this branch ?", so we'll continue to set them if and when when not sure, or it's not entirely clear.

Cancelling those flags, or better, setting them to - to mean "not relevant for this branches" is the correct action in those cases where a merge is not required/relevant/appropriate
Comment 5 Krzysztof Galazka 2019-10-23 10:01:54 UTC
(In reply to Navdeep Parhar from comment #2)

This could be caused by order of operations in iflib_init_locked. IFDI_INIT is called before iflib_fl_setup, so link may go up before ctx->ifc_rxqs[0].ifr_fl->ifl_buf_size is set. I think using 'ctx->ifc_rx_mbuf_sz' in iflib_debug_init instead should help. It is set before IFDI_INIT call.
Comment 6 O. Hartmann 2019-10-23 14:15:07 UTC
The same here. The NICs in questions are built-in igb (i350) and an add-in card i350-T2.

I posted this error to the CURRENT list, I copy from there to this location.

I have no access to the boxes of that kind until next week. Would habe copied the Intel igb chipset version if I'd have known it's related to iflib and NICs.


[...]
The last known good update of CURRENT on a Fujitsu Primergy RX2530-M5 (only one
of two sockets equipted, 64 GB RAM) was October, 17th, 2019 before 15 o'clock,
I suppose that was r353680 that time. Today's update to r353881 resulted in an
immediate crash when the network (igb0-igb3, two built-in i350 NICs and two
i350 NICs placed on a i350-T2 server adapter) comes up, just when rc scripts
configure the NIC's.

Last message I see is something like m_getzone: Inavlid cluster size 0 and
"dubugnet" or similar. Since the crash wrecked the installation (it seems after
updating, the UFS filesystem received, as so often, inconsistencies, so I can
not start vi or other applications after a full fsck -yf on all partitons,
those programs fail with some serious trap, stating that ELF is corrupt, I
can't remember the exact message). We do not have debugging facilities enabled
on that kernel suite, so I can not provide more proper informations.

For emergency rescue we downloaded the latest CURRENT memstick image,
FreeBSD-13.0-CURRENT-amd64-20191018-r353709-memstick.img dated Oct., 18th, which
also shows the bug described above. 

It seems that I have to go back to memimage
FreeBSD-13.0-CURRENT-amd64-20191011-r353427-memstick.img which dates to 11th
October 2019.
Since the crash resulted in a serious damage of the base filesystem and the
installation, I need to copy first the installation tarballs from the install
memstick into place and try then to rebuild the system with sources up to the
version which is deemed working. The I'll report, hopefully, more information.

Kind regards,
oh

Addendum:

r353680 works
r353709 doesn't work


[...]
/etc # more /var/crash/info.last 
Dump header from device: /dev/da0p2
  Architecture: amd64
  Architecture Version: 2
  Dump Length: 2952835072
  Blocksize: 512
  Compression: none
  Dumptime: Tue Oct 22 12:13:19 2019
  Hostname: wotan.lan101.bundesimmobilien.intern
  Magic: FreeBSD Kernel Dump
  Version String: FreeBSD 13.0-CURRENT #11 r353877: Tue Oct 22 11:02:32 CEST
2019 root@:/usr/obj/usr/src/amd64.amd64/sys/WOTAN
  Panic String: m_getzone: invalid cluster size 0
  Dump Parity: 2027469319
  Bounds: 0
  Dump Status: good
[...]

[...]

 # more /var/crash/core.txt.0 
/dev/stdin:1: Error in sourced command file:
Cannot access memory at address 0x65657246
/dev/stdin:1: Error in sourced command file:
Cannot access memory at address 0x65657246
/dev/stdin:1: Error in sourced command file:
Cannot access memory at address 0x65657246
/dev/stdin:1: Error in sourced command file:
Cannot access memory at address 0x65657246
/dev/stdin:1: Error in sourced command file:
Cannot access memory at address 0x65657246


[...]


[...]
---<<BOOT>>---
Copyright (c) 1992-2019 The FreeBSD Project.
Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
        The Regents of the University of California. All rights reserved.
FreeBSD is a registered trademark of The FreeBSD Foundation.
FreeBSD 13.0-CURRENT #14 r353680: Wed Oct 23 08:50:04 CEST 2019
    root@wotan.lan101.bundesimmobilien.intern:/usr/obj/usr/src/amd64.amd64/sys/WOTAN
amd64 FreeBSD clang version 9.0.0 (tags/RELEASE_900/final 372316) (based on
LLVM 9.0.0) VT(efifb): resolution 1280x1024
CPU microcode: no matching update found
CPU: Intel(R) Xeon(R) Gold 5217 CPU @ 3.00GHz (2993.05-MHz K8-class CPU)
  Origin="GenuineIntel"  Id=0x50657  Family=0x6  Model=0x55  Stepping=7
  Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE>
  Features2=0x7ffefbff<SSE3,PCLMULQDQ,DTES64,MON,DS_CPL,VMX,SMX,EST,TM2,SSSE3,SDBG,FMA,CX16,xTPR,PDCM,PCID,DCA,SSE4.1,SSE4.2,x2APIC,MOVBE,POPCNT,TSCDLT,AESNI,XSAVE,OSXSAVE,AVX,F16C,RDRAND>
  AMD Features=0x2c100800<SYSCALL,NX,Page1GB,RDTSCP,LM>
  AMD Features2=0x121<LAHF,ABM,Prefetch>
  Structured Extended
Features=0xd39ffffb<FSGSBASE,TSCADJ,BMI1,HLE,AVX2,FDPEXC,SMEP,BMI2,ERMS,INVPCID,RTM,PQM,NFPUSG,MPX,PQE,AVX512F,AVX512DQ,RDSEED,ADX,SMAP,CLFLUSHOPT,CLWB,PROCTRACE,AVX512CD,AVX512BW,AVX512VL>
Structured Extended Features2=0x808<PKU,AVX512VNNI> Structured Extended
Features3=0xbc000400<MD_CLEAR,IBPB,STIBP,L1DFL,ARCH_CAP,SSBD> XSAVE
Features=0xf<XSAVEOPT,XSAVEC,XINUSE,XSAVES>
IA32_ARCH_CAPS=0x2b<RDCL_NO,IBRS_ALL,SKIP_L1DFL_VME> VT-x:
PAT,HLT,MTF,PAUSE,EPT,UG,VPID,VID,PostIntr TSC: P-state invariant, performance
statistics real memory  = 68719476736 (65536 MB)
avail memory = 66361274368 (63287 MB)
Event timer "LAPIC" quality 600
ACPI APIC Table: <FUJ    D3383-B1>
FreeBSD/SMP: Multiprocessor System Detected: 16 CPUs
FreeBSD/SMP: 1 package(s) x 8 core(s) x 2 hardware threads
random: registering fast source Intel Secure Key RNG
random: fast provider: "Intel Secure Key RNG"
random: unblocking device.
Security policy loaded: MAC/ntpd (mac_ntpd)
ioapic0 <Version 2.0> irqs 0-23
ioapic1 <Version 2.0> irqs 24-31
ioapic2 <Version 2.0> irqs 32-39
ioapic3 <Version 2.0> irqs 40-47
ioapic4 <Version 2.0> irqs 48-55
Launching APs: 1 13 5 12 9 14 8 7 10 6 11 15 3 4 2
Timecounter "TSC-low" frequency 1496523352 Hz quality 1000
random: entropy device external interface
Comment 7 Aleksandr Fedorov freebsd_committer freebsd_triage 2019-10-23 14:31:11 UTC
I discovered a similar kernel panic.

To reproduce, just run CURRENT in bhyve with e1000 network backend.

vga0: <Generic ISA VGA> at port 0x3b0-0x3bb iomem 0xb0000-0xb7fff pnpid PNP0900 on isa0
Timecounters tick every 10.000 msec
usb_needs_explore_all: no devclass
em0: link state changed to UP
panic: m_getzone: invalid cluster size 0
cpuid = 0
time = 1
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe0011b8d7f0
vpanic() at vpanic+0x17e/frame 0xfffffe0011b8d850
panic() at panic+0x43/frame 0xfffffe0011b8d8b0
debugnet_mbuf_reinit() at debugnet_mbuf_reinit+0x21b/frame 0xfffffe0011b8d8f0
debugnet_any_ifnet_update() at debugnet_any_ifnet_update+0x107/frame 0xfffffe0011b8d940
do_link_state_change() at do_link_state_change+0x1b3/frame 0xfffffe0011b8d990
taskqueue_run_locked() at taskqueue_run_locked+0x10c/frame 0xfffffe0011b8d9f0
taskqueue_run() at taskqueue_run+0x4a/frame 0xfffffe0011b8da10
ithread_loop() at ithread_loop+0x1c6/frame 0xfffffe0011b8da70
fork_exit() at fork_exit+0x80/frame 0xfffffe0011b8dab0
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe0011b8dab0
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---
KDB: enter: panic
[ thread pid 12 tid 100010 ]
Stopped at      kdb_enter+0x37: movq    $0,0x1098a86(%rip)
db> 

For me, using 'ctx->ifc_rx_mbuf_sz' in iflib_debug_init doesn't help.

I use the following patch, as a workaround:

diff --git a/sys/net/iflib.c b/sys/net/iflib.c
index 73606981a492..1caf3505932a 100644
--- a/sys/net/iflib.c
+++ b/sys/net/iflib.c
@@ -6729,7 +6729,8 @@ iflib_debugnet_init(if_t ifp, int *nrxr, int *ncl, int *clsize)
        CTX_LOCK(ctx);
        *nrxr = NRXQSETS(ctx);
        *ncl = ctx->ifc_rxqs[0].ifr_fl->ifl_size;
-       *clsize = ctx->ifc_rxqs[0].ifr_fl->ifl_buf_size;
+       iflib_calc_rx_mbuf_sz(ctx);
+       *clsize = iflib_get_rx_mbuf_sz(ctx);
        CTX_UNLOCK(ctx);
 }

This patch relies on the value of if_softc_ctx::isc_max_frame_size.

It seems this variable is initialized before the ifnet_link_event is generated.
Comment 8 commit-hook freebsd_committer freebsd_triage 2019-10-23 16:48:31 UTC
A commit references this bug:

Author: cem
Date: Wed Oct 23 16:48:23 UTC 2019
New revision: 353934
URL: https://svnweb.freebsd.org/changeset/base/353934

Log:
  Prevent a panic when a driver provides bogus debugnet parameters

  This is just a bandaid; we should fix the driver(s) too.  Introduced in
  r353685.

  PR:		241403
  X-MFC-With:	r353685
  Reported by:	np and others

Changes:
  head/sys/net/debugnet.c
Comment 9 Conrad Meyer freebsd_committer freebsd_triage 2019-10-23 16:49:15 UTC
r353934 gets the panic out of the way.  We should fix iflib, but we can take the time to figure out the right approach.
Comment 10 Kurt Jaeger freebsd_committer freebsd_triage 2022-07-17 20:55:44 UTC
While updating a 13.0p11 to 13.1 (amd64), I got a similar crash:

ipmi0: Establishing power cycle handler
lo0: link state changed to UP
ix0: link state changed to UP
ix0.253: linkpanic: m_getzone: invalid cluster size 0
cpuid = 2
time = 1658088559
KDB: stack backtrace:
#0 0xffffffff80c69465 at kdb_backtrace+0x65
#1 0xffffffff80c1bb1f at vpanic+0x17f
#2 0xffffffff80c1b993 at panic+0x43
#3 0xffffffff80bf5d68 at m_getjcl+0x148
#4 0xffffffff8213b63b at ixgbe_refresh_mbufs+0xcb
#5 0xffffffff8213b526 at ixgbe_rxeof+0x756
#6 0xffffffff8213794f at ixgbe_msix_que+0x9f
#7 0xffffffff80bdbcba at ithread_loop+0x25a
#8 0xffffffff80bd8a5e at fork_exit+0x7e
#9 0xffffffff8108859e at fork_trampoline+0xe

during boot. This is surprising, and I do not know if this is the same cause ?
Comment 11 Kurt Jaeger freebsd_committer freebsd_triage 2022-07-17 20:57:16 UTC
(In reply to Kurt Jaeger from comment #10)
I had to go back to 13.0 and the problem was reproducable.

Any suggestions for additional debugging ? I can provide the crash dump, the update was via freebsd-update.
Comment 12 Mark Johnston freebsd_committer freebsd_triage 2022-07-18 13:22:25 UTC
(In reply to Kurt Jaeger from comment #10)
That looks like a different problem.  Is this Intel's ixgbe driver?
Comment 13 Kurt Jaeger freebsd_committer freebsd_triage 2022-07-18 13:54:59 UTC
(In reply to Mark Johnston from comment #12)
Yes, the 13.0 host had intel-ix-kmod-3.3.24 loaded.

The fbsd-update might cause a API drift if we have a 13.1 kernel running and
the intel-ix-kmod-3.3.24 13.0 kmod ? So I have to disabled loading
if_ix_updated_load="YES"
in /boot/loader.conf ?
Comment 14 Mark Johnston freebsd_committer freebsd_triage 2022-07-18 14:48:12 UTC
(In reply to Kurt Jaeger from comment #13)
In general 13.1 should have a kernel ABI that's compatible with 13.0.  In other words, it should not be necessary to recompile the driver, but there could be a bug.  In any case, it is certainly a different bug.
Comment 15 Kurt Jaeger freebsd_committer freebsd_triage 2022-07-18 18:45:12 UTC
(In reply to Mark Johnston from comment #14)
Thanks. Created a new PR for this, PR#265300
Comment 16 Mark Johnston freebsd_committer freebsd_triage 2022-07-18 18:55:25 UTC
The original bug is still open.
Comment 17 sortiz 2023-12-28 04:54:04 UTC
Sheriff? Nah, you're the outlaw in this retro-action romp - Getaway Shootout: https://getaway-shootout.com !