Summary: | iflib_debugnet_init sometimes provides bogus zero values shortly after ifnet_link_event:LINK_STATE_UP event | ||
---|---|---|---|
Product: | Base System | Reporter: | Navdeep Parhar <np> |
Component: | kern | Assignee: | Mark Johnston <markj> |
Status: | Open --- | ||
Severity: | Affects Some People | CC: | afedorov, krzysztof.galazka, markj, ohartmann, pi, rickmanbritney, zeising |
Priority: | --- | Flags: | zeising:
mfc-stable12-
zeising: mfc-stable11- |
Version: | CURRENT | ||
Hardware: | Any | ||
OS: | Any |
Description
Navdeep Parhar
2019-10-22 00:54:51 UTC
I think that means a NIC driver ::dn_init() is returning zero for clsize and something non-zero for nrxr or ncl. Any chance it's obvious which NIC the two systems share? We can mask this in `debugnet_any_ifnet_update()` but it seems better to fix the support in the NIC. (In reply to Conrad Meyer from comment #1) There are two em's in the first system and two igb's in the second one. The panics seem to occur right after the em1/igb1 link comes up on either system. So this probably affects iflib drivers. Both were in the middle of starting the network when we see the link up message pop up in the middle and then a panic. system 1: ... em0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500 options=481249b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,LRem1: link state changed to UP O,WOL_MAGIC,VLANpanic: m_getzone: invalid cluster size 0 cpuid = 1 time = 1571705154 system 2: ... igb0: flags=8843<UP,BROADCAST,igb1: link state changed to UP RUNNING,SIMPLEX,panic: m_getzone: invalid cluster size 0 cpuid = 2 time = 1571704676 KDB: stack backtrace: This isn't in 11 or 12, please calm down with the MFC flags. (In reply to Conrad Meyer from comment #3) One (triage) can't always know, but I *should* have realised the potential for it to be CURRENT only given the description in comment 0. Note too that setting mfc flags (or any other flag) doesn't mean 'merge', it means "is this relevant for-this branch ?", so we'll continue to set them if and when when not sure, or it's not entirely clear. Cancelling those flags, or better, setting them to - to mean "not relevant for this branches" is the correct action in those cases where a merge is not required/relevant/appropriate (In reply to Navdeep Parhar from comment #2) This could be caused by order of operations in iflib_init_locked. IFDI_INIT is called before iflib_fl_setup, so link may go up before ctx->ifc_rxqs[0].ifr_fl->ifl_buf_size is set. I think using 'ctx->ifc_rx_mbuf_sz' in iflib_debug_init instead should help. It is set before IFDI_INIT call. The same here. The NICs in questions are built-in igb (i350) and an add-in card i350-T2. I posted this error to the CURRENT list, I copy from there to this location. I have no access to the boxes of that kind until next week. Would habe copied the Intel igb chipset version if I'd have known it's related to iflib and NICs. [...] The last known good update of CURRENT on a Fujitsu Primergy RX2530-M5 (only one of two sockets equipted, 64 GB RAM) was October, 17th, 2019 before 15 o'clock, I suppose that was r353680 that time. Today's update to r353881 resulted in an immediate crash when the network (igb0-igb3, two built-in i350 NICs and two i350 NICs placed on a i350-T2 server adapter) comes up, just when rc scripts configure the NIC's. Last message I see is something like m_getzone: Inavlid cluster size 0 and "dubugnet" or similar. Since the crash wrecked the installation (it seems after updating, the UFS filesystem received, as so often, inconsistencies, so I can not start vi or other applications after a full fsck -yf on all partitons, those programs fail with some serious trap, stating that ELF is corrupt, I can't remember the exact message). We do not have debugging facilities enabled on that kernel suite, so I can not provide more proper informations. For emergency rescue we downloaded the latest CURRENT memstick image, FreeBSD-13.0-CURRENT-amd64-20191018-r353709-memstick.img dated Oct., 18th, which also shows the bug described above. It seems that I have to go back to memimage FreeBSD-13.0-CURRENT-amd64-20191011-r353427-memstick.img which dates to 11th October 2019. Since the crash resulted in a serious damage of the base filesystem and the installation, I need to copy first the installation tarballs from the install memstick into place and try then to rebuild the system with sources up to the version which is deemed working. The I'll report, hopefully, more information. Kind regards, oh Addendum: r353680 works r353709 doesn't work [...] /etc # more /var/crash/info.last Dump header from device: /dev/da0p2 Architecture: amd64 Architecture Version: 2 Dump Length: 2952835072 Blocksize: 512 Compression: none Dumptime: Tue Oct 22 12:13:19 2019 Hostname: wotan.lan101.bundesimmobilien.intern Magic: FreeBSD Kernel Dump Version String: FreeBSD 13.0-CURRENT #11 r353877: Tue Oct 22 11:02:32 CEST 2019 root@:/usr/obj/usr/src/amd64.amd64/sys/WOTAN Panic String: m_getzone: invalid cluster size 0 Dump Parity: 2027469319 Bounds: 0 Dump Status: good [...] [...] # more /var/crash/core.txt.0 /dev/stdin:1: Error in sourced command file: Cannot access memory at address 0x65657246 /dev/stdin:1: Error in sourced command file: Cannot access memory at address 0x65657246 /dev/stdin:1: Error in sourced command file: Cannot access memory at address 0x65657246 /dev/stdin:1: Error in sourced command file: Cannot access memory at address 0x65657246 /dev/stdin:1: Error in sourced command file: Cannot access memory at address 0x65657246 [...] [...] ---<<BOOT>>--- Copyright (c) 1992-2019 The FreeBSD Project. Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994 The Regents of the University of California. All rights reserved. FreeBSD is a registered trademark of The FreeBSD Foundation. FreeBSD 13.0-CURRENT #14 r353680: Wed Oct 23 08:50:04 CEST 2019 root@wotan.lan101.bundesimmobilien.intern:/usr/obj/usr/src/amd64.amd64/sys/WOTAN amd64 FreeBSD clang version 9.0.0 (tags/RELEASE_900/final 372316) (based on LLVM 9.0.0) VT(efifb): resolution 1280x1024 CPU microcode: no matching update found CPU: Intel(R) Xeon(R) Gold 5217 CPU @ 3.00GHz (2993.05-MHz K8-class CPU) Origin="GenuineIntel" Id=0x50657 Family=0x6 Model=0x55 Stepping=7 Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE> Features2=0x7ffefbff<SSE3,PCLMULQDQ,DTES64,MON,DS_CPL,VMX,SMX,EST,TM2,SSSE3,SDBG,FMA,CX16,xTPR,PDCM,PCID,DCA,SSE4.1,SSE4.2,x2APIC,MOVBE,POPCNT,TSCDLT,AESNI,XSAVE,OSXSAVE,AVX,F16C,RDRAND> AMD Features=0x2c100800<SYSCALL,NX,Page1GB,RDTSCP,LM> AMD Features2=0x121<LAHF,ABM,Prefetch> Structured Extended Features=0xd39ffffb<FSGSBASE,TSCADJ,BMI1,HLE,AVX2,FDPEXC,SMEP,BMI2,ERMS,INVPCID,RTM,PQM,NFPUSG,MPX,PQE,AVX512F,AVX512DQ,RDSEED,ADX,SMAP,CLFLUSHOPT,CLWB,PROCTRACE,AVX512CD,AVX512BW,AVX512VL> Structured Extended Features2=0x808<PKU,AVX512VNNI> Structured Extended Features3=0xbc000400<MD_CLEAR,IBPB,STIBP,L1DFL,ARCH_CAP,SSBD> XSAVE Features=0xf<XSAVEOPT,XSAVEC,XINUSE,XSAVES> IA32_ARCH_CAPS=0x2b<RDCL_NO,IBRS_ALL,SKIP_L1DFL_VME> VT-x: PAT,HLT,MTF,PAUSE,EPT,UG,VPID,VID,PostIntr TSC: P-state invariant, performance statistics real memory = 68719476736 (65536 MB) avail memory = 66361274368 (63287 MB) Event timer "LAPIC" quality 600 ACPI APIC Table: <FUJ D3383-B1> FreeBSD/SMP: Multiprocessor System Detected: 16 CPUs FreeBSD/SMP: 1 package(s) x 8 core(s) x 2 hardware threads random: registering fast source Intel Secure Key RNG random: fast provider: "Intel Secure Key RNG" random: unblocking device. Security policy loaded: MAC/ntpd (mac_ntpd) ioapic0 <Version 2.0> irqs 0-23 ioapic1 <Version 2.0> irqs 24-31 ioapic2 <Version 2.0> irqs 32-39 ioapic3 <Version 2.0> irqs 40-47 ioapic4 <Version 2.0> irqs 48-55 Launching APs: 1 13 5 12 9 14 8 7 10 6 11 15 3 4 2 Timecounter "TSC-low" frequency 1496523352 Hz quality 1000 random: entropy device external interface I discovered a similar kernel panic. To reproduce, just run CURRENT in bhyve with e1000 network backend. vga0: <Generic ISA VGA> at port 0x3b0-0x3bb iomem 0xb0000-0xb7fff pnpid PNP0900 on isa0 Timecounters tick every 10.000 msec usb_needs_explore_all: no devclass em0: link state changed to UP panic: m_getzone: invalid cluster size 0 cpuid = 0 time = 1 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe0011b8d7f0 vpanic() at vpanic+0x17e/frame 0xfffffe0011b8d850 panic() at panic+0x43/frame 0xfffffe0011b8d8b0 debugnet_mbuf_reinit() at debugnet_mbuf_reinit+0x21b/frame 0xfffffe0011b8d8f0 debugnet_any_ifnet_update() at debugnet_any_ifnet_update+0x107/frame 0xfffffe0011b8d940 do_link_state_change() at do_link_state_change+0x1b3/frame 0xfffffe0011b8d990 taskqueue_run_locked() at taskqueue_run_locked+0x10c/frame 0xfffffe0011b8d9f0 taskqueue_run() at taskqueue_run+0x4a/frame 0xfffffe0011b8da10 ithread_loop() at ithread_loop+0x1c6/frame 0xfffffe0011b8da70 fork_exit() at fork_exit+0x80/frame 0xfffffe0011b8dab0 fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe0011b8dab0 --- trap 0, rip = 0, rsp = 0, rbp = 0 --- KDB: enter: panic [ thread pid 12 tid 100010 ] Stopped at kdb_enter+0x37: movq $0,0x1098a86(%rip) db> For me, using 'ctx->ifc_rx_mbuf_sz' in iflib_debug_init doesn't help. I use the following patch, as a workaround: diff --git a/sys/net/iflib.c b/sys/net/iflib.c index 73606981a492..1caf3505932a 100644 --- a/sys/net/iflib.c +++ b/sys/net/iflib.c @@ -6729,7 +6729,8 @@ iflib_debugnet_init(if_t ifp, int *nrxr, int *ncl, int *clsize) CTX_LOCK(ctx); *nrxr = NRXQSETS(ctx); *ncl = ctx->ifc_rxqs[0].ifr_fl->ifl_size; - *clsize = ctx->ifc_rxqs[0].ifr_fl->ifl_buf_size; + iflib_calc_rx_mbuf_sz(ctx); + *clsize = iflib_get_rx_mbuf_sz(ctx); CTX_UNLOCK(ctx); } This patch relies on the value of if_softc_ctx::isc_max_frame_size. It seems this variable is initialized before the ifnet_link_event is generated. A commit references this bug: Author: cem Date: Wed Oct 23 16:48:23 UTC 2019 New revision: 353934 URL: https://svnweb.freebsd.org/changeset/base/353934 Log: Prevent a panic when a driver provides bogus debugnet parameters This is just a bandaid; we should fix the driver(s) too. Introduced in r353685. PR: 241403 X-MFC-With: r353685 Reported by: np and others Changes: head/sys/net/debugnet.c r353934 gets the panic out of the way. We should fix iflib, but we can take the time to figure out the right approach. While updating a 13.0p11 to 13.1 (amd64), I got a similar crash: ipmi0: Establishing power cycle handler lo0: link state changed to UP ix0: link state changed to UP ix0.253: linkpanic: m_getzone: invalid cluster size 0 cpuid = 2 time = 1658088559 KDB: stack backtrace: #0 0xffffffff80c69465 at kdb_backtrace+0x65 #1 0xffffffff80c1bb1f at vpanic+0x17f #2 0xffffffff80c1b993 at panic+0x43 #3 0xffffffff80bf5d68 at m_getjcl+0x148 #4 0xffffffff8213b63b at ixgbe_refresh_mbufs+0xcb #5 0xffffffff8213b526 at ixgbe_rxeof+0x756 #6 0xffffffff8213794f at ixgbe_msix_que+0x9f #7 0xffffffff80bdbcba at ithread_loop+0x25a #8 0xffffffff80bd8a5e at fork_exit+0x7e #9 0xffffffff8108859e at fork_trampoline+0xe during boot. This is surprising, and I do not know if this is the same cause ? (In reply to Kurt Jaeger from comment #10) I had to go back to 13.0 and the problem was reproducable. Any suggestions for additional debugging ? I can provide the crash dump, the update was via freebsd-update. (In reply to Kurt Jaeger from comment #10) That looks like a different problem. Is this Intel's ixgbe driver? (In reply to Mark Johnston from comment #12) Yes, the 13.0 host had intel-ix-kmod-3.3.24 loaded. The fbsd-update might cause a API drift if we have a 13.1 kernel running and the intel-ix-kmod-3.3.24 13.0 kmod ? So I have to disabled loading if_ix_updated_load="YES" in /boot/loader.conf ? (In reply to Kurt Jaeger from comment #13) In general 13.1 should have a kernel ABI that's compatible with 13.0. In other words, it should not be necessary to recompile the driver, but there could be a bug. In any case, it is certainly a different bug. (In reply to Mark Johnston from comment #14) Thanks. Created a new PR for this, PR#265300 The original bug is still open. Sheriff? Nah, you're the outlaw in this retro-action romp - Getaway Shootout: https://getaway-shootout.com ! |