|Summary:||iflib: Panic in ether_output_frame on ESXi|
|Product:||Base System||Reporter:||Juraj Lutter <juraj>|
|Component:||kern||Assignee:||Mark Johnston <markj>|
|Severity:||Affects Some People||CC:||avg, chris, emaste, flo, juraj, marius, markj, net, pkelsey, swills, virtualization|
|Bug Depends on:|
Description Juraj Lutter 2019-07-10 19:00:43 UTC
There is a reocurring panic with 12-STABLE (at least r349857) running as an ESXi guest: Jul 10 19:11:45 svc1 kernel: fault code = supervisor read data, page not present Jul 10 19:11:45 svc1 kernel: instruction pointer = 0x20:0xffffffff811a7f0c Jul 10 19:11:45 svc1 kernel: stack pointer = 0x28:0xfffffe004c4628f0 Jul 10 19:11:45 svc1 kernel: frame pointer = 0x28:0xfffffe004c4628f0 Jul 10 19:11:45 svc1 kernel: code segment = base rx0, limit 0xfffff, type 0x1b Jul 10 19:11:45 svc1 kernel: = DPL 0, pres 1, long 1, def32 0, gran 1 Jul 10 19:11:45 svc1 kernel: processor eflags = resume, IOPL = 0 Jul 10 19:11:45 svc1 kernel: current process = 17606 (imap-login) Jul 10 19:11:45 svc1 kernel: trap number = 12 Jul 10 19:11:45 svc1 kernel: panic: page fault Jul 10 19:11:45 svc1 kernel: cpuid = 0 Jul 10 19:11:45 svc1 kernel: time = 1562778547 Jul 10 19:11:45 svc1 kernel: KDB: stack backtrace: Jul 10 19:11:45 svc1 kernel: #0 0xffffffff80c18257 at kdb_backtrace+0x67 Jul 10 19:11:45 svc1 kernel: #1 0xffffffff80bcbbdd at vpanic+0x19d Jul 10 19:11:45 svc1 kernel: #2 0xffffffff80bcba33 at panic+0x43 Jul 10 19:11:45 svc1 kernel: #3 0xffffffff810a04bc at trap_fatal+0x39c Jul 10 19:11:45 svc1 kernel: #4 0xffffffff810a0509 at trap_pfault+0x49 Jul 10 19:11:45 svc1 kernel: #5 0xffffffff8109faff at trap+0x29f Jul 10 19:11:45 svc1 kernel: #6 0xffffffff8107a7e5 at calltrap+0x8 Jul 10 19:11:45 svc1 kernel: #7 0xffffffff80ce48a3 at iflib_fast_intr_rxtx+0xc3 Jul 10 19:11:45 svc1 kernel: #8 0xffffffff80b8f63a at intr_event_handle+0xba Jul 10 19:11:45 svc1 kernel: #9 0xffffffff8120bff7 at intr_execute_handlers+0x57 Jul 10 19:11:45 svc1 kernel: #10 0xffffffff81212133 at lapic_handle_intr+0x43 Jul 10 19:11:45 svc1 kernel: #11 0xffffffff8107baa9 at Xapic_isr1+0xd9 Jul 10 19:11:45 svc1 kernel: #12 0xffffffff80ce67c2 at iflib_completed_tx_reclaim+0x62 Jul 10 19:11:45 svc1 kernel: #13 0xffffffff80ce5e2c at iflib_txq_drain+0x6c Jul 10 19:11:45 svc1 kernel: #14 0xffffffff80cecafe at drain_ring_lockless+0x6e Jul 10 19:11:45 svc1 kernel: #15 0xffffffff80cec9e1 at ifmp_ring_enqueue+0x2b1 Jul 10 19:11:45 svc1 kernel: #16 0xffffffff80ceb2e0 at iflib_if_transmit+0x90 Jul 10 19:11:45 svc1 kernel: #17 0xffffffff80cd21e4 at ether_output_frame+0xb4 Jul 10 19:11:45 svc1 kernel: Uptime: 1d7h30m15s Jul 10 19:11:45 svc1 kernel: Dumping 2212 out of 8153 MB:..1%..11%..21%..31%..41%..51%..61%..71%..81%..91% Jul 10 19:11:45 svc1 kernel: Dump complete Hardware is: FreeBSD 12.0-STABLE r349857 GENERIC amd64 FreeBSD clang version 8.0.0 (tags/RELEASE_800/final 356365) (based on LLVM 8.0.0) VT(vga): text 80x25 CPU: Intel(R) Xeon(R) Gold 5118 CPU @ 2.30GHz (2294.61-MHz K8-class CPU) Origin="GenuineIntel" Id=0x50654 Family=0x6 Model=0x55 Stepping=4 Features=0xf83fbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,MMX,FXSR,SSE,SSE2,SS> Features2=0xfffa3203<SSE3,PCLMULQDQ,SSSE3,FMA,CX16,PCID,SSE4.1,SSE4.2,x2APIC,MOVBE,POPCNT,TSCDLT,AESNI,XSAVE,OSXSAVE,AVX,F16C,RDRAND,HV> AMD Features=0x2c100800<SYSCALL,NX,Page1GB,RDTSCP,LM> AMD Features2=0x121<LAHF,ABM,Prefetch> Structured Extended Features=0xd19f6ffb<FSGSBASE,TSCADJ,BMI1,HLE,AVX2,FDPEXC,SMEP,BMI2,ERMS,INVPCID,RTM,NFPUSG,MPX,AVX512F,AVX512DQ,RDSEED,ADX,SMAP,CLFLUSHOPT,CLWB,AVX512CD,AVX512BW,AVX512VL> Structured Extended Features2=0x8<PKU> Structured Extended Features3=0x2c000000<IBPB,STIBP,ARCH_CAP> XSAVE Features=0xb<XSAVEOPT,XSAVEC,XSAVES> IA32_ARCH_CAPS=0x4<RSBA> TSC: P-state invariant Hypervisor: Origin = "VMwareVMware" real memory = 8589934592 (8192 MB) avail memory = 8274690048 (7891 MB) ACPI APIC Table: <PTLTD APIC > FreeBSD/SMP: Multiprocessor System Detected: 2 CPUs FreeBSD/SMP: 2 package(s) x 1 core(s) acpi0: <INTEL 440BX> on motherboard acpi0: Power Button (fixed) hpet0: <High Precision Event Timer> iomem 0xfed00000-0xfed003ff on acpi0 Timecounter "HPET" frequency 14318180 Hz quality 950 cpu0: <ACPI CPU> numa-domain 0 on acpi0 attimer0: <AT timer> port 0x40-0x43 irq 0 on acpi0 Timecounter "i8254" frequency 1193182 Hz quality 0 Event timer "i8254" frequency 1193182 Hz quality 100 atrtc0: <AT realtime clock> port 0x70-0x71 irq 8 on acpi0 atrtc0: registered as a time-of-day clock, resolution 1.000000s Event timer "RTC" frequency 32768 Hz quality 0 Timecounter "ACPI-fast" frequency 3579545 Hz quality 900 acpi_timer0: <24-bit timer at 3.579545MHz> port 0x1008-0x100b on acpi0 vmx0: <VMware VMXNET3 Ethernet Adapter> port 0x5000-0x500f mem 0xfd4fc000-0xfd4fcfff,0xfd4fd000-0xfd4fdfff,0xfd4fe000-0xfd4fffff irq 19 at device 0.0 on pci4 vmx0: Using 512 TX descriptors and 256 RX descriptors vmx0: Using 2 RX queues 2 TX queues vmx0: failed to allocate 3 MSI-X vectors, err: 6 vmx0: Using an MSI interrupt vmx0: Ethernet address: 00:50:56:a6:9e:f2 vmx0: netmap queues/slots: TX 1/512, RX 1/512 I can provide further information if needed.
Comment 1 Juraj Lutter 2019-07-10 20:15:07 UTC
Created attachment 205671 [details] savecore crash description
Comment 2 Juraj Lutter 2019-07-18 19:50:14 UTC
Please! Has anyone had any chance to look at this? Thanks.
Comment 3 Mark Johnston 2019-07-18 20:19:45 UTC
Created attachment 205876 [details] proposed patch Based on the line numbers, vmxnet3_isc_txd_credits_update() is being reentered. I think there is a window where vxcr_next is an invalid queue index; if we take an interrupt during that window, we may reference invalid memory. Please try the attached patch, which attempts to bandage the problem.
Comment 4 Patrick Kelsey 2019-07-18 22:41:28 UTC
(In reply to Mark Johnston from comment #3) > Created attachment 205876 [details] > proposed patch > > Based on the line numbers, vmxnet3_isc_txd_credits_update() is being > reentered. I think there is a window where vxcr_next is an invalid queue > index; if we take an interrupt during that window, we may reference invalid > memory. > > Please try the attached patch, which attempts to bandage the problem. The problem isn't that vmxnet3_isc_txd_credits_update() needs to be re-enterable, the problem is that r347221 (merged to stable/12 in r349112) changed iflib_legacy_setup() in a way that assumes that all iflib drivers operate their interrupts in combined RXTX mode. The vmx driver is one such driver that does not. For vmx, this change results in access to the tx state from multiple contexts (interrupt and group task context) whereas the tx state management is designed to only be accessed from a single context (group task context). iflib_legacy_setup() needs to be fixed.
Comment 5 Mark Johnston 2019-07-19 00:04:02 UTC
(In reply to Patrick Kelsey from comment #4) Marius, could you take a look at this?
Comment 6 Juraj Lutter 2019-07-24 20:29:09 UTC
Just for the record, the "proposed patch" from Mark did help, panics are now gone. If iflib should be fixed, I'm willing to help.
Comment 7 Marius Strobl 2019-07-31 15:27:41 UTC
I'd say that with an INTx-type interrupt or a single MSI vector, i. e. all the "legacy" interrupt support iflib is about, there's just no other way than to operate in combined RXTX mode (as opposed to multiple MSI-X vectors which can be associated to either RX or TX on a per-vector basis). Thus, - as actually already written in my private e-mail to pkelsey@ and the submitter on June 25th - I fully agree with markj@ analysis that vmxnet3_isc_txd_credits_update() currently isn't reentrant as well as with his fix (I'd probably code it to be more in line with the index updating in vmxnet3_isc_rxd_available(), though). However, as I also already wrote in said e-mail, on top of that it isn't obvious to me that it's safe to update txc->vxcr_next and txc->vxcr_gen non-atomically/lockless since r343291 dropped the locking around these operations.
Comment 8 Marius Strobl 2019-07-31 15:36:54 UTC
That said, if a driver has a way to further distinguish the interrupt cause, i. e. RX or TX, of a combined RXTX interrupt source for a particular device and wishes to deal with RX and TX separately it certainly may do so on its own, but shouldn't rely on common/generic iflib code for that.
Comment 9 Patrick Kelsey 2019-07-31 15:50:24 UTC
(In reply to Marius Strobl from comment #7) > I'd say that with an INTx-type interrupt or a single MSI vector, i. e. all > the "legacy" interrupt support iflib is about, there's just no other way > than to operate in combined RXTX mode (as opposed to multiple MSI-X vectors > which can be associated to either RX or TX on a per-vector basis). > Thus, - as actually already written in my private e-mail to pkelsey@ and the > submitter on June 25th - I fully agree with markj@ analysis that > vmxnet3_isc_txd_credits_update() currently isn't reentrant as well as with > his fix (I'd probably code it to be more in line with the index updating in > vmxnet3_isc_rxd_available(), though). > However, as I also already wrote in said e-mail, on top of that it isn't > obvious to me that it's safe to update txc->vxcr_next and txc->vxcr_gen > non-atomically/lockless since r343291 dropped the locking around these > operations. Marius, you are missing the point, which is that you have failed to consider the case of drivers that don't use TX interrupts at all, of which the vmx driver is an example. Drivers that don't use TX interrupts still only have one queue-related interrupt in legacy mode. By assuming that all drivers want to (i.e., are designed to) do TX processing during their RX interrupts, even when those drivers have clearly indicated they are not implementing TX mode or RXTX mode hardware interrupts in their interface to iflib, you are breaking drivers such as vmx. I fully agree that vmxnet3_isc_txd_credits_update() is not re-entrant - because it was explicitly designed on the basis that it does not have to be. Anyone of course is welcome to redesign the vmx driver under changed design assumptions, and I would caution anyone going down that path that the whole vmx driver should be re-examined in this case. I think it would be better to fix the iflib legacy mode interrupt setup so it does not assume it can invoke TX processing from the RX interrupt unless the driver has indicated it is using TXRX mode, or perhaps indicated that it is using hardware-based TX interrupts.
Comment 10 Mark Johnston 2019-08-13 00:30:34 UTC
Are either of you planning to work on this?
Comment 11 Juraj Lutter 2019-08-25 20:52:54 UTC
Do you happen to know if r351276 can fix this?
Comment 13 Marius Strobl 2019-08-28 18:00:44 UTC
Could you please test whether using the patch from attachment #206985 [details] instead of #205876 solves the problem?
Comment 14 Juraj Lutter 2019-09-03 08:05:52 UTC
The patch provided was, obviously, against HEAD. I adopted it for stable/12, applied and so far the system is stable. I will allow 2 more days (as the vacations are now over and that system will be under high load within next days) for final conclusion.
Comment 15 Juraj Lutter 2019-09-07 11:03:16 UTC
I guess we are good. System with this patch survived almost one week of normal operation.
Comment 16 Mark Johnston 2019-09-23 21:54:33 UTC
Is there any reason not to commit this? It would need to be MFCed soon for 12.1.
Comment 17 Kubilay Kocak 2019-09-24 03:41:40 UTC
^Triage: - Set to more appropriate default Assignee, keep virtualization CC'd - Needs a (real) Assignee to be In Progress Once committed to head and merged, please set maintainer-approval on the attachment to ? email@example.com for an explicit approval request
Comment 19 commit-hook 2019-09-30 15:59:40 UTC
A commit references this bug: Author: markj Date: Mon Sep 30 15:59:08 UTC 2019 New revision: 352906 URL: https://svnweb.freebsd.org/changeset/base/352906 Log: Add IFLIB_SINGLE_IRQ_RX_ONLY. As of r347221 the iflib legacy interrupt mode setup assumes that drivers perform both receive and transmit processing from the interrupt handler. This assumption is invalid in the vmxnet3 driver, so introduce the IFLIB_SINGLE_IRQ_RX_ONLY flag to make iflib avoid tx processing in the interrupt handler. PR: 239118 Reported and tested by: Juraj Lutter <firstname.lastname@example.org> Obtained from: marius Reviewed by: gallatin MFC after: 3 days Differential Revision: https://reviews.freebsd.org/D21831 Changes: head/sys/dev/vmware/vmxnet3/if_vmx.c head/sys/net/iflib.c head/sys/net/iflib.h
Comment 20 Kubilay Kocak 2019-10-01 02:22:14 UTC
Comment on attachment 206985 [details] patch ^Triage: Obsoleted by https://reviews.freebsd.org/D21831
Comment 21 commit-hook 2019-10-03 14:55:43 UTC
A commit references this bug: Author: markj Date: Thu Oct 3 14:55:08 UTC 2019 New revision: 353051 URL: https://svnweb.freebsd.org/changeset/base/353051 Log: MFC r352906: Add IFLIB_SINGLE_IRQ_RX_ONLY. PR: 239118 Changes: _U stable/12/ stable/12/sys/dev/vmware/vmxnet3/if_vmx.c stable/12/sys/net/iflib.c stable/12/sys/net/iflib.h
Comment 22 commit-hook 2019-10-03 15:23:48 UTC
A commit references this bug: Author: markj Date: Thu Oct 3 15:23:39 UTC 2019 New revision: 353052 URL: https://svnweb.freebsd.org/changeset/base/353052 Log: MFS r353051: Add IFLIB_SINGLE_IRQ_RX_ONLY. PR: 239118 Approved by: re (gjb) Changes: _U releng/12.1/ releng/12.1/sys/dev/vmware/vmxnet3/if_vmx.c releng/12.1/sys/net/iflib.c releng/12.1/sys/net/iflib.h