Summary: | if_vmx(4): iflib - Panic with INVARIANTS: Memory modified after free (12.1-pre-QA) | ||
---|---|---|---|
Product: | Base System | Reporter: | Harald Schmalzbauer <bugzilla.freebsd> |
Component: | kern | Assignee: | Patrick Kelsey <pkelsey> |
Status: | Closed FIXED | ||
Severity: | Affects Some People | CC: | avg, chris, dmilith, dpetrov67, emaste, erj, marius, markj, ncrogers, net, pkelsey |
Priority: | --- | Keywords: | crash |
Version: | 12.0-STABLE | Flags: | koobs:
mfc-stable12?
koobs: mfc-stable11? |
Hardware: | Any | ||
OS: | Any | ||
Bug Depends on: | |||
Bug Blocks: | 240700 |
Description
Harald Schmalzbauer
2019-09-16 07:45:31 UTC
I guess this panic with an non-INVARIANTS kernel (wihtout any other debug OPTIONS) is related. This is happening quiet early after some traffic tests utilizing if_vmx(4). 12.1-prerelease without debug options: Fatal trap 12: page fault while in kernel mode cpuid = 1; apic id = 01 fault virtual address = 0x0 fault code = supervisor write data, page not present instruction pointer = 0x20:0xffffffff806ffdc2 stack pointer = 0x28:0xfffffe0040557900 frame pointer = 0x28:0xfffffe00405579e0 code segment = base rx0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 0 (if_io_tqg_1) trap number = 12 panic: page fault cpuid = 1 time = 1568626299 KDB: stack backtrace: #0 0xffffffff8061f047 at kdb_backtrace+0x67 #1 0xffffffff805d33bd at vpanic+0x19d #2 0xffffffff805d3213 at panic+0x43 #3 0xffffffff80941d2c at trap_fatal+0x39c #4 0xffffffff80941d79 at trap_pfault+0x49 #5 0xffffffff8094136f at trap+0x29f #6 0xffffffff8091b92c at calltrap+0x8 #7 0xffffffff8061d904 at gtaskqueue_run_locked+0x144 #8 0xffffffff8061d568 at gtaskqueue_thread_loop+0x98 #9 0xffffffff8059b103 at fork_exit+0x83 #10 0xffffffff8091c96e at fork_trampoline+0xe #3 0xffffffff805d3213 in panic (fmt=<value optimized out>) at /usr/local/share/deploy-tools/RELENG_12/src/sys/kern/kern_shutdown.c:804 #4 0xffffffff80941d2c in trap_fatal (frame=<value optimized out>, eva=<value optimized out>) at /usr/local/share/deploy-tools/RELENG_12/src/sys/amd64/amd64/trap.c:943 #5 0xffffffff80941d79 in trap_pfault (frame=0xfffffe0040557840, usermode=0) at RELENG_12/src/sys/amd64/include/pcpu.h:234 #6 0xffffffff8094136f in trap (frame=0xfffffe0040557840) at /usr/local/share/deploy-tools/RELENG_12/src/sys/amd64/amd64/trap.c:443 #7 0xffffffff8091b92c in calltrap () at /usr/local/share/deploy-tools/RELENG_12/src/sys/amd64/amd64/exception.S:289 #8 0xffffffff806ffdc2 in _task_fn_rx (context=<value optimized out>) at /usr/local/share/deploy-tools/RELENG_12/src/sys/net/iflib.c:2614 #9 0xffffffff8061d904 in gtaskqueue_run_locked (queue=0xfffff8000206c100) at /usr/local/share/deploy-tools/RELENG_12/src/sys/kern/subr_gtaskqueue.c:378 #10 0xffffffff8061d568 in gtaskqueue_thread_loop (arg=<value optimized out>) at /usr/local/share/deploy-tools/RELENG_12/src/sys/kern/subr_gtaskqueue.c:559 #11 0xffffffff8059b103 in fork_exit (callout=0xffffffff8061d4d0 <gtaskqueue_thread_loop>, arg=0xfffffe00025fe020, frame=0xfffffe0040557ac0) at /usr/local/share/deploy-tools/RELENG_12/src/sys/kern/kern_fork.c:1065 #12 0xffffffff8091c96e in fork_trampoline () at /usr/local/share/deploy-tools/RELENG_12/src/sys/amd64/amd64/exception.S:1077 #13 0x0000000000000000 in ?? () Thanks for any help, -harry Maybe your issue was similar to the one found in this bug: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=239118, which got fixed. Could you retest? (In reply to Eric Joyner from comment #2) Thanks for the hint, I also thought this could solve/influence the panic, but won't find time before the weekend. I also found that it only happens when I use a MTU of 9000 bytes - if I don't confuse things from memory… Will update soon. -harry tested with RC1: Still same panic as soon as if_vmx(4) get's load _and_ jumbo frames are in use (mtu 9000). Unfortunately, solid vmxnet3 support is crucial to my setups. I managed to patch the vmware vmxnet3 guest driver to work with FreeBSD 12, but it also suffers from panics… The vmware vmxnet3 driver provides doubled transfer rates compared to if_vmx(3), but my skills and time don't last to fix any of both myself. In case anyone is interested in the vmxnet3 patch I'll provide on request of course, but won't pollute this bug report with anything completely different. -harry (In reply to Harald Schmalzbauer from comment #4) @Herald Can you confirm (explicitly) that the panic does *not* occur with jumbo frames *disabled/not configured* ? (In reply to Kubilay Kocak from comment #5) As far as I could briefly test (some iperf3 streams), using if_vmx(4) does _not_ lead to panic if frames are <= 4k (and interface MTU is set to 9000). Also, using if_igb(4) with interface MTU of 9000 and frames as big as 9k is working fine with FreeBSD-12_RC1. Maybe 240700-block can be released if man page mentions the 4k frame size limit? Thanks, -harry (In reply to Harald Schmalzbauer from comment #6) We can leave/remove the blocker depending on the necessary/chosen resolution. In my opinion though, crashes (panics) should be resolved whether or not there are intended limits in specific scenarios, ie: "all crashes are bugs" I am seeing the same problem with FreeBSD CURRENT as of r351653 (early September). As far as I can tell, jumbo frames are not used. The interface mtu is 1500 and there are no routes with non-default mtu. The interface driver is also vmxnet3. Unread portion of the kernel message buffer: panic: Memory modified after free 0xfffff8017f069800(2048) val=9d671660 @ 0xfffff8017f069800 cpuid = 0 time = 1578053135 KDB: stack backtrace: stack1 db_trace_self_wrapper+0x2b vpanic+0x1a4 panic+0x43 trash_ctor+0x49 mb_ctor_clust+0x18 uma_zalloc_arg+0xa4d m_cljget+0x8a _iflib_fl_refill+0x511 _task_fn_rx+0xc06 gtaskqueue_run_locked+0x139 gtaskqueue_thread_loop+0x88 fork_exit+0x84 fork_trampoline+0xe KDB: enter: panic __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55 55 /usr/src/sys/amd64/include/pcpu_aux.h: No such file or directory. (kgdb) bt #0 __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55 #1 doadump (textdump=4620928) at /usr/src/sys/kern/kern_shutdown.c:467 #2 0xffffffff8039641c in db_fncall_generic (addr=<optimized out>, args=<optimized out>, rv=<optimized out>, nargs=<optimized out>) at /usr/src/sys/ddb/db_command.c:620 #3 db_fncall (dummy1=<optimized out>, dummy2=<optimized out>, dummy3=<optimized out>, dummy4=<optimized out>) at /usr/src/sys/ddb/db_command.c:668 #4 0xffffffff80395d39 in db_command (last_cmdp=<optimized out>, cmd_table=<optimized out>, dopager=0) at /usr/src/sys/ddb/db_command.c:492 #5 0xffffffff8039acd8 in db_script_exec (scriptname=<optimized out>, warnifnotfound=<optimized out>) at /usr/src/sys/ddb/db_script.c:304 #6 0xffffffff8039ab23 in db_script_kdbenter (eventname=<optimized out>) at /usr/src/sys/ddb/db_script.c:326 #7 0xffffffff80398d2a in db_trap (type=<optimized out>, code=<optimized out>) at /usr/src/sys/ddb/db_main.c:251 #8 0xffffffff807c90fc in kdb_trap (type=3, code=0, tf=<optimized out>) at /usr/src/sys/kern/subr_kdb.c:696 #9 0xffffffff80abe2d4 in trap (frame=0xfffffe0000468580) at /usr/src/sys/amd64/amd64/trap.c:621 #10 <signal handler called> #11 kdb_enter (why=0xffffffff80bdbcb7 "panic", msg=0xffffffff80bdbcb7 "panic") at /usr/src/sys/kern/subr_kdb.c:483 #12 0xffffffff8077cff1 in vpanic (fmt=<optimized out>, ap=<optimized out>) at /usr/src/sys/kern/kern_shutdown.c:987 #13 0xffffffff8077d073 in panic (fmt=0xffffffff8115abd0 <cnputs_mtx+24> "") at /usr/src/sys/kern/kern_shutdown.c:921 #14 0xffffffff80a0d399 in trash_ctor (mem=0x80, size=4621632, arg=<optimized out>, flags=<optimized out>) at /usr/src/sys/vm/uma_dbg.c:82 #15 0xffffffff8075ad68 in mb_ctor_clust (mem=0xfffff8017f069800, size=2048, arg=0x0, how=128) at /usr/src/sys/kern/kern_mbuf.c:738 #16 0xffffffff80a0783d in uma_zalloc_arg (zone=0xfffff8000289c000, udata=0x0, flags=<optimized out>) at /usr/src/sys/vm/uma_core.c:2441 #17 0xffffffff80759aca in m_cljget (m=0x0, how=1, size=2048) at /usr/src/sys/kern/kern_mbuf.c:1392 #18 0xffffffff808b5d71 in _iflib_fl_refill (ctx=0xfffff80002a4b000, fl=<optimized out>, count=<optimized out>) at /usr/src/sys/net/iflib.c:1986 #19 0xffffffff808b17d6 in __iflib_fl_refill_lt (ctx=<optimized out>, fl=0xfffff80002a42000, max=24) at /usr/src/sys/net/iflib.c:2079 #20 iflib_rxeof (rxq=0xfffff80002a32000, budget=24) at /usr/src/sys/net/iflib.c:2823 #21 _task_fn_rx (context=0xfffff80002a32000) at /usr/src/sys/net/iflib.c:3795 #22 0xffffffff807c72a9 in gtaskqueue_run_locked (queue=0xfffff800020a6100) at /usr/src/sys/kern/subr_gtaskqueue.c:377 #23 0xffffffff807c7028 in gtaskqueue_thread_loop (arg=<optimized out>) at /usr/src/sys/kern/subr_gtaskqueue.c:558 #24 0xffffffff8073ce34 in fork_exit (callout=0xffffffff807c6fa0 <gtaskqueue_thread_loop>, arg=0xfffffe00003f9008, frame=0xfffffe0000468ac0) at /usr/src/sys/kern/kern_fork.c:1048 I have a crash dump and I can provide any data requested. I can also do some debugging myself if given some directions. We (Panzura) are still getting this panic semi-regularly. I have been trying to root cause the bug, but I am failing at it so far. I am afraid that we will to revert vmxnet3 code to the pre-iflib state. It's kind of a lose-lose situation, unfortunately. Not sure if that's relevant, but one thing I noticed is that the converted driver lost handling of rxcd->error field. The old code would do: if (rxcd->error) { rxq->vxrxq_stats.vmrxs_ierrors++; m_freem(m); return; } But in the iflib world the error is simply ignored. Also, to be honest, I do not understand how ifl_fragidx and ifl_rx_bitmap on the one hand and ifl_pidx and ifl_cidx on the other work together. The map potentially allows for an arbitrary mix of free and busy descriptors while cidx/pidx assume strictly linear iteration. I see one potential problem related to ifl_fragidx. _iflib_fl_refill() has this logic: frag_idx = fl->ifl_fragidx; <loop> bit_ffc_at(fl->ifl_rx_bitmap, frag_idx, fl->ifl_size, &frag_idx); if (frag_idx < 0) bit_ffc(fl->ifl_rx_bitmap, fl->ifl_size, &frag_idx); bit_set(fl->ifl_rx_bitmap, frag_idx); <end of loop> fl->ifl_fragidx = frag_idx; So, ifl_fragidx is used to store the latest set bit in ifl_rx_bitmap. bit_ffc_at() finds the first cleared bit at or after the start position. Typically, that means /after/ as the bit /at/ frag_idx is set. But let's consider this scenario. Somehow the hardware consumes descriptors very fast, faster than they are refilled. Let's say we refilled descriptors up to index N, so ifl_fragidx=N and ifl_pidx=N+1. Let's say the hardware consumed all descriptors available to it. That means that the whole ifl_rx_bitmap is clear and ifl_cidx=N+1. Now we do the next refill and we start searching from ifl_fragidx. That position is free now, so bit_ffc_at() will return it. At this point ifl_fragidx and ifl_pidx get out of sync. We populate various software resources by frag_idx, but we program the hardware by pidx. For example, we will allocate a cluster at index N, but program its bus address in a descriptor at index N+1. That will mess up things for a driver that expects that indexes are always advanced linearly. There is a simple solution. Either we should store frag_idx + 1 to ifl_fragidx for the benefit of the next refill or we should call bit_ffc_at() with frag_idx + 1 as a starting position. As to why the hardware can exhaust all descriptors in a free list. I think that there is a certain impedance mismatch. iflib_rxeof() takes its budget in terms of full packets and also iflib_rxd_avail() works in the same terms. __iflib_fl_refill_lt() is called with a limit equal to budget + 8. But the descriptors are used by packet fragments and a single packet may have many fragments. Also, a driver like vmx may waste some descriptors[*] for reasons that are known only to vmware. For example, in one crash dumps that I have here I see that iflib_rxeof() was processing its 11th packet (rx_pkts = 11, avail = 7, budget = 16) and that packet had iri_nfrags = 14. It's conceivable that such a batch could exhaust all descriptors populated by the last refill. [*] eop=1, sop=1, len=0 but a command descriptor 141 in rx queue 0 is "consumed": (kgdb) p $19.vxcr_u.rxcd[475] $22 = {rxd_idx = 141, pad1 = 0, eop = 1, sop = 1, qid = 0, rss_type = 0, no_csum = 0, pad2 = 0, rss_hash = 0, len = 0, error = 0, vlan = 0, vtag = 0, csum = 0, csum_ok = 0, udp = 0, tcp = 0, ipcsum_ok = 0, ipv6 = 0, ipv4 = 0, fragment = 0, fcs = 0, type = 3, gen = 1} (In reply to Andriy Gapon from comment #11) Regarding ignoring rxd->error in the iflib driver: Just as a piece of background information, one thing that is true is that rxd->error can only be set when rx->eop is set (according to reference driver sources). In the error case, we still need iflib to do all of the related completion queue/free list processing for descriptors from sop to eop, but we also have knowledge that the packet is 'bad'. The current treatment of the error indication case is based on the following. There is no facility in the iflib interface (now, or when this driver was written) to indicate this case to iflib. So the vmx driver submits the packet anyway, under the reasoning that it is already possible to receive damaged packets from the network (that pass the checksum checks and so on). It is perhaps not strictly optimal to not take advantage of the early knowledge that the packet is bad, but this does not introduce a new possible condition for the network stack, and this is a rare event. Considering this again, this approach does rely on the virtual device to not provide damaged fragment length indicators (meaning something greater than the configured buffer size) in the error case, and perhaps we should not trust it to do so, in which case the minimal change to what we currently have would be to check rxd->error, and if set, consider the length of the eop fragment to be zero. iflib probably needs an audit on all of the cases of receiving zero length fragments. I may have some time next week to re-analyze the current iflib descriptor/free list mechanism and how the vmx driver interacts with it in order to determine whether there is a bug in the vmx driver, the iflib implementation has shifted since the vmx driver was written in a way that breaks the vmx driver, or everything looks OK and there is some other root cause here. The following commits address this bug (the commit hook missed them because there was a typo in the reference to this bug in their commit logs): Author: pkelsey Date: Sat Mar 14 19:43:44 UTC 2020 New revision: 358995 URL: https://svnweb.freebsd.org/changeset/base/358995 Log: Fix iflib freelist state corruption This fixes a bug in iflib freelist management that breaks the required correspondence between freelist indexes and driver ring slots. PR: 243126, 243392, 240628 Reported by: avg, alexandr.oleynikov@gmail.com, Harald Schmalzbauer Reviewed by: avg, gallatin MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D23943 Changes: head/sys/net/iflib.c Author: pkelsey Date: Sat Mar 14 19:55:06 UTC 2020 New revision: 358997 URL: https://svnweb.freebsd.org/changeset/base/358997 Log: Remove freelist contiguous-indexes assertion from rxd_frag_to_sd() The vmx driver is an example of an iflib driver that might report packets using non-contiguous descriptors (with unused descriptors either between received packets or between the fragments of a received packet), so this assertion needs to be removed. For such drivers, the freelist producer and consumer indexes don't relate directly to driver ring slots (the driver deals directly with freelist buffer indexes supplied by iflib during refill, and reports them with each fragment during packet reception), but do continue to be used by iflib for accounting, such as determining the number of ring slots that are refillable. PR: 243126, 243392, 240628 Reported by: avg, alexandr.oleynikov@gmail.com, Harald Schmalzbauer Reviewed by: gallatin MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D23946 Changes: head/sys/net/iflib.c Author: pkelsey Date: Sat Mar 14 20:08:05 UTC 2020 New revision: 359000 URL: https://svnweb.freebsd.org/changeset/base/359000 Log: Fix if_vmx receive checksum offload bug and harden against the device skipping receive descriptors This fixes a bug where the checksum offload status of received packets was being taken from the first descriptor instead of the last, which affected LRO packets. The driver has been hardened against the device skipping receive descriptors, although it is not believed that this can occur given the way this implementation configures the receive rings. Additionally, for packets received with the error indicator set, the driver now forces the length of all fragments in that packet to zero prior to passing it to iflib. Such packets should wind up being discarded at some point in the stack anyway, but this removes any questions by killing them in the driver. Counters have been added (and exposed via sysctls) for skipped receive descriptors, zero-length packets received, and packets received with the error indicator set so that these conditions can be easily observed in the field. PR: 243126, 243392, 240628 Reported by: avg, alexandr.oleynikov@gmail.com, Harald Schmalzbauer Reviewed by: gallatin MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D23949 Changes: head/sys/dev/vmware/vmxnet3/if_vmx.c head/sys/dev/vmware/vmxnet3/if_vmxvar.h ^Triage: Assign to committer resolving, pending MFC. If this doesn't need to go to stable/11, set mfc-stable11 to - Thanks Patrick! Very similar KP happens for me (every several hours) on 12.1/12.2 (build from 12/stable branch), but I don't even use vmnet but em network driver + vlan + bridge + tuntap. It's production build so I have only this in messages: ``` panic: Memory modified after free 0xfffff80004c25120(32) val=0 @ 0xfffff80004c25120 ``` iflib/vmx fixes were merged to stable/12 in r363844 and r363163. Please feel free to reopen if you believe there are lingering problems in 12.2. |