Summary: | [panic] [UFS] use-after-free panic (0xdeadc0dedeadc0de) | ||||||
---|---|---|---|---|---|---|---|
Product: | Base System | Reporter: | Glen Barber <gjb> | ||||
Component: | kern | Assignee: | freebsd-fs (Nobody) <fs> | ||||
Status: | Closed FIXED | ||||||
Severity: | Affects Many People | CC: | Andrew, bdrewery, cem, emaste, kbowling, kib, mckusick, peter, re | ||||
Priority: | --- | ||||||
Version: | 11.0-RC1 | ||||||
Hardware: | arm64 | ||||||
OS: | Any | ||||||
Attachments: |
|
From: Konstantin Belousov <kostikbel@gmail.com> Date: Fri, 26 Aug 2016 13:51:00 +0300 To: Kirk McKusick <mckusick@mckusick.com> Subject: Re: [Bug 212168] [panic] [UFS] use-after-free panic (0xdeadc0dedeadc0de) Cc: gjb@freebsd.org On Thu, Aug 25, 2016 at 10:56:20PM -0700, Kirk McKusick wrote: >> From: bugzilla-noreply@freebsd.org >> Date: Fri, 26 Aug 2016 05:19:37 +0000 >> To: freebsd-fs@FreeBSD.org >> Subject: [Bug 212168] [panic] [UFS] use-after-free panic (0xdeadc0dedeadc0de) >> >> https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=212168 >> >> Glen Barber <gjb@FreeBSD.org> changed: >> >> What |Removed |Added >> ---------------------------------------------------------------------------- >> Assignee|freebsd-bugs@FreeBSD.org |freebsd-fs@FreeBSD.org > > My first guess is that a buffer is being allocated and is not initializing > the bp->b_dep list. Any thoughts? Which buffer would it be ? Note that buf_alloc() unconditionally performs LIST_INIT(&bp->b_dep). More, buf_free() calls buf_deallocate() for non-empty b_dep, which must panic much earlier. I can only think of pbuf which has dandling b_dep for your hypothesis, but then I do not understand how the dependencies could get attached to it, even if the current buffer is cluster. I put a trivial patch below to be extra careful with b_dep for pbufs. Please try it. But really I do think that this is for normal buffer and the issue is somewhere else. Of course, the fact that intensive testing on x86 and somewhat less extensive testing on ARM and other platform did not show anything similar makes the arm64 specific bug much more feasible theory. diff --git a/sys/vm/vm_pager.c b/sys/vm/vm_pager.c index fccd1c8..3e6c447 100644 --- a/sys/vm/vm_pager.c +++ b/sys/vm/vm_pager.c @@ -375,6 +375,7 @@ initpbuf(struct buf *bp) bp->b_ioflags = 0; bp->b_iodone = NULL; bp->b_error = 0; + LIST_INIT(&bp->b_dep); BUF_LOCK(bp, LK_EXCLUSIVE, NULL); } @@ -472,6 +473,7 @@ relpbuf(struct buf *bp, int *pfreecnt) KASSERT(bp->b_vp == NULL, ("relpbuf with vp")); KASSERT(bp->b_bufobj == NULL, ("relpbuf with bufobj")); + KASSERT(LIST_EMPTY(&bp->b_dep), ("relpbuf with deps")); BUF_UNLOCK(bp); I will try this patch on the second of two machines. On the first, I have turned off SU+J for now, as well as WITNESS/INVARIANTS/etc., as there is a time constraint on package availability. We do have one machine with which testing can be done to diagnose this more, as I have seen the same issue on both of two of these. Can you get the backtrace and ps output from ddb when this happens? From an earlier console log that Glen has been capturing: Stopped at softdep_disk_io_initiation+0x50: ldr x21, [x8, #664] db> set $lines 0 db> bt Tracing pid 44795 tid 101060 td 0xfffffd0027a859c0 db_trace_self() at db_stack_trace+0xf0 db_stack_trace() at db_command+0x254 db_command() at db_command_loop+0x60 db_command_loop() at db_trap+0xf0 db_trap() at kdb_trap+0x180 kdb_trap() at data_abort+0x19c data_abort() at handle_el1h_sync+0x68 handle_el1h_sync() at ffs_geom_strategy+0x108 ffs_geom_strategy() at ffs_geom_strategy+0x108 ffs_geom_strategy() at bufwrite+0x210 bufwrite() at softdep_freefile+0x66c softdep_freefile() at ffs_vfree+0x28 ffs_vfree() at ufs_inactive+0x2a4 ufs_inactive() at VOP_INACTIVE_APV+0xcc VOP_INACTIVE_APV() at vinactive+0xb8 vinactive() at vputx+0x30c vputx() at vn_close+0xf0 vn_close() at vn_closefile+0x58 vn_closefile() at closef+0x1f8 closef() at fdescfree_fds+0x8c fdescfree_fds() at fdescfree+0x468 fdescfree() at exit1+0x48c exit1() at sys_sys_exit+0x10 sys_sys_exit() at do_el0_sync+0x4c0 do_el0_sync() at handle_el0_sync+0x64 handle_el0_sync() at 0x41b670 db> Every one of these I've seen has been in the exit->close->inactive->freefile->ffs_geom_strategy->softdep_disk_io_initiation path, and always at that precise location. It's always a [0xdeadc0dedeadc0de + offset] memory read. Unfortunately I can't see any ps output in the console for any of these. (In reply to Peter Wemm from comment #4) There should be missed ffs_freefile() frame in the backtrace between bufwrite() and softdep_freefile(). If I am right, the failing buffer is actually for the inode block. It is very confusing as the compiler is doing extensive inlining. As near as I can tell, the area you asked about is like this: softdep_freefile(): handle_bufwait(inodedep, &freefile->fx_jwork); 7568 clear_unlinked_inodedep(inodedep); -> clear_unlinked_inodedep(): 9685: bwrite(bp); ACQUIRE_LOCK(ump); -> bwrite(): 393: return (BO_WRITE(bp->b_bufobj, bp)); -> I'm confused after here. The return address of the frame is softdep_freefile+0x66c, which is the inlined sys/buf.h BO_WRITE macro. I *think* the chain continues from BO_WRITE() -> ffs_bufwrite() -> bufwrite() -> bstrategy()-> ffs_geom_strategy(). I am 100% confident that it is originating from clear_unlinked_inodedep(). 99% sure it is the first bwrite(), not the second one at line 9699. Is this of any help? FWIW; I'm comparing 'objdump --disassemble' vs adding the '-S -fverbose-asm' flags to the compile phases. That plus picking fragments out of the console log of past crashes. The machine is running RC1 binaries so you can recreate them and look at them too. One more comment: ffs_vfree() appears to be compiled as a tail jump into the body of ffs_freefile() rather than a call/return. This may lead to "interesting" backtraces. I think you can add '-fno-optimize-sibling-calls' to CFLAGS to disable the tail jumps, e.g. in sys/conf/Makefile.arm64. From my reading of the asm wk is valid, wk->wk_mp is invalid. The VFSTOUFS macro contains: #define VFSTOUFS(mp) ((struct ufsmount *)((mp)->mnt_data)) It appears that there is an item on the worklist that has been prematurely freed. This freeing should be impossible as anything on a worklist has its ONWORKLIST flag set and the worklist freeing routine will panic if the ONWORKLIST flag is set. However I have written the following check to see if we are finding freed items on the worklist in softdep_disk_io_initiation(): Index: sys/ufs/ffs/ffs_softdep.c =================================================================== --- sys/ufs/ffs/ffs_softdep.c (revision 305057) +++ sys/ufs/ffs/ffs_softdep.c (working copy) @@ -9949,6 +9949,13 @@ if ((wk = LIST_FIRST(&bp->b_dep)) == NULL) return; + if (wk->wk_type > D_LAST + 1) { + printf("dead worklist: %p next %p prev %p type %s state 0x%X\n", + wk, wk->wk_list.le_next, wk->wk_list.le_prev, + TYPENAME(wk->wk_type), wk->wk_state); + LIST_INIT(&bp->b_dep); + return; + } ump = VFSTOUFS(wk->wk_mp); marker.wk_type = D_LAST + 1; /* Not a normal workitem */ I would appreciate your running with this patch and reporting if you get the "dead worklist" error message. I see that WORKLIST_INSERT will set ONWORKLIST in wk_state. Is this always locked? The existence of WORKLIST_INSERT_UNLOCKED seems to imply it's not. As arm64 has a weaker memory model than x86 the ordering of setting this bit and adding the item to the list is not guaranteed. (In reply to Andrew Turner from comment #11) WORKLIST_INSERT_UNLOCKED() is used in only one place, where workhead is private. Also, from what I understand, the workitems from that list cannot be attached to the inodeblock. (In reply to Kirk McKusick from comment #10) > It appears that there is an item on the worklist that has been prematurely > freed. This freeing should be impossible as anything on a worklist has its > ONWORKLIST flag set and the worklist freeing routine will panic if the > ONWORKLIST flag is set. However I have written the following check to see if > we are finding freed items on the worklist in softdep_disk_io_initiation(): > > Index: sys/ufs/ffs/ffs_softdep.c > =================================================================== > --- sys/ufs/ffs/ffs_softdep.c (revision 305057) > +++ sys/ufs/ffs/ffs_softdep.c (working copy) > @@ -9949,6 +9949,13 @@ > > if ((wk = LIST_FIRST(&bp->b_dep)) == NULL) > return; > + if (wk->wk_type > D_LAST + 1) { > + printf("dead worklist: %p next %p prev %p type %s state > 0x%X\n", > + wk, wk->wk_list.le_next, wk->wk_list.le_prev, > + TYPENAME(wk->wk_type), wk->wk_state); > + LIST_INIT(&bp->b_dep); > + return; > + } > ump = VFSTOUFS(wk->wk_mp); > > marker.wk_type = D_LAST + 1; /* Not a normal workitem */ > > I would appreciate your running with this patch and reporting if you get the > "dead worklist" error message. I do not know what information you want from ddb, and I never thought I'd ever say this, but "yay, the machine finally crashed again." I will leave the machine in its current state unless instructed otherwise. dead worklist: 0xfffffd080edd5400 next 0xdeadc0dedeadc0de prev 0xdeadc0dedeadc0de type ??? state 0xDEADC0 panic: Bad link elm 0xfffffd00b7e0ba00 prev->next != elm Please let me know what additional information you want. I have setup one of the netperf ThunderX machines with UFS SU+J. I can trigger this issue with a buildworld in a loop. I tried adding logging to see what memory was allocated but didn't manage to trigger the issue. I then tried adding an atomic memory barrier into worklist_insert and worklist_remove. I'm currently testing this change, and have yet to see the issue after about 2 hours. (In reply to Andrew Turner from comment #14) Is trim enabled on the disk where the UFS volume is mounted, and on the volume itself (tunefs -t enable) ? If yes, does the issue go away if you disable trim with tunefs ? (In reply to Konstantin Belousov from comment #15) The disk doesn't support TRIM, and it's not enabled in the filesystem. root@cavium2:~ # camcontrol identify /dev/ada0 pass0: <WDC WD5001AALS-00E3A0 05.01D05> ATA8-ACS SATA 2.x device pass0: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes) protocol ATA/ATAPI-8 SATA 2.x device model WDC WD5001AALS-00E3A0 firmware revision 05.01D05 serial number WD-WCATR2780527 WWN 50014ee2af913f22 cylinders 16383 heads 16 sectors/track 63 sector size logical 512, physical 512, offset 0 LBA supported 268435455 sectors LBA48 supported 976773168 sectors PIO supported PIO4 DMA supported WDMA2 UDMA6 Feature Support Enabled Value Vendor read ahead yes yes write cache yes yes flush cache yes yes overlap no Tagged Command Queuing (TCQ) no no Native Command Queuing (NCQ) yes 32 tags NCQ Queue Management no NCQ Streaming no Receive & Send FPDMA Queued no SMART yes yes microcode download yes yes security yes no power management yes yes advanced power management no no automatic acoustic management yes no 254/0xFE 128/0x80 media status notification no no power-up in Standby yes no write-read-verify no no unload no no general purpose logging yes yes free-fall no no Data Set Management (DSM/TRIM) no Host Protected Area (HPA) yes no 976773168/976773168 HPA - Security no root@cavium2:~ # tunefs -p /dev/ufs/thunderxroot tunefs: POSIX.1e ACLs: (-a) disabled tunefs: NFSv4 ACLs: (-N) disabled tunefs: MAC multilabel: (-l) disabled tunefs: soft updates: (-n) enabled tunefs: soft update journaling: (-j) enabled tunefs: gjournal: (-J) disabled tunefs: trim: (-t) disabled tunefs: maximum blocks per file in a cylinder group: (-e) 4096 tunefs: average file size: (-f) 16384 tunefs: average number of files in a directory: (-s) 64 tunefs: minimum percentage of free space: (-m) 8% tunefs: space to hold for metadata blocks: (-k) 6408 tunefs: optimization preference: (-o) time tunefs: volume label: (-L) thunderxroot (In reply to Andrew Turner from comment #16) Ok. If you copy/paste WORKLIST_INSERT_UNLOCKED and only add the barrier there, does the issue disappear as well ? There is only one use of WORKLIST_INSERT_UNLOCKED in the ffs_softdep.c, and no uses of WORKLIST_REMOVE_UNLOCKED at all. All other calls are for WORKLIST_INSERT/REMOVE(), which assert that the mount point' softdep lock is owned. Of course, it might be some other access for read which is not under softdep lock. The only use of WORKLIST_INSERT_UNLOCKED() is for the ffs_blkfree(), where some local worklist is formed from the items. The list is processed in the same thread (passed to softdep_setup_blkfree()) in ffs_blkfree()->ffs_blkfree_cg() for !TRIM case. (In reply to Andrew Turner from comment #14) Has your addition of an atomic memory barrier into worklist_insert and worklist_remove solved the problem? If not, have you had a chance to test the patch that I sent to you to isolate the soft updates malloc pool? Is this issue OBE? (In reply to Ed Maste from comment #19) Reading through the comments for this bug, it appears that it was being caused by the lack of a memory barrier on the ARM64. The fact that it was not appearing on any other architectures and that Andrew Turner's addition of a memory barrier instruction seemed to fix the problem leads me to believe that this report has been resolved (at least if Andrew committed his change). If Andrew's change has not been committed, then it seems like it should be. This problem affects only arm64 and was fixed as described in comment 14. |
Created attachment 174085 [details] ufs_panic.txt The 11.0-RC1 FreeBSD/aarch64 architecture currently still has WITNESS and INVARIANTS enabled (whether intentional or otherwise), however this appears to be architecture-agnostic. The root filesystem is: /dev/ada0p2 on / (ufs, local, journaled soft-updates) During a buildworld loop on two systems with this setup, a panic was observed on multiple occasions. The backtrace is attached, and the panic message on the console was: Fatal data abort: x0: ffff000aa48ff4c0 x1: ffff000aa48ff4c0 x2: ffff0000005c90ff x3: 10b6 x4: 0 x5: 16 x6: ffff0000005c90ff x7: db8 x8: deadc0dedeadc0de x9: 1c x10: 8000 x11: 0 x12: 2 x13: ffff0000007c07ec x14: ffff0000007c07a0 x15: b x16: 2710 x17: e00 x18: ffff000b90150370 x19: ffff000aa48ff4c0 x20: ffff0000005e66d9 x21: fffffd00278e0ce8 x22: 0 x23: a0020020 x24: fffffd00278e0db8 x25: fffffd0e7aa588f0 x26: fffffd0027914600 x27: a8b0d2f x28: a8b0d73 x29: ffff000b901503f0 x30: ffff000b901503f0 sp: ffff000b90150370 lr: ffff0000004d1438 elr: ffff0000004b542c spsr: 60000345 far: deadc0dedeadc376 esr: 96000004 [ thread pid 44795 tid 101060 ] Stopped at softdep_disk_io_initiation+0x50: ldr x21, [x8, #664] Following a cursory investigation by Peter: FYI; I looked at the disassembled code from the cluster build, and observed that the panic is right here: if ((wk = LIST_FIRST(&bp->b_dep)) == NULL) return; ump = VFSTOUFS(wk->wk_mp); ^^^^^^^^^^^^^^^^^^^ There are a series of dereferences if bp->... so that's not it. However, wk is the problem. LIST_FIRST is returning the value 0xdeadc0dedeadc0de so that means there is a use-after free. There are a couple of possibilities: * there is an aarch64 specific bug in the interrupt handling or locking somehow. However, we have had multiple exact crashes on exactly this so it really does not look like a race or locking bug. * WITNESS / INVARIANTS are exposing a previously undetected use-after-free softdep bug. The act of having INVARIANTS/WITNESS on is causing an escalation from a normally harmless bug to a full crash. If I had to guess, compiling without INVARIANTS/WITNESS will likely sweep the problem back under the rug so you can get package builds done. If this changes things then there is definitely a softdep bug in there.