Bug 212168

Summary: [panic] [UFS] use-after-free panic (0xdeadc0dedeadc0de)
Product: Base System Reporter: Glen Barber <gjb>
Component: kernAssignee: freebsd-fs (Nobody) <fs>
Status: Closed FIXED    
Severity: Affects Many People CC: Andrew, bdrewery, cem, emaste, kbowling, kib, mckusick, peter, re
Priority: ---    
Version: 11.0-RC1   
Hardware: arm64   
OS: Any   
Attachments:
Description Flags
ufs_panic.txt none

Description Glen Barber freebsd_committer freebsd_triage 2016-08-26 05:17:58 UTC
Created attachment 174085 [details]
ufs_panic.txt

The 11.0-RC1 FreeBSD/aarch64 architecture currently still has WITNESS and INVARIANTS enabled (whether intentional or otherwise), however this appears to be architecture-agnostic.

The root filesystem is:

 /dev/ada0p2 on / (ufs, local, journaled soft-updates)

During a buildworld loop on two systems with this setup, a panic was observed on multiple occasions.

The backtrace is attached, and the panic message on the console was:

Fatal data abort:
  x0: ffff000aa48ff4c0
  x1: ffff000aa48ff4c0
  x2: ffff0000005c90ff
  x3:             10b6
  x4:                0
  x5:               16
  x6: ffff0000005c90ff
  x7:              db8
  x8: deadc0dedeadc0de
  x9:               1c
 x10:             8000
 x11:                0
 x12:                2
 x13: ffff0000007c07ec
 x14: ffff0000007c07a0
 x15:                b
 x16:             2710
 x17:              e00
 x18: ffff000b90150370
 x19: ffff000aa48ff4c0
 x20: ffff0000005e66d9
 x21: fffffd00278e0ce8
 x22:                0
 x23:         a0020020
 x24: fffffd00278e0db8
 x25: fffffd0e7aa588f0
 x26: fffffd0027914600
 x27:          a8b0d2f
 x28:          a8b0d73
 x29: ffff000b901503f0
 x30: ffff000b901503f0
  sp: ffff000b90150370
  lr: ffff0000004d1438
 elr: ffff0000004b542c
spsr:         60000345
 far: deadc0dedeadc376
 esr:         96000004
[ thread pid 44795 tid 101060 ]
Stopped at      softdep_disk_io_initiation+0x50:        ldr     x21, [x8, #664]

Following a cursory investigation by Peter:

FYI; I looked at the disassembled code from the cluster build, and observed
that the panic is right here:

        if ((wk = LIST_FIRST(&bp->b_dep)) == NULL)
                return;
        ump = VFSTOUFS(wk->wk_mp);
       ^^^^^^^^^^^^^^^^^^^

There are a series of dereferences if bp->... so that's not it.  However, wk
is the problem. LIST_FIRST is returning the value 0xdeadc0dedeadc0de so that
means there is a use-after free.

There are a couple of possibilities:

* there is an aarch64 specific bug in the interrupt handling or locking
somehow.  However, we have had multiple exact crashes on exactly this so it
really does not look like a race or locking bug.

* WITNESS / INVARIANTS are exposing a previously undetected use-after-free
softdep bug.  The act of having INVARIANTS/WITNESS on is causing an escalation
from a normally harmless bug to a full crash.

If I had to guess, compiling without INVARIANTS/WITNESS will likely sweep the
problem back under the rug so you can get package builds done.  If this
changes things then there is definitely a softdep bug in there.
Comment 1 Kirk McKusick freebsd_committer freebsd_triage 2016-08-26 15:25:41 UTC
From:    Konstantin Belousov <kostikbel@gmail.com>
Date:    Fri, 26 Aug 2016 13:51:00 +0300
To:      Kirk McKusick <mckusick@mckusick.com>
Subject: Re: [Bug 212168] [panic] [UFS] use-after-free panic
         (0xdeadc0dedeadc0de)
Cc:      gjb@freebsd.org

On Thu, Aug 25, 2016 at 10:56:20PM -0700, Kirk McKusick wrote:
>> From:    bugzilla-noreply@freebsd.org
>> Date:    Fri, 26 Aug 2016 05:19:37 +0000
>> To:      freebsd-fs@FreeBSD.org
>> Subject: [Bug 212168] [panic] [UFS] use-after-free panic (0xdeadc0dedeadc0de)
>>
>> https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=212168
>>
>> Glen Barber <gjb@FreeBSD.org> changed:
>>
>>            What    |Removed                     |Added
>> ----------------------------------------------------------------------------
>>            Assignee|freebsd-bugs@FreeBSD.org    |freebsd-fs@FreeBSD.org
>
> My first guess is that a buffer is being allocated and is not initializing
> the bp->b_dep list. Any thoughts?

Which buffer would it be ?

Note that buf_alloc() unconditionally performs LIST_INIT(&bp->b_dep).
More, buf_free() calls buf_deallocate() for non-empty b_dep, which must
panic much earlier.

I can only think of pbuf which has dandling b_dep for your hypothesis,
but then I do not understand how the dependencies could get attached to
it, even if the current buffer is cluster. I put a trivial patch below
to be extra careful with b_dep for pbufs. Please try it.

But really I do think that this is for normal buffer and the issue is
somewhere else. Of course, the fact that intensive testing on x86 and
somewhat less extensive testing on ARM and other platform did not show
anything similar makes the arm64 specific bug much more feasible theory.

diff --git a/sys/vm/vm_pager.c b/sys/vm/vm_pager.c
index fccd1c8..3e6c447 100644
--- a/sys/vm/vm_pager.c
+++ b/sys/vm/vm_pager.c
@@ -375,6 +375,7 @@ initpbuf(struct buf *bp)
        bp->b_ioflags = 0;
        bp->b_iodone = NULL;
        bp->b_error = 0;
+       LIST_INIT(&bp->b_dep);
        BUF_LOCK(bp, LK_EXCLUSIVE, NULL);
 }

@@ -472,6 +473,7 @@ relpbuf(struct buf *bp, int *pfreecnt)

        KASSERT(bp->b_vp == NULL, ("relpbuf with vp"));
        KASSERT(bp->b_bufobj == NULL, ("relpbuf with bufobj"));
+       KASSERT(LIST_EMPTY(&bp->b_dep), ("relpbuf with deps"));

        BUF_UNLOCK(bp);
Comment 2 Glen Barber freebsd_committer freebsd_triage 2016-08-26 19:00:59 UTC
I will try this patch on the second of two machines.  On the first, I have turned off SU+J for now, as well as WITNESS/INVARIANTS/etc., as there is a time constraint on package availability.

We do have one machine with which testing can be done to diagnose this more, as I have seen the same issue on both of two of these.
Comment 3 Andrew Turner freebsd_committer freebsd_triage 2016-08-27 11:07:35 UTC
Can you get the backtrace and ps output from ddb when this happens?
Comment 4 Peter Wemm freebsd_committer freebsd_triage 2016-08-27 19:17:36 UTC
From an earlier console log that Glen has been capturing:
Stopped at      softdep_disk_io_initiation+0x50:        ldr     x21, [x8, #664]
db> set $lines 0
db> bt
Tracing pid 44795 tid 101060 td 0xfffffd0027a859c0
db_trace_self() at db_stack_trace+0xf0
db_stack_trace() at db_command+0x254
db_command() at db_command_loop+0x60
db_command_loop() at db_trap+0xf0
db_trap() at kdb_trap+0x180
kdb_trap() at data_abort+0x19c
data_abort() at handle_el1h_sync+0x68
handle_el1h_sync() at ffs_geom_strategy+0x108
ffs_geom_strategy() at ffs_geom_strategy+0x108
ffs_geom_strategy() at bufwrite+0x210
bufwrite() at softdep_freefile+0x66c
softdep_freefile() at ffs_vfree+0x28
ffs_vfree() at ufs_inactive+0x2a4
ufs_inactive() at VOP_INACTIVE_APV+0xcc
VOP_INACTIVE_APV() at vinactive+0xb8
vinactive() at vputx+0x30c
vputx() at vn_close+0xf0
vn_close() at vn_closefile+0x58
vn_closefile() at closef+0x1f8
closef() at fdescfree_fds+0x8c
fdescfree_fds() at fdescfree+0x468
fdescfree() at exit1+0x48c
exit1() at sys_sys_exit+0x10
sys_sys_exit() at do_el0_sync+0x4c0
do_el0_sync() at handle_el0_sync+0x64
handle_el0_sync() at 0x41b670
db>

Every one of these I've seen has been in the exit->close->inactive->freefile->ffs_geom_strategy->softdep_disk_io_initiation path, and always at that precise location.  It's always a [0xdeadc0dedeadc0de + offset] memory read.

Unfortunately I can't see any ps output in the console for any of these.
Comment 5 Konstantin Belousov freebsd_committer freebsd_triage 2016-08-27 19:46:23 UTC
(In reply to Peter Wemm from comment #4)
There should be missed ffs_freefile() frame in the backtrace between bufwrite() and softdep_freefile().  If I am right, the failing buffer is actually for the inode block.
Comment 6 Peter Wemm freebsd_committer freebsd_triage 2016-08-28 05:30:20 UTC
It is very confusing as the compiler is doing extensive inlining.

As near as I can tell, the area you asked about is like this:

softdep_freefile():
                        handle_bufwait(inodedep, &freefile->fx_jwork);
7568                    clear_unlinked_inodedep(inodedep);

->
clear_unlinked_inodedep():
9685:           bwrite(bp);
                ACQUIRE_LOCK(ump);
-> 
bwrite():
393:      return (BO_WRITE(bp->b_bufobj, bp));
->
I'm confused after here.  The return address of the frame is softdep_freefile+0x66c, which is the inlined sys/buf.h BO_WRITE macro.

I *think* the chain continues from BO_WRITE() -> ffs_bufwrite() -> bufwrite() -> bstrategy()-> ffs_geom_strategy().

I am 100% confident that it is originating from clear_unlinked_inodedep().  99% sure it is the first bwrite(), not the second one at line 9699.

Is this of any help?
Comment 7 Peter Wemm freebsd_committer freebsd_triage 2016-08-28 05:38:07 UTC
FWIW; I'm comparing 'objdump --disassemble' vs adding the '-S -fverbose-asm' flags to the compile phases. That plus picking fragments out of the console log of past crashes.  The machine is running RC1 binaries so you can recreate them and look at them too.
Comment 8 Peter Wemm freebsd_committer freebsd_triage 2016-08-28 05:46:56 UTC
One more comment:
ffs_vfree() appears to be compiled as a tail jump into the body of ffs_freefile() rather than a call/return. This may lead to "interesting" backtraces.
Comment 9 Andrew Turner freebsd_committer freebsd_triage 2016-08-28 20:06:57 UTC
I think you can add '-fno-optimize-sibling-calls' to CFLAGS to disable the tail jumps, e.g. in sys/conf/Makefile.arm64.

From my reading of the asm wk is valid, wk->wk_mp is invalid. The VFSTOUFS macro contains:

#define VFSTOUFS(mp) ((struct ufsmount *)((mp)->mnt_data))
Comment 10 Kirk McKusick freebsd_committer freebsd_triage 2016-08-30 23:55:26 UTC
It appears that there is an item on the worklist that has been prematurely freed. This freeing should be impossible as anything on a worklist has its ONWORKLIST flag set and the worklist freeing routine will panic if the ONWORKLIST flag is set. However I have written the following check to see if we are finding freed items on the worklist in softdep_disk_io_initiation():

Index: sys/ufs/ffs/ffs_softdep.c
===================================================================
--- sys/ufs/ffs/ffs_softdep.c   (revision 305057)
+++ sys/ufs/ffs/ffs_softdep.c   (working copy)
@@ -9949,6 +9949,13 @@
 
        if ((wk = LIST_FIRST(&bp->b_dep)) == NULL)
                return;
+       if (wk->wk_type > D_LAST + 1) {
+               printf("dead worklist: %p next %p prev %p type %s state 0x%X\n",
+                   wk, wk->wk_list.le_next, wk->wk_list.le_prev,
+                   TYPENAME(wk->wk_type), wk->wk_state);
+               LIST_INIT(&bp->b_dep);
+               return;
+       }
        ump = VFSTOUFS(wk->wk_mp);
 
        marker.wk_type = D_LAST + 1;    /* Not a normal workitem */

I would appreciate your running with this patch and reporting if you get the "dead worklist" error message.
Comment 11 Andrew Turner freebsd_committer freebsd_triage 2016-08-31 12:12:37 UTC
I see that WORKLIST_INSERT will set ONWORKLIST in wk_state. Is this always locked? The existence of WORKLIST_INSERT_UNLOCKED seems to imply it's not. As arm64 has a weaker memory model than x86 the ordering of setting this bit and adding the item to the list is not guaranteed.
Comment 12 Konstantin Belousov freebsd_committer freebsd_triage 2016-08-31 15:06:24 UTC
(In reply to Andrew Turner from comment #11)
WORKLIST_INSERT_UNLOCKED() is used in only one place, where workhead is private.  Also, from what I understand, the workitems from that list cannot be attached to the inodeblock.
Comment 13 Glen Barber freebsd_committer freebsd_triage 2016-09-02 23:13:07 UTC
(In reply to Kirk McKusick from comment #10)
> It appears that there is an item on the worklist that has been prematurely
> freed. This freeing should be impossible as anything on a worklist has its
> ONWORKLIST flag set and the worklist freeing routine will panic if the
> ONWORKLIST flag is set. However I have written the following check to see if
> we are finding freed items on the worklist in softdep_disk_io_initiation():
> 
> Index: sys/ufs/ffs/ffs_softdep.c
> ===================================================================
> --- sys/ufs/ffs/ffs_softdep.c   (revision 305057)
> +++ sys/ufs/ffs/ffs_softdep.c   (working copy)
> @@ -9949,6 +9949,13 @@
>  
>         if ((wk = LIST_FIRST(&bp->b_dep)) == NULL)
>                 return;
> +       if (wk->wk_type > D_LAST + 1) {
> +               printf("dead worklist: %p next %p prev %p type %s state
> 0x%X\n",
> +                   wk, wk->wk_list.le_next, wk->wk_list.le_prev,
> +                   TYPENAME(wk->wk_type), wk->wk_state);
> +               LIST_INIT(&bp->b_dep);
> +               return;
> +       }
>         ump = VFSTOUFS(wk->wk_mp);
>  
>         marker.wk_type = D_LAST + 1;    /* Not a normal workitem */
> 
> I would appreciate your running with this patch and reporting if you get the
> "dead worklist" error message.

I do not know what information you want from ddb, and I never thought I'd ever say this, but "yay, the machine finally crashed again."

I will leave the machine in its current state unless instructed otherwise.

dead worklist: 0xfffffd080edd5400 next 0xdeadc0dedeadc0de prev 0xdeadc0dedeadc0de type ??? state 0xDEADC0
panic: Bad link elm 0xfffffd00b7e0ba00 prev->next != elm

Please let me know what additional information you want.
Comment 14 Andrew Turner freebsd_committer freebsd_triage 2016-09-25 15:42:12 UTC
I have setup one of the netperf ThunderX machines with UFS SU+J. I can trigger this issue with a buildworld in a loop. I tried adding logging to see what memory was allocated but didn't manage to trigger the issue.

I then tried adding an atomic memory barrier into worklist_insert and worklist_remove. I'm currently testing this change, and have yet to see the issue after about 2 hours.
Comment 15 Konstantin Belousov freebsd_committer freebsd_triage 2016-09-25 16:08:51 UTC
(In reply to Andrew Turner from comment #14)
Is trim enabled on the disk where the UFS volume is mounted, and on the volume itself (tunefs -t enable) ?
If yes, does the issue go away if you disable trim with tunefs ?
Comment 16 Andrew Turner freebsd_committer freebsd_triage 2016-09-25 16:32:35 UTC
(In reply to Konstantin Belousov from comment #15)
The disk doesn't support TRIM, and it's not enabled in the filesystem.

root@cavium2:~ # camcontrol identify /dev/ada0
pass0: <WDC WD5001AALS-00E3A0 05.01D05> ATA8-ACS SATA 2.x device
pass0: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)

protocol              ATA/ATAPI-8 SATA 2.x
device model          WDC WD5001AALS-00E3A0
firmware revision     05.01D05
serial number         WD-WCATR2780527
WWN                   50014ee2af913f22
cylinders             16383
heads                 16
sectors/track         63
sector size           logical 512, physical 512, offset 0
LBA supported         268435455 sectors
LBA48 supported       976773168 sectors
PIO supported         PIO4
DMA supported         WDMA2 UDMA6 

Feature                      Support  Enabled   Value           Vendor
read ahead                     yes      yes
write cache                    yes      yes
flush cache                    yes      yes
overlap                        no
Tagged Command Queuing (TCQ)   no       no
Native Command Queuing (NCQ)   yes              32 tags
NCQ Queue Management           no
NCQ Streaming                  no
Receive & Send FPDMA Queued    no
SMART                          yes      yes
microcode download             yes      yes
security                       yes      no
power management               yes      yes
advanced power management      no       no
automatic acoustic management  yes      no      254/0xFE        128/0x80
media status notification      no       no
power-up in Standby            yes      no
write-read-verify              no       no
unload                         no       no
general purpose logging        yes      yes
free-fall                      no       no
Data Set Management (DSM/TRIM) no
Host Protected Area (HPA)      yes      no      976773168/976773168
HPA - Security                 no
root@cavium2:~ # tunefs -p /dev/ufs/thunderxroot
tunefs: POSIX.1e ACLs: (-a)                                disabled
tunefs: NFSv4 ACLs: (-N)                                   disabled
tunefs: MAC multilabel: (-l)                               disabled
tunefs: soft updates: (-n)                                 enabled
tunefs: soft update journaling: (-j)                       enabled
tunefs: gjournal: (-J)                                     disabled
tunefs: trim: (-t)                                         disabled
tunefs: maximum blocks per file in a cylinder group: (-e)  4096
tunefs: average file size: (-f)                            16384
tunefs: average number of files in a directory: (-s)       64
tunefs: minimum percentage of free space: (-m)             8%
tunefs: space to hold for metadata blocks: (-k)            6408
tunefs: optimization preference: (-o)                      time
tunefs: volume label: (-L)                                 thunderxroot
Comment 17 Konstantin Belousov freebsd_committer freebsd_triage 2016-09-25 19:37:15 UTC
(In reply to Andrew Turner from comment #16)
Ok.

If you copy/paste WORKLIST_INSERT_UNLOCKED and only add the barrier there, does the issue disappear as well ?

There is only one use of WORKLIST_INSERT_UNLOCKED in the ffs_softdep.c, and no uses of WORKLIST_REMOVE_UNLOCKED at all.  All other calls are for WORKLIST_INSERT/REMOVE(), which assert that the mount point' softdep lock is owned.  Of course, it might be some other access for read which is not under softdep lock.

The only use of WORKLIST_INSERT_UNLOCKED() is for the ffs_blkfree(), where some local worklist is formed from the items.  The list is processed in the same thread (passed to softdep_setup_blkfree()) in ffs_blkfree()->ffs_blkfree_cg() for !TRIM case.
Comment 18 Kirk McKusick freebsd_committer freebsd_triage 2016-09-30 13:27:34 UTC
(In reply to Andrew Turner from comment #14)
Has your addition of an atomic memory barrier into worklist_insert and worklist_remove solved the problem? If not, have you had a chance to test the patch that I sent to you to isolate the soft updates malloc pool?
Comment 19 Ed Maste freebsd_committer freebsd_triage 2022-04-03 15:02:57 UTC
Is this issue OBE?
Comment 20 Kirk McKusick freebsd_committer freebsd_triage 2022-04-03 18:27:28 UTC
(In reply to Ed Maste from comment #19)
Reading through the comments for this bug, it appears that it was being caused by the lack of a memory barrier on the ARM64. The fact that it was not appearing on any other architectures and that Andrew Turner's addition of a memory barrier instruction seemed to fix the problem leads me to believe that this report has been resolved (at least if Andrew committed his change). If Andrew's change has not been committed, then it seems like it should be.
Comment 21 Kirk McKusick freebsd_committer freebsd_triage 2022-04-15 00:21:18 UTC
This problem affects only arm64 and was fixed as described in comment 14.