Bug 233414

Summary: [PowerPC64] OPTIONS DEBUG_MEMGUARD results in unbootable kernel
Product: Base System Reporter: Sean Bruno <sbruno>
Component: kernAssignee: freebsd-ppc mailing list <ppc>
Status: New ---    
Severity: Affects Some People CC: alfredo.junior, jhibbits, kbowling, leandro.lupori, leonardo.bianconi, markj, mmacy, nwhitehorn
Priority: ---    
Version: CURRENT   
Hardware: powerpc   
OS: Any   
Description Flags
memguard hack
memguard hack none

Description Sean Bruno freebsd_committer 2018-11-22 20:50:45 UTC
In an effort to debug a user-after-free problem, I enabled DEBUG_MEMGUARD to try and pin this issue down.

Petitboot or OPAL cannot load the kernel after this is enabled and results in no output or messages logged to the console.
Comment 1 Mark Johnston freebsd_committer 2018-11-25 19:24:38 UTC
Could you please remove the body of memguard_init(), compile a DEBUG_MEMGUARD kernel and try to boot it?  That code doesn't run particularly early during boot, so I'm not sure why it would cause the boot to fail without seeing any console output, but maybe there's some later initialization step that's needed in order to see console output?
Comment 2 Mark Johnston freebsd_committer 2018-11-30 18:25:21 UTC
(In reply to Mark Johnston from comment #1)
Disregard this, apparently the problem only happens after enabling memguard of a particular zone/malloc type.
Comment 3 Leandro Lupori 2018-11-30 18:38:32 UTC
I started taking a look at this, also to be able to debug a user-after-free problem. In my case, however, the system boots and crashes only after I enable memguard of an UMA region via sysctl, as following:
sysctl vm.memguard.desc='128 Bucket'.

Then, if I run make -C /usr/src, for instance, I get a stack like this:

#0  vpanic (fmt=0xc0000000010a6980 "%s: recursing but non-recursive rw %s @ %s:%d\n", ap=0xe0000000ce0f12f8 "\300") at /usr/home/luporl/base/head/sys/kern/kern_shutdown.c:813
#1  0xc0000000006abf18 in panic (fmt=<optimized out>) at /usr/home/luporl/base/head/sys/kern/kern_shutdown.c:804
#2  0xc0000000006a6148 in __rw_wlock_hard (c=<optimized out>, v=13835058055423348736, file=0xc0000000010e73f8 "/usr/home/luporl/base/head/sys/vm/vm_kern.c", line=471)
    at /usr/home/luporl/base/head/sys/kern/kern_rwlock.c:954
#3  0xc0000000006a6a8c in _rw_wlock_cookie (c=<optimized out>, file=0xc0000000010e73f8 "/usr/home/luporl/base/head/sys/vm/vm_kern.c", line=471)
    at /usr/home/luporl/base/head/sys/kern/kern_rwlock.c:286
#4  0xc000000000a33664 in kmem_back_domain (domain=0, object=<optimized out>, addr=16140901064502083584, size=4096, flags=<optimized out>) at /usr/home/luporl/base/head/sys/vm/vm_kern.c:471
#5  0xc000000000a33924 in kmem_back (object=0xc0000000019194a8 <kernel_object_store>, addr=16140901064502083584, size=<optimized out>, flags=513)
    at /usr/home/luporl/base/head/sys/vm/vm_kern.c:540
#6  0xc000000000a2d5b4 in memguard_alloc (req_size=1024, flags=513) at /usr/home/luporl/base/head/sys/vm/memguard.c:351
#7  0xc000000000a2abd8 in uma_zalloc_arg (zone=0xc0000001ffffdb00, udata=0x80000020, flags=513) at /usr/home/luporl/base/head/sys/vm/uma_core.c:2436
#8  0xc000000000a2b528 in bucket_alloc (zone=0xc000000002000b00, udata=0x80000020, flags=513) at /usr/home/luporl/base/head/sys/vm/uma_core.c:428
#9  0xc000000000a2b0a0 in zone_alloc_bucket (flags=<optimized out>, domain=<optimized out>, udata=<optimized out>, zone=<optimized out>) at /usr/home/luporl/base/head/sys/vm/uma_core.c:2982
#10 uma_zalloc_arg (zone=0xc000000002000b00, udata=0x0, flags=1) at /usr/home/luporl/base/head/sys/vm/uma_core.c:2590
#11 0xc000000000a76194 in uma_zalloc (flags=<optimized out>, zone=<optimized out>) at /usr/home/luporl/base/head/sys/vm/uma.h:362
#12 alloc_pvo_entry (bootstrap=<optimized out>) at /usr/home/luporl/base/head/sys/powerpc/aim/mmu_oea64.c:374
#13 0xc000000000a7a354 in moea64_enter (mmu=0xc000000001a8e268 <mmu_kernel_obj>, pmap=0xc000000001a8eba8 <kernel_pmap_store>, va=16140901064502071296, m=0xc0000001f469d400, prot=3 '\003',
    flags=515, psind=<optimized out>) at /usr/home/luporl/base/head/sys/powerpc/aim/mmu_oea64.c:1365
#14 0xc000000000ab2658 in MMU_ENTER (_psind=<optimized out>, _flags=<optimized out>, _prot=<optimized out>, _p=<optimized out>, _va=<optimized out>, _pmap=<optimized out>,
    _mmu=0xc000000001a8e268 <mmu_kernel_obj>) at ./mmu_if.h:169
#15 pmap_enter (pmap=0xc000000001a8eba8 <kernel_pmap_store>, va=16140901064502071296, p=0xc0000001f469d400, prot=3 '\003', flags=515, psind=0 '\000')
    at /usr/home/luporl/base/head/sys/powerpc/powerpc/pmap_dispatch.c:150
#16 0xc000000000a33784 in kmem_back_domain (domain=0, object=<optimized out>, addr=16140901064502071296, size=4096, flags=<optimized out>) at /usr/home/luporl/base/head/sys/vm/vm_kern.c:498
#17 0xc000000000a33924 in kmem_back (object=0xc0000000019194a8 <kernel_object_store>, addr=16140901064502071296, size=<optimized out>, flags=513)
    at /usr/home/luporl/base/head/sys/vm/vm_kern.c:540
#18 0xc000000000a2d5b4 in memguard_alloc (req_size=1024, flags=513) at /usr/home/luporl/base/head/sys/vm/memguard.c:351
#19 0xc000000000a2abd8 in uma_zalloc_arg (zone=0xc0000001ffffdb00, udata=0x80000020, flags=513) at /usr/home/luporl/base/head/sys/vm/uma_core.c:2436
#20 0xc000000000a2b528 in bucket_alloc (zone=0xc000000002000b00, udata=0x80000020, flags=513) at /usr/home/luporl/base/head/sys/vm/uma_core.c:428
#21 0xc000000000a2b0a0 in zone_alloc_bucket (flags=<optimized out>, domain=<optimized out>, udata=<optimized out>, zone=<optimized out>) at /usr/home/luporl/base/head/sys/vm/uma_core.c:2982
#22 uma_zalloc_arg (zone=0xc000000002000b00, udata=0x0, flags=1) at /usr/home/luporl/base/head/sys/vm/uma_core.c:2590
#23 0xc000000000a76194 in uma_zalloc (flags=<optimized out>, zone=<optimized out>) at /usr/home/luporl/base/head/sys/vm/uma.h:362
#24 alloc_pvo_entry (bootstrap=<optimized out>) at /usr/home/luporl/base/head/sys/powerpc/aim/mmu_oea64.c:374
#25 0xc000000000a7a354 in vm.memguard.des (mmu=0xc000000001a8e268 <mmu_kernel_obj>, pmap=0xc000000002221130, va=34635493376, m=0xc0000001f469d460, prot=3 '\003', flags=1, psind=<optimized out>)
    at /usr/home/luporl/base/head/sys/powerpc/aim/mmu_oea64.c:1365
#26 0xc000000000ab2658 in MMU_ENTER (_psind=<optimized out>, _flags=<optimized out>, _prot=<optimized out>, _p=<optimized out>, _va=<optimized out>, _pmap=<optimized out>,
    _mmu=0xc000000001a8e268 <mmu_kernel_obj>) at ./mmu_if.h:169
#27 pmap_enter (pmap=0xc000000002221130, va=34635493376, p=0xc0000001f469d460, prot=3 '\003', flags=1, psind=0 '\000') at /usr/home/luporl/base/head/sys/powerpc/powerpc/pmap_dispatch.c:150
#28 0xc000000000a30d4c in vm_fault_hold (map=0xc000000002221000, vaddr=34635493376, fault_type=1 '\001', fault_flags=0, m_hold=0x0) at /usr/home/luporl/base/head/sys/vm/vm_fault.c:1296
#29 0xc000000000a31414 in vm_fault (map=0xc000000002221000, vaddr=34635493376, fault_type=1 '\001', fault_flags=0) at /usr/home/luporl/base/head/sys/vm/vm_fault.c:536
#30 0xc000000000ab493c in trap_pfault (frame=0xe0000000ce0f2840, user=1) at /usr/home/luporl/base/head/sys/powerpc/powerpc/trap.c:809
#31 0xc000000000ab5014 in trap (frame=0xe0000000ce0f2840) at /usr/home/luporl/base/head/sys/powerpc/powerpc/trap.c:272
#32 0xc000000000aa9fb4 in powerpc_interrupt (framep=0xe0000000ce0f2840) at /usr/home/luporl/base/head/sys/powerpc/powerpc/interrupt.c:127
#33 0xc000000000102ee0 in trapagain () at /usr/home/luporl/base/head/sys/powerpc/aim/trap_subr64.S:831

This is from a VM. I also happens on a physical host, but DDB stack trace doesn't have as much information.

What seems to me here is that moea64_enter() ends up using uma_zalloc() to allocate a pvo entry, that uses memguard_alloc(), that uses kmem_back(), that calls moea64_enter(). This loop is interrupted by the panic on the non-recursive kmem_back_domain() lock.
Comment 4 Mark Johnston freebsd_committer 2018-12-02 17:59:40 UTC
Created attachment 199759 [details]

(In reply to Leandro Lupori from comment #3)
This is not a powerpc-specific problem, it's a limitation of memguard: some of its internal operation involve allocating from UMA, which might call into memguard again.  The attached patch should allow you to guard most UMA buckets.  It's a total hack and isn't intended to be committed, but might help you make progress with the use-after-free.
Comment 5 Leandro Lupori 2018-12-03 16:35:41 UTC
(In reply to Mark Johnston from comment #4)

Ok, thanks for clarifying memguard operation.
I'm using the patch now. The kernel stopped panic'ing.
Now let me run poudriere for some time and check if memguard catches the use-after-free.
Comment 6 Leandro Lupori 2018-12-03 16:59:55 UTC
(In reply to Leandro Lupori from comment #5)

Unfortunately I hit the recursive call to kmem_back_domain() issue again.

root@ppcdevref:~ # panic: __rw_wlock_hard: recursing but non-recursive rw kernel vm object @ /usr/src/sys/vm/vm_kern.c:472

cpuid = 70
time = 1543855595
KDB: stack backtrace:
0xe000000173207c10: at .kdb_backtrace+0x5c
0xe000000173207d40: at .vpanic+0x1b4
0xe000000173207e00: at .panic+0x38
0xe000000173207e90: at .__rw_wlock_hard+0x174
0xe000000173207fa0: at ._rw_wlock_cookie+0x110
0xe000000173208040: at .kmem_back_domain+0x134
0xe000000173208130: at .kmem_back_domain+0x438
0xe0000001732081f0: at .kmem_back+0x18
0xe000000173208270: at .memguard_alloc+0x188
0xe000000173208360: at .uma_zalloc_arg+0xe0
0xe000000173208420: at .uma_zalloc_pcpu_arg+0x174
0xe0000001732084c0: at .uma_zfree_arg+0x4c4
0xe000000173208570: at .uma_zfree_pcpu_arg+0xc4
0xe000000173208600: at .uma_zalloc_arg+0x3c0
0xe0000001732086c0: at .moea64_get_unique_vsid+0x2e4
0xe000000173208740: at .moea64_enter+0xac
0xe000000173208840: at .pmap_enter+0xb4
0xe0000001732088f0: at .kmem_back_domain+0x27c
0xe0000001732089e0: at .kmem_malloc_domainset+0xc8
0xe000000173208ad0: at .uma_large_malloc_domain+0xa4
0xe000000173208b60: at .uma_large_malloc+0x1c
0xe000000173208be0: at .malloc+0x134
0xe000000173208c90: at .zfs_kmem_alloc+0x34
0xe000000173208d10: at .zio_buf_alloc+0xec
0xe000000173208db0: at .abd_alloc_linear+0xec
0xe000000173208e50: at .abd_alloc_for_io+0x34
0xe000000173208ed0: at .vdev_queue_io_to_issue+0x8c4
0xe000000173209420: at .vdev_queue_io_done+0x20c
0xe0000001732094f0: at .zio_vdev_io_done+0x12c
0xe000000173209590: at .zio_execute+0x1c4
0xe000000173209670: at .taskq_run_ent+0x30
0xe0000001732096f0: at .taskqueue_poll_is_busy+0x250
0xe0000001732097c0: at .taskqueue_thread_loop+0xac
0xe000000173209850: at .fork_exit+0xd0
0xe0000001732098f0: at .fork_trampoline+0x10
0xe000000173209920: at -0x4
KDB: enter: panic
[ thread pid 0 tid 101000 ]
Stopped at      .kdb_enter+0x60:        ld      r2, r1, 0x28

It is interesting that this time memguard_alloc() was not called twice, but it was called before the first call to kmem_back_domain() returned.

Any idea on how to work around this? Maybe force memguard to also guard malloc?
Comment 7 Leandro Lupori 2018-12-05 16:58:29 UTC
I've added the following check to cover the previous panic case, that was kmem_back_domain() being first called through zfs allocation routines and then through memguard_alloc():

At the beginning of kmem_back_domain():

if (VM_OBJECT_WOWNED(object))
    locked = true;

While it apparently solved the previous issue, I've hit another similar panic:

root@ppcdevref:~ # panic: _mtx_lock_sleep: recursed on non-recursive mutex page pv @ /usr/src/sys/powerpc/aim/mmu_oea64.c:1383

cpuid = 70
time = 1544027411
KDB: stack backtrace:
0xe0000001d37f8a00: at .kdb_backtrace+0x5c
0xe0000001d37f8b30: at .vpanic+0x1b4
0xe0000001d37f8bf0: at .panic+0x38
0xe0000001d37f8c80: at .__mtx_lock_sleep+0x178
0xe0000001d37f8d70: at .__mtx_lock_flags+0x160
0xe0000001d37f8e20: at .moea64_enter+0x21c
0xe0000001d37f8f20: at .pmap_enter+0xb4
0xe0000001d37f8fd0: at .kmem_back_domain+0x298
0xe0000001d37f90c0: at .kmem_back_domain+0x454
0xe0000001d37f9180: at .kmem_back+0x18
0xe0000001d37f9200: at .memguard_alloc+0x188
0xe0000001d37f92f0: at .uma_zalloc_arg+0xe0
0xe0000001d37f93b0: at .uma_zalloc_pcpu_arg+0x174
0xe0000001d37f9450: at .uma_zalloc_arg+0x5a8
0xe0000001d37f9510: at .slb_free_user_cache+0xb4
0xe0000001d37f95a0: at .allocate_user_vsid+0x30c
0xe0000001d37f9650: at .va_to_vsid+0x90
0xe0000001d37f96e0: at .moea64_extract+0x254
0xe0000001d37f9770: at .moea64_enter+0x254
0xe0000001d37f9870: at .moea64_enter_object+0xa8
0xe0000001d37f9920: at .pmap_enter_object+0xa8
0xe0000001d37f99d0: at ._vm_map_unlock+0x4dc
0xe0000001d37f9aa0: at .vm_map_insert+0x550
0xe0000001d37f9ba0: at .vm_map_fixed+0x134
0xe0000001d37f9c70: at .elf64_coredump+0x1050
0xe0000001d37f9d60: at .elf64_coredump+0x1230
0xe0000001d37f9e40: at .elf64_coredump+0x162c
0xe0000001d37f9f50: at .elf64_coredump+0x2494
0xe0000001d37fa0d0: at .kern_execve+0x6fc
0xe0000001d37fa4b0: at .sys_execve+0x74
0xe0000001d37fa5b0: at .trap+0x664
0xe0000001d37fa770: at .powerpc_interrupt+0x290
0xe0000001d37fa810: user SC trap by 0x810212ec8: srr1=0x900000000000f032
            r1=0x3fffffffffff8050 cr=0x44002022 xer=0x20000000 ctr=0x810212ec0 r2=0x81030ea90
KDB: enter: panic
[ thread pid 87812 tid 101772 ]
Stopped at      .kdb_enter+0x60:        ld      r2, r1, 0x28

I didn't investigate this further, but I start to question our current approach, as we may find many more issues like this.

So, I'm wondering if is there another change we could do in memguard to avoid having to work around non-recursive mutexes?

Or if is there another way to catch use-after-free issues without memguard?
Comment 8 Mark Johnston freebsd_committer 2018-12-11 02:35:17 UTC
Created attachment 200018 [details]
memguard hack

Ok, let's try a different approach.  Instead, when allocating buckets, avoid memguarded bucket zones when selecting for a zone used by the VM subsystem.  Please give the attached patch a try.  I was able to use it to guard a few different bucket zones while running postgres benchmarks in the background to generate load.
Comment 9 Leandro Lupori 2018-12-12 12:32:58 UTC
(In reply to Mark Johnston from comment #8)

With this new patch I was able to use memguard to guard the '128 Bucket' zone for more than an hour of package building. Unfortunately, later I got this panic:

panic: _mtx_lock_sleep: recursed on non-recursive mutex memguard arena @ /usr/src/sys/kern/subr_vmem.c:1184

cpuid = 40
time = 1544566934
KDB: stack backtrace:
0xe0000001dbb4e250: at .kdb_backtrace+0x5c
0xe0000001dbb4e380: at .vpanic+0x1b4
0xe0000001dbb4e440: at .panic+0x38
0xe0000001dbb4e4d0: at .__mtx_lock_sleep+0x178
0xe0000001dbb4e5c0: at .__mtx_lock_flags+0x160
0xe0000001dbb4e670: at .vmem_xalloc+0x2d0
0xe0000001dbb4e7b0: at .memguard_alloc+0xdc
0xe0000001dbb4e890: at .uma_zalloc_arg+0xe0
0xe0000001dbb4e950: at .uma_zalloc_pcpu_arg+0x178
0xe0000001dbb4e9f0: at .uma_zfree_arg+0x4c4
0xe0000001dbb4eaa0: at .uma_zfree_pcpu_arg+0xc8
0xe0000001dbb4eb30: at .uma_zalloc_arg+0x3c0
0xe0000001dbb4ebf0: at .uma_zalloc_pcpu_arg+0x178
0xe0000001dbb4ec90: at .uma_zalloc_arg+0x5a8
0xe0000001dbb4ed50: at .vmem_size+0x1400
0xe0000001dbb4ee10: at .vmem_xalloc+0x320
0xe0000001dbb4ef50: at .memguard_alloc+0xdc
0xe0000001dbb4f030: at .uma_zalloc_arg+0xe0
0xe0000001dbb4f0f0: at .uma_zalloc_pcpu_arg+0x178
0xe0000001dbb4f190: at .uma_zfree_arg+0x4c4
0xe0000001dbb4f240: at ._fdrop+0xa8
0xe0000001dbb4f2d0: at .closef+0x27c
0xe0000001dbb4f3d0: at .fdsetugidsafety+0x35c
0xe0000001dbb4f480: at .kern_close+0x1e8
0xe0000001dbb4f530: at .sys_close+0x18
0xe0000001dbb4f5b0: at .trap+0x664
0xe0000001dbb4f770: at .powerpc_interrupt+0x290
0xe0000001dbb4f810: user SC trap by 0x104829e8: srr1=0x900000000000f032
            r1=0x3fffffffffff3aa0 cr=0x24824042 xer=0x20000000 ctr=0x103b9910 r2=0x10671858
KDB: enter: panic
Comment 10 Mark Johnston freebsd_committer 2018-12-12 16:28:18 UTC
Created attachment 200057 [details]
memguard hack

There is a corner case in my previous patch.  Please try this one.
Comment 11 Leandro Lupori 2018-12-14 11:47:46 UTC
(In reply to Mark Johnston from comment #10)

MemGuard seems to be very stable now, with several hours of build, guarding different UMA bucket zones, with no panics.

Unfortunately, the original "memory modified after free" issue also stopped happening, with memguard enabled.

Sean Bruno, maybe you could try this patch on your POWER8 machines? You seem to be able to reproduce it more easily than I.
Comment 12 Sean Bruno freebsd_committer 2018-12-14 14:17:46 UTC
(In reply to Mark Johnston from comment #10)
Just so that its clear, this patch works for me on the Tyan P8 in the cluster.  I can boot a MemGuard kernel and turn the knobs on/off.  I'll leave it to you folks to decide if how you want to commit the fixes.

Copyright (c) 1992-2018 The FreeBSD Project.
Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
        The Regents of the University of California. All rights reserved.
FreeBSD is a registered trademark of The FreeBSD Foundation.
FreeBSD 13.0-CURRENT #0 r342079M: Fri Dec 14 13:41:21 UTC 2018
    sbruno@build-13.freebsd.org:/usr/obj/powerpc.powerpc64/usr/src/sys/CLUSTER13 powerpc
gcc version 4.2.1 20070831 patched [FreeBSD]
WARNING: WITNESS option enabled, expect reduced performance.
        MEMGUARD map base: 0xe000000001024000
        MEMGUARD map size: 3355444 KBytes
VT: init without driver.
cpu0: IBM POWER8 revision 2.0, 4024.00 MHz
cpu0: Features2 ef000000<ARCH207,HTM,DSCR,ISEL,TAR,VCRYPTO,HTMNOSC>

As for debugging the original panic, I've set sysctl vm.memguard.desc="128 Bucket"
Comment 13 Sean Bruno freebsd_committer 2018-12-17 20:08:35 UTC
I'm guessing that memguard is making things too slow to catch whatever was crashing before.  I have been running the memguard patch from comment #10 for 3 days without issue:

sbruno@pylon.nyi:~ % sysctl vm.memguard
vm.memguard.maplimit: 3435974656
vm.memguard.mapstart: 16140901064512782336
vm.memguard.divisor: 10
vm.memguard.desc: 16 Bucket
vm.memguard.cursor: 16140901067941314560
vm.memguard.mapsize: 3435974656
vm.memguard.phys_limit: 1641385984
vm.memguard.wasted: 2380800
vm.memguard.wrapcnt: 134780
vm.memguard.numalloc: 673191
vm.memguard.fail_kva: 105
vm.memguard.fail_pgs: 0
vm.memguard.options: 1
vm.memguard.minsize: 0
vm.memguard.minsize_reject: 0
vm.memguard.frequency: 0
vm.memguard.frequency_hits: 0
sbruno@pylon.nyi:~ % uptime
 8:07PM  up 3 days,  2:45, 2 users, load averages: 4.03, 5.05, 5.15