In an effort to debug a user-after-free problem, I enabled DEBUG_MEMGUARD to try and pin this issue down. Petitboot or OPAL cannot load the kernel after this is enabled and results in no output or messages logged to the console.
Could you please remove the body of memguard_init(), compile a DEBUG_MEMGUARD kernel and try to boot it? That code doesn't run particularly early during boot, so I'm not sure why it would cause the boot to fail without seeing any console output, but maybe there's some later initialization step that's needed in order to see console output?
(In reply to Mark Johnston from comment #1) Disregard this, apparently the problem only happens after enabling memguard of a particular zone/malloc type.
I started taking a look at this, also to be able to debug a user-after-free problem. In my case, however, the system boots and crashes only after I enable memguard of an UMA region via sysctl, as following: sysctl vm.memguard.desc='128 Bucket'. Then, if I run make -C /usr/src, for instance, I get a stack like this: #0 vpanic (fmt=0xc0000000010a6980 "%s: recursing but non-recursive rw %s @ %s:%d\n", ap=0xe0000000ce0f12f8 "\300") at /usr/home/luporl/base/head/sys/kern/kern_shutdown.c:813 #1 0xc0000000006abf18 in panic (fmt=<optimized out>) at /usr/home/luporl/base/head/sys/kern/kern_shutdown.c:804 #2 0xc0000000006a6148 in __rw_wlock_hard (c=<optimized out>, v=13835058055423348736, file=0xc0000000010e73f8 "/usr/home/luporl/base/head/sys/vm/vm_kern.c", line=471) at /usr/home/luporl/base/head/sys/kern/kern_rwlock.c:954 #3 0xc0000000006a6a8c in _rw_wlock_cookie (c=<optimized out>, file=0xc0000000010e73f8 "/usr/home/luporl/base/head/sys/vm/vm_kern.c", line=471) at /usr/home/luporl/base/head/sys/kern/kern_rwlock.c:286 #4 0xc000000000a33664 in kmem_back_domain (domain=0, object=<optimized out>, addr=16140901064502083584, size=4096, flags=<optimized out>) at /usr/home/luporl/base/head/sys/vm/vm_kern.c:471 #5 0xc000000000a33924 in kmem_back (object=0xc0000000019194a8 <kernel_object_store>, addr=16140901064502083584, size=<optimized out>, flags=513) at /usr/home/luporl/base/head/sys/vm/vm_kern.c:540 #6 0xc000000000a2d5b4 in memguard_alloc (req_size=1024, flags=513) at /usr/home/luporl/base/head/sys/vm/memguard.c:351 #7 0xc000000000a2abd8 in uma_zalloc_arg (zone=0xc0000001ffffdb00, udata=0x80000020, flags=513) at /usr/home/luporl/base/head/sys/vm/uma_core.c:2436 #8 0xc000000000a2b528 in bucket_alloc (zone=0xc000000002000b00, udata=0x80000020, flags=513) at /usr/home/luporl/base/head/sys/vm/uma_core.c:428 #9 0xc000000000a2b0a0 in zone_alloc_bucket (flags=<optimized out>, domain=<optimized out>, udata=<optimized out>, zone=<optimized out>) at /usr/home/luporl/base/head/sys/vm/uma_core.c:2982 #10 uma_zalloc_arg (zone=0xc000000002000b00, udata=0x0, flags=1) at /usr/home/luporl/base/head/sys/vm/uma_core.c:2590 #11 0xc000000000a76194 in uma_zalloc (flags=<optimized out>, zone=<optimized out>) at /usr/home/luporl/base/head/sys/vm/uma.h:362 #12 alloc_pvo_entry (bootstrap=<optimized out>) at /usr/home/luporl/base/head/sys/powerpc/aim/mmu_oea64.c:374 #13 0xc000000000a7a354 in moea64_enter (mmu=0xc000000001a8e268 <mmu_kernel_obj>, pmap=0xc000000001a8eba8 <kernel_pmap_store>, va=16140901064502071296, m=0xc0000001f469d400, prot=3 '\003', flags=515, psind=<optimized out>) at /usr/home/luporl/base/head/sys/powerpc/aim/mmu_oea64.c:1365 #14 0xc000000000ab2658 in MMU_ENTER (_psind=<optimized out>, _flags=<optimized out>, _prot=<optimized out>, _p=<optimized out>, _va=<optimized out>, _pmap=<optimized out>, _mmu=0xc000000001a8e268 <mmu_kernel_obj>) at ./mmu_if.h:169 #15 pmap_enter (pmap=0xc000000001a8eba8 <kernel_pmap_store>, va=16140901064502071296, p=0xc0000001f469d400, prot=3 '\003', flags=515, psind=0 '\000') at /usr/home/luporl/base/head/sys/powerpc/powerpc/pmap_dispatch.c:150 #16 0xc000000000a33784 in kmem_back_domain (domain=0, object=<optimized out>, addr=16140901064502071296, size=4096, flags=<optimized out>) at /usr/home/luporl/base/head/sys/vm/vm_kern.c:498 #17 0xc000000000a33924 in kmem_back (object=0xc0000000019194a8 <kernel_object_store>, addr=16140901064502071296, size=<optimized out>, flags=513) at /usr/home/luporl/base/head/sys/vm/vm_kern.c:540 #18 0xc000000000a2d5b4 in memguard_alloc (req_size=1024, flags=513) at /usr/home/luporl/base/head/sys/vm/memguard.c:351 #19 0xc000000000a2abd8 in uma_zalloc_arg (zone=0xc0000001ffffdb00, udata=0x80000020, flags=513) at /usr/home/luporl/base/head/sys/vm/uma_core.c:2436 #20 0xc000000000a2b528 in bucket_alloc (zone=0xc000000002000b00, udata=0x80000020, flags=513) at /usr/home/luporl/base/head/sys/vm/uma_core.c:428 #21 0xc000000000a2b0a0 in zone_alloc_bucket (flags=<optimized out>, domain=<optimized out>, udata=<optimized out>, zone=<optimized out>) at /usr/home/luporl/base/head/sys/vm/uma_core.c:2982 #22 uma_zalloc_arg (zone=0xc000000002000b00, udata=0x0, flags=1) at /usr/home/luporl/base/head/sys/vm/uma_core.c:2590 #23 0xc000000000a76194 in uma_zalloc (flags=<optimized out>, zone=<optimized out>) at /usr/home/luporl/base/head/sys/vm/uma.h:362 #24 alloc_pvo_entry (bootstrap=<optimized out>) at /usr/home/luporl/base/head/sys/powerpc/aim/mmu_oea64.c:374 #25 0xc000000000a7a354 in vm.memguard.des (mmu=0xc000000001a8e268 <mmu_kernel_obj>, pmap=0xc000000002221130, va=34635493376, m=0xc0000001f469d460, prot=3 '\003', flags=1, psind=<optimized out>) at /usr/home/luporl/base/head/sys/powerpc/aim/mmu_oea64.c:1365 #26 0xc000000000ab2658 in MMU_ENTER (_psind=<optimized out>, _flags=<optimized out>, _prot=<optimized out>, _p=<optimized out>, _va=<optimized out>, _pmap=<optimized out>, _mmu=0xc000000001a8e268 <mmu_kernel_obj>) at ./mmu_if.h:169 #27 pmap_enter (pmap=0xc000000002221130, va=34635493376, p=0xc0000001f469d460, prot=3 '\003', flags=1, psind=0 '\000') at /usr/home/luporl/base/head/sys/powerpc/powerpc/pmap_dispatch.c:150 #28 0xc000000000a30d4c in vm_fault_hold (map=0xc000000002221000, vaddr=34635493376, fault_type=1 '\001', fault_flags=0, m_hold=0x0) at /usr/home/luporl/base/head/sys/vm/vm_fault.c:1296 #29 0xc000000000a31414 in vm_fault (map=0xc000000002221000, vaddr=34635493376, fault_type=1 '\001', fault_flags=0) at /usr/home/luporl/base/head/sys/vm/vm_fault.c:536 #30 0xc000000000ab493c in trap_pfault (frame=0xe0000000ce0f2840, user=1) at /usr/home/luporl/base/head/sys/powerpc/powerpc/trap.c:809 #31 0xc000000000ab5014 in trap (frame=0xe0000000ce0f2840) at /usr/home/luporl/base/head/sys/powerpc/powerpc/trap.c:272 #32 0xc000000000aa9fb4 in powerpc_interrupt (framep=0xe0000000ce0f2840) at /usr/home/luporl/base/head/sys/powerpc/powerpc/interrupt.c:127 #33 0xc000000000102ee0 in trapagain () at /usr/home/luporl/base/head/sys/powerpc/aim/trap_subr64.S:831 This is from a VM. I also happens on a physical host, but DDB stack trace doesn't have as much information. What seems to me here is that moea64_enter() ends up using uma_zalloc() to allocate a pvo entry, that uses memguard_alloc(), that uses kmem_back(), that calls moea64_enter(). This loop is interrupted by the panic on the non-recursive kmem_back_domain() lock.
Created attachment 199759 [details] workaround (In reply to Leandro Lupori from comment #3) This is not a powerpc-specific problem, it's a limitation of memguard: some of its internal operation involve allocating from UMA, which might call into memguard again. The attached patch should allow you to guard most UMA buckets. It's a total hack and isn't intended to be committed, but might help you make progress with the use-after-free.
(In reply to Mark Johnston from comment #4) Ok, thanks for clarifying memguard operation. I'm using the patch now. The kernel stopped panic'ing. Now let me run poudriere for some time and check if memguard catches the use-after-free.
(In reply to Leandro Lupori from comment #5) Unfortunately I hit the recursive call to kmem_back_domain() issue again. root@ppcdevref:~ # panic: __rw_wlock_hard: recursing but non-recursive rw kernel vm object @ /usr/src/sys/vm/vm_kern.c:472 cpuid = 70 time = 1543855595 KDB: stack backtrace: 0xe000000173207c10: at .kdb_backtrace+0x5c 0xe000000173207d40: at .vpanic+0x1b4 0xe000000173207e00: at .panic+0x38 0xe000000173207e90: at .__rw_wlock_hard+0x174 0xe000000173207fa0: at ._rw_wlock_cookie+0x110 0xe000000173208040: at .kmem_back_domain+0x134 0xe000000173208130: at .kmem_back_domain+0x438 0xe0000001732081f0: at .kmem_back+0x18 0xe000000173208270: at .memguard_alloc+0x188 0xe000000173208360: at .uma_zalloc_arg+0xe0 0xe000000173208420: at .uma_zalloc_pcpu_arg+0x174 0xe0000001732084c0: at .uma_zfree_arg+0x4c4 0xe000000173208570: at .uma_zfree_pcpu_arg+0xc4 0xe000000173208600: at .uma_zalloc_arg+0x3c0 0xe0000001732086c0: at .moea64_get_unique_vsid+0x2e4 0xe000000173208740: at .moea64_enter+0xac 0xe000000173208840: at .pmap_enter+0xb4 0xe0000001732088f0: at .kmem_back_domain+0x27c 0xe0000001732089e0: at .kmem_malloc_domainset+0xc8 0xe000000173208ad0: at .uma_large_malloc_domain+0xa4 0xe000000173208b60: at .uma_large_malloc+0x1c 0xe000000173208be0: at .malloc+0x134 0xe000000173208c90: at .zfs_kmem_alloc+0x34 0xe000000173208d10: at .zio_buf_alloc+0xec 0xe000000173208db0: at .abd_alloc_linear+0xec 0xe000000173208e50: at .abd_alloc_for_io+0x34 0xe000000173208ed0: at .vdev_queue_io_to_issue+0x8c4 0xe000000173209420: at .vdev_queue_io_done+0x20c 0xe0000001732094f0: at .zio_vdev_io_done+0x12c 0xe000000173209590: at .zio_execute+0x1c4 0xe000000173209670: at .taskq_run_ent+0x30 0xe0000001732096f0: at .taskqueue_poll_is_busy+0x250 0xe0000001732097c0: at .taskqueue_thread_loop+0xac 0xe000000173209850: at .fork_exit+0xd0 0xe0000001732098f0: at .fork_trampoline+0x10 0xe000000173209920: at -0x4 KDB: enter: panic [ thread pid 0 tid 101000 ] Stopped at .kdb_enter+0x60: ld r2, r1, 0x28 db> It is interesting that this time memguard_alloc() was not called twice, but it was called before the first call to kmem_back_domain() returned. Any idea on how to work around this? Maybe force memguard to also guard malloc?
I've added the following check to cover the previous panic case, that was kmem_back_domain() being first called through zfs allocation routines and then through memguard_alloc(): At the beginning of kmem_back_domain(): if (VM_OBJECT_WOWNED(object)) locked = true; While it apparently solved the previous issue, I've hit another similar panic: root@ppcdevref:~ # panic: _mtx_lock_sleep: recursed on non-recursive mutex page pv @ /usr/src/sys/powerpc/aim/mmu_oea64.c:1383 cpuid = 70 time = 1544027411 KDB: stack backtrace: 0xe0000001d37f8a00: at .kdb_backtrace+0x5c 0xe0000001d37f8b30: at .vpanic+0x1b4 0xe0000001d37f8bf0: at .panic+0x38 0xe0000001d37f8c80: at .__mtx_lock_sleep+0x178 0xe0000001d37f8d70: at .__mtx_lock_flags+0x160 0xe0000001d37f8e20: at .moea64_enter+0x21c 0xe0000001d37f8f20: at .pmap_enter+0xb4 0xe0000001d37f8fd0: at .kmem_back_domain+0x298 0xe0000001d37f90c0: at .kmem_back_domain+0x454 0xe0000001d37f9180: at .kmem_back+0x18 0xe0000001d37f9200: at .memguard_alloc+0x188 0xe0000001d37f92f0: at .uma_zalloc_arg+0xe0 0xe0000001d37f93b0: at .uma_zalloc_pcpu_arg+0x174 0xe0000001d37f9450: at .uma_zalloc_arg+0x5a8 0xe0000001d37f9510: at .slb_free_user_cache+0xb4 0xe0000001d37f95a0: at .allocate_user_vsid+0x30c 0xe0000001d37f9650: at .va_to_vsid+0x90 0xe0000001d37f96e0: at .moea64_extract+0x254 0xe0000001d37f9770: at .moea64_enter+0x254 0xe0000001d37f9870: at .moea64_enter_object+0xa8 0xe0000001d37f9920: at .pmap_enter_object+0xa8 0xe0000001d37f99d0: at ._vm_map_unlock+0x4dc 0xe0000001d37f9aa0: at .vm_map_insert+0x550 0xe0000001d37f9ba0: at .vm_map_fixed+0x134 0xe0000001d37f9c70: at .elf64_coredump+0x1050 0xe0000001d37f9d60: at .elf64_coredump+0x1230 0xe0000001d37f9e40: at .elf64_coredump+0x162c 0xe0000001d37f9f50: at .elf64_coredump+0x2494 0xe0000001d37fa0d0: at .kern_execve+0x6fc 0xe0000001d37fa4b0: at .sys_execve+0x74 0xe0000001d37fa5b0: at .trap+0x664 0xe0000001d37fa770: at .powerpc_interrupt+0x290 0xe0000001d37fa810: user SC trap by 0x810212ec8: srr1=0x900000000000f032 r1=0x3fffffffffff8050 cr=0x44002022 xer=0x20000000 ctr=0x810212ec0 r2=0x81030ea90 KDB: enter: panic [ thread pid 87812 tid 101772 ] Stopped at .kdb_enter+0x60: ld r2, r1, 0x28 db> I didn't investigate this further, but I start to question our current approach, as we may find many more issues like this. So, I'm wondering if is there another change we could do in memguard to avoid having to work around non-recursive mutexes? Or if is there another way to catch use-after-free issues without memguard?
Created attachment 200018 [details] memguard hack Ok, let's try a different approach. Instead, when allocating buckets, avoid memguarded bucket zones when selecting for a zone used by the VM subsystem. Please give the attached patch a try. I was able to use it to guard a few different bucket zones while running postgres benchmarks in the background to generate load.
(In reply to Mark Johnston from comment #8) With this new patch I was able to use memguard to guard the '128 Bucket' zone for more than an hour of package building. Unfortunately, later I got this panic: panic: _mtx_lock_sleep: recursed on non-recursive mutex memguard arena @ /usr/src/sys/kern/subr_vmem.c:1184 cpuid = 40 time = 1544566934 KDB: stack backtrace: 0xe0000001dbb4e250: at .kdb_backtrace+0x5c 0xe0000001dbb4e380: at .vpanic+0x1b4 0xe0000001dbb4e440: at .panic+0x38 0xe0000001dbb4e4d0: at .__mtx_lock_sleep+0x178 0xe0000001dbb4e5c0: at .__mtx_lock_flags+0x160 0xe0000001dbb4e670: at .vmem_xalloc+0x2d0 0xe0000001dbb4e7b0: at .memguard_alloc+0xdc 0xe0000001dbb4e890: at .uma_zalloc_arg+0xe0 0xe0000001dbb4e950: at .uma_zalloc_pcpu_arg+0x178 0xe0000001dbb4e9f0: at .uma_zfree_arg+0x4c4 0xe0000001dbb4eaa0: at .uma_zfree_pcpu_arg+0xc8 0xe0000001dbb4eb30: at .uma_zalloc_arg+0x3c0 0xe0000001dbb4ebf0: at .uma_zalloc_pcpu_arg+0x178 0xe0000001dbb4ec90: at .uma_zalloc_arg+0x5a8 0xe0000001dbb4ed50: at .vmem_size+0x1400 0xe0000001dbb4ee10: at .vmem_xalloc+0x320 0xe0000001dbb4ef50: at .memguard_alloc+0xdc 0xe0000001dbb4f030: at .uma_zalloc_arg+0xe0 0xe0000001dbb4f0f0: at .uma_zalloc_pcpu_arg+0x178 0xe0000001dbb4f190: at .uma_zfree_arg+0x4c4 0xe0000001dbb4f240: at ._fdrop+0xa8 0xe0000001dbb4f2d0: at .closef+0x27c 0xe0000001dbb4f3d0: at .fdsetugidsafety+0x35c 0xe0000001dbb4f480: at .kern_close+0x1e8 0xe0000001dbb4f530: at .sys_close+0x18 0xe0000001dbb4f5b0: at .trap+0x664 0xe0000001dbb4f770: at .powerpc_interrupt+0x290 0xe0000001dbb4f810: user SC trap by 0x104829e8: srr1=0x900000000000f032 r1=0x3fffffffffff3aa0 cr=0x24824042 xer=0x20000000 ctr=0x103b9910 r2=0x10671858 KDB: enter: panic
Created attachment 200057 [details] memguard hack There is a corner case in my previous patch. Please try this one.
(In reply to Mark Johnston from comment #10) MemGuard seems to be very stable now, with several hours of build, guarding different UMA bucket zones, with no panics. Unfortunately, the original "memory modified after free" issue also stopped happening, with memguard enabled. Sean Bruno, maybe you could try this patch on your POWER8 machines? You seem to be able to reproduce it more easily than I.
(In reply to Mark Johnston from comment #10) Just so that its clear, this patch works for me on the Tyan P8 in the cluster. I can boot a MemGuard kernel and turn the knobs on/off. I'll leave it to you folks to decide if how you want to commit the fixes. ---<<BOOT>>--- Copyright (c) 1992-2018 The FreeBSD Project. Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994 The Regents of the University of California. All rights reserved. FreeBSD is a registered trademark of The FreeBSD Foundation. FreeBSD 13.0-CURRENT #0 r342079M: Fri Dec 14 13:41:21 UTC 2018 sbruno@build-13.freebsd.org:/usr/obj/powerpc.powerpc64/usr/src/sys/CLUSTER13 powerpc gcc version 4.2.1 20070831 patched [FreeBSD] WARNING: WITNESS option enabled, expect reduced performance. MEMGUARD DEBUGGING ALLOCATOR INITIALIZED: MEMGUARD map base: 0xe000000001024000 MEMGUARD map size: 3355444 KBytes VT: init without driver. cpu0: IBM POWER8 revision 2.0, 4024.00 MHz cpu0: Features dc007182<PPC32,PPC64,ALTIVEC,FPU,MMU,SMT,ISNOOP,ARCH205,ARCH206,VSX,TRUELE> cpu0: Features2 ef000000<ARCH207,HTM,DSCR,ISEL,TAR,VCRYPTO,HTMNOSC> As for debugging the original panic, I've set sysctl vm.memguard.desc="128 Bucket"
I'm guessing that memguard is making things too slow to catch whatever was crashing before. I have been running the memguard patch from comment #10 for 3 days without issue: sbruno@pylon.nyi:~ % sysctl vm.memguard vm.memguard.maplimit: 3435974656 vm.memguard.mapstart: 16140901064512782336 vm.memguard.divisor: 10 vm.memguard.desc: 16 Bucket vm.memguard.cursor: 16140901067941314560 vm.memguard.mapsize: 3435974656 vm.memguard.phys_limit: 1641385984 vm.memguard.wasted: 2380800 vm.memguard.wrapcnt: 134780 vm.memguard.numalloc: 673191 vm.memguard.fail_kva: 105 vm.memguard.fail_pgs: 0 vm.memguard.options: 1 vm.memguard.minsize: 0 vm.memguard.minsize_reject: 0 vm.memguard.frequency: 0 vm.memguard.frequency_hits: 0 sbruno@pylon.nyi:~ % uptime 8:07PM up 3 days, 2:45, 2 users, load averages: 4.03, 5.05, 5.15