db:0:pho> bt Tracing pid 785 tid 100153 td 0xfffff8002a52c490 uma_find_refcnt() at uma_find_refcnt+0x33/frame 0xfffffe17288d2590 mb_ctor_clust() at mb_ctor_clust+0x8f/frame 0xfffffe17288d25c0 uma_zalloc_arg() at uma_zalloc_arg+0x164/frame 0xfffffe17288d2660 m_getjcl() at m_getjcl+0xa3/frame 0xfffffe17288d26b0 m_getm2() at m_getm2+0xe7/frame 0xfffffe17288d2700 m_uiotombuf() at m_uiotombuf+0xa4/frame 0xfffffe17288d2770 sosend_generic() at sosend_generic+0x6cc/frame 0xfffffe17288d2820 sosend() at sosend+0x5d/frame 0xfffffe17288d2880 soo_write() at soo_write+0x42/frame 0xfffffe17288d28b0 dofilewrite() at dofilewrite+0x88/frame 0xfffffe17288d2900 kern_writev() at kern_writev+0x68/frame 0xfffffe17288d2950 sys_write() at sys_write+0x63/frame 0xfffffe17288d29a0 amd64_syscall() at amd64_syscall+0x278/frame 0xfffffe17288d2ab0 How to repeat: sysctl vm.memguard.options=3; sysctl vm.memguard.desc=allocdirect + ssh activity Details: http://people.freebsd.org/~pho/stress/log/memguard4.txt
Dear Peter, I managed to find the root cause. The bug can be reproduced by setting "sysctl vm.memguard.options=2" and ssh activity 1. memguard.options = 2 enable memguard to protect all allocations that are bigger than PAGE_SIZE. 2. ssh activity allocates mbuf that uses zone with UMA_ZONE_REFCNT flag. The zone is protected by memguard. However, these two features save values in the same union plinks in vm_page 1. memguard save allocation size in vm_page->plinks.memguard.v 2. UMA_ZONE_REFCNT save refcount in vm_page->plinks.s.pv The following patch can work around this bug. Index: sys/vm/memguard.c =================================================================== --- sys/vm/memguard.c (revision 276729) +++ sys/vm/memguard.c (working copy) @@ -506,6 +506,9 @@ zone->uz_flags & UMA_ZONE_NOFREE) return (0); + if (zone->uz_flags & UMA_ZONE_REFCNT) + return (0); + if (memguard_cmp(zone->uz_size)) return (1);
pho reported that this change worked for him when running stress2. Either benno or I can take this bug and commit it to head. Thank you very much for the patch!
Great! Thanks for your effort.
Did some more testing with your patch and found that setting vm.memguard.frequency=1000 triggers a suspicious amount of different panics. For example http://people.freebsd.org/~pho/stress/log/memguard.frequency.txt
CR: https://reviews.freebsd.org/D1865
Unfortunately a new test of this patch show the same problem as before: root@t1:~ # sysctl vm.memguard.options=3; sysctl vm.memguard.desc=allocdirect vm.memguard.options: 1 -> 3 vm.memguard.desc: -> allocdirect root@t1:~ # ssh pho@localhost Memory modified after free 0xfffffe0000411000(4096) val=0 @ 0xfffffe0000411000 Fatal trap 12: page fault while in kernel mode cpuid = 4; apic id = 04 fault virtual address = 0x3000 fault code = supervisor read data, page not present instruction pointer = 0x20:0xffffffff80bf0053 stack pointer = 0x28:0xfffffe17287981b0 frame pointer = 0x28:0xfffffe1728798200 code segment = base rx0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 922 (sshd) [ thread pid 922 tid 100205 ] Stopped at uma_find_refcnt+0x33: movq (%rax),%rax db> x/s version version: FreeBSD 11.0-CURRENT #3 r278882M: Tue Feb 17 12:42:28 CET 2015\012 pho@t1.osted.lan:/usr/src/sys/amd64/compile/MEMGUARD\012 db>
Passing bug I'm not actively working on back to the general pool.
(In reply to Peter Holm from comment #6) I'm not able to reproduce this bug on FreeBSD 12 CURRENT using your steps. Are there any other configuration steps you've taken before running those commands? It could be that this bug has been fixed in CURRENT. uname output: FreeBSD 12.0-CURRENT 2621be48c91(master): Mon Aug 28 14:45:10 EDT 2017
(In reply to Siva Mahadevan from comment #8) Hard for me to say it the original panic is still there. With the same scenario I see: panic: MemGuard detected double-free of 0xfffffe000075e000 cpuid = 2 time = 1504166229 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe2ebbde5db0 vpanic() at vpanic+0x19c/frame 0xfffffe2ebbde5e30 panic() at panic+0x43/frame 0xfffffe2ebbde5e90 memguard_free() at memguard_free+0x14f/frame 0xfffffe2ebbde5ed0 bufkva_free() at bufkva_free+0xf8/frame 0xfffffe2ebbde5ef0 buf_free() at buf_free+0xd5/frame 0xfffffe2ebbde5f40 brelse() at brelse+0x5c0/frame 0xfffffe2ebbde5fd0 bufdone_finish() at bufdone_finish+0xd4/frame 0xfffffe2ebbde5ff0 bufdone() at bufdone+0xe3/frame 0xfffffe2ebbde6020 biodone() at biodone+0x188/frame 0xfffffe2ebbde6060 g_io_deliver() at g_io_deliver+0x5e4/frame 0xfffffe2ebbde6140 biodone() at biodone+0x188/frame 0xfffffe2ebbde6180 g_io_deliver() at g_io_deliver+0x5e4/frame 0xfffffe2ebbde6260 biodone() at biodone+0x188/frame 0xfffffe2ebbde62a0 g_io_deliver() at g_io_deliver+0x5e4/frame 0xfffffe2ebbde6380 g_disk_done() at g_disk_done+0x1ee/frame 0xfffffe2ebbde6400 biodone() at biodone+0x188/frame 0xfffffe2ebbde6440 dadone() at dadone+0x194b/frame 0xfffffe2ebbde69a0 xpt_done_process() at xpt_done_process+0x35f/frame 0xfffffe2ebbde69e0 xpt_done_td() at xpt_done_td+0x136/frame 0xfffffe2ebbde6a30 fork_exit() at fork_exit+0x13b/frame 0xfffffe2ebbde6ab0 Details @ https://people.freebsd.org/~pho/stress/log/memguard8.txt
There was a problem at one point with guarding mbufs using memguard, should be fixed by https://cgit.freebsd.org/src/commit/?id=bc9d08e1cfe381f67fea89eff8f6235a15022494 I'm not sure what's going on in comment 9. This looks a bit like a bug in memguard itself. Does it still occur on a recent head?
No problems seen with this test scenario on main-n245775-9aef4e7c2bd.