Created attachment 156885 [details] 1st panic I've now gotten 3 panics of this flavor. I have all 3 vmcore's and will attach all 3 core.txt.? files. I even get the panic when compiling (no -j) world. There is other activity on the system. I ran a ZFS scrub and it completed successfully and noted no errors. what else do I need to supply here?
Created attachment 156886 [details] 2nd Panic
Created attachment 156887 [details] 3rd Panic
borg.lerctr.org /home/ler $ uname -a FreeBSD borg.lerctr.org 11.0-CURRENT FreeBSD 11.0-CURRENT #40 r283007: Sat May 16 07:23:43 CDT 2015 root@borg.lerctr.org:/usr/obj/usr/src/sys/VT-LER amd64 borg.lerctr.org /home/ler $
I have the same panic on my laptop : # cat info5.txt cat: info5.txt: No such file or directory [root@ativ /var/crash]# cat info.5 Dump header from device: /dev/gpt/swap Architecture: amd64 Architecture Version: 2 Dump Length: 454008832 Blocksize: 512 Dumptime: Mon May 18 03:59:06 2015 Hostname: ativ Magic: FreeBSD Kernel Dump Version String: FreeBSD 11.0-CURRENT #0 r283003M: Sat May 16 13:14:12 CEST 2015 root@ativ:/usr/obj/usr/src/sys/GENERIC Panic String: Most recently used by solaris Dump Parity: 16138761 Bounds: 5 Dump Status: good Attached the panic. Regards Maurizio
Created attachment 156913 [details] Panic
Please recompile the kernel with "option DEBUG_MEMGUARD" if possible so this can be tracked down a bit further. When was the last time your system functioned for you without this panic?
It had been running fine for a few months. I did replace a disk in the array last week but that was uneventful. I'll TRY and compile a kernel with DEBUG_MEMGUARD but so far building even WORLD has provoked the panic. I **DO** have all 3 vmcore's available if someone wants to traipse through them.
was able to get a kernel compiled/booted with DEBUG_MEMGUARD. I'll attach that core.txt.
Created attachment 156917 [details] PANIC with DEBUG_MEMGUARD
Created attachment 156962 [details] 2nd Panic with DEBUG_MEMGUARD on buildworld
What else can I supply? I have all the vmcore's
(In reply to Larry Rosenman from comment #10) I have looked at all these backtraces and they are almost identical, and unfortunately errors like this are relatively hard to trace down, it's like looking at dead body (assertion, which we already know where the deadly wound is) but not a smoking gun (the code that touches freed memory). Debugging this can be really time consuming. I don't know why memguard didn't catch the issue (which should), did you added the loader.conf option: vm.memguard.desc="solaris" ? I *think* that the issue is probably a decreasing counter, but a quick glance at zfs_rlock.c didn't reveal one, and the code have been there for quite a while.
had NOT added vm.memguard.desc="solaris" bug have now -- will get a new crash -- it's easy to repro :(
Can you tell me what you doing to get the crash? The box crashed twice for me, but stopped before I can get an useful core from it. I wonder if I can speed things up by mimicking your workload.
(In reply to Alexander Kabaev from comment #14) I have boinc running, and then do a buildworld. and my "Normal" stuff. I have a core being savecore'd now with solaris in the memguard list. I'll post the core.txt in a couple of minutes.
Looking at the allocation path, I think it tries to allocate structure of 96 bytes in size, which should put the allocation into '128' malloc zone. You might be better off watching that instead of 'solaris'. 'solaris' is an umbrella type that covers all of the allocations of the code imported from OpenSolaris. On my machine, vmstat -z gives 123470 active allocations in said bucket, while vmstat -m reports 1614027 active allocations for the whole solaris type, about 13x more. Due to the way memguard is implemented, it might be more practical to use selector with less active entries. In reality, I think one will need to hack memguard_cmp_mtp so match 'solaris' _and_ size to match sizeof(struct rl).
Created attachment 156969 [details] core.txt with solaris in the memguard list
I'll have a 2nd core.txt here in a bit -- the system panic'd whilst it was just doing it's background boinc stuff.
(In reply to Larry Rosenman from comment #18) Any chance you coulg go to frame 10 and run 'x/i $rip; in kgdb? I was successful in crashing my machine, but it turns out I now neew to update my boot code, which also means I will need a bootable memstick with recent image on it, so my capabilities are reduced to rempote debug only:( If nobody picks this up, I'll continue once my dev box is up again.
as soon as the box is finished savecore'ing. (It's slow with memguard). I can also give you ssh access if that would help
(In reply to Larry Rosenman from comment #20) I have all I need for a moment, but will take you up on that promise tomorrow, assuming nobody beats me to it.
So, by looking at the zfs range code, I see these constructs: if (rl->r_write_wanted) { cv_broadcast(&rl->r_wr_cv); cv_destroy(&rl->r_wr_cv); } if (rl->r_read_wanted) { cv_broadcast(&rl->r_rd_cv); cv_destroy(&rl->r_rd_cv); } kmem_free(rl, sizeof (rl_t)); The code appears to wakeup waiters running on cv and then immediately proceeds to destroy the container structure. Looking at the recent change in kern_condvar.c, it does appear that thread that awakens does attempt to decrement the cv_waiters counter in what other thread has already freed. Hilarity ensures. Please try to back out the revision 282971 by jhb and see it that brings your stability back. FWIW, I think code in zfs is hilarious.
Backed out the revision -- Building a kernel. Have 2 more core.txt's if you want them. We'll see what happens after the kernel boot.
added jhb to the CC list since it's his rev that seems(!) to be causing the grief
booted the new kernel, and we've made it WAY past where it was crashing on a make buildworld. I think Alexander has nailed the issue.
for completeness: (kgdb) x/i $rip 0xffffffff80402d39 <zfs_range_lock+569>: mov %r15,%rsi (kgdb)
I can insta-repro this by running fio on a zvol with mixed r/w e.g. fio --filename=/dev/zvol/xxxxxx --direct=1 --rw=rw --refill_buffers \ --norandommap --randrepeat=0 --bs=8k \ --rwmixread=70 --iodepth=8 --numjobs=24 --runtime=60 \ --ioengine=pvsync \ --group_reporting --name=crashtest I can also confirm that reverting the condvar change stopped the crashes.
jhb@ just committed r283250 which reverts the commit that caused the problem: https://svnweb.freebsd.org/changeset/base/r283250