Created attachment 162472 [details] Stack trace of the panic I'm hitting the following panic on the Cavium ThunderX in the cluster when testing ZFS: panic: solaris assert: zrl->zr_refcount == 0 (0x6 == 0x0), file: /usr/home/andrew/head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zrlock.c, line: 64 I can trigger this with Poudriere cleaning up after it failes to create a jail.
This is on: FreeBSD cavium 11.0-CURRENT FreeBSD 11.0-CURRENT #3 r289872M: Sat Oct 24 13:38:33 UTC 2015 andrew@cavium:/usr/home/andrew/obj/usr/home/andrew/head/sys/GENERIC arm64
Created attachment 162782 [details] Possible fix
I filed an issue against zrlock a while ago: https://www.illumos.org/issues/3746 Here is a rebased version of the original patch against FreeBSD/head: http://people.freebsd.org/~will/patches/zrlock-fix-atomicity.diff I haven't tested this specific version of the patch, but another version of it has been in our tree in production for 3 years.
I think the issue is atomic_cas_32 is broken on arm64. We use the version in sys/cddl/compat/opensolaris/kern/opensolaris_atomic.c when we need to write it in assembler to use the correct atomic instructions. I've hacked in the code from atomic_cmpset_32 for testing.
(In reply to Andrew Turner from comment #4) Why cannot the cddl code use FreeBSD native atomic.h implementation ? As in, calling the inlined atomic_add_32 ?
(In reply to Konstantin Belousov from comment #5) It does, other than the few places it can't (or couldn't when written). It needs: atomic_add_64_nv, atomic_cas_32, atomic_cas_64, atomic_or_8_nv, and membar_producer. The _nv functions return the new value, after the operation. The cas functions return the value in the target before the operation. None of these have an equivalent atomic function. I think membar_producer can be updated to use an atomic fence.
(In reply to Andrew Turner from comment #6) atomic_add_nv() can be easily implemented with atomic_fetchadd_32(), and I think it is already done this way in some situations. WRT atomic_cas, I used the following code to emulate casueword32() for the kernel-memory accesses: for (;;) { ov = *(volatile const uint32_t *)base; if (ov != oldval) break; if (atomic_cmpset_32(base, oldval, newval)) break; } *oldvalp = ov; I believe that we should implement all solaris atomics with the freebsd one, and remove MD-specific code.
A commit references this bug: Author: andrew Date: Thu Nov 5 16:55:27 UTC 2015 New revision: 290397 URL: https://svnweb.freebsd.org/changeset/base/290397 Log: Fix the open solaris atomic functions on arm64. Without this we may use the wrong value in the comparison, leading to incorrectly setting the new value. This has been observed in the ZFS code. Without this we can lose track of the reference count in a zrlock object. We should move to use the generic atomic functions, however as this has been observed I would prefer to have this working, then move to the generic functions. PR: 204037 Sponsored by: ABT Systems Ltd Changes: head/sys/cddl/contrib/opensolaris/common/atomic/aarch64/ head/sys/cddl/contrib/opensolaris/common/atomic/aarch64/opensolaris_atomic.S head/sys/conf/files.arm64
For whatever it's worth, I ran into this panic on my amd64 machine this morning, running a build of r290447M. The first time it panic'd, I had just issued a "shutdown -r now" command. I was able to collect a textdump of that panic, and reboot the machine. I tried to run 'shutdown -r now' again, and it panic'd again, in the same place. I collected a full crashdump this time. I have reverted back to a prior kernel (r290391M) and am rebuilding a kernel with the patch from comment #3 in this bug.
The kernel with the patch from comment #3 in this bug panics the same way on my amd64 host. I will attach a backtrace in just a minute.
Created attachment 162859 [details] kgdb backtrace for crash
I ended up reverting to an older kernel so I could compile a more up-to-date version of -head on my amd64 machine. With these sources: FreeBSD busybox.pix.net 11.0-CURRENT FreeBSD 11.0-CURRENT #92 r291335M: Wed Nov 25 14:59:52 EST 2015 lidl@busybox.pix.net:/usr/obj/usr/src/sys/BUSYBOX amd64 I no longer see this panic on my amd64 host.
I'm seeing this same problem on an arm processor (Raspberry Pi 2). FreeBSD rpi2 11.0-ALPHA1 FreeBSD 11.0-ALPHA1 #0 r300905M: Sat May 28 13:06:17 MDT 2016 panic: solaris assert: zrl->zr-refcount == 0 (0x1 == 0x0), file: /usr/home/heath/freebsd-current/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zrlock.c, line: 64 I also tried the patch in comment #3 but it still panic'd.
A commit references this bug: Author: avg Date: Thu Oct 27 07:38:08 UTC 2016 New revision: 307994 URL: https://svnweb.freebsd.org/changeset/base/307994 Log: 3746 ZRLs are racy illumos/illumos-gate@260af64db74a52d64de8c6c5f67dd0a71d228ca5 https://github.com/illumos/illumos-gate/commit/260af64db74a52d64de8c6c5f67dd0a71d228ca5 https://www.illumos.org/issues/3746 From the original change log: It was possible for a reference to be added even with the lock held, and for references added just after a lock release to be lost. This bug was also independently found and reported in wesunsolve.net issues 6985013 6995524. In zrl_add(), always use an atomic operation to update the refcount. The mutex in the ZRL only guarantees that wakeups occur for waiters on the lock. It offers no protection against concurrent updates of the refcount. The only refcount transition that is safe to perform without an atomic operation is from ZRL_LOCKED back to 0, since this can only be performed by the thread which has the ZRL locked. Authored by: Will Andrews <will@freebsd.org> Reviewed by: Boris Protopopov <bprotopopov@hotmail.com> Reviewed by: Pavel Zakharov <pavel.zakha@gmail.com> Reviewed by: Yuri Pankov <yuri.pankov@gmail.com> Reviewed by: Justin T. Gibbs <gibbs@scsiguy.com> Approved by: Matt Ahrens <mahrens@delphix.com> Author: Youzhong Yang <yyang@mathworks.com> PR: 204037 MFC after: 1 week Changes: _U head/sys/cddl/contrib/opensolaris/ head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zrlock.c
A commit references this bug: Author: avg Date: Thu Nov 3 08:54:04 UTC 2016 New revision: 308245 URL: https://svnweb.freebsd.org/changeset/base/308245 Log: MFC r307994: 3746 ZRLs are racy PR: 204037 Changes: _U stable/11/ stable/11/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zrlock.c
A commit references this bug: Author: avg Date: Thu Nov 3 08:54:36 UTC 2016 New revision: 308246 URL: https://svnweb.freebsd.org/changeset/base/308246 Log: MFC r307994: 3746 ZRLs are racy PR: 204037 Changes: _U stable/10/ stable/10/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zrlock.c
Closing this bug now. If the problem comes back, please open a new bug.