Bug 204037 - Poudriere panics zfs with 'solaris assert: zrl->zr_refcount == 0 (0x6 == 0x0)'
Summary: Poudriere panics zfs with 'solaris assert: zrl->zr_refcount == 0 (0x6 == 0x0)'
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: CURRENT
Hardware: arm64 Any
: --- Affects Some People
Assignee: Andrew Turner
Depends on:
Blocks: 203349
  Show dependency treegraph
Reported: 2015-10-26 15:05 UTC by Andrew Turner
Modified: 2016-11-03 09:10 UTC (History)
6 users (show)

See Also:

Stack trace of the panic (1.78 KB, text/plain)
2015-10-26 15:05 UTC, Andrew Turner
no flags Details
Possible fix (551 bytes, patch)
2015-11-04 17:04 UTC, Konstantin Belousov
no flags Details | Diff
kgdb backtrace for crash (9.01 KB, text/plain)
2015-11-06 21:08 UTC, Kurt Lidl
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Andrew Turner freebsd_committer 2015-10-26 15:05:06 UTC
Created attachment 162472 [details]
Stack trace of the panic

I'm hitting the following panic on the Cavium ThunderX in the cluster when testing ZFS:

panic: solaris assert: zrl->zr_refcount == 0 (0x6 == 0x0), file: /usr/home/andrew/head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zrlock.c, line: 64

I can trigger this with Poudriere cleaning up after it failes to create a jail.
Comment 1 Andrew Turner freebsd_committer 2015-10-26 15:08:03 UTC
This is on: FreeBSD cavium 11.0-CURRENT FreeBSD 11.0-CURRENT #3 r289872M: Sat Oct 24 13:38:33 UTC 2015     andrew@cavium:/usr/home/andrew/obj/usr/home/andrew/head/sys/GENERIC  arm64
Comment 2 Konstantin Belousov freebsd_committer 2015-11-04 17:04:11 UTC
Created attachment 162782 [details]
Possible fix
Comment 3 Will Andrews freebsd_committer 2015-11-04 17:28:45 UTC
I filed an issue against zrlock a while ago: https://www.illumos.org/issues/3746

Here is a rebased version of the original patch against FreeBSD/head:

I haven't tested this specific version of the patch, but another version of it has been in our tree in production for 3 years.
Comment 4 Andrew Turner freebsd_committer 2015-11-05 11:48:48 UTC
I think the issue is atomic_cas_32 is broken on arm64. We use the version in sys/cddl/compat/opensolaris/kern/opensolaris_atomic.c when we need to write it in assembler to use the correct atomic instructions. I've hacked in the code from atomic_cmpset_32 for testing.
Comment 5 Konstantin Belousov freebsd_committer 2015-11-05 13:23:58 UTC
(In reply to Andrew Turner from comment #4)
Why cannot the cddl code use FreeBSD native atomic.h implementation ?  As in, calling the inlined atomic_add_32 ?
Comment 6 Andrew Turner freebsd_committer 2015-11-05 13:39:42 UTC
(In reply to Konstantin Belousov from comment #5)
It does, other than the few places it can't (or couldn't when written). It needs: atomic_add_64_nv, atomic_cas_32, atomic_cas_64, atomic_or_8_nv, and membar_producer.

The _nv functions return the new value, after the operation. The cas functions return the value in the target before the operation. None of these have an equivalent atomic function.

I think membar_producer can be updated to use an atomic fence.
Comment 7 Konstantin Belousov freebsd_committer 2015-11-05 13:49:41 UTC
(In reply to Andrew Turner from comment #6)
atomic_add_nv() can be easily implemented with atomic_fetchadd_32(), and I think it is already done this way in some situations.

WRT atomic_cas, I used the following code to emulate casueword32() for the kernel-memory accesses:

	for (;;) {
		ov = *(volatile const uint32_t *)base;
		if (ov != oldval)
		if (atomic_cmpset_32(base, oldval, newval))
	*oldvalp = ov;

I believe that we should implement all solaris atomics with the freebsd one, and remove MD-specific code.
Comment 8 commit-hook freebsd_committer 2015-11-05 16:56:04 UTC
A commit references this bug:

Author: andrew
Date: Thu Nov  5 16:55:27 UTC 2015
New revision: 290397
URL: https://svnweb.freebsd.org/changeset/base/290397

  Fix the open solaris atomic functions on arm64. Without this we may use the
  wrong value in the comparison, leading to incorrectly setting the new

  This has been observed in the ZFS code. Without this we can lose track of
  the reference count in a zrlock object.

  We should move to use the generic atomic functions, however as this has
  been observed I would prefer to have this working, then move to the generic

  PR:		204037
  Sponsored by:	ABT Systems Ltd

Comment 9 Kurt Lidl freebsd_committer 2015-11-06 20:34:09 UTC
For whatever it's worth, I ran into this panic on my amd64 machine this morning,
running a build of r290447M.

The first time it panic'd, I had just issued a "shutdown -r now" command.
I was able to collect a textdump of that panic, and reboot the machine.

I tried to run 'shutdown -r now' again, and it panic'd again, in the
same place.

I collected a full crashdump this time.

I have reverted back to a prior kernel (r290391M) and am rebuilding a kernel
with the patch from comment #3 in this bug.
Comment 10 Kurt Lidl freebsd_committer 2015-11-06 20:56:31 UTC
The kernel with the patch from comment #3 in this bug panics the same way
on my amd64 host.

I will attach a backtrace in just a minute.
Comment 11 Kurt Lidl freebsd_committer 2015-11-06 21:08:28 UTC
Created attachment 162859 [details]
kgdb backtrace for crash
Comment 12 Kurt Lidl freebsd_committer 2015-11-27 17:43:39 UTC
I ended up reverting to an older kernel so I could compile a more up-to-date
version of -head on my amd64 machine.  With these sources:

FreeBSD busybox.pix.net 11.0-CURRENT FreeBSD 11.0-CURRENT #92 r291335M: Wed Nov 25 14:59:52 EST 2015     lidl@busybox.pix.net:/usr/obj/usr/src/sys/BUSYBOX  amd64

I no longer see this panic on my amd64 host.
Comment 13 heathn 2016-05-28 21:44:40 UTC
I'm seeing this same problem on an arm processor (Raspberry Pi 2).

FreeBSD rpi2 11.0-ALPHA1 FreeBSD 11.0-ALPHA1 #0 r300905M: Sat May 28 13:06:17 MDT 2016

panic: solaris assert: zrl->zr-refcount == 0 (0x1 == 0x0), file: /usr/home/heath/freebsd-current/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zrlock.c, line: 64

I also tried the patch in comment #3 but it still panic'd.
Comment 14 commit-hook freebsd_committer 2016-10-27 07:39:07 UTC
A commit references this bug:

Author: avg
Date: Thu Oct 27 07:38:08 UTC 2016
New revision: 307994
URL: https://svnweb.freebsd.org/changeset/base/307994

  3746 ZRLs are racy


    From the original change log:
    It was possible for a reference to be added even with the lock held, and
    for references added just after a lock release to be lost.
    This bug was also independently found and reported in wesunsolve.net
    issues 6985013 6995524.
    In zrl_add(), always use an atomic operation to update the refcount.
    The mutex in the ZRL only guarantees that wakeups occur for waiters on the
    lock. It offers no protection against concurrent updates of the refcount.
    The only refcount transition that is safe to perform without an atomic
    operation is from ZRL_LOCKED back to 0, since this can only be performed
    by the thread which has the ZRL locked.

  Authored by: Will Andrews <will@freebsd.org>
  Reviewed by: Boris Protopopov <bprotopopov@hotmail.com>
  Reviewed by: Pavel Zakharov <pavel.zakha@gmail.com>
  Reviewed by: Yuri Pankov <yuri.pankov@gmail.com>
  Reviewed by: Justin T. Gibbs <gibbs@scsiguy.com>
  Approved by: Matt Ahrens <mahrens@delphix.com>
  Author: Youzhong Yang <yyang@mathworks.com>
  PR:		204037
  MFC after:	1 week

_U  head/sys/cddl/contrib/opensolaris/
Comment 15 commit-hook freebsd_committer 2016-11-03 08:55:02 UTC
A commit references this bug:

Author: avg
Date: Thu Nov  3 08:54:04 UTC 2016
New revision: 308245
URL: https://svnweb.freebsd.org/changeset/base/308245

  MFC r307994: 3746 ZRLs are racy

  PR:		204037

_U  stable/11/
Comment 16 commit-hook freebsd_committer 2016-11-03 08:55:05 UTC
A commit references this bug:

Author: avg
Date: Thu Nov  3 08:54:36 UTC 2016
New revision: 308246
URL: https://svnweb.freebsd.org/changeset/base/308246

  MFC r307994: 3746 ZRLs are racy

  PR:		204037

_U  stable/10/
Comment 17 Andriy Gapon freebsd_committer 2016-11-03 09:10:38 UTC
Closing this bug now.
If the problem comes back, please open a new bug.