Bug 180060 - [zfs] [panic] ZFS kernel panic, solaris assert on dsl_prop_unregister
Summary: [zfs] [panic] ZFS kernel panic, solaris assert on dsl_prop_unregister
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: Unspecified
Hardware: Any Any
: Normal Affects Only Me
Assignee: Andriy Gapon
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2013-06-28 16:20 UTC by James TD Smith
Modified: 2015-09-11 13:28 UTC (History)
2 users (show)

See Also:


Attachments
console screenshot of panic (26.18 KB, image/png)
2015-06-01 08:30 UTC, spork
no flags Details
proposed patch for testing (1.05 KB, patch)
2015-06-16 11:25 UTC, Andriy Gapon
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description James TD Smith 2013-06-28 16:20:00 UTC
I've had two ZFS-related kernel panics over the last few days. Both have happened since I upgraded to 9.1-RELEASE-4 on the 18th. PR 154447 has similar backtraces but was on 8.2, I can't find any other references to a similar problem.

It's a mixed UFS/ZFS system, root is on UFS with a single 4-disk RAIDZ1 pool used for backups, a media library, poudriere and a few other jails. 

I replaced one of the disks in the pool earlier this week which had been failing for a while and have been tuning the ZFS settings based on this guide: 
http://icesquare.com/wordpress/how-to-improve-zfs-performance/ after I upgraded to 4Gb RAM.

Jun 25 08:12:57 masada kernel: panic: solaris assert: dsl_prop_unregister(ds, "atime", atime_changed_cb, zfsvfs) == 0, file: /usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c, line: 1199
Jun 25 08:12:57 masada kernel: cpuid = 0
Jun 25 08:12:57 masada kernel: KDB: stack backtrace:
Jun 25 08:12:57 masada kernel: #0 0xffffffff80926c96 at kdb_backtrace+0x66
Jun 25 08:12:57 masada kernel: #1 0xffffffff808f0cae at panic+0x1ce
Jun 25 08:12:57 masada kernel: #2 0xffffffff816b5647 at zfs_unregister_callbacks+0x1b7
Jun 25 08:12:57 masada kernel: #3 0xffffffff816b58c5 at zfsvfs_teardown+0x175
Jun 25 08:12:57 masada kernel: #4 0xffffffff816b5a2b at zfs_suspend_fs+0x1b
Jun 25 08:12:57 masada kernel: #5 0xffffffff816aa229 at zfs_ioc_rollback+0xf9
Jun 25 08:12:57 masada kernel: #6 0xffffffff816abd46 at zfsdev_ioctl+0xe6
Jun 25 08:12:57 masada kernel: #7 0xffffffff807e077b at devfs_ioctl_f+0x7b
Jun 25 08:12:57 masada kernel: #8 0xffffffff80938715 at kern_ioctl+0x115
Jun 25 08:12:57 masada kernel: #9 0xffffffff8093894d at sys_ioctl+0xfd
Jun 25 08:12:57 masada kernel: #10 0xffffffff80be71d6 at amd64_syscall+0x546

Jun 28 04:08:02 masada kernel: panic: solaris assert: dsl_prop_unregister(ds, "snapdir", snapdir_changed_cb, zfsvfs) == 0, file: /usr/src/sys/modules/zfs/../..
/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c, line: 1217
Jun 28 04:08:02 masada kernel: cpuid = 1
Jun 28 04:08:02 masada kernel: KDB: stack backtrace:
Jun 28 04:08:02 masada kernel: #0 0xffffffff80926c96 at kdb_backtrace+0x66
Jun 28 04:08:02 masada kernel: #1 0xffffffff808f0cae at panic+0x1ce
Jun 28 04:08:02 masada kernel: #2 0xffffffff816b570d at zfs_unregister_callbacks+0x27d
Jun 28 04:08:02 masada kernel: #3 0xffffffff816b58c5 at zfsvfs_teardown+0x175
Jun 28 04:08:02 masada kernel: #4 0xffffffff816b5a2b at zfs_suspend_fs+0x1b
Jun 28 04:08:02 masada kernel: #5 0xffffffff816aa229 at zfs_ioc_rollback+0xf9
Jun 28 04:08:02 masada kernel: #6 0xffffffff816abd46 at zfsdev_ioctl+0xe6
Jun 28 04:08:02 masada kernel: #7 0xffffffff807e077b at devfs_ioctl_f+0x7b
Jun 28 04:08:02 masada kernel: #8 0xffffffff80938715 at kern_ioctl+0x115
Jun 28 04:08:02 masada kernel: #9 0xffffffff8093894d at sys_ioctl+0xfd
Jun 28 04:08:02 masada kernel: #10 0xffffffff80be71d6 at amd64_syscall+0x546

How-To-Repeat: I think it's more likely to happen under heavy IO load. This machine has rsnaphot backups scheduled to run every 4 hours, both of the crashes have occurred shortly after a backup began. It was also building packages with poudriere at the same time in both instances.
Comment 1 Mark Linimon freebsd_committer freebsd_triage 2013-06-28 20:57:40 UTC
Responsible Changed
From-To: freebsd-bugs->freebsd-zfs-devel


Over to maintainer(s).
Comment 2 longwitz 2013-07-25 22:52:54 UTC
Do you have mountd running ?

I have a similiar configuration and the same problem with

 FreeBSD dsssrvt4.incore 8.4-STABLE FreeBSD 8.4-STABLE #0 r253040:amd64.

and found the panic is caused by mountd. This daemon starts working on
every successfull mount (sends signal 1 to mountd) and tries to delete
all existing exports for all local file systems using the flag
MNT_DELEXPORT. mountd calls nmount() and sometimes I see messages like

mountd[56874]: can't delete exports for
/home/poudriere/data/build/j8sp32-PT2/02/new_packages: Invalid argument

and once a week I have a panic:

panic: solaris assert: dsl_prop_unregister(ds, "aclinherit",
acl_inherit_changed_cb, zfsvfs) == 0, file: /usr/src/sys/module
s/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c,
line: 1279

cpuid = 1
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2a
kdb_backtrace() at kdb_backtrace+0x37
panic() at panic+0x1ce
assfail() at assfail+0x1a
zfs_unregister_callbacks() at zfs_unregister_callbacks+0x1c8
zfsvfs_teardown() at zfsvfs_teardown+0x179
zfs_suspend_fs() at zfs_suspend_fs+0x1b
zfs_ioc_rollback() at zfs_ioc_rollback+0x24
zfsdev_ioctl() at zfsdev_ioctl+0x57f
devfs_ioctl_f() at devfs_ioctl_f+0x7b
kern_ioctl() at kern_ioctl+0x102
ioctl() at ioctl+0xfd
amd64_syscall() at amd64_syscall+0x1f4(kgdb) where(kgdb) where
Xfast_syscall() at Xfast_syscall+0xfc
--- syscall (54, FreeBSD ELF64, ioctl), rip = 0x180fd0c1c, rsp =
0x7fffffffce48, rbp = 0x7fffffffcf50 ---
KDB: enter: panic

This process panics on dsl_prop_unregister of "aclinherit" during a
rollback:
39024 38985 81834     0  R+      CPU 0                 initial thread

(kgdb) where
#9  0xffffffff804328fb in kdb_enter (why=0xffffffff8069543a "panic",
msg=0xa <Address 0xa out of bounds>) at cpufunc.h:63
#10 0xffffffff803ff367 in panic (fmt=<value optimized out>) at
/usr/src/sys/kern/kern_shutdown.c:616
#11 0xffffffff80d5712a in assfail (a=<value optimized out>, f=<value
optimized out>, l=<value optimized out>)    at
/usr/src/sys/modules/opensolaris/../../cddl/compat/opensolaris/kern/opensolaris_cmn_err.c:81
#12 0xffffffff80cbd0b8 in zfs_unregister_callbacks
(zfsvfs=0xffffff01f5664000)    at
/usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c:1278
#13 0xffffffff80cbd319 in zfsvfs_teardown (zfsvfs=0xffffff01f5664000,
unmounting=0)    at
/usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c:1919
#14 0xffffffff80cbd42b in zfs_suspend_fs (zfsvfs=0xffffff01f5664000)  at
/usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c:2225
#15 0xffffffff80cb0364 in zfs_ioc_rollback (zc=<value optimized out>) at
/usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ioctl.c:3565
#16 0xffffffff80cb4abf in zfsdev_ioctl (dev=<value optimized out>,
zcmd=<value optimized out>, arg=0xffffff00720a81a0 "\003",
flag=3,td=<value optimized out>)    at
/usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ioctl.c:5936
#17 0xffffffff8037703b in devfs_ioctl_f (fp=0xffffff0008cce0f0,
com=3222821401, data=<value optimized out>,
    cred=<value optimized out>, td=0xffffff02439d5000) at
/usr/src/sys/fs/devfs/devfs_vnops.c:695
#18 0xffffffff80444902 in kern_ioctl (td=<value optimized out>,
fd=<value optimized out>, com=3222821401,
    data=0xffffff00720a81a0 "\003") at file.h:277
#19 0xffffffff80444b3d in ioctl (td=0xffffff02439d5000,
uap=0xffffff84859e1bb0) at /usr/src/sys/kern/sys_generic.c:679
#20 0xffffffff805db824 in amd64_syscall (td=0xffffff02439d5000,
traced=0) at subr_syscall.c:114
#21 0xffffffff805c2e9c in Xfast_syscall () at
/usr/src/sys/amd64/amd64/exception.S:387

This process has just done dsl_prop_register "aclinherit" and sleeps:
56874     1 56874     0  Ss      zio->io_ 0xffffff02f41a8e48 mountd

(kgdb) thread 304
[Switching to thread 304 (Thread 100357)]#0  sched_switch
(td=0xffffff0013717000, newtd=<value optimized out>,
    flags=<value optimized out>) at /usr/src/sys/kern/sched_ule.c:1932
1932                    cpuid = PCPU_GET(cpuid);
(kgdb) where
#0  sched_switch (td=0xffffff0013717000, newtd=<value optimized out>,
flags=<value optimized out>)
    at /usr/src/sys/kern/sched_ule.c:1932
#1  0xffffffff80407d16 in mi_switch (flags=260, newtd=0x0) at
/usr/src/sys/kern/kern_synch.c:466
#2  0xffffffff8043e5c2 in sleepq_wait (wchan=0xffffff02c000ee48, pri=0)
at /usr/src/sys/kern/subr_sleepqueue.c:613
#3  0xffffffff803b9971 in _cv_wait (cvp=0xffffff02c000ee48,
lock=0xffffff02c000ee28) at /usr/src/sys/kern/kern_condvar.c:139
#4  0xffffffff80c9e9f1 in zio_wait (zio=0xffffff02c000eb10)  at
/usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c:1332
#5  0xffffffff80c35588 in dbuf_read (db=0xffffff01f7dbca80, zio=0x0,
flags=10)  at
/usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/dbuf.c:639
#6  0xffffffff80c3bcc0 in dmu_buf_hold (os=<value optimized out>,
object=<value optimized out>, offset=<value optimized out>,
    tag=0x0, dbp=0xffffff8484d1b150, flags=<value optimized out>) at
/usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/dmu.c:151
#7  0xffffffff80c8f4f8 in zap_lockdir (os=0xffffff0013274000, obj=2558,
tx=0x0, lti=RW_READER, fatreader=1, adding=0,zapp=0xffffff8484d1b1b0) at
/usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/zap_micro.c:467
#8  0xffffffff80c90b15 in zap_lookup_norm (os=<value optimized out>,
zapobj=<value optimized out>, name=0xffffffff80d31204 "aclinherit",
integer_size=8, num_integers=1, buf=0xffffff8484d1b340, mt=MT_EXACT,
realname=0x0, rn_len=0, ncp=0x0) at
/usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/zap_micro.c:788
#9  0xffffffff80c90cbe in zap_lookup (os=<value optimized out>,
zapobj=<value optimized out>, name=<value optimized out>,
    integer_size=<value optimized out>, num_integers=<value optimized
out>, buf=<value optimized out>) at
/usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/zap_micro.c:773
9  0xffffffff80c90cbe in zap_lookup (os=<value optimized out>,
zapobj=<value optimized out>, name=<value optimized out>,
    integer_size=<value optimized out>, num_integers=<value optimized
out>, buf=<value optimized out>)
    at
/usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/zap_micro.c:773
#10 0xffffffff80c5e6d0 in dsl_prop_get_dd (dd=0xffffff03d070c800,
propname=0xffffffff80d31204 "aclinherit", intsz=8, numints=1,
    buf=0xffffff8484d1b340, setpoint=0x0, snapshot=0)  at
/usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_prop.c:107
#11 0xffffffff80c5e90b in dsl_prop_get_ds (ds=0xffffff0126912c00,
propname=0xffffffff80d31204 "aclinherit", intsz=8, numints=1,
    buf=0xffffff8484d1b340, setpoint=0x0)  at
/usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_prop.c:216
#12 0xffffffff80c5f496 in dsl_prop_register (ds=0xffffff0126912c00,
propname=0xffffffff80d31204 "aclinherit",    callback=0xffffffff80cbca40
<acl_inherit_changed_cb>, cbarg=0xffffff01f5664000) at
/usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_prop.c:239
#13 0xffffffff80cbdd6b in zfs_register_callbacks (vfsp=<value optimized
out>) at
/usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c:518
#14 0xffffffff80cc0048 in zfs_mount (vfsp=0xffffff03e49398d0) at
/usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c:1706
#15 0xffffffff8048d0a7 in vfs_donmount (td=<value optimized out>,
fsflags=268505104, fsoptions=0xffffff03b8026c00)
    at /usr/src/sys/kern/vfs_mount.c:1011
#16 0xffffffff8048dc53 in nmount (td=0xffffff0013717000,
uap=0xffffff8484d1bbb0) at /usr/src/sys/kern/vfs_mount.c:447
#17 0xffffffff805db824 in amd64_syscall (td=0xffffff0013717000,
traced=0) at subr_syscall.c:114
#18 0xffffffff805c2e9c in Xfast_syscall () at
/usr/src/sys/amd64/amd64/exception.S:387

(kgdb) f 15
#15 0xffffffff8048d0a7 in vfs_donmount (td=<value optimized out>,
fsflags=268505104, fsoptions=0xffffff03b8026c00)
    at /usr/src/sys/kern/vfs_mount.c:1011
1011            error = VFS_MOUNT(mp);
(kgdb) info local
optlist = (struct vfsoptlist *) 0xffffff008edc20b0
opt = <value optimized out>
tmp_opt = <value optimized out>
fstype = 0xffffff008eda9050 "zfs"
fspath = 0xffffff03011760c0 "/home/poudriere/data/build/j8sp32-PT2/01"
errmsg = 0xffffff03abfae800 ""
error = <value optimized out>
fstypelen = 4
fspathlen = 41
errmsg_len = 255
errmsg_pos = 5
has_rw = 0
has_noro = 0

I can give more information from ddb output or kernel dump.

-- 
Andreas Longwitz
Comment 3 James TD Smith 2013-07-29 16:17:23 UTC
On 2013-07-25 23:52:54(+0200), Andreas Longwitz wrote:
> Do you have mountd running ?

Yes.
 
> I have a similiar configuration and the same problem with
> 
>  FreeBSD dsssrvt4.incore 8.4-STABLE FreeBSD 8.4-STABLE #0 r253040:amd64.
> 
> and found the panic is caused by mountd. This daemon starts working on
> every successfull mount (sends signal 1 to mountd) and tries to delete
> all existing exports for all local file systems using the flag
> MNT_DELEXPORT. mountd calls nmount() and sometimes I see messages like
> 
> mountd[56874]: can't delete exports for
> /home/poudriere/data/build/j8sp32-PT2/02/new_packages: Invalid argument
> 
> and once a week I have a panic:

I think I also had errors from mountd around the time of the panics but I
haven't had one for a while (mainly due to reducing the frequency of poudriere
runs) so I can't be sure. I will try disabling mountd and see if that eliminates
the problem, all my NFS clients are using NFSv4 so I don't need it running.

James

-- 
}-<James TD Smith>-<email/ahktenzero@mohorovi.cc>-{
Comment 4 longwitz 2013-08-12 11:28:17 UTC
In the meantime I did some more analysis of the problem and can explain
why the panic happens. Two threads are involved in the problem, one runs
the zfs command and wants to do a rollback using an ioctl() on /dev/zfs,
the other runs mountd and tries to do an "unmount exports" with
"export.ex_flags = MNT_DELEXPORT". Both threads are working on the same
dataset ds=0xffffff0126912c00.

The rollback thread wants to unregister his aclinherit property and
panics, because this property does not exist in the list of properties
anymore:

(kgdb) f 12
#12 0xffffffff80cbd0b8 in zfs_unregister_callbacks
(zfsvfs=0xffffff01f5664000) at
/usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c:1278
1278      VERIFY(dsl_prop_unregister(ds, "aclinherit",
(kgdb) l
1273            zfsvfs) == 0);
1274
1275      VERIFY(dsl_prop_unregister(ds, "aclmode", acl_mode_changed_cb,
1276            zfsvfs) == 0);
1277
1278      VERIFY(dsl_prop_unregister(ds, "aclinherit",
1279            acl_inherit_changed_cb, zfsvfs) == 0);
1280
1281      VERIFY(dsl_prop_unregister(ds, "vscan",
1282            vscan_changed_cb, zfsvfs) == 0);

The mountd thread wants to register some properties, but he first
unregisters everything (see comment in the source):

(kgdb) f 14
#14 0xffffffff80cc0048 in zfs_mount (vfsp=0xffffff03e49398d0) at
/usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c:1706
1706                    error = zfs_register_callbacks(vfsp);
(kgdb) list
1700  * When doing a remount, we simply refresh our temporary properties
1701  * according to those options set in the current VFS options.
1702  */
1703       if (vfsp->vfs_flag & MS_REMOUNT) {
1704               /* refresh mount options */
1705               zfs_unregister_callbacks(vfsp->vfs_data);
1706               error = zfs_register_callbacks(vfsp);
1707               goto out;
1708       }
1709

There is a little time gap between line 1705 and 1706 where the property
aclinherit is not registered. If another thread tries to unregister this
property during this gap he will panic.

I don't know how to fix this proper. Simple to remove the VERIFY on
dsl_prop_unregister() is easy, but I hope that one of the ZFS gurus will
have a look at this and we will get a better solution.

-- 
Andreas Longwitz
Comment 5 Andriy Gapon freebsd_committer freebsd_triage 2013-10-09 16:20:05 UTC
Please see the following suggestion:
http://article.gmane.org/gmane.os.illumos.zfs/2833
Will you be able to work on a patch that implements the idea?

-- 
Andriy Gapon
Comment 6 longwitz 2013-10-11 00:26:56 UTC
Andriy Gapon wrote
> Please see the following suggestion:
> http://article.gmane.org/gmane.os.illumos.zfs/2833
> Will you be able to work on a patch that implements the idea?

Sorry, but that is out of my scope.

Andreas Longwitz
Comment 7 spork 2015-06-01 08:30:13 UTC
Created attachment 157329 [details]
console screenshot of panic
Comment 8 Andriy Gapon freebsd_committer freebsd_triage 2015-06-16 11:25:11 UTC
Created attachment 157799 [details]
proposed patch for testing

commit 7fb40408aea1be27984cc25dd80e707c9dbc2f05
Author: Andriy Gapon <avg@icyb.net.ua>
If anyone is still running into the issue, could you please test the attached patch?

    zfs_mount(MS_REMOUNT): protect zfs_(un)register_callbacks calls
    
    We now take z_teardown_lock as a writer to ensure that there is no
    I/O while the filesystem state is in a flux.
    Also, zfs_suspend_fs() -> zfsvfs_teardown() call zfs_unregister_callbacks()
    and zfs_resume_fs() -> zfsvfs_setup() call zfs_unregister_callbacks().
    Previously there was no synchronization between those calls and the calls
    in the re-mounting case.  That could lead to concurrent execution
    and a crash.
    
    PR:		180060
    Suggested by:		mahrens
Comment 9 Andriy Gapon freebsd_committer freebsd_triage 2015-06-25 11:13:39 UTC
See review D2865
Comment 10 commit-hook freebsd_committer freebsd_triage 2015-07-02 08:32:52 UTC
A commit references this bug:

Author: avg
Date: Thu Jul  2 08:32:03 UTC 2015
New revision: 285021
URL: https://svnweb.freebsd.org/changeset/base/285021

Log:
  zfs_mount(MS_REMOUNT): protect zfs_(un)register_callbacks calls

  We now take z_teardown_lock as a writer to ensure that there is no I/O
  while the filesystem state is in a flux.  Also, zfs_suspend_fs() ->
  zfsvfs_teardown() call zfs_unregister_callbacks() and zfs_resume_fs() ->
  zfsvfs_setup() call zfs_unregister_callbacks().  Previously there was no
  synchronization between those calls and the calls in the re-mounting
  case.  That could lead to concurrent execution and a crash.

  PR:		180060
  Differential Revision:	https://reviews.freebsd.org/D2865
  Suggested by:	mahrens
  Reviewed by:	delphij, pho, mahrens, will
  MFC after:	13 days
  Sponsored by:	ClusterHQ

Changes:
  head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c
Comment 11 Andriy Gapon freebsd_committer freebsd_triage 2015-09-11 13:28:15 UTC
Should be fixed now.