Bug 229887 - zfs quota related panic: solaris assert: 0 == dmu_tx_assign(tx, TXG_WAIT),
Summary: zfs quota related panic: solaris assert: 0 == dmu_tx_assign(tx, TXG_WAIT),
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: CURRENT
Hardware: Any Any
: --- Affects Only Me
Assignee: Alexander Motin
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-07-19 10:46 UTC by Harald Schmalzbauer
Modified: 2018-08-22 16:34 UTC (History)
3 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Harald Schmalzbauer 2018-07-19 10:46:41 UTC
Hello,

today my PC locked up while using local Xorg.
After hard reset, I got the following panic: solaris assert: 0 == dmu_tx_assign(tx, TXG_WAIT), file: /usr/local/share/deploy-tools/HEAD/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_dir.c, line: 330

assfail() at assfail+0x1a/frame 0xfffffe008cd842e0
zfs_unlinked_drain() at zfs_unlinked_drain+0x175/frame 0xfffffe008cd844b0
zfsvfs_setup() at zfsvfs_setup+0x5b/frame 0xfffffe008cd844e0
zfs_mount() at zfs_mount+0x731/frame 0xfffffe008cd84670
vfs_domount() at vfs_domount+0x730/frame 0xfffffe008cd84890
vfs_donmount() at vfs_donmount+0x807/frame 0xfffffe008cd84940
sys_nmount() at sys_nmount+0x72/frame 0xfffffe008cd84980
amd64_syscall() at amd64_syscall+0x281/frame 0xfffffe008cd84ab0
fast_syscall_common() at fast_syscall_common+0x101/frame 0xfffffe008cd84ab0
--- syscall (378, FreeBSD ELF64, sys_nmount), rip = 0x80037684a, rsp = 0x7fffffffcc68, rbp = 0x7fffffffcce0 ---


#11 0xffffffff81eb523a in assfail () from /boot/kernel/opensolaris.ko
#12 0xffffffff81b806a5 in zfs_unlinked_drain () from /boot/kernel/zfs.ko
#13 0xffffffff81b997fb in zfsvfs_setup () from /boot/kernel/zfs.ko
#14 0xffffffff81b972a1 in zfs_mount () from /boot/kernel/zfs.ko
#15 0xffffffff808b44f0 in vfs_domount (td=0xfffffe008cd84368, fstype=<value optimized out>, fspath=<value optimized out>, 
    fsflags=<value optimized out>, optlist=0xfffffe008cd848d8) at /usr/local/share/deploy-tools/HEAD/src/sys/kern/vfs_mount.c:892
#16 0xffffffff808b37b7 in vfs_donmount (td=0xfffff800110d2000, fsflags=0, fsoptions=0xfffff8000e16e100)
    at /usr/local/share/deploy-tools/HEAD/src/sys/kern/vfs_mount.c:726
#17 0xffffffff808b2f82 in sys_nmount (td=0xfffff800110d2000, uap=0xfffff800110d23c0)
    at /usr/local/share/deploy-tools/HEAD/src/sys/kern/vfs_mount.c:431
#18 0xffffffff80b39771 in amd64_syscall (td=0xfffff800110d2000, traced=0) at subr_syscall.c:135
#19 0xffffffff80b1586d in fast_syscall_common () at /usr/local/share/deploy-tools/HEAD/src/sys/amd64/amd64/exception.S:500
#20 0x000000080037684a in ?? ()


After I found the suspicious dataset which cused that panic due to quota exhaustion, there was another (single user) panic, which I haven't dumped to swap because this one wassn't saved yet...
So here's just the kdb lines, transcribed manually from a picture I took:

panic: solaris assert: delta > 0 ? dsl_dir_phys(dd)->dd_used_breakdown[oldtype] >= delt :-( missing rest of the line :-(
/dsl_dir.c, line 1564
…
assfail()
dsl_dir_transfer_space() at dsl_dir_transfer_space
dsl_dataset_block_born() at dsl_dataset_block_born
dbuf_write_done() at dbuf_write_done
arc_write_done() at arc_write_done
zio_done() at zio_done
zio_execute() at zio_execute
taskqueue_run_locked() at taskqueue_run_locked
taskqueue_thread_loop() at taskqueue_thread_loop
fork_exit() at fork_exit
fork_trampoline() at fork_trampoline


Since it's a custom kernel, the latter is useless I guess, but in case somebody considers this kind of crash worth fixing, the second partly trace might point to other places in the same context.

Will try to get to the sources and add a more useful backtrace, don't have them arround and ran out of time now...

Thanks,

-harry
Comment 1 Alexander Motin freebsd_committer freebsd_triage 2018-07-19 14:36:41 UTC
The first panic indeed looks like combination of r334810 making dmu_tx_assign() errors fatal and quota overflow causing that.  We need to handle those errors somehow.

About second panic I am not sure, but I found such an interesting comment above zfs_unlinked_add():
 * When dealing with the unlinked set, we dmu_tx_hold_zap(), but we
 * don't specify the name of the entry that we will be manipulating.  We
 * also fib and say that we won't be adding any new entries to the
 * unlinked set, even though we might (this is to lower the minimum file
 * size that can be deleted in a full filesystem).  So on the small
 * chance that the nlink list is using a fat zap (ie. has more than
 * 2000 entries), we *may* not pre-read a block that's needed.
 * Therefore it is remotely possible for some of the assertions
 * regarding the unlinked set below to fail due to i/o error.  On a
 * nondebug system, this will result in the space being leaked.
Curios whether it can be the case here.
Comment 2 commit-hook freebsd_committer freebsd_triage 2018-08-22 16:33:47 UTC
A commit references this bug:

Author: mav
Date: Wed Aug 22 16:32:53 UTC 2018
New revision: 338206
URL: https://svnweb.freebsd.org/changeset/base/338206

Log:
  Add dmu_tx_assign() error handling in zfs_unlinked_drain().

  The error handling got lost during r334810, while according to the report
  error there may happen in case of dataset being over quota.  In such case
  just leave the node in the unlinked list to be freed sometimes later.

  PR:		229887
  Sponsored by:	iXsystems, Inc.

Changes:
  head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_dir.c
Comment 3 Alexander Motin freebsd_committer freebsd_triage 2018-08-22 16:34:59 UTC
I was unable to reproduce the problem, but I believe committed patch should fix it.