Bug 203864 - ZFS deadlock between zfs send, zfs rename and ctrl-C
Summary: ZFS deadlock between zfs send, zfs rename and ctrl-C
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: CURRENT
Hardware: Any Any
: --- Affects Some People
Assignee: Andriy Gapon
URL: https://reviews.freebsd.org/D23478
Keywords: dogfood, needs-qa
Depends on:
Blocks:
 
Reported: 2015-10-19 07:12 UTC by Peter Wemm
Modified: 2020-09-15 08:54 UTC (History)
13 users (show)

See Also:


Attachments
procstat -kk -a output (67.88 KB, text/plain)
2016-02-21 12:45 UTC, roel
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Peter Wemm freebsd_committer freebsd_triage 2015-10-19 07:12:59 UTC
I tried to abort a zfs send:

07:06:26   2.28G   zroot/j/jails/admin-svn@daily.31
07:06:27   2.29G   zroot/j/jails/admin-svn@daily.31
^C
load: 1.27  cmd: zfs 72318 [tx->tx_sync_done_cv] 101.46r 0.00u 1.89s 0% 4212k
load: 0.84  cmd: zfs 72318 [tx->tx_sync_done_cv] 127.51r 0.00u 1.89s 0% 4212k
load: 0.84  cmd: zfs 72318 [tx->tx_sync_done_cv] 127.98r 0.00u 1.89s 0% 4212k

It turns out the freebsd-snapshot cron was running at that very instant and it lead to a deadlock:

# ps -axl | grep tx
  0 74866 72290   0  20  0  40336   3528 tx->tx_s D     -     0:00.00 zfs rename zroot/j/jails/monitor@hourly.15 zroot/j/jails/monitor@hourly.16
  0 72318 69290   0  20  0  42520   4228 tx->tx_s DE+   0     0:01.89 zfs send -Rv zroot/j/jails/admin-svn@sync.00

I've seen this before but never quite at such a bad time.  Neither can be kill -9'ed.
Comment 1 Peter Wemm freebsd_committer freebsd_triage 2015-10-19 07:16:46 UTC
FWIW; the machine couldn't reboot either.  It required a break to ddb and 'call cpu_reset'.
Comment 2 Enji Cooper freebsd_committer freebsd_triage 2015-10-19 07:25:03 UTC
uname -a?

FWIW, I ran into ZFS deadlocks with r288943 when running tests/sys/acl . When I upgrade to r289441, this went away.

I was running into apparent deadlocks with the spa namespace lock though, not tx->tx_sync_done_cv.

mav's modifications to zfs send/recv might need to be fixed.
Comment 3 Peter Wemm freebsd_committer freebsd_triage 2015-10-19 07:29:37 UTC
It's later than the revision you indicated:

FreeBSD halo.ysv.freebsd.org 11.0-CURRENT FreeBSD 11.0-CURRENT #0 r289488: Sun Oct 18 06:47:35 UTC 2015     peter@build-11.freebsd.org:/usr/obj/usr/src/sys/CLUSTER11  amd64
Comment 4 Andriy Gapon freebsd_committer freebsd_triage 2015-10-19 07:47:34 UTC
https://wiki.freebsd.org/AvgZfsDeadlockDebug
Comment 5 roel 2016-02-21 12:45:21 UTC
Created attachment 167247 [details]
procstat -kk -a output

output from procstat -kk -a at the moment of the deadlock
Comment 6 roel 2016-02-21 12:47:25 UTC
We've run into this exact same issue using the zrep script. It's easily reproducable. The deadlock occurs when a "zfs rename" is executed while there is still a "zfs send" process running.

In zrep, this occurs when a zfs receive on the remote hosts fails and zrep immediately attempts to rename a snapshot. However, the zfs send process is still running (cleaning up?) and this causes an immediate deadlock.

Any zfs process run after that immediately blocks in state "zpa_namespace_lock". IO continues to work, but fails at some point in the future. I've not diagnosed the exact trigger that causes IO to seize.

The only resolution is to power cycle the system.

I've just attached the output to a procstat -kk -a from yesterday when I ran into this.

Running on 10.3-PRERELEASE FreeBSD 10.3-PRERELEASE #1 r294572
Comment 7 Andriy Gapon freebsd_committer freebsd_triage 2016-02-21 15:00:13 UTC
(In reply to roel from comment #6)
It would be interesting to run kgdb and examine spa_namespace_lock. Forcing a crash dump for post-mortem examination is probably more useful because it would be easier to answer any possible follow-up questions.
Comment 8 Cassiano Peixoto 2016-06-30 23:56:43 UTC
(In reply to roel from comment #6)
Hi guys,

Any news about this issue? I've same issue when running zrep script as roel said. Roel, did you find any workaround for this?

Thanks.
Comment 9 roel 2016-07-01 11:39:14 UTC
Yes, I just introduced a short sleep in the script as a workaround.
Comment 10 Cassiano Peixoto 2016-07-01 15:40:09 UTC
(In reply to roel from comment #9)
Hi Roel,

Can you send me the patch to my email address? i'd like to add it. I think would be a good idea to submit a patch to zrep port. 

The worst thing with this issue is: when it happens i have to force server reboot and when it comes back i lost all my data since the beginning of issue. It's unbelievable, i know, but it's happening.
Comment 11 roel 2016-07-01 16:07:40 UTC
That's not that hard:

*** /home/roel/zrep     Wed Jan 13 22:12:27 2016
--- zrep        Sun Feb 21 13:24:35 2016
***************
*** 1160,1164 ****
        if [[ $? -ne 0 ]] ; then
                zfs rename ${newsnap} ${newsnap}_unsent
                zrep_errquit Problem doing sync for $newsnap. renamed to ${newsnap}_unsent
        fi
---
        if [[ $? -ne 0 ]] ; then
+               echo "Error on send.. waiting 60 seconds to prevent deadlock"
+               sleep 60
                zfs rename ${newsnap} ${newsnap}_unsent
                zrep_errquit Problem doing sync for $newsnap. renamed to ${newsnap}_unsent
        fi
Comment 12 Cassiano Peixoto 2016-07-04 12:46:01 UTC
(In reply to roel from comment #11)
Hi Roel,

Thanks to send me. Do you know some way to simulate this issue? i tried to just make my second server down and run zrep, but didn't work. Thanks.
Comment 13 Andriy Gapon freebsd_committer freebsd_triage 2016-07-08 14:03:48 UTC
I think that I've been able to reproduce this problem or, at least, something that looks very much like it.  I did the standard procstat debugging and I noticed something that did not appear in any of the previous reports:

    6 100572 zfskern          txg_thread_enter mi_switch+0x167 sleepq_switch+0xe7 sleepq_wait+0x43 _sx_xlock_hard+0x49d _sx_xlock+0xc5 zvol_rename_minors+0x104 dsl_dataset_rename_snapshot_sync_impl+0x308 dsl_dataset_rename_snapshot_sync+0xc1 dsl_sync_task_sync+0xef dsl_pool_sync+0x45b spa_sync+0x7c7 txg_sync_thread+0x383 fork_exit+0x84 fork_trampoline+0xe

 1226 100746 zfs              -                mi_switch+0x167 sleepq_switch+0xe7 sleepq_wait+0x43 _cv_wait+0x1e4 txg_wait_synced+0x13b dsl_sync_task+0x205 dsl_dataset_user_release_impl+0x1cf dsl_dataset_user_release_onexit+0x86 zfs_onexit_destroy+0x56 zfsdev_close+0x88 devfs_destroy_cdevpriv+0x8b devfs_close_f+0x65 _fdrop+0x1a closef+0x200 closefp+0xa3 amd64_syscall+0x2db Xfast_syscall+0xfb

 1228 100579 zfs              -                mi_switch+0x167 sleepq_switch+0xe7 sleepq_wait+0x43 _cv_wait+0x1e4 txg_wait_synced+0x13b dsl_sync_task+0x205 dsl_dataset_rename_snapshot+0x3a zfs_ioc_rename+0x157 zfsdev_ioctl+0x635 devfs_ioctl_f+0x156 kern_ioctl+0x246 sys_ioctl+0x171 amd64_syscall+0x2db Xfast_syscall+0xfb

Thread 100746 is it.  zfsdev_close() holds spa_namespace_lock and then calls dsl_sync_task() -> txg_wait_synced().  On the other hand the sync thread (100572) gets stuck on spa_namespace_lock in a call to zvol_rename_minors().

My opinion is that the sync thread must never try to take spa_namespace_lock.  The problem seems to be introduced quite a while ago in base r219317.  Some later commits like base r272474 also followed the same pattern.  The problem is certainly FreeBSD-specific as illumos handles ZVOL names in a very different manner.

Also, the problem is rather deep-rooted and at the moment I do not see any easy way to fix without breaking ZVOL name tracking.

P.S.
A bit of information from ddb:
db> p spa_namespace_lock
ffffffff822b1ee0
db> show lock 0xffffffff822b1ee0
 class: sx
 name: spa_namespace_lock
 state: XLOCK: 0xfffff8001da60500 (tid 100746, pid 1226, "zfs")
 waiters: exclusive
db> thread 100746
[ thread pid 1226 tid 100746 ]
sched_switch+0x48a:     movl    %gs:0x34,%eax
db> bt
Tracing pid 1226 tid 100746 td 0xfffff8001da60500
sched_switch() at sched_switch+0x48a/frame 0xfffffe004def4590
mi_switch() at mi_switch+0x167/frame 0xfffffe004def45c0
sleepq_switch() at sleepq_switch+0xe7/frame 0xfffffe004def4600
sleepq_wait() at sleepq_wait+0x43/frame 0xfffffe004def4630
_cv_wait() at _cv_wait+0x1e4/frame 0xfffffe004def4690
txg_wait_synced() at txg_wait_synced+0x13b/frame 0xfffffe004def46d0
dsl_sync_task() at dsl_sync_task+0x205/frame 0xfffffe004def4790
dsl_dataset_user_release_impl() at dsl_dataset_user_release_impl+0x1cf/frame 0xfffffe004def4910
dsl_dataset_user_release_onexit() at dsl_dataset_user_release_onexit+0x86/frame 0xfffffe004def4940
zfs_onexit_destroy() at zfs_onexit_destroy+0x56/frame 0xfffffe004def4970
zfsdev_close() at zfsdev_close+0x88/frame 0xfffffe004def4990
devfs_destroy_cdevpriv() at devfs_destroy_cdevpriv+0x8b/frame 0xfffffe004def49b0
devfs_close_f() at devfs_close_f+0x65/frame 0xfffffe004def49e0
_fdrop() at _fdrop+0x1a/frame 0xfffffe004def4a00
closef() at closef+0x200/frame 0xfffffe004def4a90
closefp() at closefp+0xa3/frame 0xfffffe004def4ae0
amd64_syscall() at amd64_syscall+0x2db/frame 0xfffffe004def4bf0
Xfast_syscall() at Xfast_syscall+0xfb/frame 0xfffffe004def4bf0
--- syscall (6, FreeBSD ELF64, sys_close), rip = 0x8013f996a, rsp = 0x7fffffffd438, rbp = 0x7fffffffd450 ---
Comment 14 Steven Hartland freebsd_committer freebsd_triage 2016-07-08 14:32:32 UTC
You're conclusion seems valid there Andriy but as you say there doesn't seem to be an easy way to fix it.
Comment 15 Andriy Gapon freebsd_committer freebsd_triage 2016-07-09 05:58:42 UTC
(In reply to Steven Hartland from comment #14)
One possible approach is to move all the "minors" manipulations to ZFS ioctl handlers, the only place where they were called originally, out of the sync thread.
This sounds rather straightforward, but could turn out to be a bit laborious.
Comment 16 Andriy Gapon freebsd_committer freebsd_triage 2016-07-09 06:02:44 UTC
(In reply to Steven Hartland from comment #14)
Another possibility is to apply all zvol changes asynchronously. Either by queuing them or by having a dedicated thread scanning for them.  The former should be almost trivial to implement.
Comment 17 Andriy Gapon freebsd_committer freebsd_triage 2016-07-09 11:08:52 UTC
Another observation is that the current code seems to be broken in a couple of places as well.

1. dsl_dataset_rename_snapshot_sync_impl() uses ddrsa_fsname to construct the old and new snapshot names, but this would be wrong in the case of a recursive snpashot rename (zfs rename -r x@snap1 x@snap2) as ddrsa_fsname always points to the name of the top level filesystem.

2. dsl_dataset_promote_sync(): as far as I can see, oldname is never populated, so the code can do arbitrary renames.
Comment 18 Andriy Gapon freebsd_committer freebsd_triage 2016-07-10 17:10:58 UTC
And another deadlock of the same nature, but with an even more obvious LOR:

 5087 100846 zfs              -                mi_switch+0x167 sleepq_switch+0xe7 sleepq_wait+0x43 _cv_wait+0x1e4 rrw_enter_read_impl+0x9b dsl_pool_hold+0x77 dmu_objset_own+0x31 zvol_create_minor+0x16f zvol_create_minors+0x9d zfs_ioc_create+0x3df zfsdev_ioctl+0x65c devfs_ioctl_f+0x156 kern_ioctl+0x246 sys_ioctl+0x171 amd64_syscall+0x2db Xfast_syscall+0xfb
                                                                                                                      6 100572 zfskern          txg_thread_enter mi_switch+0x167 sleepq_switch+0xe7 sleepq_wait+0x43 _sx_xlock_hard+0x49d _sx_xlock+0xc5 zvol_rename_minors+0x104 dsl_dir_rename_sync+0x406 dsl_sync_task_sync+0xef dsl_pool_sync+0x45b spa_sync+0x7c7 txg_sync_thread+0x383 fork_exit+0x84 fork_trampoline+0xe


Thread 100846 took spa_namespace_lock in zvol_create_minor() and wants to get dp_config_rwlock (as a reader) in dsl_pool_hold().
Thread 100572 took as dp_config_rwlock (as a writer) in dsl_sync_task_sync() and wants to take spa_namespace_lock in zvol_rename_minors().
Comment 19 Andriy Gapon freebsd_committer freebsd_triage 2016-07-10 17:14:21 UTC
The deadlock from comment #13 can happen even if there are no zvols in a pool at all, the deadlock from comment #18 happens when a zvol is created.
Comment 20 Andriy Gapon freebsd_committer freebsd_triage 2016-07-11 11:44:15 UTC
I've coded the suggestion from comment #15 here https://reviews.freebsd.org/D7179
The change should also fix bugs described in comment #17.
Comment 21 Laurie Odgers 2016-07-14 07:34:02 UTC
Hi All
We are seeing the same problem using NPM 6.2.2. This has been a problem for some time, but seems to randomly pop up. We have run for months without this error happening, however today it seems to happen repeatedly and relatively reproducable. 
Procstat trace below.


# procstat -kk 12590
 PID    TID COMM             TDNAME           KSTACK                       
12590 101356 node             -                mi_switch+0xe1 sleepq_catch_signals+0xab sleepq_wait_sig+0xf _sleep+0x27d kern_kevent_fp+0x399 kern_kevent+0x9f sys_kevent+0x12a amd64_syscall+0x40f Xfast_syscall+0xfb 
12590 101358 node             -                mi_switch+0xe1 sleepq_catch_signals+0xab sleepq_wait_sig+0xf _sleep+0x27d umtxq_sleep+0x125 do_sem_wait+0x4db __umtx_op_sem_wait+0x73 amd64_syscall+0x40f Xfast_syscall+0xfb 
12590 101359 node             V8 WorkerThread  mi_switch+0xe1 sleepq_catch_signals+0xab sleepq_wait_sig+0xf _sleep+0x27d umtxq_sleep+0x125 do_sem_wait+0x4db __umtx_op_sem_wait+0x73 amd64_syscall+0x40f Xfast_syscall+0xfb 
12590 101360 node             V8 WorkerThread  mi_switch+0xe1 sleepq_catch_signals+0xab sleepq_wait_sig+0xf _sleep+0x27d umtxq_sleep+0x125 do_sem_wait+0x4db __umtx_op_sem_wait+0x73 amd64_syscall+0x40f Xfast_syscall+0xfb 
12590 101361 node             V8 WorkerThread  mi_switch+0xe1 sleepq_catch_signals+0xab sleepq_wait_sig+0xf _sleep+0x27d umtxq_sleep+0x125 do_sem_wait+0x4db __umtx_op_sem_wait+0x73 amd64_syscall+0x40f Xfast_syscall+0xfb 
12590 101362 node             V8 WorkerThread  mi_switch+0xe1 sleepq_catch_signals+0xab sleepq_wait_sig+0xf _sleep+0x27d umtxq_sleep+0x125 do_sem_wait+0x4db __umtx_op_sem_wait+0x73 amd64_syscall+0x40f Xfast_syscall+0xfb 
12590 101363 node             -                mi_switch+0xe1 sleepq_wait+0x3a sleeplk+0x15d __lockmgr_args+0x91a vop_stdlock+0x3c VOP_LOCK1_APV+0xab _vn_lock+0x43 vget+0x73 cache_lookup+0x5d5 vfs_cache_lookup+0xac VOP_LOOKUP_APV+0xa1 lookup+0x5a1 namei+0x4d4 kern_renameat+0x1b7 amd64_syscall+0x40f Xfast_syscall+0xfb 
12590 101364 node             -                mi_switch+0xe1 sleepq_wait+0x3a sleeplk+0x15d __lockmgr_args+0xca0 vop_stdlock+0x3c VOP_LOCK1_APV+0xab _vn_lock+0x43 vputx+0x21f
Comment 22 Andriy Gapon freebsd_committer freebsd_triage 2016-07-14 07:39:32 UTC
(In reply to Laurie Odgers from comment #21)
I do not think that you are seeing the same problem. The data that you provided is far too incomplete to tell, but it is more likely that you are running into bug #209158. Or something else altogether.
Comment 23 Laurie Odgers 2016-07-14 23:09:45 UTC
(In reply to Andriy Gapon from comment #22)

Yes apologies - I had both threads open and accidentally posted in this one.
Comment 24 Anthoine Bourgeois 2018-03-15 17:54:42 UTC
I update Andiry's patch with a zvol update ordering. The patch is here: https://reviews.freebsd.org/D14669
Comment 25 Andriy Gapon freebsd_committer freebsd_triage 2020-02-04 07:02:05 UTC
New version of a prospective solution: https://reviews.freebsd.org/D23478
Comment 26 commit-hook freebsd_committer freebsd_triage 2020-06-11 10:42:19 UTC
A commit references this bug:

Author: avg
Date: Thu Jun 11 10:41:34 UTC 2020
New revision: 362047
URL: https://svnweb.freebsd.org/changeset/base/362047

Log:
  rework how ZVOLs are updated in response to DSL operations

  With this change all ZVOL updates are initiated from the SPA sync
  context instead of a mix of the sync and open contexts.  The updates are
  queued to be applied by a dedicated thread in the original order.  This
  should ensure that ZVOLs always accurately reflect the corresponding
  datasets.  ZFS ioctl operations wait on the mentioned thread to complete
  its work.  Thus, the illusion of the synchronous ZVOL update is
  preserved.  At the same time, the SPA sync thread never blocks on ZVOL
  related operations avoiding problems like reported in bug 203864.

  This change is based on earlier work in the same direction: D7179 and
  D14669 by Anthoine Bourgeois.  D7179 tried to perform ZVOL operations
  in the open context and that opened races between them.  D14669 uses a
  design very similar to this change but with different implementation
  details.

  This change also heavily borrows from similar code in ZoL, but there are
  many differences too.  See:
  - https://github.com/zfsonlinux/zfs/commit/a0bd735adb1b1eb81fef10b4db102ee051c4d4ff
  - https://github.com/zfsonlinux/zfs/issues/3681
  - https://github.com/zfsonlinux/zfs/issues/2217

  PR:		203864
  MFC after:	5 weeks
  Sponsored by:	CyberSecure
  Differential Revision: https://reviews.freebsd.org/D23478

Changes:
  head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu_objset.c
  head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu_send.c
  head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_dataset.c
  head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_destroy.c
  head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_dir.c
  head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa.c
  head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/spa_impl.h
  head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zvol.h
  head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ioctl.c
  head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zvol.c
Comment 27 Andriy Gapon freebsd_committer freebsd_triage 2020-09-15 08:53:51 UTC
Merged to stable/12 in base r364917.