Bug 225337 - z_teardown_inactive_lock held inordinately long
Summary: z_teardown_inactive_lock held inordinately long
Status: New
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 11.1-RELEASE
Hardware: amd64 Any
: --- Affects Only Me
Assignee: freebsd-fs (Nobody)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-01-20 07:03 UTC by Garrett Wollman
Modified: 2019-05-07 10:48 UTC (History)
1 user (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Garrett Wollman freebsd_committer 2018-01-20 07:03:56 UTC
On one of our large NFS servers, it seems that some process holds zfsvfs->z_teardown_inactive_lock far too long -- on the order of ten minutes or more -- causing all filesystem activity to hang.  The exact same configuration and activity patterns did not have such a hang under 10.3  I believe from web searches that this lock is implicated in zfs dataset rollback and consequently zfs recv -F, but the hang only seems to take place when we have both pull replication (zfs recv) *and* active (through-the-filesystem) backups running at the same time, which usually only happens late at night.  There are no console messages or other indications of faults in the underlying storage system.  The system as a whole becomes completely unusable, our monitoring system raises alarms, but it doesn't actually crash, and whatever it was eventually does complete without visible errors.

I'm temporarily disabling the replication job to see if that truly is the smoking gun.  Or rather, I'm going to do that once I get access to the filesystem again.

Example, taken from my ssh session over the past hour (these are all waiting for the same shell script to *begin executing*):

load: 0.82  cmd: bash 56646 [zfsvfs->z_teardown_inactive_lock] 7.42r 0.00u 0.00s 0% 3624k
load: 0.71  cmd: bash 56646 [zfsvfs->z_teardown_inactive_lock] 23.00r 0.00u 0.00s 0% 3624k
load: 0.59  cmd: bash 56646 [zfsvfs->z_teardown_inactive_lock] 38.85r 0.00u 0.00s 0% 3624k
load: 1.02  cmd: bash 56646 [zfsvfs->z_teardown_inactive_lock] 88.32r 0.00u 0.00s 0% 3624k
load: 0.81  cmd: bash 56646 [zfsvfs->z_teardown_inactive_lock] 149.97r 0.00u 0.00s 0% 3624k
load: 0.76  cmd: bash 56646 [zfsvfs->z_teardown_inactive_lock] 181.17r 0.00u 0.00s 0% 3624k
load: 1.51  cmd: bash 56646 [zfsvfs->z_teardown_inactive_lock] 243.76r 0.00u 0.00s 0% 3624k
load: 0.96  cmd: bash 56646 [zfsvfs->z_teardown_inactive_lock] 282.39r 0.00u 0.00s 0% 3624k
load: 1.50  cmd: bash 56646 [zfsvfs->z_teardown_inactive_lock] 333.94r 0.00u 0.00s 0% 3624k
load: 0.93  cmd: bash 56646 [zfsvfs->z_teardown_inactive_lock] 392.77r 0.00u 0.00s 0% 3624k
load: 0.84  cmd: bash 56646 [zfsvfs->z_teardown_inactive_lock] 457.04r 0.00u 0.00s 0% 3624k
load: 0.85  cmd: bash 56646 [zfsvfs->z_teardown_inactive_lock] 526.06r 0.00u 0.00s 0% 3624k
load: 0.40  cmd: bash 56646 [zfsvfs->z_teardown_inactive_lock] 588.82r 0.00u 0.00s 0% 3624k

My suspicion is that the primary vector is zfs recv on a dataset that is currently being backed up, but why this causes all other filesystem activity to become blocked is a bit unclear to me.  (Race to the root?  I think the backup software uses openat(2) and shouldn't cause that sort of problem, but maybe random NFS clients can.)
Comment 1 Andriy Gapon freebsd_committer 2018-01-20 09:57:00 UTC
(In reply to Garrett Wollman from comment #0)
Please use procstat to obtain the stack trace(s).
Comment 2 Garrett Wollman freebsd_committer 2018-01-20 18:10:09 UTC
(In reply to Andriy Gapon from comment #1)
Is there a "continuous mode" of procstat?  Because the shell can't exec() anything when the filesystem is wedged -- that's what the ^T output above is demonstrating.
Comment 3 Garrett Wollman freebsd_committer 2018-01-20 18:11:34 UTC
(In reply to Garrett Wollman from comment #2)
I have remote console access, if there's a way to extract the same information via DDB (and it doesn't take too long to do so), I can try that.
Comment 4 Garrett Wollman freebsd_committer 2018-01-24 04:51:37 UTC
I changed the schedule of the replication cron job (which runs every 15 minutes) to *not* run during the usual backup window (which I don't have any control over but is fairly consistent) and thus far have not seen the hang, so I think I can reasonably conclude that the combination of "zfs recv -F" and active backup traversal (or any other heavy use of the filesystem?) is toxic.

The filesystems in question are a special case for us, because the active write site is on another group's server; we're doing the replication for backup/DR purposes.
Comment 5 Steve Wills freebsd_committer 2019-05-01 15:14:09 UTC
I'm seeing this too and wondering if this change:

https://github.com/zfsonlinux/zfs/issues/2869

https://github.com/zfsonlinux/zfs/commit/cafbd2aca3fd17be78577348d9661f4142d1eb70

are relevant. Any opinions?
Comment 6 Andriy Gapon freebsd_committer 2019-05-02 06:25:23 UTC
Still no stack traces...
Comment 7 Steve Wills freebsd_committer 2019-05-06 16:56:23 UTC
(In reply to Andriy Gapon from comment #6)
I'm not sure if this is what you're looking for, but I have procstat -kk output. It seems to vary, but I see these two:

mi_switch+0xe2 sleepq_wait+0x2c _cv_wait+0xf3 arc_get_data_impl+0x2df arc_buf_alloc_impl+0x204 arc_alloc_buf+0x41 dbuf_read+0x54e dnode_hold_impl+0x1af dmu_bonus_hold+0x1d zfs_rezget+0x10b zfs_resume_fs+0x1a8 zfs_ioc_rollback+0xf0 zfsdev_ioctl+0x75e devfs_ioctl+0xc7 VOP_IOCTL_APV+0x56 vn_ioctl+0x161 devfs_ioctl_f+0x1f kern_ioctl+0x238 


mi_switch+0xe2 sleepq_wait+0x2c _cv_wait+0xf3 arc_get_data_impl+0x2df arc_buf_alloc_impl+0x204 arc_alloc_buf+0x41 dbuf_read+0x54e __dbuf_hold_impl+0x579 dbuf_hold+0x7f dnode_hold_impl+0x134 dmu_bonus_hold+0x1d zfs_rezget+0x10b zfs_resume_fs+0x1a8 zfs_ioc_rollback+0xf0 zfsdev_ioctl+0x75e devfs_ioctl+0xc7 VOP_IOCTL_APV+0x56 vn_ioctl+0x161
Comment 8 Andriy Gapon freebsd_committer 2019-05-06 20:28:56 UTC
(In reply to Steve Wills from comment #7)
First of all, I am not 100% sure if this is it. It could be. It's likely.


So, it looks like the lock is held because zfs_rezget -> ... -> arc_get_data_impl is waiting for something in the ARC state.  Probably it's waiting for the ARC size to go below its limit (arc_is_overflowing()).  So, I would look at the ARC variables and what the ARC reclaim thread is doing.
Comment 9 Steve Wills freebsd_committer 2019-05-07 10:48:36 UTC
(In reply to Andriy Gapon from comment #8)
Ok, thanks. I may have had vfs.zfs.arc_free_target set too high. Turning that down seems to have helped avoid the issue.