244465 – ZFS recursive snapshot with refquota-full filesystems causes DoS for NFS users

Bug 244465 - ZFS recursive snapshot with refquota-full filesystems causes DoS for NFS users

Summary: ZFS recursive snapshot with refquota-full filesystems causes DoS for NFS users

Status:	Closed Overcome By Events

Alias:	None

Product:	Base System
Classification:	Unclassified
Component:	misc (show other bugs)
Version:	11.3-RELEASE
Hardware:	Any Any

Importance:	--- Affects Some People
Assignee:	freebsd-fs (Nobody)

URL:
Keywords:

Depends on:
Blocks:

Reported:	2020-02-27 07:42 UTC by Peter Eriksson
Modified:	2020-05-07 14:25 UTC (History)
CC List:	0 users

See Also:

Attachments
Patch to "zfs" command to add a "-m minfree" option for "snapshot" (2.97 KB, patch) 2020-02-28 10:38 UTC, Peter Eriksson	no flags	Details \| Diff
New patch to add -v, -n & -m<minfree> options to "zfs snapshot" (4.67 KB, patch) 2020-02-29 21:30 UTC, Peter Eriksson	no flags	Details \| Diff
Show Obsolete (1) View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Peter Eriksson 2020-02-27 07:42:56 UTC

This is probably related to the ZFS feature of throttling down the transaction size when writing data to very near-full (refquota) filesystems but anyway.

System:
Servers with many ZFS filesystems (HOME directories) for many users, shared via NFSv4 and SMB (Samba).

We are seeing DoS issues for NFS users on the same machine where some _other_ user have filled their HOME directories so they are at, or very near (MB's), their refquota limits at the time when we are taking our hourly recursive snapshots of all filesystems on those servers.

("zfs snapshot -r DATA/homes" basically)

The NFS users (they have their HOME directories mounted from those servers) at those times are seeing freezes for over a minute where the NFS server basically seems to be unresponsive. We are also seeing NFS mount requests beeing denied (or rather timed out) at the same times.

We haven't seen the same type of bug reports from our SMB users, but their usage patterns are a bit different so this might just be that they haven't noticed the same issue.

Possible workarounds (not tested yet):

1. Don't use "zfs snapshot -r" but instead manually loop over all filesystems and skip taking snapshots of the ones that are nearly full. (Modify the source for the "zfs" command to have an option to skip nearly-full filesystems when doing a recursive snap)

2. Temporarily increase the refquota of problematic filesystems before running "zfs snapshot -r". (Problem: We might not be able to lower the quota afterwards).

3. Modify the zfs snapshot stuff in the kernel to ignore the quotas (since it's run by the root users I think if might be reasonable, but possibly not so easy to implement).

In general this slowing down of everything ZFS-related when a filesystem is nearly full is starting to _really_ become a pain in the...

Comment 1 Peter Eriksson 2020-02-28 10:38:17 UTC

Created attachment 212018 [details]
Patch to "zfs" command to add a "-m minfree" option for "snapshot"

Comment 2 Peter Eriksson 2020-02-28 10:40:16 UTC

Uploaded a patch that implements a "-m minfree" option for "zfs snapshot". If specified then any filesystems with less free space available will be skipped. Useful if you have thousands of filesystems and are doing recursive snapshots:

  zfs snapshot -r -m 10G DATA/users

Comment 3 Peter Eriksson 2020-02-29 21:30:36 UTC

Created attachment 212053 [details]
New patch to add -v, -n & -m<minfree> options to "zfs snapshot"

Updated version of the patch that adds -n, -v & -m<minfree> flags to "zfs snapshot".
Also updates the man page for zfs.

Comment 4 Peter Eriksson 2020-05-07 14:25:46 UTC

I'm closing this. I have a workaround (a modified "zfs" command that can intelligently skip near-full datasets). And it's better to see if ZoL handles this better in the future.

- Peter