Bug 203877 - NFS threads get blocked when writing to ZFS dataset that has reached quota
Summary: NFS threads get blocked when writing to ZFS dataset that has reached quota
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 10.3-RELEASE
Hardware: amd64 Any
: --- Affects Only Me
Assignee: freebsd-fs (Nobody)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-10-19 18:43 UTC by Garrett Wollman
Modified: 2020-07-09 15:56 UTC (History)
3 users (show)

See Also:


Attachments
procstat -kk -a output (229.59 KB, text/plain)
2017-08-17 17:25 UTC, Garrett Wollman
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Garrett Wollman freebsd_committer freebsd_triage 2015-10-19 18:43:40 UTC
Figured it was about time to actually get this into the bug database, since I've made no progress figuring it out.

We've seen this bug for a long time, since at least 9.2.  Given a set of clients (in a cluster) that write logs to a file or a small set of files in NFS (v3), ignoring write errors, and a FreeBSD NFS server using ZFS as a backing store, with the particular ZFS dataset being below quota when the files are initially opened for write, but reaching its quota while the clients are actively writing, eventually something inside ZFS slows to a crawl (synchronized to txg openings, perhaps?).  When this happens, all of the NFS service threads eventually get tied up deep in ZFS and no new requests can be processed from any client, even the non-misbehaving ones.  FHA makes this happen faster, but it will still happen even with FHA disabled.
Comment 1 Garrett Wollman freebsd_committer freebsd_triage 2015-10-19 18:45:38 UTC
(I finally figured out that it was somewhere deep in ZFS by running a "procstat -a -kk | fgrep nfsd" last time this happened -- but of course it always happens when we need to get the server back up and serving files, so we just increased the quota on the problem dataset.)
Comment 2 Garrett Wollman freebsd_committer freebsd_triage 2017-08-17 17:24:20 UTC
Updating, since 10.3 still has the problem.
Comment 3 Garrett Wollman freebsd_committer freebsd_triage 2017-08-17 17:25:40 UTC
Created attachment 185534 [details]
procstat -kk -a output
Comment 4 Mark Linimon freebsd_committer freebsd_triage 2020-07-09 06:58:14 UTC
Reassign.

To submitter: is this aging PR still relevant?
Comment 5 Alexander Motin freebsd_committer freebsd_triage 2020-07-09 13:35:55 UTC
I remember issue like that we hit it on FreeNAS, and it was fixed few years ago.  Lets close it.
Comment 6 Garrett Wollman freebsd_committer freebsd_triage 2020-07-09 15:26:14 UTC
So far as I'm aware, this problem has never been fixed.  Writing into a ZFS dataset that is near its quota is very slow because write requests are serialized, and it's worse if compression is enabled.  This then causes the nfsd worker threads to get stuck in a way that prevents any client requests from being served.  Increasing the quota resolves the situation instantly.  As I recall, a ZFS expert said that the former behavior, at least, was expected, although I don't remember the explanation.

Due to the pandemic we have seen a lot less research activity more or less coincidentally with our upgrade to 12.1, so I can't say for sure that it's still present there.  It was definitely still a problem in 11.3.
Comment 7 Alexander Motin freebsd_committer freebsd_triage 2020-07-09 15:39:25 UTC
One thing it is getting slower, which is expected, since ZFS needs to complete writes and commit transaction before it know how much space actually left.  But another problem there was before is that lost the data in that case, continuing that slow writing indefinitely without reporting out of space error.  And that part has been fixed.
Comment 8 Garrett Wollman freebsd_committer freebsd_triage 2020-07-09 15:56:25 UTC
Slow writes are, if not desirable, at least understandable.  The NFS server being unable to serve *any* requests, even to entirely unrelated filesystems, is not.