Figured it was about time to actually get this into the bug database, since I've made no progress figuring it out. We've seen this bug for a long time, since at least 9.2. Given a set of clients (in a cluster) that write logs to a file or a small set of files in NFS (v3), ignoring write errors, and a FreeBSD NFS server using ZFS as a backing store, with the particular ZFS dataset being below quota when the files are initially opened for write, but reaching its quota while the clients are actively writing, eventually something inside ZFS slows to a crawl (synchronized to txg openings, perhaps?). When this happens, all of the NFS service threads eventually get tied up deep in ZFS and no new requests can be processed from any client, even the non-misbehaving ones. FHA makes this happen faster, but it will still happen even with FHA disabled.
(I finally figured out that it was somewhere deep in ZFS by running a "procstat -a -kk | fgrep nfsd" last time this happened -- but of course it always happens when we need to get the server back up and serving files, so we just increased the quota on the problem dataset.)
Updating, since 10.3 still has the problem.
Created attachment 185534 [details] procstat -kk -a output
Reassign. To submitter: is this aging PR still relevant?
I remember issue like that we hit it on FreeNAS, and it was fixed few years ago. Lets close it.
So far as I'm aware, this problem has never been fixed. Writing into a ZFS dataset that is near its quota is very slow because write requests are serialized, and it's worse if compression is enabled. This then causes the nfsd worker threads to get stuck in a way that prevents any client requests from being served. Increasing the quota resolves the situation instantly. As I recall, a ZFS expert said that the former behavior, at least, was expected, although I don't remember the explanation. Due to the pandemic we have seen a lot less research activity more or less coincidentally with our upgrade to 12.1, so I can't say for sure that it's still present there. It was definitely still a problem in 11.3.
One thing it is getting slower, which is expected, since ZFS needs to complete writes and commit transaction before it know how much space actually left. But another problem there was before is that lost the data in that case, continuing that slow writing indefinitely without reporting out of space error. And that part has been fixed.
Slow writes are, if not desirable, at least understandable. The NFS server being unable to serve *any* requests, even to entirely unrelated filesystems, is not.