Figured it was about time to actually get this into the bug database, since I've made no progress figuring it out.
We've seen this bug for a long time, since at least 9.2. Given a set of clients (in a cluster) that write logs to a file or a small set of files in NFS (v3), ignoring write errors, and a FreeBSD NFS server using ZFS as a backing store, with the particular ZFS dataset being below quota when the files are initially opened for write, but reaching its quota while the clients are actively writing, eventually something inside ZFS slows to a crawl (synchronized to txg openings, perhaps?). When this happens, all of the NFS service threads eventually get tied up deep in ZFS and no new requests can be processed from any client, even the non-misbehaving ones. FHA makes this happen faster, but it will still happen even with FHA disabled.
(I finally figured out that it was somewhere deep in ZFS by running a "procstat -a -kk | fgrep nfsd" last time this happened -- but of course it always happens when we need to get the server back up and serving files, so we just increased the quota on the problem dataset.)
Updating, since 10.3 still has the problem.
Created attachment 185534 [details]
procstat -kk -a output