Bug 261448 - ZFS VOP_RECLAIM causes long stalls during package builds
Summary: ZFS VOP_RECLAIM causes long stalls during package builds
Status: New
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: CURRENT
Hardware: Any Any
: --- Affects Only Me
Assignee: freebsd-bugs (Nobody)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2022-01-24 22:13 UTC by Mark Johnston
Modified: 2022-06-07 14:01 UTC (History)
3 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Mark Johnston freebsd_committer freebsd_triage 2022-01-24 22:13:05 UTC
I noticed that a system running "poudriere bulk -a" often appears to stall for several seconds.  During a stall, all CPUs are nearly idle even though the system is ostensibly building NCPU packages in parallel.  Digging a bit, the problem happens when a package build finishes and poudriere does a "zfs rollback" of some dataset.

The kernel implementation of zfs rollback calls zfsvfs_teardown(), which uses the per-dataset "teardown lock" to suspend all VOPs on the dataset while it walks the list of vnodes associated with the mount point (multiple times) to invalidate the name cache and cached data.  Any user threads performing operations on files in the dataset are blocked for this period.

The second part of the problem is getnewvnode(), which may call vn_alloc_hard() to reclaim a used vnode if the total number of vnodes in the system is above the "desiredvnodes" threshold.  vn_alloc_hard() performs direct reclamation from the vnode free list, and as a part of this recycling it must call VOP_RECLAIM.  ZFS' VOP_RECLAIM implementation, like all VOPs, acquires the teardown lock, so it can block for a long time.  If we get unlucky and a batch of vnodes belonging to a suspended dataset are at the head of the free list, then the system very quickly gets stuck until the rollback completes.

Logging CPU usage during poudriere runs shows lots of idle CPU caused by this problem.  The amount varies depending on how quickly package builds are finishing.  Logging output from "vmstat 1" overnight gives:

$ cat vmstat-log | awk '/^[0-9]/{user += $(NF - 2); sys += $(NF - 1); idle += $(NF)} END{printf "user %d sys %d idle %d total %d\n", user, sys, idle, user + sys + idle}'
user 2123240 sys 259657 idle 126830 total 2509727

so something like 5% of CPU is lost.  Data collected over a shorter time period during which there were lots of small package builds shows it can be much worse:

user 112162 sys 25396 idle 47896 total 185454