Bug 275594 - High CPU usage by arc_prune; analysis and fix
Summary: High CPU usage by arc_prune; analysis and fix
Status: Open
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 14.0-RELEASE
Hardware: Any Any
: --- Affects Some People
Assignee: Olivier Certner
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2023-12-07 10:37 UTC by Seigo Tanimura
Modified: 2024-05-16 07:06 UTC (History)
26 users (show)

See Also:


Attachments
The proposed fix and additional patch. (4.71 KB, application/x-xz)
2023-12-07 10:37 UTC, Seigo Tanimura
no flags Details
The measurement result charts for Comment 14. (958.71 KB, application/x-7z-compressed)
2023-12-25 10:17 UTC, Seigo Tanimura
no flags Details
The additional measurement result charts for Comment 16. (167.75 KB, application/x-7z-compressed)
2023-12-25 10:57 UTC, Seigo Tanimura
no flags Details
The test result charts for Comment 18. (346.21 KB, application/x-7z-compressed)
2023-12-27 10:28 UTC, Seigo Tanimura
no flags Details
The result charts for Comment 22. (1 / 2) (318.84 KB, application/x-7z-compressed)
2024-01-05 17:21 UTC, Seigo Tanimura
no flags Details
The result charts for Comment 22. (2 / 2) (841.93 KB, application/x-7z-compressed)
2024-01-05 17:22 UTC, Seigo Tanimura
no flags Details
The result charts for Comment 28. (1 / 2) (395.59 KB, application/x-7z-compressed)
2024-01-10 06:14 UTC, Seigo Tanimura
no flags Details
The result charts for Comment 28. (2 / 2) (762.99 KB, application/x-7z-compressed)
2024-01-10 06:14 UTC, Seigo Tanimura
no flags Details
The patches for Comment 28. (7.91 KB, application/x-xz)
2024-01-10 06:15 UTC, Seigo Tanimura
no flags Details
The result charts for Comment 30. (115.87 KB, application/x-7z-compressed)
2024-01-10 07:42 UTC, Seigo Tanimura
no flags Details
The result charts for Comment 35. (268.56 KB, application/x-7z-compressed)
2024-01-24 10:39 UTC, Seigo Tanimura
no flags Details
The result charts for Comment 40. (386.99 KB, application/x-7z-compressed)
2024-01-25 03:47 UTC, Seigo Tanimura
no flags Details
The result charts for Comment 51. (679.67 KB, application/x-7z-compressed)
2024-02-13 06:55 UTC, Seigo Tanimura
no flags Details
The result charts for Comment 51. (680.24 KB, application/x-7z-compressed)
2024-02-13 06:55 UTC, Seigo Tanimura
no flags Details
The result charts for Comment 51. (560.92 KB, application/x-7z-compressed)
2024-02-13 06:56 UTC, Seigo Tanimura
no flags Details
The result charts for Comment 51. (627.94 KB, application/x-7z-compressed)
2024-02-13 06:56 UTC, Seigo Tanimura
no flags Details
The result charts for Comment 51. (595.20 KB, application/x-7z-compressed)
2024-02-13 06:57 UTC, Seigo Tanimura
no flags Details
The result charts for Comment 51. (467.71 KB, application/x-7z-compressed)
2024-02-13 06:57 UTC, Seigo Tanimura
no flags Details
The result charts for Comment 55. (543.91 KB, application/x-7z-compressed)
2024-02-13 16:08 UTC, Seigo Tanimura
no flags Details
The result charts for Comment 55. (552.20 KB, application/x-7z-compressed)
2024-02-13 16:09 UTC, Seigo Tanimura
no flags Details
The result charts for Comment 55. (489.87 KB, application/x-7z-compressed)
2024-02-13 16:09 UTC, Seigo Tanimura
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Seigo Tanimura 2023-12-07 10:37:42 UTC
Created attachment 246849 [details]
The proposed fix and additional patch.

This is the followup to bug #275063 and bug #274698.

After applying the fix published in FreeBSD-EN-23:18.openzfs, I have again seen the issue reproducing.

I have been tracking this issue since the release of 14.0-RELEASE, and now ready to share the more promising fix.  Please review and make the fix plan.

* Test Environment: Hypervisor
- CPU: Intel Core i7-13700KF 3.4GHz (24 threads)
- RAM: 128 GB
- OS: Windows 10
- Storage: NVMe and SATA HDDs
- Hypervisor: VMWare Workstation 17.5

* Test Environment: VM & OS
- vCPUs: 16
- RAM: 16 GB
- Swap: 128 GB on NVMe
- OS: FreeBSD 14.0-RELEASE
- Storage & Filesystems: ZFS mainly
  - Main pool: 1.5G on SATA HDD
  - ZIL: 16 GB on NVMe
  - L2ARC: 64 GB on NVMe
- sysctl(3) tunings:
  - vfs.vnode.param.limit=4000000
  - vfs.vnode.vnlru.max_free_per_call=100000
  - vfs.zfs.arc_max=4294967296

* Application
- poudriere
  - Number of ports to build: 2128 (including dependencies)
  - Major configurations for port building
    - poudriere.conf
      - #NO_ZFS=yes (ZFS enabled)
      - USE_PORTLINT=no
      - USE_TMPFS="wrkdir data localbase"
      - TMPFS_LIMIT=32
      - DISTFILES_CACHE=(configured in ZFS)
      - CCACHE_DIR=(configured in ZFS)
        - The cache is filled in advance.
      - CCACHE_STATIC_PREFIX=/usr/local
      - PARALLEL_JOBS=8 (actually givin via "poudriere bulk -J")
    - make.conf
      - MAKE_JOBS_NUMBER=2

* Steps
1. Remove the package output directory, so that all packages are built.
2. Run 'poudriere bulk' to start the parallel build.
3. Observe the system and build progress by top(1), poudriere web UI, cmdwatch(1) + sysctl(8), etc.

* Observed behaviors during building
- In 10 - 15 minutes, the ARC pruning started.
  - No affects on the performance.
- In about 30 minutes, the ARC pruning started to miss the pruning target.
  - The 100% CPU usage by arc_prune observed for a few seconds occasionally.
- In about 2 hours, the large ports (lang/rust, lang/gcc12) started to build.
  - The 100% CPU usage by arc_prune observer for 5 - 10 seconds often.
  - Several other threads also exhibit the 100% CPU usage.
- Build time: 06:53:33 (309 pkgs / hr)

* Analysis
The true root cause is the consecutive execution of ARC pruning.  When there are no vnodes ready to reclaim, the ARC pruning walks through all vnodes with vnode_list_lock held.

The detail is described in:
https://github.com/altimeter-130ft/freebsd-freebsd-src/commit/f1fa73f4d5943efa874fa3ede49dd73bb8ef4bb4

* Proposed fix
- Enforce the interval between the ARC pruning execution.
  - Patch (in the attached archive): openzfs-arc_prune-interval-fix.diff
  - GitHub: https://github.com/altimeter-130ft/freebsd-freebsd-src/tree/release/14.0.0/release-14_0_0-p2-topic-openzfs-arc_prune-interval-fix
    - Branch base: https://github.com/altimeter-130ft/freebsd-freebsd-src/commit/06497fbd52e2f138b7d590c8499d9cebad182850
      - releng/14.0 down to FreeBSD-SA-23:17.pf and version bumping.
- NB this fix is meant for FreeBSD only.  Please refer to the open issues as well.

* Additional patch
- The sysctl(3) counters to observe the vnode recycling behavior.
  - Patch (in the attached archive): openzfs-arc_prune-interval-counters.diff
  - GitHub: https://github.com/altimeter-130ft/freebsd-freebsd-src/tree/release/14.0.0/release-14_0_0-p2-topic-openzfs-arc_prune-interval-counters
    - Branch base: the branch for the proposed fix.
  - The following counters may be committed for the debugging and tuning aid:
    - vfs.vnode.free.free_call
      - The calls to vnlru_free_impl().
    - vfs.vnode.free.free_retry
      - The retries from the vnode list head in vnlru_free_impl().
    - vfs.vnode.free.free_giveup
      - The giveups in vnlru_free_impl().
    - Under the heavy ZFS access, free_retry and free_giveup increase along with free_call, indicating the misses on the vnode reclaim target.

* Observed behaviors with the proposed fix during building
- The arc_prune kernel thread did not exhibit the 100% CPU usage.
  - Max: 30 - 35%.
  - The continuous CPU usage disappeared mostly.
  - The vnlru kernel thread ran in parallel with arc_prune.
- Build time: 06:37:03 (322 pkgs / hr)
  - Improved for ~8.5%.

* Open issues
- Please also refer to the fix commit log.
- Who should implement the fix?
  - OpenZFS taskq should be fixed if the issue is seen and resolvable on Linux as well.
- Is the proposed design contract upon the ARC pruning reasonable?
Comment 1 Mark Johnston freebsd_committer freebsd_triage 2023-12-07 15:04:49 UTC
> - vfs.vnode.vnlru.max_free_per_call=100000

This value is 10 times larger than the default.  Why did you change this parameter?  What happens if you leave it at the default value?
Comment 2 Seigo Tanimura 2023-12-07 19:06:03 UTC
> Why did you change this parameter?

In my test case, the ARC pruning process has the risk of hitting many non-ZFS vnodes.  vfs.vnode.vnlru.max_free_per_call has hence been increased in the hope to reach the ZFS vnodes with the better chance within a single iteration.

> USE_TMPFS="wrkdir data localbase"

The directories under the really heavy load are covered by tmpfs(5).


> What happens if you leave it at the default value?

I started "poudriere bulk" about an hour ago with:

- vfs.vnode.vnlru.max_free_per_call: 10000 (out-of-box)
- vfs.zfs.arc.prune_interval: 0 (cancel the ARC pruning interval)

The rest of the setup is the same as in the description.

After about 50 minutes from the start, arc_prune started to hog the CPU up to 90 - 100%.  The free_retry and free_giveup counters are increasing in the same pace as free_call.

I will continue this build until the end, but with no hope for the recovery of arc_prune.
Comment 3 Mark Johnston freebsd_committer freebsd_triage 2023-12-07 19:33:59 UTC
(In reply to Seigo Tanimura from comment #2)
Does arc_prune stop running once the build finishes?  The problem that EN-23:18.openzfs tries to address is that arc_prune will continue endlessly, even when the system is idle and there is no pressure for free memory or vnodes.

arc_prune_async() is rather dumb on FreeBSD, as you point out: it tries to reclaim vnodes from the global free list, but doing so might not alleviate pressure.  Really we want some way to shrink a per-mountpoint or per-filesystem cache.
Comment 4 Seigo Tanimura 2023-12-08 01:00:00 UTC
(In reply to Mark Johnston from comment #3)

A quick update:

7 hours passed; building emulators/mame, the final port.
arc_prune keeps running at 50 - 100% CPU usage.
Comment 5 Seigo Tanimura 2023-12-08 04:50:55 UTC
(In reply to Mark Johnston from comment #3)

The build has completed.

Build time: 07:40:56 (278 pkgs / hr)

arc_prune stopped shortly after poudriere finished.  The pileup of arc_prune has indeed been fixed by FreeBSD-EN-23:18.openzfs, but the essential problem should be in somewhere else.

Right now, I am testing with the following setup after the reboot:

- vfs.vnode.vnlru.max_free_per_call: 10000 (out-of-box)
- vfs.zfs.arc.prune_interval: 1000 (my fix enabled)

About 2 hours after the start, the CPU usage of arc_prune was at 20 - 25% with the occasional drops.  Poudriere was working on lang/rust and lang/gcc12 at that time.

A correction of the description:

> * Test Environment: VM & OS
>   - RAM: 20 GB (not 16 GB)

A note on the ZFS configuration:

> vfs.zfs.arc_max=4294967296 (4GiB)

This limit has been added because this host is a build server, not a file server.  AFAIK, ZFS tends to take up to about 1/4 of the available RAM for the ARC.  While that may be fair as a file server, an application server wants more RAM in general.

Under the limit above, the demand upon the ARC pruning is expected and the OS must be ready to deal with that.


> arc_prune_async() is rather dumb on FreeBSD, as you point out: it tries to reclaim vnodes from the global free list, but doing so might not alleviate pressure.  Really we want some way to shrink a per-mountpoint or per-filesystem cache.

I thought you would say that; I almost thought of the same thing more than 20 years ago while implementing the initial version of vnlru along with Matt Dillon :)

The per-mountpoint / per-filesystem vnode design has at least two challenges:

A) Balancing the vnodes across the mountpoints / filesystems, and
B) Splitting the name cache.

I suspect B) is the more difficult one.  As of now, the global name cache allows the vnode lookup in a single place with just one pass.  The behaviour and performance under the per-mountpoint / per-filesystem name cache would depend on the interaction across multiple filesystems, and hence be very complicated to analyse and tune.

The interval between the ARC pruning executions is much more simple and yet effective, under my key findings out of the first test in the description:

- The ARC pruning indeed works as long as that is a one-shot run.
- The modern hardware is fast enough to walk through all vnodes, again as long as that is a one-shot run.
- The ARC pruning and vnlru are the vnode maintainers, not the users.  They must guarantee the fairness upon the vnode use to the true vnode users, namely the user processes and threads. (and maybe the NFS server threads for a network file server)

After the current build, I will try vfs.vnode.vnlru.max_free_per_call=4000000.  This value is the same as vfs.vnode.param.limit, so there will be no limit upon the ARC pruning workload except for the giveup condition.
Comment 6 Seigo Tanimura 2023-12-08 10:17:49 UTC
(In reply to Seigo Tanimura from comment #5)

The build under the following setting have completed:

- vfs.vnode.vnlru.max_free_per_call: 10000 (out-of-box)
- vfs.zfs.arc.prune_interval: 1000 (my fix enabled)

Build time: 07:11:02 (292 pkgs / hr)
Max vfs.vnode.stats.count: ~2.2M
Max ARC memory size: ~5.6GB

NB devel/ocl-icd failed because pkg-static was killed by the kernel for taking too long to page in.  31 ports were skipped because of this failure.  This error was often seen on 14.0-RELEASE-p0, indicating an obstacle upon the executable file access.

This result is better than the baseline (14.0-RELEASE-p2) and worse than my original fix shown in the description.  Although prune_interval avoided the contention upon vnode_list_mtx somehow, this setup also limited the ARC pruning performance, introducing another pressure including the overcommit upon the ARC memory size.

I conclude this setup is not optimal nor recommended.

-----

Ongoing test:

- vfs.vnode.vnlru.max_free_per_call: 4000000 (== vfs.vnode.vnlru.max_free_per_call)
- vfs.zfs.arc.prune_interval: 1000 (my fix enabled)

This setup allows the unlimited workload to the ARC pruning under the configured interval.

Another object of this test is the measurement of the vnode number ZFS requests the OS to reclaim.  As long as this value is below 100000 (vfs.vnode.vnlru.max_free_per_call in my first test), the system behaviour and test results are expected to be the same as my first test.

A glance on 30 minutes after the build start:

- The activity of arc_prune is mostly the same as the first test; the CPU usage occasionally surges up to 30%, but it does not stay for more than 1 second so far.
- The average number of the vnodes ZFS requests to reclaim: ~44K.
  - vfs.vnode.stats.count: ~1.2M.
  - The default vfs.vnode.vnlru.max_free_per_call of 10K did regulate the ARC pruning work.
  - I will keep my eyes on this figure, especially if it exceeds 100K.
- The ARC memory size is strictly regulated as configured by vfs.zfs.arc_max.
  - The ARC pruning starts when the ARC memory size reaches ~4.1GB.
  - The ARC pruning does not happen as long as the ARC memory size is below 4.0GB.

The finding regarding the ARC memory size is something new to me.  Maybe the vnode number requested for the reclaim by ZFS is calculated very carefully and precisely, so we should actually honour that figure to keep the system healthy.

I first treated this test as an extreme case, but maybe this should be evaluated as a working setup.
Comment 7 Seigo Tanimura 2023-12-08 11:04:25 UTC
(In reply to Seigo Tanimura from comment #6)

Update:

1:25:00 since the build start, building lang/gcc12 and lang/rust.

vfs.vnode.stats.count is ~1.4M.  arc_prune tends to stick at about 20 - 25% of the CPU with the occasional drops.  The ARC memory is stable at ~4.6GB.

The average vnode number requested for the reclaim by ZFS has exceeded 100K.

I understand that the files generated by the builds exceed the upper limit of the ARC memory.  Arc_prune hence has to run continuously, though the interval avoids the starvation upon vnode_list_mtx.
Comment 8 Seigo Tanimura 2023-12-08 17:47:28 UTC
(In reply to Seigo Tanimura from comment #7)

Done.

Build time: 07:14:10 (295 pkgs / hr)
ARC memory size: ~5.8GB just when the final build finished, ~4.0 GB after cleaning up all of the poudriere builders.

The final performance was rather like the case of vfs.vnode.vnlru.max_free_per_call=10000.  arc_prune kept the CPU usage of 20 - 40% after the build of lang/gcc12 until the end.

Inside the ARC memory, "Other" of top(1) recorded ~2.0GB just when the final build finished.  This value is the sum of:

- kstat.zfs.misc.arcstats.bonus_size
- kstat.zfs.misc.arcstats.dnode_size
- kstat.zfs.misc.arcstats.dbuf_size

Maybe I have work out the way to track them...
Comment 9 Mark Johnston freebsd_committer freebsd_triage 2023-12-09 16:15:17 UTC
> I thought you would say that; I almost thought of the same thing more than 20 years ago while implementing the initial version of vnlru along with Matt Dillon :)
>
> The per-mountpoint / per-filesystem vnode design has at least two challenges:
>
> A) Balancing the vnodes across the mountpoints / filesystems, and
> B) Splitting the name cache.
>
> I suspect B) is the more difficult one.  As of now, the global name cache allows the vnode lookup in a single place with just one pass.

I'm not a VFS expert by any means, but I don't see what this has to do with the name cache.  vnodes live on a global list, chained by v_vnodelist, and this list appears to be used purely for reclamation.  Suppose we instead use a per-mountpoint LRU (and some strategy to select a mountpoint+num vnodes to reclaim) instead.  How would this affect the name cache?

> The interval between the ARC pruning executions is much more simple and yet effective, under my key findings out of the first test in the description:

Sorry, I don't understand.  The trigger for arc_prune is whether the ARC is holding "too much" metadata, or ZFS is holding "too many" dnodes in memory.  If arc_prune() is spending most of its time reclaiming tmpfs vnodes, then it does nothing to address its targets; it may as well do nothing.  Rate-limiting just gets us closer to doing nothing, or I am misunderstanding something about the patch.

Suppose that arc_prune is disabled outright.  How does your test fare?
Comment 10 Seigo Tanimura 2023-12-11 08:45:26 UTC
(In reply to Mark Johnston from comment #9)

> vnodes live on a global list, chained by v_vnodelist, and this list appears to be used purely for reclamation.

The free vnodes are indeed chained to vnode_list in sys/kern/vfs_subr.c, but this "free" means "not opened by any user processes," ie vp->v_usecount > 0.

Besides the user processes, the kernel may use a "free" vnode on its own purpose.  In such the case, the kernel "holds" the vnode by vhold(9), making vp->v_holdcnt > 0.  A vnode held by the kernel in this way cannot be recycled even if it is not opened by the user process.

vnlru_free_impl() checks if the vnode in question is held, and skips recycling if so.  I have seen, out of the tests so far, that vnlru_free_impl() tends to skip many vnodes, especially during the late phase of "poudriere bulk".  The results and findings are shown at the end of this comment.

-----

> If arc_prune() is spending most of its time reclaiming tmpfs vnodes, then it does nothing to address its targets; it may as well do nothing.

Again, the mixed use of tmpfs and ZFS has actually turned out as rather a minor problem.  Please refer to my findings.

Also, there are some easier workarounds that can be tried first, if this is really the issue:

- Perform the test of vp->v_mount->mnt_op before vp->v_holdcnt.  This should work somehow for now because ZFS is the only filesystem that calls vnlru_free_vfsops() with the valid mnt_op.
- After a preconfigured number of consecutive skips, move the marker vnode to the restart point, release vnode_list_mtx and yield the CPU.  This actually happens when a vnode is recycled, which may block.

> Suppose that arc_prune is disabled outright.  How does your test fare?

Difficult to tell.  I am sure the ARC size should keep increasing first, but cannot tell if it eventually comes to an equilibrium point because of the builder cleanup or keeps rising.

-----

In order to investigate the detail of the held vnodes found in vnlru_free_impl(), I have conducted another test with some additional counters.

Source on GitHub:
- Repo: https://github.com/altimeter-130ft/freebsd-freebsd-src/tree/release/14.0.0/release-14_0_0-p2-topic-openzfs-arc_prune-interval-counters
- Branch: release/14.0.0/release-14_0_0-p2-topic-openzfs-arc_prune-interval-counters

Test setup:
The same as "Ongoing test" in bug #275594, comment #6.

- vfs.vnode.vnlru.max_free_per_call: 4000000 (== vfs.vnode.vnlru.max_free_per_call)
- vfs.zfs.arc.prune_interval: 1000 (my fix enabled)

Build time:
06:32:57 (325 pkgs / hr)

Counters after completing the build, with some remarks:
# The iteration attempts in vnlru_free_impl().
# This includes the retry from the head of vnode_list.
vfs.vnode.free.free_attempt: 29695926809

# The number of the vnodes recycled successfully, including vtryrecycle().
vfs.vnode.free.free_success: 30841748

# The number of the iteration skips due to a held vnode. ("phase 2" hereafter)
vfs.vnode.free.free_phase2_retry: 11909948307

# The number of the phase 2 skips upon the VREG (regular file) vnodes.
vfs.vnode.free.free_phase2_retry_reg: 7877197761

# The number of the phase 2 skips upon the VBAD (being recycled) vnodes.
vfs.vnode.free.free_phase2_retry_bad: 3101137010

# The number of the phase 2 skips upon the VDIR (directory) vnodes.
vfs.vnode.free.free_phase2_retry_dir: 899106296

# The number of the phase 2 skips upon the VNON (being created) vnodes.
vfs.vnode.free.free_phase2_retry_non: 2046379

# The number of the phase 2 skips upon the doomed (being destroyed) vnodes.
vfs.vnode.free.free_phase2_retry_doomed: 3101137196

# The number of the iteration skips due to the filesystem mismatch. ("phase 3" hereafter)
vfs.vnode.free.free_phase3_retry: 17755077891

Analysis and Findings:
Out of ~30G iteration attempts in vnlru_free_impl(), ~12G failed in phase 2.  (Phase 3 failure is ~18G, but there are some workaround ideas shown above)

Among the phase 2 failures, the most dominant vnode type is VREG.  On this type, I suspect the residential VM pages alive in the kernel; a VM object holds the backend vnode if the object has at least one page allocated out of it.  Please refer to vm_page_insert_after() and vm_page_insert_radixdone() for the implementation.

Technically, such the vnodes can be recycled as long as the prerequisites checked in vtryrecycle() are met with the sufficient locks, which does not include the residential VM pages.  vnode_destroy_vobject(), called in vgonel(), takes care of those pages.  I suppose we have to do this if the more work is required on vnlru_free_impl(), maybe during the retry after reaching the end of vnode_list.

The further fix above assumes that ZFS takes the appropriate work to reduce the ARC size upon reclaiming a ZFS vnode.

The rest of the cases are either difficult or impossible for any further work.

A VDIR vnode is held by the name cache to improve the path resolution performance, both forward and backward.  While the vnodes of this kind can be reclaimed somehow, a significant performance penalty is expected upon the path resolution.

VBAD and VNON are actually the states rather than the types of the vnodes.  Both of the states are not eligible for recycling by design.
Comment 11 Seigo Tanimura 2023-12-11 08:55:25 UTC
(In reply to Seigo Tanimura from comment #10)

Correction:
> Please refer to vm_page_insert_after() and vm_page_insert_radixdone() for the implementation.

I mean vm_page_insert_radixdone() and vm_page_object_remove().  The former inserts a new page to an object, and the latter removes a page from an object.
Comment 12 Seigo Tanimura 2023-12-14 06:58:31 UTC
(In reply to Seigo Tanimura from comment #10)

I have added the fix to enable the extra vnode recycling and tested with the same setup.

Source on GitHub:
- Repo: https://github.com/altimeter-130ft/freebsd-freebsd-src
- Branches
  - Fix: release/14.0.0/release-14_0_0-p2-topic-openzfs-arc_prune-interval-fix
  - Counters atop Fix: release/14.0.0/release-14_0_0-p2-topic-openzfs-arc_prune-interval-counters

Test setup:
The same as "Ongoing test" in bug #275594, comment #6.

- vfs.vnode.vnlru.max_free_per_call: 4000000 (== vfs.vnode.vnlru.max_free_per_call)
- vfs.zfs.arc.prune_interval: 1000 (my fix for arc_prune interval enabled)
- vfs.vnode.vnlru.extra_recycle: 1 (extra vnode recycle fix enabled)

Build time:
06:50:05 (312 pkgs / hr)

Counters after completing the build, with some remarks:

# The iteration attempts in vnlru_free_impl().
# This includes the retry from the head of vnode_list.
vfs.vnode.free.free_attempt: 33934506866

# The number of the vnodes recycled successfully, including vtryrecycle().
vfs.vnode.free.free_success: 42945537

# The number of the successful recycles in phase 2 upon the VREG (regular file) vnodes.
# - cleanbuf_vmpage_only: the vnodes held by the clean bufs and resident VM pages only.
# - cleanbuf_only: the vnodes held by the clean bufs only.
vfs.vnode.free.free_phase2_retry_reg_cleanbuf_vmpage_only: 845659
vfs.vnode.free.free_phase2_retry_reg_cleanbuf_only: 3

# The number of the iteration skips due to a held vnode. ("phase 2" hereafter)
# NB the successful recycles in phase 2 are not included.
vfs.vnode.free.free_phase2_retry: 8923850577

# The number of the phase 2 skips upon the VREG vnodes.
vfs.vnode.free.free_phase2_retry_reg: 8085735334

# The number of the phase 2 skips upon the VREG vnodes in use.
# Almost all phase 2 skips upon VREG fell into this.
vfs.vnode.free.free_phase2_retry_reg_inuse: 8085733060

# The number of the successful recycles in phase 2 upon the VDIR (directory) vnodes.
# - free_phase2_retry_dir_nc_src_only: the vnodes held by the namecache entries only.
vfs.vnode.free.free_phase2_retry_dir_nc_src_only: 2234194

# The number of the phase 2 skips upon the VDIR vnodes.
vfs.vnode.free.free_phase2_retry_dir: 834902819

# The number of the phase 2 skips upon the VDIR vnodes in use.
# Almost all phase 2 skips upon VDIR fell into this.
vfs.vnode.free.free_phase2_retry_dir_inuse: 834902780

Other findings:

- The behaviour upon the arc_prune thread CPU usage was mostly the same.
  - The peak reduced just a few percents, not likely to be the essential fix.

- The namecache hit ratio degraded about 10 - 20%.
  - Maybe the recycled vnodes are looked up again, especially the directories.

-----

The issue still exists essentially with the extra vnode recycle.  Maybe the root cause is in ZFS rather than the OS.

There are some suspicious findings on the in-memory dnode behaviour during the tests so far:

- vfs.zfs.arc_max does not enforce the max size of kstat.zfs.misc.arcstats.dnode_size.
  - vfs.zfs.arc_max: 4GB
  - vfs.zfs.arc.dnode_limit_percent: 10 (default)
  - sizeof(struct dnode_t): 808 bytes
    - Found by "vmstat -z | grep dnode_t".
  - kstat.zfs.misc.arcstats.arc_dnode_limit: 400MB (default, vfs.zfs.arc.dnode_limit_percent percent of vfs.zfs.arc_max)
    - ~495K dnodes.
  - kstat.zfs.misc.arcstats.dnode_size, max: ~ 1.8GB
    - ~2.2M dnodes.
    - Almost equal to the max observed number of the vnodes.

- The dnode_t zone of uma(9) does not have the limit.

From above, the number of the in-memory dnodes looks like the bottleneck.  Maybe the essential solution is to configure vfs.zfs.arc.dnode_limit explicitly so that ZFS can hold all dnodes required by the application in the memory.
Comment 13 Seigo Tanimura 2023-12-25 10:17:35 UTC
Created attachment 247238 [details]
The measurement result charts for Comment 14.
Comment 14 Seigo Tanimura 2023-12-25 10:21:27 UTC
(In reply to Seigo Tanimura from comment #12)

Thanks for your patience, I have spent quite some days to set up the more precise and easy measurement by Fluent Bit, Elasticsearch and Kibana.  It is now working somehow and ready to share the test results and the analysis on them.

-----

Sources on GitHub:

- FreeBSD
  - Repo
    - https://github.com/altimeter-130ft/freebsd-freebsd-src
  - Branches
    - release/14.0.0/release-14_0_0-p2-topic-openzfs-arc_prune-interval-fix
	  - No changes since the last test.
    - release/14.0.0/release-14_0_0-p2-topic-openzfs-arc_prune-interval-counters
	  - Add the vnode tag counting. (disabled by default)
	  - Refactor the ZFS ARC stats for the precise investigation.
	  - The changes may be squashed or removed in the future.

- Fluent Bit
  - Repo
    - https://github.com/altimeter-130ft/fluent-fluent-bit
  - Branch
    - topic-plugin-in-sysctl
	  - The input plugin for sysctl(3).
	  - The other fixes included.
	  - No documents yet.
  - Note
    - "-DFLB_CORO_STACK_SIZE=1048576" recommended on cmake(1).
	  - The default coroutine thread stack size is only a few kilobytes on FreeBSD.

-----

Test Summary:

- Date: 21 Dec 2023 17:50Z - 22 Dec 2023 00:50Z
- Build time: 07:00:59 (304 pkgs / hr)
- Failed ports: 3
- Setup
  - vfs.vnode.vnlru.max_free_per_call: 4000000 (== vfs.vnode.vnlru.max_free_per_call)
  - vfs.zfs.arc.prune_interval: 1000 (my fix for arc_prune interval enabled)
  - vfs.vnode.vnlru.extra_recycle: 1 (extra vnode recycle fix enabled)
  - vfs.zfs.arc.dnode_limit=2684354560 (2.5G, larger than the max actual value observed so far)

-----

Findings Summary:

- Vnode Behavior
  - The vnodes circulate between the creation and pruning.
  - Almost all of the unrecyclable vnodes have one use count, which makes one extra hold count as well.

- ZFS Behaviour
  - The direct trigger of the ARC pruning is the low ARC meta value.
  - There is a massaive amount of the ghost data hits.

-----

Finding Detail:

The discussion below refer to the charts in the archive attached as poudriere-bulk-2023-12-22_02h55m25s.

The times on the chart horizontal axes are in the UTC.

- Vnode Behavior
  - The vnodes circulate between the creation and pruning.

* Charts: number-of-vnodes.png, zfs-vnode-free-calls.png, zfs-freed-vnodes.png.

Although zfs-freed-vnodes.png shows the steady vnode recycling in some time sections, the number of the vnodes is almost flat. (Typical case: 23:20Z - 00:40Z)  This means that the vnodes are both recycled and created at almost the same rate.

  - Almost all of the unrecyclable vnodes have one use count, which makes one extra hold count as well.

* Charts: zfs-vnode-recycle-phase2-reg-retries.png, zfs-vnode-recycle-phase2-dir-retries.png.

The non-zero traces on the charts are the overlaps of:
  - free_phase2_retry_*: the skipped vnodes,
  - free_phase2_retry_*_inuse: the vnodes with the non-zero v_usecount, and
  - free_phase2_retry_*_unknown1: the vnodes with one unknown extra v_holdcnt.

The non-zero v_usecount is sufficient to explain the "unknown" extra one v_holdcnt.  The current vnode recycling code does not work on the vnodes with the non-zero v_usecount, and one v_usecount makes one v_holdcnt on FreeBSD.

- ZFS Behaviour
  - The direct trigger of the ARC pruning is the low ARC meta value.

* Charts: zfs-arc-meta.png, zfs-arc-dnode-size.png.

arc_evict() triggers the async ARC pruning when either of the following conditions stand:

A) More than 3/4 of the wanted metadata size target in the ARC is not evictable, or
B) The size allocated for the dnode exceeds vfs.zfs.arc.dnode_limit.

B) is not the case in this test; the max ZFS ARC dnode size was ~1.6GB, which was less than vfs.zfs.arc.dnode_limit. (2.5GB)

The ZFS ARC meta is the ratio of the wanted metadata size with respect to the whole ARC size, initialized to 25% scaled up to 1G as the 32bit fixed point decimal.  The higher (lower) value means the more (less) size for the ARC metadata.  The reason for the sharp drop of this value is discussed next.

  - There is a massaive amount of the ghost data hits.

* Charts: zfs-arc-mru-ghost-data-hits.png, zfs-arc-mfu-ghost-data-hits.png, zfs-arc-mru-ghost-metadata-hits.png, zfs-arc-mfu-ghost-metadata-hits.png.

Please watch out for the vertical axes of the charts:
- The "hits" are the sizes, not the counts in ZFS.
- The scales of the data hits are about 10 times larger than the metadata hits.

The ZFS ARC meta value is automatically tuned by the sizes of the ghost hits passed to arc_evict_adj(), called in arc_evict().  The idea is to favour either the data or metadata experiencing the more ghost hits.

The ghost hits on the MFU and MRU data is about 10 times larger than their counterparts on the metadata.  This explains the sharp drop of the ZFS ARC meta starting at about 18:20Z.  Also, the large size of the ghost data hits happened occasionally, while the ghost metadata hits seldom exceeded the data.  This shows that the ZFS ARC meta stayed low until the end of the build test.

-----

Next Work:

- Check if the vnodes with the non-zero v_usecount are actually open(2)ed.
  - Compare the count to kern.openfiles.
  - Estimation out of the current results:
    - Max free_phase2_retry_reg_inuse per 10 minutes: ~270K @ 00:20Z.
    - The ZFS vnode free call period at the time above: ~1.57 [s]. (<= 1 / (382.5 / (10 * 60)))
    - The hits into the vnodes in use per ZFS vnode free call: ~707. (<= 270K / (10 * 60) * 1.57)
    - If this figure is reasonably close to kern.openfiles, the phase2 retries are indeed caused by the open(2)ed vnodes and such the vnodes cannot be evicted.
Comment 15 Seigo Tanimura 2023-12-25 10:57:17 UTC
Created attachment 247245 [details]
The additional measurement result charts for Comment 16.
Comment 16 Seigo Tanimura 2023-12-25 10:58:46 UTC
(In reply to Seigo Tanimura from comment #14)

* Notes on the comparision to kern.openfiles

I will assume that ~70% of kern.openfiles is for ZFS in the next test.  This ratio has been observed from a preliminary test conducted separately from the last one, where the vnode tag counting in enabled.

* Chart Archive: poudriere-bulk-2023-12-21_18h42m38s.7z
* Charts: vnode-vnodes.png, vnode-tags.png.

vnode-tags.png shows the vnode counts classified by its tags.  The section of 21 Dec 2023 15:00Z - 16:30Z has ~2.5M vnodes, out of which ~1.8M are for ZFS. (The yellow area on vnode-vnodes.png)  2.5M / 1.8M == 0.72.

The vnode tag counting will be disabled in the next test (and so by default) because it puts the load as heavy as the ZFS ARC pruning and is hence likely to affect the test results.
Comment 17 Seigo Tanimura 2023-12-27 10:28:59 UTC
Created attachment 247285 [details]
The test result charts for Comment 18.
Comment 18 Seigo Tanimura 2023-12-27 10:31:56 UTC
(In reply to Seigo Tanimura from comment #16)

* The results of the comparision between the estimated ZFS open files and kern.openfiles

Test Summary:

- Date: 26 Dec 2023 00:50Z - 26 Dec 2023 06:42Z
- Build time: 06:41:18 (319 pkgs / hr)
- Failed ports: 4
- Setup
  - vfs.vnode.vnlru.max_free_per_call: 4000000 (== vfs.vnode.vnlru.max_free_per_call)
  - vfs.zfs.arc.prune_interval: 1000 (my fix for arc_prune interval enabled)
  - vfs.vnode.vnlru.extra_recycle: 1 (extra vnode recycle fix enabled)
  - vfs.zfs.arc.dnode_limit=2684354560 (2.5G, larger than the max actual value observed so far)

Results:

* Estimated ZFS open files

         | (A)                        | (B)                          | (C)
         |                            | Phase 2 regular file retries |
         |                            | (Estimated ZFS open files    | ZFS open files
UTC Time | Vnode free call period [s] | seen by vnlru_free_impl())   | estimated by kern.openfiles
=========+============================+==============================+=================================
  02:00Z |                       1.27 |                          354 |                             491
---------+----------------------------+------------------------------+---------------------------------
  03:00Z |                       1.32 |                          411 |                             439
---------+----------------------------+------------------------------+---------------------------------
  04:00Z |                       1.35 |                          477 |                             425
---------+----------------------------+------------------------------+---------------------------------
  05:00Z |                       1.69 |                          193 |                             242
---------+----------------------------+------------------------------+---------------------------------
  06:00Z |                       1.88 |                          702 |                             232
---------+----------------------------+------------------------------+---------------------------------
  07:00Z |                       1.54 |                          299 |                             237

where

(A): 1 / ((vnode free calls) / (5 * 60))

(5 * 60) is the time granularity on the chart in seconds.  This applies to (B) as well.

(B): (number of retries) / (5 * 60) * (A)

(C): 0.7 * (kern.openfiles value)

0.7 is the observer general ratio of the ZFS vnodes in the kernel. (bug #275594 comment #16)

* Chart archive: poudriere-bulk-2023-12-26_09h50m17s.7z
* Charts: zfs-vnode-free-calls.png, zfs-vnode-recycle-phase2-reg-retries.png, kernel-open-files.png.

(B) and (C) sometimes match on the most significant figure, and do not in the other times.  Out of these results, I understand that the unrecyclable ZFS vnodes are caused by opening them in an indirect way.  The detail of the "indirect" way is discussed next.

-----

* The ZFS vnodes in use by nullfs(5)

Nullfs(5) involved in my poudriere jail setup is now suspected for the unrecyclable ZFS vnodes.

My poudriere setup uses "-m null" on the poudriere jail.

> root@pkgfactory2:/home/pkgfactory2/tanimura/work/freebsd-git/ports-head # poudriere jail -l
> release-13_2_0 13.2-RELEASE amd64 null   2023-04-13 03:14:26 /home/poudriere.jailroot/release-13.2.0
> release-14_0_0 14.0-RELEASE amd64 null   2023-11-23 15:14:17 /home/poudriere.jailroot/release-14.0.0
> root@pkgfactory2:/home/pkgfactory2/tanimura/work/freebsd-git/ports-head # 

Under this setup, poudriere-bulk(8) mounts the jail filesystems onto each builder by nullfs(5).  A nullfs(5) vnode adds one v_usecount to the lower vnode (asserted in null_nodeget()) so that the pointer to the lower vnode does not dangle.  This lasts even after the nullfs(5) vnode is inactivated and put onto the free list, until the nullfs(5) vnode gets reclaimed.

The nullfs(5) design above explains the results of the estimation upon the unrecyclable ZFS vnodes.  As the builders open the more files in ZFS via nullfs(5), the more unrecyclable ZFS vnodes are made.  In the detail, however, the estimation makes the errors because the multiple builders can open the same ZFS file.

The massive free of the vnodes after the build is also explained by the nullfs(5) design.  The cleanup of the builder filesystems dismisses a lot of nullfs(5) vnodes, which, in turn, drops v_usecount of the lower ZFS vnodes so that they can be evicted.

-----

The finding above introduces a new question: should the ZFS vnodes used by nullfs(5) be recycled?

My answer is no.  The major hurdle is the search of the vnode stacking link.  It is essentially a tree with the ZFS (or any non-nullfs(5)) vnode as the root, spanning to multiple nullfs(5) vnode leaves and depth levels.  The search is likely to be even more complex than the linear scanning of the vnode list.

In addition, all vnodes in the tree must be recyclable for the ZFS vnode at the tree root to be recyclable as well.  This is likely to put a complex dependency for the ZFS vnode recycling.

-----

My investigation so far, including this one, has proven that it costs too much to scan over all vnodes without any positive estimation in advance.  We need a way to check if the ARC pruning will yield the fruitful result in the way much cheaper than the vnode scan.

It may be good to account the number of the ZFS vnodes not in use.  Before starting an ARC pruning, we can check that count and defer pruning if that is too low.  This has already been implemented in arc_evict_impl() for the eviction of the ARC data and metadata by checking the evictable size.  The ARC data and metadata eviction is skipped if there are zero evictable bytes.

* My next work

Figure out the requirement and design of the accounting above.
Comment 19 Seigo Tanimura 2023-12-27 10:35:24 UTC
(In reply to Seigo Tanimura from comment #18)

Correction:

* The results of the comparision between the estimated ZFS open files and kern.openfiles

Test Summary:

- Failed ports: 2
Comment 20 Seigo Tanimura 2024-01-05 17:21:40 UTC
Created attachment 247465 [details]
The result charts for Comment 22. (1 / 2)
Comment 21 Seigo Tanimura 2024-01-05 17:22:05 UTC
Created attachment 247466 [details]
The result charts for Comment 22. (2 / 2)
Comment 22 Seigo Tanimura 2024-01-05 17:27:20 UTC
(In reply to Seigo Tanimura from comment #18)

> It may be good to account the number of the ZFS vnodes not in use.  Before starting an ARC pruning, we can check that count and defer pruning if that is too low.
> (snip)
> Figure out the requirement and design of the accounting above.

Done.


* Sources on GitHub:

- Repo
  - https://github.com/altimeter-130ft/freebsd-freebsd-src
- Branches
  - Fix only
    - release/14.0.0/release-14_0_0-p2-topic-openzfs-arc_prune-interval-fix
  - Fix and counters
    - release/14.0.0/release-14_0_0-p2-topic-openzfs-arc_prune-interval-counters


* New ZFS vnode / znode counters for ARC pruning

- Total ZFS vnodes
  - sysctl(3): vfs.zfs.znode.count
    - Counter variable: zfs_znode_count

- ZFS vnodes in use (v_usecount > 0)
  - sysctl(3): vfs.zfs.znode.inuse
    - Counter variable: zfs_znode_inuse_count

- ARC-prunable ZFS vnodes
  - sysctl(3): vfs.zfs.znode.prunable
    - Formula: zfs_znode_count - zfs_znode_inuse_count

- ARC pruning requests
  - sysctl(3): vfs.zfs.znode.pruning_requested
    - Counter variable: zfs_znode_pruning_requested

- Skipped ARC pruning requests
  - sysctl(3): vfs.zfs.znode.pruning_skipped
    - Counter variable: zfs_znode_pruning_skipped


* Design of counter operations

- Total ZFS vnodes (zfs_znode_count)
  - Increment upon creating a new ZFS vnode.
    - zfs_znode_alloc()
  - Decrement upon reclaiming a ZFS vnode.
    - zfs_freebsd_reclaim()

- ZFS vnodes in use ("v_usecount > 0")
  - Export to the VFS via mnt_fsvninusep of struct mount.
    - Both the VFS and ZFS have to operate on the counter.
    - struct vnode cannot be expanded anymore.
  - Increment upon inserting a new ZFS vnode into a ZFS mountpoint.
    - zfs_mknode()
    - zfs_zget()
  - Increment upon vget() and alike.
    - vget_finish_ref()
  - Decrement upon vput() and alike.
    - vput_final()


* Design of ARC pruning regulation

- Required condition
  - zfs_prune_task(uint64_t nr_to_scan, void *arg __unused)
    - Condition: (zfs_znode_count - zfs_znode_inuse_count) * dn / zfs_znode_inuse_count >= nr_to_scan
      - dn: The number of the dnodes.
      - Scale the prunable znodes to the dnodes linearly because a znode may span across multiple dnodes.
    - Call vnlru_free_vfsops() only when the condition above satisfies.

- Other changes on ARC pruning
  - Refactor the extra vnode recycling into 2 togglable features.
    - sysctl(3): vfs.vnode.vnlru.recycle_bufs_pages
      - Recycle the vnodes with the clean buffers and clean/dirty VM pages.
    - sysctl(3): vfs.vnode.vnlru.recycle_nc_src
      - Recycle the vnodes working as the namecache sources.
    - Both enabled by default.
  - Retire the interval between the ARC pruning, the initial fix.
    - The ARC pruning regulation above is more precise.


* Test results

Test Summary:

- Date: 03 Jan 2024 01:30Z - 03 Dec 2023 08:13Z
- Build time: 06:43:25 (317 pkgs / hr)
- Failed port(s): 1
- Setup
  - sysctl(3)
    - vfs.vnode.vnlru.max_free_per_call: 4000000
      - == vfs.vnode.param.limit.
    - vfs.zfs.arc_max: 4294967296
      - 4GB.
    - vfs.zfs.arc.dnode_limit=8080000000
      - 2.5 * (vfs.vnode.param.limit) * sizeof(dnode_t)
        - 2.5: experimental average dnodes per znode (2.0) + margin (0.5)
  - poudriere-bulk(8)
    - USE_TMPFS="wrkdir data localbase"


Result Chart Archive (1 / 2): (poudriere-bulk-2024-01-03_10h30m00s.7z)

- zfs-znodes-and-dnodes.png
  - The counts of the ZFS znodes and dnodes.
- zfs-dnodes-and-freeing-activity.png
  - The freeing activity of the ZFS znodes and dnodes.
- vnode-free-calls.png
  - The calls to the ZFS vnode freeing functions.


Result Chart Archive (2 / 2): (poudriere-bulk-2024-01-03_10h30m00s-zfs-arc.7z)

- zfs-arc/zfs-arc-meta.png
  - The balancing of the ZFS ARC metadata and data.
- zfs-arc/zfs-arc-(A)-(B)-(C).png
  - The ZFS ARC stats.
    (A): MRU (mru) or MFU. (mfu)
    (B): Metadata (metadata) or data (data); the "ghost-" prefix denotes the evicted cache.
    (C): Size (size) or hits (hits); the hits count the hit sizes, not the hit counts.


Finding Summary:

- The ZFS ARC meta was lowered strongly, contradicting the high metadata demand in the ZFS ARC.
  - They are both the designed behaviours.
- The low ZFS ARC meta value triggered the aggressive ARC pruning.
  - Again, this is as designed.
- The ARC pruning regulation throttled the load as expected.
  - Virtually no load happened when only one or two builders were running.
  - The fruitless pruning was eliminated.


Analysis in Detail:

- ZFS znode and dnode counts (zfs-znodes-and-dnodes.png)

The green and blue traces show the counts of the total and in-use ZFS znodes, respectively.  The gap between these lines denote the prunable ZFS znodes, also shown as the red trace.  Those traces show that there are almost no prunable znodes, so it is useless to perform the ARC pruning too often.

- ZFS znode and dnode freeing activity (zfs-dnodes-and-freeing-activity.png)

The red trace is the count of the znodes freed by the ARC pruning.  It worked in the first hour because the build happened upon many small ports, where the builder cleaning released many znodes.  After that, the build moved to the big long ones (lang/rust, lang/gcc12, ...) and the znode release ceased.  A couple of the occational bumps happened upon the builder cleanups after finishing the build of such the ports.

- Vnode free calls (vnode-free-calls.png)

The non-zero traces are vfs.zfs.znode.pruning_requested and vfs.zfs.znode.pruning_skipped, almost completely overlapped.  After 02:45Z, there were no counts on vfs.vnode.free.* shown by the median.  This means the ARC pruning was either not performed at all or merely exceptionally.

The magnitude of vfs.zfs.znode.pruning_requested shows the high pressure of the ARC pruning from ZFS.  The top peak at 02:20Z is ~1.8M / 5 mins == 6K / sec.  The ARC pruning request definitely needs a solid throttling because a typical ARC pruning work takes up to ~0.2 seconds when there are actually no prunable vnodes. [1]  Even under a steady light load in 06:25Z - 08;05Z (working on emulators/mame, where ccache does not work somehow), vfs.zfs.znode.pruning_requested recorded ~50K / 5 mins =~ 167 / sec.

[1] Observed under my first fix where the interval of 1 second was enforced between each ARC pruning.  The max ARC pruning rate was ~0.8 / sec.

- The ZFS ARC stats (zfs-arc/zfs-arc-*.png)

The ZFS ARC stats show how the high pressure upon the ARC pruning happened.

The ZFS ARC stats of the sizes (zfs-arc/zfs-arc-*-size.png) show the following properties:

  - Except for the first hour, there were almost no evictable sizes.
  - The metadata stayed solidly while the data was driven away.
  - The trace of the ZFS ARC MRU metadata size (zfs-arc-mru-metadata-size.png) is similar to that of the znode and dnode counts.

Out of these properties, I suspect that the znodes and dnodes in use dominated the ARC.  Although not confirmed by the code walk, it makes a sense to secure such the metadata in the memory because they are likely to be updated often.

Another parameter affecting the ZFS ARC is the balancing of the metadata and data.  The ZFS ARC meta (zfs-arc-meta.png) is the auto-tuned target ratio of the metadata size in the 32 bit fixed point decimal.  Since vfs.zfs.arc_max is 4GB in my setup, this value can be straightly read as the metadata size target in bytes.

The ZFS ARC meta is tuned by the ghost-hit sizes. (zfs-arc-*-ghost-*-hits.png)  It is designed to favour either the metadata or data with the more ghost-hit sizes, so that the further caching lessens that.  As the data was dominant so much in the ghost-hit sizes, the ZFS ARC meta was pushed so low; the minimum was ~197M at 05:20Z, and mostly less than 1G, the default (1/4 of vfs.zfs.arc_max), except for the first hour.  The low target of the metadata size then caused the aggressive ARC pruning as implemented in arc_evict(), in conjunction with the high demand of the unevictable metadata.
Comment 23 Seigo Tanimura 2024-01-05 17:29:36 UTC
(In reply to Mark Johnston from comment #1)

> > - vfs.vnode.vnlru.max_free_per_call=100000
> This value is 10 times larger than the default.  Why did you change this parameter?  What happens if you leave it at the default value?

Out of the tests so far, I now believe that the default value of vfs.vnode.vnlru.max_free_per_call (10K) was chosen in order to polish the load under the uncontrollable and unreasonable ARC pruning requests.

Now that the ARC pruning is precisely throttled for the efficient execution, it should be all right to increase vfs.vnode.vnlru.max_free_per_call up to vfs.vnode.param.limit and let ZFS determine the actual work load.  vnode_list_mtx is an expensive lock, so once we acquire it, we should prune as many vnodes as we can.
Comment 24 Allan Jude freebsd_committer freebsd_triage 2024-01-06 18:27:50 UTC
Something that might be worth looking at is these changes from upstream:
https://github.com/openzfs/zfs/pull/15511
https://github.com/openzfs/zfs/pull/15659

They could result in a lot more vnode's being reclaimable
Comment 25 Seigo Tanimura 2024-01-10 06:14:14 UTC
Created attachment 247560 [details]
The result charts for Comment 28. (1 / 2)
Comment 26 Seigo Tanimura 2024-01-10 06:14:39 UTC
Created attachment 247561 [details]
The result charts for Comment 28. (2 / 2)
Comment 27 Seigo Tanimura 2024-01-10 06:15:08 UTC
Created attachment 247562 [details]
The patches for Comment 28.
Comment 28 Seigo Tanimura 2024-01-10 06:20:54 UTC
(In reply to Seigo Tanimura from comment #22)

There was an implementation bug in the test of comment #22.  The updated results with the bugfix attached.


* Sources on GitHub:

- Repo
  - https://github.com/altimeter-130ft/freebsd-freebsd-src
- Branches
  - Fix only
    - release/14.0.0/release-14_0_0-p2-topic-openzfs-arc_prune-regulation-fix
  - Fix and counters
    - release/14.0.0/release-14_0_0-p2-topic-openzfs-arc_prune-regulation-counters
  - The branches have been renamed to reflect the fix nature.
  - All of the commits in each branch have been squashed into one.
  - The commit logs have been revised into the style alike the release notes.
- Patches
  - Archive: openzfs-arc_prune-regulation-20240110.tar.xz
  - Each patch is named after its branch.


* Bugfix

- zfs_prune_task() passed the number of the dnodes to vnlru_free_vfsops() and
  hence vnlru_free_impl().
  - The ARC pruning may overcommit because there are genrally more dnodes than
    znodes.
  - Fix: Convert the number of the dnodes into that of the znodes before
    passing it to vnlru_free_vfsops().
  - This bug is also found in the original source as well.


* Test results

Result Chart Archive (1 / 2): (poudriere-bulk-2024-01-09_10h19m59s.7z)

- zfs-znodes-and-dnodes.png
  - The counts of the ZFS znodes and dnodes.
- zfs-arc-pruning-regulation.png
  - The counts of the ARC prune triggers by ZFS and the skips by the fix.
- zfs-dnodes-and-freeing-activity.png
  - The freeing activity of the ZFS znodes and dnodes.
- vnode-free-calls.png
  - The calls to the ZFS vnode freeing functions.


Result Chart Archive (2 / 2): (poudriere-bulk-2024-01-09_10h19m59s-zfs-arc.7z)

- zfs-arc/zfs-arc-meta.png
  - The balancing of the ZFS ARC metadata and data.
- zfs-arc/zfs-arc-(A)-(B)-(C).png
  - The ZFS ARC stats.
    (A): MRU (mru) or MFU. (mfu)
    (B): Metadata (metadata) or data (data); the "ghost-" prefix denotes the evicted cache.
    (C): Size (size) or hits (hits); the hits count the hit sizes, not the hit counts.


Findings and Analysis:

- The giveups of vnlru_free_impl() dropped to virtually zero.
  - No CPU time was wasted for the effortless ARC pruning.
    - The logical bug has been fixed.
  - Some retries are inevitable in order to walk through the vnode list completely.

- The actual ARC pruning execution is about 10K less than the requests.
  - Actual ARC pruning: < 100 / 10 mins.
  - Max ARC pruning requests: Order of 1M / 10 mins.
  - Regulation is essential to avoid the uncontrolled behaviour.

- Except for the transient time at the start and end of the build, the prunable znodes stayed under 10% of the total znodes.
  - 10%: The default value of zfs_arc_dnode_reduce_percent.
  - The ARC pruning exhibited its best performance under the configured limitation.
Comment 29 Seigo Tanimura 2024-01-10 07:42:52 UTC
Created attachment 247565 [details]
The result charts for Comment 30.
Comment 30 Seigo Tanimura 2024-01-10 07:45:11 UTC
(In reply to Allan Jude from comment #24)

> Something that might be worth looking at is these changes from upstream:
> https://github.com/openzfs/zfs/pull/15511

I have tested PR 15511 applied to the fix branch, but there were no significant changes on the dnode sizes.

Result Chart Archive: (poudriere-bulk-openzfs-pr-11551-20240110.7z, Attachment #247565 [details])

- zfs-arc-dnode-size-before.png
  - The total dnode size in bytes, before applying PR 15511.
- zfs-arc-dnode-size-after.png
  - The total dnode size in bytes, after applying PR 15511.

Contrary to the test steps in PR 15511, the nullfs(5) vnodes used in my poudriere setup keep the use counts on the lower ZFS vnodes.  This behaviour makes the ARC-prunable vnodes much less compared to the total znode number.  Please refer to zfs-znodes-and-dnodes.png in Attachment #247560 [details]; the area between the blue and green traces are prunable.
Comment 31 Seigo Tanimura 2024-01-12 03:10:31 UTC
(In reply to Mark Johnston from comment #9)

Now with the somewhat better answer:

> Sorry, I don't understand.  The trigger for arc_prune is whether the ARC is holding "too much" metadata, or ZFS is holding "too many" dnodes in memory.  If arc_prune() is spending most of its time reclaiming tmpfs vnodes, then it does nothing to address its targets; it may as well do nothing.  Rate-limiting just gets us closer to doing nothing, or I am misunderstanding something about the patch.

The rate limiting in my fix now comes with the maximum and optimal workload.  That is required because both the overcommit and undercommit of the ARC pruning waste the CPU time by the effortless vnode list lock and the overhead to call vnlru_free_impl(), respectively.  The elimination of the giveups in vnlru_free_impl() and the prunable znodes under zfs_arc_dnode_reduce_percent are the positive evidences.

The pileup of the ARC metadata with no evictable portion should still be investigeted, as that is the direct cause of the out-of-control ARC pruning requests.  The unevictable ARC content is caused by the ZIO on it, but it is not clear how the ZIO keeps running so long.
Comment 32 Thomas Mueller 2024-01-19 08:35:12 UTC
After upgrading from 12-STABLE to 13-STABLE (due to ports changes
triggered by 12 being EOLed), I'm now observing high CPU usage of
kernel{arc_prune} on 13-STABLE too.

This is on a system with 16GB of RAM which boots from a SATA attached SSD
with UFS containing the OS, /usr/src, and /usr/local. A ZFS pool containing
data, home, poudriere jails and data, /usr/ports is located on a GELI
encrypted 1TB NVMe. ARC is limited to 2GB.
This system is used as desktop development system.

The arc_prune high CPU usage appears to be triggered by heavy file I/O on
the UFS file systems, for example git operations on /usr/src. 
Once, when /usr/ports was also still on UFS, the
  "Inspecting ports tree for modifications to git checkout..."
step of poudriere took more than 3 hours!

Questions:
Would migrating to ZFS on root mitigate the issues?
Is 13-STABLE in focus for this PR?
Comment 33 Seigo Tanimura 2024-01-20 01:43:28 UTC
(In reply to Thomas Mueller from comment #32)
> After upgrading from 12-STABLE to 13-STABLE (due to ports changes
> triggered by 12 being EOLed), I'm now observing high CPU usage of
> kernel{arc_prune} on 13-STABLE too.

Do you see any other threads using the CPU as much as kernel{arc_prune}? eg

- vnlru
- Any threads that access files somehow while running poudriere-bulk(8) (eg cc1)

If so, what you have seen is the same as mine.  Kernel{arc_prune} and the threads above contend for the vnode list lock.  Each of them spins in the kernel until it acquires the lock, which can be found by top(1) if you have any idle CPUs.  You may have to reduce the builders to let top(1) work.

I was not aware at the time of the last massive poudriere-bulk(8) on 13.2-RELEASE, but it is now likely that the same issue occured on it as well.

The comparision of my poudriere-bulk(8) results, both on the same host except for the OS versions:

                         | 13.2-RELEASE | 14.0-RELEASE
-------------------------+--------------+-------------
Build Date               |  13 Apr 2023 |  19 Jan 2024
ZFS Fix                  |           No |          Yes
# of Packages            |         1147 |         2128
# of Successful Packages |         1136 |         2127
Elapsed Time             |     18:44:33 |     06:54:28
Packages / Hour          |           61 |          309


> Questions:
> Would migrating to ZFS on root mitigate the issues?

I would say no; that would give even move pressure to ARC.

> Is 13-STABLE in focus for this PR?

Not for now, but it should be.  In addition, FreeBSD-EN-23:18.openzfs should include 13-STABLE as well.

I have one baremetal 13.2-RELEASE host with ZFS, but it does not suffer from the issue as of now.  This host serves the volumes to the bhyve(8) VMs mainly, so it does not use vnodes heavily.
Comment 34 Thomas Mueller 2024-01-20 07:41:09 UTC
(In reply to Seigo Tanimura from comment #33)

On Sat, 20 Jan 2024 01:43:28 +0000, bugzilla-noreply@freebsd.org wrote:

> Do you see any other threads using the CPU as much as kernel{arc_prune}? eg
> 
> - vnlru
> - Any threads that access files somehow while running poudriere-bulk(8) (eg
> cc1)

Yes, I've observed vnlru CPU usage 30-40% for longer streaks occasionally
when arc_prune  was at 90-100%.

With 12-STABLE it was possible to have poudriere running at idle priority
on two of the four CPUs and use the system for everyday work in parallel
(X11 UI, MUA, Firefox, or even Virtualbox). With 13-STABLE, the system
bogs down, video playback drops frames and/or audio, etc.

> If so, what you have seen is the same as mine.  Kernel{arc_prune} and the
> threads above contend for the vnode list lock.  Each of them spins in the
> kernel until it acquires the lock, which can be found by top(1) if you have any
> idle CPUs.  You may have to reduce the builders to let top(1) work.

Exactly.

What's also new in 13-STABLE is that sometimes when the issue occurs the
system runs into memory pressure, and pagedaemon can be observed with
remarkable CPU load and processes with high memory usage get killed
(firefox, virtualbox, for example). That might perhaps be caused by
some changes in poudriere default configuration, so I can't quite
tell whether that would also not have appeared on 12.

What also wasn't observed in 12-STABLE, occasional build errors with 
"bad file descriptor errors" which then cannot be reproduced after
restarting the build. Example:

 [stable13amd64-default-job-02] |   `-- Extracting python39-3.9.18: .........
 pkg-static: Fail to chmod /wsgiref/__pycache__/__init__.cpython-39.opt-1.pyc:Bad file descriptor
 [stable13amd64-default-job-02] |   `-- Extracting python39-3.9.18... done

 Failed to install the following 1 package(s): /packages/All/meson-1.3.1.pkg
 *** Error code 1

> I was not aware at the time of the last massive poudriere-bulk(8) on
> 13.2-RELEASE, but it is now likely that the same issue occured on it as well.
> 
> The comparision of my poudriere-bulk(8) results, both on the same host except
> for the OS versions:
> 
>                          | 13.2-RELEASE | 14.0-RELEASE
> -------------------------+--------------+-------------
> Build Date               |  13 Apr 2023 |  19 Jan 2024
> ZFS Fix                  |           No |          Yes
> # of Packages            |         1147 |         2128
> # of Successful Packages |         1136 |         2127
> Elapsed Time             |     18:44:33 |     06:54:28
> Packages / Hour          |           61 |          309

Looks familiar.

> > Questions:
> > Would migrating to ZFS on root mitigate the issues?  
> 
> I would say no; that would give even move pressure to ARC.

Thanks.

> > Is 13-STABLE in focus for this PR?  
> 
> Not for now, but it should be.  In addition, FreeBSD-EN-23:18.openzfs should
> include 13-STABLE as well.
> 
> I have one baremetal 13.2-RELEASE host with ZFS, but it does not suffer from
> the issue as of now.  This host serves the volumes to the bhyve(8) VMs mainly,
> so it does not use vnodes heavily.

Thanks for analysing this!
Comment 35 Seigo Tanimura 2024-01-24 10:39:42 UTC
Created attachment 247921 [details]
The result charts for Comment 35.
Comment 36 Seigo Tanimura 2024-01-24 10:47:34 UTC
(In reply to Thomas Mueller from comment #34)

I have backported the fix to stable/13 (13.3-PRERELEASE) and tested poudriere-bulk(8).

The fix has also been applied to the main and stable/14 branches without any changes.

Thomas, would you mind testing the backported fix to see if poudriere's build time changes in any way?

* Sources on GitHub:

- Repo
  - https://github.com/altimeter-130ft/freebsd-freebsd-src
- Branches
  - main (Current)
    - Fix only
      - topic-openzfs-arc_prune-regulation-fix
    - Fix and counters
      - topic-openzfs-arc_prune-regulation-counters
    - No changes from the fix on 14.0.0-RELEASE-p2.
  - stable/14 (14-STABLE)
    - Fix only
      - stable/14-topic-openzfs-arc_prune-regulation-fix
    - Fix and counters
      - release/14.0.0/release-14_0_0-p2-topic-openzfs-arc_prune-regulation-counters
    - No changes from the fix on 14.0.0-RELEASE-p2.
  - releng/14.0 (14.0-RELEASE)
    - Fix only
      - release/14.0.0/release-14_0_0-p2-topic-openzfs-arc_prune-regulation-fix
    - Fix and counters
      - release/14.0.0/release-14_0_0-p2-topic-openzfs-arc_prune-regulation-counters
    - The original fix branches.
  - stable/13 (13-STABLE / 13.3-PRERELEASE)
    - Fix only
      - stable/13-topic-openzfs-arc_prune-regulation-fix
    - Fix and counters
      - stable/13-topic-openzfs-arc_prune-regulation-counters
    - Backported changes
      - A fix equivalent to FreeBSD-EN-23:18.openzfs.
        - The ARC pruning task pileup is avoided by a single flag and the atomic operations on it.
      - Seigo's fix.
        - The ZFS vnode accounting, including the counters.
        - The ARC pruning regulation.
        - The improvement on vnlru_free_impl()
    - Changes not backported
      - Seigo's fix.
        - The counters regarding to the autotuning of ZFS ARC meta, the balancing parameter of the ARC data and metadata.
          - Those counters have changed significantly between 13-STABLE and 14-STABLE.

* Test results

Test Summary:

- Branch: stable/13-topic-openzfs-arc_prune-regulation-counters
- Date: 24 Jan 2024 00:10Z - 24 Jan 2024 05:59Z
- Build time: 05:48:30 (367 pkgs / hr)
- Failed port(s): 2
- Skipped port(s): 2
- Setup
  - sysctl(3)
    - vfs.zfs.arc_max: 4294967296
      - 4GB.
    - vfs.zfs.arc.dnode_limit=8080000000
      - 2.5 * (vfs.vnode.param.limit) * sizeof(dnode_t)
        - 2.5: experimental average dnodes per znode (2.0) + margin (0.5)
  - poudriere-bulk(8)
    - USE_TMPFS="wrkdir data localbase"

Result Chart Archive: (poudriere-bulk-13_3_prerelease-2024-01-24_09h10m00s.7z, Attachment #247921 [details])

- zfs-znodes-and-dnodes.png
  - The counts of the ZFS znodes and dnodes.
- zfs-arc-pruning-regulation.png
  - The counts of the ARC prune triggers by ZFS and the skips by the fix.
- zfs-dnodes-and-freeing-activity.png
  - The freeing activity of the ZFS znodes and dnodes.
- vnode-free-calls.png
  - The calls to the ZFS vnode freeing functions.

* Findings and Analysis

- The build time was shorter than 14.0-RELEASE because emulators/mame, started in 4.5 hours, benefitted from ccache and completed in just 10 minutes.  That does not work on 14.0-RELEASE and all sources have to be rebuilt.
  - If the emulators/mame build did not use ccache, its build would take ~2.5 hours and the whole poudriere-bulk(8) would complete in ~7 hours.  This is the same time as 14.0-RELEASE.

- No ARC pruning happened during poudriere-bulk(8).
  - The only one pruning happened while settling down the system before poudriere-bulk(8).
  - On OpenZFS 2.1, the ARC pruning is not triggered by the excess unevictable size in the ARC.
    - Above works on OpenZFS 2.2 in 14-STABLE.
    - Only the overcommitted dnodes and metadata size trigger the ARC on OpenZFS 2.1.
  - vfs.zfs.arc.dnode_limit in my setup effectively disabled the ARC pruning on OpenZFS 2.1.
    - Maybe this should be reverted to the default and retested.

- The zfskern{arc_evict} thread used the CPU up to 100% in the final ~1 hour of the build.
  - The reason is not clear.
  - There were no significant affects to the system.
Comment 37 Seigo Tanimura 2024-01-24 11:44:14 UTC
(In reply to Seigo Tanimura from comment #36)

> Thomas, would you mind testing the backported fix to see if poudriere's build time changes in any way?

Please hold this test until the further notice.

There is a bug in the backported fix of FreeBSD-EN-23:18.openzfs that blocks the ARC pruning when it should run.
Comment 38 Seigo Tanimura 2024-01-24 12:43:14 UTC
(In reply to Seigo Tanimura from comment #37)

The fix has been confirmed, the branches for stable/13 are now ready for testing.  My apology for the trouble.

The flag for the running ARC pruning is now cleared by atomic_store_rel_int().  atomic_set_rel_int() was called by mistake, which did not clear the flag.

The fixed result will be shared as soon as the build test finishes on my side.
Comment 39 Seigo Tanimura 2024-01-25 03:47:08 UTC
Created attachment 247941 [details]
The result charts for Comment 40.
Comment 40 Seigo Tanimura 2024-01-25 03:50:23 UTC
(In reply to Seigo Tanimura from comment #38)

The results of the fixed stable/13 (13.3-PRERELEASE) branch is now ready to share.

Thomas, could you please reproduce the build with this kernel and see if the build time improves?  If so, I will work on merging the fix.  Thanks in advance.

* Sources on GitHub:

The same as comment #38.

* Test results

Test Summary:

- Branch and commit: stable/13-topic-openzfs-arc_prune-regulation-counters, ef898378041a1c67cd102e8e5eaca123a543029c
- Date: 24 Jan 2024 12:10Z - 24 Jan 2024 18:02Z
- Build time: 05:51:26 (363 pkgs / hr)
- Failed port(s): 2
- Skipped port(s): 2
- Setup
  - sysctl(3)
    - vfs.zfs.arc_max: 4294967296
      - 4GB.
    - vfs.zfs.arc.dnode_limit: 0 (default)
      - kstat.zfs.misc.arcstats.arc_dnode_limit: 322122547 (calculated automatically)
  - poudriere-bulk(8)
    - USE_TMPFS="wrkdir data localbase"

Result Chart Archive: (poudriere-bulk-13_3_prerelease-2024-01-24_21h20m00s.7z, Attachment #247941 [details])

- zfs-znodes-and-dnodes.png
  - The counts of the ZFS znodes and dnodes.
- zfs-arc-pruning-regulation.png
  - The counts of the ARC prune triggers by ZFS and the skips by the fix.
- zfs-dnodes-and-freeing-activity.png
  - The freeing activity of the ZFS znodes and dnodes.
- vnode-free-calls.png
  - The calls to the ZFS vnode freeing functions.

* Findings and Analysis

- The ARC pruning has worked in the same way as 14.0-RELEASE.
  - The prunable znodes were pruned down to less than 10% of the dnodes.
  - The behaviour after 18:10Z was due to the nightly cron job started at 18:01Z.

- The build time was virtually the same as comment #38.
  - Also virtually the same as 14.0-RELEASE.

- The zfskern{arc_evict} thread used the CPU up to 100% in the final ~1 hour of the build.
  - The reason is not clear.
  - There were no significant affects to the system.
  - zfskern{arc_evict} stopped running upon completing the build.
Comment 41 Thomas Mueller 2024-01-25 06:31:53 UTC
I added changes from commits
  a57d4914c11f4bc6d5ed33e146f2664315f64701
  4efe36b1428a9956a049fc5fc5f19d4a001d51bf 
from your stable/13-topic-openzfs-arc_prune-regulation-fix branch
to my 13-STABLE kernel and things look much better now.

I have no exact comparable data though, but from observation
 - while poudriere is running, kernel{arc_prune} could not be 
   observed consuming 100% CPU or any other unusually high 
   values for longer amounts of time
 - while poudriere is running, parallel heavy disk I/O 
   (e.g. by periodic daily/weekly) will no longer slow the 
   system down
 - no spurious build failures ("bad file descriptor") observed
   for now
 - after a poudriere run, "pkg upgrade" of large packages,
   (for example texlive-texmf) no longer triggers high
   CPU usage of kernel{arc_prune} and also does no longer
   take unusually long time
 - poudriere pkg/hour went from 5-20 up to 55 (no exact
   comparable builds), just an observation from previous
   samples and one build with ~400 packages on the patched 
   system
 - no negative impact of poudriere (2 jails) running at idle 
   priority on productive work or video playback
Comment 42 Seigo Tanimura 2024-01-31 07:52:56 UTC
(In reply to Thomas Mueller from comment #41)

Thank you so much and sorry for the delayed reply.  I now believe the fix works on your case as well.

* Fix Status

I am now checking the ARC eviction and pruning triggered by the vm_lowmem kernel event.  There was a case where the system stalled with pagedaemon{uma} running at the WCPU of ~0.3% in the late stage of poudriere-bulk(8).  I suspect that a pagedaemon thread put itself to the arc_evict_waiters list in arc_wait_for_eviction(), called by arc_lowmem(), to handle the vm_lowmem kernel event.  In such the case, the ARC pruning has to work more eagerly so that the pagedaemon thread does not block forever.

The issue above has been confirmed experimentally by checking for the non-empty arc_evict_waiters list in {arc,zfs}_prune_task().  Also, arc_lowmem() has the lines for such the case.

> static void
> arc_lowmem(void *arg __unused, int howto __unused)
> {
> 	(snip)
> 	/*
> 	 * It is unsafe to block here in arbitrary threads, because we can come
> 	 * here from ARC itself and may hold ARC locks and thus risk a deadlock
> 	 * with ARC reclaim thread.
> 	 */
> 	if (curproc == pageproc)
> 		arc_wait_for_eviction(to_free, B_FALSE);
> }

After the current test, I will see if this issue can be worked out somehow.
Comment 43 Peter Much 2024-02-06 21:35:29 UTC
Just installed 13.3-BETA1, machine is doing effectively nothing:

System Memory:
        11.05%  864.11  MiB Active,     33.27%  2.54    GiB Inact
        28.36%  2.16    GiB Wired,      0.00%   0       Bytes Cache
        21.25%  1.62    GiB Free,       6.06%   474.05  MiB Gap

        Real Installed:                         8.00    GiB
        Real Available:                 98.15%  7.85    GiB
        Real Managed:                   97.22%  7.63    GiB

        Logical Total:                          8.00    GiB
        Logical Used:                   47.97%  3.84    GiB
        Logical Free:                   52.03%  4.16    GiB

ARC Size:                               3.49%   244.88  MiB
        Target Size: (Adaptive)         4.04%   283.59  MiB
        Min Size (Hard Limit):          3.58%   251.26  MiB
        Max Size (High Water):          27:1    6.85    GiB
        Compressed Data Size:                   160.01  MiB
        Decompressed Data Size:                 347.37  MiB
        Compression Factor:                     2.17

  PID USERNAME    PRI NICE   SIZE    RES STATE    C   TIME    WCPU COMMAND
    0 root         -8    -     0B    10M CPU2     2 406:21 100.00% kernel{arc_prune}


This is not useable.
Comment 44 Seigo Tanimura 2024-02-08 03:32:05 UTC
(In reply to Seigo Tanimura from comment #42)

* Fix Status

- Backport to releng/13.3.

Done locally.


- Stall upon low memory

When the vm_lowmem kernel event happens in the situation where the ARC pruning cannot evict the sufficient ZFS vnodes, a pagedaemon thread may wait for the ARC eviction indefinitely.  This causes the partial system stall.

Accelerate the ARC pruning in such the case.

The fix has been tested locally.


- Nullfs(5) node recycling

This is the fix targeting at poudriere-bulk(8).

Recycle the nullfs(5) vnodes not in use in the same way as the znodes, so that the lower ZFS vnodes can be recycled as well.  The implementation is partly shared with the accounting of the ZFS in-use znodes.

This has made a drastic improvement on the ZFS behaviour, including:
  - The ARC dnode size has reduced greatly; it no longer grows monotonically during poudriere-bulk(8).
  - The ARC metadata and data now always have some evictable sizes.  At least, they no longer fall to zero.
  - There are always some number of the prunable ZFS vnodes.

I believe this is how ZFS is supposed to work.

The fix has been tested locally.


- In-use counter overshoot and undershoot

An overshoot on the nullfs(5) in-use node counter (introduced for the nullfs(5) node recycling) has been found.  This may cause a wraparound on the vnlru_free_vfsops() argument and hence make an out-of-control behaviour.

The fix has been applied to nullfs(5) and ZFS.

The local test is in progress.


I will publish the updated git repo once the local test above completes.

Hope there are no more blockers...
Comment 45 Seigo Tanimura 2024-02-13 06:55:17 UTC
Created attachment 248418 [details]
The result charts for Comment 51.
Comment 46 Seigo Tanimura 2024-02-13 06:55:48 UTC
Created attachment 248419 [details]
The result charts for Comment 51.
Comment 47 Seigo Tanimura 2024-02-13 06:56:22 UTC
Created attachment 248420 [details]
The result charts for Comment 51.
Comment 48 Seigo Tanimura 2024-02-13 06:56:49 UTC
Created attachment 248421 [details]
The result charts for Comment 51.
Comment 49 Seigo Tanimura 2024-02-13 06:57:13 UTC
Created attachment 248422 [details]
The result charts for Comment 51.
Comment 50 Seigo Tanimura 2024-02-13 06:57:32 UTC
Created attachment 248423 [details]
The result charts for Comment 51.
Comment 51 Seigo Tanimura 2024-02-13 07:05:09 UTC
(In reply to Seigo Tanimura from comment #44)

The updated fix is now ready to share.

As of writing this, the test on releng/13.2 is running well without any unexpected build errors.  Hopefully all of the latest fixes are ready for the review and merge.

* Updates

- Accelerate the ZFS vnode recycling when there are any ARC eviction waiters.
- Recycle the nullfs vnodes, triggered by the vm_lowmem kernel events.


* Github Sources

All of the sources are under https://github.com/altimeter-130ft/freebsd-freebsd-src.

- Branches and Git Commit Hashes

            |                                                         | Git Commit Hash
            | Fix Branch                                              |
Base Branch | Fix + Counter Branch                                    | Base
============+=========================================================+=================
main        | topic-openzfs-arc_prune-regulation-fix                  | 57ddfad884
            | topic-openzfs-arc_prune-regulation-counters             | 
------------+---------------------------------------------------------+-----------------
stable/14   | stable/14-topic-openzfs-arc_prune-regulation-fix        | 20a6f4779a
            | stable/14-topic-openzfs-arc_prune-regulation-counters   | 
------------+---------------------------------------------------------+-----------------
releng/14.0 | releng/14.0-topic-openzfs-arc_prune-regulation-fix      | 4edf3b8073
            | releng/14.0-topic-openzfs-arc_prune-regulation-counters | 
------------+---------------------------------------------------------+-----------------
stable/13   | stable/13-topic-openzfs-arc_prune-regulation-fix        | 9d2f548bbe
            | stable/13-topic-openzfs-arc_prune-regulation-counters   | 
------------+---------------------------------------------------------+-----------------
releng/13.3 | releng/13.3-topic-openzfs-arc_prune-regulation-fix      | 24eb518714
            | releng/13.3-topic-openzfs-arc_prune-regulation-counters | 
------------+---------------------------------------------------------+-----------------
releng/13.2 | releng/13.2-topic-openzfs-arc_prune-regulation-fix      | c78c31d2ef
            | releng/13.2-topic-openzfs-arc_prune-regulation-counters | 

            | Git Commit Hash
            | FreeBSD-EN-23:18.openzfs | ZFS & VFS  | Nullfs     | Counters
Base Branch | Backport                 | Fix        | Fix        | (Not for merging)
============+==========================+============+============+===================
main        | N/A                      | cb4a97bd6b | 7bbbce6313 | ed4932a2b4
------------+--------------------------+------------+------------+-------------------
stable/14   | N/A                      | f1e1c524a3 | 69f7866c36 | 4e5471ade7
------------+--------------------------+------------+------------+-------------------
releng/14.0 | N/A                      | af25ebd2d0 | d13f4d74a0 | 09103b5f9c
------------+--------------------------+------------+------------+-------------------
stable/13   | c733dcff2e               | 9c65b44d3d | e2bac2ef1e | d52eb31227
------------+--------------------------+------------+------------+-------------------
releng/13.3 | ca178f626f               | 90bfaef58d | bd4cceca37 | aad948883e
------------+--------------------------+------------+------------+-------------------
releng/13.2 | c733dcff2e               | 9c65b44d3d | e2bac2ef1e | d52eb31227


* poudriere-bulk(8) Results

- Contents of Result Chart Archives
  - poudriere-bulk-(base-branch)-(timestamp).7z
    - vnodes.png
      - The overall vnodes.
    - vnode-free-calls.png
      - The calls to the vnode recycle functions.
    - freed-vnodes.png
      - The vnodes freed for recycling.
    - nullfs-nodes.png
      - The nullfs vnodes.
    - vm_lowmem-kernel-events.png
      - The vm_lowmem kernel events.
  - poudriere-bulk-(base-branch)-(timestamp)-zfs.7z
    - zfs-arc-dnode-size.png
      - The ZFS ARC dnode size in bytes.
    - zfs-arc-pruning-regulation.png
      - The ZFS ARC pruning requests and regulation.
    - zfs-znodes-and-dnodes.png
      - The ZFS ARC znodes and dnodes.
    - zfs-dnodes-and-freeing-activity.png
      - The ZFS ARC dnodes and pruning activity.
  - poudriere-bulk-(base-branch)-(timestamp)-zfs-sizes.7z
    - zfs-arc-mru-metadata-size.png
      - The ZFS ARC MRU metadata size.
    - zfs-arc-mru-data-size.png
      - The ZFS ARC MRU data size.
    - zfs-arc-mfu-metadata-size.png
      - The ZFS ARC MFU metadata size.
    - zfs-arc-mfu-data-size.png
      - The ZFS ARC MFU data size.
  - Some charts and traces are not available in 13.x due to the major version difference.


- Common Setup
  - sysctl(3)
    - vfs.zfs.arc_max: 4294967296
      - 4GB.
    - vfs.zfs.arc.dnode_limit: 0 (default)
      - kstat.zfs.misc.arcstats.arc_dnode_limit: 322122547 (calculated automatically)
  - poudriere-bulk(8)
    - USE_TMPFS="wrkdir data localbase"

- releng/14.0

  - Date
    - 09 Feb 2024 00:30Z - 09 Feb 2024 09:07Z
  - Build time
    - 09:06:22 (234 pkgs / hr)

  - Failed port(s): 3
    - japanese/xv: Expected, port Makefile problem?
    - graphics/gimp-app: Expected, a compiler internal error.
    - java/eclipse: Expected, JVM out of heap.
  - Skipped port(s): 2
    - graphics/gimp: graphics/gimp-app.
    - print/gimp-gutenprint: graphics/gimp-app.

  - Result Chart Archive
    - poudriere-bulk-14_0_release_p4-2024-02-09_09h30m00s.7z, Attachment #248418 [details]
    - poudriere-bulk-14_0_release_p4-2024-02-09_09h30m00s-zfs.7z, Attachment #248419 [details]
    - poudriere-bulk-14_0_release_p4-2024-02-09_09h30m00s-zfs-sizes.7z, Attachment #248420 [details]


- releng/13.3

  - Date
    - 08 Feb 2024 03:20Z - 09 Feb 2024 12:01Z
  - Build time
    - 09:00:46 (237 pkgs / hr)

  - Failed port(s): 1
    - japanese/xv: Expected, port Makefile problem?
  - Skipped port(s): 0

  - Result Chart Archive
    - poudriere-bulk-13_3_beta_1-2024-02-08_12h20m00s.7z, Attachment #248421 [details]
    - poudriere-bulk-13_3_beta_1-2024-02-08_12h20m00s-zfs.7z, Attachment #248422 [details]
    - poudriere-bulk-13_3_beta_1-2024-02-08_12h20m00s-zfs-sizes.7z, Attachment #248423 [details]
      - Missing traces
        - The total sizes of each ARC states and types. (Data and metadata)

- releng/13.2

  - In progress.
    - Started at 13 Feb 2024 02:40Z.


- main, stable/14, stable/13

  - No plan for the poudriere-bulk(8) tests for now.
    - stable/14 and stable/13 are expected to behave like releng/14.0 and releng/13.3, respectively.


- Analysis and Findings

  - The ZFS ARC dnodes stayed at most ~400MB.
    - Before the nullfs fix: monotonic increase in general.
    - The ZFS ARC dnodes decreased when only a few builders were running.
  - The vnodes stayed at most ~1.2M.
    - Before the nullfs fix: up to ~3M.
  - The nullfs node number is controlled well as per the build load.
  - There are often the ARC eviction waiters in the late stage of poudriere-bulk(8).
    - The tmpfs wrkdirs hold a lot of files for both the build and prerequisites.
  - There are almost always some ARC-prunable vnodes.
  - The ZFS ARC always have some evictable parts.
    - Almost all data in the ZFS ARC is evictable.
Comment 52 Seigo Tanimura 2024-02-13 16:08:36 UTC
Created attachment 248437 [details]
The result charts for Comment 55.
Comment 53 Seigo Tanimura 2024-02-13 16:09:03 UTC
Created attachment 248438 [details]
The result charts for Comment 55.
Comment 54 Seigo Tanimura 2024-02-13 16:09:22 UTC
Created attachment 248439 [details]
The result charts for Comment 55.
Comment 55 Seigo Tanimura 2024-02-13 16:10:58 UTC
(In reply to Seigo Tanimura from comment #51)

The test on releng/13.2 has completed with the expected results.


- releng/13.2

  - Date
    - 13 Feb 2024 02:40Z - 13 Feb 2024 11:06Z
  - Build time
    - 08:26:43 (252 pkgs / hr)

  - Failed port(s): 2
    - security/aws-vault: Expected, occasional network problem.
    - japanese/xv: Expected, port Makefile problem?
  - Skipped port(s): 0

  - Result Chart Archive
    - poudriere-bulk-13_2_release-2024-02-13_11h40m00s.7z, Attachment #248437 [details]
      - Missing traces
        - The total vnodes.
    - poudriere-bulk-13_2_release-2024-02-13_11h40m00s-zfs.7z, Attachment #248438 [details]
    - poudriere-bulk-13_2_release-2024-02-13_11h40m00s-zfs-sizes.7z, Attachment #248439 [details]
      - Missing traces
        - The total sizes of each ARC states and types. (Data and metadata)

  - The result tendency is the same as releng/14.0 and releng/13.3.


- Version Dependency of Fixes

The fixes for main, stable/14 and releng/14.0 are essentially the same.

The fixes for stable/13 and releng/13.3 have some differences from above because of the ZFS version difference.

In addition to above, the releng/13.2 fix also has some changes in the VFS.
Comment 56 Seigo Tanimura 2024-02-16 15:47:39 UTC
(In reply to Seigo Tanimura from comment #55)

One small update:

* Update(s)

- Split the VFS part of the per-filesystem vnode counts into a separate commit.
  - This commit adds a new field to struct mount.  __FreeBSD_version may hence
    have to be bumped.
- No functional changes in the fixes.
  - *-fix and all branches depending on it are not changed except for the
    commit hashes due to rebasing.


* Github Sources

All of the sources are under https://github.com/altimeter-130ft/freebsd-freebsd-src.

- Branches and Git Commit Hashes

            |                                                         | Git Commit Hash
            | Fix Branch                                              |
Base Branch | Fix + Counter Branch                                    | Base
============+=========================================================+=================
main        | topic-openzfs-arc_prune-regulation-fix                  | 57ddfad884
            | topic-openzfs-arc_prune-regulation-counters             | 
------------+---------------------------------------------------------+-----------------
stable/14   | stable/14-topic-openzfs-arc_prune-regulation-fix        | 20a6f4779a
            | stable/14-topic-openzfs-arc_prune-regulation-counters   | 
------------+---------------------------------------------------------+-----------------
releng/14.0 | releng/14.0-topic-openzfs-arc_prune-regulation-fix      | 4edf3b8073
            | releng/14.0-topic-openzfs-arc_prune-regulation-counters | 
------------+---------------------------------------------------------+-----------------
stable/13   | stable/13-topic-openzfs-arc_prune-regulation-fix        | 9d2f548bbe
            | stable/13-topic-openzfs-arc_prune-regulation-counters   | 
------------+---------------------------------------------------------+-----------------
releng/13.3 | releng/13.3-topic-openzfs-arc_prune-regulation-fix      | 24eb518714
            | releng/13.3-topic-openzfs-arc_prune-regulation-counters | 
------------+---------------------------------------------------------+-----------------
releng/13.2 | releng/13.2-topic-openzfs-arc_prune-regulation-fix      | c78c31d2ef
            | releng/13.2-topic-openzfs-arc_prune-regulation-counters | 

            | Git Commit Hash
            |                          | Per-filesystem |            |            |
            | FreeBSD-EN-23:18.openzfs | Vnode Counters | ZFS & VFS  | Nullfs     | Counters
Base Branch | Backport                 | (VFS part)     | Fix        | Fix        | (Not for merging)
============+==========================+================+============+============+===================
main        | N/A                      | ee039d8ac5     | 79221e0ef1 | 207bd6942b | 319570b6b8
------------+--------------------------+----------------+------------+------------+-------------------
stable/14   | N/A                      | 2e4eb52fd9     | 7b7bf0eeda | 38baf54058 | 9ba27996f6
------------+--------------------------+----------------+------------+------------+-------------------
releng/14.0 | N/A                      | 5de24ff9b4     | de37505406 | 33ba6525ef | 3d2e1a59c9
------------+--------------------------+----------------+------------+------------+-------------------
stable/13   | 5c62c48b7a               | d68540aa6e     | 17ca0ab252 | f004098adf | 58053d6f59
------------+--------------------------+----------------+------------+------------+-------------------
releng/13.3 | 0d279f56c3               | fb47a19236     | 4af563a184 | 5d309ad92f | 5814ad74f5
------------+--------------------------+----------------+------------+------------+-------------------
releng/13.2 | b784851090               | 7ef6cf8f72     | 5e979de6ca | 14702db1b8 | 185077cf44
Comment 57 Adam McDougall 2024-02-16 23:35:33 UTC
Thank you for the patches and effort. I am testing them on at least 2 version 13 systems that have had problems, 1 of them for sure including today had a runaway arc_prune.
Comment 58 Thomas Mueller 2024-02-19 08:51:45 UTC
I've been running the patches announced in comment#51 for a while 
and now those from comment#56 without noticing any negative effects,
in fact running port builds (at idle priority) in the background leaves
the systems perfectly usable (again).
Thanks!
Comment 59 Peter Much 2024-02-19 13:22:06 UTC
Endless loop in arc_prune has happened again. 

There is no poudriere and no builds in use here, just a (mostly) idle desktop with (mostly) default settings, still at 13.3-BETA1.

I then randomly changed some sysctl (with not much hope) and intended to pull a crashdump when I'm done with my open editors. Then after about half an hour the weirdness just stopped and arc_prune is back in the usual kstack; no idea if due to the sysctl changes or for other reasons.

I'll work myself thru all the material here when I find time to, but I am not interested in fixing poudriere (which I do not use).
Comment 60 Seigo Tanimura 2024-02-19 23:28:52 UTC
(In reply to Peter Much from comment #59)

Could you please run "sysctl vfs.zfs.znode" a few times at the interval of about 10 seconds while your issue is reproducing?

I want to check what triggers the ARC pruning and how often that happens in your case.
Comment 61 Seigo Tanimura 2024-02-19 23:36:51 UTC
(In reply to Seigo Tanimura from comment #60)

Along with "sysctl vfs.zfs.znode", could you please also take vfs.nullfs.recycle?

It has the counters of the vm_lowmem kernel events.
(EVENTHANDLER(9) does not have the generic counters, so I have hacked them into the nullfs fix.)
Comment 62 Peter Much 2024-02-20 00:19:29 UTC
(In reply to Seigo Tanimura from comment #60)

Okay, thanks. These OIDs seem not available in 13.3-BETA1 neither in git.freebsd.org. I'll see where to find them. Might take a bit of time.
Comment 63 Seigo Tanimura 2024-02-20 00:46:11 UTC
(In reply to Peter Much from comment #62)

Those OIDs are in my fix branch.

https://github.com/altimeter-130ft/freebsd-freebsd-src/tree/releng/13.3-topic-openzfs-arc_prune-regulation-fix
(Based on releng/13.3, 24eb518714)

Please "git clone" that branch, build and install the kernel out of it to see those OIDs.
Comment 64 Seigo Tanimura 2024-02-20 00:54:26 UTC
(In reply to Peter Much from comment #62)

Just to be sure, are your results in comment #59 and comment #43 from the source with my fixes, or the official 13.3-BETA1 build?

If you are not sure, please let me see "uname -a".  It has the git branch and commit hash from which the kernel is built.
Comment 65 Seigo Tanimura 2024-02-21 04:58:53 UTC
My best appreciation to everyone who joined, tested, reviewed and advised on this issue.

I submitted the email to update FreeBSD-EN-23:18.openzfs just now.
(cc: the committers on this case)

I will keep this case open until the handling of the fix gets determined and merged if decided to do so.
Comment 66 Peter Much 2024-02-21 13:20:39 UTC
(In reply to Seigo Tanimura from comment #64)
They are from the official distribution 6c2137593990 (plus local fixes):

FreeBSD disp.***** 13.3-BETA1 FreeBSD 13.3-BETA1[0d65a8c79a4a=6c2137593990+27] D6R13V1 amd64

This installation comes from a deploy engine, we're usually not building locally.
(To build locally means to start a test case, and to do so I must first get rid of my currently open test case, otherwise I do no longer know what exactly was changed where. This is why things proceed slowly here - I usually do root cause analysis.)

I am mainly worried about the endless loop. An endless loop in kernel thread MUST NOT happen (for whatever reason), because it cannot be contained with rctl(8); it will keep cpu at max and may bring thermal load steering into disarray.
Comment 67 Peter Much 2024-02-23 19:25:52 UTC
So, now I read all the material here. Great work!

I had upgraded my deploy engine from 13.2-RELEASE to 13.3-BETA, and found (among some spurious messages from git) that it can no longer build gcc12.

There is apparently no problem with rust or llvm15, but trying to build gcc12 does reproducibly crash (10 core, 16081M ram). Apparently the crash happens when gcc fully powers up its LTO for the first time:

last pid: 37369;  load averages:  9.35,  9.93,  9.27    up 0+03:15:25  07:21:42
417 threads:   14 running, 379 sleeping, 24 waiting
CPU: 55.4% user,  0.0% nice, 35.6% system,  0.1% interrupt,  8.8% idle
Mem: 7047M Active, 6121M Inact, 2392M Wired, 984M Buf, 60M Free
ARC: 518M Total, 45M MFU, 451M MRU, 128K Anon, 3990K Header, 17M Other
     467M Compressed, 997M Uncompressed, 2.14:1 Ratio
Swap: 15G Total, 15G Free

  PID USERNAME    PRI NICE   SIZE    RES STATE    C   TIME    WCPU COMMAND
    0 root         -8    -     0B  2432K CPU4     4   3:14  99.79% kernel{arc_p
    7 root        -16    -     0B    48K CPU6     6   2:45  99.79% pagedaemon{d
   15 root         52    -     0B    16K CPU0     0   3:00  99.70% vnlru
37334 root         52    0   891M   789M pfault   1   0:37  89.24% lto1
37270 root         52    0  1017M   915M pfault   3   0:43  88.63% lto1
37324 root         52    0   831M   770M pfault   8   0:39  88.59% lto1
37338 root         52    0   843M   785M pfault   2   0:36  88.50% lto1
37333 root         52    0   889M   788M pfault   7   0:37  82.76% lto1
37269 root         52    0  1001M   882M pfault   5   0:42  82.09% lto1
37274 root         52    0  1004M   885M pfault   9   0:42  80.24% lto1
    5 root         20    -     0B  1568K t->zth   9   0:02   1.02% zfskern{arc_
37360 root         20    0    14M  4940K CPU9     9   0:00   0.87% top

This is the last output, at this point the system becomes unresponsive, and, when allowed neither to oom-kill nor panic, continues to consume 300% compute. Apparently these are the visible three apocalyptic riders (arc_prune, pagedaemon, vnlru) entertaining themselves. :/

Implementing the patch (i.e. five new git commits from the github repo) solves the issue, and afterwards it looks like this:

last pid: 11944;  load averages:  7.13,  5.29,  5.77    up 0+03:48:45  16:12:46
424 threads:   19 running, 381 sleeping, 24 waiting
CPU: 67.9% user,  0.0% nice,  5.1% system,  0.0% interrupt, 27.0% idle
Mem: 9308M Active, 2285M Inact, 20M Laundry, 3643M Wired, 865M Buf, 336M Free
eRC: 1638M Total, 855M MFU, 575M MRU, 128K Anon, 11M Header, 198M Other
     1305M Compressed, 2980M Uncompressed, 2.28:1 Ratio
Swap: 15G Total, 15G Free

  PID USERNAME    PRI NICE   SIZE    RES STATE    C   TIME    WCPU COMMAND
11579 root        103    0  1269M  1066M CPU6     6   4:09 100.00% lto1
11605 root        103    0  1263M  1052M CPU3     3   4:08  99.87% lto1
11589 root        103    0  1295M  1091M CPU8     8   4:09  99.87% lto1
11599 root        103    0  1259M  1027M CPU9     9   4:08  99.87% lto1
11588 root        103    0  1263M  1035M CPU7     7   4:09  99.87% lto1
11590 root        103    0  1287M  1058M CPU5     5   4:08  99.87% lto1
11598 root        103    0  1311M  1082M CPU1     1   4:08  99.74% lto1
    0 root         -8    -     0B  2448K -        6   0:03   6.83% kernel{arc_p
    5 root         -8    -     0B  1568K RUN      9   0:03   5.80% zfskern{arc_
    7 root        -16    -     0B    48K psleep   2   0:37   3.11% pagedaemon{d

I'm a bit worried the thing is still reluctant to page out, but otherwise this looks good.
Comment 68 Seigo Tanimura 2024-02-29 05:04:27 UTC
(In reply to Peter Much from comment #67)

Thanks a lot for your test results and my apology for the delayed response.

Your results look just as expected.

> I'm a bit worried the thing is still reluctant to page out, but otherwise this looks good.

Some CPU usage by the ARC eviction, pruning and pagedaemon threads are still expected.  My fix is meant to keep them form running out of control.

On the FreeBSD ZFS implementation, there are two triggers on the ARC eviction and pruning:

1. The size reduction of ARC after adding new blocks to it.

This is a part of the original ARC implementation by ZFS, intended to maintain the ARC size. This usually happens along with the ZFS activities.

2. The vm_lowmem kernel event.

This is the case when either the free VM pages or the kernel memory (malloc(9), uma(9), etc.) get low, and reclaims the kernel memory used by ZFS.  You can hence see this behaviour even when ZFS is idle.  There are some other kernel processes and threads that start by this trigger, including pagedaemon and nullfs(5) with my fix.

Case 2 has to be handled carefully.  On FreeBSD ZFS, the pagedaemon thread blocks until ZFS either achieves some eviction or gives up completely.  The ARC eviction and pruning must work more aggressively in such the case, while avoiding to starve the application threads.  My fix addresses this issue in the following ways:

- Prune every single ZFS vnodes. (Work aggressively)
- Pause 1 second before repeating above. (Avoid application starve)

The results of this fix include the dynamic change of the CPU usage by the ARC eviction and pruning threads.  This may be seen more clearly by top(1) with a fractional interval. (I have not tryed, though)
Comment 69 Seigo Tanimura 2024-03-01 18:25:02 UTC
(In reply to Seigo Tanimura from comment #65)

The changes have now been uploaded to https://reviews.freebsd.org/ and now ready for the review on it.

Revisions: D44170 - D44178
(Some revisions are intended for multiple branches.)

Some branches on GitHub have been updated to follow the commit log rule for the review.  No functional changes.
Comment 70 Seigo Tanimura 2024-03-05 09:06:34 UTC
(In reply to Seigo Tanimura from comment #69)

A quick note that the nullfs fix has been updated as per kib's review on https://reviews.freebsd.org/D44177.

The nullfs fix is now implemented by a sysctl toggle (vfs.nullfs.cache_nodes) that controls the nullfs vnode caching, which is enabled by default.  Set this to false to disable the nullfs vnode caching so that its lower vnodes can be recycled smoothly.

* Github Sources

All of the sources are under https://github.com/altimeter-130ft/freebsd-freebsd-src.

- Branches and Git Commit Hashes

            |                                                         | Git Commit Hash
            | Fix Branch                                              |
Base Branch | Fix + Counter Branch                                    | Base
============+=========================================================+=================
main        | topic-openzfs-arc_prune-regulation-fix                  | 57ddfad884
            | topic-openzfs-arc_prune-regulation-counters             | 
------------+---------------------------------------------------------+-----------------
stable/14   | stable/14-topic-openzfs-arc_prune-regulation-fix        | 20a6f4779a
            | stable/14-topic-openzfs-arc_prune-regulation-counters   | 
------------+---------------------------------------------------------+-----------------
releng/14.0 | releng/14.0-topic-openzfs-arc_prune-regulation-fix      | 4edf3b8073
            | releng/14.0-topic-openzfs-arc_prune-regulation-counters | 
------------+---------------------------------------------------------+-----------------
stable/13   | stable/13-topic-openzfs-arc_prune-regulation-fix        | 9d2f548bbe
            | stable/13-topic-openzfs-arc_prune-regulation-counters   | 
------------+---------------------------------------------------------+-----------------
releng/13.3 | releng/13.3-topic-openzfs-arc_prune-regulation-fix      | 24eb518714
            | releng/13.3-topic-openzfs-arc_prune-regulation-counters | 
------------+---------------------------------------------------------+-----------------
releng/13.2 | releng/13.2-topic-openzfs-arc_prune-regulation-fix      | c78c31d2ef
            | releng/13.2-topic-openzfs-arc_prune-regulation-counters | 

            | Git Commit Hash
            |                          | Per-filesystem |            |            
            | FreeBSD-EN-23:18.openzfs | Vnode Counters | ZFS & VFS  | Nullfs
Base Branch | Backport                 | (VFS part)     | Fix        | Fix
============+==========================+================+============+============
main        | N/A                      | ee039d8ac5     | 79221e0ef1 | 15e81087f5
------------+--------------------------+----------------+------------+------------
stable/14   | N/A                      | 2e4eb52fd9     | 7b7bf0eeda | 37404c6e9e
------------+--------------------------+----------------+------------+------------
releng/14.0 | N/A                      | 5de24ff9b4     | de37505406 | 0a2e6025ce
------------+--------------------------+----------------+------------+------------
stable/13   | 5c62c48b7a               | d68540aa6e     | 17ca0ab252 | ae550ed68a
------------+--------------------------+----------------+------------+------------
releng/13.3 | 0d279f56c3               | fb47a19236     | 4af563a184 | 767b47b774
------------+--------------------------+----------------+------------+------------
releng/13.2 | b784851090               | 7ef6cf8f72     | 5e979de6ca | 93cefa9124

            | Git Commit Hash
            | Counters
Base Branch | (Not for merging)
============+===================
main        | 4d5ae0f6aa
------------+-------------------
stable/14   | bf4b2e420d
------------+-------------------
releng/14.0 | 9de8754bcf
------------+-------------------
stable/13   | 8b9a9ad167
------------+-------------------
releng/13.3 | d0389390ce
------------+-------------------
releng/13.2 | 5301e244af
Comment 71 Anton Saietskii 2024-03-06 10:49:40 UTC
Same issue here on releng/13.3, on i7-7820HQ with 64G RAM. But there's a difference as well -- my machine even freezes completely while arc_prune+vnlru eating CPU.
If I run 9 jobs -- freeze duration is like tens of minutes, but even with a single job machine freezes for a ~minute at some points of poudriere build.
Comment 72 Mark Millard 2024-03-06 17:05:28 UTC
(In reply to Anton Saietskii from comment #71)

One of the things that can happen during low free RAM is
for inactive processes to have their kernel stack swapped
out. That can include the processes for interacting with
the system, output and input. Until the kernel stacks from
the relevant processes are swapped back in, interaction is
blocked via those processes.

Do you know what kind of hang you are seeing? Might it be
an example of such?

For reference, in /etc/sysctl.conf I use:

#
# Together this pair avoids swapping out the process kernel stacks.
# This avoids processes for interacting with the system from being
# hung-up by such.
vm.swap_enabled=0
vm.swap_idle_enabled=0

Both are writable live:

# sysctl -Wd vm.swap_enabled vm.swap_idle_enabled
vm.swap_enabled: Enable entire process swapout
vm.swap_idle_enabled: Allow swapout on idle criteria

but are not tunables ( avoid in /boot/loader.conf ):

# sysctl -Td vm.swap_enabled vm.swap_idle_enabled
#
Comment 73 Anton Saietskii 2024-03-06 18:12:20 UTC
(In reply to Mark Millard from comment #72)

TBH, I didn't really get what do you mean by "kind" of hang, but perhaps following will be helpful:

(Vanilla releng/13.3 currently.)
Machine is running a few daemons with very stable memory footprint (built-in ones, like sshd, ntpd + couple of 3rd party, like net/wifibox, net-p2p/transmission daemon). Nothing related to memory has been tuned, and there's always ~2-3G of free RAM and zero swap usage.
Then I start `poudriere bulk`, and as soon as some big distfile (e.g. firefox, libreoffice) begins extraction to tmpfs, arc_prune+vnlru start recklessly eat CPU^W^W^W do their job trying to evict some ARC for me. System becomes slow and stuttering followed by a full unresponsiveness — network stops, disk activity stops, no reaction on physical console.
After several minutes (I guess when distfile finishes extraction), everything returns to normal like nothing happened with no any log messages. Then poudriere build continues with eventual arc_prune+vnlru running as tmpfs being filled with object files produced by compiler. No a single byte of swap usage observed during this.

So, it's like what you described, but for me issue hits faster and harder. Looks like CPU just have no time to execute anything beyond those 2 and possibly some other kernel threads.
Comment 74 Mark Millard 2024-03-06 18:26:39 UTC
(In reply to Anton Saietskii from comment #73)

Thanks for the description of what the hang is like. Since
to does finish, it might be a form of livelock since it is
not an overall deadlock (progress was made).

Sounds like things suspend that vm.swap_enabled=0 and
vm.swap_idle_enabled=0 would not change anything for.
So my note is probably not of much use for the context.
Comment 75 Anton Saietskii 2024-03-06 18:35:38 UTC
(In reply to Mark Millard from comment #74)

Yes, luckily it's comes back to normal.
Before I discovered this PR, plan was trying to limit ARC size so instead of 2-3G free memory, I'll have like ~12G (actually this plan isn't cancelled).
I also consider pulling patches from your tree, but didn't dig into repository deep enough yet (need to find right branch, what patches exactly to apply, etc.)
Comment 76 mike 2024-03-06 19:18:57 UTC
(In reply to Seigo Tanimura from comment #70)
Hi,
thank you for all your work on this. If I wanted to try this locally on a RELENG_14 box, what is the best way to patch a system ?
Comment 77 Peter Much 2024-03-07 05:19:26 UTC
(In reply to mike from comment #76)
That might depend on your general handling of emergency-fixes. I for my part cloned the repo from Seigo Tanimura, found the five most recent commits on the appropriate branch, extracted them with git format-patch, and patched these onto my source tree. Then build new kernel and things were fine, problems solved.
Comment 78 Seigo Tanimura 2024-03-07 07:32:15 UTC
(In reply to mike from comment #76)

I checked the sample steps with git, just when I had to leave my place.  Will share the steps in a few more hours when I come home.
(Comment #77: Peter, thanks!)
Comment 79 Seigo Tanimura 2024-03-07 15:59:57 UTC
(In reply to mike from comment #76)

Until my fix gets merged to the official source, you can keep the fix as a git branch in your local repository and build the kernel out of it.

The example steps to keeping the fix along with the official source:

- Local fix branch
  releng/14.0-topic-openzfs-arc_prune-regulation-fix-local
- Official FreeBSD branch
  releng/14.0
- Remote name of my fix repository
  a130ft


A) Clone the git repository and create the fix branch.

Take these steps to set up the git repository.  Perform them only once.

1. Clone the FreeBSD source repository.

The following steps clone the FreeBSD source repository into ~/freebsd-zfs-fix-localtracking.freebsd-src.

gitrepo@silver:~ % mkdir ~/freebsd-zfs-fix-localtracking
gitrepo@silver:~ % pushd ~/freebsd-zfs-fix-localtracking
~/freebsd-zfs-fix-localtracking ~
gitrepo@silver:~/freebsd-zfs-fix-localtracking % git clone https://github.com/freebsd/freebsd-src.git
Cloning into 'freebsd-src'...
(snip)
Updating files: 100% (99431/99431), done.
gitrepo@silver:~/freebsd-zfs-fix-localtracking % cd freebsd-src
gitrepo@silver:~/freebsd-zfs-fix-localtracking/freebsd-src % 


2. Add the remote repository of my fix and create the local fix branch.

The following steps add my fix repository as a remote repository and make the local fix branch.  Also, the local branch tracking the official FreeBSD branch is created to make the tracking easy.

gitrepo@silver:~/freebsd-zfs-fix-localtracking/freebsd-src % git remote add a130ft https://github.com/altimeter-130ft/freebsd-freebsd-src.git
gitrepo@silver:~/freebsd-zfs-fix-localtracking/freebsd-src % git fetch a130ft
(snip)
From https://github.com/altimeter-130ft/freebsd-freebsd-src
(snip)
 * [new branch]                releng/14.0-topic-openzfs-arc_prune-regulation-fix                           -> a130ft/releng/14.0-topic-openzfs-arc_prune-regulation-fix
(snip)
gitrepo@silver:~/freebsd-zfs-fix-localtracking/freebsd-src % git branch releng/14.0 origin/releng/14.0
branch 'releng/14.0' set up to track 'origin/releng/14.0'.
gitrepo@silver:~/freebsd-zfs-fix-localtracking/freebsd-src % git branch releng/14.0-topic-openzfs-arc_prune-regulation-fix-local a130ft/releng/14.0-topic-openzfs-arc_prune-regulation-fix
branch 'releng/14.0-topic-openzfs-arc_prune-regulation-fix-local' set up to track 'a130ft/releng/14.0-topic-openzfs-arc_prune-regulation-fix'.
gitrepo@silver:~/freebsd-zfs-fix-localtracking/freebsd-src % 


B) Maintain the fix branch.

Perform these steps to update your local fix branch to the official FreeBSD branch after pulling it from the upstream.

3. Rebase the fix branch onto the official FreeBSD branch you want to track.

gitrepo@silver:~/freebsd-zfs-fix-localtracking/freebsd-src % git pull releng/14.0
(snip)
gitrepo@silver:~/freebsd-zfs-fix-localtracking/freebsd-src % git switch releng/14.0-topic-openzfs-arc_prune-regulation-fix-local
Updating files: 100% (13175/13175), done.
Switched to branch 'releng/14.0-topic-openzfs-arc_prune-regulation-fix-local'
Your branch is up to date with 'a130ft/releng/14.0-topic-openzfs-arc_prune-regulation-fix'.
gitrepo@silver:~/freebsd-zfs-fix-localtracking/freebsd-src % git log
(Count and check the fix commits.)
gitrepo@silver:~/freebsd-zfs-fix-localtracking/freebsd-src % git rebase -i --onto releng/14.0 releng/14.0-topic-openzfs-arc_prune-regulation-fix-local~3 releng/14.0-topic-openzfs-arc_prune-regulation-fix-local
(Check the rebased commits.)
(Resolve the conflicts as required.)
gitrepo@silver:~/freebsd-zfs-fix-localtracking/freebsd-src %
Comment 80 mike 2024-03-07 16:02:23 UTC
(In reply to Seigo Tanimura from comment #79)
Thank you for the detailed steps!
Comment 81 Anton Saietskii 2024-03-07 18:54:24 UTC
(In reply to Anton Saietskii from comment #73)

After applying patches onto releng/13.3 (without any sysctl tuning) works like a charm so far! I can't even see neither arc_prune nor vnlru in top now, only a bit a pagedaemon (which eventually may eat up to 100%, but not for long and without stalling system), snip follows:

last pid: 77896;  load averages:  2.34,  4.42,  4.76; battery: 100%                                                                                       up 0+02:36:12  19:29:16
1089 threads:  10 running, 1049 sleeping, 30 waiting
CPU 0:  4.7% user,  0.0% nice, 14.1% system,  0.0% interrupt, 81.2% idle
CPU 1:  6.3% user,  0.0% nice, 20.9% system,  0.0% interrupt, 72.8% idle
CPU 2: 19.9% user,  0.0% nice, 17.3% system,  0.0% interrupt, 62.8% idle
CPU 3: 12.8% user,  0.0% nice,  9.7% system,  0.0% interrupt, 77.4% idle
CPU 4: 23.6% user,  0.0% nice, 14.7% system,  0.0% interrupt, 61.8% idle
CPU 5: 18.8% user,  0.0% nice, 11.5% system,  0.0% interrupt, 69.6% idle
CPU 6: 13.1% user,  0.0% nice, 12.6% system,  1.0% interrupt, 73.3% idle
CPU 7: 10.6% user,  0.0% nice, 18.0% system,  0.5% interrupt, 70.9% idle
Mem: 101M Active, 998M Inact, 1600M Laundry, 58G Wired, 524K Buf, 1651M Free
ARC: 55G Total, 29G MFU, 25G MRU, 83M Anon, 225M Header, 617M Other
     52G Compressed, 53G Uncompressed, 1.02:1 Ratio
Swap: 8192M Total, 8192M Free

  PID   JID USERNAME    PRI NICE   SIZE    RES STATE    C   TIME    WCPU COMMAND
77368     5 pbuild      100    0   106M    73M CPU6     6   0:25  99.87% /usr/bin/tar -xf /portdistfiles//firefox-123.0.1.source.tar.xz --no-same-owner --no-same-permissions (bs
  472     0 root         52  -20   375M   205M vmidle   2  87:38  59.19% bhyve: wifibox (bhyve){vcpu 0}
 1677     0 transmissi   52    0   488M   286M kqread   1  67:36  51.56% /usr/local/bin/transmission-daemon -g /usr/local/etc/transmission/ -e /var/log/transmission.log -x /var/
  472     0 root         20  -20   375M   205M uwait    5   7:44   5.01% bhyve: wifibox (bhyve){e82545-5:0 tx}
    7     0 root        -16    -     0B    48K psleep   6   2:16   4.77% [pagedaemon{dom0}]
  472     0 root         20  -20   375M   205M kqread   4   6:48   4.15% bhyve: wifibox (bhyve){mevent}
   12     0 root        -88    -     0B   480K WAIT     6   1:40   0.68% [intr{irq128: ahci0}]
    0     0 root        -16    -     0B    11M -        4   0:02   0.54% [kernel{z_rd_int_1_2}]
    0     0 root        -16    -     0B    11M -        7   0:15   0.47% [kernel{z_rd_int_0_1}]
   69     0 root         20    -     0B    16K geli:w   7   0:32   0.40% [g_eli[7] diskid/DIS]

And a shorter snip during actual Fx build:
Mem: 5844M Active, 1941M Inact, 2457M Laundry, 50G Wired, 524K Buf, 1666M Free
ARC: 47G Total, 27G MFU, 19G MRU, 67M Anon, 192M Header, 581M Other
     44G Compressed, 45G Uncompressed, 1.02:1 Ratio
Swap: 8192M Total, 62M Used, 8130M Free

  PID   JID USERNAME    PRI NICE   SIZE    RES STATE    C   TIME    WCPU COMMAND
10614     5 pbuild      103    0  3632M  2803M CPU3     3   0:28  99.94% /usr/local/bin/rustc --crate-name style --edition=2018 servo/components/style/lib.rs --error-format=json
  472     0 root         52  -20   375M   205M vmidle   5  98:39  54.51% bhyve: wifibox (bhyve){vcpu 0}
 1677     0 transmissi   48    0   488M   283M kqread   2  74:52  30.67% /usr/local/bin/transmission-daemon -g /usr/local/etc/transmission/ -e /var/log/transmission.log -x /var/
  472     0 root         20  -20   375M   205M RUN      0   8:42   5.37% bhyve: wifibox (bhyve){e82545-5:0 tx}
  472     0 root         20  -20   375M   205M kqread   6   7:35   4.02% bhyve: wifibox (bhyve){mevent}

A bit of swap being used, but again: works like a charm, no stalls at all.

Seigo, thanks a lot for you effort, I do greatly appreciate it!
Comment 82 Seigo Tanimura 2024-03-08 04:17:44 UTC
(In reply to Seigo Tanimura from comment #79)

> B) Maintain the fix branch.
> 3. Rebase the fix branch onto the official FreeBSD branch you want to track.
> gitrepo@silver:~/freebsd-zfs-fix-localtracking/freebsd-src % git pull releng/14.0

This should be "git pull origin releng/14.0".

As this repogitory has multiple remote upstreams, it is better to specify both
the remote and branch explicitely.
Comment 83 Seigo Tanimura 2024-03-08 17:56:25 UTC
(In reply to Seigo Tanimura from comment #70)

Again, thanks to everyone who has joined testing and reviewing my fix!

* Fix Status

The nullfs fix has been committed to main by 7 Mar 2024. (c849eb8f19)

Please set sysctl OID vfs.nullfs.cache_nodes to false (0) to enable the fix.


* GitHub Sources

All branches have been rebased by 08 Mar 2024, mainly to catch up to 13.3-RELEASE.

The nullfs fix has been removed from the main fix branch.

All of the sources are under https://github.com/altimeter-130ft/freebsd-freebsd-src.

- Branches and Git Commit Hashes

            |                                                         | Git Commit Hash
            | Fix Branch                                              |
Base Branch | Fix + Counter Branch                                    | Base
============+=========================================================+=================
main        | topic-openzfs-arc_prune-regulation-fix                  | 32c7350beb
            | topic-openzfs-arc_prune-regulation-counters             | 
------------+---------------------------------------------------------+-----------------
stable/14   | stable/14-topic-openzfs-arc_prune-regulation-fix        | 275aee513b
            | stable/14-topic-openzfs-arc_prune-regulation-counters   | 
------------+---------------------------------------------------------+-----------------
releng/14.0 | releng/14.0-topic-openzfs-arc_prune-regulation-fix      | adfda3c395
            | releng/14.0-topic-openzfs-arc_prune-regulation-counters | 
------------+---------------------------------------------------------+-----------------
stable/13   | stable/13-topic-openzfs-arc_prune-regulation-fix        | 8b84d2da9a
            | stable/13-topic-openzfs-arc_prune-regulation-counters   | 
------------+---------------------------------------------------------+-----------------
releng/13.3 | releng/13.3-topic-openzfs-arc_prune-regulation-fix      | 80d2b634dd
            | releng/13.3-topic-openzfs-arc_prune-regulation-counters | 
------------+---------------------------------------------------------+-----------------
releng/13.2 | releng/13.2-topic-openzfs-arc_prune-regulation-fix      | a839681443
            | releng/13.2-topic-openzfs-arc_prune-regulation-counters | 

            | Git Commit Hash
            |                          | Per-filesystem |            |            
            | FreeBSD-EN-23:18.openzfs | Vnode Counters | ZFS & VFS  | Nullfs
Base Branch | Backport                 | (VFS part)     | Fix        | Fix
============+==========================+================+============+==============
main        | N/A                      | b4808d64d6     | 8c966c0eab | (c849eb8f19)
------------+--------------------------+----------------+------------+--------------
stable/14   | N/A                      | af86ddef88     | cf8738a6f3 | 1f89f33465
------------+--------------------------+----------------+------------+--------------
releng/14.0 | N/A                      | 4a8bea1f75     | c9d9803b73 | cd5243b6df
------------+--------------------------+----------------+------------+--------------
stable/13   | 7f19707bb4               | 968ce986e0     | 3787ca314b | 762a970264
------------+--------------------------+----------------+------------+--------------
releng/13.3 | a33aedb344               | ff1452a099     | 092fcd8c9f | f4e31b6a5e
------------+--------------------------+----------------+------------+--------------
releng/13.2 | bb006c3aef               | e9624c35e7     | f25524526d | 5ae2d6d83b

            | Git Commit Hash
            | Counters
Base Branch | (Not for merging)
============+===================
main        | c8b0a042c4
------------+-------------------
stable/14   | cfc4bfdf5f
------------+-------------------
releng/14.0 | b481bcb399
------------+-------------------
stable/13   | f312688408
------------+-------------------
releng/13.3 | d4fa366558
------------+-------------------
releng/13.2 | ef8a156c28
Comment 84 karl 2024-03-08 18:39:46 UTC
(In reply to Seigo Tanimura from comment #83)

Just to be clear, this is NOT yet merged into 13/STABLE itself (or what will be 13.3-RELEASE) and needs to be done as a custom kernel?

I checked the commit logs on a current -STABLE pull as of right now and do not see those commit strings or anything clearly denoting this problem.
Comment 85 Anton Saietskii 2024-03-08 18:42:56 UTC
(In reply to karl from comment #84)

Karl, nice to meet you again — I remember your investigation of somewhat related issue ~10 years ago. :-)

Yes, patches are on author's GitHub currently.
Comment 86 karl 2024-03-08 18:44:39 UTC
(In reply to Anton Saietskii from comment #85)

Uh, yep..... which was never merged and ultimately became OBE :-)

This one hasn't bit me..... yet.
Comment 87 Seigo Tanimura 2024-03-14 12:54:11 UTC
(In reply to Seigo Tanimura from comment #83)

* Fix Status

The nullfs fix has been MFCed to stable/14 and stable/13 by 14 Mar 2024.


* GitHub Sources

The main, stable/14 and stable/13 fix branches have been rebased by 14 Mar 2024, mainly to catch up the MFC above.

The releng/* fix branches are not changed.

The nullfs fix has been removed from the stable/14 and stable/13 fix branches.

All of the sources are under https://github.com/altimeter-130ft/freebsd-freebsd-src.

- Branches and Git Commit Hashes

            |                                                         | Git Commit Hash
            | Fix Branch                                              |
Base Branch | Fix + Counter Branch                                    | Base
============+=========================================================+=================
main        | topic-openzfs-arc_prune-regulation-fix                  | 63a7c4be4a
            | topic-openzfs-arc_prune-regulation-counters             | 
------------+---------------------------------------------------------+-----------------
stable/14   | stable/14-topic-openzfs-arc_prune-regulation-fix        | 47ee352ffd
            | stable/14-topic-openzfs-arc_prune-regulation-counters   | 
------------+---------------------------------------------------------+-----------------
releng/14.0 | releng/14.0-topic-openzfs-arc_prune-regulation-fix      | adfda3c395
            | releng/14.0-topic-openzfs-arc_prune-regulation-counters | 
------------+---------------------------------------------------------+-----------------
stable/13   | stable/13-topic-openzfs-arc_prune-regulation-fix        | 6bf21b4c0c
            | stable/13-topic-openzfs-arc_prune-regulation-counters   | 
------------+---------------------------------------------------------+-----------------
releng/13.3 | releng/13.3-topic-openzfs-arc_prune-regulation-fix      | 80d2b634dd
            | releng/13.3-topic-openzfs-arc_prune-regulation-counters | 
------------+---------------------------------------------------------+-----------------
releng/13.2 | releng/13.2-topic-openzfs-arc_prune-regulation-fix      | a839681443
            | releng/13.2-topic-openzfs-arc_prune-regulation-counters | 

            | Git Commit Hash
            |                          | Per-filesystem |            |            
            | FreeBSD-EN-23:18.openzfs | Vnode Counters | ZFS & VFS  | Nullfs
Base Branch | Backport                 | (VFS part)     | Fix        | Fix
============+==========================+================+============+==============
main        | N/A                      | c067467182     | be5edb9e5c | (c849eb8f19)
------------+--------------------------+----------------+------------+--------------
stable/14   | N/A                      | 6ca1378457     | bf500b63f9 | (3ba93b50d6)
------------+--------------------------+----------------+------------+--------------
releng/14.0 | N/A                      | 4a8bea1f75     | c9d9803b73 | cd5243b6df
------------+--------------------------+----------------+------------+--------------
stable/13   | 77dbf7e5d7               | 01d8f33f39     | b7a1e499b4 | (d119f5a194)
------------+--------------------------+----------------+------------+--------------
releng/13.3 | a33aedb344               | ff1452a099     | 092fcd8c9f | f4e31b6a5e
------------+--------------------------+----------------+------------+--------------
releng/13.2 | bb006c3aef               | e9624c35e7     | f25524526d | 5ae2d6d83b

            | Git Commit Hash
            | Counters
Base Branch | (Not for merging)
============+===================
main        | 0e6e14f0c5
------------+-------------------
stable/14   | b484b20d21
------------+-------------------
releng/14.0 | b481bcb399
------------+-------------------
stable/13   | e40ae582d0
------------+-------------------
releng/13.3 | d4fa366558
------------+-------------------
releng/13.2 | ef8a156c28
Comment 88 Thomas Mueller 2024-03-15 06:39:54 UTC
(In reply to Seigo Tanimura from comment #87)

> The nullfs fix has been MFCed to stable/14 and stable/13 by 14 Mar 2024.

Looks like stable/13 has conflicts committed in the mount_nullfs(8) manual page.
Comment 89 Felix Palmen freebsd_committer freebsd_triage 2024-03-15 07:59:06 UTC
(In reply to Seigo Tanimura from comment #87)

On my home server (amd64, 4x 4TB spinning HDDs with GELI partitions and a raid-z1 pool on top, arc_max 32GiB, neither l2arc nor zil), I had issues with stalls (1 to several seconds) under heavy I/O load (with poudriere and nullfs involved) ever since upgrading to FreeBSD 13.x.

With the 13.3 kernel, the issue escalated from "pretty annoying" to "completely unusable".

Now after applying all your patches, it is a *lot* better. I could still provoke stalls having a large poudriere build with heavy ccache usage and at the same time using a Windows Server VM with zvol storage backend on the same pool ... but I guess there are limits somewhere ;) will continue to observe the overall behavior.

so:
1. Thanks a lot for this awesome work!
2. Please get it merged to all affected branches :)
Comment 90 Seigo Tanimura 2024-03-15 10:01:15 UTC
(In reply to Thomas Mueller from comment #88)

The conflict comes from ff4196ad0b in stable/13.

I will report the issue to kib.
Comment 91 geoffroy desvernay 2024-03-15 10:30:33 UTC
On 13.3 release, since upgrading from 13.2-RELEASEp10, just copying /usr/src from zfs to nfs I had the same (in top +H +S):
    0 root         -8    -     0B  6096K CPU8     8   3:32  99.74% kernel{arc_prune}
(and poudriere quite locking the full machine on a poudriere bulk - especially compiling rust)

Just updated kernel to releng/13.3 + the two patches of #87 on stable/13.3.

Machine now compiling everything *and* copying many times /usr/src on a loop, thank you  Seigo !!!

I'll wait for errata on 13.3 (at least) to upgrade our servers here…
Comment 92 Seigo Tanimura 2024-03-15 18:23:56 UTC
(In reply to Seigo Tanimura from comment #90)

* GitHub Sources

The stable/13 fix branch have been rebased to e1341e5318 for the mount_nullfs(8) man page conflict fix.

The main, stable/14 and releng/* fix branches are not changed.

All of the sources are under https://github.com/altimeter-130ft/freebsd-freebsd-src.

- Branches and Git Commit Hashes

            |                                                         | Git Commit Hash
            | Fix Branch                                              |
Base Branch | Fix + Counter Branch                                    | Base
============+=========================================================+=================
main        | topic-openzfs-arc_prune-regulation-fix                  | 63a7c4be4a
            | topic-openzfs-arc_prune-regulation-counters             | 
------------+---------------------------------------------------------+-----------------
stable/14   | stable/14-topic-openzfs-arc_prune-regulation-fix        | 47ee352ffd
            | stable/14-topic-openzfs-arc_prune-regulation-counters   | 
------------+---------------------------------------------------------+-----------------
releng/14.0 | releng/14.0-topic-openzfs-arc_prune-regulation-fix      | adfda3c395
            | releng/14.0-topic-openzfs-arc_prune-regulation-counters | 
------------+---------------------------------------------------------+-----------------
stable/13   | stable/13-topic-openzfs-arc_prune-regulation-fix        | e1341e5318
            | stable/13-topic-openzfs-arc_prune-regulation-counters   | 
------------+---------------------------------------------------------+-----------------
releng/13.3 | releng/13.3-topic-openzfs-arc_prune-regulation-fix      | 80d2b634dd
            | releng/13.3-topic-openzfs-arc_prune-regulation-counters | 
------------+---------------------------------------------------------+-----------------
releng/13.2 | releng/13.2-topic-openzfs-arc_prune-regulation-fix      | a839681443
            | releng/13.2-topic-openzfs-arc_prune-regulation-counters | 

            | Git Commit Hash
            |                          | Per-filesystem |            |            
            | FreeBSD-EN-23:18.openzfs | Vnode Counters | ZFS & VFS  | Nullfs
Base Branch | Backport                 | (VFS part)     | Fix        | Fix
============+==========================+================+============+==============
main        | N/A                      | c067467182     | be5edb9e5c | (c849eb8f19)
------------+--------------------------+----------------+------------+--------------
stable/14   | N/A                      | 6ca1378457     | bf500b63f9 | (3ba93b50d6)
------------+--------------------------+----------------+------------+--------------
releng/14.0 | N/A                      | 4a8bea1f75     | c9d9803b73 | cd5243b6df
------------+--------------------------+----------------+------------+--------------
stable/13   | 53fe620a90               | eaedc2a0f4     | c87f0609d9 | (d119f5a194)
------------+--------------------------+----------------+------------+--------------
releng/13.3 | a33aedb344               | ff1452a099     | 092fcd8c9f | f4e31b6a5e
------------+--------------------------+----------------+------------+--------------
releng/13.2 | bb006c3aef               | e9624c35e7     | f25524526d | 5ae2d6d83b

            | Git Commit Hash
            | Counters
Base Branch | (Not for merging)
============+===================
main        | 0e6e14f0c5
------------+-------------------
stable/14   | b484b20d21
------------+-------------------
releng/14.0 | b481bcb399
------------+-------------------
stable/13   | 541aff1ae1
------------+-------------------
releng/13.3 | d4fa366558
------------+-------------------
releng/13.2 | ef8a156c28
Comment 93 Graham Perrin freebsd_committer freebsd_triage 2024-03-16 00:30:22 UTC
> 14.0-RELEASE

See also: 277717 for 13.3-RELEASE.
Comment 94 Felix Palmen freebsd_committer freebsd_triage 2024-03-22 15:47:02 UTC
(In reply to Felix Palmen from comment #89)
> will continue to observe the overall behavior.

After a pretty busy (for my ZFS pool) week with the patches applied, I'd say the behavior is very good now. Provoking stalls requires crazy things like multiple poudriere-builds in I/O heavy phases (like *-depends) at the same time and also doing other I/O (e.g. in bhyve VMs). A "normal" bulk build with full parallel jobs *and* MAKE_JOBS allowed is barely noticeable when scheduled on idprio.

As I said, it was always sub-optimal on 13.x, but became utterly unusable with 13.3-RELEASE. What I forgot to mention is: on 13.3, there were even EBADFs suddenly popping up for build jobs, which could very well be some "hidden" bug that's only triggered by excessively bad performance...

All in all, I personally think this issue desperately needs fixing; depending on your hardware(?) and workload pattern, you can't really use 13.3 without patches right now.

Is anyone active reviewing the code?
Comment 95 mark burdett 2024-03-22 18:36:48 UTC
After upgrading to 13.3, what I saw, on hardware that occasionally has spikes of memory usage and is using zfs swap (I know, it's not really recommended to use swap at all, much less zfs swap :) was that a process would be killed due to slow memory allocation: (mysqld), jid 0, uid 88, was killed: a thread waited too long to allocate a page

At this point, mysqld is automatically restarted with much lower memory usage, and thus no need to swap or arc prune, and the problem was temporarily solved.  This happened before I had a chance to look at specific parameters, like CPU usage, but I am assuming it might be related to this issue.  For more peace of mind, I made various changes to avoid unexpected high memory usage on this system, and things have been stable.
Comment 96 Peter Much 2024-03-23 16:40:20 UTC
(In reply to Felix Palmen from comment #94)
I dunno, You tell me. From what I can see, we have procedures in place
for security incidents, and procedures for adding new features (if
people are interested in the feature, they will review it), but
apparently we have no procedures for fixing bugs and regressions.
Comment 97 Seigo Tanimura 2024-03-28 04:22:30 UTC
Again my apology for the delayed comment.

Now that the nullfs fix (https://reviews.freebsd.org/D44217) has been merged into stable/13 and stable/14, the next diff is https://reviews.freebsd.org/D44170, the backport of the FreeBSD-EN-23:18.openzfs fix to stable/13.  This is essentially the functional reimplementation of 799e09f75a on main (https://github.com/openzfs/zfs/commit/799e09f75a31e80a1702a850838c79879af8b917) and 3ec4ea68d4 on zfs-2.2-release (https://github.com/openzfs/zfs/commit/3ec4ea68d491a82c8de3360d50032bdecd53608f) of OpenZFS, focusing at avoiding the pileup on the arc_prune kernel thread.

Among the FreeBSD committers, I have found the names of mav and markj in the logs of the commits above.  I believe they are suitable for the diff review.

The rest of the diffs:

- https://reviews.freebsd.org/D44171 (kern/vfs: Add the per-filesystem vnode counter to struct mount.)
- https://reviews.freebsd.org/D44173 (kern/openzfs: Regulate the ZFS ARC pruning process precisely.)

are challenging because they address the interaction problem between OpenZFS and the OS (FreeBSD) kernel.  To my belief, the reviewers with the insights on both OpenZFS and FreeBSD are desired.

If the review on these diffs are too difficult, an alternative is to add a sysctl(3) toggle that controls the fix feature on D44173 so that the fix can be merged without enabling it by default.  Thanks to the many testers on this issue, I now believe the fix is ready for the more extensive public test.

-----

Besides the review, there are quite a few findings regarding the healthy operation of the OpenZFS ARC and its pruning and eviction, spotted out of my analysis.  It would be great to document them somehow.  Also, they should be minded upon reviewing D44171 and D44173.

* OpenZFS ARC buffers and their evictability

- An ARC buffer is separated for reading and writing.
  - A read ARC buffer must be copied into a write ARC buffer in order to "update" it in the copy-on-write manner.
- A read ARC buffer is not evictable until its content is read from the pool.
- A write ARC buffer is not evictable until its content is written into the pool.
  - A write ARC buffer depending on the write of another write ARC buffer may remain unevictable for a long time.
- Under a healthy operation, almost all ARC read and write buffers for data are evictable.
  - Some part of the ARC read and write buffers for metadata are not evictable because of their internal dependencies required by the OpenZFS design.
- The write ARC buffers of the vnodes in use (v_usecount > 0) have been found to remain unevictable until they get no longer in use.
  - This is the direct cause of the excess ARC pruning during poudriere-bulk(8); the nullfs filesystems cached the OpenZFS vnodes by adding v_usecount.
  - The similar issue may occur out of a difference cause, eg. too many opened OpenZFS files.

* Limitations of OpenZFS ARC pruning and eviction on FreeBSD

- The ARC pruning cannot count the OpenZFS znodes (ie FreeBSD vnodes) unprunable because of the requirements on the OS side.
  - The vnodes with the non-zero v_usecount or v_holdcnt (or both) fall into such the case.
  - The attempts to recycle such the vnodes causes the long lock upon the global vnode list.
- The pagedaemon kernel threads may excessively block for the ARC eviction progress.
  - OpenZFS supports the kernel threads to wait for a desired size of the ARC eviction progress.
    - The waiting kernel threads are resumed when either the desired ARC eviction progresses happen or there are no evictable ARC buffers at all.
  - Under a heavy load upon OpenZFS, it often manages to evict the ARC buffers much smaller, but non-zero, than the desired sizes.
    - The waiting kernel threads can neither meet the desired ARC evicition progress nor give up quickly.
Comment 98 Olivier Certner freebsd_committer freebsd_triage 2024-03-28 09:45:23 UTC
Hello Seigo,

Thanks a lot for your work.  Going to handle this.  First, let me review all that has been put in Phabricator and your findings in this long bug.

Best.
Comment 99 Seigo Tanimura 2024-04-02 06:56:46 UTC
(In reply to Olivier Certner from comment #98)

Thanks a lot for reviewing.

It would be great if you start from D44170, which should relax the unwanted load on stable/13 with the minimal change.
Comment 100 Seigo Tanimura 2024-04-02 09:11:53 UTC
(In reply to Seigo Tanimura from comment #99)

FYI, I am now revising the source to follow style(9), mainly on D44173.  This does not include any functional changes.

I will rebase the diffs once that is done.
Comment 101 Olivier Certner freebsd_committer freebsd_triage 2024-04-02 12:40:27 UTC
Hi Seigo,

Sorry, I had other things to look for in the meantime, so did not make much progress up to now.  That said, from now on I should be able to work on it near to full-time.

(In reply to Seigo Tanimura from comment #99)

That is my first target indeed.  Seems fairly simple.

(In reply to Seigo Tanimura from comment #100)

Great.

First of all, let me say I've never personally experienced the bug reported here.  I'm using poudriere for big builds from time to time, and with ZFS, but on machines with plenty of RAM, so that may explain why.  Bottom-line is that I don't have any reproducer on my own at the moment.  Will try to spin up some VM with a limited amount of RAM, limit the number of vnodes in the system and the ARC, and see how that goes.

Second, it seems that this problem materializes enough and with enough severity that it is quite urgent to find/commit fixes for it.  It might be that some of the changes proposed here cannot be imported as is because of subtle impacts or design issues (I'm just saying this generally, this may not apply in the end).  If it happens to be the case, we'll work on proposing alternatives easier to review/get accepted, and defer the other changes.  Longer term, the vnode recycling mechanism will probably be completely re-thinked, but hopefully the whole of it shouldn't be a requirement to solve the case at hand.

So, please bear a little with me until I get up to speed with all your patches and results, and the reports of testers in this bug and some connected ones.
Comment 102 Anton Saietskii 2024-04-02 14:27:49 UTC
(In reply to Olivier Certner from comment #101)

> First of all, let me say I've never personally experienced the bug
> reported here. I'm using poudriere for big builds from time to time,
> and with ZFS, but on machines with plenty of RAM, so that may
> explain why. 

Olivier, I'm unsure if it can be considered as plenty, but I do experince this on a i7-7820HQ with 64G of RAM.
Poudriere configured with USE_TMPFS="data localbase wrkdir". I can reproduce this with the following few steps:
1. Wait until ARC fills RAM so I have like ~2 gigs left.
2. Start 'poudriere bulk' on whatever with big distfile (e.g. firefox, rust, libreoffice). Jobs count doesn't matter, even 1 "works".
3. Almost as soon as that big distfile starts unpacking to tmpfs -- voila, issue pops up.
Comment 103 karl 2024-04-02 16:40:42 UTC
Interesting....

I'm on FreeBSD 13.3-STABLE stable/13-n257639-51b2556ee60a KSD-SMP, just rebuild world/kernel/nanoBSD for some of my pcEngines machines.

The builds are complete but I have a load average over 12 (!!) and the only thing that is sucking up all manner of CPU is:

  17  -  DL     7:51.59 [vnlru]

which definitely is (the machine was just rebooted, so that's rather significant all things considered.)

Is this related?  Not sure, but it looks like it might be....I don't generally run poudriere builds on this box, but do build world and kernel on it pretty-regularly, plus cross-builds using Crochet for Pis.
Comment 104 Mark Millard 2024-04-02 17:08:46 UTC
(In reply to Anton Saietskii from comment #102)

I'll note that rust builds are massive users of temporary disk space in wrkdir,
even for single-builder/single-make-job builds: I've seen nearly 30 GiByte and
have seen well over 20 GiBytes for a long time. When USE_TMPFS= effectively
includes wrkdir, this competes for RAM+SWAP via the TMPFS use. The highest
usage is towards the end (packaging).

Another issue is that historically when the builder completes, such TMPFS space
use stick around until/unless the specific builder starts another build.

I historically have provided lots of SWAP so that RAM need not cover such and
RAM+SWAP had lots of room.

But there is also the poudriere.conf technique of:

# List of package globs that are not allowed to use tmpfs for their WRKDIR
# Note that you *must* set TMPFS_BLACKLIST_TMPDIR
# EXAMPLE: TMPFS_BLACKLIST="rust"
TMPFS_BLACKLIST="rust"

# The host path where tmpfs-blacklisted packages can be built in.
# A temporary directory will be generated here and be null-mounted as the
# WRKDIR for any packages listed in TMPFS_BLACKLIST.
# EXAMPLE: TMPFS_BLACKLIST_TMPDIR=${BASEFS}/data/cache/tmp
TMPFS_BLACKLIST_TMPDIR=${BASEFS}/data/cache/tmp

that avoids the TMPFS based wrkdir space use just for rust. (Of course,
one needs the actual storage space in such cases.)

With rust avoiding TMPFS wrkdir use, I've observed llvm18 taking more
RAM+SWAP than rust, even with some llvm18 default options disabled.

(I do not have a list of packages with such large wrkdir space requirements
but would not be surprised if there are several more around, some possibly
not being compiler toolchains. But I do not normally build a wide variety
of large-builder software that are not compiler toolchain related.)
Comment 105 karl 2024-04-02 23:52:56 UTC
(In reply to karl from comment #103)

Update: Yes, this is the same thing -- the kernel's arc thread and vnlru are what is whacking the CPU, and it continues for a long time after the build completes.  In fact just building a memstick image after buildworld/buildkernel is done, and the system has settled down, causes it to occur again.
Comment 106 Olivier Certner freebsd_committer freebsd_triage 2024-04-03 21:08:28 UTC
Still could not spend much time on it today, tomorrow should be OK.  I'll start with D44170.

(In reply to Anton Saietskii from comment #102)

Yes, 64GB is already plenty.  Actually, this is what I have on the machines running poudriere.  The main difference to your configuration is that I'm using `USE_TMPFS=no`.  Thanks for detailing your reproducer, I'll try something similar and report.

(In reply to Mark Millard from comment #104)

Seems to be in line with Anton's reporting.  I may try with some LLVM, should be easier (less dependencies).

(In reply to karl from comment #105)

You're mentioning the ARC thread.  Are both your source and object directories on ZFS?  How much RAM?  Are you able to trigger the problem after a fresh boot and immediately running buildworld and buildkernel?
Comment 107 karl 2024-04-03 21:49:34 UTC
(In reply to Olivier Certner from comment #106)

Source and object on ZFS, drives are SATA/SSDs (enterprise models; Micron and Kingston) on ada (board channels, not on a host adapter), 32Gb memory and 6-core Xeon.  2-drive mirror, both geli-encrypted -- that pool has root, source and object directories on it.

[karl@NewFS /usr/src/release]$ zpool status zsr
  pool: zsr
 state: ONLINE
  scan: scrub repaired 0B in 00:30:24 with 0 errors on Fri Mar 29 03:56:04 2024
config:

        NAME            STATE     READ WRITE CKSUM
        zsr             ONLINE       0     0     0
          mirror-0      ONLINE       0     0     0
            ada0p4.eli  ONLINE       0     0     0
            ada1p4.eli  ONLINE       0     0     0

errors: No known data errors


Copyright (c) 1992-2021 The FreeBSD Project.
Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
        The Regents of the University of California. All rights reserved.
FreeBSD is a registered trademark of The FreeBSD Foundation.
FreeBSD 13.3-STABLE stable/13-n257639-51b2556ee60a KSD-SMP amd64
FreeBSD clang version 17.0.6 (https://github.com/llvm/llvm-project.git llvmorg-17.0.6-0-g6009708b4367)
VT(efifb): resolution 1024x768
CPU: Intel(R) Xeon(R) E-2146G CPU @ 3.50GHz (3500.00-MHz K8-class CPU)
  Origin="GenuineIntel"  Id=0x906ea  Family=0x6  Model=0x9e  Stepping=10
  Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE>
  Features2=0x7ffafbff<SSE3,PCLMULQDQ,DTES64,MON,DS_CPL,VMX,SMX,EST,TM2,SSSE3,SDBG,FMA,CX16,xTPR,PDCM,PCID,SSE4.1,SSE4.2,x2APIC,MOVBE,POPCNT,TSCDLT,AESNI,XSAVE,OSXSAVE,AVX,F16C,RDRAND>
  AMD Features=0x2c100800<SYSCALL,NX,Page1GB,RDTSCP,LM>
  AMD Features2=0x121<LAHF,ABM,Prefetch>
  Structured Extended Features=0x29c6fbf<FSGSBASE,TSCADJ,SGX,BMI1,HLE,AVX2,SMEP,BMI2,ERMS,INVPCID,RTM,NFPUSG,MPX,RDSEED,ADX,SMAP,CLFLUSHOPT,PROCTRACE>
  Structured Extended Features2=0x40000000<SGXLC>
  Structured Extended Features3=0x9c002400<MD_CLEAR,TSXFA,IBPB,STIBP,L1DFL,SSBD>
  XSAVE Features=0xf<XSAVEOPT,XSAVEC,XINUSE,XSAVES>
  VT-x: PAT,HLT,MTF,PAUSE,EPT,UG,VPID
  TSC: P-state invariant, performance statistics
real memory  = 34359738368 (32768 MB)
avail memory = 33186074624 (31648 MB)

It is repeatable on a build from a clean boot.

Also misbehaves on a clean boot if, post buildworld/buildkernel I build a memstick image (cd /usr/src/release; make memstick) post updating to the most-current -STABLE.  I usually do this since I want the recovery USB stick to be current once I know the system runs ok and I got the same behavior right after "make installkernel/reboot single-user/make installworld/reboot multiuser."

In both cases the system sits with the kernel arc thread and vnlru both consuming crazy amounts of CPU; load average is upward of 6 for 10+ minutes after the build completes but it does eventually clear.  The machine is not locked up from a user perspective but it is getting hammered.
Comment 108 Peter Much 2024-04-04 00:44:13 UTC
My main workhorse (all purpose server, 48 GB, 5000+ threads) is also not affected at all. But everything else here is. 

There are also cases where nothing really bad happens, but something is wrong nevertheless (here a small server 4 GB ram for my offsite backup, does not much otherwise):

operator@pole:~ $ ps axH | egrep "prune|dummynet"
    0  -  DLs     0:01.52 [kernel/arc_prune]
    0  -  DLs    23:29.91 [kernel/dummynet]

This is after I installed the patches. Before that, the compute on arc_prune was about the same as on dummynet (dummynet is the firewall, that should be mostly unchanging) - so while no malfunction yet, still lots of wasted cpu cycles.
Comment 109 mike 2024-04-04 12:16:00 UTC
(In reply to Olivier Certner from comment #106)
FYI, I am able to trigger this on a small 8G zfs box acting as a smb server. On a windows machine I am generating a 200G vhdx file to the FreeBSD smb server. That part is ok. Then on the FreeBSD box (RELENG_13.2) I run qemu-img to convert the vhdx file to qcow2.  After that is done, I see the arc_prune issue.

last pid: 49326;  load averages:  2.80,  2.42,  2.32                                                                                                                                  up 57+19:02:07  08:10:42
253 threads:   7 running, 226 sleeping, 20 waiting
CPU: 10.1% user,  0.0% nice, 59.7% system,  0.2% interrupt, 30.0% idle
Mem: 704K Active, 22M Inact, 12M Laundry, 7444M Wired, 619M Buf, 191M Free
ARC: 5955M Total, 70M MFU, 5577M MRU, 61K Anon, 53M Header, 1020K Other
     5475M Compressed, 19G Uncompressed, 3.54:1 Ratio
Swap: 4096M Total, 155M Used, 3941M Free, 3% Inuse

  PID USERNAME    PRI NICE   SIZE    RES STATE    C   TIME    WCPU COMMAND
    0 root         -8    -     0B  1456K CPU3     3  69:20  99.91% kernel{arc_prune}
   17 root        -16    -     0B    16K CPU2     2  48:54  64.44% vnlru

CPU: Intel(R) Celeron(R) N5105 @ 2.00GHz (1996.80-MHz K8-class CPU)
  Origin="GenuineIntel"  Id=0x906c0  Family=0x6  Model=0x9c  Stepping=0
  Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE>
  Features2=0x4ff8ebbf<SSE3,PCLMULQDQ,DTES64,MON,DS_CPL,VMX,EST,TM2,SSSE3,SDBG,CX16,xTPR,PDCM,SSE4.1,SSE4.2,x2APIC,MOVBE,POPCNT,TSCDLT,AESNI,XSAVE,OSXSAVE,RDRAND>
  AMD Features=0x28100800<SYSCALL,NX,RDTSCP,LM>
  AMD Features2=0x101<LAHF,Prefetch>
  Structured Extended Features=0x2394a2c3<FSGSBASE,TSCADJ,FDPEXC,SMEP,ERMS,NFPUSG,PQE,RDSEED,SMAP,CLFLUSHOPT,CLWB,PROCTRACE,SHA>
  Structured Extended Features2=0x18400124<UMIP,WAITPKG,GFNI,RDPID,MOVDIRI,MOVDIR64B>
  Structured Extended Features3=0xfc000400<MD_CLEAR,IBPB,STIBP,L1DFL,ARCH_CAP,CORE_CAP,SSBD>
  XSAVE Features=0xf<XSAVEOPT,XSAVEC,XINUSE,XSAVES>
  IA32_ARCH_CAPS=0x20c6b<RDCL_NO,IBRS_ALL,SKIP_L1DFL_VME,MDS_NO>
  VT-x: PAT,HLT,MTF,PAUSE,EPT,UG,VPID,VID,PostIntr
  TSC: P-state invariant, performance statistics
real memory  = 8589934592 (8192 MB)
avail memory = 8002285568 (7631 MB)
CPU microcode: updated from 0x1d to 0x24000024
Event timer "LAPIC" quality 600
Comment 110 Olivier Certner freebsd_committer freebsd_triage 2024-04-05 22:03:47 UTC
A small report on what I've done so far for this bug:
- Reviewed D44170, comparing it with EN-23:18.openzfs and indeed validating that it does the same thing in spirit (see the review).
- Been reproducing some of the problems reported here with a 4GB VM with 3 cores with a very recent 13-STABLE, building devel/llvm18 with poudriere, but I'm also stumbling on new ones.  I'm observing a quite high CPU time for the "arc_prune" thread and the "vnlru" proc, provided that tmpfs is not deactivated in poudriere (i.e., keeping the default configuration), when building dependencies of devel/llvm18.  I'm not even able to build the latter itself since its build dependencies step fails with spurious EBADF errors on chown of files 'pkg' is installing (started looking at tmpfs code, but didn't find anything obvious).  With tmpfs disabled, devel/llvm18 builds (well, I have not seen it finishing yet but it seems it will go the end) and I see almost no increase in CPU time for two above-mentioned thread and process.
Comment 111 Felix Palmen freebsd_committer freebsd_triage 2024-04-06 05:53:44 UTC
(In reply to Olivier Certner from comment #110)
FYI, EBADF have been observed by several people including myself (for me only since 13.3-RELEASE, but on a more powerful machine with 64GiB that's just pretty busy overall), see also
https://forums.freebsd.org/threads/rsync-bad-file-descriptor.92733/

I follow the reasoning given by cracauer and pmc there that this could likely be a different bug just hidden in normal (well-performing) operation.

Thanks a lot for looking into this!
Comment 112 Seigo Tanimura 2024-04-09 10:52:35 UTC
(In reply to Seigo Tanimura from comment #92)

The fix has been updated to meet style(9) and rebased.

* Changes

- Follow style(9).
- No functional changes.

* GitHub Sources

All of the sources are under https://github.com/altimeter-130ft/freebsd-freebsd-src.

- Branches and Git Commit Hashes

            |                                                         | Git Commit Hash
            | Fix Branch                                              |
Base Branch | Fix + Counter Branch                                    | Base
============+=========================================================+=================
main        | topic-openzfs-arc_prune-regulation-fix                  | f4d93b6761
            | topic-openzfs-arc_prune-regulation-counters             | 
------------+---------------------------------------------------------+-----------------
stable/14   | stable/14-topic-openzfs-arc_prune-regulation-fix        | 7b86d14bfc
            | stable/14-topic-openzfs-arc_prune-regulation-counters   | 
------------+---------------------------------------------------------+-----------------
releng/14.0 | releng/14.0-topic-openzfs-arc_prune-regulation-fix      | d338712beb
            | releng/14.0-topic-openzfs-arc_prune-regulation-counters | 
------------+---------------------------------------------------------+-----------------
stable/13   | stable/13-topic-openzfs-arc_prune-regulation-fix        | 1fd6ef0cb2
            | stable/13-topic-openzfs-arc_prune-regulation-counters   | 
------------+---------------------------------------------------------+-----------------
releng/13.3 | releng/13.3-topic-openzfs-arc_prune-regulation-fix      | 7a0d63c909
            | releng/13.3-topic-openzfs-arc_prune-regulation-counters | 
------------+---------------------------------------------------------+-----------------
releng/13.2 | releng/13.2-topic-openzfs-arc_prune-regulation-fix      | f5ac4e174f
            | releng/13.2-topic-openzfs-arc_prune-regulation-counters | 

            | Git Commit Hash
            |                          | Per-filesystem |            |            
            | FreeBSD-EN-23:18.openzfs | Vnode Counters | ZFS & VFS  | Nullfs
Base Branch | Backport                 | (VFS part)     | Fix        | Fix
============+==========================+================+============+==============
main        | N/A                      | c0a5149e51     | 20a1474304 | (c849eb8f19)
------------+--------------------------+----------------+------------+--------------
stable/14   | N/A                      | 6ad3e7f06f     | 7d3bb0fe51 | (3ba93b50d6)
------------+--------------------------+----------------+------------+--------------
releng/14.0 | N/A                      | 7ce096307a     | debb86e65a | fee257f966
------------+--------------------------+----------------+------------+--------------
stable/13   | 5dd192dd97               | 93f6ba7501     | 8488d19bf9 | (d119f5a194)
------------+--------------------------+----------------+------------+--------------
releng/13.3 | d3c2407ee3               | bc3133070f     | 18964b5d3c | 889ccc06e2
------------+--------------------------+----------------+------------+--------------
releng/13.2 | 8df5bda7f6               | 3afa14e361     | 31e31a183b | ca2d73e380

            | Git Commit Hash
            | Counters
Base Branch | (Not for merging)
============+===================
main        | 8b8d1d3add
------------+-------------------
stable/14   | 07f7af7546
------------+-------------------
releng/14.0 | 9711919fe1
------------+-------------------
stable/13   | 770b614369
------------+-------------------
releng/13.3 | 57c9ec7634
------------+-------------------
releng/13.2 | 4ddbadf025
Comment 113 Olivier Certner freebsd_committer freebsd_triage 2024-04-11 21:44:31 UTC
Some news!

Once again, progress has significantly been slowed down because I've been ill since last Sunday, so could work only intermittently.  Obviously this was out my control, but I'm sorry for the people here that have been awaiting resolution for a long time now.

Here's for the report:
- Fix in D44170 works on my reproducer, as expected.
- Ask upstream about their plans. After a few follow-ups, things are moving on, see https://github.com/openzfs/zfs/pull/16083. Basically, the factorization of ARC pruning common code, with the side effect of limiting calls to FreeBSD's vnlru_free_vfsops(), is going to be merged to OpenZFS 2.1.x, which is regularly merged into stable/13.
- Test that new change too, which also works.  Consequently, this fix is going to be chosen in preference to D44170 to minimize diffs with upstream and ease future merges.
- Tried to find the origin of EBADF with dtrace.  Due to a number of difficulties with dtrace and elusiveness in reproducing the problem, I don't have a definitive answer here, but only a single exploitable trace that, if not obscured too much by drops, may indicate a problem in tmpfs rather than namei() (name resolution). Obtaining these errors seems also very sensitive to tweaks to OOM parameters.
- Started reviewing D44171 to D44176.  I didn't comment there, but here are first impressions/remarks.  I understand the intent of the proposed changes (and have been having some similar in mind concerning vnode recycling) and they show some interesting ideas.  A priori, I doubt that these are going to be acceptable as they stand for several reasons.  First, some revisions are a mix of loosely connected changes pertaining to OpenZFS proper and others to FreeBSD only, and these are going to need to be separated.  Second, a lot of ZFS-specific changes are mostly about gathering statistics, and I think the relevancy of these will have to be discussed with upstream (additionally, I'm not much versed in ZFS at the moment, but will try to get up to speed on basics quite quickly).  Third, concerning vnode recycling, which I had analyzed in a different context about 2 years ago, besides some portions of code that may be erroneous (requiring more research on my part), one of the main interrogation points is likely to be the proposed additions to vget_finish_ref() and vput_final(), which are functions called a lot.  Is this overhead worth it, or are there alternatives?  I'm probably going to have to think more about the general design, after having caught up with the last round of changes in this area (changes by mjg@ from August to October 2023).

So the high CPU usage by [kernel/arc_prune] is going to be fixed very soon in stable/13 (it is already in stable/14, for 14.0 with EN issued at the end of 2023 and in main).  I'm just waiting for some code review upstream, but will commit it anyway if it doesn't happen quickly (a day or two), as I have already done one myself which I think is enough.

What I would like to know now (or just after the fix above is committed and tested by those using 13.x or stable/13) is if there are other pressing remaining problems.  There may be answers to this in some of the comments above, but I haven't reviewed all of them in detail yet (next on my to-do list).  So far, I have the impression that the other prominent problem is usage of nullfs and vnode recycling (nullfs vnodes over ZFS vnodes may prevent the latter from being recycled).  My current understanding is that it can be avoided thanks to https://reviews.freebsd.org/D44217 and setting the new sysctl knob to 0 to avoid nullfs "vnode caching".  But I only consider this a workaround, and we are facing more fundamental problems here (vnode stacking, name resolution) that will probably require more poundering as well.

So the short-term follow-up is going to depend a lot on your feedbacks of remaining blocker-level problems.

Thanks.
Comment 114 karl 2024-04-11 23:51:08 UTC
(In reply to Olivier Certner from comment #113)

Assuming you post an update here when the first commit hits 13-STABLE I can pull, build, reboot and run a buildworld/buildkernel (which reproduces the problem here) quite quickly.  I should be available to do that (and the box should be available to do that) any time during the next week or so.

Of concern is the path to get this into a freebsd-update revision for 13.3, as I have multiple binary-update systems out there on -RELEASE using binary updates and as of right now am rather concerned, although at present I've not had any of them go completely bananas on me.
Comment 115 commit-hook freebsd_committer freebsd_triage 2024-04-12 13:01:43 UTC
A commit in branch stable/13 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=330954bdb822af6bc07d487b1ecd7f8fda9c4def

commit 330954bdb822af6bc07d487b1ecd7f8fda9c4def
Author:     Alexander Motin <mav@FreeBSD.org>
AuthorDate: 2023-10-30 23:56:04 +0000
Commit:     Olivier Certner <olce@FreeBSD.org>
CommitDate: 2024-04-12 13:00:11 +0000

    Unify arc_prune_async() code, fix excessive ARC pruning

    There is no sense to have separate implementations for FreeBSD and Linux.  Make
    Linux code shared as more functional and just register FreeBSD-specific prune
    callback with arc_add_prune_callback() API.

    Aside of code cleanup this fixes excessive pruning on FreeBSD.

    [olce: This code comes from the OpenZFS pull request:
    https://github.com/openzfs/zfs/pull/16083, vendor-merged into our tree.  Its
    commit message has been slightly adapted to the present context.  The upstream
    pull request has been reviewed and merged into 'zfs-2.1.16-staging' as
    5b81b1bf5e6d6aeb8a87175dcb12b529185cac2f, which should come into our tree at the
    next vendor import.  This is the same code that was merged into stable/14 and
    main as part of vendor merges, and released as an EN (FreeBSD-EN-23:18.openzfs)
    over releng/14.0 by markj@.]

    PR:             275594, 274698
    Reported by:    Seigo Tanimura <seigo.tanimura@gmail.com>, markj, and others
    Tested by:      olce
    Approved by:    emaste (mentor)
    Obtained from:  OpenZFS
    Sponsored by:   iXsystems, Inc.
    Sponsored by:   The FreeBSD Foundation
    Signed-off-by:  Alexander Motin <mav@FreeBSD.org>

 sys/contrib/openzfs/include/os/linux/zfs/sys/zpl.h |  2 +-
 sys/contrib/openzfs/include/sys/arc.h              |  2 +-
 sys/contrib/openzfs/include/sys/arc_impl.h         |  1 -
 sys/contrib/openzfs/module/os/freebsd/zfs/arc_os.c | 62 ----------------------
 .../openzfs/module/os/freebsd/zfs/zfs_vfsops.c     | 32 +++++++++++
 sys/contrib/openzfs/module/os/linux/zfs/arc_os.c   | 51 ------------------
 .../openzfs/module/os/linux/zfs/zpl_super.c        |  2 +-
 sys/contrib/openzfs/module/zfs/arc.c               | 52 ++++++++++++++++++
 8 files changed, 87 insertions(+), 117 deletions(-)
Comment 116 Olivier Certner freebsd_committer freebsd_triage 2024-04-12 13:08:14 UTC
So the fix has just been committed to stable/13.  Please test!  With USE_TMPFS=yes (the default), and if possible as many nullfs mounts as possible in the equation (e.g., build-dedicated ports option directories for poudriere, multiple poudriere running in parallel, etc.).

A note for Seigo related to comment #113 I wrote: I'm not expecting any special work on your side at this stage.  As said, I need more time to review your work and then we will determine which parts could be imported, in which form and where.
Comment 117 karl 2024-04-12 14:54:25 UTC
(In reply to Olivier Certner from comment #116)

Kernel rebuilt on 13-STABLE, then an immediate buildworld -j8 on a cold boot and the anomalous behavior is gone -- load and kernel activity went back to normal immediately on completion and this is the arc accumulated time since boot:

root@NewFS:/usr/src # ps axH | grep arc
    0  -  DLs   0:00.00 [kernel/arc_prune]
    5  -  DL    0:00.00 [zfskern/arc_evict]
    5  -  DL    0:00.02 [zfskern/arc_reap]
    5  -  DL    0:02.17 [zfskern/l2arc_feed_thread]
  317  0  S+    0:00.00 grep arc

Appears fixed for my use case at first blush, but will run some more tests.

Will this end up as an errata on 13.3-RELEASE or do I need to pull at least kernel source and go "off-piste" for those systems I have in the field on binary updates?
Comment 118 Olivier Certner freebsd_committer freebsd_triage 2024-04-12 15:10:23 UTC
(In reply to karl from comment #117)

We will most likely issue an errata.  The only reason not to do it immediately is that I would like to know if there is some other important issue remaining and in that case if it can be fixed quickly/safely enough to be integrated in the same EN batch.
Comment 119 Olivier Certner freebsd_committer freebsd_triage 2024-04-12 15:13:00 UTC
(In reply to karl from comment #117)

Could you show the elapsed time for the [vnlru] process as well?

More in-depth testing is generally welcome before issuing an EN.
Comment 120 karl 2024-04-12 16:02:42 UTC
(In reply to Olivier Certner from comment #119)
Also normal -- sorry I didn't include it:

[karl@NewFS ~]$ ps ax|grep vnl
   17  -  DL      0:00.04 [vnlru]

Since boot, now:
12:02PM  up  2:10, 1 user, load averages: 0.37, 0.28, 0.26
Comment 121 Peter Much 2024-04-12 17:37:45 UTC
(In reply to Olivier Certner from comment #116)

Thanky for this. Sadly, the only machine where I do excessive nullfs mounts (or nullfs mounts from zfs at all) is also my only machine that is not concerned from the issue, and in fact does not run arc_prune at all. So testing appears quite pointless in this case.
Comment 122 Vladimir Druzenko freebsd_committer freebsd_triage 2024-04-13 01:50:38 UTC
Just got this on 13.3-p1 amd64 and I had to reboot this server. :-(
Comment 123 Tomohiro Hosaka 2024-04-13 09:34:26 UTC
I am having the same problem.
I recently upgraded 13 machines to 13.3-p1.
I have this problem on multiple machines.
I would like to see errata released as 13.3-p2.
Comment 124 Vladimir Druzenko freebsd_committer freebsd_triage 2024-04-13 13:30:32 UTC
(In reply to Tomohiro Hosaka from comment #123)
I got the same on 2nd server yesterday late evening and even planned reboot for today, but 100% CPU kernel{arc_prune} disappeared during night…
Comment 125 Tomohiro Hosaka 2024-04-14 01:46:05 UTC
One of the 13 machines has been freezing every day for 20-24 hours uptime.
oom is triggered and kills most of the processes within one second without changing the timestamp, as shown below.

Apr 14 04:16:50 host1 kernel: kernel: pid 95049 (postgres), jid 0, uid 770, was killed: failed to reclaim memory
Apr 14 04:16:50 host1 kernel: kernel: pid ..... (........), jid 0, uid ..., was killed: failed to reclaim memory
# ack 'Apr 14 04:16:50.*was killed:' /var/log/messages | wc -l 
     138
# ack 'Apr 14 04:16:5[012].*was killed:' /var/log/messages | wc -l
     185

sshd and processes that consume only a small amount of memory are also killed. (cron,getty,sh etc..)
The system freezes without any change, even usb keyboard input is not accepted.
It is rebooted by resetting the power button.
The previous 12.4-p9 uptime was long and perfectly fine.

I tried vm.pageout_oom_seq, vm.pfault_oom_wait, vm.pfault_oom_attempts, vfs.zfs.arc.min, vfs.zfs.arc.max, etc.
No improvement.
It does not seem like the intended oom behavior.
It would be nice if a few memory-consuming processes could be killed and the system would continue to run, but it does not work that way.
There is no way around it.

It looks like we are very close to getting freebsd-update to do the great job above, but it is not going to happen in time.
Will consider 13.2 or 14.0 or build.

Thanks.
Comment 126 Trev 2024-04-14 02:37:14 UTC
(In reply to Tomohiro Hosaka from comment #125)

See: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=277717#c8

Same issue BEFORE the recent stable fixes for arc_prune.

I replaced the server with FreeBSD 13.3-STABLE #4 stable/13-b5e7969b2: Fri Mar 29 20:50:35 AEDT 2024 with another (due to UFS file system corruption caused by the freezes and forced pwr offs), installed FBSD 3.3-REL and no issues, not even arc_prune.

The only difference between the servers is that the replacement server has no UFS file system, it is all ZFS but is still backing up to a UFS external USB drive.
Comment 127 karl 2024-04-14 02:52:34 UTC
Am this what, at the core, it looks like it is?

https://github.com/openzfs/zfs/pull/16083/commits/a14dc1bae868fb604ff872aa4a06b2e4ff21d686#diff-ee0b30bb667845dbf75eb4e94a2bf4806f5a0d63182393ded72f17cd9ae88e0fL52-R55

An awful lot of these changes appear to be type changes from signed to unsigned integers..... which leads to the obvious question:

"Was the root of the issue, when you get down to it, signed underflow (or overflow)?"
Comment 128 Olivier Certner freebsd_committer freebsd_triage 2024-04-15 08:07:49 UTC
(In reply to karl from comment #127)

The root of the issue has absolutely nothing to do with signedness problems.  Those signedness changes are simply tiny theoretical correctness ones without any practical impact. The root cause is simply that ARC pruning is triggered too many times, and the committed change just ensures that pruning requests are not piled up when one is incoming while another is being serviced.  There are lots of other potential problems in there (ARC pruning may be too aggressive; new requests may not correctly account for running requests' effects; in FreeBSD, pruning is relatively costly; in FreeBSD, the amount of vnode to prune is directly the number of bytes to free, which doesn't make any sense, and which Seigo's work attempts to fix; etc.).
Comment 129 Olivier Certner freebsd_committer freebsd_triage 2024-04-15 11:22:21 UTC
FYI,pPreparing an EN (see bug #278375).
Comment 130 karl 2024-04-15 11:30:24 UTC
(In reply to Olivier Certner from comment #128)
Got it.  I did see the other changes and haven't gone through the various interactions to fully understand them, but the signedness change stood out at me immediately, thus the question.
Comment 131 karl 2024-04-15 11:34:41 UTC
BTW I was running 13.2-STABLE for quite a long period of time and did *not* see this behavior.  It may have been triggerable, but I personally was never impacted by it until I went to 13.3.
Comment 132 Anton Saietskii 2024-04-15 12:13:07 UTC
(In reply to karl from comment #131)

Perhaps related to this (from 13.3 relnotes):
> OpenZFS has been upgraded to version 2.1.14
Comment 133 karl 2024-04-15 12:19:24 UTC
(In reply to Anton Saietskii from comment #132)
Maybe.

13.2-RELEASE had a rather not-nice problem with one of my instances on a cloud provider where, once in a while, it would throw what was CLAIMED to be a zfs failed-write set of block errors (yes, incrementing the error counts!) but a scrub revealed nothing, and the data was never actually damaged (!!!!)  It occurred about once a month on that particular machine and when it did the kernel also logged a bunch of errors of the form "calru runtime went backward".

I do wonder if what was actually going on there was related to this general issue.  Not sure.  It ceased when I went to 13.3 -- this particular machine didn't run .2 for very long as I'm pretty conservative with upgrading it as I really don't want to have to go backward and take the downtime hit involved in doing so, and was definitely a head-scratcher, as it was a zfs-only issue -- the machines I have in the field not running zfs never saw it even though on the same release and patch level, and wasn't universal either; I never saw it on my build box, for example, nor on many others, but one very-specific machine that also has a postgres database and public-facing web application that hits it pretty hard did.
Comment 134 Peter Much 2024-04-15 13:21:35 UTC
(In reply to karl from comment #131)
Same here. 13.2 did work over it's lifetime, whereas 13.3 immediately crashed my build engines and stalled my desktop.

However, we should also consider Felix' comment #89. While I cannot pinpoint specific issues, I have the strong impression that the patches from here are an over-all improvement, also compared to 13.2. 
They are running for almost two months now here, and things are just running fine, I love it!

Therefore I would very much appreciate if Seigo and Olivier could work together and identify *all* the actual changes from the patch set.
I tried to do that myself, analyzing the patch set line by line, but gave up after a couple of pages, as this relates to many subsystems of the FS layer, and if one hasn't dived into these before, it is more than a day's work, probably more than a week's...
Comment 135 Tomohiro Hosaka 2024-04-16 12:41:21 UTC
(In reply to Trev from comment #126)

Thanks for your reply.

After trying 13.3-STABLE n257699-c39938ddd3a7, the above mentioned problem was resolved and I was able to regain 12.4-p9 stability.

Thank you so much.
Comment 136 Seigo Tanimura 2024-04-22 06:43:50 UTC
(In reply to Olivier Certner from comment #116)

Hello Olivier,

My apology for being too busy.  Confirmed commit 330954bdb8 in stable/13, thanks.

Is it OK if I rebase my stable/13 branch and drop my own fix from it?  I understand that commit 5dd192dd97 in my fix as of comment #112 is virtually in stable/13, so there is no need to keep it in my side.
Comment 137 Olivier Certner freebsd_committer freebsd_triage 2024-04-22 07:02:08 UTC
(In reply to Seigo Tanimura from comment #136)

Hello Seigo,

No problem.  Yes, you can rebase as you see fit, 5dd192dd97 / D44170 is indeed what is replaced by the ZFS fix's backport.

Just to confirm, because I'm not sure if you meant that, have you already tested with a recent stable/13 (recent enough to include 330954bdb8)?  I'm not especially worried about the fix, but the more data points the better.

Thanks and regards.
Comment 138 commit-hook freebsd_committer freebsd_triage 2024-04-24 20:21:24 UTC
A commit in branch releng/13.3 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=266b3bd3f26d30f7be56b7ec9d31f3db2285b4ce

commit 266b3bd3f26d30f7be56b7ec9d31f3db2285b4ce
Author:     Alexander Motin <mav@FreeBSD.org>
AuthorDate: 2023-10-30 23:56:04 +0000
Commit:     Gordon Tetlow <gordon@FreeBSD.org>
CommitDate: 2024-04-24 20:06:16 +0000

    Unify arc_prune_async() code, fix excessive ARC pruning

    There is no sense to have separate implementations for FreeBSD and Linux.  Make
    Linux code shared as more functional and just register FreeBSD-specific prune
    callback with arc_add_prune_callback() API.

    Aside of code cleanup this fixes excessive pruning on FreeBSD.

    [olce: This code comes from the OpenZFS pull request:
    https://github.com/openzfs/zfs/pull/16083, vendor-merged into our tree.  Its
    commit message has been slightly adapted to the present context.  The upstream
    pull request has been reviewed and merged into 'zfs-2.1.16-staging' as
    5b81b1bf5e6d6aeb8a87175dcb12b529185cac2f, which should come into our tree at the
    next vendor import.  This is the same code that was merged into stable/14 and
    main as part of vendor merges, and released as an EN (FreeBSD-EN-23:18.openzfs)
    over releng/14.0 by markj@.]

    PR:             275594, 274698
    Reported by:    Seigo Tanimura <seigo.tanimura@gmail.com>, markj, and others
    Tested by:      olce
    Approved by:    emaste (mentor)
    Approved by:    so
    Obtained from:  OpenZFS
    Sponsored by:   iXsystems, Inc.
    Sponsored by:   The FreeBSD Foundation
    Signed-off-by:  Alexander Motin <mav@FreeBSD.org>

    (cherry picked from commit 330954bdb822af6bc07d487b1ecd7f8fda9c4def)

 sys/contrib/openzfs/include/os/linux/zfs/sys/zpl.h |  2 +-
 sys/contrib/openzfs/include/sys/arc.h              |  2 +-
 sys/contrib/openzfs/include/sys/arc_impl.h         |  1 -
 sys/contrib/openzfs/module/os/freebsd/zfs/arc_os.c | 62 ----------------------
 .../openzfs/module/os/freebsd/zfs/zfs_vfsops.c     | 32 +++++++++++
 sys/contrib/openzfs/module/os/linux/zfs/arc_os.c   | 51 ------------------
 .../openzfs/module/os/linux/zfs/zpl_super.c        |  2 +-
 sys/contrib/openzfs/module/zfs/arc.c               | 52 ++++++++++++++++++
 8 files changed, 87 insertions(+), 117 deletions(-)
Comment 139 Anton Saietskii 2024-04-25 12:46:11 UTC
(In reply to Anton Saietskii from comment #102)

After upgrading to 13.3-RELEASE-p2 releng/13.3-n257433-be4f1894ef3, I did exactly the same again. No issues occurred and almost no CPU time used by troublemakers:
$ ps auwwxH | egrep '[a]rc_prune|[v]nlru'
root             0  0.0  0.0       0  11248  -  DLs  13:50    0:00.11 [kernel/arc_prune]
root            16  0.0  0.0       0     16  -  DL   13:50    0:00.09 [vnlru]

Looks like this is fixed (at least for me).
Comment 140 Olivier Certner freebsd_committer freebsd_triage 2024-04-25 12:48:59 UTC
(In reply to Anton Saietskii from comment #139)

Great.  Thanks for confirming.
Comment 141 karl 2024-04-25 12:50:47 UTC
In reply to Anton Saietskii from comment #139)

I concur -- did a binary update last afternoon on one of my rather-busy systems in the field and while it was not prone to "brain freeze" behavior unlike my build box it did materially improve performance and drop the load average.
Comment 142 Vladimir Druzenko freebsd_committer freebsd_triage 2024-04-26 17:11:26 UTC
(In reply to Olivier Certner from comment #137)
You can close this PR as Fixed.
Comment 143 Vladimir Druzenko freebsd_committer freebsd_triage 2024-04-26 17:13:00 UTC
Missclick…

(In reply to Seigo Tanimura from comment #136)
You can close this PR as Fixed.
Comment 144 Olivier Certner freebsd_committer freebsd_triage 2024-04-27 15:16:53 UTC
(In reply to Vladimir Druzenko from comment #143)

This bug has become more than just the reported problem of the [kernel/arc_prune] thread taking 100% CPU and considerably slowing done other vnode operations, which is now indeed fixed on all supported versions (see bug 274698 for a chronology of merges/commits).

Seigo has contributed a number of ideas/fixes for other important problems we have, even if not as visible as the [kernel/arc_prune] problem.  I'm currently working on analyzing the fixes and contributing my own ideas and code (among other things).

Given also the amount of testimonies and data points contributed in this bug, it is probably simpler to keep it as the tracking bug for this work.  I'll change the title of this PR accordingly.

This means that this bug would stay open, although EN-24:09.zfs references it and says the fix for the [kernel/arc_prune] part is in 13 and 13.3.  There is in reality no contradiction here, though on the surface it probably appears as confusing, and is probably also not how things are usually handled, but here we are given the history around these problems, the number of people involved and how they interacted.  The change of title should clear most doubts for those visiting this bug from the EN.
Comment 145 Seigo Tanimura 2024-05-01 05:53:08 UTC
(In reply to Seigo Tanimura from comment #112)

Now that FreeBSD-EN-24:09.zfs has been issued on stable/13 and releng/13.3, I have rebased the branches and removed the merged changes.

* Changes

- Remove the FreeBSD-EN-24:09.zfs changes from stable/13 and releng/13.3.

* GitHub Sources

All of the sources are under https://github.com/altimeter-130ft/freebsd-freebsd-src.

- Branches and Git Commit Hashes

            |                                                         | Git Commit Hash
            | Fix Branch                                              |
Base Branch | Fix + Counter Branch                                    | Base
============+=========================================================+=================
main        | topic-openzfs-arc_prune-regulation-fix                  | f0e59ecff8
            | topic-openzfs-arc_prune-regulation-counters             | 
------------+---------------------------------------------------------+-----------------
stable/14   | stable/14-topic-openzfs-arc_prune-regulation-fix        | a2eaf1cdd6
            | stable/14-topic-openzfs-arc_prune-regulation-counters   | 
------------+---------------------------------------------------------+-----------------
releng/14.0 | releng/14.0-topic-openzfs-arc_prune-regulation-fix      | d338712beb
            | releng/14.0-topic-openzfs-arc_prune-regulation-counters | 
------------+---------------------------------------------------------+-----------------
stable/13   | stable/13-topic-openzfs-arc_prune-regulation-fix        | 1002fa246b
            | stable/13-topic-openzfs-arc_prune-regulation-counters   | 
------------+---------------------------------------------------------+-----------------
releng/13.3 | releng/13.3-topic-openzfs-arc_prune-regulation-fix      | be4f1894ef
            | releng/13.3-topic-openzfs-arc_prune-regulation-counters | 
------------+---------------------------------------------------------+-----------------
releng/13.2 | releng/13.2-topic-openzfs-arc_prune-regulation-fix      | f5ac4e174f
            | releng/13.2-topic-openzfs-arc_prune-regulation-counters | 

            | Git Commit Hash
            |                          | Per-filesystem |            |            
            | FreeBSD-EN-23:18.openzfs | Vnode Counters | ZFS & VFS  | Nullfs
Base Branch | Backport                 | (VFS part)     | Fix        | Fix
============+==========================+================+============+==============
main        | N/A                      | 6302f9e599     | 365ae73e65 | (c849eb8f19)
------------+--------------------------+----------------+------------+--------------
stable/14   | N/A                      | 46f5bf117c     | 64320f35b5 | (3ba93b50d6)
------------+--------------------------+----------------+------------+--------------
releng/14.0 | N/A                      | 7ce096307a     | debb86e65a | fee257f966
------------+--------------------------+----------------+------------+--------------
stable/13   | (330954bdb8)             | b5f8ffb7c4     | 5fab3dffef | (d119f5a194)
------------+--------------------------+----------------+------------+--------------
releng/13.3 | (266b3bd3f2)             | ed439d4269     | bc18e6620a | 84a5dc84e0
------------+--------------------------+----------------+------------+--------------
releng/13.2 | 8df5bda7f6               | 3afa14e361     | 31e31a183b | ca2d73e380

            | Git Commit Hash
            | Counters
Base Branch | (Not for merging)
============+===================
main        | 0a56f453fa
------------+-------------------
stable/14   | 4b6caf046a
------------+-------------------
releng/14.0 | 9711919fe1
------------+-------------------
stable/13   | d8a6de5624
------------+-------------------
releng/13.3 | d14b4a08d8
------------+-------------------
releng/13.2 | 4ddbadf025
Comment 146 Seigo Tanimura 2024-05-05 10:31:23 UTC
(In reply to Seigo Tanimura from comment #145)

The releng/14.1 branch has been created.

* Changes

- Create releng/14.1.
- Rebase main, stable/14 and stable/13.

* GitHub Sources

All of the sources are under https://github.com/altimeter-130ft/freebsd-freebsd-src.

- Branches and Git Commit Hashes

            |                                                         | Git Commit Hash
            | Fix Branch                                              |
Base Branch | Fix + Counter Branch                                    | Base
============+=========================================================+=================
main        | topic-openzfs-arc_prune-regulation-fix                  | 1023317ac4
            | topic-openzfs-arc_prune-regulation-counters             | 
------------+---------------------------------------------------------+-----------------
stable/14   | stable/14-topic-openzfs-arc_prune-regulation-fix        | 7de39f926c
            | stable/14-topic-openzfs-arc_prune-regulation-counters   | 
------------+---------------------------------------------------------+-----------------
releng/14.1 | releng/14.1-topic-openzfs-arc_prune-regulation-fix      | 25c2d762af
            | releng/14.1-topic-openzfs-arc_prune-regulation-counters | 
------------+---------------------------------------------------------+-----------------
releng/14.0 | releng/14.0-topic-openzfs-arc_prune-regulation-fix      | d338712beb
            | releng/14.0-topic-openzfs-arc_prune-regulation-counters | 
------------+---------------------------------------------------------+-----------------
stable/13   | stable/13-topic-openzfs-arc_prune-regulation-fix        | 76f866f0f6
            | stable/13-topic-openzfs-arc_prune-regulation-counters   | 
------------+---------------------------------------------------------+-----------------
releng/13.3 | releng/13.3-topic-openzfs-arc_prune-regulation-fix      | be4f1894ef
            | releng/13.3-topic-openzfs-arc_prune-regulation-counters | 
------------+---------------------------------------------------------+-----------------
releng/13.2 | releng/13.2-topic-openzfs-arc_prune-regulation-fix      | f5ac4e174f
            | releng/13.2-topic-openzfs-arc_prune-regulation-counters | 

            | Git Commit Hash
            |                          | Per-filesystem |            |            
            | FreeBSD-EN-23:18.openzfs | Vnode Counters | ZFS & VFS  | Nullfs
Base Branch | Backport                 | (VFS part)     | Fix        | Fix
============+==========================+================+============+==============
main        | N/A                      | 7389b5cd8c     | 375e07f6f5 | (c849eb8f19)
------------+--------------------------+----------------+------------+--------------
stable/14   | N/A                      | 2915b9873f     | 93429e52f9 | (3ba93b50d6)
------------+--------------------------+----------------+------------+--------------
releng/14.1 | N/A                      | 48392063ed     | e9e6b62396 | (3ba93b50d6)
------------+--------------------------+----------------+------------+--------------
releng/14.0 | N/A                      | 7ce096307a     | debb86e65a | fee257f966
------------+--------------------------+----------------+------------+--------------
stable/13   | (330954bdb8)             | 4ef38fbec0     | 7eb152767a | (d119f5a194)
------------+--------------------------+----------------+------------+--------------
releng/13.3 | (266b3bd3f2)             | ed439d4269     | bc18e6620a | 84a5dc84e0
------------+--------------------------+----------------+------------+--------------
releng/13.2 | 8df5bda7f6               | 3afa14e361     | 31e31a183b | ca2d73e380

            | Git Commit Hash
            | Counters
Base Branch | (Not for merging)
============+===================
main        | 816048755f
------------+-------------------
stable/14   | ee9ef2ed89
------------+-------------------
releng/14.1 | e5e9b3ba53
------------+-------------------
releng/14.0 | 9711919fe1
------------+-------------------
stable/13   | 820b73b5ca
------------+-------------------
releng/13.3 | d14b4a08d8
------------+-------------------
releng/13.2 | 4ddbadf025
Comment 147 Seigo Tanimura 2024-05-13 08:05:53 UTC
(In reply to Seigo Tanimura from comment #146)

Rebased to 14.1-BETA2.

I have updated my ports tree to 8a66b69ceb with some local fixes, and tested "poudriere bulk -J 16" (used to be "-J 8") with MAKE_JOBS_NUMBER=4. (used to be 2)
Although these changes added the more pressure to the kernel memory, the bulk build survived and built 2357 packages.

Again the tests with the heavy load is welcome.  What is important is that the ZFS ARC pruning works under the high pressure upon the kernel memory; that triggers the ARC pruning a lot.

* Changes

- Rebase main, stable/14, releng/14.1 and stable/13.

* GitHub Sources

All of the sources are under https://github.com/altimeter-130ft/freebsd-freebsd-src.

- Branches and Git Commit Hashes

            |                                                         | Git Commit Hash
            | Fix Branch                                              |
Base Branch | Fix + Counter Branch                                    | Base
============+=========================================================+=================
main        | topic-openzfs-arc_prune-regulation-fix                  | c1ebd76c3f
            | topic-openzfs-arc_prune-regulation-counters             | 
------------+---------------------------------------------------------+-----------------
stable/14   | stable/14-topic-openzfs-arc_prune-regulation-fix        | 3c414a8c2f
            | stable/14-topic-openzfs-arc_prune-regulation-counters   | 
------------+---------------------------------------------------------+-----------------
releng/14.1 | releng/14.1-topic-openzfs-arc_prune-regulation-fix      | e3e57ae30c
            | releng/14.1-topic-openzfs-arc_prune-regulation-counters | 
------------+---------------------------------------------------------+-----------------
releng/14.0 | releng/14.0-topic-openzfs-arc_prune-regulation-fix      | d338712beb
            | releng/14.0-topic-openzfs-arc_prune-regulation-counters | 
------------+---------------------------------------------------------+-----------------
stable/13   | stable/13-topic-openzfs-arc_prune-regulation-fix        | 85e63d952d
            | stable/13-topic-openzfs-arc_prune-regulation-counters   | 
------------+---------------------------------------------------------+-----------------
releng/13.3 | releng/13.3-topic-openzfs-arc_prune-regulation-fix      | be4f1894ef
            | releng/13.3-topic-openzfs-arc_prune-regulation-counters | 
------------+---------------------------------------------------------+-----------------
releng/13.2 | releng/13.2-topic-openzfs-arc_prune-regulation-fix      | f5ac4e174f
            | releng/13.2-topic-openzfs-arc_prune-regulation-counters | 

            | Git Commit Hash
            |                          | Per-filesystem |            |            
            | FreeBSD-EN-23:18.openzfs | Vnode Counters | ZFS & VFS  | Nullfs
Base Branch | Backport                 | (VFS part)     | Fix        | Fix
============+==========================+================+============+==============
main        | N/A                      | 658e58a217     | 7eb637c841 | (c849eb8f19)
------------+--------------------------+----------------+------------+--------------
stable/14   | N/A                      | 407c44c311     | d1aefc7787 | (3ba93b50d6)
------------+--------------------------+----------------+------------+--------------
releng/14.1 | N/A                      | e213ef74a1     | 44b9e4934f | (3ba93b50d6)
------------+--------------------------+----------------+------------+--------------
releng/14.0 | N/A                      | 7ce096307a     | debb86e65a | fee257f966
------------+--------------------------+----------------+------------+--------------
stable/13   | (330954bdb8)             | f7372a3d6e     | 1e59b1be5d | (d119f5a194)
------------+--------------------------+----------------+------------+--------------
releng/13.3 | (266b3bd3f2)             | ed439d4269     | bc18e6620a | 84a5dc84e0
------------+--------------------------+----------------+------------+--------------
releng/13.2 | 8df5bda7f6               | 3afa14e361     | 31e31a183b | ca2d73e380

            | Git Commit Hash
            | Counters
Base Branch | (Not for merging)
============+===================
main        | b36441e8c1
------------+-------------------
stable/14   | c206ab33a8
------------+-------------------
releng/14.1 | 59591c443a
------------+-------------------
releng/14.0 | 9711919fe1
------------+-------------------
stable/13   | 55585aa812
------------+-------------------
releng/13.3 | d14b4a08d8
------------+-------------------
releng/13.2 | 4ddbadf025
Comment 148 Seigo Tanimura 2024-05-16 07:06:23 UTC
(In reply to Seigo Tanimura from comment #147)

FYI to those who test the fix by poudriere-bulk(8):
There is a separate problem upon poudriere-bulk(8), likely to be triggered by involving nvme(4).  Please refer to bug #279021.

I had to apply the fix in that PR as well to complete the poudriere-bulk(8) build test.