The ARC limit in /boot/loader.conf is not used. Contents of /boot/loader.conf about ARC. vfs.zfs.arc_min=32M vfs.zfs.arc_max=128M Current ARC usage from * top -b -o res * command. # top -b -o res (...) ARC: 853M Total, 223M MFU, 168M MRU, 272K Anon, 6253K Header, 456M Other 77M Compressed, 315M Uncompressed, 4.09:1 Ratio (...) ARC should be 128 MB at max, but its about 850 MB ... ... or is there other way to limit ARC usage? Regards, vermaden
Does anyone know of a workaround for this? I'm observing this on servers with arc_max set to 256MB that're OOM-killing things because the ARC is consuming many GB. Also, in my experience it only recently started on fairly mature 11.1 installations, so I wonder if it's because of a recent errata update.
Is vfs.zfs.arc_meta_limit preventing it from going that low? You may also want to consider review D7538 it improves ARCs willingness to release.
(In reply to Shane from comment #2) > improves ARCs willingness to release. Willingness? :) We now have to convince the OS (or its part) to give back resources that it should not allocate in the first place? :D Aren't limits set there to be respected and NOT exceed? Regards, vermaden
(In reply to vermaden from comment #3) The patch in review is about ARC releasing its cache when programs want more memory, not about using more than the limit. Currently the system can swap out before it releases ARC, the patch changes that, so that when free ram is used, it will release ARC before swapping. This means it may not always be using the arc_max that it is allowed to use. Also, in the Oracle ZFS docs, it states that arc_min/arc_max have a range of 64M to physical mem. I wonder if there is a minimum amount that a zpool needs based on number of disks, blocks in pool, number of files... Maybe this is preventing the smaller memory use.
I think my idea of the minimum memory needed for a zpool is in effect here. There isn't a lot on it, but for minimum resources, the Handbook 19.6.2.1 says "A general rule of thumb is 1 GB of RAM for every 1 TB of storage." So if your pool is >= 1TB then your are using the minimum ram required to use that pool.
The 1GB per TB rule is a stab at how much RAM is needed to cache the working set and deliver RAM like read performance. It's not really intended to indicate how much RAM you need to operate a pool. (I've imported and used 1PB pools on heads with 32GB of RAM before) The only real linear limit I know of is datasets/zvols consume ~ 1MB/RAM each. So a system with 4000 datasets would need ~ 4GB RAM just for the dataset metadata.
The machines I have observed this on vary in zpool sizes. With regard to the "rule of thumb", one machine which behaves particularly horribly has a single zpool sized at 256GB. It has only 10GB referenced (meaning non-snapshot data) and less than 20k inodes. A linear interpretation of the rule of thumb suggests that just 10MB should be enough ARC, although I don't expect it to scale down that low. On this one, arc_max is set to 256MB, but the ARC runs well over 1 GB. I don't know how high it would go if left alone, since it only has 2 GB of RAM to begin with, so when it gets that big I have to reboot it. This one is an AWS VM. For another example, I have a physical machine with 6GB of RAM, with arc_max set to 256MB and top showing the ARC at 2GB. This one is a bit bigger -- it has 1.4TB across 2 zpools. It does rsync-style backups for three other machines, so there's a relatively large number of filenames. The second zpool (for the backups) has roughly 5M inodes with roughly 70-75M filenames (up to 15 names per inode), with most of its inodes read in a short time span. However, I've been running this system with these backups on ZFS for years, at least as far back as FreeBSD 9, without memory problems. While it isn't a huge system, it was always very stable in the past. While I don't see this issue on larger machines (with 128GB RAM or more, for example), I don't believe this is about a minimum memory requirement for a few reasons. To begin with, the machines are not insanely tiny or running with a wildly unbalanced disk/ram ratio. Also, if there's a hard minimum requirement, then sysctl should throw an error. Also, sysctl reports vfs.zfs.arc_meta_limit at ~67MB on both, which is much lower than arc_max. However, I retract my remark about it maybe being from a recent update, because uname on the AWS machine reports 11.1-RELEASE-p4. (I often don't reboot after updating unless the kernel has a serious vulnerability, and this one has been up for 109 days.) Again, mine are 11.1 with the latest patches by freebsd-update. I could try upgrading to 11.2 if it would be an interesting data point. >The patch in review is about ARC releasing its cache... This patch would likely help, particularly since these examples don't have swap. It seems likely to alleviate my need to meddle with arc_max, which would be great. However, I'd argue that it's still a bug that arc_max is apparently completely ignored. And now that I think about that, it's also still bug that OOM-killing processes is preferred to swap OR evacuating ARC, unless that patch is fixing that also. I'd swear I remember that fairly recently, I tried changing arc_max and top immediately showed the ARC chopped off at the new setting, and if I remember that right then this is clearly a regression...but details of that memory are vague at this point.
<jpaetzel> Is there any case where ARC usage can exceed arc_max legitimately? <mahrens> I want to say no, but your example is with arc_max=128MB so there may be some exceptions at that small of a size (only 8 maximum-size blocks!). Also stuff can get pinned in memory (and can’t be evicted regardless of ARC settings) when it’s “in use”, which is the case for metadata of open files, mounted filesystems, etc. It’s a hard limit on the *target* ARC size.
So consider the numbers in this example: Are we saying that this behavior is "correct"? I have a machine with 6GB of RAM and top shows the ARC is running at 2556 MB. Arc_max was set to 256MB at boot, so it was overshooting by ~10x. I bumped arc_max to 2GB with sysctl[1], and the ARC usage stayed at 2556 MB. [1] $ sysctl vfs.zfs.arc_max=2147483648 vfs.zfs.arc_max: 268435456 -> 2147483648
Nope. I'm starting to get some traction that this is actually a bug. Can you paste the entire ARC usage lines from top on the 6G system please?
Ok, what is going on is understood now. kern.maxvnodes had been increased. The ZFS system has no way to reclaim vnodes when ARC needs to shrink.
Fixing maxvnodes seems to have sorted out the problem. Here's the top header, in case it bears any more investigation. last pid: 8729; load averages: 0.16, 0.15, 0.10 up 0+18:24:10 11:07:00 36 processes: 1 running, 35 sleeping 0.0 100 CPU: 8040% user, 0.0% nice, 0.0% system, 0.0% interrupt, 0.0% idle Mem: 796K Active, 27M Inact, 81M Laundry, 5631M Wired, 155M Free ARC: 2639M Total, 643M MFU, 317M MRU, 972K Anon, 16M Header, 1663M Other 195M Compressed, 765M Uncompressed, 3.92:1 Ratio Swap: When I cut maxvnodes from 1048576 to 524288, the ARC Total dropped from 2639MB to 1455MB over about 30 seconds. Further reducing maxvnodes brought the Total down below the configured arc_max. Thanks!
So I looked back at this today, and I found that even after having returned maxvnodes to the default, I still see it overrunning arc_max persistently by almost 2x. I can lower maxvnodes below the default (179499), and then arc_max is respected. Below are stats on the same system which has 6GB of RAM with arc_max set to 256MB, starting with maxvnodes at the default and then lowering it. Granted, one can reasonably argue that anyone capable of tinkering with arc_max should be capable of tinkering with maxvnodes also. However, it is rather astonishing since there are suggestions to crank up maxvnodes to improve performance [wiki/ZFSTuningGuide] and to lower arc_max for small systems, but no mention of one affecting the other. So I'm adding this data to be of help if this is considered a significant bug. $ sysctl vfs.zfs.arc_max vfs.zfs.arc_max: 268435456 $ sysctl kern.maxvnodes kern.maxvnodes: 179499 $ top last pid: 95789; load averages: 0.14, 0.16, 0.11 up 8+21:45:16 10:27:57 40 processes: 2 running, 38 sleeping CPU: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle Mem: 20M Active, 251M Inact, 2769M Wired, 2856M Free ARC: 456M Total, 87M MFU, 105M MRU, 1600K Anon, 8149K Header, 255M Other 40M Compressed, 154M Uncompressed, 3.87:1 Ratio Swap: Lowering kern.maxvnodes below the default: $ sysctl kern.maxvnodes=10240 kern.maxvnodes: 179499 -> 10240 $ top last pid: 96130; load averages: 0.06, 0.12, 0.09 up 8+21:50:03 10:32:44 42 processes: 1 running, 41 sleeping CPU: 0.4% user, 0.0% nice, 0.4% system, 0.0% interrupt, 99.2% idle Mem: 21M Active, 233M Inact, 2773M Wired, 2868M Free ARC: 148M Total, 43M MFU, 29M MRU, 1600K Anon, 7678K Header, 66M Other 40M Compressed, 154M Uncompressed, 3.86:1 Ratio Swap:
Can you get these sysctls when you ARC is over the limit? kstat.zfs.misc.arcstats.other_size kstat.zfs.misc.arcstats.bonus_size kstat.zfs.misc.arcstats.dnode_size kstat.zfs.misc.arcstats.dbuf_size kstat.zfs.misc.arcstats.metadata_size
(In reply to Allan Jude from comment #14) Sure! Actually, the only two systems I have right now on 12 are not overrunning it (but they also use the defaults for kern.maxvnodes and vfs.zfs.arc_max). Looks like most of those sysctls were introduced in 12. Still, here's what the 6GB system shows, which is still running 11.2. Maybe I can upgrade it to 12 and see what it says after this weekend. It's using the defaults for kern.maxvnodes and vfs.zfs.arc_max, and it's overrunning arc_max slightly. $ top Mem: 20M Active, 304M Inact, 5030M Wired, 541M Free ARC: 996M Total, 188M MFU, 242M MRU, 1376K Anon, 22M Header, 542M Other 94M Compressed, 337M Uncompressed, 3.57:1 Ratio $ sysctl kstat.zfs.misc.arcstats.other_size kstat.zfs.misc.arcstats.bonus_size kstat.zfs.misc.arcstats.dnode_size kstat.zfs.misc.arcstats.dbuf_size kstat.zfs.misc.arcstats.metadata_size kstat.zfs.misc.arcstats.other_size: 567460432 sysctl: unknown oid 'kstat.zfs.misc.arcstats.bonus_size' sysctl: unknown oid 'kstat.zfs.misc.arcstats.dnode_size' sysctl: unknown oid 'kstat.zfs.misc.arcstats.dbuf_size' kstat.zfs.misc.arcstats.metadata_size: 443239936 $ sysctl kern.maxvnodes vfs.zfs.arc_max kern.maxvnodes: 179499 vfs.zfs.arc_max: 795726848 And just for fun, here's everything in kstat.zfs.misc.arcstats: $ sysctl kstat.zfs.misc.arcstats kstat.zfs.misc.arcstats.demand_hit_predictive_prefetch: 432799 kstat.zfs.misc.arcstats.sync_wait_for_async: 26963 kstat.zfs.misc.arcstats.arc_meta_min: 319189248 kstat.zfs.misc.arcstats.arc_meta_max: 1454089288 kstat.zfs.misc.arcstats.arc_meta_limit: 198931712 kstat.zfs.misc.arcstats.arc_meta_used: 1034141472 kstat.zfs.misc.arcstats.memory_throttle_count: 0 kstat.zfs.misc.arcstats.l2_write_buffer_list_null_iter: 0 kstat.zfs.misc.arcstats.l2_write_buffer_list_iter: 0 kstat.zfs.misc.arcstats.l2_write_buffer_bytes_scanned: 0 kstat.zfs.misc.arcstats.l2_write_pios: 0 kstat.zfs.misc.arcstats.l2_write_buffer_iter: 0 kstat.zfs.misc.arcstats.l2_write_full: 0 kstat.zfs.misc.arcstats.l2_write_not_cacheable: 1387042 kstat.zfs.misc.arcstats.l2_write_io_in_progress: 0 kstat.zfs.misc.arcstats.l2_write_in_l2: 0 kstat.zfs.misc.arcstats.l2_write_spa_mismatch: 0 kstat.zfs.misc.arcstats.l2_write_passed_headroom: 0 kstat.zfs.misc.arcstats.l2_write_trylock_fail: 0 kstat.zfs.misc.arcstats.l2_hdr_size: 0 kstat.zfs.misc.arcstats.l2_asize: 0 kstat.zfs.misc.arcstats.l2_size: 0 kstat.zfs.misc.arcstats.l2_io_error: 0 kstat.zfs.misc.arcstats.l2_cksum_bad: 0 kstat.zfs.misc.arcstats.l2_abort_lowmem: 0 kstat.zfs.misc.arcstats.l2_free_on_write: 0 kstat.zfs.misc.arcstats.l2_evict_l1cached: 0 kstat.zfs.misc.arcstats.l2_evict_reading: 0 kstat.zfs.misc.arcstats.l2_evict_lock_retry: 0 kstat.zfs.misc.arcstats.l2_writes_lock_retry: 0 kstat.zfs.misc.arcstats.l2_writes_error: 0 kstat.zfs.misc.arcstats.l2_writes_done: 0 kstat.zfs.misc.arcstats.l2_writes_sent: 0 kstat.zfs.misc.arcstats.l2_write_bytes: 0 kstat.zfs.misc.arcstats.l2_read_bytes: 0 kstat.zfs.misc.arcstats.l2_rw_clash: 0 kstat.zfs.misc.arcstats.l2_feeds: 0 kstat.zfs.misc.arcstats.l2_misses: 0 kstat.zfs.misc.arcstats.l2_hits: 0 kstat.zfs.misc.arcstats.mfu_ghost_evictable_metadata: 226059776 kstat.zfs.misc.arcstats.mfu_ghost_evictable_data: 0 kstat.zfs.misc.arcstats.mfu_ghost_size: 226059776 kstat.zfs.misc.arcstats.mfu_evictable_metadata: 0 kstat.zfs.misc.arcstats.mfu_evictable_data: 0 kstat.zfs.misc.arcstats.mfu_size: 196918784 kstat.zfs.misc.arcstats.mru_ghost_evictable_metadata: 540657664 kstat.zfs.misc.arcstats.mru_ghost_evictable_data: 0 kstat.zfs.misc.arcstats.mru_ghost_size: 540657664 kstat.zfs.misc.arcstats.mru_evictable_metadata: 0 kstat.zfs.misc.arcstats.mru_evictable_data: 0 kstat.zfs.misc.arcstats.mru_size: 254938624 kstat.zfs.misc.arcstats.anon_evictable_metadata: 0 kstat.zfs.misc.arcstats.anon_evictable_data: 0 kstat.zfs.misc.arcstats.anon_size: 1671680 kstat.zfs.misc.arcstats.other_size: 567378976 kstat.zfs.misc.arcstats.metadata_size: 443244032 kstat.zfs.misc.arcstats.data_size: 10285056 kstat.zfs.misc.arcstats.hdr_size: 23518464 kstat.zfs.misc.arcstats.overhead_size: 354117120 kstat.zfs.misc.arcstats.uncompressed_size: 353756672 kstat.zfs.misc.arcstats.compressed_size: 99411456 kstat.zfs.misc.arcstats.size: 1044426528 kstat.zfs.misc.arcstats.c_max: 795726848 kstat.zfs.misc.arcstats.c_min: 638378496 kstat.zfs.misc.arcstats.c: 795726848 kstat.zfs.misc.arcstats.p: 621214208 kstat.zfs.misc.arcstats.hash_chain_max: 4 kstat.zfs.misc.arcstats.hash_chains: 4222 kstat.zfs.misc.arcstats.hash_collisions: 840474 kstat.zfs.misc.arcstats.hash_elements_max: 142601 kstat.zfs.misc.arcstats.hash_elements: 96882 kstat.zfs.misc.arcstats.evict_l2_skip: 0 kstat.zfs.misc.arcstats.evict_l2_ineligible: 34269956096 kstat.zfs.misc.arcstats.evict_l2_eligible: 284770299904 kstat.zfs.misc.arcstats.evict_l2_cached: 0 kstat.zfs.misc.arcstats.evict_not_enough: 49123984 kstat.zfs.misc.arcstats.evict_skip: 6955603713 kstat.zfs.misc.arcstats.access_skip: 142705651 kstat.zfs.misc.arcstats.mutex_miss: 6864739 kstat.zfs.misc.arcstats.deleted: 7038134 kstat.zfs.misc.arcstats.allocated: 59387140 kstat.zfs.misc.arcstats.mfu_ghost_hits: 299078 kstat.zfs.misc.arcstats.mfu_hits: 245858368 kstat.zfs.misc.arcstats.mru_ghost_hits: 1625090 kstat.zfs.misc.arcstats.mru_hits: 105751966 kstat.zfs.misc.arcstats.prefetch_metadata_misses: 3148348 kstat.zfs.misc.arcstats.prefetch_metadata_hits: 8579968 kstat.zfs.misc.arcstats.prefetch_data_misses: 157488 kstat.zfs.misc.arcstats.prefetch_data_hits: 1039 kstat.zfs.misc.arcstats.demand_metadata_misses: 19827718 kstat.zfs.misc.arcstats.demand_metadata_hits: 348572671 kstat.zfs.misc.arcstats.demand_data_misses: 2626794 kstat.zfs.misc.arcstats.demand_data_hits: 2055456 kstat.zfs.misc.arcstats.misses: 25760348 kstat.zfs.misc.arcstats.hits: 359209134
After upgrading to 12 and letting a backup cycle run, it seems to be behaving well with defaults. I will try to aggravate the system a bit and see if it'll run over. Showing the good behavior currently: $ top last pid: 11816; load averages: 0.11, 0.17, 0.19 up 0+22:17:27 23:13:17 40 processes: 1 running, 39 sleeping CPU: 0.0% user, 0.0% nice, 0.8% system, 0.4% interrupt, 98.9% idle Mem: 21M Active, 242M Inact, 3365M Wired, 2266M Free ARC: 736M Total, 204M MFU, 186M MRU, 1120K Anon, 20M Header, 325M Other 198M Compressed, 713M Uncompressed, 3.60:1 Ratio Swap: 46G Total, 46G Free $ sysctl vfs.zfs.arc_max kstat.zfs.misc.arcstats.other_size kstat.zfs.misc.arcstats.bonus_size kstat.zfs.misc.arcstats.dnode_size kstat.zfs.misc.arcstats.dbuf_size kstat.zfs.misc.arcstats.metadata_size vfs.zfs.arc_max: 795202560 kstat.zfs.misc.arcstats.other_size: 340702560 kstat.zfs.misc.arcstats.bonus_size: 83656640 kstat.zfs.misc.arcstats.dnode_size: 191070880 kstat.zfs.misc.arcstats.dbuf_size: 65975040 kstat.zfs.misc.arcstats.metadata_size: 401358848
In 11, I could lower vfs.zfs.arc_max without rebooting. If I lowered it below the current allocation, it'd whack out a bunch of cached data to get within the new setting (maxvnodes problem notwithstanding). In 12, it seems I can't lower it below the current allocation. In practice, this means one will rarely be able to lower it without rebooting (since one wouldn't be trying to lower it on a running system unless it was already so large it was causing a problem). It was nice to be able to change it to make more room for hungry things like InnoDB without rebooting. Is that intentional? I guess there may be a good reason why, or maybe it backs off automatically better than in 11?
(In reply to Leif Pedersen from comment #17) After upgrading to 12.0, and letting yesterday's cycle run on defaults, I cranked up maxvnodes to 1048576 (default was 179420) and squashed arc_max to 64M (default was 758M). After a backup cycle, it had madly overrun vfs.zfs.arc_max. The ARC usage shows 2449M now. (This machine runs rsync-style backups for me, so it churns through a lot of inodes -- a good test case for this.) Here are the sysctl stats you asked for: $ sysctl vfs.zfs.arc_max kern.maxvnodes kstat.zfs.misc.arcstats.other_size kstat.zfs.misc.arcstats.bonus_size kstat.zfs.misc.arcstats.dnode_size kstat.zfs.misc.arcstats.dbuf_size kstat.zfs.misc.arcstats.metadata_size vfs.zfs.arc_max: 67108864 kern.maxvnodes: 1048576 kstat.zfs.misc.arcstats.other_size: 1614528504 kstat.zfs.misc.arcstats.bonus_size: 398285120 kstat.zfs.misc.arcstats.dnode_size: 906573304 kstat.zfs.misc.arcstats.dbuf_size: 309670080 kstat.zfs.misc.arcstats.metadata_size: 936406016 $ top last pid: 6873; load averages: 0.36, 0.24, 0.20 up 0+13:13:01 12:39:03 46 processes: 1 running, 45 sleeping CPU: 0.4% user, 0.0% nice, 0.8% system, 0.4% interrupt, 98.5% idle Mem: 24M Active, 291M Inact, 5389M Wired, 190M Free ARC: 2449M Total, 440M MFU, 453M MRU, 1376K Anon, 15M Header, 1540M Other 181M Compressed, 713M Uncompressed, 3.94:1 Ratio Swap: 46G Total, 46G Free
A year or so ago, before I was aware of this bug, I had a sense that things weren't right on a notebook with 15.9 GiB memory. Probably 12.0-CURRENT at the time. IIRC I chose to set vfs.zfs.arc_max to around 2 G. (I can't recall what made me choose that figure … something in bug 187594, maybe. I vaguely recall 2 G being around half of the default but that seems inconsistent with what I now read at <https://www.freebsd.org/doc/handbook/zfs-advanced.html>.) Whilst I didn't take measurements, the end result was pleasing. With a cleaner installation of 13.0-CURRENT (~2018-12), I have not (yet) felt the need to change vfs.zfs.arc_max ---- Today: grahamperrin@momh167-gjp4-8570p:~ % date ; uname -v ; uptime Sat 2 Mar 2019 11:43:54 GMT FreeBSD 13.0-CURRENT r344443 GENERIC-NODEBUG 11:43a.m. up 41 mins, 6 users, load averages: 1.02, 0.70, 0.61 grahamperrin@momh167-gjp4-8570p:~ % sysctl -a | grep vfs.zfs.arc vfs.zfs.arc_min_prescient_prefetch_ms: 6 vfs.zfs.arc_min_prefetch_ms: 1 vfs.zfs.arc_meta_strategy: 0 vfs.zfs.arc_meta_limit: 3883606016 vfs.zfs.arc_free_target: 86433 vfs.zfs.arc_kmem_cache_reap_retry_ms: 0 vfs.zfs.arc_grow_retry: 60 vfs.zfs.arc_shrink_shift: 7 vfs.zfs.arc_average_blocksize: 8192 vfs.zfs.arc_no_grow_shift: 5 vfs.zfs.arc_min: 1941803008 vfs.zfs.arc_max: 15534424064 grahamperrin@momh167-gjp4-8570p:~ % sysctl -a | grep vnode kern.maxvnodes: 348816 kern.ipc.umtx_vnode_persistent: 0 kern.minvnodes: 87204 Syncing disks, vnodes remaining... 0 0 0 0 0 0 done Syncing disks, vnodes remaining... 0 0 0 0 0 0 done Syncing disks, vnodes remaining... 0 0 0 0 0 done vm.vnode_pbufs: 512 vm.stats.vm.v_vnodepgsout: 3703 vm.stats.vm.v_vnodepgsin: 126117 vm.stats.vm.v_vnodeout: 1842 vm.stats.vm.v_vnodein: 12321 vfs.freevnodes: 9195 vfs.wantfreevnodes: 87204 vfs.vnodes_created: 19046 vfs.numvnodes: 16690 vfs.cache.cache_lock_vnodes_cel_3_failures: 0 vfs.ncpurgeminvnodes: 512 debug.vnode_domainset: <NULL> debug.sizeof.vnode: 480 debug.fail_point.status_fill_kinfo_vnode__random_path: off debug.fail_point.fill_kinfo_vnode__random_path: off grahamperrin@momh167-gjp4-8570p:~ % ---- Now added to /etc/sysctl.conf : vfs.zfs.arc_max="2147483648" If I'm to add a setting for kern.maxvnodes –for test purposes – what would you suggest? TIA
(In reply to Leif Pedersen from comment #17) > In 11, I could lower vfs.zfs.arc_max without rebooting. If I lowered it below > the current allocation, it'd whack out a bunch of cached data to get within > the new setting The freed memory still remains as Wired in top(1), free memory you can't get at is of little use, which is probably being the reason why lowering of arc_max is forbidden in FreeBSD 12.