The ARC limit in /boot/loader.conf is not used.
Contents of /boot/loader.conf about ARC.
Current ARC usage from * top -b -o res * command.
# top -b -o res
ARC: 853M Total, 223M MFU, 168M MRU, 272K Anon, 6253K Header, 456M Other
77M Compressed, 315M Uncompressed, 4.09:1 Ratio
ARC should be 128 MB at max, but its about 850 MB ...
... or is there other way to limit ARC usage?
Does anyone know of a workaround for this? I'm observing this on servers with arc_max set to 256MB that're OOM-killing things because the ARC is consuming many GB.
Also, in my experience it only recently started on fairly mature 11.1 installations, so I wonder if it's because of a recent errata update.
Is vfs.zfs.arc_meta_limit preventing it from going that low?
You may also want to consider review D7538 it improves ARCs willingness to release.
(In reply to Shane from comment #2)
> improves ARCs willingness to release.
We now have to convince the OS (or its part) to give back resources that it should not allocate in the first place? :D
Aren't limits set there to be respected and NOT exceed?
(In reply to vermaden from comment #3)
The patch in review is about ARC releasing its cache when programs want more memory, not about using more than the limit.
Currently the system can swap out before it releases ARC, the patch changes that, so that when free ram is used, it will release ARC before swapping. This means it may not always be using the arc_max that it is allowed to use.
Also, in the Oracle ZFS docs, it states that arc_min/arc_max have a range of 64M to physical mem. I wonder if there is a minimum amount that a zpool needs based on number of disks, blocks in pool, number of files... Maybe this is preventing the smaller memory use.
I think my idea of the minimum memory needed for a zpool is in effect here.
There isn't a lot on it, but for minimum resources, the Handbook 18.104.22.168 says "A general rule of thumb is 1 GB of RAM for every 1 TB of storage." So if your pool is >= 1TB then your are using the minimum ram required to use that pool.
The 1GB per TB rule is a stab at how much RAM is needed to cache the working set and deliver RAM like read performance.
It's not really intended to indicate how much RAM you need to operate a pool. (I've imported and used 1PB pools on heads with 32GB of RAM before)
The only real linear limit I know of is datasets/zvols consume ~ 1MB/RAM each.
So a system with 4000 datasets would need ~ 4GB RAM just for the dataset metadata.
The machines I have observed this on vary in zpool sizes.
With regard to the "rule of thumb", one machine which behaves particularly horribly has a single zpool sized at 256GB. It has only 10GB referenced (meaning non-snapshot data) and less than 20k inodes. A linear interpretation of the rule of thumb suggests that just 10MB should be enough ARC, although I don't expect it to scale down that low. On this one, arc_max is set to 256MB, but the ARC runs well over 1 GB. I don't know how high it would go if left alone, since it only has 2 GB of RAM to begin with, so when it gets that big I have to reboot it. This one is an AWS VM.
For another example, I have a physical machine with 6GB of RAM, with arc_max set to 256MB and top showing the ARC at 2GB. This one is a bit bigger -- it has 1.4TB across 2 zpools. It does rsync-style backups for three other machines, so there's a relatively large number of filenames. The second zpool (for the backups) has roughly 5M inodes with roughly 70-75M filenames (up to 15 names per inode), with most of its inodes read in a short time span. However, I've been running this system with these backups on ZFS for years, at least as far back as FreeBSD 9, without memory problems. While it isn't a huge system, it was always very stable in the past.
While I don't see this issue on larger machines (with 128GB RAM or more, for example), I don't believe this is about a minimum memory requirement for a few reasons. To begin with, the machines are not insanely tiny or running with a wildly unbalanced disk/ram ratio. Also, if there's a hard minimum requirement, then sysctl should throw an error. Also, sysctl reports vfs.zfs.arc_meta_limit at ~67MB on both, which is much lower than arc_max.
However, I retract my remark about it maybe being from a recent update, because uname on the AWS machine reports 11.1-RELEASE-p4. (I often don't reboot after updating unless the kernel has a serious vulnerability, and this one has been up for 109 days.)
Again, mine are 11.1 with the latest patches by freebsd-update. I could try upgrading to 11.2 if it would be an interesting data point.
>The patch in review is about ARC releasing its cache...
This patch would likely help, particularly since these examples don't have swap. It seems likely to alleviate my need to meddle with arc_max, which would be great. However, I'd argue that it's still a bug that arc_max is apparently completely ignored. And now that I think about that, it's also still bug that OOM-killing processes is preferred to swap OR evacuating ARC, unless that patch is fixing that also.
I'd swear I remember that fairly recently, I tried changing arc_max and top immediately showed the ARC chopped off at the new setting, and if I remember that right then this is clearly a regression...but details of that memory are vague at this point.
<jpaetzel> Is there any case where ARC usage can exceed arc_max legitimately?
<mahrens> I want to say no, but your example is with arc_max=128MB so there may be some exceptions at that small of a size (only 8 maximum-size blocks!). Also stuff can get pinned in memory (and can’t be evicted regardless of ARC settings) when it’s “in use”, which is the case for metadata of open files, mounted filesystems, etc.
It’s a hard limit on the *target* ARC size.
So consider the numbers in this example: Are we saying that this behavior is "correct"?
I have a machine with 6GB of RAM and top shows the ARC is running at 2556 MB. Arc_max was set to 256MB at boot, so it was overshooting by ~10x. I bumped arc_max to 2GB with sysctl, and the ARC usage stayed at 2556 MB.
$ sysctl vfs.zfs.arc_max=2147483648
vfs.zfs.arc_max: 268435456 -> 2147483648
Nope. I'm starting to get some traction that this is actually a bug.
Can you paste the entire ARC usage lines from top on the 6G system please?
Ok, what is going on is understood now.
kern.maxvnodes had been increased. The ZFS system has no way to reclaim vnodes when ARC needs to shrink.
Fixing maxvnodes seems to have sorted out the problem. Here's the top header, in case it bears any more investigation.
last pid: 8729; load averages: 0.16, 0.15, 0.10 up 0+18:24:10 11:07:00
36 processes: 1 running, 35 sleeping 0.0 100
CPU: 8040% user, 0.0% nice, 0.0% system, 0.0% interrupt, 0.0% idle
Mem: 796K Active, 27M Inact, 81M Laundry, 5631M Wired, 155M Free
ARC: 2639M Total, 643M MFU, 317M MRU, 972K Anon, 16M Header, 1663M Other
195M Compressed, 765M Uncompressed, 3.92:1 Ratio
When I cut maxvnodes from 1048576 to 524288, the ARC Total dropped from 2639MB to 1455MB over about 30 seconds. Further reducing maxvnodes brought the Total down below the configured arc_max.
So I looked back at this today, and I found that even after having returned maxvnodes to the default, I still see it overrunning arc_max persistently by almost 2x. I can lower maxvnodes below the default (179499), and then arc_max is respected. Below are stats on the same system which has 6GB of RAM with arc_max set to 256MB, starting with maxvnodes at the default and then lowering it.
Granted, one can reasonably argue that anyone capable of tinkering with arc_max should be capable of tinkering with maxvnodes also. However, it is rather astonishing since there are suggestions to crank up maxvnodes to improve performance [wiki/ZFSTuningGuide] and to lower arc_max for small systems, but no mention of one affecting the other. So I'm adding this data to be of help if this is considered a significant bug.
$ sysctl vfs.zfs.arc_max
$ sysctl kern.maxvnodes
last pid: 95789; load averages: 0.14, 0.16, 0.11 up 8+21:45:16 10:27:57
40 processes: 2 running, 38 sleeping
CPU: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle
Mem: 20M Active, 251M Inact, 2769M Wired, 2856M Free
ARC: 456M Total, 87M MFU, 105M MRU, 1600K Anon, 8149K Header, 255M Other
40M Compressed, 154M Uncompressed, 3.87:1 Ratio
Lowering kern.maxvnodes below the default:
$ sysctl kern.maxvnodes=10240
kern.maxvnodes: 179499 -> 10240
last pid: 96130; load averages: 0.06, 0.12, 0.09 up 8+21:50:03 10:32:44
42 processes: 1 running, 41 sleeping
CPU: 0.4% user, 0.0% nice, 0.4% system, 0.0% interrupt, 99.2% idle
Mem: 21M Active, 233M Inact, 2773M Wired, 2868M Free
ARC: 148M Total, 43M MFU, 29M MRU, 1600K Anon, 7678K Header, 66M Other
40M Compressed, 154M Uncompressed, 3.86:1 Ratio
Can you get these sysctls when you ARC is over the limit?
(In reply to Allan Jude from comment #14)
Sure! Actually, the only two systems I have right now on 12 are not overrunning it (but they also use the defaults for kern.maxvnodes and vfs.zfs.arc_max). Looks like most of those sysctls were introduced in 12.
Still, here's what the 6GB system shows, which is still running 11.2. Maybe I can upgrade it to 12 and see what it says after this weekend. It's using the defaults for kern.maxvnodes and vfs.zfs.arc_max, and it's overrunning arc_max slightly.
Mem: 20M Active, 304M Inact, 5030M Wired, 541M Free
ARC: 996M Total, 188M MFU, 242M MRU, 1376K Anon, 22M Header, 542M Other
94M Compressed, 337M Uncompressed, 3.57:1 Ratio
$ sysctl kstat.zfs.misc.arcstats.other_size kstat.zfs.misc.arcstats.bonus_size kstat.zfs.misc.arcstats.dnode_size kstat.zfs.misc.arcstats.dbuf_size kstat.zfs.misc.arcstats.metadata_size
sysctl: unknown oid 'kstat.zfs.misc.arcstats.bonus_size'
sysctl: unknown oid 'kstat.zfs.misc.arcstats.dnode_size'
sysctl: unknown oid 'kstat.zfs.misc.arcstats.dbuf_size'
$ sysctl kern.maxvnodes vfs.zfs.arc_max
And just for fun, here's everything in kstat.zfs.misc.arcstats:
$ sysctl kstat.zfs.misc.arcstats
After upgrading to 12 and letting a backup cycle run, it seems to be behaving well with defaults. I will try to aggravate the system a bit and see if it'll run over.
Showing the good behavior currently:
last pid: 11816; load averages: 0.11, 0.17, 0.19 up 0+22:17:27 23:13:17
40 processes: 1 running, 39 sleeping
CPU: 0.0% user, 0.0% nice, 0.8% system, 0.4% interrupt, 98.9% idle
Mem: 21M Active, 242M Inact, 3365M Wired, 2266M Free
ARC: 736M Total, 204M MFU, 186M MRU, 1120K Anon, 20M Header, 325M Other
198M Compressed, 713M Uncompressed, 3.60:1 Ratio
Swap: 46G Total, 46G Free
$ sysctl vfs.zfs.arc_max kstat.zfs.misc.arcstats.other_size kstat.zfs.misc.arcstats.bonus_size kstat.zfs.misc.arcstats.dnode_size kstat.zfs.misc.arcstats.dbuf_size kstat.zfs.misc.arcstats.metadata_size
In 11, I could lower vfs.zfs.arc_max without rebooting. If I lowered it below the current allocation, it'd whack out a bunch of cached data to get within the new setting (maxvnodes problem notwithstanding).
In 12, it seems I can't lower it below the current allocation. In practice, this means one will rarely be able to lower it without rebooting (since one wouldn't be trying to lower it on a running system unless it was already so large it was causing a problem). It was nice to be able to change it to make more room for hungry things like InnoDB without rebooting.
Is that intentional? I guess there may be a good reason why, or maybe it backs off automatically better than in 11?
(In reply to Leif Pedersen from comment #17)
After upgrading to 12.0, and letting yesterday's cycle run on defaults, I cranked up maxvnodes to 1048576 (default was 179420) and squashed arc_max to 64M (default was 758M). After a backup cycle, it had madly overrun vfs.zfs.arc_max. The ARC usage shows 2449M now. (This machine runs rsync-style backups for me, so it churns through a lot of inodes -- a good test case for this.) Here are the sysctl stats you asked for:
$ sysctl vfs.zfs.arc_max kern.maxvnodes kstat.zfs.misc.arcstats.other_size kstat.zfs.misc.arcstats.bonus_size kstat.zfs.misc.arcstats.dnode_size kstat.zfs.misc.arcstats.dbuf_size kstat.zfs.misc.arcstats.metadata_size
last pid: 6873; load averages: 0.36, 0.24, 0.20 up 0+13:13:01 12:39:03
46 processes: 1 running, 45 sleeping
CPU: 0.4% user, 0.0% nice, 0.8% system, 0.4% interrupt, 98.5% idle
Mem: 24M Active, 291M Inact, 5389M Wired, 190M Free
ARC: 2449M Total, 440M MFU, 453M MRU, 1376K Anon, 15M Header, 1540M Other
181M Compressed, 713M Uncompressed, 3.94:1 Ratio
Swap: 46G Total, 46G Free
A year or so ago, before I was aware of this bug, I had a sense that things weren't right on a notebook with 15.9 GiB memory. Probably 12.0-CURRENT at the time.
IIRC I chose to set vfs.zfs.arc_max to around 2 G. (I can't recall what made me choose that figure … something in bug 187594, maybe. I vaguely recall 2 G being around half of the default but that seems inconsistent with what I now read at <https://www.freebsd.org/doc/handbook/zfs-advanced.html>.)
Whilst I didn't take measurements, the end result was pleasing.
With a cleaner installation of 13.0-CURRENT (~2018-12), I have not (yet) felt the need to change vfs.zfs.arc_max
grahamperrin@momh167-gjp4-8570p:~ % date ; uname -v ; uptime
Sat 2 Mar 2019 11:43:54 GMT
FreeBSD 13.0-CURRENT r344443 GENERIC-NODEBUG
11:43a.m. up 41 mins, 6 users, load averages: 1.02, 0.70, 0.61
grahamperrin@momh167-gjp4-8570p:~ % sysctl -a | grep vfs.zfs.arc
grahamperrin@momh167-gjp4-8570p:~ % sysctl -a | grep vnode
Syncing disks, vnodes remaining... 0 0 0 0 0 0 done
Syncing disks, vnodes remaining... 0 0 0 0 0 0 done
Syncing disks, vnodes remaining... 0 0 0 0 0 done
Now added to /etc/sysctl.conf :
If I'm to add a setting for kern.maxvnodes –for test purposes – what would you suggest?
(In reply to Leif Pedersen from comment #17)
> In 11, I could lower vfs.zfs.arc_max without rebooting. If I lowered it below
> the current allocation, it'd whack out a bunch of cached data to get within
> the new setting
The freed memory still remains as Wired in top(1), free memory you can't get at is of little use, which is probably being the reason why lowering of arc_max is forbidden in FreeBSD 12.