Bug 275436 - tmpfs does not honor memory limits on writes
Summary: tmpfs does not honor memory limits on writes
Status: In Progress
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 15.0-CURRENT
Hardware: Any Any
: --- Affects Some People
Assignee: Mike Karels
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2023-11-29 19:19 UTC by Mike Karels
Modified: 2024-01-10 03:42 UTC (History)
1 user (show)

See Also:
linimon: mfc-stable14?
linimon: mfc-stable13?


Attachments
sample patch that works around the problem (1.50 KB, patch)
2023-11-29 19:19 UTC, Mike Karels
no flags Details | Diff
another approach to patching tmpfs_subr.c (1.19 KB, patch)
2023-12-01 23:10 UTC, Mike Karels
no flags Details | Diff
revised patch/proof of concept (2.90 KB, patch)
2023-12-07 18:43 UTC, Mike Karels
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Mike Karels freebsd_committer freebsd_triage 2023-11-29 19:19:02 UTC
Created attachment 246673 [details]
sample patch that works around the problem

The tmpfs file system checks for available memory when a file is created, but not during writes.  If I create a file, then run the system nearly out of memory, I can write to that file until processes are killed because the system is out of memory, and in some cases the system is so screwed up that a shutdown cannot be run.  A df shows continuously growing usage and size while showing 0 for free space all along; this can go on for gigabytes.  Note, this is using a tmpfs file system with the default size.

I'll attach a patch that adds a check in the write path, which helps.  It still runs the system out of memory and swap, so there are two additional changes in the patch: one divides swap space by two, and the other asks the VM system for the free memory in excess of the free target, using that in place of simply free memory.  I am not convinced that any of the changes are correct, but they help considerably.  With these changes, a large memory process keeps running although swap is exhausted when the writes fail.  More details or tests available on request.
Comment 1 Konstantin Belousov freebsd_committer freebsd_triage 2023-11-30 20:51:54 UTC
I do not think this is a right approach.  Free pages/free target from the VM
subsystem is not what people usually think about it.  The numbers only direct
the pagedaemon activity, they are not indicative in any other way.  If the
system has enough swap and inactive pages, they can be converted into reusable
pages without causing OOM/ENOMEM etc.

Similarly, I do not think it is right to do any limiting on tmpfs file resize.
Tmpfs supports holes, and there is no reason to block that.

Tmpfs already has the mechanism to count actually allocated pages and to clamp
the amount of pages allocated to specific mount.  It it the right thing to do IMO.
Why wouldn't you use it?
Comment 2 Mike Karels freebsd_committer freebsd_triage 2023-11-30 22:37:00 UTC
(In reply to Konstantin Belousov from comment #1)
First, the primary purpose of this bug is to document that there is a problem with writes exceeding the stated limit on the file system size.  The fact that I can write multiple gigabytes after df shows remaining space of 0 indicates a problem.

I had no expectation that this workaround would be a correct fix, or I would have put it into a review.

I do think that the memory free target should be taken into account.  If tmpfs attempts to use all of free memory, the VM system is likely to try to raise the free memory, paging/swapping as needed.  btw, my original reason for looking at this is that I misunderstood the meaning in the man page, which says that a size of 0 represents "the available amount of memory (including main memory and swap space)".  I assumed that this included all of main memory, which is clearly excessive, and instead it means the currently-available memory.  But if we actually use all of memory + swap, the system will be killing processes (in this case, starting with my memory hog but then the shells, nfsd, and more).  Note that under the circumstances I am testing, tmpfs refuses to create a new file, but will write gigabytes to existing files.  This seems to be due to the lack of any check on free space (whether stored or dynamically-computed) in the write path, unlike file creation.  My goal was to adjust the default limit so that it was reasonably safe, which I think also needs to be done.

I don't think my workaround introduces a problem with holes.  The tmpfs_reg_resize routine is used only for writes, so the pages in the calculation will actually be written and recorded in the vm_object.

I think that, for file systems with no specific limit, the available memory limit needs to be shared by any tmpfs file systems.  They can't each use all of memory + swap.  Also, it probably doesn't make sense for file systems with a specific limit to attempt writes when there is insufficient memory available, unless the limit they are given is supposed to be a commitment, allowing other memory users to be killed.  (The workaround doesn't distinguish between file systems with/without a specific size, in part because tmpfs_reg_resize doesn't currently refer to the file system itself.)

Does tmpfs actually store a current/recent memory limit for file systems that don't have a preset size?  If it did, maybe higher levels would block writes. The size limit actually works for file systems with a non-zero size.  However, a size limit would have to be refreshed periodically, or on demand.  df will display an available size of 0, but statfs computes that on the fly.
Comment 3 Konstantin Belousov freebsd_committer freebsd_triage 2023-12-01 01:14:33 UTC
tmpfs_setattr()->tmpfs_chsize()->tmpfs_truncate()->tmpfs_reg_resize()

The free target has nothing to do with the availability of the (managed AKA
swappable) memory for usermode.  If it is used, it would cause transient 'no
memory' errors, mostly due to pagedaemon not able to cope with working set
changes in timely manner.  This is same as requirement to not return transient
ENOMEMs to userspace from malloc(M_NOWAIT) in the top-level syscall code.

By default, system allows most consumers to eat as much memory as they want
and can.  For instance, you can mmap(MAP_ANON) insanely large region, and
fault physical pages on demand (recent Chrome does exactly this, it mmaps
1Tb for heap).  Or you could create posix shm memory segment of that size.
Or you can write a file on tmpfs.

In each case, if the facility could be mis-used to exhaust memory (not physical
RAM but pageable storage) then specific limit should be applied.  For instance,
for tmpfs the 'size' mount option should be used to limits its total use.

The thing that is unclear to me, are you reporting that mount -o size=<limit>
does not work? So that you can create and fill file larger than the limit?
Comment 4 Mike Karels freebsd_committer freebsd_triage 2023-12-01 02:19:11 UTC
(In reply to Konstantin Belousov from comment #3)
> tmpfs_setattr()->tmpfs_chsize()->tmpfs_truncate()->tmpfs_reg_resize()

Sorry, I missed that, not sure how.  In any case, a proper fix would account only for page additions.

> The thing that is unclear to me, are you reporting that mount -o size=<limit>
does not work? So that you can create and fill file larger than the limit?

No, that case works.  The broken case is when size is omitted/default.  I found the reason why, it's in tmpfs_can_alloc_page().  I tried adding a test there, and that works too, but not in time to keep the system from killing off processes.

I'll have to think about the memory competition issues.  Unlike large user mappings, you can't kill tmpfs if it gets too greedy.  And if it is /tmp, or otherwise writable by unprivileged users, such users can cause processes to be killed off and even to hang the system (I just did it).  Maybe it is necessary to sleep for memory sometimes.  But it is not unexpected for a write to return ENOSPC on a nearly-full file system.  And I think tmpfs is over-promising in this case.
Comment 5 Mike Karels freebsd_committer freebsd_triage 2023-12-01 23:10:40 UTC
Created attachment 246718 [details]
another approach to patching tmpfs_subr.c

I'm adding a revised patch to tmpfs_subr.c that works more similarly to the way things work when the size option is set.  It still doesn't stop early enough, but that's a separate problem.
Comment 6 Konstantin Belousov freebsd_committer freebsd_triage 2023-12-03 08:04:51 UTC
VM (alomst) always ensures that there are several free pages.  More, it even
typically manages to free several pages in reasonable time.

This is why our OOM organized in current way:
- global OOM triggers when VM cannot get a free page despite existence of the
  page shortage, in all domains, for some time.  It is typically triggered
  when kernel allocates too much unmanaged pages (not tmpfs case).
- per-process OOM triggers when page fault handler needs a page and cannot
  allocate it after several cycles of allocation attempts.

I added the second (per-process) OOM since global OOM (similar to your patch)
was not able to handle typical situation with usermode sitting on too many
dirty pages.

Now that I formulated this, I think that for tmpfs a reasonable approach would
be something in line of per-process OOM: try the allocation, and return ENOSPC
if it failed, with some criteria for restart.  You might look at vm/vm_fault.c
vm_fault_allocate_oom().
Comment 7 Mike Karels freebsd_committer freebsd_triage 2023-12-04 19:50:15 UTC
(In reply to Konstantin Belousov from comment #6)

> VM (alomst) always ensures that there are several free pages.  More, it even
> typically manages to free several pages in reasonable time.

Except for when it doesn't, e.g. when swap is filling up.  It also may require killing processes in that case.  Given that tmpfs is supposed to limit to "available" memory, killing processes means that it has overshot.

> Now that I formulated this, I think that for tmpfs a reasonable approach would
> be something in line of per-process OOM: try the allocation, and return ENOSPC
> if it failed, with some criteria for restart.  You might look at vm/vm_fault.c
> vm_fault_allocate_oom().

I'm not sure, but tmpfs may go down that path if it can't get a page in uiomove_object(). That may be desirable in case of transient memory shortage. I don't know what happens if it fails though.  But when swap is getting short, it seems better to fail before that point.

I really think the default limit for tmpfs is overly aggressive.  If it really tries to take all of available memory + swap (less a small reserve, currently 4 MB), processes will be killed and the system is likely to require manual intervention. In the case of a transient memory shortage, blocking for memory is fine (and may be done already?).  I am now looking at a reserve that is scaled on memory + swap space, e.g. allowing tmpfs to use 95%.  The percentage is controlled by a sysctl.  This still includes a check for available space in the write path.  (There is already a check in the create path.)  This seems to work fairly well, based on limited test circumstances so far.
Comment 8 Konstantin Belousov freebsd_committer freebsd_triage 2023-12-06 23:36:58 UTC
(In reply to Mike Karels from comment #7)
Being able to free one page is not equal to ability to free needed number of
pages, this is why 'global' OOM works only in specific situations.

Proposals to limit max tmpfs page uses to some percentage of the total of
memory and swap would not work.  Problem is that some memory is 'managed',
some is not.  For instance, all file-backed pages should be considered as
free for the purpose of the limits calculation, otherwise an app mapping
a large file and reading it would cause un-explanable ENOSPC failures from
tmpfs.  And in reverse, allowing say 80% of total use by tmpfs does not prevent
a situation where we do fill that limit, and then kernel needs a lot of unmanaged
memory, for instance, for sockets buffers due to bursts of a network activity.

There is simply no reasonable answer to question 'how to limit tmpfs physical
memory consumption' without knowing the system load pattern.  And if you know
the load and care about memory limiting, there is already the size option.
Comment 9 Mike Karels freebsd_committer freebsd_triage 2023-12-07 18:42:24 UTC
(In reply to Konstantin Belousov from comment #8)

> Proposals to limit max tmpfs page uses to some percentage of the total of
> memory and swap would not work.

My proof-of-concept does not use total memory + swap.  Instead, it computes a reserve which is (100 - memlimit%) * (free memory + swap) at time of tmpfs_init().  This seems to work out reasonably in my test cases so far.  I need to try with mapped files too.  I'll attach my current patch.

> There is simply no reasonable answer to question 'how to limit tmpfs physical
> memory consumption' without knowing the system load pattern.

There is no perfect solution, that is true.  But when memory + swap is low enough, allowing tmpfs to proceed makes bad things happen, like killing processes and hanging the system.  Backing off a little makes things better in at least some cases, which seems worthwhile to me.  And tmpfs can report a failure (ENOSPC) without being killed, unlike processes touching memory.  Yes, a compile may fail, etc, but at least that is related to the shortage.  It seems bad if writes to /tmp by an unprivileged user cause root or other important processes to be killed.
Comment 10 Mike Karels freebsd_committer freebsd_triage 2023-12-07 18:43:38 UTC
Created attachment 246880 [details]
revised patch/proof of concept
Comment 11 Konstantin Belousov freebsd_committer freebsd_triage 2023-12-11 05:50:43 UTC
So what would happen if swap is added or removed?

This cannot handle situation when a program allocates anon memory.  And by
default we do not enforce any limits on anon memory use (swap accounting).
We do not even disable overcommit.

I do not object (strongly) against the patch, but for me the introduced
restriction looks very arbitrary.  I might go as far as to propose to add
a verbosity to kernel to announce this limit in tmpfs_set_reserve_from_percent().

If you intend to commit the patch, please put it on phab, I have several
trivial notes about code.
Comment 12 Mike Karels freebsd_committer freebsd_triage 2023-12-11 19:13:01 UTC
(In reply to Konstantin Belousov from comment #11)
> So what would happen if swap is added or removed?

In theory the reserve should be increased or decreased, but that does not happen.  However, the available space will be computed based on the new/current swap space available.

> This cannot handle situation when a program allocates anon memory.

Right, it knows nothing about competing memory demands.

I'll run a few more experiments and see if anything better comes to mind.  If not, I'll put this into a review.
Comment 13 commit-hook freebsd_committer freebsd_triage 2023-12-19 15:34:25 UTC
A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=ed19c0989fe77ec3c9d7bdb752bab6bbef4c0be6

commit ed19c0989fe77ec3c9d7bdb752bab6bbef4c0be6
Author:     Mike Karels <karels@FreeBSD.org>
AuthorDate: 2023-12-19 15:32:58 +0000
Commit:     Mike Karels <karels@FreeBSD.org>
CommitDate: 2023-12-19 15:32:58 +0000

    tmpfs: enforce size limit on writes when file system size is default

    tmpfs enforced the file system size limit on writes for file systems
    with a specified size, but not when the size was the default.  Add
    enforcement when the size is default: do not allocate additional
    pages if the available memory + swap falls to the reserve level.
    Note, enforcement is also done when attempting to create a file,
    both with and without an explicit file system size.

    PR:             275436
    MFC after:      1 month
    Reviewed by:    cy
    Differential Revision:  https://reviews.freebsd.org/D43010

 sys/fs/tmpfs/tmpfs_subr.c | 2 ++
 1 file changed, 2 insertions(+)
Comment 14 commit-hook freebsd_committer freebsd_triage 2023-12-19 15:34:26 UTC
A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=636592343c3ec0feb61a4d8043676381384420dd

commit 636592343c3ec0feb61a4d8043676381384420dd
Author:     Mike Karels <karels@FreeBSD.org>
AuthorDate: 2023-12-19 15:33:33 +0000
Commit:     Mike Karels <karels@FreeBSD.org>
CommitDate: 2023-12-19 15:33:33 +0000

    tmpfs: increase memory reserve to a percent of available memory + swap

    The tmpfs memory reserve defaulted to 4 MB, and other than that,
    all of available memory + swap could be allocated to tmpfs files.
    This was dangerous, as the page daemon attempts to keep some memory
    free, using up swap, and then resulting in processes being killed.
    Increase the reserve to a fraction of available memory + swap at
    file system startup time.  The limit is expressed as a percentage
    of available memory + swap that can be used, and defaults to 95%.
    The percentage can be changed via the vfs.tmpfs.memory_percent sysctl,
    recomputing the reserve with the new percentage but the initial
    available memory + swap.  Note that the reserve can also be set
    directly with an existing sysctl, ignoring the percentage.  The
    previous behavior can be specified by setting vfs.tmpfs.memory_percent
    to 100.

    Add sysctl for vfs.tmpfs.memory_percent and the pre-existing
    vfs.tmpfs.memory_reserved to tmpfs(5).

    PR:             275436
    MFC after:      1 month
    Reviewed by:    rgrimes
    Differential Revision:  https://reviews.freebsd.org/D43011

 share/man/man5/tmpfs.5    | 18 ++++++++++++++++++
 sys/fs/tmpfs/tmpfs.h      |  8 ++++++++
 sys/fs/tmpfs/tmpfs_subr.c | 40 +++++++++++++++++++++++++++++++++++++++-
 3 files changed, 65 insertions(+), 1 deletion(-)