Created attachment 246673 [details] sample patch that works around the problem The tmpfs file system checks for available memory when a file is created, but not during writes. If I create a file, then run the system nearly out of memory, I can write to that file until processes are killed because the system is out of memory, and in some cases the system is so screwed up that a shutdown cannot be run. A df shows continuously growing usage and size while showing 0 for free space all along; this can go on for gigabytes. Note, this is using a tmpfs file system with the default size. I'll attach a patch that adds a check in the write path, which helps. It still runs the system out of memory and swap, so there are two additional changes in the patch: one divides swap space by two, and the other asks the VM system for the free memory in excess of the free target, using that in place of simply free memory. I am not convinced that any of the changes are correct, but they help considerably. With these changes, a large memory process keeps running although swap is exhausted when the writes fail. More details or tests available on request.
I do not think this is a right approach. Free pages/free target from the VM subsystem is not what people usually think about it. The numbers only direct the pagedaemon activity, they are not indicative in any other way. If the system has enough swap and inactive pages, they can be converted into reusable pages without causing OOM/ENOMEM etc. Similarly, I do not think it is right to do any limiting on tmpfs file resize. Tmpfs supports holes, and there is no reason to block that. Tmpfs already has the mechanism to count actually allocated pages and to clamp the amount of pages allocated to specific mount. It it the right thing to do IMO. Why wouldn't you use it?
(In reply to Konstantin Belousov from comment #1) First, the primary purpose of this bug is to document that there is a problem with writes exceeding the stated limit on the file system size. The fact that I can write multiple gigabytes after df shows remaining space of 0 indicates a problem. I had no expectation that this workaround would be a correct fix, or I would have put it into a review. I do think that the memory free target should be taken into account. If tmpfs attempts to use all of free memory, the VM system is likely to try to raise the free memory, paging/swapping as needed. btw, my original reason for looking at this is that I misunderstood the meaning in the man page, which says that a size of 0 represents "the available amount of memory (including main memory and swap space)". I assumed that this included all of main memory, which is clearly excessive, and instead it means the currently-available memory. But if we actually use all of memory + swap, the system will be killing processes (in this case, starting with my memory hog but then the shells, nfsd, and more). Note that under the circumstances I am testing, tmpfs refuses to create a new file, but will write gigabytes to existing files. This seems to be due to the lack of any check on free space (whether stored or dynamically-computed) in the write path, unlike file creation. My goal was to adjust the default limit so that it was reasonably safe, which I think also needs to be done. I don't think my workaround introduces a problem with holes. The tmpfs_reg_resize routine is used only for writes, so the pages in the calculation will actually be written and recorded in the vm_object. I think that, for file systems with no specific limit, the available memory limit needs to be shared by any tmpfs file systems. They can't each use all of memory + swap. Also, it probably doesn't make sense for file systems with a specific limit to attempt writes when there is insufficient memory available, unless the limit they are given is supposed to be a commitment, allowing other memory users to be killed. (The workaround doesn't distinguish between file systems with/without a specific size, in part because tmpfs_reg_resize doesn't currently refer to the file system itself.) Does tmpfs actually store a current/recent memory limit for file systems that don't have a preset size? If it did, maybe higher levels would block writes. The size limit actually works for file systems with a non-zero size. However, a size limit would have to be refreshed periodically, or on demand. df will display an available size of 0, but statfs computes that on the fly.
tmpfs_setattr()->tmpfs_chsize()->tmpfs_truncate()->tmpfs_reg_resize() The free target has nothing to do with the availability of the (managed AKA swappable) memory for usermode. If it is used, it would cause transient 'no memory' errors, mostly due to pagedaemon not able to cope with working set changes in timely manner. This is same as requirement to not return transient ENOMEMs to userspace from malloc(M_NOWAIT) in the top-level syscall code. By default, system allows most consumers to eat as much memory as they want and can. For instance, you can mmap(MAP_ANON) insanely large region, and fault physical pages on demand (recent Chrome does exactly this, it mmaps 1Tb for heap). Or you could create posix shm memory segment of that size. Or you can write a file on tmpfs. In each case, if the facility could be mis-used to exhaust memory (not physical RAM but pageable storage) then specific limit should be applied. For instance, for tmpfs the 'size' mount option should be used to limits its total use. The thing that is unclear to me, are you reporting that mount -o size=<limit> does not work? So that you can create and fill file larger than the limit?
(In reply to Konstantin Belousov from comment #3) > tmpfs_setattr()->tmpfs_chsize()->tmpfs_truncate()->tmpfs_reg_resize() Sorry, I missed that, not sure how. In any case, a proper fix would account only for page additions. > The thing that is unclear to me, are you reporting that mount -o size=<limit> does not work? So that you can create and fill file larger than the limit? No, that case works. The broken case is when size is omitted/default. I found the reason why, it's in tmpfs_can_alloc_page(). I tried adding a test there, and that works too, but not in time to keep the system from killing off processes. I'll have to think about the memory competition issues. Unlike large user mappings, you can't kill tmpfs if it gets too greedy. And if it is /tmp, or otherwise writable by unprivileged users, such users can cause processes to be killed off and even to hang the system (I just did it). Maybe it is necessary to sleep for memory sometimes. But it is not unexpected for a write to return ENOSPC on a nearly-full file system. And I think tmpfs is over-promising in this case.
Created attachment 246718 [details] another approach to patching tmpfs_subr.c I'm adding a revised patch to tmpfs_subr.c that works more similarly to the way things work when the size option is set. It still doesn't stop early enough, but that's a separate problem.
VM (alomst) always ensures that there are several free pages. More, it even typically manages to free several pages in reasonable time. This is why our OOM organized in current way: - global OOM triggers when VM cannot get a free page despite existence of the page shortage, in all domains, for some time. It is typically triggered when kernel allocates too much unmanaged pages (not tmpfs case). - per-process OOM triggers when page fault handler needs a page and cannot allocate it after several cycles of allocation attempts. I added the second (per-process) OOM since global OOM (similar to your patch) was not able to handle typical situation with usermode sitting on too many dirty pages. Now that I formulated this, I think that for tmpfs a reasonable approach would be something in line of per-process OOM: try the allocation, and return ENOSPC if it failed, with some criteria for restart. You might look at vm/vm_fault.c vm_fault_allocate_oom().
(In reply to Konstantin Belousov from comment #6) > VM (alomst) always ensures that there are several free pages. More, it even > typically manages to free several pages in reasonable time. Except for when it doesn't, e.g. when swap is filling up. It also may require killing processes in that case. Given that tmpfs is supposed to limit to "available" memory, killing processes means that it has overshot. > Now that I formulated this, I think that for tmpfs a reasonable approach would > be something in line of per-process OOM: try the allocation, and return ENOSPC > if it failed, with some criteria for restart. You might look at vm/vm_fault.c > vm_fault_allocate_oom(). I'm not sure, but tmpfs may go down that path if it can't get a page in uiomove_object(). That may be desirable in case of transient memory shortage. I don't know what happens if it fails though. But when swap is getting short, it seems better to fail before that point. I really think the default limit for tmpfs is overly aggressive. If it really tries to take all of available memory + swap (less a small reserve, currently 4 MB), processes will be killed and the system is likely to require manual intervention. In the case of a transient memory shortage, blocking for memory is fine (and may be done already?). I am now looking at a reserve that is scaled on memory + swap space, e.g. allowing tmpfs to use 95%. The percentage is controlled by a sysctl. This still includes a check for available space in the write path. (There is already a check in the create path.) This seems to work fairly well, based on limited test circumstances so far.
(In reply to Mike Karels from comment #7) Being able to free one page is not equal to ability to free needed number of pages, this is why 'global' OOM works only in specific situations. Proposals to limit max tmpfs page uses to some percentage of the total of memory and swap would not work. Problem is that some memory is 'managed', some is not. For instance, all file-backed pages should be considered as free for the purpose of the limits calculation, otherwise an app mapping a large file and reading it would cause un-explanable ENOSPC failures from tmpfs. And in reverse, allowing say 80% of total use by tmpfs does not prevent a situation where we do fill that limit, and then kernel needs a lot of unmanaged memory, for instance, for sockets buffers due to bursts of a network activity. There is simply no reasonable answer to question 'how to limit tmpfs physical memory consumption' without knowing the system load pattern. And if you know the load and care about memory limiting, there is already the size option.
(In reply to Konstantin Belousov from comment #8) > Proposals to limit max tmpfs page uses to some percentage of the total of > memory and swap would not work. My proof-of-concept does not use total memory + swap. Instead, it computes a reserve which is (100 - memlimit%) * (free memory + swap) at time of tmpfs_init(). This seems to work out reasonably in my test cases so far. I need to try with mapped files too. I'll attach my current patch. > There is simply no reasonable answer to question 'how to limit tmpfs physical > memory consumption' without knowing the system load pattern. There is no perfect solution, that is true. But when memory + swap is low enough, allowing tmpfs to proceed makes bad things happen, like killing processes and hanging the system. Backing off a little makes things better in at least some cases, which seems worthwhile to me. And tmpfs can report a failure (ENOSPC) without being killed, unlike processes touching memory. Yes, a compile may fail, etc, but at least that is related to the shortage. It seems bad if writes to /tmp by an unprivileged user cause root or other important processes to be killed.
Created attachment 246880 [details] revised patch/proof of concept
So what would happen if swap is added or removed? This cannot handle situation when a program allocates anon memory. And by default we do not enforce any limits on anon memory use (swap accounting). We do not even disable overcommit. I do not object (strongly) against the patch, but for me the introduced restriction looks very arbitrary. I might go as far as to propose to add a verbosity to kernel to announce this limit in tmpfs_set_reserve_from_percent(). If you intend to commit the patch, please put it on phab, I have several trivial notes about code.
(In reply to Konstantin Belousov from comment #11) > So what would happen if swap is added or removed? In theory the reserve should be increased or decreased, but that does not happen. However, the available space will be computed based on the new/current swap space available. > This cannot handle situation when a program allocates anon memory. Right, it knows nothing about competing memory demands. I'll run a few more experiments and see if anything better comes to mind. If not, I'll put this into a review.
A commit in branch main references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=ed19c0989fe77ec3c9d7bdb752bab6bbef4c0be6 commit ed19c0989fe77ec3c9d7bdb752bab6bbef4c0be6 Author: Mike Karels <karels@FreeBSD.org> AuthorDate: 2023-12-19 15:32:58 +0000 Commit: Mike Karels <karels@FreeBSD.org> CommitDate: 2023-12-19 15:32:58 +0000 tmpfs: enforce size limit on writes when file system size is default tmpfs enforced the file system size limit on writes for file systems with a specified size, but not when the size was the default. Add enforcement when the size is default: do not allocate additional pages if the available memory + swap falls to the reserve level. Note, enforcement is also done when attempting to create a file, both with and without an explicit file system size. PR: 275436 MFC after: 1 month Reviewed by: cy Differential Revision: https://reviews.freebsd.org/D43010 sys/fs/tmpfs/tmpfs_subr.c | 2 ++ 1 file changed, 2 insertions(+)
A commit in branch main references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=636592343c3ec0feb61a4d8043676381384420dd commit 636592343c3ec0feb61a4d8043676381384420dd Author: Mike Karels <karels@FreeBSD.org> AuthorDate: 2023-12-19 15:33:33 +0000 Commit: Mike Karels <karels@FreeBSD.org> CommitDate: 2023-12-19 15:33:33 +0000 tmpfs: increase memory reserve to a percent of available memory + swap The tmpfs memory reserve defaulted to 4 MB, and other than that, all of available memory + swap could be allocated to tmpfs files. This was dangerous, as the page daemon attempts to keep some memory free, using up swap, and then resulting in processes being killed. Increase the reserve to a fraction of available memory + swap at file system startup time. The limit is expressed as a percentage of available memory + swap that can be used, and defaults to 95%. The percentage can be changed via the vfs.tmpfs.memory_percent sysctl, recomputing the reserve with the new percentage but the initial available memory + swap. Note that the reserve can also be set directly with an existing sysctl, ignoring the percentage. The previous behavior can be specified by setting vfs.tmpfs.memory_percent to 100. Add sysctl for vfs.tmpfs.memory_percent and the pre-existing vfs.tmpfs.memory_reserved to tmpfs(5). PR: 275436 MFC after: 1 month Reviewed by: rgrimes Differential Revision: https://reviews.freebsd.org/D43011 share/man/man5/tmpfs.5 | 18 ++++++++++++++++++ sys/fs/tmpfs/tmpfs.h | 8 ++++++++ sys/fs/tmpfs/tmpfs_subr.c | 40 +++++++++++++++++++++++++++++++++++++++- 3 files changed, 65 insertions(+), 1 deletion(-)
^Triage: with Bugmeister hat, reset assignee.
Mike Karels committed these changes to head but passed away before doing any MFC's. Since he is not going to be able to handle any fallout from an MFC, I am closing this report and will just let any issues with these changes be handled in head.