MADV_FREE simply cleans pages associated with the current object backing the advisee region so that the act of page reclaimation will remove it if needed. If there is no shadowed object, this is fine: if the page is removed, it will be demand-zeroed, and if it isn't removed, it will be as it was before MADV_FREE. In the case there is a shadowed object (i.e., these pages are CoW; say, they were mapped before fork()), however, if the page is removed, it will not be demand zeroed: it will be copied up from the shadowed object. It is therefore possible to selectively rewind time to before fork() and expose whatever information was in these pages. (I have a test program that demonstrates the effect, but as it relies on inducing memory pressure, it is rude for a test suite; it would be nice if there were a synthetic way to cause the page reclaim subsystem to run even without true memory pressure.) #1851 makes somewhat oblique reference to an issue with MADV_FREE and I presume that it is this behavior that is being commented upon, but it's hard to say for sure. Perhaps this behavior was indeed eliminated in 1997, but it's back now. In any case, I *believe*, not being very familiar with the VM subsystem, that vm_map_madvise() needs to... consider MADV_FREE a modify_map=true operation and... for each map entry within the advisee region, call vm_object_split() and then release the backing object... then call pmap_advise() and vm_object_madvise() on this new object. There may well be a better way to do this; it may, in fact, be desirable to treat MADV_FREE as if it were mmap()-ing a new entry and new object when the advisee object is OBJT_DEFAULT or OBJT_SWAP; I am less sure what MADV_FREE's semantics should be if it's OBJT_VNODE, for example.
Out of curiosity, how did you come to notice this behaviour? I think the solution you proposed would solve the problem. We would want to ensure that the restrictions specified in vm_object_advice_applies() are checked before modifying the object chain. In particular, MADV_FREE only marks pages clean at the MI layer if the object is anonymous and is mapped into no more than one address space. For OBJT_VNODE mappings, MADV_FREE does not clear the MI dirty bits or requeue the page, but it will clear the dirty bit from PTEs in the specified range.
Setting modify_map = true causes clipping. I suspect that this would result in excessive fragmentation of the map entries backing the malloc() store.
(In reply to Konstantin Belousov from comment #2) I suspect we can avoid modifying the map if the entry's object has no backing object.
> Out of curiosity, how did you come to notice this behaviour? We've been studying heap temporal safety and wanted to make fairly strong assertions about the contents of, and aliasing of access to, heap memory. (If you'd like to have a longer conversation, my inbox is open. :) ) In particular, I've been looking deeply at https://github.com/microsoft/snmalloc/ and was prodding at its use of MADV_FREE together with some in-kernel changes that crawls the address space. I was hoping to have this crawler remove MADV_FREE'd pages as junk, but then wondered what would happen and then wrote a test and here we are. Incidentally, I *think* this behaviour implies that pages existing before fork()s will be held by the system (possibly paged out 'n all that, but still, present), even if all of their shadowing pages have been copied up, so long as the shadowing objects continue to exist? I don't suppose there's an easy, even if expensive, way to measure how many such "referenced but would never be found by vm_fault()" pages there are? > I suspect that this would result in excessive fragmentation of the map entries backing the malloc() store. That is certainly a risk. In the case of snmalloc's use of MADV_FREE, though, it is generally followed by a mmap() MAP_ANON of the same space (possibly much later, when it re-allocates the address space). I assume that these anonymous map entries, both arising from mmap() and MADV_FREE's (proposed) vm_object_split()-based changes are subject to simplification when adjacent?
Here's a link to a potential patch against CheriBSD https://github.com/CTSRD-CHERI/cheribsd/pull/327/commits/733326ed7513ed0b9c5a9e5c4496a02f53d5d99f
One possible alternative is to convert MADV_FREE to MADV_DONTNEED if the object has a backing object. pmap_advise() needs to be updated to preserve the modified bits, but this should be fine since vm_page_advise(MADV_FREE) calls vm_page_undirty() anyway. Some basic testing on my workstation showed that virtually all calls to vm_object_madvise(MADV_FREE) are acting on an object with no backing object. Obviously this may not be true in general. # dtrace -n 'fbt::vm_object_madvise:entry /args[0] && args[3] == 5/{@[args[0]->backing_object == NULL] = count();}' ^C 0 13 1 17504 Untested patch here: https://people.freebsd.org/~markj/patches/madvise_cow.diff
A commit references this bug: Author: kib Date: Wed Sep 4 20:28:16 UTC 2019 New revision: 351830 URL: https://svnweb.freebsd.org/changeset/base/351830 Log: madvise(MADV_FREE): Quick fix to time rewind. Don't free pages in a shadowing object. While this degrades MADV_FREE to a no-op (and we could, instead, choose to fall back to MADV_DONTNEED, at the cost of changing pmap_madvise), this is presently considered a temporary fix. We may prefer to risk a little fragmentation of the map by creating a zero/OBJT_DEFAULT entry over top of the existing object and, simultaneously, revert to the existing marking any pages in the former shadowing object in the advised region as reclaimable. At least one consumer of MADV_FREE (snmalloc) may use mmap() to construct zeroed pages "eventually" here anyway, so the fragmentation may be coming anyway. Submitted by: Nathaniel Filardo <nwf20@cl.cam.ac.uk> PR: 240061 Reviewed by: markj MFC after: 1 week Differential revision: https://reviews.freebsd.org/D21517 Changes: head/sys/vm/vm_map.c
So IMO the next easy step would be to add vm_object_collapse() call to increase chances that we trimmed the inheritance chain after the fork. As another corner case, you can check that no clipping is needed and then do what your original (large) patch did.
A commit references this bug: Author: kib Date: Wed Sep 11 04:55:10 UTC 2019 New revision: 352202 URL: https://svnweb.freebsd.org/changeset/base/352202 Log: MFC r351830: madvise(MADV_FREE): Quick fix to time rewind. PR: 240061 Changes: _U stable/12/ stable/12/sys/vm/vm_map.c
A commit references this bug: Author: kib Date: Wed Sep 11 04:59:27 UTC 2019 New revision: 352203 URL: https://svnweb.freebsd.org/changeset/base/352203 Log: MFC r351830: madvise(MADV_FREE): Quick fix to time rewind. PR: 240061 Changes: _U stable/11/ stable/11/sys/vm/vm_map.c