Bug 240061

Summary: MADV_FREE rewinds time to before fork()
Product: Base System Reporter: Nathaniel Filardo <nwf20>
Component: kernAssignee: Konstantin Belousov <kib>
Status: Closed FIXED    
Severity: Affects Only Me CC: alc, brooks, chris, kib, markj
Priority: ---    
Version: CURRENT   
Hardware: Any   
OS: Any   

Description Nathaniel Filardo 2019-08-23 20:46:33 UTC
MADV_FREE simply cleans pages associated with the current object backing the advisee region so that the act of page reclaimation will remove it if needed.

If there is no shadowed object, this is fine: if the page is removed, it will be demand-zeroed, and if it isn't removed, it will be as it was before MADV_FREE.

In the case there is a shadowed object (i.e., these pages are CoW; say, they were mapped before fork()), however, if the page is removed, it will not be demand zeroed: it will be copied up from the shadowed object.  It is therefore possible to selectively rewind time to before fork() and expose whatever information was in these pages.  (I have a test program that demonstrates the effect, but as it relies on inducing memory pressure, it is rude for a test suite; it would be nice if there were a synthetic way to cause the page reclaim subsystem to run even without true memory pressure.)

#1851 makes somewhat oblique reference to an issue with MADV_FREE and I presume that it is this behavior that is being commented upon, but it's hard to say for sure.  Perhaps this behavior was indeed eliminated in 1997, but it's back now.

In any case, I *believe*, not being very familiar with the VM subsystem, that vm_map_madvise() needs to...

  consider MADV_FREE a modify_map=true operation and...

  for each map entry within the advisee region, call vm_object_split() and then release the backing object...

  then call pmap_advise() and vm_object_madvise() on this new object.

There may well be a better way to do this; it may, in fact, be desirable to treat MADV_FREE as if it were mmap()-ing a new entry and new object when the advisee object is OBJT_DEFAULT or OBJT_SWAP; I am less sure what MADV_FREE's semantics should be if it's OBJT_VNODE, for example.
Comment 1 Mark Johnston freebsd_committer freebsd_triage 2019-08-26 14:30:30 UTC
Out of curiosity, how did you come to notice this behaviour?

I think the solution you proposed would solve the problem.  We would want to ensure that the restrictions specified in vm_object_advice_applies() are checked before modifying the object chain.  In particular, MADV_FREE only marks pages clean at the MI layer if the object is anonymous and is mapped into no more than one address space.  For OBJT_VNODE mappings, MADV_FREE does not clear the MI dirty bits or requeue the page, but it will clear the dirty bit from PTEs in the specified range.
Comment 2 Konstantin Belousov freebsd_committer freebsd_triage 2019-08-26 16:10:33 UTC
Setting modify_map = true causes clipping.  I suspect that this would result in excessive fragmentation of the map entries backing the malloc() store.
Comment 3 Mark Johnston freebsd_committer freebsd_triage 2019-08-26 16:22:51 UTC
(In reply to Konstantin Belousov from comment #2)
I suspect we can avoid modifying the map if the entry's object has no backing object.
Comment 4 Nathaniel Filardo 2019-08-26 20:16:37 UTC
> Out of curiosity, how did you come to notice this behaviour?

We've been studying heap temporal safety and wanted to make fairly strong assertions about the contents of, and aliasing of access to, heap memory.  (If you'd like to have a longer conversation, my inbox is open. :) )  In particular, I've been looking deeply at https://github.com/microsoft/snmalloc/ and was prodding at its use of MADV_FREE together with some in-kernel changes that crawls the address space.  I was hoping to have this crawler remove MADV_FREE'd pages as junk, but then wondered what would happen and then wrote a test and here we are.

Incidentally, I *think* this behaviour implies that pages existing before fork()s will be held by the system (possibly paged out 'n all that, but still, present), even if all of their shadowing pages have been copied up, so long as the shadowing objects continue to exist?  I don't suppose there's an easy, even if expensive, way to measure how many such "referenced but would never be found by vm_fault()" pages there are?

> I suspect that this would result in excessive fragmentation of the map entries backing the malloc() store.

That is certainly a risk.  In the case of snmalloc's use of MADV_FREE, though, it is generally followed by a mmap() MAP_ANON of the same space (possibly much later, when it re-allocates the address space).  I assume that these anonymous map entries, both arising from mmap() and MADV_FREE's (proposed) vm_object_split()-based changes are subject to simplification when adjacent?
Comment 5 Brooks Davis freebsd_committer freebsd_triage 2019-09-02 08:02:16 UTC
Here's a link to a potential patch against CheriBSD https://github.com/CTSRD-CHERI/cheribsd/pull/327/commits/733326ed7513ed0b9c5a9e5c4496a02f53d5d99f
Comment 6 Mark Johnston freebsd_committer freebsd_triage 2019-09-03 15:01:33 UTC
One possible alternative is to convert MADV_FREE to MADV_DONTNEED if the object has a backing object.  pmap_advise() needs to be updated to preserve the modified bits, but this should be fine since vm_page_advise(MADV_FREE) calls vm_page_undirty() anyway.

Some basic testing on my workstation showed that virtually all calls to vm_object_madvise(MADV_FREE) are acting on an object with no backing object.  Obviously this may not be true in general.

# dtrace -n 'fbt::vm_object_madvise:entry /args[0] && args[3] == 5/{@[args[0]->backing_object == NULL] = count();}'
^C
        0               13
        1            17504

Untested patch here: https://people.freebsd.org/~markj/patches/madvise_cow.diff
Comment 7 commit-hook freebsd_committer freebsd_triage 2019-09-04 20:28:32 UTC
A commit references this bug:

Author: kib
Date: Wed Sep  4 20:28:16 UTC 2019
New revision: 351830
URL: https://svnweb.freebsd.org/changeset/base/351830

Log:
  madvise(MADV_FREE): Quick fix to time rewind.

  Don't free pages in a shadowing object.  While this degrades MADV_FREE
  to a no-op (and we could, instead, choose to fall back to
  MADV_DONTNEED, at the cost of changing pmap_madvise), this is
  presently considered a temporary fix. We may prefer to risk a little
  fragmentation of the map by creating a zero/OBJT_DEFAULT entry over
  top of the existing object and, simultaneously, revert to the existing
  marking any pages in the former shadowing object in the advised region
  as reclaimable.  At least one consumer of MADV_FREE (snmalloc) may use
  mmap() to construct zeroed pages "eventually" here anyway, so the
  fragmentation may be coming anyway.

  Submitted by:	Nathaniel Filardo <nwf20@cl.cam.ac.uk>
  PR:	240061
  Reviewed by:	markj
  MFC after:	1 week
  Differential revision:	https://reviews.freebsd.org/D21517

Changes:
  head/sys/vm/vm_map.c
Comment 8 Konstantin Belousov freebsd_committer freebsd_triage 2019-09-04 20:36:31 UTC
So IMO the next easy step would be to add vm_object_collapse() call to increase chances that we trimmed the inheritance chain after the fork.

As another corner case, you can check that no clipping is needed and then do what your original (large) patch did.
Comment 9 commit-hook freebsd_committer freebsd_triage 2019-09-11 04:55:24 UTC
A commit references this bug:

Author: kib
Date: Wed Sep 11 04:55:10 UTC 2019
New revision: 352202
URL: https://svnweb.freebsd.org/changeset/base/352202

Log:
  MFC r351830:
  madvise(MADV_FREE): Quick fix to time rewind.

  PR:	240061

Changes:
_U  stable/12/
  stable/12/sys/vm/vm_map.c
Comment 10 commit-hook freebsd_committer freebsd_triage 2019-09-11 05:00:29 UTC
A commit references this bug:

Author: kib
Date: Wed Sep 11 04:59:27 UTC 2019
New revision: 352203
URL: https://svnweb.freebsd.org/changeset/base/352203

Log:
  MFC r351830:
  madvise(MADV_FREE): Quick fix to time rewind.

  PR:	240061

Changes:
_U  stable/11/
  stable/11/sys/vm/vm_map.c