Bug 230160 - linuxulator doesn't implement madvise(MADV_DONTNEED) and any MADV_ flags with values >= 8 correctly
Summary: linuxulator doesn't implement madvise(MADV_DONTNEED) and any MADV_ flags with...
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: CURRENT
Hardware: Any Any
: --- Affects Some People
Assignee: freebsd-emulation (Nobody)
URL:
Keywords:
Depends on:
Blocks: 247219
  Show dependency treegraph
 
Reported: 2018-07-29 18:47 UTC by Mark Johnston
Modified: 2020-08-24 20:03 UTC (History)
5 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Mark Johnston freebsd_committer freebsd_triage 2018-07-29 18:47:25 UTC
Our emulation of Linux's madvise(2) appears to just call sys_madvise() directly:

{ AS(madvise_args), (sy_call_t *)sys_madvise, AUE_MADVISE, NULL, 0, 0, 0, SY_THR_STATIC },

This sort of works for the "standard" MADV_* constants since their values are the same on FreeBSD and Linux.  However, Linux has a somewhat surprising behaviour for MADV_DONTNEED: if applied to a mapping of anonymous memory, the next access will return a zero-filled page.  That is, applying MADV_DONTNEED causes eager reclamation of pages in the range.  In contrast, our MADV_DONTNEED implementation just causes affected pages to skip LRU, and our MADV_FREE allows lazy reclamation of affected pages.  Some popular software, e.g., jemalloc, expects Linux's behaviour and won't work as expected under the Linuxulator because it just passes the parameters straight through to the native madvise(2) implementation.

Some of Linux's other advice parameters are implemented on FreeBSD using minherit(2) (e.g., MADV_WIPEONFORK, MADV_DONTFORK).
Comment 1 David Chisnall freebsd_committer freebsd_triage 2019-01-08 11:15:32 UTC
It's actually worse than as described.  Linux's value for `MADV_DONTNEED` is 8, which corresponds to FreeBSD's `MADV_NOCORE`, so we're not even getting the FreeBSD `MADV_DONTNEED` behaviour.

This test program demonstrates the problem.  Compiled on Linux, it runs to completion on a real Linux system and dies in the last assert on FreeBSD.

```
#include <sys/mman.h>
#include <assert.h>

int main(void)
{
        char *page = mmap(0, 4096, PROT_READ | PROT_WRITE, MAP_ANON | MAP_PRIVATE, -1, 0);
        assert(page != MAP_FAILED);
        page[0] = 42;
        assert(page[0] == 42);
        madvise(page, 4096, MADV_DONTNEED);
        assert(page[0] == 0);
}
```

This `madvise` flag is commonly used by memory allocators to guarantee zeroed memory for reuse.  It would be nice if we had a `MADV_ZERO` that did the same thing as Linux's `MADV_DONTNEED` for shared memory as well as anonymous memory.
Comment 2 Mark Johnston freebsd_committer freebsd_triage 2019-01-09 17:54:08 UTC
Yes, we should probably just extend our madvise(2) to implement Linux's MADV_DONTNEED.  Some care is needed, as we currently assume that madvise(2) is advisory and so may be ignored for pages in some transient busy state.  However, we cannot correctly do this when implementing Linux MADV_DONTNEED for private anonymous memory.

I don't really like the idea of having a generic MADV_ZERO.  First, Linux MADV_DONTNEED only "zeroes" the page if it belongs to a private mapping.  Pages belonging to shared mappings should be handled the same way as FreeBSD's MADV_DONTNEED, from my reading of the man page.  So that name would be misleading.  Second, I think it's pretty widely agreed that Linux's implementation choice here is a historical mistake.  We should emulate it, but I don't think it makes much sense to do so in a generic fashion.  I'd just add an undocumented MADV_DONTNEED_LINUX.
Comment 3 David Chisnall freebsd_committer freebsd_triage 2019-01-10 11:00:04 UTC
(In reply to Mark Johnston from comment #2)

Sorry, wanting MADV_ZERO is an unrelated issue: we have some code that would get a very nice perf improvement if we had it.

In our own code, we implement a FreeBSD equivalent of Linux's MADV_DONTNEED by doing an mmap with MAP_FIXED over the address range.  This works fine for anonymous memory mappings (which is all that Linux supports), but it means that we have to fall back to bzero for shared memory (which means that, among other things, we get a unique physical page for each virtual page, even though there's a good chance that the pages are going to stay zero for a while).
Comment 4 Bill Sorenson 2019-01-11 05:54:23 UTC
Illumos added MADV_PURGE to do this very thing for their Linux ABI support. Personally I'd recommend adopting it. https://illumos.org/issues/6818
Comment 5 Bill Sorenson 2019-01-11 05:58:11 UTC
(In reply to Bill Sorenson from comment #4)
For what its worth, newer version of Jemalloc use MADV_FREE on Linux if available (kernels ~4.15 or newer). Hopefully Linux MADV_DONTNEED is going away slowly.
Comment 6 David Chisnall freebsd_committer freebsd_triage 2019-01-11 11:21:04 UTC
(In reply to Bill Sorenson from comment #5)

`MADV_FREE` and Linux's `MADV_DONTNEED` have different use cases.  For C, where malloc is called a lot more often than calloc, `MADV_FREE` provides much better semantics.  For higher-level languages or for higher-security applications where we need to guarantee zero initialisation, `MADV_FREE` is useless because we have to `bzero` on either allocation or deallocation.

As I said, at $WORK, we have a number of use cases where Linux's behaviour gives significantly better performance (less cache churn from redundant zeroing).  We have to fall back to the zeroing behaviour when using anonymous shared memory though and that's a big perf hit for us.  A `MADV_ZERO` would be a big win.

Note, however, that `MADV_FREE` is currently broken in the Linuxulator, because the constant has a different value in FreeBSD and Linux and the Linuxulator just passes the flags through unmodified.
Comment 7 Bill Sorenson 2019-01-11 15:01:29 UTC
(In reply to David Chisnall from comment #6)
I'm aware they have different use cases. My main point is that if we are going to adopt a Linux-MADV_DONTNEED equivalent we use Illumos' MADV_PURGE rather than invent a new argument.

I don't object to adding MADV_PURGE or MADV_ZERO for Linux compatibility but to me it seems like it would usually be better to call munmap() directly than to use some bizarre madvise() semantics to simulate it although admittedly I don't know the specifics.
Comment 8 David Chisnall freebsd_committer freebsd_triage 2019-01-11 18:17:27 UTC
(In reply to Bill Sorenson from comment #7)

As I said above, there is no mechanism for doing this with shared memory segments - we cannot zero pages in the middle of a shared-memory segment without using memset / bzero and this does not allow the kernel to decommit the physical pages.  I haven't tested whether MAV_FREE allows the kernel to lazily replace the pages with zeroed pages, but for our uses we need to guarantee zeroing.

On Linux you can do this with some forms of shared memory using fallocate to punch a hole in the underlying object, though apparently it isn't very reliable.
Comment 9 Mark Johnston freebsd_committer freebsd_triage 2019-01-11 18:35:51 UTC
(In reply to David Chisnall from comment #8)
MADV_FREE has no effect on shared memory (or anything other than private anonymous memory); see vm_object_advice_applies().  We should update the man page to that effect.

Aside from emulating Linux's MADV_DONTNEED, I'm not crazy about following the precedent set by Linux by adding a MADV_ZERO that is required to have specific side effects.  msync(2) might be a better entry point for this functionality?
Comment 10 Edward Tomasz Napierala freebsd_committer freebsd_triage 2020-06-14 22:05:22 UTC
https://reviews.freebsd.org/D25272
Comment 11 commit-hook freebsd_committer freebsd_triage 2020-06-20 18:29:49 UTC
A commit references this bug:

Author: trasz
Date: Sat Jun 20 18:29:25 UTC 2020
New revision: 362440
URL: https://svnweb.freebsd.org/changeset/base/362440

Log:
  Add linux_madvise(2) instead of having Linux apps call the native
  FreeBSD madvise(2) directly.  While some of the flag values match,
  most don't.

  PR:		kern/230160
  Reported by:	markj
  Reviewed by:	markj
  Discussed with:	brooks, kib
  MFC after:	2 weeks
  Sponsored by:	The FreeBSD Foundation
  Differential Revision:	https://reviews.freebsd.org/D25272

Changes:
  head/sys/amd64/linux/linux_machdep.c
  head/sys/amd64/linux/syscalls.master
  head/sys/amd64/linux32/linux32_machdep.c
  head/sys/amd64/linux32/syscalls.master
  head/sys/arm64/linux/linux_machdep.c
  head/sys/arm64/linux/syscalls.master
  head/sys/compat/linux/linux_mmap.c
  head/sys/compat/linux/linux_mmap.h
  head/sys/i386/linux/linux_machdep.c
  head/sys/i386/linux/syscalls.master
  head/sys/sys/syscallsubr.h
  head/sys/vm/vm_mmap.c
Comment 12 commit-hook freebsd_committer freebsd_triage 2020-06-25 20:31:29 UTC
A commit references this bug:

Author: markj
Date: Thu Jun 25 20:30:31 UTC 2020
New revision: 362631
URL: https://svnweb.freebsd.org/changeset/base/362631

Log:
  Implement an approximation of Linux MADV_DONTNEED semantics.

  Linux MADV_DONTNEED is not advisory: it has side effects for anonymous
  memory, and some system software depends on that.  In particular,
  MADV_DONTNEED causes anonymous pages to be discarded.  If the mapping is
  a private mapping of a named object then subsequent faults are to
  repopulate the range from that object, otherwise pages will be
  zero-filled.  For mappings of non-anonymous objects, Linux MADV_DONTNEED
  can be implemented in the same way as our MADV_DONTNEED.

  This implementation differs from Linux semantics in its handling of
  private mappings, inherited through fork(), of non-anonymous objects.
  After applying MADV_DONTNEED, subsequent faults will repopulate the
  mapping from the parent object rather than the root of the shadow chain.

  PR:		230160
  Reviewed by:	alc, kib
  Sponsored by:	The FreeBSD Foundation
  MFC after:	2 weeks
  Differential Revision:	https://reviews.freebsd.org/D25330

Changes:
  head/sys/compat/linux/linux_mmap.c
Comment 13 Ed Maste freebsd_committer freebsd_triage 2020-07-27 19:05:07 UTC
Just MFCs needed now?
Comment 14 Mark Johnston freebsd_committer freebsd_triage 2020-07-27 19:07:51 UTC
(In reply to Ed Maste from comment #13)
Yes, I am waiting for a merge of r362440 first.  r362631 is a non-trivial merge because of some differences in internal VM interfaces between stable/12 and head.
Comment 15 Edward Tomasz Napierala freebsd_committer freebsd_triage 2020-07-27 19:12:27 UTC
I plan to do a few dozen MFCs in a few days from now.
Comment 16 commit-hook freebsd_committer freebsd_triage 2020-08-24 17:26:29 UTC
A commit references this bug:

Author: trasz
Date: Mon Aug 24 17:25:28 UTC 2020
New revision: 364715
URL: https://svnweb.freebsd.org/changeset/base/364715

Log:
  MFC r362440:

  Add linux_madvise(2) instead of having Linux apps call the native
  FreeBSD madvise(2) directly.  While some of the flag values match,
  most don't.

  PR:		kern/230160
  Sponsored by:	The FreeBSD Foundation

Changes:
_U  stable/12/
  stable/12/sys/amd64/linux/linux_machdep.c
  stable/12/sys/amd64/linux/syscalls.master
  stable/12/sys/amd64/linux32/linux32_machdep.c
  stable/12/sys/amd64/linux32/syscalls.master
  stable/12/sys/arm64/linux/linux_machdep.c
  stable/12/sys/arm64/linux/syscalls.master
  stable/12/sys/compat/linux/linux_mmap.c
  stable/12/sys/compat/linux/linux_mmap.h
  stable/12/sys/i386/linux/linux_machdep.c
  stable/12/sys/i386/linux/syscalls.master
  stable/12/sys/sys/syscallsubr.h
  stable/12/sys/vm/vm_mmap.c
Comment 17 commit-hook freebsd_committer freebsd_triage 2020-08-24 20:03:01 UTC
A commit references this bug:

Author: markj
Date: Mon Aug 24 20:02:36 UTC 2020
New revision: 364729
URL: https://svnweb.freebsd.org/changeset/base/364729

Log:
  MFC r362631, r364317:
  Implement an approximation of Linux MADV_DONTNEED semantics.

  PR:	230160

Changes:
_U  stable/12/
  stable/12/sys/compat/linux/linux_mmap.c