Bug 230160

Summary: linuxulator doesn't implement madvise(MADV_DONTNEED) and any MADV_ flags with values >= 8 correctly
Product: Base System Reporter: Mark Johnston <markj>
Component: kernAssignee: freebsd-emulation mailing list <emulation>
Status: New ---    
Severity: Affects Some People CC: cem, emaste, instructionset, theraven
Priority: ---    
Version: CURRENT   
Hardware: Any   
OS: Any   

Description Mark Johnston freebsd_committer 2018-07-29 18:47:25 UTC
Our emulation of Linux's madvise(2) appears to just call sys_madvise() directly:

{ AS(madvise_args), (sy_call_t *)sys_madvise, AUE_MADVISE, NULL, 0, 0, 0, SY_THR_STATIC },

This sort of works for the "standard" MADV_* constants since their values are the same on FreeBSD and Linux.  However, Linux has a somewhat surprising behaviour for MADV_DONTNEED: if applied to a mapping of anonymous memory, the next access will return a zero-filled page.  That is, applying MADV_DONTNEED causes eager reclamation of pages in the range.  In contrast, our MADV_DONTNEED implementation just causes affected pages to skip LRU, and our MADV_FREE allows lazy reclamation of affected pages.  Some popular software, e.g., jemalloc, expects Linux's behaviour and won't work as expected under the Linuxulator because it just passes the parameters straight through to the native madvise(2) implementation.

Some of Linux's other advice parameters are implemented on FreeBSD using minherit(2) (e.g., MADV_WIPEONFORK, MADV_DONTFORK).
Comment 1 David Chisnall freebsd_committer 2019-01-08 11:15:32 UTC
It's actually worse than as described.  Linux's value for `MADV_DONTNEED` is 8, which corresponds to FreeBSD's `MADV_NOCORE`, so we're not even getting the FreeBSD `MADV_DONTNEED` behaviour.

This test program demonstrates the problem.  Compiled on Linux, it runs to completion on a real Linux system and dies in the last assert on FreeBSD.

#include <sys/mman.h>
#include <assert.h>

int main(void)
        char *page = mmap(0, 4096, PROT_READ | PROT_WRITE, MAP_ANON | MAP_PRIVATE, -1, 0);
        assert(page != MAP_FAILED);
        page[0] = 42;
        assert(page[0] == 42);
        madvise(page, 4096, MADV_DONTNEED);
        assert(page[0] == 0);

This `madvise` flag is commonly used by memory allocators to guarantee zeroed memory for reuse.  It would be nice if we had a `MADV_ZERO` that did the same thing as Linux's `MADV_DONTNEED` for shared memory as well as anonymous memory.
Comment 2 Mark Johnston freebsd_committer 2019-01-09 17:54:08 UTC
Yes, we should probably just extend our madvise(2) to implement Linux's MADV_DONTNEED.  Some care is needed, as we currently assume that madvise(2) is advisory and so may be ignored for pages in some transient busy state.  However, we cannot correctly do this when implementing Linux MADV_DONTNEED for private anonymous memory.

I don't really like the idea of having a generic MADV_ZERO.  First, Linux MADV_DONTNEED only "zeroes" the page if it belongs to a private mapping.  Pages belonging to shared mappings should be handled the same way as FreeBSD's MADV_DONTNEED, from my reading of the man page.  So that name would be misleading.  Second, I think it's pretty widely agreed that Linux's implementation choice here is a historical mistake.  We should emulate it, but I don't think it makes much sense to do so in a generic fashion.  I'd just add an undocumented MADV_DONTNEED_LINUX.
Comment 3 David Chisnall freebsd_committer 2019-01-10 11:00:04 UTC
(In reply to Mark Johnston from comment #2)

Sorry, wanting MADV_ZERO is an unrelated issue: we have some code that would get a very nice perf improvement if we had it.

In our own code, we implement a FreeBSD equivalent of Linux's MADV_DONTNEED by doing an mmap with MAP_FIXED over the address range.  This works fine for anonymous memory mappings (which is all that Linux supports), but it means that we have to fall back to bzero for shared memory (which means that, among other things, we get a unique physical page for each virtual page, even though there's a good chance that the pages are going to stay zero for a while).
Comment 4 Bill Sorenson 2019-01-11 05:54:23 UTC
Illumos added MADV_PURGE to do this very thing for their Linux ABI support. Personally I'd recommend adopting it. https://illumos.org/issues/6818
Comment 5 Bill Sorenson 2019-01-11 05:58:11 UTC
(In reply to Bill Sorenson from comment #4)
For what its worth, newer version of Jemalloc use MADV_FREE on Linux if available (kernels ~4.15 or newer). Hopefully Linux MADV_DONTNEED is going away slowly.
Comment 6 David Chisnall freebsd_committer 2019-01-11 11:21:04 UTC
(In reply to Bill Sorenson from comment #5)

`MADV_FREE` and Linux's `MADV_DONTNEED` have different use cases.  For C, where malloc is called a lot more often than calloc, `MADV_FREE` provides much better semantics.  For higher-level languages or for higher-security applications where we need to guarantee zero initialisation, `MADV_FREE` is useless because we have to `bzero` on either allocation or deallocation.

As I said, at $WORK, we have a number of use cases where Linux's behaviour gives significantly better performance (less cache churn from redundant zeroing).  We have to fall back to the zeroing behaviour when using anonymous shared memory though and that's a big perf hit for us.  A `MADV_ZERO` would be a big win.

Note, however, that `MADV_FREE` is currently broken in the Linuxulator, because the constant has a different value in FreeBSD and Linux and the Linuxulator just passes the flags through unmodified.
Comment 7 Bill Sorenson 2019-01-11 15:01:29 UTC
(In reply to David Chisnall from comment #6)
I'm aware they have different use cases. My main point is that if we are going to adopt a Linux-MADV_DONTNEED equivalent we use Illumos' MADV_PURGE rather than invent a new argument.

I don't object to adding MADV_PURGE or MADV_ZERO for Linux compatibility but to me it seems like it would usually be better to call munmap() directly than to use some bizarre madvise() semantics to simulate it although admittedly I don't know the specifics.
Comment 8 David Chisnall freebsd_committer 2019-01-11 18:17:27 UTC
(In reply to Bill Sorenson from comment #7)

As I said above, there is no mechanism for doing this with shared memory segments - we cannot zero pages in the middle of a shared-memory segment without using memset / bzero and this does not allow the kernel to decommit the physical pages.  I haven't tested whether MAV_FREE allows the kernel to lazily replace the pages with zeroed pages, but for our uses we need to guarantee zeroing.

On Linux you can do this with some forms of shared memory using fallocate to punch a hole in the underlying object, though apparently it isn't very reliable.
Comment 9 Mark Johnston freebsd_committer 2019-01-11 18:35:51 UTC
(In reply to David Chisnall from comment #8)
MADV_FREE has no effect on shared memory (or anything other than private anonymous memory); see vm_object_advice_applies().  We should update the man page to that effect.

Aside from emulating Linux's MADV_DONTNEED, I'm not crazy about following the precedent set by Linux by adding a MADV_ZERO that is required to have specific side effects.  msync(2) might be a better entry point for this functionality?