Our emulation of Linux's madvise(2) appears to just call sys_madvise() directly: { AS(madvise_args), (sy_call_t *)sys_madvise, AUE_MADVISE, NULL, 0, 0, 0, SY_THR_STATIC }, This sort of works for the "standard" MADV_* constants since their values are the same on FreeBSD and Linux. However, Linux has a somewhat surprising behaviour for MADV_DONTNEED: if applied to a mapping of anonymous memory, the next access will return a zero-filled page. That is, applying MADV_DONTNEED causes eager reclamation of pages in the range. In contrast, our MADV_DONTNEED implementation just causes affected pages to skip LRU, and our MADV_FREE allows lazy reclamation of affected pages. Some popular software, e.g., jemalloc, expects Linux's behaviour and won't work as expected under the Linuxulator because it just passes the parameters straight through to the native madvise(2) implementation. Some of Linux's other advice parameters are implemented on FreeBSD using minherit(2) (e.g., MADV_WIPEONFORK, MADV_DONTFORK).
It's actually worse than as described. Linux's value for `MADV_DONTNEED` is 8, which corresponds to FreeBSD's `MADV_NOCORE`, so we're not even getting the FreeBSD `MADV_DONTNEED` behaviour. This test program demonstrates the problem. Compiled on Linux, it runs to completion on a real Linux system and dies in the last assert on FreeBSD. ``` #include <sys/mman.h> #include <assert.h> int main(void) { char *page = mmap(0, 4096, PROT_READ | PROT_WRITE, MAP_ANON | MAP_PRIVATE, -1, 0); assert(page != MAP_FAILED); page[0] = 42; assert(page[0] == 42); madvise(page, 4096, MADV_DONTNEED); assert(page[0] == 0); } ``` This `madvise` flag is commonly used by memory allocators to guarantee zeroed memory for reuse. It would be nice if we had a `MADV_ZERO` that did the same thing as Linux's `MADV_DONTNEED` for shared memory as well as anonymous memory.
Yes, we should probably just extend our madvise(2) to implement Linux's MADV_DONTNEED. Some care is needed, as we currently assume that madvise(2) is advisory and so may be ignored for pages in some transient busy state. However, we cannot correctly do this when implementing Linux MADV_DONTNEED for private anonymous memory. I don't really like the idea of having a generic MADV_ZERO. First, Linux MADV_DONTNEED only "zeroes" the page if it belongs to a private mapping. Pages belonging to shared mappings should be handled the same way as FreeBSD's MADV_DONTNEED, from my reading of the man page. So that name would be misleading. Second, I think it's pretty widely agreed that Linux's implementation choice here is a historical mistake. We should emulate it, but I don't think it makes much sense to do so in a generic fashion. I'd just add an undocumented MADV_DONTNEED_LINUX.
(In reply to Mark Johnston from comment #2) Sorry, wanting MADV_ZERO is an unrelated issue: we have some code that would get a very nice perf improvement if we had it. In our own code, we implement a FreeBSD equivalent of Linux's MADV_DONTNEED by doing an mmap with MAP_FIXED over the address range. This works fine for anonymous memory mappings (which is all that Linux supports), but it means that we have to fall back to bzero for shared memory (which means that, among other things, we get a unique physical page for each virtual page, even though there's a good chance that the pages are going to stay zero for a while).
Illumos added MADV_PURGE to do this very thing for their Linux ABI support. Personally I'd recommend adopting it. https://illumos.org/issues/6818
(In reply to Bill Sorenson from comment #4) For what its worth, newer version of Jemalloc use MADV_FREE on Linux if available (kernels ~4.15 or newer). Hopefully Linux MADV_DONTNEED is going away slowly.
(In reply to Bill Sorenson from comment #5) `MADV_FREE` and Linux's `MADV_DONTNEED` have different use cases. For C, where malloc is called a lot more often than calloc, `MADV_FREE` provides much better semantics. For higher-level languages or for higher-security applications where we need to guarantee zero initialisation, `MADV_FREE` is useless because we have to `bzero` on either allocation or deallocation. As I said, at $WORK, we have a number of use cases where Linux's behaviour gives significantly better performance (less cache churn from redundant zeroing). We have to fall back to the zeroing behaviour when using anonymous shared memory though and that's a big perf hit for us. A `MADV_ZERO` would be a big win. Note, however, that `MADV_FREE` is currently broken in the Linuxulator, because the constant has a different value in FreeBSD and Linux and the Linuxulator just passes the flags through unmodified.
(In reply to David Chisnall from comment #6) I'm aware they have different use cases. My main point is that if we are going to adopt a Linux-MADV_DONTNEED equivalent we use Illumos' MADV_PURGE rather than invent a new argument. I don't object to adding MADV_PURGE or MADV_ZERO for Linux compatibility but to me it seems like it would usually be better to call munmap() directly than to use some bizarre madvise() semantics to simulate it although admittedly I don't know the specifics.
(In reply to Bill Sorenson from comment #7) As I said above, there is no mechanism for doing this with shared memory segments - we cannot zero pages in the middle of a shared-memory segment without using memset / bzero and this does not allow the kernel to decommit the physical pages. I haven't tested whether MAV_FREE allows the kernel to lazily replace the pages with zeroed pages, but for our uses we need to guarantee zeroing. On Linux you can do this with some forms of shared memory using fallocate to punch a hole in the underlying object, though apparently it isn't very reliable.
(In reply to David Chisnall from comment #8) MADV_FREE has no effect on shared memory (or anything other than private anonymous memory); see vm_object_advice_applies(). We should update the man page to that effect. Aside from emulating Linux's MADV_DONTNEED, I'm not crazy about following the precedent set by Linux by adding a MADV_ZERO that is required to have specific side effects. msync(2) might be a better entry point for this functionality?
https://reviews.freebsd.org/D25272
A commit references this bug: Author: trasz Date: Sat Jun 20 18:29:25 UTC 2020 New revision: 362440 URL: https://svnweb.freebsd.org/changeset/base/362440 Log: Add linux_madvise(2) instead of having Linux apps call the native FreeBSD madvise(2) directly. While some of the flag values match, most don't. PR: kern/230160 Reported by: markj Reviewed by: markj Discussed with: brooks, kib MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D25272 Changes: head/sys/amd64/linux/linux_machdep.c head/sys/amd64/linux/syscalls.master head/sys/amd64/linux32/linux32_machdep.c head/sys/amd64/linux32/syscalls.master head/sys/arm64/linux/linux_machdep.c head/sys/arm64/linux/syscalls.master head/sys/compat/linux/linux_mmap.c head/sys/compat/linux/linux_mmap.h head/sys/i386/linux/linux_machdep.c head/sys/i386/linux/syscalls.master head/sys/sys/syscallsubr.h head/sys/vm/vm_mmap.c
A commit references this bug: Author: markj Date: Thu Jun 25 20:30:31 UTC 2020 New revision: 362631 URL: https://svnweb.freebsd.org/changeset/base/362631 Log: Implement an approximation of Linux MADV_DONTNEED semantics. Linux MADV_DONTNEED is not advisory: it has side effects for anonymous memory, and some system software depends on that. In particular, MADV_DONTNEED causes anonymous pages to be discarded. If the mapping is a private mapping of a named object then subsequent faults are to repopulate the range from that object, otherwise pages will be zero-filled. For mappings of non-anonymous objects, Linux MADV_DONTNEED can be implemented in the same way as our MADV_DONTNEED. This implementation differs from Linux semantics in its handling of private mappings, inherited through fork(), of non-anonymous objects. After applying MADV_DONTNEED, subsequent faults will repopulate the mapping from the parent object rather than the root of the shadow chain. PR: 230160 Reviewed by: alc, kib Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D25330 Changes: head/sys/compat/linux/linux_mmap.c
Just MFCs needed now?
(In reply to Ed Maste from comment #13) Yes, I am waiting for a merge of r362440 first. r362631 is a non-trivial merge because of some differences in internal VM interfaces between stable/12 and head.
I plan to do a few dozen MFCs in a few days from now.
A commit references this bug: Author: trasz Date: Mon Aug 24 17:25:28 UTC 2020 New revision: 364715 URL: https://svnweb.freebsd.org/changeset/base/364715 Log: MFC r362440: Add linux_madvise(2) instead of having Linux apps call the native FreeBSD madvise(2) directly. While some of the flag values match, most don't. PR: kern/230160 Sponsored by: The FreeBSD Foundation Changes: _U stable/12/ stable/12/sys/amd64/linux/linux_machdep.c stable/12/sys/amd64/linux/syscalls.master stable/12/sys/amd64/linux32/linux32_machdep.c stable/12/sys/amd64/linux32/syscalls.master stable/12/sys/arm64/linux/linux_machdep.c stable/12/sys/arm64/linux/syscalls.master stable/12/sys/compat/linux/linux_mmap.c stable/12/sys/compat/linux/linux_mmap.h stable/12/sys/i386/linux/linux_machdep.c stable/12/sys/i386/linux/syscalls.master stable/12/sys/sys/syscallsubr.h stable/12/sys/vm/vm_mmap.c
A commit references this bug: Author: markj Date: Mon Aug 24 20:02:36 UTC 2020 New revision: 364729 URL: https://svnweb.freebsd.org/changeset/base/364729 Log: MFC r362631, r364317: Implement an approximation of Linux MADV_DONTNEED semantics. PR: 230160 Changes: _U stable/12/ stable/12/sys/compat/linux/linux_mmap.c