Summary: | linuxulator doesn't implement madvise(MADV_DONTNEED) and any MADV_ flags with values >= 8 correctly | ||
---|---|---|---|
Product: | Base System | Reporter: | Mark Johnston <markj> |
Component: | kern | Assignee: | freebsd-emulation mailing list <emulation> |
Status: | New --- | ||
Severity: | Affects Some People | CC: | cem, emaste, instructionset, theraven |
Priority: | --- | ||
Version: | CURRENT | ||
Hardware: | Any | ||
OS: | Any |
Description
Mark Johnston
![]() It's actually worse than as described. Linux's value for `MADV_DONTNEED` is 8, which corresponds to FreeBSD's `MADV_NOCORE`, so we're not even getting the FreeBSD `MADV_DONTNEED` behaviour. This test program demonstrates the problem. Compiled on Linux, it runs to completion on a real Linux system and dies in the last assert on FreeBSD. ``` #include <sys/mman.h> #include <assert.h> int main(void) { char *page = mmap(0, 4096, PROT_READ | PROT_WRITE, MAP_ANON | MAP_PRIVATE, -1, 0); assert(page != MAP_FAILED); page[0] = 42; assert(page[0] == 42); madvise(page, 4096, MADV_DONTNEED); assert(page[0] == 0); } ``` This `madvise` flag is commonly used by memory allocators to guarantee zeroed memory for reuse. It would be nice if we had a `MADV_ZERO` that did the same thing as Linux's `MADV_DONTNEED` for shared memory as well as anonymous memory. Yes, we should probably just extend our madvise(2) to implement Linux's MADV_DONTNEED. Some care is needed, as we currently assume that madvise(2) is advisory and so may be ignored for pages in some transient busy state. However, we cannot correctly do this when implementing Linux MADV_DONTNEED for private anonymous memory. I don't really like the idea of having a generic MADV_ZERO. First, Linux MADV_DONTNEED only "zeroes" the page if it belongs to a private mapping. Pages belonging to shared mappings should be handled the same way as FreeBSD's MADV_DONTNEED, from my reading of the man page. So that name would be misleading. Second, I think it's pretty widely agreed that Linux's implementation choice here is a historical mistake. We should emulate it, but I don't think it makes much sense to do so in a generic fashion. I'd just add an undocumented MADV_DONTNEED_LINUX. (In reply to Mark Johnston from comment #2) Sorry, wanting MADV_ZERO is an unrelated issue: we have some code that would get a very nice perf improvement if we had it. In our own code, we implement a FreeBSD equivalent of Linux's MADV_DONTNEED by doing an mmap with MAP_FIXED over the address range. This works fine for anonymous memory mappings (which is all that Linux supports), but it means that we have to fall back to bzero for shared memory (which means that, among other things, we get a unique physical page for each virtual page, even though there's a good chance that the pages are going to stay zero for a while). Illumos added MADV_PURGE to do this very thing for their Linux ABI support. Personally I'd recommend adopting it. https://illumos.org/issues/6818 (In reply to Bill Sorenson from comment #4) For what its worth, newer version of Jemalloc use MADV_FREE on Linux if available (kernels ~4.15 or newer). Hopefully Linux MADV_DONTNEED is going away slowly. (In reply to Bill Sorenson from comment #5) `MADV_FREE` and Linux's `MADV_DONTNEED` have different use cases. For C, where malloc is called a lot more often than calloc, `MADV_FREE` provides much better semantics. For higher-level languages or for higher-security applications where we need to guarantee zero initialisation, `MADV_FREE` is useless because we have to `bzero` on either allocation or deallocation. As I said, at $WORK, we have a number of use cases where Linux's behaviour gives significantly better performance (less cache churn from redundant zeroing). We have to fall back to the zeroing behaviour when using anonymous shared memory though and that's a big perf hit for us. A `MADV_ZERO` would be a big win. Note, however, that `MADV_FREE` is currently broken in the Linuxulator, because the constant has a different value in FreeBSD and Linux and the Linuxulator just passes the flags through unmodified. (In reply to David Chisnall from comment #6) I'm aware they have different use cases. My main point is that if we are going to adopt a Linux-MADV_DONTNEED equivalent we use Illumos' MADV_PURGE rather than invent a new argument. I don't object to adding MADV_PURGE or MADV_ZERO for Linux compatibility but to me it seems like it would usually be better to call munmap() directly than to use some bizarre madvise() semantics to simulate it although admittedly I don't know the specifics. (In reply to Bill Sorenson from comment #7) As I said above, there is no mechanism for doing this with shared memory segments - we cannot zero pages in the middle of a shared-memory segment without using memset / bzero and this does not allow the kernel to decommit the physical pages. I haven't tested whether MAV_FREE allows the kernel to lazily replace the pages with zeroed pages, but for our uses we need to guarantee zeroing. On Linux you can do this with some forms of shared memory using fallocate to punch a hole in the underlying object, though apparently it isn't very reliable. (In reply to David Chisnall from comment #8) MADV_FREE has no effect on shared memory (or anything other than private anonymous memory); see vm_object_advice_applies(). We should update the man page to that effect. Aside from emulating Linux's MADV_DONTNEED, I'm not crazy about following the precedent set by Linux by adding a MADV_ZERO that is required to have specific side effects. msync(2) might be a better entry point for this functionality? |