With a mmap() call with a large 'size' parameter a subsequent sysctl(KERN_PROC_VMMAP) call takes too long to perform. Fix: According to kib@ this is because we compute rss accurately which means we have to visit every page in range. I dont think you need rss. To confirm this theory, can you try this patch? This should disable the rss computation.--WxuB8bj8k1nKHBCuaH4djPULWT8VZhhtbPx2rz0eNP9C0N81 Content-Type: text/plain; name="file.diff" Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename="file.diff" Index: sys/kern/kern_proc.c =================================================================== --- sys/kern/kern_proc.c (revision 265931) +++ sys/kern/kern_proc.c (working copy) @@ -2182,7 +2182,7 @@ kern_proc_vmmap_out(struct proc *p, struct sbuf *s } kve->kve_resident = 0; addr = entry->start; - while (addr < entry->end) { + while (0 && addr < entry->end) { locked_pa = 0; mincoreinfo = pmap_mincore(map->pmap, addr, &locked_pa); if (locked_pa != 0) How-To-Repeat: #include <assert.h> #include <stdlib.h> #include <stdio.h> #include <unistd.h> #include <dlfcn.h> #include <fcntl.h> #include <sys/sysctl.h> #include <sys/user.h> #include <sys/mman.h> int main(void) { int mib[4]; size_t size; int err; void *p; printf("#1\n"); p = mmap((void*) 0x3ffffffff000, 0x80000001000, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANON | MAP_FIXED | MAP_NORESERVE, -1, 0); assert(p != MAP_FAILED); printf("#2\n"); mib[0] = CTL_KERN; mib[1] = KERN_PROC; mib[2] = KERN_PROC_VMMAP; mib[3] = getpid(); size = 0; err = sysctl(mib, 4, NULL, &size, NULL, 0); /* takes about 40 seconds */ assert(err == 0); printf("#3\n"); return EXIT_SUCCESS; }
Yes, this is the cause of the slowdown; it takes a long time to iterate over 8TB 4K at a time. I confirmed by commenting out the loop, as in the proof of concept patch.
Yes, for our purposes we don't need to care about resident pages so if we could somehow avoid executing the loop, that would solve the problem. Thanks.
To provide some context, this is currently a blocker in getting the clang sanitizers working on FreeBSD.
Created attachment 144398 [details] suggested change The existing calculation of the resident page count in kern_proc_vmmap_out() does not make sense. It counts the number of installed pte's in the specified range, which can be less than the number of resident pages, if the pages are not faulted on yet (i.e. softfault case). The patch does two things: 1. it adds a tunable to disable the calculation of the resident count at all. sysctl kern.proc_vmmap_skip_resident_count; 2. it changes the algorithm for calculation to count the number of pages which are resident for the read fault, the COW copy allocations are counted as resident, while they are really not. I am on the edge WRT disabling the calculation by default, the patch does disable. One interesting consequence of the new algorithm is that the provided test case is executed in zero time even with the residency count calculation enabled. The reason is that there is no backing object for the mapping which was never faulted on. As result, the loop is not executed at all. If I change the test case to access at least one page in the mmaped range before calling sysctl, I get around 30 sec runtime on my i7 2600K.
Created attachment 144404 [details] corrected patch, unrelated changes removed
I confirm that current/11.0 with the kernel patch applied over it does solve the issue for both the isolated test case provided above and the LLVM's address sanitizers tests, though it takes a bit longer to pass the tests comparing with stable/9.2 with the workaround patch reading /dev/kmem.
A commit references this bug: Author: kib Date: Wed Jul 9 19:11:57 UTC 2014 New revision: 268466 URL: http://svnweb.freebsd.org/changeset/base/268466 Log: Current code in sysctl proc.vmmap, which intent is to calculate the amount of resident pages, in fact calculates the amount of installed pte entries in the region. Resident pages which were not soft-faulted yet are not counted. Calculate the amount of resident pages by looking in the objects chain backing the region. Add a knob to disable the residency calculation at all. For large sparce regions, either previous or updated algorithm runs for too long time, while several introspection tools do not need the (advisory) RSS value at all. PR: kern/188911 Sponsored by: The FreeBSD Foundation MFC after: 1 week Changes: head/sys/kern/kern_proc.c