Summary: | head (e.g.:) -r315870 arm64 vs. jemalloc without MALLOC_PRODUCTION: various examples of tbin->avail being zero lead to SIGSEGV's; fix is a kernel patch (comment 32) | ||
---|---|---|---|
Product: | Base System | Reporter: | Mark Millard <marklmi26-fbsd> |
Component: | bin | Assignee: | freebsd-arm (Nobody) <freebsd-arm> |
Status: | Closed DUPLICATE | ||
Severity: | Affects Only Me | CC: | Mark.Martinec, emaste, jasone, shawn.webb |
Priority: | --- | ||
Version: | CURRENT | ||
Hardware: | arm64 | ||
OS: | Any | ||
Attachments: |
Description
Mark Millard
2017-02-19 22:20:10 UTC
I did not make clear one important point: These are occasional failures during my experimental builds. I can not make them fail on demand. At this point I've no clue what may control the intermittent nature of the behavior. I did not make clear one important point: These are occasional failures during my experimental builds. I can not make them fail on demand. At this point I've no clue what may control the intermittent nature of the behavior. I've noticed disabling tcache via malloc.conf provides a stable workaround until the underlying cause of the problem is found and fixed. [I have no clue why my prior comment showed up twice.] I also forgot to list the counts of example core files: script: 3 powerd: 3 su: 6 Side note. . . FYI: I've been deliberately gathering evidence about the failures. So for now I'm not disabling tcache in general. The sh jemalloc asserts that I separately reported in 217138 are not tied to tcache as far as I can tell. That is one of the reasons that I split the reports. Dumb typo fix: powerd not powerpd. (I've typed powerpc a lot and it shows.) It turns out that for the script and powerd examples the whole tsd->tcache[0] through tsd->tcache[40] area (__je_nhbins==41) is all zeros. That includes the lg_fill_div members of each. For the su examples there is a next_gc_bin in tsd->tcache[2] that is non-zero: print/x tsd->tcache[2] (lldb) print/x tsd->tcache[2] (tcache_t) $1 = { link = { qre_next = 0x0000000000000000 qre_prev = 0x0000000000000000 } prof_accumbytes = 0x0000000000000000 gc_ticker = (tick = 0x00000000, nticks = 0x00000000) next_gc_bin = 0xffffffff tbins = { [0] = { tstats = (nrequests = 0x0000000000000000) low_water = 0x00000000 lg_fill_div = 0x00000000 ncached = 0x00000000 avail = 0x0000000000000000 } } } But the rest of the tsd->tcache area is zero. (In reply to Mark Millard from comment #6) A note on script failures: I'll note that such script failures are part of the overall activity reported in bugzilla 217138. 217138's report is biased to the sh failures, in part because they happen first. But script is involved. (In reply to Mark Millard from comment #7) A note on various su failures: At least some of these are from the bugzilla 217138 context as well. This means that those have the swappedout-out then swapped-in context involved, just like the tsd_booted assert failures for sh. Again the su.*.core files have a large area of zeros around the zero tbin->avail . Again it looks like the swapping activity did not re-establish the correct memory content in the area. As for script: it was involved in the buildworld context that was causing swapping so it too has this kind of context. (powerd I'm less sure of what it's context was like relative to swapping.) It's possibly/likely unrelated, but symptoms described here are quite similar to what was happening on FreeBSD 10.0 amd64 (but fixed in 10.1 by an unidentified fix): https://rt.perl.org/Public/Bug/Display.html?id=122199#txn-1328406 - a child process after doing some work (possibly touching swap) does a normal exit; - a parent process gets a SIGCHLD signal, handles a wait() and for some obscure reason some of its memory gets corrupted; - a parent process forks creating a new worker child process, which inherits corrupted sections of parent's memory, consequently later leading to its (child) crash if it happens to use that part of the memory (opcodes or data structures) during its normal work. Any newly born child process inherits the same memory corruption and crashes alike. (In reply to Mark.Martinec from comment #9) There is a new example using a ~110 line C program source that shows corruption, see: https://lists.freebsd.org/pipermail/freebsd-arm/2017-March/015872.html This involves fork then swap-out/swap-in. So far it is the simplest failing context that I've identified. But it turns out that adding an access to the memory in question between the fork and the wait/sleep (and so the later swap-out) prevents the failure: https://lists.freebsd.org/pipermail/freebsd-arm/2017-March/015875.html Some other notes are in: https://lists.freebsd.org/pipermail/freebsd-arm/2017-March/015877.html Bugzilla 217138 is related and may trace back to the same cause(s) as this bugzilla report: both end up with zeros memory areas. 217138 has a version of the ~110 lines of code in an attachment. Created attachment 180835 [details] ~132 line program to demonstrate memory corruptions [This updates the attachment I put in 217138. In some ways 217239 is a better match.] See: https://lists.freebsd.org/pipermail/freebsd-arm/2017-March/015880.html for the specifics of this variant of the test for memory corruptions associated with fork then wait/sleep and then swap-out then swap-in. The source is something like 132 lines long. The behavior shows an aligned 4K Byte sized units of behavior, suggesting paging sized units, not cache lines or other such smaller units. See the comments tied to partial_test_check() and its use. I my test program I reported: #define region_size (4u*1024u+1u) // Bad dyn_region pattern, parent and child // processes: // 256u, 2u*1024u, 4u*1024u, 8u*1024u, // 9u*1024u, 12u*1024u, 14u*1024u // (but see the partial_test_check() call // notes above). // Works: // 14u*1024u+1u, 15u*1024u, 16u*1024u, // 32u*1024u, 256u*1024u*1024u It appears to turn out that the 14u*1024u vs. 14u*1024+1u boundary matches up with jemalloc: contrib/jemalloc/include/jemalloc/internal/size_classes.h : . . . size == 1<<lg_grp + ndelta<<lg_delta . . .\ * bin: 'yes' if a small bin size class, 'no' otherwise. . . . #if (LG_SIZEOF_PTR == 3 && LG_TINY_MIN == 3 && LG_QUANTUM == 3 && LG_PAGE == 12) #define SIZE_CLASSES \ /* index, lg_grp, lg_delta, ndelta, psz, bin, lg_delta_lookup */ \ . . . SC( 38, 13, 11, 3, no, yes, no) \ SC( 39, 13, 11, 4, yes, no, no) \ . . . #define SMALL_MAXCLASS ((((size_t)1) << 13) + (((size_t)3) << 11)) So SMALL_MAXCLASS==(14u*1024u). The problem in the example program is tied to small bin size handling in jemalloc. As far as my test program goes using ln -s 'tcache:false' /etc/malloc.conf prevents the memory corruptions for the fork-then-swap-out-then-swap-in sequence. Note: I now have access to both a pine64+ 2GB and a rpi3 with no hardware in common (USB hub or SSD or such). (The rpi3 has less RAM and so does not need as big of a stress run.) The behaviors happen in both contexts: it was not some sort of pine64+ 2GB quirk (or the hub/SSD that I typically use in that context). (In reply to Mark Millard from comment #13) Being explicit about the combined conditions for failure for my test program: Both tcache enabled and allocations fitting in SMALL_MAXCLASS are required in order to make the program fail. Note: lldb) print __je_tcache_maxclass (size_t) $0 = 32768 which is larger than SMALL_MAXCLASS. I've not observed failures for sizes above SMALL_MAXCLASS but not exceeding __je_tcache_maxclass. Thus tcache use by itself does not seem sufficient for my program to get corruption of its dynamically allocated memory: the small allocation size also matters. Here are procstat -v results from while the two processes from the fork of my test program have already been swapped-out (before they run and fail during swap-in). I also show the results from looking at the matching core files. The start of the address range of that ends up all-zero when none of it should be is: (lldb) print dyn_region (region *volatile) $0 = 0x0000000040616000 # procstat -v 1954 PID START END PRT RES PRES REF SHD FLAG TP PATH 1954 0x10000 0x11000 r-- 1 51 5 1 CN-- vn /root/c_tests/a.out 1954 0x20000 0x21000 r-x 1 51 5 1 CN-- vn /root/c_tests/a.out 1954 0x30000 0x40000 rw- 16 0 1 0 C--- vn /root/c_tests/a.out 1954 0x40000 0x41000 r-- 0 0 2 0 CN-- sw 1954 0x41000 0x54000 rw- 0 0 2 0 CN-- sw 1954 0x40030000 0x4004a000 r-x 26 29 34 0 CN-- vn /libexec/ld-elf.so.1 1954 0x4004a000 0x40051000 rw- 3 0 1 0 C--- sw 1954 0x4005a000 0x4005c000 rw- 1 0 1 0 C--- sw 1954 0x4005c000 0x401b3000 r-x 343 384 61 27 CN-- vn /lib/libc.so.7 1954 0x401b3000 0x401c2000 --- 0 0 2 0 CN-- df 1954 0x401c2000 0x401cf000 rw- 13 0 2 0 CN-- vn /lib/libc.so.7 1954 0x401cf000 0x40201000 rw- 1 1 2 0 CN-- sw 1954 0x40400000 0x40800000 rw- 0 0 2 0 CN-- sw 1954 0xfffffffdf000 0xfffffffff000 rw- 0 0 1 0 C--D sw 1954 0xfffffffff000 0x1000000000000 r-x 1 1 37 0 ---- ph # procstat -v 1955 PID START END PRT RES PRES REF SHD FLAG TP PATH 1955 0x10000 0x11000 r-- 1 51 5 1 CN-- vn /root/c_tests/a.out 1955 0x20000 0x21000 r-x 1 51 5 1 CN-- vn /root/c_tests/a.out 1955 0x30000 0x40000 rw- 16 0 1 0 C--- vn /root/c_tests/a.out 1955 0x40000 0x41000 r-- 0 0 2 0 CN-- sw 1955 0x41000 0x54000 rw- 0 0 2 0 CN-- sw 1955 0x40030000 0x4004a000 r-x 26 29 34 0 CN-- vn /libexec/ld-elf.so.1 1955 0x4004a000 0x40051000 rw- 3 0 1 0 C--- sw 1955 0x4005a000 0x4005c000 rw- 1 0 1 0 C--- sw 1955 0x4005c000 0x401b3000 r-x 343 384 61 27 CN-- vn /lib/libc.so.7 1955 0x401b3000 0x401c2000 --- 0 0 2 0 CN-- df 1955 0x401c2000 0x401cf000 rw- 13 0 2 0 CN-- vn /lib/libc.so.7 1955 0x401cf000 0x40201000 rw- 1 1 2 0 CN-- sw 1955 0x40400000 0x40800000 rw- 0 0 2 0 CN-- sw 1955 0xfffffffdf000 0xfffffffff000 rw- 0 0 1 0 C--D sw 1955 0xfffffffff000 0x1000000000000 r-x 1 1 37 0 ---- ph The core file results are: # procstat -v /var/crash/a.out.1954.core PID START END PRT RES PRES REF SHD FLAG TP PATH 1954 0x10000 0x11000 r-- 1 51 3 1 CN-- vn /root/c_tests/a.out 1954 0x20000 0x21000 r-x 1 51 3 1 CN-- vn /root/c_tests/a.out 1954 0x30000 0x40000 rw- 16 0 1 0 C--- vn /root/c_tests/a.out 1954 0x40000 0x41000 r-- 1 1 1 0 CN-- sw 1954 0x41000 0x54000 rw- 4 4 1 0 C--- sw 1954 0x40030000 0x4004a000 r-x 26 29 30 0 CN-- vn /libexec/ld-elf.so.1 1954 0x4004a000 0x40051000 rw- 7 7 1 0 C--- sw 1954 0x4005a000 0x4005c000 rw- 2 2 1 0 C--- sw 1954 0x4005c000 0x401b3000 r-x 343 384 55 25 CN-- vn /lib/libc.so.7 1954 0x401b3000 0x401c2000 --- 0 0 1 0 CN-- df 1954 0x401c2000 0x401cf000 rw- 13 0 1 0 C--- vn /lib/libc.so.7 1954 0x401cf000 0x40201000 rw- 50 50 1 0 CN-- sw 1954 0x40400000 0x40800000 rw- 1024 1024 1 0 CN-- sw 1954 0xfffffffdf000 0xfffffffff000 rw- 3 3 1 0 C--D sw 1954 0xfffffffff000 0x1000000000000 r-x 1 1 33 0 ---- ph # procstat -v /var/crash/a.out.1955.core PID START END PRT RES PRES REF SHD FLAG TP PATH 1955 0x10000 0x11000 r-- 1 51 5 1 CN-- vn /root/c_tests/a.out 1955 0x20000 0x21000 r-x 1 51 5 1 CN-- vn /root/c_tests/a.out 1955 0x30000 0x40000 rw- 16 0 1 0 C--- vn /root/c_tests/a.out 1955 0x40000 0x41000 r-- 0 0 2 0 CN-- sw 1955 0x41000 0x54000 rw- 4 0 1 0 C--- sw 1955 0x40030000 0x4004a000 r-x 26 29 31 0 CN-- vn /libexec/ld-elf.so.1 1955 0x4004a000 0x40051000 rw- 4 0 1 0 C--- sw 1955 0x4005a000 0x4005c000 rw- 2 0 1 0 C--- sw 1955 0x4005c000 0x401b3000 r-x 343 384 56 25 CN-- vn /lib/libc.so.7 1955 0x401b3000 0x401c2000 --- 0 0 2 0 CN-- df 1955 0x401c2000 0x401cf000 rw- 13 0 1 0 C--- vn /lib/libc.so.7 1955 0x401cf000 0x40201000 rw- 1 1 2 0 CN-- sw 1955 0x40400000 0x40800000 rw- 1 1 2 0 CN-- sw 1955 0xfffffffdf000 0xfffffffff000 rw- 1 0 1 0 C--D sw 1955 0xfffffffff000 0x1000000000000 r-x 1 1 34 0 ---- ph (In reply to Mark Millard from comment #13) This was noted in comment #3. (In reply to Shawn Webb from comment #17) Had you actually inspected a memory region with known/expected content and confirmed it was all zeros instead with tcache but not without it? My note does not contradict yours. It just has more analysis/investigation context/detail with it. I'm now going to check if: ln -s 'tcache:false' /etc/malloc.conf moved where the allocations end up at by forcing a core dump even when the test passes and then looking at it. Or is the same memory region no longer zeroed. There is suggestive evidence that a relatively fixed region of memory is being trashed across wildly different programs when the processes are swapped-in after a full swap-out. Whatever lands in this area (if anything) ends up zeroed. (In reply to Mark Millard from comment #18) Well, I just had a failure for my test program with: # ls -l /etc/malloc.conf lrwxr-xr-x 1 root wheel 12 Mar 18 19:49 /etc/malloc.conf -> tcache:false in place. So my note 13 is wrong in that such a definition is not sufficient to guarantee that no memory region will end up zeroed on swap-in. This time the memory region started at: (lldb) print dyn_region (region *volatile) $0 = 0x000000004060d000 and it ended up all zero. So "sometimes but not always". (Drat.) (In reply to Mark Millard from comment #19) Ignore comment 18. I accidentally switched context (window into the wrong computer). I had been distracted by someone. Too bad I can not delete #18. (In reply to Mark Millard from comment #20) I meant to ignore comment #19. However. . . I got rid of my other ssh windows: so only the ones tied to the pine64+ 2GB and did some more trial tests. And sure enough I have example failures despite having the: /etc/malloc.conf -> tcache:false in place. At this point I've no clue what makes the difference for it having worked before vs. not doing so now for my test program in this context. (In reply to Mark Millard from comment #21) After rebooting the pine64+ 2GB the: /etc/malloc.conf -> tcache:false was again effective: the memory pattern compared fine. The memory had not moved relative to the failing cases: (lldb) print dyn_region (region *volatile) $0 = 0x000000004060d000 So apparently FreeBSD can get into a state where the tcache:false does no good (ignored?). Created attachment 181042 [details]
Program showing posix_madvise with POSIX_MADV_WILLNEED can prevent the failure
posix_madvise(.,.,POSIX_MADV_WILLNEED) in the child process
after the fork but before the sleep/swap-out avoids the failure.
In the parent process it makes no difference.
The .,. is the address and size for a memory region that is
to avoid the problem.
(In reply to Mark Millard from comment #23) It turns out that POSIX_MADV_WILLNEED can instead be POSIX_MADV_DONTNEED and it still prevents the failures. (In reply to Mark Millard from comment #24) Ignore the POSIX_MADV_DONTNEED: the example child had not reached the zero RES(ident memory) status. When it does it fails (and so to the ancestor processes if they also reached such). I have confirmed that -r315870 still trashes memory for the sequence: allocations (tcache and <= SMALL_MAXCLASS) fork sleep/wait then swap-out [zero RES(ident memory)] swap-in Created attachment 181257 [details]
Test program with thousands of 14 KiByte allocations
The test program that has thousands of 14 KiByte
allocations usually ends up with them all being
zeroed in the parent and child of the fork.
But I've had a couple of runs where a much smaller
prefix was messed up and then there were normal,
expected values.
#define region_size (14u*1024u)
. . .
#define num_regions (256u*1024u*1024u/region_size)
So num_regions==18724, using up most of 256 MiBytes.
Note: each region has its own 14 KiByte allocation.
But dyn_regions[1296].array[0] in one example was
the first normal value.
In another example dyn_regions[2180].array[4096] was
the first normal value.
The last is interesting for being part way through
an allocation's space. That but aligning with a 4
KiByte page size would seem odd for a pure-jemalloc
issue.
Created attachment 181497 [details]
Source for allowing swap-before-fork in test in addition to swap-after-fork
I've discovered that if between initializing the allocated memory
and the fork() the process swaps-out then the problem does not
happen later, even if the parent and child are swapped-out after
the fork() as well.
I've found that for the "Source for allowing" attachment I can make variations that change how much of the allocated region prefix ends up zero vs. stays good. I vary the sleep time between testing the initialized allocations and doing the fork. The longer the sleep the more zero pages show up: # diff swap_testing[56].c 1c1 < // swap_testing5.c --- > // swap_testing6.c 5c5 < // cc -g -std=c11 -Wpedantic -o swaptesting5 swap_testing5.c --- > // cc -g -std=c11 -Wpedantic -o swaptesting5 swap_testing6.c 33c33 < sleep(30); // Potentialy force swap-out here. --- > sleep(150); // Potentialy force swap-out here. 37a38,48 > // For no-swap-out here cases: > // > // The longer the sleep here the more allocations > // that end up as zero. > // > // top's Mem Active, Inact, Wired, Bug, Free and > // Swap Total, Used, and Free stay unchanged. > // What does change is the process RES decreases > // while the process SIZE and SWAP stay unchanged > // during this sleep. > NOTE: On other architectures that I've tried (such as armv6/v7) RES does not decrease during the sleep --and the problem does not happen even for as long of sleeps as I've tried. (I use "stress -m 2 --vm-bytes 900M" on armv6/v7 instead of -m 1 --vm-bytes 1800M because that large in one process is not allowed.) So watching top's RES during the sleep (longer than a few seconds) predicts the later fails-vs.-not status: If RES decreases (while other things associated with the process status stay the same) then there will be a failure. (In reply to Mark Millard from comment #29) I've been able to identify what code sequence is gradually removing the "small_mappings" via some breakpointing in the kernel after reaching the "should be just sleeping" status. Specifically I started with breakpointing when pmap_resident_count_dec was on the call stack in order to see the call chain(s) that lead to it being called while RES(ident memory) is gradually decreasing during the sleep that is just before forking. (tid 100067 is [pagedaemon{pagedaemon}], which is in vm_pageout_worker. bt does not show inlined layers.) [ thread pid 17 tid 100067 ] Breakpoint at $x.1: undefined d65f03c0 db> bt Tracing pid 17 tid 100067 td 0xfffffd0001c4aa00 . . . handle_el1h_sync() at pmap_remove_l3+0xdc pc = 0xffff000000604870 lr = 0xffff000000611158 sp = 0xffff000083a49980 fp = 0xffff000083a49a40 pmap_remove_l3() at pmap_ts_referenced+0x580 pc = 0xffff000000611158 lr = 0xffff000000615c50 sp = 0xffff000083a49a50 fp = 0xffff000083a49ac0 pmap_ts_referenced() at vm_pageout+0xe60 pc = 0xffff000000615c50 lr = 0xffff0000005d1f74 sp = 0xffff000083a49ad0 fp = 0xffff000083a49b50 vm_pageout() at fork_exit+0x94 pc = 0xffff0000005d1f74 lr = 0xffff0000002e01c0 sp = 0xffff000083a49b60 fp = 0xffff000083a49b90 fork_exit() at fork_trampoline+0x10 pc = 0xffff0000002e01c0 lr = 0xffff0000006177b4 sp = 0xffff000083a49ba0 fp = 0x0000000000000000 It turns out that pmap_ts_referenced is on its: small_mappings: . . . path for the above so the pmap_remove_l3 call is the one from that execution path. (Found by more breakpointing after enabling such on the paths.) So this is the path with: (breakpoint hook not shown) /* * Wired pages cannot be paged out so * doing accessed bit emulation for * them is wasted effort. We do the * hard work for unwired pages only. */ pmap_remove_l3(pmap, pte, pv->pv_va, tpde, &free, &lock); pmap_invalidate_page(pmap, pv->pv_va); cleared++; if (pvf == pv) pvf = NULL; pv = NULL; . . . From what I can tell this code is eliminating the content of pages that in the failing tests, ones with no backing store yet (not swapped-out yet). The observed behavior is that the pages that have the above happen end up as zero pages once swapped-out and back in. I do not see anything putting the pages that this happens to into any other lists to keep track of the contents of the page content. The swap-out and swap-in seem to have ignored these pages and to have been based on automatically zeroed pages instead. Note that the (or a) question might be if these pages should have ever gotten to this code at all. (I'm no expert overall.) But that might get into why POSIX_MADV_WILLNEED spanning each page is sufficient to avoid the zeros issue for work-then-swapout-and-back-in. I'll only write here about what the backtrace code seems to be doing if I'm interpreting correctly. One oddity here is that pmap_remove_l3 does its own pmap_invalidate_page to invalidate the same tlb entry as the above pmap_invalidate_page, so a double-invalidate. (I've no clue if such is just suboptimal vs. a form of error.) pmap_remove_l3 here does things that the analogous sys/arm/arm/pmap-v6.c's pmap_ts_referenced does not do and pmap-v6 does something this code does not. arm64's pmap_remove_l3 does (in summary): pmap_invalidate_page decrements the resident_count pmap_unwire_l3 (then pmap_ts_referenced's small_mappings code does another pmap_invalidate_page for the same argument values) arm pmap-v6's pmap_ts_referenced's small_mappings code does: conditional vm_page_dirty pte2_clear_bit for PTE2_A pmap_tlb_flush There is, for example, no decrement of the resident_count involved (that I found anyway). But I've no clue just what should be analogous vs. what should not between pmap-v6 and arm64's pmap code in this area. I'll also note that the code before the arm64 small_mappings code also uses pmap_remove_l3 but does not do the decrement nor the extra pmap_invalidate_page (for example). But again I do not know how analogous the two paths should be. Only the small_mappings path seems to have the end-up-with-zeros problem for the later fork-then-swap-out and then swap-back-in context. (In reply to Mark Millard from comment #30) I have extended the ddb based investigation to which pmap_ts_referenced call in vm_pageout_scan is involved in the call-chain that gets to pmap_resident_count_dec. It is the the 2nd one, the one in the loop: /* * Scan the active queue for pages that can be deactivated. Update * the per-page activity counter and use it to identify deactivation * candidates. Held pages may be deactivated. */ for (m = TAILQ_FIRST(&pq->pq_pl), scanned = 0; m != NULL && (scanned < min_scan || (inactq_shortage > 0 && scanned < maxscan)); m = next, scanned++) { . . . Also in the code before the pmap_ts_referenced call: /* * Check to see "how much" the page has been used. */ if ((m->aflags & PGA_REFERENCED) != 0) { vm_page_aflag_clear(m, PGA_REFERENCED); act_delta = 1; } else act_delta = 0; act_delta is being set to zero. So the PGA_REFERENCED bit of m->aflags was zero. ("page is referenced" being false.) So the earlier noted code in pmap_ts_referenced: pmap_remove_l3(pmap, pte, pv->pv_va, tpde, &free, &lock); pmap_invalidate_page(pmap, pv->pv_va); for this context seems to be taking active but unreferenced pages and getting rid of them to the point of dropping them from the resident_count (despite the only copies of the content being in RAM). I'll note that the 2nd edition of the Design and Implementation book does not say that the resident memory count is decremented by pmap_ts_referenced. Nor does it indicate that tlb entries are invalidated by/during pmap_ts_referenced. Instead it talks of updating the "active count for a page" and "it uses pmap_ts_referenced() to count and clear the number of uses of the page since it was last scanned". It looks like the book is out of date in this area. It looks like the code's general structure goes back to amd64's -r256072 from 2013-Oct-5. But the check-in comments do not seem to cover this aspect of the changes. The overall summary was "Merge projects/bhyve_npt_pmap into head". The observed (-r315914) behavior as a amd64 virtualbox guest is that the RES(ident memory) does not decrease like it does on arm64. (But it that amd64 virtualbox guest is also the only place I use zfs, complicating the comparison. I do not boot any amd64 hardware with FreeBSD [yet?].) Another amd64 vs. arm64 difference is that when the amd64 test forks the child immediately has the same RES size as the parent (large). By contrast for arm64 the child RES is tiny until the child tests the pages after it finishes sleeping. Even giving the FreeBSD guest 2 GiBytes to match the bpim3/arm64 context for RAM does not make it behave like the bpim3 (or a rpi3). Konstantin Belousov provided a patch that seems to have fixed the issue. . . On 2017-Apr-9, at 5:27 AM, Konstantin Belousov <kostikbel@gmail.com> wrote: Hmm, could you try the following patch, I did not even compiled it. diff --git a/sys/arm64/arm64/pmap.c b/sys/arm64/arm64/pmap.c index 3d5756ba891..55aa402eb1c 100644 --- a/sys/arm64/arm64/pmap.c +++ b/sys/arm64/arm64/pmap.c @@ -2481,6 +2481,11 @@ pmap_protect(pmap_t pmap, vm_offset_t sva, vm_offset_t eva, vm_prot_t prot) sva += L3_SIZE) { l3 = pmap_load(l3p); if (pmap_l3_valid(l3)) { + if ((l3 & ATTR_SW_MANAGED) && + pmap_page_dirty(l3)) { + vm_page_dirty(PHYS_TO_VM_PAGE(l3 & + ~ATTR_MASK)); + } pmap_set(l3p, ATTR_AP(ATTR_AP_RO)); PTE_SYNC(l3p); /* XXX: Use pmap_invalidate_range */ Preliminary testing indicates that this fixes the some-pages-become-zero problem for fork-then-swapout/in. Thanks! I'll see if a buildworld can go through without being stopped by the type of issue. But that will take a while. (It is how I originally ran into the problem(s) that others had been reporting on the lists.) (In reply to Mark Millard from comment #32) buildworld buildkernel completed non-stop for the first time on a BPI-M3 board. Looks good for a check-in to svn to me (head and stable/11). This combined with 2017-Feb-15's -r313772's fix to the fork trampline code's updating of sp_el0 makes arm64 far more stable for my purposes. -r313772 was never MFC'd to stable/11. In my view it should be. The same for this (217239) also fixes bugzilla 217138 so I'm marking this as a duplicate. *** This bug has been marked as a duplicate of bug 217138 *** A commit references this bug: Author: kib Date: Mon Apr 10 15:32:26 UTC 2017 New revision: 316679 URL: https://svnweb.freebsd.org/changeset/base/316679 Log: Do not lose dirty bits for removing PROT_WRITE on arm64. Arm64 pmap interprets accessed writable ptes as modified, since ARMv8.0 does not track Dirty Bit Modifier in hardware. If writable bit is removed, page must be marked as dirty for MI VM. This change is most important for COW, where fork caused losing content of the dirty pages which were not yet scanned by pagedaemon. Reviewed by: alc, andrew Reported and tested by: Mark Millard <markmi@dsl-only.net> PR: 217138, 217239 Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Changes: head/sys/arm64/arm64/pmap.c (In reply to Mark Millard from comment #33) I had been thinking of the BPI-M3 for other reasons and typed that instead of the correct: Pine64+ 2GB. (True elsewhere as well.) I do really mean arm64 here, not armv7. (In reply to commit-hook from comment #35) stable/11 -r317354 completes the fixes there: Author: kib Date: Mon Apr 24 07:52:44 2017 New Revision: 317354 URL: https://svnweb.freebsd.org/changeset/base/317354 Log: MFC r316679: Do not lose dirty bits for removing PROT_WRITE on arm64. Modified: stable/11/sys/arm64/arm64/pmap.c Directory Properties: stable/11/ (props changed) . . . See: https://lists.freebsd.org/pipermail/svn-src-stable-11/2017-April/003041.html So now the old rpi3 wiki page material is only for masking some of the problems in older contexts for 12 or for older contexts for 11. Note: -r313772 from head is now -r317147 in stable/11 . So interrupts will no longer trash the sp_el0 register during the fork-trampoline code. This was the other issue with fork behavior that caused me build problems. With both in place for head I no loner have problems with buildworld and the like ,in head. [I've not been running 11, just 12.] |