Bug 217239 - head (e.g.:) -r315870 arm64 vs. jemalloc without MALLOC_PRODUCTION: various examples of tbin->avail being zero lead to SIGSEGV's; fix is a kernel patch (comment 32)
Summary: head (e.g.:) -r315870 arm64 vs. jemalloc without MALLOC_PRODUCTION: various e...
Status: Closed DUPLICATE of bug 217138
Alias: None
Product: Base System
Classification: Unclassified
Component: bin (show other bugs)
Version: CURRENT
Hardware: arm64 Any
: --- Affects Only Me
Assignee: freebsd-arm (Nobody)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-02-19 22:20 UTC by Mark Millard
Modified: 2017-04-25 00:14 UTC (History)
4 users (show)

See Also:


Attachments
~132 line program to demonstrate memory corruptions (4.46 KB, text/plain)
2017-03-15 06:38 UTC, Mark Millard
no flags Details
Program showing posix_madvise with POSIX_MADV_WILLNEED can prevent the failure (4.09 KB, text/plain)
2017-03-22 01:46 UTC, Mark Millard
no flags Details
Test program with thousands of 14 KiByte allocations (4.78 KB, text/plain)
2017-03-28 03:40 UTC, Mark Millard
no flags Details
Source for allowing swap-before-fork in test in addition to swap-after-fork (7.32 KB, text/plain)
2017-04-05 02:16 UTC, Mark Millard
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Mark Millard 2017-02-19 22:20:10 UTC
Now that the fork trampoline for arm64 no longer allows
interrupts to mess up the stack pointer things run longer
and other issues show up. This report is for without
MALLOC_PRODUCTION defined during buildworld. The kernel
build is production style.

[I've not tried the contrasting case of having
MALLOC_PRODUCTION defined. I'll also note that
I tried powerpc64 and had no problems for without
MALLOC_PRODUCTION: this seems arm64 (aarch64)
specific.]

I've accumulated examples of each of the following
getting SIGSEGV in jemalloc code and producing core
files:

script
powerpd
su

(Note: I'm primarily building things from the console
so the variety of activity is fairly limimted.)

From register values it appears that in each
tbin->avail==0 and calculations subtract a
positive number from that.

All the script examples look like (e.g.):

(lldb) bt
* thread #1: tid = 100143, 0x00000000404e9f08 libc.so.7`__je_tcache_dalloc_large(tsd=0x00000000405fd010, tcache=0x0000000040a0d000, ptr=0x0000000040a25600, size=<unavailable>, slow_path=<unavailable>) + 228 at tcache.h:451, name = 'script', stop reason = signal SIGSEGV
  * frame #0: 0x00000000404e9f08 libc.so.7`__je_tcache_dalloc_large(tsd=0x00000000405fd010, tcache=0x0000000040a0d000, ptr=0x0000000040a25600, size=<unavailable>, slow_path=<unavailable>) + 228 at tcache.h:451
    frame #1: 0x0000000040510cfc libc.so.7`__free(ptr=0x0000000040a25600) + 124 at jemalloc_jemalloc.c:2016
    frame #2: 0x000000004058c5d8 libc.so.7`cleanfile(fp=0x00000000405e4cf0, c=<unavailable>) + 96 at fclose.c:62
    frame #3: 0x000000004058c69c libc.so.7`fclose(fp=0x00000000405e4cf0) + 60 at fclose.c:134
    frame #4: 0x000000000040255c script`done(eno=0) + 268 at script.c:375
    frame #5: 0x000000000040218c script`main [inlined] finish + 2772 at script.c:323
    frame #6: 0x0000000000402154 script`main(argc=<unavailable>, argv=<unavailable>) + 2716 at script.c:299
    frame #7: 0x0000000000401610 script`__start + 360
    frame #8: 0x0000000040414658 ld-elf.so.1`.rtld_start + 24 at rtld_start.S:41
(lldb) down
frame #0: 0x00000000404e9f08 libc.so.7`__je_tcache_dalloc_large(tsd=0x00000000405fd010, tcache=0x0000000040a0d000, ptr=0x0000000040a25600, size=<unavailable>, slow_path=<unavailable>) + 228 at tcache.h:451
   448 		}
   449 		assert(tbin->ncached < tbin_info->ncached_max);
   450 		tbin->ncached++;
-> 451 		*(tbin->avail - tbin->ncached) = ptr;
   452 	
   453 		tcache_event(tsd, tcache);
   454 	}

They are from long running builds with lots of output
in the typescript generated and the crash happens during
the cleanup at the end.

All the powerd examples look like (e.g.):

(lldb) bt
* thread #1: tid = 100099, 0x00000000404eaa10 libc.so.7`__je_tcache_dalloc_small(tsd=0x00000000405fe010, tcache=0x0000000040a0d000, ptr=0x0000000040a1e000, binind=2, slow_path=<unavailable>) + 164 at tcache.h:421, name = 'powerd', stop reason = signal SIGSEGV
  * frame #0: 0x00000000404eaa10 libc.so.7`__je_tcache_dalloc_small(tsd=0x00000000405fe010, tcache=0x0000000040a0d000, ptr=0x0000000040a1e000, binind=2, slow_path=<unavailable>) + 164 at tcache.h:421
    frame #1: 0x0000000040511cfc libc.so.7`__free(ptr=0x0000000040a1e000) + 124 at jemalloc_jemalloc.c:2016
    frame #2: 0x000000000040201c powerd`main(argc=<unavailable>, argv=<unavailable>) + 3332 at powerd.c:786
    frame #3: 0x0000000000401270 powerd`__start + 360
    frame #4: 0x0000000040415658 ld-elf.so.1`.rtld_start + 24 at rtld_start.S:41
(lldb) down
frame #0: 0x00000000404eaa10 libc.so.7`__je_tcache_dalloc_small(tsd=0x00000000405fe010, tcache=0x0000000040a0d000, ptr=0x0000000040a1e000, binind=2, slow_path=<unavailable>) + 164 at tcache.h:421
   418 		}
   419 		assert(tbin->ncached < tbin_info->ncached_max);
   420 		tbin->ncached++;
-> 421 		*(tbin->avail - tbin->ncached) = ptr;
   422 	
   423 		tcache_event(tsd, tcache);
   424 	}

So every similar to the script failures: these are during the
cleanup at the end (but dalloc large vs. small).

All the su examples look like (e.g.):

(lldb) bt
* thread #1: tid = 100156, 0x000000004054b1dc libc.so.7`__je_arena_tcache_fill_small(tsdn=<unavailable>, arena=<unavailable>, tbin=<unavailable>, binind=<unavailable>, prof_accumbytes=<unavailable>) + 212 at jemalloc_arena.c:2442, name = 'su', stop reason = signal SIGSEGV
  * frame #0: 0x000000004054b1dc libc.so.7`__je_arena_tcache_fill_small(tsdn=<unavailable>, arena=<unavailable>, tbin=<unavailable>, binind=<unavailable>, prof_accumbytes=<unavailable>) + 212 at jemalloc_arena.c:2442
    frame #1: 0x000000004052e5a0 libc.so.7`__je_tcache_alloc_small [inlined] __je_tcache_alloc_small_hard(tsdn=<unavailable>, arena=0x0000000040800140, tbin=0x0000000040a0d0a8, binind=4) + 20 at jemalloc_tcache.c:79
    frame #2: 0x000000004052e58c libc.so.7`__je_tcache_alloc_small(tsd=0x0000000040647010, arena=0x0000000040800140, tcache=0x0000000040a0d000, size=<unavailable>, binind=4, zero=false, slow_path=true) + 332 at tcache.h:298
    frame #3: 0x0000000040555184 libc.so.7`__malloc(size=1) + 184 at jemalloc_jemalloc.c:1645
    frame #4: 0x000000004046979c libpam.so.6`openpam_vasprintf(str=0x0000ffffffffe520, fmt="", ap=<unavailable>) + 92 at openpam_vasprintf.c:53
    frame #5: 0x0000000040469714 libpam.so.6`openpam_asprintf(str=<unavailable>, fmt=<unavailable>) + 120 at openpam_asprintf.c:52
    frame #6: 0x000000004046960c libpam.so.6`_openpam_log(level=<unavailable>, func="", fmt="") + 224 at openpam_log.c:125
    frame #7: 0x0000000040466914 libpam.so.6`openpam_dispatch(pamh=<unavailable>, primitive=<unavailable>, flags=<unavailable>) + 1256 at openpam_dispatch.c:182
    frame #8: 0x0000000040463b54 libpam.so.6`pam_setcred(pamh=0x0000000040a44000, flags=2) + 112 at pam_setcred.c:66
    frame #9: 0x0000000040b77730 su`main(argc=<unavailable>, argv=<unavailable>) + 2280 at su.c:475
    frame #10: 0x0000000040b76da0 su`__start + 360
    frame #11: 0x0000000040415658 ld-elf.so.1`.rtld_start + 24 at rtld_start.S:41
(lldb) down
frame #0: 0x000000004054b1dc libc.so.7`__je_arena_tcache_fill_small(tsdn=<unavailable>, arena=<unavailable>, tbin=<unavailable>, binind=<unavailable>, prof_accumbytes=<unavailable>) + 212 at jemalloc_arena.c:2442
   2439				    true);
   2440			}
   2441			/* Insert such that low regions get used first. */
-> 2442			*(tbin->avail - nfill + i) = ptr;
   2443		}
   2444		if (config_stats) {
   2445			bin->stats.nmalloc += i;

So not as close but also during cleanup (of the parent
process of the fork) during PAM_END() before exit.

See also bugzilla 217138.
Comment 1 Mark Millard 2017-02-19 22:33:10 UTC
I did not make clear one important point:

These are occasional failures during my experimental
builds. I can not make them fail on demand.

At this point I've no clue what may control the
intermittent nature of the behavior.
Comment 2 Mark Millard 2017-02-19 22:33:10 UTC
I did not make clear one important point:

These are occasional failures during my experimental
builds. I can not make them fail on demand.

At this point I've no clue what may control the
intermittent nature of the behavior.
Comment 3 Shawn Webb 2017-02-19 22:35:08 UTC
I've noticed disabling tcache via malloc.conf provides a stable workaround until the underlying cause of the problem is found and fixed.
Comment 4 Mark Millard 2017-02-19 22:50:17 UTC
[I have no clue why my prior comment showed up twice.]

I also forgot to list the counts of example core files:

script: 3
powerd: 3
su: 6

Side note. . .

FYI:

I've been deliberately gathering evidence about
the failures. So for now I'm not disabling tcache
in general.

The sh jemalloc asserts that I separately reported
in 217138  are not tied to tcache as far as I can
tell. That is one of the reasons that I split the
reports.
Comment 5 Mark Millard 2017-02-19 22:53:42 UTC
Dumb typo fix:

powerd

not powerpd.

(I've typed powerpc a lot and it shows.)
Comment 6 Mark Millard 2017-02-20 21:30:35 UTC
It turns out that for the script and powerd examples the whole
tsd->tcache[0] through tsd->tcache[40] area (__je_nhbins==41) is
all zeros. That includes the lg_fill_div members of each.

For the su examples there is a next_gc_bin in
tsd->tcache[2] that is non-zero:

print/x tsd->tcache[2]
(lldb) print/x tsd->tcache[2]
(tcache_t) $1 = {
  link = {
    qre_next = 0x0000000000000000
    qre_prev = 0x0000000000000000
  }
  prof_accumbytes = 0x0000000000000000
  gc_ticker = (tick = 0x00000000, nticks = 0x00000000)
  next_gc_bin = 0xffffffff
  tbins = {
    [0] = {
      tstats = (nrequests = 0x0000000000000000)
      low_water = 0x00000000
      lg_fill_div = 0x00000000
      ncached = 0x00000000
      avail = 0x0000000000000000
    }
  }
}

But the rest of the tsd->tcache area is zero.
Comment 7 Mark Millard 2017-02-24 03:28:36 UTC
(In reply to Mark Millard from comment #6)

A note on script failures:

I'll note that such script failures are
part of the overall activity reported
in bugzilla 217138. 217138's report is
biased to the sh failures, in part
because they happen first. But script
is involved.
Comment 8 Mark Millard 2017-02-28 05:08:18 UTC
(In reply to Mark Millard from comment #7)

A note on various su failures:

At least some of these are from the bugzilla 217138
context as well.

This means that those have the swappedout-out then
swapped-in context involved, just like the tsd_booted
assert failures for sh.

Again the su.*.core files have a large area of zeros
around the zero tbin->avail . Again it looks like the
swapping activity did not re-establish the correct
memory content in the area.

As for script: it was involved in the buildworld
context that was causing swapping so it too has
this kind of context.

(powerd I'm less sure of what it's context was
like relative to swapping.)
Comment 9 Mark.Martinec 2017-03-15 00:27:40 UTC
It's possibly/likely unrelated, but symptoms described here are
quite similar to what was happening on FreeBSD 10.0 amd64 (but fixed
in 10.1 by an unidentified fix):

  https://rt.perl.org/Public/Bug/Display.html?id=122199#txn-1328406

  - a child process after doing some work (possibly touching swap)
    does a normal exit;
  - a parent process gets a SIGCHLD signal, handles a wait() and
    for some obscure reason some of its memory gets corrupted;
  - a parent process forks creating a new worker child process,
    which inherits corrupted sections of parent's memory,
    consequently later leading to its (child) crash if it happens
    to use that part of the memory (opcodes or data structures)
    during its normal work. Any newly born child process inherits
    the same memory corruption and crashes alike.
Comment 10 Mark Millard 2017-03-15 01:33:47 UTC
(In reply to Mark.Martinec from comment #9)

There is a new example using a ~110 line C
program source that shows corruption, see:

https://lists.freebsd.org/pipermail/freebsd-arm/2017-March/015872.html

This involves fork then swap-out/swap-in. So
far it is the simplest failing context that
I've identified.

But it turns out that adding an access to
the memory in question between the fork and
the wait/sleep (and so the later swap-out)
prevents the failure:

https://lists.freebsd.org/pipermail/freebsd-arm/2017-March/015875.html

Some other notes are in:

https://lists.freebsd.org/pipermail/freebsd-arm/2017-March/015877.html

Bugzilla 217138 is related and may trace back
to the same cause(s) as this bugzilla report:
both end up with zeros memory areas. 217138
has a version of the ~110 lines of code in
an attachment.
Comment 11 Mark Millard 2017-03-15 06:38:35 UTC
Created attachment 180835 [details]
~132 line program to demonstrate memory corruptions

[This updates the attachment I put in 217138. In some ways
217239 is a better match.]

See:

https://lists.freebsd.org/pipermail/freebsd-arm/2017-March/015880.html

for the specifics of this variant of the test
for memory corruptions associated with fork then wait/sleep
and then swap-out then swap-in.

The source is something like 132 lines long.

The behavior shows an aligned 4K Byte sized units of
behavior, suggesting paging sized units, not cache lines
or other such smaller units.

See the comments tied to partial_test_check() and its use.
Comment 12 Mark Millard 2017-03-17 20:13:20 UTC
I my test program I reported:

#define region_size (4u*1024u+1u)
                        // Bad dyn_region pattern, parent and child
                        // processes:
                        //     256u,  2u*1024u,  4u*1024u, 8u*1024u,
                        // 9u*1024u, 12u*1024u, 14u*1024u
                        // (but see the partial_test_check() call
                        //  notes above).

                        // Works:
                        // 14u*1024u+1u, 15u*1024u, 16u*1024u,
                        // 32u*1024u, 256u*1024u*1024u

It appears to turn out that the
14u*1024u vs. 14u*1024+1u boundary matches
up with jemalloc:

contrib/jemalloc/include/jemalloc/internal/size_classes.h :

. . . size == 1<<lg_grp + ndelta<<lg_delta
. . .\
 *     bin: 'yes' if a small bin size class, 'no' otherwise.
. . .
#if (LG_SIZEOF_PTR == 3 && LG_TINY_MIN == 3 && LG_QUANTUM == 3 && LG_PAGE == 12)
#define SIZE_CLASSES \
  /* index, lg_grp, lg_delta, ndelta, psz, bin, lg_delta_lookup */ \
. . .
    SC( 38,     13,       11,      3,  no, yes, no) \
    SC( 39,     13,       11,      4, yes,  no, no) \
. . .
#define SMALL_MAXCLASS          ((((size_t)1) << 13) + (((size_t)3) << 11))

So SMALL_MAXCLASS==(14u*1024u).

The problem in the example program is tied to small bin size
handling in jemalloc.
Comment 13 Mark Millard 2017-03-18 23:00:39 UTC
As far as my test program goes using

ln -s 'tcache:false' /etc/malloc.conf

prevents the memory corruptions for the
fork-then-swap-out-then-swap-in sequence.
Comment 14 Mark Millard 2017-03-18 23:05:47 UTC
Note: I now have access to both a pine64+ 2GB and
a rpi3 with no hardware in common (USB hub or SSD
or such). (The rpi3 has less RAM and so does not
need as big of a stress run.)

The behaviors happen in both contexts: it was not
some sort of pine64+ 2GB quirk (or the hub/SSD that
I typically use in that context).
Comment 15 Mark Millard 2017-03-18 23:20:55 UTC
(In reply to Mark Millard from comment #13)

Being explicit about the combined conditions for failure
for my test program:

Both tcache enabled and allocations fitting in SMALL_MAXCLASS
are required in order to make the program fail.

Note:

lldb) print __je_tcache_maxclass
(size_t) $0 = 32768

which is larger than SMALL_MAXCLASS. I've not observed
failures for sizes above SMALL_MAXCLASS but not exceeding
__je_tcache_maxclass.

Thus tcache use by itself does not seem sufficient for
my program to get corruption of its dynamically allocated
memory: the small allocation size also matters.
Comment 16 Mark Millard 2017-03-19 00:16:47 UTC
Here are procstat -v results from while the two
processes from the fork of my test program
have already been swapped-out (before they
run and fail during swap-in). I also show
the results from looking at the matching
core files.

The start of the address range of that
ends up all-zero when none of it should
be is:

(lldb) print dyn_region
(region *volatile) $0 = 0x0000000040616000

# procstat -v 1954
  PID              START                END PRT  RES PRES REF SHD FLAG TP PATH
 1954            0x10000            0x11000 r--    1   51   5   1 CN-- vn /root/c_tests/a.out
 1954            0x20000            0x21000 r-x    1   51   5   1 CN-- vn /root/c_tests/a.out
 1954            0x30000            0x40000 rw-   16    0   1   0 C--- vn /root/c_tests/a.out
 1954            0x40000            0x41000 r--    0    0   2   0 CN-- sw 
 1954            0x41000            0x54000 rw-    0    0   2   0 CN-- sw 
 1954         0x40030000         0x4004a000 r-x   26   29  34   0 CN-- vn /libexec/ld-elf.so.1
 1954         0x4004a000         0x40051000 rw-    3    0   1   0 C--- sw 
 1954         0x4005a000         0x4005c000 rw-    1    0   1   0 C--- sw 
 1954         0x4005c000         0x401b3000 r-x  343  384  61  27 CN-- vn /lib/libc.so.7
 1954         0x401b3000         0x401c2000 ---    0    0   2   0 CN-- df 
 1954         0x401c2000         0x401cf000 rw-   13    0   2   0 CN-- vn /lib/libc.so.7
 1954         0x401cf000         0x40201000 rw-    1    1   2   0 CN-- sw 
 1954         0x40400000         0x40800000 rw-    0    0   2   0 CN-- sw 
 1954     0xfffffffdf000     0xfffffffff000 rw-    0    0   1   0 C--D sw 
 1954     0xfffffffff000    0x1000000000000 r-x    1    1  37   0 ---- ph 
# procstat -v 1955
  PID              START                END PRT  RES PRES REF SHD FLAG TP PATH
 1955            0x10000            0x11000 r--    1   51   5   1 CN-- vn /root/c_tests/a.out
 1955            0x20000            0x21000 r-x    1   51   5   1 CN-- vn /root/c_tests/a.out
 1955            0x30000            0x40000 rw-   16    0   1   0 C--- vn /root/c_tests/a.out
 1955            0x40000            0x41000 r--    0    0   2   0 CN-- sw 
 1955            0x41000            0x54000 rw-    0    0   2   0 CN-- sw 
 1955         0x40030000         0x4004a000 r-x   26   29  34   0 CN-- vn /libexec/ld-elf.so.1
 1955         0x4004a000         0x40051000 rw-    3    0   1   0 C--- sw 
 1955         0x4005a000         0x4005c000 rw-    1    0   1   0 C--- sw 
 1955         0x4005c000         0x401b3000 r-x  343  384  61  27 CN-- vn /lib/libc.so.7
 1955         0x401b3000         0x401c2000 ---    0    0   2   0 CN-- df 
 1955         0x401c2000         0x401cf000 rw-   13    0   2   0 CN-- vn /lib/libc.so.7
 1955         0x401cf000         0x40201000 rw-    1    1   2   0 CN-- sw 
 1955         0x40400000         0x40800000 rw-    0    0   2   0 CN-- sw 
 1955     0xfffffffdf000     0xfffffffff000 rw-    0    0   1   0 C--D sw 
 1955     0xfffffffff000    0x1000000000000 r-x    1    1  37   0 ---- ph 

The core file results are:

# procstat -v /var/crash/a.out.1954.core
  PID              START                END PRT  RES PRES REF SHD FLAG TP PATH
 1954            0x10000            0x11000 r--    1   51   3   1 CN-- vn /root/c_tests/a.out
 1954            0x20000            0x21000 r-x    1   51   3   1 CN-- vn /root/c_tests/a.out
 1954            0x30000            0x40000 rw-   16    0   1   0 C--- vn /root/c_tests/a.out
 1954            0x40000            0x41000 r--    1    1   1   0 CN-- sw 
 1954            0x41000            0x54000 rw-    4    4   1   0 C--- sw 
 1954         0x40030000         0x4004a000 r-x   26   29  30   0 CN-- vn /libexec/ld-elf.so.1
 1954         0x4004a000         0x40051000 rw-    7    7   1   0 C--- sw 
 1954         0x4005a000         0x4005c000 rw-    2    2   1   0 C--- sw 
 1954         0x4005c000         0x401b3000 r-x  343  384  55  25 CN-- vn /lib/libc.so.7
 1954         0x401b3000         0x401c2000 ---    0    0   1   0 CN-- df 
 1954         0x401c2000         0x401cf000 rw-   13    0   1   0 C--- vn /lib/libc.so.7
 1954         0x401cf000         0x40201000 rw-   50   50   1   0 CN-- sw 
 1954         0x40400000         0x40800000 rw- 1024 1024   1   0 CN-- sw 
 1954     0xfffffffdf000     0xfffffffff000 rw-    3    3   1   0 C--D sw 
 1954     0xfffffffff000    0x1000000000000 r-x    1    1  33   0 ---- ph 
# procstat -v /var/crash/a.out.1955.core
  PID              START                END PRT  RES PRES REF SHD FLAG TP PATH
 1955            0x10000            0x11000 r--    1   51   5   1 CN-- vn /root/c_tests/a.out
 1955            0x20000            0x21000 r-x    1   51   5   1 CN-- vn /root/c_tests/a.out
 1955            0x30000            0x40000 rw-   16    0   1   0 C--- vn /root/c_tests/a.out
 1955            0x40000            0x41000 r--    0    0   2   0 CN-- sw 
 1955            0x41000            0x54000 rw-    4    0   1   0 C--- sw 
 1955         0x40030000         0x4004a000 r-x   26   29  31   0 CN-- vn /libexec/ld-elf.so.1
 1955         0x4004a000         0x40051000 rw-    4    0   1   0 C--- sw 
 1955         0x4005a000         0x4005c000 rw-    2    0   1   0 C--- sw 
 1955         0x4005c000         0x401b3000 r-x  343  384  56  25 CN-- vn /lib/libc.so.7
 1955         0x401b3000         0x401c2000 ---    0    0   2   0 CN-- df 
 1955         0x401c2000         0x401cf000 rw-   13    0   1   0 C--- vn /lib/libc.so.7
 1955         0x401cf000         0x40201000 rw-    1    1   2   0 CN-- sw 
 1955         0x40400000         0x40800000 rw-    1    1   2   0 CN-- sw 
 1955     0xfffffffdf000     0xfffffffff000 rw-    1    0   1   0 C--D sw 
 1955     0xfffffffff000    0x1000000000000 r-x    1    1  34   0 ---- ph
Comment 17 Shawn Webb 2017-03-19 02:03:22 UTC
(In reply to Mark Millard from comment #13)
This was noted in comment #3.
Comment 18 Mark Millard 2017-03-19 02:48:07 UTC
(In reply to Shawn Webb from comment #17)

Had you actually inspected a memory region with
known/expected content and confirmed it was all
zeros instead with tcache but not without it?

My note does not contradict yours. It just has
more analysis/investigation context/detail with
it.

I'm now going to check if:

ln -s 'tcache:false' /etc/malloc.conf

moved where the allocations end up at by
forcing a core dump even when the test
passes and then looking at it. Or is
the same  memory region no longer zeroed.

There is suggestive evidence that a relatively fixed
region of memory is being trashed across wildly
different programs when the processes are swapped-in
after a full swap-out. Whatever lands in this area
(if anything) ends up zeroed.
Comment 19 Mark Millard 2017-03-19 03:01:57 UTC
(In reply to Mark Millard from comment #18)

Well, I just had a failure for my test program
with:

# ls -l /etc/malloc.conf 
lrwxr-xr-x  1 root  wheel  12 Mar 18 19:49 /etc/malloc.conf -> tcache:false

in place.

So my note 13 is wrong in that such a definition is
not sufficient to guarantee that no memory region will
end up zeroed on swap-in.

This time the memory region started at:

(lldb) print dyn_region
(region *volatile) $0 = 0x000000004060d000

and it ended up all zero.

So "sometimes but not always". (Drat.)
Comment 20 Mark Millard 2017-03-19 03:10:12 UTC
(In reply to Mark Millard from comment #19)

Ignore comment 18. I accidentally switched context
(window into the wrong computer). I had been
distracted by someone.

Too bad I can not delete #18.
Comment 21 Mark Millard 2017-03-19 03:32:19 UTC
(In reply to Mark Millard from comment #20)

I meant to ignore comment #19. However. . .

I got rid  of my other ssh windows: so only
the ones tied to the pine64+ 2GB and did some
more trial tests. And sure enough I have example
failures despite having the:

/etc/malloc.conf -> tcache:false

in place.

At this point I've no clue what makes the
difference for it having worked before vs.
not doing so now for my test program in
this context.
Comment 22 Mark Millard 2017-03-19 03:53:31 UTC
(In reply to Mark Millard from comment #21)

After rebooting the pine64+ 2GB the:

/etc/malloc.conf -> tcache:false

was again effective: the memory
pattern compared fine.

The memory had not moved relative to the
failing cases:

(lldb) print dyn_region
(region *volatile) $0 = 0x000000004060d000

So apparently FreeBSD can get into a state
where the tcache:false does no good (ignored?).
Comment 23 Mark Millard 2017-03-22 01:46:45 UTC
Created attachment 181042 [details]
Program showing posix_madvise with POSIX_MADV_WILLNEED can prevent the failure

posix_madvise(.,.,POSIX_MADV_WILLNEED) in the child process
after the fork but before the sleep/swap-out avoids the failure.

In the parent process it makes no difference.

The .,. is the address and size for a memory region that is
to avoid the problem.
Comment 24 Mark Millard 2017-03-22 07:19:14 UTC
(In reply to Mark Millard from comment #23)

It turns out that POSIX_MADV_WILLNEED can instead
be POSIX_MADV_DONTNEED and it still prevents the
failures.
Comment 25 Mark Millard 2017-03-22 07:30:52 UTC
(In reply to Mark Millard from comment #24)

Ignore the POSIX_MADV_DONTNEED: the example child
had not reached the zero RES(ident memory) status.
When it does it fails (and so to the ancestor
processes if they also reached such).
Comment 26 Mark Millard 2017-03-24 08:16:50 UTC
I have confirmed that -r315870 still trashes memory
for the sequence:

allocations (tcache and <= SMALL_MAXCLASS)
fork
sleep/wait then swap-out [zero RES(ident memory)]
swap-in
Comment 27 Mark Millard 2017-03-28 03:40:40 UTC
Created attachment 181257 [details]
Test program with thousands of 14 KiByte allocations

The test program that has thousands of 14 KiByte
allocations usually ends up with them all being
zeroed in the parent and child of the fork.

But I've had a couple of runs where a much smaller
prefix was messed up and then there were normal,
expected values.

#define region_size (14u*1024u)
. . .
#define num_regions (256u*1024u*1024u/region_size)

So num_regions==18724, using up most of 256 MiBytes.

Note: each region has its own 14 KiByte allocation.

But dyn_regions[1296].array[0] in one example was
the first normal value.

In another example dyn_regions[2180].array[4096] was
the first normal value.

The last is interesting for being part way through
an allocation's space. That but aligning with a 4
KiByte page size would seem odd for a pure-jemalloc
issue.
Comment 28 Mark Millard 2017-04-05 02:16:50 UTC
Created attachment 181497 [details]
Source for allowing swap-before-fork in test in addition to swap-after-fork

I've discovered that if between initializing the allocated memory
and the fork() the process swaps-out then the problem does not
happen later, even if the parent and child are swapped-out after
the fork() as well.
Comment 29 Mark Millard 2017-04-07 07:55:54 UTC
I've found that for the "Source for allowing" attachment
I can make variations that change how much of the
allocated region prefix ends up zero vs. stays good.

I vary the sleep time between testing the initialized
allocations and doing the fork. The longer the sleep
the more zero pages show up:

# diff swap_testing[56].c                                                                                                                                                                               1c1
< // swap_testing5.c
---
> // swap_testing6.c
5c5
< // cc -g -std=c11 -Wpedantic -o swaptesting5 swap_testing5.c
---
> // cc -g -std=c11 -Wpedantic -o swaptesting5 swap_testing6.c
33c33
<     sleep(30); // Potentialy force swap-out here.
---
>     sleep(150); // Potentialy force swap-out here.
37a38,48
>                // For no-swap-out here cases:
>                //
>                // The longer the sleep here the more allocations
>                // that end up as zero.
>                //
>                // top's Mem Active, Inact, Wired, Bug, Free and
>                // Swap Total, Used, and Free stay unchanged.
>                // What does change is the process RES decreases
>                // while the process SIZE and SWAP stay unchanged
>                // during this sleep.
> 


NOTE: On other architectures that I've tried (such as armv6/v7)
      RES does not decrease during the sleep --and the problem
      does not happen even for as long of sleeps as I've tried.

      (I use "stress -m 2 --vm-bytes 900M" on armv6/v7 instead
      of -m 1 --vm-bytes 1800M because that large in one
      process is not allowed.)

So watching top's RES during the sleep (longer than a few
seconds) predicts the later fails-vs.-not status: If RES
decreases (while other things associated with the process
status stay the same) then there will be a failure.
Comment 30 Mark Millard 2017-04-09 00:41:12 UTC
(In reply to Mark Millard from comment #29)

I've been able to identify what code sequence
is gradually removing the "small_mappings" via
some breakpointing in the kernel after reaching
the "should be just sleeping" status. Specifically
I started with breakpointing when
pmap_resident_count_dec was on the call stack
in order to see the call chain(s) that lead to
it being called while RES(ident memory) is
gradually decreasing during the sleep that
is just before forking.

(tid 100067 is [pagedaemon{pagedaemon}], which
is in vm_pageout_worker. bt does not show inlined
layers.)

[ thread pid 17 tid 100067 ]
Breakpoint at   $x.1:   undefined       d65f03c0
db> bt
Tracing pid 17 tid 100067 td 0xfffffd0001c4aa00
. . .
handle_el1h_sync() at pmap_remove_l3+0xdc
        pc = 0xffff000000604870  lr = 0xffff000000611158
        sp = 0xffff000083a49980  fp = 0xffff000083a49a40

pmap_remove_l3() at pmap_ts_referenced+0x580
        pc = 0xffff000000611158  lr = 0xffff000000615c50
        sp = 0xffff000083a49a50  fp = 0xffff000083a49ac0

pmap_ts_referenced() at vm_pageout+0xe60
        pc = 0xffff000000615c50  lr = 0xffff0000005d1f74
        sp = 0xffff000083a49ad0  fp = 0xffff000083a49b50

vm_pageout() at fork_exit+0x94
        pc = 0xffff0000005d1f74  lr = 0xffff0000002e01c0
        sp = 0xffff000083a49b60  fp = 0xffff000083a49b90

fork_exit() at fork_trampoline+0x10
        pc = 0xffff0000002e01c0  lr = 0xffff0000006177b4
        sp = 0xffff000083a49ba0  fp = 0x0000000000000000

It turns out that pmap_ts_referenced is on its:

small_mappings:
. . .

path for the above so the pmap_remove_l3 call is
the one from that execution path. (Found by more
breakpointing after enabling such on the paths.)

So this is the path with:
(breakpoint hook not shown)

                                /*
                                 * Wired pages cannot be paged out so
                                 * doing accessed bit emulation for
                                 * them is wasted effort. We do the
                                 * hard work for unwired pages only.
                                 */
                                pmap_remove_l3(pmap, pte, pv->pv_va, tpde,
                                    &free, &lock);
                                pmap_invalidate_page(pmap, pv->pv_va);
                                cleared++;
                                if (pvf == pv)
                                        pvf = NULL;
                                pv = NULL;
                                . . .

From what I can tell this code is eliminating the
content of pages that in the failing tests, ones
with  no backing store yet (not swapped-out yet).
The observed behavior is that the pages that have
the above happen end up as zero pages once
swapped-out and back in.

I do not see anything putting the pages that this
happens to into any other lists to keep track of
the contents of the page content. The swap-out
and swap-in seem to have ignored these pages and
to have been based on automatically zeroed pages
instead.

Note that the (or a) question might be if these
pages should have ever gotten to this code at
all. (I'm no expert overall.) But that might
get into why POSIX_MADV_WILLNEED spanning each
page is sufficient to avoid the zeros issue for
work-then-swapout-and-back-in. I'll only write
here about what the backtrace code seems to be
doing if I'm interpreting correctly.

One oddity here is that pmap_remove_l3 does its own
pmap_invalidate_page to invalidate the same tlb entry as
the above pmap_invalidate_page, so a double-invalidate.
(I've no clue if such is just suboptimal vs. a form of
error.)

pmap_remove_l3 here does things that the analogous
sys/arm/arm/pmap-v6.c's pmap_ts_referenced does not
do and pmap-v6 does something this code does not.

arm64's pmap_remove_l3 does (in summary):

  pmap_invalidate_page
  decrements the resident_count
  pmap_unwire_l3
(then pmap_ts_referenced's small_mappings code
 does another pmap_invalidate_page for the
 same argument values)

arm pmap-v6's pmap_ts_referenced's small_mappings
code does:

  conditional vm_page_dirty
  pte2_clear_bit for PTE2_A
  pmap_tlb_flush

There is, for example, no decrement of the
resident_count involved (that I found anyway). 

But I've no clue just what should be analogous
vs. what should not between pmap-v6 and arm64's
pmap code in this area.

I'll also note that the code before the
arm64 small_mappings code also uses
pmap_remove_l3 but does not do the
decrement nor the extra pmap_invalidate_page
(for example). But again I do not know
how analogous the two paths should be.

Only the small_mappings path seems to have the
end-up-with-zeros problem for the later
fork-then-swap-out and then swap-back-in
context.
Comment 31 Mark Millard 2017-04-09 11:00:53 UTC
(In reply to Mark Millard from comment #30)

I have extended the ddb based investigation to
which pmap_ts_referenced call in vm_pageout_scan
is involved in the call-chain that gets to
pmap_resident_count_dec. It is the the 2nd
one, the one in the loop:

        /*
         * Scan the active queue for pages that can be deactivated.  Update
         * the per-page activity counter and use it to identify deactivation
         * candidates.  Held pages may be deactivated.
         */
        for (m = TAILQ_FIRST(&pq->pq_pl), scanned = 0; m != NULL && (scanned <
            min_scan || (inactq_shortage > 0 && scanned < maxscan)); m = next,
            scanned++) {
. . .

Also in the code before the pmap_ts_referenced
call:

                /*
                 * Check to see "how much" the page has been used.
                 */
                if ((m->aflags & PGA_REFERENCED) != 0) {
                        vm_page_aflag_clear(m, PGA_REFERENCED);
                        act_delta = 1;
                } else
                        act_delta = 0;

act_delta is being set to zero. So the
PGA_REFERENCED bit of m->aflags was zero.
("page is referenced" being false.)


So the earlier noted code in
pmap_ts_referenced:

                                pmap_remove_l3(pmap, pte, pv->pv_va, tpde,
                                    &free, &lock);
                                pmap_invalidate_page(pmap, pv->pv_va);

for this context seems to be taking active but
unreferenced pages and getting rid of them to
the point of dropping them from the resident_count
(despite the only copies of the content being in
RAM).

I'll note that the 2nd edition of the Design and
Implementation book does not say that the resident
memory count is decremented by pmap_ts_referenced.
Nor does it indicate that tlb entries are invalidated
by/during pmap_ts_referenced.

Instead it talks of updating the "active count for
a page" and "it uses pmap_ts_referenced() to count
and clear the number of uses of the page since it
was last scanned". It looks like the book is out
of date in this area.

It looks like the code's general structure
goes back to amd64's -r256072 from 2013-Oct-5.
But the check-in comments do not seem to cover
this aspect of the changes. The overall summary
was "Merge projects/bhyve_npt_pmap into head".

The observed (-r315914) behavior as a amd64
virtualbox guest is that the RES(ident memory)
does not decrease like it does on arm64. (But
it that amd64 virtualbox guest is also the
only place I use zfs, complicating the
comparison. I do not boot any amd64 hardware
with FreeBSD [yet?].)

Another amd64 vs. arm64 difference is that
when the amd64 test forks the child immediately
has the same RES size as the parent (large).
By contrast for arm64 the child RES is tiny
until the child tests the pages after it
finishes sleeping. Even giving the FreeBSD
guest 2 GiBytes to match the bpim3/arm64
context for RAM does not make it behave
like the bpim3 (or a rpi3).
Comment 32 Mark Millard 2017-04-10 00:12:05 UTC
Konstantin Belousov provided a patch that seems to
have fixed the issue. . .

On 2017-Apr-9, at 5:27 AM, Konstantin Belousov <kostikbel@gmail.com> wrote:

Hmm, could you try the following patch, I did not even compiled it.

diff --git a/sys/arm64/arm64/pmap.c b/sys/arm64/arm64/pmap.c
index 3d5756ba891..55aa402eb1c 100644
--- a/sys/arm64/arm64/pmap.c
+++ b/sys/arm64/arm64/pmap.c
@@ -2481,6 +2481,11 @@ pmap_protect(pmap_t pmap, vm_offset_t sva, vm_offset_t eva, vm_prot_t prot)
		    sva += L3_SIZE) {
			l3 = pmap_load(l3p);
			if (pmap_l3_valid(l3)) {
+				if ((l3 & ATTR_SW_MANAGED) &&
+				    pmap_page_dirty(l3)) {
+					vm_page_dirty(PHYS_TO_VM_PAGE(l3 &
+					    ~ATTR_MASK));
+				}
				pmap_set(l3p, ATTR_AP(ATTR_AP_RO));
				PTE_SYNC(l3p);
				/* XXX: Use pmap_invalidate_range */

Preliminary testing indicates that this fixes the
some-pages-become-zero problem for fork-then-swapout/in.

Thanks!

I'll see if a buildworld can go through without being stopped
by the type of issue. But that will take a while. (It is how
I originally ran into the problem(s) that others had been
reporting on the lists.)
Comment 33 Mark Millard 2017-04-10 10:05:18 UTC
(In reply to Mark Millard from comment #32)

buildworld buildkernel completed non-stop for the first time
on a BPI-M3 board.

Looks good for a check-in to svn to me (head and stable/11).

This combined with 2017-Feb-15's -r313772's fix to the fork
trampline code's updating of sp_el0 makes arm64 far more stable
for my purposes.

-r313772 was never MFC'd to stable/11. In my view it should be.
Comment 34 Mark Millard 2017-04-10 10:11:36 UTC
The same for this (217239) also fixes bugzilla 217138 so I'm
marking this as a duplicate.

*** This bug has been marked as a duplicate of bug 217138 ***
Comment 35 commit-hook freebsd_committer freebsd_triage 2017-04-10 15:33:06 UTC
A commit references this bug:

Author: kib
Date: Mon Apr 10 15:32:26 UTC 2017
New revision: 316679
URL: https://svnweb.freebsd.org/changeset/base/316679

Log:
  Do not lose dirty bits for removing PROT_WRITE on arm64.

  Arm64 pmap interprets accessed writable ptes as modified, since
  ARMv8.0 does not track Dirty Bit Modifier in hardware. If writable bit
  is removed, page must be marked as dirty for MI VM.

  This change is most important for COW, where fork caused losing
  content of the dirty pages which were not yet scanned by pagedaemon.

  Reviewed by:	alc, andrew
  Reported and tested by:	Mark Millard <markmi@dsl-only.net>
  PR:	217138, 217239
  Sponsored by:	The FreeBSD Foundation
  MFC after:	2 weeks

Changes:
  head/sys/arm64/arm64/pmap.c
Comment 36 Mark Millard 2017-04-10 20:21:48 UTC
(In reply to Mark Millard from comment #33)

I had been thinking of the BPI-M3 for other reasons
and typed that instead of the correct: Pine64+ 2GB.
(True elsewhere as well.) I do really mean arm64
here, not armv7.
Comment 37 Mark Millard 2017-04-25 00:14:49 UTC
(In reply to commit-hook from comment #35)

stable/11 -r317354 completes the fixes
there:

Author: kib
Date: Mon Apr 24 07:52:44 2017
New Revision: 317354
URL: 
https://svnweb.freebsd.org/changeset/base/317354


Log:
 MFC r316679:
 Do not lose dirty bits for removing PROT_WRITE on arm64.

Modified:
 stable/11/sys/arm64/arm64/pmap.c
Directory Properties:
 stable/11/   (props changed)

. . .


See:

https://lists.freebsd.org/pipermail/svn-src-stable-11/2017-April/003041.html

So now the old rpi3 wiki page material is only for
masking some of the problems in older contexts for
12 or for older contexts for 11.


Note:

-r313772 from head is now -r317147 in stable/11 . So
interrupts will no longer trash the sp_el0 register
during the fork-trampoline code. This was the other
issue with fork behavior that caused me build problems.

With both in place for head I no loner have problems
with buildworld and the like ,in head.


[I've not been running 11, just 12.]