Bug 217138 - head (e.g.) -r315870 for arm64: sh vs. jemalloc asserts: include/jemalloc/internal/tsd.h:687: Failed assertion: "tsd_booted" once swapped in after being swapped out (comment 10); fix is a kernel patch (comment 37)
Summary: head (e.g.) -r315870 for arm64: sh vs. jemalloc asserts: include/jemalloc/int...
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: bin (show other bugs)
Version: CURRENT
Hardware: arm64 Any
: --- Affects Some People
Assignee: Konstantin Belousov
URL:
Keywords:
: 217239 (view as bug list)
Depends on:
Blocks:
 
Reported: 2017-02-16 12:05 UTC by Mark Millard
Modified: 2017-04-25 00:45 UTC (History)
6 users (show)

See Also:


Attachments
C source for showing fork/swap memory corruption failure (3.12 KB, text/plain)
2017-03-14 08:23 UTC, Mark Millard
no flags Details
Updated source that shows a tie to a 4KByte sized unit of failure (4.46 KB, text/plain)
2017-03-15 06:48 UTC, Mark Millard
no flags Details
Program showing posix_madvise with POSIX_MADV_WILLNEED can prevent the failure (4.09 KB, text/plain)
2017-03-22 02:27 UTC, Mark Millard
no flags Details
Test program with thousands of 14 KiByte allocations (4.77 KB, text/plain)
2017-03-28 03:43 UTC, Mark Millard
no flags Details
Source for allowing swap-before-fork in test in addition to swap-after-fork (7.32 KB, text/plain)
2017-04-05 02:19 UTC, Mark Millard
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Mark Millard 2017-02-16 12:05:54 UTC
For head -r313783 I built with a production arm64 kernel
but world without MALLOC_PRODUCTION . I intermittently
get the following sort of thing when, for example, I use
^z to put a process in the background and to get back
to the shell --or quitting a program and getting back to
the shell. The context involves already having been
su'd to root. I can not cause the crash on demand: it
is intermittent and fairly rare so far.

[Note: This was found while trying to track down why sh
fails sometimes during buildworld on a pine64 when
world was built with MALLOC_PRODUCTION.]

<jemalloc>: /usr/src/contrib/jemalloc/include/jemalloc/internal/tsd.h:687: Failed assertion: "tsd_booted"

(lldb) bt
* thread #1: tid = 100164, 0x0000000040554e18 libc.so.7`_thr_kill + 8, name = 'sh', stop reason = signal SIGABRT
  * frame #0: 0x0000000040554e18 libc.so.7`_thr_kill + 8
    frame #1: 0x0000000040554ddc libc.so.7`__raise(s=6) + 64 at raise.c:52
    frame #2: 0x0000000040554d50 libc.so.7`abort + 84 at abort.c:65
    frame #3: 0x0000000040528790 libc.so.7`__je_tsd_fetch [inlined] __je_tsd_get + 248 at tsd.h:687
    frame #4: 0x000000004052876c libc.so.7`__je_tsd_fetch [inlined] __je_tsd_fetch_impl(init=true) at tsd.h:692
    frame #5: 0x000000004052876c libc.so.7`__je_tsd_fetch + 212 at tsd.h:717
    frame #6: 0x0000000040550214 libc.so.7`ialloc_body(size=11, zero=<unavailable>, tsdn=0x0000ffffffffe650, usize=0x0000ffffffffe648, slow_path=true) + 56 at jemalloc_jemalloc.c:1586
    frame #7: 0x0000000040550184 libc.so.7`__malloc(size=1) + 184 at jemalloc_jemalloc.c:1645
    frame #8: 0x000000000041126c sh`ckmalloc(nbytes=<unavailable>) + 32 at memalloc.c:61
    frame #9: 0x000000000041bb6c sh`setvar(name=<unavailable>, val=<unavailable>, flags=<unavailable>) + 176 at var.c:256
    frame #10: 0x0000000000406bf4 sh`evalcommand(cmd=<unavailable>, flags=<unavailable>, backcmd=<unavailable>) + 3468 at eval.c:1180
    frame #11: 0x0000000000405570 sh`evaltree(n=0x0000000040ab9060, flags=<unavailable>) + 212 at eval.c:290
    frame #12: 0x000000000041105c sh`cmdloop(top=<unavailable>) + 252 at main.c:231
    frame #13: 0x0000000000410ed0 sh`main(argc=<unavailable>, argv=<unavailable>) + 660 at main.c:178
    frame #14: 0x0000000000402f30 sh`__start + 360
    frame #15: 0x0000000040434658 ld-elf.so.1`.rtld_start + 24 at rtld_start.S:41
(lldb) up 10
frame #10: 0x0000000000406bf4 sh`evalcommand(cmd=<unavailable>, flags=<unavailable>, backcmd=<unavailable>) + 3468 at eval.c:1180
   1177	
   1178	out:
   1179		if (lastarg)
-> 1180			setvar("_", lastarg, 0);
   1181		if (do_clearcmdentry)
   1182			clearcmdentry();
   1183	}

Unless tsd_booted has been trashed it would appear that
tsd_boot0() never happened before the attempted setvar
above indirectly tries the __je_tsd_get. Supporting
details from the source code:

/usr/src/contrib/jemalloc/include/jemalloc/internal/jemalloc_internal_defs.h
establishes:

#define JEMALLOC_MALLOC_THREAD_CLEANUP 
#define JEMALLOC_TLS 

which is context that is needed when looking things up.

/* malloc_tsd_externs(). */
#ifdef JEMALLOC_MALLOC_THREAD_CLEANUP
#define malloc_tsd_externs(a_name, a_type)                              \
extern __thread a_type  a_name##tsd_tls;                                \
extern __thread bool    a_name##tsd_initialized;                        \
extern bool             a_name##tsd_booted;
. . .
#ifdef JEMALLOC_MALLOC_THREAD_CLEANUP
#define malloc_tsd_data(a_attr, a_name, a_type, a_initializer)          \
. . .                                    \
a_attr bool             a_name##tsd_booted = false;
. . .

#ifdef JEMALLOC_MALLOC_THREAD_CLEANUP
#define malloc_tsd_funcs(a_attr, a_name, a_type, a_initializer,         \
    a_cleanup)                                                          \
. . .
a_name##tsd_boot0(void)                                                 \
{                                                                       \
                                                                        \
        if (a_cleanup != malloc_tsd_no_cleanup) {                       \
                malloc_tsd_cleanup_register(                            \
                    &a_name##tsd_cleanup_wrapper);                      \
        }                                                               \
        a_name##tsd_booted = true;                                      \
        return (false);                                                 \
}                                                                       \
. . .
a_attr bool                                                             \
a_name##tsd_boot(void)                                                  \
{                                                                       \
                                                                        \
        return (a_name##tsd_boot0());                                   \
}                                                                       \
. . .
/* Get/set. */                                                          \
a_attr a_type *                                                         \
a_name##tsd_get(bool init)                                              \
{                                                                       \
                                                                        \
        assert(a_name##tsd_booted);                                     \
        return (&a_name##tsd_tls);                                      \
}                                                                       \
a_attr void                                                             \
a_name##tsd_set(a_type *val)                                            \
{                                                                       \
                                                                        \
        assert(a_name##tsd_booted);                                     \
        a_name##tsd_tls = (*val);                                       \
        if (a_cleanup != malloc_tsd_no_cleanup)                         \
                a_name##tsd_initialized = true;                         \
}
. . .

#if (defined(JEMALLOC_ENABLE_INLINE) || defined(JEMALLOC_TSD_C_))
malloc_tsd_externs(, tsd_t)
malloc_tsd_funcs(JEMALLOC_ALWAYS_INLINE, , tsd_t, tsd_initializer, tsd_cleanup)

. . .

tsd_t *
malloc_tsd_boot0(void)
{
        tsd_t *tsd;

        ncleanups = 0;
        if (tsd_boot0())
                return (NULL);
        tsd = tsd_fetch();
        *tsd_arenas_tdata_bypassp_get(tsd) = true;
        return (tsd);
}
. . .
static bool
malloc_init_hard(void)
{
. . .
        tsd = malloc_tsd_boot0();
. . .

JEMALLOC_ALWAYS_INLINE_C bool
malloc_init(void)
{

        if (unlikely(!malloc_initialized()) && malloc_init_hard())
                return (true);
        malloc_thread_init();

        return (false);
}       
. . .
typedef enum {
        malloc_init_uninitialized       = 3,
        malloc_init_a0_initialized      = 2,
        malloc_init_recursible          = 1,
        malloc_init_initialized         = 0 /* Common case --> jnz. */
} malloc_init_t;
static malloc_init_t    malloc_init_state = malloc_init_uninitialized;
. . .
JEMALLOC_ALWAYS_INLINE_C bool
malloc_initialized(void)
{

        return (malloc_init_state == malloc_init_initialized);
}
Comment 1 Mark Millard 2017-02-16 12:25:11 UTC
(In reply to Mark Millard from comment #0)
It turns out that the sh failure during buildworld
also gets to __je_tsd_get (but a different way) and
then fails the same assertion for "tsd_booted":

<jemalloc>: /usr/src/contrib/jemalloc/include/jemalloc/internal/tsd.h:687: Failed assertion: "tsd_booted"

A back trace is:

(lldb) bt
* thread #1: tid = 100194, 0x0000000040554e18 libc.so.7`_thr_kill + 8, name = 'sh', stop reason = signal SIGABRT
  * frame #0: 0x0000000040554e18 libc.so.7`_thr_kill + 8
    frame #1: 0x0000000040554ddc libc.so.7`__raise(s=6) + 64 at raise.c:52
    frame #2: 0x0000000040554d50 libc.so.7`abort + 84 at abort.c:65
    frame #3: 0x0000000040528790 libc.so.7`__je_tsd_fetch [inlined] __je_tsd_get + 248 at tsd.h:687
    frame #4: 0x000000004052876c libc.so.7`__je_tsd_fetch [inlined] __je_tsd_fetch_impl(init=true) at tsd.h:692
    frame #5: 0x000000004052876c libc.so.7`__je_tsd_fetch + 212 at tsd.h:717
    frame #6: 0x0000000040550cc0 libc.so.7`__free(ptr=0x0000000040a17720) + 64 at jemalloc_jemalloc.c:2011
    frame #7: 0x0000000000411328 sh`ckfree(p=<unavailable>) + 32 at memalloc.c:88
    frame #8: 0x0000000000407cd8 sh`clearcmdentry + 76 at exec.c:505
    frame #9: 0x0000000000406bfc sh`evalcommand(cmd=<unavailable>, flags=<unavailable>, backcmd=<unavailable>) + 3476 at eval.c:1182
    frame #10: 0x0000000000405570 sh`evaltree(n=0x0000000040a1cde8, flags=<unavailable>) + 212 at eval.c:290
    frame #11: 0x000000000041105c sh`cmdloop(top=<unavailable>) + 252 at main.c:231
    frame #12: 0x0000000000410ed0 sh`main(argc=<unavailable>, argv=<unavailable>) + 660 at main.c:178
    frame #13: 0x0000000000402f30 sh`__start + 360
    frame #14: 0x0000000040434658 ld-elf.so.1`.rtld_start + 24 at rtld_start.S:41

It appears that setvar was not used but clearcmdentry
(indirectly) gets the same sort of problem when this
happens:

(lldb) up 9
frame #9: 0x0000000000406bfc sh`evalcommand(cmd=<unavailable>, flags=<unavailable>, backcmd=<unavailable>) + 3476 at eval.c:1182
   1179		if (lastarg)
   1180			setvar("_", lastarg, 0);
   1181		if (do_clearcmdentry)
-> 1182			clearcmdentry();
   1183	}
Comment 2 Mark Millard 2017-02-17 02:33:04 UTC
If one is going to look into this in a amd64
context it is important to be using head -r313772
or later in order to avoid fork sometimes not
preserving the stack pointer on the child-process
side of things --at least if experimenting with
port or buildworld buildkernel builds as a means
of testing.

Getting past that stack pointer problem is what
allowed me to see this problem during build
activity, which started me down this exploration.

[My tests for aborting in sh`forkshell if fork changes
the stack pointer are still in place but there have
been no failures so far.]
Comment 3 Mark Millard 2017-02-17 20:10:11 UTC
Before starting a round of updates to a newer version of head
I got a couple of sh core dumps that showed the same sort of
failures. But in these I'd added recording the pid that was
returned by fork.

The result was that both the core dumps were from the parent-
process side of the most recent fork.

[It will be a bit before I'm done with updating to a newer
version of head.]
Comment 4 Mark Millard 2017-02-18 09:03:42 UTC
I got a somewhat different trace back this time:

(lldb) bt
* thread #1: tid = 100105, 0x0000000040554e18 libc.so.7`_thr_kill + 8, name = 'sh', stop reason = signal SIGABRT
  * frame #0: 0x0000000040554e18 libc.so.7`_thr_kill + 8
    frame #1: 0x0000000040554ddc libc.so.7`__raise(s=6) + 64 at raise.c:52
    frame #2: 0x0000000040554d50 libc.so.7`abort + 84 at abort.c:65
    frame #3: 0x0000000040528790 libc.so.7`__je_tsd_fetch [inlined] __je_tsd_get + 248 at tsd.h:687
    frame #4: 0x000000004052876c libc.so.7`__je_tsd_fetch [inlined] __je_tsd_fetch_impl(init=true) at tsd.h:692
    frame #5: 0x000000004052876c libc.so.7`__je_tsd_fetch + 212 at tsd.h:717
    frame #6: 0x0000000040550cc0 libc.so.7`__free(ptr=0x0000000040a390e0) + 64 at jemalloc_jemalloc.c:2011
    frame #7: 0x0000000000411330 sh`ckfree(p=<unavailable>) + 32 at memalloc.c:88
    frame #8: 0x000000000040e6c8 sh`freejob(jp=<unavailable>) + 108 at jobs.c:475
    frame #9: 0x000000000040dd0c sh`waitforjob(jp=<unavailable>, origstatus=<unavailable>) + 352 at jobs.c:1066
    frame #10: 0x0000000000406988 sh`evalcommand(cmd=<unavailable>, flags=<unavailable>, backcmd=<unavailable>) + 2848 at eval.c:1166
    frame #11: 0x0000000000405570 sh`evaltree(n=0x0000000040ab9060, flags=<unavailable>) + 212 at eval.c:290
    frame #12: 0x0000000000411064 sh`cmdloop(top=<unavailable>) + 252 at main.c:231
    frame #13: 0x0000000000410ed8 sh`main(argc=<unavailable>, argv=<unavailable>) + 660 at main.c:178
    frame #14: 0x0000000000402f30 sh`__start + 360
    frame #15: 0x0000000040434658 ld-elf.so.1`.rtld_start + 24 at rtld_start.S:41

(lldb) up 10
frame #10: 0x0000000000406988 sh`evalcommand(cmd=<unavailable>, flags=<unavailable>, backcmd=<unavailable>) + 2848 at eval.c:1166
   1163	parent:	/* parent process gets here (if we forked) */
   1164		if (mode == FORK_FG) {	/* argument to fork */
   1165			INTOFF;
-> 1166			exitstatus = waitforjob(jp, &realstatus);
   1167			INTON;
   1168			if (iflag && loopnest > 0 && WIFSIGNALED(realstatus)) {
   1169				evalskip = SKIPBREAK;
Comment 5 Mark Millard 2017-02-19 02:18:54 UTC
[Note: the kernels after -r313266 are messed up
for at least PowerMac so-called "Quad-Core"s so
the below is based on a -r313864 buildworld and
a -r313266 buildkernel because I had to revert
the kernel to about the last that would avoid
randomly timed panics.]

I did a buildworld of -r313864 for powerpc64
without having MALLOC_PRODUCTION defined and
installed and booted with it in order to see
if powerpc64 had the same sort of problem that
aarch64 is having with jemalloc, in
particular during self-hosted buildworld
buildkernel activity (my test case).

powerpc64 worked fine with no aborts. So. . .

include/jemalloc/internal/tsd.h:687: Failed assertion: "tsd_booted"

seems to be aarch64 (arm64) specific in some way.
Comment 6 Mark Millard 2017-02-19 22:23:08 UTC
See also bugzilla 217239 for another type of arm64
specific problem with jemalloc without
MALLOC_PRODUCTION.
Comment 7 Mark Millard 2017-02-23 21:59:31 UTC
The following describes a reproducible sequence
in my context, unfortunately involving hours of
buildworld activity. It fails every time that I
have tried it and at the same places each time.
I give a contrast to a working context as well.

Context: doing buildworld buidlkernel on a pine64+ with
2 GiBytes of RAM. Multiple head revisions, most recently:

# uname -apKU
FreeBSD pine64 12.0-CURRENT FreeBSD 12.0-CURRENT  r313999M  arm64 aarch64 1200021 1200021

The pine64 is running what was a cross build that had
MALLOC_PRODUCTION not defined. (Unlike my usual
way of building.)

Problem: sh core dumps via failing an assert. (script core dumps
as well for other reasons for one stage but I'm focused on the
earliest failures for now: sh failures.)

The following happens when I buildworld buildkernel on the
pine64+ using:

WITH_CLANG=
WITH_CLANG_IS_CC=
WITH_CLANG_FULL=
WITH_CLANG_EXTRAS=
WITH_LLD=
WITH_LLDB=

but not when using:

WITHOUT_CLANG=
WITHOUT_CLANG_IS_CC=
WITHOUT_CLANG_FULL=
WITHOUT_CLANG_EXTRAS=
WITHOUT_LLD=
WITHOUT_LLDB=

(The rest being the same, starting after using cleanworld
in both cases.)

But note that the first failures happen long after the
those have built what they contribute to the _generic_libs
stage. (I have not yet tried isolating subsets.) Similarly
for the later 2nd stage: well after "everything" did its
llvm related activity.

I've tried the failing case under both:

2 GiBytes RAM + 3 GiBytes swap
and:
2 GiBytes RAM + 6 GiBytes swap

It made no difference and there have been no messages
about running out of swap space or other forms of resource
limitation based process killing or the like.


From sysutils/DTraceToolkit 's /usr/local/share/dtrace-toolkit/execsnoop :

. . .
2017 Feb 22 16:37:02     0  61019  61018 make install DIRPRFX=lib/libusb/\0
2017 Feb 22 16:37:02     0  61020  61019 sh -e\0
2017 Feb 22 16:37:02     0  61021  61019 sh -e\0
2017 Feb 22 16:37:02     0  61022  61019 sh -e\0
2017 Feb 22 16:37:02     0  61023  61020 sh /usr/src/tools/install.sh -C -o root -g wheel -m 444 libusb.a /usr/obj/pine64_clang/arm64.aarch64/usr/src/tmp/usr/lib/\0
2017 Feb 22 16:37:02     0  61024  61021 sh /usr/src/tools/install.sh -o root -g wheel -m 444 /usr/src/lib/libusb/libusb-0.1.pc /usr/src/lib/libusb/libusb-1.0.pc /usr/src/lib/libusb/libusb-2.0.pc /usr/obj/pine64_clang/arm64.aarch64/usr/src/tmp/usr/libdata/pkgconfig/\0
2017 Feb 22 16:37:02     0  61025  61022 sh /usr/src/tools/install.sh -C -o root -g wheel -m 444 /usr/src/lib/libusb/libusb20.h /usr/src/lib/libusb/libusb20_desc.h /usr/src/lib/libusb/usb.h /usr/src/lib/libusb/libusb.h /usr/obj/pine64_clang/arm64.aarch64/usr/src/tmp/usr/include/\0
2017 Feb 22 16:37:02     0  61023  61020 sh /usr/src/tools/install.sh -C -o root -g wheel -m 444 libusb.a /usr/obj/pine64_clang/arm64.aarch64/usr/src/tmp/usr/lib/\0
2017 Feb 22 16:37:02     0  61023  61020 install -p libusb.a /usr/obj/pine64_clang/arm64.aarch64/usr/src/tmp/usr/lib/\0
2017 Feb 22 16:37:02     0  61024  61021 sh /usr/src/tools/install.sh -o root -g wheel -m 444 /usr/src/lib/libusb/libusb-0.1.pc /usr/src/lib/libusb/libusb-1.0.pc /usr/src/lib/libusb/libusb-2.0.pc /usr/obj/pine64_clang/arm64.aarch64/usr/src/tmp/usr/libdata/pkgconfig/\0
2017 Feb 22 16:37:02     0  61025  61022 sh /usr/src/tools/install.sh -C -o root -g wheel -m 444 /usr/src/lib/libusb/libusb20.h /usr/src/lib/libusb/libusb20_desc.h /usr/src/lib/libusb/usb.h /usr/src/lib/libusb/libusb.h /usr/obj/pine64_clang/arm64.aarch64/usr/src/tmp/usr/include/\0
2017 Feb 22 16:37:02     0  61024  61021 install -p /usr/src/lib/libusb/libusb-0.1.pc /usr/src/lib/libusb/libusb-1.0.pc /usr/src/lib/libusb/libusb-2.0.pc /usr/obj/pine64_clang/arm64.aarch64/usr/src/tmp/usr/libdata/pkgconfig/\0
2017 Feb 22 16:37:02     0  61025  61022 install -p /usr/src/lib/libusb/libusb20.h /usr/src/lib/libusb/libusb20_desc.h /usr/src/lib/libusb/usb.h /usr/src/lib/libusb/libusb.h /usr/obj/pine64_clang/arm64.aarch64/usr/src/tmp/usr/include/\0
2017 Feb 22 16:37:02     0  61026  61020 sh /usr/src/tools/install.sh -s -o root -g wheel -m 444 libusb.so.3 /usr/obj/pine64_clang/arm64.aarch64/usr/src/tmp/usr/lib/\0
2017 Feb 22 16:37:02     0  61026  61020 sh /usr/src/tools/install.sh -s -o root -g wheel -m 444 libusb.so.3 /usr/obj/pine64_clang/arm64.aarch64/usr/src/tmp/usr/lib/\0
2017 Feb 22 16:37:02     0  61026  61020 install -p libusb.so.3 /usr/obj/pine64_clang/arm64.aarch64/usr/src/tmp/usr/lib/\0
2017 Feb 22 16:37:02     0  61027  61020 sh /usr/src/tools/install.sh -o root -g wheel -m 444 libusb.so.3.debug /usr/obj/pine64_clang/arm64.aarch64/usr/src/tmp/usr/lib/debug/usr/lib/\0
2017 Feb 22 16:37:02     0  61027  61020 sh /usr/src/tools/install.sh -o root -g wheel -m 444 libusb.so.3.debug /usr/obj/pine64_clang/arm64.aarch64/usr/src/tmp/usr/lib/debug/usr/lib/\0
2017 Feb 22 16:37:02     0  61027  61020 install -p libusb.so.3.debug /usr/obj/pine64_clang/arm64.aarch64/usr/src/tmp/usr/lib/debug/usr/lib/\0
2017 Feb 22 16:37:02     0  61028  61020 sh /usr/src/tools/install.sh -l rs libusb.so.3 /usr/obj/pine64_clang/arm64.aarch64/usr/src/tmp/usr/lib/libusb.so\0
2017 Feb 22 16:37:02     0  61029  61028 ln -fsn libusb.so.3 /usr/obj/pine64_clang/arm64.aarch64/usr/src/tmp/usr/lib/libusb.so\0

(That last was it for the build.) That is the end of the exec
activity for the _generic_libs part of the build (and since the
build stops: the last for the build overall).

(The below ps -daux output is from some time before the problem happened
but with later, related core files listed as well.)

root   91353  0.0  0.1   6856   1500 u0  I+   10:28     0:00.02     `-- /bin/sh /root/sys_build_scripts.pine64-host/make_pine64_nodebug_clang_bootstrap-pine64-host.sh -j 4 buildworld buildkernel
root   91356  0.0  0.1   6204   1560 u0  S+   10:28     0:06.59       `-- script /root/sys_typescripts/typescript_make_pine64_nodebug_clang_bootstrap-pine64-host-2017-02-22:10:28:28 env __MAKE_CONF=/
-rw-------  1 root  wheel  4657152 Feb 22 16:37:04 2017 script.91356.core
(from: ls -ltTU)

root   91357  0.0  0.0   4948    204  1  Ss+  10:28     0:01.87         `-- make -j 4 buildworld buildkernel
root   91373  0.0  0.1   6856   1500  1  I    10:28     0:00.01           `-- sh -ev
-rw-------  1 root  wheel  4702208 Feb 22 16:37:03 2017 sh.91373.core
(from: ls -ltTU)

root   91374  0.0  0.0   4948    204  1  S    10:28     0:01.69             `-- make -m /usr/src/share/mk -f Makefile.inc1 TARGET=arm64 TARGET_ARCH=aarch64 buildworld
root   10803  0.0  0.1   6856   1500  1  I    10:43     0:00.01               `-- sh -ev
-rw-------  1 root  wheel  4702208 Feb 22 16:37:02 2017 sh.10803.core
(from: ls -ltTU)

root   10804  0.0  0.0   4948    200  1  S    10:43     3:00.18                 `-- make -f Makefile.inc1 DESTDIR=/usr/obj/pine64_clang/arm64.aarch64/usr/src/tmp -DNO_FSCHG MK_HTML=no -DNO_LINT MK_MA
root   10811  0.0  0.1   6856   1500  1  I    10:43     0:00.01                   `-- sh -ev
root   38075  0.0  0.0   4948    204  1  S    11:14     0:00.75                     `-- make -f Makefile.inc1 _generic_libs
root   38085  0.0  0.1   6856   1500  1  I    11:14     0:00.01                       `-- sh -ev
. . .

"shutdown -r now" here makes no difference to the below
when done after the reboot.

(Of course there is some llvm related build activity during the
"everything" stage below.)

Doing another buildworld buildkernel to continue the build results
in:

. . .
2017 Feb 22 18:48:46     0  51772  51454 sh -e\0
2017 Feb 22 18:48:46     0  51773  51772 sed -E s,(^| |B|`)svn,\\1svnlite,g /usr/src/contrib/subversion/subversion/svn/svn.1\0
2017 Feb 22 18:48:46     0  51774  51454 sh -e\0
2017 Feb 22 18:48:46     0  51775  51774 gzip -cn svnlite.1\0
2017 Feb 22 18:48:48     0  51776  51454 sh -e\0
2017 Feb 22 18:48:48     0  51777  51776 \0
2017 Feb 22 18:48:48     0  51778  51777 \0
2017 Feb 22 18:48:49     0  51779  51454 sh -e\0
2017 Feb 22 18:48:49     0  51780  51779 /usr/local/aarch64-freebsd/bin/objcopy --only-keep-debug svnlite.full svnlite.debug\0
2017 Feb 22 18:48:50     0  51781  51454 sh -e\0
2017 Feb 22 18:48:50     0  51782  51781 /usr/local/aarch64-freebsd/bin/objcopy --strip-debug --add-gnu-debuglink=svnlite.debug svnlite.full svnlite\0

The above is the end of the "everything" exec activity
but before the buildworld_epilogue (that does not happen).
Again it is the last exec activity for the build because the
build stops.

(Again ps -daux from sometime before the failure mixed
with core file ls -ltTU information below:)

root   61122   0.0  0.1   6856   1500 u0  I+   17:13     0:00.01     `-- /bin/sh /root/sys_build_scripts.pine64-host/make_pine64_nodebug_clang_bootstrap-pine64-host.sh -j 4 buildworld buildkernel
root   61125   0.0  0.1   6204   1560 u0  S+   17:13     0:09.56       `-- script /root/sys_typescripts/typescript_make_pine64_nodebug_clang_bootstrap-pine64-host-2017-02-22:17:13:45 env __MAKE_CONF=
root   61126   0.0  0.0   4948    204  1  Ss+  17:13     0:02.36         `-- make -j 4 buildworld buildkernel
root   61142   0.0  0.1   6856   1500  1  I    17:13     0:00.01           `-- sh -ev
-rw-------  1 root  wheel  4702208 Feb 22 18:48:51 2017 sh.61142.core

root   61143   0.0  0.0   4948    204  1  S    17:13     0:02.08             `-- make -m /usr/src/share/mk -f Makefile.inc1 TARGET=arm64 TARGET_ARCH=aarch64 buildworld
root   81104   0.0  0.1   6856   1500  1  I    17:19     0:00.01               `-- sh -ev
-rw-------  1 root  wheel  4702208 Feb 22 18:48:50 2017 sh.81104.core

root   81105   0.0  0.0   4948    220  1  S    17:19     0:02.57                 `-- make -f Makefile.inc1 DESTDIR=/usr/obj/pine64_clang/arm64.aarch64/usr/src/tmp all
root   13358   0.0  0.1   6856   1500  1  I    17:49     0:00.01                   |-- sh -e
. . .

(Yep: script does  not core dump for this 2nd stage context.)

A 3rd buildworld buildkernel finishes the build, with buildworld
being essentially a large no-op and then doing the buildkernel.



Context details:

# more ~/src.configs/make.conf
CFLAGS.gcc+= -v

(But this was not a gcc based build.)

# more ~/src.configs/src.conf.pine64-clang-bootstrap.pine64-host 
TO_TYPE=aarch64
TOOLS_TO_TYPE=${TO_TYPE}
#
KERNCONF=GENERIC-NODBG
TARGET=arm64
.if ${.MAKE.LEVEL} == 0
TARGET_ARCH=${TO_TYPE}
.export TARGET_ARCH
.endif
#
#WITH_CROSS_COMPILER=
WITH_SYSTEM_COMPILER=
#
#CPUTYPE=soft
WITH_LIBCPLUSPLUS=
WITHOUT_BINUTILS_BOOTSTRAP=
WITHOUT_ELFTOOLCHAIN_BOOTSTRAP=
#WITHOUT_CLANG_BOOTSTRAP=
WITH_CLANG=
WITH_CLANG_IS_CC=
WITH_CLANG_FULL=
WITH_CLANG_EXTRAS=
WITH_LLD=
WITH_LLDB=
#
WITH_BOOT=
WITHOUT_LIB32=
WITHOUT_LIBSOFT=
#
WITHOUT_GCC_BOOTSTRAP=
WITHOUT_GCC=
WITHOUT_GCC_IS_CC=
WITHOUT_GNUCXX=
#
NO_WERROR=
#WERROR=
MALLOC_PRODUCTION=
#
WITH_REPRODUCIBLE_BUILD=
WITH_DEBUG_FILES=
#
CROSS_BINUTILS_PREFIX=/usr/local/${TOOLS_TO_TYPE}-freebsd/bin/
AS=/usr/local/${TOOLS_TO_TYPE}-freebsd/bin/as
AR=/usr/local/${TOOLS_TO_TYPE}-freebsd/bin/ar
LD=/usr/local/${TOOLS_TO_TYPE}-freebsd/bin/ld
NM=/usr/local/${TOOLS_TO_TYPE}-freebsd/bin/nm
OBJCOPY=/usr/local/${TOOLS_TO_TYPE}-freebsd/bin/objcopy
OBJDUMP=/usr/local/${TOOLS_TO_TYPE}-freebsd/bin/objdump
RANLIB=/usr/local/${TOOLS_TO_TYPE}-freebsd/bin/ranlib
SIZE=/usr/local/${TOOLS_TO_TYPE}-freebsd/bin/size
STRINGS=/usr/local/${TOOLS_TO_TYPE}-freebsd/bin/strings
.export AS
.export AR
.export LD
.export NM
.export OBJCOPY
.export OBJDUMP
.export RANLIB
.export SIZE
.export STRINGS


# svnlite status /usr/src/ | sort
?       /usr/src/sys/amd64/conf/GENERIC-DBG
?       /usr/src/sys/amd64/conf/GENERIC-NODBG
?       /usr/src/sys/arm/conf/BPIM3-DBG
?       /usr/src/sys/arm/conf/BPIM3-NODBG
?       /usr/src/sys/arm/conf/RPI2-DBG
?       /usr/src/sys/arm/conf/RPI2-NODBG
?       /usr/src/sys/arm64/conf/GENERIC-DBG
?       /usr/src/sys/arm64/conf/GENERIC-NODBG
?       /usr/src/sys/powerpc/conf/GENERIC64vtsc-DBG
?       /usr/src/sys/powerpc/conf/GENERIC64vtsc-NODBG
?       /usr/src/sys/powerpc/conf/GENERICvtsc-DBG
?       /usr/src/sys/powerpc/conf/GENERICvtsc-NODBG
M       /usr/src/bin/sh/jobs.c
M       /usr/src/bin/sh/miscbltin.c
M       /usr/src/contrib/llvm/lib/Target/PowerPC/PPCInstrInfo.td
M       /usr/src/contrib/llvm/tools/lld/ELF/Target.cpp
M       /usr/src/lib/csu/powerpc64/Makefile
M       /usr/src/libexec/rtld-elf/Makefile
M       /usr/src/sys/arm/arm/gic.c
M       /usr/src/sys/boot/ofw/Makefile.inc
M       /usr/src/sys/boot/powerpc/Makefile.inc
M       /usr/src/sys/boot/powerpc/kboot/Makefile
M       /usr/src/sys/boot/uboot/Makefile.inc
M       /usr/src/sys/conf/Makefile.powerpc
M       /usr/src/sys/conf/kmod.mk
M       /usr/src/sys/ddb/db_main.c
M       /usr/src/sys/ddb/db_script.c
M       /usr/src/sys/powerpc/ofw/ofw_machdep.c

The . . ./conf/*-*DBG files include the standard files
and then make adjustments to have a production
style kernel build, including the arm64 case.

Below the first two files are as they were used to isolate
forks' original lack of preserving the sp value for the
child process side when interrupts happen. (Since fixed
in head but not in stable/11 last I looked.)

# svnlite diff /usr/src/bin/sh/jobs.c /usr/src/bin/sh/miscbltin.c /usr/src/sys/arm/arm/gic.c
Index: /usr/src/bin/sh/jobs.c
===================================================================
--- /usr/src/bin/sh/jobs.c	(revision 313999)
+++ /usr/src/bin/sh/jobs.c	(working copy)
@@ -51,6 +51,9 @@
 #include <stdlib.h>
 #include <unistd.h>
 
+/* JUST FOR TESTING */
+#include <stdint.h>
+
 #include "shell.h"
 #if JOBS
 #include <termios.h>
@@ -833,6 +836,13 @@
  * in a pipeline).
  */
 
+extern uintptr_t example_stack_address(void);
+
+uintptr_t stack_address_before_fork = 0;
+uintptr_t stack_address_after_fork = 0;
+
+pid_t pid_from_fork = -1;
+
 pid_t
 forkshell(struct job *jp, union node *n, int mode)
 {
@@ -845,7 +855,10 @@
 	if (mode == FORK_BG && (jp == NULL || jp->nprocs == 0))
 		checkzombies();
 	flushall();
-	pid = fork();
+	stack_address_before_fork = example_stack_address();
+	pid_from_fork = pid = fork();
+	stack_address_after_fork  = example_stack_address();
+	if (stack_address_after_fork != stack_address_before_fork) abort();
 	if (pid == -1) {
 		TRACE(("Fork failed, errno=%d\n", errno));
 		INTON;
@@ -946,7 +959,6 @@
 	return pid;
 }
 
-
 pid_t
 vforkexecshell(struct job *jp, char **argv, char **envp, const char *path, int idx, int pip[2])
 {
Index: /usr/src/bin/sh/miscbltin.c
===================================================================
--- /usr/src/bin/sh/miscbltin.c	(revision 313999)
+++ /usr/src/bin/sh/miscbltin.c	(working copy)
@@ -64,6 +64,15 @@
 
 #undef eflag
 
+
+/* JUST FOR TESTING */
+uintptr_t example_stack_address(void)
+{
+    volatile uintptr_t test = 0;
+    return (uintptr_t)(void*)&test;
+}
+
+
 int readcmd(int, char **);
 int umaskcmd(int, char **);
 int ulimitcmd(int, char **);
Index: /usr/src/sys/arm/arm/gic.c
===================================================================
--- /usr/src/sys/arm/arm/gic.c	(revision 313999)
+++ /usr/src/sys/arm/arm/gic.c	(working copy)
@@ -672,9 +672,13 @@
 
 	if (irq >= sc->nirqs) {
 #ifdef GIC_DEBUG_SPURIOUS
+#define EXPECTED_SPURIOUS_IRQ 1023
+	    if (irq != EXPECTED_SPURIOUS_IRQ) {
 		device_printf(sc->gic_dev,
-		    "Spurious interrupt detected: last irq: %d on CPU%d\n",
+		    "Spurious interrupt %d detected of %d: last irq: %d on CPU%d\n",
+		    irq, sc->nirqs,
 		    sc->last_irq[PCPU_GET(cpuid)], PCPU_GET(cpuid));
+            }
 #endif
 		return (FILTER_HANDLED);
 	}
@@ -720,6 +724,16 @@
 	if (irq < sc->nirqs)
 		goto dispatch_irq;
 
+	if (irq != EXPECTED_SPURIOUS_IRQ) {
+#undef EXPECTED_SPURIOUS_IRQ
+#ifdef GIC_DEBUG_SPURIOUS
+		device_printf(sc->gic_dev,
+		    "Spurious end interrupt %d detected of %d: last irq: %d on CPU%d\n",
+		    irq, sc->nirqs,
+		    sc->last_irq[PCPU_GET(cpuid)], PCPU_GET(cpuid));
+#endif
+	}
+
 	return (FILTER_HANDLED);
 }
 

The gic.c change just avoids getting uninteresting spurious
interrupt messages on the console.

Other changes are generally tied to my powerpc64 and powerpc
investigations.
Comment 8 Mark Millard 2017-02-24 03:22:21 UTC
(In reply to Mark Millard from comment #7)

I should have explicitly mentioned that the
processes that get the failures were created
and initialized long before the clang/llvm
related parts of the build occur, even though
the failures occur after some of the potential
clang/llvm build activity.
Comment 9 Mark Millard 2017-02-24 03:35:21 UTC
(In reply to Mark Millard from comment #8)

A note on script failures:

I'll note that the script failures are
part of the activity reported in bugzilla
217239, where for script the failures look
in some respects like the powerd and su
failures reported there: they are not
td_booted asserts like reported here.

Still there might be a common cause since
the script failures are repeatable as
reported here in 217138.
Comment 10 Mark Millard 2017-02-28 02:53:20 UTC
I've reduced the testing context to the following
type of example (no longer involving buildworld
buildkernel):

# sh
# sh
# sh
. . .
# sh

(So a bunch of nested sh's in an ssh session that
will mostly be swapped out. I happened to have
done this under 2 users for the example material
below: on I did an su first. Plus I had another
session without such a nesting but with an su.)

After forcing these to mostly swap out (see below)
I use ^D to exit a currently interactive one. Then
the rest of the swapped out ones from the nesting
get the tsd_booted failure.

So, for example:

# stress -m 1 --vm-bytes 1536M

The context is a PINE64+ with 2GB of RAM. The above
was enough in my context to cause the needed swapouts:

# ps -aOuser,flags
  PID USER          F TT  STAT    TIME COMMAND
  688 root       4102 u0  IWs  0:00.00 login [pam] (login)
  689 root   10004002 u0  I+   0:00.06 -sh (sh)
72611 markmi     4002  0  IWs  0:00.00 -sh (sh)
72613 markmi     4002  0  IW   0:00.00 sh
72614 markmi     4002  0  IW   0:00.00 sh
72615 markmi     4002  0  IW   0:00.00 sh
72616 markmi     4002  0  IW   0:00.00 sh
72617 markmi     4002  0  IW   0:00.00 sh
72618 markmi     4002  0  IW   0:00.00 sh
72619 markmi     4002  0  IW   0:00.00 sh
72620 markmi     4002  0  IW   0:00.00 sh
72621 markmi     4002  0  IW   0:00.00 sh
72622 markmi     4002  0  IW   0:00.00 sh
72623 markmi     4002  0  IW   0:00.00 sh
72624 markmi     4002  0  IW   0:00.00 sh
72625 markmi     4002  0  IW   0:00.00 sh
72626 markmi 10004002  0  I+   0:00.01 sh
  167 markmi     4002  1  IWs  0:00.00 -sh (sh)
  169 root       4102  1  IW   0:00.00 su
  170 root       4002  1  IW   0:00.00 su (sh)
  171 root       4002  1  IW   0:00.00 sh
  172 root       4002  1  IW   0:00.00 sh
  173 root       4002  1  IW   0:00.00 sh
  174 root       4002  1  IW   0:00.00 sh
  175 root       4002  1  IW   0:00.00 sh
  176 root       4002  1  IW   0:00.00 sh
  177 root       4002  1  IW   0:00.00 sh
  178 root       4002  1  IW   0:00.00 sh
  179 root   10004002  1  I+   0:00.01 sh
60961 root       4002  2  IW   0:00.00 stress -m 1 --vm-bytes 1536M
60962 root   10000002  2  R    0:29.41 stress -m 1 --vm-bytes 1536M
60964 root   10004002  2  R+   0:00.01 ps -aOuser,flags
82389 markmi     4002  2  IWs  0:00.00 -sh (sh)
82391 root       4102  2  IW   0:00.00 su
82392 root   10004002  2  S    0:00.22 su (sh)

So with that swapped out context established:

(The markmi user case:)

$ ^D<jemalloc>: /usr/src/contrib/jemalloc/include/jemalloc/internal/tsd.h:687: Failed assertion: "tsd_booted"
<jemalloc>: /usr/src/contrib/jemalloc/include/jemalloc/internal/tsd.h:687: Failed assertion: "tsd_booted"
<jemalloc>: /usr/src/contrib/jemalloc/include/jemalloc/internal/tsd.h:687: Failed assertion: "tsd_booted"
<jemalloc>: /usr/src/contrib/jemalloc/include/jemalloc/internal/tsd.h:687: Failed assertion: "tsd_booted"
<jemalloc>: /usr/src/contrib/jemalloc/include/jemalloc/internal/tsd.h:687: Failed assertion: "tsd_booted"
<jemalloc>: /usr/src/contrib/jemalloc/include/jemalloc/internal/tsd.h:687: Failed assertion: "tsd_booted"
<jemalloc>: /usr/src/contrib/jemalloc/include/jemalloc/internal/tsd.h:687: Failed assertion: "tsd_booted"
<jemalloc>: /usr/src/contrib/jemalloc/include/jemalloc/internal/tsd.h:687: Failed assertion: "tsd_booted"
<jemalloc>: /usr/src/contrib/jemalloc/include/jemalloc/internal/tsd.h:687: Failed assertion: "tsd_booted"
<jemalloc>: /usr/src/contrib/jemalloc/include/jemalloc/internal/tsd.h:687: Failed assertion: "tsd_booted"
<jemalloc>: /usr/src/contrib/jemalloc/include/jemalloc/internal/tsd.h:687: Failed assertion: "tsd_booted"
<jemalloc>: /usr/src/contrib/jemalloc/include/jemalloc/internal/tsd.h:687: Failed assertion: "tsd_booted"
<jemalloc>: /usr/src/contrib/jemalloc/include/jemalloc/internal/tsd.h:687: Failed assertion: "tsd_booted"
<jemalloc>: /usr/src/contrib/jemalloc/include/jemalloc/internal/tsd.h:687: Failed assertion: "tsd_booted"


(The su'd user case:)

# ^D<jemalloc>: /usr/src/contrib/jemalloc/include/jemalloc/internal/tsd.h:687: Failed assertion: "tsd_booted"
<jemalloc>: /usr/src/contrib/jemalloc/include/jemalloc/internal/tsd.h:687: Failed assertion: "tsd_booted"
<jemalloc>: /usr/src/contrib/jemalloc/include/jemalloc/internal/tsd.h:687: Failed assertion: "tsd_booted"
<jemalloc>: /usr/src/contrib/jemalloc/include/jemalloc/internal/tsd.h:687: Failed assertion: "tsd_booted"
<jemalloc>: /usr/src/contrib/jemalloc/include/jemalloc/internal/tsd.h:687: Failed assertion: "tsd_booted"
<jemalloc>: /usr/src/contrib/jemalloc/include/jemalloc/internal/tsd.h:687: Failed assertion: "tsd_booted"
<jemalloc>: /usr/src/contrib/jemalloc/include/jemalloc/internal/tsd.h:687: Failed assertion: "tsd_booted"
<jemalloc>: /usr/src/contrib/jemalloc/include/jemalloc/internal/tsd.h:687: Failed assertion: "tsd_booted"
<jemalloc>: /usr/src/contrib/jemalloc/include/jemalloc/internal/tsd.h:687: Failed assertion: "tsd_booted"
<jemalloc>: /usr/src/contrib/jemalloc/include/jemalloc/internal/tsd.h:687: Failed assertion: "tsd_booted"

The one without the deeply nested sh's (but an su):

# exit
<jemalloc>: /usr/src/contrib/jemalloc/include/jemalloc/internal/tsd.h:687: Failed assertion: "tsd_booted"
Comment 11 Mark Millard 2017-02-28 03:35:03 UTC
(In reply to Mark Millard from comment #10)

I'll note that the sh.*.core files show that
__je_tsd_booted and a very large area around
it is all zero. (I've been seeing this for a
long time but until the swapped-out-before
discovery I had little information about
when the big block of zeros showed up.)

Some possibilities:

A) The memory was not restored at all.

B) Incorrect content was restored (possibly
   incorrect content was swapped out earlier).

C) The memory and caches and such were
   not forced to be coherent in one or both
   directions of transfer.

There may be other things that could be involved.
Comment 12 Mark Millard 2017-02-28 08:15:32 UTC
Other folks on the lists have reported getting the
Failed assertion: "tsd_booted" notices.

So I'm changing the Importance field to track that
now that a fairly simple, quick way to reproduce the
problem is known.

(Of course my results do involve only the equipment
that I have access to: one PINE64+, powered USB hub,
USB SSD, heat-sink, fan, case. So it is possible
that I'm assuming too much about reproduction based
on comments 10 and later.)
Comment 13 Mark Millard 2017-02-28 08:20:27 UTC
(In reply to Mark Millard from comment #12)

Here is an even stronger reproduction claim
from the lists (for an rpi3 instead of for a
pine64+):


I can confirm process corruption on RPi3 as of r313567.

-- 
Poul-Henning Kamp       | UNIX since Zilog Zeus 3.20
phk@FreeBSD.ORG         | TCP/IP since RFC 956
FreeBSD committer       | BSD since 4.3-tahoe    
Never attribute to malice what can adequately be explained by incompetence.
Comment 14 Mark Millard 2017-02-28 20:34:54 UTC
(In reply to Mark Millard from comment #11)

On reading about FreeBSD's swapping I see that
swap-in does certain essentials, including setting
up to demand-page most of the restored material
if/when referenced.

Thus another possibility for the problem is:

D) The configuration to enable this demand-paging
   might be incorrect (possibly incomplete?) such
   that some of demand paging does not happen.
Comment 15 Mark Millard 2017-03-01 03:28:13 UTC
(In reply to Mark Millard from comment #10)

If I try just after booting

stress -m 1 --vm-bytes 1800M

is more like what I need to cause the
swap-outs. 1700M is too little in that
context.


Also:

Trying a debug kernel got the same results with
only one notice: a vfs_bio.c bufwait vs.
ufs_dirhash.c dirhash lock order reversal that
likely is a well known one.

(I guess the LOR is from writing core files into
/var/crash/ .)
Comment 16 Mark Millard 2017-03-13 22:18:16 UTC
(In reply to Mark Millard from comment #15)

The problem only seems to occur when top shows
RES 0K and a positive SWAP for the failing sh
process before it executes.

When top shows both a positive RES and a positive
SWAP before the execution the sh process does not
die.

For example the following two did not die:

  PID USERNAME    THR PRI NICE   SIZE    RES   SWAP STATE   C   TIME     CPU COMMAND
11167 root          1  20    0  6880K   908K  1216K ttyin   0   0:00   0.00% su (sh)
. . .
11144 root          1  20    0  6880K   568K  1556K ttyin   2   0:00   0.00% su (sh)
Comment 17 Mark Millard 2017-03-14 08:15:57 UTC
https://lists.freebsd.org/pipermail/freebsd-arm/2017-March/015869.html

shows an about 110 line C program that shows data
corruption for fork-then-swap-out-then-swap-in.

The known-corrupted data is a memory region that was
malloc'd and that passed its byte pattern test before
the fork.

I'll also add the source as an attachment.
Comment 18 Mark Millard 2017-03-14 08:23:26 UTC
Created attachment 180808 [details]
C source for showing fork/swap memory corruption failure

I'm still at a loss about how to figure out what stages are messed
up. (Memory coherency? Some memory not swapped out? Bad data swapped
out? Wrong data swapped in?)

But at least I've found a much smaller/simpler example to demonstrate
some problem with in my Pine64+_ 2GB context.

The Pine64+ 2GB is the only arm64 context that I have access to.

The attached program fails its check for data
having its expected byte pattern in dynamically
allocated memory after a fork/swap-out/swap-in
sequence. The bytes end up all zero instead.

I'll note that the program sleeps for 60s after
forking to give time to do something else to
cause the parent and child processes to swap
out (RES=0 as seen in top).

A point is the size of the region matters: <= 14K Bytes
fails and > 14K Bytes works for as much has I have tested.

https://lists.freebsd.org/pipermail/freebsd-arm/2017-March/015869.html

has more material, including lldb's disassembly of the
compiled and linked C code.
Comment 19 Mark Millard 2017-03-14 18:29:00 UTC
(In reply to Mark Millard from comment #18)

https://lists.freebsd.org/pipermail/freebsd-arm/2017-March/015869.html

Should have had a subject/title saying arm64, not amd64.

https://lists.freebsd.org/pipermail/freebsd-current/2017-March/065219.html

(for example) has that corrected. (The correction has not shown up
on the arm list yet for some reason.)
Comment 20 Mark Millard 2017-03-15 06:48:42 UTC
Created attachment 180836 [details]
Updated source that shows a tie to a 4KByte sized unit of failure

See the comments tied to partial_test_check.

See:

https://lists.freebsd.org/pipermail/freebsd-arm/2017-March/015880.html
Comment 21 Mark Millard 2017-03-17 20:13:58 UTC
I my test program I reported:

#define region_size (4u*1024u+1u)
                        // Bad dyn_region pattern, parent and child
                        // processes:
                        //     256u,  2u*1024u,  4u*1024u, 8u*1024u,
                        // 9u*1024u, 12u*1024u, 14u*1024u
                        // (but see the partial_test_check() call
                        //  notes above).

                        // Works:
                        // 14u*1024u+1u, 15u*1024u, 16u*1024u,
                        // 32u*1024u, 256u*1024u*1024u

It appears to turn out that the
14u*1024u vs. 14u*1024+1u boundary matches
up with jemalloc:

contrib/jemalloc/include/jemalloc/internal/size_classes.h :

. . . size == 1<<lg_grp + ndelta<<lg_delta
. . .\
 *     bin: 'yes' if a small bin size class, 'no' otherwise.
. . .
#if (LG_SIZEOF_PTR == 3 && LG_TINY_MIN == 3 && LG_QUANTUM == 3 && LG_PAGE == 12)
#define SIZE_CLASSES \
  /* index, lg_grp, lg_delta, ndelta, psz, bin, lg_delta_lookup */ \
. . .
    SC( 38,     13,       11,      3,  no, yes, no) \
    SC( 39,     13,       11,      4, yes,  no, no) \
. . .
#define SMALL_MAXCLASS          ((((size_t)1) << 13) + (((size_t)3) << 11))

So SMALL_MAXCLASS==(14u*1024u).

The problem in the example program is tied to small bin size
handling in jemalloc.
Comment 22 Mark Millard 2017-03-18 22:59:31 UTC
As far as my test program goes using

ln -s 'tcache:false' /etc/malloc.conf

prevents the memory corruptions for the
fork-then-swap-out-then-swap-in sequence.

So it is doing more than just avoid an assert.
Comment 23 Mark Millard 2017-03-18 23:06:30 UTC
Note: I now have access to both a pine64+ 2GB and
a rpi3 with no hardware in common (USB hub or SSD
or such). (The rpi3 has less RAM and so does not
need as big of a stress run.)

The behaviors happen in both contexts: it was not
some sort of pine64+ 2GB quirk (or the hub/SSD that
I typically use in that context).
Comment 24 Mark Millard 2017-03-18 23:20:05 UTC
(In reply to Mark Millard from comment #22)

Being explicit about the combined conditions for failure
for my test program:

Both tcache enabled and allocations fitting in SMALL_MAXCLASS
are required in order to make the program fail.

Note:

lldb) print __je_tcache_maxclass
(size_t) $0 = 32768

which is larger than SMALL_MAXCLASS. I've not observed
failures for sizes above SMALL_MAXCLASS but not exceeding
__je_tcache_maxclass.

Thus tcache use by itself does not seen sufficient for
my program to get corruption of its dynamically allocated
memory: the small allocation size also matters.
Comment 25 Mark Millard 2017-03-19 00:16:26 UTC
Here are procstat -v results from while the two
processes from the fork of my test program
have already been swapped-out (before they
run and fail during swap-in). I also show
the results from looking at the matching
core files.

The start of the address range of that
ends up all-zero when none of it should
be is:

(lldb) print dyn_region
(region *volatile) $0 = 0x0000000040616000

# procstat -v 1954
  PID              START                END PRT  RES PRES REF SHD FLAG TP PATH
 1954            0x10000            0x11000 r--    1   51   5   1 CN-- vn /root/c_tests/a.out
 1954            0x20000            0x21000 r-x    1   51   5   1 CN-- vn /root/c_tests/a.out
 1954            0x30000            0x40000 rw-   16    0   1   0 C--- vn /root/c_tests/a.out
 1954            0x40000            0x41000 r--    0    0   2   0 CN-- sw 
 1954            0x41000            0x54000 rw-    0    0   2   0 CN-- sw 
 1954         0x40030000         0x4004a000 r-x   26   29  34   0 CN-- vn /libexec/ld-elf.so.1
 1954         0x4004a000         0x40051000 rw-    3    0   1   0 C--- sw 
 1954         0x4005a000         0x4005c000 rw-    1    0   1   0 C--- sw 
 1954         0x4005c000         0x401b3000 r-x  343  384  61  27 CN-- vn /lib/libc.so.7
 1954         0x401b3000         0x401c2000 ---    0    0   2   0 CN-- df 
 1954         0x401c2000         0x401cf000 rw-   13    0   2   0 CN-- vn /lib/libc.so.7
 1954         0x401cf000         0x40201000 rw-    1    1   2   0 CN-- sw 
 1954         0x40400000         0x40800000 rw-    0    0   2   0 CN-- sw 
 1954     0xfffffffdf000     0xfffffffff000 rw-    0    0   1   0 C--D sw 
 1954     0xfffffffff000    0x1000000000000 r-x    1    1  37   0 ---- ph 
# procstat -v 1955
  PID              START                END PRT  RES PRES REF SHD FLAG TP PATH
 1955            0x10000            0x11000 r--    1   51   5   1 CN-- vn /root/c_tests/a.out
 1955            0x20000            0x21000 r-x    1   51   5   1 CN-- vn /root/c_tests/a.out
 1955            0x30000            0x40000 rw-   16    0   1   0 C--- vn /root/c_tests/a.out
 1955            0x40000            0x41000 r--    0    0   2   0 CN-- sw 
 1955            0x41000            0x54000 rw-    0    0   2   0 CN-- sw 
 1955         0x40030000         0x4004a000 r-x   26   29  34   0 CN-- vn /libexec/ld-elf.so.1
 1955         0x4004a000         0x40051000 rw-    3    0   1   0 C--- sw 
 1955         0x4005a000         0x4005c000 rw-    1    0   1   0 C--- sw 
 1955         0x4005c000         0x401b3000 r-x  343  384  61  27 CN-- vn /lib/libc.so.7
 1955         0x401b3000         0x401c2000 ---    0    0   2   0 CN-- df 
 1955         0x401c2000         0x401cf000 rw-   13    0   2   0 CN-- vn /lib/libc.so.7
 1955         0x401cf000         0x40201000 rw-    1    1   2   0 CN-- sw 
 1955         0x40400000         0x40800000 rw-    0    0   2   0 CN-- sw 
 1955     0xfffffffdf000     0xfffffffff000 rw-    0    0   1   0 C--D sw 
 1955     0xfffffffff000    0x1000000000000 r-x    1    1  37   0 ---- ph 

The core file results are:

# procstat -v /var/crash/a.out.1954.core
  PID              START                END PRT  RES PRES REF SHD FLAG TP PATH
 1954            0x10000            0x11000 r--    1   51   3   1 CN-- vn /root/c_tests/a.out
 1954            0x20000            0x21000 r-x    1   51   3   1 CN-- vn /root/c_tests/a.out
 1954            0x30000            0x40000 rw-   16    0   1   0 C--- vn /root/c_tests/a.out
 1954            0x40000            0x41000 r--    1    1   1   0 CN-- sw 
 1954            0x41000            0x54000 rw-    4    4   1   0 C--- sw 
 1954         0x40030000         0x4004a000 r-x   26   29  30   0 CN-- vn /libexec/ld-elf.so.1
 1954         0x4004a000         0x40051000 rw-    7    7   1   0 C--- sw 
 1954         0x4005a000         0x4005c000 rw-    2    2   1   0 C--- sw 
 1954         0x4005c000         0x401b3000 r-x  343  384  55  25 CN-- vn /lib/libc.so.7
 1954         0x401b3000         0x401c2000 ---    0    0   1   0 CN-- df 
 1954         0x401c2000         0x401cf000 rw-   13    0   1   0 C--- vn /lib/libc.so.7
 1954         0x401cf000         0x40201000 rw-   50   50   1   0 CN-- sw 
 1954         0x40400000         0x40800000 rw- 1024 1024   1   0 CN-- sw 
 1954     0xfffffffdf000     0xfffffffff000 rw-    3    3   1   0 C--D sw 
 1954     0xfffffffff000    0x1000000000000 r-x    1    1  33   0 ---- ph 
# procstat -v /var/crash/a.out.1955.core
  PID              START                END PRT  RES PRES REF SHD FLAG TP PATH
 1955            0x10000            0x11000 r--    1   51   5   1 CN-- vn /root/c_tests/a.out
 1955            0x20000            0x21000 r-x    1   51   5   1 CN-- vn /root/c_tests/a.out
 1955            0x30000            0x40000 rw-   16    0   1   0 C--- vn /root/c_tests/a.out
 1955            0x40000            0x41000 r--    0    0   2   0 CN-- sw 
 1955            0x41000            0x54000 rw-    4    0   1   0 C--- sw 
 1955         0x40030000         0x4004a000 r-x   26   29  31   0 CN-- vn /libexec/ld-elf.so.1
 1955         0x4004a000         0x40051000 rw-    4    0   1   0 C--- sw 
 1955         0x4005a000         0x4005c000 rw-    2    0   1   0 C--- sw 
 1955         0x4005c000         0x401b3000 r-x  343  384  56  25 CN-- vn /lib/libc.so.7
 1955         0x401b3000         0x401c2000 ---    0    0   2   0 CN-- df 
 1955         0x401c2000         0x401cf000 rw-   13    0   1   0 C--- vn /lib/libc.so.7
 1955         0x401cf000         0x40201000 rw-    1    1   2   0 CN-- sw 
 1955         0x40400000         0x40800000 rw-    1    1   2   0 CN-- sw 
 1955     0xfffffffdf000     0xfffffffff000 rw-    1    0   1   0 C--D sw 
 1955     0xfffffffff000    0x1000000000000 r-x    1    1  34   0 ---- ph
Comment 26 Mark Millard 2017-03-19 00:29:32 UTC
Using procstat -v on one of the sh.*.core files
from a process that directly reported the Failed
assertion: "tds_booted" shows the following.

But first the address of the tsd_booted storage:

(lldb) print &__je_tsd_booted
(bool *) $0 = 0x0000000040618520

# procstat -v /var/crash/sh.1766.core 
  PID              START                END PRT  RES PRES REF SHD FLAG TP PATH
 1766           0x400000           0x423000 r-x   35   37   7   0 CN-- vn /bin/sh
 1766           0x433000           0x436000 rw-    3    3   1   0 C--- sw 
 1766         0x40433000         0x4044d000 r-x   26   29  32   0 CN-- vn /libexec/ld-elf.so.1
 1766         0x4044d000         0x40454000 rw-    3    3   1   0 C--- sw 
 1766         0x4045d000         0x4045f000 rw-    2    2   1   0 C--- sw 
 1766         0x4045f000         0x4048b000 r-x   44   48  14   7 CN-- vn /lib/libedit.so.7
 1766         0x4048b000         0x4049b000 ---    0    0   1   0 CN-- df 
 1766         0x4049b000         0x4049d000 rw-    2    0   1   0 CN-- vn /lib/libedit.so.7
 1766         0x4049d000         0x404a1000 rw-    0    0   1   0 CN-- sw 
 1766         0x404a1000         0x405f8000 r-x  343  384  58  26 CN-- vn /lib/libc.so.7
 1766         0x405f8000         0x40607000 ---    0    0   1   0 CN-- df 
 1766         0x40607000         0x40614000 rw-   13    0   1   0 C--- vn /lib/libc.so.7
 1766         0x40614000         0x4062d000 rw-    5    5   1   0 C--- sw 
 1766         0x4062d000         0x4067a000 r-x   77   82  16   8 CN-- vn /lib/libncursesw.so.8
 1766         0x4067a000         0x40689000 ---    0    0   1   0 CN-- df 
 1766         0x40689000         0x4068e000 rw-    5    0   1   0 CN-- vn /lib/libncursesw.so.8
 1766         0x4068e000         0x406ad000 rw-    0    0   1   0 C--- sw 
 1766         0x40800000         0x40c00000 rw-    9    9   1   0 C--- sw 
 1766     0xfffffffdf000     0xfffffffff000 rw-    6    6   1   0 C--D sw 
 1766     0xfffffffff000    0x1000000000000 r-x    1    1  34   0 ---- ph 



Similarly for an example of su.*.core that reported the
assertion failure:

(lldb) print &__je_tsd_booted
(bool *) $0 = 0x000000004061d520

# procstat -v /var/crash/su.1765.core 
  PID              START                END PRT  RES PRES REF SHD FLAG TP PATH
 1765           0x400000           0x404000 r-x    4    5   3   0 CN-- vn /usr/bin/su
 1765           0x413000           0x414000 rw-    1    1   1   0 C--- sw 
 1765         0x40413000         0x4042d000 r-x   26   29  30   0 CN-- vn /libexec/ld-elf.so.1
 1765         0x4042d000         0x40435000 rw-    8    8   1   0 C--- sw 
 1765         0x4043d000         0x4043f000 rw-    2    2   1   0 C--- sw 
 1765         0x4043f000         0x4044e000 r-x   15   16  33  16 CN-- vn /lib/libutil.so.9
 1765         0x4044e000         0x4045e000 ---    0    0   1   0 CN-- df 
 1765         0x4045e000         0x4045f000 rw-    1    0   1   0 CN-- vn /lib/libutil.so.9
 1765         0x4045f000         0x40461000 rw-    0    0   1   0 CN-- sw 
 1765         0x40461000         0x4046c000 r-x   11   13  21   9 CN-- vn /usr/lib/libpam.so.6
 1765         0x4046c000         0x4047c000 ---    0    0   1   0 CN-- df 
 1765         0x4047c000         0x4047d000 rw-    1    0   1   0 C--- vn /usr/lib/libpam.so.6
 1765         0x4047d000         0x40494000 r-x    8    8  19   8 CN-- vn /usr/lib/libbsm.so.3
 1765         0x40494000         0x404a4000 ---    0    0   1   0 CN-- df 
 1765         0x404a4000         0x404a6000 rw-    0    0   1   0 CN-- vn /usr/lib/libbsm.so.3
 1765         0x404a6000         0x405fd000 r-x  343  384  54  24 CN-- vn /lib/libc.so.7
 1765         0x405fd000         0x4060c000 ---    0    0   1   0 CN-- df 
 1765         0x4060c000         0x40619000 rw-   13    0   1   0 CN-- vn /lib/libc.so.7
 1765         0x40619000         0x40653000 rw-    4    4   1   0 C--- sw 
 1765         0x40653000         0x40654000 r-x    0    0   6   3 CN-- vn /usr/lib/pam_rootok.so.6
 1765         0x40654000         0x40663000 ---    0    0   1   0 CN-- df 
 1765         0x40663000         0x40664000 rw-    0    0   1   0 CN-- vn /usr/lib/pam_rootok.so.6
 1765         0x40664000         0x40665000 r-x    0    0   8   4 CN-- vn /usr/lib/pam_self.so.6
 1765         0x40665000         0x40674000 ---    0    0   1   0 CN-- df 
 1765         0x40674000         0x40675000 rw-    0    0   1   0 CN-- vn /usr/lib/pam_self.so.6
 1765         0x40675000         0x40676000 r-x    0    0   6   3 CN-- vn /usr/lib/pam_group.so.6
 1765         0x40676000         0x40685000 ---    0    0   1   0 CN-- df 
 1765         0x40685000         0x40686000 rw-    0    0   1   0 CN-- vn /usr/lib/pam_group.so.6
 1765         0x40686000         0x40687000 r-x    0    0  17   7 CN-- vn /usr/lib/pam_opie.so.6
 1765         0x40687000         0x40697000 ---    0    0   1   0 CN-- df 
 1765         0x40697000         0x40698000 rw-    0    0   1   0 CN-- vn /usr/lib/pam_opie.so.6
 1765         0x40698000         0x4069e000 r-x    0    0  17   7 CN-- vn /usr/lib/libopie.so.8
 1765         0x4069e000         0x406ae000 ---    0    0   1   0 CN-- df 
 1765         0x406ae000         0x406b1000 rw-    0    0   1   0 CN-- vn /usr/lib/libopie.so.8
 1765         0x406b1000         0x406d1000 r-x    0    0  17   7 CN-- vn /lib/libmd.so.6
 1765         0x406d1000         0x406e0000 ---    0    0   1   0 CN-- df 
 1765         0x406e0000         0x406e1000 rw-    0    0   1   0 CN-- vn /lib/libmd.so.6
 1765         0x406e1000         0x406e2000 r-x    0    0  17   7 CN-- vn /usr/lib/pam_opieaccess.so.6
 1765         0x406e2000         0x406f1000 ---    0    0   1   0 CN-- df 
 1765         0x406f1000         0x406f2000 rw-    0    0   1   0 CN-- vn /usr/lib/pam_opieaccess.so.6
 1765         0x406f2000         0x406f5000 r-x    0    0  45  18 CN-- vn /usr/lib/pam_unix.so.6
 1765         0x406f5000         0x40704000 ---    0    0   1   0 CN-- df 
 1765         0x40704000         0x40705000 rw-    0    0   1   0 CN-- vn /usr/lib/pam_unix.so.6
 1765         0x40705000         0x40710000 r-x    0    0  19   8 CN-- vn /lib/libcrypt.so.5
 1765         0x40710000         0x40720000 ---    0    0   1   0 CN-- df 
 1765         0x40720000         0x40721000 rw-    0    0   1   0 CN-- vn /lib/libcrypt.so.5
 1765         0x40721000         0x40732000 rw-    0    0   0   0 ---- -- 
 1765         0x40732000         0x40736000 r-x    0    0  17   7 CN-- vn /usr/lib/libypclnt.so.4
 1765         0x40736000         0x40745000 ---    0    0   1   0 CN-- df 
 1765         0x40745000         0x40746000 rw-    0    0   1   0 CN-- vn /usr/lib/libypclnt.so.4
 1765         0x40746000         0x4074f000 rw-    0    0   1   0 CN-- sw 
 1765         0x4074f000         0x40751000 r-x    0    0  17   7 CN-- vn /usr/lib/pam_login_access.so.6
 1765         0x40751000         0x40760000 ---    0    0   1   0 CN-- df 
 1765         0x40760000         0x40761000 rw-    0    0   1   0 CN-- vn /usr/lib/pam_login_access.so.6
 1765         0x40761000         0x40764000 r-x    0    0  45  18 CN-- vn /usr/lib/pam_unix.so.6
 1765         0x40764000         0x40773000 ---    0    0   1   0 CN-- df 
 1765         0x40773000         0x40774000 rw-    0    0   1   0 CN-- vn /usr/lib/pam_unix.so.6
 1765         0x40774000         0x40775000 r-x    0    0  21   9 CN-- vn /usr/lib/pam_permit.so.6
 1765         0x40775000         0x40784000 ---    0    0   1   0 CN-- df 
 1765         0x40784000         0x40785000 rw-    0    0   1   0 CN-- vn /usr/lib/pam_permit.so.6
 1765         0x40785000         0x40786000 r-x    0    0  21   9 CN-- vn /usr/lib/pam_permit.so.6
 1765         0x40786000         0x40795000 ---    0    0   1   0 CN-- df 
 1765         0x40795000         0x40796000 rw-    0    0   1   0 CN-- vn /usr/lib/pam_permit.so.6
 1765         0x40800000         0x40c00000 rw-    8    8   1   0 C--- sw 
 1765     0xfffffffdf000     0xfffffffff000 rw-    2    2   1   0 C--D sw 
 1765     0xfffffffff000    0x1000000000000 r-x    1    1  33   0 ---- ph
Comment 27 Mark Millard 2017-03-19 00:38:10 UTC
The earlier procstat -v comments have something interesting
in common other than the procstat results themselves: the
3 separate programs all end up with zeros in the same
general memory area of each process:

(lldb) print dyn_region
(region *volatile) $0 = 0x0000000040616000

(lldb) print &__je_tsd_booted
(bool *) $0 = 0x0000000040618520

(lldb) print &__je_tsd_booted
(bool *) $0 = 0x0000000040618520

The first is from dynamic allocation ending up
in the area. The other two are from libc.so.7
globals/statics ending up in the general area.

It looks like something is trashing a specific
memory area for some reason.
Comment 28 Mark Millard 2017-03-19 02:24:50 UTC
(In reply to Mark Millard from comment #27)

I incorrectly copy/pasted one of the addresses.
The last one below is the correction:

(lldb) print dyn_region
(region *volatile) $0 = 0x0000000040616000

(lldb) print &__je_tsd_booted
(bool *) $0 = 0x0000000040618520

(lldb) print &__je_tsd_booted
(bool *) $0 = 0x000000004061d520
Comment 29 Mark Millard 2017-03-22 02:27:20 UTC
Created attachment 181044 [details]
Program showing posix_madvise with POSIX_MADV_WILLNEED can prevent the failure

posix_madvise(.,.,POSIX_MADV_WILLNEED) in the child process
after the fork but before the sleep/swap-out avoids the failure.

In the parent process it makes no difference.

The .,. is the address and size for a memory region that is
to avoid the problem.
Comment 30 Mark Millard 2017-03-22 07:19:39 UTC
(In reply to Mark Millard from comment #29)

It turns out that POSIX_MADV_WILLNEED can instead
be POSIX_MADV_DONTNEED and it still prevents the
failures.
Comment 31 Mark Millard 2017-03-22 07:31:26 UTC
(In reply to Mark Millard from comment #30)

Ignore the POSIX_MADV_DONTNEED: the example child
had not reached the zero RES(ident memory) status.
When it does it fails (and so to the ancestor
processes if they also reached such).
Comment 32 Mark Millard 2017-03-24 08:16:06 UTC
I have confirmed that -r315870 still trashes memory
for the sequence:

allocations (tcache and <= SMALL_MAXCLASS)
fork
sleep/wait then swap-out [zero RES(ident memory)]
swap-in
Comment 33 Mark Millard 2017-03-28 03:43:41 UTC
Created attachment 181258 [details]
Test program with thousands of 14 KiByte allocations

The test program that has thousands of 14 KiByte
allocations usually ends up with them all being
zeroed in the parent and child of the fork.

But I've had a couple of runs where a much smaller
prefix was messed up and then there were normal,
expected values.

#define region_size (14u*1024u)
. . .
#define num_regions (256u*1024u*1024u/region_size)

So num_regions==18724, using up most of 256 MiBytes.

Note: each region has its own 14 KiByte allocation.

But dyn_regions[1296].array[0] in one example was
the first normal value.

In another example dyn_regions[2180].array[4096] was
the first normal value.

The last is interesting for being part way through
an allocation's space. That but aligning with a 4
KiByte page size would seem odd for a pure-jemalloc
issue.
Comment 34 Mark Millard 2017-04-05 02:19:25 UTC
Created attachment 181498 [details]
Source for allowing swap-before-fork in test in addition to swap-after-fork

I've discovered that if between initializing the allocated memory
and the fork() the process swaps-out then the problem does not
happen later, even if the parent and child are swapped-out after
the fork() as well.
Comment 35 Mark Millard 2017-04-07 07:56:25 UTC
I've found that for the "Source for allowing" attachment
I can make variations that change how much of the
allocated region prefix ends up zero vs. stays good.

I vary the sleep time between testing the initialized
allocations and doing the fork. The longer the sleep
the more zero pages show up:

# diff swap_testing[56].c                                                                                                                                                                               1c1
< // swap_testing5.c
---
> // swap_testing6.c
5c5
< // cc -g -std=c11 -Wpedantic -o swaptesting5 swap_testing5.c
---
> // cc -g -std=c11 -Wpedantic -o swaptesting5 swap_testing6.c
33c33
<     sleep(30); // Potentialy force swap-out here.
---
>     sleep(150); // Potentialy force swap-out here.
37a38,48
>                // For no-swap-out here cases:
>                //
>                // The longer the sleep here the more allocations
>                // that end up as zero.
>                //
>                // top's Mem Active, Inact, Wired, Bug, Free and
>                // Swap Total, Used, and Free stay unchanged.
>                // What does change is the process RES decreases
>                // while the process SIZE and SWAP stay unchanged
>                // during this sleep.
> 


NOTE: On other architectures that I've tried (such as armv6/v7)
      RES does not decrease during the sleep --and the problem
      does not happen even for as long of sleeps as I've tried.

      (I use "stress -m 2 --vm-bytes 900M" on armv6/v7 instead
      of -m 1 --vm-bytes 1800M because that large in one
      process is not allowed.)

So watching top's RES during the sleep (longer than a few
seconds) predicts the later fails-vs.-not status: If RES
decreases (while other things associated with the process
status stay the same) then there will be a failure.
Comment 36 Mark Millard 2017-04-09 00:41:43 UTC
(In reply to Mark Millard from comment #35)

I've been able to identify what code sequence
is gradually removing the "small_mappings" via
some breakpointing in the kernel after reaching
the "should be just sleeping" status. Specifically
I started with breakpointing when
pmap_resident_count_dec was on the call stack
in order to see the call chain(s) that lead to
it being called while RES(ident memory) is
gradually decreasing during the sleep that
is just before forking.

(tid 100067 is [pagedaemon{pagedaemon}], which
is in vm_pageout_worker. bt does not show inlined
layers.)

[ thread pid 17 tid 100067 ]
Breakpoint at   $x.1:   undefined       d65f03c0
db> bt
Tracing pid 17 tid 100067 td 0xfffffd0001c4aa00
. . .
handle_el1h_sync() at pmap_remove_l3+0xdc
        pc = 0xffff000000604870  lr = 0xffff000000611158
        sp = 0xffff000083a49980  fp = 0xffff000083a49a40

pmap_remove_l3() at pmap_ts_referenced+0x580
        pc = 0xffff000000611158  lr = 0xffff000000615c50
        sp = 0xffff000083a49a50  fp = 0xffff000083a49ac0

pmap_ts_referenced() at vm_pageout+0xe60
        pc = 0xffff000000615c50  lr = 0xffff0000005d1f74
        sp = 0xffff000083a49ad0  fp = 0xffff000083a49b50

vm_pageout() at fork_exit+0x94
        pc = 0xffff0000005d1f74  lr = 0xffff0000002e01c0
        sp = 0xffff000083a49b60  fp = 0xffff000083a49b90

fork_exit() at fork_trampoline+0x10
        pc = 0xffff0000002e01c0  lr = 0xffff0000006177b4
        sp = 0xffff000083a49ba0  fp = 0x0000000000000000

It turns out that pmap_ts_referenced is on its:

small_mappings:
. . .

path for the above so the pmap_remove_l3 call is
the one from that execution path. (Found by more
breakpointing after enabling such on the paths.)

So this is the path with:
(breakpoint hook not shown)

                                /*
                                 * Wired pages cannot be paged out so
                                 * doing accessed bit emulation for
                                 * them is wasted effort. We do the
                                 * hard work for unwired pages only.
                                 */
                                pmap_remove_l3(pmap, pte, pv->pv_va, tpde,
                                    &free, &lock);
                                pmap_invalidate_page(pmap, pv->pv_va);
                                cleared++;
                                if (pvf == pv)
                                        pvf = NULL;
                                pv = NULL;
                                . . .

From what I can tell this code is eliminating the
content of pages that in the failing tests, ones
with  no backing store yet (not swapped-out yet).
The observed behavior is that the pages that have
the above happen end up as zero pages once
swapped-out and back in.

I do not see anything putting the pages that this
happens to into any other lists to keep track of
the contents of the page content. The swap-out
and swap-in seem to have ignored these pages and
to have been based on automatically zeroed pages
instead.

Note that the (or a) question might be if these
pages should have ever gotten to this code at
all. (I'm no expert overall.) But that might
get into why POSIX_MADV_WILLNEED spanning each
page is sufficient to avoid the zeros issue for
work-then-swapout-and-back-in. I'll only write
here about what the backtrace code seems to be
doing if I'm interpreting correctly.

One oddity here is that pmap_remove_l3 does its own
pmap_invalidate_page to invalidate the same tlb entry as
the above pmap_invalidate_page, so a double-invalidate.
(I've no clue if such is just suboptimal vs. a form of
error.)

pmap_remove_l3 here does things that the analogous
sys/arm/arm/pmap-v6.c's pmap_ts_referenced does not
do and pmap-v6 does something this code does not.

arm64's pmap_remove_l3 does (in summary):

  pmap_invalidate_page
  decrements the resident_count
  pmap_unwire_l3
(then pmap_ts_referenced's small_mappings code
 does another pmap_invalidate_page for the
 same argument values)

arm pmap-v6's pmap_ts_referenced's small_mappings
code does:

  conditional vm_page_dirty
  pte2_clear_bit for PTE2_A
  pmap_tlb_flush

There is, for example, no decrement of the
resident_count involved (that I found anyway). 

But I've no clue just what should be analogous
vs. what should not between pmap-v6 and arm64's
pmap code in this area.

I'll also note that the code before the
arm64 small_mappings code also uses
pmap_remove_l3 but does not do the
decrement nor the extra pmap_invalidate_page
(for example). But again I do not know
how analogous the two paths should be.

Only the small_mappings path seems to have the
end-up-with-zeros problem for the later
fork-then-swap-out and then swap-back-in
context.
Comment 37 Mark Millard 2017-04-10 00:12:57 UTC
Konstantin Belousov provided a patch that seems to
have fixed the issue. . .

On 2017-Apr-9, at 5:27 AM, Konstantin Belousov <kostikbel@gmail.com> wrote:

Hmm, could you try the following patch, I did not even compiled it.

diff --git a/sys/arm64/arm64/pmap.c b/sys/arm64/arm64/pmap.c
index 3d5756ba891..55aa402eb1c 100644
--- a/sys/arm64/arm64/pmap.c
+++ b/sys/arm64/arm64/pmap.c
@@ -2481,6 +2481,11 @@ pmap_protect(pmap_t pmap, vm_offset_t sva, vm_offset_t eva, vm_prot_t prot)
		    sva += L3_SIZE) {
			l3 = pmap_load(l3p);
			if (pmap_l3_valid(l3)) {
+				if ((l3 & ATTR_SW_MANAGED) &&
+				    pmap_page_dirty(l3)) {
+					vm_page_dirty(PHYS_TO_VM_PAGE(l3 &
+					    ~ATTR_MASK));
+				}
				pmap_set(l3p, ATTR_AP(ATTR_AP_RO));
				PTE_SYNC(l3p);
				/* XXX: Use pmap_invalidate_range */
Comment 38 Mark Millard 2017-04-10 10:05:55 UTC
(In reply to Mark Millard from comment #37)

buildworld buildkernel completed non-stop for the first time
on a BPI-M3 board.

Looks good for a check-in to svn to me (head and stable/11).

This combined with 2017-Feb-15's -r313772's fix to the fork
trampline code's updating of sp_el0 makes arm64 far more stable
for my purposes.

-r313772 was never MFC'd to stable/11. In my view it should be.
Comment 39 Mark Millard 2017-04-10 10:11:36 UTC
*** Bug 217239 has been marked as a duplicate of this bug. ***
Comment 40 commit-hook freebsd_committer freebsd_triage 2017-04-10 15:33:04 UTC
A commit references this bug:

Author: kib
Date: Mon Apr 10 15:32:26 UTC 2017
New revision: 316679
URL: https://svnweb.freebsd.org/changeset/base/316679

Log:
  Do not lose dirty bits for removing PROT_WRITE on arm64.

  Arm64 pmap interprets accessed writable ptes as modified, since
  ARMv8.0 does not track Dirty Bit Modifier in hardware. If writable bit
  is removed, page must be marked as dirty for MI VM.

  This change is most important for COW, where fork caused losing
  content of the dirty pages which were not yet scanned by pagedaemon.

  Reviewed by:	alc, andrew
  Reported and tested by:	Mark Millard <markmi@dsl-only.net>
  PR:	217138, 217239
  Sponsored by:	The FreeBSD Foundation
  MFC after:	2 weeks

Changes:
  head/sys/arm64/arm64/pmap.c
Comment 41 Mark Millard 2017-04-10 20:20:56 UTC
(In reply to Mark Millard from comment #38)

I had been thinking of the BPI-M3 for other reasons
and typed that instead of the correct: Pine64+ 2GB.
(True elsewhere as well.) I do really mean arm64
here, not armv7.
Comment 42 Mark Millard 2017-04-25 00:13:50 UTC
(In reply to commit-hook from comment #40)

stable/11 -r317354 completes the fixes
there:

Author: kib
Date: Mon Apr 24 07:52:44 2017
New Revision: 317354
URL: 
https://svnweb.freebsd.org/changeset/base/317354


Log:
 MFC r316679:
 Do not lose dirty bits for removing PROT_WRITE on arm64.

Modified:
 stable/11/sys/arm64/arm64/pmap.c
Directory Properties:
 stable/11/   (props changed)

. . .


See:

https://lists.freebsd.org/pipermail/svn-src-stable-11/2017-April/003041.html

So now the old rpi3 wiki page material is only for
masking some of the problems in older contexts for
12 or for older contexts for 11.


Note:

-r313772 from head is now -r317147 in stable/11 . So
interrupts will no longer trash the sp_el0 register
during the fork-trampoline code. This was the other
issue with fork behavior that caused me build problems.

With both in place for head I no loner have problems
with buildworld and the like ,in head.


[I've not been running 11, just 12.]
Comment 43 Ed Maste freebsd_committer freebsd_triage 2017-04-25 00:45:25 UTC
Fix has now been merged to stable/11 and will be in the 11.1 release. Mark thank you for testing/confirming!