Bug 246630 - stable/11 regression: base.txz reproducibility depends on number of cpu cores
Summary: stable/11 regression: base.txz reproducibility depends on number of cpu cores
Status: New
Alias: None
Product: Base System
Classification: Unclassified
Component: bin (show other bugs)
Version: Unspecified
Hardware: amd64 Any
: --- Affects Some People
Assignee: freebsd-bugs mailing list
URL:
Keywords: regression
Depends on:
Blocks:
 
Reported: 2020-05-21 11:52 UTC by Fabian Keil
Modified: 2020-05-26 18:41 UTC (History)
3 users (show)

See Also:


Attachments
First 600 lines of the text section difference (32.48 KB, patch)
2020-05-22 17:33 UTC, Fabian Keil
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Fabian Keil 2020-05-21 11:52:05 UTC
On a recent FreeBSD stable/11 the rescue binary content varies depending
on the number of cpu cores.

With a 2 core system I can reproduce the binaries from the official
11.4-BETA2 release, with a 4 core system I get different results:

SHA256 (/images/FreeBSD/FreeBSD-11.4-BETA2-amd64-r361067-official/rescue/cp) = 3c96ab697fc721c226f99971863bcadbbbc73ccfec1abe2f04cbce60e3276a89
SHA256 (/images/FreeBSD/FreeBSD-11.4-BETA2-amd64-r361067-rebuild-2cores-t520/rescue/cp) = 3c96ab697fc721c226f99971863bcadbbbc73ccfec1abe2f04cbce60e3276a89
SHA256 (/images/FreeBSD/FreeBSD-11.4-BETA2-amd64-r361067-rebuild-4cores/rescue/cp) = 06e0b2964901fa02ed2f4e02c1334cb82ce1d97f4a6fc7a2632a6f74b35955d1

Diffoscope seems to indicate that some offsets are different.
The output starts with:

--- FreeBSD/FreeBSD-11.4-BETA2-amd64-r361067-official/rescue/cp
+++ FreeBSD/FreeBSD-11.4-BETA2-amd64-r361067-rebuild-4cores/rescue/cp
├── readelf --wide --sections {}
│ @@ -1,16 +1,16 @@
│  There are 19 section headers, starting at offset 0xa18bb0:
│  
│  Section Headers:
│    [Nr] Name              Type            Addr             Off    Size   ES Flg Lk Inf Al
│    [ 0]                   NULL            0000000000000000 000000 000000 00      0   0  0
│    [ 1] .note.tag         NOTE            0000000000400190 000190 000030 00   A  0   0  4
│    [ 2] .init             PROGBITS        00000000004001c0 0001c0 000018 00  AX  0   0  4
│ -  [ 3] .text             PROGBITS        0000000000400200 000200 6dab08 00  AX  0   0 64
│ -  [ 4] .fini             PROGBITS        0000000000adad08 6dad08 00000e 00  AX  0   0  4
│ +  [ 3] .text             PROGBITS        0000000000400200 000200 6daa88 00  AX  0   0 64
│ +  [ 4] .fini             PROGBITS        0000000000adac88 6dac88 00000e 00  AX  0   0  4
│    [ 5] .rodata           PROGBITS        0000000000adb000 6db000 216bd8 00   A  0   0 4096
│    [ 6] .eh_frame_hdr     PROGBITS        0000000000cf1bd8 8f1bd8 0245ec 00   A  0   0  4
│    [ 7] .eh_frame         X86_64_UNWIND   0000000000d161c8 9161c8 0bb90c 00   A  0   0  8
│    [ 8] .tdata            PROGBITS        0000000000fd2000 9d2000 000068 00 WAT  0   0 16
│    [ 9] .tbss             NOBITS          0000000000fd2070 9d2068 000840 00 WAT  0   0 16
│    [10] .ctors            PROGBITS        0000000000fd2070 9d2070 0000c0 00  WA  0   0  8
│    [11] .dtors            PROGBITS        0000000000fd2130 9d2130 000018 00  WA  0   0  8
├── readelf --wide --debug-dump=frames {}
│ @@ -9894,21179 +9894,21195 @@
│    DW_CFA_nop
│    DW_CFA_nop
│    DW_CFA_nop
│    DW_CFA_nop
│    DW_CFA_nop
│    DW_CFA_nop
│  
│ -00005ee0 0000002c 00005ee4 FDE cie=00000000 pc=0043c780..0043d57c
│ +00005ee0 0000002c 00005ee4 FDE cie=00000000 pc=0043c780..0043d4c3
│    DW_CFA_advance_loc: 1 to 0043c781
│    DW_CFA_def_cfa_offset: 16
│    DW_CFA_offset: r6 (rbp) at cfa-16
│    DW_CFA_advance_loc: 3 to 0043c784
│    DW_CFA_def_cfa_register: r6 (rbp)
│    DW_CFA_advance_loc: 16 to 0043c794
│    DW_CFA_offset: r3 (rbx) at cfa-56
│    DW_CFA_offset: r12 (r12) at cfa-48
│    DW_CFA_offset: r13 (r13) at cfa-40
│    DW_CFA_offset: r14 (r14) at cfa-32
│    DW_CFA_offset: r15 (r15) at cfa-24
│ -  DW_CFA_advance_loc2: 3554 to 0043d576
│ +  DW_CFA_advance_loc2: 3369 to 0043d4bd
│    DW_CFA_def_cfa: r7 (rsp) ofs 8
│ -  DW_CFA_advance_loc: 1 to 0043d577
│ +  DW_CFA_advance_loc: 1 to 0043d4be
│    DW_CFA_def_cfa: r6 (rbp) ofs 16
│    DW_CFA_nop
│    DW_CFA_nop

The whole diff in text format has a size of 217M so I'm not attaching it.

I first saw the issue after rebasing ElectroBSD on r360524 and confirmed
that it still exists in r360986.

The problem does not exist in r357496 and earlier but I don't yet know
the exact revision when the problem was introduced.

I can't reproduce the issue on FreeBSD-13.0-CURRENT amd64
using the 20200514-r361019 snapshot.

I haven't tested 12/stable.
Comment 1 Conrad Meyer freebsd_committer 2020-05-21 13:46:59 UTC
Notably, .text is different (different size 0x6dab08 vs 0x6daa88).  The relocations and framedata seem to correspond to that (though the section-level diff does not show any difference in .eh_frame?).
Comment 2 Ed Maste freebsd_committer 2020-05-22 17:20:03 UTC
diffoscope should disassemble the .text sections and compare the result; would you be able to extract and post a few dozen lines from where .text diverges?

Is this unaffected by -j? Are there other differences between the 2 core and 4 core systems?
Comment 3 Fabian Keil 2020-05-22 17:33:09 UTC
Created attachment 214763 [details]
First 600 lines of the text section difference
Comment 4 Fabian Keil 2020-05-22 17:36:32 UTC
I'm always compiling with -j2.

Once I noticed the issue using completely different machines I
reproduced it with bhyve changing only the core number.
Comment 5 Ed Maste freebsd_committer 2020-05-22 17:38:51 UTC
Ah, the early differences are just different offsets due to later changes,

│ -  40024a:	e8 21 94 69 00       	callq  0xa99670
│ +  40024a:	e8 a1 93 69 00       	callq  0xa995f0

Do you have somewhere you could share the full text diff? Or, find the first non-callq difference, perhaps around 0x9f....
Comment 6 Fabian Keil 2020-05-22 17:52:06 UTC
I uploaded the whole compressed diff to:
https://www.fabiankeil.de/tmp/FreeBSD-11.4-BETA2-amd64-r361067-rescue-cp.diff.gz
(33 MB)
Comment 7 Conrad Meyer freebsd_committer 2020-05-22 18:23:37 UTC
Diff seems to start here (I reordered lines for clarity):

│ @@ -63992,183658 +63992,183673 @@
...
│ -  43c7b5:    49 89 cd                mov    %rcx,%r13
│ +  43c7b5:    49 89 cf                mov    %rcx,%r15

// ^^ r13 -> r15

│ -  43c7b8:    49 29 c5                sub    %rax,%r13
│ +  43c7b8:    49 29 c7                sub    %rax,%r15

│ -  43c7bb:    4c 89 ec                mov    %r13,%rsp
│ +  43c7bb:    4c 89 fc                mov    %r15,%rsp

...

│ -  43c803:    4c 63 c2                movslq %edx,%r8
│ +  43c803:    4c 63 da                movslq %edx,%r11

// ^^ r8 -> r11

│ -  43c806:    49 c1 e0 03             shl    $0x3,%r8
│ +  43c806:    49 c1 e3 03             shl    $0x3,%r11

│ -  43c80a:    4c 03 05 e7 7a bc 00    add    0xbc7ae7(%rip),%r8        # 0x10042f8
│ +  43c80a:    4c 03 1d e7 7a bc 00    add    0xbc7ae7(%rip),%r11        # 0x10042f8

│ -  43c811:    4c 89 05 d0 7a bc 00    mov    %r8,0xbc7ad0(%rip)        # 0x10042e8
│ +  43c811:    4c 89 1d d0 7a bc 00    mov    %r11,0xbc7ad0(%rip)        # 0x10042e8

│ -  43c818:    4c 3b 05 d1 7a bc 00    cmp    0xbc7ad1(%rip),%r8        # 0x10042f0
│ +  43c818:    4c 3b 1d d1 7a bc 00    cmp    0xbc7ad1(%rip),%r11        # 0x10042f0

│    43c81f:    76 07                   jbe    0x43c828
│ -  43c821:    4c 89 05 c8 7a bc 00    mov    %r8,0xbc7ac8(%rip)        # 0x10042f0
│ +  43c821:    4c 89 1d c8 7a bc 00    mov    %r11,0xbc7ac8(%rip)        # 0x10042f0


These particular changes are not the ones that affect code *size* (those must be later), but do show reproducibility problems.
Comment 8 Conrad Meyer freebsd_committer 2020-05-22 18:26:09 UTC
First code-size change is shortly afterwards:

│ -  43c83e:    45 31 c0                xor    %r8d,%r8d
│ -  43c841:    44 8a 3b                mov    (%rbx),%r15b

│ +  43c83e:    45 31 db                xor    %r11d,%r11d
│ +  43c841:    8a 0b                   mov    (%rbx),%cl

%cl (low 8 bits of %ecx) has a shorter mov.b representation than %r15b (low 8 bits of amd64 %r15).
Comment 9 Conrad Meyer freebsd_committer 2020-05-22 18:27:03 UTC
So clearly register selection is non-deterministic for some reason.  Perhaps some optimization pass runs for a non-deterministic amount of time or % of system memory, and scaling -j impacts that?
Comment 10 Ed Maste freebsd_committer 2020-05-22 18:29:33 UTC
Here is where the interesting parts start - I've rearranged the diff -/+ to make it easier to see where differences exist:

│ -  43c803:    4c 63 c2                movslq %edx,%r8
│ +  43c803:    4c 63 da                movslq %edx,%r11
│ -  43c806:    49 c1 e0 03             shl    $0x3,%r8
│ +  43c806:    49 c1 e3 03             shl    $0x3,%r11
│ -  43c80a:    4c 03 05 e7 7a bc 00    add    0xbc7ae7(%rip),%r8        # 0x10042f8
│ +  43c80a:    4c 03 1d e7 7a bc 00    add    0xbc7ae7(%rip),%r11        # 0x10042f8
│ -  43c811:    4c 89 05 d0 7a bc 00    mov    %r8,0xbc7ad0(%rip)        # 0x10042e8
│ +  43c811:    4c 89 1d d0 7a bc 00    mov    %r11,0xbc7ad0(%rip)        # 0x10042e8
│ -  43c818:    4c 3b 05 d1 7a bc 00    cmp    0xbc7ad1(%rip),%r8        # 0x10042f0
│ +  43c818:    4c 3b 1d d1 7a bc 00    cmp    0xbc7ad1(%rip),%r11        # 0x10042f0
│    43c81f:    76 07                   jbe    0x43c828
│ -  43c821:    4c 89 05 c8 7a bc 00    mov    %r8,0xbc7ac8(%rip)        # 0x10042f0
│ +  43c821:    4c 89 1d c8 7a bc 00    mov    %r11,0xbc7ac8(%rip)        # 0x10042f0
│    43c828:    48 b8 00 00 00 00 01    mov    $0x100000000,%rax
│    43c82f:    00 00 00
│    43c832:    49 01 c6                add    %rax,%r14
│    43c835:    49 c1 fe 20             sar    $0x20,%r14
│    43c839:    4c 01 f3                add    %r14,%rbx
│    43c83c:    eb 03                   jmp    0x43c841
│ -  43c83e:    45 31 c0                xor    %r8d,%r8d
│ +  43c83e:    45 31 db                xor    %r11d,%r11d
│ -  43c841:    44 8a 3b                mov    (%rbx),%r15b
│ +  43c841:    8a 0b                   mov    (%rbx),%cl
│ -  43c844:    45 31 c9                xor    %r9d,%r9d
│ +  43c843:    45 31 c9                xor    %r9d,%r9d
│ -  43c847:    45 84 ff                test   %r15b,%r15b
│ +  43c846:    84 c9                   test   %cl,%cl
│ +  43c848:    4c 89 ad 00 ff ff ff    mov    %r13,-0x100(%rbp)
│ -  43c84a:    74 60                   je     0x43c8ac
│ +  43c84f:    74 4f                   je     0x43c8a0
│ -  43c84c:    45 31 f6                xor    %r14d,%r14d
│ +  43c851:    45 31 e4                xor    %r12d,%r12d
│ -  43c84f:    48 b8 01 00 00 00 89    mov    $0x1288900000001,%rax
│ -  43c856:    28 01 00
│ +  43c854:    48 b8 01 00 00 00 89    mov    $0x1288900000001,%rax
│ +  43c85b:    28 01 00
│ -  43c859:    0f 1f 80 00 00 00 00    nopl   0x0(%rax)
│ +  43c85e:    66 90                   xchg   %ax,%ax
│    43c860:    ba 01 00 00 00          mov    $0x1,%edx
│ -  43c865:    44 89 f9                mov    %r15d,%ecx
│ -  43c868:    48 d3 e2                shl    %cl,%rdx
│ +  43c865:    48 d3 e2                shl    %cl,%rdx
│ -  43c86b:    41 80 ff 3f             cmp    $0x3f,%r15b
│ +  43c868:    80 f9 3f                cmp    $0x3f,%cl
│ -  43c86f:    0f 87 73 01 00 00       ja     0x43c9e8
│ +  43c86b:    0f 87 60 01 00 00       ja     0x43c9d1
│ -  43c875:    48 21 c2                and    %rax,%rdx
│ +  43c871:    48 21 c2                and    %rax,%rdx
│ -  43c878:    0f 84 6a 01 00 00       je     0x43c9e8
│ +  43c874:    0f 84 57 01 00 00       je     0x43c9d1
│ -  43c87e:    47 88 7c 35 01          mov    %r15b,0x1(%r13,%r14,1)
│ +  43c87a:    43 88 4c 27 01          mov    %cl,0x1(%r15,%r12,1)
│ -  43c883:    43 c6 44 35 02 00       movb   $0x0,0x2(%r13,%r14,1)
│ +  43c87f:    43 c6 44 27 02 00       movb   $0x0,0x2(%r15,%r12,1)
│ -  43c889:    46 0f b6 7c 33 01       movzbl 0x1(%rbx,%r14,1),%r15d
│ +  43c885:    42 0f b6 4c 23 01       movzbl 0x1(%rbx,%r12,1),%ecx
│ -  43c88f:    49 83 c6 01             add    $0x1,%r14
│ +  43c88b:    49 83 c4 01             add    $0x1,%r12
│ +  43c88f:    84 c9                   test   %cl,%cl
│ -  43c803:    4c 63 c2                movslq %edx,%r8
│ +  43c803:    4c 63 da                movslq %edx,%r11
│ -  43c806:    49 c1 e0 03             shl    $0x3,%r8
│ +  43c806:    49 c1 e3 03             shl    $0x3,%r11
│ -  43c80a:    4c 03 05 e7 7a bc 00    add    0xbc7ae7(%rip),%r8        # 0x10042f8
│ +  43c80a:    4c 03 1d e7 7a bc 00    add    0xbc7ae7(%rip),%r11        # 0x10042f8
│ -  43c811:    4c 89 05 d0 7a bc 00    mov    %r8,0xbc7ad0(%rip)        # 0x10042e8
│ +  43c811:    4c 89 1d d0 7a bc 00    mov    %r11,0xbc7ad0(%rip)        # 0x10042e8
│ -  43c818:    4c 3b 05 d1 7a bc 00    cmp    0xbc7ad1(%rip),%r8        # 0x10042f0
│ +  43c818:    4c 3b 1d d1 7a bc 00    cmp    0xbc7ad1(%rip),%r11        # 0x10042f0
│    43c81f:    76 07                   jbe    0x43c828
│ -  43c821:    4c 89 05 c8 7a bc 00    mov    %r8,0xbc7ac8(%rip)        # 0x10042f0
│ +  43c821:    4c 89 1d c8 7a bc 00    mov    %r11,0xbc7ac8(%rip)        # 0x10042f0
│ -  43c865:    44 89 f9                mov    %r15d,%ecx
│ -  43c868:    48 d3 e2                shl    %cl,%rdx
│ +  43c865:    48 d3 e2                shl    %cl,%rdx
│ -  43c86b:    41 80 ff 3f             cmp    $0x3f,%r15b
│ +  43c868:    80 f9 3f                cmp    $0x3f,%cl
│ -  43c86f:    0f 87 73 01 00 00       ja     0x43c9e8
│ +  43c86b:    0f 87 60 01 00 00       ja     0x43c9d1
│ -  43c875:    48 21 c2                and    %rax,%rdx
│ +  43c871:    48 21 c2                and    %rax,%rdx
│ -  43c878:    0f 84 6a 01 00 00       je     0x43c9e8
│ +  43c874:    0f 84 57 01 00 00       je     0x43c9d1
│ -  43c87e:    47 88 7c 35 01          mov    %r15b,0x1(%r13,%r14,1)
│ +  43c87a:    43 88 4c 27 01          mov    %cl,0x1(%r15,%r12,1)

I have trouble figuring out how core counts would affect the compiler in this way, though.
Comment 11 Ed Maste freebsd_committer 2020-05-22 18:30:36 UTC
(In reply to Conrad Meyer from comment #9)
Perhaps this is compiler nondeterminism introduced in Clang 10?
Comment 12 Conrad Meyer freebsd_committer 2020-05-22 18:32:04 UTC
Beats me; I think dim@ might have more of a clue.  You or dim@ probably have better channels for inquiring further. :-)
Comment 13 Dimitry Andric freebsd_committer 2020-05-23 10:39:51 UTC
This is an interesting bug, but what is the easiest way to reproduce? Does anybody know which particular .c file, or even which particular function is apparently compiled differently?

Note that clang is really single threaded, so I don't understand how the number of cores can affect the result. Maybe the non-determinism (if it really exists) is independent of the number of cores, but changing that influences stack allocations, etc?

And is this *only* on 11.x? Because that is still using BFD ld, as far as I know. Although the linker shouldn't matter...
Comment 14 Fabian Keil 2020-05-23 14:09:16 UTC
I currently use the lazy approach and build complete releases.
Obviously this makes bisecting slow but I haven't had time to
simplify the steps yet.

I've only seen the issue on 11.x but haven't tested 12.x.
Comment 15 Dimitry Andric freebsd_committer 2020-05-23 15:21:09 UTC
Are people using -j here to build /usr/src/rescue, and if so, with which setting?  I've run a plain 'make' on a VM with 4 cores at first, then 2 cores, but the resulting /usr/obj/usr/src/rescue directories are bit-identical then.

Maybe I should retry with -j4 when I configure 4 cores, and -j2 when I configure 2 cores?
Comment 16 Dimitry Andric freebsd_committer 2020-05-23 17:11:55 UTC
Okay, I can reproduce, and it appears to be something in usr.bin/printf/printf.c, which gets built as part of sh. In particular, in the printf_doformat function.

One version gives:

0000000000000440 <printf_doformat>:
     440:       55                      push   %rbp
     441:       48 89 e5                mov    %rsp,%rbp
     444:       41 57                   push   %r15
     446:       41 56                   push   %r14
     448:       41 55                   push   %r13
     44a:       41 54                   push   %r12
     44c:       53                      push   %rbx
     44d:       48 81 ec e8 00 00 00    sub    $0xe8,%rsp
     454:       49 89 f4                mov    %rsi,%r12
     457:       48 89 fb                mov    %rdi,%rbx
     45a:       48 8b 05 00 00 00 00    mov    0x0(%rip),%rax        # 461 <printf_doformat+0x21>
     461:       48 89 45 d0             mov    %rax,-0x30(%rbp)
     465:       e8 00 00 00 00          callq  46a <printf_doformat+0x2a>

another:

0000000000000440 <printf_doformat>:
     440:       55                      push   %rbp
     441:       48 89 e5                mov    %rsp,%rbp
     444:       41 57                   push   %r15
     446:       41 56                   push   %r14
     448:       41 55                   push   %r13
     44a:       41 54                   push   %r12
     44c:       53                      push   %rbx
     44d:       48 81 ec e8 00 00 00    sub    $0xe8,%rsp
     454:       49 89 f5                mov    %rsi,%r13
     457:       48 89 fb                mov    %rdi,%rbx
     45a:       48 8b 05 00 00 00 00    mov    0x0(%rip),%rax        # 461 <printf_doformat+0x21>
     461:       48 89 45 d0             mov    %rax,-0x30(%rbp)
     465:       e8 00 00 00 00          callq  46a <printf_doformat+0x2a>

The C function starts with:

static char *
printf_doformat(char *fmt, int *rval)
{
        static const char skip1[] = "#'-+ 0";
        int fieldwidth, haveprec, havewidth, mod_ldbl, precision;
        char convch, nextch;
        char start[strlen(fmt) + 1];
        char **fargv;
        char *dptr;
        int l;

and the starting piece of code corresponds to the dynamic array declaration, char start[strlen(fmt) + 1].

As to why this is happening, no clue yet... I'm trying to build clang with UBSan, to see if it hits something in particular.
Comment 17 Fabian Keil 2020-05-26 18:41:43 UTC
In the initial report I wrote:

> I first saw the issue after rebasing ElectroBSD on
> r360524 and confirmed that it still exists in r360986.

As it turns out that wasn't entirely correct.

While both revisions result in base.txz differences with different cpu cores, the rescue binaries in r360524 are reproducible but the /bin/sh differs. Presumably because of the printf.c issue found by Dimitry.

In r360986 the /bin/sh issue is already gone.

I'll continue bisecting ...