Bug 224917 - www/firefox: bus error on stable/10 with 57.0.3,1
Summary: www/firefox: bus error on stable/10 with 57.0.3,1
Status: Closed FIXED
Alias: None
Product: Ports & Packages
Classification: Unclassified
Component: Individual Port(s) (show other bugs)
Version: Latest
Hardware: amd64 Any
: --- Affects Only Me
Assignee: freebsd-gecko (Nobody)
URL:
Keywords: regression
Depends on:
Blocks: 224591
  Show dependency treegraph
 
Reported: 2018-01-04 22:04 UTC by Jilles Tjoelker
Modified: 2018-01-12 21:42 UTC (History)
6 users (show)

See Also:
bugzilla: maintainer-feedback? (gecko)
jbeich: merge-quarterly?


Attachments
gdb output (16.25 KB, text/plain)
2018-01-04 22:04 UTC, Jilles Tjoelker
no flags Details
llvm50 update (3.45 KB, patch)
2018-01-06 11:50 UTC, Jan Beich
jbeich: maintainer-approval? (brooks)
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Jilles Tjoelker freebsd_committer freebsd_triage 2018-01-04 22:04:19 UTC
Created attachment 189422 [details]
gdb output

After upgrading firefox from 57.0.2_1,1 to 57.0.3,1 it started crashing with bus error frequently (in particular on the about:sessionrestore page but also elsewhere). This is stable/10 amd64 with packages from pkg.freebsd.org. The machine also uses nvidia-driver 384.98.

What happens is that clang has decided to combine two 64-bit stores in mozilla::ipc::MessageChannel::Clear() in libxul.so into one aligned 128-bit store (movaps). However, the object is actually not 128-bit aligned and a general protection fault occurs in the main firefox process. Some gdb output is in an attachment.

As a result, various child processes crash in mozilla::ipc::MessageChannel::OnChannelErrorFromLink(). I think these crashes are a direct result of the original crash, and not interesting by themselves.

The insufficient alignment could be because of a disagreement between various pieces of code about what the alignment should be or because the pointer is bogus. Assuming it is the former, I have modified one byte in /usr/local/lib/firefox/libxul.so to change the movaps instruction to movups so it will work with any alignment. With this change, firefox runs stably again for me (note that, on this machine, I have many tabs open but I do not leave firefox running for more than a day).

More details about the workaround:

--- /tmp/zshXpplPs      2018-01-03 23:47:31.929337000 +0100
+++ /tmp/zshxVT4SU      2018-01-03 23:47:31.929702000 +0100
@@ -1,5 +1,5 @@
 
-/usr/local/lib/firefox/libxul.so:     file format elf64-x86-64-freebsd
+libxul.so.fixed:     file format elf64-x86-64-freebsd
 
 Disassembly of section .init:
 
@@ -1231701,7 +1231701,7 @@
   cebf90:      e8 8b bf 00 00          callq  cf7f20 <_ZNSt3__16__treeINS_12__value_typeImN7mozilla3ipc14MessageChannel13PromiseHolderEEENS_19__map_value_compareImS6_NS_4lessImEELb1EEENS_9allocatorIS6_EEE7destroyEPNS_11__tree_nodeIS6_PvEE>
   cebf95:      4d 89 be 08 01 00 00    mov    %r15,0x108(%r14)
   cebf9c:      0f 57 c0                xorps  %xmm0,%xmm0
-  cebf9f:      41 0f 29 86 10 01 00    movaps %xmm0,0x110(%r14)
+  cebf9f:      41 0f 11 86 10 01 00    movups %xmm0,0x110(%r14)
   cebfa6:      00 
   cebfa7:      49 c7 46 38 00 00 00    movq   $0x0,0x38(%r14)
   cebfae:      00 

root@lion /home/jilles# cmp -l /root/libxul.so.orig /usr/local/lib/firefox/libxul.so
13549474  51  21
Comment 1 Jan Beich freebsd_committer freebsd_triage 2018-01-05 00:23:23 UTC
Probably a regression from bug 224591. Can you confirm building with clang40 works around the issue? I wonder why base clang 5.0 in /stable/11 or /head is not affected.
Comment 2 Jilles Tjoelker freebsd_committer freebsd_triage 2018-01-05 20:55:07 UTC
I rebuilt firefox-57.0.3,1 from a (slightly more recent) ports tree with SVN r457360 reverted. This seems to run stably for 20 minutes and the disassembly of the corresponding part of libxul.so shows that clang 4.0 has used two regular mov instructions instead of SSE:

  ceac73:       75 9b                   jne    ceac10 <_ZN7mozilla3ipc14MessageChannel5ClearEv+0x90>
  ceac75:       49 8b b6 10 01 00 00    mov    0x110(%r14),%rsi
  ceac7c:       48 8b 7d c8             mov    -0x38(%rbp),%rdi
  ceac80:       e8 9b bf 00 00          callq  cf6c20 <_ZNSt3__16__treeINS_12__value_typeImN7mozilla3ipc14MessageChannel13PromiseHolderEEENS_19__map_value_compareImS6_NS_4lessImEELb1EEENS_9allocatorIS6_EEE7destroyEPNS_11__tree_nodeIS6_PvEE>
  ceac85:       49 c7 86 18 01 00 00    movq   $0x0,0x118(%r14)
  ceac8c:       00 00 00 00 
  ceac90:       4d 89 be 08 01 00 00    mov    %r15,0x108(%r14)
  ceac97:       49 c7 86 10 01 00 00    movq   $0x0,0x110(%r14)
  ceac9e:       00 00 00 00 
  ceaca2:       49 c7 46 38 00 00 00    movq   $0x0,0x38(%r14)
  ceaca9:       00 
  ceacaa:       49 8b 7e 30             mov    0x30(%r14),%rdi
  ceacae:       48 85 ff                test   %rdi,%rdi
  ceacb1:       74 06                   je     ceacb9 <_ZN7mozilla3ipc14MessageChannel5ClearEv+0x139>
Comment 3 Jan Beich freebsd_committer freebsd_triage 2018-01-06 11:50:14 UTC
Created attachment 189459 [details]
llvm50 update

devel/llvm50 is at 5.0.0 while 5.0.1 was released >20 days ago. 🤦
Can apply the patch then rebuild both llvm50 and firefox?
Comment 4 Jan Beich freebsd_committer freebsd_triage 2018-01-06 11:54:47 UTC
Comment on attachment 189459 [details]
llvm50 update

Nevermind, firefox still crashes frequently...
Comment 5 Mikhail T. 2018-01-08 06:01:23 UTC
I first had this problem, when I built 57.0.2 with clang50 (by overwriting the then-default compiler).

Now that clang50 is the default, I can not get 57.0.4 to build properly:

 * clang-devel (6.0.0), clang40, clang39 all fail to compile because of syntax errors (something about Args vs. args, as well as char16_t being a built-in type)
 * gcc6 and gcc7 compile, but fail to link -- some hideous-looking C++ symbol not found

clang50 builds the binary, but it fails to start. Unless I build it without /any/ optimization. I normally use -O2, which crashes. I tried -O1 -- still crashes. Only by setting -O0 can I get a sort-of usable binary (very slow).

Will try to upgrade clang -- glad to see, it is not my setup...
Comment 6 Jan Beich freebsd_committer freebsd_triage 2018-01-08 08:10:04 UTC
(In reply to Mikhail T. from comment #5)
> * clang-devel (6.0.0), clang40, clang39 all fail to compile because
> of syntax errors (something about Args vs. args, as well as char16_t
> being a built-in type)

Overriding CC/CXX/CPP on command line may not work, so set those via environment, make.conf or makefile. For one, moz.configure appends -std=gnu99 to CC, -std=gnu++11 to CXX.

> * gcc6 and gcc7 compile, but fail to link -- some hideous-looking
> C++ symbol not found

See bug 221288. Mixing libstdc++ and libc++ isn't currently supported. It only worked somewhat for base libstdc++ because it used libcxxrt. So, to build with lang/gcc* rebuild any dependency that links against libc++.
Comment 7 Jan Beich freebsd_committer freebsd_triage 2018-01-08 10:38:54 UTC
Crashes when built against Clang 5.0.1 appear to be different or maybe it's just because I'm testing bug 223425.

More user reports:
https://forums.freebsd.org/threads/64014/
https://www.bsdforen.de/threads/firefox57-core-dump-beim-verlassen.34032/
https://twitter.com/garrett_wollman/status/950173333261357056
Comment 8 Mikhail T. 2018-01-08 12:25:45 UTC
Rebuilt with clang-5.0.1 and it seems to be working.

Definitely beyond the point where it used to crash. Thanks!
Comment 9 Mikhail T. 2018-01-09 04:48:47 UTC
(In reply to Mikhail T. from comment #8)
> Definitely beyond the point where it used to crash. Thanks!

I was too optimistic. The crash is still here. Rebuilding with clang-4.0 (thanks for the instructions!) worked, however...
Comment 10 Garrett Wollman freebsd_committer freebsd_triage 2018-01-09 05:47:54 UTC
(In reply to Mikhail T. from comment #9)
Confirmed that 5.0.1 doesn't help for me either. Going to try your suggestion of 4.0 now.
Comment 11 Garrett Wollman freebsd_committer freebsd_triage 2018-01-09 05:57:59 UTC
(In reply to Garrett Wollman from comment #10)
Hmmm, actually, I don't see any knobs in ports to select the compiler any more, so I have no idea how I'd even go about doing this. Hopefully someone can come up with a fix that doesn't involve binary patching.
Comment 12 Garrett Wollman freebsd_committer freebsd_triage 2018-01-09 06:10:08 UTC
(In reply to Garrett Wollman from comment #11)
Also confirming that Jilles' binary patching fix (there's only one byte sequence that matches) appears to work for me.
Comment 13 russo 2018-01-10 00:21:27 UTC
(In reply to Garrett Wollman from comment #11)
I can confirm that building firefox with clang40 fixed this issue for me on stable/10.

    env CC=clang40 CXX=clang++40 CPP=clang-cpp40 make deinstall install

in the firefox port directory did the trick.
Comment 14 Mikhail T. 2018-01-10 02:33:50 UTC
Are we sure, this is a bug in clang and not an error in the code?

Even if compiler really is at fault, perhaps, we can find the spot in question and rephrase it as a work-around?
Comment 15 Garrett Wollman freebsd_committer freebsd_triage 2018-01-10 03:09:41 UTC
(In reply to russo from comment #13)
Doesn't help -- my package builds are automated. To make this work I'd need a way to tell bsd.port.mk to use (and build!) clang 4.0 instead of 5.0.

Has anyone filed an LLVM bug report for this yet?
Comment 16 Mikhail T. 2018-01-10 04:19:59 UTC
(In reply to Garrett Wollman from comment #15)
> To make this work I'd need a way to tell bsd.port.mk to use (and build!)
> clang 4.0 instead of 5.0.

Put the following snippet into your /etc/make.conf:

.if ${.CURDIR:T} == "firefox"
CC=     clang40
CXX=    clang++40
CPP=    clang-cpp40
.endif
Comment 17 Dimitry Andric freebsd_committer freebsd_triage 2018-01-10 21:08:58 UTC
(In reply to Jilles Tjoelker from comment #2)
> I rebuilt firefox-57.0.3,1 from a (slightly more recent) ports tree with SVN
> r457360 reverted. This seems to run stably for 20 minutes and the
> disassembly of the corresponding part of libxul.so shows that clang 4.0 has
> used two regular mov instructions instead of SSE:
> 
>   ceac73:       75 9b                   jne    ceac10
> <_ZN7mozilla3ipc14MessageChannel5ClearEv+0x90>
>   ceac75:       49 8b b6 10 01 00 00    mov    0x110(%r14),%rsi
>   ceac7c:       48 8b 7d c8             mov    -0x38(%rbp),%rdi
>   ceac80:       e8 9b bf 00 00          callq  cf6c20
> <_ZNSt3__16__treeINS_12__value_typeImN7mozilla3ipc14MessageChannel13PromiseHo
> lderEEENS_19__map_value_compareImS6_NS_4lessImEELb1EEENS_9allocatorIS6_EEE7de
> stroyEPNS_11__tree_nodeIS6_PvEE>
>   ceac85:       49 c7 86 18 01 00 00    movq   $0x0,0x118(%r14)
>   ceac8c:       00 00 00 00 
>   ceac90:       4d 89 be 08 01 00 00    mov    %r15,0x108(%r14)
>   ceac97:       49 c7 86 10 01 00 00    movq   $0x0,0x110(%r14)
>   ceac9e:       00 00 00 00 
>   ceaca2:       49 c7 46 38 00 00 00    movq   $0x0,0x38(%r14)
>   ceaca9:       00 
>   ceacaa:       49 8b 7e 30             mov    0x30(%r14),%rdi
>   ceacae:       48 85 ff                test   %rdi,%rdi
>   ceacb1:       74 06                   je     ceacb9
> <_ZN7mozilla3ipc14MessageChannel5ClearEv+0x139>

The code in question is, strangely, just the clearing of a std::map<> object:

void
MessageChannel::Clear()
{
[...]
   gUnresolvedPromises -= mPendingPromises.size();
    for (auto& pair : mPendingPromises) {
        pair.second.mRejectFunction(pair.second.mPromise,
                                    PromiseRejectReason::ChannelClosed,
                                    __func__);
    }
    mPendingPromises.clear();  /// <--- [1]

    mWorkerLoop = nullptr;
    delete mLink;
    mLink = nullptr;

The [1] line expands to:

        movq    272(%r14), %rsi
        movq    -56(%rbp), %rdi         # 8-byte Reload
        callq   _ZNSt3__16__treeINS_12__value_typeImN7mozilla3ipc14MessageChannel13PromiseHolderEEENS_19__map_value_compareImS6_NS_4lessImEELb1EEENS_9allocatorIS6_EEE7destroyEPNS_11__tree_nodeIS6_PvEE
        movq    %r15, 264(%r14)
        xorps   %xmm0, %xmm0
        movaps  %xmm0, 272(%r14)

Here, the mangled function is just an internal destroy function of std::map, and the instructions just after that are the last 3 lines of __tree::clear (in /usr/include/c++/v1/__tree):

template <class _Tp, class _Compare, class _Allocator>
void
__tree<_Tp, _Compare, _Allocator>::clear() _NOEXCEPT
{
    destroy(__root());
    size() = 0;
    __begin_node() = __end_node();
    __end_node()->__left_ = nullptr;
}

Strangely, it turned out that rebuilding MessageChannel.cpp with newer libc++ headers made the movaps instruction change to movups, even with clang 5.0.0.  I bisected through the libc++ history, and ended up at this commit:

http://llvm.org/viewvc/llvm-project?view=revision&revision=276003

"""
Fix undefined behavior in __tree

Summary:
This patch attempts to fix the undefined behavior in __tree by changing the node pointer types used throughout. The pointer types are changed for raw pointers in the current ABI and for fancy pointers in ABI V2 (since the fancy pointer types may not be ABI compatible).

The UB in `__tree` arises because tree downcasts the embedded end node and then deferences that pointer. Currently there are 3 node types in __tree.
[... read original message for full story...]
"""

Unfortunately this fix is rather extensive, and hard to apply to the version of libc++ in stable/10, which is still approximately at version 3.4.1.

I will have a look at backporting the fix.  Meanwhile, maybe lowering the optimization level for this particular file, or just using clang 4.0.x might be a workaround.
Comment 18 commit-hook freebsd_committer freebsd_triage 2018-01-11 00:55:01 UTC
A commit references this bug:

Author: jbeich
Date: Thu Jan 11 00:54:00 UTC 2018
New revision: 458705
URL: https://svnweb.freebsd.org/changeset/ports/458705

Log:
  www/firefox: work around crash on FreeBSD 10

  PR:		224917
  Suggested by:	dim

Changes:
  head/www/firefox/Makefile
  head/www/firefox/files/patch-ipc_glue_MessageChannel.cpp
  head/www/waterfox/Makefile
  head/www/waterfox/files/patch-ipc_glue_MessageChannel.cpp
Comment 19 commit-hook freebsd_committer freebsd_triage 2018-01-11 00:56:06 UTC
A commit references this bug:

Author: jbeich
Date: Thu Jan 11 00:55:34 UTC 2018
New revision: 458709
URL: https://svnweb.freebsd.org/changeset/ports/458709

Log:
  MFH: r458705

  www/firefox: work around crash on FreeBSD 10

  PR:		224917
  Suggested by:	dim
  Approved by:	ports-secteam blanket

Changes:
_U  branches/2018Q1/
  branches/2018Q1/www/firefox/Makefile
  branches/2018Q1/www/firefox/files/patch-ipc_glue_MessageChannel.cpp
  branches/2018Q1/www/waterfox/Makefile
  branches/2018Q1/www/waterfox/files/patch-ipc_glue_MessageChannel.cpp
Comment 20 Jan Beich freebsd_committer freebsd_triage 2018-01-11 01:28:54 UTC
Clang lacks optimize attribute, so I've used optnone for MessageChannel::Clear(). Maybe worse than -O1 for the whole file but easier to maintain than un-unifying MessageChannel.cpp from 14 other files in the same directory while limiting to libc++.

Expect the binary package with the workaround sometime tomorrow. Can someone confirm?
http://beefy6.nyi.freebsd.org/build.html?mastername=103amd64-default&build=458707
http://beefy2.nyi.freebsd.org/build.html?mastername=103amd64-quarterly&build=458709
Comment 21 Garrett Wollman freebsd_committer freebsd_triage 2018-01-12 04:21:35 UTC
(In reply to Jan Beich from comment #20)
WORKSFORME
Comment 22 Garrett Wollman freebsd_committer freebsd_triage 2018-01-12 04:23:02 UTC
(In reply to Garrett Wollman from comment #21)
Just for clarity, I built my own package as usual from the updated port -- still with clang 5.0.1 since that's still in my repo.