Created attachment 189422 [details] gdb output After upgrading firefox from 57.0.2_1,1 to 57.0.3,1 it started crashing with bus error frequently (in particular on the about:sessionrestore page but also elsewhere). This is stable/10 amd64 with packages from pkg.freebsd.org. The machine also uses nvidia-driver 384.98. What happens is that clang has decided to combine two 64-bit stores in mozilla::ipc::MessageChannel::Clear() in libxul.so into one aligned 128-bit store (movaps). However, the object is actually not 128-bit aligned and a general protection fault occurs in the main firefox process. Some gdb output is in an attachment. As a result, various child processes crash in mozilla::ipc::MessageChannel::OnChannelErrorFromLink(). I think these crashes are a direct result of the original crash, and not interesting by themselves. The insufficient alignment could be because of a disagreement between various pieces of code about what the alignment should be or because the pointer is bogus. Assuming it is the former, I have modified one byte in /usr/local/lib/firefox/libxul.so to change the movaps instruction to movups so it will work with any alignment. With this change, firefox runs stably again for me (note that, on this machine, I have many tabs open but I do not leave firefox running for more than a day). More details about the workaround: --- /tmp/zshXpplPs 2018-01-03 23:47:31.929337000 +0100 +++ /tmp/zshxVT4SU 2018-01-03 23:47:31.929702000 +0100 @@ -1,5 +1,5 @@ -/usr/local/lib/firefox/libxul.so: file format elf64-x86-64-freebsd +libxul.so.fixed: file format elf64-x86-64-freebsd Disassembly of section .init: @@ -1231701,7 +1231701,7 @@ cebf90: e8 8b bf 00 00 callq cf7f20 <_ZNSt3__16__treeINS_12__value_typeImN7mozilla3ipc14MessageChannel13PromiseHolderEEENS_19__map_value_compareImS6_NS_4lessImEELb1EEENS_9allocatorIS6_EEE7destroyEPNS_11__tree_nodeIS6_PvEE> cebf95: 4d 89 be 08 01 00 00 mov %r15,0x108(%r14) cebf9c: 0f 57 c0 xorps %xmm0,%xmm0 - cebf9f: 41 0f 29 86 10 01 00 movaps %xmm0,0x110(%r14) + cebf9f: 41 0f 11 86 10 01 00 movups %xmm0,0x110(%r14) cebfa6: 00 cebfa7: 49 c7 46 38 00 00 00 movq $0x0,0x38(%r14) cebfae: 00 root@lion /home/jilles# cmp -l /root/libxul.so.orig /usr/local/lib/firefox/libxul.so 13549474 51 21
Probably a regression from bug 224591. Can you confirm building with clang40 works around the issue? I wonder why base clang 5.0 in /stable/11 or /head is not affected.
I rebuilt firefox-57.0.3,1 from a (slightly more recent) ports tree with SVN r457360 reverted. This seems to run stably for 20 minutes and the disassembly of the corresponding part of libxul.so shows that clang 4.0 has used two regular mov instructions instead of SSE: ceac73: 75 9b jne ceac10 <_ZN7mozilla3ipc14MessageChannel5ClearEv+0x90> ceac75: 49 8b b6 10 01 00 00 mov 0x110(%r14),%rsi ceac7c: 48 8b 7d c8 mov -0x38(%rbp),%rdi ceac80: e8 9b bf 00 00 callq cf6c20 <_ZNSt3__16__treeINS_12__value_typeImN7mozilla3ipc14MessageChannel13PromiseHolderEEENS_19__map_value_compareImS6_NS_4lessImEELb1EEENS_9allocatorIS6_EEE7destroyEPNS_11__tree_nodeIS6_PvEE> ceac85: 49 c7 86 18 01 00 00 movq $0x0,0x118(%r14) ceac8c: 00 00 00 00 ceac90: 4d 89 be 08 01 00 00 mov %r15,0x108(%r14) ceac97: 49 c7 86 10 01 00 00 movq $0x0,0x110(%r14) ceac9e: 00 00 00 00 ceaca2: 49 c7 46 38 00 00 00 movq $0x0,0x38(%r14) ceaca9: 00 ceacaa: 49 8b 7e 30 mov 0x30(%r14),%rdi ceacae: 48 85 ff test %rdi,%rdi ceacb1: 74 06 je ceacb9 <_ZN7mozilla3ipc14MessageChannel5ClearEv+0x139>
Created attachment 189459 [details] llvm50 update devel/llvm50 is at 5.0.0 while 5.0.1 was released >20 days ago. 🤦 Can apply the patch then rebuild both llvm50 and firefox?
Comment on attachment 189459 [details] llvm50 update Nevermind, firefox still crashes frequently...
I first had this problem, when I built 57.0.2 with clang50 (by overwriting the then-default compiler). Now that clang50 is the default, I can not get 57.0.4 to build properly: * clang-devel (6.0.0), clang40, clang39 all fail to compile because of syntax errors (something about Args vs. args, as well as char16_t being a built-in type) * gcc6 and gcc7 compile, but fail to link -- some hideous-looking C++ symbol not found clang50 builds the binary, but it fails to start. Unless I build it without /any/ optimization. I normally use -O2, which crashes. I tried -O1 -- still crashes. Only by setting -O0 can I get a sort-of usable binary (very slow). Will try to upgrade clang -- glad to see, it is not my setup...
(In reply to Mikhail T. from comment #5) > * clang-devel (6.0.0), clang40, clang39 all fail to compile because > of syntax errors (something about Args vs. args, as well as char16_t > being a built-in type) Overriding CC/CXX/CPP on command line may not work, so set those via environment, make.conf or makefile. For one, moz.configure appends -std=gnu99 to CC, -std=gnu++11 to CXX. > * gcc6 and gcc7 compile, but fail to link -- some hideous-looking > C++ symbol not found See bug 221288. Mixing libstdc++ and libc++ isn't currently supported. It only worked somewhat for base libstdc++ because it used libcxxrt. So, to build with lang/gcc* rebuild any dependency that links against libc++.
Crashes when built against Clang 5.0.1 appear to be different or maybe it's just because I'm testing bug 223425. More user reports: https://forums.freebsd.org/threads/64014/ https://www.bsdforen.de/threads/firefox57-core-dump-beim-verlassen.34032/ https://twitter.com/garrett_wollman/status/950173333261357056
Rebuilt with clang-5.0.1 and it seems to be working. Definitely beyond the point where it used to crash. Thanks!
(In reply to Mikhail T. from comment #8) > Definitely beyond the point where it used to crash. Thanks! I was too optimistic. The crash is still here. Rebuilding with clang-4.0 (thanks for the instructions!) worked, however...
(In reply to Mikhail T. from comment #9) Confirmed that 5.0.1 doesn't help for me either. Going to try your suggestion of 4.0 now.
(In reply to Garrett Wollman from comment #10) Hmmm, actually, I don't see any knobs in ports to select the compiler any more, so I have no idea how I'd even go about doing this. Hopefully someone can come up with a fix that doesn't involve binary patching.
(In reply to Garrett Wollman from comment #11) Also confirming that Jilles' binary patching fix (there's only one byte sequence that matches) appears to work for me.
(In reply to Garrett Wollman from comment #11) I can confirm that building firefox with clang40 fixed this issue for me on stable/10. env CC=clang40 CXX=clang++40 CPP=clang-cpp40 make deinstall install in the firefox port directory did the trick.
Are we sure, this is a bug in clang and not an error in the code? Even if compiler really is at fault, perhaps, we can find the spot in question and rephrase it as a work-around?
(In reply to russo from comment #13) Doesn't help -- my package builds are automated. To make this work I'd need a way to tell bsd.port.mk to use (and build!) clang 4.0 instead of 5.0. Has anyone filed an LLVM bug report for this yet?
(In reply to Garrett Wollman from comment #15) > To make this work I'd need a way to tell bsd.port.mk to use (and build!) > clang 4.0 instead of 5.0. Put the following snippet into your /etc/make.conf: .if ${.CURDIR:T} == "firefox" CC= clang40 CXX= clang++40 CPP= clang-cpp40 .endif
(In reply to Jilles Tjoelker from comment #2) > I rebuilt firefox-57.0.3,1 from a (slightly more recent) ports tree with SVN > r457360 reverted. This seems to run stably for 20 minutes and the > disassembly of the corresponding part of libxul.so shows that clang 4.0 has > used two regular mov instructions instead of SSE: > > ceac73: 75 9b jne ceac10 > <_ZN7mozilla3ipc14MessageChannel5ClearEv+0x90> > ceac75: 49 8b b6 10 01 00 00 mov 0x110(%r14),%rsi > ceac7c: 48 8b 7d c8 mov -0x38(%rbp),%rdi > ceac80: e8 9b bf 00 00 callq cf6c20 > <_ZNSt3__16__treeINS_12__value_typeImN7mozilla3ipc14MessageChannel13PromiseHo > lderEEENS_19__map_value_compareImS6_NS_4lessImEELb1EEENS_9allocatorIS6_EEE7de > stroyEPNS_11__tree_nodeIS6_PvEE> > ceac85: 49 c7 86 18 01 00 00 movq $0x0,0x118(%r14) > ceac8c: 00 00 00 00 > ceac90: 4d 89 be 08 01 00 00 mov %r15,0x108(%r14) > ceac97: 49 c7 86 10 01 00 00 movq $0x0,0x110(%r14) > ceac9e: 00 00 00 00 > ceaca2: 49 c7 46 38 00 00 00 movq $0x0,0x38(%r14) > ceaca9: 00 > ceacaa: 49 8b 7e 30 mov 0x30(%r14),%rdi > ceacae: 48 85 ff test %rdi,%rdi > ceacb1: 74 06 je ceacb9 > <_ZN7mozilla3ipc14MessageChannel5ClearEv+0x139> The code in question is, strangely, just the clearing of a std::map<> object: void MessageChannel::Clear() { [...] gUnresolvedPromises -= mPendingPromises.size(); for (auto& pair : mPendingPromises) { pair.second.mRejectFunction(pair.second.mPromise, PromiseRejectReason::ChannelClosed, __func__); } mPendingPromises.clear(); /// <--- [1] mWorkerLoop = nullptr; delete mLink; mLink = nullptr; The [1] line expands to: movq 272(%r14), %rsi movq -56(%rbp), %rdi # 8-byte Reload callq _ZNSt3__16__treeINS_12__value_typeImN7mozilla3ipc14MessageChannel13PromiseHolderEEENS_19__map_value_compareImS6_NS_4lessImEELb1EEENS_9allocatorIS6_EEE7destroyEPNS_11__tree_nodeIS6_PvEE movq %r15, 264(%r14) xorps %xmm0, %xmm0 movaps %xmm0, 272(%r14) Here, the mangled function is just an internal destroy function of std::map, and the instructions just after that are the last 3 lines of __tree::clear (in /usr/include/c++/v1/__tree): template <class _Tp, class _Compare, class _Allocator> void __tree<_Tp, _Compare, _Allocator>::clear() _NOEXCEPT { destroy(__root()); size() = 0; __begin_node() = __end_node(); __end_node()->__left_ = nullptr; } Strangely, it turned out that rebuilding MessageChannel.cpp with newer libc++ headers made the movaps instruction change to movups, even with clang 5.0.0. I bisected through the libc++ history, and ended up at this commit: http://llvm.org/viewvc/llvm-project?view=revision&revision=276003 """ Fix undefined behavior in __tree Summary: This patch attempts to fix the undefined behavior in __tree by changing the node pointer types used throughout. The pointer types are changed for raw pointers in the current ABI and for fancy pointers in ABI V2 (since the fancy pointer types may not be ABI compatible). The UB in `__tree` arises because tree downcasts the embedded end node and then deferences that pointer. Currently there are 3 node types in __tree. [... read original message for full story...] """ Unfortunately this fix is rather extensive, and hard to apply to the version of libc++ in stable/10, which is still approximately at version 3.4.1. I will have a look at backporting the fix. Meanwhile, maybe lowering the optimization level for this particular file, or just using clang 4.0.x might be a workaround.
A commit references this bug: Author: jbeich Date: Thu Jan 11 00:54:00 UTC 2018 New revision: 458705 URL: https://svnweb.freebsd.org/changeset/ports/458705 Log: www/firefox: work around crash on FreeBSD 10 PR: 224917 Suggested by: dim Changes: head/www/firefox/Makefile head/www/firefox/files/patch-ipc_glue_MessageChannel.cpp head/www/waterfox/Makefile head/www/waterfox/files/patch-ipc_glue_MessageChannel.cpp
A commit references this bug: Author: jbeich Date: Thu Jan 11 00:55:34 UTC 2018 New revision: 458709 URL: https://svnweb.freebsd.org/changeset/ports/458709 Log: MFH: r458705 www/firefox: work around crash on FreeBSD 10 PR: 224917 Suggested by: dim Approved by: ports-secteam blanket Changes: _U branches/2018Q1/ branches/2018Q1/www/firefox/Makefile branches/2018Q1/www/firefox/files/patch-ipc_glue_MessageChannel.cpp branches/2018Q1/www/waterfox/Makefile branches/2018Q1/www/waterfox/files/patch-ipc_glue_MessageChannel.cpp
Clang lacks optimize attribute, so I've used optnone for MessageChannel::Clear(). Maybe worse than -O1 for the whole file but easier to maintain than un-unifying MessageChannel.cpp from 14 other files in the same directory while limiting to libc++. Expect the binary package with the workaround sometime tomorrow. Can someone confirm? http://beefy6.nyi.freebsd.org/build.html?mastername=103amd64-default&build=458707 http://beefy2.nyi.freebsd.org/build.html?mastername=103amd64-quarterly&build=458709
(In reply to Jan Beich from comment #20) WORKSFORME
(In reply to Garrett Wollman from comment #21) Just for clarity, I built my own package as usual from the updated port -- still with clang 5.0.1 since that's still in my repo.