274927 – Toolchain fails on the __sync_val_compare_and_swap function without -march=native (port biology/seqwish)

Bug 274927 - Toolchain fails on the __sync_val_compare_and_swap function without -march=native (port biology/seqwish)

Summary: Toolchain fails on the __sync_val_compare_and_swap function without -march=na...

Status:	New

Alias:	None

Product:	Base System
Classification:	Unclassified
Component:	bin (show other bugs)
Version:	13.2-STABLE
Hardware:	Any Any

Importance:	--- Affects Only Me
Assignee:	freebsd-toolchain (Nobody)

URL:
Keywords:

Depends on:
Blocks:

Reported:	2023-11-05 19:37 UTC by Yuri Victorovich
Modified:	2023-11-06 14:11 UTC (History)
CC List:	1 user (show)

See Also:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Yuri Victorovich freebsd_committer

2023-11-05 19:37:00 UTC

The port biology/seqwish fails when -march=native is removed.

I couldn't reproduce this on a smaller example.

Un-commenting CMAKE_ARGS=-DEXTRA_FLAGS="" in the port's makefile causes removal of -march=native in the cmake script, which triggers this build failure:

ld: error: undefined symbol: __sync_val_compare_and_swap_16
>>> referenced by transclosure.cpp
>>>               CMakeFiles/seqwish.dir/src/transclosure.cpp.o:(std::__1::__function::__func<seqwish::compute_transitive_closures(seqwish::seqindex_t const&, mmmulti::iitree<unsigned long, unsigned long>&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, mmmulti::iitree<unsigned long, unsigned long>&, mmmulti::iitree<unsigned long, unsigned long>&, unsigned long, unsigned long, unsigned long, bool, unsigned long, std::__1::chrono::time_point<std::__1::chrono::steady_clock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l>>> const&)::$_5, std::__1::allocator<seqwish::compute_transitive_closures(seqwish::seqindex_t const&, mmmulti::iitree<unsigned long, unsigned long>&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, mmmulti::iitree<unsigned long, unsigned long>&, mmmulti::iitree<unsigned long, unsigned long>&, unsigned long, unsigned long, unsigned long, bool, unsigned long, std::__1::chrono::time_point<std::__1::chrono::steady_clock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l>>> const&)::$_5>, void (unsigned long)>::operator()(unsigned long&&))
>>> referenced by transclosure.cpp
>>>               CMakeFiles/seqwish.dir/src/transclosure.cpp.o:(seqwish::DisjointSets::unite(unsigned long, unsigned long))
>>> referenced by transclosure.cpp
>>>               CMakeFiles/seqwish.dir/src/transclosure.cpp.o:(seqwish::DisjointSets::unite(unsigned long, unsigned long))
>>> referenced 2 more times
c++: error: linker command failed with exit code 1 (use -v to see invocation)

Comment 1 Mark Millard 2023-11-05 19:53:17 UTC

(In reply to Yuri Victorovich from comment #0)

Hmm. Which toolchain does it use? Also, which platform(s) have
the issue?

When I went looking around, I ran into:

https://llvm.org/doxygen/Atomic_8cpp_source.html

that showed:

. . .
#if defined(__GNUC__) || (defined(__IBMCPP__) && __IBMCPP__ >= 1210)
#define GNU_ATOMICS
#endif
. . .
#if LLVM_HAS_ATOMICS == 0
  sys::cas_flag result = *ptr;
  if (result == old_value)
    *ptr = new_value;
  return result;
#elif defined(GNU_ATOMICS)
  return __sync_val_compare_and_swap(ptr, old_value, new_value);
#elif defined(_MSC_VER)
  return InterlockedCompareExchange(ptr, new_value, old_value);
#else
#  error No compare-and-swap implementation for your platform!
#endif
. . .

FreeBSD tends to not have complete coverage of what comes from
libraries in gcc/g++'s code generation on some platforms,
such as aarch64.

Comment 2 Yuri Victorovich freebsd_committer

2023-11-05 20:23:25 UTC

clang-16.0.6
FreeBSD 13.2 amd64

Comment 3 Mark Millard 2023-11-05 22:55:38 UTC

(In reply to Yuri Victorovich from comment #2)

System clang 16.0.6 ?
devel/llvm16's clang at 16.0.6?

Comment 4 Yuri Victorovich freebsd_committer

2023-11-05 22:59:17 UTC

(In reply to Mark Millard from comment #3)

The default build of the port biology/seqwish is listed as a reproducer.
biology/seqwish doesn't have USES=llvm, therefore this is about the system clang-16.

Comment 5 Mark Millard 2023-11-05 23:59:20 UTC

Not reported about the build were the related warnings about
16 bytes exceeding the lock-free (text from my local reproduction
of the error via poudriere-devel), also tied to transclosure.cpp .
This might be of help for an investigation.

--- CMakeFiles/seqwish.dir/src/transclosure.cpp.o ---
In file included from /wrkdirs/usr/ports/biology/seqwish/work/seqwish-v0.7.9/src/transclosure.cpp:1:
In file included from /wrkdirs/usr/ports/biology/seqwish/work/seqwish-v0.7.9/src/transclosure.hpp:13:
In file included from /wrkdirs/usr/ports/biology/seqwish/work/seqwish-v0.7.9/deps/mmmulti/src/mmiitree.hpp:19:
In file included from /wrkdirs/usr/ports/biology/seqwish/work/seqwish-v0.7.9/deps/ips4o/ips4o.hpp:38:
In file included from /wrkdirs/usr/ports/biology/seqwish/work/seqwish-v0.7.9/deps/ips4o/ips4o/ips4o.hpp:45:
In file included from /wrkdirs/usr/ports/biology/seqwish/work/seqwish-v0.7.9/deps/ips4o/ips4o/memory.hpp:47:
/wrkdirs/usr/ports/biology/seqwish/work/seqwish-v0.7.9/deps/ips4o/ips4o/bucket_pointers.hpp:106:28: warning: large atomic operation may incur significant performance penalty; the access size (16 bytes) exceeds the max lock-free size (8  bytes) [-Watomic-alignment]
            const auto p = __atomic_fetch_sub(&all_,
                           ^
/wrkdirs/usr/ports/biology/seqwish/work/seqwish-v0.7.9/deps/ips4o/ips4o/bucket_pointers.hpp:86:28: warning: large atomic operation may incur significant performance penalty; the access size (16 bytes) exceeds the max lock-free size (8  bytes) [-Watomic-alignment]
            const auto p = __atomic_fetch_add(&all_, Cfg::kBlockSize, __ATOMIC_RELAXED);
                           ^
/wrkdirs/usr/ports/biology/seqwish/work/seqwish-v0.7.9/deps/ips4o/ips4o/bucket_pointers.hpp:106:28: warning: large atomic operation may incur significant performance penalty; the access size (16 bytes) exceeds the max lock-free size (8  bytes) [-Watomic-alignment]
            const auto p = __atomic_fetch_sub(&all_,
                           ^
/wrkdirs/usr/ports/biology/seqwish/work/seqwish-v0.7.9/deps/ips4o/ips4o/bucket_pointers.hpp:86:28: warning: large atomic operation may incur significant performance penalty; the access size (16 bytes) exceeds the max lock-free size (8  bytes) [-Watomic-alignment]
            const auto p = __atomic_fetch_add(&all_, Cfg::kBlockSize, __ATOMIC_RELAXED);
                           ^

Comment 6 Mark Millard 2023-11-06 00:17:51 UTC

https://llvm.org/docs/Atomics.html#libcalls-sync reports that these
routines are special and can not use locks:

QUOTE
. . ., the Target in LLVM can claim support for atomics of an appropriate size, and then implement some subset of the operations via libcalls to a __sync_* function. Such functions must not use locks in their implementation, because unlike the __atomic_* routines used by AtomicExpandPass, these may be mixed-and-matched with native instructions by the target lowering.
END QUOTE

If I read that correctly, there is no generic implementation possible/available.
It has to use hardware instructions that handle the size required. Some amd64
hardware has such, other amd64 hw does not. -march or the like has to indicate
a context with an appropriate instruction for __sync_val_compare_and_swap_16
to use.

So, as far as I can tell, a pre-built biology/seqwish that just works everywhere
for amd64 is not really an option for the port.

Also, use of -march=native is not guaranteed to use an instruction that works
for the general range of amd64 hardware. It just plugs in the definition that
works for FreeBSD build server (that happens to support the operation), not
something generally useful. So just some subset of amd64 hardware likely ends
up handled.

Comment 7 Mark Millard 2023-11-06 01:31:32 UTC

https://clang.llvm.org/docs/UsersManual.html reports:

QUOTE
Several micro-architecture levels as specified by the x86-64 psABI are defined. They are cumulative in the sense that features from previous levels are implicitly included in later levels.

-march=x86-64: CMOV, CMPXCHG8B, FPU, FXSR, MMX, FXSR, SCE, SSE, SSE2
-march=x86-64-v2: (close to Nehalem) CMPXCHG16B, LAHF-SAHF, POPCNT, SSE3, SSE4.1, SSE4.2, SSSE3
-march=x86-64-v3: (close to Haswell) AVX, AVX2, BMI1, BMI2, F16C, FMA, LZCNT, MOVBE, XSAVE
-march=x86-64-v4: AVX512F, AVX512BW, AVX512CD, AVX512DQ, AVX512VL
END QUOTE

CMPXCHG16B is the 128 bit (16 Byte) instruction for the lock-free
operation in question, something -march=x86-64 does not have.

So it looks like the Makefile should use: -march=x86-64-v2
if nothing newer than x86-64-v2 is needed. This would be better
than -march-native as far as indicating the widest range of
cpu micro-architectures possible.

However, in general, consideration vs. -march=x86-64-v3 vs.
-march=x86-64-v4 is appropriate since more may be needed. Some
ports might need to add OPTION control over which to use of:

-march=x86-64-v2
-march=x86-64-v3
-march=x86-64-v4

Comment 8 Dimitry Andric freebsd_committer

2023-11-06 12:41:25 UTC

These are all called via seqwish::DisjointSets::unite() (which is in 
https://github.com/ekg/seqwish/blob/master/src/dset64-gccAtomic.hpp):

0000000000000000 <seqwish::DisjointSets::unite(unsigned long, unsigned long)>:
   0:   55                      push   %rbp
   1:   48 89 e5                mov    %rsp,%rbp
...
  43:   e8 00 00 00 00          call   48 <seqwish::DisjointSets::unite(unsigned long, unsigned long)+0x48>
                        44: R_X86_64_PLT32      __sync_val_compare_and_swap_16-0x4

The file has a comment about this:

 * The implementation in shasta/src/dset64.hpp uses std::atomic<__uint128_t>
 * for lock-free synchronization.
 * On older GCC versions, std::atomic<__uint128_t> is lock-free
 * if compilation is done with -mcx16, which enables the use of the
 * 16-byte (128 bit) compare-and-swap instruction, CMPXCHG16B.
 *
 * Unfortunately, on newer GCC versions, this is no longer true
 * because of gcc bug 80878:
 * https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80878
 *
 * As a result, there was a significant performance loss in
 * versions of Shasta built with gcc 7,
 * which is used by default on Ubuntu 18.04, when using
 * machines with large number of virtual processors.
 *
 * It is unlikely that this gcc bug will ever be fixed,
 * and to avoid this performance loss this implementation
 * uses gcc primitive __sync_bool_compare_and_swap instead
 * for lock-free synchronization. When compilation
 * is done with -mcx16 and optimization turned on,
 * this primitive uses the CMPXCHG16B instruction
 * and results in optimal speed.
 *
 * The CMPXCHG16B instruction is available on most modern 64-bit x86 processors.
 * Some older processors that don't implement this instruction
 * will crash with an "Illegal instruction" error
 * upon attempting to run this code.

However __sync_bool_compare_and_swap is usually provided by a compiler library such as libgcc or libcompiler-rt. I don't think we have this function for 128 bit integers, though.

As noted in the comment, the code should be compiled with -mxc16 for optimal performance. Processors which do not support CMPXCHG16B are quite ancient now.

Comment 9 Mark Millard 2023-11-06 14:11:55 UTC

(In reply to Dimitry Andric from comment #8)

Interesting.

I'll note that the likes of -march=x86-64-v2 was added in 2020-Oct
for gcc/g++ and clang/clang++ so older notes would not reference
the possibility.

It is associated with:

https://gitlab.com/x86-psABIs/x86-64-ABI/-/commit/77566eb03bc6a326811cb7e9a6b9396884b67c7c

For all I know it may have the same issues for lock free code
in the tool chains.