Bug 199557 - Hang on sysconf(_SC_OPEN_MAX)
Summary: Hang on sysconf(_SC_OPEN_MAX)
Status: New
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: CURRENT
Hardware: Any Any
: --- Affects Only Me
Assignee: freebsd-bugs mailing list
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-04-20 10:55 UTC by ikosarev
Modified: 2017-04-23 18:25 UTC (History)
5 users (show)

See Also:


Attachments
The test source. (7.59 KB, text/plain)
2015-04-20 10:55 UTC, ikosarev
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description ikosarev 2015-04-20 10:55:16 UTC
Created attachment 155764 [details]
The test source.

The attached source unexpectedly hangs on sysconf(_SC_OPEN_MAX) when run under

FreeBSD localhost 11.0-CURRENT FreeBSD 11.0-CURRENT #0 r280598: Wed Mar 25 15:54:11 UTC 2015     root@releng2.nyi.freebsd.org:/usr/obj/usr/src/sys/GENERIC  amd64

in a VirtualBox session.

How to reproduce:
$ cat run.sh
clang test.cc -lpthread -lc++ || exit 1
while true; do
    date
    ./a.out
done
$ ./run.sh

The problem usually expose itself after a dozen of iterations.

Actual output:
$ ./run.sh
Mon Apr 20 13:53:30 EEST 2015
before sysconf()

Expected output:
Mon Apr 20 13:53:30 EEST 2015
before sysconf()
after sysconf()

before sysconf()
after sysconf()
...
Comment 1 Ed Maste freebsd_committer 2015-04-20 14:25:40 UTC
Reproduced on FreeBSD 10.1-STABLE r280427+86df2de(stable-10) on the second try.

procstat shows:

  PID    TID COMM             TDNAME           KSTACK                       
29430 101383 a.out            -                mi_switch+0xe1 sleepq_catch_signals+0xab sleepq_wait_sig+0xf _sleep+0x27d kern_wait6+0x5f4 sys_wait4+0x72 amd64_syscall+0x357 Xfast_syscall+0xfb 
29431 104036 a.out            -                mi_switch+0xe1 sleepq_catch_signals+0xab sleepq_wait_sig+0xf _sleep+0x27d umtxq_sleep+0x125 do_rw_rdlock+0x39b __umtx_op_rw_rdlock+0x4b amd64_syscall+0x357 Xfast_syscall+0xfb
Comment 2 Ed Maste freebsd_committer 2015-04-20 14:38:39 UTC
Pointed out by kib, when invoking the fork syscall directly the child inherits potentially locked libc mutexes, and it's probably the printf that hangs, not the sysconf.

What was the original underlying issue here?
Comment 3 Ed Maste freebsd_committer 2015-04-20 14:45:03 UTC
userland backtraces:

(lldb) bt
* thread #1: tid = 103971, 0x0000000800f8e198 libc.so.7`__syscall + 8 at __syscall.S:3
  * frame #0: 0x0000000800f8e198 libc.so.7`__syscall + 8 at __syscall.S:3
    frame #1: 0x0000000000400e32 a.out`zz_fork(file=0x0000000000401222, line=269) + 114 at pr199557.cc:244
    frame #2: 0x0000000000400f7b a.out`zz_pthread_create(th=0x00007fffffffe528, attr=0x0000000000000000, callback=0x0000000000401090, param=0x0000000000000000) + 235 at pr199557.cc:269
    frame #3: 0x00000000004010dc a.out`main + 44 at pr199557.cc:294
    frame #4: 0x000000000040087f a.out`_start(ap=<unavailable>, cleanup=<unavailable>) + 367 at crt1.c:78

(lldb) bt
* thread #1: tid = 103861, 0x0000000800833b3c libthr.so.3 at _umtx_op_err.S:37
  * frame #0: 0x0000000800833b3c libthr.so.3 at _umtx_op_err.S:37
    frame #1: 0x00000008008314dc libthr.so.3`_thr_rtld_rlock_acquire [inlined] _thr_rwlock_rdlock(flags=<unavailable>, tsp=<unavailable>) + 76 at thr_umtx.h:196
    frame #2: 0x0000000800831490 libthr.so.3`_thr_rtld_rlock_acquire(lock=0x0000000800a42680) + 64 at thr_rtld.c:123
    frame #3: 0x000000080060bf42 ld-elf.so.1`rlock_acquire(lock=0x000000080081f860, lockstate=0x00007fffffffe358) + 50 at rtld_lock.c:197
    frame #4: 0x0000000800605c5d ld-elf.so.1`_rtld_bind(obj=0x0000000800621000, reloff=0) + 45 at rtld.c:679
    frame #5: 0x00000008006034fd ld-elf.so.1`??? + 45 at rtld_start.S:121
    frame #6: 0x0000000000400d9b a.out`zz_sysconf(file=0x0000000000401222, line=269) + 43 at pr199557.cc:227
    frame #7: 0x0000000000400e01 a.out`zz_fork(file=0x0000000000401222, line=269) + 65 at pr199557.cc:240
    frame #8: 0x0000000000400f7b a.out`zz_pthread_create(th=0x00007fffffffe528, attr=0x0000000000000000, callback=0x0000000000401090, param=0x0000000000000000) + 235 at pr199557.cc:269
    frame #9: 0x00000000004010dc a.out`main + 44 at pr199557.cc:294
    frame #10: 0x000000000040087f a.out`_start(ap=<unavailable>, cleanup=<unavailable>) + 367 at crt1.c:78
Comment 4 Ed Maste freebsd_committer 2015-04-20 14:55:41 UTC
See http://svnweb.freebsd.org/changeset/base/266609 for the fix for this issue in the fork() case.

In any case calling the fork syscall directly is going to be error-prone, so it would be useful to have a brief description of how tsan intends to interact with a target process fork.
Comment 5 ikosarev 2015-04-23 08:29:55 UTC
(In reply to Ed Maste from comment #2)
> Pointed out by kib, when invoking the fork syscall directly the child
> inherits potentially locked libc mutexes, and it's probably the printf
> that hangs, not the sysconf.

Note that the code doesn't leverage libc's printf(), but rather uses its own version of it.

> What was the original underlying issue here?

The original problem is random Tsan tests hanging on high loaded machines like the FreeBSD sanitizers buildbot:

http://lab.llvm.org:8011/builders/sanitizer_x86_64-freebsd/builds/4978/steps/make-check-tsan/logs/stdio

It mostly work fine when the tests run in about 4 threads, but tends to fail constantly when run in about 64 threads--doesn't matter how much cores the hardware has.

This is how it leads to the point of hang: during execution of a Tsan tests the Tsan STL catches a data race or other condition triggering a run-time report. That report contains references to related object files and associated addresses. These file names together with the addresses are then translated into source file name, line numbers and function names by calling the llvm-symbolizer or, if the RTL is unable to locate it, the addr2line command. To catch the output of the symbolizing tool the RTL fork() the process and redirects the streams. This is where the whole thing hangs. Note is that if I comment the sysconf(_SC_OPEN_MAX) call, then it hangs on one of the other subsequent system calls, like read() so it appears to not to be a sysconf()-specific issue.

Another note is that the RTL sources refer to a specific reason about why it tries to use the syscall fork() in place of the libc's intercepted one:

    // Real fork() may call user callbacks registered with pthread_atfork().
    pid = internal_fork();

The internal_fork() function is just the syscall:

int internal_fork() {
#if SANITIZER_USES_CANONICAL_LINUX_SYSCALLS
  return internal_syscall(SYSCALL(clone), SIGCHLD, 0);
#else
  return internal_syscall(SYSCALL(fork));
#endif
}
Comment 6 Konstantin Belousov freebsd_committer 2015-04-23 12:06:52 UTC
(In reply to ikosarev from comment #5)
It is actually rtld' locks which are inherited blocked and thus symbol resolution is non-functional.  As a quick test, try your test with LD_BIND_NOW=1 env var set.

Direct calls to the fork syscall from the threading process is not supposed to work.
Comment 7 Jilles Tjoelker freebsd_committer 2015-04-23 21:32:44 UTC
There is a proposal for an async-signal safe version of fork() called _Fork(), which does not call atfork handlers, at http://austingroupbugs.net/view.php?id=62 . This would help if the only problem with calling fork() is that it executes atfork handlers. It still executes a fair bit of code, but no user code.

To make _Fork() async-signal safe, the malloc handling would have to be disabled as well, making malloc/free in the child more unsafe (but also interfering less with other threads in the parent). The handling of the lock for sem_open() and sem_close() uses pthread_atfork() and would be disabled as well.

This may be useful for this and other situations that want to fork from signal handlers or other strange thread states.

I have not found common implementations of _Fork(), though.