Bug 252579

Summary: fork() causes process to hang in rare circumstances.
Product: Base System Reporter: shamaz.mazum
Component: threadsAssignee: freebsd-threads (Nobody) <threads>
Status: Closed FIXED    
Severity: Affects Only Me CC: jah
Priority: ---    
Version: CURRENT   
Hardware: Any   
OS: Any   
Attachments:
Description Flags
NSS module and test program
none
The new test case none

Description shamaz.mazum 2021-01-11 15:37:48 UTC
Created attachment 221461 [details]
NSS module and test program

This bug is discovered on FreeBSD 13-CURRENT but can also be reproduced on 12.2-RELEASE. It causes a process to hung when fork(2) is called and specific NSS (Name Switch Service) module is used.

How to reproduce:

1) Download archive in the attachment.
2) Compile NSS stub module (do not forget .1 at the end of compiled module):
   cc -shared -fPIC -pthread -o nss_stub.so.1 nss_stub.c
3) Copy nss_stub.so.1 to /usr/local/lib
4) Edit /etc/nsswitch.conf and replace 'hosts: files dns' witch 'hosts: files dns stub'
5) Compile test program: cc -o bug bug.c
6) Run it, it will hang, so even killall -9 bug won't kill it.

There is a small and unpleasant discussion on freebsd-net mailing list with Konstantin Belousov who wanted me to reproduce this bug without editing /etc/nsswitch.conf I think it's either impossible, because NSS system is somehow messing with fork, or it's beyond my competence. So the provided way to reproduce the bug is as minimal as I can get.
Comment 1 Jason A. Harmening freebsd_committer freebsd_triage 2021-01-12 05:12:52 UTC
I can reproduce this 100% of the time on a -current VM using the supplied test code.   I noticed a few things:

--for me, the parent process seems to be hanging during fork(); I see no evidence the child process is ever spawned.

--wmesg for the process is 'umtxn', and ddb shows what looks like the main thread attempting to take a userspace lock, going through umtxq_lock(), and sleeping in sleepq_wait_sig()

--I tried to write a smaller test program to reproduce the failure by simulating the locking done by the NS dispatcher and the pthread_create() issued by the stub, but this did not reproduce the hang.

--However, if I just link the original test program against libpthread ('cc -o bug -pthread bug.c), then I can no longer reproduce the hang.   This tells me the problem might have something to do with some bit of static umtx initialization that happens when linking against libpthread/libthr.   If this initialization hasn't happened by the time the NS dispatcher (which loads the stub through dlopen()) is invoked, then fork() ends up stuck in a umtx wait that never gets signaled.  It might also be related to the __isthreaded checks made by lib/libc/net/nsdispatch.c, which smell fishy to me.

At the very least, it might be possible to make a smaller repro case by writing a test program (that does not link libpthread) which dlopen()s a simple library (which does link libpthread) and calls an entry point that spawns a thread.
Comment 2 shamaz.mazum 2021-01-12 07:05:27 UTC
This is a smaller case: just a library linked against pthread which spawns a thread and a main program which loads that library and forks.

Instructions are similar:

1) Download newtest.tar.gz
2) Compile the library: cc -shared -fPIC -pthread -o bug-lib.so bug-lib.c
3) Compile the main program: cc -o demo demo.c
4) Run it: env LD_LIBRARY_PATH=. ./demo
Comment 3 shamaz.mazum 2021-01-12 07:06:02 UTC
Created attachment 221484 [details]
The new test case
Comment 4 commit-hook freebsd_committer freebsd_triage 2021-01-12 10:46:40 UTC
A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=85d028223bc2768651f4d44881644ceb5dc2a664

commit 85d028223bc2768651f4d44881644ceb5dc2a664
Author:     Konstantin Belousov <kib@FreeBSD.org>
AuthorDate: 2021-01-12 09:02:37 +0000
Commit:     Konstantin Belousov <kib@FreeBSD.org>
CommitDate: 2021-01-12 10:45:44 +0000

    libthr malloc: support recursion on thr_malloc_umtx.

    One possible way the recursion can happen is during fork: suppose
    that fork is called from early code that did not triggered
    jemalloc(3) initialization yet. Then we lock thr_malloc lock, and
    call malloc_prefork() that might require initialization of jemalloc
    pthread_mutexes, calling into libthr malloc. It is safe to allow
    recursion for this occurence.

    PR:     252579
    Reported by:    Vasily Postnicov <shamaz.mazum@gmail.com>
    MFC after:      1 week
    Sponsored by:   The FreeBSD Foundation

 lib/libthr/thread/thr_malloc.c | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)
Comment 5 shamaz.mazum 2021-01-12 14:47:04 UTC
Thanks, I confirm that bug is fixed.
Comment 6 commit-hook freebsd_committer freebsd_triage 2021-01-18 05:41:29 UTC
A commit in branch stable/12 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=20432a4fa157be15465e3aefc7977b494c812584

commit 20432a4fa157be15465e3aefc7977b494c812584
Author:     Konstantin Belousov <kib@FreeBSD.org>
AuthorDate: 2021-01-12 09:02:37 +0000
Commit:     Konstantin Belousov <kib@FreeBSD.org>
CommitDate: 2021-01-18 05:11:07 +0000

    libthr malloc: support recursion on thr_malloc_umtx.

    PR:     252579

    (cherry picked from commit 85d028223bc2768651f4d44881644ceb5dc2a664)

 lib/libthr/thread/thr_malloc.c | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)