Created attachment 156674 [details] patch While investigating the message coming from "Thread %p has exited with %d leftover", I found that _pthread_setspecific is entered twice with the same TCB, _get_curthread() returned the same thread pointer. Second invocation was incurred by the google tcmalloc library (devel/google-perftools port). > (gdb) bt > #0 yuri_enter (serno=60) at /usr/src/lib/libthr/thread/thr_spec.c:60 > #1 0x0000000804e23e89 in _pthread_setspecific (userkey=3, value=0x740a080) at /usr/src/lib/libthr/thread/thr_spec.c:228 > #2 0x000000080408b95c in ?? () from /usr/local/lib/qt4/libQtCore.so.4 > #3 0x0000000804e1da3a in thread_start (curthread=0x73f6800) at /usr/src/lib/libthr/thread/thr_create.c:288 > #4 0x0000000000000000 in ?? () > Backtrace stopped: Cannot access memory at address 0x7fffdf9fb000 > (gdb) bt > #0 yuri_enter (serno=61) at /usr/src/lib/libthr/thread/thr_spec.c:60 > #1 0x0000000804e23e89 in _pthread_setspecific (userkey=2, value=0x7a9520) at /usr/src/lib/libthr/thread/thr_spec.c:228 > #2 0x00000008008ff4de in perftools_pthread_setspecific(int, void*) () from /usr/local/lib/libtcmalloc.so.4 > #3 0x00000008008f2322 in tcmalloc::ThreadCache::CreateCacheIfNecessary() () from /usr/local/lib/libtcmalloc.so.4 > #4 0x00000008008ea666 in ?? () from /usr/local/lib/libtcmalloc.so.4 > #5 0x00000008008ffd92 in tc_calloc () from /usr/local/lib/libtcmalloc.so.4 > #6 0x0000000804e243a1 in pthread_key_allocate_data () at /usr/src/lib/libthr/thread/thr_spec.c:212 > #7 0x0000000804e23f19 in _pthread_setspecific (userkey=3, value=0x740a080) at /usr/src/lib/libthr/thread/thr_spec.c:231 > #8 0x000000080408b95c in ?? () from /usr/local/lib/qt4/libQtCore.so.4 > #9 0x0000000804e1da3a in thread_start (curthread=0x73f6800) at /usr/src/lib/libthr/thread/thr_create.c:288 > #10 0x0000000000000000 in ?? () This causes some memory loss, and stray message about leftover TLS data. The attached patch corrects for such potential problem by double-checking the pthread's "specific" array after allocating it, and frees the newly allocated copy if it happens to be already allocated. Please keep the comment, because I think these lines may appear very confusing and wrong to people without the knowledge of this specific situation.
Please MFC to 10.X
Thread cannot race with itself. pthread_setspecific() is only touching the current thread 'specific' member, so what race you are talking about ?
Why do you close the bug without understanding? You are trying to blow me off?
"Race condition" is maybe a wrong term, but it is close. Like I said in the description, stray messages are printed in the presence of an external malloc library: "Thread %p has exited with %d leftover". And memory for per-thread key-based "specific" array gets lost.
(In reply to yuri from comment #4) Your patch makes absolutely no sense, you just repeat the code from _pthread_setspecific() into pthread_key_allocate_data(). Checking the same thread-local condition twice cannot help, as well as doing it n times, n > 1. There is nothing wrong with _pthread_setspecific being called several times from the same thread, for the same key.
There is the recursion going on. calloc calls _pthread_setspecific again, and after its return it is already allocated.
(In reply to yuri from comment #6) What does prevent the recursion from going infinite ? And your patch still does not make any sense, since assignment to the curthread->specific only happens after the calloc() returned. This sounds like the application bug anyway, because malloc/calloc interposers must handle the system libraries requirements.
> What does prevent the recursion from going infinite? This question is besides the point of this PR. Recursion happens anyway, with or without this patch. If the malloc library will cause infinite recursion, then this would be a bug. But this doesn't actually happen. > This sounds like the application bug anyway, because malloc/calloc interposers must handle the system libraries requirements. Could you please specify which application are you talking about? And what are the requirements, and where are they specified?
You keep saying that this patch makes no sense. And it does fix the error message that used to be printed before all the time, and also it does fix the memory leak. How can one reconcile your opinion that "it doesn't make sense" with the obvious positive practical consequences that have pretty good theoretical explanation?
(In reply to yuri from comment #8) Requirements for the malloc interposers are established by the system source code. Third-party malloc library has to adhere to the constraints put by the system, if the intent is to replace the system component. I am saying that your statements do not make sense exactly because they do not make sense. The mere fact that a patch results in not printing a message, does not make the patch valid. That said, I think it would be more useful to ease the life of malloc(3) interposers by avoiding the calloc call in the pthread_setspecific(3). The size of the thread-specific is around 1K per thread, which is almost free for the userspace to carry together with the struct thread. Non-trivial threading programs would use explicit TLS for sure. Try the patch I am attaching. I change specific to be a part of the struct thread, and also I cleaned style and removed false or trivial comments.
Created attachment 156695 [details] Allocate specific together with the struct thread, avoids calloc call.
Now it prints this warning: > Warning: old _pthread_cleanup_push was called, stack unwinding is disabled.
Also sizeof(pthread_specific_elem)=12 * PTHREAD_KEYS_MAX=256 = 3kB. Every thread will always have 3kB attached. I am sure there are a lot of threads which don't use pthread_setspecific at all, and an extra 3K will be a burden for them.
(In reply to yuri from comment #13) Around a 2-3KB of data for the thread is not a burden, comparing with the >=1M userspace thread stack allocated for each thread. Also, I believe that any non-trivial threading program does use thread-specific data, not least because libc uses per-thread keys in nss and locale code. But I can suggest to either use the standard libc malloc(3), which does not have the problem, or direct mmap(2) calls. The later would result in some overhead due to unused part of the page for specific array, but this could not be important, since keys are used rarely, according to you previous reply. I tend to think that the mmap(2) is preferred, but __malloc() use would also be fine.
Created attachment 156696 [details] Use mmap for allocator
The mmap patch works, but with his patch every process will have the tiny separate extra mmapped block per thread. Currently every new thread creates one extra mmapped block, and this will be the second extra mmapped block. This doesn't look nice. Currently every new thread gets its own mmaped block of the size 0x20000 = 131kB. So maybe 2.3% addition isn't that bad, I don't know. Especially that this memory isn't the real memory allocation unless actually used. Is it possible to merge them? Adding this 3kB "specific" array block to that 0x2000-sized per-thread block?
(In reply to yuri from comment #16) I am not sure what 'mapped block of size 128K you mean. A new thread gets allocated the stack, which has 1M or 2M size, and struct pthread of size 640B on 32bit arches. I already proposed to put the keys table into the struct pthread, which you did not liked. I still think that making struct pthread and keys live together is the best route if malloc() has to be avoided from pthread_setspecific(). I am also fine with the mmap() patch, despite the fact that some portion of the whole page is wasted. I do not want to write a special allocator for this single purpose.
Oh no, I like it. Go ahead commit it if you think it is good.
Ideally everything in libthr that is fixed-sized should be allocated with the single mmap, I think. But this is apparently not the case. In my process, I see that 0x20000-sized block is always mmapped per-thread. Must be something google tcmalloc is doing. I first thought this is what libthr was doing. But for the time being, we need the solution, so commit your patch making struct pthread and keys live together.
A commit references this bug: Author: kib Date: Fri May 15 08:40:18 UTC 2015 New revision: 282948 URL: https://svnweb.freebsd.org/changeset/base/282948 Log: Some third-party malloc(3) implementations use pthread_setspecific(3) to handle per-thread information. Since our pthread_setspecific() implementation calls calloc(3) to allocate per-thread specific data storage, things get complicated. Switch the allocator to use bare mmap(2). There is some loss of the allocated page, since e.g. on amd64, PTHREAD_KEYS_MAX * sizeof(struct pthread_specific_elem) is 3K (it actually spans whole page due to padding), but I believe it is more acceptable than additional code for specialized allocator(). The alternatives would either to make the specific data array be part of the struct thread, or use internal bindings to call the libc malloc, avoiding interposing. Also do the style pass over the thr_spec.c, esp. simplify the conditionals nesting by returning early when an error detected. Remove trivial comments. Found by: yuri@rawbw.com PR: 200138 Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Changes: head/lib/libthr/thread/thr_spec.c
There is a commit referencing this PR, but it's still not closed and has been inactive for some time. Closing the PR as fixed but feel free to re-open it if the issue hasn't been completely resolved. Thanks