Bug 198452 - libthr/rtld deadlock
Summary: libthr/rtld deadlock
Status: Closed Works As Intended
Alias: None
Product: Base System
Classification: Unclassified
Component: threads (show other bugs)
Version: CURRENT
Hardware: amd64 Any
: --- Affects Only Me
Assignee: freebsd-threads (Nobody)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-03-09 14:00 UTC by David Shane Holden
Modified: 2015-03-09 19:00 UTC (History)
1 user (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description David Shane Holden 2015-03-09 14:00:04 UTC
I've run into what appears to be a deadlock with the following code when linked
against libthr.  Prior to base r245630 this worked as intended.  Reverting
base r245630 and base r249425 in HEAD appears to fix it.


#include <dlfcn.h>
#include <link.h>

int32_t callback(struct dl_phdr_info *info, size_t size, void *data) {
  dlopen(info->dlpi_name, RTLD_LAZY);
  return 0;
}

int main() {
  dl_iterate_phdr(callback, 0);
  return 0;
}


When linked with libthr the program deadlocks on the call to dlopen().

clang -g -lthr -o main main.c


The backtrace I'm seeing is:

(gdb) bt
#0  0x0000000800833ddc in _umtx_op_err () from /lib/libthr.so.3
#1  0x000000080082a29f in __thr_rwlock_wrlock (rwlock=0x800a42900, tsp=<optimized out>) at /usr/src/lib/libthr/thread/thr_umtx.c:296
#2  0x00000008008317f3 in _thr_rwlock_wrlock (rwlock=<optimized out>, tsp=<optimized out>) at /usr/src/lib/libthr/thread/thr_umtx.h:204
#3  _thr_rtld_wlock_acquire (lock=0x800a42900) at /usr/src/lib/libthr/thread/thr_rtld.c:141
#4  0x000000080060b5c2 in wlock_acquire (lock=0x80081f8a0 <rtld_locks>, lockstate=0x7fffffffe058) at /usr/src/libexec/rtld-elf/rtld_lock.c:213
#5  0x000000080060905f in dlopen_object () from /libexec/ld-elf.so.1
#6  0x00000008006060a3 in rtld_dlopen (name=0x7fffffffefcc "/root/deadlock/main", fd=-1, mode=<optimized out>) at /usr/src/libexec/rtld-elf/rtld.c:2955
#7  0x000000000040081f in callback (info=0x7fffffffeb08, size=64, data=0x0) at main.c:7
#8  0x0000000800606eeb in dl_iterate_phdr (callback=0x4007e0 <callback>, param=0x0) at /usr/src/libexec/rtld-elf/rtld.c:3416
#9  0x0000000000400858 in main () at main.c:12
Comment 1 Konstantin Belousov freebsd_committer freebsd_triage 2015-03-09 15:07:03 UTC
(In reply to dpejesh from comment #0)

Dlopen/dlclose cannot work in the context of the dl_iterate_phdr(3) callback.  At least, because the dl_iterate_phdr() walks over the list of the loaded shared objects, so in the language terms, it is the iterator over collection, and collection must be stable.  The callback API does not suppose a possibitily of restart.

That said, callback is intended to perform some lookups on the per object basis, e.g. enumerate exception tables, debug sections and like.

The revisions you point out as causing the problem obviously mean that this never worked anyway.
Comment 2 David Shane Holden 2015-03-09 18:10:14 UTC
> Dlopen/dlclose cannot work in the context of the dl_iterate_phdr(3) callback. 
> At least, because the dl_iterate_phdr() walks over the list of the loaded
> shared objects, so in the language terms, it is the iterator over collection,
> and collection must be stable.  The callback API does not suppose a
> possibitily of restart.

That makes sense, but this use to work pre-10.0 and currently works on Linux.
It shouldn't be difficult to work around, but the new behavior seems like a
regression which is why I opened the ticket.  I can understand dlopen() failing
if it has to load a new shared object and update its internal list, but if it's
already loaded all it should need to do is update the counters and still return
the handle, right?

> That said, callback is intended to perform some lookups on the per object
> basis, e.g. enumerate exception tables, debug sections and like.

> The revisions you point out as causing the problem obviously mean that this
> never worked anyway.

Not sure why you're saying that.  Before r245630 it worked, and reverting that
change on HEAD fixes it too.  It also works when not linked to libthr both
before and after r245630.
Comment 3 Konstantin Belousov freebsd_committer freebsd_triage 2015-03-09 19:00:25 UTC
(In reply to dpejesh from comment #2)
Technically, what happens is that dl_iterate_phdr() locks the rtld_bind_lock, and dlopen() needs the same lock.  This is reasonable, since the lock protects the structures which are used by iterator and modified by dlopen(), just for example, the list of the loaded objects.

Generally, we cannot upgrade read-lock to write, since there may be other read lock owners.  We cannot drop read lock, to wait for write, since this invalidates previous iterations.

I said that your reference to the commits is strange, because this is definitely not the commits you point out that introduces the behavior.  It is there from the moment when locks were added to dl_iterate_phdr().  You are tripping on the stated revisions since that revs force to use real locking when libthr.so is loaded into the process.  The single-threaded processes use some sort of fake locking in rtld.

I.e. the 'bug' is there for long time.

WRT Linux, their dl_iterate_phdr() is much less safe, so to say, then ours.  For instance, their dl_iterate_phdr() should not be used from the signal handler context, which causes trouble for libunwind, and is one of the reasons why I do not want to change the code to e.g. dropping lock during the calls to callback and using sentinel to remember the position to iterate.