Created attachment 188857 [details] Patch for libthr I've got spontaneous core dumps on extensive mutex operations. There are many threads and locks in my application. And I quite sure that locking/unlocking logic is corrent - it's performing well on other thread implementations (like NTPL on Linux). But it is terminated by the assertion on FreeBSD after work for a long time (from 4 up to 8 hours): Abort trap (6) with "Fatal error 'mutex is on list' Backtrace looks like this: Program terminated with signal SIGABRT, Aborted. (gdb) bt #0 0x000000080161cd2a in thr_kill () from /lib/libc.so.7 #1 0x000000080161cc96 in raise () from /lib/libc.so.7 #2 0x000000080161b489 in abort () from /lib/libc.so.7 #3 0x000000080109e63a in ?? () from /lib/libthr.so.3 #4 0x0000000801099e66 in ?? () from /lib/libthr.so.3 #5 0x000000080242d301 in Lock (this=0x80387ad90) (gdb) p *(struct pthread_mutex*) 0x80ff60c40 $4 = {m_lock = {m_owner = 100925, m_flags = 0, m_ceilings = {0, 0}, m_spare = {0, 0, 0, 0}}, m_flags = 2, m_owner = 0x8038cc400, m_count = 0, m_spinloops = 0, m_yieldloops = 0, m_qe = {tqe_next = 0x0, tqe_prev = 0x0}} "Lock" function is the last call from my application - it performs pthread_mutex_lock() call. In this case it just locks the mutex which was created some time before in the same thread. Sometimes the assertion looks like 'mutex is not on list' It looks like for me that libthr library supports a list of acquired mutexes for each thread. And there are checks before locking/unlocking that get failed for unknown reason. I've have a patch for libthr check's logic. The only effect is that some more time is required before assertion failed again.
(In reply to Andrey Sploshnov from comment #0) Do you really complain about 9.2 release, as indicated in the bug's metadata ? Reproduce the problem on HEAD or 11.1. The mutex queue is maintained per thread, in other words it is thread-local, and since only locked mutex is put or removed from the queue, your naive spinlock does not add any protection. The state of the mutex you printed is indeed strange, it has the lock owner but it is not on the queue, as the assertion claims.
(In reply to Konstantin Belousov from comment #1) I have the reproduction on versions 9.x and 10.x. 11.x looks unaffected.
(In reply to Andrey Sploshnov from comment #2) If you provide isolated test case, I might look at stable/10. Still, HEAD and stable/11 got the code to maintain mutexes state significantly reworked due to the addition of the process-shared and robust mutexes.
I have seen this failed assertion a few times, too. Does this application fork? If so, it's possible that r329254 is related. This is just a guess, but it's worth merging that to the 10.x code where you reproduced this and trying again.
Closing as fixed by any of several changes in this area in more recent releases (e.g. r329254).