Bug 224362 - 'mutex is on list' assertion failed on pthread_mutex_lock/pthread_mutex_unlock (up to FreeBSD 10.x)
Summary: 'mutex is on list' assertion failed on pthread_mutex_lock/pthread_mutex_unloc...
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: threads (show other bugs)
Version: 9.2-RELEASE
Hardware: amd64 Any
: --- Affects Some People
Assignee: freebsd-threads (Nobody)
URL:
Keywords: patch
Depends on:
Blocks:
 
Reported: 2017-12-15 09:36 UTC by Andrey Sploshnov
Modified: 2019-09-03 23:25 UTC (History)
3 users (show)

See Also:


Attachments
Patch for libthr (2.38 KB, patch)
2017-12-15 09:36 UTC, Andrey Sploshnov
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Andrey Sploshnov 2017-12-15 09:36:52 UTC
Created attachment 188857 [details]
Patch for libthr

I've got spontaneous core dumps on extensive mutex operations. There are many threads and locks in my application.
And I quite sure that locking/unlocking logic is corrent - it's performing well on other thread implementations (like NTPL on Linux).
But it is terminated by the assertion on FreeBSD after work for a long time (from 4 up to 8 hours):
Abort trap (6) with "Fatal error 'mutex is on list'

Backtrace looks like this:
Program terminated with signal SIGABRT, Aborted.
(gdb) bt
#0  0x000000080161cd2a in thr_kill () from /lib/libc.so.7
#1  0x000000080161cc96 in raise () from /lib/libc.so.7
#2  0x000000080161b489 in abort () from /lib/libc.so.7
#3  0x000000080109e63a in ?? () from /lib/libthr.so.3
#4  0x0000000801099e66 in ?? () from /lib/libthr.so.3
#5  0x000000080242d301 in Lock (this=0x80387ad90)

(gdb) p *(struct pthread_mutex*) 0x80ff60c40
$4 = {m_lock = {m_owner = 100925, m_flags = 0, m_ceilings = {0, 0}, m_spare = {0, 0, 0, 0}}, m_flags = 2, m_owner = 0x8038cc400, m_count = 0, m_spinloops = 0, m_yieldloops = 0, m_qe = {tqe_next = 0x0, tqe_prev = 0x0}}

"Lock" function is the last call from my application - it performs pthread_mutex_lock() call.
In this case it just locks the mutex which was created some time before in the same thread.

Sometimes the assertion looks like 'mutex is not on list'

It looks like for me that libthr library supports a list of acquired mutexes for each thread. And there are checks
before locking/unlocking that get failed for unknown reason.

I've have a patch for libthr check's logic. The only effect is that some more time is required before assertion failed again.
Comment 1 Konstantin Belousov freebsd_committer 2017-12-15 22:12:31 UTC
(In reply to Andrey Sploshnov from comment #0)
Do you really complain about 9.2 release, as indicated in the bug's metadata ?  Reproduce the problem on HEAD or 11.1.

The mutex queue is maintained per thread, in other words it is thread-local, and since only locked mutex is put or removed from the queue, your naive spinlock does not add any protection.

The state of the mutex you printed is indeed strange, it has the lock owner but it is not on the queue, as the assertion claims.
Comment 2 Andrey Sploshnov 2017-12-18 09:39:36 UTC
(In reply to Konstantin Belousov from comment #1)
I have the reproduction on versions 9.x and 10.x.
11.x looks unaffected.
Comment 3 Konstantin Belousov freebsd_committer 2017-12-18 13:15:16 UTC
(In reply to Andrey Sploshnov from comment #2)
If you provide isolated test case, I might look at stable/10.  Still, HEAD and stable/11 got the code to maintain mutexes state significantly reworked due to the addition of the process-shared and robust mutexes.
Comment 4 Eric van Gyzen freebsd_committer 2018-04-13 14:44:55 UTC
I have seen this failed assertion a few times, too.  Does this application fork?  If so, it's possible that r329254 is related.  This is just a guess, but it's worth merging that to the 10.x code where you reproduced this and trying again.
Comment 5 Eric van Gyzen freebsd_committer 2019-09-03 23:25:32 UTC
Closing as fixed by any of several changes in this area in more recent releases (e.g. r329254).