237195 – pthread_mutex_unlock crash as unlocked mutex destroyed by signaled thread

Bug 237195 - pthread_mutex_unlock crash as unlocked mutex destroyed by signaled thread

Summary: pthread_mutex_unlock crash as unlocked mutex destroyed by signaled thread

Status:	Closed FIXED

Alias:	None

Product:	Base System
Classification:	Unclassified
Component:	threads (show other bugs)
Version:	12.0-RELEASE
Hardware:	amd64 Any

Importance:	--- Affects Many People
Assignee:	freebsd-threads (Nobody)

URL:
Keywords:	crash, needs-qa, patch

Depends on:
Blocks:

Reported:	2019-04-11 09:12 UTC by freebsd
Modified:	2024-10-03 07:28 UTC (History)
CC List:	5 users (show)

See Also:

Attachments
A simple program to reproduce the issue. (14.72 KB, text/plain) 2019-04-11 09:12 UTC, freebsd	no flags	Details
Do not access mutex memory after unlock. (1.02 KB, patch) 2019-04-11 09:37 UTC, Konstantin Belousov	no flags	Details \| Diff
various gdb dumps of the issue (6.72 KB, text/plain) 2019-04-18 14:20 UTC, freebsd	no flags	Details
Test program for pthread (6.03 KB, text/plain) 2022-08-06 20:48 UTC, longwitz	no flags	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description freebsd 2019-04-11 09:12:08 UTC

Created attachment 203579 [details]
A simple program to reproduce the issue.

I have this program where N threads are communicating with one thread through messages.
Some of theses messages are used to ensure synchronisation (e.g.: flush).
Each messsage contain a mutex and a condition.
When testing on FreeBSD 12.0, the program randomly crashed.
Every clue was saying the mutex was destroyed while being unlocked.
Given said program is pretty huge, I've written a small, simplified, code with similar behaviour and it crashed the same way.

#0  0x000000080065c93a in ?? () from /lib/libthr.so.3
#1  0x0000000000401dd8 in mcv_progress (mcv=0x801c1ac00) at mutex_test.c:365 (calling pthread_mutex_unlock)
#2  0x0000000000401f46 in read_thread (arg=0x7fffffffdac0) at mutex_test.c:412
#3  0x0000000800654776 in ?? () from /lib/libthr.so.3

I suspect what happens looks something like this:

client/writer                   server/reader

1) queues message               
2) locks message.mutex         1) dequeues message
                               2) process message
3) waits for message.condition
                               3) locks message.mutex
                               4) signals message.condition
4) unlocks message.mutex       5) unlocks message.mutex
5) destroy message memory      6) somehow, still in pthread_mutex_unlock after
                                  the client got the hand back, maybe accesses
6) frees message memory           mutex content and crashes as it has been
                                  trashed

The same program works fine under any load and parameters on several Linux versions and OSX 10.14.4.

If this hypothesis verifies, I suppose it only affects programs rapidly creating and destroying mutexes(+conditions?).

Please find attached the test program.

To reproduce the issue, I ctrl-C the program after 5 seconds if it didn't crash and restart it immediately.  Four or five tries of this are usually enough.

Comment 1 Konstantin Belousov freebsd_committer

2019-04-11 09:37:18 UTC

I was unable to reproduce the issue with your test program.

Regardless of it, there is indeed an access to the mutex memory after the unlock.  Please try the attached patch and report back.  Note that seemingly similar check in _mutex_leave_robust() only compares addresses.

Comment 2 Konstantin Belousov freebsd_committer

2019-04-11 09:37:54 UTC

Created attachment 203581 [details]
Do not access mutex memory after unlock.

Comment 3 commit-hook freebsd_committer

2019-04-12 17:28:09 UTC

A commit references this bug:

Author: kib
Date: Fri Apr 12 17:27:19 UTC 2019
New revision: 346158
URL: https://svnweb.freebsd.org/changeset/base/346158

Log:
  Do not access mutex memory after unlock.

  PR:	237195
  Reported by:	freebsd@hurrikhan.eu
  Sponsored by:	The FreeBSD Foundation
  MFC after:	1 week

Changes:
  head/lib/libthr/thread/thr_mutex.c

Comment 4 freebsd 2019-04-18 14:20:58 UTC

Created attachment 203764 [details]
various gdb dumps of the issue

Comment 5 freebsd 2019-04-18 14:21:17 UTC

Thanks for the patch.

I was able to apply and test it.
This specific issue appears to be fixed, but now the other side seems to have an issue.
I either get an error telling the mutex is invalid, either it crashes during the log.

I'm attaching the stack trace, disassembly and register dump right after this message.

Comment 6 commit-hook freebsd_committer

2019-04-19 12:30:46 UTC

A commit references this bug:

Author: kib
Date: Fri Apr 19 12:30:15 UTC 2019
New revision: 346371
URL: https://svnweb.freebsd.org/changeset/base/346371

Log:
  MFC r346158:
  Do not access mutex memory after unlock.

  PR:	237195

Changes:
_U  stable/12/
  stable/12/lib/libthr/thread/thr_mutex.c

Comment 7 commit-hook freebsd_committer

2019-04-19 12:31:53 UTC

A commit references this bug:

Author: kib
Date: Fri Apr 19 12:31:17 UTC 2019
New revision: 346372
URL: https://svnweb.freebsd.org/changeset/base/346372

Log:
  MFC r346158:
  Do not access mutex memory after unlock.

  PR:	237195

Changes:
_U  stable/11/
  stable/11/lib/libthr/thread/thr_mutex.c

Comment 8 Konstantin Belousov freebsd_committer

2019-04-19 14:27:05 UTC

(In reply to freebsd from comment #5)
It sounds like an access to the freed memory.

Anyway, it is somewhat labor-intensive to read disassembly to track your issue, and since you did not disassemble the whole functions bodies, it is impossible.
Compile libc/libthr/rtld with debugging enabled:
 make -C lib/libc DEBUG_FLAGS=-g all install
same for lib/libthr, libexec/rtld-elf.  Then reproduce the issue and show
the backtrace with debugging symbols, so that the source line numbers are
easily seen.

Comment 9 Peter Eriksson 2021-02-18 19:50:25 UTC

I'm pondering if I might be hitting the same or similar problem in pthread_mutex_unlock() on FreeBSD 12.2-RELEASE-p3 with Samba 4.13.4 when shutting down the smbd processes. Then I (sometimes) get core dumps on our more busy servers. Now, Samba is a big program so the problem might be elsewhere but anyway...

I've now disabled the use of mutexes in Samba's TDB database and will see if the coredumps stop...


Core was generated by `/liu/sbin/smbd --daemon --configfile=/liu/etc/samba/smb.conf'.
Program terminated with signal SIGABRT, Aborted.
#0  0x0000000804e82c2a in thr_kill () from /lib/libc.so.7
(gdb) bt
#0  0x0000000804e82c2a in thr_kill () from /lib/libc.so.7
#1  0x0000000804e81084 in raise () from /lib/libc.so.7
#2  0x0000000804df7279 in abort () from /lib/libc.so.7
#3  0x0000000802a6ff6d in dump_core () at ../../source3/lib/dumpcore.c:338
#4  0x0000000802a84b08 in smb_panic_s3 (why=0x7fffffffcec0 "Signal 10: Bus error") at ../../source3/lib/util.c:849
#5  0x00000008012a3a4f in smb_panic (why=0x7fffffffcec0 "Signal 10: Bus error") at ../../lib/util/fault.c:184
#6  0x00000008012a3605 in fault_report (sig=10) at ../../lib/util/fault.c:82
#7  0x00000008012a361a in sig_fault (sig=10) at ../../lib/util/fault.c:93
#8  0x0000000801516b70 in ?? () from /lib/libthr.so.3
#9  0x000000080151613f in ?? () from /lib/libthr.so.3
#10 <signal handler called>
#11 0x0000000801518e86 in pthread_mutex_unlock () from /lib/libthr.so.3
#12 0x00000008061d90ff in tdb_mutex_unlock (tdb=0x8106c0d20, rw=3, off=9364, len=1, pret=0x7fffffffda00) at ../../lib/tdb/common/mutex.c:347
#13 0x00000008061ccec8 in fcntl_unlock (tdb=0x8106c0d20, rw=3, off=9364, len=1) at ../../lib/tdb/common/lock.c:125
#14 0x00000008061cd0e6 in tdb_brunlock (tdb=0x8106c0d20, rw_type=3, offset=9364, len=1) at ../../lib/tdb/common/lock.c:234
#15 0x00000008061cda59 in tdb_nest_unlock (tdb=0x8106c0d20, offset=9364, ltype=3, mark_lock=false) at ../../lib/tdb/common/lock.c:551
#16 0x00000008061cdb20 in tdb_unlock (tdb=0x8106c0d20, list=2299, ltype=3) at ../../lib/tdb/common/lock.c:577
#17 0x00000008061ce44c in tdb_chainunlock (tdb=0x8106c0d20, key=...) at ../../lib/tdb/common/lock.c:889
#18 0x0000000805fbf4d3 in db_tdb_do_locked (db=0x810f73360, key=..., fn=0x802a538f1 <dbwrap_watched_do_locked_fn>, private_data=0x7fffffffdc70)
    at ../../lib/dbwrap/dbwrap_tdb.c:226
#19 0x0000000805fbb68c in dbwrap_do_locked (db=0x810f73360, key=..., fn=0x802a538f1 <dbwrap_watched_do_locked_fn>, private_data=0x7fffffffdc70)
    at ../../lib/dbwrap/dbwrap.c:566
#20 0x0000000802a53b95 in dbwrap_watched_do_locked (db=0x810f73460, key=..., fn=0x802a58cd4 <g_lock_unlock_fn>, private_data=0x7fffffffdd60)
    at ../../source3/lib/dbwrap/dbwrap_watch.c:419
#21 0x0000000805fbb68c in dbwrap_do_locked (db=0x810f73460, key=..., fn=0x802a58cd4 <g_lock_unlock_fn>, private_data=0x7fffffffdd60)
    at ../../lib/dbwrap/dbwrap.c:566
#22 0x0000000802a5912c in g_lock_unlock (ctx=0x81074ae60, key=...) at ../../source3/lib/g_lock.c:908
#23 0x00000008016950af in share_mode_lock_destructor (lck=0x8181cc0e0) at ../../source3/locking/share_mode_lock.c:978
#24 0x0000000802810831 in _tc_free_internal (tc=0x8181cc080, location=0x801965c78 "../../source3/smbd/close.c:1261") at ../../lib/talloc/talloc.c:1157
#25 0x0000000802810bad in _talloc_free_internal (ptr=0x8181cc0e0, location=0x801965c78 "../../source3/smbd/close.c:1261") at ../../lib/talloc/talloc.c:1247
#26 0x0000000802811f31 in _talloc_free (ptr=0x8181cc0e0, location=0x801965c78 "../../source3/smbd/close.c:1261") at ../../lib/talloc/talloc.c:1791
#27 0x000000080182a2e5 in close_directory (req=0x0, fsp=0x8107c8260, close_type=SHUTDOWN_CLOSE) at ../../source3/smbd/close.c:1261
#28 0x000000080182a4e0 in close_file (req=0x0, fsp=0x8107c8260, close_type=SHUTDOWN_CLOSE) at ../../source3/smbd/close.c:1316
#29 0x000000080179d737 in file_close_user (sconn=0x8106c7560, vuid=3338994192) at ../../source3/smbd/files.c:341
#30 0x000000080189bb46 in smbXsrv_session_logoff (session=0x8106c4580) at ../../source3/smbd/smbXsrv_session.c:1685
#31 0x000000080189a575 in smbXsrv_session_clear_and_logoff (session=0x8106c4580) at ../../source3/smbd/smbXsrv_session.c:1192
#32 0x000000080189c155 in smbXsrv_session_logoff_all_callback (local_rec=0x7fffffffe100, private_data=0x7fffffffe210)
    at ../../source3/smbd/smbXsrv_session.c:1834
#33 0x0000000805fbe65b in db_rbt_traverse_internal (db=0x81074b560, f=0x80189c074 <smbXsrv_session_logoff_all_callback>, private_data=0x7fffffffe210, 
    count=0x7fffffffe188, rw=true) at ../../lib/dbwrap/dbwrap_rbt.c:464
#34 0x0000000805fbe855 in db_rbt_traverse (db=0x81074b560, f=0x80189c074 <smbXsrv_session_logoff_all_callback>, private_data=0x7fffffffe210)
    at ../../lib/dbwrap/dbwrap_rbt.c:522
#35 0x0000000805fbb088 in dbwrap_traverse (db=0x81074b560, f=0x80189c074 <smbXsrv_session_logoff_all_callback>, private_data=0x7fffffffe210, 
    count=0x7fffffffe204) at ../../lib/dbwrap/dbwrap.c:394
#36 0x000000080189bf6d in smbXsrv_session_logoff_all (client=0x8106bf8e0) at ../../source3/smbd/smbXsrv_session.c:1788
#37 0x00000008018a407c in exit_server_common (how=SERVER_EXIT_NORMAL, reason=0x80196f23a "termination signal") at ../../source3/smbd/server_exit.c:149
#38 0x00000008018a44a5 in smbd_exit_server_cleanly (explanation=0x80196f23a "termination signal") at ../../source3/smbd/server_exit.c:237
#39 0x00000008043109fd in exit_server_cleanly (reason=0x80196f23a "termination signal") at ../../source3/lib/smbd_shim.c:121
#40 0x00000008018481fa in smbd_sig_term_handler (ev=0x8106c7060, se=0x8106bf820, signum=15, count=1, siginfo=0x0, private_data=0x8106c7560)
    at ../../source3/smbd/process.c:979
#41 0x00000008021fa40e in tevent_common_invoke_signal_handler (se=0x8106bf820, signum=15, count=1, siginfo=0x0, removed=0x0)
    at ../../lib/tevent/tevent_signal.c:370
#42 0x00000008021fa6b9 in tevent_common_check_signal (ev=0x8106c7060) at ../../lib/tevent/tevent_signal.c:468
#43 0x00000008021f793a in poll_event_loop_poll (ev=0x8106c7060, tvalp=0x7fffffffe450) at ../../lib/tevent/tevent_poll.c:488
#44 0x00000008021f810f in poll_event_loop_once (ev=0x8106c7060, location=0x8019721b8 "../../source3/smbd/process.c:4212")
    at ../../lib/tevent/tevent_poll.c:626
#45 0x00000008021f20b8 in _tevent_loop_once (ev=0x8106c7060, location=0x8019721b8 "../../source3/smbd/process.c:4212") at ../../lib/tevent/tevent.c:772
#46 0x00000008021f23dc in tevent_common_loop_wait (ev=0x8106c7060, location=0x8019721b8 "../../source3/smbd/process.c:4212") at ../../lib/tevent/tevent.c:895
#47 0x00000008021f247f in _tevent_loop_wait (ev=0x8106c7060, location=0x8019721b8 "../../source3/smbd/process.c:4212") at ../../lib/tevent/tevent.c:914
#48 0x00000008018503e2 in smbd_process (ev_ctx=0x8106c7060, msg_ctx=0x8106c0300, dce_ctx=0x8106b10c0, sock_fd=51, interactive=false)
    at ../../source3/smbd/process.c:4212
#49 0x000000000102f56c in smbd_accept_connection (ev=0x8106c7060, fde=0x8106c04c0, flags=1, private_data=0x811004040) at ../../source3/smbd/server.c:1014
#50 0x00000008021f2f2e in tevent_common_invoke_fd_handler (fde=0x8106c04c0, flags=1, removed=0x0) at ../../lib/tevent/tevent_fd.c:138
#51 0x00000008021f7e5e in poll_event_loop_poll (ev=0x8106c7060, tvalp=0x7fffffffe770) at ../../lib/tevent/tevent_poll.c:569
#52 0x00000008021f810f in poll_event_loop_once (ev=0x8106c7060, location=0x10395d0 "../../source3/smbd/server.c:1361") at ../../lib/tevent/tevent_poll.c:626
#53 0x00000008021f20b8 in _tevent_loop_once (ev=0x8106c7060, location=0x10395d0 "../../source3/smbd/server.c:1361") at ../../lib/tevent/tevent.c:772
#54 0x00000008021f23dc in tevent_common_loop_wait (ev=0x8106c7060, location=0x10395d0 "../../source3/smbd/server.c:1361") at ../../lib/tevent/tevent.c:895
#55 0x00000008021f247f in _tevent_loop_wait (ev=0x8106c7060, location=0x10395d0 "../../source3/smbd/server.c:1361") at ../../lib/tevent/tevent.c:914
#56 0x0000000001030331 in smbd_parent_loop (ev_ctx=0x8106c7060, parent=0x8106c0760) at ../../source3/smbd/server.c:1361
#57 0x00000000010326ec in main (argc=3, argv=0x7fffffffece8) at ../../source3/smbd/server.c:2214


gdb) frame 12
#12 0x00000008061d90ff in tdb_mutex_unlock (tdb=0x8106c0d20, rw=3, off=9364, len=1, pret=0x7fffffffda00) at ../../lib/tdb/common/mutex.c:347
347		ret = pthread_mutex_unlock(chain);

(gdb) list
342		if (!tdb_mutex_index(tdb, off, len, &idx)) {
343			return false;
344		}
345		chain = &m->hashchains[idx];
346	
347		ret = pthread_mutex_unlock(chain);
348		if (ret == 0) {
349			*pret = 0;
350			return true;
351		}

(gdb) print chain
$1 = (pthread_mutex_t *) 0x81378c898

(gdb) print *chain
$2 = (pthread_mutex_t) 0x8000000000000001


These mutexes are PTHREAD_MUTEX_ERRORCHECK, PTHREAD_PROCESS_SHARED & PTHREAD_MUTEX_ROBUST if I'm not reading the code wrong.

Comment 10 longwitz 2022-08-06 20:48:30 UTC

Created attachment 235730 [details]
Test program for pthread

mypshared.c test program

Comment 11 longwitz 2022-08-06 20:55:45 UTC

The stack trace with samba given in comment #9 is an other problem because tdb does never use the function pthread_mutex_destroy() for a mutex in
a shared memory, only in local memory for checking the features of the libthr library.

I hit the problem discussed in this bug report when I tried to migrate a stable running IMAP server (cyrus-imapd25 + db48) from FreeBSD 10 to Fre
eBSD 12 r371380. The configure step of db48 finds

in V10: checking for mutexes... x86_64/gcc-assembly
in V12: checking for mutexes... POSIX/pthreads/library/x86_64/gcc-assembly

The difference is because FreeBSD 12 supports PTHREAD_PROCESS_SHARED. It comes out the berkeley db48 software is not compatibel with the share mu
texes in our libthr. When an imap program (e.g. cyr_expire) is finished, db48 calls pthread_mutex_destroy() for all used mutexes. On the next try
of the IMAP server or of another imap program to get a lock with pthread_mutex_lock() the error EINVAL is returned and the environment of berkel
ey db is broken.

Newer versions of berkeley db have the same problem. I can simple avoid the problem with a configure statement for db48 to force only the test-an
d-set mutexes as in V10.

I attach a test program mypshared.c that demonstrates the problem I see with my IMAP server in V12. If mypshared is started on two or three termi
nals then the problem with pthread_mutex_destroy() can be studied. If in the program the variable always_destroy is set to 0, then all started pr
ograms run and stop without error, because only the last finishing program calls pthread_mutex_destroy(). With always_destroy = 1 there are two c
ases when the first program comes to his end:

1. No one of the other programs has the lock. In this case the ending program calls pthread_mutex_destroy() without error but the other programs
get EINVAL at the next call to pthread_mutex_lock(). That is what I saw running my IMAP server.

2. One of the other programs helds the lock. Then the ending program calling
pthread_mutex_destroy() aborts with a message like

Fatal error 'mutex 0x800698000 own 0x8001872a is on list 0x8006861a0 0x0'
at line 154 in file /usr/src/lib/libthr/thread/thr_mutex.c (errno = 0)

This is caused by the call of mutex_assert_not_owned() at line 481 because
_PTHREADS_INVARIANTS is set in the Makefile.

Both cases are bad, the application can not work.

With higher or lower values for UPPER_BOUND_(NO)LOCK in mypshared.c it is easy to get examples for both cases. My mutex is always on addr 0x80069
7000, but the error message gives 0x800698000. Instead of "0x8001872a" sometimes I see "own 0x1872a".

The "first mutex problem" was solved with the commit for revision 297185. A program of an application must not check if it is the first program u
sing a mutex, A simple call of pthread_mutex_init() is enough, the pthread library libthr is smart and handles this. Maybe the "last mutex proble
m" can be solved
too with a similar logic. pthread_mutex_destroy() should only destroy the mutex
if the caller is the last user of the mutex. And pthread_mutex_destroy() always
should return without errors like EBUSY. If a user is not the last one and
calls pthread_mutex_destroy() the mutex should not be affected and the users
state should be the same he had before calling pthread_mtx_init().

If we cannot solve the "last mutex problem" then there should be an option to
build the libthr without PTHREAD_PROCESS_SHARED.

Comment 12 Konstantin Belousov freebsd_committer

2022-08-07 00:03:18 UTC

Both locking a destroyed mutex, and destroying an owned mutex is declared
as undefined behavior by POSIX.  FreeBSD is conforming, the program has bugs.

Comment 13 longwitz 2022-08-08 21:34:27 UTC

OK, thank you for clarification. One bug in my test program was by mistake, an other for demonstration. With following patch the program works as I have intended:

--- mypshared.c.orig    2022-08-06 21:13:03.438699000 +0200
+++ mypshared.c 2022-08-08 10:24:26.021928000 +0200
@@ -34,7 +34,7 @@
 
 pthread_mutexattr_t attr, *attrp = NULL;
 int rc, first, mycounter;
-int init_only_on_first = 0, always_destroy = 1;
+int init_only_on_first = 0, always_destroy = 0;
 
 pthread_mutex_t magic, *magicp = &magic;
 
@@ -120,7 +120,7 @@
 
 void
 start_mutex() {
-   if ( init_only_on_first && first ) {
+   if( !init_only_on_first || first ) {
       rc = pthread_mutex_init((pthread_mutex_t *)&regionp->mutexp, attrp);
       printf("rc of pthread_mutex_init = %d\n", rc);
       if( rc ) exit( rc );

Comment 14 Mark Linimon freebsd_committer

2024-10-03 04:16:07 UTC

^Triage: clear stale flags.

To submitter: is this aging PR still relevant?