Created attachment 203579 [details] A simple program to reproduce the issue. I have this program where N threads are communicating with one thread through messages. Some of theses messages are used to ensure synchronisation (e.g.: flush). Each messsage contain a mutex and a condition. When testing on FreeBSD 12.0, the program randomly crashed. Every clue was saying the mutex was destroyed while being unlocked. Given said program is pretty huge, I've written a small, simplified, code with similar behaviour and it crashed the same way. #0 0x000000080065c93a in ?? () from /lib/libthr.so.3 #1 0x0000000000401dd8 in mcv_progress (mcv=0x801c1ac00) at mutex_test.c:365 (calling pthread_mutex_unlock) #2 0x0000000000401f46 in read_thread (arg=0x7fffffffdac0) at mutex_test.c:412 #3 0x0000000800654776 in ?? () from /lib/libthr.so.3 I suspect what happens looks something like this: client/writer server/reader 1) queues message 2) locks message.mutex 1) dequeues message 2) process message 3) waits for message.condition 3) locks message.mutex 4) signals message.condition 4) unlocks message.mutex 5) unlocks message.mutex 5) destroy message memory 6) somehow, still in pthread_mutex_unlock after the client got the hand back, maybe accesses 6) frees message memory mutex content and crashes as it has been trashed The same program works fine under any load and parameters on several Linux versions and OSX 10.14.4. If this hypothesis verifies, I suppose it only affects programs rapidly creating and destroying mutexes(+conditions?). Please find attached the test program. To reproduce the issue, I ctrl-C the program after 5 seconds if it didn't crash and restart it immediately. Four or five tries of this are usually enough.
I was unable to reproduce the issue with your test program. Regardless of it, there is indeed an access to the mutex memory after the unlock. Please try the attached patch and report back. Note that seemingly similar check in _mutex_leave_robust() only compares addresses.
Created attachment 203581 [details] Do not access mutex memory after unlock.
A commit references this bug: Author: kib Date: Fri Apr 12 17:27:19 UTC 2019 New revision: 346158 URL: https://svnweb.freebsd.org/changeset/base/346158 Log: Do not access mutex memory after unlock. PR: 237195 Reported by: freebsd@hurrikhan.eu Sponsored by: The FreeBSD Foundation MFC after: 1 week Changes: head/lib/libthr/thread/thr_mutex.c
Created attachment 203764 [details] various gdb dumps of the issue
Thanks for the patch. I was able to apply and test it. This specific issue appears to be fixed, but now the other side seems to have an issue. I either get an error telling the mutex is invalid, either it crashes during the log. I'm attaching the stack trace, disassembly and register dump right after this message.
A commit references this bug: Author: kib Date: Fri Apr 19 12:30:15 UTC 2019 New revision: 346371 URL: https://svnweb.freebsd.org/changeset/base/346371 Log: MFC r346158: Do not access mutex memory after unlock. PR: 237195 Changes: _U stable/12/ stable/12/lib/libthr/thread/thr_mutex.c
A commit references this bug: Author: kib Date: Fri Apr 19 12:31:17 UTC 2019 New revision: 346372 URL: https://svnweb.freebsd.org/changeset/base/346372 Log: MFC r346158: Do not access mutex memory after unlock. PR: 237195 Changes: _U stable/11/ stable/11/lib/libthr/thread/thr_mutex.c
(In reply to freebsd from comment #5) It sounds like an access to the freed memory. Anyway, it is somewhat labor-intensive to read disassembly to track your issue, and since you did not disassemble the whole functions bodies, it is impossible. Compile libc/libthr/rtld with debugging enabled: make -C lib/libc DEBUG_FLAGS=-g all install same for lib/libthr, libexec/rtld-elf. Then reproduce the issue and show the backtrace with debugging symbols, so that the source line numbers are easily seen.
I'm pondering if I might be hitting the same or similar problem in pthread_mutex_unlock() on FreeBSD 12.2-RELEASE-p3 with Samba 4.13.4 when shutting down the smbd processes. Then I (sometimes) get core dumps on our more busy servers. Now, Samba is a big program so the problem might be elsewhere but anyway... I've now disabled the use of mutexes in Samba's TDB database and will see if the coredumps stop... Core was generated by `/liu/sbin/smbd --daemon --configfile=/liu/etc/samba/smb.conf'. Program terminated with signal SIGABRT, Aborted. #0 0x0000000804e82c2a in thr_kill () from /lib/libc.so.7 (gdb) bt #0 0x0000000804e82c2a in thr_kill () from /lib/libc.so.7 #1 0x0000000804e81084 in raise () from /lib/libc.so.7 #2 0x0000000804df7279 in abort () from /lib/libc.so.7 #3 0x0000000802a6ff6d in dump_core () at ../../source3/lib/dumpcore.c:338 #4 0x0000000802a84b08 in smb_panic_s3 (why=0x7fffffffcec0 "Signal 10: Bus error") at ../../source3/lib/util.c:849 #5 0x00000008012a3a4f in smb_panic (why=0x7fffffffcec0 "Signal 10: Bus error") at ../../lib/util/fault.c:184 #6 0x00000008012a3605 in fault_report (sig=10) at ../../lib/util/fault.c:82 #7 0x00000008012a361a in sig_fault (sig=10) at ../../lib/util/fault.c:93 #8 0x0000000801516b70 in ?? () from /lib/libthr.so.3 #9 0x000000080151613f in ?? () from /lib/libthr.so.3 #10 <signal handler called> #11 0x0000000801518e86 in pthread_mutex_unlock () from /lib/libthr.so.3 #12 0x00000008061d90ff in tdb_mutex_unlock (tdb=0x8106c0d20, rw=3, off=9364, len=1, pret=0x7fffffffda00) at ../../lib/tdb/common/mutex.c:347 #13 0x00000008061ccec8 in fcntl_unlock (tdb=0x8106c0d20, rw=3, off=9364, len=1) at ../../lib/tdb/common/lock.c:125 #14 0x00000008061cd0e6 in tdb_brunlock (tdb=0x8106c0d20, rw_type=3, offset=9364, len=1) at ../../lib/tdb/common/lock.c:234 #15 0x00000008061cda59 in tdb_nest_unlock (tdb=0x8106c0d20, offset=9364, ltype=3, mark_lock=false) at ../../lib/tdb/common/lock.c:551 #16 0x00000008061cdb20 in tdb_unlock (tdb=0x8106c0d20, list=2299, ltype=3) at ../../lib/tdb/common/lock.c:577 #17 0x00000008061ce44c in tdb_chainunlock (tdb=0x8106c0d20, key=...) at ../../lib/tdb/common/lock.c:889 #18 0x0000000805fbf4d3 in db_tdb_do_locked (db=0x810f73360, key=..., fn=0x802a538f1 <dbwrap_watched_do_locked_fn>, private_data=0x7fffffffdc70) at ../../lib/dbwrap/dbwrap_tdb.c:226 #19 0x0000000805fbb68c in dbwrap_do_locked (db=0x810f73360, key=..., fn=0x802a538f1 <dbwrap_watched_do_locked_fn>, private_data=0x7fffffffdc70) at ../../lib/dbwrap/dbwrap.c:566 #20 0x0000000802a53b95 in dbwrap_watched_do_locked (db=0x810f73460, key=..., fn=0x802a58cd4 <g_lock_unlock_fn>, private_data=0x7fffffffdd60) at ../../source3/lib/dbwrap/dbwrap_watch.c:419 #21 0x0000000805fbb68c in dbwrap_do_locked (db=0x810f73460, key=..., fn=0x802a58cd4 <g_lock_unlock_fn>, private_data=0x7fffffffdd60) at ../../lib/dbwrap/dbwrap.c:566 #22 0x0000000802a5912c in g_lock_unlock (ctx=0x81074ae60, key=...) at ../../source3/lib/g_lock.c:908 #23 0x00000008016950af in share_mode_lock_destructor (lck=0x8181cc0e0) at ../../source3/locking/share_mode_lock.c:978 #24 0x0000000802810831 in _tc_free_internal (tc=0x8181cc080, location=0x801965c78 "../../source3/smbd/close.c:1261") at ../../lib/talloc/talloc.c:1157 #25 0x0000000802810bad in _talloc_free_internal (ptr=0x8181cc0e0, location=0x801965c78 "../../source3/smbd/close.c:1261") at ../../lib/talloc/talloc.c:1247 #26 0x0000000802811f31 in _talloc_free (ptr=0x8181cc0e0, location=0x801965c78 "../../source3/smbd/close.c:1261") at ../../lib/talloc/talloc.c:1791 #27 0x000000080182a2e5 in close_directory (req=0x0, fsp=0x8107c8260, close_type=SHUTDOWN_CLOSE) at ../../source3/smbd/close.c:1261 #28 0x000000080182a4e0 in close_file (req=0x0, fsp=0x8107c8260, close_type=SHUTDOWN_CLOSE) at ../../source3/smbd/close.c:1316 #29 0x000000080179d737 in file_close_user (sconn=0x8106c7560, vuid=3338994192) at ../../source3/smbd/files.c:341 #30 0x000000080189bb46 in smbXsrv_session_logoff (session=0x8106c4580) at ../../source3/smbd/smbXsrv_session.c:1685 #31 0x000000080189a575 in smbXsrv_session_clear_and_logoff (session=0x8106c4580) at ../../source3/smbd/smbXsrv_session.c:1192 #32 0x000000080189c155 in smbXsrv_session_logoff_all_callback (local_rec=0x7fffffffe100, private_data=0x7fffffffe210) at ../../source3/smbd/smbXsrv_session.c:1834 #33 0x0000000805fbe65b in db_rbt_traverse_internal (db=0x81074b560, f=0x80189c074 <smbXsrv_session_logoff_all_callback>, private_data=0x7fffffffe210, count=0x7fffffffe188, rw=true) at ../../lib/dbwrap/dbwrap_rbt.c:464 #34 0x0000000805fbe855 in db_rbt_traverse (db=0x81074b560, f=0x80189c074 <smbXsrv_session_logoff_all_callback>, private_data=0x7fffffffe210) at ../../lib/dbwrap/dbwrap_rbt.c:522 #35 0x0000000805fbb088 in dbwrap_traverse (db=0x81074b560, f=0x80189c074 <smbXsrv_session_logoff_all_callback>, private_data=0x7fffffffe210, count=0x7fffffffe204) at ../../lib/dbwrap/dbwrap.c:394 #36 0x000000080189bf6d in smbXsrv_session_logoff_all (client=0x8106bf8e0) at ../../source3/smbd/smbXsrv_session.c:1788 #37 0x00000008018a407c in exit_server_common (how=SERVER_EXIT_NORMAL, reason=0x80196f23a "termination signal") at ../../source3/smbd/server_exit.c:149 #38 0x00000008018a44a5 in smbd_exit_server_cleanly (explanation=0x80196f23a "termination signal") at ../../source3/smbd/server_exit.c:237 #39 0x00000008043109fd in exit_server_cleanly (reason=0x80196f23a "termination signal") at ../../source3/lib/smbd_shim.c:121 #40 0x00000008018481fa in smbd_sig_term_handler (ev=0x8106c7060, se=0x8106bf820, signum=15, count=1, siginfo=0x0, private_data=0x8106c7560) at ../../source3/smbd/process.c:979 #41 0x00000008021fa40e in tevent_common_invoke_signal_handler (se=0x8106bf820, signum=15, count=1, siginfo=0x0, removed=0x0) at ../../lib/tevent/tevent_signal.c:370 #42 0x00000008021fa6b9 in tevent_common_check_signal (ev=0x8106c7060) at ../../lib/tevent/tevent_signal.c:468 #43 0x00000008021f793a in poll_event_loop_poll (ev=0x8106c7060, tvalp=0x7fffffffe450) at ../../lib/tevent/tevent_poll.c:488 #44 0x00000008021f810f in poll_event_loop_once (ev=0x8106c7060, location=0x8019721b8 "../../source3/smbd/process.c:4212") at ../../lib/tevent/tevent_poll.c:626 #45 0x00000008021f20b8 in _tevent_loop_once (ev=0x8106c7060, location=0x8019721b8 "../../source3/smbd/process.c:4212") at ../../lib/tevent/tevent.c:772 #46 0x00000008021f23dc in tevent_common_loop_wait (ev=0x8106c7060, location=0x8019721b8 "../../source3/smbd/process.c:4212") at ../../lib/tevent/tevent.c:895 #47 0x00000008021f247f in _tevent_loop_wait (ev=0x8106c7060, location=0x8019721b8 "../../source3/smbd/process.c:4212") at ../../lib/tevent/tevent.c:914 #48 0x00000008018503e2 in smbd_process (ev_ctx=0x8106c7060, msg_ctx=0x8106c0300, dce_ctx=0x8106b10c0, sock_fd=51, interactive=false) at ../../source3/smbd/process.c:4212 #49 0x000000000102f56c in smbd_accept_connection (ev=0x8106c7060, fde=0x8106c04c0, flags=1, private_data=0x811004040) at ../../source3/smbd/server.c:1014 #50 0x00000008021f2f2e in tevent_common_invoke_fd_handler (fde=0x8106c04c0, flags=1, removed=0x0) at ../../lib/tevent/tevent_fd.c:138 #51 0x00000008021f7e5e in poll_event_loop_poll (ev=0x8106c7060, tvalp=0x7fffffffe770) at ../../lib/tevent/tevent_poll.c:569 #52 0x00000008021f810f in poll_event_loop_once (ev=0x8106c7060, location=0x10395d0 "../../source3/smbd/server.c:1361") at ../../lib/tevent/tevent_poll.c:626 #53 0x00000008021f20b8 in _tevent_loop_once (ev=0x8106c7060, location=0x10395d0 "../../source3/smbd/server.c:1361") at ../../lib/tevent/tevent.c:772 #54 0x00000008021f23dc in tevent_common_loop_wait (ev=0x8106c7060, location=0x10395d0 "../../source3/smbd/server.c:1361") at ../../lib/tevent/tevent.c:895 #55 0x00000008021f247f in _tevent_loop_wait (ev=0x8106c7060, location=0x10395d0 "../../source3/smbd/server.c:1361") at ../../lib/tevent/tevent.c:914 #56 0x0000000001030331 in smbd_parent_loop (ev_ctx=0x8106c7060, parent=0x8106c0760) at ../../source3/smbd/server.c:1361 #57 0x00000000010326ec in main (argc=3, argv=0x7fffffffece8) at ../../source3/smbd/server.c:2214 gdb) frame 12 #12 0x00000008061d90ff in tdb_mutex_unlock (tdb=0x8106c0d20, rw=3, off=9364, len=1, pret=0x7fffffffda00) at ../../lib/tdb/common/mutex.c:347 347 ret = pthread_mutex_unlock(chain); (gdb) list 342 if (!tdb_mutex_index(tdb, off, len, &idx)) { 343 return false; 344 } 345 chain = &m->hashchains[idx]; 346 347 ret = pthread_mutex_unlock(chain); 348 if (ret == 0) { 349 *pret = 0; 350 return true; 351 } (gdb) print chain $1 = (pthread_mutex_t *) 0x81378c898 (gdb) print *chain $2 = (pthread_mutex_t) 0x8000000000000001 These mutexes are PTHREAD_MUTEX_ERRORCHECK, PTHREAD_PROCESS_SHARED & PTHREAD_MUTEX_ROBUST if I'm not reading the code wrong.
Created attachment 235730 [details] Test program for pthread mypshared.c test program
The stack trace with samba given in comment #9 is an other problem because tdb does never use the function pthread_mutex_destroy() for a mutex in a shared memory, only in local memory for checking the features of the libthr library. I hit the problem discussed in this bug report when I tried to migrate a stable running IMAP server (cyrus-imapd25 + db48) from FreeBSD 10 to Fre eBSD 12 r371380. The configure step of db48 finds in V10: checking for mutexes... x86_64/gcc-assembly in V12: checking for mutexes... POSIX/pthreads/library/x86_64/gcc-assembly The difference is because FreeBSD 12 supports PTHREAD_PROCESS_SHARED. It comes out the berkeley db48 software is not compatibel with the share mu texes in our libthr. When an imap program (e.g. cyr_expire) is finished, db48 calls pthread_mutex_destroy() for all used mutexes. On the next try of the IMAP server or of another imap program to get a lock with pthread_mutex_lock() the error EINVAL is returned and the environment of berkel ey db is broken. Newer versions of berkeley db have the same problem. I can simple avoid the problem with a configure statement for db48 to force only the test-an d-set mutexes as in V10. I attach a test program mypshared.c that demonstrates the problem I see with my IMAP server in V12. If mypshared is started on two or three termi nals then the problem with pthread_mutex_destroy() can be studied. If in the program the variable always_destroy is set to 0, then all started pr ograms run and stop without error, because only the last finishing program calls pthread_mutex_destroy(). With always_destroy = 1 there are two c ases when the first program comes to his end: 1. No one of the other programs has the lock. In this case the ending program calls pthread_mutex_destroy() without error but the other programs get EINVAL at the next call to pthread_mutex_lock(). That is what I saw running my IMAP server. 2. One of the other programs helds the lock. Then the ending program calling pthread_mutex_destroy() aborts with a message like Fatal error 'mutex 0x800698000 own 0x8001872a is on list 0x8006861a0 0x0' at line 154 in file /usr/src/lib/libthr/thread/thr_mutex.c (errno = 0) This is caused by the call of mutex_assert_not_owned() at line 481 because _PTHREADS_INVARIANTS is set in the Makefile. Both cases are bad, the application can not work. With higher or lower values for UPPER_BOUND_(NO)LOCK in mypshared.c it is easy to get examples for both cases. My mutex is always on addr 0x80069 7000, but the error message gives 0x800698000. Instead of "0x8001872a" sometimes I see "own 0x1872a". The "first mutex problem" was solved with the commit for revision 297185. A program of an application must not check if it is the first program u sing a mutex, A simple call of pthread_mutex_init() is enough, the pthread library libthr is smart and handles this. Maybe the "last mutex proble m" can be solved too with a similar logic. pthread_mutex_destroy() should only destroy the mutex if the caller is the last user of the mutex. And pthread_mutex_destroy() always should return without errors like EBUSY. If a user is not the last one and calls pthread_mutex_destroy() the mutex should not be affected and the users state should be the same he had before calling pthread_mtx_init(). If we cannot solve the "last mutex problem" then there should be an option to build the libthr without PTHREAD_PROCESS_SHARED.
Both locking a destroyed mutex, and destroying an owned mutex is declared as undefined behavior by POSIX. FreeBSD is conforming, the program has bugs.
OK, thank you for clarification. One bug in my test program was by mistake, an other for demonstration. With following patch the program works as I have intended: --- mypshared.c.orig 2022-08-06 21:13:03.438699000 +0200 +++ mypshared.c 2022-08-08 10:24:26.021928000 +0200 @@ -34,7 +34,7 @@ pthread_mutexattr_t attr, *attrp = NULL; int rc, first, mycounter; -int init_only_on_first = 0, always_destroy = 1; +int init_only_on_first = 0, always_destroy = 0; pthread_mutex_t magic, *magicp = &magic; @@ -120,7 +120,7 @@ void start_mutex() { - if ( init_only_on_first && first ) { + if( !init_only_on_first || first ) { rc = pthread_mutex_init((pthread_mutex_t *)®ionp->mutexp, attrp); printf("rc of pthread_mutex_init = %d\n", rc); if( rc ) exit( rc );