Created attachment 203579 [details] A simple program to reproduce the issue. I have this program where N threads are communicating with one thread through messages. Some of theses messages are used to ensure synchronisation (e.g.: flush). Each messsage contain a mutex and a condition. When testing on FreeBSD 12.0, the program randomly crashed. Every clue was saying the mutex was destroyed while being unlocked. Given said program is pretty huge, I've written a small, simplified, code with similar behaviour and it crashed the same way. #0 0x000000080065c93a in ?? () from /lib/libthr.so.3 #1 0x0000000000401dd8 in mcv_progress (mcv=0x801c1ac00) at mutex_test.c:365 (calling pthread_mutex_unlock) #2 0x0000000000401f46 in read_thread (arg=0x7fffffffdac0) at mutex_test.c:412 #3 0x0000000800654776 in ?? () from /lib/libthr.so.3 I suspect what happens looks something like this: client/writer server/reader 1) queues message 2) locks message.mutex 1) dequeues message 2) process message 3) waits for message.condition 3) locks message.mutex 4) signals message.condition 4) unlocks message.mutex 5) unlocks message.mutex 5) destroy message memory 6) somehow, still in pthread_mutex_unlock after the client got the hand back, maybe accesses 6) frees message memory mutex content and crashes as it has been trashed The same program works fine under any load and parameters on several Linux versions and OSX 10.14.4. If this hypothesis verifies, I suppose it only affects programs rapidly creating and destroying mutexes(+conditions?). Please find attached the test program. To reproduce the issue, I ctrl-C the program after 5 seconds if it didn't crash and restart it immediately. Four or five tries of this are usually enough.
I was unable to reproduce the issue with your test program. Regardless of it, there is indeed an access to the mutex memory after the unlock. Please try the attached patch and report back. Note that seemingly similar check in _mutex_leave_robust() only compares addresses.
Created attachment 203581 [details] Do not access mutex memory after unlock.
A commit references this bug: Author: kib Date: Fri Apr 12 17:27:19 UTC 2019 New revision: 346158 URL: https://svnweb.freebsd.org/changeset/base/346158 Log: Do not access mutex memory after unlock. PR: 237195 Reported by: freebsd@hurrikhan.eu Sponsored by: The FreeBSD Foundation MFC after: 1 week Changes: head/lib/libthr/thread/thr_mutex.c
Created attachment 203764 [details] various gdb dumps of the issue
Thanks for the patch. I was able to apply and test it. This specific issue appears to be fixed, but now the other side seems to have an issue. I either get an error telling the mutex is invalid, either it crashes during the log. I'm attaching the stack trace, disassembly and register dump right after this message.
A commit references this bug: Author: kib Date: Fri Apr 19 12:30:15 UTC 2019 New revision: 346371 URL: https://svnweb.freebsd.org/changeset/base/346371 Log: MFC r346158: Do not access mutex memory after unlock. PR: 237195 Changes: _U stable/12/ stable/12/lib/libthr/thread/thr_mutex.c
A commit references this bug: Author: kib Date: Fri Apr 19 12:31:17 UTC 2019 New revision: 346372 URL: https://svnweb.freebsd.org/changeset/base/346372 Log: MFC r346158: Do not access mutex memory after unlock. PR: 237195 Changes: _U stable/11/ stable/11/lib/libthr/thread/thr_mutex.c
(In reply to freebsd from comment #5) It sounds like an access to the freed memory. Anyway, it is somewhat labor-intensive to read disassembly to track your issue, and since you did not disassemble the whole functions bodies, it is impossible. Compile libc/libthr/rtld with debugging enabled: make -C lib/libc DEBUG_FLAGS=-g all install same for lib/libthr, libexec/rtld-elf. Then reproduce the issue and show the backtrace with debugging symbols, so that the source line numbers are easily seen.
I'm pondering if I might be hitting the same or similar problem in pthread_mutex_unlock() on FreeBSD 12.2-RELEASE-p3 with Samba 4.13.4 when shutting down the smbd processes. Then I (sometimes) get core dumps on our more busy servers. Now, Samba is a big program so the problem might be elsewhere but anyway... I've now disabled the use of mutexes in Samba's TDB database and will see if the coredumps stop... Core was generated by `/liu/sbin/smbd --daemon --configfile=/liu/etc/samba/smb.conf'. Program terminated with signal SIGABRT, Aborted. #0 0x0000000804e82c2a in thr_kill () from /lib/libc.so.7 (gdb) bt #0 0x0000000804e82c2a in thr_kill () from /lib/libc.so.7 #1 0x0000000804e81084 in raise () from /lib/libc.so.7 #2 0x0000000804df7279 in abort () from /lib/libc.so.7 #3 0x0000000802a6ff6d in dump_core () at ../../source3/lib/dumpcore.c:338 #4 0x0000000802a84b08 in smb_panic_s3 (why=0x7fffffffcec0 "Signal 10: Bus error") at ../../source3/lib/util.c:849 #5 0x00000008012a3a4f in smb_panic (why=0x7fffffffcec0 "Signal 10: Bus error") at ../../lib/util/fault.c:184 #6 0x00000008012a3605 in fault_report (sig=10) at ../../lib/util/fault.c:82 #7 0x00000008012a361a in sig_fault (sig=10) at ../../lib/util/fault.c:93 #8 0x0000000801516b70 in ?? () from /lib/libthr.so.3 #9 0x000000080151613f in ?? () from /lib/libthr.so.3 #10 <signal handler called> #11 0x0000000801518e86 in pthread_mutex_unlock () from /lib/libthr.so.3 #12 0x00000008061d90ff in tdb_mutex_unlock (tdb=0x8106c0d20, rw=3, off=9364, len=1, pret=0x7fffffffda00) at ../../lib/tdb/common/mutex.c:347 #13 0x00000008061ccec8 in fcntl_unlock (tdb=0x8106c0d20, rw=3, off=9364, len=1) at ../../lib/tdb/common/lock.c:125 #14 0x00000008061cd0e6 in tdb_brunlock (tdb=0x8106c0d20, rw_type=3, offset=9364, len=1) at ../../lib/tdb/common/lock.c:234 #15 0x00000008061cda59 in tdb_nest_unlock (tdb=0x8106c0d20, offset=9364, ltype=3, mark_lock=false) at ../../lib/tdb/common/lock.c:551 #16 0x00000008061cdb20 in tdb_unlock (tdb=0x8106c0d20, list=2299, ltype=3) at ../../lib/tdb/common/lock.c:577 #17 0x00000008061ce44c in tdb_chainunlock (tdb=0x8106c0d20, key=...) at ../../lib/tdb/common/lock.c:889 #18 0x0000000805fbf4d3 in db_tdb_do_locked (db=0x810f73360, key=..., fn=0x802a538f1 <dbwrap_watched_do_locked_fn>, private_data=0x7fffffffdc70) at ../../lib/dbwrap/dbwrap_tdb.c:226 #19 0x0000000805fbb68c in dbwrap_do_locked (db=0x810f73360, key=..., fn=0x802a538f1 <dbwrap_watched_do_locked_fn>, private_data=0x7fffffffdc70) at ../../lib/dbwrap/dbwrap.c:566 #20 0x0000000802a53b95 in dbwrap_watched_do_locked (db=0x810f73460, key=..., fn=0x802a58cd4 <g_lock_unlock_fn>, private_data=0x7fffffffdd60) at ../../source3/lib/dbwrap/dbwrap_watch.c:419 #21 0x0000000805fbb68c in dbwrap_do_locked (db=0x810f73460, key=..., fn=0x802a58cd4 <g_lock_unlock_fn>, private_data=0x7fffffffdd60) at ../../lib/dbwrap/dbwrap.c:566 #22 0x0000000802a5912c in g_lock_unlock (ctx=0x81074ae60, key=...) at ../../source3/lib/g_lock.c:908 #23 0x00000008016950af in share_mode_lock_destructor (lck=0x8181cc0e0) at ../../source3/locking/share_mode_lock.c:978 #24 0x0000000802810831 in _tc_free_internal (tc=0x8181cc080, location=0x801965c78 "../../source3/smbd/close.c:1261") at ../../lib/talloc/talloc.c:1157 #25 0x0000000802810bad in _talloc_free_internal (ptr=0x8181cc0e0, location=0x801965c78 "../../source3/smbd/close.c:1261") at ../../lib/talloc/talloc.c:1247 #26 0x0000000802811f31 in _talloc_free (ptr=0x8181cc0e0, location=0x801965c78 "../../source3/smbd/close.c:1261") at ../../lib/talloc/talloc.c:1791 #27 0x000000080182a2e5 in close_directory (req=0x0, fsp=0x8107c8260, close_type=SHUTDOWN_CLOSE) at ../../source3/smbd/close.c:1261 #28 0x000000080182a4e0 in close_file (req=0x0, fsp=0x8107c8260, close_type=SHUTDOWN_CLOSE) at ../../source3/smbd/close.c:1316 #29 0x000000080179d737 in file_close_user (sconn=0x8106c7560, vuid=3338994192) at ../../source3/smbd/files.c:341 #30 0x000000080189bb46 in smbXsrv_session_logoff (session=0x8106c4580) at ../../source3/smbd/smbXsrv_session.c:1685 #31 0x000000080189a575 in smbXsrv_session_clear_and_logoff (session=0x8106c4580) at ../../source3/smbd/smbXsrv_session.c:1192 #32 0x000000080189c155 in smbXsrv_session_logoff_all_callback (local_rec=0x7fffffffe100, private_data=0x7fffffffe210) at ../../source3/smbd/smbXsrv_session.c:1834 #33 0x0000000805fbe65b in db_rbt_traverse_internal (db=0x81074b560, f=0x80189c074 <smbXsrv_session_logoff_all_callback>, private_data=0x7fffffffe210, count=0x7fffffffe188, rw=true) at ../../lib/dbwrap/dbwrap_rbt.c:464 #34 0x0000000805fbe855 in db_rbt_traverse (db=0x81074b560, f=0x80189c074 <smbXsrv_session_logoff_all_callback>, private_data=0x7fffffffe210) at ../../lib/dbwrap/dbwrap_rbt.c:522 #35 0x0000000805fbb088 in dbwrap_traverse (db=0x81074b560, f=0x80189c074 <smbXsrv_session_logoff_all_callback>, private_data=0x7fffffffe210, count=0x7fffffffe204) at ../../lib/dbwrap/dbwrap.c:394 #36 0x000000080189bf6d in smbXsrv_session_logoff_all (client=0x8106bf8e0) at ../../source3/smbd/smbXsrv_session.c:1788 #37 0x00000008018a407c in exit_server_common (how=SERVER_EXIT_NORMAL, reason=0x80196f23a "termination signal") at ../../source3/smbd/server_exit.c:149 #38 0x00000008018a44a5 in smbd_exit_server_cleanly (explanation=0x80196f23a "termination signal") at ../../source3/smbd/server_exit.c:237 #39 0x00000008043109fd in exit_server_cleanly (reason=0x80196f23a "termination signal") at ../../source3/lib/smbd_shim.c:121 #40 0x00000008018481fa in smbd_sig_term_handler (ev=0x8106c7060, se=0x8106bf820, signum=15, count=1, siginfo=0x0, private_data=0x8106c7560) at ../../source3/smbd/process.c:979 #41 0x00000008021fa40e in tevent_common_invoke_signal_handler (se=0x8106bf820, signum=15, count=1, siginfo=0x0, removed=0x0) at ../../lib/tevent/tevent_signal.c:370 #42 0x00000008021fa6b9 in tevent_common_check_signal (ev=0x8106c7060) at ../../lib/tevent/tevent_signal.c:468 #43 0x00000008021f793a in poll_event_loop_poll (ev=0x8106c7060, tvalp=0x7fffffffe450) at ../../lib/tevent/tevent_poll.c:488 #44 0x00000008021f810f in poll_event_loop_once (ev=0x8106c7060, location=0x8019721b8 "../../source3/smbd/process.c:4212") at ../../lib/tevent/tevent_poll.c:626 #45 0x00000008021f20b8 in _tevent_loop_once (ev=0x8106c7060, location=0x8019721b8 "../../source3/smbd/process.c:4212") at ../../lib/tevent/tevent.c:772 #46 0x00000008021f23dc in tevent_common_loop_wait (ev=0x8106c7060, location=0x8019721b8 "../../source3/smbd/process.c:4212") at ../../lib/tevent/tevent.c:895 #47 0x00000008021f247f in _tevent_loop_wait (ev=0x8106c7060, location=0x8019721b8 "../../source3/smbd/process.c:4212") at ../../lib/tevent/tevent.c:914 #48 0x00000008018503e2 in smbd_process (ev_ctx=0x8106c7060, msg_ctx=0x8106c0300, dce_ctx=0x8106b10c0, sock_fd=51, interactive=false) at ../../source3/smbd/process.c:4212 #49 0x000000000102f56c in smbd_accept_connection (ev=0x8106c7060, fde=0x8106c04c0, flags=1, private_data=0x811004040) at ../../source3/smbd/server.c:1014 #50 0x00000008021f2f2e in tevent_common_invoke_fd_handler (fde=0x8106c04c0, flags=1, removed=0x0) at ../../lib/tevent/tevent_fd.c:138 #51 0x00000008021f7e5e in poll_event_loop_poll (ev=0x8106c7060, tvalp=0x7fffffffe770) at ../../lib/tevent/tevent_poll.c:569 #52 0x00000008021f810f in poll_event_loop_once (ev=0x8106c7060, location=0x10395d0 "../../source3/smbd/server.c:1361") at ../../lib/tevent/tevent_poll.c:626 #53 0x00000008021f20b8 in _tevent_loop_once (ev=0x8106c7060, location=0x10395d0 "../../source3/smbd/server.c:1361") at ../../lib/tevent/tevent.c:772 #54 0x00000008021f23dc in tevent_common_loop_wait (ev=0x8106c7060, location=0x10395d0 "../../source3/smbd/server.c:1361") at ../../lib/tevent/tevent.c:895 #55 0x00000008021f247f in _tevent_loop_wait (ev=0x8106c7060, location=0x10395d0 "../../source3/smbd/server.c:1361") at ../../lib/tevent/tevent.c:914 #56 0x0000000001030331 in smbd_parent_loop (ev_ctx=0x8106c7060, parent=0x8106c0760) at ../../source3/smbd/server.c:1361 #57 0x00000000010326ec in main (argc=3, argv=0x7fffffffece8) at ../../source3/smbd/server.c:2214 gdb) frame 12 #12 0x00000008061d90ff in tdb_mutex_unlock (tdb=0x8106c0d20, rw=3, off=9364, len=1, pret=0x7fffffffda00) at ../../lib/tdb/common/mutex.c:347 347 ret = pthread_mutex_unlock(chain); (gdb) list 342 if (!tdb_mutex_index(tdb, off, len, &idx)) { 343 return false; 344 } 345 chain = &m->hashchains[idx]; 346 347 ret = pthread_mutex_unlock(chain); 348 if (ret == 0) { 349 *pret = 0; 350 return true; 351 } (gdb) print chain $1 = (pthread_mutex_t *) 0x81378c898 (gdb) print *chain $2 = (pthread_mutex_t) 0x8000000000000001 These mutexes are PTHREAD_MUTEX_ERRORCHECK, PTHREAD_PROCESS_SHARED & PTHREAD_MUTEX_ROBUST if I'm not reading the code wrong.