I am running 10.3-STABLE FreeBSD 10.3-STABLE #3 r301633 und try to run bconsole from the port bacula5-client (P4Q4). Often after some bconsole commands the bconsole process can not die when entering "quit". With kgdb running on /dev/mem I can see the process: (kgdb) info threads ... 191 Thread 100880 (PID=66009: bconsole) sched_switch (td=0xfffff8008d64d4b0, newtd=<value optimized out>, flags=<value optimized out>) at /usr/src/sys/kern/sched_ule.c:1945 ... (kgdb) thread 191 [Switching to thread 191 (Thread 100880)]#0 sched_switch (td=0xfffff8008d64d4b0, newtd=<value optimized out>, flags=<value optimized out>) at /usr/src/sys/kern/sched_ule.c:1945 1945 cpuid = PCPU_GET(cpuid); (kgdb) bt #0 sched_switch (td=0xfffff8008d64d4b0, newtd=<value optimized out>, flags=<value optimized out>) at /usr/src/sys/kern/sched_ule.c:1945 #1 0xffffffff80664d41 in mi_switch (flags=260, newtd=0x0) at /usr/src/sys/kern/kern_synch.c:491 #2 0xffffffff806a87ab in sleepq_catch_signals (wchan=0xfffff8012a366f00, pri=0) at /usr/src/sys/kern/subr_sleepqueue.c:426 #3 0xffffffff806a865f in sleepq_wait_sig (wchan=0x0, pri=0) at /usr/src/sys/kern/subr_sleepqueue.c:632 #4 0xffffffff8066474d in _sleep (ident=<value optimized out>, lock=<value optimized out>, priority=<value optimized out>, wmesg=<value optimized out>, sbt=<value optimized out>, pr=<value optimized out>, flags=<value optimized out>) at /usr/src/sys/kern/kern_synch.c:253 #5 0xffffffff80678d55 in umtxq_sleep (uq=0xfffff8012a366f00, wmesg=0xffffffff80a2583c "uwrlck", abstime=0x0) at /usr/src/sys/kern/kern_umtx.c:780 #6 0xffffffff8067c2ef in do_rw_wrlock (td=0xfffff8008d64d4b0, rwlock=0x802270880, timeout=<value optimized out>) at /usr/src/sys/kern/kern_umtx.c:3109 #7 0xffffffff8067ff77 in __umtx_op_rw_wrlock (td=<value optimized out>, uap=<value optimized out>) at /usr/src/sys/kern/kern_umtx.c:3657 #8 0xffffffff808edddf in amd64_syscall (td=0xfffff8008d64d4b0, traced=0) at subr_syscall.c:141 #9 0xffffffff808d29ab in Xfast_syscall () at /usr/src/sys/amd64/amd64/exception.S:398 kgdb) p *(struct thread *)0xfffff8008d64d4b0 $1 = {td_lock = 0xffffffff80e3d220, td_proc = 0xfffff800059a29e0, td_plist = {tqe_next = 0x0, tqe_prev = 0xfffff800059a29f0}, td_runq = {tqe_next = 0x0, tqe_prev = 0xffffffff80e07008}, td_slpq = {tqe_next = 0x0, tqe_prev = 0xfffff800100d5980}, td_lockq = {tqe_next = 0x0, tqe_prev = 0xfffffe0231e18438}, td_hash = {le_next = 0x0, le_prev = 0xfffffe0000835080}, td_cpuset = 0xfffff800042013a8, td_sel = 0xfffff801a28b2500, td_sleepqueue = 0x0, td_turnstile = 0xfffff80002fbd240, td_rlqe = 0xfffff800101aacf8, td_umtxq = 0xfffff8012a366f00, td_tid = 100880, td_sigqueue = {sq_signals = {__bits = {0, 0, 0, 0}}, sq_kill = {__bits = {0, 0, 0, 0}}, sq_list = {tqh_first = 0x0, tqh_last = 0xfffff8008d64d568}, sq_proc = 0xfffff800059a29e0, sq_flags = 1}, td_lend_user_pri = 255 'ÿ', td_flags = 12, td_inhibitors = 2, td_pflags = 0, td_dupfd = 0, td_sqqueue = 0, td_wchan = 0xfffff8012a366f00, td_wmesg = 0xffffffff80a2583c "uwrlck", td_lastcpu = 0 '\0', td_oncpu = 255 'ÿ', td_owepreempt = 0 '\0', td_tsqueue = 0 '\0', td_locks = 399, td_rw_rlocks = 0, td_lk_slocks = 0, td_stopsched = 0, td_blocked = 0x0, td_lockname = 0x0, td_contested = {lh_first = 0x0}, td_sleeplocks = 0x0, td_intr_nesting_level = 0, td_pinned = 0, td_ucred = 0xfffff8004514c700, td_estcpu = 0, td_slptick = 457002771, td_blktick = 0, td_swvoltick = 457002771, td_cow = 104, td_ru = {ru_utime = {tv_sec = 0, tv_usec = 0}, ru_stime = {tv_sec = 0, tv_usec = 0}, ru_maxrss = 6588, ru_ixrss = 252, ru_idrss = 28, ru_isrss = 896, ru_minflt = 325, ru_majflt = 0, ru_nswap = 0, ru_inblock = 0, ru_oublock = 0, ru_msgsnd = 14, ru_msgrcv = 2353, ru_nsignals = 0, ru_nvcsw = 121, ru_nivcsw = 9}, td_rux = {rux_runtime = 151507080, rux_uticks = 5, rux_sticks = 2, rux_iticks = 0, rux_uu = 30135, rux_su = 13070, rux_tu = 42189}, td_incruntime = 0, td_runtime = 151507080, td_pticks = 0, td_sticks = 0, td_iticks = 0, td_uticks = 0, td_intrval = 0, td_oldsigmask = { __bits = {0, 0, 0, 0}}, td_generation = 130, td_sigstk = {ss_sp = 0x0, ss_size = 0, ss_flags = 4}, td_xsig = 0, td_profil_addr = 0, td_profil_ticks = 0, td_name = "bconsole", '\0' <repeats 11 times>, td_fpop = 0x0, td_dbgflags = 0, td_dbgksi = {ksi_link = {tqe_next = 0x0, tqe_prev = 0x0}, ksi_info = { si_signo = 0, si_errno = 0, si_code = 0, si_pid = 0, si_uid = 0, si_status = 0, si_addr = 0x0, si_value = {sival_int = 0, sival_ptr = 0x0, sigval_int = 0, sigval_ptr = 0x0}, _reason = {_fault = {_trapno = 0}, _timer = {_timerid = 0, _overrun = 0}, _mesgq = {_mqd = 0}, _poll = {_band = 0}, __spare__ = { __spare1__ = 0, __spare2__ = {0, 0, 0, 0, 0, 0, 0}}}}, ksi_flags = 0, ksi_sigq = 0x0}, td_ng_outbound = 0, td_osd = {osd_nslots = 0, osd_slots = 0x0, osd_next = {le_next = 0x0, le_prev = 0x0}}, td_map_def_user = 0x0, td_dbg_forked = 0, td_vp_reserv = 0, td_no_sleeping = 0, td_dom_rr_idx = 0, td_sigmask = { __bits = {0, 0, 0, 0}}, td_rqindex = 30 '\036', td_base_pri = 120 'x', td_priority = 120 'x', td_pri_class = 3 '\003', td_user_pri = 120 'x', td_base_user_pri = 120 'x', td_pcb = 0xfffffe0231f2bcc0, td_state = TDS_INHIBITED, td_retval = {0, 0}, td_slpcallout = {c_links = {le = { le_next = 0xfffffe00007eb2c0, le_prev = 0xfffffe0000801de8}, sle = {sle_next = 0xfffffe00007eb2c0}, tqe = {tqe_next = 0xfffffe00007eb2c0, tqe_prev = 0xfffffe0000801de8}}, c_time = 1962816112194023, c_precision = 26843, c_arg = 0xfffff8008d64d4b0, c_func = 0xffffffff806a8220 <sleepq_timeout>, c_lock = 0x0, c_flags = 2, c_iflags = 272, c_cpu = 0}, td_frame = 0xfffffe0231f2bc00, td_kstack_obj = 0xfffff8013d46ea00, td_kstack = 18446741884114206720, td_kstack_pages = 4, td_critnest = 1, td_md = {md_spinlock_count = 1, md_saved_flags = 582, md_spurflt_addr = 34414718976}, td_sched = 0xfffff8008d64d930, td_ar = 0x0, td_lprof = {{lh_first = 0x0}, {lh_first = 0x0}}, td_dtrace = 0xfffff8003e02d000, td_errno = 0, td_vnet = 0x0, td_vnet_lpush = 0x0, td_intr_frame = 0x0, td_rfppwait_p = 0xfffff801a2d224f0, td_ma = 0x0, td_ma_cnt = 0, td_su = 0x0, td_dbg_sc_code = 0, td_dbg_sc_narg = 0, td_emuldata = 0x0} (kgdb) p *(struct proc *)0xfffff800059a29e0 $2 = {p_list = {le_next = 0xfffff80010d209e0, le_prev = 0xfffff8000556c000}, p_threads = {tqh_first = 0xfffff8008d64d4b0, tqh_last = 0xfffff8008d64d4c0}, p_slock = {lock_object = {lo_name = 0xffffffff80a22eb0 "process slock", lo_flags = 720896, lo_data = 0, lo_witness = 0x0}, mtx_lock = 4}, p_ucred = 0xfffff8004514c700, p_fd = 0xfffff801d97ba800, p_fdtol = 0x0, p_stats = 0xfffff80005889800, p_limit = 0xfffff800055cc800, p_limco = {c_links = { le = {le_next = 0x0, le_prev = 0x0}, sle = {sle_next = 0x0}, tqe = {tqe_next = 0x0, tqe_prev = 0x0}}, c_time = 0, c_precision = 0, c_arg = 0x0, c_func = 0, c_lock = 0xfffff800059a2ae0, c_flags = 0, c_iflags = 0, c_cpu = 0}, p_sigacts = 0xfffff801a1ce4000, p_flag = 268451970, p_flag2 = 0, p_state = PRS_NORMAL, p_pid = 66009, p_hash = {le_next = 0x0, le_prev = 0xfffffe00006beec8}, p_pglist = {le_next = 0x0, le_prev = 0xfffff80112b3df10}, p_pptr = 0xfffff801a2e549e0, p_sibling = {le_next = 0x0, le_prev = 0xfffff801a2e54ad8}, p_children = {lh_first = 0x0}, p_mtx = {lock_object = { lo_name = 0xffffffff80a22ea3 "process lock", lo_flags = 21168128, lo_data = 0, lo_witness = 0x0}, mtx_lock = 4}, p_ksi = 0xfffff80004e76620, p_sigqueue = { sq_signals = {__bits = {0, 0, 0, 0}}, sq_kill = {__bits = {0, 0, 0, 0}}, sq_list = {tqh_first = 0x0, tqh_last = 0xfffff800059a2b28}, sq_proc = 0xfffff800059a29e0, sq_flags = 1}, p_oppid = 0, p_vmspace = 0xfffff8001032cc40, p_swtick = 456987934, p_realtimer = {it_interval = {tv_sec = 0, tv_usec = 0}, it_value = {tv_sec = 0, tv_usec = 0}}, p_ru = {ru_utime = {tv_sec = 0, tv_usec = 0}, ru_stime = {tv_sec = 0, tv_usec = 0}, ru_maxrss = 0, ru_ixrss = 0, ru_idrss = 0, ru_isrss = 0, ru_minflt = 40, ru_majflt = 0, ru_nswap = 0, ru_inblock = 0, ru_oublock = 0, ru_msgsnd = 0, ru_msgrcv = 0, ru_nsignals = 1, ru_nvcsw = 7, ru_nivcsw = 0}, p_rux = {rux_runtime = 155793645, rux_uticks = 5, rux_sticks = 2, rux_iticks = 0, rux_uu = 30987, rux_su = 13289, rux_tu = 43383}, p_crux = {rux_runtime = 0, rux_uticks = 0, rux_sticks = 0, rux_iticks = 0, rux_uu = 0, rux_su = 0, rux_tu = 0}, p_profthreads = 0, p_exitthreads = 0, p_traceflag = 0, p_tracevp = 0x0, p_tracecred = 0x0, p_textvp = 0xfffff8019c7bb588, p_lock = 0, p_sigiolst = { slh_first = 0x0}, p_sigparent = 20, p_sig = 0, p_code = 0, p_stops = 0, p_stype = 0, p_step = 0 '\0', p_pfsflags = 0 '\0', p_nlminfo = 0x0, p_aioinfo = 0x0, p_singlethread = 0x0, p_suspcount = 0, p_xthread = 0x0, p_boundary_count = 0, p_pendingcnt = 0, p_itimers = 0x0, p_procdesc = 0x0, p_magic = 3203398350, p_osrel = 1003500, p_comm = "bconsole", '\0' <repeats 11 times>, p_pad0 = 0x0, p_sysent = 0xffffffff80d18db8, p_args = 0xfffff801a294b720, p_cpulimit = 9223372036854775807, p_nice = 0 '\0', p_fibnum = 0, p_xstat = 0, p_klist = {kl_list = {slh_first = 0x0}, kl_lock = 0xffffffff80620190 <knlist_mtx_lock>, kl_unlock = 0xffffffff806201d0 <knlist_mtx_unlock>, kl_assert_locked = 0xffffffff80620210 <knlist_mtx_assert_locked>, kl_assert_unlocked = 0xffffffff80620220 <knlist_mtx_assert_unlocked>, kl_lockarg = 0xfffff800059a2ae0}, p_numthreads = 1, p_md = {md_ldt = 0x0, md_ldt_sd = {sd_lolimit = 0, sd_lobase = 0, sd_type = 0, sd_dpl = 0, sd_p = 0, sd_hilimit = 0, sd_xx0 = 0, sd_gran = 0, sd_hibase = 0, sd_xx1 = 0, sd_mbz = 0, sd_xx2 = 0}}, p_itcallout = {c_links = {le = {le_next = 0x0, le_prev = 0x0}, sle = {sle_next = 0x0}, tqe = {tqe_next = 0x0, tqe_prev = 0x0}}, c_time = 0, c_precision = 0, c_arg = 0x0, c_func = 0, c_lock = 0xfffff800059a2ae0, c_flags = 0, c_iflags = 0, c_cpu = 0}, p_acflag = 0, p_peers = 0x0, p_leader = 0xfffff800059a29e0, p_emuldata = 0x0, p_label = 0x0, p_sched = 0xfffff800059a2ed0, p_ktr = {stqh_first = 0x0, stqh_last = 0xfffff800059a2e28}, p_mqnotifier = {lh_first = 0x0}, p_dtrace = 0xfffff801d965f280, p_pwait = {cv_description = 0xffffffff80a239ab "ppwait", cv_waiters = 0}, p_dbgwait = { cv_description = 0xffffffff80a239b2 "dbgwait", cv_waiters = 0}, p_prev_runtime = 0, p_racct = 0x0, p_throttled = 0 '\0', p_orphan = {le_next = 0x0, le_prev = 0x0}, p_orphans = {lh_first = 0x0}, p_treeflag = 0, p_reaper = 0xfffff800042104f0, p_reaplist = {lh_first = 0x0}, p_reapsibling = { le_next = 0xfffff80010d209e0, le_prev = 0xfffff8000556c4d0}, p_reapsubtree = 29, p_pgrp = 0xfffff80112b3df00} The software has run without this problem for years on FreeBSD 8.4-STABLE.
I need help with this problem and give some more infos: bconsole creates two threads: watchdog and checker_deadlock. When trying to quit bconsole the main thread sends first a special ping to the watchdog thread and this thread always stops cleanly without any problems. Second the main thread tries to stop the sleeping thread checker_deadlock with pthread_cancel(). Both the main thread and the checker_deadlock thread do some housekeeping in exit routines. I give ktraces of good and bad case starting with pthread_cancel: good (quit exits normally): 54503 100490 bconsole CALL thr_kill(0x18b71,SIG 32) 54503 100490 bconsole RET thr_kill 0 54503 101233 bconsole RET nanosleep -1 errno 4 Interrupted system call 54503 101233 bconsole PSIG SIG 32 caught handler=0x80205b080 mask=0x0 code=SI_LWP 54503 100490 bconsole CALL _umtx_op(0x802270880,UMTX_OP_RW_WRLOCK,0,0,0) 54503 101233 bconsole CALL thr_wake(0x18b71) 54503 100490 bconsole RET _umtx_op 0 54503 101233 bconsole RET thr_wake 0 54503 101233 bconsole CALL thr_wake(0x18b71) 54503 101233 bconsole RET thr_wake 0 54503 101233 bconsole CALL sigreturn(0x7fffdfffda60) 54503 101233 bconsole RET sigreturn JUSTRETURN 54503 100490 bconsole CALL _umtx_op(0x802270880,UMTX_OP_RW_WRLOCK,0,0,0) 54503 100490 bconsole RET _umtx_op 0 54503 101233 bconsole CALL _umtx_op(0x802270880,UMTX_OP_RW_RDLOCK,0,0,0) 54503 101233 bconsole RET _umtx_op 0 54503 100490 bconsole CALL _umtx_op(0x802270880,UMTX_OP_RW_WRLOCK,0,0,0) 54503 100490 bconsole RET _umtx_op 0 54503 101233 bconsole CALL _umtx_op(0x802270880,UMTX_OP_RW_RDLOCK,0,0,0) 54503 101233 bconsole RET _umtx_op 0 54503 100490 bconsole CALL _umtx_op(0x802270880,UMTX_OP_RW_WRLOCK,0,0,0) 54503 101233 bconsole CALL _umtx_op(0x802270880,UMTX_OP_RW_UNLOCK,0,0,0) 54503 101233 bconsole RET _umtx_op 0 54503 100490 bconsole RET _umtx_op 0 54503 101233 bconsole CALL _umtx_op(0x802270880,UMTX_OP_RW_RDLOCK,0,0,0) 54503 101233 bconsole RET _umtx_op 0 54503 101233 bconsole CALL munmap(0x800663000,0x1000) 54503 101233 bconsole RET munmap 0 54503 101233 bconsole CALL madvise(0x803806000,0x8000,MADV_FREE) 54503 101233 bconsole RET madvise 0 54503 100490 bconsole CALL exit(0) 54503 101233 bconsole CALL thr_exit(0x803406800) bad (hang on quit): 62196 100096 bconsole 0.655701 CALL thr_kill(0x186f8,SIG 32) 62196 100096 bconsole 0.655717 RET thr_kill 0 62196 100088 bconsole 0.655726 RET nanosleep -1 errno 4 Interrupted system call 62196 100088 bconsole 0.655742 PSIG SIG 32 caught handler=0x80205b080 mask=0x0 code=SI_LWP 62196 100096 bconsole 0.655766 CALL _umtx_op(0x802270880,UMTX_OP_RW_WRLOCK,0,0,0) 62196 100088 bconsole 0.655770 CALL thr_wake(0x186f8) 62196 100096 bconsole 0.655774 RET _umtx_op 0 62196 100088 bconsole 0.655781 RET thr_wake 0 62196 100088 bconsole 0.655791 CALL thr_wake(0x186f8) 62196 100088 bconsole 0.655795 RET thr_wake 0 62196 100088 bconsole 0.655800 CALL sigreturn(0x7fffdfffda60) 62196 100088 bconsole 0.655807 RET sigreturn JUSTRETURN 62196 100096 bconsole 0.655830 CALL _umtx_op(0x802270880,UMTX_OP_RW_WRLOCK,0,0,0) 62196 100096 bconsole 0.655837 RET _umtx_op 0 62196 100088 bconsole 0.655838 CALL _umtx_op(0x802270880,UMTX_OP_RW_RDLOCK,0,0,0) 62196 100088 bconsole 0.655848 RET _umtx_op 0 62196 100096 bconsole 0.655848 CALL _umtx_op(0x802270880,UMTX_OP_RW_WRLOCK,0,0,0) 62196 100088 bconsole 0.655862 CALL _umtx_op(0x802270880,UMTX_OP_RW_UNLOCK,0,0,0) 62196 100088 bconsole 0.655875 RET _umtx_op 0 62196 100088 bconsole 0.655885 CALL _umtx_op(0x802270880,UMTX_OP_RW_RDLOCK,0,0,0) 62196 100096 bconsole 0.655885 RET _umtx_op 0 62196 100088 bconsole 0.655897 RET _umtx_op -1 errno 4 Interrupted system call 62196 100096 bconsole 0.655901 CALL _umtx_op(0x802270880,UMTX_OP_RW_UNLOCK,0,0,0) 62196 100088 bconsole 0.655903 CALL _umtx_op(0x802270880,UMTX_OP_RW_RDLOCK,0,0,0) 62196 100096 bconsole 0.655912 RET _umtx_op 0 62196 100088 bconsole 0.655915 RET _umtx_op 0 62196 100096 bconsole 0.655930 CALL _umtx_op(0x802270880,UMTX_OP_RW_WRLOCK,0,0,0) 62196 100088 bconsole 0.655930 CALL _umtx_op(0x802270880,UMTX_OP_RW_UNLOCK,0,0,0) 62196 100088 bconsole 0.655942 RET _umtx_op 0 62196 100088 bconsole 0.655952 CALL _umtx_op(0x802270880,UMTX_OP_RW_UNLOCK,0,0,0) 62196 100088 bconsole 0.655956 RET _umtx_op 0 62196 100088 bconsole 0.655966 CALL _umtx_op(0x802270880,UMTX_OP_RW_UNLOCK,0,0,0) 62196 100088 bconsole 0.655971 RET _umtx_op 0 62196 100088 bconsole 0.655978 CALL _umtx_op(0x802270880,UMTX_OP_RW_UNLOCK,0,0,0) 62196 100088 bconsole 0.655982 RET _umtx_op 0 62196 100088 bconsole 0.655989 CALL _umtx_op(0x802270880,UMTX_OP_RW_UNLOCK,0,0,0) 62196 100088 bconsole 0.655993 RET _umtx_op 0 62196 100088 bconsole 0.655999 CALL _umtx_op(0x802270880,UMTX_OP_RW_UNLOCK,0,0,0) 62196 100088 bconsole 0.656005 RET _umtx_op 0 62196 100088 bconsole 0.656010 CALL80,UMTX_OP_RW_UNLOCK,0,0,0) 62196 100088 bconsole 0.656014 RET _umtx_op 0 62196 100088 bconsole 0.656022 CALL _umtx_op(0x802270880,UMTX_OP_RW_UNLOCK,0,0,0) 62196 100088 bconsole 0.656026 RET _umtx_op 0 62196 100088 bconsole 0.656032 CALL munmap(0x800663000,0x1000) 62196 100088 bconsole 0.656042 RET munmap 0 62196 100088 bconsole 0.656089 CALL madvise(0x803806000,0x8000,MADV_FREE) 62196 100088 bconsole 0.656099 RET madvise 0 62196 100088 bconsole 0.656105 CALL thr_exit(0x803406800) and the main thread does not die, he sits on WCHAN uwrlck. Especially I like to know, why the kernel sends a second signal to the checker_deadlock thread. I can avoid the libthr/kernel problem by sleeping some ms in the main thread after pthread_cancel(), the resulting ktrace looks like rather clean, because no _umtx_op calls to the kernel are necessary: 81803 100271 bconsole 8.425361 CALL thr_kill(0x18a20,SIG 32) 81803 100271 bconsole 8.425374 RET thr_kill 0 81803 100271 bconsole 8.425380 CALL nanosleep(0x7fffffffe2f0,0) 81803 100896 bconsole 8.425382 RET nanosleep -1 errno 4 Interrupted system call 81803 100896 bconsole 8.425400 PSIG SIG 32 caught handler=0x80205b080 mask=0x0 code=SI_LWP 81803 100896 bconsole 8.425426 CALL thr_wake(0x18a20) 81803 100896 bconsole 8.425433 RET thr_wake 0 81803 100896 bconsole 8.425438 CALL thr_wake(0x18a20) 81803 100896 bconsole 8.425442 RET thr_wake 0 81803 100896 bconsole 8.425446 CALL sigreturn(0x7fffdfffda60) 81803 100896 bconsole 8.425452 RET sigreturn JUSTRETURN 81803 100896 bconsole 8.425519 CALL munmap(0x800663000,0x1000) 81803 100896 bconsole 8.425529 RET munmap 0 81803 100896 bconsole 8.425575 CALL madvise(0x80380e000,0x22000,MADV_FREE) 81803 100896 bconsole 8.425585 RET madvise 0 81803 100896 bconsole 8.425595 CALL madvise(0x803806000,0x8000,MADV_FREE) 81803 100896 bconsole 8.425603 RET madvise 0 81803 100896 bconsole 8.425609 CALL thr_exit(0x803406800) 81803 100271 bconsole 8.427522 RET nanosleep 0 81803 100271 bconsole 8.427717 CALL exit(0) In my opinion this problem must be solved in libthr/kernel and not in userland.
In the meantime I have extracted the pthread relevant parts of the program bconsole and can give a working program thrtest.c that demonstrates the problem: #include <pthread.h> #include <stdio.h> #include <stdlib.h> #include <unistd.h> static int global_ct = 0; static pthread_mutex_t global_mutex = PTHREAD_MUTEX_INITIALIZER; static pthread_t work_thread; void register_thread() { pthread_mutex_lock(&global_mutex); ++global_ct; pthread_mutex_unlock(&global_mutex); } void unregister_thread() { pthread_mutex_lock(&global_mutex); --global_ct; pthread_mutex_unlock(&global_mutex); } void clean_handler(void *a) { unregister_thread(); } void *work_start(void *x) { register_thread(); pthread_cleanup_push(clean_handler, NULL); while (sleep(30)) getpid(); pthread_cleanup_pop(1); exit(global_ct); } void cleanup_main() { pthread_cancel(work_thread); unregister_thread(); pthread_mutex_lock(&global_mutex); getpid(); pthread_mutex_unlock(&global_mutex); } int main(int argc, char *argv[]) { register_thread(); pthread_create(&work_thread, NULL, work_start, NULL); getpid(); sleep(1); cleanup_main(); exit(global_ct); } The relevant parts of ktrace output starting with sleep(1) from main(): 83194 101055 thrtest 0.001482 CALL nanosleep(0x7fffdfffdf08,0x7fffdfffdef8) 83194 100732 thrtest 1.043574 RET nanosleep 0 83194 100732 thrtest 1.043602 CALL thr_kill(0x18abf,SIG 32) 83194 100732 thrtest 1.043617 RET thr_kill 0 83194 101055 thrtest 1.043624 RET nanosleep -1 errno 4 Interrupted system call 83194 100732 thrtest 1.043625 CALL getpid 83194 101055 thrtest 1.043640 PSIG SIG 32 caught handler=0x80082d080 mask=0x0 code=SI_LWP 83194 100732 thrtest 1.043657 RET getpid 83194/0x144fa 83194 101055 thrtest 1.043676 CALL thr_wake(0x18abf) 83194 101055 thrtest 1.043686 RET thr_wake 0 83194 101055 thrtest 1.043691 CALL thr_wake(0x18abf) 83194 101055 thrtest 1.043696 RET thr_wake 0 83194 101055 thrtest 1.043701 CALL sigreturn(0x7fffdfffda70) 83194 101055 thrtest 1.043708 RET sigreturn JUSTRETURN 83194 101055 thrtest 1.043716 CALL _umtx_op(0x800a42880,UMTX_OP_RW_RDLOCK,0,0,0) 83194 101055 thrtest 1.043723 RET _umtx_op 0 83194 100732 thrtest 1.043725 CALL _umtx_op(0x800a42880,UMTX_OP_RW_WRLOCK,0,0,0) 83194 101055 thrtest 1.043743 CALL _umtx_op(0x800a42880,UMTX_OP_RW_UNLOCK,0,0,0) 83194 101055 thrtest 1.043757 RET _umtx_op 0 83194 101055 thrtest 1.043762 CALL _umtx_op(0x800a42880,UMTX_OP_RW_UNLOCK,0,0,0) 83194 101055 thrtest 1.043767 RET _umtx_op 0 83194 101055 thrtest 1.043771 CALL _umtx_op(0x800a42880,UMTX_OP_RW_RDLOCK,0,0,0) 83194 100732 thrtest 1.043771 RET _umtx_op 0 83194 101055 thrtest 1.043779 RET _umtx_op -1 errno 4 Interrupted system call 83194 100732 thrtest 1.043781 CALL _umtx_op(0x800a42880,UMTX_OP_RW_UNLOCK,0,0,0) 83194 101055 thrtest 1.043785 CALL _umtx_op(0x800a42880,UMTX_OP_RW_RDLOCK,0,0,0) 83194 100732 thrtest 1.043789 RET _umtx_op 0 83194 101055 thrtest 1.043792 RET _umtx_op 0 83194 100732 thrtest 1.043797 CALL _umtx_op(0x800a42880,UMTX_OP_RW_WRLOCK,0,0,0) 83194 101055 thrtest 1.043800 CALL _umtx_op(0x800a42880,UMTX_OP_RW_UNLOCK,0,0,0) 83194 101055 thrtest 1.043805 RET _umtx_op 0 83194 101055 thrtest 1.043811 CALL _umtx_op(0x800a42880,UMTX_OP_RW_UNLOCK,0,0,0) 83194 101055 thrtest 1.043813 RET _umtx_op 0 83194 101055 thrtest 1.043816 CALL thr_exit(0x801406800) Now the process 83194 hangs and procstat gives PID TID COMM TDNAME KSTACK 83194 100732 thrtest - mi_switch+0xe1 sleepq_catch_signals+0xab sleepq_wait_sig+0xf _sleep+0x27d umtxq_sleep+0x125 do_rw_wrlock+0x5af __umtx_op_rw_wrlock+0x47 amd64_syscall+0x40f Xfast_syscall+0xfb PID TID COMM TDNAME CPU PRI STATE WCHAN 83194 100732 thrtest - 1 122 sleep uwrlck Please can somebody verify this bug in FreeBSD 10.3 STABLE, my rev is 301633.
I can reproduce a hang with your test program with stable/10 libs on 11 kernel. On the other hand, I run your program in a loop for a hour on HEAD machine and was not able to observe the hang. When the program hangs, I see that the only live thread waits for the rtld_bind_lock, which is declared as write locked. So it looks as if the cancelled thread leaked the rtld bind lock. It might be only appearance of the leak, though. [Switching to LWP 101215 of process 54629] _umtx_op_err () at /usr/src/lib/libthr/arch/i386/i386/_umtx_op_err.S:36 36 SYSCALL_ERR(_umtx_op) (gdb) bt #0 _umtx_op_err () at /usr/src/lib/libthr/arch/i386/i386/_umtx_op_err.S:36 #1 0x2807fe01 in __thr_rwlock_wrlock (rwlock=0x0, tsp=<optimized out>) at /usr/src/lib/libthr/thread/thr_umtx.c:297 #2 0x280884c0 in _thr_rwlock_wrlock (rwlock=<optimized out>, tsp=<optimized out>) at /usr/src/lib/libthr/thread/thr_umtx.h:204 #3 _thr_rtld_wlock_acquire (lock=0x28097d00) at /usr/src/lib/libthr/thread/thr_rtld.c:141 #4 0x280539ff in wlock_acquire (lock=0x28063f58 <rtld_locks>, lockstate=0xffffd598) at /usr/src/libexec/rtld-elf/rtld_lock.c:217 #5 0x2804da23 in objlist_call_fini (list=<optimized out>, root=<optimized out>, lockstate=<optimized out>) at /usr/src/libexec/rtld-elf/rtld.c:2421 #6 0x2804cb7a in rtld_exit () at /usr/src/libexec/rtld-elf/rtld.c:2801 #7 0x281a68ab in __cxa_finalize (dso=<optimized out>) at /usr/src/lib/libc/stdlib/atexit.c:200 #8 0x28146f7f in exit (status=-10552) at /usr/src/lib/libc/stdlib/exit.c:73 #9 0x080488f6 in main () at pr-211947.c:70 (gdb) frame 4 #4 0x280539ff in wlock_acquire (lock=0x28063f58 <rtld_locks>, lockstate=0xffffd598) at /usr/src/libexec/rtld-elf/rtld_lock.c:217 217 lockinfo.wlock_acquire(lock->handle); (gdb) p/x *(struct urwlock *)(lock->handle) $3 = {rw_state = 0xa0000000, rw_flags = 0x2, rw_blocked_readers = 0x0, rw_blocked_writers = 0x0, rw_spare = {0x0, 0x0, 0x0, 0x0}} 0xa0000000 == URWLOCK_WRITE_OWNER | URWLOCK_READ_WAITERS
I can confirm that the user stack of the hanging process nearly looks like yours, the difference is probably because I use amd64: [Switching to Thread 801406400 (LWP 100956)] _umtx_op_err () at /usr/src/lib/libthr/arch/amd64/amd64/_umtx_op_err.S:37 37 RSYSCALL_ERR(_umtx_op) (gdb) bt #0 _umtx_op_err () at /usr/src/lib/libthr/arch/amd64/amd64/_umtx_op_err.S:37 #1 0x000000080082a267 in __thr_rwlock_wrlock (rwlock=0x800a42880, tsp=<optimized out>) at /usr/src/lib/libthr/thread/thr_umtx.c:297 #2 0x000000080083186a in _thr_rwlock_wrlock (rwlock=<optimized out>, tsp=<optimized out>) at /usr/src/lib/libthr/thread/thr_umtx.h:204 #3 _thr_rtld_wlock_acquire (lock=0x800a42880) at /usr/src/lib/libthr/thread/thr_rtld.c:141 #4 0x000000080060bc80 in wlock_acquire (lock=0x80081fcc0 <rtld_locks>, lockstate=0x7fffffffe658) at /usr/src/libexec/rtld-elf/rtld_lock.c:217 #5 0x0000000800605dbd in objlist_call_fini (list=<optimized out>, root=<optimized out>, lockstate=<optimized out>) at /usr/src/libexec/rtld-elf/rtld.c:2418 #6 0x0000000800604f39 in rtld_exit () at /usr/src/libexec/rtld-elf/rtld.c:2800 #7 0x0000000800b5f8c6 in __cxa_finalize () from /lib/libc.so.7 #8 0x0000000800b0337c in exit () from /lib/libc.so.7 #9 0x0000000000400bcb in main () (gdb) frame 4 #4 0x000000080060bc80 in wlock_acquire (lock=0x80081fcc0 <rtld_locks>, lockstate=0x7fffffffe658) at /usr/src/libexec/rtld-elf/rtld_lock.c:217 217 lockinfo.wlock_acquire(lock->handle); (gdb) p/x *(struct urwlock *)(lock->handle) $1 = {rw_state = 0xa0000000, rw_flags = 0x2, rw_blocked_readers = 0x0, rw_blocked_writers = 0x0, rw_spare = {0x0, 0x0, 0x0, 0x0}} In this situation I can see the backtrace of the process in the kernel, but "bt full" stops in frame 2: (kgdb) thread 705 [Switching to thread 705 (Thread 100956)]#0 sched_switch (td=0xfffff80052d114b0, newtd=<value optimized out>, flags=<value optimized out>) at /usr/src/sys/kern/sched_ule.c:1945 1945 cpuid = PCPU_GET(cpuid); (kgdb) bt #0 sched_switch (td=0xfffff80052d114b0, newtd=<value optimized out>, flags=<value optimized out>) at /usr/src/sys/kern/sched_ule.c:1945 #1 0xffffffff80551271 in mi_switch (flags=260, newtd=0x0) at /usr/src/sys/kern/kern_synch.c:491 #2 0xffffffff80594ffb in sleepq_catch_signals (wchan=0xfffff80052018c80, pri=0) at /usr/src/sys/kern/subr_sleepqueue.c:426 #3 0xffffffff80594eaf in sleepq_wait_sig (wchan=0x0, pri=0) at /usr/src/sys/kern/subr_sleepqueue.c:632 #4 0xffffffff80550c7d in _sleep (ident=<value optimized out>, lock=<value optimized out>, priority=<value optimized out>, wmesg=<value optimized out>, sbt=<value optimized out>, pr=<value optimized out>, flags=<value optimized out>) at /usr/src/sys/kern/kern_synch.c:253 #5 0xffffffff80565285 in umtxq_sleep (uq=0xfffff80052018c80, wmesg=0xffffffff808b2ebc "uwrlck", abstime=0x0) at /usr/src/sys/kern/kern_umtx.c:780 #6 0xffffffff8056881f in do_rw_wrlock (td=0xfffff80052d114b0, rwlock=0x800a42880, timeout=<value optimized out>) at /usr/src/sys/kern/kern_umtx.c:3109 #7 0xffffffff8056c4a7 in __umtx_op_rw_wrlock (td=<value optimized out>, uap=<value optimized out>) at /usr/src/sys/kern/kern_umtx.c:3657 #8 0xffffffff807db84f in amd64_syscall (td=0xfffff80052d114b0, traced=0) at subr_syscall.c:141 #9 0xffffffff807c040b in Xfast_syscall () at /usr/src/sys/amd64/amd64/exception.S:398 #10 0x0000000800833e8c in ?? () Previous frame inner to this frame (corrupt stack?) (kgdb) bt full #0 sched_switch (td=0xfffff80052d114b0, newtd=<value optimized out>, flags=<value optimized out>) at /usr/src/sys/kern/sched_ule.c:1945 tdq = <value optimized out> ts = <value optimized out> mtx = <value optimized out> #1 0xffffffff80551271 in mi_switch (flags=260, newtd=0x0) at /usr/src/sys/kern/kern_synch.c:491 td = (struct thread *) 0xfffff80052d114b0 new_switchtime = <value optimized out> runtime = <value optimized out> #2 0xffffffff80594ffb in sleepq_catch_signals (wchan=0xfffff80052018c80, pri=0) at /usr/src/sys/kern/subr_sleepqueue.c:426 td = <value optimized out> ret = Error accessing memory address 0x4: Bad address.
Created attachment 174031 [details] Unset URWLOCK_READ_WAITERS even when error occured. I still unable to reproduce the problem on HEAD, and machine where I can is not for testing kernel patches. Try this patch, please.
Your patch works fine. I have tested on two different machines both the testprogam thrtest.c and bconsole, no problems anymore. In the testprogram you can remove the sleep(1), it was only to simulate some kind of working in bconsole. Thanks for quick solution !
A commit references this bug: Author: kib Date: Thu Aug 25 16:35:42 UTC 2016 New revision: 304808 URL: https://svnweb.freebsd.org/changeset/base/304808 Log: Prevent leak of URWLOCK_READ_WAITERS flag for urwlocks. If there was some error, e.g. the sleep was interrupted, as in the referenced PR, do_rw_rdlock() did not cleared URWLOCK_READ_WAITERS. Since unlock only wakes up write waiters when there is no read waiters, for URWLOCK_PREFER_READER kind of locks, the result was missed wakeups for writers. In particular, the most visible victims are ld-elf.so locks in processes which loaded libthr, because rtld locks are urwlocks in prefer-reader mode. Normal rwlocks fall into prefer-reader mode only if thread already owns rw lock in read mode, which is not typical and correspondingly less visible. In the PR, unowned rtld bind lock was waited for in the process where only one thread was left alive. Note that do_rw_wrlock() correctly clears URWLOCK_WRITE_WAITERS in case of errors. Reported and tested by: longwitz@incore.de PR: 211947 Sponsored by: The FreeBSD Foundation MFC after: 1 week Changes: head/sys/kern/kern_umtx.c
A commit references this bug: Author: kib Date: Thu Sep 1 07:08:01 UTC 2016 New revision: 305180 URL: https://svnweb.freebsd.org/changeset/base/305180 Log: MFC r304808: Prevent leak of URWLOCK_READ_WAITERS flag for urwlocks. PR: 211947 Changes: _U stable/11/ stable/11/sys/kern/kern_umtx.c
A commit references this bug: Author: kib Date: Thu Sep 1 07:20:51 UTC 2016 New revision: 305184 URL: https://svnweb.freebsd.org/changeset/base/305184 Log: MFC r304808: Prevent leak of URWLOCK_READ_WAITERS flag for urwlocks. PR: 211947 Changes: _U stable/10/ stable/10/sys/kern/kern_umtx.c