Created attachment 169926 [details] Patch to fix race Currently the suspend all implementation in libthr only sets the NEED_SUSPEND flag if a thread does not have the SUSPENDED flag set. However, that thread may be in the process of resuming, and so its NEED_SUSPEND flag should always be set if we wish a thread to stay suspended, even if already has the SUSPENDED flag set.
(In reply to Lawrence Esswood from comment #0) So the issue is that we might leave pthread_resume{_all}_np with the THR_FLAG_SUSPENDED still set. IMO this is the real bug, and it cannot be worked around by trying to set NEED_SUSPEND in the next suspend cycle. Instead, resume functions should wait until the resumed thread is scheduled and has a chance to run enough to clear the flag. One more wait/wake point around thread->cycle would be needed to ensure of proper control pass between resume->resumed->resume threads.
(In reply to Konstantin Belousov from comment #1) This would end up doing the same thing in the suspend logic as always setting the NEED_SUSPEND flag, as the !(thread->flags & THR_FLAGS_SUSPENDED) check would always be true if THR_FLAGS_NEED_SUSPEND was clear and if we insisted that !THR_FLAGS_NEED_SUSPEND => !THR_FLAGS_SUSPENDED after a resume. The only (observable) difference would be extra blocking in resume which we probabably want to avoid.
(In reply to Lawrence Esswood from comment #2) Yes, my intent is to guarantee that the THR_FLAGS_SUSPENDED is never left dandling, since we are not supposed to distinguish between previous and current suspension in suspend_common(). Then, why not simply clear THR_FLAGS_SUSPENDED in resume_common() ? We apparently do not care how much thread moved ahead returning from check_suspend(), and the function unblocks SIGCANCEL only after it is ready to next suspend. Could you provide a test case for the problem ?
Created attachment 169969 [details] Clear THR_FLAGS_SUSPENDED in resume_common()
(In reply to Konstantin Belousov from comment #3) I had a very wonky test that very unreliably failed, now I know why it happens I can probably create a much better one to attach. Your suggestion I think would work. You would also have to change the break condition on the check_suspend loop otherwise if the thread is woken for any reason it will break too early. One question however, what happens if the thread in check_suspend has its NEED_SUSPEND flag set and is signalled somewhere between the loop exit and the _thr_signal_unblock? I think it won't do another check_suspend / check_deferred until it next hits a _thr_ast which might be never. We could maybe extend the thread lock to include the _thr_signal_unblock, or have the end of check_suspend make a recursive call. I will try knock up a test case.
(In reply to Lawrence Esswood from comment #5) Yes, it is a bug that I left the old test in the loop condition. I left only the check for THR_FLAGS_NEED_SUSPEND there, since it is cleared only on resume. WRT SIGCANCEL generated after loop exit and before thr_signal_unblock. The blocked signal list includes SIGCANCEL, thus the SIGCANCEL signal should be only delivered after unblock. Then typically we would re-enter _thr_ast() with the different set of flags.
Created attachment 169971 [details] Clear THR_FLAGS_SUSPENDED in resume_common() v2 Fix loop condition.
Created attachment 169973 [details] test case (In reply to Konstantin Belousov from comment #3) This fails pretty reliably / is the best I can do.
(In reply to Lawrence Esswood from comment #8) Test is good. It also fails reliably for me, and works on the patched libthr. If you have no more comments, I will commit the fix. Some time later I might add the test to our test suite. Would be nice if you add copyright and license into the test source preamble, share/examples/etc/bsd-style-copyright provides text of the preferred license. Thanks.
A commit references this bug: Author: kib Date: Thu May 5 10:20:23 UTC 2016 New revision: 299114 URL: https://svnweb.freebsd.org/changeset/base/299114 Log: Do not leak THR_FLAGS_SUSPENDED from the previous suspend/resume cycle. The flag currently is cleared by the resumed thread. If next suspend request comes before the thread was able to clean the flag, in which case suspender skip the thread. Instead, clear the THR_FLAGS_SUSPEND flag in resume_common(), we do not care how much code was executed in the resumed thread when the pthread_resume_*np(s) functions returned. PR: 209233 Reported by: Lawrence Esswood <le277@cam.ac.uk> MFC after: 1 week Changes: head/lib/libthr/thread/thr_resume_np.c head/lib/libthr/thread/thr_sig.c
A commit references this bug: Author: kib Date: Thu May 12 06:53:23 UTC 2016 New revision: 299521 URL: https://svnweb.freebsd.org/changeset/base/299521 Log: MFC r299114: Do not leak THR_FLAGS_SUSPENDED from the previous suspend/resume cycle. PR: 209233 Changes: _U stable/10/ stable/10/lib/libthr/thread/thr_resume_np.c stable/10/lib/libthr/thread/thr_sig.c
Closing as fixed. Please re-open if there is still work to do.