Bug 128180

Summary: pthread_cond_broadcast(3) lost wakeup
Product: Base System Reporter: Kurt Miller <kurt>
Component: threadsAssignee: attilio
Status: Closed Feedback Timeout    
Severity: Affects Only Me    
Priority: Normal    
Version: 6.3-RELEASE   
Hardware: Any   
OS: Any   

Description Kurt Miller 2008-10-17 17:50:01 UTC
I've been investigating a deadlock in the jvm that occurs with
the concurrent mark sweep garbage collector. The cause appears
to be due to the kernel failing to wake up all threads waiting
on a condition variable.

I have written a test program that mimics the jvm's underlying
pattern. It reproduces the deadlock quickly and exhibits the
same problem. The general idea is that one thread sends a broadcast
to a group of worker threads. The worker threads perform some tasks,
coordinate their completion and broadcast on the same condition
variable they are done. The design is a bit heavy on the use of
the one condition variable, however it does appear to be valid if
not ideal.

The deadlock occurs with the following system setup:

  6.3-RELEASE SMP amd64 kernel
  libthr
  2 or more cores

I have not yet checked other releases or setups.

The test program outputs periodic printf's indicating
progress is being made. When it stops the process is
deadlocked. The lost wakeup can be confirmed by inspecting
the saved_waiters local var in main(). Each time the
deadlock occurs I see that saved_waiters is 8 which tells
me all eight worker threads were waiting on the condition
variable when the broadcast was sent. Then switch to the
thread that is still waiting on the condition variable,
and you can see that the last_cycle local var is one behind
the cycles global var which indicates it didn't receive the
last wakeup.

How-To-Repeat: #include <pthread.h>
#include <stdio.h>
#include <stdlib.h>

pthread_mutex_t group_mutex = PTHREAD_MUTEX_INITIALIZER;
pthread_cond_t  group_cond_var = PTHREAD_COND_INITIALIZER;

volatile int tickets;
volatile int waiters;
volatile int finished;

int term_count;

volatile unsigned long cycles;

void *thread_main(void * thread_num);

#define NTHREADS 8
#define NYIELDS 1000

inline void atomicinc(volatile int* val) {
  __asm__ __volatile__ ("lock addl $1,(%0)" : : "r" (val) : "cc", "memory");
}

int
main( int argc, char *argv[] )
{
    long t_num;
    pthread_t tid[NTHREADS];
    volatile int saved_waiters;

    /* startup threads */
    for (t_num=0; t_num < NTHREADS; t_num++) {
        pthread_create( &tid[t_num], NULL, thread_main, (void *)t_num );
    }

    for(;;) {
        /* monitor progress on stdout */
        if (cycles % 5000 == 0)
                printf("cycles %lu\n", cycles);

        /* broadcast to workers to work */
        pthread_mutex_lock(&group_mutex);
        cycles++;
        term_count = 0;
        finished = 0;
        tickets=NTHREADS;
        saved_waiters = waiters;
        pthread_cond_broadcast(&group_cond_var);
        pthread_mutex_unlock(&group_mutex);

        /* wait for workers to finish */
        pthread_mutex_lock(&group_mutex);
        while (finished != NTHREADS)
           pthread_cond_wait(&group_cond_var, &group_mutex);
        pthread_mutex_unlock(&group_mutex);
    }
    return 0;
}

void *
thread_main(void *thread_num)
{
    unsigned long yield_count=0;
    unsigned long sleep_count=0;
    u_int32_t i, busy_loop = arc4random() & 0x7FFF;
    u_int32_t dummy = busy_loop;
    pthread_cond_t sleep_cond_var;
    pthread_mutex_t sleep_mutex;
    struct timeval tmptime;
    struct timeval delay = {0, 1};
    struct timespec waketime;
    volatile unsigned long last_cycle;

    pthread_mutex_init(&sleep_mutex, NULL);
    pthread_cond_init(&sleep_cond_var, NULL);

    for (;;) {
        pthread_mutex_lock(&group_mutex);

        waiters++;

        while (tickets == 0)
           pthread_cond_wait(&group_cond_var, &group_mutex);

        waiters--;
        tickets--;
        last_cycle = cycles;

        pthread_mutex_unlock(&group_mutex);

        /* do something busy */
        for (i = 0; i < busy_loop; i++)
                dummy *= i;

        /* sync termination */
        atomicinc(&term_count);

        for(;;) {
            if (term_count == NTHREADS)
                break;

            if (yield_count < NYIELDS) {
                yield_count++;
                sched_yield();
            } else {
                yield_count = 0;
                sleep_count++;
                // 1.6 uses pthread_cond_timedwait for sleeping
                gettimeofday(&tmptime, NULL);
                timeradd(&tmptime, &delay, &tmptime);
                waketime.tv_sec = tmptime.tv_sec;
                waketime.tv_nsec = tmptime.tv_usec * 1000;
                pthread_mutex_lock(&sleep_mutex);
                pthread_cond_timedwait(&sleep_cond_var, &sleep_mutex, &waketime);
                pthread_mutex_unlock(&sleep_mutex);
            }
        }

        /* ok all terminated now let everyone know */
        pthread_mutex_lock(&group_mutex);
        finished++;
        pthread_cond_broadcast(&group_cond_var);
        pthread_mutex_unlock(&group_mutex);
    }
    return NULL;
}
Comment 1 Kurt Miller 2008-10-17 20:56:57 UTC
I've reproduced this on the following setup:

6.3-RELEASE SMP i386 kernel
libthr
2 cores

However, the lost wakeup is on the main thread. The last
broadcast wakeup from the worker threads gets lost and
the process deadlocks.

So far the test program has not provoked the lost
wakeup using libpthread on 6.3.

Initial 7.0 (amd64 libthr) testing has not provoked the
issue yet either.
Comment 2 Daniel Eischen freebsd_committer freebsd_triage 2008-10-18 00:44:58 UTC
On Fri, 17 Oct 2008, Kurt Miller wrote:

> The test program outputs periodic printf's indicating
> progress is being made. When it stops the process is
> deadlocked. The lost wakeup can be confirmed by inspecting
> the saved_waiters local var in main(). Each time the
> deadlock occurs I see that saved_waiters is 8 which tells
> me all eight worker threads were waiting on the condition
> variable when the broadcast was sent. Then switch to the
> thread that is still waiting on the condition variable,
> and you can see that the last_cycle local var is one behind
> the cycles global var which indicates it didn't receive the
> last wakeup.

The test program doesn't look correct to me.  It seems possible
for only a few of the threads (as little as 2) to do all the
work.  Thread 1 can start doing work, then wait for a broadcast.
Thread 2 can start doing his work, then broadcast waking thread 1.
I think you need separate condition variables, one to wake up
the main thread when the last worker goes to sleep/finishes,
and one to wake up the workers.

-- 
DE
Comment 3 Kurt Miller 2008-10-18 03:54:11 UTC
Hi Daniel,

Thanks for the review of the test program.

On Friday 17 October 2008 7:44:58 pm Daniel Eischen wrote:
> On Fri, 17 Oct 2008, Kurt Miller wrote:
> 
> > The test program outputs periodic printf's indicating
> > progress is being made. When it stops the process is
> > deadlocked. The lost wakeup can be confirmed by inspecting
> > the saved_waiters local var in main(). Each time the
> > deadlock occurs I see that saved_waiters is 8 which tells
> > me all eight worker threads were waiting on the condition
> > variable when the broadcast was sent. Then switch to the
> > thread that is still waiting on the condition variable,
> > and you can see that the last_cycle local var is one behind
> > the cycles global var which indicates it didn't receive the
> > last wakeup.
> 
> The test program doesn't look correct to me.  It seems possible
> for only a few of the threads (as little as 2) to do all the
> work.  Thread 1 can start doing work, then wait for a broadcast.
> Thread 2 can start doing his work, then broadcast waking thread 1.

I didn't fully describe why the design is the way it is. I
understand some of the reasons why it was designed like this,
but to fully understand it I would need to study the concurrent
mark sweep garbage collector far more. I can explain a bit
more of what I do understand.

The controlling thread in jvm corresponds to the primordial
thread in my test program. In the jvm the controlling thread
is not in a loop. It just kicks off the worker threads and
waits for them to complete, then returns back to the calling
function. The jvm will create a worker thread per cpu which
wait around for the controlling thread to kick them off. The
garbage collection work is divided amongst them.

The reason why my test program has 8 worker threads is because
the problem was first reported to me on an dual quad core amd64
system. My test systems are just dual core.

> I think you need separate condition variables, one to wake up
> the main thread when the last worker goes to sleep/finishes,
> and one to wake up the workers.

Indeed. In my first attempts to reproduce the lost wakeup
problem I wrote the test program with a separate condition
variable for letting the main thread know when the last worker 
finished. However, that didn't reproduce the deadlock the
jdk was experiencing. Only when I fully mimicked the underlying
design of the jdk, did the deadlock get reproduced by the test
program. Note that the jdk is written in C++ and abstraction
it provides makes for some pretty ugly code when translated
in plain C.

I could make adjustments to the jvm code to introduce the
second condition variable and incorporate that in future
releases of the jdk. The problem is that the binary release
of the jdk, Diablo, can't be changed without a new formal
release process being followed. 

While the test program and the jdk's use of condition variables
may not be ideal and somewhat unexpected, I do believe it is
valid. It does work on Solaris, Linux and Windows without loosing
wakeups.

With the 6.4 release comming soon, it would be great if the lost
wakeup problem (which is rather serious) could be looked at and
fixed before 6.4 is released.

Regards,
-Kurt
Comment 4 Ed Maste freebsd_committer freebsd_triage 2008-11-04 14:09:38 UTC
> The test program doesn't look correct to me. It seems possible
> for only a few of the threads (as little as 2) to do all the
> work. Thread 1 can start doing work, then wait for a broadcast.
> Thread 2 can start doing his work, then broadcast waking thread 1.

It looks to me like the "tickets" and "finished" globals should make
the test program operate correctly.  If the scenario you describe
happens, thread 1 will just enter pthread_cond_wait again and wait
for the broadcast from the primordial thread.
Comment 5 Ed Maste freebsd_committer freebsd_triage 2008-11-04 18:16:52 UTC
(I posted this to freebsd-threads@; sending to bug-followup for the
benefit of the PR.)

On Tue, Nov 04, 2008 at 07:41:13AM -0800, Alfred Perlstein wrote:

> This bug may have been fixed in 6-stable and 6.4.
> 
> http://svn.freebsd.org/viewvc/base?view=revision&revision=184172
> 
> Can you try upgrading?

I tested the changes in r184172, but I'm still able to reproduce the
problem using the test app.  With the change it does seem to run for on
average about 10 times as long before it hangs though.
Comment 6 Ed Maste freebsd_committer freebsd_triage 2009-02-09 18:39:15 UTC
It appears that revisions 181099 and 184172 together address the Java
hang; 184172 is after 6.3 but is in 6.4.

-Ed
Comment 7 Ed Maste freebsd_committer freebsd_triage 2009-02-09 18:40:58 UTC
Responsible Changed
From-To: freebsd-threads->attilio

Attilio identified the an additional change that was needed in the fix; 
this problem should not be a problem as of 6.4. 

Attilio, can you add to the PR your corrected test app or a link to it? 
I think there is nothing else that needs to be done for this PR; please 
close it if that's correct.
Comment 8 alfred 2009-02-09 18:51:14 UTC
Can there be a link to Attilio's findings please?

I can't see one in the PR.

thank you,
-Alfred
Comment 9 Daniel Eischen freebsd_committer freebsd_triage 2012-02-11 07:52:46 UTC
State Changed
From-To: open->feedback

Can this PR be closed now?
Comment 10 Mark Linimon freebsd_committer freebsd_triage 2015-01-04 20:25:04 UTC
Feedback timeout (> 1 year).