264441 – Hang with Valgrind on single CPU systems

Bug 264441 - Hang with Valgrind on single CPU systems

Summary: Hang with Valgrind on single CPU systems

Status:	Closed FIXED

Alias:	None

Product:	Base System
Classification:	Unclassified
Component:	kern (show other bugs)
Version:	13.1-RELEASE
Hardware:	amd64 Any

Importance:	--- Affects Some People
Assignee:	Mark Johnston

URL:	https://lists.freebsd.org/archives/fr...
Keywords:	needs-qa

Depends on:
Blocks:

Reported:	2022-06-03 20:05 UTC by Paul Floyd
Modified:	2022-07-14 14:14 UTC (History)
CC List:	3 users (show)

See Also:	https://github.com/paulfloyd/freebsd_valgrind/issues/189 https://github.com/paulfloyd/freebsd_valgrind/issues/188

Attachments
standalone test case (884 bytes, text/plain) 2022-06-06 17:49 UTC, Mark Johnston	no flags	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Paul Floyd 2022-06-03 20:05:11 UTC

There are a lot more details here:

https://github.com/paulfloyd/freebsd_valgrind/issues/189
https://github.com/paulfloyd/freebsd_valgrind/issues/188

and discussion here

https://lists.freebsd.org/archives/freebsd-hackers/2022-May/001154.html

It may be somewhat difficult to reproduce the error as it may be quite timing dependent.

Steps to reproduce:

Most importantly, you need a single CPU i386 or amd64 system. That's possible with VirtualBox.

Clone and build the Valgrind source and regressions tests.

clone git://sourceware.org/git/valgrind.git

Backout a workaround for one of the issues

$ git revert a4151207a28b506cdd1e7c095e4aebaaf289a384


Make sure that you have the ports for autoconf, automake, libtool and gmake installed.


Run the folllowing comamnds

If you have GCC installed you need to set the environment variables CC=clang CXX=clang++

$ sh autogen.sh
$ ./configure
$ gmake
$ gmake check



Now you can try the two testcases that were causing problems

$ perl tests/vg_regtest none/tests/tls

and

$ perl tests/vg_regtest none/tests/pth_2sig

If you want to run Valgrind without any wrapper scripts then do the following

export VALGRIND_LAUNCHER=YOUR_PATH/valgrind/coregrind/valgrind
export VALGRIND_LIB=YOUR_PATH/valgrind/.in_place
export VALGRIND_LIB_INNER=YOUR_PATH/valgrind/.in_place


(where YOUR_PATH/valgrind is the path to the git repo cloned above)

cd to none/tests

Run

../../.in_place/none-amd64-freebsd --tool=none ./tls

or

../../.in_place/none-amd64-freebsd --tool=none ./pth_2sig

Comment 1 Mark Johnston freebsd_committer

2022-06-06 17:44:28 UTC

Thanks for the repro steps, I was able to trigger the problem locally.  It's enough to pin all of the valgrind threads to the same CPU:

/tmp/valgrind # cpuset -l 1 perl tests/vg_regtest none/tests/tls

Really there are two problems here.  First, it seems that one of the threads (the one switched out in ast()) is simply getting starved.  There are several other always-runnable threads in the process that have a slightly higher scheduling priority, and they end up monopolizing the CPU.  ULE has decided that the threads are "interactive", and in this case higher priority threads are always scheduled first.

With multiple CPUs available I suppose the problem may still exist, it just becomes harder to trigger since with more CPUs there's less chance that the starved thread will get stuck.  I'm guessing that the stuck thread is the one which is supposed to be writing to the pipe.

The second problem is a livelock in the pipe code.  It seems that pipelock()/pipeunlock() can cause reader threads to wake each other up in a loop even when there's nothing to do.  This is because the wait channel used by readers and writers to signal each other is the same as the one used to serialize I/O operations among multiple concurrent readers.  It's not too hard to write a standalone test program which triggers this, albeit unreliably.  I suspect the livelock causes the scheduler problem, since the reader threads keep yielding the CPU to sleep, and this boosts their interactivity score.

Comment 2 Mark Johnston freebsd_committer

2022-06-06 17:49:54 UTC

Created attachment 234493 [details]
standalone test case

Here's a test program.  Run it with a single parameter, the number of reader threads to create.  It'll create a named pipe, "fifo" in its current directory.  The problem can be triggered by a character to it from a different terminal, watching top(1) to see when CPU usage starts going up.  Sometimes it takes quite a few tries before the problem occurs.  It can help to force all threads onto a single CPU.

Comment 3 commit-hook freebsd_committer

2022-06-14 16:01:27 UTC

A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=e8955bd643ee852d70a0b065f2a0d1bb3fa99df2

commit e8955bd643ee852d70a0b065f2a0d1bb3fa99df2
Author:     Mark Johnston <markj@FreeBSD.org>
AuthorDate: 2022-06-14 14:52:03 +0000
Commit:     Mark Johnston <markj@FreeBSD.org>
CommitDate: 2022-06-14 16:00:59 +0000

    pipe: Use a distinct wait channel for I/O serialization

    Suppose a thread tries to read from an empty pipe.  pipe_read() does the
    following:

    1. pipelock(), possibly sleeping
    2. check for buffered data
    3. pipeunlock()
    4. set PIPE_WANTR and sleep
    5. goto 1

    pipelock() is an open-coded mutex; if a thread blocks in pipelock(), it
    sleeps until the lock holder calls pipeunlock().

    Both sleeps use the same wait channel.  So if there are multiple threads
    in pipe_read(), a thread T1 in step 3 can wake up a thread T2 sleeping
    in step 4.  Then T1 goes to sleep in step 4, and T2 acquires and
    releases the pipelock, waking up T1 again.  This can go on indefinitely,
    livelocking the process (and potentially starving a would-be writer).

    Fix the problem by using a separate wait channel for pipelock().

    Reported by:    Paul Floyd <paulf2718@gmail.com>
    Reviewed by:    mjg, kib
    PR:             264441
    MFC after:      1 month
    Sponsored by:   The FreeBSD Foundation
    Differential Revision:  https://reviews.freebsd.org/D35415

 sys/kern/sys_pipe.c | 9 ++++-----
 1 file changed, 4 insertions(+), 5 deletions(-)

Comment 4 commit-hook freebsd_committer

2022-07-14 13:51:05 UTC

A commit in branch stable/13 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=a70c03b2d383011dece503716faf504598d5fe29

commit a70c03b2d383011dece503716faf504598d5fe29
Author:     Mark Johnston <markj@FreeBSD.org>
AuthorDate: 2022-06-14 14:52:03 +0000
Commit:     Mark Johnston <markj@FreeBSD.org>
CommitDate: 2022-07-14 13:50:10 +0000

    pipe: Use a distinct wait channel for I/O serialization

    Suppose a thread tries to read from an empty pipe.  pipe_read() does the
    following:

    1. pipelock(), possibly sleeping
    2. check for buffered data
    3. pipeunlock()
    4. set PIPE_WANTR and sleep
    5. goto 1

    pipelock() is an open-coded mutex; if a thread blocks in pipelock(), it
    sleeps until the lock holder calls pipeunlock().

    Both sleeps use the same wait channel.  So if there are multiple threads
    in pipe_read(), a thread T1 in step 3 can wake up a thread T2 sleeping
    in step 4.  Then T1 goes to sleep in step 4, and T2 acquires and
    releases the pipelock, waking up T1 again.  This can go on indefinitely,
    livelocking the process (and potentially starving a would-be writer).

    Fix the problem by using a separate wait channel for pipelock().

    Reported by:    Paul Floyd <paulf2718@gmail.com>
    Reviewed by:    mjg, kib
    PR:             264441
    Sponsored by:   The FreeBSD Foundation

    (cherry picked from commit e8955bd643ee852d70a0b065f2a0d1bb3fa99df2)

 sys/kern/sys_pipe.c | 9 ++++-----
 1 file changed, 4 insertions(+), 5 deletions(-)