Bug 185873 - [linux] waitpid() in linux threads fails with ECHILD
Summary: [linux] waitpid() in linux threads fails with ECHILD
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: Unspecified
Hardware: Any Any
: Normal Affects Only Me
Assignee: Dmitry Chagin
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-01-19 03:10 UTC by Henry Hu
Modified: 2016-01-09 21:45 UTC (History)
2 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Henry Hu 2014-01-19 03:10:00 UTC
If a Linux program
1. use fork() to fork a child
2. create a thread
3. use waitpid() in the new thread to wait for the child

Then waitpid() returns -1, with errno ECHILD.

This affects some applications:
1. intellij (experimental port available) with Oracle JDK
2. android studio (based on intellij)
and possible other programs using Oracle JDK.

If you use java.lang.ProcessBuilder to create a child and get its output,
Oracle SDK works in this way:

1. use fork() to create a child
2. create a thread to wait for the child
3. the thread calls waitpid() to wait for child exit
4. the thread reads the child's output

Because waitpid() incorrectly returns -1 here before the child exits,
the child may have not produced the output, which results in empty output
received by the caller. The caller may incorrectly assume that the child
is not working.

In intellij, it calls "git --version" to obtain the git version. Because
the output is empty, it assumes that git is not working, and disables
some features.

How-To-Repeat: A simple test program:

#include <stdio.h>
#include <pthread.h>
#include <sys/wait.h>
#include <unistd.h>

int child;

void* worker(void* arg) {
	int status;
	printf("worker waiting\n");
	int ret = waitpid(child, &status, 0);
	printf("waitpid ret: %d status: %d\n", ret, status);
	return NULL;
}

int main() {
	child = fork();
	if (child == 0) {
		printf("child running\n");
		sleep(3);
		printf("child exit\n");
	} else {
		printf("forked: %d\n", child);
		pthread_t thr;
		pthread_create(&thr, NULL, worker, NULL);
		sleep(5);
	}
}

If run it natively on a FreeBSD/Linux machine, it outputs

forked: 98484
child running
worker waiting
child exit
waitpid ret: 98484 status: 0

However, if run a Linux version on a FreeBSD machine, it outputs

forked: 95940
child running
worker waiting
waitpid ret: -1 status: 0
child exit
Comment 1 Mark Linimon freebsd_committer freebsd_triage 2014-04-20 03:58:16 UTC
Responsible Changed
From-To: freebsd-bugs->freebsd-emulation


Over to maintainer(s).
Comment 2 Dmitry Chagin freebsd_committer freebsd_triage 2014-04-20 06:17:47 UTC
Responsible Changed
From-To: freebsd-emulation->dchagin

Grab.
Comment 3 Bradley T. Hughes freebsd_committer freebsd_triage 2014-09-28 19:35:21 UTC
The test program, when compiled on CentOS 6.5 and run on FreeBSD 10.1-BETA3, leaves the child a zombie:

$ ./echld
forked: 73161
child running
worker waiting
waitpid ret: -1 status: 0
child exit
...

$ ps ax | grep 73161
73161  0  Z+       0:00.00 <defunct>

This also manifests itself with running sysutils/linux-crashplan from ports, which leaves multiple zombies laying around, slowly growing over time.

I'm interested in digging more into this, let me know anyway I can help.
Comment 4 Vadim Ippolitov 2014-09-30 03:39:59 UTC
I'm not seeing any zombie processes when running sysutils/linux-crashplan on FreeBSD 9.2-RELEASE-p11.
So maybe it's a regression somewhere in FreeBSD 10.
Comment 5 Bradley T. Hughes freebsd_committer freebsd_triage 2014-10-01 20:19:20 UTC
Taking cue from Vadim's comment, I tried the Linux binary on more versions of FreeBSD, all with the same result. 8.4, 9.2, 9.3, 10.0, and 10.1 all return -1 from waitpid() with errno=ECHLD and leave the child as a zombie. At first glance, this seems related to bug 39201, https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=39201, in particular, this comment:

"... causes waitpid() to fail with ECHILD, because the original parent/child relationship doesn't hold ..."

I am unsure how to explain the difference between my and Vadim's observations. My gut feeling says it may be that we are using different JREs?
Comment 6 Dmitry Chagin freebsd_committer freebsd_triage 2014-10-29 10:31:20 UTC
This bug was the main reason to rewrite our Linuxulator from pseudo threading to native threading. What was done in the lemul branch. http://svnweb.freebsd.org/base/user/dchagin/lemul/
I'll close the PR when lemul will be merged to the HEAD.
Comment 7 Dmitry Chagin freebsd_committer freebsd_triage 2016-01-09 21:39:10 UTC
hmm, it works now, but waitpid() return inadequate status.
Comment 8 Dmitry Chagin freebsd_committer freebsd_triage 2016-01-09 21:45:54 UTC
o, no, WEXITSTATUS forgotten, so, lemul merged to stable/01, close