|Summary:||Flakey test case: bin.sh.builtins.functional_test.kill1|
|Product:||Base System||Reporter:||Li-Wen Hsu <lwhsu>|
|Component:||tests||Assignee:||freebsd-testing (Nobody) <testing>|
|Status:||In Progress ---|
|Severity:||Affects Only Me||CC:||jilles, kib, ngie|
Description Li-Wen Hsu 2018-11-29 21:45:14 UTC
Comment 1 Li-Wen Hsu 2018-12-09 00:29:53 UTC
Comment 2 Enji Cooper 2018-12-27 21:45:43 UTC
I could be wrong... but this seems racy: ``` 1 # $FreeBSD$ 2 3 : & 4 p1=$! 5 : & 6 p2=$! 7 wait $p2 8 kill %1 ``` @jilles: would it be better if sleep 5 was backgrounded instead?
Comment 3 Jilles Tjoelker 2018-12-27 22:45:18 UTC
In the below text, wait(2) means any wait system call; sh(1) uses wait3() which appears as wait4() in ktrace. The test case is meant to test that a terminated, wait(2)ed for but not wait(1)ed for job can be passed to kill(1) without error (the command will do nothing). The part with the second background job, p2 and wait is intended to wait for the first background job to terminate and be wait(2)ed for, without taking excessive time or wait(1)ing for it (which would make the %1 specification invalid). If the first background job is slow to terminate, the kill command will do something but this is harmless. If the first background job terminates but the kernel has not returned it yet via wait(2), the kill command will kill a zombie which per POSIX does nothing successfully. I noticed that the problem is quickly reproduced on head using a loop like while sh builtins/kill1.0; do :; done using head's sh as well as stable/11's sh, while it can run for quite a while on stable/11 using stable/11's sh as well as head's sh built against stable/11. Reproducing with ktrace -i seems hard, but reproducing with plain ktrace works. The below ktrace extract seems to indicate that the kernel is at fault, returning an [ESRCH] error for killing a zombie: 19837 sh CALL fork 19837 sh RET fork 19838/0x4d7e 19837 sh CALL wait4(0xffffffff,0x7fffffffe91c,0x1<WNOHANG>,0) 19837 sh RET wait4 0 19837 sh CALL fork 19837 sh RET fork 19839/0x4d7f 19837 sh CALL sigprocmask(SIG_BLOCK,0x7fffffffe820,0x7fffffffe810) 19837 sh RET sigprocmask 0 19837 sh CALL sigaction(SIGCHLD,0x7fffffffe850,0x7fffffffe830) 19837 sh RET sigaction 0 19837 sh CALL wait4(0xffffffff,0x7fffffffe80c,0x1<WNOHANG>,0) 19837 sh RET wait4 19839/0x4d7f 19837 sh CALL sigaction(SIGCHLD,0x7fffffffe830,0) 19837 sh RET sigaction 0 19837 sh CALL sigprocmask(SIG_SETMASK,0x7fffffffe810,0) 19837 sh RET sigprocmask 0 19837 sh CALL kill(0x4d7e,SIGTERM) 19837 sh RET kill -1 errno 3 No such process Process ID 18007 has not been returned by a wait4() call, so it must either be still running or a zombie. In either case, a kill() on it must succeed. It appears that there is no test that specifically verifies that killing a zombie process succeeds.
Comment 4 Jilles Tjoelker 2018-12-27 22:49:19 UTC
Process ID 18007 in the above comment should be process ID 19838. And @enji, if tests are unclear, the commit message of the commit adding them may clarify things. I had to look up this one that way ;-)
Comment 5 Enji Cooper 2018-12-27 22:52:21 UTC
> And @enji, if tests are unclear, the commit message of the commit adding them may clarify things. I had to look up this one that way ;-) Good call. I took a very brief glance at the commit message and missed the nuance that you brought up. I will look at writing a test for the zombie process issue, but its commit will need to be delayed since my mentors are both on vacation until 01/02/2018, and I'll be on vacation after that until 01/07/2018.
Comment 6 Jilles Tjoelker 2018-12-28 00:40:39 UTC
I created a review for fixing the kernel bug: https://reviews.freebsd.org/D18665 The bug is only present in head and not in any stable branches. The review also includes a test program for killing a zombie in the test plan, but without ATF/Kyua infrastructure.
Comment 7 commit-hook 2018-12-28 13:32:46 UTC
A commit references this bug: Author: jilles Date: Fri Dec 28 13:32:15 UTC 2018 New revision: 342572 URL: https://svnweb.freebsd.org/changeset/base/342572 Log: pfind, pfind_any: Correct zombie logic SVN r340744 erroneously changed pfind() to return any process including zombies and pfind_any() to return only non-zombie processes. In particular, this caused kill() on a zombie process to fail with [ESRCH]. There is no direct test case for this but /usr/tests/bin/sh/builtins/kill1.0 occasionally triggers it (as reported by lwhsu). Conversely, returning zombies from pfind() seems likely to violate invariants and cause panics, but I have not looked at this. PR: 233646 Reviewed by: mjg, kib, ngie Differential Revision: https://reviews.freebsd.org/D18665 Changes: head/sys/kern/kern_proc.c
Comment 8 Li-Wen Hsu 2019-01-21 21:40:05 UTC
The test case doesn't fail for a while after the fix was committed. This ticket could be closed after adding the test case. (I'll work on it later.)