Bug 233646 - Flakey test case: bin.sh.builtins.functional_test.kill1
Summary: Flakey test case: bin.sh.builtins.functional_test.kill1
Status: In Progress
Alias: None
Product: Base System
Classification: Unclassified
Component: tests (show other bugs)
Version: CURRENT
Hardware: Any Any
: --- Affects Only Me
Assignee: freebsd-testing (Nobody)
Depends on:
Reported: 2018-11-29 21:45 UTC by Li-Wen Hsu
Modified: 2019-01-21 21:40 UTC (History)
3 users (show)

See Also:


Note You need to log in before you can comment on or make changes to this bug.
Comment 2 Enji Cooper freebsd_committer 2018-12-27 21:45:43 UTC
I could be wrong... but this seems racy:

 1 # $FreeBSD$
 3 : &
 4 p1=$!
 5 : &
 6 p2=$!
 7 wait $p2
 8 kill %1

@jilles: would it be better if sleep 5 was backgrounded instead?
Comment 3 Jilles Tjoelker freebsd_committer 2018-12-27 22:45:18 UTC
In the below text, wait(2) means any wait system call; sh(1) uses wait3() which appears as wait4() in ktrace.

The test case is meant to test that a terminated, wait(2)ed for but not wait(1)ed for job can be passed to kill(1) without error (the command will do nothing). The part with the second background job, p2 and wait is intended to wait for the first background job to terminate and be wait(2)ed for, without taking excessive time or wait(1)ing for it (which would make the %1 specification invalid). If the first background job is slow to terminate, the kill command will do something but this is harmless. If the first background job terminates but the kernel has not returned it yet via wait(2), the kill command will kill a zombie which per POSIX does nothing successfully.

I noticed that the problem is quickly reproduced on head using a loop like
  while sh builtins/kill1.0; do :; done
using head's sh as well as stable/11's sh, while it can run for quite a while on stable/11 using stable/11's sh as well as head's sh built against stable/11.

Reproducing with ktrace -i seems hard, but reproducing with plain ktrace works. The below ktrace extract seems to indicate that the kernel is at fault, returning an [ESRCH] error for killing a zombie:

 19837 sh       CALL  fork
 19837 sh       RET   fork 19838/0x4d7e
 19837 sh       CALL  wait4(0xffffffff,0x7fffffffe91c,0x1<WNOHANG>,0)
 19837 sh       RET   wait4 0
 19837 sh       CALL  fork
 19837 sh       RET   fork 19839/0x4d7f
 19837 sh       CALL  sigprocmask(SIG_BLOCK,0x7fffffffe820,0x7fffffffe810)
 19837 sh       RET   sigprocmask 0
 19837 sh       CALL  sigaction(SIGCHLD,0x7fffffffe850,0x7fffffffe830)
 19837 sh       RET   sigaction 0
 19837 sh       CALL  wait4(0xffffffff,0x7fffffffe80c,0x1<WNOHANG>,0)
 19837 sh       RET   wait4 19839/0x4d7f
 19837 sh       CALL  sigaction(SIGCHLD,0x7fffffffe830,0)
 19837 sh       RET   sigaction 0
 19837 sh       CALL  sigprocmask(SIG_SETMASK,0x7fffffffe810,0)
 19837 sh       RET   sigprocmask 0
 19837 sh       CALL  kill(0x4d7e,SIGTERM)
 19837 sh       RET   kill -1 errno 3 No such process

Process ID 18007 has not been returned by a wait4() call, so it must either be still running or a zombie. In either case, a kill() on it must succeed.

It appears that there is no test that specifically verifies that killing a zombie process succeeds.
Comment 4 Jilles Tjoelker freebsd_committer 2018-12-27 22:49:19 UTC
Process ID 18007 in the above comment should be process ID 19838.

And @enji, if tests are unclear, the commit message of the commit adding them may clarify things. I had to look up this one that way ;-)
Comment 5 Enji Cooper freebsd_committer 2018-12-27 22:52:21 UTC
> And @enji, if tests are unclear, the commit message of the commit adding them may clarify things. I had to look up this one that way ;-)

Good call. I took a very brief glance at the commit message and missed the nuance that you brought up.

I will look at writing a test for the zombie process issue, but its commit will need to be delayed since my mentors are both on vacation until 01/02/2018, and I'll be on vacation after that until 01/07/2018.
Comment 6 Jilles Tjoelker freebsd_committer 2018-12-28 00:40:39 UTC
I created a review for fixing the kernel bug: https://reviews.freebsd.org/D18665

The bug is only present in head and not in any stable branches.

The review also includes a test program for killing a zombie in the test plan, but without ATF/Kyua infrastructure.
Comment 7 commit-hook freebsd_committer 2018-12-28 13:32:46 UTC
A commit references this bug:

Author: jilles
Date: Fri Dec 28 13:32:15 UTC 2018
New revision: 342572
URL: https://svnweb.freebsd.org/changeset/base/342572

  pfind, pfind_any: Correct zombie logic

  SVN r340744 erroneously changed pfind() to return any process including
  zombies and pfind_any() to return only non-zombie processes.

  In particular, this caused kill() on a zombie process to fail with [ESRCH].
  There is no direct test case for this but /usr/tests/bin/sh/builtins/kill1.0
  occasionally triggers it (as reported by lwhsu).

  Conversely, returning zombies from pfind() seems likely to violate
  invariants and cause panics, but I have not looked at this.

  PR:		233646
  Reviewed by:	mjg, kib, ngie
  Differential Revision:	https://reviews.freebsd.org/D18665

Comment 8 Li-Wen Hsu freebsd_committer 2019-01-21 21:40:05 UTC
The test case doesn't fail for a while after the fix was committed.

This ticket could be closed after adding the test case. (I'll work on it later.)