Bug 277959 - Refactor of usr.sbin/daemon caused regression in restart parameter
Summary: Refactor of usr.sbin/daemon caused regression in restart parameter
Status: In Progress
Alias: None
Product: Base System
Classification: Unclassified
Component: bin (show other bugs)
Version: 13.3-STABLE
Hardware: Any Any
: --- Affects Many People
Assignee: Kyle Evans
URL: https://reviews.freebsd.org/D47003
Keywords: regression, vendor
: 282694 (view as bug list)
Depends on:
Blocks:
 
Reported: 2024-03-25 20:12 UTC by Andrew Walker
Modified: 2024-12-18 10:43 UTC (History)
8 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Andrew Walker 2024-03-25 20:12:39 UTC
/usr/sbin/daemon was refactored to use kqueue in b64f569e5c051ff1acbfd7fefb3d32005957fc80. 

As a result of this if daemon is being used to supervise process with -r -R, then if supervised child process dies and SIGTERM sent to daemon, it will hang indefinitely (observed 24 hour hang).

For example if daemon is used to monitor collectd as follows:

/usr/sbin/daemon -f -p /var/run/collectd.pid -P/var/run/collectd-daemon.pid -r -R 60 /usr/local/sbin/collectd -f

and then

kill -9 <collectd pid>

and then
kill -15 <collectd-daemon.pid>

last kill reproduces issue.

Reverting to daemon from before above-referenced commit fixes issue.
Comment 1 Ray Bellis 2024-05-15 10:18:47 UTC
Possibly related, but since upgrading to 13.3 we've seen repeated instances of `daemon` exiting but leaving the child process (a Perl script) running.

It's feasible that this is associated with a perl libwww POST that is timing out, perhaps raising a signal that is propagating to `daemon`?

I'm running truss on some of my systems to try to determine root cause, and will raise a separate ticket if it turns out to be unrelated.
Comment 2 Rudolph 2024-06-18 10:38:49 UTC
This seems to happen in FreeBSD 14.0 too. I didn't yet try this in FreeBSD 14.1.
Comment 3 Mark Linimon freebsd_committer freebsd_triage 2024-06-18 11:09:16 UTC
^Triage: adds Keywords; also Cc: committer of
https://cgit.freebsd.org/src/commit/usr.sbin/daemon/daemon.c?id=b64f569e5c051ff1acbfd7fefb3d32005957fc80 20231213 .
Comment 4 Dag-Erling Smørgrav freebsd_committer freebsd_triage 2024-06-18 14:22:04 UTC
I didn't commit this, Kyle Evans did:

https://cgit.freebsd.org/src/commit/usr.sbin/daemon/daemon.c?id=8935a3993219be76c7ea03e9ad4509657d08af6c
Comment 5 Kyle Evans freebsd_committer freebsd_triage 2024-06-18 18:37:34 UTC
In the new model we're effectively dropping SIGTERM on the floor because it's all handled by the event loop, but killing the supervised process pushes us out of the event loop until we restart it again and the kqueue we use isn't persistent across restarts.  SIGTERM and SIGHUP should likely move back into the normal signal handling facilities if we're going to get this right.
Comment 6 Kyle Evans freebsd_committer freebsd_triage 2024-07-02 07:05:07 UTC
I have a tentative patch for this, but I'm still thinking about it a little bit.  My current approach is to pull the kqueue out of the eventloop and stop recreating it every time; the pipe read event will drop out when we close the read-side and leave just the signals.  daemon_sleep() then switches to using a one-shot EVFILT_TIMER instead of nanosleep(2) so that it can still process signals received in the interim following a previously processed SIGCHLD.
Comment 7 Rudolph 2024-08-26 13:13:26 UTC
Any news on this issue?
Comment 8 Rudolph 2024-10-07 14:16:44 UTC
Is there anything I can do to help fix this issue?
Comment 9 Kyle Evans freebsd_committer freebsd_triage 2024-10-07 14:19:39 UTC
(In reply to Rudolph from comment #8)

Sorry, I'll throw it on my list to revisit this shortly.
Comment 10 Kyle Evans freebsd_committer freebsd_triage 2024-10-08 05:32:28 UTC
The stack starting at https://reviews.freebsd.org/D47003 should address and test this.
Comment 11 Rudolph 2024-10-08 05:54:27 UTC
Great! Thanks so much. I'll go ahead and check the patches on my machine, if I can figure out how to compile it.
Comment 12 Rudolph 2024-10-08 06:12:48 UTC
I just successfully compiled and verified that these patches indeed fix the issue.

It would be really great if this fix would land in all currently supported releases, so people can keep reliable use daemon to supervise and restart processes. What's the process for this?
Comment 13 freebsd 2024-11-12 15:32:50 UTC
*** Bug 282694 has been marked as a duplicate of this bug. ***
Comment 14 Rudolph 2024-11-18 13:37:41 UTC
The proposed patches seem to resolve the issue, making `daemon` with restart options functional again.

Is there anything I can do to facilitate merging these patches into the main branch? It would also be great if these fixes could be backported to existing releases.
Comment 15 commit-hook freebsd_committer freebsd_triage 2024-11-19 19:51:52 UTC
A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=bc1dfc316a2bba97773a14b96f5e976a52524be4

commit bc1dfc316a2bba97773a14b96f5e976a52524be4
Author:     Kyle Evans <kevans@FreeBSD.org>
AuthorDate: 2024-11-19 19:51:27 +0000
Commit:     Kyle Evans <kevans@FreeBSD.org>
CommitDate: 2024-11-19 19:51:27 +0000

    daemon: stop rebuilding the kqueue every restart of the child

    We populate the kqueue with all of four kevents: three signal handlers and
    one for read of the child pipe.  Every time we start the child, we rebuild
    this kqueue from scratch for the child and tear it down before we exit and
    check if we need to restart the child.  As a consequence, we effectively
    drop any of the signals we're interested in between restarts.

    Push the kqueue out into the daemon state to avoid losing any signal events
    in the process, and reimplement the restart timer in terms of kqueue timers.
    The pipe read event will be automatically deleted upon last close, which
    leaves us with only the signal events that really get retained between
    restarts of the child.

    PR:             277959
    Reviewed by:    des, markj
    Differential Revision:  https://reviews.freebsd.org/D47004

 usr.sbin/daemon/daemon.c | 121 +++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 101 insertions(+), 20 deletions(-)
Comment 16 commit-hook freebsd_committer freebsd_triage 2024-11-19 19:51:56 UTC
A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=9ab59e900c1dd693b4d14285389a035e81341789

commit 9ab59e900c1dd693b4d14285389a035e81341789
Author:     Kyle Evans <kevans@FreeBSD.org>
AuthorDate: 2024-11-19 19:51:27 +0000
Commit:     Kyle Evans <kevans@FreeBSD.org>
CommitDate: 2024-11-19 19:51:27 +0000

    daemon: tests: add a test for missed SIGTERM

    This is somewhaht hard to test reliably, but we'll give it a shot.  Startup
    a sleep(1) daemon with a hefty restart delay.  In refactoring of daemon(8),
    we inadvertently started dropping SIGTERMs that came in while we were
    waiting to restart the child, so we employ the strategy:

     - Pop the child sleep(1) first
     - Wait for sleep(1) to exit (pid file truncated)
     - Pop the daemon(8) with a SIGTERM
     - Wait for daemon(8) to exit

    The pidfile is specifically truncated outside of the event loop so that we
    don't have a kqueue to catch it in the current model.

    PR:             277959
    Reviewed by:    des, markj
    Differential Revision:  https://reviews.freebsd.org/D47005

 usr.sbin/daemon/tests/daemon_test.sh | 38 ++++++++++++++++++++++++++++++++++++
 1 file changed, 38 insertions(+)
Comment 17 Kyle Evans freebsd_committer freebsd_triage 2024-11-19 20:08:38 UTC
(In reply to Rudolph from comment #14)

Sorry, that's entirely on me.  The fix will be in this week's main snapshot, I'll give it a couple of days there before MFC'ing it by next week's snapshots- if things go well in stable, I'll see about an EN request after some soak time there.
Comment 18 Rudolph 2024-11-20 07:48:48 UTC
Thanks for all the work and dedication!
Comment 19 commit-hook freebsd_committer freebsd_triage 2024-12-10 23:38:47 UTC
A commit in branch stable/14 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=e310a825c933182a5b29261b741467f649951ff1

commit e310a825c933182a5b29261b741467f649951ff1
Author:     Kyle Evans <kevans@FreeBSD.org>
AuthorDate: 2024-11-19 19:51:27 +0000
Commit:     Kyle Evans <kevans@FreeBSD.org>
CommitDate: 2024-12-10 23:00:03 +0000

    daemon: tests: add a test for missed SIGTERM

    This is somewhaht hard to test reliably, but we'll give it a shot.  Startup
    a sleep(1) daemon with a hefty restart delay.  In refactoring of daemon(8),
    we inadvertently started dropping SIGTERMs that came in while we were
    waiting to restart the child, so we employ the strategy:

     - Pop the child sleep(1) first
     - Wait for sleep(1) to exit (pid file truncated)
     - Pop the daemon(8) with a SIGTERM
     - Wait for daemon(8) to exit

    The pidfile is specifically truncated outside of the event loop so that we
    don't have a kqueue to catch it in the current model.

    PR:             277959
    Reviewed by:    des, markj

    (cherry picked from commit 9ab59e900c1dd693b4d14285389a035e81341789)

 usr.sbin/daemon/tests/daemon_test.sh | 38 ++++++++++++++++++++++++++++++++++++
 1 file changed, 38 insertions(+)
Comment 20 commit-hook freebsd_committer freebsd_triage 2024-12-10 23:38:54 UTC
A commit in branch stable/14 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=7ea2874eadf901b1187772670169b6fc3a44d917

commit 7ea2874eadf901b1187772670169b6fc3a44d917
Author:     Kyle Evans <kevans@FreeBSD.org>
AuthorDate: 2024-11-19 19:51:27 +0000
Commit:     Kyle Evans <kevans@FreeBSD.org>
CommitDate: 2024-12-10 23:05:46 +0000

    daemon: stop rebuilding the kqueue every restart of the child

    We populate the kqueue with all of four kevents: three signal handlers and
    one for read of the child pipe.  Every time we start the child, we rebuild
    this kqueue from scratch for the child and tear it down before we exit and
    check if we need to restart the child.  As a consequence, we effectively
    drop any of the signals we're interested in between restarts.

    Push the kqueue out into the daemon state to avoid losing any signal events
    in the process, and reimplement the restart timer in terms of kqueue timers.
    The pipe read event will be automatically deleted upon last close, which
    leaves us with only the signal events that really get retained between
    restarts of the child.

    PR:             277959
    Reviewed by:    des, markj

    (cherry picked from commit bc1dfc316a2bba97773a14b96f5e976a52524be4)

 usr.sbin/daemon/daemon.c | 121 +++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 101 insertions(+), 20 deletions(-)
Comment 21 commit-hook freebsd_committer freebsd_triage 2024-12-10 23:38:57 UTC
A commit in branch stable/13 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=57c60ec86246804a3df5c50ecfd921b39febcfd2

commit 57c60ec86246804a3df5c50ecfd921b39febcfd2
Author:     Kyle Evans <kevans@FreeBSD.org>
AuthorDate: 2024-11-19 19:51:27 +0000
Commit:     Kyle Evans <kevans@FreeBSD.org>
CommitDate: 2024-12-10 23:02:40 +0000

    daemon: tests: add a test for missed SIGTERM

    This is somewhaht hard to test reliably, but we'll give it a shot.  Startup
    a sleep(1) daemon with a hefty restart delay.  In refactoring of daemon(8),
    we inadvertently started dropping SIGTERMs that came in while we were
    waiting to restart the child, so we employ the strategy:

     - Pop the child sleep(1) first
     - Wait for sleep(1) to exit (pid file truncated)
     - Pop the daemon(8) with a SIGTERM
     - Wait for daemon(8) to exit

    The pidfile is specifically truncated outside of the event loop so that we
    don't have a kqueue to catch it in the current model.

    PR:             277959
    Reviewed by:    des, markj

    (cherry picked from commit 9ab59e900c1dd693b4d14285389a035e81341789)

 usr.sbin/daemon/tests/daemon_test.sh | 38 ++++++++++++++++++++++++++++++++++++
 1 file changed, 38 insertions(+)
Comment 22 commit-hook freebsd_committer freebsd_triage 2024-12-10 23:39:00 UTC
A commit in branch stable/13 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=4bb1a558a2811a27b1211579cb257a11df49c0e1

commit 4bb1a558a2811a27b1211579cb257a11df49c0e1
Author:     Kyle Evans <kevans@FreeBSD.org>
AuthorDate: 2024-11-19 19:51:27 +0000
Commit:     Kyle Evans <kevans@FreeBSD.org>
CommitDate: 2024-12-10 23:06:11 +0000

    daemon: stop rebuilding the kqueue every restart of the child

    We populate the kqueue with all of four kevents: three signal handlers and
    one for read of the child pipe.  Every time we start the child, we rebuild
    this kqueue from scratch for the child and tear it down before we exit and
    check if we need to restart the child.  As a consequence, we effectively
    drop any of the signals we're interested in between restarts.

    Push the kqueue out into the daemon state to avoid losing any signal events
    in the process, and reimplement the restart timer in terms of kqueue timers.
    The pipe read event will be automatically deleted upon last close, which
    leaves us with only the signal events that really get retained between
    restarts of the child.

    PR:             277959
    Reviewed by:    des, markj

    (cherry picked from commit bc1dfc316a2bba97773a14b96f5e976a52524be4)

 usr.sbin/daemon/daemon.c | 121 +++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 101 insertions(+), 20 deletions(-)