/usr/sbin/daemon was refactored to use kqueue in b64f569e5c051ff1acbfd7fefb3d32005957fc80. As a result of this if daemon is being used to supervise process with -r -R, then if supervised child process dies and SIGTERM sent to daemon, it will hang indefinitely (observed 24 hour hang). For example if daemon is used to monitor collectd as follows: /usr/sbin/daemon -f -p /var/run/collectd.pid -P/var/run/collectd-daemon.pid -r -R 60 /usr/local/sbin/collectd -f and then kill -9 <collectd pid> and then kill -15 <collectd-daemon.pid> last kill reproduces issue. Reverting to daemon from before above-referenced commit fixes issue.
Possibly related, but since upgrading to 13.3 we've seen repeated instances of `daemon` exiting but leaving the child process (a Perl script) running. It's feasible that this is associated with a perl libwww POST that is timing out, perhaps raising a signal that is propagating to `daemon`? I'm running truss on some of my systems to try to determine root cause, and will raise a separate ticket if it turns out to be unrelated.
This seems to happen in FreeBSD 14.0 too. I didn't yet try this in FreeBSD 14.1.
^Triage: adds Keywords; also Cc: committer of https://cgit.freebsd.org/src/commit/usr.sbin/daemon/daemon.c?id=b64f569e5c051ff1acbfd7fefb3d32005957fc80 20231213 .
I didn't commit this, Kyle Evans did: https://cgit.freebsd.org/src/commit/usr.sbin/daemon/daemon.c?id=8935a3993219be76c7ea03e9ad4509657d08af6c
In the new model we're effectively dropping SIGTERM on the floor because it's all handled by the event loop, but killing the supervised process pushes us out of the event loop until we restart it again and the kqueue we use isn't persistent across restarts. SIGTERM and SIGHUP should likely move back into the normal signal handling facilities if we're going to get this right.
I have a tentative patch for this, but I'm still thinking about it a little bit. My current approach is to pull the kqueue out of the eventloop and stop recreating it every time; the pipe read event will drop out when we close the read-side and leave just the signals. daemon_sleep() then switches to using a one-shot EVFILT_TIMER instead of nanosleep(2) so that it can still process signals received in the interim following a previously processed SIGCHLD.
Any news on this issue?
Is there anything I can do to help fix this issue?
(In reply to Rudolph from comment #8) Sorry, I'll throw it on my list to revisit this shortly.
The stack starting at https://reviews.freebsd.org/D47003 should address and test this.
Great! Thanks so much. I'll go ahead and check the patches on my machine, if I can figure out how to compile it.
I just successfully compiled and verified that these patches indeed fix the issue. It would be really great if this fix would land in all currently supported releases, so people can keep reliable use daemon to supervise and restart processes. What's the process for this?
*** Bug 282694 has been marked as a duplicate of this bug. ***
The proposed patches seem to resolve the issue, making `daemon` with restart options functional again. Is there anything I can do to facilitate merging these patches into the main branch? It would also be great if these fixes could be backported to existing releases.
A commit in branch main references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=bc1dfc316a2bba97773a14b96f5e976a52524be4 commit bc1dfc316a2bba97773a14b96f5e976a52524be4 Author: Kyle Evans <kevans@FreeBSD.org> AuthorDate: 2024-11-19 19:51:27 +0000 Commit: Kyle Evans <kevans@FreeBSD.org> CommitDate: 2024-11-19 19:51:27 +0000 daemon: stop rebuilding the kqueue every restart of the child We populate the kqueue with all of four kevents: three signal handlers and one for read of the child pipe. Every time we start the child, we rebuild this kqueue from scratch for the child and tear it down before we exit and check if we need to restart the child. As a consequence, we effectively drop any of the signals we're interested in between restarts. Push the kqueue out into the daemon state to avoid losing any signal events in the process, and reimplement the restart timer in terms of kqueue timers. The pipe read event will be automatically deleted upon last close, which leaves us with only the signal events that really get retained between restarts of the child. PR: 277959 Reviewed by: des, markj Differential Revision: https://reviews.freebsd.org/D47004 usr.sbin/daemon/daemon.c | 121 +++++++++++++++++++++++++++++++++++++++-------- 1 file changed, 101 insertions(+), 20 deletions(-)
A commit in branch main references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=9ab59e900c1dd693b4d14285389a035e81341789 commit 9ab59e900c1dd693b4d14285389a035e81341789 Author: Kyle Evans <kevans@FreeBSD.org> AuthorDate: 2024-11-19 19:51:27 +0000 Commit: Kyle Evans <kevans@FreeBSD.org> CommitDate: 2024-11-19 19:51:27 +0000 daemon: tests: add a test for missed SIGTERM This is somewhaht hard to test reliably, but we'll give it a shot. Startup a sleep(1) daemon with a hefty restart delay. In refactoring of daemon(8), we inadvertently started dropping SIGTERMs that came in while we were waiting to restart the child, so we employ the strategy: - Pop the child sleep(1) first - Wait for sleep(1) to exit (pid file truncated) - Pop the daemon(8) with a SIGTERM - Wait for daemon(8) to exit The pidfile is specifically truncated outside of the event loop so that we don't have a kqueue to catch it in the current model. PR: 277959 Reviewed by: des, markj Differential Revision: https://reviews.freebsd.org/D47005 usr.sbin/daemon/tests/daemon_test.sh | 38 ++++++++++++++++++++++++++++++++++++ 1 file changed, 38 insertions(+)
(In reply to Rudolph from comment #14) Sorry, that's entirely on me. The fix will be in this week's main snapshot, I'll give it a couple of days there before MFC'ing it by next week's snapshots- if things go well in stable, I'll see about an EN request after some soak time there.
Thanks for all the work and dedication!
A commit in branch stable/14 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=e310a825c933182a5b29261b741467f649951ff1 commit e310a825c933182a5b29261b741467f649951ff1 Author: Kyle Evans <kevans@FreeBSD.org> AuthorDate: 2024-11-19 19:51:27 +0000 Commit: Kyle Evans <kevans@FreeBSD.org> CommitDate: 2024-12-10 23:00:03 +0000 daemon: tests: add a test for missed SIGTERM This is somewhaht hard to test reliably, but we'll give it a shot. Startup a sleep(1) daemon with a hefty restart delay. In refactoring of daemon(8), we inadvertently started dropping SIGTERMs that came in while we were waiting to restart the child, so we employ the strategy: - Pop the child sleep(1) first - Wait for sleep(1) to exit (pid file truncated) - Pop the daemon(8) with a SIGTERM - Wait for daemon(8) to exit The pidfile is specifically truncated outside of the event loop so that we don't have a kqueue to catch it in the current model. PR: 277959 Reviewed by: des, markj (cherry picked from commit 9ab59e900c1dd693b4d14285389a035e81341789) usr.sbin/daemon/tests/daemon_test.sh | 38 ++++++++++++++++++++++++++++++++++++ 1 file changed, 38 insertions(+)
A commit in branch stable/14 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=7ea2874eadf901b1187772670169b6fc3a44d917 commit 7ea2874eadf901b1187772670169b6fc3a44d917 Author: Kyle Evans <kevans@FreeBSD.org> AuthorDate: 2024-11-19 19:51:27 +0000 Commit: Kyle Evans <kevans@FreeBSD.org> CommitDate: 2024-12-10 23:05:46 +0000 daemon: stop rebuilding the kqueue every restart of the child We populate the kqueue with all of four kevents: three signal handlers and one for read of the child pipe. Every time we start the child, we rebuild this kqueue from scratch for the child and tear it down before we exit and check if we need to restart the child. As a consequence, we effectively drop any of the signals we're interested in between restarts. Push the kqueue out into the daemon state to avoid losing any signal events in the process, and reimplement the restart timer in terms of kqueue timers. The pipe read event will be automatically deleted upon last close, which leaves us with only the signal events that really get retained between restarts of the child. PR: 277959 Reviewed by: des, markj (cherry picked from commit bc1dfc316a2bba97773a14b96f5e976a52524be4) usr.sbin/daemon/daemon.c | 121 +++++++++++++++++++++++++++++++++++++++-------- 1 file changed, 101 insertions(+), 20 deletions(-)
A commit in branch stable/13 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=57c60ec86246804a3df5c50ecfd921b39febcfd2 commit 57c60ec86246804a3df5c50ecfd921b39febcfd2 Author: Kyle Evans <kevans@FreeBSD.org> AuthorDate: 2024-11-19 19:51:27 +0000 Commit: Kyle Evans <kevans@FreeBSD.org> CommitDate: 2024-12-10 23:02:40 +0000 daemon: tests: add a test for missed SIGTERM This is somewhaht hard to test reliably, but we'll give it a shot. Startup a sleep(1) daemon with a hefty restart delay. In refactoring of daemon(8), we inadvertently started dropping SIGTERMs that came in while we were waiting to restart the child, so we employ the strategy: - Pop the child sleep(1) first - Wait for sleep(1) to exit (pid file truncated) - Pop the daemon(8) with a SIGTERM - Wait for daemon(8) to exit The pidfile is specifically truncated outside of the event loop so that we don't have a kqueue to catch it in the current model. PR: 277959 Reviewed by: des, markj (cherry picked from commit 9ab59e900c1dd693b4d14285389a035e81341789) usr.sbin/daemon/tests/daemon_test.sh | 38 ++++++++++++++++++++++++++++++++++++ 1 file changed, 38 insertions(+)
A commit in branch stable/13 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=4bb1a558a2811a27b1211579cb257a11df49c0e1 commit 4bb1a558a2811a27b1211579cb257a11df49c0e1 Author: Kyle Evans <kevans@FreeBSD.org> AuthorDate: 2024-11-19 19:51:27 +0000 Commit: Kyle Evans <kevans@FreeBSD.org> CommitDate: 2024-12-10 23:06:11 +0000 daemon: stop rebuilding the kqueue every restart of the child We populate the kqueue with all of four kevents: three signal handlers and one for read of the child pipe. Every time we start the child, we rebuild this kqueue from scratch for the child and tear it down before we exit and check if we need to restart the child. As a consequence, we effectively drop any of the signals we're interested in between restarts. Push the kqueue out into the daemon state to avoid losing any signal events in the process, and reimplement the restart timer in terms of kqueue timers. The pipe read event will be automatically deleted upon last close, which leaves us with only the signal events that really get retained between restarts of the child. PR: 277959 Reviewed by: des, markj (cherry picked from commit bc1dfc316a2bba97773a14b96f5e976a52524be4) usr.sbin/daemon/daemon.c | 121 +++++++++++++++++++++++++++++++++++++++-------- 1 file changed, 101 insertions(+), 20 deletions(-)