Bug 214643

Summary: net-mgmt/telegraf: service (re)start hangs when scripted: missing option -f with daemon(8)
Product: Ports & Packages Reporter: Mark.Martinec
Component: Individual Port(s)Assignee: Palle Girgensohn <girgen>
Status: Closed DUPLICATE    
Severity: Affects Some People CC: cheffo
Priority: --- Flags: bugzilla: maintainer-feedback? (girgen)
Version: Latest   
Hardware: Any   
OS: Any   
Attachments:
Description Flags
A patch to add the -f daemon(8) flag to net-mgmt/telegraf/files/telegraf.in none

Description Mark.Martinec 2016-11-18 23:27:13 UTC
Created attachment 177161 [details]
A patch to add the -f daemon(8) flag to net-mgmt/telegraf/files/telegraf.in

I came across this issue when trying to manage a service 'telegraf'
with SaltStack (i.e. sysutils/py-salt). I would expect the same
problem can occur under other management tools like Puppet/Chef/Ansible.

What happens is that when a salt-minion tries to start or restart
a telegraf process by spawning a command: 'service telegraf restart',
the process just hangs, waiting for a restart to happen, even though
the telegraf process does restart successfully and is again running
underneath.

Investigating the issue boils down to this simple and repeatable
test case, which mimics the environment under salt-minion (a Python
process), feeding a stdout to a pipe:

  # service telegraf restart | cat

  Stopping telegraf.
  Waiting for PIDS: 60067.
  Starting telegraf.
    (and hangs, even though the telegraf daemon has indeed
     successfully restarted meanwhile)

With rc_debug="YES" the same thing looks like:  (wrapped for clarity)

  # service telegraf restart | cat

  /usr/local/etc/rc.d/telegraf: DEBUG: checkyesno: telegraf_enable is
    set to YES.
  Stopping telegraf.
  /usr/local/etc/rc.d/telegraf: DEBUG: run_rc_command: doit:
    kill -TERM 919 Waiting for PIDS: 919.
  /usr/local/etc/rc.d/telegraf: DEBUG: pid file (/var/run/telegraf.pid):
    not readable.
  /usr/local/etc/rc.d/telegraf: DEBUG: checkyesno: telegraf_enable is
    set to YES.
  /usr/local/etc/rc.d/telegraf: DEBUG: run_rc_command: start_precmd:
   telegraf_prestart
  Starting telegraf.
  /usr/local/etc/rc.d/telegraf: DEBUG: run_rc_command: doit:
    /usr/sbin/daemon  -crP /var/run/telegraf.pid /usr/local/bin/telegraf  -
      config=/usr/local/etc/telegraf.conf 2>> /var/log/telegraf.log
(and hangs)

The problem is that the rc.d script for a telegraf package uses
the daemon(8) utility to run the telegraf process, but forgets
to specify the '-f' option to /usr/sbin/daemon.

$ man daemon
   daemon — run detached from the controlling terminal
   -f      Redirect standard input, standard output and
           standard error to /dev/null.
   [...] If the -p, -P or -r option is specified the program
   is executed in a spawned child process.

So what happens is:
- The process executing the daemon(8) program (as started by
  the 'service' command) has its stdout directed to a pipe;
- The daemon(8) program disassociates from a controlling terminal
  but DOES NOT close its stdout, and then forks;
- The forked daemon(8) subprocess inherits the open fd to a pipe;
- The parent daemon(8) then exits, but the child daemon(8) still
  has a pipe open, connected to cat(1) (or to salt-minion);
- the child daemon(8) process then exec's the telegraf program,
  which again inherits the pipe as its stdout.

So even though the parent daemon(8) process no longer exists and
the 'service' command could now be expected to finish, the sending
side of the pipe is still connected to a (re)started telegraf process,
and the 'cat' keeps hanging indefinitely, waiting for the pipe on
its stdin to close, which never happens.

The fix is trivial, just add the option -f to the daemon(8)
command on the rc.d script, forcing it to close stdout and stderr
before forking:

 --- /usr/local/etc/rc.d/telegraf~       2016-11-18 23:53:57.046298000 +0100
+++ /usr/local/etc/rc.d/telegraf        2016-11-18 23:54:17.675132000 +0100
@@ -31,3 +31,3 @@
 command=/usr/sbin/daemon
-command_args="-crP ${pidfile} /usr/local/bin/${name} ${telegraf_flags} -config=${telegraf_conf} 2>> /var/log/telegraf.log"
+command_args="-f -crP ${pidfile} /usr/local/bin/${name} ${telegraf_flags} -config=${telegraf_conf} 2>> /var/log/telegraf.log"

(the patch is attached)


Considering that there are at least a dozen of ports using a daemon(8)
utility but forgetting to specify the -f option, I wonder if it would
not be wiser to change the daemon(8) utility to close by default the
stdin/stdout/stderr after disassociating from a controlling terminal
but before forking, as this is something that one expects as an
essential step in daemonizing a process.  IMO this may be preferred
to filing a dozen of bug reports similar to this one for various other
ports, even at the expense of incompatibility.
Comment 1 cheffo 2016-11-22 21:40:41 UTC
Please see bug 214738 - it contains fix and also update to latest version of telegraf. 
Thanks!
Comment 2 Palle Girgensohn freebsd_committer freebsd_triage 2016-11-23 00:17:07 UTC
Hi Stefan,

Great, well spotted. 

Adding the -f option does render the redirects meaningless though, just redirecting /dev/null >> /var/log/telegraf

Now, one could argue that given the stupid log level of telegraf, this is a good thing... :)  But then why not just not redirect?

Isn't the fix suggested in 213412 the correct one, though? Redirecting both stdin and stderr closes the file descriptors before forking. Then we will not miss the logs from telegraf. Adding -quiet to telegraf_flags seems reasonable as well to shut up the endless tirades of meaingless logs.
Comment 3 Palle Girgensohn freebsd_committer freebsd_triage 2016-11-23 00:17:52 UTC

*** This bug has been marked as a duplicate of bug 213412 ***
Comment 4 commit-hook freebsd_committer freebsd_triage 2016-11-23 00:24:34 UTC
A commit references this bug:

Author: girgen
Date: Wed Nov 23 00:23:49 UTC 2016
New revision: 426880
URL: https://svnweb.freebsd.org/changeset/ports/426880

Log:
  Update telegraf to 1.1.1

  Make sure both stdin and stderr are redirected since daemon(8) forks and this
  would otherwise leave scripted restarts haning. See #213412 and #214643 for a
  more complete discussion.

  PR:		214738, 213412, 214643
  Submitted by:	Stefan Lambrev

Changes:
  head/net-mgmt/telegraf/Makefile
  head/net-mgmt/telegraf/distinfo
  head/net-mgmt/telegraf/files/telegraf.in