As smokeping runs as a deamon, at some point it looses its process name. This means that check_pidfile won't be able to find the PID of the main process (even the 2nd call in smokeping_check_pidfile, the one with $command_interpreter). So smokeping is considered dead by `service smokeping status` and my Puppet attempts to start a new one. The fix from bug 221009 is not enough in such cases. I found what happens by running `while [ true ]; do; date; ps up `pgrep -u smokeping`; sleep 1; done` for a few days. Below you can see the transition happening. The PID is the same, the process is not restarted. It just looses the command line. Fri Aug 4 12:44:45 UTC 2017 USER PID %CPU %MEM VSZ RSS TT STAT STARTED TIME COMMAND smokeping 17248 0.0 0.0 12364 1544 - S 12:44PM 0:00.00 /usr/bin/fping -C 9 -q -B1 -r1 -i10 -p6000 A.A.A.A B.B.B.B C.C.C.C smokeping 45265 0.0 0.2 210384 14128 - Is 2:20PM 0:00.01 /usr/local/bin/perl /usr/local/bin/smokeping --master-url=http://XXX:8080/ --cache-dir=/usr/local/var/smokeping/ --shared-secr smokeping 45266 0.0 0.3 210384 16620 - S 2:20PM 0:07.75 /usr/local/bin/smokeping [FPing] (perl) Fri Aug 4 12:44:46 UTC 2017 USER PID %CPU %MEM VSZ RSS TT STAT STARTED TIME COMMAND smokeping 17248 0.0 0.0 12364 1544 - S 12:44PM 0:00.00 /usr/bin/fping -C 9 -q -B1 -r1 -i10 -p6000 A.A.A.A B.B.B.B C.C.C.C smokeping 45265 0.0 0.2 210384 14128 - Is 2:20PM 0:00.01 /usr/local/bin/perl /usr/local/bin/smokeping --master-url=http://XXX:8080/ --cache-dir=/usr/local/var/smokeping/ --shared-secr smokeping 45266 0.0 0.3 210384 16620 - S 2:20PM 0:07.75 /usr/local/bin/smokeping [FPing] (perl) Fri Aug 4 12:44:47 UTC 2017 USER PID %CPU %MEM VSZ RSS TT STAT STARTED TIME COMMAND smokeping 17248 0.0 0.0 12364 1320 - S 12:44PM 0:00.00 /usr/bin/fping -C 9 -q -B1 -r1 -i10 -p6000 A.A.A.A B.B.B.B C.C.C.C smokeping 45265 0.0 0.0 210384 0 - IWs - 0:00.00 (perl) smokeping 45266 0.0 0.1 210384 9284 - S 2:20PM 0:07.75 /usr/local/bin/smokeping [FPing] (perl) Fri Aug 4 12:44:48 UTC 2017 USER PID %CPU %MEM VSZ RSS TT STAT STARTED TIME COMMAND smokeping 17248 0.0 0.0 12364 1320 - S 12:44PM 0:00.00 /usr/bin/fping -C 9 -q -B1 -r1 -i10 -p6000 A.A.A.A B.B.B.B C.C.C.C smokeping 45265 0.0 0.0 210384 0 - IWs - 0:00.00 (perl) smokeping 45266 0.0 0.1 210384 9284 - S 2:20PM 0:07.75 /usr/local/bin/smokeping [FPing] (perl) Is loosing command line arguments intrinsic property of processes swapped out? Being swapped out is the thing that changed for this process at some point. Also it seems that when I managed to unswap the process, it recovered its process name: [root@XXXX ~]% ps up 45265 USER PID %CPU %MEM VSZ RSS TT STAT STARTED TIME COMMAND smokeping 45265 0.0 0.0 210384 0 - IWs - 0:00.00 (perl) [root@XXXX ~]% kill -HUP 45265 [root@XXXX~]% ps up 45265 USER PID %CPU %MEM VSZ RSS TT STAT STARTED TIME COMMAND smokeping 45265 0.0 0.1 210384 8160 - DLs 3Aug17 0:00.02 /usr/local/bin/perl /usr/local/bin/smokeping --master-url=http://XXX:8080/ --cache-dir=/usr/local/var/smokeping/ --shared-secre That would mean that we must trust the PID in /usr/local/var/smokeping/pid and not perform any additional checks on commandline or interpreter. Other option would be to not use daemonizing within smokeping and use daemon(8) instead and use its pid for management.
Hi, Maybe I'm wrong but 45266 is the smokeping daemon, right ? And I don't see any changes over the time. If you still falling in this issue perform a ps command with the ppid, to see if we are not tracking a fork. Until that I have a running smokeping and I perform regular checks on ps and the pid. Cheers
I'm tracking the master process. Since it is not a bug in smokeping itself, I opened another ticket 222147 for rc.subr because this situation can happen to any daemon which gets swapped out. I don't understand what do you mean by "And I don't see any changes over the time.". Look at each ps output. Main process changes its name at one point.
Created attachment 188995 [details] rc script using daemon(8) as a wrapper
I'm attaching a modified rc script which instead of using native daemonization of Smokeping uses daemon(8) which itself runs with shorter command line and thus should not be prone to the issue described in PR #222147. The script is not well tested, only for slave mode and I know that the part for reloading does not really work. If you believe this is a way to go, I can further develop it.