FreeBSD 10.3-RC3 #0 r297005: Fri Mar 18 21:14:17 AEDT 2016 i386 FreeBSD 10.3-RC3 #0 r297005: Fri Mar 18 16:13:17 AEDT 2016 amd64 As reported to -stable@... I've just upgraded two local servers from 10.3-RC2 to 10.3-RC3. One is i386 (Pentium 4) with 1.5GB memory; the other is amd64 (Xeon E5 x 2) with 16GB. Something seems to be swallowing SIGTERM sometimes. I've never seen this before. rc scripts and service(8) hang when trying to stop some services. If I break out of the script or service command (via ctrl-C) and invoke kill(1), passing the relevant PID, the shell prompt returns immediately but the process doesn't terminate. Sending SIGKILL to the process via kill(1) does its job. I am seeing this with: squid (on both servers) ircd-ratbox (on both servers) git_daemon I am not seeing this (all seems normal) with: cupsd sendmail (on both servers) Looking back through the commits between -RC2 and -RC3, I've decided to try backing out r296976 to see if that may be related. Building now. Will take several hours.
Reversed r296976 and rebuilt (on i386 only). Problem disappeared. In the process of shutting down the i386 system prior to upgrade, I found that mrtg_daemon was another one that wouldn't shut down without kill -9. Sorry, but I'm not going to be able to be around to help with testing anything between now and Monday morning (UTC +11:00). I have left the amd64 system as-is (with problematic SIGTERM) if that can be of any use on Monday.
During a vfork in tcsh the child process changes the signals. Normally this is not an issue but user space signal implementations need special handling due to the child borrowing memory from the parent via the vfork. This was noticed when tcsh as used with libthr. The change to add SAVESIGVEC was supposed to resolve this issue by saving and restoring the signals. Obviously there is an edge case occurring where the signals are not being handled correctly and causing this behavior. I believe what I'm seeing with the squid example are the signals still blocked in the thread state (procstat -j). The other patch discussed (instead of SAVESIGVEC) to resolve the user space signal implementation was to use fork instead of vfork. There is a performance difference between fork and vfork which is why we preferred the SAVESIGVEC change. I just tested a patch to switch over to fork and it appeared to resolve the issue on my system. If you have time please give it a try on your system.
Created attachment 168393 [details] Patch to remove SAVESIGVEC and use fork (instead of vfork)
What are the exact steps to reproduce the SIGTERM problem? As both rc(8) scripts and service(8) use sh(1), it's not obvious how a change to csh(1)/tcsh(1) could affect the former or how to trigger that. Do the signal handling changes in tcsh 6.19.00 solve both problems, i. e. the original one r296416 tried to address as well as SIGTERM, either with SAVESIGVEC enabled/disabled but still using vfork(2)?
(In reply to Mark Peek from comment #2) Thank you Mark. Reverted bin/csh/config_p.h in my wc on the i386 box, applied your patch from comment #3 and rebuilding now. Will report results.
Marius, I have as yet not been able to find a smaller reproduction of this issue. The repro per John's comments is to install/enable squid and then the "service squid stop" will hang. I believe this is due to the thread TERM signal being blocked (viewed via procstat -j). I ran dtrace during the service startup and, as you commented, did not see csh being called directly. I also confirmed the same behavior from a direct login to a bash login shell. # dtrace -qn 'syscall::exec*:return { printf("%Y %s\n",walltimestamp,curpsinfo->pr_psargs); }' 2016 Mar 20 08:45:26 sudo service squid start 2016 Mar 20 08:45:27 /bin/sh /usr/sbin/service squid start 2016 Mar 20 08:45:27 /bin/ps -ww -p 71047 -o jid= 2016 Mar 20 08:45:27 /bin/sh /usr/sbin/service squid start 2016 Mar 20 08:45:27 /bin/sh /usr/sbin/service squid start 2016 Mar 20 08:45:27 /bin/sh /usr/sbin/service squid start 2016 Mar 20 08:45:27 env -i HOME=/ PATH=/sbin:/bin:/usr/sbin:/usr/bin /usr/local/etc/rc.d/squid start 2016 Mar 20 08:45:27 /bin/sh /usr/local/etc/rc.d/squid start 2016 Mar 20 08:45:27 /bin/ps -ww -p 71047 -o jid= 2016 Mar 20 08:45:27 /usr/bin/id -un 2016 Mar 20 08:45:27 /bin/ps -ww -o pid= -o jid= -o command= -p 71006 2016 Mar 20 08:45:27 limits -C daemon su -m squid -c sh -c "cd /var/squid && /usr/local/sbin/squid -f /usr/local/etc/squid/squid.conf " 2016 Mar 20 08:45:27 limits -C daemon su -m squid -c sh -c "cd /var/squid && /usr/local/sbin/squid -f /usr/local/etc/squid/squid.conf " 2016 Mar 20 08:45:27 limits -C daemon su -m squid -c sh -c "cd /var/squid && /usr/local/sbin/squid -f /usr/local/etc/squid/squid.conf " 2016 Mar 20 08:45:27 limits -C daemon su -m squid -c sh -c "cd /var/squid && /usr/local/sbin/squid -f /usr/local/etc/squid/squid.conf " 2016 Mar 20 08:45:27 su -m squid -c sh -c "cd /var/squid && /usr/local/sbin/squid -f /usr/local/etc/squid/squid.conf " 2016 Mar 20 08:45:27 _su -m -c sh -c "cd /var/squid && /usr/local/sbin/squid -f /usr/local/etc/squid/squid.conf " 2016 Mar 20 08:45:27 sh -c cd /var/squid && /usr/local/sbin/squid -f /usr/local/etc/squid/squid.conf 2016 Mar 20 08:45:27 /usr/local/sbin/squid -f /usr/local/etc/squid/squid.conf 2016 Mar 20 08:45:27 /usr/local/sbin/squid -f /usr/local/etc/squid/squid.conf 2016 Mar 20 08:45:27 (squid-1) -f /usr/local/etc/squid/squid.conf 2016 Mar 20 08:45:27 (logfile-daemon) /var/log/squid/access.log ------- The original problem r296416 (MFC r296976) was reported by kib: tom% LD_PRELOAD=libthr.so.3 csh ~ % ls CA bin kp public_html Mail git mm tmp RCS git-private postponed work % ^C tom% There, 'tom%' is the prompt of my zsh. Issue is reproducable without LD_PRELOAD by e.g. using of nss-ldap, which brings libthr.so into the address space as a dependency. Problem is that when vfork use is enabled, csh calls signal(3) in the child. Since libthr maintains the shadow list of signal handlers to avoid calling user-provided handlers in critical sections and for some other reasons, signal(3) overrides the handler both for child and parent. The issue is there probably from the times of 9.0, when davidxu introduced libthr hooking the handlers. ------- The suggested patch by kib was to use fork instead of vfork but the SAVESIGVEC+vfork change fixed the above issue from kib and was in the tcsh code base to handle userland signal handling.
(In reply to Marius Strobl from comment #4) The minimum steps are: Upgrade to 10.3-RC3 $ ksu - # shutdown now (and notice the rc scripts hang on various services) OR # service <foo> stop (where foo is one of the services I mentioned) OR # kill <pid of foo> (where foo is one of the services I mentioned) I imagine the problem is related to root using its default tcsh shell. I don't know the intricacies of this, I'm only a sysadmin. Examples of <foo> are: squid ircd-ratbox git_daemon mrtg_daemon I haven't noticed problems with any base system services. Most other ports are fine.
(In reply to John Marshall from comment #7) I just tried to stop and start each of the jails on the amd64 box in order to identify any more affected services. Here is an updated list (added svnserve and sks): git_daemon ircd-ratbox mrtg_daemon sks svnserve squid
John, thanks for the info. Spot checking some of these (squid, svnserve, mrtg_daemon) use libthr (checked via either ldd or procstat -v). Not sure if that's the correlation on why some have blocked signals (causing the service stop hang) and others don't.
(In reply to Mark Peek from comment #2) Applied the patch from comment #3, rebuilt and upgraded the i386 server. All looks good. Did a 'service <foo> stop' and 'service <foo> start' on each of squid, ircd-ratbox and mrtg_daemon (the other ones aren't installed on that server). Then I did a 'shutdown -r now' and everything shut down in an orderly fashion - no hangs. Thank you. Would you like me to test this on the amd64 server as well? (more of a hassle but much quicker build time and I can do it if it helps)
I've confirmed (via ldd(1)) that all of the services listed in comment #8 link with libthr.so.3. In the case of mrtg, the binary linking against libthr.so.3 is perl.
(In reply to John Marshall from comment #11) What is your login shell ? Does the problem persist if you change the login shell to sh/bash/zsh ? I mean, do the change in /etc/master.passwd, relogin and retry the service commands. It is not enough to spawn sh from your login shell and test (if you use csh there).
(In reply to John Marshall from comment #10) I've reproduced it with amd64 so I don't think it is necessary for you to repro on that architecture as well. Thanks.
(In reply to Konstantin Belousov from comment #12) I created a new login using bash and am seeing the same thing (sudo service). Which is odd given the program call chain from comment #6.
(In reply to Konstantin Belousov from comment #12) I think this might help. I experimented with changing login shell before attempting to kill the daemon and noticed no diference. Not surprising, I suppose, if the daemon was already blocking SIGTERM. I then tried sending SIGKILL to the daemon and STARTING it with a different shell. The signals are no longer blocked. svcs1# svcs1# grep ^root /etc/passwd root:*:0:0:svcs1 &:/root:/bin/csh svcs1# svcs1# service git_daemon start Starting git_daemon. svcs1# pgrep git-daemon 93129 svcs1# ps 93129 PID TT STAT TIME COMMAND 93129 - SsJ 0:00.00 /usr/local/libexec/git-core/git-daemon --strict-paths --base-path=/data/db/git --verbose --reuseaddr --detach --pid-file=/var/run/git/git_daemon.pid /data/db/git/freebsd/pkg.git svcs1# procstat -j 93129 | grep -v '\-\-$' PID TID COMM SIG FLAGS 93129 100315 git-daemon HUP -B 93129 100315 git-daemon QUIT -B 93129 100315 git-daemon TERM -B 93129 100315 git-daemon TSTP -B 93129 100315 git-daemon TTIN -B 93129 100315 git-daemon TTOU -B svcs1# svcs1# kill -9 93129 svcs1# svcs1# cd /etc svcs1# sed -e '/^root/s/csh$/sh/' -i .bak master.passwd svcs1# pwd_mkdb -p master.passwd svcs1# grep ^root /etc/passwd root:*:0:0:svcs1 &:/root:/bin/sh svcs1# svcs1# service git_daemon start Starting git_daemon. svcs1# pgrep git-daemon 93248 svcs1# procstat -j 93248 | grep -v '\-\-$' PID TID COMM SIG FLAGS svcs1# service git_daemon stop Stopping git_daemon. svcs1# svcs1# svcs1# svcs1# pwd /etc svcs1# /usr/local/etc/rc.d/git_daemon start Starting git_daemon. svcs1# pgrep git-daemon 93509 svcs1# procstat -j 93509 | grep -v '\-\-$' PID TID COMM SIG FLAGS svcs1#
(In reply to John Marshall from comment #15) Just to be clear, the tests referred to in comment #15 were on the unpatched amd64 server.
(In reply to John Marshall from comment #15) Right, this is what I expected. The issue is definitely in csh vfork SAVESIGVEC code leaking signal mask after spawn. I am ambivalent about changing csh to use fork/just reverting SAVESIGVEC. In the first case, we could meet another issue, in the second, we definitely get the ldap problem. In fact, I think that a prudent choice is to revert in releng/10.1 only, and then issue EN. The problem is in stable/9 and 9.3, and we want the fix to appear in 9.3, do we ? Meantime, HEAD and stable/10 can test the solution, be it switch to fork() or some fix to SAVESIGVEC code.
(In reply to Konstantin Belousov from comment #17) releng/10.3, of course, sorry for typo.
A commit references this bug: Author: marius Date: Wed Mar 23 00:53:31 UTC 2016 New revision: 297204 URL: https://svnweb.freebsd.org/changeset/base/297204 Log: In preparation for 10.3-RELEASE, revert r296976, i. e. the merge of r296416 (head) and r296969 (stable/10) respectively. With SAVESIGVEC enabled, csh(1) and tcsh(1) leak signal masks after spawning external commands. This causes strange effects like for example SIGTERM not being delivered to rc(8) scripts on shutdown albeit these use sh(1), if csh(1) or tcsh(1) are used as login shell of root. As such r296976 causes way more problems than it solves. It is anticipated that a proper changeset for the original problem will be issued as an Errata Notice post-10.3-RELEASE. PR: 208132 Approved by: re (gjb) Changes: _U releng/10.3/ releng/10.3/bin/csh/config_p.h
Does this problem only apply of 10.3 ? I Just upgraded to 10-STABLE r297184 and am seeing the same behaviour - certain processes no longer terminating on shutdown (memcachd for example). I haven't investigated closely yet, but it seems co-inicdnetal that it should happen just as this was being investigated in the 10.3 branch.
(In reply to pete from comment #20) A fix needs to be applied to both -current and stable/10 as well. I'm planning to switch from the SAVESIGVEC change to using the safer fork fix via the patch in comment #3.
Is planned to rollback commit 296969 in stable/10?
A commit references this bug: Author: mp Date: Thu Apr 7 14:25:15 UTC 2016 New revision: 297673 URL: https://svnweb.freebsd.org/changeset/base/297673 Log: Revert r296416 by removing SAVESIGVEC and switching to fork instead. This fixes usage with system libraries which maintain their own signal state. PR: 208132 Obtained from: kib MFC after: 3 days Changes: head/bin/csh/config_p.h
*** Bug 207873 has been marked as a duplicate of this bug. ***
A commit references this bug: Author: mp Date: Sun Apr 10 18:12:04 UTC 2016 New revision: 297789 URL: https://svnweb.freebsd.org/changeset/base/297789 Log: MFC 297673: Revert r296969 by removing SAVESIGVEC and switching to fork instead. This fixes usage with system libraries which maintain their own signal state. PR: 208132 Changes: _U stable/10/ stable/10/bin/csh/config_p.h
Just a few questions... The bug was marked as closed/fixed, but I'm still seeing the behaviour described in #207873 (closed as duplicate of this bug) on 10.2 and 10.3 (both having winbind in /etc/nsswitch.conf). I checked bin/csh/config_p.h in 10.2, 10.3 and 10-STABLE and it's identical in all three versions. In all of them it seems the above patch was half-applied, as VFORK (and not vfork) is still defined, but SAVESIGVEC does not appear anywhere. So: _ is #207873 really fixed too? _ should I apply the missing part of the attached patch?