Bug 208132

Summary: releng/10.3 (RC3) strange SIGTERM behaviour
Product: Base System Reporter: John Marshall <john>
Component: binAssignee: Mark Peek <mp>
Status: Closed FIXED    
Severity: Affects Only Me CC: KOT, ebay, fbsd-bugzilla, kbowling, kib, ml, olivier.nicole, pete, pi, re
Priority: ---    
Version: 10.3-BETA2   
Hardware: Any   
OS: Any   
Attachments:
Description Flags
Patch to remove SAVESIGVEC and use fork (instead of vfork) none

Description John Marshall 2016-03-19 03:56:39 UTC
FreeBSD 10.3-RC3 #0 r297005: Fri Mar 18 21:14:17 AEDT 2016  i386
FreeBSD 10.3-RC3 #0 r297005: Fri Mar 18 16:13:17 AEDT 2016  amd64

As reported to -stable@...

I've just upgraded two local servers from 10.3-RC2 to 10.3-RC3.  One is
i386 (Pentium 4) with 1.5GB memory; the other is amd64 (Xeon E5 x 2)
with 16GB.  Something seems to be swallowing SIGTERM sometimes.  I've
never seen this before.

rc scripts and service(8) hang when trying to stop some services.  If I
break out of the script or service command (via ctrl-C) and invoke
kill(1), passing the relevant PID, the shell prompt returns immediately
but the process doesn't terminate.  Sending SIGKILL to the process via
kill(1) does its job.

I am seeing this with:
  squid        (on both servers)
  ircd-ratbox  (on both servers)
  git_daemon

I am not seeing this (all seems normal) with:
  cupsd
  sendmail     (on both servers)

Looking back through the commits between -RC2 and -RC3, I've decided to try backing out r296976 to see if that may be related. Building now. Will take several hours.
Comment 1 John Marshall 2016-03-19 07:48:13 UTC
Reversed r296976 and rebuilt (on i386 only). Problem disappeared.

In the process of shutting down the i386 system prior to upgrade, I found that mrtg_daemon was another one that wouldn't shut down without kill -9.

Sorry, but I'm not going to be able to be around to help with testing anything between now and Monday morning (UTC +11:00). I have left the amd64 system as-is (with problematic SIGTERM) if that can be of any use on Monday.
Comment 2 Mark Peek freebsd_committer freebsd_triage 2016-03-19 18:10:10 UTC
During a vfork in tcsh the child process changes the signals. Normally this is not an issue but user space signal implementations need special handling due to the child borrowing memory from the parent via the vfork. This was noticed when tcsh as used with libthr. The change to add SAVESIGVEC was supposed to resolve this issue by saving and restoring the signals.

Obviously there is an edge case occurring where the signals are not being handled correctly and causing this behavior. I believe what I'm seeing with the squid example are the signals still blocked in the thread state (procstat -j).

The other patch discussed (instead of SAVESIGVEC) to resolve the user space signal implementation was to use fork instead of vfork. There is a performance difference between fork and vfork which is why we preferred the SAVESIGVEC change.

I just tested a patch to switch over to fork and it appeared to resolve the issue on my system. If you have time please give it a try on your system.
Comment 3 Mark Peek freebsd_committer freebsd_triage 2016-03-19 18:11:15 UTC
Created attachment 168393 [details]
Patch to remove SAVESIGVEC and use fork (instead of vfork)
Comment 4 Marius Strobl freebsd_committer freebsd_triage 2016-03-20 15:46:35 UTC
What are the exact steps to reproduce the SIGTERM problem? As both rc(8) scripts and service(8) use sh(1), it's not obvious how a change to csh(1)/tcsh(1) could affect the former or how to trigger that.

Do the signal handling changes in tcsh 6.19.00 solve both problems, i. e. the original one r296416 tried to address as well as SIGTERM, either with SAVESIGVEC enabled/disabled but still using vfork(2)?
Comment 5 John Marshall 2016-03-20 20:16:28 UTC
(In reply to Mark Peek from comment #2)
Thank you Mark. Reverted bin/csh/config_p.h in my wc on the i386 box, applied your patch from comment #3 and rebuilding now. Will report results.
Comment 6 Mark Peek freebsd_committer freebsd_triage 2016-03-20 22:14:51 UTC
Marius, I have as yet not been able to find a smaller reproduction of this issue. The repro per John's comments is to install/enable squid and then the "service squid stop" will hang. I believe this is due to the thread TERM signal being blocked (viewed via procstat -j). I ran dtrace during the service startup and, as you commented, did not see csh being called directly. I also confirmed the same behavior from a direct login to a bash login shell. 

# dtrace -qn 'syscall::exec*:return { printf("%Y %s\n",walltimestamp,curpsinfo->pr_psargs); }'
2016 Mar 20 08:45:26 sudo service squid start
2016 Mar 20 08:45:27 /bin/sh /usr/sbin/service squid start
2016 Mar 20 08:45:27 /bin/ps -ww -p 71047 -o jid=
2016 Mar 20 08:45:27 /bin/sh /usr/sbin/service squid start
2016 Mar 20 08:45:27 /bin/sh /usr/sbin/service squid start
2016 Mar 20 08:45:27 /bin/sh /usr/sbin/service squid start
2016 Mar 20 08:45:27 env -i HOME=/ PATH=/sbin:/bin:/usr/sbin:/usr/bin /usr/local/etc/rc.d/squid start
2016 Mar 20 08:45:27 /bin/sh /usr/local/etc/rc.d/squid start
2016 Mar 20 08:45:27 /bin/ps -ww -p 71047 -o jid=
2016 Mar 20 08:45:27 /usr/bin/id -un
2016 Mar 20 08:45:27 /bin/ps -ww -o pid= -o jid= -o command= -p 71006
2016 Mar 20 08:45:27 limits -C daemon su -m squid -c sh -c "cd /var/squid && /usr/local/sbin/squid  -f /usr/local/etc/squid/squid.conf "
2016 Mar 20 08:45:27 limits -C daemon su -m squid -c sh -c "cd /var/squid && /usr/local/sbin/squid  -f /usr/local/etc/squid/squid.conf "
2016 Mar 20 08:45:27 limits -C daemon su -m squid -c sh -c "cd /var/squid && /usr/local/sbin/squid  -f /usr/local/etc/squid/squid.conf "
2016 Mar 20 08:45:27 limits -C daemon su -m squid -c sh -c "cd /var/squid && /usr/local/sbin/squid  -f /usr/local/etc/squid/squid.conf "
2016 Mar 20 08:45:27 su -m squid -c sh -c "cd /var/squid && /usr/local/sbin/squid  -f /usr/local/etc/squid/squid.conf "
2016 Mar 20 08:45:27 _su -m -c sh -c "cd /var/squid && /usr/local/sbin/squid  -f /usr/local/etc/squid/squid.conf "
2016 Mar 20 08:45:27 sh -c cd /var/squid && /usr/local/sbin/squid  -f /usr/local/etc/squid/squid.conf
2016 Mar 20 08:45:27 /usr/local/sbin/squid -f /usr/local/etc/squid/squid.conf
2016 Mar 20 08:45:27 /usr/local/sbin/squid -f /usr/local/etc/squid/squid.conf
2016 Mar 20 08:45:27 (squid-1) -f /usr/local/etc/squid/squid.conf
2016 Mar 20 08:45:27 (logfile-daemon) /var/log/squid/access.log

-------
The original problem r296416 (MFC r296976) was reported by kib:
tom% LD_PRELOAD=libthr.so.3 csh                                               ~
% ls
CA              bin             kp              public_html
Mail            git             mm              tmp
RCS             git-private     postponed       work
% ^C
tom%

There, 'tom%' is the prompt of my zsh.  Issue is reproducable without
LD_PRELOAD by e.g. using of nss-ldap, which brings libthr.so into the
address space as a dependency.

Problem is that when vfork use is enabled, csh calls signal(3) in
the child. Since libthr maintains the shadow list of signal handlers
to avoid calling user-provided handlers in critical sections and for
some other reasons, signal(3) overrides the handler both for child and
parent. The issue is there probably from the times of 9.0, when davidxu
introduced libthr hooking the handlers.
-------

The suggested patch by kib was to use fork instead of vfork but the SAVESIGVEC+vfork change fixed the above issue from kib and was in the tcsh code base to handle userland signal handling.
Comment 7 John Marshall 2016-03-20 22:31:09 UTC
(In reply to Marius Strobl from comment #4)
The minimum steps are:
  Upgrade to 10.3-RC3
  $ ksu -
  # shutdown now (and notice the rc scripts hang on various services)

OR
  # service <foo> stop (where foo is one of the services I mentioned)

OR
  # kill <pid of foo> (where foo is one of the services I mentioned)

I imagine the problem is related to root using its default tcsh shell. I don't know the intricacies of this, I'm only a sysadmin.

Examples of <foo> are:
  squid
  ircd-ratbox
  git_daemon
  mrtg_daemon

I haven't noticed problems with any base system services. Most other ports are fine.
Comment 8 John Marshall 2016-03-20 23:34:28 UTC
(In reply to John Marshall from comment #7)
I just tried to stop and start each of the jails on the amd64 box in order to identify any more affected services. Here is an updated list (added svnserve and sks):
  git_daemon
  ircd-ratbox
  mrtg_daemon
  sks
  svnserve
  squid
Comment 9 Mark Peek freebsd_committer freebsd_triage 2016-03-21 00:40:40 UTC
John, thanks for the info.

Spot checking some of these (squid, svnserve, mrtg_daemon) use libthr (checked via either ldd or procstat -v). Not sure if that's the correlation on why some have blocked signals (causing the service stop hang) and others don't.
Comment 10 John Marshall 2016-03-21 00:51:39 UTC
(In reply to Mark Peek from comment #2)
Applied the patch from comment #3, rebuilt and upgraded the i386 server. All looks good. Did a 'service <foo> stop' and 'service <foo> start' on each of squid, ircd-ratbox and mrtg_daemon (the other ones aren't installed on that server). Then I did a 'shutdown -r now' and everything shut down in an orderly fashion - no hangs.

Thank you.

Would you like me to test this on the amd64 server as well? (more of a hassle but much quicker build time and I can do it if it helps)
Comment 11 John Marshall 2016-03-21 01:05:37 UTC
I've confirmed (via ldd(1)) that all of the services listed in comment #8 link with libthr.so.3. In the case of mrtg, the binary linking against libthr.so.3 is perl.
Comment 12 Konstantin Belousov freebsd_committer freebsd_triage 2016-03-21 03:21:27 UTC
(In reply to John Marshall from comment #11)
What is your login shell ?  Does the problem persist if you change the login shell to sh/bash/zsh ?  I mean, do the change in /etc/master.passwd, relogin and retry the service commands.  It is not enough to spawn sh from your login shell and test (if you use csh there).
Comment 13 Mark Peek freebsd_committer freebsd_triage 2016-03-21 04:13:25 UTC
(In reply to John Marshall from comment #10)
I've reproduced it with amd64 so I don't think it is necessary for you to repro on that architecture as well. Thanks.
Comment 14 Mark Peek freebsd_committer freebsd_triage 2016-03-21 04:16:03 UTC
(In reply to Konstantin Belousov from comment #12)
I created a new login using bash and am seeing the same thing (sudo service). Which is odd given the program call chain from comment #6.
Comment 15 John Marshall 2016-03-21 04:24:38 UTC
(In reply to Konstantin Belousov from comment #12)
I think this might help. I experimented with changing login shell before attempting to kill the daemon and noticed no diference. Not surprising, I suppose, if the daemon was already blocking SIGTERM. I then tried sending SIGKILL to the daemon and STARTING it with a different shell. The signals are no longer blocked.

svcs1# 
svcs1# grep ^root /etc/passwd
root:*:0:0:svcs1 &:/root:/bin/csh
svcs1# 
svcs1# service git_daemon start
Starting git_daemon.
svcs1# pgrep git-daemon
93129
svcs1# ps 93129
  PID TT  STAT    TIME COMMAND
93129  -  SsJ  0:00.00 /usr/local/libexec/git-core/git-daemon --strict-paths --base-path=/data/db/git --verbose --reuseaddr --detach --pid-file=/var/run/git/git_daemon.pid /data/db/git/freebsd/pkg.git
svcs1# procstat -j 93129 | grep -v '\-\-$'
  PID    TID COMM             SIG     FLAGS
93129 100315 git-daemon       HUP      -B
93129 100315 git-daemon       QUIT     -B
93129 100315 git-daemon       TERM     -B
93129 100315 git-daemon       TSTP     -B
93129 100315 git-daemon       TTIN     -B
93129 100315 git-daemon       TTOU     -B
svcs1# 
svcs1# kill -9 93129 
svcs1# 
svcs1# cd /etc
svcs1# sed -e '/^root/s/csh$/sh/' -i .bak master.passwd
svcs1# pwd_mkdb -p master.passwd
svcs1# grep ^root /etc/passwd
root:*:0:0:svcs1 &:/root:/bin/sh
svcs1# 
svcs1# service git_daemon start
Starting git_daemon.
svcs1# pgrep git-daemon
93248
svcs1# procstat -j 93248 | grep -v '\-\-$'
  PID    TID COMM             SIG     FLAGS
svcs1# service git_daemon stop
Stopping git_daemon.
svcs1# 
svcs1# 
svcs1# 
svcs1# pwd
/etc
svcs1# /usr/local/etc/rc.d/git_daemon start
Starting git_daemon.
svcs1# pgrep git-daemon
93509
svcs1# procstat -j 93509 | grep -v '\-\-$'
  PID    TID COMM             SIG     FLAGS
svcs1#
Comment 16 John Marshall 2016-03-21 04:48:59 UTC
(In reply to John Marshall from comment #15)
Just to be clear, the tests referred to in comment #15 were on the unpatched amd64 server.
Comment 17 Konstantin Belousov freebsd_committer freebsd_triage 2016-03-21 04:52:55 UTC
(In reply to John Marshall from comment #15)
Right, this is what I expected.

The issue is definitely in csh vfork SAVESIGVEC code leaking signal mask after spawn.

I am ambivalent about changing csh to use fork/just reverting SAVESIGVEC.  In the first case, we could meet another issue, in the second, we definitely get the ldap problem.

In fact, I think that a prudent choice is to revert in releng/10.1 only, and then issue EN.  The problem is in stable/9 and 9.3, and we want the fix to appear in 9.3, do we ?  Meantime, HEAD and stable/10 can test the solution, be it switch to fork() or some fix to SAVESIGVEC code.
Comment 18 Konstantin Belousov freebsd_committer freebsd_triage 2016-03-21 04:53:37 UTC
(In reply to Konstantin Belousov from comment #17)
releng/10.3, of course, sorry for typo.
Comment 19 commit-hook freebsd_committer freebsd_triage 2016-03-23 00:54:27 UTC
A commit references this bug:

Author: marius
Date: Wed Mar 23 00:53:31 UTC 2016
New revision: 297204
URL: https://svnweb.freebsd.org/changeset/base/297204

Log:
  In preparation for 10.3-RELEASE, revert r296976, i. e. the merge of
  r296416 (head) and r296969 (stable/10) respectively. With SAVESIGVEC
  enabled, csh(1) and tcsh(1) leak signal masks after spawning external
  commands. This causes strange effects like for example SIGTERM not
  being delivered to rc(8) scripts on shutdown albeit these use sh(1),
  if csh(1) or tcsh(1) are used as login shell of root. As such r296976
  causes way more problems than it solves.
  It is anticipated that a proper changeset for the original problem
  will be issued as an Errata Notice post-10.3-RELEASE.

  PR:		208132
  Approved by:	re (gjb)

Changes:
_U  releng/10.3/
  releng/10.3/bin/csh/config_p.h
Comment 20 Pete French 2016-03-24 19:51:27 UTC
Does this problem only apply of 10.3 ? I Just upgraded to 10-STABLE r297184 and am seeing the same behaviour - certain processes no longer terminating on shutdown (memcachd for example). I haven't investigated closely yet, but it seems co-inicdnetal that it should happen just as this was being investigated in the 10.3 branch.
Comment 21 Mark Peek freebsd_committer freebsd_triage 2016-03-27 18:37:51 UTC
(In reply to pete from comment #20)
A fix needs to be applied to both -current and stable/10 as well. I'm planning to switch from the SAVESIGVEC change to using the safer fork fix via the patch in comment #3.
Comment 22 Dmitry Afanasiev 2016-04-07 08:36:56 UTC
Is planned to rollback commit 296969 in stable/10?
Comment 23 commit-hook freebsd_committer freebsd_triage 2016-04-07 14:25:28 UTC
A commit references this bug:

Author: mp
Date: Thu Apr  7 14:25:15 UTC 2016
New revision: 297673
URL: https://svnweb.freebsd.org/changeset/base/297673

Log:
  Revert r296416 by removing SAVESIGVEC and switching to fork instead. This
  fixes usage with system libraries which maintain their own signal state.

  PR:             208132
  Obtained from:  kib
  MFC after:	3 days

Changes:
  head/bin/csh/config_p.h
Comment 24 Mark Peek freebsd_committer freebsd_triage 2016-04-07 14:31:13 UTC
*** Bug 207873 has been marked as a duplicate of this bug. ***
Comment 25 commit-hook freebsd_committer freebsd_triage 2016-04-10 18:13:04 UTC
A commit references this bug:

Author: mp
Date: Sun Apr 10 18:12:04 UTC 2016
New revision: 297789
URL: https://svnweb.freebsd.org/changeset/base/297789

Log:
  MFC 297673:

  Revert r296969 by removing SAVESIGVEC and switching to fork instead. This
  fixes usage with system libraries which maintain their own signal state.

  PR:	208132

Changes:
_U  stable/10/
  stable/10/bin/csh/config_p.h
Comment 26 ml 2016-05-21 11:48:30 UTC
Just a few questions...

The bug was marked as closed/fixed, but I'm still seeing the behaviour described in #207873 (closed as duplicate of this bug) on 10.2 and 10.3 (both having winbind in /etc/nsswitch.conf).

I checked bin/csh/config_p.h in 10.2, 10.3 and 10-STABLE and it's identical in all three versions.

In all of them it seems the above patch was half-applied, as VFORK (and not vfork) is still defined, but SAVESIGVEC does not appear anywhere.



So:
_ is #207873 really fixed too?
_ should I apply the missing part of the attached patch?