Bug 45478

Summary: /bin/sh coredump
Product: Base System Reporter: Oliver Fromme <olli>
Component: binAssignee: Stefan Farfeleder <stefanf>
Status: Closed FIXED    
Severity: Affects Only Me CC: olli
Priority: Normal    
Version: 4.7-RELEASE   
Hardware: Any   
OS: Any   

Description Oliver Fromme 2002-11-19 12:50:01 UTC
$ /bin/sh
$ while for true; do false; done; do true; done
^C
$ set -E
sh in malloc(): warning: recursive call
sh in malloc(): warning: recursive call
Segmentation fault (core dumped)

Fix: 

Sorry, none known.

I suggest someone familiar with the innards of our sh
looks at it.  It's probably easy to track down, using
/etc/malloc.conf and the core dumps ...
How-To-Repeat: 
See above.  The problem is not 100% reproducible.
Sometimes it happens immediately, sometimes it takes
20+ attempts, but sooner or later it'll crash.
I guess it depends on when exctly you interrupt the
endless while loop.

I can also provide the core dump if necessary.
Comment 1 Tim Robbins freebsd_committer freebsd_triage 2002-11-30 13:17:53 UTC
Responsible Changed
From-To: freebsd-bugs->tjr

I believe this is caused by the SIGINT handler longjmp()'ing 
out when it's in the middle of a malloc() call. Calls to malloc() 
and free() should be bracketed in INTON and INTOFF. 

I haven't had much luck tracking this down in the past, but 
I'll try again to find the missing INTON/INTOFF.
Comment 2 Tim Robbins freebsd_committer freebsd_triage 2004-02-15 07:36:58 UTC
Responsible Changed
From-To: tjr->freebsd-bugs

Unassign due to lack of time and interest. Perhaps someone else 
will pick this up.
Comment 3 Giorgos Keramidas freebsd_committer freebsd_triage 2005-04-15 16:52:32 UTC
On 2002-11-19 13:43, Oliver Fromme <olli@secnetix.de> wrote:
> Responsible-Changed-By: tjr
> Responsible-Changed-Why:
> I believe this is caused by the SIGINT handler longjmp()'ing
> out when it's in the middle of a malloc() call. Calls to malloc()
> and free() should be bracketed in INTON and INTOFF.
>
> I haven't had much luck tracking this down in the past, but
> I'll try again to find the missing INTON/INTOFF.

I just happened to stumble upon this bug today.  It's still with us in
FreeBSD 6.0-CURRENT.  It seems that the inner for loop in the following:

	while for true; do false; done; do true; done

is not stopped by sh(1) when ^C is hit.  Even after the interrupt is
received, sh consumes at least 5-15% of CPU on my test here, while it
appears to be sitting at a PS1 prompt, waiting for more input.

  PID USERNAME   THR PRI NICE   SIZE    RES STATE    TIME   WCPU    CPU COMMAND
 2352 keramida     1   5    0  1668K  1192K ttyin    0:03 25.48% 10.79% sh

After a few of these commands have been run, sh may reach CPU
utilizations of even more:

  PID USERNAME   THR PRI NICE   SIZE    RES STATE    TIME   WCPU    CPU COMMAND
 2352 keramida     1 123    0  1672K  1196K RUN      1:11 63.21% 63.18% sh
Comment 4 Giorgos Keramidas freebsd_committer freebsd_triage 2005-04-15 17:13:31 UTC
On 2005-04-15 18:52, Giorgos Keramidas <keramida@freebsd.org> wrote:
> On 2002-11-19 13:43, Oliver Fromme <olli@secnetix.de> wrote:
> > Responsible-Changed-By: tjr
> > Responsible-Changed-Why:
> > I believe this is caused by the SIGINT handler longjmp()'ing
> > out when it's in the middle of a malloc() call. Calls to malloc()
> > and free() should be bracketed in INTON and INTOFF.
> >
> > I haven't had much luck tracking this down in the past, but
> > I'll try again to find the missing INTON/INTOFF.
>
> I just happened to stumble upon this bug today.

I managed to get sh to print "Out of space" after a few more
invocations, and here's the backtrace I get either with gcore or by
sending a SEGV to the process (there's no other way to stop it from
printing infinite numbers of "Out of space" error messages):

: (gdb) bt
: #0  0x2811f2e3 in write () at write.S:2
: #1  0x0805733d in xwrite (fd=2, buf=0x806a000 "Out of space\namida/Mailbox", nbytes=13)
:     at output.c:318
: #2  0x080573b4 in flushout (dest=0x806132c) at output.c:206
: #3  0x08057418 in flushall () at output.c:196
: #4  0x0804c733 in exverror (cond=1, msg=0x805eb68 "Out of space", 
:     ap=0xbfbfe7f4 "4è¿¿\031\222\006(Øó\a(Üò\021(ô\001") at error.c:156
: #5  0x0804c787 in error (msg=0x806a000 "Out of space\namida/Mailbox") at error.c:166
: #6  0x0805555c in ckmalloc (nbytes=500) at memalloc.c:61
: #7  0x0805560d in stalloc (nbytes=496) at memalloc.c:132
: #8  0x080557ad in growstackblock () at memalloc.c:247
: #9  0x0804e1f1 in padvance (path=0xbfbfe8ac, name=0x806320c "") at exec.c:192
: #10 0x08054d38 in chkmail (silent=0) at mail.c:88
: #11 0x08054f92 in cmdloop (top=1) at main.c:213
: #12 0x08055138 in main (argc=1, argv=0xbfbfea40) at main.c:183

I don't know if this helps track down the problem though.  If anyone
with more sh-clue wants me to send the core file or post more data out
of it, please ask.
Comment 5 Nate Eldredge 2005-10-13 22:24:07 UTC
I have reproduced this and have a fix.  It does seem to have been 
INTON/INTOFF.  To find where the signal was being handled inside malloc, 
in gdb you can do

handle SIGINT nostop pass
(answer yes)
break exraise
command
 	where
 	continue
 	end
run

which will print a stack trace every time it handles the signal.  Then you 
look for malloc inside the trace.

I found the following offenders:

#11 0x0805bba7 in redirect (redir=0x0, flags=1) at redir.c:111
#11 0x0805e1a3 in setvar (name=0x80dd76e "_", val=0x810d0ac "true", 
flags=0) at var.c:246

However, a casual glance revealed several other un-wrapped calls.  90% of 
these are made via ckmalloc/ckrealloc/ckfree.  Thus I think the easiest 
and safest thing to do is to call INTON/INTOFF inside these functions. 
They are safe to use recursively and have trivial overhead (incrementing a 
counter).  Here is a patch which does this.  It is against CURRENT but I 
imagine it will apply to other versions as well.  After this patch I 
cannot make it crash anymore.

I found only one unsafe use of malloc/free directly: in the umask builtin. 
umaskcmd() calls setmode() which calls malloc, and then calls free.  My 
patch puts INTON/INTOFF around this as well.

By the way, I think the CPU usage thing is a red herring.  I assume the 
numbers are from top, and its CPU field is an average.  It definitely 
decays slowly to 0 after a process has been running, so it can report 
nonzero even for a totally idle process.  sh gets commands from the user 
with a blocking read, so it must be idle while waiting for input.

diff -ur /usr/src/bin/sh/memalloc.c ./src/bin/sh/memalloc.c
--- /usr/src/bin/sh/memalloc.c	Tue Apr  6 13:06:51 2004
+++ ./src/bin/sh/memalloc.c	Thu Oct 13 14:02:30 2005
@@ -57,8 +57,10 @@
  {
  	pointer p;

+	INTOFF;
  	if ((p = malloc(nbytes)) == NULL)
  		error("Out of space");
+	INTON;
  	return p;
  }

@@ -70,11 +72,20 @@
  pointer
  ckrealloc(pointer p, int nbytes)
  {
+	INTOFF;
  	if ((p = realloc(p, nbytes)) == NULL)
  		error("Out of space");
+	INTON;
  	return p;
  }

+void
+ckfree(pointer p)
+{
+	INTOFF;
+	free(p);
+	INTON;
+}

  /*
   * Make a copy of a string in safe storage.
diff -ur /usr/src/bin/sh/memalloc.h ./src/bin/sh/memalloc.h
--- /usr/src/bin/sh/memalloc.h	Tue Apr  6 13:06:51 2004
+++ ./src/bin/sh/memalloc.h	Thu Oct 13 14:01:27 2005
@@ -48,6 +48,7 @@

  pointer ckmalloc(int);
  pointer ckrealloc(pointer, int);
+void ckfree(pointer);
  char *savestr(char *);
  pointer stalloc(int);
  void stunalloc(pointer);
@@ -72,5 +73,3 @@
  #define STTOPC(p)	p[-1]
  #define STADJUST(amount, p)	(p += (amount), sstrnleft -= (amount))
  #define grabstackstr(p)	stalloc(stackblocksize() - sstrnleft)
-
-#define ckfree(p)	free((pointer)(p))
diff -ur /usr/src/bin/sh/miscbltin.c ./src/bin/sh/miscbltin.c
--- /usr/src/bin/sh/miscbltin.c	Fri Sep  9 12:59:41 2005
+++ ./src/bin/sh/miscbltin.c	Thu Oct 13 14:00:13 2005
@@ -274,12 +274,14 @@
  			umask(mask);
  		} else {
  			void *set;
+			INTOFF;
  			if ((set = setmode (ap)) == 0)
  				error("Illegal number: %s", ap);

  			mask = getmode (set, ~mask & 0777);
  			umask(~mask & 0777);
  			free(set);
+			INTON;
  		}
  	}
  	return 0;

-- 
Nate Eldredge
nge@cs.hmc.edu
Comment 6 Stefan Farfeleder freebsd_committer freebsd_triage 2005-10-14 09:51:22 UTC
Responsible Changed
From-To: freebsd-bugs->stefanf

I'll try to fix that one.
Comment 7 olli 2005-10-14 13:42:04 UTC
I have applied Nate Eldredge's patch.
With that patch, I couldn't reproduce the problem anymore.

Thank you very much, Nate!

I would vote for committing this.

--
Comment 8 Stefan Farfeleder freebsd_committer freebsd_triage 2005-10-28 12:02:38 UTC
State Changed
From-To: open->patched

I committed a modified version of this patch to HEAD.  Thanks.
Comment 9 Stefan Farfeleder freebsd_committer freebsd_triage 2005-12-26 18:19:02 UTC
State Changed
From-To: patched->closed

Also fixed in RELENG_{5,6}.