Bug 28746

Summary: Race condition in run-time linker
Product: Base System Reporter: Nathan Mower <nmower>
Component: i386Assignee: John Polstra <jdp>
Status: Closed FIXED    
Severity: Affects Only Me    
Priority: Normal    
Version: Unspecified   
Hardware: Any   
OS: Any   

Description Nathan Mower 2001-07-06 01:00:03 UTC
There seems to be a race condition in the run-time linker (ELF).  As
near as I can tell, the situation is this: _rtld_bind calls
rlock_acquire(), but before it gets to rlock_release(), a signal is
caught.  The signal handler calls exit(), so the __atexit list is
traversed, calling rtld_exit(), which calls wlock_acquire().  This
spins on the lock, which it never gets.  The process is hung.

Fix: 

Known work-around: run Apache with LD_BIND_NOW turned on.  I dunno --
might have to block signals between rlock_acquire() and rlock_release()
in _rtld_bind().
How-To-Repeat: Heavy traffic on Apache web server (I use torture.pl).  Frequently send
SIGUSR1 to child Apache processes.  This is a very intermittent bug,
as you can well imagine.
Comment 1 dd freebsd_committer freebsd_triage 2001-07-06 14:05:04 UTC
Responsible Changed
From-To: freebsd-bugs->jdp

jdp seems to make most of the changes to rtld. 
jdp: this isn't one of the best bug reports in the world, but perhaps 
it'll alert you to a possible problem.
Comment 2 John Polstra 2001-07-07 01:51:28 UTC
Actually, I think this is a good bug report.  It's very concise, but
the submitter's analysis of the problem is stated clearly, and I
believe it's 100% correct.  This kind of stuff is not easy to debug,
so he must have done quite a bit of work to diagnose the problem.
(Thank you Nathan!)

I'll have to think about the best way to fix it.  I want to avoid
blocking/unblocking signals in rlock_acquire/rlock_release if
possible, because of the cost of the system calls.  I have a couple
other ideas, but they're not fleshed out yet.  Stay tuned.
Comment 3 John Polstra freebsd_committer freebsd_triage 2001-07-08 23:56:45 UTC
State Changed
From-To: open->feedback

I think the submitter's analysis of this problem is exactly right. 
However, after looking into it some more I am inclined to close 
this PR on the grounds that the bug is in apache rather than in 
FreeBSD.  According to the POSIX standard, a signal handler is 
allowed to call _exit() but not exit().  If apache's signal handler 
called _exit() as it ought to do, the atexit() processing would be 
bypassed, the dynamic linker's termination function would not be 
called, and this problem would not appear. 

If I could see a reasonable way to fix this in the dynamic linker 
without killing performance, I'd gladly fix it.  But barring that, 
I think I'm going to have to point to POSIX and say it's not our 
bug. 

I'm putting the PR into the feedback state first, to give the 
submitter an opportunity to disagree.
Comment 4 Nathan Mower 2001-07-09 18:24:27 UTC
No disagreement here, John.  I'll submit a bug report to Apache.org.
Thanks for taking a look at it.
Comment 5 John Polstra freebsd_committer freebsd_triage 2001-07-09 18:32:58 UTC
State Changed
From-To: feedback->closed

Submitter says he doesn't object to closing this PR, since the 
actual bug is in apache.  He will send a bug report to the apache 
team.