Bug 20393

Summary: processes get stuck in vmwait instead of nanslp with large MAXUSERS
Product: Base System Reporter: bsdx <bsdx>
Component: kernAssignee: silby
Status: Closed FIXED    
Severity: Affects Only Me    
Priority: Normal    
Version: 4.1-STABLE   
Hardware: Any   
OS: Any   

Description bsdx 2000-08-04 07:30:01 UTC
rwatson asked me to send a pr for this, so I am trying to help out.

Processes hang the system by getting stuck in vmwait instead of nanslp'ing like they should.  If you have MAXUSERS set to over 128ish and try to run at least 2000 processes, some will get stuck in vmwait causing the system to become almost unresponsive until they decide to exit.  If the number of processes wanting to run is much larger than 2000, the cycle continues.  console switching works, typing at a login prompt to log in yeilds no characters on the screen.  Any processes running on other terminals (including a rtptio 0 top) freeze.  ps in DDB shows a small percent of the children in vmwait instead of whatever they should be doing.

Fix: 

Let me know if I can help out further with the issue.  I have time to test things but I do not know C let alone the kernel.  I can provide a scratchbox to demonstrate the problem and any level of access to it if it would be helpful in resolving the issue.
How-To-Repeat: compile kernel on 4.1 with maxusers=200
run a progam(http://www.looksharp.net/~user1/test.c) which tries to fork x children which do the following:
  print starting time, sleep 20 seconds, print ending time, exit.
Note that if x > some number z around 1500, processes stop forking way before maxproc in shell or kernel is reached, and processes hang as described.  After a period of time several times longer than 20 seconds, some processes exit and more start.  If x > ~2000, the cycle repeats several times.   
Problem 2: repeat above with maxusers = 600 (> ~512). 
kernel panics from running out of kernel memory instead of the freezing behavior.  (one person I showed this part of the issue to, after inspecting sourcecode, was unsure if it should panic or wait for free mem in fork1 here:  MALLOC(p2->p_cred, struct pcred *, sizeof(struct pcred), \n       M_SUBPROC, M_WAITOK);
Comment 1 Sheldon Hearn freebsd_committer freebsd_triage 2000-08-04 10:04:23 UTC
Responsible Changed
From-To: freebsd-bugs->dillon

This looks like Matt's area.
Comment 2 silby freebsd_committer freebsd_triage 2002-02-12 08:08:38 UTC
State Changed
From-To: open->analyzed

Well, the reason for situation #2 is clear now that I've worked on 
PR 23740:  Setting a large maxusers simply sets too large a maxproc. 
As a result, process-related structures will consume all memory, 
thereby causing the system to deadlock.  I should have a patch 
to limit this committed soon.  In the meantime, lowering maxproc 
is a viable solution. 

As for the temporary hang and recovery seen when many processes 
are run, but memory isn't exhausted, I'm not sure if an easy 
solution is at hand.  It simply looks as if the scheduler isn't 
handling thousands of runnable processes well.  Related to this, 
some processes are tsleeping on "ttywri" for long periods of time; 
I suspect that this may be because sshd isn't being given a chance 
to run and thereby flush the tty buffers of that process. 

So, I can fix the panics and complete hangs of the system, but 
making this modified forkbomb run efficiently may be a lot of work. 


Comment 3 silby freebsd_committer freebsd_triage 2002-02-12 08:08:38 UTC
Responsible Changed
From-To: dillon->silby

This is related to PR kern/23740, so I'm grabbing it.
Comment 4 silby freebsd_committer freebsd_triage 2002-03-07 05:01:40 UTC
State Changed
From-To: analyzed->closed

This problem is fixed in 4.5-stable as of today. 

There were two problems. 

1.  The panics were related to maxproc being set too high. 
Lower your maxusers so that it equals the megs of ram 
in your system.  (A subsequent patch will enforce 
maxproc to a sane value.) 
2.  vm_daemon scaled badly with thousands of procs, and 
was eating LOTS of processor time.  This was the freezing 
you described.  It has now been fixed. 

Thank you for including the program which exhibited the problem 
behavior, it was a big help in tracking down the problem.