| Summary: | contigmalloc1() oddity for large alignments (race condition) | ||
|---|---|---|---|
| Product: | Base System | Reporter: | Anton Berezin <tobez> |
| Component: | kern | Assignee: | Andre Oppermann <andre> |
| Status: | Closed FIXED | ||
| Severity: | Affects Only Me | ||
| Priority: | Normal | ||
| Version: | 5.0-CURRENT | ||
| Hardware: | Any | ||
| OS: | Any | ||
Responsible Changed From-To: freebsd-bugs->dillon The VM system is Matt's area. Responsible Changed From-To: dillon->freebsd-bugs Back to the free pool. Anton, do you still have the problem with contigmalloc() described in the problem report? -- Andre State Changed From-To: open->feedback Check with Originator if problem persists. Responsible Changed From-To: freebsd-bugs->andre Check with Originator if problem persists. Andre,
On Sat, Dec 27, 2003 at 03:53:11PM +0100, Andre Oppermann wrote:
> do you still have the problem with contigmalloc() described in
> the problem report?
No idea, I have not been using bktr/meteor drivers for ages now. If the
code did not change substantially, I would expect the problem to still
be there, though.
\Anton.
--
Civilization is a fractal patchwork of old and new and dangerously new.
-- Vernor Vinge
Anton Berezin wrote:
>
> Andre,
>
> On Sat, Dec 27, 2003 at 03:53:11PM +0100, Andre Oppermann wrote:
>
> > do you still have the problem with contigmalloc() described in
> > the problem report?
>
> No idea, I have not been using bktr/meteor drivers for ages now. If the
> code did not change substantially, I would expect the problem to still
> be there, though.
Ok, the code has been redone and reorganized. The redo was by phk
in sys/vm/vm_page.c rev 1.154 and the reorg by dillon in rev 1.167.
With that all contigmalloc() stuff has been moved to sys/vm/vm_contig.c
which has some more redones in it.
I'd say it's save to close this PR as it no longer relevant for todays
codebase.
Thanks for your feedback.
--
Andre
State Changed From-To: feedback->closed See description in last message. |
If an object is requested with a large alignment, say, 1<<24, so that contigmalloc1() is not even able to find a single PQ_FREE or PQ_CACHE page with said alignment, it then proceeds freeing inactive pages, one by one, and then immediately active pages as well, also one by one. The problem is, that after freeing a page (in most cases the routine pages them out --- I inserted some sysctl counters to debug this), it starts again by rescanning the same queue (either PQ_INACTIVE or PQ_ACTIVE), from its head. To me, it looks bad enough even for inactive pages, but for an active queue it's a disaster, unless the box is idle. The point is that, in a nutshell, the following sequence gets executed when contigmalloc1() tries to free the page: vm_pageout_flush(page) which calls vm_pager_put_pages(page) which calls swap_pager_putpages(page) which sleeps (swwrt). When the box is not idle, while the process is blocked in swwrt state, some other process execution will lead to more inactive (some chances) or active (all the chances) pages added, and then contigmalloc1() starts scanning a queue again! Fix: A first obvious thing to do is to remove the 1<<24 alignment allocation from the bktr (and meteor) code. This helps in my particular case. However, I think that the internal workings of contigmalloc1() are seriously broken for large alignments. My understanding is that the page freeing code is somewhat of a last resort for the routine, and it probably should not do that in this case --- the assumption contigmalloc1() takes is that if the very first loop was not able to find even the starting page, then there is a severe memory shortage or something. Not necessarily so. To me, the code simply `does not look right'. And I have no idea what the proper fix might look like. Cheers, Anton. How-To-Repeat: A program that issues the METEORSETGEO ioctl to bktr driver, with relatively large number of frames (in my tests I used 14 frames == 14*768*576*4/4096 == 6049 pages). The bktr driver did not have sufficient space preallocated. For some reason, bktr driver in its get_bktr_mem() function (dev/bktr/bktr_os.c) first tries to do vm_page_alloc_contig() with the alignment of 1<<24, and then, if this fails, proceeds with PAGE_SIZE. [As a side note, I have no idea what is the reason for using such a large alignment in bktr driver. Apparently, this piece of code was copied as is from meteor driver.] On a practically idle box the allocation fails after 4 to 8 seconds. The number of jumps from vm_pageout_flush() callpoint in inactive scan code to PQ_INACTIVE rescan is about 110. The number of jumps from vm_pageout_flush() callpoint in active scan code to PQ_INACTIVE rescan is about 4400. On a busy box (nice -20 perl -e 'for(;;){}') this takes forever - or at least I was not patient enough to wait for completion. The number of jumps increases at a steady rate, most of them are from the `active' piece. In top(1), I observed things like this (please pay attention to Ks and Ms here): Mem: 348K Active, 180K Inact, 21M Wired, 38M Cache, 9899K Buf, 64M Free Swap: 525M Total, 21M Used, 504M Free, 3% Inuse, 1552K Out