Bug 70587

Summary: [vm] [patch] NULL pointer dereference in vm_pageout_scan()
Product: Base System Reporter: gemini
Component: kernAssignee: Alan Cox <alc>
Status: Closed FIXED    
Severity: Affects Only Me CC: gemini
Priority: Normal    
Version: 4.5-RELEASE   
Hardware: Any   
OS: Any   
Attachments:
Description Flags
vm_pageout.c.diff none

Description gemini 2004-08-17 20:40:21 UTC
A couple of days ago one of our normally extremely stable server
machines panicked due to a NULL pointer dereference.  While we
didn't get a kernel dump we at least had the instruction pointer
and the offending data address.

After disassembling the respective part of the kernel it became
clear that the pointer in the 'object' field of the relevant
'vm_page_t' structure was NULL at the time and was beeing used
without checking it for NULL first.  Here's the section of code
where it happened (in vm_pageout.c:vm_pageout_scan()):

        /*
         * If the object is not being used, we ignore previous 
         * references.
         */
        if (m->object->ref_count == 0) {
                vm_page_flag_clear(m, PG_REFERENCED);
                pmap_clear_reference(m);

Now, the original assumption when this code had been written
may well have been that it can never happen that a page on the
inactive queue is _not_ associated with an object.  The crash
we experienced unfortunately proves the opposite.  And we also
found that other parts of the kernel certainly don't trust the
'object' field blindly.

Fix: Please consider adopting the patch below.  We take the pragmatic
approach and skip the page if it isn't associated with an object,
on the assumption that this state will be short-lived, and also
because in this context we wouldn't know what to do with a page
like this, anyway.  The patch deals with the scanning loops for
both the inactive and active queue.
How-To-Repeat: I have no idea how to repeat that condition.  We are running
several servers for over two years in production now, and this
was the first time it happend to us.  I speculate that the
'object' field being NULL is just a transitory state that
became apparent due to a race condition.  Otherwise it should
have hit us more frequently in the past.
Comment 1 K. Macy freebsd_committer freebsd_triage 2007-11-16 17:27:58 UTC
State Changed
From-To: open->feedback


This sounds more like a transitory bad memory issue. Have you seen this in recent releases?
Comment 2 K. Macy freebsd_committer freebsd_triage 2007-11-16 20:35:57 UTC
State Changed
From-To: feedback->open


Toss this over to alc to see if it is worth applying or should just be closed. 


Comment 3 K. Macy freebsd_committer freebsd_triage 2007-11-16 20:35:57 UTC
Responsible Changed
From-To: freebsd-bugs->alc


Toss this over to alc to see if it is worth applying or should just be closed.
Comment 4 Alan Cox freebsd_committer freebsd_triage 2007-11-22 21:18:00 UTC
State Changed
From-To: open->closed

Indeed, it is an error for a page to appear in either the active 
or inactive queues without belonging to an object.  As suggested 
in the comments this must have been either a synchronization 
error in some part of the kernel not modified by the enclosed 
patch or a transient hardware error.  Since (1) the patch does 
not identify the source of the error but only masks the error, 
(2) the synchronization of access to vm objects and vm page 
queues has completely changed in RELENG_5 and beyond, and (3) 
there have been no reports of this bug since then, I am going 
to close this PR without applying the provided patch.  That 
said, I still want to thank the submitter for his efforts.