| Summary: | [vm] [patch] NULL pointer dereference in vm_pageout_scan() | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Base System | Reporter: | gemini | ||||
| Component: | kern | Assignee: | Alan Cox <alc> | ||||
| Status: | Closed FIXED | ||||||
| Severity: | Affects Only Me | CC: | gemini | ||||
| Priority: | Normal | ||||||
| Version: | 4.5-RELEASE | ||||||
| Hardware: | Any | ||||||
| OS: | Any | ||||||
| Attachments: |
|
||||||
State Changed From-To: open->feedback This sounds more like a transitory bad memory issue. Have you seen this in recent releases? State Changed From-To: feedback->open Toss this over to alc to see if it is worth applying or should just be closed. Responsible Changed From-To: freebsd-bugs->alc Toss this over to alc to see if it is worth applying or should just be closed. State Changed From-To: open->closed Indeed, it is an error for a page to appear in either the active or inactive queues without belonging to an object. As suggested in the comments this must have been either a synchronization error in some part of the kernel not modified by the enclosed patch or a transient hardware error. Since (1) the patch does not identify the source of the error but only masks the error, (2) the synchronization of access to vm objects and vm page queues has completely changed in RELENG_5 and beyond, and (3) there have been no reports of this bug since then, I am going to close this PR without applying the provided patch. That said, I still want to thank the submitter for his efforts. |
A couple of days ago one of our normally extremely stable server machines panicked due to a NULL pointer dereference. While we didn't get a kernel dump we at least had the instruction pointer and the offending data address. After disassembling the respective part of the kernel it became clear that the pointer in the 'object' field of the relevant 'vm_page_t' structure was NULL at the time and was beeing used without checking it for NULL first. Here's the section of code where it happened (in vm_pageout.c:vm_pageout_scan()): /* * If the object is not being used, we ignore previous * references. */ if (m->object->ref_count == 0) { vm_page_flag_clear(m, PG_REFERENCED); pmap_clear_reference(m); Now, the original assumption when this code had been written may well have been that it can never happen that a page on the inactive queue is _not_ associated with an object. The crash we experienced unfortunately proves the opposite. And we also found that other parts of the kernel certainly don't trust the 'object' field blindly. Fix: Please consider adopting the patch below. We take the pragmatic approach and skip the page if it isn't associated with an object, on the assumption that this state will be short-lived, and also because in this context we wouldn't know what to do with a page like this, anyway. The patch deals with the scanning loops for both the inactive and active queue. How-To-Repeat: I have no idea how to repeat that condition. We are running several servers for over two years in production now, and this was the first time it happend to us. I speculate that the 'object' field being NULL is just a transitory state that became apparent due to a race condition. Otherwise it should have hit us more frequently in the past.