I've been hacking on the FreeBSD kernel and I believe I've discovered a rare bug with the mmap syscall. I've spent quite some time trying to understand exactly what the cause is, and at this point I think I've got a pretty good idea, logically, what's wrong. I'm not experienced enough to find the problem in the code, so I'll just have to give my best description of the problem. I'm using Ext2 with 1KiB Blocks to store some files. This is important because a block size smaller than the system's page size is necessary to cause the bug. I do not know if the bug is exclusive to Ext, however. I have a file that is 13773 bytes long stored on this file-system. That means four 4KiB pages would be needed to map its 14 blocks of data. What I've noticed is that if I open the file and use the file descriptor to mmap the entire contents, then read the entire contents, I trigger four separate page faults to load in that data. Demand paging. Makes sense so far. I've also noticed that if I use the read syscall to read data from the file before using the mmap syscall on the same file descriptor, I will not trigger the same number of page faults after mapping the file in. Reading four bytes from the start (a typical magic number check) seems to pre-fetch two full pages worth of data from the file, and somehow stores that data in a way that makes the first two page faults unnecessary. A cool bit of optimization that still makes perfect sense. If, before calling mmap I seek to 128 bytes from the end of the file and read from there something unexpected happens. As I read the mapped memory I do not trigger the final page fault, but the page is not completely valid! I've played around with different file-sizes, seek locations, and read buffer sizes. Here's what I believe is happening. The file is chunked up into the number of pages it needs when it is opened. When this file is mapped into memory a pagefault on one of its pages causes the entire page to be read from disk (4 blocks). Even when it is not mapped into userspace memory, these pages still exist attached to the file descriptor. The read syscall is capable of making these pages valid, but it DOES NOT take measures to ensure that it's reading in the complete page. So, when I ask it to read the last 128 bytes of my 13773 byte file, taking data from offsets 13645-13773, the kernel and/or ext driver correctly determines that I'm after block 13, which it loads into page 3 at an offset within the page of 1024. This means that page 3 has no data loaded from offsets 0 and 1023, but it will not trigger the page fault the kernel needs to load the missing data once it has been mapped. This problem caused a real-life bug for me as I attempted to run an old Java runtime on the Linux compat layer. For whatever reason, Java does exactly this settings up its virtual machine. Reading the first four bytes from a jar file, then the last 128, the using mmap to link up some functions defined in the jar. I was able to solve the problem by adding the following code to the Ext driver, but I feel like this cannot be the optimal place to put a fix. Branched from a commit (d1bd24e3) to the "stable/12" branch of the GitHub repo three weeks ago. In "sys/fs/ext2fs/ext2_vnops.c", beginning at line 2085 (within the "ext2_read" function). // before if (((int)uio->uio_offset / 1024)%4) { int before_offset = ((int)uio->uio_offset / 4096) * 4096; do { lbn = lblkno(fs, before_offset); error = bread(vp, lbn, 1024, NOCRED, &bp); brelse(bp); bp = NULL; if (error) { break; } before_offset += 1024; } while(before_offset < ((int)uio->uio_offset/1024)*1024); } // after int last_byte = (int)uio->uio_offset + (int)uio->uio_resid - 1; if (last_byte > 0) { int last_page = last_byte / 4096; int last_block_in_page = (last_byte % 4096)/1024; for (int i = last_block_in_page + 1; i < 4; i++) { int after_offset = ((int)last_page * 4096 + i * 1024); lbn = lblkno(fs, after_offset); error = bread(vp, lbn, 1024, NOCRED, &bp); brelse(bp); bp = NULL; if (error) { break; } } } // The purpose of this code is to force the ext driver to load in some surrounding blocks. If my description is insufficient I should be able to provide a virtual machine image reproducing the issue.
I would guess this is a bug in the ext2fs filesystem, although I'm not seeing the code you're referring to as "before" in either STABLE/12 or CURRENT.
Why do you use ext fs at all ? Can you reproduce this with UFS ? (I think not because minimal fragment size for UFS is 512 and block size is 4096).
(In reply to Conrad Meyer from comment #1) The entirety of the code I posted is new. Sorry, I was unclear. The "before" and "after" sections in my code snippet do not refer to "before my change" and "after my change". Rather, the before section is for loading in blocks before the read region and the after section is for loading in blocks afterwards. The purpose of these sections is to ensure that when loading a block that would be partway through a page, extra blocks are loaded into the cache before and after as necessary. (In reply to Konstantin Belousov from comment #2) I cannot reproduce the issue in UFS. I can't reproduce it in ext2 with 4KiB blocks either, so I think you're right about why.
Hi, Alan. I tried to reproduce the issue from your description, but did not managed with it. Seems like, Java produce more complex interaction with filesystem, than I expected. Could you please share some script or C/C++-source which reproduces this issue, if you have. If not, I should ask you to share a VM image.
Is it possible that this is related to bug #246182?