Bug 238700 - mmap bug when using file-systems with small blocks and seeking randomly to read from the file descriptor before mapping
Summary: mmap bug when using file-systems with small blocks and seeking randomly to re...
Status: New
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 12.0-STABLE
Hardware: amd64 Any
: --- Affects Only Me
Assignee: freebsd-fs (Nobody)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-06-19 04:38 UTC by Alan
Modified: 2020-05-04 17:46 UTC (History)
3 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Alan 2019-06-19 04:38:24 UTC
I've been hacking on the FreeBSD kernel and I believe I've discovered a rare bug with the mmap syscall. I've spent quite some time trying to understand exactly what the cause is, and at this point I think I've got a pretty good idea, logically, what's wrong. I'm not experienced enough to find the problem in the code, so I'll just have to give my best description of the problem. 

I'm using Ext2 with 1KiB Blocks to store some files. This is important because a block size smaller than the system's page size is necessary to cause the bug. I do not know if the bug is exclusive to Ext, however. 

I have a file that is 13773 bytes long stored on this file-system. That means four 4KiB pages would be needed to map its 14 blocks of data. What I've noticed is that if I open the file and use the file descriptor to mmap the entire contents, then read the entire contents, I trigger four separate page faults to load in that data. Demand paging. Makes sense so far. 

I've also noticed that if I use the read syscall to read data from the file before using the mmap syscall on the same file descriptor, I will not trigger the same number of page faults after mapping the file in. Reading four bytes from the start (a typical magic number check) seems to pre-fetch two full pages worth of data from the file, and somehow stores that data in a way that makes the first two page faults unnecessary. A cool bit of optimization that still makes perfect sense. 

If, before calling mmap I seek to 128 bytes from the end of the file and read from there something unexpected happens. As I read the mapped memory I do not trigger the final page fault, but the page is not completely valid! 

I've played around with different file-sizes, seek locations, and read buffer sizes. Here's what I believe is happening. The file is chunked up into the number of pages it needs when it is opened. When this file is mapped into memory a pagefault on one of its pages causes the entire page to be read from disk (4 blocks). Even when it is not mapped into userspace memory, these pages still exist attached to the file descriptor. The read syscall is capable of making these pages valid, but it DOES NOT take measures to ensure that it's reading in the complete page. 

So, when I ask it to read the last 128 bytes of my 13773 byte file, taking data from offsets 13645-13773, the kernel and/or ext driver correctly determines that I'm after block 13, which it loads into page 3 at an offset within the page of 1024. This means that page 3 has no data loaded from offsets 0 and 1023, but it will not trigger the page fault the kernel needs to load the missing data once it has been mapped. 

This problem caused a real-life bug for me as I attempted to run an old Java runtime on the Linux compat layer. For whatever reason, Java does exactly this settings up its virtual machine. Reading the first four bytes from a jar file, then the last 128, the using mmap to link up some functions defined in the jar. I was able to solve the problem by adding the following code to the Ext driver, but I feel like this cannot be the optimal place to put a fix.

Branched from a commit (d1bd24e3) to the "stable/12" branch of the GitHub repo three weeks ago.
In "sys/fs/ext2fs/ext2_vnops.c", beginning at line 2085 (within the "ext2_read" function).

	// before 
	if (((int)uio->uio_offset / 1024)%4) {
		int before_offset = ((int)uio->uio_offset / 4096) * 4096;
		do {
			lbn = lblkno(fs, before_offset);
			error = bread(vp, lbn, 1024, NOCRED, &bp);
			brelse(bp);
			bp = NULL;
			if (error) {
				break;
			}
			before_offset += 1024;
		} while(before_offset < ((int)uio->uio_offset/1024)*1024);
	}
	// after
	int last_byte = (int)uio->uio_offset + (int)uio->uio_resid - 1;
	if (last_byte > 0) {
		int last_page = last_byte / 4096;
		int last_block_in_page = (last_byte % 4096)/1024;
		for (int i = last_block_in_page + 1; i < 4; i++) {
			int after_offset = ((int)last_page * 4096 + i * 1024);
			lbn = lblkno(fs, after_offset);
			error = bread(vp, lbn, 1024, NOCRED, &bp);
			brelse(bp);
			bp = NULL;
			if (error) {
				break;
			}
		}
	}
	//

The purpose of this code is to force the ext driver to load in some surrounding blocks.

If my description is insufficient I should be able to provide a virtual machine image reproducing the issue.
Comment 1 Conrad Meyer freebsd_committer 2019-06-19 05:17:40 UTC
I would guess this is a bug in the ext2fs filesystem, although I'm not seeing the code you're referring to as "before" in either STABLE/12 or CURRENT.
Comment 2 Konstantin Belousov freebsd_committer 2019-06-22 17:17:35 UTC
Why do you use ext fs at all ?

Can you reproduce this with UFS ?  (I think not because minimal fragment size for UFS is 512 and block size is 4096).
Comment 3 Alan 2019-06-23 20:54:36 UTC
(In reply to Conrad Meyer from comment #1)

The entirety of the code I posted is new. 

Sorry, I was unclear. The "before" and "after" sections in my code snippet do not refer to "before my change" and "after my change". Rather, the before section is for loading in blocks before the read region and the after section is for loading in blocks afterwards. The purpose of these sections is to ensure that when loading a block that would be partway through a page, extra blocks are loaded into the cache before and after as necessary.

(In reply to Konstantin Belousov from comment #2)

I cannot reproduce the issue in UFS. I can't reproduce it in ext2 with 4KiB blocks either, so I think you're right about why.
Comment 4 Fedor Uporov freebsd_committer 2019-06-24 09:16:08 UTC
Hi, Alan.

I tried to reproduce the issue from your description, but did not managed with it. Seems like, Java produce more complex interaction with filesystem, than I expected.
Could you please share some script or C/C++-source which reproduces this issue, if you have.
If not, I should ask you to share a VM image.
Comment 5 sega01 2020-05-04 17:46:14 UTC
Is it possible that this is related to bug #246182?