I can reliably generate corruption on 15.0-CURRENT's NFS client using fsx, with copy_file_range enabled. I can't reproduce it on ZFS, UFS, tmpfs, or msdosfs. I can't reproduce it on Linux's NFS client either (though I'm not positive that it's using server-side-copying). Environment =========== NFS Server: FreeBSD 14.0-RELEASE-p3, amd64 NFS Clients: Both FreeBSD 14.0-RELEASE-p3 and 15.0-CURRENT @ 20-Dec. /etc/fstab is configured like this: 192.168.1.2:/home /usr/home nfs rw,nfsv4,minorversion=2 0 0 Steps To Reproduce ================== 1) Check out the fsx-rs git repository (I'm working on getting the latest version into ports) git clone git@github.com:asomers/fsx-rs.git 2) Install a Rust toolchain, if you don't have one already. pkg install rust 3) cd fsx-rs 4) cat > copy_file_range.toml <<HERE nomsyncafterwrite = true [weights] truncate = 1 fsync = 1 fdatasync = 1 punch_hole = 0 sendfile = 1 write = 10 read = 10 copy_file_range = 10 HERE 5) Run the tool. Assuming /home is NFS_mounted, run like this: cargo run --release -- -N 1024 -S 3381155135556591634 -f copy_file_range.toml -m 149264:149265 -v ~/tmp/foo.bin Expected Output =============== ... All operations completed A-OK! Actual Output ============= [INFO fsx] 3 mapwrite 0x3d369 .. 0x3ffff ( 0x2c97 bytes) [INFO fsx] 4 mapread 0x3b828 .. 0x3ffff ( 0x47d8 bytes) [INFO fsx] 5 write 0x18a59 .. 0x21671 ( 0x8c19 bytes) [INFO fsx] 6 write 0x3407e .. 0x3d057 ( 0x8fda bytes) [INFO fsx] 7 sendfile 0x141b5 .. 0x1da23 ( 0x986f bytes) [INFO fsx] 8 copy_file_range [0x326b6:0x357d2] => [0x1c39b:0x1f4b7] ( 0x311d bytes) [INFO fsx] 9 copy_file_range [0x36e5e:0x3feb4] => [ 0xa813:0x13869] ( 0x9057 bytes) [INFO fsx] 10 read 0x18e3a .. 0x20a9c ( 0x7c63 bytes) [WARN fsx] 11 write 0x20935 .. 0x2a296 ( 0x9962 bytes) [INFO fsx] 12 mapwrite 0x3eadf .. 0x3ffff ( 0x1521 bytes) [INFO fsx] 13 copy_file_range [0x3b9d9:0x3e7bf] => [0x31338:0x3411e] ( 0x2de7 bytes) [WARN fsx] 14 mapwrite 0x22763 .. 0x2b5af ( 0x8e4d bytes) [INFO fsx] 15 write 0x12126 .. 0x1666c ( 0x4547 bytes) [WARN fsx] 16 mapread 0x1bc78 .. 0x2ad3d ( 0xf0c6 bytes) [INFO fsx] 17 write 0x29386 .. 0x36d8b ( 0xda06 bytes) [INFO fsx] 18 write 0x3cff4 .. 0x3ffff ( 0x300c bytes) [INFO fsx] 19 copy_file_range [0x3c882:0x3dc94] => [ 0x6fbe: 0x83d0] ( 0x1413 bytes) [WARN fsx] 20 truncate 0x40000 => 0x14e87 [INFO fsx] 21 mapwrite 0x194a2 .. 0x19bbb ( 0x71a bytes) [INFO fsx] 22 read 0x10595 .. 0x19bbb ( 0x9627 bytes) [INFO fsx] 23 mapread 0x11ebf .. 0x14df9 ( 0x2f3b bytes) [INFO fsx] 24 mapread 0x1887 .. 0x11174 ( 0xf8ee bytes) [INFO fsx] 25 mapread 0x13b75 .. 0x19bbb ( 0x6047 bytes) [INFO fsx] 26 write 0x1867c .. 0x19f7a ( 0x18ff bytes) [INFO fsx] 27 mapread 0x119df .. 0x13ecb ( 0x24ed bytes) [INFO fsx] 28 mapwrite 0x19c2d .. 0x1ec85 ( 0x5059 bytes) [INFO fsx] 29 mapwrite 0x580f .. 0x14e19 ( 0xf60b bytes) [INFO fsx] 30 mapwrite 0x2e746 .. 0x3b542 ( 0xcdfd bytes) [INFO fsx] 31 copy_file_range [0x1ae6f:0x1e92f] => [0x173ae:0x1ae6e] ( 0x3ac1 bytes) [INFO fsx] 32 read 0x2e63a .. 0x3af8d ( 0xc954 bytes) [INFO fsx] 33 write 0x3313d .. 0x38155 ( 0x5019 bytes) [INFO fsx] 34 mapread 0x332a5 .. 0x35e49 ( 0x2ba5 bytes) [INFO fsx] 35 copy_file_range [0x17dc9:0x2206d] => [0x2e626:0x388ca] ( 0xa2a5 bytes) [INFO fsx] 36 fdatasync [INFO fsx] 37 read 0x2dad6 .. 0x3b3c8 ( 0xd8f3 bytes) [INFO fsx] 38 sendfile 0x27769 .. 0x346a8 ( 0xcf40 bytes) [INFO fsx] 39 mapwrite 0x38473 .. 0x3ffff ( 0x7b8d bytes) [INFO fsx] 40 read 0x14cb6 .. 0x157ce ( 0xb19 bytes) [INFO fsx] 41 sendfile 0x10a33 .. 0x1f57b ( 0xeb49 bytes) [INFO fsx] 42 fsync [INFO fsx] 43 copy_file_range [ 0xa490: 0xb45f] => [ 0xfb8: 0x1f87] ( 0xfd0 bytes) [INFO fsx] 44 mapread 0x31956 .. 0x3ffff ( 0xe6aa bytes) [INFO fsx] 45 copy_file_range [ 0xc1ad: 0xc474] => [ 0x7e98: 0x815f] ( 0x2c8 bytes) [INFO fsx] 46 mapread 0x8bf .. 0xc9a9 ( 0xc0eb bytes) [INFO fsx] 47 write 0x28d35 .. 0x2d27e ( 0x454a bytes) [INFO fsx] 48 mapwrite 0x3f5d8 .. 0x3ffff ( 0xa28 bytes) [INFO fsx] 49 fsync [INFO fsx] 50 copy_file_range [0x2b709:0x2e6dc] => [0x32312:0x352e5] ( 0x2fd4 bytes) [INFO fsx] 51 write 0x183d9 .. 0x1f309 ( 0x6f31 bytes) [INFO fsx] 52 copy_file_range [0x2c928:0x31221] => [0x31222:0x35b1b] ( 0x48fa bytes) [INFO fsx] 53 fsync [INFO fsx] 54 write 0x39603 .. 0x3ffff ( 0x69fd bytes) [INFO fsx] 55 write 0x30895 .. 0x39042 ( 0x87ae bytes) [INFO fsx] 56 write 0x3c384 .. 0x3ffff ( 0x3c7c bytes) [INFO fsx] 57 mapwrite 0x33ab6 .. 0x39fa0 ( 0x64eb bytes) [WARN fsx] 58 write 0x23f7e .. 0x2b871 ( 0x78f4 bytes) [INFO fsx] 59 mapwrite 0x1a29b .. 0x1c938 ( 0x269e bytes) [INFO fsx] 60 write 0x29da5 .. 0x2effc ( 0x5258 bytes) [INFO fsx] 61 read 0x29919 .. 0x3099d ( 0x7085 bytes) [WARN fsx] 62 mapread 0x1f81d .. 0x2d882 ( 0xe066 bytes) [INFO fsx] 63 write 0x32fd5 .. 0x3479f ( 0x17cb bytes) [INFO fsx] 64 copy_file_range [ 0xf5b5:0x13756] => [0x3be5e:0x3ffff] ( 0x41a2 bytes) [WARN fsx] 65 sendfile 0x1fd25 .. 0x2cffb ( 0xd2d7 bytes) [WARN fsx] 66 copy_file_range [0x30269:0x380a6] => [0x201d9:0x28016] ( 0x7e3e bytes) [INFO fsx] 67 read 0x3f602 .. 0x3ffff ( 0x9fe bytes) [WARN fsx] 68 read 0x23bd4 .. 0x25fc1 ( 0x23ee bytes) [ERROR fsx] miscompare: offset= 0x23bd4, size = 0x23ee [ERROR fsx] OFFSET GOOD BAD RANGE [ERROR fsx] 0x24710 0x39 0x37 0x18b2 [ERROR fsx] Step# (mod 256) for a misdirected write may be 55 [ERROR fsx] Using seed 3381155135556591634 [ERROR fsx] LOG DUMP [ERROR fsx] 1 SKIPPED (read) [ERROR fsx] 2 SKIPPED (mapread) [ERROR fsx] 3 MAPWRITE 0x3d369 => 0x40000 ( 0x2c97 bytes) HOLE [ERROR fsx] 4 MAPREAD 0x3b828 => 0x40000 ( 0x47d8 bytes) [ERROR fsx] 5 WRITE 0x18a59 => 0x21672 ( 0x8c19 bytes) [ERROR fsx] 6 WRITE 0x3407e => 0x3d058 ( 0x8fda bytes) [ERROR fsx] 7 SENDFILE 0x141b5 => 0x1da24 ( 0x986f bytes) [ERROR fsx] 8 COPY_FILE_RANGE [0x326b6,0x357d3] => [0x1c39b,0x1f4b8] ( 0x311d bytes) [ERROR fsx] 9 COPY_FILE_RANGE [0x36e5e,0x3feb5] => [ 0xa813,0x1386a] ( 0x9057 bytes) [ERROR fsx] 10 READ 0x18e3a => 0x20a9d ( 0x7c63 bytes) [ERROR fsx] 11 WRITE 0x20935 => 0x2a297 ( 0x9962 bytes) [ERROR fsx] 12 MAPWRITE 0x3eadf => 0x40000 ( 0x1521 bytes) [ERROR fsx] 13 COPY_FILE_RANGE [0x3b9d9,0x3e7c0] => [0x31338,0x3411f] ( 0x2de7 bytes) [ERROR fsx] 14 MAPWRITE 0x22763 => 0x2b5b0 ( 0x8e4d bytes) [ERROR fsx] 15 WRITE 0x12126 => 0x1666d ( 0x4547 bytes) [ERROR fsx] 16 MAPREAD 0x1bc78 => 0x2ad3e ( 0xf0c6 bytes) [ERROR fsx] 17 WRITE 0x29386 => 0x36d8c ( 0xda06 bytes) [ERROR fsx] 18 WRITE 0x3cff4 => 0x40000 ( 0x300c bytes) [ERROR fsx] 19 COPY_FILE_RANGE [0x3c882,0x3dc95] => [ 0x6fbe, 0x83d1] ( 0x1413 bytes) [ERROR fsx] 20 TRUNCATE DOWN from 0x40000 to 0x14e87 [ERROR fsx] 21 MAPWRITE 0x194a2 => 0x19bbc ( 0x71a bytes) HOLE [ERROR fsx] 22 READ 0x10595 => 0x19bbc ( 0x9627 bytes) [ERROR fsx] 23 MAPREAD 0x11ebf => 0x14dfa ( 0x2f3b bytes) [ERROR fsx] 24 MAPREAD 0x1887 => 0x11175 ( 0xf8ee bytes) [ERROR fsx] 25 MAPREAD 0x13b75 => 0x19bbc ( 0x6047 bytes) [ERROR fsx] 26 WRITE 0x1867c => 0x19f7b ( 0x18ff bytes) EXTEND [ERROR fsx] 27 MAPREAD 0x119df => 0x13ecc ( 0x24ed bytes) [ERROR fsx] 28 MAPWRITE 0x19c2d => 0x1ec86 ( 0x5059 bytes) EXTEND [ERROR fsx] 29 MAPWRITE 0x580f => 0x14e1a ( 0xf60b bytes) [ERROR fsx] 30 MAPWRITE 0x2e746 => 0x3b543 ( 0xcdfd bytes) HOLE [ERROR fsx] 31 COPY_FILE_RANGE [0x1ae6f,0x1e930] => [0x173ae,0x1ae6f] ( 0x3ac1 bytes) [ERROR fsx] 32 READ 0x2e63a => 0x3af8e ( 0xc954 bytes) [ERROR fsx] 33 WRITE 0x3313d => 0x38156 ( 0x5019 bytes) [ERROR fsx] 34 MAPREAD 0x332a5 => 0x35e4a ( 0x2ba5 bytes) [ERROR fsx] 35 COPY_FILE_RANGE [0x17dc9,0x2206e] => [0x2e626,0x388cb] ( 0xa2a5 bytes) [ERROR fsx] 36 FDATASYNC [ERROR fsx] 37 READ 0x2dad6 => 0x3b3c9 ( 0xd8f3 bytes) [ERROR fsx] 38 SENDFILE 0x27769 => 0x346a9 ( 0xcf40 bytes) [ERROR fsx] 39 MAPWRITE 0x38473 => 0x40000 ( 0x7b8d bytes) EXTEND [ERROR fsx] 40 READ 0x14cb6 => 0x157cf ( 0xb19 bytes) [ERROR fsx] 41 SENDFILE 0x10a33 => 0x1f57c ( 0xeb49 bytes) [ERROR fsx] 42 FSYNC [ERROR fsx] 43 COPY_FILE_RANGE [ 0xa490, 0xb460] => [ 0xfb8, 0x1f88] ( 0xfd0 bytes) [ERROR fsx] 44 MAPREAD 0x31956 => 0x40000 ( 0xe6aa bytes) [ERROR fsx] 45 COPY_FILE_RANGE [ 0xc1ad, 0xc475] => [ 0x7e98, 0x8160] ( 0x2c8 bytes) [ERROR fsx] 46 MAPREAD 0x8bf => 0xc9aa ( 0xc0eb bytes) [ERROR fsx] 47 WRITE 0x28d35 => 0x2d27f ( 0x454a bytes) [ERROR fsx] 48 MAPWRITE 0x3f5d8 => 0x40000 ( 0xa28 bytes) [ERROR fsx] 49 FSYNC [ERROR fsx] 50 COPY_FILE_RANGE [0x2b709,0x2e6dd] => [0x32312,0x352e6] ( 0x2fd4 bytes) [ERROR fsx] 51 WRITE 0x183d9 => 0x1f30a ( 0x6f31 bytes) [ERROR fsx] 52 COPY_FILE_RANGE [0x2c928,0x31222] => [0x31222,0x35b1c] ( 0x48fa bytes) [ERROR fsx] 53 FSYNC [ERROR fsx] 54 WRITE 0x39603 => 0x40000 ( 0x69fd bytes) [ERROR fsx] 55 WRITE 0x30895 => 0x39043 ( 0x87ae bytes) [ERROR fsx] 56 WRITE 0x3c384 => 0x40000 ( 0x3c7c bytes) [ERROR fsx] 57 MAPWRITE 0x33ab6 => 0x39fa1 ( 0x64eb bytes) [ERROR fsx] 58 WRITE 0x23f7e => 0x2b872 ( 0x78f4 bytes) [ERROR fsx] 59 MAPWRITE 0x1a29b => 0x1c939 ( 0x269e bytes) [ERROR fsx] 60 WRITE 0x29da5 => 0x2effd ( 0x5258 bytes) [ERROR fsx] 61 READ 0x29919 => 0x3099e ( 0x7085 bytes) [ERROR fsx] 62 MAPREAD 0x1f81d => 0x2d883 ( 0xe066 bytes) [ERROR fsx] 63 WRITE 0x32fd5 => 0x347a0 ( 0x17cb bytes) [ERROR fsx] 64 COPY_FILE_RANGE [ 0xf5b5,0x13757] => [0x3be5e,0x40000] ( 0x41a2 bytes) [ERROR fsx] 65 SENDFILE 0x1fd25 => 0x2cffc ( 0xd2d7 bytes) [ERROR fsx] 66 COPY_FILE_RANGE [0x30269,0x380a7] => [0x201d9,0x28017] ( 0x7e3e bytes) [ERROR fsx] 67 READ 0x3f602 => 0x40000 ( 0x9fe bytes) [ERROR fsx] 68 READ 0x23bd4 => 0x25fc2 ( 0x23ee bytes)
Could you please try something along these lines: diff --git a/sys/fs/nfsclient/nfs_clvnops.c b/sys/fs/nfsclient/nfs_clvnops.c index a690e988b4b3..f86ecd53ad4e 100644 --- a/sys/fs/nfsclient/nfs_clvnops.c +++ b/sys/fs/nfsclient/nfs_clvnops.c @@ -3956,7 +3956,7 @@ nfs_copy_file_range(struct vop_copy_file_range_args *ap) if (error == 0) error = ncl_flush(invp, MNT_WAIT, curthread, 1, 0); if (error == 0) - error = ncl_flush(outvp, MNT_WAIT, curthread, 1, 0); + error = ncl_vinvalbuf(outvp, V_SAVE, curthread, 0); /* Do the actual NFSv4.2 RPC. */ ret = ret2 = 0;
Do we also need to add something like... if ((obj = invp->v_object) != NULL && vm_object_mightbedirty(obj)) { VM_OBJECT_WLOCK(obj); vm_object_page_clean(obj, 0, 0, OBJPC_SYNC); VM_OBJECT_WUNLOCK(obj); } in front of the error = ncl_flush(invp, MNT_WAIT, curthread, 1, 0); If this is needed, the vnode locking code will need to be changed to LK_EXCLUSIVE lock the invp. Man, I hate mmap'd I/O.;-)
(In reply to Rick Macklem from comment #2) Yes, it seems so. OTOH, we can lock invp shared and switch to exclusive lock later if the page queue flush is indeed needed.
I'll try that patch shortly. In the meantime, I've discovered a much quicker way to reproduce the problem by increasing the operation sizes. Change the copy_file_range.toml file to contain this: flen = 524288 nomsyncafterwrite = true [opsize] max = 131072 min = 65536 [weights] truncate = 1 fsync = 1 fdatasync = 1 punch_hole = 0 sendfile = 1 write = 10 read = 10 copy_file_range = 10
Created attachment 247343 [details] Flush invp cache; invalidate outvp cache.
Yes. The patch looks good to me and, yes, since this would be a rare occurrence, starting with LK_SHARED on the invp seems like a good idea. It also makes sense to invalidate the outvp's buffers. The change in the NFSv4.2 Change attribute would have triggered that before a read from the now bogus buffers occurred, but why wait for that when you know the buffers are now invalid?
I think there's one little oopsie in the patch. I think it should be: invp_lock = LK_SHARED; relock: (ie. "relock:" after the assignment)
But are outvp buffers invalidated without the patch? If you have pages already mapped, they would continue to be used unless attributes change is acted upon.
Created attachment 247346 [details] Flush invp cache; invalidate outvp cache v2 (relock fixed)
Yes, I think you are correct. I was thinking of the case where there is a close/open done. Without a close/open, the bogus buffers could get used.
I'll need to take a look at some of the other NFSv4.2 operations, since they bypass the buffer cache (and any mmap'd I/O) to see if they need a similar treatment. I think Allocate/Deallocate will need this and Seek will need to do a vm_object_page_clean() as well as the ncl_flush(). I'll look at these once this patch is tested by Alan. Btw, the workaround for any of these problems is "minorversion=1" on the mount, for anyone who runs into these issues in unpatched systems.
kib's patch does not fix the bug for me.
Can you try a test run with "minorversion=1". This will check to see that it is related to Copy and not something else.
With minorversion=1, I cannot produce the same miscompare. However, I _can_ produce others, even when I disable copy_file_range in the config file. So there is probably more than one bug.
What about getting rid of the "nomsyncafterwrite = true", so that it does always do an msync()?
Using kib's patch and removing nomsyncafterwrite fixes the miscompare.
(In reply to Alan Somers from comment #16) Did you tried with disabled copy_file_range? Also, does 'msync after write' means msync after mapped write, or after write(2)? I suspect that it should be _before_ whatever write is.
(In reply to Konstantin Belousov from comment #17) I've tried it both ways. Whether or not copy_file_range is enabled, if nomsyncafterwrite is not set, then there is no crash. And nomsyncafterwrite refers to msync immediately after mapped write.
(In reply to Alan Somers from comment #18) > Whether or not copy_file_range is enabled, if nomsyncafterwrite is not set, then > there is no crash. But if nomsyncafterwrite is set, is there an inconsistency even without copy_file_range? For the second part, I mean that most likely this knobs works because msyncs are interlined with writes, and really msync affects the next write result, not the previous.
(In reply to Konstantin Belousov from comment #19) > But if nomsyncafterwrite is set, is there an inconsistency even without copy_file_range? Yes. > For the second part, I mean that most likely this knobs works because msyncs > are interlined with writes, and really msync affects the next write result, > not the previous.
Created attachment 247360 [details] Don't clear the dirty bits on pages being written through the buffer cache You could try this completely untested patch. The last time we played this game, the problem was that a call to vfs_busy_pages() cleared the dirty bits when it should not have done so. This is the only other place where the NFS client code does the same thing. Note that I think kib@'s copy_file_range patch is needed. I plan to do some testing of it, to see if I can break copy_file_range() without the patch and without using mmap'd I/O.
(In reply to Rick Macklem from comment #21) It would be very strange if this patch helped. More, I think it is somewhat dangerous because it can cause data corruption on itself. Issue is, not clearing pages dirty bits also does not remove write perms from the existing mappings, allowing writers to continue write to the file mappings. I think that this allows more mapped writes to be lost.
BTW, why do we need ncl_writebp() at all? It seems that it mostly repeats the regular bufwrite() with some omissions, and added reassignbuf() call on return path.
FWIW I can also reproduce the bug using exactly the same setup, config file, and seed on bfffs, a FUSE file system. I cannot reproduce it with fusefs-ext2, fusefs-sshfs, or fusefs-exfat, however. I will now work on adding a unit test to the fusefs test suite.
I did not suggest the second patch as a real fix. I just wanted to see what (if any) effect it has on Alan's tests. (In the past, I've found that almost random changes can help identify a problem, like my suggestion to get rid of "nomsyncafterwrite".) If the patch happens to help with Alan's test case, then we need to figure out what needs to be done to fix it. W.r.t. ncl_writebp(), I'll take a look. This is all very old code cribbed from the old NFS client and not written by me.
I just tested rmacklem's latest patch along with kib's. It does not fix the bug.
So we have an nfscl bug reproducible without copy_file_range, right? Could you try to determine a minimal set of syscalls types needed to reproduce it? E.g. is it just read/write/mapped write or something more is required?
Could you please try with my latest patch from D43250?
With kib's latest patch, it's more difficult to produce a miscompare. But still possible. For example, using the toml file from Comment 4 with Seed 9675272854109691020 will crash in 95 steps. Or, without copy_file_range at all, it's possible to crash in 126 steps using seed 8637402481540764494 and this toml file: flen = 524288 nomsyncafterwrite = true [opsize] max = 131072 min = 65536 [weights] truncate = 1 fsync = 1 fdatasync = 1 punch_hole = 0 write = 10 read = 10
Alan, do you know if the test does any other syscalls like rename or link even though they are not listed under [weights]? I am wondering because both nfs_rename() and nfs_link() do VOP_FSYNC(), but they do not call vm_object_page_clean(). Also, could you try a test run with truncate = 0? (The nfs_setattr() code does seem that it should vnode_pager_setsize(), but maybe more needs to be done.) I notice ncl_vinvalbuf() only gets called when there are dirty buffers.
(In reply to Rick Macklem from comment #30) It doesn't do rename or link. It does do some ftruncate calls that aren't shown, but I don't think there's anything else. And I'm out of time to help with testing until at least the 3rd, and probably the 6th.
Righto. To be honest, this may not be fixable, given the way NFS does I/O below the buffer cache (no vnode lock, etc). The fix may be to document the need to use msync(2) when mmap'd I/O is mixed with syscall I/O for NFS mounts. --> An update to the msync man page maybe?
is this related to bug 270810 which reported data corruption using mmap()+msync() on NFS? (i have encountered issues with this in the past, but i never got around to making a reproducible test case to file a PR.)
Ok, here is my understanding of what currently can happen. Hopefully Kostik will correct me if I have this wrong. #1 - File is open(2)'d. #2 - A byte range (lets say the 1st 100Mbytes) is mmap(2)'d into the address space #3 - Some addresses within this address space are modified by the process, dirtying the corresponding pages. #4 - File is read(2) sequentially. Now, when #4 happens, there will be read-aheads done by the nfsiod threads. These simply do Read RPCs against the NFS server to read the byte ranges of the file into the buffer cache blocks. They are done asynchronously and without any vnode lock. --> At this time, I do not see anything that stops these read-aheads from filling the buffer cache blocks/pages from the NFS server's now stale data. Now, I thought adding a msync(2) with MS_SYNC between #3 and #4 would be sufficient to cause the pages dirtied by #3 to be written to the NFS server (via VOP_PUTPAGES(), which is ncl_putpages()). I believe that an fsync(2) between #3 and #4 will also write the dirtied pages to the NFS server. Without either a msync(2) or fsync(2) between #3 and #4, what could be done to make this work? - Don't do read-ahead. This would be a major performance hit and is imho a non-starter. - Don't do read-ahead when a file is mmap(2)'d. This sounds better, since it will be a rare case that a file will be both mmap(2)'d and read via read(2) syscalls. --> To do this, the NFS client needs to know if the file has been mmap(2)'d. A flag could be set on the vnode when the file is mmap(2)'d and that flag can be checked by the NFS client. --> The problem is when can the flag be cleared? My recollection from a previous round of discussing this is...not until all the process(es) that mmap(2)'d the file exit. (I cannot recall if the vnode's v_usecount going to 0 is sufficient.) - Having some way that the nfsiod threads can check to see if there are dirty pages related to the buffer cache block and write those back to the NFS server before doing the read. (Recall that the buffer cache block will be quite a few pages, typically 128K to 1Mbyte in size.) --> This could be done by having the nfsiod thread LK_EXCLUSIVE lock the vnode, but that would be a major performance hit, as well. That's as far as I've gotten in previous discussions about this. Note that this PR started with a specific problem related to copy_file_range(2) and that has been fixed (or kib@'s patch will fix it when committed). The more general case as above, well??
I added Geoffrey as a cc, since he reported PR#270810. I think these two PRs are converging on the same problem. (He can remove himself from the cc list if he chooses to.) I do disagree that this affects many people, since I have only seen 1 or 2 PRs on this, but it does affect some people.
fwiw, my own issue with NFS+mmap() was using net-p2p/rtorrent, which is a fairly widely used application and which (famously) uses mmap() for nearly all filesystem operations. when running it on NFS, i found it frequently reported data corruption issues with the files it wrote to (i.e., it would write to files using mmap(), but on reading those files later, the data that was written was not as expected). that said, i am the only person (other than Geoffrey) who i've seen report issues related to this. i couldn't say whether this is because the number of people who use FreeBSD NFS and mmap() is relatively small, or because the problem was caused by something else. i only mention this at all to add some context to the discussion :-)
Going back to my example case (which sounds like what net-p2p/rtorrent might be doing), it is possible for the NFS client VOP_READ() to do a vm_object_page_clean() which should ensure that mmap(2)'d writing (as in modifying bytes within mmapped address space) is written to the server. This would not handle the case where mmap(2)'d writing is intermixed with read(2)s. Does that sound worth doing? And, is it likely that vm_object_mightbedirty() will return true when this is not needed? (If this is the case, a flag on the vnode that indicates it has been mmap(2)'d would be helpful.)
(In reply to Rick Macklem from comment #34) This sounds as an interesting theory, but please note that read-ahead initiators in nfs_clbio.c checks that the B_CACHE buffer flag is not set. This should prevent a situation where we have constructed buffer with valid (might be dirty) pages but not valid content recorded at buf cache layer. But lets recheck the theory anyway, the patch below should prevent RA when there are writeable mappings: commit 2234d9d4f7595a78bf10c08b1e6b12d2115799cd Author: Konstantin Belousov <kib@FreeBSD.org> Date: Tue Jan 2 00:22:44 2024 +0200 nfsclient: do not do (unlocked) read-ahead by nfsiod if there are writeable mappings diff --git a/sys/fs/nfsclient/nfs_clbio.c b/sys/fs/nfsclient/nfs_clbio.c index e6486af55daf..1f92fe0a4cf3 100644 --- a/sys/fs/nfsclient/nfs_clbio.c +++ b/sys/fs/nfsclient/nfs_clbio.c @@ -481,9 +481,13 @@ ncl_bioread(struct vnode *vp, struct uio *uio, int ioflag, struct ucred *cred) on = uio->uio_offset - (lbn * biosize); /* - * Start the read ahead(s), as required. + * Start the read ahead(s), as required. Do not do + * read-ahead if there are writeable mappings, since + * unlocked read by nfsiod could obliterate changes + * done by userspace. */ - if (nmp->nm_readahead > 0) { + if (nmp->nm_readahead > 0 && + vp->v_object->un_pager.vnp.writemappings == 0) { for (nra = 0; nra < nmp->nm_readahead && nra < seqcount && (off_t)(lbn + 1 + nra) * biosize < nsize; nra++) { rabn = lbn + 1 + nra; @@ -671,6 +675,7 @@ ncl_bioread(struct vnode *vp, struct uio *uio, int ioflag, struct ucred *cred) */ NFSLOCKNODE(np); if (nmp->nm_readahead > 0 && + vp->v_object->un_pager.vnp.writemappings == 0 && (bp->b_flags & B_INVAL) == 0 && (np->n_direofoffset == 0 || (lbn + 1) * NFS_DIRBLKSIZ < np->n_direofoffset) &&
(In reply to Rick Macklem from comment #37) I though about doing vm_object_page_clean() before starting RA, but the problem is that current lock is shared, while we need exclusively locked vnode for write. Lets try my patch first. I added Peter Holm in a hope that he has some spare circles to reproduce the issue and to check my patchset.
Sounds good. I was about to try a patch which did the lock upgrade and vm_object_page_clean() when vm_object_mightbedirty() returns true, but I'll leave it until yours gets tested.
(In reply to Konstantin Belousov from comment #38) I now believe that if this patch could work, we need to check both for writeable mappings (writeable > 0) or for previous existence of mappings that dirtied the vnode (vm_object_might_be_dirty()). I updated the patch on my WIP branch for testing.
If I am reading the code correctly (I may not be), it seems that vfs_buf_test_cache() will clear B_CACHE if any of the pages allocbuf() has added to a new buffer are not valid. So, if a buffer cache block is for the first 1Mbyte of the file, but the mmap(2)'d process has only touched one page within that first 1Mbyte, does B_CACHE get cleared? (And then the read-ahead happens, overwriting the dirty page.)
Comment on attachment 247360 [details] Don't clear the dirty bits on pages being written through the buffer cache This patch is crap. Please ignore it.
(In reply to Rick Macklem from comment #42) If the buffer size is 1M (shrug), then to map any page from this buffer, kernel would validate the whole buffer. At least this is the default setup where vfs.nfs.use_buf_pager is true. In other words, it should be not possible to have a buffer consisting of a single dirty page and rest of invalid pages.
Yep. I just saw that (fills the whole buffer when a page is touched) using a trivial test program. This trivial program did not expose any problem.
If this is related to the bug I reported PR#270810 , The problem isnt the page state, but the fact that when the process writing exits, the pages are immediately discarded before they are sent to the NFS server. As long as the process that has dirtied the pages is kept alive, there is no issue. This affects bsd nfs clients, linux nfs clients are unaffected.
(In reply to Konstantin Belousov from comment #39) Sure, I'll take a look.
(In reply to Konstantin Belousov from comment #27) As I have said before reading is irrelevant. It occurs when writing to a congested server, and not all writes have finished when the writing process exits. It also does not occur using nfsv2 - the process stays in state nfsio unntil writes are finished, but nfsv2 is dog slow and has other problems.
Created attachment 247424 [details] Make Setattr invalidate for setting size if mmap'd writing might have occurred This is another patch that Alan can hopefully test, since his test does include truncation. The patch adds a check for vm_obect_mightbedirty() as well as NMODIFIED (which is set for writes into the buffer cache, but not writes to mmap'd pages) to decide that flushing/invalidation should occur when a Setattr of size is done.
A commit in branch main references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=47ec00d9d6071bbb0ee5ed0bdca3c4a468334d9d commit 47ec00d9d6071bbb0ee5ed0bdca3c4a468334d9d Author: Konstantin Belousov <kib@FreeBSD.org> AuthorDate: 2023-12-30 00:15:50 +0000 Commit: Konstantin Belousov <kib@FreeBSD.org> CommitDate: 2024-01-05 04:56:30 +0000 nfsclient: flush dirty pages of the vnode before ncl_flush() when done to ensure that the server sees our cached data, because it potentially changes the server response. This is relevant for copy_file_range(), seek(), and allocate(). Convert LK_SHARED invp lock into LK_EXCLUSIVE if needed to properly call vm_object_page_clean(). Reported by: asomers PR: 276002 Noted and reviewed by: rmacklem Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D43250 sys/fs/nfsclient/nfs_clvnops.c | 45 ++++++++++++++++++++++++++++++++++++------ 1 file changed, 39 insertions(+), 6 deletions(-)
A commit in branch main references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=7dae1467d72ae1f5c8f7be0f7444da23a457d98b commit 7dae1467d72ae1f5c8f7be0f7444da23a457d98b Author: Konstantin Belousov <kib@FreeBSD.org> AuthorDate: 2023-12-29 23:22:40 +0000 Commit: Konstantin Belousov <kib@FreeBSD.org> CommitDate: 2024-01-05 04:56:17 +0000 nfsclient copy_file_range(): flush dst vnode data Otherwise server-side copy makes the client cache inconsistent with the server data. Reported by: asomers PR: 276002 Reviewed by: rmacklem Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D43250 sys/fs/nfsclient/nfs_clvnops.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
With the latest code in main, including kib's two commits, I can no longer reproduce the bug using the second copy_file_range.toml file I posted. But I can still reproduce it using the original file. In fact, I can reproduce it without using copy_file_range or truncate at all, using this config file. This may be a different bug: nomsyncafterwrite = true [weights] truncate = 0 fsync = 1 fdatasync = 1 punch_hole = 0 sendfile = 0 write = 10 read = 10 copy_file_range = 0
A commit in branch stable/14 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=384c72acb50d3fd5a3bd07f41d3755867865c0b6 commit 384c72acb50d3fd5a3bd07f41d3755867865c0b6 Author: Konstantin Belousov <kib@FreeBSD.org> AuthorDate: 2023-12-29 23:22:40 +0000 Commit: Konstantin Belousov <kib@FreeBSD.org> CommitDate: 2024-01-11 16:46:52 +0000 nfsclient copy_file_range(): flush dst vnode data PR: 276002 (cherry picked from commit 7dae1467d72ae1f5c8f7be0f7444da23a457d98b) sys/fs/nfsclient/nfs_clvnops.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
A commit in branch stable/14 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=09c69decd208d3ccd4dd54ea67c14b48c012c58c commit 09c69decd208d3ccd4dd54ea67c14b48c012c58c Author: Konstantin Belousov <kib@FreeBSD.org> AuthorDate: 2023-12-30 00:15:50 +0000 Commit: Konstantin Belousov <kib@FreeBSD.org> CommitDate: 2024-01-11 16:46:52 +0000 nfsclient: flush dirty pages of the vnode PR: 276002 (cherry picked from commit 47ec00d9d6071bbb0ee5ed0bdca3c4a468334d9d) sys/fs/nfsclient/nfs_clvnops.c | 45 ++++++++++++++++++++++++++++++++++++------ 1 file changed, 39 insertions(+), 6 deletions(-)
A commit in branch stable/13 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=d139cab8d13b83b90872f6ca5dc53890c85826a9 commit d139cab8d13b83b90872f6ca5dc53890c85826a9 Author: Konstantin Belousov <kib@FreeBSD.org> AuthorDate: 2023-12-29 23:22:40 +0000 Commit: Konstantin Belousov <kib@FreeBSD.org> CommitDate: 2024-01-11 16:47:51 +0000 nfsclient copy_file_range(): flush dst vnode data PR: 276002 (cherry picked from commit 7dae1467d72ae1f5c8f7be0f7444da23a457d98b) sys/fs/nfsclient/nfs_clvnops.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
A commit in branch stable/13 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=cee4d7c22585a736a5d231365ceb62e96d80488a commit cee4d7c22585a736a5d231365ceb62e96d80488a Author: Konstantin Belousov <kib@FreeBSD.org> AuthorDate: 2023-12-30 00:15:50 +0000 Commit: Konstantin Belousov <kib@FreeBSD.org> CommitDate: 2024-01-11 16:47:52 +0000 nfsclient: flush dirty pages of the vnode PR: 276002 (cherry picked from commit 47ec00d9d6071bbb0ee5ed0bdca3c4a468334d9d) sys/fs/nfsclient/nfs_clvnops.c | 45 ++++++++++++++++++++++++++++++++++++------ 1 file changed, 39 insertions(+), 6 deletions(-)
Does this affect NFSv3 mounts on releng/13.2, too?
(In reply to Robert Schulze from comment #57) The NFS Copy operation is only NFSv4.2. However, my recollection is that asomers@ could still generate data corruption when copy was not done, unless msync was done. If this recollection is correct, the problem is still (and may be for a long time) an issue and I do not think asomers@ tested NFSv3?
(In reply to Rick Macklem from comment #58) No. I tested NFS 4.2 and 4.1, but not NFS 3.
(In reply to Rick Macklem from comment #58) I am not sure that I parsed the question right, but: I am aware that there is still a corruption, and AFAIR without even copy_file_range(). It is enough to get some mix of regular writes and mmaped accesses (writes) to get it.
(In reply to Konstantin Belousov from comment #60) Yes, that is my understanding as well, with the additional fact that msync(2) calls fix the problem. My hunch is that this bug (not the copy_file_range(2) issue) has existed *forever*. I have not been able to come up with a plausible explanation, but... - It might be that there is some situation where B_CACHE does not get set when it should be set, causing ncl_write() (the VOP_WRITE() for NFS) to read stale data from the server, overwriting the mmap'd change(s). - There might be some race between VOP_PUTPAGES()/VPO_GETPAGES() (which must hold the vnode locked) and the NFS I/O RPCs being done on the buffer cache blocks when no vnode lock is held for them. I have not been able to think of a case where this would cause the corruption, but I'll admit I haven't thought about it lately. I have thought of doing the equivalent of msync() in ncl_write(), at least to see if that avoids the corruption. Maybe I'll come up with a test patch for Alan to try.
(In reply to Rick Macklem from comment #61) I remember it was Peter Holm who was able to reproduce the last issue. I already committed the disable for read-ahead if there are pages on the object queue. The reverse side, wait for all async rpcs to finish before starting first VOP_GETPAGES(), can be done, although it is more complicated.
(In reply to Konstantin Belousov from comment #62) I'll admit that I do not think readahead is the problem now. First off, both VOP_GETPAGES() and VOP_READ() are called with the vnode LK_SHARED, so they are not serialized by the vnode lock. When I look at the code, both ncl_getpages() (VOP_GETPAGES for NFS) and ncl_bioread() (VOP_READ for NFS) end up calling getblkx(), which should serialize access to the buffer and the code does appear to safely handle B_CACHE. There is one thing that does look weird to me, although I'll admit I do not understand the VM side of things. - When VOP_GETPAGES() calls bread_gb() (actually breadn_flags() with various arguments) the pages are sbusied. Then nfs_strategy() calls ncl_doio(), which does the RPCs and then calls bufdone(). bufdone() calls vfs_vmio_iodone(), which sunbusies the pages. - After the bread_gb() call in vfs_bio_getpages() it does a bunch of stuff that might not be correct for an unbusied page? I won't claim to understand the code, but it "upgrades" the pages to xbusied (assuming it is sbusied, I think?) and also does some stuff I do not understand related to checking the pages valid, which the comment claims is racy. Anyhow Kostik, hopefully you can understand the code and determine if the code in vfs_bio_getpages() is safe even if the pages have been sunbusied by the bread_gb() call?
Created attachment 248653 [details] Problem reproducer I'm not sure if this is helpful, but this small test program reproduces the issue on NFS only: 09:37 /usr/src/tools/test/stress2/misc $ ./mmap44.sh /dev/md10 on /mnt (ufs, NFS exported, local, soft-updates) 127.0.0.1:/mnt on /mnt2 (nfs) 9781c9781 < 0461500 40 41 42 43 44 45 46 47 48 49 4a 4b 4c 4d 4e 4f --- > 0461500 40 41 42 43 44 ee 46 47 48 49 4a 4b 4c 4d 4e 4f 256 -rw------- 1 root wheel 262144 Feb 21 09:38 file 256 -rw------- 1 root wheel 262144 Feb 21 09:37 file.orig 09:38 /usr/src/tools/test/stress2/misc $ grep -B2 -A2 0xee mmap44.sh pthread_mutex_lock(&write_mutex); c = cp[i]; cp[i] = 0xee; /* This value seems to linger with NFS */ cp[i] = c; pthread_mutex_unlock(&write_mutex); 09:39 /usr/src/tools/test/stress2/misc $
(In reply to Rick Macklem from comment #63) vfs_bio_getpages() is called with the pages exclusively busy. As you noted, bread_gb() wants to additionally shared-busy the same pages, which is impossible. So vfs_bio_getpages() changes its busy mode from exclusive to shared, same as normal lock mode can be downgraded. When bread_gb() acquires and releases the shared busy, it does +1 and -1 to the shared busy counter, so the page stays shared busy after the bread_gb() finishes. Then vfs_bio_getpages() does the upgrade from shared to exclusive, and since the shared busy state could be dropped, we recheck the validity of the pages. If some page was invalidated, we need to re-validate the buffer. And this is indeed might be not correct for NFS, now I think. Other thread could do a short read making only part of the page valid. Then the bogus pages replacement does not help to preserve the valid content. Peter, could you, please, retest nfs client with the sysctl vfs.nfs.use_buf_pager set to 0?
Both test programs still fail. Here's the fsx output: 09:34 /home/pho/fsx-rs $ sysctl vfs.nfs.use_buf_pager vfs.nfs.use_buf_pager: 0 Running `target/release/fsx -N 1024 -S 3381155135556591634 -f copy_file_range.toml -m '149264:149265' -v -v -v /mnt11/foo.bin` [ERROR fsx] miscompare: offset= 0x1eedd, size = 0x19f6 [ERROR fsx] OFFSET GOOD BAD RANGE [ERROR fsx] 0x1eedd 0x38 0x35 0x6c6 [ERROR fsx] Step# (mod 256) for a misdirected write may be 145 [ERROR fsx] Using seed 3381155135556591634 [ERROR fsx] LOG DUMP [ERROR fsx] 1 SKIPPED (read) [ERROR fsx] 2 SKIPPED (mapread) [ERROR fsx] 3 MAPWRITE 0x3d369 => 0x40000 ( 0x2c97 bytes) HOLE [ERROR fsx] 4 WRITE 0x3b828 => 0x40000 ( 0x47d8 bytes)
(In reply to Peter Holm from comment #66) There is something about holes in the file with corruption. Could you please try to run tests over fully populated file, using only writes and mapped writes, no fallocate/hole making.
(In reply to Konstantin Belousov from comment #67) I commented out the "hole" part in mmap44.sh and I still get the corruption. #if 0 // Test without holes if (lseek(fd, arc4random() % siz, SEEK_END) == -1) err(1, "lseek() END"); s = sizeof(buf); for (i = 0; i < 50; i++) { if (write(fd, buf, s) != s) warn("write()"); } #endif I added: punch_hole = 0 posix_fallocate = 0 to the fsx config file, but still see references to HOLE in the output? fsx still return lots of ERRORs.
Removing the writing past EOF (to create holes) and the following ftruncate() makes the corruption go away (at least in a short 40 minutest test).
(In reply to Peter Holm from comment #69) If you do not already have it, maybe you could try adding the patch in attachment #247424 [details] (the second one with title "Make Setattr..."). It would be related to ftruncate(), since that would do a Setattr of size. Also, does your test fail when there is no write beyond EOF/truncation, but vfs.nfs.use_buf_pager=1. Thanks for your help with this, rick
(In reply to Rick Macklem from comment #70) Setting vfs.nfs.use_buf_pager to either "1" or "0" does not make any difference to the mmap44 test. It fails with the "write beyond EOF / truncate()" enabled. With the nfs_setattr() patch added, both (fsx and mmap44) test scenarios still fail. Here's the result from mmap44 with truncate() enabled: 05:24 /usr/src/tools/test/stress2/misc $ uname -a FreeBSD mercat1.netperf.freebsd.org 15.0-CURRENT FreeBSD 15.0-CURRENT #0 main-n268383-a044cf60bd37-dirty: Fri Feb 23 05:06:46 CET 2024 pho@mercat1.netperf.freebsd.org:/usr/src/sys/amd64/compile/PHO amd64 05:24 /usr/src/tools/test/stress2/misc $ sysctl vfs.nfs.use_buf_pager vfs.nfs.use_buf_pager: 1 05:26 /usr/src/tools/test/stress2/misc $ ./mmap44.sh /dev/md10 on /mnt (ufs, NFS exported, local, soft-updates) 127.0.0.1:/mnt on /mnt2 (nfs) 3109c3109 < 0141100 40 41 42 43 44 45 46 47 48 49 4a 4b 4c 4d 4e 4f --- > 0141100 40 41 42 43 44 45 46 47 48 49 4a 4b 4c 4d 4e ee 7640c7640 < 0356560 70 71 72 73 74 75 76 77 78 79 7a 7b 7c 7d 7e 7f --- > 0356560 70 71 72 73 74 75 76 77 78 79 7a ee 7c 7d 7e 7f 256 -rw------- 1 root wheel 262144 Feb 23 05:27 file 256 -rw------- 1 root wheel 262144 Feb 23 05:26 file.orig 05:27 /usr/src/tools/test/stress2/misc $
(In reply to Peter Holm from comment #64) Hi Peter, I've tried this reproducer and have not reproduced a problem. Could you post mmap44.sh or explain how you run it?
(In reply to Rick Macklem from comment #72) Right, the reproducer needs a bit of VM pressure to show the issue. I have committed mmap44.sh; Here's how to run it: # cd /usr/src/tools/test/stress2 # make ===> lib (all) ===> testcases (all) ===> testcases/badcode (all) : # cd misc # ./all.sh -o mmap44.sh 20240224 06:54:54 all: mmap44.sh /dev/md10 on /mnt (ufs, NFS exported, local, soft-updates) 127.0.0.1:/mnt on /mnt2 (nfs) 3900c3900 < 0171660 b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 ba bb bc bd be bf --- > 0171660 b0 b1 ee b3 b4 b5 b6 b7 b8 b9 ba bb bc bd be bf 7139c7139 < 0337040 20 21 22 23 24 25 26 27 28 29 2a 2b 2c 2d 2e 2f --- > 0337040 20 21 22 23 24 25 26 27 28 29 2a 2b ee 2d 2e 2f 14249c14249 < 0675200 80 81 82 83 84 85 86 87 88 89 8a 8b 8c 8d 8e 8f --- > 0675200 80 81 82 83 84 85 86 ee 88 89 8a 8b 8c 8d 8e 8f 256 -rw------- 1 root wheel 262144 Feb 24 06:55 file 256 -rw------- 1 root wheel 262144 Feb 24 06:54 file.orig FAIL mmap44.sh exit code 2 20240224 06:56:01 all.sh done, elapsed 0 day(s), 00:01.07 # You man need to run it a few times to get the error.
Created attachment 248770 [details] Force VOP_WRITE() in VOP_PUTPAGES to be done synchronously Thanks to Peter, I have been able to reproduce the problem demonstrated by his mmap44.sh test fairly easily (within a few test cycles). With this patch, I have not been able to reproduce the problem in about 25 test cycles. I am hoping that Peter and Alan can test this patch? I am not sure if it is a correct (or the best fix), but my hunch is that the dirty bit for the page is sometimes getting cleared at the wrong time, resulting in the page modification being lost. Without this patch VOP_WRITE() { ncl_write() } can do page write synchronously, asynchronously or delayed, which means the dirty bit gets cleaned at different times.
(In reply to Rick Macklem from comment #74) It seems as thou your patch makes it a bit harder to trigger the issue with mmap44.sh The fsx test still fails as well. < 0620000 00 01 02 03 04 05 06 07 08 09 0a 0b 0c 0d 0e 0f --- > 0620000 00 01 02 03 04 05 06 07 08 09 0a 0b 0c 0d ee 0f 256 -rw------- 1 root wheel 262144 Feb 27 07:32 file 256 -rw------- 1 root wheel 262144 Feb 27 07:31 file.orig 07:32:18, elapsed 0 day(s), 00:14.07
(In reply to Peter Holm from comment #75) Could you please try to add the following to the last Rick' patch? commit 0f36903e7d1f49632515481ea4b99859d23bd21a Author: Konstantin Belousov <kib@FreeBSD.org> Date: Tue Feb 27 02:09:09 2024 +0200 ncl_write(): honor ioflags by converting them into buffer flags using vfs_bio_set_flags(). diff --git a/sys/fs/nfsclient/nfs_clbio.c b/sys/fs/nfsclient/nfs_clbio.c index 1cf45bb0c924..d8058e9a125a 100644 --- a/sys/fs/nfsclient/nfs_clbio.c +++ b/sys/fs/nfsclient/nfs_clbio.c @@ -1310,6 +1310,8 @@ ncl_write(struct vop_write_args *ap) vfs_bio_set_valid(bp, on, n); } + vfs_bio_set_flags(bp, ioflag); + /* * If IO_SYNC do bwrite(). *
Created attachment 248807 [details] Force VOP_WRITE() to be synchronous and do not clear dirty bits afterwards Here's another one that hopefully Peter can test. It doesn't fail mmap44.sh for me for 128 cycles, but the previous one didn't as well, so this doesn't mean much. It is similar to the last one, except it does not clear the dirty bits after VOP_WRITE() when VOP_WRITE() returns no error. I'll admit I do not understand why the code clears the dirty bits after VOP_WRITE() when it does not return an error, since VOP_WRITE() will clear the dirty bits. (Kostik's addition should probably be applied to this one as well.) Thanks Peter for doing this testing.
(In reply to Rick Macklem from comment #77) As I understand, the clearing of the dirty bits occur in vfs_busy_pages(). But it only happens if the write indeed occurs though the buffer cache, which is not true for appends. So in case of append writes, nfs_putpages() would not laundry the pages? So might be disable this direct write optimization? It now looks very suspicious.
Note that pages cannot become dirty until VOP_PUTPAGES() finishes. It is because all writeable mappings are invalidated and pages are share-busy. VM does not allow to create a writeable mapping of the busy page.
(In reply to Konstantin Belousov from comment #76) I ran tests with these combined changes: --- a/sys/fs/nfsclient/nfs_clbio.c +++ b/sys/fs/nfsclient/nfs_clbio.c @@ -330,8 +330,9 @@ ncl_putpages(struct vop_putpages_args *ap) uio.uio_rw = UIO_WRITE; uio.uio_td = td; - error = VOP_WRITE(vp, &uio, vnode_pager_putpages_ioflags(ap->a_sync), + error = VOP_WRITE(vp, &uio, vnode_pager_putpages_ioflags(ap->a_sync | VM_PAGER_PUT_SYNC), cred); +if (error != 0) printf("aft putpages=%d\n", error); crfree(cred); if (error == 0 || !nfs_keep_dirty_on_error) { @@ -1310,6 +1311,8 @@ ncl_write(struct vop_write_args *ap) vfs_bio_set_valid(bp, on, n); } + vfs_bio_set_flags(bp, ioflag); + /* * If IO_SYNC do bwrite(). * but the outcome was the same: Instant fsx failure and mmap44.sh failed after 1h 40m
(In reply to Rick Macklem from comment #77) Happy to help. I'll run a test with "attachment 248807 [details]" in a few hours from now.
(In reply to Peter Holm from comment #81) I see no different outcome with "attachment 248807 [details]" added.
I tried to shrink the test scenario even further. It now only uses: Mapped read/write, read(2)/write(2) and ftruncate(2). https://people.freebsd.org/~pho/mmap47.sh
(In reply to Peter Holm from comment #83) Could you try adding a msync(2) call here and see if that stops the failure? (It did so for my testing of the previous version, but that doesn't indicate much as I've seen for other changes.) while (go == 1) { i = arc4random() % siz; pthread_mutex_lock(&write_mutex); c = cp[i]; cp[i] = 0xee; /* This value seems to linger with NFS */ cp[i] = c; --> msync(cp, siz, MS_SYNC); pthread_mutex_unlock(&write_mutex); usleep(arc4random() % 200); } I might not have the call quite right, since I no longer have the patched code for this lying about. Thanks, rick
(In reply to Konstantin Belousov from comment #78) > As I understand, the clearing of the dirty bits occur in vfs_busy_pages(). > But it only happens if the write indeed occurs though the buffer cache, > which is not true for appends. So in case of append writes, nfs_putpages() > would not laundry the pages? Since O_APPEND writing is done for things like log files, I'll admit I cannot imagine why such writing would be mixed with mmap()'d writing? At the least, this does not appear to be a part of this test. > So might be disable this direct write optimization? It now looks very > suspicious. I will take a look at how O_APPEND writing is currently being done and see if it looks like it needs to be changed. If you are referring to the stuff that can be enabled for direct writing via a sysctl that is not set by default, then I agree it is sketchy. I'll atek a look.
(In reply to Rick Macklem from comment #84) Peter, Forget about trying the "add the msync()" case. I just redid the test and it failed. The fact it fails might be useful information?
while (go == 1) { i = arc4random() % siz; pthread_mutex_lock(&write_mutex); c = cp[i]; cp[i] = 0xee; /* This value seems to linger with NFS */ --> msync(cp, 0, MS_SYNC); cp[i] = c; pthread_mutex_unlock(&write_mutex); usleep(arc4random() % 200); } I tried the above variant and it not only failed, but it failed much more quickly for my test setup. Somehow, the write(2) and ftruncate(2) that grow and shrink the file back down, trashes the cp[i]= c; mmap()d write. I'm still looking...
(In reply to Rick Macklem from comment #87) I do not know if you noticed this, but the truncate(2) call in the mmap47.sh test scenario doesn't change the size of the file. But this call seems essential to triggering the corruption.
(In reply to Peter Holm from comment #88) Please try the following commit fc160bc8d3296b16e1882eed7355309a8eefcdcb Author: Konstantin Belousov <kib@FreeBSD.org> Date: Mon Mar 4 09:26:55 2024 +0200 allocbuf()/vfs_vmio_extend(): assume zero-sized buffers are B_CACHE vfs_vmio_extend() tests each added page for valid, and then extends the validity of B_CACHE flag if all added pages are valid. But, if we have a blank buffer without any content yet, the B_CACHE flag is not set yet, and validity test is not executed due to this. This makes the buffer incoherent with the page content, loosing data. diff --git a/sys/kern/vfs_bio.c b/sys/kern/vfs_bio.c index b5466fb2cd53..3ec886a32ded 100644 --- a/sys/kern/vfs_bio.c +++ b/sys/kern/vfs_bio.c @@ -3175,6 +3175,8 @@ vfs_vmio_extend(struct buf *bp, int desiredpages, int size) * B_CACHE may remain set! XXX */ toff = bp->b_bcount; + if (toff == 0) + bp->b_flags |= B_CACHE; tinc = PAGE_SIZE - ((bp->b_offset + toff) & PAGE_MASK); while ((bp->b_flags & B_CACHE) && toff < size) { vm_pindex_t pi; (both my patches are available on the ufs1 branch of deviant3)
(In reply to Konstantin Belousov from comment #89) Unfortunately, this did not seem to have any effect on the file corruption. 09:38:03 Start test of mmap47.sh 09:39:13, elapsed 0 days, 00:01.10 09:40:19, elapsed 0 days, 00:02.16 09:41:23, elapsed 0 days, 00:03.20 09:42:28, elapsed 0 days, 00:04.25 09:43:31, elapsed 0 days, 00:05.28 1050c1050 < 0040620 90 91 92 93 94 95 96 97 98 99 9a 9b 9c 9d 9e 9f --- > 0040620 90 91 92 93 94 95 96 97 ee 99 9a 9b 9c 9d 9e 9f 256 -rw------- 1 root wheel 262144 Mar 4 09:44 file 256 -rw------- 1 root wheel 262144 Mar 4 09:43 file.orig 09:44:35, elapsed 0 days, 00:06.32 09:44 /usr/src/tools/test/stress2/misc $ uname -a FreeBSD mercat1.netperf.freebsd.org 15.0-CURRENT FreeBSD 15.0-CURRENT #0 ufs1-n268657-fc160bc8d32: Mon Mar 4 09:21:15 CET 2024 pho@mercat1.netperf.freebsd.org:/var/tmp/deviant3/sys/amd64/compile/PHO amd64 09:44 /usr/src/tools/test/stress2/misc $
(In reply to Peter Holm from comment #88) I tried commenting out the ncl_vinvalbuf() call in nfs_setattr() when size is changed. The failure still occurred. To be honest, I can't remember why/if this call is needed? It seems that ncl_meta_setsize()->vtruncbuf() would do all that is necessary. Anyhow, with ncl_vinvalbuf() commented out, the vtruncbuf() called from ncl_meta_setsize() is the only thing that is done for ftruncate() that seems it might be related to this and commenting it out breaks a bunch of stuff. I am not suggesting that vtruncbuf() is broken, but it seems that it helps cause the failure.
This is just an observation. By replacing the ftruncate() call in mmap47.sh with a fdatasync() I also see the data corruption: 20:53:22 Start test of mmap47a.sh 20:54:27, elapsed 0 days, 00:01.05 20:55:31, elapsed 0 days, 00:02.09 20:56:35, elapsed 0 days, 00:03.13 20:57:41, elapsed 0 days, 00:04.19 20:58:45, elapsed 0 days, 00:05.23 20:59:49, elapsed 0 days, 00:06.27 12661c12661 < 0613500 40 41 42 43 44 45 46 47 48 49 4a 4b 4c 4d 4e 4f --- > 0613500 40 41 42 43 44 45 46 ee 48 49 4a 4b 4c 4d 4e 4f 256 -rw------- 1 root wheel 262144 Mar 5 21:00 file 256 -rw------- 1 root wheel 262144 Mar 5 20:59 file.orig 21:01:08, elapsed 0 days, 00:07.46 Failed with exit code 2 after 7 loops of mmap47a.sh. 21:01 /usr/src/tools/test/stress2/misc $ diff -u mmap47.sh mmap47a.sh --- mmap47.sh 2024-03-01 20:00:28.468087000 +0100 +++ mmap47a.sh 2024-03-05 20:53:07.446204000 +0100 @@ -126,8 +126,13 @@ while (go == -1) usleep(50); while (go == 1) { +#if 0 if (ftruncate(fd, siz) == -1) /* No size change */ err(1, "truncate)"); +#else + if (fdatasync(fd) == -1) + err(1, "fdatasync()"); +#endif usleep(arc4random() % 1000); } return (0); 21:20 /usr/src/tools/test/stress2/misc $
Is there a way to disable nfsiods at all? Might be, setting vfs.nfs.iodmin = vfs.nfs.iodmax = 0 would do it. If not, I will try to provide a patch to disable them. It is interesting if the corruption is still reproducible with nfsiods disabled.
(In reply to Konstantin Belousov from comment #93) Yes, if you set both vfs.nfs.iodmin and vfs.nfs.iodmax to 0 before doing the first NFS mount after booting, there should be no nfsiod threads. (If you set them 0 after a mount, I'm not sure how long it takes for them to go away?) You can check via: # ps axH | fgrep iod I ran 64 cycles of mmap44.sh without a failure, but that doesn't tell us much. Hopefully Peter can try this, rick
I ran the test scenario variations I have for 3 hours without seeing any problems. This with: 16:49 /usr/src/tools/test/stress2/misc $ sysctl vfs.nfs.iodmin vfs.nfs.iodmax vfs.nfs.iodmin: 0 vfs.nfs.iodmax: 0 16:50 /usr/src/tools/test/stress2/misc $ Unfortunately, the original test scenario still fails: mount -t nfs -o tcp -o retrycnt=3 -o rw \ 127.0.0.1:$mp1 $mp2; s=$? ( cargo run --release -- -N 1024 -S 3381155135556591634 -f copy_file_range.toml -m 149264:149265 -v -v -v $mp2/foo.bin # FAIL s=$? echo "Exit status is $s" ) 2>&1 | grep -Ev 'DEBUG|INFO|WARN' Finished release [optimized] target(s) in 2.97s Running `target/release/fsx -N 1024 -S 3381155135556591634 -f copy_file_range.toml -m '149264:149265' -v -v -v /mnt11/foo.bin` [ERROR fsx] miscompare: offset= 0x1eedd, size = 0x19f6 [ERROR fsx] OFFSET GOOD BAD RANGE [ERROR fsx] 0x1eedd 0x38 0x35 0x6c6 [ERROR fsx] Step# (mod 256) for a misdirected write may be 145 [ERROR fsx] Using seed 3381155135556591634 [ERROR fsx] LOG DUMP [ERROR fsx] 1 SKIPPED (read) : [ERROR fsx] 699 MAPWRITE 0x563e => 0x1445e ( 0xee20 bytes) [ERROR fsx] 700 READ 0x1eedd => 0x208d3 ( 0x19f6 bytes) Exit status is 1 umount $mp2 umount $mp1 mdconfig -d -u $mdstart exit 0 16:52 /home/pho/fsx-rs $
(In reply to Peter Holm from comment #95) This .toml config enables the file_copy_range() usage, am I right? Could you try please, in addition to disabling nfsiods on the client, disable copy_range() on the server, by setting vfs.nfsd.maxcopyrange to zero?
This is the config file I used: cat > copy_file_range.toml <<HERE nomsyncafterwrite = true [weights] truncate=1 fdatasync=1 fsync=1 write = 10 read = 10 punch_hole = 0 posix_fallocate = 0 sendfile = 0 HERE I added a "copy_file_range=0", but fsx still fails. I ran this right after a boot. sysctl vfs.nfs.iodmin=0 vfs.nfs.iodmin: 0 -> 0 sysctl vfs.nfs.iodmax=0 vfs.nfs.iodmax: 20 -> 0 sysctl vfs.nfsd.maxcopyrange=0 vfs.nfsd.maxcopyrange: 9223372036854775807 -> 0 root@mercat1:~ # ps auxwwl | grep -v grep | grep iod root@mercat1:~ #
Please try this instead of disabling copy_range on server: diff --git a/sys/fs/nfsclient/nfs_clvnops.c b/sys/fs/nfsclient/nfs_clvnops.c index 0b8c587a542c..8a98b047c0d5 100644 --- a/sys/fs/nfsclient/nfs_clvnops.c +++ b/sys/fs/nfsclient/nfs_clvnops.c @@ -3904,7 +3904,7 @@ nfs_copy_file_range(struct vop_copy_file_range_args *ap) error = ncl_flush(invp, MNT_WAIT, curthread, 1, 0); } if (error == 0) - error = ncl_vinvalbuf(outvp, V_SAVE, curthread, 0); + error = ncl_flush(outvp, MNT_WAIT, curthread, 1, 0); /* Do the actual NFSv4.2 RPC. */ ret = ret2 = 0; (nfsiods must be disabled)
Sure. Here's another observation: A ktrace(1) of the syscalls involved in a failing fsx run: 09:19 /home/pho/fsx-rs $ kdump | grep -E "fsx.*CALL" | sed 's/ *4850 //;s/fsx *//;s/(.*)/()/' | sort | uniq -c | sort -rn 1408 CALL write() 700 CALL compat11.fstat() 699 CALL lseek() 373 CALL mmap() 182 CALL munmap() 161 CALL pread() 158 CALL pwrite() 52 CALL ftruncate() 23 CALL open() 14 CALL fdatasync() 13 CALL sigprocmask() 12 CALL fsync() 11 CALL mprotect() 9 CALL close() 7 CALL fstat() 6 CALL sigaction() 5 CALL fstatfs() 4 CALL read() 3 CALL sigfastblock() 3 CALL sigaltstack() 3 CALL openat() 2 CALL issetugid 2 CALL getpid 1 CALL thr_wake() 1 CALL thr_self() 1 CALL sysarch() 1 CALL rtprio_thread() 1 CALL readlink() 1 CALL poll() 1 CALL ioctl() 1 CALL getrandom() 1 CALL getcontext() 1 CALL exit() 1 CALL cpuset_getaffinity() 1 CALL _umtx_op() 1 CALL __sysctlbyname() 09:19 /home/pho/fsx-rs $
fsx still fails with this change to nfs_clvnops.c 10:21 /home/pho/fsx-rs $ (cd /usr/src/sys/fs/nfsclient; git diff nfs_clvnops.c) diff --git a/sys/fs/nfsclient/nfs_clvnops.c b/sys/fs/nfsclient/nfs_clvnops.c index 0b8c587a542c..976bedee41c4 100644 --- a/sys/fs/nfsclient/nfs_clvnops.c +++ b/sys/fs/nfsclient/nfs_clvnops.c @@ -3904,7 +3904,8 @@ nfs_copy_file_range(struct vop_copy_file_range_args *ap) error = ncl_flush(invp, MNT_WAIT, curthread, 1, 0); } if (error == 0) - error = ncl_vinvalbuf(outvp, V_SAVE, curthread, 0); +// error = ncl_vinvalbuf(outvp, V_SAVE, curthread, 0); + error = ncl_flush(outvp, MNT_WAIT, curthread, 1, 0); /* Do the actual NFSv4.2 RPC. */ ret = ret2 = 0; 10:21 /home/pho/fsx-rs $ uname -a FreeBSD mercat1.netperf.freebsd.org 15.0-CURRENT FreeBSD 15.0-CURRENT #1 main-n268827-75464941dc17-dirty: Tue Mar 19 09:58:11 CET 2024 pho@mercat1.netperf.freebsd.org:/usr/src/sys/amd64/compile/PHO amd64 10:21 /home/pho/fsx-rs $ ps auxwwl | grep -v grep | grep iod 10:21 /home/pho/fsx-rs $
(In reply to Peter Holm from comment #99) Are you saying that the fsx run with enabled copy_file_range does not use the syscall but fails? And what would happen if the copy_file_range is disabled in the .toml?
Created attachment 249298 [details] ktrace run of fsx
What I see is that a fsx run with or without "copy_file_range = 0" added to the config file, fails and that I see no reference to a copy_file_range() call in a ktrace. I have uploaded the full kdump output to https://people.freebsd.org/~pho/ktrace.log.xz
Created attachment 250535 [details] Call vnode_pager_clean_sync() before ncl_meta_setsize() in nfs_setattr Hi Peter, Maybe you can test this patch. I am thinking that, maybe, ncl_meta_setsize() is marking B_RELBUF and doing brelse(), throwing away the dirty page. Thanks, rick
(In reply to Rick Macklem from comment #104) Hi Rick, I see the same corruption with attachment_250535.diff added. Both the original fsx program and mmap44.sh fails.
(In reply to Peter Holm from comment #105) Oh well. Thanks for testing it, Peter.