Created attachment 239910 [details] Test program that triggers the error on the NFS client Similar to 269261, it is possible to trigger data corruption on an NFS client (In my setup, NFS 4.2 served by FreeBSD 13.1 backed by ZFS) by using fspacectl on an mmap()ed file. Steps to reproduce: 0) Install a Rust toolchain 1) checkout https://github.com/asomers/fsx-rs.git at rev c3e726d 2) cd fsx-rs 3) cargo build 4) Somehow mount an nfs file system 5) cat <<HERE > fsx.toml flen = 1048576 nomsyncafterwrite = true [opsize] max = 393216 [weights] read = 0 write = 0 mapread = 10 mapwrite = 10 truncate = 0 punch_hole = 10 HERE 6) cargo run -- -f fsx.toml -N 1000 -P /tmp -S 9361357600271583733 ~/tmp/fsx.bin Alternate steps to reproduce: 1) Compile the attached program 2) ./fspace-and-mmap /path/to/nfs_filesystem/x.bin"
I wonder if 3b6056204dd8 that was just committed by kib@ might fix this? For a 13.1 server (which does not support the Deallocate operation), the client uses vop_stddeallocate() and it uses vn_bmap_seekhole_locked(). If it doesn't fix the problem, then I'd guess nfs_deallocate() needs a fix for the generic case where it calls vop_stddeallocate(). nfs_deallocate() does call vnode_pager_purge_range() before doing the RPC, but that won't happen against a 13.1 server. It's not clear to me how vnode_pager_purge_range() compares to doing a vm_object_page_clean()? I'll try to test against a 14 server that supports Deallocate. If you can test with a client with 3b6056204dd8 and let us know if it helps, that would be nice.
No, 3b6056204dd8 doesn't fix it. I've already tested both with and without that change, but always with a 13.1 server.
It would be somewhat interesting to see where the corruption occurs, on client or on server. My intuition is that the problem happens on client, esp. because server does not have fspacectl() at all. From the code reading, nfs client does enough page flushing both in nfs_deallocate() and in ncl_vinvalbuf(). At least, it is not obvious what is missed and why msync(2) before deallocation helps. The sequence seems to be, from nfs_deallocate() for NFSv4.2: ncl_vinvalbuf(V_SAVE) vm_object_page_clean(OBJPC_SYNC) <- sync pages synchronously vinvalbuf(V_SAVE) vm_object_page_remove(OBJPR_CLEANONLY) <- only remove clean pages vnode_pager_purge_range() vm_object_page_remove(0) nfsrpc_deallocate() It might be that a clean page somehow avoided being removed from the queue, but as I said, I do not see how this could occur. Or it might be the handling of the partially deallocated pages is somewhat wrong.
BTW, is the statement that msync(2)ing before deallocating avoids corruption, true for NFS?
(In reply to Konstantin Belousov from comment #4) Yes, enabling msync after every mmap write resolves the problem.
(In reply to Alan Somers from comment #5) Is there a difference between msyncing after each write vs. msyncing before each deallocate?
(In reply to Konstantin Belousov from comment #6) It seems that doing msync just before fspacectl doesn't fix the problem at all. Possibly that's because the test program immediately unmaps the file after the mmap write, and then later has to mmap it again just for the msync?
Created attachment 239937 [details] do a VOP_FSYNC() after vop_stddeallocate() to push writes of 0s This simple patch fixes the case that the test program finds. I am not 100% sure why it does fix this, but the code should always have done this since having the changes committed to the server before VOP_DEALLOCATE() returns is required. Btw, for a newer NFSv4.2 server that supports a Deallocate operation, your test program works. (I hacked my server so that Deallocate was disabled and was able to reproduce the failure with your test program.) Maybe asomers@ can test the patch and see if it fixes all the NFS cases he observes?
(In reply to Rick Macklem from comment #8) I wonder if it is possible to check the corruption of the file on server, instead of the client. Also, with regard to msync-ing the file before deallocation, you could have a permanent mapping for the file up to its max size, and msync(2) the mapping.
Yes, the file ends up corrupted on the server unless the client does VOP_FSYNC() after the vop_stddeallocet() call. I replaced "random()" with 0xcc in the test program, so that the data would be consistent between runs. For the VOP_FSYNC() case, the file is consistently correct on the server (I looked at it with "od" and compared that with what it should have been, given the do_mapwrite() calls) and it was correct. Without VOP_FSYNC(), the file was not the same on the server. One of the non-zero ranges was all 0'd out. I think that one of the writes of zeros was delayed until after the non-zero write for the second range. (I will try and wade through the packet traces, but since the VOP_WRITE()s done by vop_stddeallocate() were not aligned, there are a lot of reads for the blocks as well as writes.) Since the NFS server could crash just after fspacectl() returns success, I think the VOP_FSYNC() should be done. However, "man fspacectl" does not clarify if the hole should exist if the machine crashes immediately after fspacectl() returns success for a local fs. Since there is no standard, this should probably be decided by the "freebsd collective". If the answer is "yes, the hole should be there after the machine reboots" then VOP_FSYNC() will need to be added to vop_stddeallocate(), I think? Hopefully asomers@ can test with the VOP_FSYNC() patch, to see if all his test cases work for NFS?
With rmacklem's fsync patch, I can no longer produce any failures.
(In reply to Rick Macklem from comment #10) But where does the delayed write came from? From what I see in the code (see the comment #3), all caches. i.e. both page cache and buffers, are reclaimed. Even clean pages/buffers are reclaimed, synchronously. The presence of the delayed write means that either some of them are still there. Or might be there is an in-flight RPC still not finished? But again, reclamation is synchronous.
For this case, the code you commented on is not being executed (at leaast not after the first failed attempt). The 13.1 server just replies NFSERR_NOTSUPP to the Deallocate RPC. The code that is executed is the call to vop_stddeallocate() further down in nfs_dealloacate(), just like NFSv3/4.0/4.1.
W.r.t. "where do the delayed writes come from". That's simply nfs_deallocate()->vop_stddeallocate()->vp_zerofill()->VOP_WRITE(). Remember that, for asomer's case (a 13.1 server that does not have support for Deallocate on NFSv4.2), the code that is executed is not the stuff you were commenting on before. nfs_deallocate() is just calling vop_stddeallocate(). I looked at the packet trace for the failure case (no VOP_FSYNC()) and here's what I see (leaving the reads and other stuff out): - write at 745472 of data (for do_mapwrite( ,0xb65af)) - 4 writes of 0s for fspacectl() - write at 262144 of data (starts with 0s) (for do_mapwrite( ,0x40d57)) - 2 writes starting at 864256 of data (for do_mapwrite( ,0d3b4b)) The 2 writes for do_mapwrite( ,0x775cb) are missing. This is the one that overlaps with fspacectl(). So, it seems that the write of 0s done through the buffer cache does something like cleans the page(s) of the write when it completes and this undirty's the page(s) written by the do_mapwrite() { or something like that? The buffer cache vs paging is something I do not understand }. Adding the VOP_FSYNC() ensures that the writes of 0s done by vop_stddeallocate() have completed before fspacectl() returns, so the do_mapwrite() doesn't get stomped on. Btw, I replaced fspacectl() with do_write() and it breaks the same way as fspacectl() without the VOP_FSYNC(). - If this case needs to work without msync(), all I can think of is doing VOP_WRITE()s synchronously whenever the file is mmap'd. (I think setting IO_SYNC is sufficient for this, but I'd need to look at the code and try it.)
Doing VOP_WRITEs synchronously whenever the file is mmap()ed makes sense to me. mmap sucks, and I'd rather not optimize for it. Is there an easy way to tell if a file is mmap()ed?
well, I noticed that your test program mmap()s and munmap()s for every do_mapwrite(), so doing VOP_WRITE()s synchronously when mmap'd wouldn't help here. Maybe Kostik understands how bdone() ends up clearing the pages so that the write for a do_mapwrite() never happens? I am comfortable with adding VOP_FSYNC() to nfs_deallocate() for the vop_stddeallocate() case, since it is done synchronously when the 4.2 server supports Deallocate. The man page for msync(2) is pretty vague w.r.t. when it is needed. It simply states "usually not needed".
What about implementing a NFS VOP_MMAPPED() and have it do the equivalent of VOP_FSYNC()? (I think I'll code this and try it.) For your test program, that would result in a VOP_FSYNC() every time mmap(2) is done. Al, have you tried a similar test program where the mmap(2) is done when the file is open()'d instead of for every do_mapwrite()?
(In reply to Rick Macklem from comment #17) Does VOP_MMAPPED() get called as soon as the file is mapped? I think that's too soon. At that point, the problematic writes haven't happened yet. And no, I haven't tried persistently mapping the file.
This is all quite strange. Our buffer cache should be coherent with the page cache. The fact that simple write(2) can become inconsistent with the mmaped write means we have the big issue for NFS. Buffer cache consists of just buffer headers that collect a run of pages that represent the same frames as the buffer, in the file range. Dirty pages are cleaned by the vfs_busy_pages(bp, TRUE) call when bwrite() is executing. Since then, the dirtyness is carried by the buffer and not by pages. This is why we have to flush both pages and buffers when we know that the server supports deallocate. In NFS case, since it has its own bufwrite() implementation, vfs_busy_pages(TRUE) is called by NFS code in ncl_writebp() (and in ncl_flush()). From your explanation, it is still not clear to me how did we lost a write. It must be some case of doing direct write instead going through the buffer cache, I guess? Was the file opened with O_DIRECT?
Created attachment 239981 [details] change ncl_flush() so that it does not clear the page dirty bit for commit I think this large complex (1 line) patch fixes the problem. It works for the variants of the test program I have. (The original plus one that does do_write() instead of fspacectl() and one that does the mmap() in main instead of per-do_mapread() and do_mapwrite().) Thanks go to Kostik's explanation which allowed me find it. Basically, when ncl_flush() was called it would do vfs_busy_pages(vp, 1) before doing a commit RPC. However, the commit RPC does not write data to the server, it tells the server to commit data it already has to stable storage. --> As such, it should not clear the dirty bits on the pages. I will look through the commit logs to see how long this has been broken. I suspect a long time.
(In reply to Rick Macklem from comment #20) Now I do not understand why vfs_busy_pages() is done in that part of the ncl_flush() loop at all. There is bwrite() call later, which should handle undirty. But why do you need busy pages while doing commit RPCs? Is the intent to avoid userspace writes through mmap on the committed range?
Well, after doing the vfs_busy_pages() and the commit RPC, there is a call to bufdone() at around line#3100. bufdone calls vfs_vmio_iodone etc, so I think the pages need to be busied for that. There may be some other reasons that I can't spot at a quick glance.
(In reply to Rick Macklem from comment #22) Looks fine then, please commit your patch. If something is left, we can continue the investigation (for nfs).
Alan, can you please test the new patch to see if it fixes your tests for NFS? Thanks, rick
The latest patch works for me.
A commit in branch main references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=847967bc4e198a258b030a5864e64e029e7452e5 commit 847967bc4e198a258b030a5864e64e029e7452e5 Author: Rick Macklem <rmacklem@FreeBSD.org> AuthorDate: 2023-02-08 22:25:01 +0000 Commit: Rick Macklem <rmacklem@FreeBSD.org> CommitDate: 2023-02-08 22:25:01 +0000 nfscl: Fix interaction between mmap'd and VOP_WRITE file updates asomers@ found a problem with the NFS client, where a write to an NFS mounted file done via mmap(2) was lost when fspacectl(2) was done before it. This turned out to be caused by clearing the dirty bit on pages when the client was doing commit RPCs, due to the second argument to vfs_busy_pages() being set to 1. Commit RPCs tell the server to commit previously written data to stable storage. However, Commit RPCs do not write data from the client to the server. As such, if the dirty bit on the page has been set by a mmap'd write to an address in the page, it should not be cleared. Clearing it causes the mmap'd write to by lost. This patch fixes the problem by changing the 2nd argument to vfs_busy_pages() to 0 for this case. I doubt this bug has affected many, since it was inherited from the old NFS client and was in 4.3 FreeBSD twenty years ago. Although fspacectl(2) is FreeBSD 14 specific, a write(2) would cause the same failure. Reviewed by: kib Tested by: asomers PR: 269328 MFC after: 1 week sys/fs/nfsclient/nfs_clvnops.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
Patch has been committed to main and will be MFC'd.
A commit in branch stable/13 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=335517b9b02bc37f4d532998dc8e3ce5757cdef6 commit 335517b9b02bc37f4d532998dc8e3ce5757cdef6 Author: Rick Macklem <rmacklem@FreeBSD.org> AuthorDate: 2023-02-08 22:25:01 +0000 Commit: Rick Macklem <rmacklem@FreeBSD.org> CommitDate: 2023-02-21 23:32:40 +0000 nfscl: Fix interaction between mmap'd and VOP_WRITE file updates asomers@ found a problem with the NFS client, where a write to an NFS mounted file done via mmap(2) was lost when fspacectl(2) was done before it. This turned out to be caused by clearing the dirty bit on pages when the client was doing commit RPCs, due to the second argument to vfs_busy_pages() being set to 1. Commit RPCs tell the server to commit previously written data to stable storage. However, Commit RPCs do not write data from the client to the server. As such, if the dirty bit on the page has been set by a mmap'd write to an address in the page, it should not be cleared. Clearing it causes the mmap'd write to by lost. This patch fixes the problem by changing the 2nd argument to vfs_busy_pages() to 0 for this case. I doubt this bug has affected many, since it was inherited from the old NFS client and was in 4.3 FreeBSD twenty years ago. Although fspacectl(2) is FreeBSD 14 specific, a write(2) would cause the same failure. PR: 269328 (cherry picked from commit 847967bc4e198a258b030a5864e64e029e7452e5) sys/fs/nfsclient/nfs_clvnops.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
Can we close this bug now?