Bug 269328 - nfs client: data corruption using fspacectl and mmap
Summary: nfs client: data corruption using fspacectl and mmap
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: CURRENT
Hardware: Any Any
: --- Affects Many People
Assignee: Rick Macklem
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2023-02-04 22:02 UTC by Alan Somers
Modified: 2023-09-06 19:45 UTC (History)
4 users (show)

See Also:
rmacklem: mfc-stable13?
rmacklem: mfc-stable12?


Attachments
Test program that triggers the error on the NFS client (3.72 KB, text/plain)
2023-02-04 22:02 UTC, Alan Somers
no flags Details
do a VOP_FSYNC() after vop_stddeallocate() to push writes of 0s (472 bytes, patch)
2023-02-06 01:43 UTC, Rick Macklem
no flags Details | Diff
change ncl_flush() so that it does not clear the page dirty bit for commit (341 bytes, patch)
2023-02-08 04:45 UTC, Rick Macklem
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Alan Somers freebsd_committer freebsd_triage 2023-02-04 22:02:18 UTC
Created attachment 239910 [details]
Test program that triggers the error on the NFS client

Similar to 269261, it is possible to trigger data corruption on an NFS client (In my setup, NFS 4.2 served by FreeBSD 13.1 backed by ZFS) by using fspacectl on an mmap()ed file.

Steps to reproduce:
0) Install a Rust toolchain
1) checkout https://github.com/asomers/fsx-rs.git at rev c3e726d
2) cd fsx-rs
3) cargo build
4) Somehow mount an nfs file system
5) cat <<HERE > fsx.toml
  flen = 1048576
  nomsyncafterwrite = true
  [opsize]
  max = 393216
  [weights]
  read = 0
  write = 0
  mapread = 10
  mapwrite = 10
  truncate = 0
  punch_hole = 10
HERE
6) cargo run -- -f fsx.toml -N 1000 -P /tmp -S 9361357600271583733 ~/tmp/fsx.bin

Alternate steps to reproduce:
1) Compile the attached program
2) ./fspace-and-mmap /path/to/nfs_filesystem/x.bin"
Comment 1 Rick Macklem freebsd_committer freebsd_triage 2023-02-05 01:07:28 UTC
I wonder if 3b6056204dd8 that was just committed
by kib@ might fix this?

For a 13.1 server (which does not support the
Deallocate operation), the client uses
vop_stddeallocate() and it uses
vn_bmap_seekhole_locked().

If it doesn't fix the problem, then I'd guess
nfs_deallocate() needs a fix for the generic
case where it calls vop_stddeallocate().
nfs_deallocate() does call
vnode_pager_purge_range() before doing the RPC,
but that won't happen against a 13.1 server.
It's not clear to me how vnode_pager_purge_range()
compares to doing a vm_object_page_clean()?

I'll try to test against a 14 server that supports
Deallocate.

If you can test with a client with 3b6056204dd8
and let us know if it helps, that would be nice.
Comment 2 Alan Somers freebsd_committer freebsd_triage 2023-02-05 01:39:13 UTC
No, 3b6056204dd8 doesn't fix it.  I've already tested both with and without that change, but always with a 13.1 server.
Comment 3 Konstantin Belousov freebsd_committer freebsd_triage 2023-02-05 22:47:06 UTC
It would be somewhat interesting to see where the corruption occurs, on client
or on server.  My intuition is that the problem happens on client, esp. because
server does not have fspacectl() at all.

From the code reading, nfs client does enough page flushing both in
nfs_deallocate() and in ncl_vinvalbuf().  At least, it is not obvious what
is missed and why msync(2) before deallocation helps. The sequence seems to
be, from nfs_deallocate() for NFSv4.2:
  ncl_vinvalbuf(V_SAVE)
    vm_object_page_clean(OBJPC_SYNC) <- sync pages synchronously
    vinvalbuf(V_SAVE)
      vm_object_page_remove(OBJPR_CLEANONLY) <- only remove clean pages
    vnode_pager_purge_range()
      vm_object_page_remove(0)
    nfsrpc_deallocate()

It might be that a clean page somehow avoided being removed from the queue,
but as I said, I do not see how this could occur.  Or it might be the handling
of the partially deallocated pages is somewhat wrong.
Comment 4 Konstantin Belousov freebsd_committer freebsd_triage 2023-02-05 22:51:17 UTC
BTW, is the statement that msync(2)ing before deallocating avoids corruption,
true for NFS?
Comment 5 Alan Somers freebsd_committer freebsd_triage 2023-02-05 23:29:49 UTC
(In reply to Konstantin Belousov from comment #4)
Yes, enabling msync after every mmap write resolves the problem.
Comment 6 Konstantin Belousov freebsd_committer freebsd_triage 2023-02-05 23:33:20 UTC
(In reply to Alan Somers from comment #5)
Is there a difference between msyncing after each write vs. msyncing before
each deallocate?
Comment 7 Alan Somers freebsd_committer freebsd_triage 2023-02-05 23:43:19 UTC
(In reply to Konstantin Belousov from comment #6)
It seems that doing msync just before fspacectl doesn't fix the problem at all.  Possibly that's because the test program immediately unmaps the file after the mmap write, and then later has to mmap it again just for the msync?
Comment 8 Rick Macklem freebsd_committer freebsd_triage 2023-02-06 01:43:26 UTC
Created attachment 239937 [details]
do a VOP_FSYNC() after vop_stddeallocate() to push writes of 0s

This simple patch fixes the case that the
test program finds.  I am not 100% sure why
it does fix this, but the code should always
have done this since having the changes
committed to the server before VOP_DEALLOCATE()
returns is required.

Btw, for a newer NFSv4.2 server that supports a
Deallocate operation, your test program works.
(I hacked my server so that Deallocate was
disabled and was able to reproduce the failure
with your test program.)

Maybe asomers@ can test the patch and see if
it fixes all the NFS cases he observes?
Comment 9 Konstantin Belousov freebsd_committer freebsd_triage 2023-02-06 21:40:36 UTC
(In reply to Rick Macklem from comment #8)
I wonder if it is possible to check the corruption of the file on server,
instead of the client.

Also, with regard to msync-ing the file before deallocation, you could
have a permanent mapping for the file up to its max size, and msync(2) the
mapping.
Comment 10 Rick Macklem freebsd_committer freebsd_triage 2023-02-07 01:08:42 UTC
Yes, the file ends up corrupted on the server
unless the client does VOP_FSYNC() after
the vop_stddeallocet() call.

I replaced "random()" with 0xcc in the test
program, so that the data would be consistent
between runs.
For the VOP_FSYNC() case, the file is consistently
correct on the server (I looked at it with "od" and
compared that with what it should have been, given
the do_mapwrite() calls) and it was correct.

Without VOP_FSYNC(), the file was not the same on the
server. One of the non-zero ranges was all 0'd out.
I think that one of the writes of zeros was delayed
until after the non-zero write for the second range.
(I will try and wade through the packet traces, but
since the VOP_WRITE()s done by vop_stddeallocate()
were not aligned, there are a lot of reads for the
blocks as well as writes.)

Since the NFS server could crash just after fspacectl()
returns success, I think the VOP_FSYNC() should be
done.
However, "man fspacectl" does not clarify if the hole
should exist if the machine crashes immediately after
fspacectl() returns success for a local fs.
Since there is no standard, this should probably be
decided by the "freebsd collective". If the answer is
"yes, the hole should be there after the machine reboots"
then VOP_FSYNC() will need to be added to vop_stddeallocate(),
I think?

Hopefully asomers@ can test with the VOP_FSYNC() patch, to
see if all his test cases work for NFS?
Comment 11 Alan Somers freebsd_committer freebsd_triage 2023-02-07 02:35:19 UTC
With rmacklem's fsync patch, I can no longer produce any failures.
Comment 12 Konstantin Belousov freebsd_committer freebsd_triage 2023-02-07 03:44:51 UTC
(In reply to Rick Macklem from comment #10)
But where does the delayed write came from?

From what I see in the code (see the comment #3), all caches. i.e. both
page cache and buffers, are reclaimed.  Even clean pages/buffers are
reclaimed, synchronously.  The presence of the delayed write means that
either some of them are still there.  Or might be there is an in-flight
RPC still not finished?  But again, reclamation is synchronous.
Comment 13 Rick Macklem freebsd_committer freebsd_triage 2023-02-07 05:40:42 UTC
For this case, the code you commented on
is not being executed (at leaast not after
the first failed attempt).  The 13.1 server
just replies NFSERR_NOTSUPP to the Deallocate
RPC.  The code that is executed is the call
to vop_stddeallocate() further down in
nfs_dealloacate(), just like NFSv3/4.0/4.1.
Comment 14 Rick Macklem freebsd_committer freebsd_triage 2023-02-07 22:44:34 UTC
W.r.t. "where do the delayed writes come
from". That's simply
nfs_deallocate()->vop_stddeallocate()->vp_zerofill()->VOP_WRITE().
Remember that, for asomer's case (a 13.1 server
that does not have support for Deallocate on NFSv4.2),
the code that is executed is not the stuff you were
commenting on before. nfs_deallocate() is just calling
vop_stddeallocate().

I looked at the packet trace for the failure case
(no VOP_FSYNC()) and here's what I see (leaving the
reads and other stuff out):

- write at 745472 of data (for do_mapwrite( ,0xb65af))
- 4 writes of 0s for fspacectl()
- write at 262144 of data (starts with 0s) (for do_mapwrite( ,0x40d57))
- 2 writes starting at 864256 of data  (for do_mapwrite( ,0d3b4b))

The 2 writes for do_mapwrite( ,0x775cb) are missing.
This is the one that overlaps with fspacectl().

So, it seems that the write of 0s done through the buffer
cache does something like cleans the page(s) of the write
when it completes and this undirty's the page(s) written
by the do_mapwrite() { or something like that? The buffer
cache vs paging is something I do not understand }.

Adding the VOP_FSYNC() ensures that the writes of 0s done
by vop_stddeallocate() have completed before fspacectl()
returns, so the do_mapwrite() doesn't get stomped on.

Btw, I replaced fspacectl() with do_write() and it breaks the
same way as fspacectl() without the VOP_FSYNC().
- If this case needs to work without msync(), all I can think
  of is doing VOP_WRITE()s synchronously whenever the file is
  mmap'd. (I think setting IO_SYNC is sufficient for this, but
  I'd need to look at the code and try it.)
Comment 15 Alan Somers freebsd_committer freebsd_triage 2023-02-07 23:34:14 UTC
Doing VOP_WRITEs synchronously whenever the file is mmap()ed makes sense to me.  mmap sucks, and I'd rather not optimize for it.  Is there an easy way to tell if a file is mmap()ed?
Comment 16 Rick Macklem freebsd_committer freebsd_triage 2023-02-08 00:13:28 UTC
well, I noticed that your test program mmap()s
and munmap()s for every do_mapwrite(), so
doing VOP_WRITE()s synchronously when mmap'd
wouldn't help here.

Maybe Kostik understands how bdone() ends up
clearing the pages so that the write for a
do_mapwrite() never happens?

I am comfortable with adding VOP_FSYNC() to
nfs_deallocate() for the vop_stddeallocate()
case, since it is done synchronously when the
4.2 server supports Deallocate.

The man page for msync(2) is pretty vague w.r.t.
when it is needed.  It simply states "usually not needed".
Comment 17 Rick Macklem freebsd_committer freebsd_triage 2023-02-08 00:36:49 UTC
What about implementing a NFS VOP_MMAPPED()
and have it do the equivalent of VOP_FSYNC()?
(I think I'll code this and try it.)

For your test program, that would result in a
VOP_FSYNC() every time mmap(2) is done.

Al, have you tried a similar test program where
the mmap(2) is done when the file is open()'d
instead of for every do_mapwrite()?
Comment 18 Alan Somers freebsd_committer freebsd_triage 2023-02-08 00:44:32 UTC
(In reply to Rick Macklem from comment #17)
Does VOP_MMAPPED() get called  as soon as the file is mapped?  I think that's too soon.  At that point, the problematic writes haven't happened yet.  And no, I haven't tried persistently mapping the file.
Comment 19 Konstantin Belousov freebsd_committer freebsd_triage 2023-02-08 01:40:27 UTC
This is all quite strange.  Our buffer cache should be coherent with the page
cache.  The fact that simple write(2) can become inconsistent with the mmaped
write means we have the big issue for NFS.

Buffer cache consists of just buffer headers that collect a run of pages that
represent the same frames as the buffer, in the file range.  Dirty pages are
cleaned by the vfs_busy_pages(bp, TRUE) call when bwrite() is executing. Since
then, the dirtyness is carried by the buffer and not by pages. This is why
we have to flush both pages and buffers when we know that the server supports
deallocate.

In NFS case, since it has its own bufwrite() implementation, vfs_busy_pages(TRUE)
is called by NFS code in ncl_writebp() (and in ncl_flush()).

From your explanation, it is still not clear to me how did we lost a write.
It must be some case of doing direct write instead going through the buffer
cache, I guess? Was the file opened with O_DIRECT?
Comment 20 Rick Macklem freebsd_committer freebsd_triage 2023-02-08 04:45:16 UTC
Created attachment 239981 [details]
change ncl_flush() so that it does not clear the page dirty bit for commit

I think this large complex (1 line) patch fixes
the problem. It works for the variants of the
test program I have. (The original plus one that
does do_write() instead of fspacectl() and one that
does the mmap() in main instead of per-do_mapread()
and do_mapwrite().)

Thanks go to Kostik's explanation which allowed me
find it.  Basically, when ncl_flush() was called
it would do vfs_busy_pages(vp, 1) before doing a
commit RPC. However, the commit RPC does not write
data to the server, it tells the server to commit
data it already has to stable storage.
--> As such, it should not clear the dirty bits on
    the pages.

I will look through the commit logs to see how long
this has been broken.  I suspect a long time.
Comment 21 Konstantin Belousov freebsd_committer freebsd_triage 2023-02-08 05:20:42 UTC
(In reply to Rick Macklem from comment #20)
Now I do not understand why vfs_busy_pages() is done in that part of the
ncl_flush() loop at all.  There is bwrite() call later, which should handle
undirty.  But why do you need busy pages while doing commit RPCs? Is the
intent to avoid userspace writes through mmap on the committed range?
Comment 22 Rick Macklem freebsd_committer freebsd_triage 2023-02-08 14:06:12 UTC
Well, after doing the vfs_busy_pages() and
the commit RPC, there is a call to bufdone()
at around line#3100.
bufdone calls vfs_vmio_iodone etc, so I think
the pages need to be busied for that.

There may be some other reasons that I can't
spot at a quick glance.
Comment 23 Konstantin Belousov freebsd_committer freebsd_triage 2023-02-08 14:20:04 UTC
(In reply to Rick Macklem from comment #22)
Looks fine then, please commit your patch.  If something is left, we can
continue the investigation (for nfs).
Comment 24 Rick Macklem freebsd_committer freebsd_triage 2023-02-08 20:24:03 UTC
Alan, can you please test the new patch
to see if it fixes your tests for NFS?

Thanks, rick
Comment 25 Alan Somers freebsd_committer freebsd_triage 2023-02-08 21:10:26 UTC
The latest patch works for me.
Comment 26 commit-hook freebsd_committer freebsd_triage 2023-02-08 22:27:30 UTC
A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=847967bc4e198a258b030a5864e64e029e7452e5

commit 847967bc4e198a258b030a5864e64e029e7452e5
Author:     Rick Macklem <rmacklem@FreeBSD.org>
AuthorDate: 2023-02-08 22:25:01 +0000
Commit:     Rick Macklem <rmacklem@FreeBSD.org>
CommitDate: 2023-02-08 22:25:01 +0000

    nfscl: Fix interaction between mmap'd and VOP_WRITE file updates

    asomers@ found a problem with the NFS client, where a write to
    an NFS mounted file done via mmap(2) was lost when fspacectl(2)
    was done before it.  This turned out to be caused by clearing the
    dirty bit on pages when the client was doing commit RPCs,
    due to the second argument to vfs_busy_pages() being set to 1.
    Commit RPCs tell the server to commit previously written data to
    stable storage.  However, Commit RPCs do not write data from the
    client to the server.  As such, if the dirty bit on the page has
    been set by a mmap'd write to an address in the page, it should
    not be cleared.  Clearing it causes the mmap'd write to by lost.

    This patch fixes the problem by changing the 2nd argument to
    vfs_busy_pages() to 0 for this case.

    I doubt this bug has affected many, since it was inherited from
    the old NFS client and was in 4.3 FreeBSD twenty years ago.
    Although fspacectl(2) is FreeBSD 14 specific, a write(2) would
    cause the same failure.

    Reviewed by:    kib
    Tested by:      asomers
    PR:     269328
    MFC after:      1 week

 sys/fs/nfsclient/nfs_clvnops.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
Comment 27 Rick Macklem freebsd_committer freebsd_triage 2023-02-08 23:54:00 UTC
Patch has been committed to main and will be MFC'd.
Comment 28 commit-hook freebsd_committer freebsd_triage 2023-02-21 23:34:13 UTC
A commit in branch stable/13 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=335517b9b02bc37f4d532998dc8e3ce5757cdef6

commit 335517b9b02bc37f4d532998dc8e3ce5757cdef6
Author:     Rick Macklem <rmacklem@FreeBSD.org>
AuthorDate: 2023-02-08 22:25:01 +0000
Commit:     Rick Macklem <rmacklem@FreeBSD.org>
CommitDate: 2023-02-21 23:32:40 +0000

    nfscl: Fix interaction between mmap'd and VOP_WRITE file updates

    asomers@ found a problem with the NFS client, where a write to
    an NFS mounted file done via mmap(2) was lost when fspacectl(2)
    was done before it.  This turned out to be caused by clearing the
    dirty bit on pages when the client was doing commit RPCs,
    due to the second argument to vfs_busy_pages() being set to 1.
    Commit RPCs tell the server to commit previously written data to
    stable storage.  However, Commit RPCs do not write data from the
    client to the server.  As such, if the dirty bit on the page has
    been set by a mmap'd write to an address in the page, it should
    not be cleared.  Clearing it causes the mmap'd write to by lost.

    This patch fixes the problem by changing the 2nd argument to
    vfs_busy_pages() to 0 for this case.

    I doubt this bug has affected many, since it was inherited from
    the old NFS client and was in 4.3 FreeBSD twenty years ago.
    Although fspacectl(2) is FreeBSD 14 specific, a write(2) would
    cause the same failure.

    PR:     269328

    (cherry picked from commit 847967bc4e198a258b030a5864e64e029e7452e5)

 sys/fs/nfsclient/nfs_clvnops.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
Comment 29 Alan Somers freebsd_committer freebsd_triage 2023-09-06 16:50:39 UTC
Can we close this bug now?