276002 – nfscl: data corruption using both copy_file_range and mmap'd I/O

Bug 276002 - nfscl: data corruption using both copy_file_range and mmap'd I/O

Summary: nfscl: data corruption using both copy_file_range and mmap'd I/O

Status:	In Progress

Alias:	None

Product:	Base System
Classification:	Unclassified
Component:	kern (show other bugs)
Version:	15.0-CURRENT
Hardware:	Any Any

Importance:	--- Affects Only Me
Assignee:	Konstantin Belousov

URL:
Keywords:

Depends on:
Blocks:

Reported:	2023-12-29 21:42 UTC by Alan Somers
Modified:	2024-05-09 13:44 UTC (History)
CC List:	8 users (show)

See Also:

Attachments
Flush invp cache; invalidate outvp cache. (2.28 KB, patch) 2023-12-30 00:19 UTC, Konstantin Belousov	no flags	Details \| Diff
Flush invp cache; invalidate outvp cache v2 (relock fixed) (2.28 KB, patch) 2023-12-30 00:43 UTC, Konstantin Belousov	no flags	Details \| Diff
Don't clear the dirty bits on pages being written through the buffer cache (422 bytes, patch) 2023-12-30 16:58 UTC, Rick Macklem	no flags	Details \| Diff
Make Setattr invalidate for setting size if mmap'd writing might have occurred (832 bytes, patch) 2024-01-02 23:48 UTC, Rick Macklem	no flags	Details \| Diff
Problem reproducer (3.30 KB, text/plain) 2024-02-21 08:41 UTC, Peter Holm	no flags	Details
Force VOP_WRITE() in VOP_PUTPAGES to be done synchronously (550 bytes, patch) 2024-02-26 23:23 UTC, Rick Macklem	no flags	Details \| Diff
Force VOP_WRITE() to be synchronous and do not clear dirty bits afterwards (750 bytes, patch) 2024-02-28 03:18 UTC, Rick Macklem	no flags	Details \| Diff
ktrace run of fsx (150.43 KB, text/plain) 2024-03-19 10:35 UTC, Peter Holm	no flags	Details
Call vnode_pager_clean_sync() before ncl_meta_setsize() in nfs_setattr (773 bytes, patch) 2024-05-08 22:32 UTC, Rick Macklem	no flags	Details \| Diff
Show Obsolete (2) View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Alan Somers freebsd_committer

2023-12-29 21:42:25 UTC

I can reliably generate corruption on 15.0-CURRENT's NFS client using fsx, with copy_file_range enabled.  I can't reproduce it on ZFS, UFS, tmpfs, or msdosfs.  I can't reproduce it on Linux's NFS client either (though I'm not positive that it's using server-side-copying).

Environment
===========

NFS Server: FreeBSD 14.0-RELEASE-p3, amd64
NFS Clients: Both FreeBSD 14.0-RELEASE-p3 and 15.0-CURRENT @ 20-Dec.  /etc/fstab is configured like this:

192.168.1.2:/home	/usr/home	nfs	rw,nfsv4,minorversion=2	0	0


Steps To Reproduce
==================
1) Check out the fsx-rs git repository (I'm working on getting the latest version into ports)
   git clone git@github.com:asomers/fsx-rs.git
2) Install a Rust toolchain, if you don't have one already.
   pkg install rust
3) cd fsx-rs
4) cat > copy_file_range.toml <<HERE
nomsyncafterwrite = true
[weights]
truncate = 1
fsync = 1
fdatasync = 1
punch_hole = 0
sendfile = 1
write = 10
read = 10
copy_file_range = 10
HERE
5) Run the tool.  Assuming /home is NFS_mounted, run like this:
   cargo run --release -- -N 1024 -S 3381155135556591634 -f copy_file_range.toml -m 149264:149265 -v ~/tmp/foo.bin

Expected Output
===============
...
All operations completed A-OK!

Actual Output
=============
[INFO  fsx]    3 mapwrite 0x3d369 .. 0x3ffff ( 0x2c97 bytes)
[INFO  fsx]    4 mapread  0x3b828 .. 0x3ffff ( 0x47d8 bytes)
[INFO  fsx]    5 write    0x18a59 .. 0x21671 ( 0x8c19 bytes)
[INFO  fsx]    6 write    0x3407e .. 0x3d057 ( 0x8fda bytes)
[INFO  fsx]    7 sendfile 0x141b5 .. 0x1da23 ( 0x986f bytes)
[INFO  fsx]    8 copy_file_range [0x326b6:0x357d2] => [0x1c39b:0x1f4b7] ( 0x311d bytes)
[INFO  fsx]    9 copy_file_range [0x36e5e:0x3feb4] => [ 0xa813:0x13869] ( 0x9057 bytes)
[INFO  fsx]   10 read     0x18e3a .. 0x20a9c ( 0x7c63 bytes)
[WARN  fsx]   11 write    0x20935 .. 0x2a296 ( 0x9962 bytes)
[INFO  fsx]   12 mapwrite 0x3eadf .. 0x3ffff ( 0x1521 bytes)
[INFO  fsx]   13 copy_file_range [0x3b9d9:0x3e7bf] => [0x31338:0x3411e] ( 0x2de7 bytes)
[WARN  fsx]   14 mapwrite 0x22763 .. 0x2b5af ( 0x8e4d bytes)
[INFO  fsx]   15 write    0x12126 .. 0x1666c ( 0x4547 bytes)
[WARN  fsx]   16 mapread  0x1bc78 .. 0x2ad3d ( 0xf0c6 bytes)
[INFO  fsx]   17 write    0x29386 .. 0x36d8b ( 0xda06 bytes)
[INFO  fsx]   18 write    0x3cff4 .. 0x3ffff ( 0x300c bytes)
[INFO  fsx]   19 copy_file_range [0x3c882:0x3dc94] => [ 0x6fbe: 0x83d0] ( 0x1413 bytes)
[WARN  fsx]   20 truncate 0x40000 => 0x14e87
[INFO  fsx]   21 mapwrite 0x194a2 .. 0x19bbb (  0x71a bytes)
[INFO  fsx]   22 read     0x10595 .. 0x19bbb ( 0x9627 bytes)
[INFO  fsx]   23 mapread  0x11ebf .. 0x14df9 ( 0x2f3b bytes)
[INFO  fsx]   24 mapread   0x1887 .. 0x11174 ( 0xf8ee bytes)
[INFO  fsx]   25 mapread  0x13b75 .. 0x19bbb ( 0x6047 bytes)
[INFO  fsx]   26 write    0x1867c .. 0x19f7a ( 0x18ff bytes)
[INFO  fsx]   27 mapread  0x119df .. 0x13ecb ( 0x24ed bytes)
[INFO  fsx]   28 mapwrite 0x19c2d .. 0x1ec85 ( 0x5059 bytes)
[INFO  fsx]   29 mapwrite  0x580f .. 0x14e19 ( 0xf60b bytes)
[INFO  fsx]   30 mapwrite 0x2e746 .. 0x3b542 ( 0xcdfd bytes)
[INFO  fsx]   31 copy_file_range [0x1ae6f:0x1e92f] => [0x173ae:0x1ae6e] ( 0x3ac1 bytes)
[INFO  fsx]   32 read     0x2e63a .. 0x3af8d ( 0xc954 bytes)
[INFO  fsx]   33 write    0x3313d .. 0x38155 ( 0x5019 bytes)
[INFO  fsx]   34 mapread  0x332a5 .. 0x35e49 ( 0x2ba5 bytes)
[INFO  fsx]   35 copy_file_range [0x17dc9:0x2206d] => [0x2e626:0x388ca] ( 0xa2a5 bytes)
[INFO  fsx]   36 fdatasync
[INFO  fsx]   37 read     0x2dad6 .. 0x3b3c8 ( 0xd8f3 bytes)
[INFO  fsx]   38 sendfile 0x27769 .. 0x346a8 ( 0xcf40 bytes)
[INFO  fsx]   39 mapwrite 0x38473 .. 0x3ffff ( 0x7b8d bytes)
[INFO  fsx]   40 read     0x14cb6 .. 0x157ce (  0xb19 bytes)
[INFO  fsx]   41 sendfile 0x10a33 .. 0x1f57b ( 0xeb49 bytes)
[INFO  fsx]   42 fsync
[INFO  fsx]   43 copy_file_range [ 0xa490: 0xb45f] => [  0xfb8: 0x1f87] (  0xfd0 bytes)
[INFO  fsx]   44 mapread  0x31956 .. 0x3ffff ( 0xe6aa bytes)
[INFO  fsx]   45 copy_file_range [ 0xc1ad: 0xc474] => [ 0x7e98: 0x815f] (  0x2c8 bytes)
[INFO  fsx]   46 mapread    0x8bf ..  0xc9a9 ( 0xc0eb bytes)
[INFO  fsx]   47 write    0x28d35 .. 0x2d27e ( 0x454a bytes)
[INFO  fsx]   48 mapwrite 0x3f5d8 .. 0x3ffff (  0xa28 bytes)
[INFO  fsx]   49 fsync
[INFO  fsx]   50 copy_file_range [0x2b709:0x2e6dc] => [0x32312:0x352e5] ( 0x2fd4 bytes)
[INFO  fsx]   51 write    0x183d9 .. 0x1f309 ( 0x6f31 bytes)
[INFO  fsx]   52 copy_file_range [0x2c928:0x31221] => [0x31222:0x35b1b] ( 0x48fa bytes)
[INFO  fsx]   53 fsync
[INFO  fsx]   54 write    0x39603 .. 0x3ffff ( 0x69fd bytes)
[INFO  fsx]   55 write    0x30895 .. 0x39042 ( 0x87ae bytes)
[INFO  fsx]   56 write    0x3c384 .. 0x3ffff ( 0x3c7c bytes)
[INFO  fsx]   57 mapwrite 0x33ab6 .. 0x39fa0 ( 0x64eb bytes)
[WARN  fsx]   58 write    0x23f7e .. 0x2b871 ( 0x78f4 bytes)
[INFO  fsx]   59 mapwrite 0x1a29b .. 0x1c938 ( 0x269e bytes)
[INFO  fsx]   60 write    0x29da5 .. 0x2effc ( 0x5258 bytes)
[INFO  fsx]   61 read     0x29919 .. 0x3099d ( 0x7085 bytes)
[WARN  fsx]   62 mapread  0x1f81d .. 0x2d882 ( 0xe066 bytes)
[INFO  fsx]   63 write    0x32fd5 .. 0x3479f ( 0x17cb bytes)
[INFO  fsx]   64 copy_file_range [ 0xf5b5:0x13756] => [0x3be5e:0x3ffff] ( 0x41a2 bytes)
[WARN  fsx]   65 sendfile 0x1fd25 .. 0x2cffb ( 0xd2d7 bytes)
[WARN  fsx]   66 copy_file_range [0x30269:0x380a6] => [0x201d9:0x28016] ( 0x7e3e bytes)
[INFO  fsx]   67 read     0x3f602 .. 0x3ffff (  0x9fe bytes)
[WARN  fsx]   68 read     0x23bd4 .. 0x25fc1 ( 0x23ee bytes)
[ERROR fsx] miscompare: offset= 0x23bd4, size = 0x23ee
[ERROR fsx] OFFSET  GOOD  BAD  RANGE  
[ERROR fsx] 0x24710 0x39 0x37  0x18b2
[ERROR fsx] Step# (mod 256) for a misdirected write may be 55
[ERROR fsx] Using seed 3381155135556591634
[ERROR fsx] LOG DUMP
[ERROR fsx]    1 SKIPPED  (read)
[ERROR fsx]    2 SKIPPED  (mapread)
[ERROR fsx]    3 MAPWRITE 0x3d369 => 0x40000 ( 0x2c97 bytes) HOLE
[ERROR fsx]    4 MAPREAD  0x3b828 => 0x40000 ( 0x47d8 bytes)
[ERROR fsx]    5 WRITE    0x18a59 => 0x21672 ( 0x8c19 bytes)
[ERROR fsx]    6 WRITE    0x3407e => 0x3d058 ( 0x8fda bytes)
[ERROR fsx]    7 SENDFILE 0x141b5 => 0x1da24 ( 0x986f bytes)
[ERROR fsx]    8 COPY_FILE_RANGE [0x326b6,0x357d3] => [0x1c39b,0x1f4b8] ( 0x311d bytes)
[ERROR fsx]    9 COPY_FILE_RANGE [0x36e5e,0x3feb5] => [ 0xa813,0x1386a] ( 0x9057 bytes)
[ERROR fsx]   10 READ     0x18e3a => 0x20a9d ( 0x7c63 bytes)
[ERROR fsx]   11 WRITE    0x20935 => 0x2a297 ( 0x9962 bytes)
[ERROR fsx]   12 MAPWRITE 0x3eadf => 0x40000 ( 0x1521 bytes)
[ERROR fsx]   13 COPY_FILE_RANGE [0x3b9d9,0x3e7c0] => [0x31338,0x3411f] ( 0x2de7 bytes)
[ERROR fsx]   14 MAPWRITE 0x22763 => 0x2b5b0 ( 0x8e4d bytes)
[ERROR fsx]   15 WRITE    0x12126 => 0x1666d ( 0x4547 bytes)
[ERROR fsx]   16 MAPREAD  0x1bc78 => 0x2ad3e ( 0xf0c6 bytes)
[ERROR fsx]   17 WRITE    0x29386 => 0x36d8c ( 0xda06 bytes)
[ERROR fsx]   18 WRITE    0x3cff4 => 0x40000 ( 0x300c bytes)
[ERROR fsx]   19 COPY_FILE_RANGE [0x3c882,0x3dc95] => [ 0x6fbe, 0x83d1] ( 0x1413 bytes)
[ERROR fsx]   20 TRUNCATE  DOWN from 0x40000 to 0x14e87
[ERROR fsx]   21 MAPWRITE 0x194a2 => 0x19bbc (  0x71a bytes) HOLE
[ERROR fsx]   22 READ     0x10595 => 0x19bbc ( 0x9627 bytes)
[ERROR fsx]   23 MAPREAD  0x11ebf => 0x14dfa ( 0x2f3b bytes)
[ERROR fsx]   24 MAPREAD   0x1887 => 0x11175 ( 0xf8ee bytes)
[ERROR fsx]   25 MAPREAD  0x13b75 => 0x19bbc ( 0x6047 bytes)
[ERROR fsx]   26 WRITE    0x1867c => 0x19f7b ( 0x18ff bytes) EXTEND
[ERROR fsx]   27 MAPREAD  0x119df => 0x13ecc ( 0x24ed bytes)
[ERROR fsx]   28 MAPWRITE 0x19c2d => 0x1ec86 ( 0x5059 bytes) EXTEND
[ERROR fsx]   29 MAPWRITE  0x580f => 0x14e1a ( 0xf60b bytes)
[ERROR fsx]   30 MAPWRITE 0x2e746 => 0x3b543 ( 0xcdfd bytes) HOLE
[ERROR fsx]   31 COPY_FILE_RANGE [0x1ae6f,0x1e930] => [0x173ae,0x1ae6f] ( 0x3ac1 bytes)
[ERROR fsx]   32 READ     0x2e63a => 0x3af8e ( 0xc954 bytes)
[ERROR fsx]   33 WRITE    0x3313d => 0x38156 ( 0x5019 bytes)
[ERROR fsx]   34 MAPREAD  0x332a5 => 0x35e4a ( 0x2ba5 bytes)
[ERROR fsx]   35 COPY_FILE_RANGE [0x17dc9,0x2206e] => [0x2e626,0x388cb] ( 0xa2a5 bytes)
[ERROR fsx]   36 FDATASYNC
[ERROR fsx]   37 READ     0x2dad6 => 0x3b3c9 ( 0xd8f3 bytes)
[ERROR fsx]   38 SENDFILE 0x27769 => 0x346a9 ( 0xcf40 bytes)
[ERROR fsx]   39 MAPWRITE 0x38473 => 0x40000 ( 0x7b8d bytes) EXTEND
[ERROR fsx]   40 READ     0x14cb6 => 0x157cf (  0xb19 bytes)
[ERROR fsx]   41 SENDFILE 0x10a33 => 0x1f57c ( 0xeb49 bytes)
[ERROR fsx]   42 FSYNC
[ERROR fsx]   43 COPY_FILE_RANGE [ 0xa490, 0xb460] => [  0xfb8, 0x1f88] (  0xfd0 bytes)
[ERROR fsx]   44 MAPREAD  0x31956 => 0x40000 ( 0xe6aa bytes)
[ERROR fsx]   45 COPY_FILE_RANGE [ 0xc1ad, 0xc475] => [ 0x7e98, 0x8160] (  0x2c8 bytes)
[ERROR fsx]   46 MAPREAD    0x8bf =>  0xc9aa ( 0xc0eb bytes)
[ERROR fsx]   47 WRITE    0x28d35 => 0x2d27f ( 0x454a bytes)
[ERROR fsx]   48 MAPWRITE 0x3f5d8 => 0x40000 (  0xa28 bytes)
[ERROR fsx]   49 FSYNC
[ERROR fsx]   50 COPY_FILE_RANGE [0x2b709,0x2e6dd] => [0x32312,0x352e6] ( 0x2fd4 bytes)
[ERROR fsx]   51 WRITE    0x183d9 => 0x1f30a ( 0x6f31 bytes)
[ERROR fsx]   52 COPY_FILE_RANGE [0x2c928,0x31222] => [0x31222,0x35b1c] ( 0x48fa bytes)
[ERROR fsx]   53 FSYNC
[ERROR fsx]   54 WRITE    0x39603 => 0x40000 ( 0x69fd bytes)
[ERROR fsx]   55 WRITE    0x30895 => 0x39043 ( 0x87ae bytes)
[ERROR fsx]   56 WRITE    0x3c384 => 0x40000 ( 0x3c7c bytes)
[ERROR fsx]   57 MAPWRITE 0x33ab6 => 0x39fa1 ( 0x64eb bytes)
[ERROR fsx]   58 WRITE    0x23f7e => 0x2b872 ( 0x78f4 bytes)
[ERROR fsx]   59 MAPWRITE 0x1a29b => 0x1c939 ( 0x269e bytes)
[ERROR fsx]   60 WRITE    0x29da5 => 0x2effd ( 0x5258 bytes)
[ERROR fsx]   61 READ     0x29919 => 0x3099e ( 0x7085 bytes)
[ERROR fsx]   62 MAPREAD  0x1f81d => 0x2d883 ( 0xe066 bytes)
[ERROR fsx]   63 WRITE    0x32fd5 => 0x347a0 ( 0x17cb bytes)
[ERROR fsx]   64 COPY_FILE_RANGE [ 0xf5b5,0x13757] => [0x3be5e,0x40000] ( 0x41a2 bytes)
[ERROR fsx]   65 SENDFILE 0x1fd25 => 0x2cffc ( 0xd2d7 bytes)
[ERROR fsx]   66 COPY_FILE_RANGE [0x30269,0x380a7] => [0x201d9,0x28017] ( 0x7e3e bytes)
[ERROR fsx]   67 READ     0x3f602 => 0x40000 (  0x9fe bytes)
[ERROR fsx]   68 READ     0x23bd4 => 0x25fc2 ( 0x23ee bytes)

Comment 1 Konstantin Belousov freebsd_committer

2023-12-29 23:21:22 UTC

Could you please try something along these lines:

diff --git a/sys/fs/nfsclient/nfs_clvnops.c b/sys/fs/nfsclient/nfs_clvnops.c
index a690e988b4b3..f86ecd53ad4e 100644
--- a/sys/fs/nfsclient/nfs_clvnops.c
+++ b/sys/fs/nfsclient/nfs_clvnops.c
@@ -3956,7 +3956,7 @@ nfs_copy_file_range(struct vop_copy_file_range_args *ap)
 	if (error == 0)
 		error = ncl_flush(invp, MNT_WAIT, curthread, 1, 0);
 	if (error == 0)
-		error = ncl_flush(outvp, MNT_WAIT, curthread, 1, 0);
+		error = ncl_vinvalbuf(outvp, V_SAVE, curthread, 0);
 
 	/* Do the actual NFSv4.2 RPC. */
 	ret = ret2 = 0;

Comment 2 Rick Macklem freebsd_committer

2023-12-30 00:00:57 UTC

Do we also need to add something like...

   if ((obj = invp->v_object) != NULL &&
       vm_object_mightbedirty(obj)) {
          VM_OBJECT_WLOCK(obj);
          vm_object_page_clean(obj, 0, 0, OBJPC_SYNC);
          VM_OBJECT_WUNLOCK(obj);
    }
in front of the
    error = ncl_flush(invp, MNT_WAIT, curthread, 1, 0);

If this is needed, the vnode locking code will need to be
changed to LK_EXCLUSIVE lock the invp.

Man, I hate mmap'd I/O.;-)

Comment 3 Konstantin Belousov freebsd_committer

2023-12-30 00:09:29 UTC

(In reply to Rick Macklem from comment #2)
Yes, it seems so.

OTOH, we can lock invp shared and switch to exclusive lock later if the
page queue flush is indeed needed.

Comment 4 Alan Somers freebsd_committer

2023-12-30 00:12:31 UTC

I'll try that patch shortly.  In the meantime, I've discovered a much quicker way to reproduce the problem by increasing the operation sizes.  Change the copy_file_range.toml file to contain this:

flen = 524288
nomsyncafterwrite = true
[opsize]
max = 131072
min = 65536
[weights]
truncate = 1
fsync = 1
fdatasync = 1
punch_hole = 0
sendfile = 1
write = 10
read = 10
copy_file_range = 10

Comment 5 Konstantin Belousov freebsd_committer

2023-12-30 00:19:08 UTC

Created attachment 247343 [details]
Flush invp cache; invalidate outvp cache.

Comment 6 Rick Macklem freebsd_committer

2023-12-30 00:33:10 UTC

Yes. The patch looks good to me and, yes, since this
would be a rare occurrence, starting with LK_SHARED
on the invp seems like a good idea.

It also makes sense to invalidate the outvp's buffers.
The change in the NFSv4.2 Change attribute would have
triggered that before a read from the now bogus buffers
occurred, but why wait for that when you know the buffers
are now invalid?

Comment 7 Rick Macklem freebsd_committer

2023-12-30 00:36:06 UTC

I think there's one little oopsie in the patch.

I think it should be:

   invp_lock = LK_SHARED;
relock:
(ie. "relock:" after the assignment)

Comment 8 Konstantin Belousov freebsd_committer

2023-12-30 00:42:45 UTC

But are outvp buffers invalidated without the patch?
If you have pages already mapped, they would continue to be used
unless attributes change is acted upon.

Comment 9 Konstantin Belousov freebsd_committer

2023-12-30 00:43:46 UTC

Created attachment 247346 [details]
Flush invp cache; invalidate outvp cache v2 (relock fixed)

Comment 10 Rick Macklem freebsd_committer

2023-12-30 02:11:43 UTC

Yes, I think you are correct. I was thinking of the
case where there is a close/open done. Without a close/open,
the bogus buffers could get used.

Comment 11 Rick Macklem freebsd_committer

2023-12-30 02:18:40 UTC

I'll need to take a look at some of the other
NFSv4.2 operations, since they bypass the buffer
cache (and any mmap'd I/O) to see if they need
a similar treatment. I think Allocate/Deallocate
will need this and Seek will need to do a vm_object_page_clean()
as well as the ncl_flush().

I'll look at these once this patch is tested by Alan.

Btw, the workaround for any of these problems is
"minorversion=1" on the mount, for anyone who runs
into these issues in unpatched systems.

Comment 12 Alan Somers freebsd_committer

2023-12-30 02:34:16 UTC

kib's patch does not fix the bug for me.

Comment 13 Rick Macklem freebsd_committer

2023-12-30 03:00:39 UTC

Can you try a test run with "minorversion=1".
This will check to see that it is related to Copy and
not something else.

Comment 14 Alan Somers freebsd_committer

2023-12-30 03:16:05 UTC

With minorversion=1, I cannot produce the same miscompare.  However, I _can_ produce others, even when I disable copy_file_range in the config file.  So there is probably more than one bug.

Comment 15 Rick Macklem freebsd_committer

2023-12-30 03:22:26 UTC

What about getting rid of the "nomsyncafterwrite = true",
so that it does always do an msync()?

Comment 16 Alan Somers freebsd_committer

2023-12-30 03:24:52 UTC

Using kib's patch and removing nomsyncafterwrite fixes the miscompare.

Comment 17 Konstantin Belousov freebsd_committer

2023-12-30 03:32:10 UTC

(In reply to Alan Somers from comment #16)
Did you tried with disabled copy_file_range?

Also, does 'msync after write' means msync after mapped write, or after write(2)?
I suspect that it should be _before_ whatever write is.

Comment 18 Alan Somers freebsd_committer

2023-12-30 03:51:13 UTC

(In reply to Konstantin Belousov from comment #17)
I've tried it both ways.  Whether or not copy_file_range is enabled, if nomsyncafterwrite is not set, then there is no crash.

And nomsyncafterwrite refers to msync immediately after mapped write.

Comment 19 Konstantin Belousov freebsd_committer

2023-12-30 03:57:11 UTC

(In reply to Alan Somers from comment #18)
> Whether or not copy_file_range is enabled, if nomsyncafterwrite is not set, then
> there is no crash.
But if nomsyncafterwrite is set, is there an inconsistency even without
copy_file_range?

For the second part, I mean that most likely this knobs works because msyncs
are interlined with writes, and really msync affects the next write result,
not the previous.

Comment 20 Alan Somers freebsd_committer

2023-12-30 14:23:09 UTC

(In reply to Konstantin Belousov from comment #19)
> But if nomsyncafterwrite is set, is there an inconsistency even without
copy_file_range?

Yes.

> For the second part, I mean that most likely this knobs works because msyncs
> are interlined with writes, and really msync affects the next write result,
> not the previous.

Comment 21 Rick Macklem freebsd_committer

2023-12-30 16:58:00 UTC

Created attachment 247360 [details]
Don't clear the dirty bits on pages being written through the buffer cache

You could try this completely untested patch.
The last time we played this game, the problem
was that a call to vfs_busy_pages() cleared the
dirty bits when it should not have done so.

This is the only other place where the NFS client
code does the same thing.

Note that I think kib@'s copy_file_range patch is
needed.  I plan to do some testing of it, to see if
I can break copy_file_range() without the patch and
without using mmap'd I/O.

Comment 22 Konstantin Belousov freebsd_committer

2023-12-30 18:02:24 UTC

(In reply to Rick Macklem from comment #21)
It would be very strange if this patch helped.  More, I think it is somewhat
dangerous because it can cause data corruption on itself.

Issue is, not clearing pages dirty bits also does not remove write perms from
the existing mappings, allowing writers to continue write to the file mappings.
I think that this allows more mapped writes to be lost.

Comment 23 Konstantin Belousov freebsd_committer

2023-12-30 18:32:20 UTC

BTW, why do we need ncl_writebp() at all? It seems that it mostly repeats
the regular bufwrite() with some omissions, and added reassignbuf() call
on return path.

Comment 24 Alan Somers freebsd_committer

2023-12-30 19:07:21 UTC

FWIW I can also reproduce the bug using exactly the same setup, config file, and seed on bfffs, a FUSE file system.  I cannot reproduce it with fusefs-ext2, fusefs-sshfs, or fusefs-exfat, however.  I will now work on adding a unit test to the fusefs test suite.

Comment 25 Rick Macklem freebsd_committer

2023-12-30 20:39:36 UTC

I did not suggest the second patch as a real fix.
I just wanted to see what (if any) effect it has
on Alan's tests.
(In the past, I've found that almost random changes
 can help identify a problem, like my suggestion to
 get rid of "nomsyncafterwrite".)

If the patch happens to help with Alan's test case,
then we need to figure out what needs to be done to
fix it.

W.r.t. ncl_writebp(), I'll take a look. This is all
very old code cribbed from the old NFS client and not
written by me.

Comment 26 Alan Somers freebsd_committer

2023-12-30 20:47:12 UTC

I just tested rmacklem's latest patch along with kib's.  It does not fix the bug.

Comment 27 Konstantin Belousov freebsd_committer

2023-12-30 22:23:27 UTC

So we have an nfscl bug reproducible without copy_file_range, right?
Could you try to determine a minimal set of syscalls types needed to
reproduce it? E.g. is it just read/write/mapped write or something more
is required?

Comment 28 Konstantin Belousov freebsd_committer

2023-12-31 05:26:37 UTC

Could you please try with my latest patch from D43250?

Comment 29 Alan Somers freebsd_committer

2023-12-31 14:45:11 UTC

With kib's latest patch, it's more difficult to produce a miscompare.  But still possible.  For example, using the toml file from Comment 4 with Seed 9675272854109691020 will crash in 95 steps.

Or, without copy_file_range at all, it's possible to crash in 126 steps using seed 8637402481540764494 and this toml file:

flen = 524288
nomsyncafterwrite = true
[opsize]
max = 131072
min = 65536
[weights]
truncate = 1
fsync = 1
fdatasync = 1
punch_hole = 0
write = 10
read = 10

Comment 30 Rick Macklem freebsd_committer

2024-01-01 02:07:40 UTC

Alan, do you know if the test does any other
syscalls like rename or link even though they
are not listed under [weights]?

I am wondering because both nfs_rename()
and nfs_link() do VOP_FSYNC(), but they do
not call vm_object_page_clean().

Also, could you try a test run with
truncate = 0?
(The nfs_setattr() code does seem that
 it should vnode_pager_setsize(), but
 maybe more needs to be done.)
I notice ncl_vinvalbuf() only gets called
when there are dirty buffers.

Comment 31 Alan Somers freebsd_committer

2024-01-01 02:16:27 UTC

(In reply to Rick Macklem from comment #30)
It doesn't do rename or link.  It does do some ftruncate calls that aren't shown, but I don't think there's anything else.  And I'm out of time to help with testing until at least the 3rd, and probably the 6th.

Comment 32 Rick Macklem freebsd_committer

2024-01-01 15:48:32 UTC

Righto. To be honest, this may not be fixable,
given the way NFS does I/O below the buffer
cache (no vnode lock, etc).

The fix may be to document the need to use
msync(2) when mmap'd I/O is mixed with syscall
I/O for NFS mounts.
--> An update to the msync man page maybe?

Comment 33 Lexi Winter freebsd_triage

2024-01-01 19:01:26 UTC

is this related to bug 270810 which reported data corruption using mmap()+msync() on NFS?  (i have encountered issues with this in the past, but i never got around to making a reproducible test case to file a PR.)

Comment 34 Rick Macklem freebsd_committer

2024-01-01 21:55:10 UTC

Ok, here is my understanding of what currently can happen.
Hopefully Kostik will correct me if I have this wrong.

#1 - File is open(2)'d.
#2 - A byte range (lets say the 1st 100Mbytes) is
     mmap(2)'d into the address space
#3 - Some addresses within this address space are
     modified by the process, dirtying the corresponding
     pages.
#4 - File is read(2) sequentially.

Now, when #4 happens, there will be read-aheads done
by the nfsiod threads. These simply do Read RPCs against
the NFS server to read the byte ranges of the file into
the buffer cache blocks.
They are done asynchronously and without any vnode lock.
--> At this time, I do not see anything that stops these
    read-aheads from filling the buffer cache blocks/pages
    from the NFS server's now stale data.

Now, I thought adding a msync(2) with MS_SYNC between
#3 and #4 would be sufficient to cause the pages dirtied
by #3 to be written to the NFS server (via VOP_PUTPAGES(),
which is ncl_putpages()).
I believe that an fsync(2) between #3 and #4 will also
write the dirtied pages to the NFS server.

Without either a msync(2) or fsync(2) between #3 and #4,
what could be done to make this work?
- Don't do read-ahead.  This would be a major performance
  hit and is imho a non-starter.
- Don't do read-ahead when a file is mmap(2)'d. This sounds
  better, since it will be a rare case that a file will be
  both mmap(2)'d and read via read(2) syscalls.
  --> To do this, the NFS client needs to know if the file
      has been mmap(2)'d.
      A flag could be set on the vnode when the file is mmap(2)'d
      and that flag can be checked by the NFS client.
      --> The problem is when can the flag be cleared?
          My recollection from a previous round of discussing
          this is...not until all the process(es) that mmap(2)'d
          the file exit. (I cannot recall if the vnode's
          v_usecount going to 0 is sufficient.)
- Having some way that the nfsiod threads can check to see if there
  are dirty pages related to the buffer cache block and write those
  back to the NFS server before doing the read. (Recall that the
  buffer cache block will be quite a few pages, typically 128K to 1Mbyte
  in size.)
  --> This could be done by having the nfsiod thread LK_EXCLUSIVE lock
      the vnode, but that would be a major performance hit, as well.

That's as far as I've gotten in previous discussions about this.

Note that this PR started with a specific problem related to
copy_file_range(2) and that has been fixed (or kib@'s patch will
fix it when committed).
The more general case as above, well??

Comment 35 Rick Macklem freebsd_committer

2024-01-01 21:58:34 UTC

I added Geoffrey as a cc, since he reported PR#270810.
I think these two PRs are converging on the same problem.
(He can remove himself from the cc list if he chooses to.)

I do disagree that this affects many people, since I have
only seen 1 or 2 PRs on this, but it does affect some people.

Comment 36 Lexi Winter freebsd_triage

2024-01-01 22:06:37 UTC

fwiw, my own issue with NFS+mmap() was using net-p2p/rtorrent, which is a fairly widely used application and which (famously) uses mmap() for nearly all filesystem operations.  when running it on NFS, i found it frequently reported data corruption issues with the files it wrote to (i.e., it would write to files using mmap(), but on reading those files later, the data that was written was not as expected).

that said, i am the only person (other than Geoffrey) who i've seen report issues related to this.  i couldn't say whether this is because the number of people who use FreeBSD NFS and mmap() is relatively small, or because the problem was caused by something else.

i only mention this at all to add some context to the discussion :-)

Comment 37 Rick Macklem freebsd_committer

2024-01-01 22:21:24 UTC

Going back to my example case (which sounds like what
net-p2p/rtorrent might be doing), it is possible for
the NFS client VOP_READ() to do a
    vm_object_page_clean()
which should ensure that mmap(2)'d writing (as in
modifying bytes within mmapped address space) is
written to the server.

This would not handle the case where mmap(2)'d
writing is intermixed with read(2)s.

Does that sound worth doing?

And, is it likely that vm_object_mightbedirty()
will return true when this is not needed?
(If this is the case, a flag on the vnode that
 indicates it has been mmap(2)'d would be helpful.)

Comment 38 Konstantin Belousov freebsd_committer

2024-01-01 22:27:13 UTC

(In reply to Rick Macklem from comment #34)
This sounds as an interesting theory, but please note that read-ahead initiators
in nfs_clbio.c checks that the B_CACHE buffer flag is not set. This should prevent
a situation where we have constructed buffer with valid (might be dirty) pages
but not valid content recorded at buf cache layer.

But lets recheck the theory anyway, the patch below should prevent RA when
there are writeable mappings:

commit 2234d9d4f7595a78bf10c08b1e6b12d2115799cd
Author: Konstantin Belousov <kib@FreeBSD.org>
Date:   Tue Jan 2 00:22:44 2024 +0200

    nfsclient: do not do (unlocked) read-ahead by nfsiod if there are writeable mappings

diff --git a/sys/fs/nfsclient/nfs_clbio.c b/sys/fs/nfsclient/nfs_clbio.c
index e6486af55daf..1f92fe0a4cf3 100644
--- a/sys/fs/nfsclient/nfs_clbio.c
+++ b/sys/fs/nfsclient/nfs_clbio.c
@@ -481,9 +481,13 @@ ncl_bioread(struct vnode *vp, struct uio *uio, int ioflag, struct ucred *cred)
 		on = uio->uio_offset - (lbn * biosize);
 
 		/*
-		 * Start the read ahead(s), as required.
+		 * Start the read ahead(s), as required.  Do not do
+		 * read-ahead if there are writeable mappings, since
+		 * unlocked read by nfsiod could obliterate changes
+		 * done by userspace.
 		 */
-		if (nmp->nm_readahead > 0) {
+		if (nmp->nm_readahead > 0 &&
+		    vp->v_object->un_pager.vnp.writemappings == 0) {
 		    for (nra = 0; nra < nmp->nm_readahead && nra < seqcount &&
 			(off_t)(lbn + 1 + nra) * biosize < nsize; nra++) {
 			rabn = lbn + 1 + nra;
@@ -671,6 +675,7 @@ ncl_bioread(struct vnode *vp, struct uio *uio, int ioflag, struct ucred *cred)
 		 */
 		NFSLOCKNODE(np);
 		if (nmp->nm_readahead > 0 &&
+		    vp->v_object->un_pager.vnp.writemappings == 0 &&
 		    (bp->b_flags & B_INVAL) == 0 &&
 		    (np->n_direofoffset == 0 ||
 		    (lbn + 1) * NFS_DIRBLKSIZ < np->n_direofoffset) &&

Comment 39 Konstantin Belousov freebsd_committer

2024-01-01 22:30:07 UTC

(In reply to Rick Macklem from comment #37)
I though about doing vm_object_page_clean() before starting RA, but the problem
is that current lock is shared, while we need exclusively locked vnode for write.

Lets try my patch first.  I added Peter Holm in a hope that he has some spare
circles to reproduce the issue and to check my patchset.

Comment 40 Rick Macklem freebsd_committer

2024-01-01 22:44:03 UTC

Sounds good.  I was about to try a patch which
did the lock upgrade and vm_object_page_clean()
when vm_object_mightbedirty() returns true,
but I'll leave it until yours gets tested.

Comment 41 Konstantin Belousov freebsd_committer

2024-01-01 23:03:38 UTC

(In reply to Konstantin Belousov from comment #38)
I now believe that if this patch could work, we need to check both
for writeable mappings (writeable > 0) or for previous existence of mappings
that dirtied the vnode (vm_object_might_be_dirty()).

I updated the patch on my WIP branch for testing.

Comment 42 Rick Macklem freebsd_committer

2024-01-01 23:21:10 UTC

If I am reading the code correctly (I may not be),
it seems that vfs_buf_test_cache() will clear B_CACHE
if any of the pages allocbuf() has added to a
new buffer are not valid.

So, if a buffer cache block is for the first 1Mbyte
of the file, but the mmap(2)'d process has only touched
one page within that first 1Mbyte, does B_CACHE get cleared?
(And then the read-ahead happens, overwriting the dirty page.)

Comment 43 Rick Macklem freebsd_committer

2024-01-01 23:24:29 UTC

Comment on attachment 247360 [details]
Don't clear the dirty bits on pages being written through the buffer cache

This patch is crap. Please ignore it.

Comment 44 Konstantin Belousov freebsd_committer

2024-01-02 00:03:53 UTC

(In reply to Rick Macklem from comment #42)
If the buffer size is 1M (shrug), then to map any page from this buffer,
kernel would validate the whole buffer.  At least this is the default setup
where vfs.nfs.use_buf_pager is true.

In other words, it should be not possible to have a buffer consisting of
a single dirty page and rest of invalid pages.

Comment 45 Rick Macklem freebsd_committer

2024-01-02 01:06:12 UTC

Yep. I just saw that (fills the whole buffer when a
page is touched) using a trivial test program.
This trivial program did not expose any problem.

Comment 46 geoffrey 2024-01-02 01:37:16 UTC

If this is related to the bug I reported PR#270810 , The problem isnt the page state, but the fact that when the process writing exits, the pages are immediately discarded before they are sent to the NFS server. As long as the process that
has dirtied the pages is kept alive, there is no issue.
This affects bsd nfs clients, linux nfs clients are unaffected.

Comment 47 Peter Holm freebsd_committer

2024-01-02 06:44:22 UTC

(In reply to Konstantin Belousov from comment #39)
Sure, I'll take a look.

Comment 48 geoffrey 2024-01-02 15:43:31 UTC

(In reply to Konstantin Belousov from comment #27)
As I have said before reading is irrelevant. It occurs when writing to a congested server, and not all writes have finished when the writing process exits.  
It also does not occur using nfsv2 - the process stays in state nfsio unntil writes are finished, but nfsv2 is dog slow and has other problems.

Comment 49 Rick Macklem freebsd_committer

2024-01-02 23:48:34 UTC

Created attachment 247424 [details]
Make Setattr invalidate for setting size if mmap'd writing might have occurred

This is another patch that Alan can hopefully
test, since his test does include truncation.
The patch adds a check for vm_obect_mightbedirty()
as well as NMODIFIED (which is set for writes into
the buffer cache, but not writes to mmap'd pages)
to decide that flushing/invalidation should occur
when a Setattr of size is done.

Comment 50 commit-hook freebsd_committer

2024-01-05 05:01:27 UTC

A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=47ec00d9d6071bbb0ee5ed0bdca3c4a468334d9d

commit 47ec00d9d6071bbb0ee5ed0bdca3c4a468334d9d
Author:     Konstantin Belousov <kib@FreeBSD.org>
AuthorDate: 2023-12-30 00:15:50 +0000
Commit:     Konstantin Belousov <kib@FreeBSD.org>
CommitDate: 2024-01-05 04:56:30 +0000

    nfsclient: flush dirty pages of the vnode

    before ncl_flush() when done to ensure that the server sees our cached
    data, because it potentially changes the server response.  This is
    relevant for copy_file_range(), seek(), and allocate().

    Convert LK_SHARED invp lock into LK_EXCLUSIVE if needed to properly call
    vm_object_page_clean().

    Reported by:    asomers
    PR:     276002
    Noted and reviewed by:  rmacklem
    Tested by:      pho
    Sponsored by:   The FreeBSD Foundation
    MFC after:      1 week
    Differential revision:  https://reviews.freebsd.org/D43250

 sys/fs/nfsclient/nfs_clvnops.c | 45 ++++++++++++++++++++++++++++++++++++------
 1 file changed, 39 insertions(+), 6 deletions(-)

Comment 51 commit-hook freebsd_committer

2024-01-05 05:01:29 UTC

A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=7dae1467d72ae1f5c8f7be0f7444da23a457d98b

commit 7dae1467d72ae1f5c8f7be0f7444da23a457d98b
Author:     Konstantin Belousov <kib@FreeBSD.org>
AuthorDate: 2023-12-29 23:22:40 +0000
Commit:     Konstantin Belousov <kib@FreeBSD.org>
CommitDate: 2024-01-05 04:56:17 +0000

    nfsclient copy_file_range(): flush dst vnode data

    Otherwise server-side copy makes the client cache inconsistent with the
    server data.

    Reported by:    asomers
    PR:     276002
    Reviewed by:    rmacklem
    Tested by:      pho
    Sponsored by:   The FreeBSD Foundation
    MFC after:      1 week
    Differential revision:  https://reviews.freebsd.org/D43250

 sys/fs/nfsclient/nfs_clvnops.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comment 52 Alan Somers freebsd_committer

2024-01-07 22:19:57 UTC

With the latest code in main, including kib's two commits, I can no longer reproduce the bug using the second copy_file_range.toml file I posted.  But I can still reproduce it using the original file.  In fact, I can reproduce it without using copy_file_range or truncate at all, using this config file.  This may be a different bug:

nomsyncafterwrite = true
[weights]
truncate = 0
fsync = 1
fdatasync = 1
punch_hole = 0
sendfile = 0
write = 10
read = 10
copy_file_range = 0

Comment 53 commit-hook freebsd_committer

2024-01-12 12:02:07 UTC

A commit in branch stable/14 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=384c72acb50d3fd5a3bd07f41d3755867865c0b6

commit 384c72acb50d3fd5a3bd07f41d3755867865c0b6
Author:     Konstantin Belousov <kib@FreeBSD.org>
AuthorDate: 2023-12-29 23:22:40 +0000
Commit:     Konstantin Belousov <kib@FreeBSD.org>
CommitDate: 2024-01-11 16:46:52 +0000

    nfsclient copy_file_range(): flush dst vnode data

    PR:     276002

    (cherry picked from commit 7dae1467d72ae1f5c8f7be0f7444da23a457d98b)

 sys/fs/nfsclient/nfs_clvnops.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comment 54 commit-hook freebsd_committer

2024-01-12 12:02:11 UTC

A commit in branch stable/14 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=09c69decd208d3ccd4dd54ea67c14b48c012c58c

commit 09c69decd208d3ccd4dd54ea67c14b48c012c58c
Author:     Konstantin Belousov <kib@FreeBSD.org>
AuthorDate: 2023-12-30 00:15:50 +0000
Commit:     Konstantin Belousov <kib@FreeBSD.org>
CommitDate: 2024-01-11 16:46:52 +0000

    nfsclient: flush dirty pages of the vnode

    PR:     276002

    (cherry picked from commit 47ec00d9d6071bbb0ee5ed0bdca3c4a468334d9d)

 sys/fs/nfsclient/nfs_clvnops.c | 45 ++++++++++++++++++++++++++++++++++++------
 1 file changed, 39 insertions(+), 6 deletions(-)

Comment 55 commit-hook freebsd_committer

2024-01-12 12:03:14 UTC

A commit in branch stable/13 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=d139cab8d13b83b90872f6ca5dc53890c85826a9

commit d139cab8d13b83b90872f6ca5dc53890c85826a9
Author:     Konstantin Belousov <kib@FreeBSD.org>
AuthorDate: 2023-12-29 23:22:40 +0000
Commit:     Konstantin Belousov <kib@FreeBSD.org>
CommitDate: 2024-01-11 16:47:51 +0000

    nfsclient copy_file_range(): flush dst vnode data

    PR:     276002

    (cherry picked from commit 7dae1467d72ae1f5c8f7be0f7444da23a457d98b)

 sys/fs/nfsclient/nfs_clvnops.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comment 56 commit-hook freebsd_committer

2024-01-12 12:03:17 UTC

A commit in branch stable/13 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=cee4d7c22585a736a5d231365ceb62e96d80488a

commit cee4d7c22585a736a5d231365ceb62e96d80488a
Author:     Konstantin Belousov <kib@FreeBSD.org>
AuthorDate: 2023-12-30 00:15:50 +0000
Commit:     Konstantin Belousov <kib@FreeBSD.org>
CommitDate: 2024-01-11 16:47:52 +0000

    nfsclient: flush dirty pages of the vnode

    PR:     276002

    (cherry picked from commit 47ec00d9d6071bbb0ee5ed0bdca3c4a468334d9d)

 sys/fs/nfsclient/nfs_clvnops.c | 45 ++++++++++++++++++++++++++++++++++++------
 1 file changed, 39 insertions(+), 6 deletions(-)

Comment 57 Robert Schulze 2024-02-01 13:52:00 UTC

Does this affect NFSv3 mounts on releng/13.2, too?

Comment 58 Rick Macklem freebsd_committer

2024-02-17 00:21:13 UTC

(In reply to Robert Schulze from comment #57)
The NFS Copy operation is only NFSv4.2. However,
my recollection is that asomers@ could still
generate data corruption when copy was not done,
unless msync was done.
If this recollection is correct, the problem is
still (and may be for a long time) an issue and
I do not think asomers@ tested NFSv3?

Comment 59 Alan Somers freebsd_committer

2024-02-17 01:17:42 UTC

(In reply to Rick Macklem from comment #58)
No.  I tested NFS 4.2 and 4.1, but not NFS 3.

Comment 60 Konstantin Belousov freebsd_committer

2024-02-17 09:33:47 UTC

(In reply to Rick Macklem from comment #58)
I am not sure that I parsed the question right, but:
I am aware that there is still a corruption, and AFAIR without even
copy_file_range().  It is enough to get some mix of regular writes and
mmaped accesses (writes) to get it.

Comment 61 Rick Macklem freebsd_committer

2024-02-17 15:57:12 UTC

(In reply to Konstantin Belousov from comment #60)
Yes, that is my understanding as well, with the
additional fact that msync(2) calls fix the problem.

My hunch is that this bug (not the copy_file_range(2) issue)
has existed *forever*. I have not been able to come up
with a plausible explanation, but...
- It might be that there is some situation where B_CACHE does
  not get set when it should be set, causing ncl_write() (the
  VOP_WRITE() for NFS) to read stale data from the server,
  overwriting the mmap'd change(s).
- There might be some race between VOP_PUTPAGES()/VPO_GETPAGES()
  (which must hold the vnode locked) and the NFS I/O RPCs being
  done on the buffer cache blocks when no vnode lock is held for them.
  I have not been able to think of a case where this would cause
  the corruption, but I'll admit I haven't thought about it lately.

I have thought of doing the equivalent of msync() in ncl_write(),
at least to see if that avoids the corruption.
Maybe I'll come up with a test patch for Alan to try.

Comment 62 Konstantin Belousov freebsd_committer

2024-02-17 19:34:39 UTC

(In reply to Rick Macklem from comment #61)
I remember it was Peter Holm who was able to reproduce the last issue.

I already committed the disable for read-ahead if there are pages on the
object queue.  The reverse side, wait for all async rpcs to finish before
starting first VOP_GETPAGES(), can be done, although it is more complicated.

Comment 63 Rick Macklem freebsd_committer

2024-02-19 01:24:58 UTC

(In reply to Konstantin Belousov from comment #62)
I'll admit that I do not think readahead is the problem
now.  First off, both VOP_GETPAGES() and VOP_READ() are
called with the vnode LK_SHARED, so they are not serialized
by the vnode lock.

When I look at the code, both ncl_getpages() (VOP_GETPAGES
for NFS) and ncl_bioread() (VOP_READ for NFS) end up calling
getblkx(), which should serialize access to the buffer and the
code does appear to safely handle B_CACHE.

There is one thing that does look weird to me, although I'll
admit I do not understand the VM side of things.
- When VOP_GETPAGES() calls bread_gb() (actually breadn_flags()
  with various arguments) the pages are sbusied.
  Then nfs_strategy() calls ncl_doio(), which does the RPCs and
  then calls bufdone().
  bufdone() calls vfs_vmio_iodone(), which sunbusies the pages.
- After the bread_gb() call in vfs_bio_getpages() it does a bunch
  of stuff that might not be correct for an unbusied page?
  I won't claim to understand the code, but it "upgrades" the
  pages to xbusied (assuming it is sbusied, I think?) and also
  does some stuff I do not understand related to checking the pages
  valid, which the comment claims is racy.
Anyhow Kostik, hopefully you can understand the code and determine
if the code in vfs_bio_getpages() is safe even if the pages have
been sunbusied by the bread_gb() call?

Comment 64 Peter Holm freebsd_committer

2024-02-21 08:41:28 UTC

Created attachment 248653 [details]
Problem reproducer

I'm not sure if this is helpful, but this small test program reproduces the issue on NFS only:

09:37 /usr/src/tools/test/stress2/misc $ ./mmap44.sh
/dev/md10 on /mnt (ufs, NFS exported, local, soft-updates)
127.0.0.1:/mnt on /mnt2 (nfs)
9781c9781
< 0461500    40  41  42  43  44  45  46  47  48  49  4a  4b  4c  4d  4e  4f
---
> 0461500    40  41  42  43  44  ee  46  47  48  49  4a  4b  4c  4d  4e  4f
256 -rw-------  1 root wheel 262144 Feb 21 09:38 file
256 -rw-------  1 root wheel 262144 Feb 21 09:37 file.orig
09:38 /usr/src/tools/test/stress2/misc $ grep -B2 -A2 0xee mmap44.sh
                pthread_mutex_lock(&write_mutex);
                c = cp[i];
                cp[i] = 0xee;   /* This value seems to linger with NFS */
                cp[i] = c;
                pthread_mutex_unlock(&write_mutex);
09:39 /usr/src/tools/test/stress2/misc $

Comment 65 Konstantin Belousov freebsd_committer

2024-02-22 06:16:50 UTC

(In reply to Rick Macklem from comment #63)
vfs_bio_getpages() is called with the pages exclusively busy.  As you noted,
bread_gb() wants to additionally shared-busy the same pages, which is impossible.
So vfs_bio_getpages() changes its busy mode from exclusive to shared, same as
normal lock mode can be downgraded.  When bread_gb() acquires and releases the
shared busy, it does +1 and -1 to the shared busy counter, so the page stays
shared busy after the bread_gb() finishes.

Then vfs_bio_getpages() does the upgrade from shared to exclusive, and since
the shared busy state could be dropped, we recheck the validity of the pages.
If some page was invalidated, we need to re-validate the buffer.

And this is indeed might be not correct for NFS, now I think. Other thread
could do a short read making only part of the page valid.  Then the bogus pages
replacement does not help to preserve the valid content.

Peter, could you, please, retest nfs client with the sysctl vfs.nfs.use_buf_pager
set to 0?

Comment 66 Peter Holm freebsd_committer

2024-02-22 08:57:47 UTC

Both test programs still fail.  Here's the fsx output:

09:34 /home/pho/fsx-rs $ sysctl vfs.nfs.use_buf_pager
vfs.nfs.use_buf_pager: 0

     Running `target/release/fsx -N 1024 -S 3381155135556591634 -f copy_file_range.toml -m '149264:149265' -v -v -v /mnt11/foo.bin`
[ERROR fsx] miscompare: offset= 0x1eedd, size = 0x19f6
[ERROR fsx] OFFSET  GOOD  BAD  RANGE  
[ERROR fsx] 0x1eedd 0x38 0x35   0x6c6
[ERROR fsx] Step# (mod 256) for a misdirected write may be 145
[ERROR fsx] Using seed 3381155135556591634
[ERROR fsx] LOG DUMP
[ERROR fsx]    1 SKIPPED  (read)
[ERROR fsx]    2 SKIPPED  (mapread)
[ERROR fsx]    3 MAPWRITE 0x3d369 => 0x40000 ( 0x2c97 bytes) HOLE
[ERROR fsx]    4 WRITE    0x3b828 => 0x40000 ( 0x47d8 bytes)

Comment 67 Konstantin Belousov freebsd_committer

2024-02-22 09:10:13 UTC

(In reply to Peter Holm from comment #66)
There is something about holes in the file with corruption.  Could you please
try to run tests over fully populated file, using only writes and mapped writes,
no fallocate/hole making.

Comment 68 Peter Holm freebsd_committer

2024-02-22 09:54:04 UTC

(In reply to Konstantin Belousov from comment #67)
I commented out the "hole" part in mmap44.sh and I still get the corruption.

#if 0   // Test without holes
                if (lseek(fd, arc4random() % siz, SEEK_END) == -1)
                        err(1, "lseek() END");
                s = sizeof(buf);
                for (i = 0; i < 50; i++) {
                        if (write(fd, buf, s) != s)
                                warn("write()");
                }
#endif


I added:
   punch_hole = 0
   posix_fallocate = 0
to the fsx config file, but still see references to HOLE in the output?
fsx still return lots of ERRORs.

Comment 69 Peter Holm freebsd_committer

2024-02-22 14:14:23 UTC

Removing the writing past EOF (to create holes) and the following ftruncate() makes the corruption go away (at least in a short 40 minutest test).

Comment 70 Rick Macklem freebsd_committer

2024-02-22 23:05:11 UTC

(In reply to Peter Holm from comment #69)
If you do not already have it, maybe you could
try adding the patch in attachment #247424 [details]
(the second one with title "Make Setattr...").
It would be related to ftruncate(), since that
would do a Setattr of size.

Also, does your test fail when there is no
write beyond EOF/truncation, but
vfs.nfs.use_buf_pager=1.

Thanks for your help with this, rick

Comment 71 Peter Holm freebsd_committer

2024-02-23 04:33:16 UTC

(In reply to Rick Macklem from comment #70)
Setting vfs.nfs.use_buf_pager to either "1" or "0" does not make any difference to the mmap44 test.  It fails with the "write beyond EOF / truncate()" enabled.

With the nfs_setattr() patch added, both (fsx and mmap44) test scenarios still fail.
Here's the result from mmap44 with truncate() enabled:

05:24 /usr/src/tools/test/stress2/misc $ uname -a
FreeBSD mercat1.netperf.freebsd.org 15.0-CURRENT FreeBSD 15.0-CURRENT #0 main-n268383-a044cf60bd37-dirty: Fri Feb 23 05:06:46 CET 2024     pho@mercat1.netperf.freebsd.org:/usr/src/sys/amd64/compile/PHO amd64
05:24 /usr/src/tools/test/stress2/misc $ sysctl vfs.nfs.use_buf_pager
vfs.nfs.use_buf_pager: 1
05:26 /usr/src/tools/test/stress2/misc $ ./mmap44.sh
/dev/md10 on /mnt (ufs, NFS exported, local, soft-updates)
127.0.0.1:/mnt on /mnt2 (nfs)
3109c3109
< 0141100    40  41  42  43  44  45  46  47  48  49  4a  4b  4c  4d  4e  4f
---
> 0141100    40  41  42  43  44  45  46  47  48  49  4a  4b  4c  4d  4e  ee
7640c7640
< 0356560    70  71  72  73  74  75  76  77  78  79  7a  7b  7c  7d  7e  7f
---
> 0356560    70  71  72  73  74  75  76  77  78  79  7a  ee  7c  7d  7e  7f
256 -rw-------  1 root wheel 262144 Feb 23 05:27 file
256 -rw-------  1 root wheel 262144 Feb 23 05:26 file.orig
05:27 /usr/src/tools/test/stress2/misc $

Comment 72 Rick Macklem freebsd_committer

2024-02-23 23:51:41 UTC

(In reply to Peter Holm from comment #64)
Hi Peter,

I've tried this reproducer and have not
reproduced a problem.
Could you post mmap44.sh or explain how
you run it?

Comment 73 Peter Holm freebsd_committer

2024-02-24 06:01:41 UTC

(In reply to Rick Macklem from comment #72)
Right, the reproducer needs a bit of VM pressure to show the issue.

I have committed mmap44.sh; Here's how to run it:

# cd /usr/src/tools/test/stress2
# make
===> lib (all)
===> testcases (all)
===> testcases/badcode (all)
:
# cd misc
# ./all.sh -o mmap44.sh
20240224 06:54:54 all: mmap44.sh
/dev/md10 on /mnt (ufs, NFS exported, local, soft-updates)
127.0.0.1:/mnt on /mnt2 (nfs)
3900c3900
< 0171660    b0  b1  b2  b3  b4  b5  b6  b7  b8  b9  ba  bb  bc  bd  be  bf
---
> 0171660    b0  b1  ee  b3  b4  b5  b6  b7  b8  b9  ba  bb  bc  bd  be  bf
7139c7139
< 0337040    20  21  22  23  24  25  26  27  28  29  2a  2b  2c  2d  2e  2f
---
> 0337040    20  21  22  23  24  25  26  27  28  29  2a  2b  ee  2d  2e  2f
14249c14249
< 0675200    80  81  82  83  84  85  86  87  88  89  8a  8b  8c  8d  8e  8f
---
> 0675200    80  81  82  83  84  85  86  ee  88  89  8a  8b  8c  8d  8e  8f
256 -rw-------  1 root wheel 262144 Feb 24 06:55 file
256 -rw-------  1 root wheel 262144 Feb 24 06:54 file.orig
FAIL mmap44.sh exit code 2
20240224 06:56:01 all.sh done, elapsed 0 day(s), 00:01.07
# 

You man need to run it a few times to get the error.

Comment 74 Rick Macklem freebsd_committer

2024-02-26 23:23:45 UTC

Created attachment 248770 [details]
Force VOP_WRITE() in VOP_PUTPAGES to be done synchronously

Thanks to Peter, I have been able to reproduce
the problem demonstrated by his mmap44.sh test
fairly easily (within a few test cycles).

With this patch, I have not been able to reproduce
the problem in about 25 test cycles.

I am hoping that Peter and Alan can test this patch?

I am not sure if it is a correct (or the best fix),
but my hunch is that the dirty bit for the page is
sometimes getting cleared at the wrong time, resulting
in the page modification being lost.
Without this patch VOP_WRITE() { ncl_write() } can do
page write synchronously, asynchronously or delayed,
which means the dirty bit gets cleaned at different times.

Comment 75 Peter Holm freebsd_committer

2024-02-27 08:46:17 UTC

(In reply to Rick Macklem from comment #74)
It seems as thou your patch makes it a bit harder to trigger the issue with mmap44.sh
The fsx test still fails as well.

< 0620000    00  01  02  03  04  05  06  07  08  09  0a  0b  0c  0d  0e  0f
---
> 0620000    00  01  02  03  04  05  06  07  08  09  0a  0b  0c  0d  ee  0f
256 -rw-------  1 root wheel 262144 Feb 27 07:32 file
256 -rw-------  1 root wheel 262144 Feb 27 07:31 file.orig
07:32:18, elapsed 0 day(s), 00:14.07

Comment 76 Konstantin Belousov freebsd_committer

2024-02-27 20:26:24 UTC

(In reply to Peter Holm from comment #75)
Could you please try to add the following to the last Rick' patch?

commit 0f36903e7d1f49632515481ea4b99859d23bd21a
Author: Konstantin Belousov <kib@FreeBSD.org>
Date:   Tue Feb 27 02:09:09 2024 +0200

    ncl_write(): honor ioflags by converting them into buffer flags
    
    using vfs_bio_set_flags().

diff --git a/sys/fs/nfsclient/nfs_clbio.c b/sys/fs/nfsclient/nfs_clbio.c
index 1cf45bb0c924..d8058e9a125a 100644
--- a/sys/fs/nfsclient/nfs_clbio.c
+++ b/sys/fs/nfsclient/nfs_clbio.c
@@ -1310,6 +1310,8 @@ ncl_write(struct vop_write_args *ap)
 			vfs_bio_set_valid(bp, on, n);
 		}
 
+		vfs_bio_set_flags(bp, ioflag);
+
 		/*
 		 * If IO_SYNC do bwrite().
 		 *

Comment 77 Rick Macklem freebsd_committer

2024-02-28 03:18:13 UTC

Created attachment 248807 [details]
Force VOP_WRITE() to be synchronous and do not clear dirty bits afterwards

Here's another one that hopefully Peter can test.
It doesn't fail mmap44.sh for me for 128 cycles,
but the previous one didn't as well, so this
doesn't mean much.

It is similar to the last one, except it does not
clear the dirty bits after VOP_WRITE() when VOP_WRITE()
returns no error.

I'll admit I do not understand why the code clears the
dirty bits after VOP_WRITE() when it does not return an
error, since VOP_WRITE() will clear the dirty bits.
(Kostik's addition should probably be applied to this one
as well.)

Thanks Peter for doing this testing.

Comment 78 Konstantin Belousov freebsd_committer

2024-02-28 03:36:13 UTC

(In reply to Rick Macklem from comment #77)
As I understand, the clearing of the dirty bits occur in vfs_busy_pages().
But it only happens if the write indeed occurs though the buffer cache,
which is not true for appends. So in case of append writes, nfs_putpages()
would not laundry the pages?

So might be disable this direct write optimization?  It now looks very
suspicious.

Comment 79 Konstantin Belousov freebsd_committer

2024-02-28 03:38:18 UTC

Note that pages cannot become dirty until VOP_PUTPAGES() finishes.  It is because
all writeable mappings are invalidated and pages are share-busy.  VM does not
allow to create a writeable mapping of the busy page.

Comment 80 Peter Holm freebsd_committer

2024-02-28 05:43:58 UTC

(In reply to Konstantin Belousov from comment #76)
I ran tests with these combined changes:

--- a/sys/fs/nfsclient/nfs_clbio.c
+++ b/sys/fs/nfsclient/nfs_clbio.c
@@ -330,8 +330,9 @@ ncl_putpages(struct vop_putpages_args *ap)
 	uio.uio_rw = UIO_WRITE;
 	uio.uio_td = td;
 
-	error = VOP_WRITE(vp, &uio, vnode_pager_putpages_ioflags(ap->a_sync),
+	error = VOP_WRITE(vp, &uio, vnode_pager_putpages_ioflags(ap->a_sync | VM_PAGER_PUT_SYNC),
 	    cred);
+if (error != 0) printf("aft putpages=%d\n", error);
 	crfree(cred);
 
 	if (error == 0 || !nfs_keep_dirty_on_error) {
@@ -1310,6 +1311,8 @@ ncl_write(struct vop_write_args *ap)
 			vfs_bio_set_valid(bp, on, n);
 		}
 
+               vfs_bio_set_flags(bp, ioflag);
+
 		/*
 		 * If IO_SYNC do bwrite().
 		 *

but the outcome was the same: Instant fsx failure and mmap44.sh failed after 1h 40m

Comment 81 Peter Holm freebsd_committer

2024-02-28 05:46:55 UTC

(In reply to Rick Macklem from comment #77)

Happy to help.

I'll run a test with "attachment 248807 [details]" in a few hours from now.

Comment 82 Peter Holm freebsd_committer

2024-02-28 10:31:17 UTC

(In reply to Peter Holm from comment #81)
I see no different outcome with "attachment 248807 [details]" added.

Comment 83 Peter Holm freebsd_committer

2024-03-01 19:09:05 UTC

I tried to shrink the test scenario even further. It now only uses:

Mapped read/write, read(2)/write(2) and ftruncate(2).

https://people.freebsd.org/~pho/mmap47.sh

Comment 84 Rick Macklem freebsd_committer

2024-03-02 00:29:26 UTC

(In reply to Peter Holm from comment #83)
Could you try adding a msync(2) call here
and see if that stops the failure?
(It did so for my testing of the previous
version, but that doesn't indicate much
as I've seen for other changes.)

	while (go == 1) {
		i = arc4random() % siz;
		pthread_mutex_lock(&write_mutex);
		c = cp[i];
		cp[i] = 0xee;	/* This value seems to linger with NFS */
		cp[i] = c;
-->             msync(cp, siz, MS_SYNC);
		pthread_mutex_unlock(&write_mutex);
		usleep(arc4random() % 200);
	}
I might not have the call quite right, since I no longer have the
patched code for this lying about.

Thanks, rick

Comment 85 Rick Macklem freebsd_committer

2024-03-02 00:41:18 UTC

(In reply to Konstantin Belousov from comment #78)
> As I understand, the clearing of the dirty bits occur in vfs_busy_pages().
> But it only happens if the write indeed occurs though the buffer cache,
> which is not true for appends. So in case of append writes, nfs_putpages()
> would not laundry the pages?
Since O_APPEND writing is done for things like log files, I'll admit I
cannot imagine why such writing would be mixed with mmap()'d writing?
At the least, this does not appear to be a part of this test.

> So might be disable this direct write optimization?  It now looks very
> suspicious.
I will take a look at how O_APPEND writing is currently being done
and see if it looks like it needs to be changed.
If you are referring to the stuff that can be enabled for direct
writing via a sysctl that is not set by default, then I agree it
is sketchy. I'll atek a look.

Comment 86 Rick Macklem freebsd_committer

2024-03-02 02:50:40 UTC

(In reply to Rick Macklem from comment #84)
Peter,

Forget about trying the "add the msync()"
case. I just redid the test and it failed.

The fact it fails might be useful information?

Comment 87 Rick Macklem freebsd_committer

2024-03-02 15:00:27 UTC

	while (go == 1) {
		i = arc4random() % siz;
		pthread_mutex_lock(&write_mutex);
		c = cp[i];
		cp[i] = 0xee;	/* This value seems to linger with NFS */
-->             msync(cp, 0, MS_SYNC);
		cp[i] = c;
		pthread_mutex_unlock(&write_mutex);
		usleep(arc4random() % 200);
	}

I tried the above variant and it not only failed,
but it failed much more quickly for my test setup.

Somehow, the write(2) and ftruncate(2) that grow and
shrink the file back down, trashes the
                cp[i]= c;
mmap()d write.
I'm still looking...

Comment 88 Peter Holm freebsd_committer

2024-03-02 16:06:43 UTC

(In reply to Rick Macklem from comment #87)
I do not know if you noticed this, but the truncate(2) call in the mmap47.sh test scenario doesn't change the size of the file. But this call seems essential to triggering the corruption.

Comment 89 Konstantin Belousov freebsd_committer

2024-03-04 07:32:05 UTC

(In reply to Peter Holm from comment #88)
Please try the following
commit fc160bc8d3296b16e1882eed7355309a8eefcdcb
Author: Konstantin Belousov <kib@FreeBSD.org>
Date:   Mon Mar 4 09:26:55 2024 +0200

    allocbuf()/vfs_vmio_extend(): assume zero-sized buffers are B_CACHE
    
    vfs_vmio_extend() tests each added page for valid, and then extends the
    validity of B_CACHE flag if all added pages are valid.  But, if we have
    a blank buffer without any content yet, the B_CACHE flag is not set yet,
    and validity test is not executed due to this.
    
    This makes the buffer incoherent with the page content, loosing data.

diff --git a/sys/kern/vfs_bio.c b/sys/kern/vfs_bio.c
index b5466fb2cd53..3ec886a32ded 100644
--- a/sys/kern/vfs_bio.c
+++ b/sys/kern/vfs_bio.c
@@ -3175,6 +3175,8 @@ vfs_vmio_extend(struct buf *bp, int desiredpages, int size)
 	 * B_CACHE may remain set! XXX
 	 */
 	toff = bp->b_bcount;
+	if (toff == 0)
+		bp->b_flags |= B_CACHE;
 	tinc = PAGE_SIZE - ((bp->b_offset + toff) & PAGE_MASK);
 	while ((bp->b_flags & B_CACHE) && toff < size) {
 		vm_pindex_t pi;

(both my patches are available on the ufs1 branch of deviant3)

Comment 90 Peter Holm freebsd_committer

2024-03-04 08:46:59 UTC

(In reply to Konstantin Belousov from comment #89)
Unfortunately, this did not seem to have any effect on the file corruption.

09:38:03 Start test of mmap47.sh
09:39:13, elapsed 0 days, 00:01.10
09:40:19, elapsed 0 days, 00:02.16
09:41:23, elapsed 0 days, 00:03.20
09:42:28, elapsed 0 days, 00:04.25
09:43:31, elapsed 0 days, 00:05.28
1050c1050
< 0040620    90  91  92  93  94  95  96  97  98  99  9a  9b  9c  9d  9e  9f
---
> 0040620    90  91  92  93  94  95  96  97  ee  99  9a  9b  9c  9d  9e  9f
256 -rw-------  1 root wheel 262144 Mar  4 09:44 file
256 -rw-------  1 root wheel 262144 Mar  4 09:43 file.orig
09:44:35, elapsed 0 days, 00:06.32
09:44 /usr/src/tools/test/stress2/misc $ uname -a
FreeBSD mercat1.netperf.freebsd.org 15.0-CURRENT FreeBSD 15.0-CURRENT #0 ufs1-n268657-fc160bc8d32: Mon Mar  4 09:21:15 CET 2024     pho@mercat1.netperf.freebsd.org:/var/tmp/deviant3/sys/amd64/compile/PHO amd64
09:44 /usr/src/tools/test/stress2/misc $

Comment 91 Rick Macklem freebsd_committer

2024-03-04 13:28:33 UTC

(In reply to Peter Holm from comment #88)
I tried commenting out the ncl_vinvalbuf()
call in nfs_setattr() when size is changed.
The failure still occurred. To be honest,
I can't remember why/if this call is needed?
It seems that ncl_meta_setsize()->vtruncbuf()
would do all that is necessary.

Anyhow, with ncl_vinvalbuf() commented out,
the vtruncbuf() called from ncl_meta_setsize()
is the only thing that is done for ftruncate()
that seems it might be related to this and
commenting it out breaks a bunch of stuff.

I am not suggesting that vtruncbuf() is broken,
but it seems that it helps cause the failure.

Comment 92 Peter Holm freebsd_committer

2024-03-06 10:22:43 UTC

This is just an observation.
By replacing the ftruncate() call in mmap47.sh with a fdatasync() I also see the data corruption:

20:53:22 Start test of mmap47a.sh
20:54:27, elapsed 0 days, 00:01.05
20:55:31, elapsed 0 days, 00:02.09
20:56:35, elapsed 0 days, 00:03.13
20:57:41, elapsed 0 days, 00:04.19
20:58:45, elapsed 0 days, 00:05.23
20:59:49, elapsed 0 days, 00:06.27
12661c12661
< 0613500    40  41  42  43  44  45  46  47  48  49  4a  4b  4c  4d  4e  4f
---
> 0613500    40  41  42  43  44  45  46  ee  48  49  4a  4b  4c  4d  4e  4f
256 -rw-------  1 root wheel 262144 Mar  5 21:00 file
256 -rw-------  1 root wheel 262144 Mar  5 20:59 file.orig
21:01:08, elapsed 0 days, 00:07.46
Failed with exit code 2 after 7 loops of mmap47a.sh.
21:01 /usr/src/tools/test/stress2/misc $ diff -u mmap47.sh mmap47a.sh
--- mmap47.sh   2024-03-01 20:00:28.468087000 +0100
+++ mmap47a.sh  2024-03-05 20:53:07.446204000 +0100
@@ -126,8 +126,13 @@
        while (go == -1)
                usleep(50);
        while (go == 1) {
+#if 0
                if (ftruncate(fd, siz) == -1) /* No size change */
                        err(1, "truncate)");
+#else
+               if (fdatasync(fd) == -1)
+                       err(1, "fdatasync()");
+#endif
                usleep(arc4random() % 1000);
        }
        return (0);
21:20 /usr/src/tools/test/stress2/misc $

Comment 93 Konstantin Belousov freebsd_committer

2024-03-18 00:45:33 UTC

Is there a way to disable nfsiods at all? Might be, setting vfs.nfs.iodmin =
vfs.nfs.iodmax = 0 would do it.  If not, I will try to provide a patch to
disable them.

It is interesting if the corruption is still reproducible with nfsiods disabled.

Comment 94 Rick Macklem freebsd_committer

2024-03-18 03:09:50 UTC

(In reply to Konstantin Belousov from comment #93)
Yes, if you set both vfs.nfs.iodmin and vfs.nfs.iodmax
to 0 before doing the first NFS mount after booting,
there should be no nfsiod threads.
(If you set them 0 after a mount, I'm not sure how
long it takes for them to go away?)

You can check via:
# ps axH | fgrep iod

I ran 64 cycles of mmap44.sh without a failure, but
that doesn't tell us much.

Hopefully Peter can try this, rick

Comment 95 Peter Holm freebsd_committer

2024-03-18 15:54:55 UTC

I ran the test scenario variations I have for 3 hours without seeing any problems.
This with:
16:49 /usr/src/tools/test/stress2/misc $ sysctl vfs.nfs.iodmin vfs.nfs.iodmax
vfs.nfs.iodmin: 0
vfs.nfs.iodmax: 0
16:50 /usr/src/tools/test/stress2/misc $ 

Unfortunately, the original test scenario still fails:

mount -t nfs -o tcp -o retrycnt=3 -o rw \
    127.0.0.1:$mp1 $mp2; s=$?

(
cargo run --release -- -N 1024 -S 3381155135556591634 -f copy_file_range.toml -m 149264:149265 -v -v -v $mp2/foo.bin    # FAIL
s=$?
echo "Exit status is $s"
) 2>&1 | grep -Ev 'DEBUG|INFO|WARN'
    Finished release [optimized] target(s) in 2.97s
     Running `target/release/fsx -N 1024 -S 3381155135556591634 -f copy_file_range.toml -m '149264:149265' -v -v -v /mnt11/foo.bin`
[ERROR fsx] miscompare: offset= 0x1eedd, size = 0x19f6
[ERROR fsx] OFFSET  GOOD  BAD  RANGE  
[ERROR fsx] 0x1eedd 0x38 0x35   0x6c6
[ERROR fsx] Step# (mod 256) for a misdirected write may be 145
[ERROR fsx] Using seed 3381155135556591634
[ERROR fsx] LOG DUMP
[ERROR fsx]    1 SKIPPED  (read)
:
[ERROR fsx]  699 MAPWRITE  0x563e => 0x1445e ( 0xee20 bytes)
[ERROR fsx]  700 READ     0x1eedd => 0x208d3 ( 0x19f6 bytes)
Exit status is 1
umount $mp2
umount $mp1
mdconfig -d -u $mdstart
exit 0
16:52 /home/pho/fsx-rs $

Comment 96 Konstantin Belousov freebsd_committer

2024-03-19 04:21:02 UTC

(In reply to Peter Holm from comment #95)
This .toml config enables the file_copy_range() usage, am I right?

Could you try please, in addition to disabling nfsiods on the client, disable
copy_range() on the server, by setting vfs.nfsd.maxcopyrange to zero?

Comment 97 Peter Holm freebsd_committer

2024-03-19 06:34:27 UTC

This is the config file I used:
cat > copy_file_range.toml <<HERE
nomsyncafterwrite = true
[weights]
truncate=1
fdatasync=1
fsync=1
write = 10
read = 10
punch_hole = 0
posix_fallocate = 0
sendfile = 0
HERE

I added a "copy_file_range=0", but fsx still fails.

I ran this right after a boot.
sysctl vfs.nfs.iodmin=0
vfs.nfs.iodmin: 0 -> 0
sysctl vfs.nfs.iodmax=0
vfs.nfs.iodmax: 20 -> 0
sysctl vfs.nfsd.maxcopyrange=0
vfs.nfsd.maxcopyrange: 9223372036854775807 -> 0
root@mercat1:~ # ps auxwwl | grep -v grep | grep iod
root@mercat1:~ #

Comment 98 Konstantin Belousov freebsd_committer

2024-03-19 06:37:47 UTC

Please try this instead of disabling copy_range on server:
diff --git a/sys/fs/nfsclient/nfs_clvnops.c b/sys/fs/nfsclient/nfs_clvnops.c
index 0b8c587a542c..8a98b047c0d5 100644
--- a/sys/fs/nfsclient/nfs_clvnops.c
+++ b/sys/fs/nfsclient/nfs_clvnops.c
@@ -3904,7 +3904,7 @@ nfs_copy_file_range(struct vop_copy_file_range_args *ap)
 		error = ncl_flush(invp, MNT_WAIT, curthread, 1, 0);
 	}
 	if (error == 0)
-		error = ncl_vinvalbuf(outvp, V_SAVE, curthread, 0);
+		error = ncl_flush(outvp, MNT_WAIT, curthread, 1, 0);
 
 	/* Do the actual NFSv4.2 RPC. */
 	ret = ret2 = 0;
(nfsiods must be disabled)

Comment 99 Peter Holm freebsd_committer

2024-03-19 08:30:00 UTC

Sure.

Here's another observation:
A ktrace(1) of the syscalls involved in a failing fsx run:

09:19 /home/pho/fsx-rs $ kdump | grep -E "fsx.*CALL" | sed 's/ *4850 //;s/fsx *//;s/(.*)/()/' | sort | uniq -c | sort -rn
1408 CALL  write()
 700 CALL  compat11.fstat()
 699 CALL  lseek()
 373 CALL  mmap()
 182 CALL  munmap()
 161 CALL  pread()
 158 CALL  pwrite()
  52 CALL  ftruncate()
  23 CALL  open()
  14 CALL  fdatasync()
  13 CALL  sigprocmask()
  12 CALL  fsync()
  11 CALL  mprotect()
   9 CALL  close()
   7 CALL  fstat()
   6 CALL  sigaction()
   5 CALL  fstatfs()
   4 CALL  read()
   3 CALL  sigfastblock()
   3 CALL  sigaltstack()
   3 CALL  openat()
   2 CALL  issetugid
   2 CALL  getpid
   1 CALL  thr_wake()
   1 CALL  thr_self()
   1 CALL  sysarch()
   1 CALL  rtprio_thread()
   1 CALL  readlink()
   1 CALL  poll()
   1 CALL  ioctl()
   1 CALL  getrandom()
   1 CALL  getcontext()
   1 CALL  exit()
   1 CALL  cpuset_getaffinity()
   1 CALL  _umtx_op()
   1 CALL  __sysctlbyname()
09:19 /home/pho/fsx-rs $

Comment 100 Peter Holm freebsd_committer

2024-03-19 09:26:10 UTC

fsx still fails with this change to nfs_clvnops.c

10:21 /home/pho/fsx-rs $ (cd /usr/src/sys/fs/nfsclient; git diff nfs_clvnops.c)
diff --git a/sys/fs/nfsclient/nfs_clvnops.c b/sys/fs/nfsclient/nfs_clvnops.c
index 0b8c587a542c..976bedee41c4 100644
--- a/sys/fs/nfsclient/nfs_clvnops.c
+++ b/sys/fs/nfsclient/nfs_clvnops.c
@@ -3904,7 +3904,8 @@ nfs_copy_file_range(struct vop_copy_file_range_args *ap)
                error = ncl_flush(invp, MNT_WAIT, curthread, 1, 0);
        }
        if (error == 0)
-               error = ncl_vinvalbuf(outvp, V_SAVE, curthread, 0);
+//             error = ncl_vinvalbuf(outvp, V_SAVE, curthread, 0);
+               error = ncl_flush(outvp, MNT_WAIT, curthread, 1, 0);
 
        /* Do the actual NFSv4.2 RPC. */
        ret = ret2 = 0;
10:21 /home/pho/fsx-rs $ uname -a
FreeBSD mercat1.netperf.freebsd.org 15.0-CURRENT FreeBSD 15.0-CURRENT #1 main-n268827-75464941dc17-dirty: Tue Mar 19 09:58:11 CET 2024     pho@mercat1.netperf.freebsd.org:/usr/src/sys/amd64/compile/PHO amd64
10:21 /home/pho/fsx-rs $ ps auxwwl | grep -v grep | grep iod
10:21 /home/pho/fsx-rs $

Comment 101 Konstantin Belousov freebsd_committer

2024-03-19 10:07:13 UTC

(In reply to Peter Holm from comment #99)
Are you saying that the fsx run with enabled copy_file_range does not use the
syscall but fails?

And what would happen if the copy_file_range is disabled in the .toml?

Comment 102 Peter Holm freebsd_committer

2024-03-19 10:35:10 UTC

Created attachment 249298 [details]
ktrace run of fsx

Comment 103 Peter Holm freebsd_committer

2024-03-19 10:44:22 UTC

What I see is that a fsx run with or without "copy_file_range = 0" added to the config file, fails and that I see no reference to a copy_file_range() call in a ktrace.

I have uploaded the full kdump output to https://people.freebsd.org/~pho/ktrace.log.xz

Comment 104 Rick Macklem freebsd_committer

2024-05-08 22:32:19 UTC

Created attachment 250535 [details]
Call vnode_pager_clean_sync() before ncl_meta_setsize() in nfs_setattr

Hi Peter,

Maybe you can test this patch.
I am thinking that, maybe, ncl_meta_setsize()
is marking B_RELBUF and doing brelse(),
throwing away the dirty page.

Thanks, rick

Comment 105 Peter Holm freebsd_committer

2024-05-09 07:20:56 UTC

(In reply to Rick Macklem from comment #104)
Hi Rick,

I see the same corruption with attachment_250535.diff added.
Both the original fsx program and mmap44.sh fails.

Comment 106 Rick Macklem freebsd_committer

2024-05-09 13:44:27 UTC

(In reply to Peter Holm from comment #105)
Oh well. Thanks for testing it, Peter.