Bug 273962 - copy_file_range does not work on shared memory objects
Summary: copy_file_range does not work on shared memory objects
Status: Closed FIXED
Alias: None
Product: Documentation
Classification: Unclassified
Component: Manual Pages (show other bugs)
Version: Latest
Hardware: Any Any
: --- Affects Some People
Assignee: Rick Macklem
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2023-09-20 11:01 UTC by David Chisnall
Modified: 2024-02-08 03:09 UTC (History)
5 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description David Chisnall freebsd_committer freebsd_triage 2023-09-20 11:01:22 UTC
Using copy_file_range to copy files from anonymous shared memory objects returns EINVAL.  This is not one of the reasons for failure documented in the man page.  Ideally there should be a fallback path that works for any kind of file descriptor, but if the system call works for a limited subset of file descriptor types then the ones for which it does work should be documented in the manual.
Comment 1 Rick Macklem freebsd_committer freebsd_triage 2023-09-20 15:22:23 UTC
Yes, I think the man page needs to be patched so
that it clearly states "file descriptors for
regular file" and add the case of no-regular files
to EINVAL.

The title uses "file" which I think of as "regular file",
but that is not clear enough.

I'll come up with a patch for the man page.
Comment 2 David Chisnall freebsd_committer freebsd_triage 2023-09-20 15:52:26 UTC
(In reply to Rick Macklem from comment #1)
> The title uses "file" which I think of as "regular file",

On *NIX, everything is a file.  I found it very surprising that it didn't work with shared memory objects.  On Linux (where this call originated), shared memory objects are files in a special memory filesystem and so the line is especially blurry.  I don't believe the Linux implementation has such a limitation: it falls back to the equivalent of a read and write loop in the kernel if the fast paths are not supported).

As a user, it's very surprising that a pair of file descriptors that work with lseek, read, and write, do not work with `copy_file_range`.  It's made worse by the fact that `EINVAL` covers a variety of errors and so writing fallback code for when the file descriptor provided to an API does not work with this function cannot unambiguously detect that the reason for the failure was a file descriptor for which this is not supported.
Comment 3 Dmitry Chagin freebsd_committer freebsd_triage 2023-09-20 18:09:17 UTC
(In reply to David Chisnall from comment #2)
+1, especially since it's not hard to implement fallback method
Comment 4 Rick Macklem freebsd_committer freebsd_triage 2023-09-20 20:41:45 UTC
Long ago when I first worked on 6th edition
(late 1970s) there was the "everything was
a file" philosophy.  However, things like
shared emeory did not exist. The fact that
the code does not use vnodes suggests they
are not files to me.

Right now the code (including
vn_generic_copy_file_range() which is
the fallback) uses vnodes.
(The error is actually returned because
 there is no vnode associated with the
 file descriptor.)

To do this would require a separate chunk
of code directly under the syscall to do
the copying. A simple loop doing shm_read()/shm_write()
should work, I'd guess (but it will need to be
done as a special case when there are no vnodes) and
needs to check for "no overlapping region", since that
is part of the syscall interface.

The original intent of copy_file_range(2)
was to enable file systems to optimize copying,
such as NFSv4 doing the copy locally in the
NFS server instead of reading data to the client
and writing it back to the server.
(The fallback was implemented to simplify the
case where the file system does not have a
specific VOP_COPY_FILE_RANGE().)
Since this case is not file system related,
the above does not really apply.
(I suppose do so saves a few syscalls.)

I will ask on freebsd-fs@ to see if others
feel this capability is needed.

Feel free to come up with a patch.
If others do not, I may do so someday.
(It would be called from kern_copy_file_range()
when f_vnode == NULL.)
Comment 5 Rick Macklem freebsd_committer freebsd_triage 2023-09-20 21:16:30 UTC
I looked and doing the fallback will
require some messing about.
shm_read() and shm_write() do
range locking however copy_file_ramge()
needs to do range locking on both
fds before copying.

I think variants of shm_read() and
shm_write() would need to be created
without the range locking so that
kern_copy_file_range() can call those.
kern_copy_file_range() will also need
to use rangelock_rlock(), rangelock_tryrlock()
and rangelock_wlock() etc for this case.
Comment 6 Rick Macklem freebsd_committer freebsd_triage 2023-09-20 21:24:04 UTC
It looks like there would need to be a
layering violation, since the code that
goes in kern_copy_file_range() will need
to be specific to shm_ops, since the
range lock code needs to use shm_mtx.
(Or a new f_op specifically for
copy_file_range() would need to be added.)

So, the fallback is not trivial, although
it is not a lot of work.
Comment 7 Rick Macklem freebsd_committer freebsd_triage 2023-09-20 21:48:22 UTC
Now that I have taken a quick look, doing
the shared memory fallback will be quite a
bit of work, given the range locking stuff.
Something like:
- Define a new f_op (f_copy_file_range()).
- Factor most of the contents of kern_copy_file_range()
  out into a vn_fileop_copy_file_range().
- Call the new f_op from kern_copy_file_range()
- Write a shm_copy_file_range() f_op that is similar
  to the vnode one, but uses the shmfd stuff for
  rangelocking and then does the copying (via memcpy()
  in some sort of loop or maybe uiomove() needs to be used?).
  - I am not sure what this last step requires.

Not that big a deal, but not that easy.
Comment 8 Rick Macklem freebsd_committer freebsd_triage 2023-09-21 01:08:33 UTC
The most recent Linux man page for copy_file_range(2)
appears to be here:
https://man7.org/linux/man-pages/man2/copy_file_range.2.html

In it:
       EINVAL The flags argument is not 0.

       EINVAL fd_in and fd_out refer to the same file and the source and
              target ranges overlap.

       EINVAL Either fd_in or fd_out is not a regular file.

So, the current implementation appears to be Linux compatible.
(The FreeBSD man page does need to be fixed for the case of
EINVAL being returned for either fd not referring to regular files.)

I have posted to freebsd-fs@ and we'll see whether most others
think that the FreeBSD syscall should remain Linux compatible.
(Which is what is noted at the end of the FreeBSD man page.)
Comment 9 David Chisnall freebsd_committer freebsd_triage 2023-09-23 08:15:34 UTC
(In reply to Rick Macklem from comment #7)

What are the range locking guarantees?  I don't see anything in the man page (ours or Linux's) that mentions anything about locking ranges or any guarantees about what happens if you concurrently write to the range being read or read from the range being written.  

In the absence of any such guarantees, I'd expect the semantics of this call to be the equivalent of a read/write loop on the two fds.  My expectation for the fallback path was that it would:

1. stat the source to see how big it was.
2. Allocate a temporary buffer to hold whichever is smaller out of a sensible copy size or the amount remaining to be copied.
3. Read from the source file into that buffer.
4. Write from the buffer to the destination.
5. Loop to copy any more if 2 allocated a buffer smaller than the copy amount.

All of these are exposed in the `file*` that `kern_copy_file_range` has access to.  In the worst case, this should still be faster than doing it in userspace because it will have fewer system call transitions.

Ideally, if the source is already in the buffer cache then we can avoid allocating the buffer and just write directly from there, but that's not necessary to make it usable.

My current problem with this call is that there is no way for userspace code to know in advance whether a file descriptor that it has been passed (in my case, via a UNIX domain socket) is one that can or can't be used with copy_file_range and, after trying it, cannot differentiate between failures as a result of unsupported file descriptor types or as a result of other errors.  Ideally, I'd be able to use this syscall with anything where the userspace fallback path worked.

Alternatively, if the error from unsupported file descriptor types, the fallback code could be in the libc wrapper.
Comment 10 Konstantin Belousov freebsd_committer freebsd_triage 2023-09-23 10:06:57 UTC
(In reply to David Chisnall from comment #9)
From IEEE Standard for Information Technology—
Portable Operating System Interface (POSIX®)
Base Specifications, Issue 7

2.9.7 Thread Interactions with Regular File Operations
All of the following functions shall be atomic with respect to each other in the effects specified in
POSIX.1-2017 when they operate on regular files or symbolic links:
chmod( )
chown( )
close( )
creat( )
dup2( )
fchmod( )
fchmodat( )
fchown( )
fchownat( )
fcntl( )
fstat( )
fstatat( )
ftruncate( )
lchown( )
link( )
linkat( )
lseek( )
lstat( )
open( )
openat( )
pread( )
read( )
readlink( )
readlinkat( )
readv( )
pwrite( )
rename( )
renameat( )
stat( )
symlink( )
symlinkat( )
truncate( )
unlink( )
unlinkat( )
utime( )
utimensat( )
utimes( )
write( )
writev( )

If two threads each call one of these functions, each call shall either see all of the specified effects
of the other call, or none of them. The requirement on the close( ) function shall also apply
whenever a file descriptor is successfully closed, however caused (for example, as a consequence
of calling close( ), calling dup2( ), or of process termination).
Comment 11 David Chisnall freebsd_committer freebsd_triage 2023-09-23 10:25:04 UTC
(In reply to Konstantin Belousov from comment #10)

> when they operate on regular files or symbolic links

So not necessary for the fallback case for file descriptors that are not regular files?
Comment 12 Konstantin Belousov freebsd_committer freebsd_triage 2023-09-23 11:14:38 UTC
(In reply to David Chisnall from comment #11)
It is not required, it is quality of implementation.
POSIX does not require that read(2) works on shmfd at all.
Comment 13 commit-hook freebsd_committer freebsd_triage 2023-12-29 23:01:02 UTC
A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=84b4342c0d7ac8a3187309a978d41e6765154cc1

commit 84b4342c0d7ac8a3187309a978d41e6765154cc1
Author:     Rick Macklem <rmacklem@FreeBSD.org>
AuthorDate: 2023-12-29 22:59:00 +0000
Commit:     Rick Macklem <rmacklem@FreeBSD.org>
CommitDate: 2023-12-29 22:59:00 +0000

    copy_file_range.2: Clarify that only regular files work

    PR#273962 reported that copy_file_range(2) did not work
    on shared memory objects and returned EINVAL.
    Although the reporter felt this was incorrect, it is what
    the Linux copy_file_range(2) syscall does.

    Since there was no collective agreement that the FreeBSD
    semantics should be changed to no longer be Linux compatible,
    copy_file_range(2) still works on regular files only.

    This man page update clarifies that. If, someday, copy_file_range(2)
    is changed to support non-regular files, then the man page will
    need to be updated to reflect that.

    PR:     273962
    Reviewed by:    karels, pauamma_gundo.com (manpages)
    MFC after:      1 week
    Differential Revision:  https://reviews.freebsd.org/D43227

 lib/libc/sys/copy_file_range.2 | 18 +++++++++++++-----
 1 file changed, 13 insertions(+), 5 deletions(-)
Comment 14 commit-hook freebsd_committer freebsd_triage 2024-01-11 01:15:27 UTC
A commit in branch stable/14 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=5c4da68ad785e955976e7b73c281213abb85c23a

commit 5c4da68ad785e955976e7b73c281213abb85c23a
Author:     Rick Macklem <rmacklem@FreeBSD.org>
AuthorDate: 2023-12-29 22:59:00 +0000
Commit:     Rick Macklem <rmacklem@FreeBSD.org>
CommitDate: 2024-01-11 01:13:23 +0000

    copy_file_range.2: Clarify that only regular files work

    PR#273962 reported that copy_file_range(2) did not work
    on shared memory objects and returned EINVAL.
    Although the reporter felt this was incorrect, it is what
    the Linux copy_file_range(2) syscall does.

    Since there was no collective agreement that the FreeBSD
    semantics should be changed to no longer be Linux compatible,
    copy_file_range(2) still works on regular files only.

    This man page update clarifies that. If, someday, copy_file_range(2)
    is changed to support non-regular files, then the man page will
    need to be updated to reflect that.

    PR:     273962

    (cherry picked from commit 84b4342c0d7ac8a3187309a978d41e6765154cc1)

 lib/libc/sys/copy_file_range.2 | 18 +++++++++++++-----
 1 file changed, 13 insertions(+), 5 deletions(-)
Comment 15 commit-hook freebsd_committer freebsd_triage 2024-01-11 01:21:29 UTC
A commit in branch stable/13 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=6561a7141b5603235d18538f6c66322ed7fc67d3

commit 6561a7141b5603235d18538f6c66322ed7fc67d3
Author:     Rick Macklem <rmacklem@FreeBSD.org>
AuthorDate: 2023-12-29 22:59:00 +0000
Commit:     Rick Macklem <rmacklem@FreeBSD.org>
CommitDate: 2024-01-11 01:20:35 +0000

    copy_file_range.2: Clarify that only regular files work

    PR#273962 reported that copy_file_range(2) did not work
    on shared memory objects and returned EINVAL.
    Although the reporter felt this was incorrect, it is what
    the Linux copy_file_range(2) syscall does.

    Since there was no collective agreement that the FreeBSD
    semantics should be changed to no longer be Linux compatible,
    copy_file_range(2) still works on regular files only.

    This man page update clarifies that. If, someday, copy_file_range(2)
    is changed to support non-regular files, then the man page will
    need to be updated to reflect that.

    PR:     273962

    (cherry picked from commit 84b4342c0d7ac8a3187309a978d41e6765154cc1)

 lib/libc/sys/copy_file_range.2 | 18 +++++++++++++-----
 1 file changed, 13 insertions(+), 5 deletions(-)
Comment 16 Mark Linimon freebsd_committer freebsd_triage 2024-02-08 03:09:58 UTC
^Triage: assign to committer who also MFCed.