Using copy_file_range to copy files from anonymous shared memory objects returns EINVAL. This is not one of the reasons for failure documented in the man page. Ideally there should be a fallback path that works for any kind of file descriptor, but if the system call works for a limited subset of file descriptor types then the ones for which it does work should be documented in the manual.
Yes, I think the man page needs to be patched so that it clearly states "file descriptors for regular file" and add the case of no-regular files to EINVAL. The title uses "file" which I think of as "regular file", but that is not clear enough. I'll come up with a patch for the man page.
(In reply to Rick Macklem from comment #1) > The title uses "file" which I think of as "regular file", On *NIX, everything is a file. I found it very surprising that it didn't work with shared memory objects. On Linux (where this call originated), shared memory objects are files in a special memory filesystem and so the line is especially blurry. I don't believe the Linux implementation has such a limitation: it falls back to the equivalent of a read and write loop in the kernel if the fast paths are not supported). As a user, it's very surprising that a pair of file descriptors that work with lseek, read, and write, do not work with `copy_file_range`. It's made worse by the fact that `EINVAL` covers a variety of errors and so writing fallback code for when the file descriptor provided to an API does not work with this function cannot unambiguously detect that the reason for the failure was a file descriptor for which this is not supported.
(In reply to David Chisnall from comment #2) +1, especially since it's not hard to implement fallback method
Long ago when I first worked on 6th edition (late 1970s) there was the "everything was a file" philosophy. However, things like shared emeory did not exist. The fact that the code does not use vnodes suggests they are not files to me. Right now the code (including vn_generic_copy_file_range() which is the fallback) uses vnodes. (The error is actually returned because there is no vnode associated with the file descriptor.) To do this would require a separate chunk of code directly under the syscall to do the copying. A simple loop doing shm_read()/shm_write() should work, I'd guess (but it will need to be done as a special case when there are no vnodes) and needs to check for "no overlapping region", since that is part of the syscall interface. The original intent of copy_file_range(2) was to enable file systems to optimize copying, such as NFSv4 doing the copy locally in the NFS server instead of reading data to the client and writing it back to the server. (The fallback was implemented to simplify the case where the file system does not have a specific VOP_COPY_FILE_RANGE().) Since this case is not file system related, the above does not really apply. (I suppose do so saves a few syscalls.) I will ask on freebsd-fs@ to see if others feel this capability is needed. Feel free to come up with a patch. If others do not, I may do so someday. (It would be called from kern_copy_file_range() when f_vnode == NULL.)
I looked and doing the fallback will require some messing about. shm_read() and shm_write() do range locking however copy_file_ramge() needs to do range locking on both fds before copying. I think variants of shm_read() and shm_write() would need to be created without the range locking so that kern_copy_file_range() can call those. kern_copy_file_range() will also need to use rangelock_rlock(), rangelock_tryrlock() and rangelock_wlock() etc for this case.
It looks like there would need to be a layering violation, since the code that goes in kern_copy_file_range() will need to be specific to shm_ops, since the range lock code needs to use shm_mtx. (Or a new f_op specifically for copy_file_range() would need to be added.) So, the fallback is not trivial, although it is not a lot of work.
Now that I have taken a quick look, doing the shared memory fallback will be quite a bit of work, given the range locking stuff. Something like: - Define a new f_op (f_copy_file_range()). - Factor most of the contents of kern_copy_file_range() out into a vn_fileop_copy_file_range(). - Call the new f_op from kern_copy_file_range() - Write a shm_copy_file_range() f_op that is similar to the vnode one, but uses the shmfd stuff for rangelocking and then does the copying (via memcpy() in some sort of loop or maybe uiomove() needs to be used?). - I am not sure what this last step requires. Not that big a deal, but not that easy.
The most recent Linux man page for copy_file_range(2) appears to be here: https://man7.org/linux/man-pages/man2/copy_file_range.2.html In it: EINVAL The flags argument is not 0. EINVAL fd_in and fd_out refer to the same file and the source and target ranges overlap. EINVAL Either fd_in or fd_out is not a regular file. So, the current implementation appears to be Linux compatible. (The FreeBSD man page does need to be fixed for the case of EINVAL being returned for either fd not referring to regular files.) I have posted to freebsd-fs@ and we'll see whether most others think that the FreeBSD syscall should remain Linux compatible. (Which is what is noted at the end of the FreeBSD man page.)
(In reply to Rick Macklem from comment #7) What are the range locking guarantees? I don't see anything in the man page (ours or Linux's) that mentions anything about locking ranges or any guarantees about what happens if you concurrently write to the range being read or read from the range being written. In the absence of any such guarantees, I'd expect the semantics of this call to be the equivalent of a read/write loop on the two fds. My expectation for the fallback path was that it would: 1. stat the source to see how big it was. 2. Allocate a temporary buffer to hold whichever is smaller out of a sensible copy size or the amount remaining to be copied. 3. Read from the source file into that buffer. 4. Write from the buffer to the destination. 5. Loop to copy any more if 2 allocated a buffer smaller than the copy amount. All of these are exposed in the `file*` that `kern_copy_file_range` has access to. In the worst case, this should still be faster than doing it in userspace because it will have fewer system call transitions. Ideally, if the source is already in the buffer cache then we can avoid allocating the buffer and just write directly from there, but that's not necessary to make it usable. My current problem with this call is that there is no way for userspace code to know in advance whether a file descriptor that it has been passed (in my case, via a UNIX domain socket) is one that can or can't be used with copy_file_range and, after trying it, cannot differentiate between failures as a result of unsupported file descriptor types or as a result of other errors. Ideally, I'd be able to use this syscall with anything where the userspace fallback path worked. Alternatively, if the error from unsupported file descriptor types, the fallback code could be in the libc wrapper.
(In reply to David Chisnall from comment #9) From IEEE Standard for Information Technology— Portable Operating System Interface (POSIX®) Base Specifications, Issue 7 2.9.7 Thread Interactions with Regular File Operations All of the following functions shall be atomic with respect to each other in the effects specified in POSIX.1-2017 when they operate on regular files or symbolic links: chmod( ) chown( ) close( ) creat( ) dup2( ) fchmod( ) fchmodat( ) fchown( ) fchownat( ) fcntl( ) fstat( ) fstatat( ) ftruncate( ) lchown( ) link( ) linkat( ) lseek( ) lstat( ) open( ) openat( ) pread( ) read( ) readlink( ) readlinkat( ) readv( ) pwrite( ) rename( ) renameat( ) stat( ) symlink( ) symlinkat( ) truncate( ) unlink( ) unlinkat( ) utime( ) utimensat( ) utimes( ) write( ) writev( ) If two threads each call one of these functions, each call shall either see all of the specified effects of the other call, or none of them. The requirement on the close( ) function shall also apply whenever a file descriptor is successfully closed, however caused (for example, as a consequence of calling close( ), calling dup2( ), or of process termination).
(In reply to Konstantin Belousov from comment #10) > when they operate on regular files or symbolic links So not necessary for the fallback case for file descriptors that are not regular files?
(In reply to David Chisnall from comment #11) It is not required, it is quality of implementation. POSIX does not require that read(2) works on shmfd at all.
A commit in branch main references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=84b4342c0d7ac8a3187309a978d41e6765154cc1 commit 84b4342c0d7ac8a3187309a978d41e6765154cc1 Author: Rick Macklem <rmacklem@FreeBSD.org> AuthorDate: 2023-12-29 22:59:00 +0000 Commit: Rick Macklem <rmacklem@FreeBSD.org> CommitDate: 2023-12-29 22:59:00 +0000 copy_file_range.2: Clarify that only regular files work PR#273962 reported that copy_file_range(2) did not work on shared memory objects and returned EINVAL. Although the reporter felt this was incorrect, it is what the Linux copy_file_range(2) syscall does. Since there was no collective agreement that the FreeBSD semantics should be changed to no longer be Linux compatible, copy_file_range(2) still works on regular files only. This man page update clarifies that. If, someday, copy_file_range(2) is changed to support non-regular files, then the man page will need to be updated to reflect that. PR: 273962 Reviewed by: karels, pauamma_gundo.com (manpages) MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D43227 lib/libc/sys/copy_file_range.2 | 18 +++++++++++++----- 1 file changed, 13 insertions(+), 5 deletions(-)
A commit in branch stable/14 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=5c4da68ad785e955976e7b73c281213abb85c23a commit 5c4da68ad785e955976e7b73c281213abb85c23a Author: Rick Macklem <rmacklem@FreeBSD.org> AuthorDate: 2023-12-29 22:59:00 +0000 Commit: Rick Macklem <rmacklem@FreeBSD.org> CommitDate: 2024-01-11 01:13:23 +0000 copy_file_range.2: Clarify that only regular files work PR#273962 reported that copy_file_range(2) did not work on shared memory objects and returned EINVAL. Although the reporter felt this was incorrect, it is what the Linux copy_file_range(2) syscall does. Since there was no collective agreement that the FreeBSD semantics should be changed to no longer be Linux compatible, copy_file_range(2) still works on regular files only. This man page update clarifies that. If, someday, copy_file_range(2) is changed to support non-regular files, then the man page will need to be updated to reflect that. PR: 273962 (cherry picked from commit 84b4342c0d7ac8a3187309a978d41e6765154cc1) lib/libc/sys/copy_file_range.2 | 18 +++++++++++++----- 1 file changed, 13 insertions(+), 5 deletions(-)
A commit in branch stable/13 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=6561a7141b5603235d18538f6c66322ed7fc67d3 commit 6561a7141b5603235d18538f6c66322ed7fc67d3 Author: Rick Macklem <rmacklem@FreeBSD.org> AuthorDate: 2023-12-29 22:59:00 +0000 Commit: Rick Macklem <rmacklem@FreeBSD.org> CommitDate: 2024-01-11 01:20:35 +0000 copy_file_range.2: Clarify that only regular files work PR#273962 reported that copy_file_range(2) did not work on shared memory objects and returned EINVAL. Although the reporter felt this was incorrect, it is what the Linux copy_file_range(2) syscall does. Since there was no collective agreement that the FreeBSD semantics should be changed to no longer be Linux compatible, copy_file_range(2) still works on regular files only. This man page update clarifies that. If, someday, copy_file_range(2) is changed to support non-regular files, then the man page will need to be updated to reflect that. PR: 273962 (cherry picked from commit 84b4342c0d7ac8a3187309a978d41e6765154cc1) lib/libc/sys/copy_file_range.2 | 18 +++++++++++++----- 1 file changed, 13 insertions(+), 5 deletions(-)
^Triage: assign to committer who also MFCed.