Bug 252358 - cp(1) of large files is causing 100% CPU utilization and poor transfer of ~168M/minute
Summary: cp(1) of large files is causing 100% CPU utilization and poor transfer of ~16...
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: CURRENT
Hardware: amd64 Any
: --- Affects Some People
Assignee: Rick Macklem
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-01-02 20:02 UTC by Matthias Apitz
Modified: 2021-01-10 02:24 UTC (History)
6 users (show)

See Also:


Attachments
fix the problem by using a large "len" argument for copy_file_range(2) (685 bytes, patch)
2021-01-02 21:49 UTC, Rick Macklem
no flags Details | Diff
the truss file of running: truss -o tr cp -p guru-20210102.tar.gz xxx (4.31 KB, text/plain)
2021-01-03 07:32 UTC, Matthias Apitz
no flags Details
fix the problem by using a large "len" argument for copy_file_range(2) (1.11 KB, patch)
2021-01-04 16:26 UTC, Rick Macklem
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Matthias Apitz 2021-01-02 20:02:18 UTC
While using cp(1) to copy large files (seen with 36GByte) on UFS file systems with something like:

# cp -p guru-20210102.tar.gz xxx

the CPU utilization is 100% while the copied amount of data is poor (some 168MByte per minute). The cp(1) process shows in truss(1) only copy_file_range(2) sys calls with low frequency:

# truss -p 37655                                                                          
copy_file_range(0x3,0x0,0x4,0x0,0x200000,0x0)    = 2097152 (0x200000)                     
copy_file_range(0x3,0x0,0x4,0x0,0x200000,0x0)    = 2097152 (0x200000)                     
copy_file_range(0x3,0x0,0x4,0x0,0x200000,0x0)    = 2097152 (0x200000)                     
copy_file_range(0x3,0x0,0x4,0x0,0x200000,0x0)    = 2097152 (0x200000)                     
copy_file_range(0x3,0x0,0x4,0x0,0x200000,0x0)    = 2097152 (0x200000)
Comment 1 Rick Macklem freebsd_committer freebsd_triage 2021-01-02 21:49:40 UTC
Created attachment 221214 [details]
fix the problem by using a large "len" argument for copy_file_range(2)

This patch modifies "cp" so that it uses a large "len"
argument to copy_file_range(2).
This results in far fewer copy_file_range(2) syscalls,
each of which will do a SEEK_DATA and SEEK_HOLE.

I suspect that these SEEK_DATA/SEEK_HOLE operations
are slow for large non-sparse files and doing a large
number of them caused the slowdown.
Comment 2 commit-hook freebsd_committer freebsd_triage 2021-01-03 01:02:20 UTC
A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=c98a764c681f8b70812a9f13a6e61c96aa1a69d2

commit c98a764c681f8b70812a9f13a6e61c96aa1a69d2
Author:     Rick Macklem <rmacklem@FreeBSD.org>
AuthorDate: 2021-01-03 00:58:43 +0000
Commit:     Rick Macklem <rmacklem@FreeBSD.org>
CommitDate: 2021-01-03 00:58:43 +0000

    cp(1): fix performance issue for large non-sparse file copies

    PR252358 reported a serious performance problem when
    copying a large non-sparse file on a UFS file system.
    This problem seems to have been caused by a large
    number of SEEK_HOLE operations, with one done
    for each copy_file_range(2) call.

    This patch modifies cp(1) to use a large (SSIZE_MAX)
    len argument, reducing the number of system calls
    and resolving the performance issue.

    While here, convert the type of the "rcount" from "int"
    to "ssize_t" so that it is consistent with that returned
    by both read(2) and copy_file_range(2).

    PR:     252358
    Reviewed by:    asomers
    Differential Revision:  https://reviews.freebsd.org/D27937

 bin/cp/utils.c | 11 +++++------
 1 file changed, 5 insertions(+), 6 deletions(-)
Comment 3 Rick Macklem freebsd_committer freebsd_triage 2021-01-03 01:29:59 UTC
I have committed a variant of the patch to head.
Since this only applies to head, I will close
this PR once I hear that Matthias has tested the
patch and it has resolved the problem for him.
Comment 4 Matthias Apitz 2021-01-03 07:29:50 UTC
I did with the file from Rick's mail:

# cd /usr/src
# patch < ~guru/r368166/patch.252358
# cd bin/cp
# make
# make install

and tested the cp(1) as:

# cd /home/backups
# ls -lh guru-20210102.tar.gz
-r--------  1 root  wheel    36G  2 ene.  16:02 guru-20210102.tar.gz

# cp -p guru-20210102.tar.gz xxx
cp: guru-20210102.tar.gz: No such file or directory

The target file 'xxx' has the same size of the source and the time of 301 secs seems to be ok; but what does the message mean?

I run the cp(1) with truss(1) and the messages comes out of the process:

# truss -o tr cp -p guru-20210102.tar.gz xxx
cp: guru-20210102.tar.gz: No error: 0

# grep 'write(2' tr
write(2,"cp: ",4)				 = 4 (0x4)
write(2,"guru-20210102.tar.gz",20)		 = 20 (0x14)
write(2,": ",2)					 = 2 (0x2)
write(2,"No error: 0\n",12)			 = 12 (0xc)

When the target file 'xxx' does not exist, the message is even like a real error message:

# rm xxx
# truss -o tr cp -p guru-20210102.tar.gz xxx
cp: guru-20210102.tar.gz: No such file or directory
# grep 'write(2' tr
write(2,"cp: ",4)				 = 4 (0x4)
write(2,"guru-20210102.tar.gz",20)		 = 20 (0x14)
write(2,": ",2)					 = 2 (0x2)
write(2,"No such file or directory\n",26)	 = 26 (0x1a)

The message has to do with the call to copy_file_range(2), there is only
one single syscall which tries to copy the full file at once:

copy_file_range(0x3,0x0,0x4,0x0,0x7fffffffffffffff,0x0) = 38603862368 (0x8fcf83160)
write(2,"cp: ",4)                                = 4 (0x4)
write(2,"guru-20210102.tar.gz",20)               = 20 (0x14)
write(2,": ",2)                                  = 2 (0x2)

It looks like the message comes out of cp.c around like 290:
...
        for (badcp = rval = 0; (curr = fts_read(ftsp)) != NULL; badcp = 0) {
                switch (curr->fts_info) {
                case FTS_NS:
                case FTS_DNR:
                case FTS_ERR:
                        warnx("%s: %s",
                            curr->fts_path, strerror(curr->fts_errno));
                        badcp = rval = 1;
                        continue;
...

I will attach the complete file 'tr';
Comment 5 Matthias Apitz 2021-01-03 07:32:38 UTC
Created attachment 221230 [details]
the truss file of running: truss -o tr cp -p guru-20210102.tar.gz xxx
Comment 6 Rick Macklem freebsd_committer freebsd_triage 2021-01-04 16:26:27 UTC
Created attachment 221271 [details]
fix the problem by using a large "len" argument for copy_file_range(2)

The old patch didn't change the type of rcount to ssize_t.
Comment 7 Matthias Apitz 2021-01-04 16:55:26 UTC
I've tested the new patch from the PR. This gives:                                                                       
                                                                                          
# cd /usr/src/bin/cp                                                                      
# patch < ~guru/cp.patch                                                                  
# make                                                                                    
# make install                                                                            
                                                                                          
# cd /home/backups                                                                        
# time cp -p guru-20210102.tar.gz xxx                                                     
                                                                                          
real    4m54,233s                                                                         
user    0m0,005s                                                                          
sys     1m8,588s                                                                          
                                                                                          
# ls -lh guru-20210102.tar.gz xxx                                                         
-r--------  1 root  wheel    36G  2 ene.  16:02 guru-20210102.tar.gz                      
-r--------  1 root  wheel    36G  2 ene.  16:02 xxx                  

Seems fine to me now.
Comment 8 Rick Macklem freebsd_committer freebsd_triage 2021-01-04 17:07:42 UTC
Thanks. This is a relief. I didn't understand how
it was broken.

Then I saw that the patch I put here was a bogus
early edit, where I had changed "rcount" to
ssize_t in the wrong function.
--> 38603862368, which was truncated to 32bits
    when assigned to the "int" and appeared negative,
    generating the error.

Thanks for reporting this and the quick testing.

I'll close the PR.
Comment 9 Gordon Bergling freebsd_committer freebsd_triage 2021-01-09 14:23:34 UTC
It would be good to have this bugfix also on stable/12.
Comment 10 Rick Macklem freebsd_committer freebsd_triage 2021-01-10 02:24:21 UTC
copy_file_range(2) is only in head/FreeBSD13.