Bug 24732

Summary: cmp can not compare files lager 2GB but smaller 4GB
Product: Base System Reporter: kazarov <kazarov>
Component: binAssignee: dwmalone
Status: Closed FIXED    
Severity: Affects Only Me    
Priority: Normal    
Version: 4.2-STABLE   
Hardware: Any   
OS: Any   
Attachments:
Description Flags
file.diff
none
cmp.diff none

Description kazarov 2001-01-30 10:30:01 UTC
in /usr/src/usr.bin/cmp/regular.c, line 84, length of files is compared with SIZE_T_MAX which is equal to 0xFFFFFFFF (4GB-1). 
But man mmap says:
BUGS
     len is limited to 2GB.  Mmapping slightly more than 2GB doesn't work, but
     it is possible to map a window of size (filesize % 2GB) for file sizes of
     slightly less than 2G, 4GB, 6GB and 8GB.
And tests shows that cmp cannot mmap two files of silze greate about 1213MB (on my system).
So constant should be decreased to 1GB

How-To-Repeat: user$ dd if=/dev/zero of=tmp seek=2048k bs=1k count=1
1+0 records in
1+0 records out
1024 bytes transferred in 0.000258 secs (3969471 bytes/sec)
user$ ln tmp tmp.l
user$ cmp tmp tmp.l
cmp: tmp: Invalid argument
Comment 1 Bruce Evans 2001-01-30 14:02:21 UTC
On Tue, 30 Jan 2001 kazarov@izmiran.rssi.ru wrote:

> And tests shows that cmp cannot mmap two files of silze greate about 1213MB (on my system).

This seems about right for i386's.  Two files have to be mapped below the
kernel start address of 3G, so their size is limited to 1.5GB.  Big files,
at least, have to be mapped after libraries, so there size is further
limited.

> So constant should be decreased to 1GB

This is rather machine-dependent, and probably too large to be optimal
anyway.

To fix the machine-dependencies, I think cmp should just try to mmap
both files and fall back to c_special() if this fails.  Note that
the current check against SIZE_T_MAX is just to prevent overflow,
but it is broken in several ways:
(1) It SIZE_T_MAX with MIN(len1, len2), but it needs to compare with
    MAX(len1, len2) to prevent overflow when the lengths are cast to
    size_t.
(2) It needs comapare SIZE_T_MAX with
    MAX(len1 + skip1 % pagesize, len2 + skip2 % pagesize) to prevent
    overflow when the adjusted lengths are passed to mmap().
The casts in (1) are bogus anyway.  They are unnecessary if a prototype
for mmap() is in scope and are in the wrong place otherwise (the whole
adjusted lengths should be cast).

Bruce
Comment 2 kazarov 2001-01-30 14:59:23 UTC
> > So constant should be decreased to 1GB
>
> This is rather machine-dependent, and probably too large to be optimal
> anyway.
>
> To fix the machine-dependencies, I think cmp should just try to mmap
> both files and fall back to c_special() if this fails.  Note that
> the current check against SIZE_T_MAX is just to prevent overflow,
> but it is broken in several ways:
> (1) It SIZE_T_MAX with MIN(len1, len2), but it needs to compare with
>     MAX(len1, len2) to prevent overflow when the lengths are cast to
>     size_t.
> (2) It needs comapare SIZE_T_MAX with
>     MAX(len1 + skip1 % pagesize, len2 + skip2 % pagesize) to prevent
>     overflow when the adjusted lengths are passed to mmap().
> The casts in (1) are bogus anyway.  They are unnecessary if a prototype
> for mmap() is in scope and are in the wrong place otherwise (the whole
> adjusted lengths should be cast).

Hi, Bruce,
Sorry, I do not understand why compare MAX(...) to SIZE_T_MAX.
IMHO, mmap maps the file into process memory so both files together should
fit into process accessible address space which is less then SIZE_T_MAX (my
expirements with cmp showed that is about 2430MB, on Solaris 2.4 it was 2GB
only).
So it's nessesary to compare (len1+skip1%pagesize + len2+skip2%pagesize) to
2430MB (for my system).
Cause, reverting to c_special on mmap failure could be a better way but,
IMHO, the best solution would be proposal of Poul-Henning Kamp
<phk@FreeBSD.org> ( http://www.freebsd.org/cgi/query-pr.cgi?pr=18589 ):
"Add a loop around the mmap/compare operation which operates on some
moderate amount of data each iteration."

Sincerely Yours
Dmitry
Comment 3 Bruce Evans 2001-01-30 17:24:15 UTC
On Tue, 30 Jan 2001, Dmitry Kazarov wrote:

> Sorry, I do not understand why compare MAX(...) to SIZE_T_MAX.
> IMHO, mmap maps the file into process memory so both files together should
> fit into process accessible address space which is less then SIZE_T_MAX (my
> expirements with cmp showed that is about 2430MB, on Solaris 2.4 it was 2GB
> only).

mmap()'s `len' arg has type size_t, so lengths larger than SIZE_T_MAX
can't be passed to mmap() for it to check (and fail on).

> So it's nessesary to compare (len1+skip1%pagesize + len2+skip2%pagesize) to
> 2430MB (for my system).

mmap() will check this if the full lengths can be passed to it.  Since
their are 2 mmap()s, there is a machine-dependent amount of space
between them.  PAGE_SIZE is probably sufficient, but you have to look
at the implementation to tell.  Your 2430MB must have some slop for this.

> Cause, reverting to c_special on mmap failure could be a better way but,
> IMHO, the best solution would be proposal of Poul-Henning Kamp
> <phk@FreeBSD.org> ( http://www.freebsd.org/cgi/query-pr.cgi?pr=18589 ):
> "Add a loop around the mmap/compare operation which operates on some
> moderate amount of data each iteration."

I agree.  cp uses 8MB for the moderate amount.  So does xinstall for
its file comparison function.  xinstall also falls back to read()
when mmap() fails, and has a -M flag to prevent use of mmap().

Bruce
Comment 4 dwmalone freebsd_committer freebsd_triage 2001-06-05 20:27:03 UTC
Responsible Changed
From-To: freebsd-bugs->dwmalone

I'll have a look at this one since I had a look at size limits in tail.
Comment 5 andr 2001-10-18 00:33:15 UTC
Attached  patch solves (I hope) this problem. Basic idea of the patch --
to mmap() some moderate amount (8MB) of data rather then entire file
(this idea belongs to phk@FreeBSD.org -- see bin/18589).

				Andrew.

P.S. Patch was made against RELENG_4 sources
(extern.h v 1.1.1.1.14.1 and regular.c v 1.7.2.2). Unfortunately
anoncvs.freebsd.org is down ;-(

Comment 6 dwmalone 2001-10-29 16:49:43 UTC
I've produced a version of Andrew's patch which just mmaps the
things in 8MB chunks. For reasons I don't understand, this actually
seems to make c_regular faster (uses about half as much user time).

I'll commit this later in the week, unless anyone has any objections.

	David.


Index: regular.c
===================================================================
RCS file: /cvs/FreeBSD-CVS/src/usr.bin/cmp/regular.c,v
retrieving revision 1.10
diff -u -r1.10 regular.c
--- regular.c	20 Jun 2000 20:28:40 -0000	1.10
+++ regular.c	29 Oct 2001 16:27:07 -0000
@@ -51,6 +51,9 @@
 
 #include "extern.h"
 
+static u_char *remmap __P((u_char *, int, off_t));
+#define MMAP_CHUNK (8*1024*1024)
+
 #define ROUNDPAGE(i) ((i) & ~pagemask)
 
 void
@@ -59,7 +62,7 @@
 	char *file1, *file2;
 	off_t skip1, len1, skip2, len2;
 {
-	u_char ch, *p1, *p2;
+	u_char ch, *p1, *p2, *m1, *m2, *e1, *e2;
 	off_t byte, length, line;
 	int dfound;
 	off_t pagemask, off1, off2;
@@ -81,23 +84,24 @@
 	off2 = ROUNDPAGE(skip2);
 
 	length = MIN(len1, len2);
-	if (length > SIZE_T_MAX)
-		return (c_special(fd1, file1, skip1, fd2, file2, skip2));
 
-	if ((p1 = (u_char *)mmap(NULL, (size_t)len1 + skip1 % pagesize,
-	    PROT_READ, MAP_SHARED, fd1, off1)) == (u_char *)MAP_FAILED)
-		err(ERR_EXIT, "%s", file1);
-
-	madvise(p1, len1 + skip1 % pagesize, MADV_SEQUENTIAL);
-	if ((p2 = (u_char *)mmap(NULL, (size_t)len2 + skip2 % pagesize,
-	    PROT_READ, MAP_SHARED, fd2, off2)) == (u_char *)MAP_FAILED)
-		err(ERR_EXIT, "%s", file2);
-	madvise(p2, len2 + skip2 % pagesize, MADV_SEQUENTIAL);
+	if ((m1 = remmap(NULL, fd1, off1)) == NULL)
+		c_special(fd1, file1, skip1, fd2, file2, skip2);
+		/* NOTREACHED */
+
+	if ((m2 = remmap(NULL, fd2, off2)) == NULL) {
+		munmap(m1, MMAP_CHUNK);
+		c_special(fd1, file1, skip1, fd2, file2, skip2);
+		/* NOTREACHED */
+	}
 
 	dfound = 0;
-	p1 += skip1 - off1;
-	p2 += skip2 - off2;
-	for (byte = line = 1; length--; ++p1, ++p2, ++byte) {
+	e1 = m1 + MMAP_CHUNK;
+	e2 = m2 + MMAP_CHUNK;
+	p1 = m1 + (skip1 - off1);
+	p2 = m2 + (skip2 - off2);
+
+	for (byte = line = 1; length--; ++byte) {
 		if ((ch = *p1) != *p2) {
 			if (xflag) {
 				dfound = 1;
@@ -111,10 +115,41 @@
 		}
 		if (ch == '\n')
 			++line;
+		if (++p1 == e1) {
+			off1 += MMAP_CHUNK;
+			if ((p1 = m1 = remmap(m1, fd1, off1)) == NULL) {
+				munmap(m2, MMAP_CHUNK);
+				err(ERR_EXIT, "remmap %s", file1);
+			}
+			e1 = m1 + MMAP_CHUNK;
+		}
+		if (++p2 == e2) {
+			off2 += MMAP_CHUNK;
+			if ((p2 = m2 = remmap(m2, fd2, off2)) == NULL) {
+				munmap(m1, MMAP_CHUNK);
+				err(ERR_EXIT, "remmap %s", file2);
+			}
+			e2 = m2 + MMAP_CHUNK;
+		}
 	}
 
 	if (len1 != len2)
 		eofmsg (len1 > len2 ? file2 : file1);
 	if (dfound)
 		exit(DIFF_EXIT);
+}
+
+static u_char *
+remmap(mem, fd, offset)
+	u_char  *mem;
+	int     fd;
+	off_t   offset;
+{
+	if (mem != NULL)
+		munmap(mem, MMAP_CHUNK);
+	mem = mmap(NULL, MMAP_CHUNK, PROT_READ, MAP_SHARED, fd, offset);
+	if (mem == MAP_FAILED)
+		return (NULL);
+	madvise(mem, MMAP_CHUNK, MADV_SEQUENTIAL);
+	return (mem);
 }
Comment 7 dwmalone freebsd_committer freebsd_triage 2001-11-21 10:48:08 UTC
State Changed
From-To: open->closed

Fix committed to both -current and -stable. Thanks!