Bug 242828 - diff: -rq still does full file comparison
Summary: diff: -rq still does full file comparison
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: bin (show other bugs)
Version: 12.1-RELEASE
Hardware: Any Any
: --- Affects Some People
Assignee: Mark Johnston
URL: https://reviews.freebsd.org/D23152
Keywords: performance, regression
Depends on:
Reported: 2019-12-23 10:55 UTC by ml
Modified: 2021-01-09 20:11 UTC (History)
2 users (show)

See Also:
koobs: mfc-stable12+

patch for the diff(1) -q (353 bytes, patch)
2020-01-13 10:25 UTC, fehmi noyan isi
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description ml 2019-12-23 10:55:20 UTC
In a task of mine I often have to "diff -rq" two directories which have ~75000 files; those files are usually very similar, with a only a few lines differing. Some of them have up to 25M lines.

On 11.3 (which feature GNU diff), the whole process usually take a few minutes.

On 12.1 (which feature BSD diff), I was never able to complete it: most of the times I just interrupted it after some hours, but in a case it did crash the whole system due to an out of swap situation (16GB RAM + 32GB swap!!!).

I tried to find out the reason with gdb and I am convinced that:
_ 11.3, as soon as it sees a file differs, will move to the next one;
_ OTOH, 12.1 will still compute the whole diff set for that file, although it just have to print "File x and y differ".

Of course, there's a workaround as simple as installing textproc/diffutils and using "gdiff" instead of "diff".
Comment 1 fehmi noyan isi 2020-01-08 01:09:52 UTC
It appears BSD diff keeps working on the the files even after it knows whether the files are identical or not. 

In diffreg(), which is called by diff() we have [1]

	switch (files_differ(f1, f2, flags)) {
	case 0:
		goto closem;
	case 1:
		/* error */
		status |= 2;
		goto closem;

files_differ() returns 0 when the files are identical and 1 when they are different. For the cases when files_differ() returns 1, instead of just a single break statement, we can add a check for the command line arguments and if '-q' supplied, i.e. D_BRIEF, then we should return from diffred() (after doing some housekeeping if necessary - or add another 'goto' label at the end of the function).

Instead, BSD diff carries on and hashes each line in both files, sorts them and do more checks, hence taking more time.

If this sounds reasonable, I can work on a patch.

[1] http://src.illumos.org/source/xref/freebsd-head/usr.bin/diff/diffreg.c#341
Comment 2 fehmi noyan isi 2020-01-08 11:47:09 UTC
I think adding one _if_ check to the case where the files differ would be sufficient;

        switch (files_differ(f1, f2, flags)) {
        case 0:
                goto closem;
        case 1:
                /* if brief output is needed, just return to the caller */
                if (diff_format == D_BRIEF) {
                        rval = D_DIFFER;
                        goto closem;
                /* error */
                status |= 2;
                goto closem;

with this change, when -q or --brief is used, diff(1) will just print a message without doing any further work
Comment 3 Mark Johnston freebsd_committer 2020-01-09 15:37:40 UTC
(In reply to fehmi noyan isi from comment #2)
This looks reasonable to me.  Can you verify that the tests in /usr/tests/usr.bin/diff pass with your change applied?  Did you verify that diff exits with the correct status in all cases?
Comment 4 fehmi noyan isi 2020-01-13 10:25:35 UTC
Created attachment 210686 [details]
patch for the diff(1) -q

In addition to the code I gave in my previous comment, the "status" variable needs to be updated in order to set the proper exit status.

The attached patch seems to work and the patched diff(1) utility passes all the tests.

root@patch:/usr/src/usr.bin/diff # kyua test -k /usr/tests/Kyuafile usr.bin/diff 
usr.bin/diff/diff_test:Bflag  ->  passed  [0.071s]
usr.bin/diff/diff_test:b230049  ->  passed  [0.020s]
usr.bin/diff/diff_test:brief_format  ->  passed  [0.075s]
usr.bin/diff/diff_test:group_format  ->  passed  [0.020s]
usr.bin/diff/diff_test:header  ->  passed  [0.026s]
usr.bin/diff/diff_test:header_ns  ->  passed  [0.023s]
usr.bin/diff/diff_test:ifdef  ->  passed  [0.022s]
usr.bin/diff/diff_test:side_by_side  ->  expected_failure: --side-by-side not currently implemented (bug # 219933): atf-check failed; see the output of the test for details  [0.033s]
usr.bin/diff/diff_test:simple  ->  passed  [0.086s]
usr.bin/diff/diff_test:unified  ->  passed  [0.038s]
usr.bin/diff/netbsd_diff_test:mallocv  ->  passed  [0.024s]
usr.bin/diff/netbsd_diff_test:nomallocv  ->  passed  [0.023s]
usr.bin/diff/netbsd_diff_test:same  ->  passed  [0.024s]

Results file id is usr_tests.20200113-172221-636074
Results saved to /root/.kyua/store/results.usr_tests.20200113-172221-636074.db

13/13 passed (0 failed)
root@patch:/usr/src/usr.bin/diff # 

The patch file was generated with the command below

root@patch:~ # diff -u diffreg.c.orig diffreg.c > diffreg.c.patch
root@patch:~ # cat diffreg.c.patch 
--- diffreg.c.orig	2020-01-14 03:16:23.048362000 +1300
+++ diffreg.c	2020-01-14 05:49:28.212503000 +1300
@@ -342,6 +342,12 @@
 	case 0:
 		goto closem;
 	case 1:
+		/* if brief output is needed, just return to the caller */
+		if (diff_format == D_BRIEF) {
+			status |= 1;
+			rval = D_DIFFER;
+			goto closem;
+		}
 		/* error */
root@patch:~ # uname -a
FreeBSD patch 13.0-CURRENT FreeBSD 13.0-CURRENT #0 r356528: Thu Jan  9 04:56:46 UTC 2020     root@releng1.nyi.freebsd.org:/usr/obj/usr/src/amd64.amd64/sys/GENERIC  amd64

It would be nice to do some performance benchmarks given unexpectedly long running time was the driver for the bug report.
Comment 5 Mark Johnston freebsd_committer 2020-01-13 16:04:22 UTC
(In reply to fehmi noyan isi from comment #4)
Thanks!  I will work on getting this committed.
Comment 6 commit-hook freebsd_committer 2020-01-13 18:30:00 UTC
A commit references this bug:

Author: markj
Date: Mon Jan 13 18:29:48 UTC 2020
New revision: 356695
URL: https://svnweb.freebsd.org/changeset/base/356695

  Optimize diff -q.

  Once we know whether the files differ, we don't need to do any further

  PR:		242828
  Submitted by:	fehmi noyan isi <fnoyanisi@yahoo.com> (original version)
  Reviewed by:	bapt, kevans
  MFC after:	1 week
  Differential Revision:	https://reviews.freebsd.org/D23152

Comment 7 commit-hook freebsd_committer 2020-01-20 01:38:26 UTC
A commit references this bug:

Author: markj
Date: Mon Jan 20 01:38:03 UTC 2020
New revision: 356903
URL: https://svnweb.freebsd.org/changeset/base/356903

  MFC r356695, r356731:
  Optimize diff -q.

  PR:	242828

_U  stable/12/