Bug 265320 - diff program runs long when running diff on large files
Summary: diff program runs long when running diff on large files
Status: In Progress
Alias: None
Product: Base System
Classification: Unclassified
Component: bin (show other bugs)
Version: CURRENT
Hardware: Any Any
: --- Affects Many People
Assignee: Tom Jones
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2022-07-19 23:53 UTC by Shrikanth R K
Modified: 2022-08-16 08:30 UTC (History)
2 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Shrikanth R K 2022-07-19 23:53:28 UTC
The issue was found when running diff on a set of files that are nearly identical (1865319 lines / ~76MB in size) except having differences in the first line and line 1865315.

# uname -a
FreeBSD BSD14vm 14.0-CURRENT FreeBSD 14.0-CURRENT #0 main-n255938-326a8d3e085: Fri Jun  3 08:30:41 UTC 2022     root@releng1.nyi.freebsd.org:/usr/obj/usr/src/amd64.amd64/sys/GENERIC amd64

The diff program ran for more than 3172.21 seconds (or even more) I stopped monitoring with CTRL-T after some point but here is the capture...

root@BSD14vm:# ls -ltrhG
total 469632
-rw-r--r--  1 root  wheel    76M Jul 19 11:46 config_edb_pre
-rw-r--r--  1 root  wheel    76M Jul 19 11:46 config_edb_post

root@BSD14vm:# diff config_edb_p*
load: 0.21  cmd: diff 972 [running] 3.60r 3.44u 0.15s 29% 40480k
load: 0.21  cmd: diff 972 [running] 4.90r 4.74u 0.15s 39% 40576k
load: 1.10  cmd: diff 972 [running] 156.52r 156.31u 0.15s 100% 46564k
load: 1.19  cmd: diff 972 [running] 462.20r 461.86u 0.18s 100% 49140k
load: 1.19  cmd: diff 972 [running] 462.64r 462.30u 0.18s 100% 49140k
load: 1.19  cmd: diff 972 [running] 462.85r 462.51u 0.18s 100% 49140k
load: 1.09  cmd: diff 972 [running] 503.74r 503.39u 0.18s 100% 49140k
load: 1.09  cmd: diff 972 [running] 503.96r 503.60u 0.18s 100% 49140k
load: 1.09  cmd: diff 972 [running] 504.14r 503.78u 0.18s 100% 49140k
load: 1.09  cmd: diff 972 [running] 504.30r 503.94u 0.18s 100% 49140k
load: 1.07  cmd: diff 972 [running] 766.07r 765.61u 0.21s 98% 51196k
load: 1.17  cmd: diff 972 [running] 919.11r 918.61u 0.21s 100% 51196k
load: 1.13  cmd: diff 972 [running] 1014.49r 1013.96u 0.22s 100% 53252k
load: 1.13  cmd: diff 972 [running] 1014.75r 1014.21u 0.22s 100% 53252k
load: 1.13  cmd: diff 972 [running] 1015.05r 1014.52u 0.22s 98% 53252k
load: 1.13  cmd: diff 972 [running] 1015.31r 1014.77u 0.22s 100% 53252k
load: 1.13  cmd: diff 972 [running] 1015.49r 1014.96u 0.22s 100% 53252k
load: 1.04  cmd: diff 972 [running] 1087.73r 1087.17u 0.22s 100% 53252k
load: 1.13  cmd: diff 972 [running] 1266.02r 1265.41u 0.22s 98% 53252k
load: 1.04  cmd: diff 972 [running] 1468.09r 1467.41u 0.22s 100% 59176k
load: 1.02  cmd: diff 972 [running] 1499.33r 1498.63u 0.22s 100% 59176k
load: 1.08  cmd: diff 972 [running] 2095.98r 2095.10u 0.23s 100% 61232k
load: 1.43  cmd: diff 972 [running] 2146.53r 2145.62u 0.23s 100% 61232k
load: 1.39  cmd: diff 972 [running] 2350.22r 2349.22u 0.23s 100% 63288k
load: 1.24  cmd: diff 972 [running] 2874.17r 2873.05u 0.25s 100% 67392k
load: 1.12  cmd: diff 972 [running] 2932.42r 2931.28u 0.25s 100% 67392k
load: 1.12  cmd: diff 972 [running] 2932.67r 2931.52u 0.25s 100% 67392k
load: 1.10  cmd: diff 972 [running] 2980.07r 2978.90u 0.26s 98% 85608k
load: 1.19  cmd: diff 972 [running] 3173.49r 3172.21u 0.26s 100% 85608k
...
1c1
< <configuration changed-seconds="1658220346" changed-localtime="2022-07-19 10:45:46 CEST">
---
> <configuration changed-seconds="1658219983" changed-localtime="2022-07-19 10:39:43 CEST">
1865315c1865315
<             <description>2022-07-19 10:43:02.391784</description>
---
>             <description>2022-07-18 16:19:44.594290</description>

As you see there is difference on line 1 and line 1865315 for the entire file. If I use 'head' or 'tail' and retain only one of the difference in the file 'diff' is able to complete in less than a second. 

The below files have only the diff on line 1
 #head -n 1865310 config_edb_pre > p1_headn1865310
 #head -n 1865310 config_edb_post > p2_headn1865310

# time diff p1_headn1865310 p2_headn1865310
1c1
< <configuration changed-seconds="1658219983" changed-localtime="2022-07-19 10:39:43 CEST">
---
> <configuration changed-seconds="1658220346" changed-localtime="2022-07-19 10:45:46 CEST">
        0.83 real         0.74 user         0.09 sys

The below files have only the diff on line 1865306
 # tail -n 1865310 config_edb_pre > p1_tailn1865310
 # tail -n 1865310 config_edb_post > p2_tailn1865310

# time diff p1_tailn1865310 p2_tailn1865310 
1865306c1865306
<             <description>2022-07-18 16:19:44.594290</description>
---
>             <description>2022-07-19 10:43:02.391784</description>
        0.84 real         0.75 user         0.09 sys

# which diff
/usr/bin/diff

# diff --version
FreeBSD diff 20220309

# file config_edb_p*
config_edb_post: ASCII text
config_edb_pre:  ASCII text
Comment 1 Mark Johnston freebsd_committer freebsd_triage 2022-07-20 14:09:46 UTC
Can you provide some test case which reproduces the problem?  If not I don't think this PR will get much attention.
Comment 2 Shrikanth R K 2022-07-20 19:45:06 UTC
The tarball for config files is 2.8M+ so could not attach it here, shoved it to my GitHub, retrieve it from here
https://github.com/shrikanth07/FreeBSD-Bug-reports/blob/57da3db49a99dd65ec065b99b8146c6bd6bc5888/BZ-265320/bz265320.tgz
Comment 3 Shrikanth R K 2022-07-20 20:36:18 UTC
For the test case, retrieve the tarball of the config files from GitHub mentioned earlier. 
You should be able to recreate the problem by just running diff on those files,

FreeBSD-Bug-reports/BZ-265320/bz265320> time diff config_edb_p*
Comment 4 Allan Jude freebsd_committer freebsd_triage 2022-08-15 20:42:21 UTC
Assigning to Tom, who is working on this.
Comment 5 Tom Jones freebsd_committer freebsd_triage 2022-08-16 08:30:55 UTC
Hi,

This issue is hitting a limitation of the stone algorithm that BSD diff uses right now. The algorithm is performing an exponentially growing search looking for differences in the middle of the two sequences (files).

I have a wip implementation of the myers diff algorithm (with optimisations). The work was triggered by a similar case with 12MB intel hex files.

Tom