The issue was found when running diff on a set of files that are nearly identical (1865319 lines / ~76MB in size) except having differences in the first line and line 1865315. # uname -a FreeBSD BSD14vm 14.0-CURRENT FreeBSD 14.0-CURRENT #0 main-n255938-326a8d3e085: Fri Jun 3 08:30:41 UTC 2022 root@releng1.nyi.freebsd.org:/usr/obj/usr/src/amd64.amd64/sys/GENERIC amd64 The diff program ran for more than 3172.21 seconds (or even more) I stopped monitoring with CTRL-T after some point but here is the capture... root@BSD14vm:# ls -ltrhG total 469632 -rw-r--r-- 1 root wheel 76M Jul 19 11:46 config_edb_pre -rw-r--r-- 1 root wheel 76M Jul 19 11:46 config_edb_post root@BSD14vm:# diff config_edb_p* load: 0.21 cmd: diff 972 [running] 3.60r 3.44u 0.15s 29% 40480k load: 0.21 cmd: diff 972 [running] 4.90r 4.74u 0.15s 39% 40576k load: 1.10 cmd: diff 972 [running] 156.52r 156.31u 0.15s 100% 46564k load: 1.19 cmd: diff 972 [running] 462.20r 461.86u 0.18s 100% 49140k load: 1.19 cmd: diff 972 [running] 462.64r 462.30u 0.18s 100% 49140k load: 1.19 cmd: diff 972 [running] 462.85r 462.51u 0.18s 100% 49140k load: 1.09 cmd: diff 972 [running] 503.74r 503.39u 0.18s 100% 49140k load: 1.09 cmd: diff 972 [running] 503.96r 503.60u 0.18s 100% 49140k load: 1.09 cmd: diff 972 [running] 504.14r 503.78u 0.18s 100% 49140k load: 1.09 cmd: diff 972 [running] 504.30r 503.94u 0.18s 100% 49140k load: 1.07 cmd: diff 972 [running] 766.07r 765.61u 0.21s 98% 51196k load: 1.17 cmd: diff 972 [running] 919.11r 918.61u 0.21s 100% 51196k load: 1.13 cmd: diff 972 [running] 1014.49r 1013.96u 0.22s 100% 53252k load: 1.13 cmd: diff 972 [running] 1014.75r 1014.21u 0.22s 100% 53252k load: 1.13 cmd: diff 972 [running] 1015.05r 1014.52u 0.22s 98% 53252k load: 1.13 cmd: diff 972 [running] 1015.31r 1014.77u 0.22s 100% 53252k load: 1.13 cmd: diff 972 [running] 1015.49r 1014.96u 0.22s 100% 53252k load: 1.04 cmd: diff 972 [running] 1087.73r 1087.17u 0.22s 100% 53252k load: 1.13 cmd: diff 972 [running] 1266.02r 1265.41u 0.22s 98% 53252k load: 1.04 cmd: diff 972 [running] 1468.09r 1467.41u 0.22s 100% 59176k load: 1.02 cmd: diff 972 [running] 1499.33r 1498.63u 0.22s 100% 59176k load: 1.08 cmd: diff 972 [running] 2095.98r 2095.10u 0.23s 100% 61232k load: 1.43 cmd: diff 972 [running] 2146.53r 2145.62u 0.23s 100% 61232k load: 1.39 cmd: diff 972 [running] 2350.22r 2349.22u 0.23s 100% 63288k load: 1.24 cmd: diff 972 [running] 2874.17r 2873.05u 0.25s 100% 67392k load: 1.12 cmd: diff 972 [running] 2932.42r 2931.28u 0.25s 100% 67392k load: 1.12 cmd: diff 972 [running] 2932.67r 2931.52u 0.25s 100% 67392k load: 1.10 cmd: diff 972 [running] 2980.07r 2978.90u 0.26s 98% 85608k load: 1.19 cmd: diff 972 [running] 3173.49r 3172.21u 0.26s 100% 85608k ... 1c1 < <configuration changed-seconds="1658220346" changed-localtime="2022-07-19 10:45:46 CEST"> --- > <configuration changed-seconds="1658219983" changed-localtime="2022-07-19 10:39:43 CEST"> 1865315c1865315 < <description>2022-07-19 10:43:02.391784</description> --- > <description>2022-07-18 16:19:44.594290</description> As you see there is difference on line 1 and line 1865315 for the entire file. If I use 'head' or 'tail' and retain only one of the difference in the file 'diff' is able to complete in less than a second. The below files have only the diff on line 1 #head -n 1865310 config_edb_pre > p1_headn1865310 #head -n 1865310 config_edb_post > p2_headn1865310 # time diff p1_headn1865310 p2_headn1865310 1c1 < <configuration changed-seconds="1658219983" changed-localtime="2022-07-19 10:39:43 CEST"> --- > <configuration changed-seconds="1658220346" changed-localtime="2022-07-19 10:45:46 CEST"> 0.83 real 0.74 user 0.09 sys The below files have only the diff on line 1865306 # tail -n 1865310 config_edb_pre > p1_tailn1865310 # tail -n 1865310 config_edb_post > p2_tailn1865310 # time diff p1_tailn1865310 p2_tailn1865310 1865306c1865306 < <description>2022-07-18 16:19:44.594290</description> --- > <description>2022-07-19 10:43:02.391784</description> 0.84 real 0.75 user 0.09 sys # which diff /usr/bin/diff # diff --version FreeBSD diff 20220309 # file config_edb_p* config_edb_post: ASCII text config_edb_pre: ASCII text
Can you provide some test case which reproduces the problem? If not I don't think this PR will get much attention.
The tarball for config files is 2.8M+ so could not attach it here, shoved it to my GitHub, retrieve it from here https://github.com/shrikanth07/FreeBSD-Bug-reports/blob/57da3db49a99dd65ec065b99b8146c6bd6bc5888/BZ-265320/bz265320.tgz
For the test case, retrieve the tarball of the config files from GitHub mentioned earlier. You should be able to recreate the problem by just running diff on those files, FreeBSD-Bug-reports/BZ-265320/bz265320> time diff config_edb_p*
Assigning to Tom, who is working on this.
Hi, This issue is hitting a limitation of the stone algorithm that BSD diff uses right now. The algorithm is performing an exponentially growing search looking for differences in the middle of the two sequences (files). I have a wip implementation of the myers diff algorithm (with optimisations). The work was triggered by a similar case with 12MB intel hex files. Tom