I tried to compare 2 large files: > $ diff broadinstitute-gatk-4.6.0.0_GH0.tar.good.hex broadinstitute-gatk-4.6.0.0_GH0.tar.bad.hex | more > diff: xreallocarray: allocating 18446744073509292634 * 12 bytes: Cannot allocate memory File sizes: > $ ls -l broadinstitute-gatk-4.6.0.0_GH0.tar.good.hex broadinstitute-gatk-4.6.0.0_GH0.tar.bad.hex > -rw-r--r-- 1 yuri users 32676484755 Jul 18 21:07 broadinstitute-gatk-4.6.0.0_GH0.tar.bad.hex > -rw-r--r-- 1 yuri users 32676397537 Jul 18 20:06 broadinstitute-gatk-4.6.0.0_GH0.tar.good.hex These files have some differences in the middle. The failure is most likely due to an overflow. diff(1) should be able to compare 32GB files without the need to allocate much more memory than file sizes.
Maybe Dag-Erling would be willing to take a look? I see several places where we pass 32-bit ints to xreallocarray(), so presumably we are truncating the allocation size somewhere.
I can fix the integer overflow, but I can't easily make diff work with files that big. Its running time is proportional to the square of the total number of lines in its inputs, and it consumes memory roughly equivalent to the sum of the sizes of its input plus a nontrivial multiple of the total number of lines in its inputs. The best I can do is error out if the files are too large.
(In reply to Dag-Erling Smørgrav from comment #2) Hi Dag-Erling, Please consider the special case when the files only have one very small difference in the middle. In this case diff's memory and CPU requirements are linear in file size, but it still exhibits the problem. There is no need for diff to error out on file size. It is sufficient to: 1. Make it work correctly when actual differences are very small. 2. Fail with a correct error message when memory can't be allocated. Thanks, Yuri
If you're not happy with my patch, you are more than welcome to submit your own. I think you'll find that it's slightly more involved than you expected, especially in FreeBSD 15.
A commit in branch main references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=9317242469f1ca682626d9806f8caf65d143c09a commit 9317242469f1ca682626d9806f8caf65d143c09a Author: Dag-Erling Smørgrav <des@FreeBSD.org> AuthorDate: 2024-07-29 14:02:29 +0000 Commit: Dag-Erling Smørgrav <des@FreeBSD.org> CommitDate: 2024-07-29 14:02:29 +0000 diff: Fix integer overflow. The legacy Stone algorithm uses `int` to represent line numbers, array indices, and array lengths. If given inputs approaching `INT_MAX` lines, it would overflow and attempt to allocate ridiculously large amounts of memory. To avoid this without penalizing non-pathological inputs, switch a few variables to `size_t` and add checks while and immediately after reading both inputs. MFC after: 3 days PR: 280371 Sponsored by: Klara, Inc. Reviewed by: allanjude Differential Revision: https://reviews.freebsd.org/D46169 usr.bin/diff/diffreg.c | 45 ++++++++++++++++++++++++--------------------- 1 file changed, 24 insertions(+), 21 deletions(-)
A commit in branch stable/14 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=ea68175b07e0cd8e51249d2858749481352aa553 commit ea68175b07e0cd8e51249d2858749481352aa553 Author: Dag-Erling Smørgrav <des@FreeBSD.org> AuthorDate: 2024-07-29 14:02:29 +0000 Commit: Dag-Erling Smørgrav <des@FreeBSD.org> CommitDate: 2024-08-01 16:15:57 +0000 diff: Fix integer overflow. The legacy Stone algorithm uses `int` to represent line numbers, array indices, and array lengths. If given inputs approaching `INT_MAX` lines, it would overflow and attempt to allocate ridiculously large amounts of memory. To avoid this without penalizing non-pathological inputs, switch a few variables to `size_t` and add checks while and immediately after reading both inputs. MFC after: 3 days PR: 280371 Sponsored by: Klara, Inc. Reviewed by: allanjude Differential Revision: https://reviews.freebsd.org/D46169 (cherry picked from commit 9317242469f1ca682626d9806f8caf65d143c09a) usr.bin/diff/diffreg.c | 45 ++++++++++++++++++++++++--------------------- 1 file changed, 24 insertions(+), 21 deletions(-)
A commit in branch stable/13 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=c14665b4aee7e1594467bac4a9d9cc5c66173975 commit c14665b4aee7e1594467bac4a9d9cc5c66173975 Author: Dag-Erling Smørgrav <des@FreeBSD.org> AuthorDate: 2024-07-29 14:02:29 +0000 Commit: Dag-Erling Smørgrav <des@FreeBSD.org> CommitDate: 2024-08-01 16:46:19 +0000 diff: Fix integer overflow. The legacy Stone algorithm uses `int` to represent line numbers, array indices, and array lengths. If given inputs approaching `INT_MAX` lines, it would overflow and attempt to allocate ridiculously large amounts of memory. To avoid this without penalizing non-pathological inputs, switch a few variables to `size_t` and add checks while and immediately after reading both inputs. MFC after: 3 days PR: 280371 Sponsored by: Klara, Inc. Reviewed by: allanjude Differential Revision: https://reviews.freebsd.org/D46169 (cherry picked from commit 9317242469f1ca682626d9806f8caf65d143c09a) usr.bin/diff/diffreg.c | 45 ++++++++++++++++++++++++--------------------- 1 file changed, 24 insertions(+), 21 deletions(-)