Bug 280371 - diff(1) attempts to allocate ~10^20 bytes while comparing 32GB text files
Summary: diff(1) attempts to allocate ~10^20 bytes while comparing 32GB text files
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: bin (show other bugs)
Version: 14.1-STABLE
Hardware: Any Any
: --- Affects Some People
Assignee: Dag-Erling Smørgrav
URL: https://reviews.freebsd.org/D46169
Keywords:
Depends on:
Blocks:
 
Reported: 2024-07-19 15:53 UTC by Yuri Victorovich
Modified: 2024-08-01 16:49 UTC (History)
2 users (show)

See Also:
des: mfc-stable14+
des: mfc-stable13+


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Yuri Victorovich freebsd_committer freebsd_triage 2024-07-19 15:53:21 UTC
I tried to compare 2 large files:
> $ diff  broadinstitute-gatk-4.6.0.0_GH0.tar.good.hex broadinstitute-gatk-4.6.0.0_GH0.tar.bad.hex | more
> diff: xreallocarray: allocating 18446744073509292634 * 12 bytes: Cannot allocate memory

File sizes:
> $ ls -l broadinstitute-gatk-4.6.0.0_GH0.tar.good.hex broadinstitute-gatk-4.6.0.0_GH0.tar.bad.hex
> -rw-r--r--  1 yuri users 32676484755 Jul 18 21:07 broadinstitute-gatk-4.6.0.0_GH0.tar.bad.hex
> -rw-r--r--  1 yuri users 32676397537 Jul 18 20:06 broadinstitute-gatk-4.6.0.0_GH0.tar.good.hex

These files have some differences in the middle.

The failure is most likely due to an overflow.

diff(1) should be able to compare 32GB files without the need to allocate much more memory than file sizes.
Comment 1 Mark Johnston freebsd_committer freebsd_triage 2024-07-26 17:22:18 UTC
Maybe Dag-Erling would be willing to take a look?  I see several places where we pass 32-bit ints to xreallocarray(), so presumably we are truncating the allocation size somewhere.
Comment 2 Dag-Erling Smørgrav freebsd_committer freebsd_triage 2024-07-27 17:17:32 UTC
I can fix the integer overflow, but I can't easily make diff work with files that big.  Its running time is proportional to the square of the total number of lines in its inputs, and it consumes memory roughly equivalent to the sum of the sizes of its input plus a nontrivial multiple of the total number of lines in its inputs.  The best I can do is error out if the files are too large.
Comment 3 Yuri Victorovich freebsd_committer freebsd_triage 2024-07-27 17:21:56 UTC
(In reply to Dag-Erling Smørgrav from comment #2)

Hi Dag-Erling,

Please consider the special case when the files only have one very small difference in the middle.

In this case diff's memory and CPU requirements are linear in file size, but it still exhibits the problem.

There is no need for diff to error out on file size.

It is sufficient to:
1. Make it work correctly when actual differences are very small.
2. Fail with a correct error message when memory can't be allocated.


Thanks,
Yuri
Comment 4 Dag-Erling Smørgrav freebsd_committer freebsd_triage 2024-07-27 17:53:41 UTC
If you're not happy with my patch, you are more than welcome to submit your own.  I think you'll find that it's slightly more involved than you expected, especially in FreeBSD 15.
Comment 5 commit-hook freebsd_committer freebsd_triage 2024-07-29 14:03:48 UTC
A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=9317242469f1ca682626d9806f8caf65d143c09a

commit 9317242469f1ca682626d9806f8caf65d143c09a
Author:     Dag-Erling Smørgrav <des@FreeBSD.org>
AuthorDate: 2024-07-29 14:02:29 +0000
Commit:     Dag-Erling Smørgrav <des@FreeBSD.org>
CommitDate: 2024-07-29 14:02:29 +0000

    diff: Fix integer overflow.

    The legacy Stone algorithm uses `int` to represent line numbers, array
    indices, and array lengths.  If given inputs approaching `INT_MAX` lines,
    it would overflow and attempt to allocate ridiculously large amounts of
    memory.  To avoid this without penalizing non-pathological inputs,
    switch a few variables to `size_t` and add checks while and immediately
    after reading both inputs.

    MFC after:      3 days
    PR:             280371
    Sponsored by:   Klara, Inc.
    Reviewed by:    allanjude
    Differential Revision:  https://reviews.freebsd.org/D46169

 usr.bin/diff/diffreg.c | 45 ++++++++++++++++++++++++---------------------
 1 file changed, 24 insertions(+), 21 deletions(-)
Comment 6 commit-hook freebsd_committer freebsd_triage 2024-08-01 16:17:08 UTC
A commit in branch stable/14 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=ea68175b07e0cd8e51249d2858749481352aa553

commit ea68175b07e0cd8e51249d2858749481352aa553
Author:     Dag-Erling Smørgrav <des@FreeBSD.org>
AuthorDate: 2024-07-29 14:02:29 +0000
Commit:     Dag-Erling Smørgrav <des@FreeBSD.org>
CommitDate: 2024-08-01 16:15:57 +0000

    diff: Fix integer overflow.

    The legacy Stone algorithm uses `int` to represent line numbers, array
    indices, and array lengths.  If given inputs approaching `INT_MAX` lines,
    it would overflow and attempt to allocate ridiculously large amounts of
    memory.  To avoid this without penalizing non-pathological inputs,
    switch a few variables to `size_t` and add checks while and immediately
    after reading both inputs.

    MFC after:      3 days
    PR:             280371
    Sponsored by:   Klara, Inc.
    Reviewed by:    allanjude
    Differential Revision:  https://reviews.freebsd.org/D46169

    (cherry picked from commit 9317242469f1ca682626d9806f8caf65d143c09a)

 usr.bin/diff/diffreg.c | 45 ++++++++++++++++++++++++---------------------
 1 file changed, 24 insertions(+), 21 deletions(-)
Comment 7 commit-hook freebsd_committer freebsd_triage 2024-08-01 16:47:15 UTC
A commit in branch stable/13 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=c14665b4aee7e1594467bac4a9d9cc5c66173975

commit c14665b4aee7e1594467bac4a9d9cc5c66173975
Author:     Dag-Erling Smørgrav <des@FreeBSD.org>
AuthorDate: 2024-07-29 14:02:29 +0000
Commit:     Dag-Erling Smørgrav <des@FreeBSD.org>
CommitDate: 2024-08-01 16:46:19 +0000

    diff: Fix integer overflow.

    The legacy Stone algorithm uses `int` to represent line numbers, array
    indices, and array lengths.  If given inputs approaching `INT_MAX` lines,
    it would overflow and attempt to allocate ridiculously large amounts of
    memory.  To avoid this without penalizing non-pathological inputs,
    switch a few variables to `size_t` and add checks while and immediately
    after reading both inputs.

    MFC after:      3 days
    PR:             280371
    Sponsored by:   Klara, Inc.
    Reviewed by:    allanjude
    Differential Revision:  https://reviews.freebsd.org/D46169

    (cherry picked from commit 9317242469f1ca682626d9806f8caf65d143c09a)

 usr.bin/diff/diffreg.c | 45 ++++++++++++++++++++++++---------------------
 1 file changed, 24 insertions(+), 21 deletions(-)