Bug 230792 - sort -R, --random-source issues
Summary: sort -R, --random-source issues
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: bin (show other bugs)
Version: CURRENT
Hardware: Any Any
: --- Affects Only Me
Assignee: freebsd-bugs (Nobody)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-08-21 02:45 UTC by Conrad Meyer
Modified: 2019-04-13 04:46 UTC (History)
2 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Conrad Meyer freebsd_committer freebsd_triage 2018-08-21 02:45:35 UTC
sort(1)'s --random-source has major problems.

It attempts to MD5 the entire[1] provided file to seed its RNG.  As a special case (and the default), if the path matches "/dev/random" exactly, it "only" fetches 1024 bytes to MD5 for its seed.

[1]: It just loops until read(2) returns EOF.  This may never happen for unix sockets or devices like /dev/urandom.

So the first and most obvious bug fix is that --random-source cannot be reading megabytes out of /dev/urandom and it is unlikely that reading megabytes out of random regular files is beneficial either.

My suggestions from least controversial to more controversial:

1. Check for random via st_ino/st_rdev rather than path.  This will match urandom or non-absolute path spellings of the random device.
   a.  In this case, only read 16-32 bytes — not 1024.  32 bytes from /dev/[u]random is more than the context size of MD5 (16 bytes).

2. Reject non-regular files other than /dev/random entirely.

3. Reject regular files larger than 1024 bytes.

4. Don't MD5 the output of the random device.  It doesn't give any benefit.


------------------------------------------------


-R's implementation leaves a lot to be desired too.

The implementation is:

1. An initial MD5 context, "md5_ctx", is seeded with the ASCII digest from the random source file.

2. A comparator, "randomcoll", copies that digest twice and concatenates to each the two lines to be compared.

3. The two MD5s are finalized and the digests are converted to (!)malloced ASCII strings via MD5End().

4. If MD5End() hit ENOMEM, *that is incorporated into the sort order*.

5. Otherwise, the digests are compared *with strcmp()*. and the result returned.

The goal is to provide "random" ordering, but stable results for repeated comparisons of the same pair of lines.

Major problems:

0. ENOMEM should cause immediate exit and abort — not affect sort order!

1. There is no need to malloc out an ASCII digest instead of just memcmping the binary digest.

2. Even so, this (primitive keyed hash) has got to be one of the most expensive possible ways of producing a repeatable ordering between keys.  A better method might be to pre-fill an NxN lookup table of line number pairs with e.g. "(int)arc4random_uniform(3) - 1" (or some keyed random function if --random-source and a regular file was provided) and simply reference that LUT from the comparator.  (That's a literal 3, not the manual page section.)

That's what I've got off the top of my head.  There are probably other issues.
Comment 1 Conrad Meyer freebsd_committer freebsd_triage 2018-08-21 02:47:26 UTC
Reported here initially: https://lists.freebsd.org/pipermail/freebsd-hackers/2018-August/053152.html
Comment 2 commit-hook freebsd_committer freebsd_triage 2019-04-04 20:27:23 UTC
A commit references this bug:

Author: cem
Date: Thu Apr  4 20:27:14 UTC 2019
New revision: 345891
URL: https://svnweb.freebsd.org/changeset/base/345891

Log:
  sort(1): randomcoll: Don't sort on ENOMEM

  PR:		230792 (1/many)
  Sponsored by:	Dell EMC Isilon

Changes:
  head/usr.bin/sort/coll.c
Comment 3 commit-hook freebsd_committer freebsd_triage 2019-04-04 23:33:02 UTC
A commit references this bug:

Author: cem
Date: Thu Apr  4 23:32:27 UTC 2019
New revision: 345896
URL: https://svnweb.freebsd.org/changeset/base/345896

Log:
  sort(1): randomcoll: Skip the memory allocation entirely

  There's no reason to order based on strcmp of ASCII digests instead of
  memcmp of the raw digests.

  While here, remove collision fallback.  If you collide two MD5s, they're
  probably the same string anyway.  If robustness against MD5 collisions is
  desired, maybe we shouldn't use MD5.

  None of the behavior of sort -R is specified by POSIX, so we're free to
  implement this however we like.  E.g., using a 128-bit counter and block cipher
  to generate unique indices for each line of input.

  PR:		230792 (2/many)
  Relnotes:	This will change the sort order for a given dataset with a
  		given seed.  Other similarly breaking changes are planned.
  Sponsored by:	Dell EMC Isilon

Changes:
  head/usr.bin/sort/coll.c
Comment 4 commit-hook freebsd_committer freebsd_triage 2019-04-11 05:09:31 UTC
A commit references this bug:

Author: cem
Date: Thu Apr 11 05:08:50 UTC 2019
New revision: 346116
URL: https://svnweb.freebsd.org/changeset/base/346116

Log:
  sort(1): Simplify and bound random seeding

  Bound input file processing length to avoid the issue reported in [1].  For
  simplicity, only allow regular file and character device inputs.  For
  character devices, only allow /dev/random (and /dev/urandom symblink).

  32 bytes of random is perfectly sufficient to seed MD5; we don't need any
  more.  Users that want to use large files as seeds are encouraged to truncate
  those files down to an appropriate input file via tools like sha256(1).

  (This does not change the sort algorithm of sort -R.)

  [1]: https://lists.freebsd.org/pipermail/freebsd-hackers/2018-August/053152.html

  PR:		230792
  Reported by:	Ali Abdallah <aliovx AT gmail.com>
  Relnotes:	yes

Changes:
  head/usr.bin/sort/sort.c
Comment 5 commit-hook freebsd_committer freebsd_triage 2019-04-13 04:42:30 UTC
A commit references this bug:

Author: cem
Date: Sat Apr 13 04:42:18 UTC 2019
New revision: 346175
URL: https://svnweb.freebsd.org/changeset/base/346175

Log:
  sort(1): Memoize MD5 computation to reduce repeated computation

  Experimentally, reduces sort -R time of a 148160 line corpus from about
  3.15s to about 0.93s on this particular system.

  There's probably room for improvement using some digest other than md5, but
  I don't want to look at sort(1) anymore.  Some discussion of other possible
  improvements in the Test Plan section of the Differential.

  PR:		230792
  Reviewed by:	jhb (earlier version)
  Differential Revision:	https://reviews.freebsd.org/D19885

Changes:
  head/usr.bin/sort/coll.c
  head/usr.bin/sort/coll.h
  head/usr.bin/sort/sort.c
Comment 6 Conrad Meyer freebsd_committer freebsd_triage 2019-04-13 04:46:33 UTC
Most issues addressed.  Didn't try to improve randomcoll over existing MD5 method, other than just caching partial hash calculations to reduce repeated computation of the same value.  Some suggestions in https://reviews.freebsd.org/D19885 if anyone wants to tackle that larger project.