Summary: | sort -R, --random-source issues | ||
---|---|---|---|
Product: | Base System | Reporter: | Conrad Meyer <cem> |
Component: | bin | Assignee: | freebsd-bugs (Nobody) <bugs> |
Status: | Closed FIXED | ||
Severity: | Affects Only Me | CC: | ali.abdallah, markj |
Priority: | --- | ||
Version: | CURRENT | ||
Hardware: | Any | ||
OS: | Any |
Description
Conrad Meyer
![]() ![]() Reported here initially: https://lists.freebsd.org/pipermail/freebsd-hackers/2018-August/053152.html A commit references this bug: Author: cem Date: Thu Apr 4 20:27:14 UTC 2019 New revision: 345891 URL: https://svnweb.freebsd.org/changeset/base/345891 Log: sort(1): randomcoll: Don't sort on ENOMEM PR: 230792 (1/many) Sponsored by: Dell EMC Isilon Changes: head/usr.bin/sort/coll.c A commit references this bug: Author: cem Date: Thu Apr 4 23:32:27 UTC 2019 New revision: 345896 URL: https://svnweb.freebsd.org/changeset/base/345896 Log: sort(1): randomcoll: Skip the memory allocation entirely There's no reason to order based on strcmp of ASCII digests instead of memcmp of the raw digests. While here, remove collision fallback. If you collide two MD5s, they're probably the same string anyway. If robustness against MD5 collisions is desired, maybe we shouldn't use MD5. None of the behavior of sort -R is specified by POSIX, so we're free to implement this however we like. E.g., using a 128-bit counter and block cipher to generate unique indices for each line of input. PR: 230792 (2/many) Relnotes: This will change the sort order for a given dataset with a given seed. Other similarly breaking changes are planned. Sponsored by: Dell EMC Isilon Changes: head/usr.bin/sort/coll.c A commit references this bug: Author: cem Date: Thu Apr 11 05:08:50 UTC 2019 New revision: 346116 URL: https://svnweb.freebsd.org/changeset/base/346116 Log: sort(1): Simplify and bound random seeding Bound input file processing length to avoid the issue reported in [1]. For simplicity, only allow regular file and character device inputs. For character devices, only allow /dev/random (and /dev/urandom symblink). 32 bytes of random is perfectly sufficient to seed MD5; we don't need any more. Users that want to use large files as seeds are encouraged to truncate those files down to an appropriate input file via tools like sha256(1). (This does not change the sort algorithm of sort -R.) [1]: https://lists.freebsd.org/pipermail/freebsd-hackers/2018-August/053152.html PR: 230792 Reported by: Ali Abdallah <aliovx AT gmail.com> Relnotes: yes Changes: head/usr.bin/sort/sort.c A commit references this bug: Author: cem Date: Sat Apr 13 04:42:18 UTC 2019 New revision: 346175 URL: https://svnweb.freebsd.org/changeset/base/346175 Log: sort(1): Memoize MD5 computation to reduce repeated computation Experimentally, reduces sort -R time of a 148160 line corpus from about 3.15s to about 0.93s on this particular system. There's probably room for improvement using some digest other than md5, but I don't want to look at sort(1) anymore. Some discussion of other possible improvements in the Test Plan section of the Differential. PR: 230792 Reviewed by: jhb (earlier version) Differential Revision: https://reviews.freebsd.org/D19885 Changes: head/usr.bin/sort/coll.c head/usr.bin/sort/coll.h head/usr.bin/sort/sort.c Most issues addressed. Didn't try to improve randomcoll over existing MD5 method, other than just caching partial hash calculations to reduce repeated computation of the same value. Some suggestions in https://reviews.freebsd.org/D19885 if anyone wants to tackle that larger project. |