The stdlib qsort() code in FreeBSD is largely taken from the paper "Engineering a Sort Function" by Bentley & McIlroy (1993). In that paper, the authors employ a clever a scheme for selecting a method of swapping elements. Basically it goes like this: 1. If the base pointer or element size is not aligned to sizeof(long) then swap byte-by-byte, else 2. If the element size is exactly sizeof(long) then perform a single inline swap, else 3. perform a long-by-long swap. The implicit assumption here is that if the base pointer or element size isn't aligned to sizeof(long) then one can't do any better than a char-by-char swap. While this seems to be true on most 32-bit systems, it is not the case on at least some 64-bit systems. In particular, x86-64. Consequently, sorting data that is 32-bit aligned (but not 64-bit aligned) is much slower on 64-bit systems compared to 32-bit systems. This is because in 32-bit mode the qsort() logic uses a long-by-long swap (since the data is aligned) while in 64-bit mode qsort() drops down to a char-by-char swap. It is true that most workloads on 64-bit systems will be 64-bit aligned. However, it is fairly common for 64-bit code to need to process binary data that wasn't generated on a 64-bit system and hence may not be 64-bit aligned. Because this is fairly common it could be a decent "win" for qsort() to support fast swapping for such 32-bit aligned workloads. In my testing, sorting 64 MB worth of 100-byte records (each with a 12-byte key) took 1.8x as long on a 64-bit system as it did on a 32-bit system, with identical hardware. With a patched qsort the performance was identical on both 32-bit and 64-bit versions of the code. The patch is written such that if sizeof(long) == sizeof(int) then it acts exactly like the current version. The additional swap behavior is only enabled when sizeof(long) > sizeof(int). The extra overhead from the sizeof(int) alignment check was negligible. Given the way SWAPINIT is structured, there is no additional overhead whatsoever when the data is 64-bit aligned. The only time additional overhead is incurred is when data is NOT 64-bit aligned, in which case the extra alignment check quite is likely to provide a significant speedup. Fix: See attached patch modifying /src/lib/libc/stdlib/qsort.c Patch attached with submission follows: How-To-Repeat: Sort records of size 8*n+4 bytes from within 64-bit code. Then sort the same data from within 32-bit code. The 64-bit version should take approximately 1.8x as long to execute.
It occurs to me that a simpler solution would to just do int-by-int swaps in all cases when the base addr and element size are int-aligned, char-by-char swaps when they're not, and an inline long-swap when the base addr is long-aligned and element size is exactly sizeof(long). It's notable that BSD bcopy() only does int-by-int copying and makes no effort to do long-by-long copying when the data would permit Here's a second patch that makes the above change. On systems where sizeof(int) = sizeof(long) this version becomes identical to the current version. When sizeof(long) > sizeof(int) bulk swapping happens int-by-int instead of long-by-long. Inline swaps are still used when the base is long-aligned and element size = sizeof(long). uuencoded: begin 600 qsort.c.patch M+2TM('%S;W)T+F,),C`P.2TP-BTQ.2`P,CHR-CHQ-RXP,#`P,#`P,#`@*S`P M,#`**RLK('%S;W)T+F,N<&%T8VAE9`DR,#`Y+3`V+3$Y(#`S.C`W.C`Y+C`P M,#`P,#`P,"`K,#`P,`I`0"`M-3DL."`K-3DL.2!`0`H@("`@("`@("!]('=H M:6QE("@M+6D@/B`P*3L)"0D)7`H@?0H@"BTC9&5F:6YE(%-705!)3DE4*&$L M(&5S*2!S=V%P='EP92`]("@H8VAA<B`J*6$@+2`H8VAA<B`J*3`I("4@<VEZ M96]F*&QO;F<I('Q\(%P*+0EE<R`E('-I>F5O9BAL;VYG*2`_(#(@.B!E<R`] M/2!S:7IE;V8H;&]N9RD_(#`@.B`Q.PHK(V1E9FEN92!35T%024Y)5"AA+"!E M<RD@<W=A<'1Y<&4@/2`H*&-H87(@*BEA("T@*&-H87(@*BDP*2`E('-I>F5O M9BAI;G0I('Q\(%P**PEE<R`E('-I>F5O9BAI;G0I(#\@,B`Z("@H8VAA<B`J M*6$@+2`H8VAA<B`J*3`I("4@<VEZ96]F*&QO;F<I(#\@,2`Z(%P**PEE<R`A M/2!S:7IE;V8H;&]N9RD["B`*('-T871I8R!I;FQI;F4@=F]I9`H@<W=A<&9U M;F,H82P@8BP@;BP@<W=A<'1Y<&4I"D!`("TV."PW("LV.2PW($!`"B`):6YT M(&XL('-W87!T>7!E.PH@>PH@"6EF*'-W87!T>7!E(#P](#$I"BT)"7-W87!C M;V1E*&QO;F<L(&$L(&(L(&XI"BL)"7-W87!C;V1E*&EN="P@82P@8BP@;BD* E(`EE;'-E"B`)"7-W87!C;V1E*&-H87(L(&$L(&(L(&XI"B!]"BP@ ` end
Created attachment 148195 [details] Updated diff This is the same (latest) version but unencoded for easier examination. Not the appropriate test but it does cause a small but measurable improvement (user time) with: /usr/src/tools/regression/lib/libc/stdlib/test-qsort
I would like more feedback before committing this so I opened it for revision in our Phabricator instance: https://reviews.freebsd.org/D1752
The patch assumes that sizeof(long) % sizeof(int) == 0 everywhere. Machine-independent code should not have such assumption. There must be two completely separated calculation, one for int and one for long. I.e. current code for long must be duplicated for int. Then final pass should be added to determine which version wins (if equal, long first).
And the bug: on amd64 the patch always copy by 32bits instead of 64bits due to replacement of swapcode(long, a, b, n) by swapcode(int, a, b, n). It is possible that copy by 32bits is faster on amd64 (I don't check), but it is very architecture dependent and must be ifdefed. If we going to be architecture dependent and optimize just i386 and amd64 cases, the code should use int32_t and int64_t and use proper architecture-related ifdefs.
What is must be done in machine independent code is really simple: 1) #define SWAPINIT_long(a, es) swaptype_long = ... #define SWAPINIT_int(a, es) swaptype_int = ... 2) Call both SWAPINITs instead of one. 3) if (swaptype_long <= 1) swapcode(long, a, b, n) else if (swaptype_int <= 1) swapcode(int, a, b, n) else swapcode(char, a, b, n)
4) swap(a, b) macro needs to be fixed too: #define swap(a, b) \ if (swaptype_long == 0) { \ long t = *(long *)(a); \ *(long *)(a) = *(long *)(b); \ *(long *)(b) = t; \ } else if (swaptype_int == 0) { \ long t = *(int *)(a); \ *(int *)(a) = *(int *)(b); \ *(int *)(b) = t; \ } else \ swapfunc(a, b, es, swaptype)
Created attachment 152798 [details] Updated patch by Andrey Chernov I updated the testing utility with 4096 datapoints of random data and I did some very basic testing with an updated patch from Andrey : $ time ./test-heapsort 1..1 ok 1 - heapsort 27.83 real 27.61 user 0.00 sys $ time ./test-mergesort 1..1 ok 1 - mergesort 27.02 real 26.83 user 0.00 sys $ time ./test-qsort (old) 1..1 ok 1 - qsort 27.07 real 26.94 user 0.00 sys ... $ time ./test-qsort (new) 1..1 ok 1 - qsort 26.94 real 26.81 user 0.00 sys ___ IMHO. the performance improvements are measurable, and even when they may not seem huge, they are significant enough to make qsort preferable over heapsort and mergesort (at least for my particular dataset).
Any objection about committing the patch?
A commit references this bug: Author: pfg Date: Thu Mar 5 17:00:40 UTC 2015 New revision: 279663 URL: https://svnweb.freebsd.org/changeset/base/279663 Log: qsort(3): enhance to handle 32-bit aligned data on 64-bit systems Implement a small enhancement to the original qsort implementation: If the data is 32 bit aligned we can side-step the long type version and use int instead. The change brings a modest but significant improvement in 32 bit workloads. Relnotes: yes PR: 135718 Taken from: ache Changes: head/lib/libc/stdlib/qsort.c
Thanks for the idea, and to Andrey for the patch. I do not plan to MFC this, let's see it mature in -current.