Bug 135718 - [patch] enhance qsort(3) to properly handle 32-bit aligned data on 64-bit systems
Summary: [patch] enhance qsort(3) to properly handle 32-bit aligned data on 64-bit sys...
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: bin (show other bugs)
Version: Unspecified
Hardware: Any Any
: Normal Affects Only Me
Assignee: freebsd-bugs mailing list
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2009-06-18 21:30 UTC by jhoward
Modified: 2015-03-05 17:04 UTC (History)
3 users (show)

See Also:


Attachments
file.diff (1.06 KB, patch)
2009-06-18 21:30 UTC, jhoward
no flags Details | Diff
Updated diff (751 bytes, patch)
2014-10-11 16:40 UTC, Pedro F. Giffuni
no flags Details | Diff
Updated patch by Andrey Chernov (2.22 KB, patch)
2015-02-09 14:57 UTC, Pedro F. Giffuni
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description jhoward 2009-06-18 21:30:04 UTC
The stdlib qsort() code in FreeBSD is largely taken from the paper "Engineering a Sort Function" by Bentley & McIlroy (1993).

In that paper, the authors employ a clever a scheme for selecting a method of swapping elements.  Basically it goes like this:

1. If the base pointer or element size is not aligned to sizeof(long) then swap byte-by-byte, else
2. If the element size is exactly sizeof(long) then perform a single inline swap, else
3. perform a long-by-long swap.

The implicit assumption here is that if the base pointer or element size isn't aligned to sizeof(long) then one can't do any better than a char-by-char swap.

While this seems to be true on most 32-bit systems, it is not the case on at least some 64-bit systems.  In particular, x86-64.

Consequently, sorting data that is 32-bit aligned (but not 64-bit aligned) is much slower on 64-bit systems compared to 32-bit systems.  This is because in 32-bit mode the qsort() logic uses a long-by-long swap (since the data is aligned) while in 64-bit mode qsort() drops down to a char-by-char swap.

It is true that most workloads on 64-bit systems will be 64-bit aligned.  However, it is fairly common for 64-bit code to need to process binary data that wasn't generated on a 64-bit system and hence may not be 64-bit aligned.  Because this is fairly common it could be a decent "win" for qsort() to support fast swapping for such 32-bit aligned workloads.

In my testing, sorting 64 MB worth of 100-byte records (each with a 12-byte key) took 1.8x as long on a 64-bit system as it did on a 32-bit system, with identical hardware.  With a patched qsort the performance was identical on both 32-bit and 64-bit versions of the code.

The patch is written such that if sizeof(long) == sizeof(int) then it acts exactly like the current version.  The additional swap behavior is only enabled when sizeof(long) > sizeof(int).

The extra overhead from the sizeof(int) alignment check was negligible.  Given the way SWAPINIT is structured, there is no additional overhead whatsoever when the data is 64-bit aligned.  The only time additional overhead is incurred is when data is NOT 64-bit aligned, in which case the extra alignment check quite is likely to provide a significant speedup.

Fix: See attached patch modifying /src/lib/libc/stdlib/qsort.c

Patch attached with submission follows:
How-To-Repeat: Sort records of size 8*n+4 bytes from within 64-bit code.  Then sort the same data from within 32-bit code.  The 64-bit version should take approximately 1.8x as long to execute.
Comment 1 jhoward 2009-06-19 04:33:15 UTC
It occurs to me that a simpler solution would to just do int-by-int swaps in
all cases when the base addr and element size are int-aligned, char-by-char
swaps when they're not, and an inline long-swap when the base addr is
long-aligned and element size is exactly sizeof(long).

It's notable that BSD bcopy() only does int-by-int copying and makes no
effort to do long-by-long copying when the data would permit

Here's a second patch that makes the above change.

On systems where sizeof(int) = sizeof(long) this version becomes identical
to the current version.  When sizeof(long) > sizeof(int) bulk swapping
happens int-by-int instead of long-by-long.  Inline swaps are still used
when the base is long-aligned and element size = sizeof(long).

uuencoded:

begin 600 qsort.c.patch
M+2TM('%S;W)T+F,),C`P.2TP-BTQ.2`P,CHR-CHQ-RXP,#`P,#`P,#`@*S`P
M,#`**RLK('%S;W)T+F,N<&%T8VAE9`DR,#`Y+3`V+3$Y(#`S.C`W.C`Y+C`P
M,#`P,#`P,"`K,#`P,`I`0"`M-3DL."`K-3DL.2!`0`H@("`@("`@("!]('=H
M:6QE("@M+6D@/B`P*3L)"0D)7`H@?0H@"BTC9&5F:6YE(%-705!)3DE4*&$L
M(&5S*2!S=V%P='EP92`]("@H8VAA<B`J*6$@+2`H8VAA<B`J*3`I("4@<VEZ
M96]F*&QO;F<I('Q\(%P*+0EE<R`E('-I>F5O9BAL;VYG*2`_(#(@.B!E<R`]
M/2!S:7IE;V8H;&]N9RD_(#`@.B`Q.PHK(V1E9FEN92!35T%024Y)5"AA+"!E
M<RD@<W=A<'1Y<&4@/2`H*&-H87(@*BEA("T@*&-H87(@*BDP*2`E('-I>F5O
M9BAI;G0I('Q\(%P**PEE<R`E('-I>F5O9BAI;G0I(#\@,B`Z("@H8VAA<B`J
M*6$@+2`H8VAA<B`J*3`I("4@<VEZ96]F*&QO;F<I(#\@,2`Z(%P**PEE<R`A
M/2!S:7IE;V8H;&]N9RD["B`*('-T871I8R!I;FQI;F4@=F]I9`H@<W=A<&9U
M;F,H82P@8BP@;BP@<W=A<'1Y<&4I"D!`("TV."PW("LV.2PW($!`"B`):6YT
M(&XL('-W87!T>7!E.PH@>PH@"6EF*'-W87!T>7!E(#P](#$I"BT)"7-W87!C
M;V1E*&QO;F<L(&$L(&(L(&XI"BL)"7-W87!C;V1E*&EN="P@82P@8BP@;BD*
E(`EE;'-E"B`)"7-W87!C;V1E*&-H87(L(&$L(&(L(&XI"B!]"BP@
`
end
Comment 2 Pedro F. Giffuni freebsd_committer 2014-10-11 16:40:17 UTC
Created attachment 148195 [details]
Updated diff

This is the same (latest) version but unencoded for easier examination.

Not the appropriate test but it does cause a small but measurable
improvement (user time) with:
/usr/src/tools/regression/lib/libc/stdlib/test-qsort
Comment 3 Pedro F. Giffuni freebsd_committer 2015-02-01 21:17:02 UTC
I would like more feedback before committing this so I opened it for revision in our Phabricator instance:

https://reviews.freebsd.org/D1752
Comment 4 Andrey A. Chernov freebsd_committer 2015-02-06 19:17:45 UTC
The patch assumes that sizeof(long) % sizeof(int) == 0 everywhere.
Machine-independent code should not have such assumption.
There must be two completely separated calculation, one for int and one for long. I.e. current code for long must be duplicated for int.
Then final pass should be added to determine which version wins (if equal, long first).
Comment 5 Andrey A. Chernov freebsd_committer 2015-02-06 22:32:27 UTC
And the bug: on amd64 the patch always copy by 32bits instead of 64bits 
due to replacement of swapcode(long, a, b, n) by swapcode(int, a, b, n).
It is possible that copy by 32bits is faster on amd64 (I don't check), but it is very architecture dependent and must be ifdefed.
If we going to be architecture dependent and optimize just i386 and amd64 cases, the code should use int32_t and int64_t and use proper architecture-related ifdefs.
Comment 6 Andrey A. Chernov freebsd_committer 2015-02-06 23:57:44 UTC
What is must be done in machine independent code is really simple:

1)
#define SWAPINIT_long(a, es) swaptype_long = ...
#define SWAPINIT_int(a, es) swaptype_int = ...

2) Call both SWAPINITs instead of one.

3)
if (swaptype_long <= 1)
	swapcode(long, a, b, n)
else if (swaptype_int <= 1)
	swapcode(int, a, b, n)
else
	swapcode(char, a, b, n)
Comment 7 Andrey A. Chernov freebsd_committer 2015-02-07 02:49:54 UTC
4) swap(a, b) macro needs to be fixed too:

#define swap(a, b)                                      \
        if (swaptype_long == 0) {                       \
                long t = *(long *)(a);                  \
                *(long *)(a) = *(long *)(b);            \
                *(long *)(b) = t;                       \
        } else if (swaptype_int == 0) {                 \
                long t = *(int *)(a);                   \
                *(int *)(a) = *(int *)(b);              \
                *(int *)(b) = t;                        \
        } else                                          \
                swapfunc(a, b, es, swaptype)
Comment 8 Pedro F. Giffuni freebsd_committer 2015-02-09 14:57:20 UTC
Created attachment 152798 [details]
Updated patch by Andrey Chernov

I updated the testing utility with 4096 datapoints of random data and
I did some very basic testing with an updated patch from Andrey :

$ time ./test-heapsort
1..1
ok 1 - heapsort
       27.83 real        27.61 user         0.00 sys
$ time ./test-mergesort
1..1
ok 1 - mergesort
       27.02 real        26.83 user         0.00 sys
$ time ./test-qsort (old)
1..1
ok 1 - qsort
       27.07 real        26.94 user         0.00 sys
...
$ time ./test-qsort (new)
1..1
ok 1 - qsort
       26.94 real        26.81 user         0.00 sys
___ 

IMHO. the performance improvements are measurable, and even when they may not seem huge, they are significant enough to make qsort preferable over heapsort and mergesort (at least for my particular dataset).
Comment 9 Pedro F. Giffuni freebsd_committer 2015-02-22 15:21:12 UTC
Any objection about committing the patch?
Comment 10 commit-hook freebsd_committer 2015-03-05 17:00:54 UTC
A commit references this bug:

Author: pfg
Date: Thu Mar  5 17:00:40 UTC 2015
New revision: 279663
URL: https://svnweb.freebsd.org/changeset/base/279663

Log:
  qsort(3): enhance to handle 32-bit aligned data on 64-bit systems

  Implement a small enhancement to the original qsort implementation:
  If the data is 32 bit aligned we can side-step the long type
  version and use int instead.

  The change brings a modest but significant improvement in
  32 bit workloads.

  Relnotes:	yes

  PR:		135718
  Taken from:	ache

Changes:
  head/lib/libc/stdlib/qsort.c
Comment 11 Pedro F. Giffuni freebsd_committer 2015-03-05 17:04:53 UTC
Thanks for the idea, and to Andrey for the patch.

I do not plan to MFC this, let's see it mature in -current.