Bug 106734

Summary:	[patch] [request] bzip2(1): SSE2 optimization for bzip2/libbz2
Product:	Base System	Reporter:	Mikhail T. <freebsd-2024>
Component:	bin	Assignee:	freebsd-bugs (Nobody) <bugs>
Status:	Open ---
Severity:	Affects Only Me	Keywords:	patch
Priority:	Normal
Version:	6.2-PRERELEASE
Hardware:	Any
OS:	Any

Description Mikhail T. 2006-12-14 21:50:06 UTC

	The patch below makes bzip2's blocksort routines use SSE2-registers
	to compare 16 bytes at a time.

	On both i386 and AMD chips I tested, the performance improvement
	ranges from 5% for the already compressed (.gz) files to 20% for
	the highly compressible system logs.

	The compressed files are byte-for-byte identical with those produced
	by the original bzip2.

	The changes are ifdef-ed by __SSE2__ and relies on the intrinsics
	available in GNU, Intel's, and Microsoft's compilers.

	No changes to Makefile(s) are necessary -- when targeting an
	SSE2-capable CPU (i.e. ``-march=opteron'' or ``-march=pentium4''),
	the __SSE2__ is set by the compiler.

Fix: 

The patch is available from http://aldan.algebra.com/~mi/bz/

	The patch is not FreeBSD-specific, but was developed, tested, and timed
	on FreeBSD-6.x using both i386 and amd64.

	Feedback welcome.

Comment 1 jseward 2007-01-06 08:16:28 UTC

--Boundary-00=_dr1nFvPs1zvnDlW
Content-Type: text/plain;
  charset="us-ascii"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline


According to Valgrind on amd64-linux, this patch causes bzip2 to
do comparisons based on uninitialised memory when compressing, for the
attached file (PLIST).

==16140== Conditional jump or move depends on uninitialised value(s)
==16140==    at 0x414ECF: mainGtU (blocksort.c:538)
==16140==    by 0x414B08: mainSimpleSort (blocksort.c:690)
==16140==    by 0x4150E1: mainQSort3 (blocksort.c:805)
==16140==    by 0x415FAD: mainSort (blocksort.c:1063)
==16140==    by 0x4163F8: BZ2_blockSort (blocksort.c:1244)
==16140==    by 0x40DE46: BZ2_compressBlock (compress.c:660)
==16140==    by 0x4054B6: handle_compress (bzlib.c:431)
==16140==    by 0x40571A: BZ2_bzCompress (bzlib.c:501)
==16140==    by 0x407B0D: BZ2_bzWriteClose64 (bzlib.c:1093)
==16140==    by 0x4014E1: compressStream (bzip2.c:451)
==16140==    by 0x402E20: compress (bzip2.c:1368)
==16140==    by 0x40455C: main (bzip2.c:2034)

--Boundary-00=_dr1nFvPs1zvnDlW
Content-Type: text/plain;
  charset="us-ascii";
  name="PLIST"
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment;
	filename="PLIST"

share/sgml/docbook/dsssl/modular/README
share/sgml/docbook/dsssl/modular/VERSION
share/sgml/docbook/dsssl/modular/BUGS
share/sgml/docbook/dsssl/modular/TODO

Comment 2 Mikhail Teterin 2007-01-07 02:37:20 UTC

> According to Valgrind on amd64-linux, this patch causes bzip2 to
> do comparisons based on uninitialised memory when compressing, for the
> attached file (PLIST).
> 
> ==16140== Conditional jump or move depends on uninitialised value(s)
> ==16140==    at 0x414ECF: mainGtU (blocksort.c:538)
> ==16140==    by 0x414B08: mainSimpleSort (blocksort.c:690)
> ==16140==    by 0x4150E1: mainQSort3 (blocksort.c:805)

Julian! You informed me of this issue in a direct e-mail from Dec 19.
I responded the next day on Dec 20:

	------------------------------------------------------------
= From my 5-minute investigation, I *think* this is because the loads

=        /* Load the bytes: */
=        n1 = (__m128i)_mm_loadu_pd((double *)(block + i1));
=        n2 = (__m128i)_mm_loadu_pd((double *)(block + i2));

I doubt it -- Valgrind's diagnostics says "Conditional jump or move
depends on uninitialised value(s)". There are no jumps in the above
code, there are two machine instructions (executed in parallel).
	------------------------------------------------------------

Have you not received the e-mail with above text? (Message-Id:
<200612200615.22819@Misha>).

Yours,

	-mi

Comment 3 jseward 2007-01-07 05:08:43 UTC

I believe this analysis is correct:

>         /* Load the bytes: */
>         n1 = (__m128i)_mm_loadu_pd((double *)(block + i1));
>         n2 = (__m128i)_mm_loadu_pd((double *)(block + i2));
> 
> read beyond the end of the defined area of block.  block is
> defined for [0 .. nblock + BZ_N_OVERSHOOT - 1], but I think
> you are doing a SSE load at &block[nblock + BZ_N_OVERSHOOT - 2],
> hence loading 15 bytes of garbage.

Valgrind doesn't complain about the out-of-range access, because you
are still accessing inside a valid malloc-allocated block.  But it
does know that the read data is uninitialised, hence it complains
when you do a comparison with that data followed by a conditional
branch (or move) based on the result of the comparison.

> This is possible... You think, the loop should exit earlier and test
> the last (up to) 15 bytes one-by-one?

Certainly the loop-end stuff needs to be fixed up somehow to reflect
the 16 byte loads, but without further investigation I'm not sure how.

Comment 4 Mikhail Teterin 2007-01-09 23:34:36 UTC

On Sunday 07 January 2007 00:08, Julian Seward wrote:
= >         /* Load the bytes: */
= >         n1 = (__m128i)_mm_loadu_pd((double *)(block + i1));
= >         n2 = (__m128i)_mm_loadu_pd((double *)(block + i2));

= > read beyond the end of the defined area of block.  block is
= > defined for [0 .. nblock + BZ_N_OVERSHOOT - 1], but I think
= > you are doing a SSE load at &block[nblock + BZ_N_OVERSHOOT - 2],
= > hence loading 15 bytes of garbage.

I don't think, that's quite right... Instead of processing 8 bytes at a time, 
as the non-SSE code is doing, I'm comparing 16 at a time. Thus it is possible 
for me to be over by exactly 8 sometimes...

Anyway, the problem was stemming from my bumping i1 and i2 by 16 instead of 8 
after the _initial check_ (which, in the quadrant-less case should not need 
to be separate at all, actually). Sometimes _that_ would bring them over... I 
think, the solution is to either bump up BZ_N_OVERSHOOT even further or check 
and adjust i1 and i2:

	if (i1 >= nblock)
		i1 -= nblock;
	if (i2 >= nblock)
		i2 -= nblock;

at the beginning, rather than the end of the loop. Having done that, I no 
longer peek beyond the end of the block (according to gdb's conditional 
breakpoints, at least).

Please, check the new http://aldan.algebra.com/~mi/bz/blocksort-SSE2-patch-2

Yours,

	-mi

P.S. The following gdb-script is what I used. Run as:

	gdb -x x.txt bzip2

x.txt:
	break blocksort.c:516
	cond 1 (i1 > nblock) || (i2 > nblock)
	run -9 < /tmp/PLIST > /dev/null

andjust the compression level, the input's location, and be sure to have 
blocksort.o compiled with debug information, of course...

Comment 5 Eitan Adler freebsd_committer

2017-12-31 08:01:44 UTC

For bugs matching the following criteria:

Status: In Progress Changed: (is less than) 2014-06-01

Reset to default assignee and clear in-progress tags.

Mail being skipped

Comment 6 Graham Perrin freebsd_committer

2022-10-17 12:35:47 UTC

Keyword: 

    patch
or  patch-ready

– in lieu of summary line prefix: 

    [patch]

* bulk change for the keyword
* summary lines may be edited manually (not in bulk). 

Keyword descriptions and search interface: 

    <https://bugs.freebsd.org/bugzilla/describekeywords.cgi>