Bug 218203 - Implement AVX2 accelerated Fletcher algorithms
Summary: Implement AVX2 accelerated Fletcher algorithms
Status: New
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: CURRENT
Hardware: amd64 Any
: --- Affects Only Me
Assignee: freebsd-bugs (Nobody)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-03-29 04:26 UTC by Adam Stylinski
Modified: 2017-04-23 17:46 UTC (History)
2 users (show)

See Also:


Attachments
Quick benchmark (3.55 KB, text/x-c++src)
2017-03-30 18:56 UTC, Adam Stylinski
no flags Details
SSE4 implementation (3.07 KB, text/plain)
2017-03-30 19:54 UTC, Adam Stylinski
no flags Details
same benchmark but with 128k checksums (3.55 KB, text/plain)
2017-04-03 19:22 UTC, Adam Stylinski
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Adam Stylinski 2017-03-29 04:26:41 UTC
Intel has published a pretty straight forward implementation of Fletcher4 leveraging AVX2 instructions:  

https://software.intel.com/en-us/articles/fast-computation-of-fletcher-checksums

I was able to use this white paper and compiler intrinsics to build a rudimentary version that's nearly twice as fast.  It is feasible to swap out the existing scalar and portable implementation for this faster variant similar to the way Linux offers SIMD accelerated versions of cryptographic and hashing routines within their kernel.  

As a matter of fact, zfsonlinux is already doing this:

https://github.com/zfsonlinux/zfs/tree/482cd9ee69e88710e9241fac220501ea4e101d19/module/zcommon

While I understand the desire to remain close to the reference ZFS implementation with Illumos and maybe there doesn't need to be quite that many versions of fletcher4 (they do a superscalar version that presumably tries to take advantage of Out-of-Order execution - hoping the microarchitecture can schedule the instructions efficiently by noticing the lack of data dependencies), it does seem silly to ignore a working implementation that is measurably faster for CPUs that support it.  It has even been backported to SSSE3 instructions:
https://github.com/zfsonlinux/zfs/blob/482cd9ee69e88710e9241fac220501ea4e101d19/module/zcommon/zfs_fletcher_sse.c
Comment 1 Adam Stylinski 2017-03-30 17:07:42 UTC
If desired, I can post my benchmark code.  It is using more instructions than the zfsonlinux variant (I used SIMD intrinsics instead of inline assembly).  The extra instructions are mostly just shuffling values between registers.  After the intermediate sum loop is completed I aliased into the __m256i's instead of doing vmovqdu into memory for the constant multiplications.  I suspect the compiler was able to shuffle registers around enough to avoid some trips to memory, but the Intel whitepaper isn't quite fair to itself, as I think they are comparing the best possible performance without SIMD (which is not the original loop, but the loop unrolled 4 times) with their SIMD variant.
Comment 2 Allan Jude freebsd_committer freebsd_triage 2017-03-30 18:36:04 UTC
any additional information, benchmark code, etc you can provide will be useful. Thank you.
Comment 3 Adam Stylinski 2017-03-30 18:56:57 UTC
Created attachment 181322 [details]
Quick benchmark

Please don't mind the C++, the actual function is clean C.  Be sure to compile with -mavx2.

It's worth noting that the code that's integrated into ZFS On Linux appears to do benchmark code to select the fastest function and dynamically reassigns a function pointer to that function (at maybe module load time?).  Adding the extra layer of function pointers will probably be somewhat necessary for this kind of modularity, though this obviously has some security implications (in that corrupting this pointer value can lead to bad things).  I hardly think it'd be the first function pointer dispatched in the FreeBSD kernel, though.

Given that IXSystems cares about ZFS performance and distributes on hardware with some lower powered Atom CPUs, I figured this would be of interest.
Comment 4 Adam Stylinski 2017-03-30 19:54:18 UTC
Created attachment 181326 [details]
SSE4 implementation

Also not as impactful, but definitely measurable is an SSE4 variant.  This will actually run on Atoms and a much larger number of other pre-Haswell CPUs.  

The ZFS On Linux version does 8 padds per loop iteration because they are avoiding the pmovzxdq instruction.  This is probably to be compatible with SSSE3 and not require SSE4.
Comment 5 Adam Stylinski 2017-04-03 19:22:34 UTC
Created attachment 181440 [details]
same benchmark but with 128k checksums

128k widths are more realistic for ZFS.  Though, 8k will probably fit into cache lines a little better if you're trying to mitigate memory access effects.