Intel has published a pretty straight forward implementation of Fletcher4 leveraging AVX2 instructions: https://software.intel.com/en-us/articles/fast-computation-of-fletcher-checksums I was able to use this white paper and compiler intrinsics to build a rudimentary version that's nearly twice as fast. It is feasible to swap out the existing scalar and portable implementation for this faster variant similar to the way Linux offers SIMD accelerated versions of cryptographic and hashing routines within their kernel. As a matter of fact, zfsonlinux is already doing this: https://github.com/zfsonlinux/zfs/tree/482cd9ee69e88710e9241fac220501ea4e101d19/module/zcommon While I understand the desire to remain close to the reference ZFS implementation with Illumos and maybe there doesn't need to be quite that many versions of fletcher4 (they do a superscalar version that presumably tries to take advantage of Out-of-Order execution - hoping the microarchitecture can schedule the instructions efficiently by noticing the lack of data dependencies), it does seem silly to ignore a working implementation that is measurably faster for CPUs that support it. It has even been backported to SSSE3 instructions: https://github.com/zfsonlinux/zfs/blob/482cd9ee69e88710e9241fac220501ea4e101d19/module/zcommon/zfs_fletcher_sse.c
If desired, I can post my benchmark code. It is using more instructions than the zfsonlinux variant (I used SIMD intrinsics instead of inline assembly). The extra instructions are mostly just shuffling values between registers. After the intermediate sum loop is completed I aliased into the __m256i's instead of doing vmovqdu into memory for the constant multiplications. I suspect the compiler was able to shuffle registers around enough to avoid some trips to memory, but the Intel whitepaper isn't quite fair to itself, as I think they are comparing the best possible performance without SIMD (which is not the original loop, but the loop unrolled 4 times) with their SIMD variant.
any additional information, benchmark code, etc you can provide will be useful. Thank you.
Created attachment 181322 [details] Quick benchmark Please don't mind the C++, the actual function is clean C. Be sure to compile with -mavx2. It's worth noting that the code that's integrated into ZFS On Linux appears to do benchmark code to select the fastest function and dynamically reassigns a function pointer to that function (at maybe module load time?). Adding the extra layer of function pointers will probably be somewhat necessary for this kind of modularity, though this obviously has some security implications (in that corrupting this pointer value can lead to bad things). I hardly think it'd be the first function pointer dispatched in the FreeBSD kernel, though. Given that IXSystems cares about ZFS performance and distributes on hardware with some lower powered Atom CPUs, I figured this would be of interest.
Created attachment 181326 [details] SSE4 implementation Also not as impactful, but definitely measurable is an SSE4 variant. This will actually run on Atoms and a much larger number of other pre-Haswell CPUs. The ZFS On Linux version does 8 padds per loop iteration because they are avoiding the pmovzxdq instruction. This is probably to be compatible with SSSE3 and not require SSE4.
Created attachment 181440 [details] same benchmark but with 128k checksums 128k widths are more realistic for ZFS. Though, 8k will probably fit into cache lines a little better if you're trying to mitigate memory access effects.