Hello, Could it be possible to implement an optimized version of calculate_crc32c() with hardware CRC32C support please ? SSE 4.2 has this feature, which is really interesting. Here are some benchmarks with this : https://github.com/laanwj/crcbench On an Intel E5-2620v3 : # ./crcbench CRC32C benchmarks [sw] 65543000 bytes in 69730us monotonic 69726us CPU (940.0MB/s) [hw-sse42] 65543000 bytes in 12619us monotonic 12618us CPU (5194.4MB/s) On more data (10GB) : # ./crcbench CRC32C benchmarks [sw] 10485767000 bytes in 11179263us monotonic 11178667us CPU (938.0MB/s) [hw-sse42] 10485767000 bytes in 2028412us monotonic 2028305us CPU (5169.7MB/s) Sounds really interesting and promising. Main goal behind this request is to improve iSCSI throughput when HeaderDigest and DataDigest are enabled. Enabling these options made my iSCSI throughput drop from 300MB/s to 200MB/s. Hardware CRC32C should then help minimising this difference. Thank you very much ! Ben
*** Bug 212454 has been marked as a duplicate of this bug. ***
We have a patch for this at Isilon. Unfortunately, our assembly CRC32C implementations are privately licensed Intel code we can't share. Do you have a permissively licensed replacement we could use in FreeBSD? There are two algorithms. One algorithm just uses SSE4.2 instructions and no FPU; that's generally usable in the kernel. There is another algorithm that also uses PCLMULQDQ (FPU) which is less generally suitable for the kernel, but an option if the thread is already preserving FPU state.
Looks like Linux has a BSDL copy of the PCLMULQDQ Intel code, so we can just borrow that: http://lxr.free-electrons.com/source/arch/x86/crypto/crc32c-pcl-intel-asm_64.S Unfortunately, their non-PCLMULQDQ version is not available BSDL.
Thank you very much for your feedback ! Can't we simply look at this (which runs fine on FreeBSD) ? https://github.com/laanwj/crcbench/blob/master/crc32c_sse42.h Would you prefer a SSE4.2 only algorithm, or with FPU ?
In addition, I'm not sure what you really mean by a "permissively licensed replacement".
(In reply to Ben RUBSON from comment #4) That file isn't licensed clearly, although the COPYING file for the project appears to be MIT (which would be fine). I think no-FPU would be preferred, as long as it's not too much slower than the FPU variant.
(Oh, and the github project is in C++ and uses some compiler intrinsics that may or may not be available -- it would require some massaging to get into FreeBSD.)
Yep, -msse4.2, this is why I think this code looks so simple (mainly _mm_crc32_u64 calls).
Here is a proposed patch: https://reviews.freebsd.org/D9342
Wow, many thanks ! I'll then give it a try very shortly ! So you are using the _mm_crc32_u64 functions :) As a test I could make a ZFS scrub on a healthy pool made of iSCSI disks, to see if ZFS complains (bad) or not (good). Not the perfect test however.
(In reply to Ben RUBSON from comment #10) Yeah, I intend to do some userspace tests to confirm correctness before committing.
Patch & compilation went fine with 11.0-RELEASE-p7. But does not seem to work, at least with the iSCSI test I did : kernel: WARNING: 192.168.2.1 (iqn.2012-06.srv1:rT1): connection error; reconnecting kernel: (probe0:iscsi19:0:0:0): INQUIRY. CDB: 12 00 00 00 24 00 kernel: (probe0:iscsi19:0:0:0): CAM status: CCB request aborted by the host kernel: (probe0:iscsi19:0:0:0): Retrying command If I remove HeaderDigest and DataDigest from iscsi.conf, then iSCSI connection is OK.
(In reply to Ben RUBSON from comment #12) The sense of the input/output bits was inverted. Please try the latest version of the patch (v4 in phabricator), which removes the inversion. I've verified that it returns identical results to our software implementation for a few thousand inputs. I expect it will fix the iSCSI attach problem. Thanks!
> Enabling these options made my iSCSI throughput drop from 300MB/s to 200MB/s. By the way, this patch may not restore all lost performance. Some of the difference may come from the target having to compute checksums as well, which we can't do anything about :-).
Both my initiators and targets are FreeBSD servers :) So this nice kernel improvement should really be helpful ! I'm going to give your last (not inverted ;) patch a try. Thanks too !
Just tested but does not fully work : iSCSI connects successfully, but as soon as there is some trafic, disks disconnect and hang, system is not able to recover them.
(In reply to Ben RUBSON from comment #16) That doesn't sound like a CRC problem.
Yep I agree but strangely when I disable HeaderDigest and DataDigest, it works fine. And without the patch, HeaderDigest and DataDigest work correctly. I've performed the test twice.
(In reply to Ben RUBSON from comment #18) I noticed a bug in handling of unaligned inputs. I expect that is the problem you're seeing.
There is a new patch in phabricator that should resolve the previous issue. It should also be a little faster for inputs larger than 768 bytes (e.g. 4k and 8k IOs).
Such as iSCSI frames :) Very nice ! Will build and test it in a few minutes, and start a scrub if runs fine.
Scrub running ! Result in a few hours, but sounds good actually :)
So scrub finished correctly, good news. And some benchmark, iSCSI troughput : - without CRC32C : 300 MB/s - with sw CRC32C : 200 MB/s - with hw CRC32C : 285 MB/s Really nice improvement ! Many thanks Conrad, really glad :)
(In reply to Ben RUBSON from comment #23) Hey, that's great to hear. Thanks for trying it out several times, even a few broken patches :-).
There is no way for iSCSI to pass the fact the digest failed to the upper layers (like ZFS); the way it's handled is by dropping the connection - afterwards you get a reconnect and a retry of the disk operation. So, what I'd expect is that you would see a message saying "data digest check failed" in your syslog, then a connection drop, and a reconnect. Do you see that?
>what I'd expect is that you would see a message saying "data digest check failed" No, I did not see such a message. When I saw that the patch was not working correctly, I tested with only one disk. It successfully connected, iscsictl returned it as connected, perfect. But as soon as I dd if=/dev/mydisk, I got the following : kernel: WARNING: 192.168.2.1 (iqn.2012-06.srv1:rT1): no ping reply (NOP-In) after 5 seconds; reconnecting srv1 last message repeated 31 times srv1 last message repeated 42 times But no digest failed message.
A commit references this bug: Author: cem Date: Tue Jan 31 03:26:33 UTC 2017 New revision: 313006 URL: https://svnweb.freebsd.org/changeset/base/313006 Log: calculate_crc32c: Add SSE4.2 implementation on x86 Derived from an implementation by Mark Adler. The fast loop performs three simultaneous CRCs over subsets of the data before composing them. This takes advantage of certain properties of the CRC32 implementation in Intel hardware. (The CRC instruction takes 1 cycle but has 2-3 cycles of latency.) The CRC32 instruction does not manipulate FPU state. i386 does not have the crc32q instruction, so avoid it there. Otherwise the implementation is identical to amd64. Add basic userland tests to verify correctness on a variety of inputs. PR: 216467 Reported by: Ben RUBSON <ben.rubson at gmail.com> Reviewed by: kib@, markj@ (earlier version) Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D9342 Changes: head/sys/conf/files.amd64 head/sys/conf/files.i386 head/sys/libkern/crc32.c head/sys/libkern/x86/ head/sys/libkern/x86/crc32_sse42.c head/sys/sys/libkern.h head/tests/sys/kern/Makefile head/tests/sys/kern/libkern_crc32.c
10 days running on 2 servers without issue :) Do you plan to MFC to stable/11 ? Thank you !
(In reply to Ben RUBSON from comment #28) I don't touch stable branches, but you may ask someone else to MFC it.
Merged to 11 in https://svnweb.freebsd.org/changeset/base/317149