Bug 242747

Summary: geli: AMD Epyc+GELI not using Hardware AES
Product: Base System Reporter: Nick Evans <nevans>
Component: kernAssignee: freebsd-geom (Nobody) <geom>
Status: Open ---    
Severity: Affects Some People CC: cem, dewayne, jhb, ltning-freebsd, yoitsmeremember
Priority: --- Keywords: needs-qa, performance
Version: 12.1-RELEASEFlags: koobs: mfc-stable12?
koobs: mfc-stable11?
Hardware: amd64   
OS: Any   
Description Flags
Xeon/Epyc Server Information none

Description Nick Evans 2019-12-20 22:28:43 UTC
Created attachment 210085 [details]
Xeon/Epyc Server Information

I have two similar systems I'm testing as database hardware. Both consist of 8x Samsung 860 Pro SSD's attached to LSI9003-8i, 256GB ram, equal chassis/backplane. The only variance is one server is an Epyc 7371, and the other a Xeon Gold 6130. I've attached some command snippets to show configuration info

There are 8 configured SSD's in each all the same setup as the ones I listed in a ZFS raid 10 setup with 4 mirrored vdevs. I'm aware that GELI is setup with 4k aligned sectors vs the drives reported 512 bytes, it should be a non-factor for the behavior I'm seeing. Long and short is, despite the Epyc box reporting that it supports AES like the Xeon as well as GELI showing Crypto: hardware, when I extract a tar of my 100GB test database to each server the load of the Epyc box is significantly higher with top showing tons of CPU time on g_eli threads that the Xeon does not have when in hardware mode. When I disable the aesni and cryptodev modules on the Xeon I get the same behavior with high g_eli thread load which leads me to believe that depsite Epyc/GELI reporting they're using hardware AES it's actually still running in software mode. I've tried with both AES-XTS and AES-CBC modes for GELI with the same behavior. Both servers are fully research at the moment so I can try any necessary tests/fixes to drill down on this.
Comment 1 Nick Evans 2020-01-28 17:32:12 UTC
Quick update:

Updated to -CURRENT r357204 and am seeing the same behavior which isn't surprising since I haven't see any commits that seemed relevant. I'm also about to try this on an Epyc 7351P to see if it's processor specific.
Comment 2 Conrad Meyer freebsd_committer 2020-01-28 17:36:53 UTC
Can you run dd tests to/from the raw GELI volume to get numbers and confirm this isn't some behavior leaking in from the test environment?  There's a lot of surface area in remote file -> tar -> zfs and what we really want to know is if the behavior of the geli layer is different.

If possible, please get numbers for both aesni enable/disabled for both processors.

When aesni attaches on both systems, what does it report?  E.g., this line:

> aesni0: <AES-CBC,AES-XTS,AES-GCM,AES-ICM,SHA1,SHA256> on motherboard

Comment 3 Conrad Meyer freebsd_committer 2020-01-28 17:38:54 UTC
One vague idea is that the additional modes supported on Epyc (SHA1/SHA256) confuse the geom_eli / OpenCryptoFramework interaction and result in GELI not finding aesni (waving hands here).  I think I remember discussing something like this with jhb@, but maybe not.
Comment 4 Nick Evans 2020-01-28 21:52:30 UTC



aesni0: <AES-CBC,AES-XTS,AES-GCM,AES-ICM> on motherboard

Working on getting the GELI specific numbers. The long and short of it is if I dd from all 8 geli providers to /dev/null I see the same general results. 20-25% idle on the Epyc box, and 60-70% idle on the Xeon with the Epyc pushing slightly more from each drive (~280 MB/s Epyc vs 250MB/s Xeon). I was seeing some weird performance results when doing dd tests against single drives or when testing multiple drives when ZFS is in the mix. From benchmarks of the two CPUs these systems should be nearly identical for most tasks. I'm going to sync the -CURRENT versions now and review the setups again to be sure nothing changed on me. I should have something in a few days.
Comment 5 Conrad Meyer freebsd_committer 2020-01-28 22:06:39 UTC
I suppose it's possible that the AES-NI (or SSE) intrinsics on Epyc are just slower.  I wouldn't expect such a substantial difference, though.

It's also interesting that Epyc reports AES-XTS support while Xeon does not.  That suggests the Xeon is running a different (older) version of FreeBSD than the Eypc.  Maybe that's just from installing CURRENT on the Epyc, but for apples-to-apples comparison I'd suggest running the same version.

When you installed CURRENT, did you compile a GENERIC-NODEBUG kernel / define MALLOC_PRODUCTION for world?  Or is this just a snapshot published on the freebsd ftp server?  If the latter, it's a debugging-enabled kernel, which will result in substantially higher sys time.
Comment 6 Nick Evans 2020-01-28 22:14:11 UTC
Both report XTS, the Epyc also has CCM and the lists appear ordered differently. I am using NODEBUG on the Epyc and building the Xeon with it now.
Comment 7 Conrad Meyer freebsd_committer 2020-01-28 22:18:19 UTC
(In reply to Nick Evans from comment #6)
Ah, you're right, I only saw CBC/ICM were in the same place and forgot that the
list order had changed at one point.
Comment 8 dewayne 2020-01-29 01:07:38 UTC
(In reply to Nick Evans from comment #4)
Nick, as a data point my Xeon, circa 2015, reveals:
CPU: Intel(R) Xeon(R) CPU E3-1230L v3 @ 1.80GHz (1795.88-MHz K8-class CPU) 
aesni0: <AES-CBC,AES-CCM,AES-GCM,AES-ICM,AES-XTS> on motherboard
FreeBSD 12.1-STABLE #0 r356046M: Tue Dec 24 22:27:08 AEDT 2019
(I use geli on a few partitions, including swap, though for a box with high CPU workload:
Comment 9 John Baldwin freebsd_committer freebsd_triage 2020-01-30 16:48:26 UTC
I would start by using dtrace and/or printfs to see if aesni_newsession is failing when the GELI disks are attached (if they are mounted as root during boot you are stuck with printfs and having to reboot each time to narrow it down).
Comment 10 Nick Evans 2020-01-30 17:33:58 UTC
(In reply to dewayne from comment #8)

So far results are the same with both boxes being on -CURRENT with NODEBUG kernels so at least that's ruled out.

eli.batch=1 alone helps CPU the usage, but at the expense of throughput. At least on the Epyc box. It goes from about 280MB/s per disk to 180MB/s. Idle-ness went up to 60% but probably due to the drop in overall throughput. 

The eli.threads=2 count makes a big difference on the Epyc box. Per disk throughput went up to 330MB/s and the overall idle-ness went up to 92% running dd if=/dev/da#.eli of=/dev/null bs=1m one per disk. batch=1 even with threads=2 doesn't seem to help in this case.

I guess there's some kind of thrashing going on here when the default 32 threads per disk are created that affects the Epyc box more than the Xeon. I'll run some tests are different thread numbers and report back. Maybe we can at least come up with more sensible defaults.
Comment 11 Conrad Meyer freebsd_committer 2020-01-30 17:49:21 UTC
Yeah, I suspect NCPU threads per disk across 8 disks is not really a great default!  geli's ncpu default is probably reasonable for single-disk laptops, but doesn't factor in larger arrays.  threads=2 x 8 disks gets you to 16, or the number of real cores.  IIRC Epyc has only 128-bit wide vector units internally, but I don't see how that would affect aesni(4); AES itself is a 128-bit cipher, and the aesni(4) driver only uses SSE intrinsics, which act on 128-bit vectors.  It may simply have fewer vector units and attempting to use 32 of them at the same time contests for shared resources.
Comment 12 Nick Evans 2020-01-30 18:15:08 UTC
(In reply to Conrad Meyer from comment #11)

I've also noticed that with the lower thread counts they are sometimes pinning the g_eli threads to SMT cores too. I can't cpuset them to the physical cores either. There would probably be some benefit to only pinning them to the physical cores. AMD's SMT is better than Intel, and I'm not sure if there's a penalty using SMT cores rather than physical on *BSD, but in my experience with Windows at least there is.
Comment 13 Conrad Meyer freebsd_committer 2020-01-30 18:17:39 UTC
I don't believe there's any real difference between the two SMT cores of a real core, although I may be mistaken.  Or do you mean, two threads are pinned to the corresponding SMT threads of a single core, rather than distinct physical cores?  Yeah, that could impact realized performance.
Comment 14 Conrad Meyer freebsd_committer 2020-01-30 18:18:06 UTC
Correction: Two SMT *threads* of a real core, of course.
Comment 15 Nick Evans 2020-01-30 19:31:54 UTC
(In reply to Conrad Meyer from comment #13)

Actually I hadn't noticed it initially but while testing threads=1 it did pin two threads to cores 0/1, 10/11, and 20/21 which should be a physical and SMT for each physical core. The penalties I've seen on Windows amounted to a few percent when binding some CPU hungry thread to a physical vs SMT (ie core 0 vs core 1) with no load on the other in the pair when testing. I'll test that separately on FreeBSD at some point to see if the same penalty applies.

 1443 root          1  20    -     0B    16K geli:w  28   0:47   0.00% g_eli[0] da1
 1446 root          1  20    -     0B    16K geli:w   1   0:47   0.00% g_eli[0] da2
 1452 root          1  20    -     0B    16K geli:w  20   0:46   0.00% g_eli[0] da4
 1458 root          1  20    -     0B    16K geli:w  11   0:46   0.00% g_eli[0] da6
 1461 root          1  20    -     0B    16K geli:w   3   0:46   0.00% g_eli[0] da7
 1449 root          1  20    -     0B    16K geli:w   0   0:46   0.00% g_eli[0] da3
 1440 root          1  20    -     0B    16K geli:w  10   0:46   0.00% g_eli[0] da0
 1455 root          1  20    -     0B    16K geli:w  21   0:45   0.00% g_eli[0] da5
Comment 16 Nick Evans 2020-01-30 21:32:06 UTC
Can confirm the Xeon benefits from threads=1 or threads=2 as well.
Comment 17 Eirik Oeverby 2020-03-10 11:45:08 UTC
(In reply to Nick Evans from comment #16)

I recently got an EPYC Rome box to replace my old Nehalem, and while it's screaming fast at just about anything I throw at it, system load frequently hovers in the 30s (32/64 cores across 2 CPUs). There isn't any excessive IO going on, and load on the old box was <<12 pretty much all the time. The old box did not do GELI at all, the new one does - but again, there is no significant I/O to speak of.

I did set threads to 2, but I saw no effect - I suppose a reboot is needed (and this should go in loader.conf, correct?)
Comment 18 Nick Evans 2020-03-10 14:17:55 UTC
(In reply to Eirik Oeverby from comment #17)

Correct, kern.geom.eli.threads=2 in /boot/loader.conf and a reboot should do it.
Comment 19 Eirik Oeverby 2020-03-10 14:44:47 UTC
(In reply to Nick Evans from comment #18)

I can confirm that threads = 2 brough my load from ~30 to ~2. Funky.