Bug 242747 - geli: AMD Epyc+GELI not using Hardware AES
Summary: geli: AMD Epyc+GELI not using Hardware AES
Status: Open
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 12.1-RELEASE
Hardware: amd64 Any
: --- Affects Some People
Assignee: freebsd-geom (Nobody)
URL:
Keywords: needs-qa, performance
Depends on:
Blocks:
 
Reported: 2019-12-20 22:28 UTC by Nick Evans
Modified: 2021-01-20 00:34 UTC (History)
6 users (show)

See Also:
koobs: mfc-stable12?
koobs: mfc-stable11?


Attachments
Xeon/Epyc Server Information (2.33 KB, text/plain)
2019-12-20 22:28 UTC, Nick Evans
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Nick Evans 2019-12-20 22:28:43 UTC
Created attachment 210085 [details]
Xeon/Epyc Server Information

I have two similar systems I'm testing as database hardware. Both consist of 8x Samsung 860 Pro SSD's attached to LSI9003-8i, 256GB ram, equal chassis/backplane. The only variance is one server is an Epyc 7371, and the other a Xeon Gold 6130. I've attached some command snippets to show configuration info

There are 8 configured SSD's in each all the same setup as the ones I listed in a ZFS raid 10 setup with 4 mirrored vdevs. I'm aware that GELI is setup with 4k aligned sectors vs the drives reported 512 bytes, it should be a non-factor for the behavior I'm seeing. Long and short is, despite the Epyc box reporting that it supports AES like the Xeon as well as GELI showing Crypto: hardware, when I extract a tar of my 100GB test database to each server the load of the Epyc box is significantly higher with top showing tons of CPU time on g_eli threads that the Xeon does not have when in hardware mode. When I disable the aesni and cryptodev modules on the Xeon I get the same behavior with high g_eli thread load which leads me to believe that depsite Epyc/GELI reporting they're using hardware AES it's actually still running in software mode. I've tried with both AES-XTS and AES-CBC modes for GELI with the same behavior. Both servers are fully research at the moment so I can try any necessary tests/fixes to drill down on this.
Comment 1 Nick Evans 2020-01-28 17:32:12 UTC
Quick update:

Updated to -CURRENT r357204 and am seeing the same behavior which isn't surprising since I haven't see any commits that seemed relevant. I'm also about to try this on an Epyc 7351P to see if it's processor specific.
Comment 2 Conrad Meyer freebsd_committer freebsd_triage 2020-01-28 17:36:53 UTC
Can you run dd tests to/from the raw GELI volume to get numbers and confirm this isn't some behavior leaking in from the test environment?  There's a lot of surface area in remote file -> tar -> zfs and what we really want to know is if the behavior of the geli layer is different.

If possible, please get numbers for both aesni enable/disabled for both processors.

When aesni attaches on both systems, what does it report?  E.g., this line:

> aesni0: <AES-CBC,AES-XTS,AES-GCM,AES-ICM,SHA1,SHA256> on motherboard

Thanks.
Comment 3 Conrad Meyer freebsd_committer freebsd_triage 2020-01-28 17:38:54 UTC
One vague idea is that the additional modes supported on Epyc (SHA1/SHA256) confuse the geom_eli / OpenCryptoFramework interaction and result in GELI not finding aesni (waving hands here).  I think I remember discussing something like this with jhb@, but maybe not.
Comment 4 Nick Evans 2020-01-28 21:52:30 UTC
Epyc:

aesni0: <AES-CBC,AES-CCM,AES-GCM,AES-ICM,AES-XTS,SHA1,SHA256>


Xeon:

aesni0: <AES-CBC,AES-XTS,AES-GCM,AES-ICM> on motherboard


Working on getting the GELI specific numbers. The long and short of it is if I dd from all 8 geli providers to /dev/null I see the same general results. 20-25% idle on the Epyc box, and 60-70% idle on the Xeon with the Epyc pushing slightly more from each drive (~280 MB/s Epyc vs 250MB/s Xeon). I was seeing some weird performance results when doing dd tests against single drives or when testing multiple drives when ZFS is in the mix. From benchmarks of the two CPUs these systems should be nearly identical for most tasks. I'm going to sync the -CURRENT versions now and review the setups again to be sure nothing changed on me. I should have something in a few days.
Comment 5 Conrad Meyer freebsd_committer freebsd_triage 2020-01-28 22:06:39 UTC
I suppose it's possible that the AES-NI (or SSE) intrinsics on Epyc are just slower.  I wouldn't expect such a substantial difference, though.

It's also interesting that Epyc reports AES-XTS support while Xeon does not.  That suggests the Xeon is running a different (older) version of FreeBSD than the Eypc.  Maybe that's just from installing CURRENT on the Epyc, but for apples-to-apples comparison I'd suggest running the same version.

When you installed CURRENT, did you compile a GENERIC-NODEBUG kernel / define MALLOC_PRODUCTION for world?  Or is this just a snapshot published on the freebsd ftp server?  If the latter, it's a debugging-enabled kernel, which will result in substantially higher sys time.
Comment 6 Nick Evans 2020-01-28 22:14:11 UTC
Both report XTS, the Epyc also has CCM and the lists appear ordered differently. I am using NODEBUG on the Epyc and building the Xeon with it now.
Comment 7 Conrad Meyer freebsd_committer freebsd_triage 2020-01-28 22:18:19 UTC
(In reply to Nick Evans from comment #6)
Ah, you're right, I only saw CBC/ICM were in the same place and forgot that the
list order had changed at one point.
Comment 8 dewayne 2020-01-29 01:07:38 UTC
(In reply to Nick Evans from comment #4)
Nick, as a data point my Xeon, circa 2015, reveals:
CPU: Intel(R) Xeon(R) CPU E3-1230L v3 @ 1.80GHz (1795.88-MHz K8-class CPU) 
aesni0: <AES-CBC,AES-CCM,AES-GCM,AES-ICM,AES-XTS> on motherboard
for 
FreeBSD 12.1-STABLE #0 r356046M: Tue Dec 24 22:27:08 AEDT 2019
(I use geli on a few partitions, including swap, though for a box with high CPU workload:
kern.geom.eli.batch=1
kern.geom.eli.threads=2
)
Comment 9 John Baldwin freebsd_committer freebsd_triage 2020-01-30 16:48:26 UTC
I would start by using dtrace and/or printfs to see if aesni_newsession is failing when the GELI disks are attached (if they are mounted as root during boot you are stuck with printfs and having to reboot each time to narrow it down).
Comment 10 Nick Evans 2020-01-30 17:33:58 UTC
(In reply to dewayne from comment #8)

So far results are the same with both boxes being on -CURRENT with NODEBUG kernels so at least that's ruled out.


eli.batch=1 alone helps CPU the usage, but at the expense of throughput. At least on the Epyc box. It goes from about 280MB/s per disk to 180MB/s. Idle-ness went up to 60% but probably due to the drop in overall throughput. 

The eli.threads=2 count makes a big difference on the Epyc box. Per disk throughput went up to 330MB/s and the overall idle-ness went up to 92% running dd if=/dev/da#.eli of=/dev/null bs=1m one per disk. batch=1 even with threads=2 doesn't seem to help in this case.

I guess there's some kind of thrashing going on here when the default 32 threads per disk are created that affects the Epyc box more than the Xeon. I'll run some tests are different thread numbers and report back. Maybe we can at least come up with more sensible defaults.
Comment 11 Conrad Meyer freebsd_committer freebsd_triage 2020-01-30 17:49:21 UTC
Yeah, I suspect NCPU threads per disk across 8 disks is not really a great default!  geli's ncpu default is probably reasonable for single-disk laptops, but doesn't factor in larger arrays.  threads=2 x 8 disks gets you to 16, or the number of real cores.  IIRC Epyc has only 128-bit wide vector units internally, but I don't see how that would affect aesni(4); AES itself is a 128-bit cipher, and the aesni(4) driver only uses SSE intrinsics, which act on 128-bit vectors.  It may simply have fewer vector units and attempting to use 32 of them at the same time contests for shared resources.
Comment 12 Nick Evans 2020-01-30 18:15:08 UTC
(In reply to Conrad Meyer from comment #11)

I've also noticed that with the lower thread counts they are sometimes pinning the g_eli threads to SMT cores too. I can't cpuset them to the physical cores either. There would probably be some benefit to only pinning them to the physical cores. AMD's SMT is better than Intel, and I'm not sure if there's a penalty using SMT cores rather than physical on *BSD, but in my experience with Windows at least there is.
Comment 13 Conrad Meyer freebsd_committer freebsd_triage 2020-01-30 18:17:39 UTC
I don't believe there's any real difference between the two SMT cores of a real core, although I may be mistaken.  Or do you mean, two threads are pinned to the corresponding SMT threads of a single core, rather than distinct physical cores?  Yeah, that could impact realized performance.
Comment 14 Conrad Meyer freebsd_committer freebsd_triage 2020-01-30 18:18:06 UTC
Correction: Two SMT *threads* of a real core, of course.
Comment 15 Nick Evans 2020-01-30 19:31:54 UTC
(In reply to Conrad Meyer from comment #13)

Actually I hadn't noticed it initially but while testing threads=1 it did pin two threads to cores 0/1, 10/11, and 20/21 which should be a physical and SMT for each physical core. The penalties I've seen on Windows amounted to a few percent when binding some CPU hungry thread to a physical vs SMT (ie core 0 vs core 1) with no load on the other in the pair when testing. I'll test that separately on FreeBSD at some point to see if the same penalty applies.

  PID USERNAME    THR PRI NICE   SIZE    RES STATE    C   TIME    WCPU COMMAND
 1443 root          1  20    -     0B    16K geli:w  28   0:47   0.00% g_eli[0] da1
 1446 root          1  20    -     0B    16K geli:w   1   0:47   0.00% g_eli[0] da2
 1452 root          1  20    -     0B    16K geli:w  20   0:46   0.00% g_eli[0] da4
 1458 root          1  20    -     0B    16K geli:w  11   0:46   0.00% g_eli[0] da6
 1461 root          1  20    -     0B    16K geli:w   3   0:46   0.00% g_eli[0] da7
 1449 root          1  20    -     0B    16K geli:w   0   0:46   0.00% g_eli[0] da3
 1440 root          1  20    -     0B    16K geli:w  10   0:46   0.00% g_eli[0] da0
 1455 root          1  20    -     0B    16K geli:w  21   0:45   0.00% g_eli[0] da5
Comment 16 Nick Evans 2020-01-30 21:32:06 UTC
Can confirm the Xeon benefits from threads=1 or threads=2 as well.
Comment 17 Eirik Oeverby 2020-03-10 11:45:08 UTC
(In reply to Nick Evans from comment #16)

I recently got an EPYC Rome box to replace my old Nehalem, and while it's screaming fast at just about anything I throw at it, system load frequently hovers in the 30s (32/64 cores across 2 CPUs). There isn't any excessive IO going on, and load on the old box was <<12 pretty much all the time. The old box did not do GELI at all, the new one does - but again, there is no significant I/O to speak of.

I did set threads to 2, but I saw no effect - I suppose a reboot is needed (and this should go in loader.conf, correct?)
Comment 18 Nick Evans 2020-03-10 14:17:55 UTC
(In reply to Eirik Oeverby from comment #17)

Correct, kern.geom.eli.threads=2 in /boot/loader.conf and a reboot should do it.
Comment 19 Eirik Oeverby 2020-03-10 14:44:47 UTC
(In reply to Nick Evans from comment #18)

I can confirm that threads = 2 brough my load from ~30 to ~2. Funky.
Comment 20 Eirik Oeverby 2020-12-22 13:40:46 UTC
I believe this issue needs some more attention. I cannot tell whether hardware is indeed used on EPYC, and I have no suitable systems to test this on. However it would appear to me that the high loads and relatively low throughputs we're seeing should be considered a bug?
Comment 21 Conrad Meyer freebsd_committer freebsd_triage 2020-12-22 15:54:21 UTC
Which geli mode is being used?  Just the ordinary XTS encryption, or one of the authenticated modes?  I agree it sounds like hardware isn't getting used and that's a problem.  I've had other anecdotal reports that aesni on stable/12 isn't accelerating some workloads.  Have you been able to try dtracing aesni to see if the accelerated primitives get hit and/or errors are being emitted?
Comment 22 Conrad Meyer freebsd_committer freebsd_triage 2020-12-22 15:55:03 UTC
(And: thanks for the report and continuing to push attention on this.)
Comment 23 Conrad Meyer freebsd_committer freebsd_triage 2020-12-22 16:01:14 UTC
Also note: John Baldwin found and fixed several aesni bugs in CURRENT in the middle of this year.  I'm not sure if they were bugs present in stable/12, if they were MFC'd, and if they were MFC'd, I don't think they would be in 12.1 and don't know about 12.2.  John might know stable/* better (and maybe can comment on the bugs).
Comment 24 Eirik Oeverby 2020-12-22 16:02:20 UTC
(In reply to Conrad Meyer from comment #21)

Crypto mode is plain AES-XTS, same as in the original report from Nick.

I don't have neither a suitable environment on hand nor the necessary knowledge of how to do such an analysis. If given some pointers (preferably pre-digested for this purpose) to help overcome the second part, I'll probably be motivated to try to find a solution to the first, too - if nobody beats me to it.
Comment 25 Conrad Meyer freebsd_committer freebsd_triage 2020-12-22 16:22:27 UTC
Thanks, Eirik.  Let me see if I can come up with some copy-paste incantations to try.  I'm not a dtrace expert myself. :-)

Load dtrace support: sudo kldload dtraceall

First, to see if aesni functions are being hit at all:

  sudo dtrace -n 'fbt:aesni::entry { @[probefunc] = count() }'

Let it run for maybe a few seconds while there is any mild load to the GELI disks, then hit Ctrl-C.  After Ctrl-C, wait a few seconds for dtrace to print output and exit.  If the output is empty, nothing is calling into aesni.  Otherwise, you should see a listing of (aesni) function name, invocation count pairs.

If you get anything from the earlier trace, the following one should print out (approximately) any errors returned by aesni functions (columns are: function, error number, count):

  sudo dtrace -n 'fbt:aesni::return /arg1 > 0 && arg1 < 1000/ { @[probefunc,arg1] = count() }'
Comment 26 Eirik Oeverby 2020-12-23 17:31:40 UTC
(In reply to Conrad Meyer from comment #25)

Let's see .. About 10 seconds sampling, while deliberately reading from a geli-encrypted zfs volume while creating some read operations (find . -type f -exec cat {} \; > /dev/null) - there's plenty of writing to that volume at any given time:

  aesni_encrypt_xts                                             38168
  aesni_decrypt_xts                                            380112
  aesni_cipher_crypt                                           418280
  aesni_cipher_setup_common                                    418280
  aesni_crypt_xts                                              418280
  aesni_process                                                418280

The second probe gives no output at all.

System:
hw.model: AMD EPYC 7302 16-Core Processor (2x)
dev.aesni.0.%desc: AES-CBC,AES-CCM,AES-GCM,AES-ICM,AES-XTS,SHA1,SHA256
kern.geom.eli.threads: 2

Without kern.geom.eli.threads=2 the system quickly reaches a load of 100 or more.

/Eirik
Comment 27 John Baldwin freebsd_committer freebsd_triage 2020-12-24 19:22:55 UTC
(In reply to Eirik Oeverby from comment #26)
This means that the Epyc box is in fact using the "hardware" (well, accelerated software) instructions like the Xeon, so your issue is not that it is doing "plain" software encryption.  I would suggest perhaps using hwpmc to investigate where the CPUs are spending time, but the title of this bug is rather stale at this point.  Investigating the poorer performance on Epyc might be worth doing in a new PR.  Alan Somers has done some performance work with geli in current and might have some ideas on areas to investigate.
Comment 28 Eirik Oeverby 2021-01-19 23:46:44 UTC
(In reply to John Baldwin from comment #27)
Is that something that is likely to happen any time soon? :)

It turns out that with 12.2 (and probably earlier, I don't know since we didn't do this earlier) the impact of enabling geli for a large number of disks (SSDs in our case, 8-24 devices depending on the system) is that system loads average much, much higher than normal. Reducing number of threads immediately brings load back down.

Note: This is not limited to AMD/EPYC - most recently we saw it have pretty dramatic effect on Intel servers, too.
Comment 29 Alan Somers freebsd_committer freebsd_triage 2021-01-20 00:14:04 UTC
Yes, the default value of kern.geom.eli.threads only really makes sense if you have 1-2 geli providers.  If you have lots (I have hundreds), you should set kern.geom.eli.threads=1.

I do have a WIP patch that would switch geli from using a per-provider thread pool to a single global thread pool.  That would eliminate the need to tune that sysctl, and improve overall performance to boot.  However, the patch is held-up by an incompatibility with ccp(4).  ccp is a hardware crypto device found on some AMD systems.  I notice that you're using AMD.  Does your system have ccp? (I think it will show up as "AMD CCP" in `pciconf -lv`)  And if so, would you be willing to help test changes?

https://reviews.freebsd.org/D25747
Comment 30 Eirik Oeverby 2021-01-20 00:28:23 UTC
(In reply to Alan Somers from comment #29)
I see.

I only have this in pciconf -lv (this is from 12.2):

none19@pci0:34:0:1:     class=0x108000 card=0x14861022 chip=0x14861022 rev=0x00 hdr=0x00
    vendor     = 'Advanced Micro Devices, Inc. [AMD]'
    device     = 'Starship/Matisse Cryptographic Coprocessor PSPCPP'
    class      = encrypt/decrypt

none40@pci0:162:0:1:    class=0x108000 card=0x14861022 chip=0x14861022 rev=0x00 hdr=0x00
    vendor     = 'Advanced Micro Devices, Inc. [AMD]'
    device     = 'Starship/Matisse Cryptographic Coprocessor PSPCPP'
    class      = encrypt/decrypt

There seems to be one per CPU socket (the device shows up twice).

I don't know if this is indeed covered by the ccp driver, but I can't find any documentation on it and the discussion https://reviews.freebsd.org/D12723 seems to indicate it's not really useful..?

Anyway, I'd be happy to test if there's anything meaningful I can do on 12.2. I don't have a 13-system on EPYC (yet), but the promise of serious crypto performance might motivate me to change that. That said, our pain is mostly on handshake (RSA), not stream (AES)..
Comment 31 Conrad Meyer freebsd_committer freebsd_triage 2021-01-20 00:34:29 UTC
ccp(4) shouldn't gate anything else and can be deleted if it's in the way.  It's not a useful driver.