OS: FreeBSD 12.0-STABLE #23 r346730M: Fri Apr 26 08:38:36 CEST 2019 HBA Controllers: =========================================================================== Device Name Chip Name Board Name Firmware /dev/mps0 LSISAS2308 SAS9207-8i 14000700 /dev/mps1 LSISAS2308 SAS9207-8i 14000700 =========================================================================== ZPOOL: =========================================================================== pool: morsleben-grube2 state: ONLINE scan: scrub repaired 0 in 0 days 07:21:42 with 0 errors on Sat May 4 10:23:33 2019 config: NAME STATE READ WRITE CKSUM morsleben-grube2 ONLINE 0 0 0 da0 ONLINE 0 0 0 da1 ONLINE 0 0 0 da2 ONLINE 0 0 0 da3 ONLINE 0 0 0 da4 ONLINE 0 0 0 da5 ONLINE 0 0 0 da6 ONLINE 0 0 0 da7 ONLINE 0 0 0 da8 ONLINE 0 0 0 da9 ONLINE 0 0 0 da10 ONLINE 0 0 0 da11 ONLINE 0 0 0 errors: No known data errors =========================================================================== Writing to that ZVOL is reasonabiliy fast, but reading from that ZVOL is abysmal at around 20MB/s. Here's an output of "gstat -I 1000000": =========================================================================== dT: 1.018s w: 1.000s L(q) ops/s r/s kBps ms/r w/s kBps ms/w %busy Name 0 0 0 0 0.0 0 0 0.0 0.0| ada0 0 79 79 3791 0.3 0 0 0.0 2.2| da0 0 77 77 3823 0.4 0 0 0.0 2.7| da1 0 18 18 1131 0.6 0 0 0.0 0.9| da2 0 20 20 1257 10.1 0 0 0.0 17.5| da3 0 18 18 1131 1.5 0 0 0.0 2.2| da4 0 20 20 1257 11.2 0 0 0.0 19.1| da5 0 17 17 1069 0.9 0 0 0.0 1.0| da6 0 16 16 1006 0.3 0 0 0.0 0.3| da7 0 23 23 1446 10.3 0 0 0.0 16.9| da8 2 22 22 1383 13.0 0 0 0.0 19.6| da9 0 79 79 1862 0.3 0 0 0.0 1.9| da10 0 77 77 2043 2.3 0 0 0.0 12.3| da11 1 115 115 14709 8.5 0 0 0.0 98.0| zvol/morsleben-grube2/test =========================================================================== How can a ZVOL be 100% busy although its underlying drives have plenty of power left? TIA and BR, Nils
Disabling/Enabling ZFS prefetch or disabling "primarycache" resp. "secondarycache" makes no difference...
These are random reads right? How are you benchmarking it? Is it over iSCSI? If the benchmark program sends its read requests one after the other (i.e., it waits before a read returns before sending the next), then it's effectively waiting on one random read being dispatched to one disk every time. So it's the latency of disks that is being the limiting factor and there isn't much of anything that can be done to reduce that latency (apart from caching). And I think that the way that gstat measures %busy is that it gives you the fraction of time for which there was at least one operation in-flight. So because the zvol is almost always waiting on at least one disk, it is nearly 100% busy when measured this way. I'm guessing it just wasn't designed to measure the performance of devices that could potentially serve a lot requests concurrently. I'm not certain though. The way to speed this up is for the zvol to receive concurrent read requests. This way they can be dispatched to multiple disks concurrently. If you can do that, running multiple benchmarks at the same time on the zvol should show a much higher total throughput. Or maybe you can tell the benchmark program to use multiple threads/processes or use AIO. Maybe it's trying to send concurrent reads, but the problem is that somewhere along the way, requests get serialized. Say you use iSCSI or a VM, then the benchmark program must have a way to tell the client kernel to issue concurrent requests (if that IO is done through a FS, then that FS must have good support for issuing IO concurrently), then this must be translated to concurrent requests over iSCSI (through TCQ) or whatever VM<->Host protocol. And then this must be translated to concurrent requests over the zvol (which might not happen in all cases depending on how that zvol is being interfaced with, I'm not sure). Even if all that works well you probably won't get the full throughput that you would get locally but it should still be much better than the performance of a single disk. In any case, your pool should be able to serve way more requests than that in total, but when it's directed to a single zvol there has to be a way to submit them concurrently.
(In reply to sigsys from comment #2) the benchmark tool is completely sequential as it is a simple: dd if=/dev/zvol/morsleben-grube2/test of=/dev/null bs=1M ^C 375390208 bytes transferred in 15.330245 secs (24486902 bytes/sec) whereas the ZVOL has been filled before with random data using: dd if=/dev/urandom of=/dev/zvol/morsleben-grube2/test bs=1M ^C 8544845824 bytes transferred in 77.684268 secs (109994546 bytes/sec) "primarycache" and "secondarycache" are disabled for that ZVOL to get real read speeds... Now, I've followed your advice and tried two identical instances of "dd" reading from the ZVOL. And you are right, I get twice the throughput: 40MB/s now. It still is abysmal if you ask me, but okay... The slow read throughput also exists via the iSCSI connection from a Windows Server (the ZPOOL "morsleben-grub2" is a backup target). So I have to convince the Microsoft iSCSI initiator somehow to create more concurrent reads. Thank you for the pointers...
Can you retry with primarycache=metadata?
(In reply to crest from comment #4) with "primarycache=metadata" and "secondarycache=none" much much better... Factor ten; get around 200MB/s now (no joke) with one single "dd". With two "dd" instances, I get around 400MB/s. What is the logic behind that "primarycache=metadata" is actually more performant than "primarycache=all"?
There are three caching levels (all, metadata, none). At the default level of all metadata and data get cached. This offers the best performance in most cases, but it can become a problem if it leads to double buffering (e.g. on a VM host with data being cached once by the host and again by the guest OS). This can cause both caches to fight for resources. Disabling all caching requires ZFS to walk the B-tree on disk for every access which is very slow. Limiting caching to metadata is a compromise between the two extremes. Limited to metadata caching ZFS caches everything but the actual data. This avoids the need to (re-)read metadata from disk without double caching. The rule of thumb is to cache data just once per physical system as close to the user as possible (it's better to cache inside a VM than outside of it). Also keep in mind that the L2ARC is a pure victim cache. If the primary cache is limited to metadata it will never contain anything but metadata to be evicted to the secondary cache. Can you tell us more about your setup? What are the ZVOLs used for? What else is running on the system?
(In reply to crest from comment #6) the ZVOLs are used for ReFS-formatted (64kB cluster size) Veeam backup repositories connected via iSCSI and 10GBit: ============================================================================== morsleben-grube2/dshyp02-veeam type volume - morsleben-grube2/dshyp02-veeam creation Fri Mar 29 11:01 2019 - morsleben-grube2/dshyp02-veeam used 40.2T - morsleben-grube2/dshyp02-veeam available 12.8T - morsleben-grube2/dshyp02-veeam referenced 30.4T - morsleben-grube2/dshyp02-veeam compressratio 1.00x - morsleben-grube2/dshyp02-veeam reservation none default morsleben-grube2/dshyp02-veeam volsize 40T local morsleben-grube2/dshyp02-veeam volblocksize 64K - morsleben-grube2/dshyp02-veeam checksum on default morsleben-grube2/dshyp02-veeam compression lz4 inherited from morsleben-grube2 morsleben-grube2/dshyp02-veeam readonly off default morsleben-grube2/dshyp02-veeam createtxg 76 - morsleben-grube2/dshyp02-veeam copies 1 default morsleben-grube2/dshyp02-veeam refreservation 40.2T local morsleben-grube2/dshyp02-veeam guid 10198386066639651165 - morsleben-grube2/dshyp02-veeam primarycache metadata local morsleben-grube2/dshyp02-veeam secondarycache none local morsleben-grube2/dshyp02-veeam usedbysnapshots 0 - morsleben-grube2/dshyp02-veeam usedbydataset 30.4T - morsleben-grube2/dshyp02-veeam usedbychildren 0 - morsleben-grube2/dshyp02-veeam usedbyrefreservation 9.78T - morsleben-grube2/dshyp02-veeam logbias latency default morsleben-grube2/dshyp02-veeam dedup off default morsleben-grube2/dshyp02-veeam mlslabel - morsleben-grube2/dshyp02-veeam sync standard default morsleben-grube2/dshyp02-veeam refcompressratio 1.00x - morsleben-grube2/dshyp02-veeam written 30.4T - morsleben-grube2/dshyp02-veeam logicalused 30.5T - morsleben-grube2/dshyp02-veeam logicalreferenced 30.5T - morsleben-grube2/dshyp02-veeam volmode default default morsleben-grube2/dshyp02-veeam snapshot_limit none default morsleben-grube2/dshyp02-veeam snapshot_count none default morsleben-grube2/dshyp02-veeam redundant_metadata all default ============================================================================== Using CTL as the iSCSI target (4k blocksize) for the Windows Server 2016 hosts: ============================================================================== lun 0 { path /dev/zvol/morsleben-grube2/dshyp02-veeam blocksize 4k } ============================================================================== Nothing else besides "ctld" is running on the storage system...
Okay, played around a little bit but regardless what I'm trying I only get around 250MB/s read speed from the ZVOL via iSCSI. It's okayish but a little disappointing to have such many platters and only get so low read speeds...
(In reply to Nils Beyer from comment #8) and it's no network problem as I get around 700MB/s write speeds to the same ZVOL via iSCSI...
maybe I'm too supid, I don't know. I can't get the pool fast... Created the pool from scratch. Updated to latest 12-STABLE. But reads from that pool are still abysmal. Current pool layout: -------------------------------------------------------------------------------- NAME STATE READ WRITE CKSUM veeam-backups ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 da0 ONLINE 0 0 0 da1 ONLINE 0 0 0 da2 ONLINE 0 0 0 raidz1-1 ONLINE 0 0 0 da4 ONLINE 0 0 0 da5 ONLINE 0 0 0 da7 ONLINE 0 0 0 raidz1-2 ONLINE 0 0 0 da9 ONLINE 0 0 0 da14 ONLINE 0 0 0 da17 ONLINE 0 0 0 raidz1-3 ONLINE 0 0 0 da18 ONLINE 0 0 0 da21 ONLINE 0 0 0 da22 ONLINE 0 0 0 raidz1-4 ONLINE 0 0 0 da6 ONLINE 0 0 0 da15 ONLINE 0 0 0 da16 ONLINE 0 0 0 raidz1-5 ONLINE 0 0 0 da11 ONLINE 0 0 0 da8 ONLINE 0 0 0 da3 ONLINE 0 0 0 raidz1-6 ONLINE 0 0 0 da23 ONLINE 0 0 0 da20 ONLINE 0 0 0 da19 ONLINE 0 0 0 raidz1-7 ONLINE 0 0 0 da10 ONLINE 0 0 0 da12 ONLINE 0 0 0 da13 ONLINE 0 0 0 errors: No known data errors -------------------------------------------------------------------------------- used bonnie++: -------------------------------------------------------------------------------- Version 1.97 ------Sequential Output------ --Sequential Input- --Random- Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP veeambackups.local 64G 141 99 471829 59 122365 23 5 8 40084 8 1016 19 Latency 61947us 348ms 618ms 1634ms 105ms 190ms Version 1.97 ------Sequential Create------ --------Random Create-------- veeambackups.local -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 +++++ +++ +++++ +++ 17378 58 +++++ +++ +++++ +++ 32079 99 Latency 2424us 44us 388ms 2295us 36us 91us 1.97,1.97,veeambackups.local,1,1565375578,64G,,141,99,471829,59,122365,23,5,8,40084,8,1016,19,16,,,,,+++++,+++,+++++,+++,17378,58,+++++,+++,+++++,+++,32079,99,61947us,348ms,618ms,1634ms,105ms,190ms,2424us,44us,388ms,2295us,36us,91us -------------------------------------------------------------------------------- tested locally. No iSCSI, no NFS. "gstat" tells me that the harddisks are only 15% busy. CPU load averages: 0.51, 0.47, 0.39 ZFS recordsize is default 128k. Maybe too many top-level VDEVs? Maybe the HBA sucks for ZFS? A simple parallel DD using: -------------------------------------------------------------------------------- for NR in `jot 24 0`; do dd if=/dev/da${NR} of=/dev/null bs=1M count=1k & done -------------------------------------------------------------------------------- delivers 90MB/s for each of the 24 drives during the run which results in 90*24 = 2160MB/s total. Should be plenty for the pool. I'm really out of ideas apart from trying 13-CURRENT or FreeNAS or Linux or or or - which I'd like to avoid... Needless to say that the read performances via NFS or iSCSI are still pathetic which makes the current setup unusable as a ESXi datastore and makes me afraid of future restore jobs in TB size ranges...
increasing concurrency doesn't help: -------------------------------------------------------------------------------- Version 1.97 ------Sequential Output------ --Sequential Input- --Random- Concurrency 16 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP veeambackups 64G 156 99 496023 58 117847 21 5 8 40116 8 942.0 17 Latency 54528us 315ms 293ms 1579ms 81741us 74544us Version 1.97 ------Sequential Create------ --------Random Create-------- veeambackups -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 +++++ +++ +++++ +++ 16878 56 +++++ +++ +++++ +++ 31661 99 Latency 2501us 43us 423ms 2225us 38us 91us 1.97,1.97,veeambackups,16,1565363711,64G,,156,99,496023,58,117847,21,5,8,40116,8,942.0,17,16,,,,,+++++,+++,+++++,+++,16878,56,+++++,+++,+++++,+++,31661,99,54528us,315ms,293ms,1579ms,81741us,74544us,2501us,43us,423ms,2225us,38us,91us --------------------------------------------------------------------------------
^Triage: I'm sorry that this PR did not get addressed in a timely fashion. By now, the version that it was created against is out of support. As well, many newer versions of ZFS have been imported. Please re-open if it is still a problem on a supported version.