Bug 237807 - ZFS: ZVOL writes fast, ZVOL reads abysmal...
Summary: ZFS: ZVOL writes fast, ZVOL reads abysmal...
Status: New
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 12.0-STABLE
Hardware: Any Any
: --- Affects Only Me
Assignee: freebsd-fs mailing list
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-05-09 10:41 UTC by Nils Beyer
Modified: 2019-08-09 13:23 UTC (History)
2 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Nils Beyer 2019-05-09 10:41:48 UTC
OS: FreeBSD 12.0-STABLE #23 r346730M: Fri Apr 26 08:38:36 CEST 2019


HBA Controllers:
===========================================================================
Device Name           Chip Name        Board Name        Firmware
/dev/mps0             LSISAS2308       SAS9207-8i        14000700
/dev/mps1             LSISAS2308       SAS9207-8i        14000700
===========================================================================


ZPOOL:
===========================================================================
  pool: morsleben-grube2
 state: ONLINE
  scan: scrub repaired 0 in 0 days 07:21:42 with 0 errors on Sat May  4 10:23:33 2019
config:

        NAME        STATE     READ WRITE CKSUM
        morsleben-grube2  ONLINE       0     0     0
          da0       ONLINE       0     0     0
          da1       ONLINE       0     0     0
          da2       ONLINE       0     0     0
          da3       ONLINE       0     0     0
          da4       ONLINE       0     0     0
          da5       ONLINE       0     0     0
          da6       ONLINE       0     0     0
          da7       ONLINE       0     0     0
          da8       ONLINE       0     0     0
          da9       ONLINE       0     0     0
          da10      ONLINE       0     0     0
          da11      ONLINE       0     0     0

errors: No known data errors
===========================================================================



Writing to that ZVOL is reasonabiliy fast, but reading from that ZVOL is abysmal at around 20MB/s.



Here's an output of "gstat -I 1000000":
===========================================================================
dT: 1.018s  w: 1.000s
 L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w   %busy Name
    0      0      0      0    0.0      0      0    0.0    0.0| ada0
    0     79     79   3791    0.3      0      0    0.0    2.2| da0
    0     77     77   3823    0.4      0      0    0.0    2.7| da1
    0     18     18   1131    0.6      0      0    0.0    0.9| da2
    0     20     20   1257   10.1      0      0    0.0   17.5| da3
    0     18     18   1131    1.5      0      0    0.0    2.2| da4
    0     20     20   1257   11.2      0      0    0.0   19.1| da5
    0     17     17   1069    0.9      0      0    0.0    1.0| da6
    0     16     16   1006    0.3      0      0    0.0    0.3| da7
    0     23     23   1446   10.3      0      0    0.0   16.9| da8
    2     22     22   1383   13.0      0      0    0.0   19.6| da9
    0     79     79   1862    0.3      0      0    0.0    1.9| da10
    0     77     77   2043    2.3      0      0    0.0   12.3| da11
    1    115    115  14709    8.5      0      0    0.0   98.0| zvol/morsleben-grube2/test
===========================================================================



How can a ZVOL be 100% busy although its underlying drives have plenty of power left?



TIA and BR,
Nils
Comment 1 Nils Beyer 2019-05-09 10:44:44 UTC
Disabling/Enabling ZFS prefetch or disabling "primarycache" resp. "secondarycache" makes no difference...
Comment 2 sigsys 2019-05-11 02:59:08 UTC
These are random reads right?

How are you benchmarking it?  Is it over iSCSI?

If the benchmark program sends its read requests one after the other (i.e., it waits before a read returns before sending the next), then it's effectively waiting on one random read being dispatched to one disk every time.  So it's the latency of disks that is being the limiting factor and there isn't much of anything that can be done to reduce that latency (apart from caching).

And I think that the way that gstat measures %busy is that it gives you the fraction of time for which there was at least one operation in-flight.  So because the zvol is almost always waiting on at least one disk, it is nearly 100% busy when measured this way.  I'm guessing it just wasn't designed to measure the performance of devices that could potentially serve a lot requests concurrently.  I'm not certain though.

The way to speed this up is for the zvol to receive concurrent read requests.  This way they can be dispatched to multiple disks concurrently.

If you can do that, running multiple benchmarks at the same time on the zvol should show a much higher total throughput.  Or maybe you can tell the benchmark program to use multiple threads/processes or use AIO.

Maybe it's trying to send concurrent reads, but the problem is that somewhere along the way, requests get serialized.  Say you use iSCSI or a VM, then the benchmark program must have a way to tell the client kernel to issue concurrent requests (if that IO is done through a FS, then that FS must have good support for issuing IO concurrently), then this must be translated to concurrent requests over iSCSI (through TCQ) or whatever VM<->Host protocol.  And then this must be translated to concurrent requests over the zvol (which might not happen in all cases depending on how that zvol is being interfaced with, I'm not sure).  Even if all that works well you probably won't get the full throughput that you would get locally but it should still be much better than the performance of a single disk.

In any case, your pool should be able to serve way more requests than that in total, but when it's directed to a single zvol there has to be a way to submit them concurrently.
Comment 3 Nils Beyer 2019-05-13 11:07:32 UTC
(In reply to sigsys from comment #2)

the benchmark tool is completely sequential as it is a simple:

    dd if=/dev/zvol/morsleben-grube2/test of=/dev/null bs=1M
    ^C
    375390208 bytes transferred in 15.330245 secs (24486902 bytes/sec)


whereas the ZVOL has been filled before with random data using:

    dd if=/dev/urandom of=/dev/zvol/morsleben-grube2/test bs=1M
    ^C
    8544845824 bytes transferred in 77.684268 secs (109994546 bytes/sec)


"primarycache" and "secondarycache" are disabled for that ZVOL to get real read speeds...


Now, I've followed your advice and tried two identical instances of "dd" reading from the ZVOL. And you are right, I get twice the throughput: 40MB/s now. It still is abysmal if you ask me, but okay...


The slow read throughput also exists via the iSCSI connection from a Windows Server (the ZPOOL "morsleben-grub2" is a backup target). So I have to convince the Microsoft iSCSI initiator somehow to create more concurrent reads.


Thank you for the pointers...
Comment 4 crest 2019-05-13 11:11:44 UTC
Can you retry with primarycache=metadata?
Comment 5 Nils Beyer 2019-05-13 11:51:39 UTC
(In reply to crest from comment #4)

with "primarycache=metadata" and "secondarycache=none" much much better... Factor ten; get around 200MB/s now (no joke) with one single "dd".

With two "dd" instances, I get around 400MB/s.


What is the logic behind that "primarycache=metadata" is actually more performant than "primarycache=all"?
Comment 6 crest 2019-05-13 13:24:54 UTC
There are three caching levels (all, metadata, none). At the default level of all metadata and data get cached. This offers the best performance in most cases, but it can become a problem if it leads to double buffering (e.g. on a VM host with data being cached once by the host and again by the guest OS). This can cause both caches to fight for resources. Disabling all caching requires ZFS to walk the B-tree on disk for every access which is very slow. Limiting caching to metadata is a compromise between the two extremes. Limited to metadata caching ZFS caches everything but the actual data. This avoids the need to (re-)read metadata from disk without double caching.

The rule of thumb is to cache data just once per physical system as close to the user as possible (it's better to cache inside a VM than outside of it). Also keep in mind that the L2ARC is a pure victim cache. If the primary cache is limited to metadata it will never contain anything but metadata to be evicted to the secondary cache.

Can you tell us more about your setup? What are the ZVOLs used for? What else is running on the system?
Comment 7 Nils Beyer 2019-05-13 13:58:32 UTC
(In reply to crest from comment #6)

the ZVOLs are used for ReFS-formatted (64kB cluster size) Veeam backup repositories connected via iSCSI and 10GBit:
==============================================================================
morsleben-grube2/dshyp02-veeam  type                  volume                 -
morsleben-grube2/dshyp02-veeam  creation              Fri Mar 29 11:01 2019  -
morsleben-grube2/dshyp02-veeam  used                  40.2T                  -
morsleben-grube2/dshyp02-veeam  available             12.8T                  -
morsleben-grube2/dshyp02-veeam  referenced            30.4T                  -
morsleben-grube2/dshyp02-veeam  compressratio         1.00x                  -
morsleben-grube2/dshyp02-veeam  reservation           none                   default
morsleben-grube2/dshyp02-veeam  volsize               40T                    local
morsleben-grube2/dshyp02-veeam  volblocksize          64K                    -
morsleben-grube2/dshyp02-veeam  checksum              on                     default
morsleben-grube2/dshyp02-veeam  compression           lz4                    inherited from morsleben-grube2
morsleben-grube2/dshyp02-veeam  readonly              off                    default
morsleben-grube2/dshyp02-veeam  createtxg             76                     -
morsleben-grube2/dshyp02-veeam  copies                1                      default
morsleben-grube2/dshyp02-veeam  refreservation        40.2T                  local
morsleben-grube2/dshyp02-veeam  guid                  10198386066639651165   -
morsleben-grube2/dshyp02-veeam  primarycache          metadata               local
morsleben-grube2/dshyp02-veeam  secondarycache        none                   local
morsleben-grube2/dshyp02-veeam  usedbysnapshots       0                      -
morsleben-grube2/dshyp02-veeam  usedbydataset         30.4T                  -
morsleben-grube2/dshyp02-veeam  usedbychildren        0                      -
morsleben-grube2/dshyp02-veeam  usedbyrefreservation  9.78T                  -
morsleben-grube2/dshyp02-veeam  logbias               latency                default
morsleben-grube2/dshyp02-veeam  dedup                 off                    default
morsleben-grube2/dshyp02-veeam  mlslabel                                     -
morsleben-grube2/dshyp02-veeam  sync                  standard               default
morsleben-grube2/dshyp02-veeam  refcompressratio      1.00x                  -
morsleben-grube2/dshyp02-veeam  written               30.4T                  -
morsleben-grube2/dshyp02-veeam  logicalused           30.5T                  -
morsleben-grube2/dshyp02-veeam  logicalreferenced     30.5T                  -
morsleben-grube2/dshyp02-veeam  volmode               default                default
morsleben-grube2/dshyp02-veeam  snapshot_limit        none                   default
morsleben-grube2/dshyp02-veeam  snapshot_count        none                   default
morsleben-grube2/dshyp02-veeam  redundant_metadata    all                    default
==============================================================================



Using CTL as the iSCSI target (4k blocksize) for the Windows Server 2016 hosts:
==============================================================================
        lun 0 {
                path /dev/zvol/morsleben-grube2/dshyp02-veeam
                blocksize 4k
        }
==============================================================================



Nothing else besides "ctld" is running on the storage system...
Comment 8 Nils Beyer 2019-05-16 08:24:16 UTC
Okay, played around a little bit but regardless what I'm trying I only get around 250MB/s read speed from the ZVOL via iSCSI.

It's okayish but a little disappointing to have such many platters and only get so low read speeds...
Comment 9 Nils Beyer 2019-05-16 08:27:49 UTC
(In reply to Nils Beyer from comment #8)

and it's no network problem as I get around 700MB/s write speeds to the same ZVOL via iSCSI...
Comment 10 Nils Beyer 2019-08-09 12:44:57 UTC
maybe I'm too supid, I don't know. I can't get the pool fast...

Created the pool from scratch. Updated to latest 12-STABLE. But reads from that pool are still abysmal.

Current pool layout:
--------------------------------------------------------------------------------
        NAME        STATE     READ WRITE CKSUM
        veeam-backups  ONLINE       0     0     0
          raidz1-0  ONLINE       0     0     0
            da0     ONLINE       0     0     0
            da1     ONLINE       0     0     0
            da2     ONLINE       0     0     0
          raidz1-1  ONLINE       0     0     0
            da4     ONLINE       0     0     0
            da5     ONLINE       0     0     0
            da7     ONLINE       0     0     0
          raidz1-2  ONLINE       0     0     0
            da9     ONLINE       0     0     0
            da14    ONLINE       0     0     0
            da17    ONLINE       0     0     0
          raidz1-3  ONLINE       0     0     0
            da18    ONLINE       0     0     0
            da21    ONLINE       0     0     0
            da22    ONLINE       0     0     0
          raidz1-4  ONLINE       0     0     0
            da6     ONLINE       0     0     0
            da15    ONLINE       0     0     0
            da16    ONLINE       0     0     0
          raidz1-5  ONLINE       0     0     0
            da11    ONLINE       0     0     0
            da8     ONLINE       0     0     0
            da3     ONLINE       0     0     0
          raidz1-6  ONLINE       0     0     0
            da23    ONLINE       0     0     0
            da20    ONLINE       0     0     0
            da19    ONLINE       0     0     0
          raidz1-7  ONLINE       0     0     0
            da10    ONLINE       0     0     0
            da12    ONLINE       0     0     0
            da13    ONLINE       0     0     0

errors: No known data errors
--------------------------------------------------------------------------------


used bonnie++:
--------------------------------------------------------------------------------
Version  1.97       ------Sequential Output------ --Sequential Input- --Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
veeambackups.local 64G   141  99 471829  59 122365  23     5   8 40084   8  1016  19
Latency             61947us     348ms     618ms    1634ms     105ms     190ms
Version  1.97       ------Sequential Create------ --------Random Create--------
veeambackups.local -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16 +++++ +++ +++++ +++ 17378  58 +++++ +++ +++++ +++ 32079  99
Latency              2424us      44us     388ms    2295us      36us      91us
1.97,1.97,veeambackups.local,1,1565375578,64G,,141,99,471829,59,122365,23,5,8,40084,8,1016,19,16,,,,,+++++,+++,+++++,+++,17378,58,+++++,+++,+++++,+++,32079,99,61947us,348ms,618ms,1634ms,105ms,190ms,2424us,44us,388ms,2295us,36us,91us
--------------------------------------------------------------------------------

tested locally. No iSCSI, no NFS.

"gstat" tells me that the harddisks are only 15% busy.
CPU load averages:  0.51,  0.47,  0.39

ZFS recordsize is default 128k.

Maybe too many top-level VDEVs?


Maybe the HBA sucks for ZFS? A simple parallel DD using:
--------------------------------------------------------------------------------
for NR in `jot 24 0`; do
        dd if=/dev/da${NR} of=/dev/null bs=1M count=1k &
done
--------------------------------------------------------------------------------
delivers 90MB/s for each of the 24 drives during the run which results in 90*24 = 2160MB/s total. Should be plenty for the pool.


I'm really out of ideas apart from trying 13-CURRENT or FreeNAS or Linux or or or - which I'd like to avoid...

Needless to say that the read performances via NFS or iSCSI are still pathetic which makes the current setup unusable as a ESXi datastore and makes me afraid of future restore jobs in TB size ranges...
Comment 11 Nils Beyer 2019-08-09 13:23:02 UTC
increasing concurrency doesn't help:
--------------------------------------------------------------------------------
Version  1.97       ------Sequential Output------ --Sequential Input- --Random-
Concurrency  16     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
veeambackups 64G   156  99 496023  58 117847  21     5   8 40116   8 942.0  17
Latency             54528us     315ms     293ms    1579ms   81741us   74544us
Version  1.97       ------Sequential Create------ --------Random Create--------
veeambackups -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16 +++++ +++ +++++ +++ 16878  56 +++++ +++ +++++ +++ 31661  99
Latency              2501us      43us     423ms    2225us      38us      91us
1.97,1.97,veeambackups,16,1565363711,64G,,156,99,496023,58,117847,21,5,8,40116,8,942.0,17,16,,,,,+++++,+++,+++++,+++,16878,56,+++++,+++,+++++,+++,31661,99,54528us,315ms,293ms,1579ms,81741us,74544us,2501us,43us,423ms,2225us,38us,91us
--------------------------------------------------------------------------------