Bug 244665 - zfs: Very slow NFS I/O during resilver - _sleep() sleeps really long...
Summary: zfs: Very slow NFS I/O during resilver - _sleep() sleeps really long...
Status: Closed Overcome By Events
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 11.3-RELEASE
Hardware: Any Any
: --- Affects Some People
Assignee: freebsd-fs (Nobody)
URL:
Keywords: needs-qa, performance
Depends on:
Blocks:
 
Reported: 2020-03-08 00:45 UTC by Peter Eriksson
Modified: 2020-05-07 14:23 UTC (History)
1 user (show)

See Also:
koobs: mfc-stable12?
koobs: mfc-stable11?


Attachments
dmesg.boot (39.08 KB, text/plain)
2020-03-08 07:37 UTC, Peter Eriksson
no flags Details
zpool status (2.32 KB, text/plain)
2020-03-08 07:37 UTC, Peter Eriksson
no flags Details
zfs-stats -a (12.30 KB, text/plain)
2020-03-08 07:37 UTC, Peter Eriksson
no flags Details
loader.conf (750 bytes, text/plain)
2020-03-08 07:38 UTC, Peter Eriksson
no flags Details
sysctl.conf (618 bytes, text/plain)
2020-03-08 07:38 UTC, Peter Eriksson
no flags Details
"sysctl -a" output (362.38 KB, text/plain)
2020-03-08 07:38 UTC, Peter Eriksson
no flags Details
Test program for measuring latency of FS operations (16.12 KB, text/plain)
2020-03-08 14:30 UTC, Peter Eriksson
no flags Details
Dtrace script to trace kernel NFS mkdir/rmdir (9.13 KB, text/plain)
2020-03-08 14:33 UTC, Peter Eriksson
no flags Details
Another Dtrace script to trace *all* NFS activity in the kernel (2.83 KB, text/plain)
2020-03-10 13:08 UTC, Peter Eriksson
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Peter Eriksson 2020-03-08 00:45:00 UTC
I've been investigating an issue we're having that gives our users a really lousy NFS experience during ZFS resilver on one of our servers. 

In order to try to pinpoint exactly what's going on I've tried various things. Lately I've been dtrace the call path for "mkdir()" and it seems that it is the _sleep() kernel call that sometimes sleeps for _really_ long times like 1-20 seconds...

Sure, the kernel is busy doing the ZFS resilver and other stuff but this is silly... Any ideas? Too low priority of the _sleep (prio=99?) compared to other kernel threads so it never gets scheduled?


One example:

 21  54842               nfsrvd_dorpc:entry Start
 21  47273                 nfsv4_lock:entry Start(lp->nfslock_lock=6, iwantlock=0)
 21  37590                  nfsmsleep:entry Start(ffffffff81e9982c, ffffffff81e998a0, 99)
 21  54396                     _sleep:entry Start(prio=99, timeo=0)
 16  54397                    _sleep:return   10942618 µs
 16  37591                 nfsmsleep:return   10942626 µs
 16  37590                  nfsmsleep:entry Start(ffffffff81e9982c, ffffffff81e998a0, 99)
 16  54396                     _sleep:entry Start(prio=99, timeo=0)
  7  54397                    _sleep:return        344 µs
  7  37591                 nfsmsleep:return        352 µs
  7  47274                nfsv4_lock:return   10942987 µs
  7  13973           nfsrvd_statstart:entry Start
  7  13974          nfsrvd_statstart:return          2 µs
  7  44369     nfsrv_mallocmget_limit:entry Start
  7  44370    nfsrv_mallocmget_limit:return          2 µs
  7  13971             nfsrvd_statend:entry Start
  7  13972            nfsrvd_statend:return          2 µs
  7  13973           nfsrvd_statstart:entry Start
  7  13974          nfsrvd_statstart:return          1 µs
  7  51444                nfsrv_mtofh:entry Start
  7  51445               nfsrv_mtofh:return          2 µs
  7  47329                nfsd_fhtovp:entry Start
  7  48059           __mtx_lock_flags:entry Start
  7  48060          __mtx_lock_flags:return          2 µs
  7  47330               nfsd_fhtovp:return         14 µs
  7  13971             nfsrvd_statend:entry Start
  7  13972            nfsrvd_statend:return          1 µs
  7  13973           nfsrvd_statstart:entry Start
  7  13974          nfsrvd_statstart:return          1 µs
  7  43887             nfsrvd_getattr:entry Start
  7  38608      nfsv4root_getreferral:entry Start
  7  38609     nfsv4root_getreferral:return          2 µs
  7  43888            nfsrvd_getattr:return         22 µs
  7  13971             nfsrvd_statend:entry Start
  7  13972            nfsrvd_statend:return          1 µs
  7  54843              nfsrvd_dorpc:return   10943076 µs
Comment 1 Peter Eriksson 2020-03-08 00:51:44 UTC
Btw, if there is something else I should look at (while the resilver is running), I can probably do it - still has a couple of days left of resilvering to do....

> action: Wait for the resilver to complete.
>   scan: resilver in progress since Fri Mar  6 15:45:39 2020
>	35.7T scanned at 306M/s, 24.0T issued at 205M/s, 53.9T total
>	1012G resilvered, 44.46% done, 1 days 18:31:18 to go
Comment 2 Kubilay Kocak freebsd_committer freebsd_triage 2020-03-08 01:37:30 UTC
@Peter Could you provide additional information including:

- Exact FreeBSD version (uname -a)
- /var/run/dmesg.boot (as an attachment)
- `zpool status` and `zfs list -v` output (as an attachment)
- zfs-stats -a output (as an attachment). [1] 
- /boot/loader.conf and /etc/sysctl.conf output (as an attachment)

[1] https://www.freshports.org/sysutils/zfs-stats/

If possible and ideally, it would be great to be able to upgrade the system to the latest 12.x release to reproduce the issue there, as the issue may have already been fixed. If so, we may be able to identify the relevant commits to merge them to stable/11 if they can be
Comment 3 Peter Eriksson 2020-03-08 07:36:15 UTC
# uname -a
FreeBSD filur02 11.3-RELEASE-p6 FreeBSD 11.3-RELEASE-p6 #0: Tue Jan 28 09:55:39 UTC 2020     root@amd64-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC  amd64

Adding outputs as attachment(s).

"zfs list -v"? "zfs list" is long - there is 25000 filesystems on this machine.
Comment 4 Peter Eriksson 2020-03-08 07:37:02 UTC
Created attachment 212242 [details]
dmesg.boot
Comment 5 Peter Eriksson 2020-03-08 07:37:42 UTC
Created attachment 212243 [details]
zpool status
Comment 6 Peter Eriksson 2020-03-08 07:37:57 UTC
Created attachment 212244 [details]
zfs-stats -a
Comment 7 Peter Eriksson 2020-03-08 07:38:14 UTC
Created attachment 212245 [details]
loader.conf
Comment 8 Peter Eriksson 2020-03-08 07:38:29 UTC
Created attachment 212246 [details]
sysctl.conf
Comment 9 Peter Eriksson 2020-03-08 07:38:57 UTC
Created attachment 212247 [details]
"sysctl -a" output
Comment 10 Peter Eriksson 2020-03-08 07:44:57 UTC
Currently (at the time of the capturing of the "sysctl -a" output) it doesn't exhitib 
the really slow behaviour, however it comes and goes. We had really slow behaviour last night (after midnight between Saturday & Sunday). But not right now (early Sunday morning).

We typically have around 50-60 NFS users and 500+ SMB (Samba) users during workweek daytime.

Dual CPU systems (Intel Xeon E5-2620v4 @ 2.1GHz.
8 cores with 2 hardware threads per CPU.

256GB RAM.

We have capped ZFS ARC to 64GB (used to be 128GB but we had to lower it since it started fighting with the 500+ smbd processes).
Comment 11 Peter Eriksson 2020-03-08 14:30:15 UTC
Created attachment 212260 [details]
Test program for measuring latency of FS operations

The source code for a simple test program that measures the time certain common filesystem operations take (mkdir/rmdir and optionally: chdir, open(O_CREAT|O_WRONLY), write(64KB), fsync, close, rename, open(O_RDONLY), read(64KB), close, cd(..), rmdir). Useful to see "hickups".

Example usage:

gcc -o pfst pfst.c
./pfst -m10m /mnt/server/dir

(Will test mkdir/rmdir in a loop (1s) between attemps and alert of either one takes more than 10ms. Use with "-v" to print timing for all operations. Use "-c" for more complex tests. Or "-h" for help). 

When things are acting up:

# ./pfst -m100m /mnt/filur02
2020-03-08 01:43:19 [1066 ms]: /mnt/filur02: rmdir("t-electra-1988-449") [Time limit exceeded]
2020-03-08 01:43:22 [1103 ms]: /mnt/filur02: mkdir("t-electra-1988-450") [Time limit exceeded]
2020-03-08 01:43:24 [2441 ms]: /mnt/filur02: rmdir("t-electra-1988-450") [Time limit exceeded]
2020-03-08 01:43:39 [9586 ms]: /mnt/filur02: mkdir("t-electra-1988-455") [Time limit exceeded]


When things are looking OK:

# ./pfst -t10m  -v /mnt/filur02
[pfst, version 1.5 - Peter Eriksson <pen@lysator.liu.se>]

Loop #1:
2020-03-08 15:28:27 [ 561 µs]: /mnt/filur02: mkdir("t-electra-2170-1")
2020-03-08 15:28:27 [ 747 µs]: /mnt/filur02: rmdir("t-electra-2170-1")
Comment 12 Peter Eriksson 2020-03-08 14:33:30 UTC
Created attachment 212261 [details]
Dtrace script to trace kernel NFS mkdir/rmdir

Search and replace the UID (1003258) number to some that you use for testing (that normally isn't used for other NFS operations at the same time).

Make executable and run with ./nfs-time.d (generates lot's of output so probably send to a log file) and look for correlating filename with long timeouts from the pfst testing script (t-clienthostname-pid-loopcount).
Comment 13 Peter Eriksson 2020-03-08 15:25:29 UTC
Btw, I've tried fiddling with some sysctl parameters - but most doesn't seem to 
have any real effect on the delays:

vfs.zfs.resolver_delay from 2 to 500. (Idea: Lower IOPS). No difference.

vfs.zfs.scan_idle to 500000. (Idea: "Pause" resilver). No difference

vfs.zfs.top_maxinflight, 32 to 4. (Idea: Lower the I/O pressure). No difference

vfs.zfs.resilver_min_time.ms. 3000 to 1000, to 100. Makes a little difference - the 
"average" delays seems to be around 3x the value. So if I set it to 1000, then I'm seeing delays of around 2-4s instead of 6-9s. At 100 the delay are around 1-3s.

Settings vfs.nfsd.maxthreads to 64 or 512 instead of 256. No difference. 
(We typically have around 50-60 NFS clients connected).


This is the "top -HSIz" output during a 5s "hickup":

last pid: 46798;  load averages:  0.89,  1.00,  1.06  up 2+00:33:08    16:16:16
1532 processes:43 running, 1357 sleeping, 1 zombie, 131 waiting

Mem: 224M Active, 1665M Inact, 124G Wired, 124G Free
ARC: 64G Total, 43G MFU, 9375M MRU, 917M Anon, 1351M Header, 9287M Other
Swap: 


  PID USERNAME      PRI NICE   SIZE    RES STATE   C   TIME    WCPU COMMAND
30489 root           52    0   164M   116M select 27  13:27  13.67% smbd
   12 root          -92    -     0K  2112K WAIT    5   6:18   1.37% intr{irq344: ixl2:q5}
30496 root           20    0   111M 98844K select 14   2:09   1.37% winbindd
20922 root           20    0   211M   161M select 18   0:19   0.78% smbd{smbd}
98462 root           20    0   211M   151M select 21   3:29   0.59% smbd{smbd}
   12 root          -88    -     0K  2112K WAIT   22  18:39   0.29% intr{irq264: mpr0}
    0 root           -8    -     0K 15056K q->q_z 10   9:59   0.20% kernel{dsl_scan_tq_DATA_3}
    0 root          -16    -     0K 15056K -      12   8:57   0.20% kernel{zio_write_intr_5}
    0 root          -16    -     0K 15056K -      19   8:56   0.20% kernel{zio_write_intr_4}
    0 root          -16    -     0K 15056K -      25   8:55   0.20% kernel{zio_write_intr_0}
    0 root          -12    -     0K 15056K -      27   2:34   0.20% kernel{zio_write_issue_5}
 1472 root           20    0  6304K  4024K tx->tx  5   0:01   0.20% nfsd{nfsd: service}
20922 root           20    0   211M   161M uwait  12   0:00   0.20% smbd{smbd}
   28 root           16    -     0K    16K syncer  9  19:38   0.10% syncer
84576 root           20    0   218M   177M select 12  10:16   0.10% smbd{smbd}
    0 root          -16    -     0K 15056K -      14   8:56   0.10% kernel{zio_write_intr_7}
    0 root          -16    -     0K 15056K -       4   8:56   0.10% kernel{zio_write_intr_3}
    0 root          -16    -     0K 15056K -       2   8:55   0.10% kernel{zio_write_intr_2}
Comment 14 Peter Eriksson 2020-03-10 13:08:43 UTC
Created attachment 212302 [details]
Another Dtrace script to trace *all* NFS activity in the kernel

Another Dtrace script I've used to "see" what's happening in the kernel with regards to NFS (globally). Figured it might be useful for someone else too soo here it is :-)
Comment 15 Peter Eriksson 2020-05-07 14:23:57 UTC
I'm closing this. I still think it might be useful to try to find bottlenecks and try to find ways to make the system behave nicer but it seems the problem was/is that when there is a lot of I/O being done ARC Metadata gets discarded - and then NFS I/O requires the same metadata to be read from the disks again, and since they are busy with resilvering things slow down...

Some way to "protect" ZFS ARC metadata from going too low probably would help here. I think ZoL has such a "knob" (the current ZoF code only has "max" limits).

Also some way to cache NFS mountpoint information in-kernel could also improve some things so it doesn't have to go to disk (via the ZFS ARC cache) every time a mountpoint is traversed. Ah well, future stuff.

- Peter