I've been investigating an issue we're having that gives our users a really lousy NFS experience during ZFS resilver on one of our servers. In order to try to pinpoint exactly what's going on I've tried various things. Lately I've been dtrace the call path for "mkdir()" and it seems that it is the _sleep() kernel call that sometimes sleeps for _really_ long times like 1-20 seconds... Sure, the kernel is busy doing the ZFS resilver and other stuff but this is silly... Any ideas? Too low priority of the _sleep (prio=99?) compared to other kernel threads so it never gets scheduled? One example: 21 54842 nfsrvd_dorpc:entry Start 21 47273 nfsv4_lock:entry Start(lp->nfslock_lock=6, iwantlock=0) 21 37590 nfsmsleep:entry Start(ffffffff81e9982c, ffffffff81e998a0, 99) 21 54396 _sleep:entry Start(prio=99, timeo=0) 16 54397 _sleep:return 10942618 µs 16 37591 nfsmsleep:return 10942626 µs 16 37590 nfsmsleep:entry Start(ffffffff81e9982c, ffffffff81e998a0, 99) 16 54396 _sleep:entry Start(prio=99, timeo=0) 7 54397 _sleep:return 344 µs 7 37591 nfsmsleep:return 352 µs 7 47274 nfsv4_lock:return 10942987 µs 7 13973 nfsrvd_statstart:entry Start 7 13974 nfsrvd_statstart:return 2 µs 7 44369 nfsrv_mallocmget_limit:entry Start 7 44370 nfsrv_mallocmget_limit:return 2 µs 7 13971 nfsrvd_statend:entry Start 7 13972 nfsrvd_statend:return 2 µs 7 13973 nfsrvd_statstart:entry Start 7 13974 nfsrvd_statstart:return 1 µs 7 51444 nfsrv_mtofh:entry Start 7 51445 nfsrv_mtofh:return 2 µs 7 47329 nfsd_fhtovp:entry Start 7 48059 __mtx_lock_flags:entry Start 7 48060 __mtx_lock_flags:return 2 µs 7 47330 nfsd_fhtovp:return 14 µs 7 13971 nfsrvd_statend:entry Start 7 13972 nfsrvd_statend:return 1 µs 7 13973 nfsrvd_statstart:entry Start 7 13974 nfsrvd_statstart:return 1 µs 7 43887 nfsrvd_getattr:entry Start 7 38608 nfsv4root_getreferral:entry Start 7 38609 nfsv4root_getreferral:return 2 µs 7 43888 nfsrvd_getattr:return 22 µs 7 13971 nfsrvd_statend:entry Start 7 13972 nfsrvd_statend:return 1 µs 7 54843 nfsrvd_dorpc:return 10943076 µs
Btw, if there is something else I should look at (while the resilver is running), I can probably do it - still has a couple of days left of resilvering to do.... > action: Wait for the resilver to complete. > scan: resilver in progress since Fri Mar 6 15:45:39 2020 > 35.7T scanned at 306M/s, 24.0T issued at 205M/s, 53.9T total > 1012G resilvered, 44.46% done, 1 days 18:31:18 to go
@Peter Could you provide additional information including: - Exact FreeBSD version (uname -a) - /var/run/dmesg.boot (as an attachment) - `zpool status` and `zfs list -v` output (as an attachment) - zfs-stats -a output (as an attachment). [1] - /boot/loader.conf and /etc/sysctl.conf output (as an attachment) [1] https://www.freshports.org/sysutils/zfs-stats/ If possible and ideally, it would be great to be able to upgrade the system to the latest 12.x release to reproduce the issue there, as the issue may have already been fixed. If so, we may be able to identify the relevant commits to merge them to stable/11 if they can be
# uname -a FreeBSD filur02 11.3-RELEASE-p6 FreeBSD 11.3-RELEASE-p6 #0: Tue Jan 28 09:55:39 UTC 2020 root@amd64-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC amd64 Adding outputs as attachment(s). "zfs list -v"? "zfs list" is long - there is 25000 filesystems on this machine.
Created attachment 212242 [details] dmesg.boot
Created attachment 212243 [details] zpool status
Created attachment 212244 [details] zfs-stats -a
Created attachment 212245 [details] loader.conf
Created attachment 212246 [details] sysctl.conf
Created attachment 212247 [details] "sysctl -a" output
Currently (at the time of the capturing of the "sysctl -a" output) it doesn't exhitib the really slow behaviour, however it comes and goes. We had really slow behaviour last night (after midnight between Saturday & Sunday). But not right now (early Sunday morning). We typically have around 50-60 NFS users and 500+ SMB (Samba) users during workweek daytime. Dual CPU systems (Intel Xeon E5-2620v4 @ 2.1GHz. 8 cores with 2 hardware threads per CPU. 256GB RAM. We have capped ZFS ARC to 64GB (used to be 128GB but we had to lower it since it started fighting with the 500+ smbd processes).
Created attachment 212260 [details] Test program for measuring latency of FS operations The source code for a simple test program that measures the time certain common filesystem operations take (mkdir/rmdir and optionally: chdir, open(O_CREAT|O_WRONLY), write(64KB), fsync, close, rename, open(O_RDONLY), read(64KB), close, cd(..), rmdir). Useful to see "hickups". Example usage: gcc -o pfst pfst.c ./pfst -m10m /mnt/server/dir (Will test mkdir/rmdir in a loop (1s) between attemps and alert of either one takes more than 10ms. Use with "-v" to print timing for all operations. Use "-c" for more complex tests. Or "-h" for help). When things are acting up: # ./pfst -m100m /mnt/filur02 2020-03-08 01:43:19 [1066 ms]: /mnt/filur02: rmdir("t-electra-1988-449") [Time limit exceeded] 2020-03-08 01:43:22 [1103 ms]: /mnt/filur02: mkdir("t-electra-1988-450") [Time limit exceeded] 2020-03-08 01:43:24 [2441 ms]: /mnt/filur02: rmdir("t-electra-1988-450") [Time limit exceeded] 2020-03-08 01:43:39 [9586 ms]: /mnt/filur02: mkdir("t-electra-1988-455") [Time limit exceeded] When things are looking OK: # ./pfst -t10m -v /mnt/filur02 [pfst, version 1.5 - Peter Eriksson <pen@lysator.liu.se>] Loop #1: 2020-03-08 15:28:27 [ 561 µs]: /mnt/filur02: mkdir("t-electra-2170-1") 2020-03-08 15:28:27 [ 747 µs]: /mnt/filur02: rmdir("t-electra-2170-1")
Created attachment 212261 [details] Dtrace script to trace kernel NFS mkdir/rmdir Search and replace the UID (1003258) number to some that you use for testing (that normally isn't used for other NFS operations at the same time). Make executable and run with ./nfs-time.d (generates lot's of output so probably send to a log file) and look for correlating filename with long timeouts from the pfst testing script (t-clienthostname-pid-loopcount).
Btw, I've tried fiddling with some sysctl parameters - but most doesn't seem to have any real effect on the delays: vfs.zfs.resolver_delay from 2 to 500. (Idea: Lower IOPS). No difference. vfs.zfs.scan_idle to 500000. (Idea: "Pause" resilver). No difference vfs.zfs.top_maxinflight, 32 to 4. (Idea: Lower the I/O pressure). No difference vfs.zfs.resilver_min_time.ms. 3000 to 1000, to 100. Makes a little difference - the "average" delays seems to be around 3x the value. So if I set it to 1000, then I'm seeing delays of around 2-4s instead of 6-9s. At 100 the delay are around 1-3s. Settings vfs.nfsd.maxthreads to 64 or 512 instead of 256. No difference. (We typically have around 50-60 NFS clients connected). This is the "top -HSIz" output during a 5s "hickup": last pid: 46798; load averages: 0.89, 1.00, 1.06 up 2+00:33:08 16:16:16 1532 processes:43 running, 1357 sleeping, 1 zombie, 131 waiting Mem: 224M Active, 1665M Inact, 124G Wired, 124G Free ARC: 64G Total, 43G MFU, 9375M MRU, 917M Anon, 1351M Header, 9287M Other Swap: PID USERNAME PRI NICE SIZE RES STATE C TIME WCPU COMMAND 30489 root 52 0 164M 116M select 27 13:27 13.67% smbd 12 root -92 - 0K 2112K WAIT 5 6:18 1.37% intr{irq344: ixl2:q5} 30496 root 20 0 111M 98844K select 14 2:09 1.37% winbindd 20922 root 20 0 211M 161M select 18 0:19 0.78% smbd{smbd} 98462 root 20 0 211M 151M select 21 3:29 0.59% smbd{smbd} 12 root -88 - 0K 2112K WAIT 22 18:39 0.29% intr{irq264: mpr0} 0 root -8 - 0K 15056K q->q_z 10 9:59 0.20% kernel{dsl_scan_tq_DATA_3} 0 root -16 - 0K 15056K - 12 8:57 0.20% kernel{zio_write_intr_5} 0 root -16 - 0K 15056K - 19 8:56 0.20% kernel{zio_write_intr_4} 0 root -16 - 0K 15056K - 25 8:55 0.20% kernel{zio_write_intr_0} 0 root -12 - 0K 15056K - 27 2:34 0.20% kernel{zio_write_issue_5} 1472 root 20 0 6304K 4024K tx->tx 5 0:01 0.20% nfsd{nfsd: service} 20922 root 20 0 211M 161M uwait 12 0:00 0.20% smbd{smbd} 28 root 16 - 0K 16K syncer 9 19:38 0.10% syncer 84576 root 20 0 218M 177M select 12 10:16 0.10% smbd{smbd} 0 root -16 - 0K 15056K - 14 8:56 0.10% kernel{zio_write_intr_7} 0 root -16 - 0K 15056K - 4 8:56 0.10% kernel{zio_write_intr_3} 0 root -16 - 0K 15056K - 2 8:55 0.10% kernel{zio_write_intr_2}
Created attachment 212302 [details] Another Dtrace script to trace *all* NFS activity in the kernel Another Dtrace script I've used to "see" what's happening in the kernel with regards to NFS (globally). Figured it might be useful for someone else too soo here it is :-)
I'm closing this. I still think it might be useful to try to find bottlenecks and try to find ways to make the system behave nicer but it seems the problem was/is that when there is a lot of I/O being done ARC Metadata gets discarded - and then NFS I/O requires the same metadata to be read from the disks again, and since they are busy with resilvering things slow down... Some way to "protect" ZFS ARC metadata from going too low probably would help here. I think ZoL has such a "knob" (the current ZoF code only has "max" limits). Also some way to cache NFS mountpoint information in-kernel could also improve some things so it doesn't have to go to disk (via the ZFS ARC cache) every time a mountpoint is traversed. Ah well, future stuff. - Peter