Bug 169480 - [zfs] ZFS stalls on heavy I/O
Summary: [zfs] ZFS stalls on heavy I/O
Status: Open
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: Unspecified
Hardware: Any Any
: Normal Affects Only Me
Assignee: freebsd-bugs (Nobody)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-06-27 10:20 UTC by Levent Serinol
Modified: 2018-01-03 05:14 UTC (History)
0 users

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Levent Serinol 2012-06-27 10:20:07 UTC
 Under heavy I/O load we face freeze problems on ZFS volumes on both Freebsd 9 Release and 10 Current versions. Machines are HP servers (64bit) with HP Smart array 6400 raid controllers (with battery units). Every da device is a hardware raid5 where each one includes 9x300GB 10K SCSI hard drivers. Main of I/O pattern happens on local system except some small NFS I/O from some other servers (NFS lookup/getattr/ etc.). These servers are mail servers (qmail) with small I/O patterns (64K Read/Write).  Below you can  find procstat output on freeze time. write_limit is set to 200MB because of the huge amount of txg_wait_opens observed before. Every process stops on D state I think due to txg queue and other 2 queues are full. Is there any suggestion to fix the problem ?

btw inject_compress is the main process injecting emails to user inboxes (databases). Also, those machines were running without problems on Linux/XFS filesystem. For a while ago, we started migration from Linux to Freebsd

http://pastebin.com/raw.php?i=ic3YepWQ
Comment 1 Mark Linimon freebsd_committer freebsd_triage 2012-06-27 10:42:11 UTC
Responsible Changed
From-To: freebsd-bugs->freebsd-fs

Over to maintainer(s).
Comment 2 hgcoin 2013-01-26 01:26:10 UTC
I think I have an easier way to reproduce this problem on a very simple 
setup:

Boot a 'nas4free' livecd version 573 (freebsd 9.1.0.1).   Mount one 
zpool with no dedup, simple raidz 4 drive setup, 4GB memory.  Don't 
enable any of the various features, just get a shell from the live cd 
prompt.  Use dd to make a 20GB file on the pool writing from 
/dev/random.  While that's going, go to another virtual console cd to 
the pool, and do an ls -l.   It won't return.  Of interest, long before 
wired memory explodes to the limit, ls not returning while dd is 
running, 'top' on another shell reports this:

99.7% idle, 2.6gb free ram, dd in tx->tx, zfskern in zio->i, and intr in 
WAIT

Eventually disk activity as shown by the cheery little flickering lamp 
slows, then stops.  Characters still echo on the ls -l command that 
hasn't returned, but no others.  alt-tab strangely still works. The only 
way out is to power cycle the box.

Details here:

http://forums.nas4free.org/viewtopic.php?p=12519#p12519

Feels like completed io's are getting lost and the system just waits and 
waits until there are no more resources left to issue new commands or 
too many threads are locked waiting for what will never return.

Harry
Comment 3 Jeremy Chadwick 2013-01-26 02:08:59 UTC
Harry, things that come to mind immediately:

1. http://www.quietfountain.com/fs1pool1.txt is when your pool contained
both L2ARC and ZIL devices on SSDs.  Please remove the SSDs from the
picture entirely and use raidz1 disks ada[2345] only at this point.

I do not want to discuss ada[01] at this time, because they're SSDs.
There are quite literally 4 or 5 "catches" to using these devices on
FreeBSD ZFS, but the biggest problem -- and this WILL hurt you, no
arguments about it -- is lack of TRIM support.  You will hurt your SSDs
over time doing this.  If you want TRIM support on ZFS you will need to
run -CURRENT.

We can talk more about the SSDs later.  As said, please remove them from
the pictures for starters, as all they do is make troubleshooting much
much harder.

2. I do see some raw I/O benchmarks but only for ada2.  This is
insufficient.  A single disk performing like crap in a pool can slow
down the entire response time for everything.  I can do analysis of all
of your disks if the issue is narrowed down to one of them.

"gstat -I500ms" is a good way to watch I/O speeds in real-time.  I find
this more effective than "zpool iostat -v 1" for per-device info.

3. The ada[2345] disks involved are Hitachi HDS723015BLA642 (7K3000,
1.5TB), and there is sparse info on the web as to if these are
512-byte physical sector disks or 4096-byte.  smartmontools 6.0 or
newer will tell you.

All disks regardless advertise 512-byte as the logical size to remain
fully compatible with legacy systems, but the perform hit on I/O is
major if the device + pool ashift isn't 12.  So please check this with
smartmontools 6.0 or newer.

If the disks use physically 4096-byte sectors, you need to use gnop(8)
to align them and create the pool off of that.  Ivan Voras wrote a
wonderful guide on how to do this, and it's very simple:

http://ivoras.net/blog/tree/2011-01-01.freebsd-on-4k-sector-drives.html

It wouldn't hurt you to do this regardless, as there's no performance
hit using the gnop(8) method on 512-byte sector drives; this would
"future-proof" you if upgrading to newer disks too.  You want ashift=12.

4. Why are all of your drives partitioned?  In other words, why are you
using adaXpX rather than just adaX for your raidz1 pool?  "gpart show"
output was not provided, and I can only speculate as to what's going on
under the hood there.

Please use raw disks when recreating your pool, i.e. ada2, ada3, ada4,
etc...

I know for your cache/logs this is a different situation but again,
please remove those from the picture.

5. Please keep your Hitachi disks on the Intel ICH7 controller for the
time being.  It's SATA300 but that isn't going to hurt these disks.
Don't bring the Marvell into the picture yet.  Don't change around
cabling or anything else.

6. For any process that takes a long while, you're going to need to do
"procstat -kk" (yes -k twice) against it.

7. I do not think your issue is related to this PR.  I would suggest
discussing it on freebsd-fs first.  Of course, you're also using
something called "nas4free" which may or may not be *true, unaltered*
FreeBSD -- I have no idea.  I often find it frustrating when, say, the
FreeNAS folks or other "FreeBSD fork projects" appear on the FreeBSD
mailing lists "just because it uses FreeBSD".  You always have to go
with the vendor for support (like you did on their forum), but if you
really think this is a FreeBSD "kernel thing" freebsd-fs is fine.

Start with what I described above and go from there.

-- 
| Jeremy Chadwick                                   jdc@koitsu.org |
| UNIX Systems Administrator                http://jdc.koitsu.org/ |
| Mountain View, CA, US                                            |
| Making life hard for others since 1977.             PGP 4BD6C0CB |
Comment 4 hgcoin 2013-01-26 03:04:05 UTC
Jeremy, Thanks.

re: 1.  Not only have I removed the zil but I removed the cache before 
running this test as well, thinking as you did the whole l2arc thing was 
adding too much complexity.   The pool was nothing more than the single 
raidz when doing the test to cause the failure noted.

re 2.  The raw benchmarks you saw for ada2 were identically duplicated 
for 3, 4 and 5.  They are all exactly the same drive make and model 
purchased at the same time.

re 3.  These are 512 byte/sector drives.  v6.0 smartctl -a reports 512 
bytes/sector logical / physical.

re 4.  The drives are partitioned so that each of them can hold boot 
information should I decide someday to boot off the array, to make sure 
sector 0 of the partition starts on the afore-noticed 4K boundary,  and 
there is a couple of gig at the end for swap if I decide I want that.  
Partition 1 is just big enough for the boot code, 2 is the bulk of the 
storage, and 3 is the last 2 gig.

So, no zil or cache used in the test, all formatted zfs v28, scrub 
reported zero errors on the array two days ago (though the scrub speed 
was 14M/s or so).

It's something about max speed writes that get interrupted by another 
read that kills zfs/nas4free somehow.

re 5:  The ata drives indeed are on the ICH7, as the 300gb isn't an 
issue.  The ssd's are on the higher speed controller, and are unused as 
noted before.

re 6:  You'll note that 'ls -l' didn't return, only the 'top' program 
run before the dd was started was launched.  Here are the procstat -kk's 
for the dd and ls during their run (the ls not returning)
nas4free:~# ps axu | grep dd
root  4006  47.1  0.1  9916 1752 v0  R+    2:56AM   0:07.92 dd 
if=/dev/urandom of=foo bs=512 count=20000000
root  4009   0.0  0.1 16280 2092  0  S+    2:56AM   0:00.00 grep dd
nas4free:~# procstat -kk 4006
   PID    TID COMM             TDNAME           KSTACK
  4006 100114 dd               -                mi_switch+0x186 
sleepq_timedwait+0x42 _cv_timedwait+0x13c txg_delay+0x137 
dsl_pool_tempreserve_space+0xd5 dsl_dir_tempreserve_space+0x154 
dmu_tx_assign+0x370 zfs_freebsd_write+0x45b VOP_WRITE_APV+0xb2 
vn_write+0x38c dofilewrite+0x8b kern_writev+0x6c sys_write+0x64 
amd64_syscall+0x546 Xfast_syscall+0xf7
nas4free:~# procstat -kk 4006
   PID    TID COMM             TDNAME           KSTACK
  4006 100114 dd               -                mi_switch+0x186 
sleepq_timedwait+0x42 _cv_timedwait+0x13c txg_delay+0x137 
dsl_pool_tempreserve_space+0xd5 dsl_dir_tempreserve_space+0x154 
dmu_tx_assign+0x370 zfs_freebsd_write+0x45b VOP_WRITE_APV+0xb2 
vn_write+0x38c dofilewrite+0x8b kern_writev+0x6c sys_write+0x64 
amd64_syscall+0x546 Xfast_syscall+0xf7
nas4free:~# procstat -kk 4006
   PID    TID COMM             TDNAME           KSTACK
  4006 100114 dd               -                mi_switch+0x186 
sleepq_timedwait+0x42 _cv_timedwait+0x13c txg_delay+0x137 
dsl_pool_tempreserve_space+0xd5 dsl_dir_tempreserve_space+0x154 
dmu_tx_assign+0x370 zfs_freebsd_write+0x45b VOP_WRITE_APV+0xb2 
vn_write+0x38c dofilewrite+0x8b kern_writev+0x6c sys_write+0x64 
amd64_syscall+0x546 Xfast_syscall+0xf7
nas4free:~# procstat -kk 4006
   PID    TID COMM             TDNAME           KSTACK
  4006 100114 dd               -                mi_switch+0x186 
sleepq_timedwait+0x42 _cv_timedwait+0x13c txg_delay+0x137 
dsl_pool_tempreserve_space+0xd5 dsl_dir_tempreserve_space+0x154 
dmu_tx_assign+0x370 zfs_freebsd_write+0x45b VOP_WRITE_APV+0xb2 
vn_write+0x38c dofilewrite+0x8b kern_writev+0x6c sys_write+0x64 
amd64_syscall+0x546 Xfast_syscall+0xf7
nas4free:~# ps axu | grep ls
root  4016   0.0  0.1 14420 2196 v2  D+    2:57AM   0:00.00 ls -l
root  4018   0.0  0.1 16280 2084  0  RL+   2:57AM   0:00.00 grep ls
nas4free:~# procstat -kk 4016
   PID    TID COMM             TDNAME           KSTACK
  4016 100085 ls               -                mi_switch+0x186 
sleepq_wait+0x42 __lockmgr_args+0xe40 vop_stdlock+0x39 
VOP_LOCK1_APV+0x46 _vn_lock+0x47 vget+0x70 cache_lookup_times+0x55a 
vfs_cache_lookup+0xc8 VOP_LOOKUP_APV+0x40 lookup+0x464 namei+0x4e9 
kern_statat_vnhook+0xb3 kern_statat+0x15 sys_lstat+0x2a 
amd64_syscall+0x546 Xfast_syscall+0xf7
nas4free:~# procstat -kk 4016
   PID    TID COMM             TDNAME           KSTACK
  4016 100085 ls               -                mi_switch+0x186 
sleepq_wait+0x42 __lockmgr_args+0xe40 vop_stdlock+0x39 
VOP_LOCK1_APV+0x46 _vn_lock+0x47 vget+0x70 cache_lookup_times+0x55a 
vfs_cache_lookup+0xc8 VOP_LOOKUP_APV+0x40 lookup+0x464 namei+0x4e9 
kern_statat_vnhook+0xb3 kern_statat+0x15 sys_lstat+0x2a 
amd64_syscall+0x546 Xfast_syscall+0xf7
nas4free:~# procstat -kk 4016
   PID    TID COMM             TDNAME           KSTACK
  4016 100085 ls               -                mi_switch+0x186 
sleepq_wait+0x42 __lockmgr_args+0xe40 vop_stdlock+0x39 
VOP_LOCK1_APV+0x46 _vn_lock+0x47 vget+0x70 cache_lookup_times+0x55a 
vfs_cache_lookup+0xc8 VOP_LOOKUP_APV+0x40 lookup+0x464 namei+0x4e9 
kern_statat_vnhook+0xb3 kern_statat+0x15 sys_lstat+0x2a 
amd64_syscall+0x546 Xfast_syscall+0xf7
nas4free:~# procstat -kk 4016
   PID    TID COMM             TDNAME           KSTACK
  4016 100085 ls               -                mi_switch+0x186 
sleepq_wait+0x42 __lockmgr_args+0xe40 vop_stdlock+0x39 
VOP_LOCK1_APV+0x46 _vn_lock+0x47 vget+0x70 cache_lookup_times+0x55a 
vfs_cache_lookup+0xc8 VOP_LOOKUP_APV+0x40 lookup+0x464 namei+0x4e9 
kern_statat_vnhook+0xb3 kern_statat+0x15 sys_lstat+0x2a 
amd64_syscall+0x546 Xfast_syscall+0xf7
nas4free:~# procstat -kk 4006
   PID    TID COMM             TDNAME           KSTACK
  4006 100114 dd               -                mi_switch+0x186 
sleepq_wait+0x42 _cv_wait+0x121 txg_wait_open+0x85 
zfs_freebsd_write+0x47b VOP_WRITE_APV+0xb2 vn_write+0x38c 
dofilewrite+0x8b kern_writev+0x6c sys_write+0x64 amd64_syscall+0x546 
Xfast_syscall+0xf7
nas4free:~# procstat -kk 4006
   PID    TID COMM             TDNAME           KSTACK
  4006 100114 dd               -                mi_switch+0x186 
sleepq_wait+0x42 _cv_wait+0x121 txg_wait_open+0x85 
zfs_freebsd_write+0x47b VOP_WRITE_APV+0xb2 vn_write+0x38c 
dofilewrite+0x8b kern_writev+0x6c sys_write+0x64 amd64_syscall+0x546 
Xfast_syscall+0xf7
nas4free:~# procstat -kk 4006
   PID    TID COMM             TDNAME           KSTACK
  4006 100114 dd               -                mi_switch+0x186 
sleepq_wait+0x42 _cv_wait+0x121 txg_wait_open+0x85 
zfs_freebsd_write+0x47b VOP_WRITE_APV+0xb2 vn_write+0x38c 
dofilewrite+0x8b kern_writev+0x6c sys_write+0x64 amd64_syscall+0x546 
Xfast_syscall+0xf7
nas4free:~# procstat -kk 4016
   PID    TID COMM             TDNAME           KSTACK
  4016 100085 ls               -                mi_switch+0x186 
sleepq_wait+0x42 __lockmgr_args+0xe40 vop_stdlock+0x39 
VOP_LOCK1_APV+0x46 _vn_lock+0x47 vget+0x70 cache_lookup_times+0x55a 
vfs_cache_lookup+0xc8 VOP_LOOKUP_APV+0x40 lookup+0x464 namei+0x4e9 
kern_statat_vnhook+0xb3 kern_statat+0x15 sys_lstat+0x2a 
amd64_syscall+0x546 Xfast_syscall+0xf7
nas4free:~# procstat -kk 4016
   PID    TID COMM             TDNAME           KSTACK
  4016 100085 ls               -                mi_switch+0x186 
sleepq_wait+0x42 __lockmgr_args+0xe40 vop_stdlock+0x39 
VOP_LOCK1_APV+0x46 _vn_lock+0x47 vget+0x70 cache_lookup_times+0x55a 
vfs_cache_lookup+0xc8 VOP_LOOKUP_APV+0x40 lookup+0x464 namei+0x4e9 
kern_statat_vnhook+0xb3 kern_statat+0x15 sys_lstat+0x2a 
amd64_syscall+0x546 Xfast_syscall+0xf7
nas4free:~# procstat -kk 4016
   PID    TID COMM             TDNAME           KSTACK
  4016 100085 ls               -                mi_switch+0x186 
sleepq_wait+0x42 __lockmgr_args+0xe40 vop_stdlock+0x39 
VOP_LOCK1_APV+0x46 _vn_lock+0x47 vget+0x70 cache_lookup_times+0x55a 
vfs_cache_lookup+0xc8 VOP_LOOKUP_APV+0x40 lookup+0x464 namei+0x4e9 
kern_statat_vnhook+0xb3 kern_statat+0x15 sys_lstat+0x2a 
amd64_syscall+0x546 Xfast_syscall+0xf7
nas4free:~# procstat -kk 4006
   PID    TID COMM             TDNAME           KSTACK
  4006 100114 dd               -                mi_switch+0x186 
sleepq_wait+0x42 _cv_wait+0x121 txg_wait_open+0x85 
zfs_freebsd_write+0x47b VOP_WRITE_APV+0xb2 vn_write+0x38c 
dofilewrite+0x8b kern_writev+0x6c sys_write+0x64 amd64_syscall+0x546 
Xfast_syscall+0xf7
nas4free:~# procstat -kk 4006
   PID    TID COMM             TDNAME           KSTACK
  4006 100114 dd               -                mi_switch+0x186 
sleepq_wait+0x42 _cv_wait+0x121 txg_wait_open+0x85 
zfs_freebsd_write+0x47b VOP_WRITE_APV+0xb2 vn_write+0x38c 
dofilewrite+0x8b kern_writev+0x6c sys_write+0x64 amd64_syscall+0x546 
Xfast_syscall+0xf7
nas4free:~# procstat -kk 4006
   PID    TID COMM             TDNAME           KSTACK
  4006 100114 dd               -                mi_switch+0x186 
sleepq_wait+0x42 _cv_wait+0x121 txg_wait_open+0x85 
zfs_freebsd_write+0x47b VOP_WRITE_APV+0xb2 vn_write+0x38c 
dofilewrite+0x8b kern_writev+0x6c sys_write+0x64 amd64_syscall+0x546 
Xfast_syscall+0xf7
nas4free:~# procstat -kk 4006
   PID    TID COMM             TDNAME           KSTACK
  4006 100114 dd               -                mi_switch+0x186 
sleepq_wait+0x42 _cv_wait+0x121 txg_wait_open+0x85 
zfs_freebsd_write+0x47b VOP_WRITE_APV+0xb2 vn_write+0x38c 
dofilewrite+0x8b kern_writev+0x6c sys_write+0x64 amd64_syscall+0x546 
Xfast_syscall+0xf7
nas4free:~# procstat -kk 4016
   PID    TID COMM             TDNAME           KSTACK
  4016 100085 ls               -                mi_switch+0x186 
sleepq_wait+0x42 __lockmgr_args+0xe40 vop_stdlock+0x39 
VOP_LOCK1_APV+0x46 _vn_lock+0x47 vget+0x70 cache_lookup_times+0x55a 
vfs_cache_lookup+0xc8 VOP_LOOKUP_APV+0x40 lookup+0x464 namei+0x4e9 
kern_statat_vnhook+0xb3 kern_statat+0x15 sys_lstat+0x2a 
amd64_syscall+0x546 Xfast_syscall+0xf7
nas4free:~# procstat -kk 4016
   PID    TID COMM             TDNAME           KSTACK
  4016 100085 ls               -                mi_switch+0x186 
sleepq_wait+0x42 __lockmgr_args+0xe40 vop_stdlock+0x39 
VOP_LOCK1_APV+0x46 _vn_lock+0x47 vget+0x70 cache_lookup_times+0x55a 
vfs_cache_lookup+0xc8 VOP_LOOKUP_APV+0x40 lookup+0x464 namei+0x4e9 
kern_statat_vnhook+0xb3 kern_statat+0x15 sys_lstat+0x2a 
amd64_syscall+0x546 Xfast_syscall+0xf7
nas4free:~# procstat -kk 4016
   PID    TID COMM             TDNAME           KSTACK
  4016 100085 ls               -                mi_switch+0x186 
sleepq_wait+0x42 __lockmgr_args+0xe40 vop_stdlock+0x39 
VOP_LOCK1_APV+0x46 _vn_lock+0x47 vget+0x70 cache_lookup_times+0x55a 
vfs_cache_lookup+0xc8 VOP_LOOKUP_APV+0x40 lookup+0x464 namei+0x4e9 
kern_statat_vnhook+0xb3 kern_statat+0x15 sys_lstat+0x2a 
amd64_syscall+0x546 Xfast_syscall+0xf7
nas4free:~# procstat -kk 4016
   PID    TID COMM             TDNAME           KSTACK
  4016 100085 ls               -                mi_switch+0x186 
sleepq_wait+0x42 __lockmgr_args+0xe40 vop_stdlock+0x39 
VOP_LOCK1_APV+0x46 _vn_lock+0x47 vget+0x70 cache_lookup_times+0x55a 
vfs_cache_lookup+0xc8 VOP_LOOKUP_APV+0x40 lookup+0x464 namei+0x4e9 
kern_statat_vnhook+0xb3 kern_statat+0x15 sys_lstat+0x2a 
amd64_syscall+0x546 Xfast_syscall+0xf7
nas4free:~# procstat -kk 4016
   PID    TID COMM             TDNAME           KSTACK
  4016 100085 ls               -                mi_switch+0x186 
sleepq_wait+0x42 __lockmgr_args+0xe40 vop_stdlock+0x39 
VOP_LOCK1_APV+0x46 _vn_lock+0x47 vget+0x70 cache_lookup_times+0x55a 
vfs_cache_lookup+0xc8 VOP_LOOKUP_APV+0x40 lookup+0x464 namei+0x4e9 
kern_statat_vnhook+0xb3 kern_statat+0x15 sys_lstat+0x2a 
amd64_syscall+0x546 Xfast_syscall+0xf7
nas4free:~#


re 7:  The 'submit a problem' form doesn't list 'fs' as an option, just 
kern.  If there's another place for fs bugs I trust the freebsd folks 
will know what to do with it.  Anyhow I do think it's about write 
completions getting lost.   And, 'nas4free' is a pretty popular project 
having the major virtue that anyone, anywhere looking to solve this 
problem can download a livecd iso and reproduce it.  This is not my 
first rodeo, there are several bug reports with various black magic 
tunable tweaks that generally I think accomplish avoiding the bug only 
by their usage patterns and the tweaks not hitting it.

This particular iso freebsd download has the virtue of being completely 
reproducible.

Does any of the above help?   Also here's an update zpool :

nas4free:~# zpool status pool1
   pool: pool1
  state: ONLINE
   scan: scrub canceled on Fri Jan 25 22:07:51 2013
config:

     NAME        STATE     READ WRITE CKSUM
     pool1       ONLINE       0     0     0
       raidz1-0  ONLINE       0     0     0
         ada2p2  ONLINE       0     0     0
         ada3p2  ONLINE       0     0     0
         ada4p2  ONLINE       0     0     0
         ada5p2  ONLINE       0     0     0

errors: No known data errors
nas4free:~#

as4free:~# zpool iostat -v pool1
                capacity     operations    bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
pool1       3.01T  2.43T    168      0   360K  3.42K
   raidz1    3.01T  2.43T    168      0   360K  3.42K
     ada2p2      -      -    124      0   104K  2.34K
     ada3p2      -      -    123      0   104K  2.28K
     ada4p2      -      -    124      0   103K  2.34K
     ada5p2      -      -    126      0  98.1K  2.29K
----------  -----  -----  -----  -----  -----  -----

nas4free:~#

nas4free:~# zpool get all pool1
NAME   PROPERTY       VALUE       SOURCE
pool1  size           5.44T       -
pool1  capacity       55%         -
pool1  altroot        -           default
pool1  health         ONLINE      -
pool1  guid           1701438519865110975  default
pool1  version        28          default
pool1  bootfs         -           default
pool1  delegation     on          default
pool1  autoreplace    off         default
pool1  cachefile      -           default
pool1  failmode       wait        default
pool1  listsnapshots  off         default
pool1  autoexpand     off         default
pool1  dedupditto     0           default
pool1  dedupratio     1.46x       -
pool1  free           2.43T       -
pool1  allocated      3.01T       -
pool1  readonly       off         -
pool1  comment        -           default
pool1  expandsize     0           -
nas4free:~#
nas4free:~# zfs get all pool1/videos
NAME          PROPERTY              VALUE                  SOURCE
pool1/videos  type                  filesystem             -
pool1/videos  creation              Sat Jan 19  5:12 2013  -
pool1/videos  used                  526G                   -
pool1/videos  available             1.74T                  -
pool1/videos  referenced            526G                   -
pool1/videos  compressratio         1.00x                  -
pool1/videos  mounted               yes                    -
pool1/videos  quota                 none                   local
pool1/videos  reservation           none                   local
pool1/videos  recordsize            128K                   default
pool1/videos  mountpoint            /mnt/pool1/videos      inherited 
from pool1
pool1/videos  sharenfs              off                    default
pool1/videos  checksum              on                     default
pool1/videos  compression           off                    local
pool1/videos  atime                 off                    local
pool1/videos  devices               on                     default
pool1/videos  exec                  on                     default
pool1/videos  setuid                on                     default
pool1/videos  readonly              off                    local
pool1/videos  jailed                off                    default
pool1/videos  snapdir               hidden                 local
pool1/videos  aclmode               discard                default
pool1/videos  aclinherit            restricted             default
pool1/videos  canmount              on                     local
pool1/videos  xattr                 off                    temporary
pool1/videos  copies                1                      default
pool1/videos  version               5                      -
pool1/videos  utf8only              off                    -
pool1/videos  normalization         none                   -
pool1/videos  casesensitivity       sensitive              -
pool1/videos  vscan                 off                    default
pool1/videos  nbmand                off                    default
pool1/videos  sharesmb              off                    default
pool1/videos  refquota              none                   default
pool1/videos  refreservation        none                   default
pool1/videos  primarycache          all                    default
pool1/videos  secondarycache        all                    default
pool1/videos  usedbysnapshots       0                      -
pool1/videos  usedbydataset         526G                   -
pool1/videos  usedbychildren        0                      -
pool1/videos  usedbyrefreservation  0                      -
pool1/videos  logbias               latency                default
pool1/videos  dedup                 off                    local
pool1/videos  mlslabel                                     -
pool1/videos  sync                  standard               local
pool1/videos  refcompressratio      1.00x                  -
pool1/videos  written               526G                   -
nas4free:~#
Comment 5 Jeremy Chadwick 2013-01-26 03:55:26 UTC
Recommendations:

1. Instead of /dev/random use /dev/zero.  /dev/random is not blazing
fast given it has to harvest lots of entropy from places.  If you're
doing I/O speed testing just use /dev/zero.  The speed difference is
quite big.

2. For dd, instead of bs=512 use bs=64k.  bs=512 isn't very ideal; these
are direct I/O writes of 512 bytes each, which is dog slow.  I repeat:
dog slow.  Linux does this differently.

3. During the dd, in another VTY or window, use "gstat -I500ms" and
watch the I/O speeds for your ada[2345] disks during the dd.  They
should be hitting peaks between 60-150MBytes/sec under the far right
"Kbps" field (far left=read, far right=write).

The large potential speed variance has to do with how much data you
already have on the pool, i.e. MHDDs get slower as the actuator arms
move inward towards the spindle motor.  That's why you might see, for
example, 150MBytes/sec when reading/writing to low-numbered LBAs but
slower speeds when writing to high-numbered LBAs.

This speed will be "bursty" and "sporadic" due to the how ZFS ARC
works.  The interval at which "things are flushed to disk" is based on
the vfs.zfs.txg.timeout sysctl, which on FreeBSD 9.1-RELEASE should
default to 5 (5 seconds).

4. "zpool iostat -v {pool}" does not provide accurate speed indications
for the same reason "iostat" doesn't show ""valid"" (it does but not
what most people would hope for) information while "iostat 1" would.
You need to run it with an interval, i.e. "zpool iostat -v {pool} 1" and
let it run for a while while doing I/O.  But I recommend using gstat
like I said, simply because the interval can be set at 500ms (0.5s) and
you get a better idea of what your peak I/O speed is.

If you find a single disk that is **always** performing badly, then that
disk is your bottleneck and I can help you with analysis of its problem.

5. Your "zpool scrub" speed being 14MBytes/second indicates you are no
where close to your ideal I/O speed.  It should not be that slow unless
you're doing tons of I/O at the same time as the scrub.  Also, scrubs
take longer now due to the disabling of the vdev cache (and that's not a
FreeBSD thing, it's that way in Illumos too, and it's a sensitive topic
to discuss).

6. On FreeBSD 9.1-RELEASE generally speaking you should not have to tune
any sysctls.  The situation was different in 8.x and 9.0.  Your system
only has 4GB of RAM so prefetching automatically gets disabled, by the
way, just in case you were wondering about that (there were problems
with prefetch in older releases).

7. You should probably keep "top -s 1" running, and you might even
consider using "top -S -s 1" to see system/kernel threads (they're in
brackets).  This isn't going to tell you downright what's making things
slow though.  "vmstat -i" during heavy I/O would be useful too, just in
case somehow you have a shared interrupt that's being pegged hard (for
example I've seen SATA controllers and USB controllers sharing an
interrupt, even with APICs, where the USB layer is busted churning out
1000 ints/sec and thus affecting SATA I/O speed).

8. If you want to compare systems I'm happy to do so, although I have
less disks than you do (3 in raidz1, WD Red 1TB drives).  However my
system is not a Pentium D-class processor; it's a Core 2 Quad Q9500.
The D-class stuff is fairly old.

I have some other theories as well but one thing at a time.

-- 
| Jeremy Chadwick                                   jdc@koitsu.org |
| UNIX Systems Administrator                http://jdc.koitsu.org/ |
| Mountain View, CA, US                                            |
| Making life hard for others since 1977.             PGP 4BD6C0CB |
Comment 6 Harry Coin 2013-01-26 05:49:15 UTC
I appreciate your effort very much.

One change from last time was I set failmode=continue from 'wait'.   
There were no errors in zpool or other logs.

re 1,2: switching to bs=64k and /dev/zero as the input source, the dd 
completes and the ls -l during the process causes no problems.  Transfer 
speed reported by dd of 183qMBytes/sec.

re 3:  the gstat -I500ms shows each of the raid partitions hanging in at 
about top speed of 50-ish MBps during the above dd.  All disks perform 
within a couple % of one another.

re 4: we'll stick with gstat.

re 5: I see we agree about that.  Nothing happened during the scrub 
other than the scrub.  14Mbps scrub speed is 1/6th of the minimum I'd 
expect.

re 6:  Tuning Is Evil.  On the other hand, Crashing is Eviler.  So I 
tried Evil Tuning, with poor results as noted upstream.  So, perhaps 
tuning is indeed not only evil but a black hole for time as well.

Switching back to the 512 byte writes, I notice on gstat the writes are 
zero for 4 secs or so, then a burst of activity (all partitions within a 
few % of one another), all quiet, repeat.   Then I throw in the ls -l... 
and... it all works with no problems.   Trying the dd with the big 
blocksize and /dev/urandom.... and:  gstat looks bursty as in the 512 
byte writes.   ls -l causes no changes, works.    Trying the original 
/dev/urandom and bs=512 and ... the gstat pattern does the usual bursty 
thing, ls -l works well.    Thinking it's about the gstat...

Doing top -S -s 1 .. nothing.

nas4free:~# vmstat -i
interrupt                          total       rate
irq1: atkbd0                         168          0
irq18: uhci2                        1136          0
irq20: hpet0                      620136        328
irq21: uhci0 ehci0                  7764          4
irq256: hdac0                         37          0
irq257: ahci0                       1090          0
irq258: em0                         5625          2
irq259: ahci1                     849746        450
Total                            1485702        787

So weird.  It's not crashing.  So, I tried to move a video from one 
dataset to another using mv.  Within moments all the writes gstat shows 
have stopped, while there's a few K / s reads.  And there it sits.  
Here's top

last pid:  3932;  load averages:  0.02,  0.04,  
0.13                                             up 0+00:24:46  05:03:53
249 processes: 3 running, 229 sleeping, 17 waiting
CPU:  0.0% user,  0.0% nice,  1.2% system,  0.0% interrupt, 98.8% idle
Mem: 14M Active, 19M Inact, 953M Wired, 8256K Buf, 2246M Free
Swap:

   PID USERNAME     PRI NICE   SIZE    RES STATE   C   TIME   WCPU COMMAND
    11 root         155 ki31     0K    32K CPU1    1  22:02 100.00% 
idle{idle: cpu1}
    11 root         155 ki31     0K    32K RUN     0  22:06 98.49% 
idle{idle: cpu0}
    13 root          -8    -     0K    48K -       1   0:06  0.10% 
geom{g_up}
     5 root          -8    -     0K    80K zio->i  0   0:05  0.10% 
zfskern{txg_thread_enter}
     0 root         -16    0     0K  2624K sched   1   2:30  0.00% 
kernel{swapper}
     0 root         -16    0     0K  2624K -       0   0:30  0.00% 
kernel{zio_read_intr_1}
     0 root         -16    0     0K  2624K -       0   0:30  0.00% 
kernel{zio_read_intr_0}
     0 root         -16    0     0K  2624K -       1   0:10  0.00% 
kernel{zio_write_issue_}
     0 root         -16    0     0K  2624K -       1   0:10  0.00% 
kernel{zio_write_issue_}
    12 root         -88    -     0K   272K WAIT    1   0:08  0.00% 
intr{irq259: ahci1}
     0 root         -16    0     0K  2624K -       1   0:07  0.00% 
kernel{zio_write_intr_2}
     0 root         -16    0     0K  2624K -       1   0:07  0.00% 
kernel{zio_write_intr_3}
     0 root         -16    0     0K  2624K -       1   0:07  0.00% 
kernel{zio_write_intr_6}
     0 root         -16    0     0K  2624K -       1   0:07  0.00% 
kernel{zio_write_intr_1}
     0 root         -16    0     0K  2624K -       0   0:07  0.00% 
kernel{zio_write_intr_0}
     0 root         -16    0     0K  2624K -       0   0:07  0.00% 
kernel{zio_write_intr_4}
     0 root         -16    0     0K  2624K -       1   0:07  0.00% 
kernel{zio_write_intr_7}
     0 root         -16    0     0K  2624K -       1   0:07  0.00% 
kernel{zio_write_intr_5}
    13 root          -8    -     0K    48K -       1   0:07  0.00% 
geom{g_down}
  3919 root          20    0  9916K  1636K tx->tx  1   0:07  0.00% mv
    20 root          16    -     0K    16K syncer  0   0:03  0.00% syncer
    12 root         -60    -     0K   272K WAIT    0   0:01  0.00% 
intr{swi4: clock}
    12 root         -52    -     0K   272K WAIT    1   0:01  0.00% 
intr{swi6: Giant task}
    14 root         -16    -     0K    16K -       0   0:01  0.00% yarrow
     5 root          -8    -     0K    80K arc_re  0   0:01  0.00% 
zfskern{arc_reclaim_thre}
    12 root         -84    -     0K   272K WAIT    1   0:01  0.00% 
intr{irq1: atkbd0}
     0 root         -16    0     0K  2624K -       1   0:00  0.00% 
kernel{zio_write_issue_}
     0 root          -8    0     0K  2624K -       0   0:00  0.00% 
kernel{zil_clean}
     0 root         -16    0     0K  2624K -       1   0:00  0.00% 
kernel{zio_write_issue_}


Set failmode back to wait.  Rebooted from the livecd and now repeating 
the tests:
And they work properly.

So,  with gstat running I once again try to move a video file from one 
dataset to another.   I watch a bunch of reads happen, the a burst of 
writes, and that goes on for a couple of minutes.  Then the writes stop 
and the reads continue at 3Kbps according to gstat.  The total memory 
still has 2.5G free.   There is no writing to the zpool.   I notice that 
one of the video files has moved, and the writing stopped when it was 
trying to open the next one I'm guessing.   I can sucessfully ^C out of 
the mv, and notice a constant low speed read going on, 80-100KBps.   I 
try restarting the move, no change to the read pattern, no writes.

So, no memory explosion, ^C works, but I can't write, and gstat reports 
a steady read low pace read for no obvious reason, just as it was when 
the mv process was active.

So, a huger puzzle.
Comment 7 hgcoin 2013-01-26 05:49:23 UTC
I appreciate your effort very much.

One change from last time was I set failmode=continue from 'wait'.   
There were no errors in zpool or other logs.

re 1,2: switching to bs=64k and /dev/zero as the input source, the dd 
completes and the ls -l during the process causes no problems.  Transfer 
speed reported by dd of 183qMBytes/sec.

re 3:  the gstat -I500ms shows each of the raid partitions hanging in at 
about top speed of 50-ish MBps during the above dd.  All disks perform 
within a couple % of one another.

re 4: we'll stick with gstat.

re 5: I see we agree about that.  Nothing happened during the scrub 
other than the scrub.  14Mbps scrub speed is 1/6th of the minimum I'd 
expect.

re 6:  Tuning Is Evil.  On the other hand, Crashing is Eviler.  So I 
tried Evil Tuning, with poor results as noted upstream.  So, perhaps 
tuning is indeed not only evil but a black hole for time as well.

Switching back to the 512 byte writes, I notice on gstat the writes are 
zero for 4 secs or so, then a burst of activity (all partitions within a 
few % of one another), all quiet, repeat.   Then I throw in the ls -l... 
and... it all works with no problems.   Trying the dd with the big 
blocksize and /dev/urandom.... and:  gstat looks bursty as in the 512 
byte writes.   ls -l causes no changes, works.    Trying the original 
/dev/urandom and bs=512 and ... the gstat pattern does the usual bursty 
thing, ls -l works well.    Thinking it's about the gstat...

Doing top -S -s 1 .. nothing.

nas4free:~# vmstat -i
interrupt                          total       rate
irq1: atkbd0                         168          0
irq18: uhci2                        1136          0
irq20: hpet0                      620136        328
irq21: uhci0 ehci0                  7764          4
irq256: hdac0                         37          0
irq257: ahci0                       1090          0
irq258: em0                         5625          2
irq259: ahci1                     849746        450
Total                            1485702        787

So weird.  It's not crashing.  So, I tried to move a video from one 
dataset to another using mv.  Within moments all the writes gstat shows 
have stopped, while there's a few K / s reads.  And there it sits.  
Here's top

last pid:  3932;  load averages:  0.02,  0.04,  
0.13                                             up 0+00:24:46  05:03:53
249 processes: 3 running, 229 sleeping, 17 waiting
CPU:  0.0% user,  0.0% nice,  1.2% system,  0.0% interrupt, 98.8% idle
Mem: 14M Active, 19M Inact, 953M Wired, 8256K Buf, 2246M Free
Swap:

   PID USERNAME     PRI NICE   SIZE    RES STATE   C   TIME   WCPU COMMAND
    11 root         155 ki31     0K    32K CPU1    1  22:02 100.00% 
idle{idle: cpu1}
    11 root         155 ki31     0K    32K RUN     0  22:06 98.49% 
idle{idle: cpu0}
    13 root          -8    -     0K    48K -       1   0:06  0.10% 
geom{g_up}
     5 root          -8    -     0K    80K zio->i  0   0:05  0.10% 
zfskern{txg_thread_enter}
     0 root         -16    0     0K  2624K sched   1   2:30  0.00% 
kernel{swapper}
     0 root         -16    0     0K  2624K -       0   0:30  0.00% 
kernel{zio_read_intr_1}
     0 root         -16    0     0K  2624K -       0   0:30  0.00% 
kernel{zio_read_intr_0}
     0 root         -16    0     0K  2624K -       1   0:10  0.00% 
kernel{zio_write_issue_}
     0 root         -16    0     0K  2624K -       1   0:10  0.00% 
kernel{zio_write_issue_}
    12 root         -88    -     0K   272K WAIT    1   0:08  0.00% 
intr{irq259: ahci1}
     0 root         -16    0     0K  2624K -       1   0:07  0.00% 
kernel{zio_write_intr_2}
     0 root         -16    0     0K  2624K -       1   0:07  0.00% 
kernel{zio_write_intr_3}
     0 root         -16    0     0K  2624K -       1   0:07  0.00% 
kernel{zio_write_intr_6}
     0 root         -16    0     0K  2624K -       1   0:07  0.00% 
kernel{zio_write_intr_1}
     0 root         -16    0     0K  2624K -       0   0:07  0.00% 
kernel{zio_write_intr_0}
     0 root         -16    0     0K  2624K -       0   0:07  0.00% 
kernel{zio_write_intr_4}
     0 root         -16    0     0K  2624K -       1   0:07  0.00% 
kernel{zio_write_intr_7}
     0 root         -16    0     0K  2624K -       1   0:07  0.00% 
kernel{zio_write_intr_5}
    13 root          -8    -     0K    48K -       1   0:07  0.00% 
geom{g_down}
  3919 root          20    0  9916K  1636K tx->tx  1   0:07  0.00% mv
    20 root          16    -     0K    16K syncer  0   0:03  0.00% syncer
    12 root         -60    -     0K   272K WAIT    0   0:01  0.00% 
intr{swi4: clock}
    12 root         -52    -     0K   272K WAIT    1   0:01  0.00% 
intr{swi6: Giant task}
    14 root         -16    -     0K    16K -       0   0:01  0.00% yarrow
     5 root          -8    -     0K    80K arc_re  0   0:01  0.00% 
zfskern{arc_reclaim_thre}
    12 root         -84    -     0K   272K WAIT    1   0:01  0.00% 
intr{irq1: atkbd0}
     0 root         -16    0     0K  2624K -       1   0:00  0.00% 
kernel{zio_write_issue_}
     0 root          -8    0     0K  2624K -       0   0:00  0.00% 
kernel{zil_clean}
     0 root         -16    0     0K  2624K -       1   0:00  0.00% 
kernel{zio_write_issue_}


Set failmode back to wait.  Rebooted from the livecd and now repeating 
the tests:
And they work properly.

So,  with gstat running I once again try to move a video file from one 
dataset to another.   I watch a bunch of reads happen, the a burst of 
writes, and that goes on for a couple of minutes.  Then the writes stop 
and the reads continue at 3Kbps according to gstat.  The total memory 
still has 2.5G free.   There is no writing to the zpool.   I notice that 
one of the video files has moved, and the writing stopped when it was 
trying to open the next one I'm guessing.   I can sucessfully ^C out of 
the mv, and notice a constant low speed read going on, 80-100KBps.   I 
try restarting the move, no change to the read pattern, no writes.

So, no memory explosion, ^C works, but I can't write, and gstat reports 
a steady read low pace read for no obvious reason, just as it was when 
the mv process was active.

So, a huger puzzle.
Comment 8 Jeremy Chadwick 2013-01-29 18:50:28 UTC
Re 1,2: that transfer speed (183MBytes/second) sounds much better/much
more accurate for what's going on.  The speed-limiting factors were
certainly a small blocksize (512 bytes) used by dd, and using
/dev/random rather than /dev/zero.  I realise you're probably expecting
to see something like 480MBytes/second (4 drives * 120MB/sec), but
that's probably not going to happen on that model of system and with
that CPU.

For example, on my Q9550 system described earlier, I can get about this:

$ dd if=/dev/zero of=testfile bs=64k
^C27148+0 records in
27147+0 records out
1779105792 bytes transferred in 6.935566 secs (256519186 bytes/sec)

While "gstat -I500ms" shows each disk going between 60MBytes/sec and
140MBytes/sec.  "zpool iostat -v data 1" shows between 120-220MBytes/sec
at the pool level, and showing around 65-110MBytes/sec on a per-disk
level.

Anyway, point being, things are faster with a large bs and from a source
that doesn't churn interrupts.  But don't necessarily "pull a Linux" and
start doing things like bs=1m -- as I said before, Linux dd is
different, because the I/O is cached (without --direct), while on
FreeBSD dd is always direct.

Re 3: That sounds a bit on the slow side.  I would expect those disks,
at least during writes, to do more.  If **all** the drives show this
behaviour consistently in gstat, then you know the issue IS NOT with an
individual disk, and is instead the issue lies elsewhere.  That rules
out one piece of the puzzle, and that's good.

Re 5: Did you mean to type 14MBytes/second, not 14mbits/second?  If so,
yes, I would agree that's slow.  Scrubbing is not necessarily a good way
to "benchmark" disks, but I understand for "benchmarking" ZFS it's the
best you've got to some degree.

Regarding dd'ing and 512 bytes -- as I described to you in my previous
mail:

> This speed will be "bursty" and "sporadic" due to the how ZFS ARC
> works.  The interval at which "things are flushed to disk" is based on
> the vfs.zfs.txg.timeout sysctl, which on FreeBSD 9.1-RELEASE should
> default to 5 (5 seconds).

This is where your "4 secs or so" magic value comes from.  Please do not
change this sysctl/value; keep it at 5.

Finally, your vmstat -i output shows something of concern, UNLESS you
did this WHILE you had the dd (doesn't matter what block size) going,
and are using /dev/random or /dev/urandom (same thing on FreeBSD):

> irq20: hpet0                      620136        328
> irq259: ahci1                     849746        450

These interrupt rates are quite high.  hpet0 refers to your event
timer/clock timer (see kern.eventtimer.choice and kern.eventtimer.timer)
being HPET, and ahci1 refers to your Intel ICH7 AHCI controller.

Basically what's happening here is that you're generating a ton of
interrupts doing dd if=/dev/urandom bs=512.  And it makes perfect sense
to me why: because /dev/urandom has to harvest entropy from interrupt
sources (please see random(4) man page), and you're generating a lot of
interrupts to your AHCI controller for each individual 512-byte write.

When you say "move a video from one dataset to another", please explain
what it is you're moving from and to.  Specifically: what filesystems,
and output from "zfs list".

If you're moving a file from a ZFS filesystem to another ZFS filesystem
on the same pool, then please state that.  That may help kernel folks
figure out where your issue lies.

At this stage, a kernel developer is going to need to step in and try to
help you figure out where the actual bottleneck is occurring.  This is
going to be very difficult/complex/very likely not possible with you
using nas4free, because you will almost certainly be asked to rebuild
world/kernel to include some new options and possibly asked to include
DTrace/CTF support (for real-time debugging).  The situation is tricky.

It would really help if you would/could remove nas4free from the picture
and instead just run stock FreeBSD, because as I said, if there are some
kind of kernel tunings or adjustment values the nas4free folks put in
place that stock FreeBSD doesn't, those could be harming you.

I can't be of more help here, I'm sorry to say.  The good news is that
your disks sound fine.  Kernel developers will need to take this up.

P.S. -- I would strongly recommend updating your nas4free forum post
with a link to this conversation in this PR.  IMO, the nas4free people
need to step up and take responsibility (and that almost certainly means
talking/working with the FreeBSD folks).

-- 
| Jeremy Chadwick                                   jdc@koitsu.org |
| UNIX Systems Administrator                http://jdc.koitsu.org/ |
| Mountain View, CA, US                                            |
| Making life hard for others since 1977.             PGP 4BD6C0CB |
Comment 9 Eitan Adler freebsd_committer freebsd_triage 2017-12-31 07:58:51 UTC
For bugs matching the following criteria:

Status: In Progress Changed: (is less than) 2014-06-01

Reset to default assignee and clear in-progress tags.

Mail being skipped