Under heavy I/O load we face freeze problems on ZFS volumes on both Freebsd 9 Release and 10 Current versions. Machines are HP servers (64bit) with HP Smart array 6400 raid controllers (with battery units). Every da device is a hardware raid5 where each one includes 9x300GB 10K SCSI hard drivers. Main of I/O pattern happens on local system except some small NFS I/O from some other servers (NFS lookup/getattr/ etc.). These servers are mail servers (qmail) with small I/O patterns (64K Read/Write). Below you can find procstat output on freeze time. write_limit is set to 200MB because of the huge amount of txg_wait_opens observed before. Every process stops on D state I think due to txg queue and other 2 queues are full. Is there any suggestion to fix the problem ? btw inject_compress is the main process injecting emails to user inboxes (databases). Also, those machines were running without problems on Linux/XFS filesystem. For a while ago, we started migration from Linux to Freebsd http://pastebin.com/raw.php?i=ic3YepWQ
Responsible Changed From-To: freebsd-bugs->freebsd-fs Over to maintainer(s).
I think I have an easier way to reproduce this problem on a very simple setup: Boot a 'nas4free' livecd version 573 (freebsd 9.1.0.1). Mount one zpool with no dedup, simple raidz 4 drive setup, 4GB memory. Don't enable any of the various features, just get a shell from the live cd prompt. Use dd to make a 20GB file on the pool writing from /dev/random. While that's going, go to another virtual console cd to the pool, and do an ls -l. It won't return. Of interest, long before wired memory explodes to the limit, ls not returning while dd is running, 'top' on another shell reports this: 99.7% idle, 2.6gb free ram, dd in tx->tx, zfskern in zio->i, and intr in WAIT Eventually disk activity as shown by the cheery little flickering lamp slows, then stops. Characters still echo on the ls -l command that hasn't returned, but no others. alt-tab strangely still works. The only way out is to power cycle the box. Details here: http://forums.nas4free.org/viewtopic.php?p=12519#p12519 Feels like completed io's are getting lost and the system just waits and waits until there are no more resources left to issue new commands or too many threads are locked waiting for what will never return. Harry
Harry, things that come to mind immediately: 1. http://www.quietfountain.com/fs1pool1.txt is when your pool contained both L2ARC and ZIL devices on SSDs. Please remove the SSDs from the picture entirely and use raidz1 disks ada[2345] only at this point. I do not want to discuss ada[01] at this time, because they're SSDs. There are quite literally 4 or 5 "catches" to using these devices on FreeBSD ZFS, but the biggest problem -- and this WILL hurt you, no arguments about it -- is lack of TRIM support. You will hurt your SSDs over time doing this. If you want TRIM support on ZFS you will need to run -CURRENT. We can talk more about the SSDs later. As said, please remove them from the pictures for starters, as all they do is make troubleshooting much much harder. 2. I do see some raw I/O benchmarks but only for ada2. This is insufficient. A single disk performing like crap in a pool can slow down the entire response time for everything. I can do analysis of all of your disks if the issue is narrowed down to one of them. "gstat -I500ms" is a good way to watch I/O speeds in real-time. I find this more effective than "zpool iostat -v 1" for per-device info. 3. The ada[2345] disks involved are Hitachi HDS723015BLA642 (7K3000, 1.5TB), and there is sparse info on the web as to if these are 512-byte physical sector disks or 4096-byte. smartmontools 6.0 or newer will tell you. All disks regardless advertise 512-byte as the logical size to remain fully compatible with legacy systems, but the perform hit on I/O is major if the device + pool ashift isn't 12. So please check this with smartmontools 6.0 or newer. If the disks use physically 4096-byte sectors, you need to use gnop(8) to align them and create the pool off of that. Ivan Voras wrote a wonderful guide on how to do this, and it's very simple: http://ivoras.net/blog/tree/2011-01-01.freebsd-on-4k-sector-drives.html It wouldn't hurt you to do this regardless, as there's no performance hit using the gnop(8) method on 512-byte sector drives; this would "future-proof" you if upgrading to newer disks too. You want ashift=12. 4. Why are all of your drives partitioned? In other words, why are you using adaXpX rather than just adaX for your raidz1 pool? "gpart show" output was not provided, and I can only speculate as to what's going on under the hood there. Please use raw disks when recreating your pool, i.e. ada2, ada3, ada4, etc... I know for your cache/logs this is a different situation but again, please remove those from the picture. 5. Please keep your Hitachi disks on the Intel ICH7 controller for the time being. It's SATA300 but that isn't going to hurt these disks. Don't bring the Marvell into the picture yet. Don't change around cabling or anything else. 6. For any process that takes a long while, you're going to need to do "procstat -kk" (yes -k twice) against it. 7. I do not think your issue is related to this PR. I would suggest discussing it on freebsd-fs first. Of course, you're also using something called "nas4free" which may or may not be *true, unaltered* FreeBSD -- I have no idea. I often find it frustrating when, say, the FreeNAS folks or other "FreeBSD fork projects" appear on the FreeBSD mailing lists "just because it uses FreeBSD". You always have to go with the vendor for support (like you did on their forum), but if you really think this is a FreeBSD "kernel thing" freebsd-fs is fine. Start with what I described above and go from there. -- | Jeremy Chadwick jdc@koitsu.org | | UNIX Systems Administrator http://jdc.koitsu.org/ | | Mountain View, CA, US | | Making life hard for others since 1977. PGP 4BD6C0CB |
Jeremy, Thanks. re: 1. Not only have I removed the zil but I removed the cache before running this test as well, thinking as you did the whole l2arc thing was adding too much complexity. The pool was nothing more than the single raidz when doing the test to cause the failure noted. re 2. The raw benchmarks you saw for ada2 were identically duplicated for 3, 4 and 5. They are all exactly the same drive make and model purchased at the same time. re 3. These are 512 byte/sector drives. v6.0 smartctl -a reports 512 bytes/sector logical / physical. re 4. The drives are partitioned so that each of them can hold boot information should I decide someday to boot off the array, to make sure sector 0 of the partition starts on the afore-noticed 4K boundary, and there is a couple of gig at the end for swap if I decide I want that. Partition 1 is just big enough for the boot code, 2 is the bulk of the storage, and 3 is the last 2 gig. So, no zil or cache used in the test, all formatted zfs v28, scrub reported zero errors on the array two days ago (though the scrub speed was 14M/s or so). It's something about max speed writes that get interrupted by another read that kills zfs/nas4free somehow. re 5: The ata drives indeed are on the ICH7, as the 300gb isn't an issue. The ssd's are on the higher speed controller, and are unused as noted before. re 6: You'll note that 'ls -l' didn't return, only the 'top' program run before the dd was started was launched. Here are the procstat -kk's for the dd and ls during their run (the ls not returning) nas4free:~# ps axu | grep dd root 4006 47.1 0.1 9916 1752 v0 R+ 2:56AM 0:07.92 dd if=/dev/urandom of=foo bs=512 count=20000000 root 4009 0.0 0.1 16280 2092 0 S+ 2:56AM 0:00.00 grep dd nas4free:~# procstat -kk 4006 PID TID COMM TDNAME KSTACK 4006 100114 dd - mi_switch+0x186 sleepq_timedwait+0x42 _cv_timedwait+0x13c txg_delay+0x137 dsl_pool_tempreserve_space+0xd5 dsl_dir_tempreserve_space+0x154 dmu_tx_assign+0x370 zfs_freebsd_write+0x45b VOP_WRITE_APV+0xb2 vn_write+0x38c dofilewrite+0x8b kern_writev+0x6c sys_write+0x64 amd64_syscall+0x546 Xfast_syscall+0xf7 nas4free:~# procstat -kk 4006 PID TID COMM TDNAME KSTACK 4006 100114 dd - mi_switch+0x186 sleepq_timedwait+0x42 _cv_timedwait+0x13c txg_delay+0x137 dsl_pool_tempreserve_space+0xd5 dsl_dir_tempreserve_space+0x154 dmu_tx_assign+0x370 zfs_freebsd_write+0x45b VOP_WRITE_APV+0xb2 vn_write+0x38c dofilewrite+0x8b kern_writev+0x6c sys_write+0x64 amd64_syscall+0x546 Xfast_syscall+0xf7 nas4free:~# procstat -kk 4006 PID TID COMM TDNAME KSTACK 4006 100114 dd - mi_switch+0x186 sleepq_timedwait+0x42 _cv_timedwait+0x13c txg_delay+0x137 dsl_pool_tempreserve_space+0xd5 dsl_dir_tempreserve_space+0x154 dmu_tx_assign+0x370 zfs_freebsd_write+0x45b VOP_WRITE_APV+0xb2 vn_write+0x38c dofilewrite+0x8b kern_writev+0x6c sys_write+0x64 amd64_syscall+0x546 Xfast_syscall+0xf7 nas4free:~# procstat -kk 4006 PID TID COMM TDNAME KSTACK 4006 100114 dd - mi_switch+0x186 sleepq_timedwait+0x42 _cv_timedwait+0x13c txg_delay+0x137 dsl_pool_tempreserve_space+0xd5 dsl_dir_tempreserve_space+0x154 dmu_tx_assign+0x370 zfs_freebsd_write+0x45b VOP_WRITE_APV+0xb2 vn_write+0x38c dofilewrite+0x8b kern_writev+0x6c sys_write+0x64 amd64_syscall+0x546 Xfast_syscall+0xf7 nas4free:~# ps axu | grep ls root 4016 0.0 0.1 14420 2196 v2 D+ 2:57AM 0:00.00 ls -l root 4018 0.0 0.1 16280 2084 0 RL+ 2:57AM 0:00.00 grep ls nas4free:~# procstat -kk 4016 PID TID COMM TDNAME KSTACK 4016 100085 ls - mi_switch+0x186 sleepq_wait+0x42 __lockmgr_args+0xe40 vop_stdlock+0x39 VOP_LOCK1_APV+0x46 _vn_lock+0x47 vget+0x70 cache_lookup_times+0x55a vfs_cache_lookup+0xc8 VOP_LOOKUP_APV+0x40 lookup+0x464 namei+0x4e9 kern_statat_vnhook+0xb3 kern_statat+0x15 sys_lstat+0x2a amd64_syscall+0x546 Xfast_syscall+0xf7 nas4free:~# procstat -kk 4016 PID TID COMM TDNAME KSTACK 4016 100085 ls - mi_switch+0x186 sleepq_wait+0x42 __lockmgr_args+0xe40 vop_stdlock+0x39 VOP_LOCK1_APV+0x46 _vn_lock+0x47 vget+0x70 cache_lookup_times+0x55a vfs_cache_lookup+0xc8 VOP_LOOKUP_APV+0x40 lookup+0x464 namei+0x4e9 kern_statat_vnhook+0xb3 kern_statat+0x15 sys_lstat+0x2a amd64_syscall+0x546 Xfast_syscall+0xf7 nas4free:~# procstat -kk 4016 PID TID COMM TDNAME KSTACK 4016 100085 ls - mi_switch+0x186 sleepq_wait+0x42 __lockmgr_args+0xe40 vop_stdlock+0x39 VOP_LOCK1_APV+0x46 _vn_lock+0x47 vget+0x70 cache_lookup_times+0x55a vfs_cache_lookup+0xc8 VOP_LOOKUP_APV+0x40 lookup+0x464 namei+0x4e9 kern_statat_vnhook+0xb3 kern_statat+0x15 sys_lstat+0x2a amd64_syscall+0x546 Xfast_syscall+0xf7 nas4free:~# procstat -kk 4016 PID TID COMM TDNAME KSTACK 4016 100085 ls - mi_switch+0x186 sleepq_wait+0x42 __lockmgr_args+0xe40 vop_stdlock+0x39 VOP_LOCK1_APV+0x46 _vn_lock+0x47 vget+0x70 cache_lookup_times+0x55a vfs_cache_lookup+0xc8 VOP_LOOKUP_APV+0x40 lookup+0x464 namei+0x4e9 kern_statat_vnhook+0xb3 kern_statat+0x15 sys_lstat+0x2a amd64_syscall+0x546 Xfast_syscall+0xf7 nas4free:~# procstat -kk 4006 PID TID COMM TDNAME KSTACK 4006 100114 dd - mi_switch+0x186 sleepq_wait+0x42 _cv_wait+0x121 txg_wait_open+0x85 zfs_freebsd_write+0x47b VOP_WRITE_APV+0xb2 vn_write+0x38c dofilewrite+0x8b kern_writev+0x6c sys_write+0x64 amd64_syscall+0x546 Xfast_syscall+0xf7 nas4free:~# procstat -kk 4006 PID TID COMM TDNAME KSTACK 4006 100114 dd - mi_switch+0x186 sleepq_wait+0x42 _cv_wait+0x121 txg_wait_open+0x85 zfs_freebsd_write+0x47b VOP_WRITE_APV+0xb2 vn_write+0x38c dofilewrite+0x8b kern_writev+0x6c sys_write+0x64 amd64_syscall+0x546 Xfast_syscall+0xf7 nas4free:~# procstat -kk 4006 PID TID COMM TDNAME KSTACK 4006 100114 dd - mi_switch+0x186 sleepq_wait+0x42 _cv_wait+0x121 txg_wait_open+0x85 zfs_freebsd_write+0x47b VOP_WRITE_APV+0xb2 vn_write+0x38c dofilewrite+0x8b kern_writev+0x6c sys_write+0x64 amd64_syscall+0x546 Xfast_syscall+0xf7 nas4free:~# procstat -kk 4016 PID TID COMM TDNAME KSTACK 4016 100085 ls - mi_switch+0x186 sleepq_wait+0x42 __lockmgr_args+0xe40 vop_stdlock+0x39 VOP_LOCK1_APV+0x46 _vn_lock+0x47 vget+0x70 cache_lookup_times+0x55a vfs_cache_lookup+0xc8 VOP_LOOKUP_APV+0x40 lookup+0x464 namei+0x4e9 kern_statat_vnhook+0xb3 kern_statat+0x15 sys_lstat+0x2a amd64_syscall+0x546 Xfast_syscall+0xf7 nas4free:~# procstat -kk 4016 PID TID COMM TDNAME KSTACK 4016 100085 ls - mi_switch+0x186 sleepq_wait+0x42 __lockmgr_args+0xe40 vop_stdlock+0x39 VOP_LOCK1_APV+0x46 _vn_lock+0x47 vget+0x70 cache_lookup_times+0x55a vfs_cache_lookup+0xc8 VOP_LOOKUP_APV+0x40 lookup+0x464 namei+0x4e9 kern_statat_vnhook+0xb3 kern_statat+0x15 sys_lstat+0x2a amd64_syscall+0x546 Xfast_syscall+0xf7 nas4free:~# procstat -kk 4016 PID TID COMM TDNAME KSTACK 4016 100085 ls - mi_switch+0x186 sleepq_wait+0x42 __lockmgr_args+0xe40 vop_stdlock+0x39 VOP_LOCK1_APV+0x46 _vn_lock+0x47 vget+0x70 cache_lookup_times+0x55a vfs_cache_lookup+0xc8 VOP_LOOKUP_APV+0x40 lookup+0x464 namei+0x4e9 kern_statat_vnhook+0xb3 kern_statat+0x15 sys_lstat+0x2a amd64_syscall+0x546 Xfast_syscall+0xf7 nas4free:~# procstat -kk 4006 PID TID COMM TDNAME KSTACK 4006 100114 dd - mi_switch+0x186 sleepq_wait+0x42 _cv_wait+0x121 txg_wait_open+0x85 zfs_freebsd_write+0x47b VOP_WRITE_APV+0xb2 vn_write+0x38c dofilewrite+0x8b kern_writev+0x6c sys_write+0x64 amd64_syscall+0x546 Xfast_syscall+0xf7 nas4free:~# procstat -kk 4006 PID TID COMM TDNAME KSTACK 4006 100114 dd - mi_switch+0x186 sleepq_wait+0x42 _cv_wait+0x121 txg_wait_open+0x85 zfs_freebsd_write+0x47b VOP_WRITE_APV+0xb2 vn_write+0x38c dofilewrite+0x8b kern_writev+0x6c sys_write+0x64 amd64_syscall+0x546 Xfast_syscall+0xf7 nas4free:~# procstat -kk 4006 PID TID COMM TDNAME KSTACK 4006 100114 dd - mi_switch+0x186 sleepq_wait+0x42 _cv_wait+0x121 txg_wait_open+0x85 zfs_freebsd_write+0x47b VOP_WRITE_APV+0xb2 vn_write+0x38c dofilewrite+0x8b kern_writev+0x6c sys_write+0x64 amd64_syscall+0x546 Xfast_syscall+0xf7 nas4free:~# procstat -kk 4006 PID TID COMM TDNAME KSTACK 4006 100114 dd - mi_switch+0x186 sleepq_wait+0x42 _cv_wait+0x121 txg_wait_open+0x85 zfs_freebsd_write+0x47b VOP_WRITE_APV+0xb2 vn_write+0x38c dofilewrite+0x8b kern_writev+0x6c sys_write+0x64 amd64_syscall+0x546 Xfast_syscall+0xf7 nas4free:~# procstat -kk 4016 PID TID COMM TDNAME KSTACK 4016 100085 ls - mi_switch+0x186 sleepq_wait+0x42 __lockmgr_args+0xe40 vop_stdlock+0x39 VOP_LOCK1_APV+0x46 _vn_lock+0x47 vget+0x70 cache_lookup_times+0x55a vfs_cache_lookup+0xc8 VOP_LOOKUP_APV+0x40 lookup+0x464 namei+0x4e9 kern_statat_vnhook+0xb3 kern_statat+0x15 sys_lstat+0x2a amd64_syscall+0x546 Xfast_syscall+0xf7 nas4free:~# procstat -kk 4016 PID TID COMM TDNAME KSTACK 4016 100085 ls - mi_switch+0x186 sleepq_wait+0x42 __lockmgr_args+0xe40 vop_stdlock+0x39 VOP_LOCK1_APV+0x46 _vn_lock+0x47 vget+0x70 cache_lookup_times+0x55a vfs_cache_lookup+0xc8 VOP_LOOKUP_APV+0x40 lookup+0x464 namei+0x4e9 kern_statat_vnhook+0xb3 kern_statat+0x15 sys_lstat+0x2a amd64_syscall+0x546 Xfast_syscall+0xf7 nas4free:~# procstat -kk 4016 PID TID COMM TDNAME KSTACK 4016 100085 ls - mi_switch+0x186 sleepq_wait+0x42 __lockmgr_args+0xe40 vop_stdlock+0x39 VOP_LOCK1_APV+0x46 _vn_lock+0x47 vget+0x70 cache_lookup_times+0x55a vfs_cache_lookup+0xc8 VOP_LOOKUP_APV+0x40 lookup+0x464 namei+0x4e9 kern_statat_vnhook+0xb3 kern_statat+0x15 sys_lstat+0x2a amd64_syscall+0x546 Xfast_syscall+0xf7 nas4free:~# procstat -kk 4016 PID TID COMM TDNAME KSTACK 4016 100085 ls - mi_switch+0x186 sleepq_wait+0x42 __lockmgr_args+0xe40 vop_stdlock+0x39 VOP_LOCK1_APV+0x46 _vn_lock+0x47 vget+0x70 cache_lookup_times+0x55a vfs_cache_lookup+0xc8 VOP_LOOKUP_APV+0x40 lookup+0x464 namei+0x4e9 kern_statat_vnhook+0xb3 kern_statat+0x15 sys_lstat+0x2a amd64_syscall+0x546 Xfast_syscall+0xf7 nas4free:~# procstat -kk 4016 PID TID COMM TDNAME KSTACK 4016 100085 ls - mi_switch+0x186 sleepq_wait+0x42 __lockmgr_args+0xe40 vop_stdlock+0x39 VOP_LOCK1_APV+0x46 _vn_lock+0x47 vget+0x70 cache_lookup_times+0x55a vfs_cache_lookup+0xc8 VOP_LOOKUP_APV+0x40 lookup+0x464 namei+0x4e9 kern_statat_vnhook+0xb3 kern_statat+0x15 sys_lstat+0x2a amd64_syscall+0x546 Xfast_syscall+0xf7 nas4free:~# re 7: The 'submit a problem' form doesn't list 'fs' as an option, just kern. If there's another place for fs bugs I trust the freebsd folks will know what to do with it. Anyhow I do think it's about write completions getting lost. And, 'nas4free' is a pretty popular project having the major virtue that anyone, anywhere looking to solve this problem can download a livecd iso and reproduce it. This is not my first rodeo, there are several bug reports with various black magic tunable tweaks that generally I think accomplish avoiding the bug only by their usage patterns and the tweaks not hitting it. This particular iso freebsd download has the virtue of being completely reproducible. Does any of the above help? Also here's an update zpool : nas4free:~# zpool status pool1 pool: pool1 state: ONLINE scan: scrub canceled on Fri Jan 25 22:07:51 2013 config: NAME STATE READ WRITE CKSUM pool1 ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 ada2p2 ONLINE 0 0 0 ada3p2 ONLINE 0 0 0 ada4p2 ONLINE 0 0 0 ada5p2 ONLINE 0 0 0 errors: No known data errors nas4free:~# as4free:~# zpool iostat -v pool1 capacity operations bandwidth pool alloc free read write read write ---------- ----- ----- ----- ----- ----- ----- pool1 3.01T 2.43T 168 0 360K 3.42K raidz1 3.01T 2.43T 168 0 360K 3.42K ada2p2 - - 124 0 104K 2.34K ada3p2 - - 123 0 104K 2.28K ada4p2 - - 124 0 103K 2.34K ada5p2 - - 126 0 98.1K 2.29K ---------- ----- ----- ----- ----- ----- ----- nas4free:~# nas4free:~# zpool get all pool1 NAME PROPERTY VALUE SOURCE pool1 size 5.44T - pool1 capacity 55% - pool1 altroot - default pool1 health ONLINE - pool1 guid 1701438519865110975 default pool1 version 28 default pool1 bootfs - default pool1 delegation on default pool1 autoreplace off default pool1 cachefile - default pool1 failmode wait default pool1 listsnapshots off default pool1 autoexpand off default pool1 dedupditto 0 default pool1 dedupratio 1.46x - pool1 free 2.43T - pool1 allocated 3.01T - pool1 readonly off - pool1 comment - default pool1 expandsize 0 - nas4free:~# nas4free:~# zfs get all pool1/videos NAME PROPERTY VALUE SOURCE pool1/videos type filesystem - pool1/videos creation Sat Jan 19 5:12 2013 - pool1/videos used 526G - pool1/videos available 1.74T - pool1/videos referenced 526G - pool1/videos compressratio 1.00x - pool1/videos mounted yes - pool1/videos quota none local pool1/videos reservation none local pool1/videos recordsize 128K default pool1/videos mountpoint /mnt/pool1/videos inherited from pool1 pool1/videos sharenfs off default pool1/videos checksum on default pool1/videos compression off local pool1/videos atime off local pool1/videos devices on default pool1/videos exec on default pool1/videos setuid on default pool1/videos readonly off local pool1/videos jailed off default pool1/videos snapdir hidden local pool1/videos aclmode discard default pool1/videos aclinherit restricted default pool1/videos canmount on local pool1/videos xattr off temporary pool1/videos copies 1 default pool1/videos version 5 - pool1/videos utf8only off - pool1/videos normalization none - pool1/videos casesensitivity sensitive - pool1/videos vscan off default pool1/videos nbmand off default pool1/videos sharesmb off default pool1/videos refquota none default pool1/videos refreservation none default pool1/videos primarycache all default pool1/videos secondarycache all default pool1/videos usedbysnapshots 0 - pool1/videos usedbydataset 526G - pool1/videos usedbychildren 0 - pool1/videos usedbyrefreservation 0 - pool1/videos logbias latency default pool1/videos dedup off local pool1/videos mlslabel - pool1/videos sync standard local pool1/videos refcompressratio 1.00x - pool1/videos written 526G - nas4free:~#
Recommendations: 1. Instead of /dev/random use /dev/zero. /dev/random is not blazing fast given it has to harvest lots of entropy from places. If you're doing I/O speed testing just use /dev/zero. The speed difference is quite big. 2. For dd, instead of bs=512 use bs=64k. bs=512 isn't very ideal; these are direct I/O writes of 512 bytes each, which is dog slow. I repeat: dog slow. Linux does this differently. 3. During the dd, in another VTY or window, use "gstat -I500ms" and watch the I/O speeds for your ada[2345] disks during the dd. They should be hitting peaks between 60-150MBytes/sec under the far right "Kbps" field (far left=read, far right=write). The large potential speed variance has to do with how much data you already have on the pool, i.e. MHDDs get slower as the actuator arms move inward towards the spindle motor. That's why you might see, for example, 150MBytes/sec when reading/writing to low-numbered LBAs but slower speeds when writing to high-numbered LBAs. This speed will be "bursty" and "sporadic" due to the how ZFS ARC works. The interval at which "things are flushed to disk" is based on the vfs.zfs.txg.timeout sysctl, which on FreeBSD 9.1-RELEASE should default to 5 (5 seconds). 4. "zpool iostat -v {pool}" does not provide accurate speed indications for the same reason "iostat" doesn't show ""valid"" (it does but not what most people would hope for) information while "iostat 1" would. You need to run it with an interval, i.e. "zpool iostat -v {pool} 1" and let it run for a while while doing I/O. But I recommend using gstat like I said, simply because the interval can be set at 500ms (0.5s) and you get a better idea of what your peak I/O speed is. If you find a single disk that is **always** performing badly, then that disk is your bottleneck and I can help you with analysis of its problem. 5. Your "zpool scrub" speed being 14MBytes/second indicates you are no where close to your ideal I/O speed. It should not be that slow unless you're doing tons of I/O at the same time as the scrub. Also, scrubs take longer now due to the disabling of the vdev cache (and that's not a FreeBSD thing, it's that way in Illumos too, and it's a sensitive topic to discuss). 6. On FreeBSD 9.1-RELEASE generally speaking you should not have to tune any sysctls. The situation was different in 8.x and 9.0. Your system only has 4GB of RAM so prefetching automatically gets disabled, by the way, just in case you were wondering about that (there were problems with prefetch in older releases). 7. You should probably keep "top -s 1" running, and you might even consider using "top -S -s 1" to see system/kernel threads (they're in brackets). This isn't going to tell you downright what's making things slow though. "vmstat -i" during heavy I/O would be useful too, just in case somehow you have a shared interrupt that's being pegged hard (for example I've seen SATA controllers and USB controllers sharing an interrupt, even with APICs, where the USB layer is busted churning out 1000 ints/sec and thus affecting SATA I/O speed). 8. If you want to compare systems I'm happy to do so, although I have less disks than you do (3 in raidz1, WD Red 1TB drives). However my system is not a Pentium D-class processor; it's a Core 2 Quad Q9500. The D-class stuff is fairly old. I have some other theories as well but one thing at a time. -- | Jeremy Chadwick jdc@koitsu.org | | UNIX Systems Administrator http://jdc.koitsu.org/ | | Mountain View, CA, US | | Making life hard for others since 1977. PGP 4BD6C0CB |
I appreciate your effort very much. One change from last time was I set failmode=continue from 'wait'. There were no errors in zpool or other logs. re 1,2: switching to bs=64k and /dev/zero as the input source, the dd completes and the ls -l during the process causes no problems. Transfer speed reported by dd of 183qMBytes/sec. re 3: the gstat -I500ms shows each of the raid partitions hanging in at about top speed of 50-ish MBps during the above dd. All disks perform within a couple % of one another. re 4: we'll stick with gstat. re 5: I see we agree about that. Nothing happened during the scrub other than the scrub. 14Mbps scrub speed is 1/6th of the minimum I'd expect. re 6: Tuning Is Evil. On the other hand, Crashing is Eviler. So I tried Evil Tuning, with poor results as noted upstream. So, perhaps tuning is indeed not only evil but a black hole for time as well. Switching back to the 512 byte writes, I notice on gstat the writes are zero for 4 secs or so, then a burst of activity (all partitions within a few % of one another), all quiet, repeat. Then I throw in the ls -l... and... it all works with no problems. Trying the dd with the big blocksize and /dev/urandom.... and: gstat looks bursty as in the 512 byte writes. ls -l causes no changes, works. Trying the original /dev/urandom and bs=512 and ... the gstat pattern does the usual bursty thing, ls -l works well. Thinking it's about the gstat... Doing top -S -s 1 .. nothing. nas4free:~# vmstat -i interrupt total rate irq1: atkbd0 168 0 irq18: uhci2 1136 0 irq20: hpet0 620136 328 irq21: uhci0 ehci0 7764 4 irq256: hdac0 37 0 irq257: ahci0 1090 0 irq258: em0 5625 2 irq259: ahci1 849746 450 Total 1485702 787 So weird. It's not crashing. So, I tried to move a video from one dataset to another using mv. Within moments all the writes gstat shows have stopped, while there's a few K / s reads. And there it sits. Here's top last pid: 3932; load averages: 0.02, 0.04, 0.13 up 0+00:24:46 05:03:53 249 processes: 3 running, 229 sleeping, 17 waiting CPU: 0.0% user, 0.0% nice, 1.2% system, 0.0% interrupt, 98.8% idle Mem: 14M Active, 19M Inact, 953M Wired, 8256K Buf, 2246M Free Swap: PID USERNAME PRI NICE SIZE RES STATE C TIME WCPU COMMAND 11 root 155 ki31 0K 32K CPU1 1 22:02 100.00% idle{idle: cpu1} 11 root 155 ki31 0K 32K RUN 0 22:06 98.49% idle{idle: cpu0} 13 root -8 - 0K 48K - 1 0:06 0.10% geom{g_up} 5 root -8 - 0K 80K zio->i 0 0:05 0.10% zfskern{txg_thread_enter} 0 root -16 0 0K 2624K sched 1 2:30 0.00% kernel{swapper} 0 root -16 0 0K 2624K - 0 0:30 0.00% kernel{zio_read_intr_1} 0 root -16 0 0K 2624K - 0 0:30 0.00% kernel{zio_read_intr_0} 0 root -16 0 0K 2624K - 1 0:10 0.00% kernel{zio_write_issue_} 0 root -16 0 0K 2624K - 1 0:10 0.00% kernel{zio_write_issue_} 12 root -88 - 0K 272K WAIT 1 0:08 0.00% intr{irq259: ahci1} 0 root -16 0 0K 2624K - 1 0:07 0.00% kernel{zio_write_intr_2} 0 root -16 0 0K 2624K - 1 0:07 0.00% kernel{zio_write_intr_3} 0 root -16 0 0K 2624K - 1 0:07 0.00% kernel{zio_write_intr_6} 0 root -16 0 0K 2624K - 1 0:07 0.00% kernel{zio_write_intr_1} 0 root -16 0 0K 2624K - 0 0:07 0.00% kernel{zio_write_intr_0} 0 root -16 0 0K 2624K - 0 0:07 0.00% kernel{zio_write_intr_4} 0 root -16 0 0K 2624K - 1 0:07 0.00% kernel{zio_write_intr_7} 0 root -16 0 0K 2624K - 1 0:07 0.00% kernel{zio_write_intr_5} 13 root -8 - 0K 48K - 1 0:07 0.00% geom{g_down} 3919 root 20 0 9916K 1636K tx->tx 1 0:07 0.00% mv 20 root 16 - 0K 16K syncer 0 0:03 0.00% syncer 12 root -60 - 0K 272K WAIT 0 0:01 0.00% intr{swi4: clock} 12 root -52 - 0K 272K WAIT 1 0:01 0.00% intr{swi6: Giant task} 14 root -16 - 0K 16K - 0 0:01 0.00% yarrow 5 root -8 - 0K 80K arc_re 0 0:01 0.00% zfskern{arc_reclaim_thre} 12 root -84 - 0K 272K WAIT 1 0:01 0.00% intr{irq1: atkbd0} 0 root -16 0 0K 2624K - 1 0:00 0.00% kernel{zio_write_issue_} 0 root -8 0 0K 2624K - 0 0:00 0.00% kernel{zil_clean} 0 root -16 0 0K 2624K - 1 0:00 0.00% kernel{zio_write_issue_} Set failmode back to wait. Rebooted from the livecd and now repeating the tests: And they work properly. So, with gstat running I once again try to move a video file from one dataset to another. I watch a bunch of reads happen, the a burst of writes, and that goes on for a couple of minutes. Then the writes stop and the reads continue at 3Kbps according to gstat. The total memory still has 2.5G free. There is no writing to the zpool. I notice that one of the video files has moved, and the writing stopped when it was trying to open the next one I'm guessing. I can sucessfully ^C out of the mv, and notice a constant low speed read going on, 80-100KBps. I try restarting the move, no change to the read pattern, no writes. So, no memory explosion, ^C works, but I can't write, and gstat reports a steady read low pace read for no obvious reason, just as it was when the mv process was active. So, a huger puzzle.
Re 1,2: that transfer speed (183MBytes/second) sounds much better/much more accurate for what's going on. The speed-limiting factors were certainly a small blocksize (512 bytes) used by dd, and using /dev/random rather than /dev/zero. I realise you're probably expecting to see something like 480MBytes/second (4 drives * 120MB/sec), but that's probably not going to happen on that model of system and with that CPU. For example, on my Q9550 system described earlier, I can get about this: $ dd if=/dev/zero of=testfile bs=64k ^C27148+0 records in 27147+0 records out 1779105792 bytes transferred in 6.935566 secs (256519186 bytes/sec) While "gstat -I500ms" shows each disk going between 60MBytes/sec and 140MBytes/sec. "zpool iostat -v data 1" shows between 120-220MBytes/sec at the pool level, and showing around 65-110MBytes/sec on a per-disk level. Anyway, point being, things are faster with a large bs and from a source that doesn't churn interrupts. But don't necessarily "pull a Linux" and start doing things like bs=1m -- as I said before, Linux dd is different, because the I/O is cached (without --direct), while on FreeBSD dd is always direct. Re 3: That sounds a bit on the slow side. I would expect those disks, at least during writes, to do more. If **all** the drives show this behaviour consistently in gstat, then you know the issue IS NOT with an individual disk, and is instead the issue lies elsewhere. That rules out one piece of the puzzle, and that's good. Re 5: Did you mean to type 14MBytes/second, not 14mbits/second? If so, yes, I would agree that's slow. Scrubbing is not necessarily a good way to "benchmark" disks, but I understand for "benchmarking" ZFS it's the best you've got to some degree. Regarding dd'ing and 512 bytes -- as I described to you in my previous mail: > This speed will be "bursty" and "sporadic" due to the how ZFS ARC > works. The interval at which "things are flushed to disk" is based on > the vfs.zfs.txg.timeout sysctl, which on FreeBSD 9.1-RELEASE should > default to 5 (5 seconds). This is where your "4 secs or so" magic value comes from. Please do not change this sysctl/value; keep it at 5. Finally, your vmstat -i output shows something of concern, UNLESS you did this WHILE you had the dd (doesn't matter what block size) going, and are using /dev/random or /dev/urandom (same thing on FreeBSD): > irq20: hpet0 620136 328 > irq259: ahci1 849746 450 These interrupt rates are quite high. hpet0 refers to your event timer/clock timer (see kern.eventtimer.choice and kern.eventtimer.timer) being HPET, and ahci1 refers to your Intel ICH7 AHCI controller. Basically what's happening here is that you're generating a ton of interrupts doing dd if=/dev/urandom bs=512. And it makes perfect sense to me why: because /dev/urandom has to harvest entropy from interrupt sources (please see random(4) man page), and you're generating a lot of interrupts to your AHCI controller for each individual 512-byte write. When you say "move a video from one dataset to another", please explain what it is you're moving from and to. Specifically: what filesystems, and output from "zfs list". If you're moving a file from a ZFS filesystem to another ZFS filesystem on the same pool, then please state that. That may help kernel folks figure out where your issue lies. At this stage, a kernel developer is going to need to step in and try to help you figure out where the actual bottleneck is occurring. This is going to be very difficult/complex/very likely not possible with you using nas4free, because you will almost certainly be asked to rebuild world/kernel to include some new options and possibly asked to include DTrace/CTF support (for real-time debugging). The situation is tricky. It would really help if you would/could remove nas4free from the picture and instead just run stock FreeBSD, because as I said, if there are some kind of kernel tunings or adjustment values the nas4free folks put in place that stock FreeBSD doesn't, those could be harming you. I can't be of more help here, I'm sorry to say. The good news is that your disks sound fine. Kernel developers will need to take this up. P.S. -- I would strongly recommend updating your nas4free forum post with a link to this conversation in this PR. IMO, the nas4free people need to step up and take responsibility (and that almost certainly means talking/working with the FreeBSD folks). -- | Jeremy Chadwick jdc@koitsu.org | | UNIX Systems Administrator http://jdc.koitsu.org/ | | Mountain View, CA, US | | Making life hard for others since 1977. PGP 4BD6C0CB |
For bugs matching the following criteria: Status: In Progress Changed: (is less than) 2014-06-01 Reset to default assignee and clear in-progress tags. Mail being skipped