Summary: | SWAP on ZFS can crash server | ||
---|---|---|---|
Product: | Base System | Reporter: | Slawomir Wojciech Wojtczak <vermaden> |
Component: | kern | Assignee: | freebsd-fs (Nobody) <fs> |
Status: | Closed Overcome By Events | ||
Severity: | Affects Many People | CC: | cmangin, davor.cubranic, deadelk, john, marcus, nowak, pi, scottro11, smh, vanav, ygy |
Priority: | --- | ||
Version: | 10.1-STABLE | ||
Hardware: | Any | ||
OS: | Any |
Description
Slawomir Wojciech Wojtczak
2015-04-05 22:32:38 UTC
Increasing v_free_severe to give zfs some room seems to help. sysctl vm.v_free_target=60000 sysctl vm.v_free_min=55000 sysctl vm.v_free_severe=50000 Without some serious work, swap on ZFS will be error prone as ZFS can require memory to complete its IO pipeline and swap data to its vdevs so if memory is really short it can easily cause a deadlock. So while light use of swap on ZFS can work, heavy use of it will almost certainly fail. Given this I would never recommend ZFS backed swap. My thought was that if its 'production ready' on Solaris, then it would be the same (or better) on FreeBSD ... http://docs.oracle.com/cd/E26502_01/html/E29006/gizfl.html Regards, vermaden I'm afraid not Using the recommended (AFAIK) settings for a ZFS swap volume (below) combined with nowak's v_free sysctl recommendation allows me to run your test script successfully (meaning the system doesn't hang and the script is killed). zfs create -V <size> -o org.freebsd:swap=on -o checksum=off -o compression=off -o dedup=off -o sync=disabled -o primarycache=none <pool>/<volname> That said, I'm not sure what (if any) action this ticket should spur. Is there specific documentation you (OP) would like to see updated? FWIW, on my test VM with 3GB RAM and a 2GB swap volume (with ZFS properties as described in my previous comment) the test script hangs the system with the default settings: vm.v_free_min: 4833 vm.v_free_target: 16218 vm.v_free_reserved: 1038 vm.v_free_severe: 2935 but not with these (just bumping up v_free_min and v_free_severe): vm.v_free_min: 8192 vm.v_free_target: 16218 vm.v_free_reserved: 1038 vm.v_free_severe: 7168 On this system the magic number for v_free_severe seems to be between 5120 (which hangs) and 7168 (which doesn't). I have added a note about vm.v_free_severe to https://wiki.freebsd.org/RootOnZFS#ZFS_Swap_Volume, which is where I got the recommended swap volume settings from in the first place. So a rule of thumb can be something like this? vm.v_free_min = 3 * RAM (GB) * 1024 vm.v_free_severe = 3 * RAM (GB) * 1024 For what it's worth, I've been doing swap on a zvol for a long time now on -CURRENT without problems. Using "-o checksum=off -o compression=off -o dedup=off -o sync=disabled -o primarycache=none" seems to be key. I've hammered on it quite a bit to try and get it to crash. My memory allocator looks different from yours; this one is faster but just sets the memory to zeroes with explicit_bzero(). #include <stdlib.h> #include <stdio.h> #include <strings.h> #define HUNK 1048576 int main(void) { unsigned long long i; for(i = 0 ; ; i++) { explicit_bzero(malloc(HUNK), HUNK); printf("\r%llu", i * HUNK); fflush(stdout); }; return(EXIT_SUCCESS); } @marcus@blazingdot.com Have You tried the 'perl' one? (not mine unfortunatelly, got it from someone) I agree on most of these options for SWAP ZVOL, but using 'compression=off' instead of 'compression=lz4' seems to be big 'loss' ... also in term of IOPS because You need to allocate less blocks for the same data. > Have You tried the 'perl' one? (not mine unfortunatelly, got it from someone)
I ran it a few times (it is quite fast.) right in the middle of a 'make -j16 buildkernel' compile with several Chrome windows up. No problems, process is killed when swap is exhausted. Aside from setting the properties on the zroot/swap dataset, I have done no zfs-related tuning.
I'm running recent -CURRENT on amd64 with 8GB RAM, 2GB swap zvol on an SSD.
Swap on zvol has been rock solid for me since I started using it again after the ARC/UMA rebalance/low memory handling fix that happened last year. I would be happy to tweak things to try to get it to crash if you have some tunables that you would like to test.
@marcus@blazingdot.com Currently only 'compression=lz4' comes to my mind. > Currently only 'compression=lz4' comes to my mind.
It would surprise nobody if turning on compression caused problems.
If it's any consolation, here are the default locally set flags on the swap zvol on a Solaris 10 machine:
rpool/swap volsize 4G local
rpool/swap checksum off local
rpool/swap compression off local
rpool/swap refreservation 4G local
rpool/swap primarycache metadata local
I tried running Vermaden's Perl script on a freshly installed 10.1-RELEASE on a ThinkPad 430s with 8GB RAM and a 2G swap zvol on an SSD with no tuning whatsoever. I experienced no lockups: the script was killed after about a minute and I barely even saw any slowdown in the interactive console. Just another anecdotal observation.. I've got a fresh -CURRENT machine that I set up with swap on a zvol and am able to get it to fail reliably. There's a git process that eats all ram and then goes about 1GB into swap (with 4GB available) and then swap just appears to stop working. A bunch of processes get stuck in a pfault state and the box becomes unusable and needs to be reset. The zvol is set up with compression, dedup, primarycache, sync all off. Something is definitely still very broken. My workaround in this case is to use a vnode-backed md device as a swap file, which is still not ideal but is stable enough until I can rebuild the machine with normal swap partitions. As for the machine in my last comment, I've tuned the problem away and I'd like to share how: vm.v_free_target=60000 vm.v_free_min=55000 vm.v_free_severe=50000 vm.v_free_reserved=32000 vfs.zfs.arc_free_target=180000 Now, free memory never gets down below about 50MB or so, and normally stays up above 150MB even when under pressure and swapping. I'm swapping to a zvol with all of the swap tunables set. The default values for these look more like: vm.v_free_target: 42948 vm.v_free_min: 12741 vm.v_free_severe: 7706 vm.v_free_reserved: 2672 vfs.zfs.arc_free_target: 14014 Note that I increased vm.v_free_reserved quite a bit, but I don't know exactly what that one does. vm.v_free_reserved: Pages reserved for deadlock The description seems to make it a prime candidate for tuning in this situation, so even though nobody has mentioned it yet I raised it. Things work great. I will post again if I manage to break it again. I'd like to point out this page which has relevant information for those wishing to tune the VM system: https://wiki.freebsd.org/AvgPageoutAlgorithm The tunable vm.pageout_wakeup_thresh hasn't been mentioned here yet and seems to be important for the way the the VM subsystem works. So with that in mind, here's my current configuration for a 8Gb machine with swap on zvol. vm.v_free_severe=20460 vm.v_free_min=32768 vm.v_free_target=51200 vm.pageout_wakeup_thresh=36044 vfs.zfs.arc_free_target=36044 Note that I've picked: vm.pageout_wakeup_thresh = vm.v_free_min * 11/10 vfs.zfs.arc_free_target = vm.pageout_wakeup_thresh as these are the way they are calculated by default. As a reference, the defaults values for my system are: vm.v_free_reserved=2650 (untouched) vm.v_free_severe=7642 vm.v_free_min=12634 vm.v_free_target=42586 vm.pageout_wakeup_thresh=13893 vfs.zfs.arc_free_target=13893 I submitted my system to memory pressure, notably by runs of poudriere generating up to 3Gb of swap usage. I'm happy to report that my system remained rock stable and very responsive during the tests. (In reply to Chris M from comment #17) The information in that link is outdated. There have been some notable changes in the page out algorithms since that article was written. @Andriy Gapon Where to find up to date information then? (In reply to vermaden from comment #19) In the source code... @Andriy Gapon I expect such comments from the Linux community ... Pity. Regards, vermaden So, does anyone know if the tuning is still necessary? I am not a coder and looking at the source code won't really help me. If anyone has tested recently, I'd be grateful for them sharing their knowledge. confirm that swaping to zfs with default sysctl is a bad idea in production ... just run perl-script and look to 'top' in another vty (In reply to deadelk from comment #23) freebsd11 time taken to kill the app is about 5 minutes (In reply to deadelk from comment #24) you can try to lower ARC i think it's too much to take all memory for ARC by default why??? for what??? vfs.zfs.arc_max="512M" just working identically (speed) on 1GB RAM server and on 8GB RAM server |