Bug 199189

Summary: SWAP on ZFS can crash server
Product: Base System Reporter: vermaden
Component: kernAssignee: freebsd-fs mailing list <fs>
Status: Closed Overcome By Events    
Severity: Affects Many People CC: cmangin, davor.cubranic, deadelk, john, marcus, nowak, pi, scottro11, smh, vanav
Priority: ---    
Version: 10.1-STABLE   
Hardware: Any   
OS: Any   

Description vermaden 2015-04-05 22:32:38 UTC
Running swap on zvol is a bad idea, because it will eventually crash the server when trashing happens.

You can simulate the trashing by running the following Perl script, it will eventually fill up all your available memory.

===============================================================================
#!/usr/bin/perl
my $count=0;
my @data;
my $temp_data;
for(my $i=0;$i<10000000;$i++) {
        $temp_data.="1234567890abcdefghijklmnopqrstuvwxyz";
}

while(1) {
        $data[$count++]=$temp_data;
}
===============================================================================

Tested on FreeBSD 10.1 stable

With zvol swap - 8 GB RAM FreeBSD server will stall within 1 minutes.

without swap or dedicated swap disk/partition - server will automatically kill that Perl process.
Comment 1 nowak 2015-04-15 23:08:16 UTC
Increasing v_free_severe to give zfs some room seems to help.

sysctl vm.v_free_target=60000
sysctl vm.v_free_min=55000
sysctl vm.v_free_severe=50000
Comment 2 Steven Hartland freebsd_committer 2015-04-15 23:12:43 UTC
Without some serious work, swap on ZFS will be error prone as ZFS can require memory to complete its IO pipeline and swap data to its vdevs so if memory is really short it can easily cause a deadlock.

So while light use of swap on ZFS can work, heavy use of it will almost certainly fail.

Given this I would never recommend ZFS backed swap.
Comment 3 vermaden 2015-04-16 05:32:21 UTC
My thought was that if its 'production ready' on Solaris, then it would be the same (or better) on FreeBSD ...

http://docs.oracle.com/cd/E26502_01/html/E29006/gizfl.html

Regards,
vermaden
Comment 4 Steven Hartland freebsd_committer 2015-04-16 07:30:52 UTC
I'm afraid not
Comment 5 John Nielsen 2015-04-16 16:53:20 UTC
Using the recommended (AFAIK) settings for a ZFS swap volume (below) combined with nowak's v_free sysctl recommendation allows me to run your test script successfully (meaning the system doesn't hang and the script is killed).

zfs create -V <size> -o org.freebsd:swap=on -o checksum=off -o compression=off -o dedup=off -o sync=disabled -o primarycache=none <pool>/<volname>

That said, I'm not sure what (if any) action this ticket should spur. Is there specific documentation you (OP) would like to see updated?
Comment 6 John Nielsen 2015-04-16 17:30:13 UTC
FWIW, on my test VM with 3GB RAM and a 2GB swap volume (with ZFS properties as described in my previous comment) the test script hangs the system with the default settings:
vm.v_free_min: 4833
vm.v_free_target: 16218
vm.v_free_reserved: 1038
vm.v_free_severe: 2935

but not with these (just bumping up v_free_min and v_free_severe):
vm.v_free_min: 8192
vm.v_free_target: 16218
vm.v_free_reserved: 1038
vm.v_free_severe: 7168

On this system the magic number for v_free_severe seems to be between 5120 (which hangs) and 7168 (which doesn't).

I have added a note about vm.v_free_severe to
https://wiki.freebsd.org/RootOnZFS#ZFS_Swap_Volume, which is where I got the recommended swap volume settings from in the first place.
Comment 7 vermaden 2015-04-21 06:14:13 UTC
So a rule of thumb can be something like this?

vm.v_free_min    = 3 * RAM (GB) * 1024
vm.v_free_severe = 3 * RAM (GB) * 1024
Comment 8 Marcus Reid 2015-04-22 21:20:34 UTC
For what it's worth, I've been doing swap on a zvol for a long time now on -CURRENT without problems.  Using "-o checksum=off -o compression=off -o dedup=off -o sync=disabled -o primarycache=none" seems to be key.

I've hammered on it quite a bit to try and get it to crash.  My memory allocator looks different from yours; this one is faster but just sets the memory to zeroes with explicit_bzero().

#include <stdlib.h>
#include <stdio.h>
#include <strings.h>

#define HUNK 1048576

int main(void) {
	unsigned long long i;
	for(i = 0 ; ; i++) {
		explicit_bzero(malloc(HUNK), HUNK);
		printf("\r%llu", i * HUNK);
		fflush(stdout);
	};
	return(EXIT_SUCCESS);
}
Comment 9 vermaden 2015-04-22 21:22:24 UTC
@marcus@blazingdot.com

Have You tried the 'perl' one? (not mine unfortunatelly, got it from someone)
Comment 10 vermaden 2015-04-22 21:26:12 UTC
I agree on most of these options for SWAP ZVOL, but using 'compression=off' instead of 'compression=lz4' seems to be big 'loss' ... also in term of IOPS because You need to allocate less blocks for the same data.
Comment 11 Marcus Reid 2015-04-22 21:41:31 UTC
> Have You tried the 'perl' one? (not mine unfortunatelly, got it from someone)

I ran it a few times (it is quite fast.) right in the middle of a 'make -j16 buildkernel' compile with several Chrome windows up.  No problems, process is killed when swap is exhausted.  Aside from setting the properties on the zroot/swap dataset, I have done no zfs-related tuning.

I'm running recent -CURRENT on amd64 with 8GB RAM, 2GB swap zvol on an SSD.

Swap on zvol has been rock solid for me since I started using it again after the ARC/UMA rebalance/low memory handling fix that happened last year. I would be happy to tweak things to try to get it to crash if you have some tunables that you would like to test.
Comment 12 vermaden 2015-04-22 22:09:00 UTC
@marcus@blazingdot.com 

Currently only 'compression=lz4' comes to my mind.
Comment 13 Marcus Reid 2015-04-27 19:38:24 UTC
> Currently only 'compression=lz4' comes to my mind.

It would surprise nobody if turning on compression caused problems.

If it's any consolation, here are the default locally set flags on the swap zvol on a Solaris 10 machine:

rpool/swap  volsize               4G                     local
rpool/swap  checksum              off                    local
rpool/swap  compression           off                    local
rpool/swap  refreservation        4G                     local
rpool/swap  primarycache          metadata               local
Comment 14 Davor Cubranic 2015-05-13 21:13:39 UTC
I tried running Vermaden's Perl script on a freshly installed 10.1-RELEASE on a ThinkPad 430s with 8GB RAM and a 2G swap zvol on an SSD with no tuning whatsoever. I experienced no lockups: the script was killed after about a minute and I barely even saw any slowdown in the interactive console.
Comment 15 Marcus Reid 2015-06-15 21:30:33 UTC
Just another anecdotal observation..  I've got a fresh -CURRENT machine that I set up with swap on a zvol and am able to get it to fail reliably.  There's a git process that eats all ram and then goes about 1GB into swap (with 4GB available) and then swap just appears to stop working. A bunch of processes get stuck in a pfault state and the box becomes unusable and needs to be reset.

The zvol is set up with compression, dedup, primarycache, sync all off. Something is definitely still very broken.

My workaround in this case is to use a vnode-backed md device as a swap file, which is still not ideal but is stable enough until I can rebuild the machine with normal swap partitions.
Comment 16 Marcus Reid 2015-06-16 06:21:34 UTC
As for the machine in my last comment, I've tuned the problem away and I'd like to share how:

vm.v_free_target=60000
vm.v_free_min=55000
vm.v_free_severe=50000
vm.v_free_reserved=32000
vfs.zfs.arc_free_target=180000

Now, free memory never gets down below about 50MB or so, and normally stays up above 150MB even when under pressure and swapping.  I'm swapping to a zvol with all of the swap tunables set.

The default values for these look more like:

vm.v_free_target: 42948
vm.v_free_min: 12741
vm.v_free_severe: 7706
vm.v_free_reserved: 2672
vfs.zfs.arc_free_target: 14014

Note that I increased vm.v_free_reserved quite a bit, but I don't know exactly what that one does.

vm.v_free_reserved: Pages reserved for deadlock

The description seems to make it a prime candidate for tuning in this situation, so even though nobody has mentioned it yet I raised it.

Things work great. I will post again if I manage to break it again.
Comment 17 Chris 2016-03-27 23:40:19 UTC
I'd like to point out this page which has relevant information for those wishing to tune the VM system: 

https://wiki.freebsd.org/AvgPageoutAlgorithm

The tunable vm.pageout_wakeup_thresh hasn't been mentioned here yet and seems to be important for the way the the VM subsystem works.

So with that in mind, here's my current configuration for a 8Gb machine with swap on zvol.

vm.v_free_severe=20460
vm.v_free_min=32768
vm.v_free_target=51200
vm.pageout_wakeup_thresh=36044
vfs.zfs.arc_free_target=36044

Note that I've picked:

vm.pageout_wakeup_thresh = vm.v_free_min * 11/10
vfs.zfs.arc_free_target = vm.pageout_wakeup_thresh

as these are the way they are calculated by default.

As a reference, the defaults values for my system are:

vm.v_free_reserved=2650 (untouched)
vm.v_free_severe=7642
vm.v_free_min=12634
vm.v_free_target=42586
vm.pageout_wakeup_thresh=13893
vfs.zfs.arc_free_target=13893

I submitted my system to memory pressure, notably by runs of poudriere generating up to 3Gb of swap usage. I'm happy to report that my system remained rock stable and very responsive during the tests.
Comment 18 Andriy Gapon freebsd_committer 2016-03-28 09:00:57 UTC
(In reply to Chris M from comment #17)
The information in that link is outdated. There have been some notable changes in the page out algorithms since that article was written.
Comment 19 vermaden 2016-03-28 19:20:25 UTC
@Andriy Gapon

Where to find up to date information then?
Comment 20 Andriy Gapon freebsd_committer 2016-03-28 21:30:40 UTC
(In reply to vermaden from comment #19)
In the source code...
Comment 21 vermaden 2016-03-28 21:33:13 UTC
@Andriy Gapon

I expect such comments from the Linux community ...

Pity.

Regards,
vermaden
Comment 22 scottro11 2016-12-22 14:31:30 UTC
So, does anyone know if the tuning is still necessary?  I am not a coder and looking at the source code won't really help me.  If anyone has tested recently, I'd be grateful for them sharing their knowledge.
Comment 23 deadelk 2017-02-07 14:08:46 UTC
confirm that swaping to zfs with default sysctl is a bad idea in production ... 

just run perl-script and look to 'top' in another vty
Comment 24 deadelk 2017-02-07 14:10:40 UTC
(In reply to deadelk from comment #23)

freebsd11 
time taken to kill the app is about 5 minutes
Comment 25 deadelk 2017-02-07 14:38:43 UTC
(In reply to deadelk from comment #24)
you can try to lower ARC 

i think it's too much to take all memory for ARC by default 
why??? for what??? 

vfs.zfs.arc_max="512M"
just working identically (speed) on 1GB RAM server and on 8GB RAM server