Bug 231457

Summary: Out of swap space on ZFS
Product: Base System Reporter: dimka
Component: kernAssignee: freebsd-fs (Nobody) <fs>
Status: Open ---    
Severity: Affects Some People CC: FreeBSD, alaa.alassafin, che, gro.dsbeerf.sgub, ish, lwhsu, mail, marklmi26-fbsd, max, mikeowens, ota, parashiva, sigsys
Priority: ---    
Version: 11.2-RELEASE   
Hardware: amd64   
OS: Any   
Attachments:
Description Flags
Memory occupation test tool none

Description dimka 2018-09-18 16:08:07 UTC
After occupying all RAM and several hundred megabytes of swap space, the kernel kills large processes with the messages:

Sep 14 03:04:30 hosting kernel: pid 2078 (mysqld), uid 88, was killed: out of swap space
Sep 14 03:06:26 hosting kernel: pid 7068 (mysqld), uid 88, was killed: out of swap space
Sep 14 03:06:32 hosting kernel: pid 2085 (clamd), uid 106, was killed: out of swap space

Tested on 3 real and 1 virtual machine with 1/2/4GB RAM and 8GB swap volume, on 11.2-RELEASE/amd64.
I NOT check this on 11.0-RELEASE and 11.1-RELEASE.

It's like another bug
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=199189
but I never reached this bug on 10.*/amd64, and advice for volume/sysctl tunings from his discussion, did not help for me on 11.2/amd64.

Detailed installation procedure

# Boot from install DVD, select "Shell" in "Partitioning" menu.
#
# gnop create -S 4096 /dev/ada0
# zpool create -O mountpoint=none zroot /dev/ada0.nop
#
# zfs create -V 8GB -o org.freebsd:swap=on zroot/swap
# zfs create -o quota=2GB -o mountpoint=/mnt zroot/root
# zfs create -o quota=15GB -o mountpoint=/mnt/tmp zroot/tmp
# zfs create -o quota=30GB -o mountpoint=/mnt/var zroot/var
# zfs create -o quota=30GB -o mountpoint=/mnt/usr zroot/usr
# zfs create -o quota=15GB -o mountpoint=/mnt/home zroot/home
#
# zpool export zroot
# gnop destroy /dev/ada0.nop
# dd if=/boot/zfsboot of=/dev/ada0 bs=512 count=1
# dd if=/boot/zfsboot of=/dev/ada0 bs=512 skip=1 seek=1024
# zpool import zroot
#
# exit
#
# Post-install, select "Live CD" mode.
#
# echo zfs_enable=\"YES\" >> /nmt/etc/rc.conf
# zfs umount -a
# zfs set mountpoint=legacy zroot/root
# zfs set mountpoint=/tmp zroot/tmp
# zfs set mountpoint=/var zroot/var
# zfs set mountpoint=/usr zroot/usr
# zfs set mountpoint=/home zroot/home
# zpool set bootfs=zroot/root zroot
#
# exit
Comment 1 Mike 2018-09-24 17:44:14 UTC
Experiencing same issue.

uname -a
FreeBSD sword 11.2-RELEASE-p3 FreeBSD 11.2-RELEASE-p3 #0: Thu Sep  6 07:14:16 UTC 2018     root@amd64-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC  amd64

swapinfo -h
Device          1K-blocks     Used    Avail Capacity
/dev/da0p2       12582912     4.2M      12G     0%
/dev/da1p2       12582912     4.1M      12G     0%
Total            25165824     8.3M      24G     0%

sysctl hw | egrep 'hw.(phys|user|real)'
hw.physmem: 34297905152
hw.usermem: 1263333376
hw.realmem: 34359738368

tail /var/log/messages
Sep 24 12:30:29 sword kernel: pid 12993 (getty), uid 0, was killed: out of swap space
Sep 24 12:30:43 sword kernel: pid 12994 (getty), uid 0, was killed: out of swap space
Sep 24 12:30:58 sword kernel: pid 12995 (getty), uid 0, was killed: out of swap space
Sep 24 12:31:14 sword kernel: pid 12996 (getty), uid 0, was killed: out of swap space
Sep 24 12:31:28 sword kernel: pid 12997 (getty), uid 0, was killed: out of swap space
Sep 24 12:31:42 sword kernel: pid 12998 (getty), uid 0, was killed: out of swap space
Sep 24 12:31:57 sword kernel: pid 12999 (getty), uid 0, was killed: out of swap space
Sep 24 12:32:12 sword kernel: pid 13000 (getty), uid 0, was killed: out of swap space
Sep 24 12:32:27 sword kernel: pid 13001 (getty), uid 0, was killed: out of swap space
Sep 24 12:32:42 sword kernel: pid 13002 (getty), uid 0, was killed: out of swap spac

cat /boot/loader.conf

accf_data_load="YES"
accf_http_load="YES"
autoboot_delay=3
cc_htcp_load="YES"
hw.igb.rx_abs_int_delay=1024
hw.igb.rx_int_delay=512
hw.igb.rxd=4096
hw.igb.tx_abs_int_delay=1024
hw.igb.tx_int_delay=512
hw.igb.txd=4096
hw.intr_storm_threshold=9000
if_bridge_load="YES"
if_tap_load="YES"
kern.geom.label.disk_ident.enable="0"
kern.geom.label.gptid.enable="0"
kern.ipc.nmbclusters=262144
kern.ipc.nmbjumbo16=32768
kern.ipc.nmbjumbo9=65536
kern.ipc.nmbjumbop=262144
kern.ipc.semaem=32767
kern.ipc.semmni=32767
kern.ipc.semmns=8192
kern.ipc.semmnu=4096
kern.ipc.semmsl=120
kern.ipc.semopm=200
kern.ipc.semume=80
kern.ipc.semusz=184
kern.ipc.semvmx=65534
kern.maxusers=1024
mlx4en_load="YES"
net.fibs=2
net.inet.tcp.hostcache.cachelimit="0"
net.inet.tcp.tcbhashsize=65536
net.inet.tcp.tso=0
net.isr.bindthreads=0
nmdm_load="YES"
vfs.zfs.arc_max="36G"
vfs.zfs.txg.timeout="5"
vfs.zfs.write_limit_override="536870912"
vfs.zfs.write_limit_override="536870912"
vmm_load="YES"
zfs_load="YES"


One possible culprit is vfs.zfs.arc_max is set to size of phys memory. I am adjusting that to half memory and rebooting.
Comment 2 Mike 2018-09-25 16:49:14 UTC
Update: since adjusting vfs.zfs.arc_max to half RAM (rather than all of it which was an oversight) and rebooting, problem has not manifested.
Comment 3 dimka 2018-09-26 04:45:11 UTC
vfs.zfs.arc_max = 0.6 * RAM
by default, at least on 1/2/4G RAM.
Tune to 0.5 * RAM in /boot/loader.conf (and reboot) has no effect in my case.

Remember that you need to occupying all physical memory, and almost the entire swap space, to reproduce this problem.
Comment 4 Shane 2018-09-28 06:16:31 UTC
Actually, you don't have to use all your swap to get out of swap errors, they also happen when too much ram is wired, this prevents any ram being swapped in/out.

I started getting out of swap errors on 10.1 with 8G ram.

Look at "vm.stats.vm.v_wire_count * hw.page_size" at the time of getting out of swap errors, this is the wired amount shown in top.

As I mentioned in bug #229764 max_wired is 30% so arc should be less than 70% ram. Another thing that often wires ram is bhyve, so any guest ram should also be considered when setting arc_max.
Comment 5 dimka 2018-10-01 12:05:14 UTC
I tried this, and seen that the system is coming out of stupor state, immediatly after the killings of the processes. Without this tuning, the system remained very slow for a few seconds.
However, even a halving vfs.zfs.arc_max and vm.max_wired, did not solve the problem in my case.
Also, if i suspend memory absorption on
(vm.stats.vm.v_laundry_count * vm.stats.vm.v_page_size) > 10M
the system does not kill processes, and can not purge laundry pages to swap.
Probably there is some kind of deadlock or other similar problem.

CPU:  0.0% user,  0.0% nice, 11.3% system,  0.0% interrupt, 88.7% idle
Mem: 566M Active, 147M Inact, 130M Laundry, 94M Wired, 22M Free
ARC: 29M Total, 1048K MFU, 25M MRU, 32K Anon, 1152K Header, 1387K Other
     12M Compressed, 19M Uncompressed, 1.56:1 Ratio
Swap: 7678M Total, 30M Used, 7648M Free
Comment 6 Mark Millard 2018-10-01 19:13:17 UTC
As I understand any conditions that lead to sustained low
free RAM via pressure from one or more processes that keep
active RAM usage high leads to killing of processes to free
memory. The default vm.pageout_oom_seq=12 can be increased
to increase how long the low free RAM condition is tolerated.
(It increases how many attempts to free RAM are made first.)
I assign vm.pageout_oom_seq in /etc/sysctl.conf .

FreeBSD does not swap out processes that stay active. This
is documented in the book published by McKusic, Neville-Neil,
and Watson (2nd edition, last names listed). So if one or more
keep active RAM use high, free RAM use tends to stay low.

There can be lots of swap available and the process
killing can still happen. The console log messages
produced for this case is very misleading via referencing
out of swap instead of referencing a sustained period of
low free RAM.

Real "out of swap" conditions tend to also have messages
of the form:

Aug  5 17:54:01 sentinel kernel: swap_pager_getswapspace(32): failed

On small board computers such as ARM boards I've been
using vm.pageout_oom_seq=120 and one person with storage
devices with I/O latency problems used something like
vm.pageout_oom_seq=1024 to allow -j4 buildworld buildkernel
to work. (No attempt at approximating the smallest value
that would have worked.) There was a long 2018 Jun. through
Sep. freebsd-arm list exchange under various subjects that
eventually exposed this vm.pageout_oom_seq control and
FreeBSD's swapping criteria that I noted above.

This does not address why Free RAM is low over a sustained
period, it just makes the system more tolerant of such. It
could be that there are also other mechanisms that do not
involve vm.pageout_oom_seq .
Comment 7 Gordon Hartley 2018-12-07 10:16:52 UTC
Just wanted to report in that I also triggered this issue when doing a zfs scrub on an 11.2 system with system defaults (albeit updated from earlier releases via freebsd-update) with (intentionally) no dedicated swap filesystem. Never seem to have had problems in the past.

Stopped the scrub via "zpool scrub -s" and the problem stopped occurring.

Added just in case it helps someone diagnose the underlying cause(s).
Comment 8 Gordon Hartley 2018-12-07 12:23:08 UTC
In addition to the above - it's not just during scrub's, although that seems to exacerbate the behaviour - not sure what is going on. Going to reinstall OS with dedicated swap as workaround.
Comment 9 Max Kostikov 2018-12-13 21:40:30 UTC
I can confirm this bug with 11.2-p5 system on bare metal Xeon server with 8Gb RAM / 4Gb swap with ZFS.
It looks like a peak when within less 1 minute with thousand of log messages

root@beta:/home/xm # zcat /var/log/messages.*.bz2 | grep "Dec 13 21:31" | grep "swap_pager" | wc -l
    6285

I never saw such behaviour under FreeBSD for last years.
Comment 10 dimka 2019-01-11 05:40:00 UTC
I experimented with a stable/11 branch on 2GB RAM and 8GB swap space with r320475 "world" used, looking for the moment of kernel problems begins.
Swap on ZFS is unstable from revision r321453 (2017-07-25):
/usr/src/sys/kern/subr_blist.c
/usr/src/sys/sys/blist.h
And fully broken from r321554 (2017-07-26):
/usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dbuf.c
/usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/metaslab.c
/usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa_misc.c
/usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/space_map.c
/usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/spa.h
It happened between 11.1 and 11.2 release forks.
Also, swap on ZFS fully broken in the current revision of stable/11 r342915 (kernel and world used).
Comment 11 dimka 2019-01-11 08:17:23 UTC
Created attachment 201015 [details]
Memory occupation test tool
Comment 12 dimka 2019-01-11 08:44:08 UTC
I check and made sure that 12.0-RELEASE/amd64 is also affected by this problem,
test tool and sendmail processes was killed by OOM after physical memory is over
and several megabytes of swap were used.
Short hardware details:
CPU: Intel(R) Celeron(R) CPU 847 @ 1.10GHz (1097.53-MHz K8-class CPU)
real memory  = 2147483648 (2048 MB)
avail memory = 1987403776 (1895 MB)
FreeBSD/SMP: Multiprocessor System Detected: 2 CPUs
FreeBSD/SMP: 1 package(s) x 2 core(s)
da0: <SanDisk Ultra USB 3.0 1.00> Removable Direct Access SPC-4 SCSI device
da0: 14663MB (30031250 512 byte sectors)

You can a very simple test on your hardware or VM:
1. Install system from DVD to cleaned drive in "Auto (ZFS)" mode.
2. Comment GPT swap entry (created by installer) in /etc/fstab.
3. Create swap space on ZFS pool:
   # zfs create -V 8GB -o org.freebsd:swap=on zroot/swap
4. Reboot.
5. Compile and use "Memory occupation test tool" from bug report attachments:
   # ./memphage 9500
Argument is memory occupation limit (mem free + swap free - 500) in megabytes.
Comment 13 Billg 2019-01-23 16:34:45 UTC
I'm having the same behavior with freeBSD 11.2 p8. Although swap partition is not on zfs, we do however have a zfs pool on this machine as described here :
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=235125

I tried memphage

total memory is 128GB 
Free memory at the time I started the memphage was 110 GB
Swap 8GB

./memphage 115000 took around 3 minutes with no issues.swap got 3.2 GB filled.

./memphage 120000 took 5 minutes and got killed at 118672Mb

swap was completely filled and got multiple :
Jan 23 14:56:27 san2 kernel: swap_pager_getswapspace(2): failed

and then memphage got killed.

Jan 23 14:56:48 san2 kernel: pid 6945 (memphage), uid 0, was killed: out of swap space
Jan 23 14:56:48 san2 kernel: Jan 23 14:56:48 san2 kernel: pid 6945 (memphage), uid 0, was killed: out of swap space

--
No other processes were killed this time. only memphage
Comment 14 dimka 2019-01-29 06:30:20 UTC
The other day I received OOM in the current revision of 11.1, it happened at night, apparently during "periodic" execution.
I had to roll back this system to stable/11 r321452, and there are no problems in 5 days.
Comment 15 mail 2019-02-07 09:40:09 UTC
Hi,

We are seeing similar behaviour on one of our zfs-nfs servers as well.

Jan 31 10:41:13 volume1 kernel: pid 17505 (collectd), uid 0, was killed: out of swap space
Jan 31 10:41:13 volume1 kernel: pid 51659 (ntpd), uid 0, was killed: out of swap space
Jan 31 10:42:54 volume1 kernel: pid 73673 (devd), uid 0, was killed: out of swap space
Jan 31 10:43:11 volume1 kernel: pid 31167 (mountd), uid 0, was killed: out of swap space
Jan 31 10:44:12 volume1 kernel: pid 50359 (nfsd), uid 0, was killed: out of swap space
Jan 31 10:44:36 volume1 kernel: pid 81152 (zsh), uid 0, was killed: out of swap space
Jan 31 10:44:54 volume1 kernel: pid 49005 (zsh), uid 4002, was killed: out of swap space
Jan 31 10:46:13 volume1 kernel: pid 95263 (nrpe3), uid 181, was killed: out of swap space
Jan 31 10:46:36 volume1 kernel: pid 48518 (sshd), uid 4002, was killed: out of swap space
Jan 31 10:46:55 volume1 kernel: pid 92367 (rpcbind), uid 0, was killed: out of swap space
Jan 31 10:47:11 volume1 kernel: pid 56206 (nfsd), uid 0, was killed: out of swap space
Jan 31 10:47:23 volume1 kernel: pid 68827 (dhclient), uid 65, was killed: out of swap space
Jan 31 10:47:38 volume1 kernel: pid 87548 (getty), uid 0, was killed: out of swap space
Jan 31 10:47:50 volume1 kernel: pid 24945 (getty), uid 0, was killed: out of swap space
Jan 31 10:49:14 volume1 kernel: pid 29466 (getty), uid 0, was killed: out of swap space
Jan 31 10:49:37 volume1 kernel: pid 77339 (getty), uid 0, was killed: out of swap space
Jan 31 10:49:51 volume1 kernel: pid 78317 (getty), uid 0, was killed: out of swap space
Jan 31 10:50:13 volume1 kernel: pid 81831 (getty), uid 0, was killed: out of swap space
Jan 31 10:50:37 volume1 kernel: pid 89762 (getty), uid 0, was killed: out of swap space
Jan 31 10:50:51 volume1 kernel: pid 92067 (getty), uid 0, was killed: out of swap space
Jan 31 10:51:49 volume1 kernel: pid 97499 (getty), uid 0, was killed: out of swap space
Jan 31 10:52:14 volume1 kernel: pid 96091 (getty), uid 0, was killed: out of swap space
Jan 31 10:52:37 volume1 kernel: pid 98907 (getty), uid 0, was killed: out of swap space
Jan 31 10:52:51 volume1 kernel: pid 99595 (getty), uid 0, was killed: out of swap space
Jan 31 10:55:47 volume1 kernel: pid 60068 (zsh), uid 0, was killed: out of swap space
Feb  7 09:57:40 volume1 collectd[25157]: plugin_read_thread: read-function of the `swap' plugin took 19.765 seconds, which is above its read interval (10.000 seconds). You might want to adjust the `Interval' or `ReadThreads' settings.
Feb  7 09:59:48 volume1 kernel: pid 25157 (collectd), uid 0, was killed: out of swap space
Feb  7 09:59:48 volume1 kernel: pid 94240 (atop), uid 0, was killed: out of swap space
Feb  7 09:59:48 volume1 kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 327109, size: 16384
Feb  7 09:59:48 volume1 kernel: pid 51515 (ntpd), uid 0, was killed: out of swap space
Feb  7 09:59:48 volume1 kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 326787, size: 4096
Feb  7 09:59:48 volume1 kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 102263, size: 4096
Feb  7 09:59:48 volume1 kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 327152, size: 4096
Feb  7 09:59:48 volume1 kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 100915, size: 8192
Feb  7 09:59:48 volume1 kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 326754, size: 8192
Feb  7 09:59:48 volume1 kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 8471, size: 4096
Feb  7 09:59:48 volume1 kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 106028, size: 12288
Feb  7 09:59:48 volume1 kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 8229, size: 8192
Feb  7 09:59:48 volume1 kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 103890, size: 8192
Feb  7 10:03:11 volume1 kernel: swap_pager_getswapspace(32): failed
Feb  7 10:06:00 volume1 kernel: swap_pager_getswapspace(32): failed


root@volume1:~ # grep arc /boot/loader.conf 
vfs.zfs.arc_min="10024M"
vfs.zfs.arc_max="13084M"
root@volume1:~ # sysctl -a | grep phys
kern.ipc.shm_use_phys: 0
vm.phys_segs: 
vm.phys_free: 
vm.phys_pager_cluster: 1024
hw.physmem: 17139478528
root@volume1:~ # sysctl vm.pageout_oom_seq
vm.pageout_oom_seq: 120
root@volume1:~ # 
root@volume1:~ # swapinfo 
Device          1K-blocks     Used    Avail Capacity
/dev/gpt/swap     8388608    26080  8362528     0%
root@volume1:~ # freebsd-version -uk
11.2-RELEASE-p8
11.2-RELEASE-p8
root@volume1:~ # 

We actually do have reason to assume the VM's storage backend might be periodically affected by an extremely slow storage provider (its running as a VM on Openstack), as indicated by the "swap_pager: indefinite wait buffer: bufobj".  It's kind of worrisome that important processes (nfsd for instance) are shot down by the OOM with the default value of vm.pageout_oom_seq (if the default setting of that sysctl turns out to cause the OOM killer).

We've just changed the vm.pageout_oom_seq from its default of 12 to 120 and are monitoring the impact of that change.

Ruben(In reply to Billg from comment #13)
Comment 16 Parashiva 2019-09-11 16:54:04 UTC
Hello,

My server also with same error "mysqld killed out of swap space" when I import an 70G+ mysql dump (mysqldump -u root -p database -r dump.sql).

The server's hardware is 256G SSD*2, 32G RAM, CPU e3-1245 v3, with latest freebsd 12, zfs mirror, atime=off,primarycache=all,secondarycache=none.

I tried below solutions:

1. trick learn from stackoverflow.com
set global net_buffer_length=1048576;
set global max_allowed_packet=1073741824;
SET foreign_key_checks = 0;
not working

2. disable swap
not working

3. then I think it related to RAM or SWAP, should I disabled ARC?
zfs set primarycache=none tank
working!!!

So I have my database working now.

Hope my experience could help someone.

Thank you,
Best Regards.
Comment 17 Parashiva 2019-09-11 16:55:59 UTC
(In reply to Parashiva from comment #16)
My zrc limit is:
vfs.zfs.arc_max="4G"
vfs.zfs.arc_min="2G"
Comment 18 Masachika ISHIZUKA 2019-11-19 08:22:13 UTC
I'm using zfs on vps with very small rams (512mb).
It was good operating on 10.3R, 11.0R, 11.1R, 11.2R and 12.0R.
Recently, I upgraded to 12.1R and many processes were killed because of 'out of swap space' but no 'swap_pager_getswapspace failed'.
So, I set 'sysctl vm.pageout_oom_seq=1024' on /boot/loader.conf and this reduced killed processes.
Now, I'm watching the situation by increasing 1024 to 10240.
Comment 19 Mark Millard 2023-09-11 06:57:33 UTC
This submittal and its comments are so old that the kernel
messages for "was killed" have been made more specific (in
2 of the 3 types of contexts that lead to such kills). The
usually inaccurate "out of swap space" text is not the
typical text reported any more.

Changing this submiittal from New to Open at this point is just
misleading/confusing vs. the modern details (13.1+, say).