235125 – Process was killed: out of swap space on gmirror + zfs

Bug 235125 - Process was killed: out of swap space on gmirror + zfs

Summary: Process was killed: out of swap space on gmirror + zfs

Status:	Open

Alias:	None

Product:	Base System
Classification:	Unclassified
Component:	kern (show other bugs)
Version:	11.2-RELEASE
Hardware:	amd64 Any

Importance:	--- Affects Some People
Assignee:	freebsd-bugs (Nobody)

URL:
Keywords:

Depends on:
Blocks:

Reported:	2019-01-22 14:44 UTC by Billg
Modified:	2023-09-11 06:24 UTC (History)
CC List:	3 users (show)

See Also:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Billg 2019-01-22 14:44:12 UTC

Hello,

We've got FreeBSD 11.2p8 installed on 2 SSDs which are mirrored using gmirror.
we also have a zpool which consists of 3x7 raidz1 drives, 2x Cache and 2x SLOG and 1x spare.
dual Xeon CPUs and 128GB of Ram.

The first time this Problem accrued right after the 11.2p8 upgrade from p4:

----
Jan 14 17:04:22 san2 zfsd: POLLHUP detected on devd socket.
Jan 14 17:04:22 san2 kernel: pid 606 (devd), uid 0, was killed: out of swap space
Jan 14 17:04:22 san2 kernel: Jan 14 17:04:22 san2 kernel: pid 606 (devd), uid 0, was killed: out of swap space
Jan 14 17:04:22 san2 zfsd: Disconnecting from devd.
Jan 14 17:04:22 san2 zfsd: ConnectToDevd: Connecting to devd.
----

we had to restart the machine. After 3 days we had the same Problem, but this time multiple processes were killed:

------
Jan 19 10:49:49 san2 kernel: pid 610 (devd), uid 0, was killed: out of swap space
Jan 19 10:49:49 san2 kernel: Jan 19 10:49:49 san2 kernel: pid 610 (devd), uid 0, was killed: out of swap space
Jan 19 11:09:49 san2 kernel: pid 835 (zabbix_agentd), uid 122, was killed: out of swap space
Jan 19 11:09:49 san2 kernel: Jan 19 11:09:49 san2 kernel: pid 835 (zabbix_agentd), uid 122, was killed: out of swap space
Jan 19 11:10:48 san2 kernel: pid 847 (bareos-fd), uid 0, was killed: out of swap space
Jan 19 11:10:48 san2 kernel: Jan 19 11:10:48 san2 kernel: pid 847 (bareos-fd), uid 0, was killed: out of swap space
Jan 19 11:11:15 san2 kernel: pid 838 (ntpd), uid 233, was killed: out of swap space
Jan 19 11:11:15 san2 kernel: Jan 19 11:11:15 san2 kernel: pid 838 (ntpd), uid 233, was killed: out of swap space
Jan 19 11:11:29 san2 kernel: pid 802 (ctld), uid 0, was killed: out of swap space
Jan 19 11:11:29 san2 kernel: Jan 19 11:11:29 san2 kernel: pid 802 (ctld), uid 0, was killed: out of swap space
Jan 19 11:11:45 san2 kernel: pid 116 (adjkerntz), uid 0, was killed: out of swap space
Jan 19 11:11:45 san2 kernel: Jan 19 11:11:45 san2 kernel: pid 116 (adjkerntz), uid 0, was killed: out of swap space
Jan 19 11:12:15 san2 kernel: pid 971 (getty), uid 0, was killed: out of swap space
Jan 19 11:12:15 san2 kernel: Jan 19 11:12:15 san2 kernel: pid 971 (getty), uid 0, was killed: out of swap space
Jan 19 11:12:29 san2 kernel: pid 32950 (getty), uid 0, was killed: out of swap space
Jan 19 11:12:29 san2 kernel: Jan 19 11:12:29 san2 kernel: pid 32950 (getty), uid 0, was killed: out of swap space
Jan 19 11:12:46 san2 kernel: pid 32951 (getty), uid 0, was killed: out of swap 
-----

Messages kept on repeating until we restarted the machine.
We tried disabling zfsd, but that didn't help.

This Machine is in production its really frustrating to have this behavior. I will gladly provide any more info/tests when needed.

Thank you


Here is some more info:
------

root@san2:~ # gmirror status
       Name    Status  Components
mirror/boot  COMPLETE  gpt/boot0 (ACTIVE)
                       gpt/boot1 (ACTIVE)
mirror/swap  COMPLETE  gpt/swap0 (ACTIVE)
                       gpt/swap1 (ACTIVE)
mirror/root  COMPLETE  gpt/root0 (ACTIVE)
                       gpt/root1 (ACTIVE)

zpool list
NAME       SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
san2pool  37.9T  17.4T  20.4T        -         -     2%    46%  1.00x  ONLINE  -


root@san2:~ # vmstat
procs  memory       page                    disks     faults         cpu
r b w  avm   fre   flt  re  pi  po    fr   sr ad0 ad1   in    sy    cs us sy id
0 0 2 409M  3.2G  1170   1   5   4  1569 1355   0   0 3837   800  9140  0  7 93


Device          1K-blocks     Used    Avail Capacity
/dev/mirror/swap   8388604    26528  8362076     0%

Comment 1 Gordon Hartley 2019-01-25 06:43:49 UTC

I think this needs the 'importance'/severity needs major escalation because I've had it kill ssh sessions where I couldn't complete backups via scp, and it's killing the ssh session I'm su'd in when trying to understand the problem.

This has the potential to be one of those 'affects some people' that may transform into a case that takes out a massive amount of infrastructure that everyone depends on, and you can't remote in to solve the problem type scenarios...

The result of this problem could be catastrophic in the wrong environment. 

My fear would be people upgrade production. Everything appears to be running smoothly for while, so safety is assumed, then whatever scenario that triggers this behaviour occurs and you can't remote in to save the system, because the system keeps killing processes involving humans trying to save the system from itself because the system veto's resource allocation for itself.

Comment 2 Gordon Hartley 2019-01-25 06:52:02 UTC

(In reply to Gordon Hartley from comment #1)

To summarise it's: 'The data is protected at all costs', one of the costs being access to the data.

Comment 3 Mark Johnston freebsd_committer

2019-02-11 23:04:57 UTC

Do you have some idea regarding what's eating all of your RAM?  Does top(1) show an obvious culprit?

You're using ZFS.  Do you cap the ARC's memory usage by setting vfs.zfs.arc_max?

Comment 4 Billg 2019-02-12 11:26:45 UTC

(In reply to Mark Johnston from comment #3)

vfs.zfs.arc_max is the default. in my case "132750123008". (128 GB of RAM)
Logs never shows the arc going above 118G.

"Top" shows normal memory usage ( But i guess running "Top" right after having the Problem would not help).

however, I had Zabbix 2.2 running on this machine. I disabled it (3 weeks ago). since then I've got no more issues so far. And switched to telegraf.

Currently the machine sometimes freezes when there is a heavy load on the "ctld" Daemon which is listening on a 10G interface. But i don't think this is related at all. 

will report if there is any update related to the processes killing.