Bug 241048 - After r349840, OOM reported when swap space is available.
Summary: After r349840, OOM reported when swap space is available.
Status: In Progress
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 12.0-STABLE
Hardware: Any Any
: --- Affects Some People
Assignee: Mark Johnston
URL:
Keywords: needs-qa, regression
Depends on:
Blocks: 240700
  Show dependency treegraph
 
Reported: 2019-10-04 02:15 UTC by rkoberman
Modified: 2019-10-14 03:48 UTC (History)
3 users (show)

See Also:
koobs: mfc-stable12?


Attachments
proposed patch for head (1.85 KB, patch)
2019-10-04 20:16 UTC, Mark Johnston
no flags Details | Diff
proposed patch for stable/12 (625 bytes, patch)
2019-10-04 20:16 UTC, Mark Johnston
no flags Details | Diff
proposed patch for stable/12 (1.41 KB, patch)
2019-10-05 15:00 UTC, Mark Johnston
no flags Details | Diff
proposed patch (2.03 KB, patch)
2019-10-11 03:01 UTC, Mark Johnston
no flags Details | Diff
vm.stats on amd64 12-STABLE howing error (1.61 KB, text/plain)
2019-10-13 02:44 UTC, rkoberman
no flags Details
debug patch (719 bytes, patch)
2019-10-13 18:34 UTC, Mark Johnston
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description rkoberman 2019-10-04 02:15:12 UTC
After the MFC of base r349840 amd base r349841 to 12-STABLE as base r350374, attempts to start a VirtualBox Windows7 guest would fail when there was memory pressure forcing swapping. Another user running HEAD on an RPi3 with 6 GB of RAM is reporting similar OOM issues, though his case is triggered by building www/chromium. It does involve OOM with a lot of swap available and it appears to not occur if r349840 and r349841 are rolled back. Just reverting base r349841 does not resolve the issue.

Note that r349840 should do absolutely nothing on systems with less than 16 GB of RAM. This is per the author, markj@. So far he has no explanation for the problem.

My system is a Lenovo T520 with a Sandy Bridge processor, 8 GB of RAM and * GB of swap space evenly spit over two drives. The amount of swap space in use seems irrelevant ss the failure has occurred with over 7 GB of free swap space. The failure never occurs when the free memory is adequate to start the VM without swapping.

I have not looked at the code in VirtualBox to see how it allocates memory, but I would guess that it uses slab allocation for loading the memory image (4 GB) of the virtual machine. I suspect that it uses malloc for loading the VB (VirtualBox) code. It looks like the load of the virtual machine's memory works OK, as a minimum of 98% of the load always completes regardless of the amount of swap required without problem. At 98%, the load percentage holds for several seconds and then errors out. If there is no memory pressure, there is no pause at 98%. If the VM is restarted, it succeeds with no pause.

An occasional non-fatal error is similar except that the pause may happen at 99% and a non-fatal error that the VM could not be started due to lack of memory and applications should be terminated to free memory. The VM is left in the paused (greyed out) state, but the system may be immediately "un-paused"  and it runs instantly with no other action.

The other user, Bob Prohaske, provided information on his case and agreed to my including a link to it in this PR. The files are at http://www.zefox.net/~fbsd/rpi3/swaptests/oberman/ I am not attaching he logs due to their size. They include logs of attempts to build chromium which succeeds with r349839 and fails with r349841. Note that his system is running HEAD.
Comment 1 Mark Millard 2019-10-04 17:21:28 UTC
(In reply to rkoberman from comment #0)

An RPi3 has only 1 GiByte of RAM. Did you mean
6 GiBite of SWAP? Was the RPi3 an incorrect
identification?

Example messages that were misnomers:
(taken from another list exchange on the issue)

Sep 11 23:21:43 zipper kernel: pid 16581 (rustc), jid 43, uid 65534, was killed: out of swap space
Sep 12 02:48:23 zipper kernel: pid 1209 (rustc), jid 62, uid 65534, was killed: out of swap space
. . .

Unfortunately, the wording of this type of message is a misnomer for
what drove the associated kills: it was actually driven by being unable
to gain more free memory in order to maintain a minimum amount. FreeBSD
will not swap-out processes that stay runnable (or are running), only
ones that are waiting. Even a single process that stays runnable and
keeps lots of RAM in the active category can lead to kills when swap
is unused or little used. So the kill-behavior is very workload
dependent. (Recently additional kills on other criteria were added
on current.)

Real "out of swap" conditions tend to also have messages
similar to (or always do?):

Aug  5 17:54:01 sentinel kernel: swap_pager_getswapspace(32): failed

If you are not seeing such messages, then it is likely (certain?)
that the mount of swap space still free is not the actual thing
driving the kills.

Are you seeing "swap_pager_getswapspace(32): failed" messages?

Whatever for that, it would be nice if "was killed: out of swap space"
messages were changed not not be so over specific/incorrect about
the cause, frequently misleading people that get the messages and
leading to user activity that can not help.

From where on I may well be mixing in current (13) material,
not just 12.x material.

There are tunable's for controlling (some of) the kill behavior
that have helped on small armv7 or aarch64 boards:

vm.pageout_oom_seq
vm.pfault_oom_wait
vm.pfault_oom_attempts

Prior to the last two being added in current, increasing
vm.pageout_oom_seq to increase how many times FreeBSD would
try to gain free RAM before starting kills used to be sufficient
to get buildworld buildkernel and other activities to complete
on various small armv7 or aarch64 boards. Setting vm.pfault_oom_attempts
to -1 disables that pfault_oom mechanism but if there is a worry
about actually running out of swap, that might not be appropriate
(FreeBSD stopping overall from lack of resources).

When vm.pfault_oom_attempts is not -1, the overall criteria is based
on vm.pfault_oom_wait*vm.pfault_oom_attempts before starting
kills. But there may be a tradeoff structure between different
ways of having (nearly) the same total for the multiplication.

https://lists.freebsd.org/pipermail/freebsd-current/2019-September/074310.html
was the start of an exchange about that involved material about the
vm.pfault_oom_<?> additions. (And mentions vm.pageout_oom_seq as well.)
Comment 2 Mark Johnston freebsd_committer 2019-10-04 20:16:21 UTC
Created attachment 208104 [details]
proposed patch for head
Comment 3 Mark Johnston freebsd_committer 2019-10-04 20:16:42 UTC
Created attachment 208105 [details]
proposed patch for stable/12
Comment 4 Mark Johnston freebsd_committer 2019-10-04 20:17:24 UTC
I uploaded a couple of patches which I believe will fix the regression.  Could you please confirm that the issue is resolved when they are applied?
Comment 5 Mark Johnston freebsd_committer 2019-10-05 15:00:11 UTC
Created attachment 208118 [details]
proposed patch for stable/12
Comment 6 rkoberman 2019-10-07 14:29:33 UTC
After applying the latest proposed patches, the problem persists. No change on the  symptoms.
Comment 7 Bob Prohaska 2019-10-07 16:54:59 UTC
There's a logfile for a kill on a pi3 running r353116M at
http://www.zefox.net/~fbsd/rpi3/swaptests/oberman/r353116M/

The patch seemed to apply cleanly and there were no problems compiling.
The kill is near timestamp Oct  7 05:28:38 and swap storage activity
beforehand looks relatively light. There is a surge immediately after
the kill, which surprised me.

The patch for stable/12 was applied to a pi2 running r353115:353124M 
(the versioning is something I've not seen before). It applied cleanly
and there were no problems compiling. So far, nothing undesirable has
happened, but it's worth noting that 32-bit arm has never exhibited
the OOMA difficulties associated with 64-bit arm.
Comment 8 rkoberman 2019-10-07 19:46:20 UTC
(In reply to Mark Millard from comment #1)
I will note that I don't get any "Out of swap space" messages. You are looking at the test from Bob Prohaska who also started seeing problems with the same commit to head and had success after reverting that commit. His memory pressure was from attempting to build chromium on a 6G RPi3.

I agree that the message is really misleading, but not an issue. I am running on an AMD64 Sandy Bridge system and have not seen this error the "Out of swap space" message nor are processes being killed. It looks like VirtualBox loads the VM's RAM successfully, then tries to load the VM support code in the memory starved system and that fails. From the multi-second delay at 98% loaded. I could be totally wrong about this, but the hypothesis fits the symptoms quite well. I suspect that the RAM loading code is very different from the load of the support code. I would expect that RAM to use slab allocation and the support code to use regular malloc, but, again, this is speculation.

The problem is only triggered after the the commit of a patch which should not ever run on a system with 8G of RAM. If I revert that patch, the error never occurs. It seems pretty clear that the commit is the trigger, though the last patch markj posted really should have made the code unreachable for an system with under 16G of RAM.

BTW, at any moment I have a long list of idle processes sitting on quite a bit of memory. I'm running MATE which is not quite a big a memory pig and Gnome, but it's close with a lot of assorted stuff that mostly sits around waiting for something to happen that probably won't and are good candidate for swap.I do have one HUGE memory pig in Firefox which starts at about 2.5G and climbs to north of 3G. Right now "ps -o rss,command -Uoberman | grep firefox" is showing 3.31G over 10 processes. (Why is resident memory 'rss' in ps and 'res' in top? Minor annoyance.)

One very noticeable artifact of the problem is that VirtualBox always loads 98% of it's 4G of RAM and then pauses for several seconds before dying and always succeeds when immediately start the VM again with no processes killed and no memory freed.
Comment 9 Bob Prohaska 2019-10-09 00:55:43 UTC
It turns out that OOMA kills can be provoked using Peter Holm's stress2 test.
An example on a Pi3 running 

FreeBSD www.zefox.org 13.0-CURRENT FreeBSD 13.0-CURRENT #2 r353297: Tue Oct  8 03:11:47 PDT 2019     bob@www.zefox.org:/usr/obj/usr/src/arm64.aarch64/sys/GENERIC  arm64

produces a log excerpt that looks like this:

procs     memory        page                    disks     faults         cpu
r b w     avm     fre  flt  re  pi  po    fr   sr mm0 da0   in    sy    cs us sy id
912 0 4 2371076   19368   135   6   0   0   338  618   0   0 19815  2807 10286  1  9 90
dT: 17.380s  w: 10.000s
 L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w    d/s   kBps   ms/d   %busy Name
    0    105      7    215   62.9     98   1703  134.4      0      0    0.0   74.7  mmcsd0
    0     49      0      4    3.6     49    466    5.3      0      0    0.0   15.1  da0
    0    105      7    215   77.9     98   1703  140.2      0      0    0.0   85.9  mmcsd0s2
    0     57      7    212   73.5     51   1230  247.1      0      0    0.0   80.1  mmcsd0s2a
    0     48      0      4  149.5     48    473   26.1      0      0    0.0   39.7  mmcsd0s2b
    0     57      7    212   73.6     51   1230  247.2      0      0    0.0   80.2  ufs/rootfs
    0     48      0      4    3.7     47    466    5.2      0      0    0.0   15.2  da0p5
Tue Oct  8 13:37:16 PDT 2019
Device          1K-blocks     Used    Avail Capacity
/dev/mmcsd0s2b    4404252    23880  4380372     1%
/dev/da0p5        2097152    23344  2073808     1%
Total             6501404    47224  6454180     1%
Oct  8 13:36:34 www bob[14995]: Starting test all.cfg
Oct  8 13:37:13 www kernel: pid 15110 (swap), jid 0, uid 1001, was killed: out of swap space

I certainly don't know how to fully interpret this, but it seems clear the _amount_ of swap in
use isn't excessive. /boot/loader.conf contains
vm.pageout_oom_seq="4096"

Thanks for reading, I hope it's useful.
Comment 10 Mark Johnston freebsd_committer 2019-10-09 13:40:15 UTC
(In reply to Bob Prohaska from comment #9)
stress2 is designed to trigger low memory conditions, so OOM kills are to be expected.
Comment 11 Bob Prohaska 2019-10-09 15:12:58 UTC
(In reply to Mark Johnston from comment #10)
Does it appear that a low memory condition was actually created, and that the
kill was justified in the example provided? The swap devices were not very busy,
the amount of swap in use quite modest. 

Thanks for reading!
Comment 12 Mark Millard 2019-10-09 19:37:55 UTC
(In reply to Bob Prohaska from comment #11)

Does it last longer with

vm.pfault_oom_attempts=-1

?

I'm not sure how to uniquely tell which constraint lead
to a specific Out Of Memory Kill but the above setting
disables one of the constraints so the "last longer"
status being "yes" would indicate something about
the constraint hit.

If that does last longer, then different combinations
of values for:

vm.pfault_oom_wait
vm.pfault_oom_attempts

even for the about the same total (multiplication of
the two) may vary the behavior.

If it was not this category of constraint, I' not sure
what else to explore, given your already large:

vm.pageout_oom_seq="4096"

To my knowledge there is nothing to enable reporting
information to well help identify the constraint(s)
being hit (or approached). The old patch that we were
supplied and used no longer applies. As far as I know
it has not been updated. (And it predated at least
the pfault oom settings being available, if I
understand right.)
Comment 13 Bob Prohaska 2019-10-09 21:12:14 UTC
(In reply to Mark Millard from comment #12)

I haven't tried using 
vm.pfault_oom_attempts=-1
but can do so after the present experiment completes.

That experiment consists of overprovisioning swap to 9 GB, of which 8 GB
are usable (4GB on the microSD card plus 5 GB on a USB flash drive).
This is on the Pi3 at r353297

The compilation is not quite half done by filecount, but so far swap use
is less than half the value reached using the former swap configuration
of 4 GB on microSD plus 2 GB on usb flash resulting in 4 GB usable swap.

It's contrary to my expectation. Might it be significant if it holds true?
Comment 14 Mark Johnston freebsd_committer 2019-10-11 03:01:32 UTC
Created attachment 208237 [details]
proposed patch

Here is another patch to try.  It should apply to both head and stable/12 (with some offset).

I noticed that the heuristic we use to decide whether or not to perform per-CPU caching was wrong in that it doesn't take into account the fact that we cache two buckets of up to 256 free pages per CPU, not one.  I also adjusted the fraction of total RAM that the per-CPU caches are permitted to cache so that we are more conservative.
Comment 15 rkoberman 2019-10-11 03:20:29 UTC
(In reply to Mark Johnston from comment #14)
For 12-STABLE r353118M, so I just need this patch and revert the prior patches?
Comment 16 Mark Johnston freebsd_committer 2019-10-11 14:50:31 UTC
(In reply to rkoberman from comment #15)
Yes, please.
Comment 17 Bob Prohaska 2019-10-11 16:41:23 UTC
(In reply to Mark Johnston from comment #14)

Can the new patch be tested on 13.0-CURRENT FreeBSD 13.0-CURRENT #2 r353297 ?

The compilation of www/chromium on that system is still running and probably
won't finish for another couple of days. Max swap use so far is ~3.5 GB, about
the same as ~6 months ago compared to 4.2 GB two weeks ago. 

Presently the machine looks badly swap-bound with swap devices more than 50%
busy and % idle over 50 %. Oddly enough, the

swap_pager: indefinite wait buffer: bufobj:....." 

messages are very sparse, none for the last half hour.

Please indicate if it's more useful to let the compile finish or 
interrupt it to apply the patch. At the moment,  vm.pageout_oom_seq="4096"
and there are 8 GB of usable swap. Might it help to make any "on the fly"
changes to
vm.pageout_oom_seq
vm.pfault_oom_wait
vm.pfault_oom_attempts ?
Comment 18 rkoberman 2019-10-12 21:29:10 UTC
(In reply to Mark Johnston from comment #16)
Again, it i not fixed. I create memory pressure, start the VirtualBox VM and get an OOM kills it at 98% loaded.As usual, I start the VM again and it starts normally. Again, this is on 12-STABLE r353118M on anAMD64.

For some reason, perhaps a fix in Firefox, I could not generate memory pressure. Forthe first time I saw resident memory use by firefox start at 3.2G, but I could not get it to the 3.6 needed to create pressure. I added a real memory hog, a build of lang/rust and that got available memory down to about 3G after which the failure to start the VM to re-appear.
Comment 19 Mark Johnston freebsd_committer 2019-10-12 21:41:35 UTC
(In reply to rkoberman from comment #18)
Am I right that hw.ncpu is 4 on your system?

With the patch applied, can you run

# dtrace -n 'BEGIN{print(`vm_dom[0].vmd_pgcache[0]);}'

and paste the output?  Could you also attach the output of "sysctl vm.stats" with the patch applied?
Comment 20 rkoberman 2019-10-13 02:44:57 UTC
Created attachment 208276 [details]
vm.stats on amd64 12-STABLE howing error

Yes, hw.ncpu is 4.
CPU     ID                    FUNCTION:NAME
  0      1                           :BEGIN struct vm_pgcache {
    int domain = 0
    int pool = 0
    uma_zone_t zone = 0xfffff80003308780
}
 Attached vm.stats
Comment 21 rkoberman 2019-10-13 02:49:59 UTC
(In reply to rkoberman from comment #20)
Foot to mention that the vm.stats collected on freshly re-booted system before any memory pressure or errors..
Comment 22 Mark Johnston freebsd_committer 2019-10-13 16:10:48 UTC
(In reply to rkoberman from comment #21)
That's fine, thanks.

Something doesn't add up.  In my patch we have:

if (vmd_page_count / 600 < 2 * 256 * mp_ncpus * VM_NFREEPOOL)
    continue;

From your sysctls I get 2001978, you have mp_ncpus == 4, and VM_NFREEPOOL == 2 on amd64.  2001978 / 600 == 3336 and 2 * 256 * 4 * 2 == 4096, so we should not be creating the cache zones on your system, but the dtrace output shows that we did.

I tried booting a bhyve VM with 8GB of RAM and 4 CPUs, and I have:

# sysctl vm.stats.vm.v_page_count
vm.stats.vm.v_page_count: 2031299
# sysctl hw.ncpu
hw.ncpu: 4

and

# dtrace -n 'BEGIN{print(`vm_dom[0].vmd_pgcache[0]);}'
dtrace: description 'BEGIN' matched 1 probe
CPU     ID                    FUNCTION:NAME
  3      1                           :BEGIN struct vm_pgcache {
    int domain = 0
    int pool = 0
    uma_zone_t zone = 0
}

Are you certain you tested the correct patch and kernel?
Comment 23 rkoberman 2019-10-13 18:00:43 UTC
(In reply to Mark Johnston from comment #22)
Here is confirmation of the patch. I did the kernel build with -DNO_CLEAN. 
# svn st /usr/src/sys
?       /usr/src/sys/amd64/conf/GENERIC.4BSD
M       /usr/src/sys/dev/acpi_support/acpi_ibm.c
?       /usr/src/sys/dev/acpi_support/acpi_ibm.c.orig
M       /usr/src/sys/vm/vm_page.c
# svn diff /usr/src/sys/vm/vm_page.c
Index: /usr/src/sys/vm/vm_page.c
===================================================================
--- /usr/src/sys/vm/vm_page.c	(revision 353118)
+++ /usr/src/sys/vm/vm_page.c	(working copy)
@@ -199,10 +199,13 @@
 		vmd = VM_DOMAIN(domain);
 
 		/*
-		 * Don't allow the page caches to take up more than .25% of
-		 * memory.
+		 * Don't allow the page caches to take up more than .1875% of
+		 * memory.  A UMA bucket contains at most 256 free pages, and w
+		 * have two buckets per CPU per free pool.
 		 */
 		if (vmd->vmd_page_count / 400 < 256 * mp_ncpus * VM_NFREEPOOL)
+		if (vmd->vmd_page_count / 600 < 2 * 256 * mp_ncpus *
+		    VM_NFREEPOOL)
 			continue;
 		for (pool = 0; pool < VM_NFREEPOOL; pool++) {
 			pgcache = &vmd->vmd_pgcache[pool];
Comment 24 Mark Johnston freebsd_committer 2019-10-13 18:34:22 UTC
Created attachment 208284 [details]
debug patch

(In reply to rkoberman from comment #23)
Are you using a non-default kernel configuration?  Could you share it?

Would you please also try applying the attached patch on top of the existing patch, and post the dmesg from a fresh boot?  Thanks for your patience.
Comment 25 rkoberman 2019-10-13 19:02:29 UTC
I just noticed an error in my application of the patch. The diff I posted is missing a rather critical '-'.

I just rebuilt the kernel and will test shortly.

My sincere apologies for the waste of your time. Should the problem NOT be fixed, I will add the debug code, but I don't think it will be needed.
Comment 26 rkoberman 2019-10-13 19:46:42 UTC
Looks like the patch did the trick. I just started my VM with only 2.5G of available RAM with no problems.

Thanks so much for tracking this down and I apologize again for the botched patch. With the >30 line offset, I just manually did the patch and how I missed the deletion of the old line, I can't imagine. Rushing, I guess. The date on my kernel was only 19 minutes after you posted it and haste made waste, at least of your time.
Comment 27 Bob Prohaska 2019-10-13 22:16:05 UTC
(In reply to Mark Johnston from comment #14)
I'm doing something wrong; when I copy the unified diff text from the
patch into a file and feed it to patch then patch asks what file to patch.

Is there a downloadable version that's idiot proof?

Apologies for the dumb questions...
Comment 28 rkoberman 2019-10-13 22:34:28 UTC
(In reply to Bob Prohaska from comment #27)
# cd /usr/src
# patch < patch-file
#

There are other ways, but this should work.

Should work for HEAD.
Comment 29 Bob Prohaska 2019-10-14 03:48:07 UTC
 (In reply to Bob Prohaska from comment #27)
It looks as if removing the absolute pathnames from the patch
fixed my mistake. The kernel compiled and booted, a new build
of www/chromium is running now.

apologies for the noise.