256507 – Apparent kernel memory leak in 12-STABLE/13.1-Release

Bug 256507 - Apparent kernel memory leak in 12-STABLE/13.1-Release

Summary: Apparent kernel memory leak in 12-STABLE/13.1-Release

Status:	In Progress

Alias:	None

Product:	Base System
Classification:	Unclassified
Component:	kern (show other bugs)
Version:	12.2-STABLE
Hardware:	amd64 Any

Importance:	--- Affects Some People
Assignee:	Mark Johnston

URL:	https://lists.freebsd.org/archives/fr...
Keywords:

Duplicates (2):	266013 266272 (view as bug list)
Depends on:
Blocks:

Reported:	2021-06-09 18:31 UTC by Dave Hayes
Modified:	2024-02-06 11:52 UTC (History)
CC List:	15 users (show)

See Also:	260245 266272

Attachments
Memory graph (44.51 KB, image/png) 2021-06-09 18:31 UTC, Dave Hayes	no flags	Details
Past 3 months of activity (64.59 KB, image/png) 2022-06-26 23:51 UTC, Dave Hayes	no flags	Details
monthly graph (74.59 KB, image/png) 2022-06-27 04:06 UTC, Bane Ivosev	no flags	Details
daily graph (91.88 KB, image/png) 2022-06-27 04:07 UTC, Bane Ivosev	no flags	Details
Output of sysctl vm hw (22.39 KB, text/plain) 2022-08-27 00:53 UTC, Dave Hayes	no flags	Details
possible bug fix (593 bytes, patch) 2022-09-27 18:42 UTC, Mark Johnston	no flags	Details \| Diff
possible bug fix (1005 bytes, patch) 2022-09-28 16:51 UTC, Mark Johnston	no flags	Details \| Diff
vmstat -z from 19.2% lost memory (19.54 KB, text/plain) 2022-10-26 07:05 UTC, Dave Hayes	no flags	Details
possible bug fix for stable/12 (724 bytes, patch) 2022-10-26 17:58 UTC, Mark Johnston	no flags	Details \| Diff
Finer grained picture of memory leak (11.45 KB, image/png) 2022-11-02 20:41 UTC, Dave Hayes	no flags	Details
Show Obsolete (1) View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Dave Hayes 2021-06-09 18:31:28 UTC

Created attachment 225666 [details]
Memory graph

Consider the following output:

 # sysctl vm.stats.vm | grep count
 vm.stats.vm.v_cache_count: 0
 vm.stats.vm.v_user_wire_count: 0
 vm.stats.vm.v_laundry_count: 0
 vm.stats.vm.v_inactive_count: 121191
 vm.stats.vm.v_active_count: 20836
 vm.stats.vm.v_wire_count: 754310
 vm.stats.vm.v_free_count: 254711
 vm.stats.vm.v_page_count: 3993253

It should be pretty clear that these numbers do not add up. There are missing memory pages. 

I have some detailed statistics of this machine in prometheus. A graph of the issue is attached. I calculate "lost memory" by simply adding up all the _count variables except v_page_count, and then subtracting that sum from v_page_count.

You will note that over time, the system gradually loses free memory. Eventually this machine will start swapping and then exhaust swap space and hang. This is one example from a machine that is running relatively few services. It is not running ZFS. However, I observe the same behavior on a few other machines with disparate services and some of those are running ZFS.

I have spent some time asking on lists and looking at various sysctl values to try to determine whether I am missing something or not. I was unable to find anything relevant, and having come to the freebsd-stable list to find two others experiencing this issue, I'm filing this bug. 

Any data anyone needs, just ask me. I actually use prometheus_sysctl_exporter (thanks for that btw!). Thanks in advance. :)

Comment 1 Andriy Gapon freebsd_committer

2021-06-10 06:15:07 UTC

Just want to note that I noticed a very similar problem with stable/13.
So far I haven't been able to find any clues.
In the original report the number of unaccounted pages seem to grow smoothly and linearly.  In my case I see it growing in steps.  That is, the number would stay pretty constant (with some jitter) and then would jump over a short period of time.

I see some correlation between the jumps and certain activity, but I cannot pinpoint what exactly causes it.

Some possibilities:
- the activity involves some db style updates via mmap
- the activity involves "spawning" of processes
- the activity involves a daemon built on Mono / .NET

Comment 2 Bane Ivosev 2022-06-26 04:52:52 UTC

We have exactly the same problem as Dave and Andriy describe immediatelly after we upgrade to 13.1-RELASE. Server is Supermicro SYS-6019P-MTR, with 128 GB RAM, ZFS ... Everything was working normal for years with 12.x branch.

Comment 3 Dave Hayes 2022-06-26 23:51:39 UTC

Created attachment 234961 [details]
Past 3 months of activity

I've provided some more graphical data, graphing the actual sysctl vm.stats.vm.*_count stats vs the lost memory graph. I'm hoping this will shed enough light that someone who knows more may help fix it. I'm willing to provide almost any data needed.

Comment 4 Bane Ivosev 2022-06-27 04:06:37 UTC

Created attachment 234965 [details]
monthly graph

Comment 5 Bane Ivosev 2022-06-27 04:07:11 UTC

Created attachment 234966 [details]
daily graph

Comment 6 Bane Ivosev 2022-06-27 04:09:39 UTC

Comment on attachment 234965 [details]
monthly graph

In mid October we added RAM, then period with 12.3-RELEASE and you can see exactly when we upgrade to 13.1.

Comment 7 Andriy Gapon freebsd_committer

2022-07-05 08:21:04 UTC

Just want to update that I have not be able to root cause this problem or get rid of it.  I thought that I saw some correlation between some activity on the system and increases in missing pages.  But I was never able to reproduce the leak at will, so not sure if my observation was actually valid.

Just in case, it appeared to me that the leak was correlated with an application written in Go.  I suspected that it used some compatibility system calls (especially related to mmap) and there was a bug somewhere in the compat code.

Comment 8 Dave Hayes 2022-07-05 16:46:10 UTC

I will confirm the Go idea. The prometheus node_exporter is the only common application between two of my machines that have this bug. Good catch there, I think?

BTW, the bug is still present in stable/12-n1-1115623ac

Comment 9 Bane Ivosev 2022-07-05 21:13:00 UTC

(In reply to dave from comment #8)
Just for the record, we don't have anything in/with Go on our server and after returning back to 12.3 everything works normal.

Comment 10 Andriy Gapon freebsd_committer

2022-07-14 19:27:10 UTC

I have to correct my comment #7, the suspect application is not a Go program, it's actually a Mono program.  Somehow I confused those two things.

Anyway, the problem is still present 13.1-STABLE 689d65d736bbed.
It still correlates with activity of that application.
file identifies the "executable" as ELF 64-bit LSB shared object, x86-64, version 1 (FreeBSD), dynamically linked, interpreter /libexec/ld-elf.so.1, for FreeBSD 11.3, FreeBSD-style

Comment 11 Tom Jones freebsd_committer

2022-08-25 15:48:16 UTC

The symptoms of this issue appear to be the same as 266013.

Do the 'missing' pages return to the count if you stop all running services?

Comment 12 Dave Hayes 2022-08-25 16:16:14 UTC

So...stopping -all- running services to me is an effective reboot. :)

Nevertheless, on my machine with the most minimal service deployment that has the memory issue, I stopped the biggest memory consumers:

  unbound
  node_exporter
  blackbox_exporter

Stopping them did not return the memory, as measured by this script:

--Begin script
#!/usr/local/bin/perl
use strict;
use warnings;

my $pagesize = `sysctl -n vm.stats.vm.v_page_size`; chomp($pagesize);

my %db = ();
open(STATS,"sysctl vm.stats.vm |") || die "Can't open sysctl: $!\n";
while(<STATS>) {
  if (/v_(\S+)_count:\s(\d+)/) {
     $db{$1} = $2;
  }
}
close(STATS);

my $total = $db{'page'};
foreach my $k (keys %db) {
   next if ($k eq 'page');
   $total -= $db{$k};
}

my $totalmemMB = ($pagesize * $total) / (1024 * 1024);
printf("Lost memory: %d %d-byte pages (%.f MB)\n", $total, $pagesize, $totalmemMB);
--- End script

This printed out roughly the same numbers as reported by prometheus after services were stopped.

Of course, I have superpages enabled and yet vm.stats.vm.v_page_size reports 4096 still. I've no idea if this is the correct way to calculate actual memory lost, but it looks correct.

Comment 13 Helge Oldach 2022-08-25 18:46:02 UTC

This script might yield negative results because of lazy dequeuing of wired pages which may result into double counting pages occasionally. See bug #234559.

Perhaps you are just observing similar artifacts (in the other direction)?

Comment 14 Dave Hayes 2022-08-25 19:47:54 UTC

So I only wrote the script for the purposes of addressing comment 11. My main source of data is said sysctls exported at intervals to prometheus, which is where I get the graph you can look at in the attachments here. 

That being said, if stopping and restarting services released the lost memory due to lazy reporting, it should have shown up in prometheus eventually. It has not for the past few hours. 

Additionally, bug #234559 seems to be a reporting issue. If that were only the case, I would not have opened this bug. :) If you read the original comment, a machine with this bug left to itself for long enough will start swapping, then thrashing, and finally panic when the swap space is exhausted.

Comment 15 Dave Hayes 2022-08-25 19:51:42 UTC

I should also mention this wonderful tool prometheus_sysctl_exporter(8). I have this data ingested into prometheus at a 5 second interval for both machines here that suffer from this bug. If anyone is after specific data in the sysctl space, I probably have it available and can likely render a grafana graph of whatever query you want. I am highly interested in getting this bug fixed.

Comment 16 Tom Jones freebsd_committer

2022-08-26 15:35:55 UTC

Can you tell me all the major services that are running on the host? This would help to try and get a simple reproduction environment. 

The page count from your first comment suggests this is a machine with 16GB of memory, is that correct? The only reports I have for this issue (or a very similar one) are occuring on machines with multiple TB of RAM, which is a harder to reproduce.

Comment 17 Tom Jones freebsd_committer

2022-08-26 17:42:35 UTC

Can you provide the output of: 
     sysctl vm hw

Comment 18 Dave Hayes 2022-08-27 00:53:53 UTC

Created attachment 236137 [details]
Output of sysctl vm hw

> Can you tell me all the major services that are running on the host? 

Sure:
 - openvpn
 - unbound
 - openntpd
 - openssh_portable
 - dhcpd
 - node_exporter 
 - blackbox_exporter
 - a couple of minor perl daemons 

The -only- commonality with the other machine that has this issue (which happens to be my package builder) is:  
  - openntpd
  - openssh_portable
  - node_exporter

> The page count from your first comment suggests this is a machine with 
> 16GB of memory, is that correct? 

Yes, however the other machine with this issue has 128GB of ram. 

> Can you provide the output of: sysctl vm hw

See attachment.

Comment 19 Slava 2022-08-30 23:19:28 UTC

Hello.
Just noted similar after upgrading web servers to 13.0-RELEASE, and same still in place for 13.1-RLEASE.
After upgrade servers worked no more than several days until memory exhausted, then reboot required.
Switching 'sendfile on' to 'sendfile off' in nginx config helped and servers now works stable for months, however monitoring (poor munin, I can provide graphs if required) still show strange vm behaviour which I didn't observe in 11.1-RELEASE.
No ZFS used, only UFS.
Probably this will give clue and help in investigation. Thanks.

Comment 20 Mark Johnston freebsd_committer

2022-09-27 18:42:20 UTC

Created attachment 236885 [details]
possible bug fix

If anyone able to reproduce can test a patch, please try this one.  It applies only to 13 and later - if you are seeing these problems on 12, there is something unrelated happening.

Comment 21 Mark Johnston freebsd_committer

2022-09-28 16:51:52 UTC

Created attachment 236921 [details]
possible bug fix

The last patch missed a case, this one addresses that.  If you were testing with the previous patch, please try this one instead.  Sorry for the inconvenience.

Comment 22 Bane Ivosev 2022-10-02 18:34:51 UTC

Thanks Mark,
i just applied your patch, so we'll just have to wait and see. For our web server the problem is manifested in range of one day.

Comment 23 Bane Ivosev 2022-10-05 09:28:46 UTC

(In reply to Mark Johnston from comment #21)
Hi Mark, the patch seems to do the job in our case. We tested it on 13-stable. Still, it is a strange problem, none of our other servers with 13.1 experienced such behavior.

Comment 24 Mark Johnston freebsd_committer

2022-10-05 19:14:49 UTC

(In reply to Bane Ivosev from comment #23)
Thanks for testing.  I'm about to commit the patch and will merge to 13 soon.

Comment 25 commit-hook freebsd_committer

2022-10-05 19:19:12 UTC

A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=2c9dc2384f85a4ccc44a79b349f4fb0253a2f254

commit 2c9dc2384f85a4ccc44a79b349f4fb0253a2f254
Author:     Mark Johnston <markj@FreeBSD.org>
AuthorDate: 2022-10-05 19:12:46 +0000
Commit:     Mark Johnston <markj@FreeBSD.org>
CommitDate: 2022-10-05 19:12:46 +0000

    vm_page: Fix a logic error in the handling of PQ_ACTIVE operations

    As an optimization, vm_page_activate() avoids requeuing a page that's
    already in the active queue.  A page's location in the active queue is
    mostly unimportant.

    When a page is unwired and placed back in the page queues,
    vm_page_unwire() avoids moving pages out of PQ_ACTIVE to honour the
    request, the idea being that they're likely mapped and so will simply
    get bounced back in to PQ_ACTIVE during a queue scan.

    In both cases, if the page was logically in PQ_ACTIVE but had not yet
    been physically enqueued (i.e., the page is in a per-CPU batch), we
    would end up clearing PGA_REQUEUE from the page.  Then, batch processing
    would ignore the page, so it would end up unwired and not in any queues.
    This can arise, for example, when a page is allocated and then
    vm_page_activate() is called multiple times in quick succession.  The
    result is that the page is hidden from the page daemon, so while it will
    be freed when its VM object is destroyed, it cannot be reclaimed under
    memory pressure.

    Fix the bug: when checking if a page is in PQ_ACTIVE, only perform the
    optimization if the page is physically enqueued.

    PR:             256507
    Fixes:          f3f38e2580f1 ("Start implementing queue state updates using fcmpset loops.")
    Reviewed by:    alc, kib
    MFC after:      1 week
    Sponsored by:   E-CARD Ltd.
    Sponsored by:   Klara, Inc.
    Differential Revision:  https://reviews.freebsd.org/D36839

 sys/vm/vm_page.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

Comment 26 commit-hook freebsd_committer

2022-10-12 13:50:03 UTC

A commit in branch stable/13 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=6094749a1a5dafb8daf98deab23fc968070bc695

commit 6094749a1a5dafb8daf98deab23fc968070bc695
Author:     Mark Johnston <markj@FreeBSD.org>
AuthorDate: 2022-10-05 19:12:46 +0000
Commit:     Mark Johnston <markj@FreeBSD.org>
CommitDate: 2022-10-12 13:49:25 +0000

    vm_page: Fix a logic error in the handling of PQ_ACTIVE operations

    As an optimization, vm_page_activate() avoids requeuing a page that's
    already in the active queue.  A page's location in the active queue is
    mostly unimportant.

    When a page is unwired and placed back in the page queues,
    vm_page_unwire() avoids moving pages out of PQ_ACTIVE to honour the
    request, the idea being that they're likely mapped and so will simply
    get bounced back in to PQ_ACTIVE during a queue scan.

    In both cases, if the page was logically in PQ_ACTIVE but had not yet
    been physically enqueued (i.e., the page is in a per-CPU batch), we
    would end up clearing PGA_REQUEUE from the page.  Then, batch processing
    would ignore the page, so it would end up unwired and not in any queues.
    This can arise, for example, when a page is allocated and then
    vm_page_activate() is called multiple times in quick succession.  The
    result is that the page is hidden from the page daemon, so while it will
    be freed when its VM object is destroyed, it cannot be reclaimed under
    memory pressure.

    Fix the bug: when checking if a page is in PQ_ACTIVE, only perform the
    optimization if the page is physically enqueued.

    PR:             256507
    Fixes:          f3f38e2580f1 ("Start implementing queue state updates using fcmpset loops.")
    Reviewed by:    alc, kib
    Sponsored by:   E-CARD Ltd.
    Sponsored by:   Klara, Inc.

    (cherry picked from commit 2c9dc2384f85a4ccc44a79b349f4fb0253a2f254)

 sys/vm/vm_page.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

Comment 27 Mark Johnston freebsd_committer

2022-10-12 16:17:32 UTC

*** Bug 266013 has been marked as a duplicate of this bug. ***

Comment 28 Dave Hayes 2022-10-12 18:50:46 UTC

(In reply to Mark Johnston from comment #20)

Is there any chance something like this might be happening on 12? Is there any data you need to address the issue on 12?

Comment 29 Mark Johnston freebsd_committer

2022-10-25 19:23:48 UTC

(In reply to dave from comment #28)
I'm sorry, I missed this followup.

I'd like to know any details about the workloads you're using on stable/12 which exhibit the problem.  So far it looks like,
- it happens with or without ZFS in use
- the system is leaking pages at a constant rate
- it only happens on systems running certain go applications(?)
- stopping services does not cause the lost memory to reappear
- it happens on the latest stable/12

Given that the rate of the leak appears to be nearly constant, is it possible to figure out which service is triggering it?

Comment 30 Dave Hayes 2022-10-26 00:25:16 UTC

(In reply to Mark Johnston from comment #29)

Thank you for replying. So to confirm:

- Yes it happens with or without ZFS in use
- The system is leaking pages at a constant rate (and this rate is different for each machine)
- Both systems are running prometheus exporters (the go applications you refer to)
- Stopping services does not cause the lost memory to return

However, "latest stable" is not what I am running. stable/12-n1-1115623ac
is what I am running, which is effectively 12.3-STABLE from some months ago. I had considered upgrading to the latest 12/stable, but the report of the bug in 13 stopped me from doing this. 

I personally do not believe a service is triggering it. From my extensive stats, I have almost exactly graphically linked vm.stats.vm.v_free_count to the lost memory measurement. All the other vm.stats.vm constants have no real graphical correlation to the lost memory measurement. You can see some of this in one of my attachements.

Based on this observation alone, what you described as the cause for this kind of bug in 13 appears to me to be the most likely cause in 12 as well. Do note that I am not a kernel dev. :)

Let me know if you need any more data.

Comment 31 Mark Johnston freebsd_committer

2022-10-26 00:35:16 UTC

(In reply to dave from comment #30)
Is it possible to see whether stopping the prometheus exporters also stops the page leak?

Can you please share output of "vmstat -z" from a system that's leaked a "substantial" number of pages?  Say, more than 10% of memory.

I am sure that a kernel bug is responsible for what you're seeing, but I'm quite sure it's not the same bug as the one I already fixed.  The affected code was heavily rewritten between 12 and 13, which is where the problem was introduced; many of the folks who saw a problem on 13 reported seeing it after an upgrade from 12.  The bug in 12 might be similar, but I haven't been able to spot it by code inspection (yet), so right now I'm just trying to narrow down the areas where this bug could plausibly be lurking.

Comment 32 Dave Hayes 2022-10-26 07:05:27 UTC

Created attachment 237639 [details]
vmstat -z from 19.2% lost memory

Here you go. Hope this helps.

Comment 33 Dave Hayes 2022-10-26 07:16:59 UTC

I have picked the machine with the faster leak (the one I sent the vmstat -z for) and have stopped node_exporter and prometheus_sysctl_exporter. I will leave it in this state for 12-16 hours, after which time I should be able to see the leak stop iff the exporters are the stimulus of this bug.

Comment 34 Mark Johnston freebsd_committer

2022-10-26 17:58:28 UTC

Created attachment 237658 [details]
possible bug fix for stable/12

Here's a patch for a bug that could cause the symptoms you're seeing.  If you're able to test it soon, it would be greatly appreciated; if it fixes the problem I can get included it with a batch of errata patches next week.

If you're not able to patch the kernel, another test to try is to set the vm.pageout_update_period sysctl to 0 on a running system.  If the leak still occurs with that setting, then my patch won't help.

Comment 35 Dave Hayes 2022-10-26 18:45:08 UTC

While I can patch the kernel, it seems sensible to try your sysctl setting first. Are there any potential side effects other than fixing the issue? :)

Comment 36 Mark Johnston freebsd_committer

2022-10-26 18:55:13 UTC

(In reply to dave from comment #35)
I don't think there will be any downsides.  The scanning mostly is mostly useful when you have a large active queue, which doesn't appear to be the case in your workload.

Though, on second thought it is possible for the patch to help even if setting pageout_update_period=0 doesn't fix the problem.  I had said it wouldn't because I see v_dfree = 0 in the vm sysctl dump you attached, but that might not reflect reality on all of your systems.

So, please try the sysctl test, but it's worth also testing the patch no matter what the result.

Comment 37 Dave Hayes 2022-10-26 21:11:42 UTC

So the sysctl test after a couple hours shows zero sign of fixing the memory leak.

Which OS branch of 12/stable is your patch relative to? A build process shouldn't take too long here but I want to make sure we are both referencing the same source code; my source code is probably ancient to you.

Comment 38 Dave Hayes 2022-10-26 21:21:54 UTC

I wish I could edit comments. I'm looking for which 12/stable revision I need to grab. I can just grab the latest one if that will work.

Comment 39 Mark Johnston freebsd_committer

2022-10-26 21:26:08 UTC

(In reply to dave from comment #37)
What's the value of the vm.stats.vm.v_pfree sysctl on that system?

Comment 40 Mark Johnston freebsd_committer

2022-10-26 21:27:33 UTC

(In reply to dave from comment #38)
Any recent revision will do.  That code has not changed much in stable/12 for the past year or so.  You mentioned stable/12-n1-1115623ac earlier, which should be fine.

Comment 41 Dave Hayes 2022-10-26 21:36:44 UTC

# sysctl -a vm.stats.vm.v_pfree
vm.stats.vm.v_pfree: 8102884297

Want a graph of that over time?

Comment 42 Mark Johnston freebsd_committer

2022-10-26 21:41:03 UTC

(In reply to dave from comment #41)
Thanks.  No, a large value just suggests that the patch has a chance of fixing the bug you're hitting.

Comment 43 Dave Hayes 2022-10-27 07:32:41 UTC

The patch is not working to resolve the memory issue. :/

Comment 44 commit-hook freebsd_committer

2022-11-01 20:34:54 UTC

A commit in branch releng/13.1 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=4867d7d34dfd54986d5798eddc3ce92a70cc9841

commit 4867d7d34dfd54986d5798eddc3ce92a70cc9841
Author:     Mark Johnston <markj@FreeBSD.org>
AuthorDate: 2022-10-05 19:12:46 +0000
Commit:     Mark Johnston <markj@FreeBSD.org>
CommitDate: 2022-11-01 13:28:11 +0000

    vm_page: Fix a logic error in the handling of PQ_ACTIVE operations

    As an optimization, vm_page_activate() avoids requeuing a page that's
    already in the active queue.  A page's location in the active queue is
    mostly unimportant.

    When a page is unwired and placed back in the page queues,
    vm_page_unwire() avoids moving pages out of PQ_ACTIVE to honour the
    request, the idea being that they're likely mapped and so will simply
    get bounced back in to PQ_ACTIVE during a queue scan.

    In both cases, if the page was logically in PQ_ACTIVE but had not yet
    been physically enqueued (i.e., the page is in a per-CPU batch), we
    would end up clearing PGA_REQUEUE from the page.  Then, batch processing
    would ignore the page, so it would end up unwired and not in any queues.
    This can arise, for example, when a page is allocated and then
    vm_page_activate() is called multiple times in quick succession.  The
    result is that the page is hidden from the page daemon, so while it will
    be freed when its VM object is destroyed, it cannot be reclaimed under
    memory pressure.

    Fix the bug: when checking if a page is in PQ_ACTIVE, only perform the
    optimization if the page is physically enqueued.

    Approved by:    so
    Security:       FreeBSD-EN-22:23.vm
    PR:             256507
    Fixes:          f3f38e2580f1 ("Start implementing queue state updates using fcmpset loops.")
    Reviewed by:    alc, kib
    Sponsored by:   E-CARD Ltd.
    Sponsored by:   Klara, Inc.

    (cherry picked from commit 2c9dc2384f85a4ccc44a79b349f4fb0253a2f254)
    (cherry picked from commit 6094749a1a5dafb8daf98deab23fc968070bc695)

 sys/vm/vm_page.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

Comment 45 Graham Perrin freebsd_committer

2022-11-02 07:26:35 UTC

(In reply to commit-hook from comment #44)

> FreeBSD-EN-22:23.vm

<https://lists.freebsd.org/archives/freebsd-announce/2022-November/000050.html>

> Affects:        FreeBSD 13.1

Comment 46 Dave Hayes 2022-11-02 20:41:10 UTC

Created attachment 237828 [details]
Finer grained picture of memory leak

I'm wondering if this picture helps at all to get a working patch for FreeBSD 12? It's the finest grained detail I can display about the memory bug. 

Are there downsides to attempting to use the patch from 13 as a test?

Comment 47 Dave Hayes 2023-05-02 06:23:33 UTC

So a brief update. I've upgraded one of the affected machines to 13.2-STABLE 8c09bde96. Apparently, the memory leak has moved from "lost" to wired memory. Machine has already crashed running from out of memory, to swapping, to starting to thrash.

I would say the actual bug isn't found yet, but this entire effort has moved the visibility of the bug from "having to be clever about calculations" to "look the wired memory is leaking".

How would I go about finding this leak? What can I monitor from things like vmstat -z?

Comment 48 yuan.mei 2023-10-05 06:39:42 UTC

I started to notice a very similar behavior after upgrading to 13.2-STABLE stable/13-n256460-03b6464b29fe.  I use ZFS and my major workload is netatalk.  The machine's wired memory would go up over the course of a day or so until many processes get killed.  Even getty cannot get pages, which prevents me from login.  

The strange thing is that I only see this behavior now (after an git pull and update at the beginning of Oct.).  My last such update of the system was at the end of July and after that the system ran continuously without a problem for two months.

How can I help to provide more data to help debug this?

$ sysctl vm.stats.vm | grep -i count
vm.stats.vm.v_cache_count: 0
vm.stats.vm.v_user_wire_count: 0
vm.stats.vm.v_laundry_count: 0
vm.stats.vm.v_inactive_count: 946022
vm.stats.vm.v_active_count: 75985
vm.stats.vm.v_wire_count: 1401974
vm.stats.vm.v_free_count: 1633981
vm.stats.vm.v_page_count: 4057118

Comment 49 Dave Hayes 2023-10-05 23:37:08 UTC

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=271246

From my perception, it seems to be that people do not want to fix this issue. :)

Comment 50 kadir kose 2023-12-08 13:31:30 UTC

Hi, I encountered the same issue. Based on the sysctl outputs below, I think the lost memory is due to unswappable pages because the unswappable page count is increasing and RAM is decreasing in the outputs of the top command. However, I don't know why.


vm.stats.vm.v_laundry_count: 56910
vm.stats.vm.v_inactive_count: 329390
vm.stats.vm.v_active_count: 313255
vm.stats.vm.v_wire_count: 132393
vm.stats.vm.v_free_count: 44036
vm.stats.vm.v_page_count: 1000174
vm.stats.vm.v_page_size: 4096

vm.domain.0.stats.unswappable: 98333
vm.domain.0.stats.laundry: 56910
vm.domain.0.stats.inactive: 329390
vm.domain.0.stats.active: 313255
vm.domain.0.stats.free_count: 44036

Comment 51 Eugene Grosbein freebsd_committer

2024-01-23 06:34:47 UTC

*** Bug 266272 has been marked as a duplicate of this bug. ***

Comment 52 Borja Marcos 2024-02-06 11:20:10 UTC

I have a data point that *might* be related to this. A bit fuzzy, sorry, but maybe it can ring a bell for someone.

Today I compiled a heavyweight port (Mongodb) inside a jail on a server running TrueNAS 13. Inactive memory went to the roof (8 GB on a 16 GB machine) and of course it literally squeezed the ZFS ARC.

I stopped everything I could (there are several jails running stuff) but inactive memory didn't decrease. 

I had even stopped the jails and restarted them hoping that something was holding that memory (although it doesn't make much sense!).

The big surprise: I didn't reboot the system but going inside the jail where I compiled Mongodb I did a make clean on the port directory. And suddenly Inactive memory went from 8 GB to 1.5 GB!

I am wondering (sorry about this extremely fuzzy data point), is there any sort of directory/cache leak related to jails? Destroying all of those files really solved the situation.

So, key points:

- Using jails
- Using ZFS (of course)
- Compiling a heavy port and dependencies inside the jail

Theory: Doing that inside a jail somehow made pages in Inactive memory "stick".

Comment 53 Andriy Gapon freebsd_committer

2024-02-06 11:49:13 UTC

(In reply to Borja Marcos from comment #52)
First of all, the symptoms you describe seem to have nothing to do with this bug. This bug is about total memory shrinking (as if physical memory were removed in bits).

Second, I suspect that what you observed is related to tmpfs.

Anyway, please don't hijack bug reports.

Comment 54 Borja Marcos 2024-02-06 11:52:47 UTC

Sorry, not intending to hijack it at all.

I arrived to this particular bug following some of the duplicates (probably fat fingering my searches, sorry!) and I thought it might be related.

Going silent on this one and I will search for tmpfs related bugs. 

Thank you and please accept my apologies!