Bug 260245 - swap/vm: Apparent memory leak: 100% swap usage
Summary: swap/vm: Apparent memory leak: 100% swap usage
Status: Closed Feedback Timeout
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 12.2-RELEASE
Hardware: Any Any
: --- Affects Only Me
Assignee: Bugmeister
URL:
Keywords: needs-qa
Depends on:
Blocks:
 
Reported: 2021-12-05 23:58 UTC by Kubilay Kocak
Modified: 2025-01-24 15:09 UTC (History)
4 users (show)

See Also:


Attachments
top output STARTING (first run) (48.52 KB, image/png)
2021-12-06 00:13 UTC, Kubilay Kocak
no flags Details
vmstat / sysctl output BEFORE (14.30 KB, text/plain)
2021-12-06 01:38 UTC, Kubilay Kocak
no flags Details
vmstat / sysctl output DURING (first run) (14.38 KB, text/plain)
2021-12-06 01:40 UTC, Kubilay Kocak
no flags Details
top output DURING (first run) (83.26 KB, image/png)
2021-12-06 01:41 UTC, Kubilay Kocak
no flags Details
vmstat / sysctl /top output REPROD (18.70 KB, text/plain)
2021-12-07 00:47 UTC, Kubilay Kocak
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Kubilay Kocak freebsd_committer freebsd_triage 2021-12-05 23:58:01 UTC
A system running buildbot workers for the upstream (C)Python project CI, running 12.2-RELEASE-p7 GENERIC amd64, sees swap usage increase over time (multiple runs) until 100% is consumed, at which point the following errors are generated:

Nov 23 10:10:57 122-RELEASE-p10-amd64-9e36 kernel: swap_pager_getswapspace(18): failed
Nov 23 10:10:57 122-RELEASE-p10-amd64-9e36 kernel: swap_pager_getswapspace(9): failed
Nov 23 10:10:57 122-RELEASE-p10-amd64-9e36 kernel: swap_pager_getswapspace(3): failed
Nov 23 10:10:58 122-RELEASE-p10-amd64-9e36 kernel: pid 24131 (python), jid 0, uid 1002, was killed: out of swap space
Nov 23 10:11:01 122-RELEASE-p10-amd64-9e36 kernel: pid 78211 (python3.9), jid 0, uid 1002, was killed: out of swap space
Nov 23 10:11:01 122-RELEASE-p10-amd64-9e36 kernel: swap_pager_getswapspace(32): failed
Nov 23 10:11:01 122-RELEASE-p10-amd64-9e36 syslogd: last message repeated 26 times
Nov 23 10:11:01 122-RELEASE-p10-amd64-9e36 kernel: swap_pager_getswapspace(24): failed
Nov 23 10:11:01 122-RELEASE-p10-amd64-9e36 kernel: swap_pager_getswapspace(32): failed
Nov 23 10:11:01 122-RELEASE-p10-amd64-9e36 kernel: swap_pager_getswapspace(24): failed

Usually, a single test run will consume 30-40% of the 8Gb swap total, and a second or third run will consume it all.

The system is unable to swapoff, and no memory/vm resource utilisation changes after killing processes.

The issue appears not to be reproducible after updating to a stable/12 kernel and installing in-place.

Steps to reproduce:

- Install 12.2-RELEASE and update to latest patch level (p7 at time of writing)
- Checkout CPython (main) sources and run test suite [1]

Additional References:

Buildbot worker build run history: https://buildbot.python.org/all/#/builders/172 (test failure output may provide additional info))

[1] test command: make buildbottest TESTOPTS=-j4 TESTTIMEOUT=2100

I will follow this report up with `sysctl -a |grep vm` and `vmstat -z` output before and after issue reproduction
Comment 1 Kubilay Kocak freebsd_committer freebsd_triage 2021-12-06 00:13:08 UTC
Created attachment 229931 [details]
top output STARTING (first run)
Comment 2 Kubilay Kocak freebsd_committer freebsd_triage 2021-12-06 01:38:15 UTC
Created attachment 229932 [details]
vmstat / sysctl output BEFORE
Comment 3 Kubilay Kocak freebsd_committer freebsd_triage 2021-12-06 01:40:32 UTC
Created attachment 229933 [details]
vmstat / sysctl output DURING (first run)
Comment 4 Kubilay Kocak freebsd_committer freebsd_triage 2021-12-06 01:41:44 UTC
Created attachment 229934 [details]
top output DURING (first run)
Comment 5 Mark Johnston freebsd_committer freebsd_triage 2021-12-06 16:59:59 UTC
Do you get any OOM kills while the tests are running?
Comment 6 Mark Johnston freebsd_committer freebsd_triage 2021-12-06 18:53:16 UTC
I tried building cpython and running the test suite in a 12.2 bhyve VM with 2GB RAM, and it does not swap at all.  I suspect that some tests are being skipped due to missing dependencies, or I did not configure/build python in some specific way (I just ran ./configure && make) or the specified test invocation is not right somehow.

(In reply to Mark Johnston from comment #5)
Never mind, I missed it in the first comment.
Comment 7 Kubilay Kocak freebsd_committer freebsd_triage 2021-12-07 00:42:47 UTC
(In reply to Mark Johnston from comment #5)

I'll see if I can reproduce out of the buildbot environment, and watch the tests run. What am I looking for that indicated OOM kills specifically?
Comment 8 Kubilay Kocak freebsd_committer freebsd_triage 2021-12-07 00:47:36 UTC
Created attachment 229947 [details]
vmstat / sysctl /top output REPROD

Woke up this morning with all swap used (see attached file with sysctl.vm, zmstat and top output)

Perhaps related, is buildbot 'leftover' processes at the end of the test run. 

Note: I've noticed these for quite a while, so they may not be related, but worth pointing out in case.
Comment 9 Kubilay Kocak freebsd_committer freebsd_triage 2021-12-07 00:57:22 UTC
Cloning cpython main onto the same system to attempt reproduction outside of buildbot. 

The only additional dependencies I installed for the worker is gdb and sqlite for their Python modules and tests, which could well be relevent.

Here's a full possibly relevent package list installed on the system, with known python dependencies (bundled, tests, or otherwise), marked with asterisk (*):

autoconf-2.69_3            =   up-to-date with index
autoconf-wrapper-20131203  =   up-to-date with index
ca_root_nss-3.71           =   up-to-date with index
curl-7.80.0                =   up-to-date with index
expat-2.4.1                =   up-to-date with index *
gdb-11.1_1                 =   up-to-date with index *
gettext-runtime-0.21       =   up-to-date with index
git-2.34.1                 =   up-to-date with index * (buildbot CI steps use this)
gmake-4.3_2                =   up-to-date with index
gmp-6.2.1                  =   up-to-date with index
indexinfo-0.3.1            =   up-to-date with index
libevent-2.1.12            =   up-to-date with index
libffi-3.3_1               =   up-to-date with index *
libiconv-1.16              =   up-to-date with index * 
libtextstyle-0.21          =   up-to-date with index
m4-1.4.19,1                =   up-to-date with index
mpdecimal-2.5.1            =   up-to-date with index
mpfr-4.1.0_1               =   up-to-date with index
pcre-8.45                  =   up-to-date with index
perl5-5.32.1_1             =   up-to-date with index
pkgconf-1.8.0,1            =   up-to-date with index
popt-1.18_1                =   up-to-date with index
python39-3.9.8             =   up-to-date with index
readline-8.1.1             =   up-to-date with index * 
sqlite3-3.35.5_4,1         =   up-to-date with index *
Comment 10 Michał Skalski 2022-03-23 17:24:43 UTC
I made also test with jemalloc library dirty_decay_ms=0 setting, but this changed nothing.

Attaching log.
Comment 11 Michał Skalski 2022-03-23 17:26:22 UTC
Sorry for the noise - made comment in wrong browser window, please disregard comment #10.
Comment 12 Mark Linimon freebsd_committer freebsd_triage 2024-10-04 11:15:59 UTC
^Triage: clear stale flags.

To submitter: is this PR still relevant?