Bug 262743 - Memory leak in security/strongswan's charon daemon when communicating over vici socket.
Summary: Memory leak in security/strongswan's charon daemon when communicating over vi...
Status: Open
Alias: None
Product: Base System
Classification: Unclassified
Component: misc (show other bugs)
Version: 13.1-RELEASE
Hardware: Any Any
: --- Affects Some People
Assignee: freebsd-bugs (Nobody)
URL:
Keywords: needs-qa
Depends on:
Blocks:
 
Reported: 2022-03-23 17:20 UTC by Michał Skalski
Modified: 2023-01-14 17:19 UTC (History)
4 users (show)

See Also:
koobs: maintainer-feedback? (strongswan)


Attachments
Dump of statistics of jemalloc library at charon daemon exit (21.78 KB, text/plain)
2022-03-23 17:20 UTC, Michał Skalski
no flags Details
jemalloc stats with dirty_decay_ms=0 setting (21.44 KB, text/plain)
2022-03-23 17:26 UTC, Michał Skalski
no flags Details
Script for tracing memory footprint fof process given by pid or its name (4.23 KB, text/plain)
2022-08-08 15:37 UTC, Michał Skalski
no flags Details
gzipped charon's memory footprint log (768.76 KB, application/gzip)
2022-08-08 15:39 UTC, Michał Skalski
no flags Details
Charon's memory footprint chart (30.77 KB, image/png)
2022-08-08 15:40 UTC, Michał Skalski
no flags Details
security/strongswan: Avoid a memory leak (562 bytes, patch)
2023-01-13 06:42 UTC, Jose Luis Duran
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Michał Skalski 2022-03-23 17:20:28 UTC
Created attachment 232660 [details]
Dump of statistics of jemalloc library at charon daemon exit

On FreeBSD system (amd64, arm64) when communicating over vici socket memory leaks in terms of constantly increasing Virtual and Resident (VMS and RSS) memory of process occur, until all system memory is exhausted, when process charon is killed by kernel with message kernel: pid 903 (charon), jid 0, uid 0, was killed: failed to reclaim memory.

Any tool for memory leak detection tools (valgrind, ktrace) does not detect any memory leaks, increasing RSS is the only symptom.

The same behaviour was observed on FreeBSD 12.1, 12.2 and 9.3 (the latter is the last release before incorporating jemalloc library to FreeBSD's libc).

When running charon daemon on Linux (tested on Ubuntu 20.04 and Debian 10 bookworm/sid) problem does not occur.


I think this behaviour is because frequent memory allocation and deallocation (malloc/free functions), which is used in vici plugin. And I observed that this increase can also be caused by SA renegotiations, but that is harder to isolate.

And there is no special malloc configuration for charon daemon and on the other hand other applications on FreeBSD box are not affected, which are i.e. some running python daemons (which I believe do massive allocations and use multiple threads). I wonder what is specific in a way strongswan allocates memory that RSS process memory is increasing so much?

To reproduce:
=============
1. Download any VM image with FreeBSD 12.0+ (was tested also on latest amd64 13.1-BETA2 to confirm)
Configure virtual machine; for strongswan compilation give more memory, but for test 256 MB is enough.

2. Run VM and disable swap (to speed-up failure)

# swapoff /dev/gpt/swapfs

3. install required packages for strongswan compilation:

# pkg install git autoconf gperf autoconf-archive libtool m4 automake flex bison pkgconf gettext

4. get strongswan: git clone https://github.com/strongswan/strongswan

5. Compile strongswan:

cd strongswan
./configure --disable-kernel-netlink --enable-kernel-pfroute --enable-kernel-pfkey --disable-gmp --enable-openssl --enable-mediation --disable-scripts --with-group=wheel --enable-gcm --enable-ccm --enable-pkcs11
make -j4
make install

6. start strongswan: ipsec start

7. run in loop any command which communicates on vici interface, swanctl --stats is enough to reproduce error:

sh -c 'while swanctl --stats >/dev/null; do true; done'

8. Observe increase of VSS and RSS (Virtual and resident) memory of charon process, using e.g. top

9. After few hours charon should be killed by kernel due to not enough memory/swap space.


Additional info
===============
Problem occurred when monitored via vici socket state of charon daemon (tunnel definitions, SAs, etc), but it was also reproduced using simple swanctl --stats command repeated in loop.

No change in this beaviour is observed when using different configure's --with-printf-hooks= -- according to issue in pfsense: https://redmine.pfsense.org/issues/5149 this could be the reason, but tests with --with-printf-hooks=builtin, --with-printf-hooks=glibc and --with-printf-hooks=vstr did not fix the error.


I did some tests using various settings of `jemalloc`, attaching results, but I don't know how to interpret the results. It was gathered using following command:

sh -c "MALLOC_CONF='stats_print:true,narenas:1' /usr/local/libexec/ipsec/charon 2>/var/log/charon-memdump-0.log"
Comment 1 Michał Skalski 2022-03-23 17:26:46 UTC
Created attachment 232661 [details]
jemalloc stats with dirty_decay_ms=0 setting

I made also test with jemalloc library dirty_decay_ms=0 setting, but this changed nothing.

Attaching log.
Comment 2 Michał Skalski 2022-03-23 17:48:49 UTC
For reference - I also filled bug on strongswan's github issue tracker:

https://github.com/strongswan/strongswan/issues/966
Comment 3 Kubilay Kocak freebsd_committer freebsd_triage 2022-07-14 01:17:42 UTC
@Reporter Could you please:

- Provide full `uname -a` output for the latest version issue is reproducible on
- Test whether the issue is reproducible using the strongswan port/pkg
Comment 4 Mark Millard 2022-07-17 04:14:29 UTC
Do you have something like teh following in, say, /boot/loader.conf :

#
# Delay when persistent low free RAM leads to
# Out Of Memory killing of processes:
vm.pageout_oom_seq=120

vs. are you using the default value, 12?
Comment 5 Michał Skalski 2022-07-27 17:14:38 UTC
Thank you very much for the clues


(In reply to Mark Millard from comment #4)

Yes, using default value 12 (I used stock qemu VM image):

    # sysctl vm.pageout_oom_seq
    vm.pageout_oom_seq: 12




(In reply to Kubilay Kocak from comment #3)


Result of `uname -a` (vm image FreeBSD-13.1-RELEASE)|

    FreeBSD freebsd 13.1-RELEASE FreeBSD 13.1-RELEASE releng/13.1-n250148-fc952ac2212 GENERIC  amd64


and the same is for:

    FreeBSD freebsd 13.1-RELEASE FreeBSD 13.1-RELEASE releng/13.1-n250148-fc952ac2212 GENERIC  arm64


Indeed, the packaged version (strongswan-5.9.6_2, either installed from binary package or built from ports) behaves different. Still memory usage increases, but only RSS one (VMS does not change in observable period) and in much lower scale, raise is rather logarithmic than linear.

I compared then configure options for strongswan port and among other differences `--with-printf-hooks=builtin` option is specified by default for the port version. 


So I tested again sources of strongswan 5.9.6 - the same as used for ports and when the strongswan is configured with:

    ./configure --disable-kernel-netlink --enable-kernel-pfroute --enable-kernel-pfkey --disable-gmp --enable-openssl \
                --enable-mediation --disable-scripts --with-group=wheel --enable-gcm --enable-ccm --enable-pkcs11 \
                --with-printf-hooks=builtin


(only `--with-printf-hooks=builtin` is added) the memory usage is similar to packaged version even for unpatched original sources. Without this option (or when port version is compiled with `libc` printf-hook which I believe is default one) memory usage raises quickly as in this bug report.


I must have missed that on previous tests, I didn't notice that VMS does not rise, and RSS raises on much smaller scale.
Also checked for `printf-hooks=vstr` and it is memory usage increase is slightly bigger than for `builtin`, but still VMS is constant.

I'm still not sure if for `printf-hooks=builtin` memory does not raise too much, I will check it.
Comment 6 Mark Millard 2022-07-27 19:31:37 UTC
(In reply to Michał Skalski from comment #5)

Use of the likes of vm.pageout_oom_seq=120 should delay any kills for
failures to reclaim enough memory to reach FreeBSD's target figure
for free RAM.

This can get extra time to inspect/investigate evidence about the
on-going memory/RAM usage.


Note: Using an increased vm.pageout_oom_seq is useful for avoiding
failed-to-reclaim kills only for bounded-duration "stays running"
activities. This can allow buildworld buildkernel -j4 on Small
Board Computers with 4 cores and only 2 GiBytes of RAM, for
example, when using the default tends to suffer failed-to-reclaim
kills.)


Note: In sufficiently modern variants of FreeBSD the messages
about kills were improved and no longer always report being
out of available swap space as the reason for the kill. The
messaging about reclaim failures is an example of the
improved messaging. Reclaim failures can happen even with
a swap space being configured but little/none of the swap space
being put to use. All it takes is one process (or m ore) that
stays runnable while keeping nearly all the RAM pages in the
active state (so: unable to be reclaimed). Even now, if a
FreeBSD is modern enough to have the failed-to-reclaim
message, if the message reports "out of swap" as the reason
for a kill, the message is somewhat of a misnomer, in that
kernel data structures for managing the swap areas ran out of
space (internal fragmentation?), not the swap media.

Note: My references to "stays running" presume leaving the kernel
configured to allow process kernel stacks to be swapped out when
a process has not stayed runnable. FreeBSD does not do such
swap outs for processes that are runnable at the time.
Comment 7 Michał Skalski 2022-08-08 15:35:30 UTC
OK, thank you very much for all the help.

The problem was wrong (missing) `--with-printf-hooks=builtin` option for configure script. 

To be sure, I made longer (1.5 week lasting) tests with more strongswan's daemon `charon` stressing.

I configured few tunnels with short lifetime and started executing stressing tests like below:

     sh -c 'while :; do swanctl -l >/dev/null && swanctl -L >/dev/null && swanctl -x >/dev/null || sleep 10; done'

which lists all SAs, all configured tunnels and all certificates in an endless loop.

Results show that during time some memory (RSS) footprint increases over the time, but when memory is needed strongswan )or system) drops unused memory.
And indeed first few minutes memory increase is quite large, which made me think `printf-hook=builtin` option does not work at all.

Attaching new shell script used for memory tracking using only `ps` (also `printf` and `date`) command and doesn't need python, adding also logs from this test and graph generated by gnuplot (gnuplot's script below).


    #!/usr/bin/env gnuplot

    set style line 1 linecolor rgb '#0060ad' linetype 1 linewidth 2 pointtype 7 pointsize 0.5

    set xdata time
    set xlabel '[Time]'
    set ylabel '[MB]'

    plot 'charon_mem.log' using ($1):($5/1024) with linespoints linestyle 1 title columnhead(4)



So the issue may be closed.
Comment 8 Michał Skalski 2022-08-08 15:37:21 UTC
Created attachment 235776 [details]
Script for tracing memory footprint fof process given by pid or its name
Comment 9 Michał Skalski 2022-08-08 15:39:21 UTC
Created attachment 235777 [details]
gzipped charon's memory footprint log
Comment 10 Michał Skalski 2022-08-08 15:40:12 UTC
Created attachment 235778 [details]
Charon's memory footprint chart
Comment 11 Jose Luis Duran 2023-01-13 06:42:00 UTC
Created attachment 239436 [details]
security/strongswan: Avoid a memory leak

As stated in https://docs.strongswan.org/docs/5.9/os/freebsd.html:

> While FreeBSD's C library implements the GNU extensions for custom
> printf() conversion specifiers, the implementation seems to leak memory,
> so using --with-printf-hooks=builtin is recommended.

PR:     262743
Comment 12 Jose Luis Duran 2023-01-13 06:42:51 UTC
In the meantime, add the suggested workaround to the port?
Comment 13 Michał Skalski 2023-01-13 10:07:23 UTC
(In reply to Jose Luis Duran from comment #12)

Well, strongswan port already has (its own) option to specify printf-hooks to use and default is `builtin`, so correct one.

Maybe some note should be added to this option help or pore README describing possibility of memory leaks when `glibc` hook is chosen.
Comment 15 Jose Luis Duran 2023-01-13 11:07:50 UTC
(In reply to Michał Skalski from comment #14)
I see, sorry about the noise. I'll update bug #268918.

Thank you!
Comment 16 Jose Luis Duran 2023-01-13 11:12:18 UTC
Comment on attachment 239436 [details]
security/strongswan: Avoid a memory leak

Please ignore this file.
Comment 17 commit-hook freebsd_committer freebsd_triage 2023-01-14 17:19:02 UTC
A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/ports/commit/?id=942865477682b3d274c73d78e6a5e9b5591268df

commit 942865477682b3d274c73d78e6a5e9b5591268df
Author:     Jose Luis Duran <jlduran@gmail.com>
AuthorDate: 2023-01-13 09:31:24 +0000
Commit:     Fernando Apesteguía <fernape@FreeBSD.org>
CommitDate: 2023-01-14 17:13:48 +0000

    security/strongswan: Update to 5.9.9

    ChangeLog: https://github.com/strongswan/strongswan/releases/tag/5.9.9

    PR:             268918 262743
    Reported by:    jlduran@gmail.com
    Approved by:    strongswan@Nanoteq.com (maintainer)

 security/strongswan/Makefile | 6 +++---
 security/strongswan/distinfo | 6 +++---
 2 files changed, 6 insertions(+), 6 deletions(-)