Created attachment 232660 [details] Dump of statistics of jemalloc library at charon daemon exit On FreeBSD system (amd64, arm64) when communicating over vici socket memory leaks in terms of constantly increasing Virtual and Resident (VMS and RSS) memory of process occur, until all system memory is exhausted, when process charon is killed by kernel with message kernel: pid 903 (charon), jid 0, uid 0, was killed: failed to reclaim memory. Any tool for memory leak detection tools (valgrind, ktrace) does not detect any memory leaks, increasing RSS is the only symptom. The same behaviour was observed on FreeBSD 12.1, 12.2 and 9.3 (the latter is the last release before incorporating jemalloc library to FreeBSD's libc). When running charon daemon on Linux (tested on Ubuntu 20.04 and Debian 10 bookworm/sid) problem does not occur. I think this behaviour is because frequent memory allocation and deallocation (malloc/free functions), which is used in vici plugin. And I observed that this increase can also be caused by SA renegotiations, but that is harder to isolate. And there is no special malloc configuration for charon daemon and on the other hand other applications on FreeBSD box are not affected, which are i.e. some running python daemons (which I believe do massive allocations and use multiple threads). I wonder what is specific in a way strongswan allocates memory that RSS process memory is increasing so much? To reproduce: ============= 1. Download any VM image with FreeBSD 12.0+ (was tested also on latest amd64 13.1-BETA2 to confirm) Configure virtual machine; for strongswan compilation give more memory, but for test 256 MB is enough. 2. Run VM and disable swap (to speed-up failure) # swapoff /dev/gpt/swapfs 3. install required packages for strongswan compilation: # pkg install git autoconf gperf autoconf-archive libtool m4 automake flex bison pkgconf gettext 4. get strongswan: git clone https://github.com/strongswan/strongswan 5. Compile strongswan: cd strongswan ./configure --disable-kernel-netlink --enable-kernel-pfroute --enable-kernel-pfkey --disable-gmp --enable-openssl --enable-mediation --disable-scripts --with-group=wheel --enable-gcm --enable-ccm --enable-pkcs11 make -j4 make install 6. start strongswan: ipsec start 7. run in loop any command which communicates on vici interface, swanctl --stats is enough to reproduce error: sh -c 'while swanctl --stats >/dev/null; do true; done' 8. Observe increase of VSS and RSS (Virtual and resident) memory of charon process, using e.g. top 9. After few hours charon should be killed by kernel due to not enough memory/swap space. Additional info =============== Problem occurred when monitored via vici socket state of charon daemon (tunnel definitions, SAs, etc), but it was also reproduced using simple swanctl --stats command repeated in loop. No change in this beaviour is observed when using different configure's --with-printf-hooks= -- according to issue in pfsense: https://redmine.pfsense.org/issues/5149 this could be the reason, but tests with --with-printf-hooks=builtin, --with-printf-hooks=glibc and --with-printf-hooks=vstr did not fix the error. I did some tests using various settings of `jemalloc`, attaching results, but I don't know how to interpret the results. It was gathered using following command: sh -c "MALLOC_CONF='stats_print:true,narenas:1' /usr/local/libexec/ipsec/charon 2>/var/log/charon-memdump-0.log"
Created attachment 232661 [details] jemalloc stats with dirty_decay_ms=0 setting I made also test with jemalloc library dirty_decay_ms=0 setting, but this changed nothing. Attaching log.
For reference - I also filled bug on strongswan's github issue tracker: https://github.com/strongswan/strongswan/issues/966
@Reporter Could you please: - Provide full `uname -a` output for the latest version issue is reproducible on - Test whether the issue is reproducible using the strongswan port/pkg
Do you have something like teh following in, say, /boot/loader.conf : # # Delay when persistent low free RAM leads to # Out Of Memory killing of processes: vm.pageout_oom_seq=120 vs. are you using the default value, 12?
Thank you very much for the clues (In reply to Mark Millard from comment #4) Yes, using default value 12 (I used stock qemu VM image): # sysctl vm.pageout_oom_seq vm.pageout_oom_seq: 12 (In reply to Kubilay Kocak from comment #3) Result of `uname -a` (vm image FreeBSD-13.1-RELEASE)| FreeBSD freebsd 13.1-RELEASE FreeBSD 13.1-RELEASE releng/13.1-n250148-fc952ac2212 GENERIC amd64 and the same is for: FreeBSD freebsd 13.1-RELEASE FreeBSD 13.1-RELEASE releng/13.1-n250148-fc952ac2212 GENERIC arm64 Indeed, the packaged version (strongswan-5.9.6_2, either installed from binary package or built from ports) behaves different. Still memory usage increases, but only RSS one (VMS does not change in observable period) and in much lower scale, raise is rather logarithmic than linear. I compared then configure options for strongswan port and among other differences `--with-printf-hooks=builtin` option is specified by default for the port version. So I tested again sources of strongswan 5.9.6 - the same as used for ports and when the strongswan is configured with: ./configure --disable-kernel-netlink --enable-kernel-pfroute --enable-kernel-pfkey --disable-gmp --enable-openssl \ --enable-mediation --disable-scripts --with-group=wheel --enable-gcm --enable-ccm --enable-pkcs11 \ --with-printf-hooks=builtin (only `--with-printf-hooks=builtin` is added) the memory usage is similar to packaged version even for unpatched original sources. Without this option (or when port version is compiled with `libc` printf-hook which I believe is default one) memory usage raises quickly as in this bug report. I must have missed that on previous tests, I didn't notice that VMS does not rise, and RSS raises on much smaller scale. Also checked for `printf-hooks=vstr` and it is memory usage increase is slightly bigger than for `builtin`, but still VMS is constant. I'm still not sure if for `printf-hooks=builtin` memory does not raise too much, I will check it.
(In reply to Michał Skalski from comment #5) Use of the likes of vm.pageout_oom_seq=120 should delay any kills for failures to reclaim enough memory to reach FreeBSD's target figure for free RAM. This can get extra time to inspect/investigate evidence about the on-going memory/RAM usage. Note: Using an increased vm.pageout_oom_seq is useful for avoiding failed-to-reclaim kills only for bounded-duration "stays running" activities. This can allow buildworld buildkernel -j4 on Small Board Computers with 4 cores and only 2 GiBytes of RAM, for example, when using the default tends to suffer failed-to-reclaim kills.) Note: In sufficiently modern variants of FreeBSD the messages about kills were improved and no longer always report being out of available swap space as the reason for the kill. The messaging about reclaim failures is an example of the improved messaging. Reclaim failures can happen even with a swap space being configured but little/none of the swap space being put to use. All it takes is one process (or m ore) that stays runnable while keeping nearly all the RAM pages in the active state (so: unable to be reclaimed). Even now, if a FreeBSD is modern enough to have the failed-to-reclaim message, if the message reports "out of swap" as the reason for a kill, the message is somewhat of a misnomer, in that kernel data structures for managing the swap areas ran out of space (internal fragmentation?), not the swap media. Note: My references to "stays running" presume leaving the kernel configured to allow process kernel stacks to be swapped out when a process has not stayed runnable. FreeBSD does not do such swap outs for processes that are runnable at the time.
OK, thank you very much for all the help. The problem was wrong (missing) `--with-printf-hooks=builtin` option for configure script. To be sure, I made longer (1.5 week lasting) tests with more strongswan's daemon `charon` stressing. I configured few tunnels with short lifetime and started executing stressing tests like below: sh -c 'while :; do swanctl -l >/dev/null && swanctl -L >/dev/null && swanctl -x >/dev/null || sleep 10; done' which lists all SAs, all configured tunnels and all certificates in an endless loop. Results show that during time some memory (RSS) footprint increases over the time, but when memory is needed strongswan )or system) drops unused memory. And indeed first few minutes memory increase is quite large, which made me think `printf-hook=builtin` option does not work at all. Attaching new shell script used for memory tracking using only `ps` (also `printf` and `date`) command and doesn't need python, adding also logs from this test and graph generated by gnuplot (gnuplot's script below). #!/usr/bin/env gnuplot set style line 1 linecolor rgb '#0060ad' linetype 1 linewidth 2 pointtype 7 pointsize 0.5 set xdata time set xlabel '[Time]' set ylabel '[MB]' plot 'charon_mem.log' using ($1):($5/1024) with linespoints linestyle 1 title columnhead(4) So the issue may be closed.
Created attachment 235776 [details] Script for tracing memory footprint fof process given by pid or its name
Created attachment 235777 [details] gzipped charon's memory footprint log
Created attachment 235778 [details] Charon's memory footprint chart
Created attachment 239436 [details] security/strongswan: Avoid a memory leak As stated in https://docs.strongswan.org/docs/5.9/os/freebsd.html: > While FreeBSD's C library implements the GNU extensions for custom > printf() conversion specifiers, the implementation seems to leak memory, > so using --with-printf-hooks=builtin is recommended. PR: 262743
In the meantime, add the suggested workaround to the port?
(In reply to Jose Luis Duran from comment #12) Well, strongswan port already has (its own) option to specify printf-hooks to use and default is `builtin`, so correct one. Maybe some note should be added to this option help or pore README describing possibility of memory leaks when `glibc` hook is chosen.
(In reply to Michał Skalski from comment #13) https://cgit.freebsd.org/ports/tree/security/strongswan/Makefile?id=c1b081145ff7f719c3867702e9d83718b674505d#n49
(In reply to Michał Skalski from comment #14) I see, sorry about the noise. I'll update bug #268918. Thank you!
Comment on attachment 239436 [details] security/strongswan: Avoid a memory leak Please ignore this file.
A commit in branch main references this bug: URL: https://cgit.FreeBSD.org/ports/commit/?id=942865477682b3d274c73d78e6a5e9b5591268df commit 942865477682b3d274c73d78e6a5e9b5591268df Author: Jose Luis Duran <jlduran@gmail.com> AuthorDate: 2023-01-13 09:31:24 +0000 Commit: Fernando Apesteguía <fernape@FreeBSD.org> CommitDate: 2023-01-14 17:13:48 +0000 security/strongswan: Update to 5.9.9 ChangeLog: https://github.com/strongswan/strongswan/releases/tag/5.9.9 PR: 268918 262743 Reported by: jlduran@gmail.com Approved by: strongswan@Nanoteq.com (maintainer) security/strongswan/Makefile | 6 +++--- security/strongswan/distinfo | 6 +++--- 2 files changed, 6 insertions(+), 6 deletions(-)