|Summary:||lang/rust: Reduce memory usage/pressure|
|Product:||Ports & Packages||Reporter:||Daniel Engberg <diizzy>|
|Component:||Individual Port(s)||Assignee:||FreeBSD Rust Team <rust>|
|Severity:||Affects Only Me||CC:||diizzy, marklmi26-fbsd, mikael, ygy|
Description Daniel Engberg 2021-05-23 11:54:28 UTC
During some build stages rust/cargo eats a lot of memory using its default settings making it more or less not viable for low/mid-rage systems. Can we consider setting cogegen-units to 1 or add a toggle for it and also perhaps parallel compiling? Ref: https://reviews.freebsd.org/D30099#677659
Comment 1 Tobias Kortkamp 2021-06-04 12:04:08 UTC
Daniel, I'm sorry, not trying to ignore you but unsure of expectations. Seems worthwhile but someone needs to run the builds, get some before and after numbers, research downsides if any, and then I guess we can set codegen-units=1 if it looks ok, sure. As for some of the comments from the review: > [...] uses LLVM from their package tree instead of bundled, perhaps that's worth looking into? lang/rust had an LLVM_PORT option once, but that only works if somebody feels responsible for supporting it and fixing any regressions that might happen. But nobody really did so we removed it. > Unbundle libssh2? It's bundled now I guess because we bundle libgit2 now too (because devel/libgit2 was not updated fast enough again). Since we update the toolchain every 6 weeks it is all probably not worth the hassle.
Comment 2 Daniel Engberg 2021-06-06 21:52:18 UTC
Hi, I mainly started to look into this as building Rust on my low-end server (specs below) failed despite having quite a bit of RAM and swap at disposal. While this isn't ideal way of logging here's a graph of memory usage at the end of compiling rust (2 jobs) that succeeds. https://projects.pyret.net/files/public/freebsd/mem-usage-rustc.png This box is an old Dell T20 with a dual core Intel Pentium G3220 CPU, 12Gb of RAM and running ZFS for Poudriere but not on rootfs. It runs 12.2-RELEASE-p6 and building Rust in a 12.2 jail. It's lightly loaded and uses about 5-6Gb of RAM (incl ZFS) without any Poudriere job running. I have no specific ZFS tuning set, however from what I can tell ZFS cache seems to grow quite a bit compiling Rust. Setting codegen reduces memory usage about 1-1.5Gb from what I can tell but memory usage is still quite high. I also gave this a go on my RockPro64 (arm64) (4Gb of RAM) running 13-STABLE (stable/13-n245283-70a2e9a3d44), UFS only and while it took 14h+ hours (-j1) it did finish. During compiing it used about 2Gb (the job not the complete system) tops which is a lot less than what I'm seeing on my server. I'll give this a go on another box running 13-STABLE (amd64) and see if that also consumes a lot of memory. Thanks for replying about LLVM and libssh2, if it's too much of a hassle I understand the decision :-)
Comment 3 Daniel Engberg 2021-06-07 01:08:15 UTC
Hmm... compiling & optimizing seems to use a bit more memory, I did see a few processes use more than 2.5Gb of memory. Wired memory is a lot more though, ~4.8G and peaked at 6.7G so I guess that's due to ZFS?
Comment 4 commit-hook 2021-06-14 20:52:02 UTC
A commit in branch main references this bug: URL: https://cgit.FreeBSD.org/ports/commit/?id=294f0c5c206d70e24b6bbc28766d962dc82f8b61 commit 294f0c5c206d70e24b6bbc28766d962dc82f8b61 Author: Tobias Kortkamp <tobik@FreeBSD.org> AuthorDate: 2021-06-14 18:50:33 +0000 Commit: Tobias Kortkamp <tobik@FreeBSD.org> CommitDate: 2021-06-14 20:51:11 +0000 lang/rust-nightly: Try to reduce memory usage/pressure Try to reduce memory usage/pressure by only using one code generation unit. "This flag [codegen-units] controls how many code generation units the crate is split into. It takes an integer greater than 0. When a crate is split into multiple codegen units, LLVM is able to process them in parallel. Increasing parallelism may speed up compile times, but may also produce slower code. Setting this to 1 may improve the performance of generated code, but may be slower to compile." https://doc.rust-lang.org/rustc/codegen-options/index.html#codegen-units PR: 256099 Suggested by: Daniel Engberg lang/rust/Makefile | 3 +++ 1 file changed, 3 insertions(+)
Comment 5 Guangyuan Yang 2021-06-27 09:27:01 UTC
Just wanted to report here that, building Rust always get OOM'ed after ~6 hours on my low-end build box 10 times in a row, spec: - Intel i5-6500T (4) @ 2.496GHz - 16GB RAM, 2GB swap - FreeBSD 13.0-RELEASE amd64 - ZFS on root It does finish on my VPS which has similar spec but more swap, however, the memory/swap usage is very high. Spec: - Intel Xeon Platinum 8171M (4) @ 2.095GHz - 16GB RAM, 32GB swap - FreeBSD 13.0-RELEASE amd64 - UFS on root, with ZFS enabled on datadisks I haven't looked at it closely, so will report back if I noticed anything. Thanks!
Comment 6 Guangyuan Yang 2021-06-30 01:33:57 UTC
Cannot get lang/rust-nightly to build on my build box with 16G memory with the code generation unit change. The CPU usage is down to one core, but the RAM pressure is still very high, and the whole process ended up getting OOM'ed.
Comment 7 commit-hook 2021-09-19 09:16:18 UTC
A commit in branch main references this bug: URL: https://cgit.FreeBSD.org/ports/commit/?id=6f1fefb50e755d727f471aeb75ebe4e28f876b4b commit 6f1fefb50e755d727f471aeb75ebe4e28f876b4b Author: Tobias Kortkamp <tobik@FreeBSD.org> AuthorDate: 2021-09-07 08:14:14 +0000 Commit: Tobias Kortkamp <tobik@FreeBSD.org> CommitDate: 2021-09-19 09:03:21 +0000 lang/rust: Update to 1.55.0 - Set codegen-units=1  - Add hack to skip cargo update on git sources as a step towards solving  - Fix 'capacity overflow' panics on armv*  Changes: https://blog.rust-lang.org/2021-09-09/Rust-1.55.0.html PR: 258337 PR: 256099  PR: 256581  PR: 257419  Reviewed by: mikael, pkubaj Exp-run by: antoine Differential Revision: https://reviews.freebsd.org/D31872 With hat: rust Mk/Uses/cargo.mk | 2 +- Mk/bsd.gecko.mk | 2 +- lang/rust-bootstrap/Makefile | 8 +- lang/rust-bootstrap/distinfo | 6 +- lang/rust/Makefile | 12 +-- lang/rust/distinfo | 114 ++++++++++----------- ...m-project_compiler-rt_lib_builtins_cpu__model.c | 21 ++-- ...ols_cargo_src_cargo_sources_git_source.rs (new) | 45 ++++++++ ...rc_tools_cargo_src_cargo_util_toml_mod.rs (new) | 22 ++++ .../patch-vendor_openssl-sys_build_main.rs (gone) | 19 ---- ..._src_unix_bsd_freebsdlike_freebsd_mod.rs (gone) | 12 --- ..._unix_bsd_freebsdlike_freebsd_powerpc.rs (gone) | 50 --------- .../powerpc64-elfv1/patch-src_bootstrap_native.rs | 10 +- ...h-compiler_rustc__target_src_spec_mod.rs (gone) | 10 -- ...rc_spec_powerpc64le__unknown__freebsd.rs (gone) | 19 ---- 15 files changed, 154 insertions(+), 198 deletions(-)
Comment 8 Mark Millard 2021-09-23 19:24:07 UTC
(In reply to commit-hook from comment #7) An FYI for systems with more resources . . . Prior to this change during a from-scratch bulk -a using ALLOW_PARALLEL_JOBS= : [05:52:06]  [04:29:11] Finished lang/rust | rust-1.54.0_2: Success on the same machine after the change (again from-scratch using ALLOW_PARALLEL_JOBS= ): [12:39:47]  [11:20:24] Finished lang/rust | rust-1.55.0: Success So about 2.5 times longer (about 4.5 hrs -> 11.3 hrs). For reference: HoneyComb (16 Cortext-A72's) with 64 GiBytes of RAM, root on ZFS, Optane 480 media. Large swap on USB3 SSD media but top indicated it was unused during both the bulk -a builds. This test does not control just what the other 15 builders were doing in the overlapping time frames in each bulk -a but all the other builders were busy with a sequence of builds over that time. The load averages were well over 16 but I do not have record of such over time for either bulk -a . I've another bulk -a going on that machine and it may be about a week before it finishes. (The 11:20:24 figure is from that on-oing bulk -a .)
Comment 9 Mark Millard 2021-09-23 19:36:00 UTC
(In reply to Mark Millard from comment #8) I forgot to list that: USE_TMPFS="data" was in use. I've built rust by itself with USE_TMPFS=yes (so "wrkdir data") in the past and the tmpfs use grew to around 17 GiBytes. Luckilly I had swap configured that was sufficient for the machine that was done on at the time. Having USE_TMPFS allowing significant tmpfs sizes for port builds using huge amounts of disk space basically requires an environment with sufficient resources arrnaged up front. The use of PCIe OPTANE media helps avoid I/O being as much of an issue as it could be with, say, spinning media.
Comment 10 Mark Millard 2021-09-23 19:55:59 UTC
(In reply to Mark Millard from comment #9) I found my old note about the tmpfs use for USE_TMPFS=yes for lang/rust : # df -m | grep tmpfs Filesystem 1M-blocks Used Avail Capacity Mounted on . . . tmpfs 301422 17859 283563 6% /usr/local/poudriere/data/.m/FBSDFSSDjail-default/01/wrkdirs . . . So the 17 GiBytes was only the "wrkdirs" contribution.
Comment 11 Mikael Urankar 2021-09-24 09:22:20 UTC
(In reply to Mark Millard from comment #8) I have similar result on my amd64 box: rust 1.55.0 with codegen-units=1 build time: 00:39:59 rust 1.55.0 without codegen-units=1 build time: 00:23:15
Comment 12 Mark Millard 2021-09-24 18:17:40 UTC
(In reply to Daniel Engberg from comment #0) What USE_TMPFS (or analogous) was in use?
Comment 13 Mark Millard 2021-09-24 18:18:38 UTC
(In reply to Guangyuan Yang from comment #5) What USE_TMPFS (or analogous) setting was in use?
Comment 14 Mark Millard 2021-09-24 19:08:43 UTC
(In reply to Guangyuan Yang from comment #5) Unfortunately messages such as: pid . . . (. . .), jid . . ., uid . . ., was killed: out of swap space can be a misnomer for the "out of swap space" part: it can be reported even when none of the swap space had been in use. There are other reasons possible for why kills happen. One point is that FreeBSD wil not swap out a process that stays runnable, even if its active memory use keeps the free RAM minimal, it just continues to page in and out. If it really was out of swap space there would also be messages like: swap_pager_getswapspace(. . .): failed or: swap_pager: out of swap space Other causes for the kills include: Sustained low free RAM (via stays-runnable processes). A sufficiently delayed pageout. The swap blk uma zone was exhausted. The swap pctrie uma zone was exhausted. The first two of those have some tunables that you might want to try: # Delay when persistent low free RAM leads to # Out Of Memory killing of processes: vm.pageout_oom_seq=120 # For plunty of swap/paging space (will not # run out), avoid pageout delays leading to # Out Of Memory killing of processes: vm.pfault_oom_attempts=-1 # For possibly insufficient swap/paging space # (might run out), increase the pageout delay # that leads to Out Of Memory killing of # processes (showing defaults at the time): #vm.pfault_oom_attempts= 3 #vm.pfault_oom_wait= 10 # (The multiplication is the total but there # are other potential tradoffs in the factors # multiplied, even for nearly the same total.) I'll note that vm.pageout_oom_seq has a default of 12 but can be much larger than 120, such as 1024 or 10240 or even more. Larger figures increase the time before kills start happening because of sustained low free RAM. But no setting is designed to disable the kills from eventually happening on some scale.
Comment 15 Mark Millard 2021-09-25 01:27:53 UTC
(In reply to Guangyuan Yang from comment #5) The following is based on (in part): USE_TMPFS="data" ALLOW_PARALLEL_JOBS= for building rust-1.54.0_2 (so: before the codegen-units change). It is a root-on-ZFS context. Also in use was /boot/loader.conf having: vm.pageout_oom_seq=120 vm.pfault_oom_attempts=-1 I'll report figures based on my local top patches that record and report various "Maximum Observed" figures (MaxObs???? naming). poudriere output: . . . [00:00:23] Building 1 packages using 1 builders [00:00:23] Starting/Cloning builders [00:00:27] Hit CTRL+t at any time to see build progress and stats [00:00:27]  [00:00:00] Building lang/rust | rust-1.54.0_2 [05:10:56]  [05:10:29] Finished lang/rust | rust-1.54.0_2: Success [05:11:35] Stopping 1 builders . . . Where the top output reported: . . .; load averages: . . . MaxObs: 5.83, 5.09, 4.93 . . . . . . threads: . . . 21 MaxObsRunning . . . Mem: . . . 2285Mi MaxObsActive . . . . . . Swap: 14336Mi Total, 14336Mi Free (The "Swap:" line did not report any positive amount used.) No console messages at all. In other words: it never got near starting to using the swap paritition that was active. For reference . . . System: MACCHIATObin Double Shot (4 Cortex-A72's) with 16 GiBytes RAM. (So an aarch64 context.) Root-on-ZFS with no special tuning. main [So: 14]. 14336 MiByte sawp partition active. The boot media is a portable USB3 SSD. # uname -apKU FreeBSD CA72_4c8G_ZFS 14.0-CURRENT FreeBSD 14.0-CURRENT #12 main-n249019-0637070b5bca-dirty: Tue Aug 31 02:24:20 PDT 2021 root@CA72_16Gp_ZFS:/usr/obj/BUILDs/main-CA72-nodbg-clang/usr/main-src/arm64.aarch64/sys/GENERIC-NODBG-CA72 arm64 aarch64 1400032 1400032 But: # poudriere jail -j13_0R-CA72 -i Jail name: 13_0R-CA72 Jail version: 13.0-RELEASE-p4 Jail arch: arm64.aarch64 Jail method: null Jail mount: /usr/obj/DESTDIRs/13_0R-CA72-poud Jail fs: Jail updated: 2021-09-06 19:07:54 Jail pkgbase: disabled And: # cd /usr/ports # ~/fbsd-based-on-what-commit.sh branch: main merge-base: b0c4eaac2a3aa9bc422c21b9d398e4dbfea18736 merge-base: CommitDate: 2021-09-07 21:55:24 +0000 b0c4eaac2a3a (HEAD -> main, freebsd/main, freebsd/HEAD) security/suricata: Add patch for upstream locking fix n557269 (--first-parent --count for merge-base)
Comment 16 Mark Millard 2021-09-25 01:47:58 UTC
(In reply to Tobias Kortkamp from comment #1) Based on comment #15 I expect that codegen-units was misidentified as the cause of the memory usage/pressure. I expect that USE_TMPFS included wrkdir, which for lang/rust can be 17 GiByte+ instead, was the driving issue for memory use/pressure. USE_TMPFS="data" (avoiding wrkdir) is the primary thing that deals with the memory use/pressure from what I can tell. (USE_TMPFS=yes is equivalent to "wrkdir data".) Based on comment #8 and comment #11 I beleive the change has negative consequences for various contexts, in part based on lack of control from the OPTIONS. (The default should track what FreeBSD wants for the official package builders for the tradeoff for more-time vs. better code generation. It is possible that would be the new setting. Such is not for me to say. But . . .) Given that USE_TMPFS="data" is what makes the big difference for memory use/pressure, I'd suggest reverting the change made for this bugzilla submittal until OPTIONS has control of the codegen-units setting the rust default vs. 1 and the default for the OPTION is set to what the long term official package builds should be based on.
Comment 17 Mark Millard 2021-09-25 02:19:32 UTC
(In reply to Mark Millard from comment #15) I forgot to mention that I have set larger timeout values in /usr/local/etc/poudriere.conf than the defaults. So my experiment would not show reaching a default timeout, not that I expect such would have occured in that experiment.
Comment 18 Daniel Engberg 2021-09-25 05:08:33 UTC
When I did some testing it did help because files were better optimized however it uses a single thread just like when you use lto vs thinlto. The behaviour is also documented in Rust's documentation regarding this option.
Comment 19 Mark Millard 2021-09-25 06:47:18 UTC
(In reply to Daniel Engberg from comment #18) If that help was with memory use/memoory pressure, I'd not expect it to be as big of a difference as "wrkdir data" vs. just "data" for USE_TMPFS: "data" uses vastly less memory than the 17 GiByte+ figure. How much of a difference did codegen-units=1 make in your context? See comment 6 for someone reporting codegen-units=1 being insufficient in their context. (Many of my notes are tied to trying to help that person since they gave enough detail for me to have specific suggestions and expeirments to try and my own exeriiments to report on.) My hope is that the build-time/code-optimation tradeoff ends up under explicit control at some point. I do not expect general agreement about lang/rust build time frames being shorter (default codegen-units) vs. the consequences of taking the larger build times such as more optimized code (codegen-units=1). I'd expect the default to be for the choice made for the official package builders.
Comment 20 Mark Millard 2021-09-25 19:34:01 UTC
(In reply to Mark Millard from comment #15) I've started a bulk lang/rust on a Rock64 (4 Cortex-A53's) with 4 GiByte of RAM and 14 GiByte of swap and root on UFS (no ZFS use). (I normally avoid ZFS on systems with less than 8 GiBytes of RAM.) Again: It is based on (in part): USE_TMPFS="data" ALLOW_PARALLEL_JOBS= for building rust-1.54.0_2 (so: before the codegen-units change). It is a root-on-ZFS context. Also in use was /boot/loader.conf having: vm.pageout_oom_seq=120 vm.pfault_oom_attempts=-1 Again I have larger than default poudriere timout settings. I'll report figures based on my local top patches that record and report various "Maximum Observed" figures (MaxObs???? naming). I expect that it will complete without using any swap space. (But the Cortex-A53's will take a long time compared to the prior MACCHIATObin Double Shot experiment.) It is possible that I'll have to adjust some timeout(s) and retry: lang/rust will be the largest thing that I've built in such a context. I will note that, with 4 GiByte of RAM, the system would complain about being mistuned for swap with even near 16 GiBytes of swap.
Comment 21 Mark Millard 2021-09-26 02:01:07 UTC
(In reply to Mark Millard from comment #20) I've also started a lang/rust build on a Orange Pi+ 2E (4 Cortex-A7's, so armv7) with 2 GiBytes of RAM and 3 GiByte of swap. USB2 port, so slower I/O. USE_TMPFS="data" ALLOW_PARALLEL_JOBS= and: vm.pageout_oom_seq=120 vm.pfault_oom_attempts=-1 in use again, with larger than default poudriere timeouts. This will likely use a non-trivial amunt of swap, unlike the Rock64. (The Rock64 has used somewhat under 6 MiByte of swap early on. I've seen FreeBSD do such small usage when the need is non-obvious various times before.) This will also likely take a very long time to complete and may well need bigger timeouts. (Bigger vm.pageout_oom_seq too?) But I expect with appropriate values for such set the rust build will complete in this context. (I'm planning on adjusting timeouts to allow rust builds on these systems. So I've other reasons for the experiments but might as well report the results.) Again rust-1.54.0_2 (before the codegen-units=1 change). 1.54 had some problems on armv7 but, as I remember, not in building: later use. My prior armv7 build was on a Cortex-A72 (aarch64) targeting Cortex-A7 (armv7) via a jail that used -a arm.armv7 . (The Cortex-A72 can execute Cortex-A7 code.) But there was lots of RAM and cores for that, unlike this experiment.
Comment 22 Mark Millard 2021-09-26 06:29:49 UTC
(In reply to Mark Millard from comment #21) The armv7 (Cortex-A7) test is stopped for now because poudriere's time reporting is messed up, such as: [00:00:00] Creating the reference jail... done . . . [00:00:00] Balancing pool [main-CA7-default] [2021-09-25_23h11m13s] [balancing_pool:] Queued: 70 Built: 0 Failed: 0 Skipped: 0 Ignored: 0 Fetched: 0 Tobuild: 70 Time: -258342:-3:-36 [00:00:00] Recording filesystem state for prepkg... done . . .
Comment 23 Mark Millard 2021-09-26 20:06:04 UTC
(In reply to Mark Millard from comment #20) For the Rock64 rust-1.54.0_2 build test with 4GiBytes of RAM using USE_TMPFS="data" and ALLOW_PARALLEL_JOBS= and vm.pageout_oom_seq=120 and vm.pfault_oom_attempts=-1 but not using codegen-units=1 : . . . [00:01:22] Building 1 packages using 1 builders [00:01:22] Starting/Cloning builders [00:01:34] Hit CTRL+t at any time to see build progress and stats [00:01:34]  [00:00:00] Building lang/rust | rust-1.54.0_2 [16:11:35]  [16:10:01] Finished lang/rust | rust-1.54.0_2: Success [16:12:12] Stopping 1 builders where: last pid: . . . load averages: . . . MaxObs: 5.60, 5.01, 4.85 . . . . . . threads: . . . 11 MaxObsRunning . . . Mem: . . . 2407Mi MaxObsActive, 995248Ki MaxObsWired, 3161Mi MaxObs(Act+Wir+Lndry) Swap: 14336Mi Total, . . . 10712Ki MaxObsUsed, 2457Mi MaxObs(Act+Lndry+SwapUsed), 3171Mi MaxObs(Act+Wir+Lndry+SwapUsed) So, somewhat under 10.5 MiBytes of swap used at some point (maximum observed by top). If no swap had been made active, it likely still would have finished just fine: no swap space (partition) required. Reminder: This was a UFS context with a USB3 SSD media, no ZFS use.
Comment 24 Mark Millard 2021-09-27 06:15:50 UTC
(In reply to Mark Millard from comment #22) I've started the 2 GiByte RAM armv7 test again, after patching poudriere-devel for the time reporting issue.
Comment 25 Mark Millard 2021-09-27 23:50:12 UTC
(In reply to Mark Millard from comment #21) For this armv7 test I should have listed that I was going to use: USE_TMPFS=no (instead of "data"). The test is still running.
Comment 26 Mark Millard 2021-09-28 07:49:43 UTC
(In reply to Mark Millard from comment #25) For the Orange Pi+ 2E (armv7) rust-1.54.0_2 build test with 2GiBytes of RAM using USE_TMPFS=no and ALLOW_PARALLEL_JOBS= and vm.pageout_oom_seq=120 and vm.pfault_oom_attempts=-1 but not using codegen-units=1 : . . . [00:02:32] Building 1 packages using 1 builders [00:02:32] Starting/Cloning builders [00:03:21] Hit CTRL+t at any time to see build progress and stats [00:03:21]  [00:00:00] Building lang/rust | rust-1.54.0_2 [25:09:49]  [25:06:28] Finished lang/rust | rust-1.54.0_2: Success [25:10:27] Stopping 1 builders . . . . . . load averages: . . . MaxObs: 5.50, 5.13, 4.88 . . . . . . threads: . . . 11 MaxObsRunning . . . Mem: . . . 1559Mi MaxObsActive, 257660Ki MaxObsWired, 1837Mi MaxObs(Act+Wir+Lndry) Swap: 3072Mi Total, . . . 320604Ki MaxObsUsed, 1898Mi MaxObs(Act+Lndry+SwapUsed), 2113Mi MaxObs(Act+Wir+Lndry+SwapUsed) So: Well under 350 MiBytes of swap used for USE_TMPFS=no with 2 GiBytes of RAM. Swap space likely required, given its size vs. the 2 GiBytes. (USE_TMPFS="data" would have used more swap space.) Reminder: This was a UFS context with a USB3 SSD media, no ZFS use.