Hello, context is stable/13-n247002-bd0ad8209d5 poudriere-devel-3.3.99.20210907_1 amd64 If one runs a build for a single port and later runs a bulk -f build, the ports built as part of the single port will be deleted first like this: [00:00:12] Deleting ImageMagick7-7.0.11.12_3.pkg: no longer needed [00:00:12] Deleting amavisd-new-2.12.0_1,1.pkg: no longer needed [00:00:12] Deleting arc-5.21p.pkg: no longer needed [00:00:12] Deleting arj-3.10.22_8.pkg: no longer needed [00:00:12] Deleting atk-2.36.0.pkg: no longer needed This is unexpected behaviour. Now I can't run a single build of a big package and then a smaller one of multiple packages with the same ports tree and expect all the packages to be there. I'm not sure if the issue applies to ports-mgmt/poudriere as well. some posts describing the problem: https://lists.freebsd.org/archives/freebsd-ports/2021-October/000808.html https://lists.freebsd.org/archives/freebsd-ports/2021-October/000811.html https://lists.freebsd.org/archives/freebsd-ports/2021-October/000784.html
Why not specify a consistent list with -f? I suspect it is to work around aggressive rebuilds which is becoming more sane very soon and not doing needless rebuilds. It's doing what you told it to. 'bulk' is intended to build a repository, not a package. If you only tell it about 1 package you are interested in then it may *update a dependency, but not rebuild something you didn't mention depending on that thing*, or create a totally conflicted repository that can't function at the client machine. Same issue if building 1 and then giving a larger list, it may leave that 1 existing package in a broken unusable state if a major dependency of it is updated and it does not get rebuilt. A basic requirement is that bulk should produce a working repository with working packages, in the sense that Poudriere does not create a bug itself with something like `pkg install` or `pkg upgrade` or `pkg add`. Changing -f up removes that promise. So it has 3 options: 1. Ignore the problem (old behavior) 2. Delete anything not mentioned, because why else does it exist if the user does not expect it in their repository? If -t is given then it does not do this behavior as it assumes it is a pre-commit testing repo and not a published repo. 3. Rebuild packages *not listed* because a dependency is updated. This also can be unexpected. This change was because people kept asking for auto pkgclean. I admit that I did not expect that people would be using bulk -f with a changing list because it cannot guarantee a working repo in that way.
A workaround would be to set DELETE_UNQUEUED_PACKAGES=no but I do recommend using a consistent -f. I am open to changes here with more feedback. I don't see how https://lists.freebsd.org/archives/freebsd-ports/2021-October/000784.html relates.
I will just change the default to not do this.
Here's a better example: [00:00:19] Deleting caja-1.24.1_1.pkg: new version: 1.26.0 [00:00:19] Deleting calibre-5.28.0.pkg: new version: 5.29.0 [00:00:19] Deleting calibre-5.28.0.txz: dead symlink [00:00:19] Deleting calligra-3.2.1_28.pkg: no longer needed [00:00:19] Deleting celt-0.11.3_3.pkg: no longer needed [00:00:19] Deleting chromium-92.0.4515.159_2.pkg: no longer needed [00:00:19] Deleting chromium-92.0.4515.159_2.txz: dead symlink [00:00:19] Deleting electron12-12.0.9_2.pkg: no longer needed [00:00:19] Deleting electron12-12.0.9_2.txz: dead symlink [00:00:19] Deleting firefox-93.0_1,2.pkg: no longer needed [00:00:19] Deleting flock-2.30.1.pkg: no longer needed [00:00:20] Deleting go-1.17.1,1.pkg: new version: 1.17.2,1 [00:00:20] Deleting go-1.17.1,1.txz: dead symlink [00:00:21] Deleting gstreamer1-plugins-vpx-1.16.2.pkg: new version: 1.16.2_1 [00:00:21] Deleting handbrake-1.4.2.pkg: new version: 1.4.2_1 [00:00:21] Deleting jackit-1.9.16.pkg: new version: 1.9.16_1 [00:00:23] Deleting kproperty-3.2.0_1.pkg: no longer needed [00:00:23] Deleting kproperty-3.2.0_1.txz: dead symlink [00:00:23] Deleting kreport-3.2.0_3.pkg: no longer needed [00:00:23] Deleting libEMF-1.0.7_1.pkg: no longer needed the "no longer needed" is the unexpected behaviour. I built firefox and chromium in the previous two single builds
you were quick :D If I can do the entire list with bulk -f, I'll do that. The reason I had to go to individual builds for big ports was because some of them didn't play nice in the list abd failed to build; these built fine when -f wasn't used here's the poudriere part of /etc/weekly.local poudriere bulk -j desktop devel/llvm12 poudriere bulk -j desktop lang/rust poudriere bulk -j desktop lang/gcc10 poudriere bulk -j desktop editors/libreoffice poudriere bulk -j desktop www/qt5-webengine poudriere bulk -j desktop www/firefox poudriere bulk -j desktop devel/electron12 poudriere bulk -j desktop www/chromium poudriere bulk -j desktop editors/calligra poudriere bulk -j desktop -f /root/p.d/202104-21.ports.lst I'd *prefer* to run poudriere with bulk -f
On a related note I just hit a case where I wished I could exclude a specific list of ports from being touched (because of new problems new ports introduce). Anyway I did flip the default here in Poudriere (local) git. I will push and update -devel tonight probably.
(In reply to tech-lists from comment #5) In the rare occasion when I've done something analogous to your: poudriere bulk -j desktop devel/llvm12 poudriere bulk -j desktop lang/rust poudriere bulk -j desktop lang/gcc10 poudriere bulk -j desktop editors/libreoffice poudriere bulk -j desktop www/qt5-webengine poudriere bulk -j desktop www/firefox poudriere bulk -j desktop devel/electron12 poudriere bulk -j desktop www/chromium poudriere bulk -j desktop editors/calligra poudriere bulk -j desktop -f /root/p.d/202104-21.ports.lst I also listed all my "prebuild" ports in my equivalent of your /root/p.d/202104-21.ports.lst . This makes them and their prerequisites "needed" for the -f build too but they are discovered to already be in place and not needing an update during the -f step. Basically it was a way to have a single builder at a time for the "prebuilds" in a resource limited context but not be that limited for the builder count for the rest.
(In reply to Bryan Drewery from comment #6) Hi, did you update? in the meantime i've removed the individual build lines and put them all in the ports build list and started a bulk -f. But the system eventually crashed and rebooted itself with the following: Oct 14 10:40:53 desktop kernel: pid 83419 (python2.7), jid 208, uid 0, was killed: out of swap space Oct 14 11:36:49 desktop syslogd: kernel boot file is /boot/kernel/kernel Oct 14 11:36:49 desktop kernel: MCA: Bank 1, Status 0xff80000000000124 Oct 14 11:36:49 desktop kernel: MCA: Global Cap 0x0000000000000c09, Status 0x0000000000000005 Oct 14 11:36:49 desktop kernel: MCA: Vendor "GenuineIntel", ID 0x306c3, APIC ID 2 Oct 14 11:36:49 desktop kernel: MCA: CPU 2 UNCOR PCC OVER DCACHE L0 WR error Oct 14 11:36:49 desktop kernel: MCA: Address 0x7ff9ab740 Oct 14 11:36:49 desktop kernel: MCA: Misc 0x86 Oct 14 11:36:49 desktop kernel: panic: Unrecoverable machine check exception Oct 14 11:36:49 desktop kernel: cpuid = 2 Oct 14 11:36:49 desktop kernel: time = 1634207556 Oct 14 11:36:49 desktop kernel: Uptime: 3d21h48m43s Oct 14 11:36:49 desktop kernel: Dumping 4603 out of 32638 MB:..1%..11%..21%..31%..41%..51%..61%..71%..81%..91%---<<BOOT>>--- (There's 24GB ofswap on ssd). if I do these big ports individually, this doesn't happen.
(In reply to tech-lists from comment #8) Unfortunately messages such as: pid . . . (. . .), jid . . ., uid . . ., was killed: out of swap space can be a misnomer for the "out of swap space" part: it can be reported even when none of the swap space had been in use. There are other reasons possible for why kills happen. One point is that FreeBSD wil not swap out a process that stays runnable, even if its active memory use keeps the free RAM minimal, it just continues to page in and out. If it really was out of swap space there would also be messages like: swap_pager_getswapspace(. . .): failed or: swap_pager: out of swap space Other causes for the kills include: Sustained low free RAM (via stays-runnable processes). A sufficiently delayed pageout. The swap blk uma zone was exhausted. The swap pctrie uma zone was exhausted. The first two of those have some tunables that you might want to try: # Delay when persistent low free RAM leads to # Out Of Memory killing of processes: vm.pageout_oom_seq=120 # For plunty of swap/paging space (will not # run out), avoid pageout delays leading to # Out Of Memory killing of processes: vm.pfault_oom_attempts=-1 # For possibly insufficient swap/paging space # (might run out), increase the pageout delay # that leads to Out Of Memory killing of # processes (showing defaults at the time): #vm.pfault_oom_attempts= 3 #vm.pfault_oom_wait= 10 # (The multiplication is the total but there # are other potential tradoffs in the factors # multiplied, even for nearly the same total.) I'll note that vm.pageout_oom_seq has a default of 12 but can be much larger than 120, such as 1024 or 10240 or even more. Larger figures increase the time before kills start happening because of sustained low free RAM. But no setting is designed to disable the kills from eventually happening on some scale.
(In reply to tech-lists from comment #8) Another issue for memory requirements/swap-use is if USE_TMPFS=. . . includes wrkdir . For example, wrkdir grows to 20+ GiBytes for lang/rust . Same for devel/electron12 . USE_TMPFS=yes includes wrkdir (and data). USE_TMPFS=all does as well. I've not made obseravations of tmpfs sizes for all the big ports that you listed. Still, if more than one of the huge wrkdir size ports build in overlapping time frames, the tmpfs of memory use can be even bigger. To give a hint how much USE_TMPFS=no can make. I run with a patched top that records and reports some "Maximum Observed" figures ("MaxObs..."). For the Orange Pi+ 2E (armv7, 4-cores, 2 GiBytes RAM) rust-1.54.0_2 build test using USE_TMPFS=no and ALLOW_PARALLEL_JOBS= and vm.pageout_oom_seq=120 and vm.pfault_oom_attempts=-1 but not using codegen-units=1 : . . . [00:02:32] Building 1 packages using 1 builders [00:02:32] Starting/Cloning builders [00:03:21] Hit CTRL+t at any time to see build progress and stats [00:03:21] [01] [00:00:00] Building lang/rust | rust-1.54.0_2 [25:09:49] [01] [25:06:28] Finished lang/rust | rust-1.54.0_2: Success [25:10:27] Stopping 1 builders . . . . . . load averages: . . . MaxObs: 5.50, 5.13, 4.88 . . . . . . threads: . . . 11 MaxObsRunning . . . Mem: . . . 1559Mi MaxObsActive, 257660Ki MaxObsWired, 1837Mi MaxObs(Act+Wir+Lndry) Swap: 3072Mi Total, . . . 320604Ki MaxObsUsed, 1898Mi MaxObs(Act+Lndry+SwapUsed), 2113Mi MaxObs(Act+Wir+Lndry+SwapUsed) So: Well under 350 MiBytes of swap used for USE_TMPFS=no with 2 GiBytes of RAM (on 4-cores). Swap space likely required, given its size vs. the 2 GiBytes. (USE_TMPFS="data" would have used more swap space.) Note: This was a UFS context with a USB3 SSD media, no ZFS use.
(In reply to Mark Millard from comment #10) Hi Mark, thanks for the suggestions. The system is i7-4770K @4.2GHz amd64 with 32GB RAM. stable/13-n247002-bd0ad8209d5 I've set vm.pageout_oom_seq=120 & vm.pfault_oom_attempts=-1 now
(In reply to tech-lists from comment #11) How many FreeBSD cpus is that? How many parallel builders are configured in/for poudriere? You did not answer what USE_TMPFS value is in use. RAM+SWAP matters, not just RAM. With USE_TMPFS including wrkdir 32 GiBytes (no swap) needed not be enough: two builders that happen to be of the large variety could total over 32 GiByte of tmpfs space, much less even more, so no room for other memory use or even the tmpfs space needed. Examples for reference: HoneyComb (aarch64, 16 Cortex-A72 cores (16 FreeBSD cpus), 64 GiBytes of RAM) with a ZFS file system and using USE_TMPFS=all and ALLOW_PARALLEL_JOBS= and vm.pageout_oom_seq=120 and vm.pfault_oom_attempts=-1 for a bulk -a : . . . load averages: . . . MaxObs: 140.82, 125.33, 105.82 . . . threads: . . . 201 MaxObsRunning . . . Mem: . . . 48574Mi MaxObsActive, 59525Mi MaxObsWired, 61575Mi MaxObs(Act+Wir+Lndry) . . . Swap: 251904Mi Total . . . 25165Mi MaxObsUsed, 76364Mi MaxObs(Act+Lndry+SwapUsed), 84323Mi MaxObs(Act+Wir+Lndry+SwapUsed) So over RAM+Swapused of somewhat over 64 GiByte + 24 GiByte, so somewhat over 88 GiBytes. Instead re-running a subset that caused more large builds in an overlapping time frame (still 16 builders): . . . load averages: . . . MaxObs: 202.80, 192.03, 161.87 . . .threads: . . . 216 MaxObsRunning . . . Mem: . . . 55710Mi MaxObsActive, 60421Mi MaxObsWired, 62697Mi MaxObs(Act+Wir+Lndry) . . . Swap: 251904Mi Total . . . 84139Mi MaxObsUsed, 136765Mi MaxObs(Act+Lndry+SwapUsed), 142145Mi MaxObs(Act+Wir+Lndry+SwapUsed) So RAM+SwapUsed somewhat over (64+82) GiBytes, so somewhat over 146 GiBytes for the example.
(In reply to Mark Millard from comment #10) A similar experiment for devel/electron12 on aarch64: On a Rock64 (aarch64, 4 Cortex-A53 cores, 4 GiBytes of RAM), with a root-on-UFS file system and a 14336Mi swap partition active (but little used compared to its size as it turned out), I built devel/electron12 : . . . (52 other ports built first) . . . [10:35:31] [01] [00:00:00] Building devel/electron12 | electron12-12.0.9_2 [101:27:33] [01] [90:52:02] Finished devel/electron12 | electron12-12.0.9_2: Success [101:29:05] Stopping 4 builders . . . (Only one builder active during the devel/electron12 build part.) It was based on (in part): USE_TMPFS=no ALLOW_PARALLEL_JOBS= Also in use was /boot/loader.conf having: vm.pageout_oom_seq=120 vm.pfault_oom_attempts=-1 to avoid processes being likely to be killed for sustained durations of low free RAM or for slow paging I/O. I have larger than default poudriere timout settings. I'll not detail them here. My local top has patches that record and report various "Maximum Observed" figures (MaxObs???? naming). Starting top shortly after the "Building devel/electron12" notice was reported: . . . load averages: . . . MaxObs: 9.11, 8.34, 7.94 . . . threads: . . . 18 MaxObsRunning . . . Mem: . . . 3120Mi MaxObsActive, 792064Ki MaxObsWired, 3921Mi MaxObs(Act+Wir+Lndry) Swap: 14336Mi Total . . . 2383Mi MaxObsUsed, 5381Mi MaxObs(Act+Lndry+SwapUsed), 6142Mi MaxObs(Act+Wir+Lndry+SwapUsed) So RAM+SwapUsed somewhat over 6.3 GiBytes, vastly less than just the 20+ GiBytes of tmpfs that would occur with a USE_TMPFS that includes wrkdir .
(In reply to Mark Millard from comment #12) Hi, sorry about leaving that data out. ZFS is in use by poudriere. All of poudriere and swap is on SSD. 8GB is on a swap partition on one SSD. 16GB is on another as /dev/md99 (a swapfile) relevant poudriere.conf entries USE_TMPFS=yes MAX_FILES=4096 PARALLEL_JOBS=4 ALLOW_MAKE_JOBS=yes % sysctl hw.ncpu hw.ncpu: 8 % sysctl vm.pageout_oom_seq vm.pfault_oom_attempts vm.pageout_oom_seq: 120 vm.pfault_oom_attempts: -1 it's running a bulk -f right now with those settings and so far no crashes or OOM signals in /var/log/messages. Is the patch for top that you use available to the public? thanks,
USE_TMPFS hasn't been changed cos the build was started before I saw your message. I'll let it run to see if it completes. It's over 5 hrs in right now (245 ports incl things like electron12, firefox, libreoffice)
(In reply to tech-lists from comment #14) See comment #7 of: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=206048 for why "16GB is on another as /dev/md99 (a swapfile)" is likely to be a problem of itself.
(In reply to tech-lists from comment #14) The patch is a personal hack and depends on that I use shell windows that have lots of columns, I just added text to existing lines (and made it display more significant digits in various places as well). It also has material from a patch someone submitted that was never adopted: it has a "pbt" order available for "process birth time". Most likely I'll just Email you a copy/paste of the git diff from my /usr/13S-src/ , not that the FreeBSD vintage will be a detailed match. (This might not preserve all whitespace correctly.) I use it with the linkes of, say, # top -HICazores This makes it track thread counts, not process counts and puts more RAM usage towards the top of the thread list (same figure for each thread of a process).
(In reply to tech-lists from comment #14) I'll note that the use of: USE_TMPFS=yes is the same as using: USE_TMPFS="wrkdir data" and, so, involves a huge tmpfs for any builder that involves a huge wrkdir . With 4 builders at a time ( PARALLEL_JOBS=4 ) having RAM+SWAP being 32+(8+16) GiByte, so 56 GiByte, could prove insufficient for just the tmpfs memory, much less possibly having multiple huge processes as well. (Not necessarily likely, depending on what all else is to be built.) PARALLEL_JOBS=4 ALLOW_MAKE_JOBS=yes hw.ncpu: 8 can lead to long periods of around a 4*8 a load average. (These are not necesarily shorter term maximums for all stages of all combinations of4 ports that can occur.) While USE_TMPFS=yes or the like may be good for the FreeBSD build servers, it would not seem appropraite as a default for general folks that do their own port builds: presumes too much about the amount of RAM+SWAP given how many popular things involve huge wrkdir's and how few personal computer environments are deliberately configured to cover such requirements.
(In reply to tech-lists from comment #14) I'll note that I also run with patches that report on which of: Sustained low free RAM (via stays-runnable processes). A sufficiently delayed pageout. The swap blk uma zone was exhausted. The swap pctrie uma zone was exhausted. lead to each OOM kill. (Those are not the output messages, just indications of the type context.) There is also one message from a patch that was provided to me by markj (as I rmember), back when he was helping investigate why some OOM kills were happening on small arm boards and how those having the problem might avoid such kills. These involve: modified: sys/vm/swap_pager.c modified: sys/vm/vm_fault.c modified: sys/vm/vm_page.c modified: sys/vm/vm_pageout.c
(In reply to Bryan Drewery from comment #2) Thanks for looking into this! For context, we cannot use a consistent -f because for the complete list of ports we build, building all of them adds 8-30 hours to poudriere's run time. That gets problematic when a core package like curl or OpenSSL has a security fix we need to get out there ASAP. So we have several files. One is for "core" ports that go on all machines (like OpenSSL and curl). Then per-role files. Production web server. Production mail server. Production database server. Development server. Etc. Neither building dozens of core packages over and over in separate package trees nor gating production security fixes behind incredibly time-intensive builds of development tools are good situations for us.
(In reply to tech-lists from comment #15) How have the builds with various adjusted settings been going for you? (Not sure what all you might have experimented with by now.)
(In reply to Mark Millard from comment #21) What's happening now is it's the position poudriere was in before the fault was reported. In other words, the bulk -f runs fine with the exception of a few ports that succeed when built singly but fail in the bulk -f. What is different this time is that the tendency to fail does not seem to be related to how big the port is, and it's hardly using swapspace. it has these sysctls set: vm.pageout_oom_seq: 120 vm.pfault_oom_attempts: -1 and, in poudriere.conf USE_TMPFS=NO MAX_FILES=4096 PARALLEL_JOBS=4 ALLOW_MAKE_JOBS=yes With the last bulk -f, devel/llvm90 failed with "clang-bug" but it built fine on its own. There were other failures but these were in the fetch phase so aren't part of the problem. Other, larger ports built fine like llvm12 and libreoffice and firefox The poudriere used is now at poudriere-devel-3.3.99.20211017_2
(In reply to tech-lists from comment #22) What sort of console messages are being reported? (Some are likely expected from config-testing deliebarately causing signals and such.) (There is not much to go on from your description.)
(In reply to Mark Millard from comment #23) At the console running poudriere? Not much. In the last bulk -f build, it completes stating llvm90 failed at the build stage. In /var/log/console there's no entry corresponding with the time the build ran In /var/log/messages there's lots of stuff like this: [...] Oct 23 06:00:11 desktop kernel: pid 14000 (conftest), jid 160, uid 0: exited on signal 11 (core dumped) Oct 23 06:37:59 desktop kernel: pid 41978 (conftest), jid 159, uid 0: exited on signal 11 (core dumped) Oct 23 07:12:08 desktop kernel: pid 68695 (conftest), jid 159, uid 0: exited on signal 11 (core dumped) Oct 23 07:44:55 desktop kernel: pid 65672 (conftest), jid 159, uid 0: exited on signal 11 (core dumped) Oct 23 07:45:13 desktop kernel: pid 69182 (conftest), jid 159, uid 0: exited on signal 11 (core dumped) Oct 23 09:29:27 desktop kernel: pid 64783 (conftest), jid 161, uid 0: exited on signal 11 (core dumped) Oct 23 09:46:32 desktop kernel: pid 81782 (conftest), jid 161, uid 0: exited on signal 11 (core dumped) Oct 23 10:51:52 desktop kernel: pid 19386 (conftest), jid 157, uid 0: exited on signal 11 (core dumped) Oct 23 11:01:47 desktop kernel: pid 11869 (conftest), jid 160, uid 0: exited on signal 11 (core dumped) Oct 23 11:28:20 desktop kernel: pid 64388 (conftest), jid 160, uid 0: exited on signal 11 (core dumped) [...] but I was informed some time ago that these were harmless.
(In reply to tech-lists from comment #24) Yea, conftest ones are normal. Off the little bit of description, I'm not able to identify anything analogous that I've dealt with --or even seen. My environment no longer builds devel/llvm90 for my -f usage. From what you say, the log file from the devel/llvm90 build failure might have some (earlier?) message(s) before the clang-bug . There does not seen to be any place else to look for evidence. Looking at the FreeBSD server based builds, I'm not seeing a mix of failures and successes for devel/llvm90 . And I never had such problems in my builds (back when such was being built). (Nor have I historically seen anything else with such a mix of fail/succes results.) So far I'm not finding anything more I can do to help issolate that issue. And your reports now indicate that the kills that contributed to having a a notable list of "prebuilds" have been controlled/avoided via other resource managment configuration changes. Thus, continuing here for this subject-area could now be considered spam for the original submittal and how it has been handled by Bryan D. Best let this submittal be closed.