Created attachment 190630 [details] upstream patch to fix gmake jobserver hang when using pselect() I have ALLOW_MAKE_JOBS=yes in poudriere.conf on my package build machine. I see infrequent, random build runaway failures for different ports, where the thing that they all have in common is that the leftover leaf processes are always gmake in some sort of wait state, and each gmake process has one or more zombie child processes. December 2016 - 10-STABLE jail: USER PID %CPU %MEM VSZ RSS TT STAT STARTED TIME COMMAND root 76087 0.0 0.0 9132 1336 0 I+J 8:51AM 0:00.04 /usr/bin/make -C /usr/p orts/multimedia/gstreamer-plugins-good build root 76113 0.0 0.0 14680 1896 0 I+J 8:51AM 0:00.01 - gmake -f Makefile -j8 DESTDIR=/wrkdirs/usr/ports/multimedia/gstreamer-plugins-good/work/stage all root 76129 0.0 0.0 14680 1916 0 I+J 8:51AM 0:00.01 |-- gmake -C videofilte r root 76148 0.0 0.0 0 0 0 Z+J 8:51AM 0:00.59 | `-- <defunct> root 76156 0.0 0.0 14680 1968 0 I+J 8:51AM 0:00.01 |-- gmake -C audiofx root 76625 0.0 0.0 0 0 0 Z+J 8:51AM 0:00.34 | `-- <defunct> root 76164 0.0 0.0 14680 1912 0 I+J 8:51AM 0:00.01 |-- gmake -C audioparse rs root 78664 0.0 0.0 0 0 0 Z+J 8:51AM 0:00.43 | `-- <defunct> root 77796 0.0 0.0 14680 1948 0 I+J 8:51AM 0:00.01 |-- gmake -C debugutils root 78663 0.0 0.0 0 0 0 Z+J 8:51AM 0:00.00 | `-- <defunct> root 78193 0.0 0.0 14680 1920 0 I+J 8:51AM 0:00.01 |-- gmake -C deinterlac e root 78232 0.0 0.0 14680 1952 0 I+J 8:51AM 0:00.01 | `-- gmake all-am root 78566 0.0 0.0 0 0 0 Z+J 8:51AM 0:01.45 | `-- <defunct> root 78681 0.0 0.0 14680 1940 0 I+J 8:51AM 0:00.01 |-- gmake -C effectv root 78819 0.0 0.0 0 0 0 Z+J 8:51AM 0:00.28 | `-- <defunct> root 78830 0.0 0.0 14680 1916 0 I+J 8:51AM 0:00.01 |-- gmake -C equalizer root 78834 0.0 0.0 0 0 0 Z+J 8:51AM 0:00.43 | `-- <defunct> root 78952 0.0 0.0 14680 1912 0 I+J 8:51AM 0:00.01 `-- gmake -C flv root 79075 0.0 0.0 0 0 0 Z+J 8:51AM 0:00.00 `-- <defunct> September 2017 - 12-CURRENT jail: USER PID %CPU %MEM VSZ RSS TT STAT STARTED TIME COMMAND root 8279 0.0 0.0 10832 0 1 IW+J - 0:00.00 /usr/bin/make -C /usr/ ports/editors/openoffice-devel build root 8550 0.0 0.0 24852 0 1 IW+J - 0:00.00 - perl -S /wrkdirs/usr /ports/editors/openoffice-devel/work/aoo-4.2.0/main/solenv/bin/build.pl --all -P 2 -- -P4 root 64577 0.0 0.0 24852 0 1 IW+J - 0:00.00 `-- perl -S /wrkdirs/u sr/ports/editors/openoffice-devel/work/aoo-4.2.0/main/solenv/bin/build.pl --all -P2 -- -P4 root 64578 0.0 0.0 11700 0 1 IW+J - 0:00.00 `-- dmake -P4 root 64583 0.0 0.0 12688 0 1 IW+J - 0:00.00 `-- /usr/local/bin /bash -c cd .. && gmake -r -j4 && gmake -r deliverlog root 64584 0.0 0.1 53556 44532 1 I+J 20:32 0:29.23 `-- gmake -r -j4 root 30217 0.0 0.0 0 0 1 Z+J 20:49 0:00.01 |-- <defunct> root 30220 0.0 0.0 0 0 1 Z+J 20:49 0:00.01 |-- <defunct> root 30234 0.0 0.0 0 0 1 Z+J 20:49 0:00.01 |-- <defunct> root 30236 0.0 0.0 0 0 1 Z+J 20:49 0:00.01 `-- <defunct> USER PID %CPU %MEM VSZ RSS TT STAT STARTED TIME COMMAND nobody 43355 0.0 0.0 10500 1596 1 I+J 11:09 0:00.05 /usr/bin/make -C /usr/ports/lang/rust build nobody 43444 0.0 0.1 46368 36944 1 I+J 11:09 0:07.20 - /usr/local/bin/python2.7 /wrkdirs/usr/ports/lang/rust/work/rustc-1.23.0-src/x.py build --verbose --config ./config.toml --jobs 7 nobody 71768 0.0 0.0 25436 11536 1 I+J 11:10 0:00.21 `-- /wrkdirs/usr/ports/lang/rust/work/rustc-1.23.0-src/build/bootstrap/debug/bootstrap build --verbose --config ./config.toml --jobs 7 nobody 9694 0.0 0.0 23260 13172 1 I+J 11:14 0:00.01 `-- cmake --build . --target install --config Release -- -j 7 nobody 9746 0.0 0.0 11976 2412 1 I+J 11:14 0:00.01 `-- /usr/local/bin/gmake -j 7 install nobody 9784 0.0 0.0 12624 3292 1 I+J 11:14 0:00.08 `-- /usr/local/bin/gmake -f CMakeFiles/Makefile2 all nobody 9539 0.0 0.0 11976 2404 1 I+J 11:16 0:00.02 |-- /usr/local/bin/gmake -f lib/DebugInfo/CodeView/CMakeFiles/LLVMDebugInfoCodeView.dir/build.make lib/DebugInfo/CodeView/CMakeFiles/LLVMDebugInfoCodeView.dir/build nobody 10364 0.0 0.0 0 0 1 Z+J 11:17 0:02.12 | `-- <defunct> nobody 10119 0.0 0.0 11908 2264 1 I+J 11:14 0:00.01 |-- /usr/local/bin/gmake -f lib/TableGen/CMakeFiles/LLVMTableGen.dir/build.make lib/TableGen/CMakeFiles/LLVMTableGen.dir/build nobody 10231 0.0 0.0 0 0 1 Z+J 11:14 0:00.95 | `-- <defunct> nobody 10420 0.0 0.0 12020 2524 1 I+J 11:14 0:00.02 |-- /usr/local/bin/gmake -f lib/MC/MCParser/CMakeFiles/LLVMMCParser.dir/build.make lib/MC/MCParser/CMakeFiles/LLVMMCParser.dir/build nobody 9353 0.0 0.0 0 0 1 Z+J 11:16 0:00.86 | `-- <defunct> nobody 10667 0.0 0.0 12476 2800 1 I+J 11:14 0:00.03 |-- /usr/local/bin/gmake -f lib/Support/CMakeFiles/LLVMSupport.dir/build.make lib/Support/CMakeFiles/LLVMSupport.dir/build nobody 10766 0.0 0.0 0 0 1 Z+J 11:14 0:04.21 | `-- <defunct> nobody 10726 0.0 0.0 12120 2576 1 I+J 11:14 0:00.03 |-- /usr/local/bin/gmake -f utils/TableGen/CMakeFiles/obj.llvm-tblgen.dir/build.make utils/TableGen/CMakeFiles/obj.llvm-tblgen.dir/build nobody 9446 0.0 0.0 0 0 1 Z+J 11:16 0:03.01 | `-- <defunct> nobody 10743 0.0 0.0 12140 2652 1 I+J 11:14 0:00.04 |-- /usr/local/bin/gmake -f lib/MC/CMakeFiles/LLVMMC.dir/build.make lib/MC/CMakeFiles/LLVMMC.dir/build nobody 10795 0.0 0.0 0 0 1 Z+J 11:14 0:02.35 | `-- <defunct> nobody 10790 0.0 0.0 12412 2828 1 I+J 11:17 0:00.05 `-- /usr/local/bin/gmake -f lib/DebugInfo/PDB/CMakeFiles/LLVMDebugInfoPDB.dir/build.make lib/DebugInfo/PDB/CMakeFiles/LLVMDebugInfoPDB.dir/build nobody 17155 0.0 0.0 0 0 1 Z+J 11:17 0:00.01 `-- <defunct> This turns out to be a known bug in the jobserver code when it is configured to use pselect(). It has been patched upstream, but a version of gmake with this fix has not yet been released. See: https://savannah.gnu.org/bugs/?49014 https://savannah.gnu.org/bugs/?51159
Created attachment 190631 [details] example to reproduce gmake zombie process accumulation This is an example build tree that can be used to reproduce the gmake zombie accumulation problem caused by the gmake jobserver bug when it is configured to use pselect(). On my 8-core 16-thread Ryzen machine, "gmake -j 40 all" quickly ramps up to 40 spinner processes, but in a short time 38 zombie processes accumulate and the machine is mostly idle. For much of the rest of the run the number of zombies stays mostly in the high 30's, with some fluctuations down to about 30. I've see it hit 39 zombies at one point, but I haven't reproduced a total deadlock. With the patch applied, top shows 40 (+1 for top) running processes pretty much all of the time and I haven't observed the zombie process count being non-zero.
Reassign to portmgr for exp-run.
Created attachment 190632 [details] patch2 Include patches from other PRs.
The jobserver patch seems to cause some problems: http://package23.nyi.freebsd.org/data/103i386-default-PR225899/2018-02-15_09h29m32s/logs/errors/rust-1.23.0.log
I see the same thing here.
Created attachment 190687 [details] patch3 Let's try with the actual upstream commit.
That still fails for me. It turns out that the error is coming from rust and not gmake. Rust has its own jobserver implementation and it tries to interoperate with gmake. With this patch, it falls over because it does not expect the descriptor to be non-blocking.
It looks like rust 1.23 should interoperate with gmake's use of a nonblocking jobserver descriptor. Unfortunately what the rust bootstrap is version 1.22. I see that rust 1.24 was released a couple of days ago. We are probably blocked until rust is updated to 1.24 and the bootstrap is updated to 1.23 unless we set MAKE_JOBS_UNSAFE=yes.
MAKE_JOBS_UNSAFE=yes will not help, the problem happens with 1 job too.
Probably because the Makefile has this in do-build: ${X_PY_CMD} build \ --verbose \ --config ./config.toml \ --jobs ${MAKE_JOBS_NUMBER} That's probably enough to bring the jobserver into play.
I was wondering why the exp run picked this up ... Commenting out the --jobs flag is not sufficient to unbreak the rust build. Even without that flag, it does a parallel build anyway.
A commit references this bug: Author: tijl Date: Tue Feb 20 09:39:15 UTC 2018 New revision: 462385 URL: https://svnweb.freebsd.org/changeset/ports/462385 Log: - Remove fmake hack because FreeBSD 9 is EoL. - Fix compiler warning about realloc. [1] - Remove dependency on gettext-tools and iconv. [2] - Add upstream patch to fix hangs when doing parallel builds. [3] PR: 210938 [1], 212157 [2], 225899 [3] Obtained from: https://git.savannah.gnu.org/cgit/make.git/commit/?id=b552b05251980f693c729e251f93f5225b400714 [3] Exp-run by: antoine, truckman Changes: head/devel/gmake/Makefile head/devel/gmake/files/patch-glob-glob.c head/devel/gmake/files/patch-posixos.c