Bug 225899 - [exp-run] devel/gmake: parallel builds sometimes deadlock
Summary: [exp-run] devel/gmake: parallel builds sometimes deadlock
Status: Closed FIXED
Alias: None
Product: Ports & Packages
Classification: Unclassified
Component: Individual Port(s) (show other bugs)
Version: Latest
Hardware: Any Any
: --- Affects Some People
Assignee: Tijl Coosemans
URL:
Keywords: patch
Depends on: 225994
Blocks:
  Show dependency treegraph
 
Reported: 2018-02-14 18:43 UTC by Don Lewis
Modified: 2018-02-20 11:07 UTC (History)
1 user (show)

See Also:
tijl: maintainer-feedback+


Attachments
upstream patch to fix gmake jobserver hang when using pselect() (1.87 KB, patch)
2018-02-14 18:43 UTC, Don Lewis
no flags Details | Diff
example to reproduce gmake zombie process accumulation (1.48 KB, text/plain)
2018-02-14 18:55 UTC, Don Lewis
no flags Details
patch2 (3.21 KB, patch)
2018-02-14 20:00 UTC, Tijl Coosemans
no flags Details | Diff
patch3 (6.83 KB, patch)
2018-02-16 11:22 UTC, Tijl Coosemans
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Don Lewis freebsd_committer freebsd_triage 2018-02-14 18:43:23 UTC
Created attachment 190630 [details]
upstream patch to fix gmake jobserver hang when using pselect()

I have ALLOW_MAKE_JOBS=yes in poudriere.conf on my package build machine.  I see infrequent, random build runaway failures for different ports, where the thing that they all have in common is that the leftover leaf processes are always gmake in some sort of wait state, and each gmake process has one or more zombie child processes.

December 2016 - 10-STABLE jail:
USER   PID %CPU %MEM   VSZ  RSS TT  STAT STARTED    TIME COMMAND
root 76087  0.0  0.0  9132 1336  0  I+J   8:51AM 0:00.04 /usr/bin/make -C /usr/p
orts/multimedia/gstreamer-plugins-good build
root 76113  0.0  0.0 14680 1896  0  I+J   8:51AM 0:00.01 - gmake -f Makefile -j8
 DESTDIR=/wrkdirs/usr/ports/multimedia/gstreamer-plugins-good/work/stage all
root 76129  0.0  0.0 14680 1916  0  I+J   8:51AM 0:00.01 |-- gmake -C videofilte
r
root 76148  0.0  0.0     0    0  0  Z+J   8:51AM 0:00.59 | `-- <defunct>
root 76156  0.0  0.0 14680 1968  0  I+J   8:51AM 0:00.01 |-- gmake -C audiofx
root 76625  0.0  0.0     0    0  0  Z+J   8:51AM 0:00.34 | `-- <defunct>
root 76164  0.0  0.0 14680 1912  0  I+J   8:51AM 0:00.01 |-- gmake -C audioparse
rs
root 78664  0.0  0.0     0    0  0  Z+J   8:51AM 0:00.43 | `-- <defunct>
root 77796  0.0  0.0 14680 1948  0  I+J   8:51AM 0:00.01 |-- gmake -C debugutils
root 78663  0.0  0.0     0    0  0  Z+J   8:51AM 0:00.00 | `-- <defunct>
root 78193  0.0  0.0 14680 1920  0  I+J   8:51AM 0:00.01 |-- gmake -C deinterlac
e
root 78232  0.0  0.0 14680 1952  0  I+J   8:51AM 0:00.01 | `-- gmake all-am
root 78566  0.0  0.0     0    0  0  Z+J   8:51AM 0:01.45 |   `-- <defunct>
root 78681  0.0  0.0 14680 1940  0  I+J   8:51AM 0:00.01 |-- gmake -C effectv
root 78819  0.0  0.0     0    0  0  Z+J   8:51AM 0:00.28 | `-- <defunct>
root 78830  0.0  0.0 14680 1916  0  I+J   8:51AM 0:00.01 |-- gmake -C equalizer
root 78834  0.0  0.0     0    0  0  Z+J   8:51AM 0:00.43 | `-- <defunct>
root 78952  0.0  0.0 14680 1912  0  I+J   8:51AM 0:00.01 `-- gmake -C flv
root 79075  0.0  0.0     0    0  0  Z+J   8:51AM 0:00.00   `-- <defunct>

September 2017 - 12-CURRENT jail:
USER   PID %CPU %MEM   VSZ   RSS TT  STAT STARTED    TIME COMMAND
root  8279  0.0  0.0 10832     0  1  IW+J -       0:00.00 /usr/bin/make -C /usr/
ports/editors/openoffice-devel build
root  8550  0.0  0.0 24852     0  1  IW+J -       0:00.00 - perl -S /wrkdirs/usr
/ports/editors/openoffice-devel/work/aoo-4.2.0/main/solenv/bin/build.pl --all -P
2 -- -P4
root 64577  0.0  0.0 24852     0  1  IW+J -       0:00.00 `-- perl -S /wrkdirs/u
sr/ports/editors/openoffice-devel/work/aoo-4.2.0/main/solenv/bin/build.pl --all
-P2 -- -P4
root 64578  0.0  0.0 11700     0  1  IW+J -       0:00.00   `-- dmake -P4
root 64583  0.0  0.0 12688     0  1  IW+J -       0:00.00     `-- /usr/local/bin
/bash -c cd .. && gmake  -r -j4   && gmake  -r deliverlog
root 64584  0.0  0.1 53556 44532  1  I+J  20:32   0:29.23       `-- gmake -r -j4
root 30217  0.0  0.0     0     0  1  Z+J  20:49   0:00.01         |-- <defunct>
root 30220  0.0  0.0     0     0  1  Z+J  20:49   0:00.01         |-- <defunct>
root 30234  0.0  0.0     0     0  1  Z+J  20:49   0:00.01         |-- <defunct>
root 30236  0.0  0.0     0     0  1  Z+J  20:49   0:00.01         `-- <defunct>


USER     PID %CPU %MEM   VSZ   RSS TT  STAT STARTED    TIME COMMAND
nobody 43355  0.0  0.0 10500  1596  1  I+J  11:09   0:00.05 /usr/bin/make -C /usr/ports/lang/rust build
nobody 43444  0.0  0.1 46368 36944  1  I+J  11:09   0:07.20 - /usr/local/bin/python2.7 /wrkdirs/usr/ports/lang/rust/work/rustc-1.23.0-src/x.py build --verbose --config ./config.toml --jobs 7
nobody 71768  0.0  0.0 25436 11536  1  I+J  11:10   0:00.21 `-- /wrkdirs/usr/ports/lang/rust/work/rustc-1.23.0-src/build/bootstrap/debug/bootstrap build --verbose --config ./config.toml --jobs 7
nobody  9694  0.0  0.0 23260 13172  1  I+J  11:14   0:00.01   `-- cmake --build . --target install --config Release -- -j 7
nobody  9746  0.0  0.0 11976  2412  1  I+J  11:14   0:00.01     `-- /usr/local/bin/gmake -j 7 install
nobody  9784  0.0  0.0 12624  3292  1  I+J  11:14   0:00.08       `-- /usr/local/bin/gmake -f CMakeFiles/Makefile2 all
nobody  9539  0.0  0.0 11976  2404  1  I+J  11:16   0:00.02         |-- /usr/local/bin/gmake -f lib/DebugInfo/CodeView/CMakeFiles/LLVMDebugInfoCodeView.dir/build.make lib/DebugInfo/CodeView/CMakeFiles/LLVMDebugInfoCodeView.dir/build
nobody 10364  0.0  0.0     0     0  1  Z+J  11:17   0:02.12         | `-- <defunct>
nobody 10119  0.0  0.0 11908  2264  1  I+J  11:14   0:00.01         |-- /usr/local/bin/gmake -f lib/TableGen/CMakeFiles/LLVMTableGen.dir/build.make lib/TableGen/CMakeFiles/LLVMTableGen.dir/build
nobody 10231  0.0  0.0     0     0  1  Z+J  11:14   0:00.95         | `-- <defunct>
nobody 10420  0.0  0.0 12020  2524  1  I+J  11:14   0:00.02         |-- /usr/local/bin/gmake -f lib/MC/MCParser/CMakeFiles/LLVMMCParser.dir/build.make lib/MC/MCParser/CMakeFiles/LLVMMCParser.dir/build
nobody  9353  0.0  0.0     0     0  1  Z+J  11:16   0:00.86         | `-- <defunct>
nobody 10667  0.0  0.0 12476  2800  1  I+J  11:14   0:00.03         |-- /usr/local/bin/gmake -f lib/Support/CMakeFiles/LLVMSupport.dir/build.make lib/Support/CMakeFiles/LLVMSupport.dir/build
nobody 10766  0.0  0.0     0     0  1  Z+J  11:14   0:04.21         | `-- <defunct>
nobody 10726  0.0  0.0 12120  2576  1  I+J  11:14   0:00.03         |-- /usr/local/bin/gmake -f utils/TableGen/CMakeFiles/obj.llvm-tblgen.dir/build.make utils/TableGen/CMakeFiles/obj.llvm-tblgen.dir/build
nobody  9446  0.0  0.0     0     0  1  Z+J  11:16   0:03.01         | `-- <defunct>
nobody 10743  0.0  0.0 12140  2652  1  I+J  11:14   0:00.04         |-- /usr/local/bin/gmake -f lib/MC/CMakeFiles/LLVMMC.dir/build.make lib/MC/CMakeFiles/LLVMMC.dir/build
nobody 10795  0.0  0.0     0     0  1  Z+J  11:14   0:02.35         | `-- <defunct>
nobody 10790  0.0  0.0 12412  2828  1  I+J  11:17   0:00.05         `-- /usr/local/bin/gmake -f lib/DebugInfo/PDB/CMakeFiles/LLVMDebugInfoPDB.dir/build.make lib/DebugInfo/PDB/CMakeFiles/LLVMDebugInfoPDB.dir/build
nobody 17155  0.0  0.0     0     0  1  Z+J  11:17   0:00.01           `-- <defunct>

This turns out to be a known bug in the jobserver code when it is configured to use pselect().  It has been patched upstream, but a version of gmake with this fix has not yet been released.  See:
  https://savannah.gnu.org/bugs/?49014
  https://savannah.gnu.org/bugs/?51159
Comment 1 Don Lewis freebsd_committer freebsd_triage 2018-02-14 18:55:17 UTC
Created attachment 190631 [details]
example to reproduce gmake zombie process accumulation

This is an example build tree that can be used to reproduce the gmake zombie accumulation problem caused by the gmake jobserver bug when it is configured to use pselect().

On my 8-core 16-thread Ryzen machine, "gmake -j 40 all" quickly ramps up to 40 spinner processes, but in a short time 38 zombie processes accumulate and the machine is mostly idle.  For much of the rest of the run the number of zombies stays mostly in the high 30's, with some fluctuations down to about 30.  I've see it hit 39 zombies at one point, but I haven't reproduced a total deadlock.

With the patch applied, top shows 40 (+1 for top) running processes pretty much all of the time and I haven't observed the zombie process count being non-zero.
Comment 2 Tijl Coosemans freebsd_committer freebsd_triage 2018-02-14 19:46:22 UTC
Reassign to portmgr for exp-run.
Comment 3 Tijl Coosemans freebsd_committer freebsd_triage 2018-02-14 20:00:29 UTC
Created attachment 190632 [details]
patch2

Include patches from other PRs.
Comment 4 Antoine Brodin freebsd_committer freebsd_triage 2018-02-15 21:08:09 UTC
The jobserver patch seems to cause some problems:

http://package23.nyi.freebsd.org/data/103i386-default-PR225899/2018-02-15_09h29m32s/logs/errors/rust-1.23.0.log
Comment 5 Don Lewis freebsd_committer freebsd_triage 2018-02-15 23:31:01 UTC
I see the same thing here.
Comment 6 Tijl Coosemans freebsd_committer freebsd_triage 2018-02-16 11:22:46 UTC
Created attachment 190687 [details]
patch3

Let's try with the actual upstream commit.
Comment 7 Don Lewis freebsd_committer freebsd_triage 2018-02-17 19:15:19 UTC
That still fails for me.  It turns out that the error is coming from rust and not gmake.  Rust has its own jobserver implementation and it tries to interoperate with gmake.  With this patch, it falls over because it does not expect the descriptor to be non-blocking.
Comment 8 Don Lewis freebsd_committer freebsd_triage 2018-02-17 21:19:00 UTC
It looks like rust 1.23 should interoperate with gmake's use of a nonblocking jobserver descriptor.  Unfortunately what the rust bootstrap is version 1.22.

I see that rust 1.24 was released a couple of days ago.  We are probably blocked until rust is updated to 1.24 and the bootstrap is updated to 1.23 unless we set MAKE_JOBS_UNSAFE=yes.
Comment 9 Antoine Brodin freebsd_committer freebsd_triage 2018-02-17 21:21:29 UTC
MAKE_JOBS_UNSAFE=yes will not help, the problem happens with 1 job too.
Comment 10 Don Lewis freebsd_committer freebsd_triage 2018-02-17 23:12:03 UTC
Probably because the Makefile has this in do-build:

        ${X_PY_CMD} build \
                --verbose \
                --config ./config.toml \
                --jobs ${MAKE_JOBS_NUMBER}


That's probably enough to bring the jobserver into play.
Comment 11 Don Lewis freebsd_committer freebsd_triage 2018-02-18 01:33:31 UTC
I was wondering why the exp run picked this up ...

Commenting out the --jobs flag is not sufficient to unbreak the rust build.  Even without that flag, it does a parallel build anyway.
Comment 12 commit-hook freebsd_committer freebsd_triage 2018-02-20 09:39:39 UTC
A commit references this bug:

Author: tijl
Date: Tue Feb 20 09:39:15 UTC 2018
New revision: 462385
URL: https://svnweb.freebsd.org/changeset/ports/462385

Log:
  - Remove fmake hack because FreeBSD 9 is EoL.
  - Fix compiler warning about realloc. [1]
  - Remove dependency on gettext-tools and iconv. [2]
  - Add upstream patch to fix hangs when doing parallel builds. [3]

  PR:		210938 [1], 212157 [2], 225899 [3]
  Obtained from:	https://git.savannah.gnu.org/cgit/make.git/commit/?id=b552b05251980f693c729e251f93f5225b400714 [3]
  Exp-run by:	antoine, truckman

Changes:
  head/devel/gmake/Makefile
  head/devel/gmake/files/patch-glob-glob.c
  head/devel/gmake/files/patch-posixos.c