Bug 253718 - Major issue with zombie processes from standard base system utils
Summary: Major issue with zombie processes from standard base system utils
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 13.0-STABLE
Hardware: amd64 Any
: --- Affects Some People
Assignee: Graham Perrin
URL: https://github.com/freebsd/freebsd-sr...
Keywords:
Depends on:
Blocks:
 
Reported: 2021-02-20 11:20 UTC by dmilith
Modified: 2022-12-29 16:35 UTC (History)
2 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description dmilith 2021-02-20 11:20:28 UTC
Since FreeBSD 13-ALPHA I was able to reproduce a very nasty bug. I thought it's HardenedBSD-related, but today I also reproduced it on vanilla version of FreeBSD.

The issue happens when I build software using my "Sofin" script (which is just plain sh script using sed, awk and other base system utilities to automate software build process). From htop, the issue looks like this:

http://s.verknowsys.com/122c70f4414e3ccd903d34746d8284e8.png

In this case, two processes turn zombie, and build is stuck (here perl build process hanged on invoking some sed and zsh).
This is how build looks from my script side: http://s.verknowsys.com/9fe4981b309b381785e0b4c4ab7d0aea.png (this way I know it's perl in this case, but I had same issues building Qemu and other software bundles before)

It's a very, very severe bug, (would say critical blocker) which is not that hard to reproduce.
Comment 1 Konstantin Belousov freebsd_committer freebsd_triage 2021-02-20 11:23:05 UTC
If it is so easy to reproduce, provide absolutely minimal reproduction scenario.
Comment 2 dmilith 2021-02-20 23:30:51 UTC
I provided a whole VM image, cause my system (svdOS) is a whole stack of things built on top of FreeBSD, including:

1. Sofin - which requires stuff like /Software/Git and /Software/Zsh on ZFS datasets to run properly.
2. There's a whole ZFS dataset infrastructure provided for Sofin utility… which would take a while to set up manually.
4. There are prepared build-utilities used by build-host (under /Services/Sofin/), which is an even trickier part.
5. Every software bundle is read-only, the system is divided into read-only and writable parts. / is made read-only…

Well, it's quite some custom stuff, but all open source and without any magic.
If you'd like to see stuff from the source code point of view, they're also open and available here: https://github.com/VerKnowSys/sofin and system build stuff is here: https://github.com/VerKnowSys/svdOS - basically shell script and few lines of C code.

So whole that stuff requires a special script to be invoked on vanilla OS… which would take far too long to explain…


So here is the easiest reproduction path I can provide:

Go to http://software.verknowsys.com/build-host-images/

There you'll find my Vmware exported build-host-vm in standard "ova" format (it's a tarball). Should be possible to convert the vmdk to qcow2 if you need it, it doesn't matter.



Here's the reproduction path:

0. Start the vm.
1. Log in as root to the VM over SSH. Use "/'\;[]p-=" as root password.
2. Invoke those commands:

# Tell Sofin to remove both previously built software bundles:
s rm Imagemagick Qemu

# let's now build both bundles again:
s b Imagemagick Qemu


It will start building Imagemagick and Qemu with all software dependencies step by step… which will take a while.

In my case, the last time it hanged was on "perl" requirement of Qemu bundle, but one build before it was "qemu" requirement of Qemu bundle…

When you'll notice no progress for a longer period of time, log in over SSH again and htop/ top should show you zombie processes.


Hope you'll be able to reproduce it as well :)
Comment 3 dmilith 2021-03-06 16:42:02 UTC
The issue was later reproduced in similar circumstances on:

- x86_64 machine with pkg software build
- ThunderX2 machine with pkg software build

Both running pouderie, not using my script…
Comment 4 dmilith 2021-03-06 22:44:38 UTC
I did build with 13.0-RC1 today, and trying to reproduce the issue again.

What I did with BETA2-4 was a hack - I created a symlink from /usr/lib/libncursesw.so to /usr/lib/libncursesw.so.8 to be able to run old software from FreeBSD 12.2. Is it possible that after upgrading ncursesw old code triggers some issue causing zombies? (my zsh 5.8 is linked with ncursesw)

It's unlikely but I think it's worth to mention that :)
Comment 5 dmilith 2021-03-06 23:28:20 UTC
I was completely wrong, same issue while building perl 5.32 as Qemu dependencies on 13.0-RC1:

http://s.verknowsys.com/a3b2a77a65a230a4b69f85d9f9862f2c.png

Now even doubled. No hacks with ncursesw this time.
Comment 6 dmilith 2021-03-27 12:33:35 UTC
The same issue on RC2 and RC3 confirmed. On RC3 it feels even easier to reproduce.
Comment 7 Mark Millard 2021-03-28 00:26:31 UTC
(In reply to dmilith from comment #3)

Step by step instructions to setup the context for
the failure and then to do what demonstrates the
failure in this poudiere context (13.0-RC3+ or
main [14])? How many complications (such as virtual
machine involvement) can be avoided and still ge†
the problem?

At this point, not having any specific context in
common with you, I've no good clue on how to test
if I can reproduce the problem, including how to
avoid involving non-essential things relative to
the problem, things that I'm not familiar with.

I do port builds via poudriere-devel and I have
in the past done cross builds of ports via poudriere
(amd64->aarch64 and amd64->armv7). I've not done
cross builds of ports in a long time. Cross builds
involved qemu executing things and some such
were getting hung up a lot back then. I gave up on
it. Looking around it seems that some improvements
in the area have been made, at least for
amd64->aarch64. But I've not tested with the qemu
updates.
Comment 8 Mark Millard 2021-03-28 00:42:34 UTC
(In reply to Mark Millard from comment #7)

By the way: I have access to systems that
are based on . . .

amd64         (16 cores, so 32 threads)
aarch64       (only 4 cores)
armv7         (only 4 cores)
old powerpc64 (2 sockets/1 core each)
old powerpc   (2 sockets/1 core each or 1 socket with just 1 core)

As stands the systems are based on non-debug builds
of main [14] as far as FreeBSD use goes.  In more
detail, based on:

merge-base: 7381bbee29df959e88ec59866cf2878263e7f3b2
merge-base: CommitDate: 2021-03-12 20:29:42 +0000
7381bbee29df (freebsd/main, freebsd/HEAD, pure-src, main) cam: Run all XPT_ASYNC ccbs in a dedicated thread
Comment 9 Graham Perrin freebsd_committer freebsd_triage 2021-03-28 05:02:54 UTC
From <https://lists.freebsd.org/pipermail/freebsd-current/2021-March/079270.html>: 

> … doesn't happen on 12.2, 12.1, 12.0 or older releases. …

If not reproducible with 12.2-RELEASE-p3: 

* was it reproducible with 12.2-RELEASE-p4 kernel 
  with 12.2-RELEASE-p4 userland; and 

* is it reproducible with 12.2-RELEASE-p4 kernel 
  with 12.2-RELEASE-p5 userland?

(I'd like to read your test results alongside my own results, some of which are not yet coherent, for a different bug …)
Comment 10 dmilith 2021-03-28 09:56:40 UTC
(In reply to Graham Perrin from comment #9)

* was it reproducible with 12.2-RELEASE-p4 kernel 
  with 12.2-RELEASE-p4 userland; and 

* is it reproducible with 12.2-RELEASE-p4 kernel 
  with 12.2-RELEASE-p5 userland?

As I mentioned, I use 12.2 extensively on a few productions. No such issue ever happened on 12.x. 

I'd be super happy to test it for you if you'll have an idea of what is wrong :)
Comment 11 dmilith 2021-03-28 10:28:29 UTC
(In reply to Mark Millard from comment #8)

I have:
- one amd64, 8 cores, 8th gen i7, 64G RAM dedicated machine - my native build host
- my workstation - also amd64, 8 cores, 3rd gen i7 3370, 32G RAM, (where I do builds and tests under Vmware Fusion before I try with native build host).
- other 10+ workstations/ servers/ DO instances using system and software I prepare.


A few more notes:

- Because my workstation uses Ivy Bridge CPU, I do CFLAGS override to be sure that both my systems and software is compatible with Intel CPUs since 3rd gen.

- My software build script uses ZFS a _lot_ under the hood. Every software PREFIX is a ZFS dataset that is mounted, filled, unmounted, sent every build. Every "software build" is an lz4 compressed ZFS dataset sent to a file. ZFS got a major upgrade in 13, maybe the issue is related?

- I use tmpfs feature a lot. By default, (almost) every build is both ccached and work in RAMdisk (I'm quite sure the zombie processes issue happens when invoked from ramdisk that's temporarily mounted under ZFS dataset mount). I also use LTO caching (if enabled for given software definition). I've been trying to use all system features I can (just because I can and I get nice results from it)

- My system build configs are public:
https://github.com/VerKnowSys/svdOS/blob/master/etc/src.conf
https://github.com/VerKnowSys/svdOS/blob/master/etc/make.conf

- I use almost the same scripts (minor changes) for builds since FreeBSD 9.0 (when I started having fun with this stuff) - mostly to get reproducible builds among OS versions.
Comment 12 Mark Millard 2021-03-28 17:18:43 UTC
(In reply to dmilith from comment #11)

But comment #3 indicates that you have replicated the
problem without involving your scripts (unless I
misunderstand the wording). So it would seem that no
details specific to the scripts' operation need to
be involved but things common with your poudriere
example from comment #3 likely would be involved.

My context does not involve any zfs file systems
(just UFS+SU) and my normal configuration avoids
tmpfs use, including in poudriere use. So if I could
repeat the steps for comment #3 it would likely
provide a cross check on what is involved in the
failure vs. not involved in the failure, depending
on if it got stuck zombie processes or not.
Comment 13 dmilith 2021-04-02 13:09:38 UTC
FYI: I'm testing with fresh build of 13.0-RC5, but after a few hours, I'm unable to reproduce it anymore.

I guess that the https://github.com/freebsd/freebsd-src/commit/40d278253d205f35ec368c93b7400baddda1b81e fixed the problem?
Comment 14 dmilith 2021-04-02 13:49:25 UTC
(In reply to dmilith from comment #13)

also with https://lists.freebsd.org/pipermail/freebsd-stable/2021-March/093371.html
Comment 15 Mark Millard 2021-04-02 14:12:14 UTC
(In reply to dmilith from comment #13)

Where did you get a copy of 13.0-RC5 hours ago?
I still do not see it.
Comment 16 Mark Millard 2021-04-02 14:13:07 UTC
(In reply to Mark Millard from comment #15)

Never mind: Looks like you built your own.
Comment 17 Graham Perrin freebsd_committer freebsd_triage 2022-12-29 16:34:53 UTC
(In reply to dmilith from comment #13)

> FYI: I'm testing with fresh build of 13.0-RC5, but after a few hours, 
> I'm unable to reproduce it anymore.
> 
> I guess that … fixed the problem?

Let's assume so. 

Thank you. 

<https://cgit.freebsd.org/src/commit/?h=releng/13.0&id=40d278253d205f35ec368c93b7400baddda1b81e>