Since FreeBSD 13-ALPHA I was able to reproduce a very nasty bug. I thought it's HardenedBSD-related, but today I also reproduced it on vanilla version of FreeBSD. The issue happens when I build software using my "Sofin" script (which is just plain sh script using sed, awk and other base system utilities to automate software build process). From htop, the issue looks like this: http://s.verknowsys.com/122c70f4414e3ccd903d34746d8284e8.png In this case, two processes turn zombie, and build is stuck (here perl build process hanged on invoking some sed and zsh). This is how build looks from my script side: http://s.verknowsys.com/9fe4981b309b381785e0b4c4ab7d0aea.png (this way I know it's perl in this case, but I had same issues building Qemu and other software bundles before) It's a very, very severe bug, (would say critical blocker) which is not that hard to reproduce.
If it is so easy to reproduce, provide absolutely minimal reproduction scenario.
I provided a whole VM image, cause my system (svdOS) is a whole stack of things built on top of FreeBSD, including: 1. Sofin - which requires stuff like /Software/Git and /Software/Zsh on ZFS datasets to run properly. 2. There's a whole ZFS dataset infrastructure provided for Sofin utility… which would take a while to set up manually. 4. There are prepared build-utilities used by build-host (under /Services/Sofin/), which is an even trickier part. 5. Every software bundle is read-only, the system is divided into read-only and writable parts. / is made read-only… Well, it's quite some custom stuff, but all open source and without any magic. If you'd like to see stuff from the source code point of view, they're also open and available here: https://github.com/VerKnowSys/sofin and system build stuff is here: https://github.com/VerKnowSys/svdOS - basically shell script and few lines of C code. So whole that stuff requires a special script to be invoked on vanilla OS… which would take far too long to explain… So here is the easiest reproduction path I can provide: Go to http://software.verknowsys.com/build-host-images/ There you'll find my Vmware exported build-host-vm in standard "ova" format (it's a tarball). Should be possible to convert the vmdk to qcow2 if you need it, it doesn't matter. Here's the reproduction path: 0. Start the vm. 1. Log in as root to the VM over SSH. Use "/'\;[]p-=" as root password. 2. Invoke those commands: # Tell Sofin to remove both previously built software bundles: s rm Imagemagick Qemu # let's now build both bundles again: s b Imagemagick Qemu It will start building Imagemagick and Qemu with all software dependencies step by step… which will take a while. In my case, the last time it hanged was on "perl" requirement of Qemu bundle, but one build before it was "qemu" requirement of Qemu bundle… When you'll notice no progress for a longer period of time, log in over SSH again and htop/ top should show you zombie processes. Hope you'll be able to reproduce it as well :)
The issue was later reproduced in similar circumstances on: - x86_64 machine with pkg software build - ThunderX2 machine with pkg software build Both running pouderie, not using my script…
I did build with 13.0-RC1 today, and trying to reproduce the issue again. What I did with BETA2-4 was a hack - I created a symlink from /usr/lib/libncursesw.so to /usr/lib/libncursesw.so.8 to be able to run old software from FreeBSD 12.2. Is it possible that after upgrading ncursesw old code triggers some issue causing zombies? (my zsh 5.8 is linked with ncursesw) It's unlikely but I think it's worth to mention that :)
I was completely wrong, same issue while building perl 5.32 as Qemu dependencies on 13.0-RC1: http://s.verknowsys.com/a3b2a77a65a230a4b69f85d9f9862f2c.png Now even doubled. No hacks with ncursesw this time.
The same issue on RC2 and RC3 confirmed. On RC3 it feels even easier to reproduce.
(In reply to dmilith from comment #3) Step by step instructions to setup the context for the failure and then to do what demonstrates the failure in this poudiere context (13.0-RC3+ or main [14])? How many complications (such as virtual machine involvement) can be avoided and still ge† the problem? At this point, not having any specific context in common with you, I've no good clue on how to test if I can reproduce the problem, including how to avoid involving non-essential things relative to the problem, things that I'm not familiar with. I do port builds via poudriere-devel and I have in the past done cross builds of ports via poudriere (amd64->aarch64 and amd64->armv7). I've not done cross builds of ports in a long time. Cross builds involved qemu executing things and some such were getting hung up a lot back then. I gave up on it. Looking around it seems that some improvements in the area have been made, at least for amd64->aarch64. But I've not tested with the qemu updates.
(In reply to Mark Millard from comment #7) By the way: I have access to systems that are based on . . . amd64 (16 cores, so 32 threads) aarch64 (only 4 cores) armv7 (only 4 cores) old powerpc64 (2 sockets/1 core each) old powerpc (2 sockets/1 core each or 1 socket with just 1 core) As stands the systems are based on non-debug builds of main [14] as far as FreeBSD use goes. In more detail, based on: merge-base: 7381bbee29df959e88ec59866cf2878263e7f3b2 merge-base: CommitDate: 2021-03-12 20:29:42 +0000 7381bbee29df (freebsd/main, freebsd/HEAD, pure-src, main) cam: Run all XPT_ASYNC ccbs in a dedicated thread
From <https://lists.freebsd.org/pipermail/freebsd-current/2021-March/079270.html>: > … doesn't happen on 12.2, 12.1, 12.0 or older releases. … If not reproducible with 12.2-RELEASE-p3: * was it reproducible with 12.2-RELEASE-p4 kernel with 12.2-RELEASE-p4 userland; and * is it reproducible with 12.2-RELEASE-p4 kernel with 12.2-RELEASE-p5 userland? (I'd like to read your test results alongside my own results, some of which are not yet coherent, for a different bug …)
(In reply to Graham Perrin from comment #9) * was it reproducible with 12.2-RELEASE-p4 kernel with 12.2-RELEASE-p4 userland; and * is it reproducible with 12.2-RELEASE-p4 kernel with 12.2-RELEASE-p5 userland? As I mentioned, I use 12.2 extensively on a few productions. No such issue ever happened on 12.x. I'd be super happy to test it for you if you'll have an idea of what is wrong :)
(In reply to Mark Millard from comment #8) I have: - one amd64, 8 cores, 8th gen i7, 64G RAM dedicated machine - my native build host - my workstation - also amd64, 8 cores, 3rd gen i7 3370, 32G RAM, (where I do builds and tests under Vmware Fusion before I try with native build host). - other 10+ workstations/ servers/ DO instances using system and software I prepare. A few more notes: - Because my workstation uses Ivy Bridge CPU, I do CFLAGS override to be sure that both my systems and software is compatible with Intel CPUs since 3rd gen. - My software build script uses ZFS a _lot_ under the hood. Every software PREFIX is a ZFS dataset that is mounted, filled, unmounted, sent every build. Every "software build" is an lz4 compressed ZFS dataset sent to a file. ZFS got a major upgrade in 13, maybe the issue is related? - I use tmpfs feature a lot. By default, (almost) every build is both ccached and work in RAMdisk (I'm quite sure the zombie processes issue happens when invoked from ramdisk that's temporarily mounted under ZFS dataset mount). I also use LTO caching (if enabled for given software definition). I've been trying to use all system features I can (just because I can and I get nice results from it) - My system build configs are public: https://github.com/VerKnowSys/svdOS/blob/master/etc/src.conf https://github.com/VerKnowSys/svdOS/blob/master/etc/make.conf - I use almost the same scripts (minor changes) for builds since FreeBSD 9.0 (when I started having fun with this stuff) - mostly to get reproducible builds among OS versions.
(In reply to dmilith from comment #11) But comment #3 indicates that you have replicated the problem without involving your scripts (unless I misunderstand the wording). So it would seem that no details specific to the scripts' operation need to be involved but things common with your poudriere example from comment #3 likely would be involved. My context does not involve any zfs file systems (just UFS+SU) and my normal configuration avoids tmpfs use, including in poudriere use. So if I could repeat the steps for comment #3 it would likely provide a cross check on what is involved in the failure vs. not involved in the failure, depending on if it got stuck zombie processes or not.
FYI: I'm testing with fresh build of 13.0-RC5, but after a few hours, I'm unable to reproduce it anymore. I guess that the https://github.com/freebsd/freebsd-src/commit/40d278253d205f35ec368c93b7400baddda1b81e fixed the problem?
(In reply to dmilith from comment #13) also with https://lists.freebsd.org/pipermail/freebsd-stable/2021-March/093371.html
(In reply to dmilith from comment #13) Where did you get a copy of 13.0-RC5 hours ago? I still do not see it.
(In reply to Mark Millard from comment #15) Never mind: Looks like you built your own.
(In reply to dmilith from comment #13) > FYI: I'm testing with fresh build of 13.0-RC5, but after a few hours, > I'm unable to reproduce it anymore. > > I guess that … fixed the problem? Let's assume so. Thank you. <https://cgit.freebsd.org/src/commit/?h=releng/13.0&id=40d278253d205f35ec368c93b7400baddda1b81e>