We have an in-house php script that processes images. Been running it for years since FreeBSD 9.x or 10.x at least. It forks itself a number of times to increase the amount of images we can process per second. Sounds normal, right? On FreeBSD 13.2 we've been running great, different CPU cores leveraged per fork. Now when upgrading to 13.3 or 14.0, the php forked processes all end up leveraging the SAME CPU core, and that slows us waaay down as we're unable to leverage all the other cores of the server. I've tried a 13.2 userland, and temp booted the 13.3 kernel and the issue does NOT occur. This sort of tells me the scheduler is fine. However, when using a 13.3 userland AND a corresponding 13.3 kernel, the issues DOES occur and we're only able to use a single core. Same with a fresh install of 14.0 userland/kernel on a totally separate box. We're seeing this issue on multiple servers with 13.3 and 14.0 using php82 and php81. Leveraging the pkg php82 binaries from 13.2 land on the 13.3 userland/kernel or the 13.3 pkg php82 does not matter, both show the same issue. Something in the userland/libraries in 13.3 apparently restricts the forked php processes from leveraging any additional cpu cores. Also note, I read that LLVM was upgraded from v14.x to v17.x in 13.3-RELEASE. I installed llvm14 from pkg and recompiled from source php82 and php-extensions using llvm14, and still no joy. Any ideas where/what/how/why?
Have you tried fiddling around with cpuset to see if you can restore the old behavior? We've got some similar scripts on some old servers that I'm concerned about when we upgrade. I'm assuming you're using pcntl_fork()?
Is behavior the same on single socket hosts and on multi sockets (if it's VM check CPU configuration)? Can it be related to NUMA?
(In reply to Vladimir Druzenko from comment #2) A single socket machine also has the issue on 13.3. Forked children are locked to the same core.
(In reply to amistry from comment #1) We haven't done anything with cpuset yet. pcntl_fork is what we're using. Sample code found here: https://github.com/php/php-src/issues/14117
More troubleshooting.. This issue is occurring on /usr/lib/libomp.so with llvm17, which is the default with FreeBSD 13.3. I installed a few other llvm's via pkg (and even compiled llvm-devel 19.x) to test things. When I copy over the libomp.so from llvm14 or llvm15 into /usr/lib/ the application starts using all the CPU cores as expected. So something changed in llvm16 and later that is causing our linked application (ImageMagick) to limit itself to a single cpu core. Since FreeBSD 13.2 was still using llvm14 by default, the problem was not occurring then. llvm14-14.0.6_5 - WORKS llvm15-15.0.7_10 - WORKS llvm16-16.0.6_10 - BROKE llvm17-17.0.6 (13.3 default) - BROKE llvm19-19.0.d20240426 - BROKE I'd welcome input on what to try next, or what to report to the LLVM group to fix the issue.
Looks like I have some traction on this LLVM bug, and it should have a PR soon. https://github.com/llvm/llvm-project/issues/91098 It's a bug in the atfork() handler on Unix systems + logic in reinitializing the child process. The current library incorrectly sets the child process' affinity to compact, which roughly translates to "pin consecutive threads to consecutive cores", even when the user hasn't set KMP_AFFINITY to anything. So every child process was pinned to the first core instead of the entire system. Curious how hard it'd be to get the PR fix into 13.x and/or 14.x. Thanks.
(In reply to cbl from comment #6) Good find. This sounds pretty serious. Does it affect stable/14? If so, we should definitely fix it before 14.1. There's still time to do that.
(In reply to Alan Somers from comment #7) I tested 14.0 and 13.3 base LLVM versions and both are impacted. Have not tested stable ye. I also compiled llvm-devel, which is v19.x, and it is also impacted. Safe to assume all versions after v16.x are impacted based on my testing. I'd love to see the PR make it in 14.1 and the PR pushed to the appropriate ports as patches for llvm16/llvm17/llvm18, etc until it makes it into a newer llvm release.
PR from llvm is out that fixes this issue: https://github.com/llvm/llvm-project/pull/91391
(In reply to cbl from comment #9) It looks like the LLVM 19 brought this fix in. I suspect we should cherry-pick that patch into stable/14 and 13 shortly, so that they land in the upcoming 14.2 and 13.5 releases.
(In reply to Mark Johnston from comment #10) As far as I can see, this is already in stable/14 and stable/13. I MFC'd the fixes in 91df7d335dd44fa3cf506b35987d791502613ed4 and e2de08bf70f4343ebcb455dedf1b77ac0d67f5ca.
*** This bug has been marked as a duplicate of bug 278845 ***