Bug 278703 - Forking on 13.3 restricting to same CPU core
Summary: Forking on 13.3 restricting to same CPU core
Status: Closed DUPLICATE of bug 278845
Alias: None
Product: Base System
Classification: Unclassified
Component: threads (show other bugs)
Version: 13.3-RELEASE
Hardware: amd64 Any
: --- Affects Only Me
Assignee: freebsd-threads (Nobody)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2024-05-02 22:02 UTC by cbl
Modified: 2024-10-25 13:08 UTC (History)
5 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description cbl 2024-05-02 22:02:04 UTC
We have an in-house php script that processes images. Been running it for
years since FreeBSD 9.x or 10.x at least. It forks itself a number of times to increase the amount of images we can process per second. Sounds normal, right?  

On FreeBSD 13.2 we've been running great, different CPU cores leveraged per fork. Now when upgrading to 13.3 or 14.0, the php forked processes all end up leveraging the SAME CPU core, and that slows us waaay down as we're unable to leverage all the other cores of the server. 

I've tried a 13.2 userland, and temp booted the 13.3 kernel and the issue does NOT occur. This sort of tells me the scheduler is fine. However, when using a 13.3 userland AND a corresponding 13.3 kernel, the issues DOES occur and we're only able to use a single core.  

Same with a fresh install of 14.0 userland/kernel on a totally separate box.

We're seeing this issue on multiple servers with 13.3 and 14.0 using php82 and php81.

Leveraging the pkg php82 binaries from 13.2 land on the 13.3 userland/kernel or the 13.3 pkg php82 does not matter, both show the same issue. 

Something in the userland/libraries in 13.3 apparently restricts the forked
php processes from leveraging any additional cpu cores.  

Also note, I read that LLVM was upgraded from v14.x to v17.x in 13.3-RELEASE. I installed llvm14 from pkg and recompiled from source php82 and php-extensions using llvm14, and still no joy. 

Any ideas where/what/how/why?
Comment 1 amistry 2024-05-03 13:21:54 UTC
Have you tried fiddling around with cpuset to see if you can restore the old behavior?  We've got some similar scripts on some old servers that I'm concerned about when we upgrade.  I'm assuming you're using pcntl_fork()?
Comment 2 Vladimir Druzenko freebsd_committer freebsd_triage 2024-05-03 13:31:26 UTC
Is behavior the same on single socket hosts and on multi sockets (if it's VM check CPU configuration)? Can it be related to NUMA?
Comment 3 cbl 2024-05-03 15:55:34 UTC
(In reply to Vladimir Druzenko from comment #2)

A single socket machine also has the issue on 13.3. Forked children are locked to the same core.
Comment 4 cbl 2024-05-03 15:57:16 UTC
(In reply to amistry from comment #1)

We haven't done anything with cpuset yet. pcntl_fork is what we're using. Sample code found here: https://github.com/php/php-src/issues/14117
Comment 5 cbl 2024-05-05 00:18:16 UTC
More troubleshooting.. 

This issue is occurring on /usr/lib/libomp.so with llvm17, which is the default with FreeBSD 13.3.

I installed a few other llvm's via pkg (and even compiled llvm-devel 19.x) to test things.  When I copy over the libomp.so from llvm14 or llvm15 into /usr/lib/ the application starts using all the CPU cores as expected.   So something changed in llvm16 and later that is causing our linked application (ImageMagick) to limit itself to a single cpu core.  Since FreeBSD 13.2 was still using llvm14 by default, the problem was not occurring then.

llvm14-14.0.6_5 - WORKS
llvm15-15.0.7_10 - WORKS
llvm16-16.0.6_10  - BROKE
llvm17-17.0.6 (13.3 default) - BROKE
llvm19-19.0.d20240426 - BROKE

I'd welcome input on what to try next, or what to report to the LLVM group to fix the issue.
Comment 6 cbl 2024-05-07 19:29:26 UTC
Looks like I have some traction on this LLVM bug, and it should have a PR soon.

https://github.com/llvm/llvm-project/issues/91098

It's a bug in the atfork() handler on Unix systems + logic in reinitializing the  child process. The current library incorrectly sets the child process' affinity to  compact, which roughly translates to "pin consecutive threads to consecutive cores", even when the user hasn't set KMP_AFFINITY to anything. So every child process was pinned to the first core instead of the entire system.


Curious how hard it'd be to get the PR fix into 13.x and/or 14.x.

Thanks.
Comment 7 Alan Somers freebsd_committer freebsd_triage 2024-05-07 19:32:17 UTC
(In reply to cbl from comment #6)
Good find.  This sounds pretty serious.  Does it affect stable/14?  If so, we should definitely fix it before 14.1.  There's still time to do that.
Comment 8 cbl 2024-05-07 19:40:49 UTC
(In reply to Alan Somers from comment #7)
I tested 14.0 and 13.3 base LLVM versions and both are impacted. Have not tested stable ye. I also compiled llvm-devel, which is v19.x, and it is also impacted. Safe to assume all versions after v16.x are impacted based on my testing.  I'd love to see the PR make it in 14.1 and the PR pushed to the appropriate ports as patches for llvm16/llvm17/llvm18, etc until it makes it into a newer llvm release.
Comment 9 cbl 2024-05-07 22:50:51 UTC
PR from llvm is out that fixes this issue:

https://github.com/llvm/llvm-project/pull/91391
Comment 10 Mark Johnston freebsd_committer freebsd_triage 2024-10-24 18:46:36 UTC
(In reply to cbl from comment #9)
It looks like the LLVM 19 brought this fix in.

I suspect we should cherry-pick that patch into stable/14 and 13 shortly, so that they land in the upcoming 14.2 and 13.5 releases.
Comment 11 Dimitry Andric freebsd_committer freebsd_triage 2024-10-24 22:09:02 UTC
(In reply to Mark Johnston from comment #10)
As far as I can see, this is already in stable/14 and stable/13. I MFC'd the fixes in 91df7d335dd44fa3cf506b35987d791502613ed4 and e2de08bf70f4343ebcb455dedf1b77ac0d67f5ca.
Comment 12 Mark Johnston freebsd_committer freebsd_triage 2024-10-25 13:08:55 UTC

*** This bug has been marked as a duplicate of bug 278845 ***