| Summary: | Forking on 13.3 restricting to same CPU core | ||
|---|---|---|---|
| Product: | Base System | Reporter: | cbl |
| Component: | threads | Assignee: | freebsd-threads (Nobody) <threads> |
| Status: | New --- | ||
| Severity: | Affects Only Me | CC: | amistry, asomers, vvd |
| Priority: | --- | ||
| Version: | 13.3-RELEASE | ||
| Hardware: | amd64 | ||
| OS: | Any | ||
|
Description
cbl
2024-05-02 22:02:04 UTC
Have you tried fiddling around with cpuset to see if you can restore the old behavior? We've got some similar scripts on some old servers that I'm concerned about when we upgrade. I'm assuming you're using pcntl_fork()? Is behavior the same on single socket hosts and on multi sockets (if it's VM check CPU configuration)? Can it be related to NUMA? (In reply to Vladimir Druzenko from comment #2) A single socket machine also has the issue on 13.3. Forked children are locked to the same core. (In reply to amistry from comment #1) We haven't done anything with cpuset yet. pcntl_fork is what we're using. Sample code found here: https://github.com/php/php-src/issues/14117 More troubleshooting.. This issue is occurring on /usr/lib/libomp.so with llvm17, which is the default with FreeBSD 13.3. I installed a few other llvm's via pkg (and even compiled llvm-devel 19.x) to test things. When I copy over the libomp.so from llvm14 or llvm15 into /usr/lib/ the application starts using all the CPU cores as expected. So something changed in llvm16 and later that is causing our linked application (ImageMagick) to limit itself to a single cpu core. Since FreeBSD 13.2 was still using llvm14 by default, the problem was not occurring then. llvm14-14.0.6_5 - WORKS llvm15-15.0.7_10 - WORKS llvm16-16.0.6_10 - BROKE llvm17-17.0.6 (13.3 default) - BROKE llvm19-19.0.d20240426 - BROKE I'd welcome input on what to try next, or what to report to the LLVM group to fix the issue. Looks like I have some traction on this LLVM bug, and it should have a PR soon. https://github.com/llvm/llvm-project/issues/91098 It's a bug in the atfork() handler on Unix systems + logic in reinitializing the child process. The current library incorrectly sets the child process' affinity to compact, which roughly translates to "pin consecutive threads to consecutive cores", even when the user hasn't set KMP_AFFINITY to anything. So every child process was pinned to the first core instead of the entire system. Curious how hard it'd be to get the PR fix into 13.x and/or 14.x. Thanks. (In reply to cbl from comment #6) Good find. This sounds pretty serious. Does it affect stable/14? If so, we should definitely fix it before 14.1. There's still time to do that. (In reply to Alan Somers from comment #7) I tested 14.0 and 13.3 base LLVM versions and both are impacted. Have not tested stable ye. I also compiled llvm-devel, which is v19.x, and it is also impacted. Safe to assume all versions after v16.x are impacted based on my testing. I'd love to see the PR make it in 14.1 and the PR pushed to the appropriate ports as patches for llvm16/llvm17/llvm18, etc until it makes it into a newer llvm release. PR from llvm is out that fixes this issue: https://github.com/llvm/llvm-project/pull/91391 |