Bug 274135 - lang/php81: php-fpm processes fail to fork cleanly and procfs needed
Summary: lang/php81: php-fpm processes fail to fork cleanly and procfs needed
Status: Open
Alias: None
Product: Ports & Packages
Classification: Unclassified
Component: Individual Port(s) (show other bugs)
Version: Latest
Hardware: Any Any
: --- Affects Only Me
Assignee: Muhammad Moinur Rahman
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2023-09-27 22:17 UTC by Henrik Hudson
Modified: 2024-03-26 23:30 UTC (History)
3 users (show)

See Also:
bugzilla: maintainer-feedback? (bofh)


Attachments
truss output for PID 16396 (28.01 KB, text/plain)
2023-10-04 21:15 UTC, Henrik Hudson
no flags Details
truss output for PID 66774 (3.99 KB, text/plain)
2023-10-04 21:16 UTC, Henrik Hudson
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Henrik Hudson 2023-09-27 22:17:50 UTC
I opened the below a few weeks back:
https://github.com/php/php-src/issues/12157

However, I wanted to open a bug here as well. I maintain a custom poudriere package repo and sometime after April 2023 the PHP port seems to have gone a little wonky. I don't know if this is the fault of poudriere, 13.2, the PHP port, or PHP proper but I'm out of ideas. I've been running PHP5.4x+ and FreeBSD11+ with more or less the same setup for a loooong time. I update the lang/phpXY options and lang/phpXY-extensions options files (trying to keep them more or less the same) as we roll forward builds and we're now on php81.

See the github post for my PHP configs, but we basically run PHP-FPM via local sockets and each site gets its own "pool" user and talk over sockets with Apache. Older servers will have 1 master "root" process and 100s of PHP-FPM processes spread across 20-50 UIDs depending on server load.

1) PHP FPM processes started looking for /procfs for some reason. I ended up having to add "USE_PROCFS=no" to my poudriere.conf build to get it to not do that. This is a undocumented (it seems) option that I just happened to run across via GoogleFu. It does fix that problem, but what changed in FreeBSD 13.2 or the port that it even started to build like that??? I made zero changes outside of updating FreeBSD, poudriere and my ports tree since I built 8.1.x sometime in April and this has never been exhibited in the 5+ years I've been doing PHP builds.

2) the bigger issue is that the PHP-FPM master seems to partially fork a process and it gets stuck as a new "master" PHP process and starts consuming 100% CPU. Once I get 5+ or so of these over a few days it starts to impact services and I either re-start PHP-FPM or kill the non-actual "master" processes. The odd thing is the PHP-FPM debug logs show it forking a process for the right user, but it just doesn't "switch" to that user. If I "kill -9 PID" the wonky process then PHP-FPM will immediately start a new process with the right UID/GID no matter the server's actual traffic load or requirements which leads me to believe it's either a bug caused by malicious requests (I can't find anything) or a "race condition" in the fork code.

Thoughts? Ideas?
Comment 1 Muhammad Moinur Rahman freebsd_committer freebsd_triage 2023-09-28 20:06:19 UTC
Unfortunately at the moment I do not have a php setup to test this. I used to work for a company where we used php/nginx massively and I have never seen this happening. Although we were running php82.

I will follow the chain at the bug report but unless there are any strong solutions I do not have anything to bring in the table.
Comment 2 Henrik Hudson 2023-10-04 21:15:43 UTC
Created attachment 245440 [details]
truss output for PID 16396
Comment 3 Henrik Hudson 2023-10-04 21:16:14 UTC
Created attachment 245441 [details]
truss output for PID 66774
Comment 4 Henrik Hudson 2023-10-04 21:25:12 UTC
So, I've attached 2 files. The PID 66774 one is a "bad" fork and the PID 16396 is a "good" fork of a PHP FPM child process off the master.

The "bad" fork gets stuck in a loop, never picks up the UID it needs and just loops with:
66774: sched_yield()                             = 0 (0x0)

forever until killed.

I re-installed a package I had built in April and it started exhibiting the same issues. So, my guess is I just didn't notice the issue until the server started getting some traffic and there is something in 13.2-RELEASE which changed.

I still find it odd my poudriere builds started loading procfs starting with 13.2 , but the bigger deal is why these don't finish their "fork" correctly. Is this an "OS" issue? 

I don't know what any of the truss output really means.
Comment 5 Muhammad Moinur Rahman freebsd_committer freebsd_triage 2023-10-04 21:42:32 UTC
I would just like to add a pointer that is from 13.2 onwards ASLR is by default on. Could that be an issue?
Comment 6 Henrik Hudson 2023-10-06 19:41:55 UTC
Good catch / suggestion. I have disabled ASLR via /etc/sysctl.conf:
#disable ASLR
kern.elf64.aslr.enable=0
kern.elf64.aslr.pie_enable=0

and rebooted both systems. So, far in 2+ days, no issues. If I don't see any issues by Monday where do you suggest we go? Kernel bug report? or ????
Comment 7 Muhammad Moinur Rahman freebsd_committer freebsd_triage 2023-10-06 19:53:50 UTC
(In reply to Henrik Hudson from comment #6)
You don't go anywhere. :D

1. Inform upstream that with ASLR enabled these things happen.
2. Let me know here. I will share a patch with you to specifically disable ASLR on the php binaries. But this will be a non-default OPTION. So you will have to customize package builds or utilize poudriere at your end. Sorry for that. Moreover this will also not be merged into quarterly. In addition to that this is actually a bad time for php in the FreeBSD world. As I will be switching the PHP default to 8.2 on 25th November. So I am a bit skeptic about adding anything new at this moment in 8.2 line before the end of the year. :'(
Comment 8 Henrik Hudson 2023-10-06 22:23:13 UTC
Thanks. I did update the github issue with the ASLR info and workaround.

For us, personally, we only have 2 13.2 servers with 8.1 and will be switching to 8.2 once it's released to ports. I'm fine just disabling ASLR for now, but if you think 8.2 will have the same issue then yes, I'd like a patch so I can build for 13.2 with ASLR on going forward. 

I find it odd this isn't a bigger issue though? or is it just related to socket code? It seems this "should" be impacting more than just me? :)

I already apply "out-of-tree" patches to specific ports while keeping the rest updated, so no worries there. I can merge a custom patch.
Comment 9 Muhammad Moinur Rahman freebsd_committer freebsd_triage 2023-10-06 22:40:55 UTC
(In reply to Henrik Hudson from comment #8)
8.2 is already on the tree. We have 4 different versions of php in the tree 8.0(Will be removed end of this year), 8.1(Current DEFAULT), 8.2(Upcoming DEFAULT) and 8.3 still on BETA but usable. You can anytime switch to these versions by adding DEFAULT_VERSIONS= php=8.* in your poudriere <JAIL>-<PORTS>-<SET>-make.conf. :D

I don't think I am assertive about it. You can just search php and ASLR and the result will be less than happier for you. :D 

There is another bug here with apache and mod_php for the same reason. So the number of users affected are larger than expected but your specific case actually happens with heavier load. And nowadays there are actually very few people who really run high load php applications.

The patch will be in the ports but the options will be not DEFAULT. So you will have to add something like the following:
lang_php81_SET+=NOASLR

in the same make.conf like above. It's already late for my tonight's build so I will create the patch, do a build test on my tomorrow's build and submit the patch on Sunday. Have a nice weekend.
Comment 10 commit-hook freebsd_committer freebsd_triage 2023-10-07 11:02:09 UTC
A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/ports/commit/?id=28374e2250ce52ea7f88bef2e62c6e56181f573b

commit 28374e2250ce52ea7f88bef2e62c6e56181f573b
Author:     Muhammad Moinur Rahman <bofh@FreeBSD.org>
AuthorDate: 2023-10-07 10:56:44 +0000
Commit:     Muhammad Moinur Rahman <bofh@FreeBSD.org>
CommitDate: 2023-10-07 11:00:34 +0000

    lang/php83: Add non-default NOASLR option

    In certain corner cases php-fpm is creating zombie processes specially
    from 13.2-RELEASE onwards. The exact reason has not been pin-pointed but
    it looks like disabling ASLR solves the problem. So add a non-default
    option NOASLR to allow the user build without ASLR.

    This patch will not go into lang/php80 but will go in lang/php81 and
    lang/php82 after some more extensive tests.

    PR:             274135
    Reported by:    ports@thelanman.net

 lang/php83/Makefile | 26 +++++++++++++++++++++-----
 1 file changed, 21 insertions(+), 5 deletions(-)
Comment 11 Muhammad Moinur Rahman freebsd_committer freebsd_triage 2023-10-07 11:05:13 UTC
I have committed and added an option NOASLR for lang/php83. The patch should be more or less same for all versions of php. Let me know if you can patch the same for php81 and test it further. If not ask me to create one here. For this to work you have to add the following line in /usr/local/etc/poudriere.d/<JAIL>-<PORTSTREE>-<SET>-make.conf:
lang_php83_SET+=NOASLR
Comment 12 Henrik Hudson 2023-10-07 22:42:38 UTC
I may have spoke too soon. Disabling ASLR certainly seemed to have helped. It took 2 days, but I did get another "stuck" fork that never completes switching to its new UID.
Comment 13 Henrik Hudson 2023-11-01 21:27:34 UTC
Additional Googlefu has yielded this:
https://lists.freebsd.org/archives/freebsd-apache/2022-November/000289.html

So, I'm in the process of re-compiling GraphicsMagick without OpenMP support (we don't use ImageMagick) and giving that a shot.
Comment 14 Henrik Hudson 2023-11-15 01:29:14 UTC
The issue has not re-occurred in the last 10 days since rebuilding GraphicsMagick without OpenMP.

Since this issue didn't occur prior to FreeBSD 13.2 with the same ports options would this be considered a FreeBSD "core" bug? Should it be pushed elsewhere? or ???
Comment 15 Adam Ellis 2024-03-22 05:24:36 UTC
We are also seeing this same issue, and it began happening when we updated to FreeBSD 13.2 as well. We have tried PHP 8.2 and 8.3, and the problem still occurs. This appears to happen, on random occasions, at the moment that the Imagick and OpenMP modules initially load. Backtrace details for an example process that was locked in a sched_yield() infinite loop can be seen below:

********************
(gdb) thread apply all bt

Thread 1 (LWP 100499 of process 33248):
#0  sched_yield () at sched_yield.S:4
#1  0x00000008495aa491 in __kmp_wait_4_ptr (spinner=0x84960b51c <__kmp_initz_lock+28>, checker=1, pred=0x8495c4b50 <__kmp_bakery_check(void*, unsigned int)>, obj=0x84960b500 <__kmp_initz_lock>) at /usr/src/contrib/llvm-project/openmp/runtime/src/kmp_dispatch.cpp:2695
#2  0x00000008495c37c8 in __kmp_acquire_ticket_lock_timed_template (lck=0x0, gtid=<optimized out>) at /usr/src/contrib/llvm-project/openmp/runtime/src/kmp_lock.cpp:640
#3  __kmp_acquire_ticket_lock (lck=0x0, gtid=<optimized out>) at /usr/src/contrib/llvm-project/openmp/runtime/src/kmp_lock.cpp:645
#4  0x00000008495c9564 in __kmp_acquire_bootstrap_lock (lck=0x84960b51c <__kmp_initz_lock+28>) at /usr/src/contrib/llvm-project/openmp/runtime/src/kmp_lock.h:524
#5  __kmp_get_global_thread_id_reg () at /usr/src/contrib/llvm-project/openmp/runtime/src/kmp_runtime.cpp:253
#6  0x000000084958991e in __kmp_entry_thread () at /usr/src/contrib/llvm-project/openmp/runtime/src/kmp.h:3401
#7  kmpc_malloc (size=35590812956) at /usr/src/contrib/llvm-project/openmp/runtime/src/kmp_alloc.cpp:1098
#8  0x00000008496041b7 in __kmp_is_address_mapped (addr=0x847bad5c8) at /usr/src/contrib/llvm-project/openmp/runtime/src/z_Linux_util.cpp:2067
#9  0x00000008495d3925 in __kmp_register_library_startup () at /usr/src/contrib/llvm-project/openmp/runtime/src/kmp_runtime.cpp:6774
#10 0x00000008495c95c4 in __kmp_do_serial_initialize () at /usr/src/contrib/llvm-project/openmp/runtime/src/kmp_runtime.cpp:6966
#11 0x00000008495c9578 in __kmp_get_global_thread_id_reg () at /usr/src/contrib/llvm-project/openmp/runtime/src/kmp_runtime.cpp:255
#12 0x00000008495b40b2 in __kmp_api_omp_init_lock (user_lock=0x84960b51c <__kmp_initz_lock+28>) at /usr/src/contrib/llvm-project/openmp/runtime/src/kmp_ftn_entry.h:1083
#13 0x000000084b24773c in MagickCoreGenesis () from /usr/local/lib/libMagickCore-7.so.10
#14 0x000000084a26bd8e in zm_startup_imagick () from /usr/local/lib/php/20230831-debug/imagick.so
#15 0x000000000083b010 in zend_startup_module_ex ()
#16 0x000000000083bb10 in ?? ()
#17 0x0000000000853131 in zend_hash_apply ()
#18 0x000000000083b90c in zend_startup_modules ()
#19 0x0000000000766b0f in php_module_startup ()
#20 0x00000000009f51e9 in ?? ()
#21 0x00000000009f37a4 in ?? ()
#22 0x0000000000423720 in _start ()
(gdb)
Comment 16 Muhammad Moinur Rahman freebsd_committer freebsd_triage 2024-03-22 07:37:22 UTC
(In reply to Adam Ellis from comment #15)
Have you tried the suggestion from comment #14?
Comment 17 Adam Ellis 2024-03-22 14:02:52 UTC
(In reply to Muhammad Moinur Rahman from comment #16)
Unfortunately, we cannot simply disable OpenMP support since doing so results in a significant slowdown of the convert processes. Since we process a significant number of images every day, we need the process to run as quickly and as efficiently as possible.
Comment 18 Muhammad Moinur Rahman freebsd_committer freebsd_triage 2024-03-22 14:09:15 UTC
(In reply to Adam Ellis from comment #17)
Can you try disabling ASLR?
Comment 19 Adam Ellis 2024-03-22 23:34:16 UTC
(In reply to Muhammad Moinur Rahman from comment #18)
We are going to update to FreeBSD 13.3, which has a newer version of LLVM, and then recompile ImageMagick to see what happens with that. I'll report back and update this thread after we've tested that out a bit. Thank you for your suggestions!

The current version, on FreeBSD 13.2, is: FreeBSD clang version 14.0.5 (https://github.com/llvm/llvm-project.git llvmorg-14.0.5-0-gc12386ae247c)

And the newer version, with FreeBSD 13.3 is: FreeBSD clang version 17.0.6 (https://github.com/llvm/llvm-project.git llvmorg-17.0.6-0-g6009708b4367)
Comment 20 Adam Ellis 2024-03-26 23:30:08 UTC
Unfortunately, updating to FreeBSD 13.3 has caused other errors to occur, including random server lock-ups in the middle of the night that require manual reboots. So, our testing on that end has been fairly unsuccessful. We'll continue with other tweaks and updates to try to get it resolved, but, as of now, we don't have a solution to the sched_yield() issue that we're seeing with ImageMagick and OpenMP on FreeBSD 13.x.

Someone else has opened a bug report for it on the LLVM github project, so maybe that will result in something happening with it. https://github.com/llvm/llvm-project/issues/86684