Summary: | Process enters in STOP state and doesn't respond to any signal. | ||
---|---|---|---|
Product: | Base System | Reporter: | Rupesh Pilania <rupeshpilania> |
Component: | kern | Assignee: | freebsd-bugs (Nobody) <bugs> |
Status: | Open --- | ||
Severity: | Affects Only Me | CC: | chris, emaste, markj |
Priority: | --- | ||
Version: | Unspecified | ||
Hardware: | amd64 | ||
OS: | Any | ||
See Also: | https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=200992 |
Description
Rupesh Pilania
2024-11-12 18:21:40 UTC
Similar issue was reported here https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=200992 In order to debug this issue you will need to provide more information. What FreeBSD version are you using? What is the process? Is it easily reproducible, or does it happen occasionally? For kernel details, start with "procstat -kk <pid>" as in PR242724. We are using 13.0-RELEASE-p13 amd64. It will take time for us to migrate to latest supported versions due to application level dependencies. We are struggling to reproduce it. It happens randomly on any of the VMs running on ESXi host, issue sometime happens in 15 days, sometime in 6 months. I will add required details. There are around 10 processes which waits in umtxqb. Adding reference for two. ./procstat -kk 3821 PID TID COMM TDNAME KSTACK 3821 100260 process1 - mi_switch+0xc1 thread_suspend_switch+0xc0 thread_single+0x69c exit1+0xc1 sigexit+0x133 postsig+0x251 kern_sigsuspend+0x2b7 sys_sigsuspend+0x31 amd64_syscall+0x741 fast_syscall_common+0xf8 3821 100617 process1 - mi_switch+0xc1 _sleep+0x1cb __umtx_op_sem2_wait+0x287 sys__umtx_op+0x7a amd64_syscall+0x741 fast_syscall_common+0xf8 3821 100633 process1 - mi_switch+0xc1 _sleep+0x1cb __umtx_op_sem2_wait+0x287 sys__umtx_op+0x7a amd64_syscall+0x741 fast_syscall_common+0xf8 ./procstat -kk 4065 PID TID COMM TDNAME KSTACK 4065 100388 process2 - mi_switch+0xc1 thread_suspend_switch+0xc0 thread_single+0x69c exit1+0xc1 sigexit+0x133 postsig+0x251 kern_sigsuspend+0x2b7 sys_sigsuspend+0x31 amd64_syscall+0x741 fast_syscall_common+0xf8 4065 100872 process2 - mi_switch+0xc1 _sleep+0x1cb __umtx_op_sem2_wait+0x287 sys__umtx_op+0x7a amd64_syscall+0x741 fast_syscall_common+0xf8 4065 100876 process2 - mi_switch+0xc1 _sleep+0x1cb __umtx_op_sem2_wait+0x287 sys__umtx_op+0x7a amd64_syscall+0x741 fast_syscall_common+0xf8 4065 647181 process2 - mi_switch+0xc1 _sleep+0x1cb __umtx_op_sem2_wait+0x287 sys__umtx_op+0x7a amd64_syscall+0x741 fast_syscall_common+0xf8 4065 648097 process2 - mi_switch+0xc1 _sleep+0x1cb __umtx_op_sem2_wait+0x287 sys__umtx_op+0x7a amd64_syscall+0x741 fast_syscall_common+0xf8 4065 652349 process2 - mi_switch+0xc1 _sleep+0x1cb __umtx_op_sem2_wait+0x287 sys__umtx_op+0x7a amd64_syscall+0x741 fast_syscall_common+0xf8 ./procstat -t 3821 PID TID COMM TDNAME CPU PRI STATE WCHAN 3821 100260 process1 - -1 120 stop - 3821 100617 process1 - -1 120 stop umtxqb 3821 100633 process1 - -1 120 stop umtxqb ./procstat -t 4065 PID TID COMM TDNAME CPU PRI STATE WCHAN 4065 100388 process2 - -1 120 stop - 4065 100872 process2 - -1 122 stop umtxqb 4065 100876 process2 - -1 121 stop umtxqb 4065 647181 process2 - -1 121 stop umtxqb 4065 648097 process2 - -1 122 stop umtxqb 4065 652349 process2 - -1 120 stop umtxqb All are multithreaded custom applications which are using standard pthread library. As issue is happening randomly on any device on VMs, can it be something related to virtualization as mentioned in last comment. https://forums.freebsd.org/threads/process-hangs-in-run-state-and-can-not-be-killed-or-debugged.87035/ Few processes are running in compat/10 environment. This looks maybe related to bug 265997, but the details don't really match. Here it looks like the busy state is leaked. Is your application using any process-shared locking primitives? (In reply to Mark Johnston from comment #6) We are also suspecting compat/10 contribution here. Looks like some process sharing primitives should be there. Will get back to you with details. (In reply to Rupesh Pilania from comment #7) Looks like there are no process sharing locking mechanism was used. I suspect somehow kern.ipc.umtx_max_robust count can impact something. As per libthr it's default value is 1000. So created 1 process with 1000 robust mutexes and ran 2 instances of this. Looks like somehow it slowed down other process. Now those processes are unkillable. But I am not seeing processes waiting in umtxqb. Using 8 CPUs 16 GB RAM Somehow process with 1000 threads with robust mutexes impacted other processes. Also saw this in one of the other process which went in stop state. ./procstat -k 917 PID TID COMM TDNAME KSTACK 917 100288 process1 - mi_switch thread_suspend_switch ptracestop cursig ast doreti_ast Similar to this https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=200992 Running two instances are enough to bring system less responsive. Running 5 instances will cause system Freeze. Compiled using cc -lpthread. #include <stdio.h> #include <stdlib.h> #include <pthread.h> #include <unistd.h> #define NUM_THREADS 1000 // Number of threads; adjust based on system capability #define NUM_ITERATIONS 100000 // Number of iterations per thread pthread_mutex_t lock; pthread_mutexattr_t attr; void handle_error(int err, const char *msg) { fprintf(stderr, "%s: %s\n", msg, strerror(err)); exit(EXIT_FAILURE); } void *thread_func(void *arg) { int thread_num = *((int *)arg); free(arg); // Free allocated memory for thread argument for (int i = 0; i < NUM_ITERATIONS; i++) { // Lock and unlock the mutex to simulate contention if (pthread_mutex_lock(&lock) != 0) { perror("Failed to lock mutex"); } // Simulate some work inside the critical section // (we keep it minimal to maximize lock contention) if (pthread_mutex_unlock(&lock) != 0) { perror("Failed to unlock mutex"); } // Small sleep to prevent the system from just cycling too quickly usleep(1); } printf("Thread %d completed.\n", thread_num); return NULL; } int keep_create_threads() { pthread_t threads[NUM_THREADS]; //pthread_mutex_init(&lock, NULL); int err; // Initialize mutex attributes err = pthread_mutexattr_init(&attr); if (err != 0) handle_error(err, "pthread_mutexattr_init"); // Set the mutex as robust err = pthread_mutexattr_setrobust(&attr, PTHREAD_MUTEX_ROBUST); if (err != 0) handle_error(err, "pthread_mutexattr_setrobust"); // Initialize the mutex with the robust attribute err = pthread_mutex_init(&lock, &attr); if (err != 0) handle_error(err, "pthread_mutex_init"); // Create threads for (int i = 0; i < NUM_THREADS; i++) { int *thread_num = malloc(sizeof(int)); // Allocate memory for each thread's number if (!thread_num) { perror("Failed to allocate memory for thread number"); exit(EXIT_FAILURE); } *thread_num = i; if (pthread_create(&threads[i], NULL, thread_func, thread_num) != 0) { perror("Failed to create thread"); exit(EXIT_FAILURE); } } // Wait for all threads to finish for (int i = 0; i < NUM_THREADS; i++) { if (pthread_join(threads[i], NULL) != 0) { perror("Failed to join thread"); } } pthread_mutex_destroy(&lock); printf("All threads completed.\n"); return 0; } int main() { while(1){ keep_create_threads(); } return 0; } (In reply to Rupesh Pilania from comment #9) 1. run stress mutex program as mentioned 2. We keep monitoring other processes, as other processes becomes unresponsive then we send SIGTERM to those processes. After sending sig TERM signal these processes stuck in STOP state. |