Process enters in STOP state and doesn't respond to any signal. Currently are using FreeBSD 13.0 We are hitting exactly same issue for user space processes. Symptoms are exactly same except threads are waiting for umtxqb https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=242724 Another similar bug was reported here. https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=208138
Similar issue was reported here https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=200992
In order to debug this issue you will need to provide more information. What FreeBSD version are you using? What is the process? Is it easily reproducible, or does it happen occasionally? For kernel details, start with "procstat -kk <pid>" as in PR242724.
We are using 13.0-RELEASE-p13 amd64. It will take time for us to migrate to latest supported versions due to application level dependencies. We are struggling to reproduce it. It happens randomly on any of the VMs running on ESXi host, issue sometime happens in 15 days, sometime in 6 months. I will add required details.
There are around 10 processes which waits in umtxqb. Adding reference for two. ./procstat -kk 3821 PID TID COMM TDNAME KSTACK 3821 100260 process1 - mi_switch+0xc1 thread_suspend_switch+0xc0 thread_single+0x69c exit1+0xc1 sigexit+0x133 postsig+0x251 kern_sigsuspend+0x2b7 sys_sigsuspend+0x31 amd64_syscall+0x741 fast_syscall_common+0xf8 3821 100617 process1 - mi_switch+0xc1 _sleep+0x1cb __umtx_op_sem2_wait+0x287 sys__umtx_op+0x7a amd64_syscall+0x741 fast_syscall_common+0xf8 3821 100633 process1 - mi_switch+0xc1 _sleep+0x1cb __umtx_op_sem2_wait+0x287 sys__umtx_op+0x7a amd64_syscall+0x741 fast_syscall_common+0xf8 ./procstat -kk 4065 PID TID COMM TDNAME KSTACK 4065 100388 process2 - mi_switch+0xc1 thread_suspend_switch+0xc0 thread_single+0x69c exit1+0xc1 sigexit+0x133 postsig+0x251 kern_sigsuspend+0x2b7 sys_sigsuspend+0x31 amd64_syscall+0x741 fast_syscall_common+0xf8 4065 100872 process2 - mi_switch+0xc1 _sleep+0x1cb __umtx_op_sem2_wait+0x287 sys__umtx_op+0x7a amd64_syscall+0x741 fast_syscall_common+0xf8 4065 100876 process2 - mi_switch+0xc1 _sleep+0x1cb __umtx_op_sem2_wait+0x287 sys__umtx_op+0x7a amd64_syscall+0x741 fast_syscall_common+0xf8 4065 647181 process2 - mi_switch+0xc1 _sleep+0x1cb __umtx_op_sem2_wait+0x287 sys__umtx_op+0x7a amd64_syscall+0x741 fast_syscall_common+0xf8 4065 648097 process2 - mi_switch+0xc1 _sleep+0x1cb __umtx_op_sem2_wait+0x287 sys__umtx_op+0x7a amd64_syscall+0x741 fast_syscall_common+0xf8 4065 652349 process2 - mi_switch+0xc1 _sleep+0x1cb __umtx_op_sem2_wait+0x287 sys__umtx_op+0x7a amd64_syscall+0x741 fast_syscall_common+0xf8 ./procstat -t 3821 PID TID COMM TDNAME CPU PRI STATE WCHAN 3821 100260 process1 - -1 120 stop - 3821 100617 process1 - -1 120 stop umtxqb 3821 100633 process1 - -1 120 stop umtxqb ./procstat -t 4065 PID TID COMM TDNAME CPU PRI STATE WCHAN 4065 100388 process2 - -1 120 stop - 4065 100872 process2 - -1 122 stop umtxqb 4065 100876 process2 - -1 121 stop umtxqb 4065 647181 process2 - -1 121 stop umtxqb 4065 648097 process2 - -1 122 stop umtxqb 4065 652349 process2 - -1 120 stop umtxqb
All are multithreaded custom applications which are using standard pthread library. As issue is happening randomly on any device on VMs, can it be something related to virtualization as mentioned in last comment. https://forums.freebsd.org/threads/process-hangs-in-run-state-and-can-not-be-killed-or-debugged.87035/ Few processes are running in compat/10 environment.
This looks maybe related to bug 265997, but the details don't really match. Here it looks like the busy state is leaked. Is your application using any process-shared locking primitives?
(In reply to Mark Johnston from comment #6) We are also suspecting compat/10 contribution here. Looks like some process sharing primitives should be there. Will get back to you with details.
(In reply to Rupesh Pilania from comment #7) Looks like there are no process sharing locking mechanism was used. I suspect somehow kern.ipc.umtx_max_robust count can impact something. As per libthr it's default value is 1000. So created 1 process with 1000 robust mutexes and ran 2 instances of this. Looks like somehow it slowed down other process. Now those processes are unkillable. But I am not seeing processes waiting in umtxqb. Using 8 CPUs 16 GB RAM Somehow process with 1000 threads with robust mutexes impacted other processes. Also saw this in one of the other process which went in stop state. ./procstat -k 917 PID TID COMM TDNAME KSTACK 917 100288 process1 - mi_switch thread_suspend_switch ptracestop cursig ast doreti_ast Similar to this https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=200992
Running two instances are enough to bring system less responsive. Running 5 instances will cause system Freeze. Compiled using cc -lpthread. #include <stdio.h> #include <stdlib.h> #include <pthread.h> #include <unistd.h> #define NUM_THREADS 1000 // Number of threads; adjust based on system capability #define NUM_ITERATIONS 100000 // Number of iterations per thread pthread_mutex_t lock; pthread_mutexattr_t attr; void handle_error(int err, const char *msg) { fprintf(stderr, "%s: %s\n", msg, strerror(err)); exit(EXIT_FAILURE); } void *thread_func(void *arg) { int thread_num = *((int *)arg); free(arg); // Free allocated memory for thread argument for (int i = 0; i < NUM_ITERATIONS; i++) { // Lock and unlock the mutex to simulate contention if (pthread_mutex_lock(&lock) != 0) { perror("Failed to lock mutex"); } // Simulate some work inside the critical section // (we keep it minimal to maximize lock contention) if (pthread_mutex_unlock(&lock) != 0) { perror("Failed to unlock mutex"); } // Small sleep to prevent the system from just cycling too quickly usleep(1); } printf("Thread %d completed.\n", thread_num); return NULL; } int keep_create_threads() { pthread_t threads[NUM_THREADS]; //pthread_mutex_init(&lock, NULL); int err; // Initialize mutex attributes err = pthread_mutexattr_init(&attr); if (err != 0) handle_error(err, "pthread_mutexattr_init"); // Set the mutex as robust err = pthread_mutexattr_setrobust(&attr, PTHREAD_MUTEX_ROBUST); if (err != 0) handle_error(err, "pthread_mutexattr_setrobust"); // Initialize the mutex with the robust attribute err = pthread_mutex_init(&lock, &attr); if (err != 0) handle_error(err, "pthread_mutex_init"); // Create threads for (int i = 0; i < NUM_THREADS; i++) { int *thread_num = malloc(sizeof(int)); // Allocate memory for each thread's number if (!thread_num) { perror("Failed to allocate memory for thread number"); exit(EXIT_FAILURE); } *thread_num = i; if (pthread_create(&threads[i], NULL, thread_func, thread_num) != 0) { perror("Failed to create thread"); exit(EXIT_FAILURE); } } // Wait for all threads to finish for (int i = 0; i < NUM_THREADS; i++) { if (pthread_join(threads[i], NULL) != 0) { perror("Failed to join thread"); } } pthread_mutex_destroy(&lock); printf("All threads completed.\n"); return 0; } int main() { while(1){ keep_create_threads(); } return 0; }
(In reply to Rupesh Pilania from comment #9) 1. run stress mutex program as mentioned 2. We keep monitoring other processes, as other processes becomes unresponsive then we send SIGTERM to those processes. After sending sig TERM signal these processes stuck in STOP state.
PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND 1214 root 1 102 0 146M 109M CPU6 6 936:26 99.78% parent-process ./procstat -t 1214 PID TID COMM TDNAME CPU PRI STATE WCHAN 1214 100136 heimdall - 6 202 run - ./procstat -k 1214 PID TID COMM TDNAME KSTACK 1214 100136 heimdall - <running> ./procstat -kk 1214 PID TID COMM TDNAME KSTACK 1214 100136 heimdall - <running> All child processes stucks at umtxqb if we are trying to kill. PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND 1214 root 1 102 0 146M 109M CPU6 6 936:26 99.78% parent-process ./procstat -t 1214 PID TID COMM TDNAME CPU PRI STATE WCHAN 1214 100136 heimdall - 6 202 run - ./procstat -k 1214 PID TID COMM TDNAME KSTACK 1214 100136 heimdall - <running> ./procstat -kk 1214 PID TID COMM TDNAME KSTACK 1214 100136 heimdall - <running> dut-c600v-02:rtestuser 20] ./procstat -t 2803 PID TID COMM TDNAME CPU PRI STATE WCHAN 2803 100353 child-1 - -1 120 stop - 2803 100515 child-1 - -1 120 stop - 2803 100516 child-1 - -1 152 stop sbwait 2803 100519 child-1 - -1 120 stop usem 2803 100520 child-1 - -1 120 stop usem 2803 100524 child-1 - -1 120 stop usem 2803 100526 child-1 - -1 120 stop usem 2803 100532 child-1 - -1 120 stop usem 2803 100533 child-1 - -1 120 stop usem 2803 100534 child-1 - -1 120 stop usem 2803 100535 child-1 - -1 120 stop usem 2803 100536 child-1 - -1 120 stop umtxqb 2803 100537 child-1 - -1 120 stop usem 2803 100540 child-1 - -1 120 stop usem 2803 100541 child-1 - -1 120 stop usem 2803 100543 child-1 - -1 120 stop usem 2803 100544 child-1 - -1 120 stop usem 2803 100552 child-1 - -1 120 stop usem 2803 100553 child-1 - -1 120 stop usem 2803 100555 child-1 - -1 120 stop usem 2803 100561 child-1 - -1 120 stop usem 2803 100562 child-1 - -1 120 stop usem 2803 100563 child-1 - -1 120 stop usem 2803 100564 child-1 - -1 120 stop usem 2803 100565 child-1 - -1 120 stop usem 2803 100568 child-1 - -1 120 stop usem 2803 100570 child-1 - -1 120 stop usem 2803 100572 child-1 - -1 120 stop usem 2803 100581 child-1 - -1 120 stop usem 2803 100582 child-1 - -1 120 stop usem 2803 100583 child-1 - -1 120 stop usem 2803 100610 child-1 - -1 120 stop usem 2803 100615 child-1 - -1 120 stop usem 2803 100616 child-1 - -1 120 stop - 2803 100756 child-1 - -1 120 stop - 2803 100921 child-1 - -1 120 stop sbwait 2803 100922 child-1 - -1 120 stop uwait 2803 100923 child-1 - -1 120 stop uwait 2803 100924 child-1 - -1 120 stop - 2803 100925 child-1 - -1 120 stop - 2803 100930 child-1 - -1 120 stop uwait 2803 101143 child-1 - -1 120 stop sbwait 2803 102579 child-1 - -1 120 stop uwait 2803 394294 child-1 - -1 120 stop uwait 2803 577659 child-1 - -1 120 stop usem dut-c600v-02:rtestuser 21] dut-c600v-02:rtestuser 21] ./procstat -kk 2803 PID TID COMM TDNAME KSTACK 2803 100353 child-1 - mi_switch+0xc1 thread_suspend_check+0x260 exit1+0xaa sigexit+0x133 postsig+0x251 kern_sigsuspend+0x2b7 sys_sigsuspend+0x31 amd64_syscall+0x741 fast_syscall_common+0xf8 2803 100515 child-1 - mi_switch+0xc1 thread_suspend_switch+0xc0 cursig+0x651 ast+0x2df doreti_ast+0x1f 2803 100516 child-1 - mi_switch+0xc1 sleepq_catch_signals+0x2e6 sleepq_wait_sig+0x9 _sleep+0x1be sbwait+0x4c soreceive_generic+0x307 soreceive+0x44 kern_recvit+0x1ba sys_recvfrom+0x86 amd64_syscall+0x741 fast_syscall_common+0xf8 2803 100519 child-1 - mi_switch+0xc1 sleepq_catch_signals+0x2e6 sleepq_wait_sig+0x9 _sleep+0x1be umtxq_sleep+0x132 __umtx_op_sem2_wait+0x6f8 sys__umtx_op+0x7a amd64_syscall+0x741 fast_syscall_common+0xf8 2803 100520 child-1 - mi_switch+0xc1 sleepq_catch_signals+0x2e6 sleepq_wait_sig+0x9 _sleep+0x1be umtxq_sleep+0x132 __umtx_op_sem2_wait+0x6f8 sys__umtx_op+0x7a amd64_syscall+0x741 fast_syscall_common+0xf8 2803 100524 child-1 - mi_switch+0xc1 sleepq_catch_signals+0x2e6 sleepq_wait_sig+0x9 _sleep+0x1be umtxq_sleep+0x132 __umtx_op_sem2_wait+0x6f8 sys__umtx_op+0x7a amd64_syscall+0x741 fast_syscall_common+0xf8 2803 100526 child-1 - mi_switch+0xc1 sleepq_catch_signals+0x2e6 sleepq_wait_sig+0x9 _sleep+0x1be umtxq_sleep+0x132 __umtx_op_sem2_wait+0x6f8 sys__umtx_op+0x7a amd64_syscall+0x741 fast_syscall_common+0xf8 2803 100532 child-1 - mi_switch+0xc1 sleepq_catch_signals+0x2e6 sleepq_wait_sig+0x9 _sleep+0x1be umtxq_sleep+0x132 __umtx_op_sem2_wait+0x6f8 sys__umtx_op+0x7a amd64_syscall+0x741 fast_syscall_common+0xf8 2803 100533 child-1 - mi_switch+0xc1 sleepq_catch_signals+0x2e6 sleepq_wait_sig+0x9 _sleep+0x1be umtxq_sleep+0x132 __umtx_op_sem2_wait+0x6f8 sys__umtx_op+0x7a amd64_syscall+0x741 fast_syscall_common+0xf8 2803 100534 child-1 - mi_switch+0xc1 sleepq_catch_signals+0x2e6 sleepq_wait_sig+0x9 _sleep+0x1be umtxq_sleep+0x132 __umtx_op_sem2_wait+0x6f8 sys__umtx_op+0x7a amd64_syscall+0x741 fast_syscall_common+0xf8 2803 100535 child-1 - mi_switch+0xc1 sleepq_catch_signals+0x2e6 sleepq_wait_sig+0x9 _sleep+0x1be umtxq_sleep+0x132 __umtx_op_sem2_wait+0x6f8 sys__umtx_op+0x7a amd64_syscall+0x741 fast_syscall_common+0xf8 2803 100536 child-1 - mi_switch+0xc1 _sleep+0x1cb __umtx_op_sem2_wait+0x287 sys__umtx_op+0x7a amd64_syscall+0x741 fast_syscall_common+0xf8 2803 100537 child-1 - mi_switch+0xc1 sleepq_catch_signals+0x2e6 sleepq_wait_sig+0x9 _sleep+0x1be umtxq_sleep+0x132 __umtx_op_sem2_wait+0x6f8 sys__umtx_op+0x7a amd64_syscall+0x741 fast_syscall_common+0xf8 2803 100540 child-1 - mi_switch+0xc1 sleepq_catch_signals+0x2e6 sleepq_wait_sig+0x9 _sleep+0x1be umtxq_sleep+0x132 __umtx_op_sem2_wait+0x6f8 sys__umtx_op+0x7a amd64_syscall+0x741 fast_syscall_common+0xf8 2803 100541 child-1 - mi_switch+0xc1 sleepq_catch_signals+0x2e6 sleepq_wait_sig+0x9 _sleep+0x1be umtxq_sleep+0x132 __umtx_op_sem2_wait+0x6f8 sys__umtx_op+0x7a amd64_syscall+0x741 fast_syscall_common+0xf8 2803 100543 child-1 - mi_switch+0xc1 sleepq_catch_signals+0x2e6 sleepq_wait_sig+0x9 _sleep+0x1be umtxq_sleep+0x132 __umtx_op_sem2_wait+0x6f8 sys__umtx_op+0x7a amd64_syscall+0x741 fast_syscall_common+0xf8 2803 100544 child-1 - mi_switch+0xc1 sleepq_catch_signals+0x2e6 sleepq_wait_sig+0x9 _sleep+0x1be umtxq_sleep+0x132 __umtx_op_sem2_wait+0x6f8 sys__umtx_op+0x7a amd64_syscall+0x741 fast_syscall_common+0xf8 2803 100552 child-1 - mi_switch+0xc1 sleepq_catch_signals+0x2e6 sleepq_wait_sig+0x9 _sleep+0x1be umtxq_sleep+0x132 __umtx_op_sem2_wait+0x6f8 sys__umtx_op+0x7a amd64_syscall+0x741 fast_syscall_common+0xf8 2803 100553 child-1 - mi_switch+0xc1 sleepq_catch_signals+0x2e6 sleepq_wait_sig+0x9 _sleep+0x1be umtxq_sleep+0x132 __umtx_op_sem2_wait+0x6f8 sys__umtx_op+0x7a amd64_syscall+0x741 fast_syscall_common+0xf8 2803 100555 child-1 - mi_switch+0xc1 sleepq_catch_signals+0x2e6 sleepq_wait_sig+0x9 _sleep+0x1be umtxq_sleep+0x132 __umtx_op_sem2_wait+0x6f8 sys__umtx_op+0x7a amd64_syscall+0x741 fast_syscall_common+0xf8 2803 100561 child-1 - mi_switch+0xc1 sleepq_catch_signals+0x2e6 sleepq_wait_sig+0x9 _sleep+0x1be umtxq_sleep+0x132 __umtx_op_sem2_wait+0x6f8 sys__umtx_op+0x7a amd64_syscall+0x741 fast_syscall_common+0xf8 2803 100562 child-1 - mi_switch+0xc1 sleepq_catch_signals+0x2e6 sleepq_wait_sig+0x9 _sleep+0x1be umtxq_sleep+0x132 __umtx_op_sem2_wait+0x6f8 sys__umtx_op+0x7a amd64_syscall+0x741 fast_syscall_common+0xf8 2803 100563 child-1 - mi_switch+0xc1 sleepq_catch_signals+0x2e6 sleepq_wait_sig+0x9 _sleep+0x1be umtxq_sleep+0x132 __umtx_op_sem2_wait+0x6f8 sys__umtx_op+0x7a amd64_syscall+0x741 fast_syscall_common+0xf8 2803 100564 child-1 - mi_switch+0xc1 sleepq_catch_signals+0x2e6 sleepq_wait_sig+0x9 _sleep+0x1be umtxq_sleep+0x132 __umtx_op_sem2_wait+0x6f8 sys__umtx_op+0x7a amd64_syscall+0x741 fast_syscall_common+0xf8 2803 100565 child-1 - mi_switch+0xc1 sleepq_catch_signals+0x2e6 sleepq_wait_sig+0x9 _sleep+0x1be umtxq_sleep+0x132 __umtx_op_sem2_wait+0x6f8 sys__umtx_op+0x7a amd64_syscall+0x741 fast_syscall_common+0xf8 2803 100568 child-1 - mi_switch+0xc1 sleepq_catch_signals+0x2e6 sleepq_wait_sig+0x9 _sleep+0x1be umtxq_sleep+0x132 __umtx_op_sem2_wait+0x6f8 sys__umtx_op+0x7a amd64_syscall+0x741 fast_syscall_common+0xf8 2803 100570 child-1 - mi_switch+0xc1 sleepq_catch_signals+0x2e6 sleepq_wait_sig+0x9 _sleep+0x1be umtxq_sleep+0x132 __umtx_op_sem2_wait+0x6f8 sys__umtx_op+0x7a amd64_syscall+0x741 fast_syscall_common+0xf8 2803 100572 child-1 - mi_switch+0xc1 sleepq_catch_signals+0x2e6 sleepq_wait_sig+0x9 _sleep+0x1be umtxq_sleep+0x132 __umtx_op_sem2_wait+0x6f8 sys__umtx_op+0x7a amd64_syscall+0x741 fast_syscall_common+0xf8 2803 100581 child-1 - mi_switch+0xc1 sleepq_catch_signals+0x2e6 sleepq_wait_sig+0x9 _sleep+0x1be umtxq_sleep+0x132 __umtx_op_sem2_wait+0x6f8 sys__umtx_op+0x7a amd64_syscall+0x741 fast_syscall_common+0xf8 2803 100582 child-1 - mi_switch+0xc1 sleepq_catch_signals+0x2e6 sleepq_wait_sig+0x9 _sleep+0x1be umtxq_sleep+0x132 __umtx_op_sem2_wait+0x6f8 sys__umtx_op+0x7a amd64_syscall+0x741 fast_syscall_common+0xf8 2803 100583 child-1 - mi_switch+0xc1 sleepq_catch_signals+0x2e6 sleepq_wait_sig+0x9 _sleep+0x1be umtxq_sleep+0x132 __umtx_op_sem2_wait+0x6f8 sys__umtx_op+0x7a amd64_syscall+0x741 fast_syscall_common+0xf8 2803 100610 child-1 - mi_switch+0xc1 sleepq_catch_signals+0x2e6 sleepq_wait_sig+0x9 _sleep+0x1be umtxq_sleep+0x132 __umtx_op_sem2_wait+0x6f8 sys__umtx_op+0x7a amd64_syscall+0x741 fast_syscall_common+0xf8 2803 100615 child-1 - mi_switch+0xc1 sleepq_catch_signals+0x2e6 sleepq_wait_sig+0x9 _sleep+0x1be umtxq_sleep+0x132 __umtx_op_sem2_wait+0x6f8 sys__umtx_op+0x7a amd64_syscall+0x741 fast_syscall_common+0xf8 2803 100616 child-1 - mi_switch+0xc1 thread_suspend_check+0x260 ast+0x376 doreti_ast+0x1f 2803 100756 child-1 - mi_switch+0xc1 sleepq_catch_signals+0x2e6 sleepq_wait_sig+0x9 _sleep+0x1be sbwait+0x4c soreceive_generic+0x307 soreceive+0x44 kern_recvit+0x1ba sys_recvfrom+0x86 amd64_syscall+0x741 fast_syscall_common+0xf8 2803 100921 child-1 - mi_switch+0xc1 sleepq_catch_signals+0x2e6 sleepq_wait_sig+0x9 _sleep+0x1be sbwait+0x4c soreceive_generic+0x307 soreceive+0x44 kern_recvit+0x1ba sys_recvfrom+0x86 amd64_syscall+0x741 fast_syscall_common+0xf8 2803 100922 child-1 - mi_switch+0xc1 sleepq_catch_signals+0x2e6 sleepq_wait_sig+0x9 _sleep+0x1be umtxq_sleep+0x132 do_wait+0x41c __umtx_op_wait_uint_private+0x54 sys__umtx_op+0x7a amd64_syscall+0x741 fast_syscall_common+0xf8 2803 100923 child-1 - mi_switch+0xc1 sleepq_catch_signals+0x2e6 sleepq_wait_sig+0x9 _sleep+0x1be umtxq_sleep+0x132 do_wait+0x41c __umtx_op_wait_uint_private+0x54 sys__umtx_op+0x7a amd64_syscall+0x741 fast_syscall_common+0xf8 2803 100924 child-1 - mi_switch+0xc1 thread_suspend_check+0x260 ast+0x376 doreti_ast+0x1f 2803 100925 child-1 - mi_switch+0xc1 thread_suspend_check+0x260 ast+0x376 doreti_ast+0x1f 2803 100930 child-1 - mi_switch+0xc1 sleepq_catch_signals+0x2e6 sleepq_wait_sig+0x9 _sleep+0x1be umtxq_sleep+0x132 do_wait+0x41c __umtx_op_wait_uint_private+0x54 sys__umtx_op+0x7a amd64_syscall+0x741 fast_syscall_common+0xf8 2803 101143 child-1 - mi_switch+0xc1 sleepq_catch_signals+0x2e6 sleepq_wait_sig+0x9 _sleep+0x1be sbwait+0x4c soreceive_generic+0x307 soreceive+0x44 kern_recvit+0x1ba sys_recvfrom+0x86 amd64_syscall+0x741 fast_syscall_common+0xf8 2803 102579 child-1 - mi_switch+0xc1 sleepq_catch_signals+0x2e6 sleepq_wait_sig+0x9 _sleep+0x1be umtxq_sleep+0x132 do_wait+0x41c __umtx_op_wait_uint_private+0x54 sys__umtx_op+0x7a amd64_syscall+0x741 fast_syscall_common+0xf8 2803 394294 child-1 - mi_switch+0xc1 sleepq_catch_signals+0x2e6 sleepq_wait_sig+0x9 _sleep+0x1be umtxq_sleep+0x132 do_wait+0x41c __umtx_op_wait_uint_private+0x54 sys__umtx_op+0x7a amd64_syscall+0x741 fast_syscall_common+0xf8 2803 577659 child-1 - mi_switch+0xc1 sleepq_catch_signals+0x2e6 sleepq_wait_sig+0x9 _sleep+0x1be umtxq_sleep+0x132 __umtx_op_sem2_wait+0x6f8 sys__umtx_op+0x7a amd64_syscall+0x741 fast_syscall_common+0xf8 We didn't tried killing the parent. https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=200992 Comment10. Lokks like hitting hypervisor issue.
Created attachment 255522 [details] application-running-with-truss-stuck-in-stop-state-untxqb This application we attach truss. It is restarted by it's parent every 5 mins.
Created attachment 255523 [details] test-application-procstat-stopped-state Attaching procstat logs for test application we are attaching truss on.
Created attachment 255524 [details] other impacted processes Attaching logs for other processes waiting in umtxqb.
Slowly other applications will get stuck at same lock. db> ps 107399 s application 107400 D umtxqb 0xffffffff80fdb060 application 107401 s application 107402 s application 107403 s application 107417 s application 107420 s application 107432 s application 107449 s application 107452 s application 107454 s application 107460 Ss usem 0xfffff803b208b100 application db> show lock 0xffffffff80fdb060 class: sleep mutex name: umtxql flags: {DEF, DUPOK} state: {UNOWNED} witness gives umtxql (type: sleep mutex, depth: 0, active refs: 1024) -- last acquired @ kern/kern_umtx.c:511 /* * Lock a chain. */ static inline void umtxq_lock(struct umtx_key *key) { struct umtxq_chain *uc; uc = umtxq_getchain(key); mtx_lock(&uc->uc_lock); }
A commit in branch main references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=df436036e82b895a6233d803bc8bf14d2cfe90d7 commit df436036e82b895a6233d803bc8bf14d2cfe90d7 Author: Bojan Novković <bnovkov@FreeBSD.org> AuthorDate: 2025-01-30 15:10:04 +0000 Commit: Bojan Novković <bnovkov@FreeBSD.org> CommitDate: 2025-02-17 16:40:34 +0000 umtx: Don't sleep after casueword32 failure in do_sem2_wake When a casueword32 operation fails, 'do_sem2_wake' will call 'thread_check_susp' to avoid a potential livelock. However, it instructs 'thread_check_susp' to sleep while holding a previously busied umtxq key. This is explicitly discouraged by the comments in 'thread_check_susp' which state that a thread shouldn't sleep if it owns a kernel resource. Fix this by passing 'false' to 'thread_check_susp'. Reviewed by: kib PR: 282713 Differential Revision: https://reviews.freebsd.org/D48728 Sponsored by: Klara Inc. sys/kern/kern_umtx.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)