Bug 282713 - Process enters in STOP state and doesn't respond to any signal.
Summary: Process enters in STOP state and doesn't respond to any signal.
Status: Open
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: Unspecified
Hardware: amd64 Any
: --- Affects Only Me
Assignee: Bojan Novković
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2024-11-12 18:21 UTC by Rupesh Pilania
Modified: 2025-02-17 16:41 UTC (History)
7 users (show)

See Also:


Attachments
application-running-with-truss-stuck-in-stop-state-untxqb (293.73 KB, text/plain)
2024-11-29 08:40 UTC, Rupesh Pilania
no flags Details
test-application-procstat-stopped-state (45.52 KB, text/plain)
2024-11-29 08:43 UTC, Rupesh Pilania
no flags Details
other impacted processes (131.23 KB, text/plain)
2024-11-29 08:45 UTC, Rupesh Pilania
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Rupesh Pilania 2024-11-12 18:21:40 UTC
Process enters in STOP state and doesn't respond to any signal.
Currently are using FreeBSD 13.0
We are hitting exactly same issue for user space processes.
Symptoms are exactly same except threads are waiting for umtxqb
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=242724

Another similar bug was reported here.
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=208138
Comment 1 Rupesh Pilania 2024-11-13 13:20:23 UTC
Similar issue was reported here https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=200992
Comment 2 Ed Maste freebsd_committer freebsd_triage 2024-11-13 14:51:08 UTC
In order to debug this issue you will need to provide more information. What FreeBSD version are you using? What is the process? Is it easily reproducible, or does it happen occasionally?

For kernel details, start with "procstat -kk <pid>" as in PR242724.
Comment 3 Rupesh Pilania 2024-11-14 13:53:59 UTC
We are using 13.0-RELEASE-p13 amd64. It will take time for us to migrate to latest supported versions due to application level dependencies.
We are struggling to reproduce it. 
It happens randomly on any of the VMs running on ESXi host, issue sometime happens in 15 days, sometime in 6 months. 

I will add required details.
Comment 4 Rupesh Pilania 2024-11-14 14:35:40 UTC
There are around 10 processes which waits in umtxqb. Adding reference for two.

./procstat -kk 3821
  PID    TID COMM                TDNAME              KSTACK
 3821 100260 process1      -                   mi_switch+0xc1 thread_suspend_switch+0xc0 thread_single+0x69c exit1+0xc1 sigexit+0x133 postsig+0x251 kern_sigsuspend+0x2b7 sys_sigsuspend+0x31 amd64_syscall+0x741 fast_syscall_common+0xf8
 3821 100617 process1      -                   mi_switch+0xc1 _sleep+0x1cb __umtx_op_sem2_wait+0x287 sys__umtx_op+0x7a amd64_syscall+0x741 fast_syscall_common+0xf8
 3821 100633 process1      -                   mi_switch+0xc1 _sleep+0x1cb __umtx_op_sem2_wait+0x287 sys__umtx_op+0x7a amd64_syscall+0x741 fast_syscall_common+0xf8

./procstat -kk 4065
  PID    TID COMM                TDNAME              KSTACK
 4065 100388 process2                 -                   mi_switch+0xc1 thread_suspend_switch+0xc0 thread_single+0x69c exit1+0xc1 sigexit+0x133 postsig+0x251 kern_sigsuspend+0x2b7 sys_sigsuspend+0x31 amd64_syscall+0x741 fast_syscall_common+0xf8
 4065 100872 process2                 -                   mi_switch+0xc1 _sleep+0x1cb __umtx_op_sem2_wait+0x287 sys__umtx_op+0x7a amd64_syscall+0x741 fast_syscall_common+0xf8
 4065 100876 process2                 -                   mi_switch+0xc1 _sleep+0x1cb __umtx_op_sem2_wait+0x287 sys__umtx_op+0x7a amd64_syscall+0x741 fast_syscall_common+0xf8
 4065 647181 process2                 -                   mi_switch+0xc1 _sleep+0x1cb __umtx_op_sem2_wait+0x287 sys__umtx_op+0x7a amd64_syscall+0x741 fast_syscall_common+0xf8
 4065 648097 process2                 -                   mi_switch+0xc1 _sleep+0x1cb __umtx_op_sem2_wait+0x287 sys__umtx_op+0x7a amd64_syscall+0x741 fast_syscall_common+0xf8
 4065 652349 process2                 -                   mi_switch+0xc1 _sleep+0x1cb __umtx_op_sem2_wait+0x287 sys__umtx_op+0x7a amd64_syscall+0x741 fast_syscall_common+0xf8



./procstat -t 3821
  PID    TID COMM                TDNAME              CPU  PRI STATE   WCHAN
 3821 100260 process1      -                    -1  120 stop    -
 3821 100617 process1      -                    -1  120 stop    umtxqb
 3821 100633 process1      -                    -1  120 stop    umtxqb

./procstat -t 4065
  PID    TID COMM                TDNAME              CPU  PRI STATE   WCHAN
 4065 100388 process2                 -                    -1  120 stop    -
 4065 100872 process2                 -                    -1  122 stop    umtxqb
 4065 100876 process2                 -                    -1  121 stop    umtxqb
 4065 647181 process2                 -                    -1  121 stop    umtxqb
 4065 648097 process2                 -                    -1  122 stop    umtxqb
 4065 652349 process2                 -                    -1  120 stop    umtxqb
Comment 5 Rupesh Pilania 2024-11-14 17:54:38 UTC
All are multithreaded custom applications which are using standard pthread library.
As issue is happening randomly on any device on VMs, can it be something related to virtualization as mentioned in last comment. https://forums.freebsd.org/threads/process-hangs-in-run-state-and-can-not-be-killed-or-debugged.87035/ 

Few processes are running in compat/10 environment.
Comment 6 Mark Johnston freebsd_committer freebsd_triage 2024-11-16 16:38:56 UTC
This looks maybe related to bug 265997, but the details don't really match.  Here it looks like the busy state is leaked.

Is your application using any process-shared locking primitives?
Comment 7 Rupesh Pilania 2024-11-17 10:01:39 UTC
(In reply to Mark Johnston from comment #6)
We are also suspecting compat/10 contribution here. Looks like some process sharing primitives should be there. Will get back to you with details.
Comment 8 Rupesh Pilania 2024-11-18 16:03:58 UTC
(In reply to Rupesh Pilania from comment #7)
Looks like there are no process sharing locking mechanism was used. I suspect somehow kern.ipc.umtx_max_robust count can impact something. As per libthr it's default value is 1000. So created 1 process with 1000 robust mutexes and ran 2 instances of this. Looks like somehow it slowed down other process. Now those processes are unkillable. But I am not seeing processes waiting in umtxqb. 

Using 8 CPUs 16 GB RAM

Somehow process with 1000 threads with robust mutexes impacted other processes.

Also saw this in one of the other process which went in stop state.
./procstat -k 917
  PID    TID COMM                TDNAME              KSTACK
  917 100288 process1            -                   mi_switch thread_suspend_switch ptracestop cursig ast doreti_ast

Similar to this https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=200992
Comment 9 Rupesh Pilania 2024-11-18 16:13:24 UTC
Running two instances are enough to bring system less responsive.
Running 5 instances will cause system Freeze.
Compiled using cc -lpthread.




#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <unistd.h>

#define NUM_THREADS 1000   // Number of threads; adjust based on system capability
#define NUM_ITERATIONS 100000   // Number of iterations per thread

pthread_mutex_t lock;
pthread_mutexattr_t attr;

void handle_error(int err, const char *msg) {
    fprintf(stderr, "%s: %s\n", msg, strerror(err));
    exit(EXIT_FAILURE);
}

void *thread_func(void *arg) {
    int thread_num = *((int *)arg);
    free(arg); // Free allocated memory for thread argument

    for (int i = 0; i < NUM_ITERATIONS; i++) {
        // Lock and unlock the mutex to simulate contention
        if (pthread_mutex_lock(&lock) != 0) {
            perror("Failed to lock mutex");
        }

        // Simulate some work inside the critical section
        // (we keep it minimal to maximize lock contention)
        if (pthread_mutex_unlock(&lock) != 0) {
            perror("Failed to unlock mutex");
        }

        // Small sleep to prevent the system from just cycling too quickly
        usleep(1);
    }

    printf("Thread %d completed.\n", thread_num);
    return NULL;
}

int keep_create_threads() {
    pthread_t threads[NUM_THREADS];
    
    //pthread_mutex_init(&lock, NULL);
       int err;

    // Initialize mutex attributes
    err = pthread_mutexattr_init(&attr);
    if (err != 0)
        handle_error(err, "pthread_mutexattr_init");

    // Set the mutex as robust
    err = pthread_mutexattr_setrobust(&attr, PTHREAD_MUTEX_ROBUST);
    if (err != 0)
        handle_error(err, "pthread_mutexattr_setrobust");

    // Initialize the mutex with the robust attribute
    err = pthread_mutex_init(&lock, &attr);
    if (err != 0)
        handle_error(err, "pthread_mutex_init");

    // Create threads
    for (int i = 0; i < NUM_THREADS; i++) {
        int *thread_num = malloc(sizeof(int));  // Allocate memory for each thread's number
        if (!thread_num) {
            perror("Failed to allocate memory for thread number");
            exit(EXIT_FAILURE);
        }
        *thread_num = i;
        
        if (pthread_create(&threads[i], NULL, thread_func, thread_num) != 0) {
            perror("Failed to create thread");
            exit(EXIT_FAILURE);
        }
    }

    // Wait for all threads to finish
    for (int i = 0; i < NUM_THREADS; i++) {
        if (pthread_join(threads[i], NULL) != 0) {
            perror("Failed to join thread");
        }
    }

    pthread_mutex_destroy(&lock);
    printf("All threads completed.\n");
    return 0;
}

int main()
{
  while(1){
  keep_create_threads();
} 
return 0;
}
Comment 10 Rupesh Pilania 2024-11-18 16:28:26 UTC
(In reply to Rupesh Pilania from comment #9)
1. run stress mutex program as mentioned
2. We keep monitoring other processes, as other processes becomes unresponsive then we send SIGTERM to those processes. After sending sig TERM signal these processes stuck in STOP state.
Comment 11 Rupesh Pilania 2024-11-22 10:29:09 UTC
  PID USERNAME    THR PRI NICE   SIZE    RES STATE    C   TIME    WCPU COMMAND
 1214 root          1 102    0   146M   109M CPU6     6 936:26  99.78% parent-process

./procstat -t 1214
  PID    TID COMM                TDNAME              CPU  PRI STATE   WCHAN
 1214 100136 heimdall            -                     6  202 run     -
 ./procstat -k 1214
  PID    TID COMM                TDNAME              KSTACK
 1214 100136 heimdall            -                   <running>
 ./procstat -kk 1214
  PID    TID COMM                TDNAME              KSTACK
 1214 100136 heimdall            -                   <running>

All child processes stucks at umtxqb if we are trying to kill.


  PID USERNAME    THR PRI NICE   SIZE    RES STATE    C   TIME    WCPU COMMAND
 1214 root          1 102    0   146M   109M CPU6     6 936:26  99.78% parent-process

./procstat -t 1214
  PID    TID COMM                TDNAME              CPU  PRI STATE   WCHAN
 1214 100136 heimdall            -                     6  202 run     -
 ./procstat -k 1214
  PID    TID COMM                TDNAME              KSTACK
 1214 100136 heimdall            -                   <running>
 ./procstat -kk 1214
  PID    TID COMM                TDNAME              KSTACK
 1214 100136 heimdall            -                   <running>
 
 dut-c600v-02:rtestuser 20] ./procstat -t 2803
  PID    TID COMM                TDNAME              CPU  PRI STATE   WCHAN
 2803 100353 child-1      -                    -1  120 stop    -
 2803 100515 child-1      -                    -1  120 stop    -
 2803 100516 child-1      -                    -1  152 stop    sbwait
 2803 100519 child-1      -                    -1  120 stop    usem
 2803 100520 child-1      -                    -1  120 stop    usem
 2803 100524 child-1      -                    -1  120 stop    usem
 2803 100526 child-1      -                    -1  120 stop    usem
 2803 100532 child-1      -                    -1  120 stop    usem
 2803 100533 child-1      -                    -1  120 stop    usem
 2803 100534 child-1      -                    -1  120 stop    usem
 2803 100535 child-1      -                    -1  120 stop    usem
 2803 100536 child-1      -                    -1  120 stop    umtxqb
 2803 100537 child-1      -                    -1  120 stop    usem
 2803 100540 child-1      -                    -1  120 stop    usem
 2803 100541 child-1      -                    -1  120 stop    usem
 2803 100543 child-1      -                    -1  120 stop    usem
 2803 100544 child-1      -                    -1  120 stop    usem
 2803 100552 child-1      -                    -1  120 stop    usem
 2803 100553 child-1      -                    -1  120 stop    usem
 2803 100555 child-1      -                    -1  120 stop    usem
 2803 100561 child-1      -                    -1  120 stop    usem
 2803 100562 child-1      -                    -1  120 stop    usem
 2803 100563 child-1      -                    -1  120 stop    usem
 2803 100564 child-1      -                    -1  120 stop    usem
 2803 100565 child-1      -                    -1  120 stop    usem
 2803 100568 child-1      -                    -1  120 stop    usem
 2803 100570 child-1      -                    -1  120 stop    usem
 2803 100572 child-1      -                    -1  120 stop    usem
 2803 100581 child-1      -                    -1  120 stop    usem
 2803 100582 child-1      -                    -1  120 stop    usem
 2803 100583 child-1      -                    -1  120 stop    usem
 2803 100610 child-1      -                    -1  120 stop    usem
 2803 100615 child-1      -                    -1  120 stop    usem
 2803 100616 child-1      -                    -1  120 stop    -
 2803 100756 child-1      -                    -1  120 stop    -
 2803 100921 child-1      -                    -1  120 stop    sbwait
 2803 100922 child-1      -                    -1  120 stop    uwait
 2803 100923 child-1      -                    -1  120 stop    uwait
 2803 100924 child-1      -                    -1  120 stop    -
 2803 100925 child-1      -                    -1  120 stop    -
 2803 100930 child-1      -                    -1  120 stop    uwait
 2803 101143 child-1      -                    -1  120 stop    sbwait
 2803 102579 child-1      -                    -1  120 stop    uwait
 2803 394294 child-1      -                    -1  120 stop    uwait
 2803 577659 child-1      -                    -1  120 stop    usem
dut-c600v-02:rtestuser 21]
dut-c600v-02:rtestuser 21] ./procstat -kk 2803
  PID    TID COMM                TDNAME              KSTACK
 2803 100353 child-1      -                   mi_switch+0xc1 thread_suspend_check+0x260 exit1+0xaa sigexit+0x133 postsig+0x251 kern_sigsuspend+0x2b7 sys_sigsuspend+0x31 amd64_syscall+0x741 fast_syscall_common+0xf8
 2803 100515 child-1      -                   mi_switch+0xc1 thread_suspend_switch+0xc0 cursig+0x651 ast+0x2df doreti_ast+0x1f
 2803 100516 child-1      -                   mi_switch+0xc1 sleepq_catch_signals+0x2e6 sleepq_wait_sig+0x9 _sleep+0x1be sbwait+0x4c soreceive_generic+0x307 soreceive+0x44 kern_recvit+0x1ba sys_recvfrom+0x86 amd64_syscall+0x741 fast_syscall_common+0xf8
 2803 100519 child-1      -                   mi_switch+0xc1 sleepq_catch_signals+0x2e6 sleepq_wait_sig+0x9 _sleep+0x1be umtxq_sleep+0x132 __umtx_op_sem2_wait+0x6f8 sys__umtx_op+0x7a amd64_syscall+0x741 fast_syscall_common+0xf8
 2803 100520 child-1      -                   mi_switch+0xc1 sleepq_catch_signals+0x2e6 sleepq_wait_sig+0x9 _sleep+0x1be umtxq_sleep+0x132 __umtx_op_sem2_wait+0x6f8 sys__umtx_op+0x7a amd64_syscall+0x741 fast_syscall_common+0xf8
 2803 100524 child-1      -                   mi_switch+0xc1 sleepq_catch_signals+0x2e6 sleepq_wait_sig+0x9 _sleep+0x1be umtxq_sleep+0x132 __umtx_op_sem2_wait+0x6f8 sys__umtx_op+0x7a amd64_syscall+0x741 fast_syscall_common+0xf8
 2803 100526 child-1      -                   mi_switch+0xc1 sleepq_catch_signals+0x2e6 sleepq_wait_sig+0x9 _sleep+0x1be umtxq_sleep+0x132 __umtx_op_sem2_wait+0x6f8 sys__umtx_op+0x7a amd64_syscall+0x741 fast_syscall_common+0xf8
 2803 100532 child-1      -                   mi_switch+0xc1 sleepq_catch_signals+0x2e6 sleepq_wait_sig+0x9 _sleep+0x1be umtxq_sleep+0x132 __umtx_op_sem2_wait+0x6f8 sys__umtx_op+0x7a amd64_syscall+0x741 fast_syscall_common+0xf8
 2803 100533 child-1      -                   mi_switch+0xc1 sleepq_catch_signals+0x2e6 sleepq_wait_sig+0x9 _sleep+0x1be umtxq_sleep+0x132 __umtx_op_sem2_wait+0x6f8 sys__umtx_op+0x7a amd64_syscall+0x741 fast_syscall_common+0xf8
 2803 100534 child-1      -                   mi_switch+0xc1 sleepq_catch_signals+0x2e6 sleepq_wait_sig+0x9 _sleep+0x1be umtxq_sleep+0x132 __umtx_op_sem2_wait+0x6f8 sys__umtx_op+0x7a amd64_syscall+0x741 fast_syscall_common+0xf8
 2803 100535 child-1      -                   mi_switch+0xc1 sleepq_catch_signals+0x2e6 sleepq_wait_sig+0x9 _sleep+0x1be umtxq_sleep+0x132 __umtx_op_sem2_wait+0x6f8 sys__umtx_op+0x7a amd64_syscall+0x741 fast_syscall_common+0xf8
 2803 100536 child-1      -                   mi_switch+0xc1 _sleep+0x1cb __umtx_op_sem2_wait+0x287 sys__umtx_op+0x7a amd64_syscall+0x741 fast_syscall_common+0xf8
 2803 100537 child-1      -                   mi_switch+0xc1 sleepq_catch_signals+0x2e6 sleepq_wait_sig+0x9 _sleep+0x1be umtxq_sleep+0x132 __umtx_op_sem2_wait+0x6f8 sys__umtx_op+0x7a amd64_syscall+0x741 fast_syscall_common+0xf8
 2803 100540 child-1      -                   mi_switch+0xc1 sleepq_catch_signals+0x2e6 sleepq_wait_sig+0x9 _sleep+0x1be umtxq_sleep+0x132 __umtx_op_sem2_wait+0x6f8 sys__umtx_op+0x7a amd64_syscall+0x741 fast_syscall_common+0xf8
 2803 100541 child-1      -                   mi_switch+0xc1 sleepq_catch_signals+0x2e6 sleepq_wait_sig+0x9 _sleep+0x1be umtxq_sleep+0x132 __umtx_op_sem2_wait+0x6f8 sys__umtx_op+0x7a amd64_syscall+0x741 fast_syscall_common+0xf8
 2803 100543 child-1      -                   mi_switch+0xc1 sleepq_catch_signals+0x2e6 sleepq_wait_sig+0x9 _sleep+0x1be umtxq_sleep+0x132 __umtx_op_sem2_wait+0x6f8 sys__umtx_op+0x7a amd64_syscall+0x741 fast_syscall_common+0xf8
 2803 100544 child-1      -                   mi_switch+0xc1 sleepq_catch_signals+0x2e6 sleepq_wait_sig+0x9 _sleep+0x1be umtxq_sleep+0x132 __umtx_op_sem2_wait+0x6f8 sys__umtx_op+0x7a amd64_syscall+0x741 fast_syscall_common+0xf8
 2803 100552 child-1      -                   mi_switch+0xc1 sleepq_catch_signals+0x2e6 sleepq_wait_sig+0x9 _sleep+0x1be umtxq_sleep+0x132 __umtx_op_sem2_wait+0x6f8 sys__umtx_op+0x7a amd64_syscall+0x741 fast_syscall_common+0xf8
 2803 100553 child-1      -                   mi_switch+0xc1 sleepq_catch_signals+0x2e6 sleepq_wait_sig+0x9 _sleep+0x1be umtxq_sleep+0x132 __umtx_op_sem2_wait+0x6f8 sys__umtx_op+0x7a amd64_syscall+0x741 fast_syscall_common+0xf8
 2803 100555 child-1      -                   mi_switch+0xc1 sleepq_catch_signals+0x2e6 sleepq_wait_sig+0x9 _sleep+0x1be umtxq_sleep+0x132 __umtx_op_sem2_wait+0x6f8 sys__umtx_op+0x7a amd64_syscall+0x741 fast_syscall_common+0xf8
 2803 100561 child-1      -                   mi_switch+0xc1 sleepq_catch_signals+0x2e6 sleepq_wait_sig+0x9 _sleep+0x1be umtxq_sleep+0x132 __umtx_op_sem2_wait+0x6f8 sys__umtx_op+0x7a amd64_syscall+0x741 fast_syscall_common+0xf8
 2803 100562 child-1      -                   mi_switch+0xc1 sleepq_catch_signals+0x2e6 sleepq_wait_sig+0x9 _sleep+0x1be umtxq_sleep+0x132 __umtx_op_sem2_wait+0x6f8 sys__umtx_op+0x7a amd64_syscall+0x741 fast_syscall_common+0xf8
 2803 100563 child-1      -                   mi_switch+0xc1 sleepq_catch_signals+0x2e6 sleepq_wait_sig+0x9 _sleep+0x1be umtxq_sleep+0x132 __umtx_op_sem2_wait+0x6f8 sys__umtx_op+0x7a amd64_syscall+0x741 fast_syscall_common+0xf8
 2803 100564 child-1      -                   mi_switch+0xc1 sleepq_catch_signals+0x2e6 sleepq_wait_sig+0x9 _sleep+0x1be umtxq_sleep+0x132 __umtx_op_sem2_wait+0x6f8 sys__umtx_op+0x7a amd64_syscall+0x741 fast_syscall_common+0xf8
 2803 100565 child-1      -                   mi_switch+0xc1 sleepq_catch_signals+0x2e6 sleepq_wait_sig+0x9 _sleep+0x1be umtxq_sleep+0x132 __umtx_op_sem2_wait+0x6f8 sys__umtx_op+0x7a amd64_syscall+0x741 fast_syscall_common+0xf8
 2803 100568 child-1      -                   mi_switch+0xc1 sleepq_catch_signals+0x2e6 sleepq_wait_sig+0x9 _sleep+0x1be umtxq_sleep+0x132 __umtx_op_sem2_wait+0x6f8 sys__umtx_op+0x7a amd64_syscall+0x741 fast_syscall_common+0xf8
 2803 100570 child-1      -                   mi_switch+0xc1 sleepq_catch_signals+0x2e6 sleepq_wait_sig+0x9 _sleep+0x1be umtxq_sleep+0x132 __umtx_op_sem2_wait+0x6f8 sys__umtx_op+0x7a amd64_syscall+0x741 fast_syscall_common+0xf8
 2803 100572 child-1      -                   mi_switch+0xc1 sleepq_catch_signals+0x2e6 sleepq_wait_sig+0x9 _sleep+0x1be umtxq_sleep+0x132 __umtx_op_sem2_wait+0x6f8 sys__umtx_op+0x7a amd64_syscall+0x741 fast_syscall_common+0xf8
 2803 100581 child-1      -                   mi_switch+0xc1 sleepq_catch_signals+0x2e6 sleepq_wait_sig+0x9 _sleep+0x1be umtxq_sleep+0x132 __umtx_op_sem2_wait+0x6f8 sys__umtx_op+0x7a amd64_syscall+0x741 fast_syscall_common+0xf8
 2803 100582 child-1      -                   mi_switch+0xc1 sleepq_catch_signals+0x2e6 sleepq_wait_sig+0x9 _sleep+0x1be umtxq_sleep+0x132 __umtx_op_sem2_wait+0x6f8 sys__umtx_op+0x7a amd64_syscall+0x741 fast_syscall_common+0xf8
 2803 100583 child-1      -                   mi_switch+0xc1 sleepq_catch_signals+0x2e6 sleepq_wait_sig+0x9 _sleep+0x1be umtxq_sleep+0x132 __umtx_op_sem2_wait+0x6f8 sys__umtx_op+0x7a amd64_syscall+0x741 fast_syscall_common+0xf8
 2803 100610 child-1      -                   mi_switch+0xc1 sleepq_catch_signals+0x2e6 sleepq_wait_sig+0x9 _sleep+0x1be umtxq_sleep+0x132 __umtx_op_sem2_wait+0x6f8 sys__umtx_op+0x7a amd64_syscall+0x741 fast_syscall_common+0xf8
 2803 100615 child-1      -                   mi_switch+0xc1 sleepq_catch_signals+0x2e6 sleepq_wait_sig+0x9 _sleep+0x1be umtxq_sleep+0x132 __umtx_op_sem2_wait+0x6f8 sys__umtx_op+0x7a amd64_syscall+0x741 fast_syscall_common+0xf8
 2803 100616 child-1      -                   mi_switch+0xc1 thread_suspend_check+0x260 ast+0x376 doreti_ast+0x1f
 2803 100756 child-1      -                   mi_switch+0xc1 sleepq_catch_signals+0x2e6 sleepq_wait_sig+0x9 _sleep+0x1be sbwait+0x4c soreceive_generic+0x307 soreceive+0x44 kern_recvit+0x1ba sys_recvfrom+0x86 amd64_syscall+0x741 fast_syscall_common+0xf8
 2803 100921 child-1      -                   mi_switch+0xc1 sleepq_catch_signals+0x2e6 sleepq_wait_sig+0x9 _sleep+0x1be sbwait+0x4c soreceive_generic+0x307 soreceive+0x44 kern_recvit+0x1ba sys_recvfrom+0x86 amd64_syscall+0x741 fast_syscall_common+0xf8
 2803 100922 child-1      -                   mi_switch+0xc1 sleepq_catch_signals+0x2e6 sleepq_wait_sig+0x9 _sleep+0x1be umtxq_sleep+0x132 do_wait+0x41c __umtx_op_wait_uint_private+0x54 sys__umtx_op+0x7a amd64_syscall+0x741 fast_syscall_common+0xf8
 2803 100923 child-1      -                   mi_switch+0xc1 sleepq_catch_signals+0x2e6 sleepq_wait_sig+0x9 _sleep+0x1be umtxq_sleep+0x132 do_wait+0x41c __umtx_op_wait_uint_private+0x54 sys__umtx_op+0x7a amd64_syscall+0x741 fast_syscall_common+0xf8
 2803 100924 child-1      -                   mi_switch+0xc1 thread_suspend_check+0x260 ast+0x376 doreti_ast+0x1f
 2803 100925 child-1      -                   mi_switch+0xc1 thread_suspend_check+0x260 ast+0x376 doreti_ast+0x1f
 2803 100930 child-1      -                   mi_switch+0xc1 sleepq_catch_signals+0x2e6 sleepq_wait_sig+0x9 _sleep+0x1be umtxq_sleep+0x132 do_wait+0x41c __umtx_op_wait_uint_private+0x54 sys__umtx_op+0x7a amd64_syscall+0x741 fast_syscall_common+0xf8
 2803 101143 child-1      -                   mi_switch+0xc1 sleepq_catch_signals+0x2e6 sleepq_wait_sig+0x9 _sleep+0x1be sbwait+0x4c soreceive_generic+0x307 soreceive+0x44 kern_recvit+0x1ba sys_recvfrom+0x86 amd64_syscall+0x741 fast_syscall_common+0xf8
 2803 102579 child-1      -                   mi_switch+0xc1 sleepq_catch_signals+0x2e6 sleepq_wait_sig+0x9 _sleep+0x1be umtxq_sleep+0x132 do_wait+0x41c __umtx_op_wait_uint_private+0x54 sys__umtx_op+0x7a amd64_syscall+0x741 fast_syscall_common+0xf8
 2803 394294 child-1      -                   mi_switch+0xc1 sleepq_catch_signals+0x2e6 sleepq_wait_sig+0x9 _sleep+0x1be umtxq_sleep+0x132 do_wait+0x41c __umtx_op_wait_uint_private+0x54 sys__umtx_op+0x7a amd64_syscall+0x741 fast_syscall_common+0xf8
 2803 577659 child-1      -                   mi_switch+0xc1 sleepq_catch_signals+0x2e6 sleepq_wait_sig+0x9 _sleep+0x1be umtxq_sleep+0x132 __umtx_op_sem2_wait+0x6f8 sys__umtx_op+0x7a amd64_syscall+0x741 fast_syscall_common+0xf8

We didn't tried killing the parent. https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=200992 Comment10. Lokks like hitting hypervisor issue.
Comment 12 Rupesh Pilania 2024-11-29 08:40:07 UTC
Created attachment 255522 [details]
application-running-with-truss-stuck-in-stop-state-untxqb

This application we attach truss. It is restarted by it's parent every 5 mins.
Comment 13 Rupesh Pilania 2024-11-29 08:43:04 UTC
Created attachment 255523 [details]
test-application-procstat-stopped-state

Attaching procstat logs for test application we are attaching truss on.
Comment 14 Rupesh Pilania 2024-11-29 08:45:00 UTC
Created attachment 255524 [details]
other impacted processes

Attaching logs for other processes waiting in umtxqb.
Comment 15 Rupesh Pilania 2024-12-04 15:08:22 UTC
Slowly other applications will get stuck at same lock. 

db> ps
107399                   s                                   application
107400                   D       umtxqb  0xffffffff80fdb060  application
107401                   s                                   application
107402                   s                                   application
107403                   s                                   application
107417                   s                                   application
107420                   s                                   application
107432                   s                                   application
107449                   s                                   application
107452                   s                                   application
107454                   s                                   application
107460                   Ss      usem    0xfffff803b208b100  application

db> show lock 0xffffffff80fdb060
 class: sleep mutex
 name: umtxql
 flags: {DEF, DUPOK}
 state: {UNOWNED}

witness gives
umtxql (type: sleep mutex, depth: 0, active refs: 1024) -- last acquired @ kern/kern_umtx.c:511

/*
 * Lock a chain.
 */
static inline void
umtxq_lock(struct umtx_key *key)
{
        struct umtxq_chain *uc;

        uc = umtxq_getchain(key);
        mtx_lock(&uc->uc_lock);
}
Comment 16 commit-hook freebsd_committer freebsd_triage 2025-02-17 16:41:48 UTC
A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=df436036e82b895a6233d803bc8bf14d2cfe90d7

commit df436036e82b895a6233d803bc8bf14d2cfe90d7
Author:     Bojan Novković <bnovkov@FreeBSD.org>
AuthorDate: 2025-01-30 15:10:04 +0000
Commit:     Bojan Novković <bnovkov@FreeBSD.org>
CommitDate: 2025-02-17 16:40:34 +0000

    umtx: Don't sleep after casueword32 failure in do_sem2_wake

    When a casueword32 operation fails, 'do_sem2_wake' will call
    'thread_check_susp' to avoid a potential livelock. However, it
    instructs 'thread_check_susp' to sleep while holding a previously busied
    umtxq key. This is explicitly discouraged by the comments in
    'thread_check_susp' which state that a thread shouldn't sleep if it
    owns a kernel resource. Fix this by passing 'false'
    to 'thread_check_susp'.

    Reviewed by:    kib
    PR:             282713
    Differential Revision:  https://reviews.freebsd.org/D48728
    Sponsored by:   Klara Inc.

 sys/kern/kern_umtx.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)