282713 – Process enters in STOP state and doesn't respond to any signal.

Bug 282713 - Process enters in STOP state and doesn't respond to any signal.

Summary: Process enters in STOP state and doesn't respond to any signal.

Status:	Open

Alias:	None

Product:	Base System
Classification:	Unclassified
Component:	kern (show other bugs)
Version:	Unspecified
Hardware:	amd64 Any

Importance:	--- Affects Only Me
Assignee:	freebsd-bugs (Nobody)

URL:
Keywords:

Depends on:
Blocks:

Reported:	2024-11-12 18:21 UTC by Rupesh Pilania
Modified:	2024-11-18 16:28 UTC (History)
CC List:	3 users (show)

See Also:	200992

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Rupesh Pilania 2024-11-12 18:21:40 UTC

Process enters in STOP state and doesn't respond to any signal.
Currently are using FreeBSD 13.0
We are hitting exactly same issue for user space processes.
Symptoms are exactly same except threads are waiting for umtxqb
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=242724

Another similar bug was reported here.
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=208138

Comment 1 Rupesh Pilania 2024-11-13 13:20:23 UTC

Similar issue was reported here https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=200992

Comment 2 Ed Maste freebsd_committer

2024-11-13 14:51:08 UTC

In order to debug this issue you will need to provide more information. What FreeBSD version are you using? What is the process? Is it easily reproducible, or does it happen occasionally?

For kernel details, start with "procstat -kk <pid>" as in PR242724.

Comment 3 Rupesh Pilania 2024-11-14 13:53:59 UTC

We are using 13.0-RELEASE-p13 amd64. It will take time for us to migrate to latest supported versions due to application level dependencies.
We are struggling to reproduce it. 
It happens randomly on any of the VMs running on ESXi host, issue sometime happens in 15 days, sometime in 6 months. 

I will add required details.

Comment 4 Rupesh Pilania 2024-11-14 14:35:40 UTC

There are around 10 processes which waits in umtxqb. Adding reference for two.

./procstat -kk 3821
  PID    TID COMM                TDNAME              KSTACK
 3821 100260 process1      -                   mi_switch+0xc1 thread_suspend_switch+0xc0 thread_single+0x69c exit1+0xc1 sigexit+0x133 postsig+0x251 kern_sigsuspend+0x2b7 sys_sigsuspend+0x31 amd64_syscall+0x741 fast_syscall_common+0xf8
 3821 100617 process1      -                   mi_switch+0xc1 _sleep+0x1cb __umtx_op_sem2_wait+0x287 sys__umtx_op+0x7a amd64_syscall+0x741 fast_syscall_common+0xf8
 3821 100633 process1      -                   mi_switch+0xc1 _sleep+0x1cb __umtx_op_sem2_wait+0x287 sys__umtx_op+0x7a amd64_syscall+0x741 fast_syscall_common+0xf8

./procstat -kk 4065
  PID    TID COMM                TDNAME              KSTACK
 4065 100388 process2                 -                   mi_switch+0xc1 thread_suspend_switch+0xc0 thread_single+0x69c exit1+0xc1 sigexit+0x133 postsig+0x251 kern_sigsuspend+0x2b7 sys_sigsuspend+0x31 amd64_syscall+0x741 fast_syscall_common+0xf8
 4065 100872 process2                 -                   mi_switch+0xc1 _sleep+0x1cb __umtx_op_sem2_wait+0x287 sys__umtx_op+0x7a amd64_syscall+0x741 fast_syscall_common+0xf8
 4065 100876 process2                 -                   mi_switch+0xc1 _sleep+0x1cb __umtx_op_sem2_wait+0x287 sys__umtx_op+0x7a amd64_syscall+0x741 fast_syscall_common+0xf8
 4065 647181 process2                 -                   mi_switch+0xc1 _sleep+0x1cb __umtx_op_sem2_wait+0x287 sys__umtx_op+0x7a amd64_syscall+0x741 fast_syscall_common+0xf8
 4065 648097 process2                 -                   mi_switch+0xc1 _sleep+0x1cb __umtx_op_sem2_wait+0x287 sys__umtx_op+0x7a amd64_syscall+0x741 fast_syscall_common+0xf8
 4065 652349 process2                 -                   mi_switch+0xc1 _sleep+0x1cb __umtx_op_sem2_wait+0x287 sys__umtx_op+0x7a amd64_syscall+0x741 fast_syscall_common+0xf8



./procstat -t 3821
  PID    TID COMM                TDNAME              CPU  PRI STATE   WCHAN
 3821 100260 process1      -                    -1  120 stop    -
 3821 100617 process1      -                    -1  120 stop    umtxqb
 3821 100633 process1      -                    -1  120 stop    umtxqb

./procstat -t 4065
  PID    TID COMM                TDNAME              CPU  PRI STATE   WCHAN
 4065 100388 process2                 -                    -1  120 stop    -
 4065 100872 process2                 -                    -1  122 stop    umtxqb
 4065 100876 process2                 -                    -1  121 stop    umtxqb
 4065 647181 process2                 -                    -1  121 stop    umtxqb
 4065 648097 process2                 -                    -1  122 stop    umtxqb
 4065 652349 process2                 -                    -1  120 stop    umtxqb

Comment 5 Rupesh Pilania 2024-11-14 17:54:38 UTC

All are multithreaded custom applications which are using standard pthread library.
As issue is happening randomly on any device on VMs, can it be something related to virtualization as mentioned in last comment. https://forums.freebsd.org/threads/process-hangs-in-run-state-and-can-not-be-killed-or-debugged.87035/ 

Few processes are running in compat/10 environment.

Comment 6 Mark Johnston freebsd_committer

2024-11-16 16:38:56 UTC

This looks maybe related to bug 265997, but the details don't really match.  Here it looks like the busy state is leaked.

Is your application using any process-shared locking primitives?

Comment 7 Rupesh Pilania 2024-11-17 10:01:39 UTC

(In reply to Mark Johnston from comment #6)
We are also suspecting compat/10 contribution here. Looks like some process sharing primitives should be there. Will get back to you with details.

Comment 8 Rupesh Pilania 2024-11-18 16:03:58 UTC

(In reply to Rupesh Pilania from comment #7)
Looks like there are no process sharing locking mechanism was used. I suspect somehow kern.ipc.umtx_max_robust count can impact something. As per libthr it's default value is 1000. So created 1 process with 1000 robust mutexes and ran 2 instances of this. Looks like somehow it slowed down other process. Now those processes are unkillable. But I am not seeing processes waiting in umtxqb. 

Using 8 CPUs 16 GB RAM

Somehow process with 1000 threads with robust mutexes impacted other processes.

Also saw this in one of the other process which went in stop state.
./procstat -k 917
  PID    TID COMM                TDNAME              KSTACK
  917 100288 process1            -                   mi_switch thread_suspend_switch ptracestop cursig ast doreti_ast

Similar to this https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=200992

Comment 9 Rupesh Pilania 2024-11-18 16:13:24 UTC

Running two instances are enough to bring system less responsive.
Running 5 instances will cause system Freeze.
Compiled using cc -lpthread.




#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <unistd.h>

#define NUM_THREADS 1000   // Number of threads; adjust based on system capability
#define NUM_ITERATIONS 100000   // Number of iterations per thread

pthread_mutex_t lock;
pthread_mutexattr_t attr;

void handle_error(int err, const char *msg) {
    fprintf(stderr, "%s: %s\n", msg, strerror(err));
    exit(EXIT_FAILURE);
}

void *thread_func(void *arg) {
    int thread_num = *((int *)arg);
    free(arg); // Free allocated memory for thread argument

    for (int i = 0; i < NUM_ITERATIONS; i++) {
        // Lock and unlock the mutex to simulate contention
        if (pthread_mutex_lock(&lock) != 0) {
            perror("Failed to lock mutex");
        }

        // Simulate some work inside the critical section
        // (we keep it minimal to maximize lock contention)
        if (pthread_mutex_unlock(&lock) != 0) {
            perror("Failed to unlock mutex");
        }

        // Small sleep to prevent the system from just cycling too quickly
        usleep(1);
    }

    printf("Thread %d completed.\n", thread_num);
    return NULL;
}

int keep_create_threads() {
    pthread_t threads[NUM_THREADS];
    
    //pthread_mutex_init(&lock, NULL);
       int err;

    // Initialize mutex attributes
    err = pthread_mutexattr_init(&attr);
    if (err != 0)
        handle_error(err, "pthread_mutexattr_init");

    // Set the mutex as robust
    err = pthread_mutexattr_setrobust(&attr, PTHREAD_MUTEX_ROBUST);
    if (err != 0)
        handle_error(err, "pthread_mutexattr_setrobust");

    // Initialize the mutex with the robust attribute
    err = pthread_mutex_init(&lock, &attr);
    if (err != 0)
        handle_error(err, "pthread_mutex_init");

    // Create threads
    for (int i = 0; i < NUM_THREADS; i++) {
        int *thread_num = malloc(sizeof(int));  // Allocate memory for each thread's number
        if (!thread_num) {
            perror("Failed to allocate memory for thread number");
            exit(EXIT_FAILURE);
        }
        *thread_num = i;
        
        if (pthread_create(&threads[i], NULL, thread_func, thread_num) != 0) {
            perror("Failed to create thread");
            exit(EXIT_FAILURE);
        }
    }

    // Wait for all threads to finish
    for (int i = 0; i < NUM_THREADS; i++) {
        if (pthread_join(threads[i], NULL) != 0) {
            perror("Failed to join thread");
        }
    }

    pthread_mutex_destroy(&lock);
    printf("All threads completed.\n");
    return 0;
}

int main()
{
  while(1){
  keep_create_threads();
} 
return 0;
}

Comment 10 Rupesh Pilania 2024-11-18 16:28:26 UTC

(In reply to Rupesh Pilania from comment #9)
1. run stress mutex program as mentioned
2. We keep monitoring other processes, as other processes becomes unresponsive then we send SIGTERM to those processes. After sending sig TERM signal these processes stuck in STOP state.