Prints: thr1: pthread_join(195): No such file or directory Happens on the 48 core Cavium ThunderX in the cluster.
I'm not sure if the message is related to the hang. I've seen each independent of each other. It seems the process is stuck in the kernel waiting on a mutex: # pprocstat -t 19405 PID TID COMM TDNAME CPU PRI STATE WCHAN 19405 100607 thr1 - -1 120 sleep umtxn 19405 101334 thr1 - -1 152 sleep umtxn # procstat -k 19405 PID TID COMM TDNAME KSTACK 19405 100607 thr1 - mi_switch sleepq_catch_signals sleepq_wait_sig _sleep umtxq_sleep do_lock_umutex __umtx_op_wait_umutex do_el0_sync 19405 101334 thr1 - mi_switch sleepq_catch_signals sleepq_wait_sig _sleep umtxq_sleep do_lock_umutex __umtx_op_wait_umutex do_el0_sync
Could you try to run this test, in order to narrow the test scenario a bit. I have tried this on amd64/i386 without finding any issues. Place this in stress2/misc as thr1.sh and run it: #!/bin/sh . ../default.cfg export runRUNTIME=1h export thr1LOAD=100 export TESTPROGS=" testcases/swap/swap testcases/thr1/thr1 " (cd ..; ./testcases/run/run $TESTPROGS) Thank you.
With this script I can reproduce the issue. It can take a few hours to show up so I increased the runtime to 24 hours.
Great! So the scenario is creating many threads, which returns almost immediately. This during VM pressure.
I have some code that inspects the state when the issue show up. Below is a dump of the registers of the only thread in the process. x0 = 000000004048fd50 x1 = 0000000000000011 x2 = 0000000000000000 x3 = 0000000000000000 x4 = 0000000000000000 x5 = 0000000000000001 x6 = 0000000000000001 x7 = 000000000000007f x8 = 00000000000001c6 x9 = 0000000080000000 x10 = 00000000000187dd x11 = 00000000000187dd x12 = 0000000000000001 x13 = 000000004048fcd8 x14 = 00000000000187dd x15 = 0000000000000000 x16 = 0000000040485df8 x17 = 00000000404fe8dc x18 = 0000000040801530 x19 = 000000004048fd50 x20 = 00000000000187dd x21 = 0000000040490000 x22 = 000000004048fd50 x23 = 0000000000412000 x24 = 0000000000000000 x25 = 00000000004014f0 x26 = 0000000000000000 x27 = 0000000000000000 x28 = 0000000000000000 x29 = 0000007fffffee50 lr = 0000000040466eb0 sp = 0000007fffffee40 elr = 00000000404fe8e0 spsr = 90000000 I looked at the data passed to the kernel in x0 and found the owner of the lock to be the current thread. I also looked at a stack trace and found we entered the lock by the following calls: _pthread_create -> _thr_alloc -> __thr_umutex_lock -> _umtx_op The lock in _thr_alloc is, as far as I can tell, the only place within this function we acquire this lock, and is protecting _tcb_ctor.
(In reply to Andrew Turner from comment #5) Could you instrument the tcb_lock to add the atomic counters for acquires and releases ? Then we would see the generation counts for acq/rel on tcb_lock, in particular, whether something was missed at unlock, or e.g. a thread was terminated without unlock (weird).
It doesn't seem to be specific to any one lock. I've seen similar hangs with just thr1 and no swap running, and have seen the same symptom using rwlocks.
(In reply to Andrew Turner from comment #7) Then it sounds as if the issue is in suword() or casueword(). I recently re-read the arm64 implementations, but did not noted anything obviously wrong. It could be a hw errata, after all. E.g. might stxr return 0 but still fail the store ?
(In reply to Andrew Turner from comment #5) Aren't the locks in _thr_alloc only used by threads that call pthread_create? For the code at https://svnweb.freebsd.org/base/user/pho/stress2/testcases/thr1/thr1.c, this only happens sequentially from the main thread, so that suggests something went wrong releasing the lock on the userland side (it shouldn't need to trap into the kernel).
(In reply to martin from comment #9) Yes, the garbage collection code that gc's freed thread structures is only called from the thr_alloc(), which in context of the thr1.c means that only main thread acquires the tcb_lock. This, together with the note that other locks are similarly affected, mostly reinforces my suspect of the ll/sc hardware implementation.
Andy, has this been observed on Pass 1.1 only (and not tested on Pass 2.0)?
I haven't tried on Pass 2.0.
This may be related to Cavium erratum 26026. If so it only affects ThunderX pass 1.x CPUs. https://lists.cs.columbia.edu/pipermail/kvmarm/2016-August/021343.html
I'm unable to reproduce on later hardware without the listed erratum, as such I'm closing this under the assumption it's a known hardware bug in pre-production hardware.