Downstream bug: https://github.com/openzfs/zfs/issues/16731 On FreeBSD, zfs send sometimes hangs in a deadlock after pthread_cancel(tid); pthread_join(tid, NULL); while the thread is in pause(). The program looks something like this: #include <pthread.h> #include <signal.h> #include <stdbool.h> #include <stdio.h> #include <unistd.h> static void send_progress_thread_act(int sig, siginfo_t *info, void *ucontext) {} static void timer_delete_cleanup(void *timer) {} void *send_progress_thread(void *arg1) { const struct sigaction signal_action = {.sa_sigaction = send_progress_thread_act, .sa_flags = SA_SIGINFO}; sigaction(SIGUSR1, &signal_action, NULL); pthread_cleanup_push(timer_delete_cleanup, NULL); for(;;) { pause(); // <- thread blocked here // zfs_send_progress(), time(), localtime_r(), setproctitle(), fprintf() fprintf(stderr, "unpaused\n"); } pthread_cleanup_pop(true); return NULL; } int main() { pthread_t tid; pthread_create(&tid, NULL, send_progress_thread, NULL); sigset_t new; sigemptyset(&new); sigaddset(&new, SIGUSR1); pthread_sigmask(SIG_BLOCK, &new, NULL); sleep(1); // some sort of work pthread_cancel(tid); pthread_join(tid, NULL); // <- thread blocked here } (In reality there's also a timer that may send SIGUSR1s and SIGINFO is also consumed but this is largely irrelevant.) The backtrace looks like this: Thread 2 (LWP 259843 of process 95307): #0 0x000000082872c6aa in _sigsuspend () from /lib/libc.so.7 #1 0x0000000825f47fb6 in ?? () from /lib/libthr.so.3 #2 0x00000008286a2a65 in pause () from /lib/libc.so.7 #3 0x0000000820e3d7d5 in send_progress_thread (arg=0x8208cf2c0) at lib/libzfs/libzfs_sendrecv.c:1018 #4 0x0000000825f3eb05 in ?? () from /lib/libthr.so.3 #5 0x0000000000000000 in ?? () Backtrace stopped: Cannot access memory at address 0x82ce7a000 Thread 1 (LWP 100403 of process 95307): #0 0x0000000825f3be2c in ?? () from /lib/libthr.so.3 #1 0x0000000825f40a2e in ?? () from /lib/libthr.so.3 #2 0x0000000820e3da4f in send_progress_thread_exit (hdl=0x349d6143c000, ptid=0x349d61412700, oldmask=oldmask@entry=0x8208cf1b0) at lib/libzfs/libzfs_sendrecv.c:1074 #3 0x0000000820e40367 in dump_snapshot (zhp=0x349d6145a500, arg=arg@entry=0x8208d0ec8) at lib/libzfs/libzfs_sendrecv.c:1291 #4 0x0000000820e3014a in zfs_iter_snapshots_sorted_v2 (zhp=zhp@entry=0x349d6145a000, flags=flags@entry=0, callback=0x820e3ff30 <dump_snapshot>, data=data@entry=0x8208d0ec8, min_txg=8889024, max_txg=8890857) at lib/libzfs/libzfs_iter.c:352 #5 0x0000000820e3fd34 in dump_filesystem (zhp=0x349d6145a000, sdd=0x8208d0ec8) at lib/libzfs/libzfs_sendrecv.c:1364 #6 0x0000000820e3ec99 in dump_filesystems (rzhp=0x349d61412700, rzhp@entry=0x349d6145a000, sdd=0x2, sdd@entry=0x8208d0ec8) at lib/libzfs/libzfs_sendrecv.c:1432 #7 0x0000000820e3e413 in zfs_send_cb_impl (zhp=0x349d6145a000, fromsnap=<optimized out>, tosnap=0x349d61419106 "zpair-02749186397145403432-cartman-674eaa54-537c2abe", flags=0x8208d16f0, outfd=1, filter_func=0x0, cb_arg=0x8208d16d0, debugnvp=0x0) at lib/libzfs/libzfs_sendrecv.c:2506 #8 0x0000000820e3c26e in zfs_send_cb (outfd=0, arg=<optimized out>) at lib/libzfs/libzfs_sendrecv.c:2559 #9 0x0000000820e3c237 in zfs_send (zhp=zhp@entry=0x349d6145a000, fromsnap=fromsnap@entry=0x349d614190c0 "zpair-02749186397145403432-cartman-674e8e34-16208f9f", tosnap=tosnap@entry=0x349d61419106 "zpair-02749186397145403432-cartman-674eaa54-537c2abe", flags=0x0, flags@entry=0x8208d16f0, outfd=0, outfd@entry=1, filter_func=0x0, cb_arg=0x8208d16d0, debugnvp=0x0) at lib/libzfs/libzfs_sendrecv.c:2577 #10 0x0000000000215754 in zfs_do_send (argc=<optimized out>, argv=<optimized out>) at cmd/zfs/zfs_main.c:5099 #11 0x000000000020fd18 in main (argc=6, argv=0x8208d1850) at cmd/zfs/zfs_main.c:9249 This is obviously wrong, since the thread is /sleeping in a cancellation point/. Adding pthread_setcanceltype(PTHREAD_CANCEL_ASYNCHRONOUS, NULL); to the top of the thread doesn't change anything. In this case, it's obviously wrong because pthread_cancel()... didn't cancel the thread.
Sending the process SIGUSR1 does also seem to make it unstuck.
Sending the process SIGUSR1 does also seem to make it unstuck, so.
There must be something else. Or the bug is fixed in 13.4+. I tried the proposed test case on stable/13 (several weeks old), stable/14 (two weeks old), and HEAD (fresh). In all cases cancellation worked.
I don't think I made a solid enough point of this in the OP, but it does work most of the time. The described deadlock only happens like once a week or two on average, accd'g to the original reporter.
(In reply to наб from comment #4) Are you stating that you observed the proposed test program hanging approx. in a week if running in loop? Or that you think that the proposed program is similar to the code that hangs, but you did not checked the standalone test?
No, the original reporter observed his jobs using zfs send hanging around 1/wk. I haven't tested the minimised version under that angle. I believe the minimised version is equivalent (to the precision of the sleep(1); and fprintf(stderr, "unpaused\n"); calls to what zfs send does.
Hi, I'm one of the reporters. This is occurring on 14.1-RELEASE, on a QEMU-powered vps, roughly once a week on a sequential `zfs send` x 30 datasets, scheduled to run once per hour. Never had issues with a similar replication schedule on other servers running 14.1-RELEASE (including two other QEMU-powered & 8 bare-metal), nor previous releases.