271945 – panic: deadlres_td_sleep_q: possible deadlock detected for … (stat) | (postgres) | (LTO toml.54d28171-c) , blocked for … ticks – during poudriere-bulk(8)

Bug 271945 - panic: deadlres_td_sleep_q: possible deadlock detected for … (stat) | (postgres) | (LTO toml.54d28171-c) , blocked for … ticks – during poudriere-bulk(8)

Summary: panic: deadlres_td_sleep_q: possible deadlock detected for … (stat) | (postgr...

Status:	Closed Overcome By Events

Alias:	None

Product:	Base System
Classification:	Unclassified
Component:	kern (show other bugs)
Version:	CURRENT
Hardware:	Any Any

Importance:	--- Affects Some People
Assignee:	freebsd-fs (Nobody)

URL:
Keywords:	crash, needs-patch

Depends on:
Blocks:	14.0r
	Show dependency tree / graph

Reported:	2023-06-11 09:49 UTC by Graham Perrin
Modified:	2023-10-13 17:33 UTC (History)
CC List:	11 users (show)

See Also:	https://github.com/openzfs/zfs/pull/15122 https://github.com/openzfs/zfs/pull/15107 https://github.com/openzfs/zfs/pull/15204 https://github.com/openzfs/zfs/pull/15205

Attachments
core.txt (789.66 KB, text/plain) 2023-06-11 09:49 UTC, Graham Perrin	no flags	Details
post-panic review of poudriere bulk results (200.23 KB, image/png) 2023-07-18 23:41 UTC, Graham Perrin	no flags	Details
core.txt.9 for the most recent panic (787.44 KB, text/plain) 2023-07-18 23:53 UTC, Graham Perrin	no flags	Details
Poudriere bulk results, reviewed after the incident (218.21 KB, image/png) 2023-07-23 19:30 UTC, Graham Perrin	no flags	Details
Retrospective screenshot of poudriere (231.97 KB, image/png) 2023-08-10 14:32 UTC, Graham Perrin	no flags	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Graham Perrin freebsd_committer

2023-06-11 09:49:14 UTC

Created attachment 242729 [details]
core.txt

Dump header from device: /dev/ada0p2
  Architecture: amd64
  Architecture Version: 2
  Dump Length: 7457673216
  Blocksize: 512
  Compression: none
  Dumptime: 2023-06-11 00:03:27 +0100
  Hostname: mowa219-gjp4-8570p-freebsd
  Magic: FreeBSD Kernel Dump
  Version String: FreeBSD 14.0-CURRENT #5 main-n263447-267411d164d4-dirty: Thu Jun  8 14:26:44 BST 2023
    grahamperrin@mowa219-gjp4-8570p-freebsd:/usr/obj/usr/src/amd64.amd64/sys/GENERIC
  Panic String: deadlres_td_sleep_q: possible deadlock detected for 0xfffffe0133324ac0 (stat), blocked for 1801328 ticks

  Dump Parity: 1748434536
  Bounds: 7
  Dump Status: good


----

-dirty, in that it was built with the (now obsolete) patch from bug 271772 comment 7. 


From the attached core.txt.7: 

…

Unread portion of the kernel message buffer:
<7>sonewconn: pcb 0xfffff80026bc6800 (local:/var/run/devd.seqpacket.pipe): Listen queue overflow: 1 already in queue awaiting acceptance (60 occurrences), euid 0, rgid 0, jail 0
<7>sonewconn: pcb 0xfffff80026bc6800 (local:/var/run/devd.seqpacket.pipe): Listen queue overflow: 1 already in queue awaiting acceptance (60 occurrences), euid 0, rgid 0, jail 0
<7>sonewconn: pcb 0xfffff80026bc6800 (local:/var/run/devd.seqpacket.pipe): Listen queue overflow: 1 already in queue awaiting acceptance (60 occurrences), euid 0, rgid 0, jail 0
<7>sonewconn: pcb 0xfffff80026bc6800 (local:/var/run/devd.seqpacket.pipe): Listen queue overflow: 1 already in queue awaiting acceptance (60 occurrences), euid 0, rgid 0, jail 0
<7>sonewconn: pcb 0xfffff80026bc6800 (local:/var/run/devd.seqpacket.pipe): Listen queue overflow: 1 already in queue awaiting acceptance (60 occurrences), euid 0, rgid 0, jail 0
<7>sonewconn: pcb 0xfffff80026bc6800 (local:/var/run/devd.seqpacket.pipe): Listen queue overflow: 1 already in queue awaiting acceptance (60 occurrences), euid 0, rgid 0, jail 0
<7>sonewconn: pcb 0xfffff80026bc6800 (local:/var/run/devd.seqpacket.pipe): Listen queue overflow: 1 already in queue awaiting acceptance (60 occurrences), euid 0, rgid 0, jail 0
<7>sonewconn: pcb 0xfffff80026bc6800 (local:/var/run/devd.seqpacket.pipe): Listen queue overflow: 1 already in queue awaiting acceptance (60 occurrences), euid 0, rgid 0, jail 0
<7>sonewconn: pcb 0xfffff80026bc6800 (local:/var/run/devd.seqpacket.pipe): Listen queue overflow: 1 already in queue awaiting acceptance (60 occurrences), euid 0, rgid 0, jail 0
<7>sonewconn: pcb 0xfffff80026bc6800 (local:/var/run/devd.seqpacket.pipe): Listen queue overflow: 1 already in queue awaiting acceptance (60 occurrences), euid 0, rgid 0, jail 0
<7>sonewconn: pcb 0xfffff80026bc6800 (local:/var/run/devd.seqpacket.pipe): Listen queue overflow: 1 already in queue awaiting acceptance (60 occurrences), euid 0, rgid 0, jail 0
<7>sonewconn: pcb 0xfffff80026bc6800 (local:/var/run/devd.seqpacket.pipe): Listen queue overflow: 1 already in queue awaiting acceptance (60 occurrences), euid 0, rgid 0, jail 0
<7>sonewconn: pcb 0xfffff80026bc6800 (local:/var/run/devd.seqpacket.pipe): Listen queue overflow: 1 already in queue awaiting acceptance (60 occurrences), euid 0, rgid 0, jail 0
<7>sonewconn: pcb 0xfffff80026bc6800 (local:/var/run/devd.seqpacket.pipe): Listen queue overflow: 1 already in queue awaiting acceptance (60 occurrences), euid 0, rgid 0, jail 0
<7>sonewconn: pcb 0xfffff80026bc6800 (local:/var/run/devd.seqpacket.pipe): Listen queue overflow: 1 already in queue awaiting acceptance (60 occurrences), euid 0, rgid 0, jail 0
<7>sonewconn: pcb 0xfffff80026bc6800 (local:/var/run/devd.seqpacket.pipe): Listen queue overflow: 1 already in queue awaiting acceptance (60 occurrences), euid 0, rgid 0, jail 0
<7>sonewconn: pcb 0xfffff80026bc6800 (local:/var/run/devd.seqpacket.pipe): Listen queue overflow: 1 already in queue awaiting acceptance (60 occurrences), euid 0, rgid 0, jail 0
<7>sonewconn: pcb 0xfffff80026bc6800 (local:/var/run/devd.seqpacket.pipe): Listen queue overflow: 1 already in queue awaiting acceptance (60 occurrences), euid 0, rgid 0, jail 0
<7>sonewconn: pcb 0xfffff80026bc6800 (local:/var/run/devd.seqpacket.pipe): Listen queue overflow: 1 already in queue awaiting acceptance (60 occurrences), euid 0, rgid 0, jail 0
<7>sonewconn: pcb 0xfffff80026bc6800 (local:/var/run/devd.seqpacket.pipe): Listen queue overflow: 1 already in queue awaiting acceptance (60 occurrences), euid 0, rgid 0, jail 0
<7>sonewconn: pcb 0xfffff80026bc6800 (local:/var/run/devd.seqpacket.pipe): Listen queue overflow: 1 already in queue awaiting acceptance (60 occurrences), euid 0, rgid 0, jail 0
<7>sonewconn: pcb 0xfffff80026bc6800 (local:/var/run/devd.seqpacket.pipe): Listen queue overflow: 1 already in queue awaiting acceptance (60 occurrences), euid 0, rgid 0, jail 0
<7>sonewconn: pcb 0xfffff80026bc6800 (local:/var/run/devd.seqpacket.pipe): Listen queue overflow: 1 already in queue awaiting acceptance (60 occurrences), euid 0, rgid 0, jail 0
<7>sonewconn: pcb 0xfffff80026bc6800 (local:/var/run/devd.seqpacket.pipe): Listen queue overflow: 1 already in queue awaiting acceptance (60 occurrences), euid 0, rgid 0, jail 0
<7>sonewconn: pcb 0xfffff80026bc6800 (local:/var/run/devd.seqpacket.pipe): Listen queue overflow: 1 already in queue awaiting acceptance (60 occurrences), euid 0, rgid 0, jail 0
<7>sonewconn: pcb 0xfffff80026bc6800 (local:/var/run/devd.seqpacket.pipe): Listen queue overflow: 1 already in queue awaiting acceptance (60 occurrences), euid 0, rgid 0, jail 0
<7>sonewconn: pcb 0xfffff80026bc6800 (local:/var/run/devd.seqpacket.pipe): Listen queue overflow: 1 already in queue awaiting acceptance (60 occurrences), euid 0, rgid 0, jail 0
<7>sonewconn: pcb 0xfffff80026bc6800 (local:/var/run/devd.seqpacket.pipe): Listen queue overflow: 1 already in queue awaiting acceptance (60 occurrences), euid 0, rgid 0, jail 0
<7>sonewconn: pcb 0xfffff80026bc6800 (local:/var/run/devd.seqpacket.pipe): Listen queue overflow: 1 already in queue awaiting acceptance (60 occurrences), euid 0, rgid 0, jail 0
<7>sonewconn: pcb 0xfffff80026bc6800 (local:/var/run/devd.seqpacket.pipe): Listen queue overflow: 1 already in queue awaiting acceptance (60 occurrences), euid 0, rgid 0, jail 0
<7>sonewconn: pcb 0xfffff80026bc6800 (local:/var/run/devd.seqpacket.pipe): Listen queue overflow: 1 already in queue awaiting acceptance (60 occurrences), euid 0, rgid 0, jail 0
<7>sonewconn: pcb 0xfffff80026bc6800 (local:/var/run/devd.seqpacket.pipe): Listen queue overflow: 1 already in queue awaiting acceptance (60 occurrences), euid 0, rgid 0, jail 0
<7>sonewconn: pcb 0xfffff80026bc6800 (local:/var/run/devd.seqpacket.pipe): Listen queue overflow: 1 already in queue awaiting acceptance (60 occurrences), euid 0, rgid 0, jail 0
<7>sonewconn: pcb 0xfffff80026bc6800 (local:/var/run/devd.seqpacket.pipe): Listen queue overflow: 1 already in queue awaiting acceptance (60 occurrences), euid 0, rgid 0, jail 0
panic: deadlres_td_sleep_q: possible deadlock detected for 0xfffffe0133324ac0 (stat), blocked for 1801328 ticks

cpuid = 0
time = 1686438207
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe00c7277e00
vpanic() at vpanic+0x150/frame 0xfffffe00c7277e50
panic() at panic+0x43/frame 0xfffffe00c7277eb0
deadlkres() at deadlkres+0x350/frame 0xfffffe00c7277ef0
fork_exit() at fork_exit+0x80/frame 0xfffffe00c7277f30
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe00c7277f30
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---
KDB: enter: panic
Uptime: 1d1h50m14s
Dumping 7112 out of 16260 MB:..1%..11%..21%..31%..41%..51%..61%..71%..81%..91%

__curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:59
warning: Source file is more recent than executable.
59		__asm("movq %%gs:%P1,%0" : "=r" (td) : "n" (offsetof(struct pcpu,
(kgdb) #0  __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:59
#1  doadump (textdump=textdump@entry=1)
    at /usr/src/sys/kern/kern_shutdown.c:407
#2  0xffffffff80b51ffc in kern_reboot (howto=260)
    at /usr/src/sys/kern/kern_shutdown.c:528
#3  0xffffffff80b5251d in vpanic (fmt=<optimized out>, 
    ap=ap@entry=0xfffffe00c7277e90) at /usr/src/sys/kern/kern_shutdown.c:972
#4  0xffffffff80b522a3 in panic (fmt=<unavailable>)
    at /usr/src/sys/kern/kern_shutdown.c:896
#5  0xffffffff80adbe50 in deadlres_td_sleep_q (p=0xfffffe013f354000, 
    td=0xfffffe0133324ac0, slpticks=1800000)
    at /usr/src/sys/kern/kern_clock.c:235
#6  deadlkres () at /usr/src/sys/kern/kern_clock.c:280
#7  0xffffffff80b08730 in fork_exit (callout=0xffffffff80adbb00 <deadlkres>, 
    arg=0x0, frame=0xfffffe00c7277f40) at /usr/src/sys/kern/kern_fork.c:1102
#8  <signal handler called>
(kgdb) 

------------------------------------------------------------------------
ps -axlww

…


I was asleep at the time of the panic. As far as I can tell, after the event, the computer was busy mostly with these two things: 

a) buildworld with multiple jobs

b) poudriere configuring cmake-core-3.26.1_1

Comment 1 Graham Perrin freebsd_committer

2023-06-11 12:05:11 UTC

Also <https://lists.freebsd.org/archives/freebsd-fs/2023-June/002332.html>

> Following a panic (271945): zpool status reports 1 data error but 
> identifies no file

Comment 2 Cy Schubert freebsd_committer

2023-07-12 03:05:06 UTC

I think the common thread here is poudriere-bulk is running. When running with -j number of jobs the likelihood of this occurring is > 50%. Whereas with -j 1 it will never happen.

I have no zpools with errors.

I have another machine, not quite configured the same (fewer disks), also running poudriere but in this case the poudriere jails are all i386 jails whereas the machine that panics, the poudriere jails are all amd64. I don't know if this has a bearing on this problem or not.

My panic string is usually like this, though at times I will see a c++ process in a detected deadlock.

Unread portion of the kernel message buffer:
panic: deadlres_td_sleep_q: possible deadlock detected for 0xfffffe00c0d6d740 (p
ostgres), blocked for 180239 ticks

cpuid = 0
time = 1689121195
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe00851b7e00
vpanic() at vpanic+0x149/frame 0xfffffe00851b7e50
panic() at panic+0x43/frame 0xfffffe00851b7eb0
deadlkres() at deadlkres+0x32a/frame 0xfffffe00851b7ef0
fork_exit() at fork_exit+0x82/frame 0xfffffe00851b7f30
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe00851b7f30
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---
Uptime: 8h51m9s

It's usually postgresql which is the DBMS for gitea. But, as intimated above, sometimes it's a c++ that's listed as the "cause."

Comment 3 Graham Perrin freebsd_committer

2023-07-18 23:41:04 UTC

Created attachment 243473 [details]
post-panic review of poudriere bulk results

(In reply to Cy Schubert from comment #2)

I think the common thread here is poudriere-bulk is running. …

Thanks. 

IIRC I chose 8 (eight) jobs for the run of poudriere during which this crash occurred: 

----

Dump header from device: /dev/ada0p2
  Architecture: amd64
  Architecture Version: 2
  Dump Length: 3278311424
  Blocksize: 512
  Compression: none
  Dumptime: 2023-07-18 21:11:54 +0100
  Hostname: mowa219-gjp4-8570p-freebsd
  Magic: FreeBSD Kernel Dump
  Version String: FreeBSD 14.0-CURRENT #0 main-n264150-c65845d0bb2d-dirty: Fri Jul 14 07:11:27 BST 2023
    grahamperrin@mowa219-gjp4-8570p-freebsd:/usr/obj/usr/src/amd64.amd64/sys/GENERIC
  Panic String: deadlres_td_sleep_q: possible deadlock detected for 0xfffffe0120cb6c80 (LTO toml.54d28171-c), blocked for 1800979 ticks

  Dump Parity: 872214848
  Bounds: 9
  Dump Status: good

----

The bulk run began at 20:21 yesterday, with the dump timed at 21:11.

A post-panic review of the run (screenshot attached) suggests around twenty minutes elapsed.

Comment 4 Graham Perrin freebsd_committer

2023-07-18 23:53:53 UTC

Created attachment 243474 [details]
core.txt.9 for the most recent panic

% pkg iinfo poudriere
poudriere-devel-3.3.99.20220831
%

Comment 5 Cy Schubert freebsd_committer

2023-07-20 21:27:48 UTC

Just a quick note. Running poudriere-bulk with -j 1 does circumvent the issue. This is not a big problem when building large packages such as llvm-devel or chromium since both use all the CPUs available anyway.

Comment 6 Graham Perrin 2023-07-21 18:45:32 UTC

(In reply to Cy Schubert from comment #2)

In your experience (with CURRENT), does this type of panic occur only with the GENERIC kernel? 

I guess, symptoms might be quite different with GENERIC-NODEBUG.

Comment 7 Cy Schubert freebsd_committer

2023-07-21 19:06:50 UTC

(In reply to Graham Perrin ◐ from comment #6)

I'm using a custom kernel. I removed all device drivers except those needed to boot. Then, NICs are included in loader.conf. Each machine is different since though they have the same motherboard, they include different NICs. The laptop and firewall have disabled WITNESS and INVARIANTS, while the two server machines leave them enabled. Additionally, my kernels include BREAK_TO_DEBUGGER to allow me to enter DDB through the serial console.

The problem occurs with and without WITNESS and INVARIANTS enabled in the kernel. It only occurs on one machine, the machine that builds my amd64 packages. The other machine uses poudriere to build i386 packages. It is unaffected.

Both machines boot off UFS and have a single ZPOOL. The ZPOOLs are mirrored. The ZPOOL on the machine that has the problem has four vdevs in its pool. The machine that builds i386 packages has only two vdevs in its pool.

The hypothesis is that this is probably related to ZFS in some way. The victim process is either postgres I use as a backend for gitea or a c++. My theory is that it's likely that poudriere is cloning jails from its snapshot when creating new jails which results in the deadlock while other processes are trying to write to the same pool.

Comment 8 Cy Schubert freebsd_committer

2023-07-21 19:08:44 UTC

(In reply to Cy Schubert from comment #7)

I should also mention that it never happens when poudriere-bulk -j1 is run. Only when multiple jails are used. It's likely that two jails are being created using zfs clone at the same time resulting in the deadlock. That's the going theory ATM.

Comment 9 Cy Schubert freebsd_committer

2023-07-21 19:10:46 UTC

(In reply to Cy Schubert from comment #8)

This is certainly a problem with the ZFS kernel module. But, a workaround in poudriere a) will provide temporary relief while b) illustrating the problem.

Comment 10 Graham Perrin 2023-07-23 19:30:17 UTC

Created attachment 243575 [details]
Poudriere bulk results, reviewed after the incident

(In reply to Graham Perrin ◐ from comment #6)

> … I guess, symptoms might be quite different with GENERIC-NODEBUG.

Before the 06:37 BOOT below I found a blackout, no response to HID input. 

Presses on the power button produced slight reactions (a flickering LED, not for long, at the front of the HP EliteBook 8570p) but not the expected shut down. A forced stop (press and hold) was required :-(

From the attached screenshot, I assume that system failure occurred some time after 02:29. 

I uploaded the result of a probe not long after the boot, I expect this to automatically appear at <https://bsd-hardware.info/?computer=6fbb1f806232> in due course. 


% zcat /var/log/messages.1.bz2
…
Jul 21 02:28:40 mowa219-gjp4-8570p-freebsd kernel: sonewconn: pcb 0xfffff8004bd20300 (local:/var/run/devd.seqpacket.pipe): Listen queue overflow: 1 already in queue awaiting acceptance (60 occurrences), euid 0, rgid 0, jail 0
Jul 21 02:28:59 mowa219-gjp4-8570p-freebsd apps.plugin[9192]: Cannot fetch process 18033 command line (command 'sed')
Jul 21 02:29:07 mowa219-gjp4-8570p-freebsd apps.plugin[9192]: Cannot fetch process 20664 command line (command 'sh')
Jul 21 02:29:07 mowa219-gjp4-8570p-freebsd apps.plugin[9192]: Cannot fetch process 20667 command line (command 'sed')
Jul 21 02:29:07 mowa219-gjp4-8570p-freebsd apps.plugin[9192]: Cannot fetch process 20669 command line (command 'sh')
Jul 21 02:29:13 mowa219-gjp4-8570p-freebsd apps.plugin[9192]: Cannot fetch process 21172 command line (command 'sed')
Jul 21 02:29:19 mowa219-gjp4-8570p-freebsd apps.plugin[9192]: Cannot fetch process 21745 command line (command 'sh')
Jul 21 02:29:19 mowa219-gjp4-8570p-freebsd apps.plugin[9192]: Cannot fetch process 21746 command line (command 'bsdtar')
Jul 21 02:29:21 mowa219-gjp4-8570p-freebsd apps.plugin[9192]: Cannot fetch process 22151 command line (command 'sh')
Jul 21 02:29:26 mowa219-gjp4-8570p-freebsd apps.plugin[9192]: Cannot fetch process 23004 command line (command 'cc')
Jul 21 02:29:26 mowa219-gjp4-8570p-freebsd apps.plugin[9192]: Cannot fetch process 23005 command line (command 'ld.lld')
Jul 21 02:29:26 mowa219-gjp4-8570p-freebsd apps.plugin[9192]: Cannot fetch process 23006 command line (command 'sh')
Jul 21 02:29:26 mowa219-gjp4-8570p-freebsd apps.plugin[9192]: Cannot fetch process 23007 command line (command 'sh')
Jul 21 02:29:26 mowa219-gjp4-8570p-freebsd apps.plugin[9192]: Cannot fetch process 23013 command line (command 'sh')
Jul 21 02:29:26 mowa219-gjp4-8570p-freebsd apps.plugin[9192]: Cannot fetch process 23014 command line (command 'sh')
Jul 21 02:29:26 mowa219-gjp4-8570p-freebsd apps.plugin[9192]: Cannot fetch process 22825 command line (command 'sh')
Jul 21 02:29:27 mowa219-gjp4-8570p-freebsd apps.plugin[9192]: Cannot fetch process 23572 command line (command 'sh')
Jul 21 06:37:37 mowa219-gjp4-8570p-freebsd syslogd: kernel boot file is /boot/kernel.GENERIC-NODEBUG/kernel
Jul 21 06:37:37 mowa219-gjp4-8570p-freebsd kernel: ---<<BOOT>>---
…

Comment 11 Graham Perrin 2023-07-23 19:36:36 UTC

(In reply to Cy Schubert from comment #5)

> … poudriere-bulk with -j 1 does circumvent the issue. …

Operating system expresses elation at the prospect of: 

- operating. 

Operator expresses amusement at the prospect of: 

- a single 7,200 RPM disk spinning for days in a decade-old notebook 
  to complete a one-job bulk run :-)

Comment 12 Graham Perrin 2023-07-28 18:20:55 UTC

(In reply to Graham Perrin ◐ from comment #10)

> … the result of a probe not long after the boot, …

<https://bsd-hardware.info/?probe=9f4f71236e> (2023-07-21 05:41 UTC)

- the DE line is wrong, it should say Plasma (X11) - SDDM
  (<https://github.com/linuxhw/hw-probe/issues/158>).

Comment 13 Cy Schubert freebsd_committer

2023-07-31 10:36:35 UTC

I don't think this is related to any piece of modern hardware. My panics occur occur on an older machine building ports in amd64 jails. I have another almost identical amd64 machine building ports in poudriere i386 jails. It has no problems.

The only other difference between the two machines is that the failing machine uses a mirrored zpool with four vdevs while the machine building the i386 ports without panic uses a mirrored zpool with only two vdevs.

Comment 14 Graham Perrin 2023-08-10 04:50:51 UTC

Block bug 271607?

Tentatively, via <https://lists.freebsd.org/archives/freebsd-current/2023-August/004181.html>: 

- <https://github.com/openzfs/zfs/commit/2cb992a99ccadb78d97049b40bd442eb4fdc549d>

> ZIL: Fix config lock deadlock. · openzfs/zfs@2cb992a

Comment 15 Graham Perrin 2023-08-10 05:49:42 UTC

(In reply to Graham Perrin ◐ from comment #14)

Sorry, ignore that. I have no log device.

Comment 16 Graham Perrin 2023-08-10 14:32:57 UTC

Created attachment 244007 [details]
Retrospective screenshot of poudriere

(In reply to Cy Schubert from comment #2)

> … with -j 1 it will never happen. …

Because I like to live dangerously, here's a retrospective view of a two-job run during which a hard freeze occurred. Apparently building at the time: 

ghostscript9-agpl-base-9.56.1_12
pango-1.50.14


The then boot environment was a clean (non -dirty) GENERIC-NODEBUG build of the OS: 


root@mowa219-gjp4-8570p-freebsd:~ # strings /tmp/clean/boot/kernel/kernel | grep 09c20a293280
@(#)FreeBSD 14.0-CURRENT amd64 1400094 #3 main-n264621-09c20a293280: Thu Aug 10 02:08:53 BST 2023
FreeBSD 14.0-CURRENT amd64 1400094 #3 main-n264621-09c20a293280: Thu Aug 10 02:08:53 BST 2023
root@mowa219-gjp4-8570p-freebsd:~ # bectl umount n264621-09c20a293280-c 
root@mowa219-gjp4-8570p-freebsd:~ # 


----

It's not all bad. Re: bug 272834 comment 5, before I realised that the freeze coincided with a run of poudriere, I took photographs of screens. So, you lucky, lucky people get to see <https://photos.app.goo.gl/Ad9apU9Ny9uTdaXH6> in the background, a 1960s framed photograph of a lovely lady famed for long, long live love, mmm-mm.

Comment 17 Graham Perrin 2023-08-10 18:06:30 UTC

(In reply to Cy Schubert from comment #2)

A nit, sorry: 

-J    (uppercase) specifies a number of jobs …

-j    (lowercase) names a jail …

<https://github.com/freebsd/poudriere/wiki/poudriere-bulk.8-devel>

Comment 18 Jason Unovitch freebsd_committer

2023-08-14 22:26:17 UTC

I think I'm seeing this but I have not yet seen a panic. If I have a shell open I can Ctrl-T a poudriere bulk and catch something like this below. How it hangs does stop any other commands/new logins/remote connections to the machine but doesn't seem to prevent watchdogd running in the background and seeing the machine as responsive. I've seen it stuck for 6 hours in that last one below before doing a hard reset over IPMI.


[00:29:32] [09] [00:00:53] Finished devel/py-smmap@py39 | py39-smmap-5.0.0: Success
[00:29:32] [09] [00:00:00] Building devel/py-gitdb@py39 | py39-gitdb-4.0.10
[00:29:41] [04] [00:00:52] Finished devel/py-pytest-runner@py39 | py39-pytest-runner-6.0.0_1: Success
[00:29:41] [04] [00:00:00] Building textproc/py-regex@py39 | py39-regex-2022.9.13
load: 0.08  cmd: sh 98334 [zfs teardown inactive] 5393.38r 6.27u 5.66s 0% 4748k
load: 0.07  cmd: sh 98334 [zfs teardown inactive] 5397.98r 6.27u 5.66s 0% 4748k


12 Aug 23:

[01:16:39] [02] [00:00:00] Building security/nss | nss-3.92
load: 0.13  cmd: sh 87026 [zfs teardown inactive] 13849.68r 11.38u 11.61s 0% 4676k


14 Aug post deadlock commit update

FreeBSD xts-bsd.u10ghl.net 14.0-ALPHA1 FreeBSD 14.0-ALPHA1 amd64 1400094 #0 main-n264709-543cf924bcdc: Sat Aug 12 17:59:11 UTC 2023     jason@xyz:/usr/obj/usr/src/amd64.amd64/sys/GENERIC-NODEBUG amd64


[00:57:55] [01] [00:00:00] Building www/grafana8 | grafana8-8.5.27_1
[00:59:46] [07] [00:02:53] Finished net-mgmt/semaphore | semaphore-2.8.77_8: Success
[00:59:47] [07] [00:00:00] Building www/gitea | gitea-1.20.2_1
load: 0.12  cmd: sh 47018 [zfs teardown inactive] 4168.35r 11.38u 12.80s 0% 4676k

load: 0.09  cmd: sh 47018 [zfs teardown inactive] 4223.36r 11.38u 12.80s 0% 4676k

load: 0.08  cmd: sh 47018 [zfs teardown inactive] 4224.92r 11.38u 12.80s 0% 4676k

Comment 19 Cy Schubert freebsd_committer

2023-08-14 23:13:31 UTC

(In reply to Jason Unovitch from comment #18)

ZFS meta operations in parallel triggers this. I suspect restoring multiple jails from the same snapshot may be the cause. I haven't tested this hypothesis yet.

There's a good discussion about this on current@.

Comment 20 Graham Perrin 2023-08-15 19:56:15 UTC

Re: the script at <https://lists.freebsd.org/archives/freebsd-current/2023-August/004299.html>

How many cycles might complete before deadlock (or panic)?

Ballpark …

----

% uname -aKU
FreeBSD mowa219-gjp4-8570p-freebsd 14.0-CURRENT FreeBSD 14.0-CURRENT amd64 1400094 #2 main-n264621-09c20a293280-dirty: Wed Aug  9 01:31:14 BST 2023     grahamperrin@mowa219-gjp4-8570p-freebsd:/usr/obj/usr/src/amd64.amd64/sys/GENERIC amd64 1400094 1400094
%

Comment 21 Jason Unovitch freebsd_committer

2023-08-16 00:37:34 UTC

(In reply to Graham Perrin ◐ from comment #20)

Ballpark cycles in the in the 10-50 range. Wasn't able to scroll back in tmux after the deadlock. Did two ctrl-t's to see the same siginfo on zfs teardown inactive before power cycling over IPMI. Still on 14.0-ALPHA1 FreeBSD 14.0-ALPHA1 amd64 1400094 #0 main-n264709-543cf924bcdc. The clock time to deadlock was maybe 20 seconds on this 8 core/16 thread machine // nproc of 16.  

Great repro case to des@.

If we get to a patch-ready state I should be able to test it within 24 hours.

...

fill zroot/zfsdl/11...
rollback zroot/zfsdl/07...
fill zroot/zfsdl/02...
fill zroot/zfsdl/01...
fill zroot/zfsdl/12...
fill zroot/zfsdl/13...
fill zroot/zfsdl/14...
rollback zroot/zfsdl/08...
fill zroot/zfsdl/03...
fill zroot/zfsdl/04...
fill zroot/zfsdl/15...
rollback zroot/zfsdl/09...
fill zroot/zfsdl/05...
rollback zroot/zfsdl/10...
fill zroot/zfsdl/16...
fill zroot/zfsdl/06...
fill zroot/zfsdl/07...
fill zroot/zfsdl/08...
fill zroot/zfsdl/09...
fill zroot/zfsdl/10...
rollback zroot/zfsdl/11...
rollback zroot/zfsdl/01...
rollback zroot/zfsdl/02...
fill zroot/zfsdl/11...

load: 2.08  cmd: bsdtar 51349 [zfs teardown inactive] 27.83r 0.00u 0.00s 0% 8396k

load: 0.59  cmd: bsdtar 51349 [zfs teardown inactive] 130.93r 0.00u 0.00s 0% 8396k

Comment 22 Graham Perrin 2023-08-16 18:31:23 UTC

(In reply to Jason Unovitch from comment #21)

> … The clock time to deadlock was maybe 20 seconds on this 
> 8 core/16 thread machine // nproc of 16. …

Thanks. 

I ran the script for a few hours overnight after creating the zroot pool on a small USB flash drive. I guess, not a suitable environment for the text. 

Side notes: 

* after Control-C it was necessary to force an export of the pool

* then, the OS seemed to stop working with two of three online 
  cache devices, worked around by restarting the OS. 

I'll update the OS then maybe retry the script.

Comment 23 Dag-Erling Smørgrav freebsd_committer

2023-08-18 01:13:38 UTC

These two patches appear to resolve the issue as far as my poudriere builder is concerned:

https://github.com/openzfs/zfs/pull/15107
https://github.com/openzfs/zfs/pull/15122

The following one-liner can be used to apply them to a clean source tree:

    fetch -o- https://github.com/openzfs/zfs/pull/{15107,15122}.patch | git am --dir=sys/contrib/openzfs

Comment 24 Jason Unovitch freebsd_committer

2023-08-18 02:52:06 UTC

(In reply to Dag-Erling Smørgrav from comment #23)

Positive experience here with those two openzfs PRs applied on top of the earlier main-n264709-543cf924bcdc build followed by a make buildkernel; make installkernel. Ran the zfs-deadlock.sh for 10 minutes just fine post patch, and a run prior to the rebuild to check again was about a minute to deadlock. I'll kick off a full set of poudriere builds now.

(In reply to Graham Perrin ◐ from comment #22)

This machine is zroot with 1 4 drive raidz1 vdev, previously had a cache device but that has been removed. The second pool is a single NVME drive but it's not used for poudriere or the zfs-deadlock test script so shouldn't be relevant.

Comment 25 Graham Perrin freebsd_committer

2023-08-20 14:43:58 UTC

From <https://lists.freebsd.org/archives/freebsd-current/2023-August/004354.html> concluded: 

> …
> 
> Unfortunately I think the current code in main should still suffer from 
> this specific deadlock.  cd25b0f740 fixes some deadlocks in this area, 
> may be that is why you are getting issues less often, but I don't 
> believe it fixes this specific one, may be you was just lucky.  Only 
> https://github.com/openzfs/zfs/pull/15122 I believe should fix them.

Comment 26 Jason Unovitch freebsd_committer

2023-08-20 23:32:28 UTC

(In reply to Graham Perrin from comment #25)

Several Poudriere builds successfully on the earlier 14.0-ALPHA1 main-n264709-543cf924bcdc build.

Rolling along with the release cycle testing I updated to the ALPHA2 commit at 77013f29d04806b72668615e701978b19cb96097 with both https://github.com/openzfs/zfs/pull/15107 and https://github.com/openzfs/zfs/pull/15122 followed by performing additional builds successfully.

Comment 27 Graham Perrin freebsd_committer

2023-08-25 05:46:02 UTC

From <https://lists.freebsd.org/archives/freebsd-current/2023-August/004426.html>: 

> … The PR was just merged to upstream master. 
> Merge to zfs-2.2-release should follow shortly: 
> <https://github.com/openzfs/zfs/pull/15204>, same as some other 2.2 fixes: 
> <https://github.com/openzfs/zfs/pull/15205>.

Comment 28 Cy Schubert freebsd_committer

2023-09-28 16:38:34 UTC

This has probably been resolved now. One of the last few ZFS imports seems to have resolved this bug, for me at least.

Comment 29 Graham Perrin 2023-09-28 19:33:12 UTC

(In reply to Cy Schubert from comment #28)

I haven't encountered symptoms recently. 

Happy for this to be closed (overcome by events). 

Given the 14.0r block context, maybe better for someone other than me to close; I might be missing something. 

Thanks

Comment 30 Jason Unovitch freebsd_committer

2023-10-03 01:29:59 UTC

The pull requests cited earlier between 15107 and 15122 landed in stable/14 before the releng/14.0 branch (https://cgit.freebsd.org/src/commit/?h=stable/14&id=f789381671a3f97496d496d8f996a67eaa8b1a33).  I've seen everything perform well since so hopefully anyone else out there is seeing the same.

Comment 31 Ed Maste freebsd_committer

2023-10-13 17:33:21 UTC

Presumed fixed by OpenZFS updates as suggested in earlier comments. Please reopen or submit a new PR if the issue is observed again on 14.