Summary: | panic: deadlres_td_sleep_q: possible deadlock detected for … (stat) | (postgres) | (LTO toml.54d28171-c) , blocked for … ticks – during poudriere-bulk(8) | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Base System | Reporter: | Graham Perrin <grahamperrin> | ||||||||||||
Component: | kern | Assignee: | freebsd-fs (Nobody) <fs> | ||||||||||||
Status: | Closed Overcome By Events | ||||||||||||||
Severity: | Affects Some People | CC: | bdrewery, crest, cy, des, emaste, fs, junovitch, mav, michael.osipov, pi, swills | ||||||||||||
Priority: | --- | Keywords: | crash, needs-patch | ||||||||||||
Version: | CURRENT | ||||||||||||||
Hardware: | Any | ||||||||||||||
OS: | Any | ||||||||||||||
See Also: |
https://github.com/openzfs/zfs/pull/15122 https://github.com/openzfs/zfs/pull/15107 https://github.com/openzfs/zfs/pull/15204 https://github.com/openzfs/zfs/pull/15205 |
||||||||||||||
Bug Depends on: | |||||||||||||||
Bug Blocks: | 271607 | ||||||||||||||
Attachments: |
|
Description
Graham Perrin
2023-06-11 09:49:14 UTC
Also <https://lists.freebsd.org/archives/freebsd-fs/2023-June/002332.html> > Following a panic (271945): zpool status reports 1 data error but > identifies no file I think the common thread here is poudriere-bulk is running. When running with -j number of jobs the likelihood of this occurring is > 50%. Whereas with -j 1 it will never happen. I have no zpools with errors. I have another machine, not quite configured the same (fewer disks), also running poudriere but in this case the poudriere jails are all i386 jails whereas the machine that panics, the poudriere jails are all amd64. I don't know if this has a bearing on this problem or not. My panic string is usually like this, though at times I will see a c++ process in a detected deadlock. Unread portion of the kernel message buffer: panic: deadlres_td_sleep_q: possible deadlock detected for 0xfffffe00c0d6d740 (p ostgres), blocked for 180239 ticks cpuid = 0 time = 1689121195 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe00851b7e00 vpanic() at vpanic+0x149/frame 0xfffffe00851b7e50 panic() at panic+0x43/frame 0xfffffe00851b7eb0 deadlkres() at deadlkres+0x32a/frame 0xfffffe00851b7ef0 fork_exit() at fork_exit+0x82/frame 0xfffffe00851b7f30 fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe00851b7f30 --- trap 0, rip = 0, rsp = 0, rbp = 0 --- Uptime: 8h51m9s It's usually postgresql which is the DBMS for gitea. But, as intimated above, sometimes it's a c++ that's listed as the "cause." Created attachment 243473 [details] post-panic review of poudriere bulk results (In reply to Cy Schubert from comment #2) I think the common thread here is poudriere-bulk is running. … Thanks. IIRC I chose 8 (eight) jobs for the run of poudriere during which this crash occurred: ---- Dump header from device: /dev/ada0p2 Architecture: amd64 Architecture Version: 2 Dump Length: 3278311424 Blocksize: 512 Compression: none Dumptime: 2023-07-18 21:11:54 +0100 Hostname: mowa219-gjp4-8570p-freebsd Magic: FreeBSD Kernel Dump Version String: FreeBSD 14.0-CURRENT #0 main-n264150-c65845d0bb2d-dirty: Fri Jul 14 07:11:27 BST 2023 grahamperrin@mowa219-gjp4-8570p-freebsd:/usr/obj/usr/src/amd64.amd64/sys/GENERIC Panic String: deadlres_td_sleep_q: possible deadlock detected for 0xfffffe0120cb6c80 (LTO toml.54d28171-c), blocked for 1800979 ticks Dump Parity: 872214848 Bounds: 9 Dump Status: good ---- The bulk run began at 20:21 yesterday, with the dump timed at 21:11. A post-panic review of the run (screenshot attached) suggests around twenty minutes elapsed. Created attachment 243474 [details]
core.txt.9 for the most recent panic
% pkg iinfo poudriere
poudriere-devel-3.3.99.20220831
%
Just a quick note. Running poudriere-bulk with -j 1 does circumvent the issue. This is not a big problem when building large packages such as llvm-devel or chromium since both use all the CPUs available anyway. (In reply to Cy Schubert from comment #2) In your experience (with CURRENT), does this type of panic occur only with the GENERIC kernel? I guess, symptoms might be quite different with GENERIC-NODEBUG. (In reply to Graham Perrin ◐ from comment #6) I'm using a custom kernel. I removed all device drivers except those needed to boot. Then, NICs are included in loader.conf. Each machine is different since though they have the same motherboard, they include different NICs. The laptop and firewall have disabled WITNESS and INVARIANTS, while the two server machines leave them enabled. Additionally, my kernels include BREAK_TO_DEBUGGER to allow me to enter DDB through the serial console. The problem occurs with and without WITNESS and INVARIANTS enabled in the kernel. It only occurs on one machine, the machine that builds my amd64 packages. The other machine uses poudriere to build i386 packages. It is unaffected. Both machines boot off UFS and have a single ZPOOL. The ZPOOLs are mirrored. The ZPOOL on the machine that has the problem has four vdevs in its pool. The machine that builds i386 packages has only two vdevs in its pool. The hypothesis is that this is probably related to ZFS in some way. The victim process is either postgres I use as a backend for gitea or a c++. My theory is that it's likely that poudriere is cloning jails from its snapshot when creating new jails which results in the deadlock while other processes are trying to write to the same pool. (In reply to Cy Schubert from comment #7) I should also mention that it never happens when poudriere-bulk -j1 is run. Only when multiple jails are used. It's likely that two jails are being created using zfs clone at the same time resulting in the deadlock. That's the going theory ATM. (In reply to Cy Schubert from comment #8) This is certainly a problem with the ZFS kernel module. But, a workaround in poudriere a) will provide temporary relief while b) illustrating the problem. Created attachment 243575 [details] Poudriere bulk results, reviewed after the incident (In reply to Graham Perrin ◐ from comment #6) > … I guess, symptoms might be quite different with GENERIC-NODEBUG. Before the 06:37 BOOT below I found a blackout, no response to HID input. Presses on the power button produced slight reactions (a flickering LED, not for long, at the front of the HP EliteBook 8570p) but not the expected shut down. A forced stop (press and hold) was required :-( From the attached screenshot, I assume that system failure occurred some time after 02:29. I uploaded the result of a probe not long after the boot, I expect this to automatically appear at <https://bsd-hardware.info/?computer=6fbb1f806232> in due course. % zcat /var/log/messages.1.bz2 … Jul 21 02:28:40 mowa219-gjp4-8570p-freebsd kernel: sonewconn: pcb 0xfffff8004bd20300 (local:/var/run/devd.seqpacket.pipe): Listen queue overflow: 1 already in queue awaiting acceptance (60 occurrences), euid 0, rgid 0, jail 0 Jul 21 02:28:59 mowa219-gjp4-8570p-freebsd apps.plugin[9192]: Cannot fetch process 18033 command line (command 'sed') Jul 21 02:29:07 mowa219-gjp4-8570p-freebsd apps.plugin[9192]: Cannot fetch process 20664 command line (command 'sh') Jul 21 02:29:07 mowa219-gjp4-8570p-freebsd apps.plugin[9192]: Cannot fetch process 20667 command line (command 'sed') Jul 21 02:29:07 mowa219-gjp4-8570p-freebsd apps.plugin[9192]: Cannot fetch process 20669 command line (command 'sh') Jul 21 02:29:13 mowa219-gjp4-8570p-freebsd apps.plugin[9192]: Cannot fetch process 21172 command line (command 'sed') Jul 21 02:29:19 mowa219-gjp4-8570p-freebsd apps.plugin[9192]: Cannot fetch process 21745 command line (command 'sh') Jul 21 02:29:19 mowa219-gjp4-8570p-freebsd apps.plugin[9192]: Cannot fetch process 21746 command line (command 'bsdtar') Jul 21 02:29:21 mowa219-gjp4-8570p-freebsd apps.plugin[9192]: Cannot fetch process 22151 command line (command 'sh') Jul 21 02:29:26 mowa219-gjp4-8570p-freebsd apps.plugin[9192]: Cannot fetch process 23004 command line (command 'cc') Jul 21 02:29:26 mowa219-gjp4-8570p-freebsd apps.plugin[9192]: Cannot fetch process 23005 command line (command 'ld.lld') Jul 21 02:29:26 mowa219-gjp4-8570p-freebsd apps.plugin[9192]: Cannot fetch process 23006 command line (command 'sh') Jul 21 02:29:26 mowa219-gjp4-8570p-freebsd apps.plugin[9192]: Cannot fetch process 23007 command line (command 'sh') Jul 21 02:29:26 mowa219-gjp4-8570p-freebsd apps.plugin[9192]: Cannot fetch process 23013 command line (command 'sh') Jul 21 02:29:26 mowa219-gjp4-8570p-freebsd apps.plugin[9192]: Cannot fetch process 23014 command line (command 'sh') Jul 21 02:29:26 mowa219-gjp4-8570p-freebsd apps.plugin[9192]: Cannot fetch process 22825 command line (command 'sh') Jul 21 02:29:27 mowa219-gjp4-8570p-freebsd apps.plugin[9192]: Cannot fetch process 23572 command line (command 'sh') Jul 21 06:37:37 mowa219-gjp4-8570p-freebsd syslogd: kernel boot file is /boot/kernel.GENERIC-NODEBUG/kernel Jul 21 06:37:37 mowa219-gjp4-8570p-freebsd kernel: ---<<BOOT>>--- … (In reply to Cy Schubert from comment #5) > … poudriere-bulk with -j 1 does circumvent the issue. … Operating system expresses elation at the prospect of: - operating. Operator expresses amusement at the prospect of: - a single 7,200 RPM disk spinning for days in a decade-old notebook to complete a one-job bulk run :-) (In reply to Graham Perrin ◐ from comment #10) > … the result of a probe not long after the boot, … <https://bsd-hardware.info/?probe=9f4f71236e> (2023-07-21 05:41 UTC) - the DE line is wrong, it should say Plasma (X11) - SDDM (<https://github.com/linuxhw/hw-probe/issues/158>). I don't think this is related to any piece of modern hardware. My panics occur occur on an older machine building ports in amd64 jails. I have another almost identical amd64 machine building ports in poudriere i386 jails. It has no problems. The only other difference between the two machines is that the failing machine uses a mirrored zpool with four vdevs while the machine building the i386 ports without panic uses a mirrored zpool with only two vdevs. Block bug 271607? Tentatively, via <https://lists.freebsd.org/archives/freebsd-current/2023-August/004181.html>: - <https://github.com/openzfs/zfs/commit/2cb992a99ccadb78d97049b40bd442eb4fdc549d> > ZIL: Fix config lock deadlock. · openzfs/zfs@2cb992a (In reply to Graham Perrin ◐ from comment #14) Sorry, ignore that. I have no log device. Created attachment 244007 [details] Retrospective screenshot of poudriere (In reply to Cy Schubert from comment #2) > … with -j 1 it will never happen. … Because I like to live dangerously, here's a retrospective view of a two-job run during which a hard freeze occurred. Apparently building at the time: ghostscript9-agpl-base-9.56.1_12 pango-1.50.14 The then boot environment was a clean (non -dirty) GENERIC-NODEBUG build of the OS: root@mowa219-gjp4-8570p-freebsd:~ # strings /tmp/clean/boot/kernel/kernel | grep 09c20a293280 @(#)FreeBSD 14.0-CURRENT amd64 1400094 #3 main-n264621-09c20a293280: Thu Aug 10 02:08:53 BST 2023 FreeBSD 14.0-CURRENT amd64 1400094 #3 main-n264621-09c20a293280: Thu Aug 10 02:08:53 BST 2023 root@mowa219-gjp4-8570p-freebsd:~ # bectl umount n264621-09c20a293280-c root@mowa219-gjp4-8570p-freebsd:~ # ---- It's not all bad. Re: bug 272834 comment 5, before I realised that the freeze coincided with a run of poudriere, I took photographs of screens. So, you lucky, lucky people get to see <https://photos.app.goo.gl/Ad9apU9Ny9uTdaXH6> in the background, a 1960s framed photograph of a lovely lady famed for long, long live love, mmm-mm. (In reply to Cy Schubert from comment #2) A nit, sorry: -J (uppercase) specifies a number of jobs … -j (lowercase) names a jail … <https://github.com/freebsd/poudriere/wiki/poudriere-bulk.8-devel> I think I'm seeing this but I have not yet seen a panic. If I have a shell open I can Ctrl-T a poudriere bulk and catch something like this below. How it hangs does stop any other commands/new logins/remote connections to the machine but doesn't seem to prevent watchdogd running in the background and seeing the machine as responsive. I've seen it stuck for 6 hours in that last one below before doing a hard reset over IPMI. [00:29:32] [09] [00:00:53] Finished devel/py-smmap@py39 | py39-smmap-5.0.0: Success [00:29:32] [09] [00:00:00] Building devel/py-gitdb@py39 | py39-gitdb-4.0.10 [00:29:41] [04] [00:00:52] Finished devel/py-pytest-runner@py39 | py39-pytest-runner-6.0.0_1: Success [00:29:41] [04] [00:00:00] Building textproc/py-regex@py39 | py39-regex-2022.9.13 load: 0.08 cmd: sh 98334 [zfs teardown inactive] 5393.38r 6.27u 5.66s 0% 4748k load: 0.07 cmd: sh 98334 [zfs teardown inactive] 5397.98r 6.27u 5.66s 0% 4748k 12 Aug 23: [01:16:39] [02] [00:00:00] Building security/nss | nss-3.92 load: 0.13 cmd: sh 87026 [zfs teardown inactive] 13849.68r 11.38u 11.61s 0% 4676k 14 Aug post deadlock commit update FreeBSD xts-bsd.u10ghl.net 14.0-ALPHA1 FreeBSD 14.0-ALPHA1 amd64 1400094 #0 main-n264709-543cf924bcdc: Sat Aug 12 17:59:11 UTC 2023 jason@xyz:/usr/obj/usr/src/amd64.amd64/sys/GENERIC-NODEBUG amd64 [00:57:55] [01] [00:00:00] Building www/grafana8 | grafana8-8.5.27_1 [00:59:46] [07] [00:02:53] Finished net-mgmt/semaphore | semaphore-2.8.77_8: Success [00:59:47] [07] [00:00:00] Building www/gitea | gitea-1.20.2_1 load: 0.12 cmd: sh 47018 [zfs teardown inactive] 4168.35r 11.38u 12.80s 0% 4676k load: 0.09 cmd: sh 47018 [zfs teardown inactive] 4223.36r 11.38u 12.80s 0% 4676k load: 0.08 cmd: sh 47018 [zfs teardown inactive] 4224.92r 11.38u 12.80s 0% 4676k (In reply to Jason Unovitch from comment #18) ZFS meta operations in parallel triggers this. I suspect restoring multiple jails from the same snapshot may be the cause. I haven't tested this hypothesis yet. There's a good discussion about this on current@. Re: the script at <https://lists.freebsd.org/archives/freebsd-current/2023-August/004299.html> How many cycles might complete before deadlock (or panic)? Ballpark … ---- % uname -aKU FreeBSD mowa219-gjp4-8570p-freebsd 14.0-CURRENT FreeBSD 14.0-CURRENT amd64 1400094 #2 main-n264621-09c20a293280-dirty: Wed Aug 9 01:31:14 BST 2023 grahamperrin@mowa219-gjp4-8570p-freebsd:/usr/obj/usr/src/amd64.amd64/sys/GENERIC amd64 1400094 1400094 % (In reply to Graham Perrin ◐ from comment #20) Ballpark cycles in the in the 10-50 range. Wasn't able to scroll back in tmux after the deadlock. Did two ctrl-t's to see the same siginfo on zfs teardown inactive before power cycling over IPMI. Still on 14.0-ALPHA1 FreeBSD 14.0-ALPHA1 amd64 1400094 #0 main-n264709-543cf924bcdc. The clock time to deadlock was maybe 20 seconds on this 8 core/16 thread machine // nproc of 16. Great repro case to des@. If we get to a patch-ready state I should be able to test it within 24 hours. ... fill zroot/zfsdl/11... rollback zroot/zfsdl/07... fill zroot/zfsdl/02... fill zroot/zfsdl/01... fill zroot/zfsdl/12... fill zroot/zfsdl/13... fill zroot/zfsdl/14... rollback zroot/zfsdl/08... fill zroot/zfsdl/03... fill zroot/zfsdl/04... fill zroot/zfsdl/15... rollback zroot/zfsdl/09... fill zroot/zfsdl/05... rollback zroot/zfsdl/10... fill zroot/zfsdl/16... fill zroot/zfsdl/06... fill zroot/zfsdl/07... fill zroot/zfsdl/08... fill zroot/zfsdl/09... fill zroot/zfsdl/10... rollback zroot/zfsdl/11... rollback zroot/zfsdl/01... rollback zroot/zfsdl/02... fill zroot/zfsdl/11... load: 2.08 cmd: bsdtar 51349 [zfs teardown inactive] 27.83r 0.00u 0.00s 0% 8396k load: 0.59 cmd: bsdtar 51349 [zfs teardown inactive] 130.93r 0.00u 0.00s 0% 8396k (In reply to Jason Unovitch from comment #21) > … The clock time to deadlock was maybe 20 seconds on this > 8 core/16 thread machine // nproc of 16. … Thanks. I ran the script for a few hours overnight after creating the zroot pool on a small USB flash drive. I guess, not a suitable environment for the text. Side notes: * after Control-C it was necessary to force an export of the pool * then, the OS seemed to stop working with two of three online cache devices, worked around by restarting the OS. I'll update the OS then maybe retry the script. These two patches appear to resolve the issue as far as my poudriere builder is concerned: https://github.com/openzfs/zfs/pull/15107 https://github.com/openzfs/zfs/pull/15122 The following one-liner can be used to apply them to a clean source tree: fetch -o- https://github.com/openzfs/zfs/pull/{15107,15122}.patch | git am --dir=sys/contrib/openzfs (In reply to Dag-Erling Smørgrav from comment #23) Positive experience here with those two openzfs PRs applied on top of the earlier main-n264709-543cf924bcdc build followed by a make buildkernel; make installkernel. Ran the zfs-deadlock.sh for 10 minutes just fine post patch, and a run prior to the rebuild to check again was about a minute to deadlock. I'll kick off a full set of poudriere builds now. (In reply to Graham Perrin ◐ from comment #22) This machine is zroot with 1 4 drive raidz1 vdev, previously had a cache device but that has been removed. The second pool is a single NVME drive but it's not used for poudriere or the zfs-deadlock test script so shouldn't be relevant. From <https://lists.freebsd.org/archives/freebsd-current/2023-August/004354.html> concluded: > … > > Unfortunately I think the current code in main should still suffer from > this specific deadlock. cd25b0f740 fixes some deadlocks in this area, > may be that is why you are getting issues less often, but I don't > believe it fixes this specific one, may be you was just lucky. Only > https://github.com/openzfs/zfs/pull/15122 I believe should fix them. (In reply to Graham Perrin from comment #25) Several Poudriere builds successfully on the earlier 14.0-ALPHA1 main-n264709-543cf924bcdc build. Rolling along with the release cycle testing I updated to the ALPHA2 commit at 77013f29d04806b72668615e701978b19cb96097 with both https://github.com/openzfs/zfs/pull/15107 and https://github.com/openzfs/zfs/pull/15122 followed by performing additional builds successfully. From <https://lists.freebsd.org/archives/freebsd-current/2023-August/004426.html>: > … The PR was just merged to upstream master. > Merge to zfs-2.2-release should follow shortly: > <https://github.com/openzfs/zfs/pull/15204>, same as some other 2.2 fixes: > <https://github.com/openzfs/zfs/pull/15205>. This has probably been resolved now. One of the last few ZFS imports seems to have resolved this bug, for me at least. (In reply to Cy Schubert from comment #28) I haven't encountered symptoms recently. Happy for this to be closed (overcome by events). Given the 14.0r block context, maybe better for someone other than me to close; I might be missing something. Thanks The pull requests cited earlier between 15107 and 15122 landed in stable/14 before the releng/14.0 branch (https://cgit.freebsd.org/src/commit/?h=stable/14&id=f789381671a3f97496d496d8f996a67eaa8b1a33). I've seen everything perform well since so hopefully anyone else out there is seeing the same. Presumed fixed by OpenZFS updates as suggested in earlier comments. Please reopen or submit a new PR if the issue is observed again on 14. |