While an overnight poudriere run was cleaning up my system crashed with a zfs panic. On the console I saw (retyped): panic: VERIFY3(l->blk_birth == r->blk_birth) failed (9269896 == 9269889) cpuid = 15 time = 1643398268 #3 livelist_compare #4 avl_find #5 dsl_livelist_iterate #6 bpobj_iterate_blkptrs+0x104 #7 bpobj_iterate_impl+0x154 #8 dsl_process_sub_livelist+0x60 #9 spa_livelist_delete_cb+0xf8 #10 zthr_procedure+0xc0 #11 fork_exit+0x74 #12 fork_trampoline+0x10 Uptime: 1d3h46m6s In the terminal window with my ssh connection: [18:50:24] Logs: /data/logs/bulk/builder-default/2022-01-27_19h05m48s [19:25:12] [24] [19:24:56] Finished devel/llvm13 | llvm13-13.0.0_3: Success [19:25:15] Stopping 32 builders builder-default-job-01: removed builder-default-job-02: removed builder-default-job-01-n: removed builder-default-job-02-n: removed builder-default-job-03: removed builder-default-job-03-n: removed builder-default-job-04: removed builder-default-job-04-n: removed builder-default-job-05: removed builder-default-job-05-n: removed builder-default-job-06: removed builder-default-job-06-n: removed builder-default-job-07: removed builder-default-job-07-n: removed client_loop: send disconnect: Broken pipe My server is an Ampere eMAG (32 cores) with root on ZFS on an NVMe drive. It is running a very recent version of stable/13. The dump will take hours to complete and I don't know if I will have a usable crash dump at the end.
After waiting about 100 minutes for the dump to complete I logged in again and started poudriere and the system crashed immediately with a similar message differing only in numbers. So I must have a corrupt pool. panic: VERIFY3(l->blk_birth == r->blk_birth) failed (9269909 == 9269883) There were a large number of build pools not cleaned up from the last run: zroot/poudriere/jails/builder 1.13G 322G 1.13G /usr/local/poudriere/jails/builder zroot/poudriere/jails/builder-default-ref 4.98G 322G 1.13G /data/.m/builder-default/ref zroot/poudriere/jails/builder-default-ref/06 215M 322G 1.34G /data/.m/builder-default/06 zroot/poudriere/jails/builder-default-ref/07 219M 322G 1.34G /data/.m/builder-default/07 ...and so on up to /32 Here is my kernel ident: FreeBSD 13.0-STABLE #20 stable/13-n249167-b1ced97e75a: Tue Jan 25 17:07:54 EST 2022 root@marmota:/usr/obj/usr/src/arm64.aarch64/sys/MARMOTA arm64 FreeBSD clang version 13.0.0 (git@github.com:llvm/llvm-project.git llvmorg-13.0.0-0-gd7b669b3a303) and root pool device: nda0 at nvme0 bus 0 scbus4 target 0 lun 1 nda0: <XP1920LE30002 JA00ST05 ZEQ0065R> nda0: nvme version 1.2 x1 (max x4) lanes PCIe Gen3 (max Gen3) link nda0: 1831420MB (3750748848 512 byte sectors)
Created attachment 232017 [details] disable blk_birth assertion I most likely have the problem from https://github.com/openzfs/zfs/issues/11480, the combination of deduplication and cloning can generate blocks with identical DVA_GET_OFFSET but different blk_birth. The assertion that failed is not valid on pools with deduplication, like my system. Disabling the failing assertion allows deleting the pools that could not be deleted before.
Also got this issue with Poudriere, though I don't think dedup is enabled on any of my datasets.
After updating my ports builder to HEAD around 23th of December 2023, I ran into the same problem.
(In reply to Kurt Jaeger from comment #4) It looks like I have some zfs clones left over from the first crash: zroot/pou/jails/150-default-ref/01 569M 252G 2.80G /pou/data/.m/150-default/01 zroot/pou/jails/150-default-ref/02 551M 252G 2.78G /pou/data/.m/150-default/02 zroot/pou/jails/150-default-ref/03 455M 252G 2.69G /pou/data/.m/150-default/03 zroot/pou/jails/150-default-ref/04 54.1M 252G 2.30G /pou/data/.m/150-default/04 zroot/pou/jails/150-default-ref/05 57.5M 252G 2.30G /pou/data/.m/150-default/05 zroot/pou/jails/150-default-ref/07 61.0M 252G 2.31G /pou/data/.m/150-default/07 zroot/pou/jails/150-default-ref/08 84.8M 252G 2.33G /pou/data/.m/150-default/08 zroot/pou/jails/150-default-ref/09 170M 252G 2.41G /pou/data/.m/150-default/09 zroot/pou/jails/150-default-ref/10 242M 252G 2.48G /pou/data/.m/150-default/10 zroot/pou/jails/150-default-ref/11 240M 252G 2.48G /pou/data/.m/150-default/11 zroot/pou/jails/150-default-ref/12 47.9M 252G 2.29G /pou/data/.m/150-default/12 zroot/pou/jails/150-default-ref/13 155M 252G 2.40G /pou/data/.m/150-default/13 zroot/pou/jails/150-default-ref/14 229M 252G 2.47G /pou/data/.m/150-default/14 zroot/pou/jails/150-default-ref/15 43.6M 252G 2.29G /pou/data/.m/150-default/15 zroot/pou/jails/150-default-ref/16 43.5M 252G 2.29G /pou/data/.m/150-default/16 zroot/pou/jails/150-default-ref/27 41.6M 252G 2.29G /pou/data/.m/150-default/27 zroot/pou/jails/150-default-ref/28 242M 252G 2.48G /pou/data/.m/150-default/28
(In reply to Kurt Jaeger from comment #5) zfs destroy on one of those will cause the crash. Rebooting with the patch crashes (commented-out assert in with: panic: VERIFY(BP_GET_DEDUP(bp)) failed There are two asserts of that kind in dsl_livelist_iterate(), which is called by dsl_process_sub_livelist(). So the patch does not fix the poudriere cloned filesystem issue we have.
(In reply to Robert Clausecker from comment #3) (Kurt too?) FreeBSD version? Prior version? There has recently been (this month): Fri, 08 Dec 2023 . . . git: 3494f7c019fc - main - Notable upstream pull request merges: . . . Martin Matuska (It had a very long summary line.) Fri, 15 Dec 2023 git: 5fb307d29b36 - main - zfs: merge openzfs/zfs@86e115e21 Martin Matuska Tue, 19 Dec 2023 . . . git: 188408da9f7c - main - zfs: merge openzfs/zfs@dbda45160 Martin Matuska With no other added reports other until yours and Kurt's, I wonder if more recent changes are involved than John ran into. In part this is via your report of likely lack of dedup as context for your failure.
Block cloning acts a lot like deduplication and may trigger the assertion failure without deduplication being explicitly enabled.
(In reply to John F. Carr from comment #8) > Block cloning acts a lot like deduplication and may > trigger the assertion failure without deduplication > being explicitly enabled. Okay. It still leaves the early 2022 to late 2023 time frame difference and 2 reports in short order in late 2023 as suggestive of recent changes being involved in the recent reports. Robert and Kurt giving more details about the versions used for similar-context-to-recent failures over that 2022 to late 2023 time frame could be somewhat useful if they have been using various versions over that time frame, used for the same kind of activity fairly frequently but without problems until recently. No claim such information would be definitive.
(In reply to Mark Millard from comment #7) I upgraded to 15 around 8th of September 2023, had some crashes due to out-of-memory during poudriere runs. I upgraded the system to the latestet 15 on the 23th of December, and after that it had series of crashes with the panic given in the PR (and some others), some during poudriere, but tonight, probably during the daily jobs, it crashed with a different panic.See https://people.freebsd.org/~pi/crash/ for all those where I still have the textdumps.
(In reply to Kurt Jaeger from comment #10) One more thing: When I updated in September, I did not upgrade my zpool, so it's still at the state of 14 or so: https://people.freebsd.org/~pi/crash/features.txt
(In reply to Kurt Jaeger from comment #11) Ups, and it had block_cloning active/enabled 8-(
(In reply to Kurt Jaeger from comment #12) Another crash with the error: VERIFY3(l->blk_birth == r->blk_birth) failed (9909 == 9902) https://people.freebsd.org/~pi/20240101-pr261538/ on a newly created zpool. I was able to recover the textdump this time.
(In reply to Kurt Jaeger from comment #13) Pointer from netchild@: use sysctl vfs.zfs.bclone_enabled=0 to avoid block-cloning. Testcase: build shells/bash using poudriere, lead to certain crash before, works now.
I was able to reproduce two different panics there, one specific to block cloning, another also happening for dedup. Seems to be an incorrect assertions. This should fix them: https://github.com/openzfs/zfs/pull/15732 .
*** Bug 262760 has been marked as a duplicate of this bug. ***