Bug 261538 - zfs panic: VERIFY3(l->blk_birth == r->blk_birth) failed
Summary: zfs panic: VERIFY3(l->blk_birth == r->blk_birth) failed
Status: In Progress
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 15.0-CURRENT
Hardware: Any Any
: --- Affects Some People
Assignee: freebsd-fs (Nobody)
URL:
Keywords: crash, needs-qa, patch
: 262760 (view as bug list)
Depends on:
Blocks:
 
Reported: 2022-01-28 20:16 UTC by John F. Carr
Modified: 2024-05-16 16:33 UTC (History)
8 users (show)

See Also:


Attachments
disable blk_birth assertion (709 bytes, patch)
2022-02-22 19:09 UTC, John F. Carr
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description John F. Carr 2022-01-28 20:16:31 UTC
While an overnight poudriere run was cleaning up my system crashed with a zfs panic.

On the console I saw (retyped):

panic: VERIFY3(l->blk_birth == r->blk_birth) failed (9269896 == 9269889)

cpuid = 15
time = 1643398268
#3 livelist_compare
#4 avl_find
#5 dsl_livelist_iterate
#6 bpobj_iterate_blkptrs+0x104
#7 bpobj_iterate_impl+0x154
#8 dsl_process_sub_livelist+0x60
#9 spa_livelist_delete_cb+0xf8
#10 zthr_procedure+0xc0
#11 fork_exit+0x74
#12 fork_trampoline+0x10
Uptime: 1d3h46m6s

In the terminal window with my ssh connection:
[18:50:24] Logs: /data/logs/bulk/builder-default/2022-01-27_19h05m48s
[19:25:12] [24] [19:24:56] Finished devel/llvm13 | llvm13-13.0.0_3: Success
[19:25:15] Stopping 32 builders
builder-default-job-01: removed
builder-default-job-02: removed
builder-default-job-01-n: removed
builder-default-job-02-n: removed
builder-default-job-03: removed
builder-default-job-03-n: removed
builder-default-job-04: removed
builder-default-job-04-n: removed
builder-default-job-05: removed
builder-default-job-05-n: removed
builder-default-job-06: removed
builder-default-job-06-n: removed
builder-default-job-07: removed
builder-default-job-07-n: removed
client_loop: send disconnect: Broken pipe

My server is an Ampere eMAG (32 cores) with root on ZFS on an NVMe drive.  It is running a very recent version of stable/13.

The dump will take hours to complete and I don't know if I will have a usable crash dump at the end.
Comment 1 John F. Carr 2022-01-28 21:23:02 UTC
After waiting about 100 minutes for the dump to complete I logged in again and started poudriere and the system crashed immediately with a similar message differing only in numbers.  So I must have a corrupt pool.

panic: VERIFY3(l->blk_birth == r->blk_birth) failed (9269909 == 9269883)

There were a large number of build pools not cleaned up from the last run:

zroot/poudriere/jails/builder                 1.13G   322G     1.13G  /usr/local/poudriere/jails/builder
zroot/poudriere/jails/builder-default-ref     4.98G   322G     1.13G  /data/.m/builder-default/ref
zroot/poudriere/jails/builder-default-ref/06   215M   322G     1.34G  /data/.m/builder-default/06
zroot/poudriere/jails/builder-default-ref/07   219M   322G     1.34G  /data/.m/builder-default/07
...and so on up to /32

Here is my kernel ident:

FreeBSD 13.0-STABLE #20 stable/13-n249167-b1ced97e75a: Tue Jan 25 17:07:54 EST 2022
    root@marmota:/usr/obj/usr/src/arm64.aarch64/sys/MARMOTA arm64
FreeBSD clang version 13.0.0 (git@github.com:llvm/llvm-project.git llvmorg-13.0.0-0-gd7b669b3a303)

and root pool device:

nda0 at nvme0 bus 0 scbus4 target 0 lun 1
nda0: <XP1920LE30002 JA00ST05 ZEQ0065R>
nda0: nvme version 1.2 x1 (max x4) lanes PCIe Gen3 (max Gen3) link
nda0: 1831420MB (3750748848 512 byte sectors)
Comment 2 John F. Carr 2022-02-22 19:09:55 UTC
Created attachment 232017 [details]
disable blk_birth assertion

I most likely have the problem from https://github.com/openzfs/zfs/issues/11480, the combination of deduplication and cloning can generate blocks with identical DVA_GET_OFFSET but different blk_birth.  The assertion that failed is not valid on pools with deduplication, like my system.  Disabling the failing assertion allows deleting the pools that could not be deleted before.
Comment 3 Robert Clausecker freebsd_committer freebsd_triage 2023-12-20 00:32:23 UTC
Also got this issue with Poudriere, though I don't think dedup is enabled on any of my datasets.
Comment 4 Kurt Jaeger freebsd_committer freebsd_triage 2023-12-26 10:42:07 UTC
After updating my ports builder to HEAD around 23th of December 2023, I ran into the same problem.
Comment 5 Kurt Jaeger freebsd_committer freebsd_triage 2023-12-26 11:04:25 UTC
(In reply to Kurt Jaeger from comment #4)
It looks like I have some zfs clones left over from the first crash:
zroot/pou/jails/150-default-ref/01   569M   252G  2.80G  /pou/data/.m/150-default/01
zroot/pou/jails/150-default-ref/02   551M   252G  2.78G  /pou/data/.m/150-default/02
zroot/pou/jails/150-default-ref/03   455M   252G  2.69G  /pou/data/.m/150-default/03
zroot/pou/jails/150-default-ref/04  54.1M   252G  2.30G  /pou/data/.m/150-default/04
zroot/pou/jails/150-default-ref/05  57.5M   252G  2.30G  /pou/data/.m/150-default/05
zroot/pou/jails/150-default-ref/07  61.0M   252G  2.31G  /pou/data/.m/150-default/07
zroot/pou/jails/150-default-ref/08  84.8M   252G  2.33G  /pou/data/.m/150-default/08
zroot/pou/jails/150-default-ref/09   170M   252G  2.41G  /pou/data/.m/150-default/09
zroot/pou/jails/150-default-ref/10   242M   252G  2.48G  /pou/data/.m/150-default/10
zroot/pou/jails/150-default-ref/11   240M   252G  2.48G  /pou/data/.m/150-default/11
zroot/pou/jails/150-default-ref/12  47.9M   252G  2.29G  /pou/data/.m/150-default/12
zroot/pou/jails/150-default-ref/13   155M   252G  2.40G  /pou/data/.m/150-default/13
zroot/pou/jails/150-default-ref/14   229M   252G  2.47G  /pou/data/.m/150-default/14
zroot/pou/jails/150-default-ref/15  43.6M   252G  2.29G  /pou/data/.m/150-default/15
zroot/pou/jails/150-default-ref/16  43.5M   252G  2.29G  /pou/data/.m/150-default/16
zroot/pou/jails/150-default-ref/27  41.6M   252G  2.29G  /pou/data/.m/150-default/27
zroot/pou/jails/150-default-ref/28   242M   252G  2.48G  /pou/data/.m/150-default/28
Comment 6 Kurt Jaeger freebsd_committer freebsd_triage 2023-12-26 11:25:01 UTC
(In reply to Kurt Jaeger from comment #5)
zfs destroy on one of those will cause the crash.
Rebooting with the patch crashes (commented-out assert in  with:

panic: VERIFY(BP_GET_DEDUP(bp)) failed

There are two asserts of that kind in dsl_livelist_iterate(), which is called by
dsl_process_sub_livelist().

So the patch does not fix the poudriere cloned filesystem issue we have.
Comment 7 Mark Millard 2023-12-26 21:15:02 UTC
(In reply to Robert Clausecker from comment #3)
(Kurt too?)

FreeBSD version? Prior version? There has recently been (this month):

Fri, 08 Dec 2023
. . .
git: 3494f7c019fc - main - Notable upstream pull request merges: . . . Martin Matuska

(It had a very long summary line.)

Fri, 15 Dec 2023
git: 5fb307d29b36 - main - zfs: merge openzfs/zfs@86e115e21 Martin Matuska

Tue, 19 Dec 2023
. . .
git: 188408da9f7c - main - zfs: merge openzfs/zfs@dbda45160 Martin Matuska

With no other added reports other until yours and Kurt's, I wonder if
more recent changes are involved than John ran into. In part this is
via your report of likely lack of dedup as context for your failure.
Comment 8 John F. Carr 2023-12-26 21:18:42 UTC
Block cloning acts a lot like deduplication and may trigger the assertion failure without deduplication being explicitly enabled.
Comment 9 Mark Millard 2023-12-26 22:35:37 UTC
(In reply to John F. Carr from comment #8)

> Block cloning acts a lot like deduplication and may
> trigger the assertion failure without deduplication
> being explicitly enabled.

Okay.

It still leaves the early 2022 to late 2023 time frame
difference and 2 reports in short order in late 2023
as suggestive of recent changes being involved in the
recent reports.

Robert and Kurt giving more details about the versions
used for similar-context-to-recent failures over that
2022 to late 2023 time frame could be somewhat useful
if they have been using various versions over that
time frame, used for the same kind of activity fairly
frequently but without problems until recently.

No claim such information would be definitive.
Comment 10 Kurt Jaeger freebsd_committer freebsd_triage 2023-12-27 08:05:45 UTC
(In reply to Mark Millard from comment #7)
I upgraded to 15 around 8th of September 2023, had some crashes due to
out-of-memory during poudriere runs.

I upgraded the system to the latestet 15 on the 23th of December, and
after that it had series of crashes with the panic given in the PR (and some others), some during poudriere, but tonight, probably during the daily jobs, it crashed with a
different panic.See

https://people.freebsd.org/~pi/crash/

for all those where I still have the textdumps.
Comment 11 Kurt Jaeger freebsd_committer freebsd_triage 2023-12-27 08:53:17 UTC
(In reply to Kurt Jaeger from comment #10)
One more thing: When I updated in September, I did not upgrade my
zpool, so it's still at the state of 14 or so:

https://people.freebsd.org/~pi/crash/features.txt
Comment 12 Kurt Jaeger freebsd_committer freebsd_triage 2023-12-27 09:01:54 UTC
(In reply to Kurt Jaeger from comment #11)
Ups, and it had block_cloning active/enabled 8-(
Comment 13 Kurt Jaeger freebsd_committer freebsd_triage 2024-01-01 14:18:38 UTC
(In reply to Kurt Jaeger from comment #12)
Another crash with the error: VERIFY3(l->blk_birth == r->blk_birth) failed (9909 == 9902)

https://people.freebsd.org/~pi/20240101-pr261538/

on a newly created zpool. I was able to recover the textdump this time.
Comment 14 Kurt Jaeger freebsd_committer freebsd_triage 2024-01-02 08:09:45 UTC
(In reply to Kurt Jaeger from comment #13)
Pointer from netchild@:

use

sysctl vfs.zfs.bclone_enabled=0

to avoid block-cloning.

Testcase: build shells/bash using poudriere, lead to certain crash before,
works now.
Comment 15 Alexander Motin freebsd_committer freebsd_triage 2024-01-02 22:44:36 UTC
I was able to reproduce two different panics there, one specific to block cloning, another also happening for dedup.  Seems to be an incorrect assertions. This should fix them: https://github.com/openzfs/zfs/pull/15732 .
Comment 16 Alexander Motin freebsd_committer freebsd_triage 2024-01-02 22:57:11 UTC
*** Bug 262760 has been marked as a duplicate of this bug. ***