Bug 278414

Summary:

Reproducible zpool(8) panic with 14.0-RELEASE amd64-zfs.raw VM-IMAGES

Product:

Base System

Reporter:

Michael Dexter <editor>

Component:

kern

Assignee:

Alexander Motin <mav>

Status:

Closed FIXED

Severity:

Affects Only Me

CC:

allanjude, dch, emaste, markj, mav, rob2g2-freebsd, ronald

Priority:

---

Keywords:

crash

Version:

14.0-RELEASE

Hardware:

amd64

OS:

Any

URL:

https://github.com/openzfs/zfs/pull/16162

Attachments:

Description	Flags
Script to reproduce the issues	none
Text dump of the panic	none
Script to reproduce the panic (re-upload)	none
Core dump from last weeks' 15-CURRENT CLOUDINIT VM-IMAGE	none

Description Michael Dexter freebsd_triage

2024-04-17 17:24:57 UTC

Created attachment 250031 [details]
Script to reproduce the issues

I have been exercising the 14.0-RELEASE amd64-zfs.raw VM-IMAGES produced by Release Engineering (thank you for these!) and have two reproducible issues when mirroring two images (thank you for fixing mkimg/makefs to allow this!):

Some runs of the attached reproduction script run flawlessly, which others report between 4 and 50K checksum errors on the attached device:

	NAME             STATE     READ WRITE CKSUM
	zroot            ONLINE       0     0     0
	  mirror-0       ONLINE       0     0     0
	    gpt/rootfs1  ONLINE       0     0     0
	    gpt/rootfs2  ONLINE       0     0 51.4K

Some runs cause a panic:

Fatal trap 12: page fault while in kernel mode
cpuid = 1; apic id = 01
fault virtual address   = 0x10
fault code              = supervisor read data, page not present
instruction pointer     = 0x20:0xffffffff81f54447
stack pointer           = 0x28:0xfffffe016703cce8
frame pointer           = 0x28:0xfffffe016703cd20
code segment            = base rx0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 6 (dmu_objset_find_2)
rdi: fffff8045d553ac8 rsi: fffff80449059b50 rdx: fffff804490598d0
rcx: fffff80449059b50  r8: 0000000000000001  r9: 0000000000000002
rax: 0000000000000001 rbx: 00000000ffffffff rbp: fffffe016703cd20
r10: 0000000000000000 r11: 0000000000000001 r12: 0000000000000003
r13: 0000000000000001 r14: 0000000000000000 r15: 0000000000000046

Fatal trap 12: page fault while in kernel mode
cpuid = 1; apic id = 01
fault virtual address   = 0x12
fault code              = supervisor read data, page not present
instruction pointer     = 0x20:0xffffffff84161a20
stack pointer           = 0x28:0xfffffe016703c440
frame pointer           = 0x28:0xfffffe016703c480
code segment            = base rx0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = resume, IOPL = 0
current process         = 6 (dmu_objset_find_2)
rdi: 0000000000000012 rsi: fffff80102350828 rdx: fffff80102350810
rcx: fffffe01652553f8  r8: ffffffffffffffda  r9: 0000000000000000
rax: fffffe016703c818 rbx: fffffe01652582b0 rbp: fffffe016703c480
r10: fffffe0165e16a72 r11: fffff80024eac800 r12: fffff8017c2f1900
r13: fffffe0165255000 r14: fffff80024eac800 r15: 0000000000000000

KDB: stack backtrace:
#0 0xffffffff80b9009d at kdb_backtrace+0x5d
#1 0xffffffff80b431a2 at vpanic+0x132
#2 0xffffffff80b43063 at panic+0x43
#3 0xffffffff8100c85c at trap_fatal+0x40c
#4 0xffffffff8100c8af at trap_pfault+0x4f
#5 0xffffffff80fe3ad8 at calltrap+0x8
#6 0xffffffff8411829a at skl_compute_wm+0xa6a
#7 0xffffffff840df49f at intel_atomic_check+0xf0f
#8 0xffffffff83d15783 at drm_atomic_check_only+0x4a3
#9 0xffffffff83d15bc3 at drm_atomic_commit+0x13
#10 0xffffffff83d252c8 at drm_client_modeset_commit_atomic+0x158
#11 0xffffffff83d253b4 at drm_client_modeset_commit_locked+0x74
#12 0xffffffff83d25541 at drm_client_modeset_commit+0x21
#13 0xffffffff83d68303 at drm_fb_helper_restore_fbdev_mode_unlocked+0x83
#14 0xffffffff83d55661 at vt_kms_postswitch+0x181
#15 0xffffffff8098a01f at vt_window_switch+0x11f
#16 0xffffffff8098b45f at vtterm_cngrab+0x4f
#17 0xffffffff80ad7556 at cngrab+0x26

I am attaching the text dump and can provide a core dump, but hopefully the reproduction script will help you create your very own ones.

The script does not assist with downloading the AMD64 VM-IMAGE. Simply expand it with unxz.

Caveat: The VM-IMAGES include the zpool name 'zroot' and will conflict with a host using the same name. I can add rename-on-import syntax if you like.

Let me know what other information might be helpful. Thanks!

Comment 1 Michael Dexter freebsd_triage

2024-04-17 17:25:28 UTC

Created attachment 250032 [details]
Text dump of the panic

Comment 2 Michael Dexter freebsd_triage

2024-04-17 17:41:05 UTC

Created attachment 250033 [details]
Script to reproduce the panic (re-upload)

Comment 3 Michael Dexter freebsd_triage

2024-04-17 17:43:26 UTC

I can add syntax to import the zpool with a different name if 'zroot' is a problem for you.

Comment 4 Michael Dexter freebsd_triage

2024-04-17 19:48:07 UTC

Update: I can reproduce the panic on an AMD Ryzen 7 5800H system.

Comment 5 Michael Dexter freebsd_triage

2024-04-17 22:18:53 UTC

Observation: The stock VM-IMAGE only has one label:

------------------------------------
LABEL 0
------------------------------------
    txg: 4
    version: 5000
    state: 1
    name: 'zroot'
    pool_guid: 4016146626377348012
    top_guid: 100716240520803340
    guid: 100716240520803340
    vdev_children: 1
    features_for_read:
    vdev_tree:
        type: 'disk'
        ashift: 12
        asize: 5363990528
        guid: 100716240520803340
        id: 0
        path: '/dev/null'
        whole_disk: 1
        create_txg: 4
        metaslab_array: 2
        metaslab_shift: 29
    labels = 0 1 2 3

Comment 6 Michael Dexter freebsd_triage

2024-04-17 23:01:49 UTC

Update: I extracted the rootfs partition from the VM-IMAGE to a separate file as zroot.raw to simplify things, and found that:

1. While 'zdb -l zroot.raw' shows the label output (with a single label), I cannot import pool via the file.

2. Attaching it with mdconfig works fine, but could produce the panic on resolver in two runs.

3. dmesg does not report anything.

4. zpool status -v did not show any checksum errors.

The issue appears to be associated with attach and resilver.

I have no idea why happens after two to eight or so repetitions using the exact same source image.

Next: Testing on 14-stable and 15-current.

Comment 7 Allan Jude freebsd_committer

2024-04-18 00:09:05 UTC

(In reply to Michael Dexter from comment #5)
> Observation: The stock VM-IMAGE only has one label

See how it says "labels = 0 1 2 3" that means it has all 4 labels but they are identical.

If you get output that prints multiple labels, they are in some way different, and that should be investigated more closely

Comment 8 Michael Dexter freebsd_triage

2024-04-18 00:40:26 UTC

(In reply to Allan Jude from comment #7)
Thank you Allan.

All: For context, this feature is marked as experimental and has revealed issues already. Let's get it stable!

Comment 9 Michael Dexter freebsd_triage

2024-04-18 05:42:45 UTC

Created attachment 250042 [details]
Core dump from last weeks' 15-CURRENT CLOUDINIT VM-IMAGE

Same host, 15-CURRENT VM-IMAGE.

Where did the root-on-ZFS non-CLOUDINIT raw images go?

Comment 10 Michael Dexter freebsd_triage

2024-04-18 10:03:51 UTC

At the risk of having the wrong crash dump... do note the attached 15-CURRENT VM-IMAGE text dump:

KDB: stack backtrace:
#0 0xffffffff80b9009d at kdb_backtrace+0x5d
#1 0xffffffff80b431a2 at vpanic+0x132
#2 0xffffffff80b43063 at panic+0x43
#3 0xffffffff8100c85c at trap_fatal+0x40c
#4 0xffffffff8100c8af at trap_pfault+0x4f
#5 0xffffffff80fe3ad8 at calltrap+0x8
#6 0xffffffff81f3e9c3 at avl_remove+0x1a3
#7 0xffffffff820285c8 at dsl_scan_visit+0x2c8
#8 0xffffffff820275ad at dsl_scan_sync+0xc6d
#9 0xffffffff820541e6 at spa_sync+0xb36
#10 0xffffffff8206b3ab at txg_sync_thread+0x26b
#11 0xffffffff80afdb7f at fork_exit+0x7f
#12 0xffffffff80fe4b3e at fork_trampoline+0xe

Most follow this pattern.

Interesting: A clean 14.0 system is not exhibiting the issue. Only 14.0p2 and 14.0p5. I cannot yet say if the patch level plays a part in this, but I see 14.0p2 had some VFS changes:

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=275200

Comment 11 Ronald Klop freebsd_committer

2024-04-18 12:10:01 UTC

In you script you try to set different labels, but some of them are still the same.

See bootfs1 and swapfs2:
gpart modify -i 1 -l bootfs1 md10
gpart modify -i 1 -l bootfs1 md10
gpart modify -i 2 -l efiesp1 md10
gpart modify -i 2 -l efiesp2 md11
gpart modify -i 3 -l swapfs2 md11
gpart modify -i 3 -l swapfs2 md11
gpart modify -i 4 -l rootfs1 md10
gpart modify -i 4 -l rootfs2 md11

Is this intentional?

Comment 12 Michael Dexter freebsd_triage

2024-04-18 17:14:23 UTC

(In reply to Ronald Klop from comment #11)
Good eye!

Unfortunately, the issue still exists even when I work only with the freebsd-zfs partition. I have fixed the partition numbers and have it running on three systems running 14.0p6, two Intel, one AMD, and the issue persists with:

panic: VERIFY(sds != NULL) failed

KDB: stack backtrace:
#0 0xffffffff80b9009d at kdb_backtrace+0x5d
#1 0xffffffff80b431a2 at vpanic+0x132
#2 0xffffffff81f7d07a at spl_panic+0x3a
#3 0xffffffff8202f6d1 at dsl_scan_visit+0x3d1
#4 0xffffffff8202e5ad at dsl_scan_sync+0xc6d
#5 0xffffffff8205b1e6 at spa_sync+0xb36
#6 0xffffffff820723ab at txg_sync_thread+0x26b
#7 0xffffffff80afdb7f at fork_exit+0x7f
#8 0xffffffff80fe4b2e at fork_trampoline+0xe

Comment 13 Michael Dexter freebsd_triage

2024-04-25 07:01:42 UTC

Perhaps you would like to experiment with a makefs -t zfs image.

This is the syntax used by /usr/src/release/tools/vmimage.subr with a 128m image:

mkdir -p /tmp/rootfs/ROOT/default
mkdir -p /tmp/rootfs/usr/ports
mkdir -p /tmp/rootfs/var/audit

makefs -t zfs -s 128m -B little -o 'poolname=zroot' -o 'bootfs=zroot/ROOT/default' -o 'rootpath=/' -o 'fs=zroot;mountpoint=none' -o 'fs=zroot/ROOT;mountpoint=none' -o 'fs=zroot/ROOT/default;mountpoint=/' -o 'fs=zroot/home;mountpoint=/home' -o 'fs=zroot/tmp;mountpoint=/tmp;exec=on;setuid=off' -o 'fs=zroot/usr;mountpoint=/usr;canmount=off' -o 'fs=zroot/usr/ports;setuid=off' -o 'fs=zroot/usr/src' -o 'fs=zroot/usr/obj' -o 'fs=zroot/var;mountpoint=/var;canmount=off' -o 'fs=zroot/var/audit;setuid=off;exec=off' -o 'fs=zroot/var/log;setuid=off;exec=off' -o 'fs=zroot/var/mail;atime=on' -o 'fs=zroot/var/tmp;setuid=off' /tmp/raw.zfs.img /tmp/rootfs

Note:

zdb -l /tmp/raw.zfs.img

zpool import -d /tmp/raw.zfs.img

truncate -s 128m /tmp/img.raw
zpool create foo /tmp/img.raw
zpool export foo
zpool import -d /tmp/img.raw

The img.raw created with truncate and zpool create can be imported while the makefs one reports:

   pool: zroot
     id: 17927745092259738836
  state: UNAVAIL
status: One or more devices contains corrupted data.
 action: The pool cannot be imported due to damaged devices or data.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-5E
 config:

	zroot                                                  UNAVAIL  insufficient replicas
	  /tmp/img.raw  UNAVAIL  invalid label

Comment 14 Michael Dexter freebsd_triage

2024-04-25 07:05:57 UTC

The makefs-generated image WILL import if md attached, but not as a file with -d.

Comment 15 Alexander Motin freebsd_committer

2024-05-04 03:01:51 UTC

It seems makefs-generated ZFS images by skipping some optional dataset structures activate code not used since before ancient zpool version 11 and missing some locks.  I think https://github.com/openzfs/zfs/pull/16162 should fix the problem.  Though I wonder whether/when ZFS will regenerate those structures without explicit upgrade from the pre-11 pool version or will forever use the old code.

Comment 16 Mark Johnston freebsd_committer

2024-05-04 15:13:57 UTC

(In reply to Alexander Motin from comment #15)
Is the missing structure the "ds_next_clones_obj"?  It looks like ZFS should add this one automatically.

Comment 17 Alexander Motin freebsd_committer

2024-05-04 15:23:00 UTC

(In reply to Mark Johnston from comment #16)
I haven't looked deep what this code does, but as I see it is activated by absence of dp_origin_snap added since SPA_VERSION_DSL_SCRUB and ds_next_clones_obj since SPA_VERSION_NEXT_CLONES.

Comment 18 Mark Johnston freebsd_committer

2024-05-04 15:35:58 UTC

(In reply to Alexander Motin from comment #17)
I don't think ZFS will automatically regenerate these structures, makefs needs to handle it.

Comment 19 commit-hook freebsd_committer

2024-05-23 16:23:10 UTC

A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=49086aa35d987b78dbc3c9ec94814fe338e07164

commit 49086aa35d987b78dbc3c9ec94814fe338e07164
Author:     Alexander Motin <mav@FreeBSD.org>
AuthorDate: 2024-05-23 16:20:37 +0000
Commit:     Alexander Motin <mav@FreeBSD.org>
CommitDate: 2024-05-23 16:20:37 +0000

    Fix scn_queue races on very old pools

    Code for pools before version 11 uses dmu_objset_find_dp() to scan
    for children datasets/clones.  It calls enqueue_clones_cb() and
    enqueue_cb() callbacks in parallel from multiple taskq threads.
    It ends up bad for scan_ds_queue_insert(), corrupting scn_queue
    AVL-tree.  Fix it by introducing a mutex to protect those two
    scan_ds_queue_insert() calls.  All other calls are done from the
    sync thread and so serialized.

    Reviewed-by:    Brian Behlendorf <behlendorf1@llnl.gov>
    Reviewed-by:    Brian Atkinson <batkinson@lanl.gov>
    Signed-off-by:  Alexander Motin <mav@FreeBSD.org>
    Sponsored by:   iXsystems, Inc.
    Closes  #16162
    PR:     278414

 sys/contrib/openzfs/include/sys/dsl_scan.h | 1 +
 sys/contrib/openzfs/module/zfs/dsl_scan.c  | 6 ++++++
 2 files changed, 7 insertions(+)

Comment 20 commit-hook freebsd_committer

2024-05-23 17:42:19 UTC

A commit in branch stable/14 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=455ce1729353f2ffce9713ccc3574e73186a22f0

commit 455ce1729353f2ffce9713ccc3574e73186a22f0
Author:     Alexander Motin <mav@FreeBSD.org>
AuthorDate: 2024-05-23 16:20:37 +0000
Commit:     Alexander Motin <mav@FreeBSD.org>
CommitDate: 2024-05-23 16:24:55 +0000

    Fix scn_queue races on very old pools

    Code for pools before version 11 uses dmu_objset_find_dp() to scan
    for children datasets/clones.  It calls enqueue_clones_cb() and
    enqueue_cb() callbacks in parallel from multiple taskq threads.
    It ends up bad for scan_ds_queue_insert(), corrupting scn_queue
    AVL-tree.  Fix it by introducing a mutex to protect those two
    scan_ds_queue_insert() calls.  All other calls are done from the
    sync thread and so serialized.

    Reviewed-by:    Brian Behlendorf <behlendorf1@llnl.gov>
    Reviewed-by:    Brian Atkinson <batkinson@lanl.gov>
    Signed-off-by:  Alexander Motin <mav@FreeBSD.org>
    Sponsored by:   iXsystems, Inc.
    Closes  #16162
    PR:     278414

    (cherry picked from commit 49086aa35d987b78dbc3c9ec94814fe338e07164)

 sys/contrib/openzfs/include/sys/dsl_scan.h | 1 +
 sys/contrib/openzfs/module/zfs/dsl_scan.c  | 6 ++++++
 2 files changed, 7 insertions(+)

Comment 21 commit-hook freebsd_committer

2024-05-23 17:46:22 UTC

A commit in branch stable/13 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=9898f936aa69d1b67bcd83d189acb6013f76bd43

commit 9898f936aa69d1b67bcd83d189acb6013f76bd43
Author:     Alexander Motin <mav@FreeBSD.org>
AuthorDate: 2024-05-23 16:20:37 +0000
Commit:     Alexander Motin <mav@FreeBSD.org>
CommitDate: 2024-05-23 17:43:02 +0000

    Fix scn_queue races on very old pools

    Code for pools before version 11 uses dmu_objset_find_dp() to scan
    for children datasets/clones.  It calls enqueue_clones_cb() and
    enqueue_cb() callbacks in parallel from multiple taskq threads.
    It ends up bad for scan_ds_queue_insert(), corrupting scn_queue
    AVL-tree.  Fix it by introducing a mutex to protect those two
    scan_ds_queue_insert() calls.  All other calls are done from the
    sync thread and so serialized.

    Reviewed-by:    Brian Behlendorf <behlendorf1@llnl.gov>
    Reviewed-by:    Brian Atkinson <batkinson@lanl.gov>
    Signed-off-by:  Alexander Motin <mav@FreeBSD.org>
    Sponsored by:   iXsystems, Inc.
    Closes  #16162
    PR:     278414

    (cherry picked from commit 49086aa35d987b78dbc3c9ec94814fe338e07164)

 sys/contrib/openzfs/include/sys/dsl_scan.h | 1 +
 sys/contrib/openzfs/module/zfs/dsl_scan.c  | 6 ++++++
 2 files changed, 7 insertions(+)

Comment 22 commit-hook freebsd_committer

2024-05-23 18:12:26 UTC

A commit in branch releng/14.1 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=856d35337225d77948b43ee5d479baa2588963ec

commit 856d35337225d77948b43ee5d479baa2588963ec
Author:     Alexander Motin <mav@FreeBSD.org>
AuthorDate: 2024-05-23 16:20:37 +0000
Commit:     Alexander Motin <mav@FreeBSD.org>
CommitDate: 2024-05-23 18:11:36 +0000

    Fix scn_queue races on very old pools

    Code for pools before version 11 uses dmu_objset_find_dp() to scan
    for children datasets/clones.  It calls enqueue_clones_cb() and
    enqueue_cb() callbacks in parallel from multiple taskq threads.
    It ends up bad for scan_ds_queue_insert(), corrupting scn_queue
    AVL-tree.  Fix it by introducing a mutex to protect those two
    scan_ds_queue_insert() calls.  All other calls are done from the
    sync thread and so serialized.

    Reviewed-by:    Brian Behlendorf <behlendorf1@llnl.gov>
    Reviewed-by:    Brian Atkinson <batkinson@lanl.gov>
    Signed-off-by:  Alexander Motin <mav@FreeBSD.org>
    Sponsored by:   iXsystems, Inc.
    Closes  #16162
    PR:     278414
    Approved by:    re (cperciva)

    (cherry picked from commit 49086aa35d987b78dbc3c9ec94814fe338e07164)
    (cherry picked from commit 455ce1729353f2ffce9713ccc3574e73186a22f0)

 sys/contrib/openzfs/include/sys/dsl_scan.h | 1 +
 sys/contrib/openzfs/module/zfs/dsl_scan.c  | 6 ++++++
 2 files changed, 7 insertions(+)

Comment 23 Alexander Motin freebsd_committer

2024-05-23 18:14:14 UTC

The fix for the ZFS panic is merged into releng/14.1 and stable branches.  I hope newfs will also be updated some day.

Comment 24 Michael Dexter freebsd_triage

2024-05-25 00:18:55 UTC

I have built this on 15-CURRENT shortly after commit and SO FAR SO GOOD!

Thank you everyone!

Comment 25 Michael Dexter freebsd_triage

2024-06-08 06:35:54 UTC

On 15.0-CURRENT #0 main-n270474-d2f1f71ec8c6, one can image the weekly VM-IMAGE to a hardware device, boot it on hardware, back up its partitions to a second device, dd over the first two partitions, 'zpool attach' the second data partition, wait for resilver, pull the original drive during reboot, boot, and online it for full restoration of the pool.

This is resolved until further notice.

Thank you everyone who make this happen!