281520 – zfs panic: VERIFY(!txg_list_member(&vd->vdev_ms_list, msp, t)) failed after install

Bug 281520 - zfs panic: VERIFY(!txg_list_member(&vd->vdev_ms_list, msp, t)) failed after install

Summary: zfs panic: VERIFY(!txg_list_member(&vd->vdev_ms_list, msp, t)) failed after i...

Status:	Open

Alias:	None

Product:	Base System
Classification:	Unclassified
Component:	kern (show other bugs)
Version:	15.0-CURRENT
Hardware:	arm64 Any

Importance:	--- Affects Only Me
Assignee:	freebsd-fs (Nobody)

URL:	https://skunkwerks.at/~dch/OpenZFS/bo...
Keywords:	crash

Depends on:
Blocks:

Reported:	2024-09-15 16:02 UTC by Dave Cottlehuber
Modified:	2024-09-30 05:42 UTC (History)
CC List:	0 users

See Also:

Attachments
dmesg + early boot console (4.55 KB, text/plain) 2024-09-16 20:58 UTC, Dave Cottlehuber	no flags	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Dave Cottlehuber freebsd_committer

2024-09-15 16:02:14 UTC

repeatable panic on 1st reboot after install

cpuid = 1
time = 3
KDB: stack backtrace:
db_trace_self() at db_trace_self
db_trace_self_wrapper() at db_trace_self_wrapper+0x38
vpanic() at vpanic+0x1ac
spl_panic() at spl_panic+0x44
metaslab_fini() at metaslab_fini+0x474
vdev_metaslab_init() at vdev_metaslab_init+0x208
vdev_load() at vdev_load+0x78c
vdev_load_child() at vdev_load_child+0x14
taskq_run() at taskq_run+0x24
taskqueue_run_locked() at taskqueue_run_locked+0x17c
taskqueue_thread_loop() at taskqueue_thread_loop+0xc0
fork_exit() at fork_exit+0x78
fork_trampoline() at fork_trampoline+0x18
KDB: enter: panic
[ thread pid 5 tid 100150 ]
Stopped at      kdb_enter+0x48: str     xzr, [x19, #2048]
db>

image is a built-from-sources (make release ...) arm64 using makefs
- FreeBSD-15.0-CURRENT-arm64-aarch64-20240910-0871d4d-zfs

will try to reproduce on "vanilla" CURRENT shortly.

Comment 1 Dave Cottlehuber freebsd_committer

2024-09-16 20:40:04 UTC

I cannot reproduce this in qemu arm64, but made some progress on the OCI Ampere Altra VMs:

- this issue has been around for a while, last 2 zfs merges
  do not appear to prevent the post-reboot panic

## working around the panic

- it is not sufficient just to wait 500 seconds

- nor is it enough to do some zfs & zpool transactions like bectl create/activate/destroy

- but unpacking e.g. base.txz into a temporary dataset *is* enough

I will pull down a borked zpool for reference

Comment 2 Dave Cottlehuber freebsd_committer

2024-09-16 20:58:44 UTC

Created attachment 253610 [details]
dmesg + early boot console

Comment 3 Dave Cottlehuber freebsd_committer

2024-09-17 12:33:08 UTC

I added the post-corruption zpool here, reminder it's an arm64 boot image that
I see this assert on.

https://skunkwerks.at/~dch/OpenZFS/borked-PR281520.zpool.qcow2.xz

#openzfs irc commented:

i would suggest doing, is setting compatibility= and doing zpool upgrade with
different featuresets to narrow down what state might be going or just backing
up the cloud disk images before first boot so you can compare on disk state when
it worked and didn't

it doesn't seem to easily reproduce, but something to keep in mind, the VERIFY
that's tripping is an ASSERT, so it won't trigger on non-debug builds

but i can't immediately obviously reproduce it on my pi 4

something that would be useful, is if you can try only enabling spacemap_v2 or
log_spacemap, rather than just zpool upgrade -a and then seeing if it still
breaks.

Comment 4 Dave Cottlehuber freebsd_committer

2024-09-17 12:41:17 UTC

how is this built?

- from existing 15.0-CURRENT arm64 box (should work from 14.1-RELEASE too but I am on current here)
- the steps below should also be usable on amd64 FreeBSD if that helps, it will
  still produce the correct arm64 image
- clone https://git.sr.ht/~dch/src main branch,  into /usr/src,
- switch to commit #f7639cff05f63cfe38532bd70e33a890e1fe6b53
- run as root

# export SRCCONF=/dev/null
# export SRC_ENV_CONF=/dev/null
# make -j2C buildworld  TARGET_ARCH=aarch64 TARGET=arm64 -s
# make -j2C buildkernel TARGET_ARCH=aarch64 TARGET=arm64 KERNCONF=GENERIC -s
# cd ./release
# make -j2C clean
# make -DNOPORTS -DNOSRC \
  WITHOUT_DEBUG_FILES=YES WITHOUT_KERNEL_SYMBOLS=YES \
  WITHOUT_LIB32=YES WITHOUT_TESTS=YES \
  KERNCONF=GENERIC \
  TARGET_ARCH=aarch64 TARGET=arm64 \
  WITH_CLOUDWARE=yes \
  CLOUDWARE=OCI -s cloudware-release

there is now a /usr/obj/projects/oci/14.1-RELEASE/arm64.aarch64/release/oci.zfs.raw

this file is converted to qemu for compression, before cloud upload

qemu-img convert -S 512b -p -O qcow2 -c -o compression_type=zstd \
  /usr/obj/projects/oci/14.1-RELEASE/arm64.aarch64/release/oci.zfs.raw \
  /tmp/oci.zfs.qcow2

I tested this (without seeing the same problem) via qemu on a fast amd64:

$ qemu-system-aarch64 \
 -m 4096M -cpu cortex-a57 -smp cores=4 -M virt -nodefaults \
 -bios edk2-aarch64-code.fd \
 -serial telnet::4444,server \
 -nographic -monitor none -vga none \
 -object rng-random,id=rng0,filename=/dev/urandom -device virtio-rng-pci,rng=rng0 \
 -rtc base=utc \
 -drive if=none,file=/tmp/FreeBSD-15.0-CURRENT-arm64-aarch64-20240916-f7639cf-zfs.qcow2,id=hd0 \
 -device virtio-blk-device,drive=hd0 \
 -snapshot

Comment 5 Dave Cottlehuber freebsd_committer

2024-09-17 20:12:23 UTC

NB the original qcow2 image (before corruption occurs) is here:

https://skunkwerks.at/~dch/OpenZFS/FreeBSD-15.0-CURRENT-arm64-aarch64-20240916-f7639cf-zfs.qcow2

Comment 6 Mark Linimon freebsd_committer

2024-09-30 05:42:48 UTC

^Triage: I'm not seeing a proposed fix yet, so "In Progress" may be premature.