Bug 252396

Summary: [zfs] [panic] Panic on 'bectl list' after slog removal
Product: Base System Reporter: Jason Unovitch <junovitch>
Component: kernAssignee: freebsd-fs (Nobody) <fs>
Status: Closed FIXED    
Severity: Affects Some People CC: rew
Priority: --- Keywords: crash
Version: CURRENT   
Hardware: amd64   
OS: Any   
Attachments:
Description Flags
fix panix after slog removal none

Description Jason Unovitch freebsd_committer freebsd_triage 2021-01-04 00:14:39 UTC
Overview:

Following the removal of a slog the pool is now in a state where 'bectl list' will panic it. Discovered after removing the slog device from a machine when the SSD it was on has been acting up.

Steps to Reproduce:

This is on a fresh VM with a minimal reproduction.

1. Create a VM with two disks or suitable partitions for testing.

Assuming ada0 was used as the zfs installation disk during bsdinstall... `gpart create -s gpt ada1; gpart add -t freebsd-zfs -a 4096 ada1` is sufficient for the second disk.

2. Baseline -- will run without issues

bectl list

3. Add the slog

zpool add zroot log ada1p1

4. Baseline 2 -- will run without issues

bectl list

5. Remove the slog

zpool remove zroot ada1p1

6. Create a panic

bectl list

Result:

vpanic() at vpanic+0x181/frame 0xfffffe00a80af3b0
panic() at panic+0x43/frame 0xfffffe00a80af410
trap_fatal() at trap_fatal+0x387/frame 0xfffffe00a80af470
trap_pfault() at trap_pfault+0x97/frame 0xfffffe00a80af4d0
trap() at trap+0x2ab/frame 0xfffffe00a80af5e0
--- trap 0xc, rip = 0 = 0xfffffe00a80af56b8, rbp = 0xfffffe00a80af6d0 --- ??() at 0/frame 0xfffffe00a80af6d0
spa_vdev_state_enter() at spa_vdev_state_enter+0x43/frame 0xfffffe00a80af700
zfs_ioc_get_bootenv() at zfs_ioc_get_bootenv+0x32/frame 0xfffffe00a80af730
zfsdev_ioctl_common() at zfsdev_ioctl_common+0x4e0/frame 0xfffffe00a80af7d0
zfsdev_ioctl() at zfsdev_ioctl+0x146/frame 0xfffffe00a80af800
devfs_ioctl() at devfs_ioctl+0xcc/frame 0xfffffe00a80af850
vn_ioctl() at vn_ioctl+0x131/frame 0xfffffe00a80af960
devfs_ioctl_f() at devfs_ioctl_f+0x1e/frame 0xfffffe00a80af980
kern_ioctl() at kern_ioctl+0x289/frame 0xfffffe00a80afbf0
sys_ioctl() at sys_ioctl+0x12a/frame 0xfffffe00a80afbf0
amd64_syscall() at amd64_syscall+0x12a/frame 0xfffffe00a80afbf0
fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe00a80afbf0
--- syscall (54, FreeBSD ELF64, sys_ioctl), rip = 0x8003db45a, rsp = 0x7fffffffc668, rbp = 0x7fffffffc6d0 ---
KDB: enter: panic
[ thread pid 796 tid 100470 ]
Stopped at     kdb_enter+0x37: movq    $0,0x1099396(%rip)

Build Date & Hardware:

In this example this is using the 31 December snapshots ISOs:

FreeBSD test 13.0-CURRENT FreeBSD 13.0-CURRENT #0 main-c255460-g282381aa53a: Th Dec 31 06:08:47 UTC 2020     root@releng1.nyi.freebsd.org:/usr/obj/usr/src/amd64.amd64/sys/GENERIC  amd64

Also observed on a hardware with a GENEIRIC-NODEBUG amd64 kernel running:

FreeBSD bsd 13.0-CURRENT FreeBSD 13.0-CURRENT #0 main-c255337-g8d405efd73d: Sat Dec 26 17:07:08 UTC 2020

Additional Information:

Re-adding the slog device will still results in a panic on 'bectl list'. It will also panic on boot now and will no longer fully boot to the OS.

As a workaround bectl create, mount, umnount, activate commands all seem to work from minimal checks.  It looks feasible to use zfs list to verify naming and use the other commands to do minimal work on boot environments at this point while ‘bectl list’ is creating a panic.
Comment 1 Jason Unovitch freebsd_committer freebsd_triage 2021-01-04 00:50:41 UTC
Expanding on the comment of "Re-adding the slog device will still results in a panic on 'bectl list'. It will also panic on boot now and will no longer fully boot to the OS.", the report in https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=245649 looks similar to the observation I have after re-adding the slog device.
Comment 2 Jason Unovitch freebsd_committer freebsd_triage 2021-01-09 18:34:45 UTC
Additional commands that panic on ZFS after step 5, removal of a slog device:

zpool scrub zroot

zfsbootcfg

zfsbootcfg -p

bectl create test; zfsbootcfg “zfs:zroot/ROOT/test”


NB: manually setting the bootfs property does not panic.  This appears consistent with the observations that `bectl create test`/`bectl activate test` still allow changing the bootfs.

bectl create test; zpool set bootfs=zroot/ROOT/test zroot

Similarly, the current version of sysutils/beadm from ports does NOT panic like bectl(8) will

beadm list
Comment 3 Robert Wing freebsd_committer freebsd_triage 2021-02-20 05:54:00 UTC
Created attachment 222658 [details]
fix panix after slog removal
Comment 4 Robert Wing freebsd_committer freebsd_triage 2021-02-20 05:57:14 UTC
Looks like this was patched by Patrick Mooney in https://www.illumos.org/issues/12981

I've submitted a pull request upstream to openzfs, https://github.com/openzfs/zfs/pull/11623

In the meantime, I've included the patch here.
Comment 5 Jason Unovitch freebsd_committer freebsd_triage 2021-02-21 18:05:20 UTC
"Tested by:" looks good... I can confirm on a VM and physical hardware running Glen's 13.0-BETA3 commit at 150b4388d3b5 plus the addition of this patch that there are no longer issues running the previous panic inducing commands.

I've checked these specific commands below. They are are all running fine on the VM test and aside from not checking the very last one on active physical hardware the first few are no longer inducing a panic.

bectl list
zpool scrub zroot
zfsbootcfg
zfsbootcfg -p
bectl create test; zfsbootcfg “zfs:zroot/ROOT/test:”

Robert thank you for the effort tracking this down.  Will this make it in an upcoming 13.0-RC?
Comment 6 commit-hook freebsd_committer freebsd_triage 2021-02-22 17:28:12 UTC
A commit in branch stable/13 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=64649f0285424435634c2dfd39f49536fc2b50dd

commit 64649f0285424435634c2dfd39f49536fc2b50dd
Author:     Martin Matuska <mm@FreeBSD.org>
AuthorDate: 2021-02-22 17:05:07 +0000
Commit:     Martin Matuska <mm@FreeBSD.org>
CommitDate: 2021-02-22 17:26:12 +0000

    zfs: fix panic if scrubbing after removing a slog device

    From openzfs-master 11f2e9a4 commit message:
      vdev_ops: don't try to call vdev_op_hold or vdev_op_rele when NULL

      This prevents a panic after a SLOG add/removal on the root pool followed
      by a zpool scrub.

      When a SLOG is removed, a hole takes its place - the vdev_ops for a hole
      is vdev_hole_ops, which defines the handler functions of vdev_op_hold
      and vdev_op_rele as NULL.

    Patch Author: Patrick Mooney <pmooney@pfmooney.com>

    Obtained from:  openzfs/zfs@11f2e9a491baa2ae3fc00f6b8b892fa91a852ca1
    PR:             252396
    MFS after:      3 days

    (direct commit)

 sys/contrib/openzfs/module/zfs/vdev.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)
Comment 7 commit-hook freebsd_committer freebsd_triage 2021-02-25 16:21:12 UTC
A commit in branch releng/13.0 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=ee0b7e05e72c8820441faff29a9df99b47aed6a0

commit ee0b7e05e72c8820441faff29a9df99b47aed6a0
Author:     Martin Matuska <mm@FreeBSD.org>
AuthorDate: 2021-02-22 17:05:07 +0000
Commit:     Martin Matuska <mm@FreeBSD.org>
CommitDate: 2021-02-25 16:19:49 +0000

    zfs: fix panic if scrubbing after removing a slog device

    From openzfs-master 11f2e9a4 commit message:
      vdev_ops: don't try to call vdev_op_hold or vdev_op_rele when NULL

      This prevents a panic after a SLOG add/removal on the root pool followed
      by a zpool scrub.

      When a SLOG is removed, a hole takes its place - the vdev_ops for a hole
      is vdev_hole_ops, which defines the handler functions of vdev_op_hold
      and vdev_op_rele as NULL.

    Patch Author: Patrick Mooney <pmooney@pfmooney.com>

    Obtained from:  openzfs/zfs@11f2e9a491baa2ae3fc00f6b8b892fa91a852ca1
    PR:             252396
    Approved by:    re (gjb)

    (cherry picked from commit 64649f0285424435634c2dfd39f49536fc2b50dd)

 sys/contrib/openzfs/module/zfs/vdev.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)
Comment 8 Robert Wing freebsd_committer freebsd_triage 2021-02-25 17:34:25 UTC
The pull request was accepted and Martin got it merged in - I'm closing this.

Thanks for the report Jason. And thanks to Martin for getting the patch in.