Bug 220604

Summary: Kernel panic when deleting ZFS snapshot
Product: Base System Reporter: heppner.mark
Component: kernAssignee: freebsd-fs (Nobody) <fs>
Status: Closed FIXED    
Severity: Affects Only Me CC: avg
Priority: ---    
Version: 11.0-RELEASE   
Hardware: amd64   
OS: Any   

Description heppner.mark 2017-07-10 14:38:23 UTC
Original forum post: https://forums.freebsd.org/threads/61281/

Possibly related: bug #207464

Similar issue with ZFS on Linux: https://github.com/zfsonlinux/zfs/issues/2749

About a month ago, I had a memory stick go bad. Now there's at least 1 ZFS snapshot from that day that causes a kernel panic when I try to destroy it. More recent snapshots can be deleted without any issues. The pool has been scrubbed several times since then, never reporting any errors. smartmontools report all drives are healthy. I have no idea how to reproduce this error.

---------------------------------------------------------------------

FreeBSD oscar 11.0-RELEASE-p10 FreeBSD 11.0-RELEASE-p10 #0 r318606: Mon May 22 00:36:40 EDT 2017     root@oscar:/usr/obj/usr/src/sys/GENERIC  amd64

panic: page fault

GNU gdb 6.1.1 [FreeBSD]
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "amd64-marcel-freebsd"...

Unread portion of the kernel message buffer:


Fatal trap 12: page fault while in kernel mode
cpuid = 3; apic id = 03
fault virtual address   = 0x48
fault code              = supervisor read data, page not present
instruction pointer     = 0x20:0xffffffff822048e3
stack pointer           = 0x28:0xfffffe045991b640
frame pointer           = 0x28:0xfffffe045991b6f0
code segment            = base rx0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 5 (txg_thread_enter)
trap number             = 12
panic: page fault
cpuid = 3
KDB: stack backtrace:
#0 0xffffffff80b24477 at kdb_backtrace+0x67
#1 0xffffffff80ad97e2 at vpanic+0x182
#2 0xffffffff80ad9653 at panic+0x43
#3 0xffffffff80fa1d51 at trap_fatal+0x351
#4 0xffffffff80fa1f43 at trap_pfault+0x1e3
#5 0xffffffff80fa14ec at trap+0x26c
#6 0xffffffff80f845a1 at calltrap+0x8
#7 0xffffffff82205e84 at dsl_destroy_snapshot_sync_impl+0x894
#8 0xffffffff82206577 at dsl_destroy_snapshot_sync+0x97
#9 0xffffffff822097e4 at dsl_sync_task_sync+0xc4
#10 0xffffffff8220851b at dsl_pool_sync+0x3cb
#11 0xffffffff82227fae at spa_sync+0x7ce
#12 0xffffffff82231549 at txg_sync_thread+0x389
#13 0xffffffff80a90455 at fork_exit+0x85
#14 0xffffffff80f84ade at fork_trampoline+0xe
Uptime: 9h52m57s
Dumping 3406 out of 16089 MB:..1%..11%..21%..31%..41%..51%..61%..71%..81%..91%

Reading symbols from /boot/kernel/zfs.ko...Reading symbols from /usr/lib/debug//boot/kernel/zfs.ko.debug...done.
done.
Loaded symbols for /boot/kernel/zfs.ko
Reading symbols from /boot/kernel/opensolaris.ko...Reading symbols from /usr/lib/debug//boot/kernel/opensolaris.ko.debug...done.
done.
Loaded symbols for /boot/kernel/opensolaris.ko
Reading symbols from /boot/kernel/linux.ko...Reading symbols from /usr/lib/debug//boot/kernel/linux.ko.debug...done.
done.
Loaded symbols for /boot/kernel/linux.ko
Reading symbols from /boot/kernel/linux_common.ko...Reading symbols from /usr/lib/debug//boot/kernel/linux_common.ko.debug...done.
done.
Loaded symbols for /boot/kernel/linux_common.ko
Reading symbols from /boot/kernel/pf.ko...Reading symbols from /usr/lib/debug//boot/kernel/pf.ko.debug...done.
done.
Loaded symbols for /boot/kernel/pf.ko
Reading symbols from /boot/kernel/uplcom.ko...Reading symbols from /usr/lib/debug//boot/kernel/uplcom.ko.debug...done.
done.
Loaded symbols for /boot/kernel/uplcom.ko
Reading symbols from /boot/kernel/ucom.ko...Reading symbols from /usr/lib/debug//boot/kernel/ucom.ko.debug...done.
done.
Loaded symbols for /boot/kernel/ucom.ko
Reading symbols from /boot/kernel/accf_data.ko...Reading symbols from /usr/lib/debug//boot/kernel/accf_data.ko.debug...done.
done.
Loaded symbols for /boot/kernel/accf_data.ko
Reading symbols from /boot/kernel/accf_http.ko...Reading symbols from /usr/lib/debug//boot/kernel/accf_http.ko.debug...done.
done.
Loaded symbols for /boot/kernel/accf_http.ko
Reading symbols from /boot/kernel/pflog.ko...Reading symbols from /usr/lib/debug//boot/kernel/pflog.ko.debug...done.
done.
Loaded symbols for /boot/kernel/pflog.ko
Reading symbols from /boot/kernel/fdescfs.ko...Reading symbols from /usr/lib/debug//boot/kernel/fdescfs.ko.debug...done.
done.
Loaded symbols for /boot/kernel/fdescfs.ko
Reading symbols from /boot/kernel/uhid.ko...Reading symbols from /usr/lib/debug//boot/kernel/uhid.ko.debug...done.
done.
Loaded symbols for /boot/kernel/uhid.ko
Reading symbols from /boot/kernel/linprocfs.ko...Reading symbols from /usr/lib/debug//boot/kernel/linprocfs.ko.debug...done.
done.
Loaded symbols for /boot/kernel/linprocfs.ko
Reading symbols from /boot/kernel/nullfs.ko...Reading symbols from /usr/lib/debug//boot/kernel/nullfs.ko.debug...done.
done.
Loaded symbols for /boot/kernel/nullfs.ko
Reading symbols from /boot/kernel/geom_eli.ko...Reading symbols from /usr/lib/debug//boot/kernel/geom_eli.ko.debug...done.
done.
Loaded symbols for /boot/kernel/geom_eli.ko
#0  doadump (textdump=<value optimized out>) at pcpu.h:221
221     pcpu.h: No such file or directory.
        in pcpu.h

(kgdb) #0  doadump (textdump=<value optimized out>) at pcpu.h:221
#1  0xffffffff80ad9269 in kern_reboot (howto=260)
    at /usr/src/sys/kern/kern_shutdown.c:366
#2  0xffffffff80ad981b in vpanic (fmt=<value optimized out>,
    ap=<value optimized out>) at /usr/src/sys/kern/kern_shutdown.c:759
#3  0xffffffff80ad9653 in panic (fmt=0x0)
    at /usr/src/sys/kern/kern_shutdown.c:690
#4  0xffffffff80fa1d51 in trap_fatal (frame=0xfffffe045991b590, eva=72)
    at /usr/src/sys/amd64/amd64/trap.c:841
#5  0xffffffff80fa1f43 in trap_pfault (frame=0xfffffe045991b590, usermode=0)
    at /usr/src/sys/amd64/amd64/trap.c:691
#6  0xffffffff80fa14ec in trap (frame=0xfffffe045991b590)
    at /usr/src/sys/amd64/amd64/trap.c:442
#7  0xffffffff80f845a1 in calltrap ()
    at /usr/src/sys/amd64/amd64/exception.S:236
#8  0xffffffff822048e3 in dsl_deadlist_remove_key (dl=0xfffff8002526c870,
    mintxg=21009229, tx=0xfffff8032855c900)
    at /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_deadlist.c:287
#9  0xffffffff82205e84 in dsl_destroy_snapshot_sync_impl (
    ds=0xfffff8035e26a000, defer=<value optimized out>,
    tx=<value optimized out>)
    at /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_destroy.c:387
#10 0xffffffff82206577 in dsl_destroy_snapshot_sync (
    arg=<value optimized out>, tx=<value optimized out>)
    at /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_destroy.c:488
#11 0xffffffff822097e4 in dsl_sync_task_sync (dst=0xfffffe0459f03730,
    tx=0xfffff8032855c900)
    at /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_synctask.c:182
#12 0xffffffff8220851b in dsl_pool_sync (dp=<value optimized out>,
    txg=<value optimized out>)
    at /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_pool.c:681
#13 0xffffffff82227fae in spa_sync (spa=<value optimized out>,
    txg=<value optimized out>)
    at /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa.c:6886
#14 0xffffffff82231549 in txg_sync_thread (arg=<value optimized out>)
    at /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/txg.c:517
#15 0xffffffff80a90455 in fork_exit (
    callout=0xffffffff822311c0 <txg_sync_thread>, arg=0xfffff8001f705800,
    frame=0xfffffe045991bc00) at /usr/src/sys/kern/kern_fork.c:1038
#16 0xffffffff80f84ade in fork_trampoline ()
    at /usr/src/sys/amd64/amd64/exception.S:611
#17 0x0000000000000000 in ?? ()
Current language:  auto; currently minimal
(kgdb)
Comment 1 Andriy Gapon freebsd_committer freebsd_triage 2018-12-06 13:48:30 UTC
Have you seen this problem again?
Comment 2 heppner.mark 2018-12-19 15:07:52 UTC
(In reply to Andriy Gapon from comment #1)

The issue was consistent, I would always get a panic when trying to delete those particular snapshots.

The box needed some hardware upgrades, so I ended up building a whole new machine anyways, since there was nothing I could do with the faulty ZFS. The box is still around if I can help to pull any useful info from it.
Comment 3 Andriy Gapon freebsd_committer freebsd_triage 2018-12-19 15:12:50 UTC
(In reply to heppner.mark from comment #2)
I suspect that what you have is an on-disk corruption that affects the specific snapshot.  I guess that it originated in memory, so its checksum is correct, so it is invisible to the scrub.
If you are interested, we can try to look at the details using kgdb and zdb.