Summary: | panic: corrupted zfs dataset (zfs issue) | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Base System | Reporter: | Duncan <dpy> | ||||||
Component: | kern | Assignee: | freebsd-fs (Nobody) <fs> | ||||||
Status: | Open --- | ||||||||
Severity: | Affects Only Me | CC: | chris, dpy, grahamperrin, jfc | ||||||
Priority: | --- | Keywords: | crash, needs-qa | ||||||
Version: | 13.1-RELEASE | ||||||||
Hardware: | amd64 | ||||||||
OS: | Any | ||||||||
Attachments: |
|
Description
Duncan
2022-08-24 08:42:54 UTC
Created attachment 236087 [details]
Photo of stack trace
image was tool large to be accepted in initial creation of bug report
The number 5 comes from sys/contrib/openzfs/module/zfs/sa.c function sa_build_index, which returns EIO (error 5) if an sa_hdr_phys object has a bad magic number. Created attachment 236355 [details]
Full core.txt.0 from dump
I have compiled my system from source and caused a panic (run periodic daily (daily_clean_disks_enabled will cause it) ). I turned off all the VMs and jails before causing the crash (remove extra complexity and reduce the size of the core.txt file). I also added a larger dump disk and now have a complete viable dump. I have attached the complete core.txt.0 file and the following is the starting portion: triple0.internal dumped core - see /var/crash/vmcore.0 Sun Sep 4 20:09:55 AEST 2022 FreeBSD triple0.internal 13.1-RELEASE-p2 FreeBSD 13.1-RELEASE-p2 releng/13.1-n250158-752f813d6cc GENERIC amd64 panic: VERIFY3(0 == sa_handle_get_from_db(zfsvfs->z_os, db, zp, SA_HDL_SHARED, &zp->z_sa_hdl)) failed (0 == 5) GNU gdb (GDB) 12.1 [GDB v12.1 for FreeBSD] Copyright (C) 2022 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-portbld-freebsd13.1". Type "show configuration" for configuration details. For bug reporting instructions, please see: <https://www.gnu.org/software/gdb/bugs/>. Find the GDB manual and other documentation resources online at: <http://www.gnu.org/software/gdb/documentation/>. For help, type "help". Type "apropos word" to search for commands related to "word"... Reading symbols from /boot/kernel/kernel... Reading symbols from /usr/lib/debug//boot/kernel/kernel.debug... Unread portion of the kernel message buffer: panic: VERIFY3(0 == sa_handle_get_from_db(zfsvfs->z_os, db, zp, SA_HDL_SHARED, &zp->z_sa_hdl)) failed (0 == 5) cpuid = 8 time = 1662284816 KDB: stack backtrace: #0 0xffffffff80c694a5 at kdb_backtrace+0x65 #1 0xffffffff80c1bb5f at vpanic+0x17f #2 0xffffffff8500bf4a at spl_panic+0x3a #3 0xffffffff85032c12 at zfs_znode_alloc+0x522 #4 0xffffffff85033c35 at zfs_zget+0x3b5 #5 0xffffffff8501e84b at zfs_dirent_lookup+0x16b #6 0xffffffff8501e91a at zfs_dirlook+0x7a #7 0xffffffff85030a10 at zfs_lookup+0x3d0 #8 0xffffffff8502c11b at zfs_freebsd_cachedlookup+0x6b #9 0xffffffff80cdc4ed at vfs_cache_lookup+0xad #10 0xffffffff80ce0cf0 at VOP_LOOKUP+0x30 #11 0xffffffff80ce0723 at cache_fplookup_noentry+0x1a3 #12 0xffffffff80cddfa6 at cache_fplookup+0x366 #13 0xffffffff80ce8cfa at namei+0x12a #14 0xffffffff80d06993 at kern_statat+0xf3 #15 0xffffffff80d0708f at sys_fstatat+0x2f #16 0xffffffff810b06ec at amd64_syscall+0x10c #17 0xffffffff81087e8b at fast_syscall_common+0xf8 Uptime: 56m13s (ada0:ahcich1:0:0:0): spin-down (ada1:ahcich2:0:0:0): spin-down (ada2:ahcich3:0:0:0): spin-down (ada3:ahcich4:0:0:0): spin-down Dumping 31966 out of 130858 MB:..1%..11%..21%..31%..41%..51%..61%..71%..81%..91% __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55 55 __asm("movq %%gs:%P1,%0" : "=r" (td) : "n" (offsetof(struct pcpu, (kgdb) #0 __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55 #1 doadump (textdump=<optimized out>) at /usr/src/sys/kern/kern_shutdown.c:399 #2 0xffffffff80c1b75c in kern_reboot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:487 #3 0xffffffff80c1bbce in vpanic ( fmt=0xffffffff8527c012 "VERIFY3(0 == sa_handle_get_from_db(zfsvfs->z_os, db, zp, SA_HDL_SHARED, &zp->z_sa_hdl)) failed (0 == %lld)\n", ap=<optimized out>) at /usr/src/sys/kern/kern_shutdown.c:920 #4 0xffffffff8500bf4a in spl_panic (file=<optimized out>, func=<optimized out>, line=<unavailable>, fmt=<unavailable>) at /usr/src/sys/contrib/openzfs/module/os/freebsd/spl/spl_misc.c:107 #5 0xffffffff85032c12 in zfs_znode_sa_init (zfsvfs=0xfffff8027a9c5000, zp=0xfffff807415f9ce8, db=0xfffff80f7c49acb8, obj_type=DMU_OT_SA, sa_hdl=<optimized out>) at /usr/src/sys/contrib/openzfs/module/os/freebsd/zfs/zfs_znode.c:364 #6 zfs_znode_alloc (zfsvfs=zfsvfs@entry=0xfffff8027a9c5000, db=0xfffff80f7c49acb8, blksz=3584, obj_type=DMU_OT_SA, hdl=hdl@entry=0x0) at /usr/src/sys/contrib/openzfs/module/os/freebsd/zfs/zfs_znode.c:466 #7 0xffffffff85033c35 in zfs_zget (zfsvfs=<optimized out>, zfsvfs@entry=0xfffff8027a9c5000, obj_num=<optimized out>, zpp=zpp@entry=0xfffffe03109916f0) at /usr/src/sys/contrib/openzfs/module/os/freebsd/zfs/zfs_znode.c:1042 #8 0xffffffff8501e84b in zfs_dirent_lookup ( dzp=dzp@entry=0xfffff80741474588, name=0xfffffe0310991860 "tb_pkmeth.c", zpp=zpp@entry=0xfffffe0310991740, flag=flag@entry=2) at /usr/src/sys/contrib/openzfs/module/os/freebsd/zfs/zfs_dir.c:191 #9 0xffffffff8501e91a in zfs_dirlook (dzp=dzp@entry=0xfffff80741474588, name=<unavailable>, name@entry=0xfffffe0310991860 "tb_pkmeth.c", zpp=zpp@entry=0xfffffe03109917e0) at /usr/src/sys/contrib/openzfs/module/os/freebsd/zfs/zfs_dir.c:247 #10 0xffffffff85030a10 in zfs_lookup (dvp=<optimized out>, nm=nm@entry=0xfffffe0310991860 "tb_pkmeth.c", vpp=<optimized out>, cnp=cnp@entry=0xfffffe0310991c58, nameiop=0, cr=<optimized out>, flags=0, cached=1) at /usr/src/sys/contrib/openzfs/module/os/freebsd/zfs/zfs_vnops_os.c:929 #11 0xffffffff8502c11b in zfs_freebsd_lookup (ap=0xfffffe0310991990, ap@entry=<error reading variable: value is not available>, cached=1) at /usr/src/sys/contrib/openzfs/module/os/freebsd/zfs/zfs_vnops_os.c:4593 #12 zfs_freebsd_cachedlookup (ap=0xfffffe0310991990, ap@entry=<error reading variable: value is not available>) at /usr/src/sys/contrib/openzfs/module/os/freebsd/zfs/zfs_vnops_os.c:4601 #13 0xffffffff80cdc4ed in VOP_CACHEDLOOKUP (dvp=0xfffff80730da45b8, vpp=0xfffffe0310991a10, cnp=0xfffffe0310991c58) at ./vnode_if.h:99 #14 vfs_cache_lookup (ap=<unavailable>, ap@entry=<error reading variable: value is not available>) at /usr/src/sys/kern/vfs_cache.c:3069 #15 0xffffffff80ce0cf0 in VOP_LOOKUP (dvp=dvp@entry=0xfffff80730da45b8, vpp=<unavailable>, vpp@entry=0xfffffe0310991a10, cnp=<unavailable>, cnp@entry=0xfffffe0310991c58) at ./vnode_if.h:65 #16 0xffffffff80ce0723 in cache_fplookup_noentry ( fpl=fpl@entry=0xfffffe0310991a88) at /usr/src/sys/kern/vfs_cache.c:4928 #17 0xffffffff80cddfa6 in cache_fplookup_next (fpl=0xfffffe0310991a88) at /usr/src/sys/kern/vfs_cache.c:5284 #18 cache_fplookup_impl (dvp=<optimized out>, fpl=0xfffffe0310991a88) at /usr/src/sys/kern/vfs_cache.c:5932 #19 cache_fplookup (ndp=ndp@entry=0xfffffe0310991bd8, status=status@entry=0xfffffe0310991b84, pwdp=pwdp@entry=0xfffffe0310991b88) at /usr/src/sys/kern/vfs_cache.c:6104 #20 0xffffffff80ce8cfa in namei (ndp=ndp@entry=0xfffffe0310991bd8) at /usr/src/sys/kern/vfs_lookup.c:570 #21 0xffffffff80d06993 in kern_statat (td=0xfffffe030c6b5000, flag=<optimized out>, fd=-100, path=<unavailable>, pathseg=<unavailable>, pathseg@entry=UIO_USERSPACE, sbp=sbp@entry=0xfffffe0310991d18, hook=0x0) at /usr/src/sys/kern/vfs_syscalls.c:2441 #22 0xffffffff80d0708f in sys_fstatat (td=<unavailable>, uap=0xfffffe030c6b53e8) at /usr/src/sys/kern/vfs_syscalls.c:2418 #23 0xffffffff810b06ec in syscallenter (td=0xfffffe030c6b5000) at /usr/src/sys/amd64/amd64/../../kern/subr_syscall.c:189 #24 amd64_syscall (td=0xfffffe030c6b5000, traced=0) at /usr/src/sys/amd64/amd64/trap.c:1185 #25 <signal handler called> #26 0x00000008011ad39a in ?? () Backtrace stopped: Cannot access memory at address 0x7fffffffd588 (kgdb) ------------------------------------------------------------------------ I got back to trying to move forward with this issue (re-enabling full EOD runs) and found out where the problem was. In my nextcloud jail, part of the /usr/src file system would cause a panic if accessed (i.e. running a find over it). I haven't gotten around to locating the exact directory/file. Now the interesting thing is that this dataset is encrypted and would mount when decrypted (using a key from higher up the filesytem hierarchy (typed in password as part of startup)). The panic would only occur on access to parts of the filesytem dataset. I tried replicating the dataset (to keep for later diagnosis), but upon mounting, machine would panic, requiring a boot into single user mode and deleting the copied dataset (probably should just modify "canmount"), before booting would complete without a panic. My backups(?) consisted of dataset replication onto other pools (in the same machine and to another (soon to be offsite machine (running truenas)). When I entered the key and mounting occurred, both other systems would panic. My only solution (I could think of), was to create a new dataset and copy over (using rsync in this case) all the folders except /usr/src. I copied /usr/src from another jail. I have renamed and kept the original dataset for potential debugging in the future. Moral of the story: Proof that ZFS replication is actually NOT the same as a backup. The corruption was propagated in a more virulent form (mount == panic) to the replicated dataset. At some time I would appreciate being able to help someone figure out what has happened to the dataset, and how to stop similar in the future. It has shaken my faith a little (in ZFS). (In reply to Duncan from comment #5) > … I tried replicating the dataset (to keep for later diagnosis), but upon > mounting, machine would panic, … Was the try entirely successful? Is it the same type of panic? From <https://old.reddit.com/r/zfs/comments/kx6865/-/gjjmt8n/?context=4> (I can not verify this): >> … Replicating broken snapshots/datasets isn't possible. Even if there's an >> unrepairable CKSUM error in a data block inside a file, and all the actual >> metadata is intact, zfs will refuse to send that snapshot. … (In reply to Graham Perrin from comment #6) The replication I beleive works fine, as long as one doesn't then try to mount the dataset. I will check this properly and perphaps try setting up another machine to run the panics on. It is a bit of a pain to keep knocking over my main server. I should get back to this within the week. I did get a different type of crash dump, I believe from the mount and it is different, i.e. Unread portion of the kernel message buffer: panic: VERIFY3(sa.sa_magic == SA_MAGIC) failed (1122422741 == 3100762) cpuid = 5 time = 1666406111 KDB: stack backtrace: #0 0xffffffff80c694a5 at kdb_backtrace+0x65 #1 0xffffffff80c1bb5f at vpanic+0x17f #2 0xffffffff84ff4f4a at spl_panic+0x3a #3 0xffffffff851948f8 at zpl_get_file_info+0x1d8 #4 0xffffffff85060388 at dmu_objset_userquota_get_ids+0x298 #5 0xffffffff85073f24 at dnode_setdirty+0x34 #6 0xffffffff8504bd49 at dbuf_dirty+0x9d9 #7 0xffffffff85061fc0 at dmu_objset_space_upgrade+0x40 #8 0xffffffff85060a5f at dmu_objset_id_quota_upgrade_cb+0x14f #9 0xffffffff85061eaf at dmu_objset_upgrade_task_cb+0x7f #10 0xffffffff84ff6a0f at taskq_run+0x1f #11 0xffffffff80c7da81 at taskqueue_run_locked+0x181 #12 0xffffffff80c7ed92 at taskqueue_thread_loop+0xc2 #13 0xffffffff80bd8a9e at fork_exit+0x7e #14 0xffffffff810885ee at fork_trampoline+0xe Uptime: 13m13s (ada0:ahcich1:0:0:0): spin-down (ada1:ahcich2:0:0:0): spin-down (ada2:ahcich3:0:0:0): spin-down (ada3:ahcich4:0:0:0): spin-down Dumping 13911 out of 130858 MB:..1%..11%..21%..31%..41%..51%..61%..71%..81%..91% __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55 55 __asm("movq %%gs:%P1,%0" : "=r" (td) : "n" (offsetof(struct pcpu, (kgdb) #0 __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55 #1 doadump (textdump=<optimized out>) at /usr/src/sys/kern/kern_shutdown.c:399 #2 0xffffffff80c1b75c in kern_reboot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:487 #3 0xffffffff80c1bbce in vpanic ( fmt=0xffffffff85250fe8 "VERIFY3(sa.sa_magic == SA_MAGIC) failed (%llu == %llu)\n", ap=<optimized out>) at /usr/src/sys/kern/kern_shutdown.c:920 #4 0xffffffff84ff4f4a in spl_panic (file=<optimized out>, func=<optimized out>, line=<unavailable>, fmt=<unavailable>) at /usr/src/sys/contrib/openzfs/module/os/freebsd/spl/spl_misc.c:107 #5 0xffffffff851948f8 in zpl_get_file_info (bonustype=<optimized out>, data=0xfffffe035db250c0, zoi=0xfffffe027e72bc50) at /usr/src/sys/contrib/openzfs/module/zfs/zfs_quota.c:89 #6 0xffffffff85060388 in dmu_objset_userquota_get_ids ( dn=0xfffff8160ebcf660, before=before@entry=1, tx=<optimized out>, tx@entry=0xfffff80ec760a100) at /usr/src/sys/contrib/openzfs/module/zfs/dmu_objset.c:2215 #7 0xffffffff85073f24 in dnode_setdirty (dn=0xfffff8160ebcf660, tx=0xfffff80ec760a100) at /usr/src/sys/contrib/openzfs/module/zfs/dnode.c:1691 #8 0xffffffff8504bd49 in dbuf_dirty (db=0xfffff8160ebd3b90, db@entry=0x0, tx=tx@entry=0xfffff8160ebd3b90) at /usr/src/sys/contrib/openzfs/module/zfs/dbuf.c:2367 #9 0xffffffff8504c074 in dmu_buf_will_dirty_impl (db_fake=<optimized out>, flags=<optimized out>, flags@entry=9, tx=0xfffff8160ebd3b90, tx@entry=0xfffff80ec760a100) at /usr/src/sys/contrib/openzfs/module/zfs/dbuf.c:2517 #10 0xffffffff8504aea2 in dmu_buf_will_dirty (db_fake=<unavailable>, tx=<unavailable>, tx@entry=0xfffff80ec760a100) at /usr/src/sys/contrib/openzfs/module/zfs/dbuf.c:2523 #11 0xffffffff85061fc0 in dmu_objset_space_upgrade ( os=os@entry=0xfffff80408629800) at /usr/src/sys/contrib/openzfs/module/zfs/dmu_objset.c:2328 #12 0xffffffff85060a5f in dmu_objset_id_quota_upgrade_cb ( os=0xfffff80408629800) at /usr/src/sys/contrib/openzfs/module/zfs/dmu_objset.c:2385 #13 0xffffffff85061eaf in dmu_objset_upgrade_task_cb (data=0xfffff80408629800) at /usr/src/sys/contrib/openzfs/module/zfs/dmu_objset.c:1447 #14 0xffffffff84ff6a0f in taskq_run (arg=0xfffff801e5ab5300, pending=<unavailable>) at /usr/src/sys/contrib/openzfs/module/os/freebsd/spl/spl_taskq.c:315 #15 0xffffffff80c7da81 in taskqueue_run_locked ( queue=queue@entry=0xfffff80116004300) at /usr/src/sys/kern/subr_taskqueue.c:477 #16 0xffffffff80c7ed92 in taskqueue_thread_loop (arg=<optimized out>, arg@entry=0xfffff801dfb570d0) at /usr/src/sys/kern/subr_taskqueue.c:794 #17 0xffffffff80bd8a9e in fork_exit ( callout=0xffffffff80c7ecd0 <taskqueue_thread_loop>, arg=0xfffff801dfb570d0, frame=0xfffffe027e72bf40) at /usr/src/sys/kern/kern_fork.c:1093 #18 <signal handler called> #19 mi_startup () at /usr/src/sys/kern/init_main.c:322 #20 0xffffffff80f791d9 in swapper () at /usr/src/sys/vm/vm_swapout.c:755 #21 0xffffffff80385022 in btext () at /usr/src/sys/amd64/amd64/locore.S:80 ---------------------- I would say this is a similar but different problem. I had months of replicated copies on two different pools. Because I copied (send/recieve) them encrypted and unmounted on the destination, nothing showed up. As soon as I tried to mount them, panic. Currently I have renamed the original dataset (currently unmounted), but I deleted the backups (they wouldn't mount, but I'm sure I can re-create them). I will do more experimintaion when I have a couple of hours spare (within the week) (In reply to Duncan from comment #7) Thanks. Given the differences, I suggest a separate bug report with 266014 as 'see also'. I have bookmarked a pull request and some (GitHub) issues as food for thought; not sharing them here, because I don't imagine this particular food relating to comment #0. |