Bug 266014

Summary:

panic: corrupted zfs dataset (zfs issue)

Product:

Base System

Reporter:

Duncan <dpy>

Component:

kern

Assignee:

freebsd-fs (Nobody) <fs>

Status:

Open ---

Severity:

Affects Only Me

CC:

chris, dpy, grahamperrin, jfc

Priority:

---

Keywords:

crash, needs-qa

Version:

13.1-RELEASE

Hardware:

amd64

OS:

Any

Attachments:

Description	Flags
Photo of stack trace	none
Full core.txt.0 from dump	none

Description Duncan 2022-08-24 08:42:54 UTC

panic: VERIFY3(0 == sa_handle_get_from_db(zfsvfs->z_os, db, zp, SA_HDL_SHARED, &zp->z_sa_hdl)) failed (0 == 5)   


I have suddenly started consistently having a panic and crash on my overnight run.  This can be triggered by a find / (with various parameter) plus the overnight security run etc.

I can now trigger a crash a will (after 10 minutes or so).  I had trouble initially get a crash dump, but now have one but am having problems with kgdb:

-------------

triple0# kgdb /boot/kernel/kernel /var/crash/vmcore.6

GNU gdb (GDB) 12.1 [GDB v12.1 for FreeBSD]
Copyright (C) 2022 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-portbld-freebsd13.1".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /boot/kernel/kernel...
Reading symbols from /usr/lib/debug//boot/kernel/kernel.debug...
/usr/ports/devel/gdb/work-py39/gdb-12.1/gdb/thread.c:1328: internal-error: switch_to_thread: Assertion `thr != NULL' failed.
A problem internal to GDB has been detected,
further debugging may prove unreliable.
----- Backtrace -----
0x12c6931 ???
0x16d2678 ???
0x16d24f8 ???
0x1b21d63 ???
0x16971b8 ???
0x13e32ee ???
0x167841f ???
0x12fb704 ???
0x16a3392 ???
0x1498ac5 ???
0x149748b ???
0x1495e8e ???
0x11f317b ???
---------------------
/usr/ports/devel/gdb/work-py39/gdb-12.1/gdb/thread.c:1328: internal-error: switch_to_thread: Assertion `thr != NULL' failed.
A problem internal to GDB has been detected,
further debugging may prove unreliable.
Quit this debugging session? (y or n)

----------------------------

standard kernel (i.e. freebsd-update ...)

triple0# uname -a
FreeBSD triple0.internal 13.1-RELEASE-p1 FreeBSD 13.1-RELEASE-p1 GENERIC amd64

kgdb installed from pkg, then from ports, still same problem.

================================================

Kernel panic details:

Since I haven't been able to get the debugger to work, I have instead managed a photo of the stack backtrace on screen:

Basically:

#0 kdb_backtrace+0x65
#1 vpanic+0x17f
#2 spl_panic+0x3a
#4 zfs_znode_alloc+0x522
#4 zfs_zget+0x3b5
#5 zfs_dirent_lookup+0x16b
#6 zfs_dirlook+0x7a
#7 zfs_lookup+0x3d0
#8 zfs_freebsd_cachedlookup+0x3d0
#9 vfs_cache_lookup+0xad
#10 VOP_LOOKUP+030
#11 cache_fplookup_noentry+0x1a3
#12 cache_fplookup+0x366
#13 namei+0x12a
#14 kern_statat+0xf3
#15 sys_fstatat+0x2f
#16 amd64_syscall+0x10c
#17 fast_syscall_common+0xf8

I have attached the photo.

Hope someone can point me in the right direction (kgdb etc)

Thanks
Duncan

------
------

cat /var/crash/info.6
Dump header from device: /dev/da15p1
  Architecture: amd64
  Architecture Version: 2
  Dump Length: 4367545370
  Blocksize: 512
  Compression: zstd
  Dumptime: 2022-08-24 14:33:16 +1000
  Hostname: triple0.internal
  Magic: FreeBSD Kernel Dump
  Version String: FreeBSD 13.1-RELEASE-p1 GENERIC
  Panic String: VERIFY3(0 == sa_handle_get_from_db(zfsvfs->z_os, db, zp, SA_HDL_SHARED, &zp->z_sa_hdl)) failed (0 == 5)

  Dump Parity: 1622049469
  Bounds: 6
  Dump Status: good


---------
---------

cat /var/crash/core.txt.6
r/crash/vmcore.6.zst : Read error (39) : premature end
triple0.internal dumped core - see /var/crash/vmcore.6

Wed Aug 24 14:52:35 AEST 2022

FreeBSD triple0.internal 13.1-RELEASE-p1 FreeBSD 13.1-RELEASE-p1 GENERIC  amd64

panic: VERIFY3(0 == sa_handle_get_from_db(zfsvfs->z_os, db, zp, SA_HDL_SHARED, &zp->z_sa_hdl)) failed (0 == 5)

GNU gdb (GDB) 12.1 [GDB v12.1 for FreeBSD]
Copyright (C) 2022 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-portbld-freebsd13.0".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /boot/kernel/kernel...
Reading symbols from /usr/lib/debug//boot/kernel/kernel.debug...
/wrkdirs/usr/ports/devel/gdb/work-py39/gdb-12.1/gdb/thread.c:1328: internal-error: switch_to_thread: Assertion `thr != NULL' failed.
A problem internal to GDB has been detected,
further debugging may prove unreliable.
Quit this debugging session? (y or n) [answered Y; input not from terminal]

This is a bug, please report it.  For instructions, see:
<https://www.gnu.org/software/gdb/bugs/>.

/wrkdirs/usr/ports/devel/gdb/work-py39/gdb-12.1/gdb/thread.c:1328: internal-error: switch_to_thread: Assertion `thr != NULL' failed.
A problem internal to GDB has been detected,
further debugging may prove unreliable.
Create a core file of GDB? (y or n) [answered Y; input not from terminal]
Abort trap (core dumped)


------------------------------------------------------------------------
ps -axlww

ps: can't read nprocs

------------------------------------------------------------------------
vmstat -s

vmstat: vm_cnt:

------------------------------------------------------------------------
vmstat -m

vmstat: memstat_kvm_malloc: _kvm_vnet_selectpid: dumptid
         Type InUse MemUse Requests  Size(s)

------------------------------------------------------------------------
vmstat -z

vmstat: memstat_kvm_uma: KVM short read
ITEM                   SIZE  LIMIT     USED     FREE      REQ     FAILSLEEP XDOMAIN

------------------------------------------------------------------------
vmstat -i

vmstat: sintrnames:

------------------------------------------------------------------------
pstat -T

pstat: kvm_read():

------------------------------------------------------------------------
pstat -s

Device          1K-blocks     Used    Avail Capacity

------------------------------------------------------------------------
iostat

iostat: devstat_checkversion: userland devstat version 6 is not the same as the kernel
devstat_checkversion: devstat version 0
devstat_checkversion: libdevstat newer than kernel


------------------------------------------------------------------------
ipcs -a

ipcs: msginfo:

------------------------------------------------------------------------
ipcs -T

ipcs: msginfo:

------------------------------------------------------------------------
netstat -s


------------------------------------------------------------------------
netstat -m

netstat: memstat_kvm_all: _kvm_vnet_selectpid: dumptid

------------------------------------------------------------------------
netstat -anA

netstat: _kvm_vnet_selectpid: dumptid
netstat: _kvm_vnet_selectpid: dumptid
netstat: _kvm_vnet_selectpid: dumptid
netstat: _kvm_vnet_selectpid: dumptid
netstat: _kvm_vnet_selectpid: dumptid
netstat: _kvm_vnet_selectpid: dumptid
netstat: _kvm_vnet_selectpid: dumptid
netstat: _kvm_vnet_selectpid: dumptid
netstat: _kvm_vnet_selectpid: dumptid

------------------------------------------------------------------------
netstat -aL

netstat: _kvm_vnet_selectpid: dumptid
netstat: _kvm_vnet_selectpid: dumptid
netstat: _kvm_vnet_selectpid: dumptid
netstat: _kvm_vnet_selectpid: dumptid
netstat: _kvm_vnet_selectpid: dumptid
netstat: _kvm_vnet_selectpid: dumptid
netstat: _kvm_vnet_selectpid: dumptid
netstat: _kvm_vnet_selectpid: dumptid
netstat: _kvm_vnet_selectpid: dumptid

------------------------------------------------------------------------
fstat

fstat: procstat_getprocs()

------------------------------------------------------------------------
dmesg

dmesg: kvm_read:

------------------------------------------------------------------------
kernel config

options CONFIG_AUTOGENERATED
ident   GENERIC
machine amd64
cpu     HAMMER
makeoptions     WITH_CTF=1
makeoptions     DEBUG=-g
options IICHID_SAMPLING
options HID_DEBUG
options EVDEV_SUPPORT
options XENHVM
options USB_DEBUG
options ATH_ENABLE_11N
options AH_AR5416_INTERRUPT_MITIGATION
options IEEE80211_SUPPORT_MESH
options IEEE80211_DEBUG
options SC_PIXEL_MODE
options VESA
options PPS_SYNC
options COMPAT_LINUXKPI
options PCI_IOV
options PCI_HP
options IOMMU
options EARLY_AP_STARTUP
options SMP
options NETGDB
options NETDUMP
options DEBUGNET
options ZSTDIO
options GZIO
options EKCD
options KDB_TRACE
options KDB
options RCTL
options RACCT_DEFAULT_TO_DISABLED
options RACCT
options INCLUDE_CONFIG_FILE
options DDB_CTF
options KDTRACE_HOOKS
options KDTRACE_FRAME
options MAC
options CAPABILITIES
options CAPABILITY_MODE
options AUDIT
options HWPMC_HOOKS
options KBD_INSTALL_CDEV
options PRINTF_BUFR_SIZE=128
options _KPOSIX_PRIORITY_SCHEDULING
options SYSVSEM
options SYSVMSG
options SYSVSHM
options STACK
options KTRACE
options SCSI_DELAY=5000
options COMPAT_FREEBSD12
options COMPAT_FREEBSD11
options COMPAT_FREEBSD10
options COMPAT_FREEBSD9
options COMPAT_FREEBSD7
options COMPAT_FREEBSD6
options COMPAT_FREEBSD5
options COMPAT_FREEBSD4
options COMPAT_FREEBSD32
options EFIRT
options GEOM_LABEL
options GEOM_RAID
options TMPFS
options PSEUDOFS
options PROCFS
options CD9660
options MSDOSFS
options NFS_ROOT
options NFSLOCKD
options NFSD
options NFSCL
options MD_ROOT
options QUOTA
options UFS_GJOURNAL
options UFS_DIRHASH
options UFS_ACL
options SOFTUPDATES
options FFS
options KERN_TLS
options SCTP_SUPPORT
options TCP_RFC7413
options TCP_HHOOK
options TCP_BLACKBOX
options TCP_OFFLOAD
options ROUTE_MPATH
options IPSEC_SUPPORT
options INET6
options INET
options VIMAGE
options PREEMPTION
options NUMA
options SCHED_ULE
options NEW_PCIB
options GEOM_PART_GPT
options GEOM_PART_MBR
options GEOM_PART_EBR
options GEOM_PART_BSD
device  isa
device  mem
device  io
device  uart_ns8250
device  cpufreq
device  acpi
device  smbios
device  pci
device  fdc
device  ahci
device  ata
device  mvs
device  siis
device  ahc
device  ahd
device  esp
device  hptiop
device  isp
device  mpt
device  mps
device  mpr
device  sym
device  isci
device  ocs_fc
device  pvscsi
device  scbus
device  ch
device  da
device  sa
device  cd
device  pass
device  ses
device  amr
device  arcmsr
device  ciss
device  iir
device  ips
device  mly
device  twa
device  smartpqi
device  tws
device  aac
device  aacp
device  aacraid
device  ida
device  mfi
device  mlx
device  mrsas
device  pmspcv
device  twe
device  nvme
device  nvd
device  vmd
device  atkbdc
device  atkbd
device  psm
device  kbdmux
device  vga
device  splash
device  sc
device  vt
device  vt_vga
device  vt_efifb
device  vt_vbefb
device  agp
device  cbb
device  pccard
device  cardbus
device  uart
device  ppc
device  ppbus
device  lpt
device  ppi
device  puc
device  iflib
device  em
device  igc
device  ix
device  ixv
device  ixl
device  iavf
device  ice
device  vmx
device  axp
device  bxe
device  le
device  ti
device  mlx5
device  mlxfw
device  mlx5en
device  miibus
device  ae
device  age
device  alc
device  ale
device  bce
device  bfe
device  bge
device  cas
device  dc
device  et
device  fxp
device  gem
device  jme
device  lge
device  msk
device  nfe
device  nge
device  re
device  rl
device  sge
device  sis
device  sk
device  ste
device  stge
device  vge
device  vr
device  xl
device  wlan
device  wlan_wep
device  wlan_ccmp
device  wlan_tkip
device  wlan_amrr
device  an
device  ath
device  ath_pci
device  ath_hal
device  ath_rate_sample
device  ipw
device  iwi
device  iwn
device  malo
device  mwl
device  ral
device  wpi
device  crypto
device  aesni
device  loop
device  padlock_rng
device  rdrand_rng
device  ether
device  vlan
device  tuntap
device  md
device  gif
device  firmware
device  xz
device  bpf
device  uhci
device  ohci
device  ehci
device  xhci
device  usb
device  ukbd
device  umass
device  sound
device  snd_cmi
device  snd_csa
device  snd_emu10kx
device  snd_es137x
device  snd_hda
device  snd_ich
device  snd_via8233
device  mmc
device  mmcsd
device  sdhci
device  rtsx
device  virtio
device  virtio_pci
device  vtnet
device  virtio_blk
device  virtio_scsi
device  virtio_balloon
device  kvm_clock
device  hyperv
device  xenpci
device  netmap
device  evdev
device  uinput
device  hid

------------------------------------------------------------------------
ddb capture buffer

ddb: ddb_capture: kvm_nlist

-----
-----

128Gb Memory, AMD Ryzen Threadripper 1920X 12-core
6 running virtual machines (bhyve)
8 jails
quite a few vlans/bridges etc

triple0# zpool list -v
NAME                       SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
data_pool                 54.6T  32.7T  21.8T        -         -    13%    59%  1.00x    ONLINE  -
  raidz2-0                54.6T  32.7T  21.8T        -         -    13%  60.0%      -    ONLINE
    gpt/SG601                 -      -      -        -         -      -      -      -    ONLINE
    gpt/SG602                 -      -      -        -         -      -      -      -    ONLINE
    gpt/SG603                 -      -      -        -         -      -      -      -    ONLINE
    gpt/SG609                 -      -      -        -         -      -      -      -    ONLINE
    gpt/SG605                 -      -      -        -         -      -      -      -    ONLINE
    gpt/SG606                 -      -      -        -         -      -      -      -    ONLINE
    gpt/SG607                 -      -      -        -         -      -      -      -    ONLINE
    gpt/SG608                 -      -      -        -         -      -      -      -    ONLINE
    gpt/T6_7                  -      -      -        -         -      -      -      -    ONLINE
    gpt/T6_10                 -      -      -        -         -      -      -      -    ONLINE
logs                          -      -      -        -         -      -      -      -  -
  gpt/zil01               31.5G  76.1M  31.4G        -         -     0%  0.23%      -    ONLINE
cache                         -      -      -        -         -      -      -      -  -
  gpt/data_cache           900G   328G   572G        -         -     0%  36.4%      -    ONLINE
  gpt/data_cache2          466G   464G  1.69G        -         -     0%  99.6%      -    ONLINE
pool_18a                  81.8T  67.6T  14.2T        -         -     1%    82%  1.00x    ONLINE  -
  raidz1-0                81.8T  67.6T  14.2T        -         -     1%  82.6%      -    ONLINE
    gpt/T18_1                 -      -      -        -         -      -      -      -    ONLINE
    gpt/T18_3                 -      -      -        -         -      -      -      -    ONLINE
    gpt/T18_2                 -      -      -        -         -      -      -      -    ONLINE
    gpt/T18_4                 -      -      -        -         -      -      -      -    ONLINE
    gpt/T18_0                 -      -      -        -         -      -      -      -    ONLINE
cache                         -      -      -        -         -      -      -      -  -
  gpt/ssd120_R3C2          112G  3.39G   108G        -         -     0%  3.03%      -    ONLINE
system                     432G   194G   238G        -         -    47%    44%  1.00x    ONLINE  -
  mirror-0                 432G   194G   238G        -         -    47%  44.9%      -    ONLINE
    gpt/system_A              -      -      -        -         -      -      -      -    ONLINE
    gpt/system_pool_R1C1      -      -      -        -         -      -      -      -    ONLINE

Comment 1 Duncan 2022-08-24 08:49:06 UTC

Created attachment 236087 [details]
Photo of stack trace

image was tool large to be accepted in initial creation of bug report

Comment 2 John F. Carr 2022-08-28 11:56:56 UTC

The number 5 comes from sys/contrib/openzfs/module/zfs/sa.c function sa_build_index, which returns EIO (error 5) if an sa_hdr_phys object has a bad magic number.

Comment 3 Duncan 2022-09-04 12:28:05 UTC

Created attachment 236355 [details]
Full core.txt.0 from dump

Comment 4 Duncan 2022-09-04 12:34:54 UTC

I have compiled my system from source and caused a panic (run periodic daily (daily_clean_disks_enabled will cause it) ). I turned off all the VMs and jails before causing the crash (remove extra complexity and reduce the size of the core.txt file). I also added a larger dump disk and now have a complete viable dump.  I have attached the complete core.txt.0 file and the following is the starting portion:

triple0.internal dumped core - see /var/crash/vmcore.0

Sun Sep  4 20:09:55 AEST 2022

FreeBSD triple0.internal 13.1-RELEASE-p2 FreeBSD 13.1-RELEASE-p2 releng/13.1-n250158-752f813d6cc GENERIC  amd64

panic: VERIFY3(0 == sa_handle_get_from_db(zfsvfs->z_os, db, zp, SA_HDL_SHARED, &zp->z_sa_hdl)) failed (0 == 5)

GNU gdb (GDB) 12.1 [GDB v12.1 for FreeBSD]
Copyright (C) 2022 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-portbld-freebsd13.1".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /boot/kernel/kernel...
Reading symbols from /usr/lib/debug//boot/kernel/kernel.debug...

Unread portion of the kernel message buffer:
panic: VERIFY3(0 == sa_handle_get_from_db(zfsvfs->z_os, db, zp, SA_HDL_SHARED, &zp->z_sa_hdl)) failed (0 == 5)

cpuid = 8
time = 1662284816
KDB: stack backtrace:
#0 0xffffffff80c694a5 at kdb_backtrace+0x65
#1 0xffffffff80c1bb5f at vpanic+0x17f
#2 0xffffffff8500bf4a at spl_panic+0x3a
#3 0xffffffff85032c12 at zfs_znode_alloc+0x522
#4 0xffffffff85033c35 at zfs_zget+0x3b5
#5 0xffffffff8501e84b at zfs_dirent_lookup+0x16b
#6 0xffffffff8501e91a at zfs_dirlook+0x7a
#7 0xffffffff85030a10 at zfs_lookup+0x3d0
#8 0xffffffff8502c11b at zfs_freebsd_cachedlookup+0x6b
#9 0xffffffff80cdc4ed at vfs_cache_lookup+0xad
#10 0xffffffff80ce0cf0 at VOP_LOOKUP+0x30
#11 0xffffffff80ce0723 at cache_fplookup_noentry+0x1a3
#12 0xffffffff80cddfa6 at cache_fplookup+0x366
#13 0xffffffff80ce8cfa at namei+0x12a
#14 0xffffffff80d06993 at kern_statat+0xf3
#15 0xffffffff80d0708f at sys_fstatat+0x2f
#16 0xffffffff810b06ec at amd64_syscall+0x10c
#17 0xffffffff81087e8b at fast_syscall_common+0xf8
Uptime: 56m13s
(ada0:ahcich1:0:0:0): spin-down
(ada1:ahcich2:0:0:0): spin-down
(ada2:ahcich3:0:0:0): spin-down
(ada3:ahcich4:0:0:0): spin-down
Dumping 31966 out of 130858 MB:..1%..11%..21%..31%..41%..51%..61%..71%..81%..91%

__curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55
55              __asm("movq %%gs:%P1,%0" : "=r" (td) : "n" (offsetof(struct pcpu,
(kgdb) #0  __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55
#1  doadump (textdump=<optimized out>)
    at /usr/src/sys/kern/kern_shutdown.c:399
#2  0xffffffff80c1b75c in kern_reboot (howto=260)
    at /usr/src/sys/kern/kern_shutdown.c:487
#3  0xffffffff80c1bbce in vpanic (
    fmt=0xffffffff8527c012 "VERIFY3(0 == sa_handle_get_from_db(zfsvfs->z_os, db, zp, SA_HDL_SHARED, &zp->z_sa_hdl)) failed (0 == %lld)\n",
    ap=<optimized out>) at /usr/src/sys/kern/kern_shutdown.c:920
#4  0xffffffff8500bf4a in spl_panic (file=<optimized out>,
    func=<optimized out>, line=<unavailable>, fmt=<unavailable>)
    at /usr/src/sys/contrib/openzfs/module/os/freebsd/spl/spl_misc.c:107
#5  0xffffffff85032c12 in zfs_znode_sa_init (zfsvfs=0xfffff8027a9c5000,
    zp=0xfffff807415f9ce8, db=0xfffff80f7c49acb8, obj_type=DMU_OT_SA,
    sa_hdl=<optimized out>)
    at /usr/src/sys/contrib/openzfs/module/os/freebsd/zfs/zfs_znode.c:364
#6  zfs_znode_alloc (zfsvfs=zfsvfs@entry=0xfffff8027a9c5000,
    db=0xfffff80f7c49acb8, blksz=3584, obj_type=DMU_OT_SA, hdl=hdl@entry=0x0)
    at /usr/src/sys/contrib/openzfs/module/os/freebsd/zfs/zfs_znode.c:466
#7  0xffffffff85033c35 in zfs_zget (zfsvfs=<optimized out>,
    zfsvfs@entry=0xfffff8027a9c5000, obj_num=<optimized out>,
    zpp=zpp@entry=0xfffffe03109916f0)
    at /usr/src/sys/contrib/openzfs/module/os/freebsd/zfs/zfs_znode.c:1042
#8  0xffffffff8501e84b in zfs_dirent_lookup (
    dzp=dzp@entry=0xfffff80741474588, name=0xfffffe0310991860 "tb_pkmeth.c",
    zpp=zpp@entry=0xfffffe0310991740, flag=flag@entry=2)
    at /usr/src/sys/contrib/openzfs/module/os/freebsd/zfs/zfs_dir.c:191
#9  0xffffffff8501e91a in zfs_dirlook (dzp=dzp@entry=0xfffff80741474588,
    name=<unavailable>, name@entry=0xfffffe0310991860 "tb_pkmeth.c",
    zpp=zpp@entry=0xfffffe03109917e0)
    at /usr/src/sys/contrib/openzfs/module/os/freebsd/zfs/zfs_dir.c:247
#10 0xffffffff85030a10 in zfs_lookup (dvp=<optimized out>,
    nm=nm@entry=0xfffffe0310991860 "tb_pkmeth.c", vpp=<optimized out>,
    cnp=cnp@entry=0xfffffe0310991c58, nameiop=0, cr=<optimized out>, flags=0,
    cached=1)
    at /usr/src/sys/contrib/openzfs/module/os/freebsd/zfs/zfs_vnops_os.c:929
#11 0xffffffff8502c11b in zfs_freebsd_lookup (ap=0xfffffe0310991990,
    ap@entry=<error reading variable: value is not available>, cached=1)
    at /usr/src/sys/contrib/openzfs/module/os/freebsd/zfs/zfs_vnops_os.c:4593
#12 zfs_freebsd_cachedlookup (ap=0xfffffe0310991990,
    ap@entry=<error reading variable: value is not available>)
    at /usr/src/sys/contrib/openzfs/module/os/freebsd/zfs/zfs_vnops_os.c:4601
#13 0xffffffff80cdc4ed in VOP_CACHEDLOOKUP (dvp=0xfffff80730da45b8,
    vpp=0xfffffe0310991a10, cnp=0xfffffe0310991c58) at ./vnode_if.h:99
#14 vfs_cache_lookup (ap=<unavailable>,
    ap@entry=<error reading variable: value is not available>)
    at /usr/src/sys/kern/vfs_cache.c:3069
#15 0xffffffff80ce0cf0 in VOP_LOOKUP (dvp=dvp@entry=0xfffff80730da45b8,
    vpp=<unavailable>, vpp@entry=0xfffffe0310991a10, cnp=<unavailable>,
    cnp@entry=0xfffffe0310991c58) at ./vnode_if.h:65
#16 0xffffffff80ce0723 in cache_fplookup_noentry (
    fpl=fpl@entry=0xfffffe0310991a88) at /usr/src/sys/kern/vfs_cache.c:4928
#17 0xffffffff80cddfa6 in cache_fplookup_next (fpl=0xfffffe0310991a88)
    at /usr/src/sys/kern/vfs_cache.c:5284
#18 cache_fplookup_impl (dvp=<optimized out>, fpl=0xfffffe0310991a88)
    at /usr/src/sys/kern/vfs_cache.c:5932
#19 cache_fplookup (ndp=ndp@entry=0xfffffe0310991bd8,
    status=status@entry=0xfffffe0310991b84,
    pwdp=pwdp@entry=0xfffffe0310991b88) at /usr/src/sys/kern/vfs_cache.c:6104
#20 0xffffffff80ce8cfa in namei (ndp=ndp@entry=0xfffffe0310991bd8)
    at /usr/src/sys/kern/vfs_lookup.c:570
#21 0xffffffff80d06993 in kern_statat (td=0xfffffe030c6b5000,
    flag=<optimized out>, fd=-100, path=<unavailable>, pathseg=<unavailable>,
    pathseg@entry=UIO_USERSPACE, sbp=sbp@entry=0xfffffe0310991d18, hook=0x0)
    at /usr/src/sys/kern/vfs_syscalls.c:2441
#22 0xffffffff80d0708f in sys_fstatat (td=<unavailable>,
    uap=0xfffffe030c6b53e8) at /usr/src/sys/kern/vfs_syscalls.c:2418
#23 0xffffffff810b06ec in syscallenter (td=0xfffffe030c6b5000)
    at /usr/src/sys/amd64/amd64/../../kern/subr_syscall.c:189
#24 amd64_syscall (td=0xfffffe030c6b5000, traced=0)
    at /usr/src/sys/amd64/amd64/trap.c:1185
#25 <signal handler called>
#26 0x00000008011ad39a in ?? ()
Backtrace stopped: Cannot access memory at address 0x7fffffffd588
(kgdb)

------------------------------------------------------------------------

Comment 5 Duncan 2022-10-25 05:40:21 UTC

I got back to trying to move forward with this issue (re-enabling full EOD runs) and found out where the problem was.

In my nextcloud jail, part of the /usr/src file system would cause a panic if accessed (i.e. running a find over it).  I haven't gotten around to locating the exact directory/file.

Now the interesting thing is that this dataset is encrypted and would mount when decrypted (using a key from higher up the filesytem hierarchy (typed in password as part of startup)).  The panic would only occur on access to parts of the filesytem dataset.

I tried replicating the dataset (to keep for later diagnosis), but upon mounting, machine would panic, requiring a boot into single user mode and deleting the copied dataset (probably should just modify "canmount"), before booting would complete without a panic.

My backups(?) consisted of dataset replication onto other pools (in the same machine and to another (soon to be offsite machine (running truenas)).  When I entered the key and mounting occurred, both other systems would panic.

My only solution (I could think of), was to create a new dataset and copy over (using rsync in this case) all the folders except /usr/src.  I copied /usr/src from another jail.

I have renamed and kept the original dataset for potential debugging in the future.


Moral of the story:  Proof that ZFS replication is actually NOT the same as a backup.  The corruption was propagated in a more virulent form (mount == panic) to the replicated dataset.

At some time I would appreciate being able to help someone figure out what has happened to the dataset, and how to stop similar in the future.  It has shaken my faith a little (in ZFS).

Comment 6 Graham Perrin freebsd_committer

2022-10-25 23:25:28 UTC

(In reply to Duncan from comment #5)

> … I tried replicating the dataset (to keep for later diagnosis), but upon 
> mounting, machine would panic, …


Was the try entirely successful? 

Is it the same type of panic? 

From <https://old.reddit.com/r/zfs/comments/kx6865/-/gjjmt8n/?context=4> (I can not verify this): 

>> … Replicating broken snapshots/datasets isn't possible. Even if there's an 
>> unrepairable CKSUM error in a data block inside a file, and all the actual 
>> metadata is intact, zfs will refuse to send that snapshot. …

Comment 7 Duncan 2022-10-26 00:18:08 UTC

(In reply to Graham Perrin from comment #6)
The replication I beleive works fine, as long as one doesn't then try to mount the dataset.  I will check this properly and perphaps try setting up another machine to run the panics on.  It is a bit of a pain to keep knocking over my main server. I should get back to this within the week. 

I did get a different type of crash dump, I believe from the mount and it is different, i.e.

Unread portion of the kernel message buffer:
panic: VERIFY3(sa.sa_magic == SA_MAGIC) failed (1122422741 == 3100762)

cpuid = 5
time = 1666406111
KDB: stack backtrace:
#0 0xffffffff80c694a5 at kdb_backtrace+0x65
#1 0xffffffff80c1bb5f at vpanic+0x17f
#2 0xffffffff84ff4f4a at spl_panic+0x3a
#3 0xffffffff851948f8 at zpl_get_file_info+0x1d8
#4 0xffffffff85060388 at dmu_objset_userquota_get_ids+0x298
#5 0xffffffff85073f24 at dnode_setdirty+0x34
#6 0xffffffff8504bd49 at dbuf_dirty+0x9d9
#7 0xffffffff85061fc0 at dmu_objset_space_upgrade+0x40
#8 0xffffffff85060a5f at dmu_objset_id_quota_upgrade_cb+0x14f
#9 0xffffffff85061eaf at dmu_objset_upgrade_task_cb+0x7f
#10 0xffffffff84ff6a0f at taskq_run+0x1f
#11 0xffffffff80c7da81 at taskqueue_run_locked+0x181
#12 0xffffffff80c7ed92 at taskqueue_thread_loop+0xc2
#13 0xffffffff80bd8a9e at fork_exit+0x7e
#14 0xffffffff810885ee at fork_trampoline+0xe
Uptime: 13m13s
(ada0:ahcich1:0:0:0): spin-down
(ada1:ahcich2:0:0:0): spin-down
(ada2:ahcich3:0:0:0): spin-down
(ada3:ahcich4:0:0:0): spin-down
Dumping 13911 out of 130858 MB:..1%..11%..21%..31%..41%..51%..61%..71%..81%..91%

__curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55
55              __asm("movq %%gs:%P1,%0" : "=r" (td) : "n" (offsetof(struct pcpu,
(kgdb) #0  __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55
#1  doadump (textdump=<optimized out>)
    at /usr/src/sys/kern/kern_shutdown.c:399
#2  0xffffffff80c1b75c in kern_reboot (howto=260)
    at /usr/src/sys/kern/kern_shutdown.c:487
#3  0xffffffff80c1bbce in vpanic (
    fmt=0xffffffff85250fe8 "VERIFY3(sa.sa_magic == SA_MAGIC) failed (%llu == %llu)\n", ap=<optimized out>) at /usr/src/sys/kern/kern_shutdown.c:920
#4  0xffffffff84ff4f4a in spl_panic (file=<optimized out>,
    func=<optimized out>, line=<unavailable>, fmt=<unavailable>)
    at /usr/src/sys/contrib/openzfs/module/os/freebsd/spl/spl_misc.c:107
#5  0xffffffff851948f8 in zpl_get_file_info (bonustype=<optimized out>,
    data=0xfffffe035db250c0, zoi=0xfffffe027e72bc50)
    at /usr/src/sys/contrib/openzfs/module/zfs/zfs_quota.c:89
#6  0xffffffff85060388 in dmu_objset_userquota_get_ids (
    dn=0xfffff8160ebcf660, before=before@entry=1, tx=<optimized out>,
    tx@entry=0xfffff80ec760a100)
    at /usr/src/sys/contrib/openzfs/module/zfs/dmu_objset.c:2215
#7  0xffffffff85073f24 in dnode_setdirty (dn=0xfffff8160ebcf660,
    tx=0xfffff80ec760a100)
    at /usr/src/sys/contrib/openzfs/module/zfs/dnode.c:1691
#8  0xffffffff8504bd49 in dbuf_dirty (db=0xfffff8160ebd3b90, db@entry=0x0,
    tx=tx@entry=0xfffff8160ebd3b90)
    at /usr/src/sys/contrib/openzfs/module/zfs/dbuf.c:2367
#9  0xffffffff8504c074 in dmu_buf_will_dirty_impl (db_fake=<optimized out>,
    flags=<optimized out>, flags@entry=9, tx=0xfffff8160ebd3b90,
    tx@entry=0xfffff80ec760a100)
    at /usr/src/sys/contrib/openzfs/module/zfs/dbuf.c:2517
#10 0xffffffff8504aea2 in dmu_buf_will_dirty (db_fake=<unavailable>,
    tx=<unavailable>, tx@entry=0xfffff80ec760a100)
    at /usr/src/sys/contrib/openzfs/module/zfs/dbuf.c:2523
#11 0xffffffff85061fc0 in dmu_objset_space_upgrade (
    os=os@entry=0xfffff80408629800)
    at /usr/src/sys/contrib/openzfs/module/zfs/dmu_objset.c:2328
#12 0xffffffff85060a5f in dmu_objset_id_quota_upgrade_cb (
    os=0xfffff80408629800)
    at /usr/src/sys/contrib/openzfs/module/zfs/dmu_objset.c:2385
#13 0xffffffff85061eaf in dmu_objset_upgrade_task_cb (data=0xfffff80408629800)
    at /usr/src/sys/contrib/openzfs/module/zfs/dmu_objset.c:1447
#14 0xffffffff84ff6a0f in taskq_run (arg=0xfffff801e5ab5300,
    pending=<unavailable>)
    at /usr/src/sys/contrib/openzfs/module/os/freebsd/spl/spl_taskq.c:315
#15 0xffffffff80c7da81 in taskqueue_run_locked (
    queue=queue@entry=0xfffff80116004300)
    at /usr/src/sys/kern/subr_taskqueue.c:477
#16 0xffffffff80c7ed92 in taskqueue_thread_loop (arg=<optimized out>,
    arg@entry=0xfffff801dfb570d0) at /usr/src/sys/kern/subr_taskqueue.c:794
#17 0xffffffff80bd8a9e in fork_exit (
    callout=0xffffffff80c7ecd0 <taskqueue_thread_loop>,
    arg=0xfffff801dfb570d0, frame=0xfffffe027e72bf40)
    at /usr/src/sys/kern/kern_fork.c:1093
#18 <signal handler called>
#19 mi_startup () at /usr/src/sys/kern/init_main.c:322
#20 0xffffffff80f791d9 in swapper () at /usr/src/sys/vm/vm_swapout.c:755
#21 0xffffffff80385022 in btext () at /usr/src/sys/amd64/amd64/locore.S:80



----------------------

I would say this is a similar but different problem.  I had months of replicated copies on two different pools.  Because I copied (send/recieve) them encrypted and unmounted on the destination, nothing showed up. As soon as I tried to mount them, panic.

Currently I have renamed the original dataset (currently unmounted), but I deleted the backups (they wouldn't mount, but I'm sure I can re-create them).

I will do more experimintaion when I have a couple of hours spare (within the week)

Comment 8 Graham Perrin freebsd_committer

2022-11-02 00:36:49 UTC

(In reply to Duncan from comment #7)

Thanks. 

Given the differences, I suggest a separate bug report with 266014 as 'see also'. 

I have bookmarked a pull request and some (GitHub) issues as food for thought; not sharing them here, because I don't imagine this particular food relating to comment #0.