I rebooted a system running CURRENT from late December to update to today's CURRENT and it crashed while unloading ZFS. panic: VERIFY3(rc->rc_count == number) failed (262144 == 0) cpuid = 0 time = 1705275564 KDB: stack backtrace: db_trace_self() at db_trace_self db_trace_self_wrapper() at db_trace_self_wrapper+0x38 vpanic() at vpanic+0x1a4 spl_panic() at spl_panic+0x44 zfs_refcount_destroy_many() at zfs_refcount_destroy_many+0xdc arc_fini() at arc_fini+0x274 dmu_fini() at dmu_fini+0xc spa_fini() at spa_fini+0x34 zfs_kmod_fini() at zfs_kmod_fini+0x78 zfs_shutdown() at zfs_shutdown+0x40 kern_reboot() at kern_reboot+0x574 sys_reboot() at sys_reboot+0x350 do_el0_sync() at do_el0_sync+0x58c handle_el0_sync() at handle_el0_sync+0x48 --- exception, esr 0x56000000 [...] #9 0xffff00000003daf8 in spl_panic (file=<optimized out>, func=<optimized out>, line=13227925, fmt=0x12 <error: Cannot access memory at address 0x12>) at /usr/src/sys/contrib/openzfs/module/os/freebsd/spl/spl_misc.c:100 ap = {__stack = 0xffff0000f39d5630, __gr_top = 0xffff0000f39d55e0, __vr_top = 0xffffa000010a92a0, __gr_offs = -32, __vr_offs = 0} #10 0xffff0000001633e0 in zfs_refcount_destroy_many (rc=<optimized out>, number=18446462598740100184) at /usr/src/sys/contrib/openzfs/module/zfs/refcount.c:98 _verify3_right = 18446462598740100184 _verify3_left = 115 cookie = 0x0 ref = <optimized out> #11 0xffff0000001633fc in zfs_refcount_destroy (rc=0x12, rc@entry=0xffff0000010e7a80 <ARC_anon+128>) at /usr/src/sys/contrib/openzfs/module/zfs/refcount.c:112 No locals. #12 0xffff0000000c0508 in arc_state_fini () at /usr/src/sys/contrib/openzfs/module/zfs/arc.c:7421 No locals. #13 arc_fini () at /usr/src/sys/contrib/openzfs/module/zfs/arc.c:7753 p = <optimized out> #14 0xffff0000000ee4d0 in dmu_fini () at /usr/src/sys/contrib/openzfs/module/zfs/dmu.c:2549 No locals. #15 0xffff000000189dd8 in spa_fini () at /usr/src/sys/contrib/openzfs/module/zfs/spa_misc.c:2549 No locals. #16 0xffff0000001e67ac in zfs_kmod_fini () at /usr/src/sys/contrib/openzfs/module/zfs/zfs_ioctl.c:7933 zsnext = 0x0 zs = 0x0 #17 0xffff00000006e654 in zfs__fini () at /usr/src/sys/contrib/openzfs/module/os/freebsd/zfs/kmod_core.c:284 No locals. #18 zfs_shutdown (arg=<optimized out>, howto=<optimized out>) at /usr/src/sys/contrib/openzfs/module/os/freebsd/zfs/kmod_core.c:297 No locals. #19 0xffff000000693158 in kern_reboot (howto=0) at /usr/src/sys/kern/kern_shutdown.c:517 _ep = 0xffffa00002f63c40 _t = 0xffffa00002f63c40 _el = 0xffffa00001709c80 once = 1 #20 0xffff000000692b74 in sys_reboot (td=0xffff000121c2a640, uap=0xffff000121c2aa40) at /usr/src/sys/kern/kern_shutdown.c:312 error = <optimized out> #21 0xffff000000a60060 in syscallenter (td=0xffff000121c2a640) at /usr/src/sys/arm64/arm64/../../kern/subr_syscall.c:186 se = 0xffff000000fa1378 <sysent+1760> sa = 0xffff000121c2aa30 p = <optimized out> error = <optimized out> sy_thr_static = true traced = <optimized out> _audit_entered = <optimized out> My kernel configuration file is include GENERIC ident STRIATUS nooptions WITNESS nooptions WITNESS_SKIPSPIN options ZFS The kernel that crashed had changes on main up to 5bc10feacc9d.
I just hit this panic with syzkaller as well. This seems like a somewhat recent regression. The panic arises in arc_state_fini(): zfs_refcount_destroy(&arc_anon->arcs_size[ARC_BUFC_DATA]); A previous zfs_refcount_destroy(&arc_anon->arcs_esize[ARC_BUFC_METADATA]); succeeded.
Similar situation here. ... exec shutdown -r now Shutdown NOW! shutdown: [pid 3676] Jan 15 15:54:10 bsd15 shutdown[3676]: reboot by root: ... All buffers synced. Uptime: 7h35m52s GEOM_ELI: Device vtbd0p4.eli destroyed. GEOM_ELI: Detached vtbd0p4.eli on last close. panic: VERIFY3(rc->rc_count == number) failed (59392 == 0) cpuid = 0 time = 1705362861 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe00c737eaf0 vpanic() at vpanic+0x131/frame 0xfffffe00c737ec20 spl_panic() at spl_panic+0x3a/frame 0xfffffe00c737ec80 zfs_refcount_destroy_many() at zfs_refcount_destroy_many+0xd1/frame 0xfffffe00c737ecb0 arc_fini() at arc_fini+0x299/frame 0xfffffe00c737ece0 dmu_fini() at dmu_fini+0x9/frame 0xfffffe00c737ecf0 spa_fini() at spa_fini+0x36/frame 0xfffffe00c737ed30 zfs_kmod_fini() at zfs_kmod_fini+0x7a/frame 0xfffffe00c737ed60 zfs_shutdown() at zfs_shutdown+0x2b/frame 0xfffffe00c737ed70 kern_reboot() at kern_reboot+0x4f3/frame 0xfffffe00c737edb0 sys_reboot() at sys_reboot+0x3a9/frame 0xfffffe00c737ee00 amd64_syscall() at amd64_syscall+0x153/frame 0xfffffe00c737ef30 fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe00c737ef30 --- syscall (55, FreeBSD ELF64, reboot), rip = 0x28cdaa, rsp = 0x8204547a8, rbp = 0x820454890 --- KDB: enter: panic [ thread pid 1 tid 100002 ] Stopped at kdb_enter+0x33: movq $0,0xe3a592(%rip) It crashed on a 15.0-53df7e58cca system, but pretty sure it happened on an earlier ARM system back on 1/13.
The amd64 one that crashed (from strings in the core file) was: FreeBSD 15.0-CURRENT amd64 1500010 FreeBSD 15.0-CURRENT #18 main-n267574-6137b5f7b8c: Mon Jan 15 07:40:34 PST 2024 FreeBSD clang version 17.0.6 (https://github.com/llvm/llvm-project.git llvmorg-17.0.6-0-g6009708b4367) The aarch64 one that crashed was: FreeBSD 15.0-CURRENT aarch64 1500008 FreeBSD 15.0-CURRENT #17 main-n267508-8a3fafc8211: Thu Jan 11 10:26:44 PST 2024 FreeBSD clang version 17.0.6 (https://github.com/llvm/llvm-project.git llvmorg-17.0.6-0-g6009708b4367)
(In reply to John Kennedy from comment #2) I am seeing the identical same behavior here : panic: VERIFY3(rc->rc_count == number) failed (232652800 == 0) cpuid = 0 time = 1705557330 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe073a4a7af0 vpanic() at vpanic+0x131/frame 0xfffffe073a4a7c20 spl_panic() at spl_panic+0x3a/frame 0xfffffe073a4a7c80 zfs_refcount_destroy_many() at zfs_refcount_destroy_many+0xd1/frame 0xfffffe073a4a7cb0 arc_fini() at arc_fini+0x281/frame 0xfffffe073a4a7ce0 dmu_fini() at dmu_fini+0x9/frame 0xfffffe073a4a7cf0 spa_fini() at spa_fini+0x36/frame 0xfffffe073a4a7d30 zfs_kmod_fini() at zfs_kmod_fini+0x7a/frame 0xfffffe073a4a7d60 zfs_shutdown() at zfs_shutdown+0x2b/frame 0xfffffe073a4a7d70 kern_reboot() at kern_reboot+0x4f3/frame 0xfffffe073a4a7db0 sys_reboot() at sys_reboot+0x3a9/frame 0xfffffe073a4a7e00 amd64_syscall() at amd64_syscall+0x153/frame 0xfffffe073a4a7f30 fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe073a4a7f30 --- syscall (55, FreeBSD ELF64, reboot), rip = 0x1346f48388ca, rsp = 0x1346f299e3f8, rbp = 0x1346f299e570 --- KDB: enter: panic [ thread pid 53890 tid 101129 ] Stopped at kdb_enter+0x33: movq $0,0xe3a392(%rip) db> bt Tracing pid 53890 tid 101129 td 0xfffff84085258000 kdb_enter() at kdb_enter+0x33/frame 0xfffffe073a4a7c20 spl_panic() at spl_panic+0x3a/frame 0xfffffe073a4a7c80 zfs_refcount_destroy_many() at zfs_refcount_destroy_many+0xd1/frame 0xfffffe073a4a7cb0 arc_fini() at arc_fini+0x281/frame 0xfffffe073a4a7ce0 dmu_fini() at dmu_fini+0x9/frame 0xfffffe073a4a7cf0 spa_fini() at spa_fini+0x36/frame 0xfffffe073a4a7d30 zfs_kmod_fini() at zfs_kmod_fini+0x7a/frame 0xfffffe073a4a7d60 zfs_shutdown() at zfs_shutdown+0x2b/frame 0xfffffe073a4a7d70 kern_reboot() at kern_reboot+0x4f3/frame 0xfffffe073a4a7db0 sys_reboot() at sys_reboot+0x3a9/frame 0xfffffe073a4a7e00 amd64_syscall() at amd64_syscall+0x153/frame 0xfffffe073a4a7f30 fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe073a4a7f30 --- syscall (55, FreeBSD ELF64, reboot), rip = 0x1346f48388ca, rsp = 0x1346f299e3f8, rbp = 0x1346f299e570 --- db>
I have to wonder if anyone or anything is happening with this situation as the hits just keep on coming here : root@titan:~ # root@titan:~ # shutdown -p now Shutdown NOW! shutdown: [pid 4Stopping sshd. Waiting for PIDS: 4140, 4140. Stopping cron. Waiting for PIDS: 4144. Stopping cirrinad. Waiting for PIDS: 4076. Stopping ntpd. Waiting for PIDS: 4065. Stopping powerd. Waiting for PIDS: 4069. Stopping devd. Waiting for PIDS: 3809. Writing entropy file: . Writing early boot entropy file: . . Jan 19 20:40:45 titan syslogd: exiting on signal 15 Waiting (max 60 seconds) for system process `vnlru' to stop... done Waiting (max 60 seconds) for system process `syncer' to stop... Syncing disks, vnodes remaining... 0 0 0 em0: Link is up 1000 Mbps Full Duplex em0: link state changed to UP 0 0 0 0 done All buffers synced. Swap device mirror/swap removed. Uptime: 5h7m4s GEOM_MIRROR: Device swap: provider destroyed. GEOM_MIRROR: Device swap destroyed. panic: VERIFY3(rc->rc_count == number) failed (872153088 == 0) cpuid = 0 time = 1705696854 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe03447faaf0 vpanic() at vpanic+0x131/frame 0xfffffe03447fac20 spl_panic() at spl_panic+0x3a/frame 0xfffffe03447fac80 zfs_refcount_destroy_many() at zfs_refcount_destroy_many+0xd1/frame 0xfffffe03447facb0 arc_fini() at arc_fini+0x281/frame 0xfffffe03447face0 dmu_fini() at dmu_fini+0x9/frame 0xfffffe03447facf0 spa_fini() at spa_fini+0x36/frame 0xfffffe03447fad30 zfs_kmod_fini() at zfs_kmod_fini+0x7a/frame 0xfffffe03447fad60 zfs_shutdown() at zfs_shutdown+0x2b/frame 0xfffffe03447fad70 kern_reboot() at kern_reboot+0x4f3/frame 0xfffffe03447fadb0 sys_reboot() at sys_reboot+0x3a9/frame 0xfffffe03447fae00 amd64_syscall() at amd64_syscall+0x153/frame 0xfffffe03447faf30 fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe03447faf30 --- syscall (55, FreeBSD ELF64, reboot), rip = 0x28cdaa, rsp = 0x820f195c8, rbp = 0x820f196b0 --- KDB: enter: panic [ thread pid 1 tid 100002 ] Stopped at kdb_enter+0x33: movq $0,0xe3a392(%rip) db> bt Tracing pid 1 tid 100002 td 0xfffff8010353e000 kdb_enter() at kdb_enter+0x33/frame 0xfffffe03447fac20 spl_panic() at spl_panic+0x3a/frame 0xfffffe03447fac80 zfs_refcount_destroy_many() at zfs_refcount_destroy_many+0xd1/frame 0xfffffe03447facb0 arc_fini() at arc_fini+0x281/frame 0xfffffe03447face0 dmu_fini() at dmu_fini+0x9/frame 0xfffffe03447facf0 spa_fini() at spa_fini+0x36/frame 0xfffffe03447fad30 zfs_kmod_fini() at zfs_kmod_fini+0x7a/frame 0xfffffe03447fad60 zfs_shutdown() at zfs_shutdown+0x2b/frame 0xfffffe03447fad70 kern_reboot() at kern_reboot+0x4f3/frame 0xfffffe03447fadb0 sys_reboot() at sys_reboot+0x3a9/frame 0xfffffe03447fae00 amd64_syscall() at amd64_syscall+0x153/frame 0xfffffe03447faf30 fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe03447faf30 --- syscall (55, FreeBSD ELF64, reboot), rip = 0x28cdaa, rsp = 0x820f195c8, rbp = 0x820f196b0 --- db> This makes me think that a ZFS ZPOOL scrub is needed at every boot.
(In reply to Dennis Clarke from comment #5) Does a scrub fix the problem ? Some inconsistencies can only be fixed by recreating the pool 8-(
For me, at least, it's happened ~3 times between two boxes. Certainly not every time I shut the system down (which probably happens ~3 times/day). I haven't done any zscrub's to try and solve it (usually just do those monthly for the fun of it). ~9 days total, ~4 since I've seen it crash again.
Here is the object being destroyed on my system: (kgdb) p ARC_anon.arcs_size[0] $3 = {rc_count = 262144, rc_mtx = {lock_object = {lo_name = 0xffff000000c8d8f6 "rc->rc_mtx", lo_flags = 577830912, lo_data = 0, lo_witness = 0x0}, sx_lock = 1}, rc_tree = {avl_root = 0x0, avl_compar = 0xffff0000001630f0 <zfs_refcount_compare>, avl_offset = 0, avl_numnodes = 0}, rc_removed = {list_size = 48, list_offset = 0, list_head = { list_next = 0xffff0000010e6ad8 <ARC_anon+216>, list_prev = 0xffff0000010e6ad8 <ARC_anon+216>}}, rc_removed_count = 0, rc_tracked = 0} The sx_lock field is the correct value for a mutex to be passed to sx_destroy. The avl tree is supposed to be empty because rc_tracked=0. The value comes from the global variable reference_tracking_enable. Setting that to 1 at compile time might shed some light on the problem. ARC_anon+216 is the address of rc_removed.list_head. lo_flags = 577830912 = 0x22710000 = class 0x22, LO_INITIALIZED|LO_WITNESS|LO_QUIET. So there's nothing obviously wrong here except the accumulated count added to and subtracted from the counter is not zero. This is an ARC counter. I do not expect scrubbing to have any effect. I ran zpool scrub and it found no errors.
(In reply to John F. Carr from comment #8) I see your suggestion for reference_tracking_enable=1 and that may cast some light, as you say. My primary concern on a few machines is that the ZPOOL(s) be reliable. This means I will recreate the pools and at the creation I shall toss in -o compatibility=openzfs-2.0-freebsd and hope that there is no more of this dedupe or clone shenanigans. However, that may not have any effect ,,, I shall see.
I just had this happen again, during system shutdown: 15.0-CURRENT #26 main-n268150-8ecb7494250. [via kgdb] ... Unread portion of the kernel message buffer: <118>Stopping devd. <118>Waiting for PIDS: 757. <118>Writing entropy file: . <118>Writing early boot entropy file: . <118>. <118>Terminated <118>Feb 10 04:55:50 bsd15 syslogd: exiting on signal 15 Waiting (max 60 seconds) for system process `vnlru' to stop... done Waiting (max 60 seconds) for system process `syncer' to stop... Syncing disks, vnodes remaining... 0 0 0 0 0 0 0 0 done All buffers synced. Uptime: 9h56m17s GEOM_ELI: Device vtbd0p4.eli destroyed. GEOM_ELI: Detached vtbd0p4.eli on last close. panic: VERIFY3(rc->rc_count == number) failed (29696 == 0) cpuid = 0 time = 1707569759 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe00c737eaf0 vpanic() at vpanic+0x135/frame 0xfffffe00c737ec20 spl_panic() at spl_panic+0x3a/frame 0xfffffe00c737ec80 zfs_refcount_destroy_many() at zfs_refcount_destroy_many+0xd1/frame 0xfffffe00c737ecb0 arc_fini() at arc_fini+0x281/frame 0xfffffe00c737ece0 dmu_fini() at dmu_fini+0x9/frame 0xfffffe00c737ecf0 spa_fini() at spa_fini+0x36/frame 0xfffffe00c737ed30 zfs_kmod_fini() at zfs_kmod_fini+0x7a/frame 0xfffffe00c737ed60 zfs_shutdown() at zfs_shutdown+0x2b/frame 0xfffffe00c737ed70 kern_reboot() at kern_reboot+0x4f3/frame 0xfffffe00c737edb0 sys_reboot() at sys_reboot+0x3a9/frame 0xfffffe00c737ee00 amd64_syscall() at amd64_syscall+0x153/frame 0xfffffe00c737ef30 fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe00c737ef30 --- syscall (55, FreeBSD ELF64, reboot), rip = 0x28d86a, rsp = 0x820f08bf8, rbp = 0x820f08ce0 --- KDB: enter: panic Uptime: 9h56m17s Dumping 3585 out of 16340 MB:..1%..11%..21%..31%..41%..51%..61%..71%..81%..91% ... (kgdb) bt #0 __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:57 #1 doadump (textdump=textdump@entry=1) at /usr/src/sys/kern/kern_shutdown.c:403 #2 0xffffffff80b52da0 in kern_reboot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:521 #3 0xffffffff80b532a2 in vpanic (fmt=0xffffffff8225ea82 "VERIFY3(rc->rc_count == number) failed (%llu == %llu)\n", ap=ap@entry=0xfffffe00c737ec60) at /usr/src/sys/kern/kern_shutdown.c:973 #4 0xffffffff81fbc92a in spl_panic (file=<optimized out>, func=<optimized out>, line=<unavailable>, fmt=<unavailable>) at /usr/src/sys/contrib/openzfs/module/os/freebsd/spl/spl_misc.c:100 #5 0xffffffff820b2681 in zfs_refcount_destroy_many (rc=<optimized out>, number=<unavailable>) at /usr/src/sys/contrib/openzfs/module/zfs/refcount.c:98 #6 0xffffffff820b269c in zfs_refcount_destroy (rc=<unavailable>) at /usr/src/sys/contrib/openzfs/module/zfs/refcount.c:112 #7 0xffffffff8200d561 in arc_state_fini () at /usr/src/sys/contrib/openzfs/module/zfs/arc.c:7421 #8 arc_fini () at /usr/src/sys/contrib/openzfs/module/zfs/arc.c:7753 #9 0xffffffff8203b449 in dmu_fini () at /usr/src/sys/contrib/openzfs/module/zfs/dmu.c:2549 #10 0xffffffff820db0b6 in spa_fini () at /usr/src/sys/contrib/openzfs/module/zfs/spa_misc.c:2549 #11 0xffffffff8219985a in zfs_kmod_fini () at /usr/src/sys/contrib/openzfs/module/zfs/zfs_ioctl.c:7933 #12 0xffffffff81fc32eb in zfs__fini () at /usr/src/sys/contrib/openzfs/module/os/freebsd/zfs/kmod_core.c:284 #13 zfs_shutdown (arg=<optimized out>, howto=<optimized out>) at /usr/src/sys/contrib/openzfs/module/os/freebsd/zfs/kmod_core.c:297 #14 0xffffffff80b52d13 in kern_reboot (howto=16392) at /usr/src/sys/kern/kern_shutdown.c:517 #15 0xffffffff80b527c9 in sys_reboot (td=0xfffff80001a39000, uap=0xfffff80001a39400) at /usr/src/sys/kern/kern_shutdown.c:312 #16 0xffffffff81059473 in syscallenter (td=0xfffff80001a39000) at /usr/src/sys/amd64/amd64/../../kern/subr_syscall.c:186 #17 amd64_syscall (td=0xfffff80001a39000, traced=0) at /usr/src/sys/amd64/amd64/trap.c:1192 #18 <signal handler called> #19 0x000000000028d86a in ?? () Backtrace stopped: Cannot access memory at address 0x820f08bf8
I wouldn't necessarily expect using openzfs-2.0 as a compat to help here, since refcount.c is about ordinary reference counting, not the reflink/BRT/... feature, and none of the stacktraces in here seem to be pointing to that. If the crash started in rebooting from late December's CURRENT, then the question, I suppose, becomes what you were running before you booted into that, as a data point for when the change that caused this breakage went in.
My first panic was on December 12. The kernel was based off c711af7727824da79d87f375f3d6829feec3799a with some local changes that should not affect this bug. An earlier kernel based off of 21fce617d1de231a30833cdd9819ef61277b08d8 did not crash the one time it was rebooted. In between the compiler was upgraded to llvm 17 and the following changes were made to sys/contrib/openzfs: 3494f7c019fc Notable upstream pull request merges: #15539 687e4d7f9 Extend import_progress kstat with a notes field #15544 c7b611926 Allow block cloning across encrypted datasets #15553 adcea23cb ZIO: Add overflow checks for linear buffers #15593 5f2700eee zpool: flush output before sleeping #15609 3e4bef52b Only provide execvpe(3) when needed #15610 735ba3a7b Use uint64_t instead of u_int64_t #15612 bcd83ccd2 ZIL: Remove TX_CLONE_RANGE replay for ZVOLs #15617 55b764e06 ZIL: Do not clone blocks from the future #15623 727497ccd module/icp/asm-arm/sha2: enable non-SIMD asm kernels on armv5/6 #15625 9743d0963 BRT: Limit brt_vdev_dump() to only one vdev #15629 f9765b182 zdb: Dump encrypted write and clone ZIL records #15634 2aa3a482a ZIL: Remove 128K into 2x68K LWB split optimization #15639 11656234b FreeBSD: Ensure that zfs_getattr() initializes the va_rdev field #15647 4836d293c zfs_refcount_remove: explictly ignore returns #15649 f0cb6482e setproctitle: fix ununitialised variable #15650 450f2d0b0 import: ignore return on hostid lookups 525fe93dc748 zfs: merge openzfs/zfs@a03ebd9be db0a4f2d7a65 openzfs: unbreak 32-bit builds. 2276e53940c2 zfs: merge openzfs/zfs@688514e47 c5405d1c8507 vn_copy_file_range(): provide ENOSYS fallback to vn_generic_copy_file_range()
At least 21fce617d1de231a30833cdd9819ef61277b08d8 ... 3494f7c019fc6558a99f63b7f647373b89bcde92 is a small window. None of the changes in that window seem super obviously fraught or risky here, skimming them. Half wondering if 21fce617d1de231a30833cdd9819ef61277b08d8 built with LLVM 17 would break the same way and it's some weird optimization going awry, but I haven't remotely tried testing this enough to say that. Anything you can tell us about the datasets on the pool? I'm going to go try reproing this in a 15 VM, but if it doesn't trivially reproduce for me (though if syzkaller is hitting it, maybe it will), then knowing things about the setup hitting it would be useful.
Based on Mark's analysis, this error is that not all of the memory used by the ARC was freed properly during module unload. This would not seem to be putting anyones data at risk. Why the refcount (bytes of memory used) is not getting to 0 is as yet unanswered. It seems to be different amounts of memory for different people, so it is not likely some small static item like it might have originally appeared (initial report was about 256kb)
A correction. My panic on December 12, 2023 was a different ZFS panic (bug #275731). I don't know if that kernel would have crashed on normal reboot. The kernel that crashed with the rc_count message was built on December 26 and has these additional commits: commit 188408da9f7c19f476c1afe9becb0d373088da31 Merge: 45835894bd98 dbda45160ffa Author: Martin Matuska <mm@FreeBSD.org> Date: Tue Dec 19 23:17:48 2023 +0100 zfs: merge openzfs/zfs@dbda45160 Notable upstream pull request merges: #15665 9b1677fb5 dmu: Allow buffer fills to fail Obtained from: OpenZFS OpenZFS commit: dbda45160ffa43e5ecf0498a609230f1afee7b3f commit 5fb307d29b364982acbde82cbf77db3cae486f8c Merge: c2e340452c14 86e115e21e5b Author: Martin Matuska <mm@FreeBSD.org> Date: Fri Dec 15 14:17:23 2023 +0100 zfs: merge openzfs/zfs@86e115e21 Notable upstream pull request merges: #15643 a9b937e06 For db_marker inherit the db pointer for AVL comparision #15644 e53e60c0b DMU: Fix lock leak on dbuf_hold() error #15653 86063d903 dbuf: Handle arcbuf assignment after block cloning #15656 86e115e21 dbuf: Set dr_data when unoverriding after clone Obtained from: OpenZFS OpenZFS commit: 86e115e21e5bdeee73c88c5a1a73a29d9637380a
Yes, I keep encountering this panic problem too.
Created attachment 248997 [details] Screen capture of the panic First time I see this one on 15.0-CURRENT FreeBSD 15.0-CURRENT main-n268520-5e248c23d995 GENERIC amd64. I just took a picture of the screen.
On a 14.1-PRERELEASE system without assertions enabled: # kldunload zfs Freed UMA keg (arc_buf_hdr_t_full) was not empty (2 items). Lost 2 pages of memory. Freed UMA keg (arc_buf_t) was not empty (2 items). Lost 2 pages of memory. Freed UMA keg (zio_buf_comb_131072) was not empty (2 items). Lost 64 pages of memory. This may be the same problem. armv7, 2 GB RAM, ~22 GB pool on an SD card.
Has anyone hit this on main as of a few days ago? I'm testing a kernel built from 2a9aae9e5f7a1a218b2c05ed28d811e533912118 and can't trigger this panic anymore. This was on a system that had previously been affected. I wonder if this was silently fixed.
I have seen the bug recently. I don't know if it is present in sources from this momth. The bug was still present as of 6077246e88b0ef6f90cbee0e580b42730a641f0f ("Mon Apr 22 19:37:17 2024 +0100" but with git you never really know). I have since disabled the assertion so I can reboot my systems without having to walk around and press buttons. I will re-enable the assertion and test if the bug is thought to be fixed.
(In reply to John F. Carr from comment #20) Sorry, I finally did end up managing to reproduce the issue. I'll dig into it some more.
I was able to crash one system once by kldunload zfs with assertions enabled. It does not happen every time. Most of the affected systems have root on ZFS and I can not unload the module.
I see the "VERIFY3 rc_count" panic on every reboot. Reboots are typically to update into a more recent -current: # bectl list BE Active Mountpoint Space Created 2024-03-31-120210 - - 1.55G 2024-03-31 14:17 2024-04-08-112551 - - 1.49G 2024-04-08 11:27 2024-04-17-112537 - - 1.50G 2024-04-17 12:37 2024-04-24-152825 - - 1.44G 2024-04-24 19:46 2024-05-03-084438 - - 1.43G 2024-05-03 10:07 2024-05-13-092336 NR / 12.6G 2024-05-13 20:48 2 pools with ssd/nvme as cache/log. 1 pool mirror, 1 pool raidz2. Both pools use the same disks. pool: rpool config: NAME STATE READ WRITE CKSUM rpool ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 diskid/DISK-WD-WCC4N4KLEZT7p3 ONLINE 0 0 0 diskid/DISK-WD-WCC4N1DF9DA2p3 ONLINE 0 0 0 diskid/DISK-WD-WX52D625R0NTp3 ONLINE 0 0 0 diskid/DISK-WD-WCC4N1PYJ3F8p3 ONLINE 0 0 0 logs diskid/DISK-1107651133000999009Cp1 ONLINE 0 0 0 cache diskid/DISK-1107651133000999009Cp2 ONLINE 0 0 0 errors: No known data errors pool: space config: NAME STATE READ WRITE CKSUM space ONLINE 0 0 0 raidz2-0 ONLINE 0 0 0 diskid/DISK-WD-WCC4N4KLEZT7p4 ONLINE 0 0 0 diskid/DISK-WD-WCC4N1DF9DA2p4 ONLINE 0 0 0 diskid/DISK-WD-WX52D625R0NTp4 ONLINE 0 0 0 diskid/DISK-WD-WX52D625R2TPp4 ONLINE 0 0 0 diskid/DISK-WD-WCC4N1PYJ3F8p4 ONLINE 0 0 0 logs diskid/DISK-S649NL0T819360Vp2 ONLINE 0 0 0 cache diskid/DISK-S649NL0T819360Vp3 ONLINE 0 0 0 errors: No known data errors
Please try this patch: https://github.com/openzfs/zfs/pull/16216 .
(In reply to Alexander Motin from comment #24) No panic at reboot after nearly a day of normal workload. Before your patch I got this panic often already after rebooting into multi-user and immediate reboot (about 30 jails with various services running on this system).
The fix has been committed upstream and the corresponding bug closed. I think this is worth an out of band cherry-pick rather than waiting for a routine merge. It's a one-liner.
Martin just done the routine merge, so the patch is in master.
A commit in branch stable/14 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=1c27279ed22d2b5226f31836da041bae45a9c77b commit 1c27279ed22d2b5226f31836da041bae45a9c77b Author: Alexander Motin <mav@FreeBSD.org> AuthorDate: 2024-05-25 02:11:18 +0000 Commit: Mark Johnston <markj@FreeBSD.org> CommitDate: 2024-06-17 14:25:59 +0000 Destroy ARC buffer in case of fill error In case of error dmu_buf_fill_done() returns the buffer back into DB_UNCACHED state. Since during transition from DB_UNCACHED into DB_FILL state dbuf_noread() allocates an ARC buffer, we must free it here, otherwise it will be leaked. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Jorgen Lundman <lundman@lundman.net> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15665 Closes #15802 Closes #16216 PR: 276341 (cherry picked from commit 02c5aa9b092818785ed8db4e2246a828278138e3) sys/contrib/openzfs/module/zfs/dbuf.c | 1 + 1 file changed, 1 insertion(+)
A commit in branch releng/14.1 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=f566b7eb8d94bdde7ad9ddf9bdc9cf336678c386 commit f566b7eb8d94bdde7ad9ddf9bdc9cf336678c386 Author: Alexander Motin <mav@FreeBSD.org> AuthorDate: 2024-05-25 02:11:18 +0000 Commit: Gordon Tetlow <gordon@FreeBSD.org> CommitDate: 2024-06-18 17:35:52 +0000 Destroy ARC buffer in case of fill error In case of error dmu_buf_fill_done() returns the buffer back into DB_UNCACHED state. Since during transition from DB_UNCACHED into DB_FILL state dbuf_noread() allocates an ARC buffer, we must free it here, otherwise it will be leaked. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Jorgen Lundman <lundman@lundman.net> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15665 Closes #15802 Closes #16216 PR: 276341 Approved by: so Security: FreeBSD-EN-24:10.zfs (cherry picked from commit 02c5aa9b092818785ed8db4e2246a828278138e3) (cherry picked from commit 1c27279ed22d2b5226f31836da041bae45a9c77b) sys/contrib/openzfs/module/zfs/dbuf.c | 1 + 1 file changed, 1 insertion(+)