Bug 276341 - zfs panic: VERIFY3(rc->rc_count == number) failed
Summary: zfs panic: VERIFY3(rc->rc_count == number) failed
Status: Open
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 15.0-CURRENT
Hardware: Any Any
: --- Affects Some People
Assignee: freebsd-fs (Nobody)
URL:
Keywords: crash
Depends on:
Blocks:
 
Reported: 2024-01-15 00:03 UTC by John F. Carr
Modified: 2024-04-23 13:34 UTC (History)
9 users (show)

See Also:


Attachments
Screen capture of the panic (595.96 KB, image/jpeg)
2024-03-07 09:45 UTC, Rodrigo Osorio
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description John F. Carr 2024-01-15 00:03:55 UTC
I rebooted a system running CURRENT from late December to update to today's CURRENT and it crashed while unloading ZFS.

panic: VERIFY3(rc->rc_count == number) failed (262144 == 0)

cpuid = 0
time = 1705275564
KDB: stack backtrace:
db_trace_self() at db_trace_self
db_trace_self_wrapper() at db_trace_self_wrapper+0x38
vpanic() at vpanic+0x1a4
spl_panic() at spl_panic+0x44
zfs_refcount_destroy_many() at zfs_refcount_destroy_many+0xdc
arc_fini() at arc_fini+0x274
dmu_fini() at dmu_fini+0xc
spa_fini() at spa_fini+0x34
zfs_kmod_fini() at zfs_kmod_fini+0x78
zfs_shutdown() at zfs_shutdown+0x40
kern_reboot() at kern_reboot+0x574
sys_reboot() at sys_reboot+0x350
do_el0_sync() at do_el0_sync+0x58c
handle_el0_sync() at handle_el0_sync+0x48
--- exception, esr 0x56000000

[...]

#9  0xffff00000003daf8 in spl_panic (file=<optimized out>, 
    func=<optimized out>, line=13227925, 
    fmt=0x12 <error: Cannot access memory at address 0x12>)
    at /usr/src/sys/contrib/openzfs/module/os/freebsd/spl/spl_misc.c:100
        ap = {__stack = 0xffff0000f39d5630, __gr_top = 0xffff0000f39d55e0, 
          __vr_top = 0xffffa000010a92a0, __gr_offs = -32, __vr_offs = 0}
#10 0xffff0000001633e0 in zfs_refcount_destroy_many (rc=<optimized out>, 
    number=18446462598740100184)
    at /usr/src/sys/contrib/openzfs/module/zfs/refcount.c:98
        _verify3_right = 18446462598740100184
        _verify3_left = 115
        cookie = 0x0
        ref = <optimized out>
#11 0xffff0000001633fc in zfs_refcount_destroy (rc=0x12, 
    rc@entry=0xffff0000010e7a80 <ARC_anon+128>)
    at /usr/src/sys/contrib/openzfs/module/zfs/refcount.c:112
No locals.
#12 0xffff0000000c0508 in arc_state_fini ()
    at /usr/src/sys/contrib/openzfs/module/zfs/arc.c:7421
No locals.
#13 arc_fini () at /usr/src/sys/contrib/openzfs/module/zfs/arc.c:7753
        p = <optimized out>
#14 0xffff0000000ee4d0 in dmu_fini ()
    at /usr/src/sys/contrib/openzfs/module/zfs/dmu.c:2549
No locals.
#15 0xffff000000189dd8 in spa_fini ()
    at /usr/src/sys/contrib/openzfs/module/zfs/spa_misc.c:2549
No locals.
#16 0xffff0000001e67ac in zfs_kmod_fini ()
    at /usr/src/sys/contrib/openzfs/module/zfs/zfs_ioctl.c:7933
        zsnext = 0x0
        zs = 0x0
#17 0xffff00000006e654 in zfs__fini ()
    at /usr/src/sys/contrib/openzfs/module/os/freebsd/zfs/kmod_core.c:284
No locals.
#18 zfs_shutdown (arg=<optimized out>, howto=<optimized out>)
    at /usr/src/sys/contrib/openzfs/module/os/freebsd/zfs/kmod_core.c:297
No locals.
#19 0xffff000000693158 in kern_reboot (howto=0)
    at /usr/src/sys/kern/kern_shutdown.c:517
        _ep = 0xffffa00002f63c40
        _t = 0xffffa00002f63c40
        _el = 0xffffa00001709c80
        once = 1
#20 0xffff000000692b74 in sys_reboot (td=0xffff000121c2a640, 
    uap=0xffff000121c2aa40) at /usr/src/sys/kern/kern_shutdown.c:312
        error = <optimized out>
#21 0xffff000000a60060 in syscallenter (td=0xffff000121c2a640)
    at /usr/src/sys/arm64/arm64/../../kern/subr_syscall.c:186
        se = 0xffff000000fa1378 <sysent+1760>
        sa = 0xffff000121c2aa30
        p = <optimized out>
        error = <optimized out>
        sy_thr_static = true
        traced = <optimized out>
        _audit_entered = <optimized out>


My kernel configuration file is

include GENERIC
ident   STRIATUS
nooptions       WITNESS
nooptions       WITNESS_SKIPSPIN
options ZFS

The kernel that crashed had changes on main up to 5bc10feacc9d.
Comment 1 Mark Johnston freebsd_committer freebsd_triage 2024-01-15 00:41:26 UTC
I just hit this panic with syzkaller as well.  This seems like a somewhat recent regression.

The panic arises in arc_state_fini():

  zfs_refcount_destroy(&arc_anon->arcs_size[ARC_BUFC_DATA]);

A previous

  zfs_refcount_destroy(&arc_anon->arcs_esize[ARC_BUFC_METADATA]);

succeeded.
Comment 2 John Kennedy 2024-01-16 00:06:53 UTC
Similar situation here.

...
exec shutdown -r now
Shutdown NOW!
shutdown: [pid 3676]
Jan 15 15:54:10 bsd15 shutdown[3676]: reboot by root: 
...
All buffers synced.
Uptime: 7h35m52s
GEOM_ELI: Device vtbd0p4.eli destroyed.
GEOM_ELI: Detached vtbd0p4.eli on last close.
panic: VERIFY3(rc->rc_count == number) failed (59392 == 0)

cpuid = 0
time = 1705362861
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe00c737eaf0
vpanic() at vpanic+0x131/frame 0xfffffe00c737ec20
spl_panic() at spl_panic+0x3a/frame 0xfffffe00c737ec80
zfs_refcount_destroy_many() at zfs_refcount_destroy_many+0xd1/frame 0xfffffe00c737ecb0
arc_fini() at arc_fini+0x299/frame 0xfffffe00c737ece0
dmu_fini() at dmu_fini+0x9/frame 0xfffffe00c737ecf0
spa_fini() at spa_fini+0x36/frame 0xfffffe00c737ed30
zfs_kmod_fini() at zfs_kmod_fini+0x7a/frame 0xfffffe00c737ed60
zfs_shutdown() at zfs_shutdown+0x2b/frame 0xfffffe00c737ed70
kern_reboot() at kern_reboot+0x4f3/frame 0xfffffe00c737edb0
sys_reboot() at sys_reboot+0x3a9/frame 0xfffffe00c737ee00
amd64_syscall() at amd64_syscall+0x153/frame 0xfffffe00c737ef30
fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe00c737ef30
--- syscall (55, FreeBSD ELF64, reboot), rip = 0x28cdaa, rsp = 0x8204547a8, rbp = 0x820454890 ---
KDB: enter: panic
[ thread pid 1 tid 100002 ]
Stopped at      kdb_enter+0x33: movq    $0,0xe3a592(%rip)

It crashed on a 15.0-53df7e58cca system, but pretty sure it happened on an earlier ARM system back on 1/13.
Comment 3 John Kennedy 2024-01-16 00:37:27 UTC
The amd64 one that crashed (from strings in the core file) was:

FreeBSD 15.0-CURRENT amd64 1500010
FreeBSD 15.0-CURRENT #18 main-n267574-6137b5f7b8c: Mon Jan 15 07:40:34 PST 2024
FreeBSD clang version 17.0.6 (https://github.com/llvm/llvm-project.git llvmorg-17.0.6-0-g6009708b4367)

The aarch64 one that crashed was:

FreeBSD 15.0-CURRENT aarch64 1500008
FreeBSD 15.0-CURRENT #17 main-n267508-8a3fafc8211: Thu Jan 11 10:26:44 PST 2024
FreeBSD clang version 17.0.6 (https://github.com/llvm/llvm-project.git llvmorg-17.0.6-0-g6009708b4367)
Comment 4 Dennis Clarke 2024-01-18 14:02:36 UTC
(In reply to John Kennedy from comment #2)

I am seeing the identical same behavior here : 

panic: VERIFY3(rc->rc_count == number) failed (232652800 == 0)

cpuid = 0
time = 1705557330
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe073a4a7af0
vpanic() at vpanic+0x131/frame 0xfffffe073a4a7c20
spl_panic() at spl_panic+0x3a/frame 0xfffffe073a4a7c80
zfs_refcount_destroy_many() at zfs_refcount_destroy_many+0xd1/frame 0xfffffe073a4a7cb0
arc_fini() at arc_fini+0x281/frame 0xfffffe073a4a7ce0
dmu_fini() at dmu_fini+0x9/frame 0xfffffe073a4a7cf0
spa_fini() at spa_fini+0x36/frame 0xfffffe073a4a7d30
zfs_kmod_fini() at zfs_kmod_fini+0x7a/frame 0xfffffe073a4a7d60
zfs_shutdown() at zfs_shutdown+0x2b/frame 0xfffffe073a4a7d70
kern_reboot() at kern_reboot+0x4f3/frame 0xfffffe073a4a7db0
sys_reboot() at sys_reboot+0x3a9/frame 0xfffffe073a4a7e00
amd64_syscall() at amd64_syscall+0x153/frame 0xfffffe073a4a7f30
fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe073a4a7f30
--- syscall (55, FreeBSD ELF64, reboot), rip = 0x1346f48388ca, rsp = 0x1346f299e3f8, rbp = 0x1346f299e570 ---
KDB: enter: panic
[ thread pid 53890 tid 101129 ]
Stopped at      kdb_enter+0x33: movq    $0,0xe3a392(%rip)
db> bt
Tracing pid 53890 tid 101129 td 0xfffff84085258000
kdb_enter() at kdb_enter+0x33/frame 0xfffffe073a4a7c20
spl_panic() at spl_panic+0x3a/frame 0xfffffe073a4a7c80
zfs_refcount_destroy_many() at zfs_refcount_destroy_many+0xd1/frame 0xfffffe073a4a7cb0
arc_fini() at arc_fini+0x281/frame 0xfffffe073a4a7ce0
dmu_fini() at dmu_fini+0x9/frame 0xfffffe073a4a7cf0
spa_fini() at spa_fini+0x36/frame 0xfffffe073a4a7d30
zfs_kmod_fini() at zfs_kmod_fini+0x7a/frame 0xfffffe073a4a7d60
zfs_shutdown() at zfs_shutdown+0x2b/frame 0xfffffe073a4a7d70
kern_reboot() at kern_reboot+0x4f3/frame 0xfffffe073a4a7db0
sys_reboot() at sys_reboot+0x3a9/frame 0xfffffe073a4a7e00
amd64_syscall() at amd64_syscall+0x153/frame 0xfffffe073a4a7f30
fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe073a4a7f30
--- syscall (55, FreeBSD ELF64, reboot), rip = 0x1346f48388ca, rsp = 0x1346f299e3f8, rbp = 0x1346f299e570 ---
db>
Comment 5 Dennis Clarke 2024-01-20 13:28:42 UTC

I have to wonder if anyone or anything is happening with this
situation as the hits just keep on coming here : 

root@titan:~ # 
root@titan:~ # shutdown -p now 
Shutdown NOW!
shutdown: [pid 4Stopping sshd.
Waiting for PIDS: 4140, 4140.
Stopping cron.
Waiting for PIDS: 4144.
Stopping cirrinad.
Waiting for PIDS: 4076.
Stopping ntpd.
Waiting for PIDS: 4065.
Stopping powerd.
Waiting for PIDS: 4069.
Stopping devd.
Waiting for PIDS: 3809.
Writing entropy file: .
Writing early boot entropy file: .
.
Jan 19 20:40:45 titan syslogd: exiting on signal 15
Waiting (max 60 seconds) for system process `vnlru' to stop... done
Waiting (max 60 seconds) for system process `syncer' to stop... 
Syncing disks, vnodes remaining... 0 0 0 em0: Link is up 1000 Mbps Full Duplex
em0: link state changed to UP
0 0 0 0 done
All buffers synced.
Swap device mirror/swap removed.
Uptime: 5h7m4s
GEOM_MIRROR: Device swap: provider destroyed.
GEOM_MIRROR: Device swap destroyed.
panic: VERIFY3(rc->rc_count == number) failed (872153088 == 0)

cpuid = 0
time = 1705696854
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe03447faaf0
vpanic() at vpanic+0x131/frame 0xfffffe03447fac20
spl_panic() at spl_panic+0x3a/frame 0xfffffe03447fac80
zfs_refcount_destroy_many() at zfs_refcount_destroy_many+0xd1/frame 0xfffffe03447facb0
arc_fini() at arc_fini+0x281/frame 0xfffffe03447face0
dmu_fini() at dmu_fini+0x9/frame 0xfffffe03447facf0
spa_fini() at spa_fini+0x36/frame 0xfffffe03447fad30
zfs_kmod_fini() at zfs_kmod_fini+0x7a/frame 0xfffffe03447fad60
zfs_shutdown() at zfs_shutdown+0x2b/frame 0xfffffe03447fad70
kern_reboot() at kern_reboot+0x4f3/frame 0xfffffe03447fadb0
sys_reboot() at sys_reboot+0x3a9/frame 0xfffffe03447fae00
amd64_syscall() at amd64_syscall+0x153/frame 0xfffffe03447faf30
fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe03447faf30
--- syscall (55, FreeBSD ELF64, reboot), rip = 0x28cdaa, rsp = 0x820f195c8, rbp = 0x820f196b0 ---
KDB: enter: panic
[ thread pid 1 tid 100002 ]
Stopped at      kdb_enter+0x33: movq    $0,0xe3a392(%rip)
db> bt
Tracing pid 1 tid 100002 td 0xfffff8010353e000
kdb_enter() at kdb_enter+0x33/frame 0xfffffe03447fac20
spl_panic() at spl_panic+0x3a/frame 0xfffffe03447fac80
zfs_refcount_destroy_many() at zfs_refcount_destroy_many+0xd1/frame 0xfffffe03447facb0
arc_fini() at arc_fini+0x281/frame 0xfffffe03447face0
dmu_fini() at dmu_fini+0x9/frame 0xfffffe03447facf0
spa_fini() at spa_fini+0x36/frame 0xfffffe03447fad30
zfs_kmod_fini() at zfs_kmod_fini+0x7a/frame 0xfffffe03447fad60
zfs_shutdown() at zfs_shutdown+0x2b/frame 0xfffffe03447fad70
kern_reboot() at kern_reboot+0x4f3/frame 0xfffffe03447fadb0
sys_reboot() at sys_reboot+0x3a9/frame 0xfffffe03447fae00
amd64_syscall() at amd64_syscall+0x153/frame 0xfffffe03447faf30
fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe03447faf30
--- syscall (55, FreeBSD ELF64, reboot), rip = 0x28cdaa, rsp = 0x820f195c8, rbp = 0x820f196b0 ---
db> 

This makes me think that a ZFS ZPOOL scrub is needed at every boot.
Comment 6 Kurt Jaeger freebsd_committer freebsd_triage 2024-01-20 13:33:07 UTC
(In reply to Dennis Clarke from comment #5)
Does a scrub fix the problem ? Some inconsistencies can only be fixed by
recreating the pool 8-(
Comment 7 John Kennedy 2024-01-20 18:13:07 UTC
For me, at least, it's happened ~3 times between two boxes.  Certainly not every time I shut the system down (which probably happens ~3 times/day).  I haven't done any zscrub's to try and solve it (usually just do those monthly for the fun of it).

~9 days total, ~4 since I've seen it crash again.
Comment 8 John F. Carr 2024-01-20 19:17:18 UTC
Here is the object being destroyed on my system:

(kgdb) p ARC_anon.arcs_size[0]
$3 = {rc_count = 262144, rc_mtx = {lock_object = {lo_name = 0xffff000000c8d8f6 "rc->rc_mtx", 
      lo_flags = 577830912, lo_data = 0, lo_witness = 0x0}, sx_lock = 1}, rc_tree = {avl_root = 0x0, 
    avl_compar = 0xffff0000001630f0 <zfs_refcount_compare>, avl_offset = 0, avl_numnodes = 0}, 
  rc_removed = {list_size = 48, list_offset = 0, list_head = {
      list_next = 0xffff0000010e6ad8 <ARC_anon+216>, list_prev = 0xffff0000010e6ad8 <ARC_anon+216>}}, 
  rc_removed_count = 0, rc_tracked = 0}

The sx_lock field is the correct value for a mutex to be passed to sx_destroy.

The avl tree is supposed to be empty because rc_tracked=0.
The value comes from the global variable reference_tracking_enable.
Setting that to 1 at compile time might shed some light on the problem.

ARC_anon+216 is the address of rc_removed.list_head.

lo_flags = 577830912 = 0x22710000 = class 0x22, LO_INITIALIZED|LO_WITNESS|LO_QUIET.

So there's nothing obviously wrong here except the accumulated
count added to and subtracted from the counter is not zero.

This is an ARC counter.  I do not expect scrubbing to have any effect.
I ran zpool scrub and it found no errors.
Comment 9 Dennis Clarke 2024-01-21 16:38:39 UTC
(In reply to John F. Carr from comment #8)

I see your suggestion for reference_tracking_enable=1 and that may cast
some light, as you say. My primary concern on a few machines is that the
ZPOOL(s) be reliable. This means I will recreate the pools and at the
creation I shall toss in -o compatibility=openzfs-2.0-freebsd and hope
that there is no more of this dedupe or clone shenanigans. However, that
may not have any effect ,,, I shall see.
Comment 10 John Kennedy 2024-02-10 15:08:31 UTC
I just had this happen again, during system shutdown:  15.0-CURRENT #26 main-n268150-8ecb7494250.

[via kgdb]
...
Unread portion of the kernel message buffer:
<118>Stopping devd.
<118>Waiting for PIDS: 757.
<118>Writing entropy file: .
<118>Writing early boot entropy file: .
<118>.
<118>Terminated
<118>Feb 10 04:55:50 bsd15 syslogd: exiting on signal 15
Waiting (max 60 seconds) for system process `vnlru' to stop... done
Waiting (max 60 seconds) for system process `syncer' to stop... 
Syncing disks, vnodes remaining... 0 0 0 0 0 0 0 0 done
All buffers synced.
Uptime: 9h56m17s
GEOM_ELI: Device vtbd0p4.eli destroyed.
GEOM_ELI: Detached vtbd0p4.eli on last close.
panic: VERIFY3(rc->rc_count == number) failed (29696 == 0)

cpuid = 0
time = 1707569759
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe00c737eaf0
vpanic() at vpanic+0x135/frame 0xfffffe00c737ec20
spl_panic() at spl_panic+0x3a/frame 0xfffffe00c737ec80
zfs_refcount_destroy_many() at zfs_refcount_destroy_many+0xd1/frame 0xfffffe00c737ecb0
arc_fini() at arc_fini+0x281/frame 0xfffffe00c737ece0
dmu_fini() at dmu_fini+0x9/frame 0xfffffe00c737ecf0
spa_fini() at spa_fini+0x36/frame 0xfffffe00c737ed30
zfs_kmod_fini() at zfs_kmod_fini+0x7a/frame 0xfffffe00c737ed60
zfs_shutdown() at zfs_shutdown+0x2b/frame 0xfffffe00c737ed70
kern_reboot() at kern_reboot+0x4f3/frame 0xfffffe00c737edb0
sys_reboot() at sys_reboot+0x3a9/frame 0xfffffe00c737ee00
amd64_syscall() at amd64_syscall+0x153/frame 0xfffffe00c737ef30
fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe00c737ef30
--- syscall (55, FreeBSD ELF64, reboot), rip = 0x28d86a, rsp = 0x820f08bf8, rbp = 0x820f08ce0 ---
KDB: enter: panic
Uptime: 9h56m17s
Dumping 3585 out of 16340 MB:..1%..11%..21%..31%..41%..51%..61%..71%..81%..91%

...

(kgdb) bt
#0  __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:57
#1  doadump (textdump=textdump@entry=1) at /usr/src/sys/kern/kern_shutdown.c:403
#2  0xffffffff80b52da0 in kern_reboot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:521
#3  0xffffffff80b532a2 in vpanic (fmt=0xffffffff8225ea82 "VERIFY3(rc->rc_count == number) failed (%llu == %llu)\n", ap=ap@entry=0xfffffe00c737ec60)
    at /usr/src/sys/kern/kern_shutdown.c:973
#4  0xffffffff81fbc92a in spl_panic (file=<optimized out>, func=<optimized out>, line=<unavailable>, fmt=<unavailable>)
    at /usr/src/sys/contrib/openzfs/module/os/freebsd/spl/spl_misc.c:100
#5  0xffffffff820b2681 in zfs_refcount_destroy_many (rc=<optimized out>, number=<unavailable>)
    at /usr/src/sys/contrib/openzfs/module/zfs/refcount.c:98
#6  0xffffffff820b269c in zfs_refcount_destroy (rc=<unavailable>) at /usr/src/sys/contrib/openzfs/module/zfs/refcount.c:112
#7  0xffffffff8200d561 in arc_state_fini () at /usr/src/sys/contrib/openzfs/module/zfs/arc.c:7421
#8  arc_fini () at /usr/src/sys/contrib/openzfs/module/zfs/arc.c:7753
#9  0xffffffff8203b449 in dmu_fini () at /usr/src/sys/contrib/openzfs/module/zfs/dmu.c:2549
#10 0xffffffff820db0b6 in spa_fini () at /usr/src/sys/contrib/openzfs/module/zfs/spa_misc.c:2549
#11 0xffffffff8219985a in zfs_kmod_fini () at /usr/src/sys/contrib/openzfs/module/zfs/zfs_ioctl.c:7933
#12 0xffffffff81fc32eb in zfs__fini () at /usr/src/sys/contrib/openzfs/module/os/freebsd/zfs/kmod_core.c:284
#13 zfs_shutdown (arg=<optimized out>, howto=<optimized out>) at /usr/src/sys/contrib/openzfs/module/os/freebsd/zfs/kmod_core.c:297
#14 0xffffffff80b52d13 in kern_reboot (howto=16392) at /usr/src/sys/kern/kern_shutdown.c:517
#15 0xffffffff80b527c9 in sys_reboot (td=0xfffff80001a39000, uap=0xfffff80001a39400) at /usr/src/sys/kern/kern_shutdown.c:312
#16 0xffffffff81059473 in syscallenter (td=0xfffff80001a39000) at /usr/src/sys/amd64/amd64/../../kern/subr_syscall.c:186
#17 amd64_syscall (td=0xfffff80001a39000, traced=0) at /usr/src/sys/amd64/amd64/trap.c:1192
#18 <signal handler called>
#19 0x000000000028d86a in ?? ()
Backtrace stopped: Cannot access memory at address 0x820f08bf8
Comment 11 Rich Ercolani 2024-02-10 16:21:31 UTC
I wouldn't necessarily expect using openzfs-2.0 as a compat to help here, since refcount.c is about ordinary reference counting, not the reflink/BRT/... feature, and none of the stacktraces in here seem to be pointing to that.

If the crash started in rebooting from late December's CURRENT, then the question, I suppose, becomes what you were running before you booted into that, as a data point for when the change that caused this breakage went in.
Comment 12 John F. Carr 2024-02-10 16:41:19 UTC
My first panic was on December 12.  The kernel was based off c711af7727824da79d87f375f3d6829feec3799a with some local changes that should not affect this bug.  An earlier kernel based off of 21fce617d1de231a30833cdd9819ef61277b08d8 did not crash the one time it was rebooted.  In between the compiler was upgraded to llvm 17 and the following changes were made to sys/contrib/openzfs:

3494f7c019fc Notable upstream pull request merges:  #15539 687e4d7f9 Extend import_progress kstat with a notes field  #15544 c7b611926 Allow block cloning across encrypted datasets  #15553 adcea23cb ZIO: Add overflow checks for linear buffers  #15593 5f2700eee zpool: flush output before sleeping  #15609 3e4bef52b Only provide execvpe(3) when needed  #15610 735ba3a7b Use uint64_t instead of u_int64_t  #15612 bcd83ccd2 ZIL: Remove TX_CLONE_RANGE replay for ZVOLs  #15617 55b764e06 ZIL: Do not clone blocks from the future  #15623 727497ccd module/icp/asm-arm/sha2: enable non-SIMD asm kernels                   on armv5/6  #15625 9743d0963 BRT: Limit brt_vdev_dump() to only one vdev  #15629 f9765b182 zdb: Dump encrypted write and clone ZIL records  #15634 2aa3a482a ZIL: Remove 128K into 2x68K LWB split optimization  #15639 11656234b FreeBSD: Ensure that zfs_getattr() initializes the                   va_rdev field  #15647 4836d293c zfs_refcount_remove: explictly ignore returns  #15649 f0cb6482e setproctitle: fix ununitialised variable  #15650 450f2d0b0 import: ignore return on hostid lookups
525fe93dc748 zfs: merge openzfs/zfs@a03ebd9be
db0a4f2d7a65 openzfs: unbreak 32-bit builds.
2276e53940c2 zfs: merge openzfs/zfs@688514e47
c5405d1c8507 vn_copy_file_range(): provide ENOSYS fallback to vn_generic_copy_file_range()
Comment 13 Rich Ercolani 2024-02-10 16:58:58 UTC
At least 21fce617d1de231a30833cdd9819ef61277b08d8 ... 3494f7c019fc6558a99f63b7f647373b89bcde92 is a small window.

None of the changes in that window seem super obviously fraught or risky here, skimming them. Half wondering if 21fce617d1de231a30833cdd9819ef61277b08d8  built with LLVM 17 would break the same way and it's some weird optimization going awry, but I haven't remotely tried testing this enough to say that.

Anything you can tell us about the datasets on the pool? I'm going to go try reproing this in a 15 VM, but if it doesn't trivially reproduce for me (though if syzkaller is hitting it, maybe it will), then knowing things about the setup hitting it would be useful.
Comment 14 Allan Jude freebsd_committer freebsd_triage 2024-02-10 17:11:55 UTC
Based on Mark's analysis, this error is that not all of the memory used by the ARC was freed properly during module unload. This would not seem to be putting anyones data at risk.

Why the refcount (bytes of memory used) is not getting to 0 is as yet unanswered.
It seems to be different amounts of memory for different people, so it is not likely some small static item like it might have originally appeared (initial report was about 256kb)
Comment 15 John F. Carr 2024-02-10 20:14:29 UTC
A correction.  My panic on December 12, 2023 was a different ZFS panic (bug #275731).  I don't know if that kernel would have crashed on normal reboot.  The kernel that crashed with the rc_count message was built on December 26 and has these additional commits:

commit 188408da9f7c19f476c1afe9becb0d373088da31
Merge: 45835894bd98 dbda45160ffa
Author: Martin Matuska <mm@FreeBSD.org>
Date:   Tue Dec 19 23:17:48 2023 +0100

    zfs: merge openzfs/zfs@dbda45160
    
    Notable upstream pull request merges:
     #15665 9b1677fb5 dmu: Allow buffer fills to fail
    
    Obtained from:  OpenZFS
    OpenZFS commit: dbda45160ffa43e5ecf0498a609230f1afee7b3f

commit 5fb307d29b364982acbde82cbf77db3cae486f8c
Merge: c2e340452c14 86e115e21e5b
Author: Martin Matuska <mm@FreeBSD.org>
Date:   Fri Dec 15 14:17:23 2023 +0100

    zfs: merge openzfs/zfs@86e115e21
    
    Notable upstream pull request merges:
     #15643 a9b937e06 For db_marker inherit the db pointer for AVL comparision
     #15644 e53e60c0b DMU: Fix lock leak on dbuf_hold() error
     #15653 86063d903 dbuf: Handle arcbuf assignment after block cloning
     #15656 86e115e21 dbuf: Set dr_data when unoverriding after clone
    
    Obtained from:  OpenZFS
    OpenZFS commit: 86e115e21e5bdeee73c88c5a1a73a29d9637380a
Comment 16 Oleg 2024-02-19 23:00:04 UTC
Yes, I keep encountering this panic problem too.
Comment 17 Rodrigo Osorio freebsd_committer freebsd_triage 2024-03-07 09:45:09 UTC
Created attachment 248997 [details]
Screen capture of the panic

First time I see this one on 15.0-CURRENT FreeBSD 15.0-CURRENT main-n268520-5e248c23d995 GENERIC amd64. I just took a picture of the screen.
Comment 18 John F. Carr 2024-04-23 13:34:37 UTC
On a 14.1-PRERELEASE system without assertions enabled:

# kldunload zfs
Freed UMA keg (arc_buf_hdr_t_full) was not empty (2 items).  Lost 2 pages of memory.
Freed UMA keg (arc_buf_t) was not empty (2 items).  Lost 2 pages of memory.
Freed UMA keg (zio_buf_comb_131072) was not empty (2 items).  Lost 64 pages of memory.

This may be the same problem.

armv7, 2 GB RAM, ~22 GB pool on an SD card.