Bug 291002 - ZFS error created on otherwise running and sane system.
Summary: ZFS error created on otherwise running and sane system.
Status: New
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 15.0-STABLE
Hardware: amd64 Any
: --- Affects Many People
Assignee: freebsd-fs (Nobody)
URL:
Keywords: regression
Depends on:
Blocks:
 
Reported: 2025-11-13 19:16 UTC by David Gilbert
Modified: 2025-11-21 19:09 UTC (History)
1 user (show)

See Also:


Attachments
Core dump #1 (727.36 KB, application/x-xz)
2025-11-18 06:27 UTC, David Gilbert
no flags Details
Core dump #2 (646.82 KB, application/x-xz)
2025-11-18 06:27 UTC, David Gilbert
no flags Details
Core.txt #3 (668.93 KB, application/x-xz)
2025-11-21 19:09 UTC, David Gilbert
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description David Gilbert 2025-11-13 19:16:13 UTC
So... I've been able to reproduce this at least 3 times:

The actions of baloo (kde plasma file indexer) create errors on ZFS.  I see:

        NAME            STATE     READ WRITE CKSUM
        zhit            ONLINE       0     0     0
          mirror-0      ONLINE       0     0     0
            gpt/zhit0a  ONLINE       0     0   218
            gpt/zhit1a  ONLINE       0     0   218

errors: Permanent errors have been detected in the following files:

        zhit/home/dgilbert/.local@zfs-auto-snap_frequent-2025-11-13-11h45:/share/baloo/index
        zhit/home/dgilbert/.local@zfs-auto-snap_frequent-2025-11-13-13h30:/share/baloo/index
        zhit/home/dgilbert/.local@zfs-auto-snap_frequent-2025-11-13-12h30:/share/baloo/index
        /home/dgilbert/.local/share/baloo/index
        zhit/home/dgilbert/.local@zfs-auto-snap_frequent-2025-11-13-13h45:/share/baloo/index
        zhit/home/dgilbert/.local@zfs-auto-snap_frequent-2025-11-13-12h45:/share/baloo/index
        zhit/home/dgilbert/.local@zfs-auto-snap_frequent-2025-11-13-12h15:/share/baloo/index
        zhit/home/dgilbert/.local@zfs-auto-snap_frequent-2025-11-13-11h30:/share/baloo/index
        zhit/home/dgilbert/.local@zfs-auto-snap_frequent-2025-11-13-13h15:/share/baloo/index

I can hear your first question... is this the media?

Both drives (4T SN850X (WD black) nvme) say:

Media and Data Integrity Errors:    0

Both drives are at 0% used.  No other operations seem to cause this problem.  It's always baloo ... although by # of operations, Baloo is the major spend.

Moving on.  The indexed space is large:

zhit/home/dgilbert                                            3.2T     13G    3.2T     0%    /home/dgilbert
zhit/home/dgilbert/.local                                     3.2T     19G    3.2T     1%    /home/dgilbert/.local
zhit/home/dgilbert/tmp                                        3.2T     22G    3.2T     1%    /home/dgilbert/tmp
zhit/home/dgilbert/.thunderbird                               3.2T     14G    3.2T     0%    /home/dgilbert/.thunderbird
zhit/home/dgilbert/.cache                                     3.2T    7.3G    3.2T     0%    /home/dgilbert/.cache
yhit/retro                                                     16T    1.7G     16T     0%    /home/dgilbert/retro
yhit/dgilbert                                                  16T    5.6G     16T     0%    /home/dgilbert/yhit
yhit/games/wine_c                                              16T    1.0G     16T     0%    /home/dgilbert/.wine/drive_c
yhit/nextcloud                                                 18T    1.3T     16T     7%    /home/dgilbert/nextcloud
vr:/home/dgilbert                                             3.9T    535G    3.3T    14%    /d/vr/dgilbert

That last nfs mount is referenced by symlink, and I'm pretty sure it's being indexed.  I've seen the index as large as 50G

zhit is the 2 4T nvmes (and where the index is stored and where the bad files show up).  yhit is 4x 10T spinning rust with a 2T nvme index and log.

dmesg is here: https://termbin.com/6ppe

System is:

FreeBSD hit.dclg.ca 15.0-BETA5 FreeBSD 15.0-BETA5 releng/15.0-n280912-69c726c15077 GENERIC amd64
Comment 1 David Gilbert 2025-11-13 19:17:40 UTC
BTW... and other than baloo creating this error, I'm using this system as my daily driver... with all sorts of activity... so to have a userland process create a zfs error ... that's seems dire.
Comment 2 David Gilbert 2025-11-13 21:07:05 UTC
Spinning my tires a bit on this.  I realized it's 4 replications, actually.  The first time this happened, ~/.local was just a directory in ~ (dgilbert).  Then I made ~/.local it's own filesystem ... because a) zfs, and b) the writes from boloo are insane --- generating gigabytes of filesystem churn.  For a 50G file and 8 snapshots 15 minutes apart, it was consuming over 300G.

Right now (with the error sitting there)... baloo is still churning.  I haven't killed it yet.  I gather there's some compression.

[3:41:341]root@hit:/home/dgilbert> zfs list -rt all zhit/home/dgilbert/.local
NAME                                                                USED  AVAIL  REFER  MOUNTPOINT
zhit/home/dgilbert/.local                                          19.4G  3.16T  19.4G  /home/dgilbert/.local
zhit/home/dgilbert/.local@zfs-auto-snap_frequent-2025-11-13-13h30  39.5K      -  19.4G  -
zhit/home/dgilbert/.local@zfs-auto-snap_frequent-2025-11-13-13h45  39.5K      -  19.4G  -
zhit/home/dgilbert/.local@zfs-auto-snap_frequent-2025-11-13-14h15    72K      -  19.4G  -
zhit/home/dgilbert/.local@zfs-auto-snap_frequent-2025-11-13-14h30    77K      -  19.4G  -
zhit/home/dgilbert/.local@zfs-auto-snap_frequent-2025-11-13-14h45   110K      -  19.4G  -
zhit/home/dgilbert/.local@zfs-auto-snap_frequent-2025-11-13-15h15  35.5K      -  19.4G  -
zhit/home/dgilbert/.local@zfs-auto-snap_frequent-2025-11-13-15h30  35.5K      -  19.4G  -
zhit/home/dgilbert/.local@zfs-auto-snap_frequent-2025-11-13-15h45  35.5K      -  19.4G  -
[3:42:342]root@hit:/home/dgilbert> ll -h .local/baloo
ls: .local/baloo: No such file or directory
[3:43:343]root@hit:/home/dgilbert> ll -h .local/share/baloo/index
-rw-r--r--  1 dgilbert wheel   28G Nov 13 07:16 .local/share/baloo/index
[3:44:344]root@hit:/home/dgilbert> du -h .local/share/baloo/index
 18G    .local/share/baloo/index

... so that's not bad... but I don't know what's happening to the actual index.
Comment 3 David Gilbert 2025-11-18 06:27:08 UTC
Created attachment 265479 [details]
Core dump #1
Comment 4 David Gilbert 2025-11-18 06:27:39 UTC
Created attachment 265480 [details]
Core dump #2
Comment 5 David Gilbert 2025-11-18 06:28:25 UTC
Added a couple of kernel dumps here... #2 is particularly dire --- I had to back out the last dozen transactions to get the pool to boot again.
Comment 6 David Gilbert 2025-11-21 19:08:48 UTC
I'm continuing to see crashes.  Here's a bit of the core dump (so people don't have to open the file) from core dump #1:

#5  0xffffffff8107bb98 in trap_fatal (frame=0xfffffe015c0dd910, 
    eva=<optimized out>) at /usr/src/sys/amd64/amd64/trap.c:969
        type = <optimized out>
        handled = <optimized out>
#6  <signal handler called>
No locals.
#7  pctrie_node_load (p=p@entry=0x80000000000048, smr=0x0, 
    access=PCTRIE_LOCKED) at /usr/src/sys/kern/subr_pctrie.c:123
No locals.
#8  pctrie_root_load (ptree=ptree@entry=0x80000000000048, smr=0x0, 
    access=PCTRIE_LOCKED) at /usr/src/sys/kern/subr_pctrie.c:164
No locals.
#9  _pctrie_lookup_node (ptree=ptree@entry=0x80000000000048, node=0x0, 
    index=16045693110842147038, smr=0x0, access=PCTRIE_LOCKED, 
    parent_out=<optimized out>) at /usr/src/sys/kern/subr_pctrie.c:299
        parent = 0x0
        slot = <optimized out>

Now... Core dump #2 has something that might mean it's a different bug.  It might just be the system trying to deal with zfs corruption caused by #1.

However, this just occurred last night (uploaded as core dump #3):

#5  0xffffffff81079b98 in trap_fatal (frame=0xfffffe015c0dd910, 
    eva=<optimized out>) at /usr/src/sys/amd64/amd64/trap.c:969
        type = <optimized out>
        handled = <optimized out>
#6  <signal handler called>
No locals.
#7  pctrie_node_load (p=p@entry=0x80000000000048, smr=0x0, 
    access=PCTRIE_LOCKED) at /usr/src/sys/kern/subr_pctrie.c:123
No locals.
#8  pctrie_root_load (ptree=ptree@entry=0x80000000000048, smr=0x0, 
    access=PCTRIE_LOCKED) at /usr/src/sys/kern/subr_pctrie.c:164
No locals.
#9  _pctrie_lookup_node (ptree=ptree@entry=0x80000000000048, node=0x0, 
    index=16045693110842147038, smr=0x0, access=PCTRIE_LOCKED, 
    parent_out=<optimized out>) at /usr/src/sys/kern/subr_pctrie.c:299
        parent = 0x0
        slot = <optimized out>
Comment 7 David Gilbert 2025-11-21 19:09:44 UTC
Created attachment 265561 [details]
Core.txt #3