Bug 271292

Summary: kernel panic while dd'ing a USB disk to a ZFS directory
Product: Base System Reporter: David Gilbert <dave>
Component: kernAssignee: freebsd-fs (Nobody) <fs>
Status: Open ---    
Severity: Affects Only Me CC: marklmi26-fbsd
Priority: --- Keywords: crash, needs-qa
Version: CURRENT   
Hardware: riscv   
OS: Any   
Attachments:
Description Flags
core.txt file for the crash
none
info file for coredump none

Description David Gilbert 2023-05-07 04:51:29 UTC
Created attachment 242030 [details]
core.txt file for the crash

I was using my RISC-V unleashed (running -CURRENT) to dd copy an NVME disk using a USB enclosure.  The crash message was unhelpful:

Unread portion of the kernel message buffer:
panic: buffer modified while frozen!
cpuid = 0
time = 1683224304
KDB: stack backtrace:
db_trace_self() at db_trace_self
db_trace_self_wrapper() at db_trace_self_wrapper+0x38
kdb_backtrace() at kdb_backtrace+0x2c
vpanic() at vpanic+0x126
panic() at panic+0x2a
.LBB9_21() at .LBB9_21+0xc
arc_write_done() at arc_write_done+0x4e
.LBB84_358() at .LBB84_358+0xcc
.LBB41_60() at .LBB41_60+0xbc
taskqueue_run_locked() at taskqueue_run_locked+0x90
taskqueue_thread_loop() at taskqueue_thread_loop+0xd4
fork_exit() at fork_exit+0x68
fork_trampoline() at fork_trampoline+0xa
KDB: enter: panic

With slightly more information coming from the kgdb output from panic() down:

#12 0xffffffc002cc363a in arc_cksum_verify (buf=<optimized out>)
    at /usr/src/sys/contrib/openzfs/module/zfs/arc.c:1475
#13 0xffffffc002ccfc48 in arc_write_done (zio=0xffffffd0df9839c0)
    at /usr/src/sys/contrib/openzfs/module/zfs/arc.c:6725
#14 0xffffffc002dfbd16 in zio_done (zio=0xffffffd0df9839c0)
    at /usr/src/sys/contrib/openzfs/module/zfs/zio.c:4893
#15 0xffffffc002df48ae in __zio_execute (zio=0xffffffd0df9839c0)
    at /usr/src/sys/contrib/openzfs/module/zfs/zio.c:2233
#16 zio_execute (zio=<optimized out>)
    at /usr/src/sys/contrib/openzfs/module/zfs/zio.c:2144
#17 0xffffffc000350bc4 in taskqueue_run_locked (queue=0xffffffd01539c000)
    at /usr/src/sys/kern/subr_taskqueue.c:514
#18 0xffffffc000351a2c in taskqueue_thread_loop (arg=<optimized out>)
    at /usr/src/sys/kern/subr_taskqueue.c:826
#19 0xffffffc0002b8f90 in fork_exit (
    callout=0xffffffc000351954 <taskqueue_thread_loop>, 
    arg=0xffffffd00318df40, frame=0xffffffc1fa399c40)
    at /usr/src/sys/kern/kern_fork.c:1102
#20 0xffffffc0005b069e in fork_trampoline ()
    at /usr/src/sys/riscv/riscv/swtch.S:385
Backtrace stopped: frame did not save the PC

coredump, kernel and debug symbols will appear here as I upload them:

https://v4.nextcloud.towernet.ca/s/NkD6E4Bd7TAQdGH
Comment 1 David Gilbert 2023-05-07 04:52:40 UTC
Created attachment 242031 [details]
info file for coredump
Comment 2 Mark Millard 2023-05-07 14:53:11 UTC
I'll note that FreeBSD 14.0-CURRENT #4 main-n262103-c3c5e6c3e6c4-dirty
reportin the cor core.txt file is from during the "import openzfs update"
disaster, with various fixes/workarounds occuring after that.

It  may be too much of a mess to figure out if this is a new problem
vs. not. Testing versions from after the known problems were adjusted
for may be a requirement.
Comment 3 Mark Millard 2023-05-07 15:12:42 UTC
(In reply to Mark Millard from comment #2)

"reportin the cor core.txt file" should have been:

"reported in the core.txt file"

[I seem to be good at mangling things this morning.]

Also: There might be questions about the status of
the pool for, say, block cloning possibly having
been active for a time before before its software
adjustments were made. This might be true even if
a recent version of main [so: 14] is used with the
pool now.
Comment 4 Graham Perrin freebsd_committer freebsd_triage 2023-05-07 17:13:22 UTC
(In reply to Mark Millard from comment #3)

Does zpool-get(8) not tell whether feature@block_cloning is enabled or active in this case?
Comment 5 David Gilbert 2023-05-07 17:25:39 UTC
feature@block_cloning active.

So I need to update current to now and reboot?
Comment 6 Mark Millard 2023-05-07 18:44:13 UTC
(In reply to Graham Perrin from comment #4)

zpool get should report disabled vs. enabled vs. active.

But I'll note that I referenced it because of it being a
pool level issue, not just zfs, and one that also leads
to incompatibility with older openzfs vintages. Even if
it is just enabled, there is no way back to disabled,
unless one has a zpool checkpoint that predates the
zpool upgrade and can revert to the checkpoint. (snapshots
need not be sufficient for pool feature problems.)

However, this openzfs import also had other corruption
generation problems not involving the block cloning
feature misbehavior. If one has snapshots predating the
corruptions that one can revert to being based on, these
may allow avoiding those specific corruptions. But that
will not deal with block cloning problems if some occured.
Comment 7 Graham Perrin freebsd_committer freebsd_triage 2023-05-07 19:13:37 UTC
(In reply to dgilbert from comment #0)

> … https://v4.nextcloud.towernet.ca/s/NkD6E4Bd7TAQdGH

In one of the files: 

FreeBSD ump.daveg.ca 14.0-CURRENT FreeBSD 14.0-CURRENT #4 main-n262103-c3c5e6c3e6c4-dirty: Thu Apr 13 13:45:26 EDT 2023     root@ump.daveg.ca:/usr/obj/usr/src/riscv.riscv64/sys/GENERIC  riscv

Your c3c5e6c3e6c4 predates <https://github.com/freebsd/freebsd-src/commit/068913e4ba3dd9b3067056e832cefc5ed264b5cc> by a few days. 

c3c5e6c3e6c4 lacks the vfs.zfs.bclone_enabled 0 (zero) by default approach. 

----

Mark, I misread your comment 3 as coming from the opening poster. Sorry.
Comment 8 Mark Millard 2023-05-07 19:16:21 UTC
(In reply to dgilbert from comment #5)

You have other corruption issues not related to block cloning.
Or you may have both types corruptions. Part of the issue of
some corruptions was that they predated the checksumming that
is recorded, so scrubbing and the like does not find or fix
such corruptions.

There is a crash bug that can be temporarily avoided by:

QUOTE
When in single user mode set compression property to "off" on any zfs 
active dataset that has compression other than "off" and the sync 
property to something other than "disabled".
END QUOTE

and then working on that basis until you are using an adjusted
system version.

Do you have a pool checkpoint that predates the zpool upgrade?
If yes, would it be reasonable to revert to that? Would the
result predate the import? If yes, this could allow then
progressing by jumping over the problem period completely
but means having just older data. Similarly for creating a new
pool and restoring from backups.

Definitely get to a system based on outside the time-range
that runs from the bad import until fairly recently. Then deal
with whatever corruptions-mess may be present, if you can.
Comment 9 Mark Millard 2023-05-07 19:17:33 UTC
(In reply to Mark Millard from comment #8)

Missing word "MAY":

You MAY have other corruption issues not related to block cloning.

Sorry about that.
Comment 10 David Gilbert 2023-05-07 19:19:13 UTC
Urm... the disk scrubs fine.  Does this corruption bypass scrub?
Comment 11 Mark Millard 2023-05-07 19:24:59 UTC
(In reply to dgilbert from comment #10)

See the 1st paragraph of Comment 8 (other than the missing "MAY"):
YES.

FYI for what has been commited to openzfs, starting with the bad import:

Commit message (Expand)	Author	Age	Files	Lines
* 	zfs: merge openzfs/zfs@d96e29576	Martin Matuska	4 days	125	-625/+1965
* 	stand: Add isspace to FreeBSD ctypes.h	Warner Losh	6 days	1	-0/+1
* 	stand: back out the most of the horrible aarch64 kludge	Warner Losh	6 days	1	-5/+8
* 	openzfs: re-enable FPU usage on aarch64	Kyle Evans	11 days	1	-6/+1
* 	Fix BLAKE3 aarch64 assembly for FreeBSD and macOS	Tino Reichardt	11 days	2	-4511/+4057
* 	zfs: Fix positive ABD size assertion in abd_verify().	Mateusz Guzik	11 days	1	-1/+2
* 	openzfs: arm64: implement kfpu_begin/kfpu_end	Kyle Evans	11 days	1	-1/+27
* 	zfs: make zfs_vfs_held() definition consistent with declaration	Dimitry Andric	12 days	1	-1/+1
* 	zfs: Revert "Fix data race between zil_commit() and zil_suspend()"	Mateusz Guzik	12 days	2	-28/+0
* 	Add support for zpool user properties	Allan Jude	12 days	12	-48/+627
* 	zfs: fix up bogus checksums with blake3 in face of cpu migration	Mateusz Guzik	12 days	1	-2/+5
* 	zfs/powerpc64: Fix big-endian powerpc64 asm	Justin Hibbits	2023-04-22	5	-1/+62
* 	zfs: fix up EINVAL from getdirentries on .zfs	Mateusz Guzik	2023-04-20	1	-0/+11
* 	zfs: add missing vn state transition for .zfs	Mateusz Guzik	2023-04-20	1	-0/+4
* 	zfs: Add vfs.zfs.bclone_enabled sysctl.	Pawel Jakub Dawidek	2023-04-17	3	-2/+11
* 	zfs: Merge https://github.com/openzfs/zfs/pull/14739	Pawel Jakub Dawidek	2023-04-17	1	-3/+1
* 	zfs: cherry-pick openzfs/zfs@c71fe7164	Pawel Jakub Dawidek	2023-04-17	1	-2/+4
* 	zfs: Revert "ZFS_IOC_COUNT_FILLED does unnecessary txg_wait_synced()"	Mateusz Guzik	2023-04-15	1	-16/+5
* 	zfs: don't use zfs_freebsd_copy_file_range	Mateusz Guzik	2023-04-15	1	-1/+2
* 	zfs: Appease set by unused warnings for spl_fstrans_*mark stubs.	John Baldwin	2023-04-10	1	-1/+1
* 	openzfs: adopt to the new vn_lock_pair() interface	Konstantin Belousov	2023-04-07	1	-1/+2
* 	zfs: disable kernel fpu usage on arm and aarc64	Mateusz Guzik	2023-04-07	2	-2/+2
* 	zfs: try to fallback early if can't do optimized copy	Mateusz Guzik	2023-04-07	1	-0/+8
* 	zfs: fix up EXDEV handling for clone_range	Mateusz Guzik	2023-04-07	1	-29/+27
* 	zfs: add missing vop_fplookup_vexec assignments	Mateusz Guzik	2023-04-06	1	-0/+9
* 	zfs: fix null ap->a_fsizetd NULL pointer derefernce	Martin Matuska	2023-04-05	1	-1/+1
* 	Revert "zfs: fall back if block_cloning feature is disabled"	Martin Matuska	2023-04-04	1	-10/+7
* 	zfs: fall back if block_cloning feature is disabled	Martin Matuska	2023-04-04	1	-7/+10
* 	zfs: merge openzfs/zfs@431083f75	Martin Matuska	2023-04-03	309	-15331/+45241
Comment 12 Mark Millard 2023-05-07 19:38:58 UTC
(In reply to Mark Millard from comment #11)

Hmm. That list has both ends being imports from openzfs.

The time order is most-recent to oldest, so the "starting
with" that I referenced is actually at the bottom of the
list.
Comment 13 David Gilbert 2023-05-07 19:55:42 UTC
Ok.  Previous kernel is from Feb 13th, apparently.  Is it useful to go back, or is there some other way of validating the machine ... or am I reinstalling?
Comment 14 Mark Millard 2023-05-07 21:39:21 UTC
(In reply to dgilbert from comment #13)

Of the following features, which are active?
( These are ones added after what is listed in
/usr/share/zfs/compatibility.d/openzfs-2.1-freebsd )

      edonr
      zilsaxattr
      head_errlog
      blake3
      block_cloning

I do not know just when main started supporting each
of these. (The list has grown over time.) Presuming
one knew for the Feb-13 kernel that you reference . . .
(I expect that, for FreeBSD, blake3 and block_cloning
may be the new ones vs. Feb-13. But I do not know
for sure.)

Any active "not even read-only compatible" feature
is sufficient to block access. Absent such, any active
"just read-only compatible" feature is sufficient to
block write access.

block_cloning  is read-only compatible with an older
zfs that does not support it. The pool will go back
to just enabled status when the last cloned block is
freed. (Not that one can directly cause such freeing,
as far as I know.)

blake3 being active is not even read-only compatible with an older
zfs that does not support it. blake3 status is per dataset/filesystem.
The pool will go back to just enabled status once all filesystems
that have ever had their checksum set to blake3 are destroyed.

head_errlog being active is not even read-only compatible with an
older zfs that does not support it. Once active can not be put
back to enabled for the pool.

zilsxattr is read-only compatible with an older zfs that does
not support it. zilsaxattr status is per dataset/filesystem. The
pool will go back to just enabled status when all datasets that
use the feature have been destroyed.

edonr being active is not even read-only compatible with an older
zfs that does not support it. edonr status is per dataset/filesystem.
The pool will go back to just enabled status once all filesystems
that have ever had their checksum set to edonr are destroyed.


REMINDER (quoting):

active
This feature's on-disk format changes are in effect on the pool. Support for this feature is required to import the pool in read-write mode. If this feature is not read-only compatible, support is also required to import the pool in read-only mode (see Read-only compatibility).

enabled
An administrator has marked this feature as enabled on the pool, but the feature's on-disk format changes have not been made yet. The pool can still be imported by software that does not support this feature, but changes may be made to the on-disk format at any time which will move the feature to the active state. Some features may support returning to the enabled state after becoming active. See feature-specific documentation for details.

disabled
This feature's on-disk format changes have not been made and will not be made unless an administrator moves the feature to the enabled state. Features cannot be disabled once they have been enabled.
Comment 15 David Gilbert 2023-05-08 00:00:43 UTC
edonr "enabled"
zilsaxattr "active"
head_errlog "active"
blake3 "enabled"
block_cloning "active"
Comment 16 David Gilbert 2023-05-08 00:00:58 UTC
[2:26:325]root@ump:/var/crash> zpool get all
NAME  PROPERTY                       VALUE                          SOURCE
zump  size                           1.75T                          -
zump  capacity                       8%                             -
zump  altroot                        -                              default
zump  health                         ONLINE                         -
zump  guid                           10904999209893387658           -
zump  version                        -                              default
zump  bootfs                         zump/ROOT/default              local
zump  delegation                     on                             default
zump  autoreplace                    off                            default
zump  cachefile                      -                              default
zump  failmode                       wait                           default
zump  listsnapshots                  off                            default
zump  autoexpand                     off                            default
zump  dedupratio                     1.00x                          -
zump  free                           1.60T                          -
zump  allocated                      154G                           -
zump  readonly                       off                            -
zump  ashift                         0                              default
zump  comment                        -                              default
zump  expandsize                     -                              -
zump  freeing                        0                              -
zump  fragmentation                  31%                            -
zump  leaked                         0                              -
zump  multihost                      off                            default
zump  checkpoint                     -                              -
zump  load_guid                      9590921689347713566            -
zump  autotrim                       off                            default
zump  compatibility                  off                            default
zump  bcloneused                     828K                           -
zump  bclonesaved                    828K                           -
zump  bcloneratio                    2.00x                          -
zump  feature@async_destroy          enabled                        local
zump  feature@empty_bpobj            active                         local
zump  feature@lz4_compress           active                         local
zump  feature@multi_vdev_crash_dump  enabled                        local
zump  feature@spacemap_histogram     active                         local
zump  feature@enabled_txg            active                         local
zump  feature@hole_birth             active                         local
zump  feature@extensible_dataset     active                         local
zump  feature@embedded_data          active                         local
zump  feature@bookmarks              enabled                        local
zump  feature@filesystem_limits      enabled                        local
zump  feature@large_blocks           enabled                        local
zump  feature@large_dnode            enabled                        local
zump  feature@sha512                 enabled                        local
zump  feature@skein                  enabled                        local
zump  feature@edonr                  enabled                        local
zump  feature@userobj_accounting     active                         local
zump  feature@encryption             enabled                        local
zump  feature@project_quota          active                         local
zump  feature@device_removal         enabled                        local
zump  feature@obsolete_counts        enabled                        local
zump  feature@zpool_checkpoint       enabled                        local
zump  feature@spacemap_v2            active                         local
zump  feature@allocation_classes     enabled                        local
zump  feature@resilver_defer         enabled                        local
zump  feature@bookmark_v2            enabled                        local
zump  feature@redaction_bookmarks    enabled                        local
zump  feature@redacted_datasets      enabled                        local
zump  feature@bookmark_written       enabled                        local
zump  feature@log_spacemap           active                         local
zump  feature@livelist               enabled                        local
zump  feature@device_rebuild         enabled                        local
zump  feature@zstd_compress          active                         local
zump  feature@draid                  enabled                        local
zump  feature@zilsaxattr             active                         local
zump  feature@head_errlog            active                         local
zump  feature@blake3                 enabled                        local
zump  feature@block_cloning          active                         local
Comment 17 Mark Millard 2023-05-08 03:04:33 UTC
(In reply to dgilbert from comment #15)

Looking at the zfs source it appears that
( in openzfs/module/zcommon/zfeature_common.c )
the following means that, sort of special code
changed like the temporary sysctl for block
cloning, FreeBSD gets all features as soon as they
are imported:

static boolean_t
zfs_mod_supported_feature(const char *name,
    const struct zfs_mod_supported_features *sfeatures)
{
        /*
         * The zfs module spa_feature_table[], whether in-kernel or in
         * libzpool, always supports all the features. libzfs needs to
         * query the running module, via sysfs, to determine which
         * features are supported.
         *
         * The equivalent _can_ be done on FreeBSD by way of the sysctl
         * tree, but this has not been done yet.  Therefore, we return
         * that all features are supported.
         */
 
#if defined(_KERNEL) || defined(LIB_ZPOOL_BUILD) || defined(__FreeBSD__)
        (void) name, (void) sfeatures;
        return (B_TRUE);
#else
        return (zfs_mod_supported(ZFS_SYSFS_POOL_FEATURES, name, sfeatures));
#endif
}

This means I'm confused about how/why edonr's status has long
been not listed in the likes of the openzfs-2.*-freebsd files,
but is listed for the openzfs-2.*-linux files:

# diff /usr/share/zfs/compatibility.d/openzfs-2.1-*
1c1
< # Features supported by OpenZFS 2.1 on FreeBSD
---
> # Features supported by OpenZFS 2.1 on Linux
9a10
> edonr
amd64_ZFS amd64  1400088 1400088 # diff /usr/share/zfs/compatibility.d/openzfs-2.0-*
1c1
< # Features supported by OpenZFS 2.0 on FreeBSD
---
> # Features supported by OpenZFS 2.0 on Linux
8a9
> edonr

Despite edonr in those files, it looks to me like:

zilsaxattr "active"
head_errlog "active"

have been handled in main's kernel well before your
Feb-13 time frame. But that still leaves:

block_cloning "active"

So, if I understand right, you would end up with just
Read-Only status for the pool for a kernel from
around Feb-13.

I've not investigated the loader for back then but
I'd expect that using a more modern loader would deal
with any issue there (if it even is a boot pool).
Comment 18 David Gilbert 2023-05-08 03:24:26 UTC
Not really answering the question.  I suppose you're saying that Feb13th won't boot this drive anymore.  Buildworld and buildkernel are still running (takes a day and a bit).  Do I need to reinstall or just get a new kernel installed?
Comment 19 Mark Millard 2023-05-08 03:26:45 UTC
(In reply to Mark Millard from comment #17)

FYI: Looking at the logs listed by:

https://cgit.freebsd.org/src/log/sys/contrib/openzfs/include/zfeature_common.h

is a way to find when features were likely first
imported:

SPA_FEATURE_AVZ_V2        : 2023-May-03
SPA_FEATURE_BLOCK_CLONING : 2023-Apr-03
SPA_FEATURE_BLAKE3        : 2022-Jun-23
SPA_FEATURE_HEAD_ERRLOG   : 2022-May-12
SPA_FEATURE_ZILSAXATTR    : 2022-Mar-08

And on 2021-Feb-18:

-#if !defined(__FreeBSD__)
 	SPA_FEATURE_EDONR,
-#endif

(Past that gets into the SVN time frame.)
Comment 20 Mark Millard 2023-05-08 03:54:25 UTC
(In reply to dgilbert from comment #18)

It has been a research effort to even provide the properties
that I've reported. I'm not likely to be able to give you
simple or complete answers to the complicated context --and
I most definitely do not know your overall context or
constraints.

If FreeBSD can boot read-only media to some degree, which
it can as I understand, the read-only boot pool might fit
in that category for all I know. I've no clue how useful
such would be to you.

If you have no way to identify and fix potential corruptions,
I'd expect you would initialize a new boot-pool, avoiding
sources of potentially corrupt data. But I've no clue if you
have snapshots, checkpoints, or other such for comparisons or
if it would be reasonable for to you to back up in time if
you do have some known-at-the-time good data, even if now old.

Good luck --or at least better luck than landing in the
disasterous-openzfs-import time frame in the first place.