Created attachment 218405 [details] dump info file Hi, On my laptop I'm experiencing some kernel panics, mainly at startup or shutdown. I've got a good dump from a panic happening in shutdown, I'm attaching the core.txt and info files. The panic happens every 3-4 shutdowns, and sometimes at startup, I'll also add a dump from the startup crash as soon as I can get one. The machine is running r366077, with a GENERIC-NODEBUG kernel. I did test the ram and it looks fine. Also this machine already had a zfs elated issue due to a shutdown panic some time ago, it was refusing to boot with "zfs: allocating allocated segment". I reinstalled the OS from scratch at the time. I am unable to tell much more, apart from the clear connection to ZFS. If any further information is needed please ask. I do have the core file, so if some further investigation is needed I can try that, if told what to look at. Thanks in advance.
Created attachment 218406 [details] core dump details
I wanted to followup in this. While it's not definitive I discovered, when updating the machine from old in kernel ZFS to in kernel OpenZFS I actually did not perform the "zpool upgrade". I did that at a later time (a few days ago). And I have not seen a crash since. This is not definite because too few days have passed and I can't still rule out a crash in the next few days, but a pattern is showing. The system was crashing also with the old in kernel ZFS code. Maybe the same problem was present there and is still lurking in OpenZFS, but is mitigated by the presence of a new feature flag?
Created attachment 219200 [details] Crash info for crash duting pkg upgrade I spoke too early. Just after sending the previous comment the machine crashed during "pkg upgrade". It was extracting content of a package. I'm attaching info about this last crash. I update to new head since, it was r366077 as in the previous dump.
I have updated my laptop to main-n245104-dfff1de729b and have not seen these crashes for some time. I'll keep an eye on this but it is possible the recent OpenZFS imports have fixed this issue. I'm leaving this one open for now, but will close if the issue does not show anymore for a while.
While less frequent I am still seeing these panics sporadically, so I'm leaving this one open.
By chance I discovered something interesting. Now the machine is regularly down this: # zfs list -t snapshot internal error: cannot iterate filesystems: Invalid argument Abort (core dumped) (backtrace at the end of this comment, but I don't think this one is interesting) I tracked this down to a single snapshot that looks corrupted, if I try to analyze it with zfs zfs crashes, If I try to destroy that snashot with: zfs destroy zroot/var/mail@2021-03-14_18.00.00--1w I cause a kernel panic, backtrace also at end of message. What I gather from this panic is that the openzfs code is returning EINVAL at I cause a kernel panic, backtrace also at end of message. I don't know enough about ZFS to understand more than this, unluckily. Some more information: > uname -a FreeBSD ubik.madpilot.net 14.0-CURRENT FreeBSD 14.0-CURRENT main-n246069-112f007e128 MPNET amd64 The machine is an acer laptop, the disk is an nvd(4) device, and I'm running it eli encrypted, the disk layout was created by the installer when 13 was still current. I'm actually curious if there is a way to recover from this condition. I'll try experimenting with zdb to see if I can gather some details about why this snapshot causes a crash. ----- zfs.core backtrace: #0 0x00000008015dd4ba in thr_kill () from /lib/libc.so.7 #1 0x0000000801552de4 in raise () from /lib/libc.so.7 #2 0x0000000801606dc9 in abort () from /lib/libc.so.7 #3 0x000000080112e75e in zfs_standard_error_fmt () from /lib/libzfs.so.4 #4 0x000000080112e2b5 in zfs_standard_error () from /lib/libzfs.so.4 #5 0x00000008011175a3 in zfs_iter_snapshots () from /lib/libzfs.so.4 #6 0x0000000001031182 in ?? () #7 0x00000008011172c2 in zfs_iter_filesystems () from /lib/libzfs.so.4 #8 0x000000000103114d in ?? () #9 0x00000008011172c2 in zfs_iter_filesystems () from /lib/libzfs.so.4 #10 0x000000000103114d in ?? () #11 0x00000008011092f9 in zfs_iter_root () from /lib/libzfs.so.4 #12 0x0000000001030968 in ?? () #13 0x000000000103454c in ?? () #14 0x000000000103145e in ?? () #15 0x00000000010303df in ?? () #16 0x0000000001030300 in ?? () #17 0x0000000000000000 in ?? () ----- kernel panic backtrace panic: VERIFY3(0 == dsl_dataset_hold_obj(dp, dsl_dataset_phys(ds_next)->ds_next_snap_obj, FTAG, &ds_nextnext)) failed (0 == 22) cpuid = 7 time = 1618658184 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe00b55e22b0 vpanic() at vpanic+0x181/frame 0xfffffe00b55e2300 spl_panic() at spl_panic+0x3a/frame 0xfffffe00b55e2360 dsl_destroy_snapshot_sync_impl() at dsl_destroy_snapshot_sync_impl+0xbf6/frame 0xfffffe00b55e2440 dsl_destroy_snapshot_sync() at dsl_destroy_snapshot_sync+0x4e/frame 0xfffffe00b55e2480 zcp_synctask_destroy() at zcp_synctask_destroy+0xb0/frame 0xfffffe00b55e24c0 zcp_synctask_wrapper() at zcp_synctask_wrapper+0xee/frame 0xfffffe00b55e2510 luaD_precall() at luaD_precall+0x25f/frame 0xfffffe00b55e25e0 luaV_execute() at luaV_execute+0xf88/frame 0xfffffe00b55e2660 luaD_call() at luaD_call+0x1b3/frame 0xfffffe00b55e26a0 luaD_rawrunprotected() at luaD_rawrunprotected+0x53/frame 0xfffffe00b55e2740 luaD_pcall() at luaD_pcall+0x37/frame 0xfffffe00b55e2790 lua_pcallk() at lua_pcallk+0xa6/frame 0xfffffe00b55e27d0 zcp_eval_impl() at zcp_eval_impl+0xbc/frame 0xfffffe00b55e2800 dsl_sync_task_sync() at dsl_sync_task_sync+0xb4/frame 0xfffffe00b55e2830 dsl_pool_sync() at dsl_pool_sync+0x43b/frame 0xfffffe00b55e28b0 spa_sync() at spa_sync+0xafe/frame 0xfffffe00b55e2ae0 txg_sync_thread() at txg_sync_thread+0x3b3/frame 0xfffffe00b55e2bb0 fork_exit() at fork_exit+0x7d/frame 0xfffffe00b55e2bf0 fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe00b55e2bf0 --- trap 0, rip = 0, rsp = 0, rbp = 0 --- KDB: enter: panic
I've since reinstalled the machine from scratch and has not seen this bug anymore. Due to this I'm closing this bug report, since I'm unable to reproduce it. Maybe it was really caused by some data corruption, maybe caused by a then existing bug or unlucky circumstances.