Summary: | ZFS: FreebBSD 13 does not boot after ZIL remove: panic: VERIFY(nvlist_lookup_uint64(configs[i], ZPOOL_CONFIG_POOL_TXG, &txg) | ||||||
---|---|---|---|---|---|---|---|
Product: | Base System | Reporter: | Rumen Palov <rpalov> | ||||
Component: | kern | Assignee: | Ryan Moeller <freqlabs> | ||||
Status: | Closed FIXED | ||||||
Severity: | Affects Many People | CC: | Trond.Endrestol, freqlabs, fs, grahamperrin, ruben, titus | ||||
Priority: | --- | Keywords: | crash, needs-qa | ||||
Version: | 13.0-STABLE | ||||||
Hardware: | Any | ||||||
OS: | Any | ||||||
See Also: |
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=267009 https://github.com/openzfs/zfs/pull/14291 |
||||||
Attachments: |
|
Description
Rumen Palov
2021-06-02 07:09:00 UTC
Created attachment 225899 [details]
Panic after removing ZIL
Observed the same panic while staging a ZIL replacement in VMWare.
After clearing out the log vdev (and other special purpose vdevs like cache and special) and even removing the zpool.cache the panic remains.
zdb -e / zdb -l don't show any trace of the log device
I tried * zeroing the log device * ran bonnie++ -c 2 -n 2048:64:16:512 as a means to make sure any uberblock as seen with zdb -lllu would not refer to any uberblock that might know about the previous state, checked with timestamps seen in the zdb output. * Did this with prior attaching the zil vdev * Did this with after attaching the zil vdev * completely resilvered the mirror the zil was intended for by breaking the mirror one way and re adding a clean device again and after resilvering doing it the other way around. so my assumption is that this is not caused by misinterpreting data in either the zil or somewhere in the uberblocks, but beyond, and does survive a resilver Doesn't seem to be triggered in 12.2p4 Hard removing the SLOG device itself that is used for ZIL, causing the pool to fail, and then use a boot iso/memstick to zpool replace the missing SLOG device is a (not for the faint of heart) work around which doesn't trigger the condition in my VM lab. I would advise that work around only if you have a backup of the pool (e.g. it is a mirror so you can rebuild the mirror if things go wrong anyway). Probably not applicable on single disk/zraid pools. problem described https://forums.freebsd.org/threads/nvme-adapter-with-hp-ex900-ssd-pci-e-nvme.85816/page-2#post-589863 patch --- contrib/openzfs/module/os/freebsd/zfs/spa_os.c 2022-05-17 07:18:53.560252000 +0300 +++ /tmp/spa_os.c 2022-12-02 16:33:04.665494000 +0200 @@ -95,6 +95,7 @@ for (i = 0; i < count; i++) { uint64_t txg; + if(!configs[i]) continue; txg = fnvlist_lookup_uint64(configs[i], ZPOOL_CONFIG_POOL_TXG); if (txg > best_txg) { best_txg = txg; (In reply to titus m from comment #4) for some reason the patch fails to apply to a stock 13.1 try this --- spa_os.c 2022-12-03 09:21:13.458192000 +0200 +++ 1spa_os.c 2022-12-03 09:19:32.962406000 +0200 @@ -94,7 +94,7 @@ best_txg = 0; for (i = 0; i < count; i++) { uint64_t txg; - + if(!configs[i]) continue; txg = fnvlist_lookup_uint64(configs[i], ZPOOL_CONFIG_POOL_TXG); if (txg > best_txg) { best_txg = txg; (In reply to titus m from comment #5) The patch probably fails due to different indentation styles. It's only one line, so you better edit it by hand. Triage: in progress (status) involves assignment to a person. (Thanks) I can confirm that this patch, manually applied to 13.1p5, lets you remove and add SLOG devices again without panics. the diff of zdb -e before and after shows --- zdb-root.txt 2022-12-16 11:59:06.691436000 +0100 +++ zdb-root-slog-readded.txt 2022-12-16 14:44:34.417938000 +0100 @@ -1,6 +1,7 @@ Configuration for import: - vdev_children: 2 + vdev_children: 3 + hole_array[0]: 1 version: 5000 pool_guid: 4167832821587494122 name: 'zroot' @@ -24,6 +25,7 @@ id: 0 guid: 4024934502194417417 whole_disk: 1 + DTL: 97 create_txg: 4 path: '/dev/ada0p3' children[1]: @@ -31,19 +33,24 @@ id: 1 guid: 18122104391858199043 whole_disk: 1 + DTL: 96 create_txg: 4 path: '/dev/ada1p3' children[1]: - type: 'disk' + type: 'hole' id: 1 - guid: 4122499541960188698 + guid: 0 + children[2]: + type: 'disk' + id: 2 + guid: 11454742377729792592 whole_disk: 1 - metaslab_array: 45900 + metaslab_array: 1932 metaslab_shift: 26 ashift: 12 asize: 1068761088 is_log: 1 - create_txg: 46027 + create_txg: 138279 path: '/dev/gpt/log' load-policy: load-request-txg: 18446744073709551615 (replaced) log device is present in configuration % zpool status pool: zroot state: ONLINE scan: scrub repaired 0B in 00:00:20 with 0 errors on Fri Dec 16 13:10:06 2022 config: NAME STATE READ WRITE CKSUM zroot ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 ada0p3 ONLINE 0 0 0 ada1p3 ONLINE 0 0 0 logs gpt/log ONLINE 0 0 0 cache gpt/l2arc ONLINE 0 0 0 errors: No known data errors This was fixed upstream and merged to FreeBSD in https://cgit.freebsd.org/src/commit/?id=15f0b8c309dea1dcb14d3e374686576ff68ac43f |