Bug 245649 - [zfs] [panic] panic booting after removing zil
Summary: [zfs] [panic] panic booting after removing zil
Status: New
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 12.1-RELEASE
Hardware: amd64 Any
: --- Affects Only Me
Assignee: freebsd-fs (Nobody)
URL:
Keywords: crash
Depends on:
Blocks:
 
Reported: 2020-04-15 20:11 UTC by will
Modified: 2022-10-12 00:48 UTC (History)
1 user (show)

See Also:


Attachments
snapshot of kernel panic (190.87 KB, image/jpeg)
2020-04-15 20:11 UTC, will
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description will 2020-04-15 20:11:51 UTC
Created attachment 213430 [details]
snapshot of kernel panic

Hey,

I removed a ZIL device from my ZFS pool. I ended up fully removing the device from my server, as well as regenerating /boot/zfs/zpool.cache after removing the device.

On boot, I now get a kernel panic when trying to boot. Unusually, the server *immediately* reboots despite setting kern.panic_reboot_wait_time=1 in loader.conf. I managed to catch a blurry shot of the panic, attached.

To save you from deciphering that, the panic is a failing assert right here: https://svnweb.freebsd.org/base/release/12.1.0/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa.c?revision=354337&view=markup#l5222

I was able to boot into mfsbsd and have been able to import and mount the pool with no problem, and have removed and regenerated zpool.cache a few times. I'm not sure what else to do, and this seems like a bug.

Thanks!
Comment 1 Andriy Gapon freebsd_committer freebsd_triage 2020-04-15 21:25:19 UTC
Has the disk been physically removed?
Can you show zpool status -v and gpart show output after importing the pool into mfsbsd?
Comment 2 will 2020-04-16 08:08:55 UTC
At first, I tried booting with the disk still attached, since I was going to repurpose it as a cache device. However, now the disk is entirely removed.

Note that I also have 2 USB disks (mfsbsd itself, and an 18TB external drive) attached. They're at the bottom of the gpart output.

gpart show:
=>        34  7814037101  diskid/DISK-WD-WCC4EKKD0A8P  GPT  (3.6T)
          34           6                               - free -  (3.0K)
          40        1024                            1  freebsd-boot  (512K)
        1064     4194304                            2  freebsd-swap  (2.0G)
     4195368  7809841760                            3  freebsd-zfs  (3.6T)
  7814037128           7                               - free -  (3.5K)

=>         40  15628053088  diskid/DISK-VAHDA7WL  GPT  (7.3T)
           40         1024                     1  freebsd-boot  (512K)
         1064      4194304                     2  freebsd-swap  (2.0G)
      4195368  15623857760                     3  freebsd-zfs  (7.3T)

=>        34  7814037101  diskid/DISK-WD-WCC4E0478835  GPT  (3.6T)
          34           6                               - free -  (3.0K)
          40        1024                            1  freebsd-boot  (512K)
        1064     4194304                            2  freebsd-swap  (2.0G)
     4195368  7809841760                            3  freebsd-zfs  (3.6T)
  7814037128           7                               - free -  (3.5K)

=>        34  7814037101  diskid/DISK-WD-WCC4E1262418  GPT  (3.6T)
          34           6                               - free -  (3.0K)
          40        1024                            1  freebsd-boot  (512K)
        1064     4194304                            2  freebsd-swap  (2.0G)
     4195368  7809841760                            3  freebsd-zfs  (3.6T)
  7814037128           7                               - free -  (3.5K)

=>        34  7814037101  diskid/DISK-WD-WCC4E2VZV3E1  GPT  (3.6T)
          34           6                               - free -  (3.0K)
          40        1024                            1  freebsd-boot  (512K)
        1064     4194304                            2  freebsd-swap  (2.0G)
     4195368  7809841760                            3  freebsd-zfs  (3.6T)
  7814037128           7                               - free -  (3.5K)

=>        34  7814037101  ada5  GPT  (3.6T)
          34           6        - free -  (3.0K)
          40        1024     1  freebsd-boot  (512K)
        1064     4194304     2  freebsd-swap  (2.0G)
     4195368  7809841760     3  freebsd-zfs  (3.6T)
  7814037128           7        - free -  (3.5K)

=>        34  7814037101  diskid/DISK-WD-WCC4E1965981  GPT  (3.6T)
          34           6                               - free -  (3.0K)
          40        1024                            1  freebsd-boot  (512K)
        1064     4194304                            2  freebsd-swap  (2.0G)
     4195368  7809841760                            3  freebsd-zfs  (3.6T)
  7814037128           7                               - free -  (3.5K)

=>        34  7814037101  diskid/DISK-WD-WCC4E2050088  GPT  (3.6T)
          34           6                               - free -  (3.0K)
          40        1024                            1  freebsd-boot  (512K)
        1064     4194304                            2  freebsd-swap  (2.0G)
     4195368  7809841760                            3  freebsd-zfs  (3.6T)
  7814037128           7                               - free -  (3.5K)

=>    40  655344  da0  GPT  (7.5G) [CORRUPT]
      40     472    1  freebsd-boot  (236K)
     512  654872    2  freebsd-ufs  (320M)

=>    40  655344  diskid/DISK-07AA16081C285D19  GPT  (7.5G) [CORRUPT]
      40     472                             1  freebsd-boot  (236K)
     512  654872                             2  freebsd-ufs  (320M)

=>         40  39065624496  da1  GPT  (18T)
           40  39065624496    1  freebsd-zfs  (18T)

=>         40  39065624496  diskid/DISK-575542533239343130393639  GPT  (18T)
           40  39065624496                                     1  freebsd-zfs  (18T)



zpool status -v:
  pool: tank
 state: ONLINE
status: One or more devices are configured to use a non-native block size.
	Expect reduced performance.
action: Replace affected devices with devices that support the
	configured block size, or migrate data to a properly configured
	pool.
  scan: scrub repaired 0 in 0 days 06:42:00 with 0 errors on Sat Apr 11 08:36:28 2020
config:

	NAME                                            STATE     READ WRITE CKSUM
	tank                                            ONLINE       0     0     0
	  mirror-0                                      ONLINE       0     0     0
	    diskid/DISK-WD-WCC4E2VZV3E1p3               ONLINE       0     0     0
	    gptid/c74650ad-c61c-11e3-8b42-d0509909d8a6  ONLINE       0     0     0
	  mirror-1                                      ONLINE       0     0     0
	    diskid/DISK-WD-WCC4E0478835p3               ONLINE       0     0     0
	    diskid/DISK-WD-WCC4E1262418p3               ONLINE       0     0     0
	  mirror-2                                      ONLINE       0     0     0
	    diskid/DISK-WD-WCC4E1965981p3               ONLINE       0     0     0  block size: 512B configured, 4096B native
	    diskid/DISK-VAHDA7WLp3                      ONLINE       0     0     0  block size: 512B configured, 4096B native
	  mirror-4                                      ONLINE       0     0     0
	    diskid/DISK-WD-WCC4E2050088p3               ONLINE       0     0     0
	    diskid/DISK-WD-WCC4EKKD0A8Pp3               ONLINE       0     0     0

errors: No known data errors
Comment 3 Andriy Gapon freebsd_committer freebsd_triage 2020-04-16 13:43:23 UTC
I am wondering if DISK-VAHDA7WL could be a problem.
It has a 7+ TB partition mirrored with a 3+ TB partition in the pool.
If there's any garbage that looks like a valid ZFS label in the unused portion of the larger partition that that might confuse ZFS.
Comment 4 will 2020-04-16 15:34:10 UTC
Is there a way that I can verify that? While that's one of the newer disks, I have rebooted with that disk installed previously. I can also try breaking the mirror and rebooting that, if and only if that's the sole way to verify.
Comment 5 will 2020-04-20 21:30:35 UTC
I have tried now removing the larger hard disk and rebooting, and I still get the same panic.
Comment 6 Jason Unovitch freebsd_committer freebsd_triage 2021-01-04 00:48:30 UTC
I came across a similar panic: VERIFY(nvlist_lookup_uint64(configs[i], ZPOOL_CONFIG_POOL_TXG, &txg) == 0) failed on a newer OpenZFS system with 13.0-CURRENT from 31 Dec.

In this case, the panic was not after removing the ZIL device from the pool. There was only a panic on executing bectl list after removing the ZIL. However if I tried to add the ZIL back into the pool I see the panic on that statement.  Should this be related, the test case in https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=252396 will cause the similar fault after re-adding the ZIL to the pool.