Bug 245649

Summary:

[zfs] [panic] panic booting after removing zil

Product:

Base System

Reporter:

will

Component:

kern

Assignee:

freebsd-fs (Nobody) <fs>

Status:

New ---

Severity:

Affects Only Me

CC:

junovitch

Priority:

---

Keywords:

crash

Version:

12.1-RELEASE

Hardware:

amd64

OS:

Any

Attachments:

Description	Flags
snapshot of kernel panic	none

Description will 2020-04-15 20:11:51 UTC

Created attachment 213430 [details]
snapshot of kernel panic

Hey,

I removed a ZIL device from my ZFS pool. I ended up fully removing the device from my server, as well as regenerating /boot/zfs/zpool.cache after removing the device.

On boot, I now get a kernel panic when trying to boot. Unusually, the server *immediately* reboots despite setting kern.panic_reboot_wait_time=1 in loader.conf. I managed to catch a blurry shot of the panic, attached.

To save you from deciphering that, the panic is a failing assert right here: https://svnweb.freebsd.org/base/release/12.1.0/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa.c?revision=354337&view=markup#l5222

I was able to boot into mfsbsd and have been able to import and mount the pool with no problem, and have removed and regenerated zpool.cache a few times. I'm not sure what else to do, and this seems like a bug.

Thanks!

Comment 1 Andriy Gapon freebsd_committer

2020-04-15 21:25:19 UTC

Has the disk been physically removed?
Can you show zpool status -v and gpart show output after importing the pool into mfsbsd?

Comment 2 will 2020-04-16 08:08:55 UTC

At first, I tried booting with the disk still attached, since I was going to repurpose it as a cache device. However, now the disk is entirely removed.

Note that I also have 2 USB disks (mfsbsd itself, and an 18TB external drive) attached. They're at the bottom of the gpart output.

gpart show:
=>        34  7814037101  diskid/DISK-WD-WCC4EKKD0A8P  GPT  (3.6T)
          34           6                               - free -  (3.0K)
          40        1024                            1  freebsd-boot  (512K)
        1064     4194304                            2  freebsd-swap  (2.0G)
     4195368  7809841760                            3  freebsd-zfs  (3.6T)
  7814037128           7                               - free -  (3.5K)

=>         40  15628053088  diskid/DISK-VAHDA7WL  GPT  (7.3T)
           40         1024                     1  freebsd-boot  (512K)
         1064      4194304                     2  freebsd-swap  (2.0G)
      4195368  15623857760                     3  freebsd-zfs  (7.3T)

=>        34  7814037101  diskid/DISK-WD-WCC4E0478835  GPT  (3.6T)
          34           6                               - free -  (3.0K)
          40        1024                            1  freebsd-boot  (512K)
        1064     4194304                            2  freebsd-swap  (2.0G)
     4195368  7809841760                            3  freebsd-zfs  (3.6T)
  7814037128           7                               - free -  (3.5K)

=>        34  7814037101  diskid/DISK-WD-WCC4E1262418  GPT  (3.6T)
          34           6                               - free -  (3.0K)
          40        1024                            1  freebsd-boot  (512K)
        1064     4194304                            2  freebsd-swap  (2.0G)
     4195368  7809841760                            3  freebsd-zfs  (3.6T)
  7814037128           7                               - free -  (3.5K)

=>        34  7814037101  diskid/DISK-WD-WCC4E2VZV3E1  GPT  (3.6T)
          34           6                               - free -  (3.0K)
          40        1024                            1  freebsd-boot  (512K)
        1064     4194304                            2  freebsd-swap  (2.0G)
     4195368  7809841760                            3  freebsd-zfs  (3.6T)
  7814037128           7                               - free -  (3.5K)

=>        34  7814037101  ada5  GPT  (3.6T)
          34           6        - free -  (3.0K)
          40        1024     1  freebsd-boot  (512K)
        1064     4194304     2  freebsd-swap  (2.0G)
     4195368  7809841760     3  freebsd-zfs  (3.6T)
  7814037128           7        - free -  (3.5K)

=>        34  7814037101  diskid/DISK-WD-WCC4E1965981  GPT  (3.6T)
          34           6                               - free -  (3.0K)
          40        1024                            1  freebsd-boot  (512K)
        1064     4194304                            2  freebsd-swap  (2.0G)
     4195368  7809841760                            3  freebsd-zfs  (3.6T)
  7814037128           7                               - free -  (3.5K)

=>        34  7814037101  diskid/DISK-WD-WCC4E2050088  GPT  (3.6T)
          34           6                               - free -  (3.0K)
          40        1024                            1  freebsd-boot  (512K)
        1064     4194304                            2  freebsd-swap  (2.0G)
     4195368  7809841760                            3  freebsd-zfs  (3.6T)
  7814037128           7                               - free -  (3.5K)

=>    40  655344  da0  GPT  (7.5G) [CORRUPT]
      40     472    1  freebsd-boot  (236K)
     512  654872    2  freebsd-ufs  (320M)

=>    40  655344  diskid/DISK-07AA16081C285D19  GPT  (7.5G) [CORRUPT]
      40     472                             1  freebsd-boot  (236K)
     512  654872                             2  freebsd-ufs  (320M)

=>         40  39065624496  da1  GPT  (18T)
           40  39065624496    1  freebsd-zfs  (18T)

=>         40  39065624496  diskid/DISK-575542533239343130393639  GPT  (18T)
           40  39065624496                                     1  freebsd-zfs  (18T)



zpool status -v:
  pool: tank
 state: ONLINE
status: One or more devices are configured to use a non-native block size.
	Expect reduced performance.
action: Replace affected devices with devices that support the
	configured block size, or migrate data to a properly configured
	pool.
  scan: scrub repaired 0 in 0 days 06:42:00 with 0 errors on Sat Apr 11 08:36:28 2020
config:

	NAME                                            STATE     READ WRITE CKSUM
	tank                                            ONLINE       0     0     0
	  mirror-0                                      ONLINE       0     0     0
	    diskid/DISK-WD-WCC4E2VZV3E1p3               ONLINE       0     0     0
	    gptid/c74650ad-c61c-11e3-8b42-d0509909d8a6  ONLINE       0     0     0
	  mirror-1                                      ONLINE       0     0     0
	    diskid/DISK-WD-WCC4E0478835p3               ONLINE       0     0     0
	    diskid/DISK-WD-WCC4E1262418p3               ONLINE       0     0     0
	  mirror-2                                      ONLINE       0     0     0
	    diskid/DISK-WD-WCC4E1965981p3               ONLINE       0     0     0  block size: 512B configured, 4096B native
	    diskid/DISK-VAHDA7WLp3                      ONLINE       0     0     0  block size: 512B configured, 4096B native
	  mirror-4                                      ONLINE       0     0     0
	    diskid/DISK-WD-WCC4E2050088p3               ONLINE       0     0     0
	    diskid/DISK-WD-WCC4EKKD0A8Pp3               ONLINE       0     0     0

errors: No known data errors

Comment 3 Andriy Gapon freebsd_committer

2020-04-16 13:43:23 UTC

I am wondering if DISK-VAHDA7WL could be a problem.
It has a 7+ TB partition mirrored with a 3+ TB partition in the pool.
If there's any garbage that looks like a valid ZFS label in the unused portion of the larger partition that that might confuse ZFS.

Comment 4 will 2020-04-16 15:34:10 UTC

Is there a way that I can verify that? While that's one of the newer disks, I have rebooted with that disk installed previously. I can also try breaking the mirror and rebooting that, if and only if that's the sole way to verify.

Comment 5 will 2020-04-20 21:30:35 UTC

I have tried now removing the larger hard disk and rebooting, and I still get the same panic.

Comment 6 Jason Unovitch freebsd_committer

2021-01-04 00:48:30 UTC

I came across a similar panic: VERIFY(nvlist_lookup_uint64(configs[i], ZPOOL_CONFIG_POOL_TXG, &txg) == 0) failed on a newer OpenZFS system with 13.0-CURRENT from 31 Dec.

In this case, the panic was not after removing the ZIL device from the pool. There was only a panic on executing bectl list after removing the ZIL. However if I tried to add the ZIL back into the pool I see the panic on that statement.  Should this be related, the test case in https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=252396 will cause the similar fault after re-adding the ZIL to the pool.