Bug 219411

Summary: [zfs] Stale zpool.cache corrupts renamed (but not exported) root zpool on boot
Product: Base System Reporter: eborisch+FreeBSD
Component: kernAssignee: freebsd-fs (Nobody) <fs>
Status: New ---    
Severity: Affects Some People CC: pi
Priority: ---    
Version: 11.0-STABLE   
Hardware: Any   
OS: Any   

Description eborisch+FreeBSD 2017-05-20 03:55:11 UTC
If the root zpool is -- via LiveCD -- renamed, but not exported, and the zpool.cache file is not removed, two versions of the pool (old and new name) are imported, and corruption soon follows.

Discussed on mailing list: https://lists.freebsd.org/pipermail/freebsd-fs/2017-May/024717.html

Steps to reproduce (don't do it to your live system!):

Install FreeBSD root-on-zfs
Reboot into LiveCD 
Rename the installed root pool [1] [2] [3]
Reboot into installed OS.
Two pools will show under zpool list/status: the original name and the new name. Errors are quickly reported, and the pool is corrupted (will no longer import) very quickly. Kick off a scrub on either pool to force the issue for testing.

Notes:
[1] To do this, a 'zfs import -f ...' is required, as you can't export your root pool while it is, well, your root pool. There was some discussion on the mailing list if this is sufficient to just say "well, you've shot yourself in the foot, congrats, not a bug." I contend it is not, as it is the existence of the stale cache file that seems to cause the problem; the zpool itself is perfectly happy after being imported in this way. A bad config (cache) file shouldn't corrupt a zpool on boot. Halt the boot, by all means, if necessary, but don't trash the pool.

[2] If you mount the pool at this point and remove the old zpool.cache file, the problem does not appear on reboot and everything appears to be OK.

[3] If you export the pool before rebooting at this point, with our without removing the old zpool.cache file, the problem also doesn't occur.

It seems that some combination of the bootloader's initial opening of a "currently imported" pool and the handoff to the kernel (and later parsing of the zpool.cache file) mishandles the case where a pool described in the cache file with a different name but the same GUID is improperly imported again.