Bug 240134 - zfs: Kernel panic while importing zpool (blkptr at <addr> has invalid COMPRESS 127)
Summary: zfs: Kernel panic while importing zpool (blkptr at <addr> has invalid COMPRES...
Status: Open
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 12.0-RELEASE
Hardware: Any Any
: --- Affects Only Me
Assignee: freebsd-fs (Nobody)
URL:
Keywords: crash, needs-qa
Depends on:
Blocks:
 
Reported: 2019-08-26 20:29 UTC by Michel Depeige
Modified: 2019-08-27 14:25 UTC (History)
0 users

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Michel Depeige 2019-08-26 20:29:45 UTC
Hello,

One of my systems is stuck in a reboot loop. Kernel Panic every time while importing zpool (root-on-ZFS). Root pool is ZFS mirror (

This happened a few days (hours ?) after upgrading the root pool from FreeBSD 11 to 12. Not sure if its related or not. 

The issue is reproducible on other systems (ZFS mirror). Tried a set of x86_64 and powerpc64 systems: same issue everywhere.

Here is the kernel panic:

ZFS filesystem version: 5
ZFS storage pool version: features support (5000)
Solaris: WARNING: blkptr at 0xfffffe001be3a800 has invalid COMPRESS 127
Solaris: WARNING: blkptr at 0xfffffe001be3a800 has invalid ETYPE 255


Fatal trap 12: page fault while in kernel mode
cpuid = 0; apic id = 00
fault virtual address	= 0x88
fault code		= supervisor read data, page not present
instruction pointer	= 0x20:0xffffffff828f01b5
stack pointer	        = 0x28:0xfffffe00005f5710
frame pointer	        = 0x28:0xfffffe00005f5750
code segment		= base rx0, limit 0xfffff, type 0x1b
			= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags	= interrupt enabled, resume, IOPL = 0
current process		= 828 (zpool)
trap number		= 12
panic: page fault
cpuid = 0
time = 1566854520
KDB: stack backtrace:
#0 0xffffffff80be78d7 at kdb_backtrace+0x67
#1 0xffffffff80b9b4b3 at vpanic+0x1a3
#2 0xffffffff80b9b303 at panic+0x43
#3 0xffffffff81074bff at trap_fatal+0x35f
#4 0xffffffff81074c59 at trap_pfault+0x49
#5 0xffffffff8107427e at trap+0x29e
#6 0xffffffff8104f625 at calltrap+0x8
#7 0xffffffff8290267a at zio_checksum_verify+0x6a
#8 0xffffffff828fe2ec at zio_execute+0xbc
#9 0xffffffff82901d2c at zio_vdev_io_start+0x15c
#10 0xffffffff828fe2ec at zio_execute+0xbc
#11 0xffffffff828fdbfb at zio_nowait+0xcb
#12 0xffffffff82849c89 at arc_read+0x759
#13 0xffffffff8287353d at traverse_prefetch_metadata+0xbd
#14 0xffffffff828729ee at traverse_visitbp+0x3be
#15 0xffffffff82873623 at traverse_dnode+0xd3
#16 0xffffffff82872fa8 at traverse_visitbp+0x978
#17 0xffffffff82872a51 at traverse_visitbp+0x421
Uptime: 2m42s
(da1:umass-sim0:0:0:0): Synchronize cache failed
Dumping 161 out of 2009 MB:..10%..20%..30%..40%..50%..60%..70%..80%..90%..100%
Dump complete

Server was stable before this, did check the following :
- none of the usual zpool rescue import options works (-F, -X, etc…)
- mem testing: no errors
- checked both drives for bad sectors: nothing
- tried importing on ZoL v0.7.12 : PANIC(), the backtrace is somewhat different

After dd'ing a few TBs, The issue is reproduced easily inside a virtual machine. Both drives seems to have the exact same corruption, so that's not a drive issue (different vendors, one entreprise drive)

Looks like we have two issues there:
- The first that caused the corruption. Trying to reproduce this (probably non-ECC memory though)
- The second is KP() while importing the pool (this bug report)

Did more testing using zdb my limited knowledge. Issue is reproductible with zdb:

zdb -AAA -e -ddd zroot/usr/local
Assertion failed: (!BP_IS_EMBEDDED(bp) || BPE_GET_ETYPE(bp) == BP_EMBEDDED_TYPE_DATA), file /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c, line 5724.
Assertion failed: ((hdr)->b_lsize << 9) > 0 (0x0 > 0x0), file /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c, line 3340.
Assertion failed: ((hdr)->b_lsize << 9) != 0 (0x0 != 0x0), file /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c, line 2447.
Assertion failed: (bytes > 0), file /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c, line 5032.
Assertion failed: ((hdr)->b_lsize << 9) != 0 (0x0 != 0x0), file /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c, line 2447.
Assertion failed: ((hdr)->b_lsize << 9) != 0 (0x0 != 0x0), file /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c, line 2447.
WARNING: blkptr at 0x80b124840 has invalid COMPRESS 127
WARNING: blkptr at 0x80b124840 has invalid ETYPE 255
Assertion failed: (!BP_IS_EMBEDDED(bp)), file /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c, line 1321.
Assertion failed: (zio->io_error != 0), file /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_mirror.c, line 660.
Assertion failed: (zio->io_vd != NULL), file /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c, line 3619.

Objects from 51200 to 51231 on dataset zroot/usr/local are crashing zdb. Anything else is fine.

Bonus question: is there a way to nuke this dataset to recover recent files ?

Core dumps available if needed. Willing to test a few patches since I've reproduced this in a lab.

Thanks for your help.