Bug 206109

Summary: zpool import of corrupt pool causes system to reboot
Product: Base System Reporter: emilec
Component: kernAssignee: freebsd-fs (Nobody) <fs>
Status: New ---    
Severity: Affects Only Me CC: emilec, jha3
Priority: ---    
Version: 10.2-RELEASE   
Hardware: Any   
OS: Any   

Description emilec 2016-01-10 18:38:10 UTC
I recently setup a new RAIDZ2 pool with 5 x 4TB Seagate NAS drives using NAS4Free 10.2.0.2 (revision 2235). I discovered after copying data from an existing NAS to my new pool that there was some corruption detected. I attempted to run a scrub, but partway through the system crashed and went into a boot loop. 

I reloaded NAS4Free and tried to import the pool, but each time it would reboot the system. I then tried FreeBSD-10.2-RELEASE-amd64-mini-memstick and an import of the pool would also cause the system to reboot. I could however mount the pool read-only and access data.

From the NAS4Free logs I was able to obtain the following when the system crashed after attempting an import:
Jan  1 16:21:28 nas4free syslogd: kernel boot file is /boot/kernel/kernel
Jan  1 16:21:28 nas4free kernel: Solaris: WARNING: blkptr at 0xfffffe0003a5fa40 DVA 1 has invalid VDEV 16384
Jan  1 16:21:28 nas4free kernel:
Jan  1 16:21:28 nas4free kernel:
Jan  1 16:21:28 nas4free kernel: Fatal trap 12: page fault while in kernel mode
Jan  1 16:21:28 nas4free kernel: cpuid = 1; apic id = 01
Jan  1 16:21:28 nas4free kernel: fault virtual address  = 0x50
Jan  1 16:21:28 nas4free kernel: fault code             = supervisor read data, page not present
Jan  1 16:21:28 nas4free kernel: instruction pointer    = 0x20:0xffffffff81e79f94
Jan  1 16:21:28 nas4free kernel: stack pointer          = 0x28:0xfffffe0169ef5740
Jan  1 16:21:28 nas4free kernel: frame pointer          = 0x28:0xfffffe0169ef5750
Jan  1 16:21:28 nas4free kernel: code segment           = base rx0, limit 0xfffff, type 0x1b
Jan  1 16:21:28 nas4free kernel: = DPL 0, pres 1, long 1, def32 0, gran 1
Jan  1 16:21:28 nas4free kernel: processor eflags       = interrupt enabled, resume, IOPL = 0
Jan  1 16:21:28 nas4free kernel: current process                = 6 (txg_thread_enter)
Jan  1 16:21:28 nas4free kernel: trap number            = 12
Jan  1 16:21:28 nas4free kernel: panic: page fault
Jan  1 16:21:28 nas4free kernel: cpuid = 1
Jan  1 16:21:28 nas4free kernel: KDB: stack backtrace:
Jan  1 16:21:28 nas4free kernel: #0 0xffffffff80a86a70 at kdb_backtrace+0x60
Jan  1 16:21:28 nas4free kernel: #1 0xffffffff80a4a1d6 at vpanic+0x126
Jan  1 16:21:28 nas4free kernel: #2 0xffffffff80a4a0a3 at panic+0x43
Jan  1 16:21:28 nas4free kernel: #3 0xffffffff80ecaedb at trap_fatal+0x36b
Jan  1 16:21:28 nas4free kernel: #4 0xffffffff80ecb1dd at trap_pfault+0x2ed
Jan  1 16:21:28 nas4free kernel: #5 0xffffffff80eca87a at trap+0x47a
Jan  1 16:21:28 nas4free kernel: #6 0xffffffff80eb0c72 at calltrap+0x8
Jan  1 16:21:28 nas4free kernel: #7 0xffffffff81e8071f at vdev_mirror_child_select+0x6f
Jan  1 16:21:28 nas4free kernel: #8 0xffffffff81e802d0 at vdev_mirror_io_start+0x270
Jan  1 16:21:28 nas4free kernel: #9 0xffffffff81e9cd86 at zio_vdev_io_start+0x1d6
Jan  1 16:21:28 nas4free kernel: #10 0xffffffff81e998b2 at zio_execute+0x162
Jan  1 16:21:28 nas4free kernel: #11 0xffffffff81e991b9 at zio_nowait+0x49
Jan  1 16:21:28 nas4free kernel: #12 0xffffffff81e1c91e at arc_read+0x8fe
Jan  1 16:21:28 nas4free kernel: #13 0xffffffff81e577b2 at dsl_scan_prefetch+0xc2
Jan  1 16:21:28 nas4free kernel: #14 0xffffffff81e574a3 at dsl_scan_visitbp+0x583
Jan  1 16:21:28 nas4free kernel: #15 0xffffffff81e5722f at dsl_scan_visitbp+0x30f
Jan  1 16:21:28 nas4free kernel: #16 0xffffffff81e5722f at dsl_scan_visitbp+0x30f
Jan  1 16:21:28 nas4free kernel: Copyright (c) 1992-2015 The FreeBSD Project.

status of pool after read-only import:
zpool import -F -f -o readonly=on -R /pool0 pool0
zpool status
  pool: pool0
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://illumos.org/msg/ZFS-8000-8A
  scan: scrub in progress since Wed Dec 30 13:34:03 2015
        1.06T scanned out of 8.53T at 1/s, (scan is slow, no estimated time)
        0 repaired, 12.45% done
config:

        NAME        STATE     READ WRITE CKSUM
        pool0       ONLINE       0     0     0
          raidz2-0  ONLINE       0     0     0
            ada0    ONLINE       0     0     0
            ada1    ONLINE       0     0     0
            ada2    ONLINE       0     0     0
            ada3    ONLINE       0     0     0
            ada4    ONLINE       0     0     0

errors: 1 data errors, use '-v' for a list

I eventually discovered that the corruption was caused by faulty RAM (fails memtest). So I accept the pool is corrupt.

Seeing as NAS4Free relies on FreeBSD and the behaviour is the same I thought this would be the best place to log a bug, but feel free to point me back to NAS4Free. Their forums however suggested that ZFS is enterprise and enterprise would simply restore from backup. I believe it would be nice to rather catch the exception and print an error rather than reboot the system.
Comment 1 jha3 2016-02-04 00:39:19 UTC
I have experienced the same issue here (except for the faulty RAM part). I have a 6 x 3 TB WD zpool, running on an M1015 SAS raid card, with 16 GB ECC RAM. While watching a movie on the array late last week, the system crashed, entering a reboot loop.

Note, the OS is installed on separate flash drive. fsck indicated no errors on the OS drive. Note also that I have a second zpool of 6 drives which can still be mounted/ imported without issue.

I upgraded from freeBSD 10.0 to 10.2 at the end of Dec 2015.

As in the OP's case, the array can still be mounted read-only.

Memtest was run for 48 hours and indicated no RAM errors.
SMART reports indicate no evidence of drive failure.

I am surprised and dismayed that an error in a zpool array can cause this type of behavior in the system. I would have expected an umount or some similar indication; not a system crash or reboot loop.