231516 – [ZFS] Crash upon dectection of failed drive.

Bug 231516 - [ZFS] Crash upon dectection of failed drive.

Summary: [ZFS] Crash upon dectection of failed drive.

Status:	Closed FIXED

Alias:	None

Product:	Base System
Classification:	Unclassified
Component:	bin (show other bugs)
Version:	CURRENT
Hardware:	amd64 Any

Importance:	--- Affects Some People
Assignee:	freebsd-fs (Nobody)

URL:
Keywords:

Depends on:
Blocks:	228911
	Show dependency tree / graph

Reported:	2018-09-20 14:24 UTC by Sean Bruno
Modified:	2019-04-25 15:52 UTC (History)
CC List:	3 users (show)

See Also:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Sean Bruno freebsd_committer

2018-09-20 14:24:55 UTC

It looks like the system here is crashing on startup trying to do something with a vdev that has a failed member on host startup.

(da3:mpt0:0:3:0): READ(10). CDB: 28 00 15 37 a9 29 00 00 24 00 
(da3:mpt0:0:3:0): CAM status: SCSI Status Error
(da3:mpt0:0:3:0): SCSI status: Check Condition
(da3:mpt0:0:3:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error)
(da3:mpt0:0:3:0): Info: 0x1537a933
(da3:mpt0:0:3:0): Field Replaceable Unit: 129
(da3:mpt0:0:3:0): Actual Retry Count: 151
(da3:mpt0:0:3:0): Error 5, Unretryable error


Fatal trap 12: page fault while in kernel mode
cpuid = 6; apic id = 06
fault virtual address   = 0x11
fault code              = supervisor read instruction, page not present
instruction pointer     = 0x20:0x11
stack pointer           = 0x28:0xfffffe009d5e9118
frame pointer           = 0x28:0xfffffe009d5e9150
code segment            = base rx0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 2464 (syncthing)
[ thread pid 2464 tid 100994 ]
Stopped at      0x11
db> bt
Tracing pid 2464 tid 100994 td 0xfffff800cf1d8000
??() at 0x11
avl_find() at avl_find+0x44/frame 0xfffffe009d5e9150
vdev_queue_change_io_priority() at vdev_queue_change_io_priority+0xa1/frame 0xfffffe009d5e9190
zio_change_priority() at zio_change_priority+0x38/frame 0xfffffe009d5e91e0
zio_change_priority() at zio_change_priority+0xe7/frame 0xfffffe009d5e9230
zio_change_priority() at zio_change_priority+0xe7/frame 0xfffffe009d5e9280
zio_change_priority() at zio_change_priority+0xe7/frame 0xfffffe009d5e92d0
arc_read() at arc_read+0xf3/frame 0xfffffe009d5e9360
dbuf_read() at dbuf_read+0x714/frame 0xfffffe009d5e9420
dmu_buf_hold_array_by_dnode() at dmu_buf_hold_array_by_dnode+0x163/frame 0xfffffe009d5e9490
dmu_read_uio_dnode() at dmu_read_uio_dnode+0x37/frame 0xfffffe009d5e9500
dmu_read_uio_dbuf() at dmu_read_uio_dbuf+0x3b/frame 0xfffffe009d5e9530
zfs_freebsd_read() at zfs_freebsd_read+0x2d3/frame 0xfffffe009d5e95e0
VOP_READ_APV() at VOP_READ_APV+0x7c/frame 0xfffffe009d5e9610
null_bypass() at null_bypass+0xd8/frame 0xfffffe009d5e9750
VOP_READ_APV() at VOP_READ_APV+0x81/frame 0xfffffe009d5e9780
vn_read() at vn_read+0x195/frame 0xfffffe009d5e9800
vn_io_fault_doio() at vn_io_fault_doio+0x43/frame 0xfffffe009d5e9860
vn_io_fault1() at vn_io_fault1+0x161/frame 0xfffffe009d5e99a0
vn_io_fault() at vn_io_fault+0x195/frame 0xfffffe009d5e9a10
dofileread() at dofileread+0x95/frame 0xfffffe009d5e9a50
sys_read() at sys_read+0xc3/frame 0xfffffe009d5e9ac0
amd64_syscall() at amd64_syscall+0x369/frame 0xfffffe009d5e9bf0
fast_syscall_common() at fast_syscall_common+0x101/frame 0xfffffe009d5e9bf0
--- syscall (3, FreeBSD ELF64, sys_read), rip = 0x4a5d50, rsp = 0xc4219532c0, rbp = 0xc421953318 ---

Comment 1 Sean Bruno freebsd_committer

2018-09-20 15:30:06 UTC

ftp-mirror0.isc.freebsd.org 12.0-ALPHA6 FreeBSD 12.0-ALPHA6 #0 r338761M: Tue Sep 18 22:55:31 UTC 2018 

I've put the kernel/crashdump/debuginfo in freefall.freebsd.org:~sbruno/ftpmirror_crash.tgz

Comment 2 Sean Bruno freebsd_committer

2018-09-20 15:31:03 UTC

da3 is the disk that is failing.  Starting up the jails and indexing files on the zftp pool will cause the panic in this ticket.  I *can* remove the drive, but I want to wait for some diagnostics from ZFS folks first.


root@ftp-mirror0.isc:~ # zpool status
  pool: zftp
 state: ONLINE
  scan: scrub repaired 150K in 0 days 02:03:31 with 0 errors on Fri Aug 17 05:19:19 2018
config:

        NAME        STATE     READ WRITE CKSUM
        zftp        ONLINE       0     0     0
          raidz1-0  ONLINE       0     0     0
            da0p1   ONLINE       0     0     0
            da1p1   ONLINE       0     0     0
            da2p1   ONLINE       0     0     0
            da3p1   ONLINE       0     0     0
            da4p1   ONLINE       0     0     0
            da5p1   ONLINE       0     0     0
            da6p1   ONLINE       0     0     0
            da7p1   ONLINE       0     0     0

errors: No known data errors

  pool: zroot
 state: ONLINE
  scan: scrub repaired 0 in 0 days 03:16:08 with 0 errors on Fri Aug 17 06:31:58 2018
config:

        NAME        STATE     READ WRITE CKSUM
        zroot       ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            ada0p3  ONLINE       0     0     0
            ada1p3  ONLINE       0     0     0

errors: No known data errors

Comment 3 Andy Farkas 2018-09-21 15:50:39 UTC

Recently saw a ZFS crash due to 2 dying/dead disks (of 4) in a RAIDZ1:

FreeBSD 11.1-STABLE #0 r331113: Sun Mar 18 19:57:45 AEST 2018

panic: solaris assert: remove_reference(hdr, ((void *)0), tag) == 0 (0xd == 0x0), file: /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c, line: 3463
cpuid = 7
KDB: stack backtrace:
#0 0xffffffff805cf167 at kdb_backtrace+0x67
#1 0xffffffff8058cc46 at vpanic+0x186
#2 0xffffffff8058cab3 at panic+0x43
#3 0xffffffff815cb22c at assfail3+0x2c
#4 0xffffffff8124cb50 at arc_buf_destroy+0x1a0
#5 0xffffffff8125c56a at dbuf_read_done+0x4a
#6 0xffffffff8124f236 at arc_read_done+0x296
#7 0xffffffff812e357e at zio_done+0x80e
#8 0xffffffff812df05c at zio_execute+0xac
#9 0xffffffff805e0714 at taskqueue_run_locked+0x154
#10 0xffffffff805e1858 at taskqueue_thread_loop+0x98
#11 0xffffffff805596a2 at fork_exit+0x82
#12 0xffffffff8074a13e at fork_trampoline+0xe
Uptime: 53s

The culprits:

(da2:mps0:0:5:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error)
(da2:mps0:0:5:0): Error 5, Unretryable error
(da2:mps0:0:5:0): READ(16). CDB: 88 00 00 00 00 01 cf 2f 5c dc 00 00 00 55 00 00 
(da2:mps0:0:5:0): CAM status: SCSI Status Error
(da2:mps0:0:5:0): SCSI status: Check Condition
(da2:mps0:0:5:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
(da2:mps0:0:5:0): Error 6, Retries exhausted
(da2:mps0:0:5:0): Invalidating pack
...
Sep 21 11:50:16 <daemon.crit> drunkfish smartd[1060]: Device: /dev/da3 [SAT], 39538 Currently unreadable (pending) sectors

Comment 4 Sean Bruno freebsd_committer

2018-09-27 15:17:48 UTC

Yet another resilvering crash due to disk errors.  I've dropped the crashinfo, kernel and debug in:

freefall.freebsd.org:~sbruno/ftp-mirror_zfs.tar.gz

Comment 5 Allan Jude freebsd_committer

2018-09-29 00:20:03 UTC

(In reply to Andy Farkas from comment #3)
This crash is different, and should get its own bugzilla number.

Comment 6 Allan Jude freebsd_committer

2018-09-29 00:51:42 UTC

(In reply to Sean Bruno from comment #0)
I have submitted the fix for this to Release Engineering

Comment 7 commit-hook freebsd_committer

2018-09-29 01:26:34 UTC

A commit references this bug:

Author: allanjude
Date: Sat Sep 29 01:26:07 UTC 2018
New revision: 339009
URL: https://svnweb.freebsd.org/changeset/base/339009

Log:
  Avoid panic when adjusting priority of a read in the face of an IO error

  PR:		231516
  Reported by:	sbruno
  Approved by:	re (rgrimes)
  Obtained from:	ZFS-on-Linux
  X-MFC-with:	334844
  Sponsored by:	Klara Systems

  MFV/ZoL:	Fix zio->io_priority failed (7 < 6) assert

  commit c26cf0966d131b722c32f8ccecfe5791a789d975
  Author: Tony Hutter <hutter2@llnl.gov>
  Date:   Tue May 29 18:13:48 2018 -0700

    Fix zio->io_priority failed (7 < 6) assert

    This fixes an assert in vdev_queue_change_io_priority():

      VERIFY3(zio->io_priority < ZIO_PRIORITY_NUM_QUEUEABLE) failed (7 < 6)
      PANIC at vdev_queue.c:832:vdev_queue_change_io_priority()

    Reviewed-by: Tom Caputi <tcaputi@datto.com>
    Reviewed-by: George Melikov <mail@gmelikov.ru>
    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Signed-off-by: Tony Hutter <hutter2@llnl.gov>

Changes:
  head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_queue.c

Comment 8 commit-hook freebsd_committer

2019-04-25 15:52:01 UTC

A commit references this bug:

Author: mav
Date: Thu Apr 25 15:51:01 UTC 2019
New revision: 346679
URL: https://svnweb.freebsd.org/changeset/base/346679

Log:
  MFC r339009 (by allanjude):
  Avoid panic when adjusting priority of a read in the face of an IO error

  PR:             231516
  Reported by:    sbruno
  Approved by:    re (rgrimes)
  Obtained from:  ZFS-on-Linux
  X-MFC-with:     334844
  Sponsored by:   Klara Systems

  MFV/ZoL:        Fix zio->io_priority failed (7 < 6) assert

  commit c26cf0966d131b722c32f8ccecfe5791a789d975
  Author: Tony Hutter <hutter2@llnl.gov>
  Date:   Tue May 29 18:13:48 2018 -0700

    Fix zio->io_priority failed (7 < 6) assert

    This fixes an assert in vdev_queue_change_io_priority():

      VERIFY3(zio->io_priority < ZIO_PRIORITY_NUM_QUEUEABLE) failed (7 < 6)
      PANIC at vdev_queue.c:832:vdev_queue_change_io_priority()

    Reviewed-by: Tom Caputi <tcaputi@datto.com>
    Reviewed-by: George Melikov <mail@gmelikov.ru>
    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Signed-off-by: Tony Hutter <hutter2@llnl.gov>

Changes:
_U  stable/11/
  stable/11/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_queue.c