I am filing this bug report upon request from Andriy. This is the e-mail thread: https://lists.freebsd.org/pipermail/freebsd-fs/2017-November/025534.html
In our current setup, the source Filer sends snapshots to the target Filer periodically, about 20-50 per day. The crash is quite reproducible, although not guaranteed to show up after X number of snapshots. Setting the 'secondarycache' to 'metadata', as opposed to 'ALL' does prevent the frequent crashes. But obviously, the L2ARC isn't caching user data, as a result of which, reads end up on disk. ZFS performance w/o L2ARC is rather abysmal.
Restricting the secondarycache=metadata param setting to the ZFS where the snapshots are being received doesn't help this issue. FreeBSD still crashes w/ the same traceback. zfs send -v -i <src>@<from_snap> <src>@<to_snap> | zfs receive -Fuvs <dst> we're setting secondarycache on <dst> zfs set secondarycache=metadata <dst> Setting secondarycache to metadata on the entire pool makes the system prohibitively slower. this is a real showstopper for us, as we can't transfer snapshots for more than a day.
(In reply to Shiva from comment #2) Only completely disabling L2ARC _or_ prefetch would be a sufficient work-around for the problem.
Created attachment 188880 [details] proposed patch George Wilson has posted a prospective patch to the illumos bug report: https://www.illumos.org/issues/8857 I am attaching a version of it adapted to FreeBSD. I have't tested it myself yet. The patch is against FreeBSD head. Please let me know if there are any issues applying it to other branches.
Created attachment 188952 [details] patch for 10.3 There were a few modifications that were done to the proposed patch, to apply to FreeBSD 10.3. I've attached it here. If there are issues w/ this patch, I'd appreciate the feedback/edits?
Has anyone tested this yet? Please remember that your feedback is very important. Without it the chances of the fix being committed are much lower.
(In reply to Andriy Gapon from comment #6) We've been testing the modified fix for 10.3 (the diffs that I posted in https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=223803#c5) for a few weeks now, and we haven't had a single crash since. The systems are fairly loaded, w/ quite a few 'zfs receive' being done.
(In reply to Andriy Gapon from comment #6) How apply this patch on FreeBSD 11.1 amd64? uname -imrs FreeBSD 11.1-RELEASE-p6 amd64 GENERIC Errors: Hunk #7 failed at 2207. Hunk #10 failed at 2545. cat sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c.rej @@ -2193,7 +2207,7 @@ ASSERT(BP_IS_GANG(bp) && zio->io_gang_leader == zio); ASSERT(zio->io_child_type > ZIO_CHILD_GANG); - if (zio->io_child_error[ZIO_CHILD_GANG] == 0) + if (zio->io_child_error[zio_child(ZIO_CHILD_GANG)] == 0) zio_gang_tree_issue(zio, zio->io_gang_tree, bp, zio->io_abd, 0); else @@ -2531,7 +2545,7 @@ if (dde->dde_repair_abd != NULL) { abd_copy(zio->io_abd, dde->dde_repair_abd, zio->io_size); - zio->io_child_error[ZIO_CHILD_DDT] = 0; + zio->io_child_error[zio_child(ZIO_CHILD_DDT)] = 0; } ddt_repair_done(ddt, dde); zio->io_vsd = NULL;
(In reply to Andriy Gapon from comment #6) I'm update system to uname -imrs FreeBSD 11.1-STABLE amd64 And apply whithout errors patch. svn info /usr/src Path: /usr/src Working Copy Root Path: /usr/src URL: svn://svn.freebsd.org/base/stable/11 Relative URL: ^/stable/11 Repository Root: svn://svn.freebsd.org/base Repository UUID: ccf9f872-aa2e-dd11-9fc8-001c23d0bc1f Revision: 328212 Node Kind: directory Schedule: normal Last Changed Author: kp Last Changed Rev: 328210 Last Changed Date: 2018-01-21 02:46:03 +0300 (вс, 21 янв. 2018) Not so fast for get result. On FreeBSD 10.3 amd64 list of previos panic in my case: -rw------- 1 root wheel 24761 20 May 2016 core.txt.5 -rw------- 1 root wheel 300884 9 Feb 2017 core.txt.6 -rw------- 1 root wheel 329938 27 Aug 03:37 core.txt.7 -rw------- 1 root wheel 337428 15 Nov 03:23 core.txt.8 Now system uninteruble work uptime 10:56 up 7 days, 16:20, 2 users, load averages: 5,09 5,09 5,08 It is necessary to wait two three months of continuous work. Thanks for your work.
A commit references this bug: Author: avg Date: Thu Feb 15 14:46:30 UTC 2018 New revision: 329314 URL: https://svnweb.freebsd.org/changeset/base/329314 Log: MFV r329313: 8857 zio_remove_child() panic due to already destroyed parent zio illumos/illumos-gate@d6e1c446d7897003fd9fd36ef5aa7da350b7f6af https://github.com/illumos/illumos-gate/commit/d6e1c446d7897003fd9fd36ef5aa7da350b7f6af https://www.illumos.org/issues/8857 I had an OS panic on one of our servers: ffffff01809128c0 vpanic() ffffff01809128e0 mutex_panic+0x58(fffffffffb94c904, ffffff597dde7f80) ffffff0180912950 mutex_vector_enter+0x347(ffffff597dde7f80) ffffff01809129b0 zio_remove_child+0x50(ffffff597dde7c58, ffffff32bd901ac0, ffffff3373370908) ffffff0180912a40 zio_done+0x390(ffffff32bd901ac0) ffffff0180912a70 zio_execute+0x78(ffffff32bd901ac0) ffffff0180912b30 taskq_thread+0x2d0(ffffff33bae44140) ffffff0180912b40 thread_start+8() It panicked here: http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/fs/zfs/ zio.c#430 pio->io_lock is DEAD, thus a panic. Further analysis shows the "pio" (parent zio of "cio") has already been destroyed. Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Andriy Gapon <avg@FreeBSD.org> Reviewed by: Youzhong Yang <youzhong@gmail.com> Approved by: Dan McDonald <danmcd@omniti.com> Author: George Wilson <george.wilson@delphix.com> PR: 223803 Tested by: shiva.bhanujan@quorum.com MFC after: 2 weeks Changes: _U head/sys/cddl/contrib/opensolaris/ head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zio.h head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c
A commit references this bug: Author: avg Date: Thu Mar 1 10:35:05 UTC 2018 New revision: 330237 URL: https://svnweb.freebsd.org/changeset/base/330237 Log: MFC r329314: MFV r329313: 8857 zio_remove_child() panic due to already destroyed parent zio PR: 223803 Changes: _U stable/11/ stable/11/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zio.h stable/11/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c
A commit references this bug: Author: avg Date: Thu Mar 1 10:57:50 UTC 2018 New revision: 330238 URL: https://svnweb.freebsd.org/changeset/base/330238 Log: MFC r329314: MFV r329313: 8857 zio_remove_child() panic due to already destroyed parent zio PR: 223803 Changes: _U stable/10/ stable/10/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zio.h stable/10/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c