Bug 203473 - ZFS files server crash: solaris assert: start < end in trim_map.c
Summary: ZFS files server crash: solaris assert: start < end in trim_map.c
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 10.2-STABLE
Hardware: Any Any
: --- Affects Only Me
Assignee: Andriy Gapon
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-10-01 15:08 UTC by Willem Jan Withagen
Modified: 2015-11-17 15:12 UTC (History)
2 users (show)

See Also:


Attachments
proposed patch for testing (2.07 KB, patch)
2015-10-07 20:58 UTC, Andriy Gapon
no flags Details | Diff
Patch agains 10.2-STABLE (2.55 KB, patch)
2015-10-11 17:20 UTC, Willem Jan Withagen
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Willem Jan Withagen 2015-10-01 15:08:17 UTC
Under serious delete load, the server crashes with the above report.
Bug submitted on request of avg@FreeBSD.org

--WjW

  Architecture: amd64
  Architecture Version: 2
  Dump Length: 7031291904B (6705 MB)
  Blocksize: 512
  Dumptime: Tue Sep 29 12:55:01 2015
  Hostname: zfs.digiware.nl
  Magic: FreeBSD Kernel Dump
  Version String: FreeBSD 10.2-STABLE #2 r287746M: Sun Sep 13 12:33:23 CEST 2015
    root@zfs.digiware.nl:/home/obj/amd64/usr/srcs/work10/src/sys/ZFS
  Panic String: solaris assert: start < end, file: /usr/srcs/work10/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/trim_map.c, line: 219


(kgdb) bt
#0  doadump (textdump=1) at pcpu.h:219
#1  0xffffffff804f9d97 in kern_reboot (howto=260) at /usr/srcs/work10/src/sys/kern/kern_shutdown.c:451
#2  0xffffffff804fa195 in vpanic (fmt=<value optimized out>, ap=<value optimized out>) at /usr/srcs/work10/src/sys/kern/kern_shutdown.c:758
#3  0xffffffff804fa023 in panic (fmt=0x0) at /usr/srcs/work10/src/sys/kern/kern_shutdown.c:687
#4  0xffffffff8106a1fd in assfail (a=<value optimized out>, f=<value optimized out>, l=<value optimized out>)
    at /usr/srcs/work10/src/sys/modules/opensolaris/../../cddl/compat/opensolaris/kern/opensolaris_cmn_err.c:81
#5  0xffffffff80e65161 in trim_map_segment_add (tm=0xfffff8001432ba00, start=7704576, end=7704576, txg=0)
    at /usr/srcs/work10/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/trim_map.c:219
#6  0xffffffff80e6464f in trim_map_free_locked (tm=<value optimized out>, start=<value optimized out>, end=<value optimized out>, txg=<value optimized out>)
    at /usr/srcs/work10/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/trim_map.c:319
#7  0xffffffff80e64588 in trim_map_free (vd=0xfffff8000ae3b800, offset=7704576, size=<value optimized out>, txg=0)
    at /usr/srcs/work10/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/trim_map.c:337
#8  0xffffffff80db490c in arc_hdr_destroy (hdr=0xfffff800198ac5e8) at /usr/srcs/work10/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c:1825
#9  0xffffffff80db980a in l2arc_evict (dev=0xfffff8000aa2d480, distance=<value optimized out>, all=0)
    at /usr/srcs/work10/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c:5004
#10 0xffffffff80dba0da in l2arc_feed_thread (dummy=<value optimized out>)
    at /usr/srcs/work10/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c:5581
#11 0xffffffff804c9855 in fork_exit (callout=0xffffffff80db9b80 <l2arc_feed_thread>, arg=0x0, frame=0xfffffe023bbc4bc0)
    at /usr/srcs/work10/src/sys/kern/kern_fork.c:1018
#12 0xffffffff807773ee in fork_trampoline () at /usr/srcs/work10/src/sys/amd64/amd64/exception.S:611
#13 0x0000000000000000 in ?? ()

And the values there are:
#5  0xffffffff80e65161 in trim_map_segment_add (tm=0xfffff8001432ba00, start=7704576, end=7704576, txg=0)
    at /usr/srcs/work10/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/trim_map.c:219
219             VERIFY(start < end);
(kgdb) p start
$1 = 7704576
(kgdb) p end
$2 = 7704576
Comment 1 Andriy Gapon freebsd_committer freebsd_triage 2015-10-07 20:58:57 UTC
Created attachment 161810 [details]
proposed patch for testing

Could you please try this patch?
It would especially useful if a kernel (and zfs module) is built with INVARIANTS.

As I understand the issue is not readily reproducible?
Comment 2 Willem Jan Withagen 2015-10-07 22:28:15 UTC
(In reply to Andriy Gapon from comment #1)

You are right, the problem is hard to trigger.
This problem was only triggered once. And since the last upgrade to 10-STABLE the system is surprisingly stable...

I used to crash every few days on the "deadmansbutton" for missing a write status return. The trim-assert only happened once, while removing large ZVOLs.

I'll upgrade, and start some work on ZVOLs, to see if it happens again.

--WjW
Comment 3 Willem Jan Withagen 2015-10-08 08:48:10 UTC
(In reply to Willem Jan Withagen from comment #2)

Is the patch against 10.2 or against HEAD?
I'd expect the latter since I'm getting an anry patch...

Patching file sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c using Plan A...
Hunk #1 succeeded at 890 (offset -244 lines).
Hunk #2 failed at 2223.
Hunk #3 failed at 4602.
Hunk #4 failed at 5764.
3 out of 4 hunks failed--saving rejects to sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c.rej
Hmm...  Ignoring the trailing garbage.
done

Does it also work against 10.2 if I manually try to patch up?

--WjW
Comment 4 Andriy Gapon freebsd_committer freebsd_triage 2015-10-08 08:53:03 UTC
The patch was made against head, yes, but I think that it should also apply to the latest stable/10 (after the recent MFC wave from mav@).
You can try to make the changes by hand if the differences are not that significant.
Comment 5 Willem Jan Withagen 2015-10-11 17:20:58 UTC
Created attachment 161919 [details]
Patch agains 10.2-STABLE
Comment 6 Willem Jan Withagen 2015-10-11 17:23:37 UTC
(In reply to Willem Jan Withagen from comment #5)

Patched and rebooted the server running: 
    FreeBSD 10.2-STABLE #3 r289060M

Sofar no bad effects. Server is up for 2 days.
Comment 7 Andriy Gapon freebsd_committer freebsd_triage 2015-10-11 17:29:55 UTC
The patch looks good. Thank you for testing.
Comment 8 commit-hook freebsd_committer freebsd_triage 2015-10-30 12:00:56 UTC
A commit references this bug:

Author: avg
Date: Fri Oct 30 12:00:35 UTC 2015
New revision: 290191
URL: https://svnweb.freebsd.org/changeset/base/290191

Log:
  l2arc: do not call trim_map_free() for blocks with zero b_asize

  b_asize can be zero if the block is compressed into an empty block
  (ZIO_COMPRESS_EMPTY) and the trim code asserts that meaningless
  zero-sized trimming is not attempted.
  The logic for calling trim_map_free() is extracted into a new function
  l2arc_trim() to minimize code duplication.

  PR:		203473
  Reported by:	Willem Jan Withagen <wjw@digiware.nl>
  Tested by:	Willem Jan Withagen <wjw@digiware.nl>
  MFC after:	11 days

Changes:
  head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c
Comment 9 Willem Jan Withagen 2015-10-30 12:40:33 UTC
Patch is still holding itself.
wjw@zfs.digiware.nl> uptime
 1:39PM  up 21 days,  2:33, 1 user, load averages: 0.21, 0.13, 0.13

Thanx for the commit
Comment 10 Andriy Gapon freebsd_committer freebsd_triage 2015-10-30 12:43:09 UTC
(In reply to Willem Jan Withagen from comment #9)
Something that I should have asked earlier: what is a value of kstat.zfs.misc.arcstats.l2_compress_zeros sysctl?
Comment 11 Willem Jan Withagen 2015-10-30 17:19:38 UTC
(In reply to Andriy Gapon from comment #10)

wjw@zfs.digiware.nl> sysctl kstat.zfs.misc.arcstats.l2_compress_zeros
kstat.zfs.misc.arcstats.l2_compress_zeros: 0

There is no description with this sysctl.
So is it good or bad that it is 0.

Or in other words: What does the sysctl register?
Comment 12 Andriy Gapon freebsd_committer freebsd_triage 2015-10-30 20:47:26 UTC
(In reply to Willem Jan Withagen from comment #11)
It means that there hasn't been any buffers so far that could have provoked the bug.
Comment 13 Willem Jan Withagen 2015-10-31 10:42:04 UTC
(In reply to Andriy Gapon from comment #12)

Any suggestions as to how to get buffers into the state that they might provoke this situation?

Last time I think I had a 5Tb fully written ZVOl which I deleted.
Would that run into the buffer problem?
Comment 14 Paul Boehmer 2015-11-03 16:52:14 UTC
Running 10.2-STABLE r289928 and have been experiencing the panic detailed above.  As of this post, I show sysctl kstat.zfs.misc.arcstats.l2_compress_zeros: 228 with an uptime of 6hr 49min.  I will apply the patch and let you know if this rectifies the problem.  Should I expect the sysctl to report 0 after the patch is applied?
Comment 15 Andriy Gapon freebsd_committer freebsd_triage 2015-11-03 17:20:39 UTC
(In reply to Paul Boehmer from comment #14)
No.
Comment 16 commit-hook freebsd_committer freebsd_triage 2015-11-13 09:52:44 UTC
A commit references this bug:

Author: mav
Date: Fri Nov 13 09:52:36 UTC 2015
New revision: 290766
URL: https://svnweb.freebsd.org/changeset/base/290766

Log:
  MFC r290191 (by avg):
  l2arc: do not call trim_map_free() for blocks with zero b_asize

  b_asize can be zero if the block is compressed into an empty block
  (ZIO_COMPRESS_EMPTY) and the trim code asserts that meaningless
  zero-sized trimming is not attempted.
  The logic for calling trim_map_free() is extracted into a new function
  l2arc_trim() to minimize code duplication.

  PR:             203473
  Reported by:    Willem Jan Withagen <wjw@digiware.nl>
  Tested by:      Willem Jan Withagen <wjw@digiware.nl>

Changes:
_U  stable/10/
  stable/10/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c
Comment 17 ruben 2015-11-13 17:02:17 UTC
It looks like it was easy to provoke on my system in the following scenarios:

rsync -avu from my server to a zfs fs used for backup
running a couple of linux bhyve systems backed by zvols doing heavy io (yum / apt-get upgrade)

given the patch has just been synced to RELENG_10 I'll update and try to provoke the system again to see if it is the solution for my case
Comment 18 ruben 2015-11-16 10:59:29 UTC
This has solved the issue for me too, thanks!
Comment 19 Andriy Gapon freebsd_committer freebsd_triage 2015-11-16 13:32:42 UTC
Thank you, everyone, for testing and reporting.
Comment 20 Paul Boehmer 2015-11-16 13:51:25 UTC
Uptime of 12 days on the patch (5 days max prior to patching), no issues on my end.  Excellent work, thanks.
Comment 21 commit-hook freebsd_committer freebsd_triage 2015-11-17 15:10:48 UTC
A commit references this bug:

Author: avg
Date: Tue Nov 17 15:10:25 UTC 2015
New revision: 290983
URL: https://svnweb.freebsd.org/changeset/base/290983

Log:
  MFC r290191: l2arc: do not call trim_map_free() for blocks with zero b_asize

  The code in this branch has become quite outdated, so the change had to be
  applied by hand.

  PR:		203473

Changes:
_U  stable/9/sys/
_U  stable/9/sys/cddl/contrib/opensolaris/
  stable/9/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c