Bug 195746 - zfs L2ARC wrong alloc/free size
Summary: zfs L2ARC wrong alloc/free size
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 10.1-RELEASE
Hardware: Any Any
: --- Affects Some People
Assignee: Andriy Gapon
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-12-06 14:45 UTC by Yuriy Tabolin
Modified: 2015-09-11 13:28 UTC (History)
7 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Yuriy Tabolin 2014-12-06 14:45:20 UTC
There is a server FreeBSD 10.1-RELEASE GENERIC kernel amd64 with 2 zfs pools. There are two Intel 480Gb SSD disks in server, used like ZIL (mirror 4GB per pool) and L2ARC (stripe 75+75 GB per pool). On some zfs datasets compression are used. After same days working I noticed wrong L2ARC alloc and free sizes pool1 in zpool iostat -v, later I seen same wrong size in pool2. It looks like:

pool                                  alloc   free   read  write   read  write
------------------------------------  -----  -----  -----  -----  -----  -----
pool1                                 13,0T  34,3T     45  3,56K  3,93M  51,9M
  raidz3                              13,0T  34,3T     45  3,51K  3,93M  47,1M
    multipath/pd01                        -      -     31     97   311K  4,96M
    multipath/pd02                        -      -     31     97   311K  4,96M
    multipath/pd03                        -      -     31     97   311K  4,96M
    multipath/pd04                        -      -     31     97   311K  4,96M
    multipath/pd05                        -      -     31     97   311K  4,96M
    multipath/pd06                        -      -     31     97   311K  4,96M
    multipath/pd07                        -      -     31     97   311K  4,96M
    multipath/pd08                        -      -     31     97   311K  4,96M
    multipath/pd09                        -      -     31     97   311K  4,96M
    multipath/pd10                        -      -     31     97   311K  4,96M
    multipath/pd11                        -      -     31     97   311K  4,96M
    multipath/pd12                        -      -     31     97   311K  4,96M
    multipath/pd13                        -      -     31     97   311K  4,96M
logs                                      -      -      -      -      -      -
  mirror                               812K  3,97G      0     45      0  4,83M
    diskid/DISK-CVWL435200Y1480QGNp1      -      -      0     45      4  4,83M
    diskid/DISK-CVWL4353000F480QGNp1      -      -      0     45      4  4,83M
cache                                     -      -      -      -      -      -
  diskid/DISK-CVWL435200Y1480QGNp4     371G  16,0E      4     27   163K  3,16M
  diskid/DISK-CVWL4353000F480QGNp4     441G  16,0E      8     25   145K  2,94M
------------------------------------  -----  -----  -----  -----  -----  -----
pool2                                 10,2T  37,0T     81  1,36K  9,82M  80,2M
  raidz3                              10,2T  37,0T     81    870  9,82M  45,9M
    multipath/pd14                        -      -     21     82   903K  4,67M
    multipath/pd15                        -      -     21     82   903K  4,67M
    multipath/pd16                        -      -     21     82   903K  4,67M
    multipath/pd17                        -      -     21     82   903K  4,67M
    multipath/pd18                        -      -     21     82   904K  4,67M
    multipath/pd19                        -      -     21     82   903K  4,67M
    multipath/pd20                        -      -     21     82   903K  4,67M
    multipath/pd21                        -      -     21     82   903K  4,67M
    multipath/pd22                        -      -     21     82   903K  4,67M
    multipath/pd23                        -      -     21     82   903K  4,67M
    multipath/pd24                        -      -     21     82   903K  4,67M
    multipath/pd25                        -      -     21     82   903K  4,67M
    multipath/pd26                        -      -     21     82   903K  4,67M
logs                                      -      -      -      -      -      -
  mirror                               238M  3,74G      0    525      0  34,3M
    diskid/DISK-CVWL435200Y1480QGNp2      -      -      0    525      4  34,3M
    diskid/DISK-CVWL4353000F480QGNp2      -      -      0    525      4  34,3M
cache                                     -      -      -      -      -      -
  diskid/DISK-CVWL435200Y1480QGNp5     207G  16,0E      1     21  45,1K  2,56M
  diskid/DISK-CVWL4353000F480QGNp5     203G  16,0E      2     21  94,6K  2,60M
 

Cache values “371G  16,0E” are abnormal, real aloc size is 75G. After that I looked zfs-stat -L and seen DEGRADED L2ARC and too big L2ARC size:

L2 ARC Summary: (DEGRADED)
        Passed Headroom:                        6.05m
        Tried Lock Failures:                    22.36m
        IO In Progress:                         2.75k
        Low Memory Aborts:                      2.86k
        Free on Write:                          5.48m
        Writes While Full:                      339.48k
        R/W Clashes:                            2.07k
        Bad Checksums:                          211.52k
        IO Errors:                              101.41k
        SPA Mismatch:                           3.16b

L2 ARC Size: (Adaptive)                         1.27    TiB
        Header Size:                    1.42%   18.56   GiB
 
kstat.zfs.misc.arcstats.l2_io_error: 101531
kstat.zfs.misc.arcstats.l2_cksum_bad: 211782
 
smartctl show that both SSD is fine, without any IO errors. After reboot no problem's for some time.

I found same issues which described about L2ARC compression:
http://forums.freebsd.org/threads/l2arc-degraded.47540/
http://lists.freebsd.org/pipermail/freebsd-current/2013-October/045088.html
My problem looks like same bug.
Comment 1 John K. Gates 2015-02-13 16:04:44 UTC
I have been discussing this issue here:

https://forums.freebsd.org/threads/l2arc-degraded.47540/

Checksum and IO errors appear after an L2ARC device fills completely with cache data on any release of FreeBSD after L2ARC compression (http://wiki.illumos.org/display/illumos/L2ARC+Compression) was enabled (9.3-RELEASE and later, and 10.0-RELEASE and later).

The following is from an 10.1-RELEASE-p5 (svn rev 278678) running a generic kernel.  /etc/make.conf contains only "CPUTYPE?=core2".  ZFS pools are v28 (created under 9.2-RELEASE) and have NOT been updated for feature flags.

"zfs-stats -L" shows the L2ARC "DEGRADED" with numerous I/O and checksum errors:

root@cadence:/ # zfs-stats -L

------------------------------------------------------------------------
ZFS Subsystem Report                            Fri Feb 13 10:36:49 2015
------------------------------------------------------------------------

L2 ARC Summary: (DEGRADED)
        Passed Headroom:                        30.28m
        Tried Lock Failures:                    24.83m
        IO In Progress:                         247
        Low Memory Aborts:                      103
        Free on Write:                          54.97k
        Writes While Full:                      10.62k
        R/W Clashes:                            562
        Bad Checksums:                          1.29m
        IO Errors:                              128.28k
        SPA Mismatch:                           48.53b

L2 ARC Size: (Adaptive)                         33.51   GiB
        Header Size:                    2.24%   768.85  MiB

L2 ARC Evicts:
        Lock Retries:                           18
        Upon Reading:                           0

L2 ARC Breakdown:                               35.47m
        Hit Ratio:                      26.64%  9.45m
        Miss Ratio:                     73.36%  26.02m
        Feeds:                                  568.79k

L2 ARC Buffer:
        Bytes Scanned:                          530.43  TiB
        Buffer Iterations:                      568.79k
        List Iterations:                        36.17m
        NULL List Iterations:                   974.71k

L2 ARC Writes:
        Writes Sent:                    100.00% 136.10k

------------------------------------------------------------------------

Kernel variables showing compression is working on L2ARC, and there are I/O and checksum errors:

root@cadence:/ # sysctl kstat.zfs.misc.arcstats.l2_compress_successes kstat.zfs.misc.arcstats.l2_compress_zeros kstat.zfs.misc.arcstats.l2_compress_failures kstat.zfs.misc.arcstats.l2_cksum_bad kstat.zfs.misc.arcstats.l2_io_error
kstat.zfs.misc.arcstats.l2_compress_successes: 1353514
kstat.zfs.misc.arcstats.l2_compress_zeros: 29
kstat.zfs.misc.arcstats.l2_compress_failures: 4985
kstat.zfs.misc.arcstats.l2_cksum_bad: 1290021
kstat.zfs.misc.arcstats.l2_io_error: 128275

Slight variations of this problem have been reported in numerous instances on FreeBSD, FreeNAS and PC-BSD related forums and mailing lists, and is usually dismissed as a hardware problem:

https://bugs.freenas.org/issues/5347
https://forums.freenas.org/index.php?threads/l2-arc-summary-degraded.19256/
https://bugs.pcbsd.org/issues/3418
http://svnweb.freebsd.org/base?view=revision&sortby=file&revision=256889
http://lists.freebsd.org/pipermail/freebsd-current/2013-October/045088.html
http://lists.freebsd.org/pipermail/freebsd-bugs/2014-November/059261.html
http://lists.freebsd.org/pipermail/freebsd-bugs/2014-December/059376.html
http://lists.freebsd.org/pipermail/freebsd-fs/2014-October/020242.html

However, I've been able to easily duplicate this problem on two different sets of high quality, reliable hardware (Dell PowerEdge, Intel SSD) that otherwise tests perfectly.

To duplicate, simply create a zfs pool with a small L2ARC device and exercise the pool with random I/O until the L2ARC fills.
Comment 2 gkontos 2015-04-13 14:58:19 UTC
I have exactly the same problem with 2X Intel DC S3500 Series 600GB

------------------------------------------------------------------------
ZFS Subsystem Report				Mon Apr 13 10:56:52 2015
------------------------------------------------------------------------

L2 ARC Summary: (DEGRADED)
	Passed Headroom:			69.29m
	Tried Lock Failures:			316.43k
	IO In Progress:				86
	Low Memory Aborts:			648
	Free on Write:				1.01m
	Writes While Full:			202.57k
	R/W Clashes:				1.32k
	Bad Checksums:				17.34m
	IO Errors:				3.23m
	SPA Mismatch:				49.06m

L2 ARC Size: (Adaptive)				3.21	TiB
	Header Size:			0.19%	6.21	GiB

L2 ARC Evicts:
	Lock Retries:				146
	Upon Reading:				0

L2 ARC Breakdown:				103.52m
	Hit Ratio:			34.77%	35.99m
	Miss Ratio:			65.23%	67.53m
	Feeds:					1.28m

L2 ARC Buffer:
	Bytes Scanned:				116.31	TiB
	Buffer Iterations:			1.28m
	List Iterations:			78.60m
	NULL List Iterations:			2.68m

L2 ARC Writes:
	Writes Sent:			100.00%	620.76k
------------------------------------------------------------------------


	NAME                  STATE     READ WRITE CKSUM
	storage               ONLINE       0     0     0
	  raidz2-0            ONLINE       0     0     0
	    multipath/disk1   ONLINE       0     0     0
	    multipath/disk2   ONLINE       0     0     0
	    multipath/disk25  ONLINE       0     0     0
	    multipath/disk4   ONLINE       0     0     0
	    multipath/disk5   ONLINE       0     0     0
	    multipath/disk6   ONLINE       0     0     0
	  raidz2-1            ONLINE       0     0     0
	    multipath/disk7   ONLINE       0     0     0
	    multipath/disk8   ONLINE       0     0     0
	    multipath/disk9   ONLINE       0     0     0
	    multipath/disk26  ONLINE       0     0     0
	    multipath/disk11  ONLINE       0     0     0
	    multipath/disk12  ONLINE       0     0     0
	  raidz2-2            ONLINE       0     0     0
	    multipath/disk13  ONLINE       0     0     0
	    multipath/disk14  ONLINE       0     0     0
	    multipath/disk15  ONLINE       0     0     0
	    multipath/disk16  ONLINE       0     0     0
	    multipath/disk17  ONLINE       0     0     0
	    multipath/disk18  ONLINE       0     0     0
	  raidz2-3            ONLINE       0     0     0
	    multipath/disk19  ONLINE       0     0     0
	    multipath/disk20  ONLINE       0     0     0
	    multipath/disk21  ONLINE       0     0     0
	    multipath/disk22  ONLINE       0     0     0
	    multipath/disk23  ONLINE       0     0     0
	    multipath/disk24  ONLINE       0     0     0
	logs
	  mirror-4            ONLINE       0     0     0
	    gpt/zil0          ONLINE       0     0     0
	    gpt/zil1          ONLINE       0     0     0
	cache
	  gpt/cache0          ONLINE       0     0     0
	  gpt/cache1          ONLINE       0     0     0
	spares
	  multipath/disk3     AVAIL   
	  multipath/disk27    AVAIL   
	  multipath/disk28    AVAIL   
	  multipath/disk10    AVAIL
Comment 3 commit-hook freebsd_committer freebsd_triage 2015-08-24 08:11:32 UTC
A commit references this bug:

Author: avg
Date: Mon Aug 24 08:10:53 UTC 2015
New revision: 287099
URL: https://svnweb.freebsd.org/changeset/base/287099

Log:
  account for ashift when gathering buffers to be written to l2arc device

  The change that introduced the L2ARC compression support also introduced
  a bug where the on-disk size of the selected buffers could end up larger
  than the target size if the ashift is greater than 9.  This was because
  the buffer selection could did not take into account the fact that
  on-disk size could be larger than the in-memory buffer size due to
  the alignment requirements.

  At the moment b_asize is a misnomer as it does not always represent the
  allocated size: if a buffer is compressed, then the compressed size is
  properly rounded (on FreeBSD), but if the compression fails or it is not
  applied, then the original size is kept and it could be smaller than what
  ashift requires.

  For the same reasons arcstat_l2_asize and the reported used space
  on the cache device could be smaller than the actual allocated size
  if ashift > 9.  That problem is not fixed by this change.

  This change only ensures that l2ad_hand is not advanced by more
  than target_sz.  Otherwise we would overwrite active (unevicted)
  L2ARC buffers.  That problem is manifested as growing l2_cksum_bad
  and l2_io_error counters.

  This change also changes 'p' prefix to 'a' prefix in a few places
  where variables represent allocated rather than physical size.

  The resolved problem could also result in the reported allocated size
  being greater than the cache device's capacity, because of the
  overwritten buffers (more than one buffer claiming the same disk
  space).

  This change is already in ZFS-on-Linux:
  zfsonlinux/zfs@ef56b0780c80ebb0b1e637b8b8c79530a8ab3201

  PR:		198242
  PR:		195746 (possibly related)
  Reviewed by:	mahrens (https://reviews.csiden.org/r/229/)
  Tested by:	gkontos@aicom.gr (most recently)
  MFC after:	15 days
  X-MFC note:	patch does not apply as is at the moment
  Relnotes:	yes
  Sponsored by:	ClusterHQ
  Differential Revision:	https://reviews.freebsd.org/D2764
  Reviewed by:	noone (@FreeBSD.org)

Changes:
  head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c
Comment 4 Andriy Gapon freebsd_committer freebsd_triage 2015-09-11 13:28:34 UTC
Should be fixed now.