Bug 211381

Summary: L2ARC degraded, repeatedly, on Samsung SSD 950 Pro nvme
Product: Base System Reporter: braddeicide
Component: kernAssignee: Andriy Gapon <avg>
Status: Closed FIXED    
Severity: Affects Only Me CC: braddeicide, fs
Priority: ---    
Version: 10.3-RELEASE   
Hardware: amd64   
OS: Any   

Description braddeicide 2016-07-26 12:06:33 UTC
When placing an l2arc on an Samsung SSD 950 Pro (nvme) it works for a while (hours/day?) then it's marked as degraded and capacity is all unallocated.

I'm using the same device for logs and it appears to be functioning fine (I know not best practice)

Unlike other similar ticket #198242 I don't get bad checksums or io errors.

#zfs-stats -L
------------------------------------------------------------------------
ZFS Subsystem Report                            Tue Jul 26 20:54:51 2016
------------------------------------------------------------------------

L2 ARC Summary: (DEGRADED)
        Passed Headroom:                        1.05m
        Tried Lock Failures:                    1.68m
        IO In Progress:                         268
        Low Memory Aborts:                      114
        Free on Write:                          647.02k
        Writes While Full:                      1.42m
        R/W Clashes:                            2
        Bad Checksums:                          0
        IO Errors:                              0
        SPA Mismatch:                           1.38k

L2 ARC Size: (Adaptive)                         298.83  MiB
        Header Size:                    0.00%   0

L2 ARC Evicts:
        Lock Retries:                           37
        Upon Reading:                           0

L2 ARC Breakdown:                               44.69m
        Hit Ratio:                      5.07%   2.27m
        Miss Ratio:                     94.93%  42.43m
        Feeds:                                  1.70m

L2 ARC Buffer:
        Bytes Scanned:                          114.97  TiB
        Buffer Iterations:                      1.70m
        List Iterations:                        3.13m
        NULL List Iterations:                   58.42k

L2 ARC Writes:
        Writes Sent:                    100.00% 1.54m

# sysctl -a | grep l2
kern.cam.ctl2cam.max_sense: 252
vfs.zfs.l2c_only_size: 0
vfs.zfs.l2arc_norw: 1
vfs.zfs.l2arc_feed_again: 1
vfs.zfs.l2arc_noprefetch: 0
vfs.zfs.l2arc_feed_min_ms: 200
vfs.zfs.l2arc_feed_secs: 1
vfs.zfs.l2arc_headroom: 2
vfs.zfs.l2arc_write_boost: 52428800
vfs.zfs.l2arc_write_max: 26214400
vfs.cache.numfullpathfail2: 0
kstat.zfs.misc.arcstats.l2_write_buffer_list_null_iter: 58416
kstat.zfs.misc.arcstats.l2_write_buffer_list_iter: 3128016
kstat.zfs.misc.arcstats.l2_write_buffer_bytes_scanned: 126433859595264
kstat.zfs.misc.arcstats.l2_write_pios: 1543758
kstat.zfs.misc.arcstats.l2_write_buffer_iter: 1705468
kstat.zfs.misc.arcstats.l2_write_full: 1424118
kstat.zfs.misc.arcstats.l2_write_not_cacheable: 1229880739
kstat.zfs.misc.arcstats.l2_write_io_in_progress: 268
kstat.zfs.misc.arcstats.l2_write_in_l2: 4397896253
kstat.zfs.misc.arcstats.l2_write_spa_mismatch: 1379
kstat.zfs.misc.arcstats.l2_write_passed_headroom: 1047813
kstat.zfs.misc.arcstats.l2_write_trylock_fail: 1676946
-> kstat.zfs.misc.arcstats.l2_compress_failures: 143662342
kstat.zfs.misc.arcstats.l2_compress_zeros: 11159
kstat.zfs.misc.arcstats.l2_compress_successes: 1038776676
kstat.zfs.misc.arcstats.l2_hdr_size: 0
kstat.zfs.misc.arcstats.l2_asize: 152834048
kstat.zfs.misc.arcstats.l2_size: 338656768
kstat.zfs.misc.arcstats.l2_io_error: 0
kstat.zfs.misc.arcstats.l2_cksum_bad: 0
kstat.zfs.misc.arcstats.l2_abort_lowmem: 114
kstat.zfs.misc.arcstats.l2_cdata_free_on_write: 4449
kstat.zfs.misc.arcstats.l2_free_on_write: 647030
kstat.zfs.misc.arcstats.l2_evict_l1cached: 3164010
kstat.zfs.misc.arcstats.l2_evict_reading: 0
kstat.zfs.misc.arcstats.l2_evict_lock_retry: 37
kstat.zfs.misc.arcstats.l2_writes_lock_retry: 2181
-> kstat.zfs.misc.arcstats.l2_writes_error: 1499766
kstat.zfs.misc.arcstats.l2_writes_done: 1543758
kstat.zfs.misc.arcstats.l2_writes_sent: 1543758
kstat.zfs.misc.arcstats.l2_write_bytes: 17569594126336
kstat.zfs.misc.arcstats.l2_read_bytes: 59325997056
kstat.zfs.misc.arcstats.l2_rw_clash: 2
kstat.zfs.misc.arcstats.l2_feeds: 1705468
kstat.zfs.misc.arcstats.l2_misses: 42457839
kstat.zfs.misc.arcstats.l2_hits: 2265463
kstat.zfs.misc.arcstats.evict_l2_skip: 712
kstat.zfs.misc.arcstats.evict_l2_ineligible: 214781087744
kstat.zfs.misc.arcstats.evict_l2_eligible: 1461307185152
kstat.zfs.misc.arcstats.evict_l2_cached: 250679670784


# zpool status
  pool: zpool1
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://illumos.org/msg/ZFS-8000-9P
  scan: scrub repaired 0 in 30h28m with 0 errors on Wed Jul 20 04:06:39 2016
config:

        NAME           STATE     READ WRITE CKSUM
        zpool1         ONLINE       0     0     0
          mirror-0     ONLINE       0     0     0
            da0p2.eli  ONLINE       0     0     0
            da1p2.eli  ONLINE       0     0     0
          mirror-1     ONLINE       0     0     0
            da2p2.eli  ONLINE       0     0     0
            da3p2.eli  ONLINE       0     0     0
          mirror-3     ONLINE       0     0     0
            da6p2.eli  ONLINE       0     0     0
            da7p2.eli  ONLINE       0     0     0
          mirror-4     ONLINE       0     0     0
            da4p2.eli  ONLINE       0     0     0
            da5p2.eli  ONLINE       0     0     0
        logs
          nvd0p2.eli   ONLINE       0     0     0
        cache
          nvd0p3.eli   ONLINE       0 1.21G     0

errors: No known data errors

When I remove and re-add the device it works for a while, but then gets in the state seen below where its capacity is entirely free.

# zpool iostat -v
                  capacity     operations    bandwidth
pool           alloc   free   read  write   read  write
-------------  -----  -----  -----  -----  -----  -----
zpool1         11.6T  2.91T    625    186  11.6M  3.58M
  mirror       2.70T   945G    137     51  2.76M   727K
    da0p2.eli      -      -     39     15  2.78M   730K
    da1p2.eli      -      -     39     15  2.78M   730K
  mirror       2.86T   784G    183     34  2.86M   515K
    da2p2.eli      -      -     43     12  2.89M   517K
    da3p2.eli      -      -     43     11  2.89M   517K
  mirror       3.23T   406G    217     28  3.17M   452K
    da6p2.eli      -      -     54     10  3.27M   455K
    da7p2.eli      -      -     53     10  3.27M   455K
  mirror       2.80T   846G     87     45  2.80M   691K
    da4p2.eli      -      -     31     13  2.80M   694K
    da5p2.eli      -      -     31     13  2.80M   694K
logs               -      -      -      -      -      -
  nvd0p2.eli   14.1M  6.92G      0     34      0  1.60M
cache              -      -      -      -      -      -
  nvd0p3.eli    146M   450G    126    263  1.05M  12.7M
-------------  -----  -----  -----  -----  -----  -----
Comment 1 Andriy Gapon freebsd_committer freebsd_triage 2016-07-26 17:21:44 UTC
(In reply to braddeicide from comment #0)
I do not know why that third party script says that L2ARC is "degraded".
I do not see any of the OS tools reporting any degradation.
Comment 2 Andriy Gapon freebsd_committer freebsd_triage 2016-07-26 17:25:47 UTC
(In reply to Andriy Gapon from comment #1)
Actually, sorry, I now see that you have highlighted kstat.zfs.misc.arcstats.l2_writes_error counter.
Comment 3 Andriy Gapon freebsd_committer freebsd_triage 2016-07-26 17:34:20 UTC
It would be interesting to see what errors are reported by the following dtrace oneliner:

dtrace -n 'sdt:::l2arc-iodone /args[0]->io_error != 0/ { printf("io_error = %d", args[0]->io_error); }'
Comment 4 braddeicide 2016-07-26 21:08:36 UTC
Yes I was suspicious of the script when I removed and readded the device and it worked for a while, but the script kept saying degraded.  How I can tell it's gone wrong again however is in zpool iostat -v when all data is dropped from the arc and capacity stays 100% free despite writes.

# dtrace -n 'sdt:::l2arc-iodone /args[0]->io_error != 0/ { printf("io_error = %d", args[0]->io_error); }'
dtrace: description 'sdt:::l2arc-iodone ' matched 1 probe
dtrace: buffer size lowered to 512k
CPU     ID                    FUNCTION:NAME
  0  57400                none:l2arc-iodone io_error = 22
  3  57400                none:l2arc-iodone io_error = 22
  1  57400                none:l2arc-iodone io_error = 22
  1  57400                none:l2arc-iodone io_error = 22
  1  57400                none:l2arc-iodone io_error = 22
  3  57400                none:l2arc-iodone io_error = 22
  3  57400                none:l2arc-iodone io_error = 22
  3  57400                none:l2arc-iodone io_error = 22
  1  57400                none:l2arc-iodone io_error = 22
  3  57400                none:l2arc-iodone io_error = 22
  0  57400                none:l2arc-iodone io_error = 22
  1  57400                none:l2arc-iodone io_error = 22
  0  57400                none:l2arc-iodone io_error = 22
  2  57400                none:l2arc-iodone io_error = 22
Comment 5 Andriy Gapon freebsd_committer freebsd_triage 2016-07-27 06:09:18 UTC
(In reply to braddeicide from comment #4)

22 is EINVAL, I wonder from where it comes.
Based on the device names it seems that GELI is involved here? Perhaps it could be some problem with it...

Anyway, could you please add printing of the following values to the dtrace script? args[0]->io_offset, args[0]->io_size

Also, diskinfo -v /dev/nvd0p3.eli
Thank you.
Comment 6 Andriy Gapon freebsd_committer freebsd_triage 2016-07-27 07:00:48 UTC
BTW, don't want to jump to conclusion but it's possible that r300039 might help.
Comment 7 braddeicide 2016-07-27 16:14:05 UTC
Yes it's on a geli

dtrace -n 'sdt:::l2arc-iodone /args[0]->io_error != 0/ { printf("io_error = %d io_offset = %d io_size = %d", args[0]->io_error, args[0]->io_offset, args[0]->io_size); }'
dtrace: description 'sdt:::l2arc-iodone ' matched 1 probe
dtrace: buffer size lowered to 512k
CPU     ID                    FUNCTION:NAME
  2  57400                none:l2arc-iodone io_error = 22 io_offset = 0 io_size = 0
  1  57400                none:l2arc-iodone io_error = 22 io_offset = 0 io_size = 0
  0  57400                none:l2arc-iodone io_error = 22 io_offset = 0 io_size = 0
  1  57400                none:l2arc-iodone io_error = 22 io_offset = 0 io_size = 0
  0  57400                none:l2arc-iodone io_error = 22 io_offset = 0 io_size = 0
  1  57400                none:l2arc-iodone io_error = 22 io_offset = 0 io_size = 0
  0  57400                none:l2arc-iodone io_error = 22 io_offset = 0 io_size = 0
  1  57400                none:l2arc-iodone io_error = 22 io_offset = 0 io_size = 0
  2  57400                none:l2arc-iodone io_error = 22 io_offset = 0 io_size = 0
  0  57400                none:l2arc-iodone io_error = 22 io_offset = 0 io_size = 0

# diskinfo -v /dev/nvd0p3.eli
/dev/nvd0p3.eli
        4096            # sectorsize
        483117756416    # mediasize in bytes (450G)
        117948671       # mediasize in sectors
        0               # stripesize
        0               # stripeoffset
        7341            # Cylinders according to firmware.
        255             # Heads according to firmware.
        63              # Sectors according to firmware.
        S2GMNCAGA01093W # Disk ident.

Oh 4k, this device is pretty deadset on being 512

# diskinfo -v /dev/nvd0p3
/dev/nvd0p3
        512             # sectorsize
        483117760512    # mediasize in bytes (450G)
        943589376       # mediasize in sectors
        512             # stripesize
        0               # stripeoffset
        58735           # Cylinders according to firmware.
        255             # Heads according to firmware.
        63              # Sectors according to firmware.
        S2GMNCAGA01093W # Disk ident.

I dropped the cache and the dtrace errors stopped so that's the source alright. I detached the geli, attached the geli and added cache and they started again.
Dropped the cache, detached geli, added nvd0p3 as cache without geli and no errors and the cache started building.

                  capacity     operations    bandwidth
pool           alloc   free   read  write   read  write
cache              -      -      -      -      -      -
  nvd0p3        514M   449G      0      0      0      0

I recreated geli with 512 Sectorsize, reattached and so far no errors, but it worked the first time for a while too.  Cache is building, no dtrace errors yet, i'll watch.

# diskinfo -v /dev/nvd0p3.eli
/dev/nvd0p3.eli
        512             # sectorsize
        483117760000    # mediasize in bytes (450G)
        943589375       # mediasize in sectors
        0               # stripesize
        0               # stripeoffset
        58735           # Cylinders according to firmware.
        255             # Heads according to firmware.
        63              # Sectors according to firmware.
        S2GMNCAGA01093W # Disk ident.

re r300039, yeah looks useful
------------------------------------------------------------------------
r300039 | avg | 2016-05-17 18:43:50 +1000 (Tue, 17 May 2016) | 5 lines

MFC r297848: l2arc: make sure that all writes honor ashift of a cache device

Note: no MFC stable/9 because it has become quite out of date with head,
so the merge would be quite labourious and, thus, risky.

------------------------------------------------------------------------
Comment 8 Andriy Gapon freebsd_committer freebsd_triage 2016-07-28 08:24:12 UTC
(In reply to braddeicide from comment #7)
> io_error = 22 io_offset = 0 io_size = 0

This looks really strange.  I am not sure why L2ARC code would issue such zio-s.
Ah, it's actually a parent zio of all zio-s for individual buffers written in a pass.  If a single child zio fails, then the parent zio is marked as failed too.
Comment 9 Andriy Gapon freebsd_committer freebsd_triage 2016-07-28 08:30:43 UTC
And another one-liner:
dtrace -n 'fbt::g_io_deliver:entry /args[1] == EINVAL/ { printf("dest: %s, offset %d, length %d", stringof(args[0]->bio_to->name), args[0]->bio_offset, args[0]->bio_length); }'
Comment 10 braddeicide 2016-08-06 08:55:31 UTC
This hasn't reoccurred since I changed the geli sector size to 512 to match the underlying device.

Would you like me to return to the failing configuration so we can find a better solution?
Comment 11 Andriy Gapon freebsd_committer freebsd_triage 2016-08-06 11:15:18 UTC
(In reply to braddeicide from comment #10)
If you could do that and then pick up and test r300039 that would be great!
Comment 12 braddeicide 2016-08-15 12:11:33 UTC
Been running r300039 in previously problematic mismatched sector size configuration for a week, looks good.

# geli is 4k, underlying device is 512
diskinfo -v /dev/nvd0p3.eli | grep sectorsize
        4096            # sectorsize
diskinfo -v /dev/nvd0p3 | grep sectorsize
        512             # sectorsize

# zfs-stats
L2 ARC Summary: (HEALTHY)

# Cache is growing
pool           alloc   free   read  write   read  write
-------------  -----  -----  -----  -----  -----  -----
cache              -      -      -      -      -      -
  nvd0p3.eli    139G   311G      3      3  40.0K   427K

# guess compress_failures weren't a valid indicator
kstat.zfs.misc.arcstats.l2_compress_failures: 11644501
kstat.zfs.misc.arcstats.l2_writes_error: 0
Comment 13 Andriy Gapon freebsd_committer freebsd_triage 2016-08-15 13:30:19 UTC
(In reply to braddeicide from comment #12)
"Compress failure" only means that the compression didn't save any space and thus a buffer eligible for the compression was placed in L2ARC without it.
Comment 14 Andriy Gapon freebsd_committer freebsd_triage 2016-08-23 07:47:00 UTC
Assuming that the problem was fixed by https://svnweb.freebsd.org/base?view=revision&revision=297848