Bug 218954

Summary: [ZFS] Add a sysctl to toggle zfs_free_leak_on_eio
Product: Base System Reporter: Fabian Keil <fk>
Component: kernAssignee: freebsd-fs (Nobody) <fs>
Status: New ---    
Severity: Affects Only Me CC: fs, pi
Priority: --- Keywords: patch
Version: CURRENT   
Hardware: Any   
OS: Any   
Attachments:
Description Flags
sys/cddl: Add a sysctl to toggle zfs_free_leak_on_eio none

Description Fabian Keil 2017-04-29 12:19:07 UTC
Created attachment 182174 [details]
sys/cddl: Add a sysctl to toggle zfs_free_leak_on_eio

The attached patch adds a sysctl to toggle zfs_free_leak_on_eio.

Setting the sysctl allows to break a previously-endless cycle
of ZFS collecting checksum errors for metadata.

Before setting vfs.zfs.free_leak_on_eio=1:

fk@t520 ~ $zpool status cloudia2
  pool: cloudia2
 state: ONLINE
status: One or more devices has experienced an error resulting in data
	corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
	entire pool from backup.
   see: https://illumos.org/msg/ZFS-8000-8A
  scan: scrub repaired 308K in 53h23m with 3358 errors on Sun Apr 16 20:33:26 2017
config:

	NAME                  STATE     READ WRITE CKSUM
	cloudia2              ONLINE       0     0   129
	  label/cloudia2.eli  ONLINE       0     0   516

errors: 3362 data errors, use '-v' for a list

fk@t520 ~ $zpool status -v cloudia2
[..]
errors: Permanent errors have been detected in the following files:

        <0x186>:<0x28>
        <0x186>:<0x35>
        <0xffffffffffffffff>:<0x28>

Every five seconds the checksum counter got increased.

zfsdbg-msg output:
2017 Apr 21 11:12:43: bptree index 0: traversing from min_txg=1 bookmark -1/40/0/5120
2017 Apr 21 11:12:43: bptree index 1: traversing from min_txg=-1 bookmark 0/0/0/0
2017 Apr 21 11:12:43: bptree index 2: traversing from min_txg=-1 bookmark 0/0/0/0
2017 Apr 21 11:12:43: bptree index 3: traversing from min_txg=-1 bookmark 0/0/0/0
2017 Apr 21 11:12:43: bptree index 4: traversing from min_txg=-1 bookmark 0/0/0/0
2017 Apr 21 11:12:43: bptree index 5: traversing from min_txg=-1 bookmark 0/0/0/0
2017 Apr 21 11:12:43: bptree index 6: traversing from min_txg=-1 bookmark 0/0/0/0
2017 Apr 21 11:12:43: bptree index 7: traversing from min_txg=-1 bookmark 0/0/0/0
2017 Apr 21 11:12:48: bptree index 0: traversing from min_txg=1 bookmark -1/40/0/5120
2017 Apr 21 11:12:48: bptree index 1: traversing from min_txg=-1 bookmark 0/0/0/0
2017 Apr 21 11:12:48: bptree index 2: traversing from min_txg=-1 bookmark 0/0/0/0
2017 Apr 21 11:12:48: bptree index 3: traversing from min_txg=-1 bookmark 0/0/0/0
2017 Apr 21 11:12:48: bptree index 4: traversing from min_txg=-1 bookmark 0/0/0/0
2017 Apr 21 11:12:48: bptree index 5: traversing from min_txg=-1 bookmark 0/0/0/0
2017 Apr 21 11:12:48: bptree index 6: traversing from min_txg=-1 bookmark 0/0/0/0
2017 Apr 21 11:12:48: bptree index 7: traversing from min_txg=-1 bookmark 0/0/0/0
2017 Apr 21 11:12:53: bptree index 0: traversing from min_txg=1 bookmark -1/40/0/5120
2017 Apr 21 11:12:53: bptree index 1: traversing from min_txg=-1 bookmark 0/0/0/0
2017 Apr 21 11:12:53: bptree index 2: traversing from min_txg=-1 bookmark 0/0/0/0
2017 Apr 21 11:12:53: bptree index 3: traversing from min_txg=-1 bookmark 0/0/0/0
2017 Apr 21 11:12:53: bptree index 4: traversing from min_txg=-1 bookmark 0/0/0/0
2017 Apr 21 11:12:53: bptree index 5: traversing from min_txg=-1 bookmark 0/0/0/0
2017 Apr 21 11:12:53: bptree index 6: traversing from min_txg=-1 bookmark 0/0/0/0
2017 Apr 21 11:12:53: bptree index 7: traversing from min_txg=-1 bookmark 0/0/0/0
2017 Apr 21 11:12:58: bptree index 0: traversing from min_txg=1 bookmark -1/40/0/5120
2017 Apr 21 11:12:58: bptree index 1: traversing from min_txg=-1 bookmark 0/0/0/0
2017 Apr 21 11:12:58: bptree index 2: traversing from min_txg=-1 bookmark 0/0/0/0
2017 Apr 21 11:12:58: bptree index 3: traversing from min_txg=-1 bookmark 0/0/0/0
2017 Apr 21 11:12:58: bptree index 4: traversing from min_txg=-1 bookmark 0/0/0/0
2017 Apr 21 11:12:58: bptree index 5: traversing from min_txg=-1 bookmark 0/0/0/0
2017 Apr 21 11:12:58: bptree index 6: traversing from min_txg=-1 bookmark 0/0/0/0
2017 Apr 21 11:12:58: bptree index 7: traversing from min_txg=-1 bookmark 0/0/0/0

fk@t520 ~ $zpool get all cloudia2
NAME      PROPERTY                       VALUE                          SOURCE
cloudia2  size                           2.98T                          -
cloudia2  capacity                       54%                            -
cloudia2  altroot                        -                              default
cloudia2  health                         ONLINE                         -
cloudia2  guid                           4205907112567218706            default
cloudia2  version                        -                              default
cloudia2  bootfs                         -                              default
cloudia2  delegation                     on                             default
cloudia2  autoreplace                    off                            default
cloudia2  cachefile                      -                              default
cloudia2  failmode                       wait                           default
cloudia2  listsnapshots                  off                            default
cloudia2  autoexpand                     off                            default
cloudia2  dedupditto                     0                              default
cloudia2  dedupratio                     1.00x                          -
cloudia2  free                           1.37T                          -
cloudia2  allocated                      1.62T                          -
cloudia2  readonly                       off                            -
cloudia2  comment                        -                              default
cloudia2  expandsize                     -                              -
cloudia2  freeing                        24.2G                          default
cloudia2  fragmentation                  32%                            -
cloudia2  leaked                         0                              default
[...]

After setting vfs.zfs.free_leak_on_eio=1:

zfsdbg-msg output:
2017 Apr 21 11:13:03: bptree index 0: traversing from min_txg=1 bookmark -1/40/0/5120
2017 Apr 21 11:13:06: freed 100000 blocks in 3050ms from free_bpobj/bptree txg 17892; err=-1
2017 Apr 21 11:13:07: bptree index 0: traversing from min_txg=1 bookmark -1/68/0/718
2017 Apr 21 11:13:08: bptree index 1: traversing from min_txg=-1 bookmark 0/0/0/0
2017 Apr 21 11:13:08: bptree index 2: traversing from min_txg=-1 bookmark 0/0/0/0
2017 Apr 21 11:13:08: bptree index 3: traversing from min_txg=-1 bookmark 0/0/0/0
2017 Apr 21 11:13:08: bptree index 4: traversing from min_txg=-1 bookmark 0/0/0/0
2017 Apr 21 11:13:08: bptree index 5: traversing from min_txg=-1 bookmark 0/0/0/0
2017 Apr 21 11:13:08: bptree index 6: traversing from min_txg=-1 bookmark 0/0/0/0
2017 Apr 21 11:13:08: bptree index 7: traversing from min_txg=-1 bookmark 0/0/0/0
2017 Apr 21 11:13:08: freed 96110 blocks in 1927ms from free_bpobj/bptree txg 17893; err=0
2017 Apr 21 11:15:33: command: zpool clear cloudia2

The checksum error counters stopped incrementing,
"freeing" went to 0 and "leaked" from 0 to 256M.

fk@t520 ~ $zpool get all cloudia2
NAME      PROPERTY                       VALUE                          SOURCE
cloudia2  size                           2.98T                          -
cloudia2  capacity                       53%                            -
cloudia2  altroot                        -                              default
cloudia2  health                         ONLINE                         -
cloudia2  guid                           4205907112567218706            default
cloudia2  version                        -                              default
cloudia2  bootfs                         -                              default
cloudia2  delegation                     on                             default
cloudia2  autoreplace                    off                            default
cloudia2  cachefile                      -                              default
cloudia2  failmode                       wait                           default
cloudia2  listsnapshots                  off                            default
cloudia2  autoexpand                     off                            default
cloudia2  dedupditto                     0                              default
cloudia2  dedupratio                     1.00x                          -
cloudia2  free                           1.39T                          -
cloudia2  allocated                      1.59T                          -
cloudia2  readonly                       off                            -
cloudia2  comment                        -                              default
cloudia2  expandsize                     -                              -
cloudia2  freeing                        0                              default
cloudia2  fragmentation                  32%                            -
cloudia2  leaked                         256M                           default
[...]

The difference on the receiving side confirmed that some space had been recovered:

[fk@kendra ~]$ zfs list -r -p -t all dpool/ggated/cloudia2
NAME                                             USED          AVAIL          REFER  MOUNTPOINT
[...]
dpool/ggated/cloudia2@2017-04-21_10:37        9251840              -  1812645106176  -
dpool/ggated/cloudia2@2017-04-21_11:17        3950592              -  1800267106304  -

It's not obvious to me if the 256M were really leaked
but either way it looks like a clear win.

On another ZFS pool with the same issue but backed by an USB disk all
the space in "freeing" was supposedly "leaked" but it was a lot less
to begin with:

Before setting vfs.zfs.free_leak_on_eio=1:

fk@t520 /usr/src $zpool get all wde4
NAME  PROPERTY                       VALUE                          SOURCE
wde4  size                           1.81T                          -
wde4  capacity                       94%                            -
wde4  altroot                        -                              default
wde4  health                         ONLINE                         -
wde4  guid                           14402430966328721211           default
wde4  version                        -                              default
wde4  bootfs                         -                              default
wde4  delegation                     on                             default
wde4  autoreplace                    off                            default
wde4  cachefile                      -                              default
wde4  failmode                       wait                           default
wde4  listsnapshots                  off                            default
wde4  autoexpand                     off                            default
wde4  dedupditto                     0                              default
wde4  dedupratio                     1.00x                          -
wde4  free                           107G                           -
wde4  allocated                      1.71T                          -
wde4  readonly                       off                            -
wde4  comment                        -                              default
wde4  expandsize                     -                              -
wde4  freeing                        1.18M                          default
wde4  fragmentation                  23%                            -
wde4  leaked                         0                              default

After setting vfs.zfs.free_leak_on_eio=1:

fk@t520 /usr/src $zpool get all wde4
NAME  PROPERTY                       VALUE                          SOURCE
wde4  size                           1.81T                          -
wde4  capacity                       94%                            -
wde4  altroot                        -                              default
wde4  health                         ONLINE                         -
wde4  guid                           14402430966328721211           default
wde4  version                        -                              default
wde4  bootfs                         -                              default
wde4  delegation                     on                             default
wde4  autoreplace                    off                            default
wde4  cachefile                      -                              default
wde4  failmode                       wait                           default
wde4  listsnapshots                  off                            default
wde4  autoexpand                     off                            default
wde4  dedupditto                     0                              default
wde4  dedupratio                     1.00x                          -
wde4  free                           107G                           -
wde4  allocated                      1.71T                          -
wde4  readonly                       off                            -
wde4  comment                        -                              default
wde4  expandsize                     -                              -
wde4  freeing                        0                              default
wde4  fragmentation                  23%                            -
wde4  leaked                         1.18M                          default
[...]

The pool was affected by the issue since 2015:
https://lists.freebsd.org/pipermail/freebsd-fs/2015-February/020845.html

Obtained from: ElectroBSD