Bug 221909

Summary: [ZFS] Add a sysctl to toggle send_corrupt_data
Product: Base System Reporter: Fabian Keil <fk>
Component: kernAssignee: freebsd-fs (Nobody) <fs>
Status: Closed Overcome By Events    
Severity: Affects Some People CC: emaste, pstef
Priority: --- Keywords: patch
Version: CURRENTFlags: fk: mfc-stable11?
Hardware: Any   
OS: Any   
Attachments:
Description Flags
sys/cddl: Add a sysctl to toggle send_corrupt_data none

Description Fabian Keil 2017-08-29 13:09:42 UTC
Created attachment 185878 [details]
sys/cddl: Add a sysctl to toggle send_corrupt_data

The attached patch adds a sysctl to toggle send_corrupt_data.

Enabling it allows to send datasets with corrupted blocks
which is useful to recover data from pools with dying disks.

Blocks filled with 0x'zfs badd bloc' are sent instead of
the corrupted data. As a result, the receiving side may
end up with more corrupt data than the sending side.

While it would be preferable to send the corrupt data as is
(assuming the block can be read but contains flipped bits),
this would probably have to happen at a different layer and
currently isn't done.

The ZFSOnLinux people already added an option for this in 2013:
https://github.com/zfsonlinux/zfs/issues/1982

Usage example:

fk@t520 ~ $sudo zpool status -v wde2
  pool: wde2
 state: ONLINE
status: One or more devices has experienced an error resulting in data
	corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
	entire pool from backup.
   see: https://illumos.org/msg/ZFS-8000-8A
  scan: scrub repaired 0 in 11h40m with 10 errors on Sun Jan  1 10:25:26 2017
config:

	NAME              STATE     READ WRITE CKSUM
	wde2              ONLINE       0     0     0
	  label/wde2.eli  ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        wde2/backup/t520/tank/home/fk@2011-07-28_04:54:/Mail/Tor/Read/15654
        wde2/backup/t520/tank/home/fk@2011-07-28_04:54:/Mail/Tor/Read/16411
        [...]
fk@t520 ~ $cat /wde2/backup/t520/tank/home/fk/.zfs/snapshot/2011-07-28_04\:54/Mail/Tor/Read/16411
cat: /wde2/backup/t520/tank/home/fk/.zfs/snapshot/2011-07-28_04:54/Mail/Tor/Read/16411: Input/output error
fk@t520 ~ $dd if=/wde2/backup/t520/tank/home/fk/.zfs/snapshot/2011-07-28_04\:54/Mail/Tor/Read/16411 bs=1
dd: /wde2/backup/t520/tank/home/fk/.zfs/snapshot/2011-07-28_04:54/Mail/Tor/Read/16411: Input/output error
0+0 records in
0+0 records out
0 bytes transferred in 0.026960 secs (0 bytes/sec)
fk@t520 ~ $sudo zfs send wde2/backup/t520/tank/home/fk@2011-07-28_04:54 | mbuffer | sudo zfs receive -v tank/corruption-test
receiving full stream of wde2/backup/t520/tank/home/fk@2011-07-28_04:54 into tank/corruption-test@2011-07-28_04:54
in @  0.0 KiB/s, out @  0.0 KiB/s, 1178 MiB total, buffer   0% fullwarning: cannot send 'wde2/backup/t520/tank/home/fk@2011-07-28_04:54': Input/output error
summary: 1178 MiByte in  5min 44.0sec - average of 3508 KiB/s
cannot receive new filesystem stream: checksum mismatch or incomplete stream

Toggling vfs.zfs.send_corrupt_data allows to send the whole
snapshot with the corrupted data:

fk@t520 ~ $sudo sysctl vfs.zfs.send_corrupt_data=1
vfs.zfs.send_corrupt_data: 0 -> 1
fk@t520 ~ $sudo zfs send wde2/backup/t520/tank/home/fk@2011-07-28_04:54 | mbuffer | sudo zfs receive -v tank/corruption-test
receiving full stream of wde2/backup/t520/tank/home/fk@2011-07-28_04:54 into tank/corruption-test@2011-07-28_04:54
in @ 7193 KiB/s, out @ 7193 KiB/s, 1238 MiB total, buffer   0% full
summary: 1239 MiByte in 43.6sec - average of 28.4 MiB/s
received 1.21GB stream in 59 seconds (21.0MB/sec)
fk@t520 ~ $sudo sysctl vfs.zfs.send_corrupt_data=0
vfs.zfs.send_corrupt_data: 1 -> 0

On the receiving side the corrupted block now has a valid checksum,
the 0x'zfs badd bloc' pattern isn't obvious from userland and
from ZFS's point of view the data is legit (which doesn't seem ideal either):

fk@t520 ~ $hd /tank/corruption-test/Mail/Tor/Read/15654
00000000  0c b1 dd ba f5 02 00 00  0c b1 dd ba f5 02 00 00  |................|
*
00001170
fk@t520 ~ $sudo zfs send tank/corruption-test@2011-07-28_04:54 | dd of=/dev/null bs=1m
0+198030 records in
1238+1 records out
1298720968 bytes transferred in 1.788083 secs (726320368 bytes/sec)

Obtained from: ElectroBSD
Comment 1 Ed Maste freebsd_committer freebsd_triage 2017-08-30 02:54:23 UTC
Would you care to submit this as a pull request at https://github.com/openzfs/openzfs so it can be discussed/debated there?
Comment 2 Fabian Keil 2017-08-31 10:38:42 UTC
The patch isn't relevant for OpenZFS upstream.

On illumos-based platforms the send_corrupt_data variable is
conveniently set with mdb.
Comment 3 Piotr Pawel Stefaniak freebsd_committer freebsd_triage 2021-09-30 19:02:54 UTC
sysctl vfs.zfs.send.corrupt_data is changeable, but does it work?
Comment 4 Fabian Keil 2021-11-05 10:13:21 UTC
I've recently updated to STABLE/12 and confirmed that the vfs.zfs.send_corrupt_data sysctl works as advertised.