Created attachment 185878 [details] sys/cddl: Add a sysctl to toggle send_corrupt_data The attached patch adds a sysctl to toggle send_corrupt_data. Enabling it allows to send datasets with corrupted blocks which is useful to recover data from pools with dying disks. Blocks filled with 0x'zfs badd bloc' are sent instead of the corrupted data. As a result, the receiving side may end up with more corrupt data than the sending side. While it would be preferable to send the corrupt data as is (assuming the block can be read but contains flipped bits), this would probably have to happen at a different layer and currently isn't done. The ZFSOnLinux people already added an option for this in 2013: https://github.com/zfsonlinux/zfs/issues/1982 Usage example: fk@t520 ~ $sudo zpool status -v wde2 pool: wde2 state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: https://illumos.org/msg/ZFS-8000-8A scan: scrub repaired 0 in 11h40m with 10 errors on Sun Jan 1 10:25:26 2017 config: NAME STATE READ WRITE CKSUM wde2 ONLINE 0 0 0 label/wde2.eli ONLINE 0 0 0 errors: Permanent errors have been detected in the following files: wde2/backup/t520/tank/home/fk@2011-07-28_04:54:/Mail/Tor/Read/15654 wde2/backup/t520/tank/home/fk@2011-07-28_04:54:/Mail/Tor/Read/16411 [...] fk@t520 ~ $cat /wde2/backup/t520/tank/home/fk/.zfs/snapshot/2011-07-28_04\:54/Mail/Tor/Read/16411 cat: /wde2/backup/t520/tank/home/fk/.zfs/snapshot/2011-07-28_04:54/Mail/Tor/Read/16411: Input/output error fk@t520 ~ $dd if=/wde2/backup/t520/tank/home/fk/.zfs/snapshot/2011-07-28_04\:54/Mail/Tor/Read/16411 bs=1 dd: /wde2/backup/t520/tank/home/fk/.zfs/snapshot/2011-07-28_04:54/Mail/Tor/Read/16411: Input/output error 0+0 records in 0+0 records out 0 bytes transferred in 0.026960 secs (0 bytes/sec) fk@t520 ~ $sudo zfs send wde2/backup/t520/tank/home/fk@2011-07-28_04:54 | mbuffer | sudo zfs receive -v tank/corruption-test receiving full stream of wde2/backup/t520/tank/home/fk@2011-07-28_04:54 into tank/corruption-test@2011-07-28_04:54 in @ 0.0 KiB/s, out @ 0.0 KiB/s, 1178 MiB total, buffer 0% fullwarning: cannot send 'wde2/backup/t520/tank/home/fk@2011-07-28_04:54': Input/output error summary: 1178 MiByte in 5min 44.0sec - average of 3508 KiB/s cannot receive new filesystem stream: checksum mismatch or incomplete stream Toggling vfs.zfs.send_corrupt_data allows to send the whole snapshot with the corrupted data: fk@t520 ~ $sudo sysctl vfs.zfs.send_corrupt_data=1 vfs.zfs.send_corrupt_data: 0 -> 1 fk@t520 ~ $sudo zfs send wde2/backup/t520/tank/home/fk@2011-07-28_04:54 | mbuffer | sudo zfs receive -v tank/corruption-test receiving full stream of wde2/backup/t520/tank/home/fk@2011-07-28_04:54 into tank/corruption-test@2011-07-28_04:54 in @ 7193 KiB/s, out @ 7193 KiB/s, 1238 MiB total, buffer 0% full summary: 1239 MiByte in 43.6sec - average of 28.4 MiB/s received 1.21GB stream in 59 seconds (21.0MB/sec) fk@t520 ~ $sudo sysctl vfs.zfs.send_corrupt_data=0 vfs.zfs.send_corrupt_data: 1 -> 0 On the receiving side the corrupted block now has a valid checksum, the 0x'zfs badd bloc' pattern isn't obvious from userland and from ZFS's point of view the data is legit (which doesn't seem ideal either): fk@t520 ~ $hd /tank/corruption-test/Mail/Tor/Read/15654 00000000 0c b1 dd ba f5 02 00 00 0c b1 dd ba f5 02 00 00 |................| * 00001170 fk@t520 ~ $sudo zfs send tank/corruption-test@2011-07-28_04:54 | dd of=/dev/null bs=1m 0+198030 records in 1238+1 records out 1298720968 bytes transferred in 1.788083 secs (726320368 bytes/sec) Obtained from: ElectroBSD
Would you care to submit this as a pull request at https://github.com/openzfs/openzfs so it can be discussed/debated there?
The patch isn't relevant for OpenZFS upstream. On illumos-based platforms the send_corrupt_data variable is conveniently set with mdb.
sysctl vfs.zfs.send.corrupt_data is changeable, but does it work?
I've recently updated to STABLE/12 and confirmed that the vfs.zfs.send_corrupt_data sysctl works as advertised.