Bug 256205 - ZFS: data corruption with SEEK_HOLE/SEEK_DATA on dirty files written through nullfs
Summary: ZFS: data corruption with SEEK_HOLE/SEEK_DATA on dirty files written through ...
Status: Open
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 13.0-RELEASE
Hardware: Any Any
: --- Affects Many People
Assignee: freebsd-fs (Nobody)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-05-27 16:52 UTC by Robert Clausecker
Modified: 2021-06-22 00:35 UTC (History)
11 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Robert Clausecker 2021-05-27 16:52:21 UTC
While experimenting with star on an arm64 FreeBSD 13.0-RELEASE system on ZFS, I noticed that star would some times spuriously consider files to be full of data holes when they do in fact not have any holes at all.  A typical truss trace looks like this:

 4677: openat(AT_FDCWD,"ved",O_RDONLY,00)	 = 3 (0x3)
 4677: lseek(3,0x0,SEEK_HOLE)			 = 0 (0x0)
 4677: write(2,"'ved' is sparse\n",16)		 = 16 (0x10)
 4677: lseek(3,0x0,SEEK_HOLE)			 = 0 (0x0)
 4677: lseek(3,0x0,SEEK_DATA)			 ERR#6 'Device not configured'
 4677: lseek(3,0x37594,SEEK_HOLE)		 ERR#6 'Device not configured'
 4677: lseek(3,0x0,SEEK_SET)			 = 0 (0x0)


lseek(3, 0, SEEK_HOLE) returns 0, indicating that the file begins with a data hole.  This is unlikely as the file ved is an ELF executable which always begins with \0177ELF.  There are also spurious ENXIO returns from further lseek calls I do not understand.  Copying the file before running star makes the problem go away:

 4721: openat(AT_FDCWD,"ved",O_RDONLY,00)	 = 3 (0x3)
 4721: lseek(3,0x0,SEEK_HOLE)			 = 226708 (0x37594)
 4721: lseek(3,0x0,SEEK_SET)			 = 0 (0x0)

It is difficult to reproduce this problem as it seems to only appear some times and on random executables.

I believe this is a severe problem as wrong output from lseek(..., SEEK_HOLE, ...) may make “smart” copying and archiving programs generate corrupt copies/archives.

This problem looks similar to #164445, but surely that bug has already been fixed long ago.
Comment 1 Jörg Schilling 2021-05-27 16:57:35 UTC
Robert forgot to mention that the problem disappears if the file was copied 
using cp(1) before. It seems to depend on the way the linker writes that file.
Comment 2 Robert Clausecker 2021-05-27 17:35:19 UTC
We found that on our arm64 box, the following steps somewhat consistently produce a file afflicted with the issue:

1. on an arm64 FreeBSD 13.0 box, create an ARMv7 jail on a ZFS dataset
2. in that jail, download and unpack https://sourceforge.net/projects/schilytools/files/schily-2021-05-19.tar.bz2
3. enter schily-2021-05-19 and type "make" to compile
4. the file schily-2021-05-19/ved/OBJ/aarch64-freebsd-clang/ved then exhibits this problem

The problem can then be reproduced by typing

star/OBJ/armv7-freebsd-clang/star -c -sparse -v -f /dev/null ved/OBJ/armv7-freebsd-clang/ved

with the observed output

'ved/OBJ/armv7-freebsd-clang/ved' is sparse
a ved/OBJ/armv7-freebsd-clang/ved 226708 bytes, 1 tape blocks
star: 1 blocks + 0 bytes (total of 10240 bytes = 10.00k).
Comment 3 Robert Clausecker 2021-05-29 19:42:32 UTC
I've tried to make a minimal example by cloning the file system on which I performed the reproduction instructions, and then deleting all irrelevant files.  However, it turns out the files are actually entirely full of zeroes in the cloned dataset whereas they are fine in the original
Comment 4 Jörg Schilling 2021-05-29 21:12:09 UTC
Does this only happen with a clone or also with a snapshot?

Since a clone is writable, it is expected to contain a copy of the
node instead of a reference to it that is used with a snapshot.

It would be interesting to know why some of the access operations (e.g
for a vanilla read) work to deliver correct content but other 
operations fail.

BTW: I just created a simple sparse file using echo and dd that does not
seem to be affected by the problem. So it seems to be a zfs bug that
is triggered by the specific way the linker did write the file that
is affected.

This is how I created the simple sparse:

      echo 123 > bla
      dd if=Makefile conv=notrunc oseek=32k of=bla
Comment 5 Alan Somers freebsd_committer 2021-05-31 13:43:43 UTC
Let's take a look at the layout of the good vs bad files.  First, do "ls -li" to get the inode number of the bad file (ved) and the good file (ved.copy).  Then do "sudo zdb -ddddd <pool>/<dataset> <inode>".  That will show exactly where ZFS thinks there are holes.
Comment 6 Robert Clausecker 2021-05-31 14:11:55 UTC
(In reply to Alan Somers from comment #5)

If I try to use zdb, it gives ENOENT for whatever dataset I try to use.  Looks like this:

# zfs list tau/usr/home
NAME           USED  AVAIL     REFER  MOUNTPOINT
tau/usr/home  2.21G   176G     1.05G  /usr/home
# zdb -ddddd tau/usr/home 152278
failed to hold dataset 'tau/usr/home': No such file or directory
zdb: can't open 'tau/usr/home': No such file or directory

Something seems to be terribly wrong here.
Comment 7 Alan Somers freebsd_committer 2021-05-31 14:35:04 UTC
I think I've seen zdb do that when there is no ZFS cache file.  Did you disable the ZFS cache file?
Comment 8 Robert Clausecker 2021-05-31 15:16:56 UTC
(In reply to Alan Somers from comment #7)

Hi Alan,

The cache file /etc/zfs/zpool.cache is there and seems fine.  Here is what zdb -C outputs:

# zdb -C
tau:
    version: 5000
    name: 'tau'
    state: 0
    txg: 439
    pool_guid: 2019163837599314243
    errata: 0
    hostname: 'eta'
    com.delphix:has_per_vdev_zaps
    vdev_children: 1
    vdev_tree:
        type: 'root'
        id: 0
        guid: 2019163837599314243
        create_txg: 4
        children[0]:
            type: 'disk'
            id: 0
            guid: 12664617697930720246
            path: '/dev/da0p3'
            whole_disk: 1
            metaslab_array: 131
            metaslab_shift: 34
            ashift: 12
            asize: 1991530512384
            is_log: 0
            DTL: 302
            create_txg: 4
            com.delphix:vdev_zap_leaf: 129
            com.delphix:vdev_zap_top: 130
    features_for_read:
        com.delphix:hole_birth
        com.delphix:embedded_data

Though interestingly, this does not match the pool label:

# zdb -l da0s2a
------------------------------------
LABEL 0
------------------------------------
    version: 5000
    name: 'tau'
    state: 0
    txg: 112629
    pool_guid: 12522957766422004777
    errata: 0
    hostname: 'tau.fuz.su'
    top_guid: 16278934493555373134
    guid: 16278934493555373134
    vdev_children: 1
    vdev_tree:
        type: 'disk'
        id: 0
        guid: 16278934493555373134
        path: '/dev/da0s2a'
        whole_disk: 1
        metaslab_array: 65
        metaslab_shift: 31
        ashift: 12
        asize: 231409188864
        is_log: 0
        create_txg: 4
    features_for_read:
        com.delphix:hole_birth
        com.delphix:embedded_data
    labels = 0 1 2 3 


Could the cache file perhaps be stale?  If yes, how do I fix this?
Comment 9 Alan Somers freebsd_committer 2021-05-31 15:21:45 UTC
I doubt it.  The cache file is probably fine.  zdb must be failing for some other reason.  If you're familiar with dtrace, you can might be able to find out why.
Comment 10 Robert Clausecker 2021-05-31 15:40:05 UTC
Ok, it was a stale zpool cache file.  Regenerated that.  Now zdb spits out for the defective file:

# zdb -ddddd tau/usr/home 158868 
Dataset tau/usr/home [ZPL], ID 166, cr_txg 156, 1.02G, 73710 objects, rootbp DVA[0]=<0:1480e5000:1000> DVA[1]=<0:34b8519000:1000> [L0 DMU objset] fletcher4 uncompressed unencrypted LE contiguous unique double size=1000L/1000P birth=754593L/754593P fill=73710 cksum=fdbb2e5f4:2c554aff465c:42583346179396:463202c0625bcf73

    Object  lvl   iblk   dblk  dsize  dnsize  lsize   %full  type
    158868    1   128K   128K      0     512   128K    0.00  ZFS plain file
                                               168   bonus  System attributes
	dnode flags: USERUSED_ACCOUNTED USEROBJUSED_ACCOUNTED 
	dnode maxblkid: 0
	path	/fuz/src/schily-2021-05-19/psmake/smake
	uid     1001
	gid     0
	atime	Mon May 31 17:27:05 2021
	mtime	Mon May 31 17:27:05 2021
	ctime	Mon May 31 17:27:05 2021
	crtime	Mon May 31 17:27:05 2021
	gen	754481
	mode	100755
	size	172444
	parent	125362
	links	1
	pflags	40800000104
Indirect blocks:

And for a copy of that file:

# zdb -ddddd tau/usr/home 165839
Dataset tau/usr/home [ZPL], ID 166, cr_txg 156, 1.02G, 73711 objects, rootbp DVA[0]=<0:1481bd000:1000> DVA[1]=<0:34b85d5000:1000> [L0 DMU objset] fletcher4 uncompressed unencrypted LE contiguous unique double size=1000L/1000P birth=754604L/754604P fill=73711 cksum=f5c0e13c8:2bd5f9eb9464:4306ac91d2ee06:48548583efa66ed7

    Object  lvl   iblk   dblk  dsize  dnsize  lsize   %full  type
    165839    2   128K   128K    72K     512   256K  100.00  ZFS plain file
                                               168   bonus  System attributes
	dnode flags: USED_BYTES USERUSED_ACCOUNTED USEROBJUSED_ACCOUNTED 
	dnode maxblkid: 1
	path	/fuz/smake
	uid     1001
	gid     1001
	atime	Mon May 31 17:37:23 2021
	mtime	Mon May 31 17:37:23 2021
	ctime	Mon May 31 17:37:23 2021
	crtime	Mon May 31 17:37:23 2021
	gen	754604
	mode	100755
	size	172444
	parent	2
	links	1
	pflags	40800000104
Indirect blocks:
               0 L1  0:279c393000:1000 20000L/1000P F=2 B=754604/754604 cksum=89ddba81fd:1f641e2304c6e:3959b7adc4672c0:60f134ca72c7adee
               0  L0 0:1aec965000:b000 20000L/b000P F=1 B=754604/754604 cksum=162256381a3f:1e6cd61da6d0818:ab4c29600d5f7235:84ff2b8da3762f93
           20000  L0 0:1aec959000:5000 20000L/5000P F=1 B=754604/754604 cksum=84074ce4c6c:5f268b4e9e1cc5:83823b486e8a59f7:97b671e2edc43809

		segment [0000000000000000, 0000000000040000) size  256K

There's two more things I noticed: (a) the defective file is written through a nullfs mount and (b) the file is "cured" after a while, i.e. no longer pretends to be sparse.  This is independent of sync calls, but seems to happen after some hours of letting the FS idle.
Comment 11 Alan Somers freebsd_committer 2021-05-31 15:50:18 UTC
Now we're getting somewhere.

* zdb shows a file that is completely sparse; it contains no data
* nullfs is involved.

Maybe your newly written file is getting cached by nullfs in the VM cache and not flushed to ZFS until "some hours of letting the FS idle".  And maybe nullfs forwards the FIOSEEKHOLE ioctl to ZFS without first flushing the file.  That would explain why you're getting bad results.

To test this theory, try writing the bad file again, then do "fsync <path/to/file>" and then examine it with zdb and star.  If it's dense, then that supports my theory.
Comment 12 Robert Clausecker 2021-05-31 15:58:46 UTC
(In reply to Alan Somers from comment #11)

Yes, indeed.  An fsync on the nullfs-mounted file makes it become non-sparse in ZFS immediately.  Hope this helps!
Comment 13 Alan Somers freebsd_committer 2021-05-31 16:27:30 UTC
This might be already fixed by https://github.com/freebsd/freebsd-src/commit/42881526d401e7a9c09241e392b7ffa18cfe11d6 .  Do you have the ability to test on 14.0-CURRENT ?
Comment 14 Robert Clausecker 2021-05-31 16:30:44 UTC
(In reply to Alan Somers from comment #13)

Unfortunately I don't.
Comment 15 Alan Somers freebsd_committer 2021-05-31 16:33:30 UTC
What about amd64?  Can you test on 13.0-RELEASE with amd64?  If you can provide full steps to reproduce, including the nullfs mount options, I can test it on 14.0-CURRENT amd64.
Comment 16 Robert Clausecker 2021-05-31 16:35:11 UTC
(In reply to Alan Somers from comment #15)

I can test on amd64; one second please.
Comment 17 Robert Clausecker 2021-05-31 17:28:32 UTC
So yeah, the same approach also works on amd64.  To reproduce:

1. install archivers/star
2. make a zfs dataset and nullfs mount it somewhere.
3. in the nullfs mount, unpack the schilytools
4. build the schilytools by typing "make"
5. type

    star -c -v -f /dev/null -sparse . 2>&1 | grep sparse

   to find spurious sparse files.
Comment 18 Jörg Schilling 2021-05-31 18:57:44 UTC
Please note that the apparent sparse files are not
created on the basic system that identifies as aarch64.

If I compile schilytools in an armv7 jail that runs on
the aarch64 raspi installation, only some of the binaries
linked for armv7 are identified as apparent sparse files.

So it seems that there are some unknown constraints that
trigger the problem. My guess is that the absence of an
apparent problem on a different platform does not verify
that the problem has been fixed. In special since the 
affected files are linker output that depends on the
platform.

Since the change in nullfs is only one line, Robert can
you compile a modified kernel?
Comment 19 Alan Somers freebsd_committer 2021-05-31 19:04:25 UTC
I was able to reproduce the problem on 13.0-RELEASE, but not 14.0-CURRENT.  I'm trying stable/13 next.
Comment 20 Alan Somers freebsd_committer 2021-05-31 19:53:18 UTC
I've confirmed that 42881526d401e7a9c09241e392b7ffa18cfe11d6 fixes the problem on stable/13.  Now the question is whether secteam would accept this as an errata and publish a fix for 13.0-RELEASE.  Data corruption is a pretty bad problem, so I think they should.  I'll take it up with them.
Comment 21 Christos Chatzaras 2021-05-31 20:17:51 UTC
Does this affect only ZFS or UFS too?
Comment 22 Alan Somers freebsd_committer 2021-05-31 20:33:46 UTC
I can't reproduce it with UFS.
Comment 23 Alan Somers freebsd_committer 2021-06-02 15:26:42 UTC
kib (CCed) believes that 42881526d401e7a9c09241e392b7ffa18cfe11d6 is not a full solution, because it only flushes pages on close.  The OP's test case does close every file, but the bug may persist if the files were to remain open when somebody calls lseek(.., SEEK_HOLE).  kib thinks that a better solution would be for ZFS's SEEK_HOLE implementation to check the page queue, but we don't have a working patch yet.