While experimenting with star on an arm64 FreeBSD 13.0-RELEASE system on ZFS, I noticed that star would some times spuriously consider files to be full of data holes when they do in fact not have any holes at all. A typical truss trace looks like this: 4677: openat(AT_FDCWD,"ved",O_RDONLY,00) = 3 (0x3) 4677: lseek(3,0x0,SEEK_HOLE) = 0 (0x0) 4677: write(2,"'ved' is sparse\n",16) = 16 (0x10) 4677: lseek(3,0x0,SEEK_HOLE) = 0 (0x0) 4677: lseek(3,0x0,SEEK_DATA) ERR#6 'Device not configured' 4677: lseek(3,0x37594,SEEK_HOLE) ERR#6 'Device not configured' 4677: lseek(3,0x0,SEEK_SET) = 0 (0x0) lseek(3, 0, SEEK_HOLE) returns 0, indicating that the file begins with a data hole. This is unlikely as the file ved is an ELF executable which always begins with \0177ELF. There are also spurious ENXIO returns from further lseek calls I do not understand. Copying the file before running star makes the problem go away: 4721: openat(AT_FDCWD,"ved",O_RDONLY,00) = 3 (0x3) 4721: lseek(3,0x0,SEEK_HOLE) = 226708 (0x37594) 4721: lseek(3,0x0,SEEK_SET) = 0 (0x0) It is difficult to reproduce this problem as it seems to only appear some times and on random executables. I believe this is a severe problem as wrong output from lseek(..., SEEK_HOLE, ...) may make “smart” copying and archiving programs generate corrupt copies/archives. This problem looks similar to #164445, but surely that bug has already been fixed long ago.
Robert forgot to mention that the problem disappears if the file was copied using cp(1) before. It seems to depend on the way the linker writes that file.
We found that on our arm64 box, the following steps somewhat consistently produce a file afflicted with the issue: 1. on an arm64 FreeBSD 13.0 box, create an ARMv7 jail on a ZFS dataset 2. in that jail, download and unpack https://sourceforge.net/projects/schilytools/files/schily-2021-05-19.tar.bz2 3. enter schily-2021-05-19 and type "make" to compile 4. the file schily-2021-05-19/ved/OBJ/aarch64-freebsd-clang/ved then exhibits this problem The problem can then be reproduced by typing star/OBJ/armv7-freebsd-clang/star -c -sparse -v -f /dev/null ved/OBJ/armv7-freebsd-clang/ved with the observed output 'ved/OBJ/armv7-freebsd-clang/ved' is sparse a ved/OBJ/armv7-freebsd-clang/ved 226708 bytes, 1 tape blocks star: 1 blocks + 0 bytes (total of 10240 bytes = 10.00k).
I've tried to make a minimal example by cloning the file system on which I performed the reproduction instructions, and then deleting all irrelevant files. However, it turns out the files are actually entirely full of zeroes in the cloned dataset whereas they are fine in the original
Does this only happen with a clone or also with a snapshot? Since a clone is writable, it is expected to contain a copy of the node instead of a reference to it that is used with a snapshot. It would be interesting to know why some of the access operations (e.g for a vanilla read) work to deliver correct content but other operations fail. BTW: I just created a simple sparse file using echo and dd that does not seem to be affected by the problem. So it seems to be a zfs bug that is triggered by the specific way the linker did write the file that is affected. This is how I created the simple sparse: echo 123 > bla dd if=Makefile conv=notrunc oseek=32k of=bla
Let's take a look at the layout of the good vs bad files. First, do "ls -li" to get the inode number of the bad file (ved) and the good file (ved.copy). Then do "sudo zdb -ddddd <pool>/<dataset> <inode>". That will show exactly where ZFS thinks there are holes.
(In reply to Alan Somers from comment #5) If I try to use zdb, it gives ENOENT for whatever dataset I try to use. Looks like this: # zfs list tau/usr/home NAME USED AVAIL REFER MOUNTPOINT tau/usr/home 2.21G 176G 1.05G /usr/home # zdb -ddddd tau/usr/home 152278 failed to hold dataset 'tau/usr/home': No such file or directory zdb: can't open 'tau/usr/home': No such file or directory Something seems to be terribly wrong here.
I think I've seen zdb do that when there is no ZFS cache file. Did you disable the ZFS cache file?
(In reply to Alan Somers from comment #7) Hi Alan, The cache file /etc/zfs/zpool.cache is there and seems fine. Here is what zdb -C outputs: # zdb -C tau: version: 5000 name: 'tau' state: 0 txg: 439 pool_guid: 2019163837599314243 errata: 0 hostname: 'eta' com.delphix:has_per_vdev_zaps vdev_children: 1 vdev_tree: type: 'root' id: 0 guid: 2019163837599314243 create_txg: 4 children[0]: type: 'disk' id: 0 guid: 12664617697930720246 path: '/dev/da0p3' whole_disk: 1 metaslab_array: 131 metaslab_shift: 34 ashift: 12 asize: 1991530512384 is_log: 0 DTL: 302 create_txg: 4 com.delphix:vdev_zap_leaf: 129 com.delphix:vdev_zap_top: 130 features_for_read: com.delphix:hole_birth com.delphix:embedded_data Though interestingly, this does not match the pool label: # zdb -l da0s2a ------------------------------------ LABEL 0 ------------------------------------ version: 5000 name: 'tau' state: 0 txg: 112629 pool_guid: 12522957766422004777 errata: 0 hostname: 'tau.fuz.su' top_guid: 16278934493555373134 guid: 16278934493555373134 vdev_children: 1 vdev_tree: type: 'disk' id: 0 guid: 16278934493555373134 path: '/dev/da0s2a' whole_disk: 1 metaslab_array: 65 metaslab_shift: 31 ashift: 12 asize: 231409188864 is_log: 0 create_txg: 4 features_for_read: com.delphix:hole_birth com.delphix:embedded_data labels = 0 1 2 3 Could the cache file perhaps be stale? If yes, how do I fix this?
I doubt it. The cache file is probably fine. zdb must be failing for some other reason. If you're familiar with dtrace, you can might be able to find out why.
Ok, it was a stale zpool cache file. Regenerated that. Now zdb spits out for the defective file: # zdb -ddddd tau/usr/home 158868 Dataset tau/usr/home [ZPL], ID 166, cr_txg 156, 1.02G, 73710 objects, rootbp DVA[0]=<0:1480e5000:1000> DVA[1]=<0:34b8519000:1000> [L0 DMU objset] fletcher4 uncompressed unencrypted LE contiguous unique double size=1000L/1000P birth=754593L/754593P fill=73710 cksum=fdbb2e5f4:2c554aff465c:42583346179396:463202c0625bcf73 Object lvl iblk dblk dsize dnsize lsize %full type 158868 1 128K 128K 0 512 128K 0.00 ZFS plain file 168 bonus System attributes dnode flags: USERUSED_ACCOUNTED USEROBJUSED_ACCOUNTED dnode maxblkid: 0 path /fuz/src/schily-2021-05-19/psmake/smake uid 1001 gid 0 atime Mon May 31 17:27:05 2021 mtime Mon May 31 17:27:05 2021 ctime Mon May 31 17:27:05 2021 crtime Mon May 31 17:27:05 2021 gen 754481 mode 100755 size 172444 parent 125362 links 1 pflags 40800000104 Indirect blocks: And for a copy of that file: # zdb -ddddd tau/usr/home 165839 Dataset tau/usr/home [ZPL], ID 166, cr_txg 156, 1.02G, 73711 objects, rootbp DVA[0]=<0:1481bd000:1000> DVA[1]=<0:34b85d5000:1000> [L0 DMU objset] fletcher4 uncompressed unencrypted LE contiguous unique double size=1000L/1000P birth=754604L/754604P fill=73711 cksum=f5c0e13c8:2bd5f9eb9464:4306ac91d2ee06:48548583efa66ed7 Object lvl iblk dblk dsize dnsize lsize %full type 165839 2 128K 128K 72K 512 256K 100.00 ZFS plain file 168 bonus System attributes dnode flags: USED_BYTES USERUSED_ACCOUNTED USEROBJUSED_ACCOUNTED dnode maxblkid: 1 path /fuz/smake uid 1001 gid 1001 atime Mon May 31 17:37:23 2021 mtime Mon May 31 17:37:23 2021 ctime Mon May 31 17:37:23 2021 crtime Mon May 31 17:37:23 2021 gen 754604 mode 100755 size 172444 parent 2 links 1 pflags 40800000104 Indirect blocks: 0 L1 0:279c393000:1000 20000L/1000P F=2 B=754604/754604 cksum=89ddba81fd:1f641e2304c6e:3959b7adc4672c0:60f134ca72c7adee 0 L0 0:1aec965000:b000 20000L/b000P F=1 B=754604/754604 cksum=162256381a3f:1e6cd61da6d0818:ab4c29600d5f7235:84ff2b8da3762f93 20000 L0 0:1aec959000:5000 20000L/5000P F=1 B=754604/754604 cksum=84074ce4c6c:5f268b4e9e1cc5:83823b486e8a59f7:97b671e2edc43809 segment [0000000000000000, 0000000000040000) size 256K There's two more things I noticed: (a) the defective file is written through a nullfs mount and (b) the file is "cured" after a while, i.e. no longer pretends to be sparse. This is independent of sync calls, but seems to happen after some hours of letting the FS idle.
Now we're getting somewhere. * zdb shows a file that is completely sparse; it contains no data * nullfs is involved. Maybe your newly written file is getting cached by nullfs in the VM cache and not flushed to ZFS until "some hours of letting the FS idle". And maybe nullfs forwards the FIOSEEKHOLE ioctl to ZFS without first flushing the file. That would explain why you're getting bad results. To test this theory, try writing the bad file again, then do "fsync <path/to/file>" and then examine it with zdb and star. If it's dense, then that supports my theory.
(In reply to Alan Somers from comment #11) Yes, indeed. An fsync on the nullfs-mounted file makes it become non-sparse in ZFS immediately. Hope this helps!
This might be already fixed by https://github.com/freebsd/freebsd-src/commit/42881526d401e7a9c09241e392b7ffa18cfe11d6 . Do you have the ability to test on 14.0-CURRENT ?
(In reply to Alan Somers from comment #13) Unfortunately I don't.
What about amd64? Can you test on 13.0-RELEASE with amd64? If you can provide full steps to reproduce, including the nullfs mount options, I can test it on 14.0-CURRENT amd64.
(In reply to Alan Somers from comment #15) I can test on amd64; one second please.
So yeah, the same approach also works on amd64. To reproduce: 1. install archivers/star 2. make a zfs dataset and nullfs mount it somewhere. 3. in the nullfs mount, unpack the schilytools 4. build the schilytools by typing "make" 5. type star -c -v -f /dev/null -sparse . 2>&1 | grep sparse to find spurious sparse files.
Please note that the apparent sparse files are not created on the basic system that identifies as aarch64. If I compile schilytools in an armv7 jail that runs on the aarch64 raspi installation, only some of the binaries linked for armv7 are identified as apparent sparse files. So it seems that there are some unknown constraints that trigger the problem. My guess is that the absence of an apparent problem on a different platform does not verify that the problem has been fixed. In special since the affected files are linker output that depends on the platform. Since the change in nullfs is only one line, Robert can you compile a modified kernel?
I was able to reproduce the problem on 13.0-RELEASE, but not 14.0-CURRENT. I'm trying stable/13 next.
I've confirmed that 42881526d401e7a9c09241e392b7ffa18cfe11d6 fixes the problem on stable/13. Now the question is whether secteam would accept this as an errata and publish a fix for 13.0-RELEASE. Data corruption is a pretty bad problem, so I think they should. I'll take it up with them.
Does this affect only ZFS or UFS too?
I can't reproduce it with UFS.
kib (CCed) believes that 42881526d401e7a9c09241e392b7ffa18cfe11d6 is not a full solution, because it only flushes pages on close. The OP's test case does close every file, but the bug may persist if the files were to remain open when somebody calls lseek(.., SEEK_HOLE). kib thinks that a better solution would be for ZFS's SEEK_HOLE implementation to check the page queue, but we don't have a working patch yet.
(In reply to Alan Somers from comment #23) Any progress on this one? Or a workaround at least?
It's like I said my last post. Revision 42881526d401e7a9c09241e392b7ffa18cfe11d6 fixes the specific problem reported here, but it's not a fully general solution.
(In reply to Alan Somers from comment #25) Thanks. Is there any undergoing work to fix the remaining cases, too?
^Traige: Bump Importance due to data loss. Request feedback from secteam@ on an EN and kib@ for description of what a root cause permanent resoluton might look like, so we can track that
If someone sends secteam@ a draft erratum and a list of commits to merge from stable/* to releng/*, we can certainly include this with our next batch. https://www.freebsd.org/security/errata-template.txt
The ZFS fix for the bug has been released in OpenZFS 2.1.2 and 2.0.7 releases. OpenZFS HEAD has been merged to 13.0-STABLE. It is still not evident to me how we laymen should proceed with securing our 13.0-RELEASE installations. Should we wait for releng/13.0 to merge OpenZFS 2.0.7?
FYI the dates in https://www.freebsd.org/security/advisories/FreeBSD-EN-22:11.zfs.asc are in the future (v1.0 2022-05-15 -- Initial release etc)
What's the current state of this? Is it fixed? Merged into all current releases? Can this PR be closed?
(In reply to Robert Clausecker from comment #31) Phil Krylov, would you please link to the OpenZFS change you alluded to? Robert, kib's partial solution has been merged into all current releases.
(In reply to Alan Somers from comment #32) I was referring to https://github.com/openzfs/zfs/issues/11900 and the related PRs: #12724, #12745, #12746. Looking at this issue once again, I see it mentions nullfs, so it's not clear if the problem has been fixed with the changes on OpenZFS side.
Yes, #12724 looks like the correct fix, and kib reviewed it too. It's been merged into stable/13, so we can close this bug.