Bug 275308

Summary: EN tracking issue for potential ZFS data corruption
Product: Base System Reporter: Ed Maste <emaste>
Component: kernAssignee: Ed Maste <emaste>
Status: Closed FIXED    
Severity: Affects Only Me CC: adrik, alex-freebsd-bugs, alster, bevan, chris, crest, cryx-ports, dch, diafoirus, dpetrov67, edgeman, eduardo, forums, freebsd8593, freebsd, garga, j.kelly.hays, jamie, janm, john, kreeblah, kungfujesus06, markj, mickael.maillot, mike, ml, mm, mondo.debater_0q, nagy.attila, netchild, olce, olivierw1+bugzilla-freebsd, oxy, pete, pete, rashey, rob2g2-freebsd, robn, ruben, terry-freebsd, thierry, topical, trashcan, tsuroerusu, vvd, yoitsmeremember+fbsd, zi
Priority: ---    
Version: 14.0-RELEASE   
Hardware: Any   
OS: Any   
Bug Depends on:    
Bug Blocks: 275215    

Description Ed Maste freebsd_committer freebsd_triage 2023-11-24 14:51:47 UTC
See https://github.com/openzfs/zfs/issues/15526#issuecomment-1823737998 for details

> As discussed in this thread, setting vfs.zfs.dmu_offset_next_sync (FreeBSD) /
> zfs_dmu_offset_next_sync (Linux) to 0 does not trigger the issue here. Having
> this value set to 1 triggers the issue.

Tracking PR for potential EN to change the default.
Comment 1 Ed Maste freebsd_committer freebsd_triage 2023-11-24 15:19:50 UTC
It is possible that setting sysctl vfs.zfs.dmu_offset_next_sync=0 is a short-term workaround and the fix is in https://github.com/openzfs/zfs/pull/15571. There should be more clarity over the next couple of days.
Comment 2 Mark Johnston freebsd_committer freebsd_triage 2023-11-24 15:22:54 UTC
Do we know whether changing vfs.zfs.dmu_offset_next_sync to 0 is likely to have much performance impact?  It would be useful to document if so.
Comment 3 mike 2023-11-24 15:27:47 UTC
Is this really only RELENG_14 ? The discussion at github implies the actual underlying bug has been around for a while and block cloning only made it more likely to hit
Comment 4 Ed Maste freebsd_committer freebsd_triage 2023-11-24 15:33:18 UTC
(In reply to mike from comment #3)
This is a fast-moving issue right now and details are becoming more clear over time. It seems likely that this will result an EN for 13.2 and 14.0.
Comment 5 Ed Maste freebsd_committer freebsd_triage 2023-11-24 15:53:37 UTC
(In reply to Ed Maste from comment #4)
And, I'm not sure if this potentially affects the pre-OpenZFS ZFS in 12.4.
Comment 6 Olivier Certner freebsd_committer freebsd_triage 2023-11-24 16:15:06 UTC
From what I've read:

(In reply to Ed Maste from comment #1)
Given the patch itself, it's highly unclear it possibly can.  But it's true that, in practice and so far, some people experiencing the issue (or one of them, if there are several problems) have reported that effectively setting 'vfs.zfs.dmu_offset_next_sync' to 0 makes the problem impossible to reproduce for them.

(In reply to Mark Johnston from comment #2)
It seems that setting 'vfs.zfs.dmu_offset_next_sync' to 0 actually *improves* performance, at the expense of not reporting holes correctly for dirty files, so potentially causing more processing for readers (and more storage consumption in the case of, e.g., a concurrent 'cp').
Comment 7 Ed Maste freebsd_committer freebsd_triage 2023-11-25 02:35:32 UTC
Based on the latest discussion in https://github.com/openzfs/zfs/issues/15526 my current understanding is:

- Setting sysctl vfs.zfs.dmu_offset_next_sync=0 is reported to make the issue harder to reproduce, but does not reliably prevent it. A change to the default will *not* be in an EN update.
- https://github.com/openzfs/zfs/pull/15571 is the actual fix for this issue.
- Pull request 15571 needs to land in OpenZFS upstream first. Once that happens we can import it, MFC to stable/14, and issue an EN to update 14.0 and 13.2.
- dnode_is_dirty was introduced in commit de198f2d9507 in OpenZFS, and does not exist in FreeBSD 12.4. This issue may not affect 12.x.
Comment 8 Rob Norris 2023-11-25 09:37:00 UTC
Hi, I'm the author of 15571. Just a couple of notes:

- dmu_offset_next_sync=0 does appear to be a very reliable workaround, just not technically perfect. Only one person has been able to trip it on an extremely contrived test, as opposed to the default =1, which multiple people have been able to reproduce consistently.

- The incorrect dirty check in 12.4 (as in Illumos) is in dmu_object_wait_synced(). I have not explored this at all though; OpenZFS never had this version and a lot has changed since. There may be other reasons why it can't be tripped.

Let me know if there's anything I can assist with.

Rob.
Comment 9 Alexander Leidinger freebsd_committer freebsd_triage 2023-11-25 12:21:05 UTC
Maybe we want to send a mail to announce@ suggesting to set vfs.zfs.dmu_offset_next_sync=0 for a while to prevent issues, with a note that we will issue an EN once a fix is released and proven itself.
Comment 10 Ed Maste freebsd_committer freebsd_triage 2023-11-27 15:59:51 UTC
Mail to stable: https://lists.freebsd.org/archives/freebsd-stable/2023-November/001726.html

Review to change the zfs_dmu_offset_next_sync default for now: https://reviews.freebsd.org/D42781
Comment 11 Ed Maste freebsd_committer freebsd_triage 2023-11-27 16:50:20 UTC
See also: https://github.com/openzfs/zfs/pull/15566
Comment 12 Ed Maste freebsd_committer freebsd_triage 2023-11-28 19:12:45 UTC
The related commits have been merged into OpenZFS, and will be imported into FreeBSD as soon as possible.

https://github.com/openzfs/zfs/pull/15566
https://github.com/openzfs/zfs/pull/15571
Comment 13 Martin Matuska freebsd_committer freebsd_triage 2023-11-28 23:42:05 UTC
A bugfix from OpenZFS has been merged into:
main (2276e5394)
stable/14 (d92e0d62c)
stable/13 (5858f93a8)
Comment 14 Dave Cottlehuber freebsd_committer freebsd_triage 2023-11-29 07:20:35 UTC
for the inevitable EN (errata notice) could we also clarify some related info?

This is my understanding as of this morning, have I gotten this right.

Generally:

- testing of a new feature uncovered a rare and long-standing bug in OpenZFS
- the block cloning feature increased the likelihood of encountering this bug
- but it is still very very rare, and relies on a combination of high i/o, cpu, and unusual write/read patterns
- this error will not be picked up by scrub or other zfs consistency checks
- setting vfs.zfs.dmu_offset_next_sync=0 should mitigate the likelihood of encountering this error going forwards until a patch is made available

14.0-RELEASE new zpools:

- block cloning feature is disabled by default via vfs.zfs.bclone_enabled=0
- a newly created zpool in will have the zpool feature@block_cloning enabled
- but will not actively use that because of the sysctl
- thus actual impact is unlikely

13.x existing zpools:

- you may have encountered this bug but there is currently no definitive way
  to scan a zpool to check for the occurrence

12.x existing zpools:

- should not be impacted

Open Questions:

- I don't understand whether FreeBSD uses a similar sparse-by-default functionality (like linux coreutils 9.2/9.3) that alters the user risk here

I'm happy to help craft/review an EN if needed.
Comment 15 Alexander Leidinger freebsd_committer freebsd_triage 2023-11-29 08:23:30 UTC
(In reply to Dave Cottlehuber from comment #14)

> 12.x existing zpools:

It is not known if there is an impact or not. There are now reports that OpenZFS 0.6.x could be affected and people want to test the initial Solaris ZFS release for this bug.

> Open questions:

It doesn't matter if cp uses lseek() or not. There exist applications out there which make use if it, and as such the likelyhood of an issue is not 0. As it also relies on parallel read/write operations and the right timing, the likelyhood may be very small (https://github.com/openzfs/zfs/issues/15526#issuecomment-1826182928 states 0.00027% of real world corruption for a particular use case, but no idea how likely this use case is for others... or what the use case was).

What we know is that poudriere runs on 15-current where block cloning is enabled can trigger the issue in a way that multiple people have stumbled over it. For all the previous years we are not aware of any obvious use case which may have triggered the problem in a way that someone would have attributed it to ZFS. So yes, we need to fix it. Yes it is a good idea to use the sysctl to even decrease the likelyhood. But no, we should not have sleepness nights. If you don't use ECC RAM I would guess the likelyhood of having bad data due to a RAM issue is higher.
Comment 16 Rob Norris 2023-11-29 10:12:02 UTC
(In reply to Dave Cottlehuber from comment #14)

> Generally:

> - testing of a new feature uncovered a rare and long-standing bug in OpenZFS

Correct. A reproduction for what was first thought to be a block cloning bug also showed the same fault in 2.1, which does not have the block cloning feature.

> - the block cloning feature increased the likelihood of encountering this bug

Unclear. So far no one has been able to show that cloning definitely changes the probability of the bug showing up, and it is not at all clear why it should be in play at all. Neither have I been able to show it in my own testing.

[tech detail and speculation follows]

The actual bug is that the object dnode is briefly, incorrectly, considered "clean". There's no particular reason that the change being a clone rather than a write would affect this window; at the time the dnode appears to be in the wrong state, we're in syncing context, so the actual change must be two txgs ago.

The only reason I've been able to think that it might make a difference is simply about the amount of work in the system. Reading and writing real bytes is slow; copying a block pointer that's already in memory is negligible by comparison. It could simply be that cloning enables more operations to be completed in the same time, obviously increasing the chance.

Its also worth noting that cloning was an initial guess, based on:

- a copy-related bug was presented on Linux against 2.2
- with a version of cp from GNU coreutils known to use copy_file_range(), thus permitting cloning
- a initial analysis found a locking bug in part of block cloning involving dirty buffers

However, this happened just after 2.2.1 was released, which disabled block cloning at the POSIX entry points (a direct port of the FreeBSD 14 sysctl to Linux), due to another bug that had shown up in 2.2.0 (unencrypted blocks could be cloned into encrypted datasets). 2.2.1 disabling a headline feature had generated some interest, as had FreeBSD 14 before it, and there was already some uncertainty around the block cloning feature before 2.2 was released.

Once the problem was confirmed on 2.1, we quickly understood that lseek(SEEK_DATA/SEEK_HOLE) was the real culprit, coincidentally also being available for the first time in coreutils 9.x.

All this is to say: there was already an idea in the air that block cloning was dangerous, and that's something I'm still arguing against days later. That may be part of the reason that there is so strong a feeling that block cloning raises the probability.

> - but it is still very very rare, and relies on a combination of high i/o, cpu, and unusual write/read patterns

At a high level, yes, all true.

I believe the specific sequence is:

- program [A] executes an operation that converts a hole to data, or appends data to a file (which can be thought of as creating a hole at the end of the file, then putting data underneath it). `cp` obviously fulfils this
- the change makes it to syncing context, and the dnode starts its sync process
- the dnode appears to be "clean"
- program [B] calls lseek(SEEK_DATA) at the start of the hole (usually offset 0 in a new file). The dnode appears clean, so its block pointers are examined. The data isn't written yet, but the size in the dnode is past the offset, so ENXIO ("data not found") is returned. This is an in-memory operation.
- the window passes, the dnode is "dirty"
- program [B], believing that there's no data there, does whatever it does (in the case of `cp`, writes all-zeroes or punches a hole in the destination)

(still working on it, so I'm not totally sure this is all of it. Once I'm sure, I'll be sending a patch to OpenZFS)

So as you see, the timing is enormously difficult: a particular kind of write operation has to happen roughly in parallel with a hole-detection operation, and hit at just the right moment. Its not really clear that high I/O, CPU, etc make much difference, apart from perhaps changing the timing. The main reason we do it a lot in the reproduction is just to better the odds of seeing it.

- this error will not be picked up by scrub or other zfs consistency checks

Correct. Its entirely in-memory; the problem isn't that anything read or write bad data, but rather that it wrote _nothing_ because it thought there was nothing to write!

- setting vfs.zfs.dmu_offset_next_sync=0 should mitigate the likelihood of encountering this error going forwards until a patch is made available

Yes (though, patches are already available upstream and I believe at least on FreeBSD head).

The reason this mitigation is effective is that =0 causes the hole check to return "data" if the dnode is dirty, without checking for holes, while the default =1 causes it to wait for sync if the dnode is dirty, then check dirty _again_, before either doing the real hole block scan or returning "data" if its still dirty. The retry occurs just after the txg sync has finished, perilously close to the "clean" window as the dataset starts its next sync.

But of course, there's still a miniscule chance that the initial dirty check could hit the gap, which is why its not a 100% solution.

> 14.0-RELEASE new zpools:

> - block cloning feature is disabled by default via vfs.zfs.bclone_enabled=0

Correct. (fwiw, upstream 2.2.1 has it disabled by default on Linux as well)

> - a newly created zpool in will have the zpool feature@block_cloning enabled

Correct.

> - but will not actively use that because of the sysctl

Correct. OpenZFS' copy_file_range() handler will check this flag, and immediately fallback to the "generic" handler. There are no other block cloning entry points on FreeBSD.

> - thus actual impact is unlikely

Possibly not true, for another reason. Even if we take block cloning out of the picture, FreeBSD's generic copy_file_range() looks for holes and tries to replicate them on the target. That's through VOP_IOCTL(FIOSEEKDATA/FIOSEEKHOLE). I haven't read that code closely, but I see no reason why it would be less risky that on Linux with coreutils 9.x (which isn't much, but its there).

> 13.x existing zpools:

> - you may have encountered this bug but there is currently no definitive way
>   to scan a zpool to check for the occurrence

Correct.

Note that copy_file_range() appeared in 13.0 (bbbbeca3e9a3) and had hole checks from the beginning.

> 12.x existing zpools:

> - should not be impacted

Unclear. The bug has been affirmatively reproduced back to ZoL 0.6.5, which at that time was still pulling code from Illumos as FreeBSD was. There's some evidence from code reading that this might have been introduced in OpenSolaris in 2006 (https://github.com/illumos/illumos-gate/commit/c543ec060d, which first changed the dirty check logic to its incomplete form).

However, as best I can tell, copy_file_range() did not exist then, and `bin/cp` does not now and did not then appear to have any facility to search for holes on the source and replicate them (no lseek(SEEK_DATA/SEEK_HOLE), nor ioctl(FIO_SEEKDATA/FIOSEEKHOLE)).

So, if the bug can still be hit there, its highly likely that it never was through the inbuilt tools.

> Open Questions:

> - I don't understand whether FreeBSD uses a similar sparse-by-default functionality (like linux coreutils 9.2/9.3) that alters the user risk here

Answered above.

> I'm happy to help craft/review an EN if needed.

I hope I haven't made it harder!

The point being: its vanishingly unlikely that anyone has hit this during normal operation, and even less likely that they've hit it _unknowningly_: how often are people writing files, checking for holes and then taking action on that within a fraction of a moment, at a large enough scale for a long enough time that they could actually hit it, and then never actually looked at the target file? The only real-world cases we've seen hit this are parallel build systems, and while its a problem, its not _silent_ corruption, because the output plain doesn't work! So of course its not _never_, but its really hard to hit by accident.

My advice would to users would be (and has been):

- leave vfs.zfs.bclone_enabled=0
- if you can get the patch, take it
- if you can't, set vfs.zfs.dmu_offset_next_sync=0. You lose hole reporting for dirty files, and vastly reduce the already small odds of hitting this
- play the probabilities: if you're not sure, and you're not creating new files and copying them away in the same instant hundreds of times per second, you're fine
Comment 17 Troels Just 2023-11-29 13:18:57 UTC
(In reply to Rob Norris from comment #16)
> My advice would to users would be (and has been):
> - leave vfs.zfs.bclone_enabled=0

For paranoid people, like me, at the level of users when setting up new systems, would it be beneficial to do "zpool create ... -o feature@block_cloning=disabled" ? Does that provide any benefit over the sysctl you mentioned?

Separately (And I apologize if this is not relevant in this bug report), any thoughts as to whether #15533 ( https://github.com/openzfs/zfs/issues/15533 ), which I got hit by on Linux when running on top of LUKS, is in any way applicable on FreeBSD when running ZFS on top of GELI? I ask since OpenZFS commit bd7a02c was, as far as I can see, not Linux specific and I have a few clients with GELI+ZFS setups. I did notice your vdev_disk rewrite is Linux specific, so I was just curious as to what the situation might be for FreeBSD?
Comment 18 Ed Maste freebsd_committer freebsd_triage 2023-11-29 15:33:53 UTC
Draft EN text: https://reviews.freebsd.org/D42832
Comment 19 Anderson Guzman 2023-11-29 16:05:43 UTC
Hello everyone,

I found the following pull related with #15526 and is already merged in OpenZFS


[2.2] dmu_buf_will_clone: fix race in transition back to NOFILL

https://github.com/openzfs/zfs/pull/15599
Comment 20 Rob Norris 2023-11-30 00:04:53 UTC
(In reply to Troels Just from comment #17)

> would it be beneficial to do "zpool create ... -o feature@block_cloning=disabled" ? Does that provide any benefit over the sysctl you mentioned?

Disabling the feature makes the pool unable to clone blocks at all when asked to do so. Disabling the sysctl makes it impossible to ask at all. On the same system its much of a muchness, but if you move pools to other systems, that may make a difference - the pool feature state will move with the pool, but the sysctl is bound to the host.

> Is [#15333 about ZFS on LUKS] is in any way applicable on FreeBSD when running ZFS on top of GELI?

No, totally unrelated.

(In reply to Anderson Guzman from comment #19)

> [2.2] dmu_buf_will_clone: fix race in transition back to NOFILL

Its the fix for the cloning lock bug I mentioned right at the start, but we think there are still cloning bugs around. Cloning will remain disabled in 2.2.x for now.
Comment 21 Rob Norris 2023-11-30 01:23:08 UTC
I note its been reproduced on Illumos, so I'll be even less surprised now to find the bug on 12.4.

https://www.illumos.org/issues/16087
Comment 22 Rob Norris 2023-11-30 04:42:05 UTC
I've reproduced it on 12.4. I'm working on a patch now.

Happily, it should be even harder to hit in practice. Nothing in base appears to use SEEK_HOLE/SEEK_DATA or FIOSEEKHOLE/FIOSEEKDATA, and coreutils 9.x disables SEEK_HOLE/SEEK_DATA on FreeBSD <14 due to:

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=256205
https://debbugs.gnu.org/cgi/bugreport.cgi?bug=61386
https://github.com/coreutils/gnulib/commit/7352d9fec59398c76c3bb8a2ef86ba58818f0342

(Oh, #256205, the FreeBSD view on OpenZFS #11900. We knew. WE KNEW!!)

Patch soon!
Comment 23 Rob Norris 2023-11-30 05:24:25 UTC
Patch: https://github.com/robn/freebsd-src/commit/b51ff59cce9c288111f84c8541986ac6647703e7

I'm not set up for Phabricator, and I'm out of time today. If you want it I'm more than happy for you to take it and shepherd it through.
Comment 24 Ed Maste freebsd_committer freebsd_triage 2023-11-30 05:34:24 UTC
(In reply to Rob Norris from comment #23)
Thanks for the patch! I'll apply it to stable/12 shortly, and it will either be included with the initial EN or follow shortly after.
Comment 25 mickael.maillot 2023-11-30 10:16:51 UTC
Thoses patches will go in releng/14.0 or should i switch all my server to stable/14 ?
Comment 26 Ed Maste freebsd_committer freebsd_triage 2023-11-30 14:25:00 UTC
(In reply to mickael.maillot from comment #25)
> Thoses patches will go in releng/14.0 or

Yes, an EN with updates for 14.0 and 13.2 should be available this week. I'm not sure yet if an update for 12.4 will be included or will follow days later.
Comment 27 mike 2023-11-30 17:48:27 UTC
After patching RELENG_13/14 is there any meaningful risk mitigation to leaving vfs.zfs.dmu_offset_next_sync=0 in case this issue is not fully resolved ?  Or is the performance cost (if any) not worth it ?
Comment 28 Ed Maste freebsd_committer freebsd_triage 2023-11-30 18:59:34 UTC
(In reply to mike from comment #27)
The workaround actually improves performance (at the cost of possibly incorrect hole reporting).

Once the update is applied though I believe there's no benefit (from a risk mitigation perspective) in keeping the workaround. I've added a sentence to the end of the workaround section in the draft EN:

IV.  Workaround

Setting the vfs.zfs.dmu_offset_next_sync sysctl to 0 disables forcing
TXG sync to find holes.  This is an effective workaround that greatly
reduces the likelihood of encountering data corruption, although it does
not completely eliminate it.  Note that with the workaround holes will
not be reported in recently dirtied files.  See the zfs(4) man page for
more information of the impact of this sysctl setting.

The workaround should be removed once the system is updated to include the
fix described in this notice.
Comment 29 mike 2023-11-30 19:16:02 UTC
(In reply to Ed Maste from comment #28)
Thanks Ed. Just a side note, the section on dmu_offset_next_sync in the man pages is only in HEAD. I guess it needs to be MFC'd to 13 and 14

# man 4 zfs | grep -i dmu
             in one DMU transaction.  This is the uncompressed size, even when
# 

Thats with RELENG_14
Comment 30 Ed Maste freebsd_committer freebsd_triage 2023-11-30 20:00:03 UTC
(In reply to mike from comment #29)
It should be there, e.g. from man.cgi: https://man.freebsd.org/cgi/man.cgi?query=zfs&apropos=0&sektion=4&manpath=FreeBSD+13.2-RELEASE&arch=default&format=html

oh, the sysctl name is not greppable, from `man 4 zfs | vim -R -` on stable/13:
     z^Hzf^Hfs^Hs_^H_d^Hdm^Hmu^Hu_^H_o^Hof^Hff^Hfs^Hse^Het^Ht_^H_n^Hne^Hex^Hxt^Ht_^H_s^Hsy^Hyn^Hnc^Hc=1^H1|0 (int)
             Enable forcing TXG sync to find holes.  When enabled forces ZFS
             to sync data when S^HSE^HEE^HEK^HK_^H_H^HHO^HOL^HLE^HE or S^HSE^HEE^HEK^HK_^H_D^HDA^HAT^HTA^HA flags are used allowing
             holes in a file to be accurately reported.  When disabled holes
             will not be reported in recently dirtied files.
Comment 31 mike 2023-11-30 20:02:38 UTC
(In reply to Ed Maste from comment #30)

whoops, you are right. Sorry for the noise!
Comment 32 Ed Maste freebsd_committer freebsd_triage 2023-12-01 03:45:36 UTC
Released as FreeBSD-EN-23:16.openzfs for 14.0, 13.2, and 12.4.

https://www.freebsd.org/security/advisories/FreeBSD-EN-23:16.openzfs.asc