Bug 233245 - [UFS] [enhancement] Softupdates fails to track dependency between appended data and i_size
Summary: [UFS] [enhancement] Softupdates fails to track dependency between appended da...
Status: New
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: CURRENT
Hardware: Any Any
: --- Affects Some People
Assignee: freebsd-fs (Nobody)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-11-15 21:58 UTC by Conrad Meyer
Modified: 2019-08-28 22:30 UTC (History)
6 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Conrad Meyer freebsd_committer freebsd_triage 2018-11-15 21:58:22 UTC
This manifests with recently appended files containing extraneous zero data at the end of the file after power loss or crash.

Detection of append could happen in either ffs_balloc_ufs[12] (which is where most softdep dependencies seem to be added, and which is called unconditionally by ffs_write()) in the several "happy" cases where we discover we do not need to actually allocate a new block, e.g., cases like:

   753         if (lbn < UFS_NDADDR) {
...
   756                 nb = dp->di_db[lbn];
   757                 if (nb != 0 && ip->i_size >= smalllblktosize(fs, lbn + 1)) {
   758                         error = bread_gb(vp, lbn, fs->fs_bsize, NOCRED,
   759                             gbflags, &bp);
...
   764                         bp->b_blkno = fsbtodb(fs, nb);

+++                            if (DOINGSOFTDEP(vp) && startoffset + size > ip->i_size)
+++                                    softdep_setup_append(...);

   765                         *bpp = bp;
   766                         return (0);

It could also just be checked in a single location in ffs_write() itself:

   754         for (error = 0; uio->uio_resid > 0;) {
   755                 lbn = lblkno(fs, uio->uio_offset);
   756                 blkoffset = blkoff(fs, uio->uio_offset);
...
   781                 if (uio->uio_offset + xfersize > ip->i_size) {

+++                            if (DOINGSOFTDEP(vp))
+++                                    softdep_setup_append(...);

   782                         ip->i_size = uio->uio_offset + xfersize;
   783                         DIP_SET(ip, i_size, ip->i_size);
   784                 }

Append data could probably be tracked in quite similar fashion to diradd (directory entry adds).
Comment 1 Kirk McKusick freebsd_committer freebsd_triage 2018-11-16 01:24:26 UTC
As you note, soft updates does not currently consider it necessary to ensure that the new contents of the block be written before increasing the file size if it knows that the exposed contents will be zero (as opposed to the random data that was in a previously used block where it does ensure that the block is written before it can be accessed). It would be possible to add a requirement that the new data be written before the size could be updated, but it is not clear to me that adding this extra overhead is worthwhile. Also, this case includes overwriting data in the middle of an earlier block in the file for which there is nothing that we can do.

Note that you cannot put the test into ffs_write() in the way that you have done (which makes the call for any growth in size). It can only be done when the growth is such that it does not go out of an existing block allocation or is overwriting an earlier part of the file.
Comment 2 Conrad Meyer freebsd_committer freebsd_triage 2018-11-16 02:58:11 UTC
Hi Kirk,

Thanks for your prompt reply.

(In reply to Kirk McKusick from comment #1)
> As you note, soft updates does not currently consider it necessary to ensure
> that the new contents of the block be written before increasing the file size
> if it knows that the exposed contents will be zero (as opposed to the random
> data that was in a previously used block where it does ensure that the block is
> written before it can be accessed).

Yep, understood.

> It would be possible to add a requirement that the new data be written before
> the size could be updated,

That is the proposal. :-)

> but it is not clear to me that adding this extra overhead is worthwhile.

Overhead in which sense?  I can imagine a few objections (it makes the code more complicated; there is some minor additional memory burden; SU dependency graphs get a little deeper) but maybe you have others in mind I hadn't thought of.  I don't think there is any additional IO cost (perhaps longer fsync time on appended files due to the additional ordering barrier?), but I might well be missing something.

As far as whether any runtime overhead is worthwhile, I think there are two relevant dimensions.  One, what is the measurable overhead, if any?  (We can't really answer this one until we have a proof of concept implementation.)  And two, what is the user willing to accept, both in terms of overhead and append data fidelity?  I think it's likely a trade-off we might not want to make unilaterally, but instead leave to the end user.

It should be pretty easy to add it as a mount option or something like that; in the same vein as "noatime" allows users to disable a feature with too much overhead.  I don't see any significant barriers to enabling or disabling it on-the-fly for existing mounts at runtime, either.

> Also, this case includes overwriting data in the middle of an earlier block in the file for which there is nothing that we can do.

Understood — append is /the/ special case of write where this is possible.  But it is also a fairly common case, and given that we can do it safely, I think we should aim to do it safely.

(For middle-of-file partial overwrites, we could order data block flushing with inode mtime update, if we do not already.  That would be a pretty minimal protection and would not save us from torn writes.  I think we might also be able to allocate data blocks out-of-place and order full block writes with their corresponding block pointer or indirect block pointer updates, but that has significant downsides, such as file fragmentation, and isn't something I'm interested in right now.)

> Note that you cannot put the test into ffs_write() in the way that you have
> done (which makes the call for any growth in size). It can only be done when
> the growth is such that it does not go out of an existing block allocation or
> is overwriting an earlier part of the file.

The elided portion of ffs_write() between lines 756 and 781 invokes UFS_BALLOC() to allocate any blocks needed -- so I believe the suggested location in ffs_write() would be within an existing block allocation.  But maybe I am mistaken.  I am not attached to the location, it just seemed like a plausible spot to me.

(There is a smaller separate concern, which is that any full block append is already fully tracked with allocdirect or allocindirect — we only want the proposed partial append dependency for the last iteration of the loop in ffs_write.)
Comment 3 Konstantin Belousov freebsd_committer freebsd_triage 2018-11-16 08:27:35 UTC
Of course I sympathize the idea of improving the user experience, but IMO the cost of the code complexity is more important there.

SU, from my prospective, were never supposed to provide the user data consistency guarantees. The goal was/is only the metadata sanity, even not the guarantee that on-disk metadata actually matches some state during the system operation.  We never track user data block writes ordering, so whatever state user data is left after the crash, is the user issue.

From this PoV, file size increase by hole vs. increase by the actual content is not under the SU scope.  Again, to make my opinion clear: I would not object, but SU are already insanely complex and we must not increase the complexity just because we can.
Comment 4 Kirk McKusick freebsd_committer freebsd_triage 2018-11-16 09:25:11 UTC
I concur with Kostik on what we are trying to show to the user. Specifically the filesystem metadata is always correct and we never expose previous contents of blocks to a user. We have no problem presenting them with zeroed data where we do not have properly written data available. Consider that the user has a large file that has a hole at logical blocks 1, 2, and 3. Now suppose the application fills in this hole by writing logical blocks 1, 2, and 3. Suppose the kernel has managed to get blocks 1 and 3 written to disk but not block 2 when the inode gets written. Here we `roll back' the inode putting a hole in the file at block 2, then put the block pointer back before we unlock it and allow it to be viewed by a read operation. So as long as the system stays up, the applications always see all the written data. But if the system crashes, then when it comes back up block 2 will read back as zeroed (e.g., a hole in the file) rather than seeing the unwritten data. So, having some extra zeros at the end of a file really seems no worse.

I am presently working on hardening the filesystem. Putting check-hashes on the metadata, and working to handle disk failures by forcibly unmounting the filesystem rather than having the kernel panic. I think these are more useful than avoiding zeros at the end of files after a crash. That said, I would be happy to assist you if you want to develop the code to extend soft updates to add this semantic. You would need to add a new dependency type (or possibly extend allocdirect) to track when an existing block is extended with new data. When the inode is written, you need to roll back the length to the old size, then restore the length when the write completes. This of course only works at the end of a file, not when data is added in the middle of a file as in my example above.
Comment 5 Conrad Meyer freebsd_committer freebsd_triage 2018-11-16 18:04:54 UTC
Hi Kirk,

(In reply to Kirk McKusick from comment #4)
> I concur with Kostik on what we are trying to show to the user. Specifically
> the filesystem metadata is always correct and we never expose previous
> contents of blocks to a user. We have no problem presenting them with zeroed
> data where we do not have properly written data available. Consider that the
> user has a large file that has a hole at logical blocks 1, 2, and 3. Now
> suppose the application fills in this hole by writing logical blocks 1, 2,
> and 3. Suppose the kernel has managed to get blocks 1 and 3 written to disk
> but not block 2 when the inode gets written. Here we `roll back' the inode
> putting a hole in the file at block 2, then put the block pointer back before
> we unlock it and allow it to be viewed by a read operation. So as long as the
> system stays up, the applications always see all the written data. But if the
> system crashes, then when it comes back up block 2 will read back as zeroed
> (e.g., a hole in the file) rather than seeing the unwritten data. So, having
> some extra zeros at the end of a file really seems no worse.

I fully agree — the status quo treatment of partial block appends is no worse than the overwrite treatment, and that SU does a good job of ensuring the metadata is correct in the face of power failure / crash.  What I am proposing is definitely an enhancement.  I've marked the summary as such to be clear to anyone who stumbles upon this PR.  I filed this in bugzilla to track it as a potential work item for myself, but the discussion that has taken place is useful to me too.

> I am presently working on hardening the filesystem. Putting check-hashes on
> the metadata, and working to handle disk failures by forcibly unmounting the
> filesystem rather than having the kernel panic. I think these are more useful
> than avoiding zeros at the end of files after a crash.

I agree.  These are more important enhancements to UFS than my proposal.  I certainly appreciate the work you are doing and trust you to prioritize the work you think is valuable.  I did not expect anyone else to work on this enhancement, so perhaps the freebsd-fs@ "assignee" is bogus.  I would be happy to assign it to myself to take it off the fs@ list, if that seems most reasonable to everybody.

> That said, I would be happy to assist you if you want to develop the code to
> extend soft updates to add this semantic.

I appreciate the offer — thank you.

> You would need to add a new dependency type (or possibly
> extend allocdirect) to track when an existing block is extended with new
> data. When the inode is written, you need to roll back the length to the old
> size, then restore the length when the write completes. This of course only
> works at the end of a file, not when data is added in the middle of a file as
> in my example above.

Yes, it does seem quite similar to allocdirect (minus the block pointer manipulations, and block-centric i_size math).  Do you know of any additional good resources I can read for the specific dependency graph behavior / rollback semantics of SU?  I have the FreeBSD D&I 2e book of course, and the source code is the final word.  But if you have any additional references you can share, I would appreciate it.

(Tangential to this enhancement, I might take a pass cleaning up the plain English of the ffs_softdep.c comment blocks.  They are thorough, but tend to lack visual separation and thus do not scan easily.  Additionally, some references to specific objects are quoted, while others are not.  It might help me understand SU better to do a pass making an attempt to help others understand SU better. :-))
Comment 6 Kirk McKusick freebsd_committer freebsd_triage 2018-11-17 01:43:41 UTC
(In reply to Conrad Meyer from comment #5)
> I would be happy to assign it to myself to take it off the fs@ list, if that
> seems most reasonable to everybody.

I have no objection to it being on the fs@ list. That way I see any dialog that happens on it. But if you decide to take it off the fs@ list, then please add me explicitly.

> Do you know of any additional good resources I can read for the specific
> dependency graph behavior / rollback semantics of SU?  I have the FreeBSD
> D&I 2e book of course, and the source code is the final word.  But if you
> have any additional references you can share, I would appreciate it.

In about 2000 I used the `special topics' lecture of my advanced code reading class to do a walk-through of the soft updates code. It is a 2.5 hour video of me going through softdep.h and ffs_softdep.c. It predates the addition of the journaling support, so reflects a simpler time, but is probably still useful to get a general understanding. I would be willing to make the video available to you if you want to watch it.

> (Tangential to this enhancement, I might take a pass cleaning up the plain
> English of the ffs_softdep.c comment blocks.  They are thorough, but tend to
> lack visual separation and thus do not scan easily.  Additionally, some
> references to specific objects are quoted, while others are not.  It might
> help me understand SU better to do a pass making an attempt to help others
> understand SU better. :-))

I believe that commenting code is very important and have tried to do a good job of it in the soft updates code. So I strongly encourage you to work to improve it. Often the original author overlooks things that seem obvious to them but in fact are confusing to those that are not intimately familiar with the code. So filling in those gaps is a valuable addition.
Comment 7 Conrad Meyer freebsd_committer freebsd_triage 2018-11-17 05:21:14 UTC
(In reply to Kirk McKusick from comment #6)
Ok, I will leave it on the fs list then.

> In about 2000 I used the `special topics' lecture of my advanced code reading
> class to do a walk-through of the soft updates code. It is a 2.5 hour video of
> me going through softdep.h and ffs_softdep.c. It predates the addition of the
> journaling support, so reflects a simpler time, but is probably still useful to
> get a general understanding. I would be willing to make the video available to
> you if you want to watch it.

I think my employer has licensed some of your video series, but none specifically annotated as around 2000.  I see a 2006 "data structures" series where #6 covers UFS, including SU, and a 2009 "intensive code walkthrough" series where #15 is a "special topics," including soft updates.  I don't suppose you're referring to one of those?  If not, I would certainly appreciate a copy of the video.

> I believe that commenting code is very important and have tried to do a good
> job of it in the soft updates code. So I strongly encourage you to work to
> improve it. Often the original author overlooks things that seem obvious to
> them but in fact are confusing to those that are not intimately familiar with
> the code.

Absolutely!

> So filling in those gaps is a valuable addition.

Ok, great.  Sounds like I have plenty of direction in how to proceed. :-)