Bug 213507 - [msdosfs]: Prevent occasional directory corruption while extending it to another cluster
Summary: [msdosfs]: Prevent occasional directory corruption while extending it to anot...
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: CURRENT
Hardware: Any Any
: --- Affects Some People
Assignee: freebsd-fs (Nobody)
URL:
Keywords: patch
Depends on:
Blocks:
 
Reported: 2016-10-15 17:43 UTC by Vladyslav Movchan
Modified: 2018-05-17 17:20 UTC (History)
5 users (show)

See Also:


Attachments
This patch fixes the problem for me (681 bytes, patch)
2016-10-15 17:43 UTC, Vladyslav Movchan
no flags Details | Diff
Updated patch (861 bytes, patch)
2016-10-16 12:15 UTC, Vladyslav Movchan
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Vladyslav Movchan 2016-10-15 17:43:56 UTC
Created attachment 175786 [details]
This patch fixes the problem for me

When amount of directory entries grows directory could reach size when it require one more cluster. 

This condition checked at the beginning of createde() inside of msdosfs_lookup.c
Actual directory extension occurs by extendfile() inside of msdosfs_fat.c

I found that after completion of extendfile() newly allocated cluster consists of garbage-like data which after completion of "grow to one more cluster" might be interpreted as phantom/random directory entries.

This bug is sometimes hard to reproduce intentionally as data in newly allocated cluster often looks the same no matter how many times you retry.
And if this data consists of mostly zeros it won't produce effect noticeable on the filesystem.


Here is an output when I got lucky to trigger this problem (shell is bash):
# newfs_msdos -F 32 /dev/gpt/sg1-disk-f
/dev/gpt/sg1-disk-f: 315571904 sectors in 4930811 FAT32 clusters (32768 bytes/cluster)
BytesPerSec=512 SecPerClust=64 ResSectors=32 FATs=2 Media=0xf0 SecPerTrack=63 Heads=16 HiddenSecs=0 HugeSectors=315649016 FATsecs=38522 RootCluster=2 FSInfo=1 Backup=2
# mount_msdosfs -o large /dev/gpt/sg1-disk-f /mnt/f
# mkdir /mnt/f/test
# cd /mnt/f/test
# for i in `seq 1 150`; do; head -c $((20000*1000+1)) /dev/random > `echo {0..9}{a..k} | tr -d ' '`$i; done
# ls -laR /mnt/f/test > /dev/null
ls: .гJ(: No such file or directory
ls: .гM(: No such file or directory
ls: .гP(: No such file or directory
ls: .гS(: No such file or directory
ls: .гV(: No such file or directory
ls: .гY(: No such file or directory
ls: .А?.: Invalid argument
ls: .АB(: No such file or directory
ls: 0?.?????: Invalid argument
ls: 0t,?????.ПМ+: Invalid argument
ls: 0w,?????.?t,: Invalid argument
ls: 0|#?????.0|#: Invalid argument
ls: 0|#?????.0|#: Invalid argument
ls: 0|#?????.0|#: Invalid argument
...

In most cases there is no other way than newfs_msdos to get rid of such "random" directory entries.
So don't run this on filesystem you care about.

After I tried to intentionally bzero() newly allocated cluster data (only for directory case) I'm no longer able to reproduce this problem. Patch attached.


System details:
$ uname -a
FreeBSD morgenstern 12.0-CURRENT FreeBSD 12.0-CURRENT #0 r306964: Mon Oct 10 18:17:21 EEST 2016     root@morgenstern:/usr/obj/usr/src/sys/Mephistopheles  amd64

$ diskinfo -v /dev/gpt/sg1-disk-f
/dev/gpt/sg1-disk-f
        512             # sectorsize
        161612296192    # mediasize in bytes (151G)
        315649016       # mediasize in sectors
        0               # stripesize
        539533312       # stripeoffset
        313143          # Cylinders according to firmware.
        16              # Heads according to firmware.
        63              # Sectors according to firmware.
        9WM0NGVRs0      # Disk ident.

$ gpart show -l /dev/ada3
=>        40  3907026976  ada3  GPT  (1.8T)
          40          23        - free -  (12K)
          63   105910497     1  sg1-disk-e  (51G)
   105910560    14680080        - free -  (7.0G)
   120590640  1572864000     2  sg1-tank-disk3  (750G)
  1693454640          39        - free -  (20K)
  1693454679  1897923321     4  sg1-disk-i  (905G)
  3591378000   315649016     5  sg1-disk-f  (151G)
Comment 1 Konstantin Belousov freebsd_committer freebsd_triage 2016-10-16 05:56:24 UTC
Right after the both branches of the if() statement, one of which you patched, there is vfs_bio_clrbuf() call to clear the buffer.

Due to this, I highly doubt that the patch changes anything, i.e. the bug must be elsewere.
Comment 2 Vladyslav Movchan 2016-10-16 06:09:10 UTC
(In reply to Konstantin Belousov from comment #1)
While debugging I checked content of newly allocated cluster right after vfs_bio_clrbuf() call. So after execution of vfs_bio_clrbuf() it is possible to see junk/random data in bp->b_data.

I just rechecked it once again to make sure - sometimes newly allocated cluster contains just zeroes, but sometimes there are random data.
Comment 3 Konstantin Belousov freebsd_committer freebsd_triage 2016-10-16 06:16:14 UTC
(In reply to Vladislav Movchan from comment #2)
vfs_bio_clrbuf() zeroes invalid chunks of the buffer pages, leaving the valid chunks alone, for VMIO buffers.  If there are (partially) valid pages in the newly extended denode, this means that, most likely, some other code left the garbage on the vnode vm_object' page queue.
Comment 4 Vladyslav Movchan 2016-10-16 06:25:11 UTC
(In reply to Konstantin Belousov from comment #3)
Thanks for the explanation. I'll get back if I manage to find true reason what causes this.
Comment 5 Konstantin Belousov freebsd_committer freebsd_triage 2016-10-16 06:34:53 UTC
(In reply to Vladislav Movchan from comment #4)
You could try to check my hypothesis by catching the garbage buffer _before_ the vfs_bio_clrbuf() call and see if any page forming the buffer has valid bits set.  Look at the vfs_bio_clrbuf() code to see what I mean.
Comment 6 Vladyslav Movchan 2016-10-16 12:15:01 UTC
Created attachment 175818 [details]
Updated patch
Comment 7 Vladyslav Movchan 2016-10-16 12:16:08 UTC
(In reply to Konstantin Belousov from comment #5)
I confirmed your hypothesis - sometimes buffer pages have valid bits set and sometimes they don't.
Original problem occurs only in cases when valid bits are set, so vfs_bio_clrbuf() don't zero random data in buffer.

Also I went through history and found that originally buffer data was always zeroed by call to clrbuf() macro and in base r171522 it was changed to vfs_bio_clrbuf() as an optimization.

I suppose what I see is just a regression of this optimization and it was not noticed since then as "directory extension to another cluster" is a rare operation.

I attached another patch which restores original behaviour just for the "directory grows" case.
With this patch I'm no longer able to reproduce this problem.
Comment 8 Konstantin Belousov freebsd_committer freebsd_triage 2016-10-17 09:37:59 UTC
(In reply to Vladislav Movchan from comment #7)
I probably was not clear enough.

The problem is that a previous operation left the pages on the vnode page queue behind the directory denode end.  The damage is not limited to the situation of extending the directory, it might have other consequences with varying severity.  E.g., if the issue also happens for regular files, then extending file might result in the garbage or some previously removed unrelated disk file content to appear in the extended portion, readable by user.

That said, your patch seems to only paper over the real problem.  Right thing to do is to track the location where the damage happens.  E.g., UFS also had similar problems, where failed file extension (e.g., for lack of free space) was followed by an incomplete cleanup and dandling buffers and pages.
Comment 9 Conrad Meyer freebsd_committer freebsd_triage 2017-01-14 06:06:07 UTC
Removing [patch] as it seems like the root cause still needs identification.
Comment 10 Damjan Jovanovic 2018-05-16 18:07:41 UTC
This can still be reproduced on CURRENT as follows:

dd if=/dev/zero of=64G bs=1G count=64
newfs_msdos ./64G
mdconfig -a -f ./64G
mount -t msdosfs /dev/md0 /mnt/test
mkdir /mnt/test/ports
cp -r /usr/ports/* /mnt/test/ports

At least the following end up with corrupted contents:
ports/devel
ports/games
ports/graphics
ports/security
ports/sysutils
ports/textproc
ports/www
Comment 11 Damjan Jovanovic 2018-05-16 18:19:05 UTC
Konstantin, please can you explain how the "previous operation left the pages on the vnode page queue behind the directory denode end"?

What must code do, between for example bread() and brelse(), to clean up the "buf" properly?
Comment 12 Konstantin Belousov freebsd_committer freebsd_triage 2018-05-16 19:54:22 UTC
(In reply to Damjan Jovanovic from comment #11)
r304232 was a fix for very similar issue on UFS.
Comment 13 commit-hook freebsd_committer freebsd_triage 2018-05-16 23:30:57 UTC
A commit references this bug:

Author: mckusick
Date: Wed May 16 23:30:04 UTC 2018
New revision: 333693
URL: https://svnweb.freebsd.org/changeset/base/333693

Log:
  Revert change made in base r171522
  (https://svnweb.freebsd.org/base?view=revision&revision=304232)
  converting clrbuf() (which clears the entire buffer) to vfs_bio_clrbuf()
  (which clears only the new pages that have been added to the buffer).

  Failure to properly remove pages from the buffer cache can make
  pages that appear not to need clearing to actually have bad random
  data in them. See for example base r304232
  (https://svnweb.freebsd.org/base?view=revision&revision=304232)
  which noted the need to set B_INVAL and B_NOCACHE as well as clear
  the B_CACHE flag before calling brelse() to release the buffer.

  Rather than trying to find all the incomplete brelse() calls, it
  is simpler, though more slightly expensive, to simply clear the
  entire buffer when it is newly allocated.

  PR: 213507
  Submitted by: Damjan Jovanovic
  Reviewed by:  kib

Changes:
  head/sys/fs/msdosfs/msdosfs_fat.c
Comment 14 Kirk McKusick freebsd_committer freebsd_triage 2018-05-16 23:48:05 UTC
(In reply to Damjan Jovanovic from comment #10)
Can you verify that my patch fixes this bug for you.
Comment 15 Kirk McKusick freebsd_committer freebsd_triage 2018-05-16 23:50:25 UTC
(In reply to Vladislav Movchan from comment #6)
Can you confirm that the fix that I put in (which should be equivalent to the patch that you submitted way too long ago) fixes your problem?
Comment 16 Damjan Jovanovic 2018-05-17 02:34:20 UTC
(In reply to Kirk McKusick from comment #15)

It wasn't me that submitted that patch, it was the original reporter, Vladislav Movchan.
Comment 17 Kirk McKusick freebsd_committer freebsd_triage 2018-05-17 04:43:26 UTC
(In reply to Damjan Jovanovic from comment #16)
Well in comment #10 you said you could reproduce it on CURRENT. So I am wondering if you can still reproduce it on CURRENT after this commit (r333693).
Comment 18 Vladyslav Movchan 2018-05-17 07:42:12 UTC
(In reply to Kirk McKusick from comment #15)

I confirm I'm no longer able to reproduce this problem after base r333693.
Thanks to everyone involved into investigation and resolution of this problem.
Comment 19 Damjan Jovanovic 2018-05-17 08:29:54 UTC
I am still doing my test.

That isn't the only vfs_bio_clrbuf() in msdosfs. IMHO we really should "find all the incomplete brelse() calls".

How do "incomplete brelse() calls" work? What is the general rule? According to that UFS patch it seems to be: a buf's b_data can only be changed if you subsequently bwrite() it, or brelse() it with b_flags containing B_INVAL | B_RELBUF | B_NOCACHE and not containing B_ASYNC | B_CACHE. If bwrite() fails you must brelse() the buf with those flags instead. Is that right?
Comment 20 Kirk McKusick freebsd_committer freebsd_triage 2018-05-17 15:46:18 UTC
(In reply to Damjan Jovanovic from comment #19)
You are correct in what the code needs to do. The general areas that are likely to be problematic are when a write fails and when a file is deleted. For write failures the UFS strategy is to deallocate and invalidate the block. When a file is deleted, any of its in-memory data needs to be freed so that it does not "reappear" if the file once again is written.
Comment 21 Damjan Jovanovic 2018-05-17 17:15:43 UTC
(In reply to Kirk McKusick from comment #20)

Thank you for the explanation and your patch :).

I can no longer reproduce the directory corruption in my test.

Let me look through the code and try find the incomplete brelse().
Comment 22 Kirk McKusick freebsd_committer freebsd_triage 2018-05-17 17:20:11 UTC
This bug has been fixed. A better fix may be worked out in which case this report can be reopened for further discussion. But for now I am closing it.