Created attachment 175786 [details] This patch fixes the problem for me When amount of directory entries grows directory could reach size when it require one more cluster. This condition checked at the beginning of createde() inside of msdosfs_lookup.c Actual directory extension occurs by extendfile() inside of msdosfs_fat.c I found that after completion of extendfile() newly allocated cluster consists of garbage-like data which after completion of "grow to one more cluster" might be interpreted as phantom/random directory entries. This bug is sometimes hard to reproduce intentionally as data in newly allocated cluster often looks the same no matter how many times you retry. And if this data consists of mostly zeros it won't produce effect noticeable on the filesystem. Here is an output when I got lucky to trigger this problem (shell is bash): # newfs_msdos -F 32 /dev/gpt/sg1-disk-f /dev/gpt/sg1-disk-f: 315571904 sectors in 4930811 FAT32 clusters (32768 bytes/cluster) BytesPerSec=512 SecPerClust=64 ResSectors=32 FATs=2 Media=0xf0 SecPerTrack=63 Heads=16 HiddenSecs=0 HugeSectors=315649016 FATsecs=38522 RootCluster=2 FSInfo=1 Backup=2 # mount_msdosfs -o large /dev/gpt/sg1-disk-f /mnt/f # mkdir /mnt/f/test # cd /mnt/f/test # for i in `seq 1 150`; do; head -c $((20000*1000+1)) /dev/random > `echo {0..9}{a..k} | tr -d ' '`$i; done # ls -laR /mnt/f/test > /dev/null ls: .гJ(: No such file or directory ls: .гM(: No such file or directory ls: .гP(: No such file or directory ls: .гS(: No such file or directory ls: .гV(: No such file or directory ls: .гY(: No such file or directory ls: .А?.: Invalid argument ls: .АB(: No such file or directory ls: 0?.?????: Invalid argument ls: 0t,?????.ПМ+: Invalid argument ls: 0w,?????.?t,: Invalid argument ls: 0|#?????.0|#: Invalid argument ls: 0|#?????.0|#: Invalid argument ls: 0|#?????.0|#: Invalid argument ... In most cases there is no other way than newfs_msdos to get rid of such "random" directory entries. So don't run this on filesystem you care about. After I tried to intentionally bzero() newly allocated cluster data (only for directory case) I'm no longer able to reproduce this problem. Patch attached. System details: $ uname -a FreeBSD morgenstern 12.0-CURRENT FreeBSD 12.0-CURRENT #0 r306964: Mon Oct 10 18:17:21 EEST 2016 root@morgenstern:/usr/obj/usr/src/sys/Mephistopheles amd64 $ diskinfo -v /dev/gpt/sg1-disk-f /dev/gpt/sg1-disk-f 512 # sectorsize 161612296192 # mediasize in bytes (151G) 315649016 # mediasize in sectors 0 # stripesize 539533312 # stripeoffset 313143 # Cylinders according to firmware. 16 # Heads according to firmware. 63 # Sectors according to firmware. 9WM0NGVRs0 # Disk ident. $ gpart show -l /dev/ada3 => 40 3907026976 ada3 GPT (1.8T) 40 23 - free - (12K) 63 105910497 1 sg1-disk-e (51G) 105910560 14680080 - free - (7.0G) 120590640 1572864000 2 sg1-tank-disk3 (750G) 1693454640 39 - free - (20K) 1693454679 1897923321 4 sg1-disk-i (905G) 3591378000 315649016 5 sg1-disk-f (151G)
Right after the both branches of the if() statement, one of which you patched, there is vfs_bio_clrbuf() call to clear the buffer. Due to this, I highly doubt that the patch changes anything, i.e. the bug must be elsewere.
(In reply to Konstantin Belousov from comment #1) While debugging I checked content of newly allocated cluster right after vfs_bio_clrbuf() call. So after execution of vfs_bio_clrbuf() it is possible to see junk/random data in bp->b_data. I just rechecked it once again to make sure - sometimes newly allocated cluster contains just zeroes, but sometimes there are random data.
(In reply to Vladislav Movchan from comment #2) vfs_bio_clrbuf() zeroes invalid chunks of the buffer pages, leaving the valid chunks alone, for VMIO buffers. If there are (partially) valid pages in the newly extended denode, this means that, most likely, some other code left the garbage on the vnode vm_object' page queue.
(In reply to Konstantin Belousov from comment #3) Thanks for the explanation. I'll get back if I manage to find true reason what causes this.
(In reply to Vladislav Movchan from comment #4) You could try to check my hypothesis by catching the garbage buffer _before_ the vfs_bio_clrbuf() call and see if any page forming the buffer has valid bits set. Look at the vfs_bio_clrbuf() code to see what I mean.
Created attachment 175818 [details] Updated patch
(In reply to Konstantin Belousov from comment #5) I confirmed your hypothesis - sometimes buffer pages have valid bits set and sometimes they don't. Original problem occurs only in cases when valid bits are set, so vfs_bio_clrbuf() don't zero random data in buffer. Also I went through history and found that originally buffer data was always zeroed by call to clrbuf() macro and in base r171522 it was changed to vfs_bio_clrbuf() as an optimization. I suppose what I see is just a regression of this optimization and it was not noticed since then as "directory extension to another cluster" is a rare operation. I attached another patch which restores original behaviour just for the "directory grows" case. With this patch I'm no longer able to reproduce this problem.
(In reply to Vladislav Movchan from comment #7) I probably was not clear enough. The problem is that a previous operation left the pages on the vnode page queue behind the directory denode end. The damage is not limited to the situation of extending the directory, it might have other consequences with varying severity. E.g., if the issue also happens for regular files, then extending file might result in the garbage or some previously removed unrelated disk file content to appear in the extended portion, readable by user. That said, your patch seems to only paper over the real problem. Right thing to do is to track the location where the damage happens. E.g., UFS also had similar problems, where failed file extension (e.g., for lack of free space) was followed by an incomplete cleanup and dandling buffers and pages.
Removing [patch] as it seems like the root cause still needs identification.
This can still be reproduced on CURRENT as follows: dd if=/dev/zero of=64G bs=1G count=64 newfs_msdos ./64G mdconfig -a -f ./64G mount -t msdosfs /dev/md0 /mnt/test mkdir /mnt/test/ports cp -r /usr/ports/* /mnt/test/ports At least the following end up with corrupted contents: ports/devel ports/games ports/graphics ports/security ports/sysutils ports/textproc ports/www
Konstantin, please can you explain how the "previous operation left the pages on the vnode page queue behind the directory denode end"? What must code do, between for example bread() and brelse(), to clean up the "buf" properly?
(In reply to Damjan Jovanovic from comment #11) r304232 was a fix for very similar issue on UFS.
A commit references this bug: Author: mckusick Date: Wed May 16 23:30:04 UTC 2018 New revision: 333693 URL: https://svnweb.freebsd.org/changeset/base/333693 Log: Revert change made in base r171522 (https://svnweb.freebsd.org/base?view=revision&revision=304232) converting clrbuf() (which clears the entire buffer) to vfs_bio_clrbuf() (which clears only the new pages that have been added to the buffer). Failure to properly remove pages from the buffer cache can make pages that appear not to need clearing to actually have bad random data in them. See for example base r304232 (https://svnweb.freebsd.org/base?view=revision&revision=304232) which noted the need to set B_INVAL and B_NOCACHE as well as clear the B_CACHE flag before calling brelse() to release the buffer. Rather than trying to find all the incomplete brelse() calls, it is simpler, though more slightly expensive, to simply clear the entire buffer when it is newly allocated. PR: 213507 Submitted by: Damjan Jovanovic Reviewed by: kib Changes: head/sys/fs/msdosfs/msdosfs_fat.c
(In reply to Damjan Jovanovic from comment #10) Can you verify that my patch fixes this bug for you.
(In reply to Vladislav Movchan from comment #6) Can you confirm that the fix that I put in (which should be equivalent to the patch that you submitted way too long ago) fixes your problem?
(In reply to Kirk McKusick from comment #15) It wasn't me that submitted that patch, it was the original reporter, Vladislav Movchan.
(In reply to Damjan Jovanovic from comment #16) Well in comment #10 you said you could reproduce it on CURRENT. So I am wondering if you can still reproduce it on CURRENT after this commit (r333693).
(In reply to Kirk McKusick from comment #15) I confirm I'm no longer able to reproduce this problem after base r333693. Thanks to everyone involved into investigation and resolution of this problem.
I am still doing my test. That isn't the only vfs_bio_clrbuf() in msdosfs. IMHO we really should "find all the incomplete brelse() calls". How do "incomplete brelse() calls" work? What is the general rule? According to that UFS patch it seems to be: a buf's b_data can only be changed if you subsequently bwrite() it, or brelse() it with b_flags containing B_INVAL | B_RELBUF | B_NOCACHE and not containing B_ASYNC | B_CACHE. If bwrite() fails you must brelse() the buf with those flags instead. Is that right?
(In reply to Damjan Jovanovic from comment #19) You are correct in what the code needs to do. The general areas that are likely to be problematic are when a write fails and when a file is deleted. For write failures the UFS strategy is to deallocate and invalidate the block. When a file is deleted, any of its in-memory data needs to be freed so that it does not "reappear" if the file once again is written.
(In reply to Kirk McKusick from comment #20) Thank you for the explanation and your patch :). I can no longer reproduce the directory corruption in my test. Let me look through the code and try find the incomplete brelse().
This bug has been fixed. A better fix may be worked out in which case this report can be reopened for further discussion. But for now I am closing it.