Summary: | rsync to FAT32 flash drive gets "Freeing unused sector" errors | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Base System | Reporter: | wbe | ||||||
Component: | misc | Assignee: | Stefan Eßer <se> | ||||||
Status: | Closed FIXED | ||||||||
Severity: | Affects Some People | CC: | emaste, kib, se, wbe | ||||||
Priority: | --- | Keywords: | regression | ||||||
Version: | 14.0-RELEASE | ||||||||
Hardware: | amd64 | ||||||||
OS: | Any | ||||||||
Attachments: |
|
Description
wbe
2022-11-26 20:55:44 UTC
When I saw in the 13.2 Release Notes: "A workaround has been implemented for an apparent hardware page invalidation problem [...] The bug can lead to file system corruption with UFS and MSDOSFS, and probably other memory corruption." I was hopeful, even though the description wasn't a match. No such luck. This bug (268005) still exists in 13.2-RELEASE on an AMD Opteron processor. UFS flash drives still do fine. rsync to an MSDOSFS flash drive where most of the files don't need updating (and thus reads and writes are well mixed, probably mostly reads) get corrupted. (FWIW, MS Windows 10 has no trouble detecting and repairing the corruption, so, yes, it is actually corrupt, not a bug where the system mistakenly thinks it's corrupt.) -WBE The error message you report was added for 13.1, after 13.0. It was previously a (compiled-out) assertion and changed into an error, so it is possible that the same problem occurred in 13.0, but silently. https://cgit.freebsd.org/src/commit/?id=472f5760644320e59abc2b5362e84635dc650daf OK, reporting a problem is certainly better than ignoring it, and that explains why it appeared in 13.1, but that leaves the question of why it's happening at all and fixing that. As a user, what I see is that rsync to an msdosfs flash drive has appeared to work for years (FreeBSD 9 - 13.0), even repeatedly to the same drive, mixed with mounting it on a Windows machine and adding and deleting files there with no problems evident, so if there was corruption, it didn't cause any apparent problem. Yes, that's not proof. :) Now that the issue has been turned into an error, I get "remounting read-only due to corruption" when the error occurs and the rsync fails rather dramatically. If corruption really has been happening for years, maybe I've just been lucky. When this problem arose, I wrote a script that does "find -newer" and "cp -p" to the flash drive: that doesn't trigger this problem, even now. rsync to a freshly formatted drive (almost all writes) generally works, too. It's rsync when only some files out of many have changed that pretty consistently dies. I've only seen this with MSDOSFS flash drives. The same size and type of flash drive with UFS works fine. I have no data regarding non-flash MSDOSFS drives. While I've been using Win10 for repairs, I could try fsck(_msdosfs) if that would be more informative and helpful. -WBE This problem is still happening in 14.0-RELEASE-p3. Like before, when rsync is primarily writing, the error rarely happens, but when few files have changed, so rsync is mostly reading source and destination and comparing with not much writing, this error is likely. (Same ~7.2GB over ~80k inodes.) Any changes made to the msdosfs code between (GENERIC) 13.0-RELEASE-p12 and 13.1-RELEASE-p2 would be worth reviewing. I have been able to reproduce the issue with alloc/free tracing added to msdosfs: usemap_free: cluster=185185 usemap_alloc: cluster=361562 usemap_free: cluster=185186 usemap_alloc: cluster=361563 /mnt/FAT32: Freeing unused sector 185186 2 fffffff8 /dev/md0: remounting read-only due to corruption remounted /dev/md0 read-only In this test run block 185186 has been freed twice in a row. I'll ltry to identify the code path that causes usemap_free() to be called twice for a single vnode operation. The reason for the file system corruption is the way rsync copies the files to the destination directory. In order to preserve an existing file under the same name, new files are created under a temporary name, then renamed to the destination file name. Specifically, a file named "TODO" is written as ".TODO.qvleqq", then renamed to "TODO", replacing the existing file of that same name. This does not succeed on MSDOSFS - I'll try to create a simple test case to reproduce this issue without rsync. There appears to be an error in the msdosfs long file name lookup, which may lead to "A.B" sometimes matching the directory node of ".A.B". The rsync command uses temporary file names that happen to match that pattern, unless a tempdir is specified. The destination file is named like ".A.B" and when it has been written, "A.B" will be removed and ".A.B" renamed to "A.B". Due to the error, ".A.B" may be gone at that time, since it has been deleted instead if "A.B". Therefore, "rsync -t $TMPDIR" will not trigger this bug and might be a work-around until the issue has been fixed. Another bug seems to exist when a file that does not match the criteria of a 8.3 DOS file name is renamed to a 8.3 compatible name (e.g. ".TODO" to "TODO"). In that case a cluster will be freed twice, leading to the file system being remounted read-only (which was the issue that lead to this PR being created). I think you meant rsync -T dir (--temp-dir=dir). [Lower case -t is "preserve modification times".] If rsync is using invalid names for its temp files, I'm surprised FreeBSD 13.0 and earlier allowed it and it worked. What if the file doesn't get renamed later? Then again, if neither FreeBSD nor MS Windows has a problem with such file names being present, maybe it doesn't matter. (In reply to wbe from comment #8) Yes, I meant "-T dir" not "-t dir". My simplest test case so far is: $ rsync -r /usr/src/lib/libsysdecode /mnt/TESTDIR $ rsync -r /usr/src/lib/libsysdecode /mnt/TESTDIR Sometimes the failure occurs only after a 3rd rsync run: $ rsync -r /usr/src/lib/libsysdecode /mnt/TESTDIR When corruption occurs, the FAT is damaged, but apparently only in 1 of the 2 redundant instances. I did not have time to work on this issue today, but hope to have time tomorrow. You mention that it did work in FreeBSD-13.0: did the issue occur first after 2024-01-18? On that date, Konstantin Belousov (kib@FreeBSD.org) has added the error message and the remounting of the damaged file system in read-only mode in case of some "cannot happen" cases. Before that change, the problem was detected but only caused a silent abort of the operation instead of reporting it. > did the issue occur first after 2024-01-18? I first saw this (both the error message and remounting read-only) Sep. 10, 2022. Comment #2 says the error message (and (I think) the test for the error condition) was added in 13.1 and suggests the problem may have been happening before but silently ignored. Since I wasn't seeing any problems before 13.1, I never ran fsck (or equivalent) on it, so I don't know whether such errors were happening silently. I can't prove they weren't. (In reply to wbe from comment #10) The commit mentioned in comment #2 added a message, but did not remount the file system read-only. I had used "git annotate" to I cannot easily go back to 13.1 or 13.2, since my ZPOOLs have been updated with features not available back then. (I could and perhaps will install 13.2 in a bhyve jail.) My attempt to bisect the commits between 13.2 and 14-STABLE was complicated by many changes in namei.h and other common header files. I had tried to build the earlier versions of the msdosfs loadable module for the current kernel, but there have been many changes in header files like sys/namei.h, which prevented loading a module build from an old source tree into a current kernel, but for the same reason building the old module version on a current source did also fail. I have tried to patch the sources for ever older msdosfs module versions, but besides being ever mode effort for each earlier commit, I'm not testing the exact code that has been said to have worked in those earlier releases. There seem to be two ways forward: 1) testing with additional debug traces or 2) bisecting in a bhyve jail. I'll consider both approaches. My thoughts: 1) If you can still get a copy of the last 12.*, you could try that. Anything 13.0 and earlier never did the "remounting read-only" (for me). [However, the last 12.* I used was 12.2-RELEASE-p4, so it's possible the code was changed later.] 2) Since I can't prove there wasn't silent corruption with the older code, rather than trying to reconstruct the old code, if it were me, I'd first find what's causing the multiple free in the current code, since that should never happen. Created attachment 248301 [details]
Test script that creates a bogus directory entry on a FAT file system
The script contains a commented out rm command, which when enabled makes the script succeed.
(In reply to wbe from comment #12) i think that I fully understand the issue, but have not identified the exact source code line that is causing it. I can reproduce the corrupt directory entry (that later leads to a FAT block being freed twice) with a simple shell script, which creates and renames files in the same way as rsync does. If I remove the destination file before renaming, no error occurs. Therefore, I know that the cause is a missing case in the rename code, which leaves an extra directory entry behind. That directory entry is partially overwritten (the long file name fragments don't match the 8.3 DOS name), but it should just be marked as deleted. This case is only triggered if e.g. the length of the source and destination file names differ enough to require a different number of directory entries to store the long file name fragments. I'll attach my test script to this PR, a regression test could be based on it (by removing the trace and dump commands). Just an update: I have identified an issue in the denode cache, which can return a false match in specific situations. This causes an existing directory entry to be corrupted instead of a new directory entry being added. The result is a bogus directory entry with mismatched DOS 8.3 vs. Win95 long file names. I'm working on a fix ... Created attachment 248544 [details] Fix for this issue This fix has been uploaded to phabricator as review D43951. A commit in branch main references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=8b67c670a49b4efe7e1557121b5bbae682ea3bc7 commit 8b67c670a49b4efe7e1557121b5bbae682ea3bc7 Author: Stefan Eßer <se@FreeBSD.org> AuthorDate: 2024-02-17 21:04:49 +0000 Commit: Stefan Eßer <se@FreeBSD.org> CommitDate: 2024-02-17 21:04:49 +0000 msdosfs: fix directory corruption after rename operation The is a bug in MSDOSFS that can be triggered when the target of a rename operation exists. It is caused by the lack of inodes in the FAT file system, which are substituted by the location of the DOS 8.3 directory entry in the file system. This causes the "inode" of a file to change when its directory entry is moved to a different location. The rename operation wants to re-use the existing directory entry position of an existing target file name (POS1). But the code does instead locate the first position in the directory that provides sufficient free directory slots (POS2) to hold the target file name and fills it with the directory data. The rename operation continues and at the end writes directory data to the initially retrieved location (POS1) of the old target directory. This leads to 2 directory entries for the target file, but with inconsistent data in the directory and in the cached file system state. The location that should have been re-used (POS1) is marked as deleted in the directory, and new directory data has been written to a different location (POS2). But the VFS cache has the newly written data stored under the inode number that corresponds to the initially planned position (POS1). If then a new file is written, it can allocate the deleted directory entries (POS1) and when it queries the cache, it retrieves data that is valid for the target of the prior rename operation, leading to a corrupt directory entry (at POS1) being written (DOS file name of the earlier rename target combined with the Windows long file name of the newly written file). PR: 268005 Reported by: wbe@psr.com Approved by: kib, mckusick Fixes: 2c9cbc2d45b94 MFC after: 3 days Differential Revision: https://reviews.freebsd.org/D43951 sys/fs/msdosfs/msdosfs_vnops.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) A commit in branch releng/13.3 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=00c958f12c3da158fe79c6976337eee870714fa3 commit 00c958f12c3da158fe79c6976337eee870714fa3 Author: Stefan Eßer <se@FreeBSD.org> AuthorDate: 2024-02-17 21:04:49 +0000 Commit: Stefan Eßer <se@FreeBSD.org> CommitDate: 2024-02-21 13:31:26 +0000 msdosfs: fix directory corruption after rename operation PR: 268005 Reviewed by: mckusick Approved by: re (cperciva) (cherry picked from commit 8b67c670a49b4efe7e1557121b5bbae682ea3bc7) (cherry picked from commit 5f4753501e9cc59e0d15f849c9459478ea5c8dc7) sys/fs/msdosfs/msdosfs_vnops.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) Woohoo! It works (14.1)! :-) Thanks! [Just felt like saying thanks.] |