Bug 268005

Summary: rsync to FAT32 flash drive gets "Freeing unused sector" errors
Product: Base System Reporter: wbe
Component: miscAssignee: Stefan Eßer <se>
Status: Closed FIXED    
Severity: Affects Some People CC: emaste, kib, se, wbe
Priority: --- Keywords: regression
Version: 14.0-RELEASE   
Hardware: amd64   
OS: Any   
Attachments:
Description Flags
Test script that creates a bogus directory entry on a FAT file system
none
Fix for this issue none

Description wbe 2022-11-26 20:55:44 UTC
Summary:

rsync copying to a mounted FAT32 flash drive is getting
"Freeing unused sector" errors.  First seen Sept. 10.

Keywords: msdosfs, rsync, flash drive, "Freeing unused sector"

What fails:

* rsync from HDD to a mounted FAT32 flash drive.

* The error is always kernel: "Freeing unused sector".

Things worked fine under FreeBSD 13.0-RELEASE-p12 for amd64 and (I think)
rsync 3.2.5.

* The system was upgraded to FreeBSD 13.1-RELEASE-p2 and rsync 3.2.6 on 8/31.
  The problem first appeared on 9/10.  As of -p4, it is still occurring.

* It doesn't always happen (see below), but some cases are more likely to
  trigger the problem.

Related things that haven't experienced the problem:

* rsync to flash drives with a UFS file system continues to work fine.  The
  problem has only appeared with flash drives using MSDOS/FAT32.

* cp -pR of gigabytes of data from HDD to the FAT32 flash drive did fine.
  I've only seen the problem arise when using rsync.

More:

* I cp'd the flash drive's 7.1GB of files to HDD, did a complete reformat of the
  drive (not a "quick" format), and cp'd all the saved files back.  That all
  went fine.  I then ran rsync and the error arose again relatively quickly.

* After getting the unused sector error, I ran Win10's "Scan and Repair" on the
  drive.  It says "Your drive has been successfully repaired", indicating that
  the corruption, however it happened, appears to be real.

* Reading from the drive seems to work fine.

Discussion:

At first, since it was happening with an old 8GB flash drive I'd used for years,
I figured maybe it was just wearing out.  So, I got a brand new, fresh out of
its package, 16GB flash drive.  rsync ran for good while but the same error
("Freeing unused sector") happened.  Once again, Win10 "Scan and Repair" said
successfully repaired, indicating there was at least one problem.  [It says no
errors if there were no errors.]

I *don't* believe the new drive is hardware flaky.  I especially don't believe a
brand new flash drive would be the exact same kind of flaky as the old drive.
This is what leads me to think some sort of bug has crept in to either FreeBSD's
msdosfs code or rsync (or one of its libraries).

Log file details:

The first occurrence:  {mildly edited}

  [I plug in the old USB flash drive]
Sep 10 09:31:17 kernel: usb_msc_auto_quirk: UQ_MSC_NO_GETMAXLUN set for USB mass storage device UFD USB Flash Drive (0x{#}:0x{#})
Sep 10 09:31:20 kernel: usb_msc_auto_quirk: UQ_MSC_NO_PREVENT_ALLOW set for USB mass storage device UFD USB Flash Drive (0x{#}:0x{#})
Sep 10 09:31:20 kernel: ugen2.6: <UFD USB Flash Drive> at usbus2
Sep 10 09:31:20 kernel: umass0 on uhub2
Sep 10 09:31:20 kernel: umass0: <UFD USB Flash Drive, class 0/0, rev 2.00/11.00, addr 6> on usbus2
Sep 10 09:31:20 kernel: umass0:  SCSI over Bulk-Only; quirks = 0x8100
Sep 10 09:31:20 kernel: umass0:6:0: Attached to scbus6
Sep 10 09:31:23 kernel: da0 at umass-sim0 bus 0 scbus6 target 0 lun 0
Sep 10 09:31:23 kernel: da0: <UFD USB Flash Drive 1100> Removable Direct Access SCSI device
Sep 10 09:31:23 kernel: da0: Serial Number {its serial #}
Sep 10 09:31:23 kernel: da0: 40.000MB/s transfers
Sep 10 09:31:23 kernel: da0: 7648MB (15663104 512 byte sectors)
Sep 10 09:31:23 kernel: da0: quirks=0x2<NO_6_BYTE>
  [mount_msdosfs is run to mount da0s1 on my mount point.]
  [I run rsync]
Sep 10 09:50:32 kernel: {mymountpoint}: Freeing unused sector 983552 0 0
Sep 10 09:50:32 kernel: /dev/da0s1: remounting read-only due to corruption

Here are error lines from other times (same flash drive).  The sector info changes:

Sep 20 17:32:28 kernel: {mymountpoint}: Freeing unused sector 457900 12 fe000000
  and
Nov 21 00:19:09 kernel: {mymountpoint}: Freeing unused sector 8816 16 b3c07fff
  and
Nov 26 00:56:50 kernel: {mymountpoint}: Freeing unused sector 185786 26 da000000
  This last one occurred following the complete reformat, so there shouldn't
  have been any file system problem at mount time.

Today, the same problem happened with the brand new drive:
Nov 26 08:07:11 kernel: da0: 14976MB (30670848 512 byte sectors)
Nov 26 10:10:18 kernel: {mymountpoint}: Freeing unused sector 957732 4 fffffe00
  and a bit later:
Nov 26 14:18:49 kernel: {mymountpoint}: Freeing unused sector 943666 18 fffbaf7f

What seems to make the problem more likely to occur:

On the old drive, the majority of the files hadn't changed since the previous
rsync, so rsync did lots of reading and comparing, but not that much writing.
The problem occurred.

The first time I ran rsync with the new drive, none of the old files were there,
so rsync was mostly writing some 7GB.  That went fine, just like the 7GB cp to the old drive went fine.  Then, the second time, I'd changed the rsync command
to include 2 directories I'd excluded the first time, and it was while writing
those files that it died with the log entries above.

For now, it looks like rsync to a FAT32/512 flash drive is unreliable.
I hope this can get fixed.
 -WBE
Comment 1 wbe 2023-04-17 23:20:49 UTC
When I saw in the 13.2 Release Notes:

"A workaround has been implemented for an apparent hardware page invalidation problem [...] The bug can lead to file system corruption with UFS and MSDOSFS, and probably other memory corruption."

I was hopeful, even though the description wasn't a match.  No such luck.

This bug (268005) still exists in 13.2-RELEASE on an AMD Opteron processor.  UFS flash drives still do fine.  rsync to an MSDOSFS flash drive where most of the files don't need updating (and thus reads and writes are well mixed, probably mostly reads) get corrupted.  (FWIW, MS Windows 10 has no trouble detecting and repairing the corruption, so, yes, it is actually corrupt, not a bug where the system mistakenly thinks it's corrupt.)
 -WBE
Comment 2 Ed Maste freebsd_committer freebsd_triage 2023-04-18 13:32:32 UTC
The error message you report was added for 13.1, after 13.0. It was previously a (compiled-out) assertion and changed into an error, so it is possible that the same problem occurred in 13.0, but silently.
  
https://cgit.freebsd.org/src/commit/?id=472f5760644320e59abc2b5362e84635dc650daf
Comment 3 wbe 2023-04-18 14:52:13 UTC
OK, reporting a problem is certainly better than ignoring it, and that explains why it appeared in 13.1, but that leaves the question of why it's happening at all and fixing that.

As a user, what I see is that rsync to an msdosfs flash drive has appeared to work for years (FreeBSD 9 - 13.0), even repeatedly to the same drive, mixed with mounting it on a Windows machine and adding and deleting files there with no problems evident, so if there was corruption, it didn't cause any apparent problem.  Yes, that's not proof.  :)  Now that the issue has been turned into an error, I get "remounting read-only due to corruption" when the error occurs and the rsync fails rather dramatically.

If corruption really has been happening for years, maybe I've just been lucky.

When this problem arose, I wrote a script that does "find -newer" and "cp -p" to the flash drive: that doesn't trigger this problem, even now.  rsync to a freshly formatted drive (almost all writes) generally works, too.  It's rsync when only some files out of many have changed that pretty consistently dies.

I've only seen this with MSDOSFS flash drives.  The same size and type of flash drive with UFS works fine.  I have no data regarding non-flash MSDOSFS drives.

While I've been using Win10 for repairs, I could try fsck(_msdosfs) if that would be more informative and helpful.
 -WBE
Comment 4 wbe 2023-12-18 17:48:10 UTC
This problem is still happening in 14.0-RELEASE-p3.

Like before, when rsync is primarily writing, the error rarely happens, but when few files have changed, so rsync is mostly reading source and destination and comparing with not much writing, this error is likely.  (Same ~7.2GB over ~80k inodes.)

Any changes made to the msdosfs code between (GENERIC) 13.0-RELEASE-p12 and 13.1-RELEASE-p2 would be worth reviewing.
Comment 5 Stefan Eßer freebsd_committer freebsd_triage 2024-02-05 20:49:37 UTC
I have been able to reproduce the issue with alloc/free tracing added to msdosfs:

usemap_free: cluster=185185
usemap_alloc: cluster=361562
usemap_free: cluster=185186
usemap_alloc: cluster=361563
/mnt/FAT32: Freeing unused sector 185186 2 fffffff8
/dev/md0: remounting read-only due to corruption
remounted /dev/md0 read-only

In this test run block 185186 has been freed twice in a row.

I'll ltry to identify the code path that causes usemap_free() to be called twice for a single vnode operation.
Comment 6 Stefan Eßer freebsd_committer freebsd_triage 2024-02-05 21:27:43 UTC
The reason for the file system corruption is the way rsync copies the files to the destination directory. In order to preserve an existing file under the same name, new files are created under a temporary name, then renamed to the destination file name.

Specifically, a file named "TODO" is written as ".TODO.qvleqq", then renamed to "TODO", replacing the existing file of that same name. This does not succeed on MSDOSFS - I'll try to create a simple test case to reproduce this issue without rsync.
Comment 7 Stefan Eßer freebsd_committer freebsd_triage 2024-02-06 12:23:06 UTC
There appears to be an error in the msdosfs long file name lookup, which may lead to "A.B" sometimes matching the directory node of ".A.B".

The rsync command uses temporary file names that happen to match that pattern, unless a tempdir is specified. The destination file is named like ".A.B" and when it has been written, "A.B" will be removed and ".A.B" renamed to "A.B". Due to the error, ".A.B" may be gone at that time, since it has been deleted instead if "A.B".

Therefore, "rsync -t $TMPDIR" will not trigger this bug and might be a work-around until the issue has been fixed.

Another bug seems to exist when a file that does not match the criteria of a 8.3 DOS file name is renamed to a 8.3 compatible name (e.g. ".TODO" to "TODO"). In that case a cluster will be freed twice, leading to the file system being remounted read-only (which was the issue that lead to this PR being created).
Comment 8 wbe 2024-02-06 13:21:53 UTC
I think you meant rsync -T dir (--temp-dir=dir).
[Lower case -t is "preserve modification times".]

If rsync is using invalid names for its temp files, I'm surprised FreeBSD 13.0 and earlier allowed it and it worked.  What if the file doesn't get renamed later?  Then again, if neither FreeBSD nor MS Windows has a problem with such file names being present, maybe it doesn't matter.
Comment 9 Stefan Eßer freebsd_committer freebsd_triage 2024-02-06 20:52:27 UTC
(In reply to wbe from comment #8)
Yes, I meant "-T dir" not "-t dir".

My simplest test case so far is:

$ rsync -r /usr/src/lib/libsysdecode /mnt/TESTDIR
$ rsync -r /usr/src/lib/libsysdecode /mnt/TESTDIR

Sometimes the failure occurs only after a 3rd rsync run:

$ rsync -r /usr/src/lib/libsysdecode /mnt/TESTDIR

When corruption occurs, the FAT is damaged, but apparently only in 1 of the 2 redundant instances.

I did not have time to work on this issue today, but hope to have time tomorrow.

You mention that it did work in FreeBSD-13.0: did the issue occur first after 2024-01-18?

On that date, Konstantin Belousov (kib@FreeBSD.org) has added the error message and the remounting of the damaged file system in read-only mode in case of some "cannot happen" cases. Before that change, the problem was detected but only caused a silent abort of the operation instead of reporting it.
Comment 10 wbe 2024-02-06 21:43:26 UTC
> did the issue occur first after 2024-01-18?

I first saw this (both the error message and remounting read-only) Sep. 10, 2022.  Comment #2 says the error message (and (I think) the test for the error condition) was added in 13.1 and suggests the problem may have been happening before but silently ignored.  Since I wasn't seeing any problems before 13.1, I never ran fsck (or equivalent) on it, so I don't know whether such errors were happening silently.  I can't prove they weren't.
Comment 11 Stefan Eßer freebsd_committer freebsd_triage 2024-02-07 19:11:26 UTC
(In reply to wbe from comment #10)

The commit mentioned in comment #2 added a message, but did not remount the file system read-only. I had used "git annotate" to 

I cannot easily go back to 13.1 or 13.2, since my ZPOOLs have been updated with features not available back then. (I could and perhaps will install 13.2 in a bhyve jail.)

My attempt to bisect the commits between 13.2 and 14-STABLE was complicated by many changes in namei.h and other common header files. I had tried to build the earlier versions of the msdosfs loadable module for the current kernel, but there have been many changes in header files like sys/namei.h, which prevented loading a module build from an old source tree into a current kernel, but for the same reason building the old module version on a current source did also fail. I have tried to patch the sources for ever older msdosfs module versions, but besides being ever mode effort for each earlier commit, I'm not testing the exact code that has been said to have worked in those earlier releases.

There seem to be two ways forward: 1) testing with additional debug traces or 2) bisecting in a bhyve jail. I'll consider both approaches.
Comment 12 wbe 2024-02-07 20:24:30 UTC
My thoughts:

1) If you can still get a copy of the last 12.*, you could try that.  Anything 13.0 and earlier never did the "remounting read-only" (for me).  [However, the last 12.* I used was 12.2-RELEASE-p4, so it's possible the code was changed later.]

2) Since I can't prove there wasn't silent corruption with the older code, rather than trying to reconstruct the old code, if it were me, I'd first find what's causing the multiple free in the current code, since that should never happen.
Comment 13 Stefan Eßer freebsd_committer freebsd_triage 2024-02-10 09:33:43 UTC
Created attachment 248301 [details]
Test script that creates a bogus directory entry on a FAT file system

The script contains a commented out rm command, which when enabled makes the script succeed.
Comment 14 Stefan Eßer freebsd_committer freebsd_triage 2024-02-10 09:34:01 UTC
(In reply to wbe from comment #12)
i think that I fully understand the issue, but have not identified the exact source code line that is causing it.

I can reproduce the corrupt directory entry (that later leads to a FAT block being freed twice) with a simple shell script, which creates and renames files in the same way as rsync does.

If I remove the destination file before renaming, no error occurs. Therefore, I know that the cause is a missing case in the rename code, which leaves an extra directory entry behind. That directory entry is partially overwritten (the long file name fragments don't match the 8.3 DOS name), but it should just be marked as deleted.

This case is only triggered if e.g. the length of the source and destination file names differ enough to require a different number of directory entries to store the long file name fragments.

I'll attach my test script to this PR, a regression test could be based on it (by removing the trace and dump commands).
Comment 15 Stefan Eßer freebsd_committer freebsd_triage 2024-02-12 20:25:55 UTC
Just an update:

I have identified an issue in the denode cache, which can return a false match in specific situations. This causes an existing directory entry to be corrupted instead of a new directory entry being added. The result is a bogus directory entry with mismatched DOS 8.3 vs. Win95 long file names.

I'm working on a fix ...
Comment 16 Stefan Eßer freebsd_committer freebsd_triage 2024-02-17 18:59:48 UTC
Created attachment 248544 [details]
Fix for this issue

This fix has been uploaded to phabricator as review D43951.
Comment 17 commit-hook freebsd_committer freebsd_triage 2024-02-17 21:11:46 UTC
A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=8b67c670a49b4efe7e1557121b5bbae682ea3bc7

commit 8b67c670a49b4efe7e1557121b5bbae682ea3bc7
Author:     Stefan Eßer <se@FreeBSD.org>
AuthorDate: 2024-02-17 21:04:49 +0000
Commit:     Stefan Eßer <se@FreeBSD.org>
CommitDate: 2024-02-17 21:04:49 +0000

    msdosfs: fix directory corruption after rename operation

    The is a bug in MSDOSFS that can be triggered when the target of a
    rename operation exists. It is caused by the lack of inodes in the
    FAT file system, which are substituted by the location of the DOS 8.3
    directory entry in the file system. This causes the "inode" of a file
    to change when its directory entry is moved to a different location.

    The rename operation wants to re-use the existing directory entry
    position of an existing target file name (POS1). But the code does
    instead locate the first position in the directory that provides
    sufficient free directory slots (POS2) to hold the target file name
    and fills it with the directory data.

    The rename operation continues and at the end writes directory data to
    the initially retrieved location (POS1) of the old target directory.
    This leads to 2 directory entries for the target file, but with
    inconsistent data in the directory and in the cached file system
    state.

    The location that should have been re-used (POS1) is marked as deleted
    in the directory, and new directory data has been written to a
    different location (POS2). But the VFS cache has the newly written
    data stored under the inode number that corresponds to the initially
    planned position (POS1).

    If then a new file is written, it can allocate the deleted directory
    entries (POS1) and when it queries the cache, it retrieves data that
    is valid for the target of the prior rename operation, leading to a
    corrupt directory entry (at POS1) being written (DOS file name of the
    earlier rename target combined with the Windows long file name of the
    newly written file).

    PR:             268005
    Reported by:    wbe@psr.com
    Approved by:    kib, mckusick
    Fixes:          2c9cbc2d45b94
    MFC after:      3 days
    Differential Revision:  https://reviews.freebsd.org/D43951

 sys/fs/msdosfs/msdosfs_vnops.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)
Comment 18 commit-hook freebsd_committer freebsd_triage 2024-02-21 13:32:13 UTC
A commit in branch releng/13.3 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=00c958f12c3da158fe79c6976337eee870714fa3

commit 00c958f12c3da158fe79c6976337eee870714fa3
Author:     Stefan Eßer <se@FreeBSD.org>
AuthorDate: 2024-02-17 21:04:49 +0000
Commit:     Stefan Eßer <se@FreeBSD.org>
CommitDate: 2024-02-21 13:31:26 +0000

    msdosfs: fix directory corruption after rename operation

    PR:             268005
    Reviewed by:    mckusick
    Approved by:    re (cperciva)

    (cherry picked from commit 8b67c670a49b4efe7e1557121b5bbae682ea3bc7)
    (cherry picked from commit 5f4753501e9cc59e0d15f849c9459478ea5c8dc7)

 sys/fs/msdosfs/msdosfs_vnops.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)
Comment 19 wbe 2024-07-01 13:47:04 UTC
Woohoo! It works (14.1)!  :-)  Thanks!
[Just felt like saying thanks.]