Bug 106107 - UFS: fsck_ffs: left-over fsck_snapshot after unfinished background fsck if filesystem clean
Summary: UFS: fsck_ffs: left-over fsck_snapshot after unfinished background fsck if fi...
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 6.1-RELEASE
Hardware: Any Any
: --- Affects Only Me
Assignee: freebsd-fs (Nobody)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2006-12-01 03:20 UTC by Andrej Binder
Modified: 2023-11-12 06:51 UTC (History)
4 users (show)

See Also:
grahamperrin: maintainer-feedback? (rew)


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Andrej Binder 2006-12-01 03:20:10 UTC
Whenever fsck is run in background mode, a file called fsck_snapshot gets created in the .snap directory on the checked volume. Fsck then runs its check on this file instead of the live filesystem. Filesystem snapshots (which fsck_snapshot essentially is) are designed to persist over mounts and reboots thus if fsck does not terminate properly for some reason (hard reboot etc), the file gets left over.  This is partially solved on the next background fsck run (commonly just after the system reboots if the fs is marked dirty) since fsck overwrites the left over fsck_snapshot whit a new one and removes it when its done.

The prblem occours when you mark the filesystem clean before the next fsck background run (for example through fsck in singleuser mode). This way the fsck_snapshot file persists and possibly consumes most of the filesystem (depending on the state of the filesystem when the snapshot was made).

Fix: 

Implement a code (maybe into loader after the the fs is mounted) to check for left over fsck snapshots and remove them if appropriate.
How-To-Repeat: 1) run fsck in background mode
2) halt -qn before fsck finishes (or otherwise terminate it unproperly ... sigkill does not seem to work since fsck is in biord state)
3) boot into singleuser mode
4) fsck to mark the filesystem clean
5) reboot into normal mode and watch the file grow with every change on the live filesystem
Comment 1 Bruce Cran freebsd_committer freebsd_triage 2010-04-03 11:55:14 UTC
Responsible Changed
From-To: freebsd-bugs->freebsd-fs

Over to maintainer(s).
Comment 2 Eitan Adler freebsd_committer freebsd_triage 2017-12-31 07:58:39 UTC
For bugs matching the following criteria:

Status: In Progress Changed: (is less than) 2014-06-01

Reset to default assignee and clear in-progress tags.

Mail being skipped
Comment 3 Graham Perrin freebsd_committer freebsd_triage 2021-07-21 17:17:31 UTC
Please, does this bug explain the clean-then-dirty behaviour that's observed in the following transcript? 

<https://lists.freebsd.org/archives/freebsd-current/att-0339/2021-07-16_00.53_typescript.txt>

First observed (and reproducible) whilst working with faulty hardware. 

Reproducible today with a new SSD.
Comment 4 Kirk McKusick freebsd_committer freebsd_triage 2021-07-27 05:31:41 UTC
(In reply to Graham Perrin from comment #3)
Your transcript does not use snapshots, so this bug which is about snapshots does not apply to your transcript.

The update to the block counts should not have affected the file type, so it would appear that when the inode block with the updated count and size fields was written to disk, other parts of it were scrambled. This implies some kind of error in writing the inode block to the disk.
Comment 5 Graham Perrin freebsd_committer freebsd_triage 2021-07-27 19:26:01 UTC
(In reply to Kirk McKusick from comment #4)

Thank you. (I wondered whether there might be a shared underlying cause.)

I'll raise a separate bug report.
Comment 6 Graham Perrin freebsd_committer freebsd_triage 2022-12-29 13:29:03 UTC
Anyone, please: are the symptoms in opening comment #0 (2006) likely to be reproducible with any current branch of FreeBSD? 

<https://www.freebsd.org/where/>

I recall relatively recent attention to background fsck in <https://cgit.freebsd.org/src/commit/?h=releng/13.1&id=fb2feceac34cc9c3fb47ba4a7b0ca31637f8fdf0> (a cherry-pick to releng/13.1) …
Comment 7 Kirk McKusick freebsd_committer freebsd_triage 2022-12-29 21:26:47 UTC
This problem was fixed not long after this bug report was filed (though I was unaware of the bug report until now so did not close it). The fix is to open the snapshot and then unlink it immediately after it is created. Since the snapshot will have no references it will be removed when fsck closes the file and/or exits. If the system crashes prior to fsck exiting then the next run of fsck will find the unreferenced snapshot and remove it.
Comment 8 Mark Millard 2023-02-27 22:11:21 UTC
(In reply to Kirk McKusick from comment #7)

Mostly an FYI that there might still be an oddity possible.

Context:
# uname -apKU
FreeBSD amd64_UFS 14.0-CURRENT FreeBSD 14.0-CURRENT #61 main-n261026-d04c86717c8c-dirty: Sun Feb 19 15:03:52 PST 2023     root@amd64_ZFS:/usr/obj/BUILDs/main-amd64-nodbg-clang/usr/main-src/amd64.amd64/sys/GENERIC-NODBG amd64 amd64 1400081 1400081

I booted this system and the "df -m" listed 57% Capacity
but my du -xsm /* listed more like a 3rd of the capacity
(totals of the lines). The du result was more like
expected. (Note: This system is normally booted and uses
a distinct ZFS media, so today's activity was the first
UFS media use in some days.)

Turned out there was a /.snap/fsck_snapshot roughly matching
the du -xsm /* total. "ls -Tld" indicated today's date on
the /.snap/fsck_snapshot .

I removed the file after a while (no evidence of it being in
use noticed) and rebooted and it did not return. The removal
resulted in "df -m" reporting 29% Capacity, both before and
after the reboot, more like expected.
Comment 9 Kirk McKusick freebsd_committer freebsd_triage 2023-02-27 22:45:24 UTC
(In reply to Mark Millard from comment #8)
Is this filesystem mounted and having background fsck run on it? Background fsck is the only application that would create .snap/fsck_snapshot.
Comment 10 Mark Millard 2023-02-27 23:42:38 UTC
(In reply to Kirk McKusick from comment #9)

I expect that the background fsck ran on the system. But
by the time I noticed the high % Capacity and then finally
noticed the /.snap/fsck_snapshot , I did not find any
evidence of a background fsck still being active.

This means that I did not see the background fsck run,
unfortunately. But the date/time on /.snap/fsck_snapshot
made reasonable sense for time frame.

My seeing the file indicates that the unlink did not
happen by the time I noticed the file. That, of itself,
may be odd. (And is what prompted me to submit the note.)
Comment 11 Kirk McKusick freebsd_committer freebsd_triage 2023-02-28 00:27:21 UTC
(In reply to Mark Millard from comment #10)
There is a brief window when the file exists. The snapshot request is made. The resulting file is opened by fsck. The file is then unlinked. If fsck exits or the kernel dies after the snapshot is created and before the unlink is done then the file will remain. It is possible that you somehow hit that window.
Comment 12 Mark Millard 2023-02-28 01:17:18 UTC
(In reply to Kirk McKusick from comment #11)

FYI: There were no kernel crashes before, during, or after.
Other than % Capacity and /.snap/fsck_snapshot existence
issue, things seemed normal.

So it would seem that the fsck probably exited before the
unlink was done, leaving the file in place. Sounds like
such is an expected possibility.
Comment 13 Kirk McKusick freebsd_committer freebsd_triage 2023-02-28 20:44:14 UTC
(In reply to Mark Millard from comment #12)
The open of the snapshot also includes the reading of the superblock on the snapshot. Until recently rather few checks were done, so a bad field in the superblock could create a wild pointer that would cause fsck to segment fault. So that would be my best guess of what caused the premature exit with the snapshot still in place. I should reorganize the code to do the unlink immediately after the open to close that window.
Comment 14 Mark Millard 2023-02-28 21:07:53 UTC
(In reply to Kirk McKusick from comment #13)

FYI:

I did not find a .core file for fsck_ffs (or for any other
program) when I did a 'find / -name "*.core" -print' .

Similarly, looking in /var/log/messages did not show any
examples of the likes of messages of the form:

pid ???? (????), jid ????, uid ????: exited on signal ????

for that day.


Separately:

It does sound like moving the unlink to just after the
open would be appropriate.
Comment 15 commit-hook freebsd_committer freebsd_triage 2023-10-25 22:39:04 UTC
A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=d3a36e4b7459b2d62c4cd50de7a8e3195d7241c7

commit d3a36e4b7459b2d62c4cd50de7a8e3195d7241c7
Author:     Kirk McKusick <mckusick@FreeBSD.org>
AuthorDate: 2023-10-25 22:36:45 +0000
Commit:     Kirk McKusick <mckusick@FreeBSD.org>
CommitDate: 2023-10-25 22:38:11 +0000

    Delete snapshot after opening it when running fsck_ffs(9) in background.

    When fsck_ffs(8) runs in background, it creates a snapshot named
    fsck_snapshot in the filesystem's .snap directory. The fsck_snapshot
    file was removed when the background fsck finished. If the system
    crashed or the fsck exited unexpectedly, the fsck_snapshot file
    would remain. The snapshot would consume ever more space as the
    filesystem changed over time until it was removed by a system
    administrator or a future run of background fsck removed it to
    create a new snapshot file.

    This commit unlinks the .snap/fsck_snapshot file immediately after
    opening it so that it will be reclaimed when fsck closes it at the
    conclusion of its run. After a system crash, it will be removed as
    part of the filesystem cleanup because of its zero reference count.
    As only a few milliseconds pass between its creation and unlinking,
    there is far less opportunity for it to be accidentally left behind.

    PR:           106107
    MFC-after:    1 week

 sbin/fsck_ffs/fsck.h   | 1 -
 sbin/fsck_ffs/fsutil.c | 1 -
 sbin/fsck_ffs/globs.c  | 2 --
 sbin/fsck_ffs/main.c   | 8 +++++---
 sbin/fsck_ffs/setup.c  | 8 ++++----
 5 files changed, 9 insertions(+), 11 deletions(-)
Comment 16 commit-hook freebsd_committer freebsd_triage 2023-11-12 06:48:54 UTC
A commit in branch stable/14 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=27133e6e86c16f86642a16d15ea2c910e9642616

commit 27133e6e86c16f86642a16d15ea2c910e9642616
Author:     Kirk McKusick <mckusick@FreeBSD.org>
AuthorDate: 2023-10-25 22:36:45 +0000
Commit:     Kirk McKusick <mckusick@FreeBSD.org>
CommitDate: 2023-11-12 06:48:25 +0000

    Delete snapshot after opening it when running fsck_ffs(9) in background.

    PR:           106107

    (cherry picked from commit d3a36e4b7459b2d62c4cd50de7a8e3195d7241c7)

 sbin/fsck_ffs/fsck.h   | 1 -
 sbin/fsck_ffs/fsutil.c | 1 -
 sbin/fsck_ffs/globs.c  | 2 --
 sbin/fsck_ffs/main.c   | 8 +++++---
 sbin/fsck_ffs/setup.c  | 8 ++++----
 5 files changed, 9 insertions(+), 11 deletions(-)
Comment 17 commit-hook freebsd_committer freebsd_triage 2023-11-12 06:51:56 UTC
A commit in branch stable/13 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=d3d779f6475474bd7fc80d1f9cce91c7b42fc958

commit d3d779f6475474bd7fc80d1f9cce91c7b42fc958
Author:     Kirk McKusick <mckusick@FreeBSD.org>
AuthorDate: 2023-10-25 22:36:45 +0000
Commit:     Kirk McKusick <mckusick@FreeBSD.org>
CommitDate: 2023-11-12 06:51:14 +0000

    Delete snapshot after opening it when running fsck_ffs(9) in background.

    PR:           106107

    (cherry picked from commit d3a36e4b7459b2d62c4cd50de7a8e3195d7241c7)

 sbin/fsck_ffs/fsck.h   | 1 -
 sbin/fsck_ffs/fsutil.c | 1 -
 sbin/fsck_ffs/globs.c  | 2 --
 sbin/fsck_ffs/main.c   | 8 +++++---
 sbin/fsck_ffs/setup.c  | 8 ++++----
 5 files changed, 9 insertions(+), 11 deletions(-)