Bug 84589 - [2TB] 5.4-STABLE unresponsive during background fsck 2TB partition
Summary: [2TB] 5.4-STABLE unresponsive during background fsck 2TB partition
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 5.4-STABLE
Hardware: Any Any
: Normal Affects Only Me
Assignee: Jaakko Heinonen
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2005-08-05 20:20 UTC by David Kirchner
Modified: 2010-12-12 09:24 UTC (History)
0 users

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description David Kirchner 2005-08-05 20:20:11 UTC
After a large filesystem is marked dirty (due to a panic or a ^C'd fsck), and then a reboot, the background fsck starts. Approximately 1-2 minutes later the server slows down. Eventually, within about 5-10 minutes, all disk access attempts cease to function, and the server becomes unresponsive to even hitting return in bash.

You can still ping the server, and if you connect to SSH it will still go through all the motions, right up until it is about to spawn login. This, even though the partition being fsck'd is not in use.

As far as I can tell it will never recover. I've given it over 12 hours. It doesn't panic, unfortunately, or give any indication on the console why it is having trouble.

fsck works fine when you run it from the command line, in the foreground.

Fix: 

Disable background fsck in /etc/rc.conf:

background_fsck="NO"

It may be that using UFS2 also fixes the problem (but we've had other issues with that, I'll open another PR when I can reproduce that).
How-To-Repeat: Install 5.4-STABLE on a multi-TB server, creating a 36GB / partition, and 1 or more 2TB partitions (you will need to use auto-carving). Use softupdates to format the large partitions. Use UFS1.
Leave the large target partition completely empty.
Unmount the target partition.
Start "fsck /dev/whatever", and hit ^C part way through. Verify it says "FILE SYSTEM MARKED DIRTY".
Reboot.
Log in again to monitor the server.
It will eventually stop responding to your commands.
Comment 1 David Kirchner 2005-12-30 18:07:28 UTC
This bug has been reproduced on a different server (similar hardware)
running 6.0-RELEASE and UFS2. I accidentally forgot to disable
background fscks on the server (big d'oh!) and about 12 hours after
the server rebooted access to the disk started slowing down,
eventually becoming completely unresponsive, forcing a reboot. The
reboot took about 2 minutes to take effect, probably because the
server was "busy" with the fsck.

I was able to log in to it before it locked up, and tried ktrace'ing
the fsck_ffs process. It had no activity. I suspect it deadlocked
against something else.

Unfortunately the server was a NFS server, so the NFS client also had
to be rebooted due to a separate NFS client deadlock bug.

The how-to-repeat is the same: That ^C fsck step is just to trigger a
dirty filesystem. Really, really easy to duplicate.

The workaround is the same: Disable background_fsck for all 5.4 or 6.0
servers (or for any servers capable of performing a background fsck).

FWIW: The foreground fsck takes far less than 12 hours to complete.
Comment 2 Mark Linimon freebsd_committer freebsd_triage 2009-05-18 05:33:51 UTC
Responsible Changed
From-To: freebsd-bugs->freebsd-fs

Over to maintainer(s).
Comment 3 Jaakko Heinonen freebsd_committer freebsd_triage 2010-10-29 09:15:25 UTC
State Changed
From-To: open->feedback

Is this still a problem for you? r184934 might have improved the snapshot 
creation on large file systems. 


Comment 4 Jaakko Heinonen freebsd_committer freebsd_triage 2010-10-29 09:15:25 UTC
Responsible Changed
From-To: freebsd-fs->jh

Track.
Comment 5 Jaakko Heinonen freebsd_committer freebsd_triage 2010-12-12 09:24:55 UTC
State Changed
From-To: feedback->closed

Feedback timeout.