Bug 181226 - [ufs] Writes to almost full FS eat 100% CPU and speed drops below 1MB/sec [regression]
Summary: [ufs] Writes to almost full FS eat 100% CPU and speed drops below 1MB/sec [re...
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 9.2-BETA2
Hardware: Any Any
: Normal Affects Only Me
Assignee: Kirk McKusick
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2013-08-11 13:30 UTC by Dmitry Sivachenko
Modified: 2013-09-12 20:42 UTC (History)
0 users

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Dmitry Sivachenko 2013-08-11 13:30:01 UTC
I have 25TB Dell PERC 6 RAID5 array.  When it becomes almost full
(10-20GB free), processes which write data to it start eating 100% CPU and 
write speed drops below 1MB/sec (normally to gives 400MB/sec).

 1889 mitya           1 100    0  2058M  1027M CPU12  12   0:47 92.77% dd

systat -vm shows disk array is not busy:

Disks mfid0
KB/t  63.71
tps      65
MB/s   0.77
%busy     3

If I delete some files to free space during that slow write, the same process
starts writing with normal speed.

I was running that machine with ~1 year old 9-STABLE without any problems.
That array often overflows, and I always got "filesystem is full" error
without write speed reduction.
The problem appeared after I upgraded to 9.2-BETA2 few days ago.

tunefs: POSIX.1e ACLs: (-a)                                disabled
tunefs: NFSv4 ACLs: (-N)                                   disabled
tunefs: MAC multilabel: (-l)                               disabled
tunefs: soft updates: (-n)                                 enabled
tunefs: soft update journaling: (-j)                       disabled
tunefs: gjournal: (-J)                                     disabled
tunefs: trim: (-t)                                         disabled
tunefs: maximum blocks per file in a cylinder group: (-e)  4096
tunefs: average file size: (-f)                            16384
tunefs: average number of files in a directory: (-s)       64
tunefs: minimum percentage of free space: (-m)             1%
tunefs: space to hold for metadata blocks: (-k)            0
tunefs: optimization preference: (-o)                      space
tunefs: volume label: (-L)
Comment 1 Mark Linimon freebsd_committer freebsd_triage 2013-08-11 14:39:12 UTC
Responsible Changed
From-To: freebsd-bugs->freebsd-fs

Submitter notes this is a recent regression.
Comment 2 Dmitry Sivachenko 2013-08-25 16:45:08 UTC
I found the exact revision number which broke that:

Author: mckusick
Date: Mon Apr 22 23:59:00 2013
New Revision: 249782
URL: http://svnweb.freebsd.org/changeset/base/249782
Comment 3 Xin LI freebsd_committer freebsd_triage 2013-08-25 20:33:07 UTC
Responsible Changed
From-To: freebsd-fs->mckusick

Over to UFS maintainer.
Comment 4 dfilter service freebsd_committer freebsd_triage 2013-08-28 18:38:13 UTC
Author: mckusick
Date: Wed Aug 28 17:38:05 2013
New Revision: 254995
URL: http://svnweb.freebsd.org/changeset/base/254995

Log:
  A performance problem was reported in PR kern/181226:
  
      I have 25TB Dell PERC 6 RAID5 array. When it becomes almost
      full (10-20GB free), processes which write data to it start
      eating 100% CPU and write speed drops below 1MB/sec (normally
      to gives 400MB/sec). The revision at which it first became
      apparent was http://svnweb.freebsd.org/changeset/base/249782.
  
  The offending change reserved an area in each cylinder group to
  store metadata. The new algorithm attempts to save this area for
  metadata and allows its use for non-metadata only after all the
  data areas have been exhausted. The size of the reserved area
  defaults to half of minfree, so the filesystem reports full before
  the data area can completely fill. However, in this report, the
  filesystem has had minfree reduced to 1% thus forcing the metadata
  area to be used for data. As the filesystem approached full, it
  had only metadata areas left to allocate. The result was that
  every block allocation had to scan summary data for 30,000 cylinder
  groups before falling back to searching up to 30,000 metadata areas.
  
  The fix is to give up on saving the metadata areas once the free
  space reserve drops below 2%. The effect of this change is to use
  the old algorithm of just accepting the first available block that
  we find. Since most filesystems use the default 5% minfree, this
  will have no effect on their operation. For those that want to push
  to the limit, they will get their crappy block placements quickly.
  
  Submitted by:  Dmitry Sivachenko
  Fix Tested by: Dmitry Sivachenko
  PR:            kern/181226
  MFC after:     2 weeks

Modified:
  head/sys/ufs/ffs/ffs_alloc.c

Modified: head/sys/ufs/ffs/ffs_alloc.c
==============================================================================
--- head/sys/ufs/ffs/ffs_alloc.c	Wed Aug 28 16:59:55 2013	(r254994)
+++ head/sys/ufs/ffs/ffs_alloc.c	Wed Aug 28 17:38:05 2013	(r254995)
@@ -516,7 +516,13 @@ ffs_reallocblks_ufs1(ap)
 	ip = VTOI(vp);
 	fs = ip->i_fs;
 	ump = ip->i_ump;
-	if (fs->fs_contigsumsize <= 0)
+	/*
+	 * If we are not tracking block clusters or if we have less than 2%
+	 * free blocks left, then do not attempt to cluster. Running with
+	 * less than 5% free block reserve is not recommended and those that
+	 * choose to do so do not expect to have good file layout.
+	 */
+	if (fs->fs_contigsumsize <= 0 || freespace(fs, 2) < 0)
 		return (ENOSPC);
 	buflist = ap->a_buflist;
 	len = buflist->bs_nchildren;
@@ -737,7 +743,13 @@ ffs_reallocblks_ufs2(ap)
 	ip = VTOI(vp);
 	fs = ip->i_fs;
 	ump = ip->i_ump;
-	if (fs->fs_contigsumsize <= 0)
+	/*
+	 * If we are not tracking block clusters or if we have less than 2%
+	 * free blocks left, then do not attempt to cluster. Running with
+	 * less than 5% free block reserve is not recommended and those that
+	 * choose to do so do not expect to have good file layout.
+	 */
+	if (fs->fs_contigsumsize <= 0 || freespace(fs, 2) < 0)
 		return (ENOSPC);
 	buflist = ap->a_buflist;
 	len = buflist->bs_nchildren;
_______________________________________________
svn-src-all@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-all
To unsubscribe, send any mail to "svn-src-all-unsubscribe@freebsd.org"
Comment 5 Kirk McKusick freebsd_committer freebsd_triage 2013-08-28 18:53:33 UTC
State Changed
From-To: open->patched

A working patch has been applied to head. Assuming no problems are 
reported it will be MFC'ed to 9 in two weeks and this report closed.
Comment 6 dfilter service freebsd_committer freebsd_triage 2013-09-12 20:36:11 UTC
Author: mckusick
Date: Thu Sep 12 19:36:04 2013
New Revision: 255494
URL: http://svnweb.freebsd.org/changeset/base/255494

Log:
  MFC of 254995:
  
  A performance problem was reported in PR kern/181226:
  
      I have 25TB Dell PERC 6 RAID5 array. When it becomes almost
      full (10-20GB free), processes which write data to it start
      eating 100% CPU and write speed drops below 1MB/sec (normally
      to gives 400MB/sec). The revision at which it first became
      apparent was http://svnweb.freebsd.org/changeset/base/249782.
  
  The offending change reserved an area in each cylinder group to
  store metadata. The new algorithm attempts to save this area for
  metadata and allows its use for non-metadata only after all the
  data areas have been exhausted. The size of the reserved area
  defaults to half of minfree, so the filesystem reports full before
  the data area can completely fill. However, in this report, the
  filesystem has had minfree reduced to 1% thus forcing the metadata
  area to be used for data. As the filesystem approached full, it
  had only metadata areas left to allocate. The result was that
  every block allocation had to scan summary data for 30,000 cylinder
  groups before falling back to searching up to 30,000 metadata areas.
  
  The fix is to give up on saving the metadata areas once the free
  space reserve drops below 2%. The effect of this change is to use
  the old algorithm of just accepting the first available block that
  we find. Since most filesystems use the default 5% minfree, this
  will have no effect on their operation. For those that want to push
  to the limit, they will get their crappy block placements quickly.
  
  Submitted by:  Dmitry Sivachenko
  Fix Tested by: Dmitry Sivachenko
  PR:            kern/181226
  
  MFC of 254996:
  
  In looking at block layouts as part of fixing filesystem block
  allocations under low free-space conditions (-r254995), determine
  that old block-preference search order used before -r249782 worked
  a bit better. This change reverts to that block-preference search order.

Modified:
  stable/9/sys/ufs/ffs/ffs_alloc.c
Directory Properties:
  stable/9/sys/   (props changed)

Modified: stable/9/sys/ufs/ffs/ffs_alloc.c
==============================================================================
--- stable/9/sys/ufs/ffs/ffs_alloc.c	Thu Sep 12 18:08:25 2013	(r255493)
+++ stable/9/sys/ufs/ffs/ffs_alloc.c	Thu Sep 12 19:36:04 2013	(r255494)
@@ -516,7 +516,13 @@ ffs_reallocblks_ufs1(ap)
 	ip = VTOI(vp);
 	fs = ip->i_fs;
 	ump = ip->i_ump;
-	if (fs->fs_contigsumsize <= 0)
+	/*
+	 * If we are not tracking block clusters or if we have less than 2%
+	 * free blocks left, then do not attempt to cluster. Running with
+	 * less than 5% free block reserve is not recommended and those that
+	 * choose to do so do not expect to have good file layout.
+	 */
+	if (fs->fs_contigsumsize <= 0 || freespace(fs, 2) < 0)
 		return (ENOSPC);
 	buflist = ap->a_buflist;
 	len = buflist->bs_nchildren;
@@ -736,7 +742,13 @@ ffs_reallocblks_ufs2(ap)
 	ip = VTOI(vp);
 	fs = ip->i_fs;
 	ump = ip->i_ump;
-	if (fs->fs_contigsumsize <= 0)
+	/*
+	 * If we are not tracking block clusters or if we have less than 2%
+	 * free blocks left, then do not attempt to cluster. Running with
+	 * less than 5% free block reserve is not recommended and those that
+	 * choose to do so do not expect to have good file layout.
+	 */
+	if (fs->fs_contigsumsize <= 0 || freespace(fs, 2) < 0)
 		return (ENOSPC);
 	buflist = ap->a_buflist;
 	len = buflist->bs_nchildren;
@@ -1173,7 +1185,7 @@ ffs_dirpref(pip)
 			if (fs->fs_contigdirs[cg] < maxcontigdirs)
 				return ((ino_t)(fs->fs_ipg * cg));
 		}
-	for (cg = prefcg - 1; cg >= 0; cg--)
+	for (cg = 0; cg < prefcg; cg++)
 		if (fs->fs_cs(fs, cg).cs_ndir < maxndir &&
 		    fs->fs_cs(fs, cg).cs_nifree >= minifree &&
 	    	    fs->fs_cs(fs, cg).cs_nbfree >= minbfree) {
@@ -1186,7 +1198,7 @@ ffs_dirpref(pip)
 	for (cg = prefcg; cg < fs->fs_ncg; cg++)
 		if (fs->fs_cs(fs, cg).cs_nifree >= avgifree)
 			return ((ino_t)(fs->fs_ipg * cg));
-	for (cg = prefcg - 1; cg >= 0; cg--)
+	for (cg = 0; cg < prefcg; cg++)
 		if (fs->fs_cs(fs, cg).cs_nifree >= avgifree)
 			break;
 	return ((ino_t)(fs->fs_ipg * cg));
_______________________________________________
svn-src-all@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-all
To unsubscribe, send any mail to "svn-src-all-unsubscribe@freebsd.org"
Comment 7 Kirk McKusick freebsd_committer freebsd_triage 2013-09-12 20:41:08 UTC
State Changed
From-To: patched->closed

The fixes have been MFC'ed to 9-stable. They are not relevant 
to earlier versions of the system.