Bug 113885 - [gmirror] [patch] improved gmirror balance algorithm
Summary: [gmirror] [patch] improved gmirror balance algorithm
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 6.2-RELEASE
Hardware: Any Any
: Normal Affects Only Me
Assignee: Alexander Motin
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2007-06-20 18:50 UTC by zuborg
Modified: 2010-03-22 00:10 UTC (History)
0 users

See Also:


Attachments
file.diff (3.87 KB, patch)
2007-06-20 18:50 UTC, zuborg
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description zuborg 2007-06-20 18:50:01 UTC
gmirror have poor read balance algorithms, which doesn't allow to utilize
disks performance

'prefer' algo is suitable only for non-symmetrical enviroment, such as
one local and one network-mounted disk

'load' is broken - dummy 'round-robin' shows best results in all benchmarks

'split' is completely senseless - it uses 'round-robin' for small
requests and split big request to differents providers - but it's
clearly understood that one hdd will take less time to complete one
big requests (equal to several small sequental requests), than two (or
more) drives to that several small requests, because even if there is
no other activity, one disk doesn't spent time to move actuator and
doesn't wait for plate rotate.

'round-robin' is best but is dummy and doesn't try to select disk, which
may take less time to complete request: single sequential read thread
uses both providers at half speed (total speed is same as for single
drive) two simultaneous sequential read threads fights each other use
10% of throughput, while they could use separate provider at full speed
next threads aggravates situation

Fix: I've rewritten 'load' algo

It remembers bio_offset value of last operation (read or write) for
each provider and assumes that preferable disk is that which have
smallest value of offset distance:
|bio_offset - disk->last_offset|

It also remembers last read time, and try to use that disk which was
used longer time ago

From all providers it selects that one which have lesser value of:
offset_distance / (tunable_parameter + use_delay)

This algorithm is well suitable for multithread enviroment (web-servers,
db-servers...), for example it allow web server work at almost full
speed when awstats (log analyzer) is running, while using 'round-robin'
cause 50% slowdown (bw drops from 100Mbit/s to 50Mbit/s and remains on
that level until awstats finishes its job).

Patch attached with submission follows:
How-To-Repeat: it's repeatable on all gmirror installations
Comment 1 Mark Linimon freebsd_committer freebsd_triage 2007-06-21 02:18:50 UTC
Responsible Changed
From-To: freebsd-bugs->freebsd-geom

Over to maintainer(s).
Comment 2 will 2008-12-02 21:50:29 UTC
I have attached what I believe is a better version of your patch.  It:

1) Fixes the type ambiguity of the new use_delay/best_use_delay and
dist/best_dist variables, to match the variables used in their calculations;
they should be uint64_t and off_t, respectively.
2) Uses bit shifts instead of multiplication/division in the use delay and
distance calculations.  The precision loss should be acceptable in this
situation.
3) Cleans up the style of the code; add more commenting, better comments.
4) Gets rid of the g_mirror_disk.d_delay variable since it is no longer
used, along with the function the original patch short-circuited.

In my testing, with 16 simultaneous processes performing the same test at
the same time, by throughput, random reads/writes improved by about 35% (low
variance), while sequential reads/writes improved by 100-400% (high
variance).  IOs also increased proportionally.  Testing was done using
"rawio -a -p 16 /dev/mirror/testa", where the test mirror composed of two
160GB Seagate SATA disks and the system is a dual Opteron 246 with 1.5GB of
RAM with no other load, running 8.0-CURRENT as of 12/1/2008.  CPU usage
impact vs. old load algorithm appears negligible as well.

Regards,
Will.

--- gmirror.patch begins here ---
Index: g_mirror.c
===================================================================
--- g_mirror.c	(revision 185567)
+++ g_mirror.c	(working copy)
@@ -25,7 +25,7 @@
  */
 
 #include <sys/cdefs.h>
-__FBSDID("$FreeBSD$");
+__FBSDID("$FreeBSD: src/sys/geom/mirror/g_mirror.c,v 1.94 2007/10/20 23:23:19 julian Exp $");
 
 #include <sys/param.h>
 #include <sys/systm.h>
@@ -45,7 +45,6 @@
 #include <sys/sched.h>
 #include <geom/mirror/g_mirror.h>
 
-
 static MALLOC_DEFINE(M_MIRROR, "mirror_data", "GEOM_MIRROR Data");
 
 SYSCTL_DECL(_kern_geom);
@@ -71,7 +70,12 @@
 TUNABLE_INT("kern.geom.mirror.sync_requests", &g_mirror_syncreqs);
 SYSCTL_UINT(_kern_geom_mirror, OID_AUTO, sync_requests, CTLFLAG_RDTUN,
     &g_mirror_syncreqs, 0, "Parallel synchronization I/O requests.");
+static u_int g_mirror_plusdelay = 60000;
+TUNABLE_INT("kern.geom.mirror.plusdelay", &g_mirror_plusdelay);
+SYSCTL_UINT(_kern_geom_mirror, OID_AUTO, plusdelay, CTLFLAG_RW,
+    &g_mirror_plusdelay, 0, "Additional load delay in 1/65536ths of a second.");
 
+
 #define	MSLEEP(ident, mtx, priority, wmesg, timeout)	do {		\
 	G_MIRROR_DEBUG(4, "%s: Sleeping %p.", __func__, (ident));	\
 	msleep((ident), (mtx), (priority), (wmesg), (timeout));		\
@@ -451,8 +455,7 @@
 	disk->d_id = md->md_did;
 	disk->d_state = G_MIRROR_DISK_STATE_NONE;
 	disk->d_priority = md->md_priority;
-	disk->d_delay.sec = 0;
-	disk->d_delay.frac = 0;
+	disk->last_offset = 0;
 	binuptime(&disk->d_last_used);
 	disk->d_flags = md->md_dflags;
 	if (md->md_provider[0] != '\0')
@@ -863,16 +866,6 @@
 }
 
 static void
-g_mirror_update_delay(struct g_mirror_disk *disk, struct bio *bp)
-{
-
-	if (disk->d_softc->sc_balance != G_MIRROR_BALANCE_LOAD)
-		return;
-	binuptime(&disk->d_delay);
-	bintime_sub(&disk->d_delay, &bp->bio_t0);
-}
-
-static void
 g_mirror_done(struct bio *bp)
 {
 	struct g_mirror_softc *sc;
@@ -904,8 +897,6 @@
 		g_topology_lock();
 		g_mirror_kill_consumer(sc, bp->bio_from);
 		g_topology_unlock();
-	} else {
-		g_mirror_update_delay(disk, bp);
 	}
 
 	pbp->bio_inbed++;
@@ -1472,25 +1463,45 @@
 	struct g_consumer *cp;
 	struct bio *cbp;
 	struct bintime curtime;
+	off_t  bio_offset = bp->bio_offset;
+	off_t  best_dist = -1, dist;
+	uint64_t best_use_delay = 0, use_delay = 0;
 
-	binuptime(&curtime);
+	getbinuptime(&curtime);
 	/*
-	 * Find a disk which the smallest load.
+	 * Find the disk which has the smallest ratio of distance to use
+	 * delay, i.e. its head looks closest to bio_offset and it was used
+	 * least recently.
 	 */
 	disk = NULL;
 	LIST_FOREACH(dp, &sc->sc_disks, d_next) {
 		if (dp->d_state != G_MIRROR_DISK_STATE_ACTIVE)
 			continue;
-		/* If disk wasn't used for more than 2 sec, use it. */
-		if (curtime.sec - dp->d_last_used.sec >= 2) {
+
+		dist = dp->last_offset - bio_offset;
+		if (dist < 0)
+			dist = -dist;
+
+		/*
+		 * Calculate the use delay as follows: Add the sysctl
+		 * configured delay, then convert the bintime structure
+		 * in terms of 1/65536ths of a second before adding its
+		 * components.  So multiply seconds difference by 65536
+		 * and drop all but the 16 most significant bits in the
+		 * fraction, since they're all greater than 1/65536.
+		 */
+		use_delay = g_mirror_plusdelay;
+		use_delay += ((curtime.sec - dp->d_last_used.sec) << 16);
+		use_delay += ((curtime.frac - dp->d_last_used.frac) >> 48);
+
+		if (best_dist == -1 ||
+		    dist * best_use_delay < best_dist * use_delay) {
 			disk = dp;
-			break;
+			best_dist = dist;
+			best_use_delay = use_delay;
 		}
-		if (disk == NULL ||
-		    bintime_cmp(&dp->d_delay, &disk->d_delay) < 0) {
-			disk = dp;
-		}
 	}
+
 	KASSERT(disk != NULL, ("NULL disk for %s.", sc->sc_name));
 	cbp = g_clone_bio(bp);
 	if (cbp == NULL) {
@@ -1505,7 +1516,8 @@
 	cp = disk->d_consumer;
 	cbp->bio_done = g_mirror_done;
 	cbp->bio_to = cp->provider;
-	binuptime(&disk->d_last_used);
+	disk->d_last_used = curtime;
+	disk->last_offset = bio_offset;
 	G_MIRROR_LOGREQ(3, cbp, "Sending request.");
 	KASSERT(cp->acr >= 1 && cp->acw >= 1 && cp->ace >= 1,
 	    ("Consumer %s not opened (r%dw%de%d).", cp->provider->name, cp->acr,
@@ -1659,6 +1671,7 @@
 				g_io_deliver(bp, bp->bio_error);
 				return;
 			}
+			disk->last_offset = bp->bio_offset;
 			bioq_insert_tail(&queue, cbp);
 			cbp->bio_done = g_mirror_done;
 			cp = disk->d_consumer;
Index: g_mirror.h
===================================================================
--- g_mirror.h	(revision 185567)
+++ g_mirror.h	(working copy)
@@ -23,7 +23,7 @@
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  *
- * $FreeBSD$
+ * $FreeBSD: src/sys/geom/mirror/g_mirror.h,v 1.24 2006/11/01 22:51:49 pjd Exp $
  */
 
 #ifndef	_G_MIRROR_H_
@@ -133,7 +133,7 @@
 	struct g_mirror_softc	*d_softc; /* Back-pointer to softc. */
 	int		 d_state;	/* Disk state. */
 	u_int		 d_priority;	/* Disk priority. */
-	struct bintime	 d_delay;	/* Disk delay. */
+	off_t		 last_offset;	/* LBA of last operation. */
 	struct bintime	 d_last_used;	/* When disk was last used. */
 	uint64_t	 d_flags;	/* Additional flags. */
 	u_int		 d_genid;	/* Disk's generation ID. */

--- gmirror.patch ends here ---
Comment 3 Mark Linimon freebsd_committer freebsd_triage 2009-05-28 23:15:15 UTC
Responsible Changed
From-To: pjd->freebsd-geom

pjd is not actively working on GEOM at the moment.
Comment 4 emikulic 2009-07-16 08:56:19 UTC
Will Andrews' patch is *fantastic*

With this patch and gmirror set to "load" style balancing, I can run two
long sequential reads in parallel and get practically linear scaling on
a two-disk mirror.

Could someone please commit this?

--Emil
Comment 5 freebsdpr 2009-07-21 08:45:37 UTC
I was also surprised to discover that gmirror, regardless of the algorithm 
used, does not seem to offer either random or sequential read performance 
any better than a single drive. I have a new SATA backplane which shows 
individual drive activity indicators - with these you can easily see that 
the "load" algorithm seems to be selecting (and staying on) only a single 
drive at a time, for anywhere between 0.1 - 1 seconds. Some simple testing 
confirmed that there's no discernable read performance benefit between 1 
or >1 drives - so much for my 4 drive RAID1 idea!

In comparison, a 5 drive graid3 array offers a sequential read speed of 
nearly 4 times a single drive... with read verify ON.

----

Onto the "load" patch above - it doesn't seem to work for me. I thought it 
may have been because I had 4 drives in the array, but even after dropping 
back to 2 it still only reads from a *single* drive. Any ideas? I'm using 
7.1R-amd64.

Geom name: db0
State: COMPLETE
Components: 2
Balance: load  <--- ***
Comment 6 Ivan Voras freebsd_committer freebsd_triage 2009-08-25 10:11:12 UTC
The patch will not increase streaming read performance beyond what's 
possible with a single drive, it will improve random read performance in 
certain cases where reads are localized in such ways that reading some 
of them from one drive and others from the other drive helps.

The reason why there is no scalability with streaming read performance 
vs what can be achieved with RAID0/3/5 is that there is no striping 
here. For example: if you need to read 4 striped blocks from a RAID0 of 
two drives, blocks 0 and 2 will be sequentially stored on the first 
drive, blocks 1 and 3 will be sequentially stored on the second drive. 
Thus reading the 4 blocks will result in two sequential reads per drive. 
OTOH, with RAID1, blocks 0 and 2 will be stored with a "gap" between 
them, containing block 1, and cannot be read sequentially, but a seek is 
needed. This is why e.g. the "split" method (which effectively does 
striping on the request level) doesn't help much with performance.
Comment 7 freebsdpr 2009-09-21 16:13:05 UTC
> The patch will not increase streaming read performance beyond what's
possible with a single drive[...]

I meant that after applying the patch it literally only reads from a 
single drive. "iostat -w 1 -x" shows it consistently using the last drive 
of the set for all reads.
Comment 8 admin 2009-10-23 06:20:00 UTC
so, where is maintainer?
may be somebody commit it?
very useful feature.
Comment 9 Maxim Sobolev freebsd_committer freebsd_triage 2009-12-03 00:03:52 UTC
Here is another patch, which implements different approach. Basically it 
looks at the recently served requests and also the queue length to 
decide where to send the next request to.

http://sobomax.sippysoft.com/~sobomax/geom_mirror.diff

Regards,
-- 
Maksym Sobolyev
Sippy Software, Inc.
Internet Telephony (VoIP) Experts
T/F: +1-646-651-1110
Web: http://www.sippysoft.com
MSN: sales@sippysoft.com
Skype: SippySoft
Comment 10 dfilter service freebsd_committer freebsd_triage 2009-12-03 21:48:01 UTC
Author: mav
Date: Thu Dec  3 21:47:51 2009
New Revision: 200086
URL: http://svn.freebsd.org/changeset/base/200086

Log:
  Change 'load' balancing mode algorithm:
  - Instead of measuring last request execution time for each drive and
  choosing one with smallest time, use averaged number of requests, running
  on each drive. This information is more accurate and timely. It allows to
  distribute load between drives in more even and predictable way.
  - For each drive track offset of the last submitted request. If new request
  offset matches previous one or close for some drive, prefer that drive.
  It allows to significantly speedup simultaneous sequential reads.
  
  PR:		kern/113885
  Reviewed by:	sobomax

Modified:
  head/sys/geom/mirror/g_mirror.c
  head/sys/geom/mirror/g_mirror.h

Modified: head/sys/geom/mirror/g_mirror.c
==============================================================================
--- head/sys/geom/mirror/g_mirror.c	Thu Dec  3 21:44:41 2009	(r200085)
+++ head/sys/geom/mirror/g_mirror.c	Thu Dec  3 21:47:51 2009	(r200086)
@@ -451,9 +451,6 @@ g_mirror_init_disk(struct g_mirror_softc
 	disk->d_id = md->md_did;
 	disk->d_state = G_MIRROR_DISK_STATE_NONE;
 	disk->d_priority = md->md_priority;
-	disk->d_delay.sec = 0;
-	disk->d_delay.frac = 0;
-	binuptime(&disk->d_last_used);
 	disk->d_flags = md->md_dflags;
 	if (md->md_provider[0] != '\0')
 		disk->d_flags |= G_MIRROR_DISK_FLAG_HARDCODED;
@@ -863,16 +860,6 @@ bintime_cmp(struct bintime *bt1, struct 
 }
 
 static void
-g_mirror_update_delay(struct g_mirror_disk *disk, struct bio *bp)
-{
-
-	if (disk->d_softc->sc_balance != G_MIRROR_BALANCE_LOAD)
-		return;
-	binuptime(&disk->d_delay);
-	bintime_sub(&disk->d_delay, &bp->bio_t0);
-}
-
-static void
 g_mirror_done(struct bio *bp)
 {
 	struct g_mirror_softc *sc;
@@ -904,8 +891,6 @@ g_mirror_regular_request(struct bio *bp)
 		g_topology_lock();
 		g_mirror_kill_consumer(sc, bp->bio_from);
 		g_topology_unlock();
-	} else {
-		g_mirror_update_delay(disk, bp);
 	}
 
 	pbp->bio_inbed++;
@@ -1465,30 +1450,35 @@ g_mirror_request_round_robin(struct g_mi
 	g_io_request(cbp, cp);
 }
 
+#define TRACK_SIZE  (1 * 1024 * 1024)
+#define LOAD_SCALE	256
+#define ABS(x)		(((x) >= 0) ? (x) : (-(x)))
+
 static void
 g_mirror_request_load(struct g_mirror_softc *sc, struct bio *bp)
 {
 	struct g_mirror_disk *disk, *dp;
 	struct g_consumer *cp;
 	struct bio *cbp;
-	struct bintime curtime;
+	int prio, best;
 
-	binuptime(&curtime);
-	/*
-	 * Find a disk which the smallest load.
-	 */
+	/* Find a disk with the smallest load. */
 	disk = NULL;
+	best = INT_MAX;
 	LIST_FOREACH(dp, &sc->sc_disks, d_next) {
 		if (dp->d_state != G_MIRROR_DISK_STATE_ACTIVE)
 			continue;
-		/* If disk wasn't used for more than 2 sec, use it. */
-		if (curtime.sec - dp->d_last_used.sec >= 2) {
-			disk = dp;
-			break;
-		}
-		if (disk == NULL ||
-		    bintime_cmp(&dp->d_delay, &disk->d_delay) < 0) {
+		prio = dp->load;
+		/* If disk head is precisely in position - highly prefer it. */
+		if (dp->d_last_offset == bp->bio_offset)
+			prio -= 2 * LOAD_SCALE;
+		else
+		/* If disk head is close to position - prefer it. */
+		if (ABS(dp->d_last_offset - bp->bio_offset) < TRACK_SIZE)
+			prio -= 1 * LOAD_SCALE;
+		if (prio <= best) {
 			disk = dp;
+			best = prio;
 		}
 	}
 	KASSERT(disk != NULL, ("NULL disk for %s.", sc->sc_name));
@@ -1505,12 +1495,18 @@ g_mirror_request_load(struct g_mirror_so
 	cp = disk->d_consumer;
 	cbp->bio_done = g_mirror_done;
 	cbp->bio_to = cp->provider;
-	binuptime(&disk->d_last_used);
 	G_MIRROR_LOGREQ(3, cbp, "Sending request.");
 	KASSERT(cp->acr >= 1 && cp->acw >= 1 && cp->ace >= 1,
 	    ("Consumer %s not opened (r%dw%de%d).", cp->provider->name, cp->acr,
 	    cp->acw, cp->ace));
 	cp->index++;
+	/* Remember last head position */
+	disk->d_last_offset = bp->bio_offset + bp->bio_length;
+	/* Update loads. */
+	LIST_FOREACH(dp, &sc->sc_disks, d_next) {
+		dp->load = (dp->d_consumer->index * LOAD_SCALE +
+		    dp->load * 7) / 8;
+	}
 	g_io_request(cbp, cp);
 }
 

Modified: head/sys/geom/mirror/g_mirror.h
==============================================================================
--- head/sys/geom/mirror/g_mirror.h	Thu Dec  3 21:44:41 2009	(r200085)
+++ head/sys/geom/mirror/g_mirror.h	Thu Dec  3 21:47:51 2009	(r200086)
@@ -133,8 +133,8 @@ struct g_mirror_disk {
 	struct g_mirror_softc	*d_softc; /* Back-pointer to softc. */
 	int		 d_state;	/* Disk state. */
 	u_int		 d_priority;	/* Disk priority. */
-	struct bintime	 d_delay;	/* Disk delay. */
-	struct bintime	 d_last_used;	/* When disk was last used. */
+	u_int		 load;		/* Averaged queue length */
+	off_t		 d_last_offset;	/* Last read offset */
 	uint64_t	 d_flags;	/* Additional flags. */
 	u_int		 d_genid;	/* Disk's generation ID. */
 	struct g_mirror_disk_sync d_sync;/* Sync information. */
_______________________________________________
svn-src-all@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-all
To unsubscribe, send any mail to "svn-src-all-unsubscribe@freebsd.org"
Comment 11 Alexander Motin freebsd_committer freebsd_triage 2009-12-03 21:51:12 UTC
State Changed
From-To: open->patched

Patch committed to the HEAD. 


Comment 12 Alexander Motin freebsd_committer freebsd_triage 2009-12-03 21:51:12 UTC
Responsible Changed
From-To: freebsd-geom->mav

Patch committed to the HEAD.
Comment 13 dfilter service freebsd_committer freebsd_triage 2009-12-08 23:24:04 UTC
Author: mav
Date: Tue Dec  8 23:23:45 2009
New Revision: 200285
URL: http://svn.freebsd.org/changeset/base/200285

Log:
  MFC r200086:
  Change 'load' balancing mode algorithm:
  - Instead of measuring last request execution time for each drive and
  choosing one with smallest time, use averaged number of requests, running
  on each drive. This information is more accurate and timely. It allows to
  distribute load between drives in more even and predictable way.
  - For each drive track offset of the last submitted request. If new request
  offset matches previous one or close for some drive, prefer that drive.
  It allows to significantly speedup simultaneous sequential reads.
  
  PR:             kern/113885

Modified:
  stable/8/sys/geom/mirror/g_mirror.c
  stable/8/sys/geom/mirror/g_mirror.h
Directory Properties:
  stable/8/sys/   (props changed)
  stable/8/sys/amd64/include/xen/   (props changed)
  stable/8/sys/cddl/contrib/opensolaris/   (props changed)
  stable/8/sys/contrib/dev/acpica/   (props changed)
  stable/8/sys/contrib/pf/   (props changed)
  stable/8/sys/dev/xen/xenpci/   (props changed)

Modified: stable/8/sys/geom/mirror/g_mirror.c
==============================================================================
--- stable/8/sys/geom/mirror/g_mirror.c	Tue Dec  8 23:15:48 2009	(r200284)
+++ stable/8/sys/geom/mirror/g_mirror.c	Tue Dec  8 23:23:45 2009	(r200285)
@@ -451,9 +451,6 @@ g_mirror_init_disk(struct g_mirror_softc
 	disk->d_id = md->md_did;
 	disk->d_state = G_MIRROR_DISK_STATE_NONE;
 	disk->d_priority = md->md_priority;
-	disk->d_delay.sec = 0;
-	disk->d_delay.frac = 0;
-	binuptime(&disk->d_last_used);
 	disk->d_flags = md->md_dflags;
 	if (md->md_provider[0] != '\0')
 		disk->d_flags |= G_MIRROR_DISK_FLAG_HARDCODED;
@@ -863,16 +860,6 @@ bintime_cmp(struct bintime *bt1, struct 
 }
 
 static void
-g_mirror_update_delay(struct g_mirror_disk *disk, struct bio *bp)
-{
-
-	if (disk->d_softc->sc_balance != G_MIRROR_BALANCE_LOAD)
-		return;
-	binuptime(&disk->d_delay);
-	bintime_sub(&disk->d_delay, &bp->bio_t0);
-}
-
-static void
 g_mirror_done(struct bio *bp)
 {
 	struct g_mirror_softc *sc;
@@ -904,8 +891,6 @@ g_mirror_regular_request(struct bio *bp)
 		g_topology_lock();
 		g_mirror_kill_consumer(sc, bp->bio_from);
 		g_topology_unlock();
-	} else {
-		g_mirror_update_delay(disk, bp);
 	}
 
 	pbp->bio_inbed++;
@@ -1465,30 +1450,35 @@ g_mirror_request_round_robin(struct g_mi
 	g_io_request(cbp, cp);
 }
 
+#define TRACK_SIZE  (1 * 1024 * 1024)
+#define LOAD_SCALE	256
+#define ABS(x)		(((x) >= 0) ? (x) : (-(x)))
+
 static void
 g_mirror_request_load(struct g_mirror_softc *sc, struct bio *bp)
 {
 	struct g_mirror_disk *disk, *dp;
 	struct g_consumer *cp;
 	struct bio *cbp;
-	struct bintime curtime;
+	int prio, best;
 
-	binuptime(&curtime);
-	/*
-	 * Find a disk which the smallest load.
-	 */
+	/* Find a disk with the smallest load. */
 	disk = NULL;
+	best = INT_MAX;
 	LIST_FOREACH(dp, &sc->sc_disks, d_next) {
 		if (dp->d_state != G_MIRROR_DISK_STATE_ACTIVE)
 			continue;
-		/* If disk wasn't used for more than 2 sec, use it. */
-		if (curtime.sec - dp->d_last_used.sec >= 2) {
-			disk = dp;
-			break;
-		}
-		if (disk == NULL ||
-		    bintime_cmp(&dp->d_delay, &disk->d_delay) < 0) {
+		prio = dp->load;
+		/* If disk head is precisely in position - highly prefer it. */
+		if (dp->d_last_offset == bp->bio_offset)
+			prio -= 2 * LOAD_SCALE;
+		else
+		/* If disk head is close to position - prefer it. */
+		if (ABS(dp->d_last_offset - bp->bio_offset) < TRACK_SIZE)
+			prio -= 1 * LOAD_SCALE;
+		if (prio <= best) {
 			disk = dp;
+			best = prio;
 		}
 	}
 	KASSERT(disk != NULL, ("NULL disk for %s.", sc->sc_name));
@@ -1505,12 +1495,18 @@ g_mirror_request_load(struct g_mirror_so
 	cp = disk->d_consumer;
 	cbp->bio_done = g_mirror_done;
 	cbp->bio_to = cp->provider;
-	binuptime(&disk->d_last_used);
 	G_MIRROR_LOGREQ(3, cbp, "Sending request.");
 	KASSERT(cp->acr >= 1 && cp->acw >= 1 && cp->ace >= 1,
 	    ("Consumer %s not opened (r%dw%de%d).", cp->provider->name, cp->acr,
 	    cp->acw, cp->ace));
 	cp->index++;
+	/* Remember last head position */
+	disk->d_last_offset = bp->bio_offset + bp->bio_length;
+	/* Update loads. */
+	LIST_FOREACH(dp, &sc->sc_disks, d_next) {
+		dp->load = (dp->d_consumer->index * LOAD_SCALE +
+		    dp->load * 7) / 8;
+	}
 	g_io_request(cbp, cp);
 }
 

Modified: stable/8/sys/geom/mirror/g_mirror.h
==============================================================================
--- stable/8/sys/geom/mirror/g_mirror.h	Tue Dec  8 23:15:48 2009	(r200284)
+++ stable/8/sys/geom/mirror/g_mirror.h	Tue Dec  8 23:23:45 2009	(r200285)
@@ -133,8 +133,8 @@ struct g_mirror_disk {
 	struct g_mirror_softc	*d_softc; /* Back-pointer to softc. */
 	int		 d_state;	/* Disk state. */
 	u_int		 d_priority;	/* Disk priority. */
-	struct bintime	 d_delay;	/* Disk delay. */
-	struct bintime	 d_last_used;	/* When disk was last used. */
+	u_int		 load;		/* Averaged queue length */
+	off_t		 d_last_offset;	/* Last read offset */
 	uint64_t	 d_flags;	/* Additional flags. */
 	u_int		 d_genid;	/* Disk's generation ID. */
 	struct g_mirror_disk_sync d_sync;/* Sync information. */
_______________________________________________
svn-src-all@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-all
To unsubscribe, send any mail to "svn-src-all-unsubscribe@freebsd.org"
Comment 14 dfilter service freebsd_committer freebsd_triage 2009-12-08 23:34:47 UTC
Author: mav
Date: Tue Dec  8 23:34:34 2009
New Revision: 200286
URL: http://svn.freebsd.org/changeset/base/200286

Log:
  MFC r200086:
  Change 'load' balancing mode algorithm:
  - Instead of measuring last request execution time for each drive and
  choosing one with smallest time, use averaged number of requests, running
  on each drive. This information is more accurate and timely. It allows to
  distribute load between drives in more even and predictable way.
  - For each drive track offset of the last submitted request. If new request
  offset matches previous one or close for some drive, prefer that drive.
  It allows to significantly speedup simultaneous sequential reads.
  
  PR:             kern/113885

Modified:
  stable/7/sys/geom/mirror/g_mirror.c
  stable/7/sys/geom/mirror/g_mirror.h
Directory Properties:
  stable/7/sys/   (props changed)
  stable/7/sys/contrib/pf/   (props changed)

Modified: stable/7/sys/geom/mirror/g_mirror.c
==============================================================================
--- stable/7/sys/geom/mirror/g_mirror.c	Tue Dec  8 23:23:45 2009	(r200285)
+++ stable/7/sys/geom/mirror/g_mirror.c	Tue Dec  8 23:34:34 2009	(r200286)
@@ -451,9 +451,6 @@ g_mirror_init_disk(struct g_mirror_softc
 	disk->d_id = md->md_did;
 	disk->d_state = G_MIRROR_DISK_STATE_NONE;
 	disk->d_priority = md->md_priority;
-	disk->d_delay.sec = 0;
-	disk->d_delay.frac = 0;
-	binuptime(&disk->d_last_used);
 	disk->d_flags = md->md_dflags;
 	if (md->md_provider[0] != '\0')
 		disk->d_flags |= G_MIRROR_DISK_FLAG_HARDCODED;
@@ -863,16 +860,6 @@ bintime_cmp(struct bintime *bt1, struct 
 }
 
 static void
-g_mirror_update_delay(struct g_mirror_disk *disk, struct bio *bp)
-{
-
-	if (disk->d_softc->sc_balance != G_MIRROR_BALANCE_LOAD)
-		return;
-	binuptime(&disk->d_delay);
-	bintime_sub(&disk->d_delay, &bp->bio_t0);
-}
-
-static void
 g_mirror_done(struct bio *bp)
 {
 	struct g_mirror_softc *sc;
@@ -904,8 +891,6 @@ g_mirror_regular_request(struct bio *bp)
 		g_topology_lock();
 		g_mirror_kill_consumer(sc, bp->bio_from);
 		g_topology_unlock();
-	} else {
-		g_mirror_update_delay(disk, bp);
 	}
 
 	pbp->bio_inbed++;
@@ -1465,30 +1450,35 @@ g_mirror_request_round_robin(struct g_mi
 	g_io_request(cbp, cp);
 }
 
+#define TRACK_SIZE  (1 * 1024 * 1024)
+#define LOAD_SCALE	256
+#define ABS(x)		(((x) >= 0) ? (x) : (-(x)))
+
 static void
 g_mirror_request_load(struct g_mirror_softc *sc, struct bio *bp)
 {
 	struct g_mirror_disk *disk, *dp;
 	struct g_consumer *cp;
 	struct bio *cbp;
-	struct bintime curtime;
+	int prio, best;
 
-	binuptime(&curtime);
-	/*
-	 * Find a disk which the smallest load.
-	 */
+	/* Find a disk with the smallest load. */
 	disk = NULL;
+	best = INT_MAX;
 	LIST_FOREACH(dp, &sc->sc_disks, d_next) {
 		if (dp->d_state != G_MIRROR_DISK_STATE_ACTIVE)
 			continue;
-		/* If disk wasn't used for more than 2 sec, use it. */
-		if (curtime.sec - dp->d_last_used.sec >= 2) {
-			disk = dp;
-			break;
-		}
-		if (disk == NULL ||
-		    bintime_cmp(&dp->d_delay, &disk->d_delay) < 0) {
+		prio = dp->load;
+		/* If disk head is precisely in position - highly prefer it. */
+		if (dp->d_last_offset == bp->bio_offset)
+			prio -= 2 * LOAD_SCALE;
+		else
+		/* If disk head is close to position - prefer it. */
+		if (ABS(dp->d_last_offset - bp->bio_offset) < TRACK_SIZE)
+			prio -= 1 * LOAD_SCALE;
+		if (prio <= best) {
 			disk = dp;
+			best = prio;
 		}
 	}
 	KASSERT(disk != NULL, ("NULL disk for %s.", sc->sc_name));
@@ -1505,12 +1495,18 @@ g_mirror_request_load(struct g_mirror_so
 	cp = disk->d_consumer;
 	cbp->bio_done = g_mirror_done;
 	cbp->bio_to = cp->provider;
-	binuptime(&disk->d_last_used);
 	G_MIRROR_LOGREQ(3, cbp, "Sending request.");
 	KASSERT(cp->acr >= 1 && cp->acw >= 1 && cp->ace >= 1,
 	    ("Consumer %s not opened (r%dw%de%d).", cp->provider->name, cp->acr,
 	    cp->acw, cp->ace));
 	cp->index++;
+	/* Remember last head position */
+	disk->d_last_offset = bp->bio_offset + bp->bio_length;
+	/* Update loads. */
+	LIST_FOREACH(dp, &sc->sc_disks, d_next) {
+		dp->load = (dp->d_consumer->index * LOAD_SCALE +
+		    dp->load * 7) / 8;
+	}
 	g_io_request(cbp, cp);
 }
 

Modified: stable/7/sys/geom/mirror/g_mirror.h
==============================================================================
--- stable/7/sys/geom/mirror/g_mirror.h	Tue Dec  8 23:23:45 2009	(r200285)
+++ stable/7/sys/geom/mirror/g_mirror.h	Tue Dec  8 23:34:34 2009	(r200286)
@@ -133,8 +133,8 @@ struct g_mirror_disk {
 	struct g_mirror_softc	*d_softc; /* Back-pointer to softc. */
 	int		 d_state;	/* Disk state. */
 	u_int		 d_priority;	/* Disk priority. */
-	struct bintime	 d_delay;	/* Disk delay. */
-	struct bintime	 d_last_used;	/* When disk was last used. */
+	u_int		 load;		/* Averaged queue length */
+	off_t		 d_last_offset;	/* Last read offset */
 	uint64_t	 d_flags;	/* Additional flags. */
 	u_int		 d_genid;	/* Disk's generation ID. */
 	struct g_mirror_disk_sync d_sync;/* Sync information. */
_______________________________________________
svn-src-all@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-all
To unsubscribe, send any mail to "svn-src-all-unsubscribe@freebsd.org"
Comment 15 Alexander Motin freebsd_committer freebsd_triage 2009-12-08 23:36:07 UTC
State Changed
From-To: patched->closed

Fix merged down to 7/8-STABLE.
Comment 16 freebsdpr 2010-01-05 08:48:35 UTC
I'm still not sure this is working as expected. For a sequential read it 
gets "stuck" on one drive for several seconds before switching (I counted 
over 30 seconds in one instance), rather than swapping back and forth as 
each drive leapfrogs each other. Why is the drive at 95% utilisation 
preferred over the one at 0% util? Is there perhaps a way to tell the OS 
that we're about to do a sequential read and position all drives in the 
set at identical positions to force them to all be used?

For random reads it is switching between the drives more rapidly (eg, 
"iostat -w 1 -x" shows activity on both drives for every second), but 
their respective loads never go above 50%, which is pretty much the same 
situation before I applied the patches... aggregate performance is no 
better than a single drive.

FYI I applied patch-5.diff to a 7.0R amd64 system, but had similar results 
with earlier versions on a 7.1R amd64 system.

Even though it doesn't seem to be working here - I appreciate the people 
who are looking into this. Thanks. :)
Comment 17 Alexander Motin freebsd_committer freebsd_triage 2010-01-05 09:08:43 UTC
freebsdpr wrote:
>  I'm still not sure this is working as expected. For a sequential read it 
>  gets "stuck" on one drive for several seconds before switching (I counted 
>  over 30 seconds in one instance), rather than swapping back and forth as 
>  each drive leapfrogs each other. Why is the drive at 95% utilisation 
>  preferred over the one at 0% util? Is there perhaps a way to tell the OS 
>  that we're about to do a sequential read and position all drives in the 
>  set at identical positions to force them to all be used?

It works exactly as expected: sequential read uses only one drive at a
time to get best effect from drive's read-ahead cache. Jumping between
drives won't increase performance, but reduce it in most cases, as
instead of reading, drives will have to wait for plate revolution over
the data already red by another ones. With several simultaneous read
streams benefits of current scheme will be much more significant.

If you wish to make all drives busy, you may try 'split' method. But for
regular HDDs you won't get benefit.

>  For random reads it is switching between the drives more rapidly (eg, 
>  "iostat -w 1 -x" shows activity on both drives for every second), but 
>  their respective loads never go above 50%, which is pretty much the same 
>  situation before I applied the patches... aggregate performance is no 
>  better than a single drive.

Probably you are reading in single stream. The more requests you run
simultaneously, the better results will be. To utilize all drives you
should have at least as much requests as number of drives in array.

-- 
Alexander Motin
Comment 18 freebsdpr 2010-03-22 00:06:02 UTC
On Tue, 5 Jan 2010, freebsdpr wrote:

> I'm still not sure this is working as expected. For a sequential read it gets 
> "stuck" on one drive for several seconds before switching (I counted over 30 
...

I can confirm this *is* working and that I was using the wrong approach 
for testing. I now have a 4 drive gmirror array for MySQL indexes (the 
requests are predominantly random reads with the occasional burst of 
write), and running 6 concurrent MySQL query processes results in the 
average utilisation of each of the four drives being 90%+.

iostat -w 60 -x ad12 ad14 ad16 ad18

                         extended device statistics
device     r/s   w/s    kr/s    kw/s wait svc_t  %b
ad12      81.5  31.2  3410.0  1815.8    1 134.1  94
ad14      80.8  31.2  3405.4  1815.8    1  72.7  95
ad16      83.3  31.2  3488.7  1815.8    3 105.7  94
ad18      82.8  31.2  3502.1  1815.8    2 110.4  95

(writes look more busy than they are - it's more like 50Mbytes/sec for 2 
seconds out of the minute rather than 2Mbytes every single second)

This is a real world application, not a synthetic test, so it is working 
very nicely. Well done!
Comment 19 Pawel Jakub Dawidek freebsd_committer freebsd_triage 2014-06-01 06:36:53 UTC
Responsible Changed
From-To: freebsd-geom->pjd

I'll handle this one.