gmirror have poor read balance algorithms, which doesn't allow to utilize disks performance 'prefer' algo is suitable only for non-symmetrical enviroment, such as one local and one network-mounted disk 'load' is broken - dummy 'round-robin' shows best results in all benchmarks 'split' is completely senseless - it uses 'round-robin' for small requests and split big request to differents providers - but it's clearly understood that one hdd will take less time to complete one big requests (equal to several small sequental requests), than two (or more) drives to that several small requests, because even if there is no other activity, one disk doesn't spent time to move actuator and doesn't wait for plate rotate. 'round-robin' is best but is dummy and doesn't try to select disk, which may take less time to complete request: single sequential read thread uses both providers at half speed (total speed is same as for single drive) two simultaneous sequential read threads fights each other use 10% of throughput, while they could use separate provider at full speed next threads aggravates situation Fix: I've rewritten 'load' algo It remembers bio_offset value of last operation (read or write) for each provider and assumes that preferable disk is that which have smallest value of offset distance: |bio_offset - disk->last_offset| It also remembers last read time, and try to use that disk which was used longer time ago From all providers it selects that one which have lesser value of: offset_distance / (tunable_parameter + use_delay) This algorithm is well suitable for multithread enviroment (web-servers, db-servers...), for example it allow web server work at almost full speed when awstats (log analyzer) is running, while using 'round-robin' cause 50% slowdown (bw drops from 100Mbit/s to 50Mbit/s and remains on that level until awstats finishes its job). Patch attached with submission follows: How-To-Repeat: it's repeatable on all gmirror installations
Responsible Changed From-To: freebsd-bugs->freebsd-geom Over to maintainer(s).
I have attached what I believe is a better version of your patch. It: 1) Fixes the type ambiguity of the new use_delay/best_use_delay and dist/best_dist variables, to match the variables used in their calculations; they should be uint64_t and off_t, respectively. 2) Uses bit shifts instead of multiplication/division in the use delay and distance calculations. The precision loss should be acceptable in this situation. 3) Cleans up the style of the code; add more commenting, better comments. 4) Gets rid of the g_mirror_disk.d_delay variable since it is no longer used, along with the function the original patch short-circuited. In my testing, with 16 simultaneous processes performing the same test at the same time, by throughput, random reads/writes improved by about 35% (low variance), while sequential reads/writes improved by 100-400% (high variance). IOs also increased proportionally. Testing was done using "rawio -a -p 16 /dev/mirror/testa", where the test mirror composed of two 160GB Seagate SATA disks and the system is a dual Opteron 246 with 1.5GB of RAM with no other load, running 8.0-CURRENT as of 12/1/2008. CPU usage impact vs. old load algorithm appears negligible as well. Regards, Will. --- gmirror.patch begins here --- Index: g_mirror.c =================================================================== --- g_mirror.c (revision 185567) +++ g_mirror.c (working copy) @@ -25,7 +25,7 @@ */ #include <sys/cdefs.h> -__FBSDID("$FreeBSD$"); +__FBSDID("$FreeBSD: src/sys/geom/mirror/g_mirror.c,v 1.94 2007/10/20 23:23:19 julian Exp $"); #include <sys/param.h> #include <sys/systm.h> @@ -45,7 +45,6 @@ #include <sys/sched.h> #include <geom/mirror/g_mirror.h> - static MALLOC_DEFINE(M_MIRROR, "mirror_data", "GEOM_MIRROR Data"); SYSCTL_DECL(_kern_geom); @@ -71,7 +70,12 @@ TUNABLE_INT("kern.geom.mirror.sync_requests", &g_mirror_syncreqs); SYSCTL_UINT(_kern_geom_mirror, OID_AUTO, sync_requests, CTLFLAG_RDTUN, &g_mirror_syncreqs, 0, "Parallel synchronization I/O requests."); +static u_int g_mirror_plusdelay = 60000; +TUNABLE_INT("kern.geom.mirror.plusdelay", &g_mirror_plusdelay); +SYSCTL_UINT(_kern_geom_mirror, OID_AUTO, plusdelay, CTLFLAG_RW, + &g_mirror_plusdelay, 0, "Additional load delay in 1/65536ths of a second."); + #define MSLEEP(ident, mtx, priority, wmesg, timeout) do { \ G_MIRROR_DEBUG(4, "%s: Sleeping %p.", __func__, (ident)); \ msleep((ident), (mtx), (priority), (wmesg), (timeout)); \ @@ -451,8 +455,7 @@ disk->d_id = md->md_did; disk->d_state = G_MIRROR_DISK_STATE_NONE; disk->d_priority = md->md_priority; - disk->d_delay.sec = 0; - disk->d_delay.frac = 0; + disk->last_offset = 0; binuptime(&disk->d_last_used); disk->d_flags = md->md_dflags; if (md->md_provider[0] != '\0') @@ -863,16 +866,6 @@ } static void -g_mirror_update_delay(struct g_mirror_disk *disk, struct bio *bp) -{ - - if (disk->d_softc->sc_balance != G_MIRROR_BALANCE_LOAD) - return; - binuptime(&disk->d_delay); - bintime_sub(&disk->d_delay, &bp->bio_t0); -} - -static void g_mirror_done(struct bio *bp) { struct g_mirror_softc *sc; @@ -904,8 +897,6 @@ g_topology_lock(); g_mirror_kill_consumer(sc, bp->bio_from); g_topology_unlock(); - } else { - g_mirror_update_delay(disk, bp); } pbp->bio_inbed++; @@ -1472,25 +1463,45 @@ struct g_consumer *cp; struct bio *cbp; struct bintime curtime; + off_t bio_offset = bp->bio_offset; + off_t best_dist = -1, dist; + uint64_t best_use_delay = 0, use_delay = 0; - binuptime(&curtime); + getbinuptime(&curtime); /* - * Find a disk which the smallest load. + * Find the disk which has the smallest ratio of distance to use + * delay, i.e. its head looks closest to bio_offset and it was used + * least recently. */ disk = NULL; LIST_FOREACH(dp, &sc->sc_disks, d_next) { if (dp->d_state != G_MIRROR_DISK_STATE_ACTIVE) continue; - /* If disk wasn't used for more than 2 sec, use it. */ - if (curtime.sec - dp->d_last_used.sec >= 2) { + + dist = dp->last_offset - bio_offset; + if (dist < 0) + dist = -dist; + + /* + * Calculate the use delay as follows: Add the sysctl + * configured delay, then convert the bintime structure + * in terms of 1/65536ths of a second before adding its + * components. So multiply seconds difference by 65536 + * and drop all but the 16 most significant bits in the + * fraction, since they're all greater than 1/65536. + */ + use_delay = g_mirror_plusdelay; + use_delay += ((curtime.sec - dp->d_last_used.sec) << 16); + use_delay += ((curtime.frac - dp->d_last_used.frac) >> 48); + + if (best_dist == -1 || + dist * best_use_delay < best_dist * use_delay) { disk = dp; - break; + best_dist = dist; + best_use_delay = use_delay; } - if (disk == NULL || - bintime_cmp(&dp->d_delay, &disk->d_delay) < 0) { - disk = dp; - } } + KASSERT(disk != NULL, ("NULL disk for %s.", sc->sc_name)); cbp = g_clone_bio(bp); if (cbp == NULL) { @@ -1505,7 +1516,8 @@ cp = disk->d_consumer; cbp->bio_done = g_mirror_done; cbp->bio_to = cp->provider; - binuptime(&disk->d_last_used); + disk->d_last_used = curtime; + disk->last_offset = bio_offset; G_MIRROR_LOGREQ(3, cbp, "Sending request."); KASSERT(cp->acr >= 1 && cp->acw >= 1 && cp->ace >= 1, ("Consumer %s not opened (r%dw%de%d).", cp->provider->name, cp->acr, @@ -1659,6 +1671,7 @@ g_io_deliver(bp, bp->bio_error); return; } + disk->last_offset = bp->bio_offset; bioq_insert_tail(&queue, cbp); cbp->bio_done = g_mirror_done; cp = disk->d_consumer; Index: g_mirror.h =================================================================== --- g_mirror.h (revision 185567) +++ g_mirror.h (working copy) @@ -23,7 +23,7 @@ * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * - * $FreeBSD$ + * $FreeBSD: src/sys/geom/mirror/g_mirror.h,v 1.24 2006/11/01 22:51:49 pjd Exp $ */ #ifndef _G_MIRROR_H_ @@ -133,7 +133,7 @@ struct g_mirror_softc *d_softc; /* Back-pointer to softc. */ int d_state; /* Disk state. */ u_int d_priority; /* Disk priority. */ - struct bintime d_delay; /* Disk delay. */ + off_t last_offset; /* LBA of last operation. */ struct bintime d_last_used; /* When disk was last used. */ uint64_t d_flags; /* Additional flags. */ u_int d_genid; /* Disk's generation ID. */ --- gmirror.patch ends here ---
Responsible Changed From-To: pjd->freebsd-geom pjd is not actively working on GEOM at the moment.
Will Andrews' patch is *fantastic* With this patch and gmirror set to "load" style balancing, I can run two long sequential reads in parallel and get practically linear scaling on a two-disk mirror. Could someone please commit this? --Emil
I was also surprised to discover that gmirror, regardless of the algorithm used, does not seem to offer either random or sequential read performance any better than a single drive. I have a new SATA backplane which shows individual drive activity indicators - with these you can easily see that the "load" algorithm seems to be selecting (and staying on) only a single drive at a time, for anywhere between 0.1 - 1 seconds. Some simple testing confirmed that there's no discernable read performance benefit between 1 or >1 drives - so much for my 4 drive RAID1 idea! In comparison, a 5 drive graid3 array offers a sequential read speed of nearly 4 times a single drive... with read verify ON. ---- Onto the "load" patch above - it doesn't seem to work for me. I thought it may have been because I had 4 drives in the array, but even after dropping back to 2 it still only reads from a *single* drive. Any ideas? I'm using 7.1R-amd64. Geom name: db0 State: COMPLETE Components: 2 Balance: load <--- ***
The patch will not increase streaming read performance beyond what's possible with a single drive, it will improve random read performance in certain cases where reads are localized in such ways that reading some of them from one drive and others from the other drive helps. The reason why there is no scalability with streaming read performance vs what can be achieved with RAID0/3/5 is that there is no striping here. For example: if you need to read 4 striped blocks from a RAID0 of two drives, blocks 0 and 2 will be sequentially stored on the first drive, blocks 1 and 3 will be sequentially stored on the second drive. Thus reading the 4 blocks will result in two sequential reads per drive. OTOH, with RAID1, blocks 0 and 2 will be stored with a "gap" between them, containing block 1, and cannot be read sequentially, but a seek is needed. This is why e.g. the "split" method (which effectively does striping on the request level) doesn't help much with performance.
> The patch will not increase streaming read performance beyond what's possible with a single drive[...] I meant that after applying the patch it literally only reads from a single drive. "iostat -w 1 -x" shows it consistently using the last drive of the set for all reads.
so, where is maintainer? may be somebody commit it? very useful feature.
Here is another patch, which implements different approach. Basically it looks at the recently served requests and also the queue length to decide where to send the next request to. http://sobomax.sippysoft.com/~sobomax/geom_mirror.diff Regards, -- Maksym Sobolyev Sippy Software, Inc. Internet Telephony (VoIP) Experts T/F: +1-646-651-1110 Web: http://www.sippysoft.com MSN: sales@sippysoft.com Skype: SippySoft
Author: mav Date: Thu Dec 3 21:47:51 2009 New Revision: 200086 URL: http://svn.freebsd.org/changeset/base/200086 Log: Change 'load' balancing mode algorithm: - Instead of measuring last request execution time for each drive and choosing one with smallest time, use averaged number of requests, running on each drive. This information is more accurate and timely. It allows to distribute load between drives in more even and predictable way. - For each drive track offset of the last submitted request. If new request offset matches previous one or close for some drive, prefer that drive. It allows to significantly speedup simultaneous sequential reads. PR: kern/113885 Reviewed by: sobomax Modified: head/sys/geom/mirror/g_mirror.c head/sys/geom/mirror/g_mirror.h Modified: head/sys/geom/mirror/g_mirror.c ============================================================================== --- head/sys/geom/mirror/g_mirror.c Thu Dec 3 21:44:41 2009 (r200085) +++ head/sys/geom/mirror/g_mirror.c Thu Dec 3 21:47:51 2009 (r200086) @@ -451,9 +451,6 @@ g_mirror_init_disk(struct g_mirror_softc disk->d_id = md->md_did; disk->d_state = G_MIRROR_DISK_STATE_NONE; disk->d_priority = md->md_priority; - disk->d_delay.sec = 0; - disk->d_delay.frac = 0; - binuptime(&disk->d_last_used); disk->d_flags = md->md_dflags; if (md->md_provider[0] != '\0') disk->d_flags |= G_MIRROR_DISK_FLAG_HARDCODED; @@ -863,16 +860,6 @@ bintime_cmp(struct bintime *bt1, struct } static void -g_mirror_update_delay(struct g_mirror_disk *disk, struct bio *bp) -{ - - if (disk->d_softc->sc_balance != G_MIRROR_BALANCE_LOAD) - return; - binuptime(&disk->d_delay); - bintime_sub(&disk->d_delay, &bp->bio_t0); -} - -static void g_mirror_done(struct bio *bp) { struct g_mirror_softc *sc; @@ -904,8 +891,6 @@ g_mirror_regular_request(struct bio *bp) g_topology_lock(); g_mirror_kill_consumer(sc, bp->bio_from); g_topology_unlock(); - } else { - g_mirror_update_delay(disk, bp); } pbp->bio_inbed++; @@ -1465,30 +1450,35 @@ g_mirror_request_round_robin(struct g_mi g_io_request(cbp, cp); } +#define TRACK_SIZE (1 * 1024 * 1024) +#define LOAD_SCALE 256 +#define ABS(x) (((x) >= 0) ? (x) : (-(x))) + static void g_mirror_request_load(struct g_mirror_softc *sc, struct bio *bp) { struct g_mirror_disk *disk, *dp; struct g_consumer *cp; struct bio *cbp; - struct bintime curtime; + int prio, best; - binuptime(&curtime); - /* - * Find a disk which the smallest load. - */ + /* Find a disk with the smallest load. */ disk = NULL; + best = INT_MAX; LIST_FOREACH(dp, &sc->sc_disks, d_next) { if (dp->d_state != G_MIRROR_DISK_STATE_ACTIVE) continue; - /* If disk wasn't used for more than 2 sec, use it. */ - if (curtime.sec - dp->d_last_used.sec >= 2) { - disk = dp; - break; - } - if (disk == NULL || - bintime_cmp(&dp->d_delay, &disk->d_delay) < 0) { + prio = dp->load; + /* If disk head is precisely in position - highly prefer it. */ + if (dp->d_last_offset == bp->bio_offset) + prio -= 2 * LOAD_SCALE; + else + /* If disk head is close to position - prefer it. */ + if (ABS(dp->d_last_offset - bp->bio_offset) < TRACK_SIZE) + prio -= 1 * LOAD_SCALE; + if (prio <= best) { disk = dp; + best = prio; } } KASSERT(disk != NULL, ("NULL disk for %s.", sc->sc_name)); @@ -1505,12 +1495,18 @@ g_mirror_request_load(struct g_mirror_so cp = disk->d_consumer; cbp->bio_done = g_mirror_done; cbp->bio_to = cp->provider; - binuptime(&disk->d_last_used); G_MIRROR_LOGREQ(3, cbp, "Sending request."); KASSERT(cp->acr >= 1 && cp->acw >= 1 && cp->ace >= 1, ("Consumer %s not opened (r%dw%de%d).", cp->provider->name, cp->acr, cp->acw, cp->ace)); cp->index++; + /* Remember last head position */ + disk->d_last_offset = bp->bio_offset + bp->bio_length; + /* Update loads. */ + LIST_FOREACH(dp, &sc->sc_disks, d_next) { + dp->load = (dp->d_consumer->index * LOAD_SCALE + + dp->load * 7) / 8; + } g_io_request(cbp, cp); } Modified: head/sys/geom/mirror/g_mirror.h ============================================================================== --- head/sys/geom/mirror/g_mirror.h Thu Dec 3 21:44:41 2009 (r200085) +++ head/sys/geom/mirror/g_mirror.h Thu Dec 3 21:47:51 2009 (r200086) @@ -133,8 +133,8 @@ struct g_mirror_disk { struct g_mirror_softc *d_softc; /* Back-pointer to softc. */ int d_state; /* Disk state. */ u_int d_priority; /* Disk priority. */ - struct bintime d_delay; /* Disk delay. */ - struct bintime d_last_used; /* When disk was last used. */ + u_int load; /* Averaged queue length */ + off_t d_last_offset; /* Last read offset */ uint64_t d_flags; /* Additional flags. */ u_int d_genid; /* Disk's generation ID. */ struct g_mirror_disk_sync d_sync;/* Sync information. */ _______________________________________________ svn-src-all@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-all To unsubscribe, send any mail to "svn-src-all-unsubscribe@freebsd.org"
State Changed From-To: open->patched Patch committed to the HEAD.
Responsible Changed From-To: freebsd-geom->mav Patch committed to the HEAD.
Author: mav Date: Tue Dec 8 23:23:45 2009 New Revision: 200285 URL: http://svn.freebsd.org/changeset/base/200285 Log: MFC r200086: Change 'load' balancing mode algorithm: - Instead of measuring last request execution time for each drive and choosing one with smallest time, use averaged number of requests, running on each drive. This information is more accurate and timely. It allows to distribute load between drives in more even and predictable way. - For each drive track offset of the last submitted request. If new request offset matches previous one or close for some drive, prefer that drive. It allows to significantly speedup simultaneous sequential reads. PR: kern/113885 Modified: stable/8/sys/geom/mirror/g_mirror.c stable/8/sys/geom/mirror/g_mirror.h Directory Properties: stable/8/sys/ (props changed) stable/8/sys/amd64/include/xen/ (props changed) stable/8/sys/cddl/contrib/opensolaris/ (props changed) stable/8/sys/contrib/dev/acpica/ (props changed) stable/8/sys/contrib/pf/ (props changed) stable/8/sys/dev/xen/xenpci/ (props changed) Modified: stable/8/sys/geom/mirror/g_mirror.c ============================================================================== --- stable/8/sys/geom/mirror/g_mirror.c Tue Dec 8 23:15:48 2009 (r200284) +++ stable/8/sys/geom/mirror/g_mirror.c Tue Dec 8 23:23:45 2009 (r200285) @@ -451,9 +451,6 @@ g_mirror_init_disk(struct g_mirror_softc disk->d_id = md->md_did; disk->d_state = G_MIRROR_DISK_STATE_NONE; disk->d_priority = md->md_priority; - disk->d_delay.sec = 0; - disk->d_delay.frac = 0; - binuptime(&disk->d_last_used); disk->d_flags = md->md_dflags; if (md->md_provider[0] != '\0') disk->d_flags |= G_MIRROR_DISK_FLAG_HARDCODED; @@ -863,16 +860,6 @@ bintime_cmp(struct bintime *bt1, struct } static void -g_mirror_update_delay(struct g_mirror_disk *disk, struct bio *bp) -{ - - if (disk->d_softc->sc_balance != G_MIRROR_BALANCE_LOAD) - return; - binuptime(&disk->d_delay); - bintime_sub(&disk->d_delay, &bp->bio_t0); -} - -static void g_mirror_done(struct bio *bp) { struct g_mirror_softc *sc; @@ -904,8 +891,6 @@ g_mirror_regular_request(struct bio *bp) g_topology_lock(); g_mirror_kill_consumer(sc, bp->bio_from); g_topology_unlock(); - } else { - g_mirror_update_delay(disk, bp); } pbp->bio_inbed++; @@ -1465,30 +1450,35 @@ g_mirror_request_round_robin(struct g_mi g_io_request(cbp, cp); } +#define TRACK_SIZE (1 * 1024 * 1024) +#define LOAD_SCALE 256 +#define ABS(x) (((x) >= 0) ? (x) : (-(x))) + static void g_mirror_request_load(struct g_mirror_softc *sc, struct bio *bp) { struct g_mirror_disk *disk, *dp; struct g_consumer *cp; struct bio *cbp; - struct bintime curtime; + int prio, best; - binuptime(&curtime); - /* - * Find a disk which the smallest load. - */ + /* Find a disk with the smallest load. */ disk = NULL; + best = INT_MAX; LIST_FOREACH(dp, &sc->sc_disks, d_next) { if (dp->d_state != G_MIRROR_DISK_STATE_ACTIVE) continue; - /* If disk wasn't used for more than 2 sec, use it. */ - if (curtime.sec - dp->d_last_used.sec >= 2) { - disk = dp; - break; - } - if (disk == NULL || - bintime_cmp(&dp->d_delay, &disk->d_delay) < 0) { + prio = dp->load; + /* If disk head is precisely in position - highly prefer it. */ + if (dp->d_last_offset == bp->bio_offset) + prio -= 2 * LOAD_SCALE; + else + /* If disk head is close to position - prefer it. */ + if (ABS(dp->d_last_offset - bp->bio_offset) < TRACK_SIZE) + prio -= 1 * LOAD_SCALE; + if (prio <= best) { disk = dp; + best = prio; } } KASSERT(disk != NULL, ("NULL disk for %s.", sc->sc_name)); @@ -1505,12 +1495,18 @@ g_mirror_request_load(struct g_mirror_so cp = disk->d_consumer; cbp->bio_done = g_mirror_done; cbp->bio_to = cp->provider; - binuptime(&disk->d_last_used); G_MIRROR_LOGREQ(3, cbp, "Sending request."); KASSERT(cp->acr >= 1 && cp->acw >= 1 && cp->ace >= 1, ("Consumer %s not opened (r%dw%de%d).", cp->provider->name, cp->acr, cp->acw, cp->ace)); cp->index++; + /* Remember last head position */ + disk->d_last_offset = bp->bio_offset + bp->bio_length; + /* Update loads. */ + LIST_FOREACH(dp, &sc->sc_disks, d_next) { + dp->load = (dp->d_consumer->index * LOAD_SCALE + + dp->load * 7) / 8; + } g_io_request(cbp, cp); } Modified: stable/8/sys/geom/mirror/g_mirror.h ============================================================================== --- stable/8/sys/geom/mirror/g_mirror.h Tue Dec 8 23:15:48 2009 (r200284) +++ stable/8/sys/geom/mirror/g_mirror.h Tue Dec 8 23:23:45 2009 (r200285) @@ -133,8 +133,8 @@ struct g_mirror_disk { struct g_mirror_softc *d_softc; /* Back-pointer to softc. */ int d_state; /* Disk state. */ u_int d_priority; /* Disk priority. */ - struct bintime d_delay; /* Disk delay. */ - struct bintime d_last_used; /* When disk was last used. */ + u_int load; /* Averaged queue length */ + off_t d_last_offset; /* Last read offset */ uint64_t d_flags; /* Additional flags. */ u_int d_genid; /* Disk's generation ID. */ struct g_mirror_disk_sync d_sync;/* Sync information. */ _______________________________________________ svn-src-all@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-all To unsubscribe, send any mail to "svn-src-all-unsubscribe@freebsd.org"
Author: mav Date: Tue Dec 8 23:34:34 2009 New Revision: 200286 URL: http://svn.freebsd.org/changeset/base/200286 Log: MFC r200086: Change 'load' balancing mode algorithm: - Instead of measuring last request execution time for each drive and choosing one with smallest time, use averaged number of requests, running on each drive. This information is more accurate and timely. It allows to distribute load between drives in more even and predictable way. - For each drive track offset of the last submitted request. If new request offset matches previous one or close for some drive, prefer that drive. It allows to significantly speedup simultaneous sequential reads. PR: kern/113885 Modified: stable/7/sys/geom/mirror/g_mirror.c stable/7/sys/geom/mirror/g_mirror.h Directory Properties: stable/7/sys/ (props changed) stable/7/sys/contrib/pf/ (props changed) Modified: stable/7/sys/geom/mirror/g_mirror.c ============================================================================== --- stable/7/sys/geom/mirror/g_mirror.c Tue Dec 8 23:23:45 2009 (r200285) +++ stable/7/sys/geom/mirror/g_mirror.c Tue Dec 8 23:34:34 2009 (r200286) @@ -451,9 +451,6 @@ g_mirror_init_disk(struct g_mirror_softc disk->d_id = md->md_did; disk->d_state = G_MIRROR_DISK_STATE_NONE; disk->d_priority = md->md_priority; - disk->d_delay.sec = 0; - disk->d_delay.frac = 0; - binuptime(&disk->d_last_used); disk->d_flags = md->md_dflags; if (md->md_provider[0] != '\0') disk->d_flags |= G_MIRROR_DISK_FLAG_HARDCODED; @@ -863,16 +860,6 @@ bintime_cmp(struct bintime *bt1, struct } static void -g_mirror_update_delay(struct g_mirror_disk *disk, struct bio *bp) -{ - - if (disk->d_softc->sc_balance != G_MIRROR_BALANCE_LOAD) - return; - binuptime(&disk->d_delay); - bintime_sub(&disk->d_delay, &bp->bio_t0); -} - -static void g_mirror_done(struct bio *bp) { struct g_mirror_softc *sc; @@ -904,8 +891,6 @@ g_mirror_regular_request(struct bio *bp) g_topology_lock(); g_mirror_kill_consumer(sc, bp->bio_from); g_topology_unlock(); - } else { - g_mirror_update_delay(disk, bp); } pbp->bio_inbed++; @@ -1465,30 +1450,35 @@ g_mirror_request_round_robin(struct g_mi g_io_request(cbp, cp); } +#define TRACK_SIZE (1 * 1024 * 1024) +#define LOAD_SCALE 256 +#define ABS(x) (((x) >= 0) ? (x) : (-(x))) + static void g_mirror_request_load(struct g_mirror_softc *sc, struct bio *bp) { struct g_mirror_disk *disk, *dp; struct g_consumer *cp; struct bio *cbp; - struct bintime curtime; + int prio, best; - binuptime(&curtime); - /* - * Find a disk which the smallest load. - */ + /* Find a disk with the smallest load. */ disk = NULL; + best = INT_MAX; LIST_FOREACH(dp, &sc->sc_disks, d_next) { if (dp->d_state != G_MIRROR_DISK_STATE_ACTIVE) continue; - /* If disk wasn't used for more than 2 sec, use it. */ - if (curtime.sec - dp->d_last_used.sec >= 2) { - disk = dp; - break; - } - if (disk == NULL || - bintime_cmp(&dp->d_delay, &disk->d_delay) < 0) { + prio = dp->load; + /* If disk head is precisely in position - highly prefer it. */ + if (dp->d_last_offset == bp->bio_offset) + prio -= 2 * LOAD_SCALE; + else + /* If disk head is close to position - prefer it. */ + if (ABS(dp->d_last_offset - bp->bio_offset) < TRACK_SIZE) + prio -= 1 * LOAD_SCALE; + if (prio <= best) { disk = dp; + best = prio; } } KASSERT(disk != NULL, ("NULL disk for %s.", sc->sc_name)); @@ -1505,12 +1495,18 @@ g_mirror_request_load(struct g_mirror_so cp = disk->d_consumer; cbp->bio_done = g_mirror_done; cbp->bio_to = cp->provider; - binuptime(&disk->d_last_used); G_MIRROR_LOGREQ(3, cbp, "Sending request."); KASSERT(cp->acr >= 1 && cp->acw >= 1 && cp->ace >= 1, ("Consumer %s not opened (r%dw%de%d).", cp->provider->name, cp->acr, cp->acw, cp->ace)); cp->index++; + /* Remember last head position */ + disk->d_last_offset = bp->bio_offset + bp->bio_length; + /* Update loads. */ + LIST_FOREACH(dp, &sc->sc_disks, d_next) { + dp->load = (dp->d_consumer->index * LOAD_SCALE + + dp->load * 7) / 8; + } g_io_request(cbp, cp); } Modified: stable/7/sys/geom/mirror/g_mirror.h ============================================================================== --- stable/7/sys/geom/mirror/g_mirror.h Tue Dec 8 23:23:45 2009 (r200285) +++ stable/7/sys/geom/mirror/g_mirror.h Tue Dec 8 23:34:34 2009 (r200286) @@ -133,8 +133,8 @@ struct g_mirror_disk { struct g_mirror_softc *d_softc; /* Back-pointer to softc. */ int d_state; /* Disk state. */ u_int d_priority; /* Disk priority. */ - struct bintime d_delay; /* Disk delay. */ - struct bintime d_last_used; /* When disk was last used. */ + u_int load; /* Averaged queue length */ + off_t d_last_offset; /* Last read offset */ uint64_t d_flags; /* Additional flags. */ u_int d_genid; /* Disk's generation ID. */ struct g_mirror_disk_sync d_sync;/* Sync information. */ _______________________________________________ svn-src-all@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-all To unsubscribe, send any mail to "svn-src-all-unsubscribe@freebsd.org"
State Changed From-To: patched->closed Fix merged down to 7/8-STABLE.
I'm still not sure this is working as expected. For a sequential read it gets "stuck" on one drive for several seconds before switching (I counted over 30 seconds in one instance), rather than swapping back and forth as each drive leapfrogs each other. Why is the drive at 95% utilisation preferred over the one at 0% util? Is there perhaps a way to tell the OS that we're about to do a sequential read and position all drives in the set at identical positions to force them to all be used? For random reads it is switching between the drives more rapidly (eg, "iostat -w 1 -x" shows activity on both drives for every second), but their respective loads never go above 50%, which is pretty much the same situation before I applied the patches... aggregate performance is no better than a single drive. FYI I applied patch-5.diff to a 7.0R amd64 system, but had similar results with earlier versions on a 7.1R amd64 system. Even though it doesn't seem to be working here - I appreciate the people who are looking into this. Thanks. :)
freebsdpr wrote: > I'm still not sure this is working as expected. For a sequential read it > gets "stuck" on one drive for several seconds before switching (I counted > over 30 seconds in one instance), rather than swapping back and forth as > each drive leapfrogs each other. Why is the drive at 95% utilisation > preferred over the one at 0% util? Is there perhaps a way to tell the OS > that we're about to do a sequential read and position all drives in the > set at identical positions to force them to all be used? It works exactly as expected: sequential read uses only one drive at a time to get best effect from drive's read-ahead cache. Jumping between drives won't increase performance, but reduce it in most cases, as instead of reading, drives will have to wait for plate revolution over the data already red by another ones. With several simultaneous read streams benefits of current scheme will be much more significant. If you wish to make all drives busy, you may try 'split' method. But for regular HDDs you won't get benefit. > For random reads it is switching between the drives more rapidly (eg, > "iostat -w 1 -x" shows activity on both drives for every second), but > their respective loads never go above 50%, which is pretty much the same > situation before I applied the patches... aggregate performance is no > better than a single drive. Probably you are reading in single stream. The more requests you run simultaneously, the better results will be. To utilize all drives you should have at least as much requests as number of drives in array. -- Alexander Motin
On Tue, 5 Jan 2010, freebsdpr wrote: > I'm still not sure this is working as expected. For a sequential read it gets > "stuck" on one drive for several seconds before switching (I counted over 30 ... I can confirm this *is* working and that I was using the wrong approach for testing. I now have a 4 drive gmirror array for MySQL indexes (the requests are predominantly random reads with the occasional burst of write), and running 6 concurrent MySQL query processes results in the average utilisation of each of the four drives being 90%+. iostat -w 60 -x ad12 ad14 ad16 ad18 extended device statistics device r/s w/s kr/s kw/s wait svc_t %b ad12 81.5 31.2 3410.0 1815.8 1 134.1 94 ad14 80.8 31.2 3405.4 1815.8 1 72.7 95 ad16 83.3 31.2 3488.7 1815.8 3 105.7 94 ad18 82.8 31.2 3502.1 1815.8 2 110.4 95 (writes look more busy than they are - it's more like 50Mbytes/sec for 2 seconds out of the minute rather than 2Mbytes every single second) This is a real world application, not a synthetic test, so it is working very nicely. Well done!
Responsible Changed From-To: freebsd-geom->pjd I'll handle this one.