Bug 79035 - [vinum] gvinum unable to create a striped set of mirrored sets/plexes
Summary: [vinum] gvinum unable to create a striped set of mirrored sets/plexes
Status: Open
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: Unspecified
Hardware: Any Any
: Normal Affects Only Me
Assignee: freebsd-geom (Nobody)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2005-03-20 02:10 UTC by sven
Modified: 2020-10-28 06:08 UTC (History)
1 user (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description sven 2005-03-20 02:10:01 UTC
Under the current implementation of gvinum it is possible to create a mirrored set of striped plexes but not a striped set of mirrored plexes. For purposes of resiliency the latter configuration is preferred as illustrated by the following example:

Use 6 disks to create one of 2 different scenarios.

1) Using the current abilities of gvinum create 2 striped sets using 3 disks each: A1 A2 A3 and B1 B2 B3 then create a mirror of those 2 sets such that A(123) mirrors B(123). In this situation if any drive in Set A fails, one still has a working set with Set B. If any drive now fails in Set B, the system is shot. 

2) Using the proposed added ability to create 3 mirror sets A1 and B1, A2 and B2, A3 and B3. Now create a stripe set across all three mirrors. Now we can have a situation where one of the "A" drives fail (for example A1). Then we can also have one of the "B" drives fail and, as long as it is not "B1" in this case, we still have a functioning array.

Thus the striping of mirrors (rather than a mirror of striped sets) is a more resilient and fault-tolerant setup of a multi-disk array.
Comment 1 Greg Lehey freebsd_committer 2005-03-20 03:11:01 UTC
[Format recovered--see http://www.lemis.com/email/email-format.html]

Single line paragraphs.  Please limit your lines to < 80 characters.

On Sunday, 20 March 2005 at  2:04:34 +0000, Sven Willenberger wrote:
>
> Under the current implementation of gvinum it is possible to create
> a mirrored set of striped plexes but not a striped set of mirrored
> plexes. For purposes of resiliency the latter configuration is
> preferred as illustrated by the following example:
>
> Use 6 disks to create one of 2 different scenarios.
>
> 1) Using the current abilities of gvinum create 2 striped sets using
> 3 disks each: A1 A2 A3 and B1 B2 B3 then create a mirror of those 2
> sets such that A(123) mirrors B(123). In this situation if any drive
> in Set A fails, one still has a working set with Set B. If any drive
> now fails in Set B, the system is shot.

No, this is not correct.  The plex ("set") only fails when all drives
in it fail.

> 2) Using the proposed added ability to create 3 mirror sets A1 and
> B1, A2 and B2, A3 and B3. Now create a stripe set across all three
> mirrors. Now we can have a situation where one of the "A" drives
> fail (for example A1). Then we can also have one of the "B" drives
> fail and, as long as it is not "B1" in this case, we still have a
> functioning array.

Agreed.  So there's no difference.

> Thus the striping of mirrors (rather than a mirror of striped sets)
> is a more resilient and fault-tolerant setup of a multi-disk array.

No, you're misunderstanding the current implementation.

This is a change request, so I'm not closing (or even assigning to
myself) the PR.

Greg
--
When replying to this message, please take care not to mutilate the
original text.
For more information, see http://www.lemis.com/email.html
Finger grog@FreeBSD.org for PGP public key.
See complete headers for address and phone numbers.
Comment 2 sven 2005-03-20 04:43:00 UTC
Greg 'groggy' Lehey presumably uttered the following on 03/19/05 22:11:
> [Format recovered--see http://www.lemis.com/email/email-format.html]
> 
> Single line paragraphs.  Please limit your lines to < 80 characters.

Sorry, used the web interface ... didn't realize it would not 
automatically wrap.

> 
> On Sunday, 20 March 2005 at  2:04:34 +0000, Sven Willenberger wrote:
> 
>>Under the current implementation of gvinum it is possible to create
>>a mirrored set of striped plexes but not a striped set of mirrored
>>plexes. For purposes of resiliency the latter configuration is
>>preferred as illustrated by the following example:
>>
>>Use 6 disks to create one of 2 different scenarios.
>>
>>1) Using the current abilities of gvinum create 2 striped sets using
>>3 disks each: A1 A2 A3 and B1 B2 B3 then create a mirror of those 2
>>sets such that A(123) mirrors B(123). In this situation if any drive
>>in Set A fails, one still has a working set with Set B. If any drive
>>now fails in Set B, the system is shot.
> 
> 
> No, this is not correct.  The plex ("set") only fails when all drives
> in it fail.
> 

I hope the following diagrams better illustrate what I was trying to 
point out. Data striped across all the A's and that is mirrored to the B 
Stripes:

   __stripe__
__|___|____|__
| A1  A2  A3 | --|m
|____________|   |i
                  |r
   __stripe__     |r
__|___|____|__   |o
| B1  B2  B3 | --|r
|____________|

If A1 fails, then the A Stripe set cannot function (much like in Raid 0, 
one disk fails the set) meaning that B now is the array:

   __stripe__
__|___|____|__
|     A2  A3 | ==> fails
|____________|      |
                     |
                   --X--
   __stripe__        |
__|___|____|__      |
| B1  B2  B3 | ==> remains
|____________|

  If any B disk fails then the B Stripe set is failed leaving no 
functioning part of the mirror:

   __stripe__
__|___|____|__
|     A2  A3 | ==> fails
|____________|      |
                     |
                   --X--
   __stripe__        |
__|___|____|__      |
| B1      B3 | ==> fails
|____________|


Unless I am misunderstanding and gvinum somehow rebuilds the A stripe 
over A2 and A3 if A1 fails.

> 
>>2) Using the proposed added ability to create 3 mirror sets A1 and
>>B1, A2 and B2, A3 and B3. Now create a stripe set across all three
>>mirrors. Now we can have a situation where one of the "A" drives
>>fail (for example A1). Then we can also have one of the "B" drives
>>fail and, as long as it is not "B1" in this case, we still have a
>>functioning array.
> 
> 
> Agreed.  So there's no difference.
>

   _____stripe_____
__|__  __|___  __|___
| A1 | | A2 |  | A3 |
| B1 | | B2 |  | B3 |
|____| |____|  |____|

Now If A1 Fails, the B1 part of the mirror can still participate in the 
stripe:
   _____stripe____
__|__  __|__  __|__
|    | | A2 | | A3 |
| B1 | | B2 | | B3 |
|____| |____| |____|

Likewise if either B2 or B3 fails now we still have a functioning stripe:
   _____stripe____
__|__  __|__  __|__
|    | | A2 | | A3 |
| B1 | |    | | B3 |
|____| |____| |____|

At this point we could still have either A3 or B3 fail and still have a 
functioning stripe set.

> 
>>Thus the striping of mirrors (rather than a mirror of striped sets)
>>is a more resilient and fault-tolerant setup of a multi-disk array.
> 
> 
> No, you're misunderstanding the current implementation.

Perhaps I am ... but unless gvinum somehow reconstructs a 3 disk stripe 
into a 2 disk stripe in the event one disk fails, I am now sure how. The 
resiliency has to do with a 2 disk failure. Even in a 4 disk scenario, 
the mirror of stripes can survive 2 of 6 2-disk failure scenarios while 
the stripe of mirros can survive 4 of 6 2-disk failure scenarios.

> 
> This is a change request, so I'm not closing (or even assigning to
> myself) the PR.
>

Fair enough ... I would just like to see the stripe of mirror scenarios 
common to hardware raid solutions become a configuration option for 
gvinum (or understand why my interpretation above is incorrect), so per 
your original advice I submitted this PR.

Sven Willenberger
Comment 3 Greg Lehey freebsd_committer 2005-03-20 05:21:33 UTC
On Saturday, 19 March 2005 at 23:43:00 -0500, Sven Willenberger wrote:
> Greg 'groggy' Lehey presumably uttered the following on 03/19/05 22:11:
>> On Sunday, 20 March 2005 at  2:04:34 +0000, Sven Willenberger wrote:
>>
>>> Under the current implementation of gvinum it is possible to create
>>> a mirrored set of striped plexes but not a striped set of mirrored
>>> plexes. For purposes of resiliency the latter configuration is
>>> preferred as illustrated by the following example:
>>>
>>> Use 6 disks to create one of 2 different scenarios.
>>>
>>> 1) Using the current abilities of gvinum create 2 striped sets using
>>> 3 disks each: A1 A2 A3 and B1 B2 B3 then create a mirror of those 2
>>> sets such that A(123) mirrors B(123). In this situation if any drive
>>> in Set A fails, one still has a working set with Set B. If any drive
>>> now fails in Set B, the system is shot.
>>
>> No, this is not correct.  The plex ("set") only fails when all drives
>> in it fail.
>
> I hope the following diagrams better illustrate what I was trying to
> point out. Data striped across all the A's and that is mirrored to the B
> Stripes:
>
> ...
>
> If A1 fails, then the A Stripe set cannot function (much like in Raid 0,
> one disk fails the set) meaning that B now is the array:

No, this is not correct.

>>> Thus the striping of mirrors (rather than a mirror of striped sets)
>>> is a more resilient and fault-tolerant setup of a multi-disk array.
>>
>> No, you're misunderstanding the current implementation.
>
> Perhaps I am ... but unless gvinum somehow reconstructs a 3 disk stripe
> into a 2 disk stripe in the event one disk fails, I am now sure how.

Well, you have the source code.  It's not quite the way you look at
it.  It doesn't have stripes: it has plexes.  And they can be
incomplete.  If a read to a plex hits a "hole", it automatically
retries via (possibly all) the other plexes.  Only when all plexes
have a hole in the same place does the transfer fail.

You might like to (re)read http://www.vinumvm.org/vinum/intro.html.

Greg
--
See complete headers for address and phone numbers.
Comment 4 sven 2005-03-20 06:00:29 UTC
Greg 'groggy' Lehey presumably uttered the following on 03/20/05 00:21:
> On Saturday, 19 March 2005 at 23:43:00 -0500, Sven Willenberger wrote:
> 
>>Greg 'groggy' Lehey presumably uttered the following on 03/19/05 22:11:
>>
>>>On Sunday, 20 March 2005 at  2:04:34 +0000, Sven Willenberger wrote:
>>>
>>>
>>>>Under the current implementation of gvinum it is possible to create
>>>>a mirrored set of striped plexes but not a striped set of mirrored
>>>>plexes. For purposes of resiliency the latter configuration is
>>>>preferred as illustrated by the following example:
>>>>
>>>>Use 6 disks to create one of 2 different scenarios.
>>>>
>>>>1) Using the current abilities of gvinum create 2 striped sets using
>>>>3 disks each: A1 A2 A3 and B1 B2 B3 then create a mirror of those 2
>>>>sets such that A(123) mirrors B(123). In this situation if any drive
>>>>in Set A fails, one still has a working set with Set B. If any drive
>>>>now fails in Set B, the system is shot.
>>>
>>>No, this is not correct.  The plex ("set") only fails when all drives
>>>in it fail.
>>
>>I hope the following diagrams better illustrate what I was trying to
>>point out. Data striped across all the A's and that is mirrored to the B
>>Stripes:
>>
>>...
>>
>>If A1 fails, then the A Stripe set cannot function (much like in Raid 0,
>>one disk fails the set) meaning that B now is the array:
> 
> 
> No, this is not correct.
> 
> 
>>>>Thus the striping of mirrors (rather than a mirror of striped sets)
>>>>is a more resilient and fault-tolerant setup of a multi-disk array.
>>>
>>>No, you're misunderstanding the current implementation.
>>
>>Perhaps I am ... but unless gvinum somehow reconstructs a 3 disk stripe
>>into a 2 disk stripe in the event one disk fails, I am now sure how.
> 
> 
> Well, you have the source code.  It's not quite the way you look at
> it.  It doesn't have stripes: it has plexes.  And they can be
> incomplete.  If a read to a plex hits a "hole", it automatically
> retries via (possibly all) the other plexes.  Only when all plexes
> have a hole in the same place does the transfer fail.
> 
> You might like to (re)read http://www.vinumvm.org/vinum/intro.html.
> 
> Greg
> --
> See complete headers for address and phone numbers.

I guess I just needed someone to come out and say what you just said. 
Rereading the manual lead me to the point of confusion that brought me 
to this question in the first place. Quoting (from The-Big-Picture):

"    *

       Although a plex represents the complete data of a volume, it is 
possible for parts of the representation to be physically missing, 
either by design (by not defining a subdisk for parts of the plex) or by 
accident (as a result of the failure of a drive).
     * A volume is a collection of between one and eight plexes. Each 
plex represents the data in the volume, so more than one plex provides 
mirroring. As long as at least one plex can provide the data for the 
complete address range of the volume, the volume is fully functional."

The first sentence would seem to imply partial plexes are ok. However, 
it would appear from the last sentence (wherein it would imply that one 
plex needs to provide the date for the complete address range) that the 
volume still needs at least one complete plex in order to function. I 
could not find any indication that it could combine "partial" plexes 
into a fully functioning volume so I am glad you did point that out. 
This would indicate that the solution I seek is already available (and 
now I can test this:)  )

Sven
Comment 5 Mark Linimon freebsd_committer freebsd_triage 2005-04-03 09:31:10 UTC
Responsible Changed
From-To: freebsd-bugs->le

Over to maintainer.
Comment 6 sven 2005-04-08 23:28:10 UTC
On Sun, 2005-03-20 at 15:51 +1030, Greg 'groggy' Lehey wrote:
> On Saturday, 19 March 2005 at 23:43:00 -0500, Sven Willenberger wrote:
> > Greg 'groggy' Lehey presumably uttered the following on 03/19/05 22:11:
> >> On Sunday, 20 March 2005 at  2:04:34 +0000, Sven Willenberger wrote:
> >>
> >>> Under the current implementation of gvinum it is possible to create
> >>> a mirrored set of striped plexes but not a striped set of mirrored
> >>> plexes. For purposes of resiliency the latter configuration is
> >>> preferred as illustrated by the following example:
> >>>
> >>> Use 6 disks to create one of 2 different scenarios.
> >>>
> >>> 1) Using the current abilities of gvinum create 2 striped sets using
> >>> 3 disks each: A1 A2 A3 and B1 B2 B3 then create a mirror of those 2
> >>> sets such that A(123) mirrors B(123). In this situation if any drive
> >>> in Set A fails, one still has a working set with Set B. If any drive
> >>> now fails in Set B, the system is shot.
> >>
> >> No, this is not correct.  The plex ("set") only fails when all drives
> >> in it fail.
> >
> > I hope the following diagrams better illustrate what I was trying to
> > point out. Data striped across all the A's and that is mirrored to the B
> > Stripes:
> >
> > ...
> >
> > If A1 fails, then the A Stripe set cannot function (much like in Raid 0,
> > one disk fails the set) meaning that B now is the array:
> 
> No, this is not correct.
> 
> >>> Thus the striping of mirrors (rather than a mirror of striped sets)
> >>> is a more resilient and fault-tolerant setup of a multi-disk array.
> >>
> >> No, you're misunderstanding the current implementation.
> >
> > Perhaps I am ... but unless gvinum somehow reconstructs a 3 disk stripe
> > into a 2 disk stripe in the event one disk fails, I am now sure how.
> 
> Well, you have the source code.  It's not quite the way you look at
> it.  It doesn't have stripes: it has plexes.  And they can be
> incomplete.  If a read to a plex hits a "hole", it automatically
> retries via (possibly all) the other plexes.  Only when all plexes
> have a hole in the same place does the transfer fail.
> 
> You might like to (re)read http://www.vinumvm.org/vinum/intro.html.
> 

I was really hoping that the "holes in the plex" functioning was going
to work but my tests have shown otherwise. I created a gvinum array
consisting of (A striped B) mirror (C striped D) which is the only such
mirror/stripe combination allowed by gvinum for four drives. We have:

_________
| A   B |__
|_______|  |
           |Mirror
_________  |
| C   D |--|
|_______|

Based on what the "plex hole" theory states, Drive A and Drive D could
both fail and the system would read through the holes and pick up data
from B and C (or the converse if B and C failed), functionally
equivalent to a stripe of mirrors. To fail a drive I rebooted
single-user, dd dev/zero to the beginning of the disk and then fdisk.

drive d device /dev/da4s1h
drive c device /dev/da3s1h
drive b device /dev/da2s1h
drive a device /dev/da1s1h
volume home
plex name home.p1 org striped 960s vol home
plex name home.p0 org striped 960s vol home
sd name home.p1.s1 drive d len 71681280s driveoffset 265s plex home.p1
plexoffset 960s
sd name home.p1.s0 drive c len 71681280s driveoffset 265s plex home.p1
plexoffset 0s
sd name home.p0.s1 drive b len 71681280s driveoffset 265s plex home.p0
plexoffset 960s
sd name home.p0.s0 drive a len 71681280s driveoffset 265s plex home.p0
plexoffset 0s

In my case:        Fail B     Fail B and C
A = /dev/da1s1h      up          up
B = /dev/da2s1h      down        down
C = /dev/da3s1h      up          down
D = /dev/da4s1h      up          up

1 Volume
V home2              up          down (!)

2 Plexes
P home.p0 (A and B)  down        down
P home.p1 (C and D)  up          down

4 Subdisks
S home.p0.s0 (A)     up          up
S home.p0.s1 (B)     down        down
S home.p1.s0 (C)     up          down
S home p1.s1 (D)     up          up

Based on this failing the one drive did in fact fail the plex (home.p0).
Although at that point I realized that failing either drive on the other
plex would also fail that plex and also the volume, I went ahead and
failed drive C also. The result was a failed volume.

With the failed B drive, once I bsdlabeled the disk to include the vinum
slice, then I got the message that the the plex was now stale (instead
of down). A simple gvinum start home2 changed the state to degraded the
the system rebuilt the array. When both drives failed I had to work a
bit of a kludge in. I gvinum setstate -f up home.p1.s0, then gvinum
start home.p0. At that point the system rebuilt itself and it would
appear the data is intact .. I have not completely tested or verified
that last statement however.

In essence although my feature request to have the ability to create a
striped set of mirrors was going to be hopefully supplanted by the
functional equivalent via the "plex hole" system, it did not come to
fruition. So please note this as either a re-request for that feature or
a bug report in that the pass-through feature of gvinum plexes is
broken.

Sven
Comment 7 Mark Linimon freebsd_committer freebsd_triage 2008-05-19 21:01:03 UTC
Responsible Changed
From-To: le->freebsd-geom

With bugmeister hat on, reassign as le@ has not been active in a while.
Comment 8 Eitan Adler freebsd_committer freebsd_triage 2017-12-31 08:01:32 UTC
For bugs matching the following criteria:

Status: In Progress Changed: (is less than) 2014-06-01

Reset to default assignee and clear in-progress tags.

Mail being skipped