Bug 232671

Summary: [gmirror] gmirror fails to recover from degraded mirror sets in some circumstances
Product: Base System Reporter: Conrad Meyer <cem>
Component: kernAssignee: Conrad Meyer <cem>
Status: Closed FIXED    
Severity: Affects Some People CC: markj, sigsys
Priority: ---    
Version: CURRENT   
Hardware: Any   
OS: Any   
URL: https://reviews.freebsd.org/D18062
Bug Depends on:    
Bug Blocks: 232684, 232683, 232835    

Description Conrad Meyer freebsd_committer freebsd_triage 2018-10-25 01:02:51 UTC
Here is the example scenario:

1. I start with a GMIRROR with two ACTIVE disks called "root0".
2. I essentially disconnect one of the disks:

  (pass1:pmspcbsd0:0:1:0): CAM_REMOVE_DEVICE
  da2 at pmspcbsd0 bus 0 scbus0 target 1 lun 0
  da2: <ATA HUS726040ALN610 T840> s/n NHGDJ73Y             detached
  g_access(977): provider da2 has error
  GEOM_MIRROR[0]: Request failed (error=6).
  da2p5[READ(offset=1836285952, length=16384)]
  GEOM_MIRROR[x]: Disk da2p5 state changed from ACTIVE to DISCONNECTED (device root0).
  ...
  GEOM_MIRROR[x]: Device root0: provider da2p5 disconnected.
  GEOM_MIRROR[x]: Consumer da2p5 destroyed.
  ...

3. I add a new hot-spare mirror device to the mirrorset.

  GEOM_MIRROR[1]: Adding disk da16p3 to root0.
  GEOM_MIRROR[1]: Disk da16p3 state changed from NONE to NEW (device root0).

4. GMIRROR begins synchronizing from the remaining live provider to the NEW one.  As a result, the mirrorset's generation and/or sync id is bumped:

  GEOM_MIRROR[1]: Device root0: provider da16p3 detected.
  GEOM_MIRROR[1]: Disk da16p3 state changed from NEW to SYNCHRONIZING (device root0).
  GEOM_MIRROR[0]: Device root0: rebuilding provider da16p3.

5. The scsi bus is rescanned and da2 comes back:

  da2 at pmspcbsd0 bus 0 scbus0 target 1 lun 0

6. GEOM_MIRROR rejects it because it has a stale generation or sync id:

  GEOM_MIRROR[1]: Adding disk da2p5 to root0.
  GEOM_MIRROR[x]: Component da2p5 (device root0) broken, skipping.
  GEOM_MIRROR[0]: Cannot add disk da2p5 to root0 (error=22).

7. At this point, the mirrorset has two disks (da15p3, ACTIVE, and da16p3, SYNCHRONIZING).  The machine is rebooted before synchronization completes.
8. At boot, before mounting root, GEOM happens to detect the mirror disks in the following order:

  i. da2p5 (the stale mirror that was ejected in (2)
  ii. da16p3 (the mirror that is partially synchronized)
  iii. da15p3 (the only good / "ACTIVE" mirror in the set)

  GEOM_MIRROR[1]: Creating device root0 (id=1633884690).
  GEOM_MIRROR[1]: Device root0 created (2 components, id=1633884690).
  GEOM_MIRROR[1]: root_mount_hold 0xfffff8003f496160
  GEOM_MIRROR[1]: Adding disk da2p5 to root0.
  GEOM_MIRROR[1]: Disk da2p5 state changed from NONE to NEW (device root0).
  GEOM_MIRROR[1]: Device root0: provider da2p5 detected.
  GEOM_MIRROR[1]: Adding disk da16p3 to root0.
  GEOM_MIRROR[1]: Disk da16p3 state changed from NONE to NEW (device root0).
  GEOM_MIRROR[1]: Device root0: provider da16p3 detected.
  GEOM_MIRROR[0]: Component da2p5 (device root0) broken, skipping.

<< the bug is here, if not earlier >>

  GEOM_MIRROR[1]: Device root0 state changed from STARTING to RUNNING.
  GEOM_MIRROR[1]: Disk da16p3 state changed from NEW to SYNCHRONIZING (device root0).
  GEOM_MIRROR[1]: root_mount_rel[2352] 0xfffff8003f496160
  GEOM_MIRROR[1]: Adding disk da15p3 to root0.

9. Unfortunately, at the marked location above, GMIRROR sees the two broken mirrors, and decides to transition the mirror set into RUNNING.

  i. g_mirror_update_device(force=false) is called as a side effect of da16p3 transitioning to NEW.

  ii. We have the right number of mirrors, even though they are all broken:

  g_mirror_update_device(struct g_mirror_softc *sc, bool force)
  {
  ...
          switch (sc->sc_state) {
          case G_MIRROR_DEVICE_STATE_STARTING:
              {
  ...
                  /*
                   * Are we ready? We are, if all disks are connected or
                   * if we have any disks and 'force' is true.
                   */
                  ndisks = g_mirror_ndisks(sc, -1);
                  if (sc->sc_ndisks == ndisks || (force && ndisks > 0)) {
                          ;

  iii. We don't see any "dirty" mirrors because the logic ignores stale generations and disks mid-synchronization:

                  dirty = ndisks = 0;
                  pdisk = NULL;
                  LIST_FOREACH(disk, &sc->sc_disks, d_next) {
                          if (disk->d_sync.ds_syncid != syncid)
                                  continue;
                          if ((disk->d_flags &
                              G_MIRROR_DISK_FLAG_SYNCHRONIZING) != 0) {
                                  continue;

  iv. We interpret this as meaning we have a clean mirror set!

                  if (dirty == 0) {
                          /* No dirty disks at all, great. */

  v. And jump to RUNNING.

                  state = G_MIRROR_DEVICE_STATE_RUNNING;
                  G_MIRROR_DEBUG(1, "Device %s state changed from %s to %s.",
                      sc->sc_name, g_mirror_device_state2str(sc->sc_state),
                      g_mirror_device_state2str(state));
                  sc->sc_state = state;

10. Something triggers an event, which causes g_mirror_update_deveice() to be invoked again. 
 The sc is in the RUNNING state:

          case G_MIRROR_DEVICE_STATE_RUNNING:
                  if (g_mirror_ndisks(sc, G_MIRROR_DISK_STATE_ACTIVE) == 0 &&
                      g_mirror_ndisks(sc, G_MIRROR_DISK_STATE_NEW) == 0) {
                          /*
                           * No usable disks, so destroy the device.
                           */
                          sc->sc_flags |= G_MIRROR_DEVICE_FLAG_DESTROY;
                          break;

11. And the gmirror destroys itself, even though we had a valid mirror we could have recovered from.
Comment 1 Conrad Meyer freebsd_committer freebsd_triage 2018-10-25 01:08:16 UTC
My proposed fix is, in g_mirror_update_device: don't transition from STARTING to RUNNING unless we've got at least one ACTIVE mirror.  What is a gmirror device going to do with zero active mirrors?  It seems useless.

Possible mitigations for similar classes of issue:

* When a stale mirror is detected at runtime *and* we've already got a complete mirrorset (#6 above), zero out the gmirror superblock or otherwise disassociate it from the logical mirrorset.  It's not going to get less stale if we ignore it.

* In g_mirror_update_device when we're in RUNNING and notice we have no mirrors, transition back to STARTING?  This sounds a little silly to me; we should probably just avoid transitioning to RUNNING unless we have a valid mirror.
Comment 2 Mark Johnston freebsd_committer freebsd_triage 2018-10-25 15:39:42 UTC
Nice.  JFYI, there's a number of test cases in tests/sys/geom/class/mirror which attempt to reproduce issues like the one you described.

Regarding the proposed fix, check out r306743.  It aimed to fix a similar-sounding problem.  I think it's not quite sufficient in that we make the decision to transition to RUNNING before checking for broken disks.  In the scenario you described, it sounds like we should transition back to STARTING and wait for a third disk to arrive.
Comment 3 Conrad Meyer freebsd_committer freebsd_triage 2018-10-25 16:02:31 UTC
(In reply to Mark Johnston from comment #2)
Yep, I did this code inspection on CURRENT from yesterday-ish, so that revision was present.

I'm not sure I want us to flip flop between STARTING and RUNNING in such a case; it seems like both (1) we are allowed to remain in STARTING indefinitely by just returning (as long as we can expect some future event to potentially transition us to RUNNING), and (2) we have enough information at STARTING time to know that RUNNING will fail.  I.e., I'd like to be slightly more conservative about when we transition to RUNNING.

As far as particular code change for the root cause, adding a check for `if (ndisks == 0) return;` right before the 'if (dirty == 0) {' check seems like it *might* be sufficient to fix the correctness issue here (although not the admin-introspection issue(s)).  After all, there is no point launching a gmirror with only broken and synchronizing disks ;-).

Additionally, for administrability I'd like to record some information on the mirror softc about *why* the state is what it is.  (Possibly at least two formatted string buffers -- why we last transitioned, and why we haven't yet transitioned to the next logical state.  If either is not relevant, "n/a" would be ok.)  That way, when we timeout or whatever, that is discoverable (and ideally printed to console).

It might also make sense to do a similar thing for g_mirror_disks.  It'd also be good to add gmirror disk id to almost all of these log messages, since daNN devices can be enumerated in a different order between boots, and that was super confusing for this sighting.

Certainly adding more test cases would be a good idea along with this revision, thanks for the pointer.

I can't promise any time to work on right now, sorry.
Comment 4 Conrad Meyer freebsd_committer freebsd_triage 2018-10-25 16:06:27 UTC
Oh, and one more thing to consider: gmirror is really aggressive about destroying itself when it gets into a bad state.  That might not be the best thing for resiliency, even if all mirrors are dead?  I.e. it could stay alive and ENXIO or EIO all IO and wait for an administrator to add or remove disks (or administratively kill it).

This is mostly orthogonal to this bug, except I'd like gmirror's self-destruct function to log much more explicitly than it does today.  The only hint we have during the boot process is: "root_mount_rel[2352] 0xppppppp", and that only because we have GEOM debugging level 1 enabled!  If a CAM device dies and removes itself, you bet the console hears about it.  Just my 2ยข.
Comment 5 Conrad Meyer freebsd_committer freebsd_triage 2018-10-25 16:06:58 UTC
I'll start cloning some bugs for the sub issues.
Comment 6 Conrad Meyer freebsd_committer freebsd_triage 2018-10-25 16:18:41 UTC
@Mark, any thoughts on this from comment #1?

* When a stale mirror is detected at runtime *and* we've already got a complete mirrorset (#6 above), zero out the gmirror superblock or otherwise disassociate it from the logical mirrorset.  It's not going to get less stale if we ignore it.
Comment 7 Conrad Meyer freebsd_committer freebsd_triage 2018-11-27 05:40:13 UTC
This can be boiled down to the following problem:

1. sc_ndisks is 2
2. we have three eligible components:
  i. da2p5 (the stale mirror that was ejected)
  ii. da16p3 (the mirror that is partially synchronized)
  iii. da15p3 (the only good / "ACTIVE" mirror in the set)
3. STARTING auto-transitions to running based on sc_ndisks == len(sc->sc_disks)
*prior* to removing stale components.

I think the working patch in https://reviews.freebsd.org/D18062 will eliminate (3), rejecting (i) when (ii) is added, preventing auto-start until (iii) is
tasted, or the timeout expires.
Comment 8 commit-hook freebsd_committer freebsd_triage 2018-12-06 23:56:07 UTC
A commit references this bug:

Author: cem
Date: Thu Dec  6 23:55:41 UTC 2018
New revision: 341665
URL: https://svnweb.freebsd.org/changeset/base/341665

Log:
  gmirror: Evaluate mirror components against newest metadata copy

  If we happen to taste a stale mirror component first, don't reject valid,
  newer components that have differing metadata from the stale component
  (during STARTING).  Instead, update our view of the most recent metadata as
  we taste components.

  Like mediasize beforehand, remove some checks from g_mirror_check_metadata
  which would evict valid components due to metadata that can change over a
  mirror's lifetime.  g_mirror_check_metadata is invoked long before we check
  genid/syncid and decide which component(s) are newest and whether or not we
  have quorum.

  Before checking if we can enter RUNNING (i.e., we have quorum) after a NEW
  component is added, first remove any known stale or inconsistent disks from
  the mirrorset, rather than removing them *after* deciding we have quorum.
  Check if we have quorum after removing these components.

  Additionally, add a knob, kern.geom.mirror.launch_mirror_before_timeout, to
  force gmirrors to wait out the full timeout (kern.geom.mirror.timeout)
  before transitioning from STARTING to RUNNING.  This is a kludge to help
  ensure all eligible, boot-time available mirror components are tasted before
  RUNNING a gmirror.

  When we are instructed to forget mirror components, bump the generation id
  to avoid confusion with such stale components later.

  Add a basic test case for STARTING -> RUNNING startup behavior around stale
  genids.

  PR:		232671, 232835
  Submitted by:	Cindy Yang <cyang AT isilon.com> (previous version)
  Reviewed by:	markj (kernel portions)
  Discussed with:	asomers, Cindy Yang
  Tested by:	pho
  Sponsored by:	Dell EMC Isilon
  Differential Revision:	https://reviews.freebsd.org/D18062

Changes:
  head/sys/geom/mirror/g_mirror.c
  head/sys/geom/mirror/g_mirror.h
  head/tests/sys/geom/class/mirror/Makefile
  head/tests/sys/geom/class/mirror/component_selection.sh
  head/tests/sys/geom/class/mirror/conf.sh
Comment 9 Conrad Meyer freebsd_committer freebsd_triage 2018-12-07 00:09:40 UTC
This scenario is fixed exactly as described in comment #7, and the included test case is somewhat similar to this scenario (it covers the tasting da2p5 -> da16p3 -> da2p5 is kicked out for being stale portion, which is sufficient to fix this bug).
Comment 10 commit-hook freebsd_committer freebsd_triage 2018-12-07 02:44:25 UTC
A commit references this bug:

Author: cem
Date: Fri Dec  7 02:44:05 UTC 2018
New revision: 341674
URL: https://svnweb.freebsd.org/changeset/base/341674

Log:
  gmirror: Evaluate mirror components against newest metadata copy

  Re-apply r341665 with format strings fixed.

  If we happen to taste a stale mirror component first, don't reject valid,
  newer components that have differing metadata from the stale component
  (during STARTING).  Instead, update our view of the most recent metadata as
  we taste components.

  Like mediasize beforehand, remove some checks from g_mirror_check_metadata
  which would evict valid components due to metadata that can change over a
  mirror's lifetime.  g_mirror_check_metadata is invoked long before we check
  genid/syncid and decide which component(s) are newest and whether or not we
  have quorum.

  Before checking if we can enter RUNNING (i.e., we have quorum) after a NEW
  component is added, first remove any known stale or inconsistent disks from
  the mirrorset, rather than removing them *after* deciding we have quorum.
  Check if we have quorum after removing these components.

  Additionally, add a knob, kern.geom.mirror.launch_mirror_before_timeout, to
  force gmirrors to wait out the full timeout (kern.geom.mirror.timeout)
  before transitioning from STARTING to RUNNING.  This is a kludge to help
  ensure all eligible, boot-time available mirror components are tasted before
  RUNNING a gmirror.

  Add a basic test case for STARTING -> RUNNING startup behavior around stale
  genids.

  PR:		232671, 232835
  Submitted by:	Cindy Yang <cyang AT isilon.com> (previous version)
  Reviewed by:	markj (kernel portions)
  Discussed with:	asomers, Cindy Yang
  Tested by:	pho
  Sponsored by:	Dell EMC Isilon
  Differential Revision:	https://reviews.freebsd.org/D18062

Changes:
  head/sys/geom/mirror/g_mirror.c
  head/sys/geom/mirror/g_mirror.h
  head/tests/sys/geom/class/mirror/Makefile
  head/tests/sys/geom/class/mirror/component_selection.sh
  head/tests/sys/geom/class/mirror/conf.sh