Here is the example scenario: 1. I start with a GMIRROR with two ACTIVE disks called "root0". 2. I essentially disconnect one of the disks: (pass1:pmspcbsd0:0:1:0): CAM_REMOVE_DEVICE da2 at pmspcbsd0 bus 0 scbus0 target 1 lun 0 da2: <ATA HUS726040ALN610 T840> s/n NHGDJ73Y detached g_access(977): provider da2 has error GEOM_MIRROR[0]: Request failed (error=6). da2p5[READ(offset=1836285952, length=16384)] GEOM_MIRROR[x]: Disk da2p5 state changed from ACTIVE to DISCONNECTED (device root0). ... GEOM_MIRROR[x]: Device root0: provider da2p5 disconnected. GEOM_MIRROR[x]: Consumer da2p5 destroyed. ... 3. I add a new hot-spare mirror device to the mirrorset. GEOM_MIRROR[1]: Adding disk da16p3 to root0. GEOM_MIRROR[1]: Disk da16p3 state changed from NONE to NEW (device root0). 4. GMIRROR begins synchronizing from the remaining live provider to the NEW one. As a result, the mirrorset's generation and/or sync id is bumped: GEOM_MIRROR[1]: Device root0: provider da16p3 detected. GEOM_MIRROR[1]: Disk da16p3 state changed from NEW to SYNCHRONIZING (device root0). GEOM_MIRROR[0]: Device root0: rebuilding provider da16p3. 5. The scsi bus is rescanned and da2 comes back: da2 at pmspcbsd0 bus 0 scbus0 target 1 lun 0 6. GEOM_MIRROR rejects it because it has a stale generation or sync id: GEOM_MIRROR[1]: Adding disk da2p5 to root0. GEOM_MIRROR[x]: Component da2p5 (device root0) broken, skipping. GEOM_MIRROR[0]: Cannot add disk da2p5 to root0 (error=22). 7. At this point, the mirrorset has two disks (da15p3, ACTIVE, and da16p3, SYNCHRONIZING). The machine is rebooted before synchronization completes. 8. At boot, before mounting root, GEOM happens to detect the mirror disks in the following order: i. da2p5 (the stale mirror that was ejected in (2) ii. da16p3 (the mirror that is partially synchronized) iii. da15p3 (the only good / "ACTIVE" mirror in the set) GEOM_MIRROR[1]: Creating device root0 (id=1633884690). GEOM_MIRROR[1]: Device root0 created (2 components, id=1633884690). GEOM_MIRROR[1]: root_mount_hold 0xfffff8003f496160 GEOM_MIRROR[1]: Adding disk da2p5 to root0. GEOM_MIRROR[1]: Disk da2p5 state changed from NONE to NEW (device root0). GEOM_MIRROR[1]: Device root0: provider da2p5 detected. GEOM_MIRROR[1]: Adding disk da16p3 to root0. GEOM_MIRROR[1]: Disk da16p3 state changed from NONE to NEW (device root0). GEOM_MIRROR[1]: Device root0: provider da16p3 detected. GEOM_MIRROR[0]: Component da2p5 (device root0) broken, skipping. << the bug is here, if not earlier >> GEOM_MIRROR[1]: Device root0 state changed from STARTING to RUNNING. GEOM_MIRROR[1]: Disk da16p3 state changed from NEW to SYNCHRONIZING (device root0). GEOM_MIRROR[1]: root_mount_rel[2352] 0xfffff8003f496160 GEOM_MIRROR[1]: Adding disk da15p3 to root0. 9. Unfortunately, at the marked location above, GMIRROR sees the two broken mirrors, and decides to transition the mirror set into RUNNING. i. g_mirror_update_device(force=false) is called as a side effect of da16p3 transitioning to NEW. ii. We have the right number of mirrors, even though they are all broken: g_mirror_update_device(struct g_mirror_softc *sc, bool force) { ... switch (sc->sc_state) { case G_MIRROR_DEVICE_STATE_STARTING: { ... /* * Are we ready? We are, if all disks are connected or * if we have any disks and 'force' is true. */ ndisks = g_mirror_ndisks(sc, -1); if (sc->sc_ndisks == ndisks || (force && ndisks > 0)) { ; iii. We don't see any "dirty" mirrors because the logic ignores stale generations and disks mid-synchronization: dirty = ndisks = 0; pdisk = NULL; LIST_FOREACH(disk, &sc->sc_disks, d_next) { if (disk->d_sync.ds_syncid != syncid) continue; if ((disk->d_flags & G_MIRROR_DISK_FLAG_SYNCHRONIZING) != 0) { continue; iv. We interpret this as meaning we have a clean mirror set! if (dirty == 0) { /* No dirty disks at all, great. */ v. And jump to RUNNING. state = G_MIRROR_DEVICE_STATE_RUNNING; G_MIRROR_DEBUG(1, "Device %s state changed from %s to %s.", sc->sc_name, g_mirror_device_state2str(sc->sc_state), g_mirror_device_state2str(state)); sc->sc_state = state; 10. Something triggers an event, which causes g_mirror_update_deveice() to be invoked again. The sc is in the RUNNING state: case G_MIRROR_DEVICE_STATE_RUNNING: if (g_mirror_ndisks(sc, G_MIRROR_DISK_STATE_ACTIVE) == 0 && g_mirror_ndisks(sc, G_MIRROR_DISK_STATE_NEW) == 0) { /* * No usable disks, so destroy the device. */ sc->sc_flags |= G_MIRROR_DEVICE_FLAG_DESTROY; break; 11. And the gmirror destroys itself, even though we had a valid mirror we could have recovered from.
My proposed fix is, in g_mirror_update_device: don't transition from STARTING to RUNNING unless we've got at least one ACTIVE mirror. What is a gmirror device going to do with zero active mirrors? It seems useless. Possible mitigations for similar classes of issue: * When a stale mirror is detected at runtime *and* we've already got a complete mirrorset (#6 above), zero out the gmirror superblock or otherwise disassociate it from the logical mirrorset. It's not going to get less stale if we ignore it. * In g_mirror_update_device when we're in RUNNING and notice we have no mirrors, transition back to STARTING? This sounds a little silly to me; we should probably just avoid transitioning to RUNNING unless we have a valid mirror.
Nice. JFYI, there's a number of test cases in tests/sys/geom/class/mirror which attempt to reproduce issues like the one you described. Regarding the proposed fix, check out r306743. It aimed to fix a similar-sounding problem. I think it's not quite sufficient in that we make the decision to transition to RUNNING before checking for broken disks. In the scenario you described, it sounds like we should transition back to STARTING and wait for a third disk to arrive.
(In reply to Mark Johnston from comment #2) Yep, I did this code inspection on CURRENT from yesterday-ish, so that revision was present. I'm not sure I want us to flip flop between STARTING and RUNNING in such a case; it seems like both (1) we are allowed to remain in STARTING indefinitely by just returning (as long as we can expect some future event to potentially transition us to RUNNING), and (2) we have enough information at STARTING time to know that RUNNING will fail. I.e., I'd like to be slightly more conservative about when we transition to RUNNING. As far as particular code change for the root cause, adding a check for `if (ndisks == 0) return;` right before the 'if (dirty == 0) {' check seems like it *might* be sufficient to fix the correctness issue here (although not the admin-introspection issue(s)). After all, there is no point launching a gmirror with only broken and synchronizing disks ;-). Additionally, for administrability I'd like to record some information on the mirror softc about *why* the state is what it is. (Possibly at least two formatted string buffers -- why we last transitioned, and why we haven't yet transitioned to the next logical state. If either is not relevant, "n/a" would be ok.) That way, when we timeout or whatever, that is discoverable (and ideally printed to console). It might also make sense to do a similar thing for g_mirror_disks. It'd also be good to add gmirror disk id to almost all of these log messages, since daNN devices can be enumerated in a different order between boots, and that was super confusing for this sighting. Certainly adding more test cases would be a good idea along with this revision, thanks for the pointer. I can't promise any time to work on right now, sorry.
Oh, and one more thing to consider: gmirror is really aggressive about destroying itself when it gets into a bad state. That might not be the best thing for resiliency, even if all mirrors are dead? I.e. it could stay alive and ENXIO or EIO all IO and wait for an administrator to add or remove disks (or administratively kill it). This is mostly orthogonal to this bug, except I'd like gmirror's self-destruct function to log much more explicitly than it does today. The only hint we have during the boot process is: "root_mount_rel[2352] 0xppppppp", and that only because we have GEOM debugging level 1 enabled! If a CAM device dies and removes itself, you bet the console hears about it. Just my 2ยข.
I'll start cloning some bugs for the sub issues.
@Mark, any thoughts on this from comment #1? * When a stale mirror is detected at runtime *and* we've already got a complete mirrorset (#6 above), zero out the gmirror superblock or otherwise disassociate it from the logical mirrorset. It's not going to get less stale if we ignore it.
This can be boiled down to the following problem: 1. sc_ndisks is 2 2. we have three eligible components: i. da2p5 (the stale mirror that was ejected) ii. da16p3 (the mirror that is partially synchronized) iii. da15p3 (the only good / "ACTIVE" mirror in the set) 3. STARTING auto-transitions to running based on sc_ndisks == len(sc->sc_disks) *prior* to removing stale components. I think the working patch in https://reviews.freebsd.org/D18062 will eliminate (3), rejecting (i) when (ii) is added, preventing auto-start until (iii) is tasted, or the timeout expires.
A commit references this bug: Author: cem Date: Thu Dec 6 23:55:41 UTC 2018 New revision: 341665 URL: https://svnweb.freebsd.org/changeset/base/341665 Log: gmirror: Evaluate mirror components against newest metadata copy If we happen to taste a stale mirror component first, don't reject valid, newer components that have differing metadata from the stale component (during STARTING). Instead, update our view of the most recent metadata as we taste components. Like mediasize beforehand, remove some checks from g_mirror_check_metadata which would evict valid components due to metadata that can change over a mirror's lifetime. g_mirror_check_metadata is invoked long before we check genid/syncid and decide which component(s) are newest and whether or not we have quorum. Before checking if we can enter RUNNING (i.e., we have quorum) after a NEW component is added, first remove any known stale or inconsistent disks from the mirrorset, rather than removing them *after* deciding we have quorum. Check if we have quorum after removing these components. Additionally, add a knob, kern.geom.mirror.launch_mirror_before_timeout, to force gmirrors to wait out the full timeout (kern.geom.mirror.timeout) before transitioning from STARTING to RUNNING. This is a kludge to help ensure all eligible, boot-time available mirror components are tasted before RUNNING a gmirror. When we are instructed to forget mirror components, bump the generation id to avoid confusion with such stale components later. Add a basic test case for STARTING -> RUNNING startup behavior around stale genids. PR: 232671, 232835 Submitted by: Cindy Yang <cyang AT isilon.com> (previous version) Reviewed by: markj (kernel portions) Discussed with: asomers, Cindy Yang Tested by: pho Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D18062 Changes: head/sys/geom/mirror/g_mirror.c head/sys/geom/mirror/g_mirror.h head/tests/sys/geom/class/mirror/Makefile head/tests/sys/geom/class/mirror/component_selection.sh head/tests/sys/geom/class/mirror/conf.sh
This scenario is fixed exactly as described in comment #7, and the included test case is somewhat similar to this scenario (it covers the tasting da2p5 -> da16p3 -> da2p5 is kicked out for being stale portion, which is sufficient to fix this bug).
A commit references this bug: Author: cem Date: Fri Dec 7 02:44:05 UTC 2018 New revision: 341674 URL: https://svnweb.freebsd.org/changeset/base/341674 Log: gmirror: Evaluate mirror components against newest metadata copy Re-apply r341665 with format strings fixed. If we happen to taste a stale mirror component first, don't reject valid, newer components that have differing metadata from the stale component (during STARTING). Instead, update our view of the most recent metadata as we taste components. Like mediasize beforehand, remove some checks from g_mirror_check_metadata which would evict valid components due to metadata that can change over a mirror's lifetime. g_mirror_check_metadata is invoked long before we check genid/syncid and decide which component(s) are newest and whether or not we have quorum. Before checking if we can enter RUNNING (i.e., we have quorum) after a NEW component is added, first remove any known stale or inconsistent disks from the mirrorset, rather than removing them *after* deciding we have quorum. Check if we have quorum after removing these components. Additionally, add a knob, kern.geom.mirror.launch_mirror_before_timeout, to force gmirrors to wait out the full timeout (kern.geom.mirror.timeout) before transitioning from STARTING to RUNNING. This is a kludge to help ensure all eligible, boot-time available mirror components are tasted before RUNNING a gmirror. Add a basic test case for STARTING -> RUNNING startup behavior around stale genids. PR: 232671, 232835 Submitted by: Cindy Yang <cyang AT isilon.com> (previous version) Reviewed by: markj (kernel portions) Discussed with: asomers, Cindy Yang Tested by: pho Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D18062 Changes: head/sys/geom/mirror/g_mirror.c head/sys/geom/mirror/g_mirror.h head/tests/sys/geom/class/mirror/Makefile head/tests/sys/geom/class/mirror/component_selection.sh head/tests/sys/geom/class/mirror/conf.sh