Summary: | [gmirror] gmirror fails to recover from degraded mirror sets in some circumstances | ||
---|---|---|---|
Product: | Base System | Reporter: | Conrad Meyer <cem> |
Component: | kern | Assignee: | Conrad Meyer <cem> |
Status: | Closed FIXED | ||
Severity: | Affects Some People | CC: | markj, sigsys |
Priority: | --- | ||
Version: | CURRENT | ||
Hardware: | Any | ||
OS: | Any | ||
URL: | https://reviews.freebsd.org/D18062 | ||
Bug Depends on: | |||
Bug Blocks: | 232684, 232683, 232835 |
Description
Conrad Meyer
2018-10-25 01:02:51 UTC
My proposed fix is, in g_mirror_update_device: don't transition from STARTING to RUNNING unless we've got at least one ACTIVE mirror. What is a gmirror device going to do with zero active mirrors? It seems useless. Possible mitigations for similar classes of issue: * When a stale mirror is detected at runtime *and* we've already got a complete mirrorset (#6 above), zero out the gmirror superblock or otherwise disassociate it from the logical mirrorset. It's not going to get less stale if we ignore it. * In g_mirror_update_device when we're in RUNNING and notice we have no mirrors, transition back to STARTING? This sounds a little silly to me; we should probably just avoid transitioning to RUNNING unless we have a valid mirror. Nice. JFYI, there's a number of test cases in tests/sys/geom/class/mirror which attempt to reproduce issues like the one you described. Regarding the proposed fix, check out r306743. It aimed to fix a similar-sounding problem. I think it's not quite sufficient in that we make the decision to transition to RUNNING before checking for broken disks. In the scenario you described, it sounds like we should transition back to STARTING and wait for a third disk to arrive. (In reply to Mark Johnston from comment #2) Yep, I did this code inspection on CURRENT from yesterday-ish, so that revision was present. I'm not sure I want us to flip flop between STARTING and RUNNING in such a case; it seems like both (1) we are allowed to remain in STARTING indefinitely by just returning (as long as we can expect some future event to potentially transition us to RUNNING), and (2) we have enough information at STARTING time to know that RUNNING will fail. I.e., I'd like to be slightly more conservative about when we transition to RUNNING. As far as particular code change for the root cause, adding a check for `if (ndisks == 0) return;` right before the 'if (dirty == 0) {' check seems like it *might* be sufficient to fix the correctness issue here (although not the admin-introspection issue(s)). After all, there is no point launching a gmirror with only broken and synchronizing disks ;-). Additionally, for administrability I'd like to record some information on the mirror softc about *why* the state is what it is. (Possibly at least two formatted string buffers -- why we last transitioned, and why we haven't yet transitioned to the next logical state. If either is not relevant, "n/a" would be ok.) That way, when we timeout or whatever, that is discoverable (and ideally printed to console). It might also make sense to do a similar thing for g_mirror_disks. It'd also be good to add gmirror disk id to almost all of these log messages, since daNN devices can be enumerated in a different order between boots, and that was super confusing for this sighting. Certainly adding more test cases would be a good idea along with this revision, thanks for the pointer. I can't promise any time to work on right now, sorry. Oh, and one more thing to consider: gmirror is really aggressive about destroying itself when it gets into a bad state. That might not be the best thing for resiliency, even if all mirrors are dead? I.e. it could stay alive and ENXIO or EIO all IO and wait for an administrator to add or remove disks (or administratively kill it). This is mostly orthogonal to this bug, except I'd like gmirror's self-destruct function to log much more explicitly than it does today. The only hint we have during the boot process is: "root_mount_rel[2352] 0xppppppp", and that only because we have GEOM debugging level 1 enabled! If a CAM device dies and removes itself, you bet the console hears about it. Just my 2ยข. I'll start cloning some bugs for the sub issues. @Mark, any thoughts on this from comment #1? * When a stale mirror is detected at runtime *and* we've already got a complete mirrorset (#6 above), zero out the gmirror superblock or otherwise disassociate it from the logical mirrorset. It's not going to get less stale if we ignore it. This can be boiled down to the following problem: 1. sc_ndisks is 2 2. we have three eligible components: i. da2p5 (the stale mirror that was ejected) ii. da16p3 (the mirror that is partially synchronized) iii. da15p3 (the only good / "ACTIVE" mirror in the set) 3. STARTING auto-transitions to running based on sc_ndisks == len(sc->sc_disks) *prior* to removing stale components. I think the working patch in https://reviews.freebsd.org/D18062 will eliminate (3), rejecting (i) when (ii) is added, preventing auto-start until (iii) is tasted, or the timeout expires. A commit references this bug: Author: cem Date: Thu Dec 6 23:55:41 UTC 2018 New revision: 341665 URL: https://svnweb.freebsd.org/changeset/base/341665 Log: gmirror: Evaluate mirror components against newest metadata copy If we happen to taste a stale mirror component first, don't reject valid, newer components that have differing metadata from the stale component (during STARTING). Instead, update our view of the most recent metadata as we taste components. Like mediasize beforehand, remove some checks from g_mirror_check_metadata which would evict valid components due to metadata that can change over a mirror's lifetime. g_mirror_check_metadata is invoked long before we check genid/syncid and decide which component(s) are newest and whether or not we have quorum. Before checking if we can enter RUNNING (i.e., we have quorum) after a NEW component is added, first remove any known stale or inconsistent disks from the mirrorset, rather than removing them *after* deciding we have quorum. Check if we have quorum after removing these components. Additionally, add a knob, kern.geom.mirror.launch_mirror_before_timeout, to force gmirrors to wait out the full timeout (kern.geom.mirror.timeout) before transitioning from STARTING to RUNNING. This is a kludge to help ensure all eligible, boot-time available mirror components are tasted before RUNNING a gmirror. When we are instructed to forget mirror components, bump the generation id to avoid confusion with such stale components later. Add a basic test case for STARTING -> RUNNING startup behavior around stale genids. PR: 232671, 232835 Submitted by: Cindy Yang <cyang AT isilon.com> (previous version) Reviewed by: markj (kernel portions) Discussed with: asomers, Cindy Yang Tested by: pho Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D18062 Changes: head/sys/geom/mirror/g_mirror.c head/sys/geom/mirror/g_mirror.h head/tests/sys/geom/class/mirror/Makefile head/tests/sys/geom/class/mirror/component_selection.sh head/tests/sys/geom/class/mirror/conf.sh This scenario is fixed exactly as described in comment #7, and the included test case is somewhat similar to this scenario (it covers the tasting da2p5 -> da16p3 -> da2p5 is kicked out for being stale portion, which is sufficient to fix this bug). A commit references this bug: Author: cem Date: Fri Dec 7 02:44:05 UTC 2018 New revision: 341674 URL: https://svnweb.freebsd.org/changeset/base/341674 Log: gmirror: Evaluate mirror components against newest metadata copy Re-apply r341665 with format strings fixed. If we happen to taste a stale mirror component first, don't reject valid, newer components that have differing metadata from the stale component (during STARTING). Instead, update our view of the most recent metadata as we taste components. Like mediasize beforehand, remove some checks from g_mirror_check_metadata which would evict valid components due to metadata that can change over a mirror's lifetime. g_mirror_check_metadata is invoked long before we check genid/syncid and decide which component(s) are newest and whether or not we have quorum. Before checking if we can enter RUNNING (i.e., we have quorum) after a NEW component is added, first remove any known stale or inconsistent disks from the mirrorset, rather than removing them *after* deciding we have quorum. Check if we have quorum after removing these components. Additionally, add a knob, kern.geom.mirror.launch_mirror_before_timeout, to force gmirrors to wait out the full timeout (kern.geom.mirror.timeout) before transitioning from STARTING to RUNNING. This is a kludge to help ensure all eligible, boot-time available mirror components are tasted before RUNNING a gmirror. Add a basic test case for STARTING -> RUNNING startup behavior around stale genids. PR: 232671, 232835 Submitted by: Cindy Yang <cyang AT isilon.com> (previous version) Reviewed by: markj (kernel portions) Discussed with: asomers, Cindy Yang Tested by: pho Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D18062 Changes: head/sys/geom/mirror/g_mirror.c head/sys/geom/mirror/g_mirror.h head/tests/sys/geom/class/mirror/Makefile head/tests/sys/geom/class/mirror/component_selection.sh head/tests/sys/geom/class/mirror/conf.sh |