Summary: | [gmirror] gmirror fails to recover from degraded mirror sets in some circumstances (2/n) | ||
---|---|---|---|
Product: | Base System | Reporter: | Conrad Meyer <cem> |
Component: | kern | Assignee: | Conrad Meyer <cem> |
Status: | Closed FIXED | ||
Severity: | Affects Some People | CC: | markj |
Priority: | --- | ||
Version: | CURRENT | ||
Hardware: | Any | ||
OS: | Any | ||
Bug Depends on: | 232671 | ||
Bug Blocks: | 232684, 232683 |
Description
Conrad Meyer
2018-10-30 23:27:39 UTC
An Isilon engineer is working on this and in communication with me. We'll probably do a first round code-review in house and then I'll forward it upstream. A commit references this bug: Author: cem Date: Thu Dec 6 23:55:41 UTC 2018 New revision: 341665 URL: https://svnweb.freebsd.org/changeset/base/341665 Log: gmirror: Evaluate mirror components against newest metadata copy If we happen to taste a stale mirror component first, don't reject valid, newer components that have differing metadata from the stale component (during STARTING). Instead, update our view of the most recent metadata as we taste components. Like mediasize beforehand, remove some checks from g_mirror_check_metadata which would evict valid components due to metadata that can change over a mirror's lifetime. g_mirror_check_metadata is invoked long before we check genid/syncid and decide which component(s) are newest and whether or not we have quorum. Before checking if we can enter RUNNING (i.e., we have quorum) after a NEW component is added, first remove any known stale or inconsistent disks from the mirrorset, rather than removing them *after* deciding we have quorum. Check if we have quorum after removing these components. Additionally, add a knob, kern.geom.mirror.launch_mirror_before_timeout, to force gmirrors to wait out the full timeout (kern.geom.mirror.timeout) before transitioning from STARTING to RUNNING. This is a kludge to help ensure all eligible, boot-time available mirror components are tasted before RUNNING a gmirror. When we are instructed to forget mirror components, bump the generation id to avoid confusion with such stale components later. Add a basic test case for STARTING -> RUNNING startup behavior around stale genids. PR: 232671, 232835 Submitted by: Cindy Yang <cyang AT isilon.com> (previous version) Reviewed by: markj (kernel portions) Discussed with: asomers, Cindy Yang Tested by: pho Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D18062 Changes: head/sys/geom/mirror/g_mirror.c head/sys/geom/mirror/g_mirror.h head/tests/sys/geom/class/mirror/Makefile head/tests/sys/geom/class/mirror/component_selection.sh head/tests/sys/geom/class/mirror/conf.sh Fixed by (1) ignoring md_all in check_metadata and (2) evicting the stale mirror / refreshing metadata during STARTING. This particular scenario isn't exactly covered by the committed test, so that would be room for improvement. But it is believed to fix the issue. A commit references this bug: Author: cem Date: Fri Dec 7 02:44:05 UTC 2018 New revision: 341674 URL: https://svnweb.freebsd.org/changeset/base/341674 Log: gmirror: Evaluate mirror components against newest metadata copy Re-apply r341665 with format strings fixed. If we happen to taste a stale mirror component first, don't reject valid, newer components that have differing metadata from the stale component (during STARTING). Instead, update our view of the most recent metadata as we taste components. Like mediasize beforehand, remove some checks from g_mirror_check_metadata which would evict valid components due to metadata that can change over a mirror's lifetime. g_mirror_check_metadata is invoked long before we check genid/syncid and decide which component(s) are newest and whether or not we have quorum. Before checking if we can enter RUNNING (i.e., we have quorum) after a NEW component is added, first remove any known stale or inconsistent disks from the mirrorset, rather than removing them *after* deciding we have quorum. Check if we have quorum after removing these components. Additionally, add a knob, kern.geom.mirror.launch_mirror_before_timeout, to force gmirrors to wait out the full timeout (kern.geom.mirror.timeout) before transitioning from STARTING to RUNNING. This is a kludge to help ensure all eligible, boot-time available mirror components are tasted before RUNNING a gmirror. Add a basic test case for STARTING -> RUNNING startup behavior around stale genids. PR: 232671, 232835 Submitted by: Cindy Yang <cyang AT isilon.com> (previous version) Reviewed by: markj (kernel portions) Discussed with: asomers, Cindy Yang Tested by: pho Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D18062 Changes: head/sys/geom/mirror/g_mirror.c head/sys/geom/mirror/g_mirror.h head/tests/sys/geom/class/mirror/Makefile head/tests/sys/geom/class/mirror/component_selection.sh head/tests/sys/geom/class/mirror/conf.sh |