Summary: | geom_flashmap(4) exposes race rendering / on ZFS unbootable | ||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Base System | Reporter: | Sebastian Schwarz <seschwar> | ||||||||||||||||||
Component: | kern | Assignee: | freebsd-fs (Nobody) <fs> | ||||||||||||||||||
Status: | New --- | ||||||||||||||||||||
Severity: | Affects Only Me | CC: | crdillon+fbsd-bugs, freebsd-bugs, marius, vmiller | ||||||||||||||||||
Priority: | --- | Keywords: | regression | ||||||||||||||||||
Version: | 11.1-RELEASE | ||||||||||||||||||||
Hardware: | amd64 | ||||||||||||||||||||
OS: | Any | ||||||||||||||||||||
Attachments: |
|
Description
Sebastian Schwarz
2017-07-28 22:23:41 UTC
Created attachment 184813 [details]
dmesg of ssd/ROOT/freebsd11.0R
Created attachment 184814 [details]
dmesg of ssd/ROOT/freebsd11.1R
Created attachment 184815 [details]
output of "zfs list"
Created attachment 184816 [details]
output of "zpool status"
Created attachment 184817 [details]
output of "gpart list"
I've attached the output of various programs. Let me know if I can provide any additional details. Ignore the following lines in the dmesg: GEOM: da0: the secondary GPT table is corrupt or invalid. GEOM: da0: using the primary only -- recovery suggested. GEOM: diskid/DISK-4C530006820119118005: the secondary GPT table is corrupt or invalid. GEOM: diskid/DISK-4C530006820119118005: using the primary only -- recovery suggested. That was just an attached USB stick where I hadn't wiped the GPT correctly. The device isn't involved in the boot process in any way. Booting with that stick removed yields the same behavior for both boot environments. Created attachment 184818 [details]
/boot/loader.conf
Ok, I don't think the ZFS pool itself is the problem. If I put FreeBSD-11.1-RELEASE-amd64-mini-memstick.img on an USB stick and boot it I can import the pool and mount both of the datasets created earlier without any problems. So I guess there must be something wrong with the device discovery of 11.1 during early boot. But if you look at the "List of GEOM managed disk devices" in "dmesg of ssd/ROOT/freebsd11.1R" you see that all devices listed in the output of "zpool status" are present. So I'm confused as to why 11.1 refuses to actually use these on boot. To gather a few more data points I tested several other releases and snapshots. Their boot environments were created as described in comment #0. The results were as follows: - 10.3-RELEASE: works - 10.3-STABLE: works - 11.0-RELEASE: works - 11.1-RELEASE: broken as described in comment #0 - 11.1-STABLE: broken as described in comment #0 - 12.0-CURRENT: works In order to pinpoint the breakage I checked the revisions between 11.0 and 11.1 in a way similar to git-bisect (binary search). So after compiling a few kernels and rebooting several times I found base r318197 to introduce the breakage. Revisions before that are able to mount my root filesystem just fine. But revisions after (and including) base r318197 result in "Mounting from zfs:ssd/ROOT/(...) failed with error 6." during boot. I don't know anything about hardware or kernel development, so I'm just speculating here: base r318197 only touches MMC and SD stuff, so I'm not sure how that could break assembling my ZFS pool, which is located on SATA disks. But the notebook does have a SD card reader... Also base r318197 mentions that a few commit were only partially applied. Maybe something went wrong when backporting them, which would explain why 12.0-CURRENT works, but 11.1 doesn't. I hope this is helpful in pinpointing the cause. I'm trying to verify that base r318197 is the cause of my troubles by apply its diff to 11.0-RELASE and its reverse diff to 11.1-RELEASE to see if that changes anything. Unfortunately the diffs don't apply cleanly. I'll see what I can do about that. (In reply to seschwar from comment #10) Unfortunately, I don't see how r318197 could have an impact on ZFS either; that revision only touches bits related to mmc(4), mmcsd(4) and corresponding bridge drivers, but no KPI, no global structure or anything like that. Moreover, according to attachments 184813 and 184814, the drivers in question also don't attach on that machine (is the card reader you mention a USB one?). The partial merges only refer to e. g. ARM-specific bridge drivers that simply don't exist in stable/11, so these omissions should impact you even less. The one thing that comes to my mind is that since r318197, mmcsd(4) depends on geom_flasmap(4) and that the latter might cause some interaction with ZFS. Given that I'm not familiar with how ZFS interfaces with GEOM, I have no idea what actually could be going wrong, though. As far as geom_flasmap(4) is concerned, I can't spot an obvious bug in that regard and its code doesn't differ between head and stable/11. One - not necessarily conclusive - way to test the theory of interaction between geom_flasmap(4) and ZFS would be to use a kernel built from a configuration with mmcsd(4) removed, so this dependency isn't dragged in. In general, I'd suggest to look for unmerged fixes to GEOM and to the parts of ZFS that interface with GEOM. When interfacing mmcsd(4) with geom_flashmap(4), I've had a hard time with races in GEOM; if g_retaste(9) is called (so that an additional provider - now additionally offered by a GEOM class due to a change in configuration - should get picked up) while re-tasting is already in progress, the already ongoing re-tasting won't take note of the new provider and the new re-taste event won't actually result in an additional re-taste cycle being scheduled. If ZFS is suffering from the same or a similar bug, this doesn't necessarily mean that the problem you are hitting is present in stable/11 only; with head you just might be lucky that timing differences resulting from geom_flashmap(4) also being in the mix don't lead to a race. So I had another go at this... (In reply to Marius Strobl from comment #12) > (is the card reader you mention a USB one?). nope: $ pciconf -lv (...) none2@pci0:11:0:0: class=0x0c0010 card=0x04a31028 chip=0x11f71217 rev=0x05 hdr=0x00 vendor = 'O2 Micro, Inc.' device = 'OZ600 1394a-2000 Controller' class = serial bus subclass = FireWire none3@pci0:11:0:1: class=0x080501 card=0x04a31028 chip=0x83201217 rev=0x05 hdr=0x00 vendor = 'O2 Micro, Inc.' device = 'OZ600RJ1/OZ900RJ1 SD/MMC Card Reader Controller' class = base peripheral subclass = SD host controller none4@pci0:11:0:2: class=0x018000 card=0x04a31028 chip=0x83301217 rev=0x05 hdr=0x00 vendor = 'O2 Micro, Inc.' device = 'OZ600 MS/xD Controller' class = mass storage > The one thing that comes to my mind is that since r318197, mmcsd(4) depends > on geom_flasmap(4) and that the latter might cause some interaction with ZFS. > (...) One - not necessarily conclusive - way to > test the theory of interaction between geom_flasmap(4) and ZFS would be to use > a kernel built from a configuration with mmcsd(4) removed, so this dependency > isn't dragged in. Ok, I've build a 11.1 kernel using the following configuration: $ cat /usr/src/sys/amd64/conf/CUSTOM include GENERIC ident CUSTOM nodevice mmc nodevice mmcsd nodevice sdhci This kernel is indeed capable of mounting the root filesystem on my ZFS pool! So your guess that base r318197 is not actually the cause of this, but merely exposes the issue is probably spot on. 12.0-CURRENT not exhibiting the issue raises the question if the underlying problem is fixed there, or just happens to not be triggered. Is there a way for me to provide more information on this? Maybe boot with kern.geom.debugflags set to some appropriate value? Yeah, I've missed that sdhci_pci(4) actually is attaching, probably due to a typo in the search field. Still, the only explanation I have is that the sheer presence of the geom_flasmap(4) class is triggering a GEOM-related race. Especially, since you apparently didn't have a SD card inserted, so neither mmc(4) nor mmcsd(4) did attach and, consequently, no additional disk(9) was present. I can't find a GEOM-related change not present in stable/11 which looks like it would fix such a race. Thus, I suspect that the particular race in fact is also present in head, but due to some differences in timing you don't happen to hit it there. Recently it has been mentioned that geom_label(4) is racy, too: https://lists.freebsd.org/pipermail/svn-src-all/2017-August/149683.html The fact that as mentioned in that e-mail, bsdinstall(8) therefore doesn't use labels - but apparently you do in your ZFS setup - might also explain why not more people are hitting the problem you see. So it could be worthwhile to try whether using ada[0,1]p1 directly for the zpools instead of DISK-<foo>p1 labels reliably gets you rid of the problem. Apart from that I don't have an idea how to further debug the actual cause. Part of the problem is that there are several known GEOM-related races, some even documented in the code of geom(4). So changing something or even fixing one race just might alter the timing enough so that the real culprit is hidden. Another part is that I only know how geom(4) debugging output differs when hitting the race I mentioned earlier, but I don't know how it would differ for the other races, for example the geom_label(4)-related one. Just wanted to add a "me too" mention for this bug. Short summary: - Supermicro X10DRI-T motherboard, using UEFI boot, 128G RAM, 2 CPUs - 2 Intel DC P3700 NVME cards, to be used as mirrored zfs root and mirrored log devices for a data pool - FreeBSD-11.1 release installs fine off memstick, and the built system boots correctly, but this system uses the entire disks (so I need to shrink the system to make room for log partitions) - so I then rebooted into usb live system, and did a gpart backup of the nvme drives (see later). zfs snapshot -r, zfs send -R > backup, gpart delete last index and recreate with shorter size on both drives, recreate zroot pool with correct ashift, restore with zfs receive from backup, set bootfs, reboot - the rebooting system bootloader finds the zroot pool correctly, and proceeds to load the kernel. However, when it's supposed to mount the root filesystem, I get: Trying to mount root from zfs:zroot/ROOT/default []... Mounting from zfs:zroot/ROOT/default failed with error 6. - when I list the available boot devices, all partitions of the nvme disks are listed - I can import this pool without any issues with the usb live system, there are no errors on import. - I then redid the whole exercise, but restored the previously backed up partition tables (full size zfs partition), did exactly the same steps as described above to restore the previous zfs filesystem, rebooted, and the system started up normally. I then tried to recreate the pool using nvd0p3/nvd1p3 instead of gpt-labels, but this didn't help. But, when I built a custom kernel as suggested here: include GENERIC ident CUSTOM nodevice mmc nodevice mmcsd nodevice sdhci this kernel booted fine! So, whatever went into the system for MMC support is messing with correct mounting of zfs root pools... Originally I created both the GPT and the ZFS pool on Linux. I used the diskids since Linux doesn't have consistent device enumeration. To get additional data points I put that disk into a Lenovo ThinkPad Edge E540 which also has an SD card reader. It exhibits the same behavior as my Dell Precision M4600. FreeBSD can't find the disk unless the SD card modules are disabled. The previous installations of FreeBSD 11.1-RELEASE I did on the disk were performed manually into the already existing ZFS pool. When I performed a fresh guided installation (ZFS, UEFI) on the Lenovo from the USB installer the system booted. I'll be trying out a few more things to narrow down what caused my disk to be unbootable. I tried a few more things: I zfs send/recv-ed the affected dataset the previously created new FreeBSD 11.1-RELEASE installation. It booted with the GENERIC kernel. Then I used the FreeBSD installer to try to get a non-booting configuration. I succeeded when using diskids for the ZFS pool. If I used /dev/ada0 the GENERIC kernel would boot fine. Also while creating a ZFS pool on /dev/diskid/... I fiddled a bit with the GPT. I tried both 4K and 1M alignment. It made no difference. I tried different partition indices. It made no difference. Then I took the booting configuration (ZFS pool on /dev/ada0) and dd-ed the individual backed up partitions from my original setup (containing a ZFS pool on /dev/diskid/...) over the working ones (the partitions had the same size and locations). The system still booted fine with the GENERIC kernel. So what difference did the initial ZFS pool on /dev/ada0 really make? Then I fiddled with the GPT again, trying different types, labels and number of entries, but it made no difference. The system remained bootable. Then I dd-ed the original GPT back into place (the partitions had the same size and locations). But now the GENERIC kernel was unable to find my ZFS pool again. I tried fiddling with the GPT again. But it made no difference. Then I zeroed all space outside of the partitions (GPT & unused space due to alignment) and manually recreated the partition table. Now the GENERIC kernel was able to boot again. I attached the output of "gpart backup" of both the good and bad GPT. They only differ in two places. The partition type of the ZFS partition and the amount of entries (is that the number in the fist line?). The type doesn't matter. I can change it to both values and it doesn't make a difference. That number in the first line is more interesting. When I first recreated the GPT it was 128. After a reboot it was 152. The original GPT was create on Linux using fdisk. I'm not really sure what's going on. Creating the ZFS pool with /dev/ada0 like a fix but doesn't hold up to verification. Changing the partition table doesn't matter until it is was zeroed before. Maybe someone can make sense of this... Created attachment 189725 [details]
"gpart backup" of bad GPT
Created attachment 189726 [details]
"gpt backup" of good GPT
|