Bug 221075 - geom_flashmap(4) exposes race rendering / on ZFS unbootable
Summary: geom_flashmap(4) exposes race rendering / on ZFS unbootable
Status: New
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 11.1-RELEASE
Hardware: amd64 Any
: --- Affects Only Me
Assignee: freebsd-fs mailing list
URL:
Keywords: regression
Depends on:
Blocks:
 
Reported: 2017-07-28 22:23 UTC by Sebastian Schwarz
Modified: 2018-04-17 17:01 UTC (History)
3 users (show)

See Also:


Attachments
dmesg of ssd/ROOT/freebsd11.0R (10.68 KB, text/plain)
2017-07-28 22:24 UTC, Sebastian Schwarz
no flags Details
dmesg of ssd/ROOT/freebsd11.1R (12.66 KB, text/plain)
2017-07-28 22:24 UTC, Sebastian Schwarz
no flags Details
output of "zfs list" (2.60 KB, text/plain)
2017-07-28 22:25 UTC, Sebastian Schwarz
no flags Details
output of "zpool status" (548 bytes, text/plain)
2017-07-28 22:25 UTC, Sebastian Schwarz
no flags Details
output of "gpart list" (2.13 KB, text/plain)
2017-07-28 22:26 UTC, Sebastian Schwarz
no flags Details
/boot/loader.conf (34 bytes, text/plain)
2017-07-28 23:04 UTC, Sebastian Schwarz
no flags Details
"gpart backup" of bad GPT (166 bytes, text/plain)
2018-01-14 23:23 UTC, Sebastian Schwarz
no flags Details
"gpt backup" of good GPT (114 bytes, text/plain)
2018-01-14 23:24 UTC, Sebastian Schwarz
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Sebastian Schwarz 2017-07-28 22:23:41 UTC
I tried to upgrade my installation of 11.0p11 to 11.1 applying the outlined procedures.  However that resulted in an unbootable system.  I originally thought I had botched the upgrade (https://lists.freebsd.org/pipermail/freebsd-questions/2017-July/278163.html), but as I can't get 11.1 to boot even when doing a fresh install, I believe it to be an actual regression and not just an error on my end.

To check that hypothesis I first performed my install procedure with 11.0-RELEASE into a new dataset, which I could boot off just fine: 

# zfs create -o canmount=noauto -o mountpoint=/ ssd/ROOT/freebsd11.0R
# beadm mount freebsd11.0R
# fetch -o - https://download.freebsd.org/ftp/releases/amd64/11.0-RELEASE/base.txz | tar xf - -C /mnt/
# fetch -o - https://download.freebsd.org/ftp/releases/amd64/11.0-RELEASE/kernel.txz | tar xf - -C /mnt/
# fetch -o - https://download.freebsd.org/ftp/releases/amd64/11.0-RELEASE/lib32.txz | tar xf - -C /mnt/
# cp /boot/loader.conf /mnt/boot/loader.conf
# cp /etc/hostid /mnt/etc/hostid
# beadm umount freebsd11.0R
# beadm activate freebsd11.0R
# reboot
...
FreeBSD/amd64 () (ttyv1)

login:

I did exactly the same steps with 11.1-RELEASE instead of 11.0-RELEASE and the system was unable to mount the root partition during boot:

# zfs create -o canmount=noauto -o mountpoint=/ ssd/ROOT/freebsd11.1R
# beadm mount freebsd11.1R
# fetch -o - https://download.freebsd.org/ftp/releases/amd64/11.1-RELEASE/base.txz | tar xf - -C /mnt/
# fetch -o - https://download.freebsd.org/ftp/releases/amd64/11.1-RELEASE/kernel.txz | tar xf - -C /mnt/
# fetch -o - https://download.freebsd.org/ftp/releases/amd64/11.1-RELEASE/lib32.txz | tar xf - -C /mnt/
# cp /boot/loader.conf /mnt/boot/loader.conf
# cp /etc/hostid /mnt/etc/hostid
# beadm umount freebsd11.1R
# beadm activate freebsd11.1R
# reboot
...
Mounting from zfs:ssd/ROOT/freebsd11.1R failed with error 6.

When I press "?" at the following "mountroot>" prompt to list available devices all devices part of my ZFS pools are listed with the correct paths.  Re-entering "zfs:ssd/ROOT/freebsd11.1R" or any of my other FreeBSD boot environments results in said error 6 again.

So the only difference between the two boot environments created above it the version of FreeBSD used.  Therefore I assume 11.1-RELEASE to be at fault.

Some additional information on my setup:

- the hardware is a Dell Precision M4600 with the optical drive swapped out for an SSD
- the system is booted with EFI
- the drives are partitioned with GPT
- the two ZFS pools where originally created on Arch Linux
- from the FreeBSD installer I originally successfully installed 11.0-RELEASE into a separate dataset
- 11.0 has been working fine like this since December 2016
Comment 1 Sebastian Schwarz 2017-07-28 22:24:25 UTC
Created attachment 184813 [details]
dmesg of ssd/ROOT/freebsd11.0R
Comment 2 Sebastian Schwarz 2017-07-28 22:24:56 UTC
Created attachment 184814 [details]
dmesg of ssd/ROOT/freebsd11.1R
Comment 3 Sebastian Schwarz 2017-07-28 22:25:25 UTC
Created attachment 184815 [details]
output of "zfs list"
Comment 4 Sebastian Schwarz 2017-07-28 22:25:50 UTC
Created attachment 184816 [details]
output of "zpool status"
Comment 5 Sebastian Schwarz 2017-07-28 22:26:17 UTC
Created attachment 184817 [details]
output of "gpart list"
Comment 6 Sebastian Schwarz 2017-07-28 22:29:21 UTC
I've attached the output of various programs.  Let me know if I can provide any additional details.
Comment 7 Sebastian Schwarz 2017-07-28 22:38:30 UTC
Ignore the following lines in the dmesg:

GEOM: da0: the secondary GPT table is corrupt or invalid.
GEOM: da0: using the primary only -- recovery suggested.
GEOM: diskid/DISK-4C530006820119118005: the secondary GPT table is corrupt or invalid.
GEOM: diskid/DISK-4C530006820119118005: using the primary only -- recovery suggested.

That was just an attached USB stick where I hadn't wiped the GPT correctly.  The device isn't involved in the boot process in any way.  Booting with that stick removed yields the same behavior for both boot environments.
Comment 8 Sebastian Schwarz 2017-07-28 23:04:43 UTC
Created attachment 184818 [details]
/boot/loader.conf
Comment 9 Sebastian Schwarz 2017-07-29 09:41:23 UTC
Ok, I don't think the ZFS pool itself is the problem.  If I put FreeBSD-11.1-RELEASE-amd64-mini-memstick.img on an USB stick and boot it I can import the pool and mount both of the datasets created earlier without any problems. 

So I guess there must be something wrong with the device discovery of 11.1 during early boot.  But if you look at the "List of GEOM managed disk devices" in "dmesg of ssd/ROOT/freebsd11.1R" you see that all devices listed in the output of "zpool status" are present.  So I'm confused as to why 11.1 refuses to actually use these on boot.
Comment 10 Sebastian Schwarz 2017-07-30 10:36:27 UTC
To gather a few more data points I tested several other releases and snapshots.  Their boot environments were created as described in comment #0.

The results were as follows:

- 10.3-RELEASE: works
- 10.3-STABLE: works
- 11.0-RELEASE: works
- 11.1-RELEASE: broken as described in comment #0
- 11.1-STABLE: broken as described in comment #0
- 12.0-CURRENT: works

In order to pinpoint the breakage I checked the revisions between 11.0 and 11.1 in a way similar to git-bisect (binary search).  So after compiling a few kernels and rebooting several times I found base r318197 to introduce the breakage.  Revisions before that are able to mount my root filesystem just fine.  But revisions after (and including) base r318197 result in "Mounting from zfs:ssd/ROOT/(...) failed with error 6." during boot.

I don't know anything about hardware or kernel development, so I'm just speculating here: base r318197 only touches MMC and SD stuff, so I'm not sure how that could break assembling my ZFS pool, which is located on SATA disks.  But the notebook does have a SD card reader...  Also base r318197 mentions that a few commit were only partially applied.  Maybe something went wrong when backporting them, which would explain why 12.0-CURRENT works, but 11.1 doesn't.

I hope this is helpful in pinpointing the cause.
Comment 11 Sebastian Schwarz 2017-07-30 10:39:39 UTC
I'm trying to verify that base r318197 is the cause of my troubles by apply its diff to 11.0-RELASE and its reverse diff to 11.1-RELEASE to see if that changes anything.  Unfortunately the diffs don't apply cleanly.  I'll see what I can do about that.
Comment 12 Marius Strobl freebsd_committer 2017-07-31 21:03:35 UTC
(In reply to seschwar from comment #10)

Unfortunately, I don't see how r318197 could have an impact on ZFS either;
that revision only touches bits related to mmc(4), mmcsd(4) and corresponding
bridge drivers, but no KPI, no global structure or anything like that.
Moreover, according to attachments 184813 and 184814, the drivers in question
also don't attach on that machine (is the card reader you mention a USB one?).
The partial merges only refer to e. g. ARM-specific bridge drivers that simply
don't exist in stable/11, so these omissions should impact you even less.

The one thing that comes to my mind is that since r318197, mmcsd(4) depends
on geom_flasmap(4) and that the latter might cause some interaction with ZFS.
Given that I'm not familiar with how ZFS interfaces with GEOM, I have no idea
what actually could be going wrong, though. As far as geom_flasmap(4) is
concerned, I can't spot an obvious bug in that regard and its code doesn't
differ between head and stable/11. One - not necessarily conclusive - way to
test the theory of interaction between geom_flasmap(4) and ZFS would be to use
a kernel built from a configuration with mmcsd(4) removed, so this dependency
isn't dragged in. In general, I'd suggest to look for unmerged fixes to GEOM
and to the parts of ZFS that interface with GEOM.

When interfacing mmcsd(4) with geom_flashmap(4), I've had a hard time with
races in GEOM; if g_retaste(9) is called (so that an additional provider -
now additionally offered by a GEOM class due to a change in configuration -
should get picked up) while re-tasting is already in progress, the already
ongoing re-tasting won't take note of the new provider and the new re-taste
event won't actually result in an additional re-taste cycle being scheduled.
If ZFS is suffering from the same or a similar bug, this doesn't necessarily
mean that the problem you are hitting is present in stable/11 only; with head
you just might be lucky that timing differences resulting from geom_flashmap(4)
also being in the mix don't lead to a race.
Comment 13 Sebastian Schwarz 2017-08-12 16:32:17 UTC
So I had another go at this...

(In reply to Marius Strobl from comment #12)
> (is the card reader you mention a USB one?).

nope:

$ pciconf -lv
(...)
none2@pci0:11:0:0:  class=0x0c0010 card=0x04a31028 chip=0x11f71217 rev=0x05 hdr=0x00
    vendor     = 'O2 Micro, Inc.'
    device     = 'OZ600 1394a-2000 Controller'
    class      = serial bus
    subclass   = FireWire
none3@pci0:11:0:1:  class=0x080501 card=0x04a31028 chip=0x83201217 rev=0x05 hdr=0x00
    vendor     = 'O2 Micro, Inc.'
    device     = 'OZ600RJ1/OZ900RJ1 SD/MMC Card Reader Controller'
    class      = base peripheral
    subclass   = SD host controller
none4@pci0:11:0:2:  class=0x018000 card=0x04a31028 chip=0x83301217 rev=0x05 hdr=0x00
    vendor     = 'O2 Micro, Inc.'
    device     = 'OZ600 MS/xD Controller'
    class      = mass storage

> The one thing that comes to my mind is that since r318197, mmcsd(4) depends
> on geom_flasmap(4) and that the latter might cause some interaction with ZFS.
> (...) One - not necessarily conclusive - way to
> test the theory of interaction between geom_flasmap(4) and ZFS would be to use
> a kernel built from a configuration with mmcsd(4) removed, so this dependency
> isn't dragged in.

Ok, I've build a 11.1 kernel using the following configuration:

$ cat /usr/src/sys/amd64/conf/CUSTOM
include GENERIC
ident CUSTOM
nodevice mmc
nodevice mmcsd
nodevice sdhci

This kernel is indeed capable of mounting the root filesystem on my ZFS pool!  So your guess that base r318197 is not actually the cause of this, but merely exposes the issue is probably spot on.

12.0-CURRENT not exhibiting the issue raises the question if the underlying problem is fixed there, or just happens to not be triggered.  Is there a way for me to provide more information on this?  Maybe boot with kern.geom.debugflags set to some appropriate value?
Comment 14 Marius Strobl freebsd_committer 2017-08-13 15:41:42 UTC
Yeah, I've missed that sdhci_pci(4) actually is attaching, probably due to
a typo in the search field. Still, the only explanation I have is that the
sheer presence of the geom_flasmap(4) class is triggering a GEOM-related race.
Especially, since you apparently didn't have a SD card inserted, so neither
mmc(4) nor mmcsd(4) did attach and, consequently, no additional disk(9) was
present.

I can't find a GEOM-related change not present in stable/11 which looks like
it would fix such a race. Thus, I suspect that the particular race in fact is
also present in head, but due to some differences in timing you don't happen
to hit it there.

Recently it has been mentioned that geom_label(4) is racy, too:
https://lists.freebsd.org/pipermail/svn-src-all/2017-August/149683.html
The fact that as mentioned in that e-mail, bsdinstall(8) therefore doesn't
use labels - but apparently you do in your ZFS setup - might also explain why
not more people are hitting the problem you see. So it could be worthwhile to
try whether using ada[0,1]p1 directly for the zpools instead of DISK-<foo>p1
labels reliably gets you rid of the problem.

Apart from that I don't have an idea how to further debug the actual cause.
Part of the problem is that there are several known GEOM-related races, some
even documented in the code of geom(4). So changing something or even fixing
one race just might alter the timing enough so that the real culprit is hidden.
Another part is that I only know how geom(4) debugging output differs when
hitting the race I mentioned earlier, but I don't know how it would differ
for the other races, for example the geom_label(4)-related one.
Comment 15 Markus Wild 2017-10-23 07:50:05 UTC
Just wanted to add a "me too" mention for this bug.

Short summary:
- Supermicro X10DRI-T motherboard, using UEFI boot, 128G RAM, 2 CPUs
- 2 Intel DC P3700 NVME cards, to be used as mirrored zfs root and mirrored log devices for a data pool
- FreeBSD-11.1 release installs fine off memstick, and the built system boots correctly, but this system uses the entire
  disks (so I need to shrink the system to make room for log partitions)
- so I then rebooted into usb live system, and did a gpart backup of the nvme drives (see later). zfs snapshot -r, zfs
  send -R > backup, gpart delete last index and recreate with shorter size on both drives, recreate zroot pool with
  correct ashift, restore with zfs receive from backup, set bootfs, reboot
- the rebooting system bootloader finds the zroot pool correctly, and proceeds to load the kernel. However, when it's
  supposed to mount the root filesystem, I get:
    Trying to mount root from zfs:zroot/ROOT/default []...
    Mounting from zfs:zroot/ROOT/default failed with error 6.
- when I list the available boot devices, all partitions of the nvme disks are listed
- I can import this pool without any issues with the usb live system, there are no errors on import.
- I then redid the whole exercise, but restored the previously backed up partition tables (full size zfs
  partition), did exactly the same steps as described above to restore the previous zfs filesystem, rebooted, and the
  system started up normally.

I then tried to recreate the pool using nvd0p3/nvd1p3 instead of gpt-labels, but this didn't help. But, when I built a custom
kernel as suggested here:

include GENERIC
ident CUSTOM
nodevice mmc
nodevice mmcsd
nodevice sdhci

this kernel booted fine! So, whatever went into the system for MMC support is messing with correct mounting of zfs root pools...
Comment 16 Sebastian Schwarz 2018-01-14 00:15:42 UTC
Originally I created both the GPT and the ZFS pool on Linux.  I used the diskids since Linux doesn't have consistent device enumeration.

To get additional data points I put that disk into a Lenovo ThinkPad Edge E540 which also has an SD card reader. It exhibits the same behavior as my Dell Precision M4600.  FreeBSD can't find the disk unless the SD card modules are disabled.

The previous installations of FreeBSD 11.1-RELEASE I did on the disk were performed manually into the already existing ZFS pool.  When I performed a fresh guided installation (ZFS, UEFI) on the Lenovo from the USB installer the system booted.

I'll be trying out a few more things to narrow down what caused my disk to be unbootable.
Comment 17 Sebastian Schwarz 2018-01-14 23:22:46 UTC
I tried a few more things:

I zfs send/recv-ed the affected dataset the previously created new FreeBSD 11.1-RELEASE installation. It booted with the GENERIC kernel.

Then I used the FreeBSD installer to try to get a non-booting configuration.  I succeeded when using diskids for the ZFS pool.  If I used /dev/ada0 the GENERIC kernel would boot fine.  Also while creating a ZFS pool on /dev/diskid/... I fiddled a bit with the GPT.  I tried both 4K and 1M alignment.  It made no difference.  I tried different partition indices.  It made no difference.

Then I took the booting configuration (ZFS pool on /dev/ada0) and dd-ed the individual backed up partitions from my original setup (containing a ZFS pool on /dev/diskid/...) over the working ones (the partitions had the same size and locations).  The system still booted fine with the GENERIC kernel.  So what difference did the initial ZFS pool on /dev/ada0 really make?

Then I fiddled with the GPT again, trying different types, labels and number of entries, but it made no difference.  The system remained bootable.

Then I dd-ed the original GPT back into place (the partitions had the same size and locations).  But now the GENERIC kernel was unable to find my ZFS pool again.

I tried fiddling with the GPT again.  But it made no difference.

Then I zeroed all space outside of the partitions (GPT & unused space due to alignment) and manually recreated the partition table.  Now the GENERIC kernel was able to boot again.

I attached the output of "gpart backup" of both the good and bad GPT.  They only differ in two places.  The partition type of the ZFS partition and the amount of entries (is that the number in the fist line?).  The type doesn't matter.  I can change it to both values and it doesn't make a difference.  That number in the first line is more interesting.  When I first recreated the GPT it was 128.  After a reboot it was 152.  The original GPT was create on Linux using fdisk.

I'm not really sure what's going on.  Creating the ZFS pool with /dev/ada0 like a fix but doesn't hold up to verification.  Changing the partition table doesn't matter until it is was zeroed before.  Maybe someone can make sense of this...
Comment 18 Sebastian Schwarz 2018-01-14 23:23:40 UTC
Created attachment 189725 [details]
"gpart backup" of bad GPT
Comment 19 Sebastian Schwarz 2018-01-14 23:24:11 UTC
Created attachment 189726 [details]
"gpt backup" of good GPT