Bug 211361

Summary: suggested boot partition size is too small, bsdinstall creates unaligned partitions
Product: Base System Reporter: Dag-Erling Smørgrav <des>
Component: miscAssignee: freebsd-bugs (Nobody) <bugs>
Status: Closed FIXED    
Severity: Affects Many People CC: allanjude, nwhitehorn, op, re
Priority: --- Keywords: patch, performance
Version: 11.0-STABLEFlags: des: mfc-stable11+
des: mfc-stable10+
Hardware: Any   
OS: Any   
Description Flags
Proposed fix none

Description Dag-Erling Smørgrav freebsd_committer 2016-07-25 07:51:55 UTC
The gpart(4) man page suggests creating a 44 kB boot partition.  This was fine for 9 and 10 where gptzfsboot is 42 kB, but 11's gptzfsboot is twice that size.  Users who followed that recommendation will not be able to install a new gptzfsboot in their existing boot partition.

The man page should be updated, and the problem should probably be documented in the release notes.

Standard operating procedure (i.e. what the gpart(4) man page recommends and what bsdinstall does) is to place the swap partition immediately after the boot partition, so it is possible to fix this issue on a live system (after upgrading but before rebooting) by deleting the boot and swap partitions, creating a new, appropriately sized boot partition, then creating a new, slightly smaller swap partition.  Here is an example for a two-disk system with striped swap and either striped or mirrored ZFS:

# swapoff -a
# gpart delete -i 1 ada0
# gpart delete -i 2 ada0
# gpart add -b 40 -s $((512-40)) -t freebsd-boot -i 1 -l boot0 ada0
# gpart add -t freebsd-swap -l swap0 ada0
# gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 1 ada0
# gpart delete -i 1 ada1
# gpart delete -i 2 ada1
# gpart add -b 40 -s $((512-40)) -t freebsd-boot -i 1 -l boot1 ada1
# gpart add -t freebsd-swap -l swap1 ada1
# gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 1 ada1
# swapon -a

The point of $((512-40)) is that we want both the boot partition and the swap partition that follows it to be aligned on (at least) 4096-byte boundaries.  The partition table is normally 34 sectors long (see `gpart list | grep first`), so we round up to 40 and place the boot partition there, and the swap partition at 512.

Note that bsdinstall creates an unaligned 512 kB boot partition, so all subsequent partitions will be unaligned, which can cause serious performance issues on "advanced format" drives.  I recommend fixing this as well.
Comment 1 commit-hook freebsd_committer 2016-07-25 11:25:52 UTC
A commit references this bug:

Author: des
Date: Mon Jul 25 11:25:34 UTC 2016
New revision: 303289
URL: https://svnweb.freebsd.org/changeset/base/303289

  Rewrite the GPT and MBR examples.  For GPT, ensure that the boot partition
  is large enough for gptzfsboot, which has doubled in size since 10.

  PR:		211361
  MFC after:	3 days

Comment 2 commit-hook freebsd_committer 2016-08-05 15:30:35 UTC
A commit references this bug:

Author: des
Date: Fri Aug  5 15:30:05 UTC 2016
New revision: 303769
URL: https://svnweb.freebsd.org/changeset/base/303769

  MFH (r303289): update example section

  PR:		211361
  Approved by:	re (gjb)

_U  stable/11/
Comment 3 Allan Jude freebsd_committer 2016-08-05 16:18:54 UTC
You can obviate the need to do the extra math by adding the parameter '-a 4k' to the gpart commands, and it will align them to 4k for you.

I can make bsdinstall do this for GPT based disks, but MBR is too sensitive to try to do anything like that.
Comment 4 Nathan Whitehorn freebsd_committer 2016-08-05 19:02:06 UTC
This is just zfsboot, correct? partedit (UFS and single-volume ZFS) explicitly aligns all partitions to the GEOM stripe size.
Comment 5 Allan Jude freebsd_committer 2016-08-06 05:04:14 UTC
zfsboot does the right thing by specifying -a 4k to gpart.

The issue (as discussed in another email thread) is that the stripesize is sometimes incorrectly reported by the kernel/device driver
Comment 6 Dag-Erling Smørgrav freebsd_committer 2016-08-06 13:02:57 UTC
(In reply to Allan Jude from comment #5)
> The issue (as discussed in another email thread) is that the stripesize is
> sometimes incorrectly reported by the kernel/device driver

It's not "sometimes", it's "almost always". The only exceptions I've encountered are SATA AF drives with valid quirk entries in the CAM ATA driver.
Comment 7 Dag-Erling Smørgrav freebsd_committer 2016-08-06 14:23:31 UTC
Created attachment 173357 [details]
Proposed fix

This patch ensures that the stripe size used in partition offset and size calculations is a common multiple of the sector size, the reported stripe size and 4096.  If the reported stripe size is 0, it is assumed to be equal to the sector size.
Comment 8 Dag-Erling Smørgrav freebsd_committer 2016-08-06 14:25:11 UTC
Note: the patch compiles, but I haven't tested it yet.
Comment 9 Nathan Whitehorn freebsd_committer 2016-08-08 15:00:51 UTC
I would like to see this done in the kernel preferred IO alignment reporting rather than the installer. This limits the amount of magic in the installer and fixes similar issues in the rest of the system.

There are three options to solve this problem, in general:

1. The kernel makes up a number. The installer, gpart, and sade align to that number. It may become stale if new, weird hardware appears.

2. The installer makes up a number and aligns to it. sade is the installer, so it follows along. gpart doesn't and creates misaligned partitions by default on systems that currently have problems (as do other disk tools, like graid). sade and gpart end up with different behaviors. The number may become stale in the same way.

3. The status quo. The kernel tries to figure out the right number, but it is too small on some hardware. The installer, gpart, and sade align to that number. It is already wrong in some cases.

#1 seems strictly better than #2 here. The numbers in both cases may become stale, but you don't have to make the same fix in a bunch of places in #1 or end up in a situation where gpart and the curses version of gpart (sade) have different behavior. It also makes ZFS and graid and whatever follow along instead of picking too-small IO chunks by default on affected systems. The ZFS issue in particular is probably at least as serious as the alignment of UFS boot partitions.
Comment 10 Dag-Erling Smørgrav freebsd_committer 2016-08-08 16:20:55 UTC
Nathan, what you're suggesting has a very high probability of breaking scenarios we haven't envisioned and software we're not aware of.

This is a matter of *policy*.  I want to adopt a *policy* of creating partitions with a specific alignment to avoid potential performance issues.  If, at some point down the road, we start seeing disks with, for instance, 8192-byte sectors, our 4096-byte alignment will no longer be optimal, but it won't be any worse than the current 512-byte alignment.  The worst-case scenario here is that we waste a small amount of disk space (up to 3584 bytes, less than the total amount of text and code in this ticket, less than 5% of the size of the boot loader, and less than 0.0000005% of the capacity of a typical desktop or laptop harddrive) with absolutely no impact on performance.

BTW, gpart and sade are two completely different things; gpart is just a command-line interface to the low-level interface which bsdinstall and sade use to create partitions.
Comment 11 Nathan Whitehorn freebsd_committer 2016-08-08 16:45:49 UTC
(In reply to Dag-Erling Smørgrav from comment #10)

Sure, but the *policy* shouldn't be in the installer, but in the userland tools that it wraps. Otherwise, the default behavior of the installer is "correct", but file systems created with ZFS will have the wrong IO size, partitions added with gpart will have the wrong alignment, etc.

The installer is supposed to be an extremely thin wrapper around the normal userland tools: it's a bare front-end for gpart, newfs, and tar. If we have bad defaults in those tools, the problem should be fixed there rather than adding magic to the installer to "fix" defaults that we control. sysinstall did this rampantly and it was terrible; it made divergences between different methods of installing the system and made new users go back to the installer to do things and hose their systems thereby.

If we don't want the kernel to guess, and don't want the base userland tools to guess, I would have no objections to some global tunable or something set by the user that tells GEOM to round up to some value for stripe size, or an additional GEOM property (recommended IO size), or some system setting that suggests a minimum IO size and alignment to all userland tools. These could be adopted universally and don't result in anything "lying". If any of those solutions are too much to get done in time for the 11 release, I also wouldn't object to a direct commit of your patch to stable/11 as a stopgap. But, beyond that, modifying the installer to work around bad defaults in the operating system generally is a bad idea and a road we should not go down.
Comment 12 Dag-Erling Smørgrav freebsd_committer 2016-08-08 16:58:01 UTC
The installer does not wrap any userland tools, at least not for partitioning.  It talks directly to GEOM.
Comment 13 Nathan Whitehorn freebsd_committer 2016-08-08 17:05:06 UTC
(In reply to Dag-Erling Smørgrav from comment #12)

It wraps zfs, zpool, and newfs. It does talk directly to GEOM (through libgeom) -- but so does the gpart command-line tool. The point is just that all partitioning tools and disk setup tools should have the same default behaviors. If we don't like those defaults, we should change them. And, since we control the whole operating system, we can! Why should we work around bad defaults in the system in one particular piece of software that people use exactly once per installed system, but leave the normal tools they use repeatedly set up in a way that can give bad performance?

If this is a real problem, we should fix it across the board. Hacking up the installer is not the right solution.
Comment 14 Dag-Erling Smørgrav freebsd_committer 2016-08-08 17:39:24 UTC
This is not a matter of a bad default.  The kernel is doing its best to tell us the truth about the disk.  Based in part on that information, we need to make a policy decision - one of *dozens* of policy decisions the installer makes, don't try to pretend otherwise - to round partition offsets up to the nearest 4 kB boundary.  You are asking me to change the underlying mechanism, possibly breaking other kernel or OS components and who knows what else, in order to avoid making a policy decision.  That's not going to fly.

And once again, gpart is a command-line interface to geom_part(4).  It does not have defaults.  It does not care about alignment or CHS geometry or any of a number of things that the installer *does* know and care about, so saying "bsdinstall should have the same behavior as gpart" is spurious.
Comment 15 Nathan Whitehorn freebsd_committer 2016-08-08 17:52:20 UTC
(In reply to Dag-Erling Smørgrav from comment #14)

gpart(8) does know about those things and will align to stripesize -- the logic in the installer is copied from the userland code in geom_part.c. And the installer doesn't know about CHS. The stripe size, in geom, is the canonical mechanism for recommending IO alignment. Why do we want to break that?

The only partitioning policy choices the installer makes are those involved in making the system bootable on the given platform (where the bootloader goes, in particular, and which partition schemes are bootable). It tries *very* hard not to do anything else and, in particular, not to second guess defaults. I would like to keep it that way.
Comment 16 Dag-Erling Smørgrav freebsd_committer 2016-08-11 09:59:27 UTC
Can we please move forward on this?  I can guarantee you that if you try to commit or submit a patch to make the kernel lie about the stripe size, the GEOM / CAM / ATA maintainers will tell you to go pound sand.  If the problem is that you don't want bsdinstall to use a different default alignment than gpart, then heck, let's apply the same patch to gpart.  But we NEED THIS.  It will go in, one way or the other.
Comment 17 Nathan Whitehorn freebsd_committer 2016-08-12 15:10:00 UTC
(In reply to Dag-Erling Smørgrav from comment #16)

Sure. I'd prefer it as a direct commit to 11, though, and then to have a real fix in HEAD later. The supposedly temporary hack to ZFS for this ended up being permanent and I would prefer to avoid a repeat of that.
Comment 18 Dag-Erling Smørgrav freebsd_committer 2016-08-12 19:42:44 UTC
What would you consider a real fix?

I doubt re@ will accept a direct commit.
Comment 19 Nathan Whitehorn freebsd_committer 2016-08-12 20:07:26 UTC
(In reply to Dag-Erling Smørgrav from comment #18)

A systemwide mechanism for reporting a recommended IO size. We already have one of these in the form of the GEOM stripe size, but you are wary of adjusting it. I'd also be perfectly happy with a sysctl that sets a minimum stripe size and defaults to zero, but is overridden to 4096 in /etc/sysctl.conf on the install ISO.
Comment 20 Nathan Whitehorn freebsd_committer 2016-08-12 20:16:56 UTC
To be more specific: if we want to have a system policy that everything is aligned by default to minimum 4K, we should make that the system policy. If we don't want that policy, then we should follow what the disk drivers report and try to make them as accurate as possible.

In neither case are hacks to the installer the right solution. Since we are so far along in the 11.0 release cycle, we can do that as a temporary solution just to get the release out, but it's not something we would want in any other circumstance. This is why I would prefer a direct commit to the release branch.

(As a note: we do get the stripe size right essentially all the time -- I have not personally encountered hardware, advanced format or otherwise, on which the stripe size is misreported in years now. I suspect the number of exceptions is very small.)
Comment 21 Dag-Erling Smørgrav freebsd_committer 2016-08-13 13:55:04 UTC
I don't want the 4k alignment to be a system policy, I want it to be an installer policy.  And as I've mentioned repeatedly, we get the stripe size wrong most of the time.  The only cases I know of in which we get it right are a) directly connected AF SATA drives for which we have a quirk entry and b) when using our own software RAID drivers.  None of the HW RAID controllers I've encountered report a stripe size, and since they hide the identities of the drives, we can't even apply quirks to individual drives.  Hypervisors (I've tried VirtualBox, KVM and VMWare) don't report a stripe size either, even for paravirtualized storage.
Comment 22 commit-hook freebsd_committer 2016-08-15 09:30:25 UTC
A commit references this bug:

Author: des
Date: Mon Aug 15 09:30:22 UTC 2016
New revision: 304142
URL: https://svnweb.freebsd.org/changeset/base/304142

  Ensure that the sector size is a multiple of 4096 to avoid creating
  unaligned partitions when the actual sector size is hidden from us.

  PR:		211361
  MFC after:	3 days

Comment 23 commit-hook freebsd_committer 2016-08-19 07:05:51 UTC
A commit references this bug:

Author: des
Date: Fri Aug 19 07:05:35 UTC 2016
New revision: 304448
URL: https://svnweb.freebsd.org/changeset/base/304448

  MFH (r304142): ensure stripe size is non-zero multiple of 4096

  PR:		211361

_U  stable/11/
Comment 24 commit-hook freebsd_committer 2016-08-19 09:12:01 UTC
A commit references this bug:

Author: des
Date: Fri Aug 19 09:11:50 UTC 2016
New revision: 304457
URL: https://svnweb.freebsd.org/changeset/base/304457

  MFH (r304142): ensure stripe size is non-zero multiple of 4096

  PR:		211361
  Approved by:	re (gjb)

_U  releng/11.0/
Comment 25 commit-hook freebsd_committer 2017-03-12 13:42:52 UTC
A commit references this bug:

Author: des
Date: Sun Mar 12 13:42:40 UTC 2017
New revision: 315154
URL: https://svnweb.freebsd.org/changeset/base/315154

  MFH (r303289): update example section

  PR:		211361

_U  stable/10/
Comment 26 Dag-Erling Smørgrav freebsd_committer 2017-03-12 13:43:29 UTC
The change to bsdinstall itself was merged to 10 by rpokala@ in r313433.