Bug 212139 - r298900 introduced a fatal failure case for >2TB disk size reporting bugs
Summary: r298900 introduced a fatal failure case for >2TB disk size reporting bugs
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 11.0-RC1
Hardware: Any Any
: --- Affects Some People
Assignee: Allan Jude
URL:
Keywords: patch
: 210714 (view as bug list)
Depends on:
Blocks:
 
Reported: 2016-08-25 12:34 UTC by Peter Wemm
Modified: 2016-09-01 20:02 UTC (History)
5 users (show)

See Also:


Attachments
Hack workaround (665 bytes, patch)
2016-08-25 12:34 UTC, Peter Wemm
no flags Details | Diff
Thought experiment alternative patch (986 bytes, patch)
2016-08-25 20:18 UTC, Peter Wemm
no flags Details | Diff
Prevent read-ahead from crossing 32 bit aliases of end-of-disk (1.14 KB, patch)
2016-08-25 22:18 UTC, Peter Wemm
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Peter Wemm freebsd_committer freebsd_triage 2016-08-25 12:34:22 UTC
Created attachment 174052 [details]
Hack workaround

We have machines in the freebsd.org cluster that have 3TB SATA drives.

ada0: 2861588MB (5860533168 512 byte sectors)
However, the bios reports them as:
    disk0:   BIOS drive C (1565565872 X 512):
ie: the 3TB drive is reported to the loader as 1TB.

Prior to r298900, this was harmless.  IO was issued relative to the metadata on the disk.

r298900 changed it from working to a fatal error:
+    if (dblk >= BD(dev).bd_sectors) {
+	DEBUG("IO past disk end %llu", (unsigned long long)dblk);
+	return (EIO);
+    }
and it won't even try.  This makes machines that used to work (in spite of a bios reporting bug) suddenly fail with an IO error.  While this was observed with ZFS booting, it will affect UFS the same way as they share this code if it tries to read data beyond the truncated size.

I have attached a horrible hack that works for the affected machines in the freebsd.org package build cluster.  It is not an ideal solution but people may find it useful.

The patch is a hack to restrict attempted reads beyond the end of the disk to one single sector rather than a hard fail.  This should make it behave the same way as old versions of the bcache code.  If the bios generates an error, it would do so the same as it did with the old code.  Using a single sector prevents read-ahead amplifying delays.

A better solution might be to have the file system / partition drivers instead tell bcache what size to expect so that it can avoid doing read-aheads beyond the end of a partition.  If a 3TB GPT is on a disk, that should be used for IO and readahead clipping, not the historically unreliable bios sector count. Differences could be reported to the user.

This problem is in 11.0-RC1, 11-stable and 12-current.
Comment 1 Peter Wemm freebsd_committer freebsd_triage 2016-08-25 20:18:05 UTC
I'm adding another thought experiment hack.

The problem is that we can't trust the bios media size reporting mixed with readahead causing read-beyond-end-of-disk.

Might it be sufficient to simply de-fang the readahead when crossing 2TB aliases of the end of disk?  That should restore behavior comparable to the old loader.

In other words, suppose a 3TB disk incorrectly reported as 1TB.  Instead of assuming the disk is actually 1TB, instead we assume it could be 1TB, 3TB, 5TB, ... for the purposes of preventing readahead.  This would prevent readahead of reading the last sector of a 3TB disk misreported as 1TB.

I've probably done the math wrong in the hack, but the idea should be clear enough to show the idea.  It just truncates the end-of-disk math to cause it to calculate using aliases.

I think it would be sufficient, and would be compatible with old 10.x and earlier behavior.  It should stop the 10.3 -> 11.0 -> brick upgrade path.
Comment 2 Peter Wemm freebsd_committer freebsd_triage 2016-08-25 20:18:39 UTC
Created attachment 174075 [details]
Thought experiment alternative patch
Comment 3 Toomas Soome freebsd_committer freebsd_triage 2016-08-25 20:41:42 UTC
I havent found yet good way, but there is something to think about.

Essentially there are 2 cases:

1. unpartitioned raw disk with pool on it. We can get size from pool label and provide as disk meta info down to biosdisk if we like to. 

2. partitioned disk - almost all cases, except GPT backup label read, depend on partition boundaries, so essentially the read should honor the partition boundaries.
It means the disk open, should again have means to pass the boundaries down to biosdisk.

So with exception of GPT backup label read and unpartitioned disk, all other reads should be checked not against disk size, but partition size... and this information is known.
Comment 4 Peter Wemm freebsd_committer freebsd_triage 2016-08-25 20:53:28 UTC
FWIW; the thought experiment patch is wrong - it does not implement what I thought  it did.

I still think it is probably the best option though.  The root of the problem is the read-ahead and we cannot trust size reporting.  The old code did not do read-ahead. Preventing read-ahead from crossing the end of disk, or an alias of the end of disk should be more than sufficient to restore old behavior without adding new hard failure cases.
Comment 5 Peter Wemm freebsd_committer freebsd_triage 2016-08-25 22:18:20 UTC
Created attachment 174080 [details]
Prevent read-ahead from crossing 32 bit aliases of end-of-disk

I think this patch will work.  I've been running it in a test rig to check for wrapping and aliasing scenarios.  I'll test it shortly.

Note that it uses tcp-style sequence space signed arithmetic to handle wraparounds.  The use of signed math is intentional.
Comment 6 Peter Wemm freebsd_committer freebsd_triage 2016-08-25 22:56:58 UTC
FYI; I'm running the patch from #5 in the cluster. So far, so good.
Comment 7 Mark Linimon freebsd_committer freebsd_triage 2016-08-26 00:51:45 UTC
Allan, this was your commit IIUC.
Comment 8 Toomas Soome freebsd_committer freebsd_triage 2016-08-26 05:48:52 UTC
(In reply to Peter Wemm from comment #6)

hm, this solution really does look nice and clean.
Comment 9 Antoine Brodin freebsd_committer freebsd_triage 2016-08-28 05:44:29 UTC
*** Bug 210714 has been marked as a duplicate of this bug. ***
Comment 10 Peter Wemm freebsd_committer freebsd_triage 2016-08-28 20:08:25 UTC
(In reply to Toomas Soome from comment #8)

We are still running this patch in the freebsd.org cluster.  The problematic machines are all working and nothing new exploded.
Comment 11 commit-hook freebsd_committer freebsd_triage 2016-08-28 20:39:48 UTC
A commit references this bug:

Author: peter
Date: Sun Aug 28 20:39:34 UTC 2016
New revision: 304966
URL: https://svnweb.freebsd.org/changeset/base/304966

Log:
  The read-ahead code from r298230 made it likely the boot code would read
  beyond the end of disk. r298900 added code to prevent this.  Some BIOSes
  cause significant delays if asked to read past end-of-disk.

  We never trusted the BIOS to accurately report the sectorsize of disks
  before and this set of changes.  Unfortuately they interact badly with
  the infamous >2TB wraparound bugs.  We have a number of relatively-recent
  machines in the FreeBSD.org cluster where the BIOS reports 3TB disks as 1TB.

  With pre-r298900 they work just fine.  After r298900 they stop working if
  the boot environment attempts to access anything outside the first 1TB on
  the disk.  'ZFS: I/O error, all block copies unavailable' etc.  It affects
  both UFS and ZFS if they try to boot from large volumes.

  This change replaces the blind trust of the BIOS end-of-disk reporting
  with a read-ahead clip to prevent reads crossing the of end-of-disk
  boundary.  Since 2^32 (2TB) size reporting truncation is not uncommon,
  the clipping is done on 2TB aliases of the reported end-of-disk.
  ie: a 3TB disk reported as 1TB has readahead clipped at 1TB, 3TB, 5TB, ...
  as one of them is likely to be the real end-of-disk.

  This should make the loader on these broken machines behave the same as
  traditional pre-r298900 loader behavior, without disabling read-ahead.

  PR:		212139
  Discussed with:	tsoome, allanjude

Changes:
  head/sys/boot/i386/libi386/biosdisk.c
Comment 12 commit-hook freebsd_committer freebsd_triage 2016-09-01 19:31:49 UTC
A commit references this bug:

Author: gjb
Date: Thu Sep  1 19:30:52 UTC 2016
New revision: 305232
URL: https://svnweb.freebsd.org/changeset/base/305232

Log:
  MFC r304966 (peter):
   The read-ahead code from r298230 made it likely the boot code would read
   beyond the end of disk. r298900 added code to prevent this.  Some BIOSes
   cause significant delays if asked to read past end-of-disk.

   We never trusted the BIOS to accurately report the sectorsize of disks
   before and this set of changes.  Unfortuately they interact badly with
   the infamous >2TB wraparound bugs.  We have a number of relatively-recent
   machines in the FreeBSD.org cluster where the BIOS reports 3TB disks as 1TB.

   With pre-r298900 they work just fine.  After r298900 they stop working if
   the boot environment attempts to access anything outside the first 1TB on
   the disk.  'ZFS: I/O error, all block copies unavailable' etc.  It affects
   both UFS and ZFS if they try to boot from large volumes.

   This change replaces the blind trust of the BIOS end-of-disk reporting
   with a read-ahead clip to prevent reads crossing the of end-of-disk
   boundary.  Since 2^32 (2TB) size reporting truncation is not uncommon,
   the clipping is done on 2TB aliases of the reported end-of-disk.
   ie: a 3TB disk reported as 1TB has readahead clipped at 1TB, 3TB, 5TB, ...
   as one of them is likely to be the real end-of-disk.

   This should make the loader on these broken machines behave the same as
   traditional pre-r298900 loader behavior, without disabling read-ahead.

  PR:		212139
  Sponsored by:	The FreeBSD Foundation

Changes:
_U  stable/11/
  stable/11/sys/boot/i386/libi386/biosdisk.c
Comment 13 commit-hook freebsd_committer freebsd_triage 2016-09-01 20:01:53 UTC
A commit references this bug:

Author: gjb
Date: Thu Sep  1 20:01:38 UTC 2016
New revision: 305236
URL: https://svnweb.freebsd.org/changeset/base/305236

Log:
  MFS11 305232:
  MFC r304966 (peter):
   The read-ahead code from r298230 made it likely the boot code would read
   beyond the end of disk. r298900 added code to prevent this.  Some BIOSes
   cause significant delays if asked to read past end-of-disk.

   We never trusted the BIOS to accurately report the sectorsize of disks
   before and this set of changes.  Unfortuately they interact badly with
   the infamous >2TB wraparound bugs.  We have a number of relatively-recent
   machines in the FreeBSD.org cluster where the BIOS reports 3TB disks as 1TB.

   With pre-r298900 they work just fine.  After r298900 they stop working if
   the boot environment attempts to access anything outside the first 1TB on
   the disk.  'ZFS: I/O error, all block copies unavailable' etc.  It affects
   both UFS and ZFS if they try to boot from large volumes.

   This change replaces the blind trust of the BIOS end-of-disk reporting
   with a read-ahead clip to prevent reads crossing the of end-of-disk
   boundary.  Since 2^32 (2TB) size reporting truncation is not uncommon,
   the clipping is done on 2TB aliases of the reported end-of-disk.
   ie: a 3TB disk reported as 1TB has readahead clipped at 1TB, 3TB, 5TB, ...
   as one of them is likely to be the real end-of-disk.

   This should make the loader on these broken machines behave the same as
   traditional pre-r298900 loader behavior, without disabling read-ahead.

  PR:		212139
  Approved by:	re (kib)
  Sponsored by:	The FreeBSD Foundation

Changes:
_U  releng/11.0/
  releng/11.0/sys/boot/i386/libi386/biosdisk.c