Created attachment 174052 [details] Hack workaround We have machines in the freebsd.org cluster that have 3TB SATA drives. ada0: 2861588MB (5860533168 512 byte sectors) However, the bios reports them as: disk0: BIOS drive C (1565565872 X 512): ie: the 3TB drive is reported to the loader as 1TB. Prior to r298900, this was harmless. IO was issued relative to the metadata on the disk. r298900 changed it from working to a fatal error: + if (dblk >= BD(dev).bd_sectors) { + DEBUG("IO past disk end %llu", (unsigned long long)dblk); + return (EIO); + } and it won't even try. This makes machines that used to work (in spite of a bios reporting bug) suddenly fail with an IO error. While this was observed with ZFS booting, it will affect UFS the same way as they share this code if it tries to read data beyond the truncated size. I have attached a horrible hack that works for the affected machines in the freebsd.org package build cluster. It is not an ideal solution but people may find it useful. The patch is a hack to restrict attempted reads beyond the end of the disk to one single sector rather than a hard fail. This should make it behave the same way as old versions of the bcache code. If the bios generates an error, it would do so the same as it did with the old code. Using a single sector prevents read-ahead amplifying delays. A better solution might be to have the file system / partition drivers instead tell bcache what size to expect so that it can avoid doing read-aheads beyond the end of a partition. If a 3TB GPT is on a disk, that should be used for IO and readahead clipping, not the historically unreliable bios sector count. Differences could be reported to the user. This problem is in 11.0-RC1, 11-stable and 12-current.
I'm adding another thought experiment hack. The problem is that we can't trust the bios media size reporting mixed with readahead causing read-beyond-end-of-disk. Might it be sufficient to simply de-fang the readahead when crossing 2TB aliases of the end of disk? That should restore behavior comparable to the old loader. In other words, suppose a 3TB disk incorrectly reported as 1TB. Instead of assuming the disk is actually 1TB, instead we assume it could be 1TB, 3TB, 5TB, ... for the purposes of preventing readahead. This would prevent readahead of reading the last sector of a 3TB disk misreported as 1TB. I've probably done the math wrong in the hack, but the idea should be clear enough to show the idea. It just truncates the end-of-disk math to cause it to calculate using aliases. I think it would be sufficient, and would be compatible with old 10.x and earlier behavior. It should stop the 10.3 -> 11.0 -> brick upgrade path.
Created attachment 174075 [details] Thought experiment alternative patch
I havent found yet good way, but there is something to think about. Essentially there are 2 cases: 1. unpartitioned raw disk with pool on it. We can get size from pool label and provide as disk meta info down to biosdisk if we like to. 2. partitioned disk - almost all cases, except GPT backup label read, depend on partition boundaries, so essentially the read should honor the partition boundaries. It means the disk open, should again have means to pass the boundaries down to biosdisk. So with exception of GPT backup label read and unpartitioned disk, all other reads should be checked not against disk size, but partition size... and this information is known.
FWIW; the thought experiment patch is wrong - it does not implement what I thought it did. I still think it is probably the best option though. The root of the problem is the read-ahead and we cannot trust size reporting. The old code did not do read-ahead. Preventing read-ahead from crossing the end of disk, or an alias of the end of disk should be more than sufficient to restore old behavior without adding new hard failure cases.
Created attachment 174080 [details] Prevent read-ahead from crossing 32 bit aliases of end-of-disk I think this patch will work. I've been running it in a test rig to check for wrapping and aliasing scenarios. I'll test it shortly. Note that it uses tcp-style sequence space signed arithmetic to handle wraparounds. The use of signed math is intentional.
FYI; I'm running the patch from #5 in the cluster. So far, so good.
Allan, this was your commit IIUC.
(In reply to Peter Wemm from comment #6) hm, this solution really does look nice and clean.
*** Bug 210714 has been marked as a duplicate of this bug. ***
(In reply to Toomas Soome from comment #8) We are still running this patch in the freebsd.org cluster. The problematic machines are all working and nothing new exploded.
A commit references this bug: Author: peter Date: Sun Aug 28 20:39:34 UTC 2016 New revision: 304966 URL: https://svnweb.freebsd.org/changeset/base/304966 Log: The read-ahead code from r298230 made it likely the boot code would read beyond the end of disk. r298900 added code to prevent this. Some BIOSes cause significant delays if asked to read past end-of-disk. We never trusted the BIOS to accurately report the sectorsize of disks before and this set of changes. Unfortuately they interact badly with the infamous >2TB wraparound bugs. We have a number of relatively-recent machines in the FreeBSD.org cluster where the BIOS reports 3TB disks as 1TB. With pre-r298900 they work just fine. After r298900 they stop working if the boot environment attempts to access anything outside the first 1TB on the disk. 'ZFS: I/O error, all block copies unavailable' etc. It affects both UFS and ZFS if they try to boot from large volumes. This change replaces the blind trust of the BIOS end-of-disk reporting with a read-ahead clip to prevent reads crossing the of end-of-disk boundary. Since 2^32 (2TB) size reporting truncation is not uncommon, the clipping is done on 2TB aliases of the reported end-of-disk. ie: a 3TB disk reported as 1TB has readahead clipped at 1TB, 3TB, 5TB, ... as one of them is likely to be the real end-of-disk. This should make the loader on these broken machines behave the same as traditional pre-r298900 loader behavior, without disabling read-ahead. PR: 212139 Discussed with: tsoome, allanjude Changes: head/sys/boot/i386/libi386/biosdisk.c
A commit references this bug: Author: gjb Date: Thu Sep 1 19:30:52 UTC 2016 New revision: 305232 URL: https://svnweb.freebsd.org/changeset/base/305232 Log: MFC r304966 (peter): The read-ahead code from r298230 made it likely the boot code would read beyond the end of disk. r298900 added code to prevent this. Some BIOSes cause significant delays if asked to read past end-of-disk. We never trusted the BIOS to accurately report the sectorsize of disks before and this set of changes. Unfortuately they interact badly with the infamous >2TB wraparound bugs. We have a number of relatively-recent machines in the FreeBSD.org cluster where the BIOS reports 3TB disks as 1TB. With pre-r298900 they work just fine. After r298900 they stop working if the boot environment attempts to access anything outside the first 1TB on the disk. 'ZFS: I/O error, all block copies unavailable' etc. It affects both UFS and ZFS if they try to boot from large volumes. This change replaces the blind trust of the BIOS end-of-disk reporting with a read-ahead clip to prevent reads crossing the of end-of-disk boundary. Since 2^32 (2TB) size reporting truncation is not uncommon, the clipping is done on 2TB aliases of the reported end-of-disk. ie: a 3TB disk reported as 1TB has readahead clipped at 1TB, 3TB, 5TB, ... as one of them is likely to be the real end-of-disk. This should make the loader on these broken machines behave the same as traditional pre-r298900 loader behavior, without disabling read-ahead. PR: 212139 Sponsored by: The FreeBSD Foundation Changes: _U stable/11/ stable/11/sys/boot/i386/libi386/biosdisk.c
A commit references this bug: Author: gjb Date: Thu Sep 1 20:01:38 UTC 2016 New revision: 305236 URL: https://svnweb.freebsd.org/changeset/base/305236 Log: MFS11 305232: MFC r304966 (peter): The read-ahead code from r298230 made it likely the boot code would read beyond the end of disk. r298900 added code to prevent this. Some BIOSes cause significant delays if asked to read past end-of-disk. We never trusted the BIOS to accurately report the sectorsize of disks before and this set of changes. Unfortuately they interact badly with the infamous >2TB wraparound bugs. We have a number of relatively-recent machines in the FreeBSD.org cluster where the BIOS reports 3TB disks as 1TB. With pre-r298900 they work just fine. After r298900 they stop working if the boot environment attempts to access anything outside the first 1TB on the disk. 'ZFS: I/O error, all block copies unavailable' etc. It affects both UFS and ZFS if they try to boot from large volumes. This change replaces the blind trust of the BIOS end-of-disk reporting with a read-ahead clip to prevent reads crossing the of end-of-disk boundary. Since 2^32 (2TB) size reporting truncation is not uncommon, the clipping is done on 2TB aliases of the reported end-of-disk. ie: a 3TB disk reported as 1TB has readahead clipped at 1TB, 3TB, 5TB, ... as one of them is likely to be the real end-of-disk. This should make the loader on these broken machines behave the same as traditional pre-r298900 loader behavior, without disabling read-ahead. PR: 212139 Approved by: re (kib) Sponsored by: The FreeBSD Foundation Changes: _U releng/11.0/ releng/11.0/sys/boot/i386/libi386/biosdisk.c