199804 – ZFS: i/o error - all block copies unavailable

Bug 199804 - ZFS: i/o error - all block copies unavailable

Summary: ZFS: i/o error - all block copies unavailable

Status:	Closed FIXED

Alias:	None

Product:	Base System
Classification:	Unclassified
Component:	kern (show other bugs)
Version:	10.1-RELEASE
Hardware:	Any Any

Importance:	--- Affects Many People
Assignee:	Andriy Gapon

URL:
Keywords:	patch

Depends on:
Blocks:

Reported:	2015-04-30 11:08 UTC by Toomas Soome
Modified:	2023-06-02 19:34 UTC (History)
CC List:	3 users (show)

See Also:

Attachments
zfsimpl.c diff for HOLE issue. (504 bytes, patch) 2015-04-30 11:08 UTC, Toomas Soome	no flags	Details \| Diff
another take at the fix (529 bytes, patch) 2015-06-03 12:31 UTC, Andriy Gapon	no flags	Details \| Diff
Boot screen image (154.59 KB, image/jpeg) 2023-06-01 21:08 UTC, Jason W. Bacon	no flags	Details
Show Obsolete (1) View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Toomas Soome 2015-04-30 11:08:42 UTC

Created attachment 156141 [details]
zfsimpl.c diff for HOLE issue.

the freebsd bootloader zfs code does not check properly for holes (BP_IS_HOLE()) while traversing dnodes, and is calling zio_read() on hole, and as hole does not have data, DVA pointers are 0, and result is error: ZFS: i/o error - all block copies unavailable.

the problem is in sys/boot/zfs/zfsimpl.c dnode_get() function; bp should be checked for hole, if it is hole, its zero block and no physical read can be done as there are no physical blocks - instead the zero filled block should be constructed and returned for caller.

attached diff as possible fix for this issue.

Comment 1 Andriy Gapon freebsd_committer

2015-04-30 11:19:21 UTC

The patch looks good to me.

Comment 2 Toomas Soome 2015-06-02 15:17:23 UTC

actually the fix itself is buggy:) the basic idea is correct, but the size of the zero filled buffer is wrong, the correct code is:

                        if (BP_IS_HOLE(&bp)) {
                                memset(dnode_cache_buf, 0, 
                                    dnode->dn_datablkszsec << 
                                    SPA_MINBLOCKSHIFT);
                                rc = 0;
                        } else
                                rc = zio_read(spa, &bp, dnode_cache_buf);

as the BP_PSIZE() or BP_LSIZE() is 512B at that point (for 512B sector size at least), and indicates the size of BP itself, not the size of the hole - so the original (bad) fix did zero out only first 512B and rest of the buffer did return garbage.

This fix is verified by hashing the whole file and checking against file on disk and file read to memory by this reader code.

Comment 3 Andriy Gapon freebsd_committer

2015-06-03 11:21:45 UTC

(In reply to Toomas Soome from comment #2)

I think this fix is not completely correct.
Now you use the data block size as the hole size. I think that the hole size should be determined based on the level of the hole. Essentially, if the hole is at the zero level, then it covers dn_datablkszsec * 512 bytes of data (one data block).  If the hole is one level above, then it covers as many data blocks as a number of block pointers that fit into one indirect block. And so on.
Please note that (nlevels - i - 1) is the level number in the loop.
E.g. search for BP_SPAN() that is used elsewhere for the same purpose.

Comment 4 Andriy Gapon freebsd_committer

2015-06-03 12:29:05 UTC

Actually please scratch comment #3.  We read data blocks one by one, so regardless of the actual hole size we need to care only about the current data block.  Still, we have to keep in mind that the hole can occur at a level above zero and in that case we do not need to descend further.

Comment 5 Andriy Gapon freebsd_committer

2015-06-03 12:31:28 UTC

Created attachment 157394 [details]
another take at the fix

Comment 6 Toomas Soome 2015-06-03 13:40:07 UTC

You are absolutely correct there; I was confused from from way the levels are counted in this code;) also in addition to code review, I confirmed your update is working ok for my test case.

Comment 7 commit-hook freebsd_committer

2015-06-05 15:33:01 UTC

A commit references this bug:

Author: avg
Date: Fri Jun  5 15:32:05 UTC 2015
New revision: 284025
URL: https://svnweb.freebsd.org/changeset/base/284025

Log:
  dnode_read: handle hole blocks in zfs boot code

  A hole block pointer can be encountered at any level and the hole
  can cover multiple data blocks, but we are reading the data blocks
  one by one, so we care only about the current one.

  PR:		199804
  Reported by:	Toomas Soome <tsoome@me.com>
  Submitted by:	Toomas Soome <tsoome@me.com> (earlier version)
  Tested by:	Toomas Soome <tsoome@me.com>
  MFC after:	11 days

Changes:
  head/sys/boot/zfs/zfsimpl.c

Comment 8 Toomas Soome 2015-06-05 16:57:49 UTC

somehow the patch did lost & - the bp is defined as blkptr_t (which is struct), but BP_IS_HOLE() check is expecting an pointer so it should be BP_IS_HOLE(&bp).


In file included from /usr/src/sys/boot/zfs/zfs.c:49:
/usr/src/sys/boot/zfs/zfsimpl.c:1258:8: error: member reference type 'blkptr_t'
      (aka 'struct blkptr') is not a pointer; maybe you meant to use '.'?
                        if (BP_IS_HOLE(bp)) {

Comment 9 commit-hook freebsd_committer

2015-06-05 17:03:09 UTC

A commit references this bug:

Author: avg
Date: Fri Jun  5 17:02:22 UTC 2015
New revision: 284032
URL: https://svnweb.freebsd.org/changeset/base/284032

Log:
  dnode_read: fixup r284025, BP_IS_HOLE macro expects a pointer

  PR:		199804
  Reported by:	sbruno
  Pointyhat to:	avg
  MFC after:	10 days
  X-MFC with:	r284025

Changes:
  head/sys/boot/zfs/zfsimpl.c

Comment 10 commit-hook freebsd_committer

2015-06-17 12:44:05 UTC

A commit references this bug:

Author: avg
Date: Wed Jun 17 11:47:07 UTC 2015
New revision: 284509
URL: https://svnweb.freebsd.org/changeset/base/284509

Log:
  MFC r284025,284032: dnode_read: handle hole blocks in zfs boot code

  PR:		199804

Changes:
_U  stable/10/
  stable/10/sys/boot/zfs/zfsimpl.c

Comment 11 commit-hook freebsd_committer

2015-06-17 12:44:06 UTC

A commit references this bug:

Author: avg
Date: Wed Jun 17 11:48:01 UTC 2015
New revision: 284510
URL: https://svnweb.freebsd.org/changeset/base/284510

Log:
  MFC r284025,284032: dnode_read: handle hole blocks in zfs boot code

  PR:		199804

Changes:
_U  stable/9/sys/
_U  stable/9/sys/boot/
  stable/9/sys/boot/zfs/zfsimpl.c

Comment 12 Jason W. Bacon freebsd_committer

2023-06-01 21:08:39 UTC

Created attachment 242543 [details]
Boot screen image

Comment 13 Jason W. Bacon freebsd_committer

2023-06-01 21:12:17 UTC

(In reply to Jason W. Bacon from comment #12)

Might this be the same issue?  I'm seeing this on 13.1-RELEASE.  This is a 3-disk RAID-Z using onboard SATA.  The system runs fine for some time, but this pops up occasionally.  I can usually get past it by removing and recreating /boot on the RAID using a live system on install media.

This is a non-critical machine for me, so I'm happy to leave it in its current state and help diagnose the problem.

Comment 14 Toomas Soome freebsd_committer

2023-06-01 21:35:24 UTC

(In reply to Jason W. Bacon from comment #13)

UEFI or BIOS boot? how large the disks?

Comment 15 Jason W. Bacon freebsd_committer

2023-06-01 22:28:54 UTC

BIOS.  This is an old Gigabyte gaming MB for AMD Phenom, doesn't have UEFI.

All disk drives are the following:

https://en.wikipedia.org/wiki/ST3000DM001

Thanks.

Comment 16 Toomas Soome freebsd_committer

2023-06-01 22:56:11 UTC

(In reply to Jason W. Bacon from comment #15)

BIOS + 3TB disks? (In reply to Jason W. Bacon from comment #15)

And you are removing and re-creating /boot to fix it? I guess the problem will appear, when your /boot files will get to the high LBA and you get 32-bit wrapover for LBA.

Comment 17 Jason W. Bacon freebsd_committer

2023-06-01 23:48:57 UTC

(In reply to Toomas Soome from comment #16)

I would have thought that the BIOS limitation is in the rear view mirror at this stage, if not using MBR.  But I don't know a lot about the boot process.

If this is expected for such a configuration, it's no problem.  I can try installing on disks of 2 TB or less and see if the problem comes back.

I have experienced this problem with smaller disks, but that may have been before this fix was committed.

Comment 18 Toomas Soome freebsd_committer

2023-06-02 07:10:02 UTC

(In reply to Jason W. Bacon from comment #17)

Yes, some systems have 32-bit wrapover issue with LBA addresses and this usually means there is no IO error from physical disk, but we do get logical error from filesystem reads (and zfs is rather good detecting those;). The fact that recreating /boot does fix the issue does support this diagnosis. Of course, this fix does only work so far - we do not really have option to tell zfs where to put the data blocks and when pool is filled up, this workaround will stop working.

Comment 19 Jason W. Bacon freebsd_committer

2023-06-02 12:42:39 UTC

(In reply to Toomas Soome from comment #18)

Yeah, that's why I said "usually".  There were a couple times when I gave up on recreating /boot and just reinstalled.

Question: Is it the size of the individual disks, or the size of the RAID-Z volume that must be within the BIOS limit (which I'm assuming is 2 TB)?

Also supporting your hypothesis: For some more critical installations where  I was using ZFS on a hardware RAID, what I've done is install to a smallish UFS root partitions (e.g. 250 GB) and manually create a zpool on the remaining space after install, moving potentially large mounts like /usr/home to ZFS.  /boot remains on the small UFS partition.  I've never had a problem with this configuration.

I also have a ZFS on root (mirror on 1 TB disks) that has never given me a problem.

Comment 20 Toomas Soome freebsd_committer

2023-06-02 12:52:57 UTC

(In reply to Jason W. Bacon from comment #19)

the LBA issue is about the size of the disk. And the best workaround there is to use UEFI because there they do support large disks, large sector size (4k) and new technology (nvme). Otherwise yes, keeping OS pool on separate smaller disk on such system, is also just fine.

Comment 21 Jason W. Bacon freebsd_committer

2023-06-02 13:01:20 UTC

(In reply to Toomas Soome from comment #20)

Note that I did not use a separate disk, just a small partition on the larger disk or RAID volume.  It seems to me that this should work reliably, provided the fixed partition is entirely within the 2 TB boundary.

Comment 22 Toomas Soome freebsd_committer

2023-06-02 13:27:13 UTC

(In reply to Jason W. Bacon from comment #21)
ou, hm. the small partition within 2TB limit should avoid this problem. How large is is actually, btw?

Comment 23 Toomas Soome freebsd_committer

2023-06-02 13:37:21 UTC

(In reply to Toomas Soome from comment #22)
Also, the error you are seeing, 97, is ECKSUM (EINTEGRITY), used with checksum error, and that can be either from read from wrong place or read from correct place, but corrupted data (should be fixed if you have redundancy - mirror or raidz or copies property set > 1).

Comment 24 Jason W. Bacon freebsd_committer

2023-06-02 14:49:46 UTC

(In reply to Toomas Soome from comment #22)

Typically I've used about 250 GB.

Comment 25 Jason W. Bacon freebsd_committer

2023-06-02 14:55:52 UTC

(In reply to Toomas Soome from comment #23)

Thanks for the added info.  I've set the 3 TB drives aside and installed on 3 500 GB drives, RAID-Z 1. I'm torturing the volume now with some hefty scientific data and analyses.  I'll keep the volume nearly full and report back if there are any issues in the future.  I presume if anything bad is going to happen, it will happen following an update that affects /boot.  Hard to tell from previous experience, since I generally only reboot my FreeBSD boxes after a kernel update.

Comment 26 Jason W. Bacon freebsd_committer

2023-06-02 19:34:21 UTC

I wonder if we could add a warning to the installer, such as the following:

===
You are booting from BIOS rather than UEFI, and your root partition is larger than 2 terabytes.  This may work initially, but lead to boot problems later when freebsd-update places blocks of files used in the boot process beyond the 2 TB boundary.
===

I'm assuming there's a simple way to detect whether a system was booted from BIOS or UEFI.