Created attachment 156141 [details] zfsimpl.c diff for HOLE issue. the freebsd bootloader zfs code does not check properly for holes (BP_IS_HOLE()) while traversing dnodes, and is calling zio_read() on hole, and as hole does not have data, DVA pointers are 0, and result is error: ZFS: i/o error - all block copies unavailable. the problem is in sys/boot/zfs/zfsimpl.c dnode_get() function; bp should be checked for hole, if it is hole, its zero block and no physical read can be done as there are no physical blocks - instead the zero filled block should be constructed and returned for caller. attached diff as possible fix for this issue.
The patch looks good to me.
actually the fix itself is buggy:) the basic idea is correct, but the size of the zero filled buffer is wrong, the correct code is: if (BP_IS_HOLE(&bp)) { memset(dnode_cache_buf, 0, dnode->dn_datablkszsec << SPA_MINBLOCKSHIFT); rc = 0; } else rc = zio_read(spa, &bp, dnode_cache_buf); as the BP_PSIZE() or BP_LSIZE() is 512B at that point (for 512B sector size at least), and indicates the size of BP itself, not the size of the hole - so the original (bad) fix did zero out only first 512B and rest of the buffer did return garbage. This fix is verified by hashing the whole file and checking against file on disk and file read to memory by this reader code.
(In reply to Toomas Soome from comment #2) I think this fix is not completely correct. Now you use the data block size as the hole size. I think that the hole size should be determined based on the level of the hole. Essentially, if the hole is at the zero level, then it covers dn_datablkszsec * 512 bytes of data (one data block). If the hole is one level above, then it covers as many data blocks as a number of block pointers that fit into one indirect block. And so on. Please note that (nlevels - i - 1) is the level number in the loop. E.g. search for BP_SPAN() that is used elsewhere for the same purpose.
Actually please scratch comment #3. We read data blocks one by one, so regardless of the actual hole size we need to care only about the current data block. Still, we have to keep in mind that the hole can occur at a level above zero and in that case we do not need to descend further.
Created attachment 157394 [details] another take at the fix
You are absolutely correct there; I was confused from from way the levels are counted in this code;) also in addition to code review, I confirmed your update is working ok for my test case.
A commit references this bug: Author: avg Date: Fri Jun 5 15:32:05 UTC 2015 New revision: 284025 URL: https://svnweb.freebsd.org/changeset/base/284025 Log: dnode_read: handle hole blocks in zfs boot code A hole block pointer can be encountered at any level and the hole can cover multiple data blocks, but we are reading the data blocks one by one, so we care only about the current one. PR: 199804 Reported by: Toomas Soome <tsoome@me.com> Submitted by: Toomas Soome <tsoome@me.com> (earlier version) Tested by: Toomas Soome <tsoome@me.com> MFC after: 11 days Changes: head/sys/boot/zfs/zfsimpl.c
somehow the patch did lost & - the bp is defined as blkptr_t (which is struct), but BP_IS_HOLE() check is expecting an pointer so it should be BP_IS_HOLE(&bp). In file included from /usr/src/sys/boot/zfs/zfs.c:49: /usr/src/sys/boot/zfs/zfsimpl.c:1258:8: error: member reference type 'blkptr_t' (aka 'struct blkptr') is not a pointer; maybe you meant to use '.'? if (BP_IS_HOLE(bp)) {
A commit references this bug: Author: avg Date: Fri Jun 5 17:02:22 UTC 2015 New revision: 284032 URL: https://svnweb.freebsd.org/changeset/base/284032 Log: dnode_read: fixup r284025, BP_IS_HOLE macro expects a pointer PR: 199804 Reported by: sbruno Pointyhat to: avg MFC after: 10 days X-MFC with: r284025 Changes: head/sys/boot/zfs/zfsimpl.c
A commit references this bug: Author: avg Date: Wed Jun 17 11:47:07 UTC 2015 New revision: 284509 URL: https://svnweb.freebsd.org/changeset/base/284509 Log: MFC r284025,284032: dnode_read: handle hole blocks in zfs boot code PR: 199804 Changes: _U stable/10/ stable/10/sys/boot/zfs/zfsimpl.c
A commit references this bug: Author: avg Date: Wed Jun 17 11:48:01 UTC 2015 New revision: 284510 URL: https://svnweb.freebsd.org/changeset/base/284510 Log: MFC r284025,284032: dnode_read: handle hole blocks in zfs boot code PR: 199804 Changes: _U stable/9/sys/ _U stable/9/sys/boot/ stable/9/sys/boot/zfs/zfsimpl.c
Created attachment 242543 [details] Boot screen image
(In reply to Jason W. Bacon from comment #12) Might this be the same issue? I'm seeing this on 13.1-RELEASE. This is a 3-disk RAID-Z using onboard SATA. The system runs fine for some time, but this pops up occasionally. I can usually get past it by removing and recreating /boot on the RAID using a live system on install media. This is a non-critical machine for me, so I'm happy to leave it in its current state and help diagnose the problem.
(In reply to Jason W. Bacon from comment #13) UEFI or BIOS boot? how large the disks?
BIOS. This is an old Gigabyte gaming MB for AMD Phenom, doesn't have UEFI. All disk drives are the following: https://en.wikipedia.org/wiki/ST3000DM001 Thanks.
(In reply to Jason W. Bacon from comment #15) BIOS + 3TB disks? (In reply to Jason W. Bacon from comment #15) And you are removing and re-creating /boot to fix it? I guess the problem will appear, when your /boot files will get to the high LBA and you get 32-bit wrapover for LBA.
(In reply to Toomas Soome from comment #16) I would have thought that the BIOS limitation is in the rear view mirror at this stage, if not using MBR. But I don't know a lot about the boot process. If this is expected for such a configuration, it's no problem. I can try installing on disks of 2 TB or less and see if the problem comes back. I have experienced this problem with smaller disks, but that may have been before this fix was committed.
(In reply to Jason W. Bacon from comment #17) Yes, some systems have 32-bit wrapover issue with LBA addresses and this usually means there is no IO error from physical disk, but we do get logical error from filesystem reads (and zfs is rather good detecting those;). The fact that recreating /boot does fix the issue does support this diagnosis. Of course, this fix does only work so far - we do not really have option to tell zfs where to put the data blocks and when pool is filled up, this workaround will stop working.
(In reply to Toomas Soome from comment #18) Yeah, that's why I said "usually". There were a couple times when I gave up on recreating /boot and just reinstalled. Question: Is it the size of the individual disks, or the size of the RAID-Z volume that must be within the BIOS limit (which I'm assuming is 2 TB)? Also supporting your hypothesis: For some more critical installations where I was using ZFS on a hardware RAID, what I've done is install to a smallish UFS root partitions (e.g. 250 GB) and manually create a zpool on the remaining space after install, moving potentially large mounts like /usr/home to ZFS. /boot remains on the small UFS partition. I've never had a problem with this configuration. I also have a ZFS on root (mirror on 1 TB disks) that has never given me a problem.
(In reply to Jason W. Bacon from comment #19) the LBA issue is about the size of the disk. And the best workaround there is to use UEFI because there they do support large disks, large sector size (4k) and new technology (nvme). Otherwise yes, keeping OS pool on separate smaller disk on such system, is also just fine.
(In reply to Toomas Soome from comment #20) Note that I did not use a separate disk, just a small partition on the larger disk or RAID volume. It seems to me that this should work reliably, provided the fixed partition is entirely within the 2 TB boundary.
(In reply to Jason W. Bacon from comment #21) ou, hm. the small partition within 2TB limit should avoid this problem. How large is is actually, btw?
(In reply to Toomas Soome from comment #22) Also, the error you are seeing, 97, is ECKSUM (EINTEGRITY), used with checksum error, and that can be either from read from wrong place or read from correct place, but corrupted data (should be fixed if you have redundancy - mirror or raidz or copies property set > 1).
(In reply to Toomas Soome from comment #22) Typically I've used about 250 GB.
(In reply to Toomas Soome from comment #23) Thanks for the added info. I've set the 3 TB drives aside and installed on 3 500 GB drives, RAID-Z 1. I'm torturing the volume now with some hefty scientific data and analyses. I'll keep the volume nearly full and report back if there are any issues in the future. I presume if anything bad is going to happen, it will happen following an update that affects /boot. Hard to tell from previous experience, since I generally only reboot my FreeBSD boxes after a kernel update.
I wonder if we could add a warning to the installer, such as the following: === You are booting from BIOS rather than UEFI, and your root partition is larger than 2 terabytes. This may work initially, but lead to boot problems later when freebsd-update places blocks of files used in the boot process beyond the 2 TB boundary. === I'm assuming there's a simple way to detect whether a system was booted from BIOS or UEFI.