The error message here is about the same as PR 221077, but I'm opening a new PR due to differences in configuration and some new info.
The exact error is as follows:
ZFS: i/o error - all block copies unavailable
ZFS: can't read object set for dataset u
ZFS: can't open root filesystem
gptzfsboot: failed to mount default pool zroot
This occurred following a memory exhaustion event that flooded my messages file with entries like the following:
Apr 6 09:56:14 compute-001 kernel: swap_pager_getswapspace(11): failed
This was caused by some computational jobs exhausting available memory.
The system was still up and accepting logins, but was no longer reading or writing the ZFS pool. All attempts to reboot resulted in the message above.
The system was installed recently using default ZFS configuration provided by bsdinstall on 4 disks configured as RAID-0 on a Dell PERC H700 (which does not support JBOD). I.e. ZFS is running a RAID-Z on mfid0 through mfid3.
I updated an old forum thread on the issue and included my successful fix:
Unlike that thread, this did not appear to be triggered by an upgrade.
The gist of it is that some of the datasets (filesystems) appear to have been corrupted and the out of swap errors seem likely to be the cause.
zpool scrub did not find any errors. All drives are reported as online and optimal by the RAID controller.
My fix was as follows:
Boot from USB drive, go to live image, log in as root.
# mount -u -o rw / # Allow creating directories on USB drive
# zpool import -R /mnt -fF zroot
# cd /mnt
# mount zroot/ROOT/default # Not mounted by default, canmount defaults to noauto
# mv boot boot.orig
# cp -Rp /boot . # Note -p to make sure permissions are correct in the new /boot
# cp boot.orig/loader.conf boot/ # Restore customizations
# cp boot.orig/zfs/zpool.cache boot/zfs/
# zfs get canmount
# zfs set canmount=on var/log (and a couple others that did not match defaults)
# zpool export
After successful reboot, ran "freebsd-update fetch install" and rebooted again, so my /boot would be up-to-date.
Everything seems fine now.
I've made a backup of my /boot directory and plan to do so following every freebsd-update so I can hopefully correct this quickly if it happens again.
I am seeing the same error on a workstation using 4 vanilla SATA ports, but have not had physical access to it due to COVID 19. This is the first time I've seen the error without the presence of an underlying hardware RAID. I'll update this thread when I can gather more information.
Could this just be a 2TB issue?
What kind of disks do you use?
(In reply to Andriy Gapon from comment #1)
I don't think so. They are 3TB SAS Constellation drives. Correctly recognized and generally work fine. Others have reported successfully using much larger drives with an H700. Thanks...
Finally got a chance to examine the workstation.
I did not find any swap errors in the system log, but this system has been under heavy memory load frequently. It has only 4G RAM and runs a CentOS VM alongside poudriere builds. I have arc_max set to 1G so there should be (barely) enough RAM for typical tasks.
In this case, copying /boot from the USB install image did not alleviate the issue, but later copying back /boot.orig did the trick:
From live CD:
zpool import -R /mnt -fF zroot
cp -R boot boot.orig
zpool scrub zroot # 1.5 hours, no errors found
cp -R /boot .
Restore loader.conf and zfs cache from boot.orig
[ did not fix ]
Repeat boot from USB stick, import, etc then
rm -rf boot
cp -R boot.orig boot
Now it works flawlessly. No errors or warnings of any kind, which is actually unusual in my experience. I usually see some error messages when booting from ZFS, but it boots fine anyway in most cases.
For some reason ZFS occasionally gets into a state where the boot loader cannot read something under /boot, even though the files are intact. It's been a matter of trial-and-error to recreate the /boot directory with the right incantations to alleviate the problem. I have not seen a specific sequence that works consistently.
This is the first time I've seen this issue in the absence of a hardware RAID controller. This is using 4 SATA drives with onboard SATA ports. ( Gigabyte GA-770TA-UD3 motherboard )
FreeBSD barracuda.uits bacon ~ 1005: zpool status
scan: scrub repaired 0 in 0 days 03:55:42 with 0 errors on Thu Apr 16 18:32:27 2020
NAME STATE READ WRITE CKSUM
zroot ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
ada0p3 ONLINE 0 0 0
ada1p3 ONLINE 0 0 0
ada2p3 ONLINE 0 0 0
ada3p3 ONLINE 0 0 0
errors: No known data errors
It happened again on the same workstation following a round of freebsd-update.
Renaming and recreating /boot as outlined in the forum thread fixed it on the first try this time.
After repeated occurrences of the issue on my office workstation (4-disk RAIDZ), I have not been able to reproduce the problem since upgrading to 12.2.
I also set up a test machine (3-disk RAIDZ on an old tower system) running 13.0. I've given it a good pounding, filled the filesystem, and have not been able to reproduce the issue there.
Closing this issue since it seems to be resolved by recent updates. I'll keep maintaining the test machine for the foreseeable future and we can use it for diagnostics if the problem returns.
The problem returned after the RAIDZ filled up. This agrees with a post I saw suggesting that high disk activity may cause reorganization of the blocks containing /boot. I'll try to reproduce the problem on my test system.