Summary: | ZFS boot failure following memory exhaustion | ||
---|---|---|---|
Product: | Base System | Reporter: | Jason W. Bacon <jwb> |
Component: | kern | Assignee: | freebsd-fs (Nobody) <fs> |
Status: | Closed FIXED | ||
Severity: | Affects Some People | ||
Priority: | --- | ||
Version: | 12.1-RELEASE | ||
Hardware: | Any | ||
OS: | Any | ||
See Also: | https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=221077 |
Description
Jason W. Bacon
2020-04-07 19:10:07 UTC
Could this just be a 2TB issue? What kind of disks do you use? (In reply to Andriy Gapon from comment #1) I don't think so. They are 3TB SAS Constellation drives. Correctly recognized and generally work fine. Others have reported successfully using much larger drives with an H700. Thanks... Finally got a chance to examine the workstation. I did not find any swap errors in the system log, but this system has been under heavy memory load frequently. It has only 4G RAM and runs a CentOS VM alongside poudriere builds. I have arc_max set to 1G so there should be (barely) enough RAM for typical tasks. In this case, copying /boot from the USB install image did not alleviate the issue, but later copying back /boot.orig did the trick: From live CD: zpool import -R /mnt -fF zroot cd /mnt mount zroot/ROOT/default cp -R boot boot.orig zpool scrub zroot # 1.5 hours, no errors found cp -R /boot . Restore loader.conf and zfs cache from boot.orig [ did not fix ] Repeat boot from USB stick, import, etc then rm -rf boot cp -R boot.orig boot Now it works flawlessly. No errors or warnings of any kind, which is actually unusual in my experience. I usually see some error messages when booting from ZFS, but it boots fine anyway in most cases. For some reason ZFS occasionally gets into a state where the boot loader cannot read something under /boot, even though the files are intact. It's been a matter of trial-and-error to recreate the /boot directory with the right incantations to alleviate the problem. I have not seen a specific sequence that works consistently. This is the first time I've seen this issue in the absence of a hardware RAID controller. This is using 4 SATA drives with onboard SATA ports. ( Gigabyte GA-770TA-UD3 motherboard ) FreeBSD barracuda.uits bacon ~ 1005: zpool status pool: zroot state: ONLINE scan: scrub repaired 0 in 0 days 03:55:42 with 0 errors on Thu Apr 16 18:32:27 2020 config: NAME STATE READ WRITE CKSUM zroot ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 ada0p3 ONLINE 0 0 0 ada1p3 ONLINE 0 0 0 ada2p3 ONLINE 0 0 0 ada3p3 ONLINE 0 0 0 errors: No known data errors It happened again on the same workstation following a round of freebsd-update. Renaming and recreating /boot as outlined in the forum thread fixed it on the first try this time. https://forums.freebsd.org/threads/10-1-doesnt-boot-anymore-from-zroot-after-applying-p25.54422/ After repeated occurrences of the issue on my office workstation (4-disk RAIDZ), I have not been able to reproduce the problem since upgrading to 12.2. I also set up a test machine (3-disk RAIDZ on an old tower system) running 13.0. I've given it a good pounding, filled the filesystem, and have not been able to reproduce the issue there. Closing this issue since it seems to be resolved by recent updates. I'll keep maintaining the test machine for the foreseeable future and we can use it for diagnostics if the problem returns. The problem returned after the RAIDZ filled up. This agrees with a post I saw suggesting that high disk activity may cause reorganization of the blocks containing /boot. I'll try to reproduce the problem on my test system. Note to posterity: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=199804 describes a fix that was applied, and also a potential problem with disks too large for the BIOS, which the fix cannot address. While using GPT will allow one to boot from disks larger than 2 TB, there could be problems when files in /boot are updated and the only space available for the new files is beyond the 2 TB boundary. This is why recreating /boot *sometimes* works. It may or may not bring the necessary files within the BIOS limit. Having /boot on a smaller partition solves the problem permanently. I've installed some systems on a ~250 GB UFS partition and then created a zpool on the remaining space afterward. This configuration has worked flawlessly for many years. |