Scenario 1: - FreeBSD 11.2 - UFS root fs - 6 x 1.5 TB SATA disks in RAIDZ2 pool - The pool is being scrubbed - Issue "shutdown -p now" Actual result 1: - The system shuts down, shows that UFS fs has been synced - The system continues scrubbing (in what appears to be right after all "usual" kernel log messages) and does not power off Expected result 1: - The system stops scrubbing and powers off Scenario 2: - Continued from scenario 1 - Press hard reset button - The system starts booting into multi-user Actual result 2: - As soon as the pool is imported, scrubbing continues - As a result, system startup is extremely slow, the UFS fs check does not finish in a reasonable time Expected result 2: - Scrubbing should not continue right with pool import - System startup should be normal speed Scenario 3: - Continued from scenario 2 - Press hard reset button - Boot system single user - Run "fsck -p" in an effort to fix the UFS fs first; is successful - Run "zpool list status" Actual result 3: - The system starts scrubbing the pool immediately - As a result, "zpool list status" does not finish for a long time (actual duration can be given later) Expected result 3: - Scrubbing should not continue right with the pool import - zpool import and zpool list status should continue with normal speed In summary, I believe that the solution is that before shutdown, all scrubbing activities should be paused. Similarly, on boot any pool marked for scrubbing should be treated as if scrubbing were paused on it.
I now analyzed the startup time for scenario 3. This scenario continued as follows: - type "^c" - result: console shows "^c", otherwise no reaction, no command prompt - type "^zbg<RET>" - result: console shows "^zbg", otherwise no reaction, no command prompt - type "exit<RET>" - leave console unattended Using last(1) the time from typing "exit" to finding the corresponding "boot time" entry was 20 minutes. This most likely means that the pool import took 20 minutes to complete (I was not at the console anymore at that time). -- Martin
Scenarios, continued: - server up, scrub of pool is continuing - gstat is running - 2 VirtualBox VMs are running - another VirtualBox VM is accessing a zvol via iSCSI (ctl) Result: - The machine freezes: not immediately, but process after process gets stuck Scenario continued: - After 30 minutes watching this, perform hard reset - Start single user - "fsck -p" -> cleans UFS root fs - issue "zpool export pool" in order to start multiuser without the pool being imported Result: - The scrubbing starts again immediately, no command prompt Scenario continued: - Issue "ifconfig <interface> inet <hostname>" into console buffer - Issue "ifconfig <interface> inet6 <hostname>" into console buffer - On another machine, wait until ping succeeds Result: - First ping after ca. 20 minutes (see previous scenarios) - Console shows that 20G gmirror for swap has been rebuilt - pool is exported I am running with https://reviews.freebsd.org/D7538 applied because otherwise the system was slowly filling all swap space and swapping continuously, even with no activity. Something is very wrong in FreeBSD 11.2 with ZFS, swapping, and memory management. I did not have these problems with 11.1. -- Martin
I think I am running into this on stable/13 3a0fcdb37dffcd28c21c846d6165f6c382d9aac3
This seems simpler. I might find time to commit this. diff --git usr.sbin/periodic/etc/daily/800.scrub-zfs usr.sbin/periodic/etc/daily/800.scrub-zfs index 8cca1ea4d949..474e070153e8 100755 --- usr.sbin/periodic/etc/daily/800.scrub-zfs +++ usr.sbin/periodic/etc/daily/800.scrub-zfs @@ -15,6 +15,13 @@ then source_periodic_confs fi +doscrub() { + local pool="$1" + + zfs set org.freebsd:last-scrub=$(date +%F.%T) "${pool}" + zpool scrub "${pool}" +} + : ${daily_scrub_zfs_default_threshold=35} case "$daily_scrub_zfs_enable" in @@ -55,9 +62,7 @@ case "$daily_scrub_zfs_enable" in _pool_threshold=${daily_scrub_zfs_default_threshold} fi - _last_scrub=$(zpool history ${pool} | \ - egrep "^[0-9\.\:\-]{19} zpool scrub ${pool}\$" | tail -1 |\ - cut -d ' ' -f 1) + _last_scrub=$(zfs get -s local -H -o value org.freebsd:last-scrub ${pool}) if [ -z "${_last_scrub}" ]; then # creation time of the pool if no scrub was done _last_scrub=$(zpool history ${pool} | \ @@ -88,12 +93,12 @@ case "$daily_scrub_zfs_enable" in ;; *"none requested"*) echo " starting first scrub (since reboot) of pool '${pool}':" - zpool scrub ${pool} + doscrub ${pool} [ $rc -eq 0 ] && rc=1 ;; *) echo " starting scrub of pool '${pool}':" - zpool scrub ${pool} + doscrub ${pool} [ $rc -eq 0 ] && rc=1 ;; esac
Just my 2 cents: For the issue described in this PR it would be necessary to suspend an ongoing scrub before shutdown and resume it after restart - assuming the system does both (especially the shutdown) cleanly. Most likely something like a "reverse" rc.d would be neeeded, where the shutdown procedure checks which zpools are currently being scrubbed, saves this info, and then suspends the scrubs; conversely, the startup would need to check which scrubs were suspended and resume them. To not need a separate file for saving the info about which zpools were in the process of being scrubbed across the shutdown/reboot it would be nice to be able to do some query directly on the pool to obtain this information. -- Martin
(In reply to Martin Birgmeier from comment #5) It should already work that way. see "man zpool-scrub" OPTIONS -s Stop scrubbing. -p Pause scrubbing. Scrub pause state and progress are periodically synced to disk. If the system is restarted or pool is exported during a paused scrub, even after import, scrub will remain paused until it is resumed. Once resumed the scrub will pick up from the place where it was last checkpointed to disk. To resume a paused scrub issue zpool scrub again. -w Wait until scrub has completed before returning.
This issue appears to be happening to me on 13.2-RELEASE-p1. For me, I do suspend/resumes regularly. Mostly with no problems, but occasionally the suspend blocks at the point where you'd expect the power off and it requires a hard power off shutdown. (Or possibly it requires you to wait for the scrub to complete - obviously not practical when you're suspending a laptop.) I am experimenting with these additions in rc.suspend and rc.resume: rc.suspend: # pause any zpool scrub in progress zpool status | while read KEY VALUE; do case "$KEY" in pool:) POOL=$VALUE ;; scan:) case "$VALUE" in "scrub in progress since "*) echo "$POOL" >>/var/db/zpool.scrub.resume zpool scrub -p $POOL ;; esac esac done rc.resume: # resume any scrub that was in progress on suspend if [ -f /var/db/zpool.scrub.resume ]; then cat /var/db/zpool.scrub.resume | while read POOL; do zpool scrub $POOL done rm /var/db/zpool.scrub.resume fi Something similar may also be needed in rc.shutdown with a suitable rc.d/zpool_scrub_resume script for the shutdown/reboot sequence.