Using FreeBSD 13.0-STABLE #0 stable/13-n245116-edda63ba979a When I rm -rf a directory with a lot of files like "rm -rf /usr/ports/*" and /usr/ports is symlinked to another directory on another drive like /drive2/ports The rm -rf finishes but and returns to the prompt but gstat shows 100% writes to the disk forever and attempting to access the directory hangs. When trying to reboot the system, it says there are outstanding writes and eventually needing to power cycle the system to get it back to a usable state. This has happened when I had it on 13.0-RELEASE
UFS file systems? ZFS file systems? TRIM enabled and in use vs. not? (gstat -spod would show delete activity for TRIM activity.) There may be other configuration information that would be relevant that I've not thought of. Hangs: Do both forms of reference hang: /drive2/ports/NAME and /usr/ports/NAME ?
UFS2, spinning disks in a gmirror (but also happened on single disk). No trim enabled. I just tried to ls /usr/ports after but ls just hung, didn't try anything else. kill -9 the ls process didn't respond, just stuck at biorw or something similar to biorsomething.
Looking through bug reports, it looks similar to https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=224292 in the pkg upgrade case on 13.0-CURRENT Thinking about it, it does usually happen when I'm attempting to do some pkg operation at the same time I'm deleting a ton of files (usually rm -rf old world builds in /usr/obj and /usr/ports/*/*/work directories set to WRKDIRPREFIX=/usr/obj that weren't cleaned up) I'll try to reproduce the problem again on a vm.
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=255499 looks to be similar also. I'm also using soft update journaling: (-j) enabled
I just encountered another infinite write with gmirror setup on UFS2+SUJ FreeBSD 13.0-STABLE (PXE) #0 stable/13-n245487-d0fbb03a4dc5: Tue May 4 00:36:26 I was shutting down a virtualbox vm and then this process shows up and attempting to lsof -p 47491 just hangs without being able to ctrl+c 47491 root 1 20 0 12M 1792K biowr 2 0:11 1.11% lockf gstat -spod dT: 1.001s w: 1.000s L(q) ops/s r/s kB kBps ms/r w/s kB kBps ms/w d/s kB kBps ms/d o/s ms/o %busy Name 0 0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0.0 0.0| cd0 3 4034 0 0 0 0.0 4034 32 129087 0.6 0 0 0 0.0 0 0.0 98.5| ada0 2 4033 0 0 0 0.0 4033 32 129055 0.5 0 0 0 0.0 0 0.0 96.4| ada1 0 0 0 0 0 0.0 0 0 0 0.0 0 0 0 0.0 0 0.0 0.0| ada2 but after typing the gstat command, the system is completely frozen from i/o and can't ssh anymore. I had to get on the console and top is running but even top does not exit. Attaching screenshot in next comment.
Created attachment 224684 [details] top with hung machine doing infinite i/o top output from console since machine is locked up from infinite i/o
Created attachment 224685 [details] The lockf process
Let me know if I can run anything else while console is still active and machine frozen
Does the problem go away if you run with just soft updates (rather than journalled soft updates)? We have been investigating a similar issue where the journel gets full and we are not managing to effectively flush it.
(In reply to Kirk McKusick from comment #9) You can disable journalled soft updates on disk /dev/ada0p2 on /mnt using: # unmount /mnt # tunefs -j disable /dev/ada0p2 Clearing journal flags from inode 4 tunefs: soft updates journaling cleared but soft updates still set. tunefs: remove .sujournal to reclaim space # mount /dev/ada0p2 /mnt # rm -f /mnt/.sujournal
(In reply to Kirk McKusick from comment #10) It doesn't seem to always happen so hard to tell but it's also the / gmirror disk so would need to reboot the system to make the change.
(In reply to Jack from comment #11) It would be helpful to turn off soft updates journaling when you next schedule a reboot so that we can find out if the problem persists. Journaling slows your disk I/O by 1-2% as a tradeoff for making fsck run much faster after a crash. There is no change in risk of filesystem corruption from turning off journaling.
It's doing the infinite writes again 13.0-STABLE FreeBSD 13.0-STABLE #0 stable/13-n245485-dec9f377531d 1 13805 0 0 0.0 13805 441756 0.1 100.0| ada0 1 13805 0 0 0.0 13805 441756 0.1 100.0| ada1 1 13805 0 0 0.0 13805 441756 0.1 100.0| ada0p2 1 13805 0 0 0.0 13805 441756 0.1 100.0| ada1p2 1 13805 0 0 0.0 13805 441756 0.1 100.0| mirror/root1 The command I issued immediately before was cd /usr/ports;git pull;portupgrade -ua It's been stuck at [Reading data from pkg(8) ... - 1088 packages found - done] for the past 2 hours and everything i type results in a hang. top shows 1113 dhcpd 1 24 0 21M 2956K biowr 1 12:51 8.50% dhcpd 50445 root 1 23 0 117M 38M getblk 5 17:50 6.66% ruby27 50776 root 1 22 0 13M 2552K getblk 3 5:09 5.24% sh 50693 root 1 22 0 141M 8316K ufs 2 5:02 4.21% smbd 50783 root 1 22 0 141M 8316K ufs 7 3:27 4.17% smbd 49660 root 1 22 0 141M 8280K RUN 7 46:09 4.16% smbd 50724 root 1 22 0 141M 8316K ufs 5 4:32 3.96% smbd 50884 root 1 22 0 141M 8320K ufs 3 1:22 3.95% smbd 50817 root 1 22 0 141M 8320K ufs 2 2:40 3.94% smbd 50789 root 1 22 0 141M 8316K biowr 6 3:24 3.91% smbd 50532 root 1 21 0 141M 8316K ufs 2 9:06 3.83% smbd 50725 root 1 22 0 141M 8316K ufs 1 4:32 3.83% smbd 50888 root 1 22 0 141M 8324K ufs 5 1:19 3.82% smbd 50711 root 1 21 0 141M 8316K ufs 6 4:41 3.80% smbd 50786 root 1 21 0 141M 8316K ufs 2 3:24 3.79% smbd 50885 root 1 21 0 141M 8320K ufs 5 1:20 3.78% smbd 49659 root 1 21 0 46M 5648K RUN 6 46:17 3.76% nmbd 50887 root 1 22 0 141M 8324K biowr 4 1:19 3.73% smbd 49658 root 1 21 0 54M 7480K CPU2 2 46:17 3.73% winbindd 50889 root 1 21 0 141M 8324K ufs 3 1:17 3.71% smbd 50784 root 1 21 0 141M 8316K ufs 7 3:27 3.64% smbd 50886 root 1 21 0 141M 8320K ufs 4 1:20 3.63% smbd 50818 root 1 21 0 141M 8320K ufs 4 2:39 3.58% smbd 50533 root 1 21 0 141M 8316K ufs 7 9:07 3.57% smbd Only way to recover is to power cycle the machine as reboot also hangs.
(In reply to Jack from comment #13) When you do the reboot, please try disabling soft update journaling as described in my comment #10.
(In reply to Kirk McKusick from comment #14) But if I disable journaling and I power cycle the machine, will fsck take a very long time on a 4T gmirror with maybe over a million files?
(In reply to Jack from comment #15) Fsck will likely take up to 15 minutes. But if the net result is that the machine stops hanging, then it will be well worth it. And if it does stop hanging the machine, we will have a good idea of where to look to find and hopefully fix the problem.
(In reply to Kirk McKusick from comment #16) Ok, I'll try. This machine is a router and file server so it's hard to find a good time to not unintentionally take down the machine.
(In reply to Jack from comment #17) Is it possible for you to try applying the patch described in https://reviews.freebsd.org/D30041 It is attempting to mitigate the problem that you are experiencing when running with soft updates journaling. It would still be useful to know if disabling journaling is helpful because if it is then it validates that the fix in D30041 is a useful idea to pursue.
(In reply to Kirk McKusick from comment #18) I can apply the patch. So far the system hasn't hung yet so I haven't needed to reboot it. The random hangs makes it hard to tell when it will show itself unfortunately.
Had a power outage and still haven't been able to reproduce the issue on the system that was hanging but I now have a new computer with an NVME and SSD in a gmirror together running the latest 5/25/21 b14db362bbd2 git checkout of the source code that hung. I had done a fresh make buildworld,installworld, rebooted, then did a rm -rf /usr/ports. Edited /etc/fstab and system did infinite writes to the gmirror nvme/ssd. It was so hung that my /etc/fstab couldn't save and the only way to recover was to power cycle the machine. The gmirror had su+j+trim. Appears bug still exists in current 13-stable.
(In reply to Jack from comment #20) I think I'm seeing the same problem on 14-CURRENT occasionally and when it starts doing crazy writes just running "sync" makes it stop. When running commands is still possible anyway. It has gotten rarer for me with recent updates but running chrome and clicking links really fast would sometimes still trigger the problem (with files from chrome's cache directory being involved). In my case it's a VM with the disks on zvols and the pool has sync=disabled. Maybe it causes write I/Os to be much faster than what is usual for UFS and that's what can trigger the problem. Maybe your NVME disks are way faster than what it's used to too. Maybe regular data being written so fast causes more complex meta-data softupdates structures to be handled together. I'm really just guessing here... in case it could give someone an idea. Maybe running UFS on virtual disks with ridiculously fast writes could be a good way to try to reproduce it.
(In reply to Jack from comment #20) Let me reiterate. Please try disabling journaled soft updates as described in comment #10 to see if the problem persists. This basic test will enable us to understand how to proceed on fixing the problem.
(In reply to sigsys from comment #21) Please try running with journaled soft updates disabled (but leave soft updates enabled) as described in my comment #10. Let us know if that makes the problem go away.
(In reply to Kirk McKusick from comment #23) Welp, now I can't for the life of me reproduce it (with journaling enabled). Usually I could by running some du/cp/rm loops and chrome but now it seems intent on just working perfectly. It used to happen a lot (sometimes a few times a day, and not necessarily with a ton of FS activity like that) but it has become much more rare now. So I might not be very useful to test this anymore but it definitively looked like the same problem. I'm going to try to reproduce it with journaling enabled some more and then run this system with journaling disabled and see if the problem ever happens again.
(In reply to sigsys from comment #24) I'm in the same situation, if I try to reproduce it, it doesn't happen but when I'm doing normal things, it crops up. I'm disabling journaling on my nvme/ssd gmirror to see if it comes up again. Brand new install.
Created attachment 225293 [details] Unable to turn off soft update journaling I tried to turn off soft update journaling but when I mount the filesystem, tunefs says it's on.
(In reply to Jack from comment #26) Did you follow my instructions in comment #10? Specifically you need to have done it so that it gets disabled on both sides of the mirror. Also, did you remove .sujournal? Because if it is gone, it is not possible to mount with journaling enabled.
(In reply to Kirk McKusick from comment #27) I thought removing it from the gmirror was sufficient, did not know it had to be done on the disk itself. Will remove it from the disks and try again.
Created attachment 225298 [details] Unable to turn off soft update journaling I booted into single user mode and I tried to turn off journaling but it's throwing this error (I set kern.geom.debugflags=18)
(In reply to Jack from comment #29) What does the filesystem report when it is mounted? Does it show as journaling?
Says it is on gmirror status root Name Status Components mirror/root COMPLETE nvd0p2 (ACTIVE) ada0p2 (ACTIVE) ada1p2 (ACTIVE) /dev/mirror/root on / (ufs, local, noatime, journaled soft-updates) tunefs -p /dev/mirror/root tunefs: POSIX.1e ACLs: (-a) disabled tunefs: NFSv4 ACLs: (-N) disabled tunefs: MAC multilabel: (-l) disabled tunefs: soft updates: (-n) enabled tunefs: soft update journaling: (-j) enabled tunefs: gjournal: (-J) disabled tunefs: trim: (-t) enabled tunefs: maximum blocks per file in a cylinder group: (-e) 4096 tunefs: average file size: (-f) 16384 tunefs: average number of files in a directory: (-s) 64 tunefs: minimum percentage of free space: (-m) 0% tunefs: space to hold for metadata blocks: (-k) 6400 tunefs: optimization preference: (-o) space tunefs: volume label: (-L) tunefs -p /dev/nvd0p2 tunefs: POSIX.1e ACLs: (-a) disabled tunefs: NFSv4 ACLs: (-N) disabled tunefs: MAC multilabel: (-l) disabled tunefs: soft updates: (-n) enabled tunefs: soft update journaling: (-j) enabled tunefs: gjournal: (-J) disabled tunefs: trim: (-t) enabled tunefs: maximum blocks per file in a cylinder group: (-e) 4096 tunefs: average file size: (-f) 16384 tunefs: average number of files in a directory: (-s) 64 tunefs: minimum percentage of free space: (-m) 0% tunefs: space to hold for metadata blocks: (-k) 6400 tunefs: optimization preference: (-o) space tunefs: volume label: (-L) tunefs -p /dev/ada0p2 tunefs: POSIX.1e ACLs: (-a) disabled tunefs: NFSv4 ACLs: (-N) disabled tunefs: MAC multilabel: (-l) disabled tunefs: soft updates: (-n) enabled tunefs: soft update journaling: (-j) enabled tunefs: gjournal: (-J) disabled tunefs: trim: (-t) enabled tunefs: maximum blocks per file in a cylinder group: (-e) 4096 tunefs: average file size: (-f) 16384 tunefs: average number of files in a directory: (-s) 64 tunefs: minimum percentage of free space: (-m) 0% tunefs: space to hold for metadata blocks: (-k) 6400 tunefs: optimization preference: (-o) space tunefs: volume label: (-L) tunefs -p /dev/ada1p2 tunefs: POSIX.1e ACLs: (-a) disabled tunefs: NFSv4 ACLs: (-N) disabled tunefs: MAC multilabel: (-l) disabled tunefs: soft updates: (-n) enabled tunefs: soft update journaling: (-j) enabled tunefs: gjournal: (-J) disabled tunefs: trim: (-t) enabled tunefs: maximum blocks per file in a cylinder group: (-e) 4096 tunefs: average file size: (-f) 16384 tunefs: average number of files in a directory: (-s) 64 tunefs: minimum percentage of free space: (-m) 0% tunefs: space to hold for metadata blocks: (-k) 6400 tunefs: optimization preference: (-o) space tunefs: volume label: (-L)
(In reply to Jack from comment #31) When you booted single user on /dev/mirror/root and tried to use tunefs to disable soft updates journaling, what error did you get?
(In reply to Kirk McKusick from comment #32) I did not get any error. See https://bz-attachments.freebsd.org/attachment.cgi?id=225293 It claims it was disabled but when mounting it, tunefs showed it was enabled.
(In reply to Kirk McKusick from comment #32) If I provide you with a kernel patch that disables journaled soft updates can you install and run a patched kernel on your system? If so let me know what vintage kernel you are running (output of uname -a).
(In reply to Kirk McKusick from comment #34) Yes, I can do that. I'll be checking out the current 13-stable so the patch against whatever is in git will be fine.
Created attachment 225397 [details] patch to disable journaled soft updates please try this patch.
Created attachment 225399 [details] Unable to disable journal on softupdates after applying patch I'm still unable to turn off journaling with the patch. root@upstairs:/usr/src # git diff diff --git a/sys/ufs/ffs/ffs_softdep.c b/sys/ufs/ffs/ffs_softdep.c index 2cc16ab2a2c1..cc3f64b4e70e 100644 --- a/sys/ufs/ffs/ffs_softdep.c +++ b/sys/ufs/ffs/ffs_softdep.c @@ -2653,6 +2653,10 @@ softdep_uninitialize() callout_drain(&softdep_callout); } +static int enable_suj = 0; +SYSCTL_INT(_debug_softdep, OID_AUTO, enable_suj, CTLFLAG_RW, + &enable_suj, 0, "Enable the activation of journaled soft updates"); + /* * Called at mount time to notify the dependency code that a * filesystem wishes to use it. @@ -2716,7 +2720,7 @@ softdep_mount(devvp, mp, fs, cred) ACQUIRE_GBLLOCK(&lk); TAILQ_INSERT_TAIL(&softdepmounts, sdp, sd_next); FREE_GBLLOCK(&lk); - if ((fs->fs_flags & FS_SUJ) && + if (enable_suj && (fs->fs_flags & FS_SUJ) && (error = journal_mount(mp, fs, cred)) != 0) { printf("Failed to start journal: %d\n", error); softdep_unmount(mp); root@upstairs:/usr/src # uname -a FreeBSD upstairs.jack.com 13.0-STABLE FreeBSD 13.0-STABLE #0 stable/13-n245805-4a4a174959ce-dirty: Sun May 30 23:31:47 PDT 2021 jack@upstairs.jack.com:/usr/obj/usr/src/amd64.amd64/sys/UPSTAIRS amd64
(In reply to Jack from comment #37) The filesystem is still enabled for journaled soft updates, but with my patch the kernel does not mount it with journaled soft updates (unless you re-enable them using `sysctl debug.softdep.enable_suj=1'). You can see that journaled soft updates are not enabled on the filesystem by typing the `mount' command (with no arguments) after it is mounted and you will see that it is running with just soft updates.
Looks like that's the case. /dev/mirror/root on / (ufs, local, noatime, soft-updates) I'll try to see if I can make the system hang with the usual ways it happens.
(In reply to Jack from comment #39) We have had others report that they can reproduce the problem and that 1) disabling journaling makes the problem go away; 2) increasing the journal size moderates the problem; 3) the patch proposed in https://reviews.freebsd.org/D30041 helps to moderate the problem.
That's great that the issue is reproducible, I can't seem to reliably reproduce it. So far I've been running with the disable journal patch and have not come across any hangs yet.
*** Bug 253968 has been marked as a duplicate of this bug. ***
(In reply to sigsys from comment #21) I have a few more systems that are default installs with the latest 13-STABLE (as of today) with soft updates+journal+ufs+gmirror randomly have this infinite write problem when doing a cd /usr/ports;git pull, some are ssds and some are spinning disks. Simply typing "sync" appears to stop the infinite writes for me too if that helps any for solving the soft updates+journaling.
(In reply to Jack from comment #43) Thanks for the update. The patch that was committed to head (described in https://reviews.freebsd.org/D30041) includes doing an internal sync when this condition arises and that helps a lot. Assuming that no new issues arise in head, the MFC to 13-stable should happen later this week.
(In reply to Jack from comment #43) The fix was MFC'ed to 13-STABLE on July 7th, so a kernel built from sources on or after that date should have the fix.
(In reply to Kirk McKusick from comment #45) Thanks, I'll unpatch my src and rebuild (and upgrade the rest of my systems) to see if the issue resurfaces.