Bug 255473 - Infinite writes on UFS with SU+J filesystem
Summary: Infinite writes on UFS with SU+J filesystem
Status: In Progress
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 13.0-STABLE
Hardware: Any Any
: --- Affects Some People
Assignee: freebsd-fs (Nobody)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-04-28 22:55 UTC by Jack
Modified: 2021-05-14 00:06 UTC (History)
6 users (show)

See Also:


Attachments
top with hung machine doing infinite i/o (286.73 KB, image/jpeg)
2021-05-05 10:24 UTC, Jack
no flags Details
The lockf process (94.49 KB, image/jpeg)
2021-05-05 10:32 UTC, Jack
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Jack 2021-04-28 22:55:27 UTC
Using FreeBSD 13.0-STABLE #0 stable/13-n245116-edda63ba979a

When I rm -rf a directory with a lot of files like "rm -rf /usr/ports/*" and /usr/ports is symlinked to another directory on another drive like /drive2/ports

The rm -rf finishes but and returns to the prompt but gstat shows 100% writes to the disk forever and attempting to access the directory hangs.

When trying to reboot the system, it says there are outstanding writes and eventually needing to power cycle the system to get it back to a usable state.

This has happened when I had it on 13.0-RELEASE
Comment 1 Mark Millard 2021-04-29 01:41:09 UTC
UFS file systems? ZFS file systems? TRIM enabled
and in use vs. not? (gstat -spod would show
delete activity for TRIM activity.)

There may be other configuration information that
would be relevant that I've not thought of.

Hangs: Do both forms of reference hang:
/drive2/ports/NAME and /usr/ports/NAME ?
Comment 2 Jack 2021-04-29 01:50:55 UTC
UFS2, spinning disks in a gmirror (but also happened on single disk). No trim enabled.

I just tried to ls /usr/ports after but ls just hung, didn't try anything else. kill -9 the ls process didn't respond, just stuck at biorw or something similar to biorsomething.
Comment 3 Jack 2021-04-29 01:58:53 UTC
Looking through bug reports, it looks similar to https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=224292 in the pkg upgrade case on 13.0-CURRENT

Thinking about it, it does usually happen when I'm attempting to do some pkg operation at the same time I'm deleting a ton of files (usually rm -rf old world builds in /usr/obj and /usr/ports/*/*/work directories set to WRKDIRPREFIX=/usr/obj that weren't cleaned up)

I'll try to reproduce the problem again on a vm.
Comment 4 Jack 2021-04-30 03:06:26 UTC
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=255499 looks to be similar also. I'm also using 
soft update journaling: (-j)                       enabled
Comment 5 Jack 2021-05-05 10:23:20 UTC
I just encountered another infinite write with gmirror setup on UFS2+SUJ
FreeBSD 13.0-STABLE (PXE) #0 stable/13-n245487-d0fbb03a4dc5: Tue May  4 00:36:26                               

I was shutting down a virtualbox vm and then this process shows up and attempting to lsof -p 47491 just hangs without being able to ctrl+c

47491 root          1  20    0    12M  1792K biowr    2   0:11   1.11% lockf

gstat -spod
dT: 1.001s  w: 1.000s
 L(q)  ops/s    r/s     kB   kBps   ms/r    w/s     kB   kBps   ms/w    d/s     kB   kBps   ms/d    o/s   ms/o
  %busy Name
    0      0      0      0      0    0.0      0      0      0    0.0      0      0      0    0.0      0    0.0
   0.0| cd0
    3   4034      0      0      0    0.0   4034     32 129087    0.6      0      0      0    0.0      0    0.0
  98.5| ada0
    2   4033      0      0      0    0.0   4033     32 129055    0.5      0      0      0    0.0      0    0.0
  96.4| ada1
    0      0      0      0      0    0.0      0      0      0    0.0      0      0      0    0.0      0    0.0
   0.0| ada2

but after typing the gstat command, the system is completely frozen from i/o and can't ssh anymore. I had to get on the console and top is running but even top does not exit. Attaching screenshot in next comment.
Comment 6 Jack 2021-05-05 10:24:18 UTC
Created attachment 224684 [details]
top with hung machine doing infinite i/o

top output from console since machine is locked up from infinite i/o
Comment 7 Jack 2021-05-05 10:32:23 UTC
Created attachment 224685 [details]
The lockf process
Comment 8 Jack 2021-05-05 10:39:04 UTC
Let me know if I can run anything else while console is still active and machine frozen
Comment 9 Kirk McKusick freebsd_committer 2021-05-11 23:45:15 UTC
Does the problem go away if you run with just soft updates (rather than journalled soft updates)? We have been investigating a similar issue where the journel gets full and we are not managing to effectively flush it.
Comment 10 Kirk McKusick freebsd_committer 2021-05-11 23:56:37 UTC
(In reply to Kirk McKusick from comment #9)

You can disable journalled soft updates on disk /dev/ada0p2 on /mnt using:

# unmount /mnt
# tunefs -j disable /dev/ada0p2
Clearing journal flags from inode 4
tunefs: soft updates journaling cleared but soft updates still set.
tunefs: remove .sujournal to reclaim space
# mount /dev/ada0p2 /mnt
# rm -f /mnt/.sujournal
Comment 11 Jack 2021-05-12 04:32:37 UTC
(In reply to Kirk McKusick from comment #10)
It doesn't seem to always happen so hard to tell but it's also the / gmirror disk so would need to reboot the system to make the change.
Comment 12 Kirk McKusick freebsd_committer 2021-05-12 20:24:04 UTC
(In reply to Jack from comment #11)
It would be helpful to turn off soft updates journaling when you next schedule a reboot so that we can find out if the problem persists. Journaling slows your disk I/O by 1-2% as a tradeoff for making fsck run much faster after a crash. There is no change in risk of filesystem corruption from turning off journaling.
Comment 13 Jack 2021-05-14 00:06:38 UTC
It's doing the infinite writes again
13.0-STABLE FreeBSD 13.0-STABLE #0 stable/13-n245485-dec9f377531d

    1  13805      0      0    0.0  13805 441756    0.1  100.0| ada0
    1  13805      0      0    0.0  13805 441756    0.1  100.0| ada1
    1  13805      0      0    0.0  13805 441756    0.1  100.0| ada0p2
    1  13805      0      0    0.0  13805 441756    0.1  100.0| ada1p2
    1  13805      0      0    0.0  13805 441756    0.1  100.0| mirror/root1

The command I issued immediately before was
cd /usr/ports;git pull;portupgrade -ua

It's been stuck at

[Reading data from pkg(8) ... - 1088 packages found - done]

for the past 2 hours and everything i type results in a hang.

top shows

 1113 dhcpd         1  24    0    21M  2956K biowr    1  12:51   8.50% dhcpd
50445 root          1  23    0   117M    38M getblk   5  17:50   6.66% ruby27
50776 root          1  22    0    13M  2552K getblk   3   5:09   5.24% sh
50693 root          1  22    0   141M  8316K ufs      2   5:02   4.21% smbd
50783 root          1  22    0   141M  8316K ufs      7   3:27   4.17% smbd
49660 root          1  22    0   141M  8280K RUN      7  46:09   4.16% smbd
50724 root          1  22    0   141M  8316K ufs      5   4:32   3.96% smbd
50884 root          1  22    0   141M  8320K ufs      3   1:22   3.95% smbd
50817 root          1  22    0   141M  8320K ufs      2   2:40   3.94% smbd
50789 root          1  22    0   141M  8316K biowr    6   3:24   3.91% smbd
50532 root          1  21    0   141M  8316K ufs      2   9:06   3.83% smbd
50725 root          1  22    0   141M  8316K ufs      1   4:32   3.83% smbd
50888 root          1  22    0   141M  8324K ufs      5   1:19   3.82% smbd
50711 root          1  21    0   141M  8316K ufs      6   4:41   3.80% smbd
50786 root          1  21    0   141M  8316K ufs      2   3:24   3.79% smbd
50885 root          1  21    0   141M  8320K ufs      5   1:20   3.78% smbd
49659 root          1  21    0    46M  5648K RUN      6  46:17   3.76% nmbd
50887 root          1  22    0   141M  8324K biowr    4   1:19   3.73% smbd
49658 root          1  21    0    54M  7480K CPU2     2  46:17   3.73% winbindd
50889 root          1  21    0   141M  8324K ufs      3   1:17   3.71% smbd
50784 root          1  21    0   141M  8316K ufs      7   3:27   3.64% smbd
50886 root          1  21    0   141M  8320K ufs      4   1:20   3.63% smbd
50818 root          1  21    0   141M  8320K ufs      4   2:39   3.58% smbd
50533 root          1  21    0   141M  8316K ufs      7   9:07   3.57% smbd

Only way to recover is to power cycle the machine as reboot also hangs.