Bug 253968

Summary: 13.0-BETA4: UFS with journaled soft-updates is three times SLOWER than before
Product: Base System Reporter: rashey
Component: kernAssignee: freebsd-bugs (Nobody) <bugs>
Status: New ---    
Severity: Affects Many People CC: chris, devin, freebsd, marklmi26-fbsd, pi, rkoberman
Priority: ---    
Version: 13.0-STABLE   
Hardware: Any   
OS: Any   

Description rashey 2021-03-02 23:25:24 UTC
Hi,

Based on FreeBSD Installer, you can see performance comparison between 12.1-RELEASE and 13.0-BETA4 during extracting base, kernel, lib32 and src:
12.1-RELEASE: ~850-1500 files/sec.
13.0-BETA4: ~360-500 files/sec.
Comment 1 Christos Chatzaras 2021-03-03 04:22:25 UTC
It's even worse with doing specific operations, for example:

When I do "rm -fr /usr/ports" with 12.2 takes 5 seconds and the same command with 13.0-BETA2 takes 100 seconds.
Comment 2 Mark Millard 2021-03-03 05:29:12 UTC
(In reply to Christos Chatzaras from comment #1)

I wonder if "trim" is involved and, if so, if
enabled vs. not makes a difference. Might there
now be a lot more trim activity for some reason
(when enabled)?

(No specific evidence beyond the pure "rm -fr" being
reported as so much slower.)
Comment 3 Mark Millard 2021-03-03 05:34:24 UTC
(In reply to Mark Millard from comment #2)

In a context that sees the problem (if "trim"
is involved): using "gstat -spod" in the fast
and slow contexts and comparing might make it
obvious if there is a difference in the amount
of "d" (delete/trim) activity for similar
operations if other things like media differences
are unlikely to be contributing.
Comment 4 rkoberman 2021-03-03 06:33:21 UTC
I'm seeing something similar and have been seeing it on 13 since it was 13-current. Doing a "make extract" before and after building www/firefox takes many, many minutes. 

The system is a ThinkPad L15, but my file system is SU, not SU+J. When I monitor disc usage, it is inconsistent. It runs at "normal throughput of >40MB/s. It will abruptly drop to <2MB/s and remain there for a time that varies from a few seconds to minutes. It then jumps to normal throughput. This a spinning rust, so TRIM should not be involved. Since 12.2 does not support my graphics (CometLake), I really can't compare to 12.2. Also, I am running 13-STABLE, not BETA4. It was update on the 27th, shortly after BETA4 was announced, so I assume I have all BETA4 fixes. Disk performance monitored with gkrellm2.
Comment 5 Mark Millard 2021-03-03 07:21:12 UTC
(In reply to rkoberman from comment #4)

Doing "make fetch" and "make extract" for www/firefox is
something I can try on a 128 GiBYTe RAM ThreadRipper 1950X
that uses 960 GiByte Optane media and is running main
(14-CURRENT) from back on 2021-Feb-08. (Optane does not use
trim and is fast.) UFS2+SU (no J). The example context
predates any recent potential fixes. So . . .

Doing similar "time -l" might happen to expose something
common to the good vs. bad cases for those were the issue
stands out in their context.

# time -l make extract
===> Options unchanged
===>   firefox-85.0.2,2 depends on file: /usr/local/sbin/pkg - found
===> Fetching all distfiles required by firefox-85.0.2,2 for building
===>  Extracting for firefox-85.0.2,2
=> SHA256 Checksum OK for firefox-85.0.2.source.tar.xz.
       70.10 real        23.54 user        18.01 sys
     77540  maximum resident set size
        50  average shared memory size
        10  average unshared data size
       130  average unshared stack size
     37678  page reclaims
         0  page faults
         0  swaps
     12848  block input operations
   1175850  block output operations
        83  messages sent
        79  messages received
         0  signals received
   1188850  voluntary context switches
      1816  involuntary context switches

# time -l rm -fr /wrkdirs/usr/ports/www
       17.36 real         0.24 user         6.51 sys
      4116  maximum resident set size
         8  average shared memory size
         4  average unshared data size
       134  average unshared stack size
       587  page reclaims
         0  page faults
         0  swaps
         2  block input operations
    555695  block output operations
         0  messages sent
         0  messages received
         0  signals received
    607758  voluntary context switches
         0  involuntary context switches

(No other www/ content.) Looks like "block
output operations" and "voluntary context
switches" are what has lots of examples in
this context.


For reference: main 3acea07c1873 based:

# ~/fbsd-based-on-what-freebsd-main.sh 
merge-base: 3acea07c1873b1e4042f4a4fa8668745ee59f15b
merge-base: CommitDate: 2021-02-08 19:15:21 +0000
c1845d00f818 (HEAD -> mm-src) mm-src snapshot for mm's patched build in git context.
3acea07c1873 (pure-src) Restore the augmented strlen commentary
FreeBSD FBSDFHUGE 14.0-CURRENT FreeBSD 14.0-CURRENT mm-src-n244686-c1845d00f818 GENERIC-NODBG  amd64 amd64 1400004 1400004


I may be able to do similarly on some aarch64 SSD
based machines (that would have trim involved).
Comment 6 Mark Millard 2021-03-03 08:09:37 UTC
(In reply to Mark Millard from comment #5)

[I should have mentioned that the ThreadRipper
results are based on non-debug builds of main
(14-CURRENT). So are the below. Same FreeBSD
vintage (same source tree even). UFS2+SU (no J)
again as well.]

aarch64 examples that others might be able to
compare against . . .

From a OverDrive 1000 (4 Cortex-A57 cores
8 GiByte RAM):

# time -l make extract
===>   firefox-85.0.2,2 depends on file: /usr/local/sbin/pkg - found
===> Fetching all distfiles required by firefox-85.0.2,2 for building
===>  Extracting for firefox-85.0.2,2
=> SHA256 Checksum OK for firefox-85.0.2.source.tar.xz.
      221.36 real        57.32 user        65.94 sys
     76068  maximum resident set size
        42  average shared memory size
        12  average unshared data size
       128  average unshared stack size
     32646  page reclaims
        16  page faults
         0  swaps
     22893  block input operations
   1175983  block output operations
        47  messages sent
        45  messages received
         5  signals received
   1198799  voluntary context switches
       487  involuntary context switches

# time -l rm -fr /wrkdirs/usr/ports/www
      111.95 real         0.48 user        18.95 sys
      3756  maximum resident set size
         7  average shared memory size
         3  average unshared data size
       126  average unshared stack size
       596  page reclaims
         0  page faults
         0  swaps
         2  block input operations
    555695  block output operations
         0  messages sent
         0  messages received
         0  signals received
    613033  voluntary context switches
         1  involuntary context switches

The "rm -fr" here took a larger fraction of the
"make extract" "real" time compared to the
ThreadRipper ratio.

A MACCHIATObin Double Shot got similar results
(4 Cortex-A72 cores, 16 GiByte RAM):

# time -l make extract
===>   firefox-85.0.2,2 depends on file: /usr/local/sbin/pkg - found
===> Fetching all distfiles required by firefox-85.0.2,2 for building
===>  Extracting for firefox-85.0.2,2
=> SHA256 Checksum OK for firefox-85.0.2.source.tar.xz.
      209.41 real        49.36 user        62.94 sys
     75620  maximum resident set size
        42  average shared memory size
        12  average unshared data size
       127  average unshared stack size
     33507  page reclaims
        10  page faults
         0  swaps
     13861  block input operations
   1176990  block output operations
        47  messages sent
        45  messages received
         5  signals received
   1189697  voluntary context switches
       293  involuntary context switches

# time -l rm -fr /wrkdirs/usr/ports/www
       99.39 real         0.48 user        19.33 sys
      3752  maximum resident set size
         7  average shared memory size
         3  average unshared data size
       124  average unshared stack size
       596  page reclaims
         0  page faults
         0  swaps
         2  block input operations
    555695  block output operations
         0  messages sent
         0  messages received
         0  signals received
    639878  voluntary context switches
        29  involuntary context switches

Of course, the SSD's involved are not likely to
be as subject to fragmentation consequences as
spinning drives. So comparisons to the above
might be complicated by such.

(There may be better things to look at with
other tools.)
Comment 7 Mark Millard 2021-03-03 09:08:31 UTC
(In reply to Mark Millard from comment #6)

[I use -noatime mounts, by the way.]

Given the reports I got a surprise booting the OverDrive 1000
via 13.0-BETA4 and trying the operations on the SSD (not
the boot media this time), evidence of definite changes from
the main (14) vintage that I have things at for normal booting:

# time -l make extract
===>   firefox-85.0.2,2 depends on file: /usr/local/sbin/pkg - found
===> Fetching all distfiles required by firefox-85.0.2,2 for building
===>  Extracting for firefox-85.0.2,2
=> SHA256 Checksum OK for firefox-85.0.2.source.tar.xz.
       93.39 real        55.46 user        34.47 sys
     76576  maximum resident set size
        30  average shared memory size
         4  average unshared data size
       128  average unshared stack size
     26460  page reclaims
         0  page faults
         0  swaps
     15569  block input operations
       994  block output operations
        16  messages sent
        14  messages received
         0  signals received
     16752  voluntary context switches
      1192  involuntary context switches

Vastly fewer block output operations (and less time).

# time -l rm -fr work
       12.74 real         0.29 user         7.80 sys
      3920  maximum resident set size
         8  average shared memory size
         4  average unshared data size
       128  average unshared stack size
       588  page reclaims
         0  page faults
         0  swaps
       644  block input operations
       360  block output operations
         0  messages sent
         0  messages received
         0  signals received
      1093  voluntary context switches
       337  involuntary context switches

So even if there are issues still, apparently some have
been taken care of.

For reference:

# uname -apKU
FreeBSD generic 13.0-BETA4 FreeBSD 13.0-BETA4 #0 releng/13.0-n244592-e32bc253629: Fri Feb 26 08:02:53 UTC 2021     root@releng1.nyi.freebsd.org:/usr/obj/usr/src/arm64.aarch64/sys/GENERIC  arm64 aarch64 1300139 1300139
Comment 8 Mark Millard 2021-03-03 11:14:00 UTC
(In reply to Mark Millard from comment #7)

12.2-RELEASE on the OverDrive 1000 this time,
again operating on the SSD:

# time -l make extract
===>   firefox-85.0.2,2 depends on file: /usr/local/sbin/pkg - found
===> Fetching all distfiles required by firefox-85.0.2,2 for building
===>  Extracting for firefox-85.0.2,2
=> SHA256 Checksum OK for firefox-85.0.2.source.tar.xz.
      120.92 real        55.87 user        61.93 sys
     76520  maximum resident set size
        27  average shared memory size
        10  average unshared data size
       127  average unshared stack size
     27241  page reclaims
         6  page faults
         0  swaps
     20866  block input operations
      6362  block output operations
         0  messages sent
         0  messages received
         4  signals received
     21852  voluntary context switches
      3121  involuntary context switches

# time -l rm -fr work
       12.83 real         0.37 user         8.17 sys
      3916  maximum resident set size
         7  average shared memory size
         7  average unshared data size
       127  average unshared stack size
       934  page reclaims
         0  page faults
         0  swaps
         5  block input operations
       348  block output operations
         0  messages sent
         0  messages received
         0  signals received
       581  voluntary context switches
       736  involuntary context switches

13.0-BETA4 looks a somewhat better for this on the extract.
Of course, all my examples are SSDs or Optanes, nothing
spinning. I'm unsure how that or other context variations
would compare. But now you have some examples for
comparison.

The early Feb. main (14) had much bigger differences between
user+sys vs. real than 13.0-BETA4 or 12.2-RELEASE did. But
that may well be expected given the huge "block output
operations" count differences.

It does not appear that I've managed to run into the
13.0-BETA4 problem(s) via these specific types of
experiments for UFS2+SU (no J).
Comment 9 Mark Millard 2021-03-04 01:29:19 UTC
(In reply to Mark Millard from comment #8)

I updated the ThreadRipper to be based on c113740f266e
(2021-Mar-03):

# ~/fbsd-based-on-what-freebsd-main.sh 
merge-base: c113740f266ecfbe4a34c97607f860d63c08eb76
merge-base: CommitDate: 2021-03-03 22:01:04 +0000
0e72c127fce0 (HEAD -> mm-src) mm-src snapshot for mm's patched build in git context.
c113740f266e (pure-src) git-arc.1: Fix synopsis for the "update" verb
FreeBSD FBSDFHUGE 14.0-CURRENT FreeBSD 14.0-CURRENT mm-src-n245244-0e72c127fce0 GENERIC-NODBG  amd64 amd64 1400005 1400005

And things are much better for the UFS2+SU (not J) test:

# time -l make extract
===> Options unchanged
===>   firefox-85.0.2,2 depends on file: /usr/local/sbin/pkg - found
===> Fetching all distfiles required by firefox-85.0.2,2 for building
===>  Extracting for firefox-85.0.2,2
=> SHA256 Checksum OK for firefox-85.0.2.source.tar.xz.
       34.08 real        23.09 user        10.73 sys
     77260  maximum resident set size
        50  average shared memory size
        11  average unshared data size
       128  average unshared stack size
     37271  page reclaims
       496  page faults
         0  swaps
      9214  block input operations
       801  block output operations
        83  messages sent
        79  messages received
         0  signals received
     11598  voluntary context switches
      1656  involuntary context switches

# time -l rm -fr /wrkdirs/usr/ports/www
        3.30 real         0.09 user         3.21 sys
      4180  maximum resident set size
         8  average shared memory size
         4  average unshared data size
       128  average unshared stack size
       603  page reclaims
         0  page faults
         0  swaps
         1  block input operations
         0  block output operations
         0  messages sent
         0  messages received
         0  signals received
        20  voluntary context switches
        35  involuntary context switches
Comment 10 Christos Chatzaras 2021-03-04 13:06:25 UTC
FreeBSD 12.2

/usr/bin/time -l rm -fr /usr/ports

12.2 1st run:

        8.89 real         0.36 user         1.28 sys
      5652  maximum resident set size
         7  average shared memory size
         3  average unshared data size
       126  average unshared stack size
       962  page reclaims
         0  page faults
         0  swaps
     38627  block input operations
         0  block output operations
         0  messages sent
         0  messages received
         0  signals received
     38629  voluntary context switches
         0  involuntary context switches

12.2 2nd run:

        5.04 real         0.31 user         1.98 sys
      5652  maximum resident set size
         7  average shared memory size
         3  average unshared data size
       126  average unshared stack size
       952  page reclaims
         0  page faults
         0  swaps
     13492  block input operations
         0  block output operations
         0  messages sent
         0  messages received
         0  signals received
     13502  voluntary context switches
        17  involuntary context switches

---------------

FreeBSD 13.0 (from 3-Mar-2021 source)

/usr/bin/time -l rm -fr /usr/ports

1st run:

       12.29 real         0.68 user         2.63 sys
      5580  maximum resident set size
         8  average shared memory size
         4  average unshared data size
       129  average unshared stack size
       956  page reclaims
         0  page faults
         0  swaps
     40039  block input operations
         0  block output operations
         0  messages sent
         0  messages received
         0  signals received
     40041  voluntary context switches
         1  involuntary context switches

2rd run:

        6.66 real         0.42 user         2.14 sys
      5576  maximum resident set size
         8  average shared memory size
         4  average unshared data size
       135  average unshared stack size
       955  page reclaims
         0  page faults
         0  swaps
     15705  block input operations
         0  block output operations
         0  messages sent
         0  messages received
         0  signals received
     15713  voluntary context switches
        10  involuntary context switches
Comment 11 Christos Chatzaras 2021-03-04 13:27:37 UTC
FreeBSD 12.2

/usr/bin/time -l portsnap extract

1st run:

      126.12 real        34.99 user        59.58 sys
     49040  maximum resident set size
        21  average shared memory size
         3  average unshared data size
        87  average unshared stack size
  11820155  page reclaims
         1  page faults
         0  swaps
     25079  block input operations
      2780  block output operations
         0  messages sent
         0  messages received
         0  signals received
    379570  voluntary context switches
       240  involuntary context switches

2nd run:

       99.90 real        35.29 user        59.80 sys
     49040  maximum resident set size
        21  average shared memory size
         2  average unshared data size
        86  average unshared stack size
  11820155  page reclaims
         0  page faults
         0  swaps
        27  block input operations
      2782  block output operations
         0  messages sent
         0  messages received
         0  signals received
    354645  voluntary context switches
       370  involuntary context switches

---------------

FreeBSD 13.0 (from 3-Mar-2021 source)

/usr/bin/time -l portsnap extract

1st run:

      210.68 real        73.30 user       110.37 sys
     48952  maximum resident set size
        21  average shared memory size
         3  average unshared data size
        90  average unshared stack size
  12287686  page reclaims
         1  page faults
         0  swaps
     27790  block input operations
       658  block output operations
      4412  messages sent
   2536294  messages received
         0  signals received
    382195  voluntary context switches
       292  involuntary context switches

2rd run:

      170.19 real        70.89 user       107.19 sys
     48952  maximum resident set size
        21  average shared memory size
         3  average unshared data size
        90  average unshared stack size
  12287686  page reclaims
         0  page faults
         0  swaps
      1405  block input operations
       176  block output operations
      4412  messages sent
   2536294  messages received
         0  signals received
    355852  voluntary context switches
       399  involuntary context switches
Comment 12 Christos Chatzaras 2021-03-04 13:30:57 UTC
With 13.0-BETA2 "rm -fr" was many times slower than 12.2. With 13.0-BETA4+ (with 3-Mar-2021 source) is a lot better but still slower than 12.2.
Comment 13 Mark Millard 2021-03-04 21:02:13 UTC
Being unable to replicate anything similar to comment #4 and
comment #11 being a large increase in Inter-Process-Communication
and the rest of the rather-modern-FreeBSD examples being UFS SU
(no J) that do not show significant problems, then it would
appear that the UFS+SU+J (so with journaling) status of the
original report may be an essentially separate issue from other
fairly recent UFS I/O performance problems that seem to have
been addressed.

I do not plan to report other examples that try to isolate the
context vs. recent history. It seems that only journaled
experiments would likely be relevant to the epecific issue
being reported.

Sorry it took me so long to reach that conclusion.

The comment #11's IPC operation counts in doing a portsnap
extract should probably be a separate submittal.

I'm less clear on comment #4 vs. some context local issue
(say, file system fragmentation just to illustrate). But
if it can be replicated by others, it too should probably
have a separate submittal.
Comment 14 Christos Chatzaras 2021-03-04 21:22:42 UTC
(In reply to Mark Millard from comment #13)

Yes, I have SU without journal so it may be different issue.

I made some more tests with 12.2, 13.0 and CURRENT which I will post in the "stable" mailing list.

The summary is this:

12.2p4 = 98 sec
13.0-BETA2 = 497 sec
13.0-BETA4 = 163 sec
14-CURRENT = 255 sec