Bug 207022

Summary: stat() 4 to 16 times slower on FreeBSD/UFS than on Linux/Ext4
Product: Base System Reporter: Fabiano Sidler <freebsd-bugs>
Component: kernAssignee: freebsd-fs (Nobody) <fs>
Status: New ---    
Severity: Affects Many People CC: dchagin, emaste, rob2g2-freebsd, smh
Priority: --- Keywords: performance
Version: 10.2-RELEASE   
Hardware: amd64   
OS: Any   
Attachments:
Description Flags
graph
none
php-cgi graph none

Description Fabiano Sidler 2016-02-08 15:57:36 UTC
I happen a TYPO3 instance under FreeBSD/UFS running 4 to 16 times slower
than on Linux/Ext4. Using XDebug I've found out that it's the stat() syscall
spending the biggest part of the time.
I've already tweaked
vfs.ufs.dirhash_maxmem=104857600
vfs.ufs.dirhash_reclaimage=600
The UFS dirhash memory always has plenty space left so this can't be the
bottleneck. I further found out with the following Dtrace script that inside
the Kernel, the spinlock_exit and rdtsc functions take a huge part of the
time:
=== snip ===
BEGIN
{
	self->ts = timestamp;
}

fbt:::entry
{
	printf("fbt enter %s\t%d\n", probefunc, timestamp-self->ts);
	self->ts = timestamp;
}
=== snap ===

dtrace -qs teststat.d -c ./cstat | sort -nk4 | less
=== snip ===
fbt enter rdtsc 11744696324432
fbt enter rdtsc 11744696324802
fbt enter rdtsc 11744696324810
fbt enter rdtsc 11744696325033
fbt enter rdtsc 11744696325155
fbt enter rdtsc 11744696428364
fbt enter spinlock_exit 11744696510486
fbt enter spinlock_exit 11744697160417
fbt enter spinlock_exit 11744697807952
fbt enter spinlock_exit 11744704378856
fbt enter spinlock_exit 11744704699253
fbt enter spinlock_exit 11744716098379
fbt enter spinlock_exit 11744718145218
fbt enter spinlock_exit 11744750135790
fbt enter spinlock_exit 11744751498167
fbt enter spinlock_exit 11744767477782
fbt enter spinlock_exit 11744771142875
fbt enter spinlock_exit 11744784852745
fbt enter spinlock_exit 11744797873176
fbt enter spinlock_exit 11744804495582
fbt enter spinlock_exit 11744912093185
fbt enter spinlock_exit 11744981165023
=== snap ===

./cstat is a C program simply stat()ing some 1000 of existing files.
Is this on purpose? What can I do to close the gap on Linux? Also, it's
important to say that this is not an academic exercise! There is a client
announcing to leave for a Linux-based server.
The Linux tests were of course taken on absolutely identical hardware.

Hopeful greetings
Fabiano
Comment 1 Steven Hartland freebsd_committer freebsd_triage 2016-02-08 17:37:09 UTC
do you have the source cstat?

What's the output from:
sysctl kern.timecounter

What's your hardware spec and is it real or VM?
Comment 2 Fabiano Sidler 2016-02-08 20:19:06 UTC
(In reply to Steven Hartland from comment #1)

Hi Steven! Thanks for replying so fast! This is the cstat source:
=== snip ===
#include <sys/types.h>
#include <sys/stat.h>
#include <unistd.h>

int main(int argc, char* argv[])
{
	struct stat buf;
	stat("/home/user/file1", &buf);
	/* 4198 stat calls */
	stat("/home/user/file4200", &buf);
}
=== snap ===

[root@server ~]# sysctl kern.timecounter
kern.timecounter.tsc_shift: 1
kern.timecounter.smp_tsc_adjust: 0
kern.timecounter.smp_tsc: 0
kern.timecounter.invariant_tsc: 1
kern.timecounter.fast_gettime: 1
kern.timecounter.tick: 1
kern.timecounter.choice: TSC-low(-100) ACPI-fast(900) i8254(0) HPET(950) dummy(-1000000)
kern.timecounter.hardware: HPET
kern.timecounter.alloweddeviation: 5
kern.timecounter.stepwarnings: 0
kern.timecounter.tc.TSC-low.quality: -100
kern.timecounter.tc.TSC-low.frequency: 1198612000
kern.timecounter.tc.TSC-low.counter: 1461596423
kern.timecounter.tc.TSC-low.mask: 4294967295
kern.timecounter.tc.ACPI-fast.quality: 900
kern.timecounter.tc.ACPI-fast.frequency: 3579545
kern.timecounter.tc.ACPI-fast.counter: 16606739
kern.timecounter.tc.ACPI-fast.mask: 16777215
kern.timecounter.tc.i8254.quality: 0
kern.timecounter.tc.i8254.frequency: 1193182
kern.timecounter.tc.i8254.counter: 4783
kern.timecounter.tc.i8254.mask: 65535
kern.timecounter.tc.HPET.quality: 950
kern.timecounter.tc.HPET.frequency: 14318180
kern.timecounter.tc.HPET.counter: 1292503749
kern.timecounter.tc.HPET.mask: 4294967295

[root@sbach ~]# sysctl hw.model hw.ncpu hw.physmem hw.usermem
hw.model: Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz
hw.ncpu: 8
hw.physmem: 34334208000
hw.usermem: 33838137344

Both boxes are ESXi guests.
Comment 3 Steven Hartland freebsd_committer freebsd_triage 2016-02-08 22:28:26 UTC
Its a bit of a long shot, but can you try setting the timecounter to TSC and see if it makes any difference?

sysctl kern.timecounter.hardware=TSC-Low
Comment 4 Fabiano Sidler 2016-02-09 08:03:50 UTC
No, it doesn't make a difference, but interestingly, now the last line of the output is something like:
fbt enter spinlock_exit	70396867793752
rather than
fbt enter spinlock_exit 11744981165023
with both the HPET and the TSC-low (small Ell) kern.timecounter.hardware, even after a reboot.
Comment 5 Andriy Gapon freebsd_committer freebsd_triage 2016-02-09 09:29:05 UTC
(In reply to Fabiano Sidler from comment #0)
The script looks very strange.  What do you expect it to do?
Comment 6 Fabiano Sidler 2016-02-09 09:46:16 UTC
(In reply to Andriy Gapon from comment #5)

Originally, the script was like so:
=== snip ===
fbt:::entry
{
    entry_times[probefunc] = timestamp;
    printf("fbt enter %s\n", probefunc);
}

fbt:::return
{
    printf("fbt return %s\t%d\n", probefunc, timestamp-entry_times[probefunc]);
}
=== snap ===

But unexpectedly, the returns weren't in reverse call order, so I'm simply printing the time since last function call.
Comment 7 Fabiano Sidler 2016-02-09 14:52:08 UTC
Following http://www.brendangregg.com/blog/2015-03-10/freebsd-flame-graphs.html I've done this:
dtrace -x stackframes=100 -n 'profile-197 /arg0/ { @[stack()] = count(); }' -c ./cstat -o out.stacks
./stackcollapse.pl out.stacks | ./flamegraph.pl > out.svg
uploaded to http://imgh.us/out_4.svg

=== snip out.stacks ===


              kernel`cache_lookup+0x21c
              kernel`namei+0x4d4
              kernel`kern_statat_vnhook+0xae
              kernel`sys_stat+0x2d
              kernel`amd64_syscall+0x357
              kernel`0xffffffff80d30a8b
                1

              kernel`cache_lookup+0x53f
              kernel`namei+0x4d4
              kernel`kern_statat_vnhook+0xae
              kernel`sys_stat+0x2d
              kernel`amd64_syscall+0x357
              kernel`0xffffffff80d30a8b
                1

              kernel`acpi_cpu_c1+0x6
              kernel`sched_idletd+0x1d5
              kernel`fork_exit+0x9a
              kernel`0xffffffff80d30cde
               14
=== snap out.stacks ===

I don't see anything helpful in it, but maybe someone else does?
Comment 8 Fabiano Sidler 2016-02-09 16:31:49 UTC
Created attachment 166811 [details]
graph
Comment 9 Fabiano Sidler 2016-02-09 16:33:02 UTC
Dtraced again, with:
dtrace -x ustackframes=100 -n 'profile-99 { @[ustack()] = count(); } tick-60s { exit(0); }'
Very weird graph, actually.
Comment 10 Steven Hartland freebsd_committer freebsd_triage 2016-02-09 18:38:12 UTC
Second graph is not for the process in question, so not much use.
Comment 11 Fabiano Sidler 2016-02-09 23:08:42 UTC
Created attachment 166819 [details]
php-cgi graph
Comment 12 Fabiano Sidler 2016-02-09 23:14:06 UTC
Dtraced again, with:
dtrace -x ustackframes=100 -n 'profile-99 /uid == 1042/ { @[stack()] = count(); } tick-600s { exit(0); }'
There are no other processes running with uid 1042.
There aren't that many HTTP requests anymore, it's 00:13 AM local time.
Comment 13 Steven Hartland freebsd_committer freebsd_triage 2016-02-10 01:28:37 UTC
How much time is this really using when testing here with
== gen ==
perl -e 'for (1..4200) {my $file = "file$_";open(my $fh, ">", $file) or die "cannot open > $file: $!";}'
/== gen ==

== bin ==
#include <sys/types.h>
#include <sys/stat.h>
#include <unistd.h>
#include <stdio.h>
#include <err.h>
#include <sysexits.h>

int main(int argc, char* argv[])
{
        struct stat buf;
        char name[1024];
        int i;

        for (i = 1; i <= 4200; i++) {
                sprintf(name, "file%d", i);
                if (stat(name, &buf) != 0) {
                        printf("Tested %d files\n", i - 1);
                        err(EX_OSERR, NULL);
                }
        }
        printf("Tested %d files\n", i - 1);
}
/== bin ==
I'm seeing:
/usr/bin/time ./test
Tested 4201 files
        0.00 real         0.00 user         0.00 sys
Comment 14 Fabiano Sidler 2016-02-10 08:32:01 UTC
Hi Steven! Thank you for your help! I'm seeing the same:
/usr/bin/time ./test
Tested 4200 files
        0.00 real         0.00 user         0.00 sys

The interesting thing is that in my cstat program, the files are NOT actually all in the same directory. Could this be the issue? Then the directory tree traversal needed to be speeded up somehow.
Comment 15 Dmitry Chagin freebsd_committer freebsd_triage 2023-11-06 18:49:49 UTC
it would be nice to check this on 14.0 due to mjg@ vfs/namecache improvements
Comment 16 Ed Maste freebsd_committer freebsd_triage 2023-11-06 19:58:32 UTC
(In reply to Dmitry Chagin from comment #15)
In particular,

> I did some benching by running 20 processes in parallel, each doing stat
> on a tree of 1 million files (one tree per proc, 1000 dirs x 1000 files,
> so 20 mln inodes in total).  Box had 24 cores and 24G RAM.
> 
> Best times:
> Linux:          7.60s user 1306.90s system 1863% cpu 1:10.55 total
> FreeBSD:        3.49s user 345.12s system 1983% cpu 17.573 total
> OpenBSD:        5.01s user 6463.66s system 2000% cpu 5:23.42 total
> DragonflyBSD:   11.73s user 1316.76s system 1023% cpu 2:09.78 total
> OmniosCE:       9.17s user 516.53s system 1550% cpu 33.905 total

From https://lore.kernel.org/linux-fsdevel/20231019155958.7ek7oyljs6y44ah7@f/
Comment 17 Dmitry Chagin freebsd_committer freebsd_triage 2023-11-07 08:20:58 UTC
(In reply to Ed Maste from comment #16)
well AFAIK, mjg@ FreeBSD tests on ZFS while here UFS vs Ext4, however UFS results might even better.