Bug 207022 - stat() 4 to 16 times slower on FreeBSD/UFS than on Linux/Ext4
Summary: stat() 4 to 16 times slower on FreeBSD/UFS than on Linux/Ext4
Status: New
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 10.2-RELEASE
Hardware: amd64 Any
: --- Affects Many People
Assignee: freebsd-fs (Nobody)
URL:
Keywords: performance
Depends on:
Blocks:
 
Reported: 2016-02-08 15:57 UTC by Fabiano Sidler
Modified: 2020-07-11 19:15 UTC (History)
2 users (show)

See Also:


Attachments
graph (499.91 KB, image/svg+xml)
2016-02-09 16:31 UTC, Fabiano Sidler
no flags Details
php-cgi graph (255.84 KB, image/svg+xml)
2016-02-09 23:08 UTC, Fabiano Sidler
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Fabiano Sidler 2016-02-08 15:57:36 UTC
I happen a TYPO3 instance under FreeBSD/UFS running 4 to 16 times slower
than on Linux/Ext4. Using XDebug I've found out that it's the stat() syscall
spending the biggest part of the time.
I've already tweaked
vfs.ufs.dirhash_maxmem=104857600
vfs.ufs.dirhash_reclaimage=600
The UFS dirhash memory always has plenty space left so this can't be the
bottleneck. I further found out with the following Dtrace script that inside
the Kernel, the spinlock_exit and rdtsc functions take a huge part of the
time:
=== snip ===
BEGIN
{
	self->ts = timestamp;
}

fbt:::entry
{
	printf("fbt enter %s\t%d\n", probefunc, timestamp-self->ts);
	self->ts = timestamp;
}
=== snap ===

dtrace -qs teststat.d -c ./cstat | sort -nk4 | less
=== snip ===
fbt enter rdtsc 11744696324432
fbt enter rdtsc 11744696324802
fbt enter rdtsc 11744696324810
fbt enter rdtsc 11744696325033
fbt enter rdtsc 11744696325155
fbt enter rdtsc 11744696428364
fbt enter spinlock_exit 11744696510486
fbt enter spinlock_exit 11744697160417
fbt enter spinlock_exit 11744697807952
fbt enter spinlock_exit 11744704378856
fbt enter spinlock_exit 11744704699253
fbt enter spinlock_exit 11744716098379
fbt enter spinlock_exit 11744718145218
fbt enter spinlock_exit 11744750135790
fbt enter spinlock_exit 11744751498167
fbt enter spinlock_exit 11744767477782
fbt enter spinlock_exit 11744771142875
fbt enter spinlock_exit 11744784852745
fbt enter spinlock_exit 11744797873176
fbt enter spinlock_exit 11744804495582
fbt enter spinlock_exit 11744912093185
fbt enter spinlock_exit 11744981165023
=== snap ===

./cstat is a C program simply stat()ing some 1000 of existing files.
Is this on purpose? What can I do to close the gap on Linux? Also, it's
important to say that this is not an academic exercise! There is a client
announcing to leave for a Linux-based server.
The Linux tests were of course taken on absolutely identical hardware.

Hopeful greetings
Fabiano
Comment 1 Steven Hartland freebsd_committer 2016-02-08 17:37:09 UTC
do you have the source cstat?

What's the output from:
sysctl kern.timecounter

What's your hardware spec and is it real or VM?
Comment 2 Fabiano Sidler 2016-02-08 20:19:06 UTC
(In reply to Steven Hartland from comment #1)

Hi Steven! Thanks for replying so fast! This is the cstat source:
=== snip ===
#include <sys/types.h>
#include <sys/stat.h>
#include <unistd.h>

int main(int argc, char* argv[])
{
	struct stat buf;
	stat("/home/user/file1", &buf);
	/* 4198 stat calls */
	stat("/home/user/file4200", &buf);
}
=== snap ===

[root@server ~]# sysctl kern.timecounter
kern.timecounter.tsc_shift: 1
kern.timecounter.smp_tsc_adjust: 0
kern.timecounter.smp_tsc: 0
kern.timecounter.invariant_tsc: 1
kern.timecounter.fast_gettime: 1
kern.timecounter.tick: 1
kern.timecounter.choice: TSC-low(-100) ACPI-fast(900) i8254(0) HPET(950) dummy(-1000000)
kern.timecounter.hardware: HPET
kern.timecounter.alloweddeviation: 5
kern.timecounter.stepwarnings: 0
kern.timecounter.tc.TSC-low.quality: -100
kern.timecounter.tc.TSC-low.frequency: 1198612000
kern.timecounter.tc.TSC-low.counter: 1461596423
kern.timecounter.tc.TSC-low.mask: 4294967295
kern.timecounter.tc.ACPI-fast.quality: 900
kern.timecounter.tc.ACPI-fast.frequency: 3579545
kern.timecounter.tc.ACPI-fast.counter: 16606739
kern.timecounter.tc.ACPI-fast.mask: 16777215
kern.timecounter.tc.i8254.quality: 0
kern.timecounter.tc.i8254.frequency: 1193182
kern.timecounter.tc.i8254.counter: 4783
kern.timecounter.tc.i8254.mask: 65535
kern.timecounter.tc.HPET.quality: 950
kern.timecounter.tc.HPET.frequency: 14318180
kern.timecounter.tc.HPET.counter: 1292503749
kern.timecounter.tc.HPET.mask: 4294967295

[root@sbach ~]# sysctl hw.model hw.ncpu hw.physmem hw.usermem
hw.model: Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz
hw.ncpu: 8
hw.physmem: 34334208000
hw.usermem: 33838137344

Both boxes are ESXi guests.
Comment 3 Steven Hartland freebsd_committer 2016-02-08 22:28:26 UTC
Its a bit of a long shot, but can you try setting the timecounter to TSC and see if it makes any difference?

sysctl kern.timecounter.hardware=TSC-Low
Comment 4 Fabiano Sidler 2016-02-09 08:03:50 UTC
No, it doesn't make a difference, but interestingly, now the last line of the output is something like:
fbt enter spinlock_exit	70396867793752
rather than
fbt enter spinlock_exit 11744981165023
with both the HPET and the TSC-low (small Ell) kern.timecounter.hardware, even after a reboot.
Comment 5 Andriy Gapon freebsd_committer 2016-02-09 09:29:05 UTC
(In reply to Fabiano Sidler from comment #0)
The script looks very strange.  What do you expect it to do?
Comment 6 Fabiano Sidler 2016-02-09 09:46:16 UTC
(In reply to Andriy Gapon from comment #5)

Originally, the script was like so:
=== snip ===
fbt:::entry
{
    entry_times[probefunc] = timestamp;
    printf("fbt enter %s\n", probefunc);
}

fbt:::return
{
    printf("fbt return %s\t%d\n", probefunc, timestamp-entry_times[probefunc]);
}
=== snap ===

But unexpectedly, the returns weren't in reverse call order, so I'm simply printing the time since last function call.
Comment 7 Fabiano Sidler 2016-02-09 14:52:08 UTC
Following http://www.brendangregg.com/blog/2015-03-10/freebsd-flame-graphs.html I've done this:
dtrace -x stackframes=100 -n 'profile-197 /arg0/ { @[stack()] = count(); }' -c ./cstat -o out.stacks
./stackcollapse.pl out.stacks | ./flamegraph.pl > out.svg
uploaded to http://imgh.us/out_4.svg

=== snip out.stacks ===


              kernel`cache_lookup+0x21c
              kernel`namei+0x4d4
              kernel`kern_statat_vnhook+0xae
              kernel`sys_stat+0x2d
              kernel`amd64_syscall+0x357
              kernel`0xffffffff80d30a8b
                1

              kernel`cache_lookup+0x53f
              kernel`namei+0x4d4
              kernel`kern_statat_vnhook+0xae
              kernel`sys_stat+0x2d
              kernel`amd64_syscall+0x357
              kernel`0xffffffff80d30a8b
                1

              kernel`acpi_cpu_c1+0x6
              kernel`sched_idletd+0x1d5
              kernel`fork_exit+0x9a
              kernel`0xffffffff80d30cde
               14
=== snap out.stacks ===

I don't see anything helpful in it, but maybe someone else does?
Comment 8 Fabiano Sidler 2016-02-09 16:31:49 UTC
Created attachment 166811 [details]
graph
Comment 9 Fabiano Sidler 2016-02-09 16:33:02 UTC
Dtraced again, with:
dtrace -x ustackframes=100 -n 'profile-99 { @[ustack()] = count(); } tick-60s { exit(0); }'
Very weird graph, actually.
Comment 10 Steven Hartland freebsd_committer 2016-02-09 18:38:12 UTC
Second graph is not for the process in question, so not much use.
Comment 11 Fabiano Sidler 2016-02-09 23:08:42 UTC
Created attachment 166819 [details]
php-cgi graph
Comment 12 Fabiano Sidler 2016-02-09 23:14:06 UTC
Dtraced again, with:
dtrace -x ustackframes=100 -n 'profile-99 /uid == 1042/ { @[stack()] = count(); } tick-600s { exit(0); }'
There are no other processes running with uid 1042.
There aren't that many HTTP requests anymore, it's 00:13 AM local time.
Comment 13 Steven Hartland freebsd_committer 2016-02-10 01:28:37 UTC
How much time is this really using when testing here with
== gen ==
perl -e 'for (1..4200) {my $file = "file$_";open(my $fh, ">", $file) or die "cannot open > $file: $!";}'
/== gen ==

== bin ==
#include <sys/types.h>
#include <sys/stat.h>
#include <unistd.h>
#include <stdio.h>
#include <err.h>
#include <sysexits.h>

int main(int argc, char* argv[])
{
        struct stat buf;
        char name[1024];
        int i;

        for (i = 1; i <= 4200; i++) {
                sprintf(name, "file%d", i);
                if (stat(name, &buf) != 0) {
                        printf("Tested %d files\n", i - 1);
                        err(EX_OSERR, NULL);
                }
        }
        printf("Tested %d files\n", i - 1);
}
/== bin ==
I'm seeing:
/usr/bin/time ./test
Tested 4201 files
        0.00 real         0.00 user         0.00 sys
Comment 14 Fabiano Sidler 2016-02-10 08:32:01 UTC
Hi Steven! Thank you for your help! I'm seeing the same:
/usr/bin/time ./test
Tested 4200 files
        0.00 real         0.00 user         0.00 sys

The interesting thing is that in my cstat program, the files are NOT actually all in the same directory. Could this be the issue? Then the directory tree traversal needed to be speeded up somehow.