Bug 199587

Summary: libc strncmp() performance
Product: Base System Reporter: blackngel <blackngel1>
Component: miscAssignee: freebsd-bugs (Nobody) <bugs>
Status: New ---    
Severity: Affects Many People CC: emaste
Priority: ---    
Version: 10.1-STABLE   
Hardware: Any   
OS: Any   

Description blackngel 2015-04-21 17:52:23 UTC
I've been tinkering with the code, out of curiosity, and I've reimplemented strncmp() to check the performance. Here are my results and the benchmark code:

         42359800221 cycles -- FreeBSD strncmp()
         42113090043 cycles -- FreeBSD strncmp()
         42145558182 cycles -- FreeBSD strncmp()
         39479522958 cycles -- My Implementation
         39447595998 cycles -- My Implementation
         39445708599 cycles -- My Implementation

My implementation always runs a bit faster and seems more clean.

--------------

#include <stdio.h>

unsigned long long
rdtsc(void)
{
        unsigned int low, hi;
        __asm__ volatile(
            "rdtsc\n"
            : "=a"(low), "=d"(hi)
            :
            : "memory");
        return ((unsigned long long)hi << 32) | (unsigned long long) low;
}

int
strncmpnew(const char *p, const char *q, size_t n)
{
        while (n > 0 && *p && (*p == *q))
                n--, p++, q++;

        return (n == 0) ? 0: ((const unsigned char)*p - (const unsigned char)*q);
}

int
strncmpfbsd(const char *s1, const char *s2, size_t n)
{

	if (n == 0)
		return (0);
	do {
		if (*s1 != *s2++)
			return (*(const unsigned char *)s1 -
				*(const unsigned char *)(s2 - 1));
		if (*s1++ == '\0')
			break;
	} while (--n != 0);
	return (0);
}

void
test(int (*f)(const char *, const char *, unsigned int n), unsigned int iters, char *msg)
{
        char *str1 = "abcdefghijklmnopqrstuvwxyz0123456789A";
        char *str2 = "abcdefghijklmnopqrstuvwxyz0123456789A";
        unsigned long foo = 0;

        unsigned long long int st = rdtsc();
        unsigned int i;
        for (i = 0; i < iters; i++)
                foo += f(str1, str2, 37);
        unsigned long long int tot = rdtsc() - st;
        printf("%20llu cycles -- %s\n", tot, msg);
        asm volatile( "" :: "g"(foo) : "memory");
}

int
main(void)
{
        unsigned int iters = 100000000;
        
        test(strncmpfbsd, iters, "FreeBSD strncmp()");
        test(strncmpfbsd, iters, "FreeBSD strncmp()");
        test(strncmpfbsd, iters, "FreeBSD strncmp()");
        test(strncmpnew, iters, "My Implementation");
        test(strncmpnew, iters, "My Implementation");
        test(strncmpnew, iters, "My Implementation");
        getchar();
        
        return 0;
}

--------------
Comment 1 Eitan Adler freebsd_committer freebsd_triage 2015-04-27 16:17:48 UTC
(adding to the bug)

From bde:

This is basically confusing the compiler to produce not so good code
in a different way.

Your implementation is a bit cleaner since it doesn't arrange the source
code in a way that it thinks will be good for the object code.  This
results in it being slower for old compilers, faster for some in-between
compilers, and no different for new compilers.  However, all the C versions
are now faster than the asm versions on amd64 and i386 on 2 i7 CPUs.  I
added tests for the latter, and sprinkled some volatiles to stop the
compiler optimizing away the whole loop for the asm (libc) versions.

i386, 4790K @ 4.28GHz:
    gcc-3.3.3 -O (but no -march etc. complications):
        10.0 Gcycles  --  libc strncmp() (asm source, external linkage)
        10.1 Gcycles  --  libc strncmp() (copy of the C version)
        11.3 Gcycles  --  My Implementation

    gcc-3.3.3 -O2:
        12.0 Gcycles  --  libc strncmp() (asm source, external linkage)
         9.4 Gcycles  --  libc strncmp() (copy of the C version)
        10.2 Gcycles  --  My Implementation
    libc asm strncmp() really was made 20% slower by increasing the
    optimization level from -O to -O2, although strncmp() itself didn't
    change.  This might be due to the loop being poorly aligned.
    Tuning with -march might be needed to avoid 20% differences, so the
    mere 10% differences in these tests might be noise.  (I didn't bother
    giving many data data points, since nose from rerunning the tests is
    much smaller than 10-20% differences from tuning.)

    gcc-4.2.1 -O:
        11.4 Gcycles  --  libc strncmp() (asm source, external linkage)
        13.1 Gcycles  --  libc strncmp() (copy of the C version)
        12.1 Gcycles  --  My Implementation
    gcc-4.2.1 -O is much slower than gcc-3.3.3, but not so bad for your
    implementation.

    gcc-4.2.1 -O2:
        10.1 Gcycles  --  libc strncmp() (asm source, external linkage)
         9.5 Gcycles  --  libc strncmp() (copy of the C version)
         9.3 Gcycles  --  My Implementation
    gcc-4.2.1 is OK.

amd64, Xeon 5650 @ 2.67GHz:
    clang -O:
    The calls to *strcmp() were almost all optimized away.  I fixed
    this by replacing str1 in the call to str1 + v, where v is a
    volatile int with value 0.
        13.8 Gcycles  --  libc strncmp() (C source, external linkage)
        13.8 Gcycles  --  libc strncmp() (copy of the C version)
        13.8 Gcycles  --  My Implementation
    libc asm strncmp() is of interest here although it doesn't exist --
    if it existed, then it would be more bogus that on i386, since amd64
    doesn't run on the 1990 modem CPUs where the asm version was probably
    faster.  The asm i386 version as tuned for original i386's and barely
    changed since then.  Just as well, since it would be very messy with
    tuning for 10-20 generations of CPUs with several classes of CPU per
    generation.  amd64 libc string functions used to be missing all silly
    optimizations like this, but now optimizes the almost-never-used
    function stpcpy(), and its asm versions of strcat() and strcmp()
    are probably mistakes too.

i386, Xeon 5650 @ 2.67GHz:
    clang -O [-march=native makes no difference]
        12.0 Gcycles  --  libc strncmp() (asm source, external linkage)
        15.1 Gcycles  --  libc strncmp() (copy of the C version)
        11.5 Gcycles  --  My Implementation
    clang is even more confused by the copy of libc C strncmp() than
    gcc-4.2.1.

Bruce