I've been tinkering with the code, out of curiosity, and I've reimplemented strncmp() to check the performance. Here are my results and the benchmark code: 42359800221 cycles -- FreeBSD strncmp() 42113090043 cycles -- FreeBSD strncmp() 42145558182 cycles -- FreeBSD strncmp() 39479522958 cycles -- My Implementation 39447595998 cycles -- My Implementation 39445708599 cycles -- My Implementation My implementation always runs a bit faster and seems more clean. -------------- #include <stdio.h> unsigned long long rdtsc(void) { unsigned int low, hi; __asm__ volatile( "rdtsc\n" : "=a"(low), "=d"(hi) : : "memory"); return ((unsigned long long)hi << 32) | (unsigned long long) low; } int strncmpnew(const char *p, const char *q, size_t n) { while (n > 0 && *p && (*p == *q)) n--, p++, q++; return (n == 0) ? 0: ((const unsigned char)*p - (const unsigned char)*q); } int strncmpfbsd(const char *s1, const char *s2, size_t n) { if (n == 0) return (0); do { if (*s1 != *s2++) return (*(const unsigned char *)s1 - *(const unsigned char *)(s2 - 1)); if (*s1++ == '\0') break; } while (--n != 0); return (0); } void test(int (*f)(const char *, const char *, unsigned int n), unsigned int iters, char *msg) { char *str1 = "abcdefghijklmnopqrstuvwxyz0123456789A"; char *str2 = "abcdefghijklmnopqrstuvwxyz0123456789A"; unsigned long foo = 0; unsigned long long int st = rdtsc(); unsigned int i; for (i = 0; i < iters; i++) foo += f(str1, str2, 37); unsigned long long int tot = rdtsc() - st; printf("%20llu cycles -- %s\n", tot, msg); asm volatile( "" :: "g"(foo) : "memory"); } int main(void) { unsigned int iters = 100000000; test(strncmpfbsd, iters, "FreeBSD strncmp()"); test(strncmpfbsd, iters, "FreeBSD strncmp()"); test(strncmpfbsd, iters, "FreeBSD strncmp()"); test(strncmpnew, iters, "My Implementation"); test(strncmpnew, iters, "My Implementation"); test(strncmpnew, iters, "My Implementation"); getchar(); return 0; } --------------
(adding to the bug) From bde: This is basically confusing the compiler to produce not so good code in a different way. Your implementation is a bit cleaner since it doesn't arrange the source code in a way that it thinks will be good for the object code. This results in it being slower for old compilers, faster for some in-between compilers, and no different for new compilers. However, all the C versions are now faster than the asm versions on amd64 and i386 on 2 i7 CPUs. I added tests for the latter, and sprinkled some volatiles to stop the compiler optimizing away the whole loop for the asm (libc) versions. i386, 4790K @ 4.28GHz: gcc-3.3.3 -O (but no -march etc. complications): 10.0 Gcycles -- libc strncmp() (asm source, external linkage) 10.1 Gcycles -- libc strncmp() (copy of the C version) 11.3 Gcycles -- My Implementation gcc-3.3.3 -O2: 12.0 Gcycles -- libc strncmp() (asm source, external linkage) 9.4 Gcycles -- libc strncmp() (copy of the C version) 10.2 Gcycles -- My Implementation libc asm strncmp() really was made 20% slower by increasing the optimization level from -O to -O2, although strncmp() itself didn't change. This might be due to the loop being poorly aligned. Tuning with -march might be needed to avoid 20% differences, so the mere 10% differences in these tests might be noise. (I didn't bother giving many data data points, since nose from rerunning the tests is much smaller than 10-20% differences from tuning.) gcc-4.2.1 -O: 11.4 Gcycles -- libc strncmp() (asm source, external linkage) 13.1 Gcycles -- libc strncmp() (copy of the C version) 12.1 Gcycles -- My Implementation gcc-4.2.1 -O is much slower than gcc-3.3.3, but not so bad for your implementation. gcc-4.2.1 -O2: 10.1 Gcycles -- libc strncmp() (asm source, external linkage) 9.5 Gcycles -- libc strncmp() (copy of the C version) 9.3 Gcycles -- My Implementation gcc-4.2.1 is OK. amd64, Xeon 5650 @ 2.67GHz: clang -O: The calls to *strcmp() were almost all optimized away. I fixed this by replacing str1 in the call to str1 + v, where v is a volatile int with value 0. 13.8 Gcycles -- libc strncmp() (C source, external linkage) 13.8 Gcycles -- libc strncmp() (copy of the C version) 13.8 Gcycles -- My Implementation libc asm strncmp() is of interest here although it doesn't exist -- if it existed, then it would be more bogus that on i386, since amd64 doesn't run on the 1990 modem CPUs where the asm version was probably faster. The asm i386 version as tuned for original i386's and barely changed since then. Just as well, since it would be very messy with tuning for 10-20 generations of CPUs with several classes of CPU per generation. amd64 libc string functions used to be missing all silly optimizations like this, but now optimizes the almost-never-used function stpcpy(), and its asm versions of strcat() and strcmp() are probably mistakes too. i386, Xeon 5650 @ 2.67GHz: clang -O [-march=native makes no difference] 12.0 Gcycles -- libc strncmp() (asm source, external linkage) 15.1 Gcycles -- libc strncmp() (copy of the C version) 11.5 Gcycles -- My Implementation clang is even more confused by the copy of libc C strncmp() than gcc-4.2.1. Bruce
strncmp has been rewritten in assembly for better performance in FreeBSD 14.1. Do the performance issues still occur? If yes, please reopen. Note that your benchmark favours your own implementation: you always use the same inputs to strncmp(), allowing the branch predictor to predict the control flow of your very branchy implementation perfectly. This is not reflective of real-world input where each pair of strings passed to strncmp() is likely different. I have designed a more comprehensive benchmark here: https://github.com/clausecker/strperf Comparing your function with the SIMD-enhanced implementation I wrote, we get: os: FreeBSD arch: amd64 cpu: Intel(R) Core(TM) i7-4910MQ CPU @ 2.90GHz │ strncmp.simple.out │ strncmp.scalar.out │ strncmp.baseline.out │ │ sec/op │ sec/op vs base │ sec/op vs base │ StrncmpShortAligned 144.11µ ± 0% 141.34µ ± 1% -1.92% (p=0.000 n=20) 63.05µ ± 0% -56.25% (p=0.000 n=20) StrncmpMidAligned 83.93µ ± 0% 64.46µ ± 1% -23.20% (p=0.000 n=20) 21.51µ ± 1% -74.37% (p=0.000 n=20) StrncmpLongAligned 59.528µ ± 1% 38.266µ ± 0% -35.72% (p=0.000 n=20) 6.154µ ± 1% -89.66% (p=0.000 n=20) StrncmpShortUnaligned 143.07µ ± 0% 142.14µ ± 1% -0.65% (p=0.003 n=20) 69.63µ ± 0% -51.33% (p=0.000 n=20) StrncmpMidUnaligned 83.61µ ± 1% 64.62µ ± 0% -22.71% (p=0.000 n=20) 26.60µ ± 1% -68.19% (p=0.000 n=20) StrncmpLongUnaligned 59.416µ ± 0% 38.392µ ± 1% -35.39% (p=0.000 n=20) 6.356µ ± 1% -89.30% (p=0.000 n=20) StrncmpShortQsort 1.000m ± 1% 1.053m ± 0% +5.28% (p=0.000 n=20) 1.221m ± 1% +22.11% (p=0.000 n=20) StrncmpMidQsort 243.2µ ± 0% 243.4µ ± 0% ~ (p=0.968 n=20) 271.7µ ± 0% +11.71% (p=0.000 n=20) geomean 137.1µ 115.4µ -15.78% 48.88µ -64.33% │ strncmp.simple.out │ strncmp.scalar.out │ strncmp.baseline.out │ │ B/s │ B/s vs base │ B/s vs base │ StrncmpShortAligned 867.4Mi ± 0% 884.4Mi ± 1% +1.95% (p=0.000 n=20) 1982.6Mi ± 0% +128.57% (p=0.000 n=20) StrncmpMidAligned 1.454Gi ± 0% 1.894Gi ± 1% +30.20% (p=0.000 n=20) 5.675Gi ± 1% +290.14% (p=0.000 n=20) StrncmpLongAligned 2.051Gi ± 1% 3.190Gi ± 0% +55.56% (p=0.000 n=20) 19.835Gi ± 1% +867.25% (p=0.000 n=20) StrncmpShortUnaligned 873.7Mi ± 0% 879.4Mi ± 1% +0.66% (p=0.003 n=20) 1795.2Mi ± 0% +105.47% (p=0.000 n=20) StrncmpMidUnaligned 1.460Gi ± 1% 1.889Gi ± 0% +29.39% (p=0.000 n=20) 4.589Gi ± 1% +214.35% (p=0.000 n=20) StrncmpLongUnaligned 2.054Gi ± 0% 3.180Gi ± 1% +54.76% (p=0.000 n=20) 19.207Gi ± 1% +834.86% (p=0.000 n=20) StrncmpShortQsort 125.0Mi ± 1% 118.7Mi ± 0% -5.01% (p=0.000 n=20) 102.4Mi ± 1% -18.10% (p=0.000 n=20) StrncmpMidQsort 514.0Mi ± 0% 513.6Mi ± 0% ~ (p=0.968 n=20) 460.1Mi ± 0% -10.49% (p=0.000 n=20) geomean 912.1Mi 1.058Gi +18.74% 2.497Gi +180.37% where "simple" is your implementation (compiled with -O3), "scalar" is our non-SIMD implementation (see simd(7), somewhat similar to yours, but manually unrolled) and "baseline" is our SIMD implementation. As you can see, the simple implementation is only faster on the qsort benchmark (sorting an array of random strings) and there only slightly.
Benchmark in quotes so it doesn't get line broken. > os: FreeBSD > arch: amd64 > cpu: Intel(R) Core(TM) i7-4910MQ CPU @ 2.90GHz > │ strncmp.simple.out │ strncmp.scalar.out │ strncmp.baseline.out │ > │ sec/op │ sec/op vs base │ sec/op vs base │ > StrncmpShortAligned 144.11µ ± 0% 141.34µ ± 1% -1.92% (p=0.000 n=20) 63.05µ ± 0% -56.25% (p=0.000 n=20) > StrncmpMidAligned 83.93µ ± 0% 64.46µ ± 1% -23.20% (p=0.000 n=20) 21.51µ ± 1% -74.37% (p=0.000 n=20) > StrncmpLongAligned 59.528µ ± 1% 38.266µ ± 0% -35.72% (p=0.000 n=20) 6.154µ ± 1% -89.66% (p=0.000 n=20) > StrncmpShortUnaligned 143.07µ ± 0% 142.14µ ± 1% -0.65% (p=0.003 n=20) 69.63µ ± 0% -51.33% (p=0.000 n=20) > StrncmpMidUnaligned 83.61µ ± 1% 64.62µ ± 0% -22.71% (p=0.000 n=20) 26.60µ ± 1% -68.19% (p=0.000 n=20) > StrncmpLongUnaligned 59.416µ ± 0% 38.392µ ± 1% -35.39% (p=0.000 n=20) 6.356µ ± 1% -89.30% (p=0.000 n=20) > StrncmpShortQsort 1.000m ± 1% 1.053m ± 0% +5.28% (p=0.000 n=20) 1.221m ± 1% +22.11% (p=0.000 n=20) > StrncmpMidQsort 243.2µ ± 0% 243.4µ ± 0% ~ (p=0.968 n=20) 271.7µ ± 0% +11.71% (p=0.000 n=20) > geomean 137.1µ 115.4µ -15.78% 48.88µ -64.33% > > │ strncmp.simple.out │ strncmp.scalar.out │ strncmp.baseline.out │ > │ B/s │ B/s vs base │ B/s vs base │ > StrncmpShortAligned 867.4Mi ± 0% 884.4Mi ± 1% +1.95% (p=0.000 n=20) 1982.6Mi ± 0% +128.57% (p=0.000 n=20) > StrncmpMidAligned 1.454Gi ± 0% 1.894Gi ± 1% +30.20% (p=0.000 n=20) 5.675Gi ± 1% +290.14% (p=0.000 n=20) > StrncmpLongAligned 2.051Gi ± 1% 3.190Gi ± 0% +55.56% (p=0.000 n=20) 19.835Gi ± 1% +867.25% (p=0.000 n=20) > StrncmpShortUnaligned 873.7Mi ± 0% 879.4Mi ± 1% +0.66% (p=0.003 n=20) 1795.2Mi ± 0% +105.47% (p=0.000 n=20) > StrncmpMidUnaligned 1.460Gi ± 1% 1.889Gi ± 0% +29.39% (p=0.000 n=20) 4.589Gi ± 1% +214.35% (p=0.000 n=20) > StrncmpLongUnaligned 2.054Gi ± 0% 3.180Gi ± 1% +54.76% (p=0.000 n=20) 19.207Gi ± 1% +834.86% (p=0.000 n=20) > StrncmpShortQsort 125.0Mi ± 1% 118.7Mi ± 0% -5.01% (p=0.000 n=20) 102.4Mi ± 1% -18.10% (p=0.000 n=20) > StrncmpMidQsort 514.0Mi ± 0% 513.6Mi ± 0% ~ (p=0.968 n=20) 460.1Mi ± 0% -10.49% (p=0.000 n=20) > geomean 912.1Mi 1.058Gi +18.74% 2.497Gi +180.37%