egrep -i is terrible slow if the locale is set to utf-8. In fact, it is 77 times slower then a case sensitive search. How to repeat: First, we create a 100MB text file: for i in $(seq 1 20);do man tcsh;done > /tmp/tcsh20; for i in $(seq 1 20); do cat /tmp/tcsh20;done > /tmp/tcsh400 $ du -hs /tmp/tcsh400 99M /tmp/tcsh400 # case sensitive search with utf-8 LANG=en_CA.UTF-8 time egrep -c foobar /tmp/tcsh400 0 0.11 real 0.06 user 0.04 sys # case in-sensitive search with utf-8, terrible slow LANG=en_CA.UTF-8 time egrep -ic foobar /tmp/tcsh400 0 8.47 real 8.42 user 0.04 sys # case sensitive search with ASCII LANG=C time egrep -c foobar /tmp/tcsh400 0 0.10 real 0.06 user 0.03 sys # case in-sensitive search with ASCII LANG=C time egrep -ic foobar /tmp/tcsh400 0 0.10 real 0.07 user 0.03 sys
Is this gnugrep or bsdgrep? (`egrep -V`) I'm getting something along these lines with your reproduction steps: root@www2:/usr/bin # env LANG=en_CA.UTF-8 time egrep -c foobar /tmp/tcsh400 0 1.13 real 0.73 user 0.08 sys root@www2:/usr/bin # env LANG=en_CA.UTF-8 time egrep -ic foobar /tmp/tcsh400 0 3.73 real 3.26 user 0.08 sys root@www2:/usr/bin # env LANG=C time egrep -c foobar /tmp/tcsh400 0 1.08 real 0.72 user 0.06 sys root@www2:/usr/bin # env LANG=C time egrep -ic foobar /tmp/tcsh400 0 1.11 real 0.74 user 0.07 sys With: root@www2:/usr/bin # grep -V grep (BSD grep, GNU compatible) 2.6.0-FreeBSD So egrep -i is still slower with bsdgrep, but it's only 3x slower than the equivalent case-sensitive search here.
I'm using the standard egrep from the base system $ /usr/bin/egrep -V egrep (GNU grep) 2.5.1-FreeBSD for bsdgrep I get: $ /usr/bin/bsdgrep -V bsdgrep (BSD grep, GNU compatible) 2.6.0-FreeBSD LANG=en_CA.UTF-8 time bsdgrep -c foobar /tmp/tcsh400 0 0.50 real 0.47 user 0.02 sys LANG=en_CA.UTF-8 time bsdgrep -ci foobar /tmp/tcsh400 0 2.09 real 2.04 user 0.04 sys
(In reply to Wolfram Schneider from comment #2) Hmm. I have noticed that as well. I suspect it's a fallout of bug #253209, as I noticed it was much more slowly after that fix. However, I'm seeing the vast majority of slowdown with other locales but utf-8 as well: root@nuc ~ # grep -V grep (BSD grep, GNU compatible) 2.6.0-FreeBSD root@nuc ~ # time fgrep zpipe /usr/ports/INDEX-13 0.28 real 0.15 user 0.06 sys root@nuc ~ # time fgrep -i zpipe /usr/ports/INDEX-13 13.87 real 13.86 user 0.00 sys root@nuc ~ # LANG=en_US.UTF-8 time fgrep -i zpipe /usr/ports/INDEX-13 17.67 real 17.65 user 0.01 sys root@nuc ~ # LANG=C.UTF-8 time fgrep -i zpipe /usr/ports/INDEX-13 17.63 real 17.59 user 0.02 sys root@nuc ~ # LANG=en_US.iso8859-1 time fgrep -i zpipe /usr/ports/INDEX-13 13.97 real 13.95 user 0.02 sys root@nuc ~ # LANG=C time fgrep -i zpipe /usr/ports/INDEX-13 14.00 real 13.97 user 0.03 sys root@nuc ~ # To summarize, "-i" adds two (!) orders of magnitude, and changing to a multibyte character set adds some more 33% further on top.
Created attachment 225506 [details] Build with WITH_INTERNAL_NOSPEC for better performance Building with -DWITH_INTERNAL_NOSPEC makes "fgrep -i" use the internal implementation of the comparison function instead of regcomp() from the C library. The performance impact of using -i with or without UTF-8 is only a factor of about 3 instead of 100: $ LANG=C time fgrep zpipe /usr/ports/INDEX-14 0.03 real 0.03 user 0.00 sys $ LANG=en_US.UTF-8 time fgrep zpipe /usr/ports/INDEX-14 0.03 real 0.03 user 0.00 sys $ LANG=en_US.UTF-8 time fgrep -i zpipe /usr/ports/INDEX-14 0.10 real 0.09 user 0.00 sys $ LANG=en_US.UTF-8 time fgrep-ORIG -i zpipe /usr/ports/INDEX-14 4.47 real 4.47 user 0.00 sys (fgrep has been built with WITH_INTERNAL_NOSPEC, while fgrep-FBSD is a copy of the original FreeBSD fgrep built without that option.) Both the original version and the modified version fail the same 4 kyua tests, BTW.
(In reply to Stefan Eßer from comment #4) > The performance impact of using -i with or without UTF-8 is only a factor of about 3 instead of 100: I said with "-i" vs. without "-i" would make two 2 orders of magnitude difference. (Unrelated to multibyte vs. singlebyte.) What are fgrep-FBSD and fgrep-ORIG? Where is WITH_INTERNAL_NOSPEC documented? src.conf(5) doesn't mention it. Why isn't WITH_INTERNAL_NOSPEC enabled by default when it accounts for such a drastic improvement?
You seem to be updating wrong issue, this one talks about now-nonexistent GNU grep/libgnuregex in base, and should be just closed. For bsdgrep (or simply grep now) bug 223553 looks like a better place.
(In reply to dsdqmhsx from comment #6) Thanks. Anyhow WITH_INTERNAL_NOSPEC fixed it for me; now "-i" vs. non "-i" means a factor of 2 or 3 or so which is expected.
(In reply to Helge Oldach from comment #5) My comment #4 referred to the commengt #3, which used BSD fgrep (despite the title of the PR referring to GNU egrep). I have first compared fgrep with C or UTF-8 locale and found they had about the same performance. Adding -i in the UTF-8 case increased the run time from 0.03 seconds to 4.47 seconds (or by a factor of more than 100). With LANG=C the run time is 3.36 seconds, BTW. The patch that I have attached speeds this case up to 0.09 seconds by using an internal function instead of the regex library. fgrep-FBSD meant fgrep-ORIG (sorry for the confusion). This is the binary as built in -CURRENT without the patch. WITH_INTERNAL_NOSPEC is not documented, except for by a comment in the sources (in util.c) which explains that this option exists for systems that lack REG_NOSPEC or REG_LITERAL and specifically mentions libgnuregex. In fact, this function has a bit more overhead than necessary. An optimized variant of the strcsasestr_l() function could be inlined in util.c, but I did not try to measure the performance difference. (The optimization would cache the locale instead of calling __getlocale() and FIX_LOCALE for each invocation of strcasestr().)
(In reply to dsdqmhsx from comment #6) I had replied to comment #3, which in fact is for a different grep implementation and only shared the issue of slow performance when using -i. I'll attach the patch to PR 223553 and close this one as duplicate of that PR (which it was not, originally, but the GNU grep specific issue is no longer relevant and starting at comment #3 this actually is a duplicate ...) *** This bug has been marked as a duplicate of bug 223553 ***