Bug 223532 - GNU egrep -i is terrible slow if utf-8 locale is enabled
Summary: GNU egrep -i is terrible slow if utf-8 locale is enabled
Status: Closed DUPLICATE of bug 223553
Alias: None
Product: Base System
Classification: Unclassified
Component: bin (show other bugs)
Version: CURRENT
Hardware: Any Any
: --- Affects Only Me
Assignee: freebsd-bugs (Nobody)
URL:
Keywords:
Depends on:
Blocks: 223553
  Show dependency treegraph
 
Reported: 2017-11-08 12:59 UTC by Wolfram Schneider
Modified: 2021-06-03 05:30 UTC (History)
4 users (show)

See Also:


Attachments
Build with WITH_INTERNAL_NOSPEC for better performance (353 bytes, patch)
2021-06-02 18:20 UTC, Stefan Eßer
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Wolfram Schneider freebsd_committer freebsd_triage 2017-11-08 12:59:43 UTC
egrep -i is terrible slow if the locale is set to utf-8. In fact, it is 77 times slower then a case sensitive search.


How to repeat:

First, we create a 100MB text file:
for i in $(seq 1 20);do man tcsh;done > /tmp/tcsh20;
for i in $(seq 1 20); do cat /tmp/tcsh20;done > /tmp/tcsh400

$ du -hs /tmp/tcsh400
 99M    /tmp/tcsh400


# case sensitive search with utf-8
LANG=en_CA.UTF-8 time egrep  -c foobar /tmp/tcsh400
0
        0.11 real         0.06 user         0.04 sys


# case in-sensitive search with utf-8, terrible slow
LANG=en_CA.UTF-8 time egrep  -ic  foobar /tmp/tcsh400
0
        8.47 real         8.42 user         0.04 sys


# case sensitive search with ASCII
LANG=C time egrep  -c  foobar /tmp/tcsh400
0
        0.10 real         0.06 user         0.03 sys


# case in-sensitive search with ASCII
LANG=C time egrep  -ic foobar /tmp/tcsh400
0
        0.10 real         0.07 user         0.03 sys
Comment 1 Kyle Evans freebsd_committer freebsd_triage 2017-11-09 02:46:21 UTC
Is this gnugrep or bsdgrep? (`egrep -V`) I'm getting something along these lines with your reproduction steps:


root@www2:/usr/bin # env LANG=en_CA.UTF-8 time egrep -c foobar /tmp/tcsh400
0
        1.13 real         0.73 user         0.08 sys
root@www2:/usr/bin # env LANG=en_CA.UTF-8 time egrep -ic foobar /tmp/tcsh400
0
        3.73 real         3.26 user         0.08 sys

root@www2:/usr/bin # env LANG=C time egrep -c foobar /tmp/tcsh400
0
        1.08 real         0.72 user         0.06 sys
root@www2:/usr/bin # env LANG=C time egrep -ic foobar /tmp/tcsh400
0
        1.11 real         0.74 user         0.07 sys



With:

root@www2:/usr/bin # grep -V
grep (BSD grep, GNU compatible) 2.6.0-FreeBSD

So egrep -i is still slower with bsdgrep, but it's only 3x slower than the equivalent case-sensitive search here.
Comment 2 Wolfram Schneider freebsd_committer freebsd_triage 2017-11-09 09:02:04 UTC
I'm using the standard egrep from the base system

$ /usr/bin/egrep -V
egrep (GNU grep) 2.5.1-FreeBSD

for bsdgrep I get:
$ /usr/bin/bsdgrep -V
bsdgrep (BSD grep, GNU compatible) 2.6.0-FreeBSD

LANG=en_CA.UTF-8 time bsdgrep  -c foobar /tmp/tcsh400
0
        0.50 real         0.47 user         0.02 sys
LANG=en_CA.UTF-8 time bsdgrep  -ci foobar /tmp/tcsh400
0
        2.09 real         2.04 user         0.04 sys
Comment 3 Helge Oldach 2021-06-02 17:14:15 UTC
(In reply to Wolfram Schneider from comment #2)
Hmm. I have noticed that as well. I suspect it's a fallout of bug #253209, as I noticed it was much more slowly after that fix.

However, I'm seeing the vast majority of slowdown with other locales but utf-8 as well:

root@nuc ~ # grep -V
grep (BSD grep, GNU compatible) 2.6.0-FreeBSD
root@nuc ~ # time fgrep zpipe /usr/ports/INDEX-13
        0.28 real         0.15 user         0.06 sys
root@nuc ~ # time fgrep -i zpipe /usr/ports/INDEX-13
       13.87 real        13.86 user         0.00 sys
root@nuc ~ # LANG=en_US.UTF-8 time fgrep -i zpipe /usr/ports/INDEX-13
       17.67 real        17.65 user         0.01 sys
root@nuc ~ # LANG=C.UTF-8 time fgrep -i zpipe /usr/ports/INDEX-13
       17.63 real        17.59 user         0.02 sys
root@nuc ~ # LANG=en_US.iso8859-1 time fgrep -i zpipe /usr/ports/INDEX-13
       13.97 real        13.95 user         0.02 sys
root@nuc ~ # LANG=C time fgrep -i zpipe /usr/ports/INDEX-13
       14.00 real        13.97 user         0.03 sys
root@nuc ~ #

To summarize, "-i" adds two (!) orders of magnitude, and changing to a multibyte character set adds some more 33% further on top.
Comment 4 Stefan Eßer freebsd_committer freebsd_triage 2021-06-02 18:20:11 UTC
Created attachment 225506 [details]
Build with WITH_INTERNAL_NOSPEC for better performance

Building with -DWITH_INTERNAL_NOSPEC makes "fgrep -i" use the internal implementation of the comparison function instead of regcomp() from the C library.

The performance impact of using -i with or without UTF-8 is only a factor of about 3 instead of 100:

$ LANG=C time fgrep zpipe /usr/ports/INDEX-14 
        0.03 real         0.03 user         0.00 sys

$ LANG=en_US.UTF-8 time fgrep zpipe /usr/ports/INDEX-14
        0.03 real         0.03 user         0.00 sys

$ LANG=en_US.UTF-8 time fgrep -i zpipe /usr/ports/INDEX-14 
        0.10 real         0.09 user         0.00 sys

$ LANG=en_US.UTF-8 time fgrep-ORIG -i zpipe /usr/ports/INDEX-14 
        4.47 real         4.47 user         0.00 sys

(fgrep has been built with WITH_INTERNAL_NOSPEC, while fgrep-FBSD is a copy of the original FreeBSD fgrep built without that option.)

Both the original version and the modified version fail the same 4 kyua tests, BTW.
Comment 5 Helge Oldach 2021-06-02 19:26:38 UTC
(In reply to Stefan Eßer from comment #4)
> The performance impact of using -i with or without UTF-8 is only a factor of about 3 instead of 100:

I said with "-i" vs. without "-i" would make two 2 orders of magnitude difference. (Unrelated to multibyte vs. singlebyte.)

What are fgrep-FBSD and fgrep-ORIG?

Where is WITH_INTERNAL_NOSPEC documented? src.conf(5) doesn't mention it.

Why isn't WITH_INTERNAL_NOSPEC enabled by default when it accounts for such a drastic improvement?
Comment 6 Yuri 2021-06-02 19:43:34 UTC
You seem to be updating wrong issue, this one talks about now-nonexistent GNU grep/libgnuregex in base, and should be just closed.  For bsdgrep (or simply grep now) bug 223553 looks like a better place.
Comment 7 Helge Oldach 2021-06-02 19:51:52 UTC
(In reply to dsdqmhsx from comment #6)
Thanks. Anyhow WITH_INTERNAL_NOSPEC fixed it for me; now "-i" vs. non "-i" means a factor of 2 or 3 or so which is expected.
Comment 8 Stefan Eßer freebsd_committer freebsd_triage 2021-06-02 20:19:55 UTC
(In reply to Helge Oldach from comment #5)

My comment #4 referred to the commengt #3, which used BSD fgrep (despite the title of the PR referring to GNU egrep).

I have first compared fgrep with C or UTF-8 locale and found they had about the same performance.

Adding -i in the UTF-8 case increased the run time from 0.03 seconds to 4.47 seconds (or by a factor of more than 100). With LANG=C the run time is 3.36 seconds, BTW.

The patch that I have attached speeds this case up to 0.09 seconds by using an internal function instead of the regex library.

fgrep-FBSD meant fgrep-ORIG (sorry for the confusion). This is the binary as built in -CURRENT without the patch.

WITH_INTERNAL_NOSPEC is not documented, except for by a comment in the sources (in util.c) which explains that this option exists for systems that lack REG_NOSPEC or REG_LITERAL and specifically mentions libgnuregex.

In fact, this function has a bit more overhead than necessary. An optimized variant of the strcsasestr_l() function could be inlined in util.c, but I did not try to measure the performance difference. (The optimization would cache the locale instead of calling __getlocale() and FIX_LOCALE for each invocation of strcasestr().)
Comment 9 Stefan Eßer freebsd_committer freebsd_triage 2021-06-02 20:24:53 UTC
(In reply to dsdqmhsx from comment #6)

I had replied to comment #3, which in fact is for a different grep implementation and only shared the issue of slow performance when using -i.

I'll attach the patch to PR 223553 and close this one as duplicate of that PR (which it was not, originally, but the GNU grep specific issue is no longer relevant and starting at comment #3 this actually is a duplicate ...)

*** This bug has been marked as a duplicate of bug 223553 ***