Bug 223532 - GNU egrep -i is terrible slow if utf-8 locale is enabled
Summary: GNU egrep -i is terrible slow if utf-8 locale is enabled
Status: Open
Alias: None
Product: Base System
Classification: Unclassified
Component: bin (show other bugs)
Version: CURRENT
Hardware: Any Any
: --- Affects Only Me
Assignee: freebsd-bugs (Nobody)
URL:
Keywords:
Depends on:
Blocks: 223553
  Show dependency treegraph
 
Reported: 2017-11-08 12:59 UTC by Wolfram Schneider
Modified: 2017-11-20 15:20 UTC (History)
3 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Wolfram Schneider freebsd_committer 2017-11-08 12:59:43 UTC
egrep -i is terrible slow if the locale is set to utf-8. In fact, it is 77 times slower then a case sensitive search.


How to repeat:

First, we create a 100MB text file:
for i in $(seq 1 20);do man tcsh;done > /tmp/tcsh20;
for i in $(seq 1 20); do cat /tmp/tcsh20;done > /tmp/tcsh400

$ du -hs /tmp/tcsh400
 99M    /tmp/tcsh400


# case sensitive search with utf-8
LANG=en_CA.UTF-8 time egrep  -c foobar /tmp/tcsh400
0
        0.11 real         0.06 user         0.04 sys


# case in-sensitive search with utf-8, terrible slow
LANG=en_CA.UTF-8 time egrep  -ic  foobar /tmp/tcsh400
0
        8.47 real         8.42 user         0.04 sys


# case sensitive search with ASCII
LANG=C time egrep  -c  foobar /tmp/tcsh400
0
        0.10 real         0.06 user         0.03 sys


# case in-sensitive search with ASCII
LANG=C time egrep  -ic foobar /tmp/tcsh400
0
        0.10 real         0.07 user         0.03 sys
Comment 1 Kyle Evans freebsd_committer 2017-11-09 02:46:21 UTC
Is this gnugrep or bsdgrep? (`egrep -V`) I'm getting something along these lines with your reproduction steps:


root@www2:/usr/bin # env LANG=en_CA.UTF-8 time egrep -c foobar /tmp/tcsh400
0
        1.13 real         0.73 user         0.08 sys
root@www2:/usr/bin # env LANG=en_CA.UTF-8 time egrep -ic foobar /tmp/tcsh400
0
        3.73 real         3.26 user         0.08 sys

root@www2:/usr/bin # env LANG=C time egrep -c foobar /tmp/tcsh400
0
        1.08 real         0.72 user         0.06 sys
root@www2:/usr/bin # env LANG=C time egrep -ic foobar /tmp/tcsh400
0
        1.11 real         0.74 user         0.07 sys



With:

root@www2:/usr/bin # grep -V
grep (BSD grep, GNU compatible) 2.6.0-FreeBSD

So egrep -i is still slower with bsdgrep, but it's only 3x slower than the equivalent case-sensitive search here.
Comment 2 Wolfram Schneider freebsd_committer 2017-11-09 09:02:04 UTC
I'm using the standard egrep from the base system

$ /usr/bin/egrep -V
egrep (GNU grep) 2.5.1-FreeBSD

for bsdgrep I get:
$ /usr/bin/bsdgrep -V
bsdgrep (BSD grep, GNU compatible) 2.6.0-FreeBSD

LANG=en_CA.UTF-8 time bsdgrep  -c foobar /tmp/tcsh400
0
        0.50 real         0.47 user         0.02 sys
LANG=en_CA.UTF-8 time bsdgrep  -ci foobar /tmp/tcsh400
0
        2.09 real         2.04 user         0.04 sys