Bug 271906 - Aparent performance problem with basic and extended regular expressions
Summary: Aparent performance problem with basic and extended regular expressions
Status: Open
Alias: None
Product: Base System
Classification: Unclassified
Component: bin (show other bugs)
Version: 13.2-RELEASE
Hardware: amd64 Any
: --- Affects Many People
Assignee: freebsd-bugs (Nobody)
URL:
Keywords: performance
: 271904 271905 (view as bug list)
Depends on:
Blocks:
 
Reported: 2023-06-08 13:40 UTC by marius
Modified: 2023-06-10 05:31 UTC (History)
2 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description marius 2023-06-08 13:40:53 UTC
There seem to be a regression in version 13 of FreeBSD affecting
both basic and extended regular expression processing.

Illustration of problem (grep):

        $ uname -a 
        FreeBSD 13.2-RELEASE releng/13.2-n254617-525ecfdad597 GENERIC

        $ time grep -E '(wordorphrase|differentword)' 150MB-file >/dev/null

        real    0m54.565s
        user    0m54.372s
        sys     0m0.173s

should not take almost a minute to search 150MB file! 
Even worse is 

        $ time grep -i 'differentword' 150MB-file >/dev/null

        real    0m28.060s
        user    0m28.016s
        sys     0m0.038s

almost 30 sek to do a case-insensitive search on a 150MB
text file - compared to:

        $ time grep 'differentword' 150MB-file >/dev/null

        real    0m0.210s
        user    0m0.178s
        sys     0m0.032s

which runs at normal speed.

This all was fine on 12.3 and 12.4  - For example:

        $ uname -a
        FreeBSD 12.3-RELEASE-p11 GENERIC

        $ time grep -E '(wordorphrase|differentword)' 150MB-file >/dev/null

        real    0m0.290s
        user    0m0.219s
        sys     0m0.071s
Comment 1 Yuri Pankov freebsd_committer freebsd_triage 2023-06-08 14:14:48 UTC
It's not a regression in regex implementation per se and rather a switch from (very outdated) gnu grep which used bundled libgnuregex to bsdgrep which is using in-base regex implementation.

See also bug 223553, bug 254763, bug 255525.
Comment 2 Yuri Pankov freebsd_committer freebsd_triage 2023-06-08 14:18:10 UTC
*** Bug 271904 has been marked as a duplicate of this bug. ***
Comment 3 Yuri Pankov freebsd_committer freebsd_triage 2023-06-08 14:18:29 UTC
*** Bug 271905 has been marked as a duplicate of this bug. ***
Comment 4 marius 2023-06-09 11:16:24 UTC
So the in-base regex implementation is badly broken -- 

Additional problems (presumably with the regex lib also) are illustrated with
the following:

'þ' is the LATIN1 character THORN (0xfe)

    $ env LC_CTYPE=is_IS.ISO8859-1 grep 'þ'
    grep: trailing backslash (\)

    $ env LC_CTYPE=is_IS.ISO8859-1 sed 's/þ/th/'
    sed: 1: "s/þ/th/": RE error: trailing backslash (\)

    $ env LC_CTYPE=is_IS.ISO8859-1 expr "abcþdef" : '...þ...'
    expr: trailing backslash (\)

Any plans to fix this - or revert the change in in-base regex?
Comment 5 Daniel Tameling 2023-06-09 11:55:12 UTC
The trailing backslash error for ISO locales has been reported at bug #264275.

The performance issue comes up periodically. For example, it is discussed at length in bug #254763.
Comment 6 Yuri Pankov freebsd_committer freebsd_triage 2023-06-09 12:48:54 UTC
(In reply to marius from comment #4)
I think you misunderstood my reply.  The libc (that is in-base) regex implementation was always that way (and got quite a few fixes in last years actually), there is nothing to revert -- whatever issues there are need to be fixed.
Comment 7 Graham Perrin freebsd_committer freebsd_triage 2023-06-10 05:31:19 UTC
Is deduplication appropriate for any of the four non-tracking bug reports?