271906 – Aparent performance problem with basic and extended regular expressions

Bug 271906 - Aparent performance problem with basic and extended regular expressions

Summary: Aparent performance problem with basic and extended regular expressions

Status:	Open

Alias:	None

Product:	Base System
Classification:	Unclassified
Component:	bin (show other bugs)
Version:	13.2-RELEASE
Hardware:	amd64 Any

Importance:	--- Affects Many People
Assignee:	freebsd-bugs (Nobody)

URL:
Keywords:	performance

Duplicates (2):	271904 271905 (view as bug list)
Depends on:
Blocks:

Reported:	2023-06-08 13:40 UTC by marius
Modified:	2023-06-10 05:31 UTC (History)
CC List:	2 users (show)

See Also:	230332 223553 254763 255525

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description marius 2023-06-08 13:40:53 UTC

There seem to be a regression in version 13 of FreeBSD affecting
both basic and extended regular expression processing.

Illustration of problem (grep):

        $ uname -a 
        FreeBSD 13.2-RELEASE releng/13.2-n254617-525ecfdad597 GENERIC

        $ time grep -E '(wordorphrase|differentword)' 150MB-file >/dev/null

        real    0m54.565s
        user    0m54.372s
        sys     0m0.173s

should not take almost a minute to search 150MB file! 
Even worse is 

        $ time grep -i 'differentword' 150MB-file >/dev/null

        real    0m28.060s
        user    0m28.016s
        sys     0m0.038s

almost 30 sek to do a case-insensitive search on a 150MB
text file - compared to:

        $ time grep 'differentword' 150MB-file >/dev/null

        real    0m0.210s
        user    0m0.178s
        sys     0m0.032s

which runs at normal speed.

This all was fine on 12.3 and 12.4  - For example:

        $ uname -a
        FreeBSD 12.3-RELEASE-p11 GENERIC

        $ time grep -E '(wordorphrase|differentword)' 150MB-file >/dev/null

        real    0m0.290s
        user    0m0.219s
        sys     0m0.071s

Comment 1 Yuri Pankov freebsd_committer

2023-06-08 14:14:48 UTC

It's not a regression in regex implementation per se and rather a switch from (very outdated) gnu grep which used bundled libgnuregex to bsdgrep which is using in-base regex implementation.

See also bug 223553, bug 254763, bug 255525.

Comment 2 Yuri Pankov freebsd_committer

2023-06-08 14:18:10 UTC

*** Bug 271904 has been marked as a duplicate of this bug. ***

Comment 3 Yuri Pankov freebsd_committer

2023-06-08 14:18:29 UTC

*** Bug 271905 has been marked as a duplicate of this bug. ***

Comment 4 marius 2023-06-09 11:16:24 UTC

So the in-base regex implementation is badly broken -- 

Additional problems (presumably with the regex lib also) are illustrated with
the following:

'þ' is the LATIN1 character THORN (0xfe)

    $ env LC_CTYPE=is_IS.ISO8859-1 grep 'þ'
    grep: trailing backslash (\)

    $ env LC_CTYPE=is_IS.ISO8859-1 sed 's/þ/th/'
    sed: 1: "s/þ/th/": RE error: trailing backslash (\)

    $ env LC_CTYPE=is_IS.ISO8859-1 expr "abcþdef" : '...þ...'
    expr: trailing backslash (\)

Any plans to fix this - or revert the change in in-base regex?

Comment 5 Daniel Tameling 2023-06-09 11:55:12 UTC

The trailing backslash error for ISO locales has been reported at bug #264275.

The performance issue comes up periodically. For example, it is discussed at length in bug #254763.

Comment 6 Yuri Pankov freebsd_committer

2023-06-09 12:48:54 UTC

(In reply to marius from comment #4)
I think you misunderstood my reply.  The libc (that is in-base) regex implementation was always that way (and got quite a few fixes in last years actually), there is nothing to revert -- whatever issues there are need to be fixed.

Comment 7 Graham Perrin freebsd_committer

2023-06-10 05:31:19 UTC

Is deduplication appropriate for any of the four non-tracking bug reports?