281710 – RegEXP bug in bracket expression [^...] - sed(1), grep(1), re_format(7)

Bug 281710 - RegEXP bug in bracket expression [^...] - sed(1), grep(1), re_format(7)

Summary: RegEXP bug in bracket expression [^...] - sed(1), grep(1), re_format(7)

Status:	New

Alias:	None

Product:	Base System
Classification:	Unclassified
Component:	standards (show other bugs)
Version:	14.1-RELEASE
Hardware:	amd64 Any

Importance:	--- Affects Some People
Assignee:	Kyle Evans

URL:
Keywords:

Depends on:
Blocks:

Reported:	2024-09-25 13:30 UTC by Eric
Modified:	2024-09-27 09:26 UTC (History)
CC List:	3 users (show)

See Also:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Eric 2024-09-25 13:30:34 UTC

It looks like there's a bug in FreeBSD's sed(1), grep(1), re_format(7), regarding accented characters and their use in a bracket expression [^...] in regular expressions (modern REs as well as basic REs).


-- Short examples
Command lines 202, 203 and 207 show unexpected bahaviour.
[200] # echo '9a' | /usr/bin/sed        -En 's/([^a])(a)/-\1-\2-/p'
-9-a-
[201] # echo '9a' | /usr/bin/sed        -n 's/\([^a]\)\(a\)/-\1-\2-/p'
-9-a-
[202] # echo '9â' | /usr/bin/sed        -n 's/\([^â]\)\(â\)/-\1-\2-/p' # <--
[203] # echo '9â' | /usr/bin/sed        -En 's/([^â])(â)/-\1-\2-/p'    # <--
[204] # echo '9â' | /usr/local/bin/gsed -En 's/([^â])(â)/-\1-\2-/p'
-9-â-
[205] # echo 'ââ' | /usr/bin/sed        -En 's/([â])(â)/-\1-\2-/p'
-â-â-
[206] # echo 'ââ' | /usr/local/bin/gsed -En 's/([â])(â)/-\1-\2-/p'
-â-â-
[207] # echo '9â' | /usr/bin/grep       -E '[^â]â'                      # <--
[208] #

Same results with characters like 'ç' and 'é'. 
Reported in forum thread (see link below) Unicode characters.


-- Reference
FreeBSD forum link:
https://forums.freebsd.org/threads/bug-in-regexp-sed-1-grep-1-and-re_format-7.95088/

re_format(7):
"
DESCRIPTION
   [...]
       A bracket expression is a list of characters enclosed in	`[]'.  It nor-
       mally  matches  any single character from the list (but see below).  If
       the list	begins with `^', it matches any	single character (but see  be-
       low)  not from the rest of the list.
"
As FreeBSD intends/tries to conform to POSIX, likewise :
https://pubs.opengroup.org/onlinepubs/9799919799/basedefs/V1_chap09.html#tag_09_03_05
"
3. A non-matching list expression begins with a <circumflex> ('^'), and the matching behavior shall be the logical inverse of the corresponding matching list expression (the same bracket expression but without the leading <circumflex>). For example, since the RE "[abc]" only matches 'a', 'b', or 'c', it follows that "[^abc]" is an RE that matches any character except 'a', 'b', or 'c'. It is unspecified whether a non-matching list expression matches a multi-character collating element that is not matched by any of the expressions. The <circumflex> shall have this special meaning only when it occurs first in the list, immediately following the <left-square-bracket>.
"


-- Context of my OS and programs:
[100] # uname -a
FreeBSD q210 14.1-RELEASE-p5 FreeBSD 14.1-RELEASE-p5 GENERIC amd64
[101] # pkg which /usr/local/bin/ggrep
/usr/local/bin/ggrep was installed by package gnugrep-3.11
[102] # pkg which /usr/local/bin/gsed
/usr/local/bin/gsed was installed by package gsed-4.9
[103] # locale
LANG=C.UTF-8
LC_CTYPE="C.UTF-8"
LC_COLLATE="C.UTF-8"
LC_TIME="C.UTF-8"
LC_NUMERIC="C.UTF-8"
LC_MONETARY="C.UTF-8"
LC_MESSAGES="C.UTF-8"
LC_ALL=

Comment 1 Kyle Evans freebsd_committer

2024-09-25 15:01:57 UTC

On -CURRENT:

kevans@ifrit:~$ echo '9â' | /usr/bin/sed        -n 's/\([^â]\)\(â\)/-\1-\2-/p'
-9-â-
kevans@ifrit:~$ echo '9â' | /usr/bin/sed        -En 's/([^â])(â)/-\1-\2-/p'   
-9-â-
kevans@ifrit:~$ echo '9â' | /usr/bin/grep       -E '[^â]â'
9â
kevans@ifrit:~$ locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_ALL=

(C.UTF-8 is also fine, as you would expect)

Comment 2 Kyle Evans freebsd_committer

2024-09-25 15:04:12 UTC

https://cgit.freebsd.org/src/commit/?id=8f7ed58a15556bf567ff876e1999e4fe4d684e1d maybe?  That looks like the only difference to my eyeball

Comment 3 Olivier Certner freebsd_committer

2024-09-25 19:06:45 UTC

Tested these:
echo 'a' | grep '[abà]'
echo '9â' | grep -E '[^â]â'
echo 'â' | grep -E '[^â]â'

Same discrepancies between -CURRENT (as of mid-August) and 14-STABLE (as of mid-May).

If I apply the above patch to that 14-STABLE, I get correct behavior (and consistent with -CURRENT) for these 3 tests and all Eric's problematic tests of comment 1.

(In reply to Kyle Evans from comment #2)

So it looks like just MFCing this commit (git just does it automatically) will solve this issue.  It should have been so in the first place.  Could you please do it both to stable/13 and stable/14?  Else I can take care of that if you prefer.

Thanks and regards.

Comment 4 Kyle Evans freebsd_committer

2024-09-25 19:47:21 UTC

(In reply to Olivier Certner from comment #3)

Sure thing, thanks for confirming that.  Will hit these up tonight-ish.

Comment 5 Eric 2024-09-25 19:56:24 UTC

In reference to comment #1 & comment #2

If the referenced commit indeed solves the issue*--
I'm not experienced to judge, as it seems by various commit logs of regcomp.c UTF-8/internal C (type) representation/handling are in play--
then, IMHO, from a committers POV, it may be considered whether this can/is to be backported to stable/14 (or even stable/13) or have it wait until stable/15. 

I'm unfamiliar with the FreeBSD testing framework, but perhaps it might be useful to test against each other (singleton versus non-singleton) something like:
[1] # echo '9â' | /usr/bin/sed      -En 's/([^â])(â)/-\1-\2-/p'
[2] # echo '9â' | /usr/bin/sed      -En 's/([^ââ])(â)/-\1-\2-/p'
-9-â-

The format of [2] could be used as a temporary workaround; HT to covacat:
https://forums.freebsd.org/threads/bug-in-regexp-sed-1-grep-1-and-re_format-7.95088/#post-673304

__
* I think this can be traced back to the intro of the singleton function commit (stable/5 (~FreeBSD 5.3); commit Jul 12, 2004):
https://github.com/freebsd/freebsd-src/commit/e5996857ad1f30f74f848a2c464c75a7ae28e59a#diff-3b4acfb4853c13cf5b563a3b6813988d5c10b057520e0c35384e4dfbc90e5793

where (apart from a minor parameter change) the function hasn't changed since.

Comment 6 Eric 2024-09-25 19:57:40 UTC

I missed the additional comments ...

Comment 7 Eric 2024-09-25 20:04:38 UTC

In reference to comment #4

Thank you for the prompt uptake and planned merge!

Comment 8 commit-hook freebsd_committer

2024-09-25 20:44:11 UTC

A commit in branch stable/14 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=4f4860c9b07cc10cb6acbe6fbd71db45e344d2e6

commit 4f4860c9b07cc10cb6acbe6fbd71db45e344d2e6
Author:     Bill Sommerfeld <sommerfeld@hamachi.org>
AuthorDate: 2023-12-21 03:46:14 +0000
Commit:     Kyle Evans <kevans@FreeBSD.org>
CommitDate: 2024-09-25 20:42:25 +0000

    regex: mixed sets are misidentified as singletons

    Fix "singleton" function used by regcomp() to turn character set matches
    into exact character matches if a character set has exactly one
    element.

    The underlying cset representation is complex; most critically it
    records"small" characters (codepoint less than either 128
    or 256 depending on locale) in a bit vector, and "wide" characters in
    a secondary array.

    Unfortunately the "singleton" function uses to identify singleton sets
    treated a cset as a singleton if either the "small" or the "wide" sets
    had exactly one element (it would then ignore the other set).

    The easiest way to demonstrate this bug:

            $ export LANG=C.UTF-8
            $ echo 'a' | grep '[abà]'

    It should match (and print "a") but instead it doesn't match because the
    single accented character in the set is misinterpreted as a singleton.

    PR:             281710
    Reviewed by:    kevans, yuripv
    Obtained from:  illumos

    (cherry picked from commit 8f7ed58a15556bf567ff876e1999e4fe4d684e1d)

 lib/libc/regex/regcomp.c          | 25 ++++++++++++++++++-----
 lib/libc/tests/regex/multibyte.sh | 43 ++++++++++++++++++++++++++++++++++++++-
 2 files changed, 62 insertions(+), 6 deletions(-)

Comment 9 commit-hook freebsd_committer

2024-09-25 20:44:12 UTC

A commit in branch stable/13 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=d96ce6d000703f3f57d9214b741e16cc7741d77e

commit d96ce6d000703f3f57d9214b741e16cc7741d77e
Author:     Bill Sommerfeld <sommerfeld@hamachi.org>
AuthorDate: 2023-12-21 03:46:14 +0000
Commit:     Kyle Evans <kevans@FreeBSD.org>
CommitDate: 2024-09-25 20:42:28 +0000

    regex: mixed sets are misidentified as singletons

    Fix "singleton" function used by regcomp() to turn character set matches
    into exact character matches if a character set has exactly one
    element.

    The underlying cset representation is complex; most critically it
    records"small" characters (codepoint less than either 128
    or 256 depending on locale) in a bit vector, and "wide" characters in
    a secondary array.

    Unfortunately the "singleton" function uses to identify singleton sets
    treated a cset as a singleton if either the "small" or the "wide" sets
    had exactly one element (it would then ignore the other set).

    The easiest way to demonstrate this bug:

            $ export LANG=C.UTF-8
            $ echo 'a' | grep '[abà]'

    It should match (and print "a") but instead it doesn't match because the
    single accented character in the set is misinterpreted as a singleton.

    PR:             281710
    Reviewed by:    kevans, yuripv
    Obtained from:  illumos

    (cherry picked from commit 8f7ed58a15556bf567ff876e1999e4fe4d684e1d)

 lib/libc/regex/regcomp.c          | 25 ++++++++++++++++++-----
 lib/libc/tests/regex/multibyte.sh | 43 ++++++++++++++++++++++++++++++++++++++-
 2 files changed, 62 insertions(+), 6 deletions(-)

Comment 10 Kyle Evans freebsd_committer

2024-09-25 20:47:01 UTC

(In reply to Eric from comment #0)

Are you seeing real-world fallout from this, or was this more along the lines of testing behavior on different platforms?

Comment 11 Eric 2024-09-26 18:49:53 UTC

(in reply to Kyle Evans comment #10)

Nothing that expansive either way; just one very old laptop & one forum message.

I narrowed the problem down to:
--works NOT; there seems to be something wrong with [^ç]
sed -Ee 's/([^ç])ç/\1 /g;s/ç/\n/g'
sed -Ee 's/([^Z])Z/\1 /g;s/Z/\n/g'
couldn't explain the difference, then tried gsed; all that resulted in:
https://forums.freebsd.org/threads/bug-in-regexp-sed-1-grep-1-and-re_format-7.95088/

Comment 12 Olivier Certner freebsd_committer

2024-09-26 19:02:53 UTC

Thanks! I'm pretty sure I stumbled on this problem months ago, but didn't investigate then. Glad to see that fixed.

Comment 13 Eric 2024-09-27 09:17:27 UTC

(in reply to Kyle Evans comment #10)
(in reply to  Olivier Certner comment #12)

Based on the commit comments 
https://cgit.freebsd.org/src/commit/?id=8f7ed58a15556bf567ff876e1999e4fe4d684e1d
however, I see that I may have underestimated the possible veracious impact on string processing in a pervasive UTF-8 world.
 
I haven't a test setup available at the moment to test the examples below on -CURRENT or -STABLE-13 or 14

-- Examples
[1] # cat names
cedric
étienne
égards
françois
[2] # cat names | grep '[é]'
étienne
égards
[3] # cat names | grep '[éç]'
étienne
égards
françois
[4] # cat names | grep '[éi]'      # <-- error
cedric
étienne
françois
[5] # cat names | grep -i '[éi]'   # <-- case-insensitive "avoids" singleton 
cedric
étienne
égards
françois
[6] # cat names | grep -E '[é]|[i]' # <-- splitting in two bracket expressions avoids errroneous code
cedric
étienne
égards
françois
[7] #

I think such cases likely will have been overlooked, misjudged as correctly processed or not investigated further.

Fast & correct (UTF-8) string processing is difficult and this made me have another look at singleton's char processing. 
Viewing from a distance (and assuming one test operation (the first only) in the string of "shortcut" ||-operands), the distance to the prize (i.e. line 1626) in
https://github.com/freebsd/freebsd-src/blob/main/lib/libc/regex/regcomp.c#L1626 
as compared to
https://github.com/freebsd/freebsd-src/blob/releng/14.1/lib/libc/regex/regcomp.c#L1600
has gone up considerably:
singleton-error:      2 tests
singleton-modified:   6 tests

Are the added complexity and extra processing steps of an added singleton function for a bracket expression still justified?
Case-insensitive bracket expressions don't profit, as can be painfully observed in the examples above; they just add a certain small amount of additional time.
I wonder if comparitive testing with singleton processing versus without it yields justifiable gains—yes, that is a subjective adjective.

Comment 14 Eric 2024-09-27 09:26:13 UTC

(in reply to Kyle Evans comment #10)

Hi Kyle, I have a question and related request as it seems that NetBSD has pulled in (Feb 23, 2021) FreeBSD's code of
lib/libc/regex/regcomp.c
wrt the error in function singleton():
https://github.com/NetBSD/src/commit/1ee269c3a208a14da224b6e9917e2e9798961fff

You, or you may know who else within the team of FreeBSD core/developers, likely know how to best liase with NetBSD and inform them of the problem. I don't have anything setup OS wise or account wise regarding NetBSD, so I haven't done any actual testing.


=====
-- NetBSD code situation view
For NetBSD's main "trunk", these parts seem relevant:
https://github.com/NetBSD/src/blob/trunk/lib/libc/regex/regcomp.c#L1139-L1143
https://github.com/NetBSD/src/blob/trunk/lib/libc/regex/regcomp.c#L1733-L1754

and the commit "sync with FreeBSD" as of Feb 23, 2021:
https://github.com/NetBSD/src/commit/1ee269c3a208a14da224b6e9917e2e9798961fff

For NetBSD 10.0, released as of NetBSD 10.0 (Mar 28, 2024), these parts seem relevant:
https://github.com/NetBSD/src/blob/netbsd-10/lib/libc/regex/regcomp.c#L1141-L1145
https://github.com/NetBSD/src/blob/netbsd-10/lib/libc/regex/regcomp.c#L1735-L1756
and seems to contain the buggy singleton function.

By the looks of it seems that the singleton function issue isn't in NetBSD 9 code:
https://github.com/NetBSD/src/blob/netbsd-9/lib/libc/regex/regcomp.c#L831-L835

However, I'm unfamiliar with their github branch structures of their minor releases as the latest release of  NetBSD 9 seems to be 9.4 (April 20, 2024):
https://www.netbsd.org/releases/formal-9/NetBSD-9.4.html


-- OpenBSD code situation view
For OpenBSD, based on
https://github.com/openbsd/src/blob/master/lib/libc/regex/regcomp.c#L674-L678
it looks like they haven't implemented any optimization like the singleton() function.