It looks like there's a bug in FreeBSD's sed(1), grep(1), re_format(7), regarding accented characters and their use in a bracket expression [^...] in regular expressions (modern REs as well as basic REs). -- Short examples Command lines 202, 203 and 207 show unexpected bahaviour. [200] # echo '9a' | /usr/bin/sed -En 's/([^a])(a)/-\1-\2-/p' -9-a- [201] # echo '9a' | /usr/bin/sed -n 's/\([^a]\)\(a\)/-\1-\2-/p' -9-a- [202] # echo '9â' | /usr/bin/sed -n 's/\([^â]\)\(â\)/-\1-\2-/p' # <-- [203] # echo '9â' | /usr/bin/sed -En 's/([^â])(â)/-\1-\2-/p' # <-- [204] # echo '9â' | /usr/local/bin/gsed -En 's/([^â])(â)/-\1-\2-/p' -9-â- [205] # echo 'ââ' | /usr/bin/sed -En 's/([â])(â)/-\1-\2-/p' -â-â- [206] # echo 'ââ' | /usr/local/bin/gsed -En 's/([â])(â)/-\1-\2-/p' -â-â- [207] # echo '9â' | /usr/bin/grep -E '[^â]â' # <-- [208] # Same results with characters like 'ç' and 'é'. Reported in forum thread (see link below) Unicode characters. -- Reference FreeBSD forum link: https://forums.freebsd.org/threads/bug-in-regexp-sed-1-grep-1-and-re_format-7.95088/ re_format(7): " DESCRIPTION [...] A bracket expression is a list of characters enclosed in `[]'. It nor- mally matches any single character from the list (but see below). If the list begins with `^', it matches any single character (but see be- low) not from the rest of the list. " As FreeBSD intends/tries to conform to POSIX, likewise : https://pubs.opengroup.org/onlinepubs/9799919799/basedefs/V1_chap09.html#tag_09_03_05 " 3. A non-matching list expression begins with a <circumflex> ('^'), and the matching behavior shall be the logical inverse of the corresponding matching list expression (the same bracket expression but without the leading <circumflex>). For example, since the RE "[abc]" only matches 'a', 'b', or 'c', it follows that "[^abc]" is an RE that matches any character except 'a', 'b', or 'c'. It is unspecified whether a non-matching list expression matches a multi-character collating element that is not matched by any of the expressions. The <circumflex> shall have this special meaning only when it occurs first in the list, immediately following the <left-square-bracket>. " -- Context of my OS and programs: [100] # uname -a FreeBSD q210 14.1-RELEASE-p5 FreeBSD 14.1-RELEASE-p5 GENERIC amd64 [101] # pkg which /usr/local/bin/ggrep /usr/local/bin/ggrep was installed by package gnugrep-3.11 [102] # pkg which /usr/local/bin/gsed /usr/local/bin/gsed was installed by package gsed-4.9 [103] # locale LANG=C.UTF-8 LC_CTYPE="C.UTF-8" LC_COLLATE="C.UTF-8" LC_TIME="C.UTF-8" LC_NUMERIC="C.UTF-8" LC_MONETARY="C.UTF-8" LC_MESSAGES="C.UTF-8" LC_ALL=
On -CURRENT: kevans@ifrit:~$ echo '9â' | /usr/bin/sed -n 's/\([^â]\)\(â\)/-\1-\2-/p' -9-â- kevans@ifrit:~$ echo '9â' | /usr/bin/sed -En 's/([^â])(â)/-\1-\2-/p' -9-â- kevans@ifrit:~$ echo '9â' | /usr/bin/grep -E '[^â]â' 9â kevans@ifrit:~$ locale LANG=en_US.UTF-8 LC_CTYPE="en_US.UTF-8" LC_COLLATE="en_US.UTF-8" LC_TIME="en_US.UTF-8" LC_NUMERIC="en_US.UTF-8" LC_MONETARY="en_US.UTF-8" LC_MESSAGES="en_US.UTF-8" LC_ALL= (C.UTF-8 is also fine, as you would expect)
https://cgit.freebsd.org/src/commit/?id=8f7ed58a15556bf567ff876e1999e4fe4d684e1d maybe? That looks like the only difference to my eyeball
Tested these: echo 'a' | grep '[abà]' echo '9â' | grep -E '[^â]â' echo 'â' | grep -E '[^â]â' Same discrepancies between -CURRENT (as of mid-August) and 14-STABLE (as of mid-May). If I apply the above patch to that 14-STABLE, I get correct behavior (and consistent with -CURRENT) for these 3 tests and all Eric's problematic tests of comment 1. (In reply to Kyle Evans from comment #2) So it looks like just MFCing this commit (git just does it automatically) will solve this issue. It should have been so in the first place. Could you please do it both to stable/13 and stable/14? Else I can take care of that if you prefer. Thanks and regards.
(In reply to Olivier Certner from comment #3) Sure thing, thanks for confirming that. Will hit these up tonight-ish.
In reference to comment #1 & comment #2 If the referenced commit indeed solves the issue*-- I'm not experienced to judge, as it seems by various commit logs of regcomp.c UTF-8/internal C (type) representation/handling are in play-- then, IMHO, from a committers POV, it may be considered whether this can/is to be backported to stable/14 (or even stable/13) or have it wait until stable/15. I'm unfamiliar with the FreeBSD testing framework, but perhaps it might be useful to test against each other (singleton versus non-singleton) something like: [1] # echo '9â' | /usr/bin/sed -En 's/([^â])(â)/-\1-\2-/p' [2] # echo '9â' | /usr/bin/sed -En 's/([^ââ])(â)/-\1-\2-/p' -9-â- The format of [2] could be used as a temporary workaround; HT to covacat: https://forums.freebsd.org/threads/bug-in-regexp-sed-1-grep-1-and-re_format-7.95088/#post-673304 __ * I think this can be traced back to the intro of the singleton function commit (stable/5 (~FreeBSD 5.3); commit Jul 12, 2004): https://github.com/freebsd/freebsd-src/commit/e5996857ad1f30f74f848a2c464c75a7ae28e59a#diff-3b4acfb4853c13cf5b563a3b6813988d5c10b057520e0c35384e4dfbc90e5793 where (apart from a minor parameter change) the function hasn't changed since.
I missed the additional comments ...
In reference to comment #4 Thank you for the prompt uptake and planned merge!
A commit in branch stable/14 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=4f4860c9b07cc10cb6acbe6fbd71db45e344d2e6 commit 4f4860c9b07cc10cb6acbe6fbd71db45e344d2e6 Author: Bill Sommerfeld <sommerfeld@hamachi.org> AuthorDate: 2023-12-21 03:46:14 +0000 Commit: Kyle Evans <kevans@FreeBSD.org> CommitDate: 2024-09-25 20:42:25 +0000 regex: mixed sets are misidentified as singletons Fix "singleton" function used by regcomp() to turn character set matches into exact character matches if a character set has exactly one element. The underlying cset representation is complex; most critically it records"small" characters (codepoint less than either 128 or 256 depending on locale) in a bit vector, and "wide" characters in a secondary array. Unfortunately the "singleton" function uses to identify singleton sets treated a cset as a singleton if either the "small" or the "wide" sets had exactly one element (it would then ignore the other set). The easiest way to demonstrate this bug: $ export LANG=C.UTF-8 $ echo 'a' | grep '[abà]' It should match (and print "a") but instead it doesn't match because the single accented character in the set is misinterpreted as a singleton. PR: 281710 Reviewed by: kevans, yuripv Obtained from: illumos (cherry picked from commit 8f7ed58a15556bf567ff876e1999e4fe4d684e1d) lib/libc/regex/regcomp.c | 25 ++++++++++++++++++----- lib/libc/tests/regex/multibyte.sh | 43 ++++++++++++++++++++++++++++++++++++++- 2 files changed, 62 insertions(+), 6 deletions(-)
A commit in branch stable/13 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=d96ce6d000703f3f57d9214b741e16cc7741d77e commit d96ce6d000703f3f57d9214b741e16cc7741d77e Author: Bill Sommerfeld <sommerfeld@hamachi.org> AuthorDate: 2023-12-21 03:46:14 +0000 Commit: Kyle Evans <kevans@FreeBSD.org> CommitDate: 2024-09-25 20:42:28 +0000 regex: mixed sets are misidentified as singletons Fix "singleton" function used by regcomp() to turn character set matches into exact character matches if a character set has exactly one element. The underlying cset representation is complex; most critically it records"small" characters (codepoint less than either 128 or 256 depending on locale) in a bit vector, and "wide" characters in a secondary array. Unfortunately the "singleton" function uses to identify singleton sets treated a cset as a singleton if either the "small" or the "wide" sets had exactly one element (it would then ignore the other set). The easiest way to demonstrate this bug: $ export LANG=C.UTF-8 $ echo 'a' | grep '[abà]' It should match (and print "a") but instead it doesn't match because the single accented character in the set is misinterpreted as a singleton. PR: 281710 Reviewed by: kevans, yuripv Obtained from: illumos (cherry picked from commit 8f7ed58a15556bf567ff876e1999e4fe4d684e1d) lib/libc/regex/regcomp.c | 25 ++++++++++++++++++----- lib/libc/tests/regex/multibyte.sh | 43 ++++++++++++++++++++++++++++++++++++++- 2 files changed, 62 insertions(+), 6 deletions(-)
(In reply to Eric from comment #0) Are you seeing real-world fallout from this, or was this more along the lines of testing behavior on different platforms?
(in reply to Kyle Evans comment #10) Nothing that expansive either way; just one very old laptop & one forum message. I narrowed the problem down to: --works NOT; there seems to be something wrong with [^ç] sed -Ee 's/([^ç])ç/\1 /g;s/ç/\n/g' sed -Ee 's/([^Z])Z/\1 /g;s/Z/\n/g' couldn't explain the difference, then tried gsed; all that resulted in: https://forums.freebsd.org/threads/bug-in-regexp-sed-1-grep-1-and-re_format-7.95088/
Thanks! I'm pretty sure I stumbled on this problem months ago, but didn't investigate then. Glad to see that fixed.
(in reply to Kyle Evans comment #10) (in reply to Olivier Certner comment #12) Based on the commit comments https://cgit.freebsd.org/src/commit/?id=8f7ed58a15556bf567ff876e1999e4fe4d684e1d however, I see that I may have underestimated the possible veracious impact on string processing in a pervasive UTF-8 world. I haven't a test setup available at the moment to test the examples below on -CURRENT or -STABLE-13 or 14 -- Examples [1] # cat names cedric étienne égards françois [2] # cat names | grep '[é]' étienne égards [3] # cat names | grep '[éç]' étienne égards françois [4] # cat names | grep '[éi]' # <-- error cedric étienne françois [5] # cat names | grep -i '[éi]' # <-- case-insensitive "avoids" singleton cedric étienne égards françois [6] # cat names | grep -E '[é]|[i]' # <-- splitting in two bracket expressions avoids errroneous code cedric étienne égards françois [7] # I think such cases likely will have been overlooked, misjudged as correctly processed or not investigated further. Fast & correct (UTF-8) string processing is difficult and this made me have another look at singleton's char processing. Viewing from a distance (and assuming one test operation (the first only) in the string of "shortcut" ||-operands), the distance to the prize (i.e. line 1626) in https://github.com/freebsd/freebsd-src/blob/main/lib/libc/regex/regcomp.c#L1626 as compared to https://github.com/freebsd/freebsd-src/blob/releng/14.1/lib/libc/regex/regcomp.c#L1600 has gone up considerably: singleton-error: 2 tests singleton-modified: 6 tests Are the added complexity and extra processing steps of an added singleton function for a bracket expression still justified? Case-insensitive bracket expressions don't profit, as can be painfully observed in the examples above; they just add a certain small amount of additional time. I wonder if comparitive testing with singleton processing versus without it yields justifiable gains—yes, that is a subjective adjective.
(in reply to Kyle Evans comment #10) Hi Kyle, I have a question and related request as it seems that NetBSD has pulled in (Feb 23, 2021) FreeBSD's code of lib/libc/regex/regcomp.c wrt the error in function singleton(): https://github.com/NetBSD/src/commit/1ee269c3a208a14da224b6e9917e2e9798961fff You, or you may know who else within the team of FreeBSD core/developers, likely know how to best liase with NetBSD and inform them of the problem. I don't have anything setup OS wise or account wise regarding NetBSD, so I haven't done any actual testing. ===== -- NetBSD code situation view For NetBSD's main "trunk", these parts seem relevant: https://github.com/NetBSD/src/blob/trunk/lib/libc/regex/regcomp.c#L1139-L1143 https://github.com/NetBSD/src/blob/trunk/lib/libc/regex/regcomp.c#L1733-L1754 and the commit "sync with FreeBSD" as of Feb 23, 2021: https://github.com/NetBSD/src/commit/1ee269c3a208a14da224b6e9917e2e9798961fff For NetBSD 10.0, released as of NetBSD 10.0 (Mar 28, 2024), these parts seem relevant: https://github.com/NetBSD/src/blob/netbsd-10/lib/libc/regex/regcomp.c#L1141-L1145 https://github.com/NetBSD/src/blob/netbsd-10/lib/libc/regex/regcomp.c#L1735-L1756 and seems to contain the buggy singleton function. By the looks of it seems that the singleton function issue isn't in NetBSD 9 code: https://github.com/NetBSD/src/blob/netbsd-9/lib/libc/regex/regcomp.c#L831-L835 However, I'm unfamiliar with their github branch structures of their minor releases as the latest release of NetBSD 9 seems to be 9.4 (April 20, 2024): https://www.netbsd.org/releases/formal-9/NetBSD-9.4.html -- OpenBSD code situation view For OpenBSD, based on https://github.com/openbsd/src/blob/master/lib/libc/regex/regcomp.c#L674-L678 it looks like they haven't implemented any optimization like the singleton() function.
As the referenced commit has landed in releng/14.2 and stable/13, I'm closing this as fixed.