Bug 264275 - sed complaining about trailing backslash when using Umlauts
Summary: sed complaining about trailing backslash when using Umlauts
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: bin (show other bugs)
Version: 13.1-RELEASE
Hardware: Any Any
: --- Affects Some People
Assignee: Kyle Evans
URL:
Keywords:
: 274032 (view as bug list)
Depends on:
Blocks:
 
Reported: 2022-05-27 06:29 UTC by g.s
Modified: 2023-11-08 01:04 UTC (History)
5 users (show)

See Also:
kevans: mfc-stable14+
kevans: mfc-stable13+
kevans: mfc-stable12+


Attachments
small reproducer that calls regcomp with an umlaut. (382 bytes, text/plain)
2022-10-27 13:45 UTC, Daniel Tameling
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description g.s 2022-05-27 06:29:55 UTC
> locale
LANG=C
LC_CTYPE=de_DE.ISO8859-1
LC_COLLATE="C"
LC_TIME="C"
LC_NUMERIC="C"
LC_MONETARY="C"
LC_MESSAGES="C"
LC_ALL=

> sed s/ä/ae/
sed: 1: "s/ä/ae/": RE error: trailing backslash (\)

> echo -n ä | od -x
0000000      00e4                                                        
0000001

Also affects characters like ßçáàîÆ

Does not happen with LC_CTYPE=de_DE.UTF-8
Comment 1 Daniel Tameling 2022-10-27 13:43:24 UTC
The error comes from trying to compile the umlaut as a regex. I managed to create a small reproducer that just calls regcomp.

The error seems to come from this snippet in the p_simp_re function in lib/libc/regex/regcomp.c:

  if ((c & BACKSL) == 0 || may_escape(p, wc))
       ordinary(p, wc);
  else
       SETERROR(REG_EESCAPE);

Both checks in the if statement are false and thus we end up with the trailing backslash error. In may_escape this is the return statement that gets taken:

  if (isalpha(ch) || ch == '\'' || ch == '`')
      return (false);

ch is the wint_t representation of the umlaut, which is 0xe4. In de_DE.ISO8859-1, the isalpha call returns true. (If I do it with an UTF8 ä in an UTF8 locale, ch becomes also 0xe4, but the isalpha call returns false, so this doesn't trigger the trailing backslash error.)
Comment 2 Daniel Tameling 2022-10-27 13:45:09 UTC
Created attachment 237678 [details]
small reproducer that calls regcomp with an umlaut.
Comment 3 Eugene Grosbein freebsd_committer freebsd_triage 2023-09-23 12:01:05 UTC
Try the patch https://bugs.freebsd.org/bugzilla/attachment.cgi?id=245157 from the PR 274032
Comment 4 commit-hook freebsd_committer freebsd_triage 2023-09-25 22:53:16 UTC
A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=3fb80f1476c7776f04ba7ef6d08397cef6abcfb0

commit 3fb80f1476c7776f04ba7ef6d08397cef6abcfb0
Author:     Christos Zoulas <christos@NetBSD.org>
AuthorDate: 2023-08-30 20:37:24 +0000
Commit:     Yuri Pankov <yuripv@FreeBSD.org>
CommitDate: 2023-09-25 22:49:14 +0000

    regcomp: use unsigned char when testing for escapes

    - cast GETNEXT to unsigned where it is being promoted to int to prevent
      sign-extension (really it would have been better for PEEK*() and
      GETNEXT() to return unsigned char; this would have removed a ton of
      (uch) casts, but it is too intrusive for now).
    - fix an isalpha that should have been iswalpha

    PR:             264275, 274032
    Reviewed by:    kevans, eugen (previous version)
    Obtained from:  NetBSD
    Differential Revision:  https://reviews.freebsd.org/D41947

 lib/libc/regex/regcomp.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)
Comment 5 Kyle Evans freebsd_committer freebsd_triage 2023-09-26 01:01:24 UTC
*** Bug 274032 has been marked as a duplicate of this bug. ***
Comment 6 Kyle Evans freebsd_committer freebsd_triage 2023-09-26 01:03:58 UTC
Taking this as I've agreed to handle the MFC and whatnot -- I tentatively plan to take this one all the way back to 12 and, given that it's reasonably severe for non-C locales, EN it to 13.2 and 12.4. MFC will likely be in ~3-5 days.
Comment 7 commit-hook freebsd_committer freebsd_triage 2023-09-30 02:53:14 UTC
A commit in branch stable/12 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=25307d6c927934dd44628e06cbc7047415fb6931

commit 25307d6c927934dd44628e06cbc7047415fb6931
Author:     Christos Zoulas <christos@NetBSD.org>
AuthorDate: 2023-08-30 20:37:24 +0000
Commit:     Kyle Evans <kevans@FreeBSD.org>
CommitDate: 2023-09-30 01:41:57 +0000

    regcomp: use unsigned char when testing for escapes

    - cast GETNEXT to unsigned where it is being promoted to int to prevent
      sign-extension (really it would have been better for PEEK*() and
      GETNEXT() to return unsigned char; this would have removed a ton of
      (uch) casts, but it is too intrusive for now).
    - fix an isalpha that should have been iswalpha

    PR:             264275, 274032
    Reviewed by:    kevans, eugen (previous version)
    Obtained from:  NetBSD

    (cherry picked from commit 3fb80f1476c7776f04ba7ef6d08397cef6abcfb0)

 lib/libc/regex/regcomp.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)
Comment 8 commit-hook freebsd_committer freebsd_triage 2023-09-30 02:53:16 UTC
A commit in branch stable/13 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=ac695744e2cfb461a64018276fb94999fb0cad9c

commit ac695744e2cfb461a64018276fb94999fb0cad9c
Author:     Christos Zoulas <christos@NetBSD.org>
AuthorDate: 2023-08-30 20:37:24 +0000
Commit:     Kyle Evans <kevans@FreeBSD.org>
CommitDate: 2023-09-30 01:41:23 +0000

    regcomp: use unsigned char when testing for escapes

    - cast GETNEXT to unsigned where it is being promoted to int to prevent
      sign-extension (really it would have been better for PEEK*() and
      GETNEXT() to return unsigned char; this would have removed a ton of
      (uch) casts, but it is too intrusive for now).
    - fix an isalpha that should have been iswalpha

    PR:             264275, 274032
    Reviewed by:    kevans, eugen (previous version)
    Obtained from:  NetBSD

    (cherry picked from commit 3fb80f1476c7776f04ba7ef6d08397cef6abcfb0)

 lib/libc/regex/regcomp.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)
Comment 9 commit-hook freebsd_committer freebsd_triage 2023-09-30 02:53:19 UTC
A commit in branch stable/14 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=56b09feb23d98fcd0c4aed8d4f907a5a2f6b5ea9

commit 56b09feb23d98fcd0c4aed8d4f907a5a2f6b5ea9
Author:     Christos Zoulas <christos@NetBSD.org>
AuthorDate: 2023-08-30 20:37:24 +0000
Commit:     Kyle Evans <kevans@FreeBSD.org>
CommitDate: 2023-09-30 01:40:59 +0000

    regcomp: use unsigned char when testing for escapes

    - cast GETNEXT to unsigned where it is being promoted to int to prevent
      sign-extension (really it would have been better for PEEK*() and
      GETNEXT() to return unsigned char; this would have removed a ton of
      (uch) casts, but it is too intrusive for now).
    - fix an isalpha that should have been iswalpha

    PR:             264275, 274032
    Reviewed by:    kevans, eugen (previous version)
    Obtained from:  NetBSD

    (cherry picked from commit 3fb80f1476c7776f04ba7ef6d08397cef6abcfb0)

 lib/libc/regex/regcomp.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)
Comment 10 Kyle Evans freebsd_committer freebsd_triage 2023-09-30 05:42:53 UTC
I've submitted this for EN consideration.
Comment 11 commit-hook freebsd_committer freebsd_triage 2023-10-01 04:46:22 UTC
A commit in branch releng/14.0 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=408daf2caa9273b1cbdc8223a3da6e179e922fc2

commit 408daf2caa9273b1cbdc8223a3da6e179e922fc2
Author:     Christos Zoulas <christos@NetBSD.org>
AuthorDate: 2023-08-30 20:37:24 +0000
Commit:     Kyle Evans <kevans@FreeBSD.org>
CommitDate: 2023-10-01 04:46:02 +0000

    regcomp: use unsigned char when testing for escapes

    - cast GETNEXT to unsigned where it is being promoted to int to prevent
      sign-extension (really it would have been better for PEEK*() and
      GETNEXT() to return unsigned char; this would have removed a ton of
      (uch) casts, but it is too intrusive for now).
    - fix an isalpha that should have been iswalpha

    PR:             264275, 274032
    Reviewed by:    kevans, eugen (previous version)
    Approved by:    re (gjb)
    Obtained from:  NetBSD

    (cherry picked from commit 3fb80f1476c7776f04ba7ef6d08397cef6abcfb0)
    (cherry picked from commit 56b09feb23d98fcd0c4aed8d4f907a5a2f6b5ea9)

 lib/libc/regex/regcomp.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)
Comment 12 Kyle Evans freebsd_committer freebsd_triage 2023-10-01 04:53:13 UTC
Going to go ahead and close this; this has been MFC'd to all supported branches and will appear in the next 14.0 beta.
Comment 13 commit-hook freebsd_committer freebsd_triage 2023-11-08 01:01:10 UTC
A commit in branch releng/13.2 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=67264bfe499223cd9864b53975462e3eb57cde2c

commit 67264bfe499223cd9864b53975462e3eb57cde2c
Author:     Christos Zoulas <christos@NetBSD.org>
AuthorDate: 2023-08-30 20:37:24 +0000
Commit:     Ed Maste <emaste@FreeBSD.org>
CommitDate: 2023-11-08 00:59:51 +0000

    regcomp: use unsigned char when testing for escapes

    - cast GETNEXT to unsigned where it is being promoted to int to prevent
      sign-extension (really it would have been better for PEEK*() and
      GETNEXT() to return unsigned char; this would have removed a ton of
      (uch) casts, but it is too intrusive for now).
    - fix an isalpha that should have been iswalpha

    PR:             264275, 274032
    Reviewed by:    kevans, eugen (previous version)
    Obtained from:  NetBSD

    (cherry picked from commit 3fb80f1476c7776f04ba7ef6d08397cef6abcfb0)
    (cherry picked from commit ac695744e2cfb461a64018276fb94999fb0cad9c)

    Approved by:    so
    Security:       FreeBSD-EN-23:14

 lib/libc/regex/regcomp.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)
Comment 14 commit-hook freebsd_committer freebsd_triage 2023-11-08 01:04:20 UTC
A commit in branch releng/12.4 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=5e0387e2ec6ed9b14b3c6088c19079db15a52eae

commit 5e0387e2ec6ed9b14b3c6088c19079db15a52eae
Author:     Christos Zoulas <christos@NetBSD.org>
AuthorDate: 2023-08-30 20:37:24 +0000
Commit:     Ed Maste <emaste@FreeBSD.org>
CommitDate: 2023-11-08 01:02:08 +0000

    regcomp: use unsigned char when testing for escapes

    - cast GETNEXT to unsigned where it is being promoted to int to prevent
      sign-extension (really it would have been better for PEEK*() and
      GETNEXT() to return unsigned char; this would have removed a ton of
      (uch) casts, but it is too intrusive for now).
    - fix an isalpha that should have been iswalpha

    PR:             264275, 274032
    Reviewed by:    kevans, eugen (previous version)
    Obtained from:  NetBSD

    (cherry picked from commit 3fb80f1476c7776f04ba7ef6d08397cef6abcfb0)
    (cherry picked from commit 56b09feb23d98fcd0c4aed8d4f907a5a2f6b5ea9)

    Approved by:    so
    Security:       FreeBSD-EN-23:14

 lib/libc/regex/regcomp.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)