Bug 153502 - [libc] regex(3) bug with UTF-8 locale
Summary: [libc] regex(3) bug with UTF-8 locale
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: bin (show other bugs)
Version: CURRENT
Hardware: Any Any
: Normal Affects Only Me
Assignee: Yuri Pankov
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-12-28 18:00 UTC by sigsys
Modified: 2020-11-09 04:12 UTC (History)
2 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description sigsys 2010-12-28 18:00:32 UTC
I'm seeing odd behavior from programs using regex(3) like less(1), vi(1) and sed(1) when using LANG=en_US.UTF-8 and UTF-8 inputs.

Sometimes it seems to work right:

$ echo 'é' | sed -ne '/^.$/p'
é
$ echo 'éé' | sed -ne '/^..$/p'
éé
$ echo 'aéa' | sed -ne '/a.a/p'
aéa
$ echo 'aéa' | sed -ne '/a.*a/p'
aéa
$ echo 'aaéaa' | sed -ne '/aa.aa/p'
aaéaa
$ echo 'aéaéa' | sed -ne '/a.a.a/p'
aéaéa

But not always:

$ echo 'éa' | sed -ne '/.a/p'
$ echo 'aéaa' | sed -ne '/a.aa/p'
$ echo 'éaé' | sed -ne '/.a./p'


Seems like using ".*", ".+", ".{0,}" or ".{1,}" works right, but ".{0,1}", ".{1,1}" or a lone "." doesn't always.
Comment 1 Eitan Adler freebsd_committer freebsd_triage 2017-12-31 08:01:32 UTC
For bugs matching the following criteria:

Status: In Progress Changed: (is less than) 2014-06-01

Reset to default assignee and clear in-progress tags.

Mail being skipped
Comment 2 commit-hook freebsd_committer freebsd_triage 2018-11-23 15:49:47 UTC
A commit references this bug:

Author: yuripv
Date: Fri Nov 23 15:49:19 UTC 2018
New revision: 340835
URL: https://svnweb.freebsd.org/changeset/base/340835

Log:
  regexec: fix processing multibyte strings.

  Matcher function incorrectly assumed that moffset that we get from
  findmust is in bytes. Fix this by introducing a stepback function,
  taking short path if MB_CUR_MAX is 1, and going back byte-by-byte,
  checking if we have a legal character sequence otherwise.

  PR:		153502
  Reviewed by:	pfg, kevans
  Approved by:	kib (mentor, implicit)
  Differential revision:	https://reviews.freebsd.org/D18297

Changes:
  head/lib/libc/regex/engine.c
  head/lib/libc/tests/regex/Makefile
  head/lib/libc/tests/regex/multibyte.sh
Comment 3 commit-hook freebsd_committer freebsd_triage 2018-12-08 19:46:09 UTC
A commit references this bug:

Author: yuripv
Date: Sat Dec  8 19:45:06 UTC 2018
New revision: 341745
URL: https://svnweb.freebsd.org/changeset/base/341745

Log:
  MFC r340835:
  regexec: fix processing multibyte strings.

  Matcher function incorrectly assumed that moffset that we get from
  findmust is in bytes. Fix this by introducing a stepback function,
  taking short path if MB_CUR_MAX is 1, and going back byte-by-byte,
  checking if we have a legal character sequence otherwise.

  PR:		153502
  Reviewed by:	pfg, kevans
  Differential revision:	https://reviews.freebsd.org/D18297

Changes:
_U  stable/12/
  stable/12/lib/libc/regex/engine.c
  stable/12/lib/libc/tests/regex/Makefile
  stable/12/lib/libc/tests/regex/multibyte.sh
Comment 4 Kyle Evans freebsd_committer freebsd_triage 2020-11-09 03:19:35 UTC
This is "Fixed" now, yes? :-)
Comment 5 sigsys 2020-11-09 03:55:22 UTC
Oh yeah! I haven't noticed any more issues like that.  I redid some light testing just in case and less/sed seem to match everything perfectly (both with UTF-8 and Latin-1 locales).  Thanks to everyone who worked on this. Especially given how complicated the regex code looks...

Wow that was an old bug report.
Comment 6 Kyle Evans freebsd_committer freebsd_triage 2020-11-09 04:12:32 UTC
(In reply to sigsys from comment #5)

Excellent, thanks for confirming!