I'm seeing odd behavior from programs using regex(3) like less(1), vi(1) and sed(1) when using LANG=en_US.UTF-8 and UTF-8 inputs. Sometimes it seems to work right: $ echo 'é' | sed -ne '/^.$/p' é $ echo 'éé' | sed -ne '/^..$/p' éé $ echo 'aéa' | sed -ne '/a.a/p' aéa $ echo 'aéa' | sed -ne '/a.*a/p' aéa $ echo 'aaéaa' | sed -ne '/aa.aa/p' aaéaa $ echo 'aéaéa' | sed -ne '/a.a.a/p' aéaéa But not always: $ echo 'éa' | sed -ne '/.a/p' $ echo 'aéaa' | sed -ne '/a.aa/p' $ echo 'éaé' | sed -ne '/.a./p' Seems like using ".*", ".+", ".{0,}" or ".{1,}" works right, but ".{0,1}", ".{1,1}" or a lone "." doesn't always.
For bugs matching the following criteria: Status: In Progress Changed: (is less than) 2014-06-01 Reset to default assignee and clear in-progress tags. Mail being skipped
A commit references this bug: Author: yuripv Date: Fri Nov 23 15:49:19 UTC 2018 New revision: 340835 URL: https://svnweb.freebsd.org/changeset/base/340835 Log: regexec: fix processing multibyte strings. Matcher function incorrectly assumed that moffset that we get from findmust is in bytes. Fix this by introducing a stepback function, taking short path if MB_CUR_MAX is 1, and going back byte-by-byte, checking if we have a legal character sequence otherwise. PR: 153502 Reviewed by: pfg, kevans Approved by: kib (mentor, implicit) Differential revision: https://reviews.freebsd.org/D18297 Changes: head/lib/libc/regex/engine.c head/lib/libc/tests/regex/Makefile head/lib/libc/tests/regex/multibyte.sh
A commit references this bug: Author: yuripv Date: Sat Dec 8 19:45:06 UTC 2018 New revision: 341745 URL: https://svnweb.freebsd.org/changeset/base/341745 Log: MFC r340835: regexec: fix processing multibyte strings. Matcher function incorrectly assumed that moffset that we get from findmust is in bytes. Fix this by introducing a stepback function, taking short path if MB_CUR_MAX is 1, and going back byte-by-byte, checking if we have a legal character sequence otherwise. PR: 153502 Reviewed by: pfg, kevans Differential revision: https://reviews.freebsd.org/D18297 Changes: _U stable/12/ stable/12/lib/libc/regex/engine.c stable/12/lib/libc/tests/regex/Makefile stable/12/lib/libc/tests/regex/multibyte.sh
This is "Fixed" now, yes? :-)
Oh yeah! I haven't noticed any more issues like that. I redid some light testing just in case and less/sed seem to match everything perfectly (both with UTF-8 and Latin-1 locales). Thanks to everyone who worked on this. Especially given how complicated the regex code looks... Wow that was an old bug report.
(In reply to sigsys from comment #5) Excellent, thanks for confirming!