Summary: | [libc] regex(3) bug with UTF-8 locale | ||
---|---|---|---|
Product: | Base System | Reporter: | sigsys |
Component: | bin | Assignee: | Yuri Pankov <yuripv> |
Status: | Closed FIXED | ||
Severity: | Affects Only Me | CC: | kevans, yuripv |
Priority: | Normal | ||
Version: | CURRENT | ||
Hardware: | Any | ||
OS: | Any |
Description
sigsys
2010-12-28 18:00:32 UTC
For bugs matching the following criteria: Status: In Progress Changed: (is less than) 2014-06-01 Reset to default assignee and clear in-progress tags. Mail being skipped A commit references this bug: Author: yuripv Date: Fri Nov 23 15:49:19 UTC 2018 New revision: 340835 URL: https://svnweb.freebsd.org/changeset/base/340835 Log: regexec: fix processing multibyte strings. Matcher function incorrectly assumed that moffset that we get from findmust is in bytes. Fix this by introducing a stepback function, taking short path if MB_CUR_MAX is 1, and going back byte-by-byte, checking if we have a legal character sequence otherwise. PR: 153502 Reviewed by: pfg, kevans Approved by: kib (mentor, implicit) Differential revision: https://reviews.freebsd.org/D18297 Changes: head/lib/libc/regex/engine.c head/lib/libc/tests/regex/Makefile head/lib/libc/tests/regex/multibyte.sh A commit references this bug: Author: yuripv Date: Sat Dec 8 19:45:06 UTC 2018 New revision: 341745 URL: https://svnweb.freebsd.org/changeset/base/341745 Log: MFC r340835: regexec: fix processing multibyte strings. Matcher function incorrectly assumed that moffset that we get from findmust is in bytes. Fix this by introducing a stepback function, taking short path if MB_CUR_MAX is 1, and going back byte-by-byte, checking if we have a legal character sequence otherwise. PR: 153502 Reviewed by: pfg, kevans Differential revision: https://reviews.freebsd.org/D18297 Changes: _U stable/12/ stable/12/lib/libc/regex/engine.c stable/12/lib/libc/tests/regex/Makefile stable/12/lib/libc/tests/regex/multibyte.sh This is "Fixed" now, yes? :-) Oh yeah! I haven't noticed any more issues like that. I redid some light testing just in case and less/sed seem to match everything perfectly (both with UTF-8 and Latin-1 locales). Thanks to everyone who worked on this. Especially given how complicated the regex code looks... Wow that was an old bug report. (In reply to sigsys from comment #5) Excellent, thanks for confirming! |