Summary: | [libc] re-enable regex bayer-moore for the general multibyte case | ||||||
---|---|---|---|---|---|---|---|
Product: | Base System | Reporter: | Pedro F. Giffuni <pfg> | ||||
Component: | bin | Assignee: | freebsd-bugs (Nobody) <bugs> | ||||
Status: | Closed Unable to Reproduce | ||||||
Severity: | Affects Only Me | CC: | ache, bapt, emaste, yuripv | ||||
Priority: | --- | Keywords: | patch | ||||
Version: | CURRENT | ||||||
Hardware: | Any | ||||||
OS: | Any | ||||||
Attachments: |
|
(BTW .. I am only reporting this so that it doesn't get lost, I can't really test it). I *think* you are reading the code wrong, see how regexec.c includes engine.c several times, after setting different defines. With MNAMES defined, which defines the proper XMBRTOWC, we have a "#define matcher mmatcher", and regexec.c has the following in regexec(): if (MB_CUR_MAX > 1) return(mmatcher(g, string, nmatch, pmatch, eflags)); So in multibyte case, we are using the mmatcher(), which gets defined along with MNAMES and proper XMBRTOWC pointing to xmbrtowc() and not xmbrtowc_dummy(). The man page bit talks specifically about *word boundaries* in multibyte locales, not general multibyte case. (In reply to Yuri Pankov from comment #2) To check that I'm not imagining things, I've added some printfs to regex routines and did a very simple test case with regcomp()/regexec(), important part here is not using REG_NOSUB: $ LC_ALL=en_US.UTF-8 ./testre I'm mmatcher I'm xmbrtowc I'm xmbrtowc I'm xmbrtowc I'm xmbrtowc I'm xmbrtowc I'm xmbrtowc I'm xmbrtowc pattern 'в' does match string 'абвгд' $ LC_ALL=C ./testre I'm smatcher I'm xmbrtowc_dummy I'm xmbrtowc_dummy I'm xmbrtowc_dummy I'm xmbrtowc_dummy I'm xmbrtowc_dummy I'm xmbrtowc_dummy I'm xmbrtowc_dummy I'm xmbrtowc_dummy I'm xmbrtowc_dummy I'm xmbrtowc_dummy pattern 'в' does match string 'абвгд' Thanks for investigating! TBH, I just looked at the commit log after bapt@ reported issues with collation and I didn't really check the code in detail. It may be that we are still lacking in this area but I am happy to be able to close this old PR. |
Created attachment 167690 [details] Enable regex multibyte support Our regex(3) added full multibyte support since r132019, however it was disabled for the general case. According to the commit: "One other point of interest is that although the Boyer-Moore algorithm had to be disabled in the general multibyte case, it is still enabled for UTF-8 because of its self-synchronizing nature. This greatly speeds up matching by reducing the number of multibyte conversions that need to be done." Now that we have support for collation in 11-current, the multibyte support has all the chances of working correctly so we could enable it. This seems to be much easier than replacing regex with a new version, but it has yet to be tested thoroughly.