|Summary:||[libc] re-enable regex bayer-moore for the general multibyte case|
|Product:||Base System||Reporter:||Pedro F. Giffuni <pfg>|
|Component:||bin||Assignee:||freebsd-bugs (Nobody) <bugs>|
|Status:||Closed Unable to Reproduce|
|Severity:||Affects Only Me||CC:||ache, bapt, emaste, yuripv|
Description Pedro F. Giffuni 2016-03-03 19:02:24 UTC
Created attachment 167690 [details] Enable regex multibyte support Our regex(3) added full multibyte support since r132019, however it was disabled for the general case. According to the commit: "One other point of interest is that although the Boyer-Moore algorithm had to be disabled in the general multibyte case, it is still enabled for UTF-8 because of its self-synchronizing nature. This greatly speeds up matching by reducing the number of multibyte conversions that need to be done." Now that we have support for collation in 11-current, the multibyte support has all the chances of working correctly so we could enable it. This seems to be much easier than replacing regex with a new version, but it has yet to be tested thoroughly.
Comment 1 Pedro F. Giffuni 2016-03-03 19:04:51 UTC
(BTW .. I am only reporting this so that it doesn't get lost, I can't really test it).
Comment 2 Yuri Pankov 2017-06-29 23:12:10 UTC
I *think* you are reading the code wrong, see how regexec.c includes engine.c several times, after setting different defines. With MNAMES defined, which defines the proper XMBRTOWC, we have a "#define matcher mmatcher", and regexec.c has the following in regexec(): if (MB_CUR_MAX > 1) return(mmatcher(g, string, nmatch, pmatch, eflags)); So in multibyte case, we are using the mmatcher(), which gets defined along with MNAMES and proper XMBRTOWC pointing to xmbrtowc() and not xmbrtowc_dummy(). The man page bit talks specifically about *word boundaries* in multibyte locales, not general multibyte case.
Comment 3 Yuri Pankov 2017-06-29 23:59:00 UTC
(In reply to Yuri Pankov from comment #2) To check that I'm not imagining things, I've added some printfs to regex routines and did a very simple test case with regcomp()/regexec(), important part here is not using REG_NOSUB: $ LC_ALL=en_US.UTF-8 ./testre I'm mmatcher I'm xmbrtowc I'm xmbrtowc I'm xmbrtowc I'm xmbrtowc I'm xmbrtowc I'm xmbrtowc I'm xmbrtowc pattern 'в' does match string 'абвгд' $ LC_ALL=C ./testre I'm smatcher I'm xmbrtowc_dummy I'm xmbrtowc_dummy I'm xmbrtowc_dummy I'm xmbrtowc_dummy I'm xmbrtowc_dummy I'm xmbrtowc_dummy I'm xmbrtowc_dummy I'm xmbrtowc_dummy I'm xmbrtowc_dummy I'm xmbrtowc_dummy pattern 'в' does match string 'абвгд'
Comment 4 Pedro F. Giffuni 2017-06-30 00:54:21 UTC
Thanks for investigating! TBH, I just looked at the commit log after bapt@ reported issues with collation and I didn't really check the code in detail. It may be that we are still lacking in this area but I am happy to be able to close this old PR.