Bug 207681

Summary: [libc] re-enable regex bayer-moore for the general multibyte case
Product: Base System Reporter: Pedro F. Giffuni <pfg>
Component: binAssignee: freebsd-bugs (Nobody) <bugs>
Status: Closed Unable to Reproduce    
Severity: Affects Only Me CC: ache, bapt, emaste, yuripv
Priority: --- Keywords: patch
Version: CURRENT   
Hardware: Any   
OS: Any   
Attachments:
Description Flags
Enable regex multibyte support none

Description Pedro F. Giffuni freebsd_committer freebsd_triage 2016-03-03 19:02:24 UTC
Created attachment 167690 [details]
Enable regex multibyte support

Our regex(3) added full multibyte support since r132019, however it was disabled for the general case. According to the commit:

"One other point of interest is that although the Boyer-Moore algorithm had
to be disabled in the general multibyte case, it is still enabled for UTF-8
because of its self-synchronizing nature. This greatly speeds up matching
by reducing the number of multibyte conversions that need to be done."

Now that we have support for collation in 11-current, the multibyte support has all the chances of working correctly so we could enable it.

This seems to be much easier than replacing regex with a new version, but it has yet to be tested thoroughly.
Comment 1 Pedro F. Giffuni freebsd_committer freebsd_triage 2016-03-03 19:04:51 UTC
(BTW .. I am only reporting this so that it doesn't get lost, I can't really test it).
Comment 2 Yuri Pankov 2017-06-29 23:12:10 UTC
I *think* you are reading the code wrong, see how regexec.c includes engine.c several times, after setting different defines. With MNAMES defined, which defines the proper XMBRTOWC, we have a "#define matcher mmatcher", and regexec.c has the following in regexec():

        if (MB_CUR_MAX > 1)
                return(mmatcher(g, string, nmatch, pmatch, eflags));


So in multibyte case, we are using the mmatcher(), which gets defined along with MNAMES and proper XMBRTOWC pointing to xmbrtowc() and not xmbrtowc_dummy().

The man page bit talks specifically about *word boundaries* in multibyte locales, not general multibyte case.
Comment 3 Yuri Pankov 2017-06-29 23:59:00 UTC
(In reply to Yuri Pankov from comment #2)

To check that I'm not imagining things, I've added some printfs to regex routines and did a very simple test case with regcomp()/regexec(), important part here is not using REG_NOSUB:

$ LC_ALL=en_US.UTF-8 ./testre
I'm mmatcher
I'm xmbrtowc
I'm xmbrtowc
I'm xmbrtowc
I'm xmbrtowc
I'm xmbrtowc
I'm xmbrtowc
I'm xmbrtowc
pattern 'в' does match string 'абвгд'
$ LC_ALL=C ./testre
I'm smatcher
I'm xmbrtowc_dummy
I'm xmbrtowc_dummy
I'm xmbrtowc_dummy
I'm xmbrtowc_dummy
I'm xmbrtowc_dummy
I'm xmbrtowc_dummy
I'm xmbrtowc_dummy
I'm xmbrtowc_dummy
I'm xmbrtowc_dummy
I'm xmbrtowc_dummy
pattern 'в' does match string 'абвгд'
Comment 4 Pedro F. Giffuni freebsd_committer freebsd_triage 2017-06-30 00:54:21 UTC
Thanks for investigating!

TBH, I just looked at the commit log after bapt@ reported issues with collation and I didn't really check the code in detail. It may be that we are still lacking in this area but I am happy to be able to close this old PR.