Bug 207681 - [libc] re-enable regex bayer-moore for the general multibyte case
Summary: [libc] re-enable regex bayer-moore for the general multibyte case
Status: Closed Unable to Reproduce
Alias: None
Product: Base System
Classification: Unclassified
Component: bin (show other bugs)
Version: CURRENT
Hardware: Any Any
: --- Affects Only Me
Assignee: freebsd-bugs (Nobody)
URL:
Keywords: patch
Depends on:
Blocks:
 
Reported: 2016-03-03 19:02 UTC by Pedro F. Giffuni
Modified: 2017-06-30 00:54 UTC (History)
4 users (show)

See Also:


Attachments
Enable regex multibyte support (1.58 KB, patch)
2016-03-03 19:02 UTC, Pedro F. Giffuni
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Pedro F. Giffuni freebsd_committer 2016-03-03 19:02:24 UTC
Created attachment 167690 [details]
Enable regex multibyte support

Our regex(3) added full multibyte support since r132019, however it was disabled for the general case. According to the commit:

"One other point of interest is that although the Boyer-Moore algorithm had
to be disabled in the general multibyte case, it is still enabled for UTF-8
because of its self-synchronizing nature. This greatly speeds up matching
by reducing the number of multibyte conversions that need to be done."

Now that we have support for collation in 11-current, the multibyte support has all the chances of working correctly so we could enable it.

This seems to be much easier than replacing regex with a new version, but it has yet to be tested thoroughly.
Comment 1 Pedro F. Giffuni freebsd_committer 2016-03-03 19:04:51 UTC
(BTW .. I am only reporting this so that it doesn't get lost, I can't really test it).
Comment 2 Yuri Pankov 2017-06-29 23:12:10 UTC
I *think* you are reading the code wrong, see how regexec.c includes engine.c several times, after setting different defines. With MNAMES defined, which defines the proper XMBRTOWC, we have a "#define matcher mmatcher", and regexec.c has the following in regexec():

        if (MB_CUR_MAX > 1)
                return(mmatcher(g, string, nmatch, pmatch, eflags));


So in multibyte case, we are using the mmatcher(), which gets defined along with MNAMES and proper XMBRTOWC pointing to xmbrtowc() and not xmbrtowc_dummy().

The man page bit talks specifically about *word boundaries* in multibyte locales, not general multibyte case.
Comment 3 Yuri Pankov 2017-06-29 23:59:00 UTC
(In reply to Yuri Pankov from comment #2)

To check that I'm not imagining things, I've added some printfs to regex routines and did a very simple test case with regcomp()/regexec(), important part here is not using REG_NOSUB:

$ LC_ALL=en_US.UTF-8 ./testre
I'm mmatcher
I'm xmbrtowc
I'm xmbrtowc
I'm xmbrtowc
I'm xmbrtowc
I'm xmbrtowc
I'm xmbrtowc
I'm xmbrtowc
pattern 'в' does match string 'абвгд'
$ LC_ALL=C ./testre
I'm smatcher
I'm xmbrtowc_dummy
I'm xmbrtowc_dummy
I'm xmbrtowc_dummy
I'm xmbrtowc_dummy
I'm xmbrtowc_dummy
I'm xmbrtowc_dummy
I'm xmbrtowc_dummy
I'm xmbrtowc_dummy
I'm xmbrtowc_dummy
I'm xmbrtowc_dummy
pattern 'в' does match string 'абвгд'
Comment 4 Pedro F. Giffuni freebsd_committer 2017-06-30 00:54:21 UTC
Thanks for investigating!

TBH, I just looked at the commit log after bapt@ reported issues with collation and I didn't really check the code in detail. It may be that we are still lacking in this area but I am happy to be able to close this old PR.