Bug 207681

Summary:

[libc] re-enable regex bayer-moore for the general multibyte case

Product:

Base System

Reporter:

Pedro F. Giffuni <pfg>

Component:

bin

Assignee:

freebsd-bugs (Nobody) <bugs>

Status:

Closed Unable to Reproduce

Severity:

Affects Only Me

CC:

ache, bapt, emaste, yuripv

Priority:

---

Keywords:

patch

Version:

CURRENT

Hardware:

Any

OS:

Any

Attachments:

Description	Flags
Enable regex multibyte support	none

Description Pedro F. Giffuni freebsd_committer

2016-03-03 19:02:24 UTC

Created attachment 167690 [details]
Enable regex multibyte support

Our regex(3) added full multibyte support since r132019, however it was disabled for the general case. According to the commit:

"One other point of interest is that although the Boyer-Moore algorithm had
to be disabled in the general multibyte case, it is still enabled for UTF-8
because of its self-synchronizing nature. This greatly speeds up matching
by reducing the number of multibyte conversions that need to be done."

Now that we have support for collation in 11-current, the multibyte support has all the chances of working correctly so we could enable it.

This seems to be much easier than replacing regex with a new version, but it has yet to be tested thoroughly.

Comment 1 Pedro F. Giffuni freebsd_committer

2016-03-03 19:04:51 UTC

(BTW .. I am only reporting this so that it doesn't get lost, I can't really test it).

Comment 2 Yuri Pankov 2017-06-29 23:12:10 UTC

I *think* you are reading the code wrong, see how regexec.c includes engine.c several times, after setting different defines. With MNAMES defined, which defines the proper XMBRTOWC, we have a "#define matcher mmatcher", and regexec.c has the following in regexec():

        if (MB_CUR_MAX > 1)
                return(mmatcher(g, string, nmatch, pmatch, eflags));


So in multibyte case, we are using the mmatcher(), which gets defined along with MNAMES and proper XMBRTOWC pointing to xmbrtowc() and not xmbrtowc_dummy().

The man page bit talks specifically about *word boundaries* in multibyte locales, not general multibyte case.

Comment 3 Yuri Pankov 2017-06-29 23:59:00 UTC

(In reply to Yuri Pankov from comment #2)

To check that I'm not imagining things, I've added some printfs to regex routines and did a very simple test case with regcomp()/regexec(), important part here is not using REG_NOSUB:

$ LC_ALL=en_US.UTF-8 ./testre
I'm mmatcher
I'm xmbrtowc
I'm xmbrtowc
I'm xmbrtowc
I'm xmbrtowc
I'm xmbrtowc
I'm xmbrtowc
I'm xmbrtowc
pattern 'в' does match string 'абвгд'
$ LC_ALL=C ./testre
I'm smatcher
I'm xmbrtowc_dummy
I'm xmbrtowc_dummy
I'm xmbrtowc_dummy
I'm xmbrtowc_dummy
I'm xmbrtowc_dummy
I'm xmbrtowc_dummy
I'm xmbrtowc_dummy
I'm xmbrtowc_dummy
I'm xmbrtowc_dummy
I'm xmbrtowc_dummy
pattern 'в' does match string 'абвгд'

Comment 4 Pedro F. Giffuni freebsd_committer

2017-06-30 00:54:21 UTC

Thanks for investigating!

TBH, I just looked at the commit log after bapt@ reported issues with collation and I didn't really check the code in detail. It may be that we are still lacking in this area but I am happy to be able to close this old PR.