Bug 179721 - [libc] [patch] char<->wchar_t mismatch in glob(3), fnmatch(3), regexec(3)
Summary: [libc] [patch] char<->wchar_t mismatch in glob(3), fnmatch(3), regexec(3)
Status: Open
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: Unspecified
Hardware: Any Any
: Normal Affects Only Me
Assignee: David Chisnall
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2013-06-19 16:20 UTC by vinschen
Modified: 2018-05-28 19:48 UTC (History)
7 users (show)

See Also:


Attachments
file.diff (3.83 KB, patch)
2013-06-19 16:20 UTC, vinschen
no flags Details | Diff
Add __wcollate_range_cmp() and use it. (3.09 KB, patch)
2016-06-01 20:29 UTC, Pedro F. Giffuni
no flags Details | Diff
Add __wcollate_range_cmp() and use it (3.10 KB, patch)
2016-06-02 18:06 UTC, Pedro F. Giffuni
no flags Details | Diff
Alternative __collate_range_cmp (420 bytes, patch)
2016-06-02 18:17 UTC, Pedro F. Giffuni
no flags Details | Diff
Cleanup as suggested by ache@ (1.33 KB, patch)
2016-06-04 15:20 UTC, Pedro F. Giffuni
no flags Details | Diff
Add __wcollate_range_cmp() and use it (1.66 KB, patch)
2016-06-04 19:55 UTC, Pedro F. Giffuni
no flags Details | Diff
Add __wcollate_range_cmp() and use it (1.66 KB, patch)
2016-06-04 20:01 UTC, Pedro F. Giffuni
no flags Details | Diff
Add __wcollate_range_cmp() and use it (3.58 KB, patch)
2016-06-04 20:11 UTC, Pedro F. Giffuni
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description vinschen 2013-06-19 16:20:00 UTC
Hi,

It seems there's a mismatch between char and wchar_t in the glob(3)
functionality.  I stumbled over this problem, because Cygwin is using
FreeBSD's glob, fnmatch, and regcomp code.

All three functions convert input strings to wide character and do
test and comparisons on the wide char representation.  All three
functions call the __collate_range_cmp function in some scenario
(glob ,for instance, in match() when a range pattern is handled).

However, while all three functions operate on wchar_t chars, the
__collate_range_cmp function in locale/collcmp.c converts the
characters to char and calls strcoll_l on them.  This results in a
comparison which only works with ASCII chars, but not with the full
UNICODE character range.

An easy solution might be to call wcscoll_l from __collate_range_cmp,
but __collate_range_cmp is also called from other places, namely
from vfscanf, with char input.  Therefore the best way out might be
to introduce something along the lines of a __wcollate_range_cmp
function, as outlined below.

Fix: See attached patch for a suggestion

Patch attached with submission follows:
Comment 1 David Chisnall freebsd_committer 2015-10-08 16:31:20 UTC
This was delayed a bit in case it conflicted with the new collation code.
Comment 2 David Chisnall freebsd_committer 2015-10-08 16:32:13 UTC
... and Bugzilla doesn't let me update one field without submitting the in-progress comment.  Adding bapt and pfg to review the fix.  Looks fine to me though.
Comment 3 vinschen 2016-05-31 09:33:30 UTC
Hi folks,

any progress?  We're using this change in Cygwin basically since my patch
submission in 2013...


Corinna
Comment 4 David Chisnall freebsd_committer 2016-05-31 09:40:14 UTC
bapt / pfg - please can one of you take a look at this?  It seems like it should be fine to commit, but I'd like someone who has touched the collation code recently to have a look.
Comment 5 Baptiste Daroussin freebsd_committer 2016-05-31 11:33:22 UTC
I do not really have time right now, but I added jilles to the list of CC who might be also interested in that
Comment 6 Pedro F. Giffuni freebsd_committer 2016-05-31 19:08:41 UTC
Add ache@ ...

I am tempted to commit it but I am not really familiar with this code.
Comment 7 Pedro F. Giffuni freebsd_committer 2016-06-01 20:29:33 UTC
Created attachment 170924 [details]
Add __wcollate_range_cmp() and use it.

Update the patch. The original was not applying cleanly.
Comment 8 Pedro F. Giffuni freebsd_committer 2016-06-01 20:34:00 UTC
Nothing in this area has really changes, except perhaps that we now have
working collation.

David: Can we have a single __collate_range_cmp instead of adding another
one?
Comment 9 Pedro F. Giffuni freebsd_committer 2016-06-02 18:06:59 UTC
Created attachment 170955 [details]
Add __wcollate_range_cmp() and use it

Build fix: the second and third arguments of __collate_range_cmp() are now wchar_t instead of int. Be consistent in __wcollate_range_cmp().
Comment 10 Pedro F. Giffuni freebsd_committer 2016-06-02 18:17:57 UTC
Created attachment 170956 [details]
Alternative __collate_range_cmp

Alternatively, now that collate_range_cmp uses wchar_t the only difference between the proposed new function and the existing one is the removal of
two assignements. Does this make sense?
Comment 11 Pedro F. Giffuni freebsd_committer 2016-06-02 18:18:52 UTC
Add ache (I somehow missed it before), as he wrote this file.
Comment 12 Andrey A. Chernov freebsd_committer 2016-06-02 20:29:22 UTC
(In reply to Pedro F. Giffuni from comment #10)
wcscoll strings should be null-terminated.
It means that wchar_t arrays should be static, as it was in collate_range_cmp.
Comment 13 Andrey A. Chernov freebsd_committer 2016-06-02 20:37:42 UTC
__collate_range_cmp should be fixed to use 'char' in the args, i.e.
int __collate_range_cmp(struct xlocale_collate *table, char c1, char c2)
It have nothing common with wchars at all.
Comment 14 Jilles Tjoelker freebsd_committer 2016-06-02 21:03:38 UTC
(In reply to Andrey Chernov from comment #12)
I think static is wrong, as it is not thread-safe. An explicitly initialized stack-allocated array is the right way.
Comment 15 Andrey A. Chernov freebsd_committer 2016-06-02 21:46:24 UTC
(In reply to Jilles Tjoelker from comment #14)
Well, so let remove static and initialize last elements with 0 each time.
Comment 16 Andrey A. Chernov freebsd_committer 2016-06-04 00:49:22 UTC
(In reply to Andrey Chernov from comment #13)
BTW, I suggest to make expicetely
int __collate_range_cmp(struct xlocale_collate *table, char c1, char c2)
i.e. char c1, char c2 to avoid calling it with wchar_t by chance.
There was previously int c1, int c2, but compiler will pass char using int in any case, so it does not break ABI, just prevent incorrect usage at compile time.
Comment 17 Pedro F. Giffuni freebsd_committer 2016-06-04 03:52:59 UTC
(In reply to Andrey Chernov from comment #16)

OK.. the proposal sounds reasonable but will the change fix the PR issue?
Comment 18 Andrey A. Chernov freebsd_committer 2016-06-04 06:10:43 UTC
(In reply to Pedro F. Giffuni from comment #17)
There is no problems in the __wcollate_range_cmp() itself, when "static" will be removed and s1[1] = L'\0'; s2[1] = L'\0'; added in direct form as Jilles suggests (for __collate_range_cmp() s1[1] = '\0'; s2[1] = '\0'), but I don't have a time to inspect the callers calls in the right way and not just by mechanically change char->wchar_t which already happens with __collate_range_cmp().
Comment 19 Pedro F. Giffuni freebsd_committer 2016-06-04 15:20:18 UTC
Created attachment 171012 [details]
Cleanup as suggested by ache@

I am running this on a tinderbox.
Comment 20 Andrey A. Chernov freebsd_committer 2016-06-04 16:02:23 UTC
(In reply to Pedro F. Giffuni from comment #19)
Oh no, __collate_range_cmp should use char and strcoll_l(), __wcollate_range_cmp should use wchar_t and wcscoll_l():

int __collate_range_cmp(struct xlocale_collate *table, char c1, char c2)
{
        char s1[2], s2[2];

        s1[0] = c1;
        s1[1] = '\0';
        s2[0] = c2;
        s2[1] = '\0';
        struct _xlocale l = {{0}};
        l.components[XLC_COLLATE] = (struct xlocale_component *)table;
        return (strcoll_l(s1, s2, &l));
}
int __wcollate_range_cmp(struct xlocale_collate *table, wchar_t c1, wchar_t c2)
{
        wchar_t s1[2], s2[2];

        s1[0] = c1;
        s1[1] = L'\0';
        s2[0] = c2;
        s2[1] = L'\0';
        struct _xlocale l = {{0}};
        l.components[XLC_COLLATE] = (struct xlocale_component *)table;
        return (wcscoll_l(s1, s2, &l));
}
And everywhere when wchar_t used as arg, __wcollate_range_cmp() should be used.
Comment 21 Andrey A. Chernov freebsd_committer 2016-06-04 16:04:34 UTC
(In reply to Andrey Chernov from comment #20)
ABI will be broken otherwise.
Comment 22 Pedro F. Giffuni freebsd_committer 2016-06-04 18:44:37 UTC
(In reply to Andrey Chernov from comment #21)
Hmm...

I would prefer one commit first to unbreak the ABI and a second commit to add
__wcollate_range_cmp()

However:
Looking at opengrok ... it appears that the original patch replaces all the occurences of __collate_range_cmp in the tree with the new wide character version. Perhaps we should just use symbol versioning here.
Comment 23 Andrey A. Chernov freebsd_committer 2016-06-04 19:27:30 UTC
(In reply to Pedro F. Giffuni from comment #22)
The commit to unbreak ABI is just restoring old -stable version (+ fix "static" thread unsafety + describe args as chars to prevent similar damage in the future), i.e. that one:

int __collate_range_cmp(struct xlocale_collate *table, char c1, char c2)
{
        char s1[2], s2[2];

        s1[0] = c1;
        s1[1] = '\0';
        s2[0] = c2;
        s2[1] = '\0';
        struct _xlocale l = {{0}};
        l.components[XLC_COLLATE] = (struct xlocale_component *)table;
        return (strcoll_l(s1, s2, &l));
}

We don't need symbol versioning, because they are completely different functions which have nothing in common. They are not versions of each other.
Comment 24 Andrey A. Chernov freebsd_committer 2016-06-04 19:30:01 UTC
(In reply to Pedro F. Giffuni from comment #22)
Like in general no one wide chars function is version of plain chars one.
Comment 25 Andrey A. Chernov freebsd_committer 2016-06-04 19:34:07 UTC
(In reply to Pedro F. Giffuni from comment #22)
+ prototype, of course
lib/libc/locale/collate.h
int	__collate_range_cmp(struct xlocale_collate *, char, char);
Comment 26 Pedro F. Giffuni freebsd_committer 2016-06-04 19:55:47 UTC
Created attachment 171025 [details]
Add __wcollate_range_cmp() and use it

Updated patch.
Note that the old __collate_range_cmp() remains only for compatibility.
Comment 27 Pedro F. Giffuni freebsd_committer 2016-06-04 20:01:02 UTC
Created attachment 171026 [details]
Add __wcollate_range_cmp() and use it

(typo in the prototype)
Comment 28 Andrey A. Chernov freebsd_committer 2016-06-04 20:06:34 UTC
(In reply to Pedro F. Giffuni from comment #27)
Ok as the first step. Next step will be to replace all places with wchar_t args with __wcollate_range_cmp.
If compiler warnings cause stop currently, first and second step should be committed at once.
Comment 29 Pedro F. Giffuni freebsd_committer 2016-06-04 20:11:40 UTC
Created attachment 171028 [details]
Add __wcollate_range_cmp() and use it

Oops, forgot the "and use it" part.
Patch going through tinderbox.
Comment 30 Andrey A. Chernov freebsd_committer 2016-06-04 20:16:23 UTC
(In reply to Pedro F. Giffuni from comment #29)
Moreover, __wcollate_range_cmp should be added to locale/Symbol.map (I am not sure to which section), because __collate_range_cmp is already there.
Comment 31 Andrey A. Chernov freebsd_committer 2016-06-04 20:21:22 UTC
(In reply to Andrey Chernov from comment #30)
Either to FBSDprivate_1.0 or FBSDprivate_1.4 (or 5?) I am not sure.
Comment 32 Andrey A. Chernov freebsd_committer 2016-06-04 20:29:21 UTC
(In reply to Pedro F. Giffuni from comment #29)
BTW, this code is completely broken:
lib/libc/regex/regcomp.c
for (i = 0; i <= UCHAR_MAX; i++) {
It suppose to work with single chars only (256 max), the loop up to biggest wchar_t will be to big to store with CHadd(p, cs, i);.
It looks like someone blindly change all char to wchar_t and think all done.
Comment 33 Andrey A. Chernov freebsd_committer 2016-06-04 20:36:00 UTC
(In reply to Andrey Chernov from comment #32)
Our regex code in general is not prepared for wchar_t. There must be bitmask to store the big wchat_t ranges, not each wchar_t in the range literally.
Comment 34 Andrey A. Chernov freebsd_committer 2016-06-04 20:57:05 UTC
(In reply to Andrey Chernov from comment #33)
To be precise, there is small bitmask in some case (wchar_t < 256) but all other plain ranges are stored by all char in them which can be really big as wint_t is. 
The above mentioned loop 
for (i = 0; i <= UCHAR_MAX; i++) {
currently handles only first 256 wchars and simple drop all others.
Comment 35 Jilles Tjoelker freebsd_committer 2016-06-04 21:53:13 UTC
(In reply to Andrey Chernov from comment #30)
I don't think __wcollate_range_cmp should be added to locale/Symbol.map since it is not used outside libc. Rather, it is a bug that __collate_range_cmp was added to locale/Symbol.map at all.
Comment 36 Andrey A. Chernov freebsd_committer 2016-06-04 22:46:42 UTC
(In reply to Jilles Tjoelker from comment #35)
When it was added, it was not a bug since it was used in several placed outside of libc. As I check right now, all of src places are rolled their own version of it.
Yes, perhaps we should not repeat the same with __wcollate_range_cmp for portability reasons.
Comment 37 Andrey A. Chernov freebsd_committer 2016-06-04 23:22:46 UTC
(In reply to Andrey Chernov from comment #34)
As I see in both NetBSD and OpenBSD, they add regex range from start single char to end single char with binary comparison between them.

It seems that POSIX cares only about its own locale here:
"In the POSIX locale, a range expression represents the set of collating elements that fall between two elements in the collation sequence, inclusive. In other locales, a range expression has unspecified behavior: strictly conforming applications shall not rely on whether the range expression is valid, or on the set of collating elements matched."

Until whole Unicode range bitmask will be implemented (if ever), better stop pretending to honor collation order, we just can't do it with wchars and having 256 limit of them, and do what NetBSD/OpenBSD does using wchar_t instead.
Comment 38 Andrey A. Chernov freebsd_committer 2016-06-04 23:36:42 UTC
(In reply to Andrey Chernov from comment #36)
BTW, as I notice, __collate_range_cmp now xlocale-enhanced, having it as argument, so its old ABI usage is already broken (no own-rolled __collate_range_cmp have xlocale argument and do it in the old way too).
It means that its ABI was already broken when xlocale argument was added long ago.
It means we can remove __collate_range_cmp from Symbol.map and probably completely remove it, I am not sure.
Comment 39 commit-hook freebsd_committer 2016-06-05 19:13:31 UTC
A commit references this bug:

Author: pfg
Date: Sun Jun  5 19:12:53 UTC 2016
New revision: 301461
URL: https://svnweb.freebsd.org/changeset/base/301461

Log:
  libc/locale: Fix type breakage in __collate_range_cmp().

  When collation support was brought in, the second and third
  arguments in __collate_range_cmp() were changed from int to
  wchar_t, breaking the ABI. Change them to a "char" type which
  makes more sense and keeps the ABI compatible.

  Also introduce __wcollate_range_cmp() which does work with wide
  characters. This function is used only internally in libc so
  we don't export it. Use the new function in glob(3), fnmatch(3),
  and regexec(3).

  PR:		179721
  Suggested by:	ache. jilles
  MFC after:	3 weeks (perhaps partial only)

Changes:
  head/lib/libc/gen/fnmatch.c
  head/lib/libc/gen/glob.c
  head/lib/libc/locale/collate.h
  head/lib/libc/locale/collcmp.c
  head/lib/libc/regex/regcomp.c
Comment 40 Pedro F. Giffuni freebsd_committer 2016-06-05 19:31:49 UTC
(In reply to Andrey Chernov from comment #38)
I think we can remove __collate_range_cmp() but let's leave it around until FreeBSD 12 is branched. I don't want surprises in a release.
Comment 41 Eitan Adler freebsd_committer freebsd_triage 2018-05-28 19:48:58 UTC
batch change:

For bugs that match the following
-  Status Is In progress 
AND
- Untouched since 2018-01-01.
AND
- Affects Base System OR Documentation

DO:

Reset to open status.


Note:
I did a quick pass but if you are getting this email it might be worthwhile to double check to see if this bug ought to be closed.