Hi, It seems there's a mismatch between char and wchar_t in the glob(3) functionality. I stumbled over this problem, because Cygwin is using FreeBSD's glob, fnmatch, and regcomp code. All three functions convert input strings to wide character and do test and comparisons on the wide char representation. All three functions call the __collate_range_cmp function in some scenario (glob ,for instance, in match() when a range pattern is handled). However, while all three functions operate on wchar_t chars, the __collate_range_cmp function in locale/collcmp.c converts the characters to char and calls strcoll_l on them. This results in a comparison which only works with ASCII chars, but not with the full UNICODE character range. An easy solution might be to call wcscoll_l from __collate_range_cmp, but __collate_range_cmp is also called from other places, namely from vfscanf, with char input. Therefore the best way out might be to introduce something along the lines of a __wcollate_range_cmp function, as outlined below. Fix: See attached patch for a suggestion Patch attached with submission follows:
This was delayed a bit in case it conflicted with the new collation code.
... and Bugzilla doesn't let me update one field without submitting the in-progress comment. Adding bapt and pfg to review the fix. Looks fine to me though.
Hi folks, any progress? We're using this change in Cygwin basically since my patch submission in 2013... Corinna
bapt / pfg - please can one of you take a look at this? It seems like it should be fine to commit, but I'd like someone who has touched the collation code recently to have a look.
I do not really have time right now, but I added jilles to the list of CC who might be also interested in that
Add ache@ ... I am tempted to commit it but I am not really familiar with this code.
Created attachment 170924 [details] Add __wcollate_range_cmp() and use it. Update the patch. The original was not applying cleanly.
Nothing in this area has really changes, except perhaps that we now have working collation. David: Can we have a single __collate_range_cmp instead of adding another one?
Created attachment 170955 [details] Add __wcollate_range_cmp() and use it Build fix: the second and third arguments of __collate_range_cmp() are now wchar_t instead of int. Be consistent in __wcollate_range_cmp().
Created attachment 170956 [details] Alternative __collate_range_cmp Alternatively, now that collate_range_cmp uses wchar_t the only difference between the proposed new function and the existing one is the removal of two assignements. Does this make sense?
Add ache (I somehow missed it before), as he wrote this file.
(In reply to Pedro F. Giffuni from comment #10) wcscoll strings should be null-terminated. It means that wchar_t arrays should be static, as it was in collate_range_cmp.
__collate_range_cmp should be fixed to use 'char' in the args, i.e. int __collate_range_cmp(struct xlocale_collate *table, char c1, char c2) It have nothing common with wchars at all.
(In reply to Andrey Chernov from comment #12) I think static is wrong, as it is not thread-safe. An explicitly initialized stack-allocated array is the right way.
(In reply to Jilles Tjoelker from comment #14) Well, so let remove static and initialize last elements with 0 each time.
(In reply to Andrey Chernov from comment #13) BTW, I suggest to make expicetely int __collate_range_cmp(struct xlocale_collate *table, char c1, char c2) i.e. char c1, char c2 to avoid calling it with wchar_t by chance. There was previously int c1, int c2, but compiler will pass char using int in any case, so it does not break ABI, just prevent incorrect usage at compile time.
(In reply to Andrey Chernov from comment #16) OK.. the proposal sounds reasonable but will the change fix the PR issue?
(In reply to Pedro F. Giffuni from comment #17) There is no problems in the __wcollate_range_cmp() itself, when "static" will be removed and s1[1] = L'\0'; s2[1] = L'\0'; added in direct form as Jilles suggests (for __collate_range_cmp() s1[1] = '\0'; s2[1] = '\0'), but I don't have a time to inspect the callers calls in the right way and not just by mechanically change char->wchar_t which already happens with __collate_range_cmp().
Created attachment 171012 [details] Cleanup as suggested by ache@ I am running this on a tinderbox.
(In reply to Pedro F. Giffuni from comment #19) Oh no, __collate_range_cmp should use char and strcoll_l(), __wcollate_range_cmp should use wchar_t and wcscoll_l(): int __collate_range_cmp(struct xlocale_collate *table, char c1, char c2) { char s1[2], s2[2]; s1[0] = c1; s1[1] = '\0'; s2[0] = c2; s2[1] = '\0'; struct _xlocale l = {{0}}; l.components[XLC_COLLATE] = (struct xlocale_component *)table; return (strcoll_l(s1, s2, &l)); } int __wcollate_range_cmp(struct xlocale_collate *table, wchar_t c1, wchar_t c2) { wchar_t s1[2], s2[2]; s1[0] = c1; s1[1] = L'\0'; s2[0] = c2; s2[1] = L'\0'; struct _xlocale l = {{0}}; l.components[XLC_COLLATE] = (struct xlocale_component *)table; return (wcscoll_l(s1, s2, &l)); } And everywhere when wchar_t used as arg, __wcollate_range_cmp() should be used.
(In reply to Andrey Chernov from comment #20) ABI will be broken otherwise.
(In reply to Andrey Chernov from comment #21) Hmm... I would prefer one commit first to unbreak the ABI and a second commit to add __wcollate_range_cmp() However: Looking at opengrok ... it appears that the original patch replaces all the occurences of __collate_range_cmp in the tree with the new wide character version. Perhaps we should just use symbol versioning here.
(In reply to Pedro F. Giffuni from comment #22) The commit to unbreak ABI is just restoring old -stable version (+ fix "static" thread unsafety + describe args as chars to prevent similar damage in the future), i.e. that one: int __collate_range_cmp(struct xlocale_collate *table, char c1, char c2) { char s1[2], s2[2]; s1[0] = c1; s1[1] = '\0'; s2[0] = c2; s2[1] = '\0'; struct _xlocale l = {{0}}; l.components[XLC_COLLATE] = (struct xlocale_component *)table; return (strcoll_l(s1, s2, &l)); } We don't need symbol versioning, because they are completely different functions which have nothing in common. They are not versions of each other.
(In reply to Pedro F. Giffuni from comment #22) Like in general no one wide chars function is version of plain chars one.
(In reply to Pedro F. Giffuni from comment #22) + prototype, of course lib/libc/locale/collate.h int __collate_range_cmp(struct xlocale_collate *, char, char);
Created attachment 171025 [details] Add __wcollate_range_cmp() and use it Updated patch. Note that the old __collate_range_cmp() remains only for compatibility.
Created attachment 171026 [details] Add __wcollate_range_cmp() and use it (typo in the prototype)
(In reply to Pedro F. Giffuni from comment #27) Ok as the first step. Next step will be to replace all places with wchar_t args with __wcollate_range_cmp. If compiler warnings cause stop currently, first and second step should be committed at once.
Created attachment 171028 [details] Add __wcollate_range_cmp() and use it Oops, forgot the "and use it" part. Patch going through tinderbox.
(In reply to Pedro F. Giffuni from comment #29) Moreover, __wcollate_range_cmp should be added to locale/Symbol.map (I am not sure to which section), because __collate_range_cmp is already there.
(In reply to Andrey Chernov from comment #30) Either to FBSDprivate_1.0 or FBSDprivate_1.4 (or 5?) I am not sure.
(In reply to Pedro F. Giffuni from comment #29) BTW, this code is completely broken: lib/libc/regex/regcomp.c for (i = 0; i <= UCHAR_MAX; i++) { It suppose to work with single chars only (256 max), the loop up to biggest wchar_t will be to big to store with CHadd(p, cs, i);. It looks like someone blindly change all char to wchar_t and think all done.
(In reply to Andrey Chernov from comment #32) Our regex code in general is not prepared for wchar_t. There must be bitmask to store the big wchat_t ranges, not each wchar_t in the range literally.
(In reply to Andrey Chernov from comment #33) To be precise, there is small bitmask in some case (wchar_t < 256) but all other plain ranges are stored by all char in them which can be really big as wint_t is. The above mentioned loop for (i = 0; i <= UCHAR_MAX; i++) { currently handles only first 256 wchars and simple drop all others.
(In reply to Andrey Chernov from comment #30) I don't think __wcollate_range_cmp should be added to locale/Symbol.map since it is not used outside libc. Rather, it is a bug that __collate_range_cmp was added to locale/Symbol.map at all.
(In reply to Jilles Tjoelker from comment #35) When it was added, it was not a bug since it was used in several placed outside of libc. As I check right now, all of src places are rolled their own version of it. Yes, perhaps we should not repeat the same with __wcollate_range_cmp for portability reasons.
(In reply to Andrey Chernov from comment #34) As I see in both NetBSD and OpenBSD, they add regex range from start single char to end single char with binary comparison between them. It seems that POSIX cares only about its own locale here: "In the POSIX locale, a range expression represents the set of collating elements that fall between two elements in the collation sequence, inclusive. In other locales, a range expression has unspecified behavior: strictly conforming applications shall not rely on whether the range expression is valid, or on the set of collating elements matched." Until whole Unicode range bitmask will be implemented (if ever), better stop pretending to honor collation order, we just can't do it with wchars and having 256 limit of them, and do what NetBSD/OpenBSD does using wchar_t instead.
(In reply to Andrey Chernov from comment #36) BTW, as I notice, __collate_range_cmp now xlocale-enhanced, having it as argument, so its old ABI usage is already broken (no own-rolled __collate_range_cmp have xlocale argument and do it in the old way too). It means that its ABI was already broken when xlocale argument was added long ago. It means we can remove __collate_range_cmp from Symbol.map and probably completely remove it, I am not sure.
A commit references this bug: Author: pfg Date: Sun Jun 5 19:12:53 UTC 2016 New revision: 301461 URL: https://svnweb.freebsd.org/changeset/base/301461 Log: libc/locale: Fix type breakage in __collate_range_cmp(). When collation support was brought in, the second and third arguments in __collate_range_cmp() were changed from int to wchar_t, breaking the ABI. Change them to a "char" type which makes more sense and keeps the ABI compatible. Also introduce __wcollate_range_cmp() which does work with wide characters. This function is used only internally in libc so we don't export it. Use the new function in glob(3), fnmatch(3), and regexec(3). PR: 179721 Suggested by: ache. jilles MFC after: 3 weeks (perhaps partial only) Changes: head/lib/libc/gen/fnmatch.c head/lib/libc/gen/glob.c head/lib/libc/locale/collate.h head/lib/libc/locale/collcmp.c head/lib/libc/regex/regcomp.c
(In reply to Andrey Chernov from comment #38) I think we can remove __collate_range_cmp() but let's leave it around until FreeBSD 12 is branched. I don't want surprises in a release.
batch change: For bugs that match the following - Status Is In progress AND - Untouched since 2018-01-01. AND - Affects Base System OR Documentation DO: Reset to open status. Note: I did a quick pass but if you are getting this email it might be worthwhile to double check to see if this bug ought to be closed.
Keyword: patch or patch-ready – in lieu of summary line prefix: [patch] * bulk change for the keyword * summary lines may be edited manually (not in bulk). Keyword descriptions and search interface: <https://bugs.freebsd.org/bugzilla/describekeywords.cgi>
^Triage: FreeBSD 12 is now OBE.