Created attachment 190345 [details] Simple iswprint test When I run ls -B on one of my files, the UTF-8 pattern 0xef 0xbc 0x88 appears to be replaced as unprintable. According to http://www.utf8-chartable.de/unicode-utf8-table.pl?start=65280&utf8=0x, this should be U+FF08 a fullwidth left parenthesis. According to http://demo.icu-project.org/icu-bin/ubrowse?ch=FF08, U+FF08 should be a perfectly printable character in a UTF-8 locale. Looking at the ls.c source code eventually led me to iswprint(). I wrote the simple program to test print the character enums and then print iswprint() results in a few locales on a series of characters. (Attaching in case of link rot, code & linux results can be seen: https://wandbox.org/permlink/ZDc36tQhh7BLRpBx) Linux and OSX have some odd behavior around the classes, but U+2002 and U+FF08 are both perfectly printable on both systems in the UTF-8 locales. On the other hand FreeBSD is only returning 1 for iswprint(0x64), while it should be showing U+2002 and U+FF08 as printable. On my box, running FreeBSD 11.1-RELEASE-p4 GENERIC amd64, I get the following results: [dev ~/test/iswprint]$ ./a.out alnum:0x400100, cntrl:0x200, ideogram:0x80000, print:0x40000, space:0x4000, xdigit:0x10000, alpha:0x100, digit:0x400, lower:0x1000, punct:0x2000, special:0x100000, blank:0x20000, graph:0x800, phonogram:0x200000, rune:0xffffff00, upper:0x8000, Default Locale is: C Character 0x64 is in classes: alnum print xdigit alpha lower graph rune in C locale, iswprint(0x64) = 1 in en_US.UTF-8 locale, iswprint(0x64) = 1 in ja_JP.UTF-8 locale, iswprint(0x64) = 1 Character 0x2002 is in classes: space rune in C locale, iswprint(0x2002) = 0 in en_US.UTF-8 locale, iswprint(0x2002) = 0 in ja_JP.UTF-8 locale, iswprint(0x2002) = 0 Character 0xff08 is in classes: rune in C locale, iswprint(0xff08) = 0 in en_US.UTF-8 locale, iswprint(0xff08) = 0 in ja_JP.UTF-8 locale, iswprint(0xff08) = 0 Character 0x2002 is in classes: space rune in C locale, iswprint(0x2002) = 0 in en_US.UTF-8 locale, iswprint(0x2002) = 0 in ja_JP.UTF-8 locale, iswprint(0x2002) = 0 Character 0x82 is in classes: cntrl rune in C locale, iswprint(0x82) = 0 in en_US.UTF-8 locale, iswprint(0x82) = 0 in ja_JP.UTF-8 locale, iswprint(0x82) = 0 I confirmed with a few other FreeBSD users that they get the same results.
iswprint(wc) is a thin shim around __istype(wc, _CTYPE_R); __istype(wc, type) is a thin shim in include/_ctype.h: return (!!__maskrune(wc, _CTYPE_R)); __maskrune() is defined earlier in the same file: return ((wc < 0 || wc >= _CACHED_RUNES) ? ___runetype(wc) : _CurrentRuneLocale->__runetype[wc]) & _CTYPE_R; (CACHED_RUNES is probably 1<<8.) This tells me the type information is being looked up in ___runetype() and that the _CTYPE_R bit must be unset for 0x2002/0xff08. At some level, I thought we got this metadata from the unicode standard tables, but maybe ours are out of date or this particular data is sourced independently. ___runetype(wc) is a thin shim around ___runetype_l(wc, __get_locale()); ___runetype_l() does a binary search in the _RuneRange table for the current locale object. If nothing is found, it returns 0. This suggests the current locale object does not have or does not have correct type metadata for at least these two characters.
By the way, it might be interesting to play with nextwctype(3) to enumerate the current table. wint_t wc = 0; wctype_t wct = wctype("print"); while (wc < 0x10FFFF && wc >= 0) { wc = nextwctype(wc, wct); printf("%d: `%lc'\n", (int)wc, wc); }
I find that in the default C locale, output stops after 126 ('~'). Curiously, also with LANG or LC_CTYPE set. Hmmmm.
Oh, I see, I didn't properly setlocale(). With: setlocale(LC_ALL, getenv("LANG")); and an appropriate LANG environment variable value set (e.g., en_US.UTF-8), I do see characters beyond 0x80. The table seems to skip over 0x2002: 8190 (0x1ffe): `῾' 8208 (0x2010): `‐' As well as over 0xff08: 65276 (0xfefc): `ﻼ' 65296 (0xff10): `0'
RuneLocale data comes from _Read_RuneMagi() on a path derived from the encoding and LC_CTYPE, e.g., /usr/share/locale/en_US.UTF-8/LC_CTYPE . Where does this data come from?
Seems like the data is built in share/ctypedef/Makefile, which comes from tools/tools/locale. But share/ctypedef/en_US.UTF-8.src is independent of that Makefile and comes most recently from r325361, which updates us to CLDR 32 and Unicode 10.
Notably, FULLWIDTH LEFT PARENTHESIS (\uFF08) is missing from every relevant copy of UTF-8.src I see.
(In reply to Conrad Meyer from comment #7) Is UTF-8.src pulled from CLDR in some scripted way? Or is that a custom file? If custom, I have other codepoints that should probably be added as well.
(In reply to jkerian+freebsdbugs from comment #8) I believe it is an automated tool, but bapt knows much more about this process than I do.
It also gives wrong results for GB18030 (another multibyte encoding): iswprint(41406) = 0 (You can use '\xa1\xbe'.decode('gb18030') in Python to figure it out, that is +U3010). Likewise, lots of ?? in ls -q output.
I quickly wanted to add that this problem probably affects everybody using mosh (https://mosh.org/) - or any other remote shell that relies on the server interpreting characters - and powerline-status (https://github.com/powerline/powerline) or other programs using Unicode characters. SSH notably doesn't rely on the server interpreting the characters otherwise the outcry for this regression would have probably been much greater. Please note well that the above worked flawlessly for FreeBSD 10.x-RELEASE - which is especially frustrating. The thing that makes this bug even worse is that there are obviously "holes" in the Unicode coverage of the UTF-8 locales so that some Unicode characters work fine while others are impossible to use - which makes this problem only harder to debug for users. As a quick test you can put the following lines (taken from powerline-status) into a file and copy it onto a FreeBSD 11.x-RELEASE machine and log onto it via mosh and you'll see empty strings instead of the Unicode chars which in turn runs havoc on your input line if used as a shell prompt (PS1) as the length is calculated wrongly. "dividers": { "left": { "hard": " ", "soft": " " }, "right": { "hard": " ", "soft": " " } }, Thanks everybody for looking into this problem and fixing the associated regression!
CLDR v33.1 was released 2018-06-20, maybe we can update to that. I see v34 is in alpha as well. And Unicode 11.0.0 2018-06-05: http://www.unicode.org/versions/Unicode11.0.0/
This is still present on CURRENT, updating "Version."
There are several issues described in this bug report, let's take a look at U+2002 first, it is <EN_SPACE>, included in "space" character class, which localedef(1) does NOT currently identify as "print" character class. The POSIX locale definition says the following: -------------------------------------------------------------------------------- print Define characters to be classified as printable characters, including the <space>. In the POSIX locale, all characters in class graph shall be included; no characters in class cntrl shall be included. In a locale definition file, characters specified for the keywords upper, lower, alpha, digit, xdigit, punct, graph, and the <space> are automatically included in this class. No character specified for the keyword cntrl shall be specified. -------------------------------------------------------------------------------- So it doesn't explicitly say anything about "space" here, though it doesn't NOT explicitly forbid including "space" in the "print" class, and I *think* we can extend the <space> character being included to the entire "space" character class. See review D17467.
Properly fixing U+FF08 and other full width characters is more involved. A bit of background: we have common ctype definitions file for all UTF-8 locales, src/share/ctypedef/en_US.UTF-8.src (all other locale's symlink to the resulting /usr/share/locale/en_US.UTF-8/LC_CTYPE). src/share/ctypedef/en_US.UTF-8.src is in turn assembled from src/tools/tools/locale/etc/common.UTF-8.src and src/tools/tools/locale/etc/manual-input.UTF-8. src/tools/tools/locale/etc/common.UTF-8.src is built using src/tools/tools/locale/tools/utf8-rollup.pl, which contains the character ranges belonging to different locales, and (among other things it does) checks the corresponding .UTF-8.src file's LC_CTYPE section for character class to be used. If the character is not defined, then it doesn't get into the common.UTF-8.src, which is exactly the case here as it's not found in ja_JP.UTF-8.src and ko_KR.UTF-8.src. TL;DR: all such characters not defined anywhere in *.UTF-8.src from CLDR need to be added to the manual-input.UTF-8. CLDR v34 is really close to be released, but I strongly doubt we will have the full width characters we are missing defined in the new *.UTF-8.src files. To amend the issue for the release, I propose looking up these characters in UTF-8.src we had in src/share/mklocale in pre-11.x times, and adding them to manual-input.UTF-8. The longer term solution would be asking CLDR guys about a way to build a complete ctype map while building POSIX locale data files, and if that's not possible, going over the entire UTF-8.src contents, and adding missing bits to manual-input.UTF-8.
The ctype update is our for review as well: review D17471.
(In reply to Yuri Pankov from comment #15) Hey, that's really cool! Thanks for investigating this. I'll look at the reviews.
(In reply to Yuri Pankov from comment #15) Thanks for providing this deep insight into the root of the problem! My above addition to this bug report is covering a different part of the Unicode ctypes table though. So I would like to ask if it is possible to include that part - namely the "U+E000 - U+F8FF : Private Use Area" - as well in your patches? The respective part of the pre-11.x branch from share/mklocale/UTF-8.src is the following - sorry I don't know how to convert that into the new notation: /* * U+E000 - U+F8FF : Private Use Area */ GRAPH 0xe000 - 0xf8ff PRINT 0xe000 - 0xf8ff SWIDTH1 0xe000 - 0xf8ff Mainly I'm interested in the following code points (although including all of them probably makes most sense): Code point Glyph Description U+E0A0 Version control branch U+E0A1 LN (line) symbol U+E0A2 Closed padlock U+E0B0 Rightwards black arrowhead U+E0B1 Rightwards arrowhead U+E0B2 Leftwards black arrowhead U+E0B3 Leftwards arrowhead Many thanks for your work on this!
I just realized that it might not be as easy to add the code points "U+E000 - U+F8FF : Private Use Area" as they obviously don't have any definitive names like the other symbols - and those seem to be used to include the code points. Or is this just me not understanding properly how it's done nowadays?
Those have names as well, though pretty simple ones :-) I did it like the following (where wc2mb is simple code snippet printing out multibyte code suitable to be grep'ped in UTF-8.cm): $ grep $(./wc2mb 0xE000) ~/unicode/cldr/v33.0/posix/UTF-8.cm <PRIVATE_USE_AREA-E000> \xEE\x80\x80 $ grep $(./wc2mb 0xF8FF) ~/unicode/cldr/v33.0/posix/UTF-8.cm <PRIVATE_USE_AREA-F8FF> \xEF\xA3\xBF And yes, I think it would make sense to add them for the time being until better solution is found.
A commit references this bug: Author: yuripv Date: Thu Oct 11 18:30:13 UTC 2018 New revision: 339313 URL: https://svnweb.freebsd.org/changeset/base/339313 Log: Restore some of the ctype definitions reported in the PR from pre-CLDR data, namely 0xE000-0xF8FF private use area, and 0xFF00-0xFFF half- and fullwidth punctuation. While here, update tools/tools/locale/README based on my experience rebuilding the locale data. PR: 225692 Reviewed by: bapt, cem (previous version) Approved by: re (gjb), kib (mentor) Differential Revision: https://reviews.freebsd.org/D17471 Changes: head/share/ctypedef/en_US.UTF-8.src head/tools/tools/locale/README head/tools/tools/locale/etc/common.UTF-8.src head/tools/tools/locale/etc/manual-input.UTF-8
A commit references this bug: Author: yuripv Date: Sat Oct 27 23:31:43 UTC 2018 New revision: 339827 URL: https://svnweb.freebsd.org/changeset/base/339827 Log: localedef: define characters in "space" class also as "print", except for the known conflicts ("control" characters can't be "print"able). POSIX doesn't explicitly forbid this, and actually includes <space> character in "print". PR: 225692 Reviewed by: bapt, cem (previous version), pfg (previous version) Approved by: kib (mentor) Differential Revision: https://reviews.freebsd.org/D17467 Changes: head/usr.bin/localedef/ctype.c
A commit references this bug: Author: yuripv Date: Wed Dec 5 17:10:07 UTC 2018 New revision: 341596 URL: https://svnweb.freebsd.org/changeset/base/341596 Log: MFC r339827: localedef: define characters in "space" class also as "print", except for the known conflicts ("control" characters can't be "print"able). POSIX doesn't explicitly forbid this, and actually includes <space> character in "print". PR: 225692 Reviewed by: bapt, cem (previous version), pfg (previous version) Differential Revision: https://reviews.freebsd.org/D17467 Changes: stable/12/usr.bin/localedef/ctype.c
A commit references this bug: Author: yuripv Date: Thu Dec 6 11:49:53 UTC 2018 New revision: 341630 URL: https://svnweb.freebsd.org/changeset/base/341630 Log: MFC r339311, r339313: Restore some of the ctype definitions reported in the PR from pre-CLDR data, namely 0xE000-0xF8FF private use area, and 0xFF00-0xFFF half- and fullwidth punctuation. While here, update tools/tools/locale/README based on my experience rebuilding the locale data. PR: 225692 Reviewed by: bapt, cem (previous version) Differential Revision: https://reviews.freebsd.org/D17471 Changes: _U stable/11/ stable/11/share/ctypedef/en_US.UTF-8.src stable/11/tools/tools/locale/README stable/11/tools/tools/locale/etc/common.UTF-8.src stable/11/tools/tools/locale/etc/manual-input.UTF-8
A commit references this bug: Author: yuripv Date: Thu Dec 6 11:52:08 UTC 2018 New revision: 341631 URL: https://svnweb.freebsd.org/changeset/base/341631 Log: MFC r339827: localedef: define characters in "space" class also as "print", except for the known conflicts ("control" characters can't be "print"able). POSIX doesn't explicitly forbid this, and actually includes <space> character in "print". PR: 225692 Reviewed by: bapt, cem (previous version), pfg (previous version) Differential Revision: https://reviews.freebsd.org/D17467 Changes: _U stable/11/ stable/11/usr.bin/localedef/ctype.c
This should now be fixed in all supported branches.
*** Bug 240911 has been marked as a duplicate of this bug. ***
This issue has reappeared. It appears to be related to the update to CLDR 34 and Unicode 11.
(In reply to amigan from comment #28) - Which issue exactly as there were several in this PR? - Could you please provide the UTF-8 or wide character codes for the ones that are not rendered correctly? - Running CURRENT? Can you provide the working and non-working revisions?
(In reply to amigan from comment #28) FWIW, I tried the reproducer attached to this issue, and all of the characters that were originally reported as unprintable are still OK (i.e., printable), running 13.0-CURRENT r352495.
(Please provide the information in bug 240911.)