Bug 225692 - iswprint() wrong for some FULL WIDTH characters in UTF-8 locale
Summary: iswprint() wrong for some FULL WIDTH characters in UTF-8 locale
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: bin (show other bugs)
Version: CURRENT
Hardware: Any Any
: --- Affects Some People
Assignee: Yuri Pankov
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-02-05 18:34 UTC by jkerian+freebsdbugs
Modified: 2019-09-29 16:10 UTC (History)
6 users (show)

See Also:


Attachments
Simple iswprint test (1.59 KB, text/plain)
2018-02-05 18:34 UTC, jkerian+freebsdbugs
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description jkerian+freebsdbugs 2018-02-05 18:34:02 UTC
Created attachment 190345 [details]
Simple iswprint test

When I run ls -B on one of my files, the UTF-8 pattern  0xef 0xbc 0x88 appears to be replaced as unprintable. According to http://www.utf8-chartable.de/unicode-utf8-table.pl?start=65280&utf8=0x, this should be U+FF08 a fullwidth left parenthesis.

According to http://demo.icu-project.org/icu-bin/ubrowse?ch=FF08, U+FF08 should be a perfectly printable character in a UTF-8 locale. Looking at the ls.c source code eventually led me to iswprint().

I wrote the simple program to test print the character enums and then print iswprint() results in a few locales on a series of characters. (Attaching in case of link rot, code & linux results can be seen: https://wandbox.org/permlink/ZDc36tQhh7BLRpBx)

Linux and OSX have some odd behavior around the classes, but U+2002 and U+FF08 are both perfectly printable on both systems in the UTF-8 locales.  On the other hand FreeBSD is only returning 1 for iswprint(0x64), while it should be showing U+2002 and U+FF08 as printable.

On my box, running FreeBSD 11.1-RELEASE-p4 GENERIC amd64, I get the following results:

[dev ~/test/iswprint]$ ./a.out
alnum:0x400100, cntrl:0x200, ideogram:0x80000, print:0x40000, space:0x4000, xdigit:0x10000, alpha:0x100, digit:0x400, lower:0x1000, punct:0x2000, special:0x100000, blank:0x20000, graph:0x800, phonogram:0x200000, rune:0xffffff00, upper:0x8000,
Default Locale is: C
Character 0x64 is in classes: alnum print xdigit alpha lower graph rune
in C locale, iswprint(0x64) = 1
in en_US.UTF-8 locale, iswprint(0x64) = 1
in ja_JP.UTF-8 locale, iswprint(0x64) = 1

Character 0x2002 is in classes: space rune
in C locale, iswprint(0x2002) = 0
in en_US.UTF-8 locale, iswprint(0x2002) = 0
in ja_JP.UTF-8 locale, iswprint(0x2002) = 0

Character 0xff08 is in classes: rune
in C locale, iswprint(0xff08) = 0
in en_US.UTF-8 locale, iswprint(0xff08) = 0
in ja_JP.UTF-8 locale, iswprint(0xff08) = 0

Character 0x2002 is in classes: space rune
in C locale, iswprint(0x2002) = 0
in en_US.UTF-8 locale, iswprint(0x2002) = 0
in ja_JP.UTF-8 locale, iswprint(0x2002) = 0

Character 0x82 is in classes: cntrl rune
in C locale, iswprint(0x82) = 0
in en_US.UTF-8 locale, iswprint(0x82) = 0
in ja_JP.UTF-8 locale, iswprint(0x82) = 0

I confirmed with a few other FreeBSD users that they get the same results.
Comment 1 Conrad Meyer freebsd_committer 2018-02-05 20:21:08 UTC
iswprint(wc) is a thin shim around __istype(wc, _CTYPE_R);

__istype(wc, type) is a thin shim in include/_ctype.h:
  return (!!__maskrune(wc, _CTYPE_R));

__maskrune() is defined earlier in the same file:
  return ((wc < 0 || wc >= _CACHED_RUNES) ? ___runetype(wc) :
    _CurrentRuneLocale->__runetype[wc]) & _CTYPE_R;

(CACHED_RUNES is probably 1<<8.)

This tells me the type information is being looked up in ___runetype() and that
the _CTYPE_R bit must be unset for 0x2002/0xff08.

At some level, I thought we got this metadata from the unicode standard tables, but maybe ours are out of date or this particular data is sourced independently.

___runetype(wc) is a thin shim around ___runetype_l(wc, __get_locale());

___runetype_l() does a binary search in the _RuneRange table for the current locale object.  If nothing is found, it returns 0.  This suggests the current locale object does not have or does not have correct type metadata for at least these two characters.
Comment 2 Conrad Meyer freebsd_committer 2018-02-05 20:26:00 UTC
By the way, it might be interesting to play with nextwctype(3) to enumerate the current table.

wint_t wc = 0;
wctype_t wct = wctype("print");

while (wc < 0x10FFFF && wc >= 0) {
  wc = nextwctype(wc, wct);
  printf("%d: `%lc'\n", (int)wc, wc);
}
Comment 3 Conrad Meyer freebsd_committer 2018-02-05 20:28:15 UTC
I find that in the default C locale, output stops after 126 ('~').  Curiously, also with LANG or LC_CTYPE set.  Hmmmm.
Comment 4 Conrad Meyer freebsd_committer 2018-02-05 20:36:58 UTC
Oh, I see, I didn't properly setlocale().  With:

setlocale(LC_ALL, getenv("LANG"));

and an appropriate LANG environment variable value set (e.g., en_US.UTF-8), I do see characters beyond 0x80.

The table seems to skip over 0x2002:

8190 (0x1ffe): `῾'
8208 (0x2010): `‐'

As well as over 0xff08:

65276 (0xfefc): `ﻼ'
65296 (0xff10): `0'
Comment 5 Conrad Meyer freebsd_committer 2018-02-05 20:42:15 UTC
RuneLocale data comes from _Read_RuneMagi() on a path derived from the encoding and LC_CTYPE, e.g., /usr/share/locale/en_US.UTF-8/LC_CTYPE .

Where does this data come from?
Comment 6 Conrad Meyer freebsd_committer 2018-02-05 20:46:26 UTC
Seems like the data is built in share/ctypedef/Makefile, which comes from tools/tools/locale.  But share/ctypedef/en_US.UTF-8.src is independent of that Makefile and comes most recently from r325361, which updates us to CLDR 32 and Unicode 10.
Comment 7 Conrad Meyer freebsd_committer 2018-02-05 20:51:36 UTC
Notably, FULLWIDTH LEFT PARENTHESIS (\uFF08) is missing from every relevant copy of UTF-8.src I see.
Comment 8 jkerian+freebsdbugs 2018-02-08 18:34:27 UTC
(In reply to Conrad Meyer from comment #7)

Is UTF-8.src pulled from CLDR in some scripted way? Or is that a custom file?

If custom, I have other codepoints that should probably be added as well.
Comment 9 Conrad Meyer freebsd_committer 2018-02-08 18:44:36 UTC
(In reply to jkerian+freebsdbugs from comment #8)
I believe it is an automated tool, but bapt knows much more about this process than I do.
Comment 10 lichray 2018-06-16 14:02:02 UTC
It also gives wrong results for GB18030 (another multibyte encoding):

  iswprint(41406) = 0

(You can use '\xa1\xbe'.decode('gb18030') in Python to figure it out, that is +U3010).

Likewise, lots of ?? in ls -q output.
Comment 11 Hank Hampel 2018-09-21 10:02:52 UTC
I quickly wanted to add that this problem probably affects everybody using mosh (https://mosh.org/) - or any other remote shell that relies on the server interpreting characters - and powerline-status (https://github.com/powerline/powerline) or other programs using Unicode characters.

SSH notably doesn't rely on the server interpreting the characters otherwise the outcry for this regression would have probably been much greater. Please note well that the above worked flawlessly for FreeBSD 10.x-RELEASE - which is especially frustrating.

The thing that makes this bug even worse is that there are obviously "holes" in the Unicode coverage of the UTF-8 locales so that some Unicode characters work fine while others are impossible to use - which makes this problem only harder to debug for users.

As a quick test you can put the following lines (taken from powerline-status) into a file and copy it onto a FreeBSD 11.x-RELEASE machine and log onto it via mosh and you'll see empty strings instead of the Unicode chars which in turn runs havoc on your input line if used as a shell prompt (PS1) as the length is calculated wrongly.

        "dividers": {
                "left": {
                        "hard": " ",
                        "soft": " "
                },
                "right": {
                        "hard": " ",
                        "soft": " "
                }
        },

Thanks everybody for looking into this problem and fixing the associated regression!
Comment 12 Conrad Meyer freebsd_committer 2018-09-22 02:50:46 UTC
CLDR v33.1 was released 2018-06-20, maybe we can update to that.  I see v34 is in alpha as well.  And Unicode 11.0.0 2018-06-05: http://www.unicode.org/versions/Unicode11.0.0/
Comment 13 Conrad Meyer freebsd_committer 2018-09-22 02:51:11 UTC
This is still present on CURRENT, updating "Version."
Comment 14 Yuri Pankov 2018-10-08 11:05:06 UTC
There are several issues described in this bug report, let's take a look at U+2002 first, it is <EN_SPACE>, included in "space" character class, which localedef(1) does NOT currently identify as "print" character class.

The POSIX locale definition says the following:
--------------------------------------------------------------------------------
print
    Define characters to be classified as printable characters, including the <space>.

    In the POSIX locale, all characters in class graph shall be included; no characters in class cntrl shall be included.

    In a locale definition file, characters specified for the keywords upper, lower, alpha, digit, xdigit, punct, graph, and the <space> are automatically included in this class. No character specified for the keyword cntrl shall be specified.
--------------------------------------------------------------------------------

So it doesn't explicitly say anything about "space" here, though it doesn't NOT explicitly forbid including "space" in the "print" class, and I *think* we can extend the <space> character being included to the entire "space" character class.

See review D17467.
Comment 15 Yuri Pankov 2018-10-08 12:34:13 UTC
Properly fixing U+FF08 and other full width characters is more involved.

A bit of background: we have common ctype definitions file for all UTF-8 locales, src/share/ctypedef/en_US.UTF-8.src (all other locale's symlink to the resulting /usr/share/locale/en_US.UTF-8/LC_CTYPE).

src/share/ctypedef/en_US.UTF-8.src is in turn assembled from src/tools/tools/locale/etc/common.UTF-8.src and src/tools/tools/locale/etc/manual-input.UTF-8.

src/tools/tools/locale/etc/common.UTF-8.src is built using src/tools/tools/locale/tools/utf8-rollup.pl, which contains the character ranges belonging to different locales, and (among other things it does) checks the corresponding .UTF-8.src file's LC_CTYPE section for character class to be used.  If the character is not defined, then it doesn't get into the common.UTF-8.src, which is exactly the case here as it's not found in ja_JP.UTF-8.src and ko_KR.UTF-8.src.

TL;DR: all such characters not defined anywhere in *.UTF-8.src from CLDR need to be added to the manual-input.UTF-8.

CLDR v34 is really close to be released, but I strongly doubt we will have the full width characters we are missing defined in the new *.UTF-8.src files.  To amend the issue for the release, I propose looking up these characters in UTF-8.src we had in src/share/mklocale in pre-11.x times, and adding them to manual-input.UTF-8.

The longer term solution would be asking CLDR guys about a way to build a complete ctype map while building POSIX locale data files, and if that's not possible, going over the entire UTF-8.src contents, and adding missing bits to manual-input.UTF-8.
Comment 16 Yuri Pankov 2018-10-08 13:40:11 UTC
The ctype update is our for review as well: review D17471.
Comment 17 Conrad Meyer freebsd_committer 2018-10-08 18:46:04 UTC
(In reply to Yuri Pankov from comment #15)
Hey, that's really cool!  Thanks for investigating this.  I'll look at the reviews.
Comment 18 Hank Hampel 2018-10-10 10:58:30 UTC
(In reply to Yuri Pankov from comment #15)
Thanks for providing this deep insight into the root of the problem!

My above addition to this bug report is covering a different part of the Unicode ctypes table though. So I would like to ask if it is possible to include that part - namely the "U+E000 - U+F8FF : Private Use Area"  - as well in your patches?

The respective part of the pre-11.x branch from share/mklocale/UTF-8.src is the following - sorry I don't know how to convert that into the new notation:

/*
 * U+E000 - U+F8FF : Private Use Area
 */

GRAPH     0xe000 - 0xf8ff
PRINT     0xe000 - 0xf8ff
SWIDTH1   0xe000 - 0xf8ff


Mainly I'm interested in the following code points (although including all of them probably makes most sense):

Code point 	Glyph 	Description
U+E0A0 				Version control branch
U+E0A1 				LN (line) symbol
U+E0A2 				Closed padlock
U+E0B0 				Rightwards black arrowhead
U+E0B1 				Rightwards arrowhead
U+E0B2 				Leftwards black arrowhead
U+E0B3 				Leftwards arrowhead

Many thanks for your work on this!
Comment 19 Hank Hampel 2018-10-10 12:45:07 UTC
I just realized that it might not be as easy to add the code points "U+E000 - U+F8FF : Private Use Area" as they obviously don't have any definitive names like the other symbols - and those seem to be used to include the code points. Or is this just me not understanding properly how it's done nowadays?
Comment 20 Yuri Pankov 2018-10-10 13:04:10 UTC
Those have names as well, though pretty simple ones :-)

I did it like the following (where wc2mb is simple code snippet printing out multibyte code suitable to be grep'ped in UTF-8.cm):

$ grep $(./wc2mb 0xE000) ~/unicode/cldr/v33.0/posix/UTF-8.cm
<PRIVATE_USE_AREA-E000>                                                               \xEE\x80\x80
$ grep $(./wc2mb 0xF8FF) ~/unicode/cldr/v33.0/posix/UTF-8.cm
<PRIVATE_USE_AREA-F8FF>                                                               \xEF\xA3\xBF

And yes, I think it would make sense to add them for the time being until better solution is found.
Comment 21 commit-hook freebsd_committer 2018-10-11 18:30:51 UTC
A commit references this bug:

Author: yuripv
Date: Thu Oct 11 18:30:13 UTC 2018
New revision: 339313
URL: https://svnweb.freebsd.org/changeset/base/339313

Log:
  Restore some of the ctype definitions reported in the PR from pre-CLDR
  data, namely 0xE000-0xF8FF private use area, and 0xFF00-0xFFF half- and
  fullwidth punctuation.

  While here, update tools/tools/locale/README based on my experience
  rebuilding the locale data.

  PR:		225692
  Reviewed by:	bapt, cem (previous version)
  Approved by:	re (gjb), kib (mentor)
  Differential Revision:	https://reviews.freebsd.org/D17471

Changes:
  head/share/ctypedef/en_US.UTF-8.src
  head/tools/tools/locale/README
  head/tools/tools/locale/etc/common.UTF-8.src
  head/tools/tools/locale/etc/manual-input.UTF-8
Comment 22 commit-hook freebsd_committer 2018-10-27 23:32:08 UTC
A commit references this bug:

Author: yuripv
Date: Sat Oct 27 23:31:43 UTC 2018
New revision: 339827
URL: https://svnweb.freebsd.org/changeset/base/339827

Log:
  localedef: define characters in "space" class also as "print", except
  for the known conflicts ("control" characters can't be "print"able).
  POSIX doesn't explicitly forbid this, and actually includes <space>
  character in "print".

  PR:		225692
  Reviewed by:	bapt, cem (previous version), pfg (previous version)
  Approved by:	kib (mentor)
  Differential Revision:	https://reviews.freebsd.org/D17467

Changes:
  head/usr.bin/localedef/ctype.c
Comment 23 commit-hook freebsd_committer 2018-12-05 17:10:27 UTC
A commit references this bug:

Author: yuripv
Date: Wed Dec  5 17:10:07 UTC 2018
New revision: 341596
URL: https://svnweb.freebsd.org/changeset/base/341596

Log:
  MFC r339827:
  localedef: define characters in "space" class also as "print", except
  for the known conflicts ("control" characters can't be "print"able).
  POSIX doesn't explicitly forbid this, and actually includes <space>
  character in "print".

  PR:		225692
  Reviewed by:	bapt, cem (previous version), pfg (previous version)
  Differential Revision:	https://reviews.freebsd.org/D17467

Changes:
  stable/12/usr.bin/localedef/ctype.c
Comment 24 commit-hook freebsd_committer 2018-12-06 11:50:32 UTC
A commit references this bug:

Author: yuripv
Date: Thu Dec  6 11:49:53 UTC 2018
New revision: 341630
URL: https://svnweb.freebsd.org/changeset/base/341630

Log:
  MFC r339311, r339313:
  Restore some of the ctype definitions reported in the PR from pre-CLDR
  data, namely 0xE000-0xF8FF private use area, and 0xFF00-0xFFF half- and
  fullwidth punctuation.

  While here, update tools/tools/locale/README based on my experience
  rebuilding the locale data.

  PR:		225692
  Reviewed by:	bapt, cem (previous version)
  Differential Revision:	https://reviews.freebsd.org/D17471

Changes:
_U  stable/11/
  stable/11/share/ctypedef/en_US.UTF-8.src
  stable/11/tools/tools/locale/README
  stable/11/tools/tools/locale/etc/common.UTF-8.src
  stable/11/tools/tools/locale/etc/manual-input.UTF-8
Comment 25 commit-hook freebsd_committer 2018-12-06 11:52:38 UTC
A commit references this bug:

Author: yuripv
Date: Thu Dec  6 11:52:08 UTC 2018
New revision: 341631
URL: https://svnweb.freebsd.org/changeset/base/341631

Log:
  MFC r339827:
  localedef: define characters in "space" class also as "print", except
  for the known conflicts ("control" characters can't be "print"able).
  POSIX doesn't explicitly forbid this, and actually includes <space>
  character in "print".

  PR:		225692
  Reviewed by:	bapt, cem (previous version), pfg (previous version)
  Differential Revision:	https://reviews.freebsd.org/D17467

Changes:
_U  stable/11/
  stable/11/usr.bin/localedef/ctype.c
Comment 26 Yuri Pankov freebsd_committer 2018-12-06 11:53:24 UTC
This should now be fixed in all supported branches.
Comment 27 Daniel Ponte 2019-09-29 03:46:57 UTC
*** Bug 240911 has been marked as a duplicate of this bug. ***
Comment 28 Daniel Ponte 2019-09-29 03:53:38 UTC
This issue has reappeared. It appears to be related to the update to CLDR 34 and Unicode 11.
Comment 29 Yuri Pankov freebsd_committer 2019-09-29 10:35:48 UTC
(In reply to amigan from comment #28)
- Which issue exactly as there were several in this PR?
- Could you please provide the UTF-8 or wide character codes for the ones that are not rendered correctly?
- Running CURRENT? Can you provide the working and non-working revisions?
Comment 30 Yuri Pankov freebsd_committer 2019-09-29 14:49:59 UTC
(In reply to amigan from comment #28)
FWIW, I tried the reproducer attached to this issue, and all of the characters that were originally reported as unprintable are still OK (i.e., printable), running 13.0-CURRENT r352495.
Comment 31 Conrad Meyer freebsd_committer 2019-09-29 16:10:04 UTC
(Please provide the information in bug 240911.)