Bug 240911

Summary: iswprint() wrong for some FULL WIDTH characters in UTF-8 locale
Product: Base System Reporter: Daniel Ponte <amigan>
Component: binAssignee: Yuri Pankov <yuripv>
Status: Closed FIXED    
Severity: Affects Many People CC: cem, yuripv
Priority: ---    
Version: CURRENT   
Hardware: Any   
OS: Any   
Attachments:
Description Flags
Modified reproducer none

Description Daniel Ponte 2019-09-29 03:26:30 UTC
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=225692 has returned. Powerline characters are not rendering correctly in vim under mosh.
Comment 1 Daniel Ponte 2019-09-29 03:46:57 UTC

*** This bug has been marked as a duplicate of bug 225692 ***
Comment 2 Conrad Meyer freebsd_committer freebsd_triage 2019-09-29 16:09:14 UTC
Not a dupe per comment #30 in the other bug.  In the future, please investigate on the new bug first to confirm duplicate before marking so.

> FWIW, I tried the reproducer attached to this issue, and all of the characters
> that were originally reported as unprintable are still OK (i.e., printable),
> running 13.0-CURRENT r352495.
Comment 3 Conrad Meyer freebsd_committer freebsd_triage 2019-09-29 16:10:33 UTC
Copying request for information from other bug:

- Could you please provide the UTF-8 or wide character codes for the ones that are not rendered correctly?
- Running CURRENT? Can you provide the working and non-working revisions?
Comment 4 Daniel Ponte 2019-09-29 19:36:43 UTC
I apologize for the confusion and lack of detail. This is specifically for Powerline characters. They are not rendering correctly when the mosh server is FreeBSD. The work in #225692 most certainly appeared to correct this, previously.I have modified the test program slightly to demonstrate these characters, and also to emit the tested character.

% uname -v ; ./wcw                                                                                                                                           
FreeBSD 13.0-CURRENT #0 r352860: Sat Sep 28 21:19:24 EDT 2019     root@argon.h.c907:/usr/obj/usr/src/amd64.amd64/sys/GENERIC 
alnum:0x400100, cntrl:0x200, ideogram:0x80000, print:0x40000, space:0x4000, xdigit:0x10000, alpha:0x100, digit:0x400, lower:0x1000, punct:0x2000, special:0x100000, blank:0x20000, graph:0x800, phonogram:0x200000, rune:0xffffff00, upper:0x8000, 
Default Locale is: C
Character d 0x64 is in classes: alnum print xdigit alpha lower graph rune 
in C locale, iswprint(0x64) = 1
in en_US.UTF-8 locale, iswprint(0x64) = 1
in ja_JP.UTF-8 locale, iswprint(0x64) = 1

Character   0xe0b1 is in classes: cntrl rune 
in C locale, iswprint(0xe0b1) = 0
in en_US.UTF-8 locale, iswprint(0xe0b1) = 0
in ja_JP.UTF-8 locale, iswprint(0xe0b1) = 0

Character   0xe0b2 is in classes: cntrl rune 
in C locale, iswprint(0xe0b2) = 0
in en_US.UTF-8 locale, iswprint(0xe0b2) = 0
in ja_JP.UTF-8 locale, iswprint(0xe0b2) = 0

Character   0xe0b3 is in classes: cntrl rune 
in C locale, iswprint(0xe0b3) = 0
in en_US.UTF-8 locale, iswprint(0xe0b3) = 0
in ja_JP.UTF-8 locale, iswprint(0xe0b3) = 0


versus


% uname -a ; ./wcw
FreeBSD dtvax.dynatron.me 12.0-BETA4 FreeBSD 12.0-BETA4 r340285 GENERIC  amd64
alnum:0x400100, cntrl:0x200, ideogram:0x80000, print:0x40000, space:0x4000, xdigit:0x10000, alpha:0x100, digit:0x400, lower:0x1000, punct:0x2000, special:0x100000, blank:0x20000, graph:0x800, phonogram:0x200000, rune:0xffffff00, upper:0x8000, 
Default Locale is: C
Character d 0x64 is in classes: alnum print xdigit alpha lower graph rune 
in C locale, iswprint(0x64) = 1
in en_US.UTF-8 locale, iswprint(0x64) = 1
in ja_JP.UTF-8 locale, iswprint(0x64) = 1

Character  0xe0b1 is in classes: print graph rune 
in C locale, iswprint(0xe0b1) = 0
in en_US.UTF-8 locale, iswprint(0xe0b1) = 1
in ja_JP.UTF-8 locale, iswprint(0xe0b1) = 1

Character  0xe0b2 is in classes: print graph rune 
in C locale, iswprint(0xe0b2) = 0
in en_US.UTF-8 locale, iswprint(0xe0b2) = 1
in ja_JP.UTF-8 locale, iswprint(0xe0b2) = 1

Character  0xe0b3 is in classes: print graph rune 
in C locale, iswprint(0xe0b3) = 0
in en_US.UTF-8 locale, iswprint(0xe0b3) = 1
in ja_JP.UTF-8 locale, iswprint(0xe0b3) = 1

Character  0xe0b0 is in classes: print graph rune 
in C locale, iswprint(0xe0b0) = 0
in en_US.UTF-8 locale, iswprint(0xe0b0) = 1
in ja_JP.UTF-8 locale, iswprint(0xe0b0) = 1
Comment 5 Daniel Ponte 2019-09-29 20:09:18 UTC
Created attachment 207943 [details]
Modified reproducer
Comment 6 Yuri Pankov freebsd_committer freebsd_triage 2019-09-29 21:29:43 UTC
Thanks for the update.

It's not the CLDR34/Unicode11 update itself, and rather a followup in base r340491.  As the commit message says, there's no direct mappings between UnicodeData.txt and POSIX character classes, so I used my best judgement there :-)

The characters you are after fall in the following range (UnicodeData.txt):

E000;<Private Use, First>;Co;0;L;;;;;N;;;;;
F8FF;<Private Use, Last>;Co;0;L;;;;;N;;;;;

"Co" there means "Other, Private Use".  I *think* we could mark all those characters as printable, it won't hurt anything.
Comment 7 Daniel Ponte 2019-09-29 21:46:01 UTC
I agree. I will admit that Unicode is one of those dungeon-dwellers-only-but-fundamental things I don't fully understand :) , but I do know that these Private Use Area characters are widely used as printables, and this does seem to be a regression.
Comment 8 Yuri Pankov freebsd_committer freebsd_triage 2019-09-29 22:32:50 UTC
Could you please try replacing src/share/ctypedef/C.UTF-8.src with https://people.freebsd.org/~yuripv/C.UTF-8.src and rebuilding?
Comment 9 Daniel Ponte 2019-09-30 15:27:50 UTC
(In reply to Yuri Pankov from comment #8)

This resolves the issue on both CURRENT and 12.1-STABLE. Thank you.
Comment 10 commit-hook freebsd_committer freebsd_triage 2019-10-05 22:18:36 UTC
A commit references this bug:

Author: yuripv
Date: Sat Oct  5 22:17:55 UTC 2019
New revision: 353130
URL: https://svnweb.freebsd.org/changeset/base/353130

Log:
  Mark "private use area" characters as printable.

  At least some of the characters in E000-F8FF range are used by Powerline
  fonts, and having no attributes for these ranges in UnicodeData.txt
  other than "Other, Private Use" it should be safe to mark all of them as
  printable.  Some actually were before r340491, so this fixes the
  regression introduced there as well.

  PR:		240911
  Reviewed by:	bapt
  Tested by:	Daniel Ponte <amigan@gmail.com>
  Differential Revision:	https://reviews.freebsd.org/D21850

Changes:
  head/share/ctypedef/C.UTF-8.src
  head/tools/tools/locale/tools/utf8-rollup.pl
Comment 11 commit-hook freebsd_committer freebsd_triage 2020-12-02 22:44:48 UTC
A commit references this bug:

Author: yuripv
Date: Wed Dec  2 22:44:41 UTC 2020
New revision: 368288
URL: https://svnweb.freebsd.org/changeset/base/368288

Log:
  MFC r353130:

  Mark "private use area" characters as printable.

  At least some of the characters in E000-F8FF range are used by Powerline
  fonts, and having no attributes for these ranges in UnicodeData.txt
  other than "Other, Private Use" it should be safe to mark all of them as
  printable.  Some actually were before r340491, so this fixes the
  regression introduced there as well.

  PR:		240911
  Reviewed by:	bapt
  Tested by:	Daniel Ponte <amigan@gmail.com>
  Differential Revision:	https://reviews.freebsd.org/D21850

Changes:
_U  stable/12/
  stable/12/share/ctypedef/C.UTF-8.src
  stable/12/tools/tools/locale/tools/utf8-rollup.pl
Comment 12 Yuri Pankov freebsd_committer freebsd_triage 2020-12-02 22:45:49 UTC
Better late than never, sorry for the delay, completely forgot about MFC.