Bug 213013 - unicode private use chars incorrectly classified
Summary: unicode private use chars incorrectly classified
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: standards (show other bugs)
Version: CURRENT
Hardware: Any Any
: --- Affects Many People
Assignee: Baptiste Daroussin
URL: https://github.com/fish-shell/fish-sh...
Keywords:
Depends on:
Blocks:
 
Reported: 2016-09-27 01:46 UTC by Kurtis Rader
Modified: 2016-12-10 16:02 UTC (History)
6 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Kurtis Rader 2016-09-27 01:46:15 UTC
The fish shell project (https://github.com/fish-shell/fish-shell) has received several reports of incorrect behavior on various FreeBSD implementations when a UTF-8 locale is in effect. Including FreeBSD 12 and DragonFly BSD. I have traced the problem to FreeBSD classifying the Unicode private use code point block from 0xFDD0 to 0xFDEF. The `iswalnum()` and `iswgraph()` libc functions incorrectly return one for those code points. The Unicode standard (see section 16.7) is very clear that those code points represent "non-characters" (see http://www.unicode.org/versions/Unicode6.0.0/ch16.pdf and http://www.fileformat.info/info/unicode/char/fdd0/index.htm). Thus those functions should return zero as the GNU libc implementation does.

More background can be found in these fish issues:

https://github.com/fish-shell/fish-shell/issues/3050
https://github.com/fish-shell/fish-shell/issues/3406
https://github.com/fish-shell/fish-shell/issues/3302
Comment 1 Alan Somers freebsd_committer 2016-09-27 03:19:16 UTC
0xFDD0 and 0xFDEF aren't from the private use code block; they're from the Arabic Presentation Forms code block.  Or am I missing something?
Comment 2 Alan Somers freebsd_committer 2016-09-27 03:35:58 UTC
FYI, this bug does not affect 10.3 or older.  The locale code was replaced with DragonFlyBSD's version in time for 11.0, and that seems to have imported the bug.
Comment 3 Kurtis Rader 2016-09-27 03:36:49 UTC
Yes, you're missing something. The reason those code points are in the Arabic Presentation Forms code block is a historical accident. That is, contingent on the development of Unicode over many years. From the PDF standards document I linked to:

> In addition, there is a contiguous range of another 32 noncharacter code points in the BMP: U+FDD0..U+FDEF. For historical reasons, the range U+FDD0..U+FDEF is contained within the Arabic Presentation Forms-A block, but those noncharacters are not “Arabic noncharacters” or “right-to-left noncharacters,” and are not distinguished in any other way from the other noncharacters, except in their code point values.

They were reserved as private-use code points because they had not already been assigned as characters (glyphs) and it was decided that another adding another 32 private use code points would be useful.
Comment 4 Baptiste Daroussin freebsd_committer 2016-09-27 06:23:07 UTC
This like reporters is right about the bug, I have an idea where that can come from. I will have a look at issue an errata on 11.0-RELEASE if needed
Comment 5 Warner Losh freebsd_committer 2016-09-27 13:46:28 UTC
The Unicode text, manually inserting <CR>s, is:

In addition, there is a contiguous range of another 32 noncharacter code points
in the BMP: U+FDD0..U+FDEF. For historical reasons, the range U+FDD0..U+FDEF is
contained within the Arabic Presentation Forms-A block, but those noncharacters
are not “Arabic noncharacters” or “right-to-left noncharacters,” and are not
distinguished in any other way from the other noncharacters, except in their
code point values.

so for what little expertise I have, it seems pretty clear.

[[ Sorry to repeat it, but Kurtis' quoting got mangled by Bugzilla ]]
Comment 6 Dave Cottlehuber freebsd_committer 2016-09-27 21:15:36 UTC
# work-around

In case others are looking for a short-term solutions, install or build from
source shells/fish-2.2.0 & then `pkg lock -y shells/fish`. Thanks Kurt for the investigation.
Comment 7 Kurtis Rader 2016-09-27 21:31:06 UTC
Note that the fish project intends to have a workaround in place when the fish 2.4.0 beta begins in a few days. So unless you're really desperate and can live without all the goodness that is in the fish 2.3.x release (e.g., the "string" builtin) I'd recommend waiting for the 2.4.0 release.

Also, I mistakenly used the phrase "private use code point" (PUA) in my original comment. Note that the code points causing problems for users of the fish shell are in the "noncharacter" range U+FDD0..U+FDEF. They are explicitly not to be classified as characters or have an associated glyph (i.e., iswgraph() should return false). See the Unicode FAQ, http://www.unicode.org/faq/private_use.html.
Comment 8 commit-hook freebsd_committer 2016-10-06 19:46:49 UTC
A commit references this bug:

Author: bapt
Date: Thu Oct  6 19:46:43 UTC 2016
New revision: 306782
URL: https://svnweb.freebsd.org/changeset/base/306782

Log:
  localedef: Fix ctype dump (fixed wide spread errors)

  This commit is from John Marino in dragonfly with the following commit log:

  ====
  This was a CTYPE encoding error involving consecutive points of the same
  ctype.  It was reported by myself to Illumos over a year ago but I was
  unsure if it was only happening on BSD.  Given the cause, the bug is also
  present on Illumos.

  Basically, if consecutive points were of the exact same ctype, they would
  be defined as a range regardless.  For example, all of these would be
  considered equivalent:

    <A> ... <C>, <H>  (converts to <A> .. <H>)
    <A>, <B>, <H>     (converts to <A> .. <H>)
    <A>, <J> ... <H>  (converts to <A> .. <H>)

  So all the points that shouldn't have been defined got "bridged" by the
  extreme points.

  The effects were recently reported to FreeBSD on PR 213013.  There are
  countless places were the ctype flags are misdefined, so this is a major
  fix that has to be MFC'd.
  ====

  This reveals a bad change I did on the testsuite: while 0x07FF is a valid
  unicode it is not used yet (reserved for future use)

  PR:		213013
  Submitted by:	marino@
  Reported by:	Kurtis Rader <krader@skepticism.us>
  Obtained from:	Dragonfly
  MFC after:	1 month

Changes:
  head/contrib/netbsd-tests/lib/libc/locale/t_mbstowcs.c
  head/usr.bin/localedef/ctype.c
Comment 9 Baptiste Daroussin freebsd_committer 2016-10-06 19:52:25 UTC
The issue is fixed in head (12) I will merge in 11 in a month and will propose the fix for an errata, thanks for reporting
Comment 10 commit-hook freebsd_committer 2016-11-05 09:47:30 UTC
A commit references this bug:

Author: bapt
Date: Sat Nov  5 09:46:49 UTC 2016
New revision: 308330
URL: https://svnweb.freebsd.org/changeset/base/308330

Log:
  MFC r306782-r306783

  r306782:
  localedef: Fix ctype dump (fixed wide spread errors)

  This commit is from John Marino in dragonfly with the following commit log:

  ====
  This was a CTYPE encoding error involving consecutive points of the same
  ctype.  It was reported by myself to Illumos over a year ago but I was
  unsure if it was only happening on BSD.  Given the cause, the bug is also
  present on Illumos.

  Basically, if consecutive points were of the exact same ctype, they would
  be defined as a range regardless.  For example, all of these would be
  considered equivalent:

    <A> ... <C>, <H>  (converts to <A> .. <H>)
    <A>, <B>, <H>     (converts to <A> .. <H>)
    <A>, <J> ... <H>  (converts to <A> .. <H>)

  So all the points that shouldn't have been defined got "bridged" by the
  extreme points.

  The effects were recently reported to FreeBSD on PR 213013.  There are
  countless places were the ctype flags are misdefined, so this is a major
  fix that has to be MFC'd.
  ====

  This reveals a bad change I did on the testsuite: while 0x07FF is a valid
  unicode it is not used yet (reserved for future use)

  PR:		213013
  Submitted by:	marino@
  Reported by:	Kurtis Rader <krader@skepticism.us>
  Obtained from:	Dragonfly
  MFC after:	1 month

  r306783:
  localedef: Improve cc_list parsing

  original commit log:
  =====
  I had originally suspected the parsing of ctype definition files as being
  the source of the ctype flag mis-definitions, but it wasn't.  In the
  process, I simplified the cc_list parsing so I'm committing the no-impact
  improvement separately.  It removes some parsing redundancies and
  won't parse partial range definitions anymore.
  ====

  Submitted by:	marino
  Obtained from:	Dragonfly
  MFC after:	1 month

Changes:
_U  stable/11/
  stable/11/contrib/netbsd-tests/lib/libc/locale/t_mbstowcs.c
  stable/11/usr.bin/localedef/ctype.c
  stable/11/usr.bin/localedef/parser.y
Comment 11 Baptiste Daroussin freebsd_committer 2016-12-10 16:02:53 UTC
an errata has been issued