Summary: | unicode private use chars incorrectly classified | ||
---|---|---|---|
Product: | Base System | Reporter: | Kurtis Rader <krader> |
Component: | standards | Assignee: | Baptiste Daroussin <bapt> |
Status: | Closed FIXED | ||
Severity: | Affects Many People | CC: | asomers, david.guyot, dch, imp, me, pi |
Priority: | --- | ||
Version: | CURRENT | ||
Hardware: | Any | ||
OS: | Any | ||
URL: | https://github.com/fish-shell/fish-shell/issues/3050 |
Description
Kurtis Rader
2016-09-27 01:46:15 UTC
0xFDD0 and 0xFDEF aren't from the private use code block; they're from the Arabic Presentation Forms code block. Or am I missing something? FYI, this bug does not affect 10.3 or older. The locale code was replaced with DragonFlyBSD's version in time for 11.0, and that seems to have imported the bug. Yes, you're missing something. The reason those code points are in the Arabic Presentation Forms code block is a historical accident. That is, contingent on the development of Unicode over many years. From the PDF standards document I linked to:
> In addition, there is a contiguous range of another 32 noncharacter code points in the BMP: U+FDD0..U+FDEF. For historical reasons, the range U+FDD0..U+FDEF is contained within the Arabic Presentation Forms-A block, but those noncharacters are not “Arabic noncharacters” or “right-to-left noncharacters,” and are not distinguished in any other way from the other noncharacters, except in their code point values.
They were reserved as private-use code points because they had not already been assigned as characters (glyphs) and it was decided that another adding another 32 private use code points would be useful.
This like reporters is right about the bug, I have an idea where that can come from. I will have a look at issue an errata on 11.0-RELEASE if needed The Unicode text, manually inserting <CR>s, is: In addition, there is a contiguous range of another 32 noncharacter code points in the BMP: U+FDD0..U+FDEF. For historical reasons, the range U+FDD0..U+FDEF is contained within the Arabic Presentation Forms-A block, but those noncharacters are not “Arabic noncharacters” or “right-to-left noncharacters,” and are not distinguished in any other way from the other noncharacters, except in their code point values. so for what little expertise I have, it seems pretty clear. [[ Sorry to repeat it, but Kurtis' quoting got mangled by Bugzilla ]] # work-around In case others are looking for a short-term solutions, install or build from source shells/fish-2.2.0 & then `pkg lock -y shells/fish`. Thanks Kurt for the investigation. Note that the fish project intends to have a workaround in place when the fish 2.4.0 beta begins in a few days. So unless you're really desperate and can live without all the goodness that is in the fish 2.3.x release (e.g., the "string" builtin) I'd recommend waiting for the 2.4.0 release. Also, I mistakenly used the phrase "private use code point" (PUA) in my original comment. Note that the code points causing problems for users of the fish shell are in the "noncharacter" range U+FDD0..U+FDEF. They are explicitly not to be classified as characters or have an associated glyph (i.e., iswgraph() should return false). See the Unicode FAQ, http://www.unicode.org/faq/private_use.html. A commit references this bug: Author: bapt Date: Thu Oct 6 19:46:43 UTC 2016 New revision: 306782 URL: https://svnweb.freebsd.org/changeset/base/306782 Log: localedef: Fix ctype dump (fixed wide spread errors) This commit is from John Marino in dragonfly with the following commit log: ==== This was a CTYPE encoding error involving consecutive points of the same ctype. It was reported by myself to Illumos over a year ago but I was unsure if it was only happening on BSD. Given the cause, the bug is also present on Illumos. Basically, if consecutive points were of the exact same ctype, they would be defined as a range regardless. For example, all of these would be considered equivalent: <A> ... <C>, <H> (converts to <A> .. <H>) <A>, <B>, <H> (converts to <A> .. <H>) <A>, <J> ... <H> (converts to <A> .. <H>) So all the points that shouldn't have been defined got "bridged" by the extreme points. The effects were recently reported to FreeBSD on PR 213013. There are countless places were the ctype flags are misdefined, so this is a major fix that has to be MFC'd. ==== This reveals a bad change I did on the testsuite: while 0x07FF is a valid unicode it is not used yet (reserved for future use) PR: 213013 Submitted by: marino@ Reported by: Kurtis Rader <krader@skepticism.us> Obtained from: Dragonfly MFC after: 1 month Changes: head/contrib/netbsd-tests/lib/libc/locale/t_mbstowcs.c head/usr.bin/localedef/ctype.c The issue is fixed in head (12) I will merge in 11 in a month and will propose the fix for an errata, thanks for reporting A commit references this bug: Author: bapt Date: Sat Nov 5 09:46:49 UTC 2016 New revision: 308330 URL: https://svnweb.freebsd.org/changeset/base/308330 Log: MFC r306782-r306783 r306782: localedef: Fix ctype dump (fixed wide spread errors) This commit is from John Marino in dragonfly with the following commit log: ==== This was a CTYPE encoding error involving consecutive points of the same ctype. It was reported by myself to Illumos over a year ago but I was unsure if it was only happening on BSD. Given the cause, the bug is also present on Illumos. Basically, if consecutive points were of the exact same ctype, they would be defined as a range regardless. For example, all of these would be considered equivalent: <A> ... <C>, <H> (converts to <A> .. <H>) <A>, <B>, <H> (converts to <A> .. <H>) <A>, <J> ... <H> (converts to <A> .. <H>) So all the points that shouldn't have been defined got "bridged" by the extreme points. The effects were recently reported to FreeBSD on PR 213013. There are countless places were the ctype flags are misdefined, so this is a major fix that has to be MFC'd. ==== This reveals a bad change I did on the testsuite: while 0x07FF is a valid unicode it is not used yet (reserved for future use) PR: 213013 Submitted by: marino@ Reported by: Kurtis Rader <krader@skepticism.us> Obtained from: Dragonfly MFC after: 1 month r306783: localedef: Improve cc_list parsing original commit log: ===== I had originally suspected the parsing of ctype definition files as being the source of the ctype flag mis-definitions, but it wasn't. In the process, I simplified the cc_list parsing so I'm committing the no-impact improvement separately. It removes some parsing redundancies and won't parse partial range definitions anymore. ==== Submitted by: marino Obtained from: Dragonfly MFC after: 1 month Changes: _U stable/11/ stable/11/contrib/netbsd-tests/lib/libc/locale/t_mbstowcs.c stable/11/usr.bin/localedef/ctype.c stable/11/usr.bin/localedef/parser.y an errata has been issued |