Bug 116363 - isspace broken for UTF-8 locales
Summary: isspace broken for UTF-8 locales
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: gnu (show other bugs)
Version: Unspecified
Hardware: Any Any
: Normal Affects Only Me
Assignee: Rong-En Fan
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2007-09-15 10:10 UTC by Petr Hroudny
Modified: 2007-12-19 00:10 UTC (History)
0 users

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Petr Hroudny 2007-09-15 10:10:02 UTC
In UTF-8 locales, isspace(0xA0) returns 1 which is wrong.

In UTF-8, 0xA0 could only be the second or third byte of multibyte character, but never a space.

As a consequence, operations like str.upper() and/or str.split() are broken, when
UTF-8 character with 0xA0 byte is encountered.

An example of such character is Scaron (UTF-8 code 0xC5 0xA0).

Fix: 

For UTF-8 locales, 0xA0 should never be considered to be a space.
Comment 1 Андрей Чернов 2007-09-16 09:54:33 UTC
On Sat, Sep 15, 2007 at 09:08:01AM +0000, Petr Hroudny wrote:
> 
> >Number:         116363
> >Category:       gnu
> >Synopsis:       isspace broken for UTF-8 locales
> >Confidential:   no
> >Severity:       non-critical
> >Priority:       medium
> >Responsible:    freebsd-bugs
> >State:          open
> >Quarter:        
> >Keywords:       
> >Date-Required:
> >Class:          sw-bug
> >Submitter-Id:   current-users
> >Arrival-Date:   Sat Sep 15 09:10:02 GMT 2007
> >Closed-Date:
> >Last-Modified:
> >Originator:     Petr Hroudny
> >Release:        6-stable, 7-current
> >Organization:
> >Environment:
> >Description:
> In UTF-8 locales, isspace(0xA0) returns 1 which is wrong.
> 
> In UTF-8, 0xA0 could only be the second or third byte of multibyte character, but never a space.
> 
> As a consequence, operations like str.upper() and/or str.split() are broken, when
> UTF-8 character with 0xA0 byte is encountered.

It seems that our UTF-8.src is completely wrong, it is just plain Unicode 
and not UTF-8 which multibyte values should start from
C2-DF
E0-EF
F0-F4
only (as stated in http://en.wikipedia.org/wiki/UTF-8 f.e.)
Can anybody write replacement for it?

-- 
http://ache.pp.ru/
Comment 2 Hye-Shik Chang freebsd_committer freebsd_triage 2007-09-16 17:22:14 UTC
On Sun, Sep 16, 2007 at 12:54:33PM +0400, Andrey Chernov wrote:
> On Sat, Sep 15, 2007 at 09:08:01AM +0000, Petr Hroudny wrote:
> > 
> > >Number:         116363
> > >Category:       gnu
> > >Synopsis:       isspace broken for UTF-8 locales
> > >Confidential:   no
> > >Severity:       non-critical
> > >Priority:       medium
> > >Responsible:    freebsd-bugs
> > >State:          open
> > >Quarter:        
> > >Keywords:       
> > >Date-Required:
> > >Class:          sw-bug
> > >Submitter-Id:   current-users
> > >Arrival-Date:   Sat Sep 15 09:10:02 GMT 2007
> > >Closed-Date:
> > >Last-Modified:
> > >Originator:     Petr Hroudny
> > >Release:        6-stable, 7-current
> > >Organization:
> > >Environment:
> > >Description:
> > In UTF-8 locales, isspace(0xA0) returns 1 which is wrong.
> > 
> > In UTF-8, 0xA0 could only be the second or third byte of multibyte character, but never a space.
> > 
> > As a consequence, operations like str.upper() and/or str.split() are broken, when
> > UTF-8 character with 0xA0 byte is encountered.

If you are saying about Python's str.split(), the problem is due
to our libc bug (or feature) which is described many times before,
and Python already includes a workaround for the problem.
http://mail.python.org/pipermail/python-checkins/2004-August/042343.html

> It seems that our UTF-8.src is completely wrong, it is just plain Unicode 
> and not UTF-8 which multibyte values should start from
> C2-DF
> E0-EF
> F0-F4
> only (as stated in http://en.wikipedia.org/wiki/UTF-8 f.e.)
> Can anybody write replacement for it?

In fact, UTF-8.src defines values for not UTF-8 but Unicode codepoints.
Using the Unicode codepoint as wchar_t's internal representation gives
much benefit.  I think we would be better to make isspace() and
other ctypes functions aware of "encoding".  IIRC, tjr@ provided the
workaround as in the URL mentioned above and said that it would get
a chance to be fixed in 6 or 7 on 2004.

Hye-Shik
Comment 3 Андрей Чернов 2007-09-16 17:34:07 UTC
On Mon, Sep 17, 2007 at 01:22:14AM +0900, Hye-Shik Chang wrote:
> In fact, UTF-8.src defines values for not UTF-8 but Unicode codepoints.
> Using the Unicode codepoint as wchar_t's internal representation gives
> much benefit.  I think we would be better to make isspace() and
> other ctypes functions aware of "encoding".  IIRC, tjr@ provided the
> workaround as in the URL mentioned above and said that it would get
> a chance to be fixed in 6 or 7 on 2004.

Currently wchar_t represents given encoding in all places including 
wc<->mbr conversions. To make it UCS-4-only instead we need to rewrite the 
whole locale system from scratch and I see no benefits from that way. 
There is no simple workaround exists.

In any case there is no excuse to make really-UCS-4.src to mimic 
UTF-8.src. Providing proper UTF-8.src is much less painful way than whole 
locale rewritting and I almost half way on converting UCS-4 source to it.

-- 
http://ache.pp.ru/
Comment 4 Petr Hroudny 2007-09-17 09:35:52 UTC
2007/9/16, Hye-Shik Chang <perky@freebsd.org>:

> If you are saying about Python's str.split(), the problem is due
> to our libc bug (or feature) which is described many times before,
> and Python already includes a workaround for the problem.
> http://mail.python.org/pipermail/python-checkins/2004-August/042343.html

I run into this problem when using mutt, which utilizes isspace to
separate tokens in
e.g. list of recipients. Then I've found the workaround for Python,
saying this problem
should be fixed in FreeBSD6 - but it's still present even in 7-current.
I do believe it would be better to fix isspace() than introduce
workarounds into every application.

Regards, Petr
Comment 5 Andrey A. Chernov freebsd_committer freebsd_triage 2007-10-22 23:23:22 UTC
State Changed
From-To: open->patched

Fixed in -current
Comment 6 dfilter service freebsd_committer freebsd_triage 2007-10-24 15:29:39 UTC
rafan       2007-10-24 14:29:32 UTC

  FreeBSD src repository

  Modified files:        (Branch: RELENG_7)
    include              _ctype.h ctype.h 
    lib/libc/locale      Symbol.map big5.c euc.c gb18030.c 
                         gb2312.c gbk.c isctype.c mskanji.c none.c 
                         setrunelocale.c utf8.c 
    share/mklocale       UTF-8.src 
  Log:
  MFC ctype(3) fix for UTF-8 locale. See original commit log for details.
  
  PR:             116363
  Reported by:    Petr Hroudny <petr.hroudny at gmail.com>
  Patched by:     ache
  Reviewed by:    i18n@
  Approved by:    re (kensmith)
  OK-ed by:       portmgr
  
  Revision   Changes    Path
  1.30.10.1  +34 -1     src/include/_ctype.h
  1.28.18.1  +21 -21    src/include/ctype.h
  1.3.2.1    +5 -0      src/lib/libc/locale/Symbol.map
  1.17.10.1  +3 -0      src/lib/libc/locale/big5.c
  1.21.10.1  +3 -0      src/lib/libc/locale/euc.c
  1.7.10.1   +3 -0      src/lib/libc/locale/gb18030.c
  1.9.10.1   +3 -0      src/lib/libc/locale/gb2312.c
  1.13.2.1   +3 -0      src/lib/libc/locale/gbk.c
  1.10.2.1   +19 -19    src/lib/libc/locale/isctype.c
  1.17.10.1  +3 -0      src/lib/libc/locale/mskanji.c
  1.14.2.1   +6 -1      src/lib/libc/locale/none.c
  1.46.2.1   +5 -0      src/lib/libc/locale/setrunelocale.c
  1.14.2.1   +8 -0      src/lib/libc/locale/utf8.c
  1.2.2.1    +3 -0      src/share/mklocale/UTF-8.src
_______________________________________________
cvs-all@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/cvs-all
To unsubscribe, send any mail to "cvs-all-unsubscribe@freebsd.org"
Comment 7 dfilter service freebsd_committer freebsd_triage 2007-10-24 15:32:41 UTC
rafan       2007-10-24 14:32:33 UTC

  FreeBSD src repository

  Modified files:        (Branch: RELENG_6)
    include              _ctype.h ctype.h 
    lib/libc/locale      big5.c euc.c gb18030.c gb2312.c gbk.c 
                         isctype.c mskanji.c none.c 
                         setrunelocale.c utf8.c 
    share/mklocale       UTF-8.src 
  Log:
  MFC ctype(3) fix for UTF-8 locale. See original commit log for details.
  
  PR:             116363
  Reported by:    Petr Hroudny <petr.hroudny at gmail.com>
  Patched by:     ache
  Reviewed by:    i18n@
  Approved by:    re (kensmith)
  OK-ed by:       portmgr
  
  Revision  Changes    Path
  1.30.2.1  +34 -1     src/include/_ctype.h
  1.28.8.1  +21 -21    src/include/ctype.h
  1.17.2.1  +3 -0      src/lib/libc/locale/big5.c
  1.21.2.1  +3 -0      src/lib/libc/locale/euc.c
  1.7.2.1   +3 -0      src/lib/libc/locale/gb18030.c
  1.9.2.1   +3 -0      src/lib/libc/locale/gb2312.c
  1.12.2.1  +3 -0      src/lib/libc/locale/gbk.c
  1.9.14.1  +19 -19    src/lib/libc/locale/isctype.c
  1.17.2.1  +3 -0      src/lib/libc/locale/mskanji.c
  1.13.2.1  +6 -1      src/lib/libc/locale/none.c
  1.45.2.1  +5 -0      src/lib/libc/locale/setrunelocale.c
  1.13.2.2  +8 -0      src/lib/libc/locale/utf8.c
  1.1.8.2   +3 -0      src/share/mklocale/UTF-8.src
_______________________________________________
cvs-all@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/cvs-all
To unsubscribe, send any mail to "cvs-all-unsubscribe@freebsd.org"
Comment 8 Rong-En Fan freebsd_committer freebsd_triage 2007-10-24 15:32:49 UTC
State Changed
From-To: patched->closed

MFC'ed in 6.x and 7.x 


Comment 9 Rong-En Fan freebsd_committer freebsd_triage 2007-10-24 15:32:49 UTC
Responsible Changed
From-To: freebsd-bugs->rafan

Track.
Comment 10 dfilter service freebsd_committer freebsd_triage 2007-11-21 01:31:56 UTC
rafan       2007-11-21 01:31:49 UTC

  FreeBSD src repository

  Modified files:        (Branch: RELENG_6)
    include              _ctype.h ctype.h 
    lib/libc/locale      big5.c euc.c gb18030.c gb2312.c gbk.c 
                         isctype.c mskanji.c none.c 
                         setrunelocale.c utf8.c 
    sys/sys              param.h 
  Log:
  - Back out previous ctype(3) fix for UTF-8 locale due to forward ABI
    compatibility is still broken as we add new symbols to libc. Those
    symboles are __sbmaskrune, __sbistype, __sbtoupper and __sbtolower.
    The latter three are directly referred by binaries use ctype(3) family
    functions (see include/ctype.h for details). This means if a binary
    built on 6.3, it uses these symbols that are not available on older
    system.
  - As this has been in 6 for a month, I intentionally leave these symbols
    in libc but map them to original version. So binary built after 602113
    will not be broken after this commit.
  - Bump __FreeBSD_version for this back-out
  
  PR:             116363
  Discussed with: kris, kensmith
  Approved by:    re (kensmith)
  
  Revision    Changes    Path
  1.30.2.3    +5 -29     src/include/_ctype.h
  1.28.8.2    +21 -21    src/include/ctype.h
  1.17.2.2    +0 -3      src/lib/libc/locale/big5.c
  1.21.2.2    +0 -3      src/lib/libc/locale/euc.c
  1.7.2.2     +0 -3      src/lib/libc/locale/gb18030.c
  1.9.2.2     +0 -3      src/lib/libc/locale/gb2312.c
  1.12.2.2    +0 -3      src/lib/libc/locale/gbk.c
  1.9.14.2    +19 -19    src/lib/libc/locale/isctype.c
  1.17.2.2    +0 -3      src/lib/libc/locale/mskanji.c
  1.13.2.2    +1 -6      src/lib/libc/locale/none.c
  1.45.2.2    +0 -5      src/lib/libc/locale/setrunelocale.c
  1.13.2.3    +0 -8      src/lib/libc/locale/utf8.c
  1.244.2.32  +1 -1      src/sys/sys/param.h
_______________________________________________
cvs-all@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/cvs-all
To unsubscribe, send any mail to "cvs-all-unsubscribe@freebsd.org"
Comment 11 dfilter service freebsd_committer freebsd_triage 2007-12-19 00:04:59 UTC
beech       2007-12-19 00:04:50 UTC

  FreeBSD ports repository

  Modified files:
    www/horde-base       Makefile 
  Added files:
    www/horde-base/files patch-lib_Horde_NLS.php 
  Log:
  - Fix bug "isspace broken for UTF-8 locales."
    Causes Japanese characters to display improperly.
  
  PR:             ports/116363
  Submitted by:   Hiromi Kimura <hiromi@tac.tsukuba.ac.jp>
  Approved by:    linimon (mentor)
  
  Revision  Changes    Path
  1.61      +1 -0      ports/www/horde-base/Makefile
  1.1       +13 -0     ports/www/horde-base/files/patch-lib_Horde_NLS.php (new)
_______________________________________________
cvs-all@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/cvs-all
To unsubscribe, send any mail to "cvs-all-unsubscribe@freebsd.org"