Bug 275444 - isprint() library function returns wrong when LC_CTYPE is ja_JP.SJIS (tcsh aborts by this)
Summary: isprint() library function returns wrong when LC_CTYPE is ja_JP.SJIS (tcsh ab...
Status: In Progress
Alias: None
Product: Base System
Classification: Unclassified
Component: misc (show other bugs)
Version: CURRENT
Hardware: Any Any
: --- Affects Some People
Assignee: Yuri Pankov
URL: https://reviews.freebsd.org/D42848
Keywords:
Depends on:
Blocks:
 
Reported: 2023-11-30 04:56 UTC by uratan
Modified: 2023-12-27 12:44 UTC (History)
0 users

See Also:
linimon: mfc-stable14?
linimon: mfc-stable13?


Attachments
a test C code to confirm the problem (1005 bytes, text/plain)
2023-11-30 04:56 UTC, uratan
no flags Details
SJIS charmap from CLDR (419.45 KB, text/plain)
2023-11-30 09:24 UTC, Yuri Pankov
no flags Details
list of disappeared mappings (161.61 KB, text/plain)
2023-12-02 18:10 UTC, uratan
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description uratan 2023-11-30 04:56:15 UTC
Created attachment 246681 [details]
a test C code to confirm the problem

(I found same problem reported in bug #264299 just now, but I will report)

When LC_CTYPE is ja_JP.SJIS, isprint() library function returns
both 0 (not printable) for the char '\'(0x5c) and '~'(0x7e).
(Maybe) because of this problem, tcsh/csh will abort with core
by issuing simple "printenv" command.

See the result of the test below.
 (the test C code is attached to this report)
  +-----------------------------------------------------
  |% setenv LC_CTYPE ja_JP.eucJP
  |% ./z-test-isprint
  |isalnum('3') is 1
  |isalnum('B') is 1
  |isalnum('\') is 0
  |isalnum('~') is 0
  | isascii('3') is 1
  | isascii('B') is 1
  | isascii('\') is 1
  | isascii('~') is 1
  |isprint('3') is 1
  |isprint('B') is 1
  |isprint('\') is 1           <===
  |isprint('~') is 1           <===
  |% csh
  |% printenv
 -- omitted --
  |EDITOR=vim
  |LC_CTYPE=ja_JP.eucJP
  |EXINIT=source ~/.exrc
  |PAGER=jless
  |% exit
  |exit
  |
  |% setenv LC_CTYPE ja_JP.SJIS
  |% ./z-test-isprint
  |isalnum('3') is 1
  |isalnum('B') is 1
  |isalnum('\') is 0
  |isalnum('~') is 0
  | isascii('3') is 1
  | isascii('B') is 1
  | isascii('\') is 1
  | isascii('~') is 1
  |isprint('3') is 1
  |isprint('B') is 1
  |isprint('\') is 0           <===
  |isprint('~') is 0           <===
  |% csh
  |% printenv
 -- omitted --
  |EDITOR=vim
  |LC_CTYPE=ja_JP.SJIS
  |Segmentation fault (core dumped)       <===
  |% ls
  |csh.core         z-test-isprint*
  |typescript       z-test-isprint.c
  +-----------------------------------------------------

 - * - * -

From here, I will describe the mechanism, using these files by short-name.
  /usr/share/locale/ja_JP.eucJP/LC_CTYPE                --> F1.eucJP/LC_CTYPE
  /usr/share/locale/ja_JP.SJIS/LC_CTYPE                 --> F2.SJIS/LC_CTYPE
  /usr/src/share/ctypedef/ja_JP.eucJP.src               --> F3.ja_JP.eucJP.src
  /usr/src/tools/tools/locale/etc/final-maps/map.eucJP  --> F4.map.eucJP
  /usr/src/tools/tools/locale/etc/final-maps/map.SJIS   --> F5.map.SJIS
  /usr/src/tools/tools/locale/etc/final-maps/widths.txt --> F6.widths.txt

  In /usr/src/share/ctypedef/,
  F1.eucJP/LC_CTYPE is made from:
                    F6.widths.txt, F4.map.eucJP, F3.ja_JP.eucJP.src
  and F2.SJIS/LC_CTYPE is:
                    F6.widths.txt, F5.map.SJIS,  F3.ja_JP.eucJP.src

  F4.map.eucJP has these mapping for the subjected characters:
    +-------------------------------
    |<REVERSE_SOLIDUS> \x5c
    |<TILDE> \x7e
    +-------------------------------
  and F5.map.SJIS has these:
    +-------------------------------
    |<YEN_SIGN> \x5c
    |<OVERLINE> \x7e
    +-------------------------------
   (other chars under \x7f have same names)

  F3.ja_JP.eucJP.src is diverted to making both LC_CTYPE file,
  it has <REVERSE_SOLIDUS> and <TILDE> in misc section (including 'print')
  but not have <YEN_SIGN> or <OVERLINE> in any section,
  see summary of F3.ja_JP.eucJP.src below.
       +-------------------------------------------
     1 |# Warning: Do not edit. This file is automatically extracted from the
     2 |# tools in /usr/src/tools/tools/locale. The data is obtained from the
     3 |# CLDR project, obtained from http://cldr.unicode.org/
     4 |# -----------------------------------------------------------------------------
     5 |comment_char *
     6 |escape_char /
     7 |LC_CTYPE
     8 |*************
     9 |
    10 |upper   <A>;/
    11 |        <B>;/
   240 |lower   <a>;/
   478 |alpha   <CARON>;/
 12872 |space   <tab>;/
 12880 |cntrl   <NULL>;/
 12914 |graph   <EXCLAMATION_MARK>;/
 12932 |        <three>;/
 12947 |        <B>;/
 12973 |        <REVERSE_SOLIDUS>;/                <===
 13007 |        <TILDE>;/                          <===
 26017 |print   <space>;/
 26036 |        <three>;/
 26051 |        <B>;/
 26077 |        <REVERSE_SOLIDUS>;/                <===
 26111 |        <TILDE>;/                          <===
 39122 |punct   <EXCLAMATION_MARK>;/
 39140 |        <REVERSE_SOLIDUS>;/                <===
 39207 |digit   <zero>;/
 39210 |        <three>;/
 39218 |xdigit  <zero>;/
 39221 |        <three>;/
 39229 |        <B>;/
 39241 |blank   <tab>;/
 39245 |toupper (<a>,<A>);/
 39246 |        (<b>,<B>);/
 39474 |tolower (<A>,<a>);/
 39475 |        (<B>,<b>);/
 39703 |END LC_CTYPE
       +-------------------------------------------

  So char '\' and '~' are not classified to printable in F2.SJIS/LC_CTYPE,
  isprint() with LC_CTYPE=ja_JP.SJIS reports wrong in result, I think.

 - * - * -

WORK AROUND by quick-hack

  Make new F2.SJIS/LC_CTYPE by renaming the char names in F5.map.SJIS
  like below.
    <YEN_SIGN> --> <REVERSE_SOLIDUS>
    <OVERLINE> --> <TILDE>

WORK AROUND by proper

  Get proper file ja_JP.SJIS.src from somewhere and use it
  for making F2.SJIS/LC_CTYPE.

 - * - * -

Confirming from tcsh side

  I confirmed this problem from the tcsh side, see the code below.
  It is the function xputchar() in /usr/src/contrib/tcsh/sh.print.c.
  From line 167, xputchar() will output un-printable char as "\nnn" format.
     +-----------------------------------------------------
 144 |void
 145 |xputchar(int c)
 146 |{
 147 |    int     atr;
 148 |
 149 |    atr = c & ATTRIBUTES & TRIM;
 150 |    c &= CHAR | QUOTE;
 151 |    if (!output_raw && (c & QUOTE) == 0) {
 152 |        if (iscntrl(c) && (ASC(c) < 0x80 || MB_CUR_MAX == 1)) {
 153 |            if (c != '\t' && c != '\n'
 154 |#ifdef COLORCAT
 155 |                && !(adrof(STRcolorcat) && c == CTL_ESC('\033'))
 156 |#endif
 157 |                && (xlate_cr || c != '\r'))
 158 |            {
 159 |                xputchar('^' | atr);
 160 |                if (c == CTL_ESC('\177'))
 161 |                    c = '?';
 162 |                else
 163 |                    /* Note: for IS_ASCII, this compiles to: c = c | 0100 */
 164 |                    c = CTL_ESC(ASC(c)|0100);
 165 |            }
 166 |        }
 167 |        else if (!isprint(c) && (ASC(c) < 0x80 || MB_CUR_MAX == 1)) {
 168 |            xputchar('\\' | atr);
 169 |            xputchar((((c >> 6) & 7) + '0') | atr);
 170 |            xputchar((((c >> 3) & 7) + '0') | atr);
 171 |            c = (c & 7) + '0';
 172 |        }
 173 |        (void) putraw(c | atr);
 174 |    }
 175 |    else {
 176 |        c &= TRIM;
 177 |        if (haderr ? (didfds ? is2atty : isdiagatty) :
 178 |            (didfds ? is1atty : isoutatty))
 179 |            SetAttributes(c | atr);
 180 |        (void) putpure(c);
 181 |    }
 182 |    if (lbuffed && (c & CHAR) == '\n')
 183 |        flush();
 184 |}
     +-----------------------------------------------------

  The trigger was the '~' char in my environment variable EXINIT,
  it is detected un-printable (wrongly).
  So the xputchar() put first '\' by calling himself recursively,
  then, in the child call, the '\' char is detected also un-printable
  wrongly, so he calls himself once more, more, more...

  At the result, an infinite-recursive-function-call is established
  and loops until the stack overflow.
  The '\' char should never be un-printable for xputchar().

 - * - * -

p.s.
  I am satisfied enough by my quick-hack now...

  It seems that the scope of the char names like <TILDE> are closed
  within F3.ja_JP.eucJP.src, F4.map.eucJP and F5.map.SJIS,
  and seems that the char names are not included in F1.eucJP/LC_CTYPE
  and F2.SJIS/LC_CTYPE,
  so my quick-hack may be the complete solution if it is true...

  Also because it is very natural, I think, that isXXXX() functions
  with ja_JP.ANY environment return same result for char code
  from 0x00 to 0x7f regardless of LC_CTYPE configuration...
    (regardless of the apperarance of the char/font)
Comment 1 Yuri Pankov freebsd_committer freebsd_triage 2023-11-30 09:24:51 UTC
Created attachment 246683 [details]
SJIS charmap from CLDR
Comment 2 Yuri Pankov freebsd_committer freebsd_triage 2023-11-30 09:27:07 UTC
(In reply to uratan from comment #0)
Could you please try replacing tools/tools/locale/etc/final-maps/map.SJIS file with the one in attachment and doing `make all install` in share/ctypedef?  It seems to fix the \ and ~ characters for me, but checking that it doesn't break anything else would be helpful.
Comment 3 uratan 2023-12-02 18:10:08 UTC
Created attachment 246729 [details]
list of disappeared mappings

> Could you please try replacing
> tools/tools/locale/etc/final-maps/map.SJIS file with the one in attachment
> and doing `make all install` in share/ctypedef?
> It seems to fix the \ and ~ characters for me, but checking that
> it doesn't break anything else would be helpful.

I did so, and confirmed that tcsh/csh can printenv '~' char in
environment variable with the new LC_CTYPE:
 +----------------------------------------
 |-r--r--r--  1 root wheel 4700 Dec  1 04:14 /usr/share/locale/ja_JP.SJIS/LC_CTYPE
 |MD5 (/usr/share/locale/ja_JP.SJIS/LC_CTYPE) = 13bb935e7ba1275a263cf0074ceaa352
 +----------------------------------------

 - * - * -

I analyzed your new map file by text processing
because I am not familiar with either KANJI-codes
or LC_CTYPE with multibyte chars.

Here are files used:
 +----------------------------------------
 |MD5 (map.SJIS.orig) = 5a3ff0eb3cf5efe1fa9d44dc6a494e57 ...of 14.0R...[A]
 |MD5 (map.SJIS.yuri) = 7adad9b105608e4fd16a5e9f00af7a43 ... yours  ...[B]
 +----------------------------------------

[A] has 11495 mappings, including 96 aliases for chars under 0x7f
                              and | <YEN_SIGN> \x5c
                                  | <OVERLINE> \x7e
[B] has  7166 mappings, including same 96 aliases for chars under 0x7f
                              and | <REVERSE_SOLIDUS>       \x5C
                                  | <TILDE>                 \x7E
So [B] has less 4329 mappings than [A].

The attached file: "www-disappeared.txt" (handle as [C]) is
the list of disappeared mappings from [A] to [B], sorted by code.
It has 4331 entries including <YEN_SIGN> and <OVERLINE>.

I am using this web page as (easy) reference about SJIS code.
  http://charset.7jp.net/sjis.html
Chars with light-blue-background in the bottom table are
not recommended to use by the author because
they can not be mapped to unicode,
including 'machine/OS dependent characters'.

I confirmed, by my eyes, disappeared mappings in [C] with the table,
they all are on light-blue-background.

So your [B] should be fine, I think.
Comment 4 commit-hook freebsd_committer freebsd_triage 2023-12-20 06:28:46 UTC
A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=a4a9d2a64a4ded0adf27cbc1af82669112c41887

commit a4a9d2a64a4ded0adf27cbc1af82669112c41887
Author:     Yuri Pankov <yuripv@FreeBSD.org>
AuthorDate: 2023-11-30 09:04:55 +0000
Commit:     Yuri Pankov <yuripv@FreeBSD.org>
CommitDate: 2023-12-20 06:24:22 +0000

    locale: regenerate SJIS charmap from CLDR

    There are some noticable differences affecting the '\' and '~'
    characters between our current charmap and latest CLDR version.

    PR:             264299, 275444
    Reviewed by:    bapt
    Tested by:      uratan@miomio.jp
    Differential Revision:  https://reviews.freebsd.org/D42848

 tools/tools/locale/etc/final-maps/map.SJIS | 18682 ++++++++++-----------------
 1 file changed, 7185 insertions(+), 11497 deletions(-)
Comment 5 Mark Linimon freebsd_committer freebsd_triage 2023-12-27 12:31:46 UTC
^Triage: assign to committer that resolved; set possible MFC flags.
Comment 6 Yuri Pankov freebsd_committer freebsd_triage 2023-12-27 12:44:35 UTC
Sorry, the fix here was reverted as incomplete (broken build due to bad testing, unclean environment), working on complete one.