Created attachment 246681 [details] a test C code to confirm the problem (I found same problem reported in bug #264299 just now, but I will report) When LC_CTYPE is ja_JP.SJIS, isprint() library function returns both 0 (not printable) for the char '\'(0x5c) and '~'(0x7e). (Maybe) because of this problem, tcsh/csh will abort with core by issuing simple "printenv" command. See the result of the test below. (the test C code is attached to this report) +----------------------------------------------------- |% setenv LC_CTYPE ja_JP.eucJP |% ./z-test-isprint |isalnum('3') is 1 |isalnum('B') is 1 |isalnum('\') is 0 |isalnum('~') is 0 | isascii('3') is 1 | isascii('B') is 1 | isascii('\') is 1 | isascii('~') is 1 |isprint('3') is 1 |isprint('B') is 1 |isprint('\') is 1 <=== |isprint('~') is 1 <=== |% csh |% printenv -- omitted -- |EDITOR=vim |LC_CTYPE=ja_JP.eucJP |EXINIT=source ~/.exrc |PAGER=jless |% exit |exit | |% setenv LC_CTYPE ja_JP.SJIS |% ./z-test-isprint |isalnum('3') is 1 |isalnum('B') is 1 |isalnum('\') is 0 |isalnum('~') is 0 | isascii('3') is 1 | isascii('B') is 1 | isascii('\') is 1 | isascii('~') is 1 |isprint('3') is 1 |isprint('B') is 1 |isprint('\') is 0 <=== |isprint('~') is 0 <=== |% csh |% printenv -- omitted -- |EDITOR=vim |LC_CTYPE=ja_JP.SJIS |Segmentation fault (core dumped) <=== |% ls |csh.core z-test-isprint* |typescript z-test-isprint.c +----------------------------------------------------- - * - * - From here, I will describe the mechanism, using these files by short-name. /usr/share/locale/ja_JP.eucJP/LC_CTYPE --> F1.eucJP/LC_CTYPE /usr/share/locale/ja_JP.SJIS/LC_CTYPE --> F2.SJIS/LC_CTYPE /usr/src/share/ctypedef/ja_JP.eucJP.src --> F3.ja_JP.eucJP.src /usr/src/tools/tools/locale/etc/final-maps/map.eucJP --> F4.map.eucJP /usr/src/tools/tools/locale/etc/final-maps/map.SJIS --> F5.map.SJIS /usr/src/tools/tools/locale/etc/final-maps/widths.txt --> F6.widths.txt In /usr/src/share/ctypedef/, F1.eucJP/LC_CTYPE is made from: F6.widths.txt, F4.map.eucJP, F3.ja_JP.eucJP.src and F2.SJIS/LC_CTYPE is: F6.widths.txt, F5.map.SJIS, F3.ja_JP.eucJP.src F4.map.eucJP has these mapping for the subjected characters: +------------------------------- |<REVERSE_SOLIDUS> \x5c |<TILDE> \x7e +------------------------------- and F5.map.SJIS has these: +------------------------------- |<YEN_SIGN> \x5c |<OVERLINE> \x7e +------------------------------- (other chars under \x7f have same names) F3.ja_JP.eucJP.src is diverted to making both LC_CTYPE file, it has <REVERSE_SOLIDUS> and <TILDE> in misc section (including 'print') but not have <YEN_SIGN> or <OVERLINE> in any section, see summary of F3.ja_JP.eucJP.src below. +------------------------------------------- 1 |# Warning: Do not edit. This file is automatically extracted from the 2 |# tools in /usr/src/tools/tools/locale. The data is obtained from the 3 |# CLDR project, obtained from http://cldr.unicode.org/ 4 |# ----------------------------------------------------------------------------- 5 |comment_char * 6 |escape_char / 7 |LC_CTYPE 8 |************* 9 | 10 |upper <A>;/ 11 | <B>;/ 240 |lower <a>;/ 478 |alpha <CARON>;/ 12872 |space <tab>;/ 12880 |cntrl <NULL>;/ 12914 |graph <EXCLAMATION_MARK>;/ 12932 | <three>;/ 12947 | <B>;/ 12973 | <REVERSE_SOLIDUS>;/ <=== 13007 | <TILDE>;/ <=== 26017 |print <space>;/ 26036 | <three>;/ 26051 | <B>;/ 26077 | <REVERSE_SOLIDUS>;/ <=== 26111 | <TILDE>;/ <=== 39122 |punct <EXCLAMATION_MARK>;/ 39140 | <REVERSE_SOLIDUS>;/ <=== 39207 |digit <zero>;/ 39210 | <three>;/ 39218 |xdigit <zero>;/ 39221 | <three>;/ 39229 | <B>;/ 39241 |blank <tab>;/ 39245 |toupper (<a>,<A>);/ 39246 | (<b>,<B>);/ 39474 |tolower (<A>,<a>);/ 39475 | (<B>,<b>);/ 39703 |END LC_CTYPE +------------------------------------------- So char '\' and '~' are not classified to printable in F2.SJIS/LC_CTYPE, isprint() with LC_CTYPE=ja_JP.SJIS reports wrong in result, I think. - * - * - WORK AROUND by quick-hack Make new F2.SJIS/LC_CTYPE by renaming the char names in F5.map.SJIS like below. <YEN_SIGN> --> <REVERSE_SOLIDUS> <OVERLINE> --> <TILDE> WORK AROUND by proper Get proper file ja_JP.SJIS.src from somewhere and use it for making F2.SJIS/LC_CTYPE. - * - * - Confirming from tcsh side I confirmed this problem from the tcsh side, see the code below. It is the function xputchar() in /usr/src/contrib/tcsh/sh.print.c. From line 167, xputchar() will output un-printable char as "\nnn" format. +----------------------------------------------------- 144 |void 145 |xputchar(int c) 146 |{ 147 | int atr; 148 | 149 | atr = c & ATTRIBUTES & TRIM; 150 | c &= CHAR | QUOTE; 151 | if (!output_raw && (c & QUOTE) == 0) { 152 | if (iscntrl(c) && (ASC(c) < 0x80 || MB_CUR_MAX == 1)) { 153 | if (c != '\t' && c != '\n' 154 |#ifdef COLORCAT 155 | && !(adrof(STRcolorcat) && c == CTL_ESC('\033')) 156 |#endif 157 | && (xlate_cr || c != '\r')) 158 | { 159 | xputchar('^' | atr); 160 | if (c == CTL_ESC('\177')) 161 | c = '?'; 162 | else 163 | /* Note: for IS_ASCII, this compiles to: c = c | 0100 */ 164 | c = CTL_ESC(ASC(c)|0100); 165 | } 166 | } 167 | else if (!isprint(c) && (ASC(c) < 0x80 || MB_CUR_MAX == 1)) { 168 | xputchar('\\' | atr); 169 | xputchar((((c >> 6) & 7) + '0') | atr); 170 | xputchar((((c >> 3) & 7) + '0') | atr); 171 | c = (c & 7) + '0'; 172 | } 173 | (void) putraw(c | atr); 174 | } 175 | else { 176 | c &= TRIM; 177 | if (haderr ? (didfds ? is2atty : isdiagatty) : 178 | (didfds ? is1atty : isoutatty)) 179 | SetAttributes(c | atr); 180 | (void) putpure(c); 181 | } 182 | if (lbuffed && (c & CHAR) == '\n') 183 | flush(); 184 |} +----------------------------------------------------- The trigger was the '~' char in my environment variable EXINIT, it is detected un-printable (wrongly). So the xputchar() put first '\' by calling himself recursively, then, in the child call, the '\' char is detected also un-printable wrongly, so he calls himself once more, more, more... At the result, an infinite-recursive-function-call is established and loops until the stack overflow. The '\' char should never be un-printable for xputchar(). - * - * - p.s. I am satisfied enough by my quick-hack now... It seems that the scope of the char names like <TILDE> are closed within F3.ja_JP.eucJP.src, F4.map.eucJP and F5.map.SJIS, and seems that the char names are not included in F1.eucJP/LC_CTYPE and F2.SJIS/LC_CTYPE, so my quick-hack may be the complete solution if it is true... Also because it is very natural, I think, that isXXXX() functions with ja_JP.ANY environment return same result for char code from 0x00 to 0x7f regardless of LC_CTYPE configuration... (regardless of the apperarance of the char/font)
Created attachment 246683 [details] SJIS charmap from CLDR
(In reply to uratan from comment #0) Could you please try replacing tools/tools/locale/etc/final-maps/map.SJIS file with the one in attachment and doing `make all install` in share/ctypedef? It seems to fix the \ and ~ characters for me, but checking that it doesn't break anything else would be helpful.
Created attachment 246729 [details] list of disappeared mappings > Could you please try replacing > tools/tools/locale/etc/final-maps/map.SJIS file with the one in attachment > and doing `make all install` in share/ctypedef? > It seems to fix the \ and ~ characters for me, but checking that > it doesn't break anything else would be helpful. I did so, and confirmed that tcsh/csh can printenv '~' char in environment variable with the new LC_CTYPE: +---------------------------------------- |-r--r--r-- 1 root wheel 4700 Dec 1 04:14 /usr/share/locale/ja_JP.SJIS/LC_CTYPE |MD5 (/usr/share/locale/ja_JP.SJIS/LC_CTYPE) = 13bb935e7ba1275a263cf0074ceaa352 +---------------------------------------- - * - * - I analyzed your new map file by text processing because I am not familiar with either KANJI-codes or LC_CTYPE with multibyte chars. Here are files used: +---------------------------------------- |MD5 (map.SJIS.orig) = 5a3ff0eb3cf5efe1fa9d44dc6a494e57 ...of 14.0R...[A] |MD5 (map.SJIS.yuri) = 7adad9b105608e4fd16a5e9f00af7a43 ... yours ...[B] +---------------------------------------- [A] has 11495 mappings, including 96 aliases for chars under 0x7f and | <YEN_SIGN> \x5c | <OVERLINE> \x7e [B] has 7166 mappings, including same 96 aliases for chars under 0x7f and | <REVERSE_SOLIDUS> \x5C | <TILDE> \x7E So [B] has less 4329 mappings than [A]. The attached file: "www-disappeared.txt" (handle as [C]) is the list of disappeared mappings from [A] to [B], sorted by code. It has 4331 entries including <YEN_SIGN> and <OVERLINE>. I am using this web page as (easy) reference about SJIS code. http://charset.7jp.net/sjis.html Chars with light-blue-background in the bottom table are not recommended to use by the author because they can not be mapped to unicode, including 'machine/OS dependent characters'. I confirmed, by my eyes, disappeared mappings in [C] with the table, they all are on light-blue-background. So your [B] should be fine, I think.
A commit in branch main references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=a4a9d2a64a4ded0adf27cbc1af82669112c41887 commit a4a9d2a64a4ded0adf27cbc1af82669112c41887 Author: Yuri Pankov <yuripv@FreeBSD.org> AuthorDate: 2023-11-30 09:04:55 +0000 Commit: Yuri Pankov <yuripv@FreeBSD.org> CommitDate: 2023-12-20 06:24:22 +0000 locale: regenerate SJIS charmap from CLDR There are some noticable differences affecting the '\' and '~' characters between our current charmap and latest CLDR version. PR: 264299, 275444 Reviewed by: bapt Tested by: uratan@miomio.jp Differential Revision: https://reviews.freebsd.org/D42848 tools/tools/locale/etc/final-maps/map.SJIS | 18682 ++++++++++----------------- 1 file changed, 7185 insertions(+), 11497 deletions(-)
^Triage: assign to committer that resolved; set possible MFC flags.
Sorry, the fix here was reverted as incomplete (broken build due to bad testing, unclean environment), working on complete one.