| Summary: | /bin/sh can not handle ja_JP.eucJP character code | ||
|---|---|---|---|
| Product: | Base System | Reporter: | Kouichi Hirabayashi <kh> |
| Component: | bin | Assignee: | freebsd-bugs (Nobody) <bugs> |
| Status: | Closed FIXED | ||
| Severity: | Affects Many People | CC: | bapt, cem, lwhsu, mizhka, naito.yuichiro, ota, yuripv |
| Priority: | --- | ||
| Version: | 11.2-RELEASE | ||
| Hardware: | amd64 | ||
| OS: | Any | ||
| URL: | https://reviews.freebsd.org/D17903 | ||
|
Description
Kouichi Hirabayashi
2018-10-18 07:58:17 UTC
As a cause of /bin/sh collapse in ja.JP.eucJP environment,
there seems to be two problem as below.
1) /usr/src/bin/sh
The control characters (CTLENDVAR, CTLBACKQ, CTLARI, CTLENDARI,
CTLQUOTEMARK defined in parser.h) matches the second byte of
many EUC KANJI characters. Therefore, lexical syntax analysis
fails.
2) /usr/src/bin/libedit
The following two functions do not consider ja.JP.eucJP environment.
chartype.c: ct_conv_cbuff_resize()
read.c: read_char()
Since /bin/sh is one of the basic programs of Unix, if it rejects
all character codes except UTF-8, it should display warning such as
"Can not be used in eucJP environment" when it started. At present
it exits normally with EUC Kanji code input or key input such as
Alt-A, Alt-B, ...
Does the same problem occur in ko_KR.eucKR or zh_CN.eucCN environment?
In my investigation, main reason of this problem is because read_char() function
doesn't retry read(2) from STDIN when mbrtowc(3) returns -2.
In lib/libedit/read.c, we can see following code that retries only when
CHARSET_IS_UTF8 flag is set.
```
switch (ct_mbrtowc(cp, cbuf, cbp)) {
<snip>
case (size_t)-2:
/*
* We don't support other multibyte charsets.
* The second condition shouldn't happen
* and is here merely for additional safety.
*/
if ((el->el_flags & CHARSET_IS_UTF8) == 0 ||
cbp >= MB_LEN_MAX) {
errno = EILSEQ;
*cp = L'\0';
return -1;
}
/* Incomplete sequence, read another byte. */
goto again;
```
Of course, CHARSET_IS_UTF8 flag is not set in eucJP environment.
Try cutting CHARSET_IS_UTF8 flag check, /bin/sh works to read eucJP code.
And I found another problem with cutting CHARSET_IS_UTF8 flag check.
It is that command history mistakes calculating eucJP character length,
because ct_enc_width() function in chartype.c doesn't understand other charset
than UTF-8.
I rewrite ct_enc_width() to use wctomb(3), command history problem is fixed.
With these two changes, we don't need CHARSET_IS_UTF8 flag any more.
CHARSET_IS_UTF8 flag controls NARROW_HISTORY flag, and NARROW_HISTORY flag
is used only in HIST_FUN definition.
```
#ifdef WIDECHAR
#define HIST_FUN(el, fn, arg) \
(((el)->el_flags & NARROW_HISTORY) ? hist_convert(el, fn, arg) : \
HIST_FUN_INTERNAL(el, fn, arg))
#else
#define HIST_FUN(el, fn, arg) HIST_FUN_INTERNAL(el, fn, arg)
#endif
```
In WIDECHAR environment, hist_convert() should be called always,
because hist_convert() is a multibyte aware function.
For all my fix, I opened new differential on Phabricator.
https://reviews.freebsd.org/D17903
I believe my fix solve this problem and doesn't affect other charset than eucJP.
Please review my code.
Hirabayashi-san:
Could you please try my patch from Phabricator and check if this problem is fixed?
I don't think /bin/sh is wrong.
I confirmed that the above patch can solve this problem. Let's not close this just yet, and see if we can actually fix this in the tree :) This problem is fixed by r340933. https://svnweb.freebsd.org/base?view=revision&revision=340933 Could you please MFC to stable/12 and stable/11. Thanks. MFC done, sorry about the delay |