Bug 232374 - /bin/sh can not handle ja_JP.eucJP character code
Summary: /bin/sh can not handle ja_JP.eucJP character code
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: bin (show other bugs)
Version: 11.2-RELEASE
Hardware: amd64 Any
: --- Affects Many People
Assignee: freebsd-bugs (Nobody)
URL: https://reviews.freebsd.org/D17903
Keywords:
Depends on:
Blocks:
 
Reported: 2018-10-18 07:58 UTC by Kouichi Hirabayashi
Modified: 2019-01-16 12:04 UTC (History)
7 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Kouichi Hirabayashi 2018-10-18 07:58:17 UTC
Description
/bin/sh exits when reading non-ascii characters in ja_JP.eucJP environment!
That is, the /bin/sh can not be used in EUC Japanese environment.

How-To-Repeat:
With 11.2-RELEASE(amd64|i386)

1) set /bin/sh as login-sh then login, or execute "exec sh" in other shell.
2) press key like Alt-A (or Alt-B, Alt-C, .. Alt-1, Alt-4, ..),
   or input eucJP KANJI code.
or 

1) in window opend by "xterm /bin/sh" or "kterm"
2) press key like Alt-A (or Alt-B, Alt-C, .. Alt-1, Alt-4, ..),
   or paste eucJP KANJI code.

The cause of this problem lies in (/usr/src/lib)/)libedit,
which makes it impossible to handle the eucJP character code
with the read_char() function in read.c file.
Comment 1 Kouichi Hirabayashi 2018-10-26 02:53:56 UTC
As a cause of /bin/sh collapse in ja.JP.eucJP environment,
there seems to be two problem as below.

1) /usr/src/bin/sh 

  The control characters (CTLENDVAR, CTLBACKQ, CTLARI, CTLENDARI,
  CTLQUOTEMARK defined in parser.h) matches the second byte of
  many EUC KANJI characters. Therefore, lexical syntax analysis
  fails.

2) /usr/src/bin/libedit

  The following two functions do not consider ja.JP.eucJP environment.
    chartype.c: ct_conv_cbuff_resize()
    read.c: read_char()

Since /bin/sh is one of the basic programs of Unix, if it rejects
all character codes except UTF-8, it should display warning such as
"Can not be used in eucJP environment" when it started.  At present
it exits normally with EUC Kanji code input or key input such as
Alt-A, Alt-B, ...

Does the same problem occur in ko_KR.eucKR or zh_CN.eucCN environment?
Comment 2 Yuichiro NAITO 2018-11-08 13:08:41 UTC
In my investigation, main reason of this problem is because read_char() function
doesn't retry read(2) from STDIN when mbrtowc(3) returns -2.
In lib/libedit/read.c, we can see following code that retries only when
CHARSET_IS_UTF8 flag is set.

```
                switch (ct_mbrtowc(cp, cbuf, cbp)) {
<snip>
                case (size_t)-2:
                       /*
                        * We don't support other multibyte charsets.
                        * The second condition shouldn't happen
                        * and is here merely for additional safety.
                        */
                       if ((el->el_flags & CHARSET_IS_UTF8) == 0 ||
                           cbp >= MB_LEN_MAX) {
                               errno = EILSEQ;
                               *cp = L'\0';
                               return -1;
                       }
                        /* Incomplete sequence, read another byte. */
                        goto again;
```

Of course, CHARSET_IS_UTF8 flag is not set in eucJP environment.
Try cutting CHARSET_IS_UTF8 flag check, /bin/sh works to read eucJP code.

And I found another problem with cutting CHARSET_IS_UTF8 flag check.
It is that command history mistakes calculating eucJP character length,
because ct_enc_width() function in chartype.c doesn't understand other charset
than UTF-8.

I rewrite ct_enc_width() to use wctomb(3), command history problem is fixed.

With these two changes, we don't need CHARSET_IS_UTF8 flag any more.
CHARSET_IS_UTF8 flag controls NARROW_HISTORY flag, and NARROW_HISTORY flag
is used only in HIST_FUN definition.

```
#ifdef WIDECHAR
#define HIST_FUN(el, fn, arg) \
    (((el)->el_flags & NARROW_HISTORY) ? hist_convert(el, fn, arg) : \
	HIST_FUN_INTERNAL(el, fn, arg))
#else
#define HIST_FUN(el, fn, arg) HIST_FUN_INTERNAL(el, fn, arg)
#endif
```

In WIDECHAR environment, hist_convert() should be called always,
because hist_convert() is a multibyte aware function.

For all my fix, I opened new differential on Phabricator.

  https://reviews.freebsd.org/D17903

I believe my fix solve this problem and doesn't affect other charset than eucJP.
Please review my code.

Hirabayashi-san:
 Could you please try my patch from Phabricator and check if this problem is fixed?
 I don't think /bin/sh is wrong.
Comment 3 Kouichi Hirabayashi 2018-11-08 22:00:22 UTC
I confirmed that the above patch can solve this problem.
Comment 4 Yuri Pankov freebsd_committer freebsd_triage 2018-11-08 23:02:05 UTC
Let's not close this just yet, and see if we can actually fix this in the tree :)
Comment 5 Yuichiro NAITO 2019-01-04 10:02:10 UTC
This problem is fixed by r340933.

https://svnweb.freebsd.org/base?view=revision&revision=340933

Could you please MFC to stable/12 and stable/11.
Thanks.
Comment 6 Baptiste Daroussin freebsd_committer freebsd_triage 2019-01-16 12:04:51 UTC
MFC done, sorry about the delay