Bug 219900

Summary: tr class and unicode collation
Product: Base System Reporter: Emmanuel Vadot <manu>
Component: binAssignee: freebsd-bugs (Nobody) <bugs>
Status: New ---    
Severity: Affects Only Me CC: lme, yuripv
Priority: ---    
Version: CURRENT   
Hardware: Any   
OS: Any   

Description Emmanuel Vadot freebsd_committer freebsd_triage 2017-06-10 03:15:34 UTC
It seems that there is a bug in unicode collation using tr to convert char to uppercase :

manu@arcadia> echo $LC_ALL                                                                                                                                                                                         
en_US.UTF-8
manu@arcadia> echo test | tr '[:lower:]' '[:upper:]'
TEST
manu@arcadia> echo test | tr '[:alpha:]' '[:upper:]'                                                                                                                                                               
ÓÄÒÓ
manu@arcadia> echo test | tr '[:alnum:]' '[:upper:]'                                                                                                                                                               
ÞÎÝÞ
manu@arcadia> export LC_ALL=C
manu@arcadia> echo test | tr '[:lower:]' '[:upper:]'                                                                                                                                                               
TEST
manu@arcadia> echo test | tr '[:alpha:]' '[:upper:]'                                                                                                                                                               
ZZZZ
manu@arcadia> echo test | tr '[:alnum:]' '[:upper:]'                                                                                                                                                               
ZZZZ
Comment 1 Lars Engels freebsd_committer freebsd_triage 2017-12-05 17:11:17 UTC
A similar issue with grep and awk:

> grep '^[A-Z]' foo
foo
Bar
BAZ


Same for egrep, grep -E and awk, using sed(1) works as expected.

> grep -V
grep (GNU grep) 2.5.1-FreeBSD

> locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_ALL=en_US.UTF-8


With LANG=C all three tools work as expected.
Comment 2 Yuri Pankov 2017-12-06 00:18:27 UTC
Looks like at least grep issue is not there (or already fixed) in bsdgrep:

loki:yuri:~$ locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_ALL=
loki:yuri:~$ grep '^[A-Z]' foo
Bar
BAZ
loki:yuri:~$ grep --version
grep (BSD grep) 2.6.0-FreeBSD
Comment 3 Yuri Pankov 2017-12-06 00:19:41 UTC
WRT the tr issue -- that's what 'tr' on Debian GNU/Linux has to say about it:

$ echo test | tr '[:alpha:]' '[:upper:]'
tr: misaligned [:upper:] and/or [:lower:] construct
Comment 4 Yuri Pankov 2018-10-10 15:41:15 UTC
And for the original problem, I don't think that any conversion other than [:lower:] <-> [:upper:] makes sense in tr(1) context.  It is also noted in tr(1) itself:

                With the exception of case conversion, characters in the
                classes are in unspecified order.

In other words, I don't think it's collation problem, rather just unspecified behavior, which doesn't need fixing.

If there's a real world use case that you think is related to this, please provide examples.