Bug 219900 - tr class and unicode collation
Summary: tr class and unicode collation
Status: New
Alias: None
Product: Base System
Classification: Unclassified
Component: bin (show other bugs)
Version: CURRENT
Hardware: Any Any
: --- Affects Only Me
Assignee: freebsd-bugs mailing list
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-06-10 03:15 UTC by Emmanuel Vadot
Modified: 2018-10-10 15:41 UTC (History)
2 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Emmanuel Vadot freebsd_committer 2017-06-10 03:15:34 UTC
It seems that there is a bug in unicode collation using tr to convert char to uppercase :

manu@arcadia> echo $LC_ALL                                                                                                                                                                                         
en_US.UTF-8
manu@arcadia> echo test | tr '[:lower:]' '[:upper:]'
TEST
manu@arcadia> echo test | tr '[:alpha:]' '[:upper:]'                                                                                                                                                               
ÓÄÒÓ
manu@arcadia> echo test | tr '[:alnum:]' '[:upper:]'                                                                                                                                                               
ÞÎÝÞ
manu@arcadia> export LC_ALL=C
manu@arcadia> echo test | tr '[:lower:]' '[:upper:]'                                                                                                                                                               
TEST
manu@arcadia> echo test | tr '[:alpha:]' '[:upper:]'                                                                                                                                                               
ZZZZ
manu@arcadia> echo test | tr '[:alnum:]' '[:upper:]'                                                                                                                                                               
ZZZZ
Comment 1 Lars Engels freebsd_committer 2017-12-05 17:11:17 UTC
A similar issue with grep and awk:

> grep '^[A-Z]' foo
foo
Bar
BAZ


Same for egrep, grep -E and awk, using sed(1) works as expected.

> grep -V
grep (GNU grep) 2.5.1-FreeBSD

> locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_ALL=en_US.UTF-8


With LANG=C all three tools work as expected.
Comment 2 Yuri Pankov 2017-12-06 00:18:27 UTC
Looks like at least grep issue is not there (or already fixed) in bsdgrep:

loki:yuri:~$ locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_ALL=
loki:yuri:~$ grep '^[A-Z]' foo
Bar
BAZ
loki:yuri:~$ grep --version
grep (BSD grep) 2.6.0-FreeBSD
Comment 3 Yuri Pankov 2017-12-06 00:19:41 UTC
WRT the tr issue -- that's what 'tr' on Debian GNU/Linux has to say about it:

$ echo test | tr '[:alpha:]' '[:upper:]'
tr: misaligned [:upper:] and/or [:lower:] construct
Comment 4 Yuri Pankov 2018-10-10 15:41:15 UTC
And for the original problem, I don't think that any conversion other than [:lower:] <-> [:upper:] makes sense in tr(1) context.  It is also noted in tr(1) itself:

                With the exception of case conversion, characters in the
                classes are in unspecified order.

In other words, I don't think it's collation problem, rather just unspecified behavior, which doesn't need fixing.

If there's a real world use case that you think is related to this, please provide examples.