Bug 219900

Summary:	tr class and unicode collation
Product:	Base System	Reporter:	Emmanuel Vadot <manu>
Component:	bin	Assignee:	freebsd-bugs (Nobody) <bugs>
Status:	New ---
Severity:	Affects Only Me	CC:	lme, yuripv
Priority:	---
Version:	CURRENT
Hardware:	Any
OS:	Any

Description Emmanuel Vadot freebsd_committer

2017-06-10 03:15:34 UTC

It seems that there is a bug in unicode collation using tr to convert char to uppercase :

manu@arcadia> echo $LC_ALL                                                                                                                                                                                         
en_US.UTF-8
manu@arcadia> echo test | tr '[:lower:]' '[:upper:]'
TEST
manu@arcadia> echo test | tr '[:alpha:]' '[:upper:]'                                                                                                                                                               
ÓÄÒÓ
manu@arcadia> echo test | tr '[:alnum:]' '[:upper:]'                                                                                                                                                               
ÞÎÝÞ
manu@arcadia> export LC_ALL=C
manu@arcadia> echo test | tr '[:lower:]' '[:upper:]'                                                                                                                                                               
TEST
manu@arcadia> echo test | tr '[:alpha:]' '[:upper:]'                                                                                                                                                               
ZZZZ
manu@arcadia> echo test | tr '[:alnum:]' '[:upper:]'                                                                                                                                                               
ZZZZ

Comment 1 Lars Engels freebsd_committer

2017-12-05 17:11:17 UTC

A similar issue with grep and awk:

> grep '^[A-Z]' foo
foo
Bar
BAZ


Same for egrep, grep -E and awk, using sed(1) works as expected.

> grep -V
grep (GNU grep) 2.5.1-FreeBSD

> locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_ALL=en_US.UTF-8


With LANG=C all three tools work as expected.

Comment 2 Yuri Pankov 2017-12-06 00:18:27 UTC

Looks like at least grep issue is not there (or already fixed) in bsdgrep:

loki:yuri:~$ locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_ALL=
loki:yuri:~$ grep '^[A-Z]' foo
Bar
BAZ
loki:yuri:~$ grep --version
grep (BSD grep) 2.6.0-FreeBSD

Comment 3 Yuri Pankov 2017-12-06 00:19:41 UTC

WRT the tr issue -- that's what 'tr' on Debian GNU/Linux has to say about it:

$ echo test | tr '[:alpha:]' '[:upper:]'
tr: misaligned [:upper:] and/or [:lower:] construct

Comment 4 Yuri Pankov 2018-10-10 15:41:15 UTC

And for the original problem, I don't think that any conversion other than [:lower:] <-> [:upper:] makes sense in tr(1) context.  It is also noted in tr(1) itself:

                With the exception of case conversion, characters in the
                classes are in unspecified order.

In other words, I don't think it's collation problem, rather just unspecified behavior, which doesn't need fixing.

If there's a real world use case that you think is related to this, please provide examples.