Bug 257972

Summary:	collating sequence not sensible in some locales
Product:	Base System	Reporter:	Helge Oldach <freebsd>
Component:	standards	Assignee:	freebsd-standards (Nobody) <standards>
Status:	New ---
Severity:	Affects Many People	CC:	bapt, freebsd, freebsd, imp, kevans, se, yuripv
Priority:	---
Version:	13.0-STABLE
Hardware:	Any
OS:	Any

Description Helge Oldach 2021-08-20 14:13:54 UTC

As discussed  in https://lists.freebsd.org/archives/freebsd-stable/2021-August/000193.html

> > # uname -a
> > FreeBSD 13STABLE 13.0-STABLE FreeBSD 13.0-STABLE #49 stable/13-n246779-64085efb677-dirty: Mon Aug 16 08:42:53 CEST 2021     root@XXX amd64
> > # export LANG=en_US.ISO8859-1
> > # (echo bla; echo Bla) | grep '[A-Z]'
> > bla
> > Bla
> 
> This one is unexpected, the upper case should be a range of its own
> and should not include any lower case letters.

> > # export LANG=en_US.UTF-8
> > # (echo bla; echo Bla) | grep '[A-Z]'
> > Bla
> 
> Here I had expected the result you got with en_US.ISO8859-1 ...

> > For comparison, a Linux RHEL box delivers the expected results:
> >
> > # uname -a
> > Linux rhel.local 3.10.0-1062.9.1.el7.x86_64 #1 SMP Mon Dec 2 08:31:54 EST
2019 x86_64 x86_64 x86_64 GNU/Linux
> > # export LANG=en_US.ISO8859-1
> > # (echo bla; echo Bla) | grep '[A-Z]'
> > Bla
> > # export LANG=en_US.UTF-8
> > # (echo bla; echo Bla) | grep '[A-Z]'
> > Bla
>
> Seems that this version uses a POSIX style collating sequence for UTF-8.

> Definitely a bug in the definition of the collating sequences.
>
> And I have just verified that de_DE.ISO8859-1 wrongly considers "ö"
> to be within [a-z], while de_DE.UTF-8 does not (but should).
>
> Seems that the correct collating sequences for ISO8859-1 and UTF-8 are
> each assigned to the other one.

Can some knowledgeable person please validate?

Comment 1 Warner Losh freebsd_committer

2021-08-20 14:39:05 UTC

I don't think this is a bug. POSIX doesn't define this for anything other than the C locale.

Comment 2 Stefan Eßer freebsd_committer

2021-08-20 15:07:48 UTC

While it is true that POSIX does not define it for ISO8859-1 or UTF-8, it always used to work for ISO8859-1 (as a simple extension of ASCII).

The really surprising result is that ISO5589-1 obviously includes lower case letters in the range [A-Z] (it never did before!), while UTF-8 excludes them (and the common practice in Unicode is to have a collating sequence of "aAbBcC..." for latin based character sets.

There is obviously code that applies some collating sequence rules, but opposite to what I'd expect.

The Linux example shows that they decided to use the traditional collating sequence any locale including ISO8859-1 and UTF-8 (and as said, POSIX does not care at all).

We could make ISO8859-1 use the traditional collating sequence and UTF-8 the Unicode convention of lower case just before upper case letter, or we could always apply the traditional collating sequence, but we should definitely use traditional for UTF-8 and Unicode style for ISO8859-1.

Comment 3 Helge Oldach 2021-08-20 16:07:29 UTC

(In reply to Warner Losh from comment #1)
True, but kind of surprising behaviour - in particular as Linux delivers the naïvely expected results different from ours.

Comment 4 Kyle Evans freebsd_committer

2021-08-20 16:22:36 UTC

Tagging in Baptiste (locales) and Yuri (regex +locales)

Comment 5 Helge Oldach 2021-08-20 16:48:08 UTC

More fun with ISO8859 variants:

# (echo Bla; echo bla) | LANG=en_US.ISO8859-1 grep '[A-Z]'
Bla
bla
# (echo Bla; echo bla) | LANG=en_US.ISO8859-15 grep '[A-Z]'
Bla
bla
# (echo Bla; echo bla) | LANG=en_US.ISO8859-11 grep '[A-Z]'
Bla
#

Even better: it seems only 1 and 15 are, err, "special":

# seq 1 16 | while read f; do echo $f; (echo Bla; echo bla) | LANG=en_US.ISO8859-$f grep '[A-Z]'; done
1
Bla
bla
2
Bla
3
Bla
4
Bla
5
Bla
6
Bla
7
Bla
8
Bla
9
Bla
10
Bla
11
Bla
12
Bla
13
Bla
14
Bla
15
Bla
bla
16
Bla
#

Comment 6 Helge Oldach 2021-08-20 16:59:06 UTC

Ah, sorry, my fault. These locales don't exist so we are silently reverting to C.

Comment 7 Yuri Pankov freebsd_committer

2023-03-30 13:21:13 UTC

All I can say here is quote the following from https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html:
---
In the POSIX locale, a range expression represents the set of collating elements that fall between two elements in the collation sequence, inclusive. In other locales, a range expression has unspecified behavior: strictly conforming applications shall not rely on whether the range expression is valid, or on the set of collating elements matched.
---
So my take on this is "don't use range expressions".