Summary: | collating sequence not sensible in some locales | ||
---|---|---|---|
Product: | Base System | Reporter: | Helge Oldach <freebsd> |
Component: | standards | Assignee: | freebsd-standards (Nobody) <standards> |
Status: | New --- | ||
Severity: | Affects Many People | CC: | bapt, freebsd, freebsd, imp, kevans, se, yuripv |
Priority: | --- | ||
Version: | 13.0-STABLE | ||
Hardware: | Any | ||
OS: | Any |
Description
Helge Oldach
2021-08-20 14:13:54 UTC
I don't think this is a bug. POSIX doesn't define this for anything other than the C locale. While it is true that POSIX does not define it for ISO8859-1 or UTF-8, it always used to work for ISO8859-1 (as a simple extension of ASCII). The really surprising result is that ISO5589-1 obviously includes lower case letters in the range [A-Z] (it never did before!), while UTF-8 excludes them (and the common practice in Unicode is to have a collating sequence of "aAbBcC..." for latin based character sets. There is obviously code that applies some collating sequence rules, but opposite to what I'd expect. The Linux example shows that they decided to use the traditional collating sequence any locale including ISO8859-1 and UTF-8 (and as said, POSIX does not care at all). We could make ISO8859-1 use the traditional collating sequence and UTF-8 the Unicode convention of lower case just before upper case letter, or we could always apply the traditional collating sequence, but we should definitely use traditional for UTF-8 and Unicode style for ISO8859-1. (In reply to Warner Losh from comment #1) True, but kind of surprising behaviour - in particular as Linux delivers the naïvely expected results different from ours. Tagging in Baptiste (locales) and Yuri (regex +locales) More fun with ISO8859 variants: # (echo Bla; echo bla) | LANG=en_US.ISO8859-1 grep '[A-Z]' Bla bla # (echo Bla; echo bla) | LANG=en_US.ISO8859-15 grep '[A-Z]' Bla bla # (echo Bla; echo bla) | LANG=en_US.ISO8859-11 grep '[A-Z]' Bla # Even better: it seems only 1 and 15 are, err, "special": # seq 1 16 | while read f; do echo $f; (echo Bla; echo bla) | LANG=en_US.ISO8859-$f grep '[A-Z]'; done 1 Bla bla 2 Bla 3 Bla 4 Bla 5 Bla 6 Bla 7 Bla 8 Bla 9 Bla 10 Bla 11 Bla 12 Bla 13 Bla 14 Bla 15 Bla bla 16 Bla # Ah, sorry, my fault. These locales don't exist so we are silently reverting to C. All I can say here is quote the following from https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html: --- In the POSIX locale, a range expression represents the set of collating elements that fall between two elements in the collation sequence, inclusive. In other locales, a range expression has unspecified behavior: strictly conforming applications shall not rely on whether the range expression is valid, or on the set of collating elements matched. --- So my take on this is "don't use range expressions". |