Bug 256473 - FreeBSD shells are case insensitive for character ranges
Summary: FreeBSD shells are case insensitive for character ranges
Status: Closed Works As Intended
Alias: None
Product: Base System
Classification: Unclassified
Component: bin (show other bugs)
Version: 12.2-STABLE
Hardware: Any Any
: --- Affects Some People
Assignee: freebsd-bugs (Nobody)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-06-07 22:58 UTC by Jason W. Bacon
Modified: 2021-06-10 09:01 UTC (History)
1 user (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Jason W. Bacon freebsd_committer 2021-06-07 22:58:32 UTC
When running /bin/tcsh or /bin/sh, character ranges of either upper case or lower case also expand to files with the other case:

FreeBSD moray.acadix  bacon ~ 1017: ls -d [A-Z]*
Biostar/             Grad-school-biosci/  Save/
Books/               igv/                 scripts/
Coral@               Moto/                VirtualBox VMs/
Data/                ncbi/                wpa_supplicant.conf
Desktop/             Ports/
Documents/           Prog/
FreeBSD moray.acadix  bacon ~ 1018: bash

Using bash from ports, I do not see this behavior:

[bacon@moray ~]$ ls -d [A-Z]*
Biostar            Data               Grad-school-biosci Prog
Books              Desktop            Moto               Save
Coral              Documents          Ports              VirtualBox VMs

Also checked tcsh on a CentOS system and did not see this behavior.

Lastly I do NOT see this behavior from glob(3), so it seems to be limited to shells in the FreeBSD base.

#include <stdio.h>
#include <sysexits.h>
#include <glob.h>
#include <unistd.h>

int     main(int argc,char *argv[])

{
    glob_t g;
    
    g.gl_offs = 2;
    glob("[A-Z]*", GLOB_DOOFFS, NULL, &g);
    g.gl_pathv[0] = "ls";
    g.gl_pathv[1] = "-d";
    execvp("ls", g.gl_pathv);

    return EX_OK;
}
Comment 1 Stefan Eßer freebsd_committer 2021-06-08 07:52:07 UTC
Please check your LANG (and possibly LC_CTYPE) setting.

There are many locales that do not follow the ASCII collating sequence A .. Z followed by a .. z. Very common is a sequence of: A a B b ... Z z - and that includes all lower case letters except z within the range [A-Z].

Try with LANG set to "C" whether this gives the expected result.

Shell scripts that depend on a specific collating sequence should always reset LANG and LC_CTYPE to default values (e.g. LANG=C and LC_CTYPE unset).

Interactive shells should be able to support LC_CTYPE=C to set the collating sequence independently of LANG, but in my tests this is implemented to different degrees in different shells.

Some shells do not even respect the collating sequence specified by LANG, and bash is the one that appears to ignore both LANG and LC_CTYPE and to always use the ASCII collating sequence (or rather pure bytewise string comparison?).

The tcsh in FreeBSD uses only LANG and ignores LC_CTYPE to select the collating sequence, btw.
Comment 2 Jason W. Bacon freebsd_committer 2021-06-08 15:23:41 UTC
I'll respond about the lang settings shortly, but first:

I created a simple test case to reduce noise in this thread, and things got weirder:

Default shell is tcsh.

FreeBSD coral.acadix  bacon ~/Test 1032: ls
aardvark  Alan      Bob       zip

FreeBSD coral.acadix  bacon ~/Test 1033: ls -d [A-Z]*
Alan  Bob   zip

FreeBSD coral.acadix  bacon ~/Test 1034: ls -d [a-z]*
aardvark  Alan      Bob       zip

FreeBSD coral.acadix  bacon ~/Test 1035: bash
[bacon@coral ~/Test]$ ls -d [A-Z]*
Alan Bob

[bacon@coral ~/Test]$ ls -d [a-z]*
aardvark zip
Comment 3 Jason W. Bacon freebsd_committer 2021-06-08 15:27:39 UTC
In tcsh, I have to set LANG=C *and* unset LC_ALL to get expected behavior.

In bash, neither setting matters.

FreeBSD coral.acadix  bacon ~/Test 1038: bash

[bacon@coral ~/Test]$ printenv | egrep 'LANG|LC'
LANGUAGE=en_US.UTF-8
LANG=en_US.UTF-8
LC_ALL=en_US.UTF-8

[bacon@coral ~/Test]$ ls -d [A-Z]*
Alan Bob

[bacon@coral ~/Test]$ 
exit

# Back to tcsh

FreeBSD coral.acadix  bacon ~/Test 1039: printenv | egrep 'LANG|LC'
LANGUAGE=en_US.UTF-8
LC_ALL=en_US.UTF-8
LANG=en_US.UTF-8

FreeBSD coral.acadix  bacon ~/Test 1040: ls -d [A-Z]*
Alan  Bob   zip

FreeBSD coral.acadix  bacon ~/Test 1041: setenv LANG C

FreeBSD coral.acadix  bacon ~/Test 1042: ls -d [A-Z]*
Alan  Bob   zip

FreeBSD coral.acadix  bacon ~/Test 1043: unsetenv LC_ALL

FreeBSD coral.acadix  bacon ~/Test 1044: ls -d [A-Z]*
Alan  Bob
Comment 4 Stefan Eßer freebsd_committer 2021-06-08 16:08:23 UTC
(In reply to Jason W. Bacon from comment #2)

> I created a simple test case to reduce noise in this thread, and things got weirder:
> 
> Default shell is tcsh.
> 
> FreeBSD coral.acadix  bacon ~/Test 1032: ls
> aardvark  Alan      Bob       zip
> 
> FreeBSD coral.acadix  bacon ~/Test 1033: ls -d [A-Z]*
Alan  Bob   zip

This is to be expected. The sequence is: "AaBb..Zz" and you are selecting for all initial letters except for the files starting with "z" (which comes after "Z").

> FreeBSD coral.acadix  bacon ~/Test 1034: ls -d [a-z]*
> aardvark  Alan      Bob       zip

This is to be expected. The sequence is as above and you are selecting for all except those starting with "A" (which is "to the left" of "a").

> FreeBSD coral.acadix  bacon ~/Test 1035: bash
> [bacon@coral ~/Test]$ ls -d [A-Z]*
> Alan Bob
> 
> [bacon@coral ~/Test]$ ls -d [a-z]*
> aardvark zip
> [/quote]

This is the behavior of BASH if the "globasciiranges" shell option is set to on.

Check this with the following command:

$ shopt globasciiranges
globasciiranges	on

This is an off by default option according to the bash man-page, but apparently set in your environment.

With this variable set to off, bash behaves exactly like the other shells.
Comment 5 Stefan Eßer freebsd_committer 2021-06-08 16:17:03 UTC
(In reply to Jason W. Bacon from comment #3)
As explained in the previous comment, the behavior you observe with bash is caused by "globasciiranges" set to on.

I was wrong when I mentioned LC_CTYPE as relevant for collating sequences, it does only apply to the character class.

You want to set LC_COLLATE to "C" (i.e. "export LC_COLLATE=C" or "setenv LC_COLLATE C" depending on the type of shell).

This will make all shells use the collating sequence specified for the C locale and that is what you seem to want.

Everything works as it should, therefore I'm going to close this PR.
Comment 6 Jason W. Bacon freebsd_committer 2021-06-08 16:57:34 UTC
(In reply to Stefan Eßer from comment #4)

>> FreeBSD coral.acadix  bacon ~/Test 1032: ls
>> aardvark  Alan      Bob       zip
>> FreeBSD coral.acadix  bacon ~/Test 1033: ls -d [A-Z]*
>Alan  Bob   zip

>This is to be expected. The sequence is: "AaBb..Zz" and you are selecting for >all initial letters except for the files starting with "z" (which comes after >"Z").

I see the pattern now, but your range expansion above is incorrect and doesn't agree with the ls output I provided.

The lower case letters actually come first, which is not what I expected either.  That's why the output seemed inexplicable at first.

[A-Z] == [AbB..zZ] == all letters except 'a'
[a-z] == [aAbB..z] == all letters except 'Z'

[A-Z]* selects for all but those that start with 'a', not 'z'.  This explains why zip is listed and aardvark is not.

Adding one more file to clarify:

FreeBSD coral.acadix  bacon ~/Test 1013: ls
aardvark  Alan      Bob       Zed       zip

FreeBSD coral.acadix  bacon ~/Test 1014: ls -d [A-Z]*
Alan  Bob   Zed   zip

FreeBSD coral.acadix  bacon ~/Test 1015: ls -d [a-z]*
aardvark  Alan      Bob       zip

FreeBSD coral.acadix  bacon ~/Test 1020: ls -d [A-z]*
Alan  Bob   zip

FreeBSD coral.acadix  bacon ~/Test 1021: ls -d [a-Z]*
aardvark  Alan      Bob       Zed       zip

globasciirnages appears to be on by default in the bash port (not from anything in my env), but disabling with with shopt -u does make bash behave like tcsh.
Comment 7 Stefan Eßer freebsd_committer 2021-06-08 18:48:23 UTC
(In reply to Jason W. Bacon from comment #6)

> I see the pattern now, but your range expansion above is incorrect and doesn't agree with the ls output I provided.
> 
> The lower case letters actually come first, which is not what I expected either.  That's why the output seemed inexplicable at first.
> 
> [A-Z] == [AbB..zZ] == all letters except 'a'
> [a-z] == [aAbB..z] == all letters except 'Z'
> 
> [A-Z]* selects for all but those that start with 'a', not 'z'.  This explains why zip is listed and aardvark is not.

Seems your collating sequence has lower case letters before upper case letters, but in fact, which is very common (I got that reversed).

But Unicode collation sequences are much more complex than that.

For example, many languages sort by character without regard to upper/lower case and only if the case-ignorant comparison does not define an ordering, the case comes into play.

E.g., in /usr/ports:

$ /bin/ls -1d [cC]*
cad
CHANGES
chinese
comms
CONTRIBUTING.md
converters
COPYRIGHT

Case is ignored if the case-ignorant comparison gives a result, and that makes "cad" come before "CHANGES" and that is followed by "chinese".

This shows, that the order is not primarily determined by the case of the initial character "c" vs. "C", but by comparing the full name and then using upper/lower case only as a less relevant criterion.

And that makes "[C]*" behave different from looking at the sorted list and starting at the first entry that has "C" as its initial letter.

Anyway, this is all specified by the Unicode collation algorithm (UCA), which describes the algorithm. Each locale definition specifies parameters of that algorithm and the order you observe complies with that specification (you did not specify your locale, e.g. the LANG value that is in effect).

There is nothing wrong with the FreeBSD shells, but you may have to set some environment variable (LC_COLLATE) to the specific value that results in the correct sort order, if the default does not work for you.
Comment 8 Jason W. Bacon freebsd_committer 2021-06-09 13:30:25 UTC
(In reply to Stefan Eßer from comment #5)

More notes to posterity:

LC_COLLATE is overridden by LC_ALL.

In the absence of any LC_* settings, collation is also affected by LANG.

Unsetting LANG and LC_* brings expected behavior:

FreeBSD coral.acadix  bacon ~/Test 1007: printenv | egrep 'LANG|LC'

FreeBSD coral.acadix  bacon ~/Test 1008: ls [A-Z]*
Alan  Bob   Zed

FreeBSD coral.acadix  bacon ~/Test 1009: ls [a-z]*
aardvark  zip

Thanks for the help...
Comment 9 dsdqmhsx 2021-06-10 09:01:09 UTC
(In reply to Jason W. Bacon from comment #8)
For the record, the precedence is documented:

LC_ALL
This variable shall determine the values for all locale categories. The value of the LC_ALL environment variable has precedence over any of the other environment variables starting with LC_ (LC_COLLATE, LC_CTYPE, LC_MESSAGES, LC_MONETARY, LC_NUMERIC, LC_TIME) and the LANG environment variable.

(https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap08.html, 8.2)