Bug 247494 - sort(1) order affected by LC_CTYPE
Summary: sort(1) order affected by LC_CTYPE
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: bin (show other bugs)
Version: CURRENT
Hardware: Any Any
: --- Affects Many People
Assignee: Conrad Meyer
URL:
Keywords: patch
Depends on:
Blocks:
 
Reported: 2020-06-23 07:30 UTC by Akinori MUSHA
Modified: 2020-06-24 02:32 UTC (History)
0 users

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Akinori MUSHA freebsd_committer 2020-06-23 07:30:21 UTC
My understanding is it is LC_COLLATE that affects how sort(1) compares characters
and the C locale has the collation for binary comparison, so I set LC_COLLATE=C
when calling sort(1) for language independent sorting, but it seems LC_CTYPE also
affects how sort(1) works.

% (echo 耳 ; echo 脳 ; echo 耳) | LC_CTYPE=ja_JP.UTF-8 LC_COLLATE=C LANG=C sort
耳
脳
耳

% (echo 耳 ; echo 脳 ; echo 耳) | LC_CTYPE=C LC_COLLATE=C LANG=C sort
耳
耳
脳


For reference, GNU sort works fine with any LC_CTYPE, and according to a NetBSD user
the same goes for NetBSD sort.

% (echo 耳 ; echo 脳 ; echo 耳) | LC_CTYPE=ja_JP.UTF-8 LC_COLLATE=C LANG=C gsort
耳
耳
脳


Is this a bug or by design?
Comment 1 Conrad Meyer freebsd_committer 2020-06-23 14:14:03 UTC
On CURRENT:

$ LC_CTYPE=ja_JP.UTF-8 LC_COLLATE=C LANG=C locale
LANG=C
LC_CTYPE=ja_JP.UTF-8
LC_COLLATE="C"
LC_TIME="C"
LC_NUMERIC="C"
LC_MONETARY="C"
LC_MESSAGES="C"
LC_ALL=

sort(1) attempts to identify situations where it can run in fast, byte-compare only mode by looking only at LC_COLLATE.  The --debug option shows more information:

$ (echo 耳 ; echo 脳 ; echo 耳) | LC_CTYPE=ja_JP.UTF-8 LC_COLLATE=C LANG=C sort --debug
Memory to be used for sorting: 17100230656
Using collate rules of C locale
Byte sort is used
sort_method=radixsort
; offset=1
; k1=<耳>(1), k2=<脳>(1); offset=1; s1=<耳>, s2=<脳>; cmp1=0
; offset=1
; k1=<脳>(1), k2=<耳>(1); offset=1; s1=<脳>, s2=<耳>; cmp1=0
耳
脳
耳

Both compares seem wrong.  The UTF-8 sequences share only the first byte, 0xe8.

In LC_CTYPE=C mode:

; offset=1
; k1=<耳>(3), k2=<脳>(3); offset=1; s1=<耳>, s2=<脳>; cmp1=-4
; offset=1
; k1=<脳>(3), k2=<耳>(3); offset=1; s1=<脳>, s2=<耳>; cmp1=4
; offset=1
; k1=<耳>(3), k2=<耳>(3); offset=1; s1=<耳>, s2=<耳>; cmp1=0
耳
耳
脳

The comparisons look correct.  I will look a little more.  I think this is a bug, not design, but I am not sure yet.
Comment 2 Conrad Meyer freebsd_committer 2020-06-23 14:42:57 UTC
I think the lengths printed in the bad example are correct; that is a measure of wchar_t's, whereas in LC_CTYPE=C, the length is in bytes.  So it seems like it is a comparison problem.

I think we invoke wstrcoll() -> bwscoll() in the latter case.  bwscoll() seems to be broken for short strings:

        if (len1 <= offset)
                return ((len2 <= offset) ? 0 : -1);

E.g., $ (echo a耳 ; echo a脳 ; echo a耳) | LC_CTYPE=ja_JP.UTF-8 LC_COLLATE=C LANG=C sort --debug
...
; offset=1
; k1=<a耳>(2), k2=<a脳>(2); offset=1; s1=<a耳>, s2=<a脳>; cmp1=-256
; offset=1
; k1=<a脳>(2), k2=<a耳>(2); offset=1; s1=<a脳>, s2=<a耳>; cmp1=256
; offset=1
; k1=<a耳>(2), k2=<a耳>(2); offset=1; s1=<a耳>, s2=<a耳>; cmp1=0
a耳
a耳
a脳

The result is correct, because length (2) < offset (1).  I don't know if 'offset' here is wrong, or if bswcoll is wrong.  It seems like maybe it only invokes bswcoll() on strings it thinks are identical from a radix perspective.  So perhaps the problem is some combination of wcstr and byte_sort in radixsort.

In --mergesort mode, the result and comparisons are correct:

(echo 耳 ; echo 脳 ; echo 耳) | LC_CTYPE=ja_JP.UTF-8 LC_COLLATE=C LANG=C sort --mergesort --debug
Memory to be used for sorting: 17100230656
Using collate rules of C locale
Byte sort is used
sort_method=mergesort
; k1=<耳>(1), k2=<脳>(1); s1=<耳>, s2=<脳>; cmp1=-256
; k1=<脳>(1), k2=<耳>(1); s1=<脳>, s2=<耳>; cmp1=256
; k1=<耳>(1), k2=<耳>(1); s1=<耳>, s2=<耳>; cmp1=0
耳
耳
脳

Something is broken in radixsort.
Comment 3 Conrad Meyer freebsd_committer 2020-06-23 14:46:03 UTC
(From mergesort, offset=1 is also wrong in LC_CTYPE=C.  It should be offset=0.)  Probably we need to substract one from the radix level.  But that is not the only problem.
Comment 4 Conrad Meyer freebsd_committer 2020-06-23 15:10:38 UTC
Ok, so radix sort only goes byte-at-a-time; we can't allocate memory for all wchar_t space (4 GB).  Here are the wchar_t representations of the two characters:

echo 耳脳 | iconv -f utf-8 -t ucs-4 | hd
00000000  00 00 80 33 00 00 81 33                           |...3...3....|
          ^ first     ^ second

It incorrectly looks at the least significant byte of the wchar_t, observes that 33 == 33 and invokes collate to attempt to differentiate the two strings.  But using radixsort's level is wrong for bwscoll, which expects an offset in wchar_t.  Since radixsort has only processed 1/4 of a wchar_t, this is a bogus offset.

I'm not sure how our radixsort is supposed to work, honestly.  It seems pretty broken, even for ASCII.  It should be able to bucket multiple keys that share a character per level, but it doesn't — it falls back on comparison in that case.
Comment 5 Conrad Meyer freebsd_committer 2020-06-23 15:40:09 UTC
Hm we actually do bucket same-byte values in add_to_sublevel.  Why do we call strcmp eventually?

ah:

  379         default:
  380                 if (TINY_NODE(sl) || (sl->level > 15)) {
  381                         listcoll_t func;
  382
  383                         func = get_list_call_func(sl->level);

( == list_coll_offset(,, sl->level))

...
  385                         sl->leaves = sl->tosort;
  386                         sl->leaves_num = sl->tosort_num;
  387                         sl->leaves_sz = sl->leaves_num;
...
  403                                 DEFAULT_SORT_FUNC_RADIXSORT(sl->leaves, sl->leaves_num,
  404                                     sizeof(struct sort_list_item *),
  405                                     (int(*)(const void *, const void *)) func);

This appears to be a fast path when sorting a relatively small number of strings (< 65), or a degenerate case for comparing string bytes beyond 15 bytes.

When I input 100 1-wchar strings to radixsort, it attempts to radix sort, but does so incorrectly.

So there are several bugs:

1. radixsort does not process wchar strings correctly.
2. radixsort's fastpath does not invoke collate correctly for wchar strings.
Comment 6 Conrad Meyer freebsd_committer 2020-06-23 16:04:07 UTC
This patch fixes (2), but not (1):

@@ -258,14 +259,28 @@ add_leaf(struct sort_level *sl, struct sort_list_item *item)
 static inline int
 get_wc_index(struct sort_list_item *sli, size_t level)
 {
+       const size_t wcfact = (MB_CUR_MAX == 1) ? 1 : sizeof(wchar_t);
        const struct key_value *kv;
        const struct bwstring *bws;

        kv = get_key_from_keys_array(&sli->ka, 0);
        bws = kv->k;

-       if ((BWSLEN(bws) > level))
-               return (unsigned char) BWS_GET(bws,level);
+       if ((BWSLEN(bws) * wcfact > level)) {
+               wchar_t res;
+
+               /*
+                * Sort wchar strings a byte at a time, rather than a single
+                * byte from each wchar.
+                */
+               res = (wchar_t)BWS_GET(bws, level / wcfact);
+               /* Sort most-significant byte first. */
+               if (level % wcfact < wcfact - 1)
+                       res = (res >> (8 * (wcfact - 1 - (level % wcfact))));
+
+               return (res & 0xff);
+       }
+
        return (-1);
 }
Comment 7 Conrad Meyer freebsd_committer 2020-06-23 16:10:20 UTC
Also, I don't think there is any reason to degrade to mergesort if radix level is > 15, but that is an unrelated change.
Comment 8 Conrad Meyer freebsd_committer 2020-06-23 16:15:05 UTC
With this second patch:

(echo 耳 ; echo 脳 ; echo 耳) | LC_CTYPE=ja_JP.UTF-8 LC_COLLATE=C LANG=C sort --radixsort --debug
Using collate rules of C locale
Byte sort is used
sort_method=radixsort
; k1=<耳>(1), k2=<脳>(1); s1=<耳>, s2=<脳>; cmp1=-256
; k1=<脳>(1), k2=<耳>(1); s1=<脳>, s2=<耳>; cmp1=256
; k1=<耳>(1), k2=<耳>(1); s1=<耳>, s2=<耳>; cmp1=0
耳
耳
脳

Which seems correct.  In C mode:

(echo 耳 ; echo 脳 ; echo 耳) | LC_CTYPE=C LC_COLLATE=C LANG=C sort --radixsort --debug
Using collate rules of C locale
Byte sort is used
sort_method=radixsort
; offset=1
; k1=<耳>(3), k2=<脳>(3); offset=1; s1=<耳>, s2=<脳>; cmp1=-4
; offset=1
; k1=<脳>(3), k2=<耳>(3); offset=1; s1=<脳>, s2=<耳>; cmp1=4
; offset=1
; k1=<耳>(3), k2=<耳>(3); offset=1; s1=<耳>, s2=<耳>; cmp1=0
耳
耳
脳

@@ -317,6 +339,7 @@ free_sort_level(struct sort_level *sl)
 static void
 run_sort_level_next(struct sort_level *sl)
 {
+       const size_t wcfact = (MB_CUR_MAX == 1) ? 1 : sizeof(wchar_t);
        struct sort_level *slc;
        size_t i, sln, tosort_num;

@@ -333,8 +360,16 @@ run_sort_level_next(struct sort_level *sl)
                sort_left_dec(1);
                goto end;
        case (2):
+               /*
+                * Radixsort only processes a single byte at a time.  In wchar
+                * mode, this can be a subset of the length of a character.
+                * list_coll_offset() offset is in units of wchar, not bytes.
+                * So to calculate the offset, we must divide by
+                * sizeof(wchar_t) and round down to the index of the first
+                * character this level references.
+                */
                if (list_coll_offset(&(sl->tosort[0]), &(sl->tosort[1]),
-                   sl->level) > 0) {
+                   sl->level / wcfact) > 0) {
                        sl->sorted[sl->start_position++] = sl->tosort[1];
                        sl->sorted[sl->start_position] = sl->tosort[0];
                } else {
@@ -348,7 +383,13 @@ run_sort_level_next(struct sort_level *sl)
                if (TINY_NODE(sl) || (sl->level > 15)) {
                        listcoll_t func;

-                       func = get_list_call_func(sl->level);
+                       /*
+                        * Collate comparison offset is in units of
+                        * character-width, so we must divide the level (bytes)
+                        * by operating character width (wchar_t or char).  See
+                        * longer comment above.
+                        */
+                       func = get_list_call_func(sl->level / wcfact);

                        sl->leaves = sl->tosort;
                        sl->leaves_num = sl->tosort_num;
Comment 9 Conrad Meyer freebsd_committer 2020-06-23 16:25:35 UTC
Hm, there is also this dubious construction in wide_str_coll():

   125                                 wchar_t c1 = s1[i];
   126                                 wchar_t c2 = s2[i];
   127                                 if (c1 == L'\0')
   128                                         return ((c2 == L'\0') ? 0 : -1);
   129                                 if (c2 == L'\0')
   130                                         return (+1);
   131                                 if (c1 == c2)
   132                                         continue;
   133                                 return ((int)(c1 - c2));

Technically, int cannot accurately contain the full range of wchar_t c1 - c2.  This works ok for valid unicode characters (limited to 0x10FFFF) and FreeBSD int (>= 32-bit on all platforms) but is not generally a good idea.
Comment 10 commit-hook freebsd_committer 2020-06-23 16:44:00 UTC
A commit references this bug:

Author: cem
Date: Tue Jun 23 16:43:48 UTC 2020
New revision: 362545
URL: https://svnweb.freebsd.org/changeset/base/362545

Log:
  sort(1): Fix two wchar-related bugs in radixsort

  Sort(1)'s radixsort implementation was broken for multibyte LC_CTYPEs in at
  least two ways:

    * In actual radix sort, it would only bucket the least significant
      byte from each wchar, ignoring the 24 most-significant bits of each
      unicode character.

    * In degenerate cases / "fast paths," it would fall back to another
      sorting algorithm (default: mergesort) with a bogus comparator
      offset.  The string comparison functions in sort(1) take an offset
      in units of the operating character size.  However, radixsort was
      passing an offset in units of bytes.  The byte offset must be
      divided by sizeof(wchar_t).

  This revision addresses both discovered issues.

  Some example testcases:

    $ (echo ? ; echo ? ; echo ?) | \
    LC_CTYPE=ja_JP.UTF-8 LC_COLLATE=C LANG=C sort --radixsort --debug

    $ (echo ? ; echo ? ; echo ?) | \
    LC_CTYPE=C LC_COLLATE=C LANG=C           sort --radixsort --debug

    $ (for i in $(jot 34); do echo ?????; echo ?????; echo ?????; done) | \
    LC_CTYPE=ja_JP.UTF-8 LC_COLLATE=C LANG=C sort --radixsort --debug

  PR:		247494
  Reported by:	knu
  MFC after:	I do not intend to, but parties interested in stable might want to

Changes:
  head/usr.bin/sort/radixsort.c
Comment 11 Conrad Meyer freebsd_committer 2020-06-23 16:44:47 UTC
Thank you for the report!
Comment 12 Akinori MUSHA freebsd_committer 2020-06-24 01:46:44 UTC
Wow, it is amazing how you tracked and fixed this so quickly!  The detailed information given here will be very useful for future reference.  Thank you so much!
Comment 13 Conrad Meyer freebsd_committer 2020-06-24 02:32:06 UTC
Happy to help!