My understanding is it is LC_COLLATE that affects how sort(1) compares characters and the C locale has the collation for binary comparison, so I set LC_COLLATE=C when calling sort(1) for language independent sorting, but it seems LC_CTYPE also affects how sort(1) works. % (echo 耳 ; echo 脳 ; echo 耳) | LC_CTYPE=ja_JP.UTF-8 LC_COLLATE=C LANG=C sort 耳 脳 耳 % (echo 耳 ; echo 脳 ; echo 耳) | LC_CTYPE=C LC_COLLATE=C LANG=C sort 耳 耳 脳 For reference, GNU sort works fine with any LC_CTYPE, and according to a NetBSD user the same goes for NetBSD sort. % (echo 耳 ; echo 脳 ; echo 耳) | LC_CTYPE=ja_JP.UTF-8 LC_COLLATE=C LANG=C gsort 耳 耳 脳 Is this a bug or by design?
On CURRENT: $ LC_CTYPE=ja_JP.UTF-8 LC_COLLATE=C LANG=C locale LANG=C LC_CTYPE=ja_JP.UTF-8 LC_COLLATE="C" LC_TIME="C" LC_NUMERIC="C" LC_MONETARY="C" LC_MESSAGES="C" LC_ALL= sort(1) attempts to identify situations where it can run in fast, byte-compare only mode by looking only at LC_COLLATE. The --debug option shows more information: $ (echo 耳 ; echo 脳 ; echo 耳) | LC_CTYPE=ja_JP.UTF-8 LC_COLLATE=C LANG=C sort --debug Memory to be used for sorting: 17100230656 Using collate rules of C locale Byte sort is used sort_method=radixsort ; offset=1 ; k1=<耳>(1), k2=<脳>(1); offset=1; s1=<耳>, s2=<脳>; cmp1=0 ; offset=1 ; k1=<脳>(1), k2=<耳>(1); offset=1; s1=<脳>, s2=<耳>; cmp1=0 耳 脳 耳 Both compares seem wrong. The UTF-8 sequences share only the first byte, 0xe8. In LC_CTYPE=C mode: ; offset=1 ; k1=<耳>(3), k2=<脳>(3); offset=1; s1=<耳>, s2=<脳>; cmp1=-4 ; offset=1 ; k1=<脳>(3), k2=<耳>(3); offset=1; s1=<脳>, s2=<耳>; cmp1=4 ; offset=1 ; k1=<耳>(3), k2=<耳>(3); offset=1; s1=<耳>, s2=<耳>; cmp1=0 耳 耳 脳 The comparisons look correct. I will look a little more. I think this is a bug, not design, but I am not sure yet.
I think the lengths printed in the bad example are correct; that is a measure of wchar_t's, whereas in LC_CTYPE=C, the length is in bytes. So it seems like it is a comparison problem. I think we invoke wstrcoll() -> bwscoll() in the latter case. bwscoll() seems to be broken for short strings: if (len1 <= offset) return ((len2 <= offset) ? 0 : -1); E.g., $ (echo a耳 ; echo a脳 ; echo a耳) | LC_CTYPE=ja_JP.UTF-8 LC_COLLATE=C LANG=C sort --debug ... ; offset=1 ; k1=<a耳>(2), k2=<a脳>(2); offset=1; s1=<a耳>, s2=<a脳>; cmp1=-256 ; offset=1 ; k1=<a脳>(2), k2=<a耳>(2); offset=1; s1=<a脳>, s2=<a耳>; cmp1=256 ; offset=1 ; k1=<a耳>(2), k2=<a耳>(2); offset=1; s1=<a耳>, s2=<a耳>; cmp1=0 a耳 a耳 a脳 The result is correct, because length (2) < offset (1). I don't know if 'offset' here is wrong, or if bswcoll is wrong. It seems like maybe it only invokes bswcoll() on strings it thinks are identical from a radix perspective. So perhaps the problem is some combination of wcstr and byte_sort in radixsort. In --mergesort mode, the result and comparisons are correct: (echo 耳 ; echo 脳 ; echo 耳) | LC_CTYPE=ja_JP.UTF-8 LC_COLLATE=C LANG=C sort --mergesort --debug Memory to be used for sorting: 17100230656 Using collate rules of C locale Byte sort is used sort_method=mergesort ; k1=<耳>(1), k2=<脳>(1); s1=<耳>, s2=<脳>; cmp1=-256 ; k1=<脳>(1), k2=<耳>(1); s1=<脳>, s2=<耳>; cmp1=256 ; k1=<耳>(1), k2=<耳>(1); s1=<耳>, s2=<耳>; cmp1=0 耳 耳 脳 Something is broken in radixsort.
(From mergesort, offset=1 is also wrong in LC_CTYPE=C. It should be offset=0.) Probably we need to substract one from the radix level. But that is not the only problem.
Ok, so radix sort only goes byte-at-a-time; we can't allocate memory for all wchar_t space (4 GB). Here are the wchar_t representations of the two characters: echo 耳脳 | iconv -f utf-8 -t ucs-4 | hd 00000000 00 00 80 33 00 00 81 33 |...3...3....| ^ first ^ second It incorrectly looks at the least significant byte of the wchar_t, observes that 33 == 33 and invokes collate to attempt to differentiate the two strings. But using radixsort's level is wrong for bwscoll, which expects an offset in wchar_t. Since radixsort has only processed 1/4 of a wchar_t, this is a bogus offset. I'm not sure how our radixsort is supposed to work, honestly. It seems pretty broken, even for ASCII. It should be able to bucket multiple keys that share a character per level, but it doesn't — it falls back on comparison in that case.
Hm we actually do bucket same-byte values in add_to_sublevel. Why do we call strcmp eventually? ah: 379 default: 380 if (TINY_NODE(sl) || (sl->level > 15)) { 381 listcoll_t func; 382 383 func = get_list_call_func(sl->level); ( == list_coll_offset(,, sl->level)) ... 385 sl->leaves = sl->tosort; 386 sl->leaves_num = sl->tosort_num; 387 sl->leaves_sz = sl->leaves_num; ... 403 DEFAULT_SORT_FUNC_RADIXSORT(sl->leaves, sl->leaves_num, 404 sizeof(struct sort_list_item *), 405 (int(*)(const void *, const void *)) func); This appears to be a fast path when sorting a relatively small number of strings (< 65), or a degenerate case for comparing string bytes beyond 15 bytes. When I input 100 1-wchar strings to radixsort, it attempts to radix sort, but does so incorrectly. So there are several bugs: 1. radixsort does not process wchar strings correctly. 2. radixsort's fastpath does not invoke collate correctly for wchar strings.
This patch fixes (2), but not (1): @@ -258,14 +259,28 @@ add_leaf(struct sort_level *sl, struct sort_list_item *item) static inline int get_wc_index(struct sort_list_item *sli, size_t level) { + const size_t wcfact = (MB_CUR_MAX == 1) ? 1 : sizeof(wchar_t); const struct key_value *kv; const struct bwstring *bws; kv = get_key_from_keys_array(&sli->ka, 0); bws = kv->k; - if ((BWSLEN(bws) > level)) - return (unsigned char) BWS_GET(bws,level); + if ((BWSLEN(bws) * wcfact > level)) { + wchar_t res; + + /* + * Sort wchar strings a byte at a time, rather than a single + * byte from each wchar. + */ + res = (wchar_t)BWS_GET(bws, level / wcfact); + /* Sort most-significant byte first. */ + if (level % wcfact < wcfact - 1) + res = (res >> (8 * (wcfact - 1 - (level % wcfact)))); + + return (res & 0xff); + } + return (-1); }
Also, I don't think there is any reason to degrade to mergesort if radix level is > 15, but that is an unrelated change.
With this second patch: (echo 耳 ; echo 脳 ; echo 耳) | LC_CTYPE=ja_JP.UTF-8 LC_COLLATE=C LANG=C sort --radixsort --debug Using collate rules of C locale Byte sort is used sort_method=radixsort ; k1=<耳>(1), k2=<脳>(1); s1=<耳>, s2=<脳>; cmp1=-256 ; k1=<脳>(1), k2=<耳>(1); s1=<脳>, s2=<耳>; cmp1=256 ; k1=<耳>(1), k2=<耳>(1); s1=<耳>, s2=<耳>; cmp1=0 耳 耳 脳 Which seems correct. In C mode: (echo 耳 ; echo 脳 ; echo 耳) | LC_CTYPE=C LC_COLLATE=C LANG=C sort --radixsort --debug Using collate rules of C locale Byte sort is used sort_method=radixsort ; offset=1 ; k1=<耳>(3), k2=<脳>(3); offset=1; s1=<耳>, s2=<脳>; cmp1=-4 ; offset=1 ; k1=<脳>(3), k2=<耳>(3); offset=1; s1=<脳>, s2=<耳>; cmp1=4 ; offset=1 ; k1=<耳>(3), k2=<耳>(3); offset=1; s1=<耳>, s2=<耳>; cmp1=0 耳 耳 脳 @@ -317,6 +339,7 @@ free_sort_level(struct sort_level *sl) static void run_sort_level_next(struct sort_level *sl) { + const size_t wcfact = (MB_CUR_MAX == 1) ? 1 : sizeof(wchar_t); struct sort_level *slc; size_t i, sln, tosort_num; @@ -333,8 +360,16 @@ run_sort_level_next(struct sort_level *sl) sort_left_dec(1); goto end; case (2): + /* + * Radixsort only processes a single byte at a time. In wchar + * mode, this can be a subset of the length of a character. + * list_coll_offset() offset is in units of wchar, not bytes. + * So to calculate the offset, we must divide by + * sizeof(wchar_t) and round down to the index of the first + * character this level references. + */ if (list_coll_offset(&(sl->tosort[0]), &(sl->tosort[1]), - sl->level) > 0) { + sl->level / wcfact) > 0) { sl->sorted[sl->start_position++] = sl->tosort[1]; sl->sorted[sl->start_position] = sl->tosort[0]; } else { @@ -348,7 +383,13 @@ run_sort_level_next(struct sort_level *sl) if (TINY_NODE(sl) || (sl->level > 15)) { listcoll_t func; - func = get_list_call_func(sl->level); + /* + * Collate comparison offset is in units of + * character-width, so we must divide the level (bytes) + * by operating character width (wchar_t or char). See + * longer comment above. + */ + func = get_list_call_func(sl->level / wcfact); sl->leaves = sl->tosort; sl->leaves_num = sl->tosort_num;
Hm, there is also this dubious construction in wide_str_coll(): 125 wchar_t c1 = s1[i]; 126 wchar_t c2 = s2[i]; 127 if (c1 == L'\0') 128 return ((c2 == L'\0') ? 0 : -1); 129 if (c2 == L'\0') 130 return (+1); 131 if (c1 == c2) 132 continue; 133 return ((int)(c1 - c2)); Technically, int cannot accurately contain the full range of wchar_t c1 - c2. This works ok for valid unicode characters (limited to 0x10FFFF) and FreeBSD int (>= 32-bit on all platforms) but is not generally a good idea.
A commit references this bug: Author: cem Date: Tue Jun 23 16:43:48 UTC 2020 New revision: 362545 URL: https://svnweb.freebsd.org/changeset/base/362545 Log: sort(1): Fix two wchar-related bugs in radixsort Sort(1)'s radixsort implementation was broken for multibyte LC_CTYPEs in at least two ways: * In actual radix sort, it would only bucket the least significant byte from each wchar, ignoring the 24 most-significant bits of each unicode character. * In degenerate cases / "fast paths," it would fall back to another sorting algorithm (default: mergesort) with a bogus comparator offset. The string comparison functions in sort(1) take an offset in units of the operating character size. However, radixsort was passing an offset in units of bytes. The byte offset must be divided by sizeof(wchar_t). This revision addresses both discovered issues. Some example testcases: $ (echo ? ; echo ? ; echo ?) | \ LC_CTYPE=ja_JP.UTF-8 LC_COLLATE=C LANG=C sort --radixsort --debug $ (echo ? ; echo ? ; echo ?) | \ LC_CTYPE=C LC_COLLATE=C LANG=C sort --radixsort --debug $ (for i in $(jot 34); do echo ?????; echo ?????; echo ?????; done) | \ LC_CTYPE=ja_JP.UTF-8 LC_COLLATE=C LANG=C sort --radixsort --debug PR: 247494 Reported by: knu MFC after: I do not intend to, but parties interested in stable might want to Changes: head/usr.bin/sort/radixsort.c
Thank you for the report!
Wow, it is amazing how you tracked and fixed this so quickly! The detailed information given here will be very useful for future reference. Thank you so much!
Happy to help!