Summary: | /usr/bin/sort fails if UTF-8 input is received from stdin | ||
---|---|---|---|
Product: | Base System | Reporter: | Ronald F. Guilmette <rfg-freebsd> |
Component: | bin | Assignee: | Baptiste Daroussin <bapt> |
Status: | Closed FIXED | ||
Severity: | Affects Many People | CC: | andreas.baumgaertner, cem, emaste, markand, se, yuripv |
Priority: | --- | ||
Version: | 12.0-RELEASE | ||
Hardware: | Any | ||
OS: | Any |
Description
Ronald F. Guilmette
2019-11-02 21:09:18 UTC
I can't reproduce the issue on 13-CURRENT with the following environment: LANG=en_US.UTF-8 I also cannot reproduce it on 13-CURRENT with unset LANG or LANG=C or C.US-ASCII, or LC_CTYPE set to any weird non-UTF8 charset I know of. Could not reproduce on ref12-amd64.freebsd.org either; perhaps this was fixed on stable/12 after 12.0 released. Can you please tell us what's your shell too? And of course: file test hexdump -c test hexdump -x test Please reopen if it's still a problem and there's a way to reproduce it. Here is how to reproduce the issue:
FreeBSD 12.3-RELEASE-p2
FreeBSD 12.3-RELEASE-p5
host: /tmp# echo $LANG
en_US.UTF-8
host: /tmp# cat a | sort
sort: Illegal byte sequence
host: /tmp# cut -d" " -f 1 a
cut: a: Illegal byte sequence
host: /tmp# sort a
gelöscht
host: /tmp# sort a > b
host: /tmp# diff a b
1c1
< gel▒scht
---
> gelöscht
host: /tmp# file a b
a: ISO-8859 text
b: UTF-8 Unicode text
host: /tmp# hexdump -c a
0000000 g e l ▒ s c h t \n
0000009
host: /tmp# hexdump -c b
0000000 g e l ▒ ▒ s c h t \n
000000a
host: /tmp# hexdump -x a
0000000 6567 f66c 6373 7468 000a
0000009
host: /tmp# hexdump -x b
0000000 6567 c36c 73b6 6863 0a74
000000a
host: /tmp# ls -l a b
-rw-r--r-- 1 root wheel 9 Jun 7 20:11 a
-rw-r--r-- 1 root wheel 10 Jun 7 20:20 b
host: /tmp#
my initial issue was with a directory (here in reverse to reproduce)
host: /tmp# mkdir `cat a`
host: /tmp# find . -type d | sort
sort: Illegal byte sequence
host: /tmp# rmdir gel\366scht/
host: /tmp#
I'm unable to re-open the bug from here.
Re-open after steps to reproduce the issue have been added to the PR. A commit in branch main references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=b58094c0d98e5d0ab8abdcdb655ac902ae8ad66e commit b58094c0d98e5d0ab8abdcdb655ac902ae8ad66e Author: Baptiste Daroussin <bapt@FreeBSD.org> AuthorDate: 2022-10-12 13:31:50 +0000 Commit: Baptiste Daroussin <bapt@FreeBSD.org> CommitDate: 2022-10-12 15:37:33 +0000 sort: replace home made line reader by getdelim(3) The previous code had bug when reading lines with an unexpected encoding, returning without the full line being captured. This result in sort complaining with "sort: Illegal byte sequence" Using getdelim(3) instead of the home made code, fixes the situation. PR: 241679 Reported by: Ronald F. Guilmette <rfg-freebsd@tristatelogic.com> MFC After: 1 week Reviewed by: markj, imp Differential Revision: https://reviews.freebsd.org/D36948 usr.bin/sort/bwstring.c | 109 ------------------------------------------------ usr.bin/sort/file.c | 18 +++++--- 2 files changed, 12 insertions(+), 115 deletions(-) A commit in branch stable/13 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=489c9df19851aa052553a79de65e37ea1471014e commit 489c9df19851aa052553a79de65e37ea1471014e Author: Baptiste Daroussin <bapt@FreeBSD.org> AuthorDate: 2022-10-12 13:31:50 +0000 Commit: Baptiste Daroussin <bapt@FreeBSD.org> CommitDate: 2022-10-19 07:59:28 +0000 sort: replace home made line reader by getdelim(3) The previous code had bug when reading lines with an unexpected encoding, returning without the full line being captured. This result in sort complaining with "sort: Illegal byte sequence" Using getdelim(3) instead of the home made code, fixes the situation. PR: 241679 Reported by: Ronald F. Guilmette <rfg-freebsd@tristatelogic.com> MFC After: 1 week Reviewed by: markj, imp Differential Revision: https://reviews.freebsd.org/D36948 (cherry picked from commit b58094c0d98e5d0ab8abdcdb655ac902ae8ad66e) usr.bin/sort/bwstring.c | 109 ------------------------------------------------ usr.bin/sort/file.c | 18 +++++--- 2 files changed, 12 insertions(+), 115 deletions(-) |