Bug 241679 - /usr/bin/sort fails if UTF-8 input is received from stdin
Summary: /usr/bin/sort fails if UTF-8 input is received from stdin
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: bin (show other bugs)
Version: 12.0-RELEASE
Hardware: Any Any
: --- Affects Many People
Assignee: Baptiste Daroussin
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-11-02 21:09 UTC by Ronald F. Guilmette
Modified: 2022-10-19 08:04 UTC (History)
6 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Ronald F. Guilmette 2019-11-02 21:09:18 UTC
If /usr/bin/sort is given the name of a file on the command line and if that file contains lines which themselves contain UTF-8 encoded data, then the sort utility will function normally and as expexcted.  If on the other hand the same file content is supplied to /usr/bin/sort via its stdin channel, then sort will fail completely and will issue the following error message:

sort: Illegal byte sequence

The can be verified by placing the following line into a file called "test" and
then attempting to sort that file content in different ways:

zürich.email

Example #1:
sort test

The above works just fine.

Example #2:
sort < test

The above fails and issues the error:
sort: Illegal byte sequence
Comment 1 Conrad Meyer freebsd_committer freebsd_triage 2019-11-03 18:05:59 UTC
I can't reproduce the issue on 13-CURRENT with the following environment:

LANG=en_US.UTF-8

I also cannot reproduce it on 13-CURRENT with unset LANG or LANG=C or C.US-ASCII, or LC_CTYPE set to any weird non-UTF8 charset I know of.
Comment 2 Ed Maste freebsd_committer freebsd_triage 2019-11-03 19:27:07 UTC
Could not reproduce on ref12-amd64.freebsd.org either; perhaps this was fixed on stable/12 after 12.0 released.
Comment 3 David Demelier 2019-11-04 14:22:54 UTC
Can you please tell us what's your shell too? And of course:

file test
hexdump -c test
hexdump -x test
Comment 4 Yuri Pankov freebsd_committer freebsd_triage 2020-12-14 17:21:21 UTC
Please reopen if it's still a problem and there's a way to reproduce it.
Comment 5 Andreas Baumgärtner 2022-06-07 18:33:22 UTC
Here is how to reproduce the issue:

FreeBSD 12.3-RELEASE-p2
FreeBSD 12.3-RELEASE-p5

host: /tmp# echo $LANG
en_US.UTF-8
host: /tmp# cat a | sort
sort: Illegal byte sequence
host: /tmp# cut -d" " -f 1 a
cut: a: Illegal byte sequence
host: /tmp# sort a
gelöscht
host: /tmp# sort a > b
host: /tmp# diff a b
1c1
< gel▒scht
---
> gelöscht
host: /tmp# file a b
a: ISO-8859 text
b: UTF-8 Unicode text
host: /tmp# hexdump -c a
0000000   g   e   l   ▒   s   c   h   t  \n
0000009
host: /tmp# hexdump -c b
0000000   g   e   l   ▒   ▒   s   c   h   t  \n
000000a
host: /tmp# hexdump -x a
0000000    6567    f66c    6373    7468    000a
0000009
host: /tmp# hexdump -x b
0000000    6567    c36c    73b6    6863    0a74
000000a
host: /tmp# ls -l a b
-rw-r--r--  1 root  wheel   9 Jun  7 20:11 a
-rw-r--r--  1 root  wheel  10 Jun  7 20:20 b
host: /tmp# 

my initial issue was with a directory (here in reverse to reproduce)
host: /tmp# mkdir `cat a`
host: /tmp# find . -type d | sort
sort: Illegal byte sequence
host: /tmp# rmdir gel\366scht/
host: /tmp# 

I'm unable to re-open the bug from here.
Comment 6 Stefan Eßer freebsd_committer freebsd_triage 2022-06-07 19:26:26 UTC
Re-open after steps to reproduce the issue have been added to the PR.
Comment 7 Baptiste Daroussin freebsd_committer freebsd_triage 2022-10-12 14:09:15 UTC
https://reviews.freebsd.org/D36948
Comment 8 commit-hook freebsd_committer freebsd_triage 2022-10-12 15:38:23 UTC
A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=b58094c0d98e5d0ab8abdcdb655ac902ae8ad66e

commit b58094c0d98e5d0ab8abdcdb655ac902ae8ad66e
Author:     Baptiste Daroussin <bapt@FreeBSD.org>
AuthorDate: 2022-10-12 13:31:50 +0000
Commit:     Baptiste Daroussin <bapt@FreeBSD.org>
CommitDate: 2022-10-12 15:37:33 +0000

    sort: replace home made line reader by getdelim(3)

    The previous code had bug when reading lines with an unexpected
    encoding, returning without the full line being captured.
    This result in sort complaining with "sort: Illegal byte sequence"

    Using getdelim(3) instead of the home made code, fixes the situation.

    PR:             241679
    Reported by:    Ronald F. Guilmette <rfg-freebsd@tristatelogic.com>
    MFC After:      1 week
    Reviewed by:    markj, imp
    Differential Revision:  https://reviews.freebsd.org/D36948

 usr.bin/sort/bwstring.c | 109 ------------------------------------------------
 usr.bin/sort/file.c     |  18 +++++---
 2 files changed, 12 insertions(+), 115 deletions(-)
Comment 9 commit-hook freebsd_committer freebsd_triage 2022-10-19 08:01:53 UTC
A commit in branch stable/13 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=489c9df19851aa052553a79de65e37ea1471014e

commit 489c9df19851aa052553a79de65e37ea1471014e
Author:     Baptiste Daroussin <bapt@FreeBSD.org>
AuthorDate: 2022-10-12 13:31:50 +0000
Commit:     Baptiste Daroussin <bapt@FreeBSD.org>
CommitDate: 2022-10-19 07:59:28 +0000

    sort: replace home made line reader by getdelim(3)

    The previous code had bug when reading lines with an unexpected
    encoding, returning without the full line being captured.
    This result in sort complaining with "sort: Illegal byte sequence"

    Using getdelim(3) instead of the home made code, fixes the situation.

    PR:             241679
    Reported by:    Ronald F. Guilmette <rfg-freebsd@tristatelogic.com>
    MFC After:      1 week
    Reviewed by:    markj, imp
    Differential Revision:  https://reviews.freebsd.org/D36948

    (cherry picked from commit b58094c0d98e5d0ab8abdcdb655ac902ae8ad66e)

 usr.bin/sort/bwstring.c | 109 ------------------------------------------------
 usr.bin/sort/file.c     |  18 +++++---
 2 files changed, 12 insertions(+), 115 deletions(-)