Bug 268189 - BSD tar incorectly encode UTF-8 sequences
Summary: BSD tar incorectly encode UTF-8 sequences
Status: Open
Alias: None
Product: Base System
Classification: Unclassified
Component: bin (show other bugs)
Version: 13.1-RELEASE
Hardware: Any Any
: --- Affects Many People
Assignee: freebsd-bugs (Nobody)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2022-12-06 08:22 UTC by Aleksander Derevianko
Modified: 2022-12-09 14:56 UTC (History)
4 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Aleksander Derevianko 2022-12-06 08:22:58 UTC
BSD tar incorectly encode UTF-8 sequences

How to repeat:
Create two directories with (UTF-8) names:

d0 bf d0 be d0 bb d0 b5 d0 b2 d0 be d0 b8 cc 86
d0 bf d0 be d0 bb d0 b5 d0 b2 d0 be d0 b9

("полевой" and "полевой"). It looks exactly the same, but actually it's different names.

The difference is that sequence 'd0 b9' encode cyrillic 'й' symbol, but 'd0 b8 cc 86' encode actually two symbols: cyrillic 'и' and diacritic symbol which I can't enter here.

You can create such directories or files, but if archived using BSD tar, second name become replaced by first name.

Adding --posix option or LC_ALL=C doesn't help.

GNU tar handle such files correctly - as separate files/directories.

I think at least --posix (or some another option) must allow to COMPLETELY disable all filename encoding/decoding operations.

Problem arise in 12.3-RELEASE also, but seems to absent in 10-RELEASEs.
Comment 1 Aleksander Derevianko 2022-12-06 08:24:28 UTC
Correction: First name become replaced by second, So, 'и' and diacritic become replaced by 'й'
Comment 2 Konstantin Belousov freebsd_committer freebsd_triage 2022-12-08 01:39:02 UTC
Did you tried to contact libarchive author?  bsdtar/libarchive is a contributed
software for the long time.

https://www.libarchive.org/