Bug 238568 - tar filenames encoding problem
Summary: tar filenames encoding problem
Status: New
Alias: None
Product: Base System
Classification: Unclassified
Component: bin (show other bugs)
Version: 11.2-STABLE
Hardware: Any Any
: --- Affects Some People
Assignee: freebsd-bugs (Nobody)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-06-14 21:49 UTC by aler
Modified: 2019-06-15 11:13 UTC (History)
0 users

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description aler 2019-06-14 21:49:58 UTC
How to reproduce:

```
#!/bin/sh

rm -Rf d e
mkdir d
touch d/`printf '\306'`
mkdir e
tar -c -f - d | tar -C e -x -f -
```

Doing this with empty $LANG leads to
```
: Can't translate pathname 'd/Ж' to UTF-8# sh test.sh 
```

However, directory `d` proprely copied into `e`.

This error message disappears with `LANG=en_US.ISO8859-1`

I'm not exactly sure what this error message means, but anyway it is very unclear and may be interpreted as "the file was not archived".
Also I don't know why tar even tries to do some charset translations. It should be binary-safe against filenames by default.
Comment 1 aler 2019-06-14 21:55:51 UTC
This started happening in FreeBSD 10.
Before that tar never tried to do charset translations by any means.
Comment 2 Conrad Meyer freebsd_committer freebsd_triage 2019-06-14 22:02:30 UTC
It's a non-fatal warning that changes the exit status to non-zero but as you note, does not prevent correct copy.

libarchive changes the copy mode from encoding-aware (UTF-8 default, I guess) to binary mode when it prints that text.
Comment 3 Conrad Meyer freebsd_committer freebsd_triage 2019-06-14 22:07:14 UTC
This is specified by POSIX' pax:

https://pubs.opengroup.org/onlinepubs/9699919799/utilities/pax.html

"If there is a hdrcharset extended header in effect for a file, the value field for any gname, linkpath, path, and uname extended header records shall be encoded using the character set specified by the hdrcharset extended header record; otherwise, the value field shall be encoded using UTF-8. The value field for all other keywords specified by POSIX.1-2017 shall be encoded using UTF-8."
Comment 4 Conrad Meyer freebsd_committer freebsd_triage 2019-06-14 22:08:30 UTC
(Prior to FreeBSD 10, the default tar format was likely the older "ustar" instead of "pax".)
Comment 5 aler 2019-06-15 11:13:57 UTC
The filesystem has no internal charset so it is weird to do charset translations from no-charset (= BINARY) to any explicit charset.

Also it is not good that resulting archive somehow dependent on environment $LANG which was intended for run-time localization purposes and not for abstract data processing. Also this fact is undocumented on tar manpage.