225055 – audio/mp3splt: the default encoding for ID3v2 tags should be UTF-16BE, not UTF-16LE

Bug 225055 - audio/mp3splt: the default encoding for ID3v2 tags should be UTF-16BE, not UTF-16LE

Summary: audio/mp3splt: the default encoding for ID3v2 tags should be UTF-16BE, not UT...

Status:	Closed Not Accepted

Alias:	None

Product:	Ports & Packages
Classification:	Unclassified
Component:	Individual Port(s) (show other bugs)
Version:	Latest
Hardware:	Any Any

Importance:	--- Affects Only Me
Assignee:	freebsd-ports-bugs (Nobody)

URL:
Keywords:

Depends on:
Blocks:

Reported:	2018-01-10 16:13 UTC by Victor Sudakov
Modified:	2018-07-10 09:57 UTC (History)
CC List:	4 users (show)

See Also:

Flags:	bugzilla: maintainer-feedback? (citrin)

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Victor Sudakov 2018-01-10 16:13:06 UTC

mp3splt creates ID3v2 tags in UTF-16LE while it should be UTF-16BE according to Wikipedia. ID3Fixer says the tags are incorrect.

Workaround: use "-C 8" or even "-C 1".

Sorry, I think this PR is one of the weirdest ever. I'm not even sure I should be sending it.

The original authors of mp3splt don't seem to be reachable by E-mail.

Comment 1 Anton Yuzhaninov 2018-01-24 17:57:16 UTC

Thanks for the report, I will look into it.

As far as I remember mp3splt writes id3v2.4.0 tags and 2.4.0 allows text in both UTF-16 with BOM and UTF-16BE: http://id3.org/id3v2.4.0-structure

Does it causes compatibility problems?

Comment 2 Victor Sudakov 2018-01-25 02:48:19 UTC

(In reply to Anton Yuzhaninov from comment #1)

I was basing my report on information from Wikipedia: 

Textual frames are marked with an encoding byte.[8]

$00 – ISO-8859-1 (LATIN-1, Identical to ASCII for values smaller than 0x80).
$01 – UCS-2 encoded Unicode with BOM, in ID3v2.2 and ID3v2.3.
$02 – UTF-16BE encoded Unicode without BOM, in ID3v2.4.
$03 – UTF-8 encoded Unicode, in ID3v2.4.

Maybe the Wikipedia information is incorrect.

And the MP3 player in my car did show "---------------" instead of English song names until I switched to -C1 

This player has problems with Russian tags, I admit, but there should be no problems with plain English song names at least.

Comment 3 Anton Yuzhaninov 2018-01-30 01:56:51 UTC

I've looked at my mp3 file written by mp3splt in hd(1) - unfortunately I haven't found a more usable tool which can show all relevant information.

In a hex dump I can see:
1. mp3splt writes ID3 V2.4.0 tags
2. text frames contain encoding description byte $01 (UTF-16 with BOM) followed by BOM $feff (big endian) and then followed by text encoded in UTF-16 BE.

So as I can see this ID3 tag confirms ID3 specification. Only issue I can see - mp3splt always writes TIT2, TPE1, TALB, TCON frames in UTF-16 encoding, even when the text can be represented as ISO-8859-1 or even ASCII.

Probably your hardware player doesn't support UTF-16/UCS-2 and expects all strings to be in ISO-8859-1.

For better compatibility with incomplete player implementations we can check if text can be encoded as ISO-8859-1 and use $00 text description byte (ISO-8859-1) with text in ISO-8859-1. But I'm not sure where it is better to implement this auto-detection: in mp3splt or in libid3tag (library used by mp3splt for writing tags). Unfortunately libid3tag is also not actively developed - last release was in 2004.

I think -C 8 (utf-8) works for this player only by accident - ASCII text in UTF-8 and ISO-8859-1 are the same sequence of bytes. Changing default option to -C 8 is a bad option - there are implementations (players) which support UTF-16 (present V2.2.0), but not UTF-8 (was added only in V2.4.0).

Comment 4 Victor Sudakov 2018-02-05 02:48:41 UTC

(In reply to Anton Yuzhaninov from comment #3)

> I think -C 8 (utf-8) works for this player only by accident - ASCII text in UTF-8 and ISO-8859-1 are the same sequence of bytes. 

No, this is not the case. I have found out that the player shows *Cyrillic* song names correctly provided they are ID3v2 tags in UTF-8 encoding, so this:

mp3splt -a -T 2 -C 8 -o @b/@t -c concert.cue concert.mp3

works fine if "concert.cue" is in Russian and in UTF-8.

Comment 5 Victor Sudakov 2018-02-05 02:49:40 UTC

https://victor-sudakov.dreamwidth.org/434082.html

Comment 6 Victor Sudakov 2018-02-05 06:14:10 UTC

Have you tried what ID3Fixer https://play.google.com/store/apps/details?id=com.yschi.ID3Fixer thinks of tags created by mp3splt?

Comment 7 Anton Yuzhaninov 2018-03-03 01:44:03 UTC

I don't see any error messages from ID3Fixer, though it corrupts some (why not all?) files when output charset is UTF-16. It looks like ID3Fixer bug.

As I can see from hex dump mp3splt writes correct ID3 tags.

Using utf-8 may be a good workaround for given player, but it is not sensible default: utf-8 was introduced only in id3v2.4.0 and there are many players, which support only id3v2.3.0 (If I not mistaken even Windows 7 supports only id3v2.3.0).

Comment 8 Victor Sudakov 2018-03-03 04:53:38 UTC

(In reply to Anton Yuzhaninov from comment #7)
Wikipedia writes that the default charset before the introduction of ID3v2.4 should have been UCS-2, not UTF-16. Can it be the reason for the partial corruption? UTF-16 and UCS-2 are similar but not quite the same, aren't they?

Comment 9 Anton Yuzhaninov 2018-03-03 16:24:12 UTC

(In reply to vas from comment #8)
text encoding description byte 0x01 mean UCS-2 (with BOM) in id3v2.3.0 and UTF-16 with BOM in id3v2.4.0

For Basic Multilingual Plane (first 65536 code points) UTF-16 and UCS-2 have the same byte representation. Symbols beyond BMP can't be represented in UCS-2 and hence in id3v2.3.0, but all popular languages are covered by BMP.

If some implementation supports only id3v2.3.0/UCS-2 then symbols outside BMP (e.g. Unicode emoji) from id3v2.4.0 tags will be corrupted, but other symbols should be decoded correctly. But I have no files with symbols outside BMP to test it.

Comment 10 Victor Sudakov 2018-03-04 09:41:43 UTC

(In reply to Anton Yuzhaninov from comment #9)
Anton, 

This has been a very useful and educational conversation for me. I think we can close this PR, but this was not in vain.

Comment 11 Vsevolod Stakhov freebsd_committer

2018-07-10 09:57:53 UTC

Agreed by the reporter and the maintainer.