Bug 215393

Summary: devel/boost-libs: bad encoding conversion with base iconv()
Product: Ports & Packages Reporter: mp39590
Component: Individual Port(s)Assignee: FreeBSD Office Team <office>
Status: Closed FIXED    
Severity: Affects Only Me CC: tablooaraz, tijl
Priority: --- Flags: bugzilla: maintainer-feedback? (office)
Version: Latest   
Hardware: Any   
OS: Any   
Description Flags
test case
linux debian behaviour
openbsd behaviour
freebsd default behaviour (iconv and icu, package installation)
freebsd behaviour (icu only)
Prefer ICU (workaround)
boost.locale patch
boost.locale patch2 none

Description mp39590 2016-12-18 20:16:31 UTC
Created attachment 178070 [details]
test case

By default boost-libs is compiled with ICONV and ICU support. In my tests such configuration gives bad results for encoding from utf8 to arbitrary codepage comparing to other OS'es.

I've attached screenshots and test program which I use. Test program uses pre-defined utf8 string and python script to convert it from utf8 into Latin-5 and TIS620.2533-0 (thai) codepages, then it uses boost to do the same. To compile it simply type 'make' (or 'bmake' on Linux), boost lib and python2 are requirements.

I've run this program on OpenBSD current, Debian stable and FreeBSD 11, screenshots are attached.

If I compile boost-libs with ICU *only* I have better results - they're the same as on other two OS'es, except for TIS620.2533-0 codepage - it rises an exception about unknown encoding.

Does such inconsistency between FreeBSD's boost-lib's locale and other OS'es should be expected?
Comment 1 mp39590 2016-12-18 20:17:31 UTC
Created attachment 178071 [details]
linux debian behaviour
Comment 2 mp39590 2016-12-18 20:17:55 UTC
Created attachment 178072 [details]
openbsd behaviour
Comment 3 mp39590 2016-12-18 20:18:26 UTC
Created attachment 178073 [details]
freebsd default behaviour (iconv and icu, package installation)
Comment 4 mp39590 2016-12-18 20:19:01 UTC
Created attachment 178074 [details]
freebsd behaviour (icu only)
Comment 5 Jan Beich freebsd_committer 2016-12-18 23:21:49 UTC
Comment on attachment 178070 [details]
test case

> # -*- coding: utf-8 -*-
> test_string = u'TestéäöòДΘĝصדķћ๛ネİ'

Can you make sure the string is valid Thai in UTF-8 before testing? Try pasting anything from https://th.wikipedia.org/
Comment 6 mp39590 2016-12-18 23:32:38 UTC
I don't understand - repeat the test w/o any english/foreign letters and only with Thai alphabet?

I was expecting I can translate into any codepage from any codepage w/o check if a string is valid for destination encoding (with risk of losing chars of course).

At least '๛' inside the string is a valid Thai.
Comment 7 Jan Beich freebsd_committer 2016-12-18 23:45:19 UTC
Nevermind, I misunderstood comment 0. FreeBSD 9.3 uses GNU libiconv and doesn't appear to be affected.
Comment 8 mp39590 2016-12-20 18:47:04 UTC
I can confirm that 9.3 isn't affected.

Can we change default options of a boost-libs package to be ICU only?
Comment 9 mp39590 2017-02-21 15:16:05 UTC
Comment 10 Jan Beich freebsd_committer 2017-09-25 07:02:11 UTC
Created attachment 186691 [details]
Prefer ICU (workaround)

ICU doesn't support TIS620.2533-0, so it'd fall back to iconv().

$ ./utf8 | vis -o
UTF-8: Test\303\251\303\244\303\266\303\262\320\224\316\230\304\235\327\223\330\265\304\267\321\233\340\271\233\357\276\210\304\260
iso-8859-9 (boost): Test\351\344\366\362\335
iso-8859-9 (python): Test\351\344\366\362\335
TIS620.2533-0 (boost): Test'e"a"o`o??^g??k?\373?I
TIS620.2533-0 (python): Test\373
Comment 11 Jan Beich freebsd_committer 2017-09-25 07:13:06 UTC
Tijl, do you think either libc and/or boost can be fixed to skip invalid sequences?

$ echo 'Test\303\251\303\244\303\266\303\262\320\224\316\230\304\235\327\223\330\265\304\267\321\233\340\271\233\357\276\210\304\260' | unvis | /usr/bin/iconv -t iso-8859-9 2>/dev/null | vis -o

$ echo 'Test\303\251\303\244\303\266\303\262\320\224\316\230\304\235\327\223\330\265\304\267\321\233\340\271\233\357\276\210\304\260' | unvis | /usr/local/bin/iconv -t iso-8859-9 2>/dev/null | vis -o
Comment 12 Tijl Coosemans freebsd_committer 2017-09-25 21:20:44 UTC
Created attachment 186721 [details]
boost.locale patch

(In reply to Jan Beich from comment #11)
If the input buffer is valid UTF-8 then POSIX says this:
> If iconv() encounters a character in the input buffer that is valid, but for
> which an identical character does not exist in the target codeset, iconv()
> shall perform an implementation-defined conversion on this character.

By default our iconv either replaces such characters with "?" or transliterates them (e.g. "ĝ" becomes "^g").  GNU iconv returns an error in this case, which I believe is not POSIX compliant.

The problem reported in this bug is in Boost itself.  Their use of __ICONV_F_HIDE_INVALID on FreeBSD does not give the desired behaviour.  Please try the attached patch.  Make sure your ports tree is at least r450634.
Comment 13 Tijl Coosemans freebsd_committer 2017-09-26 12:05:49 UTC
Created attachment 186738 [details]
boost.locale patch2

Cleaned up version that is probably more acceptable upstream.
Comment 14 vali gholami 2017-11-26 20:48:13 UTC
Comment 15 commit-hook freebsd_committer 2019-01-16 20:37:50 UTC
A commit references this bug:

Author: tijl
Date: Wed Jan 16 20:36:48 UTC 2019
New revision: 490518
URL: https://svnweb.freebsd.org/changeset/ports/490518

  Fix use of iconv in Boost Locale.  On FreeBSD it used __ICONV_F_HIDE_INVALID
  which hides invalid sequences, but what Boost really wants is that iconv
  returns an error on invalid sequences like GNU libiconv does by default.
  On FreeBSD ICONV_SET_ILSEQ_INVALID can be used for this.  It has to be set
  via iconvctl.

  PR:		215393