215393 – devel/boost-libs: bad encoding conversion with base iconv()

Bug 215393 - devel/boost-libs: bad encoding conversion with base iconv()

Summary: devel/boost-libs: bad encoding conversion with base iconv()

Status:	Closed FIXED

Alias:	None

Product:	Ports & Packages
Classification:	Unclassified
Component:	Individual Port(s) (show other bugs)
Version:	Latest
Hardware:	Any Any

Importance:	--- Affects Only Me
Assignee:	FreeBSD Office Team

URL:
Keywords:

Depends on:
Blocks:

Reported:	2016-12-18 20:16 UTC by mp39590
Modified:	2019-01-16 20:39 UTC (History)
CC List:	2 users (show)

See Also:

Flags:	bugzilla: maintainer-feedback? (office)

Attachments
test case (1.02 KB, application/x-bzip) 2016-12-18 20:16 UTC, mp39590	no flags	Details
linux debian behaviour (4.83 KB, image/png) 2016-12-18 20:17 UTC, mp39590	no flags	Details
openbsd behaviour (4.95 KB, image/png) 2016-12-18 20:17 UTC, mp39590	no flags	Details
freebsd default behaviour (iconv and icu, package installation) (5.73 KB, image/png) 2016-12-18 20:18 UTC, mp39590	no flags	Details
freebsd behaviour (icu only) (3.51 KB, image/png) 2016-12-18 20:19 UTC, mp39590	no flags	Details
Prefer ICU (workaround) (3.93 KB, patch) 2017-09-25 07:02 UTC, Jan Beich	no flags	Details \| Diff
boost.locale patch (5.36 KB, patch) 2017-09-25 21:20 UTC, Tijl Coosemans	no flags	Details \| Diff
boost.locale patch2 (5.39 KB, patch) 2017-09-26 12:05 UTC, Tijl Coosemans	no flags	Details \| Diff
Show Obsolete (1) View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description mp39590 2016-12-18 20:16:31 UTC

Created attachment 178070 [details]
test case

By default boost-libs is compiled with ICONV and ICU support. In my tests such configuration gives bad results for encoding from utf8 to arbitrary codepage comparing to other OS'es.

I've attached screenshots and test program which I use. Test program uses pre-defined utf8 string and python script to convert it from utf8 into Latin-5 and TIS620.2533-0 (thai) codepages, then it uses boost to do the same. To compile it simply type 'make' (or 'bmake' on Linux), boost lib and python2 are requirements.

I've run this program on OpenBSD current, Debian stable and FreeBSD 11, screenshots are attached.

If I compile boost-libs with ICU *only* I have better results - they're the same as on other two OS'es, except for TIS620.2533-0 codepage - it rises an exception about unknown encoding.

Does such inconsistency between FreeBSD's boost-lib's locale and other OS'es should be expected?

Comment 1 mp39590 2016-12-18 20:17:31 UTC

Created attachment 178071 [details]
linux debian behaviour

Comment 2 mp39590 2016-12-18 20:17:55 UTC

Created attachment 178072 [details]
openbsd behaviour

Comment 3 mp39590 2016-12-18 20:18:26 UTC

Created attachment 178073 [details]
freebsd default behaviour (iconv and icu, package installation)

Comment 4 mp39590 2016-12-18 20:19:01 UTC

Created attachment 178074 [details]
freebsd behaviour (icu only)

Comment 5 Jan Beich freebsd_committer

2016-12-18 23:21:49 UTC

Comment on attachment 178070 [details]
test case

utf8/encode.py:
> # -*- coding: utf-8 -*-
[...]
> test_string = u'TestéäöòДΘĝصדķћ๛ﾈİ'

Can you make sure the string is valid Thai in UTF-8 before testing? Try pasting anything from https://th.wikipedia.org/

Comment 6 mp39590 2016-12-18 23:32:38 UTC

I don't understand - repeat the test w/o any english/foreign letters and only with Thai alphabet?

I was expecting I can translate into any codepage from any codepage w/o check if a string is valid for destination encoding (with risk of losing chars of course).

At least '๛' inside the string is a valid Thai.

Comment 7 Jan Beich freebsd_committer

2016-12-18 23:45:19 UTC

Nevermind, I misunderstood comment 0. FreeBSD 9.3 uses GNU libiconv and doesn't appear to be affected.

Comment 8 mp39590 2016-12-20 18:47:04 UTC

I can confirm that 9.3 isn't affected.

Can we change default options of a boost-libs package to be ICU only?

Comment 9 mp39590 2017-02-21 15:16:05 UTC

ping?

Comment 10 Jan Beich freebsd_committer

2017-09-25 07:02:11 UTC

Created attachment 186691 [details]
Prefer ICU (workaround)

ICU doesn't support TIS620.2533-0, so it'd fall back to iconv().

$ ./utf8 | vis -o
UTF-8: Test\303\251\303\244\303\266\303\262\320\224\316\230\304\235\327\223\330\265\304\267\321\233\340\271\233\357\276\210\304\260
iso-8859-9 (boost): Test\351\344\366\362\335
iso-8859-9 (python): Test\351\344\366\362\335
TIS620.2533-0 (boost): Test'e"a"o`o??^g??k?\373?I
TIS620.2533-0 (python): Test\373

Comment 11 Jan Beich freebsd_committer

2017-09-25 07:13:06 UTC

Tijl, do you think either libc and/or boost can be fixed to skip invalid sequences?

$ echo 'Test\303\251\303\244\303\266\303\262\320\224\316\230\304\235\327\223\330\265\304\267\321\233\340\271\233\357\276\210\304\260' | unvis | /usr/bin/iconv -t iso-8859-9 2>/dev/null | vis -o
Test\351\344\366\362??^g??k???\335


$ echo 'Test\303\251\303\244\303\266\303\262\320\224\316\230\304\235\327\223\330\265\304\267\321\233\340\271\233\357\276\210\304\260' | unvis | /usr/local/bin/iconv -t iso-8859-9 2>/dev/null | vis -o
Test\351\344\366\362

Comment 12 Tijl Coosemans freebsd_committer

2017-09-25 21:20:44 UTC

Created attachment 186721 [details]
boost.locale patch

(In reply to Jan Beich from comment #11)
If the input buffer is valid UTF-8 then POSIX says this:
> If iconv() encounters a character in the input buffer that is valid, but for
> which an identical character does not exist in the target codeset, iconv()
> shall perform an implementation-defined conversion on this character.

By default our iconv either replaces such characters with "?" or transliterates them (e.g. "ĝ" becomes "^g").  GNU iconv returns an error in this case, which I believe is not POSIX compliant.

The problem reported in this bug is in Boost itself.  Their use of __ICONV_F_HIDE_INVALID on FreeBSD does not give the desired behaviour.  Please try the attached patch.  Make sure your ports tree is at least r450634.

Comment 13 Tijl Coosemans freebsd_committer

2017-09-26 12:05:49 UTC

Created attachment 186738 [details]
boost.locale patch2

Cleaned up version that is probably more acceptable upstream.

Comment 14 vali gholami 2017-11-26 20:48:13 UTC

MARKED AS SPAM

Comment 15 commit-hook freebsd_committer

2019-01-16 20:37:50 UTC

A commit references this bug:

Author: tijl
Date: Wed Jan 16 20:36:48 UTC 2019
New revision: 490518
URL: https://svnweb.freebsd.org/changeset/ports/490518

Log:
  Fix use of iconv in Boost Locale.  On FreeBSD it used __ICONV_F_HIDE_INVALID
  which hides invalid sequences, but what Boost really wants is that iconv
  returns an error on invalid sequences like GNU libiconv does by default.
  On FreeBSD ICONV_SET_ILSEQ_INVALID can be used for this.  It has to be set
  via iconvctl.

  PR:		215393

Changes:
  head/devel/boost-libs/Makefile
  head/devel/boost-libs/files/patch-libs_locale_src_encoding_iconv_codepage.ipp
  head/devel/boost-libs/files/patch-libs_locale_src_posix_codecvt.cpp
  head/devel/boost-libs/files/patch-libs_locale_src_util_iconv.hpp