Created attachment 178070 [details]
By default boost-libs is compiled with ICONV and ICU support. In my tests such configuration gives bad results for encoding from utf8 to arbitrary codepage comparing to other OS'es.
I've attached screenshots and test program which I use. Test program uses pre-defined utf8 string and python script to convert it from utf8 into Latin-5 and TIS620.2533-0 (thai) codepages, then it uses boost to do the same. To compile it simply type 'make' (or 'bmake' on Linux), boost lib and python2 are requirements.
I've run this program on OpenBSD current, Debian stable and FreeBSD 11, screenshots are attached.
If I compile boost-libs with ICU *only* I have better results - they're the same as on other two OS'es, except for TIS620.2533-0 codepage - it rises an exception about unknown encoding.
Does such inconsistency between FreeBSD's boost-lib's locale and other OS'es should be expected?
Created attachment 178071 [details]
linux debian behaviour
Created attachment 178072 [details]
Created attachment 178073 [details]
freebsd default behaviour (iconv and icu, package installation)
Created attachment 178074 [details]
freebsd behaviour (icu only)
Comment on attachment 178070 [details]
> # -*- coding: utf-8 -*-
> test_string = u'TestéäöòДΘĝصדķћ๛ﾈİ'
Can you make sure the string is valid Thai in UTF-8 before testing? Try pasting anything from https://th.wikipedia.org/
I don't understand - repeat the test w/o any english/foreign letters and only with Thai alphabet?
I was expecting I can translate into any codepage from any codepage w/o check if a string is valid for destination encoding (with risk of losing chars of course).
At least '๛' inside the string is a valid Thai.
Nevermind, I misunderstood comment 0. FreeBSD 9.3 uses GNU libiconv and doesn't appear to be affected.
I can confirm that 9.3 isn't affected.
Can we change default options of a boost-libs package to be ICU only?
Created attachment 186691 [details]
Prefer ICU (workaround)
ICU doesn't support TIS620.2533-0, so it'd fall back to iconv().
$ ./utf8 | vis -o
iso-8859-9 (boost): Test\351\344\366\362\335
iso-8859-9 (python): Test\351\344\366\362\335
TIS620.2533-0 (boost): Test'e"a"o`o??^g??k?\373?I
TIS620.2533-0 (python): Test\373
Tijl, do you think either libc and/or boost can be fixed to skip invalid sequences?
$ echo 'Test\303\251\303\244\303\266\303\262\320\224\316\230\304\235\327\223\330\265\304\267\321\233\340\271\233\357\276\210\304\260' | unvis | /usr/bin/iconv -t iso-8859-9 2>/dev/null | vis -o
$ echo 'Test\303\251\303\244\303\266\303\262\320\224\316\230\304\235\327\223\330\265\304\267\321\233\340\271\233\357\276\210\304\260' | unvis | /usr/local/bin/iconv -t iso-8859-9 2>/dev/null | vis -o
Created attachment 186721 [details]
(In reply to Jan Beich from comment #11)
If the input buffer is valid UTF-8 then POSIX says this:
> If iconv() encounters a character in the input buffer that is valid, but for
> which an identical character does not exist in the target codeset, iconv()
> shall perform an implementation-defined conversion on this character.
By default our iconv either replaces such characters with "?" or transliterates them (e.g. "ĝ" becomes "^g"). GNU iconv returns an error in this case, which I believe is not POSIX compliant.
The problem reported in this bug is in Boost itself. Their use of __ICONV_F_HIDE_INVALID on FreeBSD does not give the desired behaviour. Please try the attached patch. Make sure your ports tree is at least r450634.
Created attachment 186738 [details]
Cleaned up version that is probably more acceptable upstream.
MARKED AS SPAM
A commit references this bug:
Date: Wed Jan 16 20:36:48 UTC 2019
New revision: 490518
Fix use of iconv in Boost Locale. On FreeBSD it used __ICONV_F_HIDE_INVALID
which hides invalid sequences, but what Boost really wants is that iconv
returns an error on invalid sequences like GNU libiconv does by default.
On FreeBSD ICONV_SET_ILSEQ_INVALID can be used for this. It has to be set