Bug 216372

Summary: [patch] devel/icu: turn on same workaround as for Linux to fix incorrect detection UTF-8 locale in some applications
Product: Ports & Packages Reporter: Vladimir Druzenko <vvd>
Component: Individual Port(s)Assignee: Matthew Rezny <rezny>
Status: Closed FIXED    
Severity: Affects Many People CC: henry.hu.sh, rezny, tcberner
Priority: --- Keywords: patch
Version: LatestFlags: tcberner: maintainer-feedback-
Hardware: Any   
OS: Any   
Attachments:
Description Flags
devel/icu/files/patch-common_putil.cpp
none
convert ASCII to UTF-8 outside C/POSIX locale none

Description Vladimir Druzenko freebsd_committer freebsd_triage 2017-01-22 15:35:37 UTC
For example net-im/qTox and gwenview5.

More information is here:
http://bugs.icu-project.org/trac/ticket/12886
https://bugreports.qt.io/browse/QTBUG-57522
https://github.com/qTox/qTox/issues/4012#issuecomment-273962027

This patch was tested for net-im/qTox:

--- common/putil.cpp.orig   2016-10-19 17:20:56 UTC
+++ common/putil.cpp
@@ -1813,6 +1813,31 @@
         /* Remap CP949 to a similar codepage to avoid issues with backslash and won symbol. */
         name = "EUC-KR";
     }
+    if (locale != NULL && uprv_strcmp(name, "euc") == 0) {
+        /* Linux underspecifies the "EUC" name. */
+        if (uprv_strcmp(locale, "korean") == 0) {
+            name = "EUC-KR";
+        }
+        else if (uprv_strcmp(locale, "japanese") == 0) {
+            /* See comment below about eucJP */
+            name = "eucjis";
+        }
+    }
+    else if (uprv_strcmp(name, "eucjp") == 0) {
+        /*
+        ibm-1350 is the best match, but unavailable.
+        ibm-954 is mostly a superset of ibm-1350.
+        ibm-33722 is the default for eucJP (similar to Windows).
+        */
+        name = "eucjis";
+    }
+    else if (locale != NULL && uprv_strcmp(locale, "en_US_POSIX") != 0 &&
+            (uprv_strcmp(name, "ANSI_X3.4-1968") == 0 || uprv_strcmp(name, "US-ASCII") == 0)) {
+        /*
+         * For non C/POSIX locale, default the code page to UTF-8 instead of US-ASCII.
+         */
+        name = "UTF-8";
+    }
 #elif U_PLATFORM == U_PF_HPUX
     if (locale != NULL && uprv_strcmp(locale, "zh_HK") == 0 && uprv_strcmp(name, "big5") == 0) {
         /* HP decided to extend big5 as hkbig5 even though it's not compatible :-( */
@@ -1942,7 +1967,7 @@
        nl_langinfo may use the same buffer as setlocale. */
     {
         const char *codeset = nl_langinfo(U_NL_LANGINFO_CODESET);
-#if U_PLATFORM_IS_DARWIN_BASED || U_PLATFORM_IS_LINUX_BASED
+#if U_PLATFORM_IS_DARWIN_BASED || U_PLATFORM_IS_LINUX_BASED || U_PLATFORM == U_PF_BSD
         /*
          * On Linux and MacOSX, ensure that default codepage for non C/POSIX locale is UTF-8
          * instead of ASCII.
Comment 1 Tobias C. Berner freebsd_committer freebsd_triage 2017-01-22 22:14:26 UTC
Could you please add the patch as a proper attachment :) 

I can confirm, that this fixes the issue seen here:
https://people.freebsd.org/~tcberner/icu_problem.png
of gwenview refusing to open non-ascii-named files.
Comment 2 Vladimir Druzenko freebsd_committer freebsd_triage 2017-01-22 22:21:02 UTC
Created attachment 179228 [details]
devel/icu/files/patch-common_putil.cpp
Comment 3 Tobias C. Berner freebsd_committer freebsd_triage 2017-03-18 22:43:36 UTC
Any input from office@ on this?
Comment 4 Matthew Rezny freebsd_committer freebsd_triage 2017-03-24 10:19:05 UTC
Created attachment 181127 [details]
convert ASCII to UTF-8 outside C/POSIX locale

It is obvious we should be handling the ASCII case like Linux and OS X. However, I do not think it wise to copy the Linux section wholesale as there may be unintended consequences to changing the handling of Korean and Japanese. Instead, I have taken the approach of make BSD be the same as Darwin. The handling of CP949 was identical but Darwin was already handling the ASCII->UTF-8 and we can just tack onto the #if instead of copying code. I have verified this change corrects the issue observed in qTox.
Comment 5 commit-hook freebsd_committer freebsd_triage 2017-04-07 22:06:48 UTC
A commit references this bug:

Author: rezny
Date: Fri Apr  7 22:06:08 UTC 2017
New revision: 437961
URL: https://svnweb.freebsd.org/changeset/ports/437961

Log:
  Behave same on BSDs as on Darwin in that UTF-8 shall be used instead of
  ASCII outside the POSIX 'C' locale and UTF-8 is deafult in case anything
  should call ucnv_getDefaultName() prior to calling setlocale(). This change
  fixes problems that occur in multiple Qt5 applications when handling files
  with names containing non-ASCII characters.

  PR:		216372
  Reported by:	vvd@unislabs.com
  Approved by:	bapt (office@), swills (mentor)
  Differential Revision:	https://reviews.freebsd.org/D10128

Changes:
  head/devel/icu/Makefile
  head/devel/icu/files/patch-common_putil.cpp