Bug 269127 - devel/icu: Multibyte character is included in DateTimePatterns for en locale in release 72
Summary: devel/icu: Multibyte character is included in DateTimePatterns for en locale ...
Status: Closed Overcome By Events
Alias: None
Product: Ports & Packages
Classification: Unclassified
Component: Individual Port(s) (show other bugs)
Version: Latest
Hardware: Any Any
: --- Affects Only Me
Assignee: FreeBSD Office Team
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2023-01-24 03:45 UTC by Tatsuki Makino
Modified: 2023-06-14 07:25 UTC (History)
1 user (show)

See Also:
bugzilla: maintainer-feedback? (office)


Attachments
Ports only for use in overlays, etc. (3.26 KB, patch)
2023-02-21 09:06 UTC, Tatsuki Makino
no flags Details | Diff
Experimental patch for devel/icu (1.94 KB, patch)
2023-03-04 04:49 UTC, Tatsuki Makino
no flags Details | Diff
experimental patch for devel/icu (781 bytes, patch)
2023-03-04 05:42 UTC, Tatsuki Makino
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Tatsuki Makino 2023-01-24 03:45:23 UTC
For example, the following JavaScript code may produce unintended results.

(function () {
	var i, d=[], s=[];
	d[0] = new Date(0);
	s[0] = d[0].toLocaleString("en-US");
	d[1] = new Date(s[0]);
	console.log(d[0], s[0], d[1]);
	for (i = 0; i < s[0].length; ++i) {
		console.log(s[0].charAt(i), s[0].charCodeAt(i).toString(16));
	}
})();

d[1] is expected to be the same as d[0], but is "Invalid Date" in icu-dependent web browsers (firefox-esr-102.7.0,1, chromium-109.0.5414.74 and... seamonkey-2.49.4_27 :) ).
The reason for this is that the string converted to LocaleString contains U+202F.

One problem with this is that the en, en-* locale have been deified as not containing multibyte characters in the language areas that use multibyte characters (e.g. Japan :) ).
This is why they would choose this method.
In fact, there are sites that display "Invalid Date" because of this.

The problem with this is that it will behave differently with browsers that are not using icu.
As far as I have tried, Windows10+ChromeEdge and Android+Edge return LocaleString without multibyte characters, which works as expected.

I think the distribution file for the port already has a database of the parts related to this, but the source is this.
https://github.com/unicode-org/icu/blob/bb0e745e25c99cc57055caf45c81b95ef63b25d4/icu4c/source/data/locales/en.txt

What should it be?
Comment 1 Tatsuki Makino 2023-02-21 09:06:29 UTC
Created attachment 240299 [details]
Ports only for use in overlays, etc.

This creates icudt*.dat that replaces some multibyte characters in en locale.
Replacing ${LOCALBASE}/share/icu/72.1/icudt72l.dat with this will eliminate the above problem.

For example, it is easier to see the weather forecast at different times of the day and the extent of rainfall :)
Comment 2 Charlie Li freebsd_committer freebsd_triage 2023-02-21 19:26:34 UTC
Has this finding been reported to ICU upstream?

(I actually just hit this "Invalid Date" problem myself)
Comment 3 Tatsuki Makino 2023-02-21 20:36:37 UTC
(In reply to Charlie Li from comment #2)

No, I have not yet done that at all.

As for my thoughts on this issue...
This is not a problem for people using en, en-US, or en-* locales, it is the right.
This does not seem to be a problem on the Linux side which seems to use ICU in the same way. At least it is not a problem with Android(+MS Edge browser).
This is more of a problem on the website production side. The approach of trying to put the time string output by the new feature called LocaleString into the old-fashioned Date.parse function is strange. Therefore, shouldn't we be reporting to the website where the problem occurs?

Anyway, I posted here because it didn't seem to be much of an issue on the Linux side, but has Linux outside of Android disappeared? :)
Comment 4 Charlie Li freebsd_committer freebsd_triage 2023-02-21 20:40:02 UTC
Some application consumers like Mozilla bundle libraries like ICU, which may not be the latest version. I've been hitting this with the en locales myself.
Comment 5 Tatsuki Makino 2023-02-22 03:31:06 UTC
For now, the space character has changed to a multibyte character due to this commit.
https://github.com/unicode-org/cldr/commit/a83026ab8c8fa6ed88f1047c4d0c6089f88b7e5d

This is where it was reflected in the ICU.
https://github.com/unicode-org/icu/commit/64b35481263ac4df37a28a9c549553ecc9710db2
Comment 6 Jan Beich freebsd_committer freebsd_triage 2023-02-23 06:46:03 UTC
(In reply to Charlie Li from comment #4)
Chromium 110 and Firefox 110 bundle ICU 72.
Comment 7 Tatsuki Makino 2023-03-04 04:49:09 UTC
Created attachment 240577 [details]
Experimental patch for devel/icu

It won't use the bundle's icudt72l.dat, but will rebuild it.
It can be toggled by option.
To begin with, it may not be usable as is in a big-endian environment.
Comment 8 Tatsuki Makino 2023-03-04 05:42:04 UTC
Created attachment 240578 [details]
experimental patch for devel/icu

It just builds one that allows the use of the environment variable ICU_DATA.
Running as
env ICU_DATA=/usr/local/share/icudt seamonkey
will use different data like the port of attachment 240299 [details].
This may possibly mean that there is a risk like LD_PRELOAD.

It seems that the browser has built in a behavior to convert whitespace characters, but would this be the case if we were to take action here?
Comment 9 Charlie Li freebsd_committer freebsd_triage 2023-03-13 18:22:58 UTC
The problem for me seemed to have disappeared starting Firefox 110; 109 exhibited the issue.
Comment 10 Tatsuki Makino 2023-03-14 23:49:16 UTC
chromium also had no more problems with chromium-110, I think.

I don't know what kind of fix it is, but it may be that a bug like the one below was embedded on a dare :)

                                                     Mar/14/2023 10:49
PM
Comment 11 Tatsuki Makino 2023-06-14 07:25:38 UTC
ICU 73.2 seems to have changed due to compatibility.

https://github.com/unicode-org/icu/releases/tag/release-73-2