Bug 272293 - The mbrtoc32 and mbrtoc16 functions don't recognize the same multibyte sequences as mbrtowc
Summary: The mbrtoc32 and mbrtoc16 functions don't recognize the same multibyte sequen...
Status: New
Alias: None
Product: Base System
Classification: Unclassified
Component: misc (show other bugs)
Version: 13.2-RELEASE
Hardware: Any Any
: --- Affects Some People
Assignee: freebsd-standards (Nobody)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2023-06-30 13:58 UTC by bruno
Modified: 2023-07-05 14:36 UTC (History)
1 user (show)

See Also:


Attachments
test case foo.c (968 bytes, text/plain)
2023-06-30 13:58 UTC, bruno
no flags Details
Test program using iconv directly (936 bytes, text/plain)
2023-07-02 14:18 UTC, John F. Carr
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description bruno 2023-06-30 13:58:00 UTC
Created attachment 243081 [details]
test case foo.c

It is clear from ISO C 23 (description of mbrtowc: § 7.31.6.3.2, description of mbrtoc32: § 7.30.1.5, description of mbrtoc16: § 7.30.1.3) that the notion of valid multibyte character is independent of which of these function a program uses. When a multibyte character is valid according to one of these functions, it should be valid according to the two others as well.

This is not the case in FreeBSD 13.2.

Test case:
=============================== foo.c ============================
#include <locale.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <wchar.h>
#include <uchar.h>
int main ()
{
  if (setlocale (LC_ALL, "zh_CN.GB18030") != NULL)
    {
      mbstate_t state;
      wchar_t wc = (wchar_t) 0xBADFACE;
      memset (&state, '\0', sizeof (mbstate_t));
      if (mbrtowc (&wc, "\224\071\375\067", 4, &state) == 4)
        {
          printf ("mbrtowc return value = 4\n");
          {
            char32_t c32 = (char32_t) 0xBADFACE;
            memset (&state, '\0', sizeof (mbstate_t));
            size_t ret = mbrtoc32 (&c32, "\224\071\375\067", 4, &state);
            printf ("mbrtoc32 return value = %d\n", (int) ret);
          }
          {
            char16_t c16 = (char16_t) 0xBADFACE;
            memset (&state, '\0', sizeof (mbstate_t));
            size_t ret = mbrtoc16 (&c16, "\224\071\375\067", 4, &state);
            printf ("mbrtoc16 return value = %d\n", (int) ret);
          }
        }
    }
}
==========================================================================
$ cc -Wall foo.c
$ ./a.out

Expected result (e.g. as seen on glibc 2.35):
mbrtowc return value = 4
mbrtoc32 return value = 4
mbrtoc16 return value = 4

Actual result:
mbrtowc return value = 4
mbrtoc32 return value = -2
mbrtoc16 return value = -2

I think I've seen this effect also with other encodings than GB18030. But the test case above is with GB18030.
Comment 1 Yuri Pankov freebsd_committer freebsd_triage 2023-07-01 07:46:05 UTC
This has something to do with iconv implementation of mbrtocXX functions as libc built with WITHOUT_ICONV defined returns correct results.
Comment 2 bruno 2023-07-01 08:15:24 UTC
(In reply to Yuri Pankov from comment #1)
> This has something to do with iconv implementation of mbrtocXX functions

Yes, the libc on my system has these:

$ ar t /usr/lib/libc.a | grep mbrtoc
mbrtoc16_iconv.o
mbrtoc32_iconv.o
Comment 3 John F. Carr 2023-07-02 14:18:33 UTC
Created attachment 243134 [details]
Test program using iconv directly

I wrote a test program that reproduces this behavior by calling iconv directly.  It gets an error with 13.2 libc iconv but produces plausible output with the ports version and with the iconv in Mac OS 13.2.1.

The input encoding is unfamiliar to me.  I observe that the output value doesn't fit in 16 bits and I have seen bugs before in code that was only tested with the common characters that fit in 16 bits.
Comment 4 John F. Carr 2023-07-02 16:17:10 UTC
FreeBSD claims to support GB 18030-2000.  See gb18030(5) and lib/libc/locale/gb18030.c.  I believe character U+1F60B only exists in newer versions of GB 18030.  It is one of those newfangled emojis.  Wikipedia says GB 18030-2005 and -2022 increase coverage of Unicode.

On one hand the system is working to spec.  On the other hand it would be nice to support the latest standard.
Comment 5 bruno 2023-07-05 14:35:38 UTC
(In reply to John F. Carr from comment #4)
> I believe character U+1F60B only exists in newer versions of GB 18030.

Nope. GB18030 was, from the beginning, designed as a character set that covers all valid Unicode code points.

> FreeBSD claims to support GB 18030-2000.

The first printing of GB18030 in 2000 was buggy. The second printing, also from 2000, was correct. The version from 2005 and 2022 reshuffled a couple of mappings, without changing the basic structure.
Comment 6 bruno 2023-07-05 14:36:42 UTC
I have registered the GB18030 mapping problem as a separate bug: bug #272384.