Bug 114578 - [libc] wide character printing using swprintf(dst, n, "%ls", txt) fails depending on LC_CTYPE
Summary: [libc] wide character printing using swprintf(dst, n, "%ls", txt) fails depen...
Status: Open
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: Unspecified
Hardware: Any Any
: Normal Affects Only Me
Assignee: freebsd-bugs (Nobody)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2007-07-14 13:10 UTC by Christoph Mallon
Modified: 2018-05-20 23:53 UTC (History)
0 users

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Christoph Mallon freebsd_committer freebsd_triage 2007-07-14 13:10:02 UTC
Copying a string using swprintf() and the format specifier "%ls" (or "%S")
fails if the to be copied string contains characters, which the currently
set LC_CTYPE aspect of the locale does not support.

The test program below should just copy the wide character string "Mir"
(in cyrillic letters) to an array of wide characters using swprintf().
When the LC_CTYPE aspect of the locale is set to "C" (other encodings
like ISO8859-15 fail, too), this call fails and -1 is returned. When the
LC_CTYPE aspect of the locale is set to UTF-8 (or probably other encodings,
which support full unicode) the call succeeds and returns 3 as expected.

I wonder if this behaviour is correct, because no encoding conversions
should be involved here. I could not find anything about conversions in
the ANSI C99 standard (ยง7.24.2.1 clause 8 bullet "s"), either. Only
conversions if the format is "%s" are mentioned, which is logical.

Other implementations (glibc and Windows libc) copy the string correctly,
when LC_CTYPE is set to "C".

I just discovered, that it already fails, if the format string itself
contains characters from a range, that the current LC_CTYPE does not support.

Fix: 

I didn't dive into the inner workings of *printf(), sorry.
How-To-Repeat: Here is a simple test program. It should (imo) print "3" twice, for
three copied characters, each. It prints "-1" and "3" though.

#include <locale.h>
#include <stdio.h>
#include <wchar.h>

static const wchar_t txt[] = { 0x41C, 0x43D, 0x440, 0 }; // "Mir" in cyrillic

int main(void)
{
  wchar_t str[4];
  int ret;

  setlocale(LC_CTYPE, "C");
  ret = swprintf(str, sizeof(str) / sizeof(*str), L"%ls", txt);
  printf("%d\n", ret);

  setlocale(LC_CTYPE, "UTF-8");
  ret = swprintf(str, sizeof(str) / sizeof(*str), L"%ls", txt);
  printf("%d\n", ret);

  return 0;
}
Comment 1 Mark Linimon freebsd_committer freebsd_triage 2007-07-14 18:20:41 UTC
Responsible Changed
From-To: freebsd-bugs->freebsd-standards

I'm going to guess this is a standards issue.
Comment 2 Andrey A. Chernov freebsd_committer freebsd_triage 2007-07-14 18:48:21 UTC
State Changed
From-To: open->closed

This code works as supposed by standards. 
You may see errno comes from swprintf() (exactly - from wcrtomb())  
and it is EILSEQ (Illegal byte sequence) 
It is because "C" locale is 8bit wide so does not contain wide chars 
outside 0 .. UCHAR_MAX range (and doesn't know how to convert them either), 
so any attempt to convert them fails with EILSEQ.
Comment 3 Andrey A. Chernov freebsd_committer freebsd_triage 2007-07-15 08:26:49 UTC
State Changed
From-To: closed->open

POSIX mention fputwc() requirement only for fwprintf() and wprintf(),  
not for swprintf(), so fputwc()-mbsrtowcs() forth and back conversion  
we currently have as result of pseudo-file stdio hook is not needed  
in direct implemetation (which skips whole multibyte part).
Comment 4 Christoph Mallon 2007-07-15 08:58:18 UTC
Here is a simplified example:

#include <locale.h>
#include <stdio.h>
#include <wchar.h>

static const wchar_t txt[] = { 0x41C, 0x43D, 0x440, 0 }; // "Mir" in 
cyrillic

int main(void)
{
   wchar_t str[4];
   int ret;

   setlocale(LC_CTYPE, "C");
   ret = swprintf(str, sizeof(str) / sizeof(*str), txt);
   printf("%d\n", ret);

   return 0;
}

Only a format string is used here. The call to swprintf() fails here, 
too,  and -1 is returned. The POSIX standard (and ANSI C99, too, though 
with slightly different wording) say this: "The format is composed of 
zero or more directives: ordinary wide-characters, which are simply 
copied to the output stream" (from 
http://www.opengroup.org/onlinepubs/009695399/functions/swprintf.html , 
section DESCRIPTION, second clause). So even copying the ordinary 
wide-characters from the format string fails.
Comment 5 David Schultz freebsd_committer freebsd_triage 2008-01-15 09:25:34 UTC
fputwc(3) has similar language about copying the character to the
output stream, but POSIX still says it can fail with EILSEQ if the
wide character doesn't exist in the current locale.

This isn't my area of expertise, but the present behavior seems
correct. If the current locale doesn't support a given wide
character, we should not invent a multibyte character sequence for
it, because the other end of the stream may not even be able to
interpret it.
Comment 6 Christoph Mallon 2008-09-29 09:01:20 UTC
> fputwc(3) has similar language about copying the character to the
> output stream, but POSIX still says it can fail with EILSEQ if the
> wide character doesn't exist in the current locale.

fputwc() is entierly different from swprintf(): fputwc() writes to a 
stream, swprintf() writes to an array of wchar_t.

> This isn't my area of expertise, but the present behavior seems
> correct.

No, it isn't.

> If the current locale doesn't support a given wide
> character, we should not invent a multibyte character sequence for
> it, because the other end of the stream may not even be able to
> interpret it.

The format string of swprintf() is of type wchar_t and the destination 
buffer of swprintf() is of type wchar_t. So there are absolutely no 
locale conversions involved and no multibyte sequences have to be 
invented, as you suggested. All, which should happen, is copying the 
wchar_ts from the source to the destination with no conversions involved 
at all. The standard, which I quoted already, is quite clear in this 
respect. The current implementation, which internally converts from 
wchar_t to the current multibyte locale encoding and back to wchar_t is 
just an implementation hack, which breaks, if the current locale can not 
represent full unicode.
Comment 7 wollman 2008-09-29 16:44:27 UTC
<<On Mon, 29 Sep 2008 08:10:03 GMT, Christoph Mallon <christoph.mallon@gmx.de> said:

>> fputwc(3) has similar language about copying the character to the
>> output stream, but POSIX still says it can fail with EILSEQ if the
>> wide character doesn't exist in the current locale.
 
>  fputwc() is entierly different from swprintf(): fputwc() writes to a 
>  stream, swprintf() writes to an array of wchar_t.
 
>> This isn't my area of expertise, but the present behavior seems
>> correct.
 
>  No, it isn't.

The Standard is clear:

	In addition, all forms of fwprintf() may fail if:

	[EILSEQ]	A wide-character code that does not correspond
			to a valid character has been detected.

(IEEE Std.1003.1-2001 page 471, line 15515)

You may wish that it was implemented differently, but that doesn't
mean that the current implementation is wrong.

-GAWollman
Comment 8 David Schultz freebsd_committer freebsd_triage 2009-02-07 14:04:44 UTC
State Changed
From-To: open->suspended

suspended awaiting patches 


Comment 9 David Schultz freebsd_committer freebsd_triage 2009-02-07 14:04:44 UTC
Class Changed
From-To: sw-bug->change-request

The present implementation of swprintf() could be much better, and the 
submitter is rightly offended, but this doesn't seem to be a bug. 


Comment 10 David Schultz freebsd_committer freebsd_triage 2009-02-07 14:04:44 UTC
Responsible Changed
From-To: freebsd-standards->freebsd-bugs

Reclassify this as a general bug. Although swprintf's behavior of 
converting from wide characters to multibyte representations and back 
again appears not to be a standards violation, a better implementation 
would avoid the extra work.
Comment 11 Eitan Adler freebsd_committer freebsd_triage 2018-05-20 23:53:09 UTC
For bugs matching the following conditions:
- Status == In Progress
- Assignee == "bugs@FreeBSD.org"
- Last Modified Year <= 2017

Do
- Set Status to "Open"