Copying a string using swprintf() and the format specifier "%ls" (or "%S") fails if the to be copied string contains characters, which the currently set LC_CTYPE aspect of the locale does not support. The test program below should just copy the wide character string "Mir" (in cyrillic letters) to an array of wide characters using swprintf(). When the LC_CTYPE aspect of the locale is set to "C" (other encodings like ISO8859-15 fail, too), this call fails and -1 is returned. When the LC_CTYPE aspect of the locale is set to UTF-8 (or probably other encodings, which support full unicode) the call succeeds and returns 3 as expected. I wonder if this behaviour is correct, because no encoding conversions should be involved here. I could not find anything about conversions in the ANSI C99 standard (ยง7.24.2.1 clause 8 bullet "s"), either. Only conversions if the format is "%s" are mentioned, which is logical. Other implementations (glibc and Windows libc) copy the string correctly, when LC_CTYPE is set to "C". I just discovered, that it already fails, if the format string itself contains characters from a range, that the current LC_CTYPE does not support. Fix: I didn't dive into the inner workings of *printf(), sorry. How-To-Repeat: Here is a simple test program. It should (imo) print "3" twice, for three copied characters, each. It prints "-1" and "3" though. #include <locale.h> #include <stdio.h> #include <wchar.h> static const wchar_t txt[] = { 0x41C, 0x43D, 0x440, 0 }; // "Mir" in cyrillic int main(void) { wchar_t str[4]; int ret; setlocale(LC_CTYPE, "C"); ret = swprintf(str, sizeof(str) / sizeof(*str), L"%ls", txt); printf("%d\n", ret); setlocale(LC_CTYPE, "UTF-8"); ret = swprintf(str, sizeof(str) / sizeof(*str), L"%ls", txt); printf("%d\n", ret); return 0; }
Responsible Changed From-To: freebsd-bugs->freebsd-standards I'm going to guess this is a standards issue.
State Changed From-To: open->closed This code works as supposed by standards. You may see errno comes from swprintf() (exactly - from wcrtomb()) and it is EILSEQ (Illegal byte sequence) It is because "C" locale is 8bit wide so does not contain wide chars outside 0 .. UCHAR_MAX range (and doesn't know how to convert them either), so any attempt to convert them fails with EILSEQ.
State Changed From-To: closed->open POSIX mention fputwc() requirement only for fwprintf() and wprintf(), not for swprintf(), so fputwc()-mbsrtowcs() forth and back conversion we currently have as result of pseudo-file stdio hook is not needed in direct implemetation (which skips whole multibyte part).
Here is a simplified example: #include <locale.h> #include <stdio.h> #include <wchar.h> static const wchar_t txt[] = { 0x41C, 0x43D, 0x440, 0 }; // "Mir" in cyrillic int main(void) { wchar_t str[4]; int ret; setlocale(LC_CTYPE, "C"); ret = swprintf(str, sizeof(str) / sizeof(*str), txt); printf("%d\n", ret); return 0; } Only a format string is used here. The call to swprintf() fails here, too, and -1 is returned. The POSIX standard (and ANSI C99, too, though with slightly different wording) say this: "The format is composed of zero or more directives: ordinary wide-characters, which are simply copied to the output stream" (from http://www.opengroup.org/onlinepubs/009695399/functions/swprintf.html , section DESCRIPTION, second clause). So even copying the ordinary wide-characters from the format string fails.
fputwc(3) has similar language about copying the character to the output stream, but POSIX still says it can fail with EILSEQ if the wide character doesn't exist in the current locale. This isn't my area of expertise, but the present behavior seems correct. If the current locale doesn't support a given wide character, we should not invent a multibyte character sequence for it, because the other end of the stream may not even be able to interpret it.
> fputwc(3) has similar language about copying the character to the > output stream, but POSIX still says it can fail with EILSEQ if the > wide character doesn't exist in the current locale. fputwc() is entierly different from swprintf(): fputwc() writes to a stream, swprintf() writes to an array of wchar_t. > This isn't my area of expertise, but the present behavior seems > correct. No, it isn't. > If the current locale doesn't support a given wide > character, we should not invent a multibyte character sequence for > it, because the other end of the stream may not even be able to > interpret it. The format string of swprintf() is of type wchar_t and the destination buffer of swprintf() is of type wchar_t. So there are absolutely no locale conversions involved and no multibyte sequences have to be invented, as you suggested. All, which should happen, is copying the wchar_ts from the source to the destination with no conversions involved at all. The standard, which I quoted already, is quite clear in this respect. The current implementation, which internally converts from wchar_t to the current multibyte locale encoding and back to wchar_t is just an implementation hack, which breaks, if the current locale can not represent full unicode.
<<On Mon, 29 Sep 2008 08:10:03 GMT, Christoph Mallon <christoph.mallon@gmx.de> said: >> fputwc(3) has similar language about copying the character to the >> output stream, but POSIX still says it can fail with EILSEQ if the >> wide character doesn't exist in the current locale. > fputwc() is entierly different from swprintf(): fputwc() writes to a > stream, swprintf() writes to an array of wchar_t. >> This isn't my area of expertise, but the present behavior seems >> correct. > No, it isn't. The Standard is clear: In addition, all forms of fwprintf() may fail if: [EILSEQ] A wide-character code that does not correspond to a valid character has been detected. (IEEE Std.1003.1-2001 page 471, line 15515) You may wish that it was implemented differently, but that doesn't mean that the current implementation is wrong. -GAWollman
State Changed From-To: open->suspended suspended awaiting patches
Class Changed From-To: sw-bug->change-request The present implementation of swprintf() could be much better, and the submitter is rightly offended, but this doesn't seem to be a bug.
Responsible Changed From-To: freebsd-standards->freebsd-bugs Reclassify this as a general bug. Although swprintf's behavior of converting from wide characters to multibyte representations and back again appears not to be a standards violation, a better implementation would avoid the extra work.
For bugs matching the following conditions: - Status == In Progress - Assignee == "bugs@FreeBSD.org" - Last Modified Year <= 2017 Do - Set Status to "Open"