Bug 200398 - iconv(3) support of UTF-7 is broken
Summary: iconv(3) support of UTF-7 is broken
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: bin (show other bugs)
Version: 10.1-RELEASE
Hardware: Any Any
: Normal Affects Some People
Assignee: Tijl Coosemans
URL:
Keywords: patch-ready
Depends on:
Blocks:
 
Reported: 2015-05-22 21:52 UTC by Xin LI
Modified: 2015-06-02 09:44 UTC (History)
0 users

See Also:
delphij: mfc-stable10+


Attachments
Patch by tijl@ (2.82 KB, patch)
2015-05-22 21:52 UTC, Xin LI
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Xin LI freebsd_committer freebsd_triage 2015-05-22 21:52:40 UTC
Created attachment 157059 [details]
Patch by tijl@

(This is mainly for tracking purposes).

I have observed this issue with dovecot which the index worker would crash for:

Panic: file charset-iconv.c: line 132 (charset_to_utf8): assertion
failed: (*src_size - pos <= CHARSET_MAX_PENDING_BUF_SIZE)

Having been annoyed by this for some time I have decided to instrument the code to figure out what have happen under the hood.  Eventually, I have discovered that if iconv(3) is asked to convert two UTF-7 strings to UTF-8:

"+ADw-SPAN+AD4-"

And

"+ADw-SPAN lang"

The second conversion would give wrong results, while GNU implementation of iconv(3) does not have the same issue.  Using gdb, the UTF-7 mode was 1 (shift) when processing the second string, while it should be 0, so using an iconv(cd, NULL, NULL, NULL) would mitigate this issue.

I have then asked Tijl Coosemans <tijl@> who have quickly found the problem and created a patch, quote:

===
_citrus_UTF7_mbtoutf16 stored the decoder state at the beginning so it
could restore this state on an incomplete character such that the next
call would restart the decoding.  The problem was that "-" at the end
of a string was also treated as an incomplete character but was also
removed from the state buffer.  So the initial state would be restored
(with base64 mode) but the next call would no longer see the "-" and
thus continued in base64 mode.

This state saving/restoring isn't needed here.  It's already handled
elsewhere (citrus_iconv_std.c:_citrus_iconv_std_iconv_convert) so the
patch removes it.

The patch also improves the decoding of 4 byte UTF16 characters.  If
only 2 bytes can be read it is treated as an incomplete character now
(returning an error) whereas before it would be treated as a shift
sequence (not an error).  A range check has been added for the low 2
bytes as well.
===
Comment 1 Xin LI freebsd_committer freebsd_triage 2015-05-22 21:53:20 UTC
Assign to our hero.
Comment 2 commit-hook freebsd_committer freebsd_triage 2015-05-24 15:28:15 UTC
A commit references this bug:

Author: tijl
Date: Sun May 24 15:27:32 UTC 2015
New revision: 283406
URL: https://svnweb.freebsd.org/changeset/base/283406

Log:
  Fix decoding of UTF-7 when a base64 encoded chunk appears at the end of
  the input buffer.

  _citrus_UTF7_mbtoutf16 stored the decoder state at the beginning so it
  could restore this state on an incomplete character such that the next
  call would restart the decoding.  The problem was that "-" (end of base64
  mode) at the end of a string was also treated as an incomplete character
  but was also removed from the state buffer.  So the initial state would be
  restored (with base64 mode) and the next call would no longer see the "-"
  so it continued in base64 mode.

  This state saving/restoring isn't needed here.  It's already handled
  elsewhere (citrus_iconv_std.c:_citrus_iconv_std_iconv_convert) so just
  remove it.

  Also initialise *nresult.

  PR:		200398
  Tested by:	delphij
  MFC after:	1 week

Changes:
  head/lib/libiconv_modules/UTF7/citrus_utf7.c
Comment 3 commit-hook freebsd_committer freebsd_triage 2015-06-02 09:42:57 UTC
A commit references this bug:

Author: tijl
Date: Tue Jun  2 09:42:00 UTC 2015
New revision: 283908
URL: https://svnweb.freebsd.org/changeset/base/283908

Log:
  MFC r283406,283418:

  Fix decoding of UTF-7 when a base64 encoded chunk appears at the end of
  the input buffer.

  _citrus_UTF7_mbtoutf16 stored the decoder state at the beginning so it
  could restore this state on an incomplete character such that the next
  call would restart the decoding.  The problem was that "-" (end of base64
  mode) at the end of a string was also treated as an incomplete character
  but was also removed from the state buffer.  So the initial state would be
  restored (with base64 mode) and the next call would no longer see the "-"
  so it continued in base64 mode.

  This state saving/restoring isn't needed here.  It's already handled
  elsewhere (citrus_iconv_std.c:_citrus_iconv_std_iconv_convert) so just
  remove it.

  Also initialise *nresult.

  When only 2 bytes can be read from a 4 byte UTF-16 character in a base64
  encoded chunk of a UTF-7 string, treat that as an incomplete character and
  return an error instead of a shift sequence and no error.

  Also check that the low 2 bytes have a valid value.

  PR:		200398

Changes:
_U  stable/10/
  stable/10/lib/libiconv_modules/UTF7/citrus_utf7.c