Bug 172195

Summary: PR database corrupts patches
Product: Services Reporter: Michael Gmelin <freebsd>
Component: Bug TrackerAssignee: Shaun Amott <shaun>
Status: Closed Overcome By Events    
Severity: Affects Only Me    
Priority: Normal    
Version: unspecified   
Hardware: Any   
OS: Any   
Attachments:
Description Flags
file.txt
none
china.txt
none
china.txt
none
china.txt.gz
none
test.patch.txt
none
test.patch.txt
none
test.patch.txt
none
test.patch.txt
none
test.patch.txt none

Description Michael Gmelin 2012-09-30 22:40:04 UTC
Uploading patches to the FreeBSD PR system might corrupt them on download from the PR web page. This affects UTF-8 encoded patches as well as binary patches.

As an additional test I will upload a text containing a UTF-8 snippet, which should resemble two 3 byte uTF-8 sequences forming the word China in Chinese
(0xe4 0xb8 0xad 0xe5 0x9b 0xbd or &#20013;&#22269;)

Then I will send a follow up email, that contains the same man page as a mime attachment using content-type application/x-gzip.

I might to additional tests on this PR (more UTF-8 based text etc.) to see if there is a safe way to provide this kind of data to a PR.

Fix: For binary patches: Copy and paste the base64 representation shown in the PR and decode it on the command line, e.g. using "openssl enc -d -a".

For text: none known.

This might not be a bug and completely work as designed - in this case I would like to see some documentation though, since these days patches might contain UTF-8 characters on a regular basis.


Patch attached with submission follows:
How-To-Repeat: Send a bug follow-up containing a gzipped file or UTF-8 file containing special characters.
Comment 1 Michael Gmelin 2012-09-30 22:51:18 UTC
This followup contains the same patch, this time sent through a MUA.

The attachment will be base r64 encoded, the mime header should look
like this:

Content-Type: text/x-patch
Content-Transfer-Encoding: base64
Content-Disposition: attachment; filename=china.txt

-- 
Michael Gmelin
Comment 2 Michael Gmelin 2012-09-30 22:56:18 UTC
This followup contains the same patch, this time sent through a MUA.

The attachment will be base r64 encoded, the mime header was manually altered
to contain a charset in the content-type:

Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: base64
Content-Disposition: attachment; filename=china.txt

-- 
Michael Gmelin
Comment 3 Michael Gmelin 2012-09-30 23:12:36 UTC
The next attempt contains a gzipped version of the patch.

Content-type therefore is application/x-gzip

Also, to test processing of file content, here is the character
sequence in question as part of the email again:

中å½

-- 
Michael Gmelin
Comment 4 Michael Gmelin 2012-10-01 00:12:00 UTC
Analysis:

1. The PR system assumes some different encoding than UTF-8 to be the
   default. This means:
   a) Patches uploaded through the web form will corrupt
   b) Patches mailed as attachments without explicit charset
      specification will corrupt
   c) standard send-pr patches break - adding a charset UTF-8 header
      manually will probably work, but is too easy to forget. Also
      won't fix the download option.

2. The PR system can handle binary attachments correctly in its base64
   view

3. Downloaded patches are corrupted in all cases!
   a) File attached via webform:
      fetch -o - "http://www.freebsd.org/cgi/query-pr.cgi?pr=172195&getpatch=1" | hd
      c3 a4 c2 b8 c2 ad
      (should have been: e4 b8 ad e5 9b bd)
      This looks like the input has been assumed to be latin1,
      transcoded to UTF-8 and truncated.      
      
   b) File sent as follow up attachment without UTF-8 charset:
      fetch -o -
      "http://www.freebsd.org/cgi/query-pr.cgi?pr=172195&getpatch=2" |
      hd c3 a4 c2 b8 c2 ad c3 a5
      (should have been: e4 b8 ad e5 9b bd)
      This looks like the input has been assumed to be latin1 and
      transcoded to UTF-8.

   c) File sent as follow up attachment WITH UTF-8 charset:
      (this one shows up correctly on the web page, the download is
       still broken though):
      fetch -o - "http://www.freebsd.org/cgi/query-pr.cgi?pr=172195&getpatch=3" | hd
      e4 b8 ad e5
      (should have been: e4 b8 ad e5 9b bd)
      This looks like it got the encoding right, but can't handle three
      byte characters (string length calculation problem?!)

   d) Gzipped version of the patch:
      The base64 encoded version shown on the PR webpage is correct:
      md5 china.txt.gz
      MD5 (china.txt.gz) = 29009c79690c58b0762274da0e3ad80d
 
      echo "H4sICIG7aFAAA2NoaW5hLnR4dAB7smPt09l7uQC1SPS1BwAAAA==" \
      | openssl enc -d -a | md5
      29009c79690c58b0762274da0e3ad80d

      Downloading through the download link fails though:
      fetch -o - "http://www.freebsd.org/cgi/query-pr.cgi?pr=172195&getpatch=4" | md5
      ae9f2f3531871be8c4af662863eb542e

      Taking a deeper look into the gzip file shows, that there has
      been an attempt to somehow UTF-8 encode the binary content:

      Original:
00000000  1f 8b 08 08 81 bb 68 50  00 03 63 68 69 6e 61 2e
00000010  74 78 74 00 7b b2 63 ed  d3 d9 7b b9 00 b5 48 f4
00000020  b5 07 00 00 00
00000025

      File as downloaded from the PR website:
00000000  1f c2 8b 08 08 c2 81 c2  bb 68 50 00 03 63 68 69
00000010  6e 61 2e 74 78 74 00 7b  c2 b2 63 c3 ad c3 93 c3
00000020  99 7b c2 b9 00
00000025

      As you can see, 8bit characters have been UTF-8 encoded, and the
      resulting file got truncated at the original file size.


Conclusion:

There is no simple way of submitting a patch through the PR system so
that it can be downloaded using the download link. Right now the
options are:

1. Send the file as an email attachment, making sure that the character
   encoding in the mime header is set to UTF-8 (not all email clients
   will do this automatically). This way a patch can be acquired by
   using copy and paste - the download link will not work correctly
   though and yield surprising results. A patch acquired this way might
   actually apply, but cause unintended behavior.

2. Send the file gzipped and make people use base64 decode to get the
   gzip. This way when the download link is used people will at least
   realize something went wrong.

3. Base64 encode the patch before sending it, this way everything
   stays us-ascii and cannot be messed with by the PR system. Requires
   users to base64 decode on their own and makes it hard to argue about
   the patch in a way that's transparent to users of the web page.

None of these options seem very appealing, especially since it makes it
easy for people to get it wrong and hard to get it right - also various
tools used by port maintainers (porttools, send-pr etc.) might not be
prepared to support the user to get it right. There will be more and
more UTF-8 encoded patches in the future, so I think this should be
fixed.

Suggested fixes:

- Change the default encoding (the coding assumed when no encoding is
  specified) to UTF-8. This might not be practical in all cases, but
  should be discussed.
- Make sure that the download option provides correct files (it should
  treat all files as binary and not try to alter them in any way).

I hope all of this makes sense.

-- 
Michael Gmelin
Comment 5 Shaun Amott freebsd_committer freebsd_triage 2012-10-01 13:33:08 UTC
Responsible Changed
From-To: freebsd-www->shaun

I'll look into this.
Comment 6 R.Mahmatkhanov 2012-10-16 12:51:43 UTC
Hi Shaun,

you also may use http://bugs.freebsd.org/172280 as real world example :).

-- 
Regards,
Ruslan

Tinderboxing kills... the drives.
Comment 7 Michael Gmelin 2013-06-01 18:41:18 UTC
Another test to check if there is any progress (simple plain text
attachment)


-- 
Michael Gmelin
Comment 8 Michael Gmelin 2013-06-01 18:42:31 UTC
Same as before, but encoded base64 instead of quoted-printable.


-- 
Michael Gmelin
Comment 9 Michael Gmelin 2013-06-01 18:44:19 UTC
Same as before, but encoded 8bit instead of base64.


-- 
Michael Gmelin
Comment 10 Michael Gmelin 2013-06-01 18:46:08 UTC
Same as before, but encoded base64 plus stating utf-8 encoding in
mime-type explicitly.


-- 
Michael Gmelin
Comment 11 Michael Gmelin 2013-06-01 18:46:37 UTC
Same as before, but encoded quoted-printable 64 plus stating utf-8
encoding in mime-type explicitly.


-- 
Michael Gmelin
Comment 12 Michael Gmelin 2013-06-01 18:57:27 UTC
So basically this is still broken.

When not specifying utf-8 explicitly for the attachment, the encoding
gets mangled. When stating it explicitly, characters show up correctly
when sent quoted-printable, but once clicking download the file will get
truncated (in this case by 8 characters).

-- 
Michael Gmelin
Comment 13 Michael Gmelin 2013-12-11 19:07:59 UTC
Is anybody working on this?

-- 
Michael Gmelin
Comment 14 Eitan Adler freebsd_committer freebsd_triage 2014-06-06 05:05:09 UTC
these issues have become "overcome by events"