| Summary: | PR database corrupts patches | ||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Services | Reporter: | Michael Gmelin <freebsd> | ||||||||||||||||||||
| Component: | Bug Tracker | Assignee: | Shaun Amott <shaun> | ||||||||||||||||||||
| Status: | Closed Overcome By Events | ||||||||||||||||||||||
| Severity: | Affects Only Me | ||||||||||||||||||||||
| Priority: | Normal | ||||||||||||||||||||||
| Version: | unspecified | ||||||||||||||||||||||
| Hardware: | Any | ||||||||||||||||||||||
| OS: | Any | ||||||||||||||||||||||
| Attachments: |
|
||||||||||||||||||||||
|
Description
Michael Gmelin
2012-09-30 22:40:04 UTC
This followup contains the same patch, this time sent through a MUA. The attachment will be base r64 encoded, the mime header should look like this: Content-Type: text/x-patch Content-Transfer-Encoding: base64 Content-Disposition: attachment; filename=china.txt -- Michael Gmelin This followup contains the same patch, this time sent through a MUA. The attachment will be base r64 encoded, the mime header was manually altered to contain a charset in the content-type: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: base64 Content-Disposition: attachment; filename=china.txt -- Michael Gmelin The next attempt contains a gzipped version of the patch. Content-type therefore is application/x-gzip Also, to test processing of file content, here is the character sequence in question as part of the email again: ä¸å½ -- Michael Gmelin Analysis:
1. The PR system assumes some different encoding than UTF-8 to be the
default. This means:
a) Patches uploaded through the web form will corrupt
b) Patches mailed as attachments without explicit charset
specification will corrupt
c) standard send-pr patches break - adding a charset UTF-8 header
manually will probably work, but is too easy to forget. Also
won't fix the download option.
2. The PR system can handle binary attachments correctly in its base64
view
3. Downloaded patches are corrupted in all cases!
a) File attached via webform:
fetch -o - "http://www.freebsd.org/cgi/query-pr.cgi?pr=172195&getpatch=1" | hd
c3 a4 c2 b8 c2 ad
(should have been: e4 b8 ad e5 9b bd)
This looks like the input has been assumed to be latin1,
transcoded to UTF-8 and truncated.
b) File sent as follow up attachment without UTF-8 charset:
fetch -o -
"http://www.freebsd.org/cgi/query-pr.cgi?pr=172195&getpatch=2" |
hd c3 a4 c2 b8 c2 ad c3 a5
(should have been: e4 b8 ad e5 9b bd)
This looks like the input has been assumed to be latin1 and
transcoded to UTF-8.
c) File sent as follow up attachment WITH UTF-8 charset:
(this one shows up correctly on the web page, the download is
still broken though):
fetch -o - "http://www.freebsd.org/cgi/query-pr.cgi?pr=172195&getpatch=3" | hd
e4 b8 ad e5
(should have been: e4 b8 ad e5 9b bd)
This looks like it got the encoding right, but can't handle three
byte characters (string length calculation problem?!)
d) Gzipped version of the patch:
The base64 encoded version shown on the PR webpage is correct:
md5 china.txt.gz
MD5 (china.txt.gz) = 29009c79690c58b0762274da0e3ad80d
echo "H4sICIG7aFAAA2NoaW5hLnR4dAB7smPt09l7uQC1SPS1BwAAAA==" \
| openssl enc -d -a | md5
29009c79690c58b0762274da0e3ad80d
Downloading through the download link fails though:
fetch -o - "http://www.freebsd.org/cgi/query-pr.cgi?pr=172195&getpatch=4" | md5
ae9f2f3531871be8c4af662863eb542e
Taking a deeper look into the gzip file shows, that there has
been an attempt to somehow UTF-8 encode the binary content:
Original:
00000000 1f 8b 08 08 81 bb 68 50 00 03 63 68 69 6e 61 2e
00000010 74 78 74 00 7b b2 63 ed d3 d9 7b b9 00 b5 48 f4
00000020 b5 07 00 00 00
00000025
File as downloaded from the PR website:
00000000 1f c2 8b 08 08 c2 81 c2 bb 68 50 00 03 63 68 69
00000010 6e 61 2e 74 78 74 00 7b c2 b2 63 c3 ad c3 93 c3
00000020 99 7b c2 b9 00
00000025
As you can see, 8bit characters have been UTF-8 encoded, and the
resulting file got truncated at the original file size.
Conclusion:
There is no simple way of submitting a patch through the PR system so
that it can be downloaded using the download link. Right now the
options are:
1. Send the file as an email attachment, making sure that the character
encoding in the mime header is set to UTF-8 (not all email clients
will do this automatically). This way a patch can be acquired by
using copy and paste - the download link will not work correctly
though and yield surprising results. A patch acquired this way might
actually apply, but cause unintended behavior.
2. Send the file gzipped and make people use base64 decode to get the
gzip. This way when the download link is used people will at least
realize something went wrong.
3. Base64 encode the patch before sending it, this way everything
stays us-ascii and cannot be messed with by the PR system. Requires
users to base64 decode on their own and makes it hard to argue about
the patch in a way that's transparent to users of the web page.
None of these options seem very appealing, especially since it makes it
easy for people to get it wrong and hard to get it right - also various
tools used by port maintainers (porttools, send-pr etc.) might not be
prepared to support the user to get it right. There will be more and
more UTF-8 encoded patches in the future, so I think this should be
fixed.
Suggested fixes:
- Change the default encoding (the coding assumed when no encoding is
specified) to UTF-8. This might not be practical in all cases, but
should be discussed.
- Make sure that the download option provides correct files (it should
treat all files as binary and not try to alter them in any way).
I hope all of this makes sense.
--
Michael Gmelin
Responsible Changed From-To: freebsd-www->shaun I'll look into this. Hi Shaun, you also may use http://bugs.freebsd.org/172280 as real world example :). -- Regards, Ruslan Tinderboxing kills... the drives. Another test to check if there is any progress (simple plain text attachment) -- Michael Gmelin Same as before, but encoded base64 instead of quoted-printable. -- Michael Gmelin Same as before, but encoded 8bit instead of base64. -- Michael Gmelin Same as before, but encoded base64 plus stating utf-8 encoding in mime-type explicitly. -- Michael Gmelin Same as before, but encoded quoted-printable 64 plus stating utf-8 encoding in mime-type explicitly. -- Michael Gmelin So basically this is still broken. When not specifying utf-8 explicitly for the attachment, the encoding gets mangled. When stating it explicitly, characters show up correctly when sent quoted-printable, but once clicking download the file will get truncated (in this case by 8 characters). -- Michael Gmelin Is anybody working on this? -- Michael Gmelin these issues have become "overcome by events" |