[ic] ic-utf8 readfile/writefile patch
David Christensen
david at endpoint.com
Sun Mar 22 02:54:58 UTC 2009
On Mar 21, 2009, at 7:11 PM, Peter wrote:
> On 03/21/2009 05:51 AM, Stefan Hornburg (Racke) wrote:
>> Stefan Hornburg (Racke) wrote:
>>> Yes, it doesn't crash anymore :-).
>>>
>>> One problem that might be related to this issue is the delivery of
>>> "binary"
>>> content stored in a UTF8 database.
>>>
>>> Currently the files produced are corrupted, and the data in the db
>>> is
>>> definitely correct. And it works with UTF8 inactive (as per bug
>>> #259).
>>>
>>> The code is as follows:
>>>
>>> my $data = $Db{transaction_documents}->field($td_code, 'content');
>>> $data = $Tag->filter({op => 'decode_base64', body => $data});
>>
>> Putting at this point:
>>
>> Encode::_utf8_on($data);
>>
>> "solves" the problem.
>>
>>> $Tag->deliver({type => 'application/pdf', body => $data});
>
>
> This may "solve the problem" but it doesn't look like the right
> solution
> to me. If it's really "binary" data then the utf8 flag should be off
> for it, not on and the data should not be treated as utf8 in any way.
> Presumably this would contain something like a picture or a sound file
> which should certainly not be utf8 encoded.
Correct. I'll give the binary blobs some thought, not sure if it
would need special handling or not. The issue comes in not from the
internal representation, but because the server's output handle is
apparently encoding to utf8 on the way out to the client, which should
only happen when content-type is text.
> It almost looks to me like perl was reading in the raw binary data
> (after converting from base64) performing a utf8 conversion on it, and
> then *not* setting the utf8 flag. Alternatively, it may be converting
> the data to utf8 prior to writing it out to the db in the first
> place so
> when it reads it back it gets utf8 data which is not flagged as such.
>
> ...or am I missing something here?
The utf8-flag is kind of a non-intuitive; the way that perl stores
character data > 0xFF is by encoding it in utf8 internally and
indicating such in the SV flags (specifically the utf8 flag). So when
the utf8 flag is set on the SV (scalar value record in perl's
internals), perl considers the *characters* in the string to be the
unicode code points specifically, (i.e., 0x100 and higher), not the
raw bytes used to represent the code points in the utf8 encoding. You
may notice a "wide character in print" warning if you have ever
printed Unicode data in perl without setting a PerlIO encoding layer
setting. This is because
Specifically, utf8 *data* (aka octets) is what is produced from the
encode_utf8 function; the utf8 flag on said octets is turned off, and
it is considered a sequence of 8-bit bytes. Hypothetically, this
transformation is the trivial operation of turning *off* the utf-8
flag on the SV, assuming that the SV was already encoded internally as
utf-8 (which unless one has been mucking about with the utf8 flag,
should be the case).
So with the current scenario, if we don't disable the utf8 PerlIO
layer on the server's output handle, the hi-bit raw bytes are going to
be transformed to their utf8 equivalent, which in the case of bytes in
the range 0x80 - 0xFF will be two-byte sequences = corruption on the
receiving end. The reason Racke's "fix" works is because the PerlIO
utf8 layer sees that the utf8 flag is set, and hence considers the
data already encoded (even though it's artificially been set, so it's
not valid utf8) and outputs the data directly.
In the case of base64, the format is all ASCII; ASCII is a trivial
subset of utf8, and encode_utf8($ascii_string) eq $ascii_string. Thus
in this case the data returned from the database is ASCII (regardless
of the database's automatic utf8-ification). The base64 decode
function returns data in the range 0x00 - 0xFF, which get encoded on
output as described above.
Regards,
David
--
David Christensen
End Point Corporation
david at endpoint.com
212-929-6923
http://www.endpoint.com/
More information about the interchange-users
mailing list