[ic] ic-utf8 readfile/writefile patch

Sun Mar 22 02:54:58 UTC 2009

On Mar 21, 2009, at 7:11 PM, Peter wrote:

> On 03/21/2009 05:51 AM, Stefan Hornburg (Racke) wrote:
>> Stefan Hornburg (Racke) wrote:
>>> Yes, it doesn't crash anymore :-).
>>>
>>> One problem that might be related to this issue is the delivery of  
>>> "binary"
>>> content stored in a UTF8 database.
>>>
>>> Currently the files produced are corrupted, and the data in the db  
>>> is
>>> definitely correct. And it works with UTF8 inactive (as per bug  
>>> #259).
>>>
>>> The code is as follows:
>>>
>>> my $data = $Db{transaction_documents}->field($td_code, 'content');
>>> $data = $Tag->filter({op => 'decode_base64', body => $data});
>>
>> Putting at this point:
>>
>> Encode::_utf8_on($data);
>>
>> "solves" the problem.
>>
>>> $Tag->deliver({type => 'application/pdf', 						body => $data});
>
>
> This may "solve the problem" but it doesn't look like the right  
> solution
> to me.  If it's really "binary" data then the utf8 flag should be off
> for it, not on and the data should not be treated as utf8 in any way.
> Presumably this would contain something like a picture or a sound file
> which should certainly not be utf8 encoded.

Correct.  I'll give the binary blobs some thought, not sure if it  
would need special handling or not.  The issue comes in not from the  
internal representation, but because the server's output handle is  
apparently encoding to utf8 on the way out to the client, which should  
only happen when content-type is text.

> It almost looks to me like perl was reading in the raw binary data
> (after converting from base64) performing a utf8 conversion on it, and
> then *not* setting the utf8 flag.  Alternatively, it may be converting
> the data to utf8 prior to writing it out to the db in the first  
> place so
> when it reads it back it gets utf8 data which is not flagged as such.
>
> ...or am I missing something here?

The utf8-flag is kind of a non-intuitive; the way that perl stores  
character data > 0xFF is by encoding it in utf8 internally and  
indicating such in the SV flags (specifically the utf8 flag).  So when  
the utf8 flag is set on the SV (scalar value record in perl's  
internals), perl considers the *characters* in the string to be the  
unicode code points specifically, (i.e., 0x100 and higher), not the  
raw bytes used to represent the code points in the utf8 encoding.  You  
may notice a "wide character in print" warning if you have ever  
printed Unicode data in perl without setting a PerlIO encoding layer  
setting. This is because

Specifically, utf8 *data* (aka octets) is what is produced from the  
encode_utf8 function; the utf8 flag on said octets is turned off, and  
it is considered a sequence of 8-bit bytes.  Hypothetically, this  
transformation is the trivial operation of turning *off* the utf-8  
flag on the SV, assuming that the SV was already encoded internally as  
utf-8 (which unless one has been mucking about with the utf8 flag,  
should be the case).

So with the current scenario, if we don't disable the utf8 PerlIO  
layer on the server's output handle, the hi-bit raw bytes are going to  
be transformed to their utf8 equivalent, which in the case of bytes in  
the range 0x80 - 0xFF will be two-byte sequences = corruption on the  
receiving end.  The reason Racke's "fix" works is because the PerlIO  
utf8 layer sees that the utf8 flag is set, and hence considers the  
data already encoded (even though it's artificially been set, so it's  
not valid utf8) and outputs the data directly.

In the case of base64, the format is all ASCII; ASCII is a trivial  
subset of utf8, and encode_utf8($ascii_string) eq $ascii_string.  Thus  
in this case the data returned from the database is ASCII (regardless  
of the database's automatic utf8-ification).  The base64 decode  
function returns data in the range 0x00 - 0xFF, which get encoded on  
output as described above.

Regards,

David
--
David Christensen
End Point Corporation
david at endpoint.com
212-929-6923
http://www.endpoint.com/