[ic] Call for testers

Fri Mar 13 03:32:28 UTC 2009

On Mar 12, 2009, at 9:55 PM, Peter wrote:

> On 03/12/2009 07:31 PM, David Christensen wrote:
>> There may be other options that we can look
>> at, such as a directive that indicates a fallback encoding of any
>> existing catalog files, so if we fail on the initial utf-8 decode we
>> fall back to that.  That would allow us to catch and log the fact  
>> that
>> a legacy encoding was encountered, but at the same time would allow  
>> us
>> to properly decode the data in question without resorting to
>> substitution.  I suspect this failure would occur when trying to read
>> in the data with read_file rather than at regexp match time, but the
>> solution/logic still holds.
>
> That sounds like a good idea.  Basically put, we eval in a couple
> different places and if the eval fails we can assume that the data is
> Latin-1 and then convert it to UTF8 based on that.  Perl should then
> have valid UTF8 and stop complaining and the data will (we hope) look
> like it's supposed to.  It might be useful to have a directive that
> indicates what the page encoding is as well, then we can dispense with
> the eval and just assume that all pages are encoded as per the  
> directive
> and convert to UTF8.  This would happen in read_file, then.  For best
> backwards compatibility I think it would be best to ignore this
> directive if MV_UTF8 is not set.

I've done something like this before, and it ends up looking something  
like this:

for my $enc (qw/utf-8 cp-1252 latin-1/) {
     my $decoded_data = eval { decode($data, $enc) };
     last if defined $decoded_data;
}

We could abstract out the list of encodings so we could potentially  
return different values if needed depending on context, but I'd guess  
for the most part the pages/components/templates, etc. will be in a  
single encoding, so this would boil down to ("utf8",  
$fallback_encoding).  (Hey, I'm optimistic... :-D)

>> I think the abovementioned directive solves both issues; if MV_UTF8  
>> is
>> off and/or the legacy encoding is not defined, we could fall back to
>> raw octets, like in 2.
>
> if MV_UTF8 is off this is not an issue since Perl will treat  
> everything
> as raw octets anyways and we should not try to change anything or we
> risk breaking backwards compatibility.

Right; what I was thinking.  Guess I wasn't explicit about the "off"  
code path.  My thought here was that if MV_UTF8 was set but the data  
failed to decode as utf8 and no "fallback" encoding was provided, then  
turn off the utf-8 flag, i.e., treat the data as raw octets 0-255.

> As a side note, I'm now thinking that making MV_UTF8 a variable may  
> have
> been a mistake.  I would much rather see it as a configuration
> directive.  Same goes for MV_HTTP_CHARSET.  I wonder if it's too  
> late to
> change that?

I agree that they'd work better as directives.  I have a patch in the  
ic-utf8 tree that handles verification/resolution of the  
MV_HTTP_CHARSET variable (based on a suggestion on the list some time  
ago).  We could presumably support both the variables and the  
directives simultaneously for a while, prioritizing the directive if  
it exists over the variable.  We could also make the database-specific  
encodings directives as well, so-as-to hook into the save encoding  
validation functionality.

Regards,

David
--
David Christensen
End Point Corporation
david at endpoint.com
212-929-6923
http://www.endpoint.com/