[ic] Call for testers
David Christensen
david at endpoint.com
Fri Mar 13 03:32:28 UTC 2009
On Mar 12, 2009, at 9:55 PM, Peter wrote:
> On 03/12/2009 07:31 PM, David Christensen wrote:
>> There may be other options that we can look
>> at, such as a directive that indicates a fallback encoding of any
>> existing catalog files, so if we fail on the initial utf-8 decode we
>> fall back to that. That would allow us to catch and log the fact
>> that
>> a legacy encoding was encountered, but at the same time would allow
>> us
>> to properly decode the data in question without resorting to
>> substitution. I suspect this failure would occur when trying to read
>> in the data with read_file rather than at regexp match time, but the
>> solution/logic still holds.
>
> That sounds like a good idea. Basically put, we eval in a couple
> different places and if the eval fails we can assume that the data is
> Latin-1 and then convert it to UTF8 based on that. Perl should then
> have valid UTF8 and stop complaining and the data will (we hope) look
> like it's supposed to. It might be useful to have a directive that
> indicates what the page encoding is as well, then we can dispense with
> the eval and just assume that all pages are encoded as per the
> directive
> and convert to UTF8. This would happen in read_file, then. For best
> backwards compatibility I think it would be best to ignore this
> directive if MV_UTF8 is not set.
I've done something like this before, and it ends up looking something
like this:
for my $enc (qw/utf-8 cp-1252 latin-1/) {
my $decoded_data = eval { decode($data, $enc) };
last if defined $decoded_data;
}
We could abstract out the list of encodings so we could potentially
return different values if needed depending on context, but I'd guess
for the most part the pages/components/templates, etc. will be in a
single encoding, so this would boil down to ("utf8",
$fallback_encoding). (Hey, I'm optimistic... :-D)
>> I think the abovementioned directive solves both issues; if MV_UTF8
>> is
>> off and/or the legacy encoding is not defined, we could fall back to
>> raw octets, like in 2.
>
> if MV_UTF8 is off this is not an issue since Perl will treat
> everything
> as raw octets anyways and we should not try to change anything or we
> risk breaking backwards compatibility.
Right; what I was thinking. Guess I wasn't explicit about the "off"
code path. My thought here was that if MV_UTF8 was set but the data
failed to decode as utf8 and no "fallback" encoding was provided, then
turn off the utf-8 flag, i.e., treat the data as raw octets 0-255.
> As a side note, I'm now thinking that making MV_UTF8 a variable may
> have
> been a mistake. I would much rather see it as a configuration
> directive. Same goes for MV_HTTP_CHARSET. I wonder if it's too
> late to
> change that?
I agree that they'd work better as directives. I have a patch in the
ic-utf8 tree that handles verification/resolution of the
MV_HTTP_CHARSET variable (based on a suggestion on the list some time
ago). We could presumably support both the variables and the
directives simultaneously for a while, prioritizing the directive if
it exists over the variable. We could also make the database-specific
encodings directives as well, so-as-to hook into the save encoding
validation functionality.
Regards,
David
--
David Christensen
End Point Corporation
david at endpoint.com
212-929-6923
http://www.endpoint.com/
More information about the interchange-users
mailing list