[ic] Call for testers

Fri Mar 13 05:48:35 UTC 2009

On Thu, 12 Mar 2009, Peter wrote:

>> for my $enc (qw/utf-8 cp-1252 latin-1/) {
>>      my $decoded_data = eval { decode($data, $enc) };
>>      last if defined $decoded_data;
>> }
>>
>> We could abstract out the list of encodings so we could potentially
>> return different values if needed depending on context, but I'd guess
>> for the most part the pages/components/templates, etc. will be in a
>> single encoding, so this would boil down to ("utf8",
>> $fallback_encoding).  (Hey, I'm optimistic... :-D)
>
> I like that code above.  We could have a directive, say PageEncoding
> where you can actually list the primary encoding and each fallback to
> try in order in your above loop.  This could default to "utf8 latin-1".
> Again, we completely ignore this directive if MV_UTF8 is not set.
>
>> My thought here was that if MV_UTF8 was set but the data failed to 
>> decode as utf8 and no "fallback" encoding was provided, then turn off 
>> the utf-8 flag, i.e., treat the data as raw octets 0-255.
>
> We can have this be the final default mechanism if we fall off the end 
> of the PageEncoding list.  To make this the only action other than utf8 
> one can simply set PageEncoding to utf8.

This is starting to sound awfully fancy.

If you can make all this autodetection work well, and without a big 
performance hit, I suppose it's nice.

But do we really want to encourage people to have various files in various 
encodings, never really sure what it is? Do we really want it to be 
possible for Interchange to autodetect that an HTML header file is in 
Windows-1252 and convert it to UTF-8, yet its header still says 
Windows-1252?

I doubt that's the only case where this fancy autodetection stuff could 
bite us.

I'd prefer to see nothing done to the bytes if UTF-8 support is disabled, 
and if it's enabled, see any invalid UTF-8 bytes converted to ? 
characters. That's simple, nonfatal at runtime, and yet gently encourages 
developers to get their sources in the proper UTF-8 encoding.

I think that's still pretty lax, since we're not aborting on invalid 
characters as e.g. Postgres does. What do you all think?

>> I agree that they'd work better as directives.  I have a patch in the 
>> ic-utf8 tree that handles verification/resolution of the 
>> MV_HTTP_CHARSET variable (based on a suggestion on the list some time 
>> ago).  We could presumably support both the variables and the 
>> directives simultaneously for a while, prioritizing the directive if it 
>> exists over the variable.
>
> I don't think these variables have been around that long (they were 
> introduced towards the end of 5.5, iirc which would be last year).  I 
> think it may be best to just nip them in the bud now and make a note in 
> the UPGRADE file.  This is one that I would want Mike's input on, though 
> as he may well feel differently.

I think Mike (reasonably) just wants backward compatibility to stop 
breaking for non-UTF-8 stuff. I suspect Frederic, Stefan, Kevin, and very 
few others who are actually using UTF-8 would be better suited to offer 
opinions on changes in that setup.

Jon

-- 
Jon Jensen
End Point Corporation
http://www.endpoint.com/