[ic] Call for testers

Stefan Hornburg racke at linuxia.de
Fri Mar 13 09:32:53 UTC 2009


Jon Jensen wrote:
> On Thu, 12 Mar 2009, Peter wrote:
> 
>>> for my $enc (qw/utf-8 cp-1252 latin-1/) {
>>>      my $decoded_data = eval { decode($data, $enc) };
>>>      last if defined $decoded_data;
>>> }
>>>
>>> We could abstract out the list of encodings so we could potentially
>>> return different values if needed depending on context, but I'd guess
>>> for the most part the pages/components/templates, etc. will be in a
>>> single encoding, so this would boil down to ("utf8",
>>> $fallback_encoding).  (Hey, I'm optimistic... :-D)
>> I like that code above.  We could have a directive, say PageEncoding
>> where you can actually list the primary encoding and each fallback to
>> try in order in your above loop.  This could default to "utf8 latin-1".
>> Again, we completely ignore this directive if MV_UTF8 is not set.
>>
>>> My thought here was that if MV_UTF8 was set but the data failed to 
>>> decode as utf8 and no "fallback" encoding was provided, then turn off 
>>> the utf-8 flag, i.e., treat the data as raw octets 0-255.
>> We can have this be the final default mechanism if we fall off the end 
>> of the PageEncoding list.  To make this the only action other than utf8 
>> one can simply set PageEncoding to utf8.
> 
> This is starting to sound awfully fancy.
> 
> If you can make all this autodetection work well, and without a big 
> performance hit, I suppose it's nice.
> 
> But do we really want to encourage people to have various files in various 
> encodings, never really sure what it is? Do we really want it to be 
> possible for Interchange to autodetect that an HTML header file is in 
> Windows-1252 and convert it to UTF-8, yet its header still says 
> Windows-1252?
> 
> I doubt that's the only case where this fancy autodetection stuff could 
> bite us.
> 
> I'd prefer to see nothing done to the bytes if UTF-8 support is disabled, 
> and if it's enabled, see any invalid UTF-8 bytes converted to ? 
> characters. That's simple, nonfatal at runtime, and yet gently encourages 
> developers to get their sources in the proper UTF-8 encoding.
> 
> I think that's still pretty lax, since we're not aborting on invalid 
> characters as e.g. Postgres does. What do you all think?
> 
>>> I agree that they'd work better as directives.  I have a patch in the 
>>> ic-utf8 tree that handles verification/resolution of the 
>>> MV_HTTP_CHARSET variable (based on a suggestion on the list some time 
>>> ago).  We could presumably support both the variables and the 
>>> directives simultaneously for a while, prioritizing the directive if it 
>>> exists over the variable.
>> I don't think these variables have been around that long (they were 
>> introduced towards the end of 5.5, iirc which would be last year).  I 
>> think it may be best to just nip them in the bud now and make a note in 
>> the UPGRADE file.  This is one that I would want Mike's input on, though 
>> as he may well feel differently.
> 
> I think Mike (reasonably) just wants backward compatibility to stop 
> breaking for non-UTF-8 stuff. I suspect Frederic, Stefan, Kevin, and very 
> few others who are actually using UTF-8 would be better suited to offer 
> opinions on changes in that setup.

Yes, this should be configuration settings, not variables.

Mike, do you use UTF-8 at all?

Regards
         Racke



-- 
LinuXia Systems => http://www.linuxia.de/
Expert Interchange Consulting and System Administration
ICDEVGROUP => http://www.icdevgroup.org/
Interchange Development Team




More information about the interchange-users mailing list