[ic] Call for testers
Stefan Hornburg
racke at linuxia.de
Fri Mar 13 09:32:53 UTC 2009
Jon Jensen wrote:
> On Thu, 12 Mar 2009, Peter wrote:
>
>>> for my $enc (qw/utf-8 cp-1252 latin-1/) {
>>> my $decoded_data = eval { decode($data, $enc) };
>>> last if defined $decoded_data;
>>> }
>>>
>>> We could abstract out the list of encodings so we could potentially
>>> return different values if needed depending on context, but I'd guess
>>> for the most part the pages/components/templates, etc. will be in a
>>> single encoding, so this would boil down to ("utf8",
>>> $fallback_encoding). (Hey, I'm optimistic... :-D)
>> I like that code above. We could have a directive, say PageEncoding
>> where you can actually list the primary encoding and each fallback to
>> try in order in your above loop. This could default to "utf8 latin-1".
>> Again, we completely ignore this directive if MV_UTF8 is not set.
>>
>>> My thought here was that if MV_UTF8 was set but the data failed to
>>> decode as utf8 and no "fallback" encoding was provided, then turn off
>>> the utf-8 flag, i.e., treat the data as raw octets 0-255.
>> We can have this be the final default mechanism if we fall off the end
>> of the PageEncoding list. To make this the only action other than utf8
>> one can simply set PageEncoding to utf8.
>
> This is starting to sound awfully fancy.
>
> If you can make all this autodetection work well, and without a big
> performance hit, I suppose it's nice.
>
> But do we really want to encourage people to have various files in various
> encodings, never really sure what it is? Do we really want it to be
> possible for Interchange to autodetect that an HTML header file is in
> Windows-1252 and convert it to UTF-8, yet its header still says
> Windows-1252?
>
> I doubt that's the only case where this fancy autodetection stuff could
> bite us.
>
> I'd prefer to see nothing done to the bytes if UTF-8 support is disabled,
> and if it's enabled, see any invalid UTF-8 bytes converted to ?
> characters. That's simple, nonfatal at runtime, and yet gently encourages
> developers to get their sources in the proper UTF-8 encoding.
>
> I think that's still pretty lax, since we're not aborting on invalid
> characters as e.g. Postgres does. What do you all think?
>
>>> I agree that they'd work better as directives. I have a patch in the
>>> ic-utf8 tree that handles verification/resolution of the
>>> MV_HTTP_CHARSET variable (based on a suggestion on the list some time
>>> ago). We could presumably support both the variables and the
>>> directives simultaneously for a while, prioritizing the directive if it
>>> exists over the variable.
>> I don't think these variables have been around that long (they were
>> introduced towards the end of 5.5, iirc which would be last year). I
>> think it may be best to just nip them in the bud now and make a note in
>> the UPGRADE file. This is one that I would want Mike's input on, though
>> as he may well feel differently.
>
> I think Mike (reasonably) just wants backward compatibility to stop
> breaking for non-UTF-8 stuff. I suspect Frederic, Stefan, Kevin, and very
> few others who are actually using UTF-8 would be better suited to offer
> opinions on changes in that setup.
Yes, this should be configuration settings, not variables.
Mike, do you use UTF-8 at all?
Regards
Racke
--
LinuXia Systems => http://www.linuxia.de/
Expert Interchange Consulting and System Administration
ICDEVGROUP => http://www.icdevgroup.org/
Interchange Development Team
More information about the interchange-users
mailing list