[ic] Call for testers
Peter
peter at pajamian.dhs.org
Sat Mar 14 02:58:08 UTC 2009
On 03/13/2009 07:22 PM, David Christensen wrote:
> On Mar 13, 2009, at 5:56 PM, Peter wrote:
>
>> On 03/13/2009 06:09 AM, David Christensen wrote:
>>> On Mar 13, 2009, at 4:29 AM, Peter wrote:
>>>
>>>>> and if it's enabled, see any invalid UTF-8 bytes converted to ?
>>>>> characters. That's simple, nonfatal at runtime, and yet gently
>>>>> encourages
>>>>> developers to get their sources in the proper UTF-8 encoding.
>>>> I'm fine with that, and that was the original proposal. One
>>>> problem,
>>>> though, is that while I thought that the Encode module could do
>>>> that,
>>>> apparently it can only barf when decoding unicode input, so we would
>>>> have to find another way to find the invalid chars and change them
>>>> over.
>>>
>>> There is a third param to Encode::decode which specifies the behavior
>>> of invalid decodes, which by default is to die, but can warn, ignore
>>> or silently substitute IIRC. So I think this could be make to
>>> substitute the invalid character marker without much problem.
>> Yes, you're referring to the CHECK parameter which, unfortunately,
>> works
>> for every encoding type *except* unicode.
>>
>> http://search.cpan.org/~dankogai/Encode-2.32/Encode.pm#Handling_Malformed_Data
>>
>> NOTE: Not all encoding support this feature
>>
>> Some encodings ignore CHECK argument. For example, Encode::Unicode
>> ignores CHECK and it always croaks on error.
>
>
> Here's a little test script I wrote which turns the unidentified
> characters into their \x counterparts (i.e., literal ASCII
> representation of the hex value). This uses FB_PERLQQ as a check
> param. You can see that it properly encodes/decodes valid utf-8
> codepoints, but anything which it is unable to handle it'll turn into
> the corresponding hex escape. This in my mind is more informative
> that something odd is going on, but it prevents things from blowing
> up, while still allowing the full range of unicode characters which
> are properly encoded.
>
> ----
>
> #!/usr/bin/env perl
>
> use strict;
> use warnings;
>
> use Encode;
>
> print "--------\n";
> print "Encode test script\n";
> print "Encode module version: $Encode::VERSION\n";
>
> my $cp1252_octets = "I can\222t believe it\222s not utf-8!";
> my $utf8_octets = "It doesn't make \xC2\xA2\xC2\xA2!";
>
> for my $octets ($cp1252_octets, $utf8_octets) {
> print "--------\n";
> printf "Length of original string: %d\n", length $octets;
>
> my $string = eval {
> decode('utf8', $octets, Encode::FB_PERLQQ);
> };
>
> warn "Died in utf-8 decode: $@\n" if $@;
>
> print "Original string: $octets\n";
> print "Decoded string (internal perl representation): $string\n";
> printf "Encoded utf-8 output string: %s\n", encode_utf8($string);
> printf "Length of decoded string: %d\n", length $string;
> }
>
> ----
> The output on my machine from said script:
>
> oy:~ machack$ perl utf8_test.pl
> --------
> Encode test script
> Encode module version: 2.23
> --------
> Length of original string: 31
> Original string: I can\222t believe it\222s not utf-8!
> Decoded string (internal perl representation): I can\x92t believe it
> \x92s not utf-8!
> Encoded utf-8 output string: I can\x92t believe it\x92s not utf-8!
> Length of decoded string: 37
> --------
> Length of original string: 21
> Original string: It doesn't make ¢¢!
> Decoded string (internal perl representation): It doesn't make
> \242\242!
> Encoded utf-8 output string: It doesn't make ¢¢!
> Length of decoded string: 19
> oy:~ machack$
Interesting, does it still work if you add "use utf8"?
Peter
More information about the interchange-users
mailing list