[ic] Call for testers

Peter peter at pajamian.dhs.org
Sat Mar 14 02:58:08 UTC 2009


On 03/13/2009 07:22 PM, David Christensen wrote:
> On Mar 13, 2009, at 5:56 PM, Peter wrote:
> 
>> On 03/13/2009 06:09 AM, David Christensen wrote:
>>> On Mar 13, 2009, at 4:29 AM, Peter wrote:
>>>
>>>>> and if it's enabled, see any invalid UTF-8 bytes converted to ?
>>>>> characters. That's simple, nonfatal at runtime, and yet gently
>>>>> encourages
>>>>> developers to get their sources in the proper UTF-8 encoding.
>>>> I'm fine with that, and that was the original proposal.  One  
>>>> problem,
>>>> though, is that while I thought that the Encode module could do  
>>>> that,
>>>> apparently it can only barf when decoding unicode input, so we would
>>>> have to find another way to find the invalid chars and change them
>>>> over.
>>>
>>> There is a third param to Encode::decode which specifies the behavior
>>> of invalid decodes, which by default is to die, but can warn, ignore
>>> or silently substitute IIRC.  So I think this could be make to
>>> substitute the invalid character marker without much problem.
>> Yes, you're referring to the CHECK parameter which, unfortunately,  
>> works
>> for every encoding type *except* unicode.
>>
>> http://search.cpan.org/~dankogai/Encode-2.32/Encode.pm#Handling_Malformed_Data
>>
>> NOTE: Not all encoding support this feature
>>
>>    Some encodings ignore CHECK argument. For example, Encode::Unicode
>> ignores CHECK and it always croaks on error.
> 
> 
> Here's a little test script I wrote which turns the unidentified  
> characters into their \x counterparts (i.e., literal ASCII  
> representation of the hex value).  This uses FB_PERLQQ as a check  
> param.  You can see that it properly encodes/decodes valid utf-8  
> codepoints, but anything which it is unable to handle it'll turn into  
> the corresponding hex escape.  This in my mind is more informative  
> that something odd is going on, but it prevents things from blowing  
> up, while still allowing the full range of unicode characters which  
> are properly encoded.
> 
> ----
> 
>    #!/usr/bin/env perl
> 
>    use strict;
>    use warnings;
> 
>    use Encode;
> 
>    print "--------\n";
>    print "Encode test script\n";
>    print "Encode module version: $Encode::VERSION\n";
> 
>    my $cp1252_octets = "I can\222t believe it\222s not utf-8!";
>    my $utf8_octets   = "It doesn't make \xC2\xA2\xC2\xA2!";
> 
>    for my $octets ($cp1252_octets, $utf8_octets) {
>        print "--------\n";
>        printf "Length of original string: %d\n", length $octets;
> 
>        my $string = eval {
>            decode('utf8', $octets, Encode::FB_PERLQQ);
>        };
> 
>        warn "Died in utf-8 decode: $@\n" if $@;
> 
>        print "Original string: $octets\n";
>        print "Decoded string (internal perl representation): $string\n";
>        printf "Encoded utf-8 output string: %s\n", encode_utf8($string);
>        printf "Length of decoded string: %d\n", length $string;
>    }
> 
> ----
> The output on my machine from said script:
> 
>    oy:~ machack$ perl utf8_test.pl
>    --------
>    Encode test script
>    Encode module version: 2.23
>    --------
>    Length of original string: 31
>    Original string: I can\222t believe it\222s not utf-8!
>    Decoded string (internal perl representation): I can\x92t believe it 
> \x92s not utf-8!
>    Encoded utf-8 output string: I can\x92t believe it\x92s not utf-8!
>    Length of decoded string: 37
>    --------
>    Length of original string: 21
>    Original string: It doesn't make ¢¢!
>    Decoded string (internal perl representation): It doesn't make  
> \242\242!
>    Encoded utf-8 output string: It doesn't make ¢¢!
>    Length of decoded string: 19
>    oy:~ machack$

Interesting, does it still work if you add "use utf8"?

Peter




More information about the interchange-users mailing list