[ic] Call for testers

Sat Mar 14 04:53:10 UTC 2009

On Mar 13, 2009, at 9:58 PM, Peter wrote:

> On 03/13/2009 07:22 PM, David Christensen wrote:
>> On Mar 13, 2009, at 5:56 PM, Peter wrote:
>>
>>> On 03/13/2009 06:09 AM, David Christensen wrote:
>>>> On Mar 13, 2009, at 4:29 AM, Peter wrote:
>>>>
>>>>>> and if it's enabled, see any invalid UTF-8 bytes converted to ?
>>>>>> characters. That's simple, nonfatal at runtime, and yet gently
>>>>>> encourages
>>>>>> developers to get their sources in the proper UTF-8 encoding.
>>>>> I'm fine with that, and that was the original proposal.  One
>>>>> problem,
>>>>> though, is that while I thought that the Encode module could do
>>>>> that,
>>>>> apparently it can only barf when decoding unicode input, so we  
>>>>> would
>>>>> have to find another way to find the invalid chars and change them
>>>>> over.
>>>>
>>>> There is a third param to Encode::decode which specifies the  
>>>> behavior
>>>> of invalid decodes, which by default is to die, but can warn,  
>>>> ignore
>>>> or silently substitute IIRC.  So I think this could be make to
>>>> substitute the invalid character marker without much problem.
>>> Yes, you're referring to the CHECK parameter which, unfortunately,
>>> works
>>> for every encoding type *except* unicode.
>>>
>>> http://search.cpan.org/~dankogai/Encode-2.32/Encode.pm#Handling_Malformed_Data
>>>
>>> NOTE: Not all encoding support this feature
>>>
>>>   Some encodings ignore CHECK argument. For example, Encode::Unicode
>>> ignores CHECK and it always croaks on error.
>>
>>
>> Here's a little test script I wrote which turns the unidentified
>> characters into their \x counterparts (i.e., literal ASCII
>> representation of the hex value).  This uses FB_PERLQQ as a check
>> param.  You can see that it properly encodes/decodes valid utf-8
>> codepoints, but anything which it is unable to handle it'll turn into
>> the corresponding hex escape.  This in my mind is more informative
>> that something odd is going on, but it prevents things from blowing
>> up, while still allowing the full range of unicode characters which
>> are properly encoded.
>>
>> ----
>>
>>   #!/usr/bin/env perl
>>
>>   use strict;
>>   use warnings;
>>
>>   use Encode;
>>
>>   print "--------\n";
>>   print "Encode test script\n";
>>   print "Encode module version: $Encode::VERSION\n";
>>
>>   my $cp1252_octets = "I can\222t believe it\222s not utf-8!";
>>   my $utf8_octets   = "It doesn't make \xC2\xA2\xC2\xA2!";
>>
>>   for my $octets ($cp1252_octets, $utf8_octets) {
>>       print "--------\n";
>>       printf "Length of original string: %d\n", length $octets;
>>
>>       my $string = eval {
>>           decode('utf8', $octets, Encode::FB_PERLQQ);
>>       };
>>
>>       warn "Died in utf-8 decode: $@\n" if $@;
>>
>>       print "Original string: $octets\n";
>>       print "Decoded string (internal perl representation): $string 
>> \n";
>>       printf "Encoded utf-8 output string: %s\n",  
>> encode_utf8($string);
>>       printf "Length of decoded string: %d\n", length $string;
>>   }
>>
>> ----
>> The output on my machine from said script:
>>
>>   oy:~ machack$ perl utf8_test.pl
>>   --------
>>   Encode test script
>>   Encode module version: 2.23
>>   --------
>>   Length of original string: 31
>>   Original string: I can\222t believe it\222s not utf-8!
>>   Decoded string (internal perl representation): I can\x92t believe  
>> it
>> \x92s not utf-8!
>>   Encoded utf-8 output string: I can\x92t believe it\x92s not utf-8!
>>   Length of decoded string: 37
>>   --------
>>   Length of original string: 21
>>   Original string: It doesn't make ¢¢!
>>   Decoded string (internal perl representation): It doesn't make
>> \242\242!
>>   Encoded utf-8 output string: It doesn't make ¢¢!
>>   Length of decoded string: 19
>>   oy:~ machack$
>
> Interesting, does it still work if you add "use utf8"?
>
> Peter

With my local perl (5.10) it runs without issue with or without "use  
utf8".  I'd be interested in seeing if 5.8.8 and 5.8.9 both show the  
same behavior, as those three are the versions we're targetting with  
the UTF-8 code.  Assuming some of you don't have it, I'll build/stage  
perls 5.8.8 and 5.8.9, threaded/unthreaded and run these same tests  
against it.  This would probably be useful for me to do anyway for  
compatibility checking any other UTF-8 issues.

Regards,

David
--
David Christensen
End Point Corporation
david at endpoint.com
212-929-6923
http://www.endpoint.com/