[ic] Call for testers

Gert van der Spoel gert at 3edge.com
Sat Mar 14 06:39:13 UTC 2009


> -----Original Message-----
> From: interchange-users-bounces at icdevgroup.org [mailto:interchange-
> users-bounces at icdevgroup.org] On Behalf Of Peter
> Sent: Saturday, March 14, 2009 4:58 AM
> To: interchange-users at icdevgroup.org
> Subject: Re: [ic] Call for testers
> 
> On 03/13/2009 07:22 PM, David Christensen wrote:
> > On Mar 13, 2009, at 5:56 PM, Peter wrote:
> >
> >> On 03/13/2009 06:09 AM, David Christensen wrote:
> >>> On Mar 13, 2009, at 4:29 AM, Peter wrote:
> >>>
> >>>>> and if it's enabled, see any invalid UTF-8 bytes converted to ?
> >>>>> characters. That's simple, nonfatal at runtime, and yet gently
> >>>>> encourages
> >>>>> developers to get their sources in the proper UTF-8 encoding.
> >>>> I'm fine with that, and that was the original proposal.  One
> >>>> problem,
> >>>> though, is that while I thought that the Encode module could do
> >>>> that,
> >>>> apparently it can only barf when decoding unicode input, so we
> would
> >>>> have to find another way to find the invalid chars and change them
> >>>> over.
> >>>
> >>> There is a third param to Encode::decode which specifies the
> behavior
> >>> of invalid decodes, which by default is to die, but can warn,
> ignore
> >>> or silently substitute IIRC.  So I think this could be make to
> >>> substitute the invalid character marker without much problem.
> >> Yes, you're referring to the CHECK parameter which, unfortunately,
> >> works
> >> for every encoding type *except* unicode.
> >>
> >> http://search.cpan.org/~dankogai/Encode-
> 2.32/Encode.pm#Handling_Malformed_Data
> >>
> >> NOTE: Not all encoding support this feature
> >>
> >>    Some encodings ignore CHECK argument. For example,
> Encode::Unicode
> >> ignores CHECK and it always croaks on error.
> >
> >
> > Here's a little test script I wrote which turns the unidentified
> > characters into their \x counterparts (i.e., literal ASCII
> > representation of the hex value).  This uses FB_PERLQQ as a check
> > param.  You can see that it properly encodes/decodes valid utf-8
> > codepoints, but anything which it is unable to handle it'll turn into
> > the corresponding hex escape.  This in my mind is more informative
> > that something odd is going on, but it prevents things from blowing
> > up, while still allowing the full range of unicode characters which
> > are properly encoded.
> >
> > ----
> >
> >    #!/usr/bin/env perl
> >
> >    use strict;
> >    use warnings;
> >
> >    use Encode;
> >
> >    print "--------\n";
> >    print "Encode test script\n";
> >    print "Encode module version: $Encode::VERSION\n";
> >
> >    my $cp1252_octets = "I can\222t believe it\222s not utf-8!";
> >    my $utf8_octets   = "It doesn't make \xC2\xA2\xC2\xA2!";
> >
> >    for my $octets ($cp1252_octets, $utf8_octets) {
> >        print "--------\n";
> >        printf "Length of original string: %d\n", length $octets;
> >
> >        my $string = eval {
> >            decode('utf8', $octets, Encode::FB_PERLQQ);
> >        };
> >
> >        warn "Died in utf-8 decode: $@\n" if $@;
> >
> >        print "Original string: $octets\n";
> >        print "Decoded string (internal perl representation):
> $string\n";
> >        printf "Encoded utf-8 output string: %s\n",
> encode_utf8($string);
> >        printf "Length of decoded string: %d\n", length $string;
> >    }
> >
> > ----
> > The output on my machine from said script:
> >
> >    oy:~ machack$ perl utf8_test.pl
> >    --------
> >    Encode test script
> >    Encode module version: 2.23
> >    --------
> >    Length of original string: 31
> >    Original string: I can\222t believe it\222s not utf-8!
> >    Decoded string (internal perl representation): I can\x92t believe
> it
> > \x92s not utf-8!
> >    Encoded utf-8 output string: I can\x92t believe it\x92s not utf-8!
> >    Length of decoded string: 37
> >    --------
> >    Length of original string: 21
> >    Original string: It doesn't make ¢¢!
> >    Decoded string (internal perl representation): It doesn't make
> > \242\242!
> >    Encoded utf-8 output string: It doesn't make ¢¢!
> >    Length of decoded string: 19
> >    oy:~ machack$
> 
> Interesting, does it still work if you add "use utf8"?

I added 'use utf8' and ran it using 4 perl versions on my machine, mixed thread/no thread.

The older one (5.8.4 perl) does not print any Original string, but who knows perhaps the decode of 1.99 took different paramaters - and it is out of the scope where we say it should work :) ... The output does seem to be consistent and I do not think there were warnings or other issues, with or with adding 'use utf8;' as Peter suggested .....

1) This is perl, v5.8.4 built for i386-linux-thread-multi

~# ./t.pl
--------
Encode test script
Encode module version: 1.99_01
--------
Length of original string: 31
Original string:
Decoded string (internal perl representation): I can\x92t believe it\x92s not utf-8!
Encoded utf-8 output string: I can\x92t believe it\x92s not utf-8!
Length of decoded string: 37
--------
Length of original string: 21
Original string:
Decoded string (internal perl representation): It doesn't make ▒▒!
Encoded utf-8 output string: It doesn't make ¢¢!
Length of decoded string: 19


2) This is perl, v5.8.7 built for i686-linux

~# ./t.pl
--------
Encode test script
Encode module version: 2.10
--------
Length of original string: 31
Original string: I can▒t believe it▒s not utf-8!
Decoded string (internal perl representation): I can\x92t believe it\x92s not utf-8!
Encoded utf-8 output string: I can\x92t believe it\x92s not utf-8!
Length of decoded string: 37
--------
Length of original string: 21
Original string: It doesn't make ¢¢!
Decoded string (internal perl representation): It doesn't make ▒▒!
Encoded utf-8 output string: It doesn't make ¢¢!
Length of decoded string: 19


3) This is perl, v5.8.8 built for i486-linux-gnu-thread-multi

# ./t.pl
--------
Encode test script
Encode module version: 2.12
--------
Length of original string: 31
Original string: I can▒t believe it▒s not utf-8!
Decoded string (internal perl representation): I can\x92t believe it\x92s not utf-8!
Encoded utf-8 output string: I can\x92t believe it\x92s not utf-8!
Length of decoded string: 37
--------
Length of original string: 21
Original string: It doesn't make ¢¢!
Decoded string (internal perl representation): It doesn't make ▒▒!
Encoded utf-8 output string: It doesn't make ¢¢!
Length of decoded string: 19


4) This is perl, v5.10.0 built for i686-linux

~# ./t.pl
--------
Encode test script
Encode module version: 2.23
--------
Length of original string: 31
Original string: I can▒t believe it▒s not utf-8!
Decoded string (internal perl representation): I can\x92t believe it\x92s not utf-8!
Encoded utf-8 output string: I can\x92t believe it\x92s not utf-8!
Length of decoded string: 37
--------
Length of original string: 21
Original string: It doesn't make ¢¢!
Decoded string (internal perl representation): It doesn't make ▒▒!
Encoded utf-8 output string: It doesn't make ¢¢!
Length of decoded string: 19






More information about the interchange-users mailing list