[ic] Call for testers
Gert van der Spoel
gert at 3edge.com
Sat Mar 14 06:39:13 UTC 2009
> -----Original Message-----
> From: interchange-users-bounces at icdevgroup.org [mailto:interchange-
> users-bounces at icdevgroup.org] On Behalf Of Peter
> Sent: Saturday, March 14, 2009 4:58 AM
> To: interchange-users at icdevgroup.org
> Subject: Re: [ic] Call for testers
>
> On 03/13/2009 07:22 PM, David Christensen wrote:
> > On Mar 13, 2009, at 5:56 PM, Peter wrote:
> >
> >> On 03/13/2009 06:09 AM, David Christensen wrote:
> >>> On Mar 13, 2009, at 4:29 AM, Peter wrote:
> >>>
> >>>>> and if it's enabled, see any invalid UTF-8 bytes converted to ?
> >>>>> characters. That's simple, nonfatal at runtime, and yet gently
> >>>>> encourages
> >>>>> developers to get their sources in the proper UTF-8 encoding.
> >>>> I'm fine with that, and that was the original proposal. One
> >>>> problem,
> >>>> though, is that while I thought that the Encode module could do
> >>>> that,
> >>>> apparently it can only barf when decoding unicode input, so we
> would
> >>>> have to find another way to find the invalid chars and change them
> >>>> over.
> >>>
> >>> There is a third param to Encode::decode which specifies the
> behavior
> >>> of invalid decodes, which by default is to die, but can warn,
> ignore
> >>> or silently substitute IIRC. So I think this could be make to
> >>> substitute the invalid character marker without much problem.
> >> Yes, you're referring to the CHECK parameter which, unfortunately,
> >> works
> >> for every encoding type *except* unicode.
> >>
> >> http://search.cpan.org/~dankogai/Encode-
> 2.32/Encode.pm#Handling_Malformed_Data
> >>
> >> NOTE: Not all encoding support this feature
> >>
> >> Some encodings ignore CHECK argument. For example,
> Encode::Unicode
> >> ignores CHECK and it always croaks on error.
> >
> >
> > Here's a little test script I wrote which turns the unidentified
> > characters into their \x counterparts (i.e., literal ASCII
> > representation of the hex value). This uses FB_PERLQQ as a check
> > param. You can see that it properly encodes/decodes valid utf-8
> > codepoints, but anything which it is unable to handle it'll turn into
> > the corresponding hex escape. This in my mind is more informative
> > that something odd is going on, but it prevents things from blowing
> > up, while still allowing the full range of unicode characters which
> > are properly encoded.
> >
> > ----
> >
> > #!/usr/bin/env perl
> >
> > use strict;
> > use warnings;
> >
> > use Encode;
> >
> > print "--------\n";
> > print "Encode test script\n";
> > print "Encode module version: $Encode::VERSION\n";
> >
> > my $cp1252_octets = "I can\222t believe it\222s not utf-8!";
> > my $utf8_octets = "It doesn't make \xC2\xA2\xC2\xA2!";
> >
> > for my $octets ($cp1252_octets, $utf8_octets) {
> > print "--------\n";
> > printf "Length of original string: %d\n", length $octets;
> >
> > my $string = eval {
> > decode('utf8', $octets, Encode::FB_PERLQQ);
> > };
> >
> > warn "Died in utf-8 decode: $@\n" if $@;
> >
> > print "Original string: $octets\n";
> > print "Decoded string (internal perl representation):
> $string\n";
> > printf "Encoded utf-8 output string: %s\n",
> encode_utf8($string);
> > printf "Length of decoded string: %d\n", length $string;
> > }
> >
> > ----
> > The output on my machine from said script:
> >
> > oy:~ machack$ perl utf8_test.pl
> > --------
> > Encode test script
> > Encode module version: 2.23
> > --------
> > Length of original string: 31
> > Original string: I can\222t believe it\222s not utf-8!
> > Decoded string (internal perl representation): I can\x92t believe
> it
> > \x92s not utf-8!
> > Encoded utf-8 output string: I can\x92t believe it\x92s not utf-8!
> > Length of decoded string: 37
> > --------
> > Length of original string: 21
> > Original string: It doesn't make ¢¢!
> > Decoded string (internal perl representation): It doesn't make
> > \242\242!
> > Encoded utf-8 output string: It doesn't make ¢¢!
> > Length of decoded string: 19
> > oy:~ machack$
>
> Interesting, does it still work if you add "use utf8"?
I added 'use utf8' and ran it using 4 perl versions on my machine, mixed thread/no thread.
The older one (5.8.4 perl) does not print any Original string, but who knows perhaps the decode of 1.99 took different paramaters - and it is out of the scope where we say it should work :) ... The output does seem to be consistent and I do not think there were warnings or other issues, with or with adding 'use utf8;' as Peter suggested .....
1) This is perl, v5.8.4 built for i386-linux-thread-multi
~# ./t.pl
--------
Encode test script
Encode module version: 1.99_01
--------
Length of original string: 31
Original string:
Decoded string (internal perl representation): I can\x92t believe it\x92s not utf-8!
Encoded utf-8 output string: I can\x92t believe it\x92s not utf-8!
Length of decoded string: 37
--------
Length of original string: 21
Original string:
Decoded string (internal perl representation): It doesn't make ▒▒!
Encoded utf-8 output string: It doesn't make ¢¢!
Length of decoded string: 19
2) This is perl, v5.8.7 built for i686-linux
~# ./t.pl
--------
Encode test script
Encode module version: 2.10
--------
Length of original string: 31
Original string: I can▒t believe it▒s not utf-8!
Decoded string (internal perl representation): I can\x92t believe it\x92s not utf-8!
Encoded utf-8 output string: I can\x92t believe it\x92s not utf-8!
Length of decoded string: 37
--------
Length of original string: 21
Original string: It doesn't make ¢¢!
Decoded string (internal perl representation): It doesn't make ▒▒!
Encoded utf-8 output string: It doesn't make ¢¢!
Length of decoded string: 19
3) This is perl, v5.8.8 built for i486-linux-gnu-thread-multi
# ./t.pl
--------
Encode test script
Encode module version: 2.12
--------
Length of original string: 31
Original string: I can▒t believe it▒s not utf-8!
Decoded string (internal perl representation): I can\x92t believe it\x92s not utf-8!
Encoded utf-8 output string: I can\x92t believe it\x92s not utf-8!
Length of decoded string: 37
--------
Length of original string: 21
Original string: It doesn't make ¢¢!
Decoded string (internal perl representation): It doesn't make ▒▒!
Encoded utf-8 output string: It doesn't make ¢¢!
Length of decoded string: 19
4) This is perl, v5.10.0 built for i686-linux
~# ./t.pl
--------
Encode test script
Encode module version: 2.23
--------
Length of original string: 31
Original string: I can▒t believe it▒s not utf-8!
Decoded string (internal perl representation): I can\x92t believe it\x92s not utf-8!
Encoded utf-8 output string: I can\x92t believe it\x92s not utf-8!
Length of decoded string: 37
--------
Length of original string: 21
Original string: It doesn't make ¢¢!
Decoded string (internal perl representation): It doesn't make ▒▒!
Encoded utf-8 output string: It doesn't make ¢¢!
Length of decoded string: 19
More information about the interchange-users
mailing list