[ic] Filters with UTF-8 body

Thu Mar 12 21:45:27 UTC 2009

On 03/12/2009 02:26 PM, David Christensen wrote:
> On Mar 12, 2009, at 4:10 PM, Peter wrote:
> 
>> On 03/12/2009 12:28 PM, David Christensen wrote:
>>> I have a commit queued to fix all instances of explicit ranges,
>>> however, there was something I found which I'm not sure is a wart or
>>> not.  From dist/lib/UI/Primitive.pm:
>>>
>>> 45:$DECODE_CHARS = qq{&[<"\000-\037\177-\377};
>> Provided we think it may still be needed, I think the best way to deal
>> with this one is:
>> $DECODE_CHARS = qq{&[<"[[:^print:]]};
> 
> Does [[:print:]] include only traditional ASCII, or would the unicode  
> code points fall in this range as well?  I'm under the impression that  
> extended Unicode characters would fall into the printable class, and  
> hence not be decoded, as implied by the character class, but without  
> knowing the calling context of any code which uses these arguments, I  
> don't know how to verify this.  Also, this threw me off because it was  
> a literal string and not a regex (at least directly).

You're quite correct, it does include UTF8 characters:

peter at peter-desktop:~/interchange-utf8$ perl -Mutf8 -le 'print $1 if
"fooäbar" =~ /([[:print:]]*)/'
fooäbar

OTOH:

peter at peter-desktop:~/interchange-utf8$ perl -le 'print $1 if "fooäbar"
=~ /([[:print:]]*)/'
foo

So this may actually be desirable as it could be assumed that if utf8 is
set (as set by MV_UTF8) then it's ok to output those chars directly to
pages.  OTOH, if we want to make sure they get escaped in any case then:

$DECODE_CHARS = qq{&[<"[^\040-\176]};

But then it may not work when inserted into a regex character class [],
I don't know.  Neither may the above for that matter.

Maybe it's a good thing that isn't used anywhere. (you did grep the
entire source, right?)

Peter