[ic] URL encoding bug in Vend::Interpolate::esc

Rok Ruzic rok.ruzic at informa.si
Sun May 16 06:42:27 UTC 2010


I have run across a bug in our URL encoding code, i actually found it
because it escaped more then it had to, but looking at it i saw that
it's broken both ways.

It uses this match regex for substitution: \W

Thus was probably OK 15 years ago, but now that we are trying to
support uft8, \W will also let all the wide characters in.

So i suggest we narrow this to the old ascii [^a-zA-Z0-9_]

While we're at it, we might also *not* escape the characters
Berners-Lee put in his alphanum2 character set, i.e. [\-_.+] and if you
guys agree, i would also add the stuff he put in his "safe" class, i.e.
[\$\@\&]. See http://www.w3.org/Addressing/URL/url-spec.txt for details.

For the conversion to '%' escaped double digit hex codes it turns out
that our old substitution might work, at least according to a
few examples i have found on the web (w3schools most notably). I find
it a bit suspicious, however, because it encodes the characters i have
tested with into something weird, e.g. љ is encoded into %d1%99, which
is the same as the w3school example (one of two, and the fact that
they mismatch doesn't put any confidence in me).

The character in question gets encoded into %d1%99, whereas ord()
returns 0xd1 for it, and unpack("S") will return 0x99d1.

Now throwing perl's Encode into the mix, i get back - now it gets weird
- 1113 from ord and 0x459 from unpack :-/

hexdump also says the char is 0xd199. This smells like endianness. How
do we deal with that? UTF text should be considered binary data,
meaning we also got endianness problems :-x

Since i really don't know how to go about this, i'm suggesting, that
for starters we partially correct our code and at least escape stuff
that shouldn't get into URLs and we'll figure out later how to
*correctly* escape it.

Hence the appended patch (consider it merely a proposal).

LP,
Rox


-- 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: utf8escape.diff
Type: text/x-patch
Size: 374 bytes
Desc: not available
URL: <http://www.icdevgroup.org/pipermail/interchange-users/attachments/20100516/f628c897/attachment.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 198 bytes
Desc: not available
URL: <http://www.icdevgroup.org/pipermail/interchange-users/attachments/20100516/f628c897/attachment.pgp>


More information about the interchange-users mailing list