Unicode, was: [ic] Use one table for product descriptions

Tue Jul 12 02:57:38 EDT 2005

Hello everyone.

I'm new.  

On 2005-07-11, Ethan Rowe wrote:
...
> I forgot to mention:
> Encoding is not fun.
> 
> If you need really flexible multi-lingual support (as in, stuff outside 
> Interchange's most familiar world of latin-1), you'll probably want to 
> store your descriptions as unicode (in which case UTF-8 is your choice, 
> if on MySQL or PostgreSQL).  This means you have to deal with the proper 
> encoding settings for any pages affected by this.  Moving from a 
> single-byte encoding to a multi-byte encoding is not particularly 
> pleasant.  You need to know exactly how the data pulled from the 
> database is encoded before you just plop that data willy-nilly into your 
> page output.  If you have a bunch of latin1 data in the output, and then 
> you try to put some UTF-8 into it, you're gonna get a bunch of junk 
> characters.  In my case, I stored everything as UTF-8 in PostgreSQL, use 
> an AutoLoad to set the proper Content-Type header to put all page 
> encodings as UTF-8, and then made certain that any data from the 
> database would be encoded properly.  For PostgreSQL, all data pulled 
> from the database comes as UTF-8 octets, as in each character in a 
> string corresponds to a byte (which since Perl 5.6 you cannot 
> necessarily count on ordinarily).  In other words, the data returned to 
> Perl was essentially the raw UTF-8 data; this is a perfectly fine form 
> to place directly into the page (via stuff like [sql-param description], 
> etc.).  But it means that I cannot easily use regular expressions 
> against that data, because each character in the data represents a byte 
> rather than an actual character.

I've downloaded the latest stable release of Interchange.
Tried to make it Unicode-aware on the inside.  Had to add
":utf8" layer to open files and add explicit utf8 decoding
for data coming from external libraries (i.e. databases).
Had to change many regular expressions.  Mostly I got it
working.  

It now can display Unicode chars, coming from the database
(DBM at least, but the principles are the same for all other
databases) and coming from the page templates.  It now can
accept Unicode on form input.

Have not been able to fix a problem on one of the Admin
interface forms, the one that includes file upload control
and, therefore, has unusual ENCTYPE ("multipart/form-data")
-- it brakes on the IC server input (from the link prog)
parsing stage for a misty reason.

It requires perl 5.8.  

This was an experimental attempt.  For a good patch this
needs to be finished, made more complete and more comments
added.  I probably won't get to finish this soon.  (I
probably won't use IC for the project that I'm about to
have.)  But if anyone needs Unicode mode, I'm ready to share
this for somebody to finish.  I

BTW, I had to author a web-application myself (from scratch,
not using any platform) http://acis.openlib.org/ .  So it
was a real pleasure to see how Interchange is made on the
inside.  Many things are inspiring and made very well.
Thank you, guys.

What is really annoying is wildly dancing indentation of the
source code (probably, because of tab characters in it).

Cheers,

Ivan 
http://ahinea.com/