From chen at lilux.co.il Tue Nov 18 20:02:24 2003 From: chen at lilux.co.il (Chen Naor) Date: Tue, 18 Nov 2003 22:02:24 +0200 Subject: [interchange-i18n] Encoding decoding of non latin-1 language - Hebrew - Solved References: <00a601c32aad$8b4130c0$3700a8c0@lilux.co.il> Message-ID: <006801c3ae0e$e443f560$0b32a8c0@lilux.co.il> Hello, After a lot of work I finally solved the problem of decode & encode of the high characters language in IC. Actually the problem is not in Interchange itself but in Perl-html-parser. To be more accurate it is in the HTML:Entities. I had to patch the lib in order to solve the problem. I checked it with Hebrew and it work fine, I do not know for other languages but it seems OK (checked a bit Japanese). Note: the patch is disabling some of the Html:Entities functionality. If someone need the patch please contact me here. The patch was done on HTML-Parser-3.33 Perl 5.6.1 RH 7.3 IC 4.9.8 & IC 4.9.9 Best regards Chen Naor Lilux Systems http://www.lilux.co.il/ mail: chen at lilux.co.il From racke at linuxia.de Tue Nov 18 20:21:01 2003 From: racke at linuxia.de (Stefan Hornburg) Date: Tue, 18 Nov 2003 21:21:01 +0100 Subject: [interchange-i18n] Encoding decoding of non latin-1 language - Hebrew - Solved In-Reply-To: <006801c3ae0e$e443f560$0b32a8c0@lilux.co.il> References: <00a601c32aad$8b4130c0$3700a8c0@lilux.co.il> <006801c3ae0e$e443f560$0b32a8c0@lilux.co.il> Message-ID: <20031118212101.4896f87a.racke@linuxia.de> On Tue, 18 Nov 2003 22:02:24 +0200 "Chen Naor" wrote: > Hello, > > After a lot of work I finally solved the problem of decode & encode of the high characters > language in IC. > > Actually the problem is not in Interchange itself but in Perl-html-parser. To be more accurate > it is in the HTML:Entities. > > > > I had to patch the lib in order to solve the problem. > > I checked it with Hebrew and it work fine, I do not know for other languages but it seems OK > (checked a bit Japanese). > > > > Note: the patch is disabling some of the Html:Entities functionality. > > > > If someone need the patch please contact me here. Would you mind to send me this patch ? Ciao Racke -- LinuXia Systems => http://www.linuxia.de/ Expert Interchange Consulting and System Administration ICDEVGROUP => http://www.icdevgroup.org/ Interchange Development Team From murahashi at ayayu.com Wed Nov 19 00:10:56 2003 From: murahashi at ayayu.com (murahashi at ayayu.com) Date: Wed, 19 Nov 2003 09:10:56 +0900 Subject: [interchange-i18n] Encoding decoding of non latin-1 language -Hebrew - Solved In-Reply-To: <20031118212101.4896f87a.racke@linuxia.de> References: <20031118212101.4896f87a.racke@linuxia.de> Message-ID: I would like to test at Japanese with your patch too. Could you please send me your patch? I'll feed back the result to you. Thanks and regards, -------------------------------------- Shozo Murahashi Kidanet Business ltd. murahashi at ayayu.com -------------------------------------- > On Tue, 18 Nov 2003 22:02:24 +0200 > "Chen Naor" wrote: > > > Hello, > > > > After a lot of work I finally solved the problem of decode & encode of the high characters > > language in IC. > > > > Actually the problem is not in Interchange itself but in Perl-html-parser. To be more accurate > > it is in the HTML:Entities. > > > > > > > > I had to patch the lib in order to solve the problem. > > > > I checked it with Hebrew and it work fine, I do not know for other languages but it seems OK > > (checked a bit Japanese). > > > > > > > > Note: the patch is disabling some of the Html:Entities functionality. > > > > > > > > If someone need the patch please contact me here. > > Would you mind to send me this patch ? > > Ciao > Racke > > -- > LinuXia Systems => http://www.linuxia.de/ > Expert Interchange Consulting and System Administration > ICDEVGROUP => http://www.icdevgroup.org/ > Interchange Development Team > > _______________________________________________ > interchange-i18n mailing list > interchange-i18n at icdevgroup.org > http://www.icdevgroup.org/mailman/listinfo/interchange-i18n From chen at lilux.co.il Wed Nov 19 01:19:24 2003 From: chen at lilux.co.il (Chen Naor) Date: Wed, 19 Nov 2003 03:19:24 +0200 Subject: [interchange-i18n] Encoding decoding of non latin-1 language -Hebrew - Solved Message-ID: <00b901c3ae3b$2d0e6f10$0b32a8c0@lilux.co.il> > > Note: the patch is disabling some of the Html:Entities functionality. > > > > If someone need the patch please contact me here. > > Would you mind to send me this patch ? > > Ciao > Racke Hi , I will send the package to you directly since it a bit to big for the mailing list. Change made only in : HTML-Parser-3.33/lib/HTML/Entities.pm Original file is: HTML-Parser-3.33/lib/HTML/Entities.pm.org Look at the rems from line 423 - 438 Here is the diff: [chen at shop2 HTML]# diff Entities.pm Entities.pm.org 423,426c423,424 < # changed by Guy Naor to enable Hebrew characters < < # if (defined $_[1] and length $_[1]) { < # unless (exists $subst{$_[1]}) { --- > if (defined $_[1] and length $_[1]) { > unless (exists $subst{$_[1]}) { 428,438c426,436 < # my $code = "sub {\$_[0] =~ s/([$_[1]])/\$char2entity{\$1} || num_entity(\$1)/ge; }"; < # $subst{$_[1]} = eval $code; < # die( $@ . " while trying to turn range: \"$_[1]\"\n " < # . "into code: $code\n " < # ) if $@; < # } < # &{$subst{$_[1]}}($$ref); < # } else { < # # Encode control chars, high bit chars and '<', '&', '>', '"' < $$ref =~ s/([^\n\r\t !\#\$%\'-;=?-~\xe0-\xfb])/$char2entity{$1} || num_entity($1)/ge; < # } --- > my $code = "sub {\$_[0] =~ s/([$_[1]])/\$char2entity{\$1} || num_entity(\$1)/ge; }"; > $subst{$_[1]} = eval $code; > die( $@ . " while trying to turn range: \"$_[1]\"\n " > . "into code: $code\n " > ) if $@; > } > &{$subst{$_[1]}}($$ref); > } else { > # Encode control chars, high bit chars and '<', '&', '>', '"' > $$ref =~ s/([^\n\r\t !\#\$%\'-;=?-~])/$char2entity{$1} || num_entity($1)/ge; > } Good luck & good night (3:00 AM) Chen From info at lilux.co.il Wed Nov 19 12:07:14 2003 From: info at lilux.co.il (Info Lilux Systems) Date: Wed, 19 Nov 2003 14:07:14 +0200 Subject: [interchange-i18n] Re: HTML-Parser for Hebrew IC References: <00de01c3ae3c$41ae2b30$0b32a8c0@lilux.co.il> Message-ID: <002801c3ae95$acb26b40$367b18ac@lilux.co.il> Hi Shozo, The \xe0-\xfb was put in since Hebrew is not used in this range, and I wanted to let the lib unchange as posible Chen > Hi > > I tryed to apply your patch for IC4.8.9 and basiccaly nice but still problem. > In my case, using EUC, I replaced \xe0-\xfb by \177-\377. > I think this comes from the difference between Hebrew and JP. Right? > > The rest of problem may be related encode control "<", ">" ? > I'm not sure. > > Did you try these on Entities.pm v2.0? > > > Anyway, nice work. > Thank you. > Shozo Murahahsi > From chen at lilux.co.il Wed Nov 19 21:11:51 2003 From: chen at lilux.co.il (Chen Naor) Date: Wed, 19 Nov 2003 23:11:51 +0200 Subject: [interchange-i18n] More encoding/decoding issue - XLS upload Message-ID: <011501c3aee1$c0782600$367b18ac@lilux.co.il> Hi, After solving the high characters problem with the HTML::Entities patch, there is still one problem. When uploading using XLS Spreadsheet the high chars (Hebrew in my cased) get mess. Using Single table option (text file) work OK. Any suggestion where to look for a solution? What libs are doing the work in the background? Regards, Chen Naor Lilux Systems http://www.lilux.co.il/ mail: chen at lilux.co.il