[ic] Prevent search from matching on html

Daniel Davenport DDavenport at newagedigital.com
Thu Oct 26 20:01:34 EDT 2006


 

> -----Original Message-----
> From: interchange-users-bounces at icdevgroup.org 
> [mailto:interchange-users-bounces at icdevgroup.org] On Behalf 
> Of Kevin Walsh
> Sent: 2006 October 26 -- Thursday 6:30 PM
> To: interchange-users at icdevgroup.org
> Subject: Re: [ic] Prevent search from matching on html
> 
> Josh Lavin <josh at myprivacy.ca> wrote:
> > I am finding that when we use HTML in our product 
> descriptions, the  
> > search results will include products where an HTML tag matched the  
> > search query.
> > 
> > Simple example: if my description contains "<h2>Features</h2>" and  
> > someone searches for 'h2', then that product will be 
> returned in the  
> > results.
> > 
> > I would like to avoid this, and figured I needed a custom 
> SearchOp,  
> > but I'm having no luck with this one:
> > 
> > CodeDef not_tags SearchOp
> > CodeDef not_tags Routine <<EOR
> > sub {
> >          my ($self, $i, $pat) = @_;
> > 
> >          return sub {
> >              my $string = shift;
> >              $string =~ s:<[/\w].*?\s?/?>::gi;
> >              return $string;
> >          };
> > }
> > EOR
> > 
> > The idea is to remove any HTML tags before searching. Any ideas?
> > 
> You are always returning a true value.  A SearchOp's coderef needs
> to return true if a match is found or false if no match is found.
> 
> Try something like this instead:
> 
>     CodeDef not_tags SearchOp
>     CodeDef not_tags Routine <<EOR
>     sub {
>         my ($self, $i, $pat) = @_;
>         $pat = qr/$pat/i;
> 
>         return sub {
>             my $string = shift;
> 
>             $string =~ s:<[/\w].+?>::gi;
>             return $string =~ $pat;
>         };
>     }
>     EOR

And make sure you don't have any <text with angles around it, like this
for instance> in the text being searched, cause that regexp will get it
ignored in your search...as will just about any others that just match
whatever's between < and >.

If the field's known html, that shouldn't be a problem -- they should be
&lt; and &gt; anyway -- but if you're letting other people edit
descriptions, it's something to watch out for.



More information about the interchange-users mailing list