[ic] how do I stop Google from trying to index scan pages?

Fri Sep 19 00:11:52 UTC 2008

> -----Original Message-----
> From: interchange-users-bounces at icdevgroup.org
> [mailto:interchange-users-bounces at icdevgroup.org]On Behalf Of Rene
> Hertell
> Sent: Thursday, September 18, 2008 5:05 AM
> To: interchange-users at icdevgroup.org
> Subject: Re: [ic] how do I stop Google from trying to index scan pages?
>
>
> curthauge at mnwebdesign.com wrote:
> > Hi list,
> >
> > IC 5.4.2	Perl 5.8.8	Old Construct cat on Centos 4.7
> >
> > My client has over 15,000 products, but Google only ranks about
> 400 in their
> > index. The last 4 pages in the Google index are scans. It SEEMS
> like after
> > hitting 4 scan pages, Google stops and turns away (probably
> because the page
> > content appears to be similar). I make changes to pages and
> robots.txt then
> > wait to see the new Google ranking in a few days/week. I have a lot of
> > respect for Google and always spell it with a capital 'G', but
> I still have
> > this problem. ;-)
> >
> > I've been in the archives, but I can't get the precise info I
> need to change
> > my robots.txt to stop these pages from being indexed. This is
> an example of
> > the pages in question:
> >
> >
> http://www.my-domain.com/cgi-bin/storeabc/scan/fi=products/st=db/s
> f=category
> > /se=DVD%20Video/ml=16/tf=description.html
> >
> > I have RobotUA, RobotHost, and RobotIP settings in catalog.cfg. I have a
> > robots.txt file in my httpdocs directory, with entries like this (among
> > others):
> >
> > User-agent: Googlebot
> > Disallow: /*?
> >
> > User-agent: *
> > Disallow: /storeabc/scan
> > Disallow: /scan
> > Disallow: /storeabc/process
> > Disallow: /process
> > Disallow: /cgi-bin/storeabc/process
> > Disallow: /cgi-bin/storeabc/scan/
> > Disallow: /cgi-bin/storeabc/search
> > Disallow: /cgi-bin/storeabc/pages/process
> > Disallow: /cgi-bin/storeabc/pages/scan/
> > Disallow: /cgi-bin/storeabc/pages/search
> >
> > I really just want the flypages, like this, to be ranked:
> >
> > http://www.my-domain.com/cgi-bin/storeabc/sku12345.html
> >
> > Any tips, pointers, ideas, ridicule?
>
> I managed it like this:
>
> http://www.kynsitukku.hertell.com/robots.txt
> robots.txt
> User-agent: *
> Disallow: /admin/
> Disallow: /ord/
> Disallow: /query/
> Disallow: /scan/
> Disallow: /account.html
> Disallow: /process.html
> Disallow: /search.html
>
> User-agent: Googlebot-Image
> Disallow: /
>
>
> And then i created a sitemap that i submitted via the google webmaster
> tools page. The beginning of the sitemap is static (normal pages like
> about, contact etc), and the rest i just did with this simple query:
> http://shop.kynsitukku.hertell.com/sitemap.xml
>
> [query
> 	list=1
> 	ml=9999
> 	sql="
> 	select *
> 	from products_fi_FI
> 	where inactive <> '1'
> 	"
> ]<url>
>   <loc>http://www.kynsitukku.hertell.com/[sql-code].html</loc>
>   <priority>0.5</priority>
>   <changefreq>daily</changefreq>
> </url>
> [/query]
>
> To get completely rid of inactive products, i added this to the top of
> my flypage:
>
> [if-item-field inactive]
> [tmp page_title][msg arg.0="[item-code]"]Sorry, the page (%s) was not
> found[/msg][/tmp]
> [tag op=header]
> Status: 404 Not found
> Content-type: text/html
> [/tag]
>
>
> Remember to name your files as sitemap.xml.html and robots.txt.html,
> cause ic will not find a non .html file
>
> Hope this helped.
>
> René

Thank you all for the ideas. I'm sure to implement some form of all of those
juicy tidbits. I'll report my results in the near future.

Curt