[ic] how do I stop Google from trying to index scan pages?

Curt Hauge ic_support at mnwebdesign.com
Fri Sep 19 00:11:52 UTC 2008


> -----Original Message-----
> From: interchange-users-bounces at icdevgroup.org
> [mailto:interchange-users-bounces at icdevgroup.org]On Behalf Of Rene
> Hertell
> Sent: Thursday, September 18, 2008 5:05 AM
> To: interchange-users at icdevgroup.org
> Subject: Re: [ic] how do I stop Google from trying to index scan pages?
>
>
> curthauge at mnwebdesign.com wrote:
> > Hi list,
> >
> > IC 5.4.2	Perl 5.8.8	Old Construct cat on Centos 4.7
> >
> > My client has over 15,000 products, but Google only ranks about
> 400 in their
> > index. The last 4 pages in the Google index are scans. It SEEMS
> like after
> > hitting 4 scan pages, Google stops and turns away (probably
> because the page
> > content appears to be similar). I make changes to pages and
> robots.txt then
> > wait to see the new Google ranking in a few days/week. I have a lot of
> > respect for Google and always spell it with a capital 'G', but
> I still have
> > this problem. ;-)
> >
> > I've been in the archives, but I can't get the precise info I
> need to change
> > my robots.txt to stop these pages from being indexed. This is
> an example of
> > the pages in question:
> >
> >
> http://www.my-domain.com/cgi-bin/storeabc/scan/fi=products/st=db/s
> f=category
> > /se=DVD%20Video/ml=16/tf=description.html
> >
> > I have RobotUA, RobotHost, and RobotIP settings in catalog.cfg. I have a
> > robots.txt file in my httpdocs directory, with entries like this (among
> > others):
> >
> > User-agent: Googlebot
> > Disallow: /*?
> >
> > User-agent: *
> > Disallow: /storeabc/scan
> > Disallow: /scan
> > Disallow: /storeabc/process
> > Disallow: /process
> > Disallow: /cgi-bin/storeabc/process
> > Disallow: /cgi-bin/storeabc/scan/
> > Disallow: /cgi-bin/storeabc/search
> > Disallow: /cgi-bin/storeabc/pages/process
> > Disallow: /cgi-bin/storeabc/pages/scan/
> > Disallow: /cgi-bin/storeabc/pages/search
> >
> > I really just want the flypages, like this, to be ranked:
> >
> > http://www.my-domain.com/cgi-bin/storeabc/sku12345.html
> >
> > Any tips, pointers, ideas, ridicule?
>
> I managed it like this:
>
> http://www.kynsitukku.hertell.com/robots.txt
> robots.txt
> User-agent: *
> Disallow: /admin/
> Disallow: /ord/
> Disallow: /query/
> Disallow: /scan/
> Disallow: /account.html
> Disallow: /process.html
> Disallow: /search.html
>
> User-agent: Googlebot-Image
> Disallow: /
>
>
> And then i created a sitemap that i submitted via the google webmaster
> tools page. The beginning of the sitemap is static (normal pages like
> about, contact etc), and the rest i just did with this simple query:
> http://shop.kynsitukku.hertell.com/sitemap.xml
>
> [query
> 	list=1
> 	ml=9999
> 	sql="
> 	select *
> 	from products_fi_FI
> 	where inactive <> '1'
> 	"
> ]<url>
>   <loc>http://www.kynsitukku.hertell.com/[sql-code].html</loc>
>   <priority>0.5</priority>
>   <changefreq>daily</changefreq>
> </url>
> [/query]
>
> To get completely rid of inactive products, i added this to the top of
> my flypage:
>
> [if-item-field inactive]
> [tmp page_title][msg arg.0="[item-code]"]Sorry, the page (%s) was not
> found[/msg][/tmp]
> [tag op=header]
> Status: 404 Not found
> Content-type: text/html
> [/tag]
>
>
> Remember to name your files as sitemap.xml.html and robots.txt.html,
> cause ic will not find a non .html file
>
> Hope this helped.
>
> René

Thank you all for the ideas. I'm sure to implement some form of all of those
juicy tidbits. I'll report my results in the near future.

Curt




More information about the interchange-users mailing list