[ic] how do I stop Google from trying to index scan pages?

Rene Hertell interchange-users at icdevgroup.org
Thu Sep 18 10:05:15 UTC 2008


curthauge at mnwebdesign.com wrote:
> Hi list,
> 
> IC 5.4.2	Perl 5.8.8	Old Construct cat on Centos 4.7
> 
> My client has over 15,000 products, but Google only ranks about 400 in their
> index. The last 4 pages in the Google index are scans. It SEEMS like after
> hitting 4 scan pages, Google stops and turns away (probably because the page
> content appears to be similar). I make changes to pages and robots.txt then
> wait to see the new Google ranking in a few days/week. I have a lot of
> respect for Google and always spell it with a capital 'G', but I still have
> this problem. ;-)
> 
> I've been in the archives, but I can't get the precise info I need to change
> my robots.txt to stop these pages from being indexed. This is an example of
> the pages in question:
> 
> http://www.my-domain.com/cgi-bin/storeabc/scan/fi=products/st=db/sf=category
> /se=DVD%20Video/ml=16/tf=description.html
> 
> I have RobotUA, RobotHost, and RobotIP settings in catalog.cfg. I have a
> robots.txt file in my httpdocs directory, with entries like this (among
> others):
> 
> User-agent: Googlebot
> Disallow: /*?
> 
> User-agent: *
> Disallow: /storeabc/scan
> Disallow: /scan
> Disallow: /storeabc/process
> Disallow: /process
> Disallow: /cgi-bin/storeabc/process
> Disallow: /cgi-bin/storeabc/scan/
> Disallow: /cgi-bin/storeabc/search
> Disallow: /cgi-bin/storeabc/pages/process
> Disallow: /cgi-bin/storeabc/pages/scan/
> Disallow: /cgi-bin/storeabc/pages/search
> 
> I really just want the flypages, like this, to be ranked:
> 
> http://www.my-domain.com/cgi-bin/storeabc/sku12345.html
> 
> Any tips, pointers, ideas, ridicule?

I managed it like this:

http://www.kynsitukku.hertell.com/robots.txt
robots.txt
User-agent: *
Disallow: /admin/
Disallow: /ord/
Disallow: /query/
Disallow: /scan/
Disallow: /account.html
Disallow: /process.html
Disallow: /search.html

User-agent: Googlebot-Image
Disallow: /


And then i created a sitemap that i submitted via the google webmaster
tools page. The beginning of the sitemap is static (normal pages like
about, contact etc), and the rest i just did with this simple query:
http://shop.kynsitukku.hertell.com/sitemap.xml

[query
	list=1
	ml=9999
	sql="
	select *
	from products_fi_FI
	where inactive <> '1'
	"
]<url>
  <loc>http://www.kynsitukku.hertell.com/[sql-code].html</loc>
  <priority>0.5</priority>
  <changefreq>daily</changefreq>
</url>
[/query]

To get completely rid of inactive products, i added this to the top of
my flypage:

[if-item-field inactive]
[tmp page_title][msg arg.0="[item-code]"]Sorry, the page (%s) was not
found[/msg][/tmp]
[tag op=header]
Status: 404 Not found
Content-type: text/html
[/tag]


Remember to name your files as sitemap.xml.html and robots.txt.html,
cause ic will not find a non .html file

Hope this helped.

René



More information about the interchange-users mailing list