[ic] how do I stop Google from trying to index scan pages?
Rene Hertell
interchange-users at icdevgroup.org
Thu Sep 18 10:05:15 UTC 2008
curthauge at mnwebdesign.com wrote:
> Hi list,
>
> IC 5.4.2 Perl 5.8.8 Old Construct cat on Centos 4.7
>
> My client has over 15,000 products, but Google only ranks about 400 in their
> index. The last 4 pages in the Google index are scans. It SEEMS like after
> hitting 4 scan pages, Google stops and turns away (probably because the page
> content appears to be similar). I make changes to pages and robots.txt then
> wait to see the new Google ranking in a few days/week. I have a lot of
> respect for Google and always spell it with a capital 'G', but I still have
> this problem. ;-)
>
> I've been in the archives, but I can't get the precise info I need to change
> my robots.txt to stop these pages from being indexed. This is an example of
> the pages in question:
>
> http://www.my-domain.com/cgi-bin/storeabc/scan/fi=products/st=db/sf=category
> /se=DVD%20Video/ml=16/tf=description.html
>
> I have RobotUA, RobotHost, and RobotIP settings in catalog.cfg. I have a
> robots.txt file in my httpdocs directory, with entries like this (among
> others):
>
> User-agent: Googlebot
> Disallow: /*?
>
> User-agent: *
> Disallow: /storeabc/scan
> Disallow: /scan
> Disallow: /storeabc/process
> Disallow: /process
> Disallow: /cgi-bin/storeabc/process
> Disallow: /cgi-bin/storeabc/scan/
> Disallow: /cgi-bin/storeabc/search
> Disallow: /cgi-bin/storeabc/pages/process
> Disallow: /cgi-bin/storeabc/pages/scan/
> Disallow: /cgi-bin/storeabc/pages/search
>
> I really just want the flypages, like this, to be ranked:
>
> http://www.my-domain.com/cgi-bin/storeabc/sku12345.html
>
> Any tips, pointers, ideas, ridicule?
I managed it like this:
http://www.kynsitukku.hertell.com/robots.txt
robots.txt
User-agent: *
Disallow: /admin/
Disallow: /ord/
Disallow: /query/
Disallow: /scan/
Disallow: /account.html
Disallow: /process.html
Disallow: /search.html
User-agent: Googlebot-Image
Disallow: /
And then i created a sitemap that i submitted via the google webmaster
tools page. The beginning of the sitemap is static (normal pages like
about, contact etc), and the rest i just did with this simple query:
http://shop.kynsitukku.hertell.com/sitemap.xml
[query
list=1
ml=9999
sql="
select *
from products_fi_FI
where inactive <> '1'
"
]<url>
<loc>http://www.kynsitukku.hertell.com/[sql-code].html</loc>
<priority>0.5</priority>
<changefreq>daily</changefreq>
</url>
[/query]
To get completely rid of inactive products, i added this to the top of
my flypage:
[if-item-field inactive]
[tmp page_title][msg arg.0="[item-code]"]Sorry, the page (%s) was not
found[/msg][/tmp]
[tag op=header]
Status: 404 Not found
Content-type: text/html
[/tag]
Remember to name your files as sitemap.xml.html and robots.txt.html,
cause ic will not find a non .html file
Hope this helped.
René
More information about the interchange-users
mailing list