[ic] Spiders and more lists again

Fri Sep 3 10:24:06 UTC 2010

On 3.9.2010 9:35, Stefan Hornburg (Racke) wrote:
> On 09/02/2010 04:10 PM, DB wrote:
>> This has been discussed before. Spiders crawling "more list" links
>> sometimes put a significant load on my server. I'm trying to identify
>> exactly which of my pages that the spiders are trying to crawl so that I
>> can modify the page(s), rewrite the URLs or otherwise fix the problem.
>>
>> An example of such a request from my httpd access log is:
>>
>> GET
>> /scan/MM=13d6003ffad2d76b12eb41868a5277a3:124360:124379:20.html?mv_more_ip=1&mv_nextpage=results&mv_arg=
>>
>> HTTP/1.0
>>
> 
> These URLs are pointing to more pages and are different for each user's
> session. You
> better deny access to them in your robots.txt.
> 
>> If I try this URL in a browser, I get "no search was found" Can anyone
>> provide a clue about what exactly the spider is looking for and/or come
>> up with a clever solution?
> 
> I'm using clean URLs for category searches like
> 
> http://www.f-shop.de/cgi-bin/f-shop/rollenspiele/dungeons_dragons_3_5

What i did was the following. I added this to robots.txt

User-agent: *
Disallow: /admin/
Disallow: /ord/
Disallow: /query/
Disallow: /scan/
Disallow: /account.html
Disallow: /process.html
Disallow: /search.html

and created a sitemap that i submitted to google. I put wrote a guide
for that over here: http://wiki.icdevgroup.org/moin.cgi/sitemap.xml

If you have more that 50k products, then you might need to split up the
sitemap into smaller pieces..

René