[ic] RobotUA

Tue Nov 26 17:37:00 2002

Grant [listbox@email.com] wrote:
> 
> I've had my RobotUA all set up for a few days, but examining my rotated
> access_log files, the robots aren't getting any further than this:
> 
> 66.196.65.16 - - [25/Nov/2002:18:30:41 -0800] "GET /robots.txt HTTP/1.0" 200
> 0 "-" "Mozilla/3.0 (Slurp/si; slurp@inktomi.com;
> http://www.inktomi.com/slurp.html)"
> 66.196.65.16 - - [25/Nov/2002:18:30:42 -0800] "GET / HTTP/1.0" 301 330 "-"
> "Mozilla/3.0 (Slurp/si; slurp@inktomi.com;
> http://www.inktomi.com/slurp.html)"
> 
> Here's my RobotUA entry:
> 
> RobotUA WebCrawler, BaiDuSpider, ZyBorg, almaden.ibm, Googlebot, Slurp,
> Girafabo
> t, ia_archiver, LinkWalker, MSIECrawler
> 
One of four things could be happening:

  1. Your robots.txt could be limiting access.
  2. The spider may object to receiving a "301 Moved" status when asking
     for a webpage.  Perhaps it suspects 'cloaking' and just stops there.
  3. The spider may intend to return later to ask for more pages.  Some
     spiders do this to keep the load on your server to a minimum.
     Remember that some servers have lots of websites.
  4. RobotUA could be broken, although I doubt it.  You can check it
     yourself by pretending to be "Slurp/si; slurp@inktomi.com" when
     and then requesting '/'.  Check the resulting page for "unfriendly"
     links.
  5. I said 4, didn't I?  If you can think of another then let me know.

>
> Has anyone verified that this directive really works to clean up the URLs
> for spidering?
> 
Yes.

-- 
   _/   _/  _/_/_/_/  _/    _/  _/_/_/  _/    _/
  _/_/_/   _/_/      _/    _/    _/    _/_/  _/   K e v i n   W a l s h
 _/ _/    _/          _/ _/     _/    _/  _/_/    kevin@cursor.biz
_/   _/  _/_/_/_/      _/    _/_/_/  _/    _/