[interchange-bugs] [interchange-core] [rt.icdevgroup.org #344] 80legs webcrawler not recognized as Robot due to NotRobotUA

Wed Mar 2 14:33:14 UTC 2011

<URL: http://rt.icdevgroup.org/Ticket/Display.html?id=344 >

On 03/02/2011 03:15 PM, David Christensen wrote:
>> 80 legs webcrawler identifies itself as:
>>
>> Mozilla/5.0 (compatible; 008/0.83; http://www.80legs.com/webcrawler.html) Gecko/2008032620
>>
>> Because of the NotRobotUA entry 'Gecko' this crawler is not identified as such.
>>
>> Blocking via RobotIP will not work as it works via a distributed network of IP's ... So it will crawl creating a bunch of session id's with all different IP numbers.
>
> Yeah, I've been reconsidering the NotRobotUA change.  I like it in principle, but then you end up with cases like this.  Short of a JustKiddingThisIsReallyARobotUA directive, I'm not sure how to do this generally—it starts to feel like an arms race.  I think in the general case, we'd rather users always be able to have a session/checkout, so basically we'd run into cases like this as the exception to handle.
>
> Perhaps a suitable negative lookahead/behind pattern would help in this specific case.  I'm also open to ideas/other thoughts.

What about:

* Allowing multiple RobotUA/NotRobotUA configuration directives in interchange.cfg.
* Compiling regexes after configuration is completed.
* Add a lookup hash to determine which match in regex wins.
* Break it out in a subroutine which can be overridden/supplemented.

Regards
	Racke

-- 
LinuXia Systems => http://www.linuxia.de/
Expert Interchange Consulting and System Administration
ICDEVGROUP => http://www.icdevgroup.org/
Interchange Development Team