[interchange-bugs] [interchange-core] [rt.icdevgroup.org #344] 80legs webcrawler not recognized as Robot due to NotRobotUA

David Christensen david at endpoint.com
Wed Mar 2 15:12:22 UTC 2011


On Mar 2, 2011, at 8:33 AM, Stefan Hornburg via RT wrote:

> 
> <URL: http://rt.icdevgroup.org/Ticket/Display.html?id=344 >
> 
> On 03/02/2011 03:15 PM, David Christensen wrote:
>>> 80 legs webcrawler identifies itself as:
>>> 
>>> Mozilla/5.0 (compatible; 008/0.83; http://www.80legs.com/webcrawler.html) Gecko/2008032620
>>> 
>>> Because of the NotRobotUA entry 'Gecko' this crawler is not identified as such.
>>> 
>>> Blocking via RobotIP will not work as it works via a distributed network of IP's ... So it will crawl creating a bunch of session id's with all different IP numbers.
>> 
>> Yeah, I've been reconsidering the NotRobotUA change.  I like it in principle, but then you end up with cases like this.  Short of a JustKiddingThisIsReallyARobotUA directive, I'm not sure how to do this generally—it starts to feel like an arms race.  I think in the general case, we'd rather users always be able to have a session/checkout, so basically we'd run into cases like this as the exception to handle.
>> 
>> Perhaps a suitable negative lookahead/behind pattern would help in this specific case.  I'm also open to ideas/other thoughts.
> 
> What about:
> 
> * Allowing multiple RobotUA/NotRobotUA configuration directives in interchange.cfg.
> * Compiling regexes after configuration is completed.
> * Add a lookup hash to determine which match in regex wins.
> * Break it out in a subroutine which can be overridden/supplemented.


Okay, building on this:

* have the existing robots.cfg RobotUA/NotRobotUA directives, maintained as usual.  This will allow backwards-compatibility with the existing install base for people to pull out the latest versions of the robots.cfg with older IC versions.
* the single code path which determines whether to hand out the session or not (in Vend::Dispatch::dispatch, IIRC) will be refactored to call a specific sub/subref to determine Robot/NotRobot status.
* the new default implementation will more-or-less be the existing logic ripped out, however we'll add the hooks/overrides and perhaps examples to the latest dev version.  Users can augment the subroutine definition or replace it.
* move existing robot-related logic to new class Vend::Robot(Detection|UA|) with a simple C<is_robot()> function.  Factor out any relevant params into a context hash to pass so it will work suitably to be unit tested, used in external scripts, etc.  (These values could be defaulted with $Vend::blah globals if absolutely necessary, but since there's currently a single caller, seems like we should fix the call site, not clutter up the API futher.)
* since this is affecting the ability for users to place orders/get sessions, backport these Robot-related changes to 5.6 and 5.4 as well; this way users of those versions can take advantage of any additional changes.
* pie-in-the sky: some WS API to allow IC to detect on restart (within some short minimal timeout) whether an updated version of robots.cfg is available, and optionally download it.  perhaps even just a contributed script that could be put into cron in a weekly job or similar.

Regards,

David
--
David Christensen
End Point Corporation
david at endpoint.com







More information about the interchange-bugs mailing list