[ic] Just upgraded 4.8.9->5.2 - RobotUA question

Mon Dec 13 22:21:55 EST 2004

On Mon, 13 Dec 2004 DB <DB at M-and-D.com> wrote:

>> On Sun, 12 Dec 2004, DB wrote:
>>
>>> I just upgraded by foundation based catalog from 4.8.9 to 5.2.0. I 
>>> followed the UPGRADE file instructions and things went pretty 
>>> smoothly. My main reason for the upgrade was to take advantage of 
>>> the RobotUA feature.
>>>
>>> After the upgrade, I added the section below to the end of my 
>>> interchange.cfg, however I still entries like this in my apache 
>>> access_log:
>>>
>>> "GET /unlisted.html?id=gAW3nswb HTTP/1.0" 200 17202 "-" "ia_archiver"
>>> "GET /helpfaq.html?id=SRvEvzVq HTTP/1.0" 200 32017 "-" "msnbot/0.3 
>>> (+http://search.msn.com/msnbot.htm)"
>>>
>>> Now I thought the RobotUA prevented spiders from obtaining session 
>>> ids? Am I confused, or can someone tell me why these spiders 
>>> appears to be still obtaining session ids?
>>
>> Are you sure that they're still obtaining session IDs? All those log 
>> entries tell you is that they're successfully spidering URLs that 
>> have session IDs already in them. Mostly likely their index of your 
>> site already includes hundreds of URLs with embedded session IDs, 
>> and they'll keep spidering those, getting results, and thinking 
>> everything's fine.
>>
>> The change you made says that they won't be issued a session ID, 
>> which is probably working. But it can't purge their old indexes. 
>> Perhaps some spiders eventually stop polling old addresses that 
>> aren't linked any longer, but I don't have any evidence of that.
>>
>> If you want to be sure, do something like:
>>
>> GET -H 'User-agent: ia_archiver' http://yoururl
>>
>> And look for session IDs in the URLs you get back on that page.
>>
>> Jon
>
> Hmm could be - how would I use that GET statement - in a perl script?
> I'm not familiar with the syntax.
>

Drew McLellan has a nice article about testing this sort of thing with the
standard curl library you probably have installed on your linux box.

http://allinthehead.com/retro/224/curl-for-http-debugging

For instance, try this from a command line:

   # curl --user-agent 'GoogleBot' http://yoursiteurl

- Brian