[ic] RobotUA

Philip S. Hempel interchange-users@icdevgroup.org
Wed Nov 27 23:34:01 2002


Grant wrote:

>>Grant wrote:
>>
>>    
>>
>>>Thanks a lot for the info Phillip.  I'd like to clarify a couple things
>>>though...
>>>
>>>
>>>
>>>      
>>>
>>>>Usually with a 301 it takes a couple of runs from most spiders to decide
>>>>to go anywhere else into the
>>>>site.
>>>>
>>>>
>>>>        
>>>>
>>>What is the correct way to forward from your domain to your site's index
>>>page so the spiders don't get confused?
>>>
>>>
>>>      
>>>
>>Icdevgroup uses 302 and does not get indexed but for the first page, as
>>you may have noticed, from google
>>most other search engines will tromp all over the site and produce gory
>>listings
>>My opinion and of many others I have spoke with at
>>http://www.webmastersworld.com (where the GoogleGuy hangs out) say 301
>>is the only way if your going to redirect and expect rankings. You could
>>use a doorway page
>>that uses java script and just place links into the sight with some
>>keywords in it for the search engines, but feel that is
>>very unprofessional and spammy myself.
>>
>>
>>    
>>
>
>So pretty much everyone uses a 301 or 302 to get to the index page of their
>site, and therefore has to deal with this issue?
>
>  
>
>>>      
>>>
>>>>Now depending on how long your system has been running with a 301
>>>>if you move now it will cause
>>>>you more problems. Realize that 301 is just like you told the mailman
>>>>you have a new address and then
>>>>you send a new change of address to all of your magazine companies.
>>>>Now how long does it take for them to get around to sending them to your
>>>>new address?
>>>>Then sundenlly you decide to send them and your mailman a new change of
>>>>address again even before
>>>>they have actually acted on your old change of address. Well you will
>>>>have at least 2 monthns before you get
>>>>any magazines or a good part of your mail will end up in
>>>>        
>>>>
>>different places.
>>    
>>
>>>>So usually using 301 in difference to 302 that says temp move don't keep
>>>>record of it. This is a very bad things
>>>>when it comes to spiders if you keep bouncing arround.
>>>>
>>>>
>>>>        
>>>>
>>>Are you saying a 301 or a 302 is better for spiders?
>>>
>>>
>>>      
>>>
>>Here is a link from Google that talks about what they feel you should do
>>
>>http://www.google.com/remove.html
>>
>>And the snippet that talks about 301
>>
>>*Change the URL of your website*
>>
>>Since Google's crawler associates the content of a page with its URL,
>>there is no way to manually change the URL that is displayed for your
>>website. The URL will be updated the next time we crawl your site. The
>>crawler revisits each site according to an automatic schedule, and we
>>cannot manually accelerate the date on which your site will be recrawled.
>>
>>If the URL of your website has changed since we last crawled it, you may
>>use the URL submission form <http://www.google.com/addurl.html> and the
>>URL removal methods described below. However, the URL submission form
>>does not take effect immediately, so using the URL removal feature may
>>leave your website inaccessible from Google until we crawl your site again.
>>
>>Instead of requesting a change from Google, we recommend that you ask
>>the sites currently linked to your old site to update their links (to
>>point to your new site). Also, don't forget to change any entries you
>>may have in the Yahoo! directory and the Open Directory. Finally, if
>>your old URLs redirect to your new site using HTTP 301 (permanent)
>>redirects <http://www.ietf.org/rfc/rfc2616.txt>, our crawler will know
>>to use the new URL. Changes made in this way will take 6-8 weeks to be
>>reflected in Google.
>>
>>I feel 301 is better.and also pay close attention to the time google
>>says it will take for the crawler to understand the
>>new address (6-8 weeks)
>>
>>    
>>
>>>>This is spoken completly from experience since I did this myself and
>>>>have seen its effects.
>>>>
>>>>Also all of your DMOZ entries also need to point to your redirected
>>>>location to get credit for it.
>>>>
>>>>
>>>>
>>>>        
>>>>
>>>Where are these entries?
>>>
>>>
>>>      
>>>
>>If your site has been submitted to DMOZ at http://www.dmoz.org and since
>>Google and other search engines use
>>these listing for supporting your rankings they should be set to always
>>match your expected site location
>>
>>    
>>
>
>Ok, thank you.
>
>  
>
>>>      
>>>
>>>>Point is this if you have just started doing this move, then leave it
>>>>alone. It will take at least 2 months for
>>>>google and a few others to catch up. If you have done this for awhile
>>>>you could completley lose at least
>>>>a months worth of crawls until they get around to seeing the new move.
>>>>
>>>>This happend to me and I got impatient myself and moved around again.
>>>>Lost much traffic and after talking to some people at webmasterworld,
>>>>they just told me to not mess with it and be patient they will crawl
>>>>your site within one to two
>>>>months. If your sids are not showing they will jump on it soon.
>>>>
>>>>--
>>>>Philip S. Hempel
>>>>debian/rules
>>>>
>>>>
>>>>        
>>>>
>>>It seems like there must be a better way to go about all this that doesn't
>>>use 301s at all so the spiders will head straight inside.  What would that
>>>be?
>>>
>>>- Grant
>>>
>>>
>>>
>>>      
>>>
>>Most SE's do not like 302 (temp redirect) and almost all suggest the
>>usage of 301 (permanant redirect)
>>302 does not push ranking onto the main page since this is want you want.
>>
>>Since I quit using 302 and went to 301 and a few other things I
>>went from page 5 in
>>the rankings to number 1,2,3,4,5 for over 15 key word sets and
>>went from 100 users
>>to over 800 users (not search engines) in a day average.
>>
>>Goggle spiders over 200 pages on my site now and we have as of
>>today on the average
>>have over 10 sales a day (from 1 every 3 weeks). (this is good for
>>a supposed part time business)
>>
>>
>>Hope this helps if you need more ask.
>>
>>(and please excuse typos, wrote this in a rush)
>>--
>>Philip S. Hempel
>>debian/rules
>>    
>>
>
>Here's what Google's doing on my site:
>
>64.68.82.70 - - [26/Nov/2002:08:30:13 -0800] "GET /robots.txt HTTP/1.0" 200
>0 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
>64.68.82.70 - - [26/Nov/2002:08:30:15 -0800] "GET / HTTP/1.0" 301 330 "-"
>"Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
>64.68.82.5 - - [26/Nov/2002:08:39:22 -0800] "GET /cgi-bin/shop/ HTTP/1.0"
>200 38303 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
>64.68.82.7 - - [26/Nov/2002:08:49:59 -0800] "GET /cgi-bin/shop/policies.html
>HTTP/1.0" 200 35830 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
>64.68.82.28 - - [26/Nov/2002:08:52:20 -0800] "GET
>/cgi-bin/shop/moreinfo.html HTTP/1.0" 200 39917 "-" "Googlebot/2.1
>(+http://www.googlebot.com/bot.html)"
>
>That's it.  This shows that they are getting into the main site, past the
>301.  They're just looking at a couple of pages though.  I've verified with
>the Sam Spade browser that IC is sanitizing the URLs when the Google User
>Agent is used.  Also, it's GETing "/cgi-bin/shop/", but I have NO links to
>that particular path anywhere in the site.  The redirect redirects to
>www.mystore.com/cgi-bin/shop/index.html.  How could it be hitting
>"/cgi-bin/shop/"?  
>
Try using http://validator.w3.org/ and a link checker there may be a bad 
link somewhere. ( I can't think of a checker
at the moment)

>Thanks a lot for all your help Phillip.  Hopefully others
>will benefit from this discussion too.  Any idea why the Googlebot wouldn't
>be hitting up more pages?  There are a ton of links on that front index page
>to all of my product categories.
>
>  
>

    A little history on how search engines work. A search engine will 
pickup your robots.txt if there is one.
Then it will pickup your index page.
    Now most search engines rotate on a month to month basis what it 
sees on it's first run is what is will
use for one month. If most of your links are on the index page it will 
run around for a while using just a few pages.
Most notabaly any pages that have the highest PR (with google) will be 
hit throught the month.

    Most search engines run picking up new data around once a month. On 
this once a month cycle,
the search engine will then do it's deepest crawl also adding the new 
changes to the index page into the database
This is when the freshest information is used and a new ranking is 
assigned to your pages.
You will notice again a new hits on the higher ranking pages throughout 
the month.

    Also one rule to remember, (for Google) is that it loves fresh 
edited pages. The more links and information Google sees added those 
pages may get picked up by the freshbot more often and used in it's index.
I try to add at least one new link every few days to my highest hit 
pages throughout the month to increase the likelyhood
that Google and others will pick it up as a fresher page and use it.

The pages on your site that are linked from other site will also get hit 
more often as well even if Google did not see them from
off your index page. Also remember, the deeper into the sight the page 
is the less likely that google will hit it as regular.

One of the biggest factors related to what a search engine does with a 
site has to do with how long the site has been
in the search engines index. Many search engines since they only run 
once a month may have a limit on how many
links it will crawl during a run on a site. If your site has only been 
in for a month then you will only see mabey 4 to 10 links hit.
Then the next month it will pickup 4 to 10 more and so on.

In conclusion: (if anyone wants more please ask)
Patience and */perseverance/* 
<http://www.google.com/search?hl=en&lr=&ie=UTF-8&oe=UTF-8&newwindow=1&q=perseverance&spell=1> 
is the biggest factor when doing SEO. Never make changes to your sight 
without
always checking how it will html validate. Just because it works in a 
webbrowser does not mean an it works in
a SE. Most SE's are capable of html 3.2 + some 4.0 and  some CSS 1.0

Many of these tools are at http://www.searchengineworld.com/ for spider 
sims and other helpful tools.
Also try frequenting http://www.webmasterworld.com/  there is a plethora 
of information there.
One thing that IC does have is a way to do html checking on page (there 
is a tag and off hand I don't remember)

This information is not really technical in it's depth but trying to give a lite
run down how I know this works. I hear every day many people asking the same questions.

One of our primary goals in IC is to sell products and search engines are a very large
part of this. Always have patience IC is the *GREATEST* product in both content management
as well as online sales. 
If you think of IC as a content management system instead of a web store your goal would be to put
data in that helps everyone that comes to your site, Google is pushing to webmasters make this happen. If your site works well under lynx (even Google recomends using lynx)
then you should find most search engines will have no problem finding your information.


It is good that you look at your logs, since you noticed Google picking up somthing unexpected.
I take a cursory look at my logs at least once a day. This helps me to see what may be a problem.

BTW I do wish one thing of IC that would help is if a page is not found in IC that it could send error logs to apache(by default) of the errors. You may be suprised how many SE's will find errors
that most web browers fix for you.

Good luck

-- 

Philip S. Hempel