Posted
on Monday, July 19th, 2010 and is filed under News, SEO.
You can follow any responses to this entry through the RSS 2.0 feed.
Both comments and pings are currently closed.
Posted by rolfbroer
This post was originally in YOUmoz, and was promoted to the main blog because it provides great value and interest to our community. The author’s views are entirely his or her own and may not reflect the views of SEOmoz, Inc.
Google has found an intelligent way to arrange the results for a search query. But an interesting question is - where we can find that intelligence? A lot of people have conducted research into the indexing process and even more have tested ranking factors on their weight, but we wondered how smart Googlebot itself is. To make a start, we took some statements and commonly used principles and tested how Googlebot handled them. Some results are questionable and should be tested on a few hundred domains to be sure, but it can give you some ideas.
Speed of The Crawler
The first one we tested was Matt Cutts on his following statement: “… the number of pages that we crawl is roughly proportional to your PageRank".
This brings us to one of the challenges large content sites are facing - the problem of getting all pages indexed. You can imagine if Amazon.com was a new website, it would take a while for Google to crawl all 48 million pages and if Matt Cutts’s statement is true, it would be impossible without any incoming links.
To test it, we took a domain with no history (never registered, no backlinks) and made a page with 250 links on it. Those links refer to pages that also have 250 links (and so on…). The links and URLs were numbered from 1 to 250, in the same order as they appeared in the source code. We submitted the URL via “addurl” and waited. Due to the fact that the domain has no incoming links, it has no or at least a negligible PageRank. If Matt Cutts’s statement is correct Googlebot would soon stop crawling.

As you can see in the graph, Googlebot started crawling the site with a crawl rate of approximately 2500 nodes per hour. After three hours, it slowed down to a crawl rate of approximately 25 pages per hour and maintained that rate for months. To verify this result we did the same test with two other domains. Both tests came up with nearly the same results. The only difference is the lower peak at the beginning of Googlebot’s visit.

Impact of Sitemaps
During the tests, the sitemap manifested itself as a very useful tool to influence the crawl rate. We added a sitemap with 50,000 uncrawled pages in it (indexation level 0). Googlebot placed the pages which were added to Google by sitemap on top of the crawl queue. This means that those pages got crawled before the F-levelled pages. But what’s really remarkable is the extreme increase in crawl rate. At first, the number of visits was stabilized at a rate of 20-30 pages per hour. As soon as the sitemap was uploaded through Webmaster Central, the crawler accelerated to approximately 500 pages per hour. In just a few days it reached a peak of 2224 pages per hour. Where at first the crawler visited 26.59 pages per hour on average, it grew to an average of 1257.78 pages per hour which is an increase of no less then 4630.27%. The increase of crawl rate doesn’t stop by the pages included in the sitemap. Also the other F- and 0-levelled pages take advantage of the increase in crawl rate.

It’s quite remarkable that Google suddenly uses more of it’s crawl capacity to crawl the website. At the point where we submitted the sitemap the crawl queue was filled with F-pages. Google probably attaches a lot of value to the submitted sitemap.

This brings us to Matt Cutts’s statement. After only 31 days Googlebot crawled about 375,000 pages of the website. If this is proportional to it’s PageRank (which is 0) this would mean that it will crawl 140,625,000,000 pages of a PageRank 1 website in just 31 days. Remember that PageRank is exponential. In other words, this would mean you never have to worry about your PageRank even if you own the largest website on the web. In other words don’t simply accept everything Matt says.
Amount of Links
Rand Fishkin says: “…you really can go above Google’s recommended of 100 links per page, with a PageRank 7.5 you can think about 250-300 links” ( http://www.seomoz.org/blog/whiteboard-friday-flat-site-architecture )
http://www.seomoz.org/blog/whiteboard-friday-flat-site-architecture )
Comments are closed.