Was the announcement that Google doubled the number of sites indexed really true? And if true, is there more to it that Google doesn't want the press and public to know? Is there a sandbox, and if so, why does it exist?
Fact: Google announced they had doubled (approximately) the number of documents they index right before Microsoft announced their new beta MSN search engine. Many people speculated about the convenient timing of such an announcement, since MSN was preparing to announce that they had the most pages indexed of the major search engines. Google's announcement squashed that.
Fact: Many sites (using the site: command) show number of pages indexed for a site that are quite a bit larger than the number of pages the sites actually have. (Some of my own sites show approximately double the number of pages using a site: command than the sites have). Either Google is indexing pages that should not be indexed on a site (i.e. ignoring robots.txt or the meta robots tag), or the numbers are either sometimes false and/or miscalculated.
Theory that is starting to become accepted as a fact (although not by all): A sandbox exists wherein new sites (created sometime in 2004, usually after March or April) do not rank for competitive terms, even when all factors point to the fact that they probably *should* rank for those terms, when comparing various factors to those that do rank.
My current beliefs: (subject to change on a whim or with more information)
- There is something fishy about the number of indexed documents that Google now claims.
- The sandbox exists, and one of two theories about the sandbox is probably the correct one.
What reasons would Google have for mis-stating the number of documents indexed?
- Their calculations are wrong and they either don't know it yet, or didn't know it when they announced it.
- It was simply a marketing ploy to distract from MSN's announcement.
- The number is technically correct, but they are counting pages that should never have been indexed or counted as part of the index.
- The number is technically correct, but all indexed pages are not in the same index, and not all used are for ranking for every search term.
- Something I haven't considered yet
What are the two theories that I believe are the most plausible for explaining the sandbox effect?
- To combat spam and prevent link purchases from skewing link popularity, Google is forcing new sites to go through a long period of judgement before granting them a stamp of approval. In doing so, the anchor text value and PR value of backlinks to the page are not applied to the new page immediately. Instead the value is increased little by little over time, until after a certain amount of time, the full value is applied to the site.
- The long discussed issue that Google may not have the capacity to fit all pages into its index has caused them to create new indexes (similar to the supplemental index). These new indexes (which contain new pages and/or sites) are only factored into the rankings for obscure (i.e. non-competitive) phrases, where the main index does not provide enough relevant results. If the main index does provide enough relevant results, the secondary indexes are not consulted, thereby eliminating new sites from the ranking for those terms. There are some excellent posts over at WebmasterWorld that gives some good insight into this theory. Specifically, see message numbers 201 and 209 of this thread.
Of course, there may be other explanations that will prove more plausible in time. In addition, some people firmly believe that there is no such thing as a sandbox and that Google's policy of Don't Be Evil is a policy that would prevent any of the above speculations from being plausible. They may be right, but in this case, I believe there is so much smoke that we have no choice but to start looking for the fire. And if the fire exists, then Google would be guilty of The Great Google Hoodwink.