10 Things Google Wished You Knew

by Ruud Hein August 23rd, 2011 

straight-from-google

Still trying to read Google's hidden signals and figure out their true meaning?

Sometimes a cigar is just a cigar and a kiss just a kiss. Here are 10 "take them at face-value" things Google wished you knew.

1. There is no duplicate content penalty

"Let's put this to bed once and for all, folks: There's no such thing as a 'duplicate content penalty.'"
– Susan Moskwa, Webmaster Trends Analyst, in Demystifying the "duplicate content penalty"

Scraping other's content can get you in trouble, sure, but having on-site, accidental,  non-malicious duplicate content does not earn you minus point or a penalty.

On larger, dynamic sites, duplicate content can cause a near-infinite amount of pages to crawl while Google assigns a finite amount of time on each site crawled depending on its importance. Wasting Googlebot's time on your site by feeding it duplicate content? Almost as good an idea as trying to get a penalty.

2. We see your NOSCRIPT & raise you a "yeah, right"

"One of the problems with noscript is – as others have mentioned – that it's been abused quite a bit by spammers, so search engines might treat it with some suspicion. So if this is really important content, then I wouldn't rely on all search engines treating your noscript elements in the same way as normal, visible, static content on your pages."
– John Mueller, Webmaster Trends Analyst, in Best way to include static content in dynamic pages?

The only thing missing from John Mu's statement is a "wink wink, nudge nudge" after his "might" treat it with suspicion. Cold hard fact: content in <noscript> loses almost all its value.

If you want to tailor to people with JavaScript turned off the way to do it is with unobtrusive JavaScript.

3. We don't do meta keywords

"Our web search (the well-known search at Google.com that hundreds of millions of people use each day) disregards keyword metatags completely. They simply don't have any effect in our search ranking at present."
– Matt Cutts, Search Quality Team, in Google does not use the keywords meta tag in web ranking

It's not just for search ranking that Google ignores the meta keywords " it doesn't even use it for retrieval. Whatever you put down in your meta keywords is 100% useless at Google's.

4. Use 503 "Away" Server Code & We'll Be Back Under 24 Hours (but you can still serve content)

"The interesting thing about a 503 HTTP result code (or most others) is that you can serve normal content to your users and it will only be recognized by those that explicitly watch out for the result codes, usually only search engine crawlers. [] return the 503 HTTP result code with your "we're currently closed" message, so that users can see the message, but search engine crawlers know to ignore the content and to come back another day (in practice, they'll probably try to come back sooner than that)."
– John Mueller, Webmaster Trends Analyst, in Can I restrict Google from crawling my site on a specific day of the week?

Any "we're not serving content at the moment" kind of situation, like upgrading your CMS or other site work, should see your server returning a 503 status code; you can still show visitors regular content if needed.

5. There is no supplemental index (anymore)

"Now we're coming to the next major milestone in the elimination of the artificial difference between indices: rather than searching some part of our index in more depth for obscure queries, we're now searching the whole index for every query."
– Yonatan Zunger, Search Quantity Team, in The Ultimate Fate of Supplemental Results

The supplemental index was a necessary part of the old disk-based index. Inserting new information in the index and then re-sorting was an "expensive" operation. Part of the solution was to only insert important documents "right away" into the main index and push the lesser important ones into a supplemental index.

Nowadays Google's complete index of the web is stored in memory, not on disk, and parts of the index can be decompressed at will. Inserting new documents and inserting new ranking systems (read: sorting) is super easy and can be done in real-time.

6. We care about valid HTML NOT!

"Seriously… I don't want to discourage anyone from validating their site; however, unless it's REALLY broken, we're likely going to be able to spider it pretty decently. []

Being more specific:  I'm betting that in the vast majority of cases in which folks have indexing or ranking concerns, the core issue is NOT that their site doesn't perfectly validate"
– Adam Lasnik, speaking as webmaster liaison, in Is W3C validation really essential for Google to list my site?

To wilfully disregard good, clean, valid code is economic insanity: where valid code is inherently cross browser-, cross device- and cross platform compatible, bad code isn't causing you to either miss out on opportunities that should have been yours from the start or forcing you to spending extra bucks, time and time again, simply to catch up.

While all that is true, valid code isn't one of Google's 200 ranking factors.

7. We adhere to the robots protocol " except crawl delay

"[]the reason that Google doesn't support crawl-delay is because way too many people accidentally mess it up. For example, they set crawl-delay to a hundred thousand, and, that means you get to crawl one page every other day or something like that.

We have even seen people who set a crawl-delay such that we'd only be allowed to crawl one page per month. What we have done instead is provide throttling ability within Webmaster Central []"
– Matt Cutts, Search Quality Team, in Eric Enge Interviews Google's Matt Cutts

If you really feel the need to set crawl-delay it might be time to look into another host,server or server setup. If your setup can run into serious problems when a large number of page requests are made rapidly, you're likely unable to deal with a promoted blog post going viral or getting linked to by huge traffic drivers like Techmeme, Lifehacker, or the New York Times.

8. The cached version of your page doesn't correspond to our last crawl

"In general, we do not always update the cached page every time that we crawl a page. Especially when the page does not significantly change, we may opt to just keeping the old date on it."
– John Mueller, Webmaster Trends Analyst, in Google cache of index page does not change

Regardless of how many times a day or year Googlebot comes by, the cached version of one or more pages on your site isn't always updated.

9. TLD trumps hosting location for geo-targeting

"if your site has a geographic TLD/ccTLD (like .co.nz) then we will not use the location of the server as well. Doing that would be a bit confusing, we can't really "average" between New Zealand and the USA… At any rate, if you are using a ccTLD like .co.nz you really don't have to worry about where you're hosting your website, the ccTLD is generally a much stronger signal than the server's location could ever be."
– John Mueller, Webmaster Trends Analyst, in hosting server IP address importance to SEO

An "oh cool" remark for a lot of people, we're sure. Got the domain name extension that goes with the country you want to talk to? No worries about where you're hosting. Of course, if you do want to target other countries, then you have some work to do.

10. Pages blocked in robots.txt can still get PR

"a page that is blocked by robots.txt can still accrue PageRank. In the old days, ebay.com blocked Google in robots.txt, but we still wanted to be able to return ebay.com for the query [ebay], so uncrawled urls can accumulate PageRank and be shown in our search results."
– Matt Cutts, Search Quality Team, in PageRank sculpting

Ruud Hein

My paid passion at Search Engine People sees me applying my passions and knowledge to a wide array of problems, ones I usually experience as challenges. People who know me know I love coffee.

Ruud Hein

You May Also Like

19 Responses to “10 Things Google Wished You Knew”

  1. Fabio says:

    seriously, I found it very interesting, specially the ccTLD matter. Living in italy, a lot of costumers are worried about the possible problems that a domains such .it may have for SEO and such. And the keyword in meta tags…well…thanks. I'll make it read to my costumers asap since there's no way the will believe me that they are useless in SEO, at least if they aim to get a high position on google.

  2. Dr. Pete says:

    You know I love you, Ruud, but I bristle every time I hear #1, because Google is only talking about a Capital-P penalty by their strict definition. I have seen on-site duplicate content ruin sites' (multiple) rankings, and removing duplicates have near-miraculous SEO impact (as much as 3X search traffic improvement). Now, with Panda, Google's message on duplicate content is clearly mixed. Massively duplicate internal content and "thin" pages (not just cross-domain) now not only mean those pages are impacted, but your entire site. Whether it's a Capital-P penalty doesn't matter, IMO – the restuls can be disastrous.

    Google has a bad habit of saying "let us deal with it" when it comes to on-site duplication, and in my experience, they are terrible at dealing with it. This is one piece of their advice that I routinely ignore for the good of my clients.

    • Ruud Hein says:

      Any time Google says "let us deal with it", we say "nuh-uh!" :)

      I hear what you say. It becomes a "filter" vs "penalty" debate at one point. Fact is, more pages + thinner link spread = worse ranking. Removing under performing pages, which duplicates tend to be too, will always help then. But that's due to the low PR spread on the site, not 100% pure dup. (Guess that means we sorta agree)

      As for Panda, we now see people being afraid to cover a subject that's already been covered elsewhere on their site while clearly what we're talking about is ehow's kind of "how to pop a pimple", "how to pop a pimple in 3 simple steps", "pimple popping how to guide" kind of repetition.

    • Dr. Pete says:

      @Ruud – Yeah, that's certainly true. People take everything to extremes. We shouldn't be afraid to write about the same topic more than once or duplicate one keyword. Every good site has a theme and focus, and it's bound to contain conceptual duplication. The real problem is when you have 1000s of pages that are essentially the same, or 100s of pages with 10X URL variations each, or an entire site that's nothing but syndicated or scraped content in your own wrapper (which you bought off a WordPress theme site).

    • Steve says:

      For what it's worth I completely agree with Dr. Pete, duplicate content in the wake of Panda can be a huge problem. My site received a -950 penalty which we are starting to shake off now that we have removed our duplicate content.

  3. Some penalties are self-imposed, though. Duplicate content might result in a Panda downgrade but we have known for a long time that PageRank can be split across multiple copies of an article, and that MIGHT produce undesired secondary effects.

    The Supplemental Index article irked me at the time Google published it (you can see my testy responses in the comments). While it's true that Google has improved its CRAWLING for the entire Web, it continues to parse and index some pages less than others.

    Whether that is because there is still a Supplemental Index or there is merely some "secondary order effect" doesn't really matter. When the search engine ignores highly relevant content in favor of well-linked content the user experience suffers.

    • Ruud Hein says:

      True — it's a tiny bit of a semantic distinction. The way the search index itself is constructed certainly doesn't have a supplemental index anymore; that's purely on the construction level. On the crawling level they use priority. But on the IR level itself there's already a pre-selection where if you're either not in the vector or maybe it's neighborhood, you're not even in the first grab.

      As for relevancy — I stick by my 2010 idea that "The intention of search engines is to sort by Relevancy. They can’t. They don’t. [...] relevancy is while voting and popularity are mere calculations."

  4. Micky says:

    Thanks for this list Ruud. Being in Australia, I was particularly interested in point 9 about geo-targeting.

    From what I've always read, I'd been under the impression that having both a country specific domain AND local hosting would help make your site more relevant in your country version of Google.

    This is the first time I've read that if your site has a ccTLD such as .com.au there's no benefit at all to having your site hosted in Australia as well.

    If that's true, it has huge implications for choosing a web host. Good Australian web hosting packages are much more expensive than in the US. So as long as you have a .com.au you can apparently just have your site hosted in the US without any negative impact on your site's ranking in Australia. That's great news!

  5. HI Ruud,
    Yes, point number 9 about Geo targeting is interesting – thanks.
    We are forever trying to work out what Google thinks and wants!
    Regards
    Catherine

  6. Sal Surra says:

    Nice list of items and quotes. It's good to revisit things like this from time to time to debunk and move on from the past. Most of this was old hat, but I'm surprised to see that Google will still rank a page even though it was block in the robots.txt file. I guess that makes sense since it's only blocking the bot from crawling the page. When I want to remove a page from the engine I typically put a 410 if gone permanently, like low quality pages, or moved to a good page with a 301 if there is content that is relevant for it. However, even in those cases, we don't block the engines from crawling because then the redirect or page gone tag will not work.

  7. Rana says:

    Ruud,

    Thank you for clarifying great list of "myths". There is lot of confusion about duplicate content out there but I think point #1 made it really crystal clear.

  8. jayson says:

    We don't do meta keywords…This one is epic, still there are people using this one and expecting that it can help them rank. stuffing meta keyword property with all the keywords that they can put…I'd rather stick with the anchor text targeting. Thanks a lot for this post Ruud:)

  9. Thos003 says:

    Have you done any testing on stuffing your keyword meta tags? They may not use it bump you up in the ranks, but they could use it to push you down.