Site icon Search Engine People Blog

The Google 302 pagejacking problem explained

Previously, I have discussed the serious problem that Google is having with handling 302 redirects, and how this problem is causing many sites to have its pages hijacked in the SERPs. (You can review these in this post on Sept. 27, 2004, and this post on Nov. 4, 2004 and this post on Dec. 22, 2004, and this post on Jan. 27, 2005 and several other brief mentions within other posts as well.) Many people have discussed the problem, but most never really understood the mechanics of the problem (myself included). Hopefully, this post will enable everyone to get a grip on the problem.

Claus explained in a WebmasterWorld post exactly how Google deals with 302 redirects, and how this can lead to having your page hijacked in the SERPs. It is an excellent post that really helps everyone understand the problem in an easy to read format. In the post (Post #54), he states that no one may republish the post, but he later changes his mind (Post #279 - with conditions). So, in light of the effort to spread awareness of the problem, and hopefully bring about a resolution, I am republishing his excellent post here, and hope that others will do the same (make sure you read the conditions in Post #279 before doing so). Like many others, I am opposed to any kind of hijacking, and I do not post this as an encouragement of others to try it. It must be stopped. Google must take action - no matter how difficult it is to do so - to end this practice. Now on with his post...

The full story of Google and 302s

Fine print: I may want to republish this on my own site later on (usually when i say this i don't even bother), but otherwise it's one of those "you saw it on WebmasterWorld first"" posts, so it's not intended for republishing all across the web. Yes, it means: Please don't republish if you didn't write it, which you didn't.

🙂

...just clearing up a few misunderstandings first, then you'll get the full lowdown on this stuff.

You can't ban 302 referrers as such

Why? Because your server will never know that a 302 is used for reaching it. This information is never passed to your server, so you can't instruct your server to react to it.

You can't ban a "go.php?someurl" redirect script

Why? Because your server will never know that a "go.php?someurl"" redirect script is used for reaching it. This information is never passed to your server, so you can't instruct your server to react to it.

Even if you could, it would have no effect with Google

Why? Because Googlebot does not carry a referrer with it when it spiders, so you don't know where it's been before it visited you. As already mentioned, Googlebot could have seen a link to your page a lot of places, so it can't "just pick one". Visits by Googlebot have no referrers, so you can't tell Googlebot that one link that points to your site is good while another is bad.

You CAN ban clickthrough from the page holding the 302 script - but it's no good

Yes you can - but this will only hit legitimate traffic, meaning that surfers clicking from the redirect URL will not be able to view your page. It also means that you will have to maintain an ever-increasing list of individual pages linking to your site.

For Googlebot (and any other SE spider) those links will still work, as they pass on no referrer.


This is what really happens when Gbot meets 302:

Here's the full lowdown. First time i post it all. It's extremely simplified to benefit the non-tech readers among us, and hence not 100% accurate in the finer details, but even though i really have tried to keep it simple you may want to read it twice:

  1. Googlebot visits a page holding eg. a redirect script
  2. Googlebot indexes the content and makes a note of the links
  3. Links are sent to a database for storage until another Googlebot is ready to spider them. At this point the connection breaks between your site and the site with the redirect script, so you (as webmaster) can do nothing about the following:
  4. Some other Googlebot tries one of these links
  5. It receives a "302 Found" status code and goes "yummy, here's a nice new page for me"
  6. It then receives a "Location: www.your-domain.tld" header and hurries to that address to get the content for the new page.
  7. It deliberately chooses to keep the redirect URL, as the redirect script has just told it that the new location (That is: your URL) is just a temporary location for the content. That's what 302 means: Temporary location for content.
  8. It heads straight to your page without telling your server on what page it found the link it used to get there (as, obviously, it doesn't know - another Googlebot fetched it)
  9. It has the URL (which is the link it was given, not the page that link was on), so now it indexes your content as belonging to that URL.
  10. Bingo, a brand new page is created (nevermind that it does not exist IRL, to Googlebot it does)
  11. PR for the new page is assigned later in the process. My best bet: This is an initial calculation that is done something like: PR for the page holding the link less one.
  12. Some other Googlebot finds your page at your right URL and indexes it.
  13. When both pages arrive at the reception of the "index" they are spotted by the "duplicate filter" as it is discovered that they are identical.
  14. The "duplicate filter" doesn't know that one of these pages is not a page but just a link. It has two URLs and identical content, so this is a piece of cake: Let the best page win. The other disappears.

So, essentially, by doing the right thing (interpret a 302 as per the RFC) Google allows another webmaster to convince it's bot that your website is nothing but a temporary holding place for content.

Further, this leads to creation of pages in the index that are not real pages. And, you can do nothing about it.

Thanks, Claus, for that excellent explanation.