Site icon Search Engine People Blog

5 Million Spam Pages I Found in a Couple of Hours That Google Has Missed All Week

A friend of mine, Michael VanDeMar (aka mvandemar), alerted me to a situation that highlights the sub-subdomain spam problem that Google "supposedly" fixed, but apparently has not. Here is his research and report.
***
Last Saturday, June 17th 2006, an article was posted on how to get 5 Billion Pages indexed in Google in less than 30 days. The report was based around a series of domains from one particular spammer.

Google responded that in actuality, the counts reported were simply the results of a combination of a bug in the site: command and what they were calling a "bad data push". Here’s what Google spokesperson, Adam Lasnik, assistant to Matt Cutts, had to say on Tuesday:

I've long been a lurker / occasional commenter for quite some time here, and I figured I might as well offer a few clarifications on the "5 billion" issue :-).

I work with Matt Cutts and other engineers in the Search Quality Team at Google. And yes, we noticed that lots of subdomains got indexed last week -- and sometimes listed in search results -- that shouldn't have been. Compounding the issue, our result count estimates in these contexts was MANY orders of magnitude off. For example, the one site that supposedly had 5.5 billion pages in the index actually had under 1/100,000th of that.

So how did this happen? We pushed some corrupted data with our index. Once we diagnosed the problem, we started rolling the data back and pushed something better... and we've been putting in place checks so that this kind of thing doesn't happen again.

So it looks like, according to Google, the original site in question and the bad data had been corrected, and that they were well on their way to making sure it didn’t happen again.

I did a little looking around, and found that there actually did seem to be quite a bit of this spam still there. I pointed this out to Adam on Threadwatch, and offered to write a bot to help dig out some of the flotsam I was finding. The offer, much like the numerous spam reports I and many other webmasters have sent in to Google, was of course ignored.

Last night, while doing various searches on Google, I noticed that it seemed as if there was much more spam of the same variety that had been floating around before than there should have been, especially considering how hard the Google team had worked to assure us that this was merely a minor flaw, easily corrected. So, out of curiosity, I went ahead and wrote the bot for my own use. Coding and running it took right around 2 hours. This was just the rough draft of the bot; it could of course be refined to be much more accurate and go further in depth, but I just wanted to see what a quick look around would return.

What I found was approximately 10,902,060 pages of spam spread across 157 domains, each with a minimum of 5,000 pages indexed in Google. The "vast majority" of these domains are less than 2 months old, with some as recent as 4 days. This comes very shortly on the heels of Matt Cutts' response to webmasters, when asked why so many of their sites were being deindexed, that what was needed was better quality links to get indexed in Google since Big Daddy.

75 of the domains have greater than 55,000 pages indexed, which was about the number of pages that the original domain was said to have. 26 of them had 3-5 times that number of pages.

Since it looks like maybe Google is still relying on blogs such as this to oust spam, instead of finding it on their own, the domains will probably be banned in a few days. However, for now, you may click on the link below in order to view The Spam That Google Couldn’t Find:
http://googlespam.giantshoutbox.com/

-Michael VanDeMar
***
Michael is the owner of Better Mortgage Refinancing, a site which offers hassle-free Mortgage Quotes.

Note from DazzlinDonna to Matt, Adam, and Google gang: I'd really love some no-nonsense comments from you guys about this. And remember, when you give us a comment, don't forget that we aren't your every day clueless user, so please give us some credit. So, really, why is it that an average guy (sorry, Michael, you are above-average, not to mention a really cute, single, minor programming diety) can whip this up in 2 hours, and yet Google is unable to catch this?