Those of you who have worked for years to bring your legitimate website up to high levels may be surprised at how harsh Panda can be. Sites with hundreds or thousands of pages are at a huge risk of being hit. It's completely insane to go through thousands of pages by hand looking for the flags that trigger Panda. How can you find out which of your pages have been hit, or may be targeted, and how can you fix them?
One way to find out which parts of your site were hit is by looking at your analytics. Head to your Google Analytics dashboard and click "search engines" on the side, under traffic sources. From there, click Google, and then click the keyword tab. From there, simply filter by U.S. location and click "non-paid." What does the graph show? Here are the dates to check:
- Feb 24, 2011
- April 11, 2011
- May 10, 2011
- June 16, 2011
- July 23, 2011
- August 12, 2011
- September 28, 2011
- October 9, 13, 20, 2011
- November 18, 2011
- January 15, 2012
- February 28, 2012
- March 23, 2012
- April 19, 24, 27, 2012
- June 7, 8, 25 2012
- July 24, 2012
- August 20, 2012
- September 18, 2012
Check your traffic before and after the update. Did it drop more than the usual ups and downs? If so, you were probably hit. Find out which pages on your site took the largest traffic hit and analyze them to see what's wrong. Panda looks for low quality content, and chances are these pages have something that fits that definition. Is there too little content on the pages? How about duplication -- how many other places on your site is that same information posted? All of these are warning signs.
Duplicate Content Search
Searching your site for duplicate content, either after you've been hit by Panda or as a preemptive measure, is a time consuming task. You could do a Google in-site search for phrases from each of your pages, but that's not guaranteed to find everything. It would find exact matches, but Google looks for more than just matching text. Moreover, on a site with thousands or tens of thousands of pages, that alone is a lifetime worth of work.
The best way to search your content for duplication is using an algorithm called the Levenshtein distance (kudos for this tip to Corey Northcutt ). It's a way of measuring how many changes would have to be made to one string of text to transform it into another. Two lines with a low Levenshtein distance may appear different to exact-match searches, but look the same to Google. Thankfully, you can implement Levenshtein distance searches to check your site for duplications.
The one problem with using the Levenshtein algorithm to analyze your site is it requires some tricky coding. If you're comfortable with the tech involved, it's an ideal solution. Unfortunately, it's not the best solution for everyone. More on the Levenshtein algorithm and how to implement it here .
For those who can't implement the code above, you can always check the Diagnostics tab in the Google Webmaster tool, and check out HTML suggestions. If multiple pages appear with identical or very similar text, they might be triggering Panda.
Penalized For Scraped Content
One dangerous Panda trigger is when your content is posted on multiple websites. Checking to see if any of your content is copied is easy -- just copy and paste a unique part of your content into Google in quotes and check the results or even better, use CopyScape. If you've been targeted by scraper sites, you'll see several exact matches. If you're unlucky, your site may not even appear in the first page. Thankfully, you can save the URLs and report them directly to Google.
In a thousand-plus page site, template content becomes a concern as well. Any e-commerce site with a template for product descriptions runs afoul of this low quality rule. Thankfully, as a site owner you should know if you implemented templates across your site, and can identify and fix the pages accordingly.
If you liked this post, you might also enjoy How to Identify High Quality Websites