Site icon Search Engine People Blog

Finding Pages With Duplicate Or Thin Content Using A1 Website Analyzer

duplicate-thin-content

The larger and older the website you're working on, the harder it becomes to find duplicate content and thin content.

Add external duplication to (too much) internal duplication and your site can be demoted or even completely filtered out in search results.

Add too much thin content and especially when combined with internal and external duplication, you're looking at a slap on the wrist from Panda.

How To Find Thin Content And Duplicate Content

While Google has said that short content doesn't equal thin content, word count is a pretty close measure of spotting what might be thin content.

For duplicate content you could just do a "site:" query in Google, feeding it a phrase from a page. That's sort of good enough for a one page spot check but what if you want to find duplicate content across your site? And what if you not only want to catch 1-on-1 duplicates but near duplicates as well?

In both cases we're using the excellent website crawler A1 Website Analyzer which as best as possible does a content word count and can give content similarity feedback. Using this crawler it's easy to find:

Crawling The Website

When you first start A1 Website Analyzer, the first thing you will need to do is crawling the website you wish to inspect:

Depending on the data you are interested in, you can configure data
collection options in the "Scan website | Data collection" tab.

This will give a result that looks like this:

After the scan has finished, you can select which columns to show in the program and the Excel file it can export:

You can also set various filters to only show the pages you are interested in:

You can even select predefined columns/filters "reports":

Note: To see all options you can switch off "Simplified easy mode":

(For a full description on how to crawl a site with A1 Website Analyzer, click here)

Finding Pages With Thin Content

A good technique to find pages with thin / shallow content is to look for pages in your website that has a significantly lower text to code ratio than the other. You can then rewrite them as needed.

Besides the text/code ration there's also pure word count:

You can sort and filter as you like, export to Excel and filter there, or use the quick report presets:

Finding Pages With Duplicate Titles, Headers And Descriptions

By simply selecting the built-in reports that configures which columns are visible and which filters are visible you can quickly see where you have duplicate or similar content.

Duplicate page titles:

Duplicate H1 page headers:

Duplicate page descriptions:

Again, you can also sort by these columns inside A1 Website Analyzer, or export to a spreadsheet and work from there.

Finding Pages With Similar Content

A1 Website Analyzer has a unique feature that can give visual feedback for which pages have similar content.

It is still experimental but it works quite well for many websites, and it is an additional tool in the toolbox for identifying possible problems in a website.

Before you initiate a scan, you have to enable the option "Perform keyword analysis of all pages" found in the "Scan website | Data collection" tab.

After the scan has finished, you can sort the results which will try to group and sort content in "similar" sections. These "similar" groupings go beyond simply determining if pages are exact duplicates.

The highlighted pages have a huge overlap at the start of their content

How To Fix Duplicate & Thin Content

Common solutions for fixing these problems include:

  • Merge content of multiple thin content pages into one solid content page. Redirect the old, thin pages to this URL.
  • Use the canonical tag to point multiple pages into one - in particular relevant if your pages contain small variations, e.g. different versions of the same product. (See also: What is the Difference between 301 Redirect and Canonical Attribute)

Note that ensuring pages have unique titles, headings, and meta descriptions, and that these match the content is a big part of conversion-centered SEO.