What does Google REALLY think of your site's pages? Think PageRank gives a clue? Perhaps if we were shown true PageRank, we could use it as a somewhat reliable gauge, but unfortunately, we don't have that luxury. But we don't need PageRank. We have a better way to determine the rate of importance that Google places upon each of those coveted pages of ours, but it takes a little effort on our part to uncover this goldmine of information.
What To Look For
Long ago, January 27, 2007 in fact, Aaron Wall said in a post titled, Cache Date as the New Google PageRank:
Given Google's reliance on core domain authority and displaying outdated PageRank scores, cache date is a much better measure of the authority of a particular page or site than PageRank is.
What Google frequently visits (and spends significant resources to keep updated) is what they consider important.
Then, in a guest post I wrote on directom.com in November of 2008, I discussed a few ways to track cache date, including Michael Gray's method, some WordPress plugins and some paid tools. As I mentioned in that post, I wasn't particularly happy with any of them. Because of that, I kept searching.
The Right Tool For The Job
Earlier this year, I ran across the free CJ Googlebot Activity script created by James Crooke of cj-design.com. I started running it on one of my sites in mid-May, let it accumulate data for four and a half months, and then started analyzing that data. The results were very, very interesting.
The bottom line? I now have a very clear perspective on what pages on my site Google thinks are most worthy of their time – and which aren't! And while a few of the pages were obviously ones that Google appreciated, I was very surprised by quite a few that Google placed at the bottom of its priority list. Many of those at the bottom of the heap were ones that I personally think are some of the better pages on the site, so I obviously need to make sure Google appreciates them as well. I now have a very clear idea of where my focus needs to be, and what pages I can spend less time on for now.
Making It Work
Now, I want to give you some technical tips on getting this working for you. The script comes with easy installation instructions included, so you should have no issues getting it set up. Like most scripts, you edit a config file to answer a few basic questions, and then simply place a short php include code snippet in the files you want to track. If you have a header or footer file that gets included in all your templated pages, you can place the snippet there and know that it will automatically be included everywhere (similar to how including analytics code works).
Once it's running, you have an administration area that you can go to any time you wish to view the activity. That's useful, but it's limiting in giving you "the big picture". Luckily, it also gives you an option in the admin area to export to a csv file, and this is where the data can start to be meaningful.
We'll get to that in a minute, but first you should take a look at a screenshot of the demo account listed on James Crooke's site. You can see more of the demo here.
Tweaking and Geeking
So, although I love seeing that data, I knew I needed to delve a bit deeper. I'm not a spreadsheet geek, though I wish I were, so instead of analyzing the data there, I imported the csv data into a MySQL database table. I was only concerned with 3 pieces of data, Crawl Date, Crawl Time, and URL Crawled, so the table was a very simple one. (I know a lot of people might not be MySQL geeks, but if you ever wanted to learn the basics, this task will be a good start, since it's a relatively simple one).
The steps basically involved creating a database, and then using PHPMyAdmin, creating a table with 3 fields – crawldate (date field), crawltime (time field), and crawlurl (varchar field). (You simply type the names in, and choose the type of field from a dropdown, and then click the Go button to create the table).
Once the table is created, click on it, and then choose the Import tab. Upload the csv file using the browse button and select CSV Using LOAD DATA. Since my csv file was comma-delimited, I put a comma (,) in the Fields terminated box, left the Fields enclosed by and Fields escaped by boxes blank, kept the defaults for everything else, and clicked the Go button.
In a second, everything was imported into the table. The whole process takes just a few minutes.
Once the data was in the table, I ran the following SQL statement by clicking on the SQL tab in PHPMyAdmin:
select crawlurl, count(crawlurl) from crawls group by crawlurl order by count(crawlurl) desc
This particular SQL statement doesn't even concern itself with dates or times. It simply counts the number of times each URL was crawled within the entire span of data (which was four and a half months for me), and then sorts each page URL by the number of times it was crawled, showing the most-crawled pages first.
Here's an example of how my data looked afterwards:
Crawled … PageName
52 … somepage.php
44 … thatpage.php
36 … anotherpage.php
8 … sampleurl.php
2 … yetanotherpage.php
With a large site, the list can be quite long, but that's ok. Because it's sorted, it's easy to scan and see where the problems are, and what pages need more tender loving care (as well as link love).
I'll be working more with the dates and times in the future to see if there's any more insights I can glean from the data, but for now, this very simple analysis proved to be extremely useful as an analytical tool.
If you want to see clearly which of the pages Google appreciates the most – and more importantly, which they appreciate the least – I recommend doing a similar analysis on your site. Once you have this information, you can schedule your time so that the pages you work on are the ones that need the most help.
Information yields power. Data is good.
*hat tip to Michael VanDeMar for his help.
UPDATE: I may have forgotten one step in that process above. When I said I exported to a csv file, actually it merely exports into a regular txt file. I then just did a quick find/replace to replace the line endings with commas, and find/replace to delete the extraneous words "Crawled:" and "Bot:". THEN, I saved that as a .csv file.
UPDATE 2: I didn't make things really clear apparently. This analyzes crawl data, not actual cache data. I make the leap in my own brain that crawls leads to caches, but it's only crawl data that is being analyzed here. Apologies for any confusion.