Search Robots: The Good, The Bad, and The Googlebot

 What Are Web Bots?

A web robot is a program that runs amok on the web, gathering documents and referencing them. It follows links from page to page and site to site. Web Robots are also known as "Spiders", "Crawlers", "Web Bots", "Search Bots", or just "Bots".

You may also see a program that is meant to run automatically doing a specific task, or set of tasks, called a “bot”. These types of bots are often used for chat rooms, auction sites, game bots (think online solitaire), and chatterbots (a computer program designed to emulate conversation, like IRC bots, or the popular AIM bot, FriendBot).

What are Agents?

The term "agent" (or user-agent) is another word for the name of the application accessing a document. All search bots must declare themselves when they access a document. Much like how your browser (which is an application) declares itself when you access a website. If you see a Google agent on your site, that’s also the Googlebot.

What Kinds of Web Robots Are There?

orange-search-bot2-sm.pngThere are "good" robots, and "bad" robots. The good kind of web bots have no nefarious intentions - typically they follow links to index pages, if not mindlessly, automatically. A "good" bot example is the search engine bot. Googlebot is probably the most known (and unknowable) "good" web bot there is. All SEOs dream of understanding Googlebot’s algorithm.

Google has several versions of bots:

- THE Googlebot - crawls pages for their web and news index.
- Googlebot-Mobile - crawls pages for their mobile index.
- Googlebot-Image - crawls pages for their image index.
- Mediapartners-Google - crawls pages to determine Adsense content.
- Adsbot-Google - crawls pages to measure Adwords landing page quality.

With several thousands of web bots on online today "in the wild", it’d be impossible to list them all. So here are some of the most well-known "good" search bots:

- MSNbot - owned by MSN.com
- Ask Jeeves/Teoma - owned by Ask.com
- Architext spider - owned by Excite.com
- FAST-WebCrawler - owned by FAST (AllTheWeb.com)
- Slurp - owned by Inktomi.com
- Yahoo Slurp - owned by Yahoo Web Search
- ia_archiver - owned by Alexa.com
- archive.org_bot - owned by Archive.org
- Scooter - owned by AltaVista.com
- Crawler - owned by Crawler.de
- InfoSeek sidewinder InfoSeek.com
- Lycos_Spider_(T-Rex) Lycos.com

Some of the bad type of bots are evil minions sent by programmers who are up to no good, usually the illegal type of no good. A bad bot can also be defined as a bot that ignores META tags and ignores the robots.txt, follows urls anyway and/or revisits your site too much (thereby causing it to slow down, or crash). The more insidious types of bots are:

- Denial of Service Bots (DoS Bots) flood your site and crash your server. A cybercriminal may even announce a DoS attack to the target site to extort money.

- Identity Theft Bots have the sole purpose to scour your personal information such as address, credit card numbers and passwords.

- Spam Bots send just that, spam. Mass emails usually loaded with naughty intentions and/or pharmaceutical enhancers.
- Phishing Bots are similar to both the identity theft bot and the spam bot. They send luring emails to tempt you to give up your personal information. (like asking you to enter your Paypal password on a non-Paypal site)
- Click Fraud bots imitate a person clicking on an advertisement to inflate pay per click income.

How to Deal With Bots

1.  You can tell the "good" bots what you want them to do when they visit your site by using the META tag on your pages. Use the following examples:

<meta name="robots" content="index,follow"> index this page and read or list links
<meta name="robots" content="noindex,follow"> do not index this page but may read or list links
<meta name="robots" content="index,nofollow"> do not index this page but not read or list links
<meta name="robots" content="noindex,nofollow"> do not index this page or read or list links

**But note, not all search bots recognize the META tags, and not all good bots respect them either.

2. So the better method is to use a robots.txt file to specify what the bots can, or cannot see or follow. Here are some orange-search-bot-sm.pngexamples:

# Disallow Google's Image bot from accessing your website
User-agent: Googlebot-Image
Disallow: /

# Disallow Yahoo's image bot from accessing your website
User-agent: Yahoo-MMCrawler
Disallow: /

# Disallow Archive.org’s bot from accessing your website
User-agent: archive.org_bot
Disallow: /

# Disallow any other bot from the files listed
User-agent: *
Disallow: /cgi-bin/
Disallow: /images/
Disallow: /my_private_file.html

3. You can also use an .htaccess (HyperText access) file in any directory of your server that you want to protect, or the root of your site to protect all of it. The .htaccess file is a hidden file that will deny agents from accessing what you say they cannot. Bad bots cannot ignore the .htaccess file.

For example, EmailSiphon, is a known spam bot. It looks for emails on websites to send spam to. To prevent it from accessing your site, you would use the following in your .htaccess file:

SetEnvIfNoCase User-Agent "^EmailSiphon" bad_bot

<Limit GET POST>
Order Allow,Deny
Allow from all
Deny from env=bad_bot
</Limit>

What happens is that when the bad bot, EmailSiphon, visits your site it will be served a 403 forbidden error and prevented from going any further into your site. Just add more lines with other known bad bots and slam that spam! Bu-bye!

So, whether they are good, bad, or ugly Googlebot, the bots are here to stay. Learning how to deal with the bad bots, prevent your images from being available, or stop the indexing of parts of your site is very worthwhile.

Posted in SEO

About the Author: Shannon Hutcheson

Shannon Hutcheson is a day dreamer, sarcastic cheekster, cat herder, gamer, and bookaholic. She's an experienced copy editor and a moderator at MyBlogGuest.

Living With Fibromyalgia

Additional Posts

Kay Frenzer: Search Blogger of the Day

A Newb Gets Excited About RankSense

Personal Branding Gold Rush

In compliance with Ontario’s non-essential business closure our physical offices are closed until further notice. Fortunately our willingness to adopt work from home and the required technology over the past two years has allowed us to continue our operations without impact. For our valued clients and partners you can expect the same great level of service and execution you have become accustomed to.

Many clients/prospects have reached out to us in an effort to introduce new campaigns as quickly as possible. In an effort to help our clients pivot we have increased our campaign build capacity. We are now able to turn new campaigns over in 2-3 business days opposed to the typical 5-7 business day turnaround time. Please note that campaign launch approvals from the vendor side (Google, Bing, Facebook, Instagram etc.) may be delayed as those companies migrate to work from home.

For existing clients please reach out to your account manager with any questions you may have.

For non-clients looking for assistance with new campaigns please Contact Us

Read previous post:
Kay Frenzer: Search Blogger of the Day

Meet Kay Frenzer, the Search Blogger of the Day. Today I'd like to highlight a post called Paid Link Marketing...

Close