Toll Free: 1-877-695-7388

GTA: (647) 699-2838

Search Engine People
  • SEO
  • SEM
  • CRO
  • Display
  • Blog
  • Why Us
  • Contact
  • Join Our Team
  • Get A Quote

Toll Free: 1-877-695-7388

GTA: (647) 699-2838

Search Robots: The Good, The Bad, and The Googlebot

Shannon Hutcheson | October 28th, 2008
Tweet
Share7
Share
Pin
7 Shares

 What Are Web Bots?

A web robot is a program that runs amok on the web, gathering documents and referencing them. It follows links from page to page and site to site. Web Robots are also known as "Spiders", "Crawlers", "Web Bots", "Search Bots", or just "Bots".

You may also see a program that is meant to run automatically doing a specific task, or set of tasks, called a “bot”. These types of bots are often used for chat rooms, auction sites, game bots (think online solitaire), and chatterbots (a computer program designed to emulate conversation, like IRC bots, or the popular AIM bot, FriendBot).

What are Agents?

The term "agent" (or user-agent) is another word for the name of the application accessing a document. All search bots must declare themselves when they access a document. Much like how your browser (which is an application) declares itself when you access a website. If you see a Google agent on your site, that’s also the Googlebot.

What Kinds of Web Robots Are There?

orange-search-bot2-sm.pngThere are "good" robots, and "bad" robots. The good kind of web bots have no nefarious intentions - typically they follow links to index pages, if not mindlessly, automatically. A "good" bot example is the search engine bot. Googlebot is probably the most known (and unknowable) "good" web bot there is. All SEOs dream of understanding Googlebot’s algorithm.

Google has several versions of bots:

- THE Googlebot - crawls pages for their web and news index.
- Googlebot-Mobile - crawls pages for their mobile index.
- Googlebot-Image - crawls pages for their image index.
- Mediapartners-Google - crawls pages to determine Adsense content.
- Adsbot-Google - crawls pages to measure Adwords landing page quality.

With several thousands of web bots on online today "in the wild", it’d be impossible to list them all. So here are some of the most well-known "good" search bots:

- MSNbot - owned by MSN.com
- Ask Jeeves/Teoma - owned by Ask.com
- Architext spider - owned by Excite.com
- FAST-WebCrawler - owned by FAST (AllTheWeb.com)
- Slurp - owned by Inktomi.com
- Yahoo Slurp - owned by Yahoo Web Search
- ia_archiver - owned by Alexa.com
- archive.org_bot - owned by Archive.org
- Scooter - owned by AltaVista.com
- Crawler - owned by Crawler.de
- InfoSeek sidewinder InfoSeek.com
- Lycos_Spider_(T-Rex) Lycos.com

Some of the bad type of bots are evil minions sent by programmers who are up to no good, usually the illegal type of no good. A bad bot can also be defined as a bot that ignores META tags and ignores the robots.txt, follows urls anyway and/or revisits your site too much (thereby causing it to slow down, or crash). The more insidious types of bots are:

- Denial of Service Bots (DoS Bots) flood your site and crash your server. A cybercriminal may even announce a DoS attack to the target site to extort money.

- Identity Theft Bots have the sole purpose to scour your personal information such as address, credit card numbers and passwords.

- Spam Bots send just that, spam. Mass emails usually loaded with naughty intentions and/or pharmaceutical enhancers.
- Phishing Bots are similar to both the identity theft bot and the spam bot. They send luring emails to tempt you to give up your personal information. (like asking you to enter your Paypal password on a non-Paypal site)
- Click Fraud bots imitate a person clicking on an advertisement to inflate pay per click income.

How to Deal With Bots

1.  You can tell the "good" bots what you want them to do when they visit your site by using the META tag on your pages. Use the following examples:

<meta name="robots" content="index,follow"> index this page and read or list links
<meta name="robots" content="noindex,follow"> do not index this page but may read or list links
<meta name="robots" content="index,nofollow"> do not index this page but not read or list links
<meta name="robots" content="noindex,nofollow"> do not index this page or read or list links

**But note, not all search bots recognize the META tags, and not all good bots respect them either.

2. So the better method is to use a robots.txt file to specify what the bots can, or cannot see or follow. Here are some orange-search-bot-sm.pngexamples:

# Disallow Google's Image bot from accessing your website
User-agent: Googlebot-Image
Disallow: /

# Disallow Yahoo's image bot from accessing your website
User-agent: Yahoo-MMCrawler
Disallow: /

# Disallow Archive.org’s bot from accessing your website
User-agent: archive.org_bot
Disallow: /

# Disallow any other bot from the files listed
User-agent: *
Disallow: /cgi-bin/
Disallow: /images/
Disallow: /my_private_file.html

3. You can also use an .htaccess (HyperText access) file in any directory of your server that you want to protect, or the root of your site to protect all of it. The .htaccess file is a hidden file that will deny agents from accessing what you say they cannot. Bad bots cannot ignore the .htaccess file.

For example, EmailSiphon, is a known spam bot. It looks for emails on websites to send spam to. To prevent it from accessing your site, you would use the following in your .htaccess file:

SetEnvIfNoCase User-Agent "^EmailSiphon" bad_bot

<Limit GET POST>
Order Allow,Deny
Allow from all
Deny from env=bad_bot
</Limit>

What happens is that when the bad bot, EmailSiphon, visits your site it will be served a 403 forbidden error and prevented from going any further into your site. Just add more lines with other known bad bots and slam that spam! Bu-bye!

So, whether they are good, bad, or ugly Googlebot, the bots are here to stay. Learning how to deal with the bad bots, prevent your images from being available, or stop the indexing of parts of your site is very worthwhile.

Tweet
Share7
Share
Pin
7 Shares
Posted in SEO

About the Author: Shannon Hutcheson

Shannon Hutcheson is a day dreamer, sarcastic cheekster, cat herder, gamer, and bookaholic. She's an experienced copy editor and a moderator at MyBlogGuest.

Living With Fibromyalgia

9 thoughts on “Search Robots: The Good, The Bad, and The Googlebot”

  1. Roger Hamilton says:
    October 28, 2008 at 8:22 am

    Hey, thanks for the information. I’m still a little unsure of this and your post has allowed me to understand more on it! Especially those nasty bad bots. Thanks again!

  2. dymphna boholt says:
    October 28, 2008 at 7:00 pm

    Solid Piece!

    I have checked the robots.txt file of public black hat bloggers, and it kinda is an interesting learning experience. You get to know the names of all the bad bots, via the instructions they place in there. Go do it, its fascinating.

  3. Comparison Shopping says:
    October 29, 2008 at 3:17 am

    Thank you for a very interesting and instructive post. I am copy pasting this for reference and further learning. Thank you again.

  4. Craig S. Kiessling says:
    October 29, 2008 at 9:15 pm

    It’s nice to see such a well-detailed breakdown of all the bots 🙂

  5. mobile wallpapers says:
    November 2, 2008 at 4:16 am

    Shannon, thank you for so detailed information about bots.

  6. comunactivo says:
    November 2, 2008 at 6:28 am

    Very useful post, thanks for sharing! What I can’t figure out is the Yahoo Bot – anyone know if they work much differently to the Googlebots? Can’t seem to even get indexed in Yahoo! Could it be that they index pages less often?

  7. YourDownline.co.uk says:
    November 2, 2008 at 8:12 am

    Great post, I love the pics lol, they are really good.

  8. SoLinkable says:
    November 2, 2008 at 1:17 pm

    Don’t mean to nitpick or anything but I believe you meant to write: DO index this page but DO not read or list links

  9. Singapore SEO says:
    November 10, 2008 at 12:18 am

    There are certain bots that we can usually block using the robots.txt and .htaccess file. Very clear example listed.
    Rif Chia

Comments are closed.

Recent Posts

  • Maximizing Your E-Commerce Sales:
    A CRO Audit Guide
  • Movin’ On Up! Why Migrating to Google Analytics 4 (GA4) Should be a Priority
  • A Year in Review: The Digital Marketing Trends That Defined 2021
  • The Basics of Video Marketing
  • Just How Much Do Google Reviews Impact Your SEO Ranking?

Categories

  • Analytics & ROI Analysis
  • Company News
  • Content
  • Conversion Optimization
  • CRO
  • Display Advertising/RTB
  • Email Marketing
  • En Español
  • En Français
  • Inbound Marketing
  • Lead Nurture & Marketing Automation
  • Local Search
  • Marketing
  • Mobile
  • Partnership Marketing
  • PPC
  • PR
  • SEO
  • Social Media Marketing
  • Web Design

Additional Posts

Kay Frenzer: Search Blogger of the Day

October 28th, 2008 | by Donna Fontenot

A Newb Gets Excited About RankSense

October 27th, 2008 | by Donna Fontenot

Personal Branding Gold Rush

October 27th, 2008 | by Donna Fontenot

LET'S TALK

Need more information or want to get in touch?

Get in touch!
  • SEO
  • SEM
  • Display
  • Blog
  • Why Us
  • Join Our Team
  • Contact Us
  • Local SEO
  • Small Business SEO
  • Enterprise SEO
  • International SEO

LOCATION

1305 Pickering Parkway,
5th Floor Pickering, L1V 3P2

PHONE

Toll Free: 1-877-695-7388
Greater Toronto Area: (647) 699-2838

Social

© Search Engine People Inc. 2023 – Canada’s Top Digital Agency
© SEP 2023 – A Search Engine People Company | Privacy Policy

Search Engine People