Search Engine People - Search Engine Positioning, Placement Service
Home  |  Blog  |  About Us  |  Careers  |  News  |  Contact Us

How Search Really Works: Relevance (2) - Vector Space

Ruud HeinWelcome! Thanks for visiting!

Subscribe to the full feed

by Ruud Hein
April 11, 2008

This post is part of an ongoing series: How Search Really Works.
Previously: Relevance (1)

Another way we can assess the relevance of a document is by term weighting.

From the keyword density myth we know that true term weighting is done collection wide.

By looking at the number of documents in the index that a term appears in we can make a measurement of information: how good, how special… how meaningful is this word?

The word the would not be special at all, appearing in way too many documents. Its worth would be close to zero.

But klebenleiben (”the reluctance to stop talking about a certain subject” …)would be very special indeed! Because it appears in only 18 documents among millions, its worth, its weight, would automatically be very high.

The measure is called inverse document frequency.

This measure is our weight; it is what we use to judge the relevance of a document with.

Term Frequency Times

We do so by counting the number of times a word appears in a document. We normalize that count; we adjust it so that the length of a document doesn’t matter that much anymore.

We then multiply it by our weight measurement: TF x IDF. Term Frequency times Inverse Document Frequency.

In other words, a high count of a rare word = a high score for that document, for that word. But… a high count of a common word = not so high score for that document, for that word.

Vectors

A vector is a line of a certain length into a certain direction.

Both the length and the direction of the line represent important information.

Vectors enable us to represent, to talk about, size and direction when position is irrelevant. Wind speed, velocity, force, acceleration; all these are good candidates to be represented as a vector.

TFxIDF scores are perfectly suited to be represented as vectors.

Vector Space

Think of the words that make up our index as axes of a space.

vector-space

Of course in a real index this space would consists of thousands upon thousands of axes…

Documents as Vectors

For each word in our document we can draw a line (vector) which shows its TFxIDF score for a certain term.

vector-space-documents

Queries as Vectors

Every word in a query can also be shown as a vector.

vector-space-documents-queries

By looking at documents that are “near” our query we can rank (sort) documents in our result set.

TFxIDF Vector Space Ranking

If a document is close to our query it answers our query.

But better yet: documents close to ours are similar documents. They’re talking about roughly the same thing.

This makes TFxIDF vector space ranking extremely useful to find sets of similar documents through “closeness”.

I hang out at Twitter where I enjoy the company, the buzz, the nuggets of info and opinion we pass along.
Join me on Twitter!
• Get Search Engine People delivered by email

As posted in How Search Really Works.

You're welcome to join the conversation; add your response. You can track the conversation using the RSS 2.0 feed.
You can also trackback from your own site.

6 Responses to “How Search Really Works: Relevance (2) - Vector Space”

  1. Hamlet Batista (1 comments.) Says:
    April 11th, 2008 at 3:08 pm

    Hi Rudd,

    Excellent post as usual. It is important to mention that vector space model for ranking is not currently practical for the top search engines due to the size of their index (and the corresponding size of the document vectors). While they use huge matrices for computing the importance of the links (PageRank), the process is done offline and is query-independent. Computing such vectors are query time would be prohibitively expensive in times and resources.

    Cheers

  2. Ruud Hein Says:
    April 11th, 2008 at 8:20 pm

    Good indeed to point that out. Doing any of this at run time is extremely costly. There are cost reducing procedures; working with top N documents or leader/follower samples.

    Yet I too think that this isn’t used at run time (read: query time) because the TFxIDF vector space model is geared towards words. The IDF of a words is computed; not of phrases. All in all it doesn’t deliver enough bang for its buck.

    Worse: it’s typically a model for a clean index. Boosting TF for a high IDF word is too easy when you have search access to the whole collection.

    It’s interesting though to see how this model can find related documents.

  3. Dev Basu (7 comments.) Says:
    April 14th, 2008 at 2:22 pm

    As usual Ruud this is a great post. It’s always interesting to learn the inner workings of an SE :)

  4. Malte Landwehr (2 comments.) Says:
    April 16th, 2008 at 2:06 pm

    An excellent analysis of how to weight terms by their frequency. But I doubt that the two dimensional space is enough to represent the complexity needed to maintain an index of millions of documents.

Trackbacks

  1. How Search Engines Do Not Work « IR Thoughts Says:
    April 17th, 2008 at 5:40 am

    […] 1. http://www.searchenginepeople.com/blog/how-search-really-works-relevance-2-vector-space.html […]

  2. Vector Space Models and Search Engines « IR Thoughts Says:
    April 21st, 2008 at 8:36 am

    […] That said, today’s post is in reaction to the article at http://www.searchenginepeople.com/blog/how-search-really-works-relevance-2-vector-space.html […]

  3. Leave a Reply

« Friday Funnies: Regrettable Choice
10 Golden Rules of Blogging »

Subscribe

Full Feed
Email Updates

Recent Posts

  • One Week of Sphinn SEO Lessons
  • Friday Funnies: If MySpace Were A Person
  • 50+ Sites To Help You Bury Negative Posts About You or Your Company!
  • A Letter of Apology to my Wrists
  • Social Networking Going Mobile
  • Huge Growth + Talent Shortage = Increased M & A Activity
  • Fumbling Your Site
  • Friday Funnies: Slogan Of The Month
  • Facebook - It’s the new Yahoo!
  • 25 of Digg’s Most Trusted Sites

Most Popular Ever

  • The Avatar Experiment - Stunning vs Cute vs Guy
  • Which SEO Lord of The Rings Character Are You?
  • Offline Web Links! What??????
  • Free Google Mobile Adwords
  • How To See Your Google Adwords Listings In Other Countries and Cities

Most Popular this Month

  • How to get your Blog Traffic to Convert in 5 Easy Steps
  • 4 Pillars of Social Media Algorithms ... Trust x4
  • Google to Consolidate SEO Industry
  • Friend Promotion - Step 3 of the Authority Building Process
  • 10 Golden Rules of Blogging

Subjects

  • Affiliate Marketing
  • Authority Building
  • Blogging
  • Branding
  • Canada
  • Content
  • Coupons
  • eBooks
  • En fran栩s
  • Events
  • Experiments
  • Francophone
  • Funnies
  • Google
  • Guest Post
  • How Search Really Works
  • Local Search
  • Mobile Search
  • MSN/Live
  • News
  • Online Marketing
  • Online Retailing
  • Online Shopping
  • Opinion
  • Pages Jaunes
  • PPC
  • Quebec
  • Reputation Management
  • SEM
  • SEO
  • Social Media
  • Stats
  • Technology
  • The Algorithm is Human
  • Tips
  • Tools
  • video
  • Yahoo
  • Yellow Pages

Archive

  • May 2008
  • April 2008
  • March 2008
  • February 2008
  • January 2008
  • December 2007
  • November 2007
  • October 2007
  • September 2007
  • August 2007
  • July 2007
  • June 2007
  • May 2007
  • April 2007
  • March 2007
  • February 2007
  • January 2007
  • September 2006
  • July 2006
  • May 2006
  • March 2006

Search


Recent Readers

The Writers

  • Jeff Quipp
  • Jennifer Osborne
  • Ruud Hein
  • Tom Tsinas

Top Commentators

  • Utah SEO (6)
  • Marketing Man (4)
  • Gab Goldenberg (2)
  • Stefan Vervoort (2)
  • Catfish (2)
  • Dev Basu (2)
  • hugo (2)
  • Oliver Taco (2)
  • Nick James (2)
  • Hobo (2)

Blogroll

  • AbleReach Blog
  • aimClear Blog
  • Bill Hartzer
  • Blah Blah Tech
  • Courtney Tuttle's Blog
  • DailyMoolah
  • DoshDosh
  • Geyser Marketing
  • Gray Wolf's SEO Blog
  • Jaan Kanellis
  • Justilien - Link Building
  • Learning SEO Basics
  • Matt Cutts Blog
  • New Orleans Internet Marketing
  • NorthSouthMedia
  • Nowsourcing
  • Profectio - Dave Forde
  • Quiddity - Essence SEO Blog
  • Search Engine College
  • Search Engine Jounal
  • Search Engine Land
  • Search Engine Watch
  • SEO by the SEA
  • SEO Design Solutions
  • SEOco UK Blog
  • SEOPittfall
  • SexySEO
  • Small Business SEM
  • Social Desire
  • Sphinn
  • Stepforth.com - Ross Dunn
  • Stephan Spencer's Scatterings
  • Stuntdubl
  • Techipedia
  • Tim Nash
  • Top Rank Blog
  • Trail of the Fire Horse
  • Utah SEO Blog
  • Yeepage Blogging Tips

SEO Toronto - Search Engine Optimization Specialists
Copyright © Search Engine People - All Rights Reserved.
Contact Us at 1-877-486-7875 or 905-426-9340 - contact@searchenginepeople.com