Toll Free: 1-877-695-7388

GTA: (647) 699-2838

Search Engine People
  • SEO
  • SEM
  • CRO
  • Display
  • Blog
  • Why Us
  • Contact
  • Join Our Team
  • Get A Quote

Toll Free: 1-877-695-7388

GTA: (647) 699-2838

How Search Really Works: Relevance (2) – Vector Space

Ruud Hein | April 11th, 2008
Tweet2
Share1
Share
Pin
3 Shares

This post is part of an ongoing series: How Search Really Works.
Previously: Relevance (1)

Another way we can assess the relevance of a document is by term weighting.

From the keyword density myth we know that true term weighting is done collection wide.

By looking at the number of documents in the index that a term appears in we can make a measurement of information: how good, how special... how meaningful is this word?

The word the would not be special at all, appearing in way too many documents. Its worth would be close to zero.

But klebenleiben ("the reluctance to stop talking about a certain subject" ...)would be very special indeed! Because it appears in only 18 documents among millions, its worth, its weight, would automatically be very high.

The measure is called inverse document frequency.

This measure is our weight; it is what we use to judge the relevance of a document with.

Term Frequency Times

We do so by counting the number of times a word appears in a document. We normalize that count; we adjust it so that the length of a document doesn't matter that much anymore.

We then multiply it by our weight measurement: TF x IDF. Term Frequency times Inverse Document Frequency.

In other words, a high count of a rare word = a high score for that document, for that word. But... a high count of a common word = not so high score for that document, for that word.

Vectors

A vector is a line of a certain length into a certain direction.

Both the length and the direction of the line represent important information.

Vectors enable us to represent, to talk about, size and direction when position is irrelevant. Wind speed, velocity, force, acceleration; all these are good candidates to be represented as a vector.

TFxIDF scores are perfectly suited to be represented as vectors.

Vector Space

Think of the words that make up our index as axes of a space.

vector-space

Of course in a real index this space would consists of thousands upon thousands of axes...

Documents as Vectors

For each word in our document we can draw a line (vector) which shows its TFxIDF score for a certain term.

vector-space-documents

Queries as Vectors

Every word in a query can also be shown as a vector.

vector-space-documents-queries

By looking at documents that are "near" our query we can rank (sort) documents in our result set.

TFxIDF Vector Space Ranking

If a document is close to our query it answers our query.

But better yet: documents close to ours are similar documents. They're talking about roughly the same thing.

This makes TFxIDF vector space ranking extremely useful to find sets of similar documents through "closeness".

Tweet2
Share1
Share
Pin
3 Shares
Posted in SEOTagged how search really works, ruud

About the Author: Ruud Hein

My paid passion at Search Engine People sees me applying my passions and knowledge to a wide array of problems, ones I usually experience as challenges. People who know me know I love coffee.

Ruud Hein

4 thoughts on “How Search Really Works: Relevance (2) – Vector Space”

  1. Hamlet Batista says:
    April 11, 2008 at 3:08 pm

    Hi Rudd,

    Excellent post as usual. It is important to mention that vector space model for ranking is not currently practical for the top search engines due to the size of their index (and the corresponding size of the document vectors). While they use huge matrices for computing the importance of the links (PageRank), the process is done offline and is query-independent. Computing such vectors are query time would be prohibitively expensive in times and resources.

    Cheers

  2. Ruud Hein says:
    April 11, 2008 at 8:20 pm

    Good indeed to point that out. Doing any of this at run time is extremely costly. There are cost reducing procedures; working with top N documents or leader/follower samples.

    Yet I too think that this isn’t used at run time (read: query time) because the TFxIDF vector space model is geared towards words. The IDF of a words is computed; not of phrases. All in all it doesn’t deliver enough bang for its buck.

    Worse: it’s typically a model for a clean index. Boosting TF for a high IDF word is too easy when you have search access to the whole collection.

    It’s interesting though to see how this model can find related documents.

  3. Dev Basu says:
    April 14, 2008 at 2:22 pm

    As usual Ruud this is a great post. It’s always interesting to learn the inner workings of an SE 🙂

  4. Malte Landwehr says:
    April 16, 2008 at 2:06 pm

    An excellent analysis of how to weight terms by their frequency. But I doubt that the two dimensional space is enough to represent the complexity needed to maintain an index of millions of documents.

Comments are closed.

Recent Posts

  • 3 Phase Approach to Evaluate Performance Marketing Initiatives
  • Leverage the Synergy Between Inbound & Performance Marketing
  • 5 Paid Advertising Tactics to Improve Performance Marketing
  • How CRO Boosts Performance Marketing
  • SEO for Performance Marketing: You Can’t Afford to Ignore It

Categories

  • Analytics & ROI Analysis
  • Company News
  • Content
  • Conversion Optimization
  • CRO
  • Display Advertising/RTB
  • Email Marketing
  • En Español
  • En Français
  • Inbound Marketing
  • Lead Nurture & Marketing Automation
  • Local Search
  • Marketing
  • Mobile
  • Partnership Marketing
  • PPC
  • PR
  • SEO
  • Social Media Marketing
  • Web Design

Additional Posts

Friday Funnies: Regrettable Choice

April 11th, 2008 | by Ruud Hein

Fiddle Dee Dee – Is Yahoo Gone With The Wind?

April 10th, 2008 | by Tom Tsinas

Friend Promotion – Step 3 of the Authority Building Process

April 9th, 2008 | by Jeff Quipp

LET'S TALK

Need more information or want to get in touch?

Get in touch!
  • SEO
  • SEM
  • Display
  • Blog
  • Why Us
  • Join Our Team
  • Contact Us
  • Local SEO
  • Small Business SEO
  • Enterprise SEO
  • International SEO

LOCATION

1305 Pickering Parkway,
5th Floor Pickering, L1V 3P2

PHONE

Toll Free: 1-877-695-7388
Greater Toronto Area: (647) 699-2838

Social

© Search Engine People Inc. 2023 – Canada’s Top Digital Agency
© SEP 2023 – A Search Engine People Company | Privacy Policy

Search Engine People