Search Engine People - Search Engine Positioning, Placement Service
Home  |  Blog  |  About Us  |  Careers  |  News  |  Contact Us

How Search Really Works: The Compressed Index

Ruud HeinWelcome! Thanks for visiting!

Subscribe to the full feed

by Ruud Hein
March 14, 2008

This post is part of an ongoing series: How Search Really Works.
Last week: Recognize this index?

Memory is much faster than looking things up.

In order for a search engine in high demand to serve its users efficiently it should keep things in memory instead of looking it up on a disk.

Traditionally large scale search engines will keep their complete dictionary in memory and the posting list on disk.

dictionary-in-memory-postings-on-disk

Inefficient Storage

Obviously the more you can keep in memory and the more information can be read back with one disk action, the better.

Unfortunately computer information gets inefficiently stored in boxes with fixed dimension: if a box is 10 characters wide a 4 character word still takes up 1 box of 10 characters.

Compression

The solution is to squeeze information together so the least amount of space contains the maximum amount of information.

Big String Compressed Dictionary

 uncompressed 

could become:

 big-string

By adding the length of each word to every entry we can make the list of words hundreds of characters shorter.

Gap Compressed Posting List

 uncompressed-postings

could become:

compressed-postings

By storing the difference between the document ID’s (the gaps) we can save hundreds of characters.

The same could be done for storing the gaps between the index numbers for the occurrence positions in each document.

This "compressed representation encodes occurrences of a term as a pointer to the next occurrence of the term to facilitate rapid enumeration of the occurrences of the term".

You could search "occurrences of the terms in the set of documents by following pointers through the compressed representation".

Partial Decompression

An index stored this way is completely lossless: it retains all information from document identifier to document positional identifier.

By starting with the least frequently used term in the search it is very easy to unravel to do a partial decompression of this index by

"identifying occurrences of the terms in the set of documents"

(the dictionary) - to then use;

"the corresponding term identifiers for the terms in the search request to look up a term offset table for a pointer to a first occurrence of the terms in the compressed representation of the set of documents"

(the very first posting document ID);

"and following a chain of pointers starting at the first occurrence to identify other occurrences of the terms in the compressed representation of the set of documents"

(the gap compressed document ID list)

Recommended reading:

  • New Google Approach to Indexing and Stopwords
I hang out at Twitter where I enjoy the company, the buzz, the nuggets of info and opinion we pass along.
Join me on Twitter!
• Get Search Engine People delivered by email

As posted in How Search Really Works.

You're welcome to join the conversation; add your response. You can track the conversation using the RSS 2.0 feed.
You can also trackback from your own site.

8 Responses to “How Search Really Works: The Compressed Index”

  1. Benny (1 comments.) Says:
    March 15th, 2008 at 6:19 pm

    Nice post you got going here. Keep up the good work, you have really nice articles.

  2. Ruud Hein Says:
    March 15th, 2008 at 8:16 pm

    Thanks for letting me know, Benny. You have a great blog going there, by the way. Funny photo too :)
    Feel free to hit me up on Twitter any time.

  3. spostareduro (26 comments.) Says:
    March 16th, 2008 at 3:08 pm

    Ruud, I kinda got lost on this one. Right about here:
    “compressed representation encodes occurrences of a term as a pointer to the next occurrence of the term to facilitate rapid enumeration of the occurrences of the term”…I hate being blond! :-)

  4. Ruud Hein Says:
    March 16th, 2008 at 8:02 pm

    Kim, it talks about storing differences between numbers (the pointers).

    Any time you work with large/long numbers, storing the difference between those numbers costs, on average, less space than storing the actual numbers.

    It also can help speed things up.

    So instead of saying “this word appears in document 1040, 1050 and 1052″ you say “this word appears in document 1040, 10 documents after that and 2 documents after that one”.

    Each number points to the next place: go 2 streets down…. 5 houses further… 10 dollar extra etc.

  5. Gab "SEO ROI" Goldenberg Says:
    March 18th, 2008 at 8:29 pm

    Nice technical post Ruud :).

  6. Ruud Hein Says:
    March 18th, 2008 at 10:33 pm

    Thanks Gabriel — appreciate the comment!

Trackbacks

  1. Weekly Links for March 21th, 2008 | .eduGuru Says:
    March 21st, 2008 at 2:56 pm

    […] How Search Really Works: The Compressed Index - Although a little advanced, still a very nice look at how a search engine works by analysing the Index […]

  2. How Search Really Works: Simple Query Optimization Says:
    March 21st, 2008 at 5:16 pm

    […] This post is part of an ongoing series: How Search Really Works. Last week: The Compressed Index. […]

  3. Leave a Reply

« Will Microsoft get Yahoo on sale?
How to Sell your Client on a Blog Strategy »

Subscribe

Full Feed
Email Updates

Recent Posts

  • Let Me Count The Ways: Enumerated Sphinn Wisdom
  • Friday Funnies: Independence Day
  • Google versus Les Pages Jaunes
  • 50 Sites et + Pour Vous Aider à Enterrer les Commentaires Négatifs sur Vous ou Votre Compagnie!
  • 50 + Sitios que Ayudarán a Ocultar Publicaciones Negativas Acerca de Usted o de su Compañía
  • Key Elements of an Online Community Strategy
  • The Art of Eluding Google: Is It Even Possible?
  • Using the Cross Pollination Concept to Aid With Social Media Success!
  • Perpetuum Mobile SEO : Reaping The Benefits
  • Website Transition Planning Critical When Making Changes

Most Popular Ever

  • 50 Sites to help your bury negative posts about you or your company
  • What is authority and how do you build it?
  • How to sell your client on a blog strategy?
  • Dude I'm phaaaaaat
  • Google vs. Yellow Pages

Most Popular this Month

  • Using Social Media to Build Authority
  • Microsoft adCenter - Where’s The Revenue?
  • Qualifying Prospective Search Marketing Vendors
  • Virtual Reality - Microsoft Office Live
  • Blogging - Step 1 of the Authority Building Process

Subjects

  • Affiliate Marketing
  • Authority Building
  • Blogging
  • Branding
  • Canada
  • Content
  • Coupons
  • eBooks
  • En Español
  • En français
  • En fran栩s
  • Events
  • Experiments
  • Francophone
  • Funnies
  • Google
  • Guest Post
  • How Search Really Works
  • Local Search
  • Mobile Search
  • MSN/Live
  • News
  • Online Marketing
  • Online Retailing
  • Online Shopping
  • Opinion
  • Pages Jaunes
  • PPC
  • Quebec
  • Reputation Management
  • SEM
  • SEO
  • Social Media
  • Spanish
  • Stats
  • Technology
  • The Algorithm is Human
  • Tips
  • Tools
  • video
  • Yahoo
  • Yellow Pages

Archive

  • July 2008
  • June 2008
  • May 2008
  • April 2008
  • March 2008
  • February 2008
  • January 2008
  • December 2007
  • November 2007
  • October 2007
  • September 2007
  • August 2007
  • July 2007
  • June 2007
  • May 2007
  • April 2007
  • March 2007
  • February 2007
  • January 2007
  • September 2006
  • July 2006
  • May 2006
  • March 2006

Search


Recent Readers

The Writers

  • Jeff Quipp
  • Jennifer Osborne
  • Ruud Hein
  • Tom Tsinas

Top Commentators

  • Lily (5)
  • Utah SEO (5)
  • Comparison Shopping (4)
  • Paul (3)
  • Wii Boy (3)
  • Phil Benwell (2)
  • Yossarian (2)
  • Marketing Man (2)
  • Flingcom (2)
  • Jacques SEOman (2)

Blogroll

  • AbleReach Blog
  • aimClear Blog
  • Bill Hartzer
  • Blah Blah Tech
  • Courtney Tuttle's Blog
  • DoshDosh
  • Geyser Marketing
  • Gray Wolf's SEO Blog
  • Justilien - Link Building
  • Learning SEO Basics
  • Matt Cutts Blog
  • New Orleans Internet Marketing
  • NorthSouthMedia
  • Nowsourcing
  • Profectio - Dave Forde
  • Quiddity - Essence SEO Blog
  • Search Engine College
  • Search Engine Jounal
  • Search Engine Land
  • Search Engine Watch
  • SEO by the SEA
  • SEO Design Solutions
  • SEOco UK Blog
  • SEOPittfall
  • SexySEO
  • Small Business SEM
  • Social Desire
  • Sphinn
  • Stepforth.com - Ross Dunn
  • Stephan Spencer's Scatterings
  • Stuntdubl
  • Techipedia
  • Tim Nash
  • Top Rank Blog
  • Trail of the Fire Horse
  • Utah SEO Blog
  • Yeepage Blogging Tips

SEO Toronto - Search Engine Optimization Specialists
Copyright © Search Engine People - All Rights Reserved.
Contact Us at 1-877-486-7875 or 905-426-9340 - contact@searchenginepeople.com