Toll Free: 1-877-695-7388

GTA: (647) 699-2838

Search Engine People
  • SEO
  • SEM
  • CRO
  • Display
  • Blog
  • Why Us
  • Contact
  • Join Our Team
  • Get A Quote

Toll Free: 1-877-695-7388

GTA: (647) 699-2838

How Search Really Works: The Compressed Index

Ruud Hein | March 14th, 2008
Tweet
Share
Share
Pin
0 Shares

This post is part of an ongoing series: How Search Really Works.
Last week: Recognize this index?

Memory is much faster than looking things up.

In order for a search engine in high demand to serve its users efficiently it should keep things in memory instead of looking it up on a disk.

Traditionally large scale search engines will keep their complete dictionary in memory and the posting list on disk.

dictionary-in-memory-postings-on-disk

Inefficient Storage

Obviously the more you can keep in memory and the more information can be read back with one disk action, the better.

Unfortunately computer information gets inefficiently stored in boxes with fixed dimension: if a box is 10 characters wide a 4 character word still takes up 1 box of 10 characters.

Compression

The solution is to squeeze information together so the least amount of space contains the maximum amount of information.

Big String Compressed Dictionary

 uncompressed 

could become:

 big-string

By adding the length of each word to every entry we can make the list of words hundreds of characters shorter.

Gap Compressed Posting List

 uncompressed-postings

could become:

compressed-postings

By storing the difference between the document ID's (the gaps) we can save hundreds of characters.

The same could be done for storing the gaps between the index numbers for the occurrence positions in each document.

This "compressed representation encodes occurrences of a term as a pointer to the next occurrence of the term to facilitate rapid enumeration of the occurrences of the term".

You could search "occurrences of the terms in the set of documents by following pointers through the compressed representation".

Partial Decompression

An index stored this way is completely lossless: it retains all information from document identifier to document positional identifier.

By starting with the least frequently used term in the search it is very easy to unravel to do a partial decompression of this index by

"identifying occurrences of the terms in the set of documents"

(the dictionary) - to then use;

"the corresponding term identifiers for the terms in the search request to look up a term offset table for a pointer to a first occurrence of the terms in the compressed representation of the set of documents"

(the very first posting document ID);

"and following a chain of pointers starting at the first occurrence to identify other occurrences of the terms in the compressed representation of the set of documents"

(the gap compressed document ID list)

Recommended reading:

  • New Google Approach to Indexing and Stopwords
Tweet
Share
Share
Pin
0 Shares
Posted in SEOTagged how search really works, ruud

About the Author: Ruud Hein

My paid passion at Search Engine People sees me applying my passions and knowledge to a wide array of problems, ones I usually experience as challenges. People who know me know I love coffee.

Ruud Hein

8 thoughts on “How Search Really Works: The Compressed Index”

  1. Benny says:
    March 15, 2008 at 6:19 pm

    Nice post you got going here. Keep up the good work, you have really nice articles.

  2. Ruud Hein says:
    March 15, 2008 at 8:16 pm

    Thanks for letting me know, Benny. You have a great blog going there, by the way. Funny photo too 🙂

    Feel free to hit me up on Twitter any time.

  3. spostareduro says:
    March 16, 2008 at 3:08 pm

    Ruud, I kinda got lost on this one. Right about here:
    “compressed representation encodes occurrences of a term as a pointer to the next occurrence of the term to facilitate rapid enumeration of the occurrences of the term”…I hate being blond! 🙂

  4. Ruud Hein says:
    March 16, 2008 at 8:02 pm

    Kim, it talks about storing differences between numbers (the pointers).

    Any time you work with large/long numbers, storing the difference between those numbers costs, on average, less space than storing the actual numbers.

    It also can help speed things up.

    So instead of saying “this word appears in document 1040, 1050 and 1052” you say “this word appears in document 1040, 10 documents after that and 2 documents after that one”.

    Each number points to the next place: go 2 streets down…. 5 houses further… 10 dollar extra etc.

  5. Gab "SEO ROI" Goldenberg says:
    March 18, 2008 at 8:29 pm

    Nice technical post Ruud :).

  6. Ruud Hein says:
    March 18, 2008 at 10:33 pm

    Thanks Gabriel — appreciate the comment!

Comments are closed.

Recent Posts

  • Maximizing Your E-Commerce Sales:
    A CRO Audit Guide
  • Movin’ On Up! Why Migrating to Google Analytics 4 (GA4) Should be a Priority
  • A Year in Review: The Digital Marketing Trends That Defined 2021
  • The Basics of Video Marketing
  • Just How Much Do Google Reviews Impact Your SEO Ranking?

Categories

  • Analytics & ROI Analysis
  • Company News
  • Content
  • Conversion Optimization
  • CRO
  • Display Advertising/RTB
  • Email Marketing
  • En Español
  • En Français
  • Inbound Marketing
  • Lead Nurture & Marketing Automation
  • Local Search
  • Marketing
  • Mobile
  • Partnership Marketing
  • PPC
  • PR
  • SEO
  • Social Media Marketing
  • Web Design

Additional Posts

Achieving Ah-Ha Moments With Your Small Business Clients

March 14th, 2008 | by Donna Fontenot

Will Microsoft get Yahoo on sale?

March 13th, 2008 | by The Doug

Qualifying Prospective Search Marketing Vendors

March 13th, 2008 | by Shane

LET'S TALK

Need more information or want to get in touch?

Get in touch!
  • SEO
  • SEM
  • Display
  • Blog
  • Why Us
  • Join Our Team
  • Contact Us
  • Local SEO
  • Small Business SEO
  • Enterprise SEO
  • International SEO

LOCATION

1305 Pickering Parkway,
5th Floor Pickering, L1V 3P2

PHONE

Toll Free: 1-877-695-7388
Greater Toronto Area: (647) 699-2838

Social

© Search Engine People Inc. 2023 – Canada’s Top Digital Agency
© SEP 2023 – A Search Engine People Company | Privacy Policy

Search Engine People