Search Engine People - Search Engine Positioning, Placement Service
Home  |  Blog  |  About Us  |  Careers  |  News  |  Contact Us

How Search Really Works: Recognize This Index?

Ruud HeinWelcome! Thanks for visiting!

Subscribe to the full feed

by Ruud Hein
March 7, 2008

This post is part of an ongoing series: How Search Really Works.
Last week: "The" Index (2).

Oversimplified: we have at least a few pages in our index, have extracted every single word from those pages and have written down in an index where in which pages those words occur.

Want to talk numbers? We have some very precise ones for the English language.

Google says;

"We processed 1,024,908,267,229 words of running text and are publishing the counts for all 1,176,470,663 five-word sequences that appear at least 40 times. There are 13,588,391 unique words, after discarding words that appear less than 200 times."

And that’s just a part of their index…

Now comes the fun…

I have to sort what?!

292020324_286705be9f_m That list of words in the index (the dictionary as they call it) together with the document ID numbers they have as a pointer and the positional information needs to be sorted.

Uhuh. Sorted.

Let’s say each of the above mentioned unique words (13,588,391) is 5 characters long. That’s 67 MegaByte right there. Say each unique word is found in one unique document and the document pointer is 5 numbers wide: that’s another 67 MegaByte to store the occurrence of each unique word in one document each. Imagine the word the which most probably appears at least once in every document as well…

As you see, the memory requirements are huge and we haven’t even started factoring in the storage requirements for the in-document positional pointers for the positional inverted index we know search engines use.

And once we do — we still need to factor in temporary memory to actually do something with that list; like sorting it…

Bit by Bit

The only way to handle this is to work with chunks of data which you combine later on.

block sorting 

A chunk, or block, is read into memory, sorted, written back. At one point you can start to merge the pre-sorted blocks and write them back into one sorted super-index.

In a small setup this is one machine reading and writing blocks but in a large scale setup this is a whole bunch of machines working with chunks of chunks.

distributed-indexing

Recognize This?

In such an index you can’t randomly insert new or updated documents or remove deleted ones. You would have to re-sort on every update.

So what do you do?

You sort your index and use it: this is your main index. New stuff you find on the web goes into another, more temporary index. Call it the supplemental index. In order to deliver complete and up to date results, when people search you have to return results from both indexes.

Every once in a while you’ll need to merge the new stuff from the supplemental index into the new one. If you find a lot of new stuff every day you’ll need some kind of priority setup which says these entries in the supplemental index are worth the CPU time of merging them back in the main index and these are not … yet.

Of course back in the old days you would have just gone out and re-index everything thoroughly…

I hang out at Twitter where I enjoy the company, the buzz, the nuggets of info and opinion we pass along.
Join me on Twitter!
• Get Search Engine People delivered by email

As posted in How Search Really Works.

You're welcome to join the conversation; add your response. You can track the conversation using the RSS 2.0 feed.
You can also trackback from your own site.

9 Responses to “How Search Really Works: Recognize This Index?”

  1. spostareduro (26 comments.) Says:
    March 8th, 2008 at 7:27 am

    Thanks for all the help Ruud..This is great information for us newbies and other as well..

    PS: A “few pages” 12,400,000,000 ..Funny stuff for sure *-)

  2. Nick James (44 comments.) Says:
    March 8th, 2008 at 7:49 am

    This is a great series Ruud, an essential for anybody starting out in SEO, as you explain things in such a clear and concise manner.
    Take this post for instance, how a search engine goes about the horrendous task of indexing stuff. If I’d tried to figure it out myself I’d probably ended up with severe brainache. This article has given me a basic understanding of the ins and outs of a process that we all take for granted and rarely give a second thought to. But understanding it in whatever capacity can only go towards making us better SEOs at the end of the day.

  3. Ruud Hein Says:
    March 9th, 2008 at 8:52 pm

    Kim, glad the series remain of value for you!

    Nick; thanks for the nice comment, man! Yes, I too think that understanding this stuff can help us better understand search. There’re many levels to this and not all of them require you to whip out your calculator and get number crunchy with it.

  4. Make Money Blogging (35 comments.) Says:
    March 10th, 2008 at 4:02 am

    It must be too early in the morning for me, time for another coffee then a re-read ;o)

  5. Internet Marketing Joy (12 comments.) Says:
    March 10th, 2008 at 3:56 pm

    I did not really thought of how Search Engines work..^^..Thanks for the info..now I know..^^

  6. jamie (1 comments.) Says:
    March 11th, 2008 at 5:17 am

    Hi Ruud,
    I found this pretty interesting stuff. Can you provide any tips on backlinking strategy please.
    Thanks

  7. Ruud Hein Says:
    March 12th, 2008 at 11:48 am

    @Jamie I’ve added that the topic list. Thanks for the suggestion!

  8. Shana Albert (1 comments.) Says:
    March 13th, 2008 at 3:03 am

    Yay for us, and thanks, Ruud.

Trackbacks

  1. Learn SEO: Search Indexing Part 2 Says:
    March 8th, 2008 at 7:20 am

    […] Ruud Hein of Search Engine People has added 2 new additions to “How Search Really Works” are “The Index Part 2” and “Recognize This index?” […]

  2. Leave a Reply

« In Search of New Adventures
Using Digg to Get TV, Newspaper, Radio, and Magazine Mentions »

Subscribe

Full Feed
Email Updates

Recent Posts

  • Optimisation pour iPhone; Conseil #1 : Les Numéros de Téléphone en Méta Tags
  • Social Media Optimization Assets : The Fake User
  • Visualized: Interest In PubCon, SES, SMX
  • Friday Funnies: Best Friends
  • 12 Erreurs Fréquentes à Éviter Lorsque l’on Blogue
  • 12 Errores Comunes a Evitar en un Blog
  • iPhone Search Result Optimization Tip #1: Phone Numbers in Meta Tags
  • Friday Funnies: A Day In The Life Of A Link Ninja
  • Mom’s SEO Advice: Better Safe than Sorry
  • Créer une Stratégie Efficace pour Gérer vos Profils en Ligne

Most Popular Ever

  • 50 Sites to help your bury negative posts about you or your company
  • What is authority and how do you build it?
  • How to sell your client on a blog strategy?
  • Dude I'm phaaaaaat
  • Google vs. Yellow Pages

Most Popular this Month

  • Using Social Media to Build Authority
  • Microsoft adCenter - Where’s The Revenue?
  • Qualifying Prospective Search Marketing Vendors
  • Virtual Reality - Microsoft Office Live
  • Blogging - Step 1 of the Authority Building Process

Subjects

  • Affiliate Marketing
  • Authority Building
  • Blogging
  • Branding
  • Canada
  • Content
  • Coupons
  • Cuil
  • eBooks
  • En Español
  • En français
  • En fran栩s
  • Events
  • Experiments
  • Francophone
  • Funnies
  • Google
  • Guest Post
  • How Search Really Works
  • Local Search
  • Mobile Search
  • MSN/Live
  • News
  • Online Marketing
  • Online Retailing
  • Online Shopping
  • Opinion
  • Pages Jaunes
  • PPC
  • Quebec
  • Reputation Management
  • SEM
  • SEO
  • Social Media
  • Spanish
  • Stats
  • Technology
  • The Algorithm is Human
  • Tips
  • Tools
  • video
  • Yahoo
  • Yellow Pages

Archive

  • August 2008
  • July 2008
  • June 2008
  • May 2008
  • April 2008
  • March 2008
  • February 2008
  • January 2008
  • December 2007
  • November 2007
  • October 2007
  • September 2007
  • August 2007
  • July 2007
  • June 2007
  • May 2007
  • April 2007
  • March 2007
  • February 2007
  • January 2007
  • September 2006
  • July 2006
  • May 2006
  • March 2006

Search


Recent Readers

The Writers

  • Jeff Quipp
  • Jennifer Osborne
  • Ruud Hein
  • Tom Tsinas

Top Commentators

  • Utah SEO (10)
  • Singapore SEO (9)
  • jeflin (7)
  • Metaspring (7)
  • VMOptions (7)
  • Free Wordpress Themes (7)
  • Comparison Shopping (7)
  • The Quotes World (7)
  • Custom T-Shirts Toronto (7)
  • kerja sambilan (6)

Blogroll

  • AbleReach Blog
  • aimClear Blog
  • Bill Hartzer
  • Blah Blah Tech
  • Brent Csutoras
  • Courtney Tuttle's Blog
  • DoshDosh
  • Geyser Marketing
  • Gray Wolf's SEO Blog
  • Justilien - Link Building
  • Learning SEO Basics
  • Manish Pandey
  • Matt Cutts Blog
  • New Orleans Internet Marketing
  • NorthSouthMedia
  • Quiddity - Essence SEO Blog
  • Search Engine Jounal
  • Search Engine Land
  • Search Engine Watch
  • SEO by the SEA
  • SEO Design Solutions
  • SEO Megacorp Blog
  • SEOco UK Blog
  • SEOPittfall
  • SexySEO
  • Small Business SEM
  • Social Desire
  • Sphinn
  • Stepforth.com - Ross Dunn
  • Stephan Spencer's Scatterings
  • Stuntdubl
  • Techipedia
  • Tim Nash
  • Top Rank Blog
  • Trail of the Fire Horse
  • Utah SEO Blog
  • Yeepage Blogging Tips

SEO Toronto - Search Engine Optimization Specialists
Copyright © Search Engine People - All Rights Reserved.
Contact Us at 1-877-486-7875 or 905-426-9340 - contact@searchenginepeople.com