How Search Really Works: Recognize This Index?

by Ruud Hein March 7th, 2008 

This post is part of an ongoing series: How Search Really Works.
Last week: "The" Index (2).

Oversimplified: we have at least a few pages in our index, have extracted every single word from those pages and have written down in an index where in which pages those words occur.

Want to talk numbers? We have some very precise ones for the English language.

Google says;

"We processed 1,024,908,267,229 words of running text and are publishing the counts for all 1,176,470,663 five-word sequences that appear at least 40 times. There are 13,588,391 unique words, after discarding words that appear less than 200 times."

And that's just a part of their index…

Now comes the fun…

I have to sort what?!

292020324_286705be9f_m That list of words in the index (the dictionary as they call it) together with the document ID numbers they have as a pointer and the positional information needs to be sorted.

Uhuh. Sorted.

Let's say each of the above mentioned unique words (13,588,391) is 5 characters long. That's 67 MegaByte right there. Say each unique word is found in one unique document and the document pointer is 5 numbers wide: that's another 67 MegaByte to store the occurrence of each unique word in one document each. Imagine the word the which most probably appears at least once in every document as well…

As you see, the memory requirements are huge and we haven't even started factoring in the storage requirements for the in-document positional pointers for the positional inverted index we know search engines use.

And once we do — we still need to factor in temporary memory to actually do something with that list; like sorting it…

Bit by Bit

The only way to handle this is to work with chunks of data which you combine later on.

block sorting 

A chunk, or block, is read into memory, sorted, written back. At one point you can start to merge the pre-sorted blocks and write them back into one sorted super-index.

In a small setup this is one machine reading and writing blocks but in a large scale setup this is a whole bunch of machines working with chunks of chunks.

distributed-indexing

Recognize This?

In such an index you can't randomly insert new or updated documents or remove deleted ones. You would have to re-sort on every update.

So what do you do?

You sort your index and use it: this is your main index. New stuff you find on the web goes into another, more temporary index. Call it the supplemental index. In order to deliver complete and up to date results, when people search you have to return results from both indexes.

Every once in a while you'll need to merge the new stuff from the supplemental index into the new one. If you find a lot of new stuff every day you'll need some kind of priority setup which says these entries in the supplemental index are worth the CPU time of merging them back in the main index and these are not … yet.

Of course back in the old days you would have just gone out and re-index everything thoroughly…

Ruud Hein

My paid passion at Search Engine People sees me applying my passions and knowledge to a wide array of problems, ones I usually experience as challenges. People who know me know I love coffee.

Ruud Hein

You May Also Like

9 Responses to “How Search Really Works: Recognize This Index?”

  1. [...] Ruud Hein of Search Engine People has added 2 new additions to "How Search Really Works" are "The Index Part 2" and "Recognize This index?" [...]

  2. spostareduro says:

    Thanks for all the help Ruud..This is great information for us newbies and other as well..

    PS: A "few pages" 12,400,000,000 ..Funny stuff for sure *-)

  3. Nick James says:

    This is a great series Ruud, an essential for anybody starting out in SEO, as you explain things in such a clear and concise manner.
    Take this post for instance, how a search engine goes about the horrendous task of indexing stuff. If I'd tried to figure it out myself I'd probably ended up with severe brainache. This article has given me a basic understanding of the ins and outs of a process that we all take for granted and rarely give a second thought to. But understanding it in whatever capacity can only go towards making us better SEOs at the end of the day.

  4. Ruud Hein says:

    Kim, glad the series remain of value for you!

    Nick; thanks for the nice comment, man! Yes, I too think that understanding this stuff can help us better understand search. There're many levels to this and not all of them require you to whip out your calculator and get number crunchy with it.

  5. It must be too early in the morning for me, time for another coffee then a re-read ;o)

  6. I did not really thought of how Search Engines work..^^..Thanks for the info..now I know..^^

  7. jamie says:

    Hi Ruud,
    I found this pretty interesting stuff. Can you provide any tips on backlinking strategy please.
    Thanks

  8. Ruud Hein says:

    @Jamie I've added that the topic list. Thanks for the suggestion!

  9. Shana Albert says:

    Yay for us, and thanks, Ruud.