Ruud HeinWelcome! Thanks for visiting!

Subscribe to the full feed

How Search Really Works: Recognize This Index?

by Ruud Hein.


This post is part of an ongoing series: How Search Really Works .
Last week: "The" Index (2) .

Oversimplified: we have at least a few pages in our index, have extracted every single word from those pages and have written down in an index where in which pages those words occur.

Want to talk numbers? We have some very precise ones for the English language.

Google says ;

" We processed 1,024,908,267,229 words of running text and are publishing the counts for all 1,176,470,663 five-word sequences that appear at least 40 times. There are 13,588,391 unique words, after discarding words that appear less than 200 times ."

And that's just a part of their index…

Now comes the fun…

I have to sort what?!

292020324_286705be9f_m That list of words in the index (the dictionary as they call it) together with the document ID numbers they have as a pointer and the positional information needs to be sorted.

Uhuh. Sorted.

Let's say each of the above mentioned unique words (13,588,391) is 5 characters long. That's 67 MegaByte right there. Say each unique word is found in one unique document and the document pointer is 5 numbers wide: that's another 67 MegaByte to store the occurrence of each unique word in one document each. Imagine the word the which most probably appears at least once in every document as well…

As you see, the memory requirements are huge and we haven't even started factoring in the storage requirements for the in-document positional pointers for the positional inverted index we know search engines use.

And once we do — we still need to factor in temporary memory to actually do something with that list; like sorting it…

Bit by Bit

The only way to handle this is to work with chunks of data which you combine later on.

bloc de tri 

A chunk, or block, is read into memory, sorted, written back. At one point you can start to merge the pre-sorted blocks and write them back into one sorted super-index.

In a small setup this is one machine reading and writing blocks but in a large scale setup this is a whole bunch of machines working with chunks of chunks .

distribués indexation

Recognize This?

In such an index you can't randomly insert new or updated documents or remove deleted ones. You would have to re-sort on every update.

So what do you do?

You sort your index and use it: this is your main index. New stuff you find on the web goes into another, more temporary index. Call it the supplemental index . In order to deliver complete and up to date results, when people search you have to return results from both indexes .

Every once in a while you'll need to merge the new stuff from the supplemental index into the new one. If you find a lot of new stuff every day you'll need some kind of priority setup which says these entries in the supplemental index are worth the CPU time of merging them back in the main index and these are not … yet.

Of course back in the old days you would have just gone out and re-index everything thoroughly…

I hang out at Twitter where I enjoy the company, the buzz, the nuggets of info and opinion we pass along.
Join me on Twitter!

présenter après guest


As posted in How Search Really Works on March 7, 2008.

9 Responses so far: 8 comments and 1 trackbacks

  1. spostareduro says:

    Thanks for all the help Ruud..This is great information for us newbies and other as well..

    PS: A "few pages" 12,400,000,000 ..Funny stuff for sure *-)

  2. Nick James says:

    This is a great series Ruud, an essential for anybody starting out in SEO, as you explain things in such a clear and concise manner.
    Take this post for instance, how a search engine goes about the horrendous task of indexing stuff. If I'd tried to figure it out myself I'd probably ended up with severe brainache. This article has given me a basic understanding of the ins and outs of a process that we all take for granted and rarely give a second thought to. But understanding it in whatever capacity can only go towards making us better SEOs at the end of the day.

  3. Ruud Hein says:

    Kim, glad the series remain of value for you!

    Nick; thanks for the nice comment, man! Yes, I too think that understanding this stuff can help us better understand search. There're many levels to this and not all of them require you to whip out your calculator and get number crunchy with it.

  4. It must be too early in the morning for me, time for another coffee then a re-read ;o)

  5. I did not really thought of how Search Engines work..^^..Thanks for the info..now I know..^^

  6. jamie says:

    Hi Ruud,
    I found this pretty interesting stuff. Can you provide any tips on backlinking strategy please.
    Thanks

  7. Ruud Hein says:

    @Jamie I've added that the topic list. Thanks for the suggestion!

  8. Shana Albert says:

    Yay for us, and thanks, Ruud.

Trackbacks/Pingbacks

  1. [...] Ruud Hein of Search Engine People has added 2 new additions to "How Search Really Works" are "The Index Part 2" and "Recognize This index?" [...]


Friend Connect

RECENT READERS

English flagItalian flagKorean flagChinese (Simplified) flagChinese (Traditional) flagPortuguese flagGerman flagFrench flagSpanish flagJapanese flagArabic flagRussian flagGreek flagDutch flagBulgarian flagCzech flagCroat flagDanish flagFinnish flagHindi flagPolish flagRumanian flagSwedish flagNorwegian flagCatalan flagFilipino flagHebrew flagIndonesian flagLatvian flagLithuanian flagSerbian flagSlovak flagSlovenian flagUkrainian flagVietnamese flagAlbanian flagEstonian flagGalician flagMaltese flagThai flagTurkish flagHungarian flag

Comment Recherche Really Works: Reconnaître cet indice? | Search Engine People | Toronto | French

Ruud HeinWelcome! Thanks for visiting!

Subscribe to the full feed

How Search Really Works: Recognize This Index?

by Ruud Hein.


This post is part of an ongoing series: How Search Really Works .
Last week: "The" Index (2) .

Oversimplified: we have at least a few pages in our index, have extracted every single word from those pages and have written down in an index where in which pages those words occur.

Want to talk numbers? We have some very precise ones for the English language.

Google says ;

" We processed 1,024,908,267,229 words of running text and are publishing the counts for all 1,176,470,663 five-word sequences that appear at least 40 times. There are 13,588,391 unique words, after discarding words that appear less than 200 times ."

And that's just a part of their index…

Now comes the fun…

I have to sort what?!

292020324_286705be9f_m That list of words in the index (the dictionary as they call it) together with the document ID numbers they have as a pointer and the positional information needs to be sorted.

Uhuh. Sorted.

Let's say each of the above mentioned unique words (13,588,391) is 5 characters long. That's 67 MegaByte right there. Say each unique word is found in one unique document and the document pointer is 5 numbers wide: that's another 67 MegaByte to store the occurrence of each unique word in one document each. Imagine the word the which most probably appears at least once in every document as well…

As you see, the memory requirements are huge and we haven't even started factoring in the storage requirements for the in-document positional pointers for the positional inverted index we know search engines use.

And once we do — we still need to factor in temporary memory to actually do something with that list; like sorting it…

Bit by Bit

The only way to handle this is to work with chunks of data which you combine later on.

bloc de tri 

A chunk, or block, is read into memory, sorted, written back. At one point you can start to merge the pre-sorted blocks and write them back into one sorted super-index.

In a small setup this is one machine reading and writing blocks but in a large scale setup this is a whole bunch of machines working with chunks of chunks .

distribués indexation

Recognize This?

In such an index you can't randomly insert new or updated documents or remove deleted ones. You would have to re-sort on every update.

So what do you do?

You sort your index and use it: this is your main index. New stuff you find on the web goes into another, more temporary index. Call it the supplemental index . In order to deliver complete and up to date results, when people search you have to return results from both indexes .

Every once in a while you'll need to merge the new stuff from the supplemental index into the new one. If you find a lot of new stuff every day you'll need some kind of priority setup which says these entries in the supplemental index are worth the CPU time of merging them back in the main index and these are not … yet.

Of course back in the old days you would have just gone out and re-index everything thoroughly…

I hang out at Twitter where I enjoy the company, the buzz, the nuggets of info and opinion we pass along.
Join me on Twitter!

présenter après guest


As posted in How Search Really Works on March 7, 2008.

9 Responses so far: 8 comments and 1 trackbacks

  1. spostareduro says:

    Thanks for all the help Ruud..This is great information for us newbies and other as well..

    PS: A "few pages" 12,400,000,000 ..Funny stuff for sure *-)

  2. Nick James says:

    This is a great series Ruud, an essential for anybody starting out in SEO, as you explain things in such a clear and concise manner.
    Take this post for instance, how a search engine goes about the horrendous task of indexing stuff. If I'd tried to figure it out myself I'd probably ended up with severe brainache. This article has given me a basic understanding of the ins and outs of a process that we all take for granted and rarely give a second thought to. But understanding it in whatever capacity can only go towards making us better SEOs at the end of the day.

  3. Ruud Hein says:

    Kim, glad the series remain of value for you!

    Nick; thanks for the nice comment, man! Yes, I too think that understanding this stuff can help us better understand search. There're many levels to this and not all of them require you to whip out your calculator and get number crunchy with it.

  4. It must be too early in the morning for me, time for another coffee then a re-read ;o)

  5. I did not really thought of how Search Engines work..^^..Thanks for the info..now I know..^^

  6. jamie says:

    Hi Ruud,
    I found this pretty interesting stuff. Can you provide any tips on backlinking strategy please.
    Thanks

  7. Ruud Hein says:

    @Jamie I've added that the topic list. Thanks for the suggestion!

  8. Shana Albert says:

    Yay for us, and thanks, Ruud.

Trackbacks/Pingbacks

  1. [...] Ruud Hein of Search Engine People has added 2 new additions to "How Search Really Works" are "The Index Part 2" and "Recognize This index?" [...]


Friend Connect

RECENT READERS

English flag Italian flag Korean flag Chinese (Simplified) flag Chinese (Traditional) flag Portuguese flag German flag French flag Spanish flag Japanese flag Arabic flag Russian flag Greek flag Dutch flag Bulgarian flag Czech flag Croat flag Danish flag Finnish flag Hindi flag Polish flag Rumanian flag Swedish flag Norwegian flag Catalan flag Filipino flag Hebrew flag Indonesian flag Latvian flag Lithuanian flag Serbian flag Slovak flag Slovenian flag Ukrainian flag Vietnamese flag Albanian flag Estonian flag Galician flag Maltese flag Thai flag Turkish flag Hungarian flag