رود هاينWelcome! Thanks for visiting!

Subscribe to the full feed

How Search Really Works: "The" Index (1)

by Ruud Hein.


This post is part of an ongoing series: How Search Really Works .
Previous Instalment: The Keyword Density Myth .

If a search engine would search "live" through the documents it knows about for the occurrence of the word we're looking for it could take its time and then simply report where it found our word.

In this example our search engine has only one index: the documents itself.

 الوثيقة فقط مؤشر

However, time is something a search engine doesn't have; the query needs to be answered now .

What we need is a real index!

Boolean Index - Talk about the Matrix

مؤشر منطقي

The problem with a boolean index, where we put a little flag (1) or not (0) for every word for every document is that it quickly grows way and way too large .

Three documents with amongst them just four words take 12 1's or 0's — apart from the bits and bytes we need to store the word. Now imagine a matrix where one of the sides is 13,940,000,000 columns wide…

The Inverted Index

 مؤشر مقلوب

In the inverted index we record only the places (documents) where a word does occur.

It's called inverted because instead of the documents providing the occurrences of a word, the word points to which documents it occurs in.

Sorted by document pointer, the inverted index is extremely efficient in performing AND queries .

Let's reshuffle our example a little bit to make this visually clear: تتقاطع البحث

If we search for documents that contain the words "search compression" and we down these rows at the same time, as soon as one row makes a jump to a higher document ID, you can jump forward in the other row as well: no use checking the intermediate ones as you now know that those won't have both words.

Knowing only about yes/no occurrences, an inverted index is horrible at phrase and proximity matching :

باريس هيلتون

To be continued…

I hang out at Twitter where I enjoy the company, the buzz, the nuggets of info and opinion we pass along.
Join me on Twitter!

تقدم الضيوف في مرحلة ما بعد


As posted in How Search Really Works on February 22, 2008.

4 Responses so far: 3 comments and 1 trackbacks

  1. Utah SEO Pro says:

    Excellent post on co-occurrence in search. Interested in seeing the follow ups. Information retrieval should be on the "must-know" list for all SEOs but amazing how many don't completely grasp it.

  2. I thought I understood a little about search engines but now I'm confused. Eagerly awaiting part 2.

  3. Geld Lenen says:

    I'm really looking forward to read more of this serie. I could make some people very happy if I referred them here!

Trackbacks/Pingbacks

  1. [...] In the first parts of the series we have been educated  in META keywords, keyword links, keyword stuffing, keyword density myth,  and now we have "How Search Really Works: "The" Index (1)" [...]


Friend Connect

RECENT READERS

English flagItalian flagKorean flagChinese (Simplified) flagChinese (Traditional) flagPortuguese flagGerman flagFrench flagSpanish flagJapanese flagArabic flagRussian flagGreek flagDutch flagBulgarian flagCzech flagCroat flagDanish flagFinnish flagHindi flagPolish flagRumanian flagSwedish flagNorwegian flagCatalan flagFilipino flagHebrew flagIndonesian flagLatvian flagLithuanian flagSerbian flagSlovak flagSlovenian flagUkrainian flagVietnamese flagAlbanian flagEstonian flagGalician flagMaltese flagThai flagTurkish flagHungarian flag

كيف يعمل البحث : "إن" مؤشر (1) | محرك البحث الشعب | تورونتو | Arabic

رود هاينWelcome! Thanks for visiting!

Subscribe to the full feed

How Search Really Works: "The" Index (1)

by Ruud Hein.


This post is part of an ongoing series: How Search Really Works .
Previous Instalment: The Keyword Density Myth .

If a search engine would search "live" through the documents it knows about for the occurrence of the word we're looking for it could take its time and then simply report where it found our word.

In this example our search engine has only one index: the documents itself.

 الوثيقة فقط مؤشر

However, time is something a search engine doesn't have; the query needs to be answered now .

What we need is a real index!

Boolean Index - Talk about the Matrix

مؤشر منطقي

The problem with a boolean index, where we put a little flag (1) or not (0) for every word for every document is that it quickly grows way and way too large .

Three documents with amongst them just four words take 12 1's or 0's — apart from the bits and bytes we need to store the word. Now imagine a matrix where one of the sides is 13,940,000,000 columns wide…

The Inverted Index

 مؤشر مقلوب

In the inverted index we record only the places (documents) where a word does occur.

It's called inverted because instead of the documents providing the occurrences of a word, the word points to which documents it occurs in.

Sorted by document pointer, the inverted index is extremely efficient in performing AND queries .

Let's reshuffle our example a little bit to make this visually clear: تتقاطع البحث

If we search for documents that contain the words "search compression" and we down these rows at the same time, as soon as one row makes a jump to a higher document ID, you can jump forward in the other row as well: no use checking the intermediate ones as you now know that those won't have both words.

Knowing only about yes/no occurrences, an inverted index is horrible at phrase and proximity matching :

باريس هيلتون

To be continued…

I hang out at Twitter where I enjoy the company, the buzz, the nuggets of info and opinion we pass along.
Join me on Twitter!

تقدم الضيوف في مرحلة ما بعد


As posted in How Search Really Works on February 22, 2008.

4 Responses so far: 3 comments and 1 trackbacks

  1. Utah SEO Pro says:

    Excellent post on co-occurrence in search. Interested in seeing the follow ups. Information retrieval should be on the "must-know" list for all SEOs but amazing how many don't completely grasp it.

  2. I thought I understood a little about search engines but now I'm confused. Eagerly awaiting part 2.

  3. Geld Lenen says:

    I'm really looking forward to read more of this serie. I could make some people very happy if I referred them here!

Trackbacks/Pingbacks

  1. [...] In the first parts of the series we have been educated  in META keywords, keyword links, keyword stuffing, keyword density myth,  and now we have "How Search Really Works: "The" Index (1)" [...]


Friend Connect

RECENT READERS

English flag Italian flag Korean flag Chinese (Simplified) flag Chinese (Traditional) flag Portuguese flag German flag French flag Spanish flag Japanese flag Arabic flag Russian flag Greek flag Dutch flag Bulgarian flag Czech flag Croat flag Danish flag Finnish flag Hindi flag Polish flag Rumanian flag Swedish flag Norwegian flag Catalan flag Filipino flag Hebrew flag Indonesian flag Latvian flag Lithuanian flag Serbian flag Slovak flag Slovenian flag Ukrainian flag Vietnamese flag Albanian flag Estonian flag Galician flag Maltese flag Thai flag Turkish flag Hungarian flag