This post is part of an ongoing series: How Search Really Works.
Previous Instalment: The Keyword Density Myth.

If a search engine would search "live" through the documents it knows about for the occurrence of the word we're looking for it could take its time and then simply report where it found our word.

In this example our search engine has only one index: the documents itself.

 document-only-index

However, time is something a search engine doesn't have; the query needs to be answered now.

What we need is a real index!

Boolean Index - Talk about the Matrix

boolean-index

The problem with a boolean index, where we put a little flag (1) or not (0) for every word for every document is that it quickly grows way and way too large.

Three documents with amongst them just four words take 12 1's or 0's -- apart from the bits and bytes we need to store the word. Now imagine a matrix where one of the sides is 13,940,000,000 columns wide...

The Inverted Index

 inverted-index

In the inverted index we record only the places (documents) where a word does occur.

It's called inverted because instead of the documents providing the occurrences of a word, the word points to which documents it occurs in.

Sorted by document pointer, the inverted index is extremely efficient in performing AND queries.

Let's reshuffle our example a little bit to make this visually clear: intersect-search

If we search for documents that contain the words "search compression" and we down these rows at the same time, as soon as one row makes a jump to a higher document ID, you can jump forward in the other row as well: no use checking the intermediate ones as you now know that those won't have both words.

Knowing only about yes/no occurrences, an inverted index is horrible at phrase and proximity matching:

paris-hilton

To be continued...

About the Author: Ruud Hein

I love helping to make web sites make it. From the ground up if needed. CSS challenges, server-side scripting, user and device friendly JavaScript tricks search engines have no problems with. Tracking how the sites perform and then figuring out how to make that performance and the tracking better. I'm passionate about information. No matter how often I trim my feeds in my feed readers (yes, I use more than one), I always have a couple of hundred in there covering topics ranging from design to usability, from SEO to SEM, from life hacks to productivity blogs, from.... Well, you get the idea, I guess. Knowledge and information management is close to my heart. Has to be with the amount of information I track. My "trusted system" is usually in flux but always at hand and fully searchable. My paid passion job at Search Engine People sees me applying my passions and knowledge to a wide array of problems, ones I usually experience as challenges. It's good to have you here: pleased to meet you! Read more...