How Search Really Works: "The" Index (2)

by Ruud Hein February 29th, 2008 

This post is part of an ongoing series: How Search Really Works.
Last week: "The" Index (1).

Last week we saw how an inverted index (where a list of words points to a list of documents in which they appear) is insanely useful for doing AND queries.

inverted index

But what if you're not looking for any document that has the words search AND people AND engine but you're looking for Search Engine People?

Well, if document 42 in our example reads "the engine was found after a search by some people" or "people use a search engine such as Google" than a traditional inverted index would think it's spot-on for your search. Ai….

(Extended) Biword Inverted Index

One way to go about this would be to somehow generate an inverted list of phrases.

biword phrase index

Problem: if you index phrases of 2-words length, 3+ word phrase searches become just another AND query, combining the parts of the phrase. "Long John" AND "Silver".

Indexing phrases of 3-words length simply moves the problem to 4+ word phrase searches…. etc. etc.

Problem: the inverted index becomes huge, listing every word in every document and every 2 (3? 4?) word phrase in every document….

Positional Inverted Index

The only real solution is to store not only the incidence of occurrence of a word in a document but to store the exact position(s) of the word in that document.

positional index

In this example document 42 is identified for "search engine people" because the words appear in that order: they appear in positions 1, 2 and 3.

Advantage: because the positional index is similar in construction as the traditional inverted index it inherits the same advantage. That is, when doing an AND query it can jump ahead whenever one of the words doesn't occur in the document it is looking at.

Advantage: simply by looking at words occurring in the right order, any phrase of any length can be found even though it isn't indexed as such.

Advantage: by having precise position information we can do proximity queries.

Advantage: phrase matching and query word proximity can also be used to rank search results.

The Winner

Although a positional index is at least 2-4 times (or up to 50%) larger than a traditional inverted index the payoff is so large that this is the type of index in use by commercial search engines — for phrases. … In general….

Frequently searched phrases are still better stored in a biwords index; less frequently searched phrases are better processed with a positional inverted index.

Index Type & SEO

The fun (warning: geek talking!) is of course that knowing this kind of stuff implicitly explains you things.

For example, knowing that in order for a positional inverted index to really work all words, including so-called "stop words", need to be indexed makes it less surprising that stop words are dead.

Positional indexing and retrieval also makes it not only logic but expected that shop in new york and shop new york give different results.

 

Different results = different thinking, different SEO… different opportunities.

It's all in the index :)

Ruud Hein

My paid passion at Search Engine People sees me applying my passions and knowledge to a wide array of problems, ones I usually experience as challenges. People who know me know I love coffee.

Ruud Hein

You May Also Like

5 Responses to “How Search Really Works: "The" Index (2)”

  1. Johan Krost says:

    Wow, this is advanced search engine stuff. Thanks for the info

  2. Utah SEO Pro says:

    Another great post Ruud! Stop words are definitely not dead. They are such a large part of natural language search queries.

  3. Shana Albert says:

    There's a trick, you know.

  4. [...] This post is part of an ongoing series: How Search Really Works. Last week: "The" Index (2). [...]

  5. [...] Hein's 2 newest additions to "How Search Really Works" are "The Index Part 2" and "Recognize This [...]