How Search Really Works: "The" Index (2)

This post is part of an ongoing series: How Search Really Works.
Last week: "The" Index (1).

Last week we saw how an inverted index (where a list of words points to a list of documents in which they appear) is insanely useful for doing AND queries.

inverted index

But what if you're not looking for any document that has the words search AND people AND engine but you're looking for Search Engine People?

Well, if document 42 in our example reads "the engine was found after a search by some people" or "people use a search engine such as Google" than a traditional inverted index would think it's spot-on for your search. Ai....

(Extended) Biword Inverted Index

One way to go about this would be to somehow generate an inverted list of phrases.

biword phrase index

Problem: if you index phrases of 2-words length, 3+ word phrase searches become just another AND query, combining the parts of the phrase. "Long John" AND "Silver".

Indexing phrases of 3-words length simply moves the problem to 4+ word phrase searches.... etc. etc.

Problem: the inverted index becomes huge, listing every word in every document and every 2 (3? 4?) word phrase in every document....

Positional Inverted Index

The only real solution is to store not only the incidence of occurrence of a word in a document but to store the exact position(s) of the word in that document.

positional index

In this example document 42 is identified for "search engine people" because the words appear in that order: they appear in positions 1, 2 and 3.

Advantage: because the positional index is similar in construction as the traditional inverted index it inherits the same advantage. That is, when doing an AND query it can jump ahead whenever one of the words doesn't occur in the document it is looking at.

Advantage: simply by looking at words occurring in the right order, any phrase of any length can be found even though it isn't indexed as such.

Advantage: by having precise position information we can do proximity queries.

Advantage: phrase matching and query word proximity can also be used to rank search results.

The Winner

Although a positional index is at least 2-4 times (or up to 50%) larger than a traditional inverted index the payoff is so large that this is the type of index in use by commercial search engines -- for phrases. ... In general....

Frequently searched phrases are still better stored in a biwords index; less frequently searched phrases are better processed with a positional inverted index.

Index Type & SEO

The fun (warning: geek talking!) is of course that knowing this kind of stuff implicitly explains you things.

For example, knowing that in order for a positional inverted index to really work all words, including so-called "stop words", need to be indexed makes it less surprising that stop words are dead.

Positional indexing and retrieval also makes it not only logic but expected that shop in new york and shop new york give different results.


Different results = different thinking, different SEO... different opportunities.

It's all in the index 🙂

About the Author: Ruud Hein

My paid passion at Search Engine People sees me applying my passions and knowledge to a wide array of problems, ones I usually experience as challenges. People who know me know I love coffee.

Ruud Hein

Additional Posts

10 Keyword Research Tools to Add to Your Arsenal

Catching Up With

The End Is Nigh!!!

In compliance with Ontario’s non-essential business closure our physical offices are closed until further notice. Fortunately our willingness to adopt work from home and the required technology over the past two years has allowed us to continue our operations without impact. For our valued clients and partners you can expect the same great level of service and execution you have become accustomed to.

Many clients/prospects have reached out to us in an effort to introduce new campaigns as quickly as possible. In an effort to help our clients pivot we have increased our campaign build capacity. We are now able to turn new campaigns over in 2-3 business days opposed to the typical 5-7 business day turnaround time. Please note that campaign launch approvals from the vendor side (Google, Bing, Facebook, Instagram etc.) may be delayed as those companies migrate to work from home.

For existing clients please reach out to your account manager with any questions you may have.

For non-clients looking for assistance with new campaigns please Contact Us

Read previous post:
10 Keyword Research Tools to Add to Your Arsenal

Keyword research is one of the cornerstone's of starting any good search engine optimization campaign. While I use Wordze for...Read...