This post is part of an ongoing series: How Search Really Works.
Last week: "The" Index (2).

Oversimplified: we have at least a few pages in our index, have extracted every single word from those pages and have written down in an index where in which pages those words occur.

Want to talk numbers? We have some very precise ones for the English language.

Google says;

"We processed 1,024,908,267,229 words of running text and are publishing the counts for all 1,176,470,663 five-word sequences that appear at least 40 times. There are 13,588,391 unique words, after discarding words that appear less than 200 times."

And that's just a part of their index…

Now comes the fun…

I have to sort what?!

292020324_286705be9f_m That list of words in the index (the dictionary as they call it) together with the document ID numbers they have as a pointer and the positional information needs to be sorted.

Uhuh. Sorted.

Let's say each of the above mentioned unique words (13,588,391) is 5 characters long. That's 67 MegaByte right there. Say each unique word is found in one unique document and the document pointer is 5 numbers wide: that's another 67 MegaByte to store the occurrence of each unique word in one document each. Imagine the word the which most probably appears at least once in every document as well…

As you see, the memory requirements are huge and we haven't even started factoring in the storage requirements for the in-document positional pointers for the positional inverted index we know search engines use.

And once we do — we still need to factor in temporary memory to actually do something with that list; like sorting it…

Bit by Bit

The only way to handle this is to work with chunks of data which you combine later on.

block sorting 

A chunk, or block, is read into memory, sorted, written back. At one point you can start to merge the pre-sorted blocks and write them back into one sorted super-index.

In a small setup this is one machine reading and writing blocks but in a large scale setup this is a whole bunch of machines working with chunks of chunks.

distributed-indexing

Recognize This?

In such an index you can't randomly insert new or updated documents or remove deleted ones. You would have to re-sort on every update.

So what do you do?

You sort your index and use it: this is your main index. New stuff you find on the web goes into another, more temporary index. Call it the supplemental index. In order to deliver complete and up to date results, when people search you have to return results from both indexes.

Every once in a while you'll need to merge the new stuff from the supplemental index into the new one. If you find a lot of new stuff every day you'll need some kind of priority setup which says these entries in the supplemental index are worth the CPU time of merging them back in the main index and these are not … yet.

Of course back in the old days you would have just gone out and re-index everything thoroughly…