Instead of painstakingly grabbing the absolute best matches for your query to then rank those with infinite precision, one time saving strategy has search engines go for "close enough".
Given all the time, money and resources in the world, here's what we'd normally do.
Word by word you go through a search. You look in your documents and see which has word one…. word two… word three…. You get the picture.
You give some plus points for every time a searched word appears in a document. How many points? Depends on the TFxIDF score for that specific word.
You add up the scoring into a sum, measured by relevance again. Do the same for the query itself (treating it like a very short document basically). In short: you're calculating their vector space scores.
Measure the mathematical similarity between document 1 and the query, document 2 and the query, document 3 and… Yup.
And then you can't just slap all those on the screen. You have to tailor to the searcher's need and pick and sort the top scoring documents!
Now you could sort all your scored documents at once or just go for the top number of documents needed; say the first or next 10 because the searchers has that set as maximum results per page.
Instead of doing this HUGE sorting routine you throw all values together in a big black hat (mathematicians call this a "heap"), come up with the top 10 documents or so and only then sort them.
What struck me as funny is that this high precision, high-cost way of doing things doesn't necessarily mean you get the most bang for you buck, the best quality results for your searcher's patience.
No, the mathematical similarity between our search and those documents is something we perceive as relevant.
That's a low payback to work for when the cost is so high; comparing a huge number of documents, calculating mathematical similarities…. grabbing the top of the heap…
The perception of relevance is something a search engine can use though by going for "good enough".
The Inexact Sort Of Top 10-ish Documents You Might Want
Instead of calculating the top 10 with high precision, why not grab a bundle of documents that will most probably be in that top 10?
Just grab a bunch of documents that are in the race to be the answer to the searcher's query and take the top 10 of that bunch!
Even though this top 10 is not The Top 10 we would have found using our Painstaking Precision method, it will contain many documents that would have been in that top 10 or near it.
It's like having a bowl of M&M's and wanting to eat red ones. You could sort them out painstakingly and then go for the red ones… or you could grab in that area where you see most of the red ones seem to be.