Ruud HeinWelcome! Thanks for visiting!

Subscribe to the full feed

How Search Really Works: Grabbing Most Red M&M's

by Ruud Hein.

This post is part of an ongoing series: How Search Really Works .
Previously: Relevance (2)

Instead of painstakingly grabbing the absolute best matches for your query to then rank those with infinite precision, one time saving strategy has search engines go for "close enough".

Painstaking Precision

ordinato mm

Given all the time, money and resources in the world, here's what we'd normally do.

Word by word you go through a search. You look in your documents and see which has word one…. word two… word three…. You get the picture.

Ruud Hein

How Search Really Works: Relevance (2) - Vector Space

by Ruud Hein.

This post is part of an ongoing series: How Search Really Works .
Previously: Relevance (1)

Another way we can assess the relevance of a document is by term weighting .

From the keyword density myth we know that true term weighting is done collection wide.

By looking at the number of documents in the index that a term appears in we can make a measurement of information: how good, how special… how meaningful is this word?

The word the would not be special at all, appearing in way too many documents. Its worth would be close to zero.

Ruud Hein

How Search Really Works: Relevance (1)

by Ruud Hein.

This post is part of an ongoing series: How Search Really Works .
Previously: Simple Query Optimization .

Search is always boolean: yes or no. True or false.

Either the words are in the document or not.

boolean ricerca

But as you see, not all documents are "born alike". Some are about our topic, some just mention it.

What we need, what we want , is not just a big list of results — we want a relevant list of results, preferably sorted so that the best bet appears on top.

Ruud Hein

How Search Really Works: Simple Query Optimization

by Ruud Hein.

This post is part of an ongoing series: How Search Really Works .
Last week: The Compressed Index .

While human beings can scan a page and see if the whole phrase " a grandiloquent dictionary " appears on it, a search engine can't.

A search engine needs to:

  1. Lookup the occurrences for each word in the phrase
  2. See if the positions of words in the document fit the phrase

As a search engine isn't smart it needs to work smart.

Leverage Keyword Frequency

sort-by-frequenza 

Ruud Hein

How Search Really Works: The Compressed Index

by Ruud Hein.

This post is part of an ongoing series: How Search Really Works .
Last week: Recognize this index?

Memory is much faster than looking things up.

In order for a search engine in high demand to serve its users efficiently it should keep things in memory instead of looking it up on a disk.

Traditionally large scale search engines will keep their complete dictionary in memory and the posting list on disk .

dictionary-in-memory-posting-on-disk

Inefficient Storage

Obviously the more you can keep in memory and the more information can be read back with one disk action, the better.

Ruud Hein

How Search Really Works: Recognize This Index?

by Ruud Hein.

This post is part of an ongoing series: How Search Really Works .
Last week: "The" Index (2) .

Oversimplified: we have at least a few pages in our index, have extracted every single word from those pages and have written down in an index where in which pages those words occur.

Want to talk numbers? We have some very precise ones for the English language.

Google says ;

" We processed 1,024,908,267,229 words of running text and are publishing the counts for all 1,176,470,663 five-word sequences that appear at least 40 times. There are 13,588,391 unique words, after discarding words that appear less than 200 times ."

Ruud Hein

How Search Really Works: "The" Index (2)

by Ruud Hein.

This post is part of an ongoing series: How Search Really Works .
Last week: "The" Index (1) .

Last week we saw how an inverted index (where a list of words points to a list of documents in which they appear) is insanely useful for doing AND queries.

inverted index

But what if you're not looking for any document that has the words search AND people AND engine but you're looking for Search Engine People ?

Well, if document 42 in our example reads " the engine was found after a search by some people " or " people use a search engine such as Google" than a traditional inverted index would think it's spot-on for your search. Ai….

Ruud Hein

How Search Really Works: "The" Index (1)

by Ruud Hein.

This post is part of an ongoing series: How Search Really Works .
Previous Instalment: The Keyword Density Myth .

If a search engine would search "live" through the documents it knows about for the occurrence of the word we're looking for it could take its time and then simply report where it found our word.

In this example our search engine has only one index: the documents itself.

 documento di solo-index

However, time is something a search engine doesn't have; the query needs to be answered now .

What we need is a real index!

Ruud Hein

How Search Really Works: The Keyword Density Myth

by Ruud Hein.

This post is part of an ongoing series: How Search Really Works .
Last week: Keyword Stuffing .

What is Keyword Density?

Keyword Density is a function, a calculation, of keyword frequency .

It's calculated as number of occurrences divided by number of words and is usually expressed as a percentage.

esempio densità di parole chiave 

What is Keyword Density Used For?

Nothing much, really.

Keyword density can help in readability calculations.

Keyword density is also sometimes used as a simplified manner to introduce local keyword weight but should never be confused with it.

Why don't Search Engines use Keyword Density?

locali-parola-density

Ruud Hein

How Search Really Works: Keyword Stuffing

by Ruud Hein.

This post is part of an ongoing series: How Search Really Works .
Last week: Keyword Links .

Left to their own devices, people will assign keywords (tag or link) as they please.

They paint a rich picture of the linked content.

naturale collegamento

Keyword stuffing is the unnatural repetitive use of a specific word or phrase.

In your content….

keyword-stuffing

..or your links…

 parole chiave stuffing2

English flagItalian flagKorean flagChinese (Simplified) flagChinese (Traditional) flagPortuguese flagGerman flagFrench flagSpanish flagJapanese flagArabic flagRussian flagGreek flagDutch flagBulgarian flagCzech flagCroat flagDanish flagFinnish flagHindi flagPolish flagRumanian flagSwedish flagNorwegian flagCatalan flagFilipino flagHebrew flagIndonesian flagLatvian flagLithuanian flagSerbian flagSlovak flagSlovenian flagUkrainian flagVietnamese flagAlbanian flagEstonian flagGalician flagMaltese flagThai flagTurkish flagHungarian flag

Come Cerca Really Works | People Search Engine | Toronto | Italiano

Ruud HeinWelcome! Thanks for visiting!

Subscribe to the full feed

How Search Really Works: Grabbing Most Red M&M's

by Ruud Hein.

This post is part of an ongoing series: How Search Really Works .
Previously: Relevance (2)

Instead of painstakingly grabbing the absolute best matches for your query to then rank those with infinite precision, one time saving strategy has search engines go for "close enough".

Painstaking Precision

ordinato mm

Given all the time, money and resources in the world, here's what we'd normally do.

Word by word you go through a search. You look in your documents and see which has word one…. word two… word three…. You get the picture.

Ruud Hein

How Search Really Works: Relevance (2) - Vector Space

by Ruud Hein.

This post is part of an ongoing series: How Search Really Works .
Previously: Relevance (1)

Another way we can assess the relevance of a document is by term weighting .

From the keyword density myth we know that true term weighting is done collection wide.

By looking at the number of documents in the index that a term appears in we can make a measurement of information: how good, how special… how meaningful is this word?

The word the would not be special at all, appearing in way too many documents. Its worth would be close to zero.

Ruud Hein

How Search Really Works: Relevance (1)

by Ruud Hein.

This post is part of an ongoing series: How Search Really Works .
Previously: Simple Query Optimization .

Search is always boolean: yes or no. True or false.

Either the words are in the document or not.

boolean ricerca

But as you see, not all documents are "born alike". Some are about our topic, some just mention it.

What we need, what we want , is not just a big list of results — we want a relevant list of results, preferably sorted so that the best bet appears on top.

Ruud Hein

How Search Really Works: Simple Query Optimization

by Ruud Hein.

This post is part of an ongoing series: How Search Really Works .
Last week: The Compressed Index .

While human beings can scan a page and see if the whole phrase " a grandiloquent dictionary " appears on it, a search engine can't.

A search engine needs to:

  1. Lookup the occurrences for each word in the phrase
  2. See if the positions of words in the document fit the phrase

As a search engine isn't smart it needs to work smart.

Leverage Keyword Frequency

sort-by-frequenza 

Ruud Hein

How Search Really Works: The Compressed Index

by Ruud Hein.

This post is part of an ongoing series: How Search Really Works .
Last week: Recognize this index?

Memory is much faster than looking things up.

In order for a search engine in high demand to serve its users efficiently it should keep things in memory instead of looking it up on a disk.

Traditionally large scale search engines will keep their complete dictionary in memory and the posting list on disk .

dictionary-in-memory-posting-on-disk

Inefficient Storage

Obviously the more you can keep in memory and the more information can be read back with one disk action, the better.

Ruud Hein

How Search Really Works: Recognize This Index?

by Ruud Hein.

This post is part of an ongoing series: How Search Really Works .
Last week: "The" Index (2) .

Oversimplified: we have at least a few pages in our index, have extracted every single word from those pages and have written down in an index where in which pages those words occur.

Want to talk numbers? We have some very precise ones for the English language.

Google says ;

" We processed 1,024,908,267,229 words of running text and are publishing the counts for all 1,176,470,663 five-word sequences that appear at least 40 times. There are 13,588,391 unique words, after discarding words that appear less than 200 times ."

Ruud Hein

How Search Really Works: "The" Index (2)

by Ruud Hein.

This post is part of an ongoing series: How Search Really Works .
Last week: "The" Index (1) .

Last week we saw how an inverted index (where a list of words points to a list of documents in which they appear) is insanely useful for doing AND queries.

inverted index

But what if you're not looking for any document that has the words search AND people AND engine but you're looking for Search Engine People ?

Well, if document 42 in our example reads " the engine was found after a search by some people " or " people use a search engine such as Google" than a traditional inverted index would think it's spot-on for your search. Ai….

Ruud Hein

How Search Really Works: "The" Index (1)

by Ruud Hein.

This post is part of an ongoing series: How Search Really Works .
Previous Instalment: The Keyword Density Myth .

If a search engine would search "live" through the documents it knows about for the occurrence of the word we're looking for it could take its time and then simply report where it found our word.

In this example our search engine has only one index: the documents itself.

 documento di solo-index

However, time is something a search engine doesn't have; the query needs to be answered now .

What we need is a real index!

Ruud Hein

How Search Really Works: The Keyword Density Myth

by Ruud Hein.

This post is part of an ongoing series: How Search Really Works .
Last week: Keyword Stuffing .

What is Keyword Density?

Keyword Density is a function, a calculation, of keyword frequency .

It's calculated as number of occurrences divided by number of words and is usually expressed as a percentage.

esempio densità di parole chiave 

What is Keyword Density Used For?

Nothing much, really.

Keyword density can help in readability calculations.

Keyword density is also sometimes used as a simplified manner to introduce local keyword weight but should never be confused with it.

Why don't Search Engines use Keyword Density?

locali-parola-density

Ruud Hein

How Search Really Works: Keyword Stuffing

by Ruud Hein.

This post is part of an ongoing series: How Search Really Works .
Last week: Keyword Links .

Left to their own devices, people will assign keywords (tag or link) as they please.

They paint a rich picture of the linked content.

naturale collegamento

Keyword stuffing is the unnatural repetitive use of a specific word or phrase.

In your content….

keyword-stuffing

..or your links…

 parole chiave stuffing2

English flag Italian flag Korean flag Chinese (Simplified) flag Chinese (Traditional) flag Portuguese flag German flag French flag Spanish flag Japanese flag Arabic flag Russian flag Greek flag Dutch flag Bulgarian flag Czech flag Croat flag Danish flag Finnish flag Hindi flag Polish flag Rumanian flag Swedish flag Norwegian flag Catalan flag Filipino flag Hebrew flag Indonesian flag Latvian flag Lithuanian flag Serbian flag Slovak flag Slovenian flag Ukrainian flag Vietnamese flag Albanian flag Estonian flag Galician flag Maltese flag Thai flag Turkish flag Hungarian flag