5 Common Information Retrieval Myths

Marie-Claire Jenkins

15 years ago

There is quite a lot of confusion over information retrieval sometimes, usually due to the fact that some concepts are used and investigated without their basis being known in the first place.

It's a little tricky to get acquainted with all of the various dimensions of IR, but a few simple things need to be clarified.

I have seen the following misconceptions among SEOs and also computer science students. It's not unusual and it's easily fixed 🙂

1. Information retrieval is the same as Information Extraction

Information Extraction is not Information Retrieval: Information Extraction differs from traditional techniques in that it does not recover from a collection a subset of documents which are hopefully relevant to a query, based on key-word searching (perhaps augmented by a thesaurus).

Instead, the goal is to extract from the documents (which may be in a variety of languages) salient facts about prespecified types of events, entities or relationships. These facts are then usually entered automatically into a database, which may then be used to analyse the data for trends, to give a natural language summary, or simply to serve for on-line access. (GATE)

2. Information retrieval is a compter science discipline

No, not quite.

IR is interdisciplinary because of the many different problems which arise within it.

First off our data is usually in text format so we need the area of linguistics and cognitive psychology.

Then the data is stored somehow and is either structured or unstructured so we need information architecture, information science, library science to help with that.

The text and the query are analysed and rendered into a numeric format that a machine can inderstand so statistics come into play also.

We borrow ideas from Physics too and of course many mathematical concepts come into play.

Computer science as a whole is a mozaic of different disciplines.

3. Information retrieval is just for search engines

Search engines are a common example of an information retireval system, but online library catalogs (OPAC), commercial databases like Web of sciences (and many search engines), and even the entire www are all information retrieval systems.

4. Information retrieval's biggest challenge is ranking documents

Search is an unsolved problem. We have a good 90 to 95% of the solution, but there is a lot to go in the remaining 10%.
-- Marissa Mayer

She is quite right we had a deluge of work to do in this area still. We have invented the wheel and we have hooked 4 of them onto a box. We don't have a Ferrari Enzo yet.

Some of the biggest challenges yet involve relevance and feedback, information extraction, multimedia retrieval, effective retrieval, rooting and filtering, interfaces and browsing, Magic, indexing and retrieval, distributed IR and integrated solutions.

The Magic issue (coined by Bruce Croft) concerns the vocabulary mismatch issues we have.

There is a list of Grand challenges for IR which is published and presented every year. This is the latest document. (PDF)

5. Google pioneered information retrieval

Google did arguably make the most commecially successful information retrieval system, but they were not the first to launch into IR.

In fact no search engine was.

In 1945 Vannevar Bush's As We May Think appeared in Atlantic Monthly and in this article he described an information retrieval system. In the 1960's Gerard Salton created the SMART (System for the Mechanical Analysis and Retrieval of Text) Information Retrieval System at Cornell University. One of the 1st papers was Melvin Earl (Bill) Maron and J. L. Kuhns' "On relevance, probabilistic indexing, and information retrieval" in Journal of the ACM in 1960. In 1963 the Weinberg report "Science, Government and Information" gave a full explanation of the issues concerning the "crisis of scientific information." - basically we couldn't manage this huge corpus that we had gathered throughout the centuries.

Karen Sprck Jones researched relentlessly since the 1960's computational linguistics and their application to IR at Cambridge. J. W. Sammon pioneered the vector model in 1968, and in the 1970's NLM's AIM-TWX, MEDLINE are the first ever online IR systems. Round about the same time Theodor Nelson starts introducing hypertext.

Marie-Claire Jenkins is an information scientist. Her hands-on experience as an SEO and work on her PhD in artificial intelligence, natural language understanding & generation enable her to author Science for SEO - A Bridge Between Worlds