Introduction:
Data, data everywhere (The Economist, 2010)(1) was the article which predicted that by 2016, 140 terabytes of information will be collected in 5 days and we will have more information than we can act on. As quoted by Daniel J. Boorstin “The fog of information can drive out knowledge”. The savior that saved us from getting submerged in the ocean of information, the thing that has matched the evolution of Mobile handsets, is Information Retrieval (IR).
What is IR?
IR is the process of searching through a large amount of information to obtain a smaller subset of information based on a search criterion. IR systems parse large sets of data, or databases, to identify specific pieces of information that match a user-defined search query. Through these many applications, mathematicians have developed a number of methods of selection and correlation to determine what information is selected.
In this blog post, we will go over some of these unique applications, and why IR matters to you.
IR systems accept a user query and locate relevant information from a larger collection of data. Although this may seem like a simple concept, as we all search the internet daily, the history of IR dates all the way back to the early stages of computing.
In the 1931, Emanuel Goldberg developed a “Statistical Machine” that used patter recognition to search rolls of microfilm documents. To read more about this invention go here. (2)
Following this early invention, the US Military developed an IR System in the early 1940’s to index and retrieve research documents captured from the Germans. In 1950, American computer scientist Calvin Mooers is said to have created the term “information retrieval.”(2)
In the mid-1960’s, the National Library of Medicine developed the Medical Literature Analysis and Retrieval System (MEDLARS). This database, used to organize bibliographic information on life sciences and biomedical information, was one of the first large-scale, machine-based IR systems available to the public.(3)
Why it matters to you –
The seeds of Information Retrieval existed right from 1945. Then IR was important in a different sense back than it is today. To provide a brief comparison, the application of IR in 1995 were outlined as follows from this source(4)
- Relevance Feedback – To identify relevant documents from a list of retrievable documents
- Information Extraction – Information extraction techniques developed in the context of Advanced Research Projects Agency(ARPA) or Message Understanding Conferences (MUCs) which consisted identification of attributes and relations in text.
- Multimedia Retrieval – Techniques developed to access image, video and sound databases and indexing these files for effective retrieval by categories such as colors, fabrics, languages etc
- Effective Retrieval – The retrieval techniques which evolved in 30 years now address the challenge of multilingual texts, tones and sarcasms.
Other important fields were
- Routing and Filtering
- Interfaces and Browsing
- Integrated Solutions
Importance of IR in the current world
IR is the foundation to fields such as Machine learning, Text Analytics, Large Scale Databases – Hadoop, Distributed Processing, and Business Analytics.
To give an understanding, Hadoop, which was coded by Doug Cutting after Lucene and Nutch(6), can perform external sorts in a cloud environment at a rate of 102.5 TB in 4,328 seconds(7).
In Conclusion
Today, relevant information is the heart of analysis. As a consumer and a data scientist, it is important to know the basics of IR so that seamless experimentation is facilitated.
Watch this space for exciting updates on IR!
Literature
- http://www.economist.com/node/15557443
- http://ciir-publications.cs.umass.edu/getpdf.php?id=1066
- http://en.wikipedia.org/wiki/MEDLARS
- http://www.dlib.org/dlib/november95/11croft.html
- http://www.slideshare.net/dsrc/20131113-dsrc-launchmd-r (slide 5)
- http://cutting.wordpress.com
- http://sortbenchmark.org/