Tags
ASIR, IBM Watson, IR Cutting Edge Technology, Knowledge graph, Semantic IR, Symbolic Natural Language processing, Wolfram Alpha
Technological evolution of IR
Before diving into the latest technology in IR, it’s important to know the visible milestones along the way in IR: (Robert Hof, 2013)
- Spell-checking in search (2001)
- The concept of synonyms in a search (2003)
- Auto completion of queries (2005)
- Universal search on all kinds of topics in one interface (2007)
- Google Instant to save several seconds per search (2010)
- Knowledge Graph to understand concepts and not just words (2012)
- Most recently voice search and Google Now, the predictive search service (2013)
Evolution of Google search. (Andreas Blumauer, Dec 2012)
In a broader sense, Google Search is evolved from just the words to contextual and finally to knowledge graph based, i.e. Linked Data based search. (Andreas Blumauer, Dec 2012)
This week we are exploring the cutting edge techniques in Information Retrieval (IR). In order to find out what the IR community is currently researching, we took a look at some IR conferences “Call for Papers”. A relatively new conference – 4 years old, called ASIR (Advances in Semantic Information Retrieval) caught our attention and we examined their call for papers for 2014 which includes topics such as (Wasielewska, 2014)
- Ontology for semantic information retrieval
- Ontology alignment, mapping and merging
- Semantic multimedia retrieval
- Natural language semantic processing
- Evaluation methodologies for semantic search and retrieval
- Domain-specific semantic applications
All these topics revolve around the main idea of Semantic IR. A whole new conference devoted to Semantic IR? Must be cutting edge right? Google’s Knowledge Graph, which is based on Semantic IR, was only introduced in 2012, two years after ASIR started. So what exactly is semantic search and how it is implemented?
Semantic search tries to understand searcher’s intent and the meaning of the contextual terms in the query to improve its search results. This represents a departure from most traditional search engine methods that rely on word statistics to score and rank search results. In order to achieve semantic search, the search engine has to know what the words mean and which entities they are referring to.
Semantic search can solve word sense disambiguation problems
Consider the search term “Rio”, which could refer different entities with the same name. Ideally, if the search engine could know which exact entity with the name “Rio” is being searched, the results could be much more relevant. Otherwise, the search engine has to disambiguate results for Rio – the 2011 film, Rio – the city in Brazil and Rio – the hotel.
Google’s Knowledge Graph approach to Semantic Search
One way to understand the different entities would be to build a knowledge database that keeps track of all the different entities. This requires knowing the ontology of the entities that are being tracked. That is to say, knowing how all the entities are related and what attributes they have.
Google’s implementation includes a knowledge graph which links the knowledge entities with each other in order to associate the entities and query the associated objects of the searched entity. The Knowledge Graph is described in this video: (Google, 2012)
To know more about the knowledge graph go here (Google.com)
Shortfalls of a knowledge graph
A graph based approach to understanding the ontology of entities is powerful, but one can easily imagine the shortfalls. Google’s Knowledge Graph currently presents static information about its entities (height of buildings, birth date of person etc) and this information comes from a variety of “encyclopedic” sources like CIA Factbook, Wikipedia and Freebase. Adding a temporal element to the Knowledge Graph is crucial to future success, otherwise new information will take time to propagate through the graph. Right from finding the right thing, getting a better summary and broadening the results of the query to anticipate the next search interest is the final outcome of the knowledge graph. Read here (Amit Singhal, Google, 2012) more about this technology.
Alternate Approaches to address shortfalls of a knowledge graph –Wolfram Alpha
The idea behind Wolfram Alpha is to use a human curated knowledge base to build the knowledge graph, instead of relying on the web as a data source. This will cut down on the uncertainty and improve the accuracy of what is contained in the knowledge graph. According to Wolfram Research, all the knowledge is carefully vetted by a domain expert and a scientist. We feel that this distinguishes Wolfram Alpha from Google and is a key differentiator. The wolfram language is based on symbolic natural language processing where each entity is considered as a symbol. You can read more about Wolfram Alpha here (Stephen Wolfram, 2009), go through a list of 32 amazing things you can do with Wolfram Alpha here (Walter Hickey, 2013) and explore the wolfram language here (Wolfram|Alpha.com)
The only limitation of Wolfram seems to be the presence of the data in the database. If the data is not available prior to the query, it can’t give a solution. But owing to the horizon it has captured, it is a challenge in itself to find information which the technology doesn’t cover.
Machine Learning approach to obtain semantic knowledge – IBM Watson
There is a very thin line between IR and ML. In fact both go hand in hand and enhance each other. One such example of IR and ML in action is IBM’s Watson. This super computer was developed by DeepQA department of the IBM which used IMARS – IBM Multimedia Analysis and Retrieval System (IBM.com), a system that is built to classify, index and search large data among the large data sets available. These data sets include digital images and videos. Using algorithm, IMARS analyze features and organize the search based on these features. IR is more relevant in IMARS’s extraction tool which produces index based on mathematical analyses of the collections of images and videos and ML is used more in the IMARS search tool which is used to classify and cluster images according to semantic categories. The only challenge Watson faces is the variability of an answer if it exists.(John Nosta, 2013). With Watson’s victory in Jeopardy against the human preseason winners, in which it was offline with 4 TB of data on hard drive, and its path-breaking medical application it certainly appeals to watch this space.
Conclusion
IR needs data which is generated at a break neck speed and it will evolve technologically, one way or the another, to find knowledge from the swamp of data. It will be interesting though to analyze what platform the data is accessed from.
Literature
(Andreas Blumauer, Dec 2012) – http://blog.semantic-web.at/2012/12/14/do-you-like-googles-knowledge-graph/
(Robert Hof, 2013) – http://www.forbes.com/sites/roberthof/2013/09/26/google-just-revamped-search-to-handle-your-long-questions/
(Google,2010) – http://www.youtube.com/watch?v=mTBShTwCnD4
(Wasielewska, 2014) – http://www.wikicfp.com/cfp/servlet/event.showcfp?eventid=35073©ownerid=58972
(Google, 2012) – http://www.youtube.com/watch?v=mmQl6VGvX-c
(Amit Singhal, Google, 2012) – http://googleblog.blogspot.co.uk/2012/05/introducing-knowledge-graph-things-not.html
(Stephen Wolfram, 2009) – http://blog.wolfram.com/2009/03/05/wolframalpha-is-coming/
(Walter Hickey, 2013) – http://www.businessinsider.com/awesome-things-you-can-do-with-wolfram-alpha-2013-7?op=1
(IBM.com) – https://www.ibm.com/developerworks/community/groups/service/html/communityview?communityUuid=7dc62548-8bc8-42c4-b2e9-150dde7c649a
(John Nosta, 2013) – http://www.forbes.com/sites/johnnosta/2013/03/03/is-watson-in-jeopardy-this-oncologist-thinks-so/