And if you do happen to belong to the latter insignificant minority, worry not - take solace in the fact that the significant majority of the world may not understand what you are trying to do, but can definitely sympathise with you :) Welcome to the intriguing world of IR! And yes, in this world, you only have words, probably a bag of them (pun intended) to make sense out of. As promised in my previous article, in this article, I will try to introduce the field of IR with a simple, hands-on example for the uninitiated and explore how powerful IR could be as we peek into some future directions.
Lets first define IR - simply put, it is the act of being able to retrieve relevant information for a certain need (query) from unstructured data, typically text. At the outset, IR may seem a nascent area, but its origins actually date back to the sixties and seventies. One of the early prominent works in this field came from Professor Gerard Salton, who came up with techniques for document indexing, retrieval and ranking (e.g: TF*IDF, VSM) that were considered very novel at that time. Little would he know during his lifetime, that his work would one day become the core of a system that would transform human lives for ever. Google would go on to apply some of these IR algorithms at a scale unimaginable before, harnessing the power of distributed computing and build a web search engine that would make Google not only synonymous with IR but also give that giant leap for IR.
Lets get a little more hands-on. If I were to ask you, what is my article about? You would probably say: “IR, words, Gerard Salton”. How could you make a machine understand this article and come up with such an answer? Okay, without even understanding it, can I make it come up with a set of words that best describe the article? Here’s a simple technique: Look at the most frequently occurring words in the article. I think thats a good idea except that stop words such as “is, was, that, for” would also appear. However these are words that appear in almost all articles. So divide the number of times that a word occurred in your document by the number of documents that it generally appears in. This simple, yet powerful technique is called TF-IDF (Term Frequency by Inverse Document Frequency). So here is what I did. I used Solr to index around 50 articles on yourstory.in to find out the best keyword vector for my article using the TF-IDF approach. Here is what I got!
Words, IR, Applications, TF, IDF, objective subjective, Salton
I think you guys would agree with me that its quite on the mark (after you finish reading the article)!
While IR’s poster-child has been web search, it has many other powerful applications ranging from mail spam filtering, clustering/aggregation of documents as in Google News to question-answering. As I wrote in my earlier article, IR adds that fuzzy human intelligence into the picture making the applications look very human and magical! Okay, so where do I see IR going forward in the future? I see two things happening.
Firstly I think IR will be a part and parcel of every application, or atleast any application that has human interaction. As I said, applications need to have the IR touch for both addressing the information need of the user as well as the human “feel” it ends up providing.
The second direction I see is the following. Most of the current IR applications deal with user queries - the question is objective and the response from the IR system is more subjective. That is, for example, a search query retrieves a bunch of documents that have the answer. Going forward, I think IR applications will deal with either subjective input or even no input and start giving out objective responses. What do I mean by that? User would not specifically give a query, but the IR application would gather the intent from his actions and would provide objective responses of his liking.Here is an example:
On the opening day of FC Barcelona’s campaign in the fresh La Liga season, I was pleasantly surprised to see a Google notification on my Android phone that marked the first Barca match almost like a calendar event. Being a big Messi/Barca fan, I had been querying Google every morning for scores of the Barca matches that happened at midnight India time during the previous La Liga season. Google understood that I was following Barca and that I was missing these matches and decided to deliver a notification ahead of time. To me, this is the future of IR!
I will stop here for now, but hopefully I have shown you guys the tip of this IR iceberg, which I hope will only inspire you all to learn more about this field. Go, dig deeper into IR and work your magic with words. For its only words and words are all we have, to take those hearts away and blow some minds away too!