One of the most frequent questions that I get asked by customers and business associates is how can I apply Big Data. The reason that folks ask this question is not because they do not know where it can be applied but they are looking for a way out of the Analysis Paralysis cycle, a shoulder to lean on, someone who will give them a willing ear and help them in the process of charting a meaningful course through their plethora of domains where Big Data can be potentially applied successfully.Having worked with customers for the past over 2 years on Big Data initiatives, I think I have thing or two to say in this matter. It would be a gross understatement to say that the Big Data world in general and the Hadoop world in specific are going through a lot of churn and I mean that in a good way. With rapid adoption rates and hockey stick growth rates, the scramble to move onto affordable Big Data platforms like Hadoop is being experienced by all verticals.
I can say with some level of authority that this is at least the case with the Retail, High-tech and Banking and Financial Services verticals and should not be any different with any of the others.
Enough of pontification and now let us get to the heart of the matter. Why are companies gravitating towards these platforms? While you may get many different answers (all of them valid) if you were to pose this to an equal number of people, I personally would like to answer this in two parts.
There are two drivers to this question, while one is the primary driver, the one that follows is heavily influenced by and follows the first. The main driver for any organization, especially the ones that have a hawk eye on the bottom line is a business oriented one and has to do with cost. This should not come as a surprise to many people. While the cost of parking terabytes of information in a platform like Teradata may be measured in the millions of dollars, it does drop dramatically and is measured in the 1000s or 100,000s when talking about Hadoop and it’s ecosystem.
The second driver that closely follows the first is, now that I know Hadoop is a very cost effective platform, what can I do with it. The low hanging fruit here has proven to be the re-platforming of existing applications into the Hadoop platform.
The typical questions that are asked are -
- Can I move some of my non-mission critical data into Hadoop?
- How much ROI and cost savings can I realize by this?
- What are the non SLA bound applications can I move to Hadoop?
- What is the re-platforming effort in terms of time, staff and resources?
- Can I get the same user experience without having to rewire my existing front-end applications?
- I would like to run some advanced analytics on this data which I wasn’t able to even execute on my old hardware platform
As you can see from the questions, the push is increasingly towards re-platforming of existing data warehouse oriented applications. The characteristics are what I would like to call ELT (Extract Load and Transform) oriented.
Here are some characteristics -
- Move data from your traditional platform to Hadoop which would be the Extract and Load Phases
- Once the data is parked in Hadoop, various consumption patterns are analyzed and these can be batch oriented and even real-time
- Batch oriented and High-latency consumption patterns are typically addressed using custom map/reduce programs and Hive
- Real-time needs are addressed by building a services layer on top of a column oriented datastore like Hbase .
- Real time analytical needs are addressed in different ways such as using tools like Mahout, R, Storm and even writing the results to a traditional relational database.
Consumers can even be classified into two, the advanced users who would typically be users of tools like Mahout, R, Storm and even Hive to end users who are users of dashboard oriented tools like Business Objects and Cognos. They would end up using tools like Hive and probably Pig and or aggregated information from a relational database.
While we have spoken enough about consumption patterns, we haven’t touched upon another very important use case which is the Transform part of the ELT process. For MDM oriented activities such as validation, cleansing, enrichment, transformation the power of map/reduce can be availed of and tools like Pig are increasingly becoming popular for this.
About the Author:
Krish Khambadkone is a Sr. Big Data/Hadoop Architect with TCS America International. He has helped several large clients in the Banking, Retail and High-tech space in the adoption and implementation of Hadoop and is actively involved in promoting, evangelizing and helping clients adopt this technology. He has over 20 years of experience in the Data Management, Integration and Product Development disciplines.