Conversations about Hadoop and where is my Data Whisperer

Team YS 15011 Stories

Tuesday October 09, 2012 , 5 min Read

When it comes to Hadoop everyone these days has a few stories to tell and that would of course include me.

Let us start off with some history.

Two years Ago,

Me: This is the right time to jump on the Hadoop bandwagon and create some offerings.

Stakeholders: What is this Hadoop.

Me: Usual 15 minute spiel on evolution of Hadoop

Stakeholders: Nah. Too cutting edge. Only for the Yahoos and Googles of the world. Cannot be used by traditional IT shops and businesses.

Case closed, move on.

A year ago,

Client: We have this monstrous Teradata warehouse where we are parking all the data including a lot of junk. Do you think Hadoop can take on some of this load.

Me: One the surface yes but we will need to evolve a few realistic use cases.

Client: The licensing costs are also beginning to bite, we want address this issue as well and have this trend taper over the next few years.

Me: Sure, there are always a few low hanging fruit that will help us showcase something very quickly so let’s get started.

Today

Clients: I have this 10 terabytes of online transactions that needs some analysis and cleansing, Which option would be better, HBase, Hive or a combination of the two. Then some follow on questions about cluster sizing, hardware configuration etc.

Me: Now we are talking.

6 months from now

Client: I think we have achieved all the goals for Phase I which is to move 1.2 petabytes into the cluster. We are able to do some of our base analytics using Hive and server online customers with HBase but we are still not there yet and need to plan for Phase II.

Me: Of course, we have not given any thought to advanced analytics yet. We need to bring SAS and R capabilities to mine this ocean of information. Develop analytic use cases and start implementing them.

If you catch my drift, things have really come a long way. This goes to show that Hadoop has not only come to stay but is slowly becoming the bread and butter for ELT and Analytic platforms.

Newer and better Hadoop tools are coming to the market everyday. The Hadoop framework itself has undergone a sea change with map reduce v2.Enterprises can’t seem to get enough of Hadoop and want to build larger and faster cluster. The time is not very far when CIOs will start the “My cluster is bigger than yours” race.

Hadoop is being subjected to some very interesting use cases. I have seen shops that get work done with just Flume for aggregating and processing data. So what is on the horizon for the Hadoop Ecosystem. The following trends are emerging.

A large number of mid to late adopters are moving from the POC to an Operational and even Agile states. A lot more applications are moving to Production. In other words, Hadoop is officially shipping.
The building of home grown applications on top of the base framework.
Venture Capital funding for Hadoop based startups like Hortonworks. Customers can now choose which horse to back when it comes to purchasing support for their Hadoop installations.

So what does the future have in store,

What we have seen so far is corporations across the board have just started to wet their feet in Big Data. They have setup these clusters, implemented use cases mainly in the areas of ETL, Processing and some basic Analytics.

The future lies mainly in these key areas,

They will or have already started asking the question, where is my “Data Whisperer”.
When you have data in the range of 100s of terabytes and a few petabytes, aggregated from different sources, there is quite a lot of valuable information that this data can give you. The question is, how do you extract those nuggets.
What you need is an analytics ninja aka data whisperer aka data scientist.
A good data scientist will be able to come up with the right use cases. It is not just enough to come up with the use cases, you do need a whole lot of other ducks to line up in a row such as, the right amount and quality of data, the right tools such as SAS, R and Mahout and the skilled people who can apply these use case patterns to the data using these tools.

The trend we are really seeing emerge is the evolution of Big Data 2.0 where corporations will become very agile and nimble by using these hidden nuggets to get a leg up on their competition. Six months from now, the question will really be, how many Data Scientists do I have and how can I add more to my team.

About the Author:

Krish Khambadkone is a Sr. Big Data/Hadoop Architect with TCS America International. He has helped several large clients in the Banking, Retail and High-tech space in the adoption and implementation of Hadoop and is actively involved in promoting, evangelizing and helping clients adopt this technology. He has over 20 years of experience in the Data Management, Integration and Product Development disciplines.