We live in the data age! Web has been growing rapidly in size as well as scale during the last 10 years and shows no signs of slowing down. Statistics show that every passing year more data gets generated than all the previous years combined. Moore's law not only holds true for hardware but for data being generated too. Without wasting time for coining a new phrase for such vast amounts of data, the computing industry decided to just call it, plain and simple, Big Data.
More than structured information stored neatly in rows and columns, Big Data actually comes in complex, unstructured formats, everything from web sites, social media and email, to videos, presentations, etc. This is a critical distinction, because, in order to extract valuable business intelligence from Big Data, any organization will need to rely on technologies that enable a scalable, accurate, and powerful analysis of these formats.
The next logical question arises – How do we efficiently process such large data sets? One of the pioneers in this field was Google, which designed scalable frameworks like MapReduce and Google File System. Inspired by these designs, an Apache open source initiative was started under the name Hadoop. Apache Hadoop is a framework that allows for the distributed processing of such large data sets across clusters of machines.
Apache Hadoop, at its core, consists of 2 sub-projects – Hadoop MapReduce and Hadoop Distributed File System. Hadoop MapReduce is a programming model and software framework for writing applications that rapidly process vast amounts of data in parallel on large clusters of compute nodes. HDFS is the primary storage system used by Hadoop applications. HDFS creates multiple replicas of data blocks and distributes them on compute nodes throughout a cluster to enable reliable, extremely rapid computations. Other Hadoop-related projects at Apache include Chukwa, Hive, HBase, Mahout, Sqoop and ZooKeeper.
In this series of Big Data and Hadoop, we will introduce all the key components of the ecosystem. Stay tuned!
About the authors:
Harish Ganesan is the Chief Technology Officer (CTO) and Co-Founder of 8KMiles and is responsible for the overall technology direction of its products and services. Harish Ganesan holds a management degree from Indian Institute of Management, Bangalore and Master of Computer Applications from Bharathidasan University , India.
Vijay is the Big Data Lead at 8KMiles and has 5+ years of experience in architecting Large Scale Distributed Web Systems and engineering Information Systems - Retrieval, Extraction & Management. He holds M. Tech in Information Retrieval from IIIT-B.