In part 1 of this series, we introduced the Hadoop Ecosystem. We will cover HDFS and MapReduce in this part.
When a dataset outgrows the storage capacity of a single physical machine, it becomes necessary to partition it across a number of separate machines. Filesystems that manage the storage across a network of machines are called distributed filesystems. HDFS is designed for storing very large files with write-once-ready-many-times patterns, running on clusters of commodity hardware. HDFS is not a good fit for low-latency data access, when there are lots of small files and for modifications at arbitrary offsets in the file.
Files in HDFS are broken into block-sized chunks, default size being 64MB, which are stored as independent units.
An HDFS cluster has two types of node operating in a master-worker pattern: a NameNode (the master) and a number of DataNodes (workers). The namenode manages the filesystem namespace. It maintains the filesystem tree and the metadata for all the files and directories in the tree. The namenode also knows the datanodes on which all the blocks for a given file are located. Datanodes are the workhorses of the filesystem. They store and retrieve blocks when they are told to (by clients or the namenode), and they report back to the namenode periodically with lists of blocks that they are storing.
MapReduce is a framework for processing highly distributable problems across huge datasets using a large number of computers (nodes), collectively referred to as a cluster. The framework is inspired by the map and reduce functions commonly used in functional programming.
In the "Map" step, the master node takes the input, partitions it up into smaller sub-problems, and distributes them to worker nodes. The worker node processes the smaller problem, and passes the answer back to its master node. In the "Reduce" step, the master node then collects the answers to all the sub-problems and combines them in some way to form the output – the answer to the problem it was originally trying to solve.
There are two types of nodes that control this job execution process: a JobTracker and a number of TaskTrackers. The jobtracker coordinates all the jobs run on the system by scheduling tasks to run on tasktrackers. Tasktrackers run tasks and send progress reports to the jobtracker, which keeps a record of the overall progress of each job. If a task fails, the jobtracker can reschedule it on a different tasktracker.
Chukwa is a Hadoop subproject devoted to large-scale log collection and analysis. Chukwa is built on top of HDFS and MapReduce framework and inherits Hadoop’s scalability and robustness. It has four primary components:
Flume from Cloudera is similar to Chukwa both in architecture and features. Architecturally, Chukwa is a batch system. In contrast, Flume is designed more as a continuous stream processing system.
About the authors:
Harish Ganesan is the Chief Technology Officer (CTO) and Co-Founder of 8KMiles and is responsible for the overall technology direction of its products and services. Harish Ganesan holds a management degree from Indian Institute of Management, Bangalore and Master of Computer Applications from Bharathidasan University , India.
Vijay is the Big Data Lead at 8KMiles and has 5+ years of experience in architecting Large Scale Distributed Web Systems and engineering Information Systems - Retrieval, Extraction & Management. He holds M. Tech in Information Retrieval from IIIT-B.