What is Common between Mumbai Dabbawalas and Apache Hadoop?
Wednesday July 18, 2012 , 6 min Read
Let me start by setting the context - who is a Mumbai Dabbawala? "Dabba" literally means a box and a Dabbawala is a person who carries the box. Everyday, thousands of Mumbaikars (a slang that refers to the residents of Mumbai, the financial capital of India) rely on the Dabbawalas to deliver their lunch boxes carrying the homemade food to their work places. Given the increased cost of living in India and the reluctance to have junk food for everyday meal, many households depend on the network of Dabbawalas. Mumbai is one of the largest cities in the world and an average working professional will have to leave home pretty early in the day to take the local train to commute to the work. The packed food carried from home will lose its freshness by noon. Mumbai Dabbawalas pick up the lunch box much later and manage to deliver it just in time for lunch preserving the warmth of the home food. The whole process repeats in the evening when they collect the empty boxes to drop them back at the respective households. So, what is special about it and how is this related to Apache Hadoop? Keep reading!
- The first Dabbawala collects the lunchbox from the household and marks it with a unique code
- Each Dabbawala meets at a designated place where the boxes get sorted and grouped into a carriage
- The second Dabbawala marks the carriage uniquely to represent the destination and puts that in a local train. The markings include the local rail station to unload the boxes and the building address where the box has to be finally delivered.
- The third one travels along with the dabbas in the local train to handover the carriages at each station.
- The fourth Dabbawala picks up the dabbas from the train, decodes the final destination and delivers it.
The process is just reversed in the evening to return the empty lunchboxes.
If you are familiar with MapReduce, this should already ring a bell. Almost a century before Google published the GFS and the Google MapReduce papers, the Mumbai Dabbawalas mastered the algorithm of MapReduce for their own, efficient distributed processing! For the uninitiated, Apache Hadoop is a framework to process large amounts of data in a highly parallelized and distributed environment. It solves the problem of processing petabytes of data by slicing the dataset into individual chunks that can be processed individually by inexpensive machines in a cluster. Apache Hadoop has two components – 1) A file system called HDFS that is designed to deal with the distributed data in a highly reliable way and, 2) the MapReduce engine that processes each slice of the data by applying the algorithm. For example, the Indian Meteorological department would have recorded the temperatures of each city on a daily basis for the last 100 years. Undoubtedly, this dataset would run into a few Terabytes! Imagine the computing power that is required to query this dataset to find the city with the highest temperature in the last 100 years. This is exactly where Hadoop can play a role! Once the Terabyte sized dataset is submitted to HDFS, it would slice the dataset into equal chunks and distributes each chunk to a machine running within the cluster. Then, the developer would need to write the code in two parts – 1) The code that finds the maximum temperature per each slice of dataset running on each machine (Mapper) and, 2) the code that can collect and aggregate the output of the previous step to find the city with maximum temperature (Reducer). MapReduce is precisely the algorithm that helps developers perform these two steps efficiently. If the developer writes the MapReduce code to find the city with the maximum temperature on a tiny dataset with just a few records, that same code will seamlessly work against Petabytes of data! Effectively, Apache Hadoop makes it easy to process large datasets by letting the developers focus on the core logic than worrying about the complexity and size of the data. In between the Map and Reduce phases, there are sub processes to shuffle and sort the data to make it easy for the reducers to aggregate the results. Below is an illustration of MapReduce process.
Now that we have explored both the models, let’s compare and contrast the Mumbai Dabbawala methodology with Apace Hadoop.
- Just like HDFS slices and distributes the chunk of data to individual nodes, each household submits the lunchbox to a Dabbawala.
- All the lunchboxes are collected at the common place for tagging them and to put them into carriages with unique codes. This is the job of the Mapper!
- Based on the code, carriages that need to go to the common destination are sorted and on-boarded to the respective trains. This is called Shuffle and Sort phase in MapReduce.
- At each railway station, the Dabbawala picks up the carriage and delivers each box in that to respective customers. This is the Reduce phase.
Just like the way each node in the cluster does its job without the knowledge of other processes, each Dabbawala participates in the workflow just by focusing on his task. This an evidence of how a parallelized environment can scale better.
It is fascinating to see how the century old Dabbawala system has adopted an algorithm that is now powering the Big Data revolution!
PS – I have intentionally simplified the description of Apache Hadoop and used the Indian Meteorological scenario only for illustration purposes. The solution to this problem can be achieved just with a Mapper.
- Janakiram MSV, Chief Editor, CloudStory.in