MapReduce

MapReduce is a programming model that allows applications to process large amounts of data in a parallel and efficient manner. It distributes the load to several different computers known as nodes.

Need for MapReduce

As usage of computers grew, the data that was getting collected also grew from KB to MB to GB. Now-a-days several applications have data which runs in petabytes or even further. With the emergence of IOT (Internet Of Things), we are talking about several billion sensors around us each producing tremendous data.

The collected data could be useful only if we can process that data and get some meaningful insights to improve the application or business. With huge amount of data, processing it on a single machine started becoming inefficient. This forced researchers to look for mechanism to automatically distribute the load across several computers and gather the final outcome to single node. Google researchers developed this technology which we know as MapReduce.

But What Is MapReduce Actually?

In simple terms, input data is mapped in a certain fashion such that it could be distributed to different nodes using a key. This mapped data is then processed in such a fashion that the output is meaningful set of data – it is referred to as data reduced to required output. And hence the name of the methodology – MapReduce.

Steps:

Typically it is considered to have 5 steps

  • Prepare the input – The system designates the processors and assigns a key to each processor known as “Mapper”. Then all the data related to assigned key (K1) is then handed over to that “Mapper”.
  • Mapping – This is a step where “Mapper” executes the user provided Map() function. Being user provided function, it knows what kind of data is being processed. This function maps various keys in the input data to create intermediate output which is a list organized by another key (K2).
  • Shuffling – In this step system brings all data related to one K2 key to a single processor referred as “Reducer”. As a result, intermediate data gets transferred to various “Reducers” based on the K2 keys.
  • Reducing – In this step, each “Reducer” processes the values for all the keys assigned to that “Reducer”. Each “Reducer” runs the user provided “Reduce()” function. This function again knows what kind of data is processed and how it should be condensed to create final output. Each “Reducer” creates their own output matching to each K2 key.
  • Final output – In this final step, the system collects all the “Reduced” data from each “Reducer” and sorts it on the basis of key K2. This is the final output.

Typical example for MapReduce is WordCount:

MapReduce - Word Count Example
MapReduce – Word Count Example (Source: https://cs.calvin.edu/courses/cs/374/exercises/12/lab/)

Opensource implementation for MapReduce is Apache Hadoop. There are several other implementations for this methodology including AWS MapReduce, CouchDB, Azure HDInsight.

Related Links:

Related Words:

Apache Hadoop, HDFS, AWS MapReduce, CouchDB, Azure HDInsight