Microsoft Servers

MapReduce Programming Model

In this article i will give you a gentle introduction to the programming model used in Hadoop for parallel processing.

Taylan Kabbani
6 min readFeb 26, 2020

--

Before we dive into MapReduce , let’s talk a bit about parallel processing which is the main purpose of using MapReduce, and how this programming model ease the task of parallel processing.

Note: Through the article I may refer to a machine (computer) as processor, node, or unit, just know that all refer to the same thing.

Parallel Processing

Dealing with massive amount of data sets requires efficient processing methods that will reduce run-time and cost in addition to overcome memory constraints. Parallel Processing is one of the most used methods by data scientists when it comes to compute and data-intensive tasks, where a task is broken up to multiple parts with a software tool and each part is distributed to a processor, then each processor will perform the assigned part. Finally, the parts are reassembled to deliver the final solution or execute the task.

Please note that Parallel Processing should not be confused with Multiprocessing in where multiple processors or cores are working on solving different tasks, instead of parts of the same task as in parallel processing.

When we perform Parallel Processing there are several points that we should take care of to ensure the success of the process:

“1.” Load balancing issue: We need to ensure each machine will get an equal share when dividing the data into chunks so non of the machines is being overloaded or underutilized.

“2.” Critical path problem: Managing the time required by each machine to finish the job, if any of the machines failed to deliver the job in time, the whole task gets delayed.

“3.” Reliability: There should be a mechanism to solve the fall of one of the machines, because if one machine failed to provide output the whole task will fail.

“4.” Aggregation of the result: An aggregation approach should be designed to generate the final result after getting the input form different machines.

So, performing parallel processing is not an easy task, that’s when MapReduce framework comes in handy, which allows us to write code logic to run in parallel without bothering about the previous mentioned issues.

MapReduce Programming Model

let’s start by explaining some terms:

MapReduce: Is a programming model that allows us to perform parallel processing across Big Data using a large number of nodes (multiple computers).

Cluster Computing: nodes are homogeneous and located on the same local network.

Grid Computing: nodes are heterogeneous (different hardware) and located geographically far from each other.

Another big advantage offered by MapReduce is what we call Data Locality. Simply put, data locality is bringing the processing unit to the data(i.e. performing computation process on the site where the data is being saved) instead of sending the data to the unit, which will significantly reduce the network traffic and consequently running time.

MapReduce programming Steps:

  • Input Split: In this step raw input data is being divided into chunks called input splits. Each chunk will be an input of a single map, typically an input split will have the size between 16 MP- 64 MB (it’s controllable by the user). Data input of this step will from the form (key1, Value1).
  • Mapping: A node who is assigned a map function takes the input and emits a set of (key2, value2) pairs. One of the nodes in the cluster is special — Master Node— it assigns the work to the worker nodes -Slave nodes- and make sure that the job is done by these slaves, it’s also responsible for saving the location and size of each intermediate file produced by each map task (more about this later).
  • Shuffling: In this step the output of the mapping function is being grouped by keys and redistributed in a way that all data with the same key are located on the same node. The output of this step will be (k2, list(v2)).
  • Reducing: Nodes now process each group of output data by aggregating values of shuffle phase output. The final output will be from the shape of (list (k3, v3)).
Example of MapReduce algorithm to perform word frequency task, where the input is a text.

We only need to care about assigning the Map function and the Reduce function the rest of steps will be handled automatically. For the above word frequency example pseudocode would be:

https://en.wikipedia.org/wiki/MapReduce

Combiner

In the example above where MapReduce framework is applied to find word frequencies in a document (usually very huge document), Map function will emit (word, 1) format for each word, even if the same word appears on the same node twice, so if Deer occurs twice on the same node, the map function will emit (Deer, 1) twice, instead of (Deer, 2).

By using a combiner on each mapper server, we reduce the data by combining similar keys. the combiner will make shuffling and sorting easier.

MapReduce Reliability

Since MapReduce program uses hundreds or thousands of machines, it must tolerate machine failures gracefully. As mentioned earlier there is one master node (machine) which is responsible to track the progress of worker nodes. if no response is received from a worker in a certain amount of time, the master will mark the worker as failed. If the machine failed in the map phase,map tasks will be executed again because the results of a map task are stored on the local disk of the node. However, a completed reduce task will not be re-executed because its output is saved in a global file system.

In this article i will briefly talk about Hadoop and Spark, as they are build upon MapReduce programming model. (Spark will be explained in details in a future article)

Hadoop

Hadoop is an open source, Java based framework, uses MapReduce programming model for fast information storage and processing big data, it is being managed by Apache Software Foundation.

Hadoop performs fault tolerance by storing a replicated copy of the data stored in any of the nods in the cluster, so it ensures to re-execute the process if the node broke down. Also , running complex queries in Hadoop is very fast and it won’t take seconds due to the using of distributed file system (HDFS) and MapReduce programming for parallel processing.

As hadoop is an open-source framework with no license to be purchased, the cost will be significantly low.

Spark

Spark is also an open source cluster computing platform like Hadoop. However Spark extends MapReduce model to efficiently support more types of computations (like interactive queries and stream processing), this model used by Spark called Resilient Distributed Dataset (RDD).

Spark/Hadoop

There are two prominent differences between Spark and Hadoop:

  • Spark is faster than Hadoop due to in-memory data engine(Ability to run computations in memory)
  • Spark offers simple APIs in Python, Java, Scala and SQL, and rich build-in libraries(like: Sparl SQL, Spark Streaming, MLlib, GraphX). Where several lines written in hadoop can be summarized to few lines in Spark Python (Pyspark).

References

--

--