Map Reduce - Overview

MapReduce is a parallel and distributed solution approach developed by Google for processing large datasets. Described in this paper -
Map transforms a set of data into key value pairs and Reduce aggregates this data into a scalar. A reducer receives all the data for an individual "key" from all the mappers.
The approach assumes that there are no dependencies between the input data. This make it easy to parallelize the problem. The number of parallel reduce task is limited by the number of distinct "key" values which are emitted by the map function.
MapReduce incorporates usually also a framework which supports MapReduce operations. A master controls the whole MapReduce process. The MapReduce framework is responsible for load balancing, re-issuing task if a worker as failed or is to slow, etc. The master divides the input data into separate units, send individual chunks of data to the mapper machines and collects the information once a mapper is finished. If the mapper are finished then the reducer machines will be assigned work. All key/value pairs with the same key will be send to the same reducer.
The classical example for using MapReduce is logfile analysis.

Big logfiles are split and a mapper search for different webpages which are accessed. Every time a webpage is found in the log a key / value pair is emitted to the reducer where the key is the webpage and the value is "1". The reducers aggregate the number of for certain webpages. As a end result you have the total number of hits for each webpage.


Popular posts from this blog

Resolution for “No buffer space available (maximum connections reached?): JVM_Bind” issue

Using NPIV port for RDM disk access in VMware vSphere

Using Quartz Scheduler in a Java EE Web Application