MapReduce is a parallel and
distributed solution approach developed by Google for processing large
datasets. Described in this paper - http://research.google.com/archive/mapreduce.html
Map transforms a set of data
into key value pairs and Reduce aggregates this data into a scalar. A reducer
receives all the data for an individual "key" from all the mappers.
The approach assumes that
there are no dependencies between the input data. This make it easy to
parallelize the problem. The number of parallel reduce task is limited by the
number of distinct "key" values which are emitted by the map
function.
MapReduce incorporates
usually also a framework which supports MapReduce operations. A master controls
the whole MapReduce process. The MapReduce framework is responsible for load
balancing, re-issuing task if a worker as failed or is to slow, etc. The master
divides the input data into separate units, send individual chunks of data to
the mapper machines and collects the information once a mapper is finished. If
the mapper are finished then the reducer machines will be assigned work. All
key/value pairs with the same key will be send to the same reducer.
The classical example for
using MapReduce is logfile analysis.
Big logfiles are split and a
mapper search for different webpages which are accessed. Every time a webpage
is found in the log a key / value pair is emitted to the reducer where the key is
the webpage and the value is "1". The reducers aggregate the number
of for certain webpages. As a end result you have the total number of hits for
each webpage.
No comments:
Post a Comment