A comprehensive view of hadoop mapreduce scheduling algorithms

Đăng ngày 4/2/2019 3:58:36 PM | Thể loại: | Lần tải: 0 | Lần xem: 12 | Page: 10 | FileSize: 0.29 M | File type: PDF
A comprehensive view of hadoop mapreduce scheduling algorithms. The most common objective of scheduling algorithms is to minimize the completion time of a parallel application and also achieve to these issues. in this paper, we describe the overview of Hadoop MapReduce and their scheduling issues and problems. then, we have studies of most popular scheduling algorithms in this field. finally, highlighting the implementation Idea, advantages and disadvantage of these algorithms.
International Journal of Computer Networks and Communications Security
VOL. 2, NO. 9, SEPTEMBER 2014, 308–317
Available online at: www.ijcncs.org
ISSN 2308-9830
C
N
C
S
A Comprehensive View of Hadoop MapReduce Scheduling
Algorithms
Seyed Reza Pakize
Department of Computer, Islamic Azad University, Yazd Branch, Yazd, Iran
E-mail:
s.rezapakize@gmail.com
ABSTRACT
Hadoop is a Java-based programming framework that supports the storing and processing of large data sets
in a distributed computing environment and it is very much appropriate for high volume of data. it's using
HDFS for data storing and using MapReduce to processing that data. MapReduce is a popular programming
model
to
support
data-intensive
applications
using
shared-nothing
clusters.
the
main
objective
of
MapReduce programming model is to parallelize the job execution across multiple nodes for execution.
nowadays, all focus of the researchers and companies toward to Hadoop. due this,
many scheduling
algorithms have been proposed in the past decades. there are three important scheduling issues in
MapReduce such as locality, synchronization and fairness. The most common objective of
scheduling
algorithms is to minimize the completion time of a parallel application and also achieve to these issues. in
this paper, we describe the overview of Hadoop MapReduce and their scheduling issues and problems.
then, we have studies of most popular scheduling algorithms in this field. finally, highlighting the
implementation Idea, advantages and disadvantage of these algorithms.
Keywords: Hadoop, Map Reduce, Locality, Scheduling algorithm, Synchronization, Fairness.
1
INTRODUCTION
independent tasks, and all the tasks need to have a
system slot to run [2]. In Hadoop all scheduling and
Hadoop is much more than a highly available,
allocation decisions are made on a task and node
massive
data
storage
engine.
One
of
the
main
slot level for both the map and reduce phases[4].
advantages
of
using
Hadoop
is
that
you
can
There
are
three
important
scheduling
issues
in
combine data storage and processing [1]. it can
MapReduce such as locality, synchronization and
provide
much needed
robustness
and
scalability
fairness. Locality is defined as the distance between
option to a distributed system as Hadoop provides
the
input
data
node
and
task-assigned
node.
inexpensive
and
reliable
storage.
Hadoop
using
Synchronization is the process of transferring the
HDFS for data storing and using MapReduce to
intermediate output of the map processes to the
processing
that
data.
HDFS
is
Hadoop’s
reduce processes as input is also consider as a
implementation of a distributed filesystem. It is
factor which affects the performance [5]. Fairness
designed
to
hold
a
large
amount
of
data,
and
finiteness have trade-offs between the locality and
provide
access
to
this
data
to
many
clients
dependency between the map and reduce phases.
distributed across a network [2]. MapReduce is an
due
to
the
important
issues
and
many
more
excellent
model
for
distributed
computing,
problems
in
scheduling
of
MapReduce,
the
introduced by Google in 2004. Each MapReduce
Scheduling is one of the most critical aspects of
job is composed of a certain number of map and
MapReduce. There are many algorithms to solve
reduce tasks. The MapReduce model for serving
this issue with different techniques and approaches.
multiple jobs consists of a processor sharing queue
Some
of
them
get
focus
to
improvement
data
for the Map Tasks and a multi-server queue for the
locality and some of them implements to provide
Reduce
Tasks
[3].
Hadoop
allows
the
user
to
Synchronization processing. also, many of them
configure the job, submit it, control its execution,
have
been
designed
to
minimizing
the
total
and
query
the
state.
Every
job
consists
of
completion time.
309
S. R. Pakize / International Journal of Computer Networks and Communications Security, 2 (9), September 2014
The rest of this article is organized as follows.
2.2 MapReduce Overview
Section
2
a
brief
introduction
of
Hadoop,
introduces
an
overview
of
the
MapReduce
MapReduce
[5]
is
a
popular
programming
programming model and will describe main issues
framework to support data-intensive applications
of scheduling in MapReduce. Section 3 some of the
using
shared-nothing
clusters.
MapReduce
was
more
popular
Scheduling
algorithm
in
Hadoop
introduced
to
solve
large-data
computational
MapReduce will be described. in Section 4 we have
problems, and is specifically designed to run on
discussion
and analysis of these algorithm and
commodity hardware and its dependent on divide-
shows
the
Taxonomy,
Idea
to
implementation,
and-conquer principles. In MapReduce, input data
advantages and disadvantage in the form of table.
are
represented
as
key-value
pairs.
Several
Section 5 concludes the article.
functional programming primitives including Map
and Reduce are introduced to process the data.
in
2
BACKGROUND
the
MapReduce
Programming
Model
[7],
Map
algorithm include three steps: first, Hadoop and
2.1 Overview of Hadoop
Hadoop was created by Doug Cutting, the creator
of Apache Lucene, the widely used text search
library. Hadoop has its origins in Apache Nutch, an
open source web search engine, itself a part of the
Lucene project [6], [7]. Hadoop is a, Java-based
programming framework that supports the
processing of large data sets in a distributed
computing environment and is part of the Apache
project sponsored by the Apache Software
Foundation. it can provide much needed robustness
and scalability option to a distributed system as
Hadoop provides inexpensive and reliable storage.
The Apache Hadoop software library can detect and
handle failures at the application layer, and can
deliver a highly-available service on top of a cluster
of computers, each of which may be prone to
failures [1]. Hadoop enables applications to work
with thousands of nodes and terabyte of data,
without concerning the user with too much detail
on the allocation and distribution of data and
calculation. Hadoop is much more than a highly
MapReduce framework produce a map task for
each Input Split, and each Input Split is generated
by the Input Format of job. Each
corresponds to a map task. in second step, Execute
Map task, process the input to form a
new . and in last step, Mapper's output
is sorted to be allocated to each Reducer. Reducer
algorithm also included three steps [7]: first,
MapReduce will assign related block for each
Reducer (Shuffle). Next step, the input of reducer is
grouped according to the key (sorting step). and
finally step is Secondary Sort. If the key grouping
rule in the intermediate process is different from its
rule before reduce. As shown in Figure 1, a mapper
takes input in a form of key/value pairs (k1, v1) and
transforms them into another key/value pair
(k2,v2). The MapReduce framework sorts a
mapper’s output key/value pairs and combines each
unique key with all its values (k2, {v2, v2,…}).
These key/value combinations are delivered to
reducers, which translate them into yet another
key/value pair (k3, v3)[8].
available, massive data storage engine. One of the
main advantages of using Hadoop is that you can
combine data storage and processing. Hadoop using
HDFS for data storing and using MapReduce to
processing
that
data.
HDFS
is
Hadoop’s
implementation of a distributed file system. It is
designed
to
hold
a
large
amount
of
data,
and
provide
access
to
this
data
to
many
clients
distributed across
a
network. HDFS consists
of
multiple DataNodes for storing data and a master
node called NameNode for monitoring DataNodes
and maintaining all the meta-data.
Fig. 1. The functionality of mappers and reducers [8]