TELKOM
NIKA
, Vol.14, No
.4, Dece
mbe
r
2016, pp. 15
52~155
8
ISSN: 1693-6
930,
accredited
A
by DIKTI, De
cree No: 58/DIK
T
I/Kep/2013
DOI
:
10.12928/TELKOMNIKA.v14i4.3606
1552
Re
cei
v
ed Ma
rch 2
2
, 2016;
Re
vised July
22, 2016; Accepted Augu
st
12, 2016
An Optimized Model for MapReduce Based on Hadoop
Zhang Hong
*
1
, Wang Xiao-Ming
2
, Cao Jie
3
, Ma Yan-Ho
ng
4
, Guo Yi-Rong
5
, Wang Min
6
1,2,
5,6
College
of Electrical & Inf
o
rmatio
n
Engi
n
eer
i
ng, La
nzho
u Univ
ersit
y
of T
e
chnolog
y,
Lanz
ho
u73
00
5
0
, Chin
a
1,3
Colleg
e
of Computer & Co
mmunicati
on,
L
anzh
ou U
n
iver
sit
y
of T
e
chnol
og
y,
Lanz
ho
u73
00
5
0
, Chin
a
4
State Grid Gansu Electric C
o
mpan
y, La
nzh
ou7
30
030, Ch
i
n
a
*Corres
p
o
ndi
n
g
author, e-ma
i
l
: zhang
ho
ng@
lut.cn
A
b
st
r
a
ct
Aiming
at th
e
w
a
ste of co
mp
utin
g res
o
u
r
ces
resu
ltin
g
from s
e
q
u
e
n
t
ial co
ntrol
of
runn
in
g
mec
h
a
n
is
m of
MapR
educ
e
mode
l on
Ha
do
o
p
pl
atform
,
F
o
r
k
/Join fra
m
ew
o
r
k has b
e
e
n
int
r
oduc
ed i
n
to th
is
mo
de
l to
mak
e
full us
e of CP
U reso
urce of
each
no
de.
F
r
om t
he p
e
rsp
e
c
tive of fin
e
-gr
a
in
ed
para
lle
l
data
process
i
ng, c
o
mb
in
ed w
i
th
F
o
rk/Join
fra
m
e
w
ork
,
a para
lle
l a
nd
multi-thr
ead
mod
e
l
,
thi
s
pa
per
opti
m
i
z
e
s
MapR
educ
e
mode
l an
d puts
forw
ard a Map
R
ed
uce+
F
o
rk
/Join pr
ogr
ammi
ng
mod
e
l w
h
ic
h is a distri
but
ed
and p
a
ral
l
el ar
chitecture co
mbin
ed w
i
th coa
r
se-grai
n
e
d
an
d fine-gr
ain
ed
on Ha
doo
p pl
a
tform to Supp
o
r
t
tw
o-tier levels of parall
e
l
i
sm a
r
chitecture b
o
th in
share
d
an
d distribut
ed memory machi
n
es. A test is ma
d
e
und
er th
e e
n
vir
o
n
m
e
n
t of
Had
oop
clust
e
r co
mp
ose
d
of
four
no
des. A
n
d
th
e ex
peri
m
enta
l
resu
lts prov
e t
hat
this m
o
del really can improv
e perfor
m
ance and efficien
cy of the whole system
and it
is not only suitable for
handling tasks
with data intensive
but
also
tasks with computing int
ensiv
e. it is an
effective optimi
z
ation
and i
m
prove
m
ent to the Map
R
ed
uce
mod
e
l
of big dat
a pro
c
essin
g
.
Ke
y
w
ords
: Ha
doo
p, MapR
ed
uce, F
o
rk/Join,
distribute
d
, pa
ralle
l
Copy
right
©
2016 Un
ive
r
sita
s Ah
mad
Dah
l
an
. All rig
h
t
s r
ese
rved
.
1. Introduc
tion
Had
oop d
e
ve
loped by the
Apach
e
fund
is one
of
the most po
pula
r
and sta
b
le pl
atforms
for large scale data p
r
o
c
e
ssi
ng in d
i
stribute
d
en
vironme
n
ts,
of whi
c
h the
MapRedu
ce
, a
distrib
u
ted a
nd
pa
rall
el prog
ram
m
ing
model,
greatly simplifi
e
s th
e de
si
gn of p
a
rall
el
prog
ram
m
ing
,
and the d
e
sig
n
con
c
e
p
t of mi
grati
ng computati
on in
stead
of data grea
tly
alleviated the I/O ac
c
e
s
s
bottleneck
whic
h is
a di
ffic
u
lt problem of big data process
i
ng. Through
assigni
ng the
massive dat
a to the clust
e
rs to
p
a
rall
e
l
processin
g
, it enhan
ce
s the perfo
rma
n
c
e
of analy
z
ing
big d
a
ta [1].
The frame
w
o
r
k al
so ta
ke
s ca
re
of the
low-l
e
vel p
a
rallelizatio
n
a
nd
scheduling details. It is used su
cce
s
sfully in
seve
ral imple
m
ent
ations for va
riou
s
scena
ri
os.
Ho
wever Ma
pRe
d
u
c
e targ
ets di
strib
u
te
d co
mpute
cl
uster not m
u
l
t
i-co
re
singl
e
node to
re
ali
z
e
parall
e
lism
[2
]. This frame
w
ork
doe
s
no
t make full
use of the
reso
urces of m
u
lti-co
re
CP
U. F
o
r
the pro
c
e
ssi
n
g
of both IO intensive an
d CPU inten
s
iv
e, it is powerl
e
ss. Bu
t at prese
n
t, there are
multi-core
s in
a comp
uter.
What’
s
more, in the
backg
roun
d of Moo
r
e'
s law,
com
puter h
a
rd
wa
re
has b
een dev
elope
d rapi
dl
y and the co
mputing po
wer of clu
s
ter'
s single no
de i
s
also improv
ed.
So the re
so
u
r
ce
s
of ea
ch
node i
n
the
clusters
need
to be ma
de f
u
rthe
r mine
d
and u
s
e
d
, an
d
effectively using CPU reso
urces to improve w
hole p
r
oce
s
sing pe
rf
orma
nce of Hadoo
p platform
is the further optimization
and improv
ement to
the Hadoo
p pla
tform. Theref
ore, it is a
hot
probl
em
wort
hy of study to
how to
make
this
fram
ewo
r
k initially d
e
signed fo
r lo
w-co
st nod
es to
play its due p
e
rform
a
n
c
e o
n
high pe
rformance nod
es.
2. R
e
lated Work
Up to
no
w, m
any sch
o
lars
have d
one
re
sea
r
ch
on
opt
imization
for
MapRedu
ce/
H
ad
oop
itself. The
s
e
resea
r
ch re
sults mainly
can be
divide
d into the foll
owin
g two
aspect
s
, whi
c
h
are
improvem
ent
of Map
R
e
duce p
a
rall
e
l
pro
g
rammi
ng mo
del,
optimizatio
n
of Map
R
e
duce
sched
uling
strategy [3].
The typical
resear
ch
result
s on i
m
provem
ent
of MapRe
duce
prog
ram
m
ing
model
in
clu
de ba
rri
er-le
s
s Ma
pRedu
ce [4], Map
R
edu
ce-merge
[5], Da
che [
6
],
MA
RC
O [
7
]
.
The t
y
pical
r
e
se
ar
ch
re
su
lt
s on o
p
timization of
Ma
pRe
d
u
c
e sch
edulin
g
strategy
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
1693-6
930
A Optim
i
zed Model for Ma
pRe
d
u
c
e Based on Hado
o
p
(Zha
ng Hon
g
)
1553
inclu
de
cou
p
l
ed
sched
ulin
g [8], shuffle
joint sch
eduli
ng [9], re
se
arch
on l
o
caliza
t
ion of d
a
ta a
n
d
comp
uting [1
0], re
sou
r
ce
awa
r
e
sche
d
u
ling [1
1]. Howeve
r, the
s
e rese
arch
result
s a
r
e
o
n
ly
aimed at th
e
deficie
ncy o
f
MapRedu
ce
/Hado
op,
an
d they have
not bee
n wi
d
e
ly use
d
. Ma
ny
schola
r
s hav
e do
ne
re
sea
r
ch
on
Map
R
edu
ce fram
e
w
ork
ba
sed
on third-party
tech
nolo
g
y. For
example, R.
M.Yoo put forwa
r
d M
a
p
R
e
duce fr
ame
w
ork Ph
oneix
based on m
u
lti-co
re CP
U [12].
W.Fan
g
re
ali
z
ed
Map
R
ed
uce f
r
ame
w
o
r
k Ma
rs
ru
nnin
g
on GP
U b
a
s
ed
on
Nvidia
CUDA [13]. Y.L
Zhai
studie
d
the collab
o
rative com
put
ing bet
wee
n
CPU an
d
GPU o
n
Ha
doop [1
4]. T
hese
studie
s
are trying to achie
v
e high perfo
rman
ce
Ma
p
R
ed
uce fram
ewo
r
k, but th
ey have no study
probl
em
s on
Had
oop pl
atform no
r
stud
y the perform
ance of a sin
g
le nod
e. Th
ere a
r
e al
so
many
resea
r
che
s
o
n
hybrid p
r
o
c
essing m
odel
s, whi
c
h
com
b
ine di
strib
u
tion and m
u
lti-core pa
ralleli
sm,
su
ch a
s
the
hybrid
com
b
i
nation of MP
I and
Op
enM
P for parallel
C prog
ram
m
ing [15], a
two
layered parallelism model
fo
r
C
l
ou
dH
as
ke
ll [1
6
].
Fork/Joi
n fra
m
ewo
r
k ca
n
give full play to the advantage
s of
multi-co
re
compute
r
resou
r
ces [
1
7]. For th
e same ta
sks, it
take
s le
ss ti
me than
mult
i-thre
ad d
o
e
s
[18]. In addit
i
on,
Fork /Joi
n fra
m
ewo
r
k grea
tly simplifies
the pro
g
ram
m
ing of pa
ral
l
el pro
c
e
d
u
r
e
and effe
ctively
redu
ce
s th
e
worklo
ad
of develop
ers [
19]. The m
o
st
impo
rtant
thing is that
java7 be
gin
s
to
inco
rpo
r
ate t
h
is fram
ework and
Had
o
o
p
platform
i
s
develope
d in java. So it is feasi
b
le
and
valuable
to
combine
Ma
p
R
ed
uce a
r
chi
t
ecture
with
Fork/Joi
n fra
m
ewo
r
k to
st
udy the
opti
m
ized
prog
ram
m
ing
model of two-tier p
a
rall
eli
s
m. It is a hybrid p
r
og
ram
m
ing app
ro
a
c
h which ma
ke
s
full use of
ad
vantage
s of
MapR
edu
ce
and F
o
rk/Joi
n. R. Ste
w
art
com
p
a
r
ed
a
nd eval
uated
the
advantag
es o
f
fork/join an
d MapRedu
ce framewor
k [20]. But
there are few lite
r
atures me
rgi
n
g
these two fra
m
ewo
r
ks into
one model.
This pap
er e
x
tends th
e M
apRedu
ce
fra
m
ew
o
r
k a
nd
studie
s
th
e M
apRedu
ce
+F
ork/
Joi
n
comp
uting m
odel b
a
sed o
n
Ha
doo
p pl
atform. Throu
gh u
s
ing F
o
rk/Joi
n fram
e
w
ork in
map
and
redu
ce fu
ncti
ons, it re
alize
s
pa
rallel
co
mputi
ng
of sh
aring and dist
ributed memo
ry
on
ea
ch no
de
of Ha
doop
cloud pl
atform
to speed
u
p
the
co
m
p
u
t
ation an
d i
m
prove
the
perfo
rman
ce
of
pro
c
e
ssi
ng
d
a
ta. We
will
analyze the
p
r
ocess
of Ma
pRe
d
u
c
e of
Had
oop
platf
o
rm
and
poin
t
out
its shortcomi
ngs (section
3). In se
ction 4.1, we will analyze the pr
ocess of Fork/ Join framework.
In section 4.
2, we will build
MapReduce+Fork/Joi
n hybrid m
o
del
. In section 5, we
wil
l
do
experim
ent a
nd analy
z
e th
e results.
Section 6 is the concl
u
si
on
s.
3. MapRe
d
u
ce Model of
Hado
op Platform
The p
r
o
c
e
s
s
of Map
R
ed
uce is tran
spa
r
ent to u
s
ers.
It can b
e
divi
ded into
such
stage
s
as
map
sta
g
e
, co
mbine
stage a
nd
red
u
ce
sta
ge[21]
, whi
c
h
ca
n
be exp
r
e
s
sed
in the
follo
wi
ng
st
ep
s.
1. Map:
(K1,V1)
→
lis
t(K2,V2)
2. Combi
ne:
list(K2,V2)
→
(K2,l
i
st(V)
)
3. Red
u
ce:
(K2,list(V))
→
li
st(K
3,V3)
Their i
nput
a
nd outp
u
t are all
key-val
ue pai
rs. Ge
nerally,
we n
eed to
pro
g
ram two
function
s whi
c
h are
map
functio
n
in m
appe
r cla
s
s and red
u
ce
f
unctio
n
in re
ducer cla
s
s.
If
the
input of a map function i
s
(K1, V1) and the output is
li
st (K2, V2), then sy
stem shuffle the output
and g
e
t such
output a
s
(K
2
,
list (V))
whi
c
h is th
e in
p
u
t
of red
u
ce fun
c
tion. If the o
u
tput of redu
ce
function i
s
list
(K3, V3), then list (K3, V3) or it
s deform
may be the final re
sults. T
he list (K3,V3
)
can al
so b
e
the input of an
other ma
p function a
nd re
alize
s
a re
cu
rsive implem
e
n
tation. Next, we
can
con
s
tru
c
t Jo
b a
nd
Con
f
iguration
obj
ects an
d
initi
a
lize
them, a
nd d
e
fine th
e
input
and
ou
tput
path. And then we
can
call run Job
method in
JobCli
ent cla
s
s to commit
this job an
d so
compl
e
te a si
mple ru
nning
of a distribute
d
task.
Whe
n
a job i
s
su
bmitted to the system,
it
is perform
ed with Ma
p
R
ed
uce mod
e
l who
s
e
process is described in Figure
1.
After JobTracker receives
the
new job, it will
pass the job
to
job sche
dule
r
, and initiali
ze it, and
create an o
b
je
ct to encap
sulate ope
rati
on, statu
s
a
nd
prog
re
ss for t
he job. Job
sche
dule
r
get
s block in
fo
rm
ation of the i
nput
data, an
d create
s
a
map
task for e
a
ch
block a
nd assign
s an ID n
u
mbe
r
to each map task. JobT
ra
cker a
ssi
gn
s tasks for
each surviva
l
TaskTracker which u
s
es a
hea
rt
b
eat to interact with
Jo
bTra
cker.
When
TaskT
r
a
c
ker
receives a n
e
w
task, it will
copy
re
lative
JAR file
s fro
m
HDFS an
d
dupli
c
ate all
the
files
req
u
ire
d
by a
ppli
c
ati
on to
lo
cal
disk
from di
stribute
d
ca
ches an
d ne
w a TaskRu
nner
ins
t
anc
e
to run the tas
k
.
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 16
93-6
930
TELKOM
NIKA
Vol. 14, No. 4, Dece
mb
er 201
6 : 1552 – 155
8
1554
Figure 1. Dat
a
pro
c
e
ssi
ng
flow of MapRedu
ce
It can be se
en from Fig
u
re 1 that JobTrack
er is mainly used
for sched
u
ling task,
handli
ng m
a
chi
ne fault
,
managin
g
comm
uni
cations
with
TaskTracker, monito
ri
ng
impleme
n
tation of Job a
n
d
Task and
TaskT
r
a
c
ker
is used to co
mplete Map
and Redu
ce
task
and it is the
real executor
of a
task. So
the obviou
s
a
d
vantage of t
h
is mo
del is t
hat it divides
the
massive data
into small blocks an
d call
s a grou
p of
distrib
u
ted co
mputer no
de
s to perform a job
in parall
e
l to improve pe
rfo
r
man
c
e of bi
g data pr
o
c
e
ssi
ng. This is a kind of co
arse-grai
ned
high
perfo
rman
ce
comp
uting. B
u
t the
com
p
u
t
ing re
so
ur
ce
s of
ea
ch
no
de d
o
n
o
t be
fully used. T
h
e
Map, Com
b
in
er and
Red
u
ce are runni
ng
in sequ
ential
orde
r. Combi
ne and
Red
u
c
e sta
g
e
s
mu
st
wait for Ma
p
stage. And
u
n
til Map sta
g
e
com
p
lete
s i
t
s task, they can
begin
wo
rks of summa
ry. If
the Map task are divisi
on u
neven or e
r
ro
r occu
rs, Red
u
ce
will have been
waiting
for a long tim
e
relatively.
Thi
s
will waste
t
he co
mputing resource
s.
So making full use
of resources of
each
node a
nd int
r
odu
cin
g
Fo
rk/Joi
n frame
w
ork [22] int
o
it and stu
d
ying Map
R
edu
ce
+Fo
r
k/
Join
prog
ram
m
ing
model
a
r
e
the pu
rpo
r
t
of this
pap
e
r
, whi
c
h
is
a computin
g
model
of h
i
gh
perfo
rman
ce
combi
ned
with coa
r
se-grai
ned an
d fine-grain
ed.
4. MapRe
d
u
ce+F
o
rk/Joi
n H
y
brid Mo
del
MapRedu
ce
+Fork/Joi
n
co
mputing mod
e
l base
d
on
Hadoo
p pla
tform is describ
ed i
n
Figure 2. The
whol
e syst
e
m
is Ma
ster-slave st
ru
ctu
r
e
.
The interact
ion bet
ween
JobT
ra
cker
a
nd
TaskT
r
a
c
ker,
task
sched
uling, the file
bl
ock a
nd
so
o
n
are
same
a
s
the
de
script
ion in
Figu
re
1.
The main dif
f
eren
ce i
s
when sl
aver n
ode
s
perfo
rm tasks, Fork/Joi
n frame
w
ork h
a
s b
e
e
n
introdu
ce
d into map or red
u
ce fun
c
tion
to make full use of the ad
vantage
s of sha
r
ed m
e
m
o
ry
and
pa
rallel
multi-thre
ad
runnin
g
[23].
That i
s
to
sa
y it is th
ese
slaver n
ode
s whi
c
h
will
u
s
e
optimize
d
mu
lti-threa
d
to execute tasks and a
c
hiev
e
parall
e
l task executio
n fine-g
r
ain
ed in ea
ch
node [
24]. Before
a jo
b i
s
submitted
to the
system
, we
sho
u
ld
set
su
ch p
a
rameters
as
p
a
th,
cla
ss na
me, THRES
H
OL
D and so on according to t
he real p
r
oble
m
s. We can choo
se dist
rib
u
te
d
executio
n with single threa
d
, only if we set t
he value of THRESHO
L
D one. Whe
n
a JobT
ra
cker
receives
a n
e
w job, it will
pass the jo
b to j
ob sch
edule
r
an
d a
ssi
gn
s tasks
for ea
ch
surv
ival
TaskT
r
a
c
ker.
TaskTra
c
ke
r will copy rel
a
tive JAR files from
HDF
S and dupli
c
ate all the fil
e
s
requi
re
d by a
pplication to l
o
cal
disk f
r
o
m
dist
ribute
d
ca
che
s
and n
e
w a
Ta
skRu
nner
in
stan
ce
to
run th
e ta
sk. TaskT
r
a
c
ke
r is the
re
al execut
or
a
nd its exe
c
u
t
ion ha
s th
e
ch
ara
c
te
rs
of
locali
zation.
TaskT
r
a
c
ker
start-up
s Fork/Joi
n frame
w
ork on
ea
ch com
puter n
ode an
d perf
o
rm
s
parall
e
l tasks
with multi-core chip
s un
der distribute
d
e
n
vironm
ent.
4.1. Fork/Join Frame
w
o
r
k
Fork/Joi
n fra
m
ewo
r
k provided in Java
7 target
s the
parall
e
lism
on multi-co
re
single
comp
uter no
de. It ad
opts
the divide
an
d conq
ue
r al
gorithmi
c
ske
l
etons an
d d
e
s
ign
con
c
ept
and
can
dynami
c
ally divide
a
big ta
sk into
many sm
all t
a
sks called
sub ta
sks fo
r
separate exe
c
uting
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
1693-6
930
A Optim
i
zed Model for Ma
pRe
d
u
c
e Based on Hado
o
p
(Zha
ng Hon
g
)
1555
in pa
rallel
[25
], which i
s
sh
own
in Fi
gu
re
3. At
first, F
o
rk ope
ratio
n
divides a ta
sk into
many
sub
tasks and creates
a
ne
w branch
task. T
hese sub
tasks can be divi
ded into smaller sub tasks
and
pe
rforme
d with
multi-t
h
rea
d
in
p
a
rallel. Th
en
Join o
peration
blo
c
ks t
he
current ta
sks
and
merg
es the
result
s of su
b task exe
c
utio
n until obtai
n
s
the final re
sults.
We ca
n set a threshold
THRES
H
OL
D to
vary th
e
granul
arity o
f
task ex
e
c
uti
on. Th
e
setting of
thre
sh
o
l
d T
HRES
H
O
L
D
deci
d
e
s
execution time of Fork/Joi
n fra
m
ewo
r
k
Figure 2. MapRe
d
u
c
e
+
Fo
rk/Joi
n model
based on
Ha
doop
Figure 3. Fork/Joi
n frame
w
ork
Comp
ared with other pa
rallel archite
c
ture
s, fork/join frame
w
o
r
k allo
ws in
ter-ta
sk
comm
uni
cat
i
on f
o
r
sha
r
e
d
-
memo
ry
sy
st
ems
wit
h
mul
t
i-co
re
chip
s.
A
l
so t
h
is f
r
a
m
ewo
r
k u
s
e
s
a
algorith
m
call
ed
work
steal
ing, which
ha
s b
een
sh
ow
n to be
an
effective loa
d
b
a
lan
c
ing
strateg
y
for pa
rallel
runtime
syste
m
s [26]. Ea
ch worke
r
thread m
a
intain
s a
dou
ble
-
e
nded
task q
u
eue.
When a worker thread ex
hausts it
s local queue or
compl
e
tes itse
lf tasks, it
will
steal
s a task
from the tail
of anothe
r th
read
's
que
ue
to perfo
rm
t
a
sks. By this algo
rithm, th
e frame
w
o
r
k
can
make full u
s
e of multi-thread to pe
rform tasks
in p
a
rallel a
nd re
duce the wai
t
ing time of the
empty thread,
and sh
orten t
he time of pro
g
ram exe
c
uti
ng.
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 16
93-6
930
TELKOM
NIKA
Vol. 14, No. 4, Dece
mb
er 201
6 : 1552 – 155
8
1556
4.2. Implementa
tion Patte
r
n of Map
R
e
duce
+
Fork/
Join
The F
o
rk/Joi
n fram
ework
mainly u
s
e
s
t
w
o
cl
a
s
se
s t
o
realize
parallel
comp
uting. Th
ese
two cla
s
ses a
r
e ForkJoinT
a
sk, which is
use
d
to
make
fork and join
operation, an
d ForkJoinPo
ol,
whi
c
h is a th
read
pool u
s
e
d
to stora
ge
task q
ueu
e [27]. When
creating a Fo
rk/Join fram
ework
and u
s
ing
multi-thre
ad
to reali
z
e a
parallel
op
erating, it n
eed
s to ne
w a subcl
a
ss of
Re
cursiveA
ction u
s
e
d
to
d
e
fine ta
sks
without retu
rn
values or
Re
cursive
Ta
sk used to
d
e
fine
tasks
with
ret
u
rn val
u
e
s
, b
o
th of which
are
su
bcl
a
ss
of ForkJoinT
a
sk. No matt
er
whi
c
h
cla
s
se
s
the new
cla
s
s inhe
rits fro
m
, it needs to re
con
s
tru
c
t
comp
uter m
e
thod an
d set the value of
THRES
H
OL
D used to set
the threshold
of sub task?
As mentione
d above, Map
R
ed
uce mod
e
l base
d
on Had
oop platf
o
rm nee
ds to compl
e
te
map an
d red
u
ce
method.
So a metho
d
impleme
n
ting
Map
R
edu
ce
+Fo
r
k/
Join
m
odel i
s
that when
con
s
tru
c
ting
map o
r
re
du
ce functio
n
, we create F
o
rkJ
o
inTask
task
, s
e
t parameters
of Fork
/
J
oin
frame
w
ork a
nd call writte
n comp
uter
method. A im
portant p
r
obl
em is to set key-valu
e pai
rs. In
the followi
ng
experim
ent, a
line of the *.
p
l
t file will be l
ooked a
s
the
values
and it
s offset value
as
the key. Thu
s
it can
achi
eve fine-g
r
ai
ned an
d thre
ad level tasks processin
g
in parall
e
l
on
Had
oop platf
o
rm ba
se
d on
MapRe
d
u
c
e
distrib
u
ted m
ode.
5. The Experim
e
nt
In ord
e
r to
prove th
e
co
rre
ctne
ss a
n
d
e
ffectiven
e
s
s of a
bove
de
scribe
d t
heory, a
clu
s
ter of fou
r
nod
es i
s
constructe
d [2
8], info
rmatio
n of each no
de is
sho
w
i
n
Table
1. The
hard
w
a
r
e
sp
ecification
of ea
ch
node
i
s
Inte
l (R) Core (TM
)
i5
-3210M
CP
U @2.50
G
HZ
2
.
50
GHZ, 4G m
e
mory, Win
d
o
ws 7 Ultim
a
te Edition 64. The virt
ual infra
s
tru
c
ture is VM
Ware
Wo
rkstation
1
0
. Cent
OS 6.
0 are the vi
rtu
a
l ope
rati
n
g
system. Ha
do
op is 1.2.1 ve
rsio
n. The
Ja
va
VM is ve
rsio
n 1.7.0-02
-b
13. We m
a
ke an
expe
ri
ment to
pro
c
ess a
taxi
GPS data fil
e
of
128.7MB(.plt
), filtering out the data that the ve
locity values a
r
e g
r
eater than
8
0
km/h a
nd le
ss
than or e
qual
to 0km/h, then matching
the data to
a
real ma
p [29]. The expe
rimental
re
su
lts
sho
w
that th
e Map
R
ed
uce model ta
kes 1.8 min
u
tes, whil
e Ma
pRe
d
u
c
e
+
Fo
rk/Joi
n mod
e
l
(4
thread
s which is the num
ber of co
re
s
in eac
h node
) take
s 1.1
minutes. Co
mpared with
the
MapRedu
ce
model, it get a 1.64 sp
eed
up.
Table 1. No
d
e
informatio
n of cluste
r
Name
IP
address
Functions
Master 192.168.91.1
2
8
NameNode
and
JobTracker
Slave1 192.168.91.1
3
1
DataNode
and
T
a
skTracker
Slave2 192.168.91.1
3
2
DataNode
and
T
a
skTracker
Slave3 192.168.91.1
3
3
DataNode
and
T
a
skTracker
In above clu
s
ter enviro
n
m
ent, we ma
ke
an experim
e
n
t same a
s
a
bove mention
ed with
MapRedu
ce
+Fork/Joi
n
m
o
del. As nu
m
ber of th
re
ads
c
h
an
g
e
s
,
th
e
r
u
nn
in
g time
is
s
h
ow
n in
Table
2. And
the corre
s
p
ondin
g
time
variation i
s
shown in Fi
gu
re 4.
While
a sin
g
le th
re
ad
pro
c
e
s
sed th
e file, the time con
s
um
e
d
is 1.82
minutes which wa
s slightly more than o
ne of
MapRedu
ce
model. Late
r
, the time consumed
wa
s rapi
dly de
cre
a
sed an
d
then grad
u
a
lly
decrea
s
e
d
. While the n
u
m
ber of thre
a
d
s war fou
r
, the time con
s
umed was 1.
1 minutes
wh
en it
got the maximum sp
eedu
p. Then the time taken
wa
s gra
dually in
cre
a
sed. Whil
e the numbe
r of
thread
s is ten
,
the time con
s
ume
d
wa
s 1
.
76 minutes a
nd ch
ang
ed slowly.
Table 2. Nu
m
ber of thre
ad
s and the
corresp
ondi
ng ru
nning time
Number of
threa
d
s
running time (mi
nute)
average time (mi
nute)
1 2
3 4 5 6
1
1.811
1.823
1.815
1.832
1.818
1.835
1.82
2
1.521
1.485
1.523
1.511
1.523
1.514
1.51
3
1.311
1.332
1.324
1.341
1.324
1.317
1.32
4
0.971
1.201
1.163
1.154
1.111
0.982
1.1
6
1.412
1.414
1.421
1.425
1.398
1.389
1.41
8
1.644
1.653
1.637
1.635
1.648
1.656
1.65
10
1.771
1.762
1.772
1.761
1.763
1.758
1.76
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
1693-6
930
A Optim
i
zed Model for Ma
pRe
d
u
c
e Based on Hado
o
p
(Zha
ng Hon
g
)
1557
Figure 4. Time variation wit
h
the numbe
r of thread
s
Thus it ca
n b
e
co
ncl
ude
d t
hat the
comp
ut
ing spee
d
of Map
R
ed
uce+Fo
rk/Joi
n
model i
s
more
ra
pid t
han
one
of Map
R
ed
uce
model
an
d
it is
really
ca
n imp
r
ov
e computatio
nal
perfo
rman
ce
althoug
h the
begin
n
ing
an
d final time
c
onsumed
is a
l
most a
s
sa
m
e
a
s
the
ori
g
i
nal
model. But why? The rea
s
on may be as follows. At beginni
ng, cre
a
ting obje
c
ts,
configu
r
ing a
n
d
initializing
the
s
e
paramete
r
s ta
ke a
pa
rt
of time.
Wh
e
n
the n
u
mb
er of thre
ad
s i
s
increa
sed
to
10,
cre
a
ting th
re
a
d
s
and
co
mm
unicating a
m
ong th
rea
d
s take
con
s
ide
r
able time.
Ho
w ma
ny threa
d
s
can a
c
hi
eve the high
est ef
ficien
cy on e
a
rth? It
sh
oul
d be dete
r
mi
ned a
c
cordin
g to the file size,
the clu
s
te
r scale
and th
e
spe
c
ific
anal
ysis. In
ab
ove experi
m
ent
, when
num
b
e
r of thread
s is
four, system
obtain
s
the hi
ghe
st efficien
cy.
6. Conclusion
On the ba
sis of analyzing
runnin
g
me
chani
sm
and prin
ciple
of MapRedu
ce model
o
n
Had
oop pl
atform, aimin
g
at the wa
ste
of comp
ut
ing
resou
r
ces
re
sulting from seque
ntial co
n
t
rol
of Map
R
ed
u
c
e
co
mputin
g mo
del, thi
s
p
ape
r
opti
m
ize
s
th
e m
odel
and
introdu
ce
s Fo
rk/Join
frame
w
ork a
nd
exploit
s
multi-core
p
a
ralleli
sm i
n
distri
buted
MapRedu
ce
frame
w
o
r
ks. A
MapRedu
ce
+Fork/Joi
n
pro
g
rammi
ng m
odel is
con
s
tructed o
n
the
Had
oop pl
atform. Thi
s
mo
del
is a di
stribute
d
and p
a
rall
e
l
prog
rammi
n
g
archite
c
ture com
b
ine
d
with co
arse
-g
raine
d
and fi
ne-
grain
ed o
n
Hadoo
p platform, whi
c
h i
s
n
o
t only suita
b
l
e for h
andli
n
g tasks with
d
a
ta inten
s
ive
but
also ta
sks
with com
puting i
n
tensive b
e
cause t
he inte
rnal
work
ste
a
ling alg
o
rith
m on nod
es
ca
n
greatly imp
r
o
v
e the runni
ng efficie
n
cy
of thre
ad
s. It has
reali
z
e
s
hie
r
a
r
chi
c
al
stru
ctu
r
e
of
distrib
u
ted a
nd parallel computing o
n
Hadoo
p.
Throu
gh the experim
ent under the cl
u
s
ter
environ
ment of four node
s, the
result
s prove that
it really
can improve th
e comp
utatio
nal
efficien
cy of t
he
whol
e
system an
d i
s
a
n
effective
o
p
timization
fo
r Ma
pRedu
ce mo
del
on t
he
Had
oop pl
atform. In fact, Had
oop lib
ra
ry offers a Mul
t
ithreade
dMa
pper
cla
s
s extending from the
default Map
p
e
rcl
a
ss,
which offers thre
a
d
level pa
rall
elism of m
u
lti-core
chi
p
s
.
But Fork
/J
oin is
a
more effe
ctive and an opti
m
ized frame
w
ork offeri
ng load bala
n
ci
n
g
strategy. Beca
use Fork/
J
oin
frame
w
ork hi
des th
e lo
we
st detail
s
of the pa
rallel
i
m
plementatio
n, it is ea
sie
r
to be
re
ceived
by
the prog
ram
m
er. So this
architectu
re i
s
not only
sui
t
able for CP
U intensive tasks b
u
t also fo
r IO
intensive ta
sks.
Next work is to do m
any env
iro
n
m
ents to test
its high pe
rf
orma
nce on l
a
rge
scale cl
uste
rs.
Ackn
o
w
l
e
dg
ements
We
ackn
owl
edge
the
suppo
rt from
variou
s
grant source
s:
the
Natu
ral
Scien
c
e
Found
ation
of Gansu Province (Gran
t
No.148
RJZ
A
019), the
Scientific an
d Technolo
g
i
c
al
sup
port p
r
og
ram Foun
datio
n of Gansu
Province (G
ran
t
No.1304
GKCA023
).
Referen
ces
[1]
Li Z
han
g-
yon
g
.
Optimization
and Ap
plic
atio
n Rese
arch of
MapRe
duce
Comp
uting Mo
del Bas
ed o
n
Had
oop. Diss
e
rtation. W
uhan:
W
uhan Un
iver
sit
y
of Scie
nce
and T
e
chno
log
y
; 20
15.
[2]
Sutikno T
,
Stiaw
a
n
D, Subr
oto IMI. F
o
rtifyin
g
Bi
g Data infr
astructures
to F
a
ce
Secur
i
t
y
and
Pr
ivac
y
Issues.
T
E
LKOMNIKA T
e
leco
mmu
n
icati
on C
o
mputi
ng El
ectronics a
nd Co
n
t
rol
. 2014; 1
2
(4
): 751-75
2.
0
246
8
1
0
1.
0
1
.
2
1.
4
1
.
6
1
.
8
n
u
m
ber
of
t
h
r
ead
s
t
i
m
e
(
m
i
nut
e
)
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 16
93-6
930
TELKOM
NIKA
Vol. 14, No. 4, Dece
mb
er 201
6 : 1552 – 155
8
1558
[3]
Li Jia
n
-Jia
ng, Cui Jia
n
, W
a
n
g
Dan, et al.
Surve
y
of MapR
educ
e Paral
l
e
l
Programmin
g
Mode
l.
Acta
Electron
ica Sin
i
ca.
201
1; 39(1
1
): 2635-
26
42.
[4]
Verma A, Ch
o
B, Z
ee N, et
al. Break
in
g th
e Map
R
ed
uce
stage
barri
er.
Cluster c
o
mp
uting
. 20
13
;
16(1): 19
1-2
0
6
.
[5]
Yang H
C
, Das
dan A, Hsia
o RL, et al.
Map
-reduc
e-merg
e
:
simp
lifie
d
rel
a
tion
al
data
pro
c
essin
g
o
n
larg
e cl
usters
. Procee
din
g
s
of the 20
07 A
C
M Sigmo
d
In
ternatio
nal
Co
nferenc
e on M
ana
geme
n
t of
Data. 200
7: 10
29-1
040.
[6]
Z
hao Ya-
x
ion
g
, W
u
Jie.
D
a
che: A D
a
ta
Aw
are Cach
i
ng for Bi
g-Da
ta Appl
icatio
n
s
Using t
h
e
MapR
educ
e F
r
amew
ork
. Proceed
ings IEEE INFOCOM. 2013; 19(1): 35-3
9
.
[7]
Ahmad F
,
Lee
S,
T
hottethodi
M, et al. MapReduc
e
w
i
th co
mmunicati
on o
v
erla
p (MARC
O).
Journal o
f
Parall
el & Distr
ibute
d
Co
mp
uti
n
g
. 201
3; 73(5)
: 608-62
0.
[8]
T
an J, Meng X, Z
han
g L.
Performanc
e a
n
a
lysis
of C
o
u
p
lin
g Sc
he
dul
er for M
apR
e
duce/H
a
d
oop
.
Procee
din
g
s IEEE INFOCOM. 2012; 1
31(5):
258
6-25
90.
[9]
Che
n
F
,
Kodi
alam M, Laks
h
man T
V
.
Joint sched
uli
n
g
of processi
n
g
and S
huffle
phases
in
MapR
educ
e systems
. Proce
e
d
in
gs IEEE NFOCOM. 2012; 131(
5): 114
3-1
151.
[10]
T
an J, Meng
S, Meng X, et a
l
.
Improvi
ng R
e
duceT
ask d
a
ta
local
i
ty for seque
ntial M
apR
educ
e jo
bs
.
Proceedings IEEE NFOC
OM. 2012:
1627-1635.
[11]
T
an J, Meng
X, Z
h
a
ng
L.
C
oup
lin
g T
a
sk
Progress for
MapR
educ
e R
e
sourc
e
-Aw
a
re
Sched
ul
ing
.
Procee
din
g
s IEEE INFOCOM. 2013; 1
2
(11):
161
8-16
26.
[12]
Yoo R, Rom
a
n
o
A, Koz
y
r
a
kis
C.
Phoen
ix R
ebirth: Sca
l
ab
l
e
MapR
ed
uce
on a L
a
rg
e-Sc
ale Sh
are
d
-
Memory System
.
IEEE Internation
a
l S
y
mpo
s
ium on Workl
oad C
haract
e
ri
zation. 20
09: 1
98:20
7.
[13]
F
ang W
e
i-
bi
n, He B
i
ng-s
h
e
ng, L
uo Qi
on
g, et al. M
a
rs
: Accelerati
ng
MapR
ed
uce
w
i
t
h
Grap
hic
s
Processors.
IEEE Transactio
n
s on Para
ll
el and D
i
stribute
d
Systems
. 201
1; 22(4): 60
8-6
20.
[14]
Z
hai Ya
n-lo
ng,
Luo Z
h
ua
ng,
Yang K
a
i, et a
l
. High P
e
rform
ance
M
a
ssive
Data C
o
mputi
n
g F
r
ame
w
o
r
k
Based o
n
Ha
d
oop C
l
uster.
C
o
mputer sci
enc
e
. 2013; 4
0
(3): 100-
103.
[15]
Lusk E
L
, C
h
an A.
E
a
rly
Experi
m
ents w
i
th
the
Op
e
n
MP/M
PI Hyb
r
id Pr
ogra
mming
Mod
e
l.
Procee
din
g
s of
the Internatio
n
a
l
W
o
rksho
p
o
n
OpenMP. 20
08: 36-4
7
.
[16]
Epstein J, Bla
ck AP, Jones
SLP.
Tow
a
rds Haskel
l
in t
he Cl
ou
d
. Proceed
ings
of the Hask
ell
S
y
mp
osi
u
m. 2011: 11
8-1
29.
[17]
Lea D.
A Java
fork/join fram
ework
. Acm Conference o
n
Jav
a
Grande. 2
0
0
0
: 36-43.
[18]
Agra
w
a
l K,
Leiserson
CE, Sukha J. He
lper locks
for
fork-
j
oin par
allel pr
ogramming.
AC
M Si
gp
l
a
n
Notices
. 20
11;
45(5): 24
5-2
5
6
.
[19]
Cao Ya
ng-j
i
e,
Qian De-
pei,
W
u
W
e
i-guo,
et
al. Adaptiv
e
schedu
lin
g al
gorithm b
a
se
d
on d
y
n
a
mic
core-res
ource
partitio
n
s for
man
y
-cor
e pro
c
essor s
y
stem
s
. Journa
l of
Softw
are
. 201
2; 23(2):
240-
252.
[20]
Ste
w
a
r
t R, Singer J. Co
mp
arin
g F
o
rk/Joi
n and M
apR
e
duce.
De
part
m
e
n
t of Co
mputer Scie
nce,
Heriot-Watt Un
iversity
. 201
2.
[21]
Sand
holm T
,
Lai K. MapR
ed
u
c
e optimiz
at
i
on u
s
i
n
g
re
gu
l
a
ted
dy
na
mi
c p
r
iori
ti
za
ti
on
.
ACM Sigm
etrics
Performanc
e Evalu
a
tion R
e
vi
ew
. 2015; 37(1
)
: 299-31
0.
[22]
Giacama
n
N,
Sinn
en O. Obj
e
ct-orie
n
t
ed
p
a
rall
el
izatio
n o
f
Java deskto
p
progr
ams
. IEEE Software,
Softw
are for th
e Multipr
o
cess
or Desktop:
Ap
plicati
ons, Envi
ro
nm
ents, Platform
s
. 20
11; 2
8
(
1): 32-38.
[23]
Seng
hor
A, K
onate
K.
A J
a
va
hybri
d
c
o
mp
iler
for s
h
a
r
ed
me
mory
p
a
rall
el
Progr
a
m
mi
ng
. 13th
Internatio
na
l
C
onfere
n
ce on Parall
el an
d
D
i
stribut
e
d
C
o
mputin
g, App
lic
ations
an
d T
e
chno
log
i
es
,
201
2: 131-
136.
[24]
Z
hang H
o
n
g
, W
ang
Xia
o
-mi
ng, Cao J
i
e, et al.
Paral
l
el
Co
mp
utin
g of Sha
r
ed Me
mory M
u
ltipr
o
cessor
s
Based on
JO
MP
. Internatio
nal C
onfer
enc
e on M
e
chatr
oni
cs, El
ectro
n
ic, Industri
a
l
and
Contro
l
Engi
neer
in
g. 2015; 8: 16
28-1
632.
[25]
Z
hang
Do
ng-
w
e
n, L
i
u
Ch
e
n
-gu
ang, Z
h
a
ng Ya
ng.
F
o
r
k
/Join-ori
ente
d
soft
w
a
re r
e
factorin
g a
n
d
performa
nce a
nal
ysis
. Jo
urna
l of Computer
Appl
icatio
ns
. 2
015; 35
9(1
1
): 3172-
317
7.
[26]
W
ang Le
i, Cui
Hui-min, Ch
e
n
Li, et
al. Research o
n
task parall
e
l pr
og
ramming mo
d
e
l
. Journ
a
l of
Software
. 2013
; 24(1): 77-90.
[27]
Z
hang
Ya
ng,
Ji W
e
i-
xi
ng. Sc
ala
b
le
meth
o
d
-
l
evel
p
a
ral
l
el
li
brar
y a
nd
its i
m
provem
ent
. T
he Jo
urn
a
l
of
Superc
o
mputi
n
g
. 2012; 6
1
(3): 115
4-11
67.
[28]
Purbas
ari
A,
Suw
a
rdi IS, Santoso OS, A
nda
la R.
D
a
ta
Partition a
n
d
Commun
i
cati
on o
n
Para
lle
l
Heuristik Mo
de
l Based o
n
Cl
o
nal Se
lectio
n A
l
gorit
hm.
TEL
K
OMNIKA
. 2015
; 13(1): 193-2
0
1.
[29]
Amin MS, Re
az MBI, Nasir
SS. Integrat
ed
Ve
hicl
e A
ccide
nt Detect
ion
and
Loc
ation S
y
stem.
T
E
LKOMNIKA T
e
leco
mmunic
a
tion C
o
mputi
n
g Electron
ics a
nd Co
ntrol
. 20
14; 12(1): 7
3
-7
8.
Evaluation Warning : The document was created with Spire.PDF for Python.