TELKOM
NIKA Indonesia
n
Journal of
Electrical En
gineering
Vol. 12, No. 11, Novembe
r
2014, pp. 79
2
7
~ 793
4
DOI: 10.115
9
1
/telkomni
ka.
v
12i11.63
57
7927
Re
cei
v
ed
Jul
y
7, 2014; Re
vised Septem
ber
19, 20
14;
Accept
ed O
c
tober 6, 201
4
Design and Analysis o
f
Parallel MapReduce b
ased
KNN-join Algorithm for Big Data Classification
Xuesong Ya
n*, Zhe Wan
g
, Dezh
e Ze
ng, Chengy
u
Hu, Hao Ya
o
Schoo
l of Com
puter Scie
nce,
Chin
a Un
iversit
y
of Geosci
enc
es, W
uhan, Ch
ina
*Corres
p
o
ndi
n
g
author, e-ma
i
l
:
y
a
n
x
s19
99@
126.com
A
b
st
r
a
ct
In data
mini
n
g
ap
pl
icatio
ns
, mu
lti-la
bel
cla
ssificati
on i
s
hig
h
ly
r
equ
ired in many
moder
n
app
licati
ons. M
eanw
hi
le, a
us
eful
data
mini
n
g
a
ppro
a
ch
is
the k-n
earest
nei
ghb
our
jo
in,
w
h
ich
has
hig
h
accuracy b
u
t time-c
ons
u
m
in
g
process. W
i
th recent ex
pl
osi
on of big
data
,
conventio
na
l serial KN
N jo
i
n
base
d
mu
lti-la
bel
class
i
ficatio
n
a
l
gor
ith
m
ne
eds to
sp
end
a l
o
t of ti
me
to
ha
ndl
e
hig
h
v
o
lu
mn of
data.
T
o
addr
ess this p
r
obl
em, w
e
fir
s
t desig
n a p
a
rall
el Ma
pR
e
duce
base
d
K
NN jo
in a
l
gor
i
t
hm for b
i
g d
a
ta
classificati
on.
W
e
further i
m
ple
m
ent the
al
gorith
m
us
i
ng
Had
oop
in
a
cluster w
i
th 9
vitual
mach
in
e
s
.
Experi
m
ent res
u
lts show
that our Ma
p
R
ed
uc
e base
d
KNN j
o
in ex
hib
i
ts mu
ch hig
her p
e
rfo
r
ma
nce tha
n
th
e
serial
one. Sev
e
ral i
n
terestin
g
phen
o
m
e
non
ar
e obs
erve
d from th
e exper
i
m
e
n
t results.
Ke
y
w
ords
: mu
lti-lab
e
l cl
assifi
cation, KNN j
o
i
n
, MapRe
duc
e
Copy
right
©
2014 In
stitu
t
e o
f
Ad
van
ced
En
g
i
n
eerin
g and
Scien
ce. All
rig
h
t
s reser
ve
d
.
1. Introduc
tion
Multi-lab
e
l cl
assificatio
n
, the exce
ption
of si
ngle
-
la
be
l classificatio
n
, is highly re
quire
d in
many mode
rn application
s
, su
ch as p
r
otein fu
n
c
tio
n
classification, musi
c ca
tegori
z
ation
and
sema
ntic sce
ne
cla
ssifi
cat
i
on. In p
a
st
decade
s, m
u
lti-label
cla
ssification
have
bee
n m
ade
a
signifi
cant co
ntribution
to bioinformatics,
e
s
pe
cially
to p
r
otein
s
ubcellular localization
Erro
r!
Refere
nce s
o
urce not found.
]. To put it simply, multi-label cla
ssifi
cat
i
on
Error! Re
ference
s
o
urc
e
not foun
d.
] is a mining met
hod that assi
gns a set of
label
s to an unse
en insta
n
c
e. Each in
stance
in multi-lab
e
l cla
ssifi
cation
can b
e
identif
ied by a set o
f
labels
YL
,
2
L
. For exampl
e, the
famous
son
g
nam
ed S
c
o
r
pion
s
can
be
cla
s
sified i
n
to both
‘rock’
and
‘ball
ad’.
For
sema
ntic
scene
cla
ssifi
cation, a ph
otogra
ph can b
e
labele
d
with
more than o
ne gen
re
such as mo
untai
ns,
lake
s an
d f
o
r
e
st
s in a
simil
a
r way
.
In essen
c
e,
there are
two metho
d
s for multi
-
label
cla
ssif
i
cation: (1)
Problem
transfo
rmatio
n and
(2
)
algorith
m
ad
aptation
Err
o
r!
Refere
nce
source not found.
].
And one
freque
ntly-u
sed meth
od
of algo
rithm
ada
pt
ation method
s
i
s
k
n
earest n
e
igho
urs
Erro
r!
Refere
nce s
o
urce not found.
-
Err
o
r
!
Reference s
o
ur
ce not
found.
]
which d
epe
nds o
n
simil
a
rity
searches. The similiarity join
has become an im
portant databa
se primitive for supporting
simila
rity sea
r
ch
es
and d
a
t
a mining
Error! Re
ferenc
e
source not
found.
]
.
As o
ne ope
ration
of
three well-known si
miliary
join, the
k-n
eare
s
t
neigh
b
our join
(K
NN
joi
n
) ret
r
iev
e
s
k most
si
milar
pairs a
nd i
s
f
r
equ
ently ap
plied to
mum
e
rou
s
appli
c
a
t
ions i
n
cl
udin
g
kno
w
ledg
e
discovery, d
a
t
a
mining,
an
d spatial datab
ase
s
Err
o
r!
Refere
nce source not found.
,
Error! Refe
rence source
not
fou
nd.
].
Since both
the j
o
i
n
an
d K
N
N
search
are exp
ensive,
espe
cially
on
l
a
rg
e data sets
an
d
/
or
in multi-dime
nsio
ns, KNN join is a co
stly
operation
.
Lots of re
sear
ch, in the
literature
Er
ror!
Refere
nce source not found.
],[
Error!
Refere
nce source not found.
-
Err
o
r
!
Reference s
o
ur
ce not
fou
nd.
] b
a
ve
been d
e
voted to impov
e the pe
rformc
e of KNN join by pro
posi
ng effici
ent
algorith
m
s,
many of whi
c
h have
be
en focuse
d
o
n
improvin
g
algorith
m
an
d the ce
ntrali
zed,
singl
e-th
read
setting th
at is impo
ssi
ble
u
s
ed
in
a di
stri
buted
syste
m
. With the
rap
i
d explo
s
io
n i
n
the volume
o
f
data in
big
data e
r
a, the
multi-lab
e
l
cl
assificatio
n
u
s
ing
se
rial K
NN j
o
in
ca
nn
ot
satisfy ou
r ne
eds al
rea
d
y.
P.Malarvizhi
et al.
Error! Refere
nce
source not found.
]
pro
pose an al
g
o
rithm of
cla
ssifying la
bels to the d
o
cum
ent
s
of the we
b. In their app
ro
ach,
they use bin
a
ry cla
s
sificat
i
on
of bina
ry
cla
ssifie
r
ba
sed
on
Map
R
e
d
c
ue
fram
ewo
r
k to a
s
sign
the
set of
po
sitive label to
the
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 23
02-4
046
TELKOM
NI
KA
Vol. 12, No. 11, Novem
ber 20
14: 79
27 – 793
4
7928
document
s of the web. Th
eir metho
d
n
eed
s nume
r
o
u
s redu
ce fu
nction
s when
each in
stan
ce in
datasets hav
e masse
s
of label
s.
In
r
e
c
e
n
t
ye
ars
,
Ma
pR
ed
uce
Error! Refer
e
nce
source not
found.
]
ha
s been
wi
dely
use
d
in
indu
stry and
aca
demia.
It is reg
a
rd
e
d
as
a
sim
p
le yet po
werful p
a
rall
el
and di
stri
b
u
ted
comp
uting p
a
radi
gm to explore the
clou
d co
m
p
u
t
ing re
sou
r
ces. Mea
n
whi
l
e, MapRe
d
u
c
e
architectu
re
h
a
s g
ood
scal
ability and fa
ult tolera
nce
mech
ani
sm
s
so that it al
re
ady be
come
s the
one
of the
mostly u
s
ed
parallel
and
distri
buted
system
s.
Th
erefo
r
e, we are motivate
d
to
inco
rpo
r
ate
MapRedu
ce
architectu
re
i
n
to the
de
sig
n
of
distri
but
ed a
nd
pa
rall
el KNN
algo
ri
th
m
for big data m
u
lti-label
cla
s
sificatio
n
.
The main
con
t
ribution
s
of this pa
per a
r
e
as follo
ws:
a)
We novelly a
pply the Map
R
ed
uce con
c
ept
in the desig
n of distributed and p
a
rallel KDD
algorith
m
for big data cl
assification.
b)
We
actu
ally impleme
n
t ou
r alg
o
rithm
i
n
a
clu
s
ter
with X
serve
r
s. Exten
s
ive
perfo
rma
n
ce
evaluation
s
are cond
ucte
d. Specially, we also
a
n
a
lyze the influen
ce of cl
uster
size of
MapRedu
ce
on the cate
g
o
rization (performan
ce
/a
ccura
cy)?Pe
rformance evalu
a
tion re
sults
also valid
ate the high p
e
rfo
r
man
c
e of ou
r
algo
rithm over co
nvention
a
l seri
al one.
2. KNN join
KNN j
o
in, p
r
o
posed
by [
Error!
Refere
nc
e source
not found.
] i
s
an i
m
porta
nt simi
lary joi
n
operation an
d it combine
s
each poi
nt of one point
set with its k neare
s
t neig
h
b
ours in the other
set. Fo
r exa
m
ple, it is the j
o
in of th
e
k n
eare
s
t
n
e
igh
b
o
rs(NN)
of e
v
ery point i
n
a data
s
et
R from
a dataset S
Error!
Re
fere
nce source not found.
]. Each record in R(or S) i
s
rep
r
ese
n
ted a
s
a
d-
dimen
s
ion
po
int. For one
p
o
int in data
s
e
t
R su
ch a
s
r point, we g
e
t knn
(r, S) by
cal
c
ulatin
g the
simila
rity distances
whi
c
h i
s
Eucli
dean
d
i
stan
ce
s d(
r, s), bet
wee
n
r
in R and
every reco
rd in S
in
this pap
er.
KNN join al
gorithm is as follows:
(,
)
,
,
k
nnJ
R
S
r
k
nn
r
S
for
a
l
l
r
R
(
1)
From th
e a
b
o
ve, we
ca
n
depi
cte the K
N
N simil
a
rity
join in fig
u
re
1. Wh
en 3
p
o
ints in
dataset R wa
nt to find two neigh
bou
rs in
dataset
S, K
NN join algorithm returns
6 res
u
lt
s
.
Figure 1. KNN Joi
n
Ope
r
a
t
ions
3. MapRe
d
u
ce Clus
ter
Implemente
d
by
Had
o
op, G
oogl
e’s M
a
p
R
ed
uce
Error
!
Refere
nce source not
found.Err
o
r
!
Refere
nce source not found.
], is
a pr
o
g
ra
mmin
g
mo
de
l th
a
t
c
a
n be
cr
e
a
t
ed
o
n
a
humble
ha
rd
ware
con
d
itio
n an
d
se
rves for
pro
c
e
s
si
ng la
rge
d
a
ta
set
s
i
n
a
m
a
ssively pa
ra
llel
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
2302-4
046
De
sign a
nd Analysi
s
of Parallel Map
R
ed
uce b
a
sed KNN-join Algo
rithm
for Big… (Xuesong Ya
n)
7929
manne
r. It can divides a ta
sk into
some j
obs a
nd
auto
m
atically pa
rallelize a
nd sche
dule job
s
i
n
a
distrib
u
ted sy
stem.
Figure 2
sh
ows the
dat
a flow
diag
ram of
Ma
pRedu
ce. Firstl
y, MapRe
d
u
c
e
splits
datasets i
n
to
hund
red
s
of t
hou
san
d
s of
small
data
s
et
s.
Se
con
d
ly,
one nod
e whi
c
h
i
s
a comm
on
comp
uter g
e
nerally proce
s
ses o
ne sm
all dataset
a
nd pro
d
u
c
ts
interme
d
iate data. Finally, a
large n
u
mb
er of nodes me
rges the inte
rmediat
e data
and then p
r
o
duct
s
the fina
l output data.
In the p
r
o
c
e
s
s of
com
putin
g, the
comp
u
t
ati
on inp
u
ts
a set of in
put
key/value
pa
irs
and
prod
uces a
set of output
key/value pairs. The
n
the
map fun
c
tion
pro
c
e
s
se
s th
e input d
a
ta
and
gene
rate
s a seri
es of inte
rmedi
ate data. The r
edu
ce function re
gard
s
the ab
ove intermed
iate
data as in
put data and p
r
o
duces the fin
a
l output data
.
Figure 2. Data Flow Di
ag
ram of MapRe
duce
Duri
ng the
ab
ove pro
c
e
s
s, it can be
se
e
n
that
two e
s
sential fun
c
ti
ons, i.e. map
function
and
red
u
ce functio
n
, are i
n
volved. Un
d
e
r
su
ch
fram
ewo
r
k,
devel
opers
sh
all d
e
sig
n
thei
r o
w
n
map functio
n
and re
du
ce fu
nction b
a
sed
on acco
rdin
g to the task re
quire
ment.
A Map
R
ed
uce cl
uste
r
co
n
s
ist
s
of
a m
a
ster ma
chin
e
call
ed
ma
ster
node
a
nd
several
slave ma
chin
es called d
a
ta node
s. The
master
nod
e
allocate
s ma
p tasks a
nd redu
ce tasks to
data nod
es a
nd monito
rs t
heir op
eratio
ns. A file in a MapRe
d
u
c
e
clu
s
ter is u
s
u
a
lly stored in
a
distri
-bute
d
file syste
m
(
DFS) which sp
lits a f
ile
into
equal si
zed chu
n
ks.
Th
e splits are
the
n
distrib
u
ted a
n
d
repli
c
ate
d
to all machin
es in
the
clu
s
ter. To exe
c
ute a Map
R
e
duce job, u
s
ers
can d
e
ci
de th
e numbe
r of map tasks
m
and re
du
ce tasks
r.
I
n
most
inst
an
ce
s,
m
is the same a
s
the nu-mbe
r
of splits fo
r the given in
p
u
t file
(s). After ma
ster
no
de a
ssi
gned
map an
d re
d
u
ce
tasks to data
node
s, the in
put and outp
u
t
of map and redu
ce fun
c
ti
ons a
r
e a
s
follows:
map
<
k1,v
alue1
>
->
<
k2,value
2
>
redu
ce
<
k2,list
(
v2
)>
->
list
(
v2
)
A com
b
ine f
u
nction
can b
e
invoked b
e
twee
n ma
p fu
nction
an
d re
duce fun
c
tion
to ea
se
netwo
rk
con
g
e
stion
cau
s
e
d
by th
e
rep
e
titions
of th
e inte
rmedi
ate keys
k2
p
r
odu
ced
by m
ap
function
s
Err
o
r! Re
ference
source not found.
]. The
co
mbine fun
c
tio
n
plays a
sim
ilar but optio
n
a
l
role with
reduc
e func
tion. It lik
es
:
c
o
mbine <
k2, list
(
v2
)>
->
list
<
k2, v
2
>
3. KNN Join Bas
e
d on M
a
pRe
duce
3.1. Data
Nor
m
alization
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 23
02-4
046
TELKOM
NI
KA
Vol. 12, No. 11, Novem
ber 20
14: 79
27 – 793
4
7930
We
cha
nge
the form
at of i
nput data
s
et
s in or
der to g
e
t desi
r
e
d
tra
i
ning file
s an
d testing
files. In each
dataset, we
alter ea
ch ol
d record
exp
r
essed a
s
<at
t
1,att2,att3…,label1,lab
e
l2
…>
to new re
co
rd
describ
ed a
s
<id, att1, att2
, att3,… label1, label2,…
>
.
3.2. Design
of Parallel KNN Join
In KNN join,
the mo
st ti
me-con
sumi
n
g
ste
p
is th
e calcul
ation
of dista
n
ce
between
instan
ce
s i
n
dataset R an
d in
stan
ce
s i
n
dat
a
s
et
S
.
E
a
ch i
n
st
a
n
c
e
in t
w
o f
ile
s
incl
ude
s o
n
e
id.
Files exist i
n
HDFS
an
d
are p
r
o
c
e
s
sed
as
<
ke
y
,
valu
e
>
pai
rs which represent
e
a
ch
record
in
the
files
Error! Reference s
o
urc
e
not
found.
]. Our pa
rall
el a
l
gorithm a
r
e d
i
vided into two pha
se
s.
Phase
-
I: Phase
-
I stores R*S insta
n
ces an
d com
p
letes total
distan
ce
cal
c
ulation
s
.
Figure 3 sho
w
s flo
w
diag
ram of Phase
-
I. First of all,
we split all files an
d sp
rea
d
them across all
mappe
rs. Each split is se
nt to a Mapper in the form of <
ke
y,
value
> whe
r
e
ke
y
is the offset in
bytes of t
h
is reco
rd
to the
start point of the
data
file and
val
u
e
i
s
t
he
conte
n
t of
this
re
co
rd.
We
mark the test
ing file(the training file) in
file id
=0 (=1) and p
a
rtition the input files into multi
p
le
grou
ps. An in
put re
cord ra
ndomly gen
erates a pa
rtition id named g
r
oup id a
s
th
e output key
of
map functio
n
and the outp
u
t
value is a string with t
he
conte
n
t of each re
co
rd a
n
d
relevant file id.
A list of intermediate
<
key1
,
value
1
>
wi
th the sam
e
key are sent to the sam
e
Red
u
cer.
Key
1
is u
n
iqu
e
grou
p Id an
d
valu
e1
is value list
s
obta
i
ned from ma
p function. Fo
r value list wit
h
same
Key
1
,
values with fil
e
id
= 0
(id
=
1) a
r
e
put int
o
bu
cket R (b
ucket S),
accordin
g to file
i
d
.
The di
stan
ce
betwe
en
R a
nd S i
s
the
n
cal
c
ulate
d
a
s
Error!
Refere
nce sour
ce not found.
] . So
for
the output <
key
2
,
value2
> pairs,
ke
y2
is null and
val
u
e2
is a text containin
g
a reco
rd’
s
id from
the testing file, i.e. id1 and id2, and the
distan
ce b
e
tween the
s
e two records.
Figure 3. Flow Dia
g
ra
m of Phase
-
I
The pseud
ocode of Phase
-
I is summ
ari
z
ed in Algo
rit
h
m1.
Algorithm 1.
Map input:<
offset, original
reco
rd
>
Map output:<
ke
y1,
val
ue1
>,
whe
r
e
key
’
is grou
p Id and
val
u
e’
is a
combi
nation
with origi
a
l
record and fil
e
Id
1.
Assgi
n
ea
ch i
nput re
co
rd a
group Id an
d
a file Id
2.
Take g
r
o
up Id as
key1
3.
Take th
e com
b
ination of ori
g
inal re
co
rd a
nd fileId as
value1
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
2302-4
046
De
sign a
nd Analysi
s
of Parallel Map
R
ed
uce b
a
sed KNN-join Algo
rithm
for Big… (Xuesong Ya
n)
7931
4. End
Red
u
ce input
:<
ke
y1,
valu
e1
>
R
e
du
ce
o
u
t
pu
t:<
ke
y2,
v
a
l
ue2
>,
whe
r
e
ke
y2
is null and
val
ue2
i
s
a combi
natio
n with id1, id2
and di
stan
ce.
1.
Dis
= Com
put
Dist(t
raini
ng record, testin
g record)
2.
Take n
u
ll as
ke
y
’
3.
Take th
e com
b
ination of id
1,id2 and di
st
ance as
valu
e’
4. End
Phas
e-II: Afte
r Phase- I, we get an intermediate file. For the input <
key
,
value
> pairs
of
map fun
c
tion,
ke
y
is the of
fset and
valu
e
is the
cont
ent of the re
cord in the in
termedi
ate file.
Then we set i
d1 as the o
u
tput’s
key
, an
d
valu
e
is the
combin
ation
of id2 and di
stance.
In redu
ce fun
c
tion, we get
k nea
re
st neighbo
rs of
ea
ch testing re
co
rd, the file id
of which
is 0, based o
n
distan
ce
s.
The pseud
ocode of Phase
-
II is summ
ari
z
ed in Algo
rit
h
m2.
Algorithm 2.
Map input:<
offs
et, record
>
Map output:<
ke
y1,
val
ue1
>
p
a
i
r
s
, wh
er
e
key
1
is id1 a
nd
val
ue1
i
s
a combi
natio
n with id2 an
d
distan
ce
1.
Take id
1 as
key
1
2.
Take th
e com
b
ination
with id2 and di
stan
ce a
s
val
ue1
3. End
Red
u
ce input
:<
ke
y1,
val
u
e1
>
R
e
du
ce
o
u
t
pu
t:<
ke
y2,
v
a
l
ue2
>pai
rs, where
ke
y2
is i
d1 and
v
a
lue2
is the label
set
1. For
ea
ch
ke
y
1
, find its k neare
s
t neig
h
o
u
rs
2.
Determine th
e label set according to voting mech
ani
sm
3.
Take id
1 as
key
2
4.
Take th
e labe
l set as
value2
5. End
4. Results a
nd Analy
s
is
4.1. Cloud Env
i
ronment and Da
tas
e
ts
We a
c
tually i
m
pleme
n
t ou
r alg
o
rithm i
n
Cl
u
s
terte
c
h
Clou
d Bu
sine
ss Platform
(CCBP)
with 9
vitual
machi
n
e
s
(VM
)
, ea
ch
of whi
c
h
co
ntain
s
4
*
2.00G
Hz Du
al-Core a
nd 4
G
B RAM. Ea
ch
VM runs Ub
u
n
tu 12.10 (64
-
bit) with ha
d
oop-0.
21.0. O
ne VM serve
s
as the maste
r
node an
d the
other VM
s a
c
t as sl
ave no
des.
We p
r
ov
ide ea
ch
slav
e with 1
00GB
hard
drive
space an
d 20
GB
root sp
ace an
d allocates 5.
3GB to DFS.
We utili
ze 4
comm
on m
u
lti-label
dat
ase
s
t in o
u
r experim
ents, includi
ng
CAL50
0,
emotion
s
, yeast, scene, a
s
sh
own in Table 1.
Table 1. 4 Mu
lti-label Datasets
dataset att_number
label_number
CAL500
68
174
emotions 72
6
yeast
103
14
scene 294
6
Table 2. Average Accu
ra
cy Rate of Paral
l
el
KNN Join
and Seri
al KNN Sep
a
ratel
y
for Four
Datas
e
ts
Average accuracy rate
Parallel KNN join
Serial KNN join
emotions 72.66%
72.68%
CAL500
80.33%
80.
28%
scene 88.23%
87.46%
yeast
75.54%
75.53%
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 23
02-4
046
TELKOM
NI
KA
Vol. 12, No. 11, Novem
ber 20
14: 79
27 – 793
4
7932
4.2. Perform
a
nce Ev
aluation
We cond
uct
experim
ents t
o
evaluate p
a
rallel KN
N in c
l
us
ter of different s
i
ze. For eac
h
dataset, 10 experime
n
ts are perfo
rman
ced for ea
ch
cluster
size an
d the averag
e accuracy a
n
d
time spent a
r
e cal
c
ulate
d
, as sho
w
n in
Table 2 an
d Tabel 3 resp
ectively. Fro
m
Table 2, we can
clea
rly se
e th
at the avera
g
e
accu
ra
cy rate of our
pa
rallel al
gorith
m
is alm
o
st
as hi
gh a
s
serial
KNN join, whi
c
h mea
n
s tha
t
our algo
rith
m is feasibl
e
and efficie
n
t.
From Ta
ble 3
,
we can se
e that the average
time spe
n
t shows a
s
a decrea
s
in
g function
of the cluter size. Fo
r exam
ple, for e
m
otions, it take
s 58.89
1s,
55.021
s an
d 48.135
s for 2
slaves, 4
slav
es
and 6
slav
es, respectively. In or
der to facilitate analysis
,
we represented results
in Figure 4.
Figure 4 plots avera
ge ru
nning time
s of 4 mult
i-label
dataset
s versu
s
different
clu
s
ters
of 2 slave
s
, 4 slave
s
, 6 slaves, 8 sl
aves respe
c
tively. Unit of measure i
s
se
con
d
s.
With the
gro
w
th of
sla
v
es, the ave
r
age
run
n
ing ti
mes
of
yea
s
t
and e
m
otion
s
linea
rly de
crease. Howev
e
r,
for CAL-5
00 and scen
e,
Curve
h
a
s
the risin
g
trend
slightly som
e
times, th
e rea
s
on
of
whi
c
h
i
s
that the time
save
d
by co
mputing
is le
ss than
the
time in
crea
se
d by
data
co
mmuni
cation
of
clu
s
ter. So the clu
s
ter of a
ppro
p
ri
ate si
ze make
s
sce
ne.
Table 3. Average Runnin
g
Times that Fo
ur Data
set
s
Spent in 4 Clu
s
ters of 2 Slaves, 4 Slaves,
6 Slaves, 8 Slaves Respe
c
tively
2 slaves
4 slaves
6 slaves
8 slaves
emotions 58.891s
55.021s
48.135s
46.346s
CAL500
56.318s
48.809s
49.616s
52.242s
scene 133.977s
122.001s
183.749s
123.17s
yeast
123.687s
118.746s
114.17s
108.115s
Figure 4. Average
Run
n
ing
Times of 4 M
u
lti-label
Da
ta
sets ve
rsus
Different Clu
s
ters of 2
slave
s
,
4 slave
s
, 6 sl
aves, 8 slave
s
re
spe
c
tively.
0
50
100
150
200
2 slaves
4 slaves
6 slaves
8 slaves
run time(s)
emotions
CAL500
scene
y
east
0
50
100
150
200
emotions
CAL500
scene
y
east
run time(s)
2 slav
es
phase2
phase1
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
2302-4
046
De
sign a
nd Analysi
s
of Parallel Map
R
ed
uce b
a
sed KNN-join Algo
rithm
for Big… (Xuesong Ya
n)
7933
Figure 5. Run
n
ing Time of
pha
se1 a
nd p
hase2
versu
s
4 dataset
s in
cluste
r of 2 slaves
Figure 6. Run
n
ing Time of
pha
se1 a
nd p
has
e2 versu
s
4datasets in
clu
s
ter of 4 sl
aves
Figure 7. Run
n
ing Time of
pha
se1 a
nd p
has
e2 versu
s
4datasets in
clu
s
ter of 2 sl
aves
Figure 8. Run
n
ing Time of
pha
se1 a
nd p
has
e2 versu
s
4datasets in
clu
s
ter of 2 sl
aves
To clea
rly sh
ow the time spe
n
t in the two e
s
sential
pha
se
s in ou
r Map
R
ed
uce
base
d
KNN
algo
rith
m, we fu
rthe
r inve
stigate
the time
spe
n
t in Pha
s
e-I
and Ph
ase-I
I, resp
ectivel
y
.
Figure 5,6,7,
8 sh
ow ru
nni
ng time of P
hase-I a
nd P
hase-II versu
s
4
data
s
ets in cl
uste
r of
2
slave
s
, 4 sl
aves, 6
slave
s
and 8
slave
s
,
re
spe
c
tive
ly. We
can
se
e
that sho
w
tha
t
Phase
-
I is t
h
e
most time co
nsumi
ng ph
a
s
e an
d it decreases
with
the increa
se of clu
s
ter si
ze.
For exampl
e, for
0
50
100
150
emotions
CAL500
scene
y
east
run time(s)
4 slav
es
phase2
phase1
0
50
100
150
emotions
CAL500
scene
y
east
run time(s)
6 slav
es
phase2
phase1
0
20
40
60
80
100
120
140
160
emotions
CAL500
scene
y
east
run time(s)
8 slav
es
phase2
phase1
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 23
02-4
046
TELKOM
NI
KA
Vol. 12, No. 11, Novem
ber 20
14: 79
27 – 793
4
7934
dataset scen
e, whe
n
the
r
e are 2
slave
s
, 174.0
4
5
s
(78.4% of the
total time) i
s
spe
n
t on Ph
a
s
e-I
and it decrea
s
e
s
to 96.484
s wh
en the cl
uster
si
ze be
come
s 4.
4. Conclusio
n
In KNN join,
the mo
st ti
me-con
sumi
n
g
ste
p
is th
e calcul
ation
of dista
n
ce
between
instan
ce
s.
With the i
n
cre
a
se
of
data,
se
rial
prog
ram
can
not
meet o
u
r re
q
u
irem
ents on
the
timeliness. In this pap
er, we desi
gn an
d
impleme
n
t a parall
e
l KNN join usi
ng Ma
pRe
d
u
c
e for
big
data multi-la
bel cla
s
sificat
i
on. From re
sults a
bove, we can see that t
he avera
ge accu
ra
cy of
our
parallel a
l
gorithm i
s
al
most a
s
high
as
conventi
onal
se
rial K
NN j
o
in
and t
he ave
r
ag
e time
spe
n
t sho
w
s
as
a d
e
cre
-
asing fun
c
tion
o
f
the cl
uter
si
ze. Th
erefore
,
our
metho
d
is p
r
oved
to b
e
an efficient a
nd feasi
b
le solution to mul
t
i-label cl
assif
i
cation d
ealin
g with big dat
a.
Ackn
o
w
l
e
dg
ements
This
pap
er i
s
su
ppo
rted
b
y
Natural Sci
ence Fo
und
a
t
ion of China
.
(No.6
127
24
70 a
nd
No.61
203
307
), the Provincial Natu
ral
Scien
c
e Fou
ndation of Hubei (No. 20
12FFB04
101
) and
the Fund
ame
n
tal Re
se
arch Foun
ds fo
r Nation
al
Uni
v
ersit
y
,
Chin
a Univ
e
r
sit
y
of
Geo
sci
en
c
e
s
(Wuhan
).
Referen
ces
[1]
W
an S, Mak MW
, Kung SY.
Adaptiv
e thresh
oldi
ng for multi
-
lab
e
l SVM cla
ssificatio
n
w
i
th
app
licati
on t
o
protei
n su
bcel
l
u
lar l
o
ca
li
z
a
ti
o
n
pre
d
ictio
n
.
A
c
oustics, Speech and Signa
l
Processing (ICASSP), 2013
IEEE Internatio
nal C
onfere
n
ce
on. IEEE. 2013: 3547-
35
51.
[2]
T
s
oumakas G, Katakis I.
Multi-la
bel classification: An overvie
w
.
Int
e
rnati
ona
l J
o
u
r
nal
of D
a
t
a
W
a
reho
usin
g a
nd Min
i
ng (IJD
W
M
)
. 2007; 3(3): 1-13.
[3]
Z
hang M
L
, Z
h
ou Z
H
. ML-KN
N
: A laz
y
l
ear
n
i
ng
ap
proac
h t
o
multi-
lab
e
l
le
arni
ng.
Pattern recognition
.
200
7; 40(7): 20
38-2
048.
[4]
Z
hang M
L
, Z
h
ou Z
H
.
A k-nearest neighbor
based algor
ithm
for
multi-label classifi
cation [C] //Granular
Co
mp
uting.
IEEE Internatio
n
a
l Co
nferenc
e
on. IEEE, 2005
; 2: 718-72
1.
[5]
Böhm C, Kr
eb
s F
.
T
he k-nea
rest nei
gh
bour
joi
n
: T
u
rbo ch
argi
ng th
e KD
D proc
ess.
Kn
ow
ledg
e a
n
d
Information System
s
. 20
04; 6
(
6): 728-7
49.
[6]
Kavraki LE, Pl
aku E. Distrib
u
t
ed Comp
utati
on
of the kn
n Graph for Lar
g
e
Hig
h-Dim
ens
io- na
l Poi
n
t
Sets.
[7]
Xi
a C, Lu H, O
o
i B C, et al.
Gorder: an effi
cient
meth
od for KNN jo
in pr
ocessi
ng.
Proc
eed
ings
of th
e
T
h
irtieth intern
ation
a
l co
nfere
n
ce o
n
Ver
y
lar
ge d
a
ta b
a
ses-
Volum
e
30. V
L
DB End
o
w
m
e
n
t, 2004: 7
56-
767.
[8]
Yao B,
Li F
,
K
u
mar P.
K
ne
ar
est ne
igh
bor
q
ueri
e
s a
nd k
n
n
-
joins
i
n
l
a
rge
r
e
lati
ona
l
datab
ases (
a
l
m
ost
)
for free.
Data Engi
neer
in
g (ICDE), 2010 IEE
E
26th In
ternati
ona
l Conf
erenc
e on. IEEE. 2010: 4-15.
[9]
Malarviz
hi P, P
u
jeri
RV.
Multil
abe
l Cl
assificat
i
on of D
o
cu
me
nts w
i
th Mapre
duce
. Intern
ati
ona
l Jour
nal
of Engin
eer
ing
&
T
e
chnol
og
y (
097
5-40
24). 20
13; 5(2).
[10]
Dea
n
J, Ghem
a
w
at S.
Ma
pR
educ
e: simp
lifi
ed
data
proces
sing
on
lar
ge c
l
usters.
Co
mmunic
a
tions
o
f
the ACM
. 200
8
;
51(1): 107-1
1
3
.
[11]
Z
hang C, Li F
,
Jestes J.
Efficient par
all
e
l kN
N joins for lar
g
e data in Ma
p
R
ed
uce.
Proce
edi
ngs of the
15th Intern
atio
nal C
onfere
n
ce
on Exte
ndi
ng
Datab
a
se T
e
chnol
og
y. ACM. 201
2: 38-4
9
.
[12]
Ran
ger C, Ra
ghur
am
an R,
Penmetsa A, Bradski G, Koz
y
rak
i
s C.
Evaluati
ng Map
R
e
duce for Multi
-
core and Multiprocess
or System
s
. Proc. of
13th Int.S
y
m
p
o
- sium o
n
Hi
gh-Perform
anc
e Comp
ute
r
Architecture (H
PCA). Phoen
i
x
, AZ
. 2007.
[13]
Lämmel
R. Go
ogl
e’s M
apR
ed
uce
progr
ammi
ng m
ode
l R
e
vi
sited. Sci
ence
of comp
uter pr
ogrammi
ng
,
200
8; 70(1): 1-
30.
[14]
Z
hao W
,
Ma H
,
He Q. Parall
e
l
k-mea
n
s clust
e
ri
n
g
bas
ed
on
mapre
duce.
C
l
ou
d Com
puti
n
g. Sprin
ger
Berlin H
e
i
del
be
rg. 2009: 6
74-6
79.
[15]
Lia
ng Q, W
a
n
g
Z
,
F
an Y, et al.
Multi-l
a
b
e
l
Classific
a
tio
n
b
a
sed
on P
a
rticl
e
Sw
arm A
l
g
o
rithm
. M
obi
l
e
Ad-hoc a
nd Se
nsor Net
w
o
r
ks (MSN). IEEE N
i
nth
Internati
o
n
a
l Co
nferenc
e
on. IEEE, 2013
: 421-42
4.
[16]
F
a
y
e
d
HA, Ati
y
a AF
.
A Nov
e
l
T
e
mp
late
Re
d
u
ction
Appr
oac
h for th
e-Ne
are
s
t Neig
hb
or Me
thod
. N
eura
l
Net
w
orks, IEEE
T
r
ansactions
on. 200
9; 20(5
)
: 890-89
6.
Evaluation Warning : The document was created with Spire.PDF for Python.