Internati
o
nal
Journal of Ele
c
trical
and Computer
Engineering
(IJE
CE)
V
o
l.
6, N
o
. 5
,
O
c
tob
e
r
201
6, p
p
. 2
454
~246
1
I
S
SN
: 208
8-8
7
0
8
,
D
O
I
:
10.115
91
/ij
ece.v6
i
5.1
085
3
2
454
Jo
urn
a
l
h
o
me
pa
ge
: h
ttp
://iaesjo
u
r
na
l.com/
o
n
lin
e/ind
e
x.ph
p
/
IJECE
An Approach of Semantic
Similarity Measure between
Documents Based on Big Data
Mohammed E
rritali*, Abde
rrahim Beni
-Hssane**, Maroua
ne Birjali**
,
Younes
s
Madani*
*TIAD Labor
ato
r
y
,
Dep
a
rtment o
f
Computer Sciences, Univ
er
sit
y
of Sultan
Moula
y
Slim
ane
,
F
a
cul
t
y
of S
c
i
e
nc
es
an
d
Techno
logies
,
B
é
niM
e
ll
al
, M
o
ro
cco
** LAROSERI
Laborator
y
,
Dep
a
rtment of
Computer Sciences
, University
of
Ch
ouaibDoukkali, Faculty
of
Scien
ces, El
Jadida, Morocco
Article Info
A
B
STRAC
T
Article histo
r
y:
Received Apr 13, 2016
Rev
i
sed
Jun
29,
201
6
Accepted Aug 11, 2016
Semantic ind
e
xing and document simila
rit
y
is an im
portant
i
n
form
ation
retrieval s
y
stem problem in Big
Data w
ith
broad
applications. In
this pap
e
r
,
we investig
ate
MapReduce pro
g
ramming
m
ode
l as
a s
p
ec
ifi
c
fr
am
ework for
managing distributed processing in a
larg
e of
amount documents. Then w
e
s
t
ud
y
the
s
t
at
e o
f
the
art
of d
i
ffe
rent
approaches for
computing th
e
similar
ity
of documents.
Finally
,
w
e
pro
pose our
a
pproa
c
h
of se
ma
nt
ic
si
mi
l
a
rity
m
eas
ures
us
ing W
o
rdNet as
an extern
al network semantic resource. For
evalu
a
tion
,
we compare the
proposed
approach with oth
e
r
approach
es
previously
presented b
y
using our ne
w MapReduce algo
rithm. Experimental
results review that our proposed appro
ach ou
tp
erforms the state of the
art
ones on running time performance
and
incr
eases the measu
r
ement of
se
ma
nt
i
c
si
mi
la
ri
ty
.
Keyword:
Big
Data
Do
cu
m
e
n
t
Si
milarity
H
a
doo
p clu
s
ter
M
a
pR
ed
uc pr
o
g
ram
m
i
ng
m
odel
Sem
a
n
tic Meas
u
r
e
Wo
r
d
Net
Copyright ©
201
6 Institut
e
o
f
Ad
vanced
Engin
eer
ing and S
c
i
e
nce.
All rights re
se
rve
d
.
Co
rresp
ond
i
ng
Autho
r
:
MO
HA
MMED
ERRITA
LI
,
TIA
D
Lab
o
rat
o
ry
,
Fac
u
l
t
y
of
Sci
e
nces
an
d
Tech
nol
ogi
es
,
Depa
rt
m
e
nt
of
C
o
m
put
er
Scien
ces, Un
iversity o
f
Su
ltan Mo
u
l
ay
Sli
m
a
n
e
Bén
i
Mellal, Mo
ro
cco
Em
a
il: m
.
errita
li@u
s
m
s
.
m
a
1.
INTRODUCTION
Si
nce t
h
e
ri
se
of t
h
e c
o
m
put
er sci
e
nce, t
h
e v
o
l
u
m
e
of t
e
xt
ual
i
n
fo
r
m
at
i
on st
ore
d
cont
i
nues
t
o
increase due
to devel
opm
ent of inform
a
tion techno
logies. These ne
w technologie
s
have e
n
able
d an
ex
pon
en
tial increase in th
e vo
lu
m
e
o
f
d
a
ta
b
y
on
lin
e con
t
en
ts lik
e
b
l
og
s, po
sts, social n
e
two
r
k
i
ng
and
site
interactions. E
v
ery day,
2.5 t
r
illion bytes of data are cr
eate
d
base
d
on a
n
estim
a
te and it is very large a
m
ount
so
th
at
9
0
% of d
a
ta in
th
e
world
was created
in
last 2 years [1
]. Th
is
rap
i
d
in
crease in
t
h
e vo
lu
m
e
o
f
t
e
x
t
u
a
l
i
n
f
o
rm
at
i
on ha
s creat
ed
t
h
e
p
r
o
b
l
e
m
of h
o
w
t
o
fi
n
d
t
h
e
rel
e
vant
d
o
cum
e
nt
s t
h
at
i
n
t
e
res
t
s us i
n
t
h
i
s
a
m
ount
mass o
f
tex
t
u
a
l in
form
at
io
n
.
To
ov
erco
m
e
t
h
is p
r
ob
lem
a
d
i
scip
lin
e as a who
l
e is b
o
rn. Th
is d
i
sci
p
lin
e is
called
th
e
Info
rmatio
n
Retrieval (IR) of
do
cumen
t
s in
a Bi
g
Data env
i
ron
m
en
t.
W
i
t
h
t
h
e i
n
ces
sant
i
n
c
r
ease
o
f
t
h
ese
d
o
cum
e
nt
s, i
t
has
bec
o
m
e
di
ffi
cul
t
t
o
m
a
nage a
nd e
x
pl
oi
t
t
h
em
.
Th
is d
i
fficu
lty is clo
s
ely rela
ted
to
th
e seman
tic as
pect
of
t
h
ese doc
um
ent
s
. I
ndee
d
, m
a
nual
o
p
e
r
at
i
o
n of i
s
pos
si
bl
e an
d
g
i
ves g
o
o
d
resu
l
t
s
. Ho
weve
r,
a
m
a
nual
p
r
oc
edu
r
e i
s
n
o
t
p
o
ssi
bl
e
wi
t
h
l
a
rge c
o
r
p
us. T
h
ere ar
e
m
a
ny
appl
i
cat
i
ons
usi
n
g
si
m
i
l
a
ri
ty
det
ect
i
ng t
e
c
h
nol
og
y
,
Suc
h
a
s
si
m
i
l
a
ri
ty
reco
m
m
e
ndat
i
on
[
2
]
,
c
opy
det
ect
i
o
n
[
3
]
,
s
o
ci
al
net
w
o
r
k
m
i
ni
ng
[4]
a
n
d
so
o
n
.
H
o
w t
o
q
u
i
c
kl
y
d
e
t
ect
si
m
i
l
a
r doc
um
ent
s
becom
e
s a basi
c
and i
m
port
a
nt
pr
o
b
l
e
m
as t
i
m
e
s go o
n
. D
o
cum
e
nt
sim
ilari
t
y
co
m
put
at
i
on i
s
an i
m
port
a
nt
resea
r
ch
t
opi
c i
n
in
fo
rm
atio
n
ret
r
iev
a
l and
it is a k
e
y issu
e for au
t
o
m
a
t
i
c docum
e
nt categorization a
n
d clu
s
ter
i
ng
an
alysis. A
t
p
r
esen
t, it ai
m
s
m
a
in
ly to
i
m
p
r
ov
e th
e acc
uracy and the e
f
ficiency with ap
proaches
suc
h
as the m
e
thod bas
e
d
on
vect
or
spa
c
e
m
odel
[5]
,
t
h
e m
e
t
hod
base
d
on
M
a
p
-
R
e
d
u
ce m
odel
[
6
]
.
In
t
h
e c
o
nt
ext
of
o
u
r
w
o
rk
,
w
e
nee
d
Evaluation Warning : The document was created with Spire.PDF for Python.
I
J
ECE
I
S
SN
:
208
8-8
7
0
8
An A
p
pr
oac
h
o
f
Sem
a
nt
i
c
Si
m
i
l
a
ri
t
y
Meas
u
r
e
bet
w
een
D
o
c
u
me
nt
s B
a
se
d
on
Bi
g
D
a
t
a
(
M
o
h
a
mme
d
Er
ri
t
a
l
i
)
2
455
t
o
desi
gn a
ne
w pa
ral
l
e
l
al
gori
t
h
m
based
on M
a
pR
ed
uc
e pr
o
g
ram
m
i
n
g m
odel
t
o
i
m
pr
o
v
e t
h
e
val
u
e of t
h
e
sem
a
n
tic si
mil
a
rity u
s
ing
WordNet an
d th
e
ru
nn
ing
tim
e p
e
rfo
r
m
a
n
ce.
In th
is
p
a
p
e
r,
we are
b
a
sically in
terested in
th
e ph
ase of t
h
e do
cu
m
e
n
t
s ind
e
x
a
tion
,
each
d
o
c
u
m
en
t is
represen
ted
b
y
an
i
n
term
ed
iate repr
esen
tatio
n. Th
is
represen
tatio
n is d
i
rectly o
p
e
rated b
y
th
e In
formatio
n
R
e
t
r
i
e
val
Sy
st
em
(IR
S). It
descri
bes t
h
e cont
e
n
t
s
of t
h
e doc
um
ent
by
descri
pt
o
r
s.
These desc
ri
pt
ors are
si
gni
fi
ca
nt
u
n
i
t
s i
n
t
h
e
doc
u
m
ent
.
In
o
u
r
cont
e
x
t
,
t
o
fi
n
d
t
h
e
rel
e
va
nt
doc
um
ent
s
by
com
p
ari
s
o
n
wi
t
h
a
doc
um
ent query, the ISR com
p
ares the representation of
this que
ry to the represe
n
tation
of each
doc
um
ent.
Thi
s
com
p
ari
s
on i
s
do
ne
by
m
eans of a
f
unct
i
o
n
of
c
o
rresponde
nce (Retrieval
St
atus Value: RSV) and a
score
o
f
rel
e
va
nce i
s
assi
gne
d
t
o
eac
h d
o
c
u
m
e
nt
. In
m
o
st
of t
h
e i
n
de
xi
n
g
pr
ocess a
we
i
ght
assi
gne
d t
o
eac
h
descri
pt
o
r
. T
h
i
s
wei
g
ht
det
e
r
m
i
n
es t
h
e
di
scri
m
i
nat
i
ng
po
wer
o
f
t
h
e
des
c
ri
pt
o
r
i
n
t
h
e
doc
um
ent
whe
r
e i
t
i
s
p
r
esen
t.
Th
e maj
o
rity o
f
th
e ap
pro
ach
es of t
h
e info
rm
a
tio
n
retriev
a
l
ex
ist
in
th
e literature tak
e
s
on
ly a
si
m
p
le
words and/or fragm
e
nts of t
h
e words for t
h
e researc
h
of
do
cu
m
e
n
t
s and
is u
n
a
ware
of th
e essen
tial id
ea th
at
tak
e
s th
e sem
a
n
tic relatio
n
s
o
f
th
e
words.
Th
e id
en
tif
icatio
n
of th
e sim
i
larity b
e
tween d
o
c
u
m
en
ts resu
ltin
g
f
r
o
m
th
e in
d
e
xin
g
and
th
e co
ncep
ts
o
f
t
h
e seman
tic
m
easu
r
e th
at is a
f
unda
m
e
n
t
al p
h
a
se
in
our
w
o
r
k
.
Ou
r co
nt
ri
but
i
on
of t
h
i
s
res
earch
pa
per i
s
t
o
i
nde
x t
h
e
req
u
est
, inde
x each doc
u
m
e
nt (
)
and
com
p
are the perform
a
nce of
the appli
cation to the represe
n
tation
of each
doc
um
ent (RSV). T
h
is is formally
translated as:
I
:
Q
→
E
(1
)
q
→
I
q
I
∶
D
→
E
(2
)
d
→
I
d
RS
V
∶
E
E
→
R
(3
)
I
q
,I
d
→R
S
V
I
q
,I
d
Q
i
s
t
h
e set
of
que
ri
es,
D
i
s
t
h
e set
of
doc
um
ent
s
,
E
is th
e
set of
d
e
scri
p
t
ors.
Th
e
rem
a
in
d
e
r
o
f
th
is
p
a
p
e
r is stru
ct
u
r
ed
as fo
llo
ws:
In
t
h
e
ne
xt
sec
t
i
on,
we
b
r
i
e
fl
y
prese
n
t
t
h
e
d
e
si
gn
an
d
i
m
plem
ent
a
t
i
on
of
M
a
pR
ed
uce
pr
og
ram
m
i
ng m
odel
as
a fram
e
wor
k
f
o
r m
a
nagi
n
g
di
st
ri
but
e
d
p
r
oce
ssi
ng i
n
a l
a
r
g
e
of am
ount
d
o
c
u
m
e
nt
s. W
e
t
h
en st
u
d
y
i
n
sect
i
on 4
the state of the art of the measures
o
f
sim
ilarity
for com
p
aring thes
e
m
easurem
en
ts with our a
p
proac
h
pr
o
pose
d
i
n
se
ct
i
on
5.
We c
o
ncl
u
de t
h
i
s
sec
t
i
on
by
p
r
ese
n
t
i
ng
o
u
r
p
r
o
p
o
s
e
d al
g
o
r
i
t
h
m
based
on
M
a
pR
educe
m
odel
usi
ng t
h
e vect
o
r
p
r
esen
t
a
t
i
on o
f
t
h
e d
o
cum
e
nt
s an
d
one
of t
h
e approache
s
alr
eady
existed in sect
ion
4.
Fi
nal
l
y
, we concl
ude
wi
t
h
t
h
e anal
y
s
i
s
and si
m
u
l
a
t
i
on of o
u
r ap
p
r
o
ach, o
n
Ha
d
o
op f
r
am
ewor
k,
befo
re
p
r
esen
ting
the
co
n
c
l
u
sion
and th
e
p
e
rsp
ectives of th
is wo
rk.
2.
RELATED WORK
Many
studies have
bee
n
pre
s
ented on
detecting
doc
um
ent si
m
ilarit
y
in recent years
for facilitating
th
e search
for in
fo
rm
atio
n
in
co
m
p
lex
in
formatio
n
sy
stem
s. Kum
a
r et al. [7]
an
d Ch
o
w
d
h
u
ry
[
8
]
su
rvey
e
d
d
u
p
licate o
r
n
e
ar du
p
licate d
a
ta d
e
tectio
n
algo
rith
m
s
. Relate
d
wo
rk
on
tex
t
si
m
ilarit
y
d
e
te
ctio
n
can
b
e
main
ly
cl
assi
fi
ed i
n
t
o
t
w
o
cat
eg
ori
e
s:
t
r
adi
t
i
onal
m
e
tho
d
a
n
d
paral
l
el
m
e
t
hod.
For t
h
e t
r
adi
t
i
onal
m
e
t
hods
, Ly
on et
al
. [
9
]
pr
op
ose
d
a t
r
i
-
g
r
am
and set
t
h
eo
ry
-ba
s
ed a
l
go
ri
t
h
m
,
a
dat
a
fi
n
g
er
-
b
as
ed m
e
t
hod, t
o
ext
r
act
t
h
e
dat
a
fi
n
g
er
of se
n
t
ences an
d t
h
e
n
m
a
pped i
t
i
n
t
o
a ra
nge
of
val
u
e
using Has
h
or MD5 function, then, re
ported the similarity
acco
rding to the overla
p
ped
ratio of similar
value
or t
h
e m
a
xim
u
m
co
m
m
on sub-se
q
u
ence
. M
a
t
v
eeva [
1
0]
and
Hat
z
i
v
assi
l
ogl
ou et
al
. [1
1]
prese
n
t
e
d a
Vect
o
r
Space M
odel
(VSM) algorithm
to com
put
e
the sim
i
larity
using C
o
sine
m
easurem
ent
of
t
h
e vector. Yih [9]
ex
p
l
o
r
ed
d
i
ff
eren
t scor
e app
r
oach
es,
no
t tr
ad
itio
n
a
l TF-
I
D
F
w
e
igh
t
, to study th
e ter
m
w
e
ig
h
t
f
u
n
c
tion
.
B
r
od
er
[12] explore
d
a shingles-bas
ed al
gorithm
to de
fine the c
ontainm
ent
of two doc
u
m
e
nts and took Ja
ccard
coefficient [13] to re
prese
n
t t
h
e sim
ilarit
y
o
f
th
em
.
For
t
h
e
paral
l
e
l
-
base
d m
e
t
hods, m
o
st
ap
pr
o
aches
foc
u
se
d
on M
a
pR
e
duc
e m
odel
.
Zha
n
g et
al
. [
1
4]
prese
n
t
e
d
a s
e
que
nce
-
base
d
m
e
t
hod
t
o
d
e
t
ect
part
i
a
l
s
i
m
i
l
a
ri
t
y
of
w
e
b
page
u
s
i
n
g
M
a
pR
ed
uce,
whi
c
h
co
nsisted
of two
sub
-
task
s as sen
t
ence le
vel near duplicate detection an
d seq
u
e
n
ce m
a
t
c
hi
ng
. I
n
t
h
i
s
wo
rk
,
Evaluation Warning : The document was created with Spire.PDF for Python.
I
S
SN
:
2
088
-87
08
IJEC
E
V
o
l
.
6,
No
. 5,
Oct
obe
r 20
1
6
2
456
will b
e
also used
Map
R
educe fram
e
work, bu
t we i
n
tegr
ate it with
some effectiv
e
featu
r
es t
o
g
u
a
ran
t
ee
ru
n
n
i
n
g t
i
m
e perf
orm
a
nce.
3.
BIG DATA CHALLENGES BY
HADOOP
Had
o
op i
s
an
ope
n s
o
u
r
ce
Apac
he s
o
ft
w
a
re fram
e
wo
rk
t
h
at
eval
uat
e
s
gi
ga
by
t
e
s or
pet
a
by
t
e
s o
f
structure
d
or unstructured
dat
a
and
tran
sform
s
it m
o
re
m
a
n
a
g
e
ab
le
for
ap
p
lication
s
to
work with th
is larg
e
dat
a
[
15]
-
[
1
6
]
.
The co
re com
pone
nt
s o
f
Ha
d
o
o
p
are
HD
FS
and M
a
pR
e
d
u
ce. HD
FS i
s
b
a
si
cal
l
y
used t
o
st
ore
larg
e
d
a
ta sets
an
d Map
R
ed
u
c
e is u
s
ed
t
o
p
r
ocess such larg
e d
a
ta sets.
3.
1.
Hadoop
Distribute
d
File Sys
t
em
(H
DF
S)
architec
tur
e
The
Had
o
op
D
i
st
ri
but
ed
Fi
l
e
Sy
st
em
(HDF
S) i
s
de
si
g
n
ed
to
sto
r
e v
e
ry larg
e
d
a
ta sets
reliab
l
y, and
t
o
st
ream
t
hose dat
a
set
s
at hi
gh ba
n
d
wi
dt
h
t
o
user a
ppl
i
c
a
t
i
ons. I
n
a l
a
rg
e cl
ust
e
r, t
h
o
u
s
a
nd
s of ser
v
e
r
s
bot
h
host
directly attached stora
g
e
and e
x
ec
ute user app
licatio
n task
s [17
]
. HDFS
u
s
es a
write-o
n
c
e, read
-man
y
m
odel
t
h
at
br
eaks
dat
a
i
n
t
o
bl
ock
s
t
h
at
i
t
spread
s a
c
ross m
a
ny
n
ode
s f
o
r
faul
t
t
o
l
e
rance a
n
d hi
gh
p
e
rf
or
m
a
n
ce, as Figu
r
e
1.
HD
FS st
o
r
es f
i
l
e
sy
st
em
m
e
tadat
a
an
d ap
pl
i
cat
i
on
d
a
ta sep
a
rately on
a ded
i
cated
serv
er, called
th
e
Nam
e
Node.
A
ppl
i
cat
i
o
n
dat
a
are st
ore
d
o
n
ot
he
r se
rve
r
s c
a
l
l
e
d Dat
a
N
o
d
e
s.
N
a
meN
ode
:
the n
o
d
e
t
h
at contro
ls th
e
HDFS. It is re
s
p
onsi
b
l
e
f
o
r se
r
v
i
n
g
an
y com
p
onent that needs
access to files on the
HDFS. It is also res
p
onsi
ble fo
r e
n
suring fa
ult-t
o
lera
nce on HDFS. Us
ually, fault-
t
o
l
e
rance
i
s
ac
hi
eve
d
by
re
pl
i
cat
i
ng t
h
e fi
l
e
s
ove
r
di
f
f
ere
n
t
no
des
.
Dat
a
No
de:
t
h
i
s
n
ode i
s
part
of
H
D
FS a
n
d
hol
ds t
h
e
fi
l
e
s t
h
at
are p
u
t
on
t
h
e H
D
FS
. U
s
ual
l
y
t
h
ese
nodes also
work as TaskT
r
ac
ker. JobTrac
k
e
r
tries to
allocate work to nodes suc
h
files accesses are local, as
m
u
ch as possi
ble.
Fi
gu
re
1.
The
i
n
t
e
ract
i
o
n
bet
w
een
H
D
F
S
a
n
d M
a
pR
ed
uce
j
o
b
At th
e lev
e
l of th
e
Nam
e
No
de, th
e
Job
T
ra
ck
e
r
i
s
res
p
onsi
b
l
e
f
o
r
t
h
e m
a
nagem
e
nt
o
f
re
sou
r
ces t
h
at
is th
e co
n
t
ro
l
of th
e DataNo
d
e
s in
clu
s
ter. It
ma
n
a
g
e
s th
e en
tire du
ration
o
f
the life of a j
o
b
.
Th
e
Task
Tracker
h
a
s respon
sib
i
lities
m
o
re sim
p
le, n
a
m
e
ly
lau
n
c
h
th
e task
s in
t
h
e ord
e
r prov
id
ed
b
y
th
e Job
T
racker and
peri
odi
cal
l
y
gi
ve a
st
at
us
of
p
r
o
g
re
ss
of t
h
e t
a
sk t
o
t
h
e
Jo
bT
racke
r
.
3.
2.
Map
R
edu
ce progr
a
mmi
ng m
o
del
M
a
pR
ed
uce i
s
a pr
o
g
ram
m
ing m
odel
a
n
d an
associ
at
ed i
m
pl
em
entat
i
on f
o
r
p
r
o
cessi
ng a
n
d
m
a
nagi
n
g
l
a
r
g
e dat
a
set
s
wi
t
h
a
paral
l
e
l
,
di
st
ri
but
e
d
al
g
o
r
i
t
h
m
on a cl
us
t
e
r. M
a
pR
e
duc
e di
vi
des i
n
t
o
t
h
ree
parts: Map, S
h
uffle and S
o
rt, and Reduce. A Map part
of MapRe
d
uc
e
job splits the
input
datasets into
i
nde
pen
d
e
n
t
ch
un
ks.
T
h
e i
n
de
pen
d
e
n
t
ch
u
n
k
s
are
pr
ocesse
d
i
n
a c
o
m
p
l
e
t
e
ly
paral
l
e
l
m
a
nner
usi
n
g
M
a
p
t
a
sk
.
The
n
the
Reduce function m
e
rge
d
t
h
ese
val
u
es t
o
form
a pos
si
bl
y
sm
al
ler set
o
f
val
u
e
s
. T
h
at
i
s
,
t
h
e
R
e
duce
fun
c
tion
filtered
th
e Map ou
tpu
t
and
produ
ces th
e
resu
lts wi
th
resp
ect to the k
e
y
o
f
th
e M
a
p
p
h
a
ses
[18
]
.
Ma
p:
doc
,
docT
e
xt
→
docID
,t
e
r
m
Reduce:
doc
ID
,
t
erm
,w
e
i
g
h
t
,
t
e
r
m
,w
e
i
g
h
t
→
docID
,
S
i
m
Evaluation Warning : The document was created with Spire.PDF for Python.
IJECE
ISS
N
:
2088-8708
An A
p
pr
oac
h
o
f
Sem
a
nt
i
c
Si
m
i
l
a
ri
t
y
Meas
ur
e bet
w
een
D
o
c
u
me
nt
s B
a
se
d
on
Bi
g
D
a
t
a
(
M
o
h
a
mme
d
Er
ri
t
a
l
i
)
2
457
: is th
e sem
a
n
tic similarity b
e
t
w
een th
e
do
cumen
t
an
d th
e qu
er
y
.
In o
u
r
pr
o
pose
d
al
go
ri
t
h
m
runs o
n
t
w
o c
ons
ecut
i
v
e M
a
pR
e
duce
phase
s, t
h
e fi
rst
t
o
bui
l
d
an i
nde
xi
n
g
p
h
a
se and
t
h
e
secon
d
t
o
co
mp
u
t
e t
h
e sem
a
n
tic si
m
i
lar
ities
measu
r
es, as ou
r
d
e
si
g
n
MapRed
u
ce
shown in
the
Fi
gu
re 2.
Fi
gu
re 2.
The
pr
ocess
o
f
ou
r M
a
pR
ed
uce
m
odel
Doc
u
ment in
d
exing
:
gi
ven a
cor
p
us, f
o
r eac
h t
e
rm
of
d
o
c
u
m
e
nt
, t
h
e
m
a
ppe
r em
i
t
s
t
h
e doc
um
ent
ID as
the
key, and his
words as
t
h
e
val
u
e
.
The
sh
uf
fl
e
pha
se o
f
M
a
p
R
educe
,
g
r
ou
p
s
t
h
ese
wo
rd
s
by
a
collection of the values of
each docum
e
nt, and
delivers the
s
e inverte
d
lis
ts to the reduce
rs, that write the
m
to
bl
oc
ks.
S
e
man
tic simila
rities mea
s
u
r
es:
In th
is
step
, Redu
ce takes th
e
ou
tpu
t
o
f
th
e Map
fun
c
tio
n an
d
com
put
es t
h
e
sem
a
nt
i
c
rel
a
tion
bet
w
ee
n ea
ch col
l
ect
i
o
n
of
val
u
es
o
f
e
ach d
o
c
u
m
e
nt
and t
h
e
que
r
y
. Thi
s
sem
a
ntic relation com
puted byWordNet as a
n
e
x
ternelsem
a
nt
i
c
net
w
o
r
k
[
1
9]
wi
t
h
t
h
e
use
of
t
h
e
wei
g
ht
of t
h
e
wo
rd
s
and one of t
h
e a
p
proaches al
ready
existed t
o
co
mp
u
t
e th
e sem
a
n
tic sim
i
larity
b
e
tween
t
h
e t
w
o
conce
p
t
s
.
4.
SEMA
NTI
C
SIMIL
A
R
I
TY MEAS
U
R
E
S
In
t
h
is related
work section
,
t
h
e con
t
ri
b
u
tion
of
o
u
r sem
a
n
tic si
m
ilarit
y
measu
r
e is to
ev
alu
a
te the
sem
a
nt
i
c
proxi
m
i
ty
bet
w
ee
n
doc
um
ent
s
.
W
e
pr
esent
d
i
ffe
re
nt
ap
p
r
oac
h
es
of t
h
e si
m
i
l
a
ri
ty
m
easure
bet
w
ee
n
words or do
cumen
t
s. Th
ere
are th
ree m
a
in
families a
p
p
r
o
ach
es in
th
e literatu
re o
f
seman
tic
m
easu
r
e
m
en
t
bet
w
ee
n t
h
e doc
um
ent
s
, ap
pr
oac
h
es base
d o
n
t
h
e Ar
c
s
, app
r
oache
s
based
on t
h
e No
des an
d Hy
bri
d
approaches
.
4.
1. Ap
pro
a
ch
es
ba
sed
on the Arcs
Th
e m
a
j
o
rity o
f
si
m
ilarit
y
me
asu
r
em
en
t o
f
co
n
c
ep
ts
in ontology are base
d on their dist
ances [20]
.
Obv
i
ou
sly, t
h
e con
c
ep
t X is
m
o
re si
m
ilar to
a co
n
c
ep
t
Y
than
a con
c
ep
t Z, th
is
sim
ilarity
are ev
aluated
b
y
th
e
distance,
which se
parates t
h
e concepts i
n
ont
ology.
T
h
ese m
easure
m
ents m
a
ke use
of t
h
e
hierarchical
stru
cture
o
f
on
t
o
log
y
to
d
e
termin
e th
e se
m
a
n
tic similarity
b
e
tween
th
e con
cep
ts.
Ra
da
et
al
. me
asu
r
e:
Th
is m
easu
r
e [2
1
]
is
ad
op
ted in
a
netw
o
r
k s
e
m
a
nti
c
s and
i
s
base
d o
n
t
h
e fact
th
at we can
com
p
u
t
e th
e si
milarity b
a
sed
on
th
e lin
k
s
hierarc
h
ical (ge
n
e
r
alizati
on) "is-a". To com
put
e the
si
m
ilarit
y
o
f
two
con
cep
ts i
n
an
on
to
log
y
, we m
u
st
calc
u
late th
e nu
m
b
er of m
i
n
i
m
u
m Arcs wh
ich
sep
a
rate
t
h
em
. Thi
s
m
e
asure
,
base
d
on
t
h
e c
o
m
put
at
i
o
n
o
f
t
h
e
di
st
ance
bet
w
ee
n t
h
e n
ode
s
by
t
h
e
sho
r
t
e
st
pat
h
.
Th
e sim
ilarity
measu
r
e
with
t
h
is
m
easurem
e
n
t betwee
n the
conce
p
t
and t
h
e c
o
ncept
is as
well o
f
th
e
fo
rm
ula:
Sim
c
,c
1
1d
i
s
t
c
,c
4
Wu a
nd P
a
l
m
e
r
meas
ure
:
The
pri
n
ci
pl
e o
f
t
h
i
s
m
easurem
en
t
i
s
gi
ven an o
n
t
o
l
o
gy
fo
rm
ed by
a set
of
n
o
d
e
s and
a ro
o
t
n
o
d
e
(R). X and
Y represen
t tw
o
o
n
t
o
l
og
y elem
en
t
s
for wh
ich
we will co
m
p
u
t
e th
e
si
m
ilarit
y
. Th
e p
r
in
ci
p
l
e of similari
ty
m
easu
r
em
ent
i
s
based o
n
t
h
e
di
st
an
ces (N
1 a
nd
N
2
)
whi
c
h se
par
a
t
e
t
h
e
X a
n
d
Y
n
o
d
es
fr
om
t
h
e n
ode
R
an
d t
h
e
di
st
a
n
ce
(N
)
whi
c
h
separat
e
s
t
h
e
S
ubs
um
i
ng C
o
n
cept
(
S
C
)
.
Evaluation Warning : The document was created with Spire.PDF for Python.
I
S
SN
:
2
088
-87
08
IJEC
E
V
o
l
.
6,
No
. 5,
Oct
obe
r 20
1
6
2
458
The
Wu
a
n
d
P
a
lm
er
m
easure
m
ent
i
s
defi
ne
d
by
t
h
i
s
fo
rm
ul
a:
Sim
X,
Y
(5
)
4.
2. Ap
pro
a
ch
es
B
a
sed
on th
e
N
o
des
These tec
hni
ques a
d
opt a
ne
w m
easure in
term
s o
f
th
e entrop
ic m
easu
r
emen
t
of th
e in
fo
rm
ation
t
h
eo
ry
[
22]
.
E
c
l
o
g
P
c
6
Whe
r
e
is th
e
p
r
o
b
a
b
ility o
f
find
ing
a
v
a
lu
e con
cep
t
of
.
Resni
k
me
as
ur
e:
The
notion
of the Inform
ational Conte
n
ts
(IC) was initially introduc
ed by [23],
whi
c
h pr
o
v
ed
t
h
at
an
o
b
j
ect
(w
or
d)
i
s
defi
n
e
d by
t
h
e
number of the
s
p
ec
ified classes
a
n
d that t
h
e se
mantic
si
m
ilarit
y
b
e
tween
t
w
o
con
cep
ts is m
easu
r
ed
b
y
th
e
q
u
a
n
tity o
f
inform
at
io
n
wh
i
c
h
th
ey sh
are. Th
e
i
n
f
o
rm
at
i
onal
cont
e
n
t
s
are o
b
t
a
i
n
ed
by
co
m
put
i
ng t
h
e o
b
ject
f
r
e
que
nc
y
i
n
t
h
e cor
p
u
s
.The
fo
rm
ul
a of t
h
i
s
measure is:
Sim
c
,c
M
a
x
CSc
,c
M
a
x
l
o
g
P
CS
c
,c
7
,
: represents t
h
e
m
o
st concept
specific (which m
a
xi
m
i
zes the sim
i
larity value)
betwee
n
the conce
p
t
and
in
th
e
on
t
o
log
y
.
Li
n’s
Me
asur
e:
Li
n has de
fi
ne
d
a di
f
f
ere
n
t
si
m
i
l
a
ri
ty
m
easure
t
h
at
of
R
e
sni
k
by
t
h
i
s
fo
rm
ul
a:
Sim
,
2
l
ogPCSc
,c
log
P
c
l
o
g
P
c
8
4.3. Hybride Appr
oaches
These techniques are founde
d
on
a
m
odel
whi
c
h com
b
i
n
es bet
w
ee
n t
h
e app
r
oac
h
es
base
d on t
h
e
Arcs in
add
itio
n
to th
e i
n
fo
rm
atio
n
a
l co
n
t
en
ts
which are
re
garde
d
as
fact
or
of decision.
Ji
an
g a
n
d
C
onr
at
h Meas
u
re:
To
cu
re th
e p
r
ob
lem p
r
esen
ted
t
o
th
e lev
e
l o
f
th
e Resn
i
k
m
easurem
ent
,
Ji
ac [
24]
br
o
u
ght
a
new
f
o
r
m
ul
a whi
c
h c
onsi
s
t
s
i
n
com
b
i
n
i
n
g
t
h
e
Ent
r
o
p
y
(I
nf
orm
a
ti
onal
Conte
n
ts) of the specific concept to
tho
s
e o
f
th
e con
c
ep
t
s
wh
ich
we seek
s th
e similarity. Th
is ap
p
r
oach
eis
com
put
ed by
t
h
e fo
rm
ul
a
fol
l
o
wi
ng:
Sim
X,
Y
1
distanc
e
X,
Y
9
The
distance
between
X a
n
d
Y is c
o
m
put
ed
by
t
h
e f
o
l
l
o
wi
n
g
fo
rm
ul
a:
dista
n
ce
X,
Y
E
X
E
Y
2E
C
S
X,
Y
10
Leac
ock a
nd
C
h
o
d
o
row
:
A
n
ot
he
r m
e
t
hod
prese
n
t
e
d
by
[
19]
, w
h
i
c
h c
o
m
b
i
n
es bet
w
een co
unt
i
n
g o
f
the arcs m
e
thod and the in
formational conte
n
ts m
e
t
hod. The propose
d
m
e
asure
by Leacock a
nd C
h
odorowis
base
d ove
r t
h
e sho
r
t
e
st
way
l
e
ngt
h bet
w
een t
w
o sy
nse
t
s of
W
o
r
dnet
.
Thi
s
t
echni
q
u
e i
s
defi
ne
d by
t
h
e
fo
rm
ula:
Sim
X,
Y
l
o
g
cd
X,
Y
2M
11
is th
e l
o
n
g
e
st
way leng
th,
wh
ich sep
a
rates
th
e co
n
c
ep
t
roo
t
,
o
f
o
n
t
o
l
og
y
,
o
f
th
e con
c
ep
t m
o
re i
n
bo
tto
m.
We indicate that
,
is th
e sho
r
test way
leng
th wh
ich
sep
a
rates
of
.
4.
3.
Ap
pro
a
ch
es B
a
sed
on
th
e Vec
t
or
Sp
ac
e
These a
p
proac
h
es
use a c
h
aracteristic vector, in
a dim
e
nsional s
p
ace, to
represe
n
t each object a
nd
calcu
late th
e
si
m
ilarit
y
wh
ile b
e
in
g
b
a
sed
at th
e Co
sin
e
m
easu
r
e
m
e
n
t o
r
th
e Eu
clid
ean
d
i
stan
ce. Th
e
si
m
ilarit
y
d
e
fin
itio
n
b
e
t
w
een two
v
ectors of o
b
j
ects is o
b
tain
ed
b
y
th
eir in
tern
al co
n
t
en
ts. Here are so
m
e
approaches
m
e
ntione
d i
n
t
h
e l
iterature:
Evaluation Warning : The document was created with Spire.PDF for Python.
I
J
ECE
I
S
SN
:
208
8-8
7
0
8
An A
p
pr
oac
h
o
f
Sem
a
nt
i
c
Si
m
i
l
a
ri
t
y
Meas
u
r
e
bet
w
een
D
o
c
u
me
nt
s B
a
se
d
on
Bi
g
D
a
t
a
(
M
o
h
a
mme
d
Er
ri
t
a
l
i
)
2
459
Jaccard Meas
u
re:
It
’s
de
fi
n
e
d
by
t
h
e c
o
m
m
on ob
ject
s
n
u
m
b
er di
vi
de
d
by
t
h
e
o
b
j
ect
s ful
l
num
ber
m
i
nus t
h
e
com
m
on ob
ject
s
n
u
m
ber:
Sim
X,
Y
∑
xy
∑
x
∑
y
∑
xy
12
Cosine Meas
u
re:
It
u
s
es th
e co
m
p
lete v
e
cto
r
represen
tatio
n
,
th
at is t
o
say th
e
o
b
j
ect
s frequ
e
n
c
y
(w
or
ds
). Tw
o
doc
um
ent
s
are sim
i
l
a
r
i
f
their vectors are com
b
ined. If two ob
jects are not sim
i
lar, their vectors
fo
rm
an angl
e (X,
Y) w
h
ose
C
o
si
ne re
pre
s
e
n
t
s
t
h
e sim
i
l
a
rit
y
val
u
e. The f
o
rm
ul
a i
s
defi
n
e
d by
t
h
e rat
i
o
of t
h
e
scal
ar p
r
od
uct
of
vect
ors
x
an
d y
a
n
d
t
h
e
p
r
o
duct
o
f
t
h
e
n
o
r
m
of x a
n
d y
.
Sim
X,
Y
∑
xy
∑
x
∑
y
13
The m
easurement
of Cosi
ne
qua
ntifies t
h
e
sim
ilarity
between the
two
vectors as
the
cosi
ne
of the
angl
e
bet
w
ee
n t
w
o
v
ect
ors.
Eucl
i
d
e
an Me
asu
r
e:
Th
e Eu
clid
ean
sim
i
l
a
rity is b
a
sed
o
n
th
e ratio
o
f
th
e Eu
clidean
d
i
stan
ce
increase
d
by 1. The
Eucli
d
ean distance
i
s
d
e
deni
es
by
t
h
e f
o
l
l
o
wi
ng
f
o
rm
ul
a:
dist
xy
14
The sim
ilarity measure is t
h
erefore
defi
ned by:
Sim
X,
Y
1
1d
i
s
t
15
5.
OUR P
R
OP
O
S
ED
APP
R
O
A
CH
In
th
is section
,
we p
r
esen
t a gen
e
ral and
sche
m
a
tic
vi
ew of t
h
e st
eps of
ou
r
approach as
well as their
descri
pt
i
o
n
,
t
o
achi
e
ve o
u
r
p
r
ocess f
o
rc
om
put
e
t
h
e sem
a
nt
ic
si
m
i
l
a
ri
t
y
m
e
asure
s
bet
w
ee
n
d
o
cum
e
nt
s.
Ou
r w
o
r
k
rai
s
e
s
a new ap
p
r
oa
ch t
o
c
o
m
put
e the sem
a
nt
i
c
sim
i
l
a
ri
ty
bet
w
een d
o
c
u
m
e
nt
s
by
appl
y
i
n
g
ou
r
hy
b
r
i
d
a
p
pr
oac
h
based
o
n
t
h
e
a
p
p
r
oache
s
al
rea
d
y
m
e
nt
i
one
d
i
n
t
h
e l
a
st
se
ct
i
on a
n
d t
h
e
vect
o
r
represen
tatio
n o
f
do
cu
m
e
n
t
s.Th
e g
e
n
e
ral ob
j
ective
is
t
o
s
earch in a large corpus
store
d
i
n
HDFS, the m
o
st
rel
e
va
nt
d
o
c
u
m
e
nt
s t
o
a
use
r
re
quest
t
h
at
i
s
com
posed
by
a
d
o
cum
e
nt
.
The Figure 3 s
h
ows the ste
p
s
to co
m
put
e t
h
e
sim
i
l
a
rit
y
between a
q
u
ery
a
nd a
d
o
cum
e
nt
on a
p
pl
y
i
ng
our approac
h
.
Our appr
oac
h
im
ple
m
ent an inde
xing step
docum
e
nt
s to present each doc
u
m
e
nt by the
words
that com
pose it. This step has m
u
ch sub
-
st
eps nam
e
l
y
: t
okeni
zat
i
o
n,
st
em
m
i
ng, el
im
i
n
at
i
on up
p
e
rcase,
stopwords ...T
h
en, this vector repres
e
n
t
a
t
i
on was en
ri
che
d
wi
t
h
sem
a
nt
i
c
net
w
o
r
k
b
ase
d
on t
h
e sy
no
ny
m
s
of
wo
rd
s usi
n
g
Wo
r
dnet
.
Fo
r e
ach w
o
rd
of t
h
e doc
um
ent
,
w
h
i
c
h
has ass
o
ci
at
ed sy
n
ony
m
s
i
n
Wor
d
Net
as sy
nset
and
we a
dde
d
wei
g
ht
o
f
t
h
es
e sy
no
ny
m
s
.The ret
u
r
n
ed
d
o
c
um
ent
i
s
sort
ed by
decreasi
ng
or
de
r o
f
se
m
a
nt
i
c
si
m
ilarit
y
.
5.
1. Our pro
p
osed M
a
p
R
ed
uce
al
gori
t
hm
In o
u
r
pr
o
p
o
s
ed M
a
pR
ed
uc
e al
gori
t
h
m
runs o
n
t
w
o c
o
n
s
ecut
i
v
e M
a
p
R
educe
phase
s
,
t
h
e fi
rst
t
o
bui
l
d
an
i
n
de
xi
ng
p
h
ase
an
d t
h
e sec
o
nd
t
o
c
o
m
put
e t
h
e si
m
i
l
a
ri
ty
m
easure.
In
dexi
n
g
of
do
cume
nt
s:
gi
ve
n a c
o
rpus,
for
each term
of docum
e
nt, the
mapper em
its the
doc
um
ent
ID
as the
key,
and his
words
as
the value
.
S
e
man
tic simila
rity mea
s
u
r
e:
In
th
is step, Red
u
ce
will ta
k
e
th
e ou
tpu
t
o
f
th
e Map
fun
c
tio
n
and
m
a
kes
t
h
e se
m
a
nt
i
c
i
nde
xat
i
on o
f
al
l
wo
r
d
s usi
n
g
Wo
rdNet fo
r co
m
p
u
tin
g
t
h
e sim
ila
rity relatio
n
between
each collection of values of each
doc
ument and the query. T
h
is sim
ilarity
m
eas
ure c
o
m
puted by our
alg
o
rith
m
with
th
e u
s
e of the weigh
t
of the word
s a
nd
one
of t
h
e approache
s
alrea
d
y existed i
n
the
ne
xt
sect
i
on.
Evaluation Warning : The document was created with Spire.PDF for Python.
I
S
SN
:
2
088
-87
08
IJEC
E
V
o
l
.
6,
No
. 5,
Oct
obe
r 20
1
6
2
460
Fi
gu
re
3.
The
pr
ocess
o
f
ou
r
m
e
t
hod
ol
o
g
y
t
o
c
o
m
put
e t
h
e
sem
a
nt
i
c
sim
i
lari
t
y
m
easure
Thi
s
wo
r
k
c
oul
d
be e
x
p
r
esse
d
usi
n
g t
h
e f
o
l
l
o
wi
n
g
M
a
pR
e
d
u
ce al
go
ri
t
h
m
:
Class
M
a
p
p
er
Meth
od
Ma
p
(D
o
c
id
,
t
er
m
)
Th
is
map
met
h
o
d
is ca
lled
o
n
ce p
e
r inp
u
t
line;
map
t
a
sks
a
r
e run
i
n
p
a
r
a
llel o
ver sub
s
ets
o
f
th
e i
n
pu
t files.
R
e
m
oveSt
o
p
W
or
d&
rem
oveP
unct
u
at
i
o
n
&
C
o
m
posed
Wo
r
d
(t
erm
)
Th
e va
lu
e con
t
a
i
n
s
an
en
tire lin
e from you
r
f
ile. We to
ken
i
ze th
e lin
e u
s
i
n
g S
t
ringUtils
For eac
h elem
e
n
t
∈
(Docid,term
)
Write(Docid,term)
For e
a
c
h
w
o
rd the
map
outputs the word as
the key
and the
doc
umen
t ID as
the
v
a
lue
.
End
for
Class
Re
ducer
Meth
od
Reduc
e
(Do
c
id,List(term
))
The re
duce
method is called once
per uni
que
map
out
pu
t key. The Ite
r
able
allows
to iterate over
all the
values that were emitted f
o
r the given key.
Li
st
(q)
= i
n
de
x
i
ng
(Q
uery
)
Our
se
ma
ntic indexi
ng
met
h
o
d
S=0
X
←
0
Y
←
0
For
eac
h
n
∈
List(term
)
We keep a set
of all the
doc
u
ment
IDs that y
o
u enc
o
unter
for t
h
e key.
F=calculateocc
u
re
nce(n)
For eac
h e
∈
List(q)
We ad
d t
h
e d
o
c
ume
n
t
ID t
o
our set
.
T
h
e re
aso
n
yo
u creat
e
a new
t
ext
obj
ect
i
s
t
hat
Map
R
ed
uce reus
es t
h
e
t
ext
obj
ect
w
h
e
n
i
t
e
rat
i
n
g
ove
r
t
h
e v
a
l
u
es
, w
h
i
c
h
me
an
s w
e
w
ant
t
o
cre
a
t
e
a
new
c
opy
.
R=
calculateoc
cure
nce(e
)
Count the
number
of
occur
r
e
n
ces f
o
r e
a
ch term
X
←
X+F×R×S
i
m
(
n, e)
Y
←
Y+F×R
End
for
End
for
S
←
X/
Y
Evaluation Warning : The document was created with Spire.PDF for Python.
I
J
ECE
I
S
SN
:
208
8-8
7
0
8
An A
p
pr
oac
h
o
f
Sem
a
nt
i
c
Si
m
i
l
a
ri
t
y
Meas
u
r
e
bet
w
een
D
o
c
u
me
nt
s B
a
se
d
on
Bi
g
D
a
t
a
(
M
o
h
a
mme
d
Er
ri
t
a
l
i
)
2
461
Write(Docid,S)
Our
re
duce
o
u
t
put
s
t
h
e
d
o
cu
m
e
nt
I
D
s
a
n
d
t
h
e
sem
ant
i
c
si
mi
l
a
ri
t
y
me
as
ure
of
eac
h
d
o
cu
m
e
nt
.
End
Ou
r a
p
pr
oac
h
i
s
a
hy
b
r
i
d
,
we
us
e t
h
e
vect
or
p
r
ese
n
t
a
t
i
o
n
o
f
t
h
e d
o
c
u
m
e
nt
s and
one
o
f
t
h
e
approachesal
re
ady exist i
n
sec
ti
on 4. O
u
r
ap
p
r
oac
h
i
s
p
r
esen
t
e
d
by
t
h
e fol
l
o
wi
n
g
fo
rm
ul
a:
Sim
q,
d
∑∑
q
d
S
i
m
i
,
j
∑∑
q
d
16
i: repre
s
ents t
h
e concepts
of t
h
e
que
ry
q
j: re
pre
s
ents t
h
e concepts
of t
h
e
doc
um
ent d
: is th
e freq
u
e
ncy o
f
th
e con
c
ep
t i in
qu
ery
q
: is th
e
frequ
en
cy of t
h
e co
n
c
ep
t j in
do
cu
men
t
d
,
: is th
e sem
a
n
tic si
m
ilarit
y
b
e
tween
t
h
e two
co
n
c
ep
ts i and
j
u
s
ing
Word
Net.
6.
R
E
SU
LTS AN
D ANA
LY
SIS
The c
hoice
of approach from
the
app
r
o
a
ch
es
p
r
esen
ted in
section
4 is v
e
ry im
p
o
rtan
t in
ou
r
pr
o
pose
d
a
p
pr
oach
bec
a
use
i
t
pl
ay
s an
i
m
p
o
rt
a
n
t
r
o
l
e
i
n
t
h
e
researc
h
.
T
o
pe
rf
orm
t
h
e c
o
m
p
ari
s
on
bet
w
e
e
n
these a
p
proac
h
es, we
com
put
e the sem
a
ntic sim
ilarity
m
e
asure
with the
sam
e
of documents and c
h
oose the
ap
pro
ach th
at
g
i
v
e
s goo
d resu
lts. Th
is tab
l
e sho
w
s th
e
co
m
put
ed si
m
i
l
a
ri
t
y
bet
w
een t
h
e
sam
e
docum
ent
s
.
Tab
l
e
1
.
C
o
m
p
ar
ison
o
f
ru
nn
i
n
g ti
m
e
an
d si
milarity m
easure
of
differe
n
t approaches
Appr
oaches
Sim
ilar
ity m
e
asur
e
r
unning
tim
e
(
m
s
e
c)
Leacock and Chodorow
0.14
1016
W
u
and Pal
m
er
0.
11
1297
Resnik
0.
07
1360
Jiang Conr
ath
0.
04
1391
L
i
n 0.
02
1344
The re
sults s
h
ow t
h
at the si
milari
ty changes with the c
h
ange
of each
a
p
proach, it is very im
porta
nt
in the L
eacoc
k
a
n
d Chodorow approac
h
because it gi
ves the
greatest
sim
ilarity
with a m
i
nim
u
m
running
ti
m
e
.Based
on
th
is ev
al
u
a
tio
n, ou
r app
r
o
a
chwill b
e
b
a
sed
o
n
Leaco
c
k
an
d
Ch
odo
ro
w appro
a
ch
.
6.
1.E
val
u
a
ti
o
n
o
f
our ap
pr
oac
h
In
o
u
r e
x
peri
m
e
nt
s,
we use
d
a
phy
si
cal
m
achine t
h
at
e
q
ui
p
p
e
d by
a
n
I
n
t
e
l
C
o
re i
3
C
P
U
2
.
2
7
G
H
z,
8
GB
of m
e
m
o
ry
. For
vi
rt
ual
i
zat
i
on, X
e
n S
e
rve
r
6.
2 wa
s
i
n
st
al
l
e
das a Hy
pe
rvi
s
or
of
t
y
pe 1
ba
sed
on t
h
e
di
st
ri
b
u
t
i
o
n
pr
ovi
ded
by
C
i
t
r
i
x
.
O
u
r
m
a
chi
n
e i
s
a
vi
rt
ual
i
zed Li
nu
x
U
b
u
n
t
u
Se
rv
er
12
.0
4,
wi
t
h
6GB
ofm
e
m
o
ry
for
Nam
e
Node/
D
at
aNo
d
es. Du
r
i
ng of ou
r
w
o
rk we
ha
ve u
s
ed
t
h
e versi
o
n
5
.
1
.
2 of
C
l
ou
de
ra
Manage
r
that unifies via
a use
r
gra
phical
i
n
terface
of the
ins
t
allation, confi
g
uration a
n
d managem
e
nt.
Fig
u
re 4
sh
ows th
e resu
lts of th
e se
m
a
n
tic
si
m
ilarit
y
m
e
su
res b
e
tween
thesam
e
d
o
c
u
m
en
tu
si
n
g
o
u
r
ap
pro
ach and
t
h
e ap
pro
ach
esalread
y ex
ist in
th
e literatu
re.
O
u
r
ap
pr
o
a
ch p
r
ov
id
es the d
oub
le of th
e se
m
a
nticsim
ilarity
measure c
o
m
p
ared t
o
ot
her
app
r
oaches
bec
a
use i
t
i
s
based o
n
t
h
e Lear
o
c
kan
d
C
h
o
d
o
r
o
w ap
pr
oac
h
f
o
r t
h
e m
e
t
hod
C
o
si
ne an
dt
he
vect
or
rep
r
ese
n
t
a
t
i
on
of t
h
e sem
a
nt
ic rel
a
t
i
ons
bet
w
een t
h
e t
h
e
co
n
c
ep
tswith
the u
s
e of th
e
weig
h
t
of th
e
word
s and
Wo
r
d
Net.
Fi
gu
re 5 s
h
ow
s t
h
e resul
t
s
b
y
appl
y
i
ng
ou
r
app
r
oac
h
t
o
c
o
m
put
e t
h
e sem
a
nt
i
c
sim
i
l
a
rit
y
bet
w
een a
doc
um
ent
and
a t
e
xt
cor
p
us t
h
at
co
nt
ai
ns c
o
nt
ai
ns a
vari
abl
e
num
ber
of
do
cum
e
nt
s by
co
m
put
i
ng i
n
eac
h case
th
e ti
m
e
n
ecessary to
find
th
e
si
m
ilarit
y
b
e
tween
th
e
q
u
e
ry
and eac
h d
o
c
u
m
e
nt
of t
h
e co
rp
us t
o
ret
u
r
n
a l
i
s
t
o
f
d
o
c
u
m
en
ts th
at are sem
a
n
tically si
m
i
lar (relev
a
n
t
) to a
u
s
er
requ
est.
Tabl
e 2
dem
onst
r
at
es t
h
e
r
u
nni
ng
t
i
m
e
per
f
o
r
m
a
nce of
o
u
rM
a
p
R
e
d
u
ce
al
go
ri
t
h
m
,
i
t
shows
h
o
w
o
u
r
new al
g
o
r
i
t
h
m
p
r
o
fi
t
o
f
t
h
e pa
ral
l
e
l
i
zat
i
onof
pr
ocessi
ng a
n
d
m
a
nagi
ng t
h
e
l
a
rge n
u
m
b
er of d
o
c
u
m
e
nt
s
wi
t
h
a
p
a
rallel, d
i
stribu
ted
task
s on
Hadoo
p.
Evaluation Warning : The document was created with Spire.PDF for Python.
I
S
SN
:
2
088
-87
08
IJEC
E
V
o
l
.
6,
No
. 5,
Oct
obe
r 20
1
6
2
462
0
500000
1000000
1500000
2000000
2500000
10
20
40
80
200
400
Running
times
(msec
)
Number
of
documents
Our
approach
Wu
and
Palmer
approach
Resni
k
approach
0
0.
05
0.
1
0.
15
0.
2
0.
25
0.
3
0.
35
Leacock
and
Chodorow
Wu
and
Palmer
O
ur
approach
Resnik
Jiang
Conrath
L
in
Similarity
measure
Fi
gu
re
4.
C
o
m
p
ari
s
on
bet
w
ee
nt
he
sem
a
nt
i
c
sim
i
l
a
ri
ty
m
easure
o
f
ou
r a
p
pr
oach
an
d
ot
he
r
app
r
oaches
Fi
gu
re
5.
C
o
m
p
ari
s
on
bet
w
ee
npe
rf
o
r
m
a
nce of
r
u
n
n
i
n
g t
i
m
esi
n
o
u
r
ap
p
r
oa
ch a
n
d
ot
her
ap
pr
oac
h
es
Tabl
e 2.E
v
al
ua
t
i
on of
r
u
n
n
i
n
g
t
i
m
e
for o
u
r
M
a
pR
ed
uce
al
go
ri
t
h
m
Number of docu
m
ents
Map
Reduce
10
1067 36341
20 2974
40636
60 4845
35791
100 16381
172799
200 32562
406338
400 10336
1068975
Evaluation Warning : The document was created with Spire.PDF for Python.
I
J
ECE
I
S
SN
:
208
8-8
7
0
8
An A
p
pr
oac
h
o
f
Sem
a
nt
i
c
Si
m
i
l
a
ri
t
y
Meas
u
r
e
bet
w
een
D
o
c
u
me
nt
s B
a
se
d
on
Bi
g
D
a
t
a
(
M
o
h
a
mme
d
Er
ri
t
a
l
i
)
2
463
7.
CO
NCL
USI
O
N
Thi
s
pape
r
di
scusse
d o
u
r
ap
p
r
oac
h
base
d o
n
a ne
w M
a
pR
e
duce
al
g
o
ri
t
h
m
t
o
com
put
e t
h
e sim
i
l
a
ri
ty
bet
w
ee
n a que
ry
and d
o
c
u
m
e
nt
s exi
s
t
i
ng i
n
HD
FS an
d f
i
nd t
h
e m
o
st
pert
i
n
e
n
t
of
d
o
cum
e
nt
s.Th
e resul
t
s
concl
ude
t
h
at
ou
r M
a
pre
duc
e al
go
ri
t
h
m
ou
t
p
erf
o
rm
s t
h
e st
at
e of t
h
e art
o
n
es
on
r
u
n
n
i
n
g
t
i
m
e
perf
o
r
m
a
nce
an
d
i
n
creases th
e m
easu
r
e
m
en
t o
f
sem
a
n
t
i
c
si
m
ilari
ty.Th
e
fu
ture research
inv
o
l
v
e
s testin
g
m
u
ltil
in
gu
al
of
Word
Net for do
cu
m
e
n
t
s with u
s
ing
Hadoo
p
m
u
lti-n
o
d
e
s t
o
im
p
r
o
v
e
t
h
ese resu
ltsan
d
analyzin
g
th
e
Grap
h
i
cs
Processing
Un
it (GPU). Th
e
work is
u
n
d
e
rw
ay and
resu
lts will b
e
av
ailab
l
e so
on
.
REFERE
NC
ES
[1]
Im
proving Deci
sion Making in
the World of Bi
g Data
http
://w
ww.forbes.com
/
s
ites/christoph
er
frank/2012/03
/2
5
/
improving-decis
i
on-making-in-the-world-of-big-
d
ata/#7a89869c4b4d
[2]
Khadija A. Almohsen, Huda Al-Jobori,
“Recommender S
y
stems in Light of Bi
g
Data”, In
ternational Journal of
Electrical and
C
o
mputer Engin
e
ering (IJEC
E), Vo
l. 5
,
No
. 6
,
December 2015, pp.
1553-1563, 201
5.
[3]
Hoad T. C. and
Zobel. J., “Meth
ods for Identif
y
ing Vers
ioned
an
d Plagiarized Do
cument
s,”
in JASIST,vol. 54
,pp.
203- 215, 2003.
[4]
Spertus E.,Sah
ami M. and Bu
y
ukkokten
O., “Evaluating Similari
ty
Measures:
A Large-scale Stud
y
in
the Ork
u
t
Social Network,” in
pr
oceedings
of KDD, 2005.
[5]
Mao, E., Wesley
, P.
and C
hu,
W. (2007) The
Phrase Based V
ector
Space Model for Automatic Retrieval of
F
r
eeDocum
ent
M
e
dica
l Docum
e
nts
.
Da
ta & Knowledge
Engin
e
ering, 1.
[6]
He, C.B., Tang
, Y. and Tang, F.Y.
(2011) Large-Scale Document Sim
ilarity
Computation
Based on Cloud
Computing Platf
o
rm. 2011 6th
In
terna
tional Conf
erence on
Pervas
ive Co
mputing
and Applications
(ICPCA).
[7]
Chowdhur
y
A.,
“Duplicate data
detec
tion
,
” ret
r
iv
ed from
ht
tp:/
/ir.iit.e
du/~
a
bdur/R
esearch
/Duplicat
e.htm
l
,2004
.
[8]
Ly
on C
., Malcolm J. and Dickerson B., “Detecting S
hort
Passages of Si
milar Text
in
Large Document
Collections,” in pro
cessing of
EMNLP,2001.
[9]
Matveev
a I.
, “
D
ocum
ent Repr
esenta
tion
and
Multilev
e
l Me
asur
es of Docum
e
nt Sim
ilari
t
y
,
”
i
n
proce
e
dings o
f
ACLHLT,2006.
[10]
Hatziv
assiloglou
V.,Klavans J.L.
and
Eskin E., “Detecting
Text Similar
i
ty over S
hort Passages: Exploring
Linguistic Featu
r
e Cominations
via Machine Learning,” in
proceedings of SIGDAT,1999.
[11]
Yih W., “Learning Term-weigh
ting Functions for
Similarity
Measur
es,” in pro
c
eed
i
ngs of EMNLP,2009.
[12]
Broder A. Z., “On the Resemblan
ce
and Con
t
ainm
ent
of
Documents,” in
proceed
ing
s
of SEQUENCES,1997.
[13]
Suphakit Niwatt
anakul
, Jatsada
Singthongchai
,
Ekkach
ai
Naenu
dorn and Supac
h
anun W
a
napu,
“
U
sing of Jaccard
Coeffic
i
ent for
Ke
y
w
ords Sim
ilarit
y
”
,
Proc
e
e
dings of the I
n
terna
tiona
l MultiConfer
enc
e
o
f
Engine
ers an
d
Computer Scien
t
ists 2013 Vol I
,
I
M
ECS 2013, March 13
- 15
, 201
3.
[14]
Qi Zhang,
Yue
Zhang,
Hao.
Yu and Xuan.
Huang,
“Effi
c
i
ent Partial
-
Duplic
at
e
De
tection Based on Sequence
Matching
,”
in pr
oceed
ings of SI
GIR,2010.
[15]
Ham
i
d Bagheri
and Abdusalam
Abdullah Sh
altooki
, “B
ig Data: Chall
e
nges, Opportunities and Cloud Based
Solutions”, Inter
n
ation
a
l Journal of Electrical
and Co
mputer Engineer
ing (IJECE), Vol. 5
,
No. 2, pp. 340-343
,
April 2015.
[16]
Apache H
a
doop, http:
//h
adoop.ap
ache.org/
[17]
Konstantin Shvachko, H
a
irongK
uang, Sa
n
j
ay
Radia, Rober
t
Chan
sler, “Th
e
Hado
op Distributed
F
ile S
y
stem”.
[18]
Yenumula B Redd
y
,
“Document Iden
tifi
cat
ion with M
a
pReduc
e F
r
am
ework,” DATA ANALYTICS
:
The Thir
d
International Co
nference on
Data Analy
t
ics
,
201
4.
[19]
F
r
ançois
-Régis
Chaum
a
rtin, “
W
ordNet et s
on écos
y
s
t
èm
e : un ens
e
m
b
le de res
s
ources
lingu
is
tiques
de larg
e
couvertur
e” : h
t
t
p
s://hal
.a
r
c
hiv
e
s-ouvertes.f
r/hal-
00611240.
[20]
Lee J.H
., M.H.Kim and Y.J.Lee, ’’Info
rmation
Retrieval B
a
sed on Conceptu
al Distance in IS
A Hierarch
y
’
’
,
Journal of Do
cu
mentation
49, pp
. 188-
207, 1993
.
[21]
Rada R., Mili H
., Bickn
e
ll E
.,
&Blet
t
ner M. (1989). Devel
opm
ent and appli
c
ati
on of a
m
e
tric on sem
a
ntic nets.
IEEE
Tr
ans
act
io
n on S
y
s
t
em
s
,
M
a
n,
and C
y
b
e
rnetics, 19(1)
:17-30.
[22]
D. Lin. An Infor
m
ation-Theor
e
ti
c Definit
i
on of s
i
m
ilarit
y
.
In P
r
oceed
ings
of the F
i
fteenth In
tern
a
tional Conf
eren
c
e
on Machin
e
Learning (ICML'
98)
. Morgan-
Kauf
mann: Madison,
WI, 1998.
[23]
Resnik P., Usin
g inform
ation c
ontent to
eva
l
u
a
te sem
a
nt
ic si
m
ilarit
y
in
taxo
nom
y
.
In Proceedings of 14th
International Joint Confer
ence o
n
Ar
tificial In
telligence, Montreal, 1995
.
[24]
Jiang J. and D.
Conrath, Se
m
a
n
tic sim
ila
rit
y
b
a
sed on corpus statisti
cs and l
e
xi
cal
taxonom
y
,
i
n
Proceedings o
f
International Co
nference on
Res
earch
in
Computation
a
l
Linguistics, Taiwan
, 199
7.
Evaluation Warning : The document was created with Spire.PDF for Python.