TELKOM
NIKA Indonesia
n
Journal of
Electrical En
gineering
Vol. 12, No. 8, August 201
4, pp. 6361 ~ 6368
DOI: 10.115
9
1
/telkomni
ka.
v
12i8.603
4
6361
Re
cei
v
ed Fe
brua
ry 28, 20
14; Re
vised
April 3, 2014;
Accept
ed Ap
ril 15, 2014
Valuing Semantic Similarity
Abdoula
h
i Boubac
a
r*
1
, Zhendong
Niu
2
Beiji
ng Institute
of
T
e
chnolo
g
y
,
School of Co
mputer Scie
nc
e
*Corres
p
o
ndi
n
g
author, e-ma
i
l
: abd_
bo
ubac
ar@
y
ah
oo.com
1
, zniu@bit.edu.cn
2
A
b
st
r
a
ct
Similar
i
ty is a t
ool w
i
d
e
ly
use
d
in v
a
rio
u
s d
o
m
a
i
ns s
u
ch
as
DNA se
qu
enc
e an
alysis, k
n
o
w
ledge
repres
entati
on,
natur
al
lan
g
u
age
pr
o
c
e
ssing
, d
a
t
a
m
i
ni
ng
, i
n
fo
rma
t
i
o
n re
trieva
l, i
n
fo
rmati
on fl
ow
etc.
Co
mp
uting s
e
ma
ntic si
mi
larit
y
betw
een tw
o
entities
is a
n
on-trivia
l ta
sk.
T
here ar
e
man
y
w
a
ys to defi
ne
semantic s
i
mil
a
rity. So
me
me
asures
hav
e b
een
pro
pose
d
combi
n
in
g
b
o
th
statistical infor
m
ati
on an
d
l
e
xi
cal
similarity. It is
difficult for a
measure that perform
s
w
e
ll
in
a giv
en d
o
m
ai
n to be
app
lie
d w
i
th accurac
y
i
n
anoth
e
r do
mai
n
. A simil
a
rity
meas
ure
ma
y perform
better w
i
th one la
ngu
ag
e than
anoth
e
r. W
o
rd is
supp
ose
d
to b
e
not
only s
i
mi
lar to itse
lf but
also
to s
o
me
of its syno
n
yms in a
giv
en c
ontext a
nd so
me
w
o
rds w
i
th co
mmo
n
r
oots.
Our ap
pro
a
ch
is d
e
sig
n
e
d
to
perfor
m
query
match
i
n
g
a
n
d
compute
se
ma
ntic
relate
dness us
ing w
o
rd occur
r
ences.
It performs b
e
tter than classic
a
l
measur
es like T
F
-
IDF
,
Cosine
etc.
Althou
gh it is
not a
metric, t
he pr
opos
ed s
i
milar
i
ty
meas
ure ca
n be
us
ed for a w
i
d
e
rang
e of co
nte
n
t
ana
lysis tasks
base
d
on se
mantic dista
n
ce
and its effi
cacy
has be
en d
e
m
o
n
strat
ed. The
meas
ure is
not
corpus d
e
p
end
ent so it can es
tablis
h directly
the se
mantic r
e
late
dn
ess of tw
o entities.
Ke
y
w
ords
:
sem
a
ntic sim
i
larit
y
, sem
a
ntic re
l
a
tedn
ess, infor
m
ati
on retri
e
va
l
Copy
right
©
2014 In
stitu
t
e o
f
Ad
van
ced
En
g
i
n
eerin
g and
Scien
ce. All
rig
h
t
s reser
ve
d
.
1. Introduc
tion
Measuri
ng
se
mantic
simila
rity is the
obj
ecti
ve of m
a
n
y
works. Man
y
measures
p
e
rform
well in eval
uation fram
e
w
ork for a
spe
c
ific ta
sk like synony
my extractio
n
[1], short text
comp
ari
s
o
n
[2], senten
ce
simila
rity [3],
and n
a
tural l
angu
age p
r
o
c
e
ssi
ng [4] et
c. The
chall
e
nge
is to find a
si
ngle me
asure that perfo
rms a
c
curate
l
y
as long
as
sema
ntic extraction i
s
ne
e
ded.
The sm
allest
unit for huma
n
comm
unica
tion is “wor
d
”
.
Semantic m
a
tchin
g
ce
rtai
nly cannot work
without u
s
ing
words. Can
word re
pre
s
e
n
t a uni
t that can exp
r
e
s
s a thought?
The an
swer i
s
obviou
s
ly no.
Only the
co
mbination
of
words c
an
re
ally expre
s
s
an o
b
je
ct or i
dea
and
by the
same
way gi
ve to a word
a mea
n
ing.
The a
s
so
ciati
on
of words can be
studi
ed at a
statistical
level [5] usin
g freque
ncy
estimation to
define simil
a
rity measures.
Corp
us-b
ase
d
word simil
a
rity
measures [6,
7] extract se
mantics u
s
in
g wo
rd fr
e
q
u
ency in the
e
n
tire corpu
s
. Wo
rds li
ke
st
op-
words which
appe
ar
at the
sa
me frequ
e
n
cy in
almo
st
all the
do
cu
ments do
not
ma
ke
differe
nce
and are not related to any docu
m
e
n
t in parti
cu
lar. A conce
p
t-ba
sed rep
r
esentation
of
document
s [
8
] presents
an alte
rn
ative way in
ind
e
xing. Con
c
e
p
ts p
r
e
s
e
n
t
more
de
scri
p
t
ions
than wo
rd
s and rep
r
e
s
ent
a unit of kno
w
led
ge from
wh
i
c
h se
mant
ic extractio
n
is relatively easy.
More
de
script
ions
nee
d mo
re
words. F
o
r this rea
s
on it
is very im
portant
to use a
measure that
can
comp
are
segme
n
ts o
f
texts [9] instead of a
simple metri
c
for words [
10]. In order to
measure accurately both q
uery matchin
g
and do
cu
m
ent’s sema
ntic rel
a
tedn
ess dire
ctly, we had
to impleme
n
t a sem
antic
si
milarity mea
s
ure.
Ou
r me
a
s
ure is
not ba
sed
on
corpu
s
an
d doe
s n
o
t
estimate the
distance be
tween words and entitie
s [11].
It can perform an
y task from text
simila
rity [12, 13] to
co
ncept simil
a
rity
[14,
15]. It
performs bett
e
r than the Jacc
ard
s
i
milarity
measure
whi
c
h i
s
una
ble
to ran
k
accu
rately do
cume
nts a
c
cording
to u
s
e
r
qu
ery. It cannot
b
e
comp
ared to corp
us ba
sed simila
rity measu
r
e
s
whi
c
h are very poor me
asu
r
ing
sem
anti
c
related
n
e
ss. I
t
is entirely b
a
se
d on o
ccurren
ce
s
therefore it u
s
es
extremely si
mple op
erato
r
s.
Comp
uting q
uery mat
c
hin
g
or
sema
ntic rel
a
t
edn
ess be
com
e
s
a very ea
sy task
usin
g
thi
s
measure.
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 23
02-4
046
TELKOM
NI
KA
Vol. 12, No. 8, August 2014: 636
1 –
6368
6362
2. R
e
lated
Work
Semantic
sim
ilarity is a
n
i
m
porta
nt and
very
pop
ular tool for o
r
g
a
n
izin
g an
d ex
tracting
informatio
n. Many ap
pro
a
c
he
s to
estim
a
te simi
l
a
rity
have be
en
d
e
velope
d in v
a
riou
s
domai
ns.
Information
retrieval is o
ne of the area that
use
s
similarity m
easure
s
. Th
e most po
pu
lar
measures in t
h
is area a
r
e
based on the
vector mo
del.
The simpl
e
st
vector app
ro
ach to mea
s
u
r
e
sema
nt
ic si
mi
larit
y
is t
he C
o
sin
e
dist
a
n
ce. The Co
sin
e
distan
ce
ca
n be define
d
as:
cos
,
.
|
|
(1)
In ord
e
r to
co
mpute the
si
mila
rity between two do
cu
ments it i
s
su
fficient to co
n
s
ide
r
the
co
sine valu
e of their term
vectors. The
Ja
cc
ard m
o
d
e
l is a simil
a
r measure b
a
s
ed o
n
co
m
m
on
words. It can
be expre
s
sed
as:
,
|
∩
|
|
∪
|
(2)
In this equality
∩
repre
s
e
n
ts the number of word
s com
m
on to
and
while
∪
rep
r
e
s
ent
s th
e total num
b
e
r of
wo
rd
s in
both
and
. T
h
e
s
e
me
as
ur
es
ar
e
acc
e
p
t
ab
le
fo
r
la
rg
e
document
s b
u
t not very
approp
riate whe
n
one h
a
ve to deal with sh
ort segment
s of text.
Comp
ari
ng two short seg
m
ents, the Dice simil
a
rity measure se
e
m
s to be mo
re suita
b
le. T
he
Dice simil
a
rit
y
measu
r
e i
s
different fro
m
the Jacca
r
d mea
s
u
r
e b
y
the fact th
at the co
mm
on
words
are
co
unted two tim
e
s an
d the to
tal length is
t
he su
m of the two length
s
. The sum of
the
two length
s
i
s
inde
ed e
q
u
a
l or g
r
eate
r
than the le
n
g
t
h of the two
texts becau
se of the trian
g
le
inequ
ality. The Dice mea
s
u
r
e ca
n be exp
r
esse
d for two texts
and
as
follows
:
,
|
∩
|
|
|
∪
|
|
(3)
The mo
st po
pular
mea
s
u
r
e for qu
ery m
a
tchin
g
is th
e
TF-IDF. T
h
e
r
e a
r
e ma
ny ways to
define the term freque
ncy i
n
verse do
cu
ment frequ
en
cy. The simpl
e
st rep
r
e
s
e
n
tation is:
,
,
(4)
In this eq
uali
t
y
,
and
,
rep
r
esent respe
c
tively the freque
ncy of t
e
rm
in
document
a
nd a fun
c
tio
n
of its fre
q
uen
cy in an
entire
co
rpu
s
. The idea
is to ma
ke
a
differen
c
e b
e
t
ween
com
m
on term
s
whi
c
h o
c
cur i
n
a
l
most eve
r
y document a
n
d relate
d terms
whi
c
h
are p
a
rticul
ar to
a given
do
cument. Th
e
TF-IDF
i
s
n
o
t app
ro
priat
e
for comp
u
t
ing
document
s
si
milarity. Another m
e
a
s
u
r
e
combi
n
ing
an
d extendin
g
some ve
ctor
a
ppro
a
che
s
wi
th
Latent
Sema
ntic
Indexing has been pr
e
s
ente
d
by J.
M. Hue
r
ta for machi
ne tra
n
s
lation [16]. T
he
measure is b
a
se
d on the
Co
sine di
sta
n
ce a
nd u
s
e
s
Singular Val
ue De
co
mpo
s
ition to exp
r
ess
sente
n
ce si
milarity. For texts sho
r
ter than s
ente
n
c
e
s
, like qu
e
r
y logs, su
rf
ace mat
c
hin
g
is
extremely difficult. We d
o
need
su
ch
si
milarity for ex
ample for
qu
ery su
gge
stio
n. Yih and M
eek
[17] have extende
d the T
F
-IDF m
e
a
s
u
r
e by d
e
fi
nin
g
a weightin
g functio
n
a
c
cording to
web
relevan
c
e
score
s. The m
e
asu
r
e i
s
ce
rta
i
nly applic
abl
e for qu
ery sugge
stion b
u
t works only if the
words
are fre
quently u
s
ed.
Web
releva
nce
ca
n be
applie
d to de
fine co
-o
ccurren
c
e
s
in p
a
ge-
cou
n
t ba
sed
simila
rity me
asu
r
e
s
. Bolle
gala et
al [1
8] have reu
s
ed po
pula
r
di
stan
ce
s like t
h
e
Ja
ccard
coeff
i
cient to
defin
e a
pag
e-co
u
n
t ba
sed
s
i
mila
r
i
ty me
as
ure
.
T
h
e me
asu
r
e is in
tr
od
uc
e
d
to determine the semanti
c
sim
ilarity distance betwe
en two words with a
probability function
based on th
e likelih
ood
prin
ciple. Th
e sem
antic
si
milarity between two
wo
rds is li
kely to be
extracted
by
co
rp
us-ba
s
e
d
me
asure
s
more
a
c
curat
e
ly.
Co
rpu
s
based mea
s
ure
s
have b
een
tested to
det
ermin
e
the
semantic relat
edne
ss of te
x
t
segm
ents, i
n
orde
r to d
e
tect p
a
ra
ph
ra
se
s,
by Mihalce
a et al [19]. Pr
obability is a
way that
has bee
n explo
r
ed by Lin [2
0] in order t
o
measure co
mmonality b
e
twee
n two
words.
His work d
e
termines
way
s
and rule
s
for a
prob
abili
stic
theoreti
c
al
d
e
finition of
similarity. An implem
entat
ion h
a
s be
e
n
mad
e
with
a
modified
Dice mea
s
ure i
n
a taxonom
y. Biomedi
ca
l domain i
s
an area that
use
s
si
milarity
measures fre
quently. Ped
e
rsen
et al [2
1], Lord
et
al
[22] have i
n
vestigate
d
in
medical o
n
tol
ogy.
DNA
se
quen
ce a
nalysi
s
[
23] is a
n
a
r
ea of a
c
tive re
sea
r
ch th
at contin
uou
sly expe
rime
nts
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
2302-4
046
Valuing Sem
antic Sim
ilarity (Abd
oula
h
i Bouba
ca
r)
6363
simila
rity measu
r
e
s
. Gen
e
ontology is the gr
oun
dwork n
e
cessa
r
y for most of the DNA ba
se
d
techn
o
logy. Ontology ba
sed simil
a
rity meas
ures h
a
v
e been impl
emented fo
r
compl
e
x co
n
c
ept
expre
ssi
on
s over DL
-lite kno
w
le
dge
b
a
se
d
by
Stu
c
ken
s
chmidt
[24] and
Haj
i
an et
al [25
]
.
Con
c
e
p
t’s
di
stan
ce fo
rma
lization
is u
s
ually impl
e
m
ented
with l
a
ttice theo
ry.
Zhang
et
al [
2
6
]
have explo
r
e
d
a topol
ogi
cal a
pproa
ch
with form
al
con
c
e
p
t an
alysis
as
ba
sis. T
he defi
ned
distan
ce i
s
u
s
ed to
ch
ara
c
teri
ze the
co
nce
p
ts. A fun
c
tion of n
e
igh
borh
ood
act
s
to determi
ne
a
sep
a
ratio
n
p
r
oce
s
s. Tra
c
ki
ng informatio
n flow i
s
sim
ilar to dete
c
ti
ng pa
rap
h
rases me
ntione
d
earlie
r. The
most efficie
n
t topic to exe
r
cise is th
e se
mantic
re
sem
b
lan
c
e. Metzl
e
r et al [27] h
a
s
investigate
d
similarity
measures a
nalyzin
g
the
flow of events throug
h
a text corpus.
Rep
r
e
s
entin
g
wo
rd
s a
s
v
e
rtexes an
d t
he relati
on
shi
p
s b
e
twe
en t
hem a
s
e
dge
s, Min
k
ov a
n
d
Coh
en [28]
h
a
ve appli
ed
grap
h
wal
k
to define
a
semantic simil
a
rity mea
s
u
r
e. An inter-word
simila
rity measu
r
e withi
n
a corpo
r
a ha
ve been test
ed ba
sed on
the grap
h wa
lk method. L
a
tent
sema
ntic i
n
d
e
xing h
a
s
se
rved a
s
a b
a
si
s to im
ple
m
ent si
milarit
y
measures
for compli
an
ce
analysi
s
[29].
This l
a
rge
spe
c
tru
m
sh
ows ho
w
diverse a
r
e th
e
domai
ns i
n
whi
c
h
sem
a
n
t
ic
simila
rity measures pl
ay a key role.
3.
Semantic Similarit
y
Measure
Let con
s
ide
r
two texts A a
nd B
whi
c
h
semantic relat
edne
ss n
eed
to b
e
m
e
a
s
u
r
ed. F
o
r
each wo
rd co
mmon to A and B we co
u
n
t its occurre
n
ce
s in both
A and B. Let
∩
denote the
sum of the n
u
mbe
r
of occurren
ce
s for all the comm
on wo
rd
s. Le
t
∪
denote the
sum of the
numbe
r of word
s in A and
the number
of word
s in
B includin
g
eventually their
occurre
n
ces.
We
denote by
th
e simila
rity measure
su
ch that:
.
∩
∪
(5)
The occurre
n
c
e
s
are cou
n
ted in both texts
and
, and
the text length is the sum of the
length
s
for the two texts therefo
r
e
is symmetric. All the occurren
ces are take
n for both texts.
3.1.
Measurin
g
Semantic Related
n
ess
Let co
nsi
der
the followi
ng
data where
and
represent
respe
c
tively document
s a
nd
words. Th
e freque
ncy for e
a
ch
word and
docum
ent is indicated at the interse
c
tio
n
. For examp
l
e
document
contains o
ne time the wo
rd
and four time
s the wo
rd
.
Table1. Eight
Document
s are Pre
s
e
n
te
d
For the do
cu
ments p
r
e
s
en
ted by table1 we c
an calcul
ate the sema
ntic relate
dne
ss
with
the
similarity measure.
,
,
∀
.
∈
. It is co
nsequ
ent
ly possi
ble to compl
e
te
the table2 by symmetry.
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 23
02-4
046
TELKOM
NI
KA
Vol. 12, No. 8, August 2014: 636
1 –
6368
6364
Table 2. The
Semantic
Rel
a
tedne
ss of Do
cume
nts
,…,
usin
g
the
Similarity Measure
The
measure is not corp
us dep
end
en
t theref
ore ca
n dire
ctly measu
r
e the se
mantic
related
n
e
ss. It can be ap
pli
ed for any lan
guag
e as lo
n
g
as word
s a
r
e sep
a
rate
d b
y
blank spa
c
e
.
3.2. Comparis
on
bet
w
e
e
n
and the Ja
ccar
d Similarit
y
Measur
e
The
measu
r
e can be use
d
as long as
surfa
c
e mat
c
hing is nee
de
d. If we obse
r
ve the
two tabl
es,
we can
se
e cl
e
a
rly the l
a
ck
of accu
ra
cy
i
n
Tabl
e 3: ve
ry differe
nt si
milaritie
s
p
r
e
s
ent
the same val
ue. In Table 2
the similaritie
s
ar
e all diffe
rent and prop
ortionality is resp
ecte
d.
Table 3. The
Semantic
Rel
a
tedne
ss of Do
cu
me
nts u
s
ing the
Ja
ccard Simila
rity Measure
The simila
rity between
and
prese
n
ts the highe
st sco
re
with the Ja
ccard me
asure.
That is not th
e ca
se
with t
he
measure. If we loo
k
at wo
rd’
s
fre
q
u
enci
e
s i
n
b
o
th
an
d
we
note that co
mmon
words rep
r
e
s
ent al
most the h
a
l
f
of the total of words.
Our m
e
a
s
ure
is
therefo
r
e mo
re accu
rate th
an the Jacca
r
d mea
s
u
r
e.
Unfortu
nately
,
we ca
n rem
a
rk f
r
om tabl
e2
that the highest score of
relatedn
ess re
corded i
s
,
even though
is present only on
e
time in
. As a
result, there is a need to
balan
ce t
he relatedn
ess b
e
twee
n the two entitie
s. For
this re
ason
we de
rive from
anoth
e
r similarity m
easure
den
o
t
ed as
∆
su
ch that for all
document
s
,
we have:
∆
,
,
,
(6)
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
2302-4
046
Valuing Sem
antic Sim
ilarity (Abd
oula
h
i Bouba
ca
r)
6365
3.3. Comparis
on
bet
w
e
e
n
∆
and the Dic
e
Similarit
y
Measure
Table 4
rep
r
ese
n
ts the
semantic
relat
edne
ss for th
e sam
e
data
with the
∆
si
milarity
measure. We
can rem
a
rk that
∆
,
has bee
n balan
ced
compa
r
e to
,
.
No
w we c
a
n
comp
are the
∆
measu
r
e to the Dice simila
rity wich is de
signe
d to better estimate the
related
n
e
ss f
o
r sh
ort segm
ents of text.
Table 4. The
Semantic
Rel
a
tedne
ss usi
ng the
∆
Measure
Table 5. The
Semantic
Rel
a
t
edne
ss usi
ng the Di
ce
Measure
We can rem
a
rk
again th
a
t
the Dice
si
milarity
mea
s
ure i
s
not accurate. Very different
simila
rit
i
es h
a
s t
he s
a
me
sc
ore.
S
o
me
sco
re
s like
,
are ab
no
rmal
ly high. The two
measures
a
nd
∆
are mo
re
accurate tha
n
the Jacca
r
d mea
s
u
r
e a
nd the
Dice
measure. Th
e
two me
asure
s
can
be u
s
e
d
for
extractin
g
se
mant
i
c
re
latedne
ss. We’ll rem
a
rk in
the next secti
o
n
that the
mea
s
ure is m
o
re
suitabl
e than
∆
in query p
r
o
c
e
ssi
ng while
the oppo
site
is ob
serve
d
in the
ca
se
o
f
sem
antic
re
latedne
ss. Th
e differe
nc
e i
s
fairly tiny i
n
both
sem
ant
ic related
n
e
s
s
and qu
ery proce
s
sing.
3.4.
Quer
y
Processing using
either
or
Both Ja
ccard
and Di
ce
si
milarity mea
s
ure
s
a
r
e un
a
b
le to proce
ss
user q
uery. Our
measures
ca
n be used to pro
c
e
ss u
s
e
r
query.
For that go
al we n
eed
to con
s
ide
r
the que
ry as a d
o
cum
ent and m
e
asu
r
e its
related
n
e
ss t
o
the docum
ents. Let co
n
s
ide
r
the pre
v
ious data a
nd a que
ry
su
ch t
hat
1
2
1
.
is re
peat
ed twi
c
e in t
he qu
ery
. The relevan
c
e
of the qu
ery to the
document
s a
r
e presented
by Table 7 for the
measure, table
8
for t
he Co
sin
e
measure, an
d
table9 for the
∆
measu
r
e.
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 23
02-4
046
TELKOM
NI
KA
Vol. 12, No. 8, August 2014: 636
1 –
6368
6366
Table 6. The
query an
d the
docum
ents
We note he
re a propo
rtio
nality betwee
n
the Cosi
ne
similarity an
d the
measure for
almost
all t
h
e
case
s.
The r
e
lev
a
n
c
e
,
0
.
5
3
whi
l
e the relevan
c
e
,
0
.
9
0
.
If
we use the Co
sine simil
a
rity measu
r
e t
he most relevant docu
m
ent to the
query
is the
document
while using the
measure the most relev
ant docu
m
ent
to the query
is
. We
can see usin
g the
∆
similarity measure that we’ll
have similar difference because
∆
and
are
prop
ortio
nal.
4. Discus
s
ion
This stu
d
y was
motivated
by the fa
ct th
at
we
were lo
okin
g fo
r a
n
approp
riate
measu
r
e
for qu
ery a
n
d
que
ry exp
ansi
on fo
r a
co
ncept ba
sed info
rmatio
n ret
r
ieval
system. Con
c
e
p
ts
sema
ntic
rela
tedne
ss i
s
th
e key featu
r
e
for que
ry
expan
sion in th
e model
we a
r
e imple
m
enti
ng.
We h
a
ve rep
r
esented
co
n
c
ept
s a
s
vert
exes an
d thei
r relate
dne
ss as ed
ge
s. We have alrea
d
y
develop
ed
a
method
to ext
r
act
con
c
ept
s from
user
qu
erie
s a
n
d
do
cuments.
It ha
s b
een
ea
sy t
o
measure the simila
rity betwee
n
a que
ry and eac
h o
f
the concept
s with Apa
c
h
e
Lucene
whi
c
h
has the tf-idf
as simila
rity measu
r
e. F
o
r that
t
a
sk,
it
was suf
f
i
cient
t
o
con
s
ide
r
t
he ent
ire
colle
ction
of con
c
e
p
ts a
s
a co
rpu
s
. Unfortunatel
y the
tf-idf measure ca
nnot me
asu
r
e a
c
cu
rat
e
ly
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
2302-4
046
Valuing Sem
antic Sim
ilarity (Abd
oula
h
i Bouba
ca
r)
6367
the semanti
c
related
n
e
s
s for t
w
o texts.
In order
to m
easure
the
semantic relat
edne
ss b
e
twee
n
each pai
r of con
c
ept
s we we
re obli
g
ed to ch
oo
se
the Di
ce
similarity mea
s
ure. The
Di
ce
measure
can
effectively
compa
r
e t
w
o
texts. The
ch
oi
ce
is ju
stifie
d by th
e fa
ct
that we
n
eed
to
measure di
re
ctly the relate
dne
ss
of two
co
ncepts. A
corpu
s
b
a
sed
mea
s
u
r
e i
s
not suitable
for
this ta
sk.
As
we
have
sh
o
w
n i
n
o
u
r ex
ample
s
, even
thoug
h the
Dice me
asure can
solve
the
probl
em it re
mains i
n
a
c
cu
rate. We ha
d
to com
pute
the relate
dne
ss
between
document
s a
n
d
con
c
e
p
ts
usi
ng the
Co
sin
e
si
milarity m
easure
for a
sa
ke
of a
c
cu
racy. In
o
r
de
r to
comp
ute t
he
related
n
e
s
s o
f
a q
uery
to t
he d
o
cume
nts, we
have to
co
mpute
the
path
bet
wee
n
them
throu
g
h
con
c
e
p
t node
s whi
c
h lin
k them. It appe
ared ve
ry un
comfo
r
table t
o
sum
sco
re
s expresse
d with
three
differen
t
simila
rity measure
s
(tf-id
f, Dice, a
nd
Co
sine
). Besi
de all
we
we
re n
o
t satisfi
e
d
becau
se the t
f
-idf mea
s
u
r
e
can
not exp
r
ess direct
ly h
o
w a
que
ry is related
to a
con
c
e
p
t and t
he
Co
sine
mea
s
ure
ca
nnot
e
x
press exactl
y how a
do
cument i
s
rela
ted to a
con
c
ept. In ad
dition,
the Di
ce
mea
s
ure
sup
p
o
s
e
s
that th
e
wo
rds ap
pea
r at
lea
s
t on
e tin
e
in
ea
ch
co
nce
p
t. That i
s
true, and
the
Dice mea
s
u
r
e is b
e
tter th
an the
Ja
cc
a
r
d me
asure
(3) but it d
o
e
s
not indi
cate
the
exact d
egree
of rel
a
tedn
e
ss.
Ou
r go
al
wa
s to
fin
d
a
n
ap
pro
p
riate
mea
s
u
r
e
whi
c
h
ca
n
comp
ute
the three
se
mantic simil
a
rities, i
n
o
r
de
r to
co
mpute
the
sum
an
d exp
r
e
s
s th
e path.
We h
a
ve
achi
eved that
goal usi
ng n
e
w semanti
c
simila
rity measures.
5.
Conclu
sion and Futu
r
e
Work
and
∆
are
as
accurate a
s
t
he Cosi
ne
si
milarity for q
uery mat
c
hin
g
and th
ey e
x
press
more
a
c
curately the d
e
g
r
e
e
of
sem
antic rel
a
tedne
ss.
In additio
n
th
ey are n
o
t co
rpu
s
depe
nde
nt.
The wea
k
ne
ss of corp
us
d
epen
dent me
asu
r
e
s
is th
at they cann
ot expre
ss
an a
b
sol
u
te value
for
relevan
c
e
or
sema
ntic rela
tedne
ss. All the re
sult
s the
y
provide a
r
e
corpu
s
dep
e
ndent. We
ha
ve
proven
that t
hey are g
ood
tools for
qu
ery mat
c
hin
g
as well a
s
f
o
r
sem
antic relatedn
ess. T
he
particula
rity of our mea
s
u
r
es is th
at the
y
can be
u
s
e
d
as lo
ng a
s
sema
ntic
simi
larity is nee
d
ed.
By using unique mea
s
u
r
e
,
compari
s
o
n
become
s
very easy. Both
and
∆
can be used for the
s
a
me task
s
.
Our future
work
is
to us
e them to
proce
s
s u
s
er
que
ry, establi
s
h
co
nce
p
ts
sema
ntic
related
n
e
ss,
and stu
d
y a concept ba
sed
information retrieval.
Referen
ces
[1] Olivier
Ferret.
T
e
sting se
mantic si
mil
a
rity
me
asures
f
o
r extracting
synony
ms for a corpus
.
Procee
din
g
s o
f
the Sev
enth
confer
ence
on Inte
r
natio
n
a
l
L
a
n
gua
ge
Reso
urces an
d
Eval
uati
o
n
(LREC'
1
0
) Vall
etta, Malta. Europe
an L
ang
ua
ge Res
ources
Associati
on (E
LRA).
201
0; 33
38-3
343.
[2]
Mehran Sahami,
T
i
mothy
D Heilman.
A web-b
a
sed K
e
rn
el F
unctio
n
for Measuri
ng th
e Similar
i
ty of
Short T
e
xt Sni
ppets
. WWW '
0
6
Pro
c
e
e
d
ings o
f
th
e
1
5
t
h
inte
rn
a
t
ion
a
l
co
nfe
r
e
n
c
e
o
n
World
Wid
e
Web
.
ACM Ne
w
Yor
k
, NY, USA. 2006; 377-
38
6.
[3]
Xi
ao
hua
Hu, and She
n
Xi
ajio
ng.
T
h
e E
v
alu
a
tion
of
Sentenc
e Si
milarity M
easur
es. Pal
a
ko
r
n
Achan
an
upar
p.
DaW
a
K '
08 P
r
ocee
din
g
s
of the
10th
inter
n
ation
a
l
c
onfer
e
n
ce on Data
W
a
reho
usin
g
and Kn
o
w
l
e
d
g
e
Discov
e
r
y
. P
ages Spr
i
n
ger-
V
erla
g Berli
n
, Heid
el
berg. 2
0
08; 305-
31
6.
[4] Ange
la
Sc
h
w
e
r
ing.
Ev
alu
a
tio
n
of
a Se
ma
ntic Si
mi
larity
Meas
ure for
Natur
a
l
La
ng
uag
e S
patia
l
Relati
ons
. Spa
t
ial Informatio
n
T
heor
y
.
L
e
ctu
r
e Notes in C
o
mputer Sci
e
n
c
e. 2007; 47
3
6
: 116-1
32.
Lecture N
o
tes i
n
Comp
uter Scienc
e. 200
7; 4736: 11
6-1
32.
[5]
Egidio T
e
rra,
CLA Clarke.
F
r
equ
ency Esti
mates for Statisti
cal W
o
rd
Si
mi
l
a
rity Meas
ures
.
Procee
din
g
NAACL '
0
3 Pr
o
c
eed
ings
of th
e
20
03
Conf
ere
n
ce
of t
he
Nort
h Americ
an
Ch
apter
of the As
sociati
on for
Comp
utation
a
l
Lin
guistics o
n
Huma
n
Lan
gu
age T
e
chno
log
y
. 20
03; 1: 165
-172.
[6]
Aminul Islam,
and Di
an
a Inkpen.
Se
ma
ntic
T
e
xt Similarity
Using Cor
pus
-Based W
o
rd
Similar
i
ty and
String Sim
i
larit
y
.
ACM
T
r
ansactions o
n
Kn
o
w
le
dg
e Disco
ver
y
from Dat
a
(T
KDD)
T
K
D
D
Home
pag
e
archiv
e. 200
8; 2(2); Article No
. 10
.
[7]
Aminul
Islam,
Dian
a
Ikp
en, Il
uju K
i
ri
nga.
A
p
plicati
ons
of c
o
rpus-
base
d
s
e
mantic s
i
mil
a
r
i
ty an
d w
o
rd
seg
m
e
n
tatio
n
to datab
ase sch
ema matchi
ng.
T
he VLDB Journa
l. 200
8; 17
(5): 1293-
13
20
.
[8]
Anna-
La
n Hu
a
ng, Dav
i
d Mi
l
ne, Eib
e
F
r
an
k, Ian H W
i
tten
.
Lear
ni
ng a
Conc
ept-b
ase
d
Doc
u
men
t
Similar
i
t
y
me
a
s
ure.
Journ
a
l
of the America
n
Society for Informatio
n
Sci
ence a
nd T
e
c
hno
logy.
20
12
;
63(8): 15
93-
16
08.
[9] Donald
Metzle
r, Susan Dum
a
is, Christopher Meek.
S
i
mil
a
rity Meas
ures f
o
r Sh
ort Se
g
m
ents of
T
e
xt
.
ECIR'
07 Proc
eed
ings
of the 29th E
u
rop
ean c
onfer
e
n
c
e
on IR res
e
arch, Spri
nger
-Verla
g Berl
in,
Heid
el
berg. 2
0
07; 16-2
7
.
[10]
Ming Li, Xi
n
C
hen, Xi
n Li,
Bi
n Ma, a
nd P
a
u
l
M.B. Vitan
y
i.
The Sim
i
larity
Metric
. IEEE Transactions
on
Information T
heor
y
.
Proce
e
d
i
ngs of th
e F
ourtee
n
th An
n
ual A
C
M-SIAM S
y
mp
osi
u
m
on D
i
screte
Algorit
hms. Bal
t
imore, Mar
y
la
nd, USA. 200
3.
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 23
02-4
046
TELKOM
NI
KA
Vol. 12, No. 8, August 2014: 636
1 –
6368
6368
[11]
Ja
y
J J
i
a
ng, D
a
vid W
.
Co
nrat
h Sema
ntic Si
mila
rit
y
Bas
ed
on C
o
rpus St
atistics an
d Le
xic
a
l T
a
xo
nom
y.
Procee
din
g
s of
the Internati
o
nal C
onfer
enc
e Res
ear
ch
on
Co
mp
utation
a
l
Lin
g
u
i
stics (R
OCLING X).
T
a
i
w
a
n
. 199
7.
[12]
Alberto B
a
rro
n
-
Ced
eno, A
n
d
r
eas Eis
e
lt, Paol
o
Ross
o.
Mono
lin
gu
al T
e
xt S
i
mil
a
rit
y
Measur
es: A
Comp
ariso
n
o
f
Models ov
er
W
i
kiped
ia Ar
ticles Rev
i
sio
n
s
.
Proceed
ing
s
of the 7th
intern
ation
a
l
Confer
ence
on
Natural L
a
n
g
u
age ICON.
2
009
.
[13]
Wen-tau Yih, Kr
istina T
outanova, John C Pla
tt, Christopher
Meek. Learning
Discriminativ
e Projections
for T
e
x
t
Similarit
y
Measur
es.
Procee
di
ng C
o
NL
L '
11 Pr
oc
eed
ings
of the
F
i
fteenth C
o
n
f
erence
o
n
Co
mp
utation
a
l
Natura
l La
ng
uag
e Le
arn
i
ng
. Associat
io
n for Co
mp
utatio
nal
Li
n
guistics
Stroudsb
u
r
g
,
PA, USA.
2011
; 247-25
6.
[14]
W
enjie
Li, Qiux
ian
g
Xia.
A Met
hod
of Conc
ept
Similar
i
t
y
Co
m
putatio
n Bas
e
d
on Sem
antic
Distanc
e.
In
Proced
ia En
gin
eeri
ng.
20
11; 1
5
.
[15]
Dolf T
r
ieschnig
g
, Edgar Mei
j
, Maarten d
e
Rij
ke
and W
e
ss
el
Kraaij. Meas
u
r
ing C
once
p
t Relat
edn
es
s
Using Lan
gu
ag
e
Mode
ls.
SIGI
R, ACM.
2008;
823-8
24.
[16]
Juan M
Hu
erta. Vector b
a
s
ed Ap
pro
a
ch
e
s
to Sema
ntic
Similar
i
t
y
Me
asures.
Adv
a
n
c
es in
Natur
a
l
Lan
gu
age Proc
essin
g
an
d Ap
plicati
ons, Cite
s
eer.
200
8; 16
3.
[17] W
en-tau
Y
i
h,
Christo
pher
M
eek.
Improv
ing
Simil
a
rit
y
Me
a
s
ures for
Short
Segm
ents
of
T
e
xt.
AAAI'07
Procee
din
g
s of
the 22
nd
nati
ona
l confer
enc
e
on Artific
i
al
i
n
telli
ge
nce - A
AAI Press.
20
07; 2: 14
89-
149
4.
[18]
Dan
u
shka B
o
ll
ega
la, Yutak
a
Matsuo, Mitsur
u Ishi
zu
ka
Me
asu
r
i
n
g
Se
man
t
ic Si
mi
l
a
ri
ty
be
tw
een
Wo
rd
s
Using
W
e
b
Se
arch E
ngi
nes.
Procee
din
g
s of
the 16th
i
n
ter
natio
nal
co
nfer
ence
o
n
W
o
rl
d
W
i
de
W
eb.
Ne
w
York, NY,
USA, ACM. 2007; 757-
76
6.
[19]
Rad
a
Mi
halc
e
a
,
Courtn
e
y
C
o
rl
e
y
,
Carl
o Stra
p
parav
a.
Cor
p
u
s
-base
d
a
n
d
K
n
o
w
l
e
d
ge-
base
d
Me
asure
s
of T
e
x
t
Semantic Similarit
y
.
AAAI'06 Proceedings
of
the 21st national c
onf
erenc
e
on Artific
i
al
intell
ig
ence - A
AAI Press.
2006; 1: 775-7
80.
[20]
Deka
ng Li
n. An Information
T
heoretic Defi
nitio
n
of Simila
rit
y
.
Proce
edi
n
g
s of the 15th
Internatio
na
l
Confer
ence
on
Machin
e Le
arnin
g
.
Morga
n
Kaufman
n
, San
F
r
ancisco, CA.
1998; 2
96-3
0
4
.
[21]
T
ed Pedersen
,
Sergue
i Pak
homov, Sid
d
h
a
rth
Pat
w
ar
dh
an, Christo
p
h
e
r G Chute. Measur
es o
f
Semantic Sim
i
larit
y
an
d Re
l
a
tedn
ess in t
he Med
i
ca
l D
o
mai
n
.
Journ
a
l
of Bio
m
e
d
ic
al Infor
m
atics
archiv
e.
200
7; 40(3): 28
8-2
9
9
.
[22]
PW
Lord, RD
Stevens C. A. G
oble Sem
a
n
t
ic Similarit
y
M
easur
es as to
ols for Expl
ori
ng the Ge
ne
Ontology
.
Pacif
i
c Symp
osi
u
m on Bioc
o
m
p
u
ting. Pacific Sy
mp
osi
u
m
on Bi
oco
m
p
u
ting.
2
003: 60
1-6
12.
[23]
Comp
utation
o
f
Simil
a
rit
y
Me
asures
for S
e
que
ntial
D
a
ta
usin
g Ge
nera
l
i
z
ed S
u
ffix T
r
ees. Konr
a
d
Rieck, P
a
vel
L
a
skov, Sör
en
Sonn
en
burg.
A
d
vanc
es i
n
N
e
ural I
n
for
m
atio
n Proc
essin
g
S
ystems.
20
07;
19 (NIPS).
[24] Hein
er
Stucke
nschmi
dt.
A Semantic Sim
ila
rit
y
Me
asure f
o
r Ontolo
g
y
B
a
sed Inform
ati
on.
FQAS '09
Procee
din
g
s o
f
the 8th Internatio
nal C
onfe
r
ence o
n
F
l
exi
b
le Query Ans
w
ering Syste
m
s.
Springer-
Verla
g
Berli
n
, Heid
el
berg. 2
0
09; 406-
41
7.
[25]
Behn
am Ha
jia
n, T
ony
W
h
it
e
.
Measurin
g S
e
mantic Sim
ila
ri
t
y
usi
ng
a Multi-T
r
ee Mode
l.
IJCAI 22
nd
Internatio
na
l Joint Co
nfere
n
c
e
on Artificia
l
Intelli
ge
nce. Ba
rcelo
na.
20
11
.
[26]
Lishi
Z
h
a
ng,
Shen
gzh
e
Ga
o, Li
ya
n Qi. T
opo
log
i
cal
Dist
ance F
u
nctio
n
in F
o
rma
l C
o
ncept L
a
ttice.
F
SKD '
08 Pr
oc
eed
ings
of t
he
200
8 F
i
fth Inte
rnatio
nal
C
onf
erenc
e
on F
u
zz
y
Syst
e
m
s a
n
d
Kn
ow
led
g
e
Discovery IEE
E
Computer S
o
ciety Washington.
DC, USA. 200
8; 05: 570-
574.
[27]
Donald M
e
tzler
,
Yaniv Bernst
ein, W Bruce
Croft,
Alistair Moffat, Justin Zobel
. Similarit
y
Measures for
T
r
acking Information F
l
o
w
.
CIKM '
05 Pro
c
eed
ings
of the 14t
h ACM
i
n
ternati
o
n
a
l c
onfere
n
ce o
n
Information a
n
d
know
led
ge
mana
ge
me
nt.
ACM Ne
w
York, NY, USA. 2005
; 517-52
4.
[28]
Lear
nin
g
gr
ap
h
w
a
lk
bas
ed
sim
ilar
i
t
y
me
asures for
par
sed te
xt.
EMNLP '
08 Proc
ee
din
g
s of th
e
Confer
ence
o
n
Empiric
a
l
Met
hods
in
N
a
tur
a
l
Lan
gu
ag
e P
r
ocessi
ng. Ass
o
ciati
on f
o
r C
o
mp
utatio
nal
Lin
guistics Stro
udsb
u
rg,
PA, USA. 2008; 90
7-
916.
[29]
Asad Sa
ye
e
d
, Soumitra S
a
rkar, Yu De
n
g
,
Rafah
Ho
sn, Ruch
i M
ahi
ndru, N
i
th
ya Ra
jama
ni.
Char
acteristics
of doc
ume
n
t si
milarit
y
m
easur
es for com
p
li
an
ce a
nal
ys
is.
CIKM '
09 Proc
ee
din
g
s of th
e
18th ACM Co
n
f
erence o
n
Info
rmati
on a
nd K
now
led
ge Ma
n
age
ment.
ACM Ne
w
Y
o
rk, NY, USA. 2009;
120
7-12
16.
Evaluation Warning : The document was created with Spire.PDF for Python.