TELKOM
NIKA
, Vol. 11, No. 4, April 2013, pp. 1909
~19
1
5
ISSN: 2302-4
046
1909
Re
cei
v
ed
Jan
uary 13, 201
3
;
Revi
sed Fe
brua
ry 14, 20
13; Accepted
February 26,
2013
A HowNet-based Semantic Relatedness Kernel for Text
Classification
Pei-
y
i
ng ZHANG
Coll
eg
e of Co
mputer an
d Co
mmunicati
on E
ngi
neer
in
g/ Chi
na Un
iversit
y
o
f
Petroleum
QingD
ao, Sha
n
Do
ng Ch
in
a
e-mail: 2
564
05
21@
qq.com
A
b
st
r
a
ct
T
he expl
oitati
on of the se
ma
ntic relate
dnes
s
kernel h
a
s al
w
a
ys been an
app
eal
in
g subj
ect in th
e
context of text retrieval a
nd i
n
formatio
n
ma
nag
e
m
en
t. T
y
pically, in text classificati
on th
e docu
m
ents ar
e
repres
ente
d
in
the vector sp
ac
e usi
ng th
e bag-
of-w
ords (BOW
) appr
oac
h. T
he BOW
appro
a
ch d
oes
not
take int
o
acc
ount th
e se
mantic re
late
dn
ess infor
m
at
io
n. T
o
further
improv
e the
text classific
a
ti
o
n
perfor
m
a
n
ce, this p
aper
pres
ents a n
e
w
sema
ntic-b
ased
kerne
l
of sup
p
o
rt vector mac
h
in
e al
gorith
m
fo
r
text classification. This m
e
thod fi
rstly using
CHI metho
d
to select
doc
u
m
e
n
t feature v
e
ctors, second
ly
calcul
ates the
feature vect
or
w
e
ights usin
g
T
F
-
IDF
method, and uti
l
i
z
e
s
the se
mantic r
e
late
dn
ess ker
n
e
l
w
h
ich i
n
volv
es
the s
e
ma
ntic
simila
r
i
ty co
mp
utation
a
n
d
se
ma
ntic r
e
lev
a
n
c
e co
mputati
o
n to c
l
assify
the
docu
m
ent usi
n
g sup
port vect
or machi
nes.
Experi
m
ental
r
e
sults sh
ow
that compar
ed
w
i
th the traditi
ona
l
supp
ort
vector mac
h
i
ne alg
o
ri
thm,
th
e
al
gor
ithm in
the text
classificati
on
a
c
hiev
es i
m
prov
ed cl
assific
a
tio
n
F1-m
easur
e
.
Ke
y
w
ords
: sem
a
ntic relatedness kernel, te
xt classification, sem
antic
sim
ilar
i
ty computatio
n, se
ma
nti
c
relev
ance co
mputatio
n, supp
ort vector mac
h
in
e
Copy
right
©
2013 Un
ive
r
sita
s Ah
mad
Dah
l
an
. All rig
h
t
s r
ese
rved
.
1. Introduc
tion
With the ra
pi
d gro
w
th of
online info
rm
ation,
t
e
x
t
clas
sif
i
cat
i
o
n
h
a
s be
co
me o
ne of
t
h
e
key tools fo
r automatically handing a
nd organi
zin
g
text information. The key steps in
text
cla
ssifi
cation
are d
o
cume
nt re
pre
s
e
n
ta
tion an
d
cl
a
s
sifier traini
ng
usi
ng
a
co
rpus of la
bel
ed
document
s. In the comm
only used b
ag of wo
rds represe
n
tation method,
document
s a
r
e
rep
r
e
s
ente
d
by numeri
c
vectors who
s
e
compo
nent
s are wei
ghts given to different words
or
terms
occu
rri
ng in the d
o
cument. De
spi
t
e its ea
se
of
use, ba
g of
words
re
pre
s
entation
can
not
handl
e word
synonymy a
n
d
polysemy
probl
em
s an
d
doe
s not consi
der
sem
antic relate
d
ness
betwe
en
wo
rds. T
he l
a
ck of
se
mant
ics in th
e
bag
of word
s
rep
r
e
s
enta
t
ion limits the
effectivene
ss
of automatic t
e
xt classificat
i
on method
s.
Semantic rel
a
tedne
ss co
mputation in
cludi
ng se
m
antic simil
a
ri
ty computation and
sema
ntic relevance com
putation is
widely us
ed
in many application
s
such
word sense
disam
b
igu
a
tion, info
rmatio
n retrieval
an
d text cl
uste
ri
ng, etc.
Sem
antic
rel
a
tedn
ess
com
putat
ion
can
be
divide
d
into
two ca
tegorie
s, stati
s
tical ap
proa
che
s
and a
p
p
roa
c
h
e
s b
a
sed on sem
a
n
t
ic
ontology dicti
onary.
Statisti
cal app
roa
c
h
e
s com
put
ed
sema
ntic
rel
a
tedne
ss by u
s
ing
the
co
rp
us
of training set. In the absen
ce of external
sem
anti
c
kn
owle
dge, corpus-ba
s
e
d
st
atistical meth
od
s
su
ch a
s
L
a
te
nt Semantic
Analysis
(LS
A
) [1]
can
be
applie
d to all
e
viate the sy
nonymy probl
em,
but the probl
em of polysemy still re
mai
n
s. But it is sensitive to
the traini
ng
corpus.
L. Lillian [2]
use
d
joint en
tropy and P. Brown et al
[3] us
ed averag
e mutual
information
to comp
ute the
simila
rity bet
wee
n
words.
J.H.
Lee
et
al. [4]
used
the di
stan
ce
betwe
en
wo
rds i
n
Wo
rd
Net to
comp
ute se
mantic simila
rity
betwe
en English
wo
rd
s. Re
sni
k
[5]
use
d
the l
a
rge
s
t amo
unt
of
informatio
n of
the a
n
cesto
r
s of
nod
es to
mea
s
u
r
e
th
e
se
manti
c
si
milarity of two Engli
s
h
wo
rds.
Agirre
an
d Ri
gau [6] u
s
e
d
the inform
atio
n such
a
s
co
nce
p
t dista
n
ce, depth
and
area
de
nsity
of
con
c
e
p
t hiera
r
chy tree to
compute the
semantic
si
mil
a
rity betwe
en
English
wo
rd
s. Liu Q
un et
al.
[7] exploited distan
ce
of sememe
s in th
e sem
e
me
tree to co
mput
e sem
antic
si
milarity between
two
words for example
-
ba
sed ma
chi
ne t
r
an
slation.
Li
Sujian [8] too
k
the
se
manti
c
relevan
c
e
o
f
words a
s
ma
ximum sum
of the seme
me releva
nce of the word’s con
c
ept. In his wo
rk, he
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 23
02-4
046
TELKOM
NIKA
Vol. 11, No. 4, April 2013 : 1909 – 1
915
1910
comp
uted th
e sem
e
me
with other sem
e
me which
is in the sem
e
me extensi
o
n
set of anoth
e
r
wor
d
.
The su
ppo
rt vector ma
chi
n
e (SVM) method is a ne
w and very pop
ular techniq
u
e
for text
cla
ssifi
cation
in the ma
chin
e learning
co
mmunity. Ho
wever, th
e tra
d
itional SVM
algorith
m
is
o
n
ly
taking i
n
to a
c
count the
chara
c
te
risti
c
s of wo
rd
s in
the do
cum
e
nt su
ch a
s
word fre
que
n
cy
informatio
n, without co
nsi
derin
g
the
semantic
info
rmation whi
c
h
do
cume
nts contai
ned, a
n
d
limited the
SVM algo
rithm
appli
c
atio
ns.
In this
pa
pe
r,
we present and ev
aluate
a
sem
antical
ly-
enri
c
he
d BO
W re
pre
s
e
n
tation for text classificatio
n
. We ad
opt
the HowNet
-ba
s
ed
sem
a
ntic
related
n
e
ss measure
to build
a smo
o
t
hing
matrix
and a
ke
rnel
for semanti
c
ally adju
s
ting
the
BOW re
prese
n
tation.
We will
pre
s
e
n
t in se
ction
2 som
e
rel
a
te
d wo
rks. Sect
ion 3 give
s o
u
r semanti
c
mappin
g
we
call
ed
se
mantic sm
oo
thing that
is
equivalent
to
the definitio
n
of a sem
a
ntic relate
dn
ess
metric. Secti
on 4
will
devoted to the bri
e
f pr
esentation of
support
vect
or m
a
chine
with
referenc
es
to previous work
s on text c
l
as
s
i
fic
a
ti
on.
In
se
ction 5,
ex
perim
ental re
sults
con
c
erni
ng
sup
port
vecto
r
ma
chi
n
e
wil
l
be
discu
s
se
d. In
se
ction
6, a
co
ncl
u
si
on a
n
d
different op
enin
g
s to
this work
will be presented.
2. Related Works
Duri
ng the l
a
st de
ca
de
s, a larg
e nu
mber
of
text cla
ssifi
catio
n
system
s h
a
ve been
prop
osed
usi
ng a va
riety
of app
roa
c
h
e
s
su
ch a
s
su
pport ve
cto
r
machi
ne, b
o
o
s
ting
algo
rith
ms,
term fre
que
n
c
y an
d inve
rse do
cum
ent f
r
equ
en
cy. Mo
st of the
s
e
sy
stem
s u
s
e th
e ba
g-of
-wo
r
ds
model or ve
ctor spa
c
e repre
s
e
n
tation
by hav
ing individual wo
rds a
s
ba
si
s representati
v
e
feature
s
for the do
cu
ment
conte
n
t. Whil
e bag
-o
f
wo
rd
s ap
proa
che
s
pre
s
e
n
t a go
od pe
rforman
c
e
on many ma
chin
e-le
arni
n
g
tasks d
ue t
o
low
com
put
ational cost a
nd inhe
re
nt paralleli
sm, th
eir
limitations are also well ackno
w
le
dge
d. Espec
i
a
lly, the underlying cla
ssifi
ca
tion sch
eme
is
rest
ricte
d
to detectin
g
pat
terns
within the
used terminology onl
y, which excl
ude
s co
ncept
ual
pattern
s as well as
any sema
ntically related
word
s
.
It is thus
poss
i
ble to
gain better
res
u
lts
across multiple domains by utilizing an external
sem
antic thesaurus lik
e WordNet that defines
an upp
er-leve
l
of relationsh
i
ps amo
ng m
o
st of the terms in the test
ing data.
The imp
o
rtan
ce of em
bed
d
i
ng semanti
c
related
n
e
ss
b
e
twee
n two t
e
xt segme
n
ts for text
cla
ssif
i
cat
i
on
wa
s init
ially
h
i
gh
lighted
in [
9
] whe
r
e
se
mantic
simila
rity betwe
en
words ha
s b
e
e
n
use
d
for the
sema
ntic
sm
oothing
of th
e TF-I
D
F ve
ctors. Sema
n
t
ic-a
wa
re
ke
rnels have
be
en
prop
osed by Mavroei
dis et
al. [10] who prop
ose a ge
nerali
z
e
d
vector spa
c
e m
o
del with Word
Net
sen
s
e
s
an
d their hyp
e
rny
m
s to imp
r
ov
e text clas
sification p
e
rfo
r
mance. Typically, Word
Net
is a
databa
se fo
r t
he Engli
s
h l
a
ngua
ge
conta
i
ning
sem
anti
c
lexicon th
at org
ani
ze
s word
s into
grou
ps
of synsets. Every synset stands fo
r a
si
ngle word prototype that re
fers to
a gro
up of wo
rd
s t
hat
sha
r
e
the
sa
me me
anin
g
. In a
ddition
to ma
kin
g
u
s
e of
relatio
n
s in
Wo
rd
Net, feature
su
ch
a
s
part-of sp
ee
ch tags
have b
een
con
s
ide
r
ed in [11
]. Th
is motivates t
he inten
s
ive rese
arch
ca
rri
ed
out in thi
s
issue
whi
c
h
had
given
ri
se to
a vari
ety of
i
m
pleme
n
ted system
s i
n
corporatin
g featu
r
es
derived f
r
om
the com
m
o
n
sem
antic t
heo
sau
r
u
s
Wo
rdN
e
t an
d its w
o
rd
s
relation
s. Str
i
ctly
spe
a
ki
ng, th
e hierarchi
c
a
l
orga
nizatio
n
of
Wo
rd
Net involves i
m
porta
nt dist
inction
between
variou
s p
a
rt-of-sp
ee
ch
pa
rts. Ind
eed,
while
categ
o
ri
zation of
nou
n
s
into
un
de
rlying taxon
o
mi
es,
head
ed by a
uniqu
e be
gin
ner
su
ch
as
animite o
r
ar
t
i
fact is
strai
g
hforward, this doe
s not ext
end
to verbs, whi
c
h are rathe
r
partitioned i
n
to several
semantic field
s
with many overlap
p
ing.
This
discre
pan
cy
betwe
en verb
s an
d no
un
s
obviou
s
ly infl
uen
ce
s
the calcul
us of
se
mantic simila
rity,
esp
e
ci
ally wh
en deali
ng wi
th senten
ce
s
whe
r
e the
wo
rd-by-word semantic
simil
a
rity has p
r
ov
en
usu
a
lly to b
e
non
-effe
ctive, whi
c
h
in
turn
infl
uen
ce
s n
egatively
the pe
rforma
nce
of
retri
e
val,
summ
ari
z
atio
n and cl
assifi
cation ta
sks. This ma
ke
s the deb
ate of noun
s versu
s
verbs sema
n
t
ic
simila
rity wid
e
ly open. Stepha
n Bloeh
dorn
and Al
e
s
sand
ro M
o
schitti [12] co
mbined
synta
c
tic
and
sem
antic ke
rnel
s for text classification.
Ro
bert
o
Basili, Ma
rco
Cammi
sa
a
nd Ale
s
sand
ro
Moschitti [13] use a
sema
ntic ke
rn
el to classi
fy texts with very f
e
w trai
ning e
x
amples.
Ja
mal
Abdul Nasi
r, Asim Karim,
George T
s
at
saroni
s a
nd Irakli
s Varla
m
i
s
[14] use a
knowl
edge
-b
a
s
ed
s
e
mantic
k
e
rnel for text c
l
ass
i
fic
a
tion. Cris
tianini,
N., Taylor, J.S. [15] use l
a
ten
t
sema
ntic
ke
rnel
for text categori
z
ation. S
hou
sha
n
Li, Rui Xia
[16] prop
ose a framewor
k of feature
sele
ction
method
s for text categori
z
ation.
J. Blitzer, M. Dre
d
ze
and F. Pe
reira [17] use do
main ada
ptatio
n
for sentiment
cla
ssifi
catio
n
. J. Bran
k
et al [
18] pu
t forwa
r
d int
e
ra
ction of f
eature
sel
e
ct
ion
method
s a
n
d
linea
r
cla
ssif
i
cation
mod
e
l
s
. Literature [
19-2
1
] introd
uce
s
th
e opti
m
ization
mod
e
l
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
2302-4
046
A HowNet-ba
s
ed Em
antic
Relate
dne
ss Kernel
for T
e
xt Cla
ssifi
cati
on (Pei-yin
g Zhang
)
1911
and info
rmati
on expe
ctatio
n with clo
ud
comp
uting
in
china, p
r
op
o
s
e
s
a
sparse
representati
o
n
method. G. Forma
n
[22] put forw
a
r
d a feature
sele
cti
on metri
cs fo
r text classification. L. Mari
na
et al [23] use con
c
e
p
tual e
x
traction fro
m
ontologie
s
fo
r web
cla
ssifi
cation.
This pa
per at
tempts to
u
s
e
the
Ho
wNet
as
th
e
chin
ese sema
ntic
knowl
edge,
ca
culate
s
the simila
rity betwee
n
word
s and th
e sema
ntic
relevan
c
e b
e
t
ween
words, combine
d
the
simila
rity and the relev
ance as th
e sup
port
v
e
ctor m
a
chi
ne sem
antic kernel for text
cla
ssif
i
cat
i
on.
3. Incorpora
t
ion of Semantic Rela
ted
n
ess into T
e
xts Me
tric
Term
related
ness is
used
to desig
n do
cument si
milarities whi
c
h
are the critical f
unctio
n
s
of
mo
st
t
e
x
t
cla
ssif
i
cat
i
on
algorit
h
m
s.
T
h
is se
ction
fi
rstly introdu
ces th
e
sem
a
ntic
relate
dn
ess
betwe
en wo
rds ba
se
d on Ho
wNet, secondly give t
he descri
p
tion
of docum
ent simila
rity kernels
whi
c
h can be
inco
rpo
r
ate
d
to the definitio
n
of a kernel in sup
port vector machine.
3.1. Semantic Relatedne
ss of Words
based on
Ho
w
N
e
t
Semantic
rel
a
tedne
ss of
wo
rd
s is compo
s
ed
of
sem
antic
si
milarity and
sema
ntic
relevance.
Semantic
si
milarity is t
he tw
o
wo
rds in
different co
ntexts can
be u
s
ed
interchan
gea
bly without altering the
text of
the syntactic structu
r
e
of semanti
c
level. In HowNet,
not every co
nce
p
t will co
rrespon
d to the con
c
e
p
t of
a hiera
r
chy tree in a node,
but through
a
seri
es
of ori
g
inal mea
n
ing
calle
d seme
me, us
e
s
a
knowl
edge
de
scription l
ang
uage to
de
scribe a
con
c
e
p
t. The
s
e
seme
me
s throug
h hyp
e
rnym a
nd h
y
ponym rel
a
tions
org
ani
ze
into a hie
r
archy
tree. We u
s
e
Liu Qun [7] m
e
thod to com
pute the sem
antic simil
a
rit
y
between
wo
rds.
Semantic sim
ilarity is th
e t
w
o
wo
rd
s in
di
fferent
cont
exts can b
e
use
d
inte
rcha
ngea
bly
without alte
ri
ng the text of the sy
ntacti
c structu
r
e of
sema
ntic le
vel. Different
from traditio
nal
sema
ntic di
ctionary, in Ho
wNet, not every con
c
e
p
t will co
rrespo
nd to the con
c
ept of a hierarchy
tree i
n
a
no
d
e
, but th
rou
g
h
a
se
rie
s
of origi
nal
mea
n
ing, the
u
s
e
of a
knowl
e
dge
de
scripti
o
n
langu
age
to
d
e
scrib
e
a con
c
ept. The
s
e sememe
s thro
ugh su
pe
rordi
nate and hyp
onym relation
s
orga
nize into
a hie
r
a
r
chy tree. Ou
r
goal
is to fin
d
a
way to u
s
e thi
s
kn
owl
edg
e o
f
the lan
gua
g
e
that describ
e
s
the simila
rit
y
of tw
o sem
antic expressi
ons
cal
c
ulatio
n.
Defini
tion 1:
Let two Chin
ese
words to
be W
1
and W
2
, if
W
1
has n
meaning ite
m
s: S
11
,
S
12
, … , S
1n
, W
2
ha
s m
meanin
g
item
s: S
21
, S
22
,
… , S
2m
, W
1
and
W
2
is th
e simila
rity of the
sen
s
e
s
maxi
mum simila
rit
y
, can be cal
c
ulated a
s
formula (1
):
)
,
(
max
=
)
,
(
2
1
..
1
=
,
..
1
=
2
1
j
i
m
j
n
i
S
S
Sim
W
W
Sim
(1)
Thus, the
sim
ilarity betwe
e
n
wo
rd
s ca
n
be the
issu
e boils d
o
wn to the simila
rity betwe
en
two con
c
e
p
ts. Since all co
nce
p
ts are ultimately
attributed to the orginal
with the meanin
g
(i
n
some
pla
c
e
s
with a spe
c
ific wo
rd) to
rep
r
e
s
ent, so
the orgi
nal
meanin
g
of the concept
o
f
simila
rity calculation is the
basi
s
for calculating simil
a
rity. Since all of semem
e
compo
s
ite a tree
of hierarchy
based o
n
a
relation
ship
b
e
twee
n t
he
u
pper an
d lo
wer, where a
simple
cal
c
ula
t
io
n
by semanti
c
distan
ce
simil
a
rity approa
ch.
Defini
tion 2
:
Let t
w
o
se
m
e
mes at thi
s
l
e
vel in t
he
orginal
path
of
distan
ce
d,
a
c
cordi
n
g
to the fo
rmul
a (1),
whi
c
h
wa
s the
o
r
igi
nal m
eanin
g
of these t
w
o
sema
ntic di
stances bet
we
en p
1
and p2, the si
milarity betwe
en p1 an
d
p2
is cal
c
ul
ated
as form
ula (2
):
d
p
p
Similarity
)
,
(
2
1
(2)
Whe
r
e p
1
an
d p2 represe
n
t two of whi
c
h the
sem
e
me, d is the
p
1
and p
2
in t
he context
of the original
path hiera
r
ch
y in length, is a positive inte
ger,
α
i
s
an ad
justabl
e para
m
eter.
For
Chi
n
e
s
e
text classifica
tion, co
ntent
words a
r
e th
e main
ba
si
s for
cla
s
sifica
tion, so
the notional
calcul
ation of similarity is the
key.
Defini
tion 3
:
Let the
se
mantics of th
e expressio
n
notional
con
c
ept
con
s
i
s
ts of four
parts. Ind
epe
ndent me
ani
ng that the o
r
iginal
de
scri
ption of the first type: the two concepts of
simila
rity in this p
a
rt of th
e re
co
rd fo
r the
Sim1(s1,
s2);
other i
n
d
epen
dent d
e
scriptive m
ean
ing
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 23
02-4
046
TELKOM
NIKA
Vol. 11, No. 4, April 2013 : 1909 – 1
915
1912
of the o
r
igin
al: the first i
ndep
ende
nt
sema
ntic
me
aning
of the
expre
ssi
on, i
n
ad
dition to
the
origin
al than all the other indep
ende
nt
meanin
g
of the origin
al (or
spe
c
ific word
s), this pa
rt of the
two
con
c
e
p
ts of simil
a
rity i
s
d
enote
d
Si
m2(s1,
s2);
relational
ori
g
i
nal d
e
scri
ptive mea
n
ing: t
h
e
sema
ntic
rela
tionshi
p bet
ween the
expression of
all meanin
g
with
the ori
g
inal
descri
p
tion of
the
type, the two
con
c
e
p
ts, thi
s
part
of the
si
milari
ty de
not
ed Sim3
(s1,
s2);
symb
olic meani
ng
of the
origin
al descriptive meanin
g
of the symbol, the tw
o con
c
e
p
ts of simila
rity in this pa
rt of th
e
record for th
e Sim4(s1,
s2). Th
e si
mil
a
rity bet
ween
sem
antic ex
pre
ssi
on
s of
two
con
c
ept
s is
defined a
s
formula (3
):
∏
∑
1
=
2
1
4
1
=
2
1
)
,
(
=
)
,
(
i
j
j
i
i
S
S
Sim
β
S
S
Sim
(3)
Whe
r
e
)
4
≤
≤
1
(
i
β
i
is and
adju
s
t
able p
a
rame
ters
and
sa
tisfy the eq
uation:
1
=
+
+
+
4
3
2
1
β
β
β
β
and
4
3
2
1
≥
≥
≥
β
β
β
β
.
Defini
tion 4
:
The
sema
ntic
relevan
c
e
comp
utation
betwe
en
wo
rds i
s
by me
ans of
document [8] method, the formul
a is a
s
follows:
{}
)
,
(
+
)
,
(
=
)
,
(
Rele
)
s
REL(c,
|
s
=
def(c)
)
,
(
Rele
max
≈
))
(
),
(
(
Rele
))
(
),
(
(
Rele
=
)
,
(
Rele
i
i
)
(
∈
),
(
∈
2
1
2
1
2
1
∑
2
1
j
i
a
j
i
s
j
i
j
i
c
def
s
c
def
s
s
s
asso
ω
s
s
sim
ω
s
s
s
s
c
def
c
def
c
def
c
def
c
c
j
i
(4)
Whe
r
e
Rel
e
(c1,c2) den
otes th
e
sema
ntic rele
van
c
e of two
words, d
e
f(c)
de
notes th
e
interp
retation
semem
e
set
of word
c. Re
aders can co
nsult do
cum
e
nt [8] for more details.
Defini
tion 3:
The se
man
t
ic relate
dne
ss b
e
twe
en
words
can
b
e
defined u
s
ing the
sema
ntic si
mi
larity and se
mantic releva
ncy,
the com
putation is d
e
f
ined as form
ula (5
):
)
,
(
Rele
×
+
)
,
(
Sim
×
)
-
1
(
=
)
,
(
2
1
2
1
2
1
w
w
γ
w
w
γ
w
w
SR
(5)
In formul
a (
4
), SR
(w
1
,w
2
)
is the
seman
t
ic relatedn
ess b
e
twe
en t
w
o word
s
w
1
a
nd w
2
,
Sim(w
1
,w
2
) i
s
the sem
antic simila
rity wh
i
c
h
d
e
fined by
formula (1
),
Rele
(w
1
,w
2
) i
s
the
sema
ntic
relevan
c
y
b
e
twee
n words whi
c
h define
d
by
fo
rmula
(3).
γ
is a
pa
rameter to
sca
l
e the
wei
ght
s of
the two part
s
.
3.2. Docume
nt Similarit
y
Kern
el
Given two do
cume
nt d
1
an
d d
2
∈
D (the doc
ument s
e
t),
we define their s
i
milarity as:
)
,
(
×
=
)
,
(
2
1
2
1
∈
,
∈
2
1
∑
2
2
1
1
w
w
SR
λ
λ
d
d
K
d
w
d
w
(6)
W
h
er
e
λ
1
an
d
λ
2
are th
e
weig
hts
of th
e word
(featu
r
es)
w
1
and
w
2
in
the
do
cument
d
1
and d
2
, resp
ectively, and
SR(w
1
,w
2
) i
s
a term
simila
rity function
whi
c
h d
e
fine
d by formul
a
(5).
The ab
ove d
o
cum
ent si
mi
larity coul
d b
e
used in
ke
rnel ba
sed
su
pport ve
ctor
machi
n
e
s
if it is a
valid kernel f
unction, i.e. if it
satisfie
s th
e Merce
r
’s co
ndition [15].
Such
con
d
itio
ns e
s
tabli
s
h t
hat
the Gram ma
trix, G=K(di, dj) mu
st be positive semi
-definite. It has bee
n sh
o
w
n in [15] that
the
matric
G formed by the
kernel fun
c
tio
n
(Equ
ation
2) with th
e o
u
ter matrix p
r
odu
ct K(d
1
,d
2
) is
indee
d a po
si
tive semi-d
efinite matrix.
4. Support V
ector M
achi
n
e
Suppo
rt vector ma
chine
s
were introd
uced
by Boser,
Guyon and
Vapnik in the
seminal
pap
e
r
whi
c
h
ca
n ma
p the
input d
a
ta i
n
to a n
e
w
spa
c
e u
s
in
g kernel fu
n
c
tio
n
ce
nte
r
ed i
n
to
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
2302-4
046
A HowNet-ba
s
ed Em
antic
Relate
dne
ss Kernel
for T
e
xt Cla
ssifi
cati
on (Pei-yin
g Zhang
)
1913
sup
port ve
cto
r
s a
nd the
n
make
a linea
r sepa
ratio
n
in
the new
sp
a
c
e. Let u
s
def
ine {(x
i
,y
i
), i=1
…
l} the training
sam
p
le fo
r a
bina
ry
cla
s
sification
proble
m
. If the vector
α
an
d the
scalar b a
r
e t
h
e
pa
ra
met
e
rs
of the o
u
tp
ut hyp
e
rpl
a
ne, f the S
V
M fun
c
ti
on
is def
in
ed
as f
o
rmul
a
(7
)
:
)
+
)
,
(
•
•
sgn(
=
)
(
∑
1
=
l
i
i
i
i
b
x
x
K
y
α
x
f
(7)
The i
n
trod
uct
i
on p
r
in
ciple
de
rived to
determi
ne
weights of
out
put hype
rpla
ne
(an
d
sup
port
data) is the
maxi
mization
of t
he ma
rgin
be
tween t
he o
u
t
put hyperpla
ne an
d the
d
a
ta
encode
d in the hidde
n la
yer of the network. To
d
eal with non
sepa
rabl
e data, the margin
con
c
e
p
t wa
s
softeni
zed [2
4] in orde
r to
accept
som
e
point
s that
are
on the
wrong
si
de of t
h
e
margi
n
frontie
rs.
To impleme
n
t
our app
roa
c
h, we have chosen t
he ra
dial ba
sis
kernel
that usu
a
lly gets
very good p
e
r
forma
n
ce wit
h
few tuning
and which is
still a rep
r
od
u
c
ing
ke
rnel
when a met
r
ic
is
use
d
as the a
r
gum
ent of expone
ntial.
)
-
exp(
=
)
,
(
2
y
x
γ
y
x
K
(8)
After sema
ntically smo
o
thi
ng the vectors, we get
)
)
-
(
•
•
)
-
(
exp(
=
)
,
(
2
y
x
SR
y
x
γ
y
x
K
T
(9)
In formula
(9), SR is
the
matrix whose
element
can
be defin
ed t
h
rou
gh the
semantic
related
n
e
ss
b
e
twee
n featu
r
e ve
ctors. O
t
her
ker
nel
s
based o
n
the
usu
a
l definiti
on of si
milari
ty
betwe
en two
document
s could be u
s
e
d
as well, t
ogether with the
semantic p
r
oxi
m
ity matrix.
5. Experiments and
Res
u
lts
In orde
r to the validity
of the classi
fi
cation alg
o
rithm to measure and ve
rify, we
comp
ared
th
e
com
m
only vector spa
c
e model
to
the
sema
ntic-rel
a
tedne
ss ve
ct
or spa
c
e m
o
del,
the feature v
e
ctor
wei
ght
using T
F
-I
DF met
hod a
nd feature selectio
n algo
rithm u
s
ing
CHI
method.
5.1. Perform
a
nce Me
asur
e
s
To evalu
a
te
perfo
rma
n
ce of the t
e
xt cla
ssifi
catio
n
sy
stem, we u
s
e the
standard
informatio
n retrieval mea
s
ure
s
that are pre
c
is
i
on, re
call and F1
-m
easure. Th
e
F1-me
a
sure is a
kind of avera
ge of pre
c
isi
o
n and re
call.
Preci
s
io
n is d
e
fined a
s
the
ratio of co
rre
c
t cla
ssifi
cati
on of docume
n
ts into categ
o
rie
s
to
the total number of attemp
ted cla
ssifi
cat
i
ons, is d
e
fine
d by formula (10):
positive
false
+
positive
true
positive
true
=
precision
(1
0)
Re
call i
s
d
e
fined a
s
th
e ra
tio of co
rrect
cla
ssifi
cation
of docume
n
ts into catego
ri
es to th
e
total numbe
r of labeled d
a
ta in the testin
g set, is defin
ed by formula
(11):
negative
false
+
positive
true
positive
true
=
recall
(11
)
F
1
-
m
ea
su
re
is
de
fin
e
d
as
th
e
h
a
r
m
on
ic me
an
o
f
p
r
ec
is
io
n
a
n
d
r
e
c
a
ll.
H
e
nc
e
,
a
g
ood
cla
ssifie
r
is a
s
sumed to h
a
ve a high F1-me
a
sure
,
whi
c
h indi
cat
e
s that the cl
assifier pe
rfo
r
ms
well with resp
ect to both preci
s
ion a
nd
recall, is d
e
fin
ed by formula
(12):
recall
precision
recall
precision
measure
F
+
×
×
2
=
-
1
(12
)
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 23
02-4
046
TELKOM
NIKA
Vol. 11, No. 4, April 2013 : 1909 – 1
915
1914
5.2. Experiments a
nd Re
sults
To test our p
r
opo
se
d syst
em, we used
the propo
se
d method co
mpared with
CHI*T
F-
IDF meth
od
and
CHI*TF-CRF
meth
od
usi
ng th
e F
-
1 m
e
a
s
urem
ent, the
re
su
lts a
r
e li
sted
in
Table 1.
Table 1. The
Cla
ssifi
cation
Results u
s
in
g the F-1 Me
asu
r
e
Class ca
teg
o
r
y
CHI+TF
-
ID
F+SVM
CHI+TF
-
ID
F+SVM
SR
IT 78.24
83.43
Education 86.14
90.24
Entertainment
85.86
92.41
Culture
79.45
84.12
Histor
y 81.26
91.24
Average
82.19
88.29
6. Conclusio
n
In this pap
er,
we p
r
e
s
ent a
sema
ntic rel
a
t
edne
ss ke
rnel for
smoot
hing the b
ag
of words
(BOW)
rep
r
e
s
entatio
n. Fi
rstly u
s
e
s
th
e CHI
meth
od to
sele
ct docume
n
t feature
vecto
r
s,
se
con
d
ly cal
c
ulate the fe
ature vecto
r
weig
hts usi
n
g the TF-IDF method, and con
d
u
c
ts two
experim
ents
usin
g suppo
rt vector ma
chine to
co
mp
are the
F1
measure of two
cla
ssifi
cat
i
on
system b
e
tween commo
nl
y text classifi
cation m
e
th
o
d
and the
se
mantic relate
dne
ss
ke
rnel
text
cla
ssifi
cation
method. We
find that semantic
relate
dne
ss e
nha
n
c
ed rep
r
e
s
en
tation prod
uces
signifi
cant im
provem
ent in the F1-m
easure u
s
ing
sup
port vecto
r
m
a
chi
ne cl
assif
i
er.
As a next step, we will extend the BO
W rep
r
e
s
enta
t
ion by incorp
orating di
scri
mination
informatio
n for text cl
assificatio
n
an
d co
mpa
r
e
our
rep
r
e
s
e
n
tation ap
proache
s for
text
cla
ssifi
cation
task. In the fu
ture, we
will study
on the semantic text repre
s
e
n
tation
unit to denote
the document feature vect
or and utilize the feature
vector
representation appr
oaches for text
cla
ssif
i
cat
i
on.
Ackn
o
w
l
e
dg
ements
This
wo
rk i
s
sup
porte
d
by “the
F
undam
ental
Re
sea
r
ch F
und
s for th
e Central
Universitie
s
”
of Chin
a Uni
v
ersity of Pet
r
oleu
m
(Ea
s
t
Chin
a). Th
e
autho
rs
are
grateful fo
r t
h
e
anonymo
us
reviewe
r
s
who
made co
nst
r
uctive com
m
ents. First Author is
corre
s
pondi
ng auth
o
r.
Referen
ces
[1]
Deer
w
e
ster S
C
, Dum
a
is ST
, La
nd
auer
T
K
, F
u
rnas GW
, H
a
rshma
n
RA.
Inde
xing
b
y
L
a
tent Sem
anti
c
Anal
ys
is.
JASIS
. 1990; 41(
6): 391-4
07.
[2]
Lee
LJ. Simil
a
rit
y
-b
ase
d
ap
pr
oach
e
s to nat
ural l
a
n
gua
ge
process
i
ng.
H
a
rvard Un
iversit
y
T
e
chnic
a
l
Rep
o
rt T
R
-11-
97
. 199
7.
[3] PB.
W
o
rd sense dis
a
mb
igu
a
tion
usin
g ta
ctical
metho
d
s
. Proceed
ings
of 29th Mee
t
ing of t
h
e
Associati
on for
Computati
o
n
a
l
Li
ng
uistics (A
CL2
91). 19
91; 201-
207.
[4]
HLJ. Information R
e
trieva
l
based
on
conce
p
tua
l
di
stance in IS
A hierarc
h
ies.
Journa
l of
Docu
mentatio
n
. 1993.
[5]
RP. Sema
ntic
similar
i
t
y
in
T
a
xo
nom
y:
an
inf
o
rmati
o
n
-bas
e
d
me
asure
a
n
d
its a
ppl
icatio
n
to pro
b
lem
s
of
ambig
u
it
y
in Natura
l
Lan
gu
age.
Jour
na
l of Artificial Intell
i
genc
e Res
earc
h
. 1999; (1
1): 95-13
0.
[6]
Agirre E, Rigau G.
A proposa
l
for w
o
rd sense
disa
mb
ig
uatio
n usi
ng co
nce
p
t
ual d
i
stance.
In
te
rn
a
t
io
na
l
Confer
ence
on
Recent Adva
n
c
es in Natu
r
a
l
Lan
gu
age Proc
essin
g
RANLP
95; 199
5.
[7]
Liu Qu
n, Li
S
u
jia
n. W
o
rd S
i
milarit
y
C
o
mp
uting
Base
d o
n
Ho
w
N
et.
C
o
mp
utatio
nal
Li
ngu
istics a
n
d
Information Pr
ocessi
ng
. 20
02
; 7: 59-76.
[8]
Li Su
jia
n. Res
earch
of rel
e
v
anc
y b
e
t
w
e
e
n
sentenc
es b
a
s
ed o
n
se
m
a
n
t
ic computati
o
n.
Co
mp
uter
Engi
neer
in
g an
d Appl
icatio
ns
. 200
2; 38(7): 75
-76.
[9]
Siol
as G, d’Alche-Buc F
.
Sup
port vector ma
chin
es base
d
o
n
a se
mantic k
e
rne
l
for text categor
i
z
a
t
io
n
.
Procee
din
g
of IEEE IJCNN. 2000; 20
5-2
09.
[10]
Mavroei
dis D,
T
s
atsaronis G, Vazirgi
ann
is M,
T
heobal
d M, W
e
ikum G. W
o
rd sense d
i
sambi
g
u
a
tio
n
for explo
i
ting
hi
erarchic
al thes
auri i
n
text clas
sificatio
n
.
LNCS (LNAI)
. 2005
; 3721: 18
1-19
2.
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
2302-4
046
A HowNet-ba
s
ed Em
antic
Relate
dne
ss Kernel
for T
e
xt Cla
ssifi
cati
on (Pei-yin
g Zhang
)
1915
[11]
Padmar
aju
D,
V Varma. Ap
p
l
ying
le
xica
l s
e
manti
cs t
o
i
m
prove te
xt c
l
assificati
on. Pr
ocee
din
g
s of
se
co
nd
sy
mp
osi
u
m o
n
In
di
an
Mo
rp
ho
l
ogy
.
P
hon
olo
g
y an
d Lan
gu
age En
gi
neer
ing
. 2
005:
94-9
8
.
[12]
Stepha
n Bl
oe
hdor
n, Aless
a
ndro M
o
schitti
. Combi
ned
S
y
nt
actic an
d
Semantic K
e
r
nels for T
e
xt
Classification
. ECIR 200
7, LN
CS.
2007: 3
07-
318.
[13]
Rob
e
rto Bas
ili,
Marco C
a
mmi
sa, Alessa
ndr
o
Moschitti
. A s
e
mantic k
e
rn
el
to class
i
f
y
te
xts
w
i
t
h
ver
y
fe
w
trai
ni
ng e
x
ampl
es.
Informatica
. 200
6; (3
0): 163-1
72.
[14]
Jamal Ab
dul
N
a
sir, Asim Kari
m, George T
s
a
t
sar
onis, Iraklis
Varlamis. A kno
w
l
e
d
ge-
base
d
semant
i
c
kernel for tex
t
classification.
SPIRE 2011, L
NCS 70
24.
20
11: 261-
26
6.
[15]
Cristian
ini
N, Ta
ylor JS, L
odh
i
H.
Latent Se
mantic Ker
nels
.
Procee
din
g
of t
he Ei
ghte
enth I
n
ternati
o
n
a
l
Confer
ence
on
Machin
e Le
arnin
g
. 200
1: 66-
73.
[16]
Shous
ha
n
L
i
, Rui Xia.
A
fra
m
ew
ork
of feat
ure s
e
lecti
o
n
meth
ods
for te
xt categor
i
z
a
t
i
o
n
. Proce
edi
ngs
of the 47 th An
nua
l Meetin
g o
f
the ACL and
t
he 4th IJCNLP
of the AF
NLP. 200
9: 692-
700.
[17]
J Blitzer, M D
r
edze, F
Per
e
i
r
a.
Do
ma
in
ad
aptatio
n for s
e
ntiment cl
assifi
cation
. In
Proc
eed
ings
o
f
ACL-0
7
, the 45
th Meeting of the Associ
atio
n for computatio
nal L
i
ng
uistics.
[18]
J Brank, M Grobe
lnik, N Mi
li
c-F
r
a
y
l
i
ng, D
Mlad
enic.
Inter
a
ction of
featu
r
e
selecti
on methods an
d
line
a
r class
i
fica
tion mod
e
ls
. In W
o
rkshop o
n
T
e
xt L
earn
i
n
g
H
e
ld at ICML. 20
02.
[19]
Seman, K
a
ma
ruzzama
n, Pu
spita,
F
i
tri Ma
ya
et a
l
. An
i
m
prove
d
o
p
ti
mizatio
n
mo
de
l of i
n
ternet
charg
i
ng sch
e
m
e in multi ser
v
ice net
w
o
rks.
Te
lkom
n
i
ka
. 20
12: 592-
59
8.
[20]
Z
hexi Yan
g
. Informatio
n
e
x
pe
ctation
w
i
th cl
o
ud comp
utin
g i
n
Chi
na.
Te
l
k
om
ni
ka
. 20
12: 8
76-8
82.
[21]
Z
hang
Xi
nshe
ng. Sparse re
prese
n
tatio
n
fo
r detectio
n
o
f
microcalcific
a
tion cl
usters.
T
e
lkomnika
.
201
2: 545-
550.
[22]
G F
o
rman. An extens
ive em
piri
ca
l stud
y
o
f
feature selec
t
ion metrics for text classific
a
tion. T
he
Journ
a
l of mac
h
in
e lear
ni
ng r
e
searc
h
.
200
3; 3(1): 1289-
13
05.
[23]
L Marina, L Ma
rk, K Slava. Classificati
on of w
e
b
d
o
cume
nts using co
ncep
t extractio
n
from ontolo
g
ie
s
.
Lecture N
o
tes i
n
Co
mp
uter Scienc
e
. 200
7; 4476: 28
7-2
92.
[24]
V Vapnik. T
he nature
of statis
tical le
arni
ng.
Sprin
ger.
19
95
.
Evaluation Warning : The document was created with Spire.PDF for Python.