TELKOM
NIKA
, Vol.12, No
.4, Dece
mbe
r
2014, pp. 10
45~105
2
ISSN: 1693-6
930,
accredited
A
by DIKTI, De
cree No: 58/DIK
T
I/Kep/2013
DOI
:
10.12928/TELKOMNIKA.v12i4.811
1045
Re
cei
v
ed Au
gust 26, 20
14
; Revi
sed O
c
t
ober 1
4
, 201
4; Acce
pted
No
vem
ber 2,
2014
Process Improvement of LSA for Semantic Relatedn
ess
Computing
Wujian Yang
*
1
, Lian
y
u
e L
i
n
2
Dep
a
rtment of compute a
nd s
c
ienc
e,
Zhejia
n
g
Univ
ersit
y
Cit
y Co
lle
ge
No. 48 Huz
h
o
u
Road, Ha
ngz
h
ou, Z
heji
a
n
g
, Chin
a 31
000
0
*Corres
p
o
ndi
n
g
author, e-ma
i
l
:
y
a
ng
w
j
@zuc
c.edu.cn
1
, linl
y
dtc@gmai
l.co
m
2
A
b
st
r
a
ct
T
ang p
oetry s
e
mantic c
o
rrel
a
tion c
o
mp
utin
g is critic
al i
n
ma
ny a
ppl
icati
ons, such
as s
earch
ing
,
clustering, aut
omatic generation
of poetry
and s
o
on. Aiming to
incr
ease c
o
m
puting efficiency
and
accuracy of semantic rel
a
te
dness, w
e
imp
r
oved the pr
oc
ess of latent semantic a
naly
s
is (LSA). In thi
s
pap
er, w
e
ad
o
p
ted “r
epres
ent
ation
of
w
o
rds semantic
” inste
ad of
“w
ords
-b
y-poe
ms
” to
re
prese
n
t the w
o
rds
semantic, w
h
ic
h bas
ed o
n
the
findi
ng that w
o
rds h
a
vin
g
si
mi
lar d
i
stributi
o
n in p
oetry cat
egor
ies ar
e al
mos
t
alw
a
ys seman
t
ically rel
a
ted.
Meanw
hil
e
, we des
i
gne
d ex
peri
m
e
n
t w
h
ich obtai
ned se
gmentati
on w
o
rd
s
from
more
th
an
40
000
p
o
e
ms,
an
d c
o
mp
uted
rel
a
te
dness
by
cosi
ne v
a
lu
e w
h
ic
h ca
lcul
ated
fro
m
deco
m
pose
d
c
o
-occurr
ence
matrix
w
i
th Sin
gul
ar Va
lue
D
e
co
mp
ositio
n (
SVD)
meth
od.
T
he ex
peri
m
en
tal
result show
s that this metho
d
is goo
d to a
naly
z
e
s
e
manti
c
and e
m
oti
o
n
a
l rel
a
tedn
ess
of w
o
rds in T
ang
poetry. W
e
can find associat
e
d
w
o
rds and the releva
nce of
poetry categ
o
ri
es by ma
trix
mani
pul
atio
n of th
e
deco
m
posi
ng
matric
es as w
e
ll.
Keywo
r
d
: semantic rel
a
ted
n
e
ss, Latent Semantic Ana
l
ysi
s, poetry categ
o
r
y
, singul
ar valu
e deco
m
positi
o
n
1. Introduc
tion
Tang poet
ry,
as a kind
of Chin
ese cla
s
sical
literature, ha
s a
p
r
of
ound
impa
ct
on
Chin
a
and
even th
e
wo
rld.
Com
pare
d
with th
e mod
e
rn
Ch
inese, an
cie
n
t
poetry
, whi
c
h co
nveyed
all
sort
s of emo
t
ions by refi
ned word
s, rhythmic
sylla
bles a
nd various figu
re
s of spee
ch, h
a
s
spe
c
ial
synta
x
. Howeve
r, automatic
an
alysis
of an
cient poetry,
a part of
Na
tural La
ngu
a
ge
Processin
g
(NLP) [1], ha
s alrea
d
y be
en
a hot i
s
s
ue,
whi
c
h involve
s
vari
ou
s e
s
sential tasks, e
.
g.,
establi
s
hm
en
t of corp
us [2
], word
segm
entation
[3], s
e
mantic
analys
is
[4], vec
t
or spac
e model
[5], identification of poetry style [6], etc.
The
ability to quantify semantic
rel
a
tedness of
words i
n
p
oem
s should be an
integral
part of se
ma
ntic analy
s
is,
and und
erli
e
s
many f
und
amental ta
sks in NLP, incl
uding info
rma
t
ion
retrieval,
word se
nse di
sa
mbiguatio
n, a
nd text clu
s
te
ring, et
c. In contra
st to se
mantic
simila
rity,
whi
c
h i
s
the
spe
c
ial
ca
se
of related
n
e
s
s, the not
io
n
of related
n
e
s
s is
more ge
neral th
an th
at of
simila
rity like
Budanitsky et
al [7] argu
ed
, as t
he latter sub
s
um
es
m
any different
kind
of sp
ecif
ic
relation
s, in
cl
uding m
e
ton
y
my, antonym, functional
asso
ciation,
and othe
rs. In this pap
er we
deal with
sem
antic rel
a
tedn
ess.
Semantic rel
a
tedne
ss co
mputing of n
a
tural lan
gua
ge texts req
u
ire
s
en
codi
ng vast
amount
of wo
rld kno
w
le
dg
e.
Until re
ce
n
t
ly,
prior
wo
rk of ling
u
isti
c
reso
urce
s u
s
i
ng p
u
rsue
d t
w
o
main di
re
ctio
ns. O
ne i
s
l
e
xical d
a
taba
ses
su
ch
a
s
Word
Net [8],
Wikipe
dia [9], e
n
co
de
s relati
ons
betwe
en wo
rds su
ch as synonymy,
h
y
pernymy, a
nd the
othe
r is la
rg
e-sca
l
e text co
rpo
r
a,
provide
statistical co
rpu
s
for co
mpute
r
learni
ng like L
a
tent Semant
ic Analysi
s
(L
SA) [10].
But in ge
ne
ral
com
putin
g of mo
dern lang
uag
e
sem
antic relatedn
ess, t
he lea
s
t
resou
r
ces u
s
ed a
r
e
kn
o
w
led
ge-f
r
ee
approa
che
s
t
hat rely
excl
usively on
the
corpu
s
d
a
ta
themselve
s
.
Und
e
r the
co
rpu
s
-ba
s
ed a
ppro
a
ch, wo
rd
relation
shi
p
s are often d
e
rived from t
heir
co-occu
r
ren
c
e di
stributio
n
in a
corpu
s
[11].
With
the int
r
od
u
c
tion
of ma
chin
e read
a
b
le
diction
a
rie
s
, l
e
xicon
s
, the
s
auri, an
d taxonomie
s, the
s
e ma
nually
built pseud
o-kno
w
le
dge b
a
se
s
provide
a
nat
ural
fram
ewo
r
k for organi
zing
words o
r
con
c
e
p
ts i
n
to
a
sem
antic
spa
c
e. K
o
zim
a
and
Furugo
ri
[12] mea
s
u
r
e
d
word di
stan
ce
by ad
aptive scali
ng
of a
vecto
r
spa
c
e
gen
erated fro
m
LDO
C
E (L
on
gman
Diction
a
ry of Conte
m
porary
Eng
lish). Mo
rri
s
and Hi
rst [13
]
used Roget
’s
thesa
u
ru
s
to detect
word semantic relati
onship
s
. With
the recently develop
ed le
xical taxono
my
Wo
rdNet [14]
, many re
se
a
r
ch
es have ta
ken th
e
a
d
vantage of thi
s
broa
d-cove
ra
ge taxono
my to
study wo
rd/concept relatio
n
shi
p
s [15].
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 16
93-6
930
TELKOM
NIKA
Vol. 12, No. 4, Dece
mb
er 201
4: 104
5 – 1052
1046
Ho
wever,
Chi
nese lang
uag
e used in the
anci
ent time
s was
quite d
i
fferent from
mode
rn
Chin
ese la
ng
uage
an
d th
e natu
r
e
of
poetry. So, f
o
r ta
ng
poetry’s spe
c
ial
syntax and
word
limitation,
pre
s
ent se
manti
c
re
late
dne
ss co
mputing
of
Tan
g
p
oetry based on
l
a
r
ge-scale poet
ry
corpo
r
a and
word co
-o
ccu
rre
nce.
HU, J.
an
d
Y
U
,
S
.
[16] defin
ed
a
statisti
c m
odel to
extra
c
t
contextual
si
milarity words from the corpus
based
on 614
milli
on char
s of
Chinese anci
ent
poetry. Zho
u
, C. L. [17] co
mputes
wo
rd
s semanti
c
re
latedne
ss by
combi
n
ing
m
e
thod
s of lat
ent
sema
ntic an
alysis (LSA) and mutual
informat
ion
(MI), which
is general method in word
s
r
e
la
te
dn
es
s
of C
h
in
es
e
p
o
e
t
r
y
.
Latent Sema
ntic Analy
s
is (LSA), a
ne
w alg
e
b
r
aic
model
of informatio
n retrieval, is
prop
osed
by S.T. Dumai
s
in 198
8. Thu
s
, it’s a
pu
rel
y
statistical tech
niqu
e, leverag
e
s word
co-
occurre
n
ce i
n
formatio
n from a
la
rge
u
n
label
ed
co
rpus of text.
Based
o
n
an
assu
mption
that
each word’s
meanin
g
in t
ang p
oem
s
can be
repr
esented by it
s
co-occu
r
ren
c
e word
s fo
r t
heir
regul
ar co-o
ccurre
nce in
la
rge
-
scal
e texts
corp
o
r
a, LS
A doe
s n
o
t re
ly on a
n
y hu
man-org
ani
ze
d
kno
w
le
dge; rather, it “le
a
rns”
it
s representation by a
pplying SVD
to the wo
rd
s-by-poe
ms
co
-
occurre
n
ce m
a
trix. So, we
can im
agin
e
that ho
w tr
em
endo
us th
e si
ze of thi
s
mat
r
ix is. In orde
r
to red
u
ce th
e matrix for
efficient an
d
rapid
comp
uting, we
propo
sed
a m
e
thod of p
r
o
c
e
s
s
improvem
ent, which aim to build a “wo
r
ds-by-poet
ry
catego
rie
s
”
co-occu
r
ren
c
e matrix. In this
pape
r, we m
a
inly discu
s
s
about
LSA, p
r
opo
se
ou
r p
r
oce
s
s imp
r
ov
ement b
a
sed
on th
e p
r
evi
ous
method a
s
well.
The co
ntrib
u
tions of this pa
per a
r
e thre
efold.
First, we
prop
ose to cl
assify the poems by
emotion
s
, an
d then b
u
ild
a “words
–by-poetry catego
ries”
co-occu
r
re
n
c
e m
a
trix. Specifically, we
introdu
ce
an
improve
d
me
thod of LSA
usin
g in
sem
antic relate
d
ness com
puti
ng
of words in
Tang po
em
s. Second,
we
stru
cture a
matrix, wh
ich
represent p
oems in
co
rpus
with mu
ch
smalle
r
si
ze
than p
r
eviou
s
m
e
thod,
a
nd p
r
e
s
ent
specifi
c
m
e
tho
d
s fo
r
applyi
ng Singl
e Va
lue
De
comp
ositio
n (SVD) efficiently an
d rapidly. Fi
nall
y
, we pro
p
o
s
e the
appli
c
ation
s
of re
sult
comp
uted by this improved
method.
2. The Proce
ss Improv
e
m
ent of
Rela
tedn
ess Co
mputing Method
Study the p
r
evious metho
d
to
com
pute
rel
a
t
edne
ss
with LSA,
We can fin
d
th
at it u
s
e
informatio
n b
e
twee
n wo
rd
s and
poem
s. Thou
ght a
s
soci
ation of
words a
nd
poem
s can
be
measured, bu
t complicated
statistics
s
how
s
up
in
th
e
fo
llo
w
i
ng
a
s
p
e
c
t
s
.
Firstly, it ne
eds large
-
scale n
on-re
pe
titi
ve text corpo
r
a
and
th
e workloa
d
s of the
colle
ction
an
d entry i
s
h
e
a
vy. Secondl
y, for the ro
w vector i
s
wo
rd vecto
r
, the
prep
aratory
work
of segm
entat
ion and
stati
s
tic a
r
e
com
p
licate
d
, whi
c
h
sho
u
ld sp
lit and co
unt
words fo
r e
a
ch
poem i
n
corp
us i
ndividuall
y
, and weight
s the f
r
eq
uen
cy that ea
ch
word a
ppe
ars in
ea
ch
po
em
as ele
m
ent o
f
vector.Third
ly there are a mass
of word
s, and the
numbe
r of p
oems i
s
vast
as
well, which
make
s m
a
tri
x
to be hug
e and
have
a
strong im
p
a
ct on
efficie
n
cy of op
era
t
ion.
Finally, it can’t prevent the ze
ro an
gl
es of
vecto
r
s cau
s
ed by
spa
r
sene
ss probl
em, and
the
number of sparse word i
s
still a lot.
These first th
ree
facto
r
s a
bove m
a
ke th
e matrix
larg
e which
cau
s
e lo
we
r
comp
utational
efficien
cy. And the final sp
arsene
ss pro
b
lem influen
ces the a
c
cura
cy of cal
c
ulati
on. In respon
se
to
this se
man
t
ic
relatedn
ess com
puting, we co
me
up
with a
meth
o
d
of p
r
o
c
e
s
s i
m
provem
ent
(a
s
Figure 1
)
,
whi
c
h i
s
co
mpo
s
ed of t
w
o
nov
el compo
nent
s: a
ne
w
pattern
for
re
pre
s
enting
sem
a
n
t
ic
of
wo
rd
s
in p
oems, and a new metho
d
for comp
uting
the
semanti
c
relate
dne
ss betwe
en wo
rds,
even asso
ciat
ion betw
een poetry
cate
go
ries.
The m
a
inly i
m
provem
ent i
n
this pa
per i
s
that
w
e
us
e r
e
pr
es
en
ta
tio
n
o
f
w
o
r
d
s
’
s
e
man
t
ic
by “word
s
-by
-
poet
ry cate
g
o
rie
s
”
i
n
ste
a
d
of “words-
by-poe
ms”.
Our
hypothe
sis i
s
that
word
s
whi
c
h beh
ave almost si
mi
larly in poetry
catego
rie
s
a
r
e se
manti
c
al
ly related. Fo
r simpl
e
synt
a
x
of Tang poet
ry, the mainly
feeling of poe
t is
expre
sse
d by unified e
m
otional words in po
em.
Thus,
our im
proved
meth
od con
s
ist
s
of three
main s
t
eps
.
Firs
t
,
c
o
ns
truc
t matrix, the
rep
r
e
s
entatio
n of words’
semantic. S
e
cond, d
e
comp
ose
matrix int
o
thre
e
signifi
cant m
a
tri
c
e
s
by
SVD. Finally, analyze the matrices a
nd
comp
ute rel
a
tedne
ss.
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
1693-6
930
Process Im
provem
ent of LSA for Sem
a
ntic
Rel
a
tedn
ess Com
putin
g (Wujian Ya
ng)
1047
Po
em co
rp
us
Po
em co
rpu
s
S
e
gm
enta
tio
n and
stat
istic f
o
r
e
ach poe
m
B
u
ild
“
po
em
-w
or
d
”
matr
ix
Se
gmenta
tion
a
nd
stati
s
tic
o
n
eac
h categ
ory
Poem classificat
ion
Buil
d
“
emot
ion
cat
egor
y-
wo
rd
”
matr
ix
Sin
g
le
v
a
lue
d
ecomposi
tion
I
n
t
e
rc
e
p
t ma
tr
i
x
Cosine
ca
lcul
atio
n
Figure 1. Previous metho
d
and pro
c
e
s
s improvem
ent method
2.1
Cons
tru
c
tio
n
of the Ma
tr
ix
In improve
d
repre
s
e
n
tation
, each
wo
rd i
s
ma
ppe
d int
o
a ve
ctor of
corre
s
p
ondin
g
ro
w i
n
matrix, and each
colum
n
vector of matri
x
repre
s
ent
s a poetry cate
gory. So, prio
r wo
rk fell into
three
step
s. First, we can
divide the set of Tang p
oet
ry into seve
ra
l categ
o
rie
s
,
whi
c
h i
s
cla
r
if
ied
by emotio
n.
Secon
d
,
split
wo
rd
s
and
compute
the
ti
me that
ea
ch
wo
rd
ap
pea
rs in
categ
o
ry
for
each cate
gory individually. Finally, we can build the
matrix.
Specifically, the set of po
etry can b
e
di
vided into
N cate
go
ries,
which have
M words
from
seg
m
en
tation. The
r
e
b
y, we
ca
n
b
u
ild “wo
r
d
s
-b
y-poetry
cate
gorie
s” mat
r
ix with
si
ze
of
M ×
N. The fre
q
u
ency that word i in poetry
categ
o
ry
of j can b
e
expre
s
sed a
s
fij, and the weight
of
word ca
n be
expre
s
sed a
s
aij, so the matrix can be e
x
presse
d as
[]
ij
A
a
.
2.2
Singular Value Dec
o
mpo
s
ition
In the con
s
truction
of the
matrix, we
co
nstru
c
ted the matrix to
repre
s
e
n
t se
mantic
of
words in tan
g
poetry, whe
r
e, is local
wei
ght of the
wo
rd i in the cat
egory j, g is
global
weight
o
f
the word i i
n
t
he full
texts
(Tang
po
etry
set). Be
cau
s
e
of
a pra
c
tical theore
m
that
there
mu
st
ex
is
t
sing
ular valu
e de
co
mpo
s
it
ion
(SVD) [18
,
19] for
any
n
onzero
mat
r
ix, so
in th
e
re
alizatio
n of
L
SA,
we u
s
e a typical con
s
tru
c
tion metho
d
of LSA
/ SVD, which b
a
se
d on buil
d
ing matrix space
model to do the cal
c
ul
ation
of singula
r
value de
com
p
osition.
2.2.1 Theore
m
Let
]
[
n
m
r
n
m
r
C
R
A
. Then there exi
s
t orthogon
al unitary matrices
]
[
n
m
n
m
C
R
U
and
]
[
n
m
n
m
C
R
V
such as that
]
[
H
T
V
U
V
U
A
whe
r
e
0
0
0
S
and
12
[
,
,
.
..,
]
r
Sd
i
a
g
with
.
0
...
1
r
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 16
93-6
930
TELKOM
NIKA
Vol. 12, No. 4, Dece
mb
er 201
4: 104
5 – 1052
1048
2.2.2 Main algorithms of
singular v
a
lue decompo
s
ition
Algorithms I
Find the eige
nvalue
s and
eigenve
c
tors
of
A'A
as
A'A
=
V
diag (
2
r
, O)
V '
,
V
= (
V
l
, V
2
)
V
1
is an ortho
gonal mat
r
ix with si
ze of m
× r;
Comp
ute
U
l
=AV
l
Σ
-1
, U
l
is an orthog
onal
matrix with si
ze of n × r.
Extended
U
l
, then get an o
r
thogo
nal mat
r
ix with size o
f
n × n:
U
m×
m
=(
U
l
,
U
2
)
Finally, we ca
n find A
m×n
=U
m×m
Σ
m×
n
V´
n×n
Algorithms II
Tran
sfe
r
A into a pai
r of dia
gonal
with Ho
use
hold
e
r me
thod, whi
c
h
mean
s that th
ere exi
s
t n ×
n orthog
onal
matrix P and m × m ortho
g
onal matrix Q
, which can realize:
PAQ
=
O
E
Whe
r
e E is Dou
b
ly Diago
nally Matrix (other than th
e element
s in diagon
al a
nd se
con
d
a
r
y-
diago
nal, the remai
n
ing a
r
e
zero
).
Usi
ng defo
r
mation of Q
R
metho
d
a
nd iter
ative
comp
utation
redu
ce th
e
element
s o
f
se
con
dary
-
di
agon
al of E to ze
ro
gra
d
u
a
lly, so r
eali
z
e the dia
gon
a
lization
of ma
trix, and finall
y
get result
o
f
SVD. Def
o
rmatio
n of
QR meth
o
d
is ba
se
d
on th
e a
p
p
roa
c
h
of
QR
decompo
sitio
n
to find gene
ral matrix eig
envalue
s an
d
eigenve
c
tors.
2.3
Analy
s
is of interc
ept ma
trix
LSA is essen
t
ially a dimension
a
lity redu
ction
techniq
ue that allow to obtain the large
s
t
sing
ular valu
e k (k
≤
r
≤
n <m
) to red
u
ce the
si
ze
of
U
,
V
. The matrix
A
k
, produ
ct of matrice
s
U
m×
k
,
D
k×k
and V
k
'
×n
, is gene
rate fro
m
the interception, can
rep
r
e
s
ent th
e origi
nal m
a
trix
approximatel
y as sho
w
n in
follow.
From that we can g
e
t three
smalle
r matri
c
e
s
with cl
ear meaning,
U
m×
k
shows the relative
feature bet
we
en the wo
rd
s,
V'
k×n
shows t
he poetry cl
a
s
ses a
s
well, and the middl
e matrix sho
w
s
the importa
nce of different colum
n
s in
U
m×k
.
Figure 2. Matrix after Intercept
So, we can
calcul
ate the relatedn
ess of
two
word
s b
y
corre
s
po
ndi
ng ro
ws in p
r
odu
ct of
U
m×
k
and
Σ
k×k
, which can b
e
expre
s
sed
as
U
m×
k
×
Σ
k×k
, and the val
u
e is
rep
r
e
s
ent
ed by cosi
ne
of
corre
s
p
ondin
g
vecto
r
s. T
h
e more
co
sin
e
clo
s
e
to
1,
the high
er
co
rrel
a
tion i
s
.
On the
co
ntrary,
the more
co
sine
clo
s
e to
0, the less
correl
ati
on is. Similarly, we can
cal
c
ul
ate related
n
e
ss
betwe
en two
cla
s
ses by
every two
co
lumns in
V'
k×
n
. The mo
re
co
sine
cl
ose to 0, the l
e
ss
correl
ation be
tween
categ
o
r
ies a
nd the p
oetry division
is more rea
s
o
nable.
Acco
rdi
ng to
this la
w, we
can do
so
me j
udgm
e
n
t of relatedn
ess a
s
follo
w. On
e
is a
bout
poetry cla
s
sification, which
can give ref
e
ren
c
e to
fin
d
out wheth
e
r
cate
gory div
i
sion i
s
scient
ific
or n
o
t. If the related
n
e
ss
b
e
twee
n two
certain
cate
go
ries is
high
er,
so fo
r the
s
e
two catego
rie
s
prob
ably i
s
b
e
ing
cla
s
sifie
d
ba
sin
g
o
n
the differe
nt
facts,
whi
c
h
will b
r
ea
k th
e rationality
of
cla
ssifi
cation.
Such a
s
“se
paratio
n” a
n
d
“sa
dne
ss
”, they are different ki
nd
s whi
c
h a
r
e cl
assif
i
ed
on the differe
nt facts, beca
u
se alm
o
st e
v
ery sepa
rati
on poem i
s
sad. The other is related
n
e
s
s of
words,
whi
c
h
comp
ute by
the co
sine
cal
c
ulatin
g o
f
row ve
ctors in U
m×k
×
Σ
k×
k
matrix, c
an
A / A
k
U
=
U
k
Σ
Σ
k
V’
Vk’
*
*
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
1693-6
930
Process Im
provem
ent of LSA for Sem
a
ntic
Rel
a
tedn
ess Com
putin
g (Wujian Ya
ng)
1049
determi
ne th
e stre
ngth
of sema
ntic related
n
e
s
s betwee
n
word
s, provid
e the ba
sis for
establi
s
hm
en
t of word
s
related li
bra
r
ies of
em
otions, a
nd e
m
otions
unifi
ed jud
g
ment
of
automatically gene
rated T
a
ng poetry.
3. Experimental Proced
u
r
e and An
aly
s
is of Re
sults
In this se
ctio
n we impl
em
ent our im
proved
metho
d
in our exp
e
riments, provides the
experim
ental
re
sult, and
comp
are em
pirical evid
e
n
c
e that
co
ntribute to me
a
s
uri
ng
sem
a
ntic
related
n
e
s
s o
f
wo
rds to
pre
v
ious m
e
thod
. Finally, we
a
nalyse
matrix
es
com
e
from
sin
gula
r
valu
e
decompo
sitio
n
of original
matrix.
We implem
e
n
ted our LS
A approa
ch
usin
g more t
han 40,00
0 poem
s as lo
cal poet
ry
corpu
s
at the beginni
ng.
Then divide poem
s into 11 categ
o
r
ies b
a
sed o
n
kno
w
led
g
e
o
f
literature, include
s patri
otic, wa
r,
fare
well, sea
s
on
s,
weathe
r, love, plant, nost
a
lgia, festival
s,
rural recl
usi
o
n, land
scape.
Each
poe
m
can
be
divid
ed into
seve
ral differe
nt categori
e
s for
its
emotion dive
rsity.
3.1
Segmenta
tio
n
and sta
t
istics
Cla
ssifi
cation
, we can imp
l
ement se
gm
entati
on, stati
s
tics, and fre
quen
cy weig
hting for
each cate
go
ry, and get word
weig
hts
at last. The
Table 1
sho
w
s a
part of
weig
hting result of
farewell c
a
tegory.
Table 1. Seg
m
antation an
d Statistis Re
sult of Categ
o
rie
s
No
W
o
rd
freque
nc
y
w
e
ight
1
a thousand li
16
127.04
2 w
h
ite
cloud
10
82.90
3 Qingshan
8
76.56
4 spring
breeze
12
66.48
5 old
friend
10
66.30
6 Miles
7
62.30
7 Where
11
57.64
8 Willow
s
8
53.92
We
co
mpa
r
e
ou
r rep
r
e
s
e
n
tations an
d
statisti
cs of segmentatio
n to
previou
s
a
ppro
a
ch,
whi
c
h re
du
ce
the time of impleme
n
tatio
n
like this
work, find it has been
sho
w
n t
o
be sig
n
ifica
n
tly
sup
e
rio
r
to ot
her a
p
p
r
oa
ch
es. Fo
r hypot
hesi
s
that
wo
rds whi
c
h be
have
almo
st simila
rly
in
po
etry
categ
o
rie
s
a
r
e sem
anticall
y
related, we
just do thi
s
work fo
r ea
ch cate
gory in
stead
of ea
ch
poem in
dividually. So, the time of seg
m
entation a
n
d
statistics
ca
n be re
du
ced
from the nu
mber
of poem
s to
the numb
e
r of catego
ry. The weight
of each
wo
rd corre
s
po
n
d
s to the
wo
rd’s
importa
nce to each em
otio
nal cate
gory.
Table 2. Co
n
s
tru
c
tion of th
e Freq
uen
cy Matrix
patriotic
War
Fare
wells
season
w
e
ather
love
plant
nostalgia
festival
Hermit
landscape
Homeless
28.9200
0.0000
14.4600
0.0000
14.4600
0.0000
0.0000
14.4600
14.4600
0.0000
0.0000
Han
D
y
nast
y
43.5000
130.5000
14.5000
14.5000
0.0000
0.0000
14.5000
14.5000
14.5000
0.0000
0.0000
Y
i
nshan
42.7800
71.3000
0.0000
0.0000
14.2600
0.0000
0.0000
14.2600
0.0000
0.0000
0.0000
Miles
34.3700
4.9100
62.3000
24.5500
4.9100
24.5500
14.7300
0.0000
9.8200
9.8200
9.8200
Chang
an
34.0000
17.0000
34.0000
8.5000
25.5000
25.5000
17.0000
42.5000
8.5000
0.0000
8.5000
White
jade
33.6900
0.0000
11.2300
0.0000
0.0000
0.0000
33.6900
11.2300
0.0000
0.0000
0.0000
Xian
y
ang
32.5800
0.0000
10.8600
0.0000
10.8600
0.0000
0.0000
21.7200
0.0000
0.0000
0.0000
Lo
y
a
lt
y
31.5200
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
7.8800
0.0000
0.0000
Peking
29.9400
0.0000
9.9800
0.0000
9.9800
0.0000
0.0000
0.0000
19.9600
9.9800
0.0000
Inch
29.5600
0.0000
0.0000
0.0000
0.0000
7.3900
0.0000
0.0000
0.0000
0.0000
0.0000
gra
y
hair
26.9000
0.0000
0.0000
10.7600
5.3800
16.1399
0.0000
5.3800
16.1399
0.0000
10.7600
Gold
19.2000
9.6000
24.0000
14.4000
0.0000
4.8000
9.6000
19.2000
9.6000
4.8000
0.0000
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 16
93-6
930
TELKOM
NIKA
Vol. 12, No. 4, Dece
mb
er 201
4: 104
5 – 1052
1050
3.2 Matrix
co
nstruction
Anal
y
s
is
After segme
n
t
ation and we
ighting, we m
e
rge d
upli
c
at
e of word
s an
d build the matrix as
Table
2
sh
owing. In
ord
e
r to avoid
the
in
fluence of
sp
arse
wo
rd
s o
n
the
calculation a
c
cu
ra
cy, we
exclud
ed so
me wo
rd with
both lowe
r weight and fre
q
uen
cy.
In orde
r to i
m
prove th
e a
c
cura
cy of re
latedne
ss co
mputing of
word
s sho
w
in
several
categ
o
rie
s
, we exclud
ed word
s only app
ear in o
ne category. Finally
, we build a 2
889 * 11 mat
r
ix
of high-frequ
ency word
s, whi
c
h is
mu
ch less than previous mat
r
ix.
We
co
mpa
r
e
the
re
sult
o
f
matrix
con
s
tru
c
tion
to
previou
s
app
roa
c
h, thi
s
i
m
prove
d
method
red
u
c
e
s
the n
u
m
ber of
colu
m
n
sh
arply, an
d the si
ze
of matrix de
cre
a
s
e fro
m
“wo
r
ds
×
poem
s” to
“word
×
categ
o
ries”.
Ho
weve
r, wh
eneve
r
we a
dd
poe
ms to lo
cal
corpu
s
i
n
p
r
ev
ious
approa
ch, the
size of this m
a
trix will incre
a
se in
the me
antime, but it won’t in imp
r
oved method.
3.3
Analy
s
is of Matrice
s
afte
r SVD
Matrix cons
truc
tion is
the bas
e
to the nex
t work
.
As
we introduc
ed in the
s
e
c
o
nd
se
ction,
after matrix
co
nst
r
uction, we ca
n
de
com
p
os
e ma
tr
ix b
y
s
i
ng
le
va
lu
e d
e
co
mp
os
itio
n
int
o
3 matrixes:
'
00
0
tr
rd
td
TV
X
With the
alg
o
rithm
s
int
r
o
duced i
n
third sectio
n
we
de
comp
ose
matrix into
T,
Σ
, V
matrices a
s
T
able 3 sho
w
s
(It’s only a pa
rt of matrix).
Table 3. Matri
c
e
s
of U,
Σ
, V
Matrix of
U
0.0442
-0.0451
0.1833
-0.0158
-0.0445
0.0792
-0.0838
0.2695
-0.0278
-0.1279
0.0428
-0.0726
0.1698
-0.0056
-0.0574
0.0576
0.0340
0.0062
-0.0100
0.0009
0.0842
-0.0523
0.0053
-0.0153
0.0144
Matrix of
Σ
819.82
0.00 0.00 0.00
0.00
0.00 434.17
0.00
0.00
0.00
0.00 0.00
387.36
0.00
0.00
0.00
0.00 0.00 369.27
0.00
0.00
0.00 0.00 0.00
353.02
Matrix of
V'
0.1780
-0.1856
0.2090
0.0149
-0.0886
0.2368
-0.2177
0.8174
-0.0919
-0.2895
0.4637
0.1740
-0.1633
-0.8231
0.0979
0.4114
0.3867
0.1215
0.4053
0.2429
Table 4. Co
si
ne Cal
c
ul
ation of U
M×
K
×
Σ
K×
K
Wanderer
Fare
well 0.973729
Han D
y
nast
y
run amuck
0.9673017
a fe
w
w
o
rds
0.973729
R
y
ongson
0.9657026
Separate
0.973729
War
0.9635461
we
a
l
t
h
y
Nothingness 0.9622504
Huanglong
0.9635461
Competed
0.9622504
Border
-fort
r
ess
0.9635461
gentle and simple
0.9622504
grape
0.9635461
golden armo
r
Loulan 0.9847319
battlefield
0.958317
Qinghai
0.9689628
desert
0.9571064
hero
Resurgence
0.9831921
countr
y
ho
w
can
0.9843091
Luxu
r
y 0.9831921
much
difficultie
s
0.9707253
Xi Shi
0.9708392
Karlaua
0.9683641
Weeds 0.9697423
hero
0.9558059
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
1693-6
930
Process Im
provem
ent of LSA for Sem
a
ntic
Rel
a
tedn
ess Com
putin
g (Wujian Ya
ng)
1051
As we de
scri
bed in la
st se
ction, the va
lue of co
sine
cal
c
ulate by two line
s
of
U
m×k
×
Σ
k×k
matrix, represent the co
rrel
a
tion of word
s. The gr
eate
r
the value is,
the more cl
o
s
ely relate
dn
ess
is between
word
s. So, we can find
som
e
asso
ciated words sho
w
n in
Table
4.
Error! Refe
ren
ce
source n
o
t found.
The
si
milar di
stribut
ion in cate
gories of two words can sho
w
the simila
r
emotion they have.
3.4 Addition
al
Informatio
n
Beside
s
com
puting the
semantic rel
a
tedne
ss
of
word
s from m
a
trix de
comp
ose
d
b
y
SVD, Table
5 sho
w
the cosin
e
cal
c
ul
a
t
ion re
sult of V matrix, in
whi
c
h elem
e
n
t (i, j) sho
w
the
correl
ation b
e
twee
n i and
j catego
rie
s
.
As we
can
see in th
e ta
ble, fare
well
categ
o
rie
s
h
a
ve
highe
r a
s
soci
ation with
oth
e
rs, b
u
t it’s li
mited
to 10
-1
6 magnitu
de,
whi
c
h i
s
app
roximately eq
ual
to 0, almost irrelevant.
It can be
se
en from
exp
e
rime
nt data
above
that throug
h
the sema
ntic rel
a
tedne
ss
comp
uting
of words by p
r
o
c
e
ss i
m
prove
d
metho
d
, we
can
find o
u
t
asso
ciated
word i
n
em
otio
n.
Mean
while, we can ju
dge
wheth
e
r the classificatio
n
is rea
s
o
nabl
e from analy
z
in
g the
V
matrix.
Table 5. Co
si
ne Cal
c
ul
ation of V Matrix
w
a
r
Fare
wells
season
w
e
ather
love
plant
nostalgia
festival
Hermit
landscape
Patriotic
0.38
0.27
0.24
0.26
0.21 0.20 0.37
0.22 0.15
0.25
War
0.00
0.27
0.27
0.30
0.15 0.13 0.28
0.18 0.08
0.20
Fare
wells
0.00
0.00
0.44
0.44
0.29 0.33 0.38
0.36 0.25
0.38
Season
0.00
0.00
0.00
0.54
0.35 0.43 0.31
0.34 0.26
0.41
Weather
0.00
0.00
0.00
0.00
0.29 0.33 0.30
0.32 0.21
0.34
Love
0.00
0.00
0.00
0.00
0.00 0.31 0.27
0.25 0.12
0.21
Plant
0.00
0.00
0.00
0.00
0.00 0.00 0.25
0.30 0.19
0.28
Nostalgia
0.00
0.00
0.00
0.00
0.00 0.00 0.00
0.31 0.16
0.29
Festiva
0.00
0.00
0.00
0.00
0.00 0.00 0.00
0.00 0.20
0.24
Hermit
0.00
0.00
0.00
0.00
0.00 0.00 0.00
0.00 0.00
0.32
3.5
The Limitati
on of Improv
ed Meth
od
Although
we
have seen
many re
sult
s in
whi
c
h
proce
s
s imp
r
o
v
ed method
of LSA
perfo
rms b
e
tter than that before, we a
l
so present
in this wo
rk
some exa
m
p
l
es in which
it
performs
weak
.
One of
the strength of
the me
thod
so
me
times al
so
se
rves
as its
we
akn
e
ss. Altho
ugh it’
s
outstan
ding i
n
finding th
e relatedn
ess of
wo
rd
s by
the
i
r simil
a
r
beh
avior in
different cate
go
rie
s
, a
prominent problem is
that
it is
weak
on fi
guring out the relatedness
of words appear in one
c
a
tegory merely with lower weight. So it’s
a pro
b
lem
caused by wo
rd spa
r
se as
well.
In my opini
o
n
, this l
o
wer frequ
en
cy word
s i
s
h
a
rd
to use in
po
em
cre
a
ting,
and
have
lo
wer
sema
ntic
rel
a
tedne
ss
with mo
st word
s. On
ce
we
use
it in th
e
cal
c
ul
ation,
accuracy
will
be
affec
t
ed. In this
c
a
s
e
, a poss
ible
s
o
lution to
improve accu
ra
cy can co
mbin
e the othe
r met
hod
su
ch a
s
MI,
we can get o
v
erlap data o
f
two met
hod
s as final re
sult. And it can be see
n
that
sema
ntic rel
a
tedne
ss com
puting
of tan
g
po
em
still h
a
ve a l
ong
way to go.
In t
he futu
re,
we
ca
n
do more re
se
arch in findin
g
effective sol
u
tion of word spa
r
se.
4. Conclusio
n
The mai
n
inn
o
vative point
s of this
pap
er is th
at we
come
up
with
a pro
c
e
s
s i
m
prove
d
method,
whi
c
h cl
assify the
Tan
g
co
rpu
s
by em
ot
ion
at first,
and
then
use m
a
trix rep
r
e
s
entat
ion
of wo
rd
s’ semantic by
“wo
r
d
s
-by-p
o
e
try ca
te
gori
e
s” in
stead
of “word
s
-by
-
poe
ms” ,fin
a
lly
decompo
se
d it by single value de
comp
o
s
ition a
s
well.
So, what ma
ke
s it differe
nt from othe
rs i
s
that cla
ssifie
d
the p
oetry of co
rp
us an
d
impleme
n
ted
segm
entation
and
statisti
cs on e
a
ch
cl
assificatio
n
in
stead
of ea
ch poem. With
t
h
i
s
pro
c
e
s
s imp
r
ovement,
we
can
redu
ce
th
e si
ze
of o
r
igi
nal mat
r
ix, so
as to im
prov
e comput
ational
effic
i
enc
y
.
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 16
93-6
930
TELKOM
NIKA
Vol. 12, No. 4, Dece
mb
er 201
4: 104
5 – 1052
1052
Experimental
result of co
sine of ve
cto
r
s in
U
m×
k
×
Σ
k×k
matrix shows that the
similar
distrib
u
tion in
catego
rie
s
o
f
two wo
rd
s can sh
ow
th
e simila
r emoti
on they have
,
so we
ca
n find
asso
ciated
word
s by this
way.
Mean
wh
ile, cosi
ne cal
c
ulatio
n of
V
matrix sho
w
s the relevan
c
e o
f
poetry cate
go
ries.
This meth
od
based on the
simple
synta
x
of
Tang po
etry whi
c
h create by wo
rd
s with
simila
r emoti
on, and
de
sig
n
for
sem
anti
c
rela
ted
n
e
s
s com
puting
of Tang
poet
ry not any mo
de
rn
Chin
ese a
r
ticle. So all of t
h
is
provid
es
the ba
si
s for esta
blish
m
e
n
t of wo
rd
s
emotion
s
rel
a
ted
libra
rie
s
and
emotion
s
unif
i
ed judgm
ent of
automatica
lly generate
d
Tang po
etry.
Referen
ces
[1] Mann
ing,
C
h
ris
t
opher
D.
F
oun
datio
ns of stati
s
tical n
a
tural
la
ngu
ag
e proc
es
sing
. Ed.
Hinrich Schütze.
MIT
press. 1999.
[2]
Su, JS., CL. Zhou, YH. Li. T
he esta
blis
hme
n
t of
the ann
otated cor
pus of
Song d
y
nast
y
poetr
y
b
a
se
d
on the statistic
a
l
w
o
rd
e
x
trac
tion an
d rules
and forms.
Jo
urna
l of Chin
e
s
e Information
Processin
g
.
200
7; 21(2): 52
-57.
[3]
Xu
e N. Ch
ine
s
e
w
o
rd s
egm
entatio
n as ch
aracter tag
g
in
g.
Co
mput
atio
nal L
i
n
guistics
and C
h
in
e
s
e
Lan
gu
age Proc
essin
g
. 200
3; 8(1): 29-48
[4]
Hu, JF
.
T
he l
e
xico
n mea
n
i
n
g an
al
ysis-
bas
ed com
puter
aid
ed res
earc
h
w
o
rk
of Chi
nese
anci
ent
poems.
Ph. D. Thesis
.
Beij
ing:
Peking U
n
iver
sit.
2001.
[5]
Yang, Y
u
-Z
he
n, Pei-Yu
Li
u, Pei-Pe
i Ji
ang
. Res
earc
h
o
n
T
e
xt Repr
ese
n
tation
w
i
th
C
o
mbi
natio
n o
f
S
y
nt
actic in Ve
ctor Space Mo
del.
Jisu
anj
i Gongc
he
ng/ Co
mp
uter Eng
i
n
e
e
rin
g
.
201
1; 37
(3).
[6]
Li,
L
i
an
g-Ya
n, Z
hong-S
h
i He, Yong
Yi.
Po
e
t
ry styl
i
s
ti
c an
al
ysi
s
te
chn
i
qu
e ba
se
d o
n
te
rm
co
n
n
e
c
ti
on
s
.
Machi
ne Lear
n
i
ng an
d
C
y
b
e
r
netics,
2
004. Procee
di
n
g
s of
200
4 Inter
nati
ona
l C
onfere
n
c
e on. IEEE,
200
4; 5: 2713-
271
8.
[7]
Buda
nitsk
y
,
Al
exan
der, Gra
e
m
e Hirst. Ev
a
l
uati
ng
w
o
r
d
n
e
t-base
d
m
e
a
s
ures
of le
xic
a
l s
e
mantic
relatedness.
C
o
mputati
o
n
a
l Li
ngu
istics.
200
6
;
32(1): 13-47.
[8]
Agirre, Eneko,
et al.
A study
on
si
mil
a
rity
and
rel
a
ted
nes
s usi
ng
distrib
u
tion
al
an
d W
o
rdN
e
t-base
d
appr
oach
e
s
. P
r
ocee
din
g
s
of
Huma
n L
a
n
g
u
age
T
e
chnol
og
ies: T
he 2
009
Ann
u
a
l
C
onfe
r
ence
of th
e
North Americ
an C
hapter
of
the Assoc
i
atio
n for Co
mputatio
nal L
i
ng
uistics
.
As
sociati
on fo
r
Comp
utation
a
l Lin
guistics.
20
09.
[9]
Gabril
ovich, Evgen
i
y
, Sh
aul
Markovitch. C
o
m
puti
ng Sem
antic Rel
a
ted
n
e
ss Using W
i
k
i
pe
dia-
bas
ed
Exp
licit Sem
a
n
t
ic Anal
ysis.
IJCAI
. 2007; 7: 1
606-
161
1.
[10]
Deer
w
e
ster, Scott C., et al.
I
nde
xi
ng b
y
l
a
ten
t
semantic ana
l
y
sis.
JASIS.
19
90; 41(6): 3
91-
407.
[11]
Churc
h
, Ken
n
e
th W
a
rd, P
a
trick Hanks. Word assoc
i
atio
n norms,
mutual
infor
m
ation, a
n
d
l
e
x
i
cog
r
ap
hy
.
Co
mp
utation
a
l ling
u
istics.
19
9
0
; 16(1): 22-2
9
.
[12]
Kozima, Hi
dek
i,
T
e
iji Furug
o
r
i.
Simi
larity b
e
tw
een w
o
rds comp
uted
by
spread
ing
activatio
n
on a
n
Engl
ish d
i
ction
a
ry
. Procee
din
g
s of the si
xth
confere
n
ce o
n
Europ
e
a
n
ch
apter of the A
ssociati
on fo
r
Comp
utation
a
l Lin
guistics
.
As
sociati
on for C
o
mputati
o
n
a
l Li
ngu
istics. 199
3
.
[13]
Morris, Jane,
Graeme Hirst. Le
xic
a
l coh
e
si
on comp
ut
ed b
y
thes
aura
l
rel
a
tions as a
n
in
dicator of th
e
structure of text.
Comp
utation
a
l lin
gu
istics.
1991; 17(
1): 21-
48.
[14]
Miller, Ge
orge
A., et al. Intro
ducti
o
n
to
w
o
r
dnet: An
on-
lin
e le
xical
dat
ab
ase.
Intern
atio
nal jo
urna
l
of
lexico
gra
phy
. 1
990; 3(4): 2
35-
244.
[15] Resnik,
P
h
il
ip.
Using
inf
o
rmati
on co
ntent to
eval
uate s
e
ma
ntic si
mi
larity i
n
a tax
o
n
o
m
y
.
Procee
di
ng
s
of the 14th inte
rnatio
nal j
o
i
n
t confere
n
ce o
n
Artificial Intel
lig
e
n
ce. 199
5; 1: 448–
45
3
.
[16]
HU, Junfe
ng,
Shi
w
e
n
YU. T
he C
o
mp
uter
Ai
de
d R
e
searc
h
W
o
rk of C
h
i
nese A
n
ci
ent
Poems.
Acta
Sci
c
e
n
t
ia
rum
Na
tu
ra
lum
Univ
ersitis Pekinesis
. 2001; 5: 022.
[17]
Z
hou, C
hen
g-
Le, W
e
i
Y
ou, Xi
ao
jun
Di
ng. Genetic alg
o
rit
h
m
an
d its
im
plem
entat
io
n o
f
automatic
gen
eratio
n of chin
ese so
ngci.
Journ
a
l of Softw
are.
2010; 21
(3): 427-4
37.
[18] Ramos, Juan.
Using tf-idf to deter
mi
ne w
o
rd relev
anc
e in
docu
m
ent qu
er
ies
. Procee
di
n
g
s of the F
i
rst
Instructiona
l C
onfere
n
ce o
n
Machi
ne Le
arn
i
ng. 20
03.
[19]
Z
hang
X, W
ang M. Sparse
repres
entati
on for detectio
n
of
microcalcific
a
tion clust
e
rs.
T
E
L
K
OM
N
I
KA
Indon
esi
an Jou
r
nal of Electric
al Eng
i
ne
eri
ng.
2012; 1
0
(3): 5
45-5
50.
[20]
T
I
AN, Dong-fe
ng, F
e
i OU,
W
e
i SHEN. On the Ap
pl
icati
on of Matri
x
S
i
ng
ular V
a
l
ue
Decom
positi
o
n
T
heor
y
in C
h
in
ese T
e
xt Class
ificatio
n.
Mathe
m
atics i
n
Practi
ce and T
h
e
o
ry.
2008; 2
4
: 021.
[21]
Z
hang
PY. A
Ho
w
N
et-Base
d
Sema
ntic R
e
l
a
tedn
ess K
e
rn
el for T
e
xt
Cl
assificati
on.
T
E
LKOMNIKA
Indon
esi
an Jou
r
nal of Electric
al Eng
i
ne
eri
n
g
.
2013; 1
1
(4): 1
909-
191
5.
[22]
Radi
nsk
y
K, Agichte
i
n
E, Gabrilov
i
ch E, et al.
A w
o
rd at a time: compu
t
ing w
o
rd rel
a
tedn
ess usi
n
g
temp
ora
l
se
ma
ntic an
alysis
.
Procee
din
g
s of
the 20th i
n
ter
nat
io
nal c
onfer
ence
on W
o
rld
W
i
de W
e
b
.
ACM. 2011: 3
3
7
-34
6
.
Evaluation Warning : The document was created with Spire.PDF for Python.