Internati
o
nal
Journal of Ele
c
trical
and Computer
Engineering
(IJE
CE)
Vol
.
5
,
No
. 3,
J
une
2
0
1
5
,
pp
. 40
9~
42
0
I
S
SN
: 208
8-8
7
0
8
4
09
Jo
urn
a
l
h
o
me
pa
ge
: h
ttp
://iaesjo
u
r
na
l.com/
o
n
lin
e/ind
e
x.ph
p
/
IJECE
Text Preprocessing using Annota
te
d Suffix Tre
e
with Ma
tc
hing
Keyphrase
Ionia Ve
ritaw
a
ti
*, I
t
o W
a
sito**,
T. B
a
s
a
r
uddin
*
*
* Department of
Informatics,
Pan
casila University, Indonesia
** Departmen
t
o
f
Computer Science,
University
of Indonesia, Ind
onesia
Article Info
A
B
STRAC
T
Article histo
r
y:
Received Dec 18, 2014
Rev
i
sed
Ap
r
29
, 20
15
Accepted
May 14, 2015
Text do
cument
is an important sour
ce of
information
and knowledge. Most
of the knowledge needed in var
i
ous domai
ns for
differen
t
purposes is in form
of implicit content. A
conten
t
of text
is
repres
ented b
y
k
e
yphr
as
es
, which
cons
is
ts
of one
o
r
m
o
re m
eaningf
ul words
.
Ke
yph
r
as
es
can
be
extr
act
ed from
text
through several steps of
proce
ssing,
including tex
t
pr
eprocessing
.
Annotated Suffi
x Tree (AST) built from
the docum
ents collecti
on itself is
us
ed to extr
ac
t
the ke
yphr
as
e,
after b
a
s
i
c
text
preproces
s
i
ng th
at in
clude
s
removing stop
words and stemming are
app
lied
.
Combination of four
variations of preprocessing is used. Two words
(bi-words) and three-words
of phras
es
extra
c
t
ed ar
e us
ed as
a
lis
t of ke
yphr
as
es
candid
a
te whi
c
h can h
e
l
p
user who needs key
phr
ase infor
m
ati
on to under
s
tand conten
t of
documents.
The cand
i
date of key
phr
ase can
be processed further b
y
learn
i
ng
process to
determine key
p
hrase or non
k
e
y
phr
as
e for
th
e text domain
with manual
valid
ation
.
Experiments using simulati
on corpus
in which key
p
hrases are
determ
ined
from
them
show
th
at key
phrases
of two and thr
e
e words can
b
e
extracted
more
than 90%
. Using real
corpus o
f
econom
y
,
k
e
yphrases or
meaningful phrases can be extracted about 70%
. The proposed
method can
be an effectiv
e way
to find candidate
key
phrases from collection of text
documents which can r
e
duce n
on key
phrases
or non meaning
f
ul phrases
from list of ke
yphrase candidates and
can
d
e
tect k
e
y
phrases s
e
parated
b
y
stopwords.
Keyword:
2
-
Mean
s Cluster
i
ng
Anno
tated
Suffix
Tree
K
e
yph
r
a
se
Pre
p
rocessing
TF-IDF
Copyright ©
201
5 Institut
e
o
f
Ad
vanced
Engin
eer
ing and S
c
i
e
nce.
All rights re
se
rve
d
.
Co
rresp
ond
i
ng
Autho
r
:
Ion
i
a Veritawati,
Depa
rtm
e
nt of
In
fo
rm
atics,
Pan
casila Un
i
v
ersity,
Sre
ngse
n
g Sa
w
a
h St
reet
, Ja
ga
karsa
,
Ja
kart
a,
In
d
onesi
a
Em
a
il: io
n
i
av
er11
@g
m
a
il.co
m
1.
INTRODUCTION
Text
u
s
ed
as
d
a
t
a
has i
n
c
r
eas
ed ra
pi
dl
y
i
n
m
a
ny
dom
ai
n areas.
It
bec
o
m
e
s a pr
o
b
l
e
m
w
h
en
a pe
rs
on
or
a
depa
rt
m
e
nt
nee
d
s t
h
e
c
o
n
t
ent
o
f
t
e
xt
o
r
doc
um
ent
col
l
ect
i
on as
i
n
fo
r
m
at
i
on f
o
r
t
h
ei
r
pu
rp
oses
.
A
bi
g t
e
xt
d
a
ta cau
ses
d
i
fficu
lty in
k
nowing
th
e co
n
t
en
t of th
e tex
t
o
r
do
cu
m
e
n
t
co
llectio
n
.
Th
e co
llectio
n
s
(co
r
pu
s)
h
a
v
e
im
p
licit
in
fo
rm
atio
n
wh
ich
can
b
e
ex
tracted
to
g
i
v
e
m
ean
in
g
f
u
l
in
form
at
io
n
.
Accurate in
fo
rmation
req
u
i
r
es
p
r
oce
s
si
ng
o
f
t
h
e
t
e
xt
as a
n
unst
r
uct
u
re
d
dat
a
a
n
d
p
h
y
s
i
cal
l
y
as do
cum
e
nt
s.
Keyphrase
(KP) is a m
eaningful phrase c
o
nsisting
o
f
on
e or
m
o
re w
o
r
d
s
whi
c
h ca
n
be e
x
t
r
act
ed
fr
om
a doc
um
ent
col
l
ect
i
o
n
u
s
i
ng s
o
m
e
di
ffe
r
e
nt
m
e
t
hods.
K
E
A, cal
l
e
d A
u
t
o
m
a
ti
c Key
p
h
r
ase Ext
r
act
i
o
n
,
i
s
a
m
e
t
hod f
o
r ex
t
r
act
i
ng key
p
h
r
ases usi
ng
Na
i
v
e B
a
y
e
s [1]
.
Som
e
ot
her m
e
t
hods
use S
e
m
a
nt
i
c
Anal
y
s
i
s
[2]
lex
i
cal ch
ains
[
3
], and
a r
a
nk
ing
app
r
o
ach b
y
SV
M
[4
]. K
e
yphr
asescan
also
b
e
ex
tracted
u
s
ing
t
h
esau
ru
s
dat
a
base
as c
o
m
poun
d t
e
rm
s [5]
a
n
d
ext
r
act
ed
usi
n
g e
n
t
r
o
p
y
an
d t
r
ansi
t
i
on
p
o
i
n
t
a
p
pr
o
ach as
i
n
dex
t
e
rm
[6]
.
Usually, keyphrase
s are features
of text collecti
on in whi
c
h their num
b
ers are calculated based
on
t
h
ei
r prese
n
ce
f
o
r
val
u
es
i
n
t
h
e feat
u
r
es.
Key
p
h
rase
s ha
ve bee
n
use
d
i
n
m
a
ny
appl
i
cat
i
ons suc
h
a
s
f
o
r
Evaluation Warning : The document was created with Spire.PDF for Python.
I
S
SN
:
2
088
-87
08
IJEC
E V
o
l
.
5, No
. 3,
J
u
ne 2
0
1
5
:
40
9 – 4
2
0
41
0
det
e
rm
i
n
i
ng i
n
dex
of a di
gi
t
a
l
l
i
b
rary
[7]
;
for s
u
p
p
o
rt
i
n
ga
t
e
xt
-base
d
de
ci
si
on sy
st
em
i
n
fi
na
nci
a
l
seque
nc
e
p
r
ed
ictio
n [8
]
an
dqu
estion
answeri
n
g system
[9
]; fo
r
ra
nk
i
ng t
opi
cal
key
p
h
r
ases
f
r
om
cont
e
n
t
-re
p
r
ese
n
t
a
t
i
v
e
doc
um
ent
t
i
t
l
e
s [1
0]
;
for cl
us
t
e
ri
ng
doc
um
ent
[1
1]
and t
e
x
t
from
st
ruct
ur
ed dat
a
[1
2]
;
for ap
pl
y
i
n
g
i
n
que
ry
-
o
r
ien
t
ed
su
m
m
arizatio
n
[13], in
fo
rm
atio
n ex
tractio
n
[1
4]
[1
5]
, t
e
xt
cat
ego
r
i
zat
i
o
n
[16]
an
d i
n
f
o
rm
at
i
o
n
retriev
a
l [17
]
.
Tex
t
Prep
ro
cessin
g
as an
i
n
itial p
r
o
cess h
a
s
b
een
used in
man
y
exp
e
rimen
t
s su
ch as
fo
r exp
l
oring
t
h
e im
pact
of
pre
p
r
o
cessi
ng
on t
e
xt
cl
ass
i
fi
cat
i
on [
18]
;
fo
r com
p
ress
i
ng
nat
u
r
a
l
l
a
ng
ua
ge [
19]
;
and
fo
r
sel
ect
i
ngfeat
ur
e [2
0]
. Uy
sal
[
18]
ha
s com
p
ared a
few st
e
p
s
of
pre
p
r
o
cessi
ng
.I
n t
h
i
s
pape
r, t
e
xt
p
r
e
p
r
o
c
e
ssi
n
g
is co
m
b
in
ed
by An
no
tatted
Su
ffix
Tree (AST),
wh
ich
co
nsi
s
t
s
of col
l
e
ct
i
on of t
w
o
wo
rd
s an
d fre
que
nci
e
s
fr
om
a docum
ent
.
The A
S
T i
n
t
h
i
s
paper i
s
appl
i
e
d t
o
m
a
t
c
h an
d sco
r
e a li
st
of i
n
p
u
t
t
e
d key
p
h
rase
s an
d al
so
to
ex
t
r
act k
e
y
p
h
r
ases au
to
m
a
t
i
cally if th
er
e
a
r
e no
i
n
put
key
p
h
r
ases
t
o
be m
a
t
c
hed.
Thi
s
pape
r c
o
n
s
i
s
t
s
of
f
o
ur
se
ct
i
ons.
Th
e sec
o
n
d
sect
i
o
n
de
scri
bes
t
e
xt
pr
epr
o
cessi
nga
n
d
m
e
t
hods
.It
in
clu
d
e
scon
cep
t
of An
no
tated
Su
ffix
T
r
ee
(AS
T
), al
g
o
ri
t
h
m
of AST
b
u
i
l
d
i
ng, al
go
ri
t
h
m
of AS
T m
a
tchi
n
g
andal
g
orithm
of aut
o
m
a
tic keyphrase ext
r
action that c
o
m
b
ines t
e
xt
pre
p
r
o
cessi
ng a
n
d A
S
T. T
h
i
s
sect
i
o
n al
s
o
expl
ai
n
s
t
h
e m
e
t
h
o
dol
ogy
of
expe
ri
m
e
nt
s. The t
h
i
r
d sect
i
o
n p
r
esent
s
resu
l
t
s
and anal
y
s
i
s
of re
sul
t
s
f
r
o
m
t
h
e
expe
ri
m
e
nt
s. The
fi
nal
sect
i
o
n
c
ont
ai
n
s
co
ncl
u
si
ons
an
d
f
u
t
u
r
e
w
o
r
k
.
2.
PREP
RO
CES
S
ING
A
N
D
M
ETHODS
2.1. Te
xt Preprocessing
Text
Pre
p
roce
ssi
ng i
s
a
sy
st
em
at
i
c
and ba
si
c pr
ocess a
p
pl
i
e
d i
n
a c
o
l
l
ect
i
on
of t
e
xt
doc
um
ent
s
rel
a
t
e
d t
o
rem
oval
o
f m
eani
ngl
ess cha
r
act
ers
,
uni
m
port
a
nt
w
o
r
d
s a
n
d el
i
m
inat
i
o
n
su
ffi
x
o
r
p
r
efi
x
fr
om
a w
o
r
d
[1
8]
. The
res
u
l
t
of t
e
xt
pre
p
roces
si
n
g
i
s
a l
i
s
t
of
m
eani
ngf
ul
w
o
r
d
s w
h
i
c
h can re
pre
s
ent
t
h
e co
nt
en
t
of a
d
o
c
u
m
en
t o
r
a co
llectio
n
of do
cu
m
e
n
t
s. Th
e resu
lt lis
t will b
e
u
s
ed in
v
a
riou
sapp
licatio
n
s
,
which
are
descri
bed
i
n
p
r
evi
o
us sect
i
o
n.
M
eani
n
gl
ess c
h
aract
er
s i
n
t
e
xt
o
r
doc
um
ent
are com
m
a, poi
nt
,
q
u
est
i
o
n t
a
g a
nd
ot
h
e
rs.
R
e
m
oval
o
f
th
e ch
aracters
an
d also th
e chan
g
e
th
e cap
ital letters in
to
s
m
all characters
m
a
ke the ne
xt
step
of
preproc
e
ssing
easi
e
r an
d i
s
f
o
l
l
o
wed
by
t
h
e
r
e
m
oval
u
n
i
m
port
a
nt
w
o
r
d
s
(s
t
op
w
o
r
d
s)
s
u
c
h
as c
o
nj
u
n
ct
i
o
n a
n
d
ad
ver
b
.
T
he l
i
s
t
o
f
all th
e w
o
rd
s is co
llected
f
i
r
s
t an
d
th
en th
e u
n
i
m
p
o
r
tan
t
w
o
r
d
s f
ound
ar
e r
e
m
o
v
e
d f
r
o
m
th
e d
o
c
u
m
en
t
col
l
ect
i
on. T
h
e
next
pr
ocess c
ont
i
n
ues by
usi
ng a st
em
m
e
r
whi
c
h i
s
appl
i
e
d t
o
el
im
i
n
at
e
pre
f
i
x
o
r
su
ffi
x
from
a wo
rd t
h
at
m
a
y
form
verb
or
no
u
n
. T
h
e st
em
m
e
d wor
d
s c
a
n be
use
d
as f
eat
ures
of
doc
u
m
ent
s
. Fre
que
ncy
,
as
a score
of ea
ch
feature, is
arra
nge
d in a
n
elem
en
t
of
score
m
a
t
r
i
x
bet
w
ee
n
d
o
cu
m
e
nt
s (c
ol
um
ns)
an
d
keyphrase
s (rows). T
h
e sc
ore m
a
trix is norm
a
lized us
ing TF-IDF calculation. T
h
e
norm
alized keyphras
e
sco
r
es in
th
e matrix
o
r
tab
l
e are i
m
p
o
r
tan
t
to
fin
d
co
n
t
en
t do
m
a
in
o
f
th
e do
cu
m
e
n
t
co
llectio
n
and
also
help
fu
l
whe
n
t
h
ey
are
use
d
f
o
r
que
ry
i
n
g
o
r
cl
ust
e
ri
n
g
doc
um
ent
s
.
2.
2.
Set
t
i
n
g
o
f
Prop
osed
T
e
x
t
Prepr
o
cessi
n
g
Text
p
r
e
p
r
o
ces
si
ng i
n
t
h
i
s
pa
per
has
f
o
u
r
se
t
t
i
ngs s
h
o
w
n i
n
Ta
bl
e 1
.
T
h
e
p
u
r
p
o
s
e o
f
usi
n
g
di
f
f
ere
n
t
set
t
i
ngs a
r
e t
o
f
i
nd t
h
e
best
r
e
s
u
l
t
o
f
key
p
h
ras
e
m
a
t
c
hi
ng
p
r
o
cess w
h
i
c
h
wi
l
l
be
descri
bed
i
n
sect
i
o
n
2.
4.
Tabl
e 1. Set
t
i
ng of
Pr
o
pos
ed
Text
P
r
ep
r
o
ces
si
ng
Each
of th
e fou
r
settin
g
s
(table 1
)
is ap
p
lied to
an
orig
in
al
co
rpu
s
.
Prep
rocessin
g
resu
lts of th
e fou
r
settings are
four c
o
llections of
doc
um
ents (corpus)
appropriate with each se
tting. The settings can be
expresse
d as
,
;
w
h
er
e
x
is a
wo
rd
,
i
is ind
e
x
o
f
r
e
m
o
v
a
l
of sto
p
wor
d
s
(
t
ab
l
e
1, co
l
u
m
n
2)
, and
j
is ind
e
x
o
f
ste
m
(tab
le 1
,
co
lu
m
n
3),
wh
ich
h
a
v
e
d
i
ffere
n
t
states i
n
each
settin
g. In
d
e
x
v
a
lu
es o
f
,
include“
not
r
e
m
o
v
i
n
g
st
opw
or
ds” (
i
=
0)
or “
r
em
ovi
n
g
”
t
h
em
(
i
=1),
and “
n
ot
usi
n
g
st
em
m
e
r” (
j
=0)
or
“usi
ng”
i
t
(
j
=1).
For e
x
am
ple,
,
means the
word
x
will no
t be
rem
o
v
e
d
if it is a stop
wo
rd
and
th
e word
x
will b
e
stemmed
.
To fi
n
d
a key
p
h
rase
of t
w
o or
t
h
ree wo
r
d
s fr
om
a corp
us, u
s
ual
l
y
t
h
e proc
ess i
s
sequent
i
a
l
,
usi
ng
a
n
array
st
r
u
ct
u
r
e
t
o
ar
ra
nge t
h
e w
o
r
d
s
(2
-g
r
a
m
or
3-
gram
fr
om
N-g
r
am
m
e
t
hod
),
bes
i
des u
s
i
n
g m
e
t
h
o
d
s
descri
bed in
previ
ous
section.
In this
pape
r, a
di
ffere
n
t m
e
thod for
m
a
tching
and e
x
t
r
acting
keyphrases of
Setting
Rem
ove Stop W
o
rds (
i
)
Stem
(
j
)
(1
) (2
)
(3
)
1
No (0)
No (0)
2
No (0)
Yes (1)
3
Yes (1)
No (0)
4
Yes (1)
Yes (1)
Evaluation Warning : The document was created with Spire.PDF for Python.
I
J
ECE
I
S
SN
:
208
8-8
7
0
8
Text Prepro
cessin
g
u
s
i
n
g An
no
ta
ted Su
ffix Tr
ee with
Ma
tchin
g
Keyph
r
ase
(Io
n
i
a
Verita
w
a
ti)
41
1
sin
g
l
e wo
rd
un
til th
ree w
o
rd
s u
s
i
n
g
A
nnotated
Su
ffix
Tree (AST) is propo
sed. E
ach o
f
fo
ur prepro
cessed
cor
p
us fr
om
set
t
i
ng
m
odel
dat
a
(t
abl
e
1) i
s
used t
o
b
u
i
l
d
fo
ur va
ri
o
u
s t
r
ee st
ruct
u
r
es as A
S
T and eac
h of
AS
T
is u
s
ed
in th
e
p
r
o
cesses of
keyp
hr
ase m
a
tc
h
i
ng
an
d a
u
tomatic keyphra
s
e extra
c
tion.
The s
p
ecific
methods
will b
e
ex
p
l
ai
ned
in th
e
n
e
x
t
sectio
n
.
2.
3.
An
no
ta
te
d Su
ffi
x Tree
Su
ffix tree i
n
g
e
n
e
ral is a
tree th
at co
nsists of
c
h
aract
ers as
suffi
x
of
a st
ri
ng
(c
ol
l
ect
i
on
o
f
characte
r
s) [21]. The
root
node a
n
d s
u
b-
tre
e
s of s
u
ffix tre
e
do
not
ha
ve
va
lue. Val
u
e is
put in
each ve
rtex
of
sub-tree
. For
exam
ple, the word
“ba
n
an
as” con
s
ist o
f
su
ffix
es as
fo
llo
ws: “
b
a
n
anas
”, “ana
n
as”, “
n
ana
s
”,
“anas”, “
n
as”
,
“as”, “s” a
n
d “
$
” as a
n
e
n
d c
h
aracter. Eac
h
su
ffix
is arranged
in
v
e
rtex
o
f
th
e su
ffix
tree. It is
usu
a
l
l
y
used
fo
r st
ri
n
g
m
a
t
c
hi
ng
. U
k
ko
ne
n’s
onl
i
n
e al
g
o
ri
t
h
m
[21]
i
s
pr
op
ose
d
t
o
bui
l
d
t
h
e s
u
f
f
i
x
t
r
ee
a
nd t
h
e
step
using
prefix
rath
er th
an suffix.
For exa
m
ple, prefixe
s
of “ba
n
a
n
as”
are as
foll
ows: “b”, “
b
a”, “
b
an”
,
“bana”
, “
b
a
n
an”, “ba
n
a
n
a”, “banana
s
”.
An
n
o
t
a
t
e
d S
u
f
f
i
x
T
r
ee
(A
ST)
i
s
a di
ffe
rent
conce
p
t
of
su
ff
i
x
t
r
ee
pr
o
p
o
s
e
d
by
Pam
p
apat
hi
[
2
2]
f
o
r
sp
am
filterin
g
.
Pam
p
ap
ath
i
d
e
v
e
l
o
p
e
d
AST as a stru
ct
ure of a wo
rd
th
at co
nsis
ts of ch
aracters an
d
its
fre
que
nci
e
s w
h
i
c
h are p
u
t
at
n
ode
s of t
r
eerat
her t
h
a
n
st
ri
ng
.
AST i
s
al
so u
s
ed t
o
m
a
t
c
h stri
n
g
pat
t
e
r
n
so
f
s
pam
words. E
ach of suffi
x tree m
odels
ab
ov
e is
used
f
o
r
a sing
le wo
rd
.
Fi
gu
re 1.
The
Pro
p
o
se
d AST Il
l
u
st
rat
i
o
n
In
t
h
is p
a
p
e
r, th
e propo
sed Ano
ttated
Suffix
Tree
(AST) algorithm
is
ado
p
t
ed
fr
om U
k
kon
en
conce
p
t
i
n
dev
e
l
opi
n
g
onl
i
n
e
suf
f
i
x
t
r
ee a
n
d Pam
p
apat
hi
[2
2]
i
n
ar
ran
g
i
ng
n
odes
o
f
t
r
ee as t
h
e pl
ac
e fo
r
p
u
tting
ch
aracters. Figure 1
sh
ows th
e
AST illu
stratio
n
o
f
b
i
-word
s
. Th
e
AST
p
r
o
p
o
s
ed is arrang
ed
to p
u
t
a
bi
-
w
o
r
ds
,
not
a
char
act
er, a
n
d
i
t
s
fre
que
ncy
i
n
a
n
ode
. T
h
e
dept
h
of t
h
e t
r
e
e
has t
w
o
l
e
vel
s
. T
h
e
num
ber
of al
l
no
des i
n
l
e
vel
1 o
f
t
h
e t
r
ee i
s
t
h
e t
o
t
a
l
n
u
m
b
er
of
u
n
i
q
ue
wo
rd
s i
n
t
h
e
d
o
cum
e
nt
de
vel
ope
d as
AST
,
and t
h
e
num
ber
of
f
r
eq
uency
at
e
ach
no
de
o
f
t
h
at
l
e
vel
i
s
t
h
e
t
o
t
a
l
num
ber
of
w
o
r
d
s.
Read
a
documen
t
D(
i)
has
been
P
r
eprocessed
Split t
e
xt of
D(i)
into Arr
a
y of
word
s (1..numbe
r of
words)
Create
root
of t
r
e
e
For
j= 1
..
(numb
e
r of
wor
d
s
-1
)
{Ins
ert word
(j
),
w
o
rd (j+1)
into
t
r
e
e
}
Traverse level
1,
If word
(j) =
node
1
add f
r
eque
n
cy of node
1 {lev
el 1
}
If nod
e2 (subt
re
e) of
node1
= wo
rd(j+1
)
add fre
quency of
node2
{level 2
}
Else Insert
word
(
j
+1) as
a ne
w no
de2 (subt
ree)
of
node1 at
leve
l 2
Else
Insert wo
rd(
j
)
as
a new
node1
(su
b
tree)
of
root
at level 1
Traverse level
1,
if the
r
e
is no
nod
e1 == wo
rd(
j
+1)
Insert wo
rd(
j
+1)
as a new
node
1 (
s
ubtree)
of
root a
t
level 1
Fi
gu
re
2.
A
S
T
Devel
opm
ent
Al
g
o
ri
t
h
m
Evaluation Warning : The document was created with Spire.PDF for Python.
I
S
SN
:
2
088
-87
08
IJEC
E V
o
l
.
5, No
. 3,
J
u
ne 2
0
1
5
:
40
9 – 4
2
0
41
2
The p
r
o
p
o
sed
al
go
ri
t
h
m
t
o
devel
o
p AST i
s
descri
bed i
n
fi
gu
re 2. The m
a
i
n
pr
ocess p
u
t
s
every
t
w
o
wo
rd
s o
f
a t
e
x
t
doc
um
ent
i
n
to AS
T at
l
e
vel
one a
nd t
w
o
.
If th
e
first wo
rd
th
at is g
o
i
n
g
to
b
e
in
serted h
a
s
ex
isted
at
a
n
o
d
e
o
f
lev
e
l
on
e of th
e tree, th
e no
d
e
w
ill sp
lit in
to a
n
e
w su
b-tree t
o
insert t
h
e secon
d
word
and
i
t
s
fre
que
ncy
.
2.
4. M
a
tc
hi
ng
Process
Match
i
n
g
pro
c
ess algo
rith
m
(figure 3) in th
is p
a
p
e
r is t
h
e
p
r
o
cess
o
f
sco
r
i
n
g a list o
f
i
n
pu
tted
keyphrase
s int
o
a doc
u
m
e
nt
collection
using AST de
ve
l
o
ped
from
each doc
um
ent.Th
e
process
of m
a
tching
and e
x
t
r
act
i
n
g
key
p
h
r
ases c
a
n be d
o
n
e
b
y
t
r
aversi
n
g
A
S
T and m
a
t
c
hi
ng eve
r
y
wo
r
d
i
n
no
des wi
t
h
t
h
e
in
pu
tted
k
e
yphr
ases
as w
o
r
d
o
r
b
i
-
w
o
r
d
s
one
b
y
on
e, and
co
un
t
t
h
e f
r
e
qu
en
cy
of
o
c
cu
rr
en
ces o
f
r
e
spectiv
e
m
a
t
c
hi
ng wo
r
d
s.
Input
a
Keyphras
e
(KP)
wi
ll
be
ma
tched
with
AST
Score=0
Split KP
into
wo
r
d
s
(1.. n
u
mber
of
KP)
For
j= 1
..
(Nwo
r
d
-1)
{matching
an
d s
c
oring
word
(j), w
o
rd
(j+1)
w
i
th AS
T}
Traverse level
1,
If word
(j) =
node
1
{level 1}
Traverse level
2
If nod
e2 (subt
re
e) of
node1
= wo
rd(j+1
)
{level 2
}
score=score + fr
equency of no
de
2
score_matching
= score /
(numbe
r of KP
-1
)
Insert score_mat
ching
as an e
l
e
m
ent tab
l
e of
doc
ument vs KP
Fi
gu
re
3.
Key
p
hrase
M
a
t
c
hi
n
g
Al
g
o
ri
t
h
m
us
i
ng
AS
T
Form
ul
at
i
on
fo
r
key
p
h
rase
(
K
P) m
a
t
c
hi
ng sc
ore
i
n
t
h
e
pr
oc
ess o
f
al
go
ri
t
h
m
(fi
gure
3
)
i
s
as f
o
l
l
o
w
s
:
Score of KP=
(wei
ght
e
d
KP
*
fre
que
ncy
i
n
no
de l
e
vel
2/
(n
um
ber o
f
KP
–
1
)
(1
)
In
t
h
i
s
e
xpe
ri
m
e
nt
, t
h
e
wei
g
ht
s o
f
wo
rd
are
e
qual
t
o
1.
Preprocessing
d
o
cuments
col
l
e
ction
D
i
[i=1..
numb
e
r
of
documents]
(
opti
o
n :
r
e
move
stop wor
d
s +
stem)
Preprocessing K
e
yphrases List
-
KP [1.. numb
e
r
o
f
KP]
(option
: stem
)
For
i =
1 ..nu
m
be
r of
documents
Dev
e
lop AS
T(i
)
o
f
D
(
i)
For
j =
1 ..nu
m
be
r of KP
Score = Matchin
g
bet
ween AS
T(
i), KP(
j)
(
figu
re 3
)
Inser
t
scor
e
i
n
to
tabl
e
T
1
(
i
,j) {single
wor
d
}
Inser
t
scor
e
i
n
to
tabl
e
T
2
(i,j)
{bi-w
o
rds / th
ree
-
word
s}
Score table no
rm
alizatio
n us
ing
T
F
-ID
F
(s
i
ngle
wo
r
d
):
Sc
ore
= tf
*
log
(N/n
)
// tf
: fr
equenc
y
// N :
numbe
r of
d
o
cuments
// n
: num
ber
of doc
uments
,
w
h
ic
h KP is
pres
e
n
c
e
Score table no
rm
alizatio
n using
T
F
-ID
F
mo
d
i
ficatio
n
(b
i-wo
rds / t
h
re
e-words) :
Sc
ore =
tf *
(log
(N)
-
log
(N/n
)]
Fi
gu
re
4.
Tabl
e
Sco
r
e
Ar
ra
ngi
ng
Al
go
ri
t
h
m
The
pr
ocess
o
f
m
a
t
c
hi
ng
key
p
h
r
ases i
s
a
p
pl
i
e
d t
o
eac
h
wo
rd
of
f
o
u
r
t
y
pe
s of
co
r
pus
res
u
l
t
e
d f
r
o
m
tex
t
p
r
epro
cessin
g
referred
to
settin
g
in
tab
l
e 1
in
wh
ich
each
do
cu
m
e
n
t
i
n
ev
ery settin
g is d
e
v
e
l
o
p
e
d
t
o
four
d
i
fferen
t
AST. Th
erefo
r
e, each
inp
u
tte
d
k
e
y
p
hrase
will b
e
match
e
d
to
each
of tho
s
e d
i
fferen
t
AST. There are
two
typ
e
s of i
n
pu
tted
k
e
yph
rases: list o
f
stemmed
k
e
yp
hrases wh
ich
will b
e
m
a
tch
e
d
with
ste
m
m
e
d
co
rpu
s
,
an
d
n
on-stem
k
e
yph
rases which
will b
e
match
e
d
with
no
n-stem
co
rp
us.No
r
m
a
lized
sco
r
e
o
f
si
n
g
l
e word
wh
ich
is pro
c
essed
u
s
ing
st
an
d
a
rd
TF-IDF (fi
g
u
re
4
)
will b
e
sm
al
l for a
h
i
gh
freq
u
e
n
c
y of th
e word
.
Evaluation Warning : The document was created with Spire.PDF for Python.
I
J
ECE
I
S
SN
:
208
8-8
7
0
8
Text Prepro
cessin
g
u
s
i
n
g An
no
ta
ted Su
ffix Tr
ee with
Ma
tchin
g
Keyph
r
ase
(Io
n
i
a
Verita
w
a
ti)
41
3
M
eanw
h
i
l
e
, n
o
rm
al
i
zed scor
e of
bi
-
w
o
r
ds
and t
h
ree
-
w
o
rds
w
h
i
c
h are
pr
ocesse
d usi
ng m
odi
fi
e
d
T
F
-I
D
F
(figure 4
)
will
also
h
i
g
h
fo
r
a h
i
gh
fre
q
u
e
n
c
y
of
b
i
-word
s
and
three-wo
rd
s.
2.
5. Au
to
ma
ti
c
Key
phr
ase E
x
tr
acti
on
At
t
h
e p
r
evi
o
u
s
sect
i
on,
AST
i
s
appl
i
e
d t
o
m
a
t
c
h an
d sco
r
e
an i
n
put
t
e
d
ke
y
p
h
r
ase, a
nd t
h
en t
h
e sc
ore
i
s
arra
nge
d i
n
t
o
feat
u
r
e t
a
bl
e bet
w
ee
n d
o
c
um
ent
s
and
key
p
h
rase
s.
A
t
t
h
i
s
sect
i
on,
AST i
s
used
t
o
fi
n
d
key
p
h
rase
s f
r
o
m
a col
l
ect
i
on of
do
cu
m
e
n
t
s au
to
m
a
tical
ly.
The
propose
d
m
e
thod c
o
nsists of
four steps
(figure
5). T
h
e initial step is
extracting all keyphrase
s
from
AST, according to m
i
nim
u
m
threshol
d of fre
que
nc
i
e
s at the node
s. The extracti
on
process us
e
s
four
t
y
pes of co
r
p
u
s
from
prep
r
o
c
e
ssi
ng
, w
h
i
c
h i
s
code
d by
co
m
b
i
n
at
i
on of “
r
em
ove st
op w
o
r
d
” an
d “st
e
m
”an
d
whi
c
h i
s
descri
bed i
n
sect
i
o
n
2.
2 (t
abl
e
1)
. The co
des ar
e
0
0
, 1
0
,
01 a
nd
1
1
. F
o
r e
x
am
pl
e, code “
1
0
”
ref
e
rs t
o
extracted
keyphrase
s from
corpus
w
i
t
h
w
h
i
c
h ap
pl
i
e
s “rem
ove
st
o
p
w
o
r
d
” and
d
o
es
not
appl
y
“st
e
m
”
. From
t
h
e fo
ur t
y
pes
of co
r
pus
, t
h
e resul
t
s
are f
o
u
r
t
y
pes of key
p
h
r
ases l
i
s
t
s
(KP
0
0
,
KP
1
0
, KP
01 a
nd
KP
11
) a
n
d
al
so f
o
ur
t
y
pes
of
t
a
bl
es
of
d
o
c
um
ent
vers
us
key
p
h
rase
s (
T
F-I
D
F
0
0, T
F
-
I
D
F
1
0
,
TF
-I
DF
01
an
d
TF-
I
D
F
1
1
)
.
Extract
KP
from
preprocessing
:
00
–
10
–
01
–
11
{L
is
t
0}
STEP I
(Search
KP) :
compare
:
KP00 – KP10
if
KP matc
h :
match
with KP
01 +
KP11 (w
ithout
stem)
STEP II
(Search
Suffixes word
) :
compare KP
resu
lt from
step I
match
with
KP
01 + KP11
(w
ith
stem)
STEP III
(Search
“wo
r
d + sto
p
words”) : com
pare
K
P
00-01
match a
nd
rem
o
ve stopwords
STEP IV :
elimin
ate over
lap
word
s (1-3
wor
d
s) of
KP result
from
step I –
III
Fig
u
re
5
.
Au
tomatic Keyp
h
r
ase Ex
t
r
actio
n
Algo
rith
m
The
process
of algorithm
in Step
I (figure
5)
com
p
ares keyphrase
s be
t
w
een the lists of
ke
yphrases to
find the sam
e
keyphrase
s. The next step, ste
p
II a
nd
II
I (fi
g
u
re
5) searc
h
es
key
p
h
r
ases f
r
o
m
word
witha
ffixe
s
and
fr
om
phra
s
es o
f
t
h
ree-
w
o
r
d
s t
h
at
ha
ve
a st
op
w
o
r
d
i
n
t
h
e m
i
ddl
e of
t
h
e p
h
ra
ses. T
h
e l
i
s
t
of
key
p
h
ras
e
s
resulted from
step I – III is checke
d
at
th
e last s
t
ep
(step IV). Th
e pro
c
ess is to elim
inate keyphrase
s that
ove
rlap each
othe
r. For e
x
a
m
ple,
it elim
inates bi
-words
keyphrases
whic
h a
r
e inc
l
ude
d in three
-
words
key
p
h
rase
s,
si
ngl
e w
o
r
d
s w
h
i
c
h
are
i
n
cl
u
d
e
d
i
n
bi
-
w
or
ds key
p
h
rase
s,
a
n
d bet
w
ee
n key
p
h
r
ases w
h
i
c
h are
t
h
e
sam
e
.
2.
6. Me
th
od
ol
og
y
The m
e
t
hod
ol
ogy
of t
h
e e
x
p
e
ri
m
e
nt
s i
s
show
n i
n
fi
g
u
re
6.E
x
peri
m
e
nt
s usi
n
g A
S
T are
appl
i
e
d i
n
t
o
two
parts. T
h
e
first pa
rt is a proces
s to m
a
tc
h inputte
d
key
p
hrases
with AST of each
doc
um
ent in a corpus t
o
get sc
ore
m
a
tchin
g
,
refe
r t
o
a
l
go
rithm
in fig
u
re
4
.
T
h
e
sec
o
n
d
pa
rt re
fer
to alg
o
r
ithm
in fig
u
re
5 is
a
pr
oces
s
to ext
r
act key
p
hrase
s
a
u
tom
a
tically from
a corp
us.
Each
ex
peri
m
e
nt
i
s
eva
l
uat
e
d se
par
a
t
e
l
y
.
Fi
gu
re 6.
M
e
t
h
od
ol
o
g
y
of
Ex
peri
m
e
nt
Evaluation Warning : The document was created with Spire.PDF for Python.
I
S
SN
:
2
088
-87
08
IJEC
E V
o
l
.
5, No
. 3,
J
u
ne 2
0
1
5
:
40
9 – 4
2
0
41
4
2.
7. D
a
t
a
for
S
i
mul
a
ti
o
n
For
easi
e
r
eva
l
uat
i
on,
dat
a
u
s
ed i
n
t
h
e
ex
p
e
ri
m
e
nt
s are sim
u
l
a
t
i
on dat
a
ge
nerat
e
d f
r
o
m
a corp
us
gene
rat
o
r (fi
g
u
re 7
)
. T
h
e n
u
m
b
er of key
p
h
r
ases
, wo
r
d
s
wi
t
h
affi
xes
and t
h
e l
i
s
t
of
st
op w
o
r
d
s f
o
r t
h
ree
d
o
m
ain
s
are
determin
ed
fi
rst. In
th
is m
o
d
e
l, if ste
mmer is ap
p
lied
,
word
s with
affixesth
a
t can
create a
k
e
yph
r
a
se
w
i
t
h
a h
i
gh
er
scor
e.
A
ll
th
e word
s co
m
b
in
ed
with
list o
f
st
op
word
s
b
e
come a list o
f
com
b
in
ed
wo
rd
s.
Each
i
nde
x
o
f
c
o
m
b
i
n
ed
w
o
rds
i
s
r
a
nd
om
i
zed
and then the
ra
ndom
ized
w
o
r
d
s bec
o
m
e
a si
m
u
l
a
t
i
on
doc
um
ent. Aft
e
r re
peating t
h
e process,
seve
ral created
doc
um
ents are joi
n
ed i
n
a form
atted corpus and rea
d
y
t
o
b
e
used
as sim
u
l
a
t
i
on
(m
odel
)
dat
a
f
o
r fu
rt
he
r pr
o
cess
.
T
h
e m
odel
d
a
t
a
cor
p
us
ge
n
e
rat
e
d a
n
d
rea
l
dat
a
doc
um
ent
s
fr
o
m
econom
y
do
m
a
i
n
are
use
d
fo
r si
m
u
l
a
t
i
on
and
i
n
vest
i
g
at
i
o
n
o
f
per
f
o
r
m
a
nce
of
t
h
e
p
r
o
pos
ed
m
e
t
hod.
Fi
gu
re 8 a
nd
fi
g
u
re 9 a
r
e v
i
ews of sam
p
l
e
of key
p
h
r
as
es and w
o
rds
wi
t
h
affi
xes i
n
Ind
o
n
esi
a
n
lan
g
u
a
g
e
. Th
e ph
rases con
s
i
s
t o
f
sing
le until th
ree-word
s, and
t
h
e stop
words
h
a
v
e
b
e
en
rem
o
v
e
d
bu
t the
ph
rases are
not
st
em
m
e
d (co
d
e
:
10,
refe
r t
o
s
e
t
t
i
ng of t
a
bl
e
1)
. St
em
m
e
r used i
n
t
h
ese ex
p
e
ri
m
e
nt
s i
s
st
em
m
e
r
fo
r In
d
onesi
a
n
l
a
ng
uage
.
•
Input :
–
List stop words
–
files
: l
i
st Ke
yph
r
ase
+ words with affix
e
s
•
Process (each
–
document):
–
random -- stop
words + Key
phr
ase + words with
affix
e
s
–
tag Formatting
–
Save document in a
corpus file
•
Output :
–
A corpus file
G=
Gene
ral
Domain
= A;
{A |
A part
of
G}
Document= D;
{D | D
part
of A
}
Word =w;
{w
| w
part of
D
}
Keyphrase=p;
Word w
i
th aff
i
x =
x; {p, x |
p,x part
of D}
Stop wo
rd = s;
{s
| s part
of
G}
{p,x,s | p,x,s part
of w}
N1= numbe
r of
d
o
cuments (D
)
N2 = num
ber
of
a keyphrase
(p)
at
Di ;
i: index of
a document
N3 = num
ber
of
a wor
d
w
i
th af
fix
(x) at
D
i
N4 = num
ber
of
a stop wo
rd
at D
i
nKP
i
= number of
differen
t
keyphra
s
e
nX
i
= number of
d
i
fferent
keyphras
e
nS
i
= number of
d
i
fferent
keyphras
e
w
x
,
p, s
R= (nKPi*
N
2)
+ (
n
Xi*N3)
+ (nS
i
*N
4)
r = r
andom
ize (1.
.
R)
for r=
1 ..
R, s
a
v
e
w(r
)
t
o
D
i
create a
formatt
e
d corpus of
N1
d
o
cuments
Fi
gu
re 7.
A
u
t
o
m
a
t
i
c
C
o
rp
us Gene
rat
i
o
n Al
go
ri
t
h
m
Fi
gu
re
8.
Il
u
s
t
r
at
i
on:
Li
st
of
Wo
r
d
s as
a Si
n
g
le
Wo
rd
i
n
E
xpe
ri
m
e
nt
Fi
gu
re
9.
Il
u
s
t
r
at
i
on:
Li
st
of
Wo
r
d
s as
B
i
-
W
o
r
ds a
n
d
Three
-
Wo
rd
s i
n
E
x
perim
e
nt
Evaluation Warning : The document was created with Spire.PDF for Python.
I
J
ECE
I
S
SN
:
208
8-8
7
0
8
Text Prepro
cessin
g
u
s
i
n
g An
no
ta
ted Su
ffix Tr
ee with
Ma
tchin
g
Keyph
r
ase
(Io
n
i
a
Verita
w
a
ti)
41
5
3.
RESULTS
A
N
D
DI
SC
US
S
I
ON
3.
1. E
x
perime
nt
of Ke
yphr
ase Matc
hing
Ex
peri
m
e
nt
I (f
i
g
u
r
e 6) i
s
a
pr
ocess f
o
r ke
y
p
h
r
ase m
a
t
c
h
i
ng. T
h
e l
i
s
t
of
key
p
h
rases
m
odel
and
i
n
p
u
t
t
e
d key
p
h
r
ase usi
ng
fo
r
m
a
t
c
hi
ng p
r
oc
ess i
s
det
e
rm
i
n
ed fi
rst
f
r
om
t
h
ree su
bd
om
ai
ns of ec
on
om
y
(
f
i
g
ur
e
8 an
d 9
)
.
A co
m
b
i
n
at
i
on of
1
0
0
-
c
o
r
p
u
s
i
s
g
e
nerat
e
d an
d a
rra
nge
d i
n
fi
ve
set
t
i
ngso
f
m
odel
dat
a
(t
abl
e
2
)
. Eac
h
m
odel
dat
a
consi
s
t
s
of
20 c
o
rp
us w
h
i
c
ha
re
gene
rat
e
d
by
al
go
ri
t
h
m
for A
u
t
o
m
a
t
i
c
C
o
rp
us Ge
nerat
i
o
n
(fi
g
u
r
e
7)
. Each c
o
r
p
u
s
consi
s
t
s
of 1
5
–
9
00
doc
um
ent
s
. Eac
h
d
o
cu
m
e
nt
consi
s
t
s
of
20
key
p
h
ras
e
s
m
odel
whe
r
e t
h
e
num
bers
are
2
-
40
w
o
rds
or
p
h
r
ases
fo
r ea
ch
key
p
h
rase
,
2
4
wo
rd
s
wi
t
h
a
ffi
xes
w
h
ere
t
h
e
num
bers
are
2
-
40
f
o
r
each
word. E
a
chdoc
u
m
e
nt also c
o
nsists
of
a list of stop
words
with
4-60 words for
ea
ch st
op word.
The list
o
f
inp
u
tted k
e
y
p
hrase
h
a
s th
e
sam
e
co
n
t
en
tas th
e list of t
h
e
k
e
yph
rase m
o
d
e
l.
Tabl
e 2. Set
t
i
ng of
M
o
del
Dat
a
(Si
m
ul
at
i
on Dat
a
)
No. Model
Data
Settin
g
1
(
n
u
m
ber
of KP) =
(
n
u
m
ber
of wor
d
swithaffixes)
2
(
n
u
m
ber
of KP) >
(
n
u
m
ber
of wor
d
swith affixes)
3
(
n
u
m
ber
o
f KP)
< (nu
m
ber of wor
d
swithaffi
xes)
4
(n
u
m
b
e
ro
f
KP) >
>
(n
u
m
b
e
ro
f
wo
rd
swith
af
f
i
x
e
s)
5
(
n
u
m
ber
of KP) << (
n
u
m
ber of wor
d
swithaffixes)
The res
u
l
t
s
o
f
expe
ri
m
e
nt
I, whi
c
h
m
a
t
c
hi
ngp
r
o
cess bet
w
een
i
n
p
u
t
t
e
d k
e
y
p
h
r
ases (K
P
)
wi
t
h
AS
T
devel
ope
d fromeach doc
um
e
n
t of a c
o
rpus, shown in tabl
e 3 (col
um
n 4). It showsa
s avera
g
es a
nd st
anda
rd
devi
at
i
o
ns of perce
n
t
a
ge o
f
t
o
t
a
l
m
a
t
c
hi
ng
sc
ore fr
om
10
0 c
o
r
p
us p
r
ocesse
d. Eac
h
score i
s
rel
a
t
e
d t
o
com
b
i
n
at
i
on
of
fo
u
r
t
y
pes
of t
e
xt
p
r
ep
r
o
cessi
ng
(c
ol
um
n 5,
t
a
bl
e 3)
,
whi
c
h
has
bee
n
ex
pl
ai
ned at
sect
i
o
n
2.
2
(t
abl
e
1
)
. Num
b
er of key
p
h
ra
ses
m
odel
(Score
0) det
e
rm
i
n
ed first in c
o
r
p
us ge
ne
rato
r (f
igu
r
e 7
)
an
d is
also
use
d
f
o
r c
o
m
p
ari
s
o
n
i
n
col
u
m
n
1-3
o
f
t
a
bl
e
3.
Tab
l
e
3
.
Av
er
ag
e an
d D
e
v
i
atio
n Stand
a
r
d
–
Resu
lts of
Mat
c
h
i
ng
o
f
List of
Keyph
r
a
ses usin
g Mod
e
l D
a
ta
(Matching Sc
ore Com
p
arison)
No.
Nu
m
b
er
of
M
a
tch
i
ng Key
phr
ase /T
otal Key
phr
ase
M
odel
(% of
Average ±
D
eviation Standard)
Preprocessing
= Scor
e0
> Scor
e0
< Scor
e0
T
o
tal Scor
e of KP
StopW
or
ds+ Ste
m
(1
)
(2
)
(3
)
(4
) (5
)
1 48.
42±9.
3
9
26.
50±5.
1
6
8.
57±7.
41
83.
49±2.
6
4
00
2 50.
38±1
3
.
7
3
32.
38±1
2
.
3
6
15.
44±1
3
.
8
3
98.
20±2.
8
3
10
3 32.
54±9.
4
9
41.
81±5.
3
5
8.
57±7.
41
82.
92±2.
5
9
01
4 33.
00±1
3
.
7
3
47.
69±1
2
.
5
6
15.
30±1
3
.
7
7
96.
94±3.
1
2
11
The sc
o
r
eso
f
(
e
xt
ract
ed
key
p
hrase
s
=
n
u
m
b
er
of
key
p
h
r
as
es m
odel
(=Sc
ore
0
,
col
u
m
n
1
of
t
a
bl
e
3)
will resu
lt in4
8
% o
r
h
i
gh
er, if
th
e tex
t
is prepro
cessed
withou
t stemmer, with
ou
t or
with
“rem
o
v
e
sto
p
wo
rd
s”
(0
0 a
n
d
1
0
, c
o
l
u
m
n
5 o
f
t
a
bl
e 3)
. I
n
t
h
e sa
m
e
experi
m
e
nt
, t
h
e sc
ore
s
o
f
(ext
ract
e
d
key
p
h
r
ases >
n
u
m
b
er
o
f
k
e
yph
rases m
o
d
e
l (>Score0,
co
lu
m
n
2
o
f
tab
l
e 3)
will re
sult in
4
1
% or
h
i
g
h
e
r, i
f
th
e
p
r
ep
ro
cessed
tex
t
ap
p
l
y
st
em
m
e
r, wi
t
hout
o
r
wi
t
h
“re
m
ove st
op
w
o
r
d
s”
(0
1 a
n
d
11
).
Ot
he
r
resul
t
,
t
h
e sc
ore
s
of
(
e
xt
ract
ed
key
p
hras
e
s
<n
u
m
b
e
r of
k
e
yp
hrases m
o
d
e
l (<Score0, co
l
u
m
n
3
o
f
tab
l
e 3
)
will resu
lt in
15
.3
%
o
r
h
i
g
h
e
r, if t
h
e tex
t
is
pre
p
r
o
cesse
d
wi
t
h
“
r
em
ove s
t
op
w
o
r
d
s” a
n
d
wi
t
h
out
o
r
wi
t
h
st
em
m
e
r (10
and
1
1
)
.
Tot
a
l
scor
e o
f
m
a
t
c
hi
ng p
r
oc
ess of e
x
peri
m
e
nt
I (c
ol
um
n 4 o
f
t
a
bl
e 3
)
wi
l
l
resul
t
i
nbet
t
e
r sco
r
e ab
o
u
t
96
%, i
f
t
h
e
pr
ocess
uses
co
r
pus
w
h
i
c
h
ha
s
bee
n
pre
p
r
o
c
e
ssed
by
“r
em
ove
st
o
p
wo
r
d
s” an
d
by
“st
e
m
”
or
“non-stem
” process
(code
10 and 11, co
l
u
m
n
5 of t
a
bl
e
3) c
o
m
p
ared
t
o
re
sul
t
i
n
a
b
o
u
t
8
2
%,
f
r
om
cor
p
us
whi
c
h i
s
p
r
e
p
r
o
cesse
d wi
t
h
o
u
t
“rem
ovi
n
g
st
op
w
o
r
d
s” a
nd
by
“st
e
m
”
or “
n
on
-st
e
m
”
pr
ocess
(co
d
e
00 a
n
d
0
1
).Based
o
n
ev
ery sco
r
e v
a
lu
e, it can b
e
seen
th
at
t
h
escore pro
cessed
by ”rem
o
v
i
n
g
sto
p
wo
rd
s”
will mak
e
t
h
e t
o
t
a
l
score
of key
p
h
r
ase
m
a
t
c
hi
ng hi
g
h
er c
o
m
p
are t
o
sc
ore p
r
ocess
e
d by
st
em
m
i
ng
wo
rd
s. Ge
neral
l
y
,
p
r
o
cess of
k
e
yp
hrase m
a
tch
i
ng
will g
i
v
e
a better score if the in
pu
tted
k
e
y
p
hrases are m
a
tch
e
d
with
AST fro
m
a co
rp
us
wi
t
h
“rem
ovi
ng
st
o
p
w
o
r
d
s”
an
d
“no
n
-
st
em
m
e
d w
o
r
d
s”
(c
od
e 1
0
)
o
r
a
c
o
r
p
us
wi
t
h
“re
m
ovi
ng
st
op
wo
r
d
s” a
n
d “st
e
m
m
ed wor
d
s”
(c
ode
1
1
)
.
“R
em
ovi
n
g
st
op
w
o
r
d
s”
ha
s t
h
e m
o
st
co
nt
ri
b
u
t
i
o
n i
n
f
i
ndi
n
g
key
p
h
r
ases
fr
om
t
e
xt
base
d
o
n
inp
u
tted key
p
h
rase
s
m
odel as refere
nces f
o
r key
p
h
rase ex
traction
.
Mean
wh
ile,app
lyin
g
“
stemmed
words”
coul
d re
duce
a sm
a
ll num
b
e
r of key
p
hra
s
es extracte
d
because of st
e
m
m
e
r which som
e
times diminish
m
eani
ng w
o
r
d
s
o
f
dom
ai
n
rep
r
esent
a
t
i
o
n.
Evaluation Warning : The document was created with Spire.PDF for Python.
I
S
SN
:
2
088
-87
08
IJEC
E V
o
l
.
5, No
. 3,
J
u
ne 2
0
1
5
:
40
9 – 4
2
0
41
6
3.
2.
E
x
peri
me
nt of
A
u
t
o
m
a
t
i
c
Keyp
hra
s
e E
x
tr
acti
on usi
n
g M
o
del
Da
t
a
Exp
e
rim
e
n
t
II (figure 6
)
is a p
r
o
cess fo
r au
to
m
a
t
i
c k
e
yp
h
r
ase ex
tractio
n.Each
settin
go
f
m
o
d
e
l d
a
tain
t
a
bl
e 4 an
d t
a
b
l
e 5 (col
um
n 1) w
h
i
c
h c
onsi
s
t
s
of
8 co
rp
us
h
a
s t
h
e sam
e
model
dat
a
as t
h
e ones i
n
ex
per
i
m
e
nt
I
(referring
to
mo
d
e
l d
a
ta setting
i
n
t
ab
le 2). The resu
lts o
f
ex
p
e
rim
e
n
t
II are sh
own
in
co
lum
n
2
,
3
,
4
,
5
of tab
l
e
4 a
n
d
col
u
m
n
2,
3
o
f
t
a
bl
e
5.
Tab
l
e 4
s
h
o
w
s
th
e co
m
p
ar
iso
n
b
e
tw
een
the n
u
m
b
e
r
o
f
ex
tr
actedk
e
yphr
ases and
th
e n
u
m
b
e
r
o
f
key
p
h
rase
s m
odel
.
T
h
e e
xpe
r
i
m
e
nt
s use st
e
m
m
e
d corp
us
and
n
o
n
-st
e
m
m
e
d cor
p
us
fo
r ext
r
act
i
ng
1
-
3 w
o
rds
as keyphrases
. The results
of
aut
o
m
a
t
i
c
key
p
h
r
ases ext
r
act
i
ons are m
o
re t
h
an 9
0
% (c
ol
um
n 2-5, t
a
bl
e
4). It
m
eans alm
o
st
al
l
key
p
h
r
ases
m
odel
wi
t
h
st
em
or wi
t
h
out
s
t
em
for 1
-
3
w
o
r
d
s
or
2-
3 w
o
rds
fr
om
corp
u
s
are
ext
r
act
ed a
u
t
o
m
a
t
i
call
y
, usi
ng set
t
i
ngs
of
m
odel
dat
a
(col
um
n 1, t
a
bl
e
4). E
x
t
r
act
i
n
g
key
p
h
r
ases
us
i
ng n
o
n
st
em
m
e
d wor
d
s gi
ve t
h
e
sam
e
o
r
hi
g
h
er
pe
rc
ent
a
ge
o
f
e
x
t
r
a
c
t
i
on.
Table
4.
Avera
g
e a
n
d De
viation Standa
rd –
Results of
Au
t
o
m
a
t
i
c Ex
tractio
n ofKeyph
rases
u
s
ing
M
o
d
e
l Data
o
f
Fiv
e
Settin
gsCo
m
p
ared
to
Nu
m
b
er
o
f
Keyp
hrases M
o
d
e
l
No
Model Data
Setting
Nu
m
b
er
o
f
Match
Ex
tracted
K
P
/Nu
m
b
e
r o
f
KP
Mo
d
e
l
(% of
Average ±
D
eviation Standard)
1-
3 W
o
r
d
s
2-
3 W
o
r
d
s
NonStem
Stem
NonStem
Stem
(1
) (2
)
(3
)
(4
)
(5
)
1
(
no.
of KP)
=
(
no.
of wor
d
s&
affixes)
100.
00± 0.
0
0
100.
00± 0.
0
0
100.
00± 0.
0
0
100.
00± 0.
0
0
2
(
no.
of KP)
>
(
no.
of wor
d
s&
affixes)
93.
75±1
1
.
5
7
93.
75±1
1
.
5
7
92.
65±1
3
.
6
2
92.
65±1
3
.
6
2
3
(
no.
of KP)
<
(
no.
of wor
d
s&
affixe
s)
95.
00±8.
8
6
93.
75±1
0
.
9
4
97.
79±4.
3
8
96.
32±6.
9
9
4
(
no.
of KP)
>>( no.
of wor
d
s&
affixes)
94.
38±1
0
.
5
0
94.
38±1
0
.
5
0
93.
38±1
2
.
3
5
93.
38±1
2
.
3
5
5
(
no.
of KP)
<<( no.
of wor
d
s&
affi
xes)
97.
50±3.
7
8
93.
75±7.
9
1
99.
26±2.
0
8
94.
85±6.
6
2
Acco
r
d
i
n
g t
o
f
i
ve m
odel
dat
a
set
t
i
ngs
(c
ol
u
m
n 1, t
a
bl
e
4)
,
m
odel
wi
t
h
n
u
m
b
er o
f
key
p
h
rase
s l
o
we
r
t
h
an
num
ber o
f
w
o
r
d
s wi
t
h
a
ffi
xe
s (r
o
w
3 a
nd
5, t
a
bl
e 4
)
.
They
ha
ve bet
t
er res
u
l
t
s
com
p
are
d
t
o
m
odel
wi
t
h
num
ber of
ke
yphrases great
er
tha
n
num
ber
of words
with a
ffixe
s (row2 and
4, table 4).It is becaus
e
key
p
h
rase
s f
r
o
m
sett
i
ng i
n
r
o
w
3 a
n
d
5
are
separat
e
d
wel
l
f
rom
wo
rd
s
wi
t
h
a
ffi
xe
sso
t
h
a
t
t
h
e key
p
hras
es can
b
e
m
a
tch
e
d
mo
re easily. Esp
ecially in
th
e case
o
f
m
o
d
e
l d
a
ta settin
g
at row
1
,
wh
ere th
ere is equ
a
lity
b
e
tw
een
th
e num
b
e
r o
f
k
e
yphrases and
words w
ith
affi
x
e
s,
th
e ex
tracted
resu
lts can
reach
10
0%. Th
is settin
g
m
odel shows t
h
at the m
e
thod m
a
y extract
all keyphrase
s for a s
p
ecific ca
s
e
.
Tabl
e
5.
A
v
era
g
e a
n
d
De
vi
at
i
o
n
St
a
nda
r
d
–
R
e
sul
t
s
o
f
Aut
o
m
a
ti
c Ext
r
act
i
o
n
o
f
Key
p
h
r
a
s
es
usi
n
g M
o
d
e
l
Data
of Fiv
e
Settin
g
s
Co
m
p
ared
to
Num
b
er of All
Phrases
E
x
tracted
No
Model DataSet
ting
Nu
m
b
er
of M
a
tchingE
xtr
acted KP /Nu
m
ber
of
Extracted K
P
(% of
Average ±
D
eviation Standard)
1-
3 Stem
W
o
r
d
s
2-
3 Stem
W
o
r
d
s
(1
) (2
)
(3
)
1
(
no.
of KP)
=
(
no.
of wor
d
s&
affixes)
86.
50±1.
2
8
84.
49± 1.
43
2
(
no.
of KP)
>
(
no.
of wor
d
s&
affixes)
77.
47±1
0
.
7
4
80.
24±1
6
.
4
4
3
(
no.
of KP)
<
(
no.
of wor
d
s&
affixes)
83.
01±5.
8
3
82.
76±5.
3
7
4
(
no.
of KP)
>>( no.
of wor
d
s&
affixes)
93.
38±1
3
.
6
0
93.
38±1
6
.
4
4
5
(
no.
of KP)
<<( no.
of wor
d
s&
affixes)
81.
79±8.
5
7
80.
03±9.
1
1
Tabl
e
5 s
h
o
w
s
t
h
e
num
ber
of
st
em
m
e
d and
ext
r
act
ed
key
p
hrase
s
m
odel
c
o
m
p
ared t
o
nu
m
b
er of al
l
p
h
rases ex
tracted
.
Actu
ally, p
h
r
ases ex
tracted
are
no
t alw
a
ys k
e
yph
rases. Phr
a
sesor
k
e
yph
r
a
ses to
b
e
extracted in this experim
e
nt arede
r
iv
ed
f
r
om
th
e ste
mme
d
co
rpu
s
. Th
e r
e
su
lts of
t
h
e co
m
p
ar
ison
fo
r 1-3
w
o
r
d
s
o
f
m
a
t
c
h
i
ng
ex
t
r
acted k
e
yphr
ases ar
e ab
ou
t
7
7
-
9
3
% (
c
o
l
u
m
n
2
,
tab
l
e
5
)
an
d th
e
r
e
su
lts o
f
t
h
e
com
p
ari
s
on
fo
r
2-
3 w
o
r
d
s are
abo
u
t
80
-
93%
(col
um
n 3,
table 5). It m
ean
s, there are a
b
out 10-13% extracte
d
p
h
r
a
ses
wh
ich ar
e n
o
n
-
k
e
yphrases.
Acco
r
d
i
n
g t
o
f
i
ve m
odel
dat
a
set
t
i
ngs
(col
u
m
n 1, t
a
bl
e 5
)
,
m
odel
wi
t
h
n
u
m
b
er o
f
key
p
h
rase
s eq
ual
or l
o
we
r t
h
a
n
num
ber
of
w
o
r
d
s
wi
t
h
af
fi
xes
(r
ow
1
,
3
an
d
5, t
a
bl
e 5
)
gi
ve
bet
t
e
r res
u
l
t
s
t
h
an
.
Usi
n
g
m
odel
with num
b
er of keyphrases
gr
eater
than number of
words with affixes (row 2, tabl
e 5).
It is because it has the
sam
e
reason
with cases in ta
ble 4 in
w
h
i
c
h
k
e
y
p
h
r
ases a
r
e s
e
parat
e
d
wel
l
f
r
om
wo
rds
wi
t
h
af
fi
xes
.
A
s
a
r
e
sul
t
,
t
h
e key
p
h
rase
s
can be
m
a
t
c
hed m
o
re easi
l
y
.
Especi
al
l
y
for
m
odel
dat
a
set
t
i
ng at
ro
w 4, i
n
w
h
i
c
h t
h
e n
u
m
b
e
r
Evaluation Warning : The document was created with Spire.PDF for Python.
I
J
ECE
I
S
SN
:
208
8-8
7
0
8
Text Prepro
cessin
g
u
s
i
n
g An
no
ta
ted Su
ffix Tr
ee with
Ma
tchin
g
Keyph
r
ase
(Io
n
i
a
Verita
w
a
ti)
41
7
of
keyphrase
s very greater than the nu
m
b
er of words
with affixe
s,
the
e
x
tracted res
u
ltsmay reach 93%
,
due
t
o
a gr
eat nu
m
b
er of
k
e
yph
r
a
ses. Con
s
equ
e
n
tly, alm
o
st all keyphrases
are e
x
t
r
acted.
Resu
lts b
y
m
o
d
e
l d
a
ta
(tab
le
4
)
fo
r au
to
m
a
ti
c k
e
yph
rase extractio
n
show
th
at al
m
o
st all k
e
y ph
rases
m
odel
(
m
ore
t
h
an
90
%) ca
n be
ext
r
act
e
d
u
s
i
n
g A
n
no
t
a
t
e
d Su
ffi
x
Tree (
A
ST
) c
o
m
b
i
n
ed wi
t
h
t
e
xt
p
r
ep
ro
cessi
n
g
. Mo
re realistic resu
lts is sh
own
in
tab
l
e
5
,
i
n
whi
c
h al
l
phrase
s
ext
r
act
e
d
usi
n
g t
h
e
pr
op
ose
d
m
e
thod
res
u
lt in m
o
re tha
n
77% keyphrases.
It m
eans that
the
propose
d
m
e
thod is
go
od e
n
ough to
be
us
ed for
ext
r
act
i
n
gkey
p
hrase
s
or
p
h
ra
ses w
h
i
c
h
re
p
r
esent
dom
ai
n
of
t
e
xt
.
C
o
m
p
ared
t
o
t
h
e
ot
her
m
e
t
hod,
s
u
ch
as
KEA
,
a m
e
t
hod
for ext
r
act
i
n
g
key
p
h
r
ases us
i
ng Nai
v
e B
a
y
e
s [1]
whi
c
h n
eedst
rai
n
i
ng
d
a
t
a
for
devel
o
p
i
ng a
m
odel predicti
on of
keyphra
ses, the
proposed m
e
thod
e
x
tracts keyphra
ses aut
o
m
a
tica
lly without training
dat
a
.
3.
3.
E
x
peri
me
nt of
A
u
t
o
m
a
t
i
c
Keyp
hra
s
e E
x
tr
acti
on usi
n
g M
o
di
fi
ed
Mo
del
D
a
ta
a
nd Real
D
a
t
a
Th
is ex
p
e
rim
e
n
t
is still related
to exp
e
rimen
t
II
in wh
ich
th
e d
a
tain tab
l
e 6 is a m
o
d
i
ficatio
n of
m
o
d
e
l d
a
ta in
ex
p
e
r
i
m
e
n
t
I
b
y
ad
d
i
ng
a list o
f
n
o
n
-
k
e
yphr
ases wh
ich
f
r
eq
u
e
n
c
ies ar
e
lo
w
e
r
th
an
m
a
tch
i
ng
th
resh
o
l
d
.
To
g
e
n
e
rate m
o
d
i
fied
m
o
d
e
l d
a
ta, ad
d
ition
algorith
m
(fig
u
r
e 10
) is co
m
b
in
ed to
co
rp
u
s
g
e
nerator
(figure
7
)
. The pu
rp
ose
o
f
t
h
e add
itio
n
a
l
d
a
ta to
th
e m
o
d
e
l is to
m
a
k
e
a clo
s
er cond
itio
n to
real tex
t
d
a
ta.
R
e
sul
t
s
of t
h
i
s
expe
ri
m
e
nt
are foc
u
sed
on e
x
t
r
act
i
on
of
2-
3 wo
rd
s, beca
use
i
t
det
e
r
m
i
n
es
whet
her a p
h
r
a
se has
m
eani
ng
or
not
. R
e
sul
t
s
of a
u
t
o
m
a
ti
c key
p
h
r
ase ext
r
act
i
o
ni
n col
u
m
n
2 of
t
a
bl
e 6 use
key
p
h
r
ases m
odel
t
o
be
com
p
ared.
Set
t
i
ng o
f
m
o
d
i
fi
ed m
odel
dat
a
(col
um
n 1 of t
a
bl
e 6) co
ns
i
s
t
s
of t
w
o m
odel
dat
a
cor
p
u
s
whi
c
h
use
t
h
e sam
e
t
y
pe of
dom
ai
n, key
p
h
r
ases a
nd
w
o
r
d
s wi
t
h
a
ffi
x
e
s(som
e
cont
ai
n key
p
h
r
ases
).
In m
odel
dat
a
I, t
h
e
num
ber of key
p
hrases is grea
ter thanthe
num
b
er ofword
s
with affi
xes. T
h
e num
b
er of
each non
keyphrase is
50 a
nd t
h
e
nu
m
b
er of doc
u
m
ent
s
i
s
120.
The n
u
m
b
er of
key
p
h
r
ases
, t
h
e num
ber of
w
o
r
d
s wi
t
h
a
ffi
x
e
s, t
h
e
num
ber of stop words andthe
num
ber of non keyphrases
for
each docum
e
nt are respectivel
y 5, 3, 12 and
1. In
m
o
d
e
l d
a
ta II, th
e nu
m
b
er of k
e
y
p
hrasesis sm
a
ller th
anthe num
b
er of
words with affixes. T
h
e
num
ber
of
o
t
h
e
r p
a
ram
e
te
rs is th
e sam
e
asth
e on
e in
m
o
d
e
l d
a
ta I. Th
e two
m
o
d
e
ls co
n
s
ist of th
e sam
e
l
i
st o
f
no
n-
keyphrase
s.
Co
m
b
in
e th
is co
d
e
with algo
ri
th
m
in
fig
u
r
e
7
:
non-keyphrase =
y; {y
| y part
of
G}
N5 = num
ber
of
non keyphrase
at
D
i
nY
i
= number of
d
i
fferent
non
-keyphrase
w
x, p, s, y
R= (nKP
i
*N2) +
(nX
i
*N3) + (nS
i
*N4) +
(nY
i
*N5)
r = r
andom
ize (1.
.
R)
for r=
1 ..
R, s
a
v
e
w(r
)
t
o
D
i
create a
formatt
e
d corpus of
N1
d
o
cuments
Fi
gu
re 1
0
. Part
o
f
M
o
di
fi
e
d
C
o
r
p
us Gene
rat
o
r by
A
ddi
n
g
N
o
n
Key
p
h
r
ases
Tab
l
e
6
.
Resu
lts of
Au
t
o
m
a
t
i
c
Ex
tracti
o
n of
Keyph
rase
s
u
s
in
g Mod
e
l
Data b
y
Add
itio
n of No
n Key
p
hrases
(2
-3
Key
p
h
r
as
es)
M
odified M
odel
Data
Nu
m
b
er
of E
x
tr
acted KPM
odel
/ Nu
m
b
er of KP Model (
%
)
Nu
m
b
e
r
of
M
eaningfulPhr
a
se/ Nu
m
b
er
of
Extracted Phras
e
(%)
Separ
a
tedKP
Candidates of
Main Do
m
a
in
(2
-Mean
s)
Tab
l
e TF
-I
DF
fr
o
m
P
r
e
p
r
o
-
cessing
(1
) (2
)
(3
)
(4
)
(5
)
I
94.
12
44.
44
>90%
00,
01,
11
I
I
86.
00
36.
00
>90%
00,
01,
10
The c
o
m
p
ari
s
o
n
s
of
n
u
m
b
er i
n
Ta
bl
e 6
(
col
u
m
n
2 and
3
)
f
o
cus
onl
y
o
n
2-
3 k
e
y
p
h
rases
.
The
resul
t
s
of
ex
traction
in co
lu
m
n
2 of table 6
sho
w
th
at
m
o
re th
an
80% key
p
hrases
m
odel can
be e
x
tracted. By
va
lidating
t
h
e sco
r
e
res
u
l
t
s
i
n
col
u
m
n
3
of
t
a
bl
e
6 m
a
nual
l
y
, a
b
o
u
t
36
-
44%
p
h
ra
s
e
sext
ract
e
d
are
m
eani
ngf
ul
p
h
rase
s
an
d, th
e resu
l
t
so
f all m
ean
in
gfu
l
phras
e
s extracted
are keyphrase
s.
T
h
e resu
lts in
co
lu
m
n
2
and 3
o
f
tab
l
e6
sh
ow that
m
o
d
e
l d
a
ta I g
i
v
e
b
e
tter
ex
traction
resultsth
an
th
e resu
lts in
m
o
d
e
l II.
It m
ean
s th
at th
e
p
r
o
cess of au
t
o
m
a
t
i
c k
e
yp
h
r
ase ex
tractio
n u
s
in
g
AST in
th
is ex
p
e
rimen
t
will h
a
v
e
a b
e
tter o
u
t
p
u
t
of
key
p
h
rase
s fo
r
dat
a
i
n
whi
c
h di
st
ri
b
u
t
i
on
of
key
p
h
rase
s i
s
m
o
re si
gni
fi
cant
t
h
an t
h
e
nu
m
b
er of ot
he
r
ph
rase
s
(case st
udy
of
m
odel
dat
a
I a
n
d
I
I
)
.
Evaluation Warning : The document was created with Spire.PDF for Python.
I
S
SN
:
2
088
-87
08
IJEC
E V
o
l
.
5, No
. 3,
J
u
ne 2
0
1
5
:
40
9 – 4
2
0
41
8
Th
e list
o
f
extr
acted
p
h
r
a
ses and
t
h
eir
m
a
tch
i
n
g
sco
r
e
as a
r
e
su
lts
f
r
o
m
au
to
m
a
t
i
c
k
e
yphr
ase
ext
r
act
i
o
n i
s
ar
r
a
nge
d
by
Tabl
e Sco
r
e A
rra
n
g
i
n
g Al
go
ri
t
h
m
(fi
gu
re
4) i
n
t
o
f
o
u
r
t
y
pes
o
f
TF-
I
D
F t
a
bl
e
s
(TF
-
IDF
0
0,
TF-
I
D
F0
1,
TF-
I
D
F
1
0
,
TF-
I
D
F
1
1
)
fr
om
cor
pus
i
n
m
odel
dat
a
I a
n
d I
I
.
E
ach
of
f
o
u
r
t
y
pes
of
pre
p
r
o
cessi
ng
(t
abl
e
1
)
i
s
ap
pl
i
e
d i
n
t
h
e sc
ori
ng
pr
ocess
.
The t
a
bl
e o
f
T
F
-I
DF
(d
oc
um
ent
s
ve
rsu
s
ext
r
act
ed
ph
rases
)
bec
o
m
e
s an i
n
p
u
t
of
2-m
eans cl
u
s
t
e
ri
ng a
s
an
u
n
su
pe
rvi
s
e
d
l
e
arni
ng
pr
ocess
.
The
o
b
ject
i
v
e
of t
h
i
s
cl
ust
e
ri
n
g
i
s
t
o
separat
e
key
p
h
rase
s can
di
dat
e
from
ex
tracted phrases
(m
ain dom
a
in
) in a
corpus. By checking
m
a
nual
l
y
,o
ne
cl
ust
e
r c
ont
ai
n
i
ng
dom
i
n
ant
m
eani
n
g
f
ul
ph
rases f
r
o
m
t
h
e
resul
t
s
of cl
ust
e
ri
n
g
, m
o
re t
h
an 9
0
%
of t
h
e
key
p
h
ra
se can
di
dat
e
s
o
r
m
eani
ngf
ul
p
h
rase
s(C
o
l
u
m
n
4
of t
a
bl
e 6
)
c
a
n
be se
parat
e
d. T
h
e t
a
bl
es o
f
TF
-
IDF
t
h
at
p
r
ovi
de a si
g
n
i
f
i
c
a
n
t
separat
i
o
n
of
key
p
h
ras
e
ca
ndi
dat
e
sare s
h
ow
n i
n
col
u
m
n
(
5
)
o
f
t
a
bl
e
6.
In
p
u
t
tab
l
esin
pu
t which
g
i
v
e
g
ood
clu
s
tering
results
are
TF
-I
DF
00
an
d TF-
I
D
F
01
. Gene
ral
l
y
, aut
o
m
a
t
i
c
key
p
h
r
ase
ext
r
act
i
o
n usi
n
g m
odi
fi
ed
m
odel
dat
a
com
b
ined
by
2-m
eans cl
ust
e
ri
ng
gi
ve per
f
o
rm
ance score u
p
t
o
9
0
% t
o
sep
a
r
a
te cl
u
s
ter
s
co
n
s
isting
of ex
tr
act
k
e
yphrase cand
i
d
a
tes
f
r
o
m
a d
o
c
u
m
en
ts co
llectio
n
.
The re
sul
t
s
o
f
expe
ri
m
e
nt
II usi
n
g t
e
xt
dat
a
fr
om
3 corp
u
s
of
real
d
o
cu
m
e
nt
s col
l
ect
ion m
a
i
n
l
y
i
n
econ
o
m
y
dom
ai
n an
d ext
r
ac
t
i
on o
f
2
-
3
w
o
r
d
s are
prese
n
t
e
d i
n
t
a
bl
e
7. T
h
e m
a
t
c
hing e
v
al
uat
i
o
n
i
s
do
ne
m
a
nual
l
y
beca
use c
o
m
p
ari
s
o
n
dat
a
fo
r
rea
l
dat
a
d
o
cu
m
e
nt
are not
pr
o
v
i
d
e
d
, d
u
e tot
h
e non e
x
istence
of
key
p
h
rase
s m
odel
.
B
a
sed
o
n
m
a
nual
chec
k
of al
l
m
eani
ngf
ul
p
h
rases e
x
t
r
act
ed, t
h
e
resul
t
s (col
um
n 2, t
a
bl
e 7
)
sho
w
t
h
at
by
usi
n
g t
h
e
re
al
dat
a
, m
o
re t
h
an
70
%
of extracted
phrases com
p
are
d
to all phra
s
e
s are
meaningful
phrases, and the
m
eaningful words / phrases
a
r
e not always keyphrase
s. For exam
ple, “ke
n
aika
n
harga”
(inc
reas
ingof price) is
mean
in
gfu
l
word
s / phrase
b
u
t
it is no
t a
k
e
yph
rase.
Tab
l
e
7
.
Resu
lt of Au
t
o
m
a
t
i
c Ex
traction
o
f
Keyph
rases
u
s
in
g Real Data
Real Data
(
no.
of doc)
M
eaningfulPhr
a
ses/ All
Phrases (%
)
Separ
a
tedKP C
a
ndidates of M
a
in
Do
m
a
in(2-
M
eans)
(1
) (2
)
(3
)
10
76.
92
>90%
29
71.
43
>90%
240
72.
00
>90%
Th
e sam
e
step
of
cluster
i
ng
pro
cess to
th
is r
eal
dat
a
i
s
appl
i
e
d t
o
m
odi
fi
ed
m
odel
dat
a
fo
r
separat
i
n
gm
eani
n
gf
ul
p
h
ra
ses
o
r
key
p
h
rase
can
di
dat
e
s
fr
om
al
l
phrase
s
ext
r
act
e
d
.
T
h
e
resul
t
s
of
2-m
eans
cl
ust
e
ri
n
g
ch
ec
ked m
a
nual
l
y
t
o
a cl
ust
e
r cont
ai
ni
ng
dom
i
n
ant
m
eani
ngful
p
h
rase
s ar
e abo
u
t
90%
ph
rases
(col
um
n 3 o
f
t
a
bl
e 7
)
ha
vi
n
g
m
eani
n
g
f
ul
one
s.The
res
u
l
t
s
s
c
ore
o
f
cl
ust
e
ri
ng t
o
real
dat
a
are al
m
o
st
t
h
e sam
e
asres
u
l
t
s
sco
r
e
of
ex
pe
ri
m
e
nt usi
n
g m
odi
fi
e
d
m
odel
dat
a
(
c
ol
um
n 4
of
t
a
bl
e 6
)
.
Ge
neral
l
y
,
AST ca
n
b
e
use
d
t
o
ext
r
act
key
p
hrase
s
aut
o
m
a
tical
l
y
and com
b
i
n
e
d
by
cl
ust
e
ri
n
g
m
e
t
hod t
o
get
m
eani
ngf
u
l
phrase
s
. It
wi
l
l
gi
ve
a b
e
tter
resu
lt if th
e inp
u
tted
d
a
ta for clusterin
g
is
t
a
bl
e o
f
TF
-I
DF
0
0
or
TF
-I
DF
0
1
,
w
h
i
c
h
“re
m
ovi
ng
st
op
wo
r
d
s” s
h
oul
d
not
be a
p
pl
i
e
d a
n
d
“st
e
m
m
e
d wo
rds”
can
be a
ppl
i
e
d
as an
o
p
t
i
o
n.
4.
CO
NCL
USI
O
N
Key
p
h
rase m
a
t
c
hi
ng
, sc
ori
n
g an
d e
x
t
r
act
i
on
usi
n
g
AS
T t
echni
que
and
com
b
i
n
at
i
on
o
f
t
e
xt
pre
p
r
o
cessi
ng
and
fol
l
owe
d
b
y
cl
ust
e
ri
ng
p
r
op
ose a m
e
t
h
o
d
t
o
e
x
t
r
act
ke
y
p
h
r
ases f
r
om
t
e
xt
, w
h
i
c
h
ge
neral
l
y
ex
tracts m
ean
i
n
gfu
l
p
h
rases.
In
th
e in
itial p
r
o
cess, it run
s
au
to
m
a
tical
ly a
n
d
it do
es no
t n
eed
d
o
m
ain
ex
p
e
rts.
M
a
nual
c
h
ec
ki
ng
i
s
do
ne at
t
h
e e
n
d
of e
x
p
e
ri
m
e
nt
t
o
gi
v
e
a l
i
s
t
o
f
e
x
t
r
act
ed
key
p
h
ras
e
can
di
dat
e
s.
I
t
ca
n
reduce efforts
of e
xpe
rts beca
use they
determine keyphras
e candi
dates that
have bee
n
e
x
t
r
act
ed at
t
h
e
end
of
expe
ri
m
e
nt
s.
Specifically, me
m
o
ry used in AST structure is
lo
wer, and
tex
t
prepro
cessing
to
d
e
term
in
e
key
p
h
rase
s i
s
m
o
re effi
ci
ent
t
h
an
c
o
n
v
e
n
t
i
onal
p
r
oce
ss
usi
n
g
2-
g
r
am
or
3
-
g
r
am
m
e
t
h
o
d
.
I
n
ge
ne
ra
l
,
t
h
e
pr
o
pose
d
m
e
t
hod ca
n
be us
ed
as a sem
i
-aut
o
m
at
i
c
key
phras
e ext
r
act
i
o
n f
r
o
m
docum
ent
s
col
l
ect
i
on w
h
i
c
h i
s
not
a t
e
xt
of
a speci
fi
c do
m
a
i
n
corp
us.
A
s a resul
t
,
t
h
e
m
e
t
hod can
be ge
neral
l
y
u
s
ed i
n
ot
her d
o
m
a
i
n
s,
because it can
detect key
p
hra
s
es without check
ing the
m
e
a
n
ing,
when it runs
a
u
tom
a
tica
lly.
C
o
m
p
ari
s
on t
o
ot
h
e
r m
e
t
hodss
u
ch as
N-
g
r
am
and ext
r
a
c
t
key
ph
rases
usi
n
g Nai
v
e
B
a
y
e
s i
s
a
relativ
e co
m
p
arison
,
wh
ich th
e
d
i
fferenceslie in
d
a
ta stru
ct
u
r
e and
ru
le. Althou
gh
it sh
ows
relativ
e
co
m
p
ariso
n
, the propo
sed m
e
t
h
od
sho
w
s th
e
ab
ility to
ex
tract k
e
yph
rases acco
rd
ing
t
o
th
e exp
e
rim
e
n
t
s.
Fu
ture
work
of th
is exp
e
rimen
t
will u
s
e the cand
i
d
a
te
o
f
k
e
yph
rases to b
e
cl
u
s
tered
o
r
classi
fied
w
ith
m
o
r
e
p
r
ecisely, in
clu
d
i
ng
k
e
y
p
hr
ase or n
o
n
-
k
e
yphr
ase categ
o
r
ies
r
e
lated
to
th
e do
main
in
v
e
sti
g
ated
. The
resul
t
s
ca
n b
e
use
d
t
o
devel
o
p a
kn
o
w
l
e
d
g
e
base
d o
n
a
d
o
m
ai
n and al
s
o
use t
h
em
as a candi
dat
e
o
f
q
u
ery
i
n
Inform
atio
n
Retriev
a
l System
.
Evaluation Warning : The document was created with Spire.PDF for Python.