Internati
o
nal
Journal of Ele
c
trical
and Computer
Engineering
(IJE
CE)
V
o
l.
5, N
o
. 4
,
A
ugu
st
2015
, pp
. 78
2
~
78
7
I
S
SN
: 208
8-8
7
0
8
7
82
Jo
urn
a
l
h
o
me
pa
ge
: h
ttp
://iaesjo
u
r
na
l.com/
o
n
lin
e/ind
e
x.ph
p
/
IJECE
Scientific Documents Clustering
Based on Text S
u
mmarization
Pedra
m
Va
hda
ni
Amo
li, Omid Sojo
odi Sh.
F
acult
y
of
Ele
c
tr
ica
l
,
Com
puter
a
nd IT
Engin
eeri
ng, Qa
zvin Is
l
a
m
i
c Azad
Unive
r
s
i
t
y
Nokhbegan Blvd
, Qazvin, Iran
Article Info
A
B
STRAC
T
Article histo
r
y:
Received
Ja
n 27, 2015
Rev
i
sed
Ap
r
25
, 20
15
Accepted
May 18, 2015
In this paper a novel method is proposed
for scientific documen
t clustering
.
The proposed
method is a summariza
tion-bas
e
d h
y
brid algor
ithm which
comprises a preprocessing phas
e
. In
the summarization
phase
unimportant
words which are not frequen
t
ly used in the do
cument are r
e
moved. Th
is
process reduces the
amount of
data
for th
e clustering purpos
e. In
this
proposed method after
the pr
eprocessi
ng phas
e
, Term Frequency
/
Inverse
Docum
e
nt F
r
equenc
y (
T
F
I
DF
) is
cal
cu
lated
for
all words in th
e document
and BM25 in calucu
lated for
words
in senten
ces and summed over th
e
document to score each word in docum
ent level. In next p
h
ase, Text
summarization is performed bas
e
d
on BM25 scores. After
th
at document
cluster
i
ng is do
ne according
to
the scor
es of calculated
TFIDF. The h
y
brid
progress of the proposed scheme, fro
m preprocessing phase to cluster
labeling,
gains a
rapid and efficient cl
uster
i
ng method which is
evaluated
b
y
400 English tex
t
s extracted from scientific
articles of 11 different
topics. Th
e
proposed method is compared with CSSA, SMTC and
Max-Capture
methods. The r
e
sults demons
trate the pro
f
iciency
of
the proposed scheme in
term
s
of com
p
utation t
i
m
e
, an
d com
p
arative
effic
i
enc
y
us
ing
F
-
m
eas
ure
crit
erion.
Keyword:
Scoring
Summ
arization
Text
C
l
ust
e
ri
n
g
Text
M
i
ni
n
g
Copyright ©
201
5 Institut
e
o
f
Ad
vanced
Engin
eer
ing and S
c
i
e
nce.
All rights re
se
rve
d
.
Co
rresp
ond
i
ng
Autho
r
:
P. V. Am
oli,
Faculty of Elec
trical, Com
puter a
n
d IT
En
gi
neeri
n
g
,
Qazvi
n
Islam
i
c
Aza
d
Uni
v
ersi
ty
,
No
k
hbe
ga
n
B
l
vd
, Qazvi
n
, Ira
n.
Em
a
il: p
e
d
r
amv
a
hd
an
i@g
m
ai
l.co
m
1.
INTRODUCTION
Inform
atio
n
retriev
a
l is an area con
cern
e
d wi
th
search
ing
for do
cu
m
e
n
t
s to
m
a
k
e
so
m
e
an
alysis and
th
e critical p
a
rt o
f
its app
r
o
a
ch
s is t
o
rep
r
esen
t th
e
c
o
nte
n
t of a
docum
e
nt and
one
of these
apparc
hs is
cl
ust
e
ri
n
g
. N
o
r
m
al
ly
, doc
um
ent
s
are re
prese
n
t
e
d as ba
g o
f
wo
rds
whi
c
h m
eans t
h
ey
occur i
n
de
pen
d
e
n
t
l
y
.
Man
y
researchers h
a
v
e
tried to
g
r
oup
wo
rd
s in
to
top
i
c in
o
r
d
e
r to
represen
t th
e im
p
o
rtan
ce o
f
rel
a
tio
n
s
b
e
tw
een
w
o
rd
s w
ith
in
a
gr
oup
.
[1
]
Doc
u
m
e
nt
cl
ust
e
ri
ng
o
r
t
e
xt
cl
ust
e
ri
n
g
i
s
one
o
f
t
h
e
m
a
i
n
t
e
rm
s i
n
t
e
xt
m
i
ni
ng. It
r
e
fers t
o
t
h
e
p
r
o
cess
o
f
grou
p
i
n
g
d
o
c
u
m
en
ts with
sim
i
la
r con
t
en
ts or to
p
i
cs in
t
o
clu
s
ters to
i
m
p
r
ove b
o
t
h
av
ailab
i
lity
an
d
reliab
ility o
f
do
cu
m
e
n
t
s [2
-3]. Th
e tex
t
docu
m
en
ts are
group
ed
t
o
g
e
t
h
er as clusters
on
th
e
b
a
sis
o
f
th
eir
si
m
ilarit
i
es an
d
in
to
d
i
fferent g
r
ou
ps on
the b
a
sis o
f
d
i
ssi
m
i
l
arities b
e
t
w
een
th
em
, th
is co
n
c
ep
t fo
rm
s
th
e
fo
u
ndat
i
o
n
of
t
e
xt
d
o
cum
e
nt
cl
ust
e
ri
n
g
[2
-
4
]
.
Dat
a
cl
ust
e
ri
n
g
i
s
a
di
ffi
cul
t
t
a
sk i
n
co
m
put
i
ng an
d t
h
at
i
s
because
of its
hard c
o
m
putationally [5].
Recently there
have
been s
o
me resea
r
che
s
on te
xt
clusteri
ng base
d on
se
ntence whic
h highlights
t
h
e
im
port
a
nce
o
f
m
easuri
n
g
wo
r
d
s i
n
se
nt
ence
l
e
vel
.
It
i
s
be
cause
of
hi
g
h
e
r
m
easurem
ent res
u
l
t
t
h
at
a
wo
r
d
g
a
in
s in
sen
t
ence lev
e
l co
m
p
are to do
cu
m
e
n
t
lev
e
l.
A. Ska
b
ar
, Kh
.
A
bdal
g
ade
r
[6
]
intr
o
duce
an
app
r
oach
base
on
sent
e
n
ce lev
e
l wh
ich is also
fu
zzy and
ope
rat
e
on
i
n
p
u
t
dat
a
by
usi
n
g
gra
p
h re
p
r
es
ent
a
t
i
on i
n
EM
fram
e
wo
rk
.
Evaluation Warning : The document was created with Spire.PDF for Python.
I
S
SN
:
2
088
-87
08
I
J
ECE Vo
l. 5
,
N
o
. 4
,
Au
gu
st 2
015
:
78
2
–
78
7
7
83
K. Jey
a
l
a
ks
hm
i
,
R
.
Dee
p
a,
M
.
M
a
nj
ul
a [
7
]
prese
n
t a novel hiera
r
chica
l
fuzzy relational clusteri
ng
alg
o
rith
m
th
at
o
p
e
rates on
relatio
n
a
l in
pu
t data. Th
ier al
g
o
rith
m
u
s
es a g
r
ap
h
rep
r
esen
tatio
n
o
f
t
h
e d
a
ta, and
o
p
e
rates in
a Fu
zzification
Degree fram
e
w
ork
in
wh
ich th
e g
r
ap
h
cen
t
rality o
f
an
o
b
j
ect in
th
e g
r
aph
is
in
terpreted as l
i
k
e
lih
oo
d.
E.R
.
M
.
Sen
o
and M
.
D.
G.
V.
Nu
nes [
8
]
they
pr
op
ose
an eval
uat
i
o
n
fram
e
wor
k
b
a
sed o
n
an
in
crem
en
tal an
d
un
sup
e
rv
ised
clu
s
tering
meth
od
wh
ich
is co
m
b
in
ed
with
statistical s
i
m
i
larity
m
e
tri
c
s to
measure the
se
mantic distanc
e
betwee
n
sent
ences.
S. S
h
a
r
m
a
and
V.
G
upt
a
[
9
]
h
a
s p
r
ese
n
t
e
d a
t
e
xt
cl
ust
e
ri
ng
app
r
oach
t
h
at
uses
Ka
raka
f
o
r
pr
ocessi
n
g
Punja
b
i text
and
process
only
top t
w
enty
term
of each
document for clustering.
There
are
thre
e ki
nds
of
problem
s
in document clus
tering
.
Th
e fi
rst one is how t
o
d
e
fin
e
sim
ilarity
of t
w
o d
o
c
u
m
e
nt
s. T
h
e seco
n
d
p
r
o
b
l
e
m
i
s
h
o
w t
o
deci
de a
p
p
r
op
ri
at
e nu
m
b
er of d
o
cu
m
e
nt
cl
ust
e
rs i
n
a t
e
x
t
co
llectio
n
an
d th
e th
ird
one is h
o
w to
clu
s
ter do
cu
men
t
s precisely co
rresp
ond
ing
to
n
a
tural clu
s
ters.
There
f
ore the
researche
r
s
have tried to
satisfy th
ese issu
es
b
y
p
r
op
osing
th
ei
r algorithm
.
In
order t
o
ac
hieve
a
go
o
d
cl
u
s
t
e
ri
n
g
resul
t
vari
et
y
of
m
e
t
hods
has
bee
n
pr
op
ose
d
by
usi
n
g
t
ech
ni
q
u
es
suc
h
as
di
m
e
nsi
o
n
reduction, patt
ern a
n
alysis, semantic process
i
ng and etc.
By the way, beside accuracy another problem
of text
cl
ust
e
ri
n
g
i
s
t
i
m
i
ng w
h
i
c
h
al
s
o
i
m
port
a
nt
f
o
r
m
easuri
ng a
n
d
effi
ci
en
cy
.
In t
h
i
s
pa
per f
o
r t
h
e ai
m
of achi
e
vi
n
g
e
ffe
ct
i
v
e cl
ust
e
ri
n
g
a ne
w ap
p
r
o
ach has
bee
n
pr
o
pose
d
.
A
s
effective
n
ess
a
n
d acc
uracy
of text clustering de
pe
nd on
t
h
e pure
ness
of
clusters. T
h
ere
f
ore,
selecting
of an
ite
m th
at is
m
o
stly related
to
a sp
ecific clu
s
ter is th
e
m
o
st
critical issu
e. Hen
c
e, m
easu
r
in
g
of do
cu
m
e
n
t
s h
a
s
been
m
a
de bas
e
d
on t
h
ei
r
w
o
r
d
s t
h
us
p
r
o
p
er
app
r
oach
f
o
r
v
a
l
i
d
at
i
ng
usef
u
l
wo
rd
s i
s
t
h
e
m
o
st
im
port
a
nt
i
ssue.
For t
h
i
s
p
u
r
p
o
s
e t
h
e new ap
pr
oac
h
t
h
at
i
s
prese
n
t
e
d i
n
t
h
i
s
paper u
s
e a new
way
t
o
sel
ect
useful
w
o
r
d
s by
eliminating non-us
e
f
ul
words
and m
easuring each
words
base
d on
their sentences
. By that way tim
i
ng
ha
s
b
een m
o
n
ito
red
to b
e
coun
ted as im
p
o
r
ta
nt is
sue
beca
use it
affects e
fficiency.
2.
R
E
SEARC
H M
ETHOD
In t
h
i
s
sect
i
o
n
t
h
e pr
o
pos
ed
h
y
b
ri
d m
e
t
hod i
s
descri
bed
.
T
h
e ai
m
i
s
t
o
at
tai
n
a fast
and e
ffi
ci
ent
t
e
xt
cl
ust
e
ri
n
g
m
e
tho
d
.
T
h
e
pr
o
p
o
se
d al
g
o
r
i
t
h
m
co
nsi
s
t
s
of
f
o
ur
m
a
i
n
phase
s
as a
)
pre
p
r
o
ce
ssi
ng
p
h
ase
,
b)
w
o
r
d
weighting and
scori
n
g phase
,
c) s
u
mmarizat
ion phase
and
d) cluste
ring
phase.
By the way each text ha
s bee
n
rea
d
from
sa
ved t
x
t
file which holds t
h
e c
onte
n
t of extra
c
ted we
bsite
article.
2.1. Prepr
o
ces
sing
P
h
as
e
In t
h
e
pre
p
r
o
c
e
ssi
ng
pha
se, f
r
eq
ue
nt
l
y
used
wo
rds
whi
c
h
are com
m
onl
y exi
s
t
e
d t
h
r
o
u
gh
ove
r t
h
e
t
e
xt
are rem
o
v
e
d. He
nce
t
h
e
am
ount
o
f
dat
a
f
o
r
cl
ust
e
ri
n
g
i
s
red
u
ce
d. A
f
t
e
r
t
h
at
,
st
em
m
i
ng
i
s
per
f
o
r
m
e
d by
Port
e
r
al
go
ri
t
h
m
[10]
. T
h
i
s
ac
t
i
on
hel
p
s th
e alg
o
rith
m
to
find
d
i
fferen
t asp
ects of the
same stem
s to search t
h
e
fre
que
nt item
s
m
o
re efficiently
.
2.
2. Wei
g
h
t
i
n
g and
Sc
ori
n
g
Phase
In this phase,
Term
Frequency/Inve
rse Doc
u
m
e
nt
Freque
ncy (TFIDF) is calculated for each word in
th
e do
cu
m
e
n
t
an
d
ok
ap
i BM
2
5
is calcu
lated
in
th
e se
nt
en
ce l
e
vel
and s
u
m
m
e
d i
n
t
h
e d
o
cum
e
nt
. The
B
M
25
for eac
h
word i
s
conside
r
e
d
as
the
weight fa
c
t
or
fo
r t
h
e
next
st
ep i
n
s
u
m
m
a
ri
zat
i
on.
2.
3. Summ
a
ri
z
a
ti
on
The o
b
j
ect
i
v
e
of s
u
m
m
a
ri
zat
i
on i
s
t
o
re
m
ove no
n-i
m
po
rt
ant
w
o
rds
fr
om
process
i
ng f
u
rt
her
.
There
f
ore, clustering
will be fast and it woul
d be m
o
re efficient with
the am
ount
of data reduced. BM25
fo
rm
ul
ae i
s
use
d
fo
r t
e
xt
sum
m
ari
zat
i
on as
,
l
o
g
0
.
5
0
.
5
∗
,
∗
1
,
∗
1
∗
|
|
|
|
(1
)
In
w
h
ich
,
is the score v
a
l
u
e
of
word
in th
e
sen
t
en
ce
. Ter
m
,
is th
e freq
uen
c
y
of
t
h
e
w
o
r
d
in the se
ntence
wh
ile
|
|
and
|
|
a
r
e s
e
n
t
en
c
e
le
ng
th
an
d av
e
r
ag
ed
le
ng
th
o
v
e
r
th
e
te
x
t
.
Param
e
ter
i
s
t
h
e
num
ber
of
sent
ences
i
n
t
h
e t
e
xt
an
d
is
the num
b
er
of docum
e
nts com
p
rising the
wo
rd
. C
o
nstants
and
are
found
optim
ally as
0.75
and
2
as it
will b
e
ex
p
l
ain
e
d
in
d
e
tails in the
expe
rim
e
nts. It shoul
d be
noted th
at if the BM25 value
is less than
one
for a word, the word
will be
ele
m
inated as
a non-im
portant one
.
Hence
,
a
set of im
port
a
nt
f
r
eq
ue
nt
w
o
rds
are
creat
e
d
fo
r t
h
e
d
o
c
u
m
e
nt
.
Evaluation Warning : The document was created with Spire.PDF for Python.
IJECE
ISS
N
:
2088-8708
S
c
ien
tific Do
cumen
t
s clu
s
tering
b
a
s
ed on
Text S
u
mma
r
iza
tion
(P
.
V.
Am
o
li)
78
4
2.
4. Cl
us
teri
n
g
The cl
ust
e
ri
n
g
pr
ocess
i
s
pe
rf
orm
e
d by
c
onsi
d
eri
n
g
t
h
e
TFI
D
F
val
u
e
s
o
f
f
r
eq
ue
nt
wo
rd
s. T
h
e
algorithm
can be s
u
mm
arized as follow
a)
If
t
h
e f
i
rst
doc
u
m
ent
i
s
a
n
a
l
y
zed
put
i
t
i
n
one
cl
ust
e
r,
ot
he
r
w
i
s
e go
t
o
st
ep
(
b
)
.
b)
Evaluate the doc
ument with every cluster and su
m the
weights of the similar words betwee
n this
d
o
c
um
en
t
an
d
o
t
h
e
r do
cu
m
e
nts in
ea
ch clu
s
t
e
r. I
f
no
cluster rema
in
s,
g
o
t
o
step
(c)
.
c)
Fi
nd
t
h
e l
a
rg
es
t
w
e
i
ght
.
If
i
t
w
a
s
not
zer
o
,
as
si
gn
t
h
e
d
o
cu
m
e
nt
t
o
t
h
e cl
ust
e
r a
n
d
ot
herw
i
s
e g
o
t
o
st
ep
(
d
)
.
d)
Assign a new
c
l
uster
to
t
h
is docume
nt.
Fin
a
lly, th
e cl
usters are lab
e
led
as
a
)
Fin
d
fo
r ea
ch
clu
s
ter th
e w
o
rd with
la
rg
est w
e
ig
h
t
.
b)
Am
on
g t
h
e w
e
i
ght
s
i
n
t
h
e cl
us
t
e
r f
i
nd t
h
e l
a
r
g
est
one
.
c)
Lab
e
l th
e cl
u
s
t
e
r with
th
e w
o
rd
with
l
a
rg
est
weig
h
t
.
3.
R
E
SU
LTS AN
D ANA
LY
SIS
To e
v
al
uat
e
t
h
e effi
ci
e
n
cy
o
f
t
h
e
pr
o
p
o
s
ed
m
e
t
hod,
40
0 E
ngl
i
s
h t
e
xt
s
i
n
4
g
r
o
u
p
s
of
e
xpe
ri
m
e
nt
s
were
use
d
. For each group 10,
30, 50
,
70,
90 a
n
d 100 sa
m
p
les were chose
n
. All texts
have
bee
n
ext
r
acted
fro
m
scien
tific web
s
ites articles of
diffe
re
nt topics
(see
fig
u
r
e
1).
Fi
gu
re
1.
Use
d
dat
a
ba
ses a
r
t
i
c
l
e
fo
r t
h
e
ex
pe
r
i
m
e
nt
s
The e
x
peri
m
e
nt
s co
nt
ai
n t
w
o
part
s.
I
n
t
h
e fi
rst
part
,
t
w
o
co
nst
a
nt
s
i
n
t
h
e B
M
25
fo
rm
ul
a are
investigate
d
to find the optim
um
values. Aft
e
r the
m
e
t
hod
is optim
ized, it will be
co
m
p
ared wit
h
CSS
A
[9],
SM
TC
[
11]
a
n
d M
a
x-C
a
pt
ure
[
12]
m
e
t
hods
i
n
t
h
e
sec
o
n
d
p
a
rt
o
f
t
h
e
ex
pe
r
i
m
e
nt
s.
In t
h
e BM25
form
ulae there
are two
param
e
ters
and
. T
o
investi
g
ate the im
pact of
pa
ram
e
ter
on the efficiency of the proposed m
e
thod
i
t
is set
t
o
1, 2, 3 and
4. Tabl
es
1 & 2
sh
ow
th
e r
e
su
lts of
F-
m
e
asu
r
e
and
com
put
at
i
o
n
t
i
m
e
s of t
h
e
m
e
t
hod
f
o
r
di
f
f
ere
n
t
val
u
es
o
f
param
e
t
e
r
.
314
20
18
18
14
4
2
10
ScienceDaily
Digital
Trends
Env
i
ronm
ent
News
Service
Mars
Daily
Universe
Today
MNN
HighBeam
Research
Others
Evaluation Warning : The document was created with Spire.PDF for Python.
I
S
SN
:
2
088
-87
08
I
J
ECE Vo
l. 5
,
N
o
. 4
,
Au
gu
st 2
015
:
78
2
–
78
7
7
85
Tabl
e
1.
F-m
easure
f
o
r
4
di
ffe
rent
val
u
es
o
f
param
e
t
e
r
F-m
e
asur
e
k=2 b=0.
75
k=1 b=0.
75
k=3 b=0.
75
k=4 b=0.
75
DS1
0.
50
0.
51
0.
51
0.
50
DS2
0.
51
0.
48
0.
51
0.
50
DS3
0.
62
0.
59
0.
66
0.
62
DS4
0.
51
0.
46
0.
50
0.
49
Aver
age
0.
53
0.
51
0.
54
0.
52
Tabl
e 2.
R
u
n
t
i
m
e
(seco
nd
) of
t
h
e pr
o
pose
d
m
e
t
hod f
o
r 4 d
i
ffere
nt
val
u
es
of
pa
ram
e
t
e
r
T
i
m
e
(
S
ec)
k=2 b=0.
75
k=1 b=0.
75
k=3 b=0.
75
k=4 b=0.
75
DS1
34.
33
35.
50
35.
16
35.
16
DS2
27.
16
27.
16
27.
00
27.
00
DS3
33.
33
34.
83
34.
00
34.
16
DS4
26.
16
28.
50
27.
83
27.
83
Aver
age
30.
25
31.
50
31.
00
31.
04
The
res
u
l
t
s
gi
v
e
n i
n
Ta
bl
es
1
and
Ta
bl
e 2
s
h
ow
t
h
at
t
h
e
be
st
val
u
e
f
o
r
is 2.
For as
sessment of t
h
e
param
e
ter
,
4
di
ffe
re
nt
val
u
e
s
0.
2
5
,
0
.
7
5
,
1
an
d
1.
25
are
set
.
Tabl
es
3
& 4
s
h
o
w
t
h
e
resul
t
s
of
F
-
m
easure
val
u
es
an
d c
o
m
put
at
i
on t
i
m
e
f
o
r
di
f
f
ere
n
t
v
a
l
u
es
of
pa
ram
e
t
e
r
.
Tabl
e
3.
F-m
easure
f
o
r
4
di
ffe
rent
val
u
es
o
f
param
e
t
e
r
F-m
e
asur
e b=0.
75
k=2
b=1
k=2
b=1.
25 k=2
b=0.
25 k=2
DS1 0.
5
0.
51
0.
53
0.
46
DS2 0.
51
0.
52
0.
46
0.
47
DS3 0.
62
0.
63
0.
65
0.
68
DS4 0.
51
0.
5
0.
46
0.
5
Aver
age 0.
535
0.
54
0.
525
0.
5275
Tabl
e 4.
R
u
n
t
i
m
e
(seco
nd
) of
t
h
e pr
o
pose
d
m
e
t
hod f
o
r 4 d
i
ffere
nt
val
u
es
of
pa
ram
e
t
e
r
T
i
m
e
(
S
ec) b=0.
75
k=2
b=1
k=2
b=1.
25 k=2
b=0.
25 k=2
DS1
34.
33
35.
83
36.
00
35.
16
DS2
27.
16
27.
50
27.
83
27.
16
DS3
33.
33
35.
00
35.
16
34.
16
DS4
26.
16
28.
50
28.
83
27.
83
Aver
age
30.
25
31.
70
31.
95
31.
08
Accord
ing
to th
e
resu
lts
g
i
v
e
n
in Tab
l
es
3
& 4, th
e
o
p
timu
m
v
a
lu
e for
param
e
ter
i
s
0.
75
.
In
Tabl
e
3,
F-m
easure
val
u
e
f
o
r
0.75
is
better th
an
t
h
e
v
a
lu
e fo
r
1.25
and
0
.
2
5
. Th
ere is
on
ly a slight
weakness
c
o
mpare
d
to
1
.
Ho
weve
r
co
nsi
d
e
r
i
n
g t
h
e
res
u
l
t
s
of
bot
h
Tabl
es, i
t
can
be
i
n
fe
rre
d
t
h
at
t
h
e
o
p
tim
u
m
v
a
lu
e fo
r
p
a
ram
e
ter
is
0.75
com
p
aratively.
In t
h
e sec
o
nd
part
o
f
t
h
e ex
p
e
ri
m
e
nt
, t
h
e prop
ose
d
m
e
t
hod i
s
com
p
ared
wi
t
h
t
h
e C
S
S
A
[9]
,
SM
TC
[1
1]
, a
nd M
a
x
-
C
a
pt
u
r
e m
e
t
hods
[
1
2]
. Tw
o
cri
t
e
ri
a are
us
ed f
o
r t
h
e a
sse
ssm
ent
as com
put
at
i
o
n t
i
m
e
and
F-
m
easure.
The
r
e
sul
t
o
f
c
o
m
p
u
t
at
i
on t
i
m
e for
t
h
e c
o
m
p
ari
n
g m
e
t
hods
i
s
g
i
ven i
n
fi
g
u
re
2.
As
i
t
can
be
seen
fro
m
th
e resu
lt
s, th
e propo
sed m
e
th
o
d
ou
tperform
s th
e o
t
h
e
r m
e
th
o
d
s in term
o
f
co
m
p
u
t
atio
n ti
m
e
.
Evaluation Warning : The document was created with Spire.PDF for Python.
IJECE
ISS
N
:
2088-8708
S
c
ien
tific Do
cumen
t
s clu
s
tering
b
a
s
ed on
Text S
u
mma
r
iza
tion
(P
.
V.
Am
o
li)
78
6
Fi
gu
re
2.
Pr
oce
ssi
ng
t
i
m
e
for
C
SSA
[
9
]
,
SM
TC
[
11]
, M
a
x-
C
a
pt
ure
[
1
2]
, a
n
d
p
r
op
ose
d
m
e
t
h
o
d
Fi
gu
re
3.
C
o
m
p
ari
s
on
o
f
t
h
e
pr
o
pose
d
m
e
t
hod
wi
t
h
C
S
S
A
[9]
,
SM
TC
[
1
1
]
and
M
a
x
-
C
a
p
t
ure
[1
2]
m
e
t
hods
The com
p
ari
s
o
n
o
f
F-m
easure
resul
t
s
f
o
r t
h
e
s
e fo
ur m
e
t
hod
s has bee
n
s
h
o
w
n i
n
fi
g
u
re
3.
Thi
s
res
u
l
t
sho
w
s t
h
at
t
h
e C
SSA m
e
t
hod y
i
el
d t
h
e bet
t
e
r resul
t
s
com
p
ared t
o
ot
her m
e
t
h
o
d
s. Al
t
h
o
u
g
h
i
n
t
h
e fi
rst
da
t
a
set
t
h
e pr
o
pose
d
m
e
t
hod h
a
d al
m
o
st
sim
i
l
a
r resul
t
wi
t
h
C
S
S
A
b
u
t
i
n
t
h
e ot
her t
h
ree dat
a
s
e
t
i
t
coul
d not
repeat
i
t
agai
n.
O
v
er
al
l
t
h
e p
r
o
p
o
sed
m
e
t
hod at
t
a
i
n
s a bet
t
e
r
per
f
o
rm
ance com
p
ared t
o
SM
TC
an
d M
a
x
-
C
a
p
t
ur
e
m
e
t
hods
by
t
h
e F-m
easure.
B
o
t
h
t
h
e m
e
t
hod
s SM
TC
a
n
d
M
a
x-ca
pt
u
r
e
u
s
e al
l
t
h
e
w
o
r
d
s i
n
t
h
e
d
o
cum
e
nt
a
n
d
this feature a
f
fects the cluste
ring
pe
rform
a
nce in term
s
of
pr
ocessi
ng
t
i
m
e
an
d acc
uracy
. T
h
e
di
ffe
renc
e i
n
t
h
e
accuracy
of the proposed m
e
thod a
n
d CSS
A
m
e
thod is c
a
use
d
by t
h
e
differe
nces i
n
t
h
e c
once
p
t
of
scori
ng
m
e
t
hods
f
o
r
cl
ust
e
ri
n
g
.
4.
CO
NCL
USI
O
N
In t
h
i
s
pa
pe
r a
new
ef
fi
ci
ent
and
ve
ry
hi
gh
spee
d ap
p
r
oac
h
f
o
r
d
o
c
u
m
e
nt
cl
ust
e
ri
n
g
wa
s pr
o
p
o
s
ed
.
The res
u
l
t
of t
h
i
s
cl
ust
e
ri
n
g
has sh
o
w
n t
h
at
t
h
e new m
e
t
h
od
of usi
n
g t
e
xt
su
m
m
ari
zati
on fo
r el
im
i
n
at
ing
not
-
useful words
has been e
ffecti
v
e in orde
r to
perform
effective clusteri
ng. It
also
n
eed
s to
b
e
m
e
n
tio
n
th
at th
is
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
maxcamp
p
roposed
method
cssa
smtc
time(sec)
time(sec)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
DS1
D
S2
DS3
D
S4
maxcamp
proposed
method
cssa
smtc
Evaluation Warning : The document was created with Spire.PDF for Python.
I
S
SN
:
2
088
-87
08
I
J
ECE Vo
l. 5
,
N
o
. 4
,
Au
gu
st 2
015
:
78
2
–
78
7
7
87
m
e
t
hod
has
gai
n
ed i
n
cre
d
i
b
l
y
l
o
w r
u
nni
n
g
t
i
m
e co
m
p
are t
o
al
l
co
m
p
arat
i
v
e
m
e
t
hods
. It
h
a
s bee
n
bel
i
e
ve
d t
h
i
s
i
s
a res
u
l
t
o
f
jo
b
red
u
ct
i
o
n t
h
a
t
has
bee
n
i
m
plem
ent
e
d by
su
m
m
a
ri
zati
on.
ACKNOWLE
DGE
M
ENTS
I w
o
ul
d
l
i
k
e t
o
t
h
an
ks m
y
sup
e
rvi
s
or
D
r
.S
o
j
oo
di
fo
r
hi
s a
d
vi
ce an
d c
o
m
m
e
nt
t
h
at
hel
p
e
d
m
e
t
h
ro
u
g
h
my project.
REFERE
NC
ES
[1]
Ravi kumar V., K. Raghuveer
, “Legal Do
cuments Clustering and Summarization us
ing Hierarchical Latent Dirichlet
Allocation”, IA
ES Intern
ation
a
l Journal of Artificial Intelligen
ce
(IJ-AI), Vol.
2, No. 1
,
March
2013, pp. 27~35,
ISSN: 2252-8938.
[2]
Tian Weixin, Zh
u Fuxi, “Tex
t D
o
cument Cluste
r
i
ng Based On
The Modif
y
ing R
e
lations”, In
Pr
oceed
ings
of I
E
E
E
International Co
nference on
Co
mput
er Science
and Software En
gineering
. 1
(12-
14 Dec. 2008)
, 2
56-259.
[3]
S
.
M
u
rali Kris
hna and S
.
Durga Bhavani
,
“
A
n Effici
ent A
pproach
for Text Cluster
i
ng Ba
sed on Fr
equent Itemsets”,
European Journ
a
l of Scien
tific
Research ISSN 1450-216X
, 42, 3 (2010), EuroJournals
Publishing, Inc. 2010
, 3
99-
410.
[4]
Le Wang, Li Tian, Yan Jia
and
Wei
hong Han, “A Hy
brid Alg
o
rithm for
Web Document Clustering Based o
n
Frequent Term Sets and k-Means”, In
Proceed
i
ngs APWeb/WAI
M 2007 International Workshops: DBMAN 200
7
,
WebETrends 20
07, PAIS 2007
a
nd ASWAN
2007
.
[5]
Parthajit Ro
y
an
d J. K. Mand
al,
“A Novel Sp
ectr
a
l Clustering b
a
s
e
d on Lo
cal Distribution”,
In
tern
ational Journal of
Electrica
l
and
C
o
mputer Engin
e
ering (
I
JECE)
,
Vol. 5
,
No. 2, April 2015
, pp
. 36
1 – 370
, ISSN: 2
088-8708.
[6]
Andrew S
k
abar,
Khaled
Abdalg
ader,
“
C
lus
t
erin
g S
e
nten
ce-
Level Text Using
a
Novel Fuzzy
Relation
a
l Cluster
i
ng
Algorithm”,
IEEE Tr
ansactions o
n
Knowledge an
d Da
ta Eng
i
neer
ing, Vol. 25
, No.
1, Januar
y
2013
.
[7]
K. J
e
yal
a
ks
hm
i,
R. Deep
a,
M
.
M
a
njul
a, “
A
n Eff
i
cien
t Cl
ustering Sentence-Lev
el Text
Using A
Novel Hierar
chical
Fuzz
y
R
e
l
a
tion
a
l Cluster
i
ng
Algorithm
”
, Int
e
rnat
ional
Jour
nal of Advanced Resear
ch in Computer and
Communication
Engineering Vol. 3
,
Issue 2
,
Febr
uar
y
2014.
[8]
Ram
i
z M
.
Aligu
l
i
y
ev
, “
A
new s
e
nten
ce s
i
m
i
l
a
rit
y
m
eas
ure
and s
e
nten
ce b
a
s
e
d e
x
trac
tive
te
chniq
u
e for au
tom
a
t
i
c
text summarization”,
Exper
t
S
y
s
t
ems
with Applications 36
(2009)
7764–7772.
[9]
Saurabh Sharma and Vishal Gu
pta, “Punjabi text clustering b
y
sentence
s
t
ruc
t
ure anal
ys
is
”
,
CS
& IT-CS
C
P
,
pp
.
237–244, 2012
.
[10]
M.F. Porter
, “An algor
ithm for su
ffix stripp
ing”, 1
4
(3): 130-137
, 1
980.
[11]
Yuan L., “An Ef
fective Ch
in
ese
Short Message
Texts Clustering
Al
gorithm Based on The Ward’
s
Method”, 978-
1-
4577-0536-6/11/2011 IEEE, 189
7-1899.
[12]
Zhang W., T. Yoshida, X. Tang, Q.
Wang, “Text clustering
using frequent
itemsets”, 2010. Knowledge-Based
S
y
stems 23, pp
.
379–388.
BIOGRAP
HI
ES OF
AUTH
ORS
P.
V.
Amoli
obtained B
ach
elor
Degree
in Infor
m
ati
on Technolog
y
major
i
ng In
formation s
y
stem
Engineering in
2
008.
His inter
e
st top
i
c in
cludes
Data
m
i
ning.
Dr. Sojoodi
received his B.Sc
in Software En
g.
from QIAU
and M.Sc degree in AI from
SRBIAU and PHD degree in AI from UPM.
He is
with Is
lam
i
c Azad
Qazvin
Univers
i
t
y
as
a l
ectur
er. His
r
e
s
e
arch is
in fi
elds
of Data m
i
ning
.
Further info
on h
i
s hom
epage:
ht
t
p
://qi
a
u.
ac
.ir/soj
odishijani
.info
Evaluation Warning : The document was created with Spire.PDF for Python.