Inter
national
J
our
nal
of
Electrical
and
Computer
Engineering
(IJECE)
V
ol.
10,
No.
6,
December
2020,
pp.
6361
6369
ISSN:
2088-8708,
DOI:
10.11591/ijece.v10i6.pp6361-6369
r
6361
T
ext
documents
clustering
using
modified
multi-v
erse
optimizer
Ammar
Kamal
Abasi
1
,
Ahamad
T
ajudin
Khader
2
,
Mohammed
Azmi
Al-Betar
3
,
Syibrah
Naim
4
,
Mohammed
A.
A
wadallah
5
,
Osama
Ahmad
Alomari
6
1,2
School
of
Computer
Sciences,
Uni
v
ersiti
Sains
Malaysia
(USM),
Malaysia
3
Department
of
information
technology
,
Al-Huson
Uni
v
ersity
Colle
ge,
Jordan
4
T
echnology
Department,
Endicott
Colle
ge
of
International
Studies
(ECIS),
W
oosong
Uni
v
ersity
,
K
orea
5
Department
of
Computer
Science,
Al-Aqsa
Uni
v
ersity
,
P
alestine
6
Department
of
Computer
Engineering,F
aculty
of
Engineering
and
Architecture,
Istanb
ul
Gelisim
Uni
v
ersity
,
T
urk
e
y
Article
Inf
o
Article
history:
Recei
v
ed
Mar
29,
2020
Re
vised
May
2,
2020
Accepted
May
18,
2020
K
eyw
ords:
Multi-v
erse
optimizer
Optimization
Sw
arm
intelligence
T
est
document
clustering
ABSTRA
CT
In
this
study
,
a
multi-v
erse
optimizer
(MV
O)
is
utilised
for
the
te
xt
document
clus-
tering
(TDC)
problem.
TDC
is
treated
as
a
discrete
optimization
problem,
and
an
objecti
v
e
function
based
on
the
Euclidean
distance
is
applied
as
similarity
measure.
TDC
is
tackled
by
the
di
vision
of
the
documents
into
clusters;
documents
belong-
ing
to
the
same
cluster
are
similar
,
whereas
those
belonging
to
dif
ferent
clusters
are
dissimilar
.
MV
O,
which
is
a
recent
metaheuristic
optimization
algorithm
established
for
continuous
optimization
problems,
can
intelligently
na
vig
ate
dif
ferent
areas
in
the
search
s
pace
and
search
deeply
in
each
area
using
a
particular
learning
mechanism.
The
proposed
algorithm
is
called
MV
O
TDC,
and
it
adopts
the
con
v
er
gence
beha
viour
of
MV
O
operators
to
deal
with
discrete,
rather
than
continuous,
optimization
prob-
lems.
F
or
e
v
aluating
MV
O
TDC,
a
c
omprehensi
v
e
comparati
v
e
study
i
s
conducted
on
six
te
xt
document
datasets
with
v
arious
numbers
of
documents
and
clusters.
The
qual-
ity
of
the
final
results
is
assessed
using
precision,
recall,
F-measure,
entrop
y
accurac
y
,
and
purity
measures.
Experimental
results
re
v
eal
that
the
proposed
method
performs
competiti
v
ely
in
comparison
with
state-of-the-art
algorithms.
Statistical
analysis
is
also
conducted
and
sho
ws
that
MV
O
TDC
can
produce
significant
results
in
compari-
son
with
three
well-established
methods.
Copyright
c
2020
Insitute
of
Advanced
Engineeering
and
Science
.
All
rights
r
eserved.
Corresponding
A
uthor:
Ammar
Kamal
Abasi,
School
of
Computer
Sciences,
Uni
v
ersiti
Sains
Malaysia
(USM),
11800
Pulau
Pinang,
Malaysia.
Email:
ammar
abasi@student.usm.my
1.
INTR
ODUCTION
In
the
current
digital
era,
massi
v
e
amounts
of
online
te
xt
documents
inundate
the
web
e
v
ery
day
.
Manipulating
these
te
xt
documents
is
important
for
impro
ving
the
query
results
returned
by
search
engines,
unsupervised
te
xt
or
g
anisation
systems,
te
xt
classificati
on,
te
xt
summarization,
kno
wledge
e
xtraction
processes,
information
retrie
v
al
services
and
te
xt
mining
processes,
scientific
document
clustering
[1].
Man
y
approaches
ha
v
e
been
proposed
for
unsupervised
or
g
anising
te
xt
documents.
T
e
xt
document
clustering
(TDC)
is
an
ef
fecti
v
e
and
ef
ficient
technique
used
by
researchers
in
t
his
domain
[2],
this
field
of
te
xt
mining
enables
the
or
g
anisation
of
lar
ge
amounts
of
te
xtual
data.
It
can
be
defined
as
an
unsupervised
automatic
document
clustering
technique
that
utilises
the
document
similarities
rule
to
di
vide
documents
into
homogeneous
clusters.
In
other
w
ords,
te
xt
documents
in
the
same
cluster
are
similar
,
J
ournal
homepage:
http://ijece
.iaescor
e
.com
Evaluation Warning : The document was created with Spire.PDF for Python.
6362
r
ISSN:
2088-8708
whereas
those
in
dif
ferent
clusters
are
dissimilar
[3].
Con
v
entionally
,
clustering
methods
can
be
classified
into
tw
o
main
groups:
(i)
partitional
clustering
and
(ii)
hierarchical
clustering.
K-means
and
K-medoids
are
simple
and
easy-to-use
methods
that
can
be
tailored
to
suit
la
r
ge-
scale
te
xt
document
datasets.
The
y
are
iterati
v
e
clustering-based
techniques
initiated
with
predefined
numbers
of
cluster
centroids.
At
each
ite
ration,
documents
are
distrib
uted
into
clusters
according
to
similarity
functions,
depending
on
the
distance
between
each
centroid
and
its
closest
document.
Then,
the
cluster
centroid
is
iter
-
ati
v
ely
updated
according
to
the
documents
belonging
to
the
sa
me
cluster
.
This
operation
is
stopped
when
all
the
documents
are
mo
v
ed
into
the
right
cluster
by
means
of
stagnated
cluster
centroids.
The
main
shortcoming
of
these
methods
is
their
con
v
er
gence
beha
viour;
the
y
mo
v
e
in
one
direction
in
a
single
search
space
re
gion
and
do
not
perform
a
wider
scan
i
n
the
whole
search
space
re
gion.
Therefore,
the
y
can
easily
become
trapped
in
local
optima
due
to
the
unkno
wn
shapes
of
search
spaces.
Gi
v
en
that
K-means
is
a
local
search
area
method
and
TDC
is
formulated
as
an
optimization
problem
[4],
optimization
methods
that
can
escape
the
local
optima
can
be
utilized
for
TDC
[5].
The
most
successful
algorithms
recently
utilized
for
TDC
are
metaheuristic-based
algorithms
[6].
The
first
type
of
metaheuristic
algorithms
is
e
v
olutionary-based
algorithms
(EA),
which
is
initiated
with
a
group
of
pro
visional
indi
viduals
called
population.
Generation
after
generation,
the
population
is
e
v
olv
ed
on
the
basis
of
three
main
operators:
r
ecombination
for
mixing
the
indi
vidual
features,
mutation
for
di
v
ersifying
the
search
and
selection
for
utilising
the
survi
v
al-of-the-fittest
princi
ple.
The
EA
is
stopped
when
no
further
e
v
olution
can
be
achie
v
ed.
The
main
shortcoming
of
EAs
is
that
although
the
y
can
simultaneously
na
vig
ate
se
v
eral
areas
in
the
search
space,
the
y
cannot
perform
deep
searching
in
each
area
to
which
the
y
na
vig
ate.
Consequently
,
EAs
mostly
suf
fer
from
premature
con
v
er
gence.
EAs
that
ha
v
e
been
successfully
utilized
for
TDC
include
genetic
algorithm
(GA)
[7]
and
harmon
y
search
[8].
The
second
type
of
metaheuris
tic
algorithms
is
trajectory-based
algorithms
(T
A);
a
single
solution
is
usedto
launch
such
an
algorithm.
This
solution
is
impro
v
ed
by
repetition
using
neighbouring-mo
v
es
operators
until
a
local
optimal
solution
that
is
in
the
same
search
space
re
gion,
is
reached
[9].
While
T
As
can
e
xtensi
v
ely
search
the
initial
solution
search
re
gion
and
achie
v
e
local
optima,
the
y
cannot
na
vig
ate
simultaneously
numer
-
ous
search
space
re
gions.
The
main
T
As
utilized
for
TDC
are
K-means
and
K-medoids.
Other
T
As
used
for
TDC
are
self-or
g
anizing
maps
(som)
[10],
and
-hill
climbing.
The
last
type
of
metaheuristic
algorithms
is
sw
arm
intelligence
(SI);
an
SI
algorithm
is
also
init
iated
with
a
set
of
random
solutions
called
a
sw
arm.
Iteration
after
iteration,
the
solutions
in
the
sw
arm
are
recon-
structed
by
means
of
attracting
them
by
the
best
solutions
that
are
so
f
ar
found
[11].
SI-based
algorithms
can
easily
con
v
er
ge
prematurely
.
Se
v
eral
SI-based
TDC
are
utilized,
such
as
particle
sw
arm
optimization
(PSO)
[12]
and
artificial
bee
colon
y
[13].
The
multi-v
erse
optimizer
(MV
O)
algorithm
w
as
recently
proposed
as
a
stochastic
population-based
algorithm
[14]
inspired
by
multi-v
erse
theory
.
The
big
bang
theory
e
xplains
the
origin
of
the
uni
v
erse
to
ha
v
e
been
a
massi
v
e
e
xplosion.
According
to
this
theory
,
the
origin
of
e
v
erything
in
our
uni
v
erse
requires
one
big
bang.
Multi-v
erse
theory
belie
v
es
that
more
than
one
e
xplosion
(big
bang)
occurred,
with
each
big
bang
creating
a
ne
w
and
independent
uni
v
erse.
This
theory
is
modelled
as
an
optimization
algorithm
with
three
concepts:
white
hole,
black
hole
and
w
ormhole,
for
performing
e
xploration,
e
xploitation
and
local
search,
respecti
v
ely
.
the
MV
O
has
been
utilized
for
a
wide
range
of
optimization
problems,
such
as
identifying
the
optimal
parameters
of
PEMFC
stacks
[15],
unmanned
aerial
v
ehicle
path
planning
[16],
clustering
problems
[17],
feature
selection
[18],
neural
netw
orks
[19],
and
optimising
SVM
parameters
[18].
This
paper
adapts
the
MV
O
algorithm
for
the
TDC
problem
using
Euclidean
distance
as
si
milarity
measure.
The
adaptation
includes
modifying
the
con
v
er
gence
beha
viour
of
MV
O
operators
to
deal
with
the
discrete,
rather
than
continuous,
optimization
problem.
The
main
adv
antage
of
the
proposed
method
is
that
it
impro
v
es
the
quality
of
final
outcomes
for
TDC
probl
ems.
A
comprehensi
v
e
comparati
v
e
study
is
conducted
on
six
te
xt
document
benchmark
datasets
that
ha
v
e
dif
ferent
numbers
of
clusters
and
documents.
The
quality
of
the
final
results
is
analysed
with
a
discussion
using
accurac
y
,
precision,
recall,
F-measure,
purity
and
entrop
y
criteria.
The
findings
of
the
e
xperimental
analyses
re
v
eal
that
the
propose
d
method
performs
competiti
v
ely
in
comparison
with
state-of-the-art
algorithms.
The
rest
of
this
paper
is
or
g
anised
as
follo
ws.
Section
2.
presents
the
structure
of
MV
O
for
TDC.
Section
3.
discusses
the
e
xperiments
results
of
MV
O.
Section
4.
gi
v
es
the
conclusion
of
this
study
and
future
w
ork
for
the
authors.
Int
J
Elec
&
Comp
Eng,
V
ol.
10,
No.
6,
December
2020
:
6361
–
6369
Evaluation Warning : The document was created with Spire.PDF for Python.
Int
J
Elec
&
Comp
Eng
ISSN:
2088-8708
r
6363
2.
MUL
TI-VERSE
OPTIMIZER
FOR
TEXT
DOCUMENT
CLUSTERING
This
section
describes
the
main
components
uti
lized
for
tackling
TDC.
Th
e
term
‘components’
is
used
to
denote
the
adaptation
elements
that
are
conducted
for
solving
the
TDC
problem
using
the
MV
O
algorithm
and
their
sequence,
including
(i)
TDC
pre-processing,
(ii)
solution
representation
and
(iii)
calculation
of
the
objecti
v
e
function
(e
v
aluation
of
the
solutions).
Finally
,
the
question
of
ho
w
MV
O
is
adapted
to
TDC
is
addressed.
2.1.
T
ext
document
clustering
(TDC)
TDC
aims
to
di
vide
documents
into
clusters;
each
cluster
has
similar
documents,
whereas
docum
ents
in
dif
ferent
clusters
are
dissimilar
[20,
21].
In
the
use
of
an
y
clustering
algorithm
for
TDC,
the
te
xt
needs
some
necessary
and
preliminary
s
teps
(pre-processing);
this
step
filters
unnecessary
data,
such
as
special
formatting,
special
characters
and
numbers,
out
of
the
te
xt.
Thereafter
,
the
pre-processed
te
xt
document
t
erms
are
con
v
erted
into
numerical
form
for
further
processing.
The
main
goal
of
this
step
is
to
impro
v
e
the
quality
of
features
and
reduce
the
implementation
comple
xity
of
the
TDC
algorithm
[22].
The
te
xt
pre-processing
includes
tok
enization,
st
op
w
ord
remo
v
al
and
stemming,
which
are
discussed
in
detail
in
the
succeeding
subsections.
2.1.1.
T
ok
enization
Phase
In
the
tok
enization
phase,
each
document
is
brok
en
do
wn
into
a
set
of
tok
ens
(w
ords),
where
the
tok
en
is
an
y
sequence
of
characters
separated
by
spaces.
Each
document
is
then
formulated
as
a
w
ord
instance
count
(pieces)
as
a
bag-of-w
ords
model
[23].
Note
that
the
w
ord
instance
count
is
filtered
through
remo
v
al
of
empty
sequences,
number
formatting
and
collapsing,
among
other
tasks
[24].
2.1.2.
Stop
w
ord
r
emo
v
al
Phase
In
the
stop
w
ords
remo
v
al
phase,
commonly
repeated
terms
(e.g.,
‘a’,
‘an’,
‘the’,
‘who’,
‘be’,
‘about’,
‘ag
ain’,
‘an
y’,
‘ag
ainst’),
pronouns
(e.g.
‘she’,
‘he’,
‘it’),
conjunctions
(e.g.
‘b
ut’,
‘and’,
‘or’,
)
and
similar
w
ords
are
remo
v
ed
because
the
y
ha
v
e
high
frequencies
and
ne
g
ati
v
ely
af
fect
the
clustering
process
(i.e.
the
y
hinder
the
clustering
algorithm.).
This
process
impro
v
es
the
clustering
performance
and
reduces
the
number
of
processed
w
ords
or
terms
[22].
2.1.3.
Stemming
Phase
Stemming
is
the
process
of
decomposing
terms
to
their
roots
by
remo
v
al
of
af
fix
es
(prefix
es
and
suf
fix
es)
[25].
F
or
e
xample,
the
root
of
the
w
ord
‘stemming’
is
‘stem’.
In
the
English
language,
man
y
terms
may
share
the
same
root;
for
e
xample,
the
w
ords
‘connects’,
‘connected’
and
‘connecting’
all
stem
from
the
same
root,
which
is
‘connect’
in
www
.te
xt-processing.com
/demo/stem/.
The
stemming
process
attempts
to
impro
v
e
the
clustering
by
reducing
the
number
of
dif
ferent
terms
that
ha
v
e
similar
grammatical
properties
and
stem
from
a
single
term
[26].
2.1.4.
Solution
r
epr
esentation
Each
solution
is
represented
as
a
v
ector
x
=
(
x
1
;
x
2
;
:
:
:
;
x
d
)
,
where
d
is
the
number
of
documents.
Figure
1
sho
ws
an
e
xample
of
a
solution
representation.
In
this
e
xample,
fi
v
e
clusters
contain
twenty
document;
for
e
xample,
cluster
tw
o
has
three
documents
f
4
;
8
;
9
g
,
and
document
10
(i.e.,
x
10
)
is
in
cluster
three.
Figure
1.
Solution
representation
2.1.5.
Objecti
v
e
function
In
this
study
,
for
each
solution
x
the
objecti
v
e
function
is
calculated
using
the
a
v
erage
distance
of
documents
to
the
cluster
centroid
(ADDC)
as
sho
wn
in
(1).
min
f
(
x
)
=
P
k
i
=1
(
1
n
i
P
n
i
j
=1
D
(
C
k
i
;
d
j
))
k
(1)
T
e
xt
documents
clustering
using
...
(Ammar
Kamal
Abasi)
Evaluation Warning : The document was created with Spire.PDF for Python.
6364
r
ISSN:
2088-8708
where
D
(
C
k
i
;
d
j
)
is
the
distance
between
the
cluster
centroid
j
and
document
i
,
n
i
is
the
number
of
documents
in
cluster
i
,
k
is
the
number
of
clusters,
and
f
(
x
)
is
the
objecti
v
e
function
(i.e.
minimize
the
distance).
2.2.
Multi-v
erse
in
optimization
context
MV
O
[14]
inspired
by
multi-v
erse
theory
.
According
to
multi-v
erse
theory
,
uni
v
erses
connect
and
might
e
v
en
collide
with
each
other
.
MV
O
eng
ages
and
reformulates
using
three
main
concepts:
w
ormholes,
black
holes,
and
white
holes.
The
probability
is
used
for
determining
the
inflation
rate
(corresponds
to
the
objecti
v
e
function
in
the
optimization
conte
xt)
for
the
uni
v
erse,
thereby
allo
wing
the
uni
v
erse
to
assign
one
of
these
holes.
Gi
v
en
that
the
uni
v
erse
has
a
high
inflat
ion
rate,
the
probability
of
a
white
hole
e
xisting
increases.
Meanwhile,
a
lo
w
inflation
rate
leads
to
increased
probability
of
a
black
hole
e
xisting
[14].
Re
g
ardless
of
the
uni
v
erse’
s
inflation
rate,
w
ormholes
mo
v
e
objects
to
w
ards
the
best
uni
v
erse
randomly
[15,
27].
The
black
and
white
hole
concepts
in
MV
O
are
formulated
for
e
xploring
search
spaces,
and
the
w
ormhole
concept
is
formu-
lated
for
e
xploiting
search
spaces.
In
other
EAs,
MV
O
is
initiated
by
a
population
of
indi
viduals
(uni
v
erses).
Thereafter
,
MV
O
impro
v
es
these
solutions
until
a
stopping
criterion.
The
conceptual
model
of
the
MV
O
in
[14]
sho
ws
the
mo
v
ements
of
the
objects
between
the
uni
v
erses
via
white/black
hole
tunnels.
These
hole
tunnels
are
created
between
tw
o
uni
v
erses
on
the
basis
of
the
inflation
rate
of
each
uni
v
erse
(i.e.
one
uni
v
erse
has
a
higher
inflation
rate
than
the
other
uni
v
erses.).
Objects
mo
v
e
from
uni
v
erses
with
high
inflation
rates
using
white
holes.
These
objects
are
recei
v
ed
by
uni
v
erses
with
lo
w
inflation
rates
using
black
holes.
After
a
population
of
solutions
is
initia
ted,
all
solutions
in
MV
O
are
sorted
from
high
inflation
rates
to
lo
w
ones.
Thereafter
,
it
visits
the
solutions
one
by
one
to
attract
these
solution
to
the
best
one.
This
is
done
under
the
a
ssumption
that
the
solution
that
has
been
visited
has
the
black
hole.
As
for
the
white
holes,
the
roulette
wheel
mechanism
is
used
for
selecting
one
solution.
2.3.
Adapting
MV
O
f
or
TDC
After
the
pre-processing
step,
MV
O
is
used
to
split
documents
into
their
parent
clusters.
In
this
study
,
the
solution
representation
and
the
objecti
v
e
function
formulated
abo
v
e
are
used.
The
steps
of
classical
MV
O
in
[14]
are
adopted
for
TDC
with
certain
modifications.
These
modifications
are
related
to
the
nature
of
the
problem
v
ariables.
Gi
v
en
that
the
clustering
problem
is
discrete
in
nature
[28]
and
MV
O
w
as
originally
proposed
for
continuous
optimization
problems,
MV
O
should
deal
with
discrete
v
alues
of
the
decision
v
ariables
of
each
TDC
solution.
During
the
MV
O
e
x
ecution,
the
generation
function
and
the
w
ormhole
equations
(2)
is
adjusted
for
deciding
the
feasible
solution
as
follo
ws:
x
j
i
=
8
>
>
<
>
>
:
8
<
:
j
x
j
+
T
D
R
((
ub
j
l
b
j
)
r
4
+
l
b
j
)
B
ig
r
c
r
3
<
0
:
5
;
j
x
j
T
D
R
((
ub
j
l
b
j
)
r
4
+
l
b
j
)
B
ig
r
c
r
3
0
:
5
r
2
<
W
E
P
x
j
i
r
2
W
E
P
(2)
A
general
o
v
ervie
w
of
MV
O
for
TDC
is
pro
vided
via
Figure
2,
which
visualises
the
procedural
steps.
Figure
2.
Process
of
MV
O
TDC
Int
J
Elec
&
Comp
Eng,
V
ol.
10,
No.
6,
December
2020
:
6361
–
6369
Evaluation Warning : The document was created with Spire.PDF for Python.
Int
J
Elec
&
Comp
Eng
ISSN:
2088-8708
r
6365
3.
EXPERIMENT
AL
RESUL
TS
F
or
e
v
aluating
the
performance
of
the
proposed
method,
a
set
of
designed
e
xperiments
is
conducted
using
six
instances
of
standard
datasets
formulated
for
measuring
the
performance
of
te
xt
clustering
techniques.
Six
e
v
aluation
measures
are
used,
as
con
v
entionally
done:
precision,
recall,
F-measure,
entrop
y
accurac
y
,
and
purity
criteria.
F
or
comparati
v
e
e
v
aluation,
results
obtained
in
terms
of
e
v
aluation
measures
are
compared
with
those
obtained
by
three
state-of-the-art
algorithms
(K-means
clustering,
GA
and
PSO)
using
the
same
objecti
v
e
function.
The
e
xperiments
are
conducted
using
the
programming
language
MA
TLAB.
Thorough
descriptions
about
the
e
xperimental
results
are
gi
v
en
in
the
follo
wing
subsections.
3.1.
Standard
datasets
T
able
1
pro
vides
the
characteristics
of
the
six
te
xt
document
datasets
used
in
this
study:
(CSTR2,
20Ne
wsgroups,
Classic4)
in
sites.labic.icmc.usp.br/te
xt-collections,
(tr12,
tr41
and
W
ap)
in
glaros.dtc.umn.edu/
gkhome/fetch/sw/cluto/datasets.tar
.gz.
T
able
1.
T
e
xt
document
dataset
characteristics
ID
Datasets
Number
of
documents
(d)
Number
of
features
or
terms
(t)
Number
of
clusters
(K)
DS1
CSTR
299
1725
4
DS2
20Ne
wsgroups
300
2275
3
DS3
tr12
313
5329
8
DS4
tr41
878
6743
10
DS5
W
ap
1560
7512
20
DS6
Class
ic4
2000
6500
4
3.2.
Results
and
discussion
The
results
obtained
by
MV
O
TDC
are
summarised
in
T
able
2,
and
the
parameter
settings
used
in
the
e
xperiments
are
gi
v
en
in
T
able
3.
The
results
are
summarised
in
terms
of
precision,
recall,
F-measure,
entrop
y
accurac
y
,
and
purity
for
the
si
x
datasets
.
The
findings
pro
v
e
the
v
alidity
and
ef
fecti
v
eness
of
the
proposed
MV
O
TDC
in
the
distrib
ution
of
te
xt
documents
to
the
right
clusters.
The
results
are
also
conducted
to
sho
w
the
v
alidity
of
the
proposed
method
in
comparison
with
three
well-kno
wn
methods:
GA,
K-means
and
PSO.
T
able
3
sho
ws
the
parameter
setting
v
alues
for
each
compared
algorithm.
These
parameter
settings
are
used
as
suggested
in
[17].
A
comparati
v
e
analysis
of
K-means,
GA,
PSO
and
MV
O
TDC
is
pro
vided
in
T
able
2
in
terms
of
precision,
recall,
F-measure,
entrop
y
accurac
y
,
and
purity;
the
a
v
erage
v
alues
for
each
measure
are
recorded.
The
results
obtained
by
the
K-means
clusteri
ng
algorithm
are
w
orse
t
han
those
obtained
by
the
other
algorithms
for
nearly
all
datasets.
The
possible
justification
is
that
K-means
is
a
local
search
algorithm;
therefore,
it
is
highly
lik
ely
to
f
all
in
local
optima
due
to
its
inability
to
e
xplore
the
problem
search
space
ef
fecti
v
ely
.
Meanwhile,
population-based
metaheuristic
algorithms,
such
as
GA,
PSO
and
MV
O
TDC,
can
e
xplore
dif
ferent
areas
in
the
search
space
simultaneously
and
can
consequently
achie
v
e
better
e
xploration
properties.
T
able
2
also
sho
w
that
MV
O
TDC
attains
minimum
entrop
y
and
maximum
purity
,
precision,
recall,
F-measure,
accurac
y
for
fi
v
e
datasets
(i.e.
DS1,
DS2,
DS3,
DS4,
DS6).
This
ability
of
the
proposed
MV
O
TDC
algorithm
during
the
search
in
reaching
the
right
balance
between
e
xploitation
and
e
xploration
with
a
po
werful
learning
mechanis
m
strengthens
its
performance
in
achie
ving
i
mpressi
v
e
outcomes
in
comparison
with
the
other
methods.
T
able
2
pro
vides
the
results
of
the
F-m
easure
for
all
compared
methods,
including
MV
O
TDC.
No-
tably
,
MV
O
TDC
produces
the
best
F-measure
v
alues
for
fi
v
e
datasets.
Furt
hermore,
GA,
PSO
and
MV
O
TDC
outperform
K-means
in
all
the
datasets.
From
a
dif
ferent
perspecti
v
e,
T
able
2
also
sho
ws
the
accurac
y
of
all
compared
algorithms.
In
general,
the
results
obtained
by
MV
O
TDC
are
better
than
those
of
the
other
methods.
In
f
act,
the
results
could
be
slightly
changed
from
one
dataset
to
another
due
to
the
f
act
that
clustering
algorithms
are
normally
highly
sensiti
v
e
to
the
dataset
search
space.
This
is
can
be
v
alidated
by
the
finding
that
MV
O
TDC
obtains
the
best
accurac
y
in
fi
v
e
datasets
and
the
second-best
for
DS5.
The
purity
measure
of
clusters
is
another
e
xternal
e
v
aluation.
It
measures
the
maximum
class
for
each
cluster
.
In
general,
the
closer
the
purity
v
alue
to
1,
the
better
the
clustering
solution.
T
able
2
sho
ws
the
results
of
the
purity
measure
for
all
compared
methods
on
all
datasets.
MV
O
TDC
outperforms
K-means,
GA
and
PSO
in
fi
v
e
datasets
(i.e.
DS1,
DS2,
DS3,
DS4,
DS6).
The
proposed
algorithm
obtains
a
21.5%
impro
v
ement
percentage
for
DS1
in
accordance
with
K-means.
F
or
DS2,
MV
O
TDC’
s
purity
v
alues
sho
w
impro
v
ements
of
T
e
xt
documents
clustering
using
...
(Ammar
Kamal
Abasi)
Evaluation Warning : The document was created with Spire.PDF for Python.
6366
r
ISSN:
2088-8708
6.0%,
2.6%
and
2.4%
o
v
er
those
acquired
by
K-means,
GA
and
PSO,
respecti
v
ely
.
Meanwhile,
the
obtained
impro
v
ements
are
15.4%,
9.3%,
5.7%,
19.7%,
4.7%,
2.9%,
8.0%,
4.2%
and
4.9%
for
te
xt
document
standard
datasets
DS3,
DS4
and
DS6.
In
summary
,
the
results
sho
wn
in
T
able2
re
v
eal
that
MV
O
TDC
outperforms
all
compared
algorithms
in
terms
of
cluster
quality
(i.e.
F-measure
and
purity).
Entrop
y
is
another
e
xternal
measure
used
in
e
v
aluating
and
comparing
the
quality
of
clustering
algo-
rithms.
The
entrop
y
v
alue
is
zero
only
when
all
documents
in
a
single
class
are
placed
in
a
single
cluster
.
In
this
case,
the
one
cluster
solution
is
considered
the
best.
T
able
2
sho
ws
the
entrop
y
measure
v
alues
obtained
by
all
the
compared
algorithms
on
the
dif
ferent
datasets.
Th
e
bigger
the
entrop
y
v
alue,
the
w
orse
the
clustering
so-
lution.
According
to
the
results,
MV
O
TDC
pro
vides
lo
w
entrop
y
v
alues
for
most
of
the
datasets,
which
means
that
it
performs
better
than
the
other
algorithms
and
of
fers
the
best
clustering
solution.
Notably
,
K-means
pro-
duces
the
w
orst
entrop
y
measure
for
all
datasets,
whereas
GA
and
PSO
are
ag
ain
rank
ed
in
between
MV
O
TDC
and
K-means.
The
superior
performance
of
MV
O
TDC
is
due
to
its
e
xplorati
v
e
capability
in
the
search
space.
The
objecti
v
e
function
is
determined
by
ADDC
for
all
clus
tering
algorithms
so
that
the
distance
be-
tween
the
documents
in
each
cluster
is
minimized.
Figure
3
depicts
the
con
v
er
gence
trends
of
GA,
PSO
and
MV
O
TDC
using
ADDC
v
alues.
The
x-axis
is
the
stream
of
iteration
numbers,
whereas
the
y-axis
is
the
stream
of
ADDC
v
alues.
Notably
,
the
con
v
er
gence
rate
of
MV
O
TDC
is
f
airly
f
ast
for
all
datasets
e
xcept
DS5.
T
able
2.
Results
of
the
Accurac
y
,
Precision,
Recall,
F-measure,
Purity
and
Entrop
y
for
K-means,
GA,
PSO,
and
MV
O
TDC
algorithms
o
v
er
30
independent
runs
for
DS1,
DS2,
DS3,
DS4,
D65
,and
DS6
Dataset
Measure
Optimization
algorithms
and
techniques
K-means
[17]
GA
[29]
PSO
[12]
MV
O
TDC
DS1
Accurac
y
0.3573
0.3398
0.4355
0.4593
Precision
0.4091
0.4416
0.5340
0.571
Recall
0.3091
0.3417
0.4359
0.4829
F-Measure
0.3459
0.3886
0.4819
0.5243
Purity
0.3524
0.4050
0.4953
0.5684
Entrop
y
0.8201
0.7170
0.6199
0.5206
DS2
Accurac
y
0.3180
0.3675
0.3498
0.4044
Precision
0.3121
0.4209
0.4134
0.4391
Recall
0.3099
0.3676
0.3496
0.384
F-Measure
0.3406
0.3935
0.3803
0.4109
Purity
0.3741
0.4080
0.4096
0.4343
Entrop
y
0.8028
0.7546
0.7722
0.7120
DS3
Accurac
y
0.2971
0.3676
0.4075
0.4485
Precision
0.3522
0.4128
0.4297
0.5075
Recall
0.2944
0.3549
0.4263
0.4398
F-Measure
0.3221
0.3826
0.4277
0.4705
Purity
0.3907
0.4512
0.4877
0.5448
Entrop
y
0.7137
0.6233
0.5719
0.5224
DS4
Accurac
y
0.4125
0.4320
0.4870
0.4630
Precision
0.3944
0.4140
0.4505
0.4568
Recall
0.3812
0.4007
0.4496
0.4418
F-Measure
0.3876
0.4071
0.4497
0.4568
Purity
0.4107
0.5602
0.5789
0.6081
Entrop
y
0.5874
0.5469
0.5391
0.5355
DS5
Accurac
y
0.5011
0.5316
0.5622
0.5291
Precision
0.4626
0.5313
0.5249
0.5213
Recall
0.4010
0.4705
0.4810
0.4496
F-Measure
0.4314
0.4997
0.5016
0.4831
Purity
0.4759
0.4916
0.6124
0.6069
Entrop
y
0.7043
0.6216
0.5765
0.6625
DS6
Accurac
y
0.5858
0.6620
0.6363
0.7042
Precision
0.5698
0.6725
0.6603
0.6919
Recall
0.5259
0.6319
0.6163
0.6843
F-Measure
0.5471
0.6518
0.6377
0.6880
Purity
0.5938
0.6319
0.6242
0.6742
Entrop
y
0.5600
0.5780
0.5306
0.5112
Int
J
Elec
&
Comp
Eng,
V
ol.
10,
No.
6,
December
2020
:
6361
–
6369
Evaluation Warning : The document was created with Spire.PDF for Python.
Int
J
Elec
&
Comp
Eng
ISSN:
2088-8708
r
6367
T
able
3.
P
arametric
v
alues
for
dif
ferent
v
ariants
of
TDC
algorithms
Algorithm
P
arameters
V
alue
All
Optimization
algorithms
Population
size
60
All
Optimization
algorithms
Maximum
number
of
iteration
1000
All
Optimization
algorithms
runs
30
proposed
method
(MV
O
TDC)
WEP
Max
1
proposed
method
(MV
O
TDC)
WEP
Min
0.2
proposed
method
(MV
O
TDC)
p
6
GA
C
rosso
v
er
probability
0.80
GA
Mutation
probability
0.02
PSO
Maximum
inertia
weight
0.9
PSO
Minimum
inertia
weight
0.2
PSO
C1
2
PSO
C2
2
0
100
200
300
400
500
600
700
800
900
1000
Number of iterations
0.019
0.0192
0.0194
0.0196
0.0198
0.02
0.0202
0.0204
0.0206
0.0208
0.021
ADDC
DS1
GA
PSO
MVO
0
100
200
300
400
500
600
700
800
900
1000
Number of iterations
0.022
0.0225
0.023
0.0235
0.024
0.0245
0.025
0.0255
0.026
ADDC
DS2
GA
PSO
MVO
0
100
200
300
400
500
600
700
800
900
1000
Number of iterations
0.14
0.15
0.16
0.17
0.18
0.19
0.2
0.21
0.22
0.23
0.24
ADDC
DS3
GA
PSO
MVO
0
100
200
300
400
500
600
700
800
900
1000
Number of iterations
0.062
0.064
0.066
0.068
0.07
0.072
0.074
0.076
0.078
0.08
ADDC
DS4
GA
PSO
MVO
0
100
200
300
400
500
600
700
800
900
1000
Number of iterations
0.075
0.08
0.085
0.09
0.095
0.1
ADDC
DS5
GA
PSO
MVO
0
100
200
300
400
500
600
700
800
900
1000
Number of iterations
5.7
5.8
5.9
6
6.1
6.2
6.3
ADDC
10
-3
DS6
GA
PSO
MVO
Figure
3.
Con
v
er
gence
characteristics
of
GA,
PSO
and
MV
O
TDC
on
datasets
D1,
D2,
D3,
D4,
D5
and
D6
It
is
w
orth
emphasizing
the
MV
O
TDC
can
be
used
to
address
specific
optimization
problems
such
as
EEG
signals
denoising
[30],
gene
selection
problem
[31],
and
po
wer
scheduling
problems
[32].
Despite
the
MV
O
TDC’
s
superiority
among
the
competiti
v
e
algorithms,
MV
O
TDC
remains
sensiti
v
e
to
the
characteristics
of
the
datasets,
making
it
dif
ficult
to
predict
its
beha
vior
on
ne
w
datasets
while
implemented.
T
e
xt
documents
clustering
using
...
(Ammar
Kamal
Abasi)
Evaluation Warning : The document was created with Spire.PDF for Python.
6368
r
ISSN:
2088-8708
4.
CONCLUSION
AND
FUTURE
W
ORK
This
paper
proposes
a
metaheuristic
optimization
algorithm
called
multi-v
erse
optimizer
(MV
O)
for
solving
the
te
xt
document
clustering
(TDC)
problem,
i.e.
MV
O
TDC.
This
method
introduces
a
ne
w
strate
gy
of
sharing
information
between
solutions
on
the
basis
of
an
objecti
v
e
function
and
l
earns
from
the
best
solution
instead
of
the
global
best
(i.e.
all
solutions).
The
con
v
er
gence
of
the
results
of
MV
O
TDC
is
impressi
v
e
due
to
the
method’
s
achie
v
ement
of
the
appropriate
balance
between
e
xploitation
and
e
xploration
search
during
each
run.
The
proposed
MV
O
TDC
is
e
v
aluated
using
six
te
xt
document
datasets
with
v
arious
sizes
and
com-
ple
xities.
The
numbers
of
documents
and
clusters
in
each
dataset
are
gi
v
en.
The
quality
of
the
obt
ained
results
is
assessed
using
six
measures:
precision,
recall,
F-measure,
entrop
y
accurac
y
,
and
purity
.
These
measures
are
also
used
for
a
comparati
v
e
e
v
aluation
in
which
three
well-kno
wn
clustering
al-
gorithms
are
used:
K-means,
genetic
algorithm
(GA)
and
particle
sw
arm
optimisation
(PSO).
F
or
all
measures,
the
results
obtained
by
MV
O
TDC
are
significantly
better
than
those
produced
by
the
three
compared
methods.
In
terms
of
computational
time,
MV
O
TDC
is
slo
wer
than
K-means
and
requi
res
nearly
the
same
computa-
tional
time
as
GA
and
PSO.
Therefore,
MV
O
TDC
can
be
considered
an
ef
ficient
clustering
method
for
the
te
xt
clustering
domain.
Gi
v
en
the
successful
outcomes
of
MV
O
for
the
TDC
problem,
MV
O
TDC
can
be
implemented
for
dif
ferent
types
of
clustering
problems.
MV
O
can
also
be
further
impro
v
ed
by
the
addition
or
modification
of
its
operators
so
that
it
can
address
other
discrete
optimisation
problems,
such
as
scheduling.
In
addition,
datasets
other
than
those
used
in
this
w
ork
can
be
used
in
future
studies.
In
addition,
h
ybridized
the
MV
O
with
local
search
strate
gies
in
order
to
impro
v
e
initial
solutions
and
the
e
xploitation
capability
during
optimization
process.
REFERENCES
[1]
N.
Saini,
S.
Saha,
and
P
.
Bhattacharyya,
“
Automatic
scientific
document
clustering
using
self-or
g
anized
multi-objecti
v
e
dif
ferential
e
v
olution,
”
Cogniti
v
e
Computation,
pp.
1–23,
2018.
[2]
I.
Arın,
M.
K.
Erpam,
and
Y
.
Saygın,
“I-twec:
Interacti
v
e
clustering
tool
for
twitter
,
”
Expert
Systems
with
Applications,
v
ol.
96,
pp.
1–13,
2018.
[3]
A.
K.
Abasi,
A.
T
.
Khader
,
M.
A.
Al-Betar
,
S.
Naim,
S.
N.
Makhadmeh,
and
Z.
A.
A.
Alyass
eri,
“Link-
based
multi-v
erse
optimizer
for
te
xt
documents
clustering,
”
Applied
Soft
Computing,
v
ol.
87,
2020.
[4]
W
.
Song,
W
.
Ma,
and
Y
.
Qiao,
“P
article
sw
arm
optimization
algorithm
with
en
vironmental
f
actors
for
clustering
analysis,
”
Soft
Computing,
v
ol.
21,
no.
2,
pp.
283–293,
2017.
[5]
Z.
A.
A.
Alyasseri,
A.
T
.
Khader
,
M.
A.
Al-Betar
,
M.
A.
A
w
adallah,
and
X.-S.
Y
ang,
“V
ariants
of
the
flo
wer
pollination
algorithm:
a
re
vie
w
,
”
Nature-Inspired
Algorithms
and
Applied
Optimization,
pp.
91–118,
2018.
[6]
A.
K.
Abasi,
A.
T
.
Khader
,
M.
A.
Al-Betar
,
S.
Naim,
S.
N.
Makhadmeh,
and
Z.
A.
A.
Alyasseri,
“
A
te
xt
feature
selection
technique
based
on
binary
multi-v
erse
optimizer
for
te
xt
clustering,
”
IEEE
Jordan
International
Joint
Conference
on
Electrical
Engineering
and
Information
T
echnology
,
pp.
1–6,
2019.
[7]
J.-H.
Jiang,
J.-H.
W
ang,
X.
Chu,
and
R.-Q.
Y
u,
“Clustering
data
using
a
modified
inte
ger
genetic
algorithm
(ig
a),
”
Analytica
Chimica
Acta,
v
ol.
354,
no.
1,
pp.
263–274,
1997.
[8]
R.
F
orsati,
M.
Mahda
vi,
M.
Shamsf
ard,
and
M.
R.
Me
ybodi,
“Ef
ficient
stochastic
algorithms
for
document
clustering,
”
Information
Sciences,
v
ol.
220,
pp.
269–291,
2013.
[9]
O.
A.
Alomari,
A.
T
.
Khader
,
M.
A.
Al-Betar
,
and
M.
A.
A
w
adallah,
“
A
no
v
el
gene
selection
method
using
modified
mrmr
and
h
ybrid
bat-inspired
algorithm
with
-hill
climbing,
”
Applied
Intelligence,
v
ol.
48,
no.
11,
pp.
4429–4447,
2018.
[10]
N.
Saini,
S.
Saha,
A.
Harsh,
and
P
.
Bhattacharyya,
“Sophistica
ted
som
based
genetic
operators
in
multi-
objecti
v
e
clustering
frame
w
ork,
”
Applied
Intelligence,
pp.
1–20,
2018.
[11]
M.
Ma
vro
v
ouniotis,
C.
Li,
and
S.
Y
ang,
“
A
surv
e
y
of
sw
arm
intelligence
for
dynamic
optimization:
Al-
gorithms
and
applications,
”
Sw
arm
and
Ev
olutionary
Computation,
v
ol.
33,
pp.
1–17,
2017.
[12]
T
.
Cura,
“
A
particle
sw
arm
optimization
approach
to
clustering,
”
Expert
System
s
with
Applications,
v
ol.
39,
no.
1,
pp.
1582–1588,
2012.
[13]
K.
K.
Bharti
and
P
.
K.
Singh,
“Chaotic
gradient
artificial
bee
colon
y
for
te
xt
clustering,
”
Soft
Computing,
v
ol.
20,
no.
3,
pp.
1113–1126,
2016.
Int
J
Elec
&
Comp
Eng,
V
ol.
10,
No.
6,
December
2020
:
6361
–
6369
Evaluation Warning : The document was created with Spire.PDF for Python.
Int
J
Elec
&
Comp
Eng
ISSN:
2088-8708
r
6369
[14]
S.
Mirjalili,
S.
M.
Mirjalili,
and
A.
Hatamlou,
“Multi-v
erse
optimizer:
a
nature-inspired
algorithm
for
global
optimization,
”
Neural
Computing
and
Applications,
v
ol.
27,
no.
2,
pp.
495–513,
2016.
[15]
A.
F
ath
y
and
H.
Rezk,
“Multi-v
erse
optimizer
for
identifying
the
optimal
parameters
of
pemfc
model,
”
Ener
gy
,
v
ol.
143,
pp.
634–644,
2018.
[16]
P
.
K
umar
,
S.
Gar
g,
A.
Singh,
S.
Batra,
N.
K
umar
,
and
I.
Y
ou,
“Mv
o-based
tw
o-dimensional
path
planning
scheme
for
pro
viding
quality
of
service
in
ua
v
en
vironment,
”
IEEE
Internet
of
Things
Journal,
2018.
[17]
A.
K.
Abasi,
A.
T
.
Khader
,
M.
A.
Al-Betar
,
S.
Naim,
and
S.
N.
a.
Alyasseri,
Zaid
Abdi
Alkareem
Makhad-
meh,
“
A
no
v
el
h
ybrid
multi-v
erse
optimizer
with
k-means
for
te
xt
documents
clustering,
”
Neural
Com-
puting
and
Applications,
2020.
[18]
H.
F
aris,
M.
A.
Hassonah,
A.-Z.
Ala’M,
S.
Mirjalili,
and
I.
Aljarah,
“
A
multi-v
erse
optimizer
approach
for
feature
selection
and
optimizing
svm
parameters
based
on
a
rob
ust
system
architecture,
”
Neural
Com-
puting
and
Applications,
pp.
1–15,
2017.
[19]
I.
Benmessahel,
K.
Xie,
and
M.
Chellal,
“
A
ne
w
e
v
olutionary
neural
netw
orks
based
on
intrusion
detection
systems
using
multi
v
erse
optimization,
”
Applied
Intelligence,
pp.
1–13,
2017.
[20]
H.-S.
P
ark
and
C.-H.
Jun,
“
A
simple
and
f
ast
algorithm
for
k-medoids
clustering,
”
Expert
systems
with
applications,
v
ol.
36,
no.
2,
pp.
3336–3341,
2009.
[21]
A.
Huang,
“Similarity
measures
for
te
xt
document
clustering,
”
Proceedings
of
the
sixth
ne
w
zealand
computer
science
research
student
conference
(NZCSRSC2008),
pp.
49–56,
2008.
[22]
A.
I.
Kadhim,
Y
.-N.
Cheah,
and
N.
H.
Ahamed,
“T
e
xt
document
preprocessing
and
dimension
reduc-
tion
techniques
for
te
xt
document
clustering,
”
4th
Internati
on
a
l
Conference
onArtificial
Intelligence
with
Applications
in
Engineering
and
T
echnology
(ICAIET),
pp.
69–73,
2014.
[23]
R.
Zhao
and
K.
Mao,
“Fuzzy
bag-of-w
ords
model
for
document
representation,
”
IEEE
T
ransactions
on
Fuzzy
Systems,
v
ol.
26,
no.
2,
pp.
794–804,
2018.
[24]
K.
K.
Bharti
and
P
.
K.
Singh,
“Opposition
chaotic
fitness
mutation
bas
ed
adapti
v
e
inertia
weight
bpso
for
feature
selection
in
te
xt
clustering,
”
Applied
Soft
Computing,
v
ol.
43,
pp.
20–34,
2016.
[25]
J.
Singh
and
V
.
Gupta,
“
A
systematic
re
vie
w
of
te
xt
stemming
techniques,
”
Artificial
Intelligence
Re
vie
w
,
v
ol.
48,
no.
2,
pp.
157–217,
2017.
[26]
M.
N.
P
.
Katariya,
M.
Chaudhari,
B.
Subhani,
G.
Laxminarayana,
K.
Mate
y
,
M.
A.
Nik
ose
,
S.
A.
T
inkhede,
and
S.
Deshpande,
“T
e
xt
preprocessing
for
te
xt
mining
using
side
information,
”
International
Journal
of
Computer
Science
and
Mobile
Applications,
v
ol.
3,
no.
1,
pp.
01–05,
2015.
[27]
D.
Janig
a,
R.
Czarnota,
J.
Stopa,
P
.
W
ojnaro
wski,
and
P
.
K
oso
wski,
“Performance
of
nature
inspired
optimization
algorithms
for
polymer
enhanced
oil
reco
v
ery
process,
”
Journal
of
Petroleum
Science
and
Engineering,
v
ol.
154,
pp.
354–366,
2017.
[28]
W
.
Song,
Y
.
Qiao,
S.
C.
P
ark,
and
X.
Qian,
“
A
h
ybrid
e
v
olutionary
computation
approach
with
its
ap-
plication
for
optimizing
te
xt
document
clustering,
”
Expert
Syst
ems
with
Applications,
v
ol.
42,
no.
5,
pp.
2517–2524,
2015.
[29]
D.
Mustafi
and
G.
Sahoo,
“
A
h
ybrid
approach
using
genetic
algorithm
and
the
dif
ferential
e
v
olution
heuristic
for
enhanced
initialization
of
the
k-means
algorithm
with
applications
in
te
xt
clustering,
”
Soft
Computing,
pp.
1–18,
2018.
[30]
Z.
A.
A.
Alyasseri,
A.
T
.
Khader
,
M.
A.
Al-Betar
,
A.
K.
Abasi,
and
S.
N.
Makhadmeh,
“EEG
signals
de-
noising
using
optimal
w
a
v
elet
transform
h
ybridized
with
ef
ficient
metaheuristic
methods,
”
IEEE
Access,
v
ol.
8,
pp.
10
584–10
605,
2019.
[31]
M.
A.
Al-Betar
,
O.
A.
Alomari,
and
S.
M.
Ab
u-Romman,
“
A
triz-inspired
bat
algorithm
for
gene
selection
in
cancer
classification,
”
Genomics,
v
ol.
112,
no.
1,
pp.
114–126,
2020.
[32]
S.
N.
Makhadmeh,
A.
T
.
Khader
,
M.
A.
Al-Betar
,
S.
Naim,
A.
K.
Abasi,
and
Z.
A.
A.
Alyasseri,
“Optimization
methods
for
po
wer
s
cheduling
problems
in
smart
home:
Surv
e
y
,
”
Rene
w
able
and
Sus-
tainable,
Ener
gy
Re
vie
ws,
v
ol.
115,
2019.
T
e
xt
documents
clustering
using
...
(Ammar
Kamal
Abasi)
Evaluation Warning : The document was created with Spire.PDF for Python.