TELKOM
NIKA Indonesia
n
Journal of
Electrical En
gineering
Vol.12, No.4, April 201
4, pp. 2924 ~ 2
9
3
5
DOI: http://dx.doi.org/10.11591/telkomni
ka.v12i4.4805
2924
Re
cei
v
ed Se
ptem
ber 5, 2013; Re
vi
sed
No
vem
ber 9,
2013; Accept
ed No
vem
b
e
r
23, 2013
Hybrid K-means Algorithm and Genetic Algorithm for
Cluster Analysis
Dianhu Che
ng*
1
, Xiangqian Ding
1
, Jianxin Zeng
2
, Ning Yang
3
1
Ocean Univ
er
sit
y
of Chi
na, S
han
do
ng Qing
dao, Ch
in
a
2
Chin
a tobacc
o
Yunna
n Indust
r
ial Co
., Ltd, Yunn
an, Kunm
in
g, Chin
a
3
Ne
w
s
tar Com
puter Eng
i
n
eeri
ng Ce
nter of Qingd
ao Ocea
n Univers
i
t
y
, Sh
a
ndo
ng Qin
gda
o, Chin
a
*Corres
p
o
ndi
n
g
author, em
ail
:
35113
47
9@q
q
.com
A
b
st
r
a
ct
Cluster a
n
a
l
ysi
s is a fund
a
m
e
n
tal tech
niq
ue
for vari
ous fi
le
d such
as patt
e
rn rec
ogn
itio
n
,
mach
in
e
lear
nin
g
a
nd s
o
forth. How
e
v
e
r, the clust
e
r
nu
mb
er is
pr
ed
efine
d
by
user
s in K-
me
ans
alg
o
rith
m, w
h
ic
h is
unpr
actical to
i
m
p
l
e
m
e
n
t. Since the
nu
mb
er of cluste
rs i
s
a NP-co
m
p
l
e
t
e probl
e
m
, Genetic Al
gor
ith
m
i
s
empl
oyed to s
o
lve it. In ad
dit
i
on, d
ue to the
large
ti
me co
nsu
m
i
ng i
n
co
nventi
o
n
a
l
met
hod, a
n
i
m
pr
o
v
ed
fitness function
is proposed. Accordi
ng to the simu
l
a
tion r
e
sults, the propose
d
appr
oac
h is feasible a
n
d
effective.
Ke
y
w
ords
:
clu
s
ter analys
is, K-me
ans a
l
gor
ithm, g
enetic a
l
g
o
rith
m, cluster
nu
mb
er, time c
onsu
m
ing
Copy
right
©
2014 In
stitu
t
e o
f
Ad
van
ced
En
g
i
n
eerin
g and
Scien
ce. All
rig
h
t
s reser
ve
d
.
1. Introduc
tion
Clu
s
ter a
naly
s
is
i
s
to grou
p a set of obj
ective
s i
n
the
way that the
obje
c
tives in t
he same
clu
s
ter have t
he simila
r ch
ara
c
teri
ze
s e
a
ch othe
r, wh
ile they have
diffe
ring characteri
ze
s am
ong
clu
s
ters. In recent re
se
a
r
ch, it ha
s
been a
us
ef
ul tool for statistical dat
a analysi
s
a
nd
impleme
n
ted
in data fu
sion
, machin
e lea
r
ning, in
fo
rm
ation ret
r
ieval
and
signal
p
r
ocessin
g
. As a
gene
ral probl
em, cluste
r analysi
s
coul
d
be solved by
various kin
d
s of algorith
m
s acco
rdin
g
to
the kind of the cluste
rs’ no
tion and algo
rithms’ effi
cie
n
cy. In common, the
notion of the clusters
inclu
d
e
s
grou
ps with sm
all distances a
m
ong the cl
u
s
ter individu
a
l
s, diversity of the data space,
the distrib
u
tio
n
of the particular inte
rvals
and et
c. Fo
r this re
ason, cl
usteri
ng coul
d be co
nsi
dered
as a
multi o
b
j
ective optimi
z
ation
proble
m
. The
type
of individual
data set an
d
expecte
d target
deci
de the e
m
ployment o
f
cluste
ring
a
l
gorithm
a
n
d
could
be h
e
l
pful to para
m
eter tuni
ng.
In
dealin
g
with clu
s
ter
a
naly
s
is as a
mult
i-obje
c
tive
op
timization p
r
o
b
lem, there
exist trials
a
n
d
failure
s in th
e intera
ctive
pro
c
e
ss
of knowl
edge.
T
h
erefo
r
e it is
necessa
ry to tune pa
ram
e
ters
and modify the optimizatio
n model
s unti
l
the desired result
s are a
c
hieved.
The al
gorith
m
s for clu
s
te
r analysi
s
can
be catego
ri
zed ba
se
d on
the clu
s
te
r m
odel. Up
to now, there
are more tha
n
hundred
s of algorithm
s p
r
opo
se
d. Non
e
of t
hem is
overwhelmin
g
ly
better than ot
hers. Ho
wev
e
r, it is po
ssi
ble to ch
oo
se
a prop
er al
g
o
rithm for a
p
a
rticul
ar p
r
o
b
l
e
m
by experime
n
t
al or empi
rical re
su
lts, unl
ess one
clu
s
ter mod
e
l is b
e
tter than oth
e
rs
by proof i
n
a
mathemati
c
al
way. An example is giv
en t
hat the K-mean
s me
thod can
not find non-co
n
v
e
x
clu
s
ters [1].
The promine
n
t cluste
ring
methods
a
r
e con
n
e
c
tivity based
clustering (su
c
h
as
hiera
r
chi
c
al clusteri
ng
), Centroid
-b
ased
cluste
ring (such a
s
k-me
ans cl
uste
rin
g
), distrib
u
tio
n
-
based cl
uste
ring and d
e
n
s
i
t
y-base
d
clu
s
tering.
In this deca
d
e
, a lot of attention ha
s be
en paid to clu
s
ter an
alysi
s
and the perfo
rman
ce
has
bee
n im
proved
[2-4].
With the
dev
elopme
n
t of
i
n
formatio
n science, the p
r
oce
s
sing
of h
uge
data set such as a big
d
a
ta pro
b
lem
has b
een a
pre
ssi
ng ne
e
d
. The willin
gne
ss to tra
de
sema
ntic
and
image m
ean
ing of the
ge
nerate
d
cl
ust
e
rs for p
e
rfo
r
mance ha
s
b
een d
r
am
atically
increa
sing. It results in the developme
n
t of pre-
clu
s
tering metho
d
s su
ch as
canopy clu
s
tering,
whi
c
h can proce
s
s hug
e d
a
ta sets effici
ently, but
the
resulting "cl
u
sters" are m
e
rely a rou
gh
pre-
partitionin
g
of the data set to then analyze the parti
tion
s with existin
g
slower met
hod
s su
ch as
k-
mean
s clu
s
te
ring. Vario
u
s
other ap
pro
a
c
he
s to clu
s
tering h
a
ve b
een tried
su
ch as se
ed ba
sed
c
l
us
tering [5].
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
2302-4
046
Hybrid K-m
e
a
n
s Algo
rithm
and Ge
netic
Algorithm
for Clu
s
ter Anal
ysis
(Dia
nhu
Cheng
)
2925
For
the data
set with hi
gh-dimension, m
any
of the existing approaches will fall
down to
clu
s
terin
g
du
e to the
curse of dimen
s
ion
a
lity,
which presents that t
he particula
r dista
n
ce
function
s i
s
d
i
fficult to solv
e the data
cl
usteri
ng in
hi
gh-di
men
s
ion
a
l sp
aces. T
h
is results in t
he
gene
ration
of
novel
clu
s
te
ring
analy
s
is method
s fo
r the data
se
t with hig
h
-di
m
ensi
on
whi
c
h
mainly focus
on
the sub
s
p
a
ce
clu
s
te
rin
g
and
co
rrel
a
tion clu
s
te
rin
g
that also
searche
s
arbit
r
ary
rotated
sub
s
p
a
ce
clu
s
ters t
hat ca
n be m
odele
d
by
de
signi
ng a
correlation
of thei
r ch
aracte
risti
cs.
Famou
s
on
e
s
of this kin
d
of cluste
ri
ng algo
rithm
s
incl
ude
CL
IQUE [6]
and
SUBCL
U
[7]. In
addition, the i
dea
s inspire
d
from den
sity-ba
s
ed
clu
s
te
ring meth
od
s (in parti
cula
r the
DBSCA
N
/
OPTICS
family of algorithm
s)
hav
e
b
een employed
to
sub
s
p
a
ce
cl
u
s
terin
g
su
ch as HiSC
[8]
a
nd
DiSH [9] a
n
d
co
rrelation
clu
s
teri
ng
such
as
HiCO, [10]. The “correlation c
o
nnec
tivity
” is
prop
osed a
n
d
implem
ente
d
in 4C [1
1] a
nd ERi
C
[12]
whi
c
h explo
r
ed hie
r
a
r
chi
c
al den
sity ba
sed
correl
ation cl
usters).
Based on th
e con
c
ept of mutual information,
seve
ral different clusteri
ng syst
ems are
prop
osed. Th
e example
s
are given a
s
Marina Meil
ă
's
vari
ation of informatio
n metric [13]
and
hiera
r
chi
c
al clu
s
terin
g
[14, 15].
In
addition,
a recently propo
se
d method messa
g
e
passin
g
algo
rithms, ha
s led to the cre
a
tion of
new typ
e
s of clu
s
te
ring algo
rithm
s
[16].
In this pa
per, we inve
stigated a K-m
e
ans
algo
rith
m, which th
e clu
s
ter
nu
mber i
s
pred
efined b
y
users in K-mean
s algo
rit
h
m. In pr
a
c
tical p
r
oble
m
, it is impossib
l
e to decid
e the
numbe
r cl
ust
e
r in adva
n
ce. Since the
numbe
r of cl
usters i
s
a NP-com
plete p
r
oble
m
, Gen
e
tic
Algorithm is
employed to solve it. In a
ddition,
due to the large time con
s
umi
n
g in conventi
onal
method, an i
m
prove
d
fitness fun
c
tion
is propo
se
d. Acco
rdi
ng t
o
the sim
u
la
tion re
sults,
the
prop
osed a
p
p
roa
c
h i
s
fea
s
ible
and
effective. We
p
r
opo
se
d a h
y
brid K-m
ean
s alg
o
rithm
a
nd
geneti
c
algo
ri
thm for clu
s
te
r analysi
s
.
The re
st of this pape
r is org
anized as foll
ow
s. Section II briefly introduces the co
nce
p
t o
f
cluster analy
s
is in m
a
thematic way. Section
III introduces the framewor
k of
Geneti
c
Algorith
m
for clu
s
teri
ng
. In Section IV, an improv
ed version to
enhan
ce th
e algorith
m
s
perfo
rman
ce
is
prop
osed. Th
e simulatio
n
and re
sult
s a
nalysi
s
ar
e
condu
cted in
Section V. Th
is pap
er i
s
en
ded
with co
ncl
u
si
ons a
nd future work is p
r
o
posed in Sect
ion VI.
2. Preliminary
of Cluster
Anal
y
s
is
In comp
uter
sci
en
ce, cl
ust
e
ring
analy
s
i
s
is
a cl
assi
c
probl
em a
nd
relevant te
ch
nologi
es
have bee
n a
key task in th
e pro
c
e
s
s of acq
u
irin
g kno
w
led
ge [17, 1
8
]. It has bee
n impleme
n
te
d
in many application
s
su
ch
sign
al pro
c
e
s
sing, dat
a mi
ning, machi
n
e learnin
g
[19-21]. The target
of clu
s
terin
g
is to pa
rtition a give
n
data set into
seve
ral
sub
s
ets te
rme
d
clu
s
ters, which
maximize
s the homoge
neit
y
of the data intra one cl
u
s
t
e
r and the he
teroge
neity inter cluste
rs. To
find optimal cl
usteri
ng is a
chall
enge ta
sk in cu
rrent re
sea
r
ch.
Up to now, evaluation th
e similaritie
s
am
ong data
is base
d
on
the measu
r
e of th
e
distan
ce
of two d
a
ta, whi
c
h the E
u
cli
d
ean di
stan
ce
is mo
st u
s
ed.
In clu
s
teri
ng,
con
s
id
erin
g t
hat
in a data
set
12
,
,
..
.,
N
A
aa
a
, there
are
N
data,
wh
ere
each
i
aR
is an
attribute ve
ctor
con
s
i
s
ting of
real valued measurement
s. The goal of cl
uste
ring
is to partition
the data into
sev
e
r
a
l gro
u
p
s
12
,
,
..
.,
M
CC
C
C
, where
M
is the numbe
r of cluste
rs.
The mathe
m
atical
descri
p
tion can be given a
s
follows [22, 23]:
1
M
i
i
CA
(1)
Whe
r
e:
i
C
for
1
,
..
.,
iM
(2)
And,
ij
CC
for
1
,
..
.,
iM
,
ij
(3)
The Eucli
dea
n distan
ce
ca
n be define
d
as
f
and given
as follows:
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 23
02-4
046
TELKOM
NI
KA
Vol. 12, No. 4, April 2014: 2924 – 2
935
2926
1
2
1
a
r
g
m
i
n
,
....
,
ar
g
m
i
n
i
VQ
M
f
M
fx
i
f
i
f
Ec
c
xc
(4)
Whe
r
e,
1
,1
,
.
.
.
,
i
m
mi
xC
m
cx
m
M
C
(5)
Hen
c
e, searching for the
centers of clu
s
ters
c
an in
ste
ad the functio
n
f
and ca
n re
write
f
as
follows
:
2
arg
m
i
n
i
i
f
xx
c
(6)
Whi
c
h mea
n
s the distan
ce
can b
e
mea
s
ured from
the
data point to the cente
r
of a clu
s
ter.
In most prev
ious
work, the numb
e
r
of cl
uste
rs is cal
c
ul
ated
base
d
on
desi
gne
r’s
requi
rem
ent or expe
rien
ce
. Howeve
r, a small nu
mb
e
r
of clusteri
ng
can
not partiti
on the data i
n
to
suitabl
e groups, while the large
number of cl
us
ters will m
a
ke the cl
us
tering no sense. The
formula that calcul
ated the
numbe
r of clu
s
ters is
sho
w
n as follo
ws:
0
1
,1
!
M
iN
i
M
NW
N
M
M
i
i
M
(7)
Acco
rdi
ng to
(7), it is difficult to obtain
the be
st clu
s
t
e
ring
even th
at the clu
s
ter numbe
r
M
is
kno
w
n, let al
ong u
n
kno
w
n
in pra
c
tice. T
he conventio
nal metho
d
is empiri
cal,
which i
s
to em
ploy
a suita
b
le val
ue of
m
based
on the results analysi
s
afte
r co
ndu
cting
simulatio
n
s
several time
s.
Due to the limitation of the domain
kno
w
led
ge and
searchin
g for the be
st soluti
on only in a small
scale, the pe
rforma
nce of the me
thods is not satisfi
ed. Other
tha
n
the predefi
ned criterio
n, a
feasibl
e
meth
od to o
p
timize the valu
e o
f
M
is b
a
sed o
n
the n
u
me
ri
c
criteri
a
. In t
h
is
ca
se, the
numbe
r of clu
s
ter
ways i
s
g
i
ven as follo
ws [23]:
1
,
n
i
NW
N
M
(8)
With the assume that the value of
M
is unkno
wn, wh
ich is more reasona
ble, the proble
m
of
finding an optimal value for
M
to partition
N
data could con
s
id
e
r
ed as a NP-com
plete
probl
em and t
he com
p
lexity of the proble
m
can be
cal
c
ulate
d
app
ro
ximately as
!
N
M
M
. Therefore,
attempting to
obtain a
n
o
p
timum solut
i
on by
conv
entional m
e
thod
s is
not comp
utationa
lly
feasibl
e
[22] and it is ne
ce
ssary to appe
al
to an efficient approxim
ation algo
rith
ms.
Heu
r
isti
c alg
o
rithm
s
are
well kn
own for
solving NP-com
plete p
r
oble
m
s an
d
Genetic
Algorithm i
s
o
ne of the fam
ous h
e
u
r
isti
c algorith
m
s [2
4, 25]. It can obtain go
od e
noug
h sol
u
tio
n
s
in rea
s
on
able
time. Thus i
n
this pap
er,
Geneti
c
Algo
rithm is em
pl
oyed for solving the ada
ptive
numbe
r of clu
s
terin
g
proble
m
.
3. Frame
w
o
r
k of Gen
e
tic
Algorithm fo
r Clustering
3.1. Genetic
Algorihtm
As a bran
ch i
n
the field of artificial in
telli
gen
ce, Gen
e
tic Algorith
m
s plays an im
portant
role in optimization and searchin
g pro
b
lems. It
belong
s to the
large
r
cla
ss
of evolutionary
algorith
m
s (E
A) which is inspi
r
ed by nature evol
uti
on. Due to the advantag
es of efficien
cy,
accuracy an
d easy imple
m
ent, it has found appli
c
ation
s
in co
mputer
scie
n
ce, engin
e
e
r
i
ng,
chemi
s
try, m
anufa
c
turin
g
, bioinformatics and oth
e
r fields.
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
2302-4
046
Hybrid K-m
e
a
n
s Algo
rithm
and Ge
netic
Algorithm
for Clu
s
ter Anal
ysis
(Dia
nhu
Cheng
)
2927
Like othe
r he
uristi
c algo
rithm, Genetic
Algor
ithm is
popul
ation-ba
sed, whi
c
h contain
s
a
certai
n num
b
e
r of ch
rom
o
some
s
whe
r
e
con
s
istin
g
o
f
genes [26].
Each chro
m
o
som
e
can be
con
s
id
ere
d
a
s
a
sol
u
tion t
hat ca
n b
e
m
u
tated an
d al
tered. T
he e
n
co
ding
of G
enetic Al
gorit
hm
can b
e
co
ndu
cted in bin
a
ry
spa
c
e, but ot
her e
n
co
ding
s are also fea
s
ible. Th
e ma
in operators i
n
Geneti
c
Alg
o
rithm a
r
e
cro
s
sove
r a
nd mutation.
Cro
s
sove
r is u
s
ed to
rea
s
sembl
e
the
chromo
som
e
s from one gene
ration to
the next. Mutation is used to enrich
the diversity o
f
chromo
som
e
in ea
ch ge
ne
ration so that
Geneti
c
Algo
rithm can
obta
i
n better
solut
i
ons. It occu
rs
durin
g the whole evolutio
nary pro
c
e
ss acco
rdi
ng to a predefin
ed mutation prob
ability which
should be set as a
sm
all value. Otherwi
s
e, the
m
u
tation probability is so
large that the evoluti
o
n
pro
c
e
ss
will be a primitive rand
om se
arch. T
he schedul
e of Genetic Algorith
m
is sho
w
n
as
follows, and the flowcha
r
t is given in Fig
u
re 1.
Step 1. Initialize a po
pulati
on of a ce
rtain numbe
r of random
ch
rom
o
som
e
s;
Step 2. Evaluate each ch
ro
moso
me a
c
co
rding to the fitness
func
tion.
Step 3. Select chrom
o
som
e
according t
o
the prop
orti
onal sele
ction
proba
bility.
Step 4. Norm
alize the
chro
moso
me
s so
that the chro
moso
me
s ca
n be com
p
a
r
e
d
.
Step 5. Cond
uct cro
s
sover and mutation
operato
r
s.
Step 6: Select top rankin
g chromo
som
e
s as
the p
opu
lation in the n
e
xt generatio
n.
Step 7: If
the stoppi
ng crite
r
ia is satisfied
,
end the algo
rithm. Otherwise, go to Step 2.
Figure 1. The
Flowcha
r
t of Geneti
c
Algorithm
3.2. Encoding
The Gen
e
tic
Algorithm
s for clu
s
teri
ng is base
d
on a simple
sche
me. Con
s
ide
r
ing that a
data set is consi
s
ting by
N
data,
there
a
r
e
1
N
ways to partition the
set. To illustrate the
strategy
clea
rly, an example is
given
by assuming
one ch
rom
o
some in
Gen
e
tic algo
rithm
is
encode
d as f
o
llows:
Chromo
som
e
1: 3 23221
The ch
rom
o
some inclu
d
e
s
two parts: the first numbe
r “3” me
an
s the numbe
r of cluste
rs
and the
re
st
numbe
rs pre
s
ent the
grou
p index. In Chrom
o
some
1, the data at
the location i
ndex
{1} is 3, which means the
r
e are 3 grou
p
s
in the cl
ust
e
ring. The da
ta at
the location index {3, 4, 5}
belon
g to the Group 2, an
d the data at
the loca
tion i
ndex {3} and
locatio
n
inde
x {6} belong
s
to
the Group 3 and Gro
up 1
respe
c
tively. Consi
deri
n
g
Chrom
o
som
e
1, there are many different
codi
ng ways to expre
ss the
same
ca
se. For exampl
e:
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 23
02-4
046
TELKOM
NI
KA
Vol. 12, No. 4, April 2014: 2924 – 2
9
35
2928
Chromo
som
e
2 : 3 31332
Chromo
som
e
2 ha
s the
same
mea
n
i
ng a
s
Ch
ro
moso
me 1.
Actually, there are
6
different way
s
to expre
s
s the same m
eanin
g
. T
herefore, the si
ze of
the searchin
g spa
c
e
of
geneti
c
algo
ri
thm is much
large
r
than th
e virtual
spa
c
e, which
will lead to a poo
r efficien
cy o
f
algorith
m
. Other than the
poor efficie
n
cy, it also
affects the crossove
r ope
ration in Ge
netic
Algorithm
sin
c
e the redu
n
dant co
ding
coul
d
ma
ke
the offspri
n
g
no any imp
r
oveme
n
t. For
example, if
Chromo
som
e
1 is chosen
to cross ov
er with Chro
moso
me 2, there exist
s
the
prob
ability that the offsprin
g has the
sa
me meani
ng
with parents.
In addition, the conn
ectio
n
with gene
s in
one ch
romo
some is not ta
ken into a
c
co
unt. To
solve the pro
b
lem, a hybri
d
K-mean
s a
nd Gen
e
ti
c Algorithm
s
are
propo
se
d. Actually, the inter
con
n
e
c
tion
s among
gen
e values
con
s
titute the gen
u
i
ne optimi
zati
on goal i
n
clu
s
terin
g
proble
m
s.
Based
on th
e analy
s
is, th
e develo
p
me
nt of geneti
c
operators
sp
ecially d
e
si
gn
ed to cl
uste
ri
ng
probl
em
s ha
s been inve
stigated.
3.3. Cross
o
v
e
r Opera
t
ion
To solve the
proble
m
s
mentione
d a
bove,
there
are three ki
nds of cro
s
sover are
employed in this pap
er, on
e point cro
ssover, tw
o poi
nts crossove
r and combi
n
i
n
g crossove
r [27,
28].
One-point cro
s
sover sho
w
n in Fig.1 is simila
r with the bina
ry on
e point crossover. The
point on bot
h pare
n
ts'
chrom
o
some
s is sele
ct
ed
. All data b
e
yond that point in either
chromo
som
e
is swap
ped
betwe
en the
two pa
rent o
r
gani
sms. T
h
e
resulting o
r
g
anism
s a
r
e t
h
e
offsprin
g [33, 35, 38].
Two
-
poi
nt cro
s
sover sh
own in Figure 2 call
s
for two points to be selecte
d
on the parent
orga
nism st
ri
ngs. Everythi
ng bet
wee
n
the two
point
s
is swappe
d betwe
en
the pare
n
t
organi
sm
s,
rend
eri
ng two
offspring o
r
g
anism
s:
Figure 1. The
Sketch Ma
p of One Point
Cro
s
sov
e
r
Figure 2. The
Sketch Ma
p of Two Point
Cro
s
sov
e
r
The co
mbini
n
g cro
s
sove
r combine
s
the two solution
s.
It builds the new offsprin
g
center by
cente
r
. For e
a
ch
cente
r
from the pare
n
t chro
mo
so
me it
fi
nds t
he nea
re
st centers from the
se
con
d
pare
n
t and gene
rates two n
e
w
ce
nters ra
ndomly on the line joinin
g the two pa
rent
cente
r
s.
3.4. Mutation
Opera
t
ion
Mutation ope
ration i
s
a key operato
r
in Geneti
c
Algorithm
whi
c
h can expl
ore the
sea
r
ching
sp
ace a
nd bre
a
k a
w
ay local minima
. In c
l
us
ter analys
is
, two
chromosomes are
employed to
con
d
u
c
t the mutation, one
is adopt whe
r
e
is
closer to
a centroid of a clu
s
ter an
d the
other one is
adopt by the
data whe
r
e is close
r
to
the farthest data from the centroid. Then the
two gen
es a
r
e mutated by mutation ope
rator. [29, 36, 37]
3.5.
Fitness Function
As descri
bed
above, clust
e
ring
could b
e
con
s
ide
r
ed
as an optimization p
r
obl
em. To
evaluate ea
ch solution, th
e fitness fun
c
tion is defin
e
d
as (4
). Thus, witho
u
t loss of ge
nerali
t
y,
we defin
e a solution a
s
1
,
.
..,
M
Sc
c
, then it can be e
v
aluated by the fitness fun
c
tion,
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
2302-4
046
Hybrid K-m
e
a
n
s Algo
rithm
and Ge
netic
Algorithm
for Clu
s
ter Anal
ysis
(Dia
nhu
Cheng
)
2929
1
,
.
..
.,
M
VQ
f
SE
c
c
(9)
In some ot
he
r pap
er, different method
s
are p
r
op
osed
, such as
silh
ouette whi
c
h
defined
an avera
ge di
stan
ce bet
we
en
x
and othe
rs.
2
1
yA
ax
x
y
A
(10)
Beside
s, the distan
ce
s bet
wee
n
one d
a
ta and
a cl
ust
e
r ca
n be cal
c
ulate
d
as foll
ows:
mi
n
,
CA
bx
d
x
C
(11)
Hen
c
e, the sil
houette of
x
is given as follo
ws:
ma
x
,
bx
a
x
sx
ax
b
x
(12)
Therefore the
fitness fun
c
tion is given by
:
1
N
i
i
f
Ss
x
(13)
3.6. Normalizatio
n
In stand
ard
Geneti
c
Algo
rithm for cl
ust
e
ri
ng, the
no
rmalizatio
n co
uld hel
p enh
ance the
conve
r
ge
nce perfo
rman
ce
of algorithm,
whi
c
h is d
e
scribed a
s
follo
ws:
1.
The clu
s
te
rs
are initiali
zed
to produ
ce
1
,
...
,
M
CC
.
2. For
all
i
x
S
, put
i
x
into a c
l
us
ter
l
C
, where:
2
arg
m
i
n
im
m
lx
c
(14)
3.
Sort the cent
ers of e
a
ch
cl
uster
1
,
...
,
M
CC
and up
d
a
te each data
as follows:
1
i
i
xC
i
cx
C
(15)
Whe
r
e
1
,
..
.,
iM
.
4. Design fo
r
the Improv
e
d
Gene
tic Al
gorithms Cl
usterin
g
4.1. H
y
brid K
-
means clu
s
tering and Ge
netic Algo
rithm
The ide
a
of K-mea
n
s
clu
s
t
e
ring
wa
s p
r
o
pos
ed by Ste
i
nhau
s in 1
9
5
7
[39] and first used
by MAcQu
e
e
n
in 19
67 [4
0]. Now, it h
a
s b
een
s a
popul
ar a
p
p
r
oach in
clu
s
ter an
alysi
s
.
The
sched
ule
can
be summa
ri
zed
as foll
ows: Given a
s
i
n
itial set of
M
mean
s
11
1
12
,
,
..
.,
M
mm
m
, the
algorith
m
dea
ls with the obj
ectives by al
t
e
rnatin
g between the follo
wing two step
s:
As
s
i
gnment: As
s
i
gn eac
h
objec
tive to t
he c
l
us
ter
whos
e mean is
c
l
os
es
t to it.
:,
1
tt
t
ip
p
i
p
j
C
x
xm
xm
j
M
(
1
6
)
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 23
02-4
046
TELKOM
NI
KA
Vol. 12, No. 4, April 2014: 2924 – 2
935
2930
Whe
r
e ea
ch
p
x
is assig
ned t
o
a certai
n cl
uster
t
C
, no matter whethe
r i
t
could be a
s
sign
ed to
other cl
uste
rs.
Upd
a
te: Upd
a
te the means to ensure the mean
s are located at the
cente
r
of
each new
corre
s
p
ondin
g
clu
s
ter.
1
1
t
i
i
t
ij
t
xS
i
mx
S
(17)
The algo
rithm
ends if the assi
gnme
n
ts n
o
longe
r cha
n
ge. Although in standa
rd K-mean
s
algorith
m
the
numbe
r of clu
s
ters shou
ld be pr
edefi
ned, it could
be com
b
ine
d
with Gen
e
tic
Algorithm to blend thei
r ad
vantage
s to
enhan
ce the
cl
usteri
ng pe
rfo
r
man
c
e.
In addition, a
c
cordi
ng to th
e de
sign of
(13), it
is e
a
sy
to find that the time con
s
uming of
the evaluatio
n for solutio
n
s
is very larg
e both in
time compl
e
xity
and spa
c
e co
mplexity since a
huge m
a
trix sho
u
ld be
st
orag
e an
d th
e matrix ca
lculation is tim
e
expen
sive.
To solve thi
s
probl
em, it is
necessa
ry to improv
e the fitness
func
tion. Ac
cording
t
o
(12
)
, to ma
ximize the val
ue
of
bS
and minimize the value of
aS
could
maximize the
fitness funct
i
on. To solv
e the
probl
em, a no
vel fitness fun
c
tion is p
r
op
o
s
ed a
s
follo
ws:
bS
sS
aS
(18)
Whe
r
e
aS
and
bS
have the sam
e
meaning wi
th (12),
is a
small value to guara
n
tee the
denomi
nato
r
not be zero. We use fun
c
tion (18) in
st
ead, whi
c
h h
a
ve the sam
e
cha
r
a
c
ters and
redu
ce b
o
th o
f
the time complexity and spa
c
e
compl
e
xity.
The sche
dule
of the hybr
id algorithm is de
scrib
ed a
s
follows,
Step 1: Initialize chro
mo
so
mes to comp
ose a p
opul
ation.
Step 2: Use
k-mea
n
s al
gori
t
hm to the chromosome.
Step 3: Based on (18)
cal
c
ulate the
fitness fun
c
tion and the
chromo
som
e
s a
r
e
evaluated.
Step 4: Cond
uct the cro
s
sover and m
u
tation ope
rato
r.
Step 5: If
the stoppi
ng crite
r
ia is me
et, end
the algo
rithm. Otherwise, go to Step 2.
5. Simulation Resul
t
s an
d Analy
s
is
I
n
t
h
is se
ct
ion,
sev
e
ral
si
mulat
i
on
s are
cond
ucted t
o
test the performan
ce of prop
osed
method. We term the im
proved Clu
s
te
ri
ng Ge
netic
A
l
gorithm a
s
I
C
GA an
d giv
e
the sim
u
lati
on
results as foll
ows. It is obvious that our
pr
op
osed met
hod achieve
s
the fastest converg
e
n
c
e a
n
d
obtain
s
a better accu
ra
cy. Ru
spini
Data
set, Vowel
s
a
nd mushroom
proble
m
s a
r
e
employed.
5.1. Ruspini Datase
t
In Ruspi
n
i da
taset pro
b
le
m there are 80 obje
c
ts
which a
r
e incl
u
des two featu
r
es {x,y}.
The main g
o
a
l of this ben
chma
rk is to
comp
ar
e different g
eneti
c
operators an
d investigate
the
effects on the
algorithm p
e
rforman
c
e.
The sim
u
latio
n
environ
me
nt is set a
s
follows. VC++ 6.0 is emplo
y
ed. The ha
rdwa
re i
s
2.7*2 G
H
z
CPU and 2
*
1G
RAM. In each simul
a
tion,
a ch
romo
so
m
e
con
s
i
s
ted
with 4 gene
s. To
measure the distance a
m
ong obj
ecti
ve, Eucli
dea
n norm is e
m
ployed an
d
all the data are
norm
a
lized in
the interval [0, 1].
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
2302-4
046
Hybrid K-m
e
a
n
s Algo
rithm
and Ge
netic
Algorithm
for Clu
s
ter Anal
ysis
(Dia
nhu
Cheng
)
2931
Figure 3. Ru
spini Data
set
The algo
rithm
stops o
n
ce o
ne of the following two co
n
d
itions i
s
met,
1.
The maximu
m iteration 10
0 is rea
c
h
ed.
2.
The erro
r is l
e
ss than or e
qual to 0.001.
We run ea
ch
algorith
m
50 times to obtai
n an avera
ge
perfo
rman
ce.
First, the
do
main of the
clusteri
ng
num
ber
belon
gs to the inte
ger set {
2
,…,40}
, which
mean
s there are at lea
s
t 2
cluste
rs and
at most
40 clusters in the
analysi
s
. Th
e rea
s
on to
set
the maximum
number a
s
4
0
is that although there
exists 80 obje
c
ti
ves in this problem, it makes
no se
nse if we set 80
clu
s
ters i
n
this p
r
oblem alth
ou
gh 80 i
s
the
maximum nu
mber. To
set
the
clu
s
ters belo
ngs to the
se
t {2,…,N/2} is rea
s
ona
ble t
o
apply in practice. The si
mulation resu
lts
are
given in
Table 1. Ti
m
e
co
nsuming
rep
r
e
s
ent
s th
e avera
ge ti
me cost for the 50
sim
u
lations
and the first reach iteration m
ean
s the iteration index which
the result
s are
not improved
afterwa
r
d
du
ring on
e si
mul
a
tion. The
smaller fi
rst
re
ach ite
r
ation
mean
s a
better
conve
r
ge
n
c
e
ability of an algorithm.
Table 1. Re
sults of Ran
d
o
m
ly
Generate
d
Initial Population
Algorithm Time
consumi
ng
First reach iterati
on
K-means
0.25
62.7
CGA
0.22
50.2
ICGA
0.21
31.3
Table 1 indicated that the perform
ance
of
CGA is b
e
tter than K-mean
s both in time
con
s
umi
ng a
nd first rea
c
h
iteration,
whil
e ICGA i
s
bet
ter than
both
of them. Acco
rding to
Tabl
e
1,
the co
ncl
u
si
o
n
ca
n be
dra
w
n that the
first rea
c
h it
e
r
ation
is co
rrel
ated
po
sitive prop
ortio
nal with
the time con
s
uming. Ho
we
ver, it should
be noted
that
for different ben
chma
rks
and sim
u
latio
n
s,
the sa
me iterations
may cause a h
uge
time co
nsu
m
i
ng du
e to the
pro
babili
stic
cha
r
a
c
teri
ze
d
in
heuri
s
tic al
go
rithms.
The set
of
2
,
..
.,
40
k
is very helpful
to the sim
u
l
a
tion re
sult
s
for initial po
p
u
lations
gene
rated
ra
ndomly. Next
, the two cases, 2
clu
s
ters
and 4
0
cl
ust
e
rs re
sp
ectiv
e
ly, base
d
on
the
numbe
r of clu
s
ters are take
n into accou
n
t
so t
hat additional diversity represented
by initializatio
n
can b
e
ign
o
r
ed. Thi
s
co
nsid
eratio
n i
s
not p
r
acti
cal but useful
to evaluate
the prop
ose
d
algorith
m
s a
n
d
the simulati
on re
sults a
r
e
given in Tabl
e 2.
Table 2. Re
sults of the Fixed Clu
s
ters Numbe
r
(2 cl
usters an
d 40 cl
usters)
2 Clusters
40 Clusters
Algorithm
Time consumi
ng
First reach iterati
on
K-means
0.197
87.4
CGA
0.053
61.9
ICGA
0.026
47.1
18
20
22
24
26
18
20
22
24
26
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 23
02-4
046
TELKOM
NI
KA
Vol. 12, No. 4, April 2014: 2924 – 2
935
2932
Acco
rdi
ng to
Table
2, the
averag
e time
con
s
u
m
ing
o
f
ICGA is l
e
ss than
tho
s
e
for both
CGA an
d K-mean
s. Besi
des, the p
e
rf
orma
nce of CGA is
bette
r than K-m
e
ans. Fo
r the
first
rea
c
h iteratio
n, ICGA is also overcome o
t
hers.
Figure 4-1
sh
ows a data
s
e
t
which i
s
use
d
to
test the CGA an
d K-mean
s algo
rit
h
m [32].
In Figure 4-2
,
it shows th
at ICGA reco
gnize a
ll the
clu
s
ters and the centers a
r
e locate
d well,
while
Figu
re
4-3 i
ndicates
that CGA fall
s do
wn to
fin
d
out all th
e clusters. In Fi
g
u
re
4-2, e
a
ch
set
has
a ce
ntral
point which i
s
marked by a
black
p
o
int. Hence the pe
rf
orma
nce of ICGA could fi
nd
the cente
r
of data set with
a good perf
o
rma
n
ce. Ho
wever, in Fig
u
re 4
-
3, som
e
cente
r
s a
r
e
far
away to the cluster an
d so
me of them a
r
e overla
ppe
d. For the (Row 1, Line 3), (Row 4, Line
1)
and (Ro
w
5, Line 5), the
r
e
are overl
app
ed point
s in
the data set. For the (Ro
w
3
,
Line 3), (Ro
w
4,
Line 5), (Ro
w
5, Line 2), (Row 5, Line 3),
there
are n
o
central point
s in the data set. In Figure 4-
3, it is obvious that some
center poi
nts a
r
e fa
r a
w
ay from data set
s
.
Hen
c
e it coul
d be co
ncl
u
d
ed
that the prop
ose
d
algo
rith
m ICGA
has
a huge a
b
ility in cluste
r an
alysis.
Figure 4-1. Data Set
Figure 4-2. Result
s Obtain
ed by ICGA
Figure 4-3. Result
s Obtain
ed by CGA
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
2302-4
046
Hybrid K-m
e
a
n
s Algo
rithm
and Ge
netic
Algorithm
for Clu
s
ter Anal
ysis
(Dia
nhu
Cheng
)
2933
Figure 5-1. Fi
tnes
s Evaluated by the
Silhouette Fu
nction
Figure 5-2. Numbe
r
s of Cl
usters in Be
st
Solution in Iteration
Figure 5
sho
w
s th
e fitness fun
c
tion to
determi
ne th
e optimal
nu
mber
of clu
s
t
e
rs,
whi
c
h
is to verify
the feasibility of ICGA with th
e silhou
ette fi
tness function. According
to Figure 5, with
the increment
of fitness fun
c
tion,
the nu
mber of clu
s
t
e
rs i
s
app
roximately invariant, which
sh
ows
the proposed
algorithm has the ability
to
find out the best cluster number.
5.2. Ruspini Datase
t
In this subse
c
tion, the simulation
s are to
compare different crossover an
d
mutation
operators in
Geneti
c
Algorithm to
test the effects to
the perfo
rma
n
ce of alg
o
rit
h
ms [33, 34,
35].
First the em
pl
oyed dataset
are illustrated as follows,
SODAR
Data
Set 1: This
data de
scrib
e
d
a 2 dim
e
n
s
ions
sp
ace which
ha
s 3 cl
usters.
The num
ber
of all the obje
c
tives is 9
0
.
SODAR Dat
a
Set 2: Thi
s
data
s
et i
s
simila
r with
SODAR Dat
a
Set 1, whi
c
h h
a
s 3
clu
s
ters in a 2 dimen
s
ion
s
spa
c
e and h
a
s 3 clu
s
te
rs.
The total number of ob
se
rvations i
s
90
. In
Table 1, it has bee
n refe
rred to as Sod
a
r2.
Wisco
n
si
n Breast
Can
c
e
r
Datab
a
se O
r
i
g
inal: Thi
s
d
a
t
aset i
s
al
so
maintaine
d
in
the UC
Irvine Ma
chi
ne lea
r
nin
g
repo
sitory an
d the dat
a o
b
tained from
the Unive
r
si
ty of Wiscon
sin
Ho
spital
s. It is 9 dimen
s
io
ns sp
ace and
has 2 cl
u
s
ters. The numb
e
r of the total
obje
c
tives is
699.
Secon
d
, the one poi
nt cro
s
sover is
run
with different
kind
s of mutation. The si
mulation
results a
r
e shown in Tabl
e 3. In Table 3, t
he K-means mutatio
n
is inferior to
the one poi
nt
mutation for
25 clu
s
te
rs p
r
oble
m
but superi
o
r to on
e point mutat
i
on for mu
sh
room and vo
wels
probl
em
s.
Table 3. Co
m
pari
s
on of Dif
f
erent Mutatio
n
Operators
Cancer
SODAR 1
SODAR 2
One point mut
a
tion
0.22
1368.5
930.1
K-means mutatio
n
0.27
1362.4
927.9
Table 4. Co
m
pari
s
on of Dif
f
erent Mutatio
n
Operators
Cancer
SODAR 1
SODAR 2
1 point crossover
0.22
1367.5
927.7
Combine crossover
0.27
1519.9
927.5
Different cro
s
sover op
erators a
r
e co
m
pare
d
in Tab
l
e 4 which shows that for the 25
clu
s
ters task, simple
r ope
rator (o
ne poi
nt cro
s
sover) obtains the
best re
sult
s were achieve
d
.
Ho
wever, fo
r the Mushroo
m
and Vo
wel
probl
em, the
best pe
rform
ance is a
c
hi
e
v
ed by K-me
ans
mutation.
Evaluation Warning : The document was created with Spire.PDF for Python.