TELKOM
NIKA Indonesia
n
Journal of
Electrical En
gineering
Vol. 12, No. 8, August 201
4, pp. 6181 ~ 6189
DOI: 10.115
9
1
/telkomni
ka.
v
12i8.545
8
6181
Re
cei
v
ed
De
cem
ber 2
3
, 2013; Re
vi
sed
April 18, 201
4; Acce
pted
April 30, 201
4
Which Representation to Choose for Image Cluster
Haolin Gao*,
Gang Ch
en,
Bicheng Li, Yong
w
e
i Zh
ao
Dep
a
rtment of Data Process
i
n
g
Engi
ne
erin
g, Z
hengz
ho
u Informatio
n
Scien
c
e and T
e
chno
log
y
Institute,
Z
hengz
ho
u, 45
000
2
*Corres
p
o
ndi
n
g
author, e-ma
i
l
: hol
yg
ao@
12
6.com
A
b
st
r
a
ct
To cho
o
se
be
st represe
n
tati
on for
i
m
ag
e
cluste
r, w
e
an
aly
z
e
d
the
dist
ing
u
ish
ab
ility
of thre
e
typical
i
m
a
g
e
repres
entati
o
n
s
, color
hist
og
ram, G
abor
te
xture a
n
d
ge
o
m
etric
mo
men
t
by fittin
g
i
m
age
distanc
e curve,
and sh
ow
different retri
e
val
results of
i
m
a
g
e
retrieva
l for
different i
m
age
represe
n
tatio
n
s
.
To de
pict the
d
i
sting
u
ish
ab
ilit
y of differ
ent i
m
a
g
e
r
epr
ese
n
t
ations, w
e
pr
e
s
ent a
cluster
discri
m
i
n
a
n
t in
dex
calle
d Si
mp
lifie
d Overall C
l
ust
e
r Qua
lity, w
h
ich constitutes
cluster co
mp
ac
tion an
d cluste
r separati
on, a
n
d
the exp
e
ri
me
nt
results sh
ow
that
the i
m
age
repres
entati
o
n w
i
th best d
i
stingu
ish a
b
i
lit
y also
poss
e
s
s
es
max
i
mu
m
disc
rimina
nt i
n
d
e
x
val
ue. T
h
is i
ndex
ca
n
b
e
used
to deter
mi
ne
opt
i
m
al
repres
entati
o
n
for
clusteri
ng i
m
a
g
e
s.
Ke
y
w
ords
:
image cl
usterin
g
, imag
e repres
e
n
tation,
si
mp
lifi
ed over
all cl
uster qua
lity
Copy
right
©
2014 In
stitu
t
e o
f
Ad
van
ced
En
g
i
n
eerin
g and
Scien
ce. All
rig
h
t
s reser
ve
d
.
1. Introduc
tion
Clu
s
terin
g
, al
so
calle
d cl
u
s
ter
analy
s
is or u
n
supe
rvised
cla
s
sification is
a m
e
thod of
cre
a
ting g
r
ou
ps o
r
cl
uste
rs of object
s
, so that
obje
c
ts in one
clu
s
te
r are very si
milar an
d obj
ect
s
in different cl
usters a
r
e qu
ite distincta
b
l
e
[1]. The applicatio
n are
a
s
of clu
s
teri
n
g
inclu
de ima
ge
segm
entation
,
informatio
n
retri
e
val, do
cume
nt
cl
assification,
a
ssociate
rul
e
mining and
web
usa
ge tra
c
kin
g
and
so on.
Gene
rally, cl
usteri
ng p
r
ob
lems
can b
e
divided into t
w
o
catego
rie
s
:
hard
clu
s
teri
n
g
and soft clu
s
terin
g
. In ha
rd cl
uste
ring,
a data point
belon
gs to on
e clu
s
ter, whil
e
in soft clu
s
te
ring; a data
point may be
long to two
or mo
re cl
usters
with so
me pro
babilit
ies.
Clu
s
terin
g
al
gorithm
s can
also be cl
a
ssifie
d
into two catego
rie
s
: hiera
r
chi
c
al algorithm
s and
partitione
d al
gorithm
s. Hie
r
archi
c
al al
go
rithms
in
clu
d
e
two types: divisive and a
gglome
r
ative.
Thoug
h a large num
ber of
cluste
rin
g
m
e
thod
s have
been d
e
velop
ed, clu
s
teri
ng
remain
s
a ch
allengi
ng
task. A cl
ust
e
ring
algo
rith
m may
beh
a
v
e differently becau
se the
different cho
s
en
of feature
s
o
f
the data se
t or the p
a
ra
meter va
lu
es of the algo
ri
thm [2]. A typical
clu
s
teri
ng
method in
clu
des the follo
wing three st
age
s [3]: pat
tern re
present
ation, def
inition of a pattern
proximity me
asu
r
e, group
ing
data poi
nts acco
rdi
n
g to the
pattern re
prese
n
tation and
the
proximity me
asu
r
e. Durin
g
the past d
e
cade
s, m
any clusteri
ng met
hod
s we
re
propo
sed, an
d
all
these
meth
o
d
s
have th
e
same
g
oal, i.
e., the maxi
mization
of
h
o
moge
neity
within
ea
ch
cluster
and the mini
mization of h
e
terog
eneity betwe
en diffe
rent clu
s
te
rs.
Clu
s
terin
g
in
image
pro
c
essing
and
comp
uter vi
sion is a p
r
o
c
ed
ure
for id
entifying
grou
ps
of si
milar
im
age primitives, su
ch as
i
m
age
pixels, lo
cal f
eature
s
,
seg
m
ents, o
b
je
cts o
r
even complet
e
image
s. Im
age
clu
s
terin
g
is o
ne of
th
e impo
rtant a
pplication
s
of
data cl
uste
ri
ng.
Referen
c
e [4]
cla
ssified th
e
image
cluste
ring al
gorith
m
s into two
cat
egori
e
s, the
supervi
sed
an
d
the unsupe
rvised meth
od
s. The su
pervi
sed al
gorith
m
s incorp
orate
a priori
kno
w
led
ge, su
ch
as
the numbe
r o
f
image clu
s
ters.
Data re
pre
s
e
n
tation in these
clust
e
rin
g
algorithm
s
is still one of the most importa
nt
factors that i
n
fluence the
p
e
rform
a
n
c
e
o
f
the cl
uste
rin
g
alg
o
rithm. I
f
the re
prese
n
tation i
s
go
o
d
,
the clu
s
ters a
r
e likely to be
compa
c
t an
d isolated a
n
d
even a sim
p
le clu
s
teri
ng
algorithm su
ch
as K-m
eans
will find them.
Unfortu
natel
y, there is no
universally
good representation; the choice
of representa
t
ion must be
guide
d by the domain kno
w
led
ge.
So, different
image fe
atures
may resul
t
in di
fferent
clu
s
terin
g
eff
e
cts.
Ho
weve
r, many
curre
n
tly literature
s
about
imag
e
clu
s
tering
di
re
ctly empl
oyed i
m
age
features to
cluste
ring
without
discu
ssi
ng th
e
clu
s
terin
g
effe
cts of
different
feature
s
. T
h
is p
ape
r try t
o
an
alysi
s
th
e
different clu
s
tering effect
s of different image f
eature
s
, mainly global feature
s
,
such as col
o
r
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 23
02-4
046
TELKOM
NI
KA
Vol. 12, No. 8, August 2014: 618
1 –
6189
6182
histog
ram, G
abor fe
ature
s
and geo
metric mome
n
t, and propo
se a
discrimin
ant index to judg
e
whi
c
h feature
is better for clusteri
ng.
To find th
e v
a
riation
amo
n
g
differe
nt im
age fe
ature
s
,
we
compa
r
e
d
the im
age
distan
ce
curve
s
of different fe
ature
s
; t
he
cu
rves sh
owed that
the
wavefo
rm are obvio
u
s
ly differe
nt. And
the retrieval
results for th
ree ima
ge fe
ature a
r
e al
so different. T
o
depi
ct the
differen
c
e
we
defined
a
discrimi
nant i
n
d
e
x call
ed SO
CQ
(simplifie
d overall
clu
s
ter q
uality). T
he valu
e of t
h
is
index
can
refl
ect the
cl
uste
ring
effect of
corre
s
p
ondin
g
imag
e feat
ure. So,
we
can d
e
termi
n
e
to
use
whi
c
h i
m
age representation to clu
s
terin
g
im
age
s acco
rdi
ng SOCQ. As to clust
e
ring
algorith
m
s, o
ne
of
the m
o
st
p
opul
ar and sim
p
le
algorith
m
s was K
-
me
an
s. Tho
ugh
it
wa
s
prop
osed
ove
r
50 ye
ars a
g
o
and
thou
sa
nds
of
cl
uste
ring alg
o
rithm
s
have
be
en
publi
s
hed
si
n
c
e
then, K-mea
n
s is
still wid
e
ly used
su
ch as in im
ag
e clu
s
ter. Thi
s
sp
ea
ks to
the difficulty of
desi
gning
a
gene
ral
purp
o
se
clu
s
te
rin
g
algo
rithm
and the
ill-p
o
se
d p
r
oble
m
of clu
s
te
ri
ng[5].
Therefore, in
our expe
rime
nt we use it to clu
s
ter ima
ges a
nd testif
y the cluste
ring effects.
Our
work focuse
d on ho
w
to choo
se a
beat feature for imag
e clu
s
ter, whi
c
h is
different
from feature
s
e
lec
t
ion [6],[7], they s
e
lected a su
bs
pac
e
from high
dimens
ional space to improve
the efficien
cy of classification. We d
on’t
sele
ct
a sub
s
pace vecto
r
from a high
di
mensi
on ve
ctor,
we ju
st evalu
a
ted the clu
s
t
e
ring
effe
ct of image feature vector.
The follo
wing
of this p
ape
r wa
s o
r
ga
nized a
s
follo
ws, secto
r
2
an
alysis th
e di
stinguish
ability of three global imag
e feature
s
, sector 3 p
r
e
s
e
n
t discrimina
nt index for clusteri
ng effe
cts
based on
OCQ[8], secto
r
4
comp
ares th
e disting
u
is
h
ability of the three i
m
age fe
ature
s
, se
cto
r
5
con
c
lu
de
s the main wo
rk of this pape
r and di
sc
uss the image feat
ure
s
and the i
m
age cl
uste
ri
ng.
2
.
The Distinguish Abilit
y
of
the Image Features
For the co
nvenient of co
mputation an
d result
s visualization, we mainly discu
ss the
global featu
r
es for ima
g
e
cluste
ring. T
he frequ
ently used gl
obal
image featu
r
es in
clud
e co
lor,
texture and shape, re
sp
ectively represe
n
ted by
the
HSV colo
r histogram, Ga
bor texture a
n
d
geomet
ric mo
ment. The extractio
n
of thre
e feature
s
is
detailed a
s
fo
llows.
2.1. The Feature Extra
c
tio
n
For different i
m
age featu
r
e
embodi
es
di
fferent
imag
e
informatio
n, and e
a
ch fe
a
t
ure is
an ap
proxima
t
ion of image
informatio
n, so, diffe
re
nt image featu
r
e
may have dif
f
erent si
milari
ty
for s
a
me two
images
.
HSV Colo
r hi
stogram ne
e
d
s to tra
n
slat
e an imag
e i
n
to HSV sp
a
c
e an
d qu
ant
ified the
results. The q
uantificatio
n is sh
owed a
s
follows:
0
[
31
6
,
20
]
1
[
21
,
4
0]
2[
4
1
6
,
7
5
]
0
[
0
,
0.2]
3
[
76
,
1
55
]
1[
0
.
2
,
0
.
7
]
4
[
156
,
1
90]
2[
0
.
7
5
[
19
1
,
270
]
6
[
27
1
,
29
5
]
7
[
29
6
,
315
]
if
h
if
h
if
h
if
s
if
h
HS
i
f
s
if
h
if
s
if
h
if
h
if
h
0
[
0
,
0.2]
1[
0
.
2
,
0
.
7
]
,1
]
2
[
0
.
7
,
1
]
if
v
Vi
f
v
if
v
(1)
Then the qu
a
n
tification re
sults are
com
b
ined in a si
ngl
e value:
93
x
HS
V
(
2
)
Whe
r
e
[0
,
7
1
]
x
, so an image can
be rep
r
e
s
e
n
ted by a histo
g
ram of 72 bi
ns.
Gabo
r texture is b
a
sed
G
abor fun
c
tion
and
wavelet.
Given
an
im
age
I
(
x
,
y
), it
s
Gabo
r
wavelet tran
sform is then d
e
fined to be:
*
11
1
1
1
1
(,
)
(
,
)
(
,
)
mn
m
n
Wx
y
I
x
y
g
x
x
y
y
d
x
y
(
3
)
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
2302-4
046
Whi
c
h Repre
s
entatio
n to Cho
o
se for Im
age Cluste
r (Haoli
n
Gao
)
6183
Whe
r
e * in
di
cate
s the co
mplex co
njug
ate. It is assumed that th
e local texture regi
on
s
are
spatially
homog
ene
o
u
s, and th
e
mean
mn
and
the stand
ard deviation
mn
o
f
t
he
magnitud
e
of the tran
sform
coefficie
n
ts a
r
e used to re
pre
s
ent the region:
2
(,
)
,
(,
)
mn
m
n
mn
mn
mn
W
x
y
d
xdy
W
x
y
dxdy
(
4
)
A feature ve
ctor i
s
no
w
constructe
d u
s
ing
mn
and
mn
,
as feature
com
pone
nts.
We
use scale
s
m
= 5 and o
r
ien
t
ations
n
= 4,
resulting in a
feature vecto
r
.
00
00
34
34
f
(
5
)
Whi
c
h is a 4
0
dimensi
on vector.
The shap
e feature i
s
de
picted by g
e
o
metri
c
mom
ents, whi
c
h
are p
r
oje
c
tio
n
s of the
image fun
c
tio
n
f
(
x
,
y
) onto the mono
mial
p
q
x
y
. The geomet
ric mom
ents
p
q
u
of orde
r (
p
+
q
) of the
image fun
c
tio
n
f
(
x
,
y
) are d
e
fined a
s
:
()
(
)
(
,
)
pq
pq
xy
ux
x
y
y
f
x
y
(
6
)
Whe
r
e
,0
,
1
,
2
,
pq
.
Hu introdu
ce
d seve
n no
n
linear fu
nctio
n
s
whi
c
h a
r
e tran
slation,
scale, an
d
rotation
invariant.
Hu'
s
seven
mom
ent invari
ants have
been
widely u
s
ed i
n
pattern
re
co
g
n
ition, an
d th
eir
perfo
rman
ce
has b
een eva
l
uated un
der
variou
s
defo
r
mation situati
ons. Th
ey are
defined a
s
:
12
0
0
2
22
22
0
0
2
1
1
22
33
0
1
2
2
1
0
3
22
43
0
1
2
2
1
0
3
22
53
0
1
2
3
0
1
2
3
0
1
2
0
3
2
1
22
21
03
21
03
30
12
0
3
2
1
62
0
0
2
3
0
()
4
(3
)
(
3
)
()
(
)
(3
)
(
)
(
)
3
(
)
(
3
)
(
)
3
(
)
(
)
()
(
m
m
m
m
m
m
22
12
21
0
3
11
30
1
2
21
03
22
7
2
10
3
3
01
2
3
01
2
0
3
2
1
22
12
30
21
03
30
12
03
21
)(
)
4
(
)
(
)
(3
)(
)
(
)
3
(
)
(
3
)
(
)
3
(
)
(
)
m
(
7
)
Whe
r
e
00
/
pq
p
q
uu
, and for
2,
3
,
pq
,
1(
)
/
2
pq
. The seven m
o
me
nts
form a featu
r
e vector
12
7
(,
,
,
)
f
mm
m
, whi
c
h
can
be u
s
ed to re
prese
n
t an imag
e. We u
s
e th
e
25 dimen
s
io
n
vector in this paper.
0
2
03
04
1
2
7
()
f
mm
m
(
8
)
2.2. The Distinguish Abilit
y
of Features
To sh
ow the
different di
stingui
sh ability
of
three ima
ge features,
we sele
ct 4 categori
e
s
100 im
age
s f
r
om
Calte
c
h
2
56 d
a
taba
se,
ea
ch
cate
go
ry 25 i
m
age
s,
call
ed im
age
set
1. Part
s
of
them are
sho
w
ed in Fig
u
re
1.
Becau
s
e im
a
ge dista
n
ce is impo
rtant for clu
s
te
r alg
o
rithm an
d many algorith
m
grou
p
image
s ba
se
d on inte
r-di
s
tan
c
e of im
age
s. And to
intuitively see th
e dist
ance of differen
t
feature
s
, we
just cal
c
ul
ate
t
he distan
ce
betwee
n
the
first image a
nd all the oth
e
r imag
es, th
e
distan
ce i
s
sh
owe
d
in Figu
re 2.
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 23
02-4
046
TELKOM
NI
KA
Vol. 12, No. 8, August 2014: 618
1 –
6189
6184
Figure 1. Parts of Image
s in Image Set 1
(a)
(b)
Figure 2. The
Distan
ce of the First Imag
e and othe
r Image
s in Ima
ge Set 1 (a)
HSV colo
r
histog
ram, (b
) Gabo
r texture
We
can
se
e that thre
e fitting cu
rve
s
are
diffe
rent, which m
ean
s tha
t
even sa
me
picture
s
have differe
n
t
similaritie
s
to the first image, an
d
the rel
a
tive si
milarity is al
so different, for
example, the
distan
ce
of image 9
9
to i
m
age 1
is
big
ger th
an the
distan
ce
of image 1
00 to i
m
ag
e
1 for color fe
ature,
but i
s
smaller for texture. T
h
is
can
be intuitively
see
n
in
Fig
u
re 3. T
he
Figu
re
3(a
)
is the
ret
r
ieval re
sult o
f
color fe
ature, with
the first image
as t
he que
ry ima
ge. Figu
re 3
(
b) is
the result of t
e
xture fe
ature. We
can
se
e that th
e ret
r
ieval re
sults are
very different.
In
th
e
t
w
o
figure
s
, only the first 30 im
age
s are
sho
w
ed.
(a)
(b)
Figure 3. The
Retrieval Re
sult for Different
Feature (a) HSV color
histog
ram, (b
) Gabo
r
texture
Since the rel
a
tive similarit
y
may be different
for different feature, so the cl
usteri
ng
results
also
may be
different. Then
ho
w to d
e
ci
de
whi
c
h featu
r
e
is b
e
tter fo
r clu
s
teri
ng?
We
see
k
to clu
s
te
r validity analysis for
clu
s
te
ring alg
o
rithm
s
.
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
2302-4
046
Whi
c
h Repre
s
entatio
n to Cho
o
se for Im
age Cluste
r (Haoli
n
Gao
)
6185
3. Cluste
r Validation and
the Ov
erall
Cluster Qual
it
y
3.1. Cluster Validit
y
A clu
s
terin
g
a
l
gorithm
beh
a
v
es differentl
y
depen
ding on
the cho
s
e
n
f
eatures
of the data
set an
d the p
a
ram
e
ter val
ues
of the al
gorithm [1
0]. And from th
e
viewpoi
nt of data di
stributi
on,
norm
a
lly there is no pri
o
ri
information
about the
structure of the data.
Hen
c
e
,
the accu
ra
cy of
clu
s
terin
g
en
tirely depe
nd
s on th
e dat
a and the
e
x
tent the clu
s
terin
g
alg
o
ri
thm captu
r
e
the
stru
cture. To
simply the
clusteri
ng
algo
rithm,
someti
mes
sp
ecifi
c
stru
cture was impo
sed to
a
dataset. But if the assum
p
tion ab
out the
data di
st
ributi
on is
wrong, t
he re
sult
s ma
y be very poo
r.
Therefore,
it
is im
porta
nt
to evaluate
t
he
clu
s
teri
ng
quality in
q
uantitative m
anne
r, a
nd t
h
is
probl
em i
s
al
so
calle
d clu
s
ter validity a
nalysi
s
, whi
c
h aims to
discoveri
ng the
distrib
u
tion of
a
data set,
ide
n
tifying
the cl
usteri
ng pa
ra
digm
that
is
most
suita
b
le
for
a p
r
obl
e
m
dom
ain, a
nd
deci
d
ing the
optimal pa
ra
meters for a specifi
c
clu
s
tering method.
A numbe
r of
efforts h
a
ve
been
mad
e
i
n
clu
s
te
r vali
dity. Howeve
r, the i
ssu
e o
f
clust
e
r
validity is rat
her
unde
r-ad
dre
s
sed. In g
eneral te
rm
s,
there
are
three ap
pro
a
ch
es to inve
stig
ate
clu
s
ter vali
dity. The first i
s
b
a
sed
on
external
crite
r
ia. It evaluat
es th
e results of
a
clu
s
te
ring
algorith
m
b
a
s
ed
on
a
pre-spe
c
ified
st
ructu
r
e,
whi
c
h is imp
o
sed
on
a
data
set. The
se
co
nd
approa
ch i
s
b
a
se
d on inte
rnal criteria. It evaluates
th
e re
sults
of a clu
s
terin
g
al
gorithm in te
rms
of quantities that involve the vect
ors of the data set th
emselve
s
. Th
e third app
ro
ach of clu
s
te
ring
validity is based on relative
criteri
a
. It evaluate
s
a
clu
s
terin
g
struct
ure by comp
a
r
ing it with ot
her
c
l
us
te
r
i
ng
sc
he
me
s
.
The
two
first approa
che
s
a
r
e
b
a
sed on statisti
cal te
sts and
their maj
o
r d
r
a
w
b
a
ck i
s
thei
r
high co
mputa
t
ional co
st. Moreove
r
, the indices re
late
d to these ap
proa
ch
es aim
at measuri
n
g
the degree to
which a data
set confi
r
ms
a prio
ri sp
ecif
ied schem
e.
3.2. Ren
y
i
E
n
tropy
and Ov
erall Clus
ter Qualit
y
(OCQ)
A good clu
s
t
e
ring result should fit the distrib
u
ti
on of
dataset at most. That is to say the
assignm
ent o
f
the data sa
mples to
whi
c
h cl
uste
r
wil
l
not violate the structu
r
e i
nherent in th
e
data. This id
ea ca
n be
capture
d
math
ematically
u
s
ing co
ncepts from inform
ation theo
ry, and
entropy is u
s
ed to mea
s
ure the un
ce
rtainty.
A well known entropy is
Renyi
’
s entro
p. The
definition is:
1
()
l
o
g
0
,
1
1
k
k
Hx
p
(
9
)
Whe
r
e
p
k
is t
he probability mass functi
on for the di
screte
data.
When
α
=2 we
obtain
Renyi’
s quad
ratic entro
py.
Based
o
n
rel
a
tive criteria,
Ji
He
et al
pr
opo
sed
a ne
w clu
s
teri
ng evaluation
in virtue
o
f
Renyi Entro
p
y
[8].
In their appro
a
ch, there a
r
e
two
crite
r
ia propo
sed fo
r clu
s
t
e
ring eval
uati
on
and sele
ction
of an optimal cluste
ring
sc
heme, co
mpa
c
tne
ss a
nd separation [9]
.
The co
mpa
c
t
ness mea
s
u
r
e is based o
n
t
he gene
ral
i
zed definitio
n of the deviation
of a
data set give
n by:
2
1
1
()
(
,
)
N
i
i
de
v
d
x
x
n
X
(
1
0
)
Whe
r
e
(,
)
ij
dx
x
is a
distan
ce m
e
tric b
e
twe
en t
w
o ve
ctors
x
i
and
x
j
that
reflect
s
thei
r
dissimilarity,
N
is the num
ber of memb
ers in
X
, and
1
i
i
x
Nx
is the mean of
X
. A smaller
deviation in
di
cate
s a
high
e
r
ho
moge
neit
y
of the vect
ors in th
e dat
aset. In p
a
rti
c
ula
r
, when
X
is
one-dime
nsio
nal and
d
() i
s
the Euclide
a
n
distan
ce,
de
v
(
X
) be
com
e
s the
stand
ard d
e
viation
of
the data set. The clu
s
te
r compa
c
tne
ss f
o
r the clu
s
te
r results
C
1
,
C
2
,…,
C
k
is defin
ed as:
1
()
1
()
()
k
i
i
dev
C
Cmp
kd
e
v
X
X
(
1
1
)
Whe
r
e
k
i
s
th
e numb
e
r
of clu
s
ters ge
ne
rated o
n
the
data set
X
,
dev
(
Ci
) is th
e
deviation
of the cluste
r
, and
dev
(
X
) i
s
the deviatio
n
of the data set
X
.
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 23
02-4
046
TELKOM
NI
KA
Vol. 12, No. 8, August 2014: 618
1 –
6189
6186
The clu
s
te
r separation me
asu
r
e u
s
ed h
e
re bo
rrows the idea in [1
0] and the cl
usteri
n
g
evaluation fu
nction introdu
ced by [11].
The clu
s
ter
se
paratio
n is de
fined as:
2
2
11
,
(,
)
1
(
)
e
xp(
)
(1
)
2
kk
ij
ij
j
i
dc
c
Se
p
kk
X
(
1
2
)
Whe
r
e
is a
Gau
ssi
an
co
nstant, to si
m
p
lify the com
putation
2
21
,
k
is
the numb
e
r
of c
l
us
ters
,
c
i
is the
centro
id of the
clu
s
ter
C
i
, an
d
(,
)
ij
dc
c
is the
dista
n
ce between
c
i
and
c
j
.
Similar to [10], Ji He comb
ined the clu
s
t
e
r co
mpa
c
tn
ess and cl
ust
e
r se
paration
measu
r
e
s
in
to
one fo
r the e
a
se
of evalua
ting the ove
r
a
ll perfo
rm
a
n
ce. The
com
b
i
nation, na
me
d overall cl
uster
quality(O
C
Q
)
, is defined a
s
:
()
()
(
1
)
(
)
Ocq
C
mp
S
e
p
X
XX
(
1
3
)
Whe
r
e
[0
,
1
]
is the
wei
ght that
balan
ce
s
cl
uste
r co
mpactn
ess and clu
s
ter
sep
a
ratio
n
. For exampl
e,
Ocq
(0.5
) give
s equ
al wei
g
h
t
s to the two measures.
3.3. Simplifie
d Ov
erall Cluster
Qualit
y
(SOCQ
)
The cl
uste
r
compa
c
tne
ss i
s
go
od when
the value of
Cm
p
(
X
) i
s
small, and
so
is the
clu
s
ter
se
pa
ration. Clu
s
te
r is
well
se
parated if the val
ue of
Sep
(
X
) is small.
And well separated
c
l
us
ters
indicate large inter-dis
tan
c
e
of
clu
s
ter
ce
nters, which resu
lt in a
small v
a
lue of
Se
p
(
X
)
due to th
e
monoton
ou
s
decrea
s
e
of
Gau
ss fu
ncti
on. So, in th
e wh
ole, the
OCQ de
cre
a
se
s
monoton
ou
sl
y with the increase of clu
s
te
r co
mpa
c
tne
ss a
nd cl
uste
r sep
a
ration.
But it is
well
kn
own that,
the exp
one
nt functio
n
i
n
Sep
(
X
)
ha
s a
hig
h
co
mputation
compl
e
xity. And,
the
Sep
(
X
) i
s
de
sign to
refl
ect the effe
cts of inte
r-clu
s
ter di
sta
n
ce,
simultan
eou
sl
y with a sa
m
e
monoto
nou
s prope
rty as
Cm
p
(
X
). So, we
can d
e
si
gn a ne
w O
C
Q
index, call
ed
Simplified O
C
Q
or S
O
CQ
, whi
c
h di
sc
a
r
ds the
expo
nent fun
c
tion,
and j
u
st
re
serve
the
2
(,
)
ij
dc
c
, then the
Sep
(
X
) i
s
:
2
11
,
1
()
(
,
)
(1
)
kk
ij
ij
j
i
Se
p
d
c
c
kk
X
(14
)
Then, the
Sep
(
X
)
will in
creases mon
o
tonou
sly with
the
in
cre
a
se
of the squa
re
of inter-
clu
s
ter di
stan
ce. So, to not destroy the
whol
e monot
onou
s prope
rty of
SOCQ, whi
c
h mea
n
s its
two part
s
hav
e the same m
onoton
ou
s property, we de
fine the
SOCQ
(
X
) as
follows
:
()
()
(
1
)
(
)
Ocq
C
mp
S
e
p
X
XX
(
1
5
)
Whe
r
e
0.5
. Therefore, SO
CQ
increa
se
s m
onoton
ou
sly with the in
cre
a
se
of clu
s
ter
comp
actn
ess and clu
s
ter
separation. It is
noted that
SOCQ may b
e
a negative
value.
4. Experiment
Figure 4. Parts of Image
s in Image Set 2
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
2302-4
046
Whi
c
h Repre
s
entatio
n to Cho
o
se for Im
age Cluste
r (Haoli
n
Gao
)
6187
We
cho
o
se f
our
cate
gori
e
s an
d 75
im
age
s total, called ima
ge
set 2 fo
r exp
e
rime
nts,
parts of ima
g
e
s a
r
e
sh
owe
d
in Fig
u
re
4,
the 4
ca
teg
o
r
ies in
clude
s
‘comp
e
re’, ‘si
nger’, ‘
r
ice’ a
n
d
’sports’
The
rea
s
o
n
f
o
r
cho
o
si
ng t
hese ima
g
e
s
is that
th
ey h
a
ve hig
h
intra-cl
ass
simil
a
rity and
low i
n
ter-cl
ass
simila
rity. It is very
suita
b
l
e
to
cl
u
s
ter
a
nd te
st the
di
stinct
ability o
f
image fe
ature.
We p
e
rfo
r
m i
m
age
retriev
a
l with a
sam
e
que
ry imag
e, and the re
sults
are
sh
o
w
ed
re
spe
c
ti
vely
in Figure 5.
(a)
(b)
Figure 5. The
Retrieval Re
sult of Image
Set 2;
(a) HS
V color hi
stog
ram, (b
) Gab
o
r texture
We can se
e explicitly
that
t
he col
o
r feat
ure d
o
bette
r
than texture f
r
om the t
w
o retrieval
results, this ill
ustrate
s
that
colo
r featu
r
e
have
a st
ron
g
distingui
sh
a
b
ility than texture featu
r
e f
o
r
the spe
c
ific q
uery imag
e, i.e. the first image in two fig
u
re
s.
We al
so plot
the distan
ce
betwee
n
the
first image
and all the o
t
her imag
es,
and the
fitting curve i
s
also showe
d
in the Figu
re 6. We
can
see that the
r
e are o
b
viou
s step in all th
ree
fitting curve, whi
c
h ide
n
tifies the inte
r-class dista
n
ce
is high. Con
t
rastingly, the
steps i
n
fitting
curve
s
of ima
ge set 1 a
r
e q
u
ite unobvio
u
s
in Figu
re 2.
(a)
(b)
Figure 6. The
Distan
ce b
e
twee
n the 56t
h Image and
other Imag
es;
(a) HSV colo
r histog
ram, (b)
Gabo
r texture
From
the
s
e t
h
ree
di
stan
ce
figures,
we
can
sa
y
that
the step of co
lor
fe
ature
o
r
texture
feature i
s
hig
her tha
n
sh
a
pe feature, which m
ean
s that the inter-cla
ss di
stan
ce of colo
r an
d
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 23
02-4
046
TELKOM
NI
KA
Vol. 12, No. 8, August 2014: 618
1 –
6189
6188
texture is bi
gger than
sh
ape. To
calculate t
he O
C
Q
i
ndex
va
lue
of clu
s
te
ring re
sults, we
clu
s
terin
g
im
age
set 2
with the ge
neral
clu
s
terin
g
al
gorithm,
k-m
ean
s, and
ge
t the values
o
f
all
three featu
r
e
s
for 20 time
s. The results are sho
w
ed i
n
Figure 7. It can b
e
seen
that, the values
of sh
ape
i
s
bi
gge
st an
d
col
o
r i
s
th
e
sma
llest, which
mean
that th
e qu
ality of
color cl
uste
rin
g
i
s
best, and that
of shape i
s
worst.
Similarly, we
cal
c
ulate
the
SOCQ
index
value
of clu
s
t
e
ring
results, we clu
s
teri
ng
image
set 2 al
so
with k-m
ean
s, and get the i
ndex value
s
of all three f
eature
s
for
2
0
clu
s
teri
ng. The
results are
showed i
n
Fig
u
re
8. It can
be
see
n
t
hat,
the
relative
relation
ship
in
num
bers
of t
he
three featu
r
e
s
is op
po
sitio
n
to that of Figure
7,
but they mean sam
e
clu
s
teri
ng e
ffects, the col
o
r
clu
s
terin
g
achieves the b
e
s
t effect, the sha
pe cl
uste
ring the wo
rst.
(a)
(b)
Figure 7. The
OCQ Index
Value of Image Set
2
Figure 8. The
SOCQ Index
Value of Image Set
2
5. Conclusio
n
This
pap
er
introdu
ce
s S
O
CQ t
o
tackle the
pro
b
lem of
cho
o
sin
g
which
image
rep
r
e
s
entatio
n
to cluste
r. The
im
age repre
s
e
n
tati
on
with hi
ghe
r
value is bett
e
r fo
r
cluste
ri
ng,
becau
se it is more disti
ngui
sha
b
le than othe
rs.
SOCQ is dif
f
erent from
OCQ n
o
t only in
comp
utation
compl
e
xity, but also it
can
wo
rk b
e
fo
re
clu
s
te
ring,
which
is different from
cl
uste
r
validation m
e
thods. S
o
, it i
s
conve
n
ient
for d
e
te
rmi
n
ing a
be
st im
age
rep
r
e
s
e
n
t
ation to
clu
s
ter.
Thoug
h so
m
e
times
several feature
s
are use
d
to
clu
s
ter tog
e
ther,
we ca
n wei
ght them in the
orde
r of SOCQ value.
The limitation
of SOCQ i
s
t
hat cl
ass la
b
e
ls
need
to b
e
provided, th
at is to
say th
e imag
e
set ha
s to be
labele
d
first. If only parts of
an image
set
are lab
e
led,
SOCQ
can
al
so
work, it ca
n
cho
o
se a
bet
ter rep
r
e
s
ent
ation o
n
thi
s
small
sub
s
et, then
u
s
e th
e result to
cl
uster the
wh
ole
image set. This is be
ca
use the inner
stru
cture of a
n
image set is sta
b
le.
Ackn
o
w
l
e
dg
ements
This work wa
s su
ppo
rted b
y
Nature Sci
e
nce Fo
und
ation of Chin
a No. 60872
142.
Referen
ces
[1]
Guoju
n
G, Ch
aoq
un M, Ji
an
hon
g W
,
Data
Clusteri
ng: T
heor
y
,
Alg
o
rith
ms, and A
ppl
ic
ations. Soc
i
et
y
for Industrial a
nd App
lie
d Mat
hematics.
Phi
l
a
del
phi
a, Penns
ylv
ani
a. 20
07.
[2]
Halki
d
i M, Vazirgi
a
n
n
is M.
A data set oriente
d
ap
p
r
oach for clu
s
tering a
l
gor
ithm se
lecti
o
n
.
Procee
din
g
s of
the 5th Eur
o
p
ean
Confer
enc
e on Pri
n
ci
ples
of Data Min
i
n
g
and K
n
o
w
l
e
dg
e Discov
e
r
y
.
200
1.
[3]
Jain AK, Murt
y
MN, Fly
nn PJ
. Data clustering: A revie
w
.
A
C
M Co
mputi
n
g
Surveys
. 1999
; 31(3): 264-
323.
[4]
Krinid
is S, Krin
idis M, Chatzis
V. An Uns
upe
rvised Imag
e Clusteri
ng Met
hod Bas
ed o
n
EEMD Imag
e
Histogr
am.
Jou
r
nal of Infor
m
at
ion Hi
di
ng a
nd
Multi
m
ed
ia Si
g
nal Proc
essin
g
. 2012: 3(2): 15
1-16
3.
[5]
Jain AK. Data
Clusteri
ng: 50
Years be
yo
n
d
K-Means.
Pattern Rec
ogn
itio
n
. 2009.
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
2302-4
046
Whi
c
h Repre
s
entatio
n to Cho
o
se for Im
age Cluste
r (Haoli
n
Gao
)
6189
[6]
Z
hu S
X
, Hu B.
H
y
brid
F
eatur
e
Selecti
on B
a
s
ed o
n
Improv
e
d
Gen
e
tic Alg
o
r
ithm.
Indon
esi
an Jo
urn
a
l
of
Electrical E
ngi
neer
ing
. 2
013;
11(4): 17
25-
17
30.
[7]
Li JZ
, Meng XR, W
en J.
An Improved Met
hod of SVM-B
PSO F
eature Selecti
on Bas
ed on C
l
o
u
d
Mode
l.
Indon
e
s
ian Jo
urna
l of Electrical E
ngi
neer
ing.
2
014; 12(5):
39
79-
39
86.
[8]
He J, T
an A
H
, Ch
e
w
-L
im T
an. Mod
i
fie
d
A
R
T
2A Gro
w
in
g N
e
t
w
ork
Ca
pab
le
of Ge
ne
rating
a
F
i
xe
d
Numb
er of Nod
e
s.
IEEE Transactions on Neural Networks
. 2004; (15)
3: 728
-737.
[9]
Michael JA Berry
,
Li
noff G. Data Mi
nin
g
T
e
chn
i
qu
es F
o
r
marketin
g, Sa
les a
nd
Custo
m
er Sup
port.
John W
ill
e
y
& Sons, Inc. 199
6.
[10]
Halki
d
i M, Vaz
i
rgia
nn
is M, Batistakis I. Q
ualit
y schem
e a
ssessment in t
he cluster
i
n
g
p
r
ocess.
Proc.
4th Eur. Conf. Princip
l
es a
nd
Practi
ce of Kno
w
ledge D
i
scov
e
ry in Data
bas
es
. 2000: 6
5–2
76.
[11]
Gokca
y E, Pr
in
cipe
J. A
ne
w
clusteri
ng
ev
a
l
uatio
n fu
nctio
n
usi
n
g
Re
n
y
i’s
informati
on
pot
entai
n.
Proc.
Int. Conf. Acoust., Speech, Signal Processing
. 2000.
Evaluation Warning : The document was created with Spire.PDF for Python.