Indonesian
Journal
of
Electrical
Engineering
and
Computer
Science
V
ol.
5,
No
.
2,
F
ebr
uar
y
2017,
pp
.
410
415
DOI:
10.11591/ijeecs
.v5.i2.pp410-415
410
An
Empirical
Comparison
of
Latest
Data
Clustering
Algorithms
with
State-of-the-Ar
t
Xianjin
Shi
,
W
anwan
W
ang
,
and
Chongsheng
Zhang
*
Henan
Univ
ersity
No
.1
JinMing
Street,
475001
KaiF
eng,
China,
T
el:
+86
13837150021
*Corresponding
author
,
e-mail:
chongsheng.zhang@y
ahoo
.com
Abstract
Cluster
ing
technology
has
been
applied
in
n
umerous
applications
.
It
can
enhance
the
perf
or
mance
of
inf
or
mation
retr
ie
v
al
systems
,
it
can
also
g
roup
Inter
net
users
to
help
impro
v
e
the
clic
k-through
r
a
te
of
on-line
adv
er
tising
,
etc.
Ov
er
the
past
f
e
w
decades
,
a
g
reat
man
y
data
cluster
ing
algor
ithms
ha
v
e
been
de
v
eloped,
including
K-Means
,
DBSCAN,
Bi-Cluster
ing
and
Spectr
al
cluster
ing,
etc.
In
recent
y
ears
,
tw
o
ne
w
data
cluster
ing
algor
ithms
ha
v
e
been
proposed,
which
are
affinity
propagation
(AP
,
2007)
and
density
peak
based
cluster
ing
(
DP
,
2014).
In
this
w
or
k,
w
e
empir
ically
compare
the
perf
or
mance
of
these
tw
o
latest
data
cluster
ing
algor
ithms
with
state-of-the-ar
t,
using
6
e
xter
nal
and
2
inter
nal
cluster
ing
v
alidation
metr
ics
.
Our
e
xper
imental
results
on
16
pub
lic
datasets
sho
w
that,
the
tw
o
latest
cluster
ing
algor
ithms
,
AP
and
DP
,
do
not
alw
a
ys
outperf
or
m
DBSCAN.
Theref
ore
,
to
find
the
best
cluster
ing
algor
ithm
f
or
a
specific
dataset,
all
of
AP
,
DP
and
DBSCAN
should
be
considered.
More
o
v
er
,
w
e
find
that
the
compar
ison
of
diff
erent
cluster
ing
algor
ithms
is
closely
related
to
the
cluster
ing
e
v
aluation
metr
ics
adopted.
F
or
instance
,
when
using
the
Silhouette
cluster
ing
v
alidation
metr
ic
,
the
o
v
er
all
perf
or
mance
of
K-Means
is
as
good
as
AP
and
DP
.
This
w
or
k
has
impor
tant
ref
erence
v
alues
f
or
researchers
and
engineers
who
need
to
select
appropr
iate
cluster
ing
algor
ithms
f
or
their
specific
applications
.
K
e
yw
or
ds:
Affinity
Propagation,
Density
peak
based
cluster
ing,
Cluster
ing
Ev
aluation
Cop
yright
c
2017
Institute
of
Ad
v
anced
Engineering
and
Science
.
All
rights
reser
v
ed.
1.
Intr
oduction
Cluster
ing
or
cluster
analysis
is
the
task
of
g
rouping
a
set
of
objects
in
such
a
w
a
y
that
objects
in
the
same
g
roup
(called
a
cluster)
are
more
similar
to
each
other
than
to
those
in
other
g
roups
(clusters)
[1].
Cluster
ing
has
been
widely
used
in
man
y
applications
,
such
as
disco
v-
er
ing
customer
g
roups
based
on
their
pu
rchase
beha
viours
to
design
targeted
adv
er
tisements
,
identifying
co-regulated
genes
to
pro
vide
a
genetic
finger
pr
int
f
or
v
ar
ious
diseases
,
diff
erentiating
betw
een
diff
erent
types
of
tissue
and
b
lood
in
medical
images
,
etc.
Ov
er
the
past
f
e
w
decades
,
consider
ab
le
research
eff
or
t
has
been
put
into
the
de
v
el-
opment
of
ne
w
data
cluster
ing
algor
ithms
[2,3,4,5].
Among
them,
K-Means
,
DBSCAN
[6],
Bi-
Cluster
ing
[7]
and
Spectr
al
cluster
ing
[8]
are
the
v
er
y
w
ell-kno
wn
ones
.
K-Means
is
b
y
f
ar
the
most
popular
cluster
ing
tool
used
in
scientific
and
industr
ial
applications
.
Star
ting
with
r
andom
centroids
,
K-Means
cluster
ing
iter
ativ
ely
re-assigns
each
data
point
to
the
nearest
centroid,
then
computes
a
ne
w
centroid
f
or
each
g
roup
of
data
points
ha
ving
the
same
centroid,
then
again,
allocates
each
data
point
to
the
nearest
centroid.
DBSCAN
[6]
is
a
classic
density-based
clus-
ter
ing
algor
ithm,
it
can
disco
v
er
clusters
of
arbitr
ar
y
shape
.
Bi-Cluster
ing
and
Co-Cluster
ing
[7]
allo
w
o
v
er
lap
betw
een
clusters
,
this
class
of
algor
ithms
ha
v
e
been
widely
used
in
bioinf
or
matics
.
Spectr
al
cluster
ing
[8]
techniques
perf
or
m
dimensionality
reduction
bef
ore
cluster
ing,
b
y
utilizing
the
eigen
v
alues
of
the
similar
ity
matr
ix
of
the
data.
In
recent
y
ears
,
tw
o
no
v
el
and
f
amous
data
cluster
ing
algor
ithms
ha
v
e
been
proposed.
The
first
cluster
ing
algor
ithm
is
affinity
propagation
(hereafter
ref
erred
to
as
AP)
[9],
which
w
as
pub
lished
in
Science
in
2007.
Its
highlight
is
that
it
does
not
require
users
to
specify
the
n
umber
of
clusters
.
AP
alter
nates
tw
o
message
passing
steps:
one
is
ho
w
w
ell
a
data
point
is
to
ser
v
e
Receiv
ed
No
v
ember
16,
2016;
Re
vised
J
an
uar
y
6,
2017;
Accepted
J
an
uar
y
19,
2017
Evaluation Warning : The document was created with Spire.PDF for Python.
IJEECS
ISSN:
2502-4752
411
as
the
cluster
center
f
or
another
data
point;
the
other
step
tak
es
into
account
other
data
points’
pref
erence
f
or
a
data
point
to
be
a
cluster
center
.
The
second
(and
the
latest
data
cluster
ing)
algor
ithm,
pub
lished
in
Science
in
2014,
is
the
density
peak
based
cluster
ing
algor
ithm
(hereafter
ref
erred
to
as
DP)
[10].
It
is
based
on
the
idea
that
the
cluster
centers
are
cha
r
acter
iz
ed
b
y
a
higher
density
than
their
neighbours
and
b
y
a
relativ
ely
large
distance
from
points
with
higher
densities
[10].
F
or
each
data
point,
DP
computes
its
local
density
and
its
distance
to
points
of
higher
density
.
T
op-r
ank
ed
data
points
in
both
metr
ics
will
be
selected
as
cluster
centers
.
Ov
erwhelmed
with
so
man
y
cluster
ing
algor
it
hms
,
especially
with
the
arr
iv
al
of
the
tw
o
ne
w
cluster
ing
algor
ithms
,
a
question
which
is
natur
ally
r
aised
is:
which
is
the
best
data
cluster
ing
algor
ithm?
Can
the
tw
o
latest
data
cluster
ing
algor
ithms
,
AP
and
DP
,
tr
uly
outperf
or
m
state-of-
the-ar
t?
Pr
ob
lem
Statement
.
In
this
w
or
k,
w
e
will
empir
ically
compare
the
perf
or
mance
of
the
tw
o
latest
data
cluster
ing
algor
ithms
,
which
are
DP
and
AP
,
with
other
w
ell-estab
lished
cluster
ing
algor
ithms
,
in
par
ticular
K-means
,
DBSCAN,
Bi-Cluster
ing,
Co-Cluster
ing
and
Spectr
al
cluster
ing.
Contrib
utions
.
This
w
or
k
re
v
eals
that,
the
tw
o
latest
data
cluster
ing
algor
ithms
,
AP
and
DP
,
do
not
alw
a
ys
outperf
or
m
the
classic
cluster
ing
algor
it
hms
,
such
as
DBSCAN.
Hence
,
to
find
the
best
cluster
ing
algor
ithm
f
or
a
specific
application,
all
of
AP
,
DP
and
DBSCAN
should
be
tested.
Fur
ther
more
,
w
e
find
the
compar
ison
of
diff
erent
cluster
ing
algor
ithms
is
closely
related
to
the
cluster
ing
v
alidation
metr
ics
adopted.
Thus
,
bef
ore
selecting
th
e
best
cluster
ing
algor
ithm
f
or
a
giv
en
dataset/application,
it
is
necessar
y
to
pic
k
the
cluster
ing
e
v
aluation
metr
ic
in
adv
ance
.
The
remaining
of
the
paper
is
organiz
ed
as
f
ollo
ws
.
In
Section
2.,
w
e
br
iefly
re
vie
w
com-
mon
cluster
ing
e
v
aluation
metr
ics
.
Ne
xt,
in
Section
3.,
w
e
descr
ibe
th
e
setup
f
or
the
e
xper
iments
.
W
e
analyse
the
e
xper
imental
results
in
Section
4.
and
conclude
the
paper
in
Section
5.
2.
Clustering
Ev
aluation
Metrics
There
are
tw
o
types
of
measures
to
e
v
aluate
the
results
of
the
cluster
ing
algor
ithms
(i.e
.,
the
quality
of
the
clusters),
which
are
the
inter
nal
and
e
xter
nal
v
alidation
metr
ics
[11].
The
ba-
sic
idea
of
the
inter
nal
v
alidation
measures
is
to
chec
k
whether
the
intr
a-cluster
similar
ities
(the
similar
ities
betw
een
the
data
points
inside
the
same
cluster)
are
high,
while
,
at
the
same
time
,
the
inter-cluster
similar
ities
(the
similar
ities
betw
een
data
points
from
diff
erent
clusters)
are
lo
w
.
F
or
instance
,
an
intuitiv
e
inter
nal
v
alidation
measure
that
easily
comes
into
our
mind
is
to
simply
divide
the
intr
a-cluster
similar
ities
b
y
the
inter-cluster
similar
ities
.
In
this
w
or
k,
w
e
use
tw
o
v
er
y
w
ell-kno
wn
inter
nal
cluster
ing
v
alidation
metr
ics
,
which
are
Dunn
and
Silhouette
[11,12].
The
e
xter
nal
v
alidation
metr
ics
calculate
f
or
each
cluster
,
the
distr
ib
ution
of
the
tr
ue
class
labels
f
or
all
the
data
points
in
the
same
cluster
.
Theref
ore
,
this
type
of
cluster
ing
e
v
aluation
metr
ics
require
each
data
point
to
ha
v
e
a
class
label.
If
all
or
the
major
i
ty
of
the
data
points
in
a
cluster
share
the
same
class
label,
this
implies
that
the
cluster
ing
is
v
er
y
successful,
then
the
corresponding
score
,
in
ter
ms
of
an
e
xter
nal
cluster
ing
v
alidation
metr
ic
,
will
be
high.
In
this
paper
,
w
e
utiliz
e
6
e
xter
nal
cluster
ing
v
alidation
metr
ics
,
which
are
Pur
ity
,
Homogeneity
,
Completeness
,
V
measure
,
Adjusted
r
and
and
Mutual
inf
o
score
(Mutual
inf
or
mation)
[11,12].
3.
Experimental
Setup
3.1.
Clustering
algorithms
to
be
compared
In
this
w
or
k,
w
e
compare
the
perf
or
mance
of
the
tw
o
lat
est
data
cluster
ing
algor
ithms
(i.e
.,
AP
and
DP)
with
5
classic
cluster
ing
algor
ithms
,
which
include
K-Means
,
DBSCAN,
Bi-Cluster
ing,
Co-cluster
ing
and
Spectr
al
cluster
ing.
W
e
adopt
the
implementations
from
Scikit-lear
n
1
f
or
AP
and
these
5
classic
algor
ithms
,
while
the
implementation
of
DP
w
as
obtained
from
its
official
w
ebsite
[13].
1
Scikit-lear
n
is
a
w
ell-kno
wn
open
source
machine
lear
ning
libr
ar
y
.
http://scikit-lear
n.org
An
Empir
ical
Compar
ison
of
Latest
Data
Cluster
ing
Algor
ithms
with
...
(Xianjin
Shi)
Evaluation Warning : The document was created with Spire.PDF for Python.
412
ISSN:
2502-4752
T
ab
le
1.
Summar
y
of
datasets
Dataset
Number
Number
With
Sour
ces
of
instances
of
attrib
utes
c
lass
label?
Agg
regation
788
7
Y
es
[17]
Flame
240
2
Y
es
[17]
Compound
399
6
Y
es
[17]
Spir
al
312
3
Y
es
[17]
P
athbased
300
3
Y
es
[17]
R15
600
15
Y
es
[17]
D31
3100
31
Y
es
[17]
J
ain
373
2
Y
es
[17]
Breast
699
2
No
[17]
Th
yroid
215
5
No
[17]
Y
east
1484
8
No
[17]
Wine
178
13
No
[17]
Dim4
2701
4
No
[17]
Dim8
5401
8
No
[17]
Dim32
1009
32
No
[17]
Dim64
1024
64
No
[17]
3.2.
Datasets
W
e
use
16
datasets
to
v
alidate
the
quality
of
diff
erent
cluster
ing
algor
ithms
.
These
datasets
can
be
divided
into
tw
o
categor
ies:
1)
8
datasets
commonly
u
sed
f
or
cluster
ing
[14],
including
Agg
regation,
Flame
,
Compound,
Spir
al,
P
athbased,
R15,
D31,
and
J
ain.
All
of
these
datasets
contain
class
labels
(g
round
tr
uth)
f
or
the
data
points
.
2)
8
datasets
that
do
not
contain
class
labels
,
including
Dim4,
Dim8,
Dim32,
Dim64,
Breast,
Th
yroid,
Y
east,
and
Wine
[14].
T
ab
le
1
is
a
summar
y
of
these
datasets
.
3.3.
P
arameter
Settings
F
or
K-Means
cluster
ing,
w
e
use
the
def
ault
v
alue
as
the
n
umber
of
clusters
,
which
is
8.
W
e
also
tr
y
other
clu
ster
n
umbers
such
as
2,
3.
DP
needs
to
specify
the
initial
cluster
n
umbers
(or
an
initial
cluster)
to
automatically
search
f
or
a
good
r
adius
v
alue
.
W
e
tr
y
three
diff
erent
v
alues
,
which
are
2,
3,
and
6.
DBSCAN
has
tw
o
par
ameters
,
eps
,
which
is
the
maxim
um
r
adius
of
the
neighbourhood
from
a
point,
and
minPts
,
which
is
the
minim
um
n
umber
of
data
points
within
this
distance
.
In
our
e
xper
iments
,
w
e
tr
y
diff
erent
eps
v
alues
,
such
as
0.2,
0.4,
0.9,
1.0,
3.0,
and
diff
erent
v
alues
f
or
minPts
,
such
as
13.
F
or
all
the
cluster
ing
algor
ithms
that
need
to
tune
the
par
ameters
,
w
e
man
ually
choose
the
set
of
par
ameters
that
can
achie
v
e
the
best
cluster
ing
quality
.
4.
Experimental
Results
In
T
ab
le
2,
w
e
depict
the
e
xper
imental
results
of
the
7
cluster
ing
algor
ithms
on
the
8
datasets
that
contain
class
labe
l
inf
or
mation,
using
6
e
xter
nal
cluster
ing
v
alidation
metr
ics
.
In
T
a-
b
le
3,
w
e
also
present
the
cluster
ing
results
on
the
other
8
datasets
without
class
labels
,
e
v
aluated
in
ter
ms
of
2
inter
nal
cluster
ing
v
alidation
metr
ics
.
4.1.
The
P
erf
ormance
of
Co-Clustering,
Bi-Clustering
and
Spectral
Clustering
W
e
obser
v
e
from
T
ab
le
2
and
T
ab
le
3
that,
Co-Cluster
ing
algor
ithm
has
ne
v
er
sho
wn
out-
standing
perf
or
mance
,
while
Bi-Cluster
ing
only
obtains
leading
cluster
ing
result
o
n
one
dataset,
which
is
Th
yroid
.
Hence
,
both
of
them
can
be
neglected
when
selecting
the
best
cluster
ing
algo-
r
ithm
f
or
a
specific
dataset.
Spectr
al
cluster
ing
seldom
achie
v
es
e
xcellent
cluster
ing
results
,
e
xcept
on
the
Agg
re-
gation
dataset
using
the
Mutual
inf
o
score
metr
ic
,
as
w
ell
as
the
J
ain
and
P
athbased
datasets
.
Ho
w
e
v
er
,
it
ne
v
er
sho
ws
best
perf
or
mance
according
to
the
tw
o
inter
nal
metr
ics
,
as
can
be
ob-
ser
v
ed
in
T
ab
le
4.
IJEECS
V
ol.
5,
No
.
2,
F
ebr
uar
y
2017
:
410
415
Evaluation Warning : The document was created with Spire.PDF for Python.
IJEECS
ISSN:
2502-4752
413
T
ab
le
2.
Ev
aluation
of
algor
ithms
using
e
xter
nal
v
alidation
metr
ics
.
Dataset
Algorithm
Purity
Homeg
eneity
Completeness
V
measure
Rand
score
Mutual
inf
o
score
Agg
regation
AP
0.996
0.990
0.599
0.746
0.377
0.589
DP
0.511
0.253
0.909
0.396
0.202
0.251
DBSCAN
0.827
0.718
1.000
0.836
0.734
0.716
Spectr
al
0.834
0.846
0.786
0.815
0.549
0.783
Cocluster
0.346
0.010
0.011
0.010
0.001
0.001
Bicluster
0.506
0.340
0.902
0.494
0.348
0.339
K-Means
0.911
0.909
0.765
0.830
0.668
0.761
Compound
AP
0.910
0.855
0.563
0.679
0.389
0.548
DP
0.627
0.416
1.000
0.588
0.437
0.414
DBSCAN
0.935
0.725
0.920
0.811
0.784
0.721
Spectr
al
0.875
0.857
0.667
0.750
0.481
0.659
Cocluster
0.396
0.012
0.013
0.012
-0.004
-0.005
Bicluster
0.627
0.416
1.000
0.588
0.437
0.414
K-Means
0.875
0.769
0.596
0.671
0.453
0.586
Flame
AP
0.833
0.932
0.244
0.387
0.134
0.236
DP
0.988
1.000
0.420
0.591
0.410
0.416
DBSCAN
0.742
0.741
0.396
0.516
0.456
0.393
Spectr
al
0.971
0.873
0.279
0.423
0.202
0.274
Cocluster
0.654
0.024
0.010
0.015
0.004
0.005
Bicluster
0.838
0.410
0.388
0.399
0.453
0.386
K-Means
0.983
0.910
0.289
0.438
0.206
0.284
J
ain
AP
1.000
1.000
0.238
0.385
0.120
0.233
DP
1.000
1.000
0.812
0.897
0.954
0.812
DBSCAN
0.997
0.247
0.375
0.298
0.337
0.243
Spectr
al
1.000
1.000
0.285
0.444
0.185
0.282
Cocluster
0.740
0.018
0.007
0.010
0.006
0.003
Bicluster
0.786
0.407
0.337
0.369
0.324
0.336
K-Means
0.987
0.924
0.261
0.406
0.166
0.257
P
athbased
AP
0.963
0.916
0.380
0.537
0.273
0.367
DP
0.633
0.402
0.636
0.493
0.401
0.400
DBSCAN
0.927
0.340
0.620
0.439
0.325
0.338
Spectr
al
0.877
0.781
0.430
0.555
0.348
0.423
Cocluster
0.387
0.006
0.004
0.005
-0.004
-0.005
Bicluster
0.633
0.401
0.634
0.491
0.399
0.399
K-Means
0.847
0.710
0.406
0.517
0.391
0.398
R15
AP
0.997
0.994
0.994
0.994
0.993
0.994
DP
0.667
0.773
0.984
0.886
0.579
0.763
DBSCAN
0.533
0.590
1.000
0.743
0.264
0.576
Spectr
al
0.530
0.704
0.988
0.822
0.517
0.694
Cocluster
0.108
0.021
0.039
0.027
0.001
0.003
Bicluster
0.133
0.244
0.980
0.391
0.120
0.241
K-Means
0.533
0.590
1.000
0.743
0.264
0.576
Spir
al
AP
0.888
0.770
0.288
0.419
0.144
0.272
DP
1.000
1.000
1.000
1.000
1.000
1.000
DBSCAN
0.772
0.394
0.393
0.393
0.142
0.387
Spectr
al
0.808
0.684
0.373
0.483
0.258
0.366
Cocluster
0.397
0.013
0.009
0.010
0.001
0.001
Bicluster
0.353
0.001
0.001
0.001
-0.003
-0.002
K-Means
0.487
0.182
0.098
0.127
0.048
0.088
D31
AP
0.975
0.966
0.966
0.966
0.950
0.964
DP
0.161
0.360
0.958
0.524
0.107
0.357
DBSCAN
0.065
0.042
0.976
0.081
0.004
0.040
Spectr
al
0.258
0.575
0.988
0.727
0.328
0.571
Cocluster
0.045
0.006
0.0013
0.008
-0.000
-0.000
Bicluster
0.065
0.188
0.933
0.313
0.060
0.187
K-Means
0.258
0.566
0.944
0.708
0.336
0.562
4.2.
The
P
erf
ormance
of
AP
,
DP
,
DBSCAN
and
K-Means
W
e
can
see
from
T
ab
le
2
and
T
ab
le
4
that,
on
the
e
xter
nal
v
alidation
metr
ics
,
AP
,
DP
and
DBSCAN
sho
w
v
er
y
good
cluster
ing
results
.
Moreo
v
er
,
DP
and
DBSCAN
achie
v
e
the
best
cluster
ing
results
on
the
Dunn
inter
nal
metr
ic.
Sur
pr
isingly
,
on
the
Silhouette
inter
nal
metr
ic
,
the
o
v
er
all
perf
or
mance
of
K-Means
is
as
good
as
DP
and
AP
.
W
e
no
w
compare
the
tw
o
density-based
cluster
ing
algor
ithms
,
i.e
.,
DP
vs
DBSCAN.
W
e
see
that,
on
tw
o
datasets
,
Agg
regation
and
Compound
,
DBSCAN
outperf
or
ms
DP
,
while
on
the
rest
6
datasets
DP
outperf
or
ms
DBSCAN
in
almost
all
the
metr
ics
(with
f
e
w
e
xceptions),
as
can
be
seen
in
T
ab
le
4.
4.3.
Efficienc
y
Comparison
of
Diff
erent
Clustering
Algorithms
W
e
ha
v
e
also
chec
k
ed
the
efficiency
of
diff
erent
cluster
ing
algor
ithms
.
W
e
find
that,
AP
is
v
er
y
time-consuming,
especially
when
the
n
umber
of
data
points
is
large
,
sa
y
,
more
than
3000.
DP
is
f
aster
than
AP
,
b
ut
slo
w
er
than
Co-Cluster
ing,
Bi-Cluster
ing,
and
K-Means
in
gener
al.
In
Figure
1,
the
r
unning
time
of
these
algor
ithms
on
Agg
regation
,
Y
east
,
and
Dim8
datasets
is
depict
ed.
It
should
be
mentioned
that
in
Y
east
and
Dim8
datasets
,
AP
w
as
also
f
ound
to
be
the
slo
w
est
algor
ithm,
at
least
3
times
slo
w
er
than
DP
,
so
w
e
remo
v
ed
it
from
the
plots
f
or
clar
ity
reasons
.
An
Empir
ical
Compar
ison
of
Latest
Data
Cluster
ing
Algor
ithms
with
...
(Xianjin
Shi)
Evaluation Warning : The document was created with Spire.PDF for Python.
414
ISSN:
2502-4752
T
ab
le
3.
Ev
aluation
of
algor
ithms
using
inter
nal
v
alidation
metr
ics
.
Data
sets
Metr
ics
Algor
ithm
AP
DP
DBSCAN
Spectral
Coc
luster
Bic
luster
K-Means
Breast
Dunn
0.00
0.408
0.093
0.000
0.000
0.041
0.051
Silhouette
0.182
-0.298
0.631
-0.057
0.067
0.122
0.722
Th
yroid
Dunn
0.067
0.019
0.050
0.017
0.008
0.091
0.047
Silhouette
0.230
-0.022
0.651
0.194
0.354
0.577
0.462
Y
east
Dunn
0.054
0.087
0.019
0.000
0.000
0.025
0.026
Silhouette
0.139
-0.042
-0.282
0.204
0.323
0.191
0.356
Wine
Dunn
0.178
0.256
0.265
0.190
0.109
0.256
0.147
Silhouette
0.115
0.279
0.310
0.118
0.304
0.280
0.473
Dim4
Dunn
0.561
0.561
4.904
0.003
0.001
0.463
0.601
Silhouette
0.951
0.951
0.950
0.480
0.871
0.511
0.912
Dim8
Dunn
0.068
0.453
4.171
0.292
0.003
0.745
5.085
Silhouette
0.837
0.932
0.995
0.583
0.604
0.311
0.851
Dim32
Dunn
4.035
4.035
0.016
0.003
0.000
0.645
0.771
Silhouette
0.945
0.945
0.050
-0.212
0.389
0.191
0.523
Dim64
Dunn
5.820
5.820
0.012
0.004
0.001
0.777
0.764
Silhouette
0.966
0.966
0.077
-0.194
0.355
0.154
0.511
T
ab
le
4.
T
otal
n
umber
of
datasets
where
a
classifier
r
ank
ed
first.
Metr
ics
T
otal
n
umber
Alg
AP
DP
DBSCAN
K-Means
Spectral
Dunn
2
4
2
1
0
Silhouette
3
3
2
3
0
Pur
ity
5
2
1
0
1
Homegeneity
5
3
0
0
2
Completeness
0
5
2
0
1
V
measure
2
3
2
0
1
Rand
score
2
3
3
0
0
Mutual
inf
o
score
2
3
1
0
1
4.4.
Summar
y
of
the
Results
In
summar
y
,
w
e
dr
a
w
the
f
ollo
wing
summar
y
from
the
abo
v
e
analyses:
1.
Although
AP
and
DP
are
the
tw
o
latest
(and
v
er
y
popular)
cluster
ing
algor
ithms
,
the
y
do
not
alw
a
ys
outperf
or
m
DBSCAN.
2.
AP
,
DP
,
and
DBSCAN,
when
put
together
as
a
g
roup
,
sho
w
v
er
y
outstanding
perf
or
mance
than
the
other
cluster
ing
algor
ithms
.
The
ref
ore
,
AP
,
DP
and
DBSCAN
should
be
the
major
candidates
f
or
the
cluster
ing
tasks
in
real-w
or
ld
applications
.
3.
Co-Cluster
ing,
K-Means
,
and
Bi-Cluster
ing
are
gener
ally
the
most
efficient
cluster
ing
algo-
r
ithms
.
AP
is
the
slo
w
est
one
,
whereas
DP
is
more
than
3
times
f
aster
than
AP
.
4.
The
compar
ison
of
diff
erent
cluster
ing
algor
ithms
depends
on
the
e
v
aluation
metr
ics
.
5.
Conc
lusions
In
this
w
or
k,
w
e
empir
ically
compare
the
perf
or
mance
of
the
tw
o
latest
data
cluster
ing
algor
ithms
,
which
are
affinity
propagation
(AP)
and
density
peak
based
cluster
ing
(DP),
with
state-of-the-ar
t
cluster
ing
algor
ithms
,
using
6
e
xter
nal
and
2
inter
nal
cluster
ing
v
alidation
met-
r
ics
.
Our
e
xper
imental
results
demonstr
ate
that,
the
tw
o
latest
cluster
ing
algor
ithms
AP
and
DP
do
not
alw
a
ys
outperf
or
m
DBSCAN.
This
means
that,
e
v
en
with
the
tw
o
latest
cluster
ing
algo-
r
ithms
,
there
is
no
single
cluster
ing
algor
ithm
that
is
the
best
f
or
all
the
datasets
.
Theref
ore
,
to
select
the
best
cluster
ing
algor
ithm
f
or
a
specific
dataset,
all
of
AP
,
DP
and
DBSCAN
should
be
considered/tested.
In
additio
n,
w
e
find
the
compar
ison
of
diff
erent
cluster
ing
algor
ithms
is
also
closely
related
to
the
cluster
ing
e
v
aluation
metr
ics
adopted.
In
ter
ms
of
r
unning
time
efficiency
,
our
e
xper
iments
sho
w
that
AP
is
the
least
efficient
cluster
ing
algor
ithm,
it
consumes
at
least
3
times
more
r
unning
time
than
the
others
,
while
Co-
Cluster
ing,
K-Means
,
and
Bi-Cluster
ing
are
usually
the
top-3
most
efficient
algor
ithms
.
This
w
or
k
pro
vides
v
aluab
le
empir
ical
ref
erence
f
or
researchers
and
engineer
ing
who
need
to
select
the
best
cluster
ing
algor
ithm
f
or
their
specific
applications
.
IJEECS
V
ol.
5,
No
.
2,
F
ebr
uar
y
2017
:
410
415
Evaluation Warning : The document was created with Spire.PDF for Python.
IJEECS
ISSN:
2502-4752
415
Figure
1.
Running
time
of
diff
erent
algor
ithms
on
Agg
regation
(left),
Y
east
(center),
and
Dim8
(r
ight)
datasets
.
Ref
erences
[1]
Cluster
analysis
.
Wikipedia.
https://en.wikipedia.org/wiki/Cluster
analysis
.
[2]
A.
K.
J
ain,
M.
N.
Mur
ty
,
and
P
.
J
.
Flynn.
Data
cluster
ing:
a
re
vie
w
.
A
CM
Comput
.
Sur
v
.
31,
3,
1999,
pp
264-323.
[3]
S
.
K
otsiantis
,
P
.
Pintelas
,
Recent
Adv
ances
in
Cluster
ing:
A
Br
ief
Sur
v
e
y
,
WSEAS
T
r
ansac-
tions
on
Inf
or
mation
Science
and
Applications
,
V
ol
1,
No
1,
2004,
pp
73-81.
[4]
Rui
Xu
and
D
.
W
unsch.
Sur
v
e
y
of
cluster
ing
algor
ithms
.
IEEE
T
r
ansactions
on
Neur
al
Net-
w
or
ks
,
2005,
pp
645-678.
[5]
P
.
Ber
khin,
A
sur
v
e
y
of
cluster
ing
data
mining
techniques
,
in:
J
.
K
ogan,
C
.
Nicholas
,
M.
T
eboulle
(Eds
.),
Grouping
Multidimensional
Data,
Spr
inger
,
Ber
lin,
Heidelberg,
2006,
pp
.
2571.
[6]
Mar
tin
Est
er
,
Hans-P
erer
Kr
iegel,
Jorg
Sander
,
Xiao
w
ei
Xu.
A
Density-Based
Algor
ithm
f
or
Disco
v
er
ing
Clusters
in
Large
Spatial
Databases
with
Noise
.1996.
[7]
Mir
kin,
Bor
is
.
Mathematical
Classification
and
Cluster
ing.
Kluw
er
Academic
Pub
lishers
.
1996.
[8]
Sepandar
Kamv
ar
,
Dan
Klein,
Chr
istopher
Manning.
Spectr
al
lear
ning.
In
IJCAI03,
pp
561–
566,
2003.
[9]
Brendan
J
.F
re
y
,
Delber
t
Duec
k.
Cluster
ing
b
y
P
assing
Messages
Betw
een
Data
P
oints
.
Sci-
ence
.
2007.
[10]
Ale
x
Rodr
iguez
,Alessandro
Laio
.
Cluster
ing
b
y
f
ast
search
and
find
of
density
peaks
.
2014.
[11]
Erendir
a
Rendon,Itz
el
Ab
undez,Alejandr
a
Ar
izmendi,Elvia
M.Quiroz.
Inter
nal
v
ersus
Exter
nal
cluster
v
alidation
inde
x
es
.
Inter
national
Jo
ur
nal
of
Computers
and
Comm
unications
.
Issue
1,
V
olume
5,
2011.
[12]
Cluster
ing.
Scikit-lear
n-0.17.1-documentation.
http://scikit-lear
n.org/stab
le/modules/cluster
ing.html.
[13]
http://people
.sissa.it/
laio/Research/Res
cluster
ing.php
.
[14]
Cluster
ing
datasets
.
http://cs
.joensuu.fi/sipu/datasets/.
[15]
http://scikit-lear
n.org/stab
le/modules/classes
.html#module-sklear
n.datasets
.
An
Empir
ical
Compar
ison
of
Latest
Data
Cluster
ing
Algor
ithms
with
...
(Xianjin
Shi)
Evaluation Warning : The document was created with Spire.PDF for Python.