Inter
national
J
our
nal
of
Electrical
and
Computer
Engineering
(IJECE)
V
ol.
5,
No.
2,
April
2015,
pp.
361
–
370
ISSN:
2088-8708
361
A
No
v
el
Spectral
Clustering
based
on
Local
Distrib
ution
P
arthajit
Roy
*
and
J
.
K.
Mandal
**
*
Department
of
Computer
Science,
The
Uni
v
ersity
of
Burdw
an,
Burdw
an,
W
est
Beng
al,
India-713104
**
Dept.
of
Comp.
Sc.
&
Engg.,
The
Uni
v
ersity
of
Kalyani,
Nadia,
W
est
Beng
al,
India-741235
Article
Inf
o
Article
history:
Recei
v
ed
No
v
22,
2014
Re
vised
Jan
20,
2015
Accepted
Feb
5,
2015
K
eyw
ord:
Spectral
Clustering
Af
finity
Mahalanobis
Distance
Outlier
Detection
Random
W
alk
Laplacian
Clustering
Indices
ABSTRA
CT
This
paper
proposed
v
ariation
of
spectral
clustering
model
base
d
on
a
no
v
el
af
finity
metric
that
considers
the
distrib
ution
of
the
neighboring
points
to
learn
the
underlaying
structures
in
the
data
set.
Proposed
af
finity
metric
is
calculated
using
Mahalanobis
distance
that
e
x-
ploits
the
concept
of
outlier
detection
for
identifying
the
neighborhoods
of
the
data
points.
Random
W
alk
Laplacian
of
the
representati
v
e
graph
and
its
spectra
has
been
considered
for
the
clustering
purpose
and
the
first
k
number
of
eigen
v
ectors
ha
v
e
been
considered
in
the
second
phase
of
clustering.
The
model
has
been
tested
with
benchmark
data
and
the
quality
of
the
output
of
the
proposed
model
has
been
tested
in
v
arious
cluster
v
alidity
inde
x
es.
Copyright
c
2015
Institute
of
Advanced
Engineering
and
Science
.
All
rights
r
eserved.
Corresponding
A
uthor:
J.
K.
Mandal
Dept.
of
Comp.
Sc.
&
Engg.
The
Uni
v
ersity
of
Kalyani
Nadia,
W
est
Beng
al
India-741235
e-mail:
jkm.cse@gmail.com
1.
INTR
ODUCTION
Data
clustering
is
the
process
of
grouping
the
nonuniform
data
elements
by
identifying
the
underlaying
structure[1].
Data
clustering
is
a
dif
ficult
task
from
computing’
s
point
of
vie
w
.
This
dif
ficulty
is
because
data
clus-
tering
is
computationally
hard.
There
are
dif
ferent
approaches
for
data
clustering
and
a
whole
branch
of
computer
science
has
been
de
v
oted
for
that.
General
reference
on
data
clustering
is
due
to
Ev
eritt
et
al[2].
Jain
et
al
surv
e
yed
clustering,
taxonomy
of
clustering
and
recent
trends[3].
A
More
recent
surv
e
y
is
done
by
Xu
et
al[4].
Some
recent
applications
of
clustering
has
been
done
by
F
arshchian
et
al
[5]
and
Sharma
et
al
[6].
There
are
dif
ferent
approaches
to
w
ards
cluste
ring.
Some
of
the
popular
approaches
are
Hierarchical-,
Density
Based-,
Squared-Error
based-,
Fuzzy-based-,
and
Graph
based
clustering[4].
The
purpose
of
the
present
study
is
to
cluster
spatial
data
sets
using
graph
based
model.
The
cluster
shape
and
its
orientation
is
most
important
parameter
in
the
present
w
ork.
The
obj
ecti
v
e
of
the
present
research
w
ork
is
to
handle
the
clustering
problems
where
the
cluster
shapes
are
non-con
v
e
x
and
moreo
v
er
the
present
model
considers
the
trend
of
the
data
sets
and
not
just
the
proximities.
The
graph
based
model
has
g
ained
enormous
popularity
in
recent
days.
Graph
is
a
v
ery
good
algebraic
structure
for
defining
proximity
among
the
data
points
and
proximity
among
the
points
is
the
k
e
y
for
the
success
of
almost
e
v
ery
cl
ustering
algorithm.
There
are
dif
ferent
areas
of
graph
theory
that
can
be
e
xploited
for
data
clustering.
Some
Delaunay
graph
based
model
is
a
v
ailable
due
to
Y
ang
et
al
[7]
and
Ro
y
et
al[8].
Some
minimum
spanning
tree
based
clustering
technique
has
been
proposed
by
F
oggia
et
al
[9]
and
Ro
y
et
al[10].
Some
early
minimum
as
well
as
maximum
spanning
tree
based
approaches
were
proposed
by
Asano
et
al[11].
An
outstanding
surv
e
y
on
Graph
clustering
is
done
by
Shaefer[1]
where
the
recent
de
v
elopments
has
been
outlined
e
xplicitly
.
Spectral
clustering
is
one
of
the
major
branches
of
t
he
graph
based
clustering
where
the
spectra
i.e.
the
eigen
v
alues
and
eigen
v
ectors
of
the
Laplacian
of
the
graph
is
considered
for
clustering.
There
are
se
v
eral
papers
on
this
topic.
Some
of
them
are
due
to
Higham
et
al
[12],
T
oussi
et
al
[13]
and
Qiu
et
al
[14].
A
v
er
y
good
surv
e
y
on
spectral
clustering
is
due
to
Luxb
ur
g[15].
A
general
surv
e
y
on
graph
Laplacian
has
been
done
by
Chung[16].
Some
Evaluation Warning : The document was created with Spire.PDF for Python.
362
ISSN:
2088-8708
v
ariation
of
the
traditional
spectral
clustering
is
a
v
ailable
due
to
Spielman
et
al[17]
and
Fiedler[18].
Af
finity
or
distance
among
the
data
points
is
the
most
important
input
to
an
y
clustering
model.
This
paper
uses
a
Mahalanobis
distance
proposed
by
Mahalanobis[19]
based
no
v
el
af
finity
matrix
for
spectral
clustering.
A
good
discussion
on
Mahalanobis
distance
can
be
found
in
a
book
by
M
arsland[20].
Mahalanobis
distance
has
been
used
ef
fecti
v
ely
for
outlier
ditection
by
Hodge
et
al
[21].
The
quality
of
the
output
produced
by
an
y
clustering
algorithm
is
measured
by
se
v
eral
v
alidity
indices.
There
are
tw
o
major
types
of
cluster
v
alidity
indices.
Internal
and
e
xternal
i
ndices.
Internal
indices
measures
the
quality
of
the
cluster
produced,
by
the
distrib
ution
of
the
data
and
the
density/scatterness
of
the
data
points.
The
e
xternal
indices,
on
the
other
hand,
considers
some
labeled
data
and
the
quality
of
the
clusters
is
measured
on
the
basis
of
that.
A
good
comparati
v
e
study
on
the
performance
of
v
arious
clustering
indices
is
done
by
Saha
et
al[22].
The
rest
of
the
article
is
or
g
anized
as
follo
ws.
Section
2.
discusses
the
mathematical
backgrounds
necessary
of
the
model.
Section
3.
presents
and
discusses
the
proposed
model.
Section
4.
discusses
the
results
and
analysis.
Conclusion
comes
in
the
section
5.
and
references
are
dra
wn
at
the
end.
2.
MA
THEMA
TICAL
B
A
CKGR
OUND
This
section
unco
v
ers
the
necessary
mathematics
required
to
de
v
elop
and
e
xplain
the
proposed
model.
The
proposed
model
is
a
spectral
clustering
model
based
on
a
no
v
el
af
finity
metric
that
calculates
the
proximity
among
the
data
points
in
a
no
v
el
w
ay
.
The
rest
of
this
section
discusses
spectral
graph,
similarity
measures
and
other
neces-
sary
mathematical
backgrounds
in
the
fol
lo
wing
manner
.
Subsection
2.1.
discusses
the
clustering
as
an
optimization
problem.
Subsection
2.2.
discusses
spectral
graph
theory
and
subsection
2.4.
discusses
the
distance
and
similarity
measures.
2.1.
Data
Clustering
Gi
v
en
a
set
of
data
points,
the
consideration
of
clustering
is
to
identify
the
grouping
by
e
xploring
the
under
-
laying
structures
in
a
data
set,
if
there
e
xists
an
y
.
The
main
assumption
in
the
clustering
is
that
the
property
of
the
underlying
structure
of
the
data
set
is
not
kno
wn.
Only
the
number
of
clusters
may
be
kno
wn
sometimes.
formally
clustering
can
be
defined
as
follo
ws:
Gi
v
en
S
=
f
p
1
;
p
2
;
p
3
;
p
4
;
;
p
n
g
is
the
set
of
n
data
points
in
m
-dimensional
space
R
m
,
the
clustering
is
the
partition
of
S
into
k
dif
ferent
groups,
k
being
the
number
of
clusters,
C
=
f
C
1
;
C
2
;
C
3
;
;
C
k
g
with
respect
to
a
distance
metric
d
(
p
i
;
p
j
)
in
such
a
w
ay
that
the
inter
-cluster
distance
becomes
maximum
and
the
intra-cluster
distance
becomes
minimum,
o
v
er
all
the
partitions[3].
i.e.
the
objecti
v
e
of
clustering
is
to
minimize
the
ratio
of
intra
cluster
distance
and
inter
cluster
distance
as
sho
wn
in
equation
1.
M
inimiz
e
Z
=
X
8
C
i
2
C
I
ntr
aC
l
uster
D
istance
I
n
ter
C
l
u
stter
D
istance
w
:r
:t:
d
(
p
i
;
p
j
)
(1)
2.2.
Spectral
Graph
Theory
A
gr
aph
is
an
algebraic
structure
G
=
<
V
;
E
>
,
where
V
is
called
the
verte
x
set
and
E
is
called
the
edg
e
set.
The
set
E
can
be
defined
as
a
relation
o
v
er
the
cartesian
product
V
V
,
defined
as
xy
=
>
(
x;
y
)
2
E
which
implies
that
there
is
an
edge
between
x
and
y
of
the
v
erte
x
set
V
[23].
P
arallel
edges
implies
more
than
one
edge
between
tw
o
v
ertices
and
in
our
case
parallel
edges
are
not
allo
wed.
A
graph
consists
of
self
loop
if
xx
for
some
x
2
V
.
A
graph
G
=
<
V
;
E
>
is
called
an
undir
ected
graph,
if
the
relation
r
ho
is
symmetric.
A
graph
is
called
a
simple
undirected
graph
or
S-graph,
if
it
is
undirected
graph
without
an
y
self
loop
or
parallel
edges.
A
graph
is
weighted,
if
there
is
a
weight
function
d
(
v
i
;
v
j
)
associated
with
e
v
ery
edge
e
2
E
.
A
weight
function
d
(
:;
:
)
is
a
metric
if
it
follo
ws
the
follo
wing
properties:
Non-negati
vity:
8
x;
y
2
R
m
;
d
(
x;
y
)
0
:
Symmetricity:
8
x;
y
2
R
m
;
d
(
x;
y
)
=
d
(
y
;
x
)
:
T
riangular
Inequality:
8
x;
y
;
z
2
R
m
;
d
(
x;
y
)
+
d
(
y
;
z
)
d
(
x;
z
)
:
IJECE
V
ol.
5,
No.
2,
April
2015:
361
–
370
Evaluation Warning : The document was created with Spire.PDF for Python.
IJECE
ISSN:
2088-8708
363
Gi
v
en
a
weighted
graph
G,
the
Adjacenc
y
matrix
representation
of
the
graph
is
a
square
matrix
of
order
n
=
j
V
j
where
the
entry
w
i;j
(
w
i;j
0
),
is
the
weight
of
the
edge
e
(
i;
j
)
.
i.e.
the
adjacenc
y
matrix
W
can
be
defined
as,
W
n
n
=
(
w
ij
)
i
=1
;
2
;
;n
;
j
=1
;
2
;
;n
(2)
The
de
gree
d
i
of
a
v
erte
x
v
i
2
V
of
a
graph,
represented
as
per
equation
2,
can
be
defined
as
sum
of
the
weights
of
the
incident
edges
on
a
particular
v
erte
x
as
sho
wn
in
the
equation
3.
d
i
=
d
(
v
i
)
=
n
X
j
=1
w
ij
(3)
The
de
gree
matrix
D
is
a
diagonal
matrix
with
the
de
grees
d
i
are
the
leading
diagonal
of
the
matrix
D
.
The
tool
for
spectral
clustering
is
Laplacian
matrix
of
the
graph.
A
Laplacian
of
a
graph
is
a
symmetric
matrix
that
can
be
formed
from
the
adjacenc
y
ma
trix.
The
eigen
v
alues
and
eigen
v
ectors
of
the
Laplacian
are
the
most
important
tools
for
analyzing
the
structure
of
a
graph.
S
p
e
cially
the
cut
related
things
can
algebraically
be
analyzed
by
computing
the
v
alues
of
the
eigen
v
alues
and
eigen
v
ectors
of
the
graph
Laplacian.
There
e
xists
se
v
eral
graph
Laplacians
and
e
v
ery
Laplacian
has
its
o
wn
strength
and
weakness
i
n
a
particular
situation
[15].
A
detailed
literature
on
spectral
graph
theory
has
been
gi
v
en
by
Chung[16].
Unnormalized
Laplacian:
The
unnormalized
Laplacian
of
a
graph,
denoted
as
L
,
can
be
defined
as
per
the
equa-
tion
4.
L
=
D
W
:
(4)
where
D
is
the
diagonal
de
gree
matrix
and
W
is
the
adjacenc
y
matrix.
The
important
thing
to
note
that
the
Laplacian
defined
abo
v
e,
is
a
real
symmetric
matrix
whose
diagonal
elements
are
all
non-ne
g
ati
v
e
and
all
other
elements
are
ne
g
ati
v
e.
Also,
the
sum
of
e
v
ery
ro
w
is
zero.
The
spectra
is
the
eigen
v
ectors
of
the
this
Laplacian.
There
are
se
v
eral
important
properties
of
this
spectra.
Some
of
the
properties
of
this
Laplacian
[24][25][15]
are
gi
v
en
as
follo
ws.
1.
f
0
Lf
=
1
2
P
n
i
=1
P
n
j
=1
w
ij
(
f
i
f
j
)
2
;
8
f
2
R
n
:
2.
L
is
symmetric
and
positi
v
e
semi-definite.
3.
The
smallest
eigen
v
alue
of
L
is
0
.
4.
All
eigen
v
alues
are
real
with
0
=
1
2
n
:
5.
If
the
graph
is
not
connected,
the
the
number
of
0
eigen
v
alue,
i.e.
the
multiplicity
of
eigen
v
alue
0
,
is
the
number
of
connected
components.
Normalized
Laplacian:
The
spectra
of
unnormalized
Laplacian
of
a
graph
is
independent
of
the
matrix
D
.
Normal-
ized
Laplacian
o
v
ercomes
this.
T
w
o
of
the
normalized
Laplacians
are
v
ery
popular
.
These
are
Symmetric
Laplacian
and
Random
W
alk
Laplacian.
Symmetric
Laplacina
L
sy
m
is
a
symmetric
matrix
define
as
per
the
equation
5.
L
sy
m
=
D
1
=
2
LD
1
=
2
(5)
The
Random
W
alk
Laplacian
models
the
Random
W
alk
in
a
graph.
i.e.
starting
from
an
arbitrary
v
erte
x
of
a
graph,
if
we
randomly
go
to
an
y
of
its
adjacent
v
er
te
x
and
follo
w
the
same
process
repetiti
v
ely
,
then
in
a
densely
connected
subgraph
of
the
original
graph,
the
w
alk
will
be
ended
in
the
same
subgraph
in
most
of
the
cases.
This
property
is
modeled
in
a
Random
W
alk
Laplacian.
The
random
w
alk
graph
L
r
w
can
be
defined
as
per
the
equation
6.
L
r
w
=
D
1
L
=
I
D
1
W
(6)
All
of
the
properties
for
unnormalized
Laplacian
holds
good
for
L
r
w
also.
In
the
present
paper
,
the
random
w
alk
Laplacian
and
its
spectra
will
be
used
for
clustering.
Spectr
al
Clustering
based
on
Local
Distrib
ution(P
arthajit
Roy)
Evaluation Warning : The document was created with Spire.PDF for Python.
364
ISSN:
2088-8708
2.3.
Similarity
Measur
ement
The
success
of
the
graph
based
clustering
algorithms
lie
on
the
choice
of
similarity
matrix.
The
similarity
between
tw
o
data
poi
nts
is
the
only
me
ans
for
the
creation
of
an
edge
in
the
graph
between
them.
If
the
edges
are
more
close
to
the
actual
relationship,
the
success
possibility
of
the
algorithm
is
also
high.
F
or
this,
similarity
measurement
demands
adequate
research.
There
are
se
v
eral
types
of
similarity
measures
a
v
ailable.
Some
of
them
e
xploits
the
distance
between
the
data
points
only
.
The
follo
wing
part
discusses
tw
o
of
them.
neighbor:
In
this
case,
if
the
distance
between
tw
o
data
points
is
less
then
a
predefined
benchmark
v
alue
,
then
the
weight
is
assigned.
Otherwise
0
is
assigned.
Gaussian
Similarity:
In
this
case,
the
distance
between
the
data
points
is
computed
as,
g
~
X
;
~
Y
=
e
k
~
X
~
Y
k
2
2
2
(7)
The
v
alue
of
in
the
equation
7
kno
wn
as
the
str
ength
of
inclusion
,
i.e.
more
the
v
alue
of
,
the
more
the
weight
will
be
gi
v
en
to
the
edge
between
the
data
points.
i.e.
f
ar
points
will
also
come
under
consideration
with
hea
vy
weight.
2.4.
Distance
Metric
Gi
v
en
tw
o
data
points,
the
ultimate
tool
for
desi
gning
the
Laplacian
matrix
is
similarity
among
data
points
and
the
similarity
is
based
on
the
distance
among
data
points,
which
is
the
weight
of
the
edge
between
them.
There
are
se
v
eral
distance
functions
a
v
ailable
b
ut
the
most
popular
is
the
Euclidean
distance
defined
as,
E
(
i;
j
)
=
m
X
k
=1
j
x
i
(
k
)
x
j
(
k
)
j
2
!
1
=
2
(8)
The
generalization
of
the
Euclidean
distance
is
called
the
Mink
o
wski
distance
define
as,
M
(
i;
j
)
=
m
X
k
=1
j
x
i
(
k
)
x
j
(
k
)
j
n
!
1
=n
(9)
There
are
other
distance
functions
also.
But
These
tw
o
are
the
most
popular
.
3.
PR
OPOSED
MODEL
This
section
presents
proposed
model.
The
Eucledian
(Equation
8)
and
Mink
o
wski
(Equation
9)
of
the
pre
vious
section
gi
v
es
the
proximit
y
b
ut
their
main
disadv
antage
is
the
y
measures
the
distance
from
a
single
point.
Ev
en
if
the
distance
needs
to
be
measured
from
a
di
strib
ution,
the
y
first
con
v
ert
it
to
a
single
point
(lik
e
mean,
median
etc.)
and
calculate
the
distance.
This
mean
or
median
i
s
called
representati
v
e
of
the
distrib
ution
and
often
the
y
do
not
sho
w
good
result.
Consider
the
follo
wing
situation.
Let
there
are
12
points
in
tw
o
dimensions.
Some
v
alues
are
sho
wn
in
T
able
1
and
the
distrub
ution
are
sho
wn
graphically
in
figure
1
(a).
Suppose
the
black-fill
spots
are
some
distrib
ution.
In
our
case
this
will
be
the
neighbor
set
of
some
point.
T
able
1.
Sample
points
in
tw
o
dimensions
x
3.0
2.9
2.8
2.7
.
.
.
2.0
y
1.0
1.1
1.0
1.0
.
.
.
1.1
W
e
w
ould
lik
e
to
find
out
the
distance
of
an
y
other
points,
lik
e
P
1
or
P
2
from
this
distrib
ution.
Clearly
the
point
P
2
is
more
close
to
this
distrib
ution
than
P
1
.
But
ho
w
to
measure
that
mathema
tically?
The
traditional
w
ay
will
find
the
mean
of
the
distrib
ution
and
will
try
to
find
the
distance
of
the
gi
v
en
point
from
the
mean.
But
this
is
not
good
here,
because
point
P
1
and
P
2
are
equidistance
from
the
mean,
so
either
both
will
be
rejected
or
both
will
be
accepted.
Figure
1
(a)
sho
ws
this.
IJECE
V
ol.
5,
No.
2,
April
2015:
361
–
370
Evaluation Warning : The document was created with Spire.PDF for Python.
IJECE
ISSN:
2088-8708
365
(a)
(b)
Figure
1.
(a)
Distrib
ution
of
the
point
set
of
table
1
(b)
Point
P1
and
P2
are
equidistant
from
the
mean
of
the
distrib
ution.
The
solution
to
this
problem
is
the
Mahalanobis
distance[19].
Mahalanobis
distance
is
a
distance
measure
which
addresses
the
abo
v
e
problem
in
a
better
w
ay
.
F
ollo
wing
are
steps
for
calculating
Mahalanobis
distance.
Gi
v
en
a
data
set
S
=
f
p
1
;
p
2
;
;
p
n
g
where
each
p
i
2
R
m
,
the
mean
is
defined
as
per
equation
10.
=
1
n
n
X
i
=1
p
i
(10)
v
ariance
is
defined
as
per
equation
11.
2
=
1
n
n
X
i
=1
(
p
i
)
2
(11)
The
Co
v
ariance
matrix
M
m
m
,
where
m
is
the
number
of
attrib
utes
of
the
data
set,
can
be
defined
as,
M
=
cov
(
i;
j
);
8
i
=
1
:
m
;
j
=
1
;
;
m
(12)
and
the
Mahalanobis
distance
between
x
1
and
x
2
is
defined
using
the
equation
12
as,
d
=
p
x
1
M
1
x
2
(13)
In
the
figure
1
(b),
the
Euclidean
distance
between
P
1
and
the
mean
is
0
:
6
.
The
distance
of
P
2
from
the
mean
is
also
0
:
6
.
So,
an
y
algorithm
that
considers
Euclidean
distance
will
either
include
both
of
them
or
will
reject
both
of
them,
which
is
unrealistic.
The
Mahalanobis
distance
(equation
13,
on
the
other
hand,
for
P
1
from
the
distrib
ution
is
34
:
28
whereas
the
distance
of
P
2
from
the
distrib
ution
is
24
:
11
.
This
clearly
states
that
the
point
P
2
is
more
close
to
the
distrib
ution
than
point
P
1
.
Spectr
al
Clustering
based
on
Local
Distrib
ution(P
arthajit
Roy)
Evaluation Warning : The document was created with Spire.PDF for Python.
366
ISSN:
2088-8708
Figure
2.
Point
P3
f
alls
outside
the
distrib
ution
and
the
Point
P4
f
alls
inside.
Another
illustration
is
sho
wn
in
figure
2.
Here,
point
P
3
and
P
4
are
considered.
Clearly
,
P
4
f
alls
inside
the
distrib
ution
and
P
3
f
alls
out
side.
The
Mahalanobis
distance
states
this
f
act
clearly
as
the
distance
of
P
3
is
28
:
64
and
that
of
P
4
is
23
:
59
.
If
a
suitable
cutof
f
distance
can
be
set,
then
P
3
can
be
e
xcluded
and
the
P
4
can
be
included
which
is
not
possible
in
case
of
traditional
similarity
measures.
The
present
model
e
xploits
this
property
of
Mahalanobis
distance
in
the
neighborhood
of
each
point
for
calculating
the
neighbors
on
the
basis
of
distrib
ution
similarity
and
not
on
the
basis
of
distance
similarity
.
The
model
assumes
the
number
of
clusters
as
a
prerequisite.
This
is
k
in
the
present
case.
The
model
also
assumes
some
kno
wledge,
as
in
case
of
other
clustering
algorithms,
as
parameter
v
alue
for
the
creation
of
similarity
matrix.
Initially
,
the
algorithms
has
been
presented
and
then
the
w
orking
principle
of
the
proposed
model
has
been
discussed
e
xplicitely
.
The
algorithm
for
Mahalanobis
distance
based
s
imilarity
matrix
computation
has
been
proposed
in
Algorithm
1.
The
algorithm
for
clustering
model
is
sho
wn
in
the
Algorithm
2
.
Algorithm
1
Similarity
Matrix
Computation
Input:
S
=
n
~
P
1
;
~
P
2
;
:
:
:
;
~
P
n
o
.
m
-dimensional
Data
Points
to
be
clustered.
Input:
.
The
Benchmark
Distance
v
alue
for
Euclidean
distance
Input:
.
The
Benchmark
Distance
v
alue
for
Mahalanobis
distance
1:
declar
e
N
m
m
,
M
m
m
as
matrix
2:
N
S
I
M
I
L
A
R
I
T
Y
M
A
T
R
I
X
(
S
;
)
.
Calculates
the
-neighbor
similarity
matrix
in
the
traditional
w
ay
.
3:
f
or
all
~
P
i
2
S
do
4:
declar
e
A
as
set
5:
A
N
E
I
G
H
B
O
R
H
O
O
D
O
F
(
~
P
i
;
N
)
.
Calculates
The
neighborhood
of
~
P
i
w
.r
.t.
the
already
computed
N
.
6:
matrix
C
O
V
A
R
I
A
N
C
E
M
A
T
R
I
X
(
A
)
7:
f
or
all
~
P
j
2
S
do
8:
d
~
P
j
T
1
~
P
j
1
=
2
9:
if
d
then
10:
M
(
i
)(
j
)
1
=d
11:
M
(
j
)(
i
)
1
=d
12:
else
13:
M
(
i
)(
j
)
0
14:
M
(
j
)(
i
)
0
15:
end
if
16:
end
f
or
17:
end
f
or
IJECE
V
ol.
5,
No.
2,
April
2015:
361
–
370
Evaluation Warning : The document was created with Spire.PDF for Python.
IJECE
ISSN:
2088-8708
367
Algorithm
2
Spectral
Clustering
using
Mahalanobis
Distance
Input:
S
=
n
~
P
1
;
~
P
2
;
:
:
:
;
~
P
n
o
.
m
-dimensional
Data
Points
to
be
clustered.
Input:
K
.
The
number
of
clusters
1:
declar
e
M
m
m
as
matrix
2:
M
M
A
H
A
L
A
N
O
B
I
S
S
I
M
I
L
A
R
I
T
Y
M
A
T
R
I
X
(
S
;
)
.
Calculates
the
Mahalanobis
distance
based
similarity
matrix
using
Algorithm
1.
3:
L
r
w
R
A
N
D
O
M
W
A
L
K
L
A
P
L
A
C
I
A
N
(
M
)
4:
C
A
L
C
U
L
A
T
E
E
I
G
E
N
S
(
L
r
w
)
.
Calculates
the
eigen
v
alues
and
eigen
v
ectors.
5:
S
O
R
T
E
I
G
E
N
V
A
L
U
E
S
.
Sorts
the
eigen
v
alues
in
ascending
order
.
6:
D
S
M
A
L
L
E
S
T
E
I
G
E
N
V
E
C
T
O
R
S
(
k
)
.
D
is
the
set
of
k
eigen
v
ectors
for
k
smallest
eigen
v
alues.
7:
K
-
M
E
A
N
S
(
D
;
k
)
.
Cluster
k
eigen
v
ectors
using
K-means
algorithm.
From
Algorithm
1
and
Algorithm
2,
it
is
clear
,
that
the
main
trick
y
part
is
the
formation
of
the
similarity
matrix
or
the
af
finity
matrix.
Let
us
discuss
the
w
orking
principle
of
this
part
first.
What
the
algorithm
1
is
doing
is
that
it
is
calculating
the
final
similarity
matrix
or
adjacenc
y
matrix
in
tw
o
passes.
The
first
pass
calculates
the
similarity
matrix
or
the
adjacenc
y
matrix
using
an
y
traditional
method.
In
the
present
paper
,
-neighbor
has
been
adapted
b
ut
other
similarity
matrices
may
als
o
be
chosen.
Thereafter
e
v
ery
point
and
the
neighborhood
of
them
are
being
computed
for
the
second
pass
or
final
adjacenc
y
or
similarity
matrix.
This
is
computation
is
v
ery
trick
y
.
F
or
a
particular
ro
w
,
matrix
elements
with
nonzero
v
alue
is
the
v
ertices
of
the
neighbor
set.
after
ha
ving
v
erte
x
set
,
Lets
say
A
,
the
present
paper
com
pu
t
es
the
co
v
ariance
matrix
of
the
neighborhood
set
A
.
Proposed
model
considers
the
other
points
and
the
Mahalanobis
distance
of
them.
A
second
ne
w
similarity
matrix
is
being
created
in
this
w
ay
.
Here
also
an
y
suitable
similarity
measure
may
be
considered.
In
the
present
paper
,
-neighbor
has
been
chosen.
i.e.
gi
v
en
the
Mahalanobis
distance
of
a
point
from
a
set
of
points,
if
the
distance
is
less
than
,
then
the
point
is
considered
for
being
a
member
of
the
neighborhood
of
the
point
otherwise
the
point
is
considered
as
the
outlier
of
the
point
set
and
is
not
considered
to
be
a
neighbor
of
the
point
set.
From
the
discussions,
it
is
clear
that
the
Mahalanobis
distance
based
similarity
matrix
is
more
realistic.
It
may
include
(or
reject)
a
point
in
neighborhood
of
point
by
not
measuring
the
distance
only
b
ut
by
considering
the
distrib
ution
also.
This
method
uses
the
technique
of
outlier
e
xclusion
in
micro
le
v
el
for
creating
af
finity
or
similarity
matrix.
Algorithm
2
is
the
actual
clustering
algorithm.
The
present
paper
assumes
the
number
of
clusters,
i.e.
k
is
already
kno
wn.
Then
the
present
paper
calculates
random
w
alk
Laplacian
of
the
Mahalanobis
distance
based
similarity
matrix.
The
reason
for
selecting
the
random
w
alk
Laplacian
is
that,
it
identifies
dense
subgraph
in
a
better
w
ay
and
therefore
gi
v
es
a
better
result
in
practical.
The
proposed
method
finds
the
eigen
v
alue
of
the
Laplacian
of
the
matrix
and
s
orts
them
in
ascending
order
and
the
smallest
k
eigen
v
alues
are
identified
and
the
corresponding
eigen
v
ectors
are
considered
for
k
-means
clustering
and
the
result
of
this
k
-means
clustering
is
the
final
output.
4.
RESUL
TS
AND
AN
AL
YSIS
The
model
has
been
tasted
with
tw
o
data
sets.
One
is
a
to
y
data
set
proposed
by
the
authors
and
the
second
data
set
is
kno
wn
as
spiral
data
set
proposed
by
Chang
and
Y
eung[26].
The
performance
of
the
proposed
model
has
been
compared
with
three
other
models
namely
K-means
algo-
rithm,
hierarchical
clustering
and
one
of
the
standard
spectral
clustering
method.
More
about
K-means
and
hierarchical
clustering
can
be
found
in
[2,
3].
The
standard
spectral
model
is
due
to
Shi-Malik
[27].
The
to
y
e
xample
consists
of
75
data
points
in
a
tw
o
dimensional
plane.
The
data
set
has
been
tak
en
in
such
a
w
ay
that
the
trend
of
the
points
are
clear
.
In
the
figure
3
(a),
it
is
clear
that
the
small
portion
in
the
right
side
of
the
lo
wer
cl
uster
(indicated
as
C
in
the
figure)
is
a
part
of
the
lo
wer
cluster
and
that
the
small
middle
cluster(indicated
as
B
in
the
figure)is
the
part
of
the
upper
straight
line.
The
trend
clearly
suggests
that.
The
important
thing
to
note
that
the
small
clusters
are
equidistant
from
the
lo
wer
spiral
cluster
.
i.e.
portion
B
and
portion
C
are
equidistant
from
the
lo
wer
spiral
part
of
the
figure
3
(a).
Portion
B
is
equidistant
from
lo
wer
spiral
part
and
the
upper
straight
line
portion.
So,
the
traditional
distance
or
the
k
-nearest
neighborhood
is
not
at
all
a
good
measure,
because
none
of
these
considers
the
trend
of
the
data
s
ets.
Either
t
he
y
will
include
both
of
the
small
clusters
or
the
y
will
e
xclude
both
of
them.
The
proposed
Mahalanobis
dista
nce
based
similarity
matrix
is
a
v
ery
good
opt
ion
in
such
type
of
situations
because
it
will
include
one
and
e
xclude
the
other
based
on
the
point
distrib
ution.
The
result
of
the
proposed
model
is
sho
wn
in
figure
3
(b).
The
result
sho
ws
that
the
proposed
model
clearly
considers
the
distance
as
well
as
the
cluster
point
distrib
ution
trends.
Spectr
al
Clustering
based
on
Local
Distrib
ution(P
arthajit
Roy)
Evaluation Warning : The document was created with Spire.PDF for Python.
368
ISSN:
2088-8708
(a)
(b)
Figure
3.
(a)
The
Sample
to
y
data.
(b)
The
black
colors
and
red
colors
are
the
tw
o
clusters.
The
proposed
models
performance
is
compared
with
the
standard
models
and
the
results
produced
are
tested
by
se
v
en
internal
indices
and
four
e
xternal
indices.
T
able
3
and
table
2
are
the
results
of
the
te
sts.
The
proposed
model
sho
ws
better
result
in
four
out
of
se
v
en
indices.
All
the
four
e
xternal
indices
says
the
strength
of
the
proposed
model
is
better
in
the
proposed
situation.
T
able
2.
Internal
Indices
for
proposed
to
y
data
On
Synthetic
T
o
y
Data
Internal
Rule
K-Means
Hierarchical
Shi-Malik[27]
Proposed
Is
the
Proposed
Indices
Model
(A
v
erage
Link)
Model
Model
model
best?
Silhouette
Max
0.4835682
0.4326811
0.2461643
0.2506375
No
Scott
Symons
Min
800.4114
798.99
861.9192
729.9601
Y
es
W
emmert
Max
0.5940378
0.5102386
0.1872991
0.2446224
No
Gancarski
T
au
Max
-0.5383263
-0.4735096
-0.2174559
-0.1618588
Y
es
Gamma
Max
-0.7610649
-0.6694097
-0.3078273
-0.2304259
Y
es
G-Plus
Min
0.4403892
0.4174937
0.3262061
0.3034445
Y
es
Ray
T
uri
Min
0.1788701
0.2293373
0.7925694
1.289401
No
T
able
3.
External
indices
for
proposed
to
y
data
On
Synthetic
T
o
y
Data
Internal
Rule
K-Means
Hierarchical
Shi-Malik[27]
Proposed
Which
One
Indices
Model
(A
v
erage)
Model
Model
Model
is
Best?
F
olk
es
Mallo
ws
Max
0.558085
0.6511637
0.9060712
1
Proposed
Jaccard
Max
0.386073
0.4815563
0.8275653
1
Proposed
Rand
Max
0.532973
0.6302703
0.8976577
1
Proposed
Czekano
wski
Max
0.5570745
0.6500682
0.9056478
1
Proposed
Dice
The
model
has
been
tasted
with
the
another
standard
data
set,
kno
wn
as
spiral
data,
proposed
by
Chang
et
al[26].
The
data
set
has
312
data
points
and
2
dimensions
and
3
clusters.
The
data
set
is
not
linearly
separable.
The
proposed
model
correctly
identifies
the
three
clusters
of
the
spiral
data
with
accurate
point
distrib
ution.
i.e.
Both
path
based
model[26]
and
the
present
model
gi
v
es
100%
accurac
y
.
This
sho
ws
the
strength
of
the
proposed
model
on
IJECE
V
ol.
5,
No.
2,
April
2015:
361
–
370
Evaluation Warning : The document was created with Spire.PDF for Python.
IJECE
ISSN:
2088-8708
369
standard
situations.
5.
CONCLUSION
The
proposed
Mahalanobis
distance
based
local
distrib
ution
oriented
spectral
clustering
model
is
a
good
model
in
normal
situations
as
well
as
it
can
handle
the
situations
where
the
distrib
ution
of
the
points
needs
to
be
considered.
The
result
sho
ws
that
the
model
can
handle
non-con
v
e
x
data
set
successfully
which
is
a
major
strength
of
the
proposed
model.
Ne
v
ertheless,
there
are
scope
of
impro
v
ements
also.
The
model
is
time
demanding
because
of
the
eigen
v
ec
tor
computations.
So,
the
spars
ification
of
the
graph
and
clustering
the
sparse
graph
instead
of
the
original
one
may
be
one
impro
v
ement
direction.
Secondly
,
if
the
in
v
erse
of
the
co
v
ariance
matrix
does
not
e
xists,
then
the
Mahalanobis
distance
cannot
be
calculated.
Handling
this
will
be
a
major
impro
v
ement.
Finally
,
model
is
sensiti
v
e
to
parameters
lik
e
in
similarity
matrix
calculation.
The
automatic
calculation
of
such
parameters
can
mak
e
the
model
e
v
en
more
rob
ust.
REFERENCES
[1]
G.
Schaefer
and
H.
Zhou,
“Fuzzy
clustering
for
colour
reduction
in
images,
”
T
elecommunication
Systems
,
v
ol.
40,
no.
1-2,
pp.
17–25,
2009.
[2]
B.
S.
Ev
eritt,
D.
Stahl,
M.
Leese,
and
S.
Landau,
Cluster
Analysis
,
5th
ed.
John
W
ile
y
&
Sons,
2011.
[3]
A.
K.
Jain,
M.
N.
Murty
,
and
P
.
J.
Flynn,
“Data
clustering:
A
re
vie
w
,
”
A
CM
Computing
Surve
ys
,
v
ol.
31,
no.
3,
pp.
264–323,
1999.
[4]
R.
Xu
and
D.
W
unsch-II,
“Surv
e
y
of
clustering
algorithms,
”
IEEE
T
r
ansactions
on
Neur
al
Network
,
v
ol.
16,
no.
3,
pp.
645–678,
May
2005.
[5]
F
.
F
arshchian,
E.
P
archam,
and
S.
T
ofighi,
“Retinal
i
dentification
and
connecting
this
information
to
electronic
health
record,
”
International
J
ournal
of
Electrical
and
Computer
Engineering
,
v
ol.
3,
no.
3,
pp.
359–365,
June
2013.
[6]
S.
Sharma
and
G.
N.
Purohit,
“
Analysis
of
spectral
clustering
approach
for
tracking
community
formation
in
social
netw
ork,
”
International
J
ournal
of
Social
Networking
and
V
irtual
Communities
,
v
ol.
1,
no.
1,
pp.
31–37,
July
2012.
[7]
X.
Y
ang
and
W
.
Cui,
“
A
no
v
el
spatial
clustering
algorithm
based
on
delaunay
triangulation,
”
J
ournal
of
Softwar
e
Engineering
and
Applications
,
v
ol.
3,
pp.
141–149,
2010.
[8]
P
.
Ro
y
and
J.
K.
Mandal,
“
A
no
v
el
spatial
fuzzy
clustering
using
delaunay
triangulation
for
lar
ge
scale
gis
data
(nsfcdt),
”
Pr
ocedia
T
ec
hnolo
gy
,
v
ol.
6,
no.
452459,
2012.
[9]
P
.
F
oggia,
G.
Percannella,
C.
Sansone,
and
M.
V
ento,
“
A
graph-based
clus
tering
method
and
its
applications,
”
Pr
oceedings
of
Advances
in
Br
ain,
V
ision,
and
Artificial
Intellig
ence
,
v
ol.
4729,
pp.
277–287,
2007.
[10]
P
.
Ro
y
and
J.
K.
Mandal,
“
A
delaunay
triangulation
preprocessing
based
fuzzy-encroachment
graph
clustering
for
lar
ge
scale
gis
data,
”
Pr
oceedings
of
the
International
Symposium
on
Electr
onic
System
Design,
2012
,
pp.
300–305,
2012.
[11]
T
.
Asano,
B.
Bhattacharya,
M.
K
eil,
and
F
.
Y
ao,
“Clustering
algorithms
based
on
minimum
and
maximum
spanning
trees,
”
Pr
oceedings
of
the
4th
Annual
Symposium
on
Computational
Geometry
,
pp.
252–257,
1988.
[12]
D.
J.
Higham,
G.
Kalna,
and
M.
Kibble,
“Spectral
clustering
and
its
use
in
bioinformatics,
”
J
ournal
of
Computa-
tional
and
Applied
Mathematics
,
v
ol.
204,
pp.
25–37,
2007.
[13]
S.
A.
T
oussi
and
H.
S.
Y
azdi,
“Feature
selection
in
spectral
clustering,
”
International
J
ournal
of
Signal
Pr
ocess-
ing
,
Ima
g
e
Pr
ocessing
and
P
attern
Reco
gnition
,
v
ol.
4,
no.
3,
pp.
179–194,
September
2011.
[14]
H.
Qiu
and
E.
R.
Hancock,
“Graph
matching
and
clustering
using
spectral
partitions,
”
P
attern
Reco
gnition
,
v
ol.
39,
pp.
22–34,
2006.
[15]
U.
v
on
Luxb
ur
g,
“
A
tutorial
on
spectral
clustering,
”
Max
Planck
Institute
for
Biological
Cybernetics,
T
ech.
Rep.
TR-149,
August
2006.
[16]
F
.
Chung,
“Spectral
graph
theory
,
”
American
Mathematical
Society
,USA,
T
ech.
Rep.,
1997.
[17]
D.
A.
Spielman
and
S.-H.
T
eng,
“Spectral
partitioning
w
orks:
Planar
graphs
and
finite
element
meshes,
”
Linear
Alg
ebr
a
and
its
Applications
,
v
ol.
421,
p.
284305,
2007.
[18]
M.
Fiedler
,
“
A
property
of
eigen
v
ectors
of
nonne
g
ati
v
e
symmetric
matrices
and
its
application
to
graph
theory
,
”
Czec
hoslo
vak
Mathematical
J
ournal
,
v
ol.
25,
no.
4,
pp.
619–633,
1975.
[19]
P
.
C.
Mahalanobis,
“On
the
generalized
dis
tance
in
statistics,
”
in
Pr
oceedings
of
the
National
institute
of
Sciences
of
India
,
no.
2,
1936,
pp.
49–55.
[20]
S.
Marsland,
Mac
hine
Learning
,
An
Algorithmic
P
er
spective
,
1st
ed.,
ser
.
Machine
Learning
and
P
attern
Recog-
nition
Series.
CRC
Press,
2009.
Spectr
al
Clustering
based
on
Local
Distrib
ution(P
arthajit
Roy)
Evaluation Warning : The document was created with Spire.PDF for Python.
370
ISSN:
2088-8708
[21]
V
.
J.
Hodge
and
J.
Austin,
“
A
surv
e
y
of
outlier
detection
methodologies,
”
Artificial
Intellig
ence
Re
vie
w
,
v
ol.
22,
pp.
85–126,
2004.
[22]
S.
Saha
and
S.
Bandyopadh
yay
,
“Performance
e
v
aluation
of
some
symmetry-based
cluster
v
alidity
inde
x
es,
”
IEEE
T
r
ansactions
on
Systems,
Man,
and
CyberneticsP
art
C:Applications
AND
Re
vie
ws
,
v
ol.
39,
no.
4,
pp.
420–425,
July
2009.
[23]
C.
Godsil
and
G.
Ro
yle,
Alg
ebr
aic
Gr
aph
Theory
,
ser
.
Graduate
T
e
xts
in
Mathematics.
Springer
,
2001,
v
ol.
207.
[24]
B.
Mohar
,
Gr
aph
Theory
,
Combinatorics,
and
Applications
.
W
illy
,
1991,
v
ol.
2,
ch.
Laplacian
Spectrum
of
Graphs,
pp.
871–898.
[25]
——,
Gr
aph
Symmetry:
Alg
ebr
aic
Methods
and
Aplications
,
ser
.
C
497.
Kluwer
,
1997,
v
ol.
N
A
T
O
ASI,
ch.
Some
Applications
of
Laplace
eigen
v
alues
of
graphs,
pp.
225–275.
[26]
H.
Chang
and
D.-Y
.
Y
eung,
“Rob
ust
path-based
spectral
clustering,
”
P
attern
Reco
gnition
,
v
ol.
41,
no.
1,
pp.
191–203,
January
2008.
[27]
J.
Shi
and
J.
Malik,
“Normalized
cuts
and
image
se
gmentation,
”
IEEE
T
r
ansactions
on
P
attern
Analysis
and
Mac
hine
Intellig
ence
,
v
ol.
8,
no.
22,
pp.
888–905,
August
2000.
IJECE
V
ol.
5,
No.
2,
April
2015:
361
–
370
Evaluation Warning : The document was created with Spire.PDF for Python.