TELKOM
NIKA Indonesia
n
Journal of
Electrical En
gineering
Vol.12, No.7, July 201
4, pp
. 5529 ~ 55
3
6
DOI: 10.115
9
1
/telkomni
ka.
v
12i7.461
4
5529
Re
cei
v
ed O
c
t
ober 3, 20
13;
Revi
se
d Feb
r
ua
ry 1
0
, 201
4; Acce
pted
March 5, 201
4
Sequence Clustering Algorithm Based on Weighed
Sequential Pattern Similarity
Di Wu
1,2
, Jiadong Re
n*
1
1
Colle
ge of Info
rmation Sci
enc
e and En
gi
neer
ing, Yans
ha
n Univers
i
t
y
,
Qinhu
an
gda
o 066
00
4, Chin
a
2
Departme
n
t of Information a
n
d
Electron
ic En
gin
eeri
ng, He
b
e
i Univ
ersit
y
of
Engin
eeri
ng,
H
a
n
D
a
n
05
60
38
, C
h
i
na
*Corres
p
o
ndi
n
g
author, e-ma
i
l
: bestmoog
oo
@16
3
.com
A
b
st
r
a
ct
Sequ
enc
e clus
tering h
a
s b
e
c
o
me an
active
issue i
n
the c
u
rrent scie
n
tifi
c community. How
e
ver,
the clusteri
ng q
uality is affecte
d
heav
ily by se
lectin
g in
iti
a
l cl
usterin
g
center
s rando
mly. In this pap
er, a n
e
w
sequ
enc
e si
mil
a
rity meas
ure
m
e
n
t bas
ed
on
w
e
ighe
d se
qu
entia
l patter
n
s
is defi
ned. S
e
q
uenc
e Cl
usteri
n
g
Algorit
hm
Bas
ed
on Weighed
Sequential P
a
ttern Simila
r
i
ty (SCWSPS) algorit
hm is
proposed. Sequences
w
i
th the larg
es
t w
e
ighted si
milarity ar
e ch
os
en as th
e
mer
ge o
b
jects. T
h
e last
K-1 sy
nthesis res
u
lts a
r
e
del
eted fro
m
s
equ
enc
e dat
ab
ase. Others
se
que
nces
are d
i
vide
d into
K cl
usters. Moreov
er, in e
a
ch c
l
us
ter,
the sequ
enc
e w
h
ich has the
largest su
m o
f
simil
a
riti
es w
i
th other sequ
e
n
ces is view
ed
as the update
d
center. The experim
ental results a
nd
analysis show that
the perfor
m
a
nce
of SCWSPS is better than
KSPAM an
d K-
me
ans
in c
l
ust
e
rin
g
q
ual
ity. When th
e se
qu
ence
scal
e
is
v
e
ry lar
ge, th
e e
x
ecutio
n effici
e
n
cy
of SCWSPS is slightly w
o
rse t
han KSPAM a
nd K-mea
n
s.
Ke
y
w
ords
: dat
a mi
ni
ng, seq
u
ence cl
usterin
g
,
sequenti
a
l p
a
ttern, w
e
ighted
similar
i
ty
Copy
right
©
2014 In
stitu
t
e o
f
Ad
van
ced
En
g
i
n
eerin
g and
Scien
ce. All
rig
h
t
s reser
ve
d
.
1. Introduc
tion
With the
dev
elopme
n
t of
biologi
cal
pro
t
ein
sequ
ences analy
s
is, use
r
a
c
ce
ss in
Web
field, consum
er tran
sa
ctio
n data analy
s
is, s
eque
nce clust
e
rin
g
has be
co
me
one of the
hot
resea
r
ch pro
b
lems i
n
cl
u
s
terin
g
an
alysis
method l
a
tely [1]. The pro
c
e
s
sing
of seq
uen
ce
clu
s
terin
g
is
based on th
e use
r
-define
d
sim
ila
rity, and then the
sequ
en
ce
s of databa
se
are
divided into
several
clu
s
ters. At the sam
e
time, t
he b
e
st cl
uste
ring
re
sults
are t
hose simil
a
riti
es
betwe
en seq
uen
ce
s in ea
ch cl
uste
r a
s
small a
s
po
ssi
ble, and
si
milaritie
s
bet
wee
n
clu
s
te
rs as
large a
s
po
ssible [2].
K
-mean
s
alg
o
rithm
ha
s be
come
on
e of
the mo
st po
p
u
lar
clu
s
te
rin
g
alg
o
rithm
s
,
the ide
a
of
K
-mea
ns i
s
very
sim
p
le,
and
it can
be
reali
z
e
d
e
a
si
ly. However,
Euclide
an
distance
is u
s
ed
in
traditional
K
-mean
s to co
mpute the si
milaritie
s
bet
wee
n
data ob
jects [3]. Wh
en the data size i
s
very larg
e, it will re
sult in t
oo mu
ch time
co
st.
In addit
i
on, the initial
clu
s
ter
cente
r
s a
r
e
sel
e
cte
d
at rando
m. If the selecti
ons a
r
e in the vicini
ty of the local m
i
nimum, the final gene
rat
e
d
clu
s
terin
g
re
sults a
r
e the
local optima
l
soluti
on aft
e
r several ti
mes of iterat
ive update. The
expecte
d glo
bal optimal solution can n
o
t be obtaine
d [4].
2. Related Work
At prese
n
t, many schola
r
s have devot
ed to
theoreti
c
al re
se
arch
of improved
K
-mean
s
algorith
m
s fo
r optimi
z
ing
initial clu
s
teri
ng center
s.
On the b
a
si
s of the d
e
n
s
ity, Sun et al.
provide
d
a
g
e
netic
K
-
m
ea
ns
ba
se
d o
n
me
lio
r
a
ted
in
itia
l ce
n
t
er
[5
].
K
obje
c
ts whi
c
h
are b
e
lon
g
to
high de
nsity
area
and the
most far a
w
ay to ea
ch
other a
r
e
selecte
d
as i
n
itial cente
r
s.
An
enha
nci
ng
K
-mean
s for im
proving initial
centers was
disc
u
s
sed by Yedla [6]. In this metho
d
, the
origin
al d
a
ta
points are
so
rted into
K
eq
ual
sets acco
rding
to the
sorted
dista
n
ces. In
ea
ch
set,
the middl
e p
o
ints a
r
e
taken a
s
the
ini
t
ial cente
r
s.
K
-mean
s
ba
sed on
PSO (Particle S
w
a
r
m
Optimizatio
n
) was p
r
op
ose
d
by Fu [7]. H
o
weve
r, the classical PSO tends to get l
o
cal o
p
timum
s
.
Global o
p
timi
zed
clu
s
teri
n
g
partition
ca
n not be o
b
tained. Yu et
al. pre
s
ente
d
K
-mean
s b
a
s
ed
on improve
d
PSO [8]. Based on parti
cle
hybridiz
ation
,
mutation operation, an
d dynamic
cha
nge
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 23
02-4
046
TELKOM
NI
KA
Vol. 12, No. 7, July 201
4: 5529 – 55
36
5530
of the flight a
c
celeration fa
ctor of parti
cl
e in
the state spa
c
e of ea
ch dime
nsi
o
n
,
improved PSO
can ove
r
com
e
the probl
em
of initia
l
centers selection sensitivity.
Although the
performan
ce of the sel
e
ction
s
of in
itial cluste
rin
g
cente
r
s of
above
algorith
m
s h
a
ve been im
proved. However, the si
m
ilarity measurements a
r
e
not appli
c
abl
e to
deal with
seq
uen
ce. In literature [9], a new si
mil
a
rity measurement
base
d
on bit
m
ap ope
ratio
n
s
is d
e
fined.
Th
e computatio
n complexity
can
be
r
edu
ced effe
ctively. A novel
se
q
uen
ce
simila
rity
function
base
d
on ide
n
tification dista
n
ce and e
d
it
di
stan
ce
wa
s introdu
ce
d [1
0]. If the identity
distan
ce b
e
twee
n two se
quen
ce
s is g
r
eater tha
n
t
he spe
c
ified th
reshold, it is
not necessa
ry to
comp
ute
rep
eatedly the
edit dista
n
ce. The
overal
l
impleme
n
tation efficie
n
cy
of algo
rithm
i
s
improve
d
g
r
eatly. However, the
simil
a
rity
mea
s
u
r
ements of t
hese ki
nd
al
gorithm
s d
o
no
t
consider the
weight of
sequential patterns in ea
ch sequence. K
SPAM (
K
-me
ans algo
rith
m
of
seq
uen
ce
pat
terns minin
g
based o
n
the
Huffman m
e
thod) [
11] was pre
s
ente
d
by
Yang. A high
ly
efficient op
erators
of ‘and’
and ’o
r’ are
applie
d
to cal
c
ulate
simila
ri
ty between
seque
nces. Th
us
both of the c
o
s
t
s
of time and memory are great in KSPAM.
In this
artic
l
e, in order to cons
ider the
weight of sequ
ential pattern
s in each se
q
uen
ce, a
new
wei
ghte
d
se
que
ntial
pattern
ba
sed si
milari
ty
for me
asuri
ng the
pair
of seq
uen
ce
s is
pre
s
ente
d
. For the pu
rpo
s
e of obtaini
ng the glob
al
optimal clu
s
t
e
ring
re
sults,
initial cluste
ring
cente
r
s a
r
e g
a
ined a
c
cordi
ng to mergi
n
g
the sequ
en
ces with the la
rge
s
t weig
hte
d
simila
rity.
The
remi
nde
r of thi
s
p
ape
r
is o
r
g
anized
as foll
ows. In
se
ction
2, we
de
scribe
the
related
work of seq
uen
ce cl
uste
ring. Sectio
n
3 give
s pro
b
lem definitio
ns. Sectio
n 4 con
c
lu
de
s the
SCWSPS m
e
thod. Section 5 contains experim
ental
results, and we offer
our conclusions in
se
ction 6.
3. Problem Definitions
A
ssu
me t
hat
SD
={
S
1
,
S
2
,···,
S
N
} rep
r
e
s
e
n
ts the sequ
ence datab
ase. Whe
r
ein,
N
is the
numbe
r
of se
quen
ce
s in
SD
. Any sequ
ence of
SD
is
d
enote
d
a
s
S
=
a
1
a
2
··
·
a
n
, where
a
n
is
the
n
-th item,
a
n
∈
L
.
L
indicat
e
s the set of items. Suppo
se that
Sup
(
P
) is the numbe
r of occurren
ces
of seq
uen
ce
pattern
P
in
SD
. If
Sup
(
P
)
≥
MinSup
,
P
is de
eme
d
to be
a fre
quent
seq
u
e
n
ce.
Whe
r
ein,
MinSup
is the user-define
d
minimum su
ppo
rt th
reshold. The set of fre
quent se
que
n
t
ial
patterns
in
SD
is
r
e
pr
es
en
te
d
as
FSP
.
FSP
={
FSP
1
,
FSP
2
,···,
FSP
t
}.
FSP
t
is the
t
-th fre
q
u
ent
seq
uential p
a
ttern.
t
is the numbe
r of fre
quent sequ
en
tial patterns i
n
SD
.
Assu
me that
there i
s
a
on
e-to-one
co
rres
p
ond
en
ce betwe
en wei
ghted seq
uen
ce
ve
ctor
W
(
S
i
) an
d ea
ch frequ
ent seque
ntial patt
e
rn i
n
FSP
. T
he value
of e
a
ch
dime
nsio
n of
W
(
S
i
) i
s
t
he
weig
ht of each freq
uent
seq
uential p
a
ttern in
S
i
.
W
(
S
x
)=
{
W
i1
,
W
i2
,…,
W
ir
,…,
W
it
}. Wh
erei
n
,
W
ir
denote
s
the
weig
ht of frequent se
que
ntial pattern
FS
P
t
in
S
i
, 0
≤
W
ir
≤
1, 1
≤
r
≤
t
.
The way to measure the simila
rity betwee
n
se
que
n
c
e
s
is a
critical issue for
seque
nce
clu
s
terin
g
. O
n
the ba
sis of weighte
d
seq
uen
ce
vector of e
a
ch
seq
uen
ce, the weig
hted
seq
uential p
a
ttern simila
rity
WSPSim
(
S
i
,
S
j
) is descri
b
ed as follo
ws.
Defini
tion 1.
Suppo
se th
at
S
i
and
S
j
are a
n
y two
seq
uen
ce
s i
n
SD
, the weighted
seq
uential p
a
ttern simila
rity
WSPSim
(
S
i
,
S
j
) between
S
i
and
S
j
can be
de
sign
ed as
bel
ow:
t
t
jr
t
t
ir
t
r
jr
ir
j
i
W
W
W
W
S
S
WESim
1
2
1
2
1
)
,
(
(1)
1
)
(
log
2
r
ir
ir
EFD
N
EFS
W
(2)
Whe
r
ein,
EFS
ir
rep
r
e
s
ent
s the fre
que
ncy of frequ
ent
seq
uential
pa
ttern
FSP
r
in
S
x
. If the
freque
nci
e
s of
FSP
r
in
each
seq
uen
ce are hig
her,
then the weight of
FSP
r
is larger.
EFD
r
indicates
th
e numbe
r
of se
quen
ce
s whi
c
h
co
ntain
FS
P
r
in
SD
. If the frequ
en
cy of
FSP
r
in
SD
is
highe
r, then the impo
rtan
ce of
FSP
r
is lower.
N
i
s
the
numbe
r of se
quen
ce
s in
SD
.
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
2302-4
046
Sequen
ce
Cl
usteri
ng Algo
rithm
Based on Wei
ghe
d Sequential P
a
ttern Sim
ilarity (Di
Wu)
5531
In addition,
seque
nce cl
ustering
crite
r
ia
function
E
is described as follows.
Where
SC
j
indicates the
mean vecto
r
of cluste
r
C
j
.
2
1
K
jC
S
j
i
j
i
SC
S
E
(3)
4. The Sequ
ence Clu
s
te
r
i
ng Algorith
m
SCWSPS
Sequen
ce
Cl
usteri
ng Algo
rithm Based
on weighe
d seque
ntial pattern
simila
rity name
d
SCWSPS is
disc
us
sed.
In our
approac
h, by
us
ing WSICH
(Weighted-Simi
larity-Bas
ed Initial
Cente
r
s Han
d
ling) alg
o
rit
h
m,
K
initial
clu
s
terin
g
ce
nters a
r
e g
a
i
ned. Mo
reov
er, by
comp
u
t
ing
the sum of simila
rities b
e
twee
n any seq
uen
ce
a
n
d
other seq
uen
ce
s in e
a
ch
clu
s
ter, the
seq
uen
ce
wit
h
the l
a
rg
est
sum
of
sim
ilarities
is up
dated to
the
clu
s
teri
ng
center. At la
st,
according
to
analyzi
ng the
clu
s
teri
ng
criteria fun
c
tion
, the final
K
clu
s
ters
a
r
e
obtaine
d.
Th
e
flowchart of SCWSPS is
s
h
own as
Figure 1.
Figure 1. The
Flowcha
r
t of SCWSPS
4.1. Weighte
d
-Similarit
y
-
B
as
ed
Initial Centers Handling
A novel WSI
CH
algo
rithm
for sel
e
ctin
g
the
K
initial clustering centers i
s
given in this
se
ction. If se
quen
ce
s
S
i
a
nd
S
j
has
the
larg
est
simil
a
rity
in
cu
rre
n
t
SD
, th
e
n
the
y
a
r
e
c
h
os
en
as
the me
rge
ob
jects.
The
sy
nthesi
s
re
sult
of
S
i
a
nd
S
j
i
s
d
enote
d
a
s
S
ij
。
S
i
and
S
j
are
delete
d
from
SD
an
d
S
ij
is
adde
d. Finall
y
, the last
K
-1
S
ij
are
remo
ved. And oth
e
rs
sequ
en
ce
s
a
r
e divided
into
K
sub
s
ets. T
he last g
ene
ration
S
ij
o
r
the only o
r
igi
nal se
que
nce ca
n be
se
en a
s
the ini
t
ial
clu
s
terin
g
ce
nter in ea
ch subset. The pr
oce
s
s of WSICH i
s
explain
ed as b
e
lo
w:
Algorithm WSICH
(
SD
,
K
,
N
,
S
i
,
S
j
,
S
ij
)
Input:
SD
: Sequen
ce
d
a
taba
se;
K
: The num
ber of user-defi
ned cl
uste
rs;
N
: The
numbe
r of se
quen
ce
s in
SD
;
S
i
,
S
j
: Any sequ
en
ce in
SD
;
S
ij
: Synthesi
s
re
sult of
S
i
and
S
j
。
Outpu
t:
K
init
ial centers
Begin
Choose the testi
ng sequence dat
abase
SD
Update
K
cente
r
s b
y
similarity
me
asurement
Assign sequence
to the most similar cluster
Computer t
he clustering criterion f
unction
Y
N
Preprocess the sequences in
SD
End
Converge?
Select initia
l
K
ce
nters based on
weighted sequenti
a
l pattern similarit
y
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 23
02-4
046
TELKOM
NI
KA
Vol. 12, No. 7, July 201
4: 5529 – 55
36
5532
Step 1:
Mi
n
e
the frequ
e
n
t sequ
ential
pattern
s of
SD
by Prefixspa
n
algo
rith
m. The
freque
nt se
q
uential p
a
ttern set
FSP
i
s
get. Accordi
ng to the
su
pport
relatio
n
s
hip
s
b
e
twe
e
n
S
i
and ea
ch fre
q
uent se
que
ntial pattern, all the obj
e
c
ts are pretreated t
o
equal
-len
gth vectors.
Step 2:
Com
pute the wei
g
hted simila
riti
es bet
wee
n
a
n
y two seq
u
e
n
ce
s.
Step 3:
In current
SD
,
seque
nces
S
i
and
S
j
with the large
s
t weighted
simil
a
rity are
cho
s
e
n
a
s
th
e me
rge
obj
e
c
ts. T
he
average
of ea
ch
dimen
s
ion
a
l
value of th
e p
r
etre
ated ve
ctors
of
S
i
and
S
j
is the corre
s
p
o
n
d
ing value of
synthe
sis result
S
ij
.
Step 4:
S
i
an
d
S
j
are delet
ed, mean
whil
e,
S
ij
is added
to
SD
.
Step 5:
Rep
e
a
t Step3 and
Step4, until all sequ
en
ce
s are p
r
o
c
e
s
se
d.
Step 6:
On t
he ba
sis
of the order
of
merg
e re
co
rd
s, the last
K
-1
S
ij
are rem
o
ved, and
others sequ
e
n
ce
s a
r
e divi
ded into
K
subsets. In ea
ch sub
s
et, if
there exist
s
the last merge
obje
c
t
S
ij
, then
S
ij
is vie
w
e
d
a
s
the i
n
itia
l clu
s
teri
ng
cent
er, oth
e
rwi
s
e, the
only o
r
iginal
seque
nce
is se
en a
s
the initial clu
s
te
ring cente
r
.
A
s
t
o
sy
nt
he
sis
re
sult
S
ij
repre
s
e
n
ts th
e
com
m
on f
r
e
quent
seq
uen
tial pattern
s
betwe
en
S
i
and
S
j
.
It can be u
s
e
d
a
s
the represe
n
tative of
S
i
and
S
j
.
validly. Meanwhile,
in WSICH, the
merg
e exe
c
u
t
e pro
c
e
s
s is based
on th
e sim
p
le o
p
e
r
ation
of average, thu
s
the
com
putation
a
l
compl
e
xity can be red
u
ced
greatly.
4.2. Centers
Upda
ting an
d Allocating
A
f
t
e
r sele
ct
in
g
K
initial clusterin
g
ce
nte
r
s, the obje
c
t
s
of
SD
are
clu
s
tere
d to the most
similar cl
uster. Ho
wever, each
initial cl
usteri
ng center
ar
e not the finally resul
t
s. They will
be
update
d
furth
e
r. In term
s of the cu
rre
nt clust
e
r
cent
ers, the
obje
c
ts of
SD
ar
e c
l
u
s
te
r
e
d
to
th
e
most simila
r clu
s
ter.
Th
e similaritie
s
of any
two se
qu
ences
i
n
SD
are cal
c
ulate
d
.
In
ea
ch clu
s
ter,
the se
que
nce
whi
c
h h
a
s th
e larg
est
sum
of weig
hted
simila
rities
wi
th other
seq
u
ences i
s
vie
w
ed
as
the
n
e
w cl
usteri
ng cent
er.
The clu
s
te
ring crite
r
ia
fu
nction i
s
com
puted. If it is conve
r
ge
d, the
n
the above ste
p
s are termin
at
ed. Otherwi
se, co
ntinue t
o
repe
at the above ste
p
s.
On the b
a
si
s of the
alg
o
rithm
WSICH an
d the p
r
ocedu
re
of cente
r
s upd
a
t
e and
allocation, sequence cl
ustering al
gori
t
hm
SCWSPS based on weight
ed sequential
pat
tern
simila
rity is propo
sed. The
spe
c
ific p
r
o
c
e
ss of SCWSP
S
is descri
b
e
d
as follo
ws:
Algorithm S
C
WSPS
(
SD
,
K
,
N
,
S
i
,
S
j
,
S
ij
,
ε
)
Input:
SD
: Seque
nce dat
aba
se;
K
: Th
e numb
e
r of
use
r
-defined
dividing cl
usters;
N
:
The num
ber
of sequ
en
ce
s in
SD
;
S
i
,
S
j
: Any sequen
ce in
SD
;
S
ij
: Synthesis
re
sult of
S
i
and
S
j
;
ε
:
The cluste
ring crite
r
ia fu
nction thresh
old.
Outpu
t:
K
final c
l
us
ters
.
Step 1:
K
initi
a
l cente
r
s a
r
e
obtained
by
adoptin
g WSICH al
gorith
m
.
Step 2:
In terms of the cu
rre
nt clus
t
e
r cente
r
s,
the obje
c
ts
of
SD
are
clust
e
red to the
most
simil
a
r c
l
ust
e
r.
Step 3:
Cal
c
ulate the sim
ilarities of an
y two sequ
e
n
ce
s in
SD
. In
each clu
s
ter,
the
seq
uen
ce
wh
ich h
a
s the l
a
rge
s
t sum
of simila
ri
ties with othe
r seque
nces i
s
viewed a
s
th
e
update
d
clu
s
t
e
ring
cente
r
.
Step 4:
Com
pute the clu
s
t
e
ring
crite
r
ia functio
n
.
Step 5:
If the
clu
s
teri
ng
cri
t
eria fun
c
tion
is
not conve
r
ged, re
peat
t
he step2, ste
p3
a
n
d
step4. Othe
rwise, the algo
rithm is termi
nated.
The fin
a
l clu
s
terin
g
centers are ou
tputted.
For SCWSP
S
, it applies the weig
hted
sequ
ent
ial p
a
ttern a
s
an
importa
nt ind
e
x. Th
e
accuracy of clusteri
ng resu
lts can b
e
greatly impr
ove
d
. Beside
s, the pro
c
e
s
s of pretre
atment
is
based on the
supp
ort rel
a
tionshi
ps b
e
twee
n se
quen
ce an
d ea
ch
frequent se
quential patt
e
rn.
The time
co
st for mining
seq
uential p
a
tterns i
s
very
large. T
h
u
s
, whe
n
han
dli
ng sm
all-scal
e
sequence database, the perform
a
nc
e of SCWSPS is still as good
as
K
-m
ean
s. Ho
wev
e
r,
wh
en
the scope of
sequence dat
abase is very
large, the ex
ecution
time of SCWSPS is a little great
er
than
K
-me
a
n
s
. Moreover,
the sele
ctio
n of
K
initial clust
e
ring cent
ers is
opt
imized. Fo
r t
h
e
clustering results of SCWS
PS, the possi
bility of
clustering result
s immergi
ng in
partial minim
u
m
can b
e
de
cre
a
se
d, further
the clu
s
terin
g
quality can b
e
enha
nced.
4.3. Case
An
aly
z
ing of SCWSPS
A
ssu
me t
h
at
SD
={
S
1
,
S
2
,
S
3
,
S
4
,
S
5
}, wherei
n,
S
1
=a
b
c
b,
S
2
=aaa
b,
S
3
=
acbb
c,
S
4
=ba
c
ca,
S
5
=cb
c
a
aa. The num
ber
of user-d
efin
ed dividing cl
usters
K
=3. The minimu
m supp
ort
Mi
nsup
=
40%. The cl
usteri
ng criteria function thresh
old
ε
=0.
0
2
.
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
2302-4
046
Sequen
ce
Cl
usteri
ng Algo
rithm
Based on Wei
ghe
d Sequential P
a
ttern Sim
ilarity (Di
Wu)
5533
Step 1: By
Prefixspa
n
al
gorithm,
the
freque
nt se
q
uential patterns of
SD
a
r
e mined.
FSP={a,b,
c
,a
a,ab,ac,b
a,bb
,bc,ca,
c
b,cc,a
aa,abb,a
b
c,a
c
b,a
cc,b
aa,b
c
a,cbc,
cca}.
In accorda
n
ce
with the
supp
ort relation
shi
p
s
between
S
i
and ea
ch
frequ
ent
seq
u
ential p
a
ttern
in
FSP
, all the
obje
c
ts are pretreated to e
qual
-l
ength 2
1
-dim
en
siona
l vectors.
S
1
={1,1,1,0,1,
1,0,1,1,0,1,
0,0,1,1,1,0,0,0,0,0};
S
2
={
1,1
,
0,1,1,0,0,0,0,0,
0,0,1,0,0,0,0, 0,0,
0,0};
S
3
={1,1,
1,0,1,1,0,1,1,0,1,
1,0,1,1,1,1,0,0,1,0};
S
4
={1,1,1,1,0,1,
1,0,1,1,0,1,
0,0,0,0,1,1,1,0, 1};
S
5
={1,1,1,1,0,
0,1,0,1,1,1,
1,1,0,0,0,0,1,1,1,1}.
The wei
ghted
sequ
en
ce ve
ctors of ea
ch
seq
uen
ce a
r
e
calculated.
W
(
S
1
)
=
{
1
,2,1.322,0,3.47
4,2
.
322,0,
2.322,
1.322,0,1.73
7
,
0,0,
2.322,2.32
2,2
.
322,0,0,0,0,0
}
.
W
(
S
2
)
=
{3,1,0,5.2
11
,5.211,0,0,0,0
,
0,
0,0,2.322,0
,
0,0,
0,0,0,0,0}
;
W
(
S
3
)
=
{
1
, 2, 2.644,0,3.47
4
,4.644,
0,
2.322,2.64
4, 0,3.
474,
1.737,2.32
2,
4.644,4.64
4,2
.
322,0,0,4.64
4,0};
W
(
S
4
)
=
{
2
,1,2.644,1.73
7,0,
4.644,4.64
4,0
,
2.644,4.644,
0,1.737,0,0,0,
0,2.322,2.
32
2
,
4.644,0,2.32
2};
W
(
S
5
)
=
{3,
1
,2.644,5.211,
0, 0, 6.966,0,1.322,13.9
32,
1.737,1.73
7,2
.
322,0,0,0,
0,6
.
966,6.966,2.
322,6.96
6}.
Acco
rdi
ng to
formul
a (1
), co
mpute
the
weighte
d
simil
a
ritie
s
between
a
n
y two
seq
uen
ce
s.
WSPSim
(
S
1
,
S
2
)=0.389
;
WSPSim
(
S
1
,
S
3
)=0.845;
WSP
S
im
(
S
1
,
S
4
)=0.
228;
WSPSim
(
S
1
,
S
5
)=0.088;
WSPSim
(
S
2
,
S
3
)=0.
227;
WSPSim
(
S
2
,
S
4
)=0.17
0;
WSPSim
(
S
2
,
S
5
)=0.24
0;
WSPSim
(
S
3
,
S
4
)=0.3
48;
WSPSim
(
S
3
,
S
5
)=0.1
37;
WSPSim
(
S
4
,
S
5
)=0.79
8.
The present
large
s
t simil
a
rity
WSPSim
(
S
1
,
S
3
)=0.845,
S
1
and
S
3
are cho
s
e
n
as the
merg
e obj
ect
s
.
W
(
S
13
)={1,
2, 1.983, 0,
3.
474,3.48
3,0
,
2.322,1.983,
0,2.
606,0.86
9
,
0,2.322,3.48
3
,
3.483,1.16
1,0
,
0,2.322,
0}. T
he cu
rrent
SD=
{
S
13
,
S
2
,
S
4
,
S
5
}.
In the s
a
me
way,
W
(
S
45
)={2.5,1,2.644,
3.474,0,2.32
2
,
5.
805,0,1.98
3, 9.
288,0.86
9,1.737,
1.161,0,0,0,1.
161,4.64
4,
5.8
05,1.161,4.6
4
4
}.
W
(
S
132
)={
2
, 1.5, 0.99
2, 2.606,
4.343
,1.742,0,1.16
1,
0.992, 0,1.30
3,0.435,1.
16
1
,
1.161,1.742,
1.742,0.58
1,0
,
0,1.161, 0} .
W
(
S
13245
)={
2
.25,1.25,1.81
8,
3.04,2.172,2.
032,2.90
3,0.5
81,1.488,4.6
4
4
,1.086,1.08
6
,
1.161,
0.58
1,0.871,0.87
1
,
0.871,2.32
2,
2.903,1.16
1,2
.
322}. Accord
ing to the ord
e
r of merge reco
rd
s,
S
13245
and S
132
are deleted. Othe
rs
seq
uen
ce
s are divided into
three c
l
us
ters
.
The initial clusteri
ng cent
ers a
r
e
S
2
,
S
13
,
S
45
.
Step 2: In
clu
s
ter1
, S
1
and
S
3
are assig
n
ed
to
S
13
.
In clus
ter2
, S
4
an
d
S
5
are clu
s
tered
to
S
45
. There is only
S
2
in c
l
us
ter3.
Step 3: Calculate the simi
larities
of any
two sequ
en
ces in ori
g
inal
SD
. In eac
h c
l
us
ter,
the se
que
nce
whi
c
h h
a
s th
e larg
est
su
m of simila
rities
with othe
r seq
uen
ce
s i
s
viewed a
s
t
h
e
update
d
clu
s
t
e
ring
cente
r
.
In c
l
us
ter1,
WSPSim
(
S
1
,
S
3
) +
WSPS
im
(
S
1
,
S
13
)= 1
.
782;
WSPSim
(
S
3
,
S
1
) +
WSPSi
m
(
S
3
,
S
13
)= =1.
824. Thu
s
th
e update
d
center i
s
S
3
. In clu
s
ter2, the update
d
center i
s
S
5
. The
pre
s
ent three
centers a
r
e
S
2
,
S
3
,
S
5
.
Step 4, 5: Clusteri
ng crite
r
ia functio
n
E=(1
-0.93
7
)
2
+(1
-
0.9
79)
2
+(1-0.91
1)
2
+(1
-
0.976)
2
=
0.0129
<
ε
=0.0
2, so it is con
v
erged. Th
e final three
clu
s
tering cente
r
s are
S
2
,
S
3
,
S
5
.
5. Experimental Re
sults
and An
aly
s
is
In order to verify the
performances of S
C
WSPS,
K
-m
ean
s a
nd KS
PAM in literature [1
0],
Wine
recogni
tion, yeast an
d segm
entati
on-all
(Com
pl
ete segm
enta
t
ion datas
et)
of UCI [12] are
utilized. The
classi
c ma
chine learning
database
UCI is proposed by universi
t
y of California
Irvine. The
r
e
are
187
data
s
ets in
UCI at
pre
s
e
n
t. Artificial d
a
taset i
s
al
so
used d
u
ring
compa
r
i
ng
the clu
s
teri
ng
quality. Whe
r
ein, the
relev
ant paramet
e
r
s of
real
and
artificial d
a
ta
sets, p
a
rt of t
he
test seq
uen
ces in artifici
al dataset are d
e
scrib
ed a
s
T
able 1, Table
2 and Tabl
e 3
,
respe
c
tively.
Table 1. The
Paramete
rs o
f
Real Data
se
ts
Datasets
The Numb
er of S
equences
Attribute Dimensions
The Numb
er of
C
l
usters
Wine Recognition
178
13
3
Y
east
1484
8
10
Segmentation-All
2310
19
7
Table 2. The
Paramete
rs o
f
Artificial Dataset
The Numb
er of S
equences
The Numb
er of
C
l
usters
The Prop
ortions
of Noises
390
3
5%
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 23
02-4
046
TELKOM
NI
KA
Vol. 12, No. 7, July 201
4: 5529 – 55
36
5534
Table 3. Part
of the Test Seque
nces in
Artificial Data
set
Sequence Numb
er
Sequence
1
acdceefb
2
debacbdfdea
3
eabcfabcdcdab
4
bdcefacbdfeacdfeacbd
5
fedcbadbacbdfdd
abc
…
…
Our exp
e
rim
ents a
r
e
run
on the Intel
Core 2
Duo
2.93 G
H
z
CP
U, 2GB mai
n
memory
and Microsoft
XP. All
algorithms are wri
tten
in MyEclipse 8.5. We compare S
C
WSPS with
K
-
means and K
SPAM in clustering qua
lity and execution efficiency.
5.1. Cluste
ring Qualit
y
T
esting
In this sectio
n, the averag
e corre
c
t rat
e
s of
clu
s
teri
ng re
sults of
three algo
rithms are
tested by the formula a
s
fol
l
ows:
q
j
jK
jK
j
j
j
j
K
N
n
N
n
N
n
q
CR
ARV
1
2
2
1
1
1
)
(
(4)
Whe
r
e
q
reg
a
rd
s the nu
mber of test
s, and the n
u
mbe
r
of clu
s
ters is de
no
ted as
K
.
n
j1
/N
j1
+
n
j
2
/N
j
2
+·
··
+
n
jK
/N
jK
repres
ent
s
the c
o
rrec
t rate of
c
l
us
ter
K
of the
j
-th tes
t. I
n
this tes
t, we
set
q
=15. T
h
e avera
ge
co
rre
ct rates
of clu
s
terin
g
re
sults of three
algorith
m
s i
n
real d
a
tasets
are
sho
w
n a
s
Ta
ble 4.
Table 4. The
Average
Correct Rate
s of Clu
s
teri
n
g
Re
sults
in Real Data
sets (ARV(CR)/%)
Datasets
SCWSPS
KSPAM
K
-means
Wine Recognition
88.98%
82.36%
68.22%
Y
east
91.32%
84.83%
72.94%
Segmentation-All
93.37%
87.91%
75.38%
For g
e
tting the cl
uste
ring
quality und
er different
minimum
su
pport, we set
K
=3.
Minsup
=3,6,9
,12 and 15. T
he experi
m
en
tal result
s in ar
tificial data
s
ets are
given as Figu
re 2.
Figure 2. The
Average Correct
Rate
s Co
mpari
s
o
n
of Clu
s
terin
g
Re
sults in Artifici
al Data
set
Table 3 an
d
Figure 2 in
dicate that in tw
o datasets, the aver
age co
rrect
rates of
clustering results of SCWSPS is better than
the other two algorithm
s. For SCWSPS, the
weig
hted
of freq
uent
seque
ntial p
a
tterns are
c
onsi
dered to
mea
s
u
r
e t
he
simila
ritie
s
of
60
70
80
90
100
36
9
1
2
1
5
The
average correct rates
of
cluster
i
ng r
e
sults
%
M
i
nSup %
SCW
SPS
KSPAM
K-m
eans
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
2302-4
046
Sequen
ce
Cl
usteri
ng Algo
rithm
Based on Wei
ghe
d Sequential P
a
ttern Sim
ilarity (Di
Wu)
5535
seq
uen
ce
s. T
he mea
n
ing
s
of cluste
ring
result
s a
r
e e
x
plained a
ccurately. Furth
e
r the
clust
e
ring
quality can b
e
greatly improved.
5.2. Executio
n
Efficiency
Analy
z
ing
To an
alyze
the exe
c
ution
efficien
cy of
the
thre
e
al
g
o
rithm
s
,
UCI and artificial
datasets
are u
s
ed. Th
e results of th
e executio
n time in two dat
aset
s are
sho
w
n a
s
Table
5 and Tabl
e 6
.
Table 5. Execution Time of Thre
e Algorit
hms in Artifici
al Data
set (T/
s
)
Minimum Suppor
t
M
i
nSup
SCWSPS
KSPAM
K
-means
3%
0.223
0.213
0.389
6%
0.254
0.252
0.425
9%
0.307
0.294
0.481
12%
0.354
0.342
0.568
15%
0.418
0.397
0.653
Table 6. Co
m
pari
s
on of Executio
n Time
in Real
Data
sets (T/s)
Datasets
SCWSPS
KSPAM
K
-means
Wine Recognition
0.031
0.030
0.052
Y
east
0.388
0.367
0.499
Segmentation-All
1.903
1.691
1.109
As illustrated in T
able 5
and T
able 6,
on the
w
h
o
l
e,
the exec
ution time of S
C
WSPS is
little inferior t
han KSPAM and
K
-m
ean
s in real dat
a
s
ets. Me
an
while, in artifici
al dataset, it
is
obvious
that
the time
perfor
mance of
SCWSPS is
better than
K
-mea
ns,
but l
i
ttle inferio
r
t
han
KSPAM under each
MinSup
. For S
C
WSPS, in the proc
ess
of
prec
onditioning sequenc
e
s
in
SCWSPS, the time cost for mini
ng frequent sequential pa
tterns is
very large
while dealing
wit
h
large
scale
seque
nce dat
aba
se. A hig
h
ly efficient
operators of
‘and’ an
d ’o
r’
are a
pplie
d
in
KSPAM to calc
ulate
s
i
milarities
bet
ween
s
e
quenc
e
s
,
the time
cos
t
c
a
n be reduc
e
d great
ly
.
However,
SCWSPS als
o
has
s
o
me
adv
antages
in ti
me
when the s
e
quence
s
c
ale is
not large.
Owing to the selection of
initia
l cente
r
s is improve
d
. Moreove
r
, the merg
e exe
c
ute p
r
o
c
e
s
s is
based on the
simple o
peration of
average, thus th
e
comp
utation
a
l com
p
lexity is better tha
n
K
-
mean
s in artif
i
cial data
s
et.
6. Conclusio
n
A novel sequence clusteri
ng algorithm SC
WSPS based on
weighed
sequential
pattern
similarity is
desi
gned in this
stu
d
y. First a
nd f
o
rem
o
st, in
accordan
ce
with the
sup
port
relation
shi
p
b
e
twee
n ea
ch
seq
uen
ce
an
d se
que
ntial
pattern
s mini
ng from
SD
,
a ne
w simil
a
rity
based on the
weight of se
quential patte
rns in ea
ch
seque
nce for measuri
ng the sequ
en
ce
s is
defined. T
he
clu
s
terin
g
qu
ality can b
e
greatly im
p
r
o
v
ed. In additi
on, the
way
of sele
cting i
n
itial
c
l
us
te
r
i
ng
c
e
n
t
e
r
s
r
a
n
domly in
K
-means is
cha
n
ged. Initial clusteri
ng cen
t
ers a
r
e gai
ned
according to
mergi
ng the
seq
uen
ce
s with the
large
s
t weighted
si
milarity. The global o
p
tima
l
clu
s
terin
g
re
sults
can be
obtained. O
u
r exper
i
m
e
n
tal results
and analy
s
is show that
the
performance of
SCWSPS
is better than KSPAM and
K
-mean
s in
clu
s
terin
g
qu
ality. Howeve
r,
the execution efficiency of SCW
SPS is
a little inferior than KSPAM and
K
-means while dealing
with the larg
e
scal
e
se
que
nce d
a
taba
se
.
Ackn
o
w
l
e
dg
ements
This work is su
ppo
rted
by the National Natu
ral Scien
c
e
Found
ation
of China
(No.6
117
019
0), Youth
F
ound
ation of
He
bei
Ed
u
c
ation
a
l
Co
mmittee (No
.
Q2012
070
)
and
Scien
c
e an
d Tech
nolo
g
y Re
sea
r
ch an
d Develo
pme
n
t Progra
m
of Hand
an (No:
1321
1030
77
-3).
Referen
ces
[1]
Santhisr
ee K,
Devi DN, B
haskar V. SE
QD
BSCAN: A Ne
w
Sequence DBSCAN
Algorit
hm for
Clusteri
ng of W
eb
Usa
ge D
a
ta.
Internati
o
nal J
our
nal
of Information
T
e
chn
o
lo
gy a
n
d
Know
le
dg
e
Mana
ge
me
nt
. 201
0; 2(2): 481
-486.
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 23
02-4
046
TELKOM
NI
KA
Vol. 12, No. 7, July 201
4: 5529 – 55
36
5536
[2]
Z
hao YZ
, Liu
XY, Z
hao H
.
T
he K-Medoids Cl
usteri
n
g
Algor
ithm w
i
t
h
Membra
n
e
Comp
uting.
TEL
K
OMNIKA
. 2013; 1
1
(4): 2
050-
205
7.
[3]
Xi
on
g YS. A
Clusteri
ng
Al
g
o
rithm Bas
e
d
on Ro
ug
h S
e
t and Ge
neti
c
Algorithm.
TEL
K
OMNIKA
Indon
esi
an Jou
r
nal of Electric
al Eng
i
ne
eri
n
g
.
2013; 1
1
(10):
578
2-57
88.
[4]
Meng Y, Lu
o K, Liu JH, Jia
ng F
.
SA-Rou
gh Sets
K-me
ans Res
ourc
e
D
y
namic A
lloc
a
tion Strate
g
y
Based
on Cl
oud C
o
mp
utin
g Enviro
nme
n
t
.
T
E
LKOMNIKA Indones
i
an Jour
na
l o
f
Electrical
Engi
neer
in
g
. 2012; 10(
6): 148
5-14
89.
[5]
Sun
XJ, L
i
u
XY
. Ne
w
Gen
e
ti
c K - mea
n
s
Clusteri
ng
A
l
g
o
rithm Bas
ed
on Me
lior
a
ted
Initial
Cent
er
.
Co
mp
uter Engi
neer
ing a
nd
A
p
plicati
ons
. 2
0
0
8
; 44(23): 1
66-
168.
[6]
Y
edl
a M, Path
akota SR, Sri
n
i
v
asa
T
M
. Enha
ncin
g
K-mea
n
s
Cluster
ing
A
l
g
o
rithm
w
i
t
h
Improve
d
Initi
a
l
Center
.
Journ
a
l
of
Applic
atio
n
Rese
arch of C
o
mputers
. 2
010
;
1(2): 121-1
2
5
.
[7]
F
u
T
,
Sun YM. PSO-based
K-means
Alg
o
r
i
thm an
d its A
pplic
atio
n i
n
N
e
t
w
o
r
k Intrusi
o
n Det
e
ctio
n
Sy
s
t
e
m
.
Co
mp
uter Scienc
e
. 2
011; 38(
5): 54-
58.
[8]
Yu HT
, Li
X
,
Yao NM.
Rese
arch on Optimization
Metho
d
f
o
r K-mea
n
s C
l
u
sterin
g Al
gorit
hm.
Jour
nal
of
Chin
ese C
o
mp
uter Systems
.
201
2; 33(1
0
): 2272-
227
7.
[9]
Wu D, Ren JD.
K
-means
Sequ
enc
e Clu
stering Al
gorit
hm base
d
on
T
op-
K
Maxi
mal F
r
equ
en
t
Sequ
enc
e Patterns.
Internati
ona
l Jour
na
l o
f
Advance
m
en
ts in Co
mputin
g T
e
chn
o
lo
gy
.
201
2; 4(2
0
)
:
405-
413.
[10]
Ren JD, Ca
i BL, He HT
, Hu CZ
. A Method for Dete
ctin
g Soft
w
a
r
e
Vul
ner
abil
i
ties Bas
ed
on Cl
usterin
g
and Mo
del A
n
a
l
y
z
ing.
Jo
urna
l of Computati
o
n
a
l Informatio
n
Systems
. 20
11
; 7(4): 1065-1
0
73.
[11]
Yang T
X
, W
ang Z
H
, W
ang H, W
ang LY. Res
earch
of Cl
uste
ring Initi
a
l Ce
nter Selecti
on.
Journ
a
l of
Nanj
in
g Nor
m
a
l
Univers
i
ty (Na
t
ural Scie
nce E
d
itio
n)
. 201
0; 33(4): 161-
16
5.
[12]
Z
hang J, Dua
n
F
.
Improved K
-means a
l
gor
ithm
w
i
t
h
Meli
or
ated Initia
l Ce
n
t
ers.
Journal of
Computer
Engi
neer
in
g an
d Desi
gn
, 20
13
; 34(5): 169
1-1
694.
Evaluation Warning : The document was created with Spire.PDF for Python.