Internati
o
nal
Journal of Ele
c
trical
and Computer
Engineering
(IJE
CE)
Vol.
5, No. 6, Decem
ber
2015, pp. 1472~
1
479
I
S
SN
: 208
8-8
7
0
8
1
472
Jo
urn
a
l
h
o
me
pa
ge
: h
ttp
://iaesjo
u
r
na
l.com/
o
n
lin
e/ind
e
x.ph
p
/
IJECE
An Opti
mistic Approach for Clust
erin
g
M
u
lti-versi
o
n XM
L
Documents Usin
g Compressed Delta
V
i
ja
y
Sona
w
a
ne, D
.
Ra
j
e
sw
a
r
a
Rao
Department o
f
C
S
E, K.L.
University
, Green
Fields
, Guntur
, Andhr
a Pradesh
Article Info
A
B
STRAC
T
Article histo
r
y:
Received
May 20, 2015
Rev
i
sed
Au
g
12
, 20
15
Accepted Aug 28, 2015
Today
with Stan
dardization of
XML as
an infor
m
ation exch
ang
e
over web
,
huge amount o
f
information is
form
atted in
the XML docu
m
ent. XML
documents are huge in size. The amount
of information that has to be
transm
itted
,
processed, stored, a
nd queried is often larg
er than t
h
at of other
data formats. Also in real world
appl
ications XML documents are d
y
namic
in
nature
. The v
e
rs
atil
e app
lic
abil
it
y of
XML docu
m
ents in different fields of
inform
ation m
a
i
n
tenan
ce
a
nd management
is in
creasing
the d
e
mand to store
differen
t
versio
ns of XML documents
with tim
e. However, s
t
orage of al
l
versions of an XML document may
introd
uce the redun
dancy
.
Self
describing
natu
re of XML cr
eates
the problem of verbosity
, in
result
documents are in huge size. This pape
r proposes
optimistic appr
oach
to Re-
cluster
m
u
lti-ve
r
s
ion XML docu
m
ents which
ch
ange
in t
i
m
e
b
y
reassessing
distance between them
b
y
usin
g know
ledge fr
om
initial
clustering solution
and changes stor
ed in compressed delta
. Evolvin
g
size of XML
document is
reduced b
y
ap
ply
i
ng homomo
r
phic compression before clustering them
which reta
ins it
s original struct
ure.
Compressed delta stor
es the changes
responsible for document ve
rsion
s
, without deco
mpressing them. Test results
shows that our
approach perfo
rms mu
ch bette
r than using fu
ll pa
ir-wise
document comp
arison.
Keyword:
Clu
s
ter
i
ng
Com
p
ressed Delta
H
o
mo
mo
r
p
h
i
c
Mu
lti-v
e
rsi
o
n
XM
L
Copyright ©
201
5 Institut
e
o
f
Ad
vanced
Engin
eer
ing and S
c
i
e
nce.
All rights re
se
rve
d
.
Co
rresp
ond
i
ng
Autho
r
:
Vijay
S
o
nawa
n
e
,
Depa
rt
m
e
nt
of C
S
E, K.L
.
U
n
i
v
ersi
t
y
,
Gree
n Fi
el
ds, Gu
nt
u
r
,
A
n
dh
r
a
Pra
d
es
h, 5
2
2
50
2.
Em
a
il: v
i
j
a
yson
awan
e1
1@gmail.co
m
1
.
IN
TR
OD
UC
TI
ON
W
3
c estab
lished
th
e ex
ten
s
i
b
le m
a
r
k
-
u
p
lan
g
u
a
g
e
(X
ML-
1
.0)
in
1988
an
d
X
M
L h
a
s b
eco
m
e
th
e
defact
o st
an
da
rd f
o
r dat
a
re
p
r
esent
a
t
i
o
n an
d exc
h
a
nge
of
i
n
fo
rm
at
i
on on t
h
e i
n
t
e
r
n
et
[1]
.
T
oday
i
t
i
s
al
so
i
m
p
o
r
tan
t
to kn
ow
th
e bu
siness v
a
lues of
i
n
fo
r
m
atio
n
presen
t on
-li
n
e.
In
fo
rm
atio
n
availab
l
e o
n
-lin
e is n
o
t
onl
y
use
f
ul
f
o
r
i
ndi
vi
d
u
al
use
r
b
u
t
al
so t
o
b
u
si
ness
or
ga
ni
zat
i
ons m
o
st
l
y
for
deci
si
o
n
m
a
ki
ng p
u
r
p
o
s
e
. XM
L
o
f
fers m
a
n
y
featu
r
es
o
f
b
u
sin
e
ss fun
c
tion
s
su
ch
as co
n
t
en
t in
tegratio
n,
in
tellig
en
ce and
salv
ag
e.
It is v
e
ry
i
m
p
o
r
tan
t
to
m
a
in
tain
tho
s
e do
cu
m
e
n
t
s p
r
operly an
d
kn
ow
h
o
w to
u
s
e effi
cien
tly in
formatio
n
in
it.Clu
s
terin
g
i
s
dat
a
m
i
ni
ng
t
a
sk al
way
s
an
d al
way
s
p
e
r
f
o
r
m
e
d usi
ng
di
s
t
ance m
easurem
ent
on
den
s
e
regi
ons i
n
dat
a
set
[2]
.
XM
L
doc
um
ent
s
are
sem
i
-struct
u
re
d i
n
nat
u
re
an
d i
t
i
s
e
n
co
de
by
di
ff
e
r
ent
t
e
xt
ual
fo
r
m
at
[3]
.
He
nce
m
a
ny
t
r
adi
t
i
onal
a
p
p
r
oac
h
es
fai
l
e
d
t
o
ha
ndl
e cl
u
s
t
e
ri
ng
o
f
XM
L d
o
c
u
m
e
nt
s. Seve
ral
cl
ust
e
r
i
ng a
p
pr
oac
h
e
s
are
discuss
e
d
in
[4
]
-[6]
.
Al
t
e
rnat
i
v
e
op
t
i
on t
o
ha
n
d
l
e
XM
L dat
a
i
s
snaps
h
ot
XM
L doc
um
ent
but
i
n
real
w
o
rl
d co
nt
ent
o
f
XML do
cu
m
e
n
t
s is d
y
n
a
m
i
c
in
n
a
ture and ch
ang
e
s in
it are li
m
i
t
l
ess a
n
d
no
t p
r
ed
ictab
l
e [7
]. Ev
ery ti
me
change i
n
ori
g
inal docum
e
nt
is co
m
p
letely t
r
eated
as co
m
p
letely n
e
w
XML (v
ersi
o
n
)
d
o
c
u
m
en
t rather th
an
co
nsid
eri
n
g
the d
e
g
r
ee
o
f
ch
ang
e
in
it. Dyn
a
mic XML
d
o
c
u
m
en
ts create th
e d
e
m
a
n
d
for m
u
lti-v
e
rsion
su
ppo
rt as it is ap
p
licab
le in
man
y
field
s
o
f
in
fo
rm
atio
n
main
ten
a
nce and
m
a
n
a
g
e
m
e
n
t
[8
]. So
it is n
e
cessary
t
o
st
o
r
e di
f
f
ere
n
t
versi
ons
of
XM
L do
cum
e
nt
s wi
t
h
t
i
m
e.
St
ora
g
e o
f
al
l
t
h
e versi
ons
of
an
XM
L
doc
um
ent
Evaluation Warning : The document was created with Spire.PDF for Python.
I
J
ECE
I
S
SN
:
208
8-8
7
0
8
An
Op
timistic App
r
oa
ch
for
Clu
s
terin
g
Mu
l
ti-versio
n
XML Do
cumen
t
s
Usin
g
…
(Vija
y
Son
a
w
an
e)
1
473
i
n
creases t
h
e
red
u
nda
ncy
and m
a
ke sear
chi
n
g an
d q
u
e
ry
i
ng
har
d
er
on g
r
owi
ng
doc
um
ent
s
.The
sel
f
-
d
e
scri
b
i
ng
feat
u
r
e of
XML prov
id
es immen
s
e flex
ib
ility bu
t it in
trod
u
c
es th
e m
a
in
p
r
o
b
l
em
o
f
v
e
rbo
s
ity
whi
c
h resul
t
s
i
n
hu
ge d
o
cum
e
nt
si
zes, w
h
i
c
h
creates
diffic
u
lty sa
m
e
as redunda
n
cy [9].
Th
is p
a
p
e
r in
tro
d
u
ces an
op
ti
mistic ap
p
r
o
a
ch
to
fi
n
d
n
e
w
clu
s
tering
so
lutio
n
of m
u
lti-v
e
rsion
XML
doc
um
ents which are
dy
namic in nat
u
re
. In this a
p
proac
h
we
do not t
r
eat each ne
w XML
docum
e
nt as
com
p
l
e
t
e
l
y
new
doc
um
ent
(v
ersi
o
n
)
rat
h
er
we c
o
n
s
i
d
e
r
s a
m
ount
o
f
t
h
e
d
o
cum
e
nt
s af
fec
t
ed. T
o
l
i
m
i
t
t
h
e si
ze
and re
duce the stora
g
e
spac
e each
docum
e
nt
is
c
o
m
p
ressed using de
signe
d hom
o
m
o
rphic
c
o
m
p
ression
schem
e
(whi
c
h
ret
a
i
n
st
r
u
ct
ur
e l
i
k
e o
r
i
g
i
n
al
doc
um
ent
)
and com
p
resse
d delta is used to
record only affected
part
of
t
h
e
d
o
c
u
m
e
nt
s (wi
t
h
o
u
t
dec
o
m
p
ressi
ng
i
t
)
.
Fi
nal
l
y
up
-t
o
-
dat
e
cl
us
t
e
ri
ng
sol
u
t
i
o
n
i
s
o
b
t
a
i
n
ed
by
usi
n
g
p
r
ev
iou
s
kn
ow d
i
stan
ces calcu
lated
d
u
ring
i
n
itial clu
s
tering
an
d kno
wledg
e
record
ed
i
n
co
m
p
ressed d
e
lta.
2. WH
Y
X
M
L
CO
MP
RESS
ION
?
Th
e sel
f
-d
escri
b
ing
p
r
op
erty
o
f
XML (sch
ema is rep
eated fo
r each
record)
p
r
ov
id
es
fl
ex
ib
ility to
XM
L b
u
t
i
t
al
so i
n
t
r
od
uces
t
h
e
m
a
jor
pr
obl
em
of ve
rb
osi
t
y
of XM
L
doc
um
ent
s
whi
c
h re
sul
t
s
i
n
hu
g
e
doc
um
ent size. Large
doc
um
e
n
t size affect on stora
g
e s
p
ac
e, net
w
or
k ba
n
d
wi
dt
h u
s
ed f
o
r dat
a
exc
h
an
g
e
and
m
a
i
n
m
e
m
o
ry requi
red f
o
r pr
ocessi
ng
. It
m
a
kes doc
um
ent
s
searchi
n
g
,
que
ry
i
n
g
,
an
d cl
ust
e
ri
n
g
h
a
rde
r
.
Hom
o
m
o
rphi
c
com
p
ressi
o
n
s
c
hem
e
ret
a
i
n
t
h
e
ori
g
i
n
al
hi
erarchical struct
ure
of th
e
document and c
o
mpress
e
d
form
at can be accessed and
parse
d
in the s
a
m
e
way of
the origi
n
al one
[9]. If
XML docum
e
nt changes its
structure or content, the
n
com
p
ressed delta is used
t
o
rec
o
r
d
s t
h
e c
h
an
ges bet
w
een t
w
o
versi
o
n
s
w
i
t
hou
t
decom
p
ressi
ng
i
t
.
Fi
gure 1 S
h
o
w
s d
o
c
u
m
e
nt
s hom
o
m
orp
h
i
c co
m
p
resse
d
vi
ew of
ou
r com
p
resso
r, i
t
avoi
ds
ele
m
en
t/attrib
u
t
e v
a
lu
es rep
e
titio
n
b
y
rep
l
acin
g
it with
un
iqu
e
ch
aracter.
Fi
gu
re
1.
H
o
m
o
m
o
rp
hi
c com
p
ress
ed
vi
e
w
3. LITE
RAT
URE
SU
R
V
E
Y
In th
is section we d
i
scu
ss ex
istin
g wo
rk in
th
e
area
o
f
ch
ang
e
d
e
tectio
n and clu
s
terin
g
of m
u
lti-
versi
on
XM
L
doc
um
ent
s
, st
r
e
ssi
ng t
h
e fact
t
h
at
any
of
ex
istin
g
wo
rk
do
es no
t d
e
al time efficien
tly with
reassessi
n
g
clusters o
f
m
u
lti-version
XML do
cu
m
e
n
t
s
b
y
usin
g
no
tion
o
f
co
m
p
ressed
d
e
lta
[10
]
.
Doc
u
m
e
nt
ver
s
i
on
det
ect
i
o
n
and
m
a
nagem
e
nt
i
s
i
m
por
t
a
nt
i
n
vari
ous
ap
p
l
i
cat
i
ons suc
h
as so
ft
wa
re
pl
agi
a
ri
sm
det
ect
i
on,
we
b d
o
cum
e
nt
searc
h
i
n
g a
nd
ran
k
i
ng.
To
det
ect
t
h
e ve
rsi
o
ns,
sim
i
l
a
ri
ti
es bet
w
ee
n
v
a
ri
o
u
s
files
need
to b
e
an
al
ysed
,
for th
is selectio
n
of
similarity fu
n
c
tion and
thresh
o
l
d
v
a
lu
e are im
p
o
rtan
t.
The content a
nd struct
ure si
milarity as well as app
lication
requirem
ent are im
por
t
a
n
t
fact
or i
n
des
i
gni
n
g
sim
i
l
a
ri
ty
func
t
i
on.
Va
ri
o
u
s
schem
e
s have
bee
n
pr
op
osed
for d
e
tecting
ch
ang
e
s
in m
u
l
ti-v
e
rsion
XML
doc
um
ent
s
bas
e
d o
n
t
h
e
d
iff
al
gori
t
h
m
[1
2
]
-[1
5]
. D
o
c
u
m
e
nt
s t
e
xt
ual
co
nt
ent
[
16]
, St
r
u
ct
u
r
e o
f
d
o
cu
m
e
nt
[1
7]
-[
2
0
]
an
d
Doc
u
m
e
nt
C
l
assi
fi
cat
i
on
[2
1]
, [
2
2]
.
To do the c
o
mparative a
n
alysis of
cha
n
ge de
t
ect
i
on schem
e
s need t
h
e a
n
al
y
s
i
s
of va
ri
o
u
s param
e
t
e
rs
suc
h
as c
h
an
g
e
det
ect
i
on
bet
w
een t
w
o
ve
rs
i
ons
of
an
X
M
L doc
um
ent
,
use
of
rel
a
t
i
o
nal
,
del
t
a
,
or
ob
ject
-
referen
c
i
n
g
app
r
o
a
ch
fo
r ch
an
g
e
d
e
tection
,
su
ppo
rt fo
r ordered
XML
d
o
cu
m
e
n
t
s, scalab
ility, su
pp
ort
e
d
file
size,
and repre
s
entation
of unchange
d parts [8].
So
m
e
ch
ang
e
d
e
tectio
n
sch
e
mes ex
p
licitly
sh
ows t
h
e
u
s
er th
e ch
ang
e
d
part of th
e
do
cumen
t
s [23
]
-
[2
5]
, w
h
i
l
e
ot
her m
a
y
not
sho
w
[
7
]
,
[
26]
-
[
3
0
]
.
Speci
fi
ed
schem
e
s shou
l
d
be scal
abl
e
eno
u
gh t
o
det
e
ct
t
h
e
Evaluation Warning : The document was created with Spire.PDF for Python.
I
S
SN
:
2
088
-87
08
IJECE
Vol. 5, No. 6, D
ecem
ber
2015 :
1472 –
1479
1
474
ch
ang
e
s i
n
XM
L d
o
c
u
m
en
ts. Th
e storag
e of
in
term
ed
iate c
o
m
p
lete v
e
rsion
s
of XML
docu
m
en
ts i
m
p
r
o
v
e
s th
e
efficiency a
n
d
space c
o
m
p
lexity as the re
qui
red ve
rsion
ca
n
be c
r
eated
by using t
h
e a
p
propriate intermediate
co
m
p
lete v
e
rsio
n. Qu
ery
p
r
o
c
essin
g
b
e
co
m
e
s faster
wh
ile a syste
m
sto
r
es th
e in
term
ed
iat
e
co
m
p
lete v
e
rsio
ns
because the
r
e i
s
no
nee
d
to re
construct t
h
e i
n
term
ediate versions
run tim
e
.
To cl
uster XM
L doc
u
m
e
nts, researc
h
er in recent
decade
y
ears propos
ed vari
ous
sc
hemes. In [31]
aut
h
or
pr
o
pose
d
m
e
t
hods t
o
di
sco
v
er
fre
q
u
e
nt
l
y
chan
gi
n
g
st
ruct
u
r
e (FC
S
) f
r
om
a sequence
of
versi
ons
of
dy
nam
i
c XM
L doc
um
ent
,
t
h
en p
r
o
p
o
sed m
e
t
h
o
d
nam
e
d C
DX t
o
cl
u
s
t
e
r
dy
nam
i
c XM
L doc
um
ent
col
l
ect
i
on
by
usi
n
g FC
Se
s. I
n
[
32]
aut
h
or
pr
o
pose
d
Se
m
X
C
l
ust
fram
e
wo
r
k
, an
d
di
v
i
de XM
L
doc
u
m
ent
i
n
t
o
t
r
ee t
upl
e
set, th
en
pr
oposed
XK
-
m
ean
s alg
o
r
ith
m
an
d X
F
IH
C alg
o
r
i
th
m to
clu
s
ter
X
M
L do
cu
m
e
n
t
b
y
u
s
ing
W
o
rd
Net.
A novel
W
e
i
g
hted Elem
ent Tree Model (WETM) is
propos
ed
i
n
[
3
3]
f
o
r m
easuri
n
g t
h
e structural similarity
of XM
L
doc
u
m
ent
s
, and t
h
e
W
E
TM
m
ode
l
enhance
s
t
h
e exp
r
essi
on abi
l
i
t
y
of st
ruct
ura
l
i
n
form
at
i
on o
f
su
b
t
r
ees.
Aut
h
o
r
i
n
[3
4]
p
r
op
os
ed a
f
r
am
ewor
k
fo
r cl
ust
e
ri
n
g
XM
L
do
cu
m
e
nt
s by
st
r
u
ct
ure,
t
h
ey
m
odel
t
h
e
XM
L d
o
c
u
m
e
nt
s as
ro
ot
ed
or
dere
d l
a
bel
l
e
d t
r
ees
, t
h
e
n
s
t
udi
ed
t
h
e
usa
g
e
of st
ruct
ura
l
di
st
ance m
e
tri
c
s i
n
hierarc
h
ical clustering algori
thm
s
to
d
e
tect g
r
oup
s of
stru
ctur
ally si
mil
a
r
X
M
L do
cumen
t
s. A
u
t
h
or in
[
3
5
]
pr
o
pose
d
a
n
a
p
pr
oac
h
f
o
r
det
ect
i
ng st
r
u
ct
ural
si
m
i
l
a
ri
t
y
between
XM
L
d
o
c
um
ent
s
whi
c
h
si
g
n
i
f
i
cant
l
y
d
i
ffer
s
fr
om
st
andar
d
m
e
t
hods
ba
s
e
d
on
g
r
ap
h
-
m
a
t
c
hi
ng al
g
o
r
i
t
h
m
s
, and al
l
o
ws a
si
g
n
i
f
i
cant
re
d
u
ct
i
o
n
of
t
h
e
req
u
i
r
e
d
c
o
m
put
at
i
on c
o
st
s.
Ap
pr
oac
h
es
f
o
r cl
u
s
t
e
ri
n
g
X
M
L d
o
c
u
m
e
nt
s are
use
f
ul
t
o
ol
s f
o
r
XM
L
st
re
am
m
i
ni
ng
[3
6
]
, ge
ne t
eam
[3
7]
an
d we
b
service ret
r
ieval [3
8]
. A
di
ffe
rent t
echn
i
qu
e fo
r
cluster
i
n
g
ser
i
es of
heter
o
g
e
n
e
ou
s
X
M
L
doc
um
ent
s
pro
pos
ed by
[
3
9]
. Thi
s
co
nsi
d
e
r
s seri
es (st
r
eam
s) of
XM
L do
cu
m
e
nt
s and cal
cul
a
t
e
s t
h
e sim
ilari
t
y
bet
w
ee
n eac
h
i
n
com
i
ng XM
L d
o
cum
e
nt
and
t
h
e e
x
i
s
t
i
n
g cl
ust
e
rs
by
usi
n
g t
h
e
co
nc
ept
o
f
l
e
vel
st
ruct
ure
.
Si
m
ilarit
y
is d
e
termin
ed
at cluster lev
e
l
rath
er th
an
p
a
ir-wise fo
r i
n
d
i
v
i
du
al d
o
c
u
m
en
ts in
th
e clu
s
ters.
4.
PROPOSE
D
S
Y
STE
M
O
V
ERVIEW
Pro
p
o
se
d ap
pr
oach
ove
r
v
i
e
w
t
o
fi
n
d
u
p
-t
o
-
d
a
t
e
cl
ust
e
ri
ng s
o
l
u
t
i
o
n o
f
m
u
l
t
i
-
ver
s
i
o
n XM
L
doc
um
ent
s
wh
ich
ch
ang
e
s in
tim
e is sh
own in
Figu
re 2.
Fi
gu
re
2.
O
v
er
vi
ew
o
f
t
h
e
Pr
o
pos
ed
A
p
p
r
oac
h
Giv
e
n
set
of S XML do
cu
m
e
n
t
s, its cl
u
s
teri
n
g
co
m
p
o
s
ition
S can
b
e
represen
ted
as grap
h of
fu
lly
connected e
d
ges with
S
* (
S
—
1)=
2
wi
t
h
t
o
t
a
l
num
ber of
wei
ght
e
d
ed
ge
s, t
o
get
si
ngl
e
l
i
nk cl
ust
e
rs o
f
l
e
vel
L, th
e edg
e
s with
weigh
t
w
≥
L h
a
v
e
t
o
pruned
,
resu
lting
co
nn
ected
edg
e
s g
i
v
e
resu
lt clu
s
ter [5
]. If similarit
y
bet
w
ee
n t
w
o
XM
L
doc
um
ent
s
i
s
used
as
m
easure t
o
fi
nd
cl
ust
e
ri
ng
sol
u
t
i
o
n t
h
en
wei
g
ht
o
f
t
h
e
ed
ge
s
con
n
ect
i
n
g
d
o
c
u
m
e
nt
s sym
bol
i
ze t
h
e di
st
a
n
c
e
bet
w
ee
n t
h
e
m
.
Dista
n
ce
:
Di
st
ance bet
w
een
t
w
o
XM
L d
o
c
u
m
e
nt
s can be
defi
n
e
d
by
edi
t
ope
rat
i
o
ns
Op (
i
nsert
,
del
e
t
e
, u
p
dat
e)
wi
t
h
m
i
nim
u
m
cost
t
h
at
t
r
ansf
orm
o
n
e
do
cum
e
nt
i
n
t
o
ot
her
.
Evaluation Warning : The document was created with Spire.PDF for Python.
I
J
ECE
I
S
SN
:
208
8-8
7
0
8
An
Op
timistic App
r
oa
ch
for
Clu
s
terin
g
Mu
l
ti-versio
n
XML Do
cumen
t
s
Usin
g
…
(Vija
y
Son
a
w
an
e)
1
475
di
st
(D
oc
1,
Doc
2
)
= m
i
nim
u
m
(
(di
s
t
(
D
o
c1
→
Do
c2 ),
d
i
st(Do
c
2
→
D
o
c1)
))
If
distance
bet
w
een t
w
o doc
u
ments is s
m
aller, it in
dicates the hi
ghe
r sim
i
larity between
them
. Hence
wh
en
to
tal co
st o
p
e
ration
b
e
t
w
een
two
d
o
c
u
m
en
t is eq
u
a
l to
zero
th
en
th
ese do
cu
m
e
n
t
are said
to
similar.
Wh
en
an
y d
o
c
u
m
en
t in
in
itia
l clu
s
terin
g
so
l
u
tio
n
ch
an
g
e
s
th
en
its d
i
stan
ce with
rest of th
e do
cu
m
e
n
t
s
in
the
cl
ust
e
r al
s
o
c
h
ange
s.
De
pen
d
i
ng
o
n
t
h
e c
h
a
nge
(i
.e
n
u
m
b
er a
n
d
t
y
pe)
res
p
o
n
si
bl
e f
o
r
d
o
cum
e
nt
ve
rsi
o
n
wi
l
l
deci
de
resi
den
ce o
f
t
h
e
ve
rsi
o
n
(
n
e
w
doc
u
m
ent
)
i
n
cl
ust
e
ri
n
g
s
o
l
u
t
i
o
n.
Whe
n
ne
w
ver
s
i
o
n
ap
pear
s i
t
can
be
part
o
f
o
r
i
g
i
n
al
cl
ust
e
r o
r
i
t
m
i
ght
get
co
nt
ai
ned
by
an
ot
he
r
one
or i
t
fo
rm
i
t
s
own cl
ust
e
r. To
get
u
p
-t
o
-
dat
e
clu
s
tering
so
lu
t
i
o
n
it is
requ
ired
to fi
n
d
th
e
n
e
w
d
i
stan
ces
b
e
t
w
een all XML
d
o
c
u
m
en
ts p
a
irs.
To get up-t
o-date clustering
solution after
each se
t of change
s applie
d(whe
n
the doc
u
ments in the
in
itial clu
s
ter ch
ang
e
s th
eir d
i
stan
ces), co
m
p
arison
b
e
tw
een all th
e
do
cu
m
e
n
t
s
p
a
ir is no
t
co
st
effectiv
e
way
because m
o
st
of tim
e new ve
rsion m
a
y carries sm
all am
ount
of cha
n
ge or m
a
y not
m
odified at all. Option
1
shown in dia
g
ram
redunda
n
tly calculates distances
bet
w
een each pa
ir of
doc
um
e
n
ts
by m
a
king ful
l
com
p
arison
be
tween them
. This option i
n
c
u
rs m
o
re cost
as i
t
does n
o
t
con
s
i
d
er
de
gre
e
of c
h
an
ges i
n
t
h
e
doc
um
ent
s
, he
nce m
o
st
of
t
h
e
o
p
era
tions a
r
e
unnecessa
rily repeated.
Our proposed approach
is
time
efficien
t so
lu
tion
t
o
g
e
t
cu
rren
t
clu
s
terin
g
so
l
u
tio
n by reassessi
ng
pai
r
wi
se di
st
an
ces bet
w
ee
n t
h
e doc
um
ent
s
. C
o
m
p
ressed
d
e
lta reco
rd
all
th
e ch
ang
e
s resp
on
si
b
l
e fo
r
m
u
l
tip
le
versi
on
o
f
t
h
e
doc
um
ent
s
wi
t
h
o
u
t
dec
o
m
p
re
ssi
ng t
h
em
. To get
cu
rre
nt
cl
u
s
t
e
ri
ng
sol
u
t
i
o
n, i
t
m
a
kes t
h
e use o
f
kn
o
w
n
di
st
anc
e
s bet
w
ee
n d
o
cum
e
nt
s pai
r
befo
re cha
n
g
e
s appl
i
e
d a
n
d set
of cha
n
ges rec
o
r
d
e
d
i
n
t
h
e
com
p
ressed
de
l
t
a
. It
has
fol
l
owi
n
g
a
dva
nt
a
g
es;
1)
C
o
m
p
r
e
ssi
on
sc
hem
e
use
d
ret
a
i
n
s
d
o
cum
e
nt
s i
n
o
r
i
g
i
n
al
structure with reduce
d
size, Com
p
resse
d
docum
e
nts are
a
ccessed
and
pa
rsed in the s
a
me way of t
h
e
ori
g
inal
fo
rm
at
. 2) C
o
m
p
ressed
del
t
a
rec
o
r
d
s
cha
n
ges a
p
pl
i
e
d
o
n
d
o
c
u
m
e
nt
s w
i
t
hout
dec
o
m
p
ressi
n
g
i
t
,
hel
p
f
u
l
f
o
r
chan
ge
det
ect
i
o
n
.
3)
It
co
ns
i
d
ers
t
h
e
de
gr
ee o
f
c
h
a
nges
ap
pl
i
e
d
o
n
t
h
e
d
o
cum
e
nt
s be
fo
re
fi
n
d
i
n
g
n
e
w
cl
ust
e
ri
n
g
s
o
l
u
t
i
on.
4.
1. Wor
k
i
n
g
1)
Doc
u
m
e
nt
C
o
m
p
ressi
on -
G
i
ven set
o
f
X
M
L doc
um
ent
s
are com
p
ress
ed usi
ng
h
o
m
o
m
o
rphi
c
co
m
p
ression
sch
e
m
e
, it retain
s stru
ct
u
r
e of
o
r
ig
in
al do
cu
m
e
n
t
s.
2)
Fin
d
i
n
g
In
itial Clu
s
tering
So
l
u
tio
n - Th
is step
g
i
v
e
s in
itial clu
s
tering
so
lutio
n
s
b
y
u
s
ing
d
i
stan
ce
base
d cl
ust
e
ri
n
g
al
g
o
ri
t
h
m
and rec
o
r
d
s
di
st
ance m
a
t
r
i
x
b
e
tween
all th
e
pairs o
f
t
h
e XML d
o
c
u
m
en
ts with
di
rect
i
o
n
of m
i
ni
m
u
m
cost
(t
h
i
s
step is e
x
ec
uted only once
).
3)
Obt
a
i
n
i
ng
d
o
c
u
m
e
nt
s versi
o
n
- Here
d
o
cum
e
nt
ver
s
i
o
ns ar
e creat
ed (
u
si
n
g
o
u
r
ver
s
i
o
n c
r
eat
i
o
n
pr
o
g
ram
)
by
a
ppl
y
i
n
g
s
o
rt
o
f
cha
nge
s
on
c
o
m
p
ressed
i
n
p
u
t
XM
L
d
o
c
u
m
e
nt
s. C
o
m
p
r
e
ssed
del
t
a
rec
o
r
d
s al
l
th
e set
o
f
ch
ang
e
s
b
e
tween
t
h
e in
itial clu
s
terin
g
ru
n
and
cu
rren
t tim
esta
m
p
s withou
t u
s
i
n
g d
e
co
m
p
ression
.
4)
Ob
tain
i
n
g
fin
a
l Clu
s
terin
g
So
lu
tion
- Th
is step
is rep
e
titiv
ely ex
ecu
ted
wh
en
ev
er version
s
appea
r
s
or c
u
rrent clusteri
ng s
o
lution is
requi
r
ed. It
recalcul
a
tes the
new di
stances
betwee
n all pai
r
s
by using
k
nowledg
e ab
ou
t ch
ang
e
s
from
co
m
p
ressed
d
e
lta and
d
i
stan
ce m
a
trix
reco
rd
ed
d
u
ring
i
n
itial clu
s
tering
run
.
In t
h
i
s
, cri
t
i
cal
part
i
s
fi
n
d
i
n
g
new
di
st
ances
based
o
n
dat
a
i
n
com
p
ressed
del
t
a
and
pre
v
i
ous
kn
o
w
di
st
ances,
i
n
ne
xt
sect
i
o
n we p
r
esent tec
hni
que to ac
hieve
thi
s
.
5
.
D
I
STAN
CE
M
E
ASUR
EM
EN
T
Hierarc
h
ical st
ruct
ure
of XM
L easily allows perfo
rm
i
ng o
p
erat
i
o
ns
o
n
t
h
e no
des
o
f
t
h
e
doc
um
ent
s
.
Co
m
p
ressio
n
sch
e
m
e
we u
s
ed
h
e
re
retain
s
o
r
i
g
in
al do
cumen
t
stru
cture. In m
u
lti-v
e
rsio
n XML
do
cumen
t
s,
new
v
e
rsi
ons
are obt
ai
n
e
d
b
y
ap
pl
y
i
ng
i
n
s
e
rt
, up
dat
e
a
n
d del
e
t
e
o
p
e
r
a
t
i
ons
i
n
c
o
m
b
i
n
at
i
o
n
on
d
o
c
u
m
e
nt
v
e
r
s
i
o
n nod
es
an
d su
m
o
f
th
ese op
er
a
tion
s
are stored
in com
p
ressed
d
e
lta.
Compresse
d Delta(C
δ
)
- Gi
v
e
n dy
n
a
m
i
c XM
L doc
um
ent
Doc
1
wi
t
h
i
t
s
versi
on
Do
c
1
΄
, com
p
ressed
del
t
a
rec
o
r
d
s
t
h
e c
h
a
nges
f
r
o
m
one st
at
e o
f
d
o
c
u
m
e
nt
t
o
anot
her
.
It
co
n
s
i
s
t
s
of
a
set
o
p
erat
i
o
ns
Op
(in
s
ert,
del
e
t
e
, u
p
dat
e)
, ex
ecu
tion
o
f
it o
n
on
e
state of
d
o
c
u
m
en
t Doc1
will retu
rn do
cu
m
e
n
t
in
stat
e
Do
c
1
΄
.
Cost of Compr
e
ssed Delta(C
C
δ
)
- The cost
of c
o
m
p
ressed
del
t
a
i
s
sum
of ope
rat
i
ons c
o
s
t
Op(
i
nsert,
del
e
t
e
, u
p
dat
e)
listed
in
t
h
e com
p
ressed
d
e
lta.
Invert
e
d O
p
er
at
i
on (
O
invert
)
- Gi
ven dy
nam
i
c XM
L docu
m
ent
Doc1 wi
t
h
i
t
s
versi
on
Doc
1
΄
, If an
execution of operation
Op
(insert, d
e
lete, upd
a
t
e)
on d
o
c
u
m
e
nt
Doc
1
ret
u
r
n
s i
t
s
ve
rsi
o
n
Doc
1
΄
, then e
x
ecution
of
i
nve
rt
ed ope
rat
i
on o
n
ve
rsi
on
Do
c
1
΄
retu
rn
s orig
i
n
al do
cu
m
e
n
t
Do
c1. i.e
insert(
A
) is i
nve
rted
o
p
erati
on
o
f
cor
r
es
po
n
d
i
n
g del
e
t
e
(A
)
an
d up
dat
e
( A
→
B
) i
s
i
nve
rt
ed
o
p
erat
i
o
n
of c
o
r
r
esp
o
ndi
ng
u
p
d
at
e(B
→
A) where A,
B a
r
e
th
e nod
es
in
an
X
M
L
do
c
u
me
n
t
s
.
Giv
e
n
an
in
itial clu
s
terin
g
so
lu
tio
n
S co
n
t
ai
n
i
ng
Do
c1
an
d Do
c2
with
d
i
st(Do
c
1, Do
c2) is d
i
stan
ce
bet
w
ee
n t
h
em
and
( D
o
c1
→
Do
c2
) is th
e min
i
m
u
m co
st d
i
rectio
n, if set o
f
ch
an
g
e
s
sto
r
ed
in
co
mp
ressed
Evaluation Warning : The document was created with Spire.PDF for Python.
I
S
SN
:
2
088
-87
08
IJECE
Vol. 5, No. 6, D
ecem
ber
2015 :
1472 –
1479
1
476
d
e
lta ( C
δ
) t
r
a
n
sf
orm
one
do
cum
e
nt
i
n
t
o
ot
her
,
t
h
e
n
ne
w
di
st
ance
di
st
΄
b
e
t
w
een t
h
em
can be
de
fi
ne
d
by
usi
n
g
follo
win
g
fo
rm
ula:
(1
)
W
h
en
Do
c1
c
h
a
n
g
e
s
to
Do
c
1
΄
th
en
to
find
its n
e
w d
i
st
an
ce with
Doc2
, co
mm
o
n
set o
f
o
p
e
ration
s
th
at
trans
f
orm
s
Doc1 to
Do
c
1
΄
wi
t
h
m
i
nim
u
m di
st
ance nee
d
t
o
be su
btrac
t
ed while calculating the
dis
t
ances
bet
w
ee
n
Doc
1
΄
and
Doc
2
a
s
t
h
ey are
equal i
n
results, so
only d
i
stin
ct op
eratio
n
s
n
e
ed
to
co
nsid
er
ed
.
(2
)
Here
O
i
are t
h
e
p
o
p
erat
i
o
ns
f
r
om
di
st
(
D
oc
1
,
D
o
c
2
)
w
h
i
c
h
have
co
nse
q
ue
nt
i
n
vert
e
d
ope
rat
i
ons
O
i
invert
in
C
δ
(D
oc
2,
Do
c
2
΄
),
1
≤
i
≤
p.
It
m
eans t
h
e set
of ope
rat
i
ons
whi
c
h gi
ves m
i
nim
u
m
di
st
ance bet
w
een
Doc
1
an
d D
o
c2
,
an
d were subsequ
e
n
tly inv
e
rted
du
ri
n
g
Do
c2 tran
sformatio
n
i
n
to
Doc
2
΄
need to
be s
u
btracted whe
n
calculating the
distance
between Doc
1
a
nd
Doc
2
΄
, as their com
b
ined e
ffe
ct is
n
u
ll,
wh
ereas on
ly th
e d
i
stin
ct
n
on-
inv
e
r
t
ed op
er
ation
s
n
e
ed
to
b
e
coun
ted.
(
3
)
Here O
i
are q r
e
si
dual
o
p
erat
i
ons
fr
om
di
st
(Doc
1,
Doc
2
) aft
e
r rem
ovi
n
g
r
e
peat
ed o
p
e
r
at
i
ons
fr
om
C
δ
1 whi
c
h
have
c
onse
q
ue
nt
i
n
vert
e
d
o
p
e
r
at
i
ons
Oi
i
nve
r
t
i
n
C
δ
2 ( D
o
c
2
,
Do
c
2
΄
),
1
≤
i
≤
p.
W
h
en bo
th Do
c1 an
d Do
c2
h
a
ve
ch
ang
e
d
in
t
o
its resp
ectiv
e
v
e
rsi
o
ns
Do
c
1
΄
and
Doc
2
΄
, above
bot
h form
ulas are applicable to fi
nd ne
w
distance
di
st
΄
(
Doc
1
΄
,
Doc
2
΄
). First we add the cost of d a
n
d C
δ
1 a
n
d purge the operati
o
ns
that are re
pe
ated in
bot
h
d a
n
d
C
δ
1
an
d
rem
ove re
m
a
i
n
i
ng
op
erat
i
ons
t
h
ose a
r
e i
nve
rt
ed
i
n
C
δ
2.
(
4
)
Whe
n
b
o
t
h
d
o
c
um
ent
s
Doc1
and D
o
c
2
d
o
not
ch
an
ge, t
h
en the ne
w dis
t
ance dist is the sa
m
e
with the old
distance dist.
6. E
X
PE
RIM
E
NTAL E
V
A
L
UATIO
N
Pro
p
o
se
d ap
pr
oach i
s
im
pl
em
ent
e
d usi
n
g Java. T
o
eval
u
a
t
e
t
h
e perf
or
m
a
nce of p
r
o
p
o
se
d app
r
oac
h
we e
x
tracted
docum
e
nts of variable size from
XML data
rep
o
sito
ry
with an
av
erag
e
n
u
m
b
er o
f
lev
e
ls o
f
4
[40
]
. In
itial clu
s
terin
g
so
l
u
tion
is ob
tain
ed
fo
r co
m
p
ressed
set o
f
inpu
t XML d
o
c
u
m
en
ts b
y
find
ing
m
i
n
i
m
u
m
distances
between all the doc
u
m
e
nt pair
s. Th
en
d
i
stan
ces
between
all do
cu
m
e
n
t
p
a
ir in
th
e clu
s
teri
ng
so
lu
tion
along with
t
h
e set
of ope
rations
c
o
rre
s
p
ondi
ng to each m
i
nim
u
m
distance
is recorde
d
. T
o
get new
doc
um
ents
versi
on
we ap
pl
i
e
d di
f
f
ere
n
t
perce
n
t
a
ges
of
chan
ges o
n
c
o
m
p
ressed i
n
p
u
t
XM
L d
o
cu
m
e
nt
s. M
a
i
n
object
i
v
e
of t
h
e e
v
al
uat
i
on
i
s
t
o
com
p
are (a
ft
er c
h
a
n
g
e
s ap
pl
i
e
d)
t
i
m
e re
qui
re
d t
o
f
i
nd
ne
w cl
u
s
t
e
ri
n
g
(R
e
-
cl
ust
e
ri
n
g
)
sol
u
t
i
o
n o
f
t
h
e
doc
um
ent
s
wit
h
and
wi
t
h
o
u
t
com
p
ressi
on
usi
n
g f
u
l
l
doc
um
ent
co
m
p
ari
s
i
on schem
e
(FDC
)
wi
t
h
ou
r pr
o
p
o
s
ed
a
p
pr
oach
.
Tabl
e
1. E
x
per
i
m
e
nt
resul
t
s
No of
Docu
m
e
nts
Docu
m
e
nts
Original Size
(Kb)
Size af
ter
Co
m
p
ression (
K
b)
Changes Applied t
o
get Ver
s
ions (
%
)
Clustering Ti
m
e
(m
i
llisecond)
Full Docu
m
e
nt
Co
m
p
a
r
ision(FDC
)
Pr
oposed
Appr
oach
without
co
m
p
ression
with
co
m
p
ression
10
50
35
10
756
396
90
25
100
70
20
2150
1548
334
50
500
350
50
4589
3225
964
100
1000
700
80
9947
5610
1465
150
1500
1040
100
1413
5
8256
2104
Ex
peri
m
e
nt
al
resul
t
s
are s
h
ow
n i
n
Ta
bl
e 1. Th
ese
re
sults clearly
dem
onstrate that applie
d
com
p
ression schem
e
satisfactorily redu
ces the docum
e
nts size. Last colum
n
sh
o
w
s pr
opo
sed
ap
pro
ach
is ti
me
Evaluation Warning : The document was created with Spire.PDF for Python.
I
J
ECE
I
S
SN
:
208
8-8
7
0
8
An
Op
timistic App
r
oa
ch
for
Clu
s
terin
g
Mu
l
ti-versio
n
XML Do
cumen
t
s
Usin
g
…
(Vija
y
Son
a
w
an
e)
1
477
effi
ci
ent
t
o
fi
n
d
ne
w cl
ust
e
ri
ng
(by
reasses
s
i
ng
new
di
st
a
n
ces) t
h
an
usi
n
g f
u
l
l
pai
r
-wi
s
e com
p
ari
s
on
bet
w
ee
n
th
e do
cu
m
e
n
t
s.
Resu
lt ch
arts
are sh
own
in
Fig
u
re
3
.
Ch
arts also
clearly
dem
onstarte t
h
at our
porpos
ed a
p
proach
per
f
om
r wel
l
e
v
en
i
f
n
u
m
b
er of
d
o
c
u
m
e
nt
s and
ap
pl
i
e
d
pe
r
cent
a
ge
o
f
c
h
a
g
es
(t
o
get
new
ve
rsi
o
n)
i
n
cre
a
sed.
Fi
gu
re
3.
Ex
pe
ri
m
e
nt
R
e
sul
t
C
h
art
7. CO
N
C
L
U
S
I
ON
In
t
h
is p
a
p
e
r
we h
a
v
e
p
r
op
osed
an
op
tim
is
tic ap
p
r
o
a
ch
for clu
s
teri
n
g
m
u
lti-v
e
rsion
xml d
o
c
u
m
en
ts
wh
en
d
i
stan
ces b
e
tween
in
itially c
l
u
s
tered
d
o
c
u
m
en
ts h
a
ve ch
ang
e
d. After do
cu
m
e
n
t
v
e
rsion
app
ears ev
ery
ti
m
e
ru
nn
ing
o
f
fu
ll p
a
ir-wise d
o
c
u
m
en
ts
co
m
p
arisio
n
on the entire modi
fied do
cum
e
nt
set
i
s
expensi
v
e
so
lu
tion
as it in
cur red
und
an
t op
eration
s
.
Our pro
p
o
s
e
d
approach allows reas
sessing th
e p
a
ir-wise XML
d
o
c
u
m
en
t d
i
stan
ce b
y
fi
nd
ing effect of th
e t
e
m
p
o
r
al ch
anges o
n
kn
own
distan
ces b
e
tween
in
itially clu
s
tered
doc
um
ent
s
.Thi
s p
r
o
p
o
se
d a
p
pr
oac
h
i
s
bot
h
t
i
m
e
and
I/
O
effect
i
v
e
,
as
h
o
m
o
m
o
rp
hi
c c
o
m
p
ressi
o
n
sc
hem
e
i
s
appl
i
e
d
o
n
i
n
p
u
t
XM
L d
o
c
u
m
e
nt
s and t
h
e
num
bers o
f
o
p
erat
i
o
ns i
n
v
o
l
v
ed i
n
reasses
s
i
ng t
h
e
di
st
an
ce are
hi
g
h
l
y
re
duce
d
.
REFERE
NC
ES
[1]
S. Abitebou
l, P.
Buneman,
and D
.
Suciu
,
”Data o
n
the Web”, San
Fr
ancisco: Morgan Kaufmann, 2
000.
[2]
Chen M. S., H
a
n J., and Yu P., ”Data Mining:
An Overview fr
om
Databas
e
P
e
rs
pectiv
e”
,
IEEE Transactions on
Knowledge and
Data Engin
eerin
g,
Vol. 8
,
866-8
83, 1996
.
[3]
Sonawane Vijay
.
, D. R. R
a
o., ”XML Document Mining”,
Journal Research in Computer Engineering a
nd
Electronics,
Vol. 2, No. 2
,
pp
. 1-4
,
2013
.
[4]
Cos
t
a G
., M
a
n
c
o G
., O
r
ta
le R
.
, and
Tagar
e
l
li
A
., ”A
tr
ee-b
a
sed Approach
to
Clustering XM
L documents b
y
Struc
t
ure”
,
PAK
DD 2004,
pp. 13
7-148, 2004
.
[5]
Dalam
a
gas T.
, Cheng T., W
i
nk
el K. J., and Sellis
T., ”Clust
eri
ng XML documents b
y
Structur
e”,
SETN 2004,
pp.
112-121, 2004
.
[6]
Shen, Y. and
Wang, B.,
”Clustering Schemale
ss
XML documents”, Spring
er, pp.
767-784, 2003
[7]
D
y
reson C
.
, and
Grandi F., “Tem
poral XML”,
Da
tabase Systems
, pp.
3032-3035
, 2009.
[8]
Sidra F., Mansoor S., ”T
em
poral and m
u
lti-vers
ioned XML doc
um
ents: A surve
y
”,
In
formation
Processing and
Management,
V
o
l. 50
, No
. 1
,
pp
. 113-131, 2014.
[9]
Sherif Sakr.,
”X
ML compression techniqu
es: A survey
and comparison”,
Journal of Computer a
nd System Scien
c
e
,
Vol. 75
, No. 5, p
p
. 302-322
, 200
9.
[10]
Vija
y S
onawan
e
, D.
R.
Rao,
”A
Com
p
arative
S
t
u
d
y
:
Chang
e
De
t
ect
ion and
Quer
ying
D
y
n
a
m
i
c X
M
L Docum
e
nts
”
,
International Jo
urnal of
Electrical
and Computer Engin
eering
(
I
JECE)
,
Vol. 5, No. 4
,
2015
.
[11]
Samini S., Su-Cheng H., Poo Kuan H
.,
”Bridging XML and Relation
a
l Databa
s
e
s: An Effectiv
e Mapping Scheme
based on Persistent”,
Internation
a
l Journal of Electrical
and Com
puter Engineerin
g (
I
JECE)
,Vol. 2, No. 2, pp. 239
-
246, 2012
.
[12]
Al-Khalifa S., Jagadish H. V.,
Koudas N
., Patel J. M., Srivastava
D., and Wu Y., ”Structural
j
o
ins: a prim
itive for
efcient XML quer
y
p
a
ttern matching”,
Proceed
ings of the eighteenth interna
tion
a
l conferen
ce on
data engineerin
g
,
pp. 141-154
, 20
02.
[13]
Rusu L. I
., R
a
h
a
y
u
W
.,
Tan
i
ar
.
,
”Stor
a
ge
te
chn
i
ques for m
u
lti-
versioned XML
docum
ents”,
P
r
oceedings
o
f
th
e
thirteen
th
intern
ational
conferen
ce on
database systems for ad
van
ce app
lica
tions
,
pp. 538-545
, 20
08.
[14]
S
accol D d
e
B.
,
Edelwe
is
s
N.,
Galant
e R de M
.,
Zanio
l
o C.
, ”
X
M
L
vers
ion dete
ction
”
,
Proceedings of th
e ACM
symposium on document
engin
e
ering
, pp. 7988, 2
007.
Evaluation Warning : The document was created with Spire.PDF for Python.
I
S
SN
:
2
088
-87
08
IJECE
Vol. 5, No. 6, D
ecem
ber
2015 :
1472 –
1479
1
478
[15]
Wang Y.
,
DeWitt D. J.
, and Cai J.
Y.,
”X-Diff
:
an
effe
ctiv
e
ch
ange d
e
te
ct
ion a
l
gorithm
s
for X
M
L docum
ents
”
,
Proceed
ings of
t
h
e nin
e
te
enth
int
e
r
national
conf
e
r
ence on
data
en
gineering
,
pp
. 5
19-530, 2003
.
[16]
Baez
a-Yat
e
s
R.
,
and Rib
e
iro-Net
o
B.,
”M
odern
information retrieval: Th
e
concep
ts and
techno
lo
g
y
b
e
hind
sear
ch”,
ACM Press/Add
i
son-Wesley,
201
1.
[17]
F
l
es
ca S
., and P
uglies
e
A.,
”F
as
t dete
ction of XM
L s
t
ructural s
i
m
ilarit
y
”
,
IEEE
Transactions on
Knowledge and
Data Engin
eerin
g,
Vol. 17
, No 2
,
pp. 160-175, 20
05.
[18]
Gao M.,
and C
h
en F., ”Cluster
ing XML Data
Streams b
y
Stru
cture based on
Slid
ingWindows and Expon
ential
Histogra
m
s”
,
P
r
oceedings of
t
h
e int
e
rnational
conferen
ce
on
advances in
databases, kno
w
ledge
, and da
ta
applications
, pp
. 224-230, 2013.
[19]
Wan X., Yang J., ”Using proportional tr
an
sportation similarity
w
ith learn
e
d
element semantics for XML docume
n
t
cluster
i
ng”,
Pr
o
ceed
ings
of
the
ft
eenth
int
e
r
natio
nal con
f
er
enc
e
o
n
wor
l
d wid
e
w
e
b
, pp
. 961-962
,
2006.
[20]
Elham B. F., H
a
san K.,
”Improving semantic
clustering using
with
Ontolog
y
a
nd rules”
,
Intern
ational Journal of
Electrica
l
and
C
o
mputer Engin
e
ering (
I
JECE)
,
Vol. 4
,
No. 1, pp
. 7-15
, 2014
.
[21]
Pon R. K., Crd
e
nas A. F., Bu
ttler D.,
Critchlow
T., ”iSco
r
e: M
e
asuring the in
ter
e
stingness of ar
ticles in a
limited
user environment”,
Proceed
ings of the IEEE symposium on
comp
utational in
tellig
ence and data
mining
, pp. 354-
361, 2007
.
[22]
Wang Y., Hodg
es J., and Tang B.,
”Classication
of web docume
n
ts us
ing a naive bay
e
sian method”,
Proceedings o
f
the
fteenth
IEEE international co
nference
on
tools with, articial intelligen
ce
, pp
.
560-564, 2003
.
[23]
Leonard
i E., Bhowmick S. S.,
Madria
S., ”Xan
d
y
: Detecting changes on
large unordered XML documents using
rela
tiona
l da
tab
a
ses”,
Proceed
in
gs of the international
confer
en
ce on database systems for adva
n
ced app
lica
tion
s
,
pp. 711-723
, 20
05.
[24]
Rus
u
L. I., R
a
h
a
y
u
W
.
, Tan
i
ar
D
., ”M
ain
t
ai
n
i
n
g
versions of d
y
namic XML documents”,
Proceedings of the sixth
internationa
l
co
nference on
web
information, systems engineering
, pp
. 536-543
, 2
005.
[25]
Wuwongse V., Yoshikawa M.,
a
nd Amagasa T., ”Temporal v
e
r
s
ioning of XML documents”,
Pr
oceed
ings
of th
e
Seven
th Internat
ional conferen
ce
on digital libraries
: Internation
a
l collaboratio
n
and cross-fertil
ization
, pp. 419-
428, 2004
.
[26]
Cavalieri F., ”EXup: an engine
for the evolution
of XML sche
mas
and as
s
o
ciat
e
d
docum
ents
”,
P
r
oceedings
of th
e
internationa
l
co
nference on
extending database technolog
y
, pp. 1
10, 2010
.
[27]
Cavalieri F., Gu
errini G
., Mesiti
M.,
and
Oliboni
B.,
”On the redu
ction of
sequen
c
es of XML docu
m
ent and sch
e
ma
update op
eratio
ns”,
Proceeding
s of the IEEE twenty seven
th in
ternationa
l conferen
ce on
data engineering
workshops
, pp. 7
7
-86, 2011
.
[28]
Guerrini G.,
and
Mesiti M., ”X-
E
volution
:
A com
p
rehensive ap
proach for XML schem
a
evoluti
on”,
Proceeding
s
of
the
international workshop on da
t
abase and
exp
ert systems application
, pp. 251-2
55, 2008
.
[29]
Rosado L. A., Mrquez A. P., and
Gil,
J. M., ”Man
aging branch ver
s
ioning in
versioned/temporal
XML documents”,
Proceed
ings of
t
h
e in
ternati
ona
l
symposium on X
M
L, database
, p
p
. 107-121
, 200
7.
[30]
Zholudev
V., Ko
hlhase
M.,
”TN
T
Base: A versio
ned storag
e for
XML”,
Ba
lisage: The Markup
C
onference,
2009.
[31]
Wei Li, Xiong
fei Li, and R
e
gen Te.,
”Cluster D
y
n
a
mic XML Documents based
on Frequently
Chang
i
ng
S
t
ructures
”
,
Adv
ances in
informa
tion S
c
ien
ces an
d Service Sciences(
A
ISS)
,
Vol. 4, No. 6
,
pp
. 70-76
, 2012
.
[32]
A. Tagar
e
ll
i.
, an
d S
.
Greco.
,
”S
e
m
an
tic clusterin
g
of XML documents”,
ACM Transactions on Information Systems
,
Vol. 28
, No. 1, p
p
. 1-56
, 2010
.
[33]
C.
Y.
Wang,
X. J.
Yuan, and
H.
Ning,
et a
l
.
,
”Similarity
Ev
aluation of XM
L Documents B
a
sed on Weigh
t
ed
Elem
ent Tree
M
odel”
,
In Advan
ced Data Min
i
n
g
and Applica
t
ions,
Vol. 5678,
R. Huang,et al.,
Eds., ed
: Spring
er
Berlin
Heidelber
g
, pp
. 680-687
,
2009.
[34]
T
.
Da
la
ma
ga
s,
T
Che
ng, K. J. Winkel,
et
al
.,
”A methodolog
y
for clusterin
g
XML documents b
y
stru
ctur
e”,
Information Systems,
Vol. 31, No. 3
,
pp
. 187-228
, 2006.
[35]
S.
Fle
s
c
a
,
G.
Ma
nc
o,
E.
Ma
sc
iari,
et al
.,
”Fast
detection of XML
s
t
ructural s
i
m
ilarit
y
”
,
IEEE Tran
sactions on
Knowledge and
Data Engin
eerin
g,
Vol. 17
, No. 2
,
pp
. 160-175
, 2
005.
[36]
M.
X.
Gao,
W. J.
Yao,
and G.
J. Mao, ”Expo
nential hi
stogr
am of cluster feature for XML stream”,
Journal of
Beijing University
o
f
Technology,
Vol. 37
, pp
. 12
42-1248, 2011
.
[37]
B. Wang, and C
.
Lin
,
”Improved algorithms for fi
nding gene
teams and cons
tructing gen
e
te
am
trees
”
,
IE
EE/
AC
M
Transactions on
Computational
Biology
and Bio
i
nformatics,
Vol.
8, pp
. 1258-127
2, 2011
.
[38]
P.
Ple
b
a
n
i,
a
nd B.
Pe
rnic
i,
”
U
RBE
:
We
b Se
rv
ic
e Retri
e
val B
a
se
d on Sim
ilarit
y
Evalu
a
tion
”
,
IEEE Transactions on
Knowledge and
Data Engin
eerin
g,
Vol. 21
, No. 1
1
, pp
. 1629-164
2, 2009
.
[39]
N
a
yak R., X
u
S
., “
X
CLS
:
A
F
a
s
t
and Effec
t
ive Cl
uster
i
ng Algorithm for Heterogen
e
ous XML Documen
t
s”,
Pr
oceed
ings
of
the 10
th Pa
ci
fic-
A
s
i
a Confer
ence
on Ad
van
ces
in Know
led
g
e Dis
c
ov
er
y a
nd Data Minin
g
,
Singapore, pp. 3
918, 2006
.
[40]
XML data repos
itory
, online
at
www.
cs.
w
ashi
ngton.
edu/research/pro
jects/xmltk/xmldata.
Evaluation Warning : The document was created with Spire.PDF for Python.
I
J
ECE
I
S
SN
:
208
8-8
7
0
8
An
Op
timistic App
r
oa
ch
for
Clu
s
terin
g
Mu
l
ti-versio
n
XML Do
cumen
t
s
Usin
g
…
(Vija
y
Son
a
w
an
e)
1
479
BIOGRAP
HI
ES OF
AUTH
ORS
Vijay
Sonawan
e
obtained
Bach
elor Degree in
Inf
o
rmation Techn
o
log
y
from North Maharashtr
a
University
(MS)
, India. Th
en h
e
earned
Master
in
Technolog
y
(Co
m
puter scien
ce)
from Shivaji
Univers
i
t
y
,
Kolh
apur (M
S
)
, Ind
i
a
.
He
is
cu
rrent
l
y
P
h
D Res
earch
s
c
holar
in
com
pute
r
s
c
ien
c
e
and
engineering in K.L.University
, Vijay
a
wad
a
.
H
e
is working a
s
Assista
n
t Profe
ssor in Sa
ndip
Institute of Te
ch
nolog
y
and Rese
arch Centr
e
, Nash
ik. His m
a
jor area of inter
e
st is Data m
i
ning,
Information retr
ieval, web
info
r
m
ation management. He published
various p
a
p
e
rs in Journals
and Confer
ences.
D.Rajes
w
ar
a Ra
o, he is
a P
r
ofe
ssor in departm
e
nt of CSE in K L
Universit
y
, Vija
yaw
a
da. He is
acting as head
s of knowledge engin
e
ring r
e
s
earch group
and Research
p
o
licy
advisor
y
com
m
ittee
in K.
L.Universi
t
y
.
He is He
had
pub
lis
hed v
a
rious r
e
search
pap
e
rs at National and
Interna
tiona
l Jou
r
nals/Confer
enc
e
s in
the
ar
ea o
f
data mining,
soft computing.
Evaluation Warning : The document was created with Spire.PDF for Python.