Internati
o
nal
Journal of Ele
c
trical
and Computer
Engineering
(IJE
CE)
V
o
l. 6,
N
o
.
3
,
Ju
n
e
201
6,
p
p
.
9
9
5
~
100
1
I
S
SN
: 208
8-8
7
0
8
,
D
O
I
:
10.115
91
/ij
ece.v6
i
3.9
878
9
95
Jo
urn
a
l
h
o
me
pa
ge
: h
ttp
://iaesjo
u
r
na
l.com/
o
n
lin
e/ind
e
x.ph
p
/
IJECE
Auto
ma
tic
De
te
ctio
n o
f
Ille
g
i
ti
m
a
te Websites with Mutual
Clustering
K
.
Ka
na
ka
Du
rga
,
V
.
Ra
ma
K
r
ish
n
a
Dept of CSE, K
L University
, Guntur, AP
Article Info
A
B
STRAC
T
Article histo
r
y:
Received Ja
n
6, 2016
Rev
i
sed
Mar
10
, 20
16
Accepted
Mar 25, 2016
In the websites the cont
ents will be ar
e sim
ilari
t
y
when we co
m
p
ared with
other s
ear
ch eng
i
nes
.
S
o
to
chec
k the s
i
m
ilar
co
ntent
in the
web
s
ites
and i
t
s
web contents we
creat
ed a overh
ead to the se
arch
engine which w
ill sever
e
l
y
effec
t
its perfor
m
ance & quali
t
y
. So to detect th
e silm
ilar or sam
e
content or
web documenattion some tech
niques
are im
pl
em
ented b
y
we
b crawling
res
earch
com
m
u
nit
y
. S
o
it is
one of m
a
jor facto
r
for the s
earch
engines
t
o
provide some applicator
y
data to users
in the fi
rst page itse
l
f.
So to avoid
such issues we
proposed a methodlog
y
called
Autom
a
tic Dete
ction of
illeg
itim
at
e web
s
ites with Mutual Cluster
i
ng (ADIW
M
C) paper we are
presenting
a peculiar and
efficacious path
for the
dete
ction of si
m
ilariti
es in
the web pages in web clusterin
g
. Dete
ction of same and
similar web pages
and web
content will be don
e b
y
stor
ing th
e cr
awled
web
pages
into
depositor
y
. In
iti
all
y
th
e adword
s will be extr
ac
ted from
the cr
awled pag
e
s
and sim
ilarit
y
checking will be done
between the two pages based in th
e
usage of
adwords. So a thr
e
shold valu
e is se
t
for this,
if
the
sim
ilar
i
t
y
check
ing perc
en
tage is gre
a
te
r t
h
an the thr
e
shol
d then sim
ilari
t
y
con
t
ent
is
re
duc
e
d
a
nd i
m
prove
s t
h
e
de
posit
a
r
y
a
nd i
m
prove
s
t
h
e
se
a
r
c
h
e
n
gi
ne
qua
lity
.
In the s
ect
ions
of exis
ting an
al
ys
is
and the propo
s
e
d anal
ys
is
we
are c
l
ear
l
y
exploring
how it works.
Keyword:
Illeg
iti
m
a
te
Mu
tu
al Clu
s
terin
g
Ph
ising
W
e
b Cr
aw
led
Copyright ©
201
6 Institut
e
o
f
Ad
vanced
Engin
eer
ing and S
c
i
e
nce.
All rights re
se
rve
d
.
Co
rresp
ond
i
ng
Autho
r
:
K. Ka
naka
D
u
r
g
a,
M. Tech Stud
en
t,
D
e
p
t
of
CSE,
K L Un
iv
ersity,
V
a
dd
esw
a
r
a
m
5
225
02
, Gun
t
ur
D
i
st
r
i
ct, Andh
r
a
Pr
ad
esh
,
Ind
i
a.
1.
INTRODUCTION
Large
-
scale a
n
d targeted attacks:
C
y
be
rcri
m
i
nal
s
have
t
o
cheat
onl
i
n
e
s
t
rat
e
gy
ad
o
p
t
e
d t
w
o
fam
i
li
ar
consum
er. Many scam
s are designe
d
for la
rge
-
scale su
cc
ess [1].
Phasi
n
g scam
s posing as banks a
n
d online
servi
ce p
r
ovi
d
e
rs by
t
h
e t
h
ou
san
d
ray
’
s m
i
lli
on s
p
am
m
e
ssages fai
l
fract
i
on
of
users t
o
a fake we
bsi
t
e
penal
cont
rol
[
2
]
.
In
fact
, m
a
ny t
h
i
e
ves are w
o
r
k
i
n
g som
e
where i
n
bet
w
ee
n, fai
t
hf
ul
l
y
repr
o
d
u
ce t
h
e l
ogi
c of
fra
ud
,
wi
t
h
o
u
t
ha
r
d
w
a
re t
o
rep
r
o
d
u
ce fr
om
previ
o
us
versi
ons
o
f
t
h
e at
t
ack. T
h
us, c
r
i
m
i
n
al
s enga
ge
d i
n
a
d
v
a
nced
banking fraud
cost places exi
s
t for ba
nks wi
th online
banki
n
g,
which the victim
has acc
ess to the inspection
of t
h
eir '
d
e
posi
t
s'
. When a fal
s
e ba
nk is off,
the cr
im
in
als a n
e
w op
tim
ize
d
fro
m
th
e o
l
d
site. Crimin
als h
a
v
e
the fa
ke esc
r
ow se
rvices
as
part of a
n
a
d
vanced
high
er
tax fra
ud. On
the s
u
rface,
e
s
crow sites
seem
different,
b
u
t
often
sh
are si
m
i
larities
in
th
e tex
t
o
r
HTML stru
ct
ure leg
.
Yet ano
t
h
e
r ex
am
p
l
e is o
n
lin
e Pon
z
i 'h
ig
h
yield
i
nvest
m
e
nt
pr
og
ram
s
(HYI
P
)
. Th
e pr
o
g
ra
m
s
offe
r i
nve
st
ors t
h
e e
x
t
r
a
v
aga
n
t
i
n
t
e
rest
, whi
c
h m
eans t
h
at
in
ev
itab
l
y co
llap
s
e wh
en
dry attract n
e
w d
e
p
o
s
its. Th
e au
thors are
behi
nd the sce
n
es as
the creation
of ne
w
pr
o
g
ram
s
t
h
at
oft
e
n s
h
are si
m
i
l
a
ri
ti
es wi
t
h
pre
v
i
o
us
ver
s
i
ons
[
3
]
.
T
h
e
d
e
si
gne
rs
of t
h
e
s
e scam
s have
a st
ro
n
g
in
cen
tiv
e t
o
d
i
stin
gu
ish
fro
m
th
e o
l
d
to
k
e
ep
th
eir n
e
w co
p
i
es. Po
ten
tial v
i
cti
m
s
m
a
y
b
e
afraid
wh
en
th
ey
realize th
at an earlier
v
e
rsi
o
n
o
f
t
h
is site, "reporte
d a
s
fra
ud
ul
ent
.
S
o
, cri
m
i
n
als a conce
r
ted effort to
di
st
i
n
g
u
i
s
h
ne
w c
opi
es
o
f
ol
d
.
Evaluation Warning : The document was created with Spire.PDF for Python.
I
S
SN
:
2
088
-87
08
IJEC
E
V
o
l
.
6,
No
. 3,
J
u
ne 2
0
1
6
:
99
5 – 1
0
0
1
99
6
T
h
e
ma
in
go
a
l
of
th
e r
e
s
e
ar
ch
is
to
d
e
v
e
lo
p n
e
w
a
n
d effec
tive detection
of the sam
e
na
me near t
h
e
si
t
e
docum
ent
s
. In t
h
e b
e
gi
nn
i
ng o
f
t
h
e cra
w
l
web
page
s m
u
st
be prepar
ed wi
t
h
t
h
e he
l
p
of w
h
i
c
h t
h
e dat
a
parsi
n
g
HTM
L
t
a
gs a
nd
Java
web
w
r
i
t
i
ng,
r
e
m
ove t
h
e
d
o
c
u
m
e
nt
s [4]
-
[
8
]
.
Thi
s
i
s
fol
l
o
w
e
d
wi
t
h
t
h
e
re
m
oval
o
f
t
h
e leg
a
l form
o
r
th
e end
o
f
t
h
e crawled p
a
g
e
s.
An
d
t
h
e effects of t
h
e Al
g
o
rith
m
are u
s
ed
t
o
filter the
affixe
s (
p
re
fix
e
s an
d s
u
f
f
ixe
s
) to c
r
a
w
l da
ta fo
r th
e
k
e
y
w
ords. Fi
n
a
lly, th
e sim
ilarit
i
e
s b
e
tween
t
h
e two
doc
um
ents num
b
er is
calculated base
d on the extr
acted keywords. T
h
e docum
e
nts are received
at the
di
ffe
re
nce i
s
g
r
eat
er t
h
a
n
a
pre
d
et
erm
i
ned
t
h
res
hol
d
val
u
e i
n
t
h
e nei
g
hb
o
r
h
o
od
of t
h
e sam
e
nam
e
. M
u
c
h
researc
h
has
b
een
d
one
m
a
ny
exp
e
ri
m
e
nt
s usi
n
g
real
dat
a
set
s
, a
n
d
has
been
f
o
un
d t
o
sur
p
ass
t
h
e
p
r
e
v
i
o
u
s
recomm
endation
algo
rith
m
s
.
2.
LITERATU
R
E
SU
RVE
Y
Literatu
re Research
is th
e mo
st i
m
p
o
r
tan
t
step
in
th
e d
e
v
e
lo
p
m
en
t o
f
so
ftware. Before th
e n
e
w tool
shoul
d
speci
fy the reas
ons
for the com
p
any'
s eco
nom
i
c power. Once thes
e things
are com
p
le
te, th
e fo
ll
o
w
i
ng
st
eps t
o
det
e
r
m
i
n
e whi
c
h o
p
e
rat
i
ng sy
st
em
and co
ul
d be
use
d
fo
r t
h
e d
e
vel
o
pm
ent
of t
h
e devi
ce [9]
.
If t
h
e
wo
rk
be
gu
n co
nst
r
uct
i
on s
o
ft
ware a
ppl
i
cat
i
ons
nee
d
m
u
ch o
u
t
s
i
d
e s
u
p
p
o
rt
. S
u
pp
ort
i
n
g
t
h
i
s
soft
wa
re
can b
e
obt
ai
ne
d
fr
om
abo
v
e,
f
r
om
t
h
e b
o
o
k
or t
h
e we
bsi
t
e
. F
o
r t
h
e co
nst
r
u
c
t
i
on
of t
h
e
s
y
st
em
of t
h
e
abo
v
e
consideration the
devel
opm
ent of th
e system
was ta
ken into
account.
2.
1.
Existin
g
Meth
od
olo
g
y
The t
w
o m
a
i
n
t
y
pes of vi
nes are com
m
on and
on t
h
e s
k
i
n
.
Previ
o
us
doc
u
m
ent
s
and l
i
n
k
s
t
o
vari
o
u
s
facto
r
s co
mm
o
n
to
the crawler to
crawl,
bu
t it was in
th
e
past as a way to li
m
i
t th
e n
u
m
b
e
r
o
f
p
a
g
e
s
with
th
e
hel
p
o
f
som
e
speci
al
kn
o
w
l
e
dge
has bee
n
f
o
cu
sed cra
w
l
e
r
.
R
e
posi
t
o
ri
es
i
nde
x o
f
t
h
e Web
pag
e
, t
o
m
e
, and
not
t
h
e
searc
h
page
on
a sy
st
em
(f
or
exam
pl
e, R
e
s
earc
h
)
was
est
a
bl
i
s
he
d by
t
h
e w
e
b
cr
aw
ler for
th
e of
f
e
r
advi
ce
[1
0]
. T
h
e de
vel
opm
ent
o
f
t
h
e
Int
e
r
n
et
i
n
or
der t
o
sur
v
i
v
e a
cl
os
e copy
of t
h
e
doc
um
ent
and
wi
t
h
t
h
e
need t
o
i
n
t
e
gra
t
e het
e
ro
gene
o
u
s dat
a
seri
ous
pro
b
l
e
m
s
. They
al
so bear a st
ri
ki
n
g
si
m
i
l
a
ri
ty
am
ong t
h
e d
u
m
m
y
d
a
ta is n
o
t
th
e sa
m
e
a l
itt
le
b
it clo
s
er.
Web
research
is facin
g
seriou
s
p
r
ob
lem
s
b
eca
u
s
e of th
e duplicate
copi
es a
nd i
s
cl
ose t
o
t
h
e w
e
b pa
ge. Pa
ge
or i
n
de
x st
ora
g
e l
o
cat
i
o
n o
r
i
rri
t
a
t
e
users t
o
sl
o
w
l
y
i
n
cre
a
se o
r
in
crease th
e co
st o
f
th
is serv
ice. Th
erefo
r
e alg
o
r
ith
m
s
t
h
at recogn
ize t
h
e in
ev
itab
ility o
f
su
ch
a p
a
g
e
[11
]
.
Web
cra
w
ling issues
s
u
ch
as fres
h
ness
a
nd effective use of
the
res
o
urces
refe
rre
d to i
n
t
h
e
past. Recent
l
y, the
el
im
i
n
at
i
on of du
pl
i
cat
e
or
near duplicate docum
e
nts on
t
h
e I
n
t
e
r
n
et
ha
s
bec
o
m
e
a
m
a
jo
r c
once
r
n an
d ha
s
attr
acted
sign
ifican
t r
e
sear
ch. Th
e surv
ey is b
e
yond
the
bo
und
ar
ies
of
t
h
e si
g
n
i
f
i
can
c
e of
th
e doo
r
h
a
nd
le
cyber crim
es, as each will be
very sm
al
l, enforcing the
law could
prove
to be a va
lua
b
le way. For e
x
a
m
ple,
the m
e
thod i
n
two
differe
n
t groups, eac
h of whic
h
explains
m
o
re
than 100
fake esc
r
ow si
tes. In a
d
dition, you
can
red
u
ce t
h
e work
l
o
ad
for in
v
e
stig
ato
r
s, acco
rd
ing
to
th
e p
r
iorities th
at will in
v
e
stig
ate h
o
w crim
in
al
s.
W
e
have m
a
ny
pro
m
i
s
es t
o
cont
i
nue t
h
e ki
n
d
o
f
wo
rk y
o
u
.
[1
2]
Fi
rst
,
t
h
e t
r
ue Yi
p cl
ust
e
ri
ng
i
s
goo
d. Sec
o
n
d
l
y
, i
t
i
s
a si
gn
of cl
ust
e
ri
n
g
phi
s
h
i
ng si
t
e
s a
nd
s
p
am
ads i
n
st
o
r
ef
ro
nt
s, as
ha
s bee
n
t
r
i
e
d i
n
ot
he
r areas
w
h
ere i
t
wou
l
d
b
e
in
terestin
g
t
o
co
mp
are th
e co
m
b
in
atio
n of cl
us
t
e
ri
ng
. Fi
nal
l
y
, ad
di
t
i
onal
i
n
p
u
t
feat
ures
s
u
c
h
as
Wh
oi
s
regi
st
rat
i
on a
n
d i
m
age det
a
i
l
.
3.
PROP
OSE
D
SYSTE
M
Key
w
or
ds fa
ke
doc
um
ent
s
col
l
ect
ed fr
om
t
h
e breast
cl
ose t
o
t
h
e web
.
Fi
rst
,
crawl
e
d
we
b doc
um
ent
s
are
parse
d
t
o
e
x
t
r
act
t
h
e
key
w
o
r
d
s
.
Parse
/
com
m
on l
a
ng
u
a
ge t
o
st
o
p
a
n
d
vi
ew t
h
e
rest
o
f
t
h
e
t
e
rm
, HTM
L
t
a
gs are Ja
vaS
c
ri
pt
rem
oval
.
An
d t
o
red
u
ce
t
h
e n
u
m
b
er of
key
w
or
ds t
h
at
cl
osel
y
m
a
t
c
hes t
h
e nam
e
pl
at
e on
th
e tab
l
e is sto
r
ed
in
th
e
p
r
o
c
ess. Th
e
p
a
ssword
is stored
in
t
h
e tab
l
e so
th
at
th
e s
earc
h
spa
ce is reduced t
o
the
b
r
east.
Ag
ain
s
t all
th
e files i
n
th
e repo
sito
ry o
f
an
e
qual
num
ber o
f
d
o
c
u
m
e
nt
s on t
h
e
Int
e
r
n
et
t
oday
are
com
put
ed f
r
o
m
a l
i
s
t
of ke
y
w
or
ds
o
n
t
h
e
pag
e
.
Doc
u
m
e
nt
s be
f
o
re t
h
e
sim
i
l
a
ri
ty
score i
s
co
nsi
d
ere
d
t
o
be
cl
ose t
o
t
h
e t
h
r
e
sh
ol
d
of
t
h
e
s
a
m
e
nam
e
m
o
re t
h
an
.
In
t
h
i
s
pape
r,
we have
a hi
st
ory
of
cl
ose
an
d
e
ffectiv
e
W
e
b
crawling
t
h
e
site
to
id
en
tify
th
e dup
licate
post
.
C
a
t
c
h t
h
e cl
ose
of t
h
e
Web
pa
ge i
n
a
dva
nce t
o
rese
rve a
co
py
o
f
t
h
e we
b
page
s
ha
ve bee
n
c
r
a
w
l
e
d
rep
o
si
t
o
ri
es [
1
3]
. At
fi
rst
f
r
o
m
crawl
i
ng
p
a
ges,
key
w
ords, key
w
ords a
r
e ext
r
acted a
nd c
o
llected s
c
ore is
calcu
lated
b
a
sed
on
t
h
e simila
rity b
e
tween
t
h
e two
p
a
g
e
s.
Th
e do
cu
m
e
n
t
with
th
e sam
e
n
a
me is co
n
s
id
ered
to
b
e
p
a
rticu
l
arly i
m
p
o
r
tan
t
as the clo
s
ing
o
f
th
e do
or is m
o
re t
h
an m
a
n
y
si
m
i
larities.
3.
1.
Study
of s
y
ste
m
&
Appr
opr
i
ate I
m
plementation
The appropriat
e im
ple
m
entation
of
t
h
e p
r
o
j
ect
and b
u
si
ne
ss pr
ocess
to exam
ine this proposal and
co
st esti
m
a
tes
for th
e
p
r
o
j
ect
an
d
will h
a
v
e
t
o
pu
t fo
rt
h
some o
f
th
e b
e
st o
v
e
rall p
l
an. Th
e im
p
l
e
m
en
tatio
n
of
a syste
m
analysis of the propos
ed sy
stem
at th
e ti
m
e
o
f
th
e trial was co
ndu
cted
ap
pr
op
r
i
ately. Th
i
s
is to
Evaluation Warning : The document was created with Spire.PDF for Python.
I
J
ECE
I
S
SN
:
208
8-8
7
0
8
Au
toma
tic Det
ectio
n
o
f
Illeg
i
t
i
ma
te Web
s
ites with
Mu
t
u
a
l
C
l
u
s
terin
g
(ADIWMC)
(K. Ka
na
ka Du
rg
a
)
9
97
ens
u
re t
h
at
t
h
e
pr
o
pose
d
sy
st
e
m
wi
ll
not
be a
b
u
r
d
en
f
o
r
t
h
e
com
p
any
.
S
o
m
e
of t
h
e m
o
re u
nde
rst
a
n
d
i
n
g
of
t
h
e
ci
rcum
st
ances necessa
ry
fo
r
t
h
e i
m
pl
em
ent
a
t
i
on
of t
h
e
app
r
op
ri
at
e m
e
t
hod
of a
n
al
y
s
i
s
. The t
h
ree m
o
st
i
m
p
o
r
tan
t
are i
n
terested in
p
a
rticip
atin
g
i
n
t
h
e an
al
ysis of t
h
e a
p
propriate
perform
a
nce are as follows:
a.
Economic appropriate e
n
for
cement
The com
p
any'
s structure is de
signe
d to e
n
s
u
re that
the econom
ic im
pact
of
this st
udy.
The system
can p
o
u
r
i
n
t
o
t
h
e com
p
any
'
s researc
h
an
d devel
opm
ent
i
s
l
i
m
i
t
e
d t
o
t
h
e am
ount
o
f
t
h
e su
bsi
d
y
.
T
h
e cost
s
h
a
v
e
to
b
e
justified
.
Th
e maj
o
rity o
f
th
e
b
udg
et
with
in th
e fram
e
wo
rk
of th
e d
e
v
e
lo
p
m
en
t an
d
u
s
e of
t
echn
o
l
o
gy
, i
t
i
s
p
o
ssi
bl
e
f
o
r
f
r
ee.
Sh
o
u
l
d
p
u
r
c
hase
o
n
l
y
pr
o
duct
s
m
a
nufact
ure
d
.
b.
Technical
appropriate e
n
for
cement
In this study, the appropriate t
echni
cal
im
plem
ent
a
t
i
on, t
h
e sy
st
em
i
s
desi
gne
d t
o
ve
ri
f
y
t
h
e
m
eans
of t
h
e t
ech
ni
cal
req
u
i
r
em
ent
s
. The sy
st
em
to be
de
vel
o
pe
d wi
t
h
o
u
t
a hi
gh
dem
a
nd f
o
r
t
echni
cal
res
o
urces
.
These tec
hnica
l resources
will lead to
highe
r
dem
a
nds. This will lead to
high
dem
a
nd placed on t
h
e cl
ient.
Devel
opm
ent
of t
h
e sy
st
em
necessa
ry
fo
r t
h
e ap
pl
i
cat
i
on
of s
u
c
h
p
r
o
p
er
t
y
or vac
uum
the sy
st
em
m
e
asur
e
s
onl
y
m
odest
c
h
ange
s, m
u
st
ha
ve.
c.
So
cia
l
sufficient to
run
The sam
p
le of the study to ve
rify the level of us
e
r
acceptance of the system
.
Including user traini
ng
to use t
h
e system effectively in the
process
.
The system
user should
not
feel threatene
d
,
but
we nee
d
to
accept
i
t
.
At
t
h
e hei
g
ht
of t
h
e a
g
re
em
ent
i
s
t
h
e
onl
y
use
r
o
f
t
h
e sy
st
em
dep
e
nd
s o
n
t
h
e m
e
t
hod o
f
co
nsum
er
educat
i
o
n a
nd
t
o
d
o
g
o
o
d
w
o
rks
.
The l
e
vel
of c
o
n
f
i
d
e
n
ce
are the end us
ers of the syste
m
, welcom
ed som
e
u
s
efu
l
criticism
s
will n
o
t
b
e
ab
le to
b
e
bu
ilt.
4.
SYSTE
M
ARCHITECT
URE
4.
1.
Installati
on &
Implementati
on
Project im
plementation process tr
ansfo
r
m
s
th
eo
ry in th
e op
erating
system
. Th
erefo
r
e, it is
considere
d
t
h
e
m
o
st im
porta
nt sector in the effectiv
e i
m
pl
em
ent
a
t
i
on o
f
t
h
e
new
sy
st
em
, gi
vi
n
g
t
h
e
use
r
co
nfid
en
ce t
h
at th
e
n
e
w syste
m
will work
an
d
b
e
eff
ectiv
e. Th
e estab
l
ish
m
en
t o
f
goo
d p
l
ann
i
ng
pro
cess
i
n
cl
udi
ng a
n
al
y
s
i
s
of exi
s
t
i
n
g sy
st
em
s and t
h
e appl
i
cat
i
o
n o
f
p
r
ess
u
re
, m
a
de way
fo
r
t
h
e passa
ge a
nd t
h
e
ev
alu
a
tion
o
f
th
e tran
sition
(Fig
ure
1
)
.
Figure
1. System
Architecture
a.
Web document parsing
Inform
atio
n
ex
tracted
d
a
ta to
crawl to
h
e
lp
d
e
term
in
e th
e
fu
t
u
re p
a
t
h
o
f
th
e track. T
h
e analysis can
be as sim
p
le a
s
a / hypertext
link or URL
extraction cl
ean
ing
d
i
fficu
lt as th
e an
alysis o
f
HTML tag
s
in
th
e
HTML con
t
en
t
.
Is in
ev
itab
l
e
for th
e an
alyzer,
wh
ich
is
av
ailab
l
e on
th
e wh
o
l
e
site will be faced
with
m
a
n
y
Evaluation Warning : The document was created with Spire.PDF for Python.
I
S
SN
:
2
088
-87
08
IJEC
E
V
o
l
.
6,
No
. 3,
J
u
ne 2
0
1
6
:
99
5 – 1
0
0
1
99
8
err
o
rs
. T
h
e ana
l
y
zer usual
l
y
g
e
t
i
n
fo
rm
at
i
on fr
om
t
h
e W
e
b
page ca
n
not
t
h
i
n
k o
f
a fe
w
wo
rd
s, as
us
ua
l
,
and
m
a
ny
ot
he
r
HT
M
L
t
a
gs,
Java
Scri
pt
a
n
d
ot
he
r
bad
cha
r
act
er
s (Fi
g
u
r
e
2)
.
Fi
gu
re
2.
Ent
e
r
i
ng t
h
e
web
s
i
t
e
i
n
sea
r
ch
en
gi
ne
b.
In vi
ew
of
th
e
al
g
o
ri
thm
The inform
ation
recovery
variants are limited to
a com
m
on ro
ot b
y
refere
nce.
After the
fals
e
prem
i
s
e t
h
at
t
h
e t
w
o sha
r
e t
h
e
sam
e
possess
pr
of
o
u
n
d
ex
p
r
essi
on
. Th
us, t
h
ey
have t
h
e sam
e
i
d
eas st
i
l
l
appea
r
m
o
rph
o
l
o
gi
cal
l
y
di
ffe
re
nt
i
s
kn
o
w
n
i
n
t
h
e
IR
sy
st
em
usi
n
g
t
h
e
sam
e
query
a
n
d t
h
e
doc
um
ent
sh
o
u
l
d
be
ch
eck
ed
u
s
i
n
g th
e v
i
ew [14
]
. Giv
e
n
facilit
ate th
e red
u
c
tio
n
o
f
wo
rd
s th
at h
a
v
e
th
e sa
m
e
ro
o
t
as. Th
is i
s
accom
p
lished
through the
re
m
oval
of
ea
ch deri
vative and
inflectional
s+
uffixes
.
[15] Fo
r e
x
ample, i
s
"connected"
,
"
c
onnected"
an
d "c
om
m
uni
cati
on" t
o
a
ll cond
ensed to
"conn
ect".
c.
Keywor
ds re
presentati
ves
We ha
ve a uni
que
pass
word
and is counte
d
each ti
m
e
they crawl a web page as a res
u
lt of view.
These
key
w
ords
represe
n
t the im
age to
im
p
r
ove t
h
e
proces
s closely as
possible. T
h
is is
the
picture
in order t
o
reduce t
h
e sea
r
ch space
for
detecting
nea
r
du
pl
i
cat
e. Fi
rst
of
k
e
y
w
or
ds
and
t
h
e
n
u
m
b
er a
ppea
r
on
t
h
e we
b
page
s are ar
ra
nge
d i
n
t
h
e
or
der
based
o
n
t
h
e desce
n
di
n
g
census
.
A
f
t
e
r
t
h
at
, t
h
e n
u
m
b
er N o
f
t
h
e ke
y
w
or
d
num
ber i
s
st
or
ed i
n
t
h
e t
a
bl
e,
an
d t
h
e
rem
a
inde
r i
s
i
nde
xe
d
key
w
or
ds a
n
d st
ore
d
i
n
an
o
t
her t
a
bl
e.
In
t
h
i
s
way
,
the val
u
e
of
N
is set to 4. T
h
e
diffe
re
nce
between the
t
w
o nu
m
b
er
s on
do
cu
m
e
n
t
s can
on
l
y
b
e
calcu
lated if
the
fi
rst
t
w
o key
w
or
ds
of t
h
e
doc
um
ent
are t
h
e
sam
e
(Fi
gure
3). T
h
ere
f
ore, the searc
h
area i
s
reduce
d to cl
ose in
t
h
e
s
a
me
n
a
me
.
Evaluation Warning : The document was created with Spire.PDF for Python.
I
J
ECE
I
S
SN
:
208
8-8
7
0
8
Au
toma
tic Det
ectio
n
o
f
Illeg
i
t
i
ma
te Web
s
ites with
Mu
t
u
a
l
C
l
u
s
terin
g
(ADIWMC)
(K. Ka
na
ka Du
rg
a
)
9
99
Fi
gu
re 3.
C
h
ec
ki
n
g
wi
t
h
ot
he
r
key
w
o
r
ds
d.
Contr
a
st Sc
or
e Calculation
If t
h
e
fi
rst
key
wo
rd i
n
t
h
e n
e
w we
b pa
ge i
s
not
i
n
c
o
m
p
l
i
ance wi
t
h
t
h
e f
i
rst
key
w
or
ds f
o
r t
h
e
pa
ge
in
th
e tab
l
e, the in
fo
rm
atio
n
ad
d
e
d
t
o
th
e
web
pag
e
i
n
th
e
repo
sitory. If all th
e k
e
ywo
r
ds o
f
th
e two
p
a
g
e
s is
th
e sam
e
in
formatio
n
shou
ld
b
e
co
nsid
ered
d
u
p
licate p
a
ge
s and t
h
ere
f
ore
are
not incl
uded
in th
e rep
o
sito
ry. If
th
e first
n
e
w
keyword
s
on
the sam
e
p
a
g
e
in
th
e
rep
o
s
it
o
r
y, and
th
e
simi
larities b
e
tween
th
e t
w
o do
cu
m
e
n
t
s
can calc
u
late num
b
ers (Ta
b
le
1).
Tabl
e
1.
Int
e
rn
et
equal
t
o
t
h
e
sum
o
f
th
ese t
w
o do
cu
m
e
n
t
s is calcu
lated
Let
T1 an
d T2
fo
r t
a
bl
et
s co
nt
ai
ni
ng t
h
e ext
r
act
ed key
w
or
d
s
and t
h
e c
o
r
r
e
s
po
n
d
i
n
g i
m
port
a
nce. T
h
e
lab
e
l o
f
each
tab
l
e is reg
a
rd
ed as th
e calcu
latio
n
o
f
th
e si
m
i
l
a
rity sco
r
e. If a lab
e
l is
presen
t
i
n
bot
h t
a
bl
es, t
h
e
fo
rm
ul
a used
t
o
cal
cul
a
t
e
t
h
e
t
a
g si
m
i
l
a
ri
t
y
score
i
s
as
Fi
g
u
r
e
4 a
n
d Fi
gu
re
5.
Fig
u
re
4
.
(A) Ex
act Resu
lt an
alysis o
f
C
o
n
t
ent si
m
ilarit
i
es in
web
s
ites
T1
K1 K2
K4 K5
……
Kn
C1 C2
C4 C5
……
Cn
T2
K1 K3
K2 K4
……
Kn
C1 C3
C2 C4
……
Cn
Evaluation Warning : The document was created with Spire.PDF for Python.
I
S
SN
:
2
088
-87
08
IJEC
E
V
o
l
.
6,
No
. 3,
J
u
ne 2
0
1
6
:
99
5 – 1
0
0
1
1
000
Fi
gu
re
5.
(B
)
E
x
am
i
n
i
ng i
n
Pe
rcent
a
ge
of
Si
m
i
l
a
ri
ty
e.
Web document duplicate
de
tection close
Near t
h
e same n
a
m
e
will
n
o
t
b
e
con
s
i
d
ered
t
h
e "real
n
a
m
e
s" wh
ich
do
cu
m
e
n
t
s th
e min
u
t
e
diffe
re
nces.
Typographical errors
, rel
ease
,
ra
di
at
i
on,
o
r
pl
ag
i
a
ri
zed
doc
um
ent
s
, m
a
ny
exp
r
essi
o
n
s
of
t
h
e
sam
e
phy
si
cal
t
h
i
n
gs
, spam
em
ai
l
s
fr
om
t
h
e sam
e
nake
d f
o
rm
, a
nd a
num
ber
of these cases
may result in a heap
Reaching alm
o
st sim
i
lar. The
perce
n
t of the
large we
b
pa
ge is known to be cl
ose to a
variety of t
h
e
sa
m
e
nam
e
, accordi
ng to the s
u
rve
y
. These surv
e
y
s suggest that as part of clos
e to
1.7% to 7% of the we
b
page t
o
g
o
crawler. The step
s i
n
vo
lv
ed
in th
e m
e
th
od
o
u
tlin
ed
i
n
the fo
llo
wi
n
g
p
a
rag
r
aph
s
(Figu
r
e 6).
Fi
gu
re
6.
Del
e
t
i
ng t
h
e
Du
pl
i
c
at
e C
ont
e
n
t
An
o
u
t
p
u
t
is a q
u
a
lity th
at h
e
lp
s th
e end
user n
e
ed
s and
p
r
esen
ts in
formatio
n
clearly
. Be in
an
y
sy
st
em
of t
r
eat
m
e
nt
res
u
l
t
s
co
m
m
uni
cat
ed t
h
ro
u
g
h
t
h
e
ve
nt
s t
o
use
r
s a
n
d
ot
he
r sy
st
em
s. The
desi
gn
cap
aci
t
y
i
s
det
e
rm
i
n
ed ho
w t
h
e i
n
f
o
r
m
at
i
on need
ed
im
m
e
di
at
el
y
and m
u
st
be m
oved t
o
t
h
e
out
p
u
t
co
py
pa
per
.
It
i
s
t
h
e
main
so
u
r
ce
o
f
in
form
at
io
n
an
d
d
i
rect u
s
er. Th
e efficien
t an
d
in
tellig
en
t
o
u
t
p
u
t
d
e
si
g
n
i
n
creases th
e ratio
o
f
t
h
e sy
st
em
t
o
hel
p
t
h
e
use
r
i
n
deci
si
o
n
m
a
ki
ng.
Evaluation Warning : The document was created with Spire.PDF for Python.
I
J
ECE
I
S
SN
:
208
8-8
7
0
8
Au
toma
tic Det
ectio
n
o
f
Illeg
i
t
i
ma
te Web
s
ites with
Mu
t
u
a
l
C
l
u
s
terin
g
(ADIWMC)
(K. Ka
na
ka Du
rg
a
)
1
001
5.
CO
NS
OF TH
E SYSTE
M
•
Design
co
m
p
u
t
er p
r
od
uctio
n
m
u
st tak
e
p
l
ace in
an
org
a
n
i
zed
, well tho
ugh
t so
; law, th
e ou
tpu
t
to be
devel
ope
d
wi
t
h
t
h
e
p
r
ovi
s
o
t
h
at
eac
h
out
pu
t
i
s
desi
g
n
e
d
s
o
t
h
at
t
h
e sy
st
e
m
i
s
fo
un
d t
o
be c
o
n
v
e
n
i
e
nt
an
d
efficien
t to u
s
e.
Wh
en
th
e ou
t
p
u
t
o
f
th
e an
al
ysis o
f
t
h
e c
o
m
puter designs, they a
r
e to t
h
e s
p
ecific
out
put
wh
ich
is
n
e
ed
ed
to id
en
tify the req
u
i
rem
e
n
t
s.
•
Sel
ect
m
e
t
hods
f
o
r p
r
esent
i
n
g i
n
f
o
rm
at
i
on.
•
Create
d
o
c
u
m
e
nt, rep
o
rt o
r
o
t
her fo
rm
ats
t
h
at
i
n
f
o
rm
at
i
on
cont
ai
ne
d i
n
t
h
e sy
st
em
.
The o
u
t
p
ut
f
o
r
m
of an i
n
f
o
rm
at
i
on sy
st
em
needs t
o
pe
rf
o
r
m
one or m
o
re
of t
h
e f
o
l
l
o
wi
ng
ob
ject
i
v
es
.
Sen
d
i
n
f
o
rm
ati
on a
b
out
pas
t
, cu
rre
nt
a
n
d
p
r
o
j
ect
ed
st
a
t
e of
the
fut
u
re. Re
po
rt
o
f
sign
ifican
t ev
en
ts,
o
ppo
rt
u
n
ities,
p
r
ob
lem
s
, o
r
warn
i
n
g
s
and
t
r
ig
g
e
r action
t
o
co
nfirm
an
actio
n.
6. CO
N
C
L
U
S
I
ON
In t
h
i
s
pa
per
we ex
pl
o
r
e
d
t
h
at
t
h
e
web c
onsi
s
t
s
of
i
n
f
o
rm
ati
on/
dat
a
s
t
ori
n
g i
n
va
ri
ous
feat
u
r
e
s
i
n
l
c
ude
s suc
h
as usi
n
g of st
r
u
ct
u
r
ed an
d u
n
s
t
r
uct
u
re
d dat
a
whi
c
h ex
bi
t
s
t
h
ei
r dy
nam
i
c
nat
u
re and c
h
e
c
ki
n
g
th
e si
m
ilarit
i
y
i
n
th
e web
s
ite co
n
t
en
ts wh
ich
will effect th
e d
a
ta retriv
eal.
Th
en
th
e web
p
a
g
e
d
o
c
u
m
en
ts h
a
v
e
p
o
s
e h
i
g
h
lev
e
l
d
i
fficu
lties fo
r th
e search engin
e
s to
kn
ow
wh
ich
web
s
ites
are
of real
or fak
e
. So
th
e d
e
tectio
n
of si
m
i
l
a
ri
ty
and s
a
m
e
cont
ent
check
wi
l
l
gai
n
ed m
o
r
e
at
t
e
nt
i
on i
n
prese
n
t
y
ears
usi
n
g we
b
m
i
ni
ng
t
echni
q
u
es
. So
i
n
t
h
i
s
paper
we pr
op
ose a
new m
e
t
hodo
l
ogy
w
h
i
c
h co
m
b
i
n
e and ev
al
uat
e
t
h
e cl
ust
e
red
m
e
t
hod
ol
gy
t
o
aut
o
m
a
t
i
c
li
nk
whi
c
h get
s
t
h
e sem
i
-aut
om
at
i
c
scam
s. B
y
the si
m
u
l
a
t
i
on and e
v
al
ut
i
on
r
e
sul
t
s
we
have
shown the m
o
re ac
curacy
rate tha
n
the
GPC
cl
u
s
t
e
red a
p
pr
oac
h
.
It
can al
s
o
use i
n
bi
g sca
m
s
l
i
k
e
phi
s
h
i
n
g t
h
at
u
s
e m
o
re co
py
of
co
nt
ent
.
Pa
r
t
i
c
ul
arl
y
we a
p
pl
i
e
d t
h
i
s
f
o
r t
w
o
scam
s cal
led
HY
IP
S a
n
d
fa
ke
escrow
we
bsites. Furt
her
work
can be
exte
nted t
o
th
e
re
duced m
e
m
o
ry spaces
for m
o
re we
b
deposita
ries and
wh
ich
im
p
r
ov
es th
e search
eng
i
n
e
qu
ality.
REFERE
NC
ES
[1]
S. Achar
y
a,
et al.
,
“
S
elect
ivit
y
es
tim
ation
in spa
t
i
a
l d
a
tab
a
ses,”
in
SIGMOD
, pp. 1
3–24, 1999
.
[2]
S. Alsubaiee,
et al.
, “Supporting
location-based
approxi
mate-keyword queries
,” in
GIS
, pp
. 61–70
, 2010
.
[3]
A.
Arasu,
et al.
,
“Incorporating
string tr
ansformati
ons in record matching
,” in
SIG
M
OD
, pp. 1231–
1234, 2008.
[4]
A.
Arasu,
et al.
, “E
f
fi
cient exa
c
t
set-sim
ilarit
y
jo
i
n
s,”
in
VL
DB
, p
p
. 918–929
, 200
6.
[5]
N. Beckm
a
nn
,
et al.
, “
T
he
R
tree
:
an
ef
fi
cient
and
robust access method for
po
ints and rectangles,” in
SIGMOD
, pp.
322–331, 1990
.
[6]
A. Z. Broder,
et al.
, “Min- wise
independen
t
p
e
r
m
uta
tions (ex
t
en
ded abstr
a
ct),” in
STOC
, pp. 327
–336, 1998
.
[7]
X. Cao,
et al.
, “Retrieving top-k
prestige-b
a
sed
relev
a
nt spat
ial w
e
b objec
ts,”
Proc. VLDB Endow
., vol. 3
,
pp. 373
–
384, 2010
.
[8]
K. Chakr
a
barti,
et a
l
.,
“An e
f
fi
cie
n
t
fi
lter
for appr
oximate member
ship checking
,” in
SIGMOD
, pp.
805–818, 2008
.
[9]
S. Chaudhuri,
et
al
., “Robust and
ef
fi
cien
t fu
zz
y m
a
tch
for on
line
d
a
ta
cl
ean
ing,
” in
SIGMOD
, pp. 3
13– 324, 2003.
[10]
S. Chaudhuri,
et al.
, “Selectivity estimation for st
ring predicates
: Overcoming the
underestimation problem,” in
ICDE
, pp
. 227–
238, 2004
.
[11]
S. Chaudhuri,
et al.
,
“
A
prim
itiv
e
opera
tor for
sim
ilari
t
y
joins in
da
ta
cle
a
ning
,”
in
IC
DE
, pp
. 5–16
,
2006.
[12]
E. Coh
e
n, “
S
iz
e-estim
at
ion fra
m
e
work with a
pplic
ations to
t
r
ansitiv
e c
l
osure and r
e
a
c
habi
l
i
t
y
,
”
Journal o
f
Computer and S
y
stem Scien
ces
, vol/issue: 55(3), pp.
441–453
,
19
97.
[13]
G. Cong,
et a
l
.
, “E
f
fi
cient re
trie
val of the
top-k
m
o
st relevant s
p
ati
a
l web obj
ec
ts,”
PV
LDB
, vol/issue: 2(1), pp
.
337–348, 2009
.
[14]
G
.
Li
,
et
al
., “Supporting search-
a
s-
y
ou-ty
p
e
usin
g sql in
~databas
es,”
TKDE
, 201
1.
[15]
A. M
aze
ika
,
et
al
.
,
“
E
stim
ating
t
h
e sele
ct
ivit
y
of
approximate string queries,”
ACM TODS
, vol/issue: 32(2)
, pp. 12
–
52, 2007
.
Evaluation Warning : The document was created with Spire.PDF for Python.