Indonesi
an
Journa
l
of El
ect
ri
cal Engineer
ing
an
d
Comp
ut
er
Scie
nce
Vo
l.
1
3
,
No.
2
,
Febr
uar
y
201
9
, pp.
4
92
~
498
IS
S
N: 25
02
-
4752, DO
I: 10
.11
591/ijeecs
.v1
3
.i
2
.pp
492
-
498
492
Journ
al h
om
e
page
:
http:
//
ia
es
core.c
om/j
ourn
als/i
ndex.
ph
p/ij
eecs
Foc
us
ed cr
awlin
g from th
e basic a
pp
ro
ach
to conte
xt awa
re
notificati
on archi
tectur
e
Venug
opal Bo
ppana
,
S
andh
ya P
S
chool
of
Com
p
uti
ng
Sci
ence an
d
Engi
n
ee
ring
,
Vell
ore
Instit
u
te
of
Technol
og
y
,
Chenna
i
Campus
,
India
Art
ic
le
In
f
o
ABSTR
A
CT
Art
ic
le
history:
Re
cei
ved
J
ul
7
,
2018
Re
vised
Oct
4
,
2018
Accepte
d
Nov
18
, 201
8
The
la
rg
e
and
wide
ran
ge
of
informati
on
has
bec
om
e
a
tough
ti
m
e
for
cra
wler
s
and
s
ea
rch
engi
nes
t
o
ext
ra
ct
re
la
t
e
d
informati
on.
Thi
s
paper
discusses
about
foc
used
cra
wl
er
s
al
so
ca
l
le
d
as
topi
c
spe
ci
fi
c
cra
wler
and
var
iations
of
foc
used
cra
wl
ers
l
e
adi
ng
to
distr
ibu
te
d
arc
hi
tectur
e
,
i.
e
.
,
context
awa
re
not
ifi
c
at
i
on
arc
hitec
ture.
To
get
th
e
re
le
van
t
page
s
fr
om
a
huge
amo
unt
of
infor
m
at
ion
av
ai
l
abl
e
in
th
e
in
te
rne
t
we
use
the
foc
u
sed
cra
wl
er
.
Thi
s
ca
n
bring
o
ut
the
releva
nt
p
age
s
for
the
given
topi
c
with
le
ss
num
ber
of
sea
rch
es
in
a
short
ti
m
e.
Here
t
he
input
to
the
foc
used
cra
wler
is
a
topi
c
spec
ified
using
exe
m
pla
r
y
do
cu
m
ent
s,
but
no
t
u
sing
the
ke
y
wor
ds.
Focused
cra
wler
s
avoi
d
t
he
sea
r
chi
ng
of
al
l
the
web
do
cuments
instead
it
se
arc
h
es
over
the
li
nks
tha
t
are
r
eleva
n
t
to
th
e
cr
awler
boundar
y
.
Th
e
Focused
cra
wling
m
ec
ha
nism
hel
ps
us
to
save
CP
U
ti
m
e
to
l
arg
e
extent
to
ke
ep
th
e
cra
wl
up
-
to
-
da
te
.
Ke
yw
or
ds:
Com
plex
eve
nt
proces
sin
g
Fo
c
us
e
d
cra
wl
er
To
pic sp
eci
fic
craw
le
r
Copyright
©
201
8
Instit
ut
e
o
f Ad
vanc
ed
Engi
n
ee
r
ing
and
S
cienc
e
.
Al
l
rights re
serv
ed.
Corres
pond
in
g
Aut
h
or
:
Venu
gopal B
oppa
na
,
School
of Com
pu
ti
ng
Scie
nce
and E
ngineeri
ng,
Vell
or
e
Insti
tute o
f
Tec
hnolog
y,
Che
nn
ai
Ca
m
pu
s,
Che
nn
ai
,
I
ndia
.
Em
a
il
:
srees.boppa
na@gm
ai
l
.co
m
1.
INTROD
U
CTION
In
recent
days
m
os
t
of
the
la
te
st
inform
at
io
n
is
avail
able
f
or
us
f
ro
m
the
internet.
B
ut
the
great
est
chall
enge
is
to
get
the
releva
nt
inf
or
m
at
ion
f
or
t
he
giv
e
n
t
opic
.
T
his
ca
n
a
lso
le
ad
to
e
xtr
act
ing
th
e
ir
relevan
t
inf
or
m
at
ion
from
the
web
.
T
his
ty
pe
of
e
xtr
act
ion
,
i.e.,
e
xtracti
ng
bot
h
re
le
van
t
an
d
irrel
evan
t
data
is
done
by
the
cl
assic
al
cr
awler.
T
his
le
a
d
to
wastage
of
CP
U
ti
m
e,
m
e
m
or
y
an
d
res
ources
t
o
la
r
ge
exte
nt.
T
he
breadt
h
first
m
echan
ism
is
fo
ll
owed
by
the
cl
assic
al
craw
le
r
w
hi
ch
sea
rch
e
s
al
l
the
li
nks
of
a
sing
le
pa
ren
t.
Tha
t
po
s
sible l
in
ks
m
ay
co
ns
ist
of
irreleva
nt
data
al
ong wit
h
th
e
releva
nt d
at
a.
To
res
olv
e
t
he
ab
ove
c
halle
ng
e
s
li
ke
tim
e,
s
pace,
res
ources
a
nd
irre
le
van
t
data,
t
opic
s
pecific
craw
le
r
or
fo
c
us
e
d
craw
le
rs
are
desig
ne
d
a
nd
int
rod
uc
ed
.
These
are
m
uch
bette
r
tha
n
cl
assic
al
craw
le
r
in
pro
du
ci
ng
acc
ur
at
e
data
for
the
giv
e
n
to
pi
c.
This
to
pic
s
pecific
cra
wler
avo
i
ds
the
se
arch
i
ng
of
t
he
entire
web, instea
d
se
arch
es
only
spec
ific
area
of
t
he web
. T
his c
raw
le
r
foll
ows
the m
echan
ism
o
f
d
e
pth
first s
earch
.
The wo
r
king
of foc
us
ed
cra
w
le
r
is di
vid
e
d
i
nto
t
wo steps
. In
the
f
irst
ste
p i
rr
el
eva
nt d
at
a
is separate
d fro
m
the
releva
nt
data
and
t
he
seco
nd
ste
p
is
sel
ect
ing
the
see
d
pa
ge
URL
w
hich
helps
in
fin
ding
the
ne
xt
chil
d
no
des,
i.e.,
ne
xt
li
nks
for
the
releva
nt
pa
ges.
The
fo
c
us
e
d
cra
wle
r
hel
ps
in
redu
ci
ng
th
e
tim
e
t
o
cra
wl,
m
e
m
or
y
to
store
the
cra
wled
pa
ges
or
to
store
the
visit
ed
pa
ges,
decr
eases
ir
r
el
evan
t
data.
This
giv
e
s
th
e
gr
eat
i
m
pr
ovem
ent o
ver the cla
ssica
l crawler
.
The
cl
assic
al
f
ocused
c
ra
wlers
a
nd
t
he
le
ar
ning
f
oc
us
e
d
c
raw
le
r
s
are
t
he
two
sub
c
ra
wlers
of
the
fo
c
us
e
d
cra
wle
r.
T
he
cl
assic
al
fo
c
us
e
d
cra
wlers
are
giv
e
n
with
the
pr
e
defi
ned
set
of
r
ule
s
to
pick
t
he
re
le
van
t
pag
e
s
f
or
the
gi
ven
t
op
ic
.
Lea
rn
i
ng
c
ra
wler
updates
t
he
cra
wling
li
nk
by
l
earn
i
ng
f
ro
m
the
trai
ning
set
.
This
trai
ning set is
update
d reg
ularl
y.
Evaluation Warning : The document was created with Spire.PDF for Python.
Ind
on
esi
a
n
J
E
le
c Eng &
Co
m
p
Sci
IS
S
N:
25
02
-
4752
Foc
us
ed
cr
awli
ng
fr
om the
ba
sic
app
r
oach
to
c
on
te
xt
awa
r
e noti
fi
cation
.
.. (
Venug
opal B
oppana
)
493
2.
THE
CL
AS
SI
FIC
ATIO
N
O
F THE FO
C
U
SED
CRAWL
ER
Unde
r
the
f
oc
us
e
d
cra
wler
we
hav
e
tw
o
s
ub
di
visio
ns
(i
)
Lea
rn
i
ng
f
oc
us
e
d
cra
wler
(
ii
)
Cl
assic
al
fo
c
us
e
d
cra
wler.
U
nder
the
le
arn
i
ng
f
oc
us
e
d
cra
wler
,
as
s
how
n
in
Fig
ure
1,
we
ha
ve
two
s
ub
div
isi
on:
(i
)
ANN
base
d
cl
assifi
er
(ii)
Fe
edb
ac
k
m
et
ho
d.
U
nd
e
r
cl
ass
ic
al
fo
c
us
e
d
c
r
awler
we
ha
ve
two
s
ub
di
vision
s
(i
)
Sem
antic
craw
le
r
(ii)
S
ocial
Sem
antic
Cra
wler.
U
nd
e
r
t
he
sem
antic
craw
le
r
we
ha
ve
four
sub
div
is
ion
s
(i)
On
t
ology
and
fo
c
us
e
d
cra
wler
m
od
el
(ii)
Co
ntext
base
d
ap
proac
h
for
rele
van
ce
(iii
)
On
t
ology
base
d
cr
awler
(iv)
FCA
base
d
ap
proac
h.
U
nd
e
r
the s
ocial
sem
antic
craw
l
er (
i) Ta
g base
d
ap
proac
h
cra
wling
pro
file
page (
ii
)
On
t
ology ba
se
d
a
p
pr
oach o
nt
ology we
b res
ources
.
Figure
1
.
Cl
assifi
cat
ion
of the
fo
c
us
e
d
c
raw
le
r
3.
LIT
ERATUR
E SU
RV
E
Y
Chak
rab
a
rti
et
al
.
[1
]
int
rod
uced
the
fo
c
use
d
cr
awler
t
o
the
w
or
ld
.
The
fo
c
us
e
d
craw
le
r
first
desig
ne
d
was
base
d
on
the
hy
per
te
xt
str
uctur
e
.
The
tw
o
im
po
rtant
com
po
ne
nts
in
the
work
i
ng
of
the
craw
le
r
are
the
cl
assifi
er
an
d
the
dist
il
le
r,
her
e
we
div
ide
t
he
rele
van
t
p
ag
es
f
rom
the
irreleva
nt
pa
ges
by
usi
ng
th
e
cl
assifi
er
a
nd
to
fin
d
the
see
d
URL
we
us
e
t
he
disti
ll
er,
thi
s
seed
UR
L
le
ads
us
t
o
oth
e
r
releva
nt
pa
ges
,
only
base
d
on the
good see
d URL
we
get
good
num
ber
o
f rel
ev
ant li
nk
s
. T
his f
oc
us
e
d
cra
wle
r
has p
rove
n
to
g
ive a
bette
r
res
ult
than
the
cl
assic
a
l
craw
le
r.
T
he
fo
c
us
e
d
cra
wler
is
able
to
bri
ng
out
m
or
e
nu
m
ber
of
rel
evant
pag
e
s
w
he
n
co
m
par
ed
with
th
e
cl
assic
al
craw
le
r.
B
ut
the
f
ocused
c
raw
le
r
will
work
as
c
la
ssica
l
craw
le
r
if
th
e
seed
URL is
not sel
ect
ed
acc
ur
at
e
ly
or if
th
e trai
ning set
is not s
uffici
ent.
I
n
H.
Zh
a
ng et al
. [
2] cla
ssifi
cat
ion
w
as do
ne
u
sing
the
Ar
t
ific
ia
l Neu
ral Netw
ork
(
ANN)
, t
his pape
r
us
e
d
the
A
NN
desig
ne
d
us
in
g
the
dom
ai
n
on
t
ology.
T
he
m
et
ho
dolo
gy
in
this
pa
per
co
ns
ist
s
of
t
hr
ee
ste
ps
,
the
f
irst
ste
p
is
data
pre
par
at
i
on,
sec
ond
ste
p
is
trai
ni
ng
sta
ge
a
nd
t
he
la
s
t
sta
ge
is
cra
wl
ing
sta
ge.
Her
e
in
the
trai
ning sta
ge AN
N was
us
e
d
In
S
.
Chak
ra
ba
rti
et
al
.
[3
]
oth
er
ve
rsion
of
the
le
arn
in
g
f
ocused
was
di
scusse
d
in
this
pap
e
r.
T
he
m
et
ho
dolo
gy
in
t
his
pa
per
ta
kes
the
t
w
o
cl
assifi
ers
in
ste
ad
of
a
si
ng
le
cl
assifi
er.
T
he
nam
e
of
th
ose
tw
o
cl
assifi
ers
are
crit
ic
cl
assifi
er
and
ap
pr
e
ntic
e
cl
assifi
er.
T
he
first
cl
assifi
e
r
is
us
e
d
t
o
col
le
ct
the
feedba
ck
a
nd
Evaluation Warning : The document was created with Spire.PDF for Python.
IS
S
N
:
2502
-
4752
Ind
on
esi
a
n
J
E
le
c Eng &
Co
m
p
Sci,
Vo
l.
1
3
, N
o.
2
,
F
e
br
ua
ry
201
9
:
4
92
–
498
494
the
seco
nd
cl
assifi
er
is
use
d
to
trai
n
t
he
ba
sic
cl
assifi
er
base
d
on
the
f
eedb
ac
k.
He
re
the
pe
rfor
m
ance
of
getti
ng
t
he
rele
van
t
pa
ges
is
i
m
pr
ov
e
d
by
a
ddin
g
oth
er
cl
assifi
er
to
trai
n
t
he
basic
cl
assie
r
us
i
ng
the
fe
edb
ac
k
ob
ta
ine
d.
The
featur
e
w
hich
are
us
e
d
t
o
trai
n
the
basic
cl
assifi
er
by
th
e
s
econd
cl
a
ssifie
r
are
fu
ll
y
de
pe
nd
e
nt
featur
e
s.
T
his
dep
e
ndent
feat
ur
e
great
ly
r
ed
uces t
he per
for
m
ance of th
e
classi
fier.
The
a
bove
th
r
ee
pap
e
rs
discuss
e
d
ab
out
th
e
le
arn
in
g
f
oc
us
e
d
cra
wler.
Now
w
e
m
ov
e
to
sem
antic
fo
c
us
e
d
cra
wle
rs
w
hich
c
om
par
e
the
m
eaning
of
pa
ge
s
visit
ed
with
the
se
arch
to
pic.
If
t
he
m
eaning
of
the
search
t
op
ic
m
at
ches
with
t
he
pag
e
,
the
n
that
pag
e
is
c
onsid
ered
as r
el
eva
nt
pag
e o
the
rw
is
e
it
is
con
side
r
ed
as
irreleva
nt
pag
e
.
T
o
c
om
par
e
the
cra
wled
pa
ge
with
t
he
se
arch
e
d
t
op
ic
w
e
us
e
the
onto
log
y.
The
onto
logy
m
ai
ntains
relat
ed
w
ords
of
t
he
searche
d
to
pi
c.
The
on
t
olo
gy
wide
ns
th
e
search
a
rea,
wh
ic
h
le
ads
t
o
m
or
e
releva
nt p
a
ges a
nd less ir
relev
ant p
a
ges. T
hus,
usi
ng the
ontolo
gy, we ca
n ea
sil
y com
pu
te
the r
el
e
vance
of
t
he
visit
ed
pa
ge.
T
hu
s
,
the
on
tol
ogy
play
s
a
n
im
portant
ro
le
i
n
releva
nce
cal
c
ulati
on
.
To
ext
ract
the
pag
e
s
wh
ic
h
are sem
antic
al
l
y
m
at
ching
with the
searc
h
t
opic
w
e
can
use
the ontol
og
y
w
it
h
the
fo
c
us
e
d crawler
In
M.
Eh
rig
et
al
.
[4
]
gav
e
th
e
introdu
ct
io
n
and
work
i
ng
of
the
fo
c
us
e
d
c
r
awler
with
th
e
on
tol
og
y.
The
pa
per
has
pro
posed
a
m
et
hodo
l
og
y
c
onta
inin
g
c
oupl
e
of
cy
cl
es.
T
he
fi
rst
cy
cl
e
is
ontolo
gy
cy
c
le
an
d
seco
nd
one
is
craw
li
ng
cy
cl
e
.
T
he
us
er
que
ry
co
ntains
nu
m
ber
of
keyw
ords,
the
relat
ion
s
hi
p
betwee
n
the
se
keyw
ords
is
done
i
n
th
e
first
cy
cl
e.
In
the
seco
nd
cy
cl
e
vi
sit
ed
pag
es
ar
e
colle
ct
ed
bas
ed
on
the
key
words
giv
e
n
by
us
er
.
In
t
his
pa
per
t
hey
ha
ve
ad
op
te
d
the
brea
dth
-
first
m
echan
i
sm
to
craw
l
th
e
releva
nt
pa
ge
s.
But
the
disa
dv
a
nta
ge
of
this
pa
pe
r
is,
it
can
dec
ide
w
hethe
r
a
par
ti
cula
r
pa
ge
is
releva
nt
or
irreleva
nt
pa
ge
on
ly
wh
e
n
c
om
plete
p
a
ge
is
dow
nl
oad
e
d. T
his lea
ds
t
o
w
ast
e of
r
eso
ur
ces
.
Nex
t,
we
sh
ift
our
fo
c
us
on
to
the
so
ci
al
se
m
antic
fo
cus
ed
cra
wlers.
T
he
so
ci
al
sem
a
ntic
fo
c
us
e
d
craw
le
r
s
us
e
t
he
so
ci
al
sit
es
and
s
ocial
w
e
bsi
te
s
to
get
m
o
re
releva
nt
pa
ge
s
by
le
arn
i
ng
the
us
e
r
pro
file
an
d
pr
e
fer
e
nces.
T
his
w
orki
ng
of
fo
c
us
e
d
c
raw
l
ing
us
in
g
t
he
s
ocio
netw
ork
ha
s
pro
ve
n
to
gi
ve
the
bette
r
re
su
lt
s.
This
cra
wlin
g
m
echan
is
m
us
es
the
knowle
dg
e
of
m
any
ex
per
t
peopl
e
locat
ed
in
m
an
y
places.
As
this
craw
li
ng
m
echan
ism
br
ing
s
tog
et
her
knowl
edg
e
of
m
any
exp
e
rts
to
one
place,
t
her
e
by
giv
i
ng
rise
to
m
any
releva
nt p
a
ges
for
the
searc
he
d
to
pic.
To
reduce
the
effo
rt
of
us
e
r
in
sea
rc
hing
the
rele
van
t
pa
ge,
ta
ggin
g
i
s
intr
oduce
d
i
n
the
s
ocial
netw
ork.
This
reduce
t
he
re
s
ource
us
a
ge
a
nd
br
i
ng
s
out
th
e
m
or
e
releva
nt
pag
e
s
tha
n
t
he
irrelev
ant
pa
ges.
A
So
ci
al
Sem
ant
ic
Fo
cuse
d
Cr
awler
com
bine
s
bo
th
sem
antic
know
le
dge
and
so
ci
al
netw
ork
to
ge
t
m
or
e
releva
nt
pa
ges
and
li
nks.
The
first
ap
proac
h
com
bin
ing
the
fo
c
us
e
d
cra
wle
rs
with
s
ocial
netw
ork
an
d
ta
gg
i
ng
was
giv
e
n
by
Z.
Z
hang
et
al
.
[5
]
.
The
f
ocu
s
ed
cra
wlin
g
ba
sed
on
the
profi
le
pag
e.
I
n
Nidh
i
Sin
gh
[
6]
,
sh
ow
e
d
the
top
ic
cl
ass
ific
at
ion
by
us
ing
ver
y
m
ini
m
u
m
te
xt
wh
ic
h
is
avail
able
in
URL.
In
st
ead
of
lo
ok
i
ng
at
the
entire
we
b
pa
ge
,
j
ust
by
te
xt
in
URL
we
ca
n
cl
assify
the
sentence
based
on
the
t
op
ic
.
T
his
pa
per
intr
oduc
e
d
on
li
ne
inc
rem
e
ntal l
earn
i
ng algorit
hm
to
cl
assify
the
URL.
A
new
tra
ver
s
al
fr
am
ewo
r
k
in
f
oc
us
ed
cra
wling
ha
s
bee
n
pro
po
se
d
by
Sit
i
Ma
i
m
un
ah,
H
usni
S
Sastram
ihard
j
a
,
Dw
i
H
W
i
dyanto
ro,
Kusp
riy
anto
[7
]
w
hic
h
increase
s
the
recall
.A
s
the
conve
ntion
al
f
ocus
e
d
craw
le
r
s
wer
e
on
ly
able
to
re
ach
releva
nt
w
eb
docum
ents
wh
ic
h
co
nn
ect
ed
directl
y
wh
i
ch
is
no
t
suffic
ie
nt
as
there
m
ay
exis
t
web
doc
um
e
nts
w
hich
are
l
ink
e
d
to
ide
ntifie
d
rele
van
t
w
eb
doc
um
ents.Th
is
can
be
ac
hieve
d
us
in
g
this
pro
posal
.
In
We
ng
J,
Lim
E
-
P,
Jia
ng
J,
He
Q
[
8]
pro
po
ses
T
witt
erRan
k
a
n
e
xtension
to
Pa
ge
Ra
nk
al
gorithm
.Th
is
al
go
rithm
m
e
asur
e
s
the
twit
te
rer
s
infl
uence
on
to
pic
-
se
ns
it
ivit
y.The
pro
posed
a
rch
it
ect
ure
perform
s
top
ic
disti
ll
at
ion
,
c
onstr
ucting
to
pic
sp
eci
fic
relat
ion
s
hip
netw
or
k
a
nd
finall
y
pro
vid
i
ng
ra
n
ks
base
d
on
to
pic
se
ns
it
ivit
y.
A
n
E
ve
nt
Fo
c
us
e
d
Cra
wling
(
EFC)
a
rch
it
ect
ure
has
bee
n
pro
pose
d
by
Fara
g
,
M.M.
Gand
E.A
.
Fox.
[9]
.
This
cra
wler
is
used
to
r
et
ri
eve
h
ighly
r
eleva
nt
web
p
age
s
which
ar
e
s
imila
r
to
th
e
sel
ec
t
ed
see
d
URL
’s
b
y
the
cur
at
o
r.
Th
is
pape
r expl
ai
ns h
ow foc
used
cra
w
le
r can
be
used
t
o
buil
d
an
eve
n
t m
odel
.
In
Ak
y
ol,
Me
hm
et
Ali,
et
al
.
[10]
disc
usse
d
ab
out
a
di
stribu
te
d
a
rc
hi
te
ct
ur
e
w
her
e
distrib
uted
fo
c
us
e
d
cra
wler
and
a
distri
bute
d
com
plex
even
t
pr
ocessin
g
are
com
bin
e
d
to
identify
th
e
con
te
xt
of
th
e
us
ers
and
no
ti
fy
the
m
accord
in
gly.
The
distrib
ute
d
fo
c
us
e
d
cra
wler
can
be
use
d
to
craw
l
th
e
web
sit
es
w
hich
ar
e
ob
ta
ine
d
f
ro
m
var
io
us
data
so
urces
.
Her
e
distrib
uted
cra
wler
is
us
ed
t
o
ser
ve
m
any
us
ers
.
The
r
es
ults
of
distrib
uted cra
wler
deliver
ed
to the use
rs
i
n base
d on their
con
te
xt.
4.
FOC
US
E
D C
RAWL
ERS
The
oth
e
r
nam
es
f
or
the
we
b
cra
wler
a
re
bots,
s
pi
der
et
c.
The
se
web
cr
awlers
f
or
m
a
structu
re
of
web
pa
ges
an
d
URL
base
d
on
the
us
e
r
query.
This
sof
tware,
base
d
on
the
keyw
or
ds
i
n
the
us
e
r
query
searche
s
for
th
e
URL
and
pr
oduce
s
the
relev
ant
pag
e
s.
The
adv
a
nce
d
an
d
im
pr
ov
e
d
ver
si
on
o
f
we
b
cra
wler
is
the
f
oc
us
ed
C
r
awlers.
T
hese
fo
c
us
e
d
c
raw
le
rs
base
d
on
th
e
us
e
r
sea
rch
t
op
ic
fin
ds
t
he
seed
URL
a
nd
then
from
the
seed
U
RL,
t
he
c
ra
w
le
r
searc
hes
th
e
releva
nt p
a
ge
s.
T
he
m
ai
n
ai
m
of
the focus
ed
c
raw
le
r
is
t
o
reduce
the
per
ce
ntage
of
ir
releva
nt
pag
e
s
with
t
he
total
num
ber
of
searc
he
d
pa
ges
a
nd
i
ncr
ea
se
the
pe
rcen
t
age
of
releva
nt
pag
es
with
the
total
nu
m
ber
of
fetched
pag
es
.
U
nder
f
oc
us
ed
c
r
awli
ng
we
ha
ve
two
m
ai
n
div
isi
ons
they
are (i) Cl
a
ssic f
ocu
se
d
c
r
awler
(ii) Lea
r
ning
fo
c
us
e
d
c
raw
le
r
, as
sho
wn in T
able
1
.
Evaluation Warning : The document was created with Spire.PDF for Python.
Ind
on
esi
a
n
J
E
le
c Eng &
Co
m
p
Sci
IS
S
N:
25
02
-
4752
Foc
us
ed
cr
awli
ng
fr
om the
ba
sic
app
r
oach
to
c
on
te
xt
awa
r
e noti
fi
cation
.
.. (
Venug
opal B
oppana
)
495
4.1.
Classi
c f
ocuse
d cra
w
le
r
The
cl
assic
al
fo
c
us
e
d
cra
wler
is
again
di
vid
e
d
into
tw
o
cra
wlers,
th
ey
are
(i)
So
c
ia
l
Se
m
antic
Fo
c
us
e
d
Cra
wler
(ii)
Sem
antic
Fo
c
us
e
d
Cra
wler.
T
hese
a
r
e
div
ide
d
base
d
on
tw
o
c
rite
r
ia
’s
first
on
e
is
base
d
on cra
wlin
g
ar
ea an
d
sec
ond
on
e
b
a
sed
on t
he
m
et
ho
d f
ollo
we
d
t
o
c
heck
the r
el
e
van
ce
of the
f
et
ch
ed
page
.
The
w
orkin
g
of
pa
ge
rele
va
nce
is
alm
os
t
sam
e
as
co
m
pu
ti
ng
a
rele
va
nce
of
hype
rtext
do
c
um
ent.
This
cra
wler
m
ai
ntains
the
qu
e
ue
to
c
ollec
t
al
l
the
fetch
ed
pa
ges
an
d
URL.
T
hese
pa
ges
are
ar
ra
nged
base
d
on
t
he
pr
io
rity
and
ra
nk
i
ng
of
t
he
pa
ge
.
T
he
na
m
e
of
th
is
qu
e
ue
is
pr
i
or
it
y
queue
.
H
ere
we
us
e
t
he
page
pr
i
or
it
y
crit
erion
to
c
heck
th
e
relevan
ce
of
the
pag
e
.
The
work
i
ng
of
pa
ge
pri
or
it
y
crit
erio
n
is
si
m
i
la
r
to
the
disti
ll
er.
The d
ist
il
le
r
can ex
tr
act
the r
el
eva
nt
pag
e
s b
y
detect
ing
the
good
acce
ss point.
A
gain,
t
o
get t
he
g
oo
d
acce
ss point t
he
cra
wler
nee
d t
o
ide
ntify t
he goo
d hype
rtext
nod
e
s.
Ba
sed
on
the
app
li
cat
io
n
the
craw
le
r
m
ai
nt
ai
ns
the
var
i
ous
pr
i
or
it
y
que
ue
s.
I
f
the
c
ra
wler
fi
nd
s
th
e
irreleva
nt
pa
ge
,
t
hen
that
li
nk
is
no
t
inclu
de
d
in
the
que
ue
.
The
cra
wler
stop
s
sea
rc
hing
from
that
li
n
k
an
d
searche
s
f
or
t
he
oth
e
r
li
nk
w
hi
ch
le
ads
to
rel
evan
t
pages.
T
his
the
m
ajo
r
di
ff
ere
nce
betw
een
cl
assic
al
f
ocused
craw
le
r
a
nd
ge
ner
ic
c
ra
wler.
Ma
ny
searc
h
e
ng
i
nes
us
e
t
he
se
ge
netic
cr
a
wlers.
A
fter
th
e
cra
wler
reac
hes
t
he
require
d
num
ber
of r
el
e
van
t
pa
ges or
if t
he
ti
m
e lim
it ex
ceeds,
c
raw
le
r
sto
ps
sear
chi
ng
a
nd
retu
rn
t
he
r
esult t
o
the user.
4.2.
Le
arnin
g
fo
c
used cr
awle
r
The
sec
ond
ty
pe
of
f
ocu
se
d
craw
le
r
s
is
Learn
i
ng
Cra
wle
r
.
These
cra
wler
work
base
d
on
the
trai
ning
set
. Th
es
e take
the f
ee
dbac
k
usi
ng
t
he
trai
ning set to
up
date
the cra
wlin
g
li
nk
s
which
lead
s to
m
or
e
nu
m
ber
of
releva
nt
pages.
A
gro
up
of
sa
m
ple
pag
es
relat
ed
to
the
sea
rch
e
d
to
pic
is
ta
ken
as
the
tr
ai
nin
g
datase
t.
This
trai
ning
set
he
lps
in
detect
in
g
the
releva
nt
and
i
rr
el
e
van
t
pag
e
s.
T
he
vari
ou
s
m
et
ho
ds
are
f
ollo
wed
by
the
le
arn
in
g
c
ra
wlers.
S
om
e
of
t
hem
are
Ba
ye
sia
n
cl
assifi
er
,
Hidden
Ma
r
kov
M
od
el
.
T
o
com
pu
te
the
di
sta
nce
betwee
n
c
raw
l
ed
pag
e
and
set
of train
i
ng pag
es, we ca
n use
con
te
xt gra
ph
s
.
4.3.
Sem
an
tic
and
So
ci
al Sem
antic
Fo
c
used
Craw
le
r
This
sect
ion
de
al
s
with
the
de
sign
of
sem
antic
and
so
ci
al
se
m
antic
fo
cuse
d
craw
le
r.
T
he
y
craw
l
on
diff
e
re
nt
ty
pes
of
Web
area
s
us
i
ng
di
ff
e
re
nt
ap
proac
hes.
Fo
c
us
e
d
C
ra
wling
ba
sed
on
Hu
m
an
Co
gnit
ion
(F
CHC
)
cra
wling
a
ppr
oach
e
xtracts
the
data
relat
ed
to
relevan
t
pa
ges
fro
m
the
bo
okm
ark
s
gi
ven
by
the
us
er
.
This
m
a
intai
ns
nu
m
ber
of
rel
at
ed
or
sim
i
la
r
wo
r
ds
f
or
a
s
ing
le
keyw
ord.
No
w
a
fter
the
us
er
has
gi
ve
n
the
top
ic
of
searc
h,
th
en
us
in
g
th
e
si
m
i
la
r
wo
r
ds
the
craw
le
r
c
an
easi
ly
extract
the
web
li
nks
of
releva
nt
data.
These
tw
o
cra
wling
m
echani
s
m
s
wo
r
k
us
i
ng
tw
o
diff
e
re
nt
patte
rns.
T
he
two
patte
r
ns
are
Brea
dt
h
-
Fi
rst
Patt
ern
(BFP
)
and
D
ept
h
-
Fi
r
st
Patt
ern
(
DFP).
Th
e
oth
e
r
var
ia
ti
on
o
f
f
oc
us
e
d
c
raw
le
r
is
sem
antic
focuse
d
craw
le
r
.
T
his
is
al
so
cal
le
d
as
dynam
ic
se
m
a
ntic
releva
nce
craw
li
ng
(
DS
R
).
T
his
ar
range
the
pa
ges
visit
ed
in
pr
i
or
it
y o
rd
e
r.
4.4.
Foc
used
Cr
aw
le
r using
Hu
man
C
ognitio
n
(F
CHC)
To
pro
duce
th
e
best
res
ults
i
.e.
m
or
e
num
ber
of
rele
van
t
pag
e
s
with
m
i
nim
u
m
li
nk
se
arch
by
the
fo
c
us
e
d
cra
wle
r,
the c
hoic
e of
the seed
URL
s is v
ery im
po
r
ta
nt as th
is see
d
url
h
el
ps
to find
t
he
ot
her
rel
evan
t
li
nk
s.
Du
e
t
o
this
eff
ic
ie
nt
w
orkin
g
of
F
oc
use
d
cra
wlin
g
this
can
be
ap
pl
ie
d
with
s
ocial
m
edia,
her
e
we
ca
n
get
la
r
ge
nu
m
ber
of
bookm
ark
ed
pa
ges
base
d
on
the
us
e
r
i
nterest.
From
t
he
s
ocial
m
edi
a
we
ca
n
get
th
e
inpu
t
from
n
um
ber
of
people
with
va
ried
i
nterest l
ocated at
v
a
rio
us
place i
n t
he worl
d.
Fr
om
m
any
ye
ars,
t
he
re
searc
her
s
are
stu
dying
on
“
how
t
o
li
nk
t
he
s
ocial
m
edia
data
with
the
we
b
pag
e
s
?
”
as
this
stu
dy
can
help
to
br
in
g
out
m
or
e
nu
m
ber
of
rele
van
t
pa
ge
s
with
le
ss
ti
m
e
and
res
our
ces.
T
he
m
ai
n
com
po
ne
nts
of
FCHC a
r
e
:
a)
Sele
ct
ion of
se
ed URL
Fo
r
e
ve
ry
qu
e
r
y
giv
en
by
the
us
er
the
sea
rc
h
en
gin
e
pr
od
uces
n
num
ber
of
w
eb
URL
from
tho
se
URL,
to
p
pr
i
ori
ty
URL
are
c
on
si
der
e
d
as
t
he
see
d
URLs
wh
ic
h
le
ads
t
o
m
any
relevan
t
URL.
Sele
c
ti
on
of
seed
URL
ba
se
d
on
t
he
to
pic
giv
e
n
ca
n
bri
ng
la
r
ge
di
ff
e
re
nce
in
t
he
res
ul
t
of
the
c
raw
l
er.
A
go
od
see
d
URL
helps
the
c
ra
w
le
r
to
pro
duce
the
be
st
res
ult.
Crawle
r
ca
n
s
el
ect
m
or
e
tha
n
on
e
see
d
UR
L,
by
this
sea
r
chin
g
area ca
n be
wider tha
n
a
narr
ow d
i
recti
on.
b)
Crawl
ing
ar
e
a
Fo
ll
owin
g
ca
n
be
ta
ken
a
s
craw
li
ng
a
re
a,
any
sit
e
c
on
ta
ini
ng
the
data,
sit
e
m
ain
ta
inin
g
the
bookm
ark
ing o
f
the
p
a
ges.
c)
Pa
ge
rel
eva
nc
e crit
eri
on
Wh
il
e
c
raw
li
ng
the
cra
wler
m
at
ches
the
giv
en
to
pic
with
the
pag
e
s
visit
ed
to
c
hec
k
wh
et
her
it
i
s
releva
nt or ir
re
le
van
t
page.
Evaluation Warning : The document was created with Spire.PDF for Python.
IS
S
N
:
2502
-
4752
Ind
on
esi
a
n
J
E
le
c Eng &
Co
m
p
Sci,
Vo
l.
1
3
, N
o.
2
,
F
e
br
ua
ry
201
9
:
4
92
–
498
496
d)
Pa
ge priorit
y c
rit
eri
on
A
syst
em
atic
search
patte
r
n
m
ot
ivate
d
by
hu
m
an
co
gn
it
ion
is
us
e
d
as
the
pr
io
rity
crit
erio
n
f
or
the
craw
le
r
. T
he
t
wo searc
h patt
ern
s
are
Brea
dth
first
patte
rn (B
FP)
a
nd
Dep
t
h first
patte
rn (DFP
)
e)
Terminati
on c
rit
eri
on
:
Ba
sed
on
the
two
c
onditi
ons
the
craw
le
r
st
ops
the
cra
wlin
g.
T
he
first
co
nd
it
io
n
is
nu
m
ber
of
URLs
to b
e
cra
wled
e
xceeds t
he
lim
i
t and the
seco
nd c
onditi
on is t
il
l t
he
pr
i
or
it
y
qu
e
ue
is
em
pty
.
4.4.1
F
CHC
S
earchin
g
P
att
erns
Using
bo
t
h
br
e
adth
first
patte
r
n
(B
FP)
an
d
de
pth
first
patte
r
n
(
DFP)
t
he
pa
ges
ca
n
be
sea
r
ched
to
get
the
releva
nt
pag
es.
T
he
flo
w
of
w
orki
ng
in
BFP
is
as
fo
ll
ow
s,
in
this
first
al
l
the
us
ers
who
ta
gged
th
e
seed
pag
e
are
place
d
in
t
he
qu
e
ue,
then
t
he
cra
wl
er
sta
rts
visit
ing
the
pa
ges
fro
m
a
ll
the
pag
es
wh
ic
h
a
re
ta
gged
by
the
use
r.
T
his
is
done
to g
et
t
he
res
ource o
f
interest
. Th
e
n
a
fter
the
cra
wler
fin
ds
t
he
relev
ant
pa
ges
the
n
tho
s
e
pag
e
s ar
e
sto
re
d
in
a
qu
e
ue.
Most o
f
t
he
cra
wler
w
ork base
d on the B
FP
c
om
par
ed wit
h
t
he DFP
.
Ther
e
is
a
sli
gh
t
diff
e
rence
in
parsi
ng
the
pag
e
s
in
DF
C,
in
this
instea
d
of
visit
ing
the
pag
es
f
ro
m
seed
URL
of
al
l
us
ers
at
sa
m
e
tim
e
first
on
e
us
e
r
is
pic
ked
a
nd
f
r
om
her
e
pa
rsing
is
done
ti
ll
the
craw
le
r
reaches
the
re
so
urce
of
inter
est
and
then
t
he
craw
le
r
sta
rting
agai
n
the
par
sin
g
from
the
oth
er
us
er
ti
l
l
i
t
reaches
the
res
ource
of intere
st. Th
is
pro
ce
dure
conti
nues t
il
l al
l t
he
us
ers
are c
om
plete
d.
4.5.
DSR
ba
se
d Se
mantic
F
ocuse
d C
r
awle
r
This
D
SR
fetc
hes
the
t
opic
re
le
van
t
pag
es
f
r
om
the
par
ti
cul
ar
area
us
i
ng
t
he
m
ulti
thread
ing
c
once
pt.
To
get
m
or
e
re
le
van
t
pa
ges
on
a
gi
ven
to
pic
we
can
use
dom
ai
n
ontolo
g
y.
Mostl
y
f
or
the
ed
ucati
onal
pu
rpose
we
ca
n use t
he do
m
ai
n
ontol
ogy t
o
e
xpan
d
t
he
to
pic.
4.5.1
DSR
b
ase
d S
em
antic
F
ocus
ed
C
r
awl
er Framew
or
k
T
o
de
sig
n
the
eff
ic
ie
nt
D
SR
we
nee
d
the
fol
lowing
com
ponen
ts
they
a
re
of
do
m
ai
n
on
t
ology,
local
database
,
pr
i
ori
ty
qu
eue
,
an
d
the
pro
po
se
d
m
ult
it
hr
eade
d
Sem
antic
F
ocused
Cra
wler.
SFC
(
Sem
antic
Fo
c
us
se
d
Cra
wler)
picks
the
web
pa
ge
that
can
direct
the
craw
le
r
to
m
a
ny
oth
e
r
relev
ant
pa
ges.
Ge
ne
rall
y,
SFC
sel
ect
s
to
p
rate
d
URL.
We
get
this
to
p
-
rated
URL
f
ro
m
the
pri
or
i
ty
qu
e
ue.
Here
to
par
se
t
he
we
b,
nu
m
ber
of
pa
r
al
le
l
thread
s
are
create
d,
by
this
we
can
get
nu
m
ber
of
hyper
li
nks
at
sa
m
e
t
i
m
e.
These
hype
rlink
s
are ad
de
d
t
o
the
queue
.
T
hese h
y
per
li
nks
a
re u
s
ed
to
pa
rse
t
he
web
to g
et
m
or
e
of
num
ber
re
le
vant
web
pa
ges.
T
he
se
qu
e
nce
of
par
si
ng
of
the
URL
al
so
pla
ys
an
im
po
rtan
t
this
order
of
par
si
ng
ca
n
be
known
from
the
pr
io
rity
qu
e
ue.
T
he
craw
le
r
sho
uld
avo
i
d
the
visit
ing
sam
e
old
pa
ge
nu
m
ber
of
tim
es.
To
av
oid
thi
s
sit
uation
an
othe
r
queue
is
m
a
i
ntained
wh
ic
h
stores
the
pag
e
s
visit
ed.
The
hype
rlink
s
of
t
he
releva
nt
pa
ge
s
are
store
d
in
se
parat
e d
at
abase
to be
us
e
d
f
or lat
er
purpose
.
4.6.
C
omp
reh
ensi
ve
Tr
av
ers
al
f
ocus
ed
C
r
awl
er
The
co
nventi
onal
fo
c
us
e
d
cr
awlers
f
ollo
w
the
top
dow
n
appr
oach
in
or
der
to
get
the
top
ic
s
pecific
web
doc
um
ent
s
w
hich
is useful
w
he
n
the
re
is
only
on
e
li
nk
wh
ic
h
is
to
pic
al
ly
sp
eci
fic.
But
if
the roo
t n
od
e
of
web
do
c
um
ent
con
sist
s
of
a
nothe
r
releva
nt
do
c
um
ent
li
nk
ed
to
this
node
the
craw
le
r
cannot
go
bac
k
an
d
because
of
thi
s
we
will
get
l
ow
re
cal
l.To
im
pr
ov
e
recall
this
fr
am
ewo
r
k
has
be
en
propose
d
im
pr
oves
the
recall
of
th
e
cr
awli
ng
i
n
an
i
m
pr
essive
m
an
ner.
To
im
pr
ove
this
a
le
xic
on
li
st
is
prepa
red
w
her
e
do
c
um
ent
releva
nce ca
n be asse
ssed
from
the local
on
t
ology.
4.7.
Event F
ocus
e
d C
r
awle
r
This
is
an
arc
hi
te
ct
ur
e
w
her
e
even
t
m
od
el
li
ng
can
be
do
ne
us
in
g
E
ven
t
F
ocused
Cra
wle
r.
Ba
sed
on
the
co
ntext
an
d
ty
pe,
the
e
ve
nts
can
be
re
cognise
d
an
d
r
epr
ese
nted
.
Th
e
con
te
xt
her
e
is
no
thi
ng
bu
t
when
,
where
.
T
he
ty
pe
m
eans
wha
t
.
This
can
be
use
d
to
prepa
re
li
st
of
seed
UR
L’s
base
d
on
t
he
eve
nts.
Usi
ng
th
e
even
t m
od
el
a
na
ly
sis can b
e
don
e
on e
ve
nt c
ollec
ti
on
s.
4.8.
The C
onte
xt
Focuse
d
Cra
w
le
r
This
helps
us
e
r
to
query
the
search
en
gin
e
for
pag
e
t
hat
ha
s
a
li
nk
with
a
par
ti
c
ular
docum
ent.
This
m
echan
ism
is
po
s
sible
is
C
onte
xt
Fo
c
us
e
d
Crawler
(CFC
)
.
T
his
qu
e
ry
he
lps
to
co
ns
t
ruct
a
co
ntext
gr
aph
of
pag
e
s
w
hich
ar
e
at
m
ini
m
u
m
distance
f
ro
m
t
he
URL
of
the
pag
e
giv
e
n
by
the
us
e
r.
T
his
m
ini
m
u
m
dista
nce
i
s
decide
d
base
d
on
the
ap
plica
ti
on
.
He
re
th
e
m
ini
m
u
m
di
sta
nce
is
the
nu
m
ber
of
li
nk
s
us
e
d
to
rea
ch
the
releva
nt
pa
ge
f
ro
m
the
pa
ge
URL
giv
e
n
by
the
use
r.
T
his
const
ru
ct
e
d
st
r
uctu
re
ca
n
be
us
e
d
in
the
t
rainin
g
of
the
cl
assifi
er.
The
n
the
cl
ass
ifie
r
di
vid
es
t
he
pag
es
acco
rd
i
ng
t
o
the
t
op
ic
.
This
divi
sion
is
base
d
on
th
e
distance tr
ave
r
sed by t
he
cra
wler t
o
reac
h
t
he
ta
r
get
do
c
um
ent.
They are
tw
o
s
ta
ges
in
contex
t fo
c
us
e
d
c
raw
l
er:
Evaluation Warning : The document was created with Spire.PDF for Python.
Ind
on
esi
a
n
J
E
le
c Eng &
Co
m
p
Sci
IS
S
N:
25
02
-
4752
Foc
us
ed
cr
awli
ng
fr
om the
ba
sic
app
r
oach
to
c
on
te
xt
awa
r
e noti
fi
cation
.
.. (
Venug
opal B
oppana
)
497
1)
An
init
ia
li
zat
i
on
ph
a
se:
In
this
co
ntext
graph
a
nd
ass
oci
at
ed
c
la
ssifiers
are
con
st
ru
ct
e
d
f
or
eve
ry
see
d
do
c
um
ents
2)
A
c
raw
li
ng
ph
ase:
I
n
this
se
arch
en
gin
e
by
us
in
g
cl
assi
f
ie
r
tra
ver
se
to
reac
h
t
he
relevan
t
doc
um
e
nt.
Ba
sed upo
n
th
ese li
nk
s
up
da
ti
on
i
n
c
on
te
xt
gr
a
ph are
do
ne
.
Table
1
.
Su
m
m
ary o
f
Cra
wler
s
Crawler
Na
m
e
Descripti
o
n
Clas
sic f
o
cu
sed
cr
awler
This
cr
awl
er
m
ain
t
ain
s th
e qu
eu
e to co
llect all
th
e f
etch
e
d
pag
es
an
d
URL.
Lear
n
in
g
f
o
cu
sed
cr
awler
Take th
e f
eedb
ack u
sin
g
the trainin
g
s
et to u
p
d
ate the cr
a
wlin
g
lin
k
s wh
ich
leads
to
m
o
re
n
u
m
b
e
r
o
f
r
elev
an
t
p
ag
es
Fo
cu
sed
Crawling
bas
ed
on
Hu
m
an
C
o
g
n
itio
n
(
FCHC
)
craw
lin
g
This
app
roach
extr
acts th
e data
relate
d
to relevan
t pag
es f
ro
m
the
b
o
o
k
m
arks
giv
en
by
th
e us
er
.
Fo
cu
sed
Crawler
u
sin
g
Hu
m
an
Co
g
n
itio
n
(
FCHC
)
Extracts la
rge n
u
m
b
er
o
f
bo
o
k
m
arked
pag
es b
ased
o
n
the u
ser
in
terest
FCHC
Sea
rchin
g
Patter
n
s
Uses
bo
th
breadth
f
irst pattern
(
BFP
)
an
d
dep
th
f
irst patt
ern (D
F
P)
to
searc
h
the r
elev
an
t pag
es
DSR b
ased
Se
m
an
tic Focu
sed
Crawl
er
This
f
etch
es th
e top
ic r
elev
an
t pag
es
f
ro
m
the p
articula
r
ar
ea
us
in
g
th
e
m
u
lt
ith
readin
g
co
n
cept
.
This
us
es
th
e do
m
ain
on
to
lo
g
y
Co
m
p
r
eh
en
siv
e T
r
av
ersal fo
cu
sed
Crawler
This
us
es to
p
do
wn
app
roach
in o
rder to get th
e top
ic s
p
ecif
ic web
d
o
cu
m
en
ts
Even
t Focu
sed
Cra
wler
To ex
tract
th
e
r
ele
v
an
t pag
es
,
ev
en
t m
o
d
ellin
g
anal
y
sis
is us
ed
.
The Co
n
tex
t Focu
sed
Crawle
r
Help
s th
e sear
ch
e
n
g
in
e to q
u
ery
f
o
r
p
ag
e that h
as a lin
k
with a
p
articular
d
o
cu
m
e
n
t.
This
qu
ery h
elp
s to
con
stru
ct a
co
n
tex
t grap
h
o
f
pag
es
5.
A
F
OCUSE
D CRAWLE
R
I
N CO
NTE
X
T
A
W
A
RE NO
TIFIC
ATIO
N
ARCHIT
EC
TURE
In
pull
-
ba
sed
s
yst
e
m
us
er
m
a
y
m
iss
so
m
e
of
the
im
po
rtan
t
inf
or
m
at
ion
or
can
not
get
t
he
updated
inf
or
m
at
ion
.
This
can
be
res
olv
e
d
by
us
in
g
the
pu
s
h
-
base
d
noti
ficat
ion
te
chn
i
qu
e
.
This
can
be
achiev
ed
by
introd
ucin
g
th
e
fo
c
us
e
d
cra
w
le
r
in
co
ntext
awar
e
noti
ficat
ion
.
Usi
ng
t
his
te
chn
iq
ue
,
the
us
er
ca
n
receive
th
e
la
te
st
info
rm
ation
base
d
on
t
he
co
ntexts
s
pe
ci
fied
by
the
us
er
.
T
he
bi
ggest
ad
va
ntage
of
t
he
pus
h
-
ba
sed
no
ti
ficat
io
n
is
it
helps
t
he
use
rs
to
get
t
he
l
at
est
inform
at
i
on
by
a
voidin
g
c
on
ti
nu
ous
queryi
ng
by
the
us
e
r.
Her
e
t
he
us
er
first
nee
d
t
o
s
pecify
his
inte
rested
t
op
ic
a
nd
c
on
te
xt,
bas
ed
on
t
he
inte
r
est
giv
e
n
by
the
us
er
fo
c
us
e
d
c
raw
l
er
se
nd
the
no
ti
ficat
ion
of
la
te
st
inf
or
m
at
ion
a
bout
t
hat
pa
rtic
ular
t
op
ic
.
This
al
so
sen
d
t
he
no
ti
ficat
io
n
to
the user
base
d on the c
onte
xt
i.e. lo
cat
io
n, ti
m
e etc
.
The
c
on
te
xt
ca
n
be
di
vid
e
d
i
nto
t
wo
cat
e
gories
first
one
is
exter
nal
an
d
s
econd
is
i
nternal
.
W
e
ca
n
get
the
inf
orm
at
ion
ab
out
pla
ce,
tem
per
at
ure,
li
gh
t,
s
ound,
and
ai
r
pressu
re
by
usi
ng
the
sens
or
s
.
This
t
ype
of
inf
or
m
at
ion
com
es
un
de
r
exte
rn
al
conte
xt.
T
he
inter
nal con
te
xt ar
e the u
se
r
prefe
ren
ces
.
To
achie
ve
the
b
et
te
r
resu
lt
s,
the
di
stribu
te
d
a
rch
i
te
ct
ur
e
need
t
o
be
desig
ne
d
to
tra
ve
rse
t
he
require
d
U
RL
an
d
t
o
se
nd
the
no
ti
ficat
io
n
to
the
us
er
ba
sed
on
bo
t
h
intern
al
and
exter
nal
con
te
xt.
T
he
us
er
ca
n
receiv
e
the
no
ti
ficat
ion
via
SMS,
chat
-
bot
m
essages,
e
m
ai
l
.
The
fr
am
e
work
need
to
be
desig
ne
d
suc
h
that
it
sh
ould
al
low
the
use
r
t
o
sp
eci
fy
the
c
onte
xt
to
receive
the
inf
or
m
at
ion
.
T
he
use
r
ca
n
sp
eci
fy
the
ti
m
e,
locat
ion
,
noti
ficat
ion
m
eth
od
to
receive t
he
in
f
or
m
at
ion
. T
he
F
igure
2
s
hows
the a
rch
it
ect
ur
e of c
on
ce
ptu
al
f
ram
ework.
Fig
ure
2
.
Archi
te
ct
ur
e
of
C
on
ceptual
fr
am
ework
The
va
rio
us
da
ta
so
ur
ce
s
ar
e
faceb
ook,
t
witt
er,
we
bs
it
es
and
c
on
te
xt
data
con
side
red
is
tim
e,
locat
ion
,
keyw
ords,
pr
e
fer
e
nc
es
of
th
e
us
e
r.
Now
c
on
te
xt
da
ta
and
data
sources
a
re
m
ai
n
ta
ined
on
distribu
t
e
d
m
essaging
que
ue.
N
ow
base
d
on
the
in
for
m
at
ion
and
c
onte
xt
sp
eci
fied
by
the
us
er
,
di
stribu
te
d
proc
essing
eng
i
ne
usi
ng
the
data
st
or
e
d
in
cl
oud
a
ppli
cat
io
n
ser
ve
r,
se
nd
t
he
noti
ficat
ion
s
t
o
the
us
e
rs
us
i
ng
SMS,
e
m
ai
l
or chat
bot. T
hi
s d
ist
rib
uted
pro
ces
sin
g
e
ng
i
ne
c
onsist
s of
distrib
uted cra
wler a
nd
distribu
te
d
ce
p
e
ng
i
ne
.
Evaluation Warning : The document was created with Spire.PDF for Python.
IS
S
N
:
2502
-
4752
Ind
on
esi
a
n
J
E
le
c Eng &
Co
m
p
Sci,
Vo
l.
1
3
, N
o.
2
,
F
e
br
ua
ry
201
9
:
4
92
–
498
498
6.
CONCL
US
I
O
N
The
f
oc
us
e
d
c
raw
le
r
so
l
ved
m
any
prob
le
m
s
of
t
he
gen
e
r
ic
craw
le
r
a
nd
help
e
d
to
get
the
m
or
e
releva
nt
pa
ges
with
m
ini
m
um
nu
m
ber
of
traversa
ls.
We
ha
ve
gi
ven
a
over
view
of
m
any
ver
si
on
s
of
the
fo
c
us
e
d
cra
wler
by
sp
eci
fyi
ng
it
s
adv
antages
and
disa
dvant
ages.
T
he
f
ocused
cra
wler
ha
d
bro
ught
good
a
nd
gr
eat
c
ha
ng
e
in
sea
rch
i
ng
f
or
th
e
gi
ven
us
er
qu
e
ry.
T
o
s
earch
the
c
omplet
e
we
b
t
o
ge
t
the
rele
va
nt
pa
ges
m
or
e
than
on
e
fo
cu
sed
cr
awl
er
can
be
us
ed
.
This
giv
es
le
ss
nu
m
ber
of
irreleva
nt
pa
ge
s
and
m
or
e
num
ber
releva
nt
pa
ges
base
d
on
the
us
er
query.
A
ccordin
g
t
o
th
e
ab
ov
e
disc
usse
d
fo
c
us
e
d
c
r
awler
us
e
d
ta
gs
an
d
on
t
ology
to
ge
t
relevan
t
pa
ge
s
and
al
s
o
to
exp
a
nd
the
ar
ea
of
sea
rch
i
ng.
By
us
in
g
m
or
e
e
ff
ic
ie
nt
ta
gg
i
ng
m
et
ho
d
the
pe
rfor
m
ance
of
f
ocu
se
d
cra
w
le
r
can
be
im
pro
ved.
We
c
an
al
s
o
in
cl
ud
e
the
c
onte
xt
awar
e
no
ti
ficat
io
n
in
fo
c
us
ed
c
rawl
er
to
br
i
ng
out
m
or
e
relevan
t
pa
ges.
We
can
i
m
pr
ove
the
fo
c
us
e
d
cr
awler
perform
ance
by
m
achine
le
ar
ning
al
gorithm
s.
T
his
helps
to
c
om
par
e
the
we
b
pa
ges
wi
th
co
nte
nt
post
ed
by
the user.
T
o process t
his c
on
t
ent w
e
m
ay
u
se the text
m
ini
ng alg
or
it
hm
lik
e
featur
e
sel
e
ct
ion
.
REFERE
NCE
S
[1]
S.
Chakra
b
arti,
M.
Berg
,
and
B
.
Dom
.
Focused
Crawli
ng
:
A
N
ew
Approac
h
to
Topic
-
spec
if
ic
W
eb
Resourc
e
Discove
r
y
.
Journ
al
of
Computer
Net
work
.
1999
;
3
1(11
-
16)
:1623
-
1640.
[2]
Z.
H.
T
ao,
K.
B.
Yeong,
K.
H.
Gee
.
An
ont
olog
y
-
b
ase
d
app
roa
ch
to
l
ea
rn
a
ble
foc
used
c
ra
wling.
Journal
of
Information
Sc
ience
.
2008
;
178(2
3):4512
-
4522.
[3]
S.
Chakra
bar
ti,
K.
Punera
,
and
M.
Subram
an
y
am.
Ac
c
elerated
Foc
used
Craw
li
ng
through
Online
R
elev
an
ce
Fe
edba
ck.
In
Proce
ed
ings o
f
11
t
h
Inte
rn
at
ion
al confere
nc
e
on
W
orld
W
ide
W
eb.2002;
148
-
159.
[4]
M.
Ehri
g,
and
A.
Mae
dch
e.
Onto
logy
-
fo
cuse
d
cra
wli
ng
of
web
do
cume
nts.
In
proc
ee
dings
of
ACM
s
y
m
posium
on
appl
i
ed
computi
ng,
pp
.
1174
-
11
78,
2003
[5]
Z.
Zha
ng
,
O.
Nasraoui
and
R.
Zwol.
Ex
plo
it
ing
Tags
and
Soci
al
Profi
l
es
to
Impr
ove
F
ocuse
d
Cra
wli
ng
.
Inte
rna
ti
ona
l
Joi
nt
Confer
ences o
n
W
eb
Int
el
l
ige
n
ce
and
In
te
l
li
ge
n
t
Agent
T
ec
hnol
og
y
,
pp
.
136
-
13
9,
2009
[6]
Singh,
Nidhi,
e
t
al
.
Lar
ge
scal
e
url
-
based
cl
a
ss
if
ic
ati
on
using
onli
ne
in
cre
me
ntal
le
arning
.
1
1th
Inte
rn
at
ion
a
l
Confer
ence
on
Mac
hine L
ea
rni
ng
an
d
Appl
ications
(ICMLA),
2
012
.
Vol
.
2
.
IE
E
E,
2012
.
[7]
Siti
Maimunah,
Hus
ni
S
Sastram
iha
rdj
a,
Dw
i
H
W
id
y
an
toro,
Ku
spri
y
an
to.
C
T
-
FC
:
m
ore
Com
pre
hensive
Tr
ave
rs
al
Focused
Crawler.
TEL
KOMNIK
A
Tele
communic
ati
on,
Computin
g,
El
e
ct
ronics
a
nd
Contro
l
Vol.
10,
No.
1,
Marc
h
2012
:
189
–
198.
ISS
N:
1693
-
6930.
[8]
W
eng
J,L
im
E
-
P,Jia
ng
J,He
Q.
Tw
it
te
rRank:
fi
n
ding
topi
c
-
s
ensitive
inf
lu
ent
ia
l
t
wit
te
rers
.
Proc
e
edi
ngs
of
the
third
ACM
int
ern
a
ti
o
nal
conf
er
enc
e
o
n
W
eb
sea
r
ch an
d
data
m
ini
ng
.
N
ew
York,
US
A. 2010;
261
-
270.
[9]
Fara
g,
M
.
M.G.
and
E
.
A.Fox.
B
uil
ding
and
archivi
ng
ev
ent
web
co
ll
e
ct
ions:
A
foc
used
cr
aw
le
r
appr
oa
ch.
in
Bul
letin
of
I
EEE
Technical
Comm
it
tee
on
Digit
a
l
Libr
aries.
2015;
p.
1
-
2.
[10]
Ak
y
ol
,
Mehm
et
Ali,
et
al
.
A
Con
te
x
t
Aware
Noti
f
ic
ati
on
Archi
t
ec
t
ure
Based
on
Distribute
d
Fo
cuse
d
Craw
li
ng
in
the
Bi
g
Data
Era
.
European,
M
edi
t
err
anean,
a
n
d
Middle
Ea
st
er
n
Confer
ence
o
n
Inform
at
ion
Sy
stems
.
Springer
,
Cham,
2017.
Evaluation Warning : The document was created with Spire.PDF for Python.