TELKOM
NIKA
, Vol.12, No
.4, Dece
mbe
r
2014, pp. 11
05~111
2
ISSN: 1693-6
930,
accredited
A
by DIKTI, De
cree No: 58/DIK
T
I/Kep/2013
DOI
:
10.12928/TELKOMNIKA.v12i4.981
1105
Re
cei
v
ed Se
ptem
ber 4, 2014; Re
vi
sed
No
vem
ber 1
3
,
2014; Accep
t
ed No
vem
b
e
r
29, 2014
A Model of Vertical Crawler using Hidden Markov Chain
Ye Hu, Jun T
u
*, Wang
y
u
Tong
Scho
ol of Co
mputer Scie
nc
e and T
e
chno
l
o
g
y
, Hub
e
i
Un
i
v
ersit
y
of T
e
chnol
og
y, No. 1
NanH
u Street,
Hon
g
san 430
0
68,W
uha
n,
Chi
n
a
* Corres
pon
di
n
g
author, e-ma
i
l
: tujun_
hut@
1
63.com
A
b
st
r
a
ct
T
he larg
e si
z
e
and the dy
na
mic nature of the
W
eb ma
ke it n
e
cessary to co
ntinu
a
lly
ma
int
a
in W
e
b
base
d
i
n
for
m
at
ion r
e
triev
a
l sy
stems. In
order
to get
more
o
b
jects by
visiti
ng few
irre
lev
a
nt w
eb p
ages,
the
w
eb craw
ler usually tak
e
s the he
uristic s
earch
ing st
rat
egy that rank
s urls
by thei
r imp
o
rtanc
e and
preferentially visits the
more
im
po
rtant web
pages. While s
o
me system
s
re
ly
on cr
awlers
that ex
haustiv
e
ly
craw
l the W
e
b
,
others inc
o
rp
orate “f
ocus
”
w
i
thin their cr
aw
lers to har
v
e
st app
licati
o
n
or topic-s
peci
f
i
c
collecti
ons.
Ho
w
e
ver, very l
i
m
ite
d
w
o
rk
ha
s ad
dre
sse
d th
e ur
gent
issu
e
of h
o
w
to
me
asure
the
craw
le
d
pag
e
’
s
import
ance. In
ord
e
r
to de
al w
i
th
this is
s
ue, th
is pa
per
has
prese
n
ted
a n
e
w
mo
del
taki
n
g
advantages of
the Hidden M
a
rkov Mo
del (
H
MM) to enhance the cr
awler
perfor
m
ance. I
n
this
new
m
o
del
the i
m
pr
ove
d
HMM lear
ni
ng
abil
i
ty is e
m
pl
o
y
ed to s
o
lve th
e pro
b
le
m
of the the
m
e of th
e craw
ler dr
ift an
d
thus i
m
pr
ove t
he d
a
ta
acqu
is
ition
accur
a
cy. Nu
meric
a
l
s
i
mulation tests
have been
carried out to verify the
perfor
m
a
n
ce o
f
the prop
osed
appr
oach. T
h
e an
alysis
r
e
s
u
lts hav
e sho
w
n that the ne
w
mode
l w
a
s ver
y
efficient a
nd w
a
s sup
e
rior to t
he ex
isting B
a
yesia
n
pr
o
b
a
b
il
istic mode
l, Na
ïve Bayes
mo
d
e
l, an
d K-Ne
ar
est
Neig
hb
or Appr
oach. T
hus, the prop
ose
d
improve
d
HMM b
a
sed a
ppr
oach
can be us
ed in
practice.
Ke
y
w
ords
:
hi
d
den
mark
ov mode
l, craw
ler, unifor
m
reso
urc
e
locator
1. Introduc
tion
Due
to th
e ex
plosive
g
r
owt
h
of th
e
web
page
s, b
e
si
d
e
s to
h
ope
th
e sea
r
ch e
ngi
ne that
can
provide
more
and
m
o
re
app
rop
r
i
a
te informat
ion, peo
ple
h
a
ve the
requ
ireme
n
t of ta
king
centralized q
uery on give
n topic. The
dynamism
of the Web, crawlin
g forms the backbon
e of
appli
c
ation
s
t
hat facilitate
Web i
n
form
ation retrieva
l.
While th
e typical u
s
e
of cra
w
lers
ha
s be
en
for cre
a
ting
and maintai
n
ing indexe
s
for gen
eral
-p
urpo
se
sea
r
ch engine
s, di
verse u
s
a
g
e
o
f
topical cra
w
le
rs is em
ergi
n
g
both for cli
ent and
se
rver-ba
s
ed ap
p
lication
s
. Top
i
cal cra
w
lers
are
becoming
im
portant to
ols to su
ppo
rt appli
c
ation
s
su
ch
as
d
y
namic
We
b
portal
s
, onl
ine
sea
r
ching, an
d so on.
A
larg
e numb
e
r of
alg
o
rith
ms have bee
n
p
r
op
osed
f
o
r buildin
g crawle
rs.
The
d
i
fferen
c
e
is in th
e he
uristics they u
s
e to sco
r
e th
e unv
is
ited
UR
Ls
, w
i
th
s
o
me
a
l
go
r
i
th
ms
a
d
a
p
t
in
g an
d
tuning their p
a
ram
e
ters be
fore or d
u
rin
g
the cra
w
l [1].
In [2] Chakra
barti et al. d
e
scrib
ed a fo
cu
s
ed
crawl
e
r, that sea
r
ches the
we
b
to find
relevant pages on a given topic.
The
crawler utilizes a cl
assifier
to determi
ne relevancy
of a
page a
nd a distiller to ev
aluate pa
ge links. A co
mp
arison of lea
r
ning sch
e
ma
s employe
d
by
focu
sed
cra
w
l
e
rs can b
e
fo
und in [3]. In
orde
r to
determine
whethe
r a web p
age
matche
s
with
a
pred
efined to
pic, cla
s
sifica
tion algo
rithm
s
are em
pl
oyed. Some of the cla
ssifi
cat
i
on algo
rithm
s
are Bayesi
an
proba
bilisti
c model [4], Na
ïve Bayes model [5], K-Neare
s
t Nei
g
h
bor App
r
oa
ch
[6]
etc. In [7],
Dixit described
t
he m
e
chani
sm called
mi
g
r
ating
crawle
r
to re
du
ce l
o
a
d
on
the
network
by se
nding
th
e mig
r
ant
s to
the
web
se
rver itself
fo
r t
a
kin
g
the
adv
antage
of lo
cal do
wnlo
adi
ng
and filtering b
e
fore sendi
ng
the docum
en
ts to the Search en
gine rep
o
sitory.
Ran
k
in
g sea
r
ch re
sult
s is
a fundame
n
ta
l proble
m
in informatio
n re
trieval. To pre
s
ent the
document
s in
an ord
e
re
d
manne
r, Pag
e
Ran
k
in
g
m
e
thod
s are a
pplied, which
can a
rra
nge
th
e
document
s in
orde
r of thei
r releva
nce, importa
nce a
nd content score.
Some
of the comm
on
page
ra
nki
n
g
algo
rithm
s
a
r
e Pa
ge
Ran
k
Algorithm
[8], Weig
hted P
age
Ra
nk Algorithm
[9] a
n
d
Hyperli
nked Indu
ced To
pi
c se
arch Alg
o
rithm [10
]. The Page
Ra
nk alg
o
rithm
provide
s
a gl
obal
ran
k
ing
of Web pag
es b
a
sed on thei
r importa
nce
estimated from
hyperlin
ks [8]. For insta
n
ce, a
link from p
a
g
e
“A” to pag
e
“B” is con
s
id
ered a
s
if pa
ge “A” i
s
voting for the im
portan
c
e of p
age
“B”. So, with increa
se in nu
mber of lin
ks
to page “B”, it
s impo
rtan
ce
also in
crea
se
s.
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 16
93-6
930
TELKOM
NIKA
Vol. 12, No. 4, Dece
mb
er 201
4: 110
5 – 1112
1106
The Pag
e
Ra
nk al
go
rithm
attempts t
o
provide a
global
e
s
timate of
We
b pag
e
importa
nce.
Ho
wever,
the
impo
rtan
ce
of We
b
page
s i
s
subj
ectiv
e
for different
users
and
thus
can b
e
better determin
ed if the PageRa
n
k algo
rith
m take
s into co
n
s
ide
r
ation u
s
er prefere
n
ce
s.
The impo
rtan
ce of a p
age
may depe
nd
on differe
nt intere
sts a
nd
kno
w
le
dge of
different pe
o
p
le
therefo
r
e a
g
l
obal ran
k
m
a
y not provid
e the a
c
t
ual
the impo
rtan
ce of that
pa
ge for
a give
n
individual u
s
e
r
.
With in
cre
a
si
ng pop
ula
r
ity of sea
r
ch e
ngine
s,
impli
c
it feedb
ack,
i.
e., the actions
users
take wh
en
in
teractin
g with
the sea
r
ch engin
e
,
c
an
be u
s
e
d
to i
m
prove
the
ranki
n
g
s
. Impl
icit
relevan
c
e m
easure
s
hav
e been
studi
ed by seve
ra
l re
se
arch g
r
oup
s. An overview of im
plicit
measures i
s
compil
ed in
Kelly and Te
evan [11].
Howeve
r, the finding
s of th
e re
sea
r
ch work,
were not ap
p
lied to impro
v
e the ran
k
in
g of web
se
arc
h
re
sult
s i
n
reali
s
t
i
c
se
t
t
i
ngs.
Wh
at
we
need i
s
a m
easure
of the crawl
ed p
age’
s im
po
rtance, and th
en a metho
d
to summ
a
r
ize
perfo
rman
ce
across
a set
of crawl
ed p
a
ges. A n
u
mb
er of topi
cal
crawli
ng al
go
ri
thms h
a
ve be
en
prop
osed in t
he literatu
r
e.
Often the ev
aluation
of th
ese
crawl
e
rs
is do
ne by compa
r
ing a f
e
w
cra
w
le
rs on
a
limited nu
mb
er of q
u
e
r
ie
s/tasks
wit
hout
con
s
id
eratio
ns of
statisti
cal sig
n
ifica
n
ce.
For example,
the
existin
g
ranki
ng algo
rithms
main
ly
es
tima
te
the
url’s
impo
r
t
an
ce
b
y
w
e
b p
age
s
’
relevan
c
y to t
he topi
c o
r
th
eir a
u
thoritie
s. There
are
several
com
m
on alg
o
rithm
s
for evalu
a
ting
authoritie
s, such a
s
pag
e
r
an
k, leinbe
rg, hits
and sal
s
a [12]. These algo
rith
ms ca
n exa
c
tly
evaluate the
web p
age’
s
authority,
but
they hardly
con
s
id
er topi
cal info
rmatio
n, resulting i
n
a
probl
em of topic-drift that m
ean
s althou
gh the web p
age with hig
h
authority sco
re ce
rtainly h
a
s
high unive
rsa
l
authority, it
not alway
s
ha
s
high a
u
thori
t
y on given topic too [13]-[1
7
].
In general, it is impo
rtant to compa
r
e topi
cal
cra
w
lers
over a larg
e numbe
r of topics and
tasks. Thi
s
wi
ll allow u
s
to
ascertai
n the
statistica
l si
g
n
ifican
ce of p
a
rticul
ar b
ene
fits that we m
a
y
observe a
c
ro
ss
crawl
e
rs. There are two key dime
nsions in th
e a
s
sessme
nt proce
s
s. One
key
dimen
s
ion
is the n
a
ture
of t
he
cra
w
l ta
sk.
Crawl
cha
r
a
c
teristics
such as que
rie
s
a
n
d/or keyword
s
provide
d
a
s
input criteria t
o
the cra
w
ler,
use
r
-profile
s,
and de
sired
prop
ertie
s
of
the page
s to
be
fetched
(si
m
i
l
ar pa
ge
s, p
opula
r
pa
ge
s, autho
ri
tative page
s, e
t
c.) can lea
d
to sig
n
ificant
differen
c
e
s
in
cra
w
ler d
e
si
gn and imple
m
entation.
Th
e task could
be con
s
traine
d by paramet
ers
like the m
a
ximum num
be
r of page
s to
be fetch
ed
(long
crawl
s
versu
s
sho
r
t
cra
w
l
s
)
or t
h
e
available
me
mory. Hen
c
e,
a cra
w
lin
g ta
sk can b
e
vie
w
ed
as a
co
n
s
train
ed
multiobje
c
tive se
a
r
ch
probl
em. Ho
wever, the wi
de variety of obje
c
tive f
unction
s, cou
p
led with the la
ck of app
ro
priate
kno
w
le
dge
a
bout the
se
arch
spa
c
e, m
a
ke the
pr
oble
m
a ha
rd
one
. Furthe
rmo
r
e
,
a crawl
e
r
m
a
y
have to d
eal
with o
p
timization issu
es
su
ch a
s
l
o
ca
l versus
glob
al optima [1].
The ot
her
key
dimen
s
ion i
s
to make
com
pari
s
on
s an
d determi
ne ci
rcum
stan
ce
s unde
r whi
c
h
one or the ot
he
r
cra
w
le
rs wo
rk be
st. Com
pari
s
on
s mu
st be fair an
d
be mad
e
wit
h
an eye to
ward d
r
a
w
ing
out
statistically si
gnifica
nt differen
c
e
s
. Not o
n
ly does
thi
s
requi
re a
sufficient num
be
r of cra
w
l run
s
but al
so
sou
n
d
metho
dolo
g
i
es th
at con
s
ider th
e
tem
p
oral
natu
r
e of
crawl
e
r outp
u
ts. Significa
nt
chall
enge
s i
n
evaluation
in
clud
e the
gen
eral
unavail
a
b
ility of relev
ant sets fo
r p
a
rticul
ar topi
cs o
r
queri
e
s. T
h
u
s
evalu
a
tion t
y
pically relies on d
e
fining
measures fo
r estimatin
g
p
age im
porta
n
c
e.
Ho
wever, lite
r
ature revie
w
indicate
s th
at
very limited wo
rk
ha
s been d
one t
o
add
re
ss th
is
probl
em, i.e., to measure t
he pag
e impo
rtance.
It is imperative to meas
ure the page imp
o
rta
n
ce
to improve th
e cra
w
le
r pe
rforma
nce.
In orde
r to deal with th
e above me
nti
oned i
ssu
e, this pape
r has propo
sed a new
approa
ch to
measure the
pag
e imp
o
rt
ance a
nd
h
e
n
ce
en
han
ce
the
cra
w
le
r accu
ra
cy. T
h
e
inovation of t
h
is
work i
s
t
hat the imp
r
oved
HM
M h
a
s b
een i
n
tro
duce in me
a
s
uri
ng the
pa
ge
importance t
o
enhance th
e crawler performance.
The st
rong
learning
ability of
the
HMM
has
been
fully u
s
ed to
evaluat
e the
crawl
e
rs’ p
age
impo
rtance. By d
o
ing
so, th
e
data
sea
r
chi
ng
accuracy
cou
l
d be in
crea
sed an
d thu
s
the cra
w
le
r p
e
rform
a
n
c
e
could b
e
en
ha
nce
d
. simul
a
tion
tests h
a
ve be
en imple
m
ent
ed to verify the pro
p
o
s
ed
a
ppro
a
ch. The
analysi
s
resu
lts have
sho
w
n
that the new model was v
e
ry efficient and wa
s su
p
e
rio
r
to the existing Bayesian pro
babili
stic
model, Naïve
Bayes mode
l, and K-Nea
r
est Neig
hbo
r Approa
ch. Thus, the pro
p
o
se
d improve
d
HMM ba
se
d approa
ch ha
s great impota
n
ce in b
u
si
ne
ss a
ppli
c
ation
s
.
2. Rese
arch
Metho
d
Whe
n
dealin
g
with applications involving
the
displ
a
y of advertise
me
nts in se
arch
engin
e
results, the
cl
ick-throug
h rate (CTR) fro
m
i
ndep
end
e
n
t Web
users is impo
rtant
to measure a
n
d
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
1693-6
930
A Model of Vertical
Crawl
e
r usin
g Hid
d
e
n
Marko
v
Ch
ain (Ye Hu)
1107
thus
dete
r
min
e
the i
m
pa
ct
of the a
d
verti
s
eme
n
t
wi
thin the
au
ction
syste
m
that i
s
b
e
ing
u
s
ed.
In
this ca
se, a st
och
a
sti
c
app
roach ca
n be
use
d
for mod
e
ling the u
s
er sessio
ns [13]
.
For
ea
ch
we
b
s
ite in
the
set
of the
simila
r topi
c that h
a
s
the
seed
s’
URLs in
adva
n
ce
with
the meta
sea
r
ch
engin
e
to
ol, we
will
st
art from
the root URL
of the
w
ebsite
usin
g the Vit
e
rbi
algorith
m
, foreca
st the maximum prob
a
b
ility link
to g
e
t the similar topic page
s on the basi
s
of
the root
URL
of ea
ch
si
mil
a
r
we
b
s
ite
an
d the t
r
aine
d
HMM, a
nd
g
u
ide th
e
cra
w
ler to
cra
w
l t
he
target info
rm
ation to be
collecte
d
[12].
We
ca
n
obtai
n the mo
st likely state
seq
uen
ce u
s
in
g the
curre
n
t observations an
d related pa
ram
e
ters in
HM
M. If
the number of pa
g
e
s in the si
milar
web
s
ite
s
and
the target numbe
r of pa
ges ex
ce
e
d
the ce
rtain thresh
old
s
, they need a cert
ain
degree of co
n
t
rol to improv
e the efficien
cy of colle
ctio
n by topic cra
w
lers.
2.1
.
Gener
a
ting a HMM
Model
The topi
c l
e
arnin
g
p
r
o
c
e
s
s of the
we
b cra
w
le
r ca
n be
const
r
u
c
ted
well
usi
ng
HMM
model, b
e
cau
s
e th
e p
r
o
c
e
s
s of lin
ks fro
m
pag
e to p
a
ge i
s
a
hidd
e
n
and
un
kn
o
w
n
pro
c
e
s
s,
an
d
disting
u
ish th
e p
r
op
ertie
s
of a
co
ncret
e
web
pag
e
is the
domi
nant p
r
o
c
e
ss whi
c
h
can
be
observed.
Con
s
tru
c
ting
the HM
M th
rough
sim
u
lati
ng u
s
e
r
s’
a
c
cess
seq
uen
ces
and
an
alyzing
the
use
r
s’
browsi
ng mo
de, you
can
get the
link
stru
ct
ures
betwe
en p
a
g
e
s. Acco
rdin
g
to the nu
mb
er
of the page
s
acce
ss
seq
u
ence, t
he top
i
c-relate
d pag
es a
r
e ma
rke
d
with holl
o
w circle
s and t
he
nontopi
c-relat
ed pag
es a
r
e
marked with
solid ci
rc
l
e
s.
The netwo
rk diagra
m
is constructe
d a
s
following:
We
ca
n see f
r
om th
e defin
ition of the
HMM
that the
HMM i
s
u
s
u
a
lly comp
osed
of two
parts:
state transfe
r mod
e
l
and the mod
e
l from t
he st
ate set to the obse
r
ved se
quen
ce
s. Fig
u
re
1 sho
w
s the state transfe
r diagram.
Figure 1. State transfe
r dia
g
ram
Imagining a web su
rfer who
jump
s
fro
m
web pag
e to we
b pag
e, it choo
se
s
with uniform
prob
ability which li
nk to fo
llow at e
a
ch
step. Th
e
surfer will
occa
si
onally jump t
o
a rand
om p
age
with some
small proba
bili
ty. We co
nsi
der th
e w
eb
as a
directe
d
gra
ph. Fi b
e
the set of p
age
s
whi
c
h pa
ge i links to, and
Bi be the set of pages
which lin
k to p
age i. After average
d over a
sufficient number of steps, the pr
obability of
the surfer
on page j at some poi
nt in time is given
by the formul
a:
State transfe
r set
S
T
,T
T
,…T
:
Observation set
O
O
,O
,…,
O
:
Paramete
r m
odel of HMM
with the kn
own para
m
eters of
θ
A
,
B
,
π
.
First, we
sele
ct the topic-related pag
es set
T
as the target pa
ge set from the trainin
g
s
e
t.
T
is the sh
ortest di
stan
ce that
the page i from the target pag
e
T
. As sho
w
n in
Figure 2, ‘‘4’’
and ‘‘7’’ are the target pa
g
e
s.
T
to
T
are de
fined as follo
wing:
Secon
d
, for
all pag
es we
use the
k-m
ean
s alg
o
rith
m for a
u
tom
a
tic
clu
s
terin
g
. Each
categ
o
ry i am
obse
r
ved is
corre
s
p
ondin
g
to the obse
r
ved value
O
.
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 16
93-6
930
TELKOM
NIKA
Vol. 12, No. 4, Dece
mb
er 201
4: 110
5 – 1112
1108
Third, the i
n
i
t
ial probabilit
y distribution matrix is
π
P
T
,P
T
,…
,P
T
, among
whi
c
h
P
T
represents the probability that
the distance to the target
to
pic page equals i in the
initial state, where th
e initial val
ues a
r
e g
enerally even
ly distributed,
π
P
q
i
show time 1
choose the
probability of a certain
state. The transfer
matrix
is
A
a
, among
whi
c
h
a
indicates the
probability that tr
ansferred from state Ti
to
state Tj . For exam
ple,
if
a
equal
s 0,
T
cann
ot be transfe
rred fro
m
T
by only one step, an
d
it is require
d at least three cli
c
ks to
achi
eve the tran
sform
a
tion
. But when there a
r
e cro
s
s-lin
ks in pa
ges,
a
may be
not equal to
zero. The emissi
on probability is
B
b
k
, among which
b
k
indicate the probability in state
T
when the observe
d value
O
is known. Each emi
ssi
on
probability
b
i
ndicate the probability
from state
S
enco
unte
r
e
d
the observed value
j.
b
k
P
v
|
j
,
1
,
1
,
indicate symb
ols. The
symbols u
s
e
d
in the followi
ng a
r
e as th
e stan
dard
s
in HM
M.
Figure 2. Net
w
ork g
r
ap
h to state-relation
diagra
m
2.2. Topic Model Trainin
g
The tra
d
ition
a
l HMM m
o
d
e
l reptile
s to
be furthe
r im
proved i
n
de
aling
with the
training
set. First, the
training
set is sel
e
cte
d
by t
he exp
e
rt
we
b compo
s
e
d
of relate
d topi
cs. An
d to
m
a
rk
the correlatio
n with
the th
e
m
e of
ea
ch
p
age. If
the sm
all
set,
the wo
rklo
ad
i
s
not great. Ho
wev
e
r,
if the colle
ction i
s
la
rge,
the
work that
tag is ve
ry
heavy. Sec
o
ndly, in the
c
l
ass
i
fic
a
tion of the
training
set,
equiri
ng th
e
use
r
to
dete
r
mine th
e n
u
m
ber of
cla
sses th
em
selve
s
. Thi
s
i
s
not
only
for the u
s
e
r
s i
s
very difficult and in
accu
ra
te cla
ssifi
cati
on of reptile
s
cra
w
l
sub
s
e
q
uent pa
ge
s will
have a
signif
i
cant imp
a
ct
on the p
r
o
c
e
ss.
Woul
d affect the d
e
g
r
ee of reptiles crawl th
e web
fundame
n
tall
y.
On the ba
sis of the above training
se
t m
odel, we
use the Ba
urn–Welch alg
o
rithm.
Acco
rdi
ng to
the estimat
ed value
s
of
the init
ial gi
ven paramet
ers,
we
kee
p
on
continu
ous
iteration so that the vari
ous param
eters tend to obtain
more rea
s
on
able value
s
.
A
.
Initialization
π
r
i
shows the probability value of Si
when
t equals 1. In the model, it means the
probability to reach the t
a
rget
pages
when it step 1. Iterative
cal
c
ulation
φ
i
,
j
sho
w
s t
h
e
prob
ability at time t and t+1
with the state of
S
to
S
.
r
i
∑
φ
i
,
j
show the probability at
time t with the s
t
ate of
S
.
,
,
,
|
|
(1)
This
showed
the probabilit
y of the state
of
at time
t
with the step of
i
and time
t+
1
.
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
1693-6
930
A Model of Vertical
Crawl
e
r usin
g Hid
d
e
n
Marko
v
Ch
ain (Ye Hu)
1109
B
.
Iterativ
e
rev
a
luation
∑
,
∑
(2
)
∑
,
∑
(3)
We corre
c
t contin
uou
sly
for the state tr
ansitio
n matrix and
emission
sta
t
e matrix
according
to
the value
s
.
∑
r
i
indi
cate th
e num
ber of
ti
mes tran
sferred from
state
S
.
∑
φ
i,
j
indicate the
numbe
r of times that jump
s from state
S
to s
t
ate
S
C
.
Te
rm
ination co
ndition
|
|
|
|
(4)
Her
e
,
ε
is the
sele
cted th
re
shol
d in a
d
va
nce.
log
P
O
|
θ
is the
maximum probability of
state se
que
n
c
e O after t
he re
asse
ssment of param
eters iteration. In
the HMM, it is the
maximum p
r
o
bability of a
ccess p
a
ths of
a URL.
Co
ntinuou
s optima
l
co
rrecti
o
n
of
the p
a
ra
met
e
rs
of the model
make
s the
output pa
ram
e
ters
of t
he final traini
ng
grad
ually mo
ve toward m
o
re
optimum valu
es.
2.3. Topic Learning Mod
e
l
Let's loo
k
at two impo
rtant
assumptio
n
s
HMM mod
e
l:
Suppo
se on
e
,
when the st
ate transitio
n,
the state tra
n
sition p
r
ob
a
b
ility at time
t
to
t +
1
state tra
n
sitio
n
time to the
state
whe
n
only t
he time
t state, but
not to the
st
ate befo
r
e a
n
y
moment. Ca
n
be expre
s
se
d by the formula:
|
|
(5)
Meet assumi
ng one , we
say rando
m se
quen
ce
con
s
t
i
tutes a first o
r
de
r Markov chain.
Assu
ming t
w
o, hidden
Markov mod
e
ls
a
s
sume t
hat th
e output valu
e, the output
at time
t
is wo
rth ob
serving the cu
rre
nt time
t
depen
ds o
n
ly on the pro
b
a
b
ility in which the state h
a
s
nothing to do
with the previ
ous
state.
Ca
n be expre
ssed by the formula:
|
1
,
1
(6)
In fact, the
s
e
two
a
s
sumpt
i
ons a
r
e
not
use
d
in
the
web
bet
wee
n
the ve
ry rea
s
on
able
becau
se the
prob
ability of obse
r
ving th
e vect
or outp
u
t appea
rs at
any one time depe
nd
s n
o
t
only on
the
current
state o
f
t
he
system
whi
c
h, d
epe
n
d
ing
on th
e
system a
nd th
e time i
n
whi
c
h
the previou
s
state . Both a
s
sume that separat
es the
relation
shi
p
betwee
n
page
s and pa
ge
s an
d
page
s releva
nt to the subj
ect of the pa
ge or la
ndin
g
page o
r
ienta
t
ion pro
babilit
y is large, th
en
we have reason to believe
that th
is
website contains l
i
nks to have
a greater probab
ility of target-
oriente
d
web
s
ite . To be
able to e
s
tab
lish
conta
c
t
with the hi
st
ory of the st
ate, the nee
d to
observation t
r
an
sfer th
e
HMM’
s prob
ability and
o
u
tput pro
bab
ility of state Hidd
en Ma
rkov
assumptio
n
to make a
p
p
r
o
p
riate imp
r
ov
ements.
Improved
hid
den M
a
rkov
model
assu
m
e
s th
at the
hi
dden
state
seque
nce i
s
a
se
co
nd-
orde
r M
a
rkov
ch
ain. T
h
is state tran
sition
s i
s
the s
t
ate
at time t to the s
t
ate at the time
t
+
1
l
, the
state tran
sitio
n
pro
babilitie
s de
pend
not
only on t
he
state at time
t, and also
d
epen
ds o
n
th
e
state of the time
t-1
. Partial probabilities
formula:
,
,
….
,
(7)
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 16
93-6
930
TELKOM
NIKA
Vol. 12, No. 4, Dece
mb
er 201
4: 110
5 – 1112
1110
∑
1
,
0
,
1
,
,
n
in the model state num
ber. Similarly,
the proba
bility
of the current
state of the
output
value not only depe
nds o
n
the sy
stem wh
ere the
cu
rrent st
ate,
and de
pen
ds
on the syste
m
before t
he
moment of the state.
|
|
,
1
,
,
1
(8)
In this
pap
er,
assum
p
tion
s (7
) (8) is pre
s
ent
e
d
o
n
th
e ba
si
s of im
proved
HMM
learni
ng
algorith
m
is
studie
d
, it is con
c
lud
ed that t
he improved algo
rith
m of forwa
r
d
algorithm a
n
d
backward alg
o
rithm.
Forward and
backward alg
o
rithm is
cal
c
ulated un
der t
he co
ndition
of
θ
given model to
produce the probability of
the observed sequence
,
,…,
, is
P
O
|
θ
.
By formula (7
) indi
cated th
at
θ
given model, the probability of
θ
cert
ain state
se
q
uen
ce
,
,…,
:
|
|
|
,
|
,
,
…
|
,
,
∏
(9)
Includi
ng
system at time
1
time state of
p
r
obability, the probability of
state of
said
→
, that:
|
,
|
,
|
,
,
…
|
,
,
∏
(10)
The
formula (8) available
in
the
state s
eque
nce Q (model h
a
s
gi
ven) p
r
od
uce
d
und
er
the co
ndition
of the pro
bab
ility of observ
a
tion s
eque
n
c
e O,
so to p
r
odu
ce
a giv
en seque
nce
in
the context of the given m
odel broth
e
r O
prob
ability:
P
O
|
θ
∑
P
O,
Q
|
θ
∑
π
b
o
a
o
∏
a
b
o
(11)
We can se
e that formula (11
)
to calcul
ate
P
O
|
θ
, its computation is very big, so the
Forward and
backward alg
o
rithm metho
d
is used to calcul
ate as
P
O
|
θ
P
o
,o
,…
,
o
|
θ
P
o
,o
,…,
o
,o
,…
,
o
,q
S
,q
S
θ
12
α
i,
j
β
i,
j
,2
t
T
1
3. Experiments and
the
Analy
s
is
In ord
e
r to
ev
aluate the
p
r
opo
sed
HMM
ba
sed
cra
w
ler, nu
meri
cal
simul
a
tion te
sts
have
been
carried
out in
thi
s
work.
We
si
mulated
exp
e
rime
nts fo
r
the seed
s
of the initial
p
a
g
e
perfo
rmed
by
manu
al. The
releva
nt pa
g
e
of t
he th
e
m
e pa
ge, the
se
ed
s of the
se
ed
s of e
a
c
h
theme pag
e set si
ze of 100. For ea
ch
topic, t
he sp
iders cra
w
lin
g to the page
and the re
su
lt of
the seed,
be
cau
s
e
ea
ch
page
of the
cra
w
le
r
retu
rn, their
do
cu
ment si
milarit
y
is o
b
taine
d
b
y
usin
g the m
e
thod of VSM,
if their
simil
a
rity va
lue
s
of maximum
value is big
g
e
r tha
n
a
user-
defined th
re
shold, then
the
page
will
ma
rk th
e result
s
so fa
r. Che
c
k the result of
the cra
w
ler, t
he
more the
cra
w
ler i
s
succe
ssful, the
spi
ders
cra
w
ling
to the theme and the p
r
obability of the
simila
r result
is hi
ghe
r. Th
e pe
rform
a
n
c
e of the
cra
w
ler i
s
th
e ave
r
age
of
all th
e theme
of t
he
che
c
k re
sult
s.
Figur
e 3
sho
w
s
t
h
e
sim
u
l
a
t
i
on t
e
st
r
e
sult
s,
wh
ere
the pe
rforma
nce
of th
e p
r
opo
se
d
improve
d
HM
M has be
en compa
r
ed
with
the original
HMM alg
o
rith
m.
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
1693-6
930
A Model of Vertical
Crawl
e
r usin
g Hid
d
e
n
Marko
v
Ch
ain (Ye Hu)
1111
Figure 3. The
simulation
re
sults
It can be
see
in Fig. 3 tha
t
the improve
d
HMM o
b
tai
n
s the g
r
a
b
quantity of 8997
with
1152
relevan
t
theme
s
wh
ile the
pri
m
itive HMM
o
n
l
y
get 4
00
re
levant them
e
s
.
Hen
c
e,
the
improve
d
HM
M outperfo
rm
s the primitiv
e HMM.
Table 1 li
sts the co
mpa
r
i
s
on
of the p
r
opo
se
d met
hod a
gain
s
t the existin
g
Bayesia
n
prob
abili
stic
model, Naïve
Bayes model
, and K-Ne
arest Nei
ghb
or
Appro
a
ch.
Table 1. The
comp
ari
s
o
n
result
s of the cra
w
le
r searching
Me
t
h
od G
r
a
b
q
u
a
n
t
it
y
R
e
l
e
v
a
nt
qu
a
n
t
i
t
y
Ba
y
e
sian prob
ab
ilist
i
c model
8654
981
Naïve Ba
y
e
s mo
del
8549
1005
K-Nearest N
e
igh
bor
8733
973
Improved HMM
8997
1152
One
ca
n n
o
te
in tabl
e 1
th
at the Baye
si
an p
r
ob
abili
stic m
odel
obta
i
ns th
e g
r
a
b
quantity
of 8654 with 981 rel
e
vant theme
s
; the Naïve Baye
s model obtain
s
the gra
b
quan
tity of 8549 with
1005
releva
n
t
themes; th
e K-Nearest
Neig
hbo
r
obt
ains th
e gra
b
qua
ntity of 8733
with 9
7
3
relevant
them
es.
Hen
c
e,
th
e pe
rforman
c
e of th
e p
r
op
ose
d
HMM
b
a
se
d a
pproa
ch is am
ong
the
best on
e. This re
sults in
di
cate that the
impr
oved HMM coul
d provide e
ffect measurement
of
page imp
o
rta
n
ce. As a
result, the crawle
r sear
ch
ing pe
rform
a
nce i
s
better than the other
method
s. He
nce, the com
pari
s
on
re
sul
t
s indica
te that the propo
sed
HMM b
a
sed crawl
e
r
can
improve the i
n
formatio
n re
trieval.
4. Conclusio
n
In orde
r to improve the
web
crawl
e
r
perfo
rman
ce,
this wo
rk
propo
sed a
ne
w mod
e
l
based on th
e hidden m
a
rkov (HMM
) chai
n. Thro
u
gh nume
r
ical
simulation t
e
st, the anal
ysis
results have
sho
w
n
the
propo
sed
alg
o
ri
thm was effe
ctive to
guide
the fo
cu
se
d
cra
w
lin
g b
a
sed
on the user
path in HMM
.
The
accu
ra
cy of data a
c
qui
sition
w
ill
be highe
r if the recogniti
on
model that th
e topic
relie
s
on ca
n contin
ue to iter
ate
optimizatio
n i
n
theory. The
analysi
s
re
sults
indicate that the HMM topic
crawle
r has a
g
o
o
d
pro
s
p
e
ct i
n
enha
nci
n
g
the web
crawle
r
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 16
93-6
930
TELKOM
NIKA
Vol. 12, No. 4, Dece
mb
er 201
4: 110
5 – 1112
1112
performance.
Future work will
verify the proposed crawl
e
r
se
arch system based
on
HMM
in
busi
n
e
ss a
ppl
ication
s
.
Ackn
o
w
l
e
dg
ements
This wo
rk was sup
porte
d
by
Chin
ese
Co
lle
ge Stu
dents'
Innov
ative Entrepreneu
rial
Traini
ng Pro
g
r
am (No. 201
2105
0002
4).
Referen
ces
[1]
Mobas
her B,
Dai
H, Lu
o T
,
Naka
ga
w
a
M
.
Effective
per
sona
li
z
a
ti
on b
a
sed on asso
ciatio
n
rul
e
discov
e
ry fro
m
w
eb us
ag
e
data
. In Pro
c
eed
ings
of the 3r
d Intern
ation
a
l W
o
rks
hop
on W
e
b
Information a
n
d
Data Man
a
g
e
ment, W
I
DM
200
1: 9–1
5.
[2]
Chakr
abarti S,
Berg M, D
o
m B.
F
o
cused cr
aw
ling: a
new
appr
oach
to to
pic-sp
ecific W
eb res
ourc
e
discov
e
ry.
In Proceedings of the 8th Intern
ational WWW Co
nferenc
e. 1999: 237-252.
[3]
Pant G, Sriniv
asan P. Le
arn
i
ng to cra
w
l: C
o
mpari
ng cl
as
sificatio
n
sche
m
es.
ACM T
r
ansactio
n
s on
Information System
s
. 20
05; 2
3
: 430-4
62.
[4]
Koll
eran
d D, S
aham
i M.
Hi
era
r
chical
ly cl
assif
y
ing
doc
u
m
ent
s usin
g v
e
ry fe
w
w
o
rds
. In Procee
din
g
s
of
the F
ourtee
n
th Internatio
na
l C
onfere
n
ce o
n
Machi
ne Le
arn
i
ng (ICML’
97).
199
7: 170-
178.
[5]
W
ang W
,
C
h
e
n
X, Z
o
u
Y.
A
focused
craw
le
r bas
ed
on
n
a
ïve b
a
yes c
l
ass
i
fier.
In Pr
oce
e
d
in
gs of t
h
e
T
h
ird Internationa
l S
y
m
p
. o
n
Intelli
ge
nt Informati
on T
e
chno
log
y
and
Securit
y
Infor
m
atics, Chin
a.
201
0: 517-
521.
[6]
Yang Y, L
u
i
X.
A reex
a
m
i
natio
n of text categor
i
z
a
t
io
n metho
d
s.
In Procee
din
g
s
of the 22
n
d
Internatio
na
l C
onfere
n
ce o
n
Rese
arch a
nd
Deve
l
opm
ent i
n
Information
Re
trieval (SIGIR’99). 1999:
42-4
9
.
[7] Dix
i
t
A.
D
e
sig
n
of Scal
ab
le P
a
rall
el M
i
gr
atin
g Craw
ler
Bas
ed
on A
u
g
m
en
ted Hy
pertext
Docu
ments.
Ph.D. Thesis
. MDU. 201
0.
[8]
Page
L, Brin S
,
Mot
w
a
n
i R,
W
i
nogr
ad T
.
The pa
ger
ank ci
tation ra
nkin
g: brin
gin
g
ord
e
r to the w
eb.
T
e
chnical Re
p
o
rt, Stanford InfoLab. 1
998.
[9]
X
i
ng W, Ghorbani A.
W
e
ighted pa
gera
n
k
al
gorith
m
.
In Procee
din
g
s of the Se
cond A
n
n
u
a
l
Confer
ence
on
Communic
a
tio
n
Net
w
orks a
n
d
Serv
ices R
e
s
earch (CN
S
R’
0
4
). 2004: 5
67-5
78.
[10]
Ding
C, He
X,
Hus
ban
ds P,
Z
ha
H, Sim
o
n H.
Link
a
n
a
l
ysis: Hu
bs a
n
d
a
u
torities
o
n
the w
o
rl
d.
T
e
chnical Re
p
o
rt. 2001: 44
7-
463.
[11]
Kell
y
D, T
eevan J.
Imp
licit fe
edb
ack for i
n
fe
rring
user pr
ef
erenc
e:
A bi
bli
ogra
phy
. In SI
GIR Forum.
200
3: 521-
536.
[12]
T
an Q, Mitra P. Clusteri
ng-
b
a
s
ed incr
ementa
l
w
e
b cra
w
l
i
n
g
.
ACM Trans. Inf. Syst.
2010; 2
8
: 4-18.
[13]
Sada
go
pan
N,
Li J.
Char
acteriz
i
ng typical
and atypical us
er
se
ssions in
c
l
ickstream
s.
I
n
Procee
din
g
s
of the 17th Inte
rnatio
nal C
onfe
r
ence
o
n
W
o
rld
W
i
de W
eb. 20
08: 885
–8
94.
[14]
W
ang X, Li
u
C. Sem
antic represe
n
tatio
n
of comple
x res
ource req
uests for service-ori
ente
d
architectur
e
.
T
E
LKOMNIKA Indo
nesi
an Jo
u
r
nal of Electric
al Eng
i
ne
eri
ng.
2014; 1
2
(1): 7
41–
74
6.
[15]
Herma
w
a
n H, Sarno R. Dev
e
lo
pi
n
g
distrib
u
ted s
y
stem
w
i
th servic
e res
ource or
iente
d
architecture
.
T
E
LKOMNIKA Indon
esi
an Jou
r
nal of Electric
al Eng
i
ne
eri
ng.
2012; 1
0
(2): 3
89–
39
9.
[16]
Paul
us I. Cost
and b
e
n
e
fit of informati
o
n
sea
r
ch usin
g t
w
o
different strate
gies.
TEL
K
OMNIKA
. 2010;
8(3): 195-
20
6.
[17]
Qu
X, W
ang
Y.
T
he researc
h
on s
o
ft
w
a
re
re
source
re-sh
a
ri
ng for
smal
l
an
d me
dium-s
ize
d
e
n
terpris
e
clou
d man
u
fac
t
uring s
y
stem.
T
E
LKOMNIKA Indon
esi
an
Journ
a
l of E
l
e
c
trical En
gin
e
e
rin
g
. 20
14;
12(1): 71
1-7
1
7
.
Evaluation Warning : The document was created with Spire.PDF for Python.