Indonesi
an
Journa
l
of El
ect
ri
cal Engineer
ing
an
d
Comp
ut
er
Scie
nce
Vo
l.
23
,
No.
1
,
Ju
ly
2021
, p
p.
5
19
~
528
IS
S
N: 25
02
-
475
2,
DOI: 10
.11
591/ijeecs
.v
23
.i
1
.
pp5
19
-
52
8
ļ²
519
Journ
al h
om
e
page
:
http:
//
ij
eecs.i
aesc
or
e.c
om
A deep
web dat
a extrac
t
ion
mod
el
f
or we
b mi
nin
g
: a re
view
Ily Amali
n
a A
hma
d
S
ab
ri
, Mus
tafa
Man
Facul
t
y
of
Oc
ean E
ngin
ee
ring
T
ec
hnolog
y
and
I
nform
at
ic
s,
Univ
ersit
i
Malay
si
a T
ere
ngganu
,
Kua
l
a
Nerus,
T
ere
ng
ganu,
Malay
s
ia
Art
ic
le
In
f
o
ABSTR
A
CT
Art
ic
le
history:
Re
cei
ved
Sep
1
1,
2020
Re
vised A
pr 22
, 2
021
Accepte
d
Ma
y
1,
2021
The
world
wide
web
has
b
ec
o
m
e
a
l
arg
e
pool
of
informat
ion.
Ext
r
ac
t
ing
struct
ure
d
da
ta
f
rom
a
publi
shed
webpa
ges
has
dra
wn
at
te
n
ti
on
in
the
la
s
t
dec
ad
e.
The
pro
ce
ss
of
web
dat
a
ext
ra
ct
ion
(W
DE)
has
m
an
y
challe
ng
es,
du
e
to
var
i
ety
of
web
data
and
the
unstruct
ure
d
d
at
a
from
h
y
p
er
t
ext
m
ark
up
la
nguag
e
(HTML)
fil
es.
The
a
i
m
of
thi
s
pape
r
is
to
provide
a
co
m
pre
hensive
over
vie
w
of
cur
ren
t
web
data
e
xtra
c
ti
on
techni
ques,
in
te
rm
s
of
ext
racte
d
qual
ity
d
ata.
This
pape
r
foc
uses
on
stud
y
for
d
ata
ext
r
ac
t
ion
usi
ng
wrappe
r
appr
oac
h
es
and
co
m
par
es
ea
ch
o
the
r
to
ide
n
ti
f
y
t
he
best
appr
oach
to
ext
ract
dat
a
from
onli
ne
site
s.
To
observe
the
eff
i
ci
en
c
y
of
the
proposed
m
odel
,
we
compare
the
p
er
form
anc
e
of
data
ext
r
ac
t
ion
b
y
single
web
pag
e
ext
ra
ct
ion
with
diffe
r
ent
m
odel
s
such
as
document
object
m
o
del
(DO
M),
wrappe
r
usin
g
h
y
brid
dom
and
json
(W
H
DJ
),
w
rap
per
ext
r
actio
n
of
image
usin
g
DO
M
and
JSON
(
W
EID
J)
and
W
EIDJ
(no
-
rule
s).
Fina
lly
,
t
he
expe
r
imentati
ons
prove
d
tha
t
W
EIDJ
ca
n
ext
ract
dat
a
fas
te
st
and
low
ti
m
e
consum
ing
compare
d
to
othe
r
proposed
m
et
hod.
Ke
yw
or
ds:
Data ext
racti
on techn
i
qu
e
s
Do
c
um
ent o
bje
ct
m
od
el
No
isy
i
nfor
m
ation
Web data
extra
ct
ion
Wr
a
pper
e
xtr
act
ion
of
im
age
us
in
g DO
M
and JSO
N
Wr
a
pper
us
in
g
hy
br
i
d
D
O
M
and JS
ON
This
is an
open
acc
ess arti
cl
e
un
der
the
CC
B
Y
-
SA
l
ic
ense
.
Corres
pond
in
g
Aut
h
or
:
Ily
A
m
al
ina A
hm
ad
Sabri
Faculty
of
Oce
an
E
ngineeri
ng
Tech
nolo
gy a
nd Info
rm
atics
Un
i
ver
sit
i M
al
ay
sia
Tereng
ga
nu
Ku
al
a
Ne
ru
s
, T
ereng
ganu, M
al
ay
sia
Em
a
il
:
i
ly
l
ina@u
m
t.edu
.m
y
1.
INTROD
U
CTION
The
World
Wi
de
Web
has
be
com
e
a
la
rg
e
pool
of
inf
orm
ation
w
hich
c
on
t
ai
ns
we
b
pa
ges
,
inclu
ding
i
m
ages,
au
dio
,
vid
e
o
c
li
ps,
pro
du
ct
in
f
orm
at
ion
.
W
e
b
traff
ic
is
am
on
g
t
he
im
po
rt
ant
issue
due
to
the
extracti
on
proc
ess
[
1]
.
T
he
pr
ocess
of
e
xtra
ct
ing
data
fro
m
web
pa
ges
is
a
c
on
ce
r
n
for
people
t
hat
le
ad
t
o
oth
e
r
pur
pose
and
giv
e
huge
ben
e
fit.
Com
m
on
ly
,
web
sit
es
are
m
a
inly
desig
n
f
or
hum
an
to
glance
certai
n
inf
or
m
at
ion
.
T
he
str
uctu
re
of
we
bs
it
es
are
di
ff
ere
nt
eac
h
ot
her
a
nd
they
a
re
sem
i
-
structu
red.
Pe
op
le
ne
ed
to
sel
ect
certai
n
i
m
ages
m
m
anu
al
ly
that
the
y
are
interest
ed
to
save.
It
is
t
i
m
e
con
s
um
ing
.
On
e
of
the
te
chnolo
gies
th
at
can
be
ap
plied
f
or
web
dat
a
extracti
on
(
WDE)
is
cal
le
d
as
a
w
rapper
.
The
m
ai
n
go
al
of
this
wr
a
pper
or
to
ol
is
to
tra
nsfo
r
m
the
sem
i
-
structu
red
data
i
nt
o
str
uctu
red
da
ta
.
The
re
a
re
a
lot
of
resea
rc
hes
t
hat
discuss
a
bout
wr
a
ppers
.
Mos
t
research
es
discuss
ab
out
au
tom
a
ti
c
data
e
xtracti
on
inclu
des
noise
inf
orm
at
ion
.
A
post
-
proce
s
s
ing
m
ay
be
require
d
in
web
da
ta
extracti
on
to
deal
with
be
nef
ic
ia
l
extract
ion
.
It
is
im
po
r
ta
nt
to
extract
the
data
with
hi
gh
pr
eci
sion
a
nd
rec
al
l
and
al
so
i
n
fastest
way
f
or
us
ers
.
I
n
this
pap
e
r,
a
wr
a
pp
er
has
been
propose
d
to
e
xtract
dat
a
base
d
on
dif
fe
re
nt
r
ules
a
nd
m
od
el
s
s
uc
h
as
docum
ent
obj
ect
m
od
el
(
DO
M
),
wr
a
pper
us
in
g
hybr
i
d
D
OM
and
JS
ON
(
WHDJ
),
w
rapper
e
xtracti
on
of
im
age
us
ing
DO
M
a
nd
JSO
N
(
W
E
I
DJ)
an
d
WEIDJ
(
no
-
r
ul
es).
This
resea
r
ch
wor
ks
no
t
only
f
ocu
s
on
how
to
extract
da
ta
but
al
so
f
oc
us
on
pr
ovidin
g
us
e
r
fr
ie
ndly
platf
orm
fo
r
de
velo
pe
rs
t
o
treat
t
he
extracte
d
data
.
T
his
ca
n
be
a
chieve
d
c
om
plete
l
y
thr
ough the
u
s
er frien
dly b
rowser f
or GUI
.
Evaluation Warning : The document was created with Spire.PDF for Python.
ļ²
IS
S
N
:
2502
-
4752
Ind
on
esi
a
n
J
E
le
c
En
g
&
Co
m
p
Sci,
Vo
l.
23
, N
o.
1
,
Ju
ly
2021
:
5
1
9
-
52
8
520
Ov
e
r
the
past
new
decad
e
s,
nu
m
erous
stu
di
es
hav
e
bee
n
carried
ou
t
on
m
ining
data
from
web
sit
e
or
web
pa
ge
s
an
d
nu
m
ero
us
te
chn
i
qu
e
s
ha
ve
be
en
ap
plied
[
2]
.
Ma
ny
rece
nt
works
trie
d
to
extract
the
str
uc
ture
d
inf
or
m
at
ion
from
web
pag
e
s
us
i
ng
va
riet
y
of
te
c
hniq
ue
s
su
c
h
as
D
OM,
visu
al
se
gm
entat
ion
s
or
ot
he
r
te
chn
iq
ues
[3
]
,
[
4]
.
Kam
anw
a
r
et
al
.,
[
5]
ag
r
eed
that
WD
E
is
a
way
of
m
i
ning
us
e
rās
re
quisi
te
figure
s
f
ro
m
web
pa
ges.
N
ow
a
days,
t
he
extracto
r
is
us
ed
to
e
xtract
i
nfor
m
at
ion
be
cause
web
pa
ge
is
an
oc
ean
of
data
wh
ic
h
m
akes
browsi
ng
inf
orm
at
ion
as
a
ver
y
com
plex
task
.
N
orm
ally
t
he
co
ntents
of
web
doc
um
ents
are
un
st
ru
ct
ur
e
d.
Web
data
extra
ct
ion
is
def
i
ne
d
a
s
a
process
wh
ic
h
us
e
t
oo
l
an
d
wr
a
ppers
as
m
ediu
m
s
to
extract
inf
or
m
at
ion
from
web
doc
um
ents
in
hype
r
te
xt
m
ark
up
la
ngua
ge
(
HTM
L)
f
or
m
at
.
The
no
isy
in
f
or
m
ation
s
uc
h
as tags,
adve
rtise
m
ents, an
d b
ann
e
r
x wil
l be
rem
ov
ed by
wrapp
e
r.
STEM
ha
s
bee
n
propose
d
by
Fang
[6]
to
e
xtract
str
uctu
re
s
of
ide
ntifie
rs
from
the
ta
g
path
of
we
b
pag
e
s.
The
n
a
su
f
fix
tree
is
buil
t
on
top
of
t
hese
seq
ue
nces
and
f
our
re
fin
ing
filt
ers
are
pro
po
se
d
to
vi
ew
the
sect
ion
s
w
hich
co
ntain
un
ne
cessary
in
form
at
ion
.
Pour
am
i
ni
et
al
,
[7]
pro
posed
ha
ndle
-
base
d
wr
a
pp
er
by
us
in
g
DO
M
t
re
e
ap
proach.
T
hi
s
researc
h
wor
ked
on
te
xt
fea
tures.
It
act
s
as
ha
nd
le
s
t
o
m
i
ne
data
rec
ords
from
web
pa
ges.
T
he
extracti
on
con
sist
s
of
te
xt
ual
delim
it
ers,
keyw
ords,
c
onsta
nts
or
te
xt
p
at
te
rn
s
.
Po
ly
no
m
ia
l
al
gorithm
has
been
desi
gned
to
f
or
m
agains
t
the
pa
ge
el
e
m
ents
in
tw
o
s
it
uations
;
m
ixed
bo
tt
om
up
a
nd
to
p
-
dow
n
trave
rse
DO
M
-
tree.
T
he
lim
it
a
ti
on
of
this
app
li
cat
io
n
is
the
extracti
on
proce
ss
can
only
be
per
f
or
m
ed
on the
visible
pa
rts. It
can
not
extract f
r
om
the who
le
we
b p
ages.
TANG
O
was
pro
po
se
d
by
Jim
Ʃnez
et
al
.,
[
8]
,
desi
gn
e
d
to
le
arn
r
ules
f
or
a
detai
le
d
an
d
recall
abili
ty
extracti
on
of
inf
or
m
at
ion
fro
m
se
m
i
-
structu
red
w
eb
do
c
um
ents.
The
high
preci
sion
a
nd
recall
abili
ty
are
pr
e
-
requisi
te
s
in
th
e
co
ntext
of
e
nter
pr
ise
syst
e
m
s
integrati
on
.
It
de
pe
nd
s
on
on
an
ope
n
ca
ta
l
og
ue
of
ty
pe
s
that
helps
to
m
ap
the
co
ntents
of
do
c
um
ents
into
a
kn
owle
dge
base.
Eac
h
co
m
po
nen
t
of
we
b
doc
um
ents
i
n
D
OM
node
is
denote
d
by
HTML,
D
OM,
CSS
,
rela
ti
on
al
,
a
nd
us
e
r
-
def
i
ned
featu
res.
Re
sea
rch
done
by
et
al
.
,
[
9]
ha
s
pro
po
se
d
the
de
ep
we
b
data extracti
on
(DW
DE)
fram
ewo
r
k
to p
r
ovide ac
cur
at
e res
ults to u
se
rs
base
d
on thei
r
URLs
or dom
a
ins sea
rch
e
d.
Tripathy
et
al.
,
[10]
pro
posed
VEDD
wr
a
pp
er
to
extract
th
e
relevan
t
sear
ch
res
ults
records
(
SRR
s)
from
search
eng
i
ne
by
filt
ering
ou
t
the
no
isy
and
re
dund
ant
rec
ords.
B
FS
was
us
e
d
i
n
the
beg
i
nn
i
ng
as
it
helpe
d
to
re
-
st
r
uctu
re
the
unstruct
ur
e
d
an
d
se
m
i
-
structured
SSR
pag
e
s
wh
i
ch
si
m
plify
the
extracti
on
pr
oc
ess.
SSR
pag
e
s
whic
h
in
turn
si
m
pl
ifie
s
the
e
xtracti
on
proc
ess.
Dero
uiche
et
al,
[11]
propose
d
obj
e
ct
run
ne
r
te
chn
iq
ue
cal
le
d
w
rappe
r
inf
eren
ce
that
processes
the
e
xt
racti
on
an
d
in
te
gr
at
io
n
auto
m
at
i
cally
of
com
plex
structu
re
d
data
. Th
e e
xtracti
on pro
ce
ss w
as
done
in t
wo
sta
ges; au
tom
at
ic
annotat
ions an
d
extra
ct
ion
te
m
pla
te
const
ru
ct
io
ns.
X
W
R
AP
,
a
w
rapper
base
d
on
DO
M
tree
was
de
velo
pe
d
by
Liu
et
al
.
,
[12]
.
It
c
onsist
s
of
f
ou
r
com
po
ne
nts;
synta
ct
ic
al
structur
e
no
rm
aliz
at
ion
,
in
form
ation
e
xtracti
on
was
use
d
for
der
i
ving
r
ules,
code
gen
e
rati
on
was
us
e
d
for
gen
e
r
at
ing
the
w
ra
ppers
pro
gram
s,
te
sti
ng
a
nd
pa
ckin
g
us
e
d
f
or
validat
io
n.
OL
ERA
was
de
vel
op
e
d
by
Chan
g
et
al.,
[13]
.
It
pr
oduce
d
ext
racti
on
ru
le
s
from
sem
i
-
structu
re
d
we
b
pag
es
without
consi
der
i
ng
th
e
trai
ning
data
s.
It
was
desi
gn
e
d
with
vis
ualiz
at
io
n
sup
ports.
H
oweve
r,
the
te
chn
i
que
was
represe
nted
by
it
s
sensiti
vity
to
the
order
i
ng
in
f
or
m
at
ion
.
T
her
e
wer
e
al
so
pro
bab
il
it
ie
s
in
the
fail
ur
e
of
extracti
on
proc
ess, if
tem
plates
f
or
each
att
ribu
te
we
re s
im
i
la
r.
Liu
et
al.
,
[
14]
pro
posed
MDR.
It
was
a
f
ully
auto
m
at
ed
syst
e
m
to
ide
ntify
da
ta
records
in
webpa
ges.
T
he
app
li
cat
ion
of
this
te
chn
iq
ue
ob
li
ge
d
al
l
data
to
hav
e
sam
e
par
e
nts
and
m
ul
ti
ple
data
reco
r
ds
to
hav
e
sim
i
la
r
stru
ct
ur
e
s.
Th
e
draw
bac
k
of
this
ap
proac
h
was
it
s
disa
bili
ty
to
ext
ract
in
div
id
ual
fiel
ds
.
V
IPS
was
propose
d
by
Ca
i
Yu
et
al
.,
[
15
]
an
d
Ca
i
et
al
.,
[
16
]
.
It
was
a
c
om
bin
at
ion
of
tw
o
te
chn
i
qu
e
s;
pa
rsing
of
HTML
in
DOM
tree
and
we
b
pa
ge
la
yo
ut
analy
sis
us
in
g
visu
al
c
ues.
T
he
ex
pe
rim
ent
s
cl
early
sh
ow
ed
that
visio
n
-
base
d
w
eb
pa
ge
co
ntent
structu
re
wa
s
ver
y
he
lp
fu
l
in
detect
ing
an
d
filt
ering
ou
t
no
isy
an
d
irrel
evan
t
inf
or
m
at
ion
.
A
lt
ho
ug
h
this
re
search
pro
ve
d
good
c
om
pliances
to
the
m
ulti
ple
data
re
gions
of
dee
p
w
e
bs
f
or
data ext
racti
on, it sti
ll
r
est
rict
ed by i
ts i
nca
pa
bili
ty
it
co
m
plete
ly
r
e
m
ov
in
g n
oise.
Crescenzi
et
al
.,
[
17]
de
velo
pe
d
Roa
dR
unne
r.
T
his
to
ol
e
na
bled
data
e
xtr
act
ion
th
r
ough
the
us
e
of
autom
at
ic
ally
gen
e
rated
w
ra
pp
e
rs.
It
was
base
d
on
the
s
i
m
i
la
riti
es
and
dif
fer
e
nces
be
tween
the
we
bp
a
ges
.
The
a
dvanta
ge
of
Roa
d
R
unne
r
is
that
it
ha
d
no
pri
or
knowle
dge
a
bout
t
he
sc
hem
a
of
t
he
webpa
ges
a
nd
it
s
abili
ty
in
han
dl
ing
neste
d
str
uctu
res
of
co
ntents.
Th
e
lim
i
t
at
ion
s
we
re
it
s
disabili
ty
in
m
anag
in
g
dis
jun
ct
i
on
cases
and
er
r
ors
in
the
inp
ut
docum
ents,
thu
s
aff
ect
ing
it
ās
e
ff
ect
ive
ness.
I
EPAD,
a
syst
em
that
auto
m
atical
l
y
disco
ver
e
d
e
xtracti
on
ru
le
s
from
web
pa
ges
[18]
.
This
sys
tem
can
identify
recor
d
bo
undar
ie
s
from
rep
eat
e
d
patte
rn
m
ining
and
m
ulti
ple
sequ
e
nce
al
ign
m
ents.
The
adv
a
ntage
of
this
te
chn
i
qu
e
is
the
extracti
on
of
inf
or
m
at
ion
inv
ol
ves
no
hu
m
an
ef
forts
an
d
con
te
nt
de
penden
t
heurist
ic
s.
The
li
m
it
at
io
n
of
this
to
ol
was
it
s
poor a
bili
ty
in
deali
ng w
it
h
c
om
plex
an
d ne
ste
d
str
uctu
red
data.
Hsu
et
al.,
[19
]
dev
el
op
e
d
SoftM
eal
y
as
we
b
data
e
xtracti
on
to
ol.
T
his
t
oo
l
a
ppli
ed
c
onte
xtu
al
r
ules
and
fi
nite
sta
te
tran
ducers
(FST)
te
c
hn
i
qu
e
wh
i
c
h
c
om
pr
ise
d
body
tra
nduc
ers
a
nd
tu
ple
trans
du
ce
r.
Th
e
body
tran
du
ce
rs
ext
racted
the
pa
r
ts
of
the
web
con
te
nts
that
con
ta
in
tu
ple.
Then,
tup
le
tran
du
ce
rs
it
erati
vely
extracte
d
the
t
up
le
s
.
T
his
te
c
hn
i
qu
e
ho
wever
wa
s
no
t
a
bl
e
to
gen
e
rali
ze
oversee
n
s
epa
rator
s.
T
SI
MM
IS
was
Evaluation Warning : The document was created with Spire.PDF for Python.
Ind
on
esi
a
n
J
E
le
c Eng &
Co
m
p
Sci
IS
S
N:
25
02
-
4752
ļ²
A d
ee
p
we
b da
t
a
ext
r
action m
od
el
f
or
we
b m
ining
: a review
(Ily
Amali
na A
hma
d Sabri)
521
an
extract
or
th
at
extracts
data
us
ing
e
x
tract
or
from
WW
W
con
te
nts
then
c
onve
rted
the
extracte
d
in
for
m
at
ion
into
a
struct
ur
e
d
f
or
m
at
bef
ore
storing
it
into
data
base
[
20]
.
The
releva
nt
data
is
retriev
ed
in
ob
j
ect
ex
change
m
od
e
l (O
EM)
form
at
.
Web
data
extr
act
ion
syst
em
i
s
a
so
ftw
are
a
pp
li
cat
io
n
that
can
retrie
ve
rel
evan
t
in
form
ation
s
uch
a
s
te
xt,
i
m
ages,
aud
i
o
an
d
m
any
oth
ers
f
r
om
web
s
ources
[
21
]
.
T
his
ap
plica
ti
on
usual
ly
coope
rates
with
we
b
so
urces
an
d
m
ining
the
rel
e
van
t
i
nfor
m
at
i
on
to
be
store
d.
The
m
ining
co
ntents
c
ons
ist
of
ori
gin
s
in
the
HTML
web
pa
ges
a
nd
can
be
post
-
pr
ocesse
d,
tra
nsfo
rm
ed
to
the
m
os
t
sui
ta
ble
structu
re
d
f
or
m
at
and
s
tore
d
for
a
dv
a
nce
pur
pose.
D
OM
can
be
a
ppli
ed
directl
y
to
disco
ve
r
the
require
d
i
nfo
rm
ation
f
r
om
HTML
do
c
um
ents.
A
bid
in
et
al.,
[
22]
co
ns
tr
ucted
DO
M
tree
str
uctu
re
on
the
pr
e
li
m
inary
step
.
T
he
n,
unne
cessary
nodes
su
c
h
as
script,
sty
le
need
to
be
fil
te
red
.
Cl
assifi
cat
ion
process
is
vital
to
th
e
searc
h
cl
ass
es
of
m
ul
tim
edia
data.
Data
for
m
e
dia
will
be
reco
gniz
ed
wh
e
n
the
par
se
r
f
ound
w
ord
āsrc=ā
in
the
data
struc
ture.
Finall
y
m
ult
i
m
edia
data
can
be
extracte
d.
H
ow
e
ve
r,
it
has
been
f
ound
tha
t
la
rg
e
a
m
ou
nt
of
processi
ng
tim
es
are
re
qu
i
red
f
or
the
e
xtracti
on
of
web
pa
ge
s
w
hich
c
onsist
s
la
rg
e
siz
e
of
HTML
str
uct
ur
es
.
Be
sides
t
hat,
al
l
i
m
ages
will
be
extracte
d
with
ou
t
co
ns
ide
rin
g
rep
e
ti
ti
ve
file
s.
Th
us
W
E
I
DJ
m
od
el
is
pr
opos
e
d
t
o
o
v
e
r
c
o
m
e
t
h
e
l
i
m
i
t
a
t
i
o
n
s
o
f
D
O
M
m
o
d
e
l
i
n
e
x
t
r
a
c
t
i
n
g
i
m
a
g
e
s
.
T
a
b
l
e
1
s
u
m
m
a
r
i
z
e
s
w
e
b
d
a
t
a
e
x
t
r
a
c
t
i
o
n
t
o
o
l
s
.
The
m
otivati
on f
or
t
his r
esear
ch
ori
gin
at
es
from
p
rev
io
us
works on
te
ch
ni
qu
e
s and m
et
h
odologies o
f
locat
ing
an
d
e
xtracti
ng
data
f
ro
m
var
i
ou
s
w
eb
pa
ges
of
dif
fer
e
nt
sit
es.
Th
ese
data
can
be
ve
ry
ben
e
fici
al
s
an
d
us
ef
ul
f
or
m
anag
erial
inf
orm
at
ion
.
T
he
ext
r
act
ed
inf
or
m
at
i
on
is
m
erg
ed
i
nto
the
m
ultim
edia
databa
se
a
nd
ca
n
be
us
e
d
to
f
ulfi
ll
new
queries
in
the
ne
xt
sta
ge
of
data
m
ini
ng.
The
m
ai
n
con
t
rib
ution
of
t
his
resea
rch
w
ork
is
the
de
velo
pm
e
nt
of
the
web
data
extracti
on
m
od
el
us
in
g
hy
br
id
a
pproach
es
for
i
m
age
s
extracti
on
an
d
detai
ls
rev
eal
at
io
n
of
it
s
inf
or
m
at
i
on.
T
his
m
odel
is
ex
pected
to
e
nab
le
s
a
n
e
ff
ect
ive
i
m
ageā
s
e
xtrac
ti
on
by
sp
eci
fical
ly
disclose
only
relat
ed
par
ts,
sim
ultaneousl
y
resul
ts
in
a
red
uce
d
extracti
on
ās
tim
es.
This
pap
er
is
structu
re
d
as
f
oll
ows;
In
the
fo
ll
owin
g
Se
ct
ion
2,
this
pa
per
pr
ese
nts
the
resea
rch
m
et
hod
to
ad
dre
ss
the
extracti
on
iss
ue
s.
T
he
n,
we
will
show
the
perform
ance
of
pro
po
se
d
to
ol
in
Sect
io
n
3
wh
ic
h
pr
e
sents
res
ult
and analy
sis a
nd
finall
y i
n
Sec
ti
on
4,
t
he
c
on
cl
us
io
n
is
disc
u
sse
d.
Table
1.
We
b data ext
racti
on too
ls
(Auth
o
r,
y
ea
r)
Too
ls
Mod
el
Fan
g
,
Xie,
Zhan
g
,
Ch
en
g
and
Z
h
an
g
[
6
]
STE
M
Su
f
f
i
x
T
ree
Ba
sed
Metho
d
Po
u
ra
m
in
i,
Kh
aj
e
Hass
an
i and
Nasiri
[
7
]
Han
d
le
-
b
ased
Wra
p
p
er
DOM
Tr
e
e
Ji
m
Ʃn
ez a
n
d
Co
rch
u
elo
[
8
]
TANG
O
DOM
Ch
itra
an
d
Ay
sh
a
Ban
u
[
9
]
DW
D
E
Tag b
ased
Feature
Tr
ip
ath
y
,
Jo
sh
i,
T
h
o
m
as, Shett
y
and
Tho
m
as [
1
0
]
VEDD
-
DOM
Tr
e
e
-
Breadt
h
Fir
st S
ear
ch
(
BFS)
Derou
ich
e,
Cau
tis
an
d
Abd
ess
ale
m
[
1
1
]
Ob
jectRu
n
n
er
Liu, Pu an
d
Han [
1
2
]
XW
RAP
DOM
Tr
e
e
Ch
an
g
and
Kuo
[
1
3
]
OLE
RA
Liu, Gr
o
ss
m
an
and
Z
h
ai [
1
4
]
MDR
Cai, Yu,
W
en
an
d
M
a [
1
5
]
VIPS
-
DOM
Tr
e
e
-
Visu
al Cu
es
Crescen
zi,
Mecc
a
an
d
M
erialdo
[
1
7
]
Ro
ad
Ru
n
n
er
Ch
an
g
and
L
u
i [
1
8
]
IE
PA
D
Pattern Disco
v
er
y
Hsu
and
Dun
g
[
1
9
]
So
f
tMealy
Ha
m
m
e
r
,
G
arcia
-
Molin
a,
Ch
o
,
Ar
an
h
a
an
d
Cresp
o
[
2
0
]
TSI
MM
I
S
Ob
ject E
x
ch
an
g
e
Mod
el (
O
EM
)
2.
RE
SEA
R
CH MET
HO
D
The
basic
co
nc
epts
of
data
extracti
on
pro
cess
m
us
t
consi
st
of
data,
sel
ect
ion
,
trans
f
or
m
at
ion
and
knowle
dge.
I
n
the
pr
el
im
inary
ste
p,
us
e
rs
ne
ed
to
kn
ow
t
he
ty
pes
of
data
that
they
are
extracti
ng
ei
ther
te
xts,
i
m
ages,
vid
e
os
or
oth
e
rs.
This
sel
ect
ion
of
da
ta
m
us
t
be
do
ne
earli
er
beca
us
e
eac
h
data
has
thei
r
ow
n
s
ource
s
and
e
xtracti
ng
m
od
el
s.
A
fter
the
sel
ect
ion
of
the
ty
pe
of
data
has
bee
n
do
ne,
the
fol
lowing
pr
oces
s
are
abstracti
ng
a
nd
trans
f
or
m
ing
t
he
sel
ect
ed
data
into
ta
bula
r
f
or
m
at
us
in
g
s
pe
ci
fic
ap
proac
he
s
w
hich
nee
d
to
be
fu
ll
y u
nderst
ood pr
i
or
t
o dev
el
op a
wr
a
pper.
W
ra
ppers
are
t
oo
ls
that
ha
ve
been
dev
el
op
e
d
us
in
g
s
pecifi
c
te
chn
iq
ues
or
m
od
el
s
.
T
his
too
l
can
be
us
e
d
to
extract
i
m
ages
aut
oma
ti
cal
ly
.
The
w
rapper
can
be
cat
egorized
i
nto
tw
o
m
ai
n
co
m
po
nen
ts
.
The
firs
t
com
po
ne
nt
in
volves
t
he
in
sert
ion
of
we
b
a
dd
ress,
ā
URLā
of
web
pa
ge.
It
c
om
pr
ise
s
the
pa
rsing
of
t
he
H
TML
web
pag
e
a
nd
conver
ti
ng
th
e
m
to
DO
M
tree
struct
ur
e
.
Th
is
co
nv
e
rsion
is
sign
ific
a
nt
to
un
de
rsta
nd
the
structu
re
of
H
TML
pa
ges
in
tree
en
vir
onm
ent.
This
m
et
hod
is
us
ef
ul
in
ha
nd
li
ng
the
str
uctur
e
of
dat
a
,
wh
et
her
it
is
struct
ur
e
d,
sem
i
-
structu
re
d
or
unstr
uctu
red.
T
he
sec
ond
pa
rt
is
relat
ed
to
t
he
knowle
dge
base
d
Evaluation Warning : The document was created with Spire.PDF for Python.
ļ²
IS
S
N
:
2502
-
4752
Ind
on
esi
a
n
J
E
le
c
En
g
&
Co
m
p
Sci,
Vo
l.
23
, N
o.
1
,
Ju
ly
2021
:
5
1
9
-
52
8
522
const
ru
ct
io
n.
T
he
e
xtracti
on t
echn
i
qu
e
s
that
are b
e
en
a
ppli
ed
in t
his r
e
sear
ch work
a
re
D
OM,
hybri
d
m
od
el
of
DO
M
a
nd
J
S
ON
(
WHDJ)
a
nd
hybri
d
m
od
el
of
DO
M
,
JSON
an
d
vis
ual
segm
entat
i
on
(
WEIDJ)
.
Figure
1
sh
ows
g
e
ne
ral
m
od
el
s f
or th
re
e w
e
b data ext
r
act
ion
m
od
el
s;
DO
M
[
23
]
, W
HD
J
[
24
]
a
nd
WEIDJ
[
25
]
.
Figure
1.
Ge
ne
ral m
od
el
3.
RESU
LT
S
A
ND AN
ALYSIS
Fo
r
e
xperim
ental
wo
r
ks
,
se
ve
ral
sa
m
ples
of
WW
F
we
b
pag
e
s
wer
e
ta
ken
a
nd
the
e
xtracti
on
of
con
te
nts
was
pe
rfor
m
ed
on
t
he
sam
pled
data
us
in
g
HTML
so
urce
file
s.
T
his
file
c
on
ta
in
s
the
in
f
or
m
at
i
on
of
the
im
ages
whic
h
are
goin
g
t
o
be
e
xtracted
.
Fig
ur
e
2
sho
ws
s
am
ple
ext
ractor
s
pecifica
ti
on
of
file
.
List
of
com
m
and
s c
onsist
o
f
im
ages an
d
im
ageā
s
U
RL can
be
see
n i
n
the
br
ac
kets
ā
{ā a
nd ā
}
ā.
M
os
t o
f
the im
ages
a
r
e
i
n
.
j
p
g
f
o
r
m
a
t
f
i
l
e
.
W
E
I
D
J
i
s
c
a
p
a
b
l
e
i
n
e
x
t
r
a
c
t
i
n
g
i
m
a
g
e
s
o
f
v
a
r
i
o
u
s
f
o
r
m
a
t
s
s
u
c
h
a
s
.
j
p
g
,
.
g
i
f
,
.
b
m
p
a
n
d
o
t
h
e
r
s
.
JavaSc
ript
O
bj
ect
No
ta
ti
on,
a
lso
know
n
as
J
SON
is
synta
x
for
savi
ng
an
d
swappi
ng
data.
JSON
ha
s
m
or
e
ben
efit
that
can
im
pr
ess
the
us
e
r.
T
his
te
chnolo
gy
enab
le
s
us
er
s
easy
to
underst
and
a
nd
get
th
e
th
e
i
m
po
rtant
te
xt
in
orde
r
tran
s
m
itti
ng
data
ob
j
ect
s.
It
is
the
best
ch
oice
f
or
stora
ge
an
d
it
al
so
ena
bled
a
sp
ee
dy
respo
ns
e
to
in
form
ation
que
r
ie
s.
The
ou
t
put
can
be
ra
nge
d
f
ro
m
si
m
ple
to
com
plex
structu
re
an
d
hi
gh
ly
nested
.
$js
on_url
_p
at
h
is
us
e
d
as
c
on
st
ru
ct
or
to
in
form
the
JSON
da
ta
set
to
inclu
de
the
nested
st
ru
ct
ures
of
JSON
obj
ect
.
I
n
first
ste
p,
U
RL
nee
ds
t
o
be
declare
d
as
json
path.
T
hen,
ās
rcā
val
ue
ne
eds
t
o
be
s
peci
fied
as
path.
T
his
is
ver
y
im
po
rtant
in
order
to
find
th
e
inf
or
m
at
ion
of
im
age
s
from
the
i
m
age
neste
d
str
uctu
re.
Figure
3
sho
w
s
the
struc
ture
of
e
xtracted
i
nfor
m
at
ion
,
w
hich
has
been
orga
nized
in
structu
re
d
ways
an
d
disp
la
ye
d
in
ta
ble
form
at
s
[26]
.
T
he
e
xtract
ion
pr
ocess
in
this
exam
ple
was
pe
rfor
m
ed
by
ta
ble
def
in
it
ion
s.
The
i
niti
al
co
m
m
and
$js
on_URL
gets
t
he
con
te
nts
of
the
sour
ce
file
or
web
a
ddress
w
ho
s
e
URL
is
giv
en
i
n
[ā
URL
ā].
Af
te
r
the
file
has
be
en
fetche
d,
the
co
ntents
will
be
detai
le
d
into
s
peci
fic
crit
eria
su
ch
as
$no,
$im
g_
URL,
im
age,
$si
ze_in
_byt
es
and
$to
ta
l_tim
e_load_page.
T
he
extra
ct
ion
inf
orm
ati
on
will
be
d
en
oted
i
n
ta
ble r
e
pr
ese
nt
at
ion
.
Evaluation Warning : The document was created with Spire.PDF for Python.
Ind
on
esi
a
n
J
E
le
c Eng &
Co
m
p
Sci
IS
S
N:
25
02
-
4752
ļ²
A d
ee
p
we
b da
t
a
ext
r
action m
od
el
f
or
we
b m
ining
: a review
(Ily
Amali
na A
hma
d Sabri)
523
Figure
2. A
sa
m
ple ex
tract
or
sp
eci
ficat
io
n
fi
le
Figure
3. The
e
xtracted
in
for
m
at
ion
in
J
S
O
N
f
orm
at
In
ad
diti
on
to
the
basic
ca
pabi
li
t
ie
s
of
WEIDJ,
our
ext
ractor
al
so
pro
vide
s
seve
ral
oth
e
r
us
ef
ul
a
nd
us
er
ās
fr
ie
ndly
feature
s.
On
e
of
t
hem
is
t
he
que
ries
to
the
save
d
im
a
ges
are
pro
vide
d.
Fi
gure
4
s
how
s
colle
ct
ion
of
im
ages
that
ha
ve
bee
n
sa
ved
in
sin
gle
m
ultim
edia
databas
e.
Th
ese
im
ages
can
be
querie
d
f
r
om
database
for
be
nef
ic
ia
l
pur
pose.
T
hus,
it
can
be
us
e
d
f
or
furthe
r
pur
po
s
e
su
c
h
as
ge
ne
rati
on
of
repor
ts
,
analy
sis
.
Im
ages
that
hav
e
bee
n
sel
ect
ed
will
be
sto
r
ed
in
m
ultim
e
dia
databa
se.
The
i
m
ages
are
su
ccesf
ully
save
d
in
datab
ase.
There
are
two
op
ti
on
s
t
hat
can
be
sel
ect
ed
by
us
ers
fo
r
sa
ving
i
m
ages
into
m
ult
i
m
edia
database
ei
t
her
in
aut
om
at
ic
o
r
m
anu
al
.
JS
O
N
as
a
sta
ndar
d
m
od
ule
c
ou
l
d
acce
pt
any
da
ta
structu
re
a
nd
t
urn
them
into
a
representa
ti
on
of
string.
Fig
ur
e
5
sho
ws
im
age
s
that
su
ccesf
ul
ly
extracte
d
and
represe
nt
in
JSON
form
at
. Th
e ad
van
ta
ges usin
g JSO
N
is
f
ast
er
and it i
s
ver
y e
asy
to use.
Evaluation Warning : The document was created with Spire.PDF for Python.
ļ²
IS
S
N
:
2502
-
4752
Ind
on
esi
a
n
J
E
le
c
En
g
&
Co
m
p
Sci,
Vo
l.
23
, N
o.
1
,
Ju
ly
2021
:
5
1
9
-
52
8
524
Figure
4. Im
ages r
et
rie
ved f
rom
d
at
abase
Figure
5. JS
O
N
f
orm
at
The
e
xperim
entat
ion
f
or
deep
we
b,
t
he
we
b
data
e
xtracti
on
is
pe
rfo
rm
ed
by
co
ns
i
der
i
ng
the
siz
e
an
d
diff
e
re
nt
le
vel
of
im
ages
[27]
.
This
e
xp
e
rim
ent
has
bee
n
c
onduct
ed
with
reg
a
rds
to
f
or
m
er
works
do
ne
by
[16]
to
c
om
par
e
the
perf
or
m
ances
of
e
xtra
ct
ion
process.
The
im
age
extracti
on
has
be
en
e
xtracted
i
n
three
ways:
a)
The
e
xtracti
on
of im
ages in
g
e
ner
al
way
b)
The
e
xtracti
on
of
im
ages
by
co
ns
ide
rin
g
t
he
siz
e
of
im
a
ges
i
n
t
wo
pa
r
ts;
50*50
pix
e
ls
an
d
128*
128
pix
el
s.
c)
The
e
xtracti
on
of
im
ages
is
te
s
te
d
ra
ndom
ly
at
dif
fer
e
nt
le
ve
ls;
5
pag
e
s,
10
pag
e
s,
15
pa
ges,
20
pa
ges,
25
pag
e
s a
nd 30 p
ages.
Evaluation Warning : The document was created with Spire.PDF for Python.
Ind
on
esi
a
n
J
E
le
c Eng &
Co
m
p
Sci
IS
S
N:
25
02
-
4752
ļ²
A d
ee
p
we
b da
t
a
ext
r
action m
od
el
f
or
we
b m
ining
: a review
(Ily
Amali
na A
hma
d Sabri)
525
In
this
pap
e
r,
we
discuss
the
resu
lt
of
dee
p
web
data
e
xtra
ct
ion
by
e
xtrac
ti
on
of
im
ages
that
has
bee
n
te
ste
d
rand
om
l
y
fo
r
30
pag
es
as
sh
ow
n
in
T
able
s
2
a
nd
3
a
nd
by
co
ns
ide
rin
g
two
par
ts
of
e
xtracti
on
pi
xels;
50*50
pi
xels
a
nd
12
8*128
pi
xels.
Ta
ble
4
s
hows
t
he
per
c
entage
of
ti
m
e
extracti
on
re
ga
rd
i
ng
to
Ta
bl
e
4
(
a
)
and
(
b
)
.
From
t
his
ta
ble,
we
c
an
see
that
t
he
per
ce
ntage
of
tim
e
extracti
on
for
WEIDJ
a
nd
WEIDJ
-
no
r
u
le
s
is
lowe
r
com
par
e
d
to
im
age
extr
act
ion
us
in
g
D
OM
an
d
WHD
J.
This
perf
orm
ance
can
pro
ve
that
the
e
xtr
act
ion
sem
i
-
structu
re
d data u
sin
g W
EID
J
is fa
ste
st
com
par
ed
t
o other
s
.
Table
2.
Per
for
m
ance
of im
age ex
tract
io
n by
w
e
b pag
e
s
(
30
U
RL
)
f
or
DOM
and
WHDJ
B
en
ch
m
a
rk
DOM
W
HDJ
I
m
ag
e f
o
u
n
d
I
m
ag
e
retr
iev
ed
I
m
ag
e
f
iltered
Ti
m
e
I
m
ag
e
f
o
u
n
d
I
m
ag
e
retr
iev
ed
I
m
ag
e
f
iltered
Ti
m
e
a
m
n
h
.org
1662
611
1051
3
8
4
5
.7
2
7
8
1077
578
499
2
4
5
7
.5
0
4
2
o
cean.si.ed
u
687
610
77
7
5
1
.5967
62
7
1
5
.2595
iu
cn
.org
289
251
38
6
8
3
.3783
227
191
36
5
0
9
.2624
en
d
an
g
eredsp
ecies
in
ternatio
n
al.org
77
43
34
1
5
8
.5747
59
43
16
1
1
6
.4149
wwf
.org
.
m
y
492
375
117
5
0
3
.206
460
371
89
4
6
2
.0894
Table
3.
Per
for
m
ance of
im
age ex
tract
io
n by
w
e
b pag
e
s
(
30
U
RL
)
f
or
WEIDJ a
nd
WEIDJ
(no
-
r
ules)
Ben
ch
m
a
rk
W
E
IDJ
W
E
IDJ(no
-
rules
)
I
m
ag
e
f
o
u
n
d
I
m
ag
e
retr
iev
ed
I
m
ag
e
f
iltered
Ti
m
e
I
m
ag
e
retri
ev
ed
Ti
m
e
a
m
n
h
.org
249
204
45
1
0
0
.272
5
4
3
0
/
1
6
9
1
5
1
0
.6992
o
cean.si.ed
u
379
366
13
8
2
.71
6
2
6
9
1
/6
7
6
2
5
4
.8985
iu
cn
.org
118
101
17
1
0
8
.7956
8
1
9
/2
7
4
2
0
8
.7372
en
d
an
g
eredsp
ecies
in
ternatio
n
al.org
277
105
172
4
7
.43
3
5
4
2
7
/4
0
1
3
8
.95
2
1
wwf
.org
.
m
y
371
276
94
9
4
.92
8
8
4
9
5
/4
6
1
7
7
.92
7
6
Table
4.
Per
for
m
ance of I
m
ag
e
extracti
on
by
p
e
rcen
ta
ge fo
r
30
URL
W
eb
add
ress
DOM
W
HDJ
W
E
IDJ
W
E
IDJ
-
n
o
r
u
les
Ti
m
e
Percenta
g
e %
Ti
m
e
Percenta
g
e %
Ti
m
e
Percenta
g
e %
Ti
m
e
Percentag
e
%
a
m
n
h
.org
3
8
4
5
.7
2
7
8
5
5
.6
2
4
5
7
.5
0
4
2
3
5
.5
1
0
0
.272
1
.5
5
1
0
.6992
7
.4
o
cean.si.ed
u
7
5
1
.5967
42
7
1
5
.2595
40
8
2
.71
6
2
4
2
5
4
.8985
14
iu
cn
.org
6
8
3
.3783
45
5
0
9
.2624
34
1
0
8
.795
6
7
.2
2
0
8
.7372
1
3
.8
en
d
an
g
eredsp
ecies
in
ternati
o
n
al.org
1
5
8
.5747
4
3
.9
1
1
6
.4149
3
2
.2
4
7
.43
3
5
1
3
.1
3
8
.95
2
1
1
0
.8
wwf
.org
.
m
y
5
0
3
.206
4
4
.2
4
6
2
.0894
4
0
.6
9
4
.92
8
8
8
.35
7
7
.92
7
6
6
.85
To
gi
ve
bette
r
visu
al
iz
at
ion
for
us
e
rs,
Fi
gure
6
s
hows
the
per
f
orm
ance
of
ti
m
e
fo
r
ea
ch
m
od
el
in
extracti
ng
im
a
ges
f
or
WW
F
web
sit
e
(r
e
fer
t
o
Ta
ble
4).
Fro
m
this
figure,
we
ca
n
see
that
tim
e
per
form
a
nce
of
Do
c
um
ent
Obje
ct
Mod
el
is
44%
w
hich
is
c
on
t
rib
uting
l
onger
tha
n
oth
e
r
m
od
el
s.
This
is
because
the
m
od
el
needs
to
chec
k
the
im
ages
fo
r
eac
h
no
de
on
e
b
y
one
be
fore
extracti
ng
al
l
i
m
ages
from
this
web
sit
e.
Th
e
wr
a
pper h
ybri
d
D
OM
an
d
JS
ON
(WH
DJ)
has
been
p
r
opos
ed
to
ove
rco
m
e
the
lim
i
ta
ti
on
of
D
OM.
T
he
resu
lt
s
sh
ow
t
he
hy
br
i
d
m
od
el
,
com
bin
at
ion
of
D
O
M
and
J
SON
(
40%)
is
s
ucces
s.
H
ow
e
ve
r,
al
though
the
ti
m
e
ha
s
been
reduce
d
but
there
a
re
cer
ta
in
i
m
ages
that
can
no
t
been
extracte
d.
That
is
the
weakne
ss
of
WHDJ
.
S
o,
i
n
this
resea
rch
work
we
pr
opos
e
d
a
ne
w
hybr
i
d
m
od
el
w
hi
ch
is
c
om
bin
at
ion
of
the
vis
ual
se
gm
entat
i
on
an
d
handlin
g
noisy
i
m
ages
can
be
detect
ed
to
e
ns
ur
e
that
only
ben
e
fici
al
im
a
ges
ca
n
be
retr
ie
ved
.
T
he
def
i
niti
on
of
noisy
i
m
ages
is
the
im
ages
that
m
a
y
con
ta
ins
of
lo
go,
re
pe
ti
ti
on
of
im
ages
an
d
m
any
m
or
e
.
This
is bec
ause
web,
de
sp
it
e
a
ct
s
as
la
rg
e
re
po
sit
or
ie
s
of
knowle
dge,
it
unde
niably
al
so
con
ta
i
ns
nois
y
inform
at
ion
.
No
isy
inf
or
m
at
ion
can
de
gr
a
de
the
perform
ances
of
data
extracti
on
s
.
W
E
I
DJ
is
propose
d
in
orde
r
to
overc
om
e
the
lim
it
at
ion
of
e
xtracti
ng
be
ne
f
ic
ia
l
i
m
ages
an
d
rem
ov
e
noisy
i
m
ages
to
en
su
re
it
ca
n
e
xtr
act
i
m
ages
in
f
ast
est
way.
From
this
f
igure,
th
e
pe
r
centage
of
WE
ID
J
i
n
extracti
ng
im
ages
is
qu
it
e
fastest
(8%).
WEIDJ
N
o
-
ru
le
s
is
i
m
ple
m
enting
sim
il
ar
te
chn
iqu
e
in
WEID
m
od
el
bu
t
this
m
od
el
will
ret
rieve
al
l
ty
pes
of
im
ages
inclusi
ng
no
isy
im
ages.
Table
5
an
d
6
sh
ows
im
age
extracti
on
f
or
de
ep
web
that
ha
ve
sam
ple
siz
e
of
im
age
between
50
x50
and
128x12
8.
The
reas
on
th
e
extracti
on
ha
s
been
e
xp
e
ri
m
ented
in
bet
ween
this
tw
o
siz
e
is
becau
se
the
ben
e
fici
al
i
m
a
ge
siz
e
norm
al
ly
in
rag
e
128x12
8
but
the
noisy
i
m
ages
suc
h
as
head
e
r,
l
ogo
a
nd
so
for
th
is
i
n
50x50 pi
xels.
Evaluation Warning : The document was created with Spire.PDF for Python.
ļ²
IS
S
N
:
2502
-
4752
Ind
on
esi
a
n
J
E
le
c
En
g
&
Co
m
p
Sci,
Vo
l.
23
, N
o.
1
,
Ju
ly
2021
:
5
1
9
-
52
8
526
Figure
6. Perce
ntage %
of
ti
m
e p
e
rfor
m
ance
for diffe
re
nt m
od
el
s
in
e
xtract
ing
im
ages
Table
5.
Per
for
m
ance of im
age ex
tract
io
n f
or d
ee
p web (
Size
50*5
0)
Ben
ch
m
a
rk
DO
M
Link
Fou
n
d
I
m
g
f
o
u
n
d
I
m
g
retr
iev
ed
I
m
g
f
iltered
Ti
m
e
a
m
n
h
.org
132
4881
2125
2756
1
0
5
5
6
.22
3
8
o
cean.si.ed
u
97
1966
1610
356
2
3
1
9
.4
2
4
4
iu
cn
.org
96
999
811
188
1
9
7
9
.6
8
5
1
en
d
an
g
eredsp
ecies
in
ternatio
n
al.org
30
394
288
96
8
6
5
.8827
wwf
.org
.
m
y
142
1803
1374
429
1
9
0
0
.9
3
9
4
WH
DJ
a
m
n
h
.org
132
4013
2028
1985
8
7
7
8
.3
7
4
7
o
cean.si.ed
u
97
1705
1505
200
2
0
7
6
.7
5
4
8
iu
cn
.org
96
707
596
111
1
6
2
5
.4
5
9
6
en
d
an
g
eredsp
ecies
in
ternatio
n
al.org
30
300
269
31
5
3
4
.6634
wwf
.org
.
m
y
142
1626
1370
256
1
3
8
5
.7
1
5
7
WEI
DJ
a
m
n
h
.org
132
1521
1385
136
4
5
7
.7495
o
cean.si.ed
u
96
836
803
33
3
1
2
.985
iu
cn
.org
96
340
310
30
3
0
8
.6347
en
d
an
g
eredsp
ecies
in
ternatio
n
al.org
30
262
102
160
2
6
.40
4
8
wwf
.org
.
m
y
143
1311
1059
251
3
1
8
.2913
WEI
DJ
(
no
Rul
es
)
a
m
n
h
.org
7
3
3
9
/
4
9
2
1
9
2
8
.7615
o
cean.si.ed
u
3
8
3
2
/
1
9
7
2
5
8
0
.42
iu
cn
.org
1
9
5
2
/
1
0
1
1
6
6
0
.984
en
d
an
g
eredsp
ecies
in
ternatio
n
al.org
4
2
7
/4
0
1
3
6
.82
0
5
wwf
.org
.
m
y
3
6
7
2
/
1
9
0
7
5
7
3
.7713
Table
6.
Per
for
m
ance of
im
age ex
tract
io
n f
or d
ee
p web
(
Size
128*
128)
Ben
ch
m
a
rk
DO
M
Link
Fou
n
d
I
m
g
f
o
u
n
d
I
m
g
retr
iev
ed
I
m
g
f
iltered
Ti
m
e
a
m
n
h
.org
133
4920
839
4081
1
3
7
0
9
.22
5
3
o
cean.si.ed
u
97
2007
404
1603
3
2
4
4
.0
4
6
7
iu
cn
.org
96
998
493
505
2
9
8
0
.6
3
9
6
en
d
an
g
eredsp
ecies
in
ternatio
n
al.org
30
394
78
316
8
0
8
.1518
wwf
.org
.
m
y
143
1818
307
1515
1
6
2
1
.6
7
9
6
WH
DJ
a
m
n
h
.org
134
4124
822
3302
1
2
2
2
3
.65
o
cean.si.ed
u
98
1681
404
1277
1
8
8
8
.9
1
3
1
iu
cn
.org
97
790
523
267
1
7
7
2
.1
3
8
en
d
an
g
eredsp
ecies
in
ternatio
n
al.org
30
300
66
234
4
3
6
.362
wwf
.org
.
m
y
143
1175
164
1011
5
9
2
.1318
WEI
DJ
a
m
n
h
.org
1593
1420
173
1
6
9
7
.7
9
3
1
1593
o
cean.si.ed
u
98
846
807
39
1
3
6
8
.3
6
4
1
iu
cn
.org
97
389
330
59
1
2
5
3
.8
5
1
7
en
d
an
g
eredsp
ecies
in
ternatio
n
al.org
30
277
93
184
4
5
.66
1
7
wwf
.org
.
m
y
143
1371
541
829
3
4
2
.2131
WEI
DJ
(no
-
rules
)
a
m
n
h
.org
7
0
1
2
/
4
9
1
8
/
1
3
3
5
.5
3
6
2
o
cean.si.ed
u
3
9
0
2
/
2
0
0
5
5
3
3
.1249
iu
cn
.org
1
0
0
2
/
9
7
0
5
4
0
.0529
en
d
an
g
eredsp
ecies
in
ternatio
n
al.org
4
0
0
/4
2
7
3
1
.12
6
8
wwf
.org
.
m
y
2
5
4
1
/
1
3
4
6
3
1
0
.7469
Evaluation Warning : The document was created with Spire.PDF for Python.
Ind
on
esi
a
n
J
E
le
c Eng &
Co
m
p
Sci
IS
S
N:
25
02
-
4752
ļ²
A d
ee
p
we
b da
t
a
ext
r
action m
od
el
f
or
we
b m
ining
: a review
(Ily
Amali
na A
hma
d Sabri)
527
4.
CONCL
US
I
O
N
In
this
pa
per,
we
ha
ve
descr
i
bed
a
m
od
el
for
web
da
ta
ext
r
act
ion
pro
gr
am
s,
wh
ic
h
pro
vid
es
a
n
offe
r
po
te
ntial
web
data
extracti
on
for
us
ers
.
Am
ong
17
we
bs
it
es
that
we
us
e
d
f
or
the
e
valu
at
ion
e
xp
e
rim
e
nt,
th
e
exp
e
rim
ental
work
discu
sses
the
extracti
on
from
fi
ve
web
s
it
es
and
the
le
ve
l
of
extracti
on
is
fo
cusi
ng
on
deep
web.
It
can
be
op
e
rated
by
th
e
act
ion
of
use
rs
in
cl
ic
king
a
nd
pointi
ng
th
e
cur
s
or
t
o
sea
rch
th
e
we
b
ad
dr
ess
after
inse
rtin
g
the
we
b
ur
l
.
T
hi
s
exp
e
rim
ent
s
hows
t
hat
our
pro
po
se
d
wr
a
pper
is
a
ble
to
r
edu
ce
us
e
rā
s
bur
de
r
n
in
w
riti
ng
any
co
nf
i
gurati
on
file
due
t
o
diff
e
ren
t
st
ru
ct
ure
of
ea
ch
we
b
pag
e
al
th
ough
it
is
in
t
he
sam
e
web
sit
e.
A
n
i
m
po
rtant
pa
rt
of
our
w
ork
was
t
he
m
odel
of
we
b
data
extracti
on,
the
e
xecu
ti
on
t
i
m
e
of
extracti
on
bec
om
e
lon
ge
r
es
pe
ci
al
ly
in
ext
r
act
ing
a
la
r
ge
nu
m
ber
s
of
im
ages
due
to
c
onta
in
t
he
no
isy
im
ages
al
so
.
Ma
j
ori
ty
of
t
he
te
ch
niqu
es
co
nv
e
rt
the
web
sit
e
int
o
D
OM
tree
s
o
th
at
they
can
be
analy
zed
to
id
entify
no
ise
s
by
rem
ov
i
ng
the
unre
la
te
d
el
em
ents.
The
extra
ct
io
n
ti
m
e
beco
m
e
s
lo
ng
e
r
s
o
a
n
al
te
rn
at
ive
have
be
e
n
cond
ucted
t
o
de
crease
t
he
e
xe
cution
ti
m
e
by
app
ly
in
g
J
S
ON
in
WHD
J.
A
n
im
pr
oved
al
gorithm
and
bette
r
so
luti
on
in
dea
li
ng
with
the
ever
ex
pa
nd
i
ng
data
siz
e,
wh
ic
h
w
ou
l
d
furthe
r
com
plica
te
the
pr
oce
ssin
g
of
the
data,
s
houl
d
be
inv
e
nted
.
A
fter
the
e
xtracti
on
is
su
cce
ssf
ul
the
i
m
ages
an
d
relat
ed
i
nform
at
ion
will
be
save
d
in
a
data
base
a
s
a
struc
ture
d
f
or
m
at
.
This
inf
or
m
at
ion
can
be
us
e
d
f
or
f
ur
t
her
act
io
n
s
uc
h
as
decisi
on
m
akin
g.
The
one
releva
nt
of
this
extra
ct
ion
process
i
s
the
execu
ti
on
tim
e
is
red
uce
and
the
i
m
ageā
s
file
nam
es
w
il
l
be
rein
dex
e
d.
In
f
uture
w
ork,
w
e
are
plan
ning
to
exten
d
this
researc
h
w
ork
in
fo
c
us
in
g
e
xtracti
on
from
m
ulti
deep
we
bs
it
es.
The
pe
rfo
rm
a
nce
of
im
ages
extracti
on
will
infl
uen
ce
the
tim
e
fo
r
e
xec
ut
ion
process
a
nd
the
i
m
pact
of
the
s
tud
y
f
or
t
he
na
ti
on
an
d
c
omm
un
it
y
is
the
extracti
on
of
se
m
i
-
structur
e
d
data
that
can
be
us
ed
for
m
anag
in
g
a
nd an
al
yz
in
g
t
he
c
har
act
e
risti
cs of elem
ents.
ACKN
OWLE
DGE
MENTS
I
since
rely
tha
nk
al
l
those
w
ho
hel
ped
m
e
in
com
pleti
ng
this
ta
sk
e
sp
ec
ia
ll
y
Bi
asi
swa
U
niv
e
rsiti
Ma
la
ysi
a Teren
gga
nu (
B
UM
T).
REFERE
NCE
S
[1]
S.
Z.
Z.
Abidin,
N.
M.
Idr
is,
A.
H.
Hus
ai
n,
ā
Ext
racti
on
and
c
la
s
sific
a
ti
on
of
uns
truc
tur
ed
da
ta
in
W
ebPage
s
for
struct
ure
d
m
ulti
m
edi
a
databa
se
via
XM
L,ā
Int
ernati
onal
Conf
ere
nce
on
Information
Retriev
a
l
&
Knowle
dge
Manage
ment
(
CAMP
)
,
2010,
pp.
44
-
49,
doi: 10.
1
109/INFRK
M.2
010.
5466948.
[2]
D.
Cai
,
S.
Yu,
J.
W
en,
W
.
Ma,
ā
VIP
S:
A
Vision
-
Based
Page
Seg
m
ent
at
ion
Algor
it
hm
,
ā
Boo
k
VIPS:
a
vi
sion
-
base
d
page
segment
a
tion alg
ori
thm
,
Microsoft te
chn
ica
l
rep
or
t,
MSR
-
TR
-
2003
-
79,
200
3.
[3]
Z
.
Ca
i
,
J.
L
iu,
L.
Xu,
C.
Yin,
J.
W
ang,
ā
A
Visio
n
Rec
ognition
B
ase
d
Method
for
W
eb
Data
Ext
r
a
ct
ion
,
ā
Compute
r
Sci
en
ce
,
2017
.
[4]
Chia
-
Hui
Chan
g,
Shih
-
Chie
n
Kuo,
ā
Oler
a:
sem
isupervi
sed
W
eb
-
dat
a
ex
tracti
on
with
visu
al
support,
ā
I
E
EE
Inte
lligen
t
Syst
e
ms
,
vol. 19, no.
6,
pp
.
56
-
64
,
No
v.
-
Dec
.
2004
,
do
i:
10
.
1109/MIS.
2004.
71.
[5]
Chia
-
Hui
Chang
,
Shao
-
Chen
Lu
i,
ā
IEPAD
:
Info
rm
at
ion
Ext
ra
c
tion
B
ase
d
on
Pa
tt
ern
Discove
r
y
,
ā
Book
IEPAD:
Information
e
xt
r
act
ion
based
on
patt
ern
d
iscov
er
yā
ACM,
pp.
681
-
688
,
2001
.
[6]
M.
Cit
ra,
A.
A.
Banu,
ā
Dee
p
We
b
Data
Ext
r
ac
t
i
on
Based
on
UR
L
and
Dom
ai
n
Cla
ss
ifi
c
at
ion
,ā
ISAA
CA
Journal
,
vol.
4
,
pp
.
1
-
4
,
2
015,
[7]
V.
Cresc
en
zi
,
G
.
Mec
ca,
P.
Meri
al
do,
ā
Roadrunn
er:
Towa
rds
Aut
om
at
ic
Dat
a
Extrac
t
ion
from
La
r
ge
W
eb
Site
s
,ā
Book
Roadrunne
r: Towar
ds aut
omatic
da
ta
ex
tra
ct
ion
from l
arge
web
sit
es,
pp
.
10
9
-
118
,
2001
.
[8]
N.
Deroui
ch
e,
B.
Cautis,
T
.
Abdess
al
em,
ā
Autom
at
ic
Ex
tr
ac
t
ion
of
Struc
ture
d
W
eb
Dat
a
with
Dom
ai
n
Know
le
dge,
ā
I
EE
E
28th
In
t
ernati
onal
Con
fe
renc
e
on
Data
Engi
ne
erin
g
,
2012,
pp.
726
-
737,
doi
:
10.
1109/ICDE.
2
012.
90.
[9]
Y.
Fang,
X
.
Xie
,
X.
Zha
ng
,
R.
Cheng,
Z,
Zha
n
g,
ā
STEM:
A
S
uffix
Tr
ee
-
Bas
e
d
Method
for
W
eb
Data
R
ec
o
rds
Ext
ra
ct
ion
,
ā
Kno
wle
dge
and
In
fo
rm
ati
on
Syste
ms
,
vol
.
55
,
no
.
2
,
p
p.
305
-
331
,
201
8.
[10]
P.
Gulat
i
,
M.
Y
ada
v,
ā
A
Novel
Approac
h
for
Ext
ra
ct
ing
P
ert
i
nent
Ke
y
words
for
W
eb
Im
age
Annotat
ion
usin
g
Sem
ant
ic
Dist
an
ce
and Euc
l
idea
n
Distance
,
ā
So
ftw
are
Engi
n
ee
rin
g
,
pp
.
173
-
183
,
2019.
[11]
D.
T.
Hai
,
ā
A
Novel
Int
ege
r
L
ine
ar
Program
mi
ng
Form
ula
ti
on
for
Designing
Tra
nspare
n
t
W
DM
Optic
al
Co
re
Networks,ā
Inte
r
nati
onal
Conf
ere
nce
on
Ad
vanc
e
d
Technol
ogie
s f
or Comm
unic
ati
ons A
TC
,
2019,
pp.
273
-
277,
do
i
:
10.
1109/ATC.
20
19.
8924515.
[12]
J.
Ham
m
er,
G.
Molina
,
H
.
Cho,
R.
Aranha
,
A.
C
respo,
ā
A.:
āEx
tr
ac
t
ing
Sem
istru
c
ture
d
Inform
atio
n
from
the
W
eb,ā
Standford
Infol
a
b
Publicati
on
Se
rve
r,
1997
.
[13]
C.
N.
Hs
u,
M.
T.
Dung,
āGe
ner
ating
Finit
e
-
Sta
te
Tra
nsduce
rs
for
Sem
i
-
Struct
ure
d
Data
Ext
r
ac
t
ion
from
The
W
eb
,ā
Information
Syst
ems
,
vol
.
23
,
no
.
8,
pp.
521
-
538,
1998,
doi
:
h
tt
ps:/
/doi
.
org
/10.
1016
/S0306
-
4379(98)00027
-
1.
[14]
R.
Jeffe
rson,
A.
C
onnel
l
,
and
O.
J
eff
erson,
ā
W
eb
Data
Ext
ra
ction
ā,
Le
ns.org
,
ā
W
eb
Dat
a
Ex
tractionā,
acce
ss
ed
2
1
April
2021.
[15]
P.
Jim
ene
z,
R
.
Corchue
lo
,
ā
On
Le
arn
ing
W
eb
I
nform
at
ion
Ext
r
ac
t
ion
Rule
s
wit
h
TANG
O
,ā
Information
Syste
m
s
,
vol.
62
,
pp
.
74
-
1
03,
2016
,
doi
:
h
t
tps:/
/doi.org/
10
.
1016/j
.
is
.
2016.
0
5.
003.
Evaluation Warning : The document was created with Spire.PDF for Python.
ļ²
IS
S
N
:
2502
-
4752
Ind
on
esi
a
n
J
E
le
c
En
g
&
Co
m
p
Sci,
Vo
l.
23
, N
o.
1
,
Ju
ly
2021
:
5
1
9
-
52
8
528
[16]
N.
V.
Kam
anwa
r,
S.
G
.
Kal
e,
ā
W
eb
dat
a
ex
tra
c
ti
on
t
ec
hn
ique
s:
A
rev
ie
w,
ā
Worl
d
Confe
ren
ce
on
Fut
uristi
c
Tr
en
ds
in
Re
sear
ch
and
Innov
a
ti
on
for
So
c
ial
We
lf
are
(
Startup
Concl
ave
)
,
2016
,
pp.
1
-
5
,
doi:
10.
1109/START
UP
.
2016.
7583910.
[17]
A.
H.
F.
L
ae
nde
r,
B.
A.
R
.
Ne
to
,
A.
S.
Da
silva,
J.
S.
Teixe
ir
a,
ā
A
Brie
f
Surve
y
of
W
eb
Dat
a
Ex
tra
c
ti
on
T
ools
,ā
ACM
Sigmod
Record
,
vol
.
31
,
no.
2,
pp.
84
-
93,
do
i:
ht
tps:/
/do
i.
org
/
10.
1145/565117.565137.
[18]
B.
Li
u
,
R.
Gros
sm
an,
Y.
Zha
i,
ā
Mining
dat
a
re
cor
ds
in
web
p
age
s,ā
Proceedi
ngs
of
the
nint
h
ACM
SIGKD
D
int
ernati
ona
l co
nfe
renc
e
on
Kno
wle
dge
discover
y
and
da
ta
min
in
g
,
pp
.
601
-
606
,
2003.
[19]
L.
Li
u
,
C.
Pu,
W
.
Han,
ā
XW
R
AP
:
an
XM
L
-
ena
ble
d
wrapp
er
c
onstruct
ion
s
y
s
t
em
for
W
eb
informati
on
source
s,ā
Proce
ed
ings
of
16th
Inte
rnation
al
Confe
renc
e
o
n
Data
Engi
nee
r
ing
(Cat
.
No.00
CB37073),
2000,
pp.
611
-
621,
d
oi:
10.
1109/ICDE.
2
000.
839475.
[20]
P.
Malhot
r
a,
S.
K.
Mal
ik,
ā
W
eb
Page
Segm
ent
at
ion
Towa
rd
s
Inform
at
i
on
Ext
ra
ct
ion
for
W
eb
Sem
ant
ic
s,
ā
Inte
rnational
Co
nfe
renc
e
on
Inno
vat
i
ve
Comput
in
g
and
Comm
uni
cat
ions
,
pp.
431
-
442
,
2018
.
[21]
M.
Man,
I.
A.
A.
Sabri,
M.
M.
A.
Jali
l
,
N.
Ali
,
S.
Muham
ad,
ā
Inform
at
ion
Inte
gra
t
ion
Archi
t
e
ct
ure
S
y
s
te
m
for
Empow
eri
ng
Rural
W
om
an
In
Seti
u
W
et
l
ands,
Te
r
engga
nu,
Malay
s
ia
,ā
Journal
of
Sustaina
bil
ity
Sc
ie
n
ce
a
nd
Manage
ment
,
vo
l,
14
,
no
.
1
,
pp
.
7
7
-
86
,
2019
.
[22]
A.
Pouram
ini
,
S.
K.
Hass
ani
,
Sh.
Nasiri,
ā
Dat
a
E
x
tra
c
ti
on
Us
ing
Conte
nt
B
ase
d
Handle
s
,
ā
Journ
al
of
AI
and
Dat
a
Mini
ng
,
vol
.
6
,
n
o.
2
,
pp
.
399
-
40
7,
2018
,
doi
:
10
.
22044/JAD
M.20
17.
990
.
[23]
I.
A.
A.
Sabri
,
M.
Man,
ā
A
Per
form
anc
e
of
Co
m
par
at
ive
Stud
y
for
Sem
i
-
Structured
W
eb
Dat
a
Ext
ra
ct
ion
Mode
l
,
ā
Inte
rnational
J
ournal
of
E
lect
rical
and
Computer
Engi
n
ee
ri
ng
,
vol
.
9
,
no
.
6,
pp
.
5463
-
5
470,
2019,
doi
:
10.
11591/ijece.
v
9i6.
pp5463
-
547
0.
[24]
I.
A.
A.
Sabri,
M.
Man,
ā
Im
pr
oving
Perform
anc
e
of
DO
M
in
Sem
i
-
Struct
ure
d
Dat
a
Ext
ra
ct
i
on
Us
ing
W
EIDJ
Model,
ā
Indone
s
ian
Journal
of
El
e
c
tric
al
Engi
n
e
ering
and
Comp
ute
r
Sci
en
ce
,
vol
.
9,
no.
3,
pp.
752
-
763,
2018,
doi:
10.
11591/ijeecs.
v9.
i3.
pp752
-
763
.
[25]
I.
A.
A.
Sabri,
M.
Man,
ā
W
EI
DJ
:
Deve
lopment
Of
A
New
Alg
orit
hm
For
Se
m
i
-
Struct
ure
d
W
eb
Data
Ext
r
ac
t
ion
,
ā
TEL
KOMNIKA
(
Tele
communic
a
ti
on
Computi
ng
El
e
ct
ronics
and
Control
),
vol.
1
9,
no.
1
,
pp
.
31
7
-
326,
2021,
doi
:
10.
12928/TELK
OM
NIK
A.v19i
1.
16205.
[26]
A.
K.
Tr
ipa
th
y
,
N.
Jos
hi,
S.
Tho
m
as,
S.
Shetty
a
nd
N.
Thomas,
ā
VEDD
-
a
visual
wrappe
r
for
ex
tra
c
ti
on
of
d
ata
using
DO
M
tre
e,
ā
Int
ernati
onal
Confe
renc
e
on
Comm
unic
ati
on,
Information
&
Computing
Tech
nology
ICCICT
,
2012,
pp
.
1
-
6
,
d
oi:
10
.
1109/ICC
ICT.
2012.
63981
14.
[27]
I.
A.
A.
Sabri,
M.
Man,
ā
Perform
anc
e
Anal
y
s
i
s
for
Mining
I
m
age
s
of
Dee
p
W
eb,
ā
Inte
rnat
ional
Journal
of
Adv
anc
ed
Com
pute
r
Scienc
e
and
Appl
i
cations
IJA
CS
,
2
020,
vol.
11,
no.
10,
pp
.
1
-
7,
2020,
doi:
10.
14569/IJACS
A.2020.
0111001
.
BIOGR
AP
HI
ES OF
A
UTH
ORS
Ily
Amal
ina
Ah
mad
Sab
ri
,
re
ce
iv
ed
her
Dipl
om
a
in
Inform
ation
Technol
og
y
in
2006
from
PS
MZA,
Te
ren
g
ganu,
B
ac
he
lor
of
Inform
at
ion
Te
chno
log
y
(Software
Engi
ne
ering),
Master
'
s
degr
ee,
and
Ph
.
D.
in
Com
pute
r
Scie
nc
e
from
Univer
siti
Malay
s
ia
T
ere
ngg
anu
i
n
2009,
2014,
and
2019
r
espe
c
ti
vely
.
She
is
a
Senior
L
ec
tur
er
in
Facult
y
of
O
c
ea
n
Engi
ne
eri
ng
Technol
og
y
and
Inform
at
ic
s
,
Univer
siti
Ma
lay
sia
Te
r
engga
nu
.
Her
rese
arc
h
in
te
rests
in
cl
ude
W
eb
Mining,
Data
Ext
r
ac
t
ion,
Inform
at
ion
Ret
rie
v
al
,
Arti
ficia
l
Inte
lligen
ce
an
d
Dec
ision
Support
S
y
stem.
Her
cur
ren
t
r
ese
arc
h
proj
ec
ts
a
re
"M
-
Fly
Coun
te
r
Deve
lopment
of
Auto
-
Counti
ng
Mobile
Apps
for
La
rg
e
Popula
ti
ons
of
Hous
efly
for
Pest
Contr
ol
and
Moni
tori
n
g
Acti
vi
tiesā
wh
ic
h
is
fund
ed
b
y
PP
RG
2021
sche
m
e,
Kaji
a
n
dan
Pem
bang
unan
Perisia
n
untuk
Studio
Al
-
Quran,
UM
T
which
is
funde
d
b
y
UM
T,
iMAK
ERS@U
MT
whi
ch
is
funde
d
b
y
MO
STI,
ā
Deve
lopment
and
Im
ple
m
ent
at
ion
of
DIET
CAR
E:
An
In
te
ra
cti
ve
Onlin
e
Nutr
it
ional
Da
ta
base
Mana
gement
S
y
stem
to
support
Inte
l
li
g
ent
Cli
ent
Moni
toringā
which
is
fu
nded
b
y
TAPE
-
RG
and
ā
An
int
ellige
n
t Ti
ss
ue
Dispenser
S
y
st
emā
which
is f
u
nded
b
y
PP
RG
.
Mustafa
Man
i
s
an
A
ss
oci
at
e
Profess
or
in
School
of
Inform
at
ic
s
and
Applie
d
Mathe
m
at
i
cs
and
al
so
as
a
De
put
y
Dire
ct
or
at
Resea
rch
Mana
g
ement
Innova
ti
o
n
Cent
re
(RMIC),
UM
T.
He
start
ed
h
is
PhD
studie
s
in
Jul
y
2
009
and
fin
ished
his
studie
s
in
C
om
pute
r
Scie
n
ce
from
UTM
in
2012.
He
has
rec
ei
ved
Com
pute
r
Scie
n
ce
Diploma,
Com
p
ute
r
Scie
n
ce
De
gre
e,
Master
Degre
e
from
UP
M.
In
2012,
he
has
bee
n
awa
rde
d
a
ā
MIec
MO
S
Prestigi
ous
Aw
ard
sā
for
his
PhD
by
MIM
OS
Berha
d.
His
r
ese
arc
h
is
foc
used
on
the
d
eve
l
opm
ent
of
m
ult
ipl
e
t
y
p
es
of
dat
a
b
ase
s
int
egr
at
ion
m
odel
an
d
al
so
in
Augm
ent
ed
Rea
l
ity
(AR),
andr
oid
base
d,
and
I
T
rel
a
te
d
int
o
acro
ss
dom
ai
n
pla
tfo
rm
.
Evaluation Warning : The document was created with Spire.PDF for Python.