Indonesi
an
Journa
l
of El
ect
ri
cal Engineer
ing
an
d
Comp
ut
er
Scie
nce
Vo
l.
10
,
No.
3
,
June
201
8
,
pp.
12
3
4
~
12
43
IS
S
N:
25
02
-
4752
, DO
I: 10
.11
591/
ijeecs
.
v
10
.i
3
.pp
12
3
4
-
12
43
1234
Journ
al h
om
e
page
:
http:
//
ia
es
core.c
om/j
ourn
als/i
ndex.
ph
p/ij
eecs
A Surve
y on Cl
eanin
g D
i
rty Dat
a Using
M
ac
hin
e L
earning
Paradig
m for Bi
g Data A
nalyti
cs
Jesmeen
M. Z
. H
.
1
, J. Hos
se
n
2
, S
.
S
ay
e
ed
3
,
C
.
K
.
Ho
4
, T
aw
sif K.
5
,
A
rm
an
ur
Rahm
an
6
,
E.
M.
H
.
Arif
7
1,2,5,6,7
Facul
t
y
of Engineering and
Technol
og
y
,
M
ult
imedia Unive
r
sit
y
,
Me
la
k
a, 75
450,
Mal
a
y
s
ia
3
Facul
t
y
of
Infor
m
at
ion
Sci
ence &
T
ec
hnolog
y
,
Multi
m
edi
a
Uni
ver
sit
y
,
Mel
aka, 75450,
Mal
a
y
s
ia
4
Facul
t
y
of
Com
puti
ng
and
Infor
m
at
ic
s,
Mul
ti
m
e
dia
Univ
ersity
,
Mela
ka
,
75450
,
Malay
s
ia
Art
ic
le
In
f
o
ABSTR
A
CT
Art
ic
le
history:
Re
cei
ved
Ja
n
15
, 2
01
8
Re
vised
Ma
r
1
1
, 2
01
8
Accepte
d
Ma
r
2
4
, 201
8
Rec
en
tly
Big
Data
h
as
bec
om
e
one
of
th
e
im
porta
nt
n
ew
factors
in
th
e
business
fie
ld.
Thi
s
nee
ds
to
have
strategie
s
t
o
m
ana
ge
la
rge
volumes
of
struct
ure
d
,
unstr
uct
ure
d
and
sem
i
-
struct
ur
ed
da
ta.
It’s
cha
l
le
ng
in
g
to
ana
l
y
z
e
such
la
rge
sca
l
e
of
da
ta
to
ex
tra
c
t
data
m
ea
n
ing
and
hand
li
n
g
unce
rt
ain
outc
om
es.
Alm
ost
al
l
big
d
ata
sets
are
d
irty,
i.e.
the
set
m
a
y
con
ta
i
n
ina
c
cur
a
ci
es,
m
i
ss
ing
dat
a,
m
is
codi
ng
and
o
th
er
issues
tha
t
i
nflue
nc
e
th
e
strengt
h
of
b
ig
dat
a
anal
y
tics.
One
of
th
e
big
gest
ch
al
l
enge
s
in
big
d
ata
ana
l
y
t
ic
s
is
to
d
iscove
r
and
rep
ai
r
dirty
d
at
a
;
fai
lur
e
to
do
thi
s
ca
n
lead
to
ina
c
cur
ate
an
aly
t
ic
s
and
unpre
dic
t
abl
e
conclusions.
Dat
a
clea
ning
is
a
n
essenti
a
l
par
t
of
m
ana
ging
and
ana
l
y
z
ing
da
ta.
In
thi
s
surve
y
p
ape
r,
d
ata
qual
ity
troub
le
s whic
h
m
a
y
occ
u
r
in
big
dat
a
proc
essing
to
under
stand
cl
e
a
r
l
y
wh
y
an
orga
n
iza
ti
on
r
equi
res
d
ata
c
le
an
ing
are
e
xamined,
fo
ll
ow
ed
b
y
d
a
ta
qual
ity
cri
t
eri
a
(dimensions
use
d
to
indi
cate
da
ta
qualit
y
).
T
he
n,
cl
e
ani
n
g
tool
s
availa
b
le
i
n
m
ark
et
a
re
su
m
m
ari
ze
d.
Also
cha
l
le
nges
f
aced
in
c
le
an
ing
big
data
due
to
n
at
ure
of
data
are
discussed.
Mac
h
ine
l
ea
rn
ing
a
lgo
rit
hm
s
ca
n
be
used
to
an
aly
z
e
da
ta
and
m
ake
pr
edi
c
tions
and
fin
al
l
y
clea
n
data
aut
om
at
i
ca
l
l
y
.
Ke
yw
or
d
s
:
Bi
g
data
Bi
g
data a
naly
ti
cs
Data
cl
eani
ng
Dirty
d
at
a
Ma
chine
le
a
rn
i
ng
Copyright
©
201
8
Instit
ut
e
o
f Ad
vanc
ed
Engi
n
ee
r
ing
and
S
cienc
e
.
Al
l
rights re
serv
ed
.
Corres
pond
in
g
Aut
h
or
:
Jesm
een M. Z.
H.
&
Dr
.
Ja
kir Ho
s
sen
Faculty
of E
ngineerin
g
a
nd T
echnolo
gy,
Mult
i
m
edia Universit
y,
Me
la
ka,
75450, Ma
la
ysi
a.
Em
a
il
: jes
m
ee
n.on
li
ne
@
gm
a
il
.co
m
, j
akir.h
osse
n@m
m
u.
edu
.m
y
1.
INTROD
U
CTION
In
2016,
IBM
est
i
m
at
ed
that
in
la
st
two
ye
ars
only
,
ar
ound
2.5
qui
ntil
l
ion
byte
s’
dat
a
hav
e
be
e
n
pro
du
ce
d
eac
h
day,
w
hich
is
cu
rr
e
ntly
90%
of
total
data
[
1].
T
his
big
da
ta
is
usual
ly
cr
eat
ed
usi
ng
de
vices
li
ke
sens
or
s
a
nd
ne
w
te
ch
no
l
ogie
s
ev
olv
in
g
i
n
to
day’s
era
,
even
m
or
e
the
da
ta
evo
luti
on
a
m
ou
nt
will
po
s
sibly
a
ccel
erate.
Whereas,
Ci
sco
f
oreca
ste
d
by
20
20,
t
he
vo
l
um
e
of
world
wide
traff
ic
will
cr
oss
the
I
nter
net
with
IP
WAN
net
w
orks
m
ay
r
each
to 2.
3ZB eac
h y
ear [2].
The
bu
l
ky
an
d
heter
og
e
ne
ous
natu
re
of
big
data
re
qu
i
res
i
nv
e
sti
gation
usi
ng
Bi
g
data
a
naly
ti
cs.
Bi
g
data
analy
ti
cs
helps
to
disc
ov
e
r
c
on
c
eal
ed
patte
rn
s
,
a
nonym
ou
s
relat
ion
s
hip
s
,
tre
nds
of
c
urre
nt
m
ark
et
sit
uation,
co
nsum
er
pr
ef
ere
nc
es
an
d
oth
e
r
as
pects
of
data
th
at
can
assist
in
sti
tutes
and
co
m
pan
ie
s
to
m
a
ke
up
-
to
-
date, f
a
ste
r a
nd b
et
te
r
d
eci
sion f
or b
us
ine
ss.
By
now,
m
os
t
well
-
kn
own
c
om
pan
ie
s
reali
zed
the
dem
and
of
im
ple
m
entin
g
big
data
a
na
ly
ti
cs
int
o
their
syst
e
m
f
or
bette
r
products
and
se
rv
i
ces.
Using
big
data
capab
il
it
ie
s
any
co
m
pan
y
can
i
m
pr
ov
e
their
pro
du
ct
s a
nd s
erv
ic
es
outc
ome
s and
gro
w product
i
vity
b
y
obta
inin
g
m
eaningf
ul v
isi
ons t
o
a
dv
a
nce t
heir
work
forw
a
r
d.
T
her
e
are
dif
fer
e
nt
too
ls
a
vaila
ble
in
m
ark
et
to
han
dle
the
big
data
but
these
too
ls
co
nce
rn
ts
with
few
issues
[
3].
These
to
ols
a
r
e
not
usual
ly
integ
rated
with
data
qual
it
y
m
anag
m
ent,
th
er
ef
or
e
,
in
m
ark
et
the
Evaluation Warning : The document was created with Spire.PDF for Python.
Ind
on
esi
a
n
J
E
le
c
En
g
&
Co
m
p
Sci
IS
S
N:
25
02
-
4752
A Survey
on Cl
ean
i
ng D
irt
y
D
ata
Usi
ng Ma
c
hin
e
Learni
ng
Paradi
gm for
Big
Da
t
a
…
(
J
esmee
n
M.
Z.
H.
)
1235
too
ls
for
data
qual
it
y
est
i
m
a
ted
by
20
22
t
o
re
ach
1,376.7
Mi
ll
ion
f
ro
m
USD
610.2
Mi
ll
ion
in
20
17,
w
he
re
th
e
Com
po
und
Annu
al
G
r
ow
t
h
Ra
te
m
easur
ed
is
17.7
%
.
The
ba
se
ye
ar
con
si
der
e
d
f
or
this
r
eport
is
2016
and
t
he
forecast
per
io
d
is
2017
–
20
22
[4
]
.
It’s
not
on
ly
us
e
of
big
da
ta
capab
il
it
ie
s
an
orga
nizat
io
n
re
quire
d
to
c
ollec
t
values
with
out
m
ist
akes,
inco
m
ple
te
values
besides
er
rors
bu
t
it
is ver
y oft
en
ne
gated
to
o.
This
ki
nd
of d
at
a
i
s
us
ua
ll
y
known
to
dirty
data,
and
t
o
cl
ean
t
hi
s
data
c
an
be
chall
eng
i
ng
f
or
com
pan
ie
s
w
ho
wa
nt
to
get
bette
r
resu
lt
s.
Cl
eani
ng
data
m
anual
ly
req
uires
exp
e
rience
a
nd
of
te
n
hum
a
n
te
nt
to
m
ake
m
ist
ake.
Curr
e
ntly
,
m
achine
le
ar
nin
g
is
ad
opte
d
i
n
diff
e
ren
t
a
rea
for
process
t
he
ta
sk
s
a
uto
m
at
ic
al
ly
,
su
ch
as
[5, 6
]
.
T
her
e
f
or
e
,
as
m
achine
le
ar
ni
ng
ca
n
help
a
ny
ta
sk
to
co
m
ple
te
autom
a
ti
cal
ly
i
t
is
possible
to
cl
ean
dirty
data
by
trai
ning
cl
assifi
cat
ion
m
od
el
s.
2.
BIG D
ATA
ANA
L
YTICS
The
ge
ner
al
pr
ocedu
re
for
ob
ta
ining
visi
on
s
fr
om
Bi
g
Data
can
be
br
ea
k
dow
n
into
five
m
ai
n
sta
ges
[7
]
–
[
9] as s
hown in Fi
gure
1.
Figure
1. Proce
sses for
ex
t
racti
ng
i
ns
ig
hts
fro
m
b
ig d
at
a
Data
Acquisi
ti
on
:
Tim
eliness
is
on
e
of
the
i
m
po
rtant
requirem
ent
wh
il
e
data
loading
[1
0].
T
he
fun
dam
ental
c
har
act
erist
ic
s
of
Bi
g
Data
wit
h
it
s
ex
pone
ntial
rate
of
gr
owin
g
dem
and
s
i
m
pr
ove
e
xce
ption
al
issue in
Bi
g Da
ta
en
gi
neer
i
ng
su
c
h
as
data ac
qu
isi
ti
on a
n
d st
or
i
ng [7].
Data
Mi
ning
and
Cl
ea
ns
in
g:
The
m
os
t
essenti
al
sta
ge
of
proce
ssin
g
bi
g
data
is
to
i
m
ple
m
ent
a
m
et
ho
d
to
e
xtr
act
fr
om
loaded
un
-
st
ru
ct
ur
e
d
Bi
g
Data
and
m
ine
-
out
the
ne
cessary
data
to
able
to
c
oh
e
r
ent
it
in
a
ty
pical
and
orga
nized
a
rrang
em
ent
that
will
be
easy
to
recogn
iz
e
.
Dat
a
cl
eaning
pro
cess
is
helps
to
cl
ean
dirty data.
Data
A
ggreg
at
ion
a
nd
In
te
grat
ion
:
The
cl
ea
ned
data
obta
ined
require
d
to
ag
gr
e
gate
f
or
processi
ng
these
data
by
gathe
rin
g
an
d
expressi
ng
int
o
su
m
m
ary
fo
rm
[1
1],
[
12]
fo
ll
owi
ng
by
integ
rati
ng
Data,
to
orga
nize
data
f
ro
m
disp
a
rate
so
urces
by
gro
up
i
ng
of
pr
act
i
cal
and
busi
ne
ss
m
et
ho
ds
,
an
d
ob
ta
in
m
eaningf
ul
and v
al
ued res
ult [
12
]
.
Data
A
naly
sis
and
M
od
el
li
ng
:
Fr
om
the
viewpoint
of
Bi
g
Data,
the
go
al
s
are
to
pr
oduc
e
bu
si
ness
sign
ific
a
nce
thr
ough
the
a
na
ly
sis
of
data
wh
ic
h
m
ay
f
luctuat
e
acco
r
ding
to
te
ch
ni
qu
e
a
nd
data
form
.
Con
st
ru
ct
a
nd i
nv
e
sti
gate m
ea
ningf
ul r
e
ports
to help t
he b
usi
ness for
bette
r
and
faster
deci
sion m
aking
.
Data
I
nter
pr
et
at
ion
:
P
resen
ti
ng
data
i
n
under
sta
nd
a
ble
f
or
m
fo
r
use
rs
,
i.e.
pr
ese
ntin
g
data
us
i
ng
analy
sis
and
m
od
el
li
ng
res
ults
to
m
ake
decisi
on
by
inter
pret
ing
the
outc
om
es
and
extrac
ti
ng
knowle
dg
e.
Data
In
te
r
pret
at
ion
qu
e
ries
a
re
cat
egorized
to
gether
an
d
i
nd
ic
a
te
to
the
sam
e
ta
ble,
diag
ra
m
gr
aph
or
ot
her
da
t
a
dem
on
strat
io
n op
ti
ons
3.
DA
T
A
Q
U
AL
ITY P
ROBLE
MS
The
data
cl
ean
ing
pr
ocess
get
s
m
or
e
com
plex
w
hen
data
com
es
fr
om
het
eroge
neous
s
ources
.
He
re,
data
qu
al
it
y
pr
oble
m
has
to
be
so
lve
d
by
data
cl
eaning
and
data
tran
sform
at
ion
.
Desp
it
e
of
the
var
i
ous
view
po
i
nts
on
the
ef
fect
of
da
ta
qu
al
it
y,
in
t
he
e
nd,
al
l
ha
ve
the
pro
ba
bili
ty
to
pro
du
ce
in
eco
no
m
ic
expenses
for
gro
ups.
S
om
e
of
su
r
vey
of
r
eal
case,
in
vo
l
ving
eco
no
m
ic
costs
du
e
to
dirty
data,
on
a
survey
in
2014
it
s
fou
nd
that
ar
ound
$1
3.3
m
i
llion
doll
ars
’
an
nu
a
l
c
os
ts
in
orga
nizat
ion
s
a
nd
3
tril
li
on
per
ye
ar
to
US
ec
onom
y
du
e
to
bad
data.
Anothe
r
org
anizat
ion,
the
U.
S
.
Po
sta
l
Se
rv
ic
e,
rec
ogniz
es
the
cost
of
bad
data,
in
20
13,
a
n
est
i
m
at
ed
a
m
ou
nt
of
m
ai
l
un
su
ccess
fu
l
deliveri
ng
to
m
entione
d
a
ddress
was
a
rou
nd
6.
8
bill
ion
,
w
hic
h
rack
s
up to $
1.5 bil
li
on in
m
anag
i
ng c
os
ts
[13].
Evaluation Warning : The document was created with Spire.PDF for Python.
IS
S
N
:
2502
-
4752
Ind
on
esi
a
n
J
E
le
c Eng &
Co
m
p
Sci,
Vol
.
10
, N
o.
3
,
June
201
8
:
12
34
–
12
43
1236
By
so
m
e
evaluati
on
s
it
is
known
th
at
the
in
orga
nizat
ion
s
a
nd
c
om
pan
ie
s
issue
of
di
rty
data
al
read
y
reache
d
to
e
pi
dem
ic
a
m
ou
nts.
The
iss
ue
is
equ
al
ly
preval
ent
an
d
hy
po
t
he
ti
cal
ly
equ
al
beyo
nd
fr
i
gh
te
ning
in
healt
h
ca
re
a
nd
oth
er
or
gan
i
zat
ion
.
[
14
]
.
Fo
r
insta
nce,
in
a
te
le
com
m
un
ic
at
ion
in
du
st
ry,
dirty
da
ta
has
nu
m
erous
c
os
ts.
First
a
nd
f
orem
os
t,
Exp
e
ri
an
a
pproxim
ates
ave
rag
e
12
%
loses
i
n
bu
siness
du
e
to
wrong
record
s
cau
sin
g
pr
oductivit
y
reduct
io
n,
resour
ces
wasta
ge,
and
sig
nifican
tl
y,
m
isused
chan
ce
s
f
or
m
ark
et
i
ng
of
c
ro
s
s
-
c
hann
el
.
The
E
xp
e
rian
in
vestigat
io
n
al
so
f
oc
us
es
that
approxim
at
el
y
on
e
-
thir
d
of
re
spo
nd
e
rs
thin
k
that
they
waste
alm
os
t
10
%
or
m
or
e
budget
in
m
ark
et
ing
be
cause
of
outc
om
e
ob
ta
ined
f
ro
m
inaccur
at
e
data
.
The
E
xp
e
rian
pr
ese
nts
that
25%
of
s
urvey
par
ti
ci
pa
nts
in
their
resea
rc
h
pr
ese
ntly
in
th
ei
r
orga
nizat
io
n
do
no
t
m
easur
es
acc
uracy
of
data,
w
her
e
gro
wt
hs
i
n
te
le
com
s
an
d
util
it
ie
s
com
pan
ie
s
to
33%
,
and
in
orga
niz
at
ion
s
li
ke
gove
r
nm
e
nts r
eac
hes
to 36%
[1
5].
These
m
easur
e
m
ents
are
within
organ
iz
at
i
on
s
,
w
he
reas
ob
s
er
ving
exte
rn
al
m
at
ers
li
k
e
m
ark
et
ing,
m
ark
et
ers
st
rugg
le
with
dirty
data
as
well
.
Re
gardin
g
to
Bi
zR
epo
rt.c
om,
“…m
ark
et
ers
are
gen
e
rati
ng
a
la
rg
e
portio
n
of
poor
-
qu
al
it
y
le
ads,
includi
ng
th
ose
with
im
pr
oper
f
or
m
at
ti
ng
and
e
ve
n
inacc
uraci
es.
Ba
d
pro
sp
ect
inf
or
m
at
ion
ca
n
hav
e
ne
gativ
e
co
ns
e
qu
e
nce
s,
incl
ud
i
ng
w
ast
ed
m
edia
inv
est
m
ent,
sq
ua
nd
e
re
d
re
sourc
es,
a
nd
poor c
us
tom
er ex
pe
rience
, wh
ic
h
m
ark
et
ers
s
i
m
ply can’t af
f
ord.
”
[16
]
In
m
edical
case,
er
r
or
s
can
a
ble
to
kill
patie
nts
or
pro
duc
e
lo
ng
la
sti
ng
har
m
to
heath
of
the
patie
nt.
In
19
99
an
inst
it
ute
of
Me
dici
ne
repor
te
d
[17]
ap
pro
xim
a
tio
ns,
for
i
ns
ta
nc
e,
at
le
ast
44,
000
t
o
98,00
0
people
lost
their
li
ves
each
ye
ar
fo
r
m
edical
err
ors
in
ho
s
pital
s
on
ly
an
d
wh
ic
h
cause
d
m
or
e
$17
to
$29
bi
ll
ion
annuall
y
in
he
al
thcare
c
os
ts.
Othe
r
th
an
he
at
h
issue
,
dirt
y
data
can
al
s
o
be
in
volve
d
in
pr
iv
acy
issue
f
or
patie
nts.
4.
DA
T
A
Q
U
AL
ITY
CR
ITE
R
IA
Data
qual
it
y
is
gen
e
rall
y
desc
r
ibed
as
the
ca
pa
bili
ty
of
da
ta
to
sat
isfy
sta
te
d
an
d
im
plied
ne
eds
wh
e
n
us
e
d
unde
r
spe
ci
fied
co
nd
it
ion
s
[
18]
.
Dat
a
accuracy,
com
plete
ness
and
c
on
sist
e
nc
y
are
m
os
t
p
opular
init
ia
ti
ves
to
address
Data
qual
it
y
[19],
[20],
beside
oth
e
r
dim
ensi
on
s
li
ke
Acc
essibil
it
y,
Con
sist
ent
represe
ntati
on
,
tim
e
li
ness,
U
ndersta
nd
a
bili
ty
,
Re
le
van
cy
,
et
c.
[19].
Mo
re
over
,
data
qu
al
it
y
is
co
m
bin
at
ion
of
data
co
ntent
a
nd
form
.
Wh
e
re
data
co
nten
t
m
us
t
con
ta
in
accurate
in
for
m
at
ion
and
da
ta
fo
rm
essenti
a
l
be
colle
ct
ed
an
d
visu
al
iz
ed
i
n
an
ap
proac
h
that
create
s
da
ta
fu
nctio
ni
ng.
Con
te
nt
a
nd
form
are
sign
ific
ant
consi
der
at
io
n
t
o
re
duce
data
m
ist
akes,
as
th
ey
il
lu
m
inate
t
he
ta
sk
of
re
pa
iring
dirty
data
needs
bey
ond
si
m
ply
pro
vid
in
g
c
orr
ect
d
at
a.
Likewise,
w
hil
e
de
velo
ping
a
schem
e
to
im
pro
ve
data
qua
li
ty
it
is
essenti
al
to
ide
ntify
the
pr
im
ary
reasons
of
dirt
y
data.
The
ca
us
es
are
cat
e
gories
into
or
gan
i
zed
an
d
uninte
ntion
al
e
rror
s
.
The
ba
sic
sour
ces
of
pro
du
ci
ng
syst
e
m
at
ic
err
or
s
inclu
de
w
hile
pro
gr
am
m
ing
,
w
r
ong
def
i
niti
on
for
data
ty
pes,
r
ules
not
def
i
ned
correct
ly
,
data
colle
ct
ion
’
s
r
ules
vi
olati
on
,
bad
ly
de
fine
d
ru
le
s,
a
nd
trai
ned
poorl
y.
T
he
s
ources
of
rand
om
error
s
ca
n
be
error
s
du
e
to
keyi
ng,
unrea
da
ble
script,
data
transcr
i
ption
com
plica
t
ion
s
,
hard
war
e
fail
ur
e
or
corrupti
on,
a
nd
er
rors
or
inte
ntion
al
ly
m
isrepr
ese
ntin
g
dec
la
rati
on
s
on
t
he
portion
of
use
rs
sp
eci
fyi
ng
m
ajo
r
data.
H
um
an
ro
le
on
data
en
try
us
ually
res
ult
error
,
this
error
ca
n
be
t
ypos
,
m
issi
ng
ty
pes,
li
te
ral
values,
Heter
og
e
ne
ou
s
ontolo
gies
(i.e.
Diff
e
re
nt
natu
re
of
dat
a),
O
utd
at
ed
values
or
Violati
on
s
of
in
te
gr
it
y
const
raints.
Si
m
il
arly
,
see
Figure
2.
as
a
n
e
xam
ple,
wh
e
re
fe
w
data
qual
it
y
prob
le
m
s
can
be
ide
ntifie
d
in
th
e
W
i
reless
Ser
vice Faci
li
ty
Per
m
it
s (
Ci
ty
o
f
S
an
F
ra
ncisco
) data
base
.
Figure
2. Data
qu
al
it
y p
roble
m
s iden
ti
fied
in a
n op
e
n data
set
Ther
e
f
or
e,
the
m
os
t co
m
m
on
d
im
ension
s
of
dirty data i
nclu
ding
data du
plica
ti
on
a
re:
In
acc
ur
at
e
data
ref
ers
to
a
ny
fiel
d
con
ta
ins
wr
on
g
val
ues.
A
righ
t
value
of
da
ta
will
br
ing
accu
rat
e
and
sig
nified
a
rr
a
ng
em
ent of
consi
ste
ncy an
d un
am
big
uo
us.
Evaluation Warning : The document was created with Spire.PDF for Python.
Ind
on
esi
a
n
J
E
le
c
En
g
&
Co
m
p
Sci
IS
S
N:
25
02
-
4752
A Survey
on Cl
ean
i
ng D
irt
y
D
ata
Usi
ng Ma
c
hin
e
Learni
ng
Paradi
gm for
Big
Da
t
a
…
(
J
esmee
n
M.
Z.
H.
)
1237
In
c
om
plete
data
from
m
issi
ng
data
is
pro
duc
ed
by
data
set
s
basical
ly
m
iss
ing
values
.
T
he
se
ty
pe
of
data
consi
de
re
d
co
nceale
d
w
hen
the
am
ount
of
values
i
den
ti
fie
d
in
a
set
,
bu
t
the
va
lues
them
sel
ves
are
un
i
den
ti
fie
d,
a
nd it
is also
kn
own
t
o be c
onde
ns
e
d wh
e
n
t
he
re ar
e
v
al
ues
i
n
a
set
that a
re
el
i
m
inate
d.
In
c
onsist
ent
da
ta
is
data
red
un
dan
cy
;
i.e.
sam
e
data
value
is
store
d
in
dif
fere
nt
file
s
wh
ic
h
m
ay
be
in
diff
e
re
nt for
m
at
s.
Duplic
at
e d
at
a
is entries t
hat
ha
ve been
adde
d by a syst
em
us
er
sam
e d
at
a m
ul
ti
ple tim
es
5.
CLEA
NING
TOOL
S
Diff
e
re
nt
vend
or
s
pro
vid
e
da
ta
cl
eansing
s
olu
ti
ons,
incl
udes
Tal
pr
e
sen
ts
the
web
sit
e
li
nk
of
t
he
com
pan
y.
Where,
t
he
“l
ike
(s)
”
an
d
“
disli
ke
(s)
”
are
ob
ta
ined
from
Custom
ers
com
m
e
nts
obta
ine
d
from
diff
e
re
nt
webs
it
es,
li
ke
end
,
IBM,
SAS,
O
r
acl
e
and
La
va
storm
An
al
yt
ics.
The
re
are
s
om
e
fr
ee
too
ls
been
work
on
data
trans
form
ation
[21]
[
22
]
,
s
uc
h
as,
O
pen
Re
fine,
plyr,
a
nd
resh
a
pe
2,
al
th
ough
it
is
unc
ertai
n
wh
et
her
t
hey
can
exec
ute
B
ig
D
at
a.
Anot
her
well
-
kn
own
to
ol
is
ETL
too
ls,
wh
ic
h
pro
vid
es
c
om
plex
data
conve
rsion
te
chn
i
qu
e
s
by
m
erg
in
g
an
d
rep
ai
ri
ng
da
ta
[2
3].
A
su
m
m
arizat
ion
of
so
m
e
avail
able
com
m
ercial
iz
e
d
too
ls
to
m
anag
e
Data
Q
ual
it
y
in
pr
esente
d
in
Table
1.
Wh
e
re
the
“
V
endo
r”
fiel
d
m
entions
the
com
pan
y
of
fe
rin
g
the
tool
s
and
“Pr
oduc
t”
m
entions
the
too
l
offer
e
d
by
the
vend
or
for
m
anag
ing
Data
Qu
al
it
y. “
Websi
te
” colu
m
n
[
24
]
,
[25
]
.
Table
1
.
C
om
par
iso
n of Com
m
ercial
iz
ed
Da
ta
Q
ualit
y M
an
agem
ent To
ols
V
en
d
er
Pro
d
u
c
t
W
eb
s
i
t
e
L
i
k
e (
s
)
D
i
s
l
i
k
e (
s
)
T
ri
fac
t
a
T
ri
fac
t
a D
at
a W
ran
g
l
er
t
ri
f
ac
t
a.co
m
i
n
t
e
l
l
i
g
en
t
l
y
reco
g
n
i
z
es
im
p
o
r
t
ed
d
a
t
a
fi
l
e an
d
p
ro
v
i
d
e
s
p
res
cri
p
t
i
v
e m
et
h
o
d
s
fo
rm
u
l
a ba
s
e
d
In
f
o
rm
a
t
i
c
a
ID
Q
D
at
a Q
u
a
l
i
t
y
Standar
d
E
d
i
t
i
o
n
A
d
d
re
s
s
V
al
i
d
at
i
o
n
S
erv
i
ce
s
an
d
Str
i
k
eIr
o
n
D
at
a Q
u
a
l
i
t
y
A
d
v
an
c
ed
E
d
i
t
i
o
n
D
at
a Q
u
a
l
i
t
y
G
o
v
er
n
an
ce
E
d
i
t
i
o
n
i
n
f
o
rm
at
i
ca.c
o
m
E
as
y
in
t
era
ct
w
i
t
h
pr
o
v
i
d
ed
i
n
t
e
rface
t
o
i
d
e
n
t
i
fy
th
e f
u
n
c
t
i
o
n
s
,
E
as
e of D
a
t
a M
i
g
rat
i
o
n
,
Co
m
p
l
et
e
l
y
on
cl
o
u
d
It
re
q
u
i
re
s
SQ
L
k
n
o
w
l
ed
g
e
SA
P
In
f
o
rm
at
i
o
n
St
ew
ard
D
at
a Q
u
a
l
i
t
y
Mana
g
e
m
en
t
SA
P D
at
a Ser
v
i
c
es
g
o
.
s
a
p
.com
A
b
i
l
i
t
y
to
rec
o
g
n
i
s
e
o
rg
a
n
i
zat
i
o
n
'
s
n
eed
s
N
o
o
p
t
i
o
n
t
o
c
o
n
t
r
o
l
So
u
rce c
o
d
e an
d
i
n
t
e
g
ra
t
e
Mel
i
s
s
a
D
at
a
D
at
a Q
u
a
l
i
t
y
Com
p
o
n
e
n
t
s
fo
r
SSIS
Pers
o
n
at
o
r
G
l
o
b
al
D
a
t
a Q
u
al
i
t
y
Suite
G
l
o
b
al
Mat
ch
U
p
m
el
i
s
s
a
d
a
t
a.co
m
A
PI's
are
ea
s
y
and
s
t
r
ai
g
h
t
f
o
rw
ard
A
b
l
e
t
o
u
s
e p
h
o
en
i
c
s
fo
r a
d
d
res
s
co
rre
ct
i
o
n
s
N
o
be
t
t
er d
o
c
u
m
en
t
at
i
o
n
,
Perfo
rm
an
ce i
s
s
l
o
w
for
real
t
i
m
e qu
er
i
e
s
.
Mel
i
s
s
a
L
i
s
t
w
a
re
A
p
p
e
n
d
co
n
t
ac
t
s
,
S
t
a
n
d
ard
i
ze A
d
d
r
es
s
,
Si
m
p
l
e i
n
t
er
face
N
o
Mac
O
S i
n
t
eg
r
at
i
o
n
BD
N
A
BD
N
A
T
ech
n
o
p
e
d
i
a
BD
N
A
N
o
r
m
al
i
ze
T
ech
n
o
p
e
d
i
a
b
d
n
a.c
o
m
g
o
o
d
co
v
er
ag
e of
v
e
n
d
o
r
s
a
n
d
p
r
o
d
u
c
t
s
,
Pro
ac
t
i
v
e
i
n
kee
p
i
n
g
t
h
ei
r pa
ck
s
u
p
t
o
d
at
e,
A
d
o
p
t
m
at
u
ri
n
g
tec
h
n
o
l
o
g
i
es
w
i
t
h
m
an
ag
ea
b
l
e r
i
s
k
N
eed
t
o
p
o
i
n
t
o
u
t
s
t
a
l
e
d
at
a,
i
t
w
i
l
l
n
o
t
refre
s
h
for
m
o
n
t
h
s
t
o
y
ears
.
SA
S
D
at
a Ma
n
a
g
em
en
t
D
at
a Q
u
a
l
i
t
y
D
es
k
t
o
p
s
as
.c
o
m
T
h
e lea
rn
i
n
g
c
u
r
v
e i
s
m
an
ag
ea
b
l
e.
N
eed
s
t
ra
i
n
i
n
g
a
n
d
ed
u
ca
t
i
o
n
t
o
u
s
e,
n
o
co
m
m
an
d
w
i
n
d
o
w
E
x
p
er
i
a
n
Cap
t
u
r
e,
Cl
e
an
a
n
d
E
n
h
an
c
e
d
at
a q
u
a
l
i
t
y
to
o
l
s
E
x
p
er
i
a
n
Pa
n
d
o
ra
E
x
p
er
i
a
n
D
at
a Q
u
a
l
i
t
y
Pl
a
t
fo
rm
ex
p
er
i
an
.c
o
m
L
o
w
cost a
n
d
f
l
e
x
i
b
i
l
i
t
y
of u
s
e w
i
t
h
v
ar
i
o
u
s
fi
l
e f
o
rm
at
s
.
Pi
t
n
ey
Bo
w
es
Sp
ec
t
r
u
m
T
ech
n
o
l
o
g
y
Pl
a
t
fo
rm
Co
d
e
-
1
Pl
u
s
p
i
t
n
ey
b
o
w
es
.c
o
m
U
s
er i
n
t
erfac
e i
s
q
u
i
t
e fri
e
n
d
l
y
and
at
t
rac
t
i
v
e,
Ca
n
crea
t
e
A
PIs
w
i
t
h
o
u
t
p
ro
g
r
am
m
i
n
g
H
ard
t
o
i
n
t
eg
rat
e an
d
h
a
n
d
l
e l
arg
e am
o
u
n
t
s
of
d
at
a
CRMfu
s
i
o
n
D
e
m
an
d
T
o
o
l
s
CRMfu
s
i
o
n
Pe
o
p
l
eI
m
p
o
rt
crm
fu
s
i
o
n
.c
o
m
Man
a
g
e
l
ar
g
e s
cal
e d
at
a.
i
n
re
al
t
i
m
e Standar
d
i
ze,
cl
ea
n
s
e a
n
d
o
v
era
l
l
m
an
i
p
u
l
at
e da
t
a
U
n
ab
l
e
t
o
e
n
t
er a
c
u
s
t
o
m
SO
Q
L
(e.
g
.
w
i
t
h
a
s
u
b
q
u
e
ry
) as t
h
e
b
a
s
i
s
for
t
h
e
d
a
t
a
p
u
l
l
ed
d
o
w
n
O
racl
e
O
racl
e E
n
t
er
p
r
i
s
e D
at
a Q
u
al
i
t
y
o
rac
l
e.c
o
m
Pro
f
i
l
i
n
g
c
u
s
t
o
m
ers
easily
G
reat
f
o
r ba
t
c
h
-
o
ri
en
t
ed
p
ro
c
es
s
i
n
g
e
m
p
h
a
s
es
i
n
E
T
L
in
s
t
ead
o
f d
at
a c
o
n
t
ex
t
a
n
d
m
an
ag
em
en
t
,
n
o
t
g
o
o
d
for
real
-
t
i
m
e pro
ces
s
i
n
g
IBM
In
f
o
s
p
h
ere Q
u
al
i
t
y
St
ag
e
In
f
o
s
p
h
ere I
n
f
o
rm
at
i
o
n
A
n
al
y
zer
In
f
o
Sp
h
er
e Inf
o
rm
at
i
o
n
S
erv
er
i
b
m
.co
m
T
h
e l
i
n
eag
e i
n
t
eg
r
at
e
s
m
et
ad
at
a f
ro
m
Co
g
n
o
s
,
D
a
t
a
s
t
a
g
e,
Q
u
al
i
t
y
Stag
e,
a
n
d
O
racl
e Me
t
a
d
a
t
a.
A
d
d
re
s
s
y
A
d
d
re
s
s
y
ad
d
re
s
s
y
.com
O
n
l
y
a
s
i
m
p
l
e Ja
v
aSc
ri
p
t
s
n
i
p
p
e
t
i
s
req
u
i
r
ed
o
n
t
h
e p
ag
e
t
h
e re
s
t
o
f t
h
e
co
n
fi
g
u
ra
t
i
o
n
can
b
e
d
o
n
e
v
i
a t
h
e c
o
n
t
r
o
l
p
a
n
el
Evaluation Warning : The document was created with Spire.PDF for Python.
IS
S
N
:
2502
-
4752
Ind
on
esi
a
n
J
E
le
c Eng &
Co
m
p
Sci,
Vol
.
10
, N
o.
3
,
June
201
8
:
12
34
–
12
43
1238
6.
BIG D
ATA
ANA
L
YTICS
D
ATA CLE
A
N
ING
CHALL
ENGES
Gen
e
rall
y,
the
data
gathe
re
d
will
no
t
be
i
n
a
read
y
f
or
m
fo
r
a
naly
zi
ng.
Fo
r
i
ns
ta
nce
,
c
on
si
der
data
ob
ta
ine
d
f
r
om
Tel
ecom
m
un
icati
on
sto
red
sy
stem
,
con
sist
in
g
of
fee
db
ac
k
ob
ta
ine
d
f
r
om
diff
e
re
nt
agen
t
s
and
structu
re
d
data
from
ro
uters
.
It
is
chall
en
gin
g
t
o
a
naly
ze
su
c
h
ty
pes
of
un
st
ru
ct
ur
e
d
da
ta
.
Re
qu
i
rem
ent
of
extracti
on
pr
oc
edure
that
rec
ov
e
rs
necessa
r
y
data
from
var
io
us
s
ources
and
dem
on
str
at
es
it
in
a
structu
red
arr
a
ng
em
ent
app
r
opriat
e
for
analy
sis
is
com
pu
ls
or
y.
Data
cl
eaning
is
a
n
essenti
al
portio
n
of
data
analy
sis
and
chall
eng
i
ng
to
o
[
26
]
.
Re
sea
r
cher
from
data
base
resea
rc
h
com
m
un
it
y
of
fer
ed
few
c
halle
ng
e
s
to
o
btain
us
ef
ul
data
from
big
data
[27
]
,
[
28
]
.
This
is
chall
eng
i
ng
th
r
ough
ever
y
data
ana
ly
sis,
bu
t
after
involvin
g
the
var
ie
ty
and
volum
ino
us
bi
g
data
,
it
trans
form
s
even
bey
ond
pro
nounced
.
T
he
data
qual
it
y
req
ui
red
t
o
ass
ured
for
accurate
an
d
c
orrect
data
vis
ualiz
at
ion
.
T
o
deal
this
issue,
or
ga
nizat
io
n
r
equ
i
re
to
over
com
e
so
m
e
co
m
m
on
chall
enges:
6.1. Scal
ab
il
it
y
Cl
eaning
te
ch
niques
re
qu
i
re
d
scal
ing
data
capaci
ti
es
as
qu
ic
kly
increa
sing
data
siz
e
of
Bi
g
da
ta
,
wh
ic
h
is
quit
e
chall
eng
i
ng.
E
xisti
ng
proce
dures
i
nvolv
e
j
a
m
m
ing
data
for
ide
ntica
l
data
detect
ion
[29],
[30]
,
ident
ific
at
ion
a
nd
li
nka
ge
f
or
data cl
eanin
g
[
30
]
, clea
n
data
us
in
g
sam
pling [31], and d
ist
r
ibu
te
d data cl
eani
ng
[32].
6.2. Semi
S
tru
ctured
and
U
nst
ruc
tu
re
d Data
Bi
g
data
is
us
ua
ll
y
set
of
var
ie
ty
of
data,
w
hi
ch
m
a
y
be
popu
la
te
d
with
s
e
m
i
structur
e
d
la
yout
dat
a
e.g
.
i
n
XML/J
SON
an
d
un
st
r
uctu
red
form
at
data
e.g.
in
w
ord
-
proces
sin
g
file
s,
in
e
-
m
ai
l
besides
i
n
te
xt
fiel
ds
in
databa
ses.
S
e
m
i
structur
e
d
and
unstr
uctu
r
ed
data
rem
ai
n
m
os
tl
y
un
fam
il
ia
r
fo
r
Data
qual
it
y
pr
oble
m
s
[28
,
33
]
.
6.3. User
Engag
e
ment
Wh
il
e
m
uch
research
work
w
as
inv
ol
ved
hum
ans
to
execut
e
ded
upli
cat
io
n
process
in
da
ta
set
.
Fo
r
instance,
th
rou
gh
act
ive
le
ar
ning,
inclu
ding
hum
an
exp
e
rt
in
oth
er
to
cl
ean
data
[30
]
,
li
ke
getti
ng
us
er
respo
ns
e t
o determ
ine r
ules fo
r data
quali
ty
, is stil
l t
o
be d
is
cov
e
re
d.
6.4. R
aising
P
ri
va
c
y and Se
curi
ty In
terest
s
Wh
il
e
cl
eani
ng
data
the
m
os
t
com
m
on
ta
sk
is
to
obse
rve
an
d
e
xam
ine
com
plete
set
of
ra
w
data
value
w
hich
m
ay
be
restric
te
d
by
s
om
e
do
m
ai
n
is
a
s
ign
ific
a
nt
chal
le
ng
es
[
9],
li
ke
te
le
com
m
un
i
cat
ion
,
m
edici
ne
and
f
inance.
F
or
ex
a
m
ple,
te
le
com
m
un
ic
at
ion
da
ta
,
su
ch
as
th
e
In
te
r
net
co
nnect
ion
l
og
i
n
sessions
log
c
ollec
te
d
over
a
n
e
xtensi
ve
pe
rio
d
of
ti
m
e
can
rev
eal
an
in
div
id
ual’s
locat
ion
a
nd
be
hav
i
or,
as
s
hown
i
n
Figure
3.
Figure
3. I
nform
at
ion
g
at
her
e
d f
r
om
r
unning
an
al
yt
ic
s on d
at
a and f
il
es t
o creat
e Ta
wsif'
s profil
e
6.5. C
omp
u
t
ati
on
al
C
omp
li
c
at
i
on
fo
r
Da
ta St
re
amin
g
Huge
da
ta
co
ll
ect
ion
fro
m
v
ariet
y of
sen
s
ors
an
d user
de
vices is al
ways an
interest
ing
iss
ue.
Gart
ne
r
,
In
c
.
forecast
ed
in
20
17,
that
8.4
Bi
ll
ion
dev
ic
es
will
be
l
ink
e
d
thin
gs
and
us
ed
in
global
in
2017,
up
31
%
Evaluation Warning : The document was created with Spire.PDF for Python.
Ind
on
esi
a
n
J
E
le
c
En
g
&
Co
m
p
Sci
IS
S
N:
25
02
-
4752
A Survey
on Cl
ean
i
ng D
irt
y
D
ata
Usi
ng Ma
c
hin
e
Learni
ng
Paradi
gm for
Big
Da
t
a
…
(
J
esmee
n
M.
Z.
H.
)
1239
from
20
16,
an
d
will
reach
20.
4
bill
ion
by
2020
[34].
Thi
s
is
the
reaso
n
data
cl
eansing
act
ion
s
m
ay
e
ng
a
ge
huge p
ro
ces
sin
g powe
r.
6.6. M
achine
Le
arnin
g
an
d
Oth
er
A
l
go
ri
t
hms
Last
ly
,
it
kn
own
that
bi
g
da
ta
analy
ti
cs
is
sti
ll
in
it
s
init
i
al
per
io
ds
of
de
velo
pm
ent
as
a
te
chn
ic
al
discipli
ne.
He
nc
e
m
any
Ma
chine
Lear
ning
a
lgorit
hm
s
us
ab
le
to
scal
e
big
data
set
s
or
un
able
to
tole
rate
the
no
ise
s
a
nd
ga
ps
pro
du
ce
d
by
real
world
[35]
-
[
38]
.
The
r
e
is
sti
ll
fu
rther
researc
h
go
i
ng
to
to
im
pr
ov
e
these
al
gorithm
s
that
will
be
m
or
e
su
it
able
with
r
eal
w
or
ld
c
ondi
ti
on
s
wh
ic
h
m
ay
co
ntain
m
il
l
ion
s
an
d
t
rill
ion
s
of
com
po
ne
nts fo
r data
clea
ning.
6.7. M
an
u
ally
Currentl
y,
after
be
nef
it
of
hi
stog
ram
s,
conver
s
at
io
n
ta
ble
s
and
r
ules
w
it
h
al
go
rithm
s
ind
ivid
ual
interfe
ren
ce
is
nev
e
rtheless
com
pu
lsory to re
cognize a
nd r
e
pair
t
he data
[3
0], [3
9].
7.
MACHI
NE L
EAR
NI
NG
P
ARA
DIGMS
FOR
BIG
D
A
TA C
LE
ANI
NG
Currentl
y
the
re
are
dif
fer
e
nt
t
ypes
of
le
ar
ning
par
a
dig
m
s
avail
able
in
m
a
chine
le
ar
ning;
but,
not
al
l
ty
pes
ap
plica
bl
e
to
al
l
fiel
d.
F
or
i
ns
ta
nce
,
[
40]
presente
d
a
cl
eaning
ap
pro
ach
us
in
g
Data
m
ining
a
nd
S
VM
(a
m
achine Lear
ni
ng
Para
dig
m
).
Mac
hin
e
Lear
ning tech
nique
s can be
us
e
d
t
o
te
ach
the syst
e
m
an
d
c
om
plete
the
ta
sk
m
y
m
ini
m
um
hu
m
an
interact
ion.
It
m
ay
reduce
t
he
ti
m
e
an
d
res
ource
s
re
qu
ire
d
t
o
a
naly
ze
an
d
tra
nsfo
rm
dirty
data
to
usa
ble
cl
ean
data
.
Ma
chine
Lea
rn
i
ng
te
ch
niqu
es
are
us
e
d
to
m
ake
syst
e
m
i
ntell
igent
by
le
arn
i
ng
capab
il
it
y.
Dat
a
can
be
cl
ass
ifie
d
by
th
ree
ways,
un
-
s
up
e
rv
ise
d,
s
uper
vi
sed
a
nd
sem
i
su
pe
r
vised
m
et
hods
.
Sele
ct
ion
of
a
lgorit
hm
s
m
us
t
be
dep
e
nden
t
on
the
siz
e,
qual
it
y,
and
natu
re
of
t
he
data.
S
om
e
com
m
on
le
arn
in
g
al
gorithm
s can
be
u
se
d
to
clea
n data
are s
how
n
in
F
igure
4.
Figure
4. Ma
ch
ine lea
r
ning
al
gorithm
s
A
r
t
i
f
i
c
i
a
l
n
e
u
r
a
l
n
e
t
w
o
r
k
B
a
c
k
p
r
o
p
a
g
a
t
i
o
n
C
o
n
v
o
l
u
t
i
o
n
a
l
n
e
u
r
a
l
n
e
t
w
o
r
k
D
e
e
p
l
e
a
r
n
i
n
g
M
u
l
t
i
l
a
y
e
r
p
e
r
c
e
p
t
r
o
n
P
e
r
c
e
p
t
r
o
n
R
e
c
u
r
r
e
n
t
n
e
u
r
a
l
n
e
t
w
o
r
k
(
R
N
N
)
S
p
i
k
i
n
g
n
e
u
r
a
l
n
e
t
w
o
r
k
B
a
y
e
s
i
a
n
s
t
a
t
i
s
t
i
c
s
B
a
y
e
s
i
a
n
k
n
o
w
l
e
d
g
e
b
a
s
e
N
a
i
v
e
B
a
y
e
s
G
a
u
s
s
i
a
n
N
a
i
v
e
B
a
y
e
s
M
u
l
t
i
n
o
m
i
a
l
N
a
i
v
e
B
a
y
e
s
A
v
e
r
a
g
e
d
O
n
e
-
D
e
p
e
n
d
e
n
c
e
E
s
t
i
m
a
t
o
r
s
(
A
O
D
E
)
B
a
y
e
s
i
a
n
B
e
l
i
e
f
N
e
t
w
o
r
k
(
B
B
N
)
B
a
y
e
s
i
a
n
N
e
t
w
o
r
k
(
B
N
)
D
e
c
i
s
i
o
n
t
r
e
e
D
e
c
i
s
i
o
n
t
r
e
e
C
l
a
s
s
i
f
i
c
a
t
i
o
n
a
n
d
r
e
g
r
e
s
s
i
o
n
t
r
e
e
(
C
A
R
T
)
I
t
e
r
a
t
i
v
e
D
i
c
h
o
t
o
m
i
s
e
r
3
(
I
D
3
)
C
4
.
5
a
l
g
o
r
i
t
h
m
C
5
.
0
a
l
g
o
r
i
t
h
m
A
r
t
i
f
i
c
i
a
l
n
e
u
r
a
l
n
e
t
w
o
r
k
F
e
e
d
f
o
r
w
a
r
d
n
e
u
r
a
l
n
e
t
w
o
r
k
L
o
g
i
c
l
e
a
r
n
i
n
g
m
a
c
h
i
n
e
S
e
l
f
-
o
r
g
a
n
i
z
i
n
g
m
a
p
A
s
s
o
c
i
a
t
i
o
n
r
u
l
e
l
e
a
r
n
i
n
g
A
p
r
i
o
r
i
a
l
g
o
r
i
t
h
m
E
c
l
a
t
a
l
g
o
r
i
t
h
m
F
P
-
g
r
o
w
t
h
a
l
g
o
r
i
t
h
m
H
i
e
r
a
r
c
h
i
c
a
l
c
l
u
s
t
e
r
i
n
g
S
i
n
g
l
e
-
l
i
n
k
a
g
e
c
l
u
s
t
e
r
i
n
g
C
o
n
c
e
p
t
u
a
l
c
l
u
s
t
e
r
i
n
g
C
l
u
s
t
e
r
a
n
a
l
y
s
i
s
B
I
R
C
H
F
u
z
z
y
c
l
u
s
t
e
r
i
n
g
H
i
e
r
a
r
c
h
i
c
a
l
C
l
u
s
t
e
r
i
n
g
K
-
m
e
a
n
s
a
l
g
o
r
i
t
h
m
K
-
m
e
a
n
s
c
l
u
s
t
e
r
i
n
g
K
-
m
e
d
i
a
n
s
M
e
a
n
-
s
h
i
f
t
A
n
o
m
a
l
y
d
e
t
e
c
t
i
o
n
k
-
n
e
a
r
e
s
t
n
e
i
g
h
b
o
r
s
c
l
a
s
s
i
f
i
c
a
t
i
o
n
(
k
-
N
N
)
L
o
c
a
l
o
u
t
l
i
e
r
f
a
c
t
o
r
M
a
c
h
i
n
e
L
e
a
r
n
i
n
g
S
u
p
e
r
v
i
s
e
d
l
e
a
r
n
i
n
g
•
A
s
s
o
c
i
a
t
i
o
n
r
u
l
e
l
e
a
r
n
i
n
g
a
l
g
o
r
i
t
h
m
s
•
G
a
u
s
s
i
a
n
p
r
o
c
e
s
s
r
e
g
r
e
s
s
i
o
n
•
I
n
s
t
a
n
c
e
-
b
a
s
e
d
l
e
a
r
n
i
n
g
•
L
a
z
y
l
e
a
r
n
i
n
g
•
L
e
a
r
n
i
n
g
V
e
c
t
o
r
Q
u
a
n
t
i
z
a
t
i
o
n
•
L
o
g
i
s
t
i
c
M
o
d
e
l
T
r
e
e
•
M
i
n
i
m
u
m
m
e
s
s
a
g
e
l
e
n
g
t
h
(
d
e
c
i
s
i
o
n
t
r
e
e
s
,
d
e
c
i
s
i
o
n
g
r
a
p
h
s
,
e
t
c
.
)
•
S
y
m
b
o
l
i
c
m
a
c
h
i
n
e
l
e
a
r
n
i
n
g
a
l
g
o
r
i
t
h
m
s
•
S
u
p
p
o
r
t
v
e
c
t
o
r
m
a
c
h
i
n
e
s
•
R
a
n
d
o
m
F
o
r
e
s
t
s
•
I
n
f
o
r
m
a
t
i
o
n
f
u
z
z
y
n
e
t
w
o
r
k
s
(
I
F
N
)
•
C
o
n
d
i
t
i
o
n
a
l
R
a
n
d
o
m
F
i
e
l
d
•
Q
u
a
d
r
a
t
i
c
c
l
a
s
s
i
f
i
e
r
s
•
k
-
n
e
a
r
e
s
t
n
e
i
g
h
b
o
r
•
B
a
y
e
s
i
a
n
n
e
t
w
o
r
k
s
•
H
i
d
d
e
n
M
a
r
k
o
v
m
o
d
e
l
s
U
n
s
u
p
e
r
v
i
s
e
d
l
e
a
r
n
i
n
g
•
E
x
p
e
c
t
a
t
i
o
n
-
m
a
x
i
m
i
z
a
t
i
o
n
a
l
g
o
r
i
t
h
m
•
V
e
c
t
o
r
Q
u
a
n
t
i
z
a
t
i
o
n
•
G
e
n
e
r
a
t
i
v
e
t
o
p
o
g
r
a
p
h
i
c
m
a
p
•
I
n
f
o
r
m
a
t
i
o
n
b
o
t
t
l
e
n
e
c
k
m
e
t
h
o
d
S
e
m
i
-
s
u
p
e
r
v
i
s
e
d
l
e
a
r
n
i
n
g
•
G
e
n
e
r
a
t
i
v
e
m
o
d
e
l
s
•
L
o
w
-
d
e
n
s
i
t
y
s
e
p
a
r
a
t
i
o
n
•
G
r
a
p
h
-
b
a
s
e
d
m
e
t
h
o
d
s
•
C
o
-
t
r
a
i
n
i
n
g
R
e
i
n
f
o
r
c
e
m
e
n
t
l
e
a
r
n
i
n
g
•
T
e
m
p
o
r
a
l
d
i
f
f
e
r
e
n
c
e
l
e
a
r
n
i
n
g
•
Q
-
l
e
a
r
n
i
n
g
•
L
e
a
r
n
i
n
g
A
u
t
o
m
a
t
a
•
S
t
a
t
e
-
A
c
t
i
o
n
-
R
e
w
a
r
d
-
S
t
a
t
e
-
A
c
t
i
o
n
D
e
e
p
l
e
a
r
n
i
n
g
•
D
e
e
p
b
e
l
i
e
f
n
e
t
w
o
r
k
s
•
D
e
e
p
B
o
l
t
z
m
a
n
n
m
a
c
h
i
n
e
s
•
D
e
e
p
C
o
n
v
o
l
u
t
i
o
n
a
l
n
e
u
r
a
l
n
e
t
w
o
r
k
s
•
D
e
e
p
R
e
c
u
r
r
e
n
t
n
e
u
r
a
l
n
e
t
w
o
r
k
s
•
H
i
e
r
a
r
c
h
i
c
a
l
t
e
m
p
o
r
a
l
m
e
m
o
r
y
•
D
e
e
p
B
o
l
t
z
m
a
n
n
M
a
c
h
i
n
e
(
D
B
M
)
•
S
t
a
c
k
e
d
A
u
t
o
-
E
n
c
o
d
e
r
s
Evaluation Warning : The document was created with Spire.PDF for Python.
IS
S
N
:
2502
-
4752
Ind
on
esi
a
n
J
E
le
c Eng &
Co
m
p
Sci,
Vol
.
10
, N
o.
3
,
June
201
8
:
12
34
–
12
43
1240
7.1. Deep
Lea
rning
This
te
chn
i
que
is
widely
us
ed
by
data
re
presentat
io
n,
ra
ther
tha
n
data
featur
es
to
e
xecu
te
da
ta
cl
eaning.
Dee
p
Lea
rn
i
ng
A
lgorit
hm
s
transf
orm
s
data
into
a
bs
tra
ct
r
epr
ese
ntati
ons
that
al
lo
ws
l
earn
i
ng
featur
e
s.
He
nc
e,
there
is
n
o
r
equ
i
rem
ent
fo
r
featur
e
ext
racti
on
as
the
feat
ur
es
a
re
le
arn
e
d
rig
ht
from
th
e
data.
Du
e
to natu
re
of Bi
g data, t
he
capa
bili
ty
to
ignore
f
eat
ure
extracti
on ste
p i
s g
reat
deal.
7.2. N
aïv
e B
ay
es C
l
as
sifie
r
A
lgo
ri
t
hm
This
al
gorithm
prov
i
des
cl
as
sific
at
ion
pa
ra
m
et
er
and
at
tr
ibu
te
s
to
la
bel
the
occ
urren
c
es
m
us
t
be
conditi
on
al
ly
i
nd
e
pe
nd
e
nt,
i
f
the
insta
nce
c
onta
ins
se
ve
ral
at
tribu
te
s.
This
al
gorithm
is
su
it
able
f
or
m
od
erate
or large
traini
ng
data set
.
7.3. K
-
Me
an
s
Clust
eri
n
g M
achine Le
arni
ng
Algori
t
hm
K
-
Me
a
n
s
pr
oduces
str
onge
r
c
lusters
tha
n
hie
rar
c
hical
cl
us
te
rin
g
in
case
of
globu
la
r
cl
us
te
rs.
And
f
or
la
rg
e
num
ber
of
var
ia
ble
K
-
Me
ans
cl
us
te
rin
g exec
utes s
pee
dier
t
han h
ie
rarc
hical
clusteri
ng.
7.4. Apri
ori Al
go
ri
t
hm
Aprio
ri
Algori
thm
is
easy
to
i
m
ple
m
ent
and
c
an
be
par
al
le
li
zed
easi
ly
.
Wh
ic
h
us
es
la
rg
e
it
em
se
t
pro
per
ti
es to
im
ple
m
ent.
7.5. R
an
d
om Fores
t Mac
hine Le
ar
ning
Algori
th
m
s
Ra
ndom
Fo
res
t
is
ver
y
le
ss
robust
to
no
ise
,
wh
ic
h
m
a
kes
it
m
or
e
eff
ic
ie
nt
an
d
ve
rsati
le
fo
r
cl
assifi
cat
ion
a
nd
regressi
on
job
s
.
I
t
is
easy
to
def
i
ne
w
hic
h
pa
ram
et
ers
t
o
us
e
,
since
it
’
s
no
t
delic
at
e
t
o
the
par
am
et
ers
require
d
to
run.
This
al
gorithm
can
be
gro
wn
in
par
al
le
l
and
eff
ic
ie
nt
f
or
la
rg
e
databas
e
wit
h
higher
classi
fic
at
ion
acc
ur
acy
.
8.
CONCL
US
I
O
N
In
rece
nt
ye
ars
,
pro
ba
bly
big
data
proce
ssin
g
br
ought
the
gr
eat
est
re
volu
ti
on
in
c
om
pu
ti
ng
.
The
dat
a
cl
eaning
of
m
a
ssive
siz
es
of
data
li
es
at
the
hear
t
of
bi
g
da
ta
analy
ti
cs
processin
g
for
al
l
purpose
of
dom
ai
ns
for bet
te
r data
inv
e
sti
gation.
In
this
pa
p
er,
a
n
over
view
is
i
niti
at
ed
to
iden
ti
fy
the
po
te
nti
al
of
data
cl
ea
ning
in
big
dat
a
analy
ti
cs
in
the
proces
s
of
gathe
rin
g,
a
rr
a
ng
i
ng
a
nd
proces
sin
g
inf
orm
at
ion
.
It
is
im
po
rtant
to
un
der
sta
nd
data
qu
al
it
y
crit
eria
of
dirt
y
data
to
a
ble
to
cl
ea
n
data
set
s
wi
thout
f
ai
lure.
A
com
par
is
on
of
co
m
m
ercialize
d
too
ls
is
pr
ese
nted
by
obta
inin
g
com
m
ents
from
diff
eren
t
cu
stom
ers.
Most
of
t
he
too
ls
m
os
tl
y
con
ce
r
ns
to
or
gan
iz
e
data
set
s
a
nd
cl
ean
m
essy
data
a
nd
very
m
et
ho
ds
us
es
m
achine
le
arn
i
ng.
But
th
ey
didn’t
give
m
uch
i
m
po
rtance
to
big
data
ch
ara
ct
erist
ic
s,
wh
ic
h
m
a
y
le
ad
to
big
chall
e
ng
e
wh
il
e
cl
eanin
g
data.
There
ar
e
m
any
avail
able
data
rep
ai
ri
ng
al
gor
it
h
m
s,
sti
ll
it
r
equ
i
red
hu
m
an
exp
e
rt
to
ta
ke
intel
li
gen
t
dec
isi
on
if
the
cl
e
anin
g
process
is
c
orr
ect
or
no
t.
Ma
chine
le
ar
ning
al
gorithm
s
will
pro
bab
ly
re
pla
ce
m
os
t
j
obs
i
n
the
w
or
l
d,
with
the
fast
ev
olu
ti
on
of
big
data
a
nd
acce
ssi
bili
ty
of
pro
gr
am
m
i
ng
t
ools
li
ke
P
yt
ho
n
an
d
R
,
m
achine
le
ar
nin
g
i
s
increasin
g
m
ain
stream
existen
ce
for
data
sc
ie
ntist
s.
Ma
chine
le
arn
i
ng
ap
pli
cat
ion
s
are
highly
autom
ated
an
d
sel
f
-
m
od
ify
ing
wh
ic
h
c
on
ti
nue
to
im
pr
ove
ov
e
r
ti
m
e
wit
h
m
ini
m
a
l
hum
an
interve
nti
on
a
s
they
le
a
rn
with
m
or
e d
at
a.
This
s
urvey
ha
s
prom
pted
us
to
co
nduct
a
ddit
ion
al
real
-
w
or
l
d
eval
uations
an
d
de
velo
p
a
m
od
ifi
e
d
fr
am
ewo
r
k
of
b
ig
data
a
naly
ti
cs
by
cha
ngin
g
str
uctu
re
of
cl
eaning
phase
to
get
m
or
e
cl
ear
visi
ons
of
data.
It
is
exp
ect
ed
t
o
produce
a
ne
w
pla
n
re
garding
t
he
struct
ure
of
da
ta
qu
al
it
y
te
chn
iqu
es
wh
ic
h
can
be
m
or
e
eff
ic
ie
nt i
n
big
data analy
ti
cs.
REFERE
NCE
S
[1]
IBM:
“
10
Ke
y
Marke
ti
ng
Tr
end
s for
2017
Custo
m
er
Expect
a
ti
on
s
”
.
(2017)
.
[2]
Cisco:
W
hite
Pa
per
,
C
isco
Glob
a
l
Cloud
Inde
x :
Forec
ast
and
Met
hodolog
y
,
2015
–
2020.
(2016)
.
[3]
Khan,
N.
,
Yaqo
ob,
I.,
Abak
er,
I.,
Hashem
,
T
.
,
In
a
y
at,
Z
.
,
Kam
al
e
ldi
n,
W
.
,
Ali
,
M.
,
Alam,
M.
,
Shir
az
,
M
.
,
G
ani
,
A.
,
“
Big
Data
:
Surv
e
y
,
T
ec
hnolo
gies
,
Opportun
it
i
es
,
and
Ch
al
l
enges
”
.
(2014)
.
[4]
Marke
tsandmark
et
s.c
om
:
D
at
a
Q
ual
ity
Too
ls Ma
r
ket
.
(2017).
[5]
Xu,
H.,
Zh
ang,
R.
,
“
Resea
r
ch
o
n
Data
Int
egr
ati
on
of
the
Sem
ant
ic
W
eb
Based
o
n
Ontolog
y
Le
ar
ning
Te
chn
olog
y
”
.
TEL
KOMNIKA
Indone
s
ian
J
ourn
al
of
Elec
tr
ic
al
Eng
ineering,
20
14
;
12:
167
–
178
.
[6]
Khan,
M.,
Prade
epi
ni
,
G.,
Mac
h
i
ne
Le
arn
ing
Bas
ed
Autom
oti
ve
Forensic
Anal
y
s
is
for
Mobile
Applicati
ons
Us
ing
Data
Min
ing.
T
ELKOMN
IKA
(
Tele
communi
cation, C
omp
ut
ing
, E
le
ct
ro
ni
cs and
Co
ntro
l
)
.
2015;
16
(2)
:
3
50
–
354
.
Evaluation Warning : The document was created with Spire.PDF for Python.
Ind
on
esi
a
n
J
E
le
c
En
g
&
Co
m
p
Sci
IS
S
N:
25
02
-
4752
A Survey
on Cl
ean
i
ng D
irt
y
D
ata
Usi
ng Ma
c
hin
e
Learni
ng
Paradi
gm for
Big
Da
t
a
…
(
J
esmee
n
M.
Z.
H.
)
1241
[7]
W
ang,
Y.,
Vi
ctor
J.,
W
.
,
“
Big
Data
Anal
y
t
ic
s
on
the
Char
ac
t
e
risti
c
Equ
il
ib
riu
m
of
Coll
ec
t
ive
Opinions
in
Social
Networks
Big
D
at
a
Anal
y
t
ic
s
on
the
Chara
c
te
rist
ic
Equi
l
ibri
um
of
Coll
ec
t
ive
Opi
nions
in
Socia
l
Networks
”
.
Int.
J
.
Cogn.
Informati
cs
Nat. Intell.
(2
014).
[8]
Ext
,
A.P.F.O.
R
.
T.
,
Udio,
A.
,
Id
e
o,
V.
:
Big
Dat
a Anal
y
tics : Ch
a
llenges
And
.
5,
41
–
51
(2016).
[9]
Erl
,
T
.
,
Khattak,
W
.
,
Buhl
er,
and
P.:
Big
Dat
a
Fun
damenta
ls:Con
c
ept
s,
Driv
ers
&
Te
chn
ique
s: Boo
k.
[10]
He,
Y.
,
L
ee,
R.
,
Huai,
Y.
,
Shao
,
Z.
,
Jain
,
N.,
Zh
a
ng,
X.,
Xu,
Z.
:
R
CFi
le
:
A
Fast
a
nd
Space
-
ef
f
ic
i
e
nt
Data
P
lac
em
e
nt
Struct
ure
in
Ma
pRe
duc
e
-
based
Warehouse
System
s.
In:
ICDE
Co
nfe
ren
c
e
2011
.
p
p.
1199
–
1208
(2
011).
[11]
Muns
hi,
A.A.,
Moham
ed,
Y.A.
I.
:
B
ig
d
at
a
fra
m
ework
for
anal
y
ti
cs
in
sm
art
g
rids.
E
lectr.
Po
wer
Syst.
R
es.
1
51,
369
–
380
(2017).
[12]
Zhou
,
X.
,
Jin
,
Q.,
W
u,
B
.
,
W
a
ng,
W
.
,
Organi
c
Strea
m
s,
“
Data
Aggrega
t
ion
an
d
Inte
gr
at
ion
Ba
sed
on
Indiv
idu
al
Nee
ds
”
.
In
2013
Inte
rnational
J
oint
Confe
renc
e
on
Awarene
ss
S
ci
en
ce
and
Tech
nology
and
Ubi
-
Me
dia
Computin
g
(
iCA
ST
-
UMEDI
A)
.
pp.
535
–
541
.
IEEE, Aiz
u
-
W
a
kamatsu,
Jap
an
(
2014).
[13]
Berna
rdino
,
J.
,
La
ran
je
iro
,
N.
,
S
o
y
demir
,
S.N.
,
Berna
rdino
,
J.
,
A
Surve
y
on
Da
ta
Qual
ity
:
Cla
s
sif
y
ing
Poor
Dat
a
A
Surve
y
on
Dat
a
Quali
t
y
:
Cl
assif
y
i
ng
Poor
Dat
a
.
In
:
Th
e
21st
IEEE
Pa
ci
f
ic
Ri
m
Int
ernati
on
al
Symposium
o
n
Depe
ndable Co
mputing
(
PR
DC
2015)
,
(2015).
[14]
VH
AInc:
The C
ost of
Dirt
y
Dat
a
.
(2012)
.
[15]
Expe
ri
an:
Unlo
c
king
th
e
power
o
f
data :
the c
ost
o
f
dirty
da
ta
and how to
improve
it
s a
cc
ur
acy
For
eword
.
(2011)
.
[16]
Kristina
Knight:
Stud
y
:
Dirt
y
Data
a
probl
em
for
m
ark
et
ers,
htt
p
:/
/www
.
biz
r
epor
t.
com/2015/01
/st
ud
y
-
dir
t
y
-
da
ta
-
a
-
proble
m
-
for
-
m
ark
et
ers.
h
tml,
(201
5).
[17]
MEDICIN
E,
I.
O
.
,
TO
ERR
IS H
UM
AN
:
BUILD
ING
A SA
FER HEAL
TH
SY
STEM.
(1999)
.
[18]
Sidi,
F.,
Hass
an
y
,
P
.
,
Panah
y
,
S.
,
Affendey
,
L.
S
.
,
Jaba
r,
M.A.
,
Ibr
ahi
m
,
H.,
Mus
tapha,
A
.,
“
Data
Quali
t
y
:
A
Surv
e
y
of
Data
Qua
li
t
y
Dim
ensions
”
.
In:
2012
Inter
nati
onal
Confer
enc
e
on
Infor
mation
Retrieval
&
Knowle
dg
e
Manage
ment
(
CAMP
)
.
pp.
300
–
304.
IE
EE,
Ku
a
l
a
Lumpur,
Mal
a
y
sia
,
Ma
lay
si
a
(
2012).
[19]
Juddoo,
S.,
“
Overvi
ew
of
da
ta
q
ual
ity
challe
ng
es
in
the
con
te
xt
o
f
Big
Data
”
.
In
2015
Inte
rnation
al
Confe
renc
e
o
n
Computing,
Co
mm
unic
ati
on
an
d
Sec
uri
ty (
ICCCS)
.
IEE
E
,
Pam
ple
m
ouss
es,
Maurit
ius (
2015)
.
[20]
Ta
l
eb,
I
.
,
Kass
ab
i,
H.T.
El,
Serha
ni,
M.A.
,
Ds
souli,
R.
,
Bouhaddi
o
ui,
C.
,
“
Big
Dat
a
Quali
t
y
:
A
Quali
t
y
Dim
ension
s
Eva
lu
at
ion
”
.
In
2016
Intl
IEE
E
Confe
renc
es
on
Ubiquit
ous
Inte
lligen
ce
&
C
omputing,
Adv
a
nce
d
and
Tr
ust
ed
Computing,
Scalable
Computing
and
Comm
unic
ati
ons,
Cloud
a
nd
Bi
g
Data
Computing,
Int
erne
t
of
Pe
opl
e,
and
Smar
t
World
Co
ngress
.
pp.
759
–
765.
IE
EE (2016
).
[21]
Bee
har
r
y
,
Y.,
Fo
wdur,
T.
P.
,
Hurbungs
,
V.,
Bas
soo,
V.,
Ramnar
ai
n
-
Seet
ohul
,
V.,
“
Anal
y
s
ing
tra
nspo
rta
ti
on
d
ata
wit
h
open
source
big
dat
a
anal
y
tic
too
ls
”
.
In
te
rnationa
l
Journal
on
Ele
ct
rical
Engi
n
ee
ri
ng
and
Informati
cs
(
IJE
EI)
.
2017
;
5(2):
174
–
184
.
[22]
Naum
ann,
F.,
Hersche
l
,
M.:
Dat
a
Fus
ion
in
Three
Steps :
Resolving
Inc
onsistenc
i
es
at
Schema
,
T
uple
-
,
and
Valu
e
-
le
ve
l
Dat
a
Fus
io
n
in Three
Steps
:
Resolvi
n
g
Inc
o
nsistenc
i
es
at Sc
hema
-
,
Tuple
-
,
and
Valu
e
-
l
eve
l
.
(2006).
[23]
Miche
l
,
P.,
Dm
it
ri
y
ev
,
V.
,
Abi
lov,
M.,
M
arx
,
J.
,
“
EL
TA :
Ne
w
Approac
h
in
Designing
Business
Inte
lligen
c
e
Soluti
ons i
n
Era
of
Big
Da
ta
”.
16
,
667
–
674
(2014
).
[24]
gar
tne
r
.
com:
Ho
m
e
page ga
rtn
er.com,
ww
w.g
art
n
er.
com.
[25]
G2crowd:
Hom
e
page of
g2
cro
w
d,
ww
w.g2c
row
d.
com.
[26]
Jin,
X.,
W
ah,
B.
W
.
,
Cheng,
X.,
W
ang,
Y.,
“
Sign
ifi
c
anc
e
and
Ch
al
l
enge
s
of
Big
Data
Resea
r
ch
”
.
Bi
g
Data
Re
s.
2
,
59
–
64
(2015).
[27]
G
arg
,
N.,
Sing
la,
S.,
Jangra
,
S.,
“
Chal
le
ng
es
and
Te
chn
ique
s
for
Te
sting
of
B
ig
Data
”
.
Proce
dia
-
Proce
dia
Compu
t.
Sci
.
85,
940
–
948
(2016).
[28]
Sivara
j
ah,
U.
,
Kam
al
,
M.M.
,
Ira
ni,
Z
.
,
W
ee
r
akk
od
y
,
V.
,
“
Crit
i
c
al
anal
y
s
is
of
Big
Data
ch
al
l
en
ges
and
ana
l
y
t
ical
m
et
hods
”
.
J. B
us
.
R
es
.
(2016
).
[29]
Dupare
,
J.M., Sa
m
bhe,
N.U.
,
“
A
Novel
Dat
a
C
le
a
ning
Algori
thm Us
ing
RF
ID a
nd
W
SN
Inte
gra
t
io
n
1
”
.
(2015)
.
[30]
Li
u,
H., T
k
,
A.K
.
,
Thomas,
J.P.
,
“
Cle
ani
ng
Fram
ework
for
B
ig
D
at
a
—
Obj
ec
t
Id
ent
ifica
ti
on
and Li
nkag
e
”
.
(2015
).
[31]
W
ang,
J.,
Krish
nan,
S.,
Fr
ankl
i
n,
M.J.,
Goldb
er
g,
K.,
Milo
,
T
.
,
Kraska,
T
.
,
Ber
kel
e
y
,
U.C.,
“
A
Sam
ple
-
and
-
Clean
Fram
ework
for
Fast
and
Ac
cur
a
te
Quer
y
P
roc
es
sing
on
Dirt
y
D
at
a
”
.
In
Proce
ed
ings
of
the
2014
ACM
SIGMO
D
Inte
rnational
Co
nfe
renc
e
on
Man
ageme
nt
o
f
Data
.
pp
.
469
–
480
.
,
Snow
bird,
Utah
,
US
A (2014).
[32]
Kha
y
y
a
t,
Z
.
,
Il
yas
,
I.
F.,
Madd
en
,
S.,
“
BigDansing :
A
Sy
stem
for
Big
Data
Cle
an
sing
"
.
In
SIGM
OD
’15
.
pp.
1
–
16.
,
Melbourne
,
Vic
t
oria
,
Aus
tra
l
ia (2
015).
[33]
La
brin
idi
s,
A
.
,
Jaga
dish,
H.
V
:
“
Chal
le
ng
es
and
Opportunit
ie
s
wi
th
Big
Data
”
.
20
32
–
2033.
[34]
Egha
m
:
Gart
ner
Sa
y
s
8.
4
B
i
ll
io
n
Connec
t
ed
“
Thi
ngs”
W
il
l
Be
in
Us
e
in
2017,
Up
31
Perc
ent
From
2016.
,
U.K.
(2017).
[35]
P
rue
ngkar
n
R.
,
W
ong
K.W.,
F.C.
C.
,
“
Data
Cle
ani
ng
Us
ing
Com
ple
m
ent
ar
y
Fuz
z
y
Suppo
rt
Vec
tor
Mac
h
ine
Te
chn
ique
”
.
In
Neural
Informati
on
Proce
ss
ing. I
CONIP
2016
(2
016).
[36]
Pr
oc
edur
es,
P.
,
Li
u,
G.,
Yu,
H.S
.
D.,
“
Predi
ct
ion
of
Protei
n
–
Prot
ei
n
Int
eract
ion
Site
s
with
Mac
hin
e”
.
In
The
Journal
of
Me
mbr
ane
Biology. Springe
r
US
(2015).
[37]
Chen
,
Y.
,
He
,
W
.
,
Hua,
Y
.
,
W
ang,
W
.
,
“
Com
p
oundE
y
es :
Nea
r
-
dupli
c
at
e
Detec
ti
on
in
La
rg
e
Sc
al
e
Onl
ine
Vid
e
o
S
y
stems
in
the
Cloud
”
.
In
IE
EE
INFOCOM
201
6
-
The
35th
Annual
IEE
E
In
te
r
nati
onal
Conf
ere
nce
on
Comput
er
Comm
unic
ati
ons CompoundEy
es
:
pp.
1
–
9.
IEEE
(
2016).
[38]
W
ang,
Y.,
Li
,
Q.,
“
Revi
ew
on
Studie
s
and
Advanc
es
of
Mac
hine
Learni
ng
Approac
hes
”
.
TEL
KOMNIKA
(
Tele
communic
ati
on,
Computing, E
l
ec
troni
cs
and
Control)
12,
148
7
–
1494
(2014).
[39]
Høversta
d,
B
.
A
.
,
Ti
demann,
A.
,
La
ngseth
,
H.,
“
E
ffe
ct
s
of
Data
C
l
ea
nsing
on
Loa
d
Predic
ti
on
Algo
rit
hm
s
”
.
In
2013
IEE
E
Computational Intelligence
Applications in
Smar
t
G
rid (
CIASG)
.
pp.
93
–
100.
IEEE,
Singapor
e
(2013)
.
Evaluation Warning : The document was created with Spire.PDF for Python.
IS
S
N
:
2502
-
4752
Ind
on
esi
a
n
J
E
le
c Eng &
Co
m
p
Sci,
Vol
.
10
, N
o.
3
,
June
201
8
:
12
34
–
12
43
1242
[40]
Nata
ra
ja
n
,
K.,
Li
,
J.
,
Koronios,
A.,
Scie
n
ce,
I.
,
Mining,
D.
,
“
Cle
ani
ng,
D.
:
Data
m
ini
ng
technique
s
for
da
ta
cl
e
ani
ng
”
.
In
Pr
oce
ed
ings o
f
the
4th
World
Cong
ress
on
Engi
ne
er
ing
Asset
Manag
eme
nt
.
pp.
796
–
804
(
2009).
BIOGR
AP
HI
ES OF
A
UTH
ORS
Siti
Nadi
ah
Ch
e
Azm
i
is
cur
r
ent
l
y
do
ing
he
r
Master
of
Phi
losph
y
d
egr
e
e
in
Elec
tri
c
al
Engi
ne
eri
ng
at
Facult
y
of
Elec
trica
l
Engi
n
ee
r
ing,
Univer
siti
Te
kn
ologi
Malay
sia
,
Malay
s
ia.
She
has
complet
ed
h
er
Bac
h
el
or
of
Applie
d
Scie
n
ce
(El
ectroni
cs
an
d
Instrum
ent
at
io
n
Phy
si
cs)
from
Uni
ver
siti
Malays
ia
T
ere
ngg
anu,
Malay
s
ia a
t
201
3.
Dr.
Jakir
Hos
s
en
is
gra
duat
ed
in
Mec
hani
c
al
Engi
nee
r
ing
from
the
Dhaka
Univer
sit
y
o
f
Engi
ne
eri
ng
and
Te
chnol
og
y
(19
97),
Master
s
in
Com
m
unic
at
ion
and
Network
Engi
nee
r
ing
from
Univer
siti
Putr
a
Malay
s
ia
(2003)
and
PhD
in
Sm
art
T
ec
hno
log
y
and
Roboti
c
En
gine
er
ing
from
Univer
siti
Putr
a
Malay
s
ia
(2012)
.
He
is
cur
ren
tly
a
Senior
Lectu
re
r
at
the
Fa
cult
y
of
Engi
ne
eri
ng
and
T
ec
hnolog
y,
Multi
m
ed
ia
U
nive
rsit
y
,
Mal
a
ysia.
His
r
ese
ar
c
h
int
e
rests
ar
e
i
n
the
area
of
Ar
ti
ficial
In
te
l
ligence
(Fuz
z
y
Lo
gic
,
Neur
al
Ne
t
work),
Infe
ren
ce
S
y
stems
,
Patter
n
Cla
ss
ifi
c
at
ion
,
Mobile
Robo
t
N
avi
ga
ti
on
and
In
t
el
li
g
ent Cont
rol
.
Dr.
Md
Shohel
Sa
y
e
ed
obt
ai
n
ed
the
B
.
Sc.
Ag.
(
Hons
)
from
Bangl
ade
sh
Agri
cul
t
ura
l
Unive
rsit
y
.
He
completed
his
M.Sc.
(I
T)
from
Univer
siti
Keba
ngsaa
n
Malay
s
ia
(UK
M)
and
Ph.D.
in
Engi
ne
eri
ng
fro
m
Multi
m
edi
a
Univer
sit
y
,
Ma
l
a
y
si
a.
At
pr
ese
nt,
he
is
holdi
n
g
a
positi
on
of
As
socia
te
Profes
sor
at
th
e
Fa
cul
t
y
Info
rm
at
ion
S
ci
en
ce
and
Te
ch
nolog
y
,
Mult
ime
dia
Univ
ersity
,
Malay
s
ia.
His
m
ai
n
rese
arc
h
i
nte
rests
are
Bio
m
et
ric
s,
Pattern
Rec
ognit
ion
,
Signal
and
Im
age
Proce
ss
ing,
Big
Data
and
Da
ta
Mining.
Dr.
Ho
Chin
Kuan
obta
in
ed
th
e
B.
Sc.
(Hons
)
in
Com
pute
r
Scie
n
ce
with
E
lectr
on
ic
s
Engi
ne
eri
ng
from
Univer
sit
y
Col
le
ge
London
,
UK
.
Sub
seque
ntly
,
he
comple
ted
his
M.Sc.
(IT
)
and
Ph.D.
in
Inform
at
ion
Te
c
hnolog
y
from
Multi
m
edi
a
Uni
ver
sit
y
,
Mal
a
y
s
i
a.
At
pre
sent
,
he
is
a
Profess
or
and
Dea
n
a
t
th
e
Facul
t
y
of
Co
m
puti
ng
and
Inf
orm
at
ic
s,
Mul
tim
edi
a
Univer
si
t
y
,
Mal
a
y
s
ia.
His
ma
in
rese
arc
h
in
te
rests
are Na
tur
al
Com
puti
ng
,
C
om
bina
tori
a
l
Op
ti
m
iz
ation
and
D
at
a
Mining.
Ch
y
.
Moham
m
ed
Ta
ws
if
K.
cu
rre
ntly
a
postgr
adua
t
e
studen
t
in
Engi
n
ee
r
ing
Program
and
rese
arc
h
ing
wit
h
Artifi
cial
Int
el
li
g
ence,
Ev
ent
Proce
ss
ing
and
Big
dat
a
from
Multi
m
edi
a
Univer
sit
y
(MM
U).
He
pursued
bac
he
lor’s
degr
e
e
in
Com
pute
r
Scie
nc
e
and
En
gine
er
ing
from
Inte
rna
ti
ona
l
Isl
a
m
ic
Univer
sit
y
Chit
ta
gong
,
B
an
gla
desh.
Evaluation Warning : The document was created with Spire.PDF for Python.
Ind
on
esi
a
n
J
E
le
c
En
g
&
Co
m
p
Sci
IS
S
N:
25
02
-
4752
A Survey
on Cl
ean
i
ng D
irt
y
D
ata
Usi
ng Ma
c
hin
e
Learni
ng
Paradi
gm for
Big
Da
t
a
…
(
J
esmee
n
M.
Z.
H.
)
1243
Md.
Arm
anur
R
ahman
rec
ei
v
ed
the
B.
Sc.
degr
ee
in
computer
scie
nc
e
and
eng
ine
er
ing
fro
m
As
ia
n
Univer
sit
y
o
f
Bang
la
d
esh
(AU
B)
in
2010
.
He
is
cur
ren
t
l
y
working
towa
r
d
the
MEngSc
degr
ee
at
the
Multi
m
edi
a
Un
ive
rsit
y
(MM
U),
Mal
a
y
s
ia
.
His
rese
ar
ch
intere
st
in
cl
ud
es
per
form
anc
e
op
ti
m
iz
ation
of
b
ig
data
s
y
st
em,
dat
a
m
ini
ng,
m
ac
hine
l
ea
rn
in
g
and
image
proc
e
ss
ing.
M
d
Arif
H
o
s
sa
in
c
u
rr
e
n
t
l
y
a
po
s
t
g
r
a
du
a
te
st
ud
e
n
t
in
E
n
g
i
n
ee
ri
n
g
s
p
ec
iali
zi
n
g
in
So
lar
E
n
e
r
g
y
T
ec
hno
l
o
g
y
f
r
o
m
Mu
l
t
i
m
e
d
ia
U
n
i
ve
rsit
y
(
MM
U)
in
M
a
l
a
y
si
a.
He
h
a
s
c
o
m
p
let
e
d
b
ac
h
e
l
o
r
‟s
d
eg
r
e
e
in
E
l
ec
tr
i
ca
l
&
E
l
ec
tr
on
i
c
E
n
g
i
n
ee
ri
n
g
f
r
o
m
B
a
n
g
la
d
e
sh
U
n
i
ve
rsit
y
,
D
ha
ka
,
B
a
n
g
la
d
e
s
h
.
His
R
e
s
ea
r
c
h
I
n
ter
e
st
is
E
l
ec
tr
i
ca
l
E
n
g
i
n
ee
ri
n
g
,
R
e
n
e
w
a
b
le
E
n
e
r
g
y
,
W
i
n
d
E
n
e
r
g
y
,
C
on
tr
o
ll
er
,
Fu
zz
y
l
o
g
i
c,
N
e
u
r
a
l
N
e
t
w
o
rk
a
n
d
I
n
t
el
l
i
ge
n
t
S
y
st
e
m
.
Evaluation Warning : The document was created with Spire.PDF for Python.