TELK
OMNIKA
,
V
ol.
16,
No
.
2,
Apr
il
2018,
pp
.
834
842
ISSN:
1693-6930,
accredited
A
b
y
DIKTI,
Decree
No:
58/DIKTI/K
ep/2013
DOI:
10.12928/telk
omnika.v16.i2.7669
834
Data
Cleaning
Ser
vice
f
or
Data
W
arehouse:
An
Experimental
Comparative
Stud
y
on
Local
Data
Arif
Bramantor
o
F
aculty
of
Computing
and
Inf
or
mation
T
echnology
in
Rabigh
King
Abdulaziz
Univ
ersity
,
Jeddah,
Saudi
Ar
abia
e-mail:
asoegihad@kau.edu.sa
Abstract
Data
w
arehouse
is
a
collectiv
e
entity
of
data
from
v
ar
ious
data
sources
.
Data
are
pro
ne
to
se
v
er
al
complications
and
irregular
ities
in
data
w
arehouse
.
Data
cleaning
ser
vice
is
non
tr
ivial
activity
to
ensure
data
quality
.
Data
cl
eaning
ser
vice
in
v
olv
es
identification
of
errors
,
remo
ving
them
and
impro
v
e
the
quality
of
data.
One
of
the
common
methods
is
duplicate
elimination.
This
research
f
ocuses
on
the
ser
vice
of
duplicate
elimination
on
local
data.
It
i
nitially
sur
v
e
ys
data
quality
f
ocusing
on
quality
prob
lems
,
cleaning
methodology
,
in
v
olv
ed
stages
and
ser
vices
within
data
w
arehouse
en
vironment.
It
also
pro
vides
a
compar
ison
through
some
e
xper
iments
on
local
data
with
diff
erent
cases
,
such
as
diff
erent
spelling
on
diff
erent
pron
unciation,
misspellings
,
name
ab
bre
viation,
honor
ific
prefix
es
,
common
nic
knames
,
splitted
name
and
e
xact
match.
All
ser
vices
are
e
v
aluated
based
on
the
proposed
quality
of
ser
vice
metr
ics
such
as
perf
or
mance
,
capability
to
process
the
n
umber
of
records
,
platf
or
m
suppor
t,
data
heterogeneity
,
and
pr
ice;
so
that
in
the
future
these
ser
vices
are
reliab
le
to
handle
big
data
in
data
w
arehouse
.
K
e
yw
or
d:
Data
Cleaning
Ser
vice
,
Data
W
arehouse
,
Data
Quality
,
Local
Data
Cop
yright
c
2018
Univer
sitas
Ahmad
Dahlan.
All
rights
reser
ved.
1.
Intr
oduction
Data
w
arehouse
is
a
relational
database
f
or
questioning
and
analyzing
b
y
fur
ther
pro-
cessing.
It
is
obtained
from
se
v
er
al
tr
ansactions
from
other
sources
.
Integ
r
ated
data
w
arehouse
is
an
integ
r
ation
of
files
,
sources
and
other
records
.
Se
v
er
al
ser
vices
are
used
to
ensure
good
data,
such
as
data
cleaning
and
data
integ
r
ation
ser
vice
within
enter
pr
ise
.
Su
bject
or
iented
data
w
arehouse
is
a
subject
centr
ic
model
in
v
olving
se
v
er
al
subjects
,
such
as
v
endor
,
product,
sales
and
customer
[1].
A
good
data
w
arehouse
m
ust
f
ocus
on
proper
analysis
and
cleaning
of
data
r
ather
than
daily
ser
vice
tr
ansaction
and
oper
ations
.
This
kind
of
model
is
required
b
y
most
enter
pr
ises
.
The
model
m
ust
be
simple
and
related
to
the
data
cleaning
objectiv
e
.
It
should
also
a
v
oid
data
which
are
not
required
f
or
tr
ansaction
and
decision
making
oper
ations
.
Non
v
olatile
nature
of
data
is
impor
tant
f
or
t
hese
oper
ations
.
It
should
be
ph
ysically
separ
ated
a
w
a
y
from
the
application.
This
separ
ation
helps
data
f
or
reco
v
er
y
and
other
time
consuming
mechanisms
,
such
as
loading
and
retr
ie
v
al
of
data
[2].
Time
v
ar
iant
is
the
per
iod
of
time
which
is
in
v
olv
ed
in
the
data
stor
age
in
data
w
arehouse
.
This
is
the
element
of
time
.
The
decisions
tak
en
dur
ing
the
process
is
v
er
y
impor
tant,
theref
ore
the
ge
ner
ated
trend
repor
ts
are
significant
elements
in
data
w
arehousing.
A
proper
decision
suppor
t
system
w
or
ks
successfully
with
this
trend
repor
t.
There
are
se
v
er
al
commercial
applications
such
as
customer
relationship
and
b
usiness
applications
using
utilizing
data
cleaning
ser
vice
.
A
proper
scheme
is
in
v
olv
ed
dur
ing
the
de
v
elopment
of
data
w
arehouse
.
The
questions
and
analysis
are
completed
in
the
designing
stage
.
Meaningful
access
of
rele
v
ant
data
is
required
together
with
the
gener
ated
v
alues
.
The
e
xtr
action
of
the
source
is
impor
tant,
theref
ore
it
m
ust
be
v
er
y
clean
without
an
y
unrelated
sources
.
The
pro
vided
ser
vice
is
data
cleaning
in
an
y
big
enter
pr
ise
.
The
input
data
input
are
rechec
k
ed
and
tested
bef
ore
the
y
are
allocated
to
a
specific
data
w
arehouse
.
The
loaded
data
are
separ
ated
from
technical
specification
and
process
[3].
Receiv
ed
October
25,
2017;
Re
vised
F
ebr
uar
y
13,
2018;
Accepted
March
14,
2018
Evaluation Warning : The document was created with Spire.PDF for Python.
TELK
OMNIKA
ISSN:
1693-6930
835
Se
v
er
al
automatic
e
x
ecutions
are
also
made
to
eliminate
error
in
the
data.
Identifying
the
incomplete
data
is
a
hindr
ance
f
or
processing
of
data.
It
mak
es
the
corrections
more
complicated.
A
ser
vice
called
bac
k
flushing
is
used
to
rechec
k
the
data
cleaning
frequently
.
Installation
of
data
occurs
in
the
first
instance
f
or
the
model
from
other
sources
.
Monitor
ing
ser
vice
is
used
f
or
the
reco
v
er
y
of
data
at
diff
erent
le
v
els
from
huge
to
small
quantity
.
The
amount
of
load
is
a
lso
related
to
the
process
in
w
arehouse
.
Hence
,
caution
steps
are
note
w
or
th
y
to
process
the
data
smoothly
.
ETL
is
the
process
of
Extr
act,
T
r
ansf
or
m,
and
Load
of
data.
It
means
that
the
e
xtr
action
of
rele
v
ant
data
is
f
ollo
w
ed
b
y
the
tr
ansf
or
mation
and
consequently
the
loading
of
data
in
the
w
arehouse
.
Extr
act
is
a
method
of
data
e
xtr
action
which
occurs
in
data
w
arehouse
f
rom
the
allocated
resources
.
The
consolidation
of
these
resources
also
tak
es
places
in
the
separ
ated
system
that
is
allocated
f
or
each
le
v
el
of
processing.
This
step
of
e
xtr
action
tak
es
the
data
into
another
le
v
el
called
tr
ansf
or
ming
[4].
T
r
ansf
or
m
is
a
mechanism
to
in
v
olv
e
the
data
e
xtr
action
which
is
con
v
er
ted
from
pre
vious
f
or
m
and
placed
in
the
data
w
arehouse
without
an
y
errors
in
the
data.
The
source
of
data
needs
a
proper
manipula
tion
of
al
l
methods
.
It
f
ollo
ws
a
set
of
functions
to
e
xtr
act
data
into
the
w
arehouse
without
an
y
modification
to
the
e
xisting
data.
The
technical
and
other
requirements
are
v
alidated
to
meet
the
requirements
.
There
are
se
v
er
al
tr
ansf
or
mations
in
v
olv
ed
in
the
process
,
such
as
selecting
only
par
ticular
in
f
or
mation
and
assigning
them
with
specific
functions
.
Coding
the
data
with
v
alues
is
a
concrete
ser
vice
of
data
cleaning
which
occurs
automatically
.
Another
f
or
m
of
data
cleaning
ser
vice
is
to
encode
the
result
into
a
ne
w
v
alue
and
to
combine
data
v
alues
of
tw
o
diff
erent
methods
.
The
f
or
m
of
data
can
be
simple
or
comple
x
method.
The
path
of
data
ma
y
be
f
ailed
or
successful,
which
both
methods
in
v
olv
e
in
handling
data
in
a
specific
prog
r
am.
F
or
e
xample
,
the
model
can
be
a
tr
anslated
code
in
an
e
xtr
acted
data.
Load
is
a
process
of
data
handling
which
is
impor
tant
together
with
the
targeting
r
ange
of
inf
or
mation.
F
e
w
data
can
be
o
v
erwr
itten
with
other
non-updated
or
updated
data.
The
selec-
tion
of
the
design
is
a
lso
impor
tant
together
with
a
proper
understanding
of
the
a
v
ailab
le
choices
related
to
time
and
b
usiness
requirements
.
The
comple
x
model
of
system
is
to
allo
w
se
v
er
al
changes
to
be
updated
and
uploaded.
The
o
v
er
all
quality
of
data
in
the
data
w
arehouse
en
viron-
ment
is
v
alidated
b
y
utilizing
ETL
mechanism
[5].
The
objectiv
e
of
this
paper
is
to
identify
the
causes
f
or
data
quality
prob
lem.
P
ar
ticular
methodology
and
e
xper
iments
are
adopted
to
address
the
prob
lem.
It
is
e
xpected
that
it
con-
tr
ib
utes
to
better
data
quality
in
data
w
arehouse
.
Data
cleaning
and
duplicate
elimination
ser
vices
are
the
appropr
iate
methods
to
impro
v
e
the
data
quality
.
Exper
iments
are
conducted
to
pro
vide
the
results
and
compar
isons
betw
een
de-duplication
ser
vices
.
The
research
approach
as
it
is
presented
here
is
no
v
el
due
to
the
le
v
el
of
implementation
b
y
utilizing
ser
vice
or
iented
approach
as
an
e
vident
suppor
t
to
the
conducted
sur
v
e
y
.
In
this
research
,
only
data
cleansing
ser
vice
is
considered
as
the
main
task.
The
rest
of
the
data
quality
ser
vices
such
as
completeness
and
histor
ical
reputation
ser
vices
remain
as
a
future
w
or
k
to
compose
more
ser
vices
in
a
ser
vice-or
iented
system.
2.
Data
Quality
in
Data
W
arehouse
Data
w
arehousing
is
a
promising
industr
y
f
or
se
v
er
al
go
v
er
nment
organization
and
pr
i-
v
ate
institutions
,
which
in
v
olv
es
se
v
er
al
confidential
dat
a
stor
age
with
regards
to
inter
nal
secur
ity
.
With
the
enor
mous
amount
of
data,
the
responsibility
of
organization
becomes
cr
itical
when
it
comes
to
secur
ity
concer
ns
[6].
The
assur
ance
of
data
quality
is
the
pr
imar
y
objectiv
e
of
an
y
management
le
v
els
.
There
is
an
increased
potential
data
quality
and
its
irregular
ities
.
Data
w
are-
house
is
adopted
b
y
the
organization
to
impro
v
e
the
relationship
betw
een
customers
,
client
and
management.
Thus
,
impro
ving
the
efficiency
f
or
the
entire
organization
is
required.
Data
quality
is
defined
as
the
measured
perf
or
mance
or
the
loss
of
data
in
an
organization
[2].
The
pur
pose
of
data
quality
measurement
is
to
identify
the
missing
data
from
the
system.
The
quality
of
data
attained
f
or
the
data
w
arehouse
model
assures
the
inputs
on
the
client
side
.
Ho
w
e
v
er
,
one
user
is
diff
erent
from
another
user
.
The
data
m
ust
be
simple
,
consistent
and
full
of
understanding.
The
ab
undance
of
data
increase
the
b
urden
on
the
system
side
.
The
quality
Data
Cleaning
Ser
vice
f
or
Data
W
arehouse:
An
Exper
imental
...
(Ar
if
Br
amantoro)
Evaluation Warning : The document was created with Spire.PDF for Python.
836
ISSN:
1693-6930
of
d
ata
is
cr
itical
together
with
the
identification
of
irregular
ities
.
The
k
e
y
quality
of
data
and
its
dimension
metr
ics
are
impor
tant
to
understand
the
eff
ectiv
e
quality
impro
v
ement.
Data
quality
has
the
impor
tance
due
to
the
use
of
data
w
arehouse
system.
Data
quality
is
measured
in
each
phase
of
oper
ations
.
Metr
ics
are
selected
to
ensure
measurement
of
data
quality
and
analysis
.
The
selection
of
metr
ics
is
cr
itical
to
the
final
result
which
directly
aff
ects
to
the
customer
relationship
.
Quantifying
data
is
impor
tant
to
sa
v
e
the
cost
and
impro
v
e
mar
k
et
standards
in
a
competitiv
e
econom
y
[7].
In
this
paper
,
data
quality
and
quality
of
ser
vice
metr
ics
are
combined
to
impro
v
e
the
confidence
of
data
quality
process
.
With
an
increasing
technology
and
enor
mous
data
inputs
in
industr
y
,
the
author
ities
need
to
impro
v
e
the
quality
of
data
in
enter
pr
ise
.
There
are
se
v
er
al
prob
lems
f
aced
b
y
the
enter
pr
ises
in
order
to
maintain
and
sustain
their
quality
of
ser
vice
in
deliv
er
ing
the
project.
The
types
of
data
are
classified
into
intr
insic
,
conte
xtual,
representativ
e
and
accessib
le
.
P
erf
or
mance
is
the
f
actor
of
quality
standard
in
an
y
enter
pr
ise
tr
ademar
ks
.
The
addition
of
data
m
ust
be
static
r
ather
than
dynamic
in
order
to
efficiently
a
v
oid
irregular
ities
dur
ing
monitor
ing
the
process
of
quality
standard
impro
v
ements
.
The
consumer
m
ust
be
carefully
considered
when
there
are
data
sent
b
y
the
client
to
the
ent
er
pr
ise
[8].
A
common
data
quality
fr
ame
w
or
k
includes
a
loop
of
activity
in
w
eighting
its
cost
and
benefit
as
illustr
ated
in
Figure
1.
Figure
1.
Data
Quality
Process
Loop
3.
Classification
of
Data
Quality
Moder
n
data
quality
impro
v
ement
approach
requires
a
real
time
scenar
io
with
the
pref-
erence
to
a
v
oid
oper
ational
and
analytical
models
[9].
The
correctness
and
assur
ance
of
data
quality
is
measured
after
dur
ing
the
impro
v
ement.
The
data
quality
issues
requires
to
be
han-
dled
f
or
designing
ser
vices
f
or
data
w
arehouse
without
an
y
quality
prob
lems
.
The
identification
of
prob
lems
caused
b
y
poor
data
is
e
xamined
to
der
iv
e
a
proper
procedure
.
Inaccur
ate
inf
or
mation
b
y
the
customer
is
another
cause
f
or
a
decline
in
quality
.
Unlik
e
con
v
entional
approach,
there
are
se
v
er
al
other
pro
ximity
and
time
v
ar
iant
issues
that
m
ust
be
giv
en
a
consider
ation
in
moder
n
approach.
The
source
of
data
in
mo
der
n
data
w
arehouse
is
related
to
the
data
quality
impro
v
e-
ment
in
data
w
arehouses
.
The
fields
are
filled
b
y
the
ones
in
unstr
uctured
f
or
ms
.
These
issues
are
impro
v
ements
and
adv
ancement
f
or
moder
n
research
in
data
w
arehouse
compared
to
the
con
v
entional
method
of
Inmon
[10]
and
Kimball
research
[11].
According
to
Data
W
arehouse
Institute
[12],
data
quality
impro
v
ement
includes
the
cor-
rection
on
def
ectiv
e
data
to
ensure
the
achie
v
ement
of
minim
um
le
v
el
of
data
quality
standard.
It
is
also
mentioned
that
data
are
required
to
be
fla
wless
without
an
y
irregular
ities
.
It
has
to
meet
the
standard
requirement
of
the
compatib
le
application.
The
quality
of
data
required
b
y
user
is
diff
erent
from
the
one
required
b
y
the
organization.
Str
ict
r
ules
are
used
t
o
a
v
oid
an
improper
data
processing.
V
alidation
is
made
at
par
ticu
lar
le
v
el
where
data
are
equipped
with
pin
n
umbers
or
pass
w
ords
.
The
frequent
data
errors
are
considered
as
a
common
phenomenon,
ho
w
e
v
er
the
model
de
v
eloped
f
or
data
quality
in
data
w
arehouse
is
regular
ly
adaptiv
e
to
all
changes
.
Hence
,
TELK
OMNIKA
V
ol.
16,
No
.
2,
Apr
il
2018
:
834
842
Evaluation Warning : The document was created with Spire.PDF for Python.
TELK
OMNIKA
ISSN:
1693-6930
837
data
in
high
quality
can
be
used
in
oper
ations
,
decision
making
process
and
modeling.
In
addition,
the
quality
inf
or
mation
indicates
which
data
model
needed
b
y
data
w
arehouse
.
The
probability
of
errors
that
can
lead
to
a
decline
in
data
quality
is
required
f
or
records
and
protocol
distr
ib
ution
in
a
netw
or
k.
The
calculation
of
technical
inf
or
mation
and
the
require-
ment
protocol
proposed
b
y
the
enter
pr
ise
ha
v
e
to
be
fulfilled
to
achie
v
e
data
quality
.
The
assured
mechanism
f
or
de
v
elopment
of
these
protocols
can
benefit
the
enter
pr
ise
b
y
pro
viding
data
qual-
ity
management
in
a
large
scale
.
The
goal
is
to
meet
mar
k
et
standards
r
ather
than
to
adopt
lo
w
cost
protocols
that
ma
y
lead
to
a
f
ailure
of
the
suggested
model.
Man
y
organization
and
go
v
er
n-
ment
agencies
are
i
n
v
o
lv
ed
with
huge
database
collection
[13].
The
impor
tance
of
data
quality
becomes
a
big
concer
n
to
achie
v
e
results
and
e
xper
iments
required
b
y
a
client.
If
this
is
not
tak
en
ser
iously
,
se
v
er
al
complications
ma
y
ar
ise
due
to
a
f
ailure
in
data
quality
which
aff
ects
the
customer
relationship
model
at
an
y
le
v
el
of
processes
in
enter
pr
ise
.
An
eff
ectiv
e
r
isk
m
anagement
is
needed
f
or
a
system
to
lear
n
from
its
deficiencies
.
The
designed
protocols
m
ust
be
in
such
a
w
a
y
to
cope
up
with
the
r
isk
and
deliv
er
the
required
stan-
dard
results
.
The
policy
mak
ers
m
ust
decide
t
he
r
isk
str
ategies
to
comprehend
the
desired
data
quality
standards
.
Fur
ther
management
of
the
r
isk
mitigation
protocols
f
or
data
quality
impro
v
e-
ment
and
the
desired
policy
f
or
m
ulation
pla
y
a
major
role
based
on
data
quality
requirements
.
The
agent
f
or
r
isk
mitigation
approach
is
assigned
after
se
v
er
al
testing
le
v
els
,
since
the
y
are
going
to
pla
y
a
major
role
in
the
enter
pr
ise
w
or
king
le
v
el.
The
decisions
are
tak
en
from
the
policy
of
the
r
isk
mitigation
f
or
data
quality
approaches
.
The
management
of
an
y
enter
pr
ise
should
pa
y
an
attention
to
Lly
ods
approach
[14]
of
data
quality
model
and
r
isk
mitigation
standards
.
4.
Duplicate
Elimination
T
est
Bed
Duplicate
elimination
is
one
of
the
impor
tant
concrete
ser
vices
in
data
cleani
ng
ser
vice
composition.
The
main
objectiv
e
of
data
cleaning
ser
vice
is
to
maintain
data
quantity
.
It
is
a
ser
vice-or
iented
method
to
remo
v
e
duplicated
data
which
ma
y
be
represente
d
b
y
the
user
more
than
one
time
.
The
gener
al
idea
is
a
matching
process
that
enab
les
to
identify
duplicated
data.
One
impor
tant
aspect
dur
ing
the
search
of
the
duplication
of
the
same
records
is
the
ambiguity
of
data.
There
are
se
v
er
al
e
xper
iments
conducted
to
con
vince
the
duplicate
elimination.
Se
v
er
al
ser
vices
are
used
dur
ing
the
matching
process
on
those
e
xper
iments
.
Ho
w
e
v
er
,
only
f
e
w
of
them
giv
e
the
desired
results
.
The
duplicate
inf
or
mation
are
displa
y
ed
and
recorded
in
the
f
or
m
of
a
tab
le
together
with
the
indication
of
its
percentage
.
The
eff
ectiv
e
ser
vices
are
gener
ally
chosen
to
get
successful
results
of
the
impro
v
ement
on
data
quality
standards
.
The
aim
of
this
research
is
to
compare
the
duplicate
elimination
ser
vices
and
find
out
which
ones
perf
or
m
better
.
The
compar
ison
is
gener
ally
based
on
tw
o
par
ameters
.
The
first
par
ameter
is
th
e
time
to
detect
the
errors
in
the
data
that
alter
the
system
and
en
vironment.
Additional
time
is
required
to
impro
v
e
the
quality
of
data
in
the
process
system
of
an
y
functionality
.
The
second
par
ameter
is
the
memor
y
that
deter
mines
on
the
eff
ectiv
eness
of
the
data
quality
.
The
ser
vices
required
f
or
the
e
xper
iments
are
a
v
ailab
le
from
the
f
ollo
wing
ser
vice
pro
viders:
WinPure
Clean
and
Match
(ref
erred
as
WinPure),
Doub
leT
ak
e3
Dedupe
&
Merge
(ref
erred
as
Doub
leT
ak
e),
WizSame
(ref
erred
as
the
same
name),
and
Dedupe
Express
(ref
erred
as
DQ-
Global).
Bef
ore
the
compar
ison
betw
een
ser
vices
are
conducted,
the
e
xper
imental
test
bed
needs
to
be
de
v
eloped
on
real
data
in
local
Saudi
Ar
abia.
Dur
ing
the
first
e
xper
iment
,
there
are
eight
duplicates
from
the
data
set
man
ually
selected
from
data
w
arehouse
and
fur
ther
e
xamined
b
y
the
duplicate
detection
ser
vices
as
sho
wn
in
Figure
2.
It
is
impor
tant
to
note
that
the
data
with
high
pr
iv
acy
are
preser
v
ed.
Due
to
the
limitation
of
the
page
,
only
the
result
of
the
duplicate
data
detected
b
y
Doub
leT
ak
e
ser
vice
is
presented
in
this
paper
as
illustr
ated
in
Figure
3
which
has
se
v
en
duplicate
data.
In
this
figure
,
Doub
leT
ak
e
ser
vice
pro
vides
some
inf
or
mation
that
might
be
diff
erent
from
other
ser
vices
,
such
as
the
n
umber
of
suppressed
records
and
the
r
ate
of
records
per
hour
.
Hence
,
this
research
standardiz
es
the
ser
vice
output
as
the
n
umber
of
duplicate
records
to
ease
the
compar
ison.
The
quality
of
ser
vice
is
included
f
or
the
perf
or
mance
analysis
as
w
ell.
Data
Cleaning
Ser
vice
f
or
Data
W
arehouse:
An
Exper
imental
...
(Ar
if
Br
amantoro)
Evaluation Warning : The document was created with Spire.PDF for Python.
838
ISSN:
1693-6930
Figure
2.
First
Exper
iment
Dataset
Figure
3.
Duplicate
Data
Detected
By
Doub
leT
ak
e
Due
to
the
limitation
of
the
page
,
only
one
e
xper
imental
test
bed
is
presented
in
this
paper
.
A
summar
y
of
total
fiv
e
e
xper
iments
is
presented
in
T
ab
le
1.
T
ab
le
1.
Summar
y
of
Fiv
e
Exper
iments
WinPure
Doub
leT
ak
e
Wizsame
DQGlobal
Exper
iment
1
50%
88%
75%
88%
Exper
iment
2
25%
75%
67%
33%
Exper
iment
3
50%
90%
90%
80%
Exper
iment
4
88%
50%
75%
63%
Exper
iment
5
17%
100%
92%
83%
TELK
OMNIKA
V
ol.
16,
No
.
2,
Apr
il
2018
:
834
842
Evaluation Warning : The document was created with Spire.PDF for Python.
TELK
OMNIKA
ISSN:
1693-6930
839
5.
Comparisons
Between
Ser
vices
in
Duplicate
Detection
In
this
compar
ison,
there
is
a
finer
g
r
an
ular
ity
based
on
the
pre
vious
e
xper
iment
test
bed.
Each
ser
vice
processes
the
same
set
of
records
so
that
the
detection
capability
of
all
ser
vices
can
be
justified.
All
the
records
f
or
the
compar
ison
are
based
on
the
duplicate
types
.
The
compar
isons
are
made
based
on
the
predefined
duplication
types
.
There
are
se
v
en
duplication
types
as
f
ollo
ws:
1.
Diff
erent
spelling
and
pron
unciation
compar
ison.
The
duplicated
records
e
xamined
in
this
compar
ison
and
the
e
xamination
result
b
y
r
unning
f
our
ser
vices
are
illustr
ated
in
Figure
4.
Due
to
the
e
xistence
of
diff
erent
languages
in
Saudi
Ar
abia,
inconsistent
name
tr
ansliter
ated
from
another
language
is
not
uncommon.
It
is
interesting
to
note
that
the
ser
vice
pro
vided
b
y
WinPure
is
unab
le
to
detect
an
y
records
with
diff
erent
spelling
and
pron
unciation.
Figure
4.
Diff
erent
Spelling
and
Pron
unciation
Duplicated
Records
and
Examination
Results
2.
Compar
ison
based
on
misspellings
.
The
duplicated
records
e
xamined
in
this
compar
ison
and
the
e
xamination
result
b
y
r
unning
f
our
ser
vices
are
illustr
ated
in
Figure
5.
This
compar
ison
pro
vides
less
percentage
of
the
detected
records
than
the
pre
vious
compar
ison.
It
can
be
inf
erred
that
the
misspelling
cases
ha
v
e
more
v
ar
iants
in
the
records
.
The
ne
xt
compar
isons
are
not
sho
wn
as
a
figure
due
to
the
limitation
of
the
page
.
3.
Compar
ison
based
on
name
ab
bre
viation.
The
duplicated
records
are
e
xamined
in
this
compar
ison
and
the
e
xamination
results
b
y
r
unning
f
our
ser
vices
.
In
this
compar
ison,
the
records
with
name
ab
bre
viation
are
handled
more
accur
ately
b
y
the
ser
vices
,
e
xcept
f
or
WinPure
ser
vice
.
4.
Compar
ison
based
on
honor
ific
prefix
es
.
The
duplicated
records
are
e
xamined
in
this
compar
ison
and
the
e
xamination
result
b
y
r
un-
ning
f
our
ser
vices
.
It
is
interesting
to
note
that
DQGlobal
ser
vice
under
perf
or
ms
in
this
e
xper
iment.
5.
Compar
ison
based
on
common
nic
knames
.
The
duplicated
records
are
e
xamined
in
this
compar
ison
and
the
e
xamination
results
b
y
r
unning
f
our
ser
vices
.
In
this
compar
ison,
Doub
leT
ak
e
ser
vice
is
unab
le
to
detect.
Data
Cleaning
Ser
vice
f
or
Data
W
arehouse:
An
Exper
imental
...
(Ar
if
Br
amantoro)
Evaluation Warning : The document was created with Spire.PDF for Python.
840
ISSN:
1693-6930
Figure
5.
Misspellings
Duplicated
Records
and
Examination
Results
6.
Compar
ison
based
on
splitted
name
.
The
duplicated
records
are
e
xamined
in
this
compar
ison
and
the
e
xamination
result
b
y
r
un-
ning
f
our
ser
vices
.
In
this
compar
ison,
WinPure
ser
vice
under
perf
or
ms
again.
7.
Compar
ison
based
on
e
xact
match.
The
duplicated
records
are
e
xamined
in
this
compar
ison
and
the
e
xamination
result
b
y
r
un-
ning
f
our
ser
vices
.
Exact
ma
tch
f
eature
is
impor
tant
f
or
some
cases
which
need
a
specific
handling,
such
as
to
in
v
estigate
the
inter
nal
mistak
e
of
data
w
arehousing.
A
complete
compar
ison
summar
y
f
or
se
v
en
duplicate
types
is
illustr
ated
in
Figure
6.
It
can
be
inf
erred
that
WizSame
ser
vice
has
the
highest
reliability
in
an
y
compar
ison
cr
iter
ia
amongst
other
ser
vices
,
although
it
has
no
peak
perf
or
mance
in
ter
m
of
the
n
umber
of
detected
records
.
Figure
6.
Se
v
en
T
ype
Duplicates
Examination
Results
6.
Quality
of
Data
Cleaning
Ser
vices
In
addition
to
the
e
v
aluation
f
or
quality
of
data,
there
is
a
requirement
to
assess
the
data
cleaning
ser
vice
based
on
the
quality
of
ser
vice
.
There
are
se
v
er
al
quality
of
ser
vice
metr
ics
are
tak
en
into
account
in
this
pape
r
,
such
as
perf
or
mance
,
capability
to
process
the
n
umber
of
TELK
OMNIKA
V
ol.
16,
No
.
2,
Apr
il
2018
:
834
842
Evaluation Warning : The document was created with Spire.PDF for Python.
TELK
OMNIKA
ISSN:
1693-6930
841
records
,
data
heterogeneity
,
and
pr
ice
.
f
or
The
perf
or
mance
of
the
ser
vice
is
brok
en
do
wn
into
tw
o
metr
ics:
processing
time
and
memor
y
.
Time
is
an
impor
tant
f
actor
which
is
mostly
tak
en
into
account
in
most
algor
ithm
compar
isons
which
is
calculated
based
on
the
processed
records
.
1000
records
are
consider
ab
ly
enough
to
be
tak
en
into
account
f
or
this
compar
ison.
The
time
spent
b
y
each
ser
vice
on
the
processing
of
1000
records
is
being
calculated.
The
results
of
these
record
manipulation
depends
on
the
system
en
vironment.
Theref
ore
,
the
compar
ison
betw
een
all
of
these
ser
vices
is
conducted
in
the
same
en
vironment.
The
en
vironment
related
to
e
xper
i-
ments
are
k
ept
consistent
on
all
f
ou
r
ser
vices
.
Modifying
the
en
vironment
ma
y
aff
ect
the
o
v
er
all
perf
or
mance
of
these
ser
vices
.
The
result
of
the
processing
time
e
v
aluation
is
presented
in
Figure
7
(a).
It
sho
ws
that
WizSame
and
WinPure
utiliz
ed
less
CPU
time
f
or
processing
1000
records
.
Accordingly
,
DQ-
Global
took
the
maxim
um
time
f
or
processing
1000
records
.
Figure
7
(b)
presents
the
compar
i-
son
betw
een
the
memor
y
utilization
of
e
xamination
ser
vices
.
In
this
e
v
aluation,
both
Doub
leT
ak
e
and
WizSame
had
an
optimal
perf
or
mance
,
while
DQGlobal
and
WinPure
had
more
memor
y
consumption
f
or
the
processing
of
1000
records
.
Figure
7.
Time
Spent
and
Memor
y
Utilization
on
The
Processing
of
1000
Records
The
capability
of
e
ach
ser
vice
to
process
the
records
is
an
impor
tant
metr
ic
f
or
data
cleaning
ser
vice
.
The
compar
ison
descr
ibes
ho
w
man
y
records
that
each
ser
vice
can
process
the
remo
v
al
of
duplication.
WinPure
w
as
ab
le
to
process
250,000
records
at
maxim
um.
Doub
leT
ak
e
w
as
ab
le
to
process
tw
enty
million
records
at
maxim
um.
WizSame
w
as
ab
le
to
pro
cess
one
million
record
at
maxim
um.
DQGlobal
w
as
ab
le
to
process
one
million
records
at
maxim
um.
Diff
erent
ser
vice
has
diff
erent
capability
to
process
par
ticular
data
f
or
mat.
It
is
considered
as
a
data
heterogeneity
metr
ic.
WinPure
is
ab
le
to
process
T
e
xt
File
,
MS
Excel,
MS
Access
,
Dbase
,
MS
SQL
Ser
v
er
.
Doub
leT
ak
e
is
ab
le
to
process
MS
Excel,
MS
Access
,
Dbase
,
Plain
T
e
xt
File
,
ODBC
,
F
o
xPro
,
MS
SQL
Ser
v
er
,
DB2
and
Or
acle
.
WizSame
is
ab
le
to
process
dBase
,
MS
SQL,
MS
Access
and
Or
acle
,
Plain
t
e
xt
file
,
Dbase
,
ODBC
,
OLE
DB
.
DQGlobal
is
ab
le
to
process
MS
Access
,
P
ar
ado
x,
MS
Excel,
DBF
,
Lotus
,
F
o
xPro
,
and
Plain
te
xt
file
.
This
compar
ison
sho
ws
that
Doub
leT
ak
e
r
uns
more
data
f
or
mats
compared
to
other
ser
vices
.
WizSame
scores
second
f
or
r
unning
more
data
f
or
mats
in
remo
ving
duplication.
Pr
ice
is
another
quality
of
ser
vice
metr
ics
considered
in
this
paper
.
The
ser
vice
that
has
high
pr
ice
is
not
f
easib
le
f
or
par
ticular
users
.
The
pr
ice
f
or
purchasing
t
he
license
of
the
appli-
cations
is
in
a
wide
r
ange
b
y
the
time
this
paper
is
wr
itten.
WinPure
costs
$949.00,
Doub
leT
ak
e
costs
$5,900.00,
WizSame
costs
$2,495.0
0
and
DQGlobal
costs
$3,850.00.
This
compar
ison
im-
plies
that
Doub
leT
ak
e
has
the
highest
pr
ice
compared
to
the
rest
of
the
ser
vices
.
Ho
w
e
v
er
,
since
w
e
wr
ap
all
these
applications
as
ser
vices
,
the
cost
is
minimiz
ed
b
y
pa
ying
only
f
or
th
e
e
x
ecuted
ser
vices
.
Data
Cleaning
Ser
vice
f
or
Data
W
arehouse:
An
Exper
imental
...
(Ar
if
Br
amantoro)
Evaluation Warning : The document was created with Spire.PDF for Python.
842
ISSN:
1693-6930
7.
Conc
lusion
In
data
w
arehousing,
data
cleaning
ser
vice
pla
ys
an
impor
tant
roles
in
man
y
domains
.
If
the
data
is
not
clean
and
full
of
anomalies
,
the
resultant
data
ha
v
e
a
lot
of
issues
,
such
as
data
integ
r
ation
and
quer
y
errors
.
In
order
to
get
the
best
f
or
m
of
the
e
xtr
acted
data,
it
is
impor
tant
to
clean
the
data
as
an
initial
step
.
Data
redundancy
should
be
remo
v
ed
to
maintain
the
data
integ
r
ity
.
This
research
pro
vides
an
o
v
er
vie
w
about
the
quality
of
data
to
be
used
in
data
w
are-
housing
and
to
analyz
e
,
pr
actice
and
e
xper
iment
the
concept
of
data
quality
b
y
utilizing
real
local
data.
Hence
,
this
research
has
tw
o
contr
ib
utions
.
First,
it
su
r
v
e
y
ed
of
data
quality
in
the
en
viron-
ment
of
data
w
arehouse
and
the
d
ata
integ
r
ity
analysis
.
Second,
it
compared
the
ser
vices
that
can
remo
v
e
the
duplication
of
data
through
some
real
e
xper
iments
.
The
e
xper
iments
w
ere
con-
ducted
based
on
the
perf
or
mance
measures
so
that
it
could
be
deter
mined
which
ser
vice
is
more
eff
ectiv
e
f
or
the
remo
v
al
of
data
duplication.
The
compar
ison
is
considered
as
an
aid
f
or
users
to
select
the
best
ser
vices
depending
on
their
needs
,
especially
in
the
scope
of
Saudi
Ar
abia.
Ac
kno
wledg
ement
This
w
or
k
w
as
suppor
ted
b
y
the
Deanship
of
Scientific
Research
(DSR),
King
Abdulaziz
Univ
ersity
,
Jeddah,
Saudi
Ar
abia.
The
author
,
theref
ore
,
g
r
atefully
ac
kno
wledges
the
DSR
techni-
cal
and
financial
suppor
t.
The
author
also
thanks
Mshar
i
AlT
ur
aifi
f
or
conducting
the
e
xper
iments
in
Saudi
Ar
abia.
Ref
erences
[1]
B
.
Moustaid
and
M.
F
akir
,
“Implementation
of
b
usiness
intelligence
f
or
sales
management,
”
IAES
Inter
national
Jour
nal
of
Ar
tificial
Intelligence
(IJ-AI)
,
v
ol.
5,
no
.
1,
pp
.
22–34,
2016.
[2]
I.
Khliad,
“Data
w
arehouse
design
and
implementation
based
on
qualit
y
requirements
,
”
Inter-
national
Jour
nal
of
Adv
ances
in
Engineer
ing
and
T
echnology
,
pp
.
642–651,
2014.
[3]
L.
Rober
t,
“Data
quality
in
healthcare
data
w
arehouse
en
vior
nments
,
”
34th
Ha
w
aii
Inter
na-
tional
Conf
erence
System
on
System
Sciences
,
pp
.
9–1,
2001.
[4]
G.
Shankar
anar
a
y
anan,
“T
o
w
ards
implementin
g
total
data
quality
management
in
a
data
w
arehouse
,
”
Jour
nal
of
Inf
or
mation
T
echnology
Management
,
v
ol.
16,
no
.
1,
pp
.
21–30,
2005.
[5]
A.
Amine
,
R.
A.
Daoud,
and
B
.
Bouikhalene
,
“Efficiency
compar
aison
and
e
v
aluation
betw
een
tw
o
etl
e
xtr
action
tools
,
”
Indonesian
Jour
nal
of
Electr
ical
Engineer
ing
and
Computer
Science
,
v
ol.
3,
no
.
1,
pp
.
174–181,
2016.
[6]
R.
A
rchana,
R.
S
.
Hegadi,
and
T
.
Manjunath,
“A
big
data
secur
ity
using
data
masking
meth-
ods
,
”
Indonesian
Jour
nal
of
Electr
ical
Engineer
ing
and
Computer
Science
,
v
ol.
7,
no
.
2,
pp
.
449–456,
2017.
[7]
H.
Marcus
,
K.
Mathias
,
and
K.
Ber
nard,
“Ho
w
to
measure
data
quality;
a
metr
ic
approach,
”
T
w
enty
Eighth
Inter
national
conf
erence
on
Inf
or
mation
System,
Montreal
,
pp
.
1–15,
2007.
[8]
H.
F
reder
ik,
Z.
Dennis
,
and
L.
Anders
,
“The
cost
of
poor
quality
,
”
Jour
nal
of
industr
ial
En-
gneer
ing
and
Management
,
pp
.
163–193,
2011.
[9]
K.
Rahul,
“Data
quality
in
data
w
arehouse
prob
lems
and
solution,
”
Jour
nal
of
Computer
En-
gineer
ing
(ISOR-JCE)
ISSN-2278-0661,
V
olume
16,
Issue1
,
pp
.
18–24,
2014.
[10]
B
.
Inmon,
“Data
w
arehousing
2.0
architecture
f
or
ne
xt
gener
ation
of
data
w
arehousing,
”
T
ech.
Rep
.,
2010.
[11]
R.
Kimball,
M.
Ross
,
J
.
Mundy
,
and
W
.
Thor
nthw
aite
,
The
Kimball
Group
Reader
:
Relent-
lessly
Pr
actical
T
ools
f
or
Data
W
arehousing
and
Business
Intelligence
Remastered
Collec-
tion
.
John
Wile
y
&
Sons
,
2015.
[12]
United
States
Depar
tment
of
Inter
ior
CIO,
“Data
qu
ality
management
guide
,
”
T
ech.
Rep
.,
2008.
[13]
Q.
Sun
and
Q.
Xu,
“Research
on
collabor
ativ
e
mechanism
of
d
ata
w
arehouse
in
shar
ing
platf
or
m,
”
Indonesian
Jour
nal
of
Electr
ical
Engineer
ing
and
Computer
Science
,
v
ol.
12,
no
.
2,
pp
.
1100–1108,
2014.
[14]
LL
y
ods
,
“Solv
ency
ii-section
4-statistical
quality
standards
,
”
T
ech.
Rep
.,
2010.
TELK
OMNIKA
V
ol.
16,
No
.
2,
Apr
il
2018
:
834
842
Evaluation Warning : The document was created with Spire.PDF for Python.