TELKOM
NIKA
, Vol.14, No
.2, June 20
16
, pp. 707~7
1
4
ISSN: 1693-6
930,
accredited
A
by DIKTI, De
cree No: 58/DIK
T
I/Kep/2013
DOI
:
10.12928/TELKOMNIKA.v14i1.2395
707
Re
cei
v
ed Au
gust 7, 201
5; Re
vised Ma
rch 22, 2016; A
c
cepted Ap
ril 8, 2016
Fuzzy-based Spectral Alignment for Correcting DNA
Sequence from Next Generation Sequencer
Kana Sa
putr
a
S
1
, Agus Buono
2
, Wisn
u Anan
ta Ku
suma*
3
1,2,
3
Department
of Computer S
c
ienc
e,
Bogor Agricult
ural
U
n
i
v
ersit
y
1,3
Bioinformatic
s
Researc
h
Group, Bog
o
r Agr
i
cultur
al Un
iver
sit
y
,
Jl. Meranti, W
i
ng 20 L
e
ve
l 5, Darmag
a
, Bog
o
r 166
80
T
e
lp./F
ax.: +
62-251-
862
55
84
*Corres
p
o
ndi
n
g
author e-m
a
il
: ananta@
ap
ps
.ipb.ac.id
A
b
st
r
a
ct
Next ge
ner
atio
n seq
u
e
n
cin
g
t
e
chn
o
lo
gy is
a
b
le
to ge
ner
ate
short
rea
d
i
n
l
a
rge nu
mbers and in a
relativ
e
ly s
hort
time
in
si
ngl
e r
unn
ing
pr
ogra
m
s. T
h
e
gr
aph
base
d
D
N
A s
e
que
nce
asse
mbly, that
is us
e
d
to
han
dle
these
b
i
g d
a
ta i
n
ass
e
mb
ly st
ep, is
v
e
ry sens
itive to
DNA se
qu
enci
ng err
o
rs. T
h
is
prob
le
m c
an
b
e
solve
d
by
p
e
rforming
a
n
erro
r correcti
on ste
p
b
e
fore
the
as
sembly
proc
es
s. T
h
is res
earc
h
pr
opos
ed
fu
zz
y
inference system
based spectral alignm
ent m
e
thod wh
ich
can detect and correct
DNA
s
equencing
errors.
T
he spectra
l
a
lign
m
ent
meth
od w
a
s i
m
p
l
e
m
e
n
ted
as
a
pre-pr
ocessi
ng
step bef
ore t
he D
N
A seq
u
enc
e
asse
mb
ly proc
ess. T
he evalu
a
tion w
a
s cond
ucted usi
ng Ve
lvet asse
mbl
e
r. the total node
s yielde
d by th
e
Velvet ass
e
mbl
e
r bec
ome a
measur
e of
the s
u
ccess of error
correction.
T
h
e results sh
ow
ed that the fu
zz
y
-
base
d
spectra
l
alig
n
m
e
n
t met
hod
gen
erate
d
sma
ll total
no
des for k =
53. It
w
a
s conclu
d
ed that the fu
zz
y
-
base
d
spectra
l
alig
n
m
e
n
t met
hod is succes
s
fully
ab
le to d
e
tect and corr
e
c
t DNA seque
n
c
ing err
o
rs.
Ke
y
w
ords
: D
N
A Seq
u
e
n
cin
g
Errors, F
u
zz
y
Infer
ence
S
ystem M
ode
l
,
Next Gener
ation S
e
q
u
e
n
c
i
ng,
Spectral Al
ign
m
e
n
t Method
Copy
right
©
2016 Un
ive
r
sita
s Ah
mad
Dah
l
an
. All rig
h
t
s r
ese
rved
.
1. Introduc
tion
In the Biol
o
g
y and
Heal
th scien
c
e,
deoxyribo
n
u
c
leic
aci
d
(DNA), a
very
impo
rtant
macromol
ecu
l
e, has a fu
nction to
sto
r
e all of
info
rmation
abo
ut the geneti
c
of creatu
r
es.
Sequen
cin
g
pro
c
e
ss i
s
re
quire
d to determin
e
DNA seq
uen
ce
s. This p
r
o
c
e
s
s produ
ce
s DNA
seq
uen
ce wh
ich is re
qui
re
d to find gen, area t
hat has sp
ecifi
c
protein cod
e
, and to compa
r
e
homolo
gou
s
DNA
sequ
en
ce
s amo
ng d
i
fferent org
a
n
i
sms [1]. No
wad
a
ys, se
q
uen
cing p
r
o
c
ess
has be
en ap
p
lied into varie
t
ies sam
p
le o
f
tumor as an
effort to identify mutations asso
ciated wi
th
c
a
nc
er [2].
Sequen
cin
g
t
e
ch
nolo
g
y ha
s b
een
co
ntin
uou
sly evolve
d from t
r
aditi
onal Sa
nge
r
Shotgun
seq
uen
cing
to next ge
ne
ration
sequ
en
cing
(NGS) such
a
s
Illumi
na G
enom
e
Analyze
r
(Sol
exa),
ABI SOLiD System, 454
G
enome S
equ
encer F
L
X,
and Helicos
Heliscop
e
[3]. Eventough, NGS
as a hig
h
thro
ughp
ut and fast se
que
nce
r
is be
tte
r tha
n
traditional
Sange
r Shout
gun sequ
en
ci
ng,
NGS still p
r
o
duces
DNA
seque
nci
ng e
r
rors. The
r
e
a
r
e several types of e
rro
rs g
enerated by t
he
seq
uen
ce
r, i.e. sub
s
titutio
n
, inse
rtion
and d
e
le
tion
[4]. The re
sults of
sequ
enci
ng read
s by
Ilumina, one
of the most fa
mous
NGS te
chn
o
logie
s
a
nd co
mmonly
used to p
r
o
d
u
ce 3
5
bp –
1
25
bp reads, are still containi
ng 0.5 to 2.5% sequen
cing
errors. Almost all of them
are substitution
errors [5]. Th
e existen
c
e
o
f
sub
s
titution
errors tend
s t
o
gen
erate
m
o
re b
r
a
n
che
s
in the graph.
Con
s
e
quently
, the numbers of node
s a
r
e in
cre
a
sed.
The increa
si
ng of node
s
will make th
e
grap
h in th
e
DNA
seque
nce
asse
mbl
y
pro
c
e
s
s
be
come
mo
re
compl
e
x [6]. Therefore,
erro
r
corre
c
tion
is
requi
re
d to
i
m
prove
the
quality
of
DNA p
r
od
uced
by
NGS [7]
and
redu
ce
the
compl
e
xity of
the grap
h.
Spectral alig
nment is a
method to d
e
te
ct and
co
rre
ct DNA seque
nci
ng errors [8].
Spectral alig
nment meth
od is
devel
oped
ba
se
d
on the fre
quen
cy of tuple o
c
curre
n
ce
s
(multipli
c
ity) [9]. Multiplicity is used to de
cide wh
ich tuples bel
ong t
o
solid tuple
s
or wea
k
tupl
es.
In this m
e
tho
d
, rea
d
s cont
aining
se
que
ncin
g e
r
ro
rs
are
assu
med
co
nsi
s
ts of a
t
least
one
wea
k
tuple. The co
rrectio
n
will be
condu
cted b
y
replaci
ng the wea
k
tuple
with a most si
milar soli
d on
e.
Ho
wever, det
ermini
ng the
optimal value
of multiplic
ity is difficult. Therefore, re
sea
r
ch in DNA
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 16
93-6
930
TELKOM
NIKA
Vol. 14, No. 2, June 20
16 : 707 – 71
4
708
seq
uen
cing
errors co
rrection base
d
o
n
quality
sco
re ha
s bee
n
also develo
ped, for inst
ance
RECO
UNT,
a software fo
r dete
c
ting a
nd corre
c
ti
ng
seq
uen
cing
errors which
wa
s devel
op
ed
based on b
a
se quality score (Phre
d
sco
r
e) and u
s
in
g expectatio
n
maximizatio
n
algorithm (E
M)
and stati
s
tic
model [10].
De
cidin
g
whi
c
h tupl
es a
r
e
inclu
ded int
o
soli
d tuple
s
or wea
k
tupl
es i
s
not en
o
ugh by
usin
g only
m
u
ltiplicity be
cause
someti
mes tu
ple
s
t
hat have
low qualitie
s
can
be
cla
s
sified
as
solid tu
ples.
Thus, in thi
s
resea
r
ch, we
combi
ned th
e
para
m
eter
of multiplicity a
nd ba
se q
uali
t
y
score
of e
a
ch re
ad
s to
h
andle
the
pro
b
lem of
co
rre
c
ting
se
que
n
c
ing
erro
rs. I
n
this stu
d
y, we
prop
osed to
rep
r
e
s
ent the
multiplicity a
nd the
ba
s
e
quality sc
or
e in ter
m
of f
u
zz
y. The fuz
z
y
approa
che
s
were
a
c
tually
has been proven
succe
ssful
in solving
the pro
b
lem i
n
the re
al wo
rld,
su
ch
as inte
grating
p
r
od
u
c
tion
ca
pabili
ty and lo
ad
balan
cing
du
ring
sche
duli
ng a
c
tivity [11],
cla
ssify likeli
hood
s of pu
rch
a
si
ng he
a
l
th insu
ran
c
e
[12], predict
the ca
se
s of Failed Ba
ck
Surge
r
y Syndrom
e (FBS
S) [13], classify and
anal
yze Microa
rray Gene d
a
ta by using
dat
a
mining
and
fuzzy logi
c [1
4], analyze t
he
relation
sh
ips
between
gene
s
and
h
e
lp d
e
ci
phe
r a
geneti
c
net
work [1
5], buil
d
ing lightin
g
system
b
a
sed on fu
zzy logic
sche
me to auto
m
ate
fluore
s
cent la
mps in
ord
e
r to a
c
hieve
il
lumination
a
c
cording
to In
done
sia
n
Nat
i
onal Stan
da
rd
(SNI) [1
6], a
nd ha
ndle
a
m
biguity pe
rceived
symp
t
o
ms th
e p
a
tient and
the
certai
nty fact
or
method to ha
ndle the rel
a
tionship bet
we
en
the sympt
o
ms of the di
sea
s
e [17].
The aim of this re
sea
r
ch is to apply and obtain fuzzy inferen
c
e
syst
em model in orde
r to
detect an
d correct the DNA se
quen
ci
ng erro
rs by
usin
g
the
sp
ectral alignm
ent
method as a
prep
ro
ce
ssin
g step
before
con
d
u
c
ting
DNA a
s
sem
b
ly process. T
he eval
uatio
n
wa
s condu
ct
ed
by using the
Velvet asse
m
b
ler for
sho
w
i
ng the
decre
asin
g of the total node
s after implem
enti
ng
a corre
c
tion
step.
2. Rese
arch
Metho
d
2.1. Data
se
ts
The data
s
et
was obtai
n
ed from Ho
b
bes Ge
nom
e
Sequen
ce Mappin
g
[18]. These
dataset was repre
s
e
n
ted in
FASTQ format. In
t
he FASTQ format, we co
uld hav
e the informat
ion
of re
ad
s a
n
d
the b
a
se q
u
a
lity score
of
ea
ch
nu
cleo
tide in
re
ads [19]. Tabl
e
1 de
scrib
ed t
h
e
cha
r
a
c
teri
stics of the dataset.
Table 1. The
cha
r
a
c
teri
stics of data set
Orga
nism Number
of
reads
Mean
Caenorh
abdities Elegans (
W
orm
B
ase
W
S
201)
1,000,000
100
bp
Human Geno
m
e
(HG18)
1,000,000
100
bp
These
rea
d
s actu
ally we
re ge
nerated
from
Illumin
a
se
que
ncer.
Rea
d
s produ
ced
by
Illumina h
a
ve
seq
uen
cin
g
errors a
nd ot
her
symbol
s
besi
d
e
s
Ade
n
i
ne (A
), Cyto
sine (C),
Gua
n
i
ne
(G) a
nd Thy
m
ine (T
), na
mely N. N is a symbol
stat
ing the inca
p
ability of NGS in reading
certai
n
bases [7]. Thi
s
re
sea
r
ch was al
so focused only on su
bstitution erro
r.
2.2. Spectral
Alignment
Metho
d
This
re
sea
r
ch
focu
sed
on t
he p
r
e-proce
ssi
ng
step
of the DNA seq
uen
ce
s. Fu
zzy base
d
spe
c
tral ali
g
n
m
ent method
is a method for dete
c
ting a
nd co
rre
cting
DNA sequ
en
cing e
rro
rs. The
resea
r
ch imp
r
oved the
sp
ectral
alignm
ent method
t
hat wa
s pe
rf
orme
d in 20
0
1
[8] by inclu
d
ing
base qu
ality score. T
h
e
first step
o
f
the spe
c
tral alignm
ent
method i
s
to determi
ne
the
multiplicity. After the multiplicity is obta
i
ned, tupl
e
s
were cla
s
sifie
d
into solid t
uple
s
and
weak
tuples. Tu
ple
is a su
bseq
uen
ce of re
a
d
that
have a
certai
n lengt
h. For in
stan
ce, su
ppo
se
d
we
have
rea
d
= {ACGACGACCGAT}. T
h
us, a
set of
tuple
s
with
the len
g
th of
5
are
ACG
A
C,
CGACG, GA
CGA, CGA
C
C, GACCG,
ACCGA, a
n
d
CCGA
T
. Thi
s
re
se
arch u
s
ed 5 l
ength
tuple
[9].
A tuple is cla
ssifie
d
as
we
ak tuple if it has
multipli
city less tha
n
or
equal to 10 a
nd also
contai
ns cha
r
acter
N in the read
s; otherwise woul
d b
e
classified a
s
solid tupl
e. Next, the tuples
of
ea
ch rea
d
s
wa
s
evalu
a
ted.
Read
s
that all pf th
eir tupl
es bel
ong to
solid
tuple
would
be
cla
ssif
i
e
d
pr
o
c
e
ss wa
s st
a
r
t
ed
by
calcul
ating the di
st
ance bet
wee
n
the wea
k
tu
ple an
d all so
lid
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
1693-6
930
Fuzzy-b
ased
Spectral Alignm
ent for Correctin
g
DNA Sequen
ce fro
m
Next… (Ka
na Saputra S)
709
tuples mem
b
ers u
s
in
g L
e
venshtein
dist
ance. Th
e string
m
a
tchi
ng pro
c
e
s
s wa
s done
to repl
a
c
e
the wea
k
tup
l
e with the solid tuple whi
c
h ha
d the neare
s
t simila
rity score. Thi
s
pro
c
e
s
s was
repe
ated
for ea
ch
re
ad
s. In thi
s
re
se
arch,
we i
m
proved
the
spectral
align
m
ent meth
od
by
employing
fu
zzy
co
ncept t
o
re
pr
esent t
uple
quality. Steps of
ou
r prop
osed me
thod
u
s
in
g
fu
zzy-
based spe
c
tral alignme
n
t method could
be descri
bed
as follows.
2.3. Fuzzy
-b
ased Spec
tr
al Alignment
2.3.1. Dete
r
m
ining Multiplicit
y
and Tuple Qualit
y
In this re
se
arch,
we
u
s
ed
two
pa
ramete
rs fr
om
a
tupl
e, ie m
u
ltiplicity and tu
ple
quality.
We u
s
e
d
FA
STQ form
at data that rep
r
esent
b
a
se
quality sco
r
e
usin
g the A
m
eri
c
an Sta
n
dard
Cod
e
for Info
rmation Inte
rcha
nge
(ASCII). The qualit
y sco
re
can
be obtain
ed
by conve
r
ting
the
ASCII value i
n
the
FASTQ
file into
Phred
score.
Ba
se
quality
score
is u
s
ed
to calculate tu
pl
e
quality. Tuple quality is the sum of b
a
se qu
a
lity score divided
by the number of base
s
.
In
addition, m
u
ltiplicity is the
numb
e
r
of tuple o
c
cu
rre
nce i
n
certai
n length
in e
v
ery rea
d
. If th
e
same
tuple
a
ppea
rs twi
c
e
or m
o
re
in th
e same
rea
d
, it will b
e
cou
n
t as
one. T
h
e quality of tu
ple
is the total nu
mber of ba
se
quality score
s
that is divided by the number of ba
se
s.
The multipli
city and quality score
sh
ould
be no
rmali
z
e
d
sin
c
e the
ra
nge value
s
were to
o
big. The form
ula of normali
zation u
s
ed i
s
linea
r scalin
g normali
zati
on or it is usu
a
lly called ma
x-
min [20]. The multiplicity and tuple qual
ity would
be cal
c
ulate
d
for every read
s. Norm
alizatio
n
data cal
c
ul
ated usi
ng Equ
a
tion (1
) as fo
llow:
(
1
)
2.3.2. Dete
r
m
ining Solid
Tuple and Wea
k
Tuple
In the propo
sed method, we modified th
e spe
c
tral ali
gnment meth
od by employ
ing fuzzy
inferenc
e s
y
stem. In this
res
e
arc
h
,
we
us
ed Ma
mda
n
i inferen
c
e [
21] which all
o
ws a
sy
ste
m
to
take in a
set
of cri
s
p inp
u
t
values an
d
apply a set o
f
fuzzy rul
e
s
to those valu
es, in o
r
de
r to
derive a singl
e,
cri
s
p, outp
u
t
value.
Th
e followin
g
st
e
p
s
a
r
e
execute
d
to cl
assify tuple
s
into
soli
d
or we
ak tupl
e
.
Figure 1. Model of fuzzy in
feren
c
e p
r
o
c
e
dure
Step 1: Defining input an
d output.
The m
u
ltiplicity and qu
al
ity sco
re
be
come
the i
n
put of this
Mamda
n
i inf
e
ren
c
e.
More
over, the output of the system i
s
a dec
i
s
io
n of the cla
ssifi
catio
n
of tuples.
Step 2: Defining fuzzy sets
for system va
riable
s
.
The
system
reco
gni
ze
s th
e inp
u
t an
d o
u
tput
va
r
i
ab
les
an
d
d
e
f
in
es its
me
mb
er
sh
ip
s
.
T
he
fuzzy sets for system varia
b
les
can b
e
seen in Tabl
e 2.
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 16
93-6
930
TELKOM
NIKA
Vol. 14, No. 2, June 20
16 : 707 – 71
4
710
Table 2. The
cha
r
a
c
teri
stics of fuzzy sets for sy
stem variable
s
Variable
Fuzz
y
Set Name
Parameter
Multiplic
ity
Lo
w
[0 0 0.1 0.2]
Medium
[0.1 0.3 0.5 0.
8]
High
[0.4 0.6 1 1]
Q
uality
Lo
w
[0 0 0.1 0.2]
Medium
[0.1 0.3 0.5 0.
8]
High
[0.4 0.6 1 1]
Decision
Weak Tuple
[0 0 0.1 0.5]
Solid Tuple
[0.1 0.3 1 1]
Table 2
pre
s
ents the p
a
rameter
of ea
ch fu
zzy set name. Input
variable
s
h
a
ve equa
l
para
m
eter.
T
he p
a
ra
meter affects the
o
u
tput fuzzy.
It sh
ows th
at
a tuple
can
b
e
cl
assified
i
n
to
wea
k
or solid tuple.
Step 3: Defining fuzzy rule
s.
The next ste
p
is defini
ng
the If-Then
rules to
d
e
scri
be syste
m
b
ehavior. Th
e
rule
s are
desi
gne
d as t
o
de
scribe th
e impo
rtan
ce
of the vari
abl
es o
n
de
cisi
o
n
. At this step
, multiplicity and
quality score
from tuple
s
b
e
com
e
inp
u
ts into t
he sy
stem. Base
d o
n
the expe
rt kno
w
le
dge, t
h
is
study expre
s
se
s the pro
b
l
e
m in term
s of logica
l rul
e
s. The rule
s combin
ation
can be
see
n
in
Table 3.
Table 3. The
combi
nation
of rules
Code
Rules
[R1]
IF
Q
u
ality
is low
AND Multip
licity
is low
THEN Weak T
uples
[R2]
IF
Q
u
ality
is low
AND Multiplic
ity
is medium THEN
Weak Tuples
[R3]
IF
Q
u
ality
is low
AND Multiplicity
is high T
H
EN Solid T
uples
[R4]
IF Qu
ality is medium AND Mult
ipli
city
is low
T
H
EN
Weak T
uples
[R5]
IF
Q
u
ality
is medium AND Multip
licity
is medium T
H
EN Weak Tupl
es
[R6]
IF Qu
ality is medium AND Mult
iplicity
is high THEN
Solid Tuples
[R7]
IF
Q
u
ality
is high AND Multiplicity
is low
T
H
EN Soli
d T
uples
[R8]
IF
Q
u
ality
is high AND Multiplic
ity
is medium THEN
Solid Tuples
[R9]
IF
Q
u
ality
is high AND Multip
licity
is high T
H
EN Solid T
uples
The propo
rtio
nal rule
s for
FIS model is
the
numbe
r o
f
input to the power of the
numbe
r
of membe
r
shi
p
function
s [2
2]. Therefo
r
e,
this re
sea
r
ch
was
used ni
ne rul
e
s o
b
tai
ned from th
re
e
membe
r
ship
function
s to
the po
we
r of
two in
put variabl
es. T
h
e
s
e
rule
s
we
re the
result
of
experim
entati
on be
ca
use i
t
is still
not
yet f
ound th
e inform
ation
about
multiplicity and
tu
ple
quality.
Step 4: Defuzzificatio
n pro
c
ess.
Finally defuzi
f
ication ste
p
is nee
ded to
conve
r
t all input data into three ling
u
isti
c term
s
that can
be u
s
ed to
cla
s
sify tuples. The
defuzzifi
ca
tion
p
r
oc
ess
tr
an
s
f
o
r
ms
the fuzz
y s
e
t into
a
cri
s
p value, a
more mea
n
in
gful rep
r
e
s
ent
ation.
Multiplicity and tuple qualit
y will be processe
d using f
u
zzy inference sy
stem m
o
del. After
that process,
each tuple ha
ve one value
from 0
to 1. The value will
determi
ne wh
ether the tupl
e
is cla
s
sified i
n
to wea
k
tupl
e or soli
d tupl
e, so we h
a
ve to determin
e
limit of the tuple value.
2.2.3. Error Corre
ction
In this re
sea
r
ch, the Leve
n
s
htein/e
d
it distan
ce was
u
s
ed a
s
di
stan
ce fun
c
tion [2
3]. For
instance, supposed we have
reads R
= TTTAAT
CGAAA and spectrum of solid tuples S =
{AAACG, AACCT,
CCAGT} and weak tuple = AATCG fr
om a reads R. Next,
we calculated the
distan
ce
bet
wee
n
wea
k
t
uple AAT
CG
and
all
tupl
es i
n
th
e
sp
ectru
m
of
sol
i
t tuple S. In
this
example, the
distance between AATCG
and AAAC
G i
s
1, AATCG
and AACCT i
s
2, and AAT
CG
and
CCAG
T
i
s
5. From the
result, the cl
ose
s
t
di
stan
ce wa
s a
c
hiev
ed by the di
stance
betwee
n
AATCG and
AAACG. Thus
, the
weak
t
uple AATCG
in
the reads
R
would
be replac
ed by
a s
o
lid
tuple of AAACG. As
a res
u
lt, the reads
R =
TTTAATCGAAA would be
c
o
rrec
t
ed
as
R’ =
TTTAAACGA
AA. The
DNA sequencing
errors correction
process
was performed to all
DNA
read
s ite
r
atively. After the entire
re
a
d
s
were
cor
r
ecte
d, a
se
t of erro
r-f
re
e rea
d
s
(e
rr
or
corre
c
tion
) would b
e
pro
duced. The
perfo
rman
ce
of co
rre
ctio
n wa
s eval
u
a
ted by Vel
v
et
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
1693-6
930
Fuzzy-b
ased
Spectral Alignm
ent for Correctin
g
DNA Sequen
ce fro
m
Next… (Ka
na Saputra S)
711
assembl
e
r
re
ferrin
g
to set
of error-free
read
s (e
rr
or
correctio
n
) tog
e
ther
with set of erro
r read
s
(witho
ut error corre
c
tion).
2.3. Ev
aluation
To evaluate
the effectivene
ss
of the
error
co
rre
ction, we
co
mpare the result
s of
assembli
ng e
rro
r-f
ree
read
s (e
rror
co
rre
ction)
and error rea
d
s (wi
t
hout
er
ror correctio
n
).
T
h
e
dataset wa
s
use
d
a
s
the i
nput for Velv
et assemb
l
e
r.
Velvet asse
mbler i
s
a
so
ftware
con
s
i
s
ting
of algo
rithm
s
to mani
pulat
e De Bruijn
g
r
aph
in
doin
g
DNA sequ
e
n
ce
a
s
sembl
y
[24]. This step
wa
s re
quired
to evaluate th
e perfo
rma
n
ce of the
erro
r corre
c
tion p
r
oce
s
s. The p
a
ram
e
ter of t
he
error corre
c
tion pro
c
e
s
s in this re
sea
r
ch is the tot
a
l node
s and
execution ti
me of the graph
prod
uced. Error
re
ad
s (without e
rro
r
co
rre
ction
)
woul
d tend to
pro
duce a m
o
re
compl
e
x grap
h
than
that of prod
uced by error-free
rea
d
s (e
rror
co
rrection
). Thi
s
woul
d b
e
u
s
ed to
evaluat
e
wheth
e
r the e
rro
r co
rrectio
n
pro
c
e
ss
wa
s perfo
rme
d
prop
erly.
The total no
des
coul
d b
e
cal
c
ulate
d
from
PreG
raph file. Pre
G
rap
h
file is Velve
t
assembl
e
r ou
tput. There a
r
e two file
s
correspon
di
ng
to graph
ge
nerate
d
by V
e
lvet assem
b
ler.
The two file
s
named
“PreG
r
aph
” a
nd
“La
s
tGraph
”. Bo
th files
co
ntai
n a li
st of n
o
d
e
s
rep
r
e
s
enti
n
g
De B
r
uijn
gra
ph by Velvet.
In this re
se
a
r
ch
only th
e “PreG
r
aph
”
would
be
con
s
i
dere
d
, be
ca
u
s
e
the graph
re
pre
s
ente
d
in
“La
s
tG
raph
” is the fin
a
l
output of Vel
v
et yielded b
y
Velvet’s erro
r
removal
tech
nique
s and
grap
h simplifi
c
ation. The
grap
h rep
r
e
s
ented
i
n
“PreGra
p
h
”
h
a
s
not
pro
c
e
s
sed
by Velvet’s e
r
ro
r removal te
chniqu
es
and
grap
h
simplifi
c
ation. So, it
can
be
u
s
ed
a
measure to i
ndicate the
su
cces
s of
o
u
r DNA se
q
uen
cing erro
rs
co
rrectio
n
method. V
e
lvet
executio
n u
s
ing two
com
m
and
s, nam
ely “velveth” and
“velvet
g
”. “velveth
” is
comm
an
d to
con
s
tru
c
t the dataset and “velvetg” is co
mmand bui
l
d
s the de Bruij
n
grap
h from k-m
e
rs obtain
e
d
from “velveth
”. The
execution time i
s
ca
lculate
d
by
sum of the
executio
n time f
o
r “velveth
” a
n
d
“v
elv
e
tg”.
3. Results a
nd Analy
s
is
3.1. Implementa
tion of Fu
zzy
-based Spectr
a
l Align
m
ent Me
tho
d
Fuzzy infe
re
nce
syst
em
model
ha
s p
e
rfome
d
several
experi
m
ents. Th
e ex
perim
ents
were condu
ct
ed by cha
ngi
ng the pa
ra
meters of
ea
ch vari
able a
nd the mem
bership fun
c
ti
on.
After the
cal
c
ulation
of the
total no
de
s
only the
s
e
pa
ramete
r
are
suitable fo
r
be
ing u
s
e
d
in
t
h
is
ca
se. The pa
ramete
r ca
n cla
ssify a tup
l
e into a solid
tuple or we
a
k
tuple. For t
he memb
ership
function, it
ha
s b
een
cond
u
c
ted
expe
rim
ents
usi
ng
si
gmoid
and
trape
zoid.
The
results show t
hat
trape
zoid i
s
more
suita
b
le
. It can be
se
en from th
e result
s of the
cal
c
ulatio
n of the total nod
es
and Velvet executio
n time.
A tuple
with f
u
zzy value
of
more tha
n
0.
4 was cl
assifi
ed a
s
a
solid
tuple; oth
e
rwise
will
be cla
ssifie
d
as a wea
k
tuple. The nu
mber of
soli
d
tuples and
wea
k
tuple
s
can be
see
n
in
Figure 2.
Figure 2. The
numbe
r of so
lid tuples a
n
d
wea
k
tuple
s
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 16
93-6
930
TELKOM
NIKA
Vol. 14, No. 2, June 20
16 : 707 – 71
4
712
Figure 2 pre
s
ent
s the total tuples
cla
s
sified
into
sol
i
d tuples
and
wea
k
tuple
s
. Wea
k
tuples
are m
o
re d
o
mina
nt than solid tu
ples. It sh
ows that fuzzy i
n
feren
c
e
syst
em mod
e
l de
tect
more e
r
r
o
r tu
ples.
3.2. Ev
aluation
The effective
ness of the p
r
opo
se
d met
hod wa
s eval
uated
u
s
ing Velvet
assem
b
ler.
Fo
r
each org
ani
sm, there woul
d be a pair of file, erro
r rea
d
s (without error
corre
c
tion
) and containi
ng
corre
c
ted
rea
d
s by fu
zzy b
a
se
d spe
c
tral
alignm
e
n
t m
e
thod. The
s
e
two pai
rs
of dataset of two
orga
nisms were sto
r
ed in
four files in F
ASTA fo
rmat. The output of Velvet is a
de Bruijn gra
ph
con
s
tru
c
ted from DNA
seq
uen
ce
s rea
d
s.
In the gra
ph
con
s
tru
c
tion
pro
c
e
ss, a
s
a part of
DNA sequ
en
ce
assembly, eit
her
rea
d
s
contai
ning
seque
nci
ng e
rro
r o
r
e
r
ror-free
re
ad
s
coul
d affect
the compl
e
xity of gra
ph.
Sequen
cin
g
errors in
rea
d
s
coul
d lea
d
to pro
d
u
c
e
unne
ce
ssary
bran
ch
es i
n
t
he g
r
aph. In
this
ca
se, the co
mplexity of a grap
h co
uld
be mea
s
ur
e
d
by calculati
ng the total node
s in a g
r
aph
gene
rated i
n
the DNA se
quen
ce
asse
mbly. A
ssu
m
ed, given AA
TGC
and
G
CCAG. T
he f
i
rst
sub
s
e
que
nce
should b
e
re
ad AATGC, instea
d becau
se of seq
uen
cing e
rro
r, the sub
s
eq
uen
ce
wa
s re
ad a
s
AATAC. The
n
, the grap
h
prod
uced by
read
s
contai
n
i
ng se
que
nci
ng erro
rs
(wit
hout
error correcti
on) will be m
o
re co
mplex
compared to those of the
error-free
reads (with error
cor
r
e
c
t
i
on
).
For ea
ch
DNA seque
nce assembly p
r
o
c
e
ss
u
s
in
g Velvet, a para
m
eter ha
sh l
ength k
must b
e
d
e
te
rmine
d
. Hash
length
is the
length
of
k-m
e
rs in
clud
ed i
n
the
ha
sh ta
ble. The
k val
ue
must be
an
odd n
u
mbe
r
and mu
st b
e
small
e
r th
an the le
ngt
h of ea
ch fragment
s. In this
resea
r
ch, k
wa
s
set to 5
3
, 55 a
nd
57
. The
result
of DNA sequ
ence a
s
semb
ly pro
c
e
s
s u
s
ing
Velvet for each k value
s
ca
n be se
en in
Table 4.
Table 4. DNA
sequ
en
ce a
s
sembly result using Velvet assembl
e
r
Orga
nism Error
Cor
r
ection
Total nodes in gr
aph
k = 53
k = 55
k = 57
Caenorh
abdities Elegans
(
W
orm
B
ase W
S
201)
No
38,224
32,195
26,859
Y
e
s
38,174
32,045
26,635
Human Geno
m
e
(HG18)
No
23,573
18,227
14,151
Y
e
s
23,562
18,224
14,150
Table 4
presents the
re
sults
ge
nerate
d
by Velvet in total nod
e
s
p
r
odu
ce
d f
o
r e
a
ch
dataset with k = 53, k
= 5
5
, and k
= 5
7
. It sho
w
n t
hat e
v
ery co
rre
cte
d
rea
d
s
pro
d
uce
d
fewe
r to
tal
node
s comp
ared to tho
s
e of erro
r-f
re
e read
s (wit
hout error
correctio
n
) for every k values.
Significant tot
a
l node
s
red
u
ction fo
r dat
a set
s
was
o
c
curred
on
k
= 53. Th
e re
sults
sh
own that
the erro
r correction
usi
ng f
u
zzy-ba
se
d
spectral a
lig
n
m
ent metho
d
wa
s abl
e to d
e
tect an
d
correct
DNA se
que
n
c
ing erro
rs a
nd
al
so simpl
i
fy
the
con
s
tructed
graph resulte
d
from
DNA se
que
n
c
e
as
sembly
st
e
p
.
De Bruijn g
r
aph i
s
p
a
rti
c
ula
r
ly suita
b
le for
rep
r
ese
n
ting the
sho
r
t re
ad
overla
p
relation
shi
p
. The graph si
ze is dete
r
m
i
ned by
the genom
e si
ze
and rep
eat conte
n
t of th
e
seq
uen
ce
d sample, and i
n
prin
ciple, will not
be affected by the high red
und
an
cy of deep re
ad
coverage [25
]. However,
seq
uen
cing
errors ca
n cause de Bru
ijn grap
hs to
becom
e hig
h
ly
bran
ch
ed
an
d tangl
ed [2
6
]. The b
r
an
ch
ca
n b
e
rep
r
ese
n
ted
by the la
rge
of th
e total n
ode
s. In
this research
, the su
cce
s
s of
seq
uen
cing e
rro
rs co
rre
ction
wa
s
measured by
cal
c
ulatin
g t
he
total node
s produ
ced by Ve
lvet assem
b
le
r.
We al
so
cal
c
ulated the
executio
n time f
o
r a
ll k
valu
e
s
.
The execution
time wa
s requi
re
d
to sup
port th
e inform
ation
of the gra
p
h
compl
e
xity.
The mo
re
co
mplex the re
sulting
gra
ph,
the
more time it take
s. The ex
ecutio
n time can b
e
se
en i
n
Figure 3.
Figure
3 pre
s
ents com
pari
s
on of
exe
c
ut
ion
time
by V
e
lvet betwe
e
n
erro
r read
s
(witho
ut
error co
rrecti
on) and co
rrected re
ad
s for
ea
ch
org
anism. It
sho
w
n that th
e
corre
c
ted
re
a
d
s
prod
uced le
ss exe
c
ution t
i
me compa
r
e
d
to erro
r re
ads
(without
error
co
rrecti
on) fo
r eve
r
y k
values. Sig
n
ificant exe
c
utio
n time redu
cti
on for
data
sets o
c
curred
on k
=
53. Th
e re
sult
s sho
w
n
that the e
r
ror co
rrectio
n
u
s
ing
fuzzy-b
a
s
ed
spe
c
tral
alignme
n
t me
thod
wa
s abl
e to d
e
tect
a
n
d
corre
c
t
DNA
seq
uen
cing
e
rro
rs an
d al
so sim
p
lifies t
he
con
s
tru
c
t
ed g
r
ap
h resulted fro
m
DNA
seq
uen
ce a
s
sembly
st
e
p
.
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
1693-6
930
Fuzzy-b
ased
Spectral Alignm
ent for Correctin
g
DNA Sequen
ce fro
m
Next… (Ka
na Saputra S)
713
Figure 3. The exec
ution time for unc
o
rre
c
t
ed reads
(NO) and
c
o
rrec
t
ed reads
(YES)
4. Conclusio
n
The
paramet
er
and
trap
e
z
oid
mem
bership
fun
c
tion
for fu
zzy inf
e
ren
c
e
sy
ste
m
mod
e
l
wa
s
suit
able
f
o
r t
h
i
s
ca
se.
The f
u
zzy
i
n
f
e
ren
c
e
sy
st
e
m
mod
e
l c
an
cla
ssif
y
a t
u
pl
e int
o
solid
t
u
ple
or
we
ak tuple
.
The
re
sult
s
sho
w
e
d
that t
he
wea
k
tu
pl
es
we
re
mo
re do
minant
than
soli
d tupl
es.
The eval
uati
on results u
s
ing V
e
lvet assembl
e
r
showed the
e
ffective
ness
of the propo
se
d
method in d
e
tecting an
d
corre
c
ting seque
nci
ng
error. The fu
zzy-ba
se
d sp
ectral ali
g
nm
ent
method can redu
ce the total node
s of graph u
s
in
g
k = 53 and
therefo
r
e it also su
cce
ssf
ully
redu
ce
d the
executio
n time. Thus, it ca
n be con
c
lud
ed that ou
r propo
sed m
e
th
od can effecti
v
ely
and efficie
n
tly detect and
correct DNA seque
nci
ng errors.
Ackn
o
w
l
e
dg
ements
The autho
rs woul
d like to
give deep g
r
atit
ude to M
i
nistry of
Educatio
n and
Culture
(DIKTI), Republic of Indonesia for fu
ndi
ng this research
through the BPP-DN.
Referen
ces
[1]
Rog
e
rs K. Ne
w
T
h
inkin
g
ab
ou
t G
enetics. New
York: Brita
n
n
i
ca Educ
atio
nal
Publis
hin
g
. 20
11: 132.
[2]
Cho
ng ML, K
u
CS, W
u
M, Soong
R. Char
act
e
risin
g
Som
a
ti
c Mutations
in
Canc
er G
enom
e b
y
Me
ans
of Ne
xt-gen
era
t
ion Seq
u
e
n
cin
g
. Chich
e
ster: John W
i
l
e
y
& S
ons, Ltd. 201
2.
[3]
Z
hao
Xia
oho
n
g
, Palm
er
LE,
Bola
nos
R, Mir
c
ean
C, F
a
s
u
l
p
D, W
i
tten
ber
g G
M
. EDAR:
AN Efficie
n
t
Error Detecti
o
n an
d R
e
mo
val Al
gorithm
for Ne
xt G
ener
ation
Seq
uenc
ing
Data.
Jo
u
r
na
l
of
Co
mp
utation
a
l Biol
ogy
. 20
10; 17(1
1
): 154
9-1
560.
[4]
Chevr
e
u
x
B. M
I
RA: An Aut
o
mated G
e
nom
e
and
EST
Assembler.
G
e
r
m
an
Ca
ncer
Rese
arch C
ent
e
r
Heid
el
berg, D
e
partment of Mo
lecul
a
r Bio
phys
i
cs
. 2005: 1
8
.
[5]
Kell
e
y
D
R
, Michae
l CS, Steven LS. Q
uake:
Q
ualit
y
-
A
w
a
r
e
Detection a
n
d
Correction of
Sequ
enci
n
g
Errors.
G
eno
me Biol
ogy
. 2010.
[6]
Miller J
R
, Kore
n S, Sutton G. Assembl
y
Al
go
rithms for Ne
xt
-Generati
on S
e
que
ncin
g D
a
ta.
G
eno
mics
.
201
0; 95(6): 31
5–3
27.
[7]
Yang
X, C
hoc
kalin
gam
SP,
Aluru
S. A S
u
rve
y
of Err
o
r-Correctio
n M
e
thods
for N
e
xt-G
enerati
o
n
Sequ
enci
ng.
J
ourn
a
l of Briefi
ng in Bi
oinf
ormatics
. 2012.
[8]
Pevzner
PA,
T
ang H, W
a
te
rman MS.
A
n
Eul
e
ria
n
P
a
th
Appr
oac
h to
DNA F
r
a
g
men
t
Assembly
.
Procee
din
g
s of
the Nation
al A
c
adem
y of
Sci
ences. 20
01; 9
8
(17): 97
48-
97
53.
[9]
Caesar N, Kusuma WA, Wijay
a
SH.
DN
A
Sequ
enci
ng Er
ror Correcti
ng
usin
g Spectra
l
Alig
nment
.
ICACSIS. 2013
: 279-28
4.
[10]
W
ija
ya
E, Frith MC
, S
u
zuki Y, Horton P.
RECOUNT:
Expectatio
n
M
a
xi
ma
z
a
t
i
on
B
a
sed
Erro
r
Correctio
n T
o
o
l
for Next G
ene
ration Se
qu
enc
ing D
a
ta
. Proceed
ings T
r
im.
200
9; 17(6).
[11]
O
t
hman Z
,
Su
bari K, M
o
rad
N.
Appl
icati
on
of F
u
zz
y Infer
ence S
y
st
ems
and
G
e
n
e
tic
Algorit
hm in
Integrated Pr
o
c
ess Plan
nin
g
and Sch
e
d
u
li
n
g
.
Internatio
nal
Journa
l of T
he Co
mp
uter, T
he Internet,
adn Ma
na
ge
ment
. 200
2; 10(2
)
: 81-96.
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 16
93-6
930
TELKOM
NIKA
Vol. 14, No. 2, June 20
16 : 707 – 71
4
714
[12]
Abdu
lla
h
L, R
ahma
n
MNA. Emplo
y
e
e
Lik
e
lih
oo
d of P
u
rchasi
ng
Hea
l
th Insur
ance
u
s
ing F
u
zz
y
Inference S
y
st
em.
Internatio
n
a
l Jour
nal of C
o
mputer Sci
e
n
c
e Issues
. 201
2; 9(2): 112-1
1
6
.
[13]
Qid
w
a
y
U, S
h
a
m
im MS, Raqu
ib F
,
Enam A.
Failed Back Surgery Syndrome
(FBSS) Prediction us
ing
Fu
z
z
y Inference System
(FIS).
IEEE International Confer
ence
on Signal Processing and
Communications (ICSPC)
.
20
07: 880-
88
3.
[14]
Bhuva
nes
w
a
ri
V, Brintha SJ. Microarra
y Ge
ne
E
x
press
i
o
n
Anal
ys
is Usin
g
T
y
pe 2 F
u
zz
y
Log
ic (MGA-
FL
).
Internation
a
l Jour
nal of C
o
mputer Sci
e
n
c
e
. 2012; 2(
2): 53-6
9
.
[15]
Ressom H, R
e
yno
l
ds R, V
a
rgh
e
se RS. I
n
cr
eas
ing th
e
Efficienc
y of
F
u
zz
y
L
o
g
i
c-base
d
Gen
e
Expr
essi
on Dat
a
Anal
ys
is.
Physion Genom
ic
s
. 2003; 13: 10
7-11
7.
[16]
Panj
aitan
SD,
Harto
y
o A.
A Lig
h
ting
C
ontrol S
y
stem
in Bu
ild
ings
base
d
o
n
F
u
zz
y
Lo
gic.
TEL
K
OMNIKA
. 2011; 9(3): 4
2
3
-43
2
.
[17]
Putra IKGD, Prihatini PM. Fuzzy
Expert Sy
stem
for T
r
opic
a
l Infecti
ous
Di
sease
b
y
Cert
aint
y F
a
ctor
.
TEL
K
OMNIKA
. 2012; 1
0
(4): 8
25-8
36.
[18]
Hob
bes
Genome
Seque
n
c
e Map
p
in
g Ho
me Pag
e
. Avail
abl
e:
http://hobbes.ic
s
.uci.edu/e
x
am
ples.shtml
[19]
Cock PJA, F
i
el
ds CJ, Goto N,
Heu
e
r ML, R
i
c
e PM. T
he Sang
er F
AST
Q
F
ile F
o
rmat for
Sequ
enc
es
w
i
t
h
Qu
alit
y Sc
ores, a
nd th
e
So
lexa/Illumina FASTQ Variants.
Nucleic Ac
ids Research
. 200
9;
38(
6):
176
7-17
71.
[20]
Akso
y S, Ro
b
e
rt MH. F
eatu
r
e Norm
aliz
ati
on
a
nd
Lik
e
li
h
ood-
base
d
Si
milarit
y
Me
asu
r
es for Imag
e
Retriev
a
l.
Elsevier
. 201
1; 22: 563-
582.
[21]
Musi IID. Pen
gemb
ang
an F
u
zz
y
Infer
ens
i
Si
stem Untuk
Seleksi M
e
to
de Pe
nin
g
kata
n Perol
e
h
an
Min
y
ak T
i
ngkat
Lanj
ut.
T
hesis. In
stitut Pertanian Bog
o
r; 200
9.
[22]
T
ang K,
T
o
kinaga S. Optimiz
a
tion of F
u
zz
y
Infe
rence S
y
st
em Rul
e
s b
y
U
s
ing the Ge
net
ic Algor
ithm
and Its Applic
a
t
ion to the Bon
d
Ratin
g
.
Jour
nal of the Ope
r
ations
R
e
sear
ch Society of Japa
n
. 199
9;
42(3): 30
2-3
1
5
.
[23]
Leve
n
shte
in V
I. Binar
y Co
de
s Cap
abl
e of
Correctin
g D
e
l
e
tions, Ins
e
rtio
ns an
d R
e
ver
s
als.
Soviet
Physics Dok
l
ady
. 1966; 1
0
(8)
:
707.
[24]
Zerbino DR, Birney
E.
Velv
e
t: Alg
o
rithms
for
De
Novo
Sh
ort Re
ad
Assemb
l
y
us
ing
D
e
Br
uijn
Grap
hs
.
Geno
me R
e
se
arch
. 200
8.
[25]
Li Rui
q
ia
ng,
Z
hu Hon
g
me
i, Rua
n
Ju
e,
Qia
n
W
u
b
i
n, F
a
n
g
Xi
ao
don
g, S
h
i Z
h
on
gbi
n, L
i
Yi
ngru
i
, L
i
Shen
gtin
g, Sh
an Gao, Kristi
a
n
sen
K, Li S
o
n
gga
ng, Ya
ng H
uanm
ing, W
a
n
g
Jian, W
a
n
g
J
un. De N
o
vo
Assembl
y
of H
u
man Gen
o
me
s
w
i
th Mass
ive
l
y
Par
a
ll
el Sh
o
r
t Read Seq
u
e
n
cin
g
.
Geno
me Rese
arch
.
201
0; 20: 265-
272.
[26]
Paszkie
w
i
cz K
,
Studholm
e
DJ. De N
o
vo
Assembl
y
of Short Se
que
nce R
eads.
Breafin
gs I
n
Bioi
nfor
matics
.
2010; 5(2): 4
5
7
-47
2
.
Evaluation Warning : The document was created with Spire.PDF for Python.