TELKOM
NIKA
, Vol.13, No
.2, June 20
15
, pp. 694 ~ 7
0
2
ISSN: 1693-6
930,
accredited
A
by DIKTI, De
cree No: 58/DIK
T
I/Kep/2013
DOI
:
10.12928/TELKOMNIKA.v13i2.1415
694
Re
cei
v
ed
Jan
uary 7, 2015;
Re
vised Ma
rch 14, 2015; A
c
cepted Ap
ril 2, 2015
Comparison of Data Partitioning Schema of Parallel
Pairwise Alignment on Shared Memory System
Auriza Ra
hmad Akb
a
r, He
ru Sukoco*,
Wisnu Anan
ta Kusuma
Dep
a
rtment of Comp
uter Scie
nc
e, Institut Pertania
n
Bog
o
r
Jl Meranti W
i
n
g
20 Lev
el 5, D
a
rmag
a
, Bogor
1668
0, Indon
e
s
ia
*Corres
p
o
ndi
n
g
author, e-ma
i
l
: hsrkom@ip
b
.ac.id
A
b
st
r
a
ct
T
he pairw
ise
alig
n
m
e
n
t (PA) algorith
m
is
w
i
dely use
d
in
bioinf
ormatics
to analy
z
e
b
i
olo
g
ica
l
sequ
enc
e. W
i
th the adv
anc
e of
sequ
enc
er techno
lo
gy, a mass
iv
e a
m
o
unt of DN
A fragments ar
e
sequ
enc
ed
mu
ch q
u
icker
a
n
d
che
a
p
e
r. T
hus
, the a
l
i
g
n
m
e
n
t al
gorith
m
n
e
e
d
s to
be
par
al
l
e
li
z
e
d
to
b
e
ab
le
to ali
g
n
the
m
i
n
a s
horter
time. Many
prev
io
us re
se
arch
es have
par
all
e
li
zed
PA al
gorith
m
usin
g
v
a
rio
u
s
data p
a
rtitio
ni
ng sch
e
m
a, b
u
t it
is unk
no
w
n
w
h
ich on
e
is the b
e
st. T
he data
parti
tioni
ng sch
e
m
a is
importa
nt for parall
e
l PA p
e
rformanc
e, beca
u
se this
a
l
gor
ithm
uses dy
na
mic
progr
a
m
mi
ng tech
niq
ue t
h
a
t
nee
ds inte
nse
inter-thre
ad co
mmu
n
icati
on. I
n
this p
aper,
w
e
comp
are
d
four p
a
rtition
i
ng
sche
m
as to fi
n
d
the best perfor
m
i
ng one
on s
hared
m
e
mory
system
.
Thos
e schem
as
are: block
ed c
o
lum
n
wise,
rowwise,
antid
iag
o
n
a
l, a
nd
block
e
d
col
u
mnw
i
se w
i
th
ma
nu
al
sc
he
d
u
lin
g
an
d l
o
o
p
unro
lli
ng. T
h
e t
e
sting
is
do
ne
o
n
qua
d-core
proc
essor usi
ng
D
N
A seq
uenc
e
of 100
0 to 1
6
0
00 b
p
as the
in
put. T
he bl
ock
ed co
lu
mnw
i
se
w
i
th
ma
nu
al sch
ed
ulin
g a
nd l
o
o
p
unro
lli
ng sc
he
ma g
a
ve th
e best perfor
m
a
n
ce of 8
9
%
efficiency.
T
h
e
synchro
ni
z
a
ti
o
n
time is
min
i
mi
z
e
d t
o
get
the best
per
forma
n
ce
pos
sible.T
h
is res
u
lt prov
ide
d
h
i
g
h
perfor
m
a
n
ce p
a
rall
el PA w
i
th
fine-gr
ain
par
alle
lis
m th
at can b
e
us
ed fur
t
her to dev
el
o
p
par
all
e
ls
mu
l
t
ipl
e
sequ
enc
e ali
g
n
m
e
n
t (MSA).
Ke
y
w
ords
: Da
ta Partition, Pai
r
w
i
se Align
m
en
t, Parallel Proc
essin
g
, Share
d
Memory
1. Introduc
tion
The pai
rwi
s
e
alignme
n
t (PA) algorith
m
is used
in bioi
nformati
cs to
align a pai
r of DNA or
protein
se
que
nce
s
of certai
n org
ani
sm, in ord
e
r
to de
termine the
si
milarity between them [1]. It
use
s
dyna
mic prog
rammi
n
g
techni
que t
o
get the
be
st alignment re
sult with com
p
lexity of
O
(
n
2
),
whe
r
e
n
is th
e seq
uen
ce
s'
length [2]. It is the f
ound
ation of the multiple se
qu
ence alignm
e
n
t
(MSA) alg
o
rit
h
m to align m
o
re tha
n
two
seq
uen
ce
s al
together. Oth
e
r than th
at, it is also u
s
ed f
o
r
databa
se
seq
uen
ce search
ing to find the most si
mila
r
seq
uen
ce to the one that is given [3].
The next-ge
neratio
n DNA sequ
en
ce
r tech
n
o
logy
nowaday
s can produ
ce
a lot of
sequence
dat
a, up to hund
reds of billion base pai
r (bp) in one
run
[5]. This big data needs faster
pro
c
e
ssi
ng,
so the al
gorith
m
nee
ds to
b
e
pa
ralleli
zed
to sp
eed
up
the align
m
ent
pro
c
e
s
s. Ma
ny
r
e
sear
ches
have par
a
llelized MSA,
s
u
ch as
Pr
aline [6], Clus
talW-MPI
[7], MT-Clus
t
alW
[8],
MAFFT [9], a
nd
star alg
o
ri
thm [10]. But
only a
fe
w t
hat have
pa
rallelize
d
PA,
su
ch
as ParA
lign
[11] and Cu
d
a
SW [12].
The d
a
ta d
e
p
ende
ncy
of PA is hi
gh
due
to it
s dyn
a
mi
c p
r
og
ram
m
in
g natu
r
e. Be
cause of
this, the dat
a pa
rtitioning
schem
a on
parall
e
l
PA a
l
gorithm
affects the
pe
rfo
r
man
c
e
great
ly.
Several
data
partitioni
ng
schem
as that have
be
en a
pplied
for PA p
a
rallelizatio
n were:
colum
n
wi
se [
13], diagon
al
[11], rowwi
se [14], blo
c
ked
colu
mn
wise [15], and
blocked
ant
i-
diago
nal [16].
Unfortu
natel
y, the best p
a
rtitioni
ng
schema o
n
sha
r
ed m
e
mo
ry system i
s
not
yet
kno
w
n.
In this paper, we parallel
i
zed PA alg
o
rithm on sh
ared me
mory system usi
ng four
different d
a
ta
partitioni
ng
schema
s
: bl
o
c
ked
col
u
mn
wise, ro
wwise, antidia
gon
al, and
revi
sed
blocke
d
colu
mnwi
se. We tested and re
vised
e
a
ch schem
a
to obt
ain
the high
e
s
t
pe
rform
a
n
c
e
possibl
e of parallel PA alg
o
rithm.
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
1693-6
930
Com
pari
s
o
n
of Data Partitioning Schem
a of Para
llel
Pairwi
se Alig
nm
ent .... (Auriza
Rahm
ad
A.)
695
2. Rese
arch
Metho
d
2.1. Implementa
tion of Se
quential Pair
w
i
s
e
Alignm
ent Algo
rith
m
The first step
is to implem
ent PA algorit
hm in sequ
en
tial manne
r u
s
ing
global
ali
gnment
approa
ch. Th
e longe
st co
mmon seque
nce (LCS
) al
gor
ithm [17] is used as a
basi
s
to deve
l
op
the sequ
entia
l PA algo
rith
m. The
LCS i
t
self is a
glob
al align
m
ent
algorith
m
that
is m
o
re
kno
w
n
as Needl
em
an–
Wun
s
ch
algorith
m
. The alignm
en
t sco
re is
set as follo
w: the score is
increme
n
ted
by one
if bot
h DNA
re
sidu
es
are
mat
c
h
;
else
the
sco
r
e i
s
sub
s
tracted by o
ne. T
h
e
gap pe
nalty is appli
ed by initializing th
e
sco
re of
zeroth ro
w and
zeroth colum
n
multiplied b
y
its
distan
ce fro
m
starting poi
nt (uppe
r-l
eft corne
r).
An example o
f
alignment table cal
c
ul
ati
on usin
g gap p
enalty of -3 can be seen o
n
Figure
1
.
Thi
s
tabl
e ali
g
n
s
AGTCA
an
d
ATGA seq
u
e
n
ce
re
sultin
g
in alig
nment
score
of
1 (the value o
f
right-bottom
corn
er) and
alignme
n
t re
sult as follows.
A
G
T
C
A
A
-
T
G
A
This
algorithm c
o
rrec
tness
is
verified
by
aligning t
w
o sequenc
e
s
from BAliBASE DNA
seq
uen
ce ali
gnment ben
chmark
[1
8].
These sequ
en
ce
s
a
r
e
Sa
ccharom
yce
s
cere
viseae
Gly
R
S1
(SYG_YEAST) and
Schizosa
ccha
rom
y
ce
s pom
be
G
l
yRS (SYG_SCHPO
)
. Ou
r alignme
n
t re
sult
is com
p
a
r
ed
with the re
sul
t
of ClustalW
2.1 prog
ram
usin
g default
option.
Figure 1. Glo
bal alignm
ent
for seq
uen
ce
AGTCA and
ATGA
2.2. Implementation of Parallel
Pair
w
i
s
e
Alignment
Algorithm
The second
step i
s
to d
e
velop the p
a
rallel ve
rsi
o
n of this alg
o
rithm a
nd verify its
corre
c
tne
s
s. Paralleli
zatio
n
is imple
m
ented u
s
ing
OpenMP, b
e
c
au
se it is
much
ea
sier to
implement rather than by using pr
ocessor instruction di
rectly [11] or by
using Pthreads [8],[9].
OpenMP i
s
a
libra
ry to pa
rallellize
sequ
ential p
r
og
ra
m into multith
r
ead
ed
on a
sha
r
ed
mem
o
ry
system [1
9]. Four pa
rtitioni
ng
schema
s
were te
sted: blocke
d colu
mnwi
se, ro
wwise,
antidia
g
onal,
and bl
ocke
d
colum
n
wi
se
with ma
nual
sched
uling
a
nd loo
p
u
n
roll
ing. The
correctne
ss of p
a
r
allel
PA is verifie
d
by com
p
a
r
ing its
outp
u
t with
the
seq
uential o
ne to sati
sfy the seq
uen
tial
c
o
ns
is
tenc
y [20].
2.2.1. Blocke
d column
w
i
s
e
The bl
ocke
d
colum
n
wi
se
partitionin
g
schem
a divid
e
s
the
align
m
ent table
pe
r blo
ck
o
f
colum
n
s [15].
The
-th thre
ad (
) gets its part of a block fro
m
col
u
mn
to
,
w
h
er
e
is
the
numbe
r of col
u
mns
a
nd
is
the nu
mbe
r
o
f
threa
d
s.
For exampl
e, if t
he n
u
mbe
r
of
colum
n
s was
9 and
the n
u
m
ber
of threa
d
s
wa
s 3, th
e
n
would
get i
t
s pa
rt of a
bl
ock of
colum
n
1–3. This p
a
rt
itioning sch
e
m
a is illust
rat
ed on
Figure
2
.
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 16
93-6
930
TELKOM
NIKA
Vol. 13, No. 2, June 20
15 : 694 – 70
2
696
2.2.2. Ro
w
w
i
s
e
The ro
wwi
se
pa
rtitoning
schema
divid
e
s
th
e ali
g
n
m
ent table
p
e
r
ro
w [14].
The
-th
thread
(
) g
e
ts
its pa
rt of
-th
ro
w, where
,
is th
e n
u
mb
er of
rows, and
is the numbe
r o
f
threads. Fo
r example, if the numb
e
r of
rows wa
s 6 and the num
b
e
r
of thread
s was 3, then
woul
d get its part of 1st and 4th ro
w.
This partitio
n
ing sche
ma
is
illustrated on
Figure 3.
Figure 2. Blocked column
wise
Figure 3. Ro
wwi
se
Figure 4. Antidiago
nal
2.2.3. Antidi
agonal
The antidia
g
onal pa
rtition
i
ng schem
a divides
the
alignme
n
t table per a
n
tid
i
agon
al,
cro
s
sing fro
m
bottom-left
to top-right
[16]. The
-th thread (
) ge
ts its part of
-th
antidiag
onal, whe
r
e
,
is t
he nu
mbe
r
o
f
rows,
is th
e num
ber of
colum
n
s,
a
n
d
is the nu
mb
er of thread
s.
For exam
ple
,
if the numbe
r of ro
ws wa
s 6, the numb
e
r
of colum
n
s
was 9, an
d the
numbe
r of thread
s
wa
s 3, then
woul
d get its part of
1st, 4th, 7th,
10th, and 13t
h antidiag
ona
l. This partitio
n
ing sch
e
ma
is illustrated o
n
Figure
4
.
2.2.4. Blocked column
w
i
se
w
i
th
manual scheduling and loop unrolling
The first blo
c
ked
colum
n
wise sch
e
ma i
s
not optimal
beca
u
se the usag
e of OpenMP
parall
e
l for
construct, that
is sim
p
le bu
t less flex
ible
. To overcom
e
this sho
r
tcoming, the d
a
ta
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
1693-6
930
Com
pari
s
o
n
of Data Partitioning Schem
a of Para
llel
Pairwi
se Alig
nm
ent .... (Auriza
Rahm
ad
A.)
697
partitionin
g
mech
ani
sm
need
s to
be
don
e m
anu
ally. The
ne
xt revision
i
s
by ap
plying
loop
unrolli
ng te
chniqu
e,
i.e.
mergi
ng
sev
e
ral
loop
iterations in
to
o
ne. Thi
s
te
ch
nique
will
re
duce
pro
c
e
s
sor in
struction a
nd i
n
crea
se cach
e locality [21],[22].
2.3. Perform
a
nce Compa
r
ison
T
h
e
fina
l
s
t
ep
is to
tes
t
th
e
pe
r
f
or
manc
e
of e
a
ch d
a
ta pa
rtitioni
ng
schem
as.
Several
DNA sequ
en
ce data is g
e
nerate
d
ran
d
o
mly wi
th varying length of 1000, 2000,
4000, 800
0, and
1600
0 b
p
a
s
the inp
u
t. Th
e alig
nment
table
cal
c
ul
ation i
s
time
d u
s
ing
om
p_get
_wtime
fun
c
tion.
This p
r
o
c
e
ss
is re
peate
d
ten times fo
r each sche
ma
s and
se
quen
ce len
g
ths, a
nd then ta
ke
the
averag
e a
s
t
he final
execution time. T
he final
execution time i
s
com
p
a
r
ed
to get th
e b
e
s
t
perfo
rming d
a
ta partitionin
g
schema.
The te
st i
s
condu
cted
u
s
i
ng Intel
Co
re i5-347
0 p
r
oce
s
sor that
ha
s 4
core
s, Debi
an
GNU/Linux 6
4
-bit op
erati
ng sy
stem, and G
CC
4.8.1 com
p
iler that supp
orts Ope
n
MP 3.
1
specification.
3. Results a
nd Analy
s
is
3.1. Implementa
tion of Se
quential Pair
w
i
s
e
Alignm
ent Algo
rith
m
The p
s
e
udo
code fo
r PA al
gorithm
is prese
n
ted
as a
pro
c
e
d
u
r
e
called P
AIRWIS
E
-A
LIGN
,
whi
c
h t
a
ke
s t
w
o
seq
uen
ce
s of
X
a
nd
Y
as
th
e
pa
ram
e
ters. The score of
ea
ch cell
is cal
c
ul
ated
by ch
oo
sing t
he maxim
u
m
score
bet
wee
n
thre
e p
o
ssi
b
le alig
nme
n
t dire
ction
s
: di
agon
al, up,
a
n
d
left. The diagonal score is
cal
c
ulated by the help of S
IMILARIT
Y
proced
ure: if the DNA re
sidu
e
is
equal, the score is a
dde
d by
M
ATCH
score, else su
b
s
tra
c
ted by
M
ISM
A
T
C
H
score. The up an
d left
score a
r
e
cal
c
ulate
d
by su
bstra
c
ting it b
y
linear
G
AP
penalty
score
.
M
ATCH
=
+1
M
ISM
A
T
C
H
=
-1
G
AP
=
-3
S
IMILARIT
Y
(
a
,
b
)
1
if
a
==
b
2
return
M
ATCH
3
else
4
return
M
ISM
A
T
C
H
P
AIRWISE
-A
LIGN
(
X
,
Y
)
m
=
X
.
length
n
=
Y
.
length
let
C
[0..
m
, 0.
.
n
] be a new table
for
i
= 0
to
m
C
i,
0
=
i
.
G
AP
for
j
= 0
to
n
C
0,
j
=
j
.
G
AP
for
i
= 1
to
m
for
j
= 1
to
n
┌
diag
=
C
i
-1,
j
-1
+ S
IMILARIT
Y
(
X
i
,
Y
j
)
C
i
,
j
= M
AX
├
up
=
C
i
-1
,
j
+
G
AP
└
left
=
C
i
,
j
-1
+
G
AP
return
C
In general, our PA algorit
hm has
been
able to align
two seq
uen
ces corre
c
tly,
whe
r
e
a
s
Clu
s
talW pro
duces align
m
ent with
mo
re contig
u
o
u
s
gap.
This is ca
used
by the la
ck of
g
a
p
extens
ion feature in our P
A
algorithm.
The co
mparison of alignment result for SYG_YEAST and
SYG_SCHP
O betwee
n
Clustal
W
and
our PA is as
fo
llows. Base
d on this re
sult, our simpl
e
PA
algorith
m
h
a
s
b
een
imple
m
ented
co
rre
c
tly and
will
become
ou
r
basi
s
to
dev
elop th
e p
a
rallel
versio
n of PA algorithm.
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 16
93-6
930
TELKOM
NIKA
Vol. 13, No. 2, June 20
15 : 694 – 70
2
698
> Clu
s
t
a
l
W
SYG_YEAST
ATG
A
GT
GT
AG
AA
G
A
T
A
TCA
AG
AA
GGC
TAG
AG
C
C
GC
T
GT
TCCA
TTT
A
A
CA
GA
GAA
CA
G
C
T
A.
..
SYG_SCHP
O
ATGA
-
--C
AGAA
G
T
T
-
TCA
--
AA
GGC
--
-
AG
C
A
GCT
--
--
--
TTT
G
A
TC
GA
ACT
CA
G
T
T
C...
****
***** * *** ***** *** ***
*** * ** *** *
> PairwiseAlign
SYG_YEAST
ATG
A
GT
GT
AG
AA
G
A
T
A
TC
A
A
G
AA
GGC
TAG
AG
C
C
GC
T
G
TT
CC
AT
TT
AA
C
AG
AG
AA
C
A
G
CT
A
...
SYG_SCHP
O
ATGA
-
--C
AGAA
G
-
T
T
TC
-
A
-
AA
GGC
--
-
AG
C
A
GCT
-
TT
TG
AT
CG
AA
C
TC
AG
TT
C
-
G
--
A
...
****
***** * ** * ***** *** *** **
** *** ** * *
*
3.2. Implementation of Parallel
Pair
w
i
s
e
Alignment
Algorithm
3.2.1. Blocke
d column
w
i
s
e
The
syn
c
hro
n
izatio
n time
that i
s
cau
s
ed by
inter-th
read
d
a
ta d
e
pend
en
cy for blo
c
ked
colum
n
wi
se
schem
a is low. Only the fi
rst cell of each row (a
ste
r
isk sign on
Figure
2
) th
at must be
ch
e
c
ked
wheth
e
r its left
cell h
a
s b
een filled
by anothe
r threa
d
o
r
not yet?. If a
thread
run
s
slowe
r
, then th
e next thr
ead
must wait un
til the former
thread fini
sh
e
d
one ro
w of its part.
The syn
c
h
r
o
n
izatio
n co
ul
d be do
ne o
n
ly for
m
×
t
times, but becau
se of
OpenMP
limitation the
synchron
ization still be
done for
m
×
n
times. The u
s
ag
e of Op
e
n
MP parallel
for
dire
ctive alth
ough
sim
p
ler but le
ss flexi
b
le, so
the
checkin
g
is
stil
l done
at ea
ch cell. T
h
u
s
,
the
parall
e
l alg
o
ri
thm perfo
rma
n
ce
be
com
e
s less effici
ent
. Below i
s
th
e pseud
ocod
e to pa
ralleli
ze
PA using this
schema.
for
i
= 1
to
m
parallel
fo
r
j
= 1
to
n
//
sched
ule(st
atic)
w
h
ile
C
i
,
j
-1
==
null
wa
i
t
┌
diag
=
C
i
-1,
j
-1
+ S
IMILARIT
Y
(
X
i
,
Y
j
)
C
i
,
j
=
M
AX
├
up
=
C
i
-1,
j
+
G
AP
└
left
=
C
i
,
j
-1
+
G
AP
The re
sult yi
elds u
p
to 3
.
00 times fa
ster exe
c
utio
n time usin
g
blocked
col
u
mnwi
se
schema o
n
4
thread
s. Thu
s
, the efficien
cy—spee
dup
per numb
e
r
of thread
s—o
f
this sche
m
a
is
75%.
3.2.2. Ro
w
w
i
s
e
The syn
c
h
r
on
ization time f
o
r ro
wwise schem
a is hig
h
, it is done for
m
×
n
times
.
Every
singl
e cell m
u
st che
ck
whether it
s up
per
cell
h
a
s
been filled
b
y
another th
read o
r
not y
e
t.
De
spite of th
at, the paralle
l algorith
m
pe
rforma
nc
e of
this sch
e
ma
doe
s not b
e
come
worse. T
h
i
s
is
cau
s
ed
by
the natu
r
e
o
f
sha
r
ed
me
mory
system,
i.e.
no d
a
ta
movement i
n
volved betwe
en
thread
s. Belo
w is the p
s
eu
docode to pa
rallelize PA u
s
ing this sch
e
m
a.
parallel for
i
= 1
to
m
//
sched
ule(st
atic,1)
for
j
= 1
to
n
w
h
ile
C
i
-1,
j
==
null
wa
i
t
┌
diag
=
C
i
-1,
j
-1
+ S
IMILARIT
Y
(
X
i
,
Y
j
)
C
i
,
j
=
M
AX
├
up
=
C
i
-1,
j
+
G
AP
└
left
=
C
i
,
j
-1
+
G
AP
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
1693-6
930
Com
pari
s
o
n
of Data Partitioning Schem
a of Para
llel
Pairwi
se Alig
nm
ent .... (Auriza
Rahm
ad
A.)
699
Unexp
e
cte
d
ly,
the
re
sult of rowwi
s
e sche
ma
is bette
r than
blocke
d
colum
n
wi
se
schem
a.
It yields up t
o
3.18 time
s faster
execution time
u
s
ing this sche
ma on
4 thread
s. Thu
s
,
the
effic
i
enc
y
of t
h
is
sc
hema is 80%.
3.2.3. Antidi
agonal
The syn
c
h
r
on
ization time f
o
r antidia
gon
al schem
a is
very high, two times of the
rowwi
s
e
schema, th
at
is
2
×
m
×
n
times. Eve
r
y sin
g
le
cell
m
u
st
che
c
k
wh
ether it
s left
and
upp
er
ce
lls
have been filled by anothe
r thread o
r
n
o
t yet. Moreo
v
er, the non-l
i
near me
mory acce
ss p
a
ttern
of this schem
a make
s the
parall
e
l algo
ri
thm perfo
rma
n
ce mu
ch
wo
rse. Belo
w is
the pse
udo
co
de
to paralleli
ze
PA using this
schema.
parallel for
k
= 1
to
m
+
n
//
sched
ule(st
atic,1)
//
upper di
agon
al
if
k
<
m
for
i
=
k
do
w
n
to
1,
j
= 1
to
n
wh
i
l
e
C
i
,
j
-1
==
null
or
C
i
-1,
j
==
nu
ll
wa
i
t
┌
diag
=
C
i
-1,
j
-1
+ S
IMILARIT
Y
(
X
i
,
Y
j
)
C
i
,
j
= M
AX
├
up
=
C
i
-1,
j
+
G
AP
└
left
=
C
i
,
j
-1
+
G
AP
//
lower di
agon
al
elseif
k
>
m
for
i
=
m
do
w
n
to
1,
j
=
k
-
m
to
n
wh
i
l
e
C
i
,
j
-1
==
null
or
C
i
-1,
j
==
nu
ll
wa
i
t
┌
diag
=
C
i
-1,
j
-1
+ S
IMILARIT
Y
(
X
i
,
Y
j
)
C
i
,
j
= M
AX
├
up
=
C
i
-1,
j
+
G
AP
└
left
=
C
i
,
j
-1
+
G
AP
As expe
cted,
the
re
sult of
antidiag
onal
schema
is th
e worst am
on
g the
others.
It yields
up to 1.44 times fa
ster e
x
ecution time
using thi
s
schema o
n
4 th
read
s. Thu
s
,
the efficien
cy of
this schem
a is only 36%.
3.2.4. Blocked column
w
i
se
w
i
th
manual scheduling and loop unrolling
3.2.4.1. Manual scheduli
ng
The lo
op
sch
edulin
g is do
ne ma
nually
usin
g blo
c
k
decompo
sitio
n
metho
d
[2
3] as a
sub
s
titute for OpenMP pa
rallel for con
s
tru
c
t. Theref
ore, the syn
c
hroni
zatio
n
can be don
e for
only
m
×
t
times, that is onl
y once at firs
t
cell of each row (a
ste
r
isk sign on
Figure
2
).
T
h
is
m
anu
al scheduli
ng
m
a
kes blo
c
ked column
wise schema
mo
re e
fficient.
Below is th
e pse
udo
co
de
of blocked
co
lumnwi
se
sch
e
ma that ha
s been revise
d usin
g man
ual
sched
uling.
parallel
id
=
Thre
ad
.
id
t
=
Threa
d
.
siz
e
jfirs
t
=
ہ
id
*
n
/
t
ۂ
+ 1
jlas
t
=
ہ
(
id
+1
)*
n
/
t
ۂ
3.2.4.2. Loop unrolling
The loop for
each ro
w will
be unroll
ed by factor of two,
i.e.
iteration for two rows
will be
merg
ed as o
n
e
. This is don
e by making two loop
copy
for
C
i
,
j
and
C
i
+1,
j
and also using in
creme
n
t
of two for ea
ch iteratio
n. A checkin
g
mech
ani
sm is add
ed in the middle of
the loop to che
c
k
wheth
e
r th
e l
a
st row ha
s
b
een
rea
c
h
ed
or n
o
t. Thi
s
is ne
ce
ssary to
antici
pate
when th
e nu
m
ber
of rows (
m
) is not divisible
by the unroll factor.
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 16
93-6
930
TELKOM
NIKA
Vol. 13, No. 2, June 20
15 : 694 – 70
2
700
This l
oop
unrolling te
chni
q
ue by fa
ctor
of
two redu
ces the
syn
c
h
r
onization tim
e
to half,
that is ½ ×
m
×
t.
Below is the continue
d pse
udo
cod
e
of
blocked
colum
n
wi
se
schem
a that has
been revise
d usin
g loop u
n
r
olling by fact
or of two.
for
i
= 1
to
m
by
2
w
h
ile
C
i
,
jfirst
-1
== null
wa
i
t
for
j
=
jfirst
to
jlast
┌
diag
=
C
i
-1,
j
-1
+ S
IMILARIT
Y
(
X
i
,
Y
j
)
C
i
,
j
= M
AX
├
up
=
C
i
-1,
j
+
G
AP
└
left
=
C
i
,
j
-1
+
G
AP
if
i
==
m
br
eak
┌
diag
=
C
i
,
j
-1
+ S
IMILARIT
Y
(
X
i
+1
,
Y
j
)
C
i+1
,
j
=
M
AX
├
up
=
C
i
,
j
+
G
AP
└
left
=
C
i
+1,
j
-1
+
G
AP
The
re
sult o
f
the blo
c
ke
d column
wi
se sche
ma
with manu
al sche
duling
an
d loop
unrolli
ng by factor
of two
yields up to
3.54 time
s fa
ster exe
c
utio
n time on 4 t
h
rea
d
s. Th
us, the
effic
i
enc
y
of t
h
is
sc
hema is 89%.
3.2.5. Verification of Par
a
llel PA Algorithm Corre
ctness
All versi
on
of the p
a
rall
el
PA above
ha
ve bee
n te
sted for sequ
e
n
tial con
s
iste
ncy by
comp
ari
ng th
eir outp
u
ts wi
th the sequ
en
tial PA
algorithm. All of them prod
uced the sa
me outp
u
t,
therefo
r
e the
parallel al
go
rithm implem
entation ar
e 100% co
nsi
s
t
ent with se
qu
ential algo
rith
m.
This ve
rification i
s
imp
o
rta
n
t be
cau
s
e
in
corre
c
t impl
e
m
entation
of
multithrea
ded
pro
g
ram le
a
d
s
to race
con
d
ition that result in incon
s
i
s
te
nt output.
3.3. Perform
a
nce Compa
r
ison
The a
n
tidiag
onal
schema
yields the
worst p
e
rfo
r
mance
with
efficien
cy of
36%. It is
unde
rsta
nda
b
l
e be
cau
s
e
o
f
its non
-line
a
r m
e
mory
a
c
cess
pattern
that ca
usi
n
g
a lot of
ca
che
misses.
The ro
wwi
se
sch
ema yiel
ds better p
e
rf
orma
nce than blocked
col
u
mnwi
se
sch
e
ma with
efficien
cy of 8
0
% and 7
5
% respe
c
tively. This
re
sult
is beyond our e
x
pectation,
b
e
ca
use
ro
wwi
s
e
schema h
a
s more inter-threa
d
data depe
nden
cy.
The main reason of this is be
cau
s
e
the
impleme
n
tion
is on
sh
are
d
memo
ry sy
stem,
where
data movem
ent between
thread
s i
s
no
n-
existent,
i.e.
each thre
ad i
s
a
c
cessin
g the same
sha
r
ed m
e
mo
ry spa
c
e. Th
e result
would
b
e
different if it
is impleme
n
t
ed on dist
rib
u
ted me
mo
ry system. Another rea
s
on
is that the first
blocke
d col
u
mnwi
se sch
e
m
a above is
not optimal yet.
The
second
blocke
d col
u
mnwi
se
schema t
hat h
a
s b
een
rev
i
sed yiel
ds t
he be
st
perfo
rman
ce
with efficie
n
cy of 89% on
4 thre
ad
s.
Th
e loop
un
rolli
ng techniq
ue
(merging
two
row
iteration
s
into
one
) on thi
s
schem
a ha
s been
prov
e
n
to re
du
ce
synchro
n
ization time, he
n
c
e
increa
sing th
e com
putatio
n portio
n
of the para
llel
algorith
m
. Th
e com
pari
s
o
n
of those d
a
ta
partitionin
g
schem
as that
have bee
n tested is
sho
w
n
on Figure 5.
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
1693-6
930
Com
pari
s
o
n
of Data Partitioning Schem
a of Para
llel
Pairwi
se Alig
nm
ent .... (Auriza
Rahm
ad
A.)
701
Figure 5. Performa
nce com
pari
s
on of ant
idiago
nal
(AD), blocked
col
u
mnwi
se
(BC), rowwi
s
e (R),
and revi
sed b
l
ocked
colum
n
wi
se (B
C2)
on 4 threa
d
s
4. Conclusio
n
The revised blocked columnwise schema
using manual scheduling and loop
unrolling
yields
the highest performance of
89% effici
ency. The manual
scheduling makes the blocked
columnwise
schema more efficient and the
l
oop unrolling technique halves the synchronization
time.
In shared memory
system, the performanc
e
is defined by
the portion of synchronization
time. The synchronization time must be minimized to maximize the performance.
We
have presented parallel PA algorithm
with high performance and fine-grain
parallelism
that could
be used
further as a
component to
develop parallel
MSA algorithm on
hybrid shared–distributed
memo
ry
system using OpenMP
and message-passing interface
(MPI).
Ackn
o
w
l
e
dg
ement
This research is
funded by
Cooperation Part
nership of
National Agricultural
Research
and Development (KKP3N) in 2013 from Indonesian Agricultural Ministry.
Referen
ces
[1]
Coh
en J. Bi
oin
f
ormatics: an i
n
troducti
on for
computer sc
ie
ntists.
ACM C
o
mputi
ng S
u
rv
eys (CSUR)
.
200
4; 36: 122
–
158.
[2]
W
a
terman MS, Smith T
F
,
Be
yer W
A
. Some
biol
ogic
a
l se
q
uenc
e metrics.
Advanc
es in
Mathe
m
atics
.
197
6; 20: 367
–
387.
[3]
Edgar R
C
. MUSCLE: multip
le
seque
nce
ali
g
nm
ent
w
i
th
hig
h
accurac
y
an
d hig
h
throu
g
h
put.
Nucle
i
c
Acids Res
earc
h
. 2004; 3
2
: 17
92–
17
97.
[4]
Rog
nes T
,
Seeber
g E. Six-fold sp
eed-
up
of
Smith–W
ate
rman seq
uenc
e dat
ab
ase se
arches us
in
g
para
lle
l proces
sing o
n
commo
n micropr
ocess
o
rs.
Bioinfor
ma
tics
. 2000; 16:
699
–7
06.
[5]
Liu
L,
Li Y,
Li
S
,
Hu
N, He
Y,
Pong
R,
Lin
D, Lu L,
L
a
w
M. C
o
mparis
on
of n
e
xt-g
en
erati
on sequ
enci
n
g
s
y
stems.
BioM
ed Res
earch In
ternatio
nal
. 2
0
12.
[6]
Klei
nju
ng J, Doug
las N, Heri
nga J. Paral
l
el
i
z
ed multi
p
le a
l
ignm
ent.
Bioinf
ormatics
. 200
2
;
18: 1270–
127
1.
[7]
Li KB.
Cl
ustal
W
-MPI: Clusta
l
W
an
al
ysis
u
s
ing
distri
bute
d
a
n
d
p
a
ral
l
el
comp
utin
g.
Bi
oinfor
matics
.
200
3; 19: 158
5
–15
86.
[8]
Chaic
h
o
o
mp
u
K, Kittitornk
u
n S, T
ongsim
a S.
MT
-Cl
u
s
t
alW
:
mu
ltithre
adi
ng
multi
p
le
seq
u
e
n
ce
a
l
i
gnm
en
t
. IEEE International Parallel
and
Distribut
ed Pr
ocessing S
y
m
posium (IPDP
S
). Rhodes
Island. 20
06: 8
.
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 16
93-6
930
TELKOM
NIKA
Vol. 13, No. 2, June 20
15 : 694 – 70
2
702
[9]
Katoh K, T
oh H. Paral
l
el
izati
on of th
e MAF
F
T
multiple s
e
que
nce
ali
gnm
ent pro
g
ram.
B
i
oinf
ormatics
.
201
0; 26: 189
9
–19
00.
[10]
Satra R, Kusuma WA, Sukoco
H. Accelerati
ng comp
utatio
n of DNA
multi
p
le se
qu
ence
alig
nme
n
t i
n
distrib
u
ted env
ironme
n
t.
T
e
lk
omnik
a
Ind
o
n
e
sia
n
Jo
urna
l
of Electrica
l
E
ngi
neer
in
g
. 20
14; 12(
12):
827
8–
828
5.
[11]
Rog
nes T
.
ParAlign:
a par
al
lel s
equ
enc
e
a
lig
nme
n
t al
go
rithm for rap
i
d
and s
ensitiv
e
datab
ase
search
es.
Nucleic Acids Research
. 200
1; 29:
1647
–1
652.
[12]
Liu
Y, W
i
ra
w
a
n A, Schmi
d
t
A. CUDASW
+
+
3.
0: acce
ler
a
ting
Smith-W
a
terman
prote
i
n d
a
tabas
e
search b
y
co
up
ling C
P
U an
d GPU SIMD instructions.
BMC Bioi
nfor
matics
.
2013; 1
4
: 117.
[13]
Hug
h
e
y
R. Par
a
lle
l har
d
w
ar
e for seque
nce c
o
mparis
on a
n
d
alig
nment.
Co
mp
uter App
lica
t
ions in th
e
Biosci
ences: C
ABIOS
. 1996; 12: 473
–4
79.
[14]
Martins W
S
, d
e
l C
u
vil
l
o J, C
u
i W
,
Gao GR.
W
hole
ge
no
me
al
ig
nment us
ing a mu
ltithre
ade
d
p
a
ral
l
e
l
imple
m
entati
o
n
.
S
y
m
posi
u
m on Comp
uter Architecture
a
nd
Hig
h P
e
rfor
mance
Com
p
u
t
ing. Vitór
i
a
.
200
1: 1–8.
[15]
Liu
W
,
Schmi
d
t B.
Par
a
ll
el
des
ign
p
a
tter
n
for c
o
mp
utation
a
l
bi
olo
g
y
an
d sc
ientific
co
mp
utin
g
app
licati
ons
. IEEE Internatio
n
a
l Co
nferenc
e
on Cl
uste
r Co
mputin
g. Hon
g
k
ong. 20
03: 45
6–4
59.
[16]
Li J,
Rank
a S
,
Sahn
i S.
Pa
irwi
se
se
qu
en
ce
al
i
gnm
en
t for ve
ry lo
ng
seq
u
e
n
c
e
s
o
n
GPU
s
. IEEE
Internatio
na
l
C
onfere
n
ce on Comp
utation
a
l Advanc
es
i
n
Bi
o an
d Me
dic
a
l
Scienc
es. Las
Vegas.
201
2
:
1–6.
[17]
Cormen
T
H
, Leisers
on
CE,
Rivest R
L
, Stei
n C.
Introduction to Algorithm
s
. T
h
ird E
d
iti
on. C
a
mbri
dg
e
:
MIT
Press. 2009: 390–
39
6.
[18]
Carrol
l
H, Bec
kstead W
,
O’C
onn
or T
,
Ebbert
M, Clement
M, Snell Q, M
c
Clel
l
a
n
D. DN
A referenc
e
alig
nme
n
t be
n
c
hmarks b
a
se
d on
tertia
r
y
str
u
cture of
enc
o
ded
prote
i
ns.
Bioi
nfor
matics
.
200
7; 23(
19):
264
8–
264
9.
[19]
Dag
u
m L, Meno
n R. O
penMP: an ind
u
s
tr
y
stand
ard
API for sha
r
ed-memor
y
p
r
ogrammi
ng
.
Co
mp
utation
a
l
Scienc
e & Engi
neer
ing, IEEE
. 199
8; 5: 46–5
5
.
[20]
Lamp
o
rt L. Ho
w
t
o
mak
e
a
multiproc
e
ssor
com
puter
that
correctl
y
e
x
ec
utes multi
p
roc
e
ss pro
g
rams
.
Com
p
uters, IEEE Transactions on.
19
79; 1
00: 690
–6
91.
[21]
Lovem
an
DB. Progr
am im
p
r
oveme
n
t
b
y
source-to-s
our
ce
transform
ation.
J
ourn
a
l
o
f
the AC
M
(JACM)
. 1977;
24: 121
–1
45.
[22]
Sedge
w
ick R. Implem
enting quickso
rt programs.
Commu
n
ic
ations of the A
C
M
. 1978; 21:
847
–8
57.
[23]
Quinn
MJ.
Paralle
l Progr
ammi
ng in C w
i
th
MPI and OpenMP
. Ne
w
Y
o
r
k
: McGra
w
-
Hil
l
.
2003: 118
–
119.
Evaluation Warning : The document was created with Spire.PDF for Python.