TELKOM
NIKA
, Vol. 11, No. 8, August 2013, pp. 43
7
9
~4
384
e-ISSN: 2087
-278X
4379
Re
cei
v
ed Fe
brua
ry 27, 20
13; Re
vised
Ma
y 12, 20
13
; Accepte
d
May 22, 20
13
A Novel Architecture of Multi-GPU Computing Card
Sen Guo*
1,2
, Sanfen
g Che
n
1,2
, YongSh
e
ng Liang
1,2
1
ShenZ
he
n Institute of Information T
e
chno
log
y
Sh
enZ
h
e
n
,
China
2
Shenzh
en Ke
y L
abor
ator
y
of
Visual Me
dia
Proc
essi
ng an
d T
r
ansmission
, Shenzhe
n, Guan
gd
ong,
518
17
2, Chin
a
*Corres
p
o
ndi
n
g
author, e-ma
i
l
:
y
b
bsss12
1
0
@
12
6.
com*, C
henSf@sz
iit.co
m
.cn, LiangYS
@
sziit.com.cn
A
b
st
r
a
ct
T
he data tra
n
s
m
iss
i
on
betw
e
en GPUS in th
e existin
g
multi
_
GPU co
mp
uti
ng card
is ofte
n throu
g
h
PCIE w
h
ich is
in rel
a
tive
low
spee
d, so the
PCIE has b
e
co
me
bottle
neck
of
Overall perf
o
rmanc
e
. A novel
architectur
e
of
multi_GP
U co
mp
utin
g car
d
have
be
en
pro
pose
d
i
n
th
is
pap
er: A
mu
lti
-
chan
nel
me
mor
y
w
h
ich have
mu
ltiple i
n
terfaces
is adde
d, incl
u
d
in
g on
e co
mmo
n
interfac
e share
d
by diffe
rent GPUs, w
h
ich
is co
nnect
ed w
i
th a
F
P
GA ar
bitratio
n circ
uit
an
d s
e
ve
ra
l other interfac
es conn
ected
w
i
th
de
dicat
ed
GP
Us
frame
buffer in
dep
en
dently, a
nd this
mu
lti-ch
ann
el
me
mory
is call
ed "
g
lo
ba
l share
d
me
mo
ry". T
he result
o
f
a si
mul
a
tio
n
of
accel
e
ratin
g
c
o
mputer to
mo
grap
hy
al
ge
br
aic reco
nstruct
i
on
on
mu
lti-GPU de
mo
nstra
t
es
effectiveness
o
f
this approac
h
.
Ke
y
w
ords
:
mu
lti-GPU, tomo
g
r
aphy a
l
ge
bra
i
c reconstructio
n
, GPGPU, CUDA
Copy
right
©
2013 Un
ive
r
sita
s Ah
mad
Dah
l
an
. All rig
h
t
s r
ese
rved
.
1. Introduc
tion
Comp
uter graphi
cs hardware
h
a
s bee
n
widely
u
s
ed
for gen
eral p
u
rpo
s
e
co
mp
uting in
variou
s appli
c
ation
s
, beyo
nd the origi
n
al target
of compute
r
gra
p
hics and ga
m
i
ng indu
stry. The
usin
g com
put
er graphi
cs h
a
rd
wa
re to a
c
celerate co
mmon comp
u
t
ation can b
e
tracked b
a
ck
to
machi
n
e
s
li
ke the I
k
on
as [1], the Pixe
l Machin
e [2] and Pixel-Planes
[3, 4]. I
n
1999,
NVIDIA
Corpo
r
ation i
n
trodu
ce
d G
e
force
25
6, wh
ich
wa
s the
first con
s
ume
r-l
evel ca
rd
on t
he ma
rket wit
h
hard
w
a
r
e
-
a
c
celerate
d T&L
(Tran
s
fo
rm & Lighting). After that, progra
m
abl
e graphi
cs pip
e
li
ne
wa
s introd
uced and
sha
d
ing lan
gua
g
e
s were be
come pop
ula
r
for GPU g
eneral pu
rpo
s
e
comp
uting. In 2006, NVIDI
A
Corp
oratio
n introdu
ce
d Comp
ute Uni
f
ied Device (CUDA), a
s
a flag
of the arrival
of modern
GPGPU. Co
mpari
ng
with
traditional G
P
GPU tech
ni
que
s, CUDA
has
several
a
d
va
ntage
s,
such as scattere
d read
s,
shared
memory, fa
st
er d
o
wnloa
d
s and
re
adb
acks
to or from the
GPU, and ful
l
y suppo
rt for integer
an
d bitwise ope
ra
tions. Ope
n
CL, which is ve
ry
simila
r with
CUDA, was i
n
trodu
ce
d in
2008
as th
e
first op
en, ro
yalty-free sta
ndard for cro
s
s-
platform, parallel pro
g
ra
m
m
ing of mode
rn proce
s
so
rs found in person
a
l com
put
ers, serve
r
s a
nd
hand
held/em
bedd
ed devi
c
es [5].
Re
cent years, a new term "person
a
l sup
e
rc
omp
u
ter" emerge
s in the parallel co
mputing
indu
stry. The person
a
l su
p
e
rcomp
u
ters usu
a
lly
have one or mo
re
hard
w
a
r
e a
c
celerato
rs (mo
s
tly
GPUs). Havi
ng the advan
tages li
ke po
rtable,
co
st-e
ffective and energy-effici
e
n
t, the perso
nal
sup
e
rcom
put
ers are o
b
ta
ining the
favorite of
e
ngi
neers, sci
ent
ists
a
nd co
mputer expe
rts.
Furthe
rmo
r
e,
more a
n
d
more p
r
of
ession
al
ind
u
stry softwa
r
e comp
anie
s
provide
GPU
accele
ration
for thei
r p
r
od
ucts.
With th
e po
we
r
of GPU,
the
s
e softwa
r
e usu
a
lly
can
get 2x
to
more than 1
0
x
accele
ratio
n
on the personal su
pe
rco
m
puters, com
parin
g to their co
rre
sp
ondi
ng
C
P
U-
only vers
ions
.
Multi-GPU
graphi
cs
card like NVI
D
IA GTX
690 or A
M
D Radeo
n
7970, which has m
o
re
than on
e G
P
U chips in a si
ngle
grap
hics b
oard, i
s
al
so getting u
s
ed i
n
pe
rsona
l
sup
e
rcom
put
ers. B
u
t these multi-GPU
grap
hics
ca
rd
s h
a
ve a
disa
dvantage th
at the GP
Us in
the
same b
oard
are not intra-con
n
e
c
ted
.
The in
tra-con
n
e
c
tion doe
sn't mea
n
the electrical
con
n
e
c
tion, but the data
conn
ectio
n
. Each GP
U of this singl
e
board ha
s different me
mory
stora
ge, and
they cann
ot acce
ss
oth
e
r
GPUs memo
ry units dire
ctl
y
. This issue
will be detail
e
d
in next sectio
n.
In this paper,
we prop
ose a solution to
brid
g
e
the different GPUs in the same
board.
And we an
al
yze the
perfo
rman
ce i
m
provement exp
e
ctation if thi
s
solution
is
adopte
d
by t
h
e
Evaluation Warning : The document was created with Spire.PDF for Python.
e-ISSN: 2
087-278X
TELKOM
NIKA
Vol. 11, No
. 8, August 2013: 4379 –
4384
4380
grap
hics
ca
rd
s ma
nufa
c
turers. Be
ca
use
CUDA
cod
e
is cl
ea
rer to
unde
rsta
nd a
nd ea
sie
r
to
use
than Ope
n
CL
, we use
CUDA for the sol
u
tion de
sc
ript
ion and for th
e experim
ent impleme
n
tation.
2. The Disad
v
a
ntage of Current M
u
lti-GPU grap
hic
s
card desig
n
As descri
bed
above, in current desig
n, the GP
Us in on
e single multi
-
GPU bo
ard can
not
c
o
mmunic
a
te to eac
h
other direc
t
ly. But in lots
of cases, the
pro
g
rammers
usua
lly want to
sh
are
some d
a
ta be
tween differe
nt GPUs [6]. Since the GP
Us
can
not co
mmuni
cate di
rectly, to cop
y
a
buffer from G
P
U 1 to GPU 2, the program sho
u
ld
co
py the GPU 1's buffe
r to system mem
o
ry
throug
h PCIE
and the
n
co
py the buffer to GPU 2 th
roug
h PCIE. Figure 1 give
s a di
agram
of
memory
copy
betwee
n
different GP
Us.
The PCIE ba
ndwi
d
th is
rel
a
tive much l
e
ss
than
GPU a
c
cess to
its memo
ry u
n
its [7].
Acco
rdi
ng
G
P
U b
and
widt
h test
utility provid
e
d
by
NVIDIA, the
hos
t-device download/u
p
l
oad
usu
a
lly ha
s 2
G
B/s to 3
G
B/s ba
nd
width,
while
th
e GP
U ha
s m
o
re than 1
00GB/
s band
width
to
their o
n
-boa
rd mem
o
ry
storag
e. So, th
e data
tran
smissi
on th
ro
u
gh PCIE
usu
a
lly be
come
s a
bottlene
ck for lots of applications
runni
n
g
on multi-GP
U syste
m
s.
Figure 1. Dat
a
Flow of Me
mory Co
py from GPU1 to
GPU2
3. An Examp
l
e: Compu
t
e
r
Tomograp
h
y
Algebraic
Recon
s
tr
uction
Medical
ima
g
i
ng
is one of the
dom
ains whe
r
e
GPU
parall
e
l comp
uting a
r
e
wild
ly used.
Gene
rally the
CT scann
er
gene
rate
s large x-ray
proj
ection data, a
nd then these proje
c
tion d
a
ta
are
re
con
s
tru
c
ted to a 3
D
volume, in which
ea
ch vo
xel rep
r
e
s
ent
s the d
ensity
of the insp
ect
ed
obje
c
t. The proce
s
s of re
co
nstru
c
ting the
3D data is
ca
lled tomog
r
ap
hic re
co
nst
r
u
c
tion.
There are
several m
e
tho
d
s fo
r CT
reco
nstructio
n
.
Usu
a
lly the
s
e m
e
thod
s
can
be
divided into t
w
o catego
rie
s
: the analyti
c
re
co
nst
r
ucti
on metho
d
s
(su
c
h li
ke
Co
mpari
ng to the
comm
only u
s
ed Filtered B
a
ck-Proje
ctio
n (FBP)
algo
rithm [8]) a
n
d
the alge
brai
c recon
s
tru
c
tion
method
s [9]. The algeb
raic re
co
nst
r
u
c
tion metho
d
s
usually ap
ply iterative recon
s
tru
c
tio
n
techni
que
s, such a
s
Simult
aneo
us Alge
b
r
aic
Re
con
s
truction Te
ch
ni
que(SA
RT) [1
0].
Comp
ari
ng to more
com
m
only use
d
FBP
method
, SART usu
a
lly perform
s better,
esp
e
ci
ally wh
en the
set of available p
r
oj
ection
s i
s
sp
a
r
sely o
r
no
n-u
n
iformly di
stri
buted in a
ngl
es
[11]. Howeve
r, SART are
rarely appli
ed in most
of medical
CT system
s d
ue to their h
i
gh
compl
e
xity and hi
gh
co
m
putational
co
sts. F
o
r
exa
m
ple, the
SART
req
u
ire
s
a
sequ
en
ce of
alternatin
g volume p
r
oje
c
tions an
d corrective b
a
ck-projections until t
he recon
s
tru
c
ted volu
me
fits all proje
c
tion image
s.
This
pro
c
e
ss is very time
con
s
u
m
ing
and difficult to co
nverg
e
t
o
a
result in
stant
aneo
usly. In
this
pap
er,
we
will
intro
d
u
c
e the
multi-G
P
U impl
eme
n
t
ation of SA
RT
,
and an
alyze t
he perfo
rma
n
c
e bottlen
eck of it, and how may the gl
obal shared
memory imp
r
ove
it.
3.1. SART Al
gorithm
Given a N =
n
3
volume V ,
M proje
c
tion i
m
age
s are ob
tained by the X-ray dete
c
to
r in M
different angl
es. Let P
φ
de
notes the p
r
oj
ection ima
ge
in angle
φ
, an
d P denotes
a vector
stori
ng
all pixels of P
φ
,
φ
= 0, 1,
..., M. Then the algeb
rai
c
to
m
ogra
phy re
co
nstru
c
tion
ca
n be de
scribe
d
as a line
a
r al
gebra proble
m
:
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
e-ISSN:
2087
-278X
A Novel Archi
t
ecture
of Multi-GPU Com
puting Ca
rd (S
en Guo
)
4381
W V =
P
(1)
Although the
volume ha
s three di
men
s
i
ons, it is flattened a
s
a vector. Similarly
although
we shoul
d g
e
t M proje
c
t image
s, and
each proj
ecti
on image P
φ
has two
di
mensi
o
n
s
, the M
proje
c
ted
ima
ges
are al
so
flattened a
s
a
vector.
W i
s
a Rn
3
wei
ght matrix,
in
which w
ij
de
note
s
the influen
ce
factor that vo
xel v
j
∈
V contribute
s
its value to pixel pi
∈
P.
An iterative
m
e
thod i
s
used
to
solve the
equatio
n. At the a
ngle
φ
, a
projectio
n
im
age P’
φ
is com
puted.
Then, ea
ch voxel is
co
rrected by an accumulated
co
rrection a
c
cord
ing to all pixels
in P’
φ
. In
the back-proje
c
tion stag
e, voxels v
j
∈
V are
corre
c
ted by the followi
ng e
quation:
'
1
1
ii
ij
N
iP
in
kk
n
jj
ij
iP
pp
w
w
vv
w
(2)
Whe
r
e p
i
i
s
the proje
c
tion p
i
xel value, p
i
is the integral
acro
se
s the
ray,
λ
is a re
laxation facto
r
and
1
N
in
n
w
is the len
g
th of the line. This equatio
n can b
e
divided into follo
wing two equ
ations:
'
1
ii
i
N
in
n
pp
w
(3)
And:
1
ij
i
iP
kk
jj
ij
iP
w
vv
w
(4)
From Eq
uatio
n 3 an
d Equ
a
t
ion 4, we
ca
n find
that ea
ch ite
r
ation i
n
the SART al
gorithm
can b
e
divide
d into four ma
in stage
s: pro
j
ection, corre
c
tion, ba
ck-p
rojectio
n and
update.
1.
Proje
c
tion sta
ge: Comp
ute line integrals
p’
I
for all ray
s
of P
φ
.
2.
Corre
c
tion st
age: Subtra
ct
the calcul
ate
d
line integral
from proje
c
ti
on p p
i
in the proje
c
tion
image, and n
o
rmali
z
e it.
3.
Back-p
roje
cti
on stag
e: Dist
ribute corre
c
tions o
n
to voxels.
4.
Upd
a
te stag
e
:
Update the
volume.
Equation
3 i
s
co
mpute
d
in
proje
c
tion
an
d corre
c
tion
steps, a
nd E
q
uation
4 i
s
co
mputed
in back-proje
c
tion an
d upd
ate step
s.
Clea
rly that the mo
st com
putational int
ensive
p
r
o
c
e
s
ses of thi
s
a
l
gorithm a
r
e
proje
c
tion
and ba
ck-p
ro
jection. Proj
e
c
tion an
d ba
ck-p
roj
e
ctio
n require integ
r
al com
putatio
n acro
se
s a ra
y
or in
sid
e
a v
o
xel. In the o
t
her
side, th
e
co
rrec
tio
n
a
nd u
pdate
proce
s
se
s a
r
e
just mat
r
ix a
dd/
sub
s
tra
c
t op
eration
s
, whi
c
h a
r
e very
fast. So the GPU a
c
ce
leration
sh
ou
ld focu
s on
the
proje
c
tion a
n
d
back-proje
c
tion pro
c
e
s
se
s.
3.2. SART o
n
single GPU
Several p
ape
rs
have
pro
p
o
se
d GP
U-b
a
se
d SAR
T
parall
e
lization
.
[11] gives
a detail
CUDA
im
ple
m
entation
of SART with NVIDIA
gra
phi
cs cards. In
this i
m
plem
en
tation, the u
s
er
prop
osed two techniq
u
e
s
: ray-drive
n
proje
c
tion
an
d
voxel-driven
back-p
r
oje
c
ti
on. Ray-d
r
ive
n
,
mean
s ea
ch ray in the projection
stage
will be as
sig
n
ed a GPU thread to do the
integral a
c
ro
se
s
this ray;
voxel
-
driven,
means each voxel i
n
the
back-project
ion
will be assi
gned a
GPU thread t
o
handl
e its co
rrectio
n
cal
c
ul
ation. The co
rre
ction
step i
s
co
mputed i
n
the ray-d
r
iven threa
d
s a
n
d
the update
step is comput
ed in the voxel-d
r
iven thre
ads.
These two t
e
ch
niqu
es
can be ma
pp
ed onto
GP
U programmi
ng mod
e
ls trivially. In
proje
c
tion
ste
p
, each
ray is assig
ned a
thread,
an
d this thre
ad ju
st rea
d
s the
data from th
e
voxels it penetrates a
nd calcul
ates the
integral
. Th
e comp
utation
of one ray do
esn’t rely on,
and
will not affect other ray
s
’
computation resul
t.
The voxel-driven back-
projection step has the
same
situatio
n, each voxel
can be
cal
c
ul
ated and u
p
d
a
ted inde
pen
dently.
Evaluation Warning : The document was created with Spire.PDF for Python.
e-ISSN: 2
087-278X
TELKOM
NIKA
Vol. 11, No
. 8, August 2013: 4379 –
4384
4382
So the imple
m
entation i
s
trial, just
co
py
the data
from ho
st m
e
mory to th
e
device
memory, the
n
setup an
d
launch the kernel th
read
s. We will n
o
t discuss th
e perfo
rman
ce
optimizatio
n on hardware. For
detail, ple
a
se
see [11].
3.3. SART on Multi-GP
U Graphics
To utilize m
o
re than on
e G
P
Us to impl
e
m
ent
SART. We
shoul
d di
vide the entire threa
d
grid to seve
ra
l parts an
d m
ap them to different
GPUs. Becau
s
e the
penetration of
the rays of the
corre
s
p
ondin
g
thread
pa
rt
s
with th
e in
spect
obje
c
t v
o
lume
is not
cub
o
id, a
nd i
s
va
rying
due
to
different angl
es, we shoul
d sync-up th
e whole vo
lu
me data bet
wee
n
differen
t
GPUs after the
proje
c
tion
ste
p
s. F
o
r the
same
rea
s
o
n
,
We
sh
ould
sync-u
p
th
e p
r
ojectio
n
pl
an
e data
(pi) af
ter
the back-proj
ection
step
s.
Figure 2 sho
w
s h
o
w to m
ap SART ont
o more than
one GPUs. T
he pro
b
lem comes from
the data
upd
ate in e
a
ch iterati
on,
espe
cially the d
a
t
a
volume
upd
ate. Beca
use
the GPU thread
kernel
s a
r
e very fast, the
data tran
smi
ssi
on b
e
tw
ee
n two GP
Us
contri
bute
s
a
big pa
rt to the
overall p
r
o
c
e
ssi
ng time.
In our exp
e
ri
ments o
n
a p
l
atform with
a
NVIDIA GTX
690 a
nd Intel
i7 3770K,
we use a
256
×25
6
×25
6
volume
(ea
c
h voxel
is 16
-bit)
projecte
d
to a
512
×5
1
2
pla
ne fo
r
si
mulation. In
e
a
ch
iteration,
the proje
c
tion ke
rnel
take
s a
b
o
u
t 0.8 ms, the back-proje
c
tion ke
rnel ta
ke
s abo
ut 18
.5
ms, but the m
e
mory
copie
s
even take
s n
ear 2
0
.0
m
s
. In other
way,
each iteratio
n
take
s abo
ut 40
ms.
Figure 2. Map SART onto
more tha
n
On
e GPUs
4. The Global Shared Memor
y
Solutio
n
To redu
ce
th
e commu
nication
co
st b
e
twee
n GP
Us o
n
a
singl
e bo
ard, we
have
to
establi
s
h
a i
n
ner data
con
nectio
n
b
e
tween th
e
GPUs. A
simpl
e
solution i
s
l
e
t
each GP
U
can
access other GPUs' m
e
mory, but thi
s
will ra
ise
t
w
o serious probl
em
s: cache consistency
probl
em a
n
d
read
-after-write consi
s
ten
c
y problem
. Ca
che co
nsi
s
ten
c
y
pro
b
le
m
is cau
s
ed
by
each GPU in
the same bo
ard ha
s an in
dividual
L1 cach
e, and the read
-after-write proble
m
is
cau
s
e
d
by
th
at the
ord
e
r
of memo
ry a
c
cess i
s
not
guarantee
d,
esp
e
ci
ally co
nsid
erin
g the
fact
that the l
a
te
ncy of
a
c
cess vide
o m
e
m
o
ry fro
m
GPU m
e
mo
ry
c
ontrolle
r i
s
v
e
ry
hi
gh
(u
s
ually
hund
red
s
GP
U clo
c
ks).
Here
we p
r
o
pose a
de
sig
n
idea
to ad
d
r
ess thi
s
p
r
o
b
lem. We
cal
l
this a
n
ide
a
becau
se
we have
n't really built a multi-
GPU
b
oard
acco
rdi
ng to this de
sign. We can
only pre
s
ent
this
idea'
s co
ncep
t and analyze
it theoreticall
y
. The des
ign
is to add a multi-chan
nel
memory to the
multi-GPU bo
ard, a
nd thi
s
memo
ry is
o
n
ly for
tra
n
sf
erri
ng d
a
ta b
e
twee
n different GPUs. T
h
is
multi-channel
memory should have multiple in
terfa
c
e
s
, includi
ng o
ne comm
on i
n
terface sh
ared
by different
GPUs, whi
c
h
is co
nne
cte
d
with
a FP
GA arbitratio
n circuit a
n
d
several oth
e
r
interfaces
co
nne
cted with
dedi
cated G
P
Us fra
m
e b
u
ffer indep
en
dently. We can call it "Share
d
Memory", but
to distin
gui
sh the
CUDA
Share
d
Me
m
o
ry of a
stream multiprocessor
(SM),
we call
this m
e
mo
ry
Global
Sha
r
e
d
Mem
o
ry. T
he "Gl
obal"
m
ean
s it i
s
n
o
t
only shared
b
y
he p
r
o
c
e
ssors
in a SM, but share
d
by all the GPUs. Fig
u
re 3 give
s a diagram of this de
sign:
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
e-ISSN:
2087
-278X
A Novel Archi
t
ecture
of Multi-GPU Com
puting Ca
rd (S
en Guo
)
4383
Figure 3. Dia
g
ram of Glo
b
a
l Share
d
Me
mory De
sig
n
The FP
GA a
r
bitration
ci
rcu
i
t cont
rols the
ac
ce
ss of thi
s
Gl
obal
Sha
r
ed
Memo
ry (GSM),
and m
a
ke
su
re o
n
ly one
GPU
can
a
c
cess the
GSM
in on
e mo
m
ent. It reali
z
e
s
the l
o
ck
()
and
unlo
c
k() fun
c
tionality for t
he GSM.
Th
e a
c
cess co
ntrol to
GSM
sh
ould
b
e
v
e
ry fa
st for l
o
w-
latency GSM
read/
write a
c
cess respon
se. Thi
s
is
why we u
s
e o
n
-bo
a
rd FPG
A
, not CPU to
control GSM,
becau
se CP
U control
sig
nals
sh
oul
d g
o
throu
gh PCIE to each G
P
U and th
e sam
e
for the GPU's respon
ses, whi
c
h will result high latency.
With this
GSM, a memo
ry copy fro
m
G
P
U 1 to GPU 2 in a du
al-GPU sy
stem
can
be
descri
bed a
s
follow:
1.
GPU1 re
que
sts to lock t
he global
sh
ared mem
o
ry, wait until
su
ccess.
2.
GPU1 re
ad
s its local device memo
ry and write to
global sh
are
d
memory.
3.
GPU1 unlock
s
the global s
hared memory if it's
f
u
ll or the c
o
py is
finis
h
ed.
4.
GPU2 re
que
sts to lock t
he global
sh
ared mem
o
ry, wait until
su
ccess.
5.
GPU2 re
ad
s the global share
d
memo
ry and write
to its device memory.
6.
GPU2
unlo
c
ks the gl
obal
sha
r
ed
mem
o
ry if all in g
l
obal
sha
r
ed
memory i
s
re
ad to lo
ca
l
device me
mo
ry or the co
py is finishe
d
.
5. Experiments and
Con
c
lusion
Becau
s
e
we
don't have a
real graphi
cs ca
rd
with this de
sig
n
, we only eval
uate the
perfo
rman
ce i
m
provem
ent brou
ght by this de
sign in a
comm
on mult
i-GPU g
r
ap
hi
cs
cards. In o
u
r
experim
ents, we used a d
ual-GP
U
NVI
D
IA
Gefo
rce GTX
69
0. We
allo
cate
a
256MB buffer
in
both GP
U1
memory
and
GPU2
mem
o
ry to
simulat
e
a
256
MB
Global
Sha
r
e
d
Mem
o
ry. A
nd to
simulate th
e real o
peratio
n of copy th
e data fr
om
GPU1 to GP
U2 through GSM, we first let
GPU1
copy t
he data in
G
P
U1'
s
video
memory to
th
e 256MB b
u
ffer in GP
U1,
after GPU1's
copy
is finish
ed, le
t GPU2 copy the data with
same
si
ze fro
m
the 256MB
in GPU2 to t
he de
stinatio
n
buffer of GPU2. Note that
this pro
c
e
ss
doe
sn't per
form a real copy from GPU1 to GPU2,it j
u
s
t
simulates the performance of this
copy.
So the output will be
incorrect. However this
will no
t
affect our eva
l
uation for the
performan
ce
.
The real me
mory copy from GPU1 to GPU2
with G
S
M shoul
d h
a
ve som
e
pe
rforma
nce
differen
c
e
wit
h
ou
r exp
e
ri
ments, b
u
t th
e differe
nce
will be
not
bi
g. Theo
reti
ca
lly analyze
d
,
we
can
find th
at the
spe
ed
of
memory
copy with
GS
M i
s
determi
ned
b
y
the mem
o
ry
clo
c
k, mem
o
ry
interface
widt
h an
d the
late
ncy of
FPGA
contro
lle
r. Be
cau
s
e
FPGA
controlle
r a
n
d
the two GP
Us
are di
re
ctly in
ter-con
n
e
c
ted
on the b
oard, we
c
an i
m
age that the l
a
tency
will b
e
very low. T
he
FPGA control
l
er laten
c
y sh
ould contrib
u
te little
to the
overall mem
o
ry copy time whe
n
the dat
a
to be copi
ed
have relative
big si
ze.
Again in
ou
r
experim
ent o
n
a pl
atform
with a
NVIDI
A
GTX690
an
d Intel i7
377
0K, with a
256
×25
6
×25
6
volume
(ea
c
h voxel
is 16
-bit)
projecte
d
to a
512
×5
1
2
pla
ne fo
r
si
mulation. In
e
a
ch
iteration,
whil
e the
ke
rnel
s'
used time
re
mains
the sa
me,
the
time
use
d
by data transmissio
n has
been
red
u
ce
d from 20.
0 ms to ne
arly
1.0 ms. And
t
he overall u
s
ed time is
re
duced fro
m
n
early
40.0 m
s
to
n
early 2
0
.0 m
s
. Almo
st hal
f time ca
n b
e
save
d. Wecan find
that t
he me
mory
copy
betwe
en GP
Us
will no lon
ger be the b
o
ttleneck for o
v
erall pe
rform
ance.
Evaluation Warning : The document was created with Spire.PDF for Python.
e-ISSN: 2
087-278X
TELKOM
NIKA
Vol. 11, No
. 8, August 2013: 4379 –
4384
4384
Ackn
o
w
l
e
dg
ements
This
re
sea
r
ch
wa
s finan
cia
lly supp
orted
by
ShenZh
e
n
Internatio
nal
coo
p
e
r
ation
proje
c
t
of sci
en
ce a
n
d
tech
nolo
g
y re
sea
r
ch (NO:
ZYA2010
0707
0116A
),
and S
c
ien
c
e
and T
e
chnol
ogy
Planning Proj
ect of ShenZ
hen (NO.
JC2
0090
3180
648
A).
Referen
ces
[1]
D Lue
bke, G Humphr
e
y
s. Ho
w
GPUs
w
o
rk
?
.
Comp
uter
. 20
07; 40(2): 9
6
-1
00.
[2] Pat
Hanr
ah
an.
Why are Graphics System
s s
o
Fast?.
Proce
edi
ng of
18th I
n
ter
nati
o
n
a
l C
onfere
n
ce
o
n
Parall
el Arch
ite
c
tures and C
o
mpilati
on T
e
chniq
ues. Ne
w
Y
o
r
k. 2009; 9: 34-
36.
[3]
W
illiam Mark.
Future Graphic
s
Architec
tures
.
ACM Queue. 200
8; 23(3): 56
-64.
[4]
J Nickolls, W
J
Dall
y. T
he gpu
computi
ng era.
IEEE Transactions on Micros
oft
. 2010; 30(2)
: 56-69.
[5]
Ka
y
v
on F
a
ta
h
a
lia
n, Mike
Ho
uston. A Cl
os
er Lo
ok at GPUs.
C
o
mm
uni
ca
ti
on
s o
f
th
e AC
M
. 200
8;
51(1
0
): 50-5
7
[6]
Benj
amin
Blo
ck, Peter Vir
nau, T
obias
Preis.
Multi-G
P
U acce
lerat
e
d multi-s
p
in
Monte C
a
rlo
simulati
ons of the 2D Isin
g mo
del.
Co
mputer Physics
Co
mmunic
a
tions
. 2
0
1
0
; 181(4): 1
549
-155
6.
[7]
Enos JJ, Guo
c
hun Sh
i, Sho
w
e
rman.
GPU
clusters for hi
gh-p
e
rfor
manc
e co
mputi
n
g
. Procee
din
g
of
Cluster C
o
mpu
t
ing an
d W
o
rkshops
. Bei
jin
g. 200
9; 4: 1-8.
[8]
Gordon
R, Be
nder
R, Herma
n GT
. Algebra
i
c reconstructi
o
n
techn
i
qu
es (
A
RT
) for three-dimens
io
nal
electro
n
micros
cop
y
an
d x-r
a
y
photogr
ap
h
y
.
197
0; 29(3): 47
1-48
1.
[9]
AH Anders
en,
AC Kak. Simultan
eous A
l
gebr
ai
c Rec
o
n
s
truction T
e
chniq
ue (SART
)
: A superi
o
r
implem
entati
o
n
of the ART
algorithm.
Ultrasonic Im
aging
. 19
84; 6(1): 81-9
4
.
[1
0
]
Avi
n
a
s
h
C
Ka
k, Ma
l
c
o
l
m Sl
aney
. Pri
n
cip
l
e
s
o
f
co
mpu
t
e
r
i
z
ed
to
mog
r
a
p
h
i
c
i
m
a
g
i
n
g
.
Class
i
cs in A
ppl
ie
d
Mathem
atics
. 2
001; 33(
1): 329
-332.
[11] Yuqi
ang
Lu.
Acceler
a
ting
Alge
braic
Rec
onstructio
n
Us
ing C
UDA-E
n
abl
ed GPU
. Procee
din
g
of
Comp
uter Graphics, Imagi
ng
and Vis
ual
iz
ati
on. Hon
g
Ko
ng.
2009; 1
1
: 480-
485.
Evaluation Warning : The document was created with Spire.PDF for Python.