Int
ern
at
i
onal
Journ
al of Ele
ctrical
an
d
Co
mput
er
En
gin
eeri
ng
(IJ
E
C
E)
Vo
l.
8
, No
.
6
,
Decem
ber
201
8
, p
p.
5457
~
5471
IS
S
N: 20
88
-
8708
,
DOI: 10
.11
591/
ijece
.
v8
i
6
.
pp
5457
-
54
71
5457
Journ
al h
om
e
page
:
http:
//
ia
es
core
.c
om/
journa
ls
/i
ndex.
ph
p/IJECE
Real
-
Ti
me Impl
ement
ation and P
er
f
orma
nce Opti
mizati
on
of
Local De
rivati
ve P
attern
Alg
or
ith
m on
GP
Us
Nisha C
handr
an
S, Dur
gapr
as
ad G
ang
odkar,
Ankus
h
Mittal
Depa
rtment
o
f
C
om
pute
r
Scie
n
ce a
nd
Engi
n
ee
rin
g,
Graphi
c Era
Univer
sit
y
,
Indi
a
Art
ic
le
In
f
o
ABSTR
A
CT
Art
ic
le
history:
Re
cei
ved
Ja
n
2
1
, 2
01
8
Re
vised
Me
i
2
8
, 2
01
8
Accepte
d
J
un
14
, 201
8
Patt
ern
base
d
t
e
xture
desc
rip
tor
s
are
widely
us
ed
in
Conte
nt
B
ase
d
Im
age
Ret
ri
eva
l
(CBIR
)
for
eff
icient
r
etriev
a
l
of
m
at
ching
images.
Loca
l
Deri
va
ti
v
e
Patt
ern
(
LDP),
a
highe
r
ord
er
lo
ca
l
p
at
t
ern
op
er
at
or,
o
rigi
n
al
l
y
proposed
for
fac
e
r
ec
ogn
it
ion
,
enc
odes
th
e
dis
ti
nctive
spa
ti
a
l
r
el
a
ti
onships
con
ta
in
ed
in
a
loc
a
l
reg
ion
of
an
image
as
the
fea
tur
e
vec
tor
.
L
DP
eff
ic
ie
ntly
e
xtra
c
ts
fine
r
det
a
il
s
and
prov
ide
s
eff
i
cient
re
t
rie
va
l
howeve
r
,
it
was
proposed
for
images
of
li
m
it
ed
resolu
ti
on.
Over
th
e
p
eri
od
of
ti
m
e
th
e
developm
ent
i
n
the
digita
l
image
sensors
h
ad
pa
id
wa
y
for
ca
p
turi
ng
imag
es
at
a
ver
y
h
i
g
h
resolution.
LDP
al
gorit
hm
though
ver
y
eff
i
c
ie
nt
in
content
-
b
ase
d
image
ret
ri
eva
l
did
not
sca
le
wel
l
when
ca
pturi
ng
fe
at
u
res
from
such
h
igh
-
resolut
ion
i
m
age
s
as
it
bec
om
es
computat
ion
al
l
y
ver
y
expe
nsive
.
Th
is
pape
r
propo
ses
how
to
eff
icientl
y
ex
tract
par
allelism
from
the
LDP
a
lgori
thm
and
st
rat
eg
ie
s
for
opti
m
al
l
y
imple
m
ent
ing
it
b
y
expl
oiting
som
e
inhe
ren
t
Gene
ral
-
Purpos
e
Graphi
cs
Proce
ss
ing
Unit
(GP
G
PU
)
cha
rac
teristi
cs.
B
y
opti
m
a
lly c
onfiguri
ng
the
GP
GP
U ke
rn
el
s,
image
r
et
ri
e
val
was pe
rform
ed
at
a
m
uch
f
ast
er
rate.
Th
e
LDP
al
gorit
hm
was
porte
d
on
to
Com
pute
U
nifi
ed
Devi
ce
Archi
tectur
e
(CUD
A)
suppor
te
d
GP
GP
U
and
a
m
axi
m
um
sp
ee
d
up
of
aro
un
d
240x
was
ac
hi
eve
d
as
compare
d
to its
sequ
ent
i
al
counterpa
r
t
.
Ke
yw
or
d:
CB
IR
CUD
A
GPGP
U
Local
Der
iv
at
ive Patt
er
n
Parall
ei
zat
ion
Textu
re
Descr
i
ptor
Copyright
©
201
8
Instit
ut
e
o
f Ad
vanc
ed
Engi
n
ee
r
ing
and
S
cienc
e
.
Al
l
rights re
serv
ed
.
Corres
pond
in
g
Aut
h
or
:
Nish
a
Cha
ndra
n
S
.,
Dep
a
rtem
ent o
f
Com
pu
te
r
Sci
ence a
nd E
ng
i
neer
i
ng,
Gr
a
phic
Er
a
Unive
rsity
,
Dehra
dun
248002,
U
tt
ara
kha
nd, In
dia.
Em
a
il
: nisha.
unni
kr
is
hn
a
n@
gm
ail.co
m
1.
INTROD
U
CTION
In
C
on
te
nt
Ba
sed
Im
age
Re
trie
val
(CBIR
),
i
m
ages
are
in
dex
e
d
a
uto
m
atical
ly
by
their
own
visu
a
l
con
te
nt
instea
d
of
t
he
te
xt
-
bas
ed
key
wor
ds
.
The
basic
ste
p
involve
d
in
C
BIR
is
extracti
ng
a
nd
in
dex
i
ng
th
e
visu
al
co
ntent
(low
le
vel
fe
at
ur
es
)
from
t
he
i
m
ages.
F
eat
ur
e
vect
or
e
xtracti
on
is
per
f
or
m
ed
as
a
pr
e
-
processi
ng
ste
p
in
m
os
t
of
the
CB
IR
syst
em
s.
An
eff
ic
ie
nt
i
m
age
featu
re
desc
ript
or
m
us
t
hav
e
exc
el
le
nt
discrim
inati
on
prop
e
rty
to
di
scrim
inate
var
i
ou
s
im
age
cl
asses
and
it
m
us
t
be
com
pu
ta
ti
on
al
ly
inex
pe
nsi
ve.
Diff
e
re
nt
featu
re
de
script
or
s
li
ke
colo
r,
s
ha
pe
a
nd
te
xture
are
us
ed
f
or
eff
ic
ie
nt
im
age
retrieval
i
n
CB
IR
syst
e
m
s.
A
m
on
g
the
var
i
ous
visu
al
featu
re
descr
ipt
or
s
,
te
xture
desc
riptor
play
s
a
sign
i
ficant
ro
le
in
f
eat
ur
e
extracti
on
an
d
it
ref
ers
to
t
he
visu
al
patte
rns
that
ha
ve
ho
m
og
eneit
y
pr
operty
.
Te
xture
descr
i
ptor
re
presents
structu
ral
ar
ra
ngem
ent
of
s
urf
aces
in
relat
ion
to
t
he
s
urr
ound
i
ng
en
vir
on
m
ent.
Patt
ern
base
d
te
xt
ur
e
f
eat
ur
es
li
ke
Local
Bi
nar
y
Patt
ern
(L
BP)
[1
]
re
pr
es
ents
i
m
ages
based
on
the
first
or
der
de
riv
at
ive
m
ic
ro
patte
rns
wh
ic
h
a
ppare
nt
ly
fail
to
extr
a
ct
detai
le
d
in
f
orm
ation
c
on
ta
i
ned
in
the
im
a
ges.
Re
searc
he
rs
the
refo
re,
f
urt
her
inv
est
igate
d
th
e possibil
it
y of usi
ng h
i
gh
e
r o
rd
e
r op
e
rato
rs for ca
pturin
g m
or
e d
et
ai
le
d
inf
or
m
at
ion
.
LDP
op
e
rato
r
pr
op
os
e
d
by
Zha
ng
et
al
.
[2
]
is
a
te
xtu
re
descr
ipt
or
w
hi
ch
encodes
hig
he
r
ord
e
r
der
i
vative
inf
orm
ation
.
L
DP
r
epr
ese
nts
the
hi
gh
e
r
orde
r
pat
te
rn
in
form
ation
of
a
local
regi
on
by
e
ncodin
g
th
e
change
i
n
the
neig
hbou
rho
od
der
i
vative
di
recti
on
s
a
nd
therefo
re
c
on
t
ai
ns
m
or
e
det
ai
le
d
discrim
i
native
featur
e
s
than
t
he
first
or
der
de
rivati
ves
us
ed
in
LBP.
H
ow
e
ver,
LDP
bein
g
a
highe
r
orde
r
operat
or
produces
a
Evaluation Warning : The document was created with Spire.PDF for Python.
IS
S
N
:
2088
-
8708
In
t J
Elec
&
C
om
p
En
g,
V
ol.
8
, N
o.
6
,
Dece
m
ber
2
01
8
:
5457
-
5471
5458
ver
y
la
rg
e
featur
e
vector
wh
i
ch
m
akes
the
al
gorithm
co
m
pu
ta
ti
onal
ly
ver
y
exp
e
ns
ive
.
Currentl
y,
with
th
e
adv
a
nc
em
ents
in
the
dig
it
al
sign
al
te
ch
nolog
y
im
ages
ar
e
bein
g
ca
ptur
ed
at
a
ver
y
high
re
so
l
utio
n
a
nd
extracti
ng
LD
P
featur
e
s
us
i
ng
the
c
urren
t
gen
e
ral
-
pu
rpo
se
hardw
a
re
f
r
om
la
rg
e
data
bases
of
su
c
h
hig
h
-
reso
l
ution
im
ages
beco
m
es
ver
y
chall
en
ging.
T
his
li
m
it
s
t
he
ap
plica
ble
do
m
ai
ns
of
t
hi
s
powe
rful
al
gorithm
with
high
dis
crim
inati
ve
prop
e
rty
.
L
DP
li
ke
m
os
t
i
mage
processin
g
al
gorithm
s
perf
or
m
the
sam
e
com
pu
ta
ti
on
on
a
num
ber
of
pix
el
s,
wh
ic
h
is
a
ty
pical
data
par
al
le
l
operat
ion
.
T
he
refor
e
,
com
pu
ti
ng
the
LD
P
featur
e
s
f
r
om
diff
e
re
nt
re
gions
of
the
high
-
reso
l
ution
im
a
ge
in
pa
rall
el
will
reduce
t
he
com
pu
ta
ti
on
tim
e
of
the alg
or
it
hm
ther
e
by m
aking it
eff
ic
ie
nt
for real
tim
e app
li
cat
ion
s.
Curre
ntly
,
Graph
ic
s
Process
ing
Un
it
(G
P
U
)
has
ente
red
t
he
G
ene
ral
-
P
urp
os
e
C
om
pu
ti
ng
D
om
ai
n
(G
P
GPU
).
I
n
com
par
ison
to
the
si
ng
le
pro
cesso
r
CP
U,
G
PG
P
Us
ha
ve
ve
ry
hi
gh
c
om
pu
ta
ti
on
po
wer
.
The
y
are
inex
pe
ns
iv
e
and
scal
able
than
CP
Us.
T
he
hig
hly
data
par
al
le
l
i
m
age
processi
ng
al
gorithm
s
exp
loi
ts
the
Sing
le
I
ns
tr
uct
ion
Mult
iple
Data
(S
IM
D)
arch
it
ect
ure
an
d
can
be
eff
ic
i
ently
par
al
le
li
zed
on
a
GPGP
U
[
3].
Howe
ver,
tradi
ti
on
al
GPGP
U
dev
el
op
m
ent
is
based
on
gra
phic
s
functi
on
li
br
a
ry
li
ke
Op
e
nGL
and
Direc
t
3D
,
wh
ic
h
re
stric
ts
it
s
us
e
by
pr
of
essi
onal
s,
w
ho
are
fam
ilia
r
with
gra
ph
ic
s
API.
The
e
m
erg
ence
of
Com
pu
te
Un
i
fied
Dev
ic
e
Ar
c
hitec
ture
(CU
DA),
N
VIDIA’
s
pa
ra
ll
el
arch
it
ect
ur
e
im
ple
m
entat
ion
,
rem
ov
es
these
inco
nv
e
nien
ce
s
as
it
pr
ovi
des
APIs
f
or
pr
og
ram
m
ers
to
dev
el
op
par
al
le
l
app
li
c
at
ion
s
us
i
ng
the
C
pro
gr
am
m
ing
la
ngua
ge.
GPU
s
are
ine
xp
e
ns
i
ve,
scal
a
bl
e
an
d
ha
ve
ve
ry
hi
gh
c
om
pu
ta
ti
on
po
wer
c
om
par
ed
to
sing
le
process
or
CP
Us.
GPU
s
ha
ve
bee
n
widely
us
e
d
no
w
for
acce
le
rati
ng
m
any
eff
ic
i
ent
CB
IR
al
go
rithm
s
[4]
,
[5
]
in
t
he
fiel
d
of
im
age
visio
n,
m
edical
i
m
aging
[
6]
,
rem
ote
sensing
[7
]
et
c.
T
his
pa
per
pro
poses
a
par
al
le
l
im
ple
m
entat
ion
of
the
L
DP
al
gorit
hm
on
N
VIDIA
GPUs.
Th
e
i
m
ple
m
entat
io
n
e
xp
l
oits
the
featur
e
s
li
ke
the
gl
obal
m
e
m
or
y,
sh
ar
ed
m
e
m
or
y,
L1
cac
he,
C
UDA
stream
s
and
op
ti
m
al
us
e
of
re
gisters.
W
e
ha
ve
m
easur
ed
the
GPU
im
ple
m
e
ntati
on
s
on
Fe
rm
i
as
well
a
s
the
la
te
st
Kepl
er
arc
hitec
tur
e.
To
t
he
best
of
our
knowle
dge
t
his
is
the
first r
e
porte
d
e
ffor
t
t
o
par
al
le
li
ze
the LD
P
al
gorithm
us
i
ng
G
PU
s
. Mai
n
co
ntri
bu
t
ion
s
of
the
pa
per
a
re
s
umm
arized
as
fo
ll
ows:
(
1)
W
e
pr
ese
nt
an
e
f
fici
ent
strat
egy
for
pa
rall
el
cal
cu
la
ti
on
of
the
LDP
featur
e
vecto
r
of
an
im
age
by
op
ti
m
a
ll
y
us
in
g
the
GPU
gl
obal
m
e
m
or
y
and
the
faster
s
ha
red
m
e
m
or
y
(2
)
W
e
hav
e furthe
r
tu
ned
t
he
sp
ee
du
p
of
L
DP
com
pu
ta
ti
ons b
y
usi
ng
the
L
1
data
cache.
(
3) W
e h
ave
acce
le
rated
the
GPU
op
e
rati
on
s
us
i
ng
the
c
o
nc
ept
of
C
UDA
stream
s.
(4
)
We
hav
e
ex
ploi
te
d
the
asy
nchron
ou
s
nat
ur
e
of
t
he
CUD
A
ke
r
nel
cal
ls
by
dep
l
oying
cal
c
ulati
on
on
host
whil
e
the
dev
ic
e
execu
te
s
t
he
GPU
ke
rnel
s
there
by
extracti
ng e
nh
a
nced pa
rall
el
ism
.
The
rest
of
th
e
pap
e
r
is
org
anized
as
fo
ll
ows
.
We
be
gin
with
a
disc
us
si
on
on
the
relat
ed
w
ork
i
n
sect
ion
2.
Sec
ti
on
3
re
vie
ws
the
m
at
he
m
atical
fo
undati
on
of
the
LD
P
al
go
rithm
and
desc
ribes
t
he
m
ain
con
ce
pts
of
pr
ogram
m
ing
G
PU
s
us
in
g
N
V
IDIA’s
C
UDA
m
od
el
.
In
sec
ti
on
4,
we
pr
e
sent
ou
r
ap
pro
ach
f
or
com
pu
ti
ng
t
he
LDP
us
in
g
the
GPU.
Fi
nally
,
sect
ion
5
pres
ents
the
e
xp
e
ri
m
ents
done
a
nd
a
detai
le
d
a
na
ly
sis
of
the
res
ults
that
dem
on
strat
es
the
eff
ic
ie
nc
y
of
our
GPU
al
go
rithm
and
sect
ion
6
dr
a
ws
so
m
e
con
cl
us
io
ns
about the
prese
nted w
ork
a
nd
su
ggest
s
so
m
e o
ptim
i
zation
s a
s futu
re
work
.
2.
RE
LATE
D
W
ORK
The
m
os
t
co
m
pu
ta
ti
onal
ly
exp
en
sive
pa
rt
of
the
CB
IR
alg
ori
thm
s
is
the
featur
e
ext
ra
ct
ion
.
Col
or
,
te
xtu
re
or
sh
a
pe
feat
ur
es
or
a
com
bin
at
ion
of
these
a
re
gen
e
rall
y
extr
act
ed
from
the
i
m
ages
for
ef
fici
ent
i
m
age
m
a
tc
hin
g
an
d
retrieval.
G
PG
P
Us
are
now
bei
ng
ef
fi
ci
ently
u
sed
for
i
m
ple
m
enting
this com
pu
ta
ti
on
al
ly
exp
e
ns
i
ve
f
eat
ur
e
extra
ct
ion
ta
sk
as
well
as
the
m
at
ching
ta
sk
in
CB
IR
al
gorithm
s.
It
is
a
well
-
kn
own
fact
that
the G
P
Us
perform
m
uch
f
ast
er
t
han
CP
Us
a
nd
at
ta
in
e
x
cel
le
nt
s
peedup
f
or
im
age
pr
oces
sin
g
a
pp
li
cat
ion
s
wh
ic
h
are
sp
e
ci
fical
ly
ta
il
or
ed
for
pa
rall
el
pr
og
ram
m
ing
.
Sever
al
nota
bl
e
wo
r
ks
sho
w
ing
the
ad
va
ntage
of
us
in
g
a
GPU
and
the
subs
ta
ntial
a
m
ou
nt
of
sp
ee
d
up
ob
ta
ine
d
ove
r
the
seq
uen
ti
al
ver
sio
n,
f
or
su
c
h
app
li
cat
io
ns
ar
e
avail
able
in
the
li
te
ratur
e.
Ba
sic
i
m
age
pr
oces
sin
g
al
gorithm
s
li
ke
histogram
equ
al
izati
on,
Discrete
Co
sin
e
Tra
ns
f
or
m
(D
CT)
e
nc
od
e
and
dec
ode,
cl
oud
rem
ov
al
et
c.
are
ef
fici
en
tl
y
par
al
le
li
zed
us
in
g
CUD
A
an
d
is
repor
te
d
to
ha
ve
at
ta
ined
ex
cel
le
nt
sp
eed
up
[8
]
.
I
n
Re
f.
[9
]
a
par
al
le
l
ver
si
on
of
the
fr
act
al
i
m
age
enco
di
ng
al
gorithm
is
bu
il
t
on
the
GPU
an
d
the
resu
lt
s
ind
ic
a
te
the
enco
di
ng
was
achie
ve
d
in
m
illi
secon
ds
ti
m
e
without
c
om
pr
omi
sing
on
the
qu
al
it
y
of
the
dec
od
e
d
im
age.
Te
k
et
al
.
[
10
]
e
ff
i
ci
ently
par
al
le
li
zed
a
f
ace
rec
ogniti
on
fr
am
ewo
r
k
usi
ng
CU
DA
a
nd
repo
rted
a
s
peedu
p
of
29x.
Ga
ngod
kar
et
al
.
i
n
[11]
i
m
ple
m
ented
a
par
al
le
l
ver
si
on
of
fast
norm
alized
cro
ss
co
rr
el
at
io
n
fo
r
real
tim
e
tem
plate
m
at
c
hing
wh
ic
h
at
ta
ined
a
sp
eedup
of
17x
com
par
ed
to
the
sequ
e
ntial
execu
ti
on.
I
n
Re
f.
[
12
]
a
pa
rall
el
al
go
rith
m
fo
r
acce
le
rati
ng
t
he
i
m
age
inp
ai
nting
proce
ss
is
pro
po
se
d.
T
he
inte
ns
e
c
om
pu
ta
ti
on
al
ta
sk
of
proces
sing
the
four
t
h
orde
r
P
DE
eq
uatio
ns
is
acce
le
rated
by
36x
us
in
g
the
N
VIDIA
G
EFO
RC
E
GT
X
67
0
GPU
c
ard.
A
par
al
le
l
im
ple
m
entat
ion
of
c
on
te
nt
-
based
vi
deo
searc
h
a
nd
retrieval
of
vid
e
os
is
pro
pose
d
in
Re
f.
[
13]
.
T
he
Evaluation Warning : The document was created with Spire.PDF for Python.
In
t J
Elec
&
C
om
p
En
g
IS
S
N: 20
88
-
8708
Real
-
Time
I
m
pleme
nta
ti
on
and Perf
orm
anc
e Opti
miza
ti
on
o
f L
oc
al D
e
riv
ative
…
(
Ni
sh
a C
handr
an S
)
5459
auth
or
s
ac
hie
ve
d
ar
ound
80%
accur
acy
in
vid
e
o
sea
rch
and
retrie
val
and
a
sp
ee
dup
of
5x
com
par
e
d
to
t
he
seq
uen
ti
al
versi
on
of
the
al
gorit
hm
.
In
Re
f.
[
14
]
a
paral
le
l
i
m
ple
m
e
ntati
on
of
the
aud
io
c
om
pr
essio
n
al
gorithm
Ap
pl
e
Loss
le
ss
A
udio
C
odec
(A
L
AC)
is
prese
nted
a
nd
a
s
pee
dup
of
80
-
90%
is
achieve
d.
H
ei
dari
et
al
.
[1
5]
pro
pose
d
a
par
al
le
l
i
m
ple
m
entation
of
colo
r
m
o
m
ents
and
te
xt
ur
e
feat
ur
e
a
n
d
achieve
d
a
spe
ed
up
of
144x
over
t
he
se
qu
e
ntial
im
ple
m
entat
ion
.
Zh
u
et
al
.
[
16]
ha
ve
par
al
le
li
zed
the
S
UR
F
al
gorithm
an
d
hav
e
achieve
d
a
s
pe
edup
of
15x
ov
e
r
it
s
se
que
ntial
co
un
te
rp
a
rt.
I
n
[
17
]
Pr
i
scariu
an
d
Re
i
d
have
im
ple
m
ented
Histo
gr
am
of
Or
ie
nted
Gr
a
di
ents
(
HOG)
in
GP
U
an
d
ha
ve
ac
hieve
d
a
sp
ee
dup
of
67
x.
In
[
18
]
Pa
r
k
et
al
.
exp
l
or
e
d
the
desig
n
an
d
i
m
plem
entat
ion
i
ssu
es
of
i
m
age
pr
oce
ssin
g
al
gorithm
s
on
GP
Us
with
CU
D
A
fr
am
ewo
r
k.
T
hey
have
sel
ect
ed
m
ult
i
vie
w
ste
reo
m
at
c
hing,
li
near
fe
at
ur
e
ext
racti
on,
JPE
G
2000
i
m
age
encodin
g
no
n
-
photoreal
ist
ic
r
end
e
rin
g
a
s
e
xam
ple
ap
plica
ti
on
s
an
d
e
ff
ic
i
ently
par
al
le
li
zed
them
on
the
GPU.
In
[
19]
the
a
ut
hors
pro
pose
d
an
im
ple
m
enta
ti
on
of
I
sin
g
si
m
ula
ti
on
work
us
i
ng
the
CU
DA
f
ram
ewo
r
k.
T
hey
hav
e
at
ta
ine
d
a
per
f
or
m
ance
i
m
pr
ove
m
ent
of
20
-
40
ti
m
es
co
m
par
ed
to
the
co
nv
e
ntio
nal
CPU
i
m
ple
m
entat
io
n.
I
n
[
20]
the
auth
or
s
have
im
ple
m
ented
a
par
al
le
l
al
gorithm
in
GPU
for
value
pr
e
dicti
on
a
nd
sp
ec
ulati
ve
ex
ecuti
on.
T
hey
hav
e
obser
ve
d
that
so
ftwa
re
value
pre
dicti
on
te
c
hn
i
qu
es
did
no
t
at
ta
in
m
uch
perform
ance
ga
in
w
her
ea
s
the
s
of
tw
are
s
pecu
la
ti
on
te
c
hn
i
qu
e
s
at
ta
ined
c
on
si
der
a
bl
e
perf
or
m
ance
gain
.
Howe
ver,
opti
m
iz
ing
a
par
al
le
l
GP
U
c
od
e
is
no
t
a
trivia
l
ta
sk
.
T
o
achie
ve
good
pe
rfor
m
ance
the
pro
gra
m
m
er
sh
oul
d
fi
ne
tu
ne
the c
od
e
taki
ng into
acco
un
t t
he
un
der
ly
in
g
a
rch
it
ect
ure.
Seve
ral
tun
i
ng
strat
egies
usi
ng
di
ff
e
ren
t
as
pe
ct
s
of
t
he
G
P
U
arc
hitec
tures
have
bee
n
dis
cusse
d
in
t
he
sta
nd
a
rd
C
UDA
bo
ok
s
[21
]
-
[
23]
.
I
n
[
24
]
auth
or
s
giv
e
i
ns
ig
hts
into
t
he
relat
ion
s
hip
betwee
n
occ
upancy,
threa
d
bl
ock
si
ze
and
s
ha
pe,
cache
hie
rar
c
hy
con
fi
gurati
on
an
d
th
read
a
ccess
patte
r
n
to
gl
ob
al
m
e
m
or
y.
By
ta
kin
g
sim
ple
m
at
rix
m
ulti
pli
cat
ion
as
a
n
e
xam
ple
they
relat
e
the
threa
d
blo
c
k
siz
e
an
d
sh
a
pe
to
bo
t
h
cache
m
isses
and
oc
cup
a
ncy.
L
ob
e
iras
et
al
.
in
[2
5],
by
ta
kin
g
Fast
Fo
uri
er
Transf
or
m
(F
FT)
al
gorithm
as
a
n
exam
ple,
hav
e
analy
zed
different
as
pects
of
t
he
G
PU
a
r
chite
ct
ur
es
,
li
ke
the
infl
ue
nc
e
of
m
e
m
or
y
acce
ss
patte
rn
s
a
nd
r
egiste
r
press
ure
on
pe
rfor
m
ance.
T
hey
ha
ve
sh
ow
n
that
the
pe
rfor
m
ance
of
reg
ist
er
-
base
d
so
lu
ti
ons
is
m
uch
bette
r
tha
n
the
sh
are
d
m
e
m
or
y
du
e
to
the
hi
gh
e
r
re
gis
te
r
band
width.
In
[
26]
Lia
ng
et
al
.
hav
e
fine
tu
ne
d
their
3D
loc
al
iz
at
ion
ap
plica
ti
on
in
GPU
by
opti
m
iz
ing
the
num
ber
of
thr
eads
pe
r
thre
a
d
blo
c
k
an
d
reg
i
ste
rs
per
th
rea
d.
By
exp
erim
enting
with
di
ff
e
ren
t
val
ues
of
the
th
read
blo
c
k
siz
e
and
the
nu
m
ber
o
f
re
gisters
they
h
ave
fou
nd
an
opti
m
al
set
ti
ng
whic
h
gi
ves
the
be
st
perform
ance.
To
rr
e
s
et
al
. in
[
27
]
hav
e
stu
died
t
he
relat
io
n
between
dif
fer
e
nt
threa
d
blo
c
k
siz
es
an
d
sha
pe
s
with
t
he
gl
ob
al
m
e
m
or
y
and
the
cache
m
e
m
or
y
acce
ss
patte
rn
s
.
I
n
[
28
]
L
ee
et
al
.
hav
e
discu
ssed
opti
m
iz
at
i
on
strat
e
gies
f
or
c
om
pu
te
bo
und
as
well
as
m
e
m
o
ry
bound
al
go
rithm
s.
They
hav
e
op
ti
m
iz
e
d
the
com
pu
te
bound
al
gorithm
s
by
red
ucing
the
nu
m
ber
of
r
egi
ste
rs
a
nd
by
in
creas
in
g
t
he
th
read
bl
ock
siz
e
w
her
eas
m
e
m
or
y
boun
d
al
go
rithm
s
are
opti
m
iz
ed
by sto
rin
g
the
data in lim
it
ed
GPU r
e
sourc
es
li
ke
co
ns
ta
nt or share
d
m
e
m
or
y i
n
a
n o
pti
m
i
zed
way
.
3.
LDP
DESCRI
PTOR
A
ND
USAGE F
OR
IMAGE
RETRIEV
AL
LDP
al
gorithm
con
ta
ins
ve
ry
detai
le
d
discrim
inati
ve
featur
es
of
a
n
i
m
age
as
it
enco
des
the
hig
he
r
order
de
rivati
ve
in
form
at
io
n.
F
or
an
im
age
(
)
,
the
fi
r
st
order
de
riv
at
ives
al
ong
0
°
,
45
°
,
90
°
an
d
135
°
directi
ons
are
denoted
as
∝
′
(
)
where
α
=
0
°
,
45
°
,
90
°
an
d
135
°
.
Let
us
co
ns
id
er
a
pix
el
0
in
(
)
a
nd
it
s
ei
gh
t
nei
ghbors
,
w
her
e
=1…
8
as
s
how
n
i
n
Fig
ur
e
1a
.
T
he
fou
r
first
order
der
i
vativ
es
[
2]
for
t
he
pi
xel
0
al
ong
t
he fo
ur
diff
e
re
nt d
ir
ect
ion
s
are give
n as i
n
(
1
)
t
o
(
4
).
0°
,
(
0
)
=
(
0
)
−
(
4
)
;
(1)
45°
,
(
0
)
=
(
0
)
−
(
3
)
;
(2)
9
0
°
,
(
0
)
=
(
0
)
−
(
2
)
;
(3)
135°
,
(
0
)
=
(
0
)
−
(
1
)
(4)
The direct
io
nal
d
if
fer
e
ntial
cha
nn
el
s
for t
he f
our direct
io
ns
are as
sho
wn in F
i
gure
1b.
The
sec
ond o
r
der directi
on
al
[2
]
L
DP
,
fo
r
th
e p
ixel
0
∝
2
(
0
)
, in α
directi
on is g
i
ve
n
as
in (5
).
∝
2
(
0
)
=
{
(
∝
′
(
0
)
,
(
∝
′
(
1
)
)
,
(
∝
′
(
0
)
,
(
∝
′
(
2
)
)
…
,
(
∝
′
(
0
)
,
(
∝
′
(
8
)
)
}
(5)
Evaluation Warning : The document was created with Spire.PDF for Python.
IS
S
N
:
2088
-
8708
In
t J
Elec
&
C
om
p
En
g,
V
ol.
8
, N
o.
6
,
Dece
m
ber
2
01
8
:
5457
-
5471
5460
wh
e
re
(
.
,
.
)
is
a
bina
ry
e
ncodin
g
f
un
ct
io
n
w
hich
determ
ines
the
tra
ns
it
ion
s
in
the
l
ocal
patte
r
n
a
nd
is
gi
ve
n
as in
(6).
(
′
(
0
)
,
′
(
)
)
=
{
0
,
′
(
)
.
′
(
0
)
>
0
1
,
′
(
)
.
′
(
0
)
≤
0
(6)
i=
1,
2
.
..8.
wh
e
re
‘.’
re
pr
e
sents
m
ulti
plication
operat
or.
The
thir
d
ord
er
local
der
i
va
ti
ve
patte
rn
is
com
pu
te
d
us
i
ng
the
seco
nd
orde
r
de
rivati
ves.
T
he
third
or
der
dir
ect
ion
al
L
DP
[
2],
f
or
the
pixe
l
0
∝
3
(
0
)
,
in
α
direc
ti
on
is
giv
e
n
as
in (7
).
∝
3
(
0
)
=
{
(
∝
′′
(
0
)
,
(
∝
′′
(
1
)
)
,
(
∝
′′
(
0
)
,
(
∝
′′
(
2
)
)
…
,
(
∝
′′
(
0
)
,
(
′′
(
8
)
)
}
(7)
In g
e
ner
al
,
the
n
th
or
der
LD
P i
s co
m
pu
te
d usi
ng the
(
n
-
1)
th
orde
r deri
vative
s.
T
he n
th
or
der
directi
onal
L
D
P
[2
]
, f
or
t
he pix
el
0
,
∝
(
0
)
in α
directi
on is
giv
e
n
as
in
(8)
.
∝
(
0
)
=
{
(
∝
−
1
(
0
)
,
(
∝
−
1
(
1
)
)
,
(
∝
−
1
(
0
)
,
(
∝
−
1
(
2
)
)
…
,
(
∝
−
1
(
0
)
,
(
∝
−
1
(
8
)
)
}
(8)
(a)
(b)
Figure
1. (a
)
S
chem
at
ic
sh
owing
t
he
loc
at
io
n of t
he pixel
0
and the
neig
hb
or
i
ng p
i
xels Z
1
...Z
8
us
e
d
i
n
th
e
LDP
al
gorithm
[2],
(
b) Sc
hem
at
ic
sh
owin
g f
our
directi
onal
diff
e
re
ntial
ch
a
nn
el
s
Higher
orde
r
LDP
s
e
xtract
higher
s
patia
l
fr
e
qu
e
ncies
th
at
con
ta
in
fine
r
detai
ls
of
an
i
m
age.
This
is
ver
y
us
ef
ul
f
or
i
m
age
retriev
al
and
patte
r
n
m
at
ching
op
e
r
at
ion
s.
H
ow
e
ve
r,
if
t
he
im
age
co
ntains
no
i
se
the
higher
order
L
DP
s
al
so
e
nh
a
nces
the
noise
,
especial
ly
fr
om
the
fo
urt
h
orde
r
an
d
ab
ove
.
LDP
th
us
re
present
s
the
hi
g
he
r
ord
er
patte
r
n
in
f
or
m
at
ion
by
l
abeli
ng
the
pix
el
s
of
a
n
im
age
by
com
par
in
g
tw
o
de
ri
vative
directi
ons
at
ne
ighborin
g
pixe
ls
an
d
c
on
cat
enati
ng
the
res
ults
into
a
32
-
bit
bin
a
ry
se
quence.
Fig
ur
es
2b
an
d
2c
show
the
vi
su
al
iz
at
ion
of
the
second
order
an
d
thir
d
order
L
DP
in
zero
di
recti
on
resp
ect
ively
f
or
an
exam
ple
i
m
age
show
n
in
Fig
ure
2a
.
A
s
sho
w
n
in
fig
ur
es
the
third
orde
r
L
D
P
giv
e
s
m
or
e
de
ta
il
ed
inform
a
ti
on
than
t
he
sec
on
d
order.
We
ha
ve
us
e
d
the
thir
d
order
LD
P
in
our
ex
pe
r
i
m
ents
as
it
extracts
higher
sp
a
ti
al
fr
e
qu
e
ncies
th
ereb
y
ext
racti
ng
fine
r
detai
ls
of
a
n
i
m
age.
A
s
our
prel
i
m
inary
work
we
ha
ve
im
ple
m
ented
the
LDP
al
gorithm
for
m
edium
si
zed
im
ages
us
i
ng
only
the
G
PU
global
m
e
m
or
y
[2
9
]
.
Th
e
ps
e
udo
c
ode
of
the
LDP
al
gorith
m
is
giv
e
n
i
n
Ta
ble
1
a
nd
it
s
im
p
lem
e
ntati
on
us
i
ng
the
gl
ob
al
m
em
or
y
sp
ace
c
an
be
fou
nd in [2
9].
3.1.
GP
U and
NV
I
DI
A CUD
A
An
ove
rv
ie
w
of
the
basic
G
PG
P
U
a
rc
hitec
ture
is
prese
nted
in
this
sect
ion
.
G
PGPU
ge
ner
al
ly
co
ns
is
ts
of
hundre
ds
of
S
tream
ing
-
P
r
oc
essors
(S
P
s)
c
al
le
d
cor
es
w
hich
are
f
urt
he
r
gro
up
e
d
in
to
Stream
ing
Mult
i
Pr
oc
esso
rs
(
S
MPs)
[
21
]
.
T
he
cor
es
pe
rfo
r
m
data
pr
ocess
i
ng
in
the
Sin
gl
e
In
str
uction
Mult
iple
Thr
ea
d
(SIM
T)
m
od
el
.
Ma
pp
ing
of
the
al
go
rithm
to
the
GP
U
is
done
by
first
identify
ing
the
hi
ghly
par
al
le
l
ph
ase
s
of
the
al
gorithm
.
Those
par
al
le
l
pha
ses
are
the
n
offloa
ded
t
o
the
GPU
an
d
a
re
then
cal
le
d
the
CUD
A
kernel
s
.
Th
e
rest
of
t
he
phas
es
are
m
app
ed
on
t
o
the
CP
U.
O
n
c
om
pleti
on
of
e
xecu
ti
on
of
a
CUD
A
ke
r
nel
the
CPU
c
ol
le
ct
s
the
exec
utio
n
r
esults
an
d
the
GPU
is
the
n
s
cheduled
with
a
new
ke
rn
el
f
or
e
xec
ution.
CUD
A
ex
pl
oits
data
le
vel
pa
rall
el
is
m
by
m
aking
ever
y
th
rea
d
work
on
a
dif
f
eren
t
set
of
da
ta
.
The
li
ght
weig
ht
CU
DA
threa
ds
com
bin
e
to
f
orm
thread
blo
c
ks
a
nd
t
he
th
re
ad
bl
oc
ks
com
bin
e
t
o
f
or
m
a
com
pu
ta
ti
on
al
gr
i
d.
T
he
gr
i
ds
an
d
the
th
reads
ca
n
be
ar
range
d
i
n
on
e
,
t
wo
or
t
hr
ee
dim
ensio
ns
.
Eac
h
th
rea
d
a
nd
th
rea
d
bl
oc
k
can
be
id
entifi
ed
by uniq
ue
i
ds
a
nd the
ke
rn
el
is
laun
c
he
d by a
gr
i
d of
t
hr
ea
d bloc
ks
.
Evaluation Warning : The document was created with Spire.PDF for Python.
In
t J
Elec
&
C
om
p
En
g
IS
S
N: 20
88
-
8708
Real
-
Time
I
m
pleme
nta
ti
on
and Perf
orm
anc
e Opti
miza
ti
on
o
f L
oc
al D
e
riv
ative
…
(
Ni
sh
a C
handr
an S
)
5461
(a)
(b)
(c)
Figure
2. (a
)
S
a
m
ple q
ue
ry i
m
age f
r
om
Corel
d
at
abase
, (b
)
Secon
d order
LDP
of the
que
ry im
age in
zer
o
directi
on. Deci
m
al
eq
uiv
al
e
nt of th
e
b
i
nar
y
pa
tt
ern
is s
how
n, (c
)
T
hir
d Ord
er L
DP
of the
qu
e
ry im
age in
zer
o
directi
on. Deci
m
al
eq
uiv
al
e
nt of th
e
b
i
nar
y
pa
tt
ern
is s
how
n
Ferm
i
[3
0]
the
world
’s
fi
rst
c
om
plete
GP
U
arch
it
ect
ure
w
as
a
sig
nificant
le
ap
f
orward
f
ro
m
the
G80
chipset
se
ries.
The
SMPs
in
t
he
F
erm
i
arch
it
ect
ur
e
c
on
sist
s
of
32
co
res
a
nd
eac
h
of
these
can
e
xecu
te
on
e
floati
ng
-
point
or
i
nteger
instru
ct
io
n
per
cl
ock
.
Ke
pler
a
rch
it
ect
ure
la
unche
d
i
n
20
10
is
the
la
te
st
N
VIDIA
gen
e
rati
on
of
GPU
arc
hitec
tures
an
d
is
buil
t
base
d
on
the
f
oundat
io
n
of
Fe
rm
i.
Each
SMP
in
K
epler
arch
it
ect
ure
ha
s
192
sin
gle
preci
sion
CU
DA
cor
es,
eac
h
on
e
with
floati
ng
-
point
an
d
inte
ger
arit
hm
et
ic
log
ic
un
it
s.
Tesl
a
K
40
c
la
unche
d
in
20
12
is
N
VIDIA’
s
fla
gship
c
om
pu
te
GPU,
base
d
on
the
Ke
pler
GK1
10
arch
it
ect
ure
[
31]
.
W
it
h
5G
B
or
6GB
of
GDDR5
m
e
m
or
y,
they
pro
vid
e
up
to
3.9
5
TF
L
OP
S
sin
gle
pre
ci
sion
and
1.33
T
FL
O
PS
double
pr
e
ci
sion
floati
ng
point
pe
rfor
m
ance.
To
hid
e
the
lo
ng
off
c
hip
m
e
m
or
y
acce
ss
la
te
ncies
NVI
DIA
Fe
rm
i
an
d
Ke
pler
GPU
s
inclu
de
s
of
t
war
e
m
anag
e
d
caches,
t
he
s
ha
red
m
e
m
or
y
and
t
he
hard
war
e
m
anag
ed
c
aches
,
the
L
1
data
cac
hes.
Sh
a
red
m
e
m
or
y
is
on
ch
ip
an
d
the
C
U
DA
C
c
om
piler
treat
s
var
ia
bles
in
t
he
sh
a
red
m
e
m
o
ry
dif
fer
e
nt
fro
m
the
ty
pical
va
riables.
It
cre
at
es
a
co
py
of
the
va
riable
for
eac
h
blo
c
k
that
is
la
un
c
he
d
on
t
he
GPU.
E
ver
y
th
read
in
t
he
bloc
k
sh
a
res
the
m
e
m
or
y
and
thu
s
ca
n
com
m
un
ic
at
e
and
c
ollaborat
e
on
the
com
pu
ta
ti
on
s
do
ne.
Fu
rt
her
m
or
e,
s
har
e
d
m
e
m
or
y
and
L1
-
D
cache
on
a
GP
U
util
iz
e
the
sam
e
ph
ysi
cal
stora
ge
an
d
their
ca
pacit
y
can
be
c
onfi
gure
d
at
r
un
ti
m
e
[3
2
]
.
Eac
h
SM
in
the
Fe
rm
i
and
Kep
le
r
arch
it
e
ct
ur
e
has
64KB
of
on
-
c
hip
m
e
m
or
y,
di
vid
e
d
into
s
ha
red
m
e
m
or
y
and
L
1
cach
e.
The
pro
gr
am
m
ers
m
ay
cho
os
e
be
tween
the
tw
o
config
ur
at
io
ns
:
48KB
of
sh
a
r
ed
m
e
m
or
y
and
16KB
of
L
1
cache
or
16KB
of
s
har
e
d
m
e
m
or
y
and
48KB
of
L1
cache
.
H
oweve
r,
a
n
inc
r
ease
in
cache
m
e
m
or
y
red
uc
es
the
a
m
ou
nt
of s
hared m
e
m
or
y t
o 16KB,
w
hich
c
an
le
ad
to
t
he r
edu
ct
io
n
i
n occ
up
a
ncy
of the
SMs.
In
a
dd
it
io
n
to
sh
are
d
m
e
m
or
y
and
L
1
cach
e,
CUD
A
strea
m
s
[3
3],
[
34]
al
so
play
s
an
i
m
po
rtant
ro
l
e
in
acce
le
ra
ti
ng
the
com
pu
ta
ti
on
s
done
on
th
e
GPU.
CU
D
A
stream
rep
rese
nts
a
queue
of
GPU
operati
on
s
that
get
execu
te
d
in
a
sp
eci
fic
order.
Both
ke
rnel
s
la
un
ch
,
an
d
m
e
m
or
y
copi
es
can
be
added
into
a
stream
.
The
order
in
w
hich
the
operati
on
s
are
a
dde
d
to
the
str
eam
sp
ec
ifie
s
the
ord
er
in
w
hich
th
ey
will
be
e
xe
cuted
.
Wh
e
n
us
in
g
st
ream
s,
instea
d
of
c
opyi
ng
th
e
input
bu
ff
e
rs
entirel
y
to
th
e
GPU,
t
hey
are
sp
li
t
into
s
m
al
le
r
chun
ks
a
nd
t
he
m
e
m
or
y
cop
ie
s
to
a
nd
from
the
GP
U
as
well
as
th
e
kernel
la
un
ches
are
pe
rfor
m
ed
on each
ch
unk.
Table
1.
T
he
pseu
do code
for t
he
L
DP
calc
ulati
on
[29]
Alg
o
rith
m
: Pseu
d
o
cod
e f
o
r
th
e L
DP
calculatio
n
Inp
u
t: Query
I
m
ag
e; Outp
u
t: r
etri
ev
al r
esu
lt
Step
1: Div
id
e the
in
p
u
t i
m
ag
e
of
×
p
ix
els in
to
blo
ck
s o
f
×
su
b
-
i
m
ag
es.
Each s
u
b
i
m
ag
e has
×
p
ix
els.
Step
2: Fo
r
i=
1 to
2
d
o
step
s 3
to 5
Step
3
:
Fo
r
th
e
i
m
ag
e
p
ix
el
(
0
)
,
th
e
lo
ca
tio
n
o
f
wh
ich
is
as
sh
o
wn
in
Fig
.1,
c
alcu
late
th
e
f
irst
o
rder
d
erivativ
es
al
o
n
g
0
0
,
45
0
,
9
0
0
an
d
135
0
d
irection
s as g
iv
e
n
in eq
s.1
to 4
.
Step
4: Calcu
late b
in
ary
enco
d
in
g
f
u
n
ctio
n
as giv
en
in eq
.6.
Step
5: Calcu
late t
h
e his
to
g
ra
m
s
of
the b
in
ary
patte
rns
.
Step
6: Co
n
catenate th
e his
to
g
ra
m
s o
f
all
th
e su
b
i
m
ag
es
to co
n
stru
ct the f
eatu
re
v
ecto
r
o
f
the
i
m
ag
e.
4.
OPTIM
U
M
S
IZ
ING
OF E
ES
The
a
ppr
oac
h
ada
pted
to
pa
rall
el
iz
e
the
LDP
al
gorith
m
is
to
br
ea
k
the
data
le
ve
l
par
al
le
li
s
m
involve
d
in
the
al
gorithm
.
Th
e
al
gorithm
is
div
ide
d
i
nto
t
w
o
par
ts:
CP
U
(
Ho
st
)
a
nd
G
P
U
(
De
vice)
pro
cessi
ng
as
sh
ow
n
in
Figure
3.
The
im
age
is
read
by
the
h
ost
,
co
pied
to
the
de
vice
m
e
m
or
y,
sp
at
i
al
histog
ram
s
of
the
i
m
age
is
com
pu
te
d
a
nd
is
c
op
ie
d
bac
k
t
o
the
h
os
t
m
e
m
or
y. The
im
ages
are
div
i
ded
into
16x1
6
s
ub
bl
oc
ks
an
d
the
bl
ocks
are
processe
d
by
the
GPU.
T
o
s
how
the
c
om
pu
ta
ti
on
of
LD
P
a
sam
ple
7x
7
bl
ock
siz
e
is
ta
ke
n
a
s
sh
ow
n
i
n
Fi
gure
4.
A
blo
c
k
of
9
threa
ds
is
re
qu
ire
d
t
o
c
al
culat
e
the
th
ird
orde
r
L
DP
of
the
blac
k
sq
ua
r
e
reg
i
on
i
n
Fig
ure
4.
T
he
L
DP
patte
rn
s
f
or
th
e
blo
c
k
are
cal
culat
ed
usi
ng
the
bi
nar
y
e
ncodin
g
f
unct
ion
s
gi
ve
n
Z
e
r
o
d
i
r
e
c
t
i
o
n
50
100
150
200
250
50
100
150
200
250
Z
e
r
o
d
i
r
e
c
t
i
o
n
50
100
150
200
250
50
100
150
200
250
Evaluation Warning : The document was created with Spire.PDF for Python.
IS
S
N
:
2088
-
8708
In
t J
Elec
&
C
om
p
En
g,
V
ol.
8
, N
o.
6
,
Dece
m
ber
2
01
8
:
5457
-
5471
5462
in
Eq
uatio
n
7.
The
bin
a
ry
enc
od
i
ng
f
un
ct
io
ns
for
the
pix
el
16
,
the
po
sit
io
n
of
w
hich
is
s
how
n
in
Fig
ur
e
4,
for
the
fo
ur
dif
fer
e
nt d
i
recti
ons
0
°
,
45
°
,
90
°
an
d
135
°
is give
n
as
in (9
)
.
∝
3
(
16
)
=
{
(
∝
′′
(
16
)
,
∝
′′
(
8
)
)
,
(
∝
′′
(
16
)
,
∝
′′
(
9
)
)
(
∝
′′
(
16
)
,
∝
′′
(
10
)
)
(
′′
(
16
)
,
∝
′′
(
17
)
)
(
∝
′′
(
16
)
,
∝
′′
(
24
)
)
(
∝
′′
(
16
)
,
∝
′′
(
23
)
)
(
∝
′′
(
16
)
,
∝
′′
(
22
)
)
(
∝
′′
(
16
)
,
∝
′′
(
15
)
)
whe
re
α
=
0
°
,
45
°
,
90
°
a
nd
135
°
}
(9)
4
.
1.
GP
U and
NV
I
DI
A CUD
A
This
sect
io
n
de
scribes
how
our
im
ple
m
ent
at
ion
pa
rall
el
iz
es
the
L
DP
al
gorithm
.
The
al
gorithm
is
div
ide
d
into
tw
o
m
a
in
ste
ps
the
LDP
histo
gr
a
m
cal
culat
ion
and
the
histogr
a
m
intersect
ion
.
O
ut
of
the
t
wo
th
e
m
os
t
crit
ic
a
l
st
ep
in
t
otal
exe
cution
ti
m
e
is
t
he
L
DP
histogram
cal
culat
io
n
an
d
th
ere
fore
al
l
op
ti
m
iz
at
io
ns
ar
e
done
to
s
pee
dup
this
ke
r
nel.
T
he
G
PU
al
gorithm
thu
s
consi
sts
of
tw
o
ke
r
ne
ls
to
execu
te
th
e
op
e
rati
ons
of
the
two basic st
e
ps o
f
the
LDP al
gorithm
.
The
L
DP
_com
pu
te
kernel
co
m
pu
te
s
the
LDP
values
of
al
l
the
pix
el
s
in
th
e
16x16
blo
c
k
of
the
im
age
al
ong
al
l
the
f
our
directi
ons
0
°
,
45
°
,
90
°
and
135
°
.
The
s
um
o
f
the
decim
al
equ
i
valent
of
t
he
patte
r
ns
is
then
us
e
d
to
f
or
m
the
L
DP
hist
ogram
.
Thr
eads
are
al
locat
ed
f
or
eac
h
pix
el
of
t
he
im
age
blo
ck
a
nd
eac
h
threa
d
cal
culat
es
and
stores
the
LD
P
histogram
values.
The
histogram
values
are
then
tra
ns
f
err
e
d
to
the
host
f
or
norm
al
iz
a
ti
on
.
Figure
3. Steps
in
the
co
m
pu
t
at
ion
of LDP
Figure
4. Im
age b
loc
k o
f
siz
e
7x7
t
o dem
on
s
trat
e the c
om
pu
ta
ti
on
of a t
hread
blo
c
k.
LDP
of the
r
e
gi
on
i
n black
r
e
ct
ang
le
is t
o be
co
m
pu
te
d
The
Hist_i
ntersect
ker
nel
co
m
pu
te
s
the
histogram
intersect
ion
of
the
norm
alized
qu
ery
i
m
age
LDP
histo
gr
am
and
the
databa
se
i
m
age
histo
gra
m
.
H
ist
og
ram
intersect
io
n
[
2]
is
us
e
d
t
o
m
easur
e
the
sim
il
arit
y
betwee
n
tw
o hi
stogram
s an
d
i
s g
i
ven b
y t
he (1
0).
(
,
)
=
∑
min
(
=
1
)
(10)
wh
e
re
(
,
)
is
the
hi
stogram
intersect
ion
sta
ti
sti
c
.
T
his
m
easure
cal
culat
es
t
he
com
m
on
part
s
of
t
w
o
histo
gr
am
s.
Histogram
intersect
ion
th
us
fi
nd
s
t
he
com
m
on
par
ts
of
t
he
qu
e
ry
i
m
age
histogram
and
the
database
im
age
histogram
s.
Dep
e
ndin
g
up
on
the
siz
e
of
the
histo
gr
am
s
thread
s
a
re
al
locat
ed
f
or
c
om
par
ing
the ele
m
ents in
the
qu
e
ry a
nd
the d
at
a
base i
m
age h
ist
ogr
a
m
s.
CPU
N
o
r
m
al
i
ze
q
u
e
r
y
i
m
ag
e
h
i
sto
g
r
am
N
o
r
m
al
i
ze
d
atab
ase
i
m
ag
e
h
i
sto
g
r
am
s
R
e
ad
d
atab
ase
i
m
ag
e
h
i
sto
g
r
am
s
R
e
ad
q
u
e
r
y
i
m
ag
e
D
i
sp
l
ay
r
e
su
l
ts
G
PU
Co
m
p
u
te
LDP
Co
m
p
u
te
H
i
sto
g
r
am
Tr
an
sf
e
r
im
ag
e
to
GPU
Tr
an
sf
e
r
r
e
su
l
ts
to
Host
Tr
an
sf
e
r
h
i
sto
g
r
am
s
to
G
PU
Co
m
p
u
te
h
i
sto
g
r
am
i
n
te
r
sec
tion
Tr
an
sf
e
r
r
e
su
l
ts
to
Host
Evaluation Warning : The document was created with Spire.PDF for Python.
In
t J
Elec
&
C
om
p
En
g
IS
S
N: 20
88
-
8708
Real
-
Time
I
m
pleme
nta
ti
on
and Perf
orm
anc
e Opti
miza
ti
on
o
f L
oc
al D
e
riv
ative
…
(
Ni
sh
a C
handr
an S
)
5463
4
.
2
.
Desi
gn
S
p
ace E
xp
lor
at
i
on
and O
pt
im
iz
at
ion
In
t
he
fo
ll
owin
g
sect
io
ns
we
discuss
in
deta
il
the
arch
it
ect
ur
e
relat
ed
des
ign
s
pace
e
xplorati
on
a
nd
op
ti
m
iz
ation
fo
cusin
g
m
ai
nly on the C
U
DA thr
ea
d bloc
k
siz
es, r
e
gisters
, mem
or
y space
s a
nd CU
DA stre
a
m
s.
4
.
2
.
1
Thre
ad
Bl
ock
Siz
e
The
ge
ne
ral
m
et
hodo
l
og
y
a
da
pted
w
he
n
op
tim
iz
ing
a
GPU
acce
le
rated
app
li
cat
io
n
is
to
increas
e
the
o
ccu
pa
ncy
of
the
SMPs.
It
is
essenti
al
to
ru
n
as
m
any
th
read
s
as
possi
ble
on
the
SM
Ps
to
hid
e
the
lon
g
m
e
m
or
y
la
te
ncy.
To
inc
rease
occupa
ncy
ei
ther
the
num
ber
of
reg
ist
er
s
pe
r
threa
d
has
t
o
be
dec
rease
d
or
the
nu
m
ber
o
f
thr
e
ads
in
the
t
hr
ea
d
blo
c
ks
h
as
t
o
b
e
a
dju
ste
d
ac
cordin
g
t
o
the n
um
ber
o
f
re
gi
ste
rs
pe
r
t
hr
ea
d.
Th
e
occupa
ncy
of
a
GPU
of
c
ompu
te
capa
bili
ty
2.1
f
or
42
re
gi
ste
rs
pe
r
t
hr
ea
d
is
m
or
e
t
han
68%
us
i
ng
12
8,1
92
and
256
th
reads
wh
e
reas
with
64
th
read
s
it
is
33
%
.
W
e
ex
plore
the
opti
m
u
m
thread
bl
ock
s
iz
e
f
or
diff
e
rent
i
m
age sizes tha
t gives
the m
axi
m
u
m
o
ccu
pancy
as w
el
l as t
he
least
ex
ec
ution t
i
m
e.
4
.
2
.
2 Reg
is
ter
s
Re
du
ci
ng
th
e
nu
m
ber
of
re
gi
ste
rs
us
e
d
by
t
he
threa
ds
is
a
no
t
her
op
ti
on
t
o
increa
se
occ
up
a
ncy.
T
he
CUD
A
nvcc
c
om
piler
perfor
m
s
la
rg
e
scal
e
r
e
gister
al
loc
at
ion
t
o
rem
ove
in
struc
ti
on
dep
e
ndencies
and
to
m
axi
m
iz
e
sing
le
thread
pe
rfo
rm
ance
[2
7]
,
[
35
]
.
Re
gisters
bein
g
a
li
m
it
ed
resou
rce
f
or
a
blo
c
k
an
i
ncr
e
m
ental
increase
i
n
re
gi
ste
r
al
locat
ion
will
le
ad
to
r
unni
ng
fe
wer
th
read
bloc
ks.
T
hu
s
,
the
re
is
a
trade
off
be
twe
en
the
sing
le
t
hr
ea
d
pe
rfor
m
ance
an
d
the
nu
m
ber
of
act
ive
t
hr
ea
ds
at
a
ti
m
e.
In
[35]
the
aut
hors
ha
ve
us
e
d
a
m
et
ric
Re
gRati
o
w
hic
h
is
th
e
rati
o
of
t
he
reg
ist
er
s
us
e
d
per
th
r
ead
to
the
m
axim
u
m
al
lowed
r
egiste
rs
pe
r
threa
d.
Using
this
m
etr
ic
,
we
ha
ve
optim
iz
ed
the
usa
ge
of
the
re
gi
ste
rs.
W
e
kee
p
the
Re
gRati
o
value
betwee
n
0
an
d
1
and
c
hoos
e
a
hi
gh
value
of
R
egRat
io
so
tha
t
there
is
a
good
tra
de
off
bet
ween
t
he
sin
gle
thread
perf
orm
ance
and
the num
ber
of
sim
ultaneo
us
ly
act
ive
th
re
ads.
For
a
p
a
rtic
ular
kernel
th
e
m
axi
m
u
m
nu
m
ber
of
r
e
gisters p
er
threa
d
can
be
sp
eci
fied
to
th
e
nv
cc
com
piler
at
com
pile
-
t
i
m
e
us
ing
the
ma
xr
regc
ount
op
ti
on.
T
he
or
e
ti
cal
ly,
the
act
ual
reg
i
ste
r
us
e
m
igh
t
be
le
ss
than
th
e
lim
it
.
W
e
ha
ve
al
so
ve
rified
the
perf
or
m
a
nce
gain
ac
hieved
by
Re
gRati
o
us
in
g
a
he
ur
ist
ic
appr
oach
wh
e
re
fo
r
eac
h
im
a
ge
siz
e,
threa
d
siz
e
and
bloc
k
siz
e
the
num
ber
of
reg
ist
ers
is
va
r
ie
d
an
d
t
he
c
om
bin
at
ion
w
hich
giv
es
t
he
m
axim
u
m
occu
pa
ncy
an
d
m
inim
u
m
ker
nel
ex
ecuti
on
tim
e is fo
und.
4
.
2
.
3
S
ha
re
d
Mem
ory
and
L1 Cach
e
To
f
ur
t
her
im
pr
ove
the
pe
rfo
r
m
ance
of
the
L
DP
al
gorithm
,
in
the
ne
xt
ste
p
we
m
ake
us
e
of
the
faste
r
on
c
hip
sh
a
re
d
m
e
m
or
y.
The
nu
m
ber
of
r
egiste
rs
an
d
th
e
thread
blo
c
k
siz
e
is
kep
t
a
t
an
op
ti
m
u
m
value
ob
ta
ine
d
from
the
earli
er
st
udy.
Furthe
r
we
choose
betwee
n
the
t
wo
co
nfi
gurati
on
s:
48
KB
of
s
har
e
d
m
e
m
or
y
and
16
KB
of
L1
cac
he
or
vi
ce
versa
to
st
ud
y
t
he
pe
rform
ance
im
pr
ove
m
ent
obta
ine
d
by
us
in
g
L
1
cache
instea
d of s
hared m
e
m
or
y.
4
.
2
.
4 CUD
A
S
trea
m
s
Ferm
i
and
Kep
le
r
arc
hitec
ture
-
ba
sed
GPUs
sup
por
t
de
vice
overlap
wh
ic
h
is
the
capaci
ty
to
si
m
ultaneou
sly
execu
te
a
CU
DA
kernel
w
hi
le
per
f
orm
ing
a
m
e
m
or
y
copy
op
erati
on
bet
ween
de
vice
and
host
m
e
m
or
y.
This
is
facil
it
at
ed
us
ing
CU
DA
stre
a
m
s.
The
us
e
of
stream
s
is
ver
y
prof
it
able
in
app
li
cat
ion
s
w
her
e
inpu
t
data
inst
ances
are
i
nd
e
pende
nt.
A
st
r
ea
m
is
a
sequence
of
com
m
ands
that
is
execu
te
d
in
ord
er.
By
def
a
ult,
a
CU
DA
pro
gr
am
has
only
on
e
str
ea
m
cal
le
d
the
def
a
ult
stream
[22]
,
[
23]
.
Wh
en
m
ulti
ple
stream
s
are
in
vo
l
ved
t
he
or
der
in
w
hich
the
opera
ti
on
s
are
done
beco
m
es
an
im
po
rtant
issue
.
The
re
are
t
w
o
ba
sic
appr
oach
es
us
e
d
to
orde
r
the
op
e
rati
ons,
bre
adth
fir
st
and
de
pth
fir
st.
I
n
de
pth
fir
st
appr
oach
al
l
operat
ion
s
of
a
giv
en
s
tream
are
gr
ou
ped
t
og
et
her
an
d
in
br
ea
dth
fir
st
appr
oach
sim
i
lar
op
e
rati
ons
of
diff
ere
nt
strea
m
s
are
gro
up
e
d
t
og
et
he
r.
C
orrect
res
ults
are
obta
in
ed
with
bo
t
h
a
ppr
oach
es
th
ough
both
pe
rfo
rm
diff
ere
ntly
on
the
sp
eci
fic
ge
ner
a
ti
on
of
GPU
[
34
]
.
I
n
[
35]
the
aut
hors
hav
e
sh
ow
n
that
for
a
certai
n
ap
pl
ic
at
ion
wh
ic
h
us
es
D
input
data
inst
ances
a
nd
de
fines
B
blo
c
ks
of
t
hr
ea
ds
for
kernel
exec
ution
t
he
data
ins
ta
nces
can
be
bro
ke
n
into
nStre
am
stream
s.
Ther
efore,
each
stre
a
m
no
w
w
ork
s
with
D/
nStre
am
data
instances
an
d
B/
nStre
am
blo
c
ks
an
d
the
m
e
m
or
y
cop
y
of
one
stream
ov
e
rlaps
the
kernel
exec
ution
of
th
e
oth
e
r
stream
achie
ving
a
perform
ance im
pr
ov
em
ent. In our a
ppli
cat
i
on d
e
pe
nd
i
ng
on the
siz
e of t
he
im
age th
e
va
lue of
nS
trea
m
v
aries.
5.
E
X
PERI
MEN
TS
AND
A
NAL
YS
IS
OF
RE
SU
L
TS
All
ex
per
im
ents
in
t
his
pap
e
r
are
done
on
t
w
o
diff
e
re
nt
m
a
chines
.
P
reli
m
i
nar
y
e
xperim
e
nts
are
done
on
In
te
l
Co
re
i3
-
2375M
with
4G
B
RAM
a
nd
Ge
force
720m
(G
PU
1)
a
nd
are
f
urt
her
r
epeate
d
on
In
t
el
Xeon
CPU
E
5
-
1650
with
64
GB
RA
M
and
Tesl
a
K
40
C
(GPU
2).
The
a
uthors
in
the
or
i
gin
al
pa
per
[2
]
ha
ve
re
porte
d
the
CPU
e
xec
ut
ion
ti
m
e
fo
r
th
e
seq
uen
ti
al
L
DP
al
go
rithm
as
0.1
8s
ec
f
or
an
im
age
of
si
ze
88
x
88
pi
xe
ls
on
a
PC
with 3
GH
z
CPU
a
nd 2
GB
RAM. In
our
ex
per
im
ents
w
e
ha
ve
us
ed
ou
r
ow
n
se
quenti
al
i
m
ple
m
entation
o
f
Evaluation Warning : The document was created with Spire.PDF for Python.
IS
S
N
:
2088
-
8708
In
t J
Elec
&
C
om
p
En
g,
V
ol.
8
, N
o.
6
,
Dece
m
ber
2
01
8
:
5457
-
5471
5464
the
al
gorithm
.
Using
our
im
ple
m
entat
ion
,
t
he
tim
e
ta
ken
f
or
88
x88
pixe
ls
on
G
PU
1
is
0.
078s
ec
an
d
t
he
ti
m
e
ta
ken
on
G
PU
2
is
0.0
36sec.
We
hav
e
us
e
d
high
re
so
l
ution
i
m
ages
rangin
g
f
ro
m
siz
es
256x
256
t
o
40
96x4
096
for
our
e
xperi
m
ents.
The
e
xperim
ents
fo
r
m
ed
ium
i
m
age
siz
es
us
in
g
th
e
global
m
e
m
o
ry
on
GeF
orce
720m
i
s
pr
ese
nted
in
[
29]
.
I
n
t
his
pap
e
r
the
LD
P
al
gorithm
is
i
m
ple
m
ented
on
la
rge
siz
ed
im
ages
us
i
ng
the
fast
-
sh
ar
e
d
m
e
m
or
y
and
f
urt
her
im
pr
ov
e
d
us
in
g
t
he
c
onc
ept
of
L
1
data cache
a
nd
CU
DA
stream
s.
H
igh
r
es
olu
ti
on
i
m
ages
rangin
g from
size 256
x256 to
4096
x4096 pi
xels are
consi
de
red in t
his pap
er.
Gefor
ce
72
0m
is
a
car
d
of
co
m
pu
te
capab
il
it
y
2.
1
an
d
a
th
r
ead b
loc
k
ca
n
con
ta
in
a
m
axim
u
m
of
1024
threa
ds
a
nd
ev
ery
SMP
ca
n
process
a
m
axim
u
m
of
1536
threa
ds
wh
e
r
ea
s
Tesl
a
K
40C
is
a
car
d
of
c
om
pu
te
capab
il
it
y
3.5,
in
w
hich
a
t
hr
e
ad
bl
oc
k
can
c
on
ta
in
a
m
axi
m
u
m
of
10
24
t
hr
ea
ds
a
nd
eve
ry
SMP
can
pr
ocess
a
m
axi
m
u
m
of
2048
t
hr
ea
ds
.
T
he
te
st
im
ages
are
ta
ke
n
from
Corel
data
bas
e
[
36
]
a
nd
Ess
ex
face
databa
se
[
37
]
.
Corel
da
ta
bas
e
consi
sts
of
var
i
ou
s
cat
egories
li
ke
a
nim
a
ls,
cars,
distract
ions
a
nd
t
ran
s
port.
Fo
r
our
exp
e
rim
ents
we
ha
ve
colle
ct
e
d
10
00
im
ages
from
var
ious
c
at
egories
of
th
e
Corel
databa
se
to
f
or
m
the
dataset
.
The
Esse
x
fac
e
database
c
on
sist
s
of
20
im
a
ges
eac
h
of
153
pe
rs
ons.
We
hav
e
ta
ke
n
10
i
m
ages
of
eac
h
15
3
per
s
ons
f
or
ou
r
exp
e
rim
ents.
The
ap
proac
h
fo
ll
owe
d
in
our
par
al
le
l
i
m
pl
e
m
entat
ion
is
t
o
achieve
al
m
os
t
th
e
sam
e
retrieval
resu
lt
s
as
that
of
the
se
qu
e
ntial
i
m
ple
m
entation
.
T
he
i
m
ages
are
div
ide
d
into
non
-
ov
e
rlapp
i
ng
i
m
age
blo
c
ks
of
siz
e
16x16.
T
he
e
xp
e
rim
ent
is
rep
eat
e
d
for
diff
e
re
nt
im
ag
e
siz
es
an
d
t
he
pe
rfor
m
ance
ga
in
of
the
kernel
is
an
al
yz
ed.
CUD
A
Visu
al
Profile
r
is
us
ed
t
o
ana
ly
ze
the
CUD
A
ke
rn
el
p
er
form
ance
[38]
. W
e
ha
ve
dev
el
op
e
d
tw
o
ke
rn
el
s
f
or
im
age
retrieval,
one
for
L
DP
his
togram
co
m
pu
ta
ti
on
a
nd
a
no
t
her
f
or
the
sim
il
arity
m
at
ching
us
in
g
hist
ogram
intersect
ion
.
F
or
the
im
ple
m
entat
ion
the
query
i
m
age
is
cop
i
ed
to
t
he
G
PU
global
m
e
m
or
y.
The
i
m
age
is
div
ide
d
into
16x1
6
bl
ock
s
a
nd
the
LDP
histo
gr
a
m
valu
es
of
ea
ch
bl
ock
a
re
ca
lc
ulate
d.
We
hav
e
ch
ose
n
a
blo
c
k
siz
e
of
25
6
th
rea
ds
s
o
that
e
ve
r
y
SMP
in
GPU1
with
com
pu
te
capa
bili
ty
2.1
ca
n
sche
du
le
six
bl
ock
s
wh
e
reas
GP
U
2
with
com
pu
te
capa
bili
ty
3.
5
can
schedule
ei
ght
blo
c
ks
by
op
ti
m
u
m
resou
rce
util
iz
at
ion
.
Th
e
kern
el
is
la
un
che
d
with
the
neces
sary
threads
to
cal
culat
e
the
LDP
hist
ogra
m
values.
The
c
om
pu
te
d
LDP
histo
gr
a
m
is
then
tra
nsfer
red
bac
k
t
o
the
CP
U
for
norm
al
iz
ation.
The
m
at
chin
g
is
the
n
done by t
he
se
cond ke
rn
el
us
i
ng the
histo
gr
a
m
intersect
ion
m
et
ho
d.
To
a
void
t
he
CPU
from
being
idle
w
hen
the
GPU
is
cal
culat
ing
,
we
e
ff
ic
ie
ntly
di
vide
the
ta
s
k
of
histo
gr
am
intersect
ion
b
et
wee
n
the
t
wo
i
n
s
uc
h
a w
ay
that
in
m
ini
m
u
m
tim
e
the
cal
culat
ion
is
c
om
pleted
.
T
his
is
done
by
la
unchi
ng
the
kernel
asy
nc
hro
nous
ly
,
i
.e.
wh
e
ne
ver
t
he
ke
r
nel
is
execu
ti
ng
i
n
the
GPU,
t
he
CPU
is
no
t
blo
c
ked
an
d
is
f
ree
to
perform
i
ts
own
c
om
pu
ta
ti
on
s.
Our
strat
e
gy
f
or
balanci
ng
t
he
w
ork
l
oad
betwee
n
CPU
an
d
GPU
is
giv
e
n
bel
ow.
Let
R
be
t
he
rati
o
of
the
ti
m
e
ta
ken
by
th
e
CPU
to
perf
orm
the
ta
sk
to
the
ti
m
e
ta
ken
by
the
G
PU
to
perf
or
m
the
ta
sk
.
If
T
is
the
total
num
ber
of
im
age
s
to
be
process
ed
the
n
im
ages
to
the
GPU is
giv
e
n b
y
∗
1
+
an
d
im
ages to
the
CP
U
is
giv
en
b
y
1
+
w
her
e
R
=
.
Figure
5
s
hows
the
pe
rcent
age
of
al
loca
ti
on
of
the
im
ages
to
the
CPU
a
nd
GPU.
T
he
rati
os
CPU/T=
1/1+R
an
d
G
PU/T =
R/1+
R
are p
lot
te
d
agai
ns
t
the log
a
rithm
ic
val
ues
of
t
he
s
pee
dup.
From
Figu
re
5
it
can
be
seen
that
if
the
exec
ution
ti
m
e
ta
ken
by
the
CP
U
and
t
he
GPU
is
alm
os
t
the
sam
e
then
the
im
ages
can
be
e
qual
ly
div
ide
d
betw
een
the
CP
U
a
nd
t
he
G
PU.
I
n
our
e
xperim
ents
we
al
lott
e
d
50
im
ages
of
siz
e
256x25
6
pix
el
s
to
t
he
CP
U
a
nd
95
0
im
ages
to
the
GPU.
F
or
im
ages
of
siz
e
4096
x4096
we
al
lott
ed
30
i
m
ages
to
the
C
PU
a
nd
97
0
to
the
GPU.
As
G
P
U2,
the
Tesl
a
m
achine
is
m
or
e
pow
er
fu
l
and
m
or
e
su
it
able
f
or
sci
entifi
c
cal
cu
la
ti
on
s
we
ha
v
e
done
al
l
the
c
om
pu
ta
ti
on
s
in
the
GPU.
I
n
the
f
ollo
wing
s
ect
ion
we
disc
us
s
the
perform
ance im
pr
ov
em
ent achieve
d by ou
r
d
esi
gn sp
ace
e
xp
l
or
at
io
n.
Figure
5. O
ptim
al
w
ork
l
oad
balance
bet
wee
n
CP
U
a
nd GP
U
0
1
2
3
4
5
6
0
10
20
30
40
50
60
70
80
90
100
l
o
g
10
(
S
p
e
e
d
u
p
t
i
m
e
)
I
m
a
g
e
a
l
l
o
c
a
t
i
o
n
i
n
%
C
PU
a
l
l
o
c
a
ti
o
n
G
PU
a
l
l
o
c
a
ti
o
n
Evaluation Warning : The document was created with Spire.PDF for Python.
In
t J
Elec
&
C
om
p
En
g
IS
S
N: 20
88
-
8708
Real
-
Time
I
m
pleme
nta
ti
on
and Perf
orm
anc
e Opti
miza
ti
on
o
f L
oc
al D
e
riv
ative
…
(
Ni
sh
a C
handr
an S
)
5465
5
.
1
.
Op
timi
z
in
g the
Nu
m
ber
of Threa
ds
and
Re
gister
s
We
ha
ve
fou
nd
opti
m
iz
ed
values
f
or
re
gisters
us
i
ng
t
he
m
et
ho
d
discu
ssed
in
sect
io
n
4.1
.2.
We
consi
der
th
rea
ds
pe
r
threa
d
blo
c
k
(N)
an
d
reg
ist
er
al
locat
ion
pe
r
threa
d
(R)
as
input
va
riables.
T
he
de
fau
lt
reg
ist
er
us
e
(
w
it
ho
ut
re
gister
lim
it
)
of
our
L
DP
im
ple
m
ent
at
ion
is
42.
T
he
m
axi
m
u
m
va
lue
f
or
N
is
76
8
due
to
the
c
onstrai
nt
of
reg
ist
er
us
e.
We
ex
plore
var
io
us
co
m
bin
at
ion
s
of
N
a
nd
R
values
a
nd
a
naly
ze
the
perform
ance
i
m
pr
ov
em
ent.
On
ly
a
su
bse
t
of
the
res
ults
is
disp
la
ye
d
he
r
e.
As
sho
wn
in
Figu
r
es
6a
an
d
6b
the
desig
n
sp
ace
e
xh
i
bits
high
va
riat
ion
in
te
rm
s
of
exec
utio
n
tim
e.
The
ov
er
al
l
per
form
ance
crit
ic
al
l
y
dep
en
ds
on both N a
nd
R. For
e
xam
ple,
f
or
a
n
im
age o
f
size 2
56x2
56 the
opti
m
a
l set
ti
ng
is N
=
12
8
a
nd
R=
23, w
her
eas
for
an
im
age
of
siz
e
1024x1
0
24
the
optim
al
set
ti
ng
is
at
N=2
56
R=
32.
W
e
hav
e
done
a
n
opti
m
a
l
cho
ic
e
by
ta
kin
g
int
o
co
nsi
der
at
io
n
the
m
axi
m
u
m
occu
pa
ncy
an
d
m
i
nim
u
m
tim
e.
W
it
h
the
de
faul
t
nu
m
ber
of
re
gisters
,
42
in
our
case
,
the
m
axi
m
u
m
nu
m
ber
of
act
ive
threads
per
m
ulti
pr
oce
sso
r
is
li
m
it
ed
to
768
in
a
de
vice
of
com
pu
te
capa
bi
li
t
y
2.
1.
By
de
creasin
g
the
nu
m
ber
of
re
giste
rs
to
t
he
opti
m
al
value
i.e.
23
and
a
th
read
bl
ock
siz
e
of
12
8,
th
e
act
ive
num
ber
of
th
rea
ds
pe
r
m
ulti
pr
oces
so
r
is
incr
ease
d
to
1024.
I
n
a
de
vice
of
c
om
pu
te
capab
il
it
y
3
.
5
t
he
nu
m
ber
of
threa
ds
is
incre
ased
from
1280
to
20
48
with
the
opti
m
al
cho
ic
e
of
t
he
num
ber
of
reg
ist
ers
p
e
r
t
hread
and t
he num
ber
o
f
t
hr
ea
ds pe
r
th
rea
d block.
Com
bin
ing
the
afo
r
em
entioned
va
rio
us
re
ne
wab
le
e
nergy
r
eso
ur
ces with
energy
stora
ge
syst
e
m
s
an
d
diesel
en
gin
e
gen
e
rato
r
in
a
n
inte
rcon
nected
syst
em
,
the
ge
ner
at
e
d
e
le
ct
ric
energy
can
be
e
ff
ec
ti
vely
distrib
uted an
d co
ntr
olled to
m
eet
the en
er
gy
r
eq
uirem
ent o
f
the c
onnecte
d
loa
ds.
(a)
(b)
Figure
6. (a
) O
pti
m
al
n
o.
of
r
egiste
rs
a
nd th
r
ead
blo
c
k
siz
e
for
a
n
im
age o
f
size 2
56x2
56
,
(b) Optim
al
n
o.
of
reg
ist
er
s a
nd th
read bl
oc
k
siz
e
for
a
n
i
m
age of
siz
e
1024
x1024
5
.
2
.
I
mple
men
tation usin
g G
loba
l
Mem
or
y
The
first
ve
rsi
on
of
the
al
go
rithm
was
i
m
p
lem
ented
us
in
g
the
global
m
e
m
or
y
sp
ace.
The
detai
le
d
i
m
ple
m
entat
io
n
proce
dure
of
the
L
DP
al
gor
it
h
m
us
in
g
the
global
m
e
m
or
y
an
d
the
tim
e
com
par
ison
bet
wee
n
the
seq
ue
ntial
an
d
pa
rall
el
ver
si
on
f
or
m
edium
i
m
age
siz
es
on
G
PU1
w
as
prese
nt
ed
in
[
29
]
.
O
pti
m
iz
ed
nu
m
ber
of
re
gi
ste
rs
an
d
th
rea
ds
we
re
co
ns
i
de
red
fo
r
each
i
m
age
siz
e.
Fig
ur
e
7
sho
ws
th
e
tim
e
co
m
par
ison
of
LDP_c
om
pu
te
on
GPU
2
f
or
i
m
age
siz
es
2
56x256,
512x
512,
1024x1
024,
2048
x204
8
an
d
4096
x4096.
As
sh
ow
n
in
Fig
ure
7
the
C
U
D
A
kernel
gi
ve
s
m
or
e
perf
orm
ance
gain
f
or
hi
gher
siz
e
i
m
ages
tha
n
s
m
al
le
r
i
m
ages.
This
is
beca
us
e
the
host
m
e
m
or
y
to
de
vice
m
e
m
or
y
cop
y
ta
ke
s
c
on
si
der
a
ble
am
ount
of
ti
m
e
a
nd
th
e
sp
ee
dup
obta
in
ed
m
us
t
payof
f
fo
r
this
co
py
over
hea
d.
I
n
cas
e
of
sm
aller
sized
i
m
ages
su
ch
as
25
6x
256
pi
xels
or
512x
512
pix
el
s,
the
sp
ee
dup
obta
ine
d
is
no
t
e
noug
h
to
payo
ff
for
thi
s
co
py
over
he
ad.
Howe
ver,
as
the
i
m
age
siz
e
in
creases
to
20
46x204
6
pix
el
s
and
40
96x4
096
pi
xels
this
ov
e
rh
ea
d
is
paid
off
an
d
the
perform
ance gai
n
inc
reases.
Figure
7. Tim
e
co
m
par
iso
n be
tween
seq
ue
ntial
v
ersi
on and
par
al
le
l ve
rsion
in
Tesla
K40
c
wh
e
n glo
bal m
e
m
or
y i
s used
. L
ogarit
hm
ic
scale
s ar
e
us
e
d
f
or p
l
otti
ng
N
o
.
o
f
T
h
r
e
a
d
s
N
o
.
o
f
R
e
g
i
s
t
e
r
s
128
192
256
42
36
34
32
23
20
1
8
.
4
1
8
.
5
1
8
.
6
1
8
.
7
1
8
.
8
1
8
.
9
19
1
9
.
1
1
9
.
2
1
9
.
3
1
9
.
4
N
o
.
o
f
T
h
r
e
a
d
s
N
o
.
o
f
R
e
g
i
s
t
e
r
s
128
192
256
42
36
34
32
23
20
180
185
190
195
200
205
10
2
10
3
10
4
10
-2
10
-1
10
0
10
1
10
2
I
m
a
g
e
s
i
z
e
(
i
n
p
i
x
e
l
s
)
T
i
m
e
i
n
s
e
c
o
n
d
s
Tim
e
com
parison for diffe
re
nt im
age
size
s
Se
q
u
e
n
ti
a
l
Pa
r
a
l
l
e
l
Evaluation Warning : The document was created with Spire.PDF for Python.
IS
S
N
:
2088
-
8708
In
t J
Elec
&
C
om
p
En
g,
V
ol.
8
, N
o.
6
,
Dece
m
ber
2
01
8
:
5457
-
5471
5466
This
s
hows
t
ha
t
the
GPUs
give
excell
ent
pe
r
form
ance
for
la
rg
e
w
orkloa
ds
w
hich
a
re
hi
gh
ly
pa
rall
el
and
thr
ough
pu
t
or
ie
nted
.
H
ow
ever,
the
la
rg
e
nu
m
ber
of
re
gisters
re
quire
d
b
y
the
t
hr
ea
d
will
lim
it
the
num
ber
of
act
ive
th
rea
ds
pe
r
m
ulti
pr
ocess
or.
By
red
uci
ng
th
e
nu
m
ber
of
reg
ist
e
rs
pe
r
threa
d
a
nd
by
sel
ect
ing
an
idea
l
threa
d
bl
ock
siz
e
the
num
ber
of
act
ive
thr
ea
ds
per
m
ulti
process
or
ca
n
be
increase
d
wh
ic
h
in
t
urn
will
increase
the ove
rall
p
e
rfor
m
ance.
5
.
3
.
I
ml
ement
at
i
on
u
sing
Shared
Mem
or
y and L
1 C
ache
The
nex
t
op
ti
m
iz
at
ion
done
is
to
us
e
the
s
oft
war
e
m
anag
e
d
faste
r
on
c
hip
sh
a
re
d
m
e
mo
ry
instea
d
of
the
gl
ob
al
m
e
m
or
y.
Sh
are
d
m
e
m
or
y
enab
l
es
inter
t
hr
ea
d
data
re
us
e
as
t
he
th
rea
ds
i
n
t
he
th
rea
d
blo
c
k
s
har
e
the
data
store
d
in
t
he
sh
a
re
d
m
e
mo
ry.
I
n
our
im
plem
e
ntati
on
we
ha
ve
care
ful
ly
div
ided
the
i
m
ages
in
su
ch
a
way
that
the
48KB
per
bl
ock
c
on
s
trai
nt
of
sh
a
re
d
m
e
m
or
y
is
ne
ver
vi
olate
d
a
nd
al
so
w
e
us
e
the
m
axi
m
um
sh
a
re
d
m
e
m
or
y
avail
able
for
pr
ocess
ing
.
T
he
im
age
is
div
ide
d
into
16x16
ti
le
s
an
d
seve
n
s
uch
ti
le
s
are
cop
ie
d
to
the
sh
are
d
m
e
m
or
y
fo
r
LD
P
c
om
pu
ta
ti
on
.
T
he
su
m
s
of
the
LDP
value
s
in
the
f
our
direct
ion
s
a
re
the
n
wr
it
te
n
to
the
sh
a
red
m
em
or
y
fo
r
f
ur
t
he
r
us
e
.
T
otal
am
ou
nt
of
s
ha
re
d
m
e
m
or
y
us
ed
pe
r
bl
oc
k
is
7K
B. We
ha
ve
ta
ken
the
threa
d
blo
c
k
siz
e
as
12
8
in
G
PU
1
s
o
that
6
blo
cks
a
re
act
ive
in
on
e
SM
P
and
the
total
sh
are
d
m
e
m
or
y
us
ed
per
SMP
is
42KB.
T
his
fall
s
within
the
48KB
li
m
it
at
t
he
sam
e
t
i
m
e
sche
du
le
s
the
m
axi
m
u
m
nu
m
ber
of
threa
ds
possibl
e.
T
he
s
peedu
p
ob
ta
ine
d
f
or
in
G
PU1
w
hen
sh
a
red
m
e
m
or
y
is
us
e
d
is
s
how
n
i
n
Ta
ble
2a.
I
n
GPU2
t
he
th
re
ad
bl
ock
siz
e
is
ta
ken
a
s
25
6
and
6
blo
c
ks
a
re
act
ive
in
on
e
SMP
an
d
the
total
sh
are
d
m
e
m
or
y
us
e
d per SM
P i
s 42K
B
. T
he
s
peedu
p ob
ta
i
ne
d
in
GPU
2 w
he
n
s
har
e
d
m
e
m
or
y i
s
us
e
d
is
s
how
n
in
Ta
ble
2b.
The
tim
e
ta
ke
n
for
the
c
om
pu
ta
ti
on
w
hen
sh
are
d
m
e
m
o
ry
is
us
ed
is
m
uch
le
ss
com
par
ed
to
the
par
al
le
l
ve
rsion
w
hich
us
es
the
gl
ob
al
m
e
m
or
y.
T
his
is
be
cause
as
t
he
s
har
e
d
m
e
m
or
y
is
on
-
c
hip
it
is
m
uch
faster
tha
nth
e
global
m
e
m
or
y.
Howe
ver
,
t
o
ensure
co
rrec
t
resu
lt
s
the
re
m
us
t
be
sync
hro
nizat
ion
bet
ween
t
he
threa
ds
w
hich
sh
are
t
he
m
e
m
or
y.
CU
D
A’
s
sim
ple
ba
rr
ie
r
sy
nchr
oniz
at
ion
pr
im
i
ti
ve
__
synct
hre
ads
()
is use
d for this
.
This
av
oid
s
th
e
race
conditi
on
as
the
thread’s
exec
ution
proceed
past
a
__
sy
ncthre
ads
(
)
bar
rier
only
after all
the
thr
eads i
n
the
b
l
oc
k hav
e
ex
e
cut
ed
the
b
a
rr
ie
r
s
ynch
ronizat
ion.
We
ha
ve
furth
er
stu
died
the
op
ti
on
of
us
in
g
a
la
r
ger
L
1
cache
in
place
of
s
ha
red
m
em
or
y
and
it
s
eff
ect
on
im
pr
ov
i
ng
the
perf
or
m
ance
of
the
LDP_com
pu
te
kernel.
De
fau
l
t
L1
siz
e
is
16KB
but
in
c
ases
wh
e
n
reg
ist
er
s
pill
in
g
is
pro
blem
at
ic
increasin
g
the
siz
e
of
L1
to
48K
B
i
m
pr
ov
e
s
the
pe
rfo
rm
ance.
The
cuda
Device
Set
CacheC
on
fi
g
f
un
ct
io
n
is
us
e
d
to
set
prefe
re
nc
e
betwee
n
the
sh
ar
ed
m
e
m
or
y
and
L1
cac
he
.
The
sli
gh
t
i
m
pr
ove
m
ent
in
sp
eed
up
ob
ta
ine
d
by
us
i
ng
L1
ca
che
w
hen
com
par
e
d
to
sh
a
re
d
m
e
m
or
y
fo
r
the
two
GPUs
is
as
show
n
in
Ta
bles
3a
an
d
3b.
T
he
L1
cache
ver
s
ion
s
hows
onl
y
a
sli
gh
t
per
f
or
m
ance
i
m
pr
ovem
ent
wh
e
n
c
om
par
e
d
to
the
sh
a
red
m
e
m
or
y vers
i
on.
5
.
4
.
I
mple
men
tation usin
g
S
trea
m
s
LDP_c
om
pu
te
kernel
is
furth
er
optim
iz
ed
us
ing
stream
s.
To
al
locat
e
op
erati
on
s
t
o
dif
f
eren
t
stream
s
the
input
i
m
age
is
div
ided
in
to
s
m
al
l
chu
nk
s
and
the
re
qu
i
red
ope
rati
ons
are
perform
ed
on
eac
h
ch
unk.
The
three
m
ai
n
ope
rati
on
s
in
the
LDP
com
pu
ta
ti
on
is
m
e
m
or
y
co
py
from
host
to
de
vice,
e
xecu
ti
on
of
t
he
ke
rn
el
s
and
c
op
yi
ng
t
he
res
ult
bac
k
f
r
om
dev
ic
e
t
o
host.
For
a
n
in
put
im
age
of
siz
e
say
N=
256x
256
siz
e,
the
sta
ge
d
execu
ti
on
of
nStre
ams
,
t
ran
s
f
ers
ch
unks
of
N/
nStre
ams
si
ze
i.e.
12
8x12
8
pi
xels.
If
B
blo
c
ks
a
re
us
e
d
f
or
t
he
com
pu
ta
ti
on
of
the
w
ho
le
i
m
age
in
the
non
-
sta
ge
d
ve
rsion
the
n
i
n
the
sta
ge
d
e
xe
cution
of
nStre
ams
,
B/
nS
tre
ams
bl
ocks
are u
se
d
f
or
c
om
pu
ta
ti
on.
Als
o,
w
hile
the
fir
st
ch
unk
i
s
bein
g
c
om
pu
te
d
by
t
he
B/
nS
tre
am
s
blo
c
ks
,
the
se
cond
ch
unk
is
being
tra
nsfe
rr
e
d
there
by
gi
vin
g
a
n
im
pr
ov
em
ent
in
pe
rfor
m
ance.
Fig
ur
e
8a
sh
ows
s
uc
h
a
con
c
urre
ncy
be
tween
tra
ns
fe
r
an
d
com
pu
ta
ti
on
with
nS
tre
am
set
as
4.
T
he
host
m
e
m
or
ie
s
are
al
locat
ed
as
pa
ge
loc
ke
d.
We
ha
ve
ass
um
ed
that
the
ke
rn
e
l
execu
ti
on
an
d
m
e
m
or
y
copi
e
s
ta
ke
r
oughl
y
the
sam
e
execu
ti
on
tim
e.
The
ex
ecuti
on
ti
m
e
l
i
ne
is
as
sho
wn
in
Fig
ur
e
8b.
The
inter
e
ngine
de
pe
nd
e
nci
es
are
highli
gh
te
d
with
ar
rows.
O
n
a
GeF
or
ce
72
0m
there
is
on
l
y
a
sing
le
c
op
y
eng
i
ne
a
nd
s
ing
le
kernel
e
ng
i
ne.
Dep
t
h
first
ap
proac
h
does
no
t
giv
e
any
s
pe
edup
w
he
reas
br
ea
dth
first
s
earch
giv
es
a
sp
ee
dup
as
show
n
in
Table
4a.
I
n
K
40
c
the
re
are
two
c
op
y
en
gine
s
on
e
f
or
host
to
dev
ic
e
tran
sfer
an
d
an
othe
r
for
de
vice
to
host
trans
fer
a
nd
a
sing
le
ke
rn
el
en
gin
e.
Both
de
pth
s
first
a
nd
breadt
h
fir
st
achieve
t
he
sam
e
sp
eedup.
T
he
op
e
rati
ons
a
re
qu
e
ue
d
ac
ro
s
s
the
stream
s
in
su
c
h
a
way
th
at
the
ope
rati
on
of
one
st
rea
m
do
es
not
blo
ck
the
op
e
rati
on
of
th
e
oth
er
stream
s.
This
al
lows
the
GPU
to
ex
ecute
cop
ie
s
a
nd
kernels
in
par
al
le
l,
al
lowi
ng
the
app
li
cat
io
n
to
run faste
r
as
show
n
i
n
Ta
ble
4b.
The
im
age
retrieval
res
ults
us
in
g
the
par
a
ll
el
i
m
p
lem
ent
at
ion
of
LD
P
is
giv
en
i
n
Ta
ble
5.
T
he
perform
ance
evaluati
on
was
done
by
cal
cul
at
ing
two
par
a
m
et
ers
-
true
posit
ive
rate
(T
PR)
an
d
false
po
sit
ive
rate
(F
PR)
give
n
by
=
/
(
+
)
an
d
=
/
(
+
)
w
he
re
TP
is
T
r
ue
P
os
it
ive,
F
N
i
s
False
Ne
gative
,
FP
is
False
Po
sit
ive
a
nd
T
N
is
T
r
ue
Ne
gative.
The
f
our
pa
ram
et
ers
TP,
FN
,
FP
,
T
N
a
re
evaluate
d
as
fol
lows
:
Firstl
y,
the
G
rou
nd
Tr
uth
(GT)
f
or
al
l
the
im
ag
es
in
the
database
is
rec
orde
d
as
TR
U
E
or
F
ALS
E
depend
i
ng
on
wh
e
ther
the
im
age
belongs
to
th
e
TRUE
cl
ass
or
the
F
ALS
E
cl
ass.
In
the
s
eco
nd
Evaluation Warning : The document was created with Spire.PDF for Python.