TELKOM
NIKA
, Vol. 13, No. 4, Dece
mb
er 201
5, pp. 1399
~1
407
ISSN: 1693-6
930,
accredited
A
by DIKTI, De
cree No: 58/DIK
T
I/Kep/2013
DOI
:
10.12928/TELKOMNIKA.v13i4.1896
1399
Re
cei
v
ed Au
gust 26, 20
15
; Revi
sed O
c
t
ober 2
9
, 201
5; Acce
pted
No
vem
ber 1
2
,
2015
A Hybrid Sorting Algorithm on Heterogeneous
Architectures
Ming Xu*
1
, Xianbin Xu
1
, F
a
ng Zhen
g
2
, Yuanhua Ya
ng
1
, Mengjia Yin
1
1
Computer Sch
ool of W
uha
n
Univer
s
i
t
y
, W
u
han 4
3
0
072, H
ube
i, Chin
a
2
Colle
ge of Info
rmatics, Huazh
ong Agr
i
clut
ura
l
Univers
i
t
y
, W
uha
n 43
007
0, Hub
e
i, Chi
n
a
*Corres
p
o
ndi
n
g
author, em
ail
:
xumin
g
@
w
h
u
.
edu.cn
A
b
st
r
a
ct
Now
adays
hig
h
perfor
m
a
n
ce
computin
g d
e
vices ar
e
more co
mmo
n
than
ever b
e
fo
re. T
h
e
capac
ity of ma
in
me
mories
b
e
co
mes v
e
ry
hug
e; CP
Us
g
e
t more cor
e
s
and c
o
mp
utin
g un
its that ha
ve
greater
perfor
m
a
n
ce.
T
h
ere are mor
e
and
mor
e
mac
h
i
n
e
s
get
accel
e
rat
o
rs such
as
GPUs, too. T
a
ke
ful
l
adva
n
tag
e
s of
mo
der
n mach
i
nes that us
e h
e
terog
e
n
eous
architectur
e
s to get h
i
gh
er p
e
r
forma
n
ce s
o
lu
tions
is a
rea
l
ch
al
l
eng
e. T
her
e a
r
e so
much
lit
eratures
on
o
n
l
y us
e CP
Us
or GPUs; h
o
w
e
ver, res
earch
o
n
alg
o
rith
ms that
utili
z
e
heter
og
ene
ous arch
ite
c
tures is
comp
arative
l
y few
.
In this pap
er, we prop
ose a no
vel
hybri
d
sorting
alg
o
rith
m that let CPU coop
erate w
i
th GP
U.
To fully utili
z
e
comput
i
ng cap
abil
i
ty of both CP
U
and GPU, w
e
used SIMD intr
insic i
n
structio
ns to impl
e
m
e
n
t sorting kern
els that run o
n
CPU, and a
d
o
p
te
d
radix sort kern
els that impl
e
m
e
n
ted by C
U
D
A (Co
m
put
e Unifie
d Dev
i
ce
Architec
ture) that run on GP
U.
Performanc
e e
v
alu
a
tion
is pro
m
is
ing th
at our
algor
ith
m
ca
n sort one b
ill
io
n
32-b
i
t float dat
a in n
o
more th
a
n
5 seconds.
Ke
y
w
ords
: SIMD, CUDA, He
teroge
ne
ous A
r
chitecture, Sor
t
ing Alg
o
rith
m
Copy
right
©
2015 Un
ive
r
sita
s Ah
mad
Dah
l
an
. All rig
h
t
s r
ese
rved
.
1. Introduc
tion
Sorting is u
s
ed by so m
any com
pute
r
appl
i
c
ation
s
[1]. Database ap
plicatio
ns have
sortin
ga
s an
internal o
peration whi
c
h i
s
used
by SQL ope
ration
s, and he
nce
all applicatio
ns
using a database can take advantag
e of an effici
ent sorting al
gorithm [2]. Sorting facilitat
es
statistics rel
a
ted a
ppli
c
a
t
ions i
n
cl
udi
ng findi
ng
clo
s
e
s
t pai
r,
determinin
g
an
elemen
t’s
uniqu
ene
ss, finding
kt
h
la
rge
s
t eleme
n
t, and identifying outliers. Other ap
plications that u
s
e
sortin
g in
clu
d
e
compute
r
g
r
aphi
cs, com
putational
ge
ometry, comp
utational
biol
ogy, su
pply
chain
manag
eme
n
t, image proce
ssi
ng [3, 4] and data comp
ressio
n.
Recently, main memory capacity in modern ma
chin
es incre
a
se rapi
dly, which ma
ke
s in-
memory
sorti
ng feasi
b
le. T
oday’s
CPUs are typica
lly
multi-core p
r
o
c
e
s
sors with i
m
prove
d
ca
che
throug
hput a
nd slo
w
e
d
do
wn po
we
r co
nsum
ption [5,
6, 7]. Cores
on a multi-co
re pro
c
e
s
sor
may
be homo
gen
eou
s or not, sha
r
e a certa
i
n level of cache me
mory,
and implem
ent architectu
re
s
su
ch a
s
SIM
D
(singl
e in
structio
n, multi
p
le dat
a
)
, mu
ltithreadin
g
[8
]. Algorithms
that utilize th
ese
advantag
es
may get higher perfo
rma
n
c
e.Driven by
the insatiabl
e
market dem
and for re
al-ti
m
e,
high-definitio
n 3
D
gra
phi
cs, the
prog
ra
mmable
G
r
ap
hic P
r
o
c
e
s
sor Unit
or GP
U
has evolved
i
n
to
a hig
h
ly pa
rallel, m
u
ltithrea
ded,
m
any-cor
e p
r
oce
s
sor wit
h
trem
end
o
u
s
com
putat
ional
horsep
o
wer
and very hig
h
memory b
and
width [9].
GPU is spe
c
iali
zed for compute
-
inten
s
ive,
highly paralle
l comput
ation
-
exactly wh
a
t
graphi
cs re
nderi
ng is a
b
out - and the
r
efore de
sign
ed
su
ch th
at mo
re tran
sisto
r
s are
devote
d
to data
processing
rath
er than
data
cachi
ng a
nd fl
ow
control. NVI
D
IA intro
d
u
c
ed
CUDA™,
a ge
ne
ral
purp
o
se p
a
rallel comp
uting platfo
rm
and
prog
ram
m
ing
model that leverage
s the p
a
rallel
com
p
u
t
e engine in
NVIDIA GPUs to solve ma
ny
compl
e
x com
putational p
r
oblem
s in a more efficie
n
t way than that on a CPU. When corre
c
tly
leverage the
compute capability of GPUs, al
gor
ithm
s that migrated to
GPU
may get 10X
and
much m
o
re
speed u
p
than
they run on CPU.
Many re
se
arche
s
h
a
ve d
one o
n
hi
gh
perfo
rman
ce
comp
uting th
at only u
s
e
CPUs o
r
GPUs, but fe
w on
ho
w to
coo
r
din
a
te th
e two
kind
of
pro
c
e
s
sors.
We a
r
e fa
cin
g
the challe
n
g
e
that take
full
advanta
g
e
s
of a
hete
r
o
gene
ou
s a
r
chitecture m
a
chin
e
can
ha
ve. As a
ki
n
d
of
fundame
n
tal algorith
m
, so
rting is
cho
s
en by
us to
be re-i
mple
mented on
a
heterog
ene
ous
architectu
re
machi
ne. In the next se
cti
on we
su
m
m
a
rize vario
u
s parall
e
l so
rting algo
rithm
s
as
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 16
93-6
930
TELKOM
NIKA
Vol. 13, No
. 4, Decem
b
e
r
2015 : 139
9 – 1407
1400
well as
re
sea
r
ch o
n
hetero
gene
ou
s architectures,
in
sectio
n 3 we pre
s
ent det
ails of our no
vel
hybrid
so
rting
algo
rithm, p
e
rform
a
n
c
e
e
v
aluation
s
a
r
e presented
i
n
sectio
n 4,
and
se
ction
5 is
con
c
lu
sio
n
.
2. Relate
d Resear
ch
There we
re many resea
r
che
s
on pa
ralle
l so
rt algorithm
s, incl
ude CP
U an
d GPU
impleme
n
tations.
Peter San
d
e
r
s
and S
e
b
a
stian
Win
k
el [10] impl
emented
Sa
mple So
rt o
n
sin
g
le
p
r
oc
es
so
r ma
c
h
in
es
, wh
ich
is fo
rme
r
ly
k
n
ow
n as
the
b
e
s
t
p
r
ac
tic
a
l
c
o
mp
ar
iso
n
ba
se
d so
rtin
g
algorith
m
for distri
buted
memory
pa
ra
llel co
m
pute
r
s. They
use
d
ra
ndo
m sa
mple to fin
d
k-1
splitters that
partition the
input list into
bucket
s
of a
bout eq
ual
si
ze, which we
re then
sorte
d
recursively. Each
elem
ent
were pl
aced i
n
to corr
e
c
t
b
u
cket
that det
ermin
ed by
bi
nary sea
r
ch. To
avoid co
nditio
nal bran
che
s
the splitters a
r
e form
ed a search tre
e
an
alogo
us to th
e tree st
ru
ctu
r
e
use
d
for bin
a
r
y heap
s. Elements t
r
ave
r
se
s the
sea
r
ch tre
e
for pl
acem
ent ind
epen
dently a
nd
in
tr
o
d
u
c
t
io
n of o
r
ac
les
fr
om r
a
d
i
x s
o
r
t
ma
k
e
the co
mpiler optimi
z
ation
mo
re easily.
Th
ey also
argu
ed
that sample sort
i
s
more ca
ch
e
efficient
than
quick
sort b
e
c
au
se it mov
e
s the el
eme
n
ts
only
log
times r
a
ther than
log
times. The oversampli
ng factor
is u
s
ed
to make a flexible
tradeoff b
e
tween the
overhead fo
r h
a
n
d
ling the
sa
mp
le a
nd th
e accu
ra
cy o
f
splitting. Th
ey
argu
ed that t
heir im
pleme
n
tations
obtai
ned of u
p
to
2X accel
e
rati
on ove
r
std::sort from th
e
GCC
STL library.
Shifu Ch
en a
nd Ji
ng
Qin [
11] propo
se
d
a sorting
alg
o
rithm im
ple
m
ented i
n
CUDA.
They used an algor
ithm s
i
milar
w
i
th that us
ed
in [10], but implement
ed it on GPU
us
ing CUDA.
To imp
r
ove p
e
rform
a
n
c
e, t
hey used diff
erent m
e
thod
s to lo
ad el
e
m
ents from u
n
so
rted li
sts
and
nearly
sorted
lists,
re
sp
ectively. They argu
ed
t
hat t
heir expe
rim
ents
sh
ow th
at the
algo
rithm
achi
eves a 1
0
X to 20X accele
ration ov
er STL qui
cksort.
Hiro
shi In
oue
and Ta
ka
o
Moriyama [1
2] prop
osed
an alig
ned a
c
ce
ss
so
rting
algorith
m
whi
c
h takes
advantag
e of SIMD instru
ction
s
and
th
read
-level p
a
r
alleli
sm avai
lable on tod
a
y
’s
multi-core
proce
s
sors. T
h
ey use
d
a
n
i
m
prove
d
co
mbso
rt a
s
th
eir in
-core al
gorithm,
whi
c
h is
simila
r
with bi
tonic so
rting netwo
rks. After all
divided
blocks which
length le
ss th
an the
cap
a
ci
ty
of
L2 ca
ch
e of
a co
re are
sorte
d
,
they are me
rged
u
s
e out-co
r
e a
l
gorithm, whi
c
h
i
n
tegrate odd-
even merge a
l
gorithm impl
emented
with
SIMD instru
ctions.
Nad
a
thur Sati
sh an
d Mark Harri
s
[14] de
sc
ribe
d the d
e
sig
n
of high-perfo
rman
ce
parall
e
l
radix so
rt a
nd merge sort routin
es for
many-core GP
Us, takin
g
advant
age of the full
prog
ram
m
abil
i
ty
offered by CUDA.
Th
ey desi
gne
d
a
m
o
re
efficient
radix sort by
makin
g
effici
ent
use
of mem
o
ry ban
dwidth,
whi
c
h i
s
g
a
i
ned by
(1) m
i
nimizin
g
the
numbe
r of
scatters to
glob
al
memory an
d
(2) maximizi
ng the coh
e
rence of
scatt
ers. Th
ey also desig
ned a
merge sort
on
GPU, whi
c
h i
s
simil
a
r
with
other m
e
rg
e
sort im
plem
entation
s
that
divide input i
n
to equ
al-sized
tiles an
d
sort
them si
mult
aneo
usly, the
n
merge th
ese so
rted tile
s. To avoid i
n
efficient coa
r
se-
grain
ed me
rg
e pro
c
e
ss, th
ey merge la
rg
er arrays
by dividing them
up into small
tiles that can
be
merg
ed ind
e
pend
ently using the blo
c
k-wise pr
oce
ss. Th
ey argued that
th
eir pe
rform
a
nce
experim
ents
sho
w
that their radix so
rt is the fastest GPU sort an
d
their me
rge sort is the fastest
comp
ari
s
o
n
-b
ase
d
so
rt in the literatu
r
e.
Jatin Chhug
a
n
i and Willia
m Macy [2] prop
osed
an
efficient merge so
rt on multi-co
re
SIMD CP
U
architectu
re.
They first
examined
how instruction-level pa
rall
elism, data-level
parall
e
lism,
threa
d
-l
evel p
a
ralleli
sm,
an
d mem
o
ry
-l
e
v
el paralleli
sm impa
ct the
pe
rform
a
n
c
e
of
sorting al
gorithm, then
described their al
gorithm
that utilize
these parall
e
lism
s
. They
first
evenly
divide the input data into
blocks a
nd sort ea
ch
of
them individu
ally. Blocks a
r
e assig
ned t
o
a
thread
and
sorted
by a SIMD impl
eme
n
tation of
me
rge
so
rt. The
n
these blo
cks a
r
e me
rg
ed
by
iteration
s
. In each iteratio
n
,
pairs of li
sts ar
e merge
d
to obtain sorted
seq
uen
ce
s of twice the
length th
an
the p
r
evio
us i
t
eration.
Wh
e
n
blo
c
ks
b
e
comes too lon
g
to be merg
ed individuall
y
,
thread
s
co
rp
orate to
me
rge p
a
irs of
b
l
ocks. In
this ca
se
the m
edian
of the
merg
ed li
st a
r
e
comp
uted,
which
cut
pairs of blo
c
ks int
o
mutua
lly ex
clu
s
ive pa
rts
that
can
be
merg
ed by
si
ngle
thread
individ
ually. They
al
so
use m
u
lti-way m
e
rg
e fo
r la
rge
inp
u
t li
st to
red
u
ce
memory
a
c
ce
ss
durin
g so
rting
.
Nad
a
thur S
a
tish an
d Ch
ang
kyu Kim
[15]
pre
s
e
n
ted a
com
petitive anal
ysis of
comp
ari
s
o
n
and n
o
n
-
co
mpari
s
o
n
ba
sed
so
rting
algo
rithms on late
st
CPU
and
GPU
architectu
re
s.
Novel CPU
radix so
rt an
d GPU
merg
e sort imple
m
entation
s
a
r
e pro
p
o
s
ed.
T
o
alleviate irre
g
u
lar mem
o
ry acce
sses of
CPU r
adix so
rt implementa
t
ions, they propo
sed a buf
fer
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
1693-6
930
A Hyb
r
id Sort
ing Algorithm
on Hete
rog
e
n
eou
s Architectures
(Mi
ng X
u
)
1401
base schem
e
that collect e
l
ements b
e
lo
nging to the
same ra
dix into buffers in lo
cal sto
r
ag
e, and
write o
u
t the
buffers f
r
om
local
storag
e to glob
al memory o
n
ly whe
n
eno
ug
h eleme
n
ts h
a
ve
accumul
a
ted.
They
adopt
e
d
the
GPU radix sort
im
pl
ementation
d
e
scrib
ed i
n
[
11]. CP
U me
rge
sort
implem
e
n
tation d
e
scri
bed i
n
[2] a
n
d
GPU me
rge
sort
implem
e
n
tation d
e
scri
bed i
n
were a
l
so
adopte
d
, besi
des o
n
GPU,
wide
r merge
netwo
rk
we
re
used.
All algorithm
s mentioned
a
bove only utilize
si
ngle h
o
m
ogen
eou
s p
r
ocesso
rs. Howeve
r,
multico
r
e m
a
chin
es eq
uip
ped
with
accelerato
rs a
r
e
becoming
in
crea
singly
pop
ular. In
the fi
eld
of HPC, the current ha
rdwa
re tren
d
is to
desig
n multipro
ce
ssor a
r
chitectures featu
r
i
ng
hetero
gen
eo
us te
chn
o
logi
es
su
ch a
s
spe
c
iali
zed
copro
c
e
s
so
rs
or data
-
p
a
rall
el accel
e
rato
rs,
too. We
fa
ce
the challe
nge
that b
u
ilding
appli
c
atio
n
s
t
hat can
fully
utilize th
e
ca
pability of e
n
tire
machi
ne,
th
at
is, parallel
ta
sks ca
n be schedul
ed over
the
full set of available
pro
c
e
ssi
ng units to
maximize
pe
rforman
c
e. An
dre
R. B
r
odtkorb
and
Chri
stopher Dy
ken
[16] p
r
ovided
an
overvie
w
of
node
-level h
e
terog
ene
ou
s computin
g, inclu
d
ing
ha
rdwa
re, software tool
s
an
d state-of-the
-art
algorith
m
s a
n
d
appli
c
ation
s
.
Edgar S
o
lom
onik
and
Lax
mikant V. Kal
´
e [
17] have
i
m
pleme
n
ted several
p
a
rall
elso
rting
inclu
d
ing hi
stogra
m
so
rt, sample sort an
d radix
s
o
r
t
us
in
g
C
h
ar
m++
.
T
hey focu
sed on optimi
z
ing
histog
ram
so
rt and argue
that hist
ogra
m
sort is mo
strelia
ble to a
c
hieve a defi
ned level of load
balance. They illustrated t
heir point ofview by
performance evaluations
run
on machi
n
es wi
th
mode
rn supe
rco
m
pute
r
a
r
chitectures.
Céd
r
ic Au
go
nnet and Sa
muel Thib
ault
[18] des
ign
e
d
StarPU, wh
ich is a
software tool
aiming to
allo
w p
r
og
ramm
ers to exploit
the co
mputin
g po
we
r of th
e availabl
e CPUs
and
GP
Us,
while
relievin
g
them from t
he ne
ed to speci
a
lly
adap
t their progra
m
s to the ta
rget machine
and
pro
c
e
ssi
ng u
n
its. They implement seve
ral sche
dulin
g algorith
m
s
on StarPU an
d comp
ared them
wit
h
simila
r t
a
sk
s.
In this arti
cle
,
we
u
s
e
a
simpler way to impl
ement
hybrid
sortin
g on
hete
r
o
g
eneo
us
architectu
re
s.
Usi
ng
Open
MP, we u
s
e
one
singl
e CPU thre
ad to
comm
uni
cate
with GP
U when
we ne
ed assi
gn tasks to G
P
U while the
other CP
U thread
s exe
c
ut
e tasks a
ssi
g
ned to them. In
whol
e sorting
pro
c
e
s
s, CP
U an
d GP
U
are tig
h
tly
bo
und that they
coo
p
e
r
ate
wi
th each othe
r to
compl
e
te ea
ch step, ho
w tasks a
r
e a
s
si
gned m
u
st b
e
deci
ded b
e
f
oreha
nd to
guarantee
CPU
and GP
U ta
ke al
mo
st same time to
com
p
lete t
heir ta
sks.
To ha
rne
s
s
the ca
pabilit
y of
hetero
gen
eo
us a
r
chitectu
res,
we im
pl
emented
so
rt
ing kern
els f
o
r
CPU, an
d
adopte
d
a
GPU
impleme
n
tation, both of them can ta
ke f
u
ll adv
antag
e
of CPU and
GPU re
so
urces, re
spe
c
tively.
3. Proposed
H
y
brid Sorting Algorith
m
We use SSE intrinsi
c instructions to utilize
SIMD
capabilities of CP
U cores; CUDA
is
use
d
to utilize GPU
com
p
uting capa
bilities; and
Op
enMP ru
ntim
e is u
s
e
d
to
coo
r
din
a
te ta
sks
among
CPU
cores a
s
well as bet
wee
n
CPU and GP
U.
We u
s
e 3
2
-bi
t
float as the data type of t
he element
s to be sorte
d
. Hen
c
e o
n
e
128
-bit
SIMD
vector regi
ster conta
i
ns
4
keys. Note that our a
l
gorithm i
s
no
t limited to this data type a
n
d
degree
of dat
a pa
ralleli
sm
as l
ong
as th
e SIMD in
stru
ction
s
sup
port them. We
suppo
se
that the
numbe
r
of el
ements of i
n
put data
list
i
s
N, which
i
s
po
we
r of
2
without l
o
ss
of gen
erality.
We
alway
s
use a
ligned m
o
ve whe
n
SIMD registe
r
s l
oad
or sto
r
e d
a
ta, so the first e
l
ements
of lists
are
aligne
d t
o
16
-byte bo
unda
ry, and
we
will ha
ndl
e unali
gne
d
ca
se
s du
ring
so
rting p
r
o
c
ess
whi
c
h will be
explanation i
n
se
ction 3.3.
There are t
w
o stage
s in a
n
entire hyb
r
i
d
so
rting. In the first sta
g
e
,
CPU and G
P
U may
have their o
w
n tasks that to so
rt
unsort
ed part
s
of in
put list. In se
con
d
stag
e, all sorted
part
s
is
merg
ed. We i
m
pleme
n
ted
multi-way me
rge [1
9]
to re
duce the gl
ob
al syn
c
h
r
oni
zation in
se
co
nd
stage.
We i
m
plem
e
n
ted me
rg
e
sort de
scri
bed
in [2]
o
n
CPU, which u
s
e
s
two
step
s t
o
sort
a
whol
e data
seque
nce. In first
step, a
da
ta list is
divid
ed to
seve
ral
data
chu
n
ks
of sam
e
le
ngt
h,
whi
c
h i
s
so
rted on
e by
one. In
se
cond
step,
sorted
chun
ks are me
rg
ed
. We
use S
SE
instru
ction
s
i
m
pleme
n
t a bitonic me
rgi
ng netwo
rk
[13, 20] which is use
d
in both step
s. SIMD
impleme
n
tation ca
n so
rt 32 element
s o
n
ce a time,
confined to the
capa
city of SIMD regi
sters.
We a
dopte
d
GPU radix so
rt ke
rnel
s in
CUB
lib
ra
ry [21]. CUB p
r
o
v
ides
state-of
-the-art,
reu
s
abl
e software comp
o
nents fo
r ev
ery layer
of the CUDA
p
r
og
rammi
ng
model, in
clu
d
ing
device-wi
de,
block-wide and warp-w
ide primitive functions as wel
l
as m
any useful utilities.
We
use
d
device-wide radix so
rt ke
rnel
s in CUB lib
rary to ease our G
P
U pro
g
ra
m
m
ing tasks.
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 16
93-6
930
TELKOM
NIKA
Vol. 13, No
. 4, Decem
b
e
r
2015 : 139
9 – 1407
1402
As the pe
rformance evalu
a
tion re
sult
s
belo
w
show,
parall
e
l me
rg
e so
rt ke
rn
el
s ru
n on
CPU h
a
ve si
milar pe
rform
ance with
rad
i
x sort
kern
el
s ru
n on
GPU. Wh
en
the
input data li
st is
small, CPU sort
i
m
plem
entations
will even
out
perf
orm
GPU
sort implem
ent
ations. Since the
main mem
o
ry
is often hug
e, CPU kerne
l
s ca
n sort
ve
ry large
data
lists. On the
contrary, due
to
the capa
city limit of glob
al
memory, o
n
ly small
or med
i
um si
ze
d d
a
ta list
s
can
be
so
rted
by G
P
U
dire
ctly, large
data li
sts m
u
st b
e
divide
d into
seve
ral
part
s
a
nd
so
rted by
run
n
i
ng GP
U
kern
els
several times, which i
s
ve
ry inefficient.
The
advant
age
s of GPU sort impl
em
entation is th
at
whe
n
the le
n
g
th of input d
a
ta list in
crea
se, the
time need
ed
to so
rt
increa
se
s more slo
w
ly
than
CPU
ke
rnel
d
oes. T
he bi
gg
er the
data li
st is,
the mo
re
outpe
rform
s
t
hat GPU
kernels th
an
CP
U
ker
nel
s.
3.1. GPU Ker
n
els
We
use GP
U to unl
oad
CP
U
sortin
g ta
sks.
We
ad
opt
ed d
e
vice
-wi
d
e ra
dix sort fu
nction
s
in CUB lib
ra
ry [21] as our GPU kern
el
s. In
each st
age, GPU
ke
rnel
s may so
rt one o
r
several
data chu
n
ks.
Becau
s
e d
a
ta
tra
n
sfe
r
betwe
en mai
n
mem
o
ry a
nd
GP
U glo
bal
m
e
mo
ry
is
relatively sl
o
w
a
nd GP
U
kernel
will
surp
ass
CP
U
ke
rnel mo
re
and
more
when th
e lengt
h of in
put
data list incre
a
se
s a
s
long
as the data li
st can fi
t in the GPU glob
al
memory, we
cann
ot assig
n
same
p
r
op
ort
i
on of ta
sks
whe
n
we
so
rt
differe
nt dat
a list
s
. In thi
s
pap
er,
we
u
s
e
a
seg
m
en
ted
way that de
signate several length inte
rvals b
a
sed on
ou
r
pe
rfo
r
med evaluat
ions. Whe
n
the
length of an i
nput list bel
o
ng to a lengt
h interval, it is divided int
o
parts
whi
c
h
will be
sorte
d
b
y
CPU and GP
U
respe
c
tively
and simulta
neou
sly
by
th
e co
rrespon
di
ng p
r
opo
rtion
of the interv
al.
If input data
list is
small
enou
gh, it is so
rted o
n
ly by CPU;
wh
en the le
ngt
h of it increa
se
s,
proportion of GPU tasks
will al
so increase. Different
stages of our sort
ing process may
use
different prop
ortion
s, too. This way of task sche
dulin
g fit well in our implem
ent
ation, though
it is
not very flexi
b
le.
3.2. CPU Ke
r
n
els
We u
s
e
two
step
s to
sort
data list
s
on
CPU. In
first
step, we pa
rti
t
ion the whol
e list into
several data
chu
n
ks, the l
ength of ea
ch chu
n
k i
s
se
lected
so that
L3 ca
ch
e ca
n safely hol
d
two
chu
n
ks. Each chun
k is fu
rther divide
d into bl
ocks
so that each L1 ca
che of a CPU co
re
can
safely hol
d two blo
c
ks. CP
U cores
so
rt these bl
o
c
ks
concurrently a
nd t
hen co
ordinate with
e
a
ch
other to me
rg
e them
to a
sorted
chun
k.
Next in
s
e
c
ond
s
t
ep
, ch
u
n
ks
ar
e mer
g
ed to
a s
o
r
t
e
d
lis
t
.
If the numbe
r of chu
n
ks i
s
small, the
ke
rnel me
rge
s
t
w
o
chu
n
ks o
n
ce
a time, e
a
ch t
w
o
chu
n
ks
are segme
n
ted by media
n
s [22], and then each
two segm
ents
of same inde
x in two chunks
respe
c
tively are me
rg
ed
simultan
eou
sl
y. If the
number of
chun
ks is la
rge,
we use m
u
lti-way
merg
e [19] to segm
ent them and se
veral sam
e
index part
s
are merged. To full utilize the
cap
ability of cach
es
of CP
U core
s, all th
e data a
r
e
so
rted an
d me
rged in
ca
che.
The si
ze
of d
a
ta
chu
n
ks a
nd d
a
ta blo
c
ks a
r
e cal
c
ul
ated
based o
n
the
size of L1
an
d L3
ca
che
a
s
well a
s
the
size
of input data list. Each time a data block
is moved
into
L1 cache, it is so
rted by a sortin
g netwo
rk
whi
c
h is impl
emented u
s
in
g SIMD intrin
sic in
stru
ction
s
.
3.2.1. Sorting Net
w
o
r
k
The core of our
CPU
so
rt kernel
s is
so
rt
ing net
work implemente
d
using SIM
D
intrinsi
c
instru
ction
s
, i
n
clu
de od
d-e
v
en so
rting
netwo
rk
and
bitonic
sorti
ng network [2, 13, 20]. For
unsorted bl
ocks, we first load keys into
SIMD
regi
ste
r
s, then u
s
e odd-ev
en sorting netwo
rk
to
make
all
keys within a si
ngle
SIM
D
registe
r
ar
e
sorted. F
o
r pa
rtially sorted
blocks, we u
s
e
bitonic
so
rtin
g netwo
rk me
rge
keys. In
merg
e proc
e
s
ses, 4 by 4
bitontic
sortin
g netwo
rk, 8 by 8
bitonic
sortin
g netwo
rk a
n
d
16 by
16 bitonic sorting
netwo
rk a
r
e
use
d
, deci
d
e
d
by the length of
key sli
c
e
s
th
at load into
SIMD regi
ste
r
s. Sin
c
e
enti
r
e sorting ta
kes mo
st time
to merge
pa
rtial
sorte
d
data, 1
6
by 16 bitoni
c so
rting net
work is m
o
stl
y
used.
There i
s
al
so
a tri
c
k o
n
h
o
w to
loa
d
keys
to
SIMD
regi
sters. Ea
ch SIM
D
regi
ster ca
n
contai
n a 4
-
el
ement vecto
r
.
The CP
U kernel we fi
rst i
m
pleme
n
ted
woul
d ch
eck i
f
two arrays t
hat
to be merged
are em
pty wheneve
r
it wa
nt to load ke
y
s
from a
n
yon
e
of them to SIMD regi
ste
r
s,
whi
c
h is a bit tediou
s. To be con
c
i
s
e, we
chan
ged t
he
keys
cho
s
e
n
logic
so that the first key
s
of
two last
dat
a vectors that belongs t
o
the tw
o sorted array
respectively
will be
com
p
ared
beforehand. If a data vector’s first
key is smaller, then the array it
belongs to will
be empty first.
Furthe
rmo
r
e,
all key vecto
r
s in
the oth
e
r
array t
hat
sh
ould
be lo
ade
d after thi
s
l
a
st vecto
r
can
be
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
1693-6
930
A Hyb
r
id Sort
ing Algorithm
on Hete
rog
e
n
eou
s Architectures
(Mi
ng X
u
)
1403
found, that is, the times of load ope
rati
ons, befo
r
e
and after a
n
array is empt
y, is clear a
n
d
definite.
3.2.2. Merging T
w
o
Chu
n
ks
If the number of sorted chunks i
s
not
greater
than 4,
every two
chun
ks
will be m
e
rged to
a larger dat
a chunk. Medians
[22] are first cal
c
ul
ated whi
c
h
will partition
each chunk into
segm
ents,
so
that two
seg
m
ents
of sam
e
index i
n
different
ch
un
ks
respe
c
tively will bel
ong to
a
same fin
a
lly sorte
d
data
block. The
sum of t
he le
ngth of two
same ind
e
xed
segm
ent is
also
equal to the l
ength of a bl
ock. Data
ch
unks a
r
e so
rt
ed, so a
n
y two seg
m
ent in
same
ch
un
k that
belon
g to two
adjacent sort
ed blo
c
ks respectively
are also a
d
ja
cent
, which
divide
s by a media
n
.
Then eve
r
y two same in
d
e
xed se
gmen
ts are m
e
rg
e
d
simulta
neo
usly by 16 by
16 bitonic
so
rting
netwo
rk. Be
cause se
gme
n
ts se
par
ate
d
by median
s are n
o
t ne
ce
ssarily alig
ned, that is, the
memory a
d
d
r
ess of the first key of a
seg
m
ent is
not di
visible by 1
6
, or the l
ength
of the se
gme
n
t
is not divisibl
e by 16. We need to han
dl
e these un
ali
gned case be
fore we can u
s
e bitoni
c so
rtin
g
netwo
rk m
e
rg
e them.
3.2.3. Multi-Wa
y
Merge
Suppo
se that
there a
r
e L
sorted d
a
ta ch
unks
after
sta
ge 1 finished
and L i
s
also
power
of 2, if we
use kernel
s that described in
section 3.2.2
to
merge them, we
will run these
kernels
1
times to get an entire
so
rted outp
u
t. So if
4
, this way will no longer be
efficient, let
alone tra
n
sfe
r
data betwe
en main me
mory and GP
U for so ma
n
y
times. The machi
ne we use
d
for perfo
rma
n
c
e evalu
a
tion
has a 4
-
core CPU
wi
th
Hyper-Th
r
e
a
d
ing technol
o
g
y, which
ca
n at
most sort 8 bl
ocks
simultan
eou
sly.
Above kernel is in
efficient to
merge
so ma
ny blocks, too.
We imple
m
e
n
ted multi-wa
y merge alg
o
r
ithm
pro
p
o
s
ed in [19] to merg
e more than two
sorte
d
chun
ks at a time.
Similar with t
w
o
chu
n
ks
merg
e, we
n
eed first find
all the ele
m
ents
belon
g to a same blo
ck in
the merged l
i
st within
the data ch
unks. We copy all data sli
c
es th
at
belon
g to a
same
sorte
d
block i
n
to a
singl
e CP
U
core
L1
ca
ch
e
,
merg
e the
m
use 1
6
by
16
bitonic
sortin
g netwo
rk, on
ce two d
a
ta sl
ice
s
.
By usin
g mul
t
i-way m
e
rge
algo
rithm
we can
merge
seve
ral
data
ch
un
ks in
a
sin
g
le
merg
e kernel
s run, ho
wev
e
r, the
pro
c
e
s
s of
find th
e
quantile
s i
n
chu
n
ks, na
m
e
ly uppe
r b
o
und
s
and lo
we
r bo
und
s of the sorted bl
ocks i
s
fairly compl
e
x and will ta
ke lo
nge
r time if the numb
e
r o
f
c
h
unks
merged every time is
too large.
So ther
e must be a tradeoff. We
choos
e
the number
of
sorte
d
chun
ks that u
s
ing
multi-way me
rge
ever
y tim
e
no m
o
re th
an 16. If the
numbe
r of
ch
unks
is gre
a
ter tha
n
16, we re
cursively u
s
e multi-
way merge or two ch
unks merge
on merg
ed la
rge
r
chu
n
k
s
.
There’s an
oth
e
r fa
ctor that
effect the
qua
ntile
search
p
r
ocess. A
s
d
e
s
cribe
d
in
[19
], only
the uppe
r bo
und of a sort
ed block
are sea
r
ched. Th
at requires th
e lowe
r boun
d and the len
g
th
of the blo
ck
must b
e
kno
w
n. Th
e first
element
s of a
ll chu
n
ks
can
be u
s
ed
as th
e lower b
oun
d fo
r
all uppe
r bo
und calculati
on, but the length of bl
o
c
ks
will then
multiplied by
the index of the
sea
r
ching u
p
per b
ound,
which e
nd up t
o
many
red
u
ndan
cy and l
engthy se
arch pro
c
e
s
se
s. To
avoid su
ch in
efficient ca
se
, first bound
s for each
sorted chun
k wit
h
in the finally sorted list a
r
e
su
ccessively sea
r
ched, ea
ch chun
k ca
n take
the fo
rmer o
u
tco
m
e – the uppe
r boun
d of last
adja
c
ent ch
u
n
k – also is the lowe
r bou
nd of i
t. Then CPU co
re
s simultaneo
usly
search b
oun
ds
of blocks wit
h
in sorted
ch
unks, on
e co
re ta
ke
s
an
e
n
tire chun
k –
also
ea
ch bl
ock ha
s its l
o
wer
boun
d found,
whi
c
h is very
efficient.
3.3. Unaligned Case Handle
The mo
st im
portant
ke
rn
el in ou
r CP
U sorting i
s
bitonic m
e
rging n
e
two
r
k whi
c
3h
occupi
es mo
st CP
U
run
n
i
ng time. T
o
get hi
gh
pe
rforma
nce,
we mu
st e
n
su
re th
at all
d
a
ta
transfe
r to a
nd from SIM
D
re
giste
r
s i
s
alig
ned. So
unalign
ed
case
s i
s
dete
c
ted a
nd h
a
n
d
led
before bl
ocks of keys coul
d be merged
using me
rging network.
Unalign
ed cases often occurred
in stag
e 2,
when
we
cal
c
u
l
ate media
n
s
for two
ch
un
ks o
r
bo
und
s f
o
r
several ch
unks. Be
cau
s
e
we always se
lect the lengt
h of chun
ks
or blo
c
ks
is p
o
we
r of 2, th
e sum of len
g
th of segme
n
ts
divided by m
edian
s o
r
qu
antiles i
s
al
ways po
we
r of
2, too. What
we did i
s
to
cut do
wn
so
me
element
s fro
m
the head o
r
tail of each
unalig
ned
se
gment to ma
ke the resi
du
al part of it and a
new a
r
ray that consi
s
t of all cut element
s are all align
e
d
.
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 16
93-6
930
TELKOM
NIKA
Vol. 13, No
. 4, Decem
b
e
r
2015 : 139
9 – 1407
1404
3.3.1. Handle
Unaligned Case When M
e
rging T
w
o
Chun
ks
If the add
re
ss of the h
ead
or tail el
eme
n
t
of
one
se
gm
ent is
not alig
ned by
16-byte, then
so doe
s the h
ead or tail ele
m
ent of the other se
gm
ent.
If
that is the
ca
se, the nea
rest ele
m
ent to
the head o
r
tail that its addre
ss a
r
e 1
6
-byte a
ligned
are comp
ute
d
. If the head of segme
n
ts is
not
align
ed, all
elem
ents before
the ne
are
s
t
a
lig
ned
eleme
n
ts fro
m
both
se
gm
ents
are
cut
and
copi
ed into a
n
aligne
d buff
e
r an
d sorted
. The length
of buffer mu
st be 16. The
keys i
n
buffer are
loade
d a
s
u
s
ual a
s
if
we
“insert” them
into
the
se
g
m
ent that th
e la
st elem
e
n
t of the b
u
ffer
belon
gs to.
Similarly, una
ligned tail ele
m
ents a
r
e cu
t and combi
n
ed to a buffer of 16 eleme
n
ts, the
buffer is
so
rted and m
e
rg
ed as if it is
appe
nded to
the seg
m
ent
that the first element of th
e
buffer belongs
to.
3.3.2. Handle
Unaligned Case When M
u
lti-Wa
y
Mer
g
ing
When multi-way merge are used, a few more
segm
ents that belongs to a block will be
merg
ed. Recall that the su
m of length o
f
all s
egm
ent
s bel
ong to a
sorte
d
blo
c
k
is al
ways p
o
w
er
of 2. Before they ar
e
merged,
some
tail element
s will be cu
t down from
segm
ent
s whose
l
ength
can
not divi
si
ble by
16, th
e num
be
r of
cut el
ement
s is th
e
remai
nder. All
the
cut el
ement
s are
copi
ed to
a
buffer,
who
s
e len
g
th m
u
st be
divisi
bl
e by 1
6
. Th
e buffe
r i
s
t
hen
so
rted
a
n
d
appe
nded t
o
the se
gment t
hat its first element bel
ong
s to. No
w all
segm
ents
are
divisible by 1
6
,
whe
n
they are copi
ed to a L1 ca
ch
e su
cce
ssively, all segm
ent is ali
gned.
3.4. H
y
brid
Sorting Implementa
tion
The
hybrid
so
rt imple
m
enta
t
ion ha
s t
w
o
st
age
s. Be
ca
use
CP
U
and
GPU si
multa
neou
sly
sort
differe
nt
parts of the
i
nput li
st, glo
bal
sy
nchro
n
i
z
e i
s
n
e
cessarily bet
wee
n
two
stag
es.
In
each
stage,
Ope
n
MP
libra
ry is
re
spo
n
si
ble to
mana
ge ta
sk a
ssi
gnme
n
ts a
s
well
a
s
synchro
n
ization
between CPU core
s
a
nd GPU. Se
ction
s
directi
v
e is used to
assign ta
sks in
each stag
e, describ
ed in th
e pse
udo
cod
e
belo
w
:
1
v
oid hybrid
_
sort(data)
2
{
3
#
pragma omp parallel
4
{
5
omp
_
set
_
nested(2);
6
#
pragma omp sections
7
{
8
#
pragma omp section
9
{
10
g
pu
_
kernel(data
_
chunk1);
11
}
12
#
pragma omp section
13
{
14
c
pu
_
kernel(data
_
chunk2);
15
}
16
}
17
}
18
cudaDeviceSynchronize();
19
}
Cod
e
blo
c
ks that in different
se
ction
di
re
ctive wi
ll execute
si
multaneo
usly
. Since
OpenMP
library al
so
co
ordinate diffe
re
nt CPU core
s, neste
d di
re
ctive is ne
ce
ssarily. At the e
nd
of each
stage
, CUDA
sy
nc
hroni
zatio
n
function g
uar
an
tees that GP
U ha
s finish
e
d
its task
s.
In stage
1, input data list i
s
broken into
tw
o pa
rts, o
ne will b
e
sorted by CPU,
and th
e
other will be
sorted on GP
U. The part to be sorted
on CPU will be further broken into sm
al
ler
chu
n
ks, ea
ch
chu
n
k can b
e
safely
hold
by L3
ca
che,
they will b
e
sorte
d
o
ne b
y
one. When
all
these
chunks are
sorted, they’ll be merge to a sort
ed part. Similar as
CPU, the part to be
sorted
on GPU
will b
e
bro
k
e
n
into
chu
n
ks if the l
ength of
it ex
cee
d
s
GPU
kernel
ca
n sort
once. In stag
e
2, multi-way
merg
e algo
rit
h
m first searche
s
the fina
lly sorted
ch
unks, wh
ose
bound
s divi
de
sorte
d
p
a
rt
s i
n
sta
ge 1
into
se
gment
s. Differ
ent segm
ents
are
ag
ai
n a
ssi
gned
to
CPU an
d G
P
U
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
1693-6
930
A Hyb
r
id Sort
ing Algorithm
on Hete
rog
e
n
eou
s Architectures
(Mi
ng X
u
)
1405
respe
c
tively. To get
be
st
perfo
rman
ce,
the len
g
th o
f
parts in
ea
ch sta
ge m
u
st
be d
e
libe
r
at
ely
given so
that
CP
U ke
rnel
and GPU
kernel run
s
simil
a
r time.
Base
d on
p
e
rfo
r
m
ance eval
uati
o
n
for
e
a
ch so
rt kernel, differe
nt
ratio
s
are spe
c
ifie
d
for different so
rt stage
an
d different
l
ength
of
input list. Wh
en the length
of list is sma
ll (less
than L
3
ca
che
can
hold an
d be
sorte
d
on
ce
with
CPU
ke
rnel
o
f
stage
1),
CPU kern
el i
s
enou
gh; a
s
t
he len
g
th of
data list
grows la
rge
r
, pa
rt
that
assign
ed to
GPU
kernel
also i
n
crea
se
to get
pe
rformance b
o
o
s
t in sta
ge 1
d
ue to GP
U
sort
kernel ha
s be
tter perfo
rma
n
ce tha
n
CP
U ke
rnel
whe
n
list is un
so
rted. The more data list lon
ger,
the more
GPU will do than CPU will.
In
both stage, GPU kernel us
es radix sorti
ng, performance
discre
pan
cy i
s
ve
ry lo
w; ho
wever,
CP
U
kernel
in
sta
g
e
2
will ta
ke
m
u
ch
le
ss time
than in
sta
g
e
1
as
well
as G
P
U kernel, d
ue to m
u
lti-way merge
ca
n take full a
d
vantage
of th
e form
er
part
i
al
results. So, in stage 2 we a
ssi
gn CP
U m
o
re task than
GPU.
4. Performan
ce Ev
aluation
We have eva
l
uated the pe
rforma
nc
e of our hybri
d
so
rt on a machi
ne that has a
n
Intel
Core i7 47
00
MQ CPU
whi
c
h works at 2
.
40GHz, and
a NVIDIA Ge
Force GTX 7
65M GPU
wh
ich
works at 8
0
0
M
Hz. th
e CP
U ha
s fo
ur
co
res that
can
deliver
8 thre
ads via
Intel
Hyper-Th
r
e
a
d
ing
Tech
nolo
g
y. The GP
U h
a
s 768
CUDA
core
s. Th
e ca
pacity of m
a
i
n
and
glo
bal
memory fo
r
CPU
and GP
U is
16GB an
d 2
G
B, resp
ecti
vely. We run
our imple
m
entation
s
on
OpenSu
s
e
Linux
13.1.
Table 1. Singl
e GPU pe
rformance evalu
a
tion
Number of
Elements
One-
wa
y transfe
r
(ms
)
Sorting
(ms
)
Sum
(ms
)
0.139693
0.645861
0.925247
0.267403
1.00632
1.541126
0.432259
1.64044
2.504958
0.76247
2.92104
4.44598
1.42201
5.43771
8.28173
2.73591
10.529
16.00082
5.36332
20.7026
31.42924
10.6202
41.1297
62.3701
21.1348
81.6049
123.8745
42.16
163.207
247.527
84.213
327.713
496.139
We first p
r
e
s
ent the eval
u
a
tion results
of GP
U
so
rti
ng. We u
s
e
radix sort in t
he
CUB
libra
ry to sort
ran
dom
dist
ri
bution 3
2
-bit floats.
Ta
ble
1
pre
s
e
n
ts th
e
avera
ge
re
sults of 5
0
tim
e
s
run fo
r e
a
ch i
nput list. Tim
e
take
n by o
ne-way d
a
ta tran
sfer,
so
rting an
d the
whole p
r
o
c
e
s
s
are
listed respecti
vely. Because of the cap
a
c
ity con
s
tr
ai
nts of global m
e
mory of
the
GPU, the len
g
th
of lists is only
up to
2
.
Table 2. Sing
CPU pe
rform
ance evaluati
o
n
Number of
Elements
Step 1(s)
Step 2(s)
Sum(s)
0.00026897
-
0.00026901
6
0.00056428
2
-
0.00056433
0.00118396
-
0.00118401
1
0.00251854
-
0.00251858
9
0.00571485
-
0.00571491
1
0.0137335
-
0.01373355
9
0.0222362
0.00565383
0.02789003
0.0449629
0.0136799
0.0586428
0.0892215
0.0419758
0.1311973
0.178331
0.104341
0.282672
0.389743
0.193178
0.582921
0.798254
0.45844
1.256694
1.60057
1.14128
2.74185
3.25119
2.63423
5.88542
6.55064
5.1366
11.68724
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 16
93-6
930
TELKOM
NIKA
Vol. 13, No
. 4, Decem
b
e
r
2015 : 139
9 – 1407
1406
Next we present our evalu
a
ti
on of CPU sortin
g. The d
a
ta lists and t
i
mes run on e
a
ch list
are i
denti
c
al
with the li
sts
use
d
by eval
uation of
GP
U sortin
g.
Bo
th
of
two step
s
of CPU so
rt
ing
are
re
corded,
as well a
s
the total time
taken. Th
e capa
city of main memo
ry
of the CPU i
s
so
large th
at we
can
even so
rt a list of len
g
th up to
2
, though it
will take a l
onge
r ti
me. Table
2
pre
s
ent
s all the re
sults. It is worth n
o
ting that
we chose the len
g
th of a data chun
k that b
e
merg
ed in
ca
che
be
no m
o
re tha
n
2^1
9
, whe
n
the
r
e
will be n
o
mo
re
than 4
data
chun
ks, n
a
mel
y
whe
n
the
nu
mber of el
em
ents
of the li
st is le
ss
o
r
e
q
ual tha
n
2^2
1
,
step
1 i
s
e
n
ough
to
sort t
h
e
entire list.
Being carefull
y
sch
edul
ed, CPU and GP
U
can co
ope
rate
with ea
ch
other
to
get highe
r
performance.
Figure 1 illustrated
performance eval
uation results
of
our hybri
d
sort algorithm.
The time
axis of it
wa
s lo
g
ged, the
g
r
ap
hic i
s
nea
rly li
nearly
exce
pt
so
me
point
s. We
con
s
ide
r
it
is because
our task assignment
al
gorithm i
s
not very flexible. We
illust
rated sorting
rate of each
three al
gorith
m
s in Fig
u
re
2. It is obviously that
GPU so
rting al
ways ha
s high
est rate fo
r 3
2bit
float keys, b
u
t
the memory
of GPU is to
o small
to
so
rt larg
e data
s
ets. Hybri
d
sort have m
u
ch
highe
r rate th
an CP
U
sort
whe
n
the
size of data i
s
medium
or la
rge.
We thin
k hybrid
so
rt rate
lowe
r than
G
P
U sort rate
due la
rgely t
o
the pa
rt
ial
results
of Stage 1 i
s
not u
s
ed
by GPU
in
Stage 2.
Figure 1. Hyb
r
id so
rt perfo
rmance evalu
a
tion
Figure 2. Sorting rate comp
arison
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
1693-6
930
A Hyb
r
id Sort
ing Algorithm
on Hete
rog
e
n
eou
s Architectures
(Mi
ng X
u
)
1407
5. Conclusio
n
Heteroge
neo
us archite
c
tures app
ears
t
o
pr
evail n
o
wadays. Software
develo
p
e
r
s
mu
st
face th
e chall
enge
s that to
de
sign
appli
c
ation
s
t
hat
can ma
ke
full
utilize of
all these p
r
o
c
e
s
sors
to get highe
r perfo
rman
ce
. We u
s
ed
a
simple
way
to impleme
n
t a hybrid
so
rting algo
rith
m,
whi
c
h can sort one billion 3
2
-bit float data in no more than 5 second
s.
Future
wo
rk
has t
w
o a
s
pe
cts. First, we
shall
update
our a
ppli
c
atio
ns to ma
ke
m
o
re u
s
e
of the processors. We
can adopt a wider SIMD
length by use AVX and
AVX2 instruction
sets,
and there is a GPU integrated into Intel Haswell
CP
U, which we
can u
s
e it to
do som
e
tasks.
Secon
d
, pe
rforma
nce eva
l
uation
sho
w
that radix
sort ke
rn
el do
es n
o
t fit both stag
e well, we
must use more effici
ent GPU kernels i
n
stage
2 which fully utilize the
partial
sort
ed chunks.
Finally, we m
u
st ma
ke
our
algorith
m
mo
re flex
ible, we
need to i
m
pro
v
e our
sche
d
u
ling al
gorith
m
to smoothly fit in various case
s an
d different ma
chi
n
es.
Referen
ces
[1]
Martin, William
A. Sorting.
ACM Computi
ng
Surveys (CSU
R).
1971; 3(4);
147-
174.
[2]
Chhugani, Jatin, et.al.
Efficient imple
m
enta
t
ion of sortin
g
on multi-cor
e
SIMD CPU architecture
.
Procee
din
g
s of
the VLDB End
o
w
m
e
n
t
. 2008;
1(2): 1313-
13
24.
[3]
Djajadi, Arko, et.al. A mo
del
visi
on of
sorting s
y
ste
m
applic
atio
n
using rob
o
tic
manipu
lator.
T
E
LKOMNIKA (T
eleco
m
mu
ni
cation C
o
mputi
ng Electro
n
ics
and C
ontrol)
. 2
010; 8(2): 1
37-
148.
[4]
Hartati, Sri, Agus Harj
oko, an
d T
r
i
W
a
h
y
u S
upar
di. T
he dig
i
tal microsc
ope
and its ima
ge
process
i
n
g
utilit
y
.
T
E
LKO
M
NIKA (T
eleco
m
mu
nicati
on C
o
mputi
ng El
ectronics a
nd Co
n
t
rol).
2011; 9(3
)
; 565-57
4.
[5]
West, Richard,
et.al. Online
cache mo
de
lin
g for commod
i
t
y
multic
ore pr
ocessors.
ACM SIGOPS
Operatin
g Systems R
e
vi
ew
. 2010; 44(
4): 19-
29.
[6]
Arora, Kris
han,
Paramv
eer S
i
ngh
Gill,
and
P
a
ru
l M
ehr
a. De
sign
of h
i
g
h
p
e
r
formance
an
d
lo
w
p
o
w
e
r
simulta
neo
us multi-thre
ade
d process
o
r.
Internatio
nal J
our
n
a
l of El
ec
trical and Co
mp
uter Engi
neer
in
g
(IJECE).
2013;
3(3); 423-4
28.
[7]
Cho
w
dhury
,
Re
zau
l
Alam, et.al. Oblivio
us alg
o
rithms for multicores a
n
d
net
w
o
rks of process
o
rs.
Journ
a
l of Para
llel a
nd D
i
strib
u
ted Co
mputi
n
g
. 2013; 7
3
(7): 911-
925.
[8]
Guide,
Part.
Intel®
64 an
d IA-32 Architectur
e
s Softw
are Develo
per
’
s
Ma
nu
al.
201
0.
[9]
Nvidi
a
CUDA.
Nvidi
a
cud
a
c progr
ammin
g
g
u
id
e.
Nvidi
a
Co
rporati
o
n
. 20
11
.
[10]
Sand
ers, Peter
,
and S
e
b
a
stia
n W
i
nke
l
. Sup
e
r scal
a
r sam
p
le sort
.Algorithm
s—ESA 2004.
20
04; 7
84-
796.
[11]
Chen, Shifu,
et.al. A fast and
fle
x
ib
le s
o
rtin
g al
gor
ithm
w
i
t
h
cu
da.
Al
gorit
hms
an
d Arc
h
i
t
ectures for
Parall
el Proc
es
sing.
20
09; 28
1-29
0.
[12]
Inoue, H
i
ros
h
i,
et.al.
AA-sort
: A new
para
l
lel s
o
rting
al
g
o
rith
m for
mu
l
t
i-core SIMD
process
o
rs
.
Procee
din
g
s of the 16th Inter
nati
o
n
a
l C
onfere
n
ce o
n
Paralle
l Architecture a
n
d
Compil
atio
n
T
e
chniques. 2
007; 18
9-1
98.
[13]
Parberr
y
, Ian. T
he
pair
w
ise s
o
rting n
e
t
w
ork.
Parall
el Proce
ssing L
e
tters
. 1992; 2(2, 3): 20
5-21
1.
[14]
Satish, Na
dat
hur, Mark Har
r
is, and Mic
h
ael Garl
an
d.
Desig
n
i
ng effi
cient sorti
ng
alg
o
rith
ms for
m
a
nycore GPUs.
Proceed
in
gs of Parall
el &
Distribute
d
Processi
ng (IPD
PS 2009). 2
0
0
9
; 1-10.
[15]
Satish, Nadathur, et.al.
F
a
st sort on CP
Us
and GPUs:
a
case for b
a
n
d
w
idth obl
ivio
us
SIMD sort.
Procee
din
g
s o
f
the 201
0 AC
M SIGMOD In
ternatio
nal
Co
nferenc
e o
n
Mana
geme
n
t of
data. 20
10
;
351-
362.
[16]
Brodtkorb, A
n
d
r
e R
et.al. State-
of-the-art i
n
hetero
g
e
neo
us
computi
ng.
Sc
ientific Pr
ogr
a
m
mi
ng
. 20
10
;
18(1): 1-3
3
.
[17]
Solom
onik, Ed
gar, an
d La
xm
i
k
ant V. Kale.
H
i
ghly sc
al
abl
e
para
lle
l sortin
g
. Procee
din
g
s o
f
2010 IEEE
Internatio
na
l Symp
osi
u
m on
Parall
el & Distr
ibute
d
Process
i
ng (IPDPS 20
10). 201
0; 1-12
.
[18]
Augo
nn
et, Céd
r
ic, et.al. Star PU: a un
ifie
d
platform for ta
sk sched
uli
ng
on h
e
tero
gen
e
ous mu
lticor
e
architectur
e
s.
Conc
urrency a
nd Co
mputati
o
n: Practice an
d
Experie
nce
. 2
011; 23(
2): 187
-198.
[19]
F
r
ancis, Rh
ys,
Ian
Mathi
e
so
n,
and Lin
da Pann
an.
A fas
t, simpl
e
al
gor
ithm to
bal
anc
e a p
a
ral
l
e
l
mu
ltiw
ay merg
e.
Procee
di
ngs
of Paral
l
el
Architectures
an
d La
ng
uag
es
Europ
e
(PAR
L
E
'
93). Berli
n
,
Heid
el
berg. 1
9
93; 570-
58
1.
[20]
Batcher, Ke
nn
eth E.
Sortin
g
netw
o
rks an
d t
heir
ap
plic
ation
s
.
Procee
din
g
s
of the s
p
rin
g
j
o
int com
pute
r
confere
n
ce. 19
68; 307-
31
4.
[21]
Merrill, D
uan
e
,
Michae
l
Gar
l
an
d, and
An
dre
w
Grimsh
a
w
.
Po
licy-b
a
se
d tuni
ng for
perfor
m
a
n
ce
portab
ility an
d l
i
brary co-
opti
m
i
z
at
io
n.
Innovat
ive Para
lle
l Co
mputin
g
(InPar
), 2012. 20
12; 1-10.
[22]
F
r
ancis, Rh
ys
S an
d Ian
D. Mathi
e
so
n.
A be
nchm
ark par
all
e
l s
o
rt for shar
e
d
memo
r
y
multiproc
e
ssor
s
.
IEEE
Transactions on Computers
. 198
8; 3
7
(12): 16
19-
16
26.
Evaluation Warning : The document was created with Spire.PDF for Python.