TELKOM
NIKA
, Vol.14, No
.1, March 2
0
1
6
, pp. 309~3
1
8
ISSN: 1693-6
930,
accredited
A
by DIKTI, De
cree No: 58/DIK
T
I/Kep/2013
DOI
:
10.12928/TELKOMNIKA.v14i1.2741
309
Re
cei
v
ed Se
ptem
ber 20, 2015; Revi
se
d De
ce
m
ber
26, 2015; Accepted Janu
ary 17, 201
6
Optimized Merge Sort on Modern Commodity Multi-core
CPUs
Ming Xu*
1
, Xianbin Xu
1
, MengJia Yin
1
, Fang Zhe
ng
2
1
Computer Sch
ool of W
uha
n
Univer
s
i
t
y
, W
u
han 4
3
0
072, H
ube
i, Chin
a
2
Colle
ge of Info
rmatics, Huazh
ong Agr
i
cult
ura
l
Univers
i
t
y
, W
uha
n 43
007
0, Hub
e
i, Chi
n
a
*Corres
p
o
ndi
n
g
author, em
ail
:
xumin
g
@
w
h
u
.
edu.cn
A
b
st
r
a
ct
Sorting is a k
i
n
d
of w
i
dely use
d
basic a
l
g
o
rith
ms
. As the hi
g
h
perfor
m
a
n
ce
computi
ng d
e
vi
ces are
incre
a
sin
g
ly
c
o
mmon, mor
e
and more moder
n
co
mm
o
d
ity mach
ines
have th
e ca
pab
ility of
par
alle
l
concurr
ent co
mp
utin
g. A ne
w
impl
e
m
e
n
tati
on of so
rti
ng a
l
gorit
hms is
pr
opos
ed
to har
ness
the pow
e
r
of
new
er SIMD operati
ons a
nd
mu
lti-core
c
o
mputin
g provi
d
e
d
by moder
n CPUs. T
he alg
o
rith
m is hybr
i
d
b
y
opti
m
i
z
e
d
bit
o
n
i
c sorting
netw
o
rk and
multi-
w
a
y merg
e.
Ne
w
SIMD instructions prov
id
ed
by moder
n CP
Us
are use
d
in the
bitonic n
e
tw
ork imp
l
e
m
entati
on, w
h
ic
h ad
o
p
ted a d
i
fferent
meth
od to arr
ang
e data so t
hat
the nu
mber of
SIMD oper
atio
ns is red
u
ce
d. Bala
nced
bi
nar
y trees ar
e us
ed in
multi-w
a
y mer
ge, w
h
ic
h is
also d
i
fferent
w
i
th former i
m
ple
m
entatio
ns.
Efforts
are also pai
d on
mi
ni
mi
z
i
n
g
d
a
ta
mov
i
n
g
in
me
mory
since
merge s
o
rt is a kind
of m
e
mory-bound applic
ati
on.
T
he p
e
rforma
nce ev
al
uatio
n
show
s that th
e
prop
osed
al
gor
ithm is tw
ice
a
s
fast as the
s
o
rt functi
o
n
i
n
C+
+
standard
l
i
brary w
h
e
n
on
ly sin
g
le
thre
a
d
is
used. It also ou
tperforms ra
dix
so
rt impl
e
m
e
n
t
ed in Boost li
b
r
ary.
Ke
y
w
ords
: merge sort, bitonic network, SIM
D
, AVX
Copy
right
©
2016 Un
ive
r
sita
s Ah
mad
Dah
l
an
. All rig
h
t
s r
ese
rved
.
1. Introduc
tion
Sorting is a
kind of basi
c
a
l
gorithm that
is dee
ply researche
d
and
widely u
s
ed.
Sorting
algorithms
are utilized by
many comput
er appli
c
at
ions, which typci
a
lly include
but will not limi
t
ed
to databa
se
system [1] a
nd imag
e proce
s
sing a
p
p
lication
s
[2]. Also all platf
o
rm
s have t
heir
respe
c
tive i
m
pleme
n
tatio
n
s
of vari
ou
s
so
rti
ng
alg
o
rithm
s
. As
hard
w
a
r
e
ev
olving, ne
w
sort
impleme
n
tations a
r
e co
ntinually em
ergin
g
. Among them is parallel so
rting, thanks to
increa
singly p
opula
r
multi-core technol
og
y.
In the past few years, on
e of the major tran
si
tion
s in comput
er chip ind
u
stry is from
singl
e c
o
r
e
s t
o
mult
iple
co
r
e
s.
P
r
oc
es
so
rs
ba
se
d o
n
mu
lti-
co
r
e
mic
r
o-
ar
ch
ite
c
tur
e
be
co
me
mo
r
e
and mo
re po
pular. Th
e two major type
s are today’
s
CPUs and
G
P
Us respe
c
tively. Among them
are ge
ne
ratio
n
s of Intel® Core™ Pro
c
ess Fam
ily a
nd NVIDIA®
GeFo
rce se
ri
es. For exa
m
ple,
an Intel’s m
u
lti-core
CP
U consist
s
of multiple
co
res,
each of
whi
c
h
featu
r
es so
phi
sticated
techn
o
logie
s
such as
out-of-ord
e
r
executio
n,
hyper-th
r
ea
din
g
, cache hi
era
r
chy, sin
g
le
instru
ction,
multiple d
a
ta
(SIMD) in
structio
n
s
and etc.
While
a
GPU pro
c
e
s
sor ca
n
typi
cally
contai
ns more
core
s call
ed strea
m
multi-proce
ssors, ea
ch of
whi
c
h co
ntains more scala
r
pro
c
e
s
sors th
at have t
he
capabl
e of
exe
c
ute
a
sa
m
e
i
n
stru
ction
in
con
c
u
r
rent. T
h
is
ma
ke
s G
P
U
a good to
ol
to accompli
sh mi
ssi
on
s
that need
p
r
oce
s
s vast
data in a
sa
me way. SIMD
inst
ru
ct
ion
s
i
s
som
e
wh
at
si
milar
wit
h
t
h
o
s
e o
n
GP
U. Gene
rally, a
SIMD
inst
ru
ction can o
perate
on a vector of data, which are
store
d
in vector regi
sters.
Wi
thout exploiting this kin
d
of
parall
e
lism,
o
n
ly a small fraction
of co
mputat
ional
p
o
we
r of
a m
odern multi
-
core
CPU can
be
utilized.
In this paper,
a new optimi
z
ed
parallel
sorti
ng al
gorithm on CP
U i
s
proposed. It utilizes
the multi-co
re CP
U of 4t
h Gen
e
ration
Intel® Co
re
™ Family [3-5] whi
c
h
pro
v
ides n
e
w SIMD
ins
t
ruc
t
ions
,
that is
Intel® Advanced
Vec
t
or
Extens
ions
(Intel® AVX) and Intel® Advanc
ed
Vec
t
or Extens
ions
2 (Intel® AVX2) [7].
Performa
nce
comparis
ons
with other
s
t
ate-of-the-art
sortin
g algo
rit
h
ms a
r
e al
so
provide
d
.
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 16
93-6
930
TELKOM
NIKA
Vol. 14, No. 1, March 2
016 : 309 – 3
1
8
310
2. Related Work
There a
r
e a
lot of literatu
r
es o
n
sortin
g
and
parallel
so
rting. Vari
ous
kin
d
s of
sortin
g
algorith
m
s a
r
e dee
ply re
sea
r
ched. T
hey can
b
e
roughly cl
a
ssifie
d
to two major typ
e
s-
com
pari
s
o
n
and
n
on-com
pari
s
on
so
rt
[6].
The
two repre
s
e
n
tative
com
pari
s
o
n
sort are merge
sort and qui
ck sort.
A
typi
cal
n
o
n
-
comp
arison so
rt
is radix
so
rt. There
are al
so
derivatio
ns
of
these
ba
si
c a
l
gorithm
s
su
ch a
s
sam
p
le
sort
[16]
a
nd
bucket
so
rt [2
2]. Parallel
so
rting
algo
rith
ms
can
al
so be classified as
ei
ther
me
r
g
e
-
ba
s
e
d
or
split
t
e
r-
ba
sed
[20]
. Usually diffe
rent al
go
rith
ms
are
combi
n
e
d
to so
rt larg
e data list
s
whe
n
they
are so
rted
con
c
urre
ntly
. For example, bu
cket
sort o
r
sam
p
l
e
sort can divide items in a
list
to a set of continuou
s
chu
n
ks, and then ea
ch chu
n
k
can b
e
sorte
d
indep
end
en
tly by other algorithm
s
[14,
15]. On the
other h
and,
multi-way me
rge
[12] can eve
n
ly divide the task of me
rging m
u
lt
iple
data chu
n
ks which we
re
sorte
d
by other
algorith
m
s in
advan
ce an
d assign e
a
ch part
of the task to different
pro
c
e
s
sors.
Dua
ne Me
rrill
[8] pre
s
ente
d
a family of
very efficient
parall
e
l alg
o
ri
thms for
ra
di
x sortin
g
on GPU. Allo
cation
-o
riente
d
algorith
m
ic desig
n st
rategie
s
that matche
s the st
rength
s
of G
P
U
pro
c
e
s
sor
architecture to th
e kin
d
of dyn
a
mic
p
a
ralleli
sm a
r
e d
e
scri
bed. The
so
rt
ing pa
sse
s
are
con
s
tru
c
ted from a very efficient pa
rallel
prefix sc
an
ru
ntime. Up to
now, it
is the
state-of
-the
-a
rt
impleme
n
tation of radix so
rt on GPU. th
ere a
r
e al
so o
t
her algo
rithm
implementati
ons o
n
GPU
as
descri
bed in [
17-1
9
], [21].
Nad
a
thur Sa
tish [11] pre
s
ente
d
a co
m
petitive analysis of co
mpari
s
o
n
an
d non-
comp
ari
s
o
n
based so
rtin
g
algo
rithm
s
on
the
l
a
test
CPU
an
d G
P
U a
r
c
h
itect
u
re
s. N
o
v
e
l
CPU
radix
so
rt an
d GP
U me
rg
e sort im
ple
m
entation
s
a
r
e
pro
p
o
s
ed.
To
alleviate
irre
gula
r
m
e
mory
acce
sses of
CPU radix so
rt
impleme
n
ta
tions,
t
hey propo
sed
a buff
e
r ba
se
d sch
e
me that coll
ect
element
s b
e
l
ongin
g
to the
same
radix in
to buffers in
l
o
cal
sto
r
ag
e,
and
write
out
the buffers f
r
om
local
stora
ge
to global me
mory only wh
en eno
ugh
el
ements have accumul
a
t
ed.
By comparative
analysi
s
, me
rge sort, e
s
pe
cially bitoni
c
sortin
g n
e
tw
o
r
k [1
3], is m
o
re SIMD fri
e
n
d
ly. The an
al
ysis
points to m
e
rge sort wi
nni
ng over
radix sort on future architecture
s due to its ef
ficient utilization
of SIMD and low band
wid
t
h utilization. The comb
i
n
ation of SIM
D
impleme
n
tation of mergin
g
netwo
rk a
nd
multi-way merge a
r
e propo
sed.
The m
a
in
ch
a
llenge
of impl
ementing
a
bi
tonic
so
rting
netwo
rk by SI
MD in
structio
ns i
s
to
desi
gn efficie
n
t
ho
rizontal comp
ari
s
o
n
pro
c
e
s
se
s.
T
hat is,
co
mp
arison
s b
e
tween ite
m
s wi
thin
one vecto
r
re
gister. Give
n that each ve
ctor regi
ste
r
can hold
v
items, usually a b
i
tonic sortin
g
netwo
rk impl
ementation
lo
ads
2
v
items into
v
SIMD regi
sters. Item
s
i
n
the
s
e
re
gisters will
be
sorte
d
an
d e
v
entually form a sm
all sorted d
a
ta
bl
ock en
d by
end, which i
s
then
co
pie
d
to
memory by vector
store
instructions. To illustrate
the sorting process, we
can i
m
agine that all
items
resi
de i
n
a o
ne-dime
nsio
nal a
r
ray, and th
e di
stance of two it
ems i
s
the
dif
f
eren
ce
of th
eir
indices of the array. During sorting process,
bit
oni
c sorting
net
work will co
nstantl
y
pick two items
to comp
are,
and swa
p
the
m
if they are out of or
de
r. It is easily found that
the
distan
ce
s of two
items pi
cked
rang
e from
2
1
v
, namely the first an
d the la
st item are pi
cked, to 1, namely two
adja
c
ent ite
m
s that
re
si
de in
same
vector
re
gi
st
er a
r
e
pi
cke
d. Ho
weve
r,
all form
s
of
SIMD
min/max instruction
s
that are often u
s
e
d
in a
bitonic sortin
g network im
pleme
n
t
ation take two
SIMD re
giste
r
s
as their in
put pa
ram
e
te
rs. T
w
o
item
s
to be com
pare
d
mu
st be cho
s
en
from
different re
gisters, wh
ethe
r they are in the sam
e
slot
of the two registers o
r
not
. Due to lack of
dire
ct inst
ru
ction to com
p
are item
s in
same
re
gi
ster, it is essential to
use
d
a
ta intercha
n
g
e
instru
ction
s
fi
rst to move o
ne of ea
ch p
a
ir of
items t
o
be comp
ared to anoth
e
r regi
ster b
e
fo
re
vertical comp
arison in
stru
ctions can be
use
d
.
Chh
uga
ni [9] impleme
n
ted
a high
pe
rfo
r
man
c
e
bitoni
c sortin
g net
work i
n
12
8-bit wide
SIMD ins
t
ruc
t
ions
provided by Streaming SIMD
Extens
ions
(SSE),
Stream
ing SIMD
Extens
ions
2 (SSE2) and Streaming SIMD Extens
ions
3 (S
S
E
3). Given items
to be
sorted are 32-bit
integers,
e
a
ch
XMM regi
ster can hold
4
items.
A
4
X
4 so
rting
n
e
twork i
s
used to
so
rt th
ese
loade
d intege
rs. A lane
ref
e
r to all the
same sl
ot
s in
SIMD regi
ste
r
s that h
o
ld
s data. Odd
-
ev
en
sortin
g net
wo
rk i
s
u
s
ed to
sort e
a
ch lan
e
, befor
e
all the items
re
si
de in a la
ne
can be tran
spo
s
e
d
to a SIMD registe
r
. After then bitonic sorting
n
e
twork is u
s
e
d
to sort 4 regi
sters, SIM
D
instru
ction
s
in
cludi
ng min/
max and sh
uffle are used.
Xiaoch
en Tia
n
[10] implemented a bit
onic me
rg
e sort alg
o
rith
m on Intel® Xeon Phi
CPUs, whi
c
h
provide Intel
®
AVX-512
instruction set. To
reduce
dat
a shuffle operations, masked
comp
ari
s
o
n
operation
s
(e
.g. _mm512
_
m
ask_mi
n_e
pi32) were
u
s
ed.
Ho
weve
r, the alg
o
rit
h
m
also rea
rra
ng
e data items i
n
regi
ster afte
r each iteratio
n.
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
1693-6
930
Optim
i
zed M
e
rge So
rt on Mode
rn Com
m
odity Multi-core CP
Us (Ming Xu)
311
Our previous work [23]
have impl
ement
ed
bitonic sorting network
by utilizi
ng
SSE
int
r
insi
c i
n
st
r
u
ct
ion
s
.
I
n
t
h
is
pa
pe
r, a n
e
w im
pleme
n
t
ation that
uti
lizing
ne
w SIMD in
stru
ctio
ns
and wide
r
ve
ctor regi
ste
r
s is
de
scrib
ed. CPUs
that b
e
l
ong to the
4th and
after g
eneration Int
e
l®
Core™ P
r
ocessor Family
provide Intel
®
AVX and
I
n
tel® AVX2 that contain i
n
structions
can
operate o
n
v
e
ctor d
a
ta re
gisters of 2
5
6
-bit
widt
h,
n
a
mely YMM
registe
r
s.
Und
e
r x8
6-6
4
m
o
de
,
each co
re
within a CP
U can use up to
16 YMM re
g
i
sters. We im
plemente
d
th
e bitonic
so
rting
netwo
rk ag
ai
nst a
qua
d-core Intel
Core i7
CP
U
whi
c
h ba
sed on the
Haswell microarchite
c
ture.
Bitonic so
rtin
g netwo
rk i
s
use
d
to sort a
bloc
k of unsorted data ite
m
s
and m
e
rg
e two so
rted l
i
sts
step by step
.
The main
contri
bution
of
this
pap
er
is: 1)
propo
se a bitonic sortin
g network
impleme
n
tation utilize wi
der SIMD in
stru
ction
s
an
d try to reduce data inte
rcha
nge bet
ween
SIMD regi
ste
r
s; 2) p
r
o
p
o
s
e an optimized mult
i-way
merge al
gorithm; 3) pro
p
o
se a
n
merg
e
pro
c
e
s
s that
can
re
du
ce
d
a
ta move tim
e
s i
n
me
mory
. In the follo
wing
se
ction
al
l the d
e
tails a
r
e
descri
bed,
followe
d the
fou
r
th
se
ction th
at re
co
rd
s th
e pe
rforman
c
e evalu
a
tion.
The l
a
st
se
cti
o
n
is co
ncl
u
si
on.
3. Optimiz
e
d Parallel Merge Sort
The
ba
sic proce
s
s of
ou
r propo
sed
i
m
pleme
n
ta
tion is very
s
t
raightforward.
Firs
t the
unsorted
inp
u
t
data li
st is d
i
vided into
a
certai
n
n
u
mb
er of
data
ch
unks, e
a
ch of
whi
c
h
is so
rt
ed
by a single
CPU co
re ind
e
pend
ently an
d con
c
u
r
rent
l
y
. Then the multi-way merge [12] is u
s
ed to
partition the
s
e data ch
un
ks acco
rdin
g to segm
ent
s i
n
the final sorted list, and g
a
ther data ite
m
s
f
o
r ea
ch
su
c
h
se
gment
.
T
he last
p
r
o
c
e
ss i
s
t
o
sort
data items fo
r all segme
n
ts which ca
n then
form the
final
output.
The
first a
nd l
a
st
pro
c
e
s
s a
r
e
done
both
by
re
cu
rsively
utilizing
biton
i
c
sortin
g netwo
rk. All pro
c
e
s
se
s ca
n be e
x
ecuted
co
n
c
urrently. In the rem
a
ind
e
r of this pape
r,
notation
s
bel
ow are used:
n
Numbe
r
of items of the in
put list.
c
Numbe
r
of d
a
ta chu
n
ks th
at will be merged by multi-way merge
q
Numbe
r
of d
a
ta segm
ents that be prod
uce
d
by multi-way me
rge
p
Numbe
r
of C
P
U co
re
s.
k
The width of
a vector regi
ster.
b
The length of
each item.
l
Numbe
r
of items
can be
sorted by
ru
nni
ng bitoni
c so
rting netwo
rk
once.
3.1. Bitonic
Sorting Ne
tw
o
r
k
The num
ber
of input items of the bitonic sort
ing n
e
twork i
s
de
cide
d both by the capa
city
of vector regi
sters and the
length of each item. An YMM regi
ster
has
capa
city
32
k
bytes, and
in ou
r pe
rformance eval
u
a
tion item
s of
input li
sts a
r
e 32
-bit integ
e
rs,
namely
4
b
bytes. So an
YMM regi
ste
r
can
contain
8 item
s,
and
the num
ber o
f
input items
of the bitoni
c
sortin
g n
e
two
r
k
impleme
n
ted
by us is
2
(/
)
6
4
lk
b
since there a
r
e to
tally 16 YMM regi
sters.
3.1.1. Initiall
y
B
i
tonic Sorting Net
w
or
k
If we divide the whol
e bito
nic merge proce
s
s to
several ste
p
s by the unit length of input
blocks to be
merg
ed in e
a
c
h
step, then
there a
r
e tota
lly 6 steps, a
c
cordi
ng to th
e unit length
s
1,
2, 4, 8,
16, a
nd 3
2
re
spe
c
t
i
vely. The first thre
e
ste
p
s
can
be
combi
ned to
on
e, b
e
ca
use o
n
ly
min
and m
a
x ope
ration
s a
r
e u
s
ed. T
hen th
ere
are
4
ste
p
s in
a full
sorting
network. Same
as [
9
],
each la
ne i
s
sorte
d
in
a
s
cendin
g
o
r
de
r at first.
This is
wh
at ste
p
1 d
o
e
s
. But
then, item
s
are
interleave
d
in
stead of bein
g
fully transp
o
se
d for
the next compa
r
i
s
on. Durin
g
mergi
ng, items are
alway
s
sto
r
e
d
in lane o
r
d
e
r. After the se
co
n
d
step,
each
even i
ndexed la
ne
and its n
e
xt lane
store
s
16
sort
ed item
s
end
by end.
After
the la
st
st
ep,
all sorte
d
ite
m
s
sto
r
ed
in l
ane
s a
s
showed
in Figure 1 (a
). We try to re
duce data int
e
rchan
ge op
e
r
ation
s
in ea
ch step a
s
few as po
ssi
ble.
No
w there i
s
a bran
ch in
remain
der o
f
t
he proce
s
s: if a full sorted outp
u
t block is
essential, so
rted items mu
st be tran
spo
s
ed befo
r
e
th
ey are stored
to memory; if two partially
sorte
d
blo
c
ks (ea
c
h contai
ns
/2
3
2
l
items) is
enou
gh, then
YMM
regi
ste
r
s that
contai
ns the
result will be perm
u
ted so t
hat all the higher half
pa
rt of them will be mov
ed toge
ther to form one
output blo
c
k,
so d
o
e
s
all th
e lower
half p
a
rt to fo
rm
th
e othe
r. In m
o
st cases ite
m
s a
r
e
store
d
in
partially sorte
d
style, exce
p
t
that
multi-way merg
e is t
o
be u
s
e
d
ne
xt or this is th
e last
step of
the
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 16
93-6
930
TELKOM
NIKA
Vol. 14, No. 1, March 2
016 : 309 – 3
1
8
312
sortin
g p
r
o
c
e
ss. Sto
r
e 3
2
items into
m
e
mory in
full
sorte
d
o
r
de
r
need
s 2
3
in
struction
s
,
whi
l
e
store the sam
e
items in
partially
sorted
order only need 8 inst
ruct
ions. Algorithm
1 illustrates t
h
e
pro
c
e
ss state
d
above.
(a)
(b)
Figure 1. Illustrations of dat
a lay
outs in YMM regi
sters.
(a) sorte
d
items are pla
c
e
d
in lane orde
r.
(b) lo
we
r half items are rea
rra
nge
d to pa
rtially sorte
d
style.
Algorithm 1
Initially Bitonic Sort
Input:
integer lis
t
S
, its length
n
. {
n
is multiple of 64}
1
repea
t
2
load data from
S
to vector regi
sters {v
movdqa u
s
ed
}
3
sort all lanes {vpmin
sd
and vpmaxsd
used
}
4
horizontall
y
and/or verti
c
ally interle
a
ve
items {vp
s
h
u
fd, vpblendd
or vperm2i
1
2
8
use
d
}
5
compare it
ems
6
... go on th
e bitonic
sorti
ng network, instructions same with line
3 and 4
7
if
64
n
then
8
transp
o
se data items
9
else
10
r
ear
r
a
nge data items
t
o
par
tially s
o
rted blocks
11
end if
12
store items to
S
13
until
all
/6
4
n
blocks in
S
are so
rted
3.1.2. Intermediate
Resul
t
s
As stated ab
ove, items can be rea
r
ra
nged in
two
different wa
ys before st
ored to
memory. In prac
tic
e
, AVX/AVX2 ins
t
ruc
t
ions
c
an ta
k
e
three or more parameters, for example, a
perm
u
tation i
n
stru
ction ta
ke
s four pa
rameters,
na
mely two so
urce regi
sters, one
de
stin
ation
regi
ster an
d
one m
a
sk i
n
teger.
The
n
o
ne h
a
lf of
YM
M re
giste
r
s can b
e
u
s
ed
to
hold
sorted
d
a
ta,
the other h
o
l
d
the rea
r
ran
ged re
sult
s. So in any
ca
se the regi
sters
used
a
s
sou
r
ce of vector
store
in
stru
ctions
ca
n be
the same,
so
doe
s the
item
s in th
em, onl
y differen
c
e i
s
the l
a
youts
of
items. As sh
own in Fig
u
re 1 (b), item
s in lo
we
r 4 lane
s of YMM0 ~ YMM7 are perm
u
ted into
YMM12 ~ YM
M15, whi
c
h is in partially so
rted layout.
When all items are initially sorted by a full
bitonic so
rting netwo
rk, small blocks can now
be me
rged to
longe
r list
s
recu
rsively. When me
rgin
g i
s
ne
ce
ssary,
the way to m
e
rge i
s
same
as
in [9], items are loa
ded from each list respe
c
tively
.
Here only the last step of
a bitonic so
rting
netwo
rk is
execute
d
, be
ca
use th
e unit l
ength of
bl
ocks i
s
3
2
. Recall that the d
a
ta bein
g
loa
d
e
d
has two p
o
ssible layout
s, If they are partia
lly so
rte
d
, 12 instru
ct
ions a
r
e nee
ded to load and
place them i
n
lane o
r
de
r; and if they are fu
lly so
rted, 18 inst
ru
ction
s
are
ne
eded, in
cludi
ng
gather i
n
stru
ction
s
that suppo
rting loa
d
non
-contin
uou
s items
based on
ce
rtain rul
e
s
(b
ase
memory address, scale an
d mask), which are a new kind of operations in AVX2.
5
11
19
21
22
25
30
37
44
45
50
52
63
66
67
70
22
4
22
7
23
3
23
8
24
0
24
6
26
6
27
4
27
6
28
1
28
2
28
4
28
6
28
9
29
4
29
6
Y
MM0
Y
MM7
5
11
44
45
73
83
12
7
14
1
5
11
44
45
73
83
12
7
14
1
YM
M
0
YM
M
1
YM
M1
2
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
1693-6
930
Optim
i
zed M
e
rge So
rt on Mode
rn Com
m
odity Multi-core CP
Us (Ming Xu)
313
3.1.3. Only
Cop
y
There is on
e
more p
o
int n
eed to be p
o
i
n
t
ed out-t
wo
data blo
c
ks should b
e
merged only
whe
n
they co
ntains item
s that
are out o
f
order. Othe
rwise the merging is redu
n
dant. Even the
last step of bi
tonic sorting
netwo
rk n
eed
s 102 in
struct
ions. In pe
rfo
r
man
c
e eval
u
a
tion, data se
ts
use
d
a
r
e
all i
n
unifo
rm
dist
ribution.
It is f
ound
that
wh
atever th
e le
n
g
th of
a li
st, there
a
r
e
ch
a
n
ce
of 12%
~ 1
5
%
that two
da
ta blo
c
ks to
b
e
me
rge
d
a
r
e
in o
r
de
r.
Co
mbine th
e two, it is
wo
rthwhile
to kno
w
wh
e
t
her two d
a
ta blocks are
in orde
r
or
not. The inst
ructio
ns
used
are two extract
instru
ction
s
, whi
c
h extra
c
t
high
er
half part of
an
Y
MM regi
ster
to a XMM
re
gister a
nd th
en
extract th
e h
i
ghe
st item f
r
om it
afterward
s
.
Duri
ng
merging
proce
s
s, hig
h
e
r
half
of YM
M
regi
sters
are
alway
s
filled first. Each tim
e
when
one
block i
s
l
oaded, the
maximum item i
n
it is
extracted, if it is not greater
than the minimum of the remai
nde
r items, then these item
s are
store
d
to me
mory directly. Otherwise, the othe
r
blo
c
k that co
ntain
s
the minimu
m item is loa
ded
into lo
wer h
a
l
f
of YMM
regi
sters.
Afte
r
two
b
l
ock
s
ar
e
me
r
g
e
d
,
low
e
r
ha
lf o
f
th
e re
s
u
lt is
s
t
or
ed
to
memory. Algorithm 2 illustrate
s the process stated above.
Algorithm 2
Bitonic Merge
Input:
so
rt
ed
sou
r
c
e
list
s
S1
,
S2
, destination list
S3
, input stat
e
T1
, output
state
T2
{s
tate
indicates part
ially or full sorted}
1
load items fro
m
a sou
r
ce list, rearra
nge
them base on
T1
2
w
h
ile
the
r
e a
r
e unm
erg
ed
items
do
3
S
←
list contains mini
m
a
l unme
r
ge
d item or the onl
y one contai
n
s
unme
r
g
ed items
4
RM
ax
←
the maximum
item in regi
sters
5
LM
i
n
←
the minimum item in
S
6
if
RM
ax
LM
in
then
7
rearran
g
e
items in re
g
i
sters ba
se o
n
T2
, s
t
ore them to
S3
8
load ite
m
s from
S
, re
arrang
e them
base o
n
T1
9
else
10
load ite
m
s from
S
, re
arrang
e them
base o
n
T1
11
merge two loa
ded bl
ocks
12
rearrange the lower half of res
u
lt bas
e
on
T2
, st
ore them to
S3
13
end if
14
end w
h
ile
3.2. Multi-Wa
y
Merge
When
the
n
u
mbe
r
of
so
rted data
chu
n
ks i
s
small,
say
2
cq
, the m
u
lti-way
merge
algorith
m
d
e
scrib
ed i
n
[1
2]
is u
s
e
d
to
evenly a
ssi
gn
remaind
e
r me
rging
ta
sks to
CP
U
cores.
Like
the re
spe
c
tively sorte
d
ch
unks, the fin
a
l
so
rted o
u
tput ca
n al
so
be divide
d
evenly to
q
da
ta
segm
ents, what multi-way merge
can
do is to
find each final segment'
s
ite
m
s from sort
ed
chu
n
ks a
nd
aggregate
th
em togeth
e
r.
Then
ea
ch segment ca
n be
me
rge
d
in
depe
ndently and
con
c
u
r
rently. To
achieve
the g
oal, al
l the b
oun
da
ries that
part
i
tion items
which
bel
ong
to
different se
g
m
ents a
r
e fou
nd in all so
rte
d
data ch
un
ks.
Bounda
rie
s
o
f
a partitioni
n
g
are form
ed
by it
ems in
di
fferent data
chun
ks. Su
pp
ose th
at
the maximu
m
value
within
the p
a
rtitionin
g
is
ma
x
L
, the mini
mum valu
e
wi
thin the
upp
e
r
b
ound
ary
is
mi
n
R
, a co
rrect
partitionin
g
m
u
st
satisfy two re
quireme
n
t
s: The
ord
e
ri
ng requi
rem
e
nt
is that
all
items
within a
partitioni
ng
must h
a
ve va
lues l
e
ss t
h
a
n
or equ
al to
all value
s
of
element
s lyin
g on
its up
per bo
u
ndary, n
a
mel
y
max
m
i
n
LR
; and th
e
di
stan
ce req
u
irem
ent
is that
the total n
u
m
ber
of
inclu
ded
ele
m
ents mu
st b
e
/
nq
. The l
o
we
st boun
da
ry is already
kn
o
w
n
whi
c
h
is f
o
rme
d
by
all
the first elem
ent of data chun
ks, the u
pper
boun
da
ry of a partitioning i
s
gue
ssed first, wh
ich
ensure th
at the partitioni
ng satisfy the di
stan
ce
requi
rem
ent
. Followe
d
several time
s of
inspections and adjustments w
ill ensure the partiti
oning satisf
y the both. Partitioning
can be
found by this
recursive loo
k
ing u
p
proce
ss b
e
cau
s
e t
he upp
er b
o
u
ndary of current partitioni
n
g
is
also the lo
we
r boun
da
ry of the next partitioning.
The alg
o
rith
m in [12] is
a
bit inefficient
whe
n
it traverse item
s to fi
nd
ma
x
L
and
mi
n
R
. The
con
d
ition
s
used to ch
eck
orde
rin
g
re
qu
ireme
n
t ar
e
a
l
so a bit
com
p
lex. Given that the iterati
on
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 16
93-6
930
TELKOM
NIKA
Vol. 14, No. 1, March 2
016 : 309 – 3
1
8
314
times of th
e l
oop i
s
m
, then
the comp
uta
t
ion complexi
ty of the loop
is o
b
viou
sly
()
Om
c
. We
prop
ose a n
e
w
con
c
i
s
e
boun
ds l
ook
up process
whi
c
h h
a
s
((
)
l
og(
))
Oc
m
c
complexity and
simple
r
j
udgi
ng con
d
ition. For ea
ch
trav
ersed
item
, a
key
-
value
pa
ir i
s
created,
of whi
c
h
the
key
and value i
s
the value an
d index of th
e item,
re
spe
c
tively. Two
binary tree is built based
on
these
key-val
ue p
a
irs first,
whi
c
h ta
ke
s
(l
o
g
(
)
)
Oc
c
.
The tree us
ed to find
mi
n
R
is built with
a l
e
ss
comp
arator,
su
ch that the
root of t
he t
r
ee al
way
s
h
o
lds th
e information of
mi
n
R
. Similarly, the
tree u
s
ed to
find
ma
x
L
is built
with a g
r
eate
r
co
mpa
r
ato
r
, su
ch that th
e root of the
tree al
way
s
hold
s
the inf
o
rmatio
n of
ma
x
L
. Just a
simpl
e
com
pari
s
o
n
on the key
s
of the two root node
s i
s
enou
gh to kn
ow whethe
r the orderi
ng
requireme
nt is satisfied. Th
e
com
p
lexity of get root no
de
s
is only
(1)
O
. The pro
c
e
ss
of ad
justment i
s
al
so ea
sy.
Two
root nod
es
a
nd at mo
st 2 more n
ode
s
whi
c
h have t
he same index value with any of
the two roots
will be del
eted. New nodes added
after adju
s
tm
ent are al
so
come from the
same d
a
ta chun
ks
whi
c
h the old root n
ode
s lying in. All
these ope
rati
ons nee
d
(lo
g
(
)
)
Oc
co
mplexity. To impleme
n
t tre
e
structu
r
e,
multi-map
co
ntainer
in C++ STL li
bra
r
y is u
s
ed,
whi
c
h is
ba
sed on a
re
d-b
l
ack tree. Su
m up all a
bove, the com
put
ing
compl
e
xity of boun
ds loo
k
ing
up
pro
c
ess
can
be
readily
kn
own. Algorithm
3 illust
rate
s t
he
pro
c
e
ss state
d
above.
Algorithm 3
Multi-way
Merge
Input:
li
st
S
, t
e
mporary lis
t
T
, offsets of sorted chun
ks
O
, length of a final sorted
segment
M
1
L
←
O
{cu
r
re
nt lower b
oun
dary}
2
repea
t
3
left
←
a re
d-bla
c
k tree h
a
s greate
r
co
m
parator {e.g
. std::greater<int>}
4
right
←
a red-bl
ack tre
e
has le
ss com
parato
r
5
U
←
upper bound
ary tha
t
satisfy dista
n
ce requi
rem
ent
6
traverse
ite
m
s
on
U
an
d
its left neighb
ors, b
u
ild
key
-
value p
a
irs a
nd insert the
m
to
right
and
left
res
p
ec
tively
7
w
h
ile
key of
left
’s root
key
of
right
’s root
do
8
L
I
ndex
←
index of
left
’
s
root
9
RIndex
←
index of
right
’
s
root
10
remove node
s that ha
ve same ind
e
x
with
LIndex
or
RInde
x
in both tree
s
11
select two items have
greate
r
/sm
a
ll
er value from
chun
k
LInde
x
/
RIndex
re
spectively
12
insert nodes b
u
ilt based on sele
cte
d
items to tre
e
s
13
update
U
14
end w
h
ile
15
copy blocks between
L
and
U
to
T
in
desce
nding
o
r
de
r of their length
16
L
←
U
17
until
uppe
r b
ound
ary of ea
ch segme
n
t are found
3.3. Data M
o
v
i
ng
After multi-way merge
co
mplete, blo
c
ks in same
se
gment are m
e
rge
d
. Bitonic sortin
g
netwo
rk a
r
e
use
d
again. I
t
is kno
w
n th
at merge
sort
is a kind of
memory
-bo
u
nd ope
ration.
T
o
spe
ed up the
mergi
ng process, we u
s
e l
o
w laten
c
y in
st
ru
ct
ion
s
as
most
a
s
po
ssi
ble.
We al
so t
r
y
to optimize th
e mergi
ng proce
s
s by red
u
ce the d
a
ta copyin
g times.
3.3.1. Aligned Vector Me
mor
y
Operations
Only aligne
d
SIMD vector load and
store in
st
ru
ctio
ns a
r
e u
s
ed
in our bitoni
c sortin
g
netwo
rk im
pl
ementation.
So the start
memory a
ddress of input d
a
ta array and
tempora
r
y b
u
ffer
must be
bot
h 32-byte ali
gned. And
without lo
ss
of gene
rality, all the data
lists u
s
e
d
for
perfo
rman
ce
evaluation
ha
ve length
tha
t
is multipl
e
of
64. Ho
we
ver,
it
is alm
o
st certai
n
th
at
multi-way merge
will prod
uce data bl
ocks that has st
art an
d/or end
addresses
are not
32-by
t
e
aligne
d. These data blo
c
ks need to be a
d
juste
d
bef
ore aligne
d loa
d
instru
ction
s
can be u
s
e
d
.
The length of
a final sorte
d
segm
ent which
was
stated above, na
mely
/
nq
, can be
set
to multiple of 32. At the e
nd of multi-way mer
ge d
a
ta blocks belo
ngs to a sam
e
final segm
e
n
t
will be co
pie
d
together in
a contigu
o
u
s
pla
c
e in
temporary buffer. Before co
pying, two more
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
1693-6
930
Optim
i
zed M
e
rge So
rt on Mode
rn Com
m
odity Multi-core CP
Us (Ming Xu)
315
tasks a
r
e d
o
ne: one i
s
to
make
all the
start an
d
en
d
address of a
block in n
e
w array is 32
-b
ye
aligne
d, and
the other i
s
to kno
w
the l
ength of
ea
ch data blo
c
k, since they a
r
e to be
copi
ed
according
to
the d
e
sce
n
d
ing
ord
e
r of
their len
g
th
whi
c
h
is p
r
epared fo
r th
e next o
p
timi
zed
mergi
ng.
To attain
the
s
e
goal
s, m
e
mory a
d
d
r
e
s
se
s of
lo
wer
and
upp
er b
ound
ary
of e
a
ch
data
block a
r
e che
c
ked. If the addre
s
s of a lowe
r bou
nda
ry is not 32
-b
yte a
ligned, then the ne
arest
su
ccessive it
em
whi
c
h
ha
s 3
2
-byte ali
gned
ad
dre
s
s b
e
come
s t
he n
e
w lo
we
r bou
nda
ry; a
nd if
the
ad
dress o
f
an uppe
r bo
und
i
s
not 32-byte
align
ed,
then the
ne
arest p
r
evio
us i
t
em which h
a
s
32-byte ali
g
n
ed ad
dre
s
s i
s
cho
s
en
as
the ne
w up
p
e
r bo
und
ary.
Items bet
we
en old
and
n
e
w
boun
dary a
r
e
copie
d
to on
e of two temp
ora
r
y ve
ctors
and then
so
rted. Vecto
r
s a
r
e allo
cate
d u
s
e
32 byte
align
ed all
o
cator,
too. Beca
use
the le
ngth
of ea
ch
se
gme
n
t is
multiple
of 32,
so i
s
t
h
e
length of ea
ch data blo
ck,
it is not difficult to
found
that, the number of item
s that co
pied
to
vectors i
s
al
so multiple
of
32. They
ca
n
then
be
copi
ed to tem
p
o
r
ary buffe
r a
s
extra blo
c
ks
and
merg
ed with others.
So
ex
cept boun
da
ry
lookin
g
u
p
pro
c
e
s
s an
d
copyin
g item
s to ve
ctors,
all
the other me
mory ope
ratio
n
s u
s
e alig
ne
d instru
ction
s
.
Algorithm 4
Optimize
d Re
cursive Me
rg
e
Input:
d
e
stin
ation se
gmen
t
S,
tempora
r
y
buffer
T
1
Len
←
functi
on that return
s the numb
e
r
of blocks of a
n
array
2
w
h
ile
the tota
l numbe
r of blocks
2
do
3
if
Len(S)
1
then
4
If
Len(S
)
1
and
Len
(
T)
is odd
th
en
5
merg
e the block in
S
and the first block in
T
, s
t
ore the res
u
lt to the tail of
S
6
end if
7
merge blocks in
T
fro
m
head to tail, store re
sult
s to
S
from tail to head
8
if
Len(T
)
is even,
th
e
n
the result of the two last blocks store to the head of
T
9
else
10
if
the head of
T
is not empty
then
11
merge blocks
in
S
, store results
to
T
from tail to head
12
if
Len(S)
is o
dd,
th
en
merge the
first block of
T
and the last
block of
S
into tail of
S
13
else
14
if
Len(S)
is o
dd,
th
en
merge the
block in
T
an
d the first blo
ck in
S
.
15
merg
e other blo
cks in
S
, store a
ll result
s of 14 and 15 to
T
.
16
end if
17
end if
18
end w
h
ile
19
merg
e the last two blocks,
store
re
sult to
S
. {all above mergi
ng is d
o
ne by Bitonic Merg
e}
3.3.2. Wa
y
s
to Redu
ce Da
ta Co
p
y
ing
Besides utilizi
ng aligned vector
load and store instructions
, the other way to opt
imizing
memory
op
eration i
s
to
re
duce m
e
mo
ry ope
rati
on
s as mu
ch as possibl
e.
We
have ca
reful
l
y
desi
gne
d the
merging
pro
c
e
s
s in the
l
a
st
step
so t
hat data
mov
i
ng time
s b
e
twee
n inp
u
t d
a
ta
segm
ents a
n
d
their tempo
r
ary buffe
rs a
r
e dropp
ed.
Before descri
bing the merging process,
one
thing m
u
st be
clarified. If two data blocks
lies in
a
co
ntiguou
s
chu
n
k
of memo
ry e
nd to e
nd, th
en the
sa
me
chu
n
k ca
nnot
be u
s
e
d
a
s
t
he
destin
a
tion of
output, sin
c
e
the merged
output will
li
kely overwrite
the unme
r
g
e
d
input
s. Co
p
y
ing
data to a tempora
r
y buffer
is ne
cessa
r
y to avoi
d overwrite. The thi
ng is, only co
pying the data
block th
at lying in
the l
o
wer m
e
mo
ry space i
s
e
nou
gh. Rega
rdle
ss of th
e le
n
g
th of t
w
o
d
a
ta
blocks, if the
data bl
ock i
n
front
of the
memo
ry
spa
c
e i
s
m
o
ved
to buffer, the
n
the
chu
n
k
of
memory can safely
sto
r
e merg
ed
resul
t
s.
Bas
ed
on
it we re
du
ced
the data mo
ving in me
rgi
n
g
p
r
oc
es
s
.
We sta
r
ted a
t
temporary buffer.
All blo
c
ks
are
copi
ed h
e
re
after bou
nda
ry lo
okin
g u
p
compl
e
te, a
n
d
recall that
blo
c
k a
r
e
stored
in
de
scen
ding
orde
r of th
eir len
g
th. Blocks
are
merg
ed from
the head of
the buffer to the tail, while merged
results a
r
e p
l
ace
d
into in
put
segm
ent from
tail to head.
There is a blo
ck
stay in
the tail of
the temporary buffer, if the number
of blo
c
ks in
the buffe
r a
r
e
odd; oth
e
rwise the m
e
rg
ed
re
sult of th
e
last two blo
cks
will pla
c
e
d
t
o
the head of the tempora
r
y buffer. The space is enou
gh to hold the result of
the two, becau
se
the
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 16
93-6
9
30
TELKOM
NIKA
Vol. 14, No. 1, March 2
016 : 309 – 3
1
8
316
blocks befo
r
e
them are all
not sho
r
ter
than them. Now the blo
cks in the inp
u
t
segme
n
t are
store
d
in ascendin
g
ord
e
r
of their length
.
Then
the
nu
mber of
blocks lying i
n
th
e
inp
u
t segm
ent may
be
more
than
o
ne. If th
e
block i
n
the
t
e
mpo
r
ary
buf
fer a
r
e
sto
r
ed
in the
h
ead,
then m
e
rg
ed
output yield
e
d
by bl
ocks ly
ing
in the i
nput
segment
are
stored
from
th
e tail of
t
he
b
u
ffer to th
e h
ead. If the
nu
mber of bl
ocks in
the input se
gment are o
dd, then the
first bloc
k i
n
the buffer
will be me
rg
ed into the i
nput
segm
ent wit
h
the last bl
ock in the i
nput se
gm
e
n
t. On the other h
and, if the block in
the
temporary b
u
ffer is
sto
r
ed
at the tail, a
n
d
the
num
b
e
r of blo
c
ks in i
nput
segm
ent
is
odd, th
e t
w
o
first blo
ck a
r
e
merge
d
, stored at the tail of t
he buffer. Other blo
c
ks
are me
rge
d
succe
ssively.
On the whol
e
,
our goal is to ensu
r
e the
tem
porary bu
ffer is never e
m
pty so that memory
copyin
g in m
e
rgin
g p
r
o
c
e
s
s can b
e
d
e
c
re
ased.
Wh
en the total
n
u
mbe
r
of u
n
m
erg
ed bl
ocks i
s
more th
an two, the pro
c
e
s
s of me
rging i
s
a
s
stat
e
d
a
bove and ill
u
s
trated i
n
Alg
o
rithm 4;
so that
whe
n
the nu
mber of blo
cks is two, they
c
an be merged into the input direc
t
ly.
4. Performan
ce Ev
aluation
The m
a
chine
use
d
fo
r p
e
rf
orma
nce eval
uation
ha
s a
n
Intel Core i7
4700M
Q
CP
U
whi
c
h
works at 31
00MHz. Th
e
CPU h
a
s f
our
co
re
s th
at can
deliv
er 8 th
rea
d
s via Intel Hyper-
Thre
adin
g
Te
chn
o
logy. Th
e chi
p
on
mai
nboa
rd
has two d
ual-ch
a
n
n
el DDR3 m
e
mory controll
ers
whi
c
h provide
s
160
0M
Hz
memory
clo
c
k freq
uen
cy.
The total ca
p
a
city of main memory i
s
16
GB.
The op
eratin
g system
and
compil
er u
s
e
d
is Wi
ndo
ws 10 and Vi
su
al Studio 201
3, respe
c
tively.
All the re
sult
s are
average
values
attain
ed by
run
n
ing
test ap
plication 30
time
s. In figures belo
w
,
all the axes a
r
e loga
rithmi
c which ba
se i
s
2.
First
we
com
pare
d
our bit
onic sortin
g n
e
twork
with
ot
her two
alg
o
ri
thms i
n
singl
e
thre
ad.
One i
s
the
sort fun
c
tion i
n
C++ sta
n
d
a
rd li
bra
r
y which
uses qui
c
kso
r
t; the ot
her i
s
sp
rea
d
so
rt
introdu
ce
d in
Boost lib
rary
1.58 which
uses radix so
rt for large data
s
e
ts
. In Figure 2 we
c
a
n see
that our bito
n
i
c sorting
net
work i
s
st
riki
n
g
ly fast
er tha
n
the sort fun
c
tion in
C++
stand
ard li
bra
r
y.
The sp
eed i
s
never lo
we
r than twice of the latte
r. Our bitoni
c netwo
rk al
so
outperfo
rm the
spread
so
rt, although the g
ap is na
rrow
whe
n
the len
g
th of data se
ts is hug
e
.
Figure 2. Performa
nce com
pari
s
on of different alg
o
rith
ms ru
nnin
g
in
single thread
Next we te
sted multi-way
merge
with
multi-thre
ad
s. From Fig
u
re
3 we can see that
whe
n
we u
s
e
8 thread
s, t
he p
e
rfo
r
man
c
e
get
a
noti
c
ea
ble im
pro
v
ement. Thi
s
sh
ows that
our
impleme
n
tation can
al
so
benefit
fr
om
hyper-thre
adi
ng. So
we
u
s
e
8 th
rea
d
s for th
e foll
o
w
ing
tests. In
Figu
re 4
we
sho
w
the pe
rforma
nce
of
o
u
r m
u
lti-thre
ade
d
sortin
g im
ple
m
entation,
which
can
sort on
e billion 32
-bit integers in 17
se
con
d
s.
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
1693-6
9
30
Optim
i
zed M
e
rge So
rt on Mode
rn Com
m
odity Multi-core CP
Us (Ming Xu)
317
Figure 3. Performa
nce com
pari
s
on bito
ni
c
so
rting in 4
and 8 thre
ad
s resp
ectively.
Figure 4. Performa
nce eval
uat
ion of opti
m
al merg
e so
rt
5. Conclusio
n
We propos
ed a new parallel
s
o
rting implementation
whic
h utiliz
e
s new AVX instruc
t
ions
.
Data are sto
r
ed by lanes
whe
n
they are being
sorte
d
in vector
registe
r
s, whi
c
h re
du
ce
s the
numbe
r of SI
MD in
stru
ctio
ns n
eed
ed. O
u
r
single th
re
ad bitoni
c n
e
twork
stri
kingl
y
outperfo
rm
the
sort
fun
c
tion i
n
C++
stan
d
a
rd
library,
so
does
t
he ra
dix sorting im
plementatio
n in Boost libra
ry.
Our multi
-
thre
aded al
gorith
m
can
sort on
e billion integ
e
rs in le
ss than 17 second
s.
Ackn
o
w
l
e
dg
ements
The wo
rk de
scribe
d in this pap
er is
su
pporte
d by the Natu
ral Scien
c
e Fo
un
dation of
Hub
e
i Provin
ce of China
(No.:400
6-3
6
1
1503
0), an
d the Fun
d
ame
n
tal Re
sea
r
ch Fund
s for t
he
Central Unive
r
sitie
s
(No.:5
2902
-09
002
0
6297
).
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 16
93-6
930
TELKOM
NIKA
Vol. 14, No. 1, March 2
016 : 309 – 3
1
8
318
Referen
ces
[1]
Subroto IMI, Sutikno T
,
Stia
w
an D. T
he Arch
itectu
re of Indo
nsia
n Publ
icati
on Inde
x: A Major Indo
nsi
a
n
Academ
ic Dat
abas
e.
T
E
LKOMNIKA (T
eleco
m
mu
nicati
o
n
Co
mp
utin
g
Electron
ics and
Contro
l)
.
201
4;12(
1): 1-5.
[2]
Djaj
a
d
i
A, Lao
da F
,
Rus
y
ad
i
R, et al. A Mode
l
Visi
on of
Sorting S
y
ste
m
Applic
atio
n Using
Rob
o
tic
Mani
pul
ator.
T
E
LKOMNIKA (T
eleco
mmunic
a
tion
Co
mputin
g Electro
n
ics
a
nd C
ontro
l)
. 20
10; 8(2):
13
7
-
148.
[3]
Kurd N,
Cho
w
dhur
y M, B
u
rto
n
E, et a
l
.
Has
w
e
ll: A
F
a
mil
y
of IA 22
nm Pr
ocessors.
IEE
E
Journ
a
l
of
Soli
d-state Circ
u
its
. 2015; 5
0
(
1
): 49-58.
[4]
Hammarlund P
,
Kumar
R, Osborne RB,
et al. Has
w
e
l
l: T
he F
ourth-g
en
era
t
ion Inte
l
Core
Processor
.
IEEE Mirco
. 2014; 2: 6-20.
[5]
Jain T
,
Agra
w
a
l T
.
T
he Hasw
e
ll Micr
oarc
h
i
t
ecture-4th Ge
nerati
on Proc
e
ssor.
Internatio
nal Jo
urn
a
l of
Co
mp
uter Scie
nce an
d Infor
m
ation T
e
ch
no
lo
gies
. 20
13; 4(3
)
: 477-48
0.
[6]
Cormen
T
H
, Leisers
on
CE,
Rivest R
L
, Ste
i
n
C. Introd
ucti
on to
Al
gorith
m
s. 3rd E
d
itio
n
.
MIT
press.
200
9.
[7]
Intel®
64 and I
A
-32 Architect
u
res Soft
w
a
re
Devel
o
p
e
rs Ma
nua
l. Intel Corp
oratio
n. 201
4.
[8]
Merrill
D, Grim
sha
w
A. Hi
gh
Performanc
e a
nd Sc
ala
b
l
e
R
adi
x Sortin
g: A
Cas
e
Stud
y
of
Implem
entin
g
D
y
namic P
a
ral
l
e
lism for GPU Pomputi
ng.
Pa
ralle
l Process
i
n
g
Letters
. 201
1
;
21(02): 24
5-2
72.
[9]
Chh
uga
ni
J, N
g
u
y
e
n
AD,
Le
e
VW
, et al.
Efficient I
m
p
l
e
m
e
n
tation
of S
o
rti
ng
on
Multi-c
o
r
e
SIMD
CPU
Architecture
. P
r
ocee
din
g
s of T
he VLDB End
o
w
m
e
n
t. 2008;
1(2): 1313-
13
24.
[10]
Xi
aoc
hen
T
,
Rocki K, Su
d
a
R.
Re
gister
Leve
l
Sort A
l
gorit
hm on M
u
lti-core
SIMD
Processors
.
Procee
din
g
s of
the 3r
d W
o
rks
hop
on Irre
gul
a
r
Appl
ic
atio
ns: Architectures a
nd
Al
gorit
hms. ACM. 201
3:
9-16.
[11]
Satish N, K
i
m
C, Chh
u
g
ani
J, et al.
Fast S
o
rt on CP
Us a
n
d
GPUs: A C
a
se for Ba
ndw
id
th Oblivi
o
u
s
SIMD Sort
. Procee
din
g
s of
T
he 2010 AC
M SIGMOD In
ternatio
nal
Co
nferenc
e on M
ana
geme
n
t of
Data. ACM. 20
10: 351-
36
2.
[12]
F
r
ancis R, M
a
thies
on I, Pa
nn
an L.
A
Fast, Simple Algorit
h
m to Ba
la
nce
A Para
lle
l Mu
l
t
iw
ay Merge
.
PARLE’
93 Par
a
lle
l Architectu
res and L
a
n
g
u
ages Eur
o
p
e
. Sprin
ger. 19
93
: 570-58
1.
[13]
Batcher, Ke
nn
eth E.
Sortin
g Netw
orks
and
T
heir
Ap
plic
ati
ons
. Proce
e
d
i
ngs of T
he Ap
ril 3
0
-Ma
y 2,
196
8, Sprin
g
Joint Com
puter
Confer
ence. A
C
M. 1968: 30
7
-
314.
[14]
Che
n
S, Qin J, Xie Y, et a
l
.
A Fast and Flexibl
e
Sorti
ng Alg
o
rith
m
w
i
th CUDA
. Algorit
hms and
Architectures f
o
r Parallel Pr
o
c
essin
g
. Sprin
ger Berl
in He
id
elb
e
rg. 20
09: 2
81-2
90.
[15]
Z
hang K, W
u
B.
A Novel Paral
l
el Appr
oac
h of Radix Sort w
i
th Bucket Partition Prepr
ocess
.
2012 IEEE
14th Intern
atio
nal C
onfer
ence
on Hi
gh Perfo
rm
ance C
o
mp
uting
and C
o
m
m
unic
a
tion & 2
012 IEEE 9
t
h
Internatio
na
l C
onfere
n
ce o
n
Embed
ded S
o
ftw
a
r
e a
nd S
y
stems (HPCC-IC
ESS). IEEE. 2
012: 98
9-9
94.
[16]
Sand
ers P, W
i
nkel S.
Super Scalar Sample
Sort
. ESA. Springer. 20
04; 32
21: 784-
79
6.
[17]
Leisc
hner
N, Osipov V, Sa
nders P.
GPU Sa
mp
le
So
rt
. 2010 IEEE International S
y
mposium
on
Parall
el & Distr
ibute
d
Process
i
ng (IPDPS). IEEE. 2010: 1-1
0
.
[18]
Satish N,
Harr
is M, Garland M.
Desi
gni
ng
Efficient S
o
rting Al
gor
ith
m
f
o
r Manyc
o
re
GPUs
. IEEE
Internatio
na
l Sy
mp
osi
u
m on
Parall
el & Distr
i
bute
d
Process
i
ng, IPDPS. IEEE. 2009: 1-1
0
.
[19]
Ye X, F
a
n D,
Lin W
,
et al.
High P
e
rfor
mance
Co
mp
ari
s
on-b
a
sed S
o
r
t
ing Al
gorith
m
on Ma
nycore
GPUs
. 2010 I
EEE International
S
y
mpos
ium on Par
a
llel & Dist
ributed Process
i
ng (I
PDPS). IEEE.
201
0: 1-10.
[20]
Solom
onik
E, Kale
LV.
Hi
gh
ly Scal
abl
e P
a
rall
el S
o
rtin
g
. 2010 IEEE I
n
ternational S
y
mposium o
n
Parall
el & Distr
ibute
d
Process
i
ng (IPDPS). IEEE. 2010: 1-1
2
.
[21]
Peters H, Schulz-H
ild
ebr
an
dt O, Luttenberger N.
A Novel S
o
rting
Algorith
m
fo
r Many-cor
e
Architectures
Based
on
Ad
a
p
tive B
i
tonic
S
o
rt
. 201
2 IEEE
26th
Intern
ati
ona
l Par
a
ll
el
a
nd D
i
strib
u
ted
Processi
ng S
y
mposi
u
m (IPDPS). IEEE. 2012: 227-2
37.
[22]
Z
hao Z
,
Min C
.
An Innov
ative
Bucket Sorti
n
g Alg
o
rith
m B
a
sed o
n
Pro
b
a
b
ility Distri
buti
o
n
. IEEE WRI
W
o
rld Co
ngres
s on Comp
uter
Science a
nd I
n
formatio
n
Eng
i
ne
erin
g. 200
9;7: 846-8
50.
[23]
Xu M, Xu XB, Z
heng F
,
et al. A Hybr
id
Sortin
g Alg
o
rithm on Het
e
rog
ene
ous A
r
chitectures.
T
E
LKOMNIKA (T
eleco
m
mu
ni
cation C
o
mputi
ng Electro
n
ics
and C
ontrol)
. 2
015; 13(
4): 139
9-14
07.
Evaluation Warning : The document was created with Spire.PDF for Python.