Internati
o
nal
Journal of Ele
c
trical
and Computer
Engineering
(IJE
CE)
Vol
.
5
,
No
. 3,
J
une
2
0
1
5
,
pp
. 50
3~
51
7
I
S
SN
: 208
8-8
7
0
8
5
03
Jo
urn
a
l
h
o
me
pa
ge
: h
ttp
://iaesjo
u
r
na
l.com/
o
n
lin
e/ind
e
x.ph
p
/
IJECE
An Efficient Cache Organizati
on for
On-Chip Multipr
o
cessor
Network
s
Medh
at H
Aw
ad
alla,
Ah
me
d S
a
dek
Department o
f
Electrical and Co
mputer Engin
eer
ing, SQU, Oman
Department o
f
C
o
mmunication,
Electronics an
d
Com
puters
Univers
i
t
y
of
Helwan
, Eg
yp
t
Department o
f
C
o
mputer Scien
c
e, Facu
lty
of Com
puters and
Infor
m
ation, Fay
oum
University
, Eg
ypt
Article Info
A
B
STRAC
T
Article histo
r
y:
Received Nov 18, 2014
Rev
i
sed
Ap
r
21
, 20
15
Accepte
d
May 2, 2015
To meet th
e gr
owing com
putat
ion-intensiv
e ap
plic
a
tions and the
needs
of
low-power, high
-perform
ance s
y
s
t
em
s
,
the numb
e
r of computing resources
in single-ch
ip has enormously
increased
. B
y
adding man
y
computing
resources to build a s
y
stem in Sy
st
em-on-Chip
, its
inte
r
c
onnection between
each o
t
her b
e
c
o
m
e
s
a chal
le
nging is
s
u
e. T
h
is
paper fo
cu
s
e
s
on the
inter
c
onnection
design issues of area
,
power and perform
ance
of chip
m
u
lti-
proces
s
o
rs
with s
h
ared ca
che m
e
m
o
r
y
. I
t
s
hows
that h
a
ving a s
h
ared c
ach
e
m
e
m
o
ry
contrib
u
tes
to th
e p
e
r
f
orm
a
nce im
pro
v
em
ent; howev
er,
t
y
pic
a
l
inter
c
onnection
between
cores and the sh
ared
cache using crossb
ar occup
i
es
m
o
s
t
of the chip
area
, cons
um
es
a lot of power
a
nd does
not s
cal
e effi
cien
t
l
y
with in
creased
number of
co
res. Th
is pap
e
r
proposes an
architectural
paradigm in an attempt to gain sma
ller area occu
pation allowing
more space
for an addi
tion
a
l c
ache m
e
m
o
r
y
. It
also reduces power consumption
com
p
ared to
th
e ex
is
ting
cros
s
b
ar arch
it
ectur
e. F
u
rth
e
rm
ore,
the
pap
e
r
m
odified the t
y
p
i
ca
l M
E
S
I
cache
coherenc
e algor
ithm
to be tailor
e
d for the
suggested arch
itectur
e. Th
e exp
e
riment
al results show that the developed
archi
t
ec
ture pro
duces
les
s
broadcas
t opera
tions
com
p
ared to t
h
e t
y
pic
a
l
algorithm.
Keyword:
Ch
ip
M
u
lti Processors
Interc
onnecti
o
n m
echanism
s
Sha
r
ed cache
me
m
o
ry
Copyright ©
201
5 Institut
e
o
f
Ad
vanced
Engin
eer
ing and S
c
i
e
nce.
All rights re
se
rve
d
.
Co
rresp
ond
i
ng
Autho
r
:
Med
h
a
t Awad
alla
Depa
rtem
ent of Electrical a
nd Co
m
p
u
t
er
Engin
eer
ing
,
SQ
U, Om
an
Em
a
il: med
h
a
th
a@squ
.
ed
u.om
1.
INTRODUCTION
The s
u
ccess
of syste
m
-on-a-chip
(SoC
) hi
nge
s upon
a
well-concerte
d integrate
d
approac
h
from
m
u
lt
i
p
l
e
di
sci
p
l
i
n
es, suc
h
as
devi
ce,
desi
gn
,
and a
pp
l
i
cat
i
o
n. F
r
om
t
h
e devi
ce per
s
pect
i
v
e, rapi
dl
y
im
provi
n
g
VLSI
techno
log
y
allo
ws
th
e in
teg
r
ation
o
f
b
illio
n
s
o
f
tran
sistors
o
n
a sin
g
l
e ch
ip
, thus p
e
rm
i
ttin
g
a wid
e
rang
e o
f
fun
c
tio
n
s
to
b
e
co
m
b
in
ed
o
n
o
n
e
ch
ip.
From th
e ap
p
licatio
n
p
e
rsp
ectiv
e,
n
u
m
erou
s k
iller
ap
p
lication
s
h
a
v
e
b
een id
en
tified
,
wh
ich
can m
a
k
e
fu
ll u
s
e
o
f
t
h
e aforem
e
n
tio
n
e
d
fu
n
c
tion
a
lities p
r
o
v
i
ded
b
y
a si
ngl
e
chi
p
.
From
t
h
e d
e
si
g
n
pers
pect
i
v
e,
ho
we
ver
,
wi
t
h
great
er
de
vi
ce
i
n
t
e
grat
i
o
n,
sy
st
em
desi
gn
s b
ecom
e
m
o
re com
p
lex and are
increas
ingly challe
ngi
ng to
desi
gn.
Mu
lti-co
re
p
r
ocessors h
a
v
e
b
eco
m
e
th
e
main
stream
arch
itectu
r
es as a
resu
lt o
f
th
ei
r ab
ility
to
p
r
ov
id
e
p
a
rallelis
m
an
d
m
u
lt
i
-
task
ing
to
achiev
e h
i
gh
er
p
e
rform
a
n
ce [1
]. In
itial
m
u
lti-co
re d
e
sign
s rep
l
i
cated
o
f
f-t
h
e
-sh
e
lf co
res m
u
ltip
le ti
m
e
s, an
d
th
en
con
n
ected
th
e co
res t
o
g
e
t
h
er
u
s
ing an
in
tercon
nectio
n
mechanism
.
These
designs a
r
e called
m
u
lti-core
oblivi
o
us
designs
as the
y
replicate cores unawa
r
e t
h
a
t
each
co
re is b
e
co
m
i
n
g
p
a
rt of a wh
o
l
e system
. H
o
wev
e
r in
sp
ite o
f
th
e g
r
owin
g
tren
d
to
p
u
t
m
u
ltip
le co
res o
n
the
d
i
e, a
d
e
ep
und
erstan
d
i
n
g
is
lack
in
g in
t
h
e
literatu
re
of the d
e
sign
sp
ace o
f
th
e in
tercon
n
ection
fram
e
work,
an
d
p
a
rticu
l
arl
y
h
o
w it in
teracts with
th
e
rest o
f
th
e
m
u
lti-c
o
re arch
itectu
r
e. For a
g
i
v
e
n
n
u
m
b
e
r
o
f
co
res, th
e
“best” inte
rconnection
arc
h
itecture in a
gi
ven chi
p
m
u
l
t
i
pr
ocessi
ng
en
vi
r
onm
ent
de
p
e
nd
s
on
a m
y
ri
ad
o
f
Evaluation Warning : The document was created with Spire.PDF for Python.
I
S
SN
:
2
088
-87
08
IJEC
E V
o
l
.
5, No
. 3,
J
u
ne 2
0
1
5
:
50
3 – 5
1
7
50
4
fact
or
s, i
n
cl
udi
ng
per
f
o
rm
ance ob
ject
i
v
es
, powe
r/area
budget, ba
ndwi
dth re
quirem
ents, technology, and eve
n
th
e system
so
ftware.
Wh
ile in
terco
n
n
ects are relativ
el
y
w
e
ll un
d
e
r
s
tood
f
o
r
co
nn
ecti
n
g ch
i
p
s, m
u
lt
i-
ch
i
p
m
odul
es, a
n
d
boa
r
d
-l
evel
n
o
d
es,
co
n
n
ect
i
n
g c
o
res
o
n
t
h
e
sam
e
chi
p
p
r
e
s
ent
s
a
di
st
i
n
ct
l
y
di
ffe
rent
p
r
obl
em
.
This is because power, are
a
, latency, and ba
ndwidt
h are all first-order de
sign
constraints for on-c
hi
p
interconnects.
Secondly, t
h
e
design c
h
oices for t
h
e c
o
re
s
,
caches, and i
n
terconnect in
teract to a m
u
ch
great
e
r
deg
r
ee. F
o
r e
x
am
pl
e, an ag
gressi
ve i
n
t
e
rc
on
nect
desi
gn
cons
um
es power a
nd area
reso
urces t
h
a
t
t
h
e
n
con
s
t
r
ai
ns
t
h
e num
ber,
si
ze, and desi
g
n
of
t
h
e
co
res
and c
aches. T
h
us,
unlike a c
onv
entio
n
a
l m
u
lt
ip
rocessor,
per
f
o
r
m
a
nce i
s
not
necessa
ri
l
y
m
a
xim
i
zed
by
t
h
e hi
g
h
est
ban
d
wi
dt
h i
n
t
e
rco
n
n
ect
avai
l
a
bl
e. Of c
o
u
r
se, t
h
e
conve
r
se is also true
, that the num
ber and type of co
res (as
well as on c
h
ip
m
e
m
o
ry
) also dictate requi
re
m
e
nts
on the
interc
onnect.
In fact, increasing t
h
e
num
b
er of
cores
places c
o
nflicting dem
a
nds
on the i
n
terc
onne
ct –
req
u
i
r
i
n
g
hi
g
h
e
r ba
n
d
wi
dt
h
whi
l
e
dec
r
easi
ng a
v
ai
l
a
bl
e r
eal
est
a
t
e
. Theref
ore
,
Thi
s
pape
r a
d
d
r
ess
e
s t
h
e
interconnection desi
gn issue
s
of area
, power and perfo
rmance of chi
p
m
u
lti-proces
sors
with sha
r
e
d
cache
m
e
m
o
ry
. On
-
c
hi
p
net
w
or
ks
archi
t
ect
ures
are bel
i
e
ve
d
t
o
be t
h
e i
d
eal
sol
u
t
i
o
n
for sy
st
em
on c
h
i
p
in
ter
c
onn
ection
pr
ob
lem
s
[
2
]. N
e
tw
ork
s
on
Ch
ip
(
N
o
C
) can
im
p
r
o
v
e
d
e
sign
pr
oductiv
ity b
y
su
pp
or
ting
m
odul
ari
t
y
and
reuse o
f
com
p
l
e
x cores
,
t
hus
enabl
i
n
g a hi
g
h
er l
e
vel
o
f
ab
st
ract
i
on i
n
arc
h
i
t
ect
ural
m
odel
i
n
g
of future syste
m
s [3]. Current NoC
arc
h
itectures c
o
nside
r
the processi
ng core a
n
d its c
ache m
e
m
o
ry as a
si
ngl
e m
o
d
u
l
e
.
Accessing
cach
e
m
e
m
o
ry is
o
f
ten
a prin
ci
p
a
l bo
ttle
n
eck in
su
ch
m
u
lti
-core system
s,
as m
u
lt
ip
le
cores c
o
m
p
ete for on-c
hip li
mited
m
e
m
o
ry resource
s, s
o
cache m
e
m
o
r
y
orga
nization and m
a
nage
ment is
essential for i
m
prove
d pe
rform
a
nce
and efficiency. Ha
ving
s
h
are
d
c
ache that
hol
ds data accessi
ble by
m
u
l
tip
le co
res h
a
v
e
p
r
ov
ed to
en
h
a
n
c
e th
e p
e
rfo
r
m
a
n
ce
[4
],
bu
t typ
i
cal in
terconn
ectio
n m
ech
an
isms h
a
v
e
great im
pact o
n
the syste
m
, e
s
pecially
typic
a
l crossba
r
that
cons
um
es a
lot of area and corr
esp
ond
ing
l
y li
mit
the available space for cache
m
e
m
o
ry [5].
On-c
hip cachi
ng
has bee
n
em
ployed as a
very effective approa
c
h
to reduce m
e
m
o
ry bandwidth requirem
ent.
While cach
ing
helps t
o
reduce
off-chi
p
m
e
m
o
ry bandwi
dth
significa
ntly, the e
ffective
us
e of cac
hes is
not always
guara
n
teed.
Use
f
ul c
o
ntent m
a
y be e
v
icted
due t
o
ad
dress co
nflicts, wh
ich
add
s
to
th
e off-ch
i
p
me
m
o
ry
p
r
essu
re. Th
e
p
r
ob
le
m
is
m
o
re sig
n
i
fican
t in
m
u
lti-co
re
syste
m
s, in
wh
ich
m
u
ltip
le
task
s are ex
ecu
tin
g
sim
u
ltan
e
o
u
sly. Th
e co
n
t
en
tio
n
o
f
cach
e resou
r
ces wou
l
d
p
o
t
en
tially lea
d
to
sign
ifican
t in
ter-task
cach
e in
te
r
f
ere
n
ces an
d t
h
us
m
u
ch hi
ghe
r
m
e
m
o
ry
ban
d
wi
dt
h
requirem
ents. Cache partitioning techni
que
s have be
e
n
a
well-studie
d
area in
recent
years. These
cache
orga
nizations a
i
m
a
t
i
m
proving cache utilization to achie
ve better performance and powe
r efficiency [6, 7, 8].
Most
existing cache partitioning
tec
hni
que
s foc
u
s
on re
ducing cac
he m
i
s
s
rates in
orde
r to increase
syste
m
t
h
r
o
u
g
h
p
u
t
o
r
ove
ral
l
pe
rf
or
m
a
nce. H
o
we
ver
,
t
h
e m
a
jor
i
t
y
of t
h
ese a
p
p
r
oaches
d
o
not
c
o
n
s
i
d
e
r
o
f
f
-
chi
p
me
m
o
ry b
a
ndwid
th as a primary o
p
tim
iza
t
io
n
go
al.
This
pape
r s
uggests a
n
arc
h
itectural m
odel
whe
r
e
cac
he
me
m
o
ry is organized a
s
both sha
r
ed and
pri
v
ate cache,
and
utilizing the new c
once
p
t
of
NoC to
im
ple
m
ent the interconnection m
echanism
.
The pape
r
f
o
cu
s
e
s
on
th
e 8
-
cor
e
pro
c
e
s
s
o
r
as
a
n
e
x
amp
l
e
m
odel where
pe
rform
a
nce, a
r
ea a
nd
powe
r m
easures are
studie
d
a
n
d com
p
ared to t
h
e
pre
v
iously s
u
ggested m
odels.
Th
e
rest
o
f
t
h
is p
a
p
e
r is organized
as
fo
llows: S
ect
i
on
2
gi
v
e
s t
h
e
rel
a
t
e
d
wo
rk
. Sect
i
o
n
3 s
h
o
w
s
t
h
e
typ
i
cal in
tercon
n
ection
m
ech
an
ism
s
an
d
th
eir ch
aracter
istics. Section
4
illu
strates th
e cach
e
o
r
g
a
n
i
zat
io
n
of
share
d
an
d p
r
i
v
at
e dat
a
. Sect
i
on 5
dem
onst
r
at
es t
h
e net
w
or
k o
n
chi
p
(
N
oC
) arc
h
i
t
ect
ur
e. Sect
i
on 6
pr
esent
s
t
h
e s
u
g
g
est
e
d
m
odel
for
8
-
c
o
re
pr
ocess
o
r
.
S
ect
i
on
7 c
o
n
c
l
u
des t
h
e
pape
r.
2.
RELATED WORK
As techno
log
y
m
o
v
e
s toward
s m
u
lti-co
re
syste
m
-o
n
-
ch
i
p
s (So
C
s), it
b
eco
m
e
s u
b
i
qu
ito
us in
all
com
put
i
ng
do
m
a
i
n
s ran
g
i
n
g
from
gener
a
l
pu
r
pose se
rve
r
s t
o
t
h
e d
o
m
a
in speci
fi
c pr
o
cesso
rs an
d f
r
o
m
3G
cellu
lar b
a
se st
atio
n
s
to
th
e latest g
a
m
e
co
n
s
o
l
es [9
]. Th
ese So
Cs tod
a
y h
a
v
e
do
zen
s
of tiled
co
res on
a sin
g
l
e
ch
ip
. Th
e cor
e
co
un
t is ex
p
ect
ed
to
gr
ow
to hu
ndr
ed
s or
ev
en
thou
sand
s i
n
th
e n
e
ar
fu
t
u
r
e
[
1
0
]
. Conv
en
ti
o
n
a
l
wisdo
m
is to
do
ub
le th
e
nu
mb
er
of cores
on a ch
ip with
each
silico
n
g
e
n
e
ratio
n
[1
1
]
.
For ex
am
p
l
e, th
e
latest
release of NVIDIA Tesla C1060
GPU has
as
m
a
ny as 240 c
o
res i
n
tegrated in
a chip
[12]. The fact that s
u
ch a
h
i
gh
nu
m
b
er of cores will b
e
tig
h
tly in
teg
r
ated
on
to
th
e same d
i
e p
r
esen
t
s
a fun
d
a
m
e
n
t
al ch
allen
g
e
for on
-
chi
p
c
o
m
m
uni
cat
i
on am
on
g c
o
res
.
The
net
w
or
k-
o
n
-c
hi
p
(N
oC
) i
s
an e
n
abl
i
n
g t
echn
o
l
o
gy
f
o
r i
n
t
e
g
r
at
i
on
o
f
l
a
rge
n
u
m
b
ers of
em
bedd
e
d
cores
o
n
a si
ngl
e
di
e. T
h
e
exi
s
t
i
ng m
e
t
hod
of i
m
pl
em
ent
i
ng a
N
o
C
wi
t
h
pl
a
n
ar
m
e
t
a
l
i
n
t
e
rco
n
n
ect
s i
s
d
e
ficien
t du
e t
o
h
i
gh
laten
c
y an
d
si
g
n
i
fican
t
p
o
wer co
n
s
um
p
t
io
n
arising o
u
t
of lo
ng
mu
lti-ho
p
link
s
u
s
ed
in
dat
a
exc
h
an
g
e
. Di
f
f
ere
n
t
net
w
or
ks
-o
n-c
h
i
p
(
N
oC
s)
[13] are em
erging as the
scalable fabri
c
for
interconnecting the core
s. They consist of route
r
s, lin
ks, and
well-de
fine
d network
inte
rfaces
. One of
the key
i
ssues
i
n
t
h
e d
e
si
gn
o
f
N
o
C
s
i
s
t
h
e
re
d
u
ct
i
o
n of
bot
h
a
r
ea and
po
wer
di
ss
i
p
at
i
on. S
u
c
h
r
e
qui
rem
e
nt
s im
pose
i
m
p
o
r
tan
t
d
e
si
g
n
cho
i
ces like th
e topo
log
y
, switch
i
ng
tech
n
i
q
u
e
,
rou
tin
g algo
rith
m
an
d th
e arch
itectu
r
al
Evaluation Warning : The document was created with Spire.PDF for Python.
I
J
ECE
I
S
SN
:
208
8-8
7
0
8
An
efficien
t
cach
e
o
r
g
a
n
i
za
ti
o
n
fo
r On
-C
h
i
p
Mu
ltip
ro
cesso
r
Netwo
r
ks
(Medhat Aw
adalla)
50
5
i
m
p
l
e
m
en
tatio
n
.
As a resu
lt, m
o
st o
f
cu
rrent No
Cs im
pl
em
ent
regul
a
r
n
e
t
w
o
r
k t
o
p
o
logies that can
be easily
l
a
i
d
out
o
n
a chi
p
su
rface
. S
o
m
e
t
opol
ogi
e
s
are pre
f
er
red
,
si
nce t
h
ey
of
fer l
o
wer p
o
w
er con
s
um
pt
i
on t
h
an
ot
he
r t
o
p
o
l
o
gi
es f
o
r
ap
pl
i
cat
i
o
n
-
speci
fi
c m
a
ppi
ng
of
t
a
sk
s [
14]
.
T
h
ere
have
been
se
v
e
ral
p
r
op
osal
s
an
d
im
ple
m
entatio
ns of
high-perform
ance
chi
p
m
u
lt
i
-
proc
esso
r arc
h
i
t
ect
ures
[1
5]
. H
o
weve
r
,
t
h
e l
a
t
e
ncy
,
po
we
r
con
s
um
pt
i
on a
n
d
i
n
t
e
rc
o
nnec
t
ro
ut
i
n
g p
r
ob
l
e
m
s
of c
o
n
v
e
n
t
i
onal
N
o
C
s
can
be a
d
d
r
es
sed
by
re
pl
aci
ng
o
r
au
g
m
en
tin
g mu
lti-ho
p wi
red
p
a
th
s with h
i
gh
-b
an
dwid
t
h
si
n
g
l
e-hop
long
-rang
e
wireless
lin
k
s
. Th
is op
en
s
up
n
e
w
o
ppo
rt
u
n
i
ties fo
r d
e
tail
ed
inv
e
stig
ation
s
in
to
th
e
d
e
sig
n
o
f
wireless No
Cs
(W
i
N
o
C
s) with
on
-ch
i
p
antennas, s
u
ita
ble transcei
vers and route
r
s.
More
ove
r,
as i
t
i
s
an em
ergi
ng t
e
c
h
n
o
l
o
gy
,
t
h
e o
n
-c
hi
p
w
i
rel
e
ss
l
i
nks al
so
nee
d
t
o
o
v
e
r
com
e
si
gni
fi
ca
nt
ch
al
l
e
nges pe
rt
ai
ni
n
g
t
o
rel
i
a
bl
e i
n
t
e
grat
i
o
n.
In t
h
ei
r pa
per
,
t
h
ey
prese
n
t
va
ri
o
u
s
chal
l
e
nges a
nd em
ergi
n
g
sol
u
t
i
o
ns re
gar
d
i
ng t
h
e de
si
g
n
of a
n
effi
ci
ent
and
rel
i
a
bl
e
Wi
NoC
archi
t
ect
u
r
e. I
n
t
e
rc
on
nect
i
o
n
m
echani
s
m
s
prese
n
t
e
d i
n
[
16]
di
sc
usse
d
t
h
e ad
vant
a
g
es
and
di
sa
dva
nt
ages
of
each m
echanism
.
Cores in Hydra [17]
are c
o
nnected t
o
the level 2 (L2)
cache through
a cross
b
ar. Modula
r
ity
resul
t
s
i
n
en
ha
nced c
ont
rol
o
v
er el
ect
ri
cal
p
a
ram
e
t
e
rs and hence ca
n res
u
l
t
i
n
hi
gher
per
f
o
r
m
a
nce or re
duce
d
po
we
r con
s
um
pt
i
o
n
.
These i
n
t
e
rc
on
nect
i
o
n
s
can be hi
g
h
l
y
effective in particula
r
environm
ents where
m
o
st
co
mm
u
n
i
catio
n
is lo
cal, ex
p
l
icit co
re-to
-
core co
mm
u
n
i
cati
o
n. Du
e to
th
eir scalab
ility, th
ese arch
itectu
r
es are
attractive for a
large num
b
er of c
o
res
.
The c
r
oss
o
ver
poi
nt whe
r
e these architectures
beco
m
e
su
p
e
r
i
or
to
th
e
m
o
re co
nv
en
ti
o
n
a
l i
n
terconnects stu
d
i
ed
is
not
cl
ear
, a
nd i
s
l
e
ft
fo
r f
u
t
u
re
wo
r
k
[
18]
.
Th
ere i
s
a l
a
r
g
e b
ody
of
related
wo
rk
ev
alu
a
ting
trad
eo
ffs b
e
tween
b
u
s
-b
ased
an
d scalab
le sh
ared
m
e
m
o
ry
mu
ltip
ro
cesso
rs, in
th
e
co
n
t
ex
t o
f
conv
en
tion
a
l (m
u
ltip
le-ch
i
p) m
u
l
tip
ro
cessors. So
m
e
earlier i
m
p
l
em
en
tatio
n
s
o
f
th
e i
n
terconn
ection
net
w
or
ks fo
r m
u
lt
i
p
rocess
o
r
s
ha
ve
b
een de
scri
be
d
i
n
[
8
,
1
2]
.
H
o
weve
r, on
-c
hi
p
i
n
t
e
rc
on
nect
s have
d
i
ffere
nt
sets of tra
d
e
o
ffs a
n
d design i
ssues. T
h
us, t
h
e conclusi
on
s
of
t
h
ose
pri
o
r
st
udi
es m
u
st
b
e
re-e
val
u
at
ed
i
n
t
h
e
co
n
t
ex
t
of on
-ch
i
p
m
u
ltip
ro
cesso
rs with
o
n
-ch
i
p
i
n
terco
n
n
e
cts.
Recent propos
al of organizi
ng
cache m
e
m
o
ry based on
data s
h
ari
n
g a
r
e
prese
n
t
e
d in [19].
Arc
h
itecture i
n
Na
halal m
odel [20] pa
rtitions the L2
cac
he
space
accordi
n
g to t
h
e
program
s’ data sha
r
ing,
and can thus
offer
vicinity of
refere
nce t
o
bot
h s
h
are
d
a
n
d
private
data. T
h
e orga
nization
of cache
m
e
mory i
n
[20] is bas
e
d
on
non-uniform cache
a
r
chitec
t
ure
(NUCA) a
rray
with a s
w
itched
network em
bedded in i
t
for
hi
g
h
per
f
o
r
m
a
nce.
3.
TYPICAL INTERCONNECTI
ON
M
E
CHAN
ISM
S
Th
is section
d
i
scu
sses i
n
tercon
n
ection
m
ech
an
ism
s
wh
ich
are u
s
ed
to
con
n
ect
b
e
tween
m
u
lti-co
re
process
o
r elements
a.
Nahal
a
l: Cache
Organiz
a
ti
on
In
m
o
n
o
lith
ic p
r
o
cesso
r arch
itectu
r
es, cach
e
m
e
m
o
ry is
o
r
g
a
n
i
zed
acco
rd
ing
to
con
t
en
ts (e.g.,
instructions ca
che and
data cache)
or
hi
erarc
h
ically (e.g., level 1
a
nd le
vel 2 ca
che). In m
u
lti-core
envi
ro
nm
ent
,
o
t
her
fact
o
r
s ca
n e
ffi
ci
ent
l
y
o
r
gani
ze
t
h
e cac
he m
e
m
o
ry
i
n
or
der
t
o
gai
n
t
h
e
bene
fi
t
s
of
havi
n
g
m
u
l
tip
le co
res
o
n
t
h
e sam
e
ch
ip
.
For instance, d
a
ta sh
ar
i
n
g
is an
im
p
o
r
t
a
n
t
factor to
co
n
s
i
d
er
fo
r m
u
lti-co
re
syste
m
s, where there is a dis
c
rimina
tion bet
w
een
sha
r
ed
data that is acce
ssible by m
u
ltiple cores and
pri
v
ate
dat
a
t
h
at
i
s
acc
essed
by
a si
n
g
l
e
co
re.
As
gl
obal
wi
re
del
a
y
s
becom
e
a d
o
m
i
nant
fact
or
i
n
V
L
SI
de
si
g
n
[
1
3]
,
on c
h
ip acc
ess
tim
e
depends on the
distance
betwee
n the
pr
ocess
o
r a
n
d the data. In m
a
ny designs, L
2
c
ache is
typically located as a bulk i
n
one
location (e.g., at the center surrou
n
d
ed
by
t
h
e pr
ocess
o
rs
) an
d hence
pr
o
duci
ng a
n
o
n
-
uni
f
o
rm
access acr
oss
the
chip. In com
m
ercial
wo
rk
l
o
ads, sign
ifican
t fracti
o
n
o
f
me
m
o
ry
accesses involve s
h
are
d
data
that cannot
be
replicated
without pe
rform
a
nce
pe
nalty
to m
a
intain cohe
renc
e.
The
newly propose
d
Nahala
l cache orga
ni
zation in t
h
is pape
r is ins
p
ired from
the layout of a
co
op
erativ
e v
illag
e
called
Nah
a
lal, sho
w
n
in
Figu
re 1.a wh
ich
is b
a
sed
on
u
r
b
a
n
d
e
si
g
n
id
eas fro
m
th
e 1
9
t
h
cen
tury [20
]
. In
Nah
a
lal, p
ublic b
u
ild
in
gs are lo
cated
in
an inner circle, e
n
close
d
by a circle of hom
e
steads.
Private a
r
eas
of land a
r
e a
rra
nged in
outer ci
rcles, each
clos
e to its
owner'
s
hous
e.
Evaluation Warning : The document was created with Spire.PDF for Python.
I
S
SN
:
2
088
-87
08
IJEC
E V
o
l
.
5, No
. 3,
J
u
ne 2
0
1
5
:
50
3 – 5
1
7
50
6
Fig
u
re 1
.
a. Nah
a
lal
v
illag
e
layo
u
t
Fi
gu
re
1.
b.
C
o
ncept
u
al
l
a
y
o
u
t
Th
e sug
g
e
sted
arch
itecture is
si
m
ilar to
th
e o
r
g
a
n
i
zatio
n
o
f
th
e v
illag
e
where a fraction
o
f
L2
cach
e
me
m
o
ry is located at the center of th
e c
h
ip, share
d
and s
u
rrounded
by the
processing c
o
res
.
The
rest of L
2
cache is place
d on the outer
circle su
rrounding the c
o
res a
s
shown in Figu
re 1.b.
Im
plemen
tation of
Nahala
l
i
m
proves L2 c
ache access times by up t
o
41% c
o
m
p
ared t
o
tra
d
itional C
M
P desi
gns.
3.
2.
Ne
tw
or
ks
on
C
h
i
p
(
N
o
C
)
N
o
C
archi
t
ect
ures m
a
y ado
p
t
desi
gn c
once
p
t
s
from
t
r
adi
t
i
onal
com
put
er n
e
t
w
o
r
ks
, h
o
we
ver o
n
chi
p
im
pl
em
ent
a
t
i
o
ns
of
net
w
o
r
ks
nee
d
di
f
f
ere
n
t
co
nsi
d
e
r
at
i
o
ns
, w
h
e
r
e
net
w
or
k a
r
chi
t
ect
u
r
es
an
d
pr
ot
oc
ol
s
hav
e
to
d
e
al with the adv
a
n
t
ag
es an
d limitat
i
o
n
s
o
f
th
e
silico
n
fab
r
ic. On
ch
i
p
im
p
l
e
m
en
tatio
n
s
h
a
v
e
ad
v
a
n
t
ag
es
ove
r t
r
a
d
i
t
i
onal
com
put
er
net
w
o
r
k
s
l
i
k
e l
o
ca
l
pr
oxi
m
i
t
y
an
d
det
e
rm
i
n
i
s
t
i
c
nat
u
re
whe
r
e
no
des t
o
be c
o
nnect
e
d
are det
e
rm
i
n
ed be
fo
re t
h
e
net
w
or
k i
s
de
si
gne
d. M
o
re
o
v
er
, dy
nam
i
c
chan
ges
of l
i
n
ks (l
i
n
k u
p
g
ra
des
o
r
fai
l
u
res
)
a
r
e
no
t
expect
e
d
o
n
-c
hi
p.
Al
s
o
,
hi
gh
l
y
rel
i
a
bl
e l
i
nk
ope
rat
i
o
n ca
n
be ass
u
m
e
d.
Based
on t
h
ese
conside
r
ations
, the m
odules
are inte
rc
onne
cted by a
networ
k
of m
u
lti-port s
w
itche
s
connected t
o
each ot
her
by links com
pos
ed of pa
ralle
l poi
nt-to-point
lines. The
ne
twork, for e
x
a
m
ple,
appl
i
e
s a
m
e
sh t
o
p
o
l
o
gy
a
n
d
em
pl
oy
s wo
rm
hol
e pa
cket
f
o
rwa
r
di
ng
wi
t
h
h
o
p
-
b
y
-
ho
p c
r
e
d
i
t
-
bas
e
d
back
p
r
ess
u
re flo
w
co
ntrol (
f
o
r lossl
es
s b
u
f
f
er o
p
e
r
at
i
on a
nd m
i
nim
a
l bu
ffe
r requirem
ents) [6]. Packe
t
s are
fo
rwa
r
ded
ba
s
e
d
on
a
sh
ort
e
st
pat
h
, em
pl
o
y
i
ng
X-
Y m
ech
anism
whe
r
e
a pac
k
et is
routed first
in t
h
e “X”
di
rect
i
o
n, t
h
e
n
i
n
a
pe
rpe
ndi
c
u
l
a
r
di
rec
t
i
on.
Thi
s
sc
hem
e
leads to a
sim
p
le, cost-e
ffective
router
i
m
p
l
e
m
en
tatio
n
.
Th
e p
a
ck
et i
s
d
i
v
i
d
e
d
in
t
o
m
u
l
tip
le flits
[1
5
]
. Th
e flits are classified as
th
e fo
llowing
t
y
p
e
s:
•
FP (Fu
ll Pack
et): A
on
e-flit pack
et
•
EP (End
o
f
Pack
et): Last
flit in
a
p
a
ck
et
•
BDY (Bod
y): A n
o
n
-
last flit
So
, all flits ex
cep
t th
e last o
n
e
in
a p
ack
et are tag
g
e
d
BDY. Th
e first flit o
f
a p
ack
et can
b
e
d
e
tected
as th
e first flit fo
llowing
either a FP or EP
flit an
d
th
is iden
tificatio
n
tri
g
g
e
rs th
e rou
tin
g
m
ech
an
ism
.
Ev
ery
arrivi
ng
flit of a packet on the input por
t of the route
r
is store
d
in an inpu
t buffe
r. On the arri
val of the first
flit, th
e rou
ting alg
o
rith
m
d
e
term
in
es th
e outp
u
t
p
o
rt th
at t
h
e
p
ack
et
d
e
sti
n
ed. R
o
u
ting
i
n
fo
rm
atio
n
per in
pu
t
p
o
rt is sto
r
ed
i
n
th
e cu
rren
t
ro
u
ting
tab
l
e
(CRT), un
til th
e tail flit o
f
th
e p
ack
et is receiv
ed
,
pro
cessed
and
d
e
liv
ered
.
Wh
en
a flit is forward
e
d
fro
m
an
inp
u
t
t
o
an
o
u
t
p
u
t
p
o
rt,
on
e buffer b
e
comes av
ailab
l
e an
d
a
bu
ffe
r-c
re
di
t
i
s
sent
bac
k
t
o
t
h
e pre
v
i
o
us r
o
ut
er. Eac
h
o
u
t
p
ut
po
rt
o
f
a r
out
e
r
i
s
co
nnect
e
d
t
o
an i
n
p
u
t
p
o
rt
of
a
n
e
igh
bor rou
t
er. Th
e
ou
tpu
t
p
o
rt m
a
in
tain
s th
e av
ailab
l
e
flit slo
t
s in
t
h
e bu
ffer
of
n
e
igh
bor inpu
t po
rt. Th
is
n
u
m
b
e
r is stored
in
th
e
Nex
t
Bu
ffer Stat
e (NBS) th
at is d
ecrem
en
ted
up
on
tran
smissio
n
o
f
a
flit an
d
increm
ented upon t
h
e
recepti
on of a
buffer
cr
edit from
the neighbor
route
r
[16].
3.
3
NO
C
Co
n
n
ecti
n
g S
h
are
d
C
a
che
Me
m
o
ry
In this sectio
n,
we foc
u
s
on a
n
8
-
co
re p
r
oc
e
ssor
with
8
cach
e
b
a
nk
s. Cro
s
sb
ar as an
in
terconn
ectio
n
mechanism
car
ries a heavy area overhea
d. If the total die ar
ea is around 400
mm
2,
then the area ove
rhea
d for
an accepta
ble
latency is around
46
.8% for full shari
ng
(nea
rly half
the
chip!). For calculating on-chi
p
m
e
m
o
ry
si
zes, i
t
i
s
det
e
r
m
i
n
ed t
o
be 1
bi
t
pe
r sq
uare
m
i
cro
n
,
or
0.
12
5M
B
per m
m
2 for
6
5
nm
t
echnol
o
g
y
[2].
If t
h
e cache
ba
nks
we
re
pri
v
a
t
e, and an SBF is use
d
to i
n
terconnect
thes
e banks
t
o
the 8 cores
,
the
r
e would be
3MB of cache
m
e
m
o
ry available for each
core
. The propos
ed arc
h
itectural
m
odel in Figure
2 is inspire
d
from
the Naha
lal cache arc
h
itecture whe
r
e
part
of t
h
e
L
2
cache m
e
m
o
ry is share
d
a
n
d place
d in a
n
inne
r
ci
rcl
e
surr
o
u
n
d
e
d by
t
h
e N
o
C
rout
e
r
s. Eac
h
ro
ut
er i
s
con
n
e
c
t
e
d t
o
nei
g
hb
or r
o
ut
ers, a p
r
ocessi
n
g
co
re,
and
a
cache m
e
m
o
ry bank. T
h
e rem
a
ining pa
rt
of the
L
2
cac
he is
kept
private
for eac
h c
o
re.
Evaluation Warning : The document was created with Spire.PDF for Python.
I
J
ECE
I
S
SN
:
208
8-8
7
0
8
An
efficien
t
cach
e
o
r
g
a
n
i
za
ti
o
n
fo
r On
-C
h
i
p
Mu
ltip
ro
cesso
r
Netwo
r
ks
(Medhat Aw
adalla)
50
7
Figure
2. NoC
connecting s
h
a
r
ed cache
m
e
mory
By calculating the area occ
u
pied by
th
e
on
ch
ip
n
e
two
r
k
,
we can
calcu
late th
e remain
in
g
area
available for c
ache m
e
m
o
ry. The a
r
ea
occ
u
pied
by links
ca
n
be calculate
d accordi
n
g to:
Co
st
wire-ar
e
a
=A
W
i
L
i
i lin
k
s
(1
)
Wh
ere
A0
is the wire
p
itch
,
W(i) is t
h
e i li
n
k
wi
d
t
h
(
n
um
ber
o
f
wi
res
)
a
n
d
L
(
i
)
i
s
t
h
e l
e
ngt
h
of
l
i
n
k i
[2]
.
C
o
nsi
d
e
r
i
n
g
1
6
bi
t
dat
a
si
gnal
s
f
r
om
o
n
e
p
o
rt
t
o
a
not
her
,
t
h
e
n
,
t
h
e
t
o
t
a
l
area
occ
u
pi
ed
by
2
4
l
i
n
ks t
h
at
are i
m
pl
em
ented i
n
8X
m
e
t
a
l pl
a
n
e
wi
t
h
a
wi
re
pi
t
c
h
o
f
1
.
6 µm
(f
or
65
nm
t
echnol
ogy
) i
s
est
i
m
at
ed t
o
be ~
3
mm2
[
2
].
The a
r
ea c
o
st
of
r
out
e
r
i
s
a
f
f
ect
ed
by
se
ver
a
l
param
e
t
e
rs,
num
ber
o
f
po
rt
s an
d
si
ze o
f
fl
i
t
bu
ffe
rs.
A
good estim
a
tio
n for the a
r
ea cost of t
h
e router is flip-fl
op
count [16]. T
h
e co
st o
f
a router esti
m
a
ted
b
y
flip
-
fl
o
p
co
unt
res
u
l
t
s
i
n
~
625 fl
i
p
-
f
l
o
ps t
a
ki
n
g
an area o
f
ab
o
u
t
0.
10
3 m
m
2
(f
or 6
5
nm
t
e
chn
o
l
o
gy
)
.
So,
we nee
d
abo
u
t
0
.
8
2
5
m
m
2 for t
h
e
8 r
out
e
r
s, a
nd
he
nce,
3.
82
5 m
m
2
f
o
r t
h
e ent
i
r
e
net
w
or
k. T
h
i
s
area occ
u
pi
ed
by
t
h
e
network would limit the available cache
s
p
a
ce to be 2.94
MB per core. If
we allocate 2MB to each core as a
pri
v
ate cache,
we ha
ve a remaining
7.52 M
B
of cache m
e
m
o
ry
to be shared
by the 8 cores at the inne
r circle
of the
arc
h
itecture
.
T
h
ere
f
ore, the a
r
ea
re
qui
red
to
b
u
ild t
h
e N
o
C is
fa
r s
m
a
ller th
an
t
h
at n
e
ed
ed to con
s
tru
c
t
t
h
e cross
b
a
r
. F
i
gu
re 3 sh
ow
s
t
h
e area t
a
ken by
t
h
e cross
b
ar i
n
t
e
rc
o
nne
ct
i
on fo
r 2
-
wa
y
,
4-way
,
a
nd
8-
way
sh
ar
in
g fo
r th
e
8
-
co
r
e
s
p
r
o
cesso
r [2
].
By increasing t
h
e
num
ber
of
data lines, we
can ac
hi
eve
a
higher ba
ndwi
dth, but
with i
n
crease
d
a
r
ea
o
v
e
rh
ead. By u
tilizin
g
Eq. (1), th
e
n
e
two
r
k
area ov
erh
e
ad
for d
i
fferen
t nu
m
b
er o
f
d
a
ta lin
es is d
e
m
o
n
s
trated
in Figure 4. Fi
gure 5 s
h
ows the
corres
p
ondi
ngly available cache m
e
m
o
ry space vers
us t
h
e num
b
er of
data
lin
es. Fro
m
th
ese figu
res, it i
s
ob
serv
e
d
tha
t
the area
occ
u
pied
by the
net
w
ork
ev
en
with
2
5
6
b
its
of data
is
m
u
ch sm
al
l
e
r t
h
an
t
h
at
of
t
h
e
cross
b
a
r
(nea
rl
y
15
%).
Dy
na
m
i
c powe
r
di
ss
i
p
at
i
on i
n
s
w
i
t
c
hi
n
g
ci
rc
ui
t
s
i
s
[
16]
:
P
d
=C
L
V
d
d
2
f
(2
)
whe
r
e CL is
the load ca
pa
citance, Vdd i
s
the supp
l
y
vol
t
a
ge
, an
d
f
i
s
t
h
e swi
t
c
h
i
ng f
r
e
que
ncy
.
Loa
d
capacitance c
o
nsists of
wire c
a
pacitan
ce a
nd gate ca
pacitance of the
drive
n
tra
n
sistor, as
sum
i
ng that the wire
capaci
t
a
nce i
s
t
h
e d
o
m
i
nant
and
est
i
m
at
ed t
o
be
0
.
2
p
F
/
m
m
[13]
, a
n
d
l
i
nk
fre
que
ncy
of
2
.
5
G
Hz, t
h
en t
h
e
di
ssi
pat
e
d
po
w
e
r per l
i
n
k w
o
ul
d be a
b
o
u
t
4
6
m
W
gi
vi
ng a
t
o
t
a
l
of 1.
1
W
f
o
r al
l
l
i
nks. T
h
e dy
nam
i
c po
wer p
e
r
fl
i
p
-
f
l
o
p i
s
e
s
t
i
m
at
ed t
o
be
0.
05m
W at
2.
5G
Hz
fo
r
6
5nm
t
echn
o
l
o
gy
,
an
d t
h
en
t
h
e
t
o
t
a
l
p
o
we
r
di
ssi
p
a
t
e
d
by
t
h
e ro
ut
ers
wo
ul
d by
a
b
o
u
t
2
50m
W. The
p
o
we
r o
v
er
hea
d
due t
o
a t
y
pi
cal
cross
b
ar i
s
very
si
g
n
i
f
i
can
t
.
Th
e
ove
r
h
ead ca
n be m
o
re t
h
an t
h
e p
o
we
r t
a
ke
n u
p
by
t
h
r
ee full cores for a
com
p
letely
shared cac
he (m
ore tha
n
25
W)
. Fi
g
u
re
6 sho
w
s t
h
e
po
wer di
ssi
p
a
t
e
d by
t
h
e cross
b
a
r
fo
r va
ri
o
u
s sha
r
i
n
g l
e
vel
for t
h
e
8-c
o
re
s
p
r
o
cesso
r [2
].
In
creasing
th
e
n
u
m
b
e
r of d
a
ta lin
es to
g
e
t a h
i
gh
er
b
a
ndwid
th
affects in
tu
rn
th
e power
d
i
ssip
a
ted
b
y
t
h
e net
w
o
r
k
.
F
i
gu
re 7 s
h
ow
s
t
h
e p
o
we
r
di
ss
i
p
at
ed i
n
t
h
e n
e
t
w
o
r
k
fo
r
di
ff
erent
num
ber
of
dat
a
l
i
n
es.
From
Router1
Router2
Router3
Router4
Router5
Router6
Router7
Router0
Co
re 0
Co
re 1
Co
re 2
Co
re 3
Co
re 4
Co
re 5
Co
re 6
Co
re 7
Evaluation Warning : The document was created with Spire.PDF for Python.
I
S
SN
:
2
088
-87
08
IJEC
E V
o
l
.
5, No
. 3,
J
u
ne 2
0
1
5
:
50
3 – 5
1
7
50
8
th
ese r
e
su
lts, it
is ob
v
i
o
u
s
ly t
h
at ev
en
w
ith
2
5
6
d
a
ta lin
es, th
e
p
o
w
e
r d
i
ssip
ated
b
y
th
e
netw
or
k
is
about 6
0
%
of
t
h
at
di
ssi
pat
e
d
by
t
h
e c
r
oss
b
ar
.
Fi
gu
re
3.
A
r
ea
ove
r
h
ead
o
f
c
r
oss
b
ar
Fi
gu
re
4.
R
e
l
a
t
i
ons
hi
p
bet
w
ee
n
ban
d
w
i
d
t
h
a
n
d
area
Figure
5. Available cac
he s
p
a
ce fo
r
differe
n
t
num
b
er of
data lines
1x
2x
4x
Metal p
l
an
e
Evaluation Warning : The document was created with Spire.PDF for Python.
IJECE
ISS
N
:
2088-8708
An
efficien
t
cach
e
o
r
g
a
n
i
za
ti
o
n
fo
r On
-C
h
i
p
Mu
ltip
ro
cesso
r
Netwo
r
ks
(Medh
a
t
Awad
alla
)
50
9
Fi
gu
re
6.
P
o
we
r
di
ssi
pat
e
d
by
t
h
e cr
oss
b
ar
wi
t
h
va
ri
o
u
s l
e
ve
l
of
sha
r
i
n
g
Fi
gu
re
7.
P
o
we
r
di
ssi
pat
e
d
f
o
r
va
ri
o
u
s
num
bers
of
dat
a
l
i
n
e
s
3.
4 Effect of
Adding a Customiz
ed BUS to
the
NO
C
In
tercon
n
ect arch
itectu
r
es
wh
ich
rely so
lely o
n
No
Cs
h
a
v
e
sev
e
ral
d
r
awb
a
ck
s lik
e laten
c
y o
f
critical
si
gnal
s
an
d c
o
m
p
l
e
xi
ty
of op
erat
i
ons t
h
at
requi
re gl
o
b
al
k
n
o
w
l
e
d
g
e o
r
c
ont
rol
.
F
o
r e
x
a
m
pl
e,
t
h
e di
st
r
i
but
ed
nature
of t
h
e
network is an obstacle for a
broadcast
ope
ration.
A
bro
a
d
c
ast
operatio
n
on
a n
e
twork
req
u
i
res ad
d
itio
n
a
l m
a
s
s
iv
e dup
licatio
n
o
f
un
icast messag
e
s. A
recent re
searc
h
suggeste
d the
addition
of a l
o
w latency, c
u
stom
ized share
d
bus to t
h
e NoC [17]. This
bus is
i
n
feri
or t
o
t
h
e
NoC
i
n
t
e
rm
s
of
ban
d
w
i
d
t
h
,
but
i
t
has a
n
i
n
here
nt
m
a
i
n
prope
rt
y
:
i
t
i
s
capabl
e
of
br
oa
d
cast
i
n
g
in
fo
rm
atio
n
.
Th
is
n
e
wly su
ggested
arch
itectu
r
e is ca
l
l
e
d B
u
s E
n
hance
d
N
e
t
w
o
r
k
o
n
C
h
i
p
or
"B
EN
oC
".
Th
e ad
d
ition
of a sh
ared
bu
s will su
b
s
equ
e
n
tly c
o
n
t
ri
b
u
t
e to
th
e add
e
d
in
terconn
ect ov
erh
e
ad
in
term
s of area
and power. But eve
n
w
ith t
h
e a
dde
d
ove
rhead of the
shared
b
u
s
,
th
e
to
t
a
l
o
v
e
r
h
e
ad
o
f
th
e
BENo
C
is still lo
wer th
an
t
h
at of th
e cro
s
sb
ar in
terco
n
n
ect.
The sha
r
e
d
b
u
s
consi
s
t
s
o
f
7
by
t
e
s wi
dt
h a
d
d
r
ess b
u
s
,
1
2
by
t
e
s wi
dt
h s
n
o
o
p
b
u
s an
d
8 by
t
e
s wi
dt
h
response bus
.
These
widths a
r
e typical
as sugge
sted for t
h
e
8 core case in [2
]. If th
e sh
ared
b
u
s is to
use th
e
lo
w laten
c
y 8x
m
e
tal
layer, we can
u
tilize
eq
u
a
tion
(1) to calcu
late th
e area con
s
u
m
ed
b
y
th
e bu
s. Fi
g
u
re 8
shows
the a
r
ea
consum
ed by t
h
e BE
NoC
and con
f
irm
s
th
at th
e to
tal area
o
f
in
terco
n
n
ect,
ev
en with a
2
5
6
b
its
o
f
d
a
ta lin
es, is still n
o
t
co
mp
arab
le with
the area o
v
e
rh
ead
of th
e cro
s
sbar i
m
p
l
e
m
en
tin
g
all way sh
ari
n
g
in
the 4x
plane
.
Evaluation Warning : The document was created with Spire.PDF for Python.
I
S
SN
:
2
088
-87
08
IJEC
E V
o
l
.
5, No
. 3,
J
u
ne 2
0
1
5
:
50
3 – 5
1
7
51
0
Fi
gu
re
8.
R
e
l
a
t
i
ons
hi
p
bet
w
ee
n
ban
d
w
i
d
t
h
a
n
d
area
o
f
B
E
NoC
Taki
n
g
i
n
t
o
ac
cou
n
t
t
h
e ad
de
d p
o
we
r co
ns
u
m
pti
on o
f
the share
d
bus that
sp
an
s all th
e wid
t
h
o
f
th
e
chip,
we ca
n re
use equation (2) to calculate the power
c
o
n
s
um
ed by
t
h
e B
E
N
o
C
i
n
t
e
rc
on
nect
. Fi
g
u
r
e
9
sho
w
s
th
at th
e p
o
wer
d
i
ssip
a
ted
b
y
th
e in
terco
n
n
e
ct ev
en
with
2
5
6
b
its of d
a
ta lin
es is still
les
s
th
an
th
at consu
m
ed
by
t
h
e c
r
oss
b
ar
im
pl
em
ent
e
d i
n
4x
pl
a
n
e
(ab
out
8
0
%
).
Fi
gu
re
9.
P
o
we
r
di
ssi
pat
e
d
f
o
r
va
ri
o
u
s
num
bers
of
dat
a
l
i
n
e
s
f
o
r B
E
N
o
C
4.
CA
CHE
CO
H
E
REN
C
E WI
TH MES
I
WRITE BACK
PROTOCOL
In a sha
r
ed m
e
m
o
ry
m
u
ltiprocessor syste
m
with a
separat
e
cache
m
e
m
o
ry for eac
h processor, it is
pos
sible to ha
ve
m
a
ny copies of a
n
y one inst
ruction op
erand: one c
opy in
the
m
a
in
m
e
mory and one in each
cache m
e
m
o
ry.
When
one copy of a
n
ope
ra
nd is c
h
ange
d,
the othe
r copi
es of t
h
e opera
nd m
u
st be change
d
also. Cache c
ohe
re
nce is the disciplin
e that ensures tha
t
changes in th
e val
u
es of share
d
ope
r
ands are
propagate
d
through
out the syste
m
in a
tim
e
ly
fashion. In the followi
ng s
ubs
ections the typical MESI cache
co
h
e
ren
ce algorith
m
will b
e
discu
ssed
.
Evaluation Warning : The document was created with Spire.PDF for Python.
I
J
ECE
I
S
SN
:
208
8-8
7
0
8
An
efficien
t
cach
e
o
r
g
a
n
i
za
ti
o
n
fo
r On
-C
h
i
p
Mu
ltip
ro
cesso
r
Netwo
r
ks
(Medhat Aw
adalla)
51
1
4.
1.
T
y
pic
a
l MESI Cohere
nce
Al
gori
t
hm
Cache cohere
nce is a
m
u
s
t
to
avoi
d using inconsiste
nt data and there
f
ore,
there m
u
st be s
o
m
e
means
t
o
di
st
ri
but
e
i
n
fo
rm
ati
on a
b
ou
t
chan
ges
bet
w
een cac
he m
e
m
o
ry
and m
a
i
n
m
e
m
o
ry
.
The idea in wri
t
e back cache c
ohe
re
n
ce prot
ocols is to hol
d a line in th
e cache, eve
n
if it is
m
odified,
as l
o
n
g
as ot
he
r n
ode
s d
o
n
o
t
need i
t
,
an
d
u
pdat
e
c
ont
e
n
t
s
ext
e
rnal
l
y
o
n
l
y
whe
n
an
ot
he
r n
ode
nee
d
s i
t
.
The
pr
ocess i
n
whi
c
h o
n
e n
o
d
e n
o
t
i
f
i
e
s t
h
e ot
h
e
r n
ode
s of i
t
s
act
i
ons an
d ot
her
no
des l
i
s
t
e
n i
s
cal
l
e
d sn
o
opi
n
g
[19]. T
h
e ME
SI m
odel s
p
ecifies that eac
h
cache line
has
two stat
us
bits, so, each lin
e can be
in
one of
t
h
e
fo
llowing
po
ssib
le
states:
Modifie
d
: the cache line resides only
in this cache and its contents are
m
odified relative to the
m
a
in
me
m
o
r
y
.
Ex
clusiv
e
: the
cache line
resi
des
only in this
cache a
n
d its c
onte
n
ts are
the
sam
e
as
m
e
m
o
ry.
Share
d
: the ca
che line is
sh
ared with othe
rs.
Inva
lid
: the ca
che line
contains
no valid m
e
m
o
ry copy.
Modifie
d
and
exclusi
v
e lines
are owne
d by
the cache a
nd can be cha
nged without telling ot
hers. Howeve
r,
whe
n
attem
p
ting to acce
ss
non-owne
d line
,
t
h
e action is t
o
be
broa
dcasted and if th
e line
is owne
d
by a
n
othe
r
cache, the
owner is to updat
e the worl
d and
change its line
state. If we apply a sam
p
le a
ccess seque
n
ce
on
8
core C
M
P
,
we
can see
t
h
e i
m
port
a
nce
of
br
oa
dcast
i
n
g
t
o
m
a
i
n
t
a
i
n
coh
e
rence
bet
w
ee
n
no
des a
n
d fi
gu
re
out
t
h
e pe
rf
o
r
m
a
nce o
f
t
h
i
s
ki
nd
o
f
ope
rat
i
o
ns.
Ass
u
m
e
begi
n
n
i
n
g
fr
om
reset
st
at
e;
l
e
t'
s eval
uat
e
t
h
e
fol
l
owi
n
g
sam
p
le sequence:
Core
0 reads
a0
:
bro
a
d
cast ad
dress t
o
all
n
o
d
e
s and
th
e line statu
s
is
no
w
"E".
Core
0 reads
a0
: c
o
re
0 rea
d
s a0 from
cache and the
status still "E".
Core
0 writes
a0
: c
o
re
0 m
o
difies a0
only in its
cache
only and status is
now "M".
Core
1 reads
a0
: c
o
re
1
broa
dcasts the
address, c
o
re
0 broad
casts a
res
p
onse a
n
d s
u
pplies data t
o
bot
h cache
and m
a
in
m
e
mory a
n
d status
changes
to "
S
".
Core
0 reads
a0
: c
o
re
0 rea
d
s from
cache, status still "S".
Core
1 writes
a0
:
c
o
re
1
br
o
a
dcast
a
d
d
r
ess
so t
h
at
ot
her
n
ode
s
invalidate
their copies a
n
d status
cha
n
ges t
o
"E".
Core
1 writes
a0
: c
o
re
1 m
o
difies a0 i
n
its c
ache a
n
d changes status to "
M
".
Core
0 reads
a0
: c
o
re
0
broa
dcasts the
address, c
o
re
1 broad
casts a
res
p
onse a
n
d s
u
pplies data t
o
bot
h cache
and m
a
in
m
e
mory a
n
d status
changes
to "
S
".
The
pre
v
i
o
us
s
e
que
nce
nee
d
e
d
t
o
b
r
oa
dcast
i
n
f
o
rm
at
i
on 6 t
i
m
e
s t
o
al
l
8 p
r
ocessi
n
g
c
o
re
s
con
s
um
i
ng 2
4
st
eps
i
f
pe
rf
orm
e
d u
s
i
ng t
h
e
on c
h
i
p
net
w
or
k,
bes
i
des t
h
at
, t
h
e st
at
us o
f
t
h
e
cac
he l
i
n
e a
0
has
been
cha
n
ged
t
o
"S"
twice, fi
rst at s
t
ep 4, a
n
d sec
o
nd at step
8, a
n
d the
r
e
a
r
e two c
opies
of a
0
in both cac
he
me
m
o
ries of the t
w
o
pr
ocessi
ng
co
r
e
s.
4.
2. T
h
e Pr
op
osed T
uned
-
ME
SI
C
o
here
nce Al
gori
t
hm
The m
a
in concept of Na
hal
a
l cache orga
nization is
to place the shared in
form
ation in a share
d
lo
catio
n withou
t dup
licatio
n at ev
ery core, th
erefo
r
e,
when a cac
he li
ne is s
h
a
r
ed between
one
or m
o
re
processi
ng cores, it is m
i
grated to the
sha
r
e
d
cac
he
m
e
m
o
ry so that it is
accessible by
all cores. T
h
is
cache
line is considered a
hot line a
n
d ca
n
be access
e
d
by all cores.
To realize this
conce
p
t, the trad
itional MESI cache c
ohe
re
nce algorith
m
m
u
st be
m
odified to s
u
it the
new a
r
chitecture taking into account
the existence of s
h
are
d
and hot
share
d
cache
lines. This modi
fied
alg
o
rith
m
is ca
lled
Tu
n
e
d
-
M
E
SI. Th
e trad
it
io
n
a
l MESI al
g
o
rith
m
is
stil
l
ap
p
licab
le fo
r th
e n
e
w arch
itectu
r
e
before
a line i
s
consi
d
ere
d
hot. T
h
e
flowc
h
art a
n
d st
ate
diagram
of tuned-MESI al
gorithm
is de
picted i
n
Fig
u
r
e
s
10
an
d 11
.
G
o
i
n
g th
rou
gh th
e sam
e
sequ
en
ce
agai
n for
the suggested
arc
h
itectur
e that
has an
a
d
vanta
g
e of
a
share
d
bus t
h
at all nodes
snoop so t
h
at broa
dcasting is
pe
rform
e
d m
u
ch
m
o
re e
fficiently, and im
ple
m
ents the
Tun
e
d-MESI alg
o
rith
m
.
Assume th
at a sha
r
e
d
line is
considered
hot afte
r t
w
o exte
rnal ac
cesses.
Core
0 reads
a0
:
bro
a
d
cast ad
dress t
o
all
n
o
d
e
s and
th
e line statu
s
is
no
w
"E".
Core
0 reads
a0
: c
o
re
0 rea
d
s a0 from
cache and the
status still "E".
Core
0 writes
a0
: c
o
re
0 m
o
difies a0
only in its cache a
n
d s
t
atus is now "
M
".
Core
1 reads
a0
: c
o
re
1
broa
dcasts the
address, c
o
re
0 broad
casts a
res
p
onse a
n
d s
u
pplies data t
o
bot
h cache
and m
a
in
m
e
mory a
n
d status
change
d to "S"
.
Core
0 reads
a0
: c
o
re
0 rea
d
s from
cache, status still "S".
Core
1
writes a0
: li
ne is m
i
grated t
o
s
h
a
r
ed area
and c
o
nsi
d
er
ed
ho
t lin
e.
Inform
atio
n
abo
u
t
t
h
e
Ho
t Li
n
e
is
br
oa
dcast
e
d t
o
al
l
cores a
n
d st
at
us i
s
"S"
.
Core
1 writes
a0
: core
1
m
o
difies a0
i
n
sh
ared
cach
e
and
st
atu
s
still "S".
Core
0 reads
a0
: core
0
reads fro
m
sh
ared
cach
e an
d statu
s
still "S".
Evaluation Warning : The document was created with Spire.PDF for Python.
I
S
SN
:
2
088
-87
08
IJEC
E V
o
l
.
5, No
. 3,
J
u
ne 2
0
1
5
:
50
3 – 5
1
7
51
2
Anal
y
s
i
s
o
f
t
h
e pre
v
i
o
us se
que
nce s
h
o
w
s
t
h
at
broa
dcas
t
i
ng o
p
erat
i
o
n
s
have
been
r
e
duce
d
t
o
4
ope
rat
i
o
ns. B
e
si
de t
h
e
fact
s t
h
at
t
h
ese
br
oa
d
cast
s
have
be
e
n
carried
out using t
h
e s
h
are
d
bus that re
quires less
cycles to propa
g
ate a m
e
ssage
to all processi
ng
cores c
o
m
p
ared to t
h
e
NoC interconnect
alone.
Fig
u
re 10
.
State
tran
sition
d
i
ag
ram
o
f
Tun
e
d
-
MESI
5.
SIM
U
LATE
D
E
X
PER
I
ME
NTS AN
D DI
SC
USSI
ON
The
pr
og
ram
m
odel
s
t
w
o p
r
ocess
o
rs eac
h
of t
h
em
i
s
consi
s
t
i
ng o
f
8 co
res. T
h
e fi
rst
p
r
oces
so
r ha
s
only private ca
ches a
nd im
ple
m
ents the typical MESI
algorithm
to
m
a
intain cach
e coherence
betwee
n all 8
cores
.
The sec
o
nd
process
o
r
uses th
e
proposed arc
h
itecture that organi
ze
s cache m
e
m
o
ry as bot
h sha
r
ed (a
t
the inne
r circle) and privat
e (private cache
for each c
o
re
). This pr
ocess
o
r im
ple
m
ents the proposed T
une
d-
MESI algorithm to
m
a
intain cache cohere
nce. T
h
e program
pe
rform
s
a random
sequence
of cac
he line
accesses
and measures
the
num
ber of broadca
s
t
operat
ions
nee
d
ed to m
a
intain cache c
ohe
rence
for the
process
o
r t
h
at im
ple
m
ents the typical
M
E
SI al
go
ri
t
h
m
,
and
fo
r t
h
e
pr
op
os
ed arc
h
itecture
that im
ple
m
ents the
Tu
ned
-
M
E
S
I
c
ache c
ohe
re
nce
al
go
ri
t
h
m
.
Th
e use
r
defi
nes t
h
e
fol
l
o
wi
n
g
p
a
ram
e
t
e
rs:
Number
of
participating c
o
r
e
s
: co
res th
at
will b
e
p
a
rticipatin
g
in th
e access sequ
en
ce.
Number
of
m
o
deled c
a
che
lines
: cache lines that
will be i
n
volve
d
i
n
the
access se
que
nc
e.
Threshold :
num
b
er of acces
ses after which
a line is c
o
nsi
d
ere
d
Hot.
Hot sp
ace
:
a
v
ailable space
for Ho
t Share
d
cache lines
.
Priv
ate lines
:
p
e
rcen
tag
e
of t
o
tal nu
m
b
er of
lin
es th
at
will b
e
p
r
i
v
ate fo
r co
res.
The "Thres
hol
d" and "Hot space" param
e
ters are only appl
icable for the pr
oposed a
r
chit
ecture. T
h
ey have no
effect
on the
typical algorithm.
The sim
u
lator is set to generat
e
100 cache lines. To
m
easure the broa
dcasts
incurred
by the access of
a random
core to a ra
ndom
cache line
with
a random
read or write
ac
cess,
we us
ed
a sequence
length
of
100,000 access
e
s. The user va
riable param
e
ters are m
a
ni
pulated to dem
onstrate diffe
rent
factors affecting the
perform
a
nce of the
arc
h
itect
u
r
e. Fo
r a co
nfigu
r
ed
set
o
f
p
a
ram
e
ters, the sim
u
lato
r is run
10
tim
es a
n
d the
r
e
s
u
lts
ar
e
av
er
a
g
ed
. T
h
e e
ffe
c
t
o
f
fo
llow
i
n
g
p
a
r
a
me
te
rs (p
articip
ating cores, t
h
res
h
o
l
d, hot
space
a
n
d
t
h
e
level of s
h
aring)
on the
beha
vior of
the processors is addressed. For a
s
m
all threshold
of
two,
hot space of
fi
ve a
n
d l
e
vel
of
sha
r
i
n
g
(1
0
%
of all lines a
r
e s
h
are
d
).
From
t
h
e achi
e
ved r
e
sul
t
s
sh
ow
n i
n
Fi
g
u
re
12
, i
t
i
s
obvi
o
u
s t
h
at
whe
n
t
h
e n
u
m
b
er of
part
i
c
i
p
at
i
n
g
cores i
n
crease
s
, the num
b
er
of
broa
dcasts
increases
. The
tune
d-ME
SI
out
per
f
o
r
m
e
d t
h
e t
y
pi
cal
al
g
o
ri
t
h
m
because the a
v
ailable share
d
cache wa
s rela
tively eno
ugh
to accomm
odate the sm
a
ll percenta
ge of s
h
are
d
lin
es. If th
e percen
tag
e
o
f
sh
ared
lin
es is increase
d
to 50%, the typi
cal
al
gori
t
h
m
shows t
h
e sam
e
pr
ofi
l
e
whe
r
e t
h
e
num
ber of
broa
dca
s
ts increase
s
a
s
the
num
b
er
of sh
ared
lin
es
an
d p
a
rtic
ipating cores i
n
cre
a
ses as
illu
strated
in
Fig
u
re 13
. Howev
e
r, th
e nu
mb
er
o
f
bro
a
d
c
asts was sig
n
i
fican
tly in
creased
as th
e nu
m
b
er o
f
share
d
line
s
ha
s increa
sed,
wh
i
c
h m
eans m
o
re b
r
oa
dcast
i
n
g.
The
per
f
o
r
m
a
nce of t
h
e T
une
d-M
E
S
I
al
g
o
r
i
t
h
m
was
not
re
asonable,
beca
use the a
v
ailable hot s
p
ace
is relatively s
m
all to accommodate th
e
sha
r
e
d
lines.
An i
n
c
r
eased
broa
dca
s
t ove
rhea
d
wa
s incurre
d to
re
m
ove
the least accessed
hot lines
from
shared cac
he and a
dd
ot
hers to it. By increasing
the t
h
reshold
of the
cache
l
i
n
e t
o
be con
s
i
d
ere
d
h
o
t
t
o
t
e
n, agai
n t
h
e perf
orm
a
nce
of t
h
e t
une
d
al
gori
t
h
m
has been i
m
prov
ed by
Evaluation Warning : The document was created with Spire.PDF for Python.