Internati
o
nal
Journal of Ele
c
trical
and Computer
Engineering
(IJE
CE)
V
o
l.
5, N
o
. 4
,
A
ugu
st
2015
, pp
. 66
9
~
68
4
I
S
SN
: 208
8-8
7
0
8
6
69
Jo
urn
a
l
h
o
me
pa
ge
: h
ttp
://iaesjo
u
r
na
l.com/
o
n
lin
e/ind
e
x.ph
p
/
IJECE
Performance Enhancement of
Multicor
e
Ar
chitectur
e
Medh
at Aw
ad
alla,
Hanan K
o
ns
owa
Ele
c
tri
cal
and
co
m
puter engin
eer
ing depar
t
ment, SQU, Oman
Communications, Electronics
an
d Computers Department,
F
acu
lt
y of
Eng
i
ne
ering
,
Helwan Unive
r
sit
y
, C
a
iro
,
Eg
yp
t
Article Info
A
B
STRAC
T
Article histo
r
y:
Received
Ma
r 8, 2015
Rev
i
sed
Ap
r
21
, 20
15
Accepted
May 10, 2015
Multicor
e proce
ssors integrate
several
cores o
n
a single ch
ip
. The f
i
xed
archi
t
ec
ture of m
u
lticore pl
atfo
rm
s often fails to accom
m
odate the inher
e
nt
diverse requirements of different a
pplications
. The perman
ent need to
enhanc
e the
perform
ance o
f
m
u
lticore
a
r
chit
ectur
e m
o
tivat
es the
development of
a dy
n
a
mic architecture.
To ad
dress this issue, this paper
pres
ents
new al
gorithm
s
for thread s
e
l
ection in
fetch stage. Moreover, this
paper pr
es
ents
t
h
ree n
e
w fet
c
h s
t
age po
li
cies
,
E
A
CH_LOOP
_FETCH, INC-
FETCH, and WZ-FETCH, based on Ordina
r
y
Least Square (OLS) regression
statistic method
. These new f
e
tch polic
ies diff
er on
thread selection time
which is represented b
y
instructions
’ count and window size. Furthermore,
the sim
u
lat
i
on
m
u
lticore
tool
, i
s
adapted
to co
pe with m
u
lti
co
re proc
essor
d
y
namic
design b
y
adding a
d
y
n
a
mic
featur
e
in t
h
e poli
c
y of thr
e
ad sele
ctio
n
in fetch stag
e.
SPLASH2, para
llel sci
e
nt
ific w
o
rkloads, has been used to
valid
ate th
e p
r
oposed adaptation fo
r multi2sim. Intensive simulated
experiments h
a
ve been condu
cted
a
nd th
e
obtain
e
d results show that
rem
a
rkable p
e
rf
orm
a
nce enhan
cem
ents
have
been ach
ieved in terms of
execu
tion time and number of instru
ctions
per second. p
r
oduces less
broadcast oper
a
tions compared
to the ty
pical
alg
o
rithm.
Keyword:
Fetch
p
o
licy
m
u
l
ti2
si
m
Mu
ltico
r
e
Or
di
na
ry
Least
Sq
ua
re (
O
L
S
)
pi
pel
i
n
e
P
r
oce
ssor
Copyright ©
201
5 Institut
e
o
f
Ad
vanced
Engin
eer
ing and S
c
i
e
nce.
All rights re
se
rve
d
.
Co
rresp
ond
i
ng
Autho
r
:
Med
h
a
t Awad
alla,
Depa
rtem
ent of Electrical a
nd Co
m
p
u
t
er
Engin
eer
ing
,
SQ
U, Om
an
Em
a
il: med
h
a
th
a@squ
.
ed
u.om
1.
INTRODUCTION
Indu
stry is emb
r
aci
n
g
m
u
ltico
r
e
b
y
rap
i
d
l
y in
creasing
t
h
e
n
u
m
b
e
r
o
f
processin
g
cores
p
e
r ch
i
p
.
In
20
0
5
, b
o
t
h
A
M
D an
d I
n
t
e
l
of
fere
d d
u
al
-c
ore
x8
6
pr
od
uc
t
s
;
AM
D shi
p
p
e
d i
t
s
fi
rst
q
u
a
d
-c
ore
pr
o
duct
i
n
20
0
7
.
M
eanw
h
i
l
e
S
u
n s
h
i
p
pe
d a
n
8
-
co
re,
3
2
-t
hrea
ded
C
M
P i
n
2
0
05. It is
conc
eivable t
h
at the num
b
er of c
o
res
pe
r
chip
will incre
a
se expone
ntially
at
the rate of Moore
’
s La
w, over t
h
e ne
xt decade [1].
In
fact
an Intel research
pr
o
j
ect
ex
pl
o
r
e
s
C
M
Ps wi
t
h
e
i
ght
y
i
d
e
n
tical process
o
r/cac
he cores inte
gra
t
ed onto a single die, a
n
d Berkeley
research
ers sug
g
e
st fu
ture CMPs co
u
l
d
contain
th
o
u
sand
s
o
f
co
res [2
].
Mu
ltico
r
e pro
c
esso
r
h
a
s b
e
en wid
e
ly
u
s
ed
to
ex
ecu
t
e d
i
fferen
t applicatio
n
s
. Mu
lt
ico
r
e
p
r
o
cessor d
e
p
e
n
d
s on
m
u
l
tith
read
ing arch
itecture i
n
m
a
n
y
stag
es. Th
e
mu
ltith
read
i
n
g
arch
itectures, esp
ecially
th
e
Si
m
u
ltan
e
ou
s Mu
ltith
read
i
n
g (SMT) arch
itectu
r
e
p
r
ov
id
es th
e m
i
crop
ro
cesso
r
with
a sign
ifican
t p
a
ralle
lis
m, wh
ich
is th
e
p
o
t
en
tial k
e
y fo
r
b
e
tter p
e
rform
a
n
c
e
an
d
m
o
re thro
ugh
pu
ts.
It is m
o
re p
r
o
m
i
s
in
g
t
o
enh
a
nce p
r
o
cesso
r
u
tilizatio
n
th
an
sup
e
rscalar
o
r
t
h
e
m
u
l
tith
read
ing
d
u
e
to
its
n
a
ture of a
g
l
ob
al instru
ction
sch
e
du
lin
g.
SMT architect
ure
has be
en
define
d as fully share
d
system
resources, i.e.,
com
putation re
sources a
nd
me
m
o
ry accessing
res
o
urces
, by se
veral c
o
ncurre
ntly exe
c
uting threads
in the
sam
e
core [1].
Howe
ver, t
h
e
resource s
h
ari
n
g leads to
une
v
en distri
bution am
ong th
rea
d
s as
well as
perform
a
nce unfairne
ss,
which might
n
o
t
b
e
o
p
tim
iz
ed
un
til th
e well-d
e
sign
ed
sch
e
m
e
is carried
ou
t. Long
-laten
cy lo
ad
m
i
g
h
t
m
a
k
e
th
e
sh
ared
resources o
c
cup
i
ed
b
y
so
m
e
th
read
s in
efficien
tly, an
d
th
us is o
n
e
o
f
the
m
a
j
o
r ob
stacles to
ward
s hig
h
er
Evaluation Warning : The document was created with Spire.PDF for Python.
I
S
SN
:
2
088
-87
08
I
J
ECE Vo
l. 5
,
N
o
. 4
,
Au
gu
st 2
015
:
66
9
–
68
4
67
0
p
a
rallelism
an
d
b
e
tter
p
e
rform
an
ce [3
]. As a
resu
lt, so
me in
stru
ction fetch
p
o
licies are d
e
si
g
n
e
d [4
] to
allo
cate th
e
p
r
i
o
rities in
t
h
e
fetch
stag
e to
all
e
v
i
ate th
e im
p
act o
f
lon
g
-latency lo
ad
.
The
rest
o
f
t
h
i
s
pape
r i
s
or
gani
ze
d as
f
o
l
l
o
ws.
Sect
i
o
n
2 e
xpl
ai
ns
a s
i
m
u
l
a
t
i
on fra
m
e
wor
k
a
n
d
benc
hm
ark s
u
i
t
e desc
ri
pt
i
o
n.
Sect
i
o
n
3
gi
ve
s t
h
e
rel
a
t
e
d
w
o
r
k
t
o
t
h
e t
h
e
m
e of t
h
e
pa
p
e
r.
Sect
i
o
n
4
p
r
esent
s
th
e d
e
v
e
lop
e
d
meth
o
d
o
l
og
y.
Sectio
n
5
illu
st
rates exp
e
rim
e
n
t
s an
d d
i
scu
s
sio
n
.
Section
6
co
n
c
l
u
d
e
s th
e
p
a
p
e
r.
Ou
r fram
e
wo
r
k
co
nsi
s
t
s
of
m
u
lt
i
c
ore sim
u
l
a
t
i
on t
ool
an
d
a subset
of
be
nchm
ark p
r
o
g
r
am
s used t
o
evaluate the architectural enhancem
ent
usi
ng ei
t
h
e
r
on
e benc
hm
ark pr
og
ram
execut
e
d i
n
paral
l
e
l
way
o
r
m
u
lt
i
p
l
e
benc
h
m
ark
pr
og
ram
s
exec
ut
ed i
n
se
que
nt
i
a
l
way
[
1
]
,
[
6
]
.
1.1
Multico
re Simula
ti
o
n
Descri
ptio
n
Mu
lti2
Sim
[5
] is a sim
u
latio
n
fram
e
work for
h
e
tero
g
e
n
e
o
u
s co
m
p
u
ting
inclu
d
i
ng m
o
d
e
ls fo
r sup
e
r-
scalar, m
u
lti
th
read
ed
, m
u
ltic
o
r
e, and
g
r
aph
i
cs p
r
o
cessors. Mu
lti2
sim
u
s
es th
ree d
i
fferen
t
m
o
d
e
ls: A
fun
c
tion
a
l si
mu
latio
n
en
g
i
n
e
also
called
si
m
u
lato
r k
e
rn
el; A d
e
tailed
si
m
u
latio
n
;
and
An
ev
en
t-driv
en
si
m
u
latio
n
.
Herei
n
after
we use two term
s, context and thre
a
d
s. Fi
gure 1
depicts
the
m
u
lticore com
pone
nts.
Co
n
t
ex
t will be u
s
ed
to
d
e
note a so
ftware en
tity, d
e
fin
e
d
b
y
statu
s
of
v
i
rtu
a
l m
e
m
o
ry
i
m
ag
e an
d
a l
o
g
i
cal
reg
i
ster
file. Th
read
will refer to
a pro
cessor h
a
rdware
en
t
ity wh
ich
can
b
e
rep
r
esen
ted as p
h
y
sical reg
i
ster
f
ile, a set
o
f
physical
m
e
m
o
r
y
p
a
g
e
s, a set
o
f
p
i
p
e
lin
e qu
eu
es, …etc.
Fig
u
re
1
.
Mu
ltico
r
e co
m
p
on
ents
Ou
r si
m
u
l
a
ti
on t
ool
su
p
p
o
r
t
s
a set
of param
e
t
e
rs that specify how st
ages are orga
nized in
m
u
l
tith
read
ed
d
e
sign
. Stag
es can b
e
sh
ared
am
o
n
g
thread
s
o
r
p
r
iv
ate
p
e
r thread ex
cep
t
ex
ecu
tion
stag
e,
wh
ich
is sh
ared
b
y
d
e
fi
n
itio
n o
f
M
u
ltith
read
. Th
e
p
i
p
e
line
m
o
d
e
l in
m
u
ltico
r
e sim
u
lat
o
r is
d
i
v
i
d
e
d
i
n
to
fiv
e
stages, fetch
s
t
age, decode/renam
e
stage, issue stage
,
e
x
ecution sta
g
e
and comm
it
stage as
depic
t
ed i
n
Fi
gu
re 2.
The Fetch sta
g
e takes instructions from
cache L1 an
d place
s it in instructi
on fetch que
ue
(IFQ).
The
decode/re
n
am
e
stage takes i
n
s
t
ruction from
an IFQ, de
c
o
des
t
h
em
, renam
e
s t
h
ei
r re
gi
st
ers
and a
ssi
g
n
s t
h
e
m
a
slot
in reo
r
der bu
ffe
r (ROB
)
a
n
d
i
n
st
r
u
ct
i
o
n que
ue (I
Q
)
.
Fi
gu
re
2.
Pr
oce
ssor
pi
pel
i
n
es.
Evaluation Warning : The document was created with Spire.PDF for Python.
I
J
ECE
I
S
SN
:
208
8-8
7
0
8
Title o
f
ma
nu
scrip
t
is sh
o
r
t
an
d clea
r, imp
lies resea
r
ch resu
lts (First Au
tho
r
)
67
1
The
n
, the iss
u
e
stage consum
es instructions
from
IQ an
d se
nds t
h
em
t
o
t
h
e cor
r
esp
o
ndi
n
g
f
unct
i
o
nal
u
n
it.
During
th
e ex
ecu
tion
stag
e, th
e fun
c
t
i
o
n
a
l un
its ope
rate and
write th
eir resu
lts b
ack
t
o
reg
i
ster file.
Fin
a
lly, th
e commi
t stag
e retires instru
ction
s
fro
m
ROB in
p
r
og
ram
o
r
d
e
r.
Multicore proc
essor
switches a
m
ong
th
reads
according t
o
a
threa
d
selection
policy. Proce
ssor ca
n
be
classified
as Co
arse-grain m
u
l
tith
read
ing
(CGMT),
Fine-grain
m
u
ltit
h
r
ead
i
ng
(FGMT) an
d Simu
ltan
e
ou
s
m
u
l
tith
read
ing (SMT). FGM
T
p
r
o
cesso
r
switch
e
s th
read
on
fix
e
d
sch
e
d
u
le o
n
ev
ery processor cycle, CGMT
pr
ocess
o
r i
s
charact
eri
z
e
d
b
y
t
h
read swi
t
c
hi
n
g
i
n
d
u
ce
d by
a l
ong l
a
t
e
ncy
ope
rat
i
o
n
or t
h
rea
d
q
u
a
nt
um
expi
rat
i
on
(s
wi
t
c
h-
on
-eve
nt
) a
nd
SM
T p
r
oce
ssor i
s
a
b
l
e
t
o
i
ssue i
n
st
ruct
i
o
ns f
r
om
di
ffe
re
nt
t
h
rea
d
s i
n
a
si
ngl
e
cycle.
The
use
d
tool has three
fetch thread
s
e
lec
t
i
on p
o
l
i
c
y
choi
ces
(t
h
r
ee s
w
i
t
c
h ca
se “Sha
re
d”,
“Tim
eSlice” a
n
d “SwitchOnEve
nt”) a
n
d the m
u
ltithr
eading
para
digm
s as shown i
n
Ta
bl
e 1
[2].
Tab
l
e 1
.
Classificatio
n
o
f
m
u
ltith
read
ing
p
a
rad
i
g
m
s
d
e
p
e
ndin
g
on
M
u
lti2
Si
m
v
a
riab
les
in
a CPU
C
o
nfigu
r
atio
n
file
Variable
Coarse-Grain M
T
Fine-
G
rain MT
Si
m
u
l
t
aneous MT
FetchKind SwitchOnE
vent
T
i
m
e
Slice
Ti
m
e
Slice/Shar
ed
DispatchKind Ti
m
e
Slice
Ti
m
e
Slice
Ti
m
e
Slice/Shar
ed
IssueKind Ti
m
e
Slice
Ti
m
e
Slice
Shared
Co
mm
itKind
Ti
m
e
Slice
Ti
m
e
Slice
Ti
m
e
Slice/Shared
The va
ri
abl
e
s
t
o
speci
fy
h
o
w
pi
pel
i
n
e st
ages di
vi
de t
h
ei
r sl
ot
s am
on
g t
h
rea
d
s are
Fet
c
hKi
n
d,
Di
spat
c
h
Ki
nd
,
Issue
K
i
n
d, a
nd C
o
m
m
i
t
K
ind
.
The
val
u
e
s
that these options can
take are Tim
e
Slice and
Sha
r
ed. T
h
e form
er
m
eans that a stage is
de
vote
d
to a si
ngl
e
threa
d
in eac
h cycle,
altern
atin
g
th
em
in
a ro
und
-
rob
i
n
fash
i
o
n, wh
ile th
e latt
er m
ean
s th
at
m
u
ltip
le th
read
s can
b
e
h
a
n
d
l
ed
in
a sing
le cycle. Th
e stag
e
b
a
ndwid
th always refers t
o
the to
tal nu
m
b
er
o
f
slo
t
s
d
e
v
o
t
ed
to thread
s.
The fetc
h stage can be a
d
ditionally configure
d
w
ith
long
term
th
read
switch
e
s,
b
y
assig
n
i
n
g
t
h
e
v
a
lu
e Switch
O
n
E
v
e
n
t
fo
r th
e
Fetch
K
i
n
d
v
a
ri
ab
le.
In th
is
case, in
st
ru
ction
s
are
fetch
e
d from
o
n
e
sing
le t
h
read
eith
er
u
n
til a
q
u
a
n
t
u
m
exp
i
res or
un
til th
e curren
t
thread
issu
es a long
-laten
cy
op
eratio
n
,
su
ch as a lo
ad
in
stru
ction
i
n
cu
rring
a cach
e
miss.
1.2
Benchmar
k S
u
ite
Descripti
o
n
SPLA
S
H
-
2, i
s
a wor
k
l
o
a
d
t
h
at
has 1
1
pa
ral
l
e
l
sci
e
nt
i
f
ic benc
hm
arks,
cl
assi
fi
ed as kern
el
s o
r
ap
p
lication
s
[6]. All o
f
th
em
p
r
ov
id
e co
mman
d
-
lin
e ar
gu
men
t
s or
co
nf
igur
a
tio
n
files to
sp
ecify th
e i
n
put d
a
ta
size. Tab
l
e
2
an
d Tab
l
e
3
ou
tlin
e SPLASH-2
work
l
o
ad and
th
e co
mm
an
d
-
lin
e arg
u
m
en
ts fo
r so
m
e
of th
ese
benc
hm
arks. SPLA
S
H
2
be
nchm
arks per
f
o
rm
com
put
ati
ons
,
sy
nc
hr
o
n
i
zat
i
ons
,
co
m
m
uni
cat
i
on, st
ressi
n
g
process
o
r c
o
re
s,
m
e
m
o
ry hierarc
h
y, and interconnection
networks
. Thus, they are used
for the evaluat
i
on of
th
e b
a
selin
e and
propo
sed
m
u
ltico
r
e arch
itectu
r
es.
Ad
d
itionally, SPLASH2
b
e
n
c
h
m
ark
s
p
r
ov
id
e argu
men
t
s to
sp
ecify th
e num
b
e
r o
f
con
t
ex
ts created
at
run
tim
e, wh
ich
allow th
e
ev
alu
a
tion
o
f
syste
m
s with
differen
t
num
ber of cores.
W
e
focus
on
barnes, fm
m and ocean.
Each be
nc
hm
a
r
k exec
utes sa
me num
ber of worke
r
threa
d
s as number of c
o
res. Trace is
c
o
llected for 500
m
illi
on instructions.
Table 2. Overview
of SPL
ASH-2 workloa
d
s
P
r
ogra
m
Application Do
main
Pro
b
le
m Si
ze
Bar
n
es High-
Per
f
orm
a
nce
Co
m
puting 65,
536
par
ticles
cholesky
High-
Per
f
orm
a
nce
Co
m
puting
tk29.
O
Fft
Signal Pr
ocessing
4,
194,
30
4 data poi
nts
Fm
m High-
Per
f
orm
a
nce
Co
m
puting 65,
536
par
ticles
L
u
High-
Per
f
orm
a
nce
Co
m
puti
ng
1024×1
024
m
a
tr
ix, 64×64 blocks
Ocean High-Perform
a
nce
Co
m
puting 514×51
4
grid
Radiosity
Gener
a
l
lar
g
e
r
o
o
m
Raytra
ce
Graphics
car
Volr
end Gr
aphics
head
W
a
ter High-
Per
f
orm
a
nce
Co
m
puting
4096
m
o
lec
Evaluation Warning : The document was created with Spire.PDF for Python.
I
S
SN
:
2
088
-87
08
I
J
ECE Vo
l. 5
,
N
o
. 4
,
Au
gu
st 2
015
:
66
9
–
68
4
67
2
Tabl
e
3. C
o
m
m
a
nd-l
i
n
e
ar
g
u
m
ent
s
fo
r t
h
e S
P
LAS
H
-2
be
nc
hm
arks
Bench
m
ar
k
Ar
gu
m
e
nts
Descr
i
ption
Fm
m
$NT
H
RE
ADS <in
put
Sy
ste
m
of N-
body
pr
oblem
over
a nu
m
b
er
of tim
e
steps
Bar
n
es $NT
H
RE
ADS
<in
put
Hier
archical
Bar
n
es-
H
ut
m
e
thod for
N-
body
pr
oblem
Ocean
-n258 –p
$NTHREADS -e1e-07 -r20
000
-t288
00
Ocean currents sim
u
l
a
tion
L
U
-
p$NT
H
RE
ADS –
n204
8 –b1
6
L
o
we
r
/
Upper m
a
trix
factor
ization
Radiosity
-
b
atch –r
oo
m
–p$NT
H
READS
I
n
teger
r
a
dix sor
t
Raytra
ce
$NTHREA
D
S <in
put
Ray t
r
acer
Wate
rrsp
$NTHREA
D
S <in
put
Wate
r
m
o
l
ecule si
m
u
l
a
tion
2.
RELATED WORK
Mu
ch wo
rk to im
p
r
o
v
e
th
e
p
e
rform
a
n
ce of m
u
ltico
r
e arch
itectu
r
e h
a
s
b
een co
ndu
cted
i
n
clud
i
ng
Tu
llsen
[1
2
]
, he exp
l
or
ed a
var
i
ety o
f
SMT f
e
tch po
licie
s t
h
at
assi
gn
fet
c
h
pri
o
ri
t
y
t
o
t
h
rea
d
s acc
o
r
di
ng
t
o
vari
ous c
r
i
t
e
ri
a. IC
O
U
N
T
w
a
s t
h
e best
pe
rf
orm
i
ng p
o
l
i
c
y
,
i
n
whi
c
h t
h
e pri
o
ri
t
y
i
s
assi
gne
d t
o
a t
h
rea
d
according t
o
the
num
ber
of
instructions it
has in dec
o
de, renam
e
, and
issue stage
s
(i
ssue
queues
)
of t
h
e
p
i
p
e
lin
e. Thread
s
with
th
e
fewest nu
m
b
er o
f
i
n
stru
ctio
ns will b
e
g
i
v
e
n
th
e
h
i
gh
est
p
r
i
o
rity for fet
c
h
,
the
reaso
n
bei
ng t
h
at
suc
h
t
h
rea
d
s
m
a
y
be
m
a
ki
ng m
o
re fo
r
w
a
r
d
pr
ocess t
h
a
n
ot
hers
, t
o
p
r
ev
ent
one t
h
rea
d
fr
om
clogging the issue que
u
e, and to pr
ov
id
e a
m
i
x
o
f
th
reads in
th
e issu
e q
u
e
u
e
to
in
crease p
a
rallelis
m. Two
param
e
ters cha
r
acterize IC
OUNT
sc
hem
e
. The
first
pa
ra
m
e
t
e
r di
ct
at
es t
h
e m
a
xim
u
m
num
ber
of
t
h
r
eads t
o
fet
c
h fr
om
each cy
cl
e, whi
l
e
t
h
e secon
d
de
not
es t
h
e m
a
xim
u
m
num
ber
of i
n
st
r
u
ct
i
o
ns
per t
h
rea
d
t
o
fet
c
h
.
M
o
re
rece
nt
po
l
i
c
i
e
s, im
pl
em
ent
e
d
o
n
t
o
p
o
f
IC
O
UNT
,
foc
u
s i
n
t
h
i
s
pr
o
b
l
e
m
and ad
d m
o
r
e
co
nt
r
o
l
ove
r i
ssu
e
q
u
e
u
e
s as
well as th
e ph
ysical reg
i
sters.
In
[1
3
]
, a l
o
ad
h
it/miss p
r
ed
ictor
is u
s
ed
i
n
a sup
e
r-scalar
p
r
o
c
essor
t
o
gui
de di
s
p
at
chi
n
g o
f
i
n
st
r
u
ct
i
ons t
h
at
t
h
e sche
dul
er m
a
kes. Thi
s
al
l
o
ws
t
h
e sched
u
l
e
r t
o
di
s
p
at
ch de
p
e
nde
n
t
in
stru
ction
s
at
th
e ti
m
e
th
ey r
e
q
u
i
re d
a
ta. The au
tho
r
s
p
r
opo
se sev
e
ral h
it/
miss p
r
ed
ictors th
at are ad
ap
t
a
tio
n
s
of
wel
l
-
k
n
o
w
n
bra
n
ch m
i
ss predi
c
t
o
rs. T
h
e
aut
h
ors s
u
gges
t
addi
n
g
a l
o
ad
m
i
ss predi
c
t
o
r
i
n
SM
T pr
oce
ssor i
n
or
der t
o
det
ect
L2 m
i
sses. Thi
s
pre
d
i
c
t
o
r
wo
ul
d
gui
de t
h
e i
n
st
r
u
ct
i
on
fet
c
h, s
w
i
t
c
hi
n
g
b
e
t
w
een t
h
rea
d
s
whe
n
any
o
f
t
h
em
i
s
pre
d
i
c
t
e
d t
o
m
i
ss i
n
L2.
D
a
t
a
Gat
i
ng
(D
G)
[1
4]
an
d
D
-
C
ache
W
ar
n (
D
W
a
r
n
)
[1
3]
are ot
he
r
fet
c
h pol
i
c
i
e
s
t
h
at
are base
d on L1 Dat
a
C
a
che
(
D
-C
ac
he
)
m
i
ss, Assum
e
L1
DCach m
i
sses clearly indicate
fut
u
re
L2
cac
he m
i
ss. C
ons
eque
nt
l
y
, D
G
pre
v
e
n
t
s
t
h
e t
h
rea
d
wi
t
h
u
n
s
ol
ve
d L
1
D
-
C
ache m
i
ss rat
e
fr
om
fetch
i
ng
instructio
n
s
,
su
ch
that it is less lik
ely to
in
trodu
ce lo
ng
laten
c
y l
o
ad
t
o
th
e syste
m
. Si
m
ilarly
DW
arn
t
a
kes act
i
o
ns
whe
n
L1
D
-
C
a
che m
i
ss hap
p
e
ns t
o
o
.
It
re
d
u
ces t
h
e
pri
o
rity of those t
h
re
ads
with
unsol
v
ed L
1
D-Cach
e miss
with
ou
t g
a
ting th
e
m
co
m
p
let
e
ly. By ad
j
u
stin
g
p
r
i
o
rity, it
is ex
p
ected
to
ach
iev
e
th
e
b
a
lan
c
e
b
e
tween
m
i
n
i
mal in
efficien
t o
ccup
a
n
c
y and
op
timized
fetch
wid
t
h
u
tilizatio
n
.
Lich
en
W
e
ng
an
d
C
h
en
Li
u
[4], [5] advoc
a
t
ed that L1
a
n
d L2 cache m
i
sses are closely associated
du
ring exec
ution,
but
the relations
hip is
not as sim
p
le a
s
that L1 D-Ca
che m
i
sses
lea
d
s to L2
cac
he
m
i
ss. In fact, the relationshi
p between L
1
a
nd L
2
cache m
i
sses c
a
n ha
rdly be
describe
d
by a sim
p
le
m
odel
perfectly. They use
d
a statistical
m
odel, Ordinary
Least
Sq
ua
re (
O
LS
) r
e
g
r
essi
o
n
, t
o
rep
r
ese
n
t
t
h
e rel
a
t
i
o
ns
hi
p
bet
w
ee
n L
1
a
n
d
L
2
cac
he m
i
sses.
Ano
t
h
e
r v
ital facto
r
i
n
creases p
r
o
cesso
r thro
ugh
pu
t in
Si
m
u
l
t
an
eo
us Mu
ltith
read
i
n
g
(SMT) is th
e
reso
u
r
ce m
a
nagem
e
nt
. Her
e
, we
bri
e
fl
y
prese
n
t
pre
v
i
ous
w
o
r
k
s
o
n
dy
nam
i
c resou
r
ce al
l
o
cat
i
on i
n
m
u
l
tip
ro
cesso
r, m
u
ltith
read
ed
and
m
u
ltic
o
r
e
p
l
atfo
rm
s.
Alth
ou
gh
sev
e
ral propo
sal
s
th
at ad
d
r
essed
th
e
man
a
g
e
m
e
n
t
o
f
a sin
g
l
e m
i
c
r
o-arch
itectu
r
al resou
r
ce ex
ist in
th
e literat
u
re, propo
sals to
m
a
n
a
g
e
mu
ltip
le
in
teractin
g
reso
urces on
m
u
lt
ico
r
e ch
i
p
s at ru
n
tim
e are
m
u
ch
scarce. Static reso
urce p
a
rtitio
n
i
ng
techn
i
qu
es
h
a
v
e
b
e
en
sugg
ested, bu
t are n
o
t
as
effectiv
e as
d
y
n
a
m
i
c
a
lly co
n
t
ro
lling
th
e
reso
urce u
s
ag
e of each th
read
sin
ce program
p
h
a
ses are
no
t fix
e
d
all th
e ti
m
e
. Static
reso
urce
p
a
rtitio
nin
g
[15
]
, [16
]
ev
en
ly sp
lits critical
reso
u
r
ces am
ong
al
l
t
h
rea
d
s,
t
hus
p
r
eve
n
t
i
ng
res
o
ur
ce m
o
nop
o
lization
b
y
a sing
le thread. Howev
e
r, th
is
m
e
t
hod ca
n ca
use
reso
u
r
ces t
o
r
e
m
a
i
n
i
d
l
e
whe
n
o
n
e t
h
re
ad
has
no
n
eed
fo
r t
h
em
, even
i
f
ot
her
t
h
read
s co
ul
d
b
e
n
e
fit fro
m
ad
d
ition
a
l reso
urces. Mart
ı
nez [17
]
in
tro
d
u
ced
a m
ach
i
n
e learn
i
ng
ap
pro
ach
t
o
mu
ltico
r
e
reso
u
r
ce m
a
nagem
e
nt
. It
pro
duce
s
sel
f
-
opt
i
m
i
z
i
ng on
-chi
p ha
rd
ware a
g
ent
s
capa
b
l
e
o
f
l
earni
ng
, pl
a
nni
ng
,
and c
o
ntinuous
l
y adapting t
o
change th
e
workl
o
ad
dem
a
nds. Thes
e res
u
lts ar
e m
o
re effi
cient and
flexi
b
le to
m
a
nage t
h
e cr
i
t
i
cal
hardwa
r
e
reso
ur
ce
s at runtim
e. A history-a
w
are,
r
e
so
urce
based
dy
nam
i
c (or sim
p
l
y
HAR
D) sc
he
d
u
l
e
r f
o
r het
e
r
o
gene
o
u
s C
M
Ps
has b
een
de
ve
l
ope
d [
1
8]
. H
A
R
D
rel
i
e
s o
n
reco
rdi
ng a
p
p
l
i
cat
i
o
n
resource
u
tiliz
atio
n
an
d thro
ugh
pu
t to ad
ap
tiv
ely ch
an
g
e
cores fo
r
app
licatio
n
s
d
u
ring
run
time.
Th
is
t
echni
q
u
e i
s
us
ed t
o
ac
hi
eve
bot
h pe
rf
o
r
m
a
nce an
d
po
we
r im
provem
ents
. HARD is
a
dyn
amic sch
e
duler
f
o
r
h
e
tero
g
e
n
e
o
u
s
m
u
ltico
r
e syst
e
m
s. It u
s
es p
a
st th
read
assig
n
m
en
ts to
fin
d
th
e b
e
st
m
a
t
c
h
i
ng
co
re for ev
ery
co
re, sav
e
s power b
y
downgrad
i
n
g
ap
p
licatio
n
s
with
lo
w
resource u
tilizatio
n
to
weak
er co
res an
d
im
p
r
o
v
e
s
Evaluation Warning : The document was created with Spire.PDF for Python.
I
J
ECE
I
S
SN
:
208
8-8
7
0
8
Title o
f
ma
nu
scrip
t
is sh
o
r
t
an
d clea
r, imp
lies resea
r
ch resu
lts (First Au
tho
r
)
67
3
p
e
rform
a
n
ce b
y
up
grad
i
ng d
e
m
a
n
d
i
n
g
ap
p
lication
to
stron
g
e
r cores. Ad
ap
ti
v
e
Reso
urce Partitio
n
i
ng
Al
g
o
ri
t
h
m
(A
R
P
A)
[
19]
t
h
a
t
dy
nam
i
call
y
assi
gns
res
o
ur
ces t
o
eac
h t
h
read acc
o
r
di
ng
t
o
t
h
re
ad
be
h
a
vi
o
r
changes is propos
ed. It analyzes th
e resource usage e
fficiency of eac
h th
rea
d
i
n
a t
i
m
e
peri
o
d
an
d a
ssi
gn
s
m
o
re reso
u
r
ces
to t
h
rea
d
s
w
h
i
c
h ca
n
use
the
m
in a m
o
re
efficient way. T
h
e
purpos
e
o
f
ARPA is t
o
imp
r
ov
e
th
e efficien
cy
o
f
resou
r
ce
u
tilizatio
n
,
t
h
ereby i
m
p
r
ov
ing
ov
erall in
st
ru
cti
o
n throug
hpu
t.
3.
THE PROPOSED
METHOD
In th
e
fo
llowing
su
bsectio
ns, t
h
e co
n
t
ribu
tio
n of th
is
p
a
p
e
r
will b
e
p
r
esen
t
e
d
.
6.
1.
Updating Mul
t
icore Simulator to
En
hance
Cache
Predic
tion Acc
u
rac
y
Th
e
u
s
ed
simu
latio
n
m
u
ltic
o
r
e t
o
o
l
, m
u
lt
i2
sim
is ad
ap
t
e
d
to
cop
e
with
d
y
n
a
m
i
c d
e
sig
n
of the
m
u
lticore arc
h
itecture by a
d
ding a
new feature for
thread
selection
in
fetch stag
e an
d to
im
p
r
ov
e th
e
pre
d
iction acc
uracy. The
us
ed tool
has t
h
ree threa
d
sel
e
ct
i
on
pol
i
c
y
c
hoi
ces
(t
h
r
ee
s
w
i
t
c
h case
“S
hare
d”,
“Tim
eSlice” a
n
d “SwitchOnEve
nt”).
In pra
c
tice, the fetc
h
policy indirectly controls t
h
e
usa
g
e of all process
o
r
reso
u
r
ces.
So
,
i
t
can be l
e
ver
a
ged
t
o
a
voi
d
t
h
e st
ar
vat
i
o
n
of t
h
e c
onc
ur
r
e
nt
t
h
rea
d
s
,
e.
g. m
o
n
o
p
o
l
i
zat
i
on
of
in
stru
ction
qu
eu
es
b
y
a th
read d
u
e
to
long
-lat
en
cy lo
ad
m
i
ss
es in the last c
ache le
vel
s
.
B
r
anch m
i
spre
di
ct
i
ons
an
d m
e
m
o
ry-l
ev
el p
a
rallelism
can
also
b
e
tak
e
n in
to con
s
id
eratio
n b
y
th
e fetch
p
o
l
i
c
y. In th
is
p
a
per fetch
stag
e as a on
e
o
f
m
u
ltico
r
e reso
urce is selected
to im
p
l
e
m
e
n
t a
d
y
n
a
m
i
c d
e
sig
n
fo
r m
u
lti
co
re.
Since Tim
e
Slice fetch
kind i
s
the defa
ult choice i
n
Multi
2Sim
, it
s
h
o
u
l
d
be
up
dat
e
d
t
o
im
pl
em
ent
OLS re
gression feature [11]
and be a
b
le to forecas
t L
2
data cache m
i
sses from
historical L1
data cache
m
i
sses and act
ual
L2 dat
a
cac
he m
i
sses for p
r
evi
ous r
u
ns
. T
o
im
pl
em
ent
t
h
e abo
v
e m
e
nt
i
o
ned
up
dat
e
o
f
f
e
t
c
h
stage, for eac
h threa
d
, a
ne
w
array wit
h
two dim
e
nsions
to store L
1
data
cache m
i
sses and t
h
e c
o
rresponding
actual L2
data cache m
i
sses as shown in Figure
3 is
used.
This array represen
ts t
h
e sa
m
p
les window size
param
e
ter. The stori
n
g ope
r
ation
of sam
p
les is accom
p
lishe
d at
discret
e
tim
e
. This time is
m
easure
d
by
n
u
m
b
e
r of in
st
ru
ction
s
.
An
i
n
stru
ctio
n
coun
ter is u
s
ed
to d
e
term
in
e th
e ti
m
e
fo
r storin
g
th
e sam
p
les. Th
e
fun
c
tion
to calcu
late th
e
fu
t
u
re L2 d
a
ta cache miss is ex
ist
at th
read
lev
e
l
an
d called
regressio
n
eng
i
n
e
[2
2
]
.
Figur
e
3
.
Fet
c
hi
n
g
acc
or
di
n
g
t
o
pre
d
i
c
t
e
d L
2
D-C
a
c
h
e
M
i
s
s
es
In
t
h
is stud
y
we ch
an
g
e
in
structio
n
co
un
t
p
a
ra
m
e
ter wh
ich
rep
r
esen
ts the time o
f
st
o
r
ing
t
h
e sam
p
les
and a
d
dress its effect on pre
d
iction accuracy of L
2
da
ta cache. The
performance
m
e
tric, pre
d
ication
acc
uracy,
is use
d
to m
eas
ure
L2 data
cac
he m
i
ss com
p
a
r
ed by act
ual L
2
data cac
he miss.
6.
2.
The Proposed Dynamic Fe
tch P
o
licie
s
Based
on Data Cache Misses
In
t
h
is sectio
n, th
ree
d
i
fferen
t
fetch
p
o
licie
s (EAC
H_
LO
O
P
_Fetc
h
, INC
_
Fetch and
WS_Fetch) are
introduced. These policies depe
nd on
the
data cache misses values and Ordi
nary
Least Square
(OLS)
regression statistic
m
e
thod.
At real tim
e
,
the
relation betwee
n L1 a
nd L
2
cache
m
i
sses are update
d
dy
nam
i
cal
ly
by
cal
cul
a
t
i
ng OLS
reg
r
essi
o
n
coe
ffi
ci
ent
s
.
The di
ffe
rence
s
am
ong t
h
ese
t
h
ree
new
pol
i
c
i
e
s rel
y
on t
h
e cal
cul
a
t
i
on t
i
m
e
of OLS re
gr
ess
i
on c
o
ef
fi
ci
ents. Th
e ti
m
e
fo
r calcu
latin
g
"OLS
regression
coefficients" c
a
n
be
one
of t
h
e followi
ng:
1.
EACH_L
OOP_Fetch:
At eac
h loop if the sy
ste
m
can acces
s the cac
he
of l
e
vel one.
Evaluation Warning : The document was created with Spire.PDF for Python.
I
S
SN
:
2
088
-87
08
I
J
ECE Vo
l. 5
,
N
o
. 4
,
Au
gu
st 2
015
:
66
9
–
68
4
67
4
2.
INC
_
Fetch:
At
each l
o
op if t
h
e system
can access th
e ca
che a
n
d inst
ruction c
o
unter
has ce
rtain
val
u
e
whi
c
h i
s
de
fi
n
e
d as
sam
p
l
i
ng
peri
od
.t
hi
s
m
eans aft
e
r
It
i
s
m
easured
by
M
i
sses Pe
r
Ki
l
o
I
n
st
r
u
ct
i
o
n
s
(M
PK
I)
.
3.
WS_Fetch: At
each loop if the sy
ste
m
can access the cache, instructio
n counte
r
is equal to sam
p
ling
peri
od and l
o
c
a
tion
of store r
eaches t
h
e m
a
xim
u
m
window
size.
The cal
c
u
l
a
t
i
on
of
OL
S r
e
g
r
essi
on
coe
ffi
ci
ent
s
i
s
devel
o
ped
t
h
ro
u
g
h
i
m
pl
em
ent
i
ng t
h
e s
o
ft
ware
fu
nct
i
o
n cal
l
e
d
C
a
l
c
_ol
s_
re
g_
coef
f (
)
i
n
t
h
e
sim
u
l
a
t
i
on t
ool
. The
devel
o
p
e
d f
unct
i
ons
fo
r
t
h
e m
e
nt
i
oned
fet
c
h
pol
i
c
i
e
s ar
e s
h
ow
n
fi
g
u
re
4
,
f
i
gu
re
5 a
n
d
fi
g
u
re
6
.
Meth
o
d
on
e “E
ACH_LOOP_F
etch
”
start ws_store1()
//window
size stor
{
ws
l[cur
r
ent_c
ach
e]
[loc]=m
i
s
s
[
current_c
ache]
;
calc_o
ls_r
eg_co
e
ff();
cal
c_future_
L
2()
;
loc
++;
if(lo
c
==ws)
//window size
{
z
e
ro_
w
s
();
lo
c=0
;
}
}
Fi
gu
re
4.
“EA
C
H_L
O
OP
_Fe
t
ch” f
u
nct
i
o
n
i
n
si
m
u
l
a
t
i
on t
o
ol
M
e
t
hod tw
o
“INC_F
e
tc
h”
start ws_store2()
//window size s
t
ore
{
calc_futu
r
e_L2(
)
;
ins_count++;
if(ins_count==5
00) //Instruction
count
{
ws
l[cur
r
ent_c
ach
e]
[loc]=m
i
s
s
[
current_c
ache]
ins
_
co
unt=0;
loc
++;
calc_ols_r
e
g_coeff();
if(
l
oc
==
w
s
)
{
zero_ws();
lo
c=0;
}
}
}
Fi
gu
re
5.
“E
A
C
H_L
O
OP
_Fe
t
ch” f
u
nct
i
o
n
i
n
si
m
u
l
a
t
i
on t
o
ol
Evaluation Warning : The document was created with Spire.PDF for Python.
I
J
ECE
I
S
SN
:
208
8-8
7
0
8
Title o
f
ma
nu
scrip
t
is sh
o
r
t
an
d clea
r, imp
lies resea
r
ch resu
lts (First Au
tho
r
)
67
5
M
e
t
hod thr
e
e “WS_F
e
tc
h”
start ws_store3(
) //window size
store
{
calc_future_
L2();
ins_count++;
if(
i
ns_count==500)
//Instru
c
tion coun
t
{
ws
l[current_c
ac
he]
[
loc]
=m
is
s
[
c
u
rrent_c
ach
e]
ins_count=0;
loc+
+;
if(
l
oc==ws)
//window size
{
ca
lc_o
ls_r
eg
_coeff();
z
e
ro_w
s
();
lo
c=0
;
}
}
}
Fi
gu
re
6.
“
W
S
_
Fet
c
h”
“E
AC
H_
LO
OP
_Fet
c
h
”
fu
nct
i
o
n i
n
sim
u
l
a
t
i
on t
o
ol
4.
DYNAMIC
FETCH POLICIES B
A
SE
D ON T
R
ANSAC
TION
METR
IC
S WITH VA
RIA
B
LE
FETCH QUE
U
E
SIZ
E
The si
m
u
l
a
ti
on t
ool
su
p
p
o
r
t
s
a set
of param
e
t
e
rs that specify how st
ages are orga
nized in
m
u
l
tith
read
ed
d
e
sign
. Stag
es can b
e
sh
ared
am
o
n
g
thread
s
o
r
p
r
iv
ate
p
e
r thread ex
cep
t
ex
ecu
tion
stag
e,
wh
ich
is sh
ared
b
y
d
e
fi
n
ition
o
f
m
u
ltith
read
. In
th
is
sectio
n
,
th
e sh
ared
allo
cated
reso
urces are
u
s
ed
.
W
e
foc
u
s
o
n
fet
c
h
q
u
e
u
e
as
a
s
h
ared
res
o
urce
. Ou
r pr
o
b
l
e
m
is ho
w t
o
di
st
ri
but
e t
h
e fet
c
h
que
ue f
o
r m
u
l
t
i
c
or
e
am
ong
t
h
e t
h
r
eads
by
a
dy
n
a
m
i
c way
?
T
h
e si
ze
of
fet
c
h
que
ue i
s
cha
nge
d
dy
nam
i
cal
l
y
am
ong t
h
e
t
h
rea
d
s
during the run time. The deviation fr
om
the
baseline in allocated fetc
h
queue
resource is done according t
o
tran
saction
m
e
tric. Th
is tran
sactio
n
m
e
t
r
i
c
[
18]
[
2
3]
i
s
m
odi
fi
ed
. It
i
s
def
i
ned as t
h
e
rat
i
o
bet
w
een t
h
e num
be
r
of i
n
st
ru
ct
i
ons
i
n
com
m
i
t
queue t
o
t
h
e s
u
m
of t
h
e
n
u
m
b
er of i
n
st
ruct
i
o
n
s
i
n
m
i
cro o
p
e
r
at
i
on c
o
de (
o
pco
d
e
)
que
ue,
fet
c
h
qu
eue a
n
d
o
r
der
bu
ffe
r.
T
r
a
n
sa
ctionmetric
/
/
(1
)
It is also
u
s
ed
to
d
i
fferen
tiate b
e
tween
th
e t
h
read
s, it can
b
e
d
e
fin
e
d
as an
ev
alu
a
tion
metric. Th
is
fetch
po
licy is
n
a
m
e
d
as Maxi
m
u
m
Tran
sact
io
n po
licy,
MT
RANS, which increm
ents
the
allocated
res
o
urces
fo
r a speci
fi
c t
h
rea
d
. T
h
i
s
t
h
r
ead m
u
st
have
m
a
xim
u
m
val
u
e o
f
t
r
ansact
i
on m
e
t
r
i
c
. So, i
t
shoul
d ha
ve m
o
re
resources to speed
u
p
th
e ex
ecu
tio
n of its i
n
stru
ction
s
.
Th
e im
p
l
e
m
en
t
a
tio
n
of th
is
fetch
po
licy is do
n
e
b
y
di
vi
di
n
g
fet
c
h q
u
e
u
e s
i
ze equal
l
y
f
o
r
al
l
t
h
reads
.
The calculation of transacti
o
n m
e
tric
is perform
e
d pe
ri
odically after som
e
cyc
l
es ca
lled epoc
h for each
threa
d
.
After e
ach e
poc
h, an additional re
s
o
urce is
gr
a
n
t
e
d
gra
dually in steps. T
h
is
means that aft
e
r each
epoc
h t
h
e re
di
st
ri
but
i
o
n o
p
er
at
i
on i
s
do
ne b
y
deduct
i
o
n t
h
e fet
c
h que
ue
ent
r
i
e
s fr
om
each t
h
rea
d
(p
r
o
cessi
ng
node
) a
n
d add
these
values t
o
the active t
h
rea
d
.
Th
e
d
e
v
e
lop
e
d p
s
eu
do
cod
e
fo
r
MTR
A
N
S
i
s
show
n in
f
i
gu
r
e
7. In
t
h
is
alg
o
r
ith
m
,
MTRA
N
S
, th
e
activ
e th
read
is th
e th
read
th
at h
a
s th
e m
a
x
i
mu
m
tran
sactio
n v
a
lu
e. Th
e num
b
e
r o
f
in
stru
ctio
n
s
m
i
g
r
ated
fro
m
one t
h
rea
d
t
o
anot
her i
s
cal
l
e
d st
ep. F
o
r e
x
am
pl
e, If t
h
e m
u
lt
i
c
ore pr
oc
essor
have t
w
o
cores, eac
h core ha
s
two threads a
n
d the fetch
que
ue size is 64, so each thread
will take 16 entries in th
e fetch queue
.
Assume the
ep
o
c
h
is 10
00
in
stru
ction
and th
e step
is 2
.
After
1
000
in
st
ru
ction
s
, th
e tran
saction
m
e
tric is calcu
lated
b
a
sed
on
eq
uat
i
o
n (
1
)
.
W
e
t
a
ke
2 i
n
s
t
ruct
i
o
n e
n
t
r
i
e
s
fr
om
fet
c
h q
u
e
ue assi
gne
d
f
o
r t
h
ree
t
h
r
ead
s an
d a
d
d
t
h
em
t
o
t
h
e
fetch
que
ue siz
e
of the active
threa
d
(say t
h
read
2). In
this
case, the size
of fetc
h
que
u
e for t
h
rea
d
1, thread
3
an
d thread 4 will b
e
14
wh
ile
for thread
2
will b
e
20
.
Evaluation Warning : The document was created with Spire.PDF for Python.
I
S
SN
:
2
088
-87
08
I
J
ECE Vo
l. 5
,
N
o
. 4
,
Au
gu
st 2
015
:
66
9
–
68
4
67
6
MTRANS
// se
lec
t
thread
w
ith m
a
xim
u
m
tra
n
saction
reg
a
rdl
e
ss data
m
i
sses
get_thrd_max_Transacton()
{
for(i=0;i<4;i++)
//
The us
ed multicore model cons
ists of 2
/
/
core
s
,
ea
ch on
e h
a
s
2
thre
ads
.
{
if(
t
hrd_max_
t
rans==i)continu
e
;
q_sz[i]
= q_s
z[i]
-step _q;
if(q_sz[i]
<=q
_sz_min)
q_sz[i]
=q_sz_min;
q_sz[thrd_max_trans]
=
queue_sz[
thr
d_max_]
+step_q;
if(q_sz[thrd_
m
ax_ trans]
>=q
_sz_max)
queue_sz [thr
d_max_ tr
ansaction]
= queu
e
_sz
_max;
break
;
}
}
Fi
gu
re
7.
M
T
R
A
N
S
f
u
nct
i
o
n
5.
DY
N
A
MI
C F
ETCH
P
O
LI
CIES
B
A
SE
D
ON
TR
AN
SA
CTIO
N METRIC
S AN
D D
A
TA C
A
C
H
E
MISS WITH
VARIABLE
FETCH QUE
U
E SIZ
E
Two
n
e
w
t
h
r
e
ad
selection
alg
o
r
ith
m
s
f
o
r
fetch
po
licy h
a
v
e
b
een
pr
oposed
in th
is sect
io
n
.
Th
e f
i
r
s
t
alg
o
rith
m
is Max
i
m
u
m
Tran
saction
No
Max
Misses al
g
o
rith
m
(MTNMM) wh
ich is u
s
ed
t
o
i
n
cremen
t th
e
allocated res
o
urce i.e
.
(i
ncre
asing t
h
e size
of
fetch
queue
to accomm
odate
m
o
re
instructions) for
s
p
ecific
t
h
rea
d
. T
h
i
s
t
h
read m
u
st
ha
ve
m
a
xim
u
m
value
of t
r
a
n
sact
i
o
n
m
e
t
r
i
c
and
m
u
st
not
ha
ve
t
h
e m
a
xim
u
m
m
i
sses
o
f
L1
d
a
ta cach
e.
In th
is case, th
e t
h
read
will g
a
i
n
mo
re
reso
urces
to
sp
eed
up
th
e ex
ecu
tion
o
f
it
s
instructions. Howe
ve
r, if it does
not
have t
h
e m
a
xi
m
u
m
misses for L1
data cache
,
the
n
it will have
only the
assigne
d
size of
fetc
h que
ue,
t
h
e defa
ult
assigne
d fetch qu
e
u
e size
(baseli
n
e). T
h
is m
eans that at each e
poc
h,
the tra
n
saction
metric is recalculated to select the
active
thre
ad a
n
d dete
rm
ine
the
size of fetch
que
u
e.
The conce
p
t of the second
algorithm
MICRO_C
OUNT is th
e sa
me
as MTNMM, h
o
wev
e
r th
e
sel
ect
ed t
h
read
m
u
st
not
have
m
i
nim
u
m
opc
ode
i
n
m
i
cr
o opcode queue. The
i
n
creas
i
n
g in
r
e
sour
ce allo
cati
on
will b
e
gran
ted to
th
e activ
e t
h
read
as if it has no
t th
e m
i
n
i
m
u
m n
u
m
b
e
r o
f
op
cod
e
in
micro
_
o
p
c
od
e
q
u
e
u
e
.
The
pse
u
d
o
c
o
de t
h
es
e t
w
o a
l
go
ri
t
h
m
s
are st
at
ed i
n
fi
g
u
r
e
8 a
n
d fi
gu
re
9
whi
l
e
fl
o
w
c
h
art
s
a
r
e
depi
c
t
ed i
n
Fig
u
r
e
10
an
d Fig
u
r
e
11
.
// MTNMM
//s
ele
c
t
thr
ead w
ith m
a
xim
u
m
tra
n
s
action
and
it
h
a
s
not m
a
xim
u
m
dat
a
cach
e m
i
s
s
e
s
in
L1.
//get_
t
hrd_max_
T
rans()
get_thrd_max_Transaction()
{
get_
thrd_max_miss();
for(i=0
;i<4
;
i++)
{
if(thrd_max_tr
a
ns==i)
continu
e
;
//ex
c
lude
th
e thr
ead wi
th m
a
xim
u
m
data
ca
che
m
i
sse
if(thrd_m
ax_trans
==thrd_
m
ax_misses) continue;
q_sz[i]
=q_sz[i]
-
s
tep _q;
if(q_sz [i]
<
= q_sz_min)
q_sz[i]
=q_sz_min;
q_sz [thrd_max
_trans]
= q_s
z[thrd_max_trans]+step_q;
if(q_sz
[thrd_max_ tr
ans]
>= q_sz _max)
q_sz[thrd_max_
t
rans]
=
q_sz _max;
break
;
}
}
Fi
gu
re
8.
M
T
N
M
M
fu
nct
i
o
n
Evaluation Warning : The document was created with Spire.PDF for Python.
I
J
ECE
I
S
SN
:
208
8-8
7
0
8
Title o
f
ma
nu
scrip
t
is sh
o
r
t
an
d clea
r, imp
lies resea
r
ch resu
lts (First Au
tho
r
)
67
7
//MICRO_COU
NT
//sele
c
t
thr
ead
th
at h
a
s m
a
xim
u
m
transa
ction
and
m
i
nim
u
m
opcode in
//m
ic
ro opco
d
e queu
e
.
get_thrd_min_o
pcode() //
select th
e
thread
of
m
i
nim
u
m
opcode
{
uopq_count(
c
ore,thr
ead)
;
//get micro op
code count for
each
thr
ead
min_micr=0;f
r
st=0;
for(i=0
;i<4
;
i++)
{
if(frst==0){min_micr=micr
_count[0]
;thrd_min_micr=0;frst=1
;
}
if(
m
icr_count[i]
=
=0) continue;
if(
m
icr_count[i]
<
min_micr){thrd_min_micr=i;}
}
}
get_thrd_max_Transaction
() //get_thrd_max_Tran
s
()
{
for(i=0;
i
<4;
i
++)
{
if(thrd_m
ax_trans==i)con
tinue;
//ex
c
lude
th
e thr
ead wi
th m
i
ni
mu
m opcode
in micro opcode queu
e
.
if(thrd_m
in_micr
== thrd
_max_trans)continue;
q_sz[i]
=q_sz[i]
-step _q;
if(q_s
z [i]
<= q_sz_min)
q_sz[i]
=q_sz_min;
q_sz [thr
d_max_t
rans]
=q_sz[thrd_max_tr
a
ns]
+
step_q;
if(q_s
z [thrd_max_ trans]
>= q_sz _max)
q_sz[th
rd
_max_trans]
=
q_sz _max;
bre
a
k;
}
}
Fi
gu
re 9.
M
I
C
R
O_C
O
U
N
T f
unct
i
o
n
Figure 10.
D
yna
m
i
c r
e
so
ur
ces allo
catio
n f
l
ow
ch
art for MTRANS algo
rithm
Evaluation Warning : The document was created with Spire.PDF for Python.
I
S
SN
:
2
088
-87
08
I
J
ECE Vo
l. 5
,
N
o
. 4
,
Au
gu
st 2
015
:
66
9
–
68
4
67
8
Figure 11.
Dyna
m
i
c resources
allocation fl
ow
ch
ar
t for
MTN
MM algo
r
ithm
6.
SIMULATION RESULTS
A lo
t
of sim
u
la
ted
exp
e
rim
e
n
t
s h
a
v
e
b
e
en
con
d
u
c
ted using
t
h
e m
o
d
i
fied si
m
u
la
tio
n
too
l
,
Mu
lti2
Sim
,
t
o
chec
k t
h
e
va
l
i
d
i
t
y
of t
h
e
p
r
op
ose
d
a
p
pr
oa
ches.
6.
1.
The Modified
Simulati
on T
o
ol
Mu
lti2
Sim
is
m
o
d
i
fied
to
inco
rpo
r
ate m
o
re o
p
tion
s
and
i
m
p
l
e
m
en
t OSL. Figu
re
12
an
d
Figu
re
13
illu
strate th
e effect o
f
ch
an
g
i
ng
th
e sam
p
lin
g ti
m
e
o
n
th
e relatio
n
b
e
tween
actu
a
l d
a
ta cach
e m
i
sses at
lev
e
l
one
(L1) and the pre
d
icted value of
da
ta cache misses in data cache level two (L2)
.
It is actu
a
l read
ing
v
a
lue
of
data cache misses in DL2
m
i
nus the pre
d
icted val
u
e of data cache misses in DL
2 for each wi
ndows array
si
ze. T
h
e
ex
pe
r
i
m
e
nt
s ha
ve
be
en
re
peat
ed
f
o
r
di
f
f
e
r
ent
va
lu
es
o
f
w
i
nd
ow arr
a
y size and instr
u
ction coun
t.
W
e
tried to
find the val
u
es
of t
h
ese
param
e
te
rs at
which t
h
e highest pre
d
iction
accurac
y
by changi
ng one
param
e
t
e
r of t
h
em
and fi
x
t
h
e ot
he
r an
d
vi
c
e
versa
.
As s
h
o
w
n i
n
Fi
gu
re 1
3
, t
h
e
hi
ghest
pre
d
i
c
t
i
on acc
uracy
i
s
occu
rre
d at
i
n
s
t
ruct
i
o
n c
o
u
n
t
(sam
pl
i
ng t
i
m
e) =
1
0
0
0
a
n
d
w
i
nd
ows
ar
ray
si
ze eq
ual
7
.
A
s
s
i
g
n
re
so
urc
e
s f
o
r
thr
e
a
d
s un
if
orm
l
y
Star
t
Tak
e
bas
e
li
n
e
Increm
ent c
o
unter
s
Th
read
Behavio
r anal
y
s
i
s
E
poc
h
s
=m
a
x
?
Acces
s alloca
te
d
res
our
ces
Metr
ic ca
lc
ul
at
io
n
U
pda
te a
l
l
o
ca
te
d
res
our
ce
Th
r
e
ad
s
e
l
e
cti
o
n
No
Ye
s
Evaluation Warning : The document was created with Spire.PDF for Python.