TELK
OMNIKA
,
V
ol.
15,
No
.
3,
September
2017,
pp
.
1040
1047
ISSN:
1693-6930,
accredited
A
b
y
DIKTI,
Decree
No:
58/DIKTI/K
ep/2013
DOI:
10.12928/telk
omnika.v15.i3.6513
1040
Application
Pr
ofiling
and
Mapping
on
NoC-based
MPSoC
Em
ulation
Platf
orm
on
Reconfigurab
le
Logic
Jia
W
ei
T
ang
*
,
Y
uan
W
en
Hau
**
,
Nasir
Shaikh-Husin
*
,
and
Muhammad
Nadzir
Mar
sono
*
*
F
aculty
of
Electr
ical
Engineer
ing,
Univ
ersiti
T
eknologi
Mala
ysia,
Johor
,
Mala
ysia
**
IJN-UTM
Cardio
v
ascular
Engineer
ing
Centre
,
F
aculty
of
Biosciences
and
Medical
Engineer
ing,
Un
iv
ersiti
T
eknologi
Mala
ysia,
Johor
,
Mala
ysia
*
Corresponding
author
,
e-mail:
jwtang2@liv
e
.utm.m
y
Abstract
In
netw
or
k-on-chip
(NoC)
based
m
ulti-processor
system-on-chip
(MPSoC)
de
v
elopment,
applica-
tion
profiling
is
one
of
the
most
cr
ucial
step
dur
ing
design
time
to
search
and
e
xplore
optimal
mapping.
Con
v
entional
mapping
e
xplor
ation
methodologies
analyse
application-specific
g
r
aphs
b
y
estimating
its
r
un-
time
beha
viour
using
analytical
or
sim
ulation
models
.
Ho
w
e
v
er
,
t
he
f
or
mer
does
not
replicate
the
actual
application
r
un-time
perf
or
mance
while
the
latter
requires
significant
amount
of
time
f
or
e
xplor
ation.
T
o
map
applications
on
a
specific
MPSoC
platf
or
m,
the
application
beha
viour
on
cycle-accur
ate
em
ulated
platf
or
m
should
be
considered
f
or
obtaining
better
mappin
g
quality
.
This
paper
proposes
an
application
mapping
methodology
that
utiliz
es
a
MPSoC
prototyped
in
Field-Prog
r
ammab
le
Gate
Arr
a
y
(FPGA).
Applications
are
implemented
on
homogeneous
MPSoC
cores
and
their
costs
are
analysed
and
profiled
on
the
platf
or
m
in
ter
m
of
e
x
ecution
time
,
intr
a-core
comm
unication
and
inter-core
comm
unication
dela
ys
.
These
metr
ics
are
utiliz
ed
in
analytical
e
v
aluation
of
the
application
mapping.
The
prop
osed
analytical-based
mapping
is
demonstr
ated
against
the
e
xhaustiv
e
br
ute
f
orce
method.
Results
sho
w
that
the
proposed
method
is
ab
le
to
produce
quality
mappings
compared
to
the
g
round
tr
uth
solutions
b
ut
in
shor
ter
e
v
alu
ation
time
.
K
e
yw
or
ds:
Embedded
system,
application
mapping,
m
ulti-processor
system-on-chips
(MPSoCs),
through-
put
constr
aint.
Cop
yright
c
2017
Univer
sitas
Ahmad
Dahlan.
All
rights
reser
ved.
1.
Intr
oduction
Multi-Processor
System-on-Chips
(MPSoCs)
is
commonly
used
in
moder
n
embedded
systems
such
as
smar
tphones
and
tab
lets
as
w
ell
as
in
other
div
erse
fields
such
as
m
ultimedia
and
netw
or
king
applications
to
perf
or
m
time-constr
ained
tasks
.
Mapping
of
time
cr
itical
applica-
tions
is
a
cr
ucial
step
in
design
time
to
pro
vide
the
highest
possib
le
system
perf
or
mance
.
The
prob
lem
of
mapping
applications
onto
MPSoC
is
one
of
the
biggest
challenge
embedded
system
prog
r
ammers
f
ace
no
w
ada
ys
[1–3].
An
application
is
commonly
specified
as
a
data-flo
w
g
r
aph
consisting
of
tasks
that
comm
unicate
betw
een
them,
on
the
other
hand,
an
MPSoC
platf
or
m
is
descr
ibed
b
y
the
n
umber
of
cores
,
the
type
of
cores
(in
heterogeneous
platf
or
m),
and
their
com-
m
unication
topology
or
infr
astr
ucture
.
The
design
space
e
xplor
ation
process
tak
es
the
application
g
r
aph
as
input
and
assigns
all
tasks
to
processing
cores
on
the
desired
MPSoC
platf
or
m
to
find
the
best
perf
or
mance
mapping.
Methods
f
or
e
v
aluating
each
mapping
perf
or
mance
in
design
space
can
be
categor
iz
ed
into
three
types
[4]:
estimation
from
analytical
model,
measurement
from
sim
ulation
model,
and
measurement
from
implementation
prototype
.
Analytical
estimation
is
the
f
astest
e
v
alua-
tion
method
b
ut
with
limited
accur
acy
as
it
does
not
sufficiently
replicate
the
system
beha
viour
.
Sim
ulation-based
measurement
e
xists
betw
een
tw
o
e
xtremes
,
i.e
.
slo
w
with
high
accur
acy
and
f
ast
b
ut
less
accur
ate
.
On
the
contr
ar
y
,
prototype-based
e
v
aluation
pro
vides
the
highest
accur
acy
b
ut
requires
long
de
v
elopment
time
.
Receiv
ed
F
ebr
uar
y
27,
2017;
Re
vised
J
une
28,
2017;
Accepted
J
uly
21,
2017
Evaluation Warning : The document was created with Spire.PDF for Python.
TELK
OMNIKA
ISSN:
1693-6930
1041
In
most
cases
,
a
cer
tain
application
is
desired
to
be
e
x
ecuted
on
a
specific
MPSoC
plat-
f
or
m.
As
the
task
e
x
ecution
and
comm
unication
dela
ys
v
ar
y
f
or
diff
erent
MPSoC
platf
or
ms
,
map-
ping
that
solely
depends
on
the
costs
f
or
e
x
ecuting
application
specification
beha
viour
ma
y
not
be
sufficient
to
produce
high
quality
mapping
when
consider
ing
the
MPSoC
ph
ysical
interconnect
f
abr
ic.
The
beha
viour
on
the
platf
or
m
m
ust
be
considered
and
profiled
thoroughly
to
aid
applica-
tion
mapping
e
xplor
ation
process
.
Hence
,
an
application
profiled
in
a
prototype-based
platf
or
m
can
impro
v
e
profiling
p
recision
o
v
er
the
analytical-based
mapping
while
ha
ving
f
ast
e
v
aluation
speed.
This
paper
presents
a
application
mapping
methodology
that
utiliz
es
field-prog
r
ammab
le
gate
arr
a
y
(FPGA)
em
ulation
as
the
e
v
aluation
platf
or
m
to
optimiz
e
the
application
mapping
f
or
the
highest
throughput.
Applications
are
analysed
and
profiled
on
the
em
ulation
platf
or
m
to
estimate
their
implemented
costs
in
ter
m
of
e
x
ecution,
intr
a-core
comm
unication
and
inter-core
comm
u-
nication
dela
ys
.
The
profiled
application
costs
are
used
in
impro
ving
the
accur
acy
of
analytical-
based
mapping.
The
proposed
methodology
is
demonstr
ated
b
y
mapping
se
v
er
al
test
applications
on
a
3
3
meshed
netw
or
k-on-chip
(NoC)
based
MPSoC
prototyped
in
FPGA.
The
rest
of
this
paper
is
organiz
ed
as
f
ollo
ws
.
Section
2.
presents
the
related
w
or
ks
in
application
mapping.
Section
3.
discusses
the
proposed
throughput
a
w
are
application
mapping
methodology
including
the
e
x
ecution
tr
ace
analysis
,
platf
or
m
par
ameters
e
xtr
action
and
mapping
e
xplor
ation
steps
.
Section
4.
presents
the
result
of
the
proposed
mapping
approach
on
diff
erent
test
cases
and
Section
5.
concludes
this
paper
.
2.
Related
w
orks
T
ask
mapping
technique
can
be
classified
into
design
time
map
ping,
r
un-time
mapping
and
h
ybr
id
mapping.
In
design
time
mapping
[5–9],
the
mapping
process
is
perf
or
med
off-line
to
f
acilitate
better
decision
in
using
system
resources
.
Design
time
mapping
is
suitab
le
f
or
static
w
or
kload
scenar
ios
ha
ving
a
predefined
task
g
r
aph
and
st
atic
platf
or
m
(i.e
.,
both
applications
and
platf
or
m
do
not
change
in
time).
Run-time
mapping
[10–13]
on
the
contr
ar
y
,
is
ab
le
to
handle
dynamism
of
r
un-time
w
or
kload
such
as
inser
tion
or
remo
ving
ne
w
applications
into
or
from
the
system,
as
w
ell
as
the
dynamic
platf
or
m
beha
viour
due
to
r
un-time
f
aults
.
Ho
w
e
v
er
,
r
un-time
task
mapping
is
perf
or
med
using
limited
on-line
processing
resource
while
making
a
quic
k
decision
without
sufficient
mapping
e
xplor
ation,
thus
ma
y
produce
lo
w-quality
mapping.
In
addition,
h
ybr
id
mapping
has
also
been
proposed
in
ma
n
y
diff
erent
w
or
ks
[14–19]
b
y
combining
design
time
and
r
un-time
techniques
.
Hybr
id
mapping
e
xplores
best
mapping
of
tasks
dur
ing
design
time
f
or
diff
erent
scenar
ios
(constr
aints
conditions)
while
chooses
the
most
suitab
le
mapping
at
r
un-time
.
Ho
w
e
v
er
,
these
con
v
entional
approaches
in
application
mapping
assume
a
specific
w
ell-defined
prog
r
am
beha
viour
,
often
specifie
d
in
task
g
r
aph
and
does
not
consider
platf
or
m
dependent
par
ameters
,
such
as
the
changes
in
comm
unication
dela
y
and
comm
unication
v
olume
dur
ing
dynamic
mapping
e
xplor
ation.
Most
mapping
methodologie
s
in
liter
atures
e
.g.
[5–1
3]
assumed
a
specific
prog
r
am
e
x
e-
cution
while
the
comm
unication
cost
w
as
considered
constant
across
platf
or
m.
As
pr
actical
be-
ha
viour
of
applications
ma
y
diff
er
compared
to
its
specified
task
g
r
aph,
there
are
diff
erent
methods
in
liter
atures
e
xtr
acting
application
r
un-time
metr
ics
pr
ior
to
the
mapping
e
xplor
ation.
T
r
ace
anal-
ysis
technique
w
as
used
in
[20]
to
find
the
e
x
ecution
beha
vior
f
or
each
mapping
and
e
xamine
their
diff
erences
.
Singh
et
al.
[21]
used
the
application
beha
viour
from
tr
ace
analysis
to
optimally
map
m
ultiple
ap
plications
on
the
same
MPSoC
.
T
o
our
best
kno
wledge
,
there
is
no
w
or
k
that
profiles
and
maps
application
based
on
e
x
ecution,
intr
a-core
and
inter-core
comm
unicat
ion
on
an
em
ulation
platf
or
m.
3.
Pr
oposed
Application
Mapping
Methodology
The
flo
w
of
the
proposed
application
mapping
ap
proach
is
depicted
in
Figure
1.
Unlik
e
con
v
entional
mapping
methodology
which
solely
depends
on
the
costs
specified
in
application
task
g
r
aph,
the
proposed
approach
considers
the
applica
tion
pr
actical
beha
viour
to
better
estimate
Application
Profiling
and
Mapping
on
NoC-based
MPSoC
...
(Jia
W
ei
T
ang)
Evaluation Warning : The document was created with Spire.PDF for Python.
1042
ISSN:
1693-6930
the
mapping
perf
or
mance
.
The
proposed
methodology
f
or
application
mapping
are
as
f
ollo
ws:
1.
Application
Pr
ofiling
–
The
predefined
application
are
e
x
ecuted
on
the
targeted
platf
or
m
to
measure
its
pr
actical
implementation
costs
.
Application
are
profiled
in
ter
m
of
e
x
ecution
and
comm
unication
costs
in
time
dela
y
(cycles).
2.
Application
Mapping
–
Application
mapping
design
space
e
xplor
ation
is
perf
or
med
to
op-
timiz
e
o
v
er
all
throughput.
In
order
to
e
v
aluate
each
mapping
perf
or
mance
,
e
x
ecution
tr
aces
are
analysed.
An
analytical
model
is
f
or
m
ulated
based
on
the
profiled
application
costs
are
used
to
estimate
the
mapping
perf
or
mance
dur
ing
the
design
space
e
xplor
ation.
A
p
p
lic
at
ion
P
r
o
f
ilin
g
App
lic
a
tio
n
t
0
t
1
t
2
t
3
t
4
M
P
S
o
C
in
F
P
GA
Applic
a
tion
S
pe
c
if
ic
a
tion
P
la
tf
or
m
S
p
e
c
if
ic
a
tion
M
a
p
p
in
g
D
e
s
ign
S
p
ac
e
E
xp
lor
a
t
io
n
A
pplic
a
tion
P
r
o
f
ile
s
E
xe
c
ution
T
r
a
c
e
s
A
na
lys
is
e
xe
c
ution,
x
intr
a
-
c
or
e
c
o
mm.
,
c
a
inte
r
-
c
or
e
c
omm.
,
c
e
e
mula
te
C
ur
r
e
nt
M
a
ppin
g
C
o
r
e
1
:
C
o
r
e
2
:
C
o
r
e
3
:
t
i
m
e
e
xe
c.
d
e
l
a
y
x
i
c
ij
co
m
m.
d
e
l
a
y
A
na
lytic
E
va
lua
tion
B
e
s
t
M
a
p
ping?
ye
s
Ne
xt
M
a
pp
ing
no
Op
tima
l
M
a
ppin
g
Figure
1.
Proposed
application
mapping
using
application
profiled
in
FPGA-em
ulated
MPSoC
.
3.1.
Application
and
Platf
orm
Model
T
ask
g
r
aph
of
an
application
can
be
char
acter
iz
ed
b
y
a
directed
g
r
aph
(DG),
G
=
(
T
;
E
)
,
where
T
=
f
t
1
;
t
2
;
:::;
t
n
g
is
the
set
of
tasks
in
the
application
and
E
=
f
e
1
;
e
2
;
:::;
e
n
g
is
the
set
of
directed
edges
representing
dependencies
of
tasks
as
sho
wn
in
Figure
2.
Each
task,
t
i
contains
an
x
i
denoting
the
w
orst-case
e
x
ecution
time
f
or
the
task
and
is
assumed
to
remain
fix
ed
dur
ing
the
e
x
ecution.
Each
edge
,
e
k
includes
v
ij
which
represents
the
comm
unication
v
olume
betw
een
tasks
t
i
and
t
j
.
As
designing
the
entire
NoC-based
MPSoC
platf
or
m
is
out
of
scope
of
this
w
or
k,
the
emplo
y
ed
platf
or
m
is
gener
ated
using
ProNoC
tool
[22].
The
o
v
er
vie
w
of
the
NoC-based
MPSoC
platf
or
m
used
in
this
w
or
k
is
depicted
in
Figure
3.
Figure
3a
illustr
ates
the
architecture
f
or
each
core
(tile)
in
the
MPSoC
platf
or
m.
Distr
ib
uted
memor
y
architecture
is
implemented,
where
each
core
consists
of
a
processor
,
a
local
memor
y
,
and
a
netw
or
k
interf
ace
(NI).
Mapped
tasks
in
each
core
are
stored
and
e
x
ecuted
from
the
local
me
mor
y
,
while
cores
inter-comm
unicate
with
other
cores
via
netw
or
k
interf
aces
.
All
cores
are
homogeneous
and
connected
to
each
other
through
the
NoC
interconnect
f
abr
ic.
TELK
OMNIKA
V
ol.
15,
No
.
3,
September
2017
:
1040
1047
Evaluation Warning : The document was created with Spire.PDF for Python.
TELK
OMNIKA
ISSN:
1693-6930
1043
t
0
t
1
t
2
t
3
t
4
t
5
e
1
e
0
e
2
e
3
e
4
e
5
e
6
Figure
2.
Example
of
an
application
model
containing
6
tasks
.
Pr
o
ce
s
s
o
r
M
e
m
o
r
y
N
et
w
o
r
k
I
nt
er
f
ac
e
I
nt
e
r
c
o
nnec
t
C
o
r
e
0
..
..
.
Pr
oc
es
s
o
r
M
em
o
r
y
N
e
t
w
o
r
k
I
n
t
er
f
a
ce
C
o
r
e
k
(a)
Each
core
consists
of
a
processor
,
local
memor
y
,
and
netw
or
k
interf
ace
in
MPSoC
platf
or
m
model
C
o
re
0
C
o
re
1
C
o
re
2
Cor
e
3
Cor
e
4
C
o
re
5
Cor
e
6
C
o
re
7
C
o
re
8
R
R
R
R
R
R
R
R
R
(b)
3
3
mesh
based
NoC
intercon-
nect.
Figure
3.
MPSoC
platf
or
m
model
f
or
profiling,
mapping
and
em
ulating
application
in
FPGA.
Mapping
of
an
application
is
the
process
of
assigning
its
tasks
into
the
a
v
aila
b
le
cores
on
MPSoC
platf
or
m.
Cooper
ativ
e
(non
pre-emptiv
e)
m
ulti-tasking
is
emplo
y
ed
in
each
core
,
where
one
core
can
e
x
ecute
more
than
one
task.
Comm
unicating
tasks
use
message
passing
interf
ace
(MPI)
protocol
to
send
and
receiv
e
data
and
control
pac
k
ets
.
Hence
,
comm
unication
betw
een
tasks
that
are
mapped
to
the
same
core
also
require
message
cop
ying
dela
y
that
is
considered
as
intr
a-core
comm
unication.
On
the
other
hand,
inter-core
comm
unication
is
defined
as
the
comm
unication
betw
een
tasks
that
are
mapped
in
diff
erent
cores
.
3.2.
Application
Pr
ofiling
Application
profiling
is
the
process
to
e
xtr
act
pr
actical
task
e
x
ecution
and
comm
unication
dela
ys
of
an
application
when
em
ulated
in
FPGA.
The
aim
of
perf
or
ming
application
profiling
is
to
measure
the
eff
ect
of
task
e
x
ecution
and
comm
unication
on
the
o
v
er
all
system
throughput.
Ex
ecution
time
,
x
i
of
each
task
t
i
is
defined
as
the
time
required
to
complete
a
task
after
it
receiv
es
complete
data
(or
tok
en)
dur
ing
one
e
x
ecution
per
iod.
T
ask
e
x
ecution
time
of
an
application
can
be
easily
obtained
b
y
captur
ing
the
time
diff
erence
betw
een
the
star
ting
and
ending
of
a
task’
s
e
x
ecution.
As
the
e
x
ecution
time
on
an
y
homogeneous
core
is
equal
regardless
of
the
mapping,
the
maxim
um
e
x
ecution
dela
ys
of
all
tasks
can
be
recorded
b
y
mapping
all
tasks
into
one
core
.
On
the
contr
ar
y
,
comm
unication
time
,
c
ij
is
divided
into
tw
o
types
represented
b
y
a
tw
o-
tuple
<
ca;
c
e
>
,
with
ca
and
ce
denote
the
intr
a-core
and
inter-core
dela
ys
,
respectiv
ely
.
As
application
as
specified
in
a
task
g
r
aph
contains
only
comm
unication
v
olume
in
each
connecting
task,
profiling
each
connecting
core
has
to
be
done
to
measure
the
pr
actical
comm
unication
time
.
In
order
to
profile
each
intr
a-core
comm
unication
dela
y
,
the
tw
o
connecting
tasks
are
isolated
and
e
x
ecuted
in
the
same
core
.
The
intr
a-core
comm
unication
dela
y
is
measured
as
the
throughput
diff
erence
betw
een
e
x
ecuting
both
tasks
independently
without
comm
unication
and
with
com-
m
unication.
On
the
contr
ar
y
,
both
tasks
are
mapped
in
neighbour
ing
cores
to
profile
inter-core
comm
unication
dela
y
.
After
profiling,
the
application
g
r
aph
is
re-constr
ucted
with
profiled
spec-
ification
as
illustr
ated
in
Figure
4.
Although
the
actual
dela
ys
of
the
applications
ma
y
diff
er
f
or
diff
erent
mappings
(ma
ybe
due
to
the
m
ulti-task
o
v
erhead
or
comm
unication
infr
astr
ucture
de-
Application
Profiling
and
Mapping
on
NoC-based
MPSoC
...
(Jia
W
ei
T
ang)
Evaluation Warning : The document was created with Spire.PDF for Python.
1044
ISSN:
1693-6930
la
y
such
as
contention),
the
profiled
costs
can
be
used
as
the
basis
to
estimate
the
application
r
un-time
beha
viour
.
x
0
x
1
c
0
1
=
<
c
a,
ce
>
0
1
t
0
t
1
e
01
profil
ing
Figure
4.
Application
g
r
aph
re-constr
uction
after
profiling
e
x
ecution
and
comm
unication
time
on
em
ulated
FPGA.
3.3.
Mapping
Design
Space
Exploration
After
application
profiling,
analytical-based
mapping
utiliz
es
the
profiled
costs
to
optimiz
e
the
applicat
ion
throughput.
Ex
ecution
tr
aces
of
each
mapping
is
analysed
to
find
the
estimation
of
application
r
un-time
beha
viour
.
Each
mapping
is
e
v
aluated
using
analytical
e
xpression
to
cal-
culate
the
throughput
using
the
profiled
costs
.
Mapping
e
xplor
ation
process
is
iter
ativ
e
until
the
best
mapping
is
obtained.
3.3.1.
Ex
ecution
T
races
Anal
ysis
Ex
ecution
tr
aces
of
a
mapped
applications
are
analysed
to
obtain
the
application
through-
put.
T
ask
par
allelism
and
tempor
al
par
allelism
are
considered
in
the
e
x
ecution
tr
aces
analysis
.
The
f
or
mer
is
the
ability
to
allo
w
par
allel
e
x
ecution
of
diff
erent
tasks
while
the
latter
allo
ws
par
allel
e
x
ecution
of
diff
erent
tempor
al
data
in
diff
erent
cores
at
the
same
time
.
F
or
instance
,
Figure
5
depicts
the
e
x
ecution
tr
aces
of
an
application
ha
ving
6
tasks
,
as
sho
wn
in
Figure
2,
mapped
onto
a
4-core
system.
All
4
cores
are
comm
unicating
with
each
other
while
e
x
ecuting
their
assigned
tasks
in
par
allel
as
soon
as
their
data
are
a
v
ailab
le
.
T
ask
par
allelism
occurs
f
or
task
t
1
and
t
2
,
taking
x
1
and
x
2
dela
ys
to
be
e
x
ecuted
in
core
1
and
core
2,
respectiv
ely
.
Data
par
allelism
is
demonstr
ated
through
the
e
x
ecution
of
diff
erent
tempor
al
data
(e
.g
data
1
and
data
2)
at
diff
erent
time
instances
.
As
core
4
requires
the
longest
time
to
compute
one
result,
the
per
iod
(in
v
erse
of
throughput)
is
deter
mined
b
y
the
combination
of
e
x
ecution
dela
y
and
comm
unication
dela
y
.
C
o
r
e
1
:
C
o
r
e
2
:
C
o
r
e
3
:
C
o
r
e
4
:
x
0
c
0
1
x
1
c
0
2
x
2
c
0
2
c
1
3
c
2
3
c
1
3
c
2
3
c
2
4
x
3
c
3
5
c
2
4
x
4
c
3
5
x
5
x
0
c
0
1
x
1
c
0
2
x
2
c
0
2
c
1
3
c
2
3
c
1
3
c
2
3
c
2
4
x
3
c
3
5
c
2
4
x
4
c
3
5
x
5
x
0
c
0
1
x
1
c
0
2
x
2
c
0
2
c
1
3
c
2
3
c
1
3
c
2
3
c
2
4
x
3
c
3
5
c
2
4
x
4
c
3
5
e
5
e
xe
c. d
e
la
y
fo
r
ta
s
k i
d
a
ta
1
d
a
ta
2
d
a
ta
3
x
i
co
m
m
. d
e
l
a
y fo
r
ta
sk i to
ta
sk
j
c
i
j
t
i
m
e
p
e
r
i
o
d
p
e
r
i
o
d
p
e
ri
o
d
Figure
5.
Ex
ecution
tr
aces
of
a
6-tasks
applications
r
unning
on
4
cores
system.
Application
throughput
is
reciprocal
of
one
per
iod
(i.e
,
the
a
v
er
age
time
needed
to
com-
plete
one
iter
ation
of
the
application).
Each
core
utiliz
es
the
time
per
iod
to
e
x
ecute
its
assigned
task
as
w
ell
as
to
comm
unicate
with
other
cores
.
Since
the
tasks
in
each
application
can
be
e
x-
ecuted
in
pipeline
and
task-par
allel
in
diff
erent
cores
,
the
o
v
er
all
throughput
is
deter
mined
b
y
the
slo
w
est
core
(i.e
,
the
core
that
requires
the
longest
time
to
complete
e
x
ecutions
and
comm
unica-
tions
of
all
its
mapped
tasks).
TELK
OMNIKA
V
ol.
15,
No
.
3,
September
2017
:
1040
1047
Evaluation Warning : The document was created with Spire.PDF for Python.
TELK
OMNIKA
ISSN:
1693-6930
1045
3.3.2.
Anal
ytical
Ev
aluation
Giv
en
a
g
r
aph
representation
of
an
application
with
N
t
tasks
,
mapping
e
xplor
ation
in
this
w
or
k
is
the
process
of
assigning
all
tasks
into
a
v
ailab
le
N
c
cores
to
maximiz
e
the
system
throughput.
Based
on
the
e
x
ecution
tr
ace
ana
lys
i
s
,
throughput
of
a
mapped
application
is
the
reciprocal
of
per
iod.
The
per
iod
is
defined
as
the
maxim
um
total
required
time
(among
all
cores)
f
or
e
x
ecution
and
comm
unication,
and
can
be
mathematically
f
or
m
ulated
in
Equation
1.
The
total
required
time
in
each
core
is
calculated
b
y
the
summation
of
the
profiled
e
x
ecution
dela
ys
of
all
individual
assigned
tasks
as
w
ell
as
all
the
profiled
intr
a-core
and
inter-core
comm
unication
dela
ys
associated
to
it.
1
thr
oug
hput
=
per
iod
=
max
1
k
N
C
(
X
k
+
C
k
)
=
max
1
k
N
C
(
N
T
X
i
=0
x
i
:mt
ik
+
N
T
X
i
=0
N
T
X
j
=0
ca
ij
:ma
ij
k
+
N
T
X
i
=0
N
T
X
j
=0
ce
ij
:me
ij
k
)
(1)
where
i
and
j
are
task
identifiers
while
k
is
the
core
identifier
.
X
k
denotes
the
total
e
x
ecution
dela
ys
of
mapped
tasks
in
core
k
,
which
is
the
summation
of
the
profiled
e
x
ecution
dela
y
x
i
if
task
t
i
is
mapped
to
core
k
(
mt
ik
=
1
).
C
k
denotes
the
total
comm
unication
dela
ys
associated
to
core
k
,
which
is
the
summation
of
all
profiled
intr
a-core
comm
unication
dela
y
ca
ij
if
intr
a-core
comm
u-
nication
of
task
t
i
and
t
j
e
xist
in
core
k
(
ma
ij
k
=
1
),
and
all
profiled
inter-core
comm
unication
dela
y
ce
ij
if
inter-core
comm
unication
of
task
t
i
and
t
j
is
associated
to
core
k
(
me
ij
k
=
1
).
4.
Results
and
Discussion
The
proposed
analytical-based
mapping
is
demonstr
ated
b
y
emplo
ying
the
e
xhaustiv
e
br
ute
f
orce
method
to
map
se
v
er
al
r
andom
application
g
r
aphs
and
find
the
mapping
that
produces
the
best
application
throughput.
The
best
mapping
obtained
b
y
the
proposed
analytical-based
e
xhaustiv
e
(AE)
are
compared
with
em
ulation-based
e
xhaustiv
e
(EE)
solutions
(i.e
.,
g
round
tr
uth)
which
is
obtained
b
y
e
x
ecuting
all
mappings
on
a
FPGA
em
ulation
system.
Figure
6a
sho
ws
the
throughput
f
or
all
possib
le
mapping
f
or
the
proposed
analytical-
based
applicat
ion
mapping
and
em
ulation-based
mapping.
The
proposed
analytical-based
map-
ping
produces
results
with
similar
trend
with
the
g
round
tr
uth
solution.
Figure
6b
depicts
the
compar
ison
betw
een
em
ulation-based
throughput
and
analytical-based
solution.
The
proposed
analytical-based
mapping
throughput
is
almost
similar
with
em
ulation-based
throughput
as
the
points
scatter
nearb
y
the
str
aight
line
of
g
r
adient
one
.
T
ab
le
1
illustr
ates
the
compar
ison
betw
een
proposed
analytical-based
and
em
ulation-
based
e
xhaustiv
e
(g
round
tr
uth)
f
or
application
of
diff
erent
r
andom
g
r
aphs
.
The
best
throughput
(BT)
is
obtained
b
y
em
ulating
the
best
mapping
produced
b
y
both
methods
on
FPGA.
The
e
v
al-
uation
time
(ET)
is
the
time
tak
en
to
perf
or
m
the
e
xhaustiv
e
e
v
aluation
on
all
possib
le
mappings
while
profiling
time
is
the
time
tak
en
to
profile
the
tasks
based
on
the
proposed
method.
Results
illustr
ate
that
the
e
v
aluation
time
f
or
both
e
xhaustiv
e
mapping
increases
e
xponentially
with
the
n
umber
of
possib
le
mappings
.
The
proposed
analytical-based
method
is
ab
le
to
e
v
aluates
all
mappings
in
m
uch
shor
ter
time
despite
requir
ing
e
xtr
a
profiling
time
.
Ho
w
e
v
er
,
the
profiling
time
does
not
increase
with
n
umber
of
mappings
as
it
corresponds
to
the
n
umber
of
profiles
required,
that
is
the
total
n
umber
of
tasks
and
edges
in
application
g
r
aphs
.
5.
Conc
lusion
Efficient
mapping
algor
ithm
produces
high
quality
solution
in
f
ast
e
v
aluation
time
.
This
paper
presents
a
application
mapping
methodology
that
utiliz
es
FPGA
em
ula
tion
as
an
e
v
aluation
platf
or
m
to
optimiz
e
the
application
throughput.
Applications
are
analysed
and
profiled
in
the
Application
Profiling
and
Mapping
on
NoC-based
MPSoC
...
(Jia
W
ei
T
ang)
Evaluation Warning : The document was created with Spire.PDF for Python.
1046
ISSN:
1693-6930
0
50
100
150
200
250
300
100000
200000
300000
400000
500000
mappings (sorted with EE)
1/throughput (cycles)
AE
EE
(a)
Throughput
of
all
possib
le
mapping
f
or
the
proposed
analytical-based
e
xhaustiv
e
(AE)
and
em
ulation-based
e
x-
haustiv
e
(EE)
mapping
(sor
ted
based
on
EE).
100000
200000
300000
400000
500000
150000
200000
250000
300000
350000
400000
450000
500000
EE 1/throughput (cycles)
AE 1/throughput (cycles)
(b)
Em
ulation-based
v
ersus
analytical-based
e
xhaustiv
e
throughput.
Figure
6.
Throughput
compar
ison
betw
een
em
ulation-based
e
xhaustiv
e
(EE)
and
proposed
analytical-based
e
xhaustiv
e
(AE)
mapping
in
design
space
e
xplor
ation
of
a
7-task
application.
T
ab
le
1.
Compar
ison
in
throughput
and
e
v
aluation
time
betw
een
proposed
analytical-based
(AE)
and
em
ulation-based
e
xhaustiv
e
(EE)
f
or
application
of
diff
erent
r
andom
g
r
aphs
Applications
No
.
of
EE
Proposed
AE
T
asks
BT
ET
BT
PT
ET
r
andom1(r7)
5
87
607.61
s
87
252.21
s
0.03
s
r
andom2(r8)
6
401
2912.33
s
401
296.10
s
0.15
s
r
andom4(r2)
7
106
15745.43
s
106
309.34
s
0.96
s
r
andom5(r4)
7
875
16143.52
s
875
276.82
s
0.91
s
BT
:
Best
Throughput
ET
:
Ev
aluation
Time
PT
:
Profiling
Time
em
ulated
en
vironment
to
obtain
their
pr
actical
implemented
costs
in
ter
m
of
e
x
ecution,
intr
a-core
comm
unication
and
inter-core
comm
unication
dela
ys
.
Application
mapping
utiliz
es
the
profiled
costs
find
optimal
mapping
through
e
x
ecution
tr
ace
analysis
and
analytical
e
v
aluation.
Results
sho
w
that
the
proposed
analytical-based
e
xhaustiv
e
based
on
the
pro
filing
can
produce
similar
best
throughput
compared
to
the
g
round
tr
uth
solutions
b
ut
in
shor
ter
e
v
aluation
time
.
Ac
kno
wledg
ement
This
w
or
k
is
suppor
ted
in
par
t
b
y
the
Collabor
ativ
e
Research
in
Engineer
ing,
Science
&
T
echnology
(CREST)
g
r
ant
P17C114
(UTM
v
ote
no
.
4B176)
and
Univ
ersiti
T
eknologi
Mala
ysia
Matching
g
r
ant
(UTM
v
ote
no
.
00M75).
Ref
erences
[1]
R.
Marculescu,
U
.
Y
.
Og
r
as
,
L.-S
.
P
eh,
N.
E.
Jerger
,
and
Y
.
Hosk
ote
,
“Outstanding
research
prob
lems
in
noc
design:
system,
microarchitecture
,
and
circuit
perspectiv
es
,
”
IEEE
T
r
ansac-
tions
on
Computer-Aided
Design
of
Integ
r
ated
Circuits
and
Systems
,
v
ol.
28,
no
.
1,
pp
.
3–21,
2009.
[2]
A.
K.
Singh,
M.
Shafique
,
A.
K
umar
,
and
J
.
Henk
el,
“Mapping
on
m
ulti/man
y-core
systems:
sur
v
e
y
of
current
and
emerging
trends
,
”
in
Proceedings
of
the
50th
Ann
ual
Design
A
utomation
Conf
erence
.
A
CM,
2013,
p
.
1.
[3]
P
.
K.
Sahu
and
S
.
Chattopadh
y
a
y
,
“A
sur
v
e
y
on
application
mapping
str
ategies
f
or
netw
or
k-
on-chip
design,
”
Jour
nal
of
Systems
Architecture
,
v
ol.
59,
no
.
1,
pp
.
60–76,
2013.
[4]
R.
Piscitelli
and
A.
D
.
Pimentel,
“Design
space
pr
uning
through
h
ybr
id
analysis
in
system-
le
v
el
design
space
e
xplor
ation,
”
in
Design,
A
uto
mation
&
T
est
in
Europe
Conf
erence
&
Exhi-
TELK
OMNIKA
V
ol.
15,
No
.
3,
September
2017
:
1040
1047
Evaluation Warning : The document was created with Spire.PDF for Python.
TELK
OMNIKA
ISSN:
1693-6930
1047
bition
(D
A
TE),
2012
.
IEEE,
2012,
pp
.
781–786.
[5]
J
.
Hu
and
R.
Marculescu,
“Energy-and
perf
or
mance-a
w
are
mapping
f
or
regular
noc
archi-
tectures
,
”
IEEE
T
r
ansactions
on
computer-aided
design
of
integ
r
ated
circuits
and
systems
,
v
ol.
24,
no
.
4,
pp
.
551–562,
2005.
[6]
K.
Sr
iniv
asan,
K.
S
.
Chatha,
and
G.
K
onje
v
od,
“Linear-prog
r
amming-based
techniques
f
or
synthesis
of
netw
or
k-on-chip
architectures
,
”
IEEE
T
r
ansactions
on
V
er
y
Large
Scale
Integ
r
a-
tion
(VLSI)
Systems
,
v
ol.
14,
no
.
4,
pp
.
407–420,
2006.
[7]
S
.
T
osun,
“Cluster-based
application
mapping
method
f
or
netw
or
k-on-chip
,
”
Adv
ances
in
En-
gineer
ing
Softw
are
,
v
ol.
42,
no
.
10,
pp
.
868–874,
2011.
[8]
P
.
K.
Sahu,
T
.
Shah,
K.
Manna,
and
S
.
Chattopadh
y
a
y
,
“Application
mapping
onto
mesh-
based
netw
or
k-on-chip
using
discrete
par
ticle
s
w
ar
m
optimization,
”
IEEE
T
r
ansactions
on
V
er
y
Large
Scale
Integ
r
ation
(VLSI)
Systems
,
v
ol.
22,
no
.
2,
pp
.
300–312,
2014.
[9]
Y
.
Z.
T
ei,
Y
.
W
.
Hau,
N.
Shaikh-Husin,
and
M.
N.
Marsono
,
“Netw
or
k
par
titioning
domain
kno
wledge
m
ultiobjectiv
e
application
mapping
f
or
large-scale
netw
or
k-on-chip
,
”
Applied
Com-
putational
Intelligence
and
Soft
Computing
,
v
ol.
2014,
p
.
9,
2014.
[10]
E.
L.
de
Souza
Car
v
alho
,
N.
L.
V
.
Calazans
,
and
F
.
G.
Mor
aes
,
“Dynamic
task
mapping
f
or
mpsocs
,
”
IEEE
Design
&
T
est
of
Computers
,
v
ol.
27,
no
.
5,
pp
.
26–35,
2010.
[11]
A.
K.
Singh,
T
.
Sr
ikanthan
,
A.
K
umar
,
and
W
.
Jigang,
“Comm
unication-a
w
are
heur
istics
f
or
r
un-time
task
mapping
on
noc-based
mpsoc
platf
or
ms
,
”
Jour
nal
of
Systems
Architecture
,
v
ol.
56,
no
.
7,
pp
.
242–255,
2010.
[12]
M.
F
attah,
A.-M.
Rahmani,
T
.
C
.
Xu,
A.
Kandur
i,
P
.
Liljeberg,
J
.
Plosila,
and
H.
T
enhunen,
“Mix
ed-cr
iticality
r
un-time
task
mapping
f
or
noc-based
man
y-core
systems
,
”
in
2014
22nd
Euromicro
Inter
national
Conf
erence
on
P
ar
allel,
Distr
ib
uted
and
Netw
or
k-Based
Processing
(PDP)
.
IEEE,
2014,
pp
.
458–465.
[13]
T
.
D
.
Ngo
,
K.
J
.
Mar
tin,
and
J
.-P
.
Diguet,
“Mo
v
e
based
algor
ithm
f
or
r
untime
mapping
of
dataflo
w
actors
on
heterogeneous
mpsocs
,
”
Jour
nal
of
Signal
Processing
Systems
,
pp
.
1–
18,
2015.
[14]
S
.
Stuijk,
M.
Geilen,
and
T
.
Basten,
“A
predictab
le
m
ultiprocessor
design
flo
w
f
or
streaming
applications
with
dynamic
beha
viour
,
”
in
2010
13th
Euromicro
Conf
erence
on
Digital
System
Design:
Architectures
,
Methods
and
T
ools
(DSD)
.
IEEE,
2010,
pp
.
548–555.
[15]
L.
Schor
,
I.
Baciv
aro
v
,
D
.
Rai,
H.
Y
ang,
S
.-H.
Kang,
and
L.
Thiele
,
“Scenar
io-based
design
flo
w
f
or
mapping
streaming
applications
onto
on-chip
man
y-core
systems
,
”
in
Proceedings
of
the
2012
inter
national
conf
erence
on
Compilers
,
architectures
and
synthesis
f
or
embedded
systems
.
A
CM,
2012,
pp
.
71–80.
[16]
C
.
Lee
,
S
.
Kim,
and
S
.
Ha,
“Efficient
r
un-time
resource
management
of
a
man
ycore
accel-
er
ator
f
or
stream-based
applications
,
”
in
2013
IEEE
11th
symposium
on
Embedded
systems
f
or
real-time
m
ultimedia
(ESTIMedia)
.
IEEE,
2013,
pp
.
51–60.
[17]
W
.
Quan
and
A.
D
.
Pimentel,
“A
scenar
io-based
r
un-time
task
mapping
algor
ithm
f
o
r
mpsocs
,
”
in
Proceedings
of
the
50th
Ann
ual
Design
A
utomation
Conf
erence
.
A
CM,
2013,
p
.
131.
[18]
A.
K.
Singh,
A.
K
umar
,
and
T
.
Sr
ikanthan,
“Acceler
ating
throughput-a
w
are
r
untime
mapping
f
or
heterogeneous
mpsocs
,
”
A
CM
T
r
ansactions
on
Design
A
utomation
of
Electronic
Systems
(T
OD
AES)
,
v
ol.
18,
no
.
1,
p
.
9,
2013.
[19]
W
.
Quan
and
A.
D
.
Pimentel,
“A
h
ybr
id
task
mapping
algor
ithm
f
or
heterogeneous
mpsocs
,
”
A
CM
T
r
ansactions
on
Embedded
Computing
Systems
(TECS)
,
v
ol.
14,
no
.
1,
p
.
14,
2015.
[20]
A.
Goens
and
J
.
Castr
illon,
“Analysis
of
process
tr
aces
f
or
mapping
dynamic
kpn
applications
to
mpsocs
,
”
in
IFIP
Inter
national
Embedded
Systems
Symposium
(IESS)
,
2015.
[21]
A.
K.
Singh,
M.
Shafique
,
A.
K
umar
,
and
J
.
Henk
el,
“Resource
and
throughput
a
w
are
e
x-
ecution
tr
ace
analysis
f
or
efficient
r
un-time
mapping
on
mpsocs
,
”
IEEE
T
r
ansactions
on
Computer-Aided
Design
of
Integ
r
ated
Circuits
and
Systems
,
v
ol.
35,
no
.
1,
pp
.
72–85,
2016.
[22]
ProNoC.
(2016)
Prototype
NoC
based
MPSoC
.
Opencore
.
[Online].
A
v
ailab
le:
http://opencores
.org/project,an-fpga-implementation
-of-lo
w-latency-
noc-based-mpsoc
Application
Profiling
and
Mapping
on
NoC-based
MPSoC
...
(Jia
W
ei
T
ang)
Evaluation Warning : The document was created with Spire.PDF for Python.