Indonesian Journal of
Electrical
Engineer
ing and
Computer Science
V
o
l. 10
, No
. 3, Jun
e
20
18
, pp
. 10
36
~
1
044
ISSN: 2502-4752,
DOI: 10.115
91/ijeecs
.v10.i
3.pp1036-1044
1
036
Jo
urn
a
l
h
o
me
pa
ge
: http://iaescore.c
om/jo
urnals/index.php/ijeecs
Fine-grained Overhead Charact
erisation of Cross-ISA DBTO
for Multicore Processor
J
oo-
On Oo
i
1
,
Fawniz
u Az
madi H
u
ssin
2
,
Mohd.
Nordin
Z
a
karia
3
1
Department of Computer
and
Co
mmunication Technolog
y
,
Univ
ersiti
Tunku Ab
dul Rahman, Malay
s
ia
2
Department of Electrical
and
Electron
i
c Engin
e
ering, Univ
ersiti
Teknol
ogi Petro
n
as, Malay
s
ia
3
Department of Computer
Infor
m
ation
S
y
stem
,
Universiti Tekno
logi Petronas, Malay
s
ia
Article Info
A
B
STRAC
T
Article histo
r
y:
Received
Ja
n 10, 2018
Rev
i
sed
Mar
16
, 20
18
Accepte
d Apr 2, 2018
The
emergence
of modern portable soft
war
e
, start to
behav
e
d h
ybrid short-
long running
co
mbined applications, in
which
an active apps may
invok
ed
others to fulfil
l
task
requirements. Thus the imp
lementation of
D
y
namic
Transla
tion and
Optim
isation (DBTO) into heter
ogeneous m
u
ltic
ore s
y
stem
-
on-chip (SoC) will requir
e
car
efu
l
re-stud
y
, to ens
u
re effic
i
ent usa
g
e of m
o
st
available cores. In
order
to improve
eff
i
ciency
in
supporting this
Instruction
Set Archit
ec
ture
(ISA) diversit
y of
computing
platforms, mix
modes of
static
al
l
y
and
d
y
nam
i
cal
l
y
Bin
a
r
y
Transl
ation
a
nd Optim
iza
tion
s
y
stem
,
or
DBTO, need to
utili
ze con
c
urre
nt com
p
ilat
i
on t
echniqu
es, to be
tter servi
c
e
the combined applications
pro
c
essing.
Th
is res
earch
deep div
e
d into
finer
-
grained DBTO
overhead an
aly
s
is,
to pro
v
ide ca
tegori
z
ation and
characterization
of overhead sources in
br
eakdo
wn stages durin
g concurr
e
nt
instruction
processing. A dual-
engine
of tr
anslation
and
opti
m
i
zation
is
constructed for finer managemnt
of star
t-up overh
eads. Help
er functions, i.e.
LoadLink/StoreCondition (
LL/S
C
) are d
e
ri
v
e
d f
r
om atomic instructions, to
create m
u
ltipl
e
helper thr
ead supporte
d b
y
m
u
lt
iple host cores, for better
instruction tr
an
slation and o
p
tim
i
zation operation concurr
e
ntly
. Our
experiment platform,
evaluated
through PAR
S
EC-3.0 bench
m
ark suite,
shows performa
n
ce improvemen
t
approa
ching 2.0x for apps based programs
and 1.25x for
k
e
rnel b
a
sed pro
g
rams, for x86
to X86-64 emulation
.
This
techn
i
que pos
s
e
s
s
great po
ten
tia
l and
s
e
rve
as
r
e
s
earch
bas
e
d
p
l
atform
for
future b
i
nar
y
tra
n
slation
te
chniq
u
e de
v
e
lopment, including
adap
tive method
.
K
eyw
ords
:
Bin
a
ry op
tim
is
atio
n
Bin
a
ry tran
slatio
n
Mu
ltico
r
e
Mu
lti-ISA p
r
ocessor
Mu
ltith
read
ed
Copyright ©
201
8 Institut
e
o
f
Ad
vanced
Engin
eer
ing and S
c
i
e
nce.
All rights re
se
rve
d
.
Co
rresp
ond
i
ng
Autho
r
:
Joo-On Ooi,
Depa
rt
m
e
nt
of
C
o
m
put
er an
d
C
o
m
m
uni
cat
i
o
n Tec
h
nol
ogy
,
Un
i
v
ersiti Tu
nk
u Abd
u
l
Rahman
,
Jalan
U
n
i
v
er
sit
i
, Band
ar
Bar
a
t
,
319
00
K
a
m
p
ar
, Malaysia.
Em
a
il: o
o
i
j
o
@u
tar.edu
.
m
y
1.
INTRODUCTION
Dy
nam
i
c Bi
nary
Transl
at
i
o
n (
D
B
T
) ha
s
been c
o
m
m
onl
y
use
d
i
n
cross
-
I
S
A
pr
ocess vi
rt
ual
m
achi
n
es [2
0]
t
o
enabl
e
sy
st
em
or appl
i
cat
i
on m
i
grat
i
o
n fr
om
one ISA t
o
an
ot
he
r
[17]
. S
o
m
e
pop
ul
ar
appl
i
cat
i
o
n o
f
t
h
i
s
DB
T i
n
cl
u
d
es A
n
dr
oi
d e
m
ul
at
or
usi
ng QEM
U
[1]
t
o
devel
op AR
M
base
d
co
de
r
u
n
n
i
n
g o
n
x86 m
achine [16, 17]. This
fast
em
ulation technique dy
nam
i
cally
tran
slates guest executables (e
g.
ARM
b
i
n
a
ry) to
n
a
tiv
e in
stru
ction
s
o
n
t
h
e ho
st mach
in
e
(eg.
x86 server), a
n
d store the
tran
sl
ated
n
a
tiv
e cod
e
in
cache m
e
m
o
ry to avoid
re-tra
nslation [3
,
8]. The translated code runs m
a
ny
tim
e
s faster
than the tradit
ional
in
terpretatio
n
ap
pro
ach, it can
b
e
furth
e
r op
timized
th
ro
ug
h
Dy
nam
i
c Bi
nary
Opt
i
m
i
z
at
i
on (
D
B
O
) pr
oce
ss, i
n
w
h
ich
r
e
d
undan
t
an
d
sup
e
rf
luo
u
s
instr
u
ct
io
n
s
can
b
e
eli
m
in
ated
to r
e
du
ce cod
e
size f
o
r
f
a
st
er
cod
e
pr
ocessi
ng
[2
, 10]
.
Ho
we
ver,
t
o
fu
rt
he
r spee
d u
p
t
h
e
em
ul
at
i
on t
h
r
o
ug
h
ge
nerat
i
n
g hi
ghl
y
opt
i
m
i
zed code has
ever bec
o
m
e
m
o
re
chal
l
e
ngi
ng si
nce opt
i
m
i
zat
i
ons
re
q
u
i
r
e lo
ng
er translatio
n
tim
e, wh
ich
is a po
rti
o
n
of the
Evaluation Warning : The document was created with Spire.PDF for Python.
In
d
onesi
a
n
J
E
l
ec En
g &
C
o
m
p
Sci
ISS
N
:
2
5
0
2
-
47
52
Fin
e
-Gra
in
ed
Overh
e
ad
C
h
ara
c
terisa
tion
of
Cro
ss-IS
A DBTO
f
o
r Mu
ltico
r
e…
(Joo
-On
.
Oo
i)
1
037
sy
st
em
runt
im
e, and
pot
e
n
t
i
a
l
l
y
prod
uce
d
u
n
rel
i
a
bl
e co
de
i
f
opt
i
m
i
z
at
i
ons are not
care
f
ul
l
y
t
e
st
ed [12
]
. The
perform
a
nce of a DBT
-
DB
O e
m
ulator is greatly deter
m
in
ed
b
y
turn
-arou
n
d
tim
e, in
wh
ich
ex
ecu
tio
n ti
m
e
pl
ay
s a m
a
i
n
cont
ri
b
u
t
o
r
.
T
h
i
s
ove
r
h
ead
s ar
e no
n
-
ne
gl
i
g
i
b
l
e
du
ri
n
g
dy
na
m
i
c process
of
t
r
ans
f
o
r
m
i
ng a pi
ece
of c
o
de i
n
t
o
a
n
ot
he
r,
whi
c
h c
a
use
d
t
h
e em
ul
at
ed p
r
o
g
r
am
or sy
st
em
t
o
pause
pr
o
g
ress
m
o
m
e
nt
ari
l
y
[9]
.
Suc
h
ove
rheads
im
p
act directly the overall syste
m
perform
an
ce, wh
ich is
n
o
t th
e on
ly im
p
o
r
tan
t
m
e
tric, d
u
e
to
ove
rheads
from
s
t
art-up or
reactive c
ode
translation bec
o
m
e
signifi
cant as com
p
are
to relatively s
m
aller
ove
ral
l
ove
rhe
a
ds o
n
ce an ap
pl
i
cat
i
on has
b
een exec
ut
ed.
Past
expe
ri
m
e
nt
s do
ne by
re
searche
r
s ha
ve
sho
w
n
th
at a typ
i
cal
DBT with
DB
O em
u
l
at
io
n
pro
cess will g
o
e
s th
roug
h
a seri
es o
f
co
mm
o
n
p
r
im
it
iv
es d
e
sp
ite th
e
l
e
vel
at
whi
c
h
i
t
operat
e
s [
5
,1
8,
20]
. S
o
m
e
of t
h
e f
u
nct
i
o
ns pe
rf
orm
by
t
h
e opt
i
m
i
zer
i
n
cl
udes
fol
l
o
wi
n
g
,
wh
ich
m
a
y n
o
t
n
ecessarily in
th
is sequ
en
ce:
(i) co
d
e
pro
f
ilin
g in
ord
e
r to
d
e
tect ho
t reg
i
o
n
s
,
(ii)
b
u
ild reg
i
on
s,
(iii) decode instructions, (i
v) optim
ize regions, (v) c
ode
sche
duling, (vi
)
code caching etc [2,10]. In this
researc
h
, t
h
e
essence
of
fa
st
er co
de
t
r
an
sl
at
i
on i
n
Q
E
M
U
a
n
d
ri
c
h
opt
i
m
i
zati
ons
pos
sess
by
L
L
VM
i
s
com
b
ined into
single hy
bri
d
translat
or-optimizer syste
m
,
known as
Dual
-En
g
i
n
e
DB
T
O
, w
h
i
c
h i
s
ca
pabl
e
of
h
a
nd
ling
m
u
lti
-
I
S
A
gu
est cod
e
t
o
w
a
r
d
s m
u
lti-
I
S
A ho
st
p
r
o
cessing
, an
d
p
r
od
u
c
ed
bo
th go
od tr
an
slated
cod
e
q
u
a
lity with rel
a
tiv
ely lo
w tran
slatio
n ov
erh
e
ad
[20
]
.
Furt
herm
ore, t
h
i
s
pa
per l
o
ok
i
n
t
o
o
v
e
r
hea
d
charact
e
r
i
zat
i
on i
n
fi
ner
gra
i
ned l
e
vel
,
f
o
c
u
si
n
g
on t
h
e
sp
ecific
p
a
ram
e
ters’
d
e
lay time in
cu
rred during
th
e acti
v
atio
n
o
f
m
u
ltip
le
h
e
lp
er thread
s
b
y
LL/SC instru
ctio
n
cal
l
,
i
n
whi
c
h
aim
e
d t
o
assi
st t
h
e sim
u
l
t
a
neous
bi
na
ry
t
r
an
sl
at
i
on an
d o
p
t
i
m
i
zati
on pr
oc
essi
ng
fo
r t
h
i
s
newl
y
con
s
t
r
uct
e
d
D
B
TO sy
st
em
.
Thr
o
ug
h
revi
e
w
i
n
g t
h
e
o
v
er
head
p
r
oce
ss
a set
of
f
o
rm
ul
a fo
r co
de t
r
a
n
si
t
i
o
n
ove
r
h
eads i
s
d
e
ri
ve
d, an
d t
h
i
s
form
ul
a i
s
val
i
d
at
ed t
h
ro
ug
h t
h
e si
m
u
lat
i
on ex
peri
m
e
nt
usi
n
g t
h
e DB
T
O
sy
st
em
const
r
u
c
t
e
d.
Thu
s
th
e m
a
in
co
n
t
ribu
tio
n
s
of th
is
research
work are as
fo
l
l
o
w
s:
a.
A det
a
i
l
anal
y
s
i
s
on DB
T
O
ove
r
h
eads
,
i
n
c
l
ude cl
assi
fi
cat
i
on, c
h
aract
e
r
i
zat
i
on an
d f
o
r
m
ul
ae deri
vat
i
o
n
s
in
vo
lv
ing
related
influ
e
n
c
ing
p
a
ram
e
ters.
b.
A m
u
lti-th
readed
retarg
etab
le DBTO on
m
u
ltico
r
es
p
r
o
cesso
r, cap
ab
le for
si
m
u
ltan
e
o
u
s
bin
a
ry tran
slation
an
d
op
timisati
o
n,
is d
e
v
e
lop
e
d
for h
ypo
th
esi
s
v
a
lid
ation
.
c.
A nov
el fin
e
-g
rai
n
ed
o
v
erhead
ch
aracteri
zatio
n
with
form
u
l
a d
e
riv
a
tio
n, cau
sed
by
m
u
ltip
le h
e
l
p
er
t
h
rea
d
s c
r
eat
i
o
n t
h
r
o
u
g
h
L
o
a
d
Li
n
k
a
n
d St
or
eC
on
d i
n
st
r
u
ct
i
o
n
.
d.
An e
xpe
ri
m
e
nt fram
e
wor
k
t
o
val
i
d
at
e t
h
e p
r
op
ose
d
fi
ne
-
g
r
a
i
n
ed o
v
e
r
hea
d
charact
eri
zat
i
on
of t
h
e m
u
l
t
i
pl
e
helpe
r
threa
d
s supporte
d
DB
TO
system
.
This resea
r
ch
also intend to
explore the possibility of ot
her m
e
thod(s) to reduce tra
n
slation a
n
d
opt
i
m
i
zati
on
o
v
er
hea
d
s i
n
c
u
r
r
ed
, as
wel
l
as
pr
o
v
i
d
i
n
g a
us
eful
pl
at
fo
rm
for
resea
r
che
r
,
ho
pe
ful
l
y
t
o
pr
od
uce a
usef
ul
de
vel
o
p
m
ent
t
ool
or e
xpe
ri
m
e
nt
al
prot
ot
y
p
e f
o
r em
bed
d
e
d
ap
pl
i
cat
i
on, f
o
r i
n
st
an
ce Int
e
r
n
et
o
f
Thi
n
g
devi
ces
.
2.
THEORETICAL BASIS
2.
1
Dy
nami
c
B
i
nar
y
T
r
a
n
s
l
ati
o
n ad Opti
mi
sati
on oper
ati
o
n
D
u
r
i
ng
D
B
T, i
n
ord
e
r
to
ach
i
e
v
e
gu
est to
host b
i
n
a
r
y
tra
n
sl
ation,
DBT sys
t
e
m
uses a globally share
d
code cac
he, s
o
that all executing threads sha
r
ed a si
ngle code cache
,
and each guest block has
only a singl
e
translated c
o
py in the shar
e
d
code cache
[19]. All the threa
d
s will ma
intain one director
that rec
o
rds the
m
a
ppi
n
g
fr
om
a g
u
est
c
o
de
bl
oc
k
t
o
i
t
s
c
o
r
r
es
po
n
d
i
n
g t
r
ansl
at
ed
host
co
de
regi
on
.
An
exec
ut
i
o
n
t
h
rea
d
in
itially
lo
o
k
s
u
p
th
e
d
i
rectory to
lo
cate th
e
tran
slated
cod
e
reg
i
o
n
.
On
ce
n
o
t
foun
d, it activ
ates th
e Tiny Co
d
e
Gen
e
rator (TC
G
) to
tran
slate th
e u
n
t
ran
s
lat
e
d
gu
est cod
e
bloc
k. As all the exec
ution threa
d
s s
h
are t
h
e code
cache and t
h
e
m
a
pping di
re
ctory,
QEMU uses a c
r
itical
section to serialise all ac
cesses to the
share
d
st
ruct
u
r
es, as s
h
o
w
n i
n
Fi
g
u
r
e 1 [
15]
. Ty
pi
c
a
l
l
y
TC
G i
s
m
eant
t
o
be l
i
g
ht
wei
g
ht
o
p
t
i
m
i
zer, w
h
i
c
h
pr
o
v
i
de an
i
d
eal
pl
at
form
fo
r em
ul
ati
ng sho
r
t
-
ru
n
n
i
n
g appl
i
cat
i
o
ns wi
t
h
few
hot
bl
o
c
ks, su
ch as d
u
r
i
ng t
h
e b
oot
i
n
g
of a
n
o
p
e
rating
syste
m
. Th
e proble
m
o
b
s
erv
e
d fro
m
lo
wer
q
u
a
lity tran
slated
cod
e
durin
g
cross-ISA
b
i
n
a
ry
t
r
ansl
at
i
o
n
has
enco
u
r
age
d
t
h
e ex
pl
orat
i
o
n
and
de
vel
o
p
m
ent
of a
ddi
t
i
onal
t
r
a
n
sl
at
i
o
n p
r
ocess, c
o
m
m
onl
y
kn
o
w
n
as Dy
n
a
m
i
c B
i
nary
Opt
i
m
i
zat
i
on, i
n
whi
c
h se
veral
“hea
vy
” o
p
t
i
m
i
zat
i
on passe
s bei
n
g em
pl
o
y
e
d t
o
i
m
p
r
ov
e th
e qu
ality o
f
tran
sl
ated
co
d
e
[10
]
. Hyb
r
i
d
typ
e
o
f
DB
T with
DBO co
m
b
in
ed
th
e adv
a
n
t
ages o
f
faster
g
u
e
st cod
e
tran
slatio
n
with
po
ten
tial for cod
e
op
ti
mizatio
n
to
yield redu
ced
si
ze b
i
nary
co
de fo
r faster
host
m
achine execution.
Th
e tran
slation and
o
p
tim
isat
i
o
n in
DBTO
op
eration
aim
s
to c
o
nve
r
ting a
code
from
one form
at to
anot
her
,
w
h
i
c
h cause va
ri
o
u
s st
ages o
f
ove
r
h
eads
,
m
a
in
ly classified
in
to
n
a
tiv
e an
d
in
st
ru
m
e
n
t
ation
ove
r
h
eads
[
11]
. Nat
i
v
e
ov
er
h
eads m
a
i
n
l
y
due t
o
st
art
u
p a
nd
re-act
i
v
e o
v
er
hea
d
s. St
a
r
t
-
u
p
ove
rhea
ds
i
s
t
h
e
ove
rheads
occ
u
rred
until the syste
m
reach
es
steady-state, whe
r
e
the vast ma
j
o
rity of t
h
e
execute
d code
com
e
s
from
the trans
l
ated and
opti
m
ized code ca
che. Re-
act
i
v
e
ove
rhea
ds ar
e caused
by
r
e
-t
ran
s
l
a
t
i
on a
nd
re-
optim
isation of regions
of c
ode that have
bee
n
evicted
from
the
translation cache, pa
rticul
arly in the case
of a
Evaluation Warning : The document was created with Spire.PDF for Python.
I
S
SN
:
2
502
-47
52
I
ndo
n
e
sian
J Elec Eng
& Com
p
Sci, V
o
l. 10
,
No
.
3
,
Jun
e
2
018
:
10
36
–
1
044
1
038
m
u
l
tip
rog
r
am
med
env
i
ro
n
m
en
t with sh
ared
tran
slatio
n
c
ache [20]. All
these ove
rhea
ds can cause
significant
sl
ow
do
w
n
t
o
DB
T p
r
oce
ssi
n
g
, as s
h
o
w
n i
n
Fi
gu
re 2
,
t
h
u
s
, red
u
ci
n
g
t
h
e
t
o
t
a
l
DB
T an
d DB
O
ove
r
h
ead
s i
s
of
ut
m
o
st
im
port
a
nce
[9]
.
Fi
gu
re
1.
Dy
na
m
i
c bi
nary
t
r
a
n
sl
at
i
on
ope
rat
i
o
n
,
(a)
ove
r
v
i
e
w bl
oc
k, (
b
) DB
T
O
o
v
er
vi
ew
bl
ock
di
sg
ram
Fig
u
r
e
2
.
Slowd
o
wn
an
alysis o
f
D
B
T, in br
eak
dow
n
of
fi
ne
-
g
rai
n
e
d
o
v
er
hea
d
s
[3]
Fi
gu
re
3.
B
r
ea
kd
o
w
n
o
f
bi
na
r
y
t
r
ansl
at
i
o
n t
o
t
a
l
ove
r
h
eads
[
3
2]
Fi
gu
re 4.
Eq
ua
t
i
on fo
r vari
ou
s
fi
ne
-
g
rai
n
e
d
o
v
e
rh
ead
s [9
]
2.
2
Over
hea
d
Ch
ar
acteri
s
a
t
i
on
of
DB
T
O
Ove
r
head
of t
h
e Dy
nam
i
c Bi
nary
Tra
n
sl
at
i
on a
nd Optim
ization
can be
characte
r
ize
by
differe
nt
fine-grain cate
g
ories,
nam
e
l
y
initia
lizat
ion, col
d
c
ode t
r
anslation,
profiling
and hot tr
ace building, and
tran
slated
code ex
ecu
tion
;
wh
ich
will b
e
fu
rt
h
e
r descri
b
e
d. In
itialisatio
n
ov
erh
e
ad
,
th
e o
v
e
rh
ead
i
n
curred
du
ri
n
g
l
o
a
d
i
n
g
of
DB
T sy
st
e
m
i
n
t
o
m
e
m
o
r
y
[1
1]
. The
o
v
e
rhea
d i
s
m
easure
d
by
exec
ut
i
ng t
h
e
nat
i
v
e
cod
e
i
mmed
i
ately after th
e
DBT initil
isatio
n
.
Th
e
o
v
e
rh
ead
fo
rmu
l
a as
p
r
o
p
o
s
ed
b
y
[9
] is shown
i
n
Equ
a
tion
(1).
Typ
i
cal in
itiali
satio
n
ov
erh
e
ad
is aro
und
0
.
2%. Th
e C
o
ld
Co
d
e
Tran
slatio
n is th
e m
e
tri
c
associated
with
th
e
newly enc
o
unt
ered c
ode
or
not-yet-tra
n
slated-c
ode, resi
de
in code cache.
At the sam
e
time DBT also updates
th
e prev
iou
s
ly tran
slated
b
l
o
c
k
s
t
o
b
r
an
ch into
th
is
n
e
wly tran
slated
cod
e
.
Evaluation Warning : The document was created with Spire.PDF for Python.
In
d
onesi
a
n
J
E
l
ec En
g &
C
o
m
p
Sci
ISS
N
:
2
5
0
2
-
47
52
Fin
e
-Gra
in
ed
Overh
e
ad
C
h
ara
c
terisa
tion
of
Cro
ss-IS
A DBTO
f
o
r Mu
ltico
r
e…
(Joo
-On
.
Oo
i)
1
039
A DBT
configuration which utilis
ing
persistant traces, known as
DBTPTraces rem
oves t
h
e
profili
ng
and
hot trace building overhe
ad, but still
suffers from
cold code tra
n
slation
overhea
d
. To eli
m
inate this
cold
code t
r
ansl
at
i
o
n o
v
er
hea
d
, a
newe
r co
n
f
i
g
u
r
at
i
on
wi
t
h
co
de pat
c
hi
n
g
ad
di
t
i
on,
k
n
o
w
n
as DB
TPT
r
ace
+C
P
(Cod
e Patch
i
ng
) is con
s
tru
c
ted
.
Th
is co
nfig
uration
lo
ad
th
e p
e
rsistan
t
h
o
t
traces and th
e n
a
tiv
e cod
e
is
patche
d with i
n
structions to
jum
p
to
the loaded traces, the patche
d code
is then exec
uted, t
hus eliminating
co
ld
cod
e
tran
slatio
n
pro
cess.
Measu
r
em
en
t fo
r th
is
ov
erh
e
ad
is sho
w
n
in
eq
u
a
tion
(2),
with
11
.9
1
%
as ty
pi
cal
val
u
e. P
r
o
f
i
l
i
ng
a
n
d
Hot
Trace
Buil
ding:The
appli
cation e
x
ec
ution ca
n
be
accelerated by optimizing
freq
ue
ntly executed c
o
de,
or
kn
o
w
n a
s
h
o
t
c
ode
,
by
a m
e
an o
f
det
ect
i
on
m
echani
s
m
usi
ng
r
unt
i
m
e i
n
form
at
i
on, com
m
onl
y
usi
ng
pr
ofi
l
i
n
g
tech
n
i
qu
e to
ob
tain
ho
t traces, m
easurem
ent
as sh
o
w
n
i
n
Eq
uat
i
o
n
4
.
T
h
i
s
pr
ocess
i
n
vol
ves t
w
o
t
y
pes
of
o
v
e
rh
ead
s,
n
a
mely Pro
f
ilin
g Instru
m
e
n
t
atio
n and
Pro
f
ilin
g Ex
ecu
tion
.
Profilin
g In
stru
m
e
n
t
atio
n
overh
ead
is
t
h
e t
i
m
e
requi
r
e
d t
o
p
e
r
f
o
r
m
t
h
e i
n
st
r
u
m
e
ntat
i
on,
or t
o
ol
i
n
f
r
ast
r
uct
u
re s
e
t
up p
r
e
p
ari
n
g f
o
r
pr
ofi
l
i
n
g
code
tran
slatio
n
p
r
ocess. Wh
ereas Profilin
g
Ex
ecu
tio
n
o
v
e
rh
ead
is th
e ti
me sp
en
t d
u
ring
ex
ecu
tin
g
t
h
e p
r
ofilin
g
instructions [9]
.
The persiste
nt
tr
ace loading
ove
rhead
(E
quation 3) is m
e
a
s
ure
d
by com
p
aring the e
x
ec
ution
tim
e
of specific benc
hm
arking
us
ing
DBT
LT+Native a
n
d DBT
N
ative
co
nfigurations
. This DBTPT
r
aces
ap
pro
ach rem
o
v
e
s t
h
e
p
r
o
f
ilin
g and ho
t trace bu
ild
ing
overh
ead
bu
t adds th
e
ov
erh
e
ad to
l
o
ad th
e
p
e
rsisten
t
traces from
the target file, with
typical values around
22.73%
, as sh
own in Figure
3.Tra
n
slated
Code
Ex
ecu
tio
n
:
Id
eally, el
i
m
in
atin
g
th
e in
itialisatio
n
and
tran
slatio
n
o
v
e
rh
ead
s
wo
u
l
d
m
a
k
e
th
e tran
slated
cod
e
run at least as
fast as the
native code.
Howe
ver, the t
r
anslated code
is not e
x
actly the sa
m
e
as ori
g
inal c
ode
and the s
p
ecialized
hardwa
re
u
n
a
v
a
ilab
ility may cau
se ex
t
r
a
o
v
e
rh
ead,
wh
ich
can
b
e
b
r
ok
en down in
to
gu
est co
d
e
em
u
l
atio
n
overh
ead,
code
cache
control trans
f
er overhea
d
, C
ode
Duplication a
n
d Return
A
ddress Stac
k (R
AS) ove
r
hea
d
, whi
c
h
will b
e
d
e
scribed
in th
e fo
llowing
section
.
Co
d
e
Em
u
l
atio
n
o
v
e
rh
ead o
c
cu
rred du
e t
o
t
h
e
n
eed to k
e
ep
th
e
ori
g
i
n
al
pr
o
g
r
a
m
behavi
o
r
,
t
hus t
r
a
n
sl
at
ed
code m
u
st
em
ul
at
e part
i
a
l of t
h
e nat
i
v
e
i
n
st
ruct
i
o
ns
du
ri
n
g
ex
ecu
tion
,
wh
i
c
h
invo
lv
ed
m
o
re in
stru
ction em
u
l
atio
n
s
that p
o
t
en
tially cau
se i
n
crem
en
t ov
erh
e
ad
s.
Co
d
e
Cach
e C
o
n
t
ro
l
Tran
sfer will em
p
l
o
y
s th
ree m
o
d
e
s
o
f
op
eration
s
to
perfo
r
m
th
e
con
t
ro
l tran
sfer
b
e
tween
traces and bas
i
c blocks inside the code cac
he, incl
udi
ng code bloc
k
c
h
aining,
co
de dispatching
a
n
d
inline
dispatc
h
ing. In typical situation, ch
aini
ng
does not incur extra overhea
d
du
ring the translated code e
x
ecution.
Du
e to
i
n
lin
ed
d
i
sp
atch
ing
m
e
th
od
, t
h
e C
o
d
e
Disp
atch
er
is on
ly called
to reso
lv
e ad
dre
sse
s fr
om
col
d
c
o
de.
Howe
ver, col
d
code
is not ofte
n execute
d and the ove
r
hea
d
c
a
use
d
by
cal
l
s
t
o
t
h
e C
o
de Di
s
p
at
che
r
i
s
min
i
mized
. In th
is way, m
a
j
o
rity of th
e ad
dress
reso
lu
tio
n ov
erh
e
ad
occu
rs with
t
h
e in
lin
ed
d
i
sp
atch
ing,
whi
c
h i
s
used
wi
t
h
i
n
fre
q
u
e
n
t
l
y
execut
e
d
hot
co
de, a
n
d
m
a
y represe
n
t
a si
gni
fi
cant
port
i
on o
f
ad
dres
s
resol
u
t
i
o
n
o
v
e
r
head
, i
n
w
h
i
c
h
i
t
s
m
easurm
e
nt
i
s
sho
w
n i
n
E
quat
i
o
n
(
5
).
Ex
peri
m
e
nt
do
ne
by
ot
her
resea
r
cher
[9
] sho
w
s th
at
th
e add
r
ess reso
lu
tion
overhead ca
use
by i
ndi
rect jum
p
s
an
d
ret
u
r
n
i
n
structio
ns a
cco
u
n
ts f
o
r
app
r
oxi
m
a
t
e
ly
31
.5
% o
f
t
h
e t
o
t
a
l
t
r
ansl
at
i
o
n
ove
rh
ead
, as s
h
o
w
n i
n
Fi
gu
r
e
3. R
e
sea
r
che
r’s e
xpe
ri
m
e
nt
sho
w
s
th
at so
m
e
p
r
o
c
esso
r relies
o
n
RAS to efficien
tly p
r
ed
ict th
e targ
et add
r
ess
o
f
return in
structio
n
s
,
wh
ich
fo
rm
a
no
rm
by
m
o
st
m
odern m
i
cro
p
r
o
cess
o
rs.
Ho
weve
r, ret
u
r
n
i
n
st
r
u
ct
i
o
ns can
n
o
t
be
execut
e
d i
n
si
de t
h
e
translated code and a
r
e
normally e
m
ulate
d
through in
d
i
r
e
c
t
jum
p
ins
t
ructions,which
greatly
increas
es the
i
ndi
rect
branch predi
c
t
o
r cache pressure. Experim
e
n
t
i
ndicated t
h
at
t
y
p
i
cal
D
B
TO overhead due t
o
R
A
S i
s
33.6%, w
h
i
c
h account
ed t
h
e highest
wei
ght
ag
e am
ong all
t
h
e rel
a
t
e
d overheads.
3.
R
E
SEARC
H M
ETHOD
Th
is research
in
ten
d
to
p
r
o
v
id
e con
c
u
r
ren
t
b
i
n
a
ry tran
slatio
n
usilisin
g
m
u
l
tith
read
ing serv
ices
on
m
u
l
tico
r
e,
b
y
co
nstru
c
ting an
i
n
frastru
c
tu
re fo
r at
o
m
ic i
n
stru
ctio
n
wh
i
c
h
im
p
l
e
m
en
te
d
i
n
QEMU,
ch
osen
th
ro
ugh
h
ypo
t
h
esis b
e
ing
mad
e
b
y
th
e ev
i
d
en
ce
o
f
re
search
er’s
work
th
at sho
w
n
it’s cap
ab
ility to
p
e
rfor
m
p
a
rallel task
s pro
cessing
throug
h m
u
ltip
le co
m
p
u
t
e u
n
its em
u
l
a
tio
n
[4
, 6
]
.
Thr
o
ug
h p
r
ovi
di
n
g
ne
w TC
G hel
p
e
r
s act
as sort
o
f
so
ft
m
m
u
hel
p
ers,
at
om
i
c
i
t
y
beha
vi
o
r
can b
e
gua
ra
nteed to
som
e
m
e
m
o
ry
accesses. More
specifically
, the ne
w s
o
ftmm
u helpers be
have a
s
Loa
d
L
i
nk a
nd
Sto
r
eC
o
n
d
itio
nal in
stru
ctio
ns, an
d
are called
fro
m
TCG
co
d
e
b
y
m
ean
s o
f
targ
et specific h
e
lp
ers. Th
e
im
pl
em
ent
a
t
i
o
n hea
v
i
l
y
uses
t
h
e so
ft
war
e
TLB
t
oget
h
er
wi
t
h
a ne
w
bi
tm
ap t
h
at
has
been a
d
ded t
o
t
h
e
ram
_
l
i
s
t st
ruct
ure
whi
c
h fl
a
g
s,
on a pe
r-
C
P
U basi
s
,
al
l
t
h
e
m
e
m
o
ry pages t
h
at
ar
e i
n
t
h
e
m
i
ddl
e of a
Lo
ad
Link
(LL), StoreCond
itio
n
a
l
(SC) op
eratio
n
.
Lo
ad
Link
in
st
ru
ction
is th
e in
st
ru
ction th
at read
s the
v
a
lu
e
from
a shared
me
m
o
ry location a
n
d stor
es the content int
o
a register
of t
h
e calling CPU. It als
o
establishes a
link a
n
d rec
o
rds the CPU
with the
accesse
d address (xaddr),
to
prope
r
ly
ha
ndle t
h
e s
u
bs
eque
nt SC
ope
r
ation.
Sto
r
eC
o
n
d
itio
nal in
stru
ction
is th
e in
stru
ctio
n
t
h
at writes to
th
e add
r
ess x
a
d
d
r
o
n
l
y
if it b
e
lon
g
s
to
an
exclusi
v
e m
e
mory
re
gio
n
(E
M
R
) pre
v
io
usl
y
created by an LL. The SC is
not
al
way
s
su
ccessf
ul
si
nce anot
her
CPU can nullify the EMR by writing or
reading to it. Si
nce all these pa
ges can
be accessed
directly through
th
e fast-p
at
h
an
d alter a v
C
PU’s lin
ked v
a
l
u
e, th
e
n
e
w
b
itmap
h
a
s
b
een
co
up
led
with
a n
e
w TLB
flag fo
r th
e
Evaluation Warning : The document was created with Spire.PDF for Python.
I
S
SN
:
2
502
-47
52
I
ndo
n
e
sian
J Elec Eng
& Com
p
Sci, V
o
l. 10
,
No
.
3
,
Jun
e
2
018
:
10
36
–
1
044
1
040
TLB virtual address
whic
h forces th
e slow-path exec
ution for all the acce
sses to a page
containing a linke
d
add
r
ess
.
T
h
i
s
n
e
w sl
ow
-pat
h i
m
pl
em
ent
i
on
d
e
m
onst
r
at
es t
h
e f
o
l
l
o
wi
ng
fea
t
ures:
a.
The LL
be
haves as a norm
al load slow-pa
t
h, exce
pt
for
clearing t
h
e di
rty flag in t
h
e
bitm
ap. The
cp
u
tlb.c co
d
e
wh
ile
g
e
n
e
ratin
g a TLB en
try, ch
eck
s
if
the
r
e is at least one vCPU
that has
the bit
cleared
in
th
e ex
cl
u
s
ive b
itm
ap
, in
that case th
e TLB en
try
w
ill h
a
v
e
the EXCL
flag
set, t
h
us
forcing
th
e slow-
path. The TLB
cache
of all the othe
r
vCPUs
is flushed
to e
n
sure
that all the
vCPUs
will follow the
slow-
path for that page. The LL
will also set the linked
a
d
dre
ss and size
of
the access in a
vCPU’s priva
t
e
v
a
riab
le.
After
th
e correspon
din
g
SC, t
h
is add
r
ess will
b
e
set to
a
reset value.
b.
The SC can fa
il by returni
n
g 1, or
succee
d
by returni
ng
0. It has to
com
e
always after a LL and
has t
o
access the sa
me addre
ss ‘li
nke
d’ by t
h
e
pre
v
ious
LL,
othe
rwise it will fail. If in t
h
e tim
e window
delim
ited by a
legit pair
of L
L
/SC operations a
not
her write access
happe
n
s to the linke
d a
d
dress, the
SC
will fail.
In
th
eo
ry,
t
h
e p
r
ov
id
ed
im
p
l
e
m
en
tatio
n
o
f
TCG
Lo
adLink
/Sto
reC
o
nd
itio
nal
can
b
e
u
s
ed
to
pro
p
e
rly
h
a
nd
le ato
m
ic
in
stru
ction
s
o
n
an
y p
r
o
cesso
r
arch
itecture. Du
ri
n
g
im
p
l
e
m
e
n
tatio
n
work, two
n
e
w in
structio
ns
are created
i
n
to
ex
istin
g 132 TCG
op
s i
n
stru
ction
set, main
ly to
h
a
nd
l
e
lo
ad lin
k
i
ng
an
d cond
itio
n
a
l sto
r
e
ope
rat
i
o
n bet
w
een
rel
a
t
e
d
regi
st
ers a
n
d
host
m
e
m
o
ry
. Thi
s
t
w
o i
n
s
t
ruct
i
o
ns are i
n
t
r
od
uce
d
as
hel
p
e
r
in
str
u
ction
s
kno
wn
as help
er_ld
lin
k_
n
a
m
e
an
d h
e
l
p
er_
s
tcon
d_n
am
e.
O
p
er
a
t
i
o
n
s
o
f
One of t
h
e
m
a
jor probl
ems when deali
n
g wi
t
h
m
u
lti
-threaded progr
am
s
i
s
t
h
e
occurrence of race conditions. In the cont
ext of this work, a race condition can
be associated to an inconsistency
of the whole
machine s
t
a
t
e,
which is
in
char
ge of translatin
g ato
m
ic ins
t
ru
cti
ons.
The di
r
ect negati
ve resul
t
o
f
such a state is
the failure of a St
oreConditional (SC) operation
that should have succeeded, or even
worse, the
success of a SC operation that had to fail. In the subseq
uent sec
tions, all critical
points that result in
race
condi
t
i
ons are
expl
ored, wher
e t
h
e
im
plem
en
t
e
d approach i
s
al
so docu
m
ented. Updat
e
s of
t
h
e excl
usi
v
e b
i
tm
ap
can
le
ad
to
in
c
o
n
s
isten
c
ies
d
u
e
to
th
e o
u
t
-o
f-o
rd
er ex
ecu
tio
n
o
f
lo
ad
/sto
re
o
p
e
ratio
n
s
as
s
een
, fo
r in
st
an
ce, o
n
ARM architec
tures [16]. For t
h
is reason
all a
ccessors to suc
h
a bitmap are
at
o
m
ic, an out
co
m
e
t
h
at
is possi
ble
by
m
e
ans of ho
st
at
o
m
i
c
i
n
st
ructi
ons. It
i
s
import
a
nt
t
o
not
e, t
h
at
t
h
i
s
can be possi
bl
y
achieved onl
y
i
n
t
h
e case
where bi
tm
ap accessors
are Q
E
M
U
funct
i
ons and not
im
plem
en
t
e
d t
h
rough
TC
G generated code. In fact, ot
her
guest
C
P
Us, dif
f
erent
fro
m
t
h
e one i
ssu
i
ng t
h
e LL, coul
d
have al
read
y
gen
e
rat
e
d
TLB
en
tri
e
s for t
h
e same page,
forci
ng t
h
e execut
i
on t
o
foll
ow t
h
e fast
-pat
h (as shown i
n
i
n
Fi
gure 5).
T
o
avoi
d t
h
i
s
dangerous behavi
our
,
TLB
en
trie
s o
f
th
ese
CPUs wi
ll b
e
flu
s
h
e
d
,
fo
rcin
g
th
e
m
to
recre
a
te
th
e
TL
B en
try
th
at
co
v
e
rs
th
e p
a
g
e
in
th
e
EMR.
This flush request will also prevent
race conditions that are related to th
e delayed new state propagation of the
excl
usi
v
e b
i
t
.
Our im
pl
em
en
tati
on al
so ensu
re t
h
e eval
uati
o
n
s and updat
e
s of t
h
e EM
Rs have been safeg
u
arded
usi
ng a
m
u
t
e
x, due t
o
updat
i
n
g t
h
i
s
st
ruct
ure i
s
not
possi
bl
e wi
t
h
a
si
ngle
at
o
m
ic i
n
st
ructi
on.
Anot
her rel
a
t
e
d
aspect
that re
quires addition
a
l cau
tion, rel
a
tes
to
the ac
tual
me
mory
acces
ses
made
by the LL
and SC
i
n
st
ruct
i
ons. More speci
fical
l
y
,
t
h
e result
s on
m
e
m
o
ry br
ou
ght
by
t
h
ese i
n
st
ructi
ons has al
so t
o
be done joi
n
t
l
y
wi
t
h
t
h
e updat
e
of t
h
e EM
R
val
u
es.
The Listi
n
gs 1, 2 an
d 3 re
present
respect
i
vel
y
t
h
e al
gori
t
h
m
s for LL, SC
and
norm
a
l s
t
ore access. In these
exa
m
ple
s
, the
critica
l
region is delimited by tw
o calls L
O
CK and
UNLOCK.
LoadLi
nk as i
n
Li
sti
ng 1, onl
y
works as l
ong
as t
h
e nor
m
a
l
local
is done i
n
si
de t
h
e cr
i
t
i
c
al
secti
on, ot
herwi
s
e
t
h
e
l
o
aded val
u
e can be p
o
t
e
nt
i
a
ll
y updat
e
d by
an
ot
her C
P
U, wh
i
c
h
m
i
ght
or
m
i
ght
not
be i
n
si
de t
h
e crit
ical
regi
on.
For t
h
e sam
e
r
eason, t
h
e SC
operat
i
on (Li
s
ting 2) has al
so
t
o
rely
on t
h
e
sam
e
cr
i
t
i
c
al
regi
on t
o
be consi
s
t
e
nt
wi
t
h
t
h
e rest
of t
h
e at
o
m
ic i
n
st
ructi
on em
ulati
on.
W
i
t
hout
ent
e
ri
ng t
h
e critical
regi
on, it
can pot
ent
i
a
ll
y
decl
are
t
h
e operati
on as successful
(
r
et
urni
ng 0), b
u
t
perfor
m
i
ng
t
h
e st
ore after anot
her C
P
U
m
odi
f
i
ed t
h
e
val
u
e.
Sim
i
l
a
r
l
y
,
t
h
e st
ore operat
i
on (Li
s
ti
ng 3) ent
e
rs t
h
e crit
ical
regi
on t
o
check for
a possi
bl
e confl
i
c
t
i
n
EM
R
,
but
also to perfor
m
the regular ac
c
e
ss.
Li
st
i
ng
1
(l
eft
)
:
Loa
d
Li
n
k
pse
u
d
o
c
o
de, l
o
ad
( )
de
n
o
t
e
s a
pl
ai
n
load f
r
om
m
e
m
o
ry of size z
Li
st
i
ng 2
(m
i
ddl
e):
St
o
r
eC
on
d pse
u
d
o
code
,
st
ore ( ) den
o
t
e
s
a
pl
ai
n st
ore
t
o
m
e
m
o
ry
of si
ze
z
Li
st
i
ng
3
(ri
g
h
t
)
:
Pl
ai
n
wri
t
e
a
ccess t
r
a
ppe
d
b
y
t
h
e sl
o
w
-
p
at
h
Evaluation Warning : The document was created with Spire.PDF for Python.
In
d
onesi
a
n
J
E
l
ec En
g &
C
o
m
p
Sci
ISS
N
:
2
5
0
2
-
47
52
Fin
e
-Gra
in
ed
Overh
e
ad
C
h
ara
c
terisa
tion
of
Cro
ss-IS
A DBTO
f
o
r Mu
ltico
r
e…
(Joo
-On
.
Oo
i)
1
041
Fig
u
re
5
.
Cod
e
reg
i
on
tran
sitio
n b
e
t
w
een guest CPU
reg
i
sters an
d cod
e
cach
e
The overheads i
nvol
ved duri
ng t
h
i
s
LL/
S
C
operat
i
on
i
n
cl
udes hel
p
er function cal
ls and co
de t
r
ansiti
on.
Duri
ng t
h
e QE
M
U
sy
st
em
-
m
ode em
ul
ati
on process, t
h
e LoadLi
nk i
n
st
ructi
on operat
i
on takes pl
ace whi
l
e
t
h
e
code transit
i
on is in process. Upon activa
tio
n, helper
functi
on which come
s in a piece of
C code will be cal
led
from
translated code, so to st
ore the current address and load the value.
Firs
t
l
y a b
i
t in
th
e e
x
clu
s
iv
e b
i
t
m
a
p
wil
l
b
e
set to
en
fo
rce slo
w
p
a
th
, wh
ich
m
e
an
s h
e
l
p
er fu
n
c
tio
n
call fo
r ach
iev
i
n
g
m
u
l
tip
le h
e
lp
e
r
th
read
. DBT
n
eed
to
set
t
h
e li
nk address of guest
CPU
m
e
m
o
r
y
by fi
rst
obt
ai
ni
ng a l
o
ck for t
h
e tar
g
et
guest
C
P
U
m
e
m
o
r
y
seg
m
en
t
t
o
acces
s the critical region, the
n
the address s
i
ze is de
ter
m
in
ed, followed by setting the address range i
n
to the
t
a
r
g
eted vC
PU t
h
read regi
st
er
. Upon co
ndi
t
i
onal
st
ore i
n
st
r
u
ct
i
on act
i
v
ation, al
l
vC
PUS are halt
ed, DB
T
t
o
check i
f
t
h
i
s
process has been i
n
t
e
rrupt
ed si
nce l
a
st
LL
call
, t
h
rough
ch
ecki
ng t
h
e
TLB
t
a
ble by
com
p
ari
n
g
current
address and val
u
e wi
t
h
t
h
e saved co
pi
es i
n
t
h
e
TLB
t
a
ble, i
f
unchanged st
ore p
r
ocess i
s
al
l
o
w
e
d and
success opera
tion reported, the cpu states from
current gues
t
CPU w
i
ll
the
n
be save
into
code cache.
Else th
e
TLB
en
t
r
y table needs
t
o
be u
pdat
e
d t
h
rough
TLB
f
l
ushi
ng
for al
l vC
PUs.
Thi
s
process
i
s
repeat
ed for di
f
f
erent
basi
c bl
ocks i
n
si
de t
h
e code cache, t
hus t
o
generat
e
m
u
lt
i
p
le h
e
lp
er fu
n
c
tio
n
th
read
to
assis
t
in
b
i
n
a
ry tran
s
l
a
tio
n
as wel
l
as opt
im
izat
i
on process. Based on
the Load
Li
nk and St
oreC
ond operat
i
on descri
bed previ
ously
,
t
h
e
overhead of
co
de t
r
ansit
i
on due t
o
Load Li
nki
ng proce
ss can be m
odel
e
d by const
r
uct
i
ng overhead form
ula.
As
we have descr
i
bed previ
ously
, Load Li
nki
ng process i
s
af
fect
ed b
y
process
i
n
cl
udi
ng
cri
t
i
c
al regi
on
i
n
si
de
m
e
m
o
ry
l
o
cki
n
g, address and
i
t
’
s si
ze
set
t
i
ng, fl
ushi
ng
TLB
, l
o
adi
ng cpu stat
e and unl
ocking cri
t
i
cal
region o
f
m
e
m
o
r
y
.
Thus the Load l
i
nk overhead can be deri
ved as bel
o
w:
The overhead
of code t
r
ansi
tion due t
o
St
ore
C
ondi
t
i
onal
process, i
n
fl
uenced by
hal
t
i
ng al
l
vC
PUs, co
mpari
ng
val
u
es, and sav
i
ng val
u
es, is deri
ved:
Thus t
h
e code transi
ti
on overh
ead due t
o
LoadLi
nk and st
ore condi
t
i
onal
pr
ocess i
s
t
h
en given:
4.
R
E
SU
LTS AN
D ANA
LY
SIS
Ex
peri
m
e
nt
i
s
do
ne t
h
r
o
u
g
h
si
m
u
l
a
t
i
on of sel
ect
ed
P
A
R
S
EC
-
3
.
0
b
e
nchm
ark
pr
o
g
ram
s
[1
4]
,
com
p
i
l
e
d wi
t
h
gcc ve
rsi
o
n 4.
8
.
3,a
s
de
pi
ct
ed i
n
Tabl
e 1
.
Al
l
per
f
o
r
m
a
nce eval
uat
i
o
n i
s
do
ne o
n
a sy
st
em
wi
t
h
one
1
.
7
GHz
q
u
ad
-c
ore
I
n
t
e
l
C
o
re
-i
7
p
r
oces
sor
an
d
4
GB
y
t
es m
a
i
n
m
e
mory
.
Th
e
ope
ra
t
i
ng sy
st
em
i
s
64
-
b
i
t
Ub
u
n
t
u
1
4
.
0
4
LTS
Li
nu
x wi
t
h
ker
n
el
versi
o
n 3.
19
.0
-3
3
-
ge
neri
c.
The
sel
e
ct
ed
P
A
R
S
EC
3.
0
[
1
5]
be
nc
h
m
ark
pr
o
g
ram
s
are eval
uat
e
d wi
t
h
t
h
e sim
l
arge i
n
p
u
t
set
s
rese
m
b
l
e
real
i
npu
t
s
usi
n
g l
a
rge
r
pr
obl
em
si
ze
of
dat
a
set
s
, fo
r
x8
6-
3
2
g
u
est
I
S
A
on
t
h
e x
8
6
-
6
4
ho
st
pl
at
fo
rm
. Al
l
t
h
e sel
ect
ed b
e
nchm
ark
pr
og
ram
s
are paral
l
e
l
i
zed
wi
t
h
t
h
e Pt
hrea
d m
odel
and c
o
m
p
i
l
e
d for
re
spect
i
v
e
gue
st
ISAs with
PARSEC d
e
fau
lt co
m
p
iler o
p
timization
and S
I
M
D
e
n
abl
e
d. T
h
e
be
nchm
arks si
m
u
l
a
t
i
on
per
f
o
r
m
a
nce i
s
com
p
are
d
t
h
ro
u
g
h
usi
n
g si
m
l
arg
e
i
n
p
u
t
s
bet
w
ee
n t
h
ree di
ffe
re
nt
co
nfi
g
u
r
at
i
o
ns:
(i
)
Hy
bri
d
-
Q
EM
U
wi
t
h
si
n
g
l
e
-t
h
r
ead m
ode,
de
not
e as
Hy
b
r
i
d
-
Q
-s
,
an
d
(ii)
Hy
b
r
id
-QEMU with
m
u
lti-th
read
m
o
d
e
, d
e
no
t
e
as Hybrid-Q-m
. During
ex
p
e
rim
e
n
t
s, ato
m
ic
in
stru
ction
s
are em
u
l
ated
with
ligh
t
wei
g
h
t
me
m
o
ry tran
sactio
n
s
, fo
r all th
e exp
e
rim
e
n
t
co
nfigu
r
ation
s
, so th
at
the be
nc
hm
arks can be
em
ula
t
ed correctly.
unlock
mem
load
state
cpu
flush
TLB
set
size
addr
set
addr
link
lock
mem
T
T
T
T
T
T
=
L
_
_
_
_
_
_
_
_
_
load_link
value
save
cmp
halt
vCPU
T
T
T
=
L
_
_
store_cond
cond
store
link
load
T
T
=
L
_
_
.
code_trans
Evaluation Warning : The document was created with Spire.PDF for Python.
I
S
SN
:
2
502
-47
52
I
ndo
n
e
sian
J Elec Eng
& Com
p
Sci, V
o
l. 10
,
No
.
3
,
Jun
e
2
018
:
10
36
–
1
044
1
042
Table 1. Selected
Be
nchm
ark program
s
features
Fi
gu
re
6.
P
A
R
S
EC
-
3
.
0
benc
h
m
ark re
sul
t
s
of
x
8
6
t
o
x
8
6
-
64
4.
1.
Static an
d Dyn
a
mic Bi
nar
y
Tr
anslation per
f
ormance
It is
ob
serv
ed
th
at translatio
n ti
m
e
b
y
sing
le th
re
ad
supp
or
t
w
e
r
e
seen
t
o
be m
u
ch
shor
ter th
an
tho
s
e
m
u
l
tith
read
ed
su
ppo
rt fo
r all ap
p
s
b
a
sed
ben
c
h
m
ark
prog
ram
s
, as seen
in
Fig
u
re 6
(
a) to
(d).
Wh
ereas for
k
e
rn
el b
a
sed
prog
ram
s
, i.e. stream
clu
s
ter an
d
cann
eal,
translatio
n
ti
m
e
fo
r
m
u
ltith
read
ed
sup
port were seen
sh
orter
o
r
at m
o
st clo
s
ely similar to
sing
le thread
sup
por
t
,
a
s
s
e
e
n
i
n
F
i
g
u
r
e
6(
e
)
a
nd (f
). The
relatively
po
o
r
er
tran
slatio
n
p
e
rform
a
n
ce
o
f
m
u
l
tith
read
ed
su
ppo
rt fo
r
b
e
n
c
h
m
ark
ap
ps were m
o
stly d
u
e
t
o
accu
m
u
l
ated
o
v
e
rh
ead
s i
n
cu
rred
b
y
th
read
con
t
en
tion
du
e to
m
u
ltip
le
th
read
sup
portin
g
in
itializati
o
n
s
stag
e,
wh
i
c
h
can
resu
lt in
sign
ifican
t p
e
rform
a
n
ce d
e
g
r
ad
ation
,
d
e
sp
ite
the
fact that the conc
urre
nt
exec
ut
i
on s
u
pp
o
r
t
e
d b
y
m
u
l
tith
read
ing shou
ld
resu
lt in
tran
slation
time redu
ctio
n.
In th
e
o
t
h
e
r
way roun
d, im
p
r
o
v
e
d
t
r
an
slation
with
optim
ization time perform
ance was observ
ed f
o
r ke
r
n
el
base
d pr
o
g
ram
s
fo
r
H-Q-m
,
due to they are either
fi
ne-
g
ran
u
l
a
r
or c
o
arse
-
g
ra
nul
a
r
p
r
o
g
r
am
s geare
d
f
o
r
paral
l
e
l
i
s
m
in nat
u
re
, t
h
u
s
bene
fi
t
e
d t
h
r
o
ug
h
m
u
l
tith
read
ed
p
r
o
cess supp
ort sch
e
m
e
d
u
rin
g
th
e
b
i
n
a
ry tran
slatio
n p
r
ocess.Th
e
ap
p
s
b
a
sed
p
r
o
g
rams
are
typ
i
cally sh
o
r
t
ru
nn
ing
p
r
og
ra
m
,
th
u
s
th
ey
will g
a
in
b
e
n
e
ficial fro
m
DBT,
wh
ich
can
b
e
o
b
serv
ed fro
m
th
e
sho
r
t
e
r
el
apse
d t
i
m
e t
a
ken f
o
r
H
-
Q
-
s.
Wh
ereas
ker
n
el
b
a
sed a
p
ps a
r
e
t
o
wa
rds
l
o
nge
r
ru
n
n
i
n
g
pr
o
g
r
a
m
,
i
n
wh
ich
th
ey will
b
e
b
e
n
e
fited
fro
m
DBT
an
d DBO.
4.
2.
Conc
urre
nt Dynamic
B
i
nary Translation perform
a
nce
As shown i
n
F
i
gure 6 for almost
al
l PAR
S
EC
-3.0 benchm
ark prog
ram
s
,
t
h
e i
n
crem
en
t
of
t
h
e el
apsed
t
i
m
e
i
s
seen t
o
be reduced
wi
th t
h
e gro
w
i
ng
of t
h
e num
ber
o
f
wo
rk
er th
read
s. It is o
b
s
erved
th
at th
is
tran
sla
tio
n
time
incre
m
en
t gradually re
ach sa
tu
ration stage when nu
m
b
er of worker
thread exc
eed 32 thread
s. This
phenom
enon is due to the c
ontributi
on from
our built-in
concurrent fe
at
ures, which am
ortis
e
the start-u
p
overheads at ce
rtan period afte
r start-
up process, eventua
l
ly r
eaching steady
sta
t
e when number of of activated
worker threads reaching 40.
M
o
d
e
l
G
r
a
nu
l
a
r
i
t
y
S
h
ar
i
n
g
E
xch
ange
bl
ac
k
s
c
hol
es
apps
F
i
nanc
i
a
l
A
nal
y
s
i
s
dat
a-
p
a
r
a
l
l
el
c
oar
se
s
m
al
l
l
ow
l
o
w
body
t
r
ac
k
apps
C
o
m
put
er
Vi
s
i
o
n
dat
a-
p
a
r
a
l
l
el
m
edi
um
m
edi
um
hi
gh
m
edi
um
c
anneal
k
e
r
n
el
E
ngi
neer
i
n
g
unst
r
uc
t
u
r
e
d
f
i
n
e
unb
ounded
hi
gh
hi
gh
f
e
r
r
e
t
apps
Si
m
i
l
a
r
i
t
y
S
ear
c
h
pi
pel
i
n
e
m
edi
um
unb
ounded
hi
gh
hi
gh
st
r
eam
c
l
ust
e
r
k
er
n
e
l
D
at
a M
i
ni
ng
dat
a-
p
a
r
a
l
l
el
m
edi
um
m
edi
um
l
o
w
m
edi
um
v
i
ps
apps
M
edi
a
P
r
oc
ess
i
ng
dat
a-
p
a
r
a
l
l
el
c
oar
se
m
edi
um
l
o
w
m
edi
um
Pr
o
g
r
a
m
C
a
t
.
A
p
p
l
i
cat
i
o
n
Dom
a
i
n
Wo
r
k
i
n
g
S
e
t
Da
t
a
Us
a
g
e
P
a
r
a
lle
liz
a
t
io
n
Evaluation Warning : The document was created with Spire.PDF for Python.
In
d
onesi
a
n
J
E
l
ec En
g &
C
o
m
p
Sci
ISS
N
:
2
5
0
2
-
47
52
Fin
e
-Gra
in
ed
Overh
e
ad
C
h
ara
c
terisa
tion
of
Cro
ss-IS
A DBTO
f
o
r Mu
ltico
r
e…
(Joo
-On
.
Oo
i)
1
043
4.
3.
Pr
ogr
am Sta
r
t
-
up o
v
er
head
a
n
al
ysi
s
D
u
r
i
ng
th
e
star
t-
up
stag
e of
g
u
e
st to
ho
st ISA
b
i
na
ry translation, the
aver
age ela
p
se
d
tim
e
for the
h
e
lp
er th
read
s
sp
en
t in
critical sectio
n
s
will
in
crease
si
g
n
i
fican
tly d
u
e
to
th
e h
e
lper thread
s co
n
t
en
d
i
n
g
for
critical sectio
n
with
in
th
e QEMU d
i
sp
atcher where t
h
e s
e
rialization leng
th
en
s th
e
wai
t
ti
m
e
. Th
e d
e
lay is
furthe
r worsten
by critical secti
on acces
s
wait tim
e
and
bra
n
ch target
m
a
pping
directory l
o
okup tim
e
.
Fu
rt
h
e
rm
o
r
e this laten
c
y
wh
ich in
creased
fro
m
su
ch se
rialisatio
n
is g
r
e
a
t
e
r
th
an th
e redu
ced ex
ecu
tio
n tim
e
gai
n
e
d
f
r
om
increm
ent
a
l
hel
p
er t
h
rea
d
s w
h
i
c
h assi
st
i
n
bi
na
ry
t
r
ansl
at
i
on
pr
ocess. T
hus t
h
i
s
l
a
t
e
nc
y
has
gene
rat
e
d
hi
g
h
ove
rhea
ds w
h
i
c
h dom
i
n
at
es t
h
e t
o
t
a
l
t
r
ansl
at
i
on t
i
m
e
and hence t
h
e
ove
ral
l
execut
i
o
n t
i
m
e
,
event
u
ally causes the
poor
pe
rform
a
nce of t
h
e
parallel PARSEC-3.0 be
nc
hm
arking for
the selected a
p
ps.
5.
CO
NCL
USI
O
N
Thi
s
pa
pe
r p
r
e
s
ent
e
d
det
a
i
l
fi
ne-
g
rai
n
ed a
n
a
l
y
s
i
s
of co
ncu
r
rent
dy
nam
i
c t
r
ansl
at
i
on a
n
d o
p
t
i
m
i
zat
i
on
in
curred
o
n
ou
r n
e
wly constru
c
ted Du
al-Eng
in
e
DBTO
arch
itecture, with m
u
lti-
t
h
read
ed
retargetab
le
cap
ab
ility ru
nn
ing
on
m
u
lti
co
res
p
r
o
cesso
r. Exp
e
rim
e
n
t
s h
a
s sh
own th
at su
ch
mu
lti-th
read
ed
h
ybrid
tr
an
slatio
n an
d op
ti
m
i
zatio
n
ap
pro
ach can
ach
iev
e
r
e
latively lo
w
e
r tr
anslatio
n
o
v
e
r
h
ead
an
d yet w
ith
good
tran
slated
code qu
ality o
n
t
h
e targ
et b
i
n
a
ry app
lica
tio
n
s
, esp
ecially for k
e
rn
els b
a
sed
p
r
o
g
ram
s
. In
th
is
expe
ri
m
e
nt
, t
h
e H-Q
-
m
supp
ort
e
d by
m
u
l
tipl
e
t
h
rea
d
s f
o
r
bi
nary
t
r
an
sl
at
i
on p
r
ocessi
n
g
,
i
s
m
o
re effi
ci
ent
fo
r
k
e
rn
el b
a
sed
ap
p
lication
s
, as sh
own
b
y
u
p
to
1
.
2
5
x
sp
eedu
p
o
f
m
u
ltith
read
v
e
rsus sing
le th
read
.
Wh
ereas
app
s
base
d p
r
o
g
ram
are
m
o
re bene
fi
ci
al
t
h
ro
ug
h si
n
g
l
e
t
h
re
aded s
u
pp
ort
e
d
bi
nary
t
r
a
n
sl
at
i
on wi
t
h
up t
o
1.
8
x
sp
eed
u
p
v
e
rsu
s
m
u
ltip
le th
read
s, as also
reported
i
n
o
u
r prev
iou
s
p
a
p
e
r
[20
]
. We foresee th
e
great p
o
t
ential
o
f
u
tilisin
g
th
e mu
ltith
read
techn
i
qu
e fo
r assistin
g
b
i
n
a
ry
tran
slatio
n
and
op
ti
m
i
satio
n
p
r
o
cess,
bo
th
for sh
ort
r
unn
ing
an
d lon
g
run
n
i
n
g
pr
og
r
a
m
an
alysis.
ACKNOWLE
DGE
M
ENTS
The a
u
t
h
or
wo
ul
d l
i
k
e t
o
t
h
a
nk
Ass
o
c. P
r
o
f
D
r
. Fa
wni
z
u,
Dr.
N
o
r
d
i
n
a
nd t
h
ei
r st
u
d
e
n
t
s
f
o
r
great
su
ppo
rt and
v
a
lu
ab
le co
mm
en
ts.
REFERE
NC
ES
[1]
F. Bellard
,
"
Q
EMU, a fast and p
o
rtable dynamic
translator,”
in USENIX Annual
T
echn
i
cal Conference,
pp. 41-46,
2005.
[2]
C. Lattner and
V. Adve,
"
LLV
M: A compilatio
n framework for lifel
ong progra
m
analysis
&
transformation,”
in
P
r
oceeding
CGO
,
2004.
[3]
Uh GR, Cohn R
,
Yadav
a
lli, B.
,
Peri, R
.
and
A
y
y
a
g
a
ri, R
.,
"Ana
lyzing dynamic
binary instrumentation o
verhea
d
,”
in WBIA Workshop at ASPLOS, Oct 2006
.
[4]
Ding J. H., Chang P.C., Hsu W.
C., and Chung Y.C., “PQE
MU:
Aparallel s
y
stem emulator based on QEMU,” in
Parallel
and Distributed
S
y
s
t
ems (ICPADS),
2011 IEEE 17
th
In’l
Conference,
7 D
ec 2011
, pp
. 276
-283.
[5]
Jeffer
y
A., “Using the LLVM
compiler infr
astru
c
ture for
op
timised,
as
y
n
chronou
s d
y
namic tr
anslation
in QEMU,”
Master’s thesis
,
University
of
Ad
elaide, Australia, 2009.
[6]
Z
.
W
a
n
g
,
R
.
L
i
u
,
Y
.
C
h
e
n
,
X
.
W
u
,
H
.
C
h
e
n
,
W
.
Z
h
a
n
g
,
and B
.
Zhang. “COREMU: a s
callab
l
e an
d portable par
a
ll
el
full-s
y
s
t
em
em
ul
ator,
”
in
Proc. PPoPP,
2011
.
[7]
Baraz
,
Leonid
,
et al
.
"
I
A
-
32 Execu
tion Layer:
a two-phase
dynamic
translator designed to support IA-32
applications on I
t
anium®-based
systems,"
Proceedings of the 36th
annual IE
EE/A
C
M International S
y
mposium on
Microarch
itectur
e. I
E
EE
Computer Society
,
2003
.
[8]
D
e
hnert, J
a
m
e
s
C., et
al.
"
T
he T
r
ansmeta Code Morphing™ Software:
using speculation
,
reco
very, and adaptive
retranslation to
address real-life chall
e
nges,
" P
r
oceed
ings
of the
intern
ation
a
l s
y
m
pos
ium
on Code gener
a
t
i
on an
d
optimization: feedback
-dir
ected
and runtime op
timization
,
I
EEE
Computer Society
,
2003
.
[9]
Borin, Edson,
and Youfeng Wu
.
"
C
haracterization
of DBT overhead."
Workload Characterization, 2009. IISWC
2009. IEEE Inter
n
ation
a
l S
y
mpos
ium on. I
EEE, 2
009.
[10]
J
.
Lu,
H.Ch
en,
P
.
-C.Yew,
and
W
.
-C.Hs
u
, “
D
es
ign and
implementation of
a lightwe
ight
d
y
na
m
i
c optim
iz
atio
n
sy
s
t
e
m
,
”
Jour
na
l of
Ins
t
r
u
ction-
L
evel
Par
a
ll
elis
m
, 6:1–24
, 2004
.
[11]
C.K.Luk
,
R.Coh
n
, R.Muth, H.P
a
til, A.Kl
auser, G
.
Lowne
y
, S
.
W
a
l
l
ace
, V.Reddi
, an
d K. Hazelwood
, “
Pin: Build
ing
customized prog
ram analysis too
l
s with
dynamic instrumentation
,” In Proc. PLDI,
2005.
[12]
K.
Scott,
N.
Kumar,
B.
R.
Childers
, J.W.Davidson,
and M.L.Soff
a,
“Overhead reduction techn
i
q
u
es for software
dynamic translation,”
In Pro
c
. IPDPS, pages 200
–207, 2004
.
[13]
S. Sridhar, J
.
S.Shapiro,
E.N
o
rthup, and
P.P.Bungale,
“H
DTrans: an o
p
ensource,
low
-
level d
y
namic
instrume
ntation sy
ste
m
,
”
In Proc.VEE, pag
e
s 175
–185, 2006
.
Evaluation Warning : The document was created with Spire.PDF for Python.
I
S
SN
:
2
502
-47
52
I
ndo
n
e
sian
J Elec Eng
& Com
p
Sci, V
o
l. 10
,
No
.
3
,
Jun
e
2
018
:
10
36
–
1
044
1
044
[14]
Bienia
, Chr
i
s
t
i
a
n,
et
al
.
"
T
he
PARSEC
benchmark suite:
C
haracterization
a
nd architectur
al
implicat
ions,
"
P
r
oceed
ings
of t
h
e 17th
intern
ation
a
l
confere
n
ce on P
a
r
a
l
l
e
l
arch
it
ectur
es
and com
p
il
atio
n
techn
i
ques. ACM, 2008.
[15]
Cifuentes, Cristina, and VishvMalhotr
a
.
"
B
inary
translation: Static, dynamic, reta
rgetable?
" Softw
are Maintenan
ce
1996
, Proceedin
g
s.,Intern
a
tional
Conference on
. I
EEE, 1996
.
[16]
Chen, Jiunn-Yeu, et
al.,
"A static binary translator for efficien
t
migration of ARM-based applica
tions,"
Workshop
on Optimization
s
for DSP a
nd Embedded S
y
stems. 2008.
[17]
Kondoh G, Ko
matsu H.,
“Dynamic binary translation speci
alized for embedded systems,”
AC
M Sigplan Notices.
2010 Jul 1;45(7)
:157-66.
[18]
S
h
en, Bor-Yeh,
et a
l
.
,
"
A
n
LLVM-based hybrid
binary translation system,"
Ind
u
strial
Embedded S
y
stems (SIES),
2012 7th I
E
EE I
n
ternational
S
y
mposium on. IEEE, 2012.
[19]
Liu, Chi
a
-Lun
,
et al
. "D
y
n
am
ic
all
y
Tr
anslat
ing
Binar
y
Cod
e
f
o
r Multi-Thre
ad
ed Program
s Using Shared Code
Cache
,
"
Journal of
Electronic
Science and Techno
logy
no
. 4
(2014
): 434-438.
[20]
Ooi, J. O., Hussin, F
.
A. B.
, &Z
a
k
aria
, N., (2016, Decem
ber).
“Dual-Engin
e
Cr
oss-ISA
DB
T
O
T
e
c
hnique Utilisin
g
MultiThr
eaded
Support for
Mu
lticor
e Pr
ocesso
r System,”
In
Embedded Multicore/Man
y
-cor
e
S
y
stems-on-Chip
(MCSoC), 2016
IEEE 10th
Inter
n
ation
a
l S
y
mpos
ium, December
2016, pp
. 257-2
64.
BIOGRAP
HI
ES OF
AUTH
ORS
J
oo-On Ooi rec
e
ived h
i
s
M
S
c
and Bach
elor
d
e
gree
in
Ele
c
tr
c
a
l and
El
ec
troni
cs
Engine
ering
from Nan
y
ang
Technolog
ical Un
iversity
, Singap
o
re,
in 2004
and
1999 respectiv
ely
.
From 2002
until 2011 he h
a
s been invo
lvin
g in Application
Engineering in
sem
i
conductor f
i
eld
,
cover
i
ng
microprocessor, microcontroller
and memory
d
e
vi
ces. Prior to
this he has been involving
in
satellite sub-s
y
stem de
sign in Satellite Engineerin
g center,
NTU,
Singapore. Sin
c
e April 2011 he
has
been
the f
a
c
u
lt
y m
e
m
b
er in
the Dep
a
rtm
e
nt
of Computer an
d Communicatio
n Techno
log
y
,
Universiti Tunk
u Abdul Rahm
an, Malay
s
i
a
. His
res
earch in
ter
e
sts are in th
e areas of m
u
lticore
proc
e
ssor,
runtime
sy
ste
m
,
re
al-time
p
roc
e
ssing,
compiler bin
a
r
y
tr
anslation an
d optimisaion.
Currentl
y
he
is
a
m
e
m
b
er of I
E
E
E
Cir
c
uit
and
S
y
s
t
em
S
o
ciet
y.
Fawnizu Azm
a
di Hussin received his PhD degree in el
ectr
i
ca
l and el
ectron
i
c unde
r
Monbukagakusho scholarship
of MEXT from
Nara
Institute of ci
ence and Techno
log
y
,
J
a
pan.W
h
ile
cu
rrentl
y
s
e
rv
ed as
As
s
o
ciate P
r
ofes
s
o
r at the
Elec
tri
cal
and
Elec
troni
cs
Engineering Department,
and
member of the
Centre for In
telligent Signal and Imaging
Research
,Univer
s
iti Tekno
logi Petronas (UTP);
h
e
was the Deput
y
head of EE Dept. from
2012-
2013, and
also
Progam Manager of
MSc progr
amme. He assumed
Director
of
UTP Strateg
i
c
Allian
ce Off
i
ce
since
Oct
.
20
14, and
was p
a
st-cha
ir of
IE
EE C
i
rcui
t
and
S
y
stem
soc
i
e
t
y
Malay
s
ia ch
apter during 2013-2014. His research in
terests are in the areas of VLSI design and
testing
,
par
ticu
l
a
r
l
y
SoC Design-for-Test
and
sch
e
duling
,
NoC in
terconn
ect
, low-
power VLSI
and FPGA’s algorithm
and
arch
i
t
ec
ture
im
plem
e
n
tation
/
optim
iz
a
tion.
Moham
e
d Nordin Zakari
a re
ceiv
e
d his PhD degree in Com
puter
Scienc
e from
Universiti Sa
ins
M
a
la
y
s
ia
, M
a
l
y
s
i
a. Curren
t
l
y
D
r
. Nordin as
s
u
m
e
d as
Head of
High P
e
rform
ance Com
puting
Centre (HP
CC),
UTP
,
M
a
la
y
s
ia
. His
res
earch
i
n
teres
t
s
ar
e in t
h
e are
a
s
of com
puter graph
i
cs
,
evolution
a
r
y
alg
o
rithm and s
c
heduling fo
r la
rge
sc
a
l
e c
o
mput
i
n
g
i
n
fra
st
ruct
ure.
Evaluation Warning : The document was created with Spire.PDF for Python.