Inter
national
J
our
nal
of
Electrical
and
Computer
Engineering
(IJECE)
V
ol.
8,
No.
5,
October
2018,
pp.
3021
–
3037
ISSN:
2088-8708
3021
I
ns
t
it
u
t
e
o
f
A
d
v
a
nce
d
Eng
ine
e
r
i
ng
a
nd
S
cie
nce
w
w
w
.
i
a
e
s
j
o
u
r
n
a
l
.
c
o
m
Netw
ork
Function
Modeling
and
P
erf
ormance
Estimation
Mario
Baldi
and
Amedeo
Sapio
Department
of
Control
and
Computer
Engineering,
Politecnico
di
T
orino,
Italy
Article
Inf
o
Article
history:
Recei
v
ed
July
20,
2017
Re
vised
March
13,
2018
Accepted
April
7,
2018
K
eyw
ord:
Netw
ork
Function
Modeling
Netw
ork
Functions
V
irtualization
Performance
Estimation
ABSTRA
CT
This
w
ork
introduces
a
methodology
for
the
modelization
of
netw
ork
functions
focused
on
the
identification
of
recurring
e
x
ecution
patterns
as
basic
b
ui
lding
blocks
and
aimed
at
pro
viding
a
platform
independent
representation.
By
mapping
each
modeling
b
uilding
block
on
specific
hardw
are,
the
performance
of
the
netw
ork
functi
on
can
be
estimated
in
terms
of
maximum
throughput
that
the
netw
ork
function
can
achie
v
e
on
the
specific
e
x
ecution
platform.
The
approach
is
such
that
once
the
basic
modeling
b
uilding
blocks
ha
v
e
been
mapped,
the
estimate
can
be
com
puted
automatically
for
an
y
modeled
netw
ork
function.
Experimental
results
on
se
v
eral
sample
netw
ork
functions
sho
w
that
although
our
approach
cannot
be
v
ery
accurate
without
taking
in
consideration
traf
fic
characteristics,
it
is
v
ery
v
alua
ble
for
those
application
where
e
v
en
loose
estimates
are
k
e
y
.
One
such
e
xample
is
orchestration
in
netw
ork
functions
virtualization
(NFV)
platforms,
as
well
as
in
general
virtualization
platforms
where
virtual
machine
placement
is
based
also
on
the
performance
of
netw
ork
services
of
fered
to
them.
Being
able
to
automatically
estimate
the
performance
of
a
virtualized
netw
ork
function
(VNF)
on
dif
ferent
e
x
ecution
hardw
are,
enables
optimal
placement
of
VNFs
themselv
es
as
well
as
the
virtual
hosts
the
y
serv
e,
while
ef
ficiently
utilizing
a
v
ailable
resources.
Copyright
c
2018
Institute
of
Advanced
Engineering
and
Science
.
All
rights
r
eserved.
Corresponding
A
uthor:
Amedeo
Sapio
Department
of
Control
and
Computer
Engineering
Politecnico
di
T
orino,
Italy
amedeo.sapio@polito.it
1.
INTR
ODUCTION
F
or
a
fe
w
years
no
w
softw
are
netw
ork
appliances
ha
v
e
been
increasingly
deplo
yed.
Initially
,
their
ap-
peal
stemmed
from
their
lo
wer
cost,
shorter
time-to-mark
et,
ease
of
upgrade
when
compared
to
purposely
designed
hardw
are
de
vices.
These
features
are
particularly
adv
antageous
in
the
case
of
appliances,
a.k.a.
middlebox
es,
op-
erating
on
relati
v
ely
recent,
higher
layer
protocols
that
are
usually
more
comple
x
and
are
possibly
still
e
v
olving.
More
recently
,
with
the
o
v
erwhelming
success
and
dif
fusion
of
cloud
computing
and
virtualization,
softw
are
ap-
pliances
became
natural
means
to
ensure
that
netw
ork
functionalities
ha
v
e
the
same
fle
xibility
and
mobility
as
the
virtual
machines
(VMs)
the
y
of
fer
services
to.
In
this
conte
xt,
impl
ementing
in
softw
are
e
v
en
less
comple
x,
more
stable
netw
ork
functionalities
is
v
aluable.
This
trend
led
to
embracing
Softwar
e
Defined
Networking
and
Network
Functions
V
irtualization
(NFV).
The
former
as
a
h
ybrid
hardw
are/softw
are
approach
to
ensure
high
performance
for
lo
wer
layer
pack
et
forw
arding,
while
retaining
a
high
de
gree
of
fle
xibility
and
programmability
.
The
latter
as
a
virtualization
solution
tar
geting
the
e
x
ecution
of
softw
are
netw
ork
functions
in
isolated
VMs
sharing
a
pool
of
hosts,
rather
than
on
dedicated
hardw
are
(i.e.,
appliances).
Such
a
solution
enables
virtual
netw
ork
appliances
(i.e.,
VMs
e
x
ecuting
netw
ork
functions)
to
be
pro
visioned,
allocated
a
dif
ferent
amount
of
resources,
and
possibly
mo
v
ed
across
data
centers
in
little
time,
which
is
k
e
y
in
ensuring
that
the
netw
ork
can
k
eep
up
with
the
fle
xibility
in
the
pro-
visioning
and
deplo
yment
of
virtual
hosts
in
today’
s
virtualized
data
centers.
Additional
fle
xibility
is
of
fered
when
coupling
NFV
with
SDN
as
netw
ork
traf
fic
can
be
steered
through
a
chain
of
V
irtualized
Network
Functions
(VNFs)
in
order
to
pro
vide
aggre
g
ated
services.
W
ith
inputs
from
the
industry
,
the
NFV
approach
has
been
standardized
by
the
European
T
elecommunications
Standards
Institute
(ETSI)
in
2013
[1].
The
fle
xibility
pro
vided
by
NFV
requires
the
ability
to
ef
fecti
v
ely
assign
compute
nodes
to
VNFs
and
allocate
the
most
appropriate
amount
of
resources,
such
as
CPU
quota,
RAM,
virtual
interf
aces.
In
the
ETSI
standard
the
component
in
char
ge
of
taking
such
decisions
is
called
or
c
hestr
ator
and
it
can
also
dynami
cally
modify
the
J
ournal
Homepage:
http://iaescor
e
.com/journals/inde
x.php/IJECE
I
ns
t
it
u
t
e
o
f
A
d
v
a
nce
d
Eng
ine
e
r
i
ng
a
nd
S
cie
nce
w
w
w
.
i
a
e
s
j
o
u
r
n
a
l
.
c
o
m
,
DOI:
10.11591/ijece.v8i5.pp3021-3037
Evaluation Warning : The document was created with Spire.PDF for Python.
3022
ISSN:
2088-8708
Figure
1.
NF
modeling
and
performance
estimation
approach.
amount
of
resources
assigned
to
a
running
VNF
when
needed.
The
orches
trator
can
also
request
the
migration
of
a
VNF
when
the
current
compute
node
e
x
ecuting
it
is
no
longer
capable
of
fulfilling
the
VNF
performance
requirements.
These
tasks
require
the
orchestrator
to
be
able
to
estimate
the
performance
of
VNFs
according
to
the
amount
of
resources
the
y
can
use.
Such
estimation
must
tak
e
into
account
the
nature
of
the
traf
fic
manipulation
performed
by
the
VNF
at
hand,
some
specifics
of
its
implementation,
and
the
e
xpected
amount
of
traf
fic
it
operates
on.
A
good
estimation
is
k
e
y
in
ensuring
higher
resource
usage
ef
ficienc
y
and
a
v
oid
adjustments
at
runtime.
This
w
ork
proposes
a
unified
modeling
approach
applicable
to
an
y
VNF
,
independently
of
the
platform
it
is
running
on.
By
mapping
a
VNF
model
on
a
specific
hardw
are
it
is
possible
to
predict
the
maximum
amount
of
traf
fic
that
the
VNF
can
sustain
with
the
required
performance.
The
proposed
modeling
approach
relies
on
the
identification
of
the
most
significant
operations
performed
by
the
VNF
on
the
mos
t
common
pack
ets.
These
operations
are
described
in
a
hardw
are
independent
notation
to
ensure
that
the
model
is
v
alid
for
an
y
e
x
ecution
platform.
The
mapping
of
the
model
on
a
tar
get
hardw
are
architecture
(required
in
order
to
determine
the
actual
performance)
can
be
automated,
hence
allo
wing
to
easily
apply
the
approach
to
each
a
v
ailable
hardw
are
platform
and
choose
the
most
suitable
for
the
e
x
ecution.
Ev
en
if
the
proposed
modeling
approach
has
been
defined
with
NFV
in
mind,
it
can
be
applied
to
non-
virtualized
netw
ork
functions
(NFs),
whether
implemented
in
softw
are
or
hardw
are,
pro
vided
that
the
implementa-
tion
and
characteri
stics
of
the
underlying
hardw
are
are
kno
wn.
The
a
v
ailability
of
a
unified
modeling
approach
for
VNF
and
NF
is
instrumental
in
the
inte
gration
of
middlebox
es
in
an
NFV
infrastructure
[2],
which
is
important
in
a
transition
phase
and
for
specific
applications
where
a
dedicated
or
specialized
hardw
are
platform
is
necessary
for
a
specific
NF
to
satisfy
performance
requirements.
The
modeling
approach
is
introduced
in
Section
2.
and
is
illustrated
in
Section
3.
by
applying
it
to
v
ari-
ous
netw
ork
functions.
In
order
to
v
alidate
the
proposed
models,
Section
4.
compares
the
estimated
performance
with
actual
measurements
of
softw
are
netw
ork
functions
running
on
a
general
purpose
hardw
are
platform.
After
discussing
related
w
ork
in
Section
5.,
Section
6.
concludes
the
paper
.
2.
METHODOLOGY
The
proposed
modeling
approach
is
based
on
the
definition
of
a
set
of
processing
steps,
here
called
Elemen-
tary
Oper
ations
(EOs),
that
are
common
throughout
v
arious
NF
implementations.
This
stems
from
the
obse
rv
ation
that,
generally
,
most
NFs
perform
a
rather
small
s
et
of
operations
when
processing
the
a
v
erage
pack
et,
namely
,
well-defined
alteration
of
pack
et
header
fields,
coupled
with
data
structure
lookups.
An
EO
is
informally
defined
as
the
longest
sequence
of
elementary
steps
(e.g.,
CPU
instructions
or
ASIC
transactions)
that
is
common
among
the
processing
tasks
or
multiple
NFs.
As
a
consequence,
an
EO
has
v
ariable
granularity
ranging
from
a
simple
I/O
or
memory
load
operation,
to
a
whole
IP
checksum
computation.
On
the
other
hand,
EOs
are
defined
so
that
each
can
be
potentially
used
in
multiple
NF
models.
An
NF
is
modeled
as
a
sequence
of
EOs
that
represent
the
actions
performed
for
the
v
ast
majority
of
pack
ets.
Since
we
are
interested
in
performance
estimation,
we
ignore
operations
that
af
fects
only
a
small
number
of
pack
ets
(i.e.,
less
the
1%),
since
these
tasks
ha
v
e
a
ne
gligible
impact
on
performance,
e
v
en
when
the
y
are
more
comple
x
and
resource
intensi
v
e
than
the
most
common
ones.
Accordingly
e
xceptions,
such
as
f
ailures,
configuration
changes,
etc.,
are
not
considered.
It
is
important
to
highlight
that
NF
models
produced
with
this
approach
are
hardw
are
independent,
which
ensures
that
the
y
can
be
applied
when
NFs
are
deplo
yed
on
dif
ferent
e
x
ecution
platforms.
In
order
to
estimate
the
IJECE
V
ol.
8,
No.
5,
October
2018:
3021
–
3037
Evaluation Warning : The document was created with Spire.PDF for Python.
IJECE
ISSN:
2088-8708
3023
T
able
1.
List
of
sample
EOs
EO
P
arameters
Description
1
I/O_mem
-
mem_I/O
hdr
,
data
P
ack
et
cop
y
between
I/O
and
(cache)
memory
2
parse
-
deparse
b
P
arse
or
encapsulate
a
data
field
3
increase
-
decrease
b
Increase/decrease
a
field
4
sum
b
Sum
2
operands
5
checksum
-
inc_checksum
b
Compute
IP
checksum
6
array_access
es,
max
Direct
access
to
a
byte
array
in
memory
7
ht_lookup
N,
HE,
max,
p
Simple
hash
table
lookup
8
lpm_lookup
b,
es
Longest
prefix
match
lookup
9
ct_insertion
N,
HE,
max,
p
Cache
table
insertion
performance
of
an
NF
on
a
specific
hardw
are
platform,
each
EO
must
be
mapped
on
the
hardw
are
components
in
v
olv
ed
in
its
e
x
ecution
and
their
features.
This
mapping
allo
ws
to
tak
e
into
consideration
the
limits
of
the
in
v
olv
ed
hardw
are
components
and
g
ather
a
set
of
parameters
that
af
fect
the
performance
(e.g.,
clock
frequenc
y).
Moreo
v
er
,
the
load
incurred
by
each
component
when
e
x
ecuting
each
EO
must
be
estimated,
whether
through
actual
e
xper
-
iments
or
based
on
nominal
hardw
are
specifications.
The
data
collected
during
such
mapping
are
specific
to
EOs
and
the
hardw
are
platform,
b
ut
not
to
a
particular
NF
.
Hence,
the
y
can
be
applied
to
es
timate
the
performance
of
an
y
NF
modeled
in
terms
of
EOs.
Specifically
,
the
performance
of
each
indi
vidual
EO
in
v
olv
ed
in
the
NF
model
is
computed
and
composed
considering
the
cumul
ati
v
e
load
that
all
EOs
impose
on
the
hardw
are
components
of
the
e
x
ecution
platform,
while
heeding
all
of
the
applicable
constraints.
Figure
1
summarizes
the
steps
and
intermediate
outputs
of
the
proposed
approach.
T
able
1
presents
a
list
of
sample
EOs
that
we
identified
when
modeling
a
number
of
NFs.
Such
list
is
by
no
means
meant
to
be
e
xhausti
v
e;
rather
,
it
should
be
incrementally
e
xtended
whene
v
er
it
turns
out
that
a
ne
w
NF
being
considered
cannot
be
described
in
terms
of
pre
viously
identified
EOs
.
When
defining
an
EO,
it
is
impor
-
tant
to
identify
the
parameters
related
to
traf
fic
characteristics
that
significantly
af
fect
the
e
x
ecution
and
resource
consumption.
2.1.
Elementary
Operations
A
succinct
description
of
the
EOs
listed
in
table
1
is
pro
vided
belo
w
.
1.
P
ac
k
et
copy
between
I/O
and
memory
:
A
pack
et
is
copied
from/to
an
I/O
b
uf
fer
to/from
memory
or
CPU
cache.
hdr
is
the
number
of
bytes
that
are
preferably
stored
in
the
f
astest
cache
memory
,
while
data
bytes
can
be
k
ept
in
lo
wer
le
v
el
cache
or
main
memory
.
The
parameters
ha
v
e
been
chosen
taking
into
consideration
that
some
NPUs
pro
vide
a
m
anual
cache
that
can
be
e
xplicitly
loaded
with
the
data
that
need
f
ast
access.
General
purpose
CPUs
may
ha
v
e
assembler
instructions
(e.g.,
PREFETCHh
)
to
e
xplicitly
influence
the
cache
logic.
2.
P
ar
se
or
encapsulate
a
data
field
:
A
data
field
of
b
bytes
stored
in
memory
i
s
parsed.
A
parsing
operation
is
necessary
before
performing
an
y
computation
on
a
field
(it
corresponds
to
loading
a
processor
re
gister).
The
dual
operat
ion,
i.e.,
depar
se
,
implies
storing
back
into
memory
a
properly
constructed
sequence
of
fields.
3.
Incr
ease/decr
ease
a
field
:
Increase/decrease
the
numerical
v
alue
contained
in
a
field
of
b
bytes.
The
field
to
increase/decrease
must
ha
v
e
already
been
parsed.
4.
Sum
2
oper
ands
:
T
w
o
operands
of
b
bytes
are
added.
Network
Function
Modeling
and
P
erformance
Estimation
(Mario
Baldi)
Evaluation Warning : The document was created with Spire.PDF for Python.
3024
ISSN:
2088-8708
Figure
2.
Hardw
are
architecture
description.
5.
Compute
IP
c
hec
ksum
:
The
standard
IP
checksum
computation
is
performed
on
b
bytes.
When
only
some
bytes
change
in
the
rele
v
ant
data,
the
checksum
can
be
computed
incrementally
from
the
pre
vious
correct
v
alue
[3].
In
this
case,
the
pre
vious
v
alue
of
the
checksum
must
be
parsed
beforehand
and
b
is
the
number
of
cha
n
ge
d
bytes
for
which
the
checksum
must
be
incrementally
computed.
6.
Dir
ect
access
to
a
byte
arr
ay
in
memory
:
This
EO
performs
a
direct
access
to
an
element
of
an
array
in
memory
using
an
inde
x.
Each
array
entry
has
size
es
,
while
the
array
has
at
most
max
entries.
7.
Simple
hash
table
lookup
:
A
simple
lookup
in
a
direct
hash
table
is
performed.
The
hash
k
e
y
consists
of
N
components
and
each
entry
has
size
equal
to
HE
.
The
table
has
at
most
max
entries
and
the
collision
probability
is
p
.
8.
Long
est
Pr
efix
Matc
h
lookup
:
This
EO
selects
an
entry
from
a
ta
b
l
e
based
on
the
Longest
Prefix
Match
(LPM).
This
lookup
algorithm
selects
the
most
specific
of
the
matching
entries
in
a
table
(i.e.,
the
one
where
the
lar
gest
number
of
leading
bits
of
the
k
e
y
match
those
in
the
table
entry).
The
param
eter
b
represents
the
number
of
bytes,
on
a
v
erage,
of
the
matching
prefix,
while
es
is
the
entry
size.
9.
Cac
he
table
insertion
:
Sa
v
e
in
a
hash
table
an
entry
with
the
current
timestamp
or
update
the
timestamp
if
the
entry
is
already
present.
This
EO
ha
v
e
the
same
parameters
of
the
simple
hash
table
lookup
operation;
the
performance
of
both
EOs
depends
from
the
hash
table
characteristics.
F
or
the
sak
e
of
simplicity
(and
without
af
fecting
the
v
alidity
of
the
approach,
as
sho
wn
by
the
results
in
Section
4.),
in
modeling
NFs
by
means
of
EOs,
we
assume
that
the
number
of
processor
re
gisters
is
lar
ger
than
the
number
of
pack
et
fields
t
hat
must
be
processed
simultaneously
.
Therefore
there
is
no
competit
ion
for
processor
re
gisters.
2.2.
Mapping
to
Hard
war
e
W
e
no
w
proceed
to
map
the
described
EOs
to
a
specific
hardw
are
platform:
a
serv
er
with
2
Intel
Xeon
E5-2690
v2
CPUs
(Ivy
Bridge
architecture
with
ten
ph
ysical
cores
at
3
GHz),
64
GB
DDR3
RAM
memory
and
one
Intel
82599ES
netw
ork
card
with
2x10Gbps
Ethernet
ports.
Figure
2
pro
vides
a
schematic
representation
of
the
platform
main
components
and
relati
v
e
constraints
using
the
template
proposed
in
[4].
Using
the
CPU
reference
manual
[5],
it
is
possible
to
determine
the
operations
required
for
the
e
x
ecution
of
each
EO
in
T
able
1
and
estimate
the
achie
v
able
performance.
1.
I/O_mem(hdr,
data)
-
mem_I/O(hdr,
data)
The
considered
CPU
pro
vides
a
DMA-lik
e
mechanism
to
mo
v
e
data
from
the
I/O
b
uf
fers
to
the
shared
L3
cache
and
vice
v
ersa.
Intel
DPDK
dri
v
ers
[6]
with
Data
Direct
I/O
T
echnology
(DDIO)
le
v
erage
this
capability
to
mo
v
e
pack
ets
to/from
t
he
L3
cache
without
the
CPU
interv
ention,
impro
ving
the
pack
et
processing
speed.
The
IJECE
V
ol.
8,
No.
5,
October
2018:
3021
–
3037
Evaluation Warning : The document was created with Spire.PDF for Python.
IJECE
ISSN:
2088-8708
3025
portion
of
each
pack
et
that
must
be
processed
(
hdr
)
is
then
mo
v
ed
from
L3
cache
into
the
L1/L2
cache
by
the
CPU.
This
operation
requires
31
clock
c
ycles
to
access
the
L3
cache,
around
5
c
ycles
to
write
a
L1/L2
cache
line
and
9
c
ycles
to
write
back
a
L3
cache
line
[7].
On
the
whole,
the
e
x
ecution
of
this
EO
requires:
31
+
[5
j
9]
d
hdr
64
B
e
clock
c
ycles
pro
vided
that
hdr
is
less
than
the
total
amount
of
L1
and
L2
caches,
as
it
is
reasonable
for
modern
systems
and
common
pack
et
sizes.
The
multiplier
is
5
for
I/O_mem
and
9
for
mem_I/O
.
2.
parse(b)
-
deparse(b)
Loading
a
64
bit
re
gister
requires
5
clock
c
ycles
if
data
is
in
L1
cache
or
12
clock
c
ycles
if
data
is
in
L2
cache,
otherwise
an
additional
L3
cache
or
DRAM
memory
access
is
required
to
retrie
v
e
a
64
byte
line
and
store
it
in
L1
or
L2
respecti
v
ely
(the
re
v
erse
operation
has
the
same
cost):
5
d
b
8
B
e
clock
c
ycles
f
+
d
b
64
B
e
L3
or
DRAM
accesses
g
or
12
d
b
8
B
e
clock
c
ycles
f
+
d
b
64
B
e
L3
or
DRAM
accesses
g
3.
increase(b)
-
decrease(b)
Whether
a
processor
includes
an
increase
(decrease)
instruction
or
one
for
adding
(subtract)
a
constant
v
alue
to
a
64
bit
re
gister
,
this
EO
requires
1
clock
c
ycle
to
complete.
Ho
we
v
er
,
thanks
to
pipelining,
up
to
3
independent
such
instructions
can
be
e
x
ecuted
during
1
clock
c
ycle:
d
0
:
33
b
8
B
e
clock
c
ycles
4.
sum(b)
On
the
considered
architecture,
the
e
x
ecution
of
this
EO
is
equi
v
alent
to
the
increase(b)
EO.
Please
note
that
this
is
not
necessarily
the
case
on
e
v
ery
architecture.
5.
checksum(b)
-
inc_checksum(b)
Figure
3
sho
ws
a
sample
assembly
code
to
compute
a
checksum
on
an
Intel
x86-64
processor
.
Assuming
that
the
data
on
which
the
checksum
is
computed
is
not
in
L1/L2
cache,
according
to
the
Intel
documentation
[5],
the
e
x
ecution
of
this
code
requires
7
d
b
2
e
+
8
clock
c
ycles
+
d
b
64
B
e
L3
or
DRAM
accesses
6.
array_access(es,
max)
Direct
array
access
needs
to
e
x
ecute
an
“
ADD
”
instruction
(1
clock
c
ycle)
for
computing
the
inde
x
and
a
“
LOAD
”
instruction
resulting
into
a
direct
memory
access
and
as
man
y
clock
c
ycles
as
the
number
of
CPU
re
gisters
required
to
load
the
selected
array
element:
1
+
d
es
8
B
e
clock
c
ycles
+
d
es
64
B
e
DRAM
accesses
7.
ht_lookup(N,
HE,
max,
p)
W
e
assume
that
a
simple
hash
table
lookup
is
implemented
according
to
the
pseudo-code
described
in
[4]
and
sho
wn
in
Figure
4
for
ease
of
reference.
Considering
that
the
hash
entry
needs
to
be
loaded
from
memory
to
L1
cache,
a
simple
hash
table
lookup
w
ould
require
approximately:
d
(4
N
+
106
+
5
d
H
E
8
B
e
+
5
d
H
E
32
B
e
)
(1
+
p
)
e
Network
Function
Modeling
and
P
erformance
Estimation
(Mario
Baldi)
Evaluation Warning : The document was created with Spire.PDF for Python.
3026
ISSN:
2088-8708
Register
ECX:
number
of
bytes
b
Register
EDX:
pointer
to
the
buffer
Register
EBX:
checksum
CHECKSUM_LOOP:
XOR
EAX,
EAX
;EAX=0
MOV
AX,
WORD
PTR
[EDX]
;AX
<-
next
word
ADD
EBX,
EAX
;add
to
checksum
SUB
ECX,
2
;update
number
of
bytes
ADD
EDX,
2
;update
buffer
CMP
ECX,
1
;check
if
ended
JG
CKSUM_LOOP
MOV
EAX,
EBX
;EAX=EBX=checksum
;EAX=checksum>>16
EAX
is
the
carry
SHR
EAX,
16
AND
EBX,
0xffff
;EBX=checksum&0xffff
;EAX=(checksum>>16)+(checksum&0xffff)
ADD
EAX,
EBX
MOV
EBX,
EAX
;EBX=checksum
SHR
EBX,
16
;EBX=checksum>>16
ADD
EAX,
EBX
;checksum+=(checksum>>16)
MOV
checksum,
EAX
;checksum=EAX
Figure
3.
Sample
Intel
x86
assembly
code
for
checksum
computation.
clock
c
ycles
and
d
(
d
H
E
64
B
e
(1
+
p
))
e
L3
or
DRAM
accesses
Otherwise,
if
the
entry
is
already
in
the
L1/L2
cache,
the
memory
accesses
and
cache
store
operations
are
not
required.
Notice
that
in
order
for
the
whole
table
to
be
in
cache,
its
size
should
be
limited
by:
max
H
E
32
K
B
+
256
K
B
=
288
K
B
8.
lpm_lookup(b,es)
There
are
se
v
eral
dif
ferent
algorithms
for
finding
the
longest
matching
rule.
Here
we
consider
the
DIR-24-8
algorithm
[8],
which
i
n
most
cases
(when
the
entry
matches
up
to
24
bits)
is
able
to
find
the
first
matching
rule
with
only
one
memory
access.
This
speed,
ho
we
v
er
,
comes
at
the
cost
of
space,
because
of
the
redundant
storage
of
rules.
Ho
we
v
er
,
the
v
ery
f
ast
lookup
this
algorithm
pro
vides
hea
vily
outweighs
this
space
constraint.
W
ith
the
DIR-24-8
algorithm
the
longest
prefix
match
requires
the
equi
v
alent
of
an
array_access(es,16M)
operation
if
b
3
,
otherwise
an
additional
memory
access
is
required,
corresponding
to
an
array_access(es,255)
.
9.
ct_insertion(N,
HE,
max,
p)
The
EO
corresponds
to
a
lookup
in
a
hash
table
follo
wed
by
either
the
insertion
of
a
ne
w
entry
or
the
update
of
the
timestamp
in
an
e
xisting
one.
The
tw
o
operations
ha
v
e
approximately
the
same
cost;
the
pseudo-code
in
Figure
5
sho
ws
the
operations
required
to
update
the
timestamp
of
the
entry
.
As
a
result
the
cache
table
insertion
algorithm
w
ould
require
approximately:
d
(4
N
+
129
+
7
d
H
E
8
B
e
+
5
d
H
E
32
B
e
)
(1
+
p
)
e
clock
c
ycles
and
2
d
(
d
H
E
64
B
e
(1
+
p
))
e
L3
or
DRAM
accesses
3.
MODELING
USE
CASES
This
section
demonstrates
the
application
of
the
modeling
approach
described
in
section
2..
EOs
are
used
to
describe
the
operation
of
simple
netw
ork
functions,
such
as
L2
Switches,
and
a
more
comple
x
case,
a
Br
oadband
IJECE
V
ol.
8,
No.
5,
October
2018:
3021
–
3037
Evaluation Warning : The document was created with Spire.PDF for Python.
IJECE
ISSN:
2088-8708
3027
Register
$1-N:
key
components
Register
$HL:
hash
length
Register
$HP:
hash
array
pointer
Register
$HE:
hash
entry
size
Register
$Z:
result
Pseudo
code:
#
hash
key
calculation
eor
$tmp,
$tmp
for
i
in
1
...
N
eor
$tmp,
$i
#
key
is
available
in
$tmp
#
calculate
hash
index
from
key
udiv
$tmp2,
$tmp,
$HL
mls
$tmp2,
$tmp2,
$HL,
$tmp
#
index
is
available
in
$tmp2
#
index
->
hash
entry
pointer
mul
$tmp,
$tmp2,
$HE
add
$tmp,
$HP
#
entry
pointer
available
in
$tmp
<prefetch
entry
to
L1
memory>
#
pointer
to
L1
entry
->
$tmp2
#
hash
key
check
(entry
vs.
key)
for
i
in
1
...
N
ldr
$Z,
[$tmp2],
#4
#
check
keys
cmp
$i,
$Z
bne
collision
#
no
jump
means
matching
keys
#
pointer
to
data
available
in
$Z
Figure
4.
Hash
table
lookup
pseudo-code.
Network
Gate
way
(BNG).
The
model
is
us
ed
to
estimate
the
performance
of
each
use
case
on
the
hardw
are
platform
presented
in
Section
2.2..
The
accurac
y
of
the
estimation
is
e
v
aluated
in
Section
4.
based
on
real
measurements
obtained
through
a
range
of
e
xperiments.
3.1.
L2
Switch
First
we
model
an
Ethernet
switch
with
a
static
forw
arding
table.
In
this
case
the
output
port
is
selected
through
a
simple
lookup
in
the
table
using
the
destination
MA
C
address.
Afterw
ards
we
consider
a
more
general
case
where
the
forw
arding
table
is
populated
using
the
backw
ard
learning
algorithm.
Finally
,
we
model
an
MPLS
switch,
which
selects
the
output
interf
ace
according
to
the
MPLS
label
in
the
pack
et.
3.1.1.
Basic
F
orwarding
F
or
each
pack
et
the
switch
selects
the
output
interf
ace
where
it
must
be
forw
arded;
such
interf
ace
is
re-
trie
v
ed
from
a
hash
table
using
as
a
k
e
y
the
destination
MA
C
address
e
xtracted
from
the
pack
et.
More
in
detail,
when
a
netw
ork
interf
ace
recei
v
es
a
pack
et,
it
stores
it
in
an
I/O
b
uf
fer
.
In
order
to
access
the
Ethernet
header
,
the
CPU/NPU
must
first
cop
y
the
pack
et
in
cache
or
main
memory
(possibly
with
the
help
of
a
Direct
Memory
Access
module).
Since
the
switch
operates
only
on
the
Ethernet
header
together
with
the
identifier
of
the
ingress
and
e
gress
ports
through
which
it
is
recei
v
ed
and
forw
arded,
the
corresponding
30
bytes
(
18
+
6
+
6
bytes)
1
are
copied
in
the
f
astest
cache,
while
the
rest
of
the
pack
et
(up
to
1500
bytes)
can
be
k
ept
in
L3
cache
or
1
In
this
paper
we
consider
that
interf
aces
are
identified
by
their
Ethernet
address.
Dif
ferent
implementations
can
use
a
dif
ferent
identifier
,
which
leads
to
a
minor
v
ariation
in
the
model.
Network
Function
Modeling
and
P
erformance
Estimation
(Mario
Baldi)
Evaluation Warning : The document was created with Spire.PDF for Python.
3028
ISSN:
2088-8708
Register
$HE:
updated
hash
entry
Register
$HT:
pointer
to
previous
L1
entry
Register
$HS:
hash
entry
size
Pseudo
code:
for
i
in
1
...
$HS/8
mov
[$HT],
$HE
add
$HT,
#8
#update
timestamp
rdtsc
mov
[$HT],
EDX
add
$HT,
#2
mov
[$HT],
EAX
<store
updated
entry>
Figure
5.
Entry
update
pseudo-code
for
cache
table
insertion.
I/O_mem(30,ps)
parse(6)
ht_lookup(1,12,2M,0)
deparse(6)
mem_I/O(30,ps)
(a)
Basic
forw
arding
switch
model.
I/O_mem(30,ps)
parse(8)
ht_lookup(1,14,2M,0)
parse(12)
ct_insertion(2,14,2M,0)
deparse(6)
mem_I/O(30,ps)
(b)
Learning
switch
model.
I/O_mem(34,ps-4)
parse(3)
ht_lookup(1,12,1M,0)
parse(1)
decrease(1)
deparse(10)
mem_I/O(34,ps-4)
(c)
MPLS
switch
model.
Figure
6.
Models
of
dif
ferent
L2
switches.
main
memory
.
T
o
ensure
generali
ty
,
we
consider
that
an
incoming
pack
et
cannot
be
copied
directly
from
an
I/O
b
uf
fer
to
another
,
b
ut
instead
it
must
be
first
copied
in
(cache)
memory
.
The
swi
tch
must
then
read
the
destination
MA
C
address
(6
bytes)
prior
to
using
it
to
access
the
hash
table
to
get
the
appropriate
output
interf
ace.
The
hash
table
has
one
k
e
y
(the
destination
MA
C)
and
consists
of
12
byte
entries
composed
of
the
k
e
y
and
the
output
interf
ace
MA
C
address.
A
common
number
o
f
entries
in
a
typical
switch
implementation
is
2
M
,
which
gi
v
es
an
idea,
when
mapping
the
model
to
a
specific
hardw
are,
of
whether
the
hash
table
can
be
fully
stored
in
cache
under
generic
traf
fic
conditions.
The
ne
w
output
port
must
be
stored
in
the
data
structure
in
L3
cache
or
main
memory
(which,
as
pre
viously
e
xplained,
has
the
same
cost
as
parsing
6
bytes),
before
mo
ving
the
pack
et
to
the
b
uf
fer
of
the
selected
output
I/O
de
vice.
The
resulting
model
e
xpressing
the
abo
v
e
steps
in
terms
of
EOs
is
summarized
in
Figure
6a,
where
ps
is
the
ethernet
payload
size.
Such
model
assumes
that
the
collision
probability
of
the
hash
is
ne
gligible
(i.e.,
the
hash
table
is
suf
ficiently
sparse).
Applying
to
the
Ethernet
switch
model
the
mapping
of
EOs
presented
in
Section
2.2.,
we
can
estimate
that
forw
arding
a
pack
et,
re
g
ardless
of
the
pack
et
size
(thanks
to
DDIO),
requires:
213
clock
c
ycles
+
1
DRAM
access
As
a
consequence,
a
single
core
of
an
Intel
Xeon
E5-2690v2
operating
at
3.6
Ghz
can
process
17.31
Mpps
,
while
the
DDR3
memory
can
support
111.08
Mpps
.
The
memory
throughput
is
estimated
considering
that
each
pack
et
requires
a
12
byte
memory
access
to
read
the
hash
table
entry
,
which
has
a
latenc
y
of:
(
CAS
latenc
y
2)
+
3
data
rate
If
we
consider
minimum
size
(64
bytes)
pack
ets
(i.e.,
an
unrealistic,
w
orst
case
scenario),
a
single
core
can
process
11.36
Gbps
.
IJECE
V
ol.
8,
No.
5,
October
2018:
3021
–
3037
Evaluation Warning : The document was created with Spire.PDF for Python.
IJECE
ISSN:
2088-8708
3029
3.1.2.
Lear
ning
Switch
W
e
here
consider
an
Ethernet
switch
with
VLAN
support,
in
which
case
the
k
e
y
used
for
lookups
in
the
forw
arding
table
consist
s
of
the
destination
MA
C
address
and
the
VLAN
ID
(
2
bytes).
Hence,
8
bytes
must
be
parsed
from
the
header
(destination
address
and
VLAN
ID)
of
each
pack
et
in
order
to
obtain
the
lookup
k
e
y
and
entries
in
the
forw
arding
table
are
14
bytes
long
(destination
address
and
VLAN
ID
as
k
e
y
and
output
interf
ace
as
v
alue).
Since
the
switch
is
applying
backw
ard
learning,
for
each
pack
et
the
source
MA
C
address
and
source
port
are
used
to
update
the
forw
arding
table.
The
switch
must
also
parse
the
source
MA
C
address
and
read
from
memory
the
source
port
(added
to
pack
ets
stored
in
memory)
and
either
add
an
entry
in
the
forw
arding
table
or
just
update
the
timestamp
of
an
e
xisting
one.
The
resulting
model
is
sho
wn
in
Figure
6b.
When
mapped
to
our
hardw
are
architecture,
forw
arding
a
pack
et
requires
an
estimated:
352
clock
c
ycles
+
2
DRAM
accesses
hence
the
maximum
throughput
reachable
by
a
single
core
is
reduced
to
10.47
Mpps
,
while
the
DDR3
memory
can
support
55.54
Mpps
.
This
translates
to
a
maximum
throughput
of
6.87
Gbps
for
64
byte
pack
ets.
3.1.3.
MPLS
Switch
An
MPLS
switch
is
a
simple,
yet
curr
ently
widely
deplo
yed,
Netw
ork
Function.
F
or
each
pack
et
the
switch
sw
aps
a
single
MPLS
label
and
forw
ards
the
pack
et
on
an
Ethernet
netw
ork
to
w
ards
the
ne
xt
hop.
The
ne
w
label
and
the
ne
xt
hop
are
retrie
v
ed
from
a
hash
ta
ble
whose
k
e
y
is
the
label
e
xtracted
from
the
pack
et.
Since
the
MPLS
switch
modifies
the
label
in
the
MPLS
header
,
in
addition
to
associating
to
it
the
output
port,
the
MPLS
header
(
4
bytes)
is
also
preferably
copied
in
the
L1/L2
cache,
while
the
rest
of
the
pack
et
can
be
k
ept
in
L3
cache
or
main
memory
.
The
switch
must
then
e
xtract
the
MPLS
label
(20
bit
3
bytes)
prior
to
using
it
to
access
the
hash
table
to
get
the
ne
w
label
and
the
ne
xt
hop.
The
hash
table
has
one
k
e
y
(the
label)
and
consists
of
12
byte
entries:
Input
label
(k
e
y)
-
3
bytes
Output
label
-
3
bytes
Ne
xt
hop
Ethernet
address
-
6
bytes.
The
maximum
number
of
entries
in
the
hash
table
is,
in
the
w
orst
case,
1
M
=
2
20
and
we
consider
that
the
collision
probability
is
ne
gligible.
In
the
mos
t
general
case,
each
entry
,
referred
in
the
MPLS
standard
documents
as
Ne
xt
Hop
Label
F
orw
ard-
ing
Entry
(NHLFE),
could
hold
more
than
one
label
in
case
of
multiple
label
operations.
F
or
the
sak
e
of
simplicity
we
model
only
a
single
label
operation:
the
sw
apping
of
a
label,
which
is
the
most
frequent
operation
in
common
MPLS
switch
deplo
yment
scenarios.
The
switch
must
also
decrease
the
T
ime-T
o-Li
v
e
(TTL)
contained
in
the
MPLS
header
,
which
requires
parsing
the
corresponding
field,
follo
wed
by
a
decrease
operation
for
t
h
e
1
byte
field.
The
ne
w
(outgoing)
MPLS
header
and
output
port
must
be
stored
in
main
memory
(encapsulation
of
10
bytes)
and
mo
v
ed
to
the
b
uf
fer
of
the
output
I/O
de
vice.
The
resulting
model
is
summarized
in
Figure
6c.
As
we
map
this
model
to
the
considered
hardw
are
platform,
we
can
conclude
that
the
estimated
forw
arding
cost
for
a
MPLS
switch
is:
224
clock
c
ycles
+
1
DRAM
access
corresponding
to
a
maximum
per
core
throughput
of
16.45
Mpps
,
while
the
memory
could
pro
vide
the
same
throughput
as
the
basic
forw
arding
switch.
The
maximum
bitrate
considering
64
bytes
pack
ets
is
10.8
Gbps
.
3.2.
Br
oadband
Netw
ork
Gateway
A
Broadband
Netw
ork
Gate
w
ay
(BNG)
is
the
first
IP
point
in
the
netw
ork
for
DSL
and
cable
modem
subscribers
connecting
them
to
the
broadband
IP
netw
ork.
The
primary
task
of
a
BNG
is
to
aggre
g
ate
traf
fic
from
v
arious
subscriber
sessions
from
an
access
netw
ork,
and
route
it
to
the
core
netw
ork
of
the
service
pro
vider
.
More-
o
v
er
,
a
BNG
carries
out
additional
vital
tasks
for
Netw
ork
Service
Pro
viders
(NSPs),
such
as
managing
subscribers’
sessions,
performing
accounting
and
enforcing
operator
policies.
Hence,
a
BNG
represents
a
more
comple
x
use
case
for
the
application
of
the
proposed
modelization
approach.
Network
Function
Modeling
and
P
erformance
Estimation
(Mario
Baldi)
Evaluation Warning : The document was created with Spire.PDF for Python.
3030
ISSN:
2088-8708
Dst addr
Src addr
S
-
VL
A
N
C
-
VL
A
N
EtherT
y
pe
Data
FC
S
6 by
tes
6 by
tes
4 by
tes
4 by
tes
2 by
tes
1
8
–
1480 by
tes
4
by
te
s
Ethern
et
MPLS
D
a
t
a
FC
S
14 by
tes
4 by
tes
20 by
tes
0
–
1444 by
tes
4 b
y
tes
12 by
tes
IPv
4
20 by
tes
A
cce
ss
Co
re
IPv
4
22 by
tes
G
RE
Key
:
32
bi
ts
VL
A
N
ID:
12 bi
ts
VL
A
N
ID:
12 bi
ts
Ethern
et
+ Q
i
nQ
20 by
tes
G
RE
IPv
4
Figure
7.
P
ack
et
formats.
In
our
modeling
ef
fort
we
refer
to
the
softw
are
implementation
of
a
BNG
present
in
the
Intel
Data
Plane
Performance
Demonstrators
(DPPD)
[9].
This
is
an
open
source,
highly
optimized
softw
are
BNG
specifically
in-
tended
for
performance
analysis.
In
this
implementation
the
traf
fic
in
the
access
netw
ork
between
the
Customer
Premise
Equipment
(CPE)
and
the
BNG
is
encapsulated
using
Ethernet
QinQ
frames,
whil
e
the
traf
fic
between
the
BNG
and
the
Carrier
-grade
N
A
T
(CGN
A
T)
in
the
core
MPLS
netw
ork
is
encapsulated
using
GRE
(Generic
Rout-
ing
Encapsulation).
In
this
scenario
pack
ets
recei
v
ed
from
the
access
netw
ork
and
pack
ets
recei
v
ed
from
the
core
netw
ork
are
processed
dif
ferently
by
the
BNG,
thus
2
separate
models
are
required
for
the
2
directions.
The
tw
o
dif
ferent
formats
of
pack
ets
forw
arded
in
the
access
and
in
the
core
netw
ork
is
illustrated
in
Figure
7.
P
ack
ets
from
CPEs
are
matched
with
2
dif
ferent
tables:
(
i
)
a
hash
table
that
gi
v
en
the
QinQ
tag
pro
vides
the
corresponding
GRE
k
e
y
(up
to
16M
entries
of
7
bytes)
and
(
ii
)
an
LPM
routing
table
that
gi
v
en
the
destination
IP
address
returns
the
output
port,
the
IP
address
of
the
remote
GRE
tunnel
endp
oi
nt,
the
ne
xt
hop
MA
C
address
and
the
MPLS
label
(this
table
can
contain
up
to
8K
routes).
P
ack
ets
from
the
core
netw
ork
are
instead
matched
with
only
one
hash
table
that
gi
v
en
the
GRE
k
e
y
and
the
inner
destination
IP
address
pro
vides
the
QinQ
tag,
the
destination
MA
C
address
and
the
output
port.
The
BNG
supports
up
to
64K
CPEs,
thus
this
table
can
contain
up
to
64K
entries
of
23
bytes.
The
QinQ
tag
and
the
GRE
k
e
y
are
used
t
o
track
the
subscriber
(e.g.,
for
accounting),
while
the
tunnel
endpoint
(i.e.,
the
CGN
A
T)
is
selected
according
to
the
destination
of
the
pack
et.
The
resulting
models
for
both
directi
ons
are
summarized
in
Figure
8.
When
processing
pack
ets
from
the
access
netw
ork,
MA
C
with
QinQ
and
IP
headers
are
loaded
preferably
in
L1/L2
cache,
so
that
the
QinQ
header
can
be
parsed.
The
e
xtracted
QinQ
tag
is
used
for
the
lookup
in
table
(
i
),
while
the
destination
IP
address
is
parsed
and
deplo
yed
in
the
LPM
lookup
table
(
ii
).
These
2
lookups
pro
vide
the
output
GRE
k
e
y
,
destination
IP
and
MA
C
addresses,
MPLS
tag
and
output
port
that
are
used
in
the
encapsulation
of
the
output
pack
et.
The
TTL
(T
ime
T
o
Li
v
e)
of
the
int
ernal
IP
pack
et
is
decremented
and
thus
the
checksum
must
be
incrementally
updated
start
ing
from
the
current
v
alue.
The
ne
w
pack
et
format
requires
also
the
computation
of
the
GRE
checksum
and
the
e
xternal
IP
pack
et
T
otal
Length
field
and
header
checksum.
Moreo
v
er
,
backw
ard
learning
is
used
to
populate
the
table
used
to
process
pack
ets
from
the
core
netw
ork.
Hence,
an
additional
ct
insertion
operation
is
required,
after
parsing
source
port,
MA
C
and
IP
addresses.
The
final
pack
et
is
formed
with
the
encapsulation
of
70
bytes,
corresponding
to
the
ne
w
ethernet,
MPLS,
e
xternal
IP
,
GRE
and
inner
IP
headers
and
then
sent
to
the
output
I/O
b
uf
fer
.
P
ack
ets
from
the
core
net
w
ork
require
a
parse
operation
for
the
GRE
k
e
y
and
the
inner
destination
IP
before
using
them
for
an
hash
table
lookup
to
get
the
QinQ
tag,
the
destination
MA
C
address
and
the
output
port.
In
this
case
also
the
TTL
of
the
inner
IP
pack
et
is
decremented
and
the
checksum
incrementally
updated.
The
ne
w
outgoing
pack
et
must
then
be
stored
in
memory
or
cache
(encapsulation
of
42
bytes)
and
mo
v
ed
to
the
b
uf
fer
of
the
output
I/O
de
vice.
Mapping
these
models
to
the
considered
hardw
are
platform,
we
can
conclude
that
the
estimated
cost
to
process
a
64
bytes
pack
et
from
the
access
netw
ork
is:
717
clock
c
ycles
+
6
DRAM
accesses
corresponding
to
a
maximum
per
core
throughput
of
5.14
Mpps
(
3.37
Gbps
),
while
the
DDR3
memory
can
support
12.11
Mpps
(
7.95
Gbps
).
The
estimated
cost
to
process
a
64
byte
pack
et
from
the
core
netw
ork
is:
274
clock
c
ycles
+
1
DRAM
access
IJECE
V
ol.
8,
No.
5,
October
2018:
3021
–
3037
Evaluation Warning : The document was created with Spire.PDF for Python.