Inter
national
J
our
nal
of
Electrical
and
Computer
Engineering
(IJECE)
V
ol.
10,
No.
5,
October
2020,
pp.
5479
5486
ISSN:
2088-8708,
DOI:
10.11591/ijece.v10i5.pp5479-5486
r
5479
Benchmarking
open
sour
ce
deep
lear
ning
framew
orks
Ghadeer
Al-Bdour
,
Raffi
Al-Qurran,
Mahmoud
Al-A
yy
oub,
Ali
Shatnawi
Jordan
Uni
v
ersity
of
Science
and
T
echnology
,
Jordan
Article
Inf
o
Article
history:
Recei
v
ed
Jul
23,
2019
Re
vised
Apr
25,
2020
Accepted
May
6,
2020
K
eyw
ords:
CNTK
Performance
comparison
T
ensorFlo
w
Theano
ABSTRA
CT
Deep
Learning
(DL)
is
one
of
the
hottest
fields.
T
o
foster
t
he
gro
wth
of
DL,
se
v
eral
open
source
frame
w
orks
appeared
pro
viding
implementations
of
the
most
common
DL
algorithms.
These
frame
w
orks
v
ary
in
the
algorithms
the
y
support
and
in
the
quality
of
their
implementations.
The
purpose
of
this
w
ork
i
s
to
pro
vide
a
qualitati
v
e
and
quantitati
v
e
comparison
among
thre
e
such
frame
w
orks:
T
ensorFlo
w
,
Theano
and
CNTK.
T
o
ensure
that
our
study
is
as
comprehensi
v
e
as
possible,
we
consider
multi-
ple
benchmark
datasets
from
dif
ferent
fields
(image
processing,
NLP
,
etc.)
and
mea-
sure
the
performance
of
the
frame
w
orks’
implementations
of
dif
ferent
DL
algorithms.
F
or
most
of
our
e
xperiments,
we
find
out
that
CNTK’
s
implementations
are
superior
to
the
other
ones
under
consideration.
Copyright
c
2020
Insitute
of
Advanced
Engineeering
and
Science
.
All
rights
r
eserved.
Corresponding
A
uthor:
Mahmoud
Al-A
yyoub,
Jordan
Uni
v
ersity
of
Science
and
T
echnology
,
Irbid,
Jordan.
Email:
maalshbool@just.edu.jo
1.
INTR
ODUCTION
Deep
learni
ng
(DL)
is
the
hottest
trend
in
machine
learning
(ML).
Although
the
theoretical
conce
p
t
s
behind
DL
are
not
ne
w
,
it
has
enjo
yed
a
sur
ge
of
interest
o
v
er
the
past
decade
due
to
man
y
f
actors.
One
e
xample
is
that
DL
approaches
ha
v
e
significantly
outperformed
state-of-the-art
(SO
T
A)
approaches
in
man
y
tasks
across
dif
ferent
fields
such
as
image
processing,
computer
vision,
speech
processing,
natural
language
processing
(NLP),
etc.
Moreo
v
er
,
the
scientific
community
(from
both
the
academia
and
the
industry)
has
quickly
and
massi
v
ely
adopted
DL.
Open
source
implementations
of
s
uccessful
DL
algorithms
quickly
appeared
on
code
sharing
websites,
and
were
subsequently
used
by
man
y
researchers
in
dif
ferent
fields.
Se
v
eral
DL
frame
w
orks
e
xist,
such
as
T
ensorFlo
w
,
Theano,
CNTK,
Caf
fe
and
PyT
orch,
each
with
dif
ferent
feat
u
r
es
and
charact
eristics.
Furthermore,
each
frame
w
ork
utilizes
dif
ferent
techniques
to
optimize
its
code.
Although
the
same
algorithm
is
implemented
in
dif
ferent
frame
w
orks,
the
performance
of
the
dif
ferent
implementations
can
v
ary
greatly
.
A
researcher/practitioner
looking
to
use
such
an
algorithm
in
his/her
w
ork
w
ould
f
ace
a
dif
ficult
choice,
since
the
number
of
dif
ferent
implementations
is
high
and
the
ef
fort
in
v
ested
by
the
research
community
in
scientifically
comparing
these
implementations
is
limited.
In
this
w
ork,
we
a
im
at
pro
viding
qualitati
v
e
and
quantitati
v
e
comparisons
between
three
popular
open
source
DL
frame
w
orks:
T
ensorFlo
w
,
Theano
and
CNTK.
These
frame
w
orks
support
multi-core
CPUs
as
well
as
multiple
GPUs.
All
of
them
import
cuDNN,
which
is
a
DL
library
from
NVIDIA
that
supports
highly
tuned
implementations
for
standard
routines
such
as
forw
ard
and
backw
ard
con
v
olution,
normalization,
pooling
and
acti
v
ation
layers.
W
e
compare
these
frame
w
orks
by
training
dif
ferent
neural
netw
ork
(NN)
architectures
on
fi
v
e
dif
ferent
standard
benchmark
datasets
for
v
arious
tasks
in
image
processing,
computer
vision
and
NLP
.
Despite
their
importance,
comparati
v
e
studies
lik
e
ours
that
focus
on
performance
issues
are
rare.
Limited
ef
forts
ha
v
e
been
dedicated
to
conducting
comparati
v
e
studies
between
SO
T
A
DL
frame
w
orks
running
on
dif
ferent
hardw
are
platforms
(CPU
and
GPU)
to
highlight
the
adv
antages
and
limitations
for
each
frame
w
ork
for
dif
ferent
deep
NN
architectures.
These
ef
forts
included
papers
[1-9]
as
well
as
online
blogs
J
ournal
homepage:
http://ijece
.iaescor
e
.com/inde
x.php/IJECE
Evaluation Warning : The document was created with Spire.PDF for Python.
5480
r
ISSN:
2088-8708
(https://github
.com/soumith/con
vnet-benchmarks).
Due
to
space
constraint,
we
do
not
discuss
the
details
of
these
w
orks
here.
Interested
readers
are
referred
to
earlier
v
ersions
of
this
w
ork
[10,
11]
for
such
details.
Ho
we
v
er
,
we
do
note
that,
in
pre
vious
studies,
the
comparison
goal
focused
only
on
processing
time.
None
of
those
comparati
v
e
studies
dealt
with
CPU
and
GPU
utilization
or
memory
consumption.
This
w
ork
co
v
ered
these
metrics
to
find
which
of
the
considered
frame
w
orks
achie
v
e
the
best
performance.
Finally
and
most
importantly
,
the
comparisons
in
v
olv
ed
more
datasets
from
more
fields
compared
with
pre
vious
studies.
The
rest
of
this
paper
is
or
g
anized
as
follo
ws.
Section
2.
discusses
the
frame
w
orks,
the
w
ay
the
y
were
used
to
train
the
datasets
and
a
brief
comparison
between
them.
The
methodology
we
follo
w
is
discuss
ed
in
Section
3.
Experimental
results
and
the
discussion
are
detailed
in
Section
4.
The
w
ork
is
concluded
with
final
thoughts
presented
in
Section
5.
2.
DEEP
LEARNING
FRAMEW
ORKS
The
frame
w
orks
considered
in
this
comparati
v
e
study
are:
CNTK,
T
ensorFlo
w
a
n
d
Theano.
Moreo
v
er
,
we
use
K
eras
on
top
of
these
frame
w
orks
as
discussed
later
.
All
of
these
frame
w
orks
pro
vide
fle
xible
APIs
and
configuration
options
for
performance
optimization.
Softw
are
v
ersions
of
the
frame
w
orks
are
sho
wn
in
T
able
1
and
their
properties
are
sho
wn
in
T
able
2.
T
able
1.
Frame
w
orks
used
for
this
comparati
v
e
study
Frame
w
ork
Major
V
ersion
Github
Commit
ID
CNTK
2.0
7436a00
T
ensorFlo
w
1.2.0
49961e5
Theano
0.10.0.de
v1
8a1af5b
K
eras
2.0.5
78f26df
T
able
2.
Properties
of
the
considered
frame
w
orks
Property
CNTK
T
ensorFlo
w
Theano
K
eras
Core
C++
C++
Python
Python
CPU
X
X
X
X
Multi-Threaded
CPU
X
Eigen
Blas,
con
v2D,
Limited
OpenMP
X
GPU
X
X
X
X
Multi-GPU
X
X
X
(e
xperimental
v
ersion)
X
NVIDIA
cuDNN
X
X
X
X
2.1.
Micr
osoft
cogniti
v
e
toolkit
(CNTK)
CNTK
is
an
open
source
DL
frame
w
ork
de
v
eloped
by
Microsoft
Research
[12]
for
training
and
testing
man
y
types
of
NN
across
multiple
GPUs
or
serv
ers.
CNTK
supports
dif
ferent
DL
architectures
lik
e
Feedfor
-
w
ard,
Con
v
olutional,
Recurrent,
Long
Short-T
erm
Memory
(LSTM)
and
Sequence-to-Sequence
NN.
In
CNTK,
a
Computational
Netw
ork
learns
an
y
function
by
con
v
erting
it
to
a
directed
graph,
where
leaf
nodes
consist
of
an
input
v
alues
or
learning
parameters
while
other
nodes
r
epresent
matrix
operation
applied
to
its
children.
In
this
case,
CNTK
has
an
adv
antage
as
it
can
automatically
find
the
deri
v
e
gradients
for
all
the
computations
which
are
required
to
learn
the
parameters.
In
CNTK,
users
specify
their
netw
orks
using
a
configuration
file
that
contains
information
about
the
netw
ork
type,
where
to
find
input
data
and
the
w
ay
to
optimize
param-
eters
[13].
C
NTK
interf
ace
supports
dif
ferent
APIs
of
se
v
eral
languages
such
as
Python,
C++
and
C#
across
both
GPU
(CUD
A)
or
CPU
platforms.
According
to
its
de
v
elopers
(https://docs.microsoft.com/en-us/cogniti
v
e-
toolkit/cntk-e
v
aluation-o
v
ervie
w),
CNTK
w
as
written
in
C++
in
an
ef
ficient
w
ay
,
where
it
remo
v
es
duplicated
computations
in
forw
ard
and
backw
ard
passes,
uses
mini
mal
memory
and
reduces
memory
reallocation
by
reusing
them.
2.2.
Theano
Theano
is
an
open
source
Python
librar
y
de
v
eloped
at
MILA
lab
at
the
Uni
v
ersity
of
Montreal
as
a
compiler
for
mathematical
e
xpressions
that
lets
users
and
de
v
elopers
optimize
and
e
v
aluate
their
e
xpressions
using
NumPy’
s
syntax
(a
Python
library
that
supports
a
lar
ge
and
multi-dimensional
arrays)
[3,
14].
Theano
starts
performing
computations
automatically
by
optimizing
the
selection
of
computations,
translates
them
into
other
machine
learning
languages
such
as
C++
or
CUD
A
(for
GPU)
and
then
compiles
them
into
Python
Int
J
Elec
&
Comp
Eng,
V
ol.
10,
No.
5,
October
2020
:
5479
–
5486
Evaluation Warning : The document was created with Spire.PDF for Python.
Int
J
Elec
&
Comp
Eng
ISSN:
2088-8708
r
5481
modules
in
an
ef
ficient
w
ay
on
CPUs
or
GPUs.
Theano’
s
de
v
elopment
started
in
2008
and
it
is
more
popular
on
a
research
and
ecosystem
platform
than
man
y
DL
libraries.
Se
v
eral
softw
are
packages
ha
v
e
been
de
v
eloped
to
b
uild
on
top
of
Theano,
with
a
higher
-le
v
el
user
interf
ace
which
aims
to
mak
e
Theano
easier
to
e
xpress
and
train
dif
ferent
architectures
of
deep
learning
models,
such
as
Pylearn2,
Lasagne
and
K
eras.
2.3.
T
ensorFlo
w
T
ensorFlo
w
is
an
open
source
frame
w
ork
de
v
eloped
by
Google
Brain
T
eam
[15].
It
uses
a
single
data
flo
w
graph,
e
xpressing
all
numerical
computations,
to
achie
v
e
e
xcellent
performance.
T
ensorFlo
w
constructs
lar
ge
computation
graphs
where
each
node
represents
a
mathematical
operation,
while
the
edges
represent
the
communication
between
nodes.
This
data
flo
w
graph
e
x
ecutes
the
communication
between
sub-computations
e
xplicitly
,
which
mak
es
it
possible
to
e
x
ecute
independent
computations
in
parallel
or
to
use
multiple
de
vices
to
e
x
ecute
partition
computations
[15].
Programmers
of
T
ensorFlo
w
define
lar
ge
computation
graphs
from
basic
operators,
then
distrib
ute
the
e
x
ecution
of
these
graphs
acros
s
a
heterogeneous
distrib
uted
system
(can
deplo
y
computation
to
one
or
more
CPUs
or
GPUs
on
a
dif
ferent
hardw
are
platforms
such
as
desktops,
serv
ers,
or
e
v
en
mobile
de
vices).
The
fle
xible
architecture
of
T
ensorFlo
w
allo
ws
de
v
elopers
and
users
to
e
xperiment
and
train
a
wide
v
ariety
of
NN
m
odels;
It
is
used
for
deplo
ying
ML
systems
into
production
for
dif
ferent
fields
including
speech
recognition,
NLP
,
computer
vision,
robotics
and
computational
drug
disco
v
ery
.
T
ensorFlo
w
uses
dif
ferent
APIs
of
se
v
eral
languages
such
as
Python,
C++
and
Ja
v
a
for
constructing
and
e
x
ecuting
a
graph.
Python
API
is
the
most
complete
and
the
easiest
to
use
(https://www
.tensorflo
w
.or
g/).
2.4.
K
eras
K
eras
is
an
open
source
DL
library
de
v
eloped
in
Python.
It
runs
on
top
of
CNTK,
Theano
or
T
en-
sorFlo
w
frame
w
orks.
K
eras
w
as
founded
by
Google
engineer
Chollet
in
2015
as
a
part
of
the
r
esearch
project
ONEIR
OS
(Open-ended
Neuro-Electronic
Intelligent
Robot
Operating
System).
K
eras
is
designed
in
a
w
ay
that
allo
ws
f
ast
e
xpression
with
DNN
and
easy
and
f
ast
prototyping
(modularity
and
e
xtensibility)
[16].
3.
METHODOLOGY
The
goal
of
this
e
xperimental
study
is
to
compare
the
aforementioned
frame
w
orks
(Theano,
T
en-
sorFlo
w
and
CNTK)
by
using
them
to
train
Con
v
olutional
NN
(CNN)
and
Recurrent
NN
(RNN)
models
on
standard
benchmark
datasets
of
classical
problems
in
image
processing
(MNIST
,
CIF
AR-10
and
Self-dri
ving
Car)
and
NLP
(Penn
T
reeBank
and
IMDB).
Specifically
,
we
aim
at
comparing
the
resources
consumed
by
each
frame
w
ork
to
reach
a
certain
accurac
y
le
v
el
for
each
problem.
Thus,
we
e
xperiment
with
dif
ferent
epoch
counts
in
order
to
mak
e
sure
the
accurac
y
for
all
frame
w
orks
are
close
to
each
other
.
Each
frame
w
ork’
s
performance
is
e
v
aluated
using
running
time,
memory
consumption
and
CPU
and
GPU
utilization.
W
e
use
a
laptop
that
has
an
Intel
Core
i7-6700HQ
CP
U
@
2.60GHz
(4
cores)
with
16
GB
RAM,
64-bit
operating
system
(W
indo
ws
10),
and
NVIDIA
GEFORCE
GTX
960m
graphics
card
with
PCI
Express
3.0
b
us
support,
equipped
with
4
GB
GDDR5
memory
and
640
CUD
A
cores.
3.1.
Benchmark
datasets
In
this
subsection,
we
discuss
the
datasets
used
in
our
e
xperiments.
3.1.1.
MNIST
The
MNIST
(Mix
ed
National
Institute
of
Standards
and
T
echnology)
dataset
for
handwritten
digits
is
widely
used
in
ML
[17].
It
has
60,000
trai
ning
images
and
10,000
testing
images.
Each
image
is
28
28
pix
els
which
is
flattened
into
a
784-v
alue
v
ector
.
The
label
of
each
image
is
a
number
between
0
and
9
representing
the
digit
appearing
in
the
image.
3.1.2.
CIF
AR-10
The
CIF
AR-10
dataset
is
one
of
the
datasets
collected
by
Krizhe
vsk
y
et
al.
[18,
19].
It
consists
of
60,000
32
32
color
images
e
v
enly
distrib
uted
o
v
er
ten
classes:
airplane,
autom
obile,
bird,
cat,
deer
,
dog,
frog,
horse,
ship
and
truck.
There
are
50,000
training
images
and
10,000
test
images.
The
classes
are
completely
mutually
e
xclusi
v
e.
I.e.,
there
is
no
o
v
erlap
between
them.
F
or
instance,
the
“
Automobile”
class
includes
sedans,
SUVs,
etc.
On
the
other
hand,
the
“T
ruck”
class
includes
only
big
trucks.
T
o
a
v
oid
o
v
erlap,
neither
one
of
these
tw
o
classes
includes
pickup
trucks.
Benc
hmarking
open
sour
ce
deep
learning
fr
ame
works
(Ghadeer
Al-Bdour)
Evaluation Warning : The document was created with Spire.PDF for Python.
5482
r
ISSN:
2088-8708
3.1.3.
P
enn
T
r
eeBank
In
1993,
Marcus
et
al.
[20]
wrote
a
paper
on
constructing
a
lar
ge
annotated
corpus
of
English
called
the
Penn
T
reeBank
(PTB).
The
y
re
vie
wed
their
e
xperience
with
constructing
one
lar
ge
annotated
corpus
that
consists
of
o
v
er
4.5
million
w
ords
of
Americ
an
English.
It
w
as
annotated
for
part-of-s
peech
(POS)
tag
infor
-
mation.
Moreo
v
er
,
half
of
the
corpus
w
as
annotated
for
sk
eletal
syntactic
structure.
The
dataset
is
lar
ge
and
di
v
erse.
It
includes
the
Bro
wn
Corpus
(retagged)
and
the
W
all
Street
Journal
Corpus,
as
well
as
Department
of
Ener
gy
abstracts,
Do
w
Jones
Ne
wswire
stories,
Department
of
Agriculture
b
ulletins,
Library
of
America
te
xts,
MUC-3
messages,
IBM
Manual
sentences,
WB
UR
radio
transcripts
and
A
TIS
sentences.
3.1.4.
IMDB
The
IMDB
dataset
[21]
is
another
e
xample
of
applying
CNN,
which
is
an
online
dataset
of
informa-
tion
re
g
arding
films,
TV
programs
and
video
g
ames.
It
consists
of
25,000
re
vie
ws
labeled
by
the
sentiment
(positi
v
e/ne
g
ati
v
e)
of
each
re
vie
w
.
The
re
vie
ws
ha
v
e
been
preprocessed
and
encoded
as
int
e
gers
in
a
form
of
a
sequence
of
w
ord
inde
x
es.
W
ords
are
inde
x
ed
by
o
v
erall
frequenc
y
in
the
dataset,
so
that
the
inde
x
i
encodes
the
i
th
most
frequent
w
ord
in
the
data
in
order
to
allo
w
operations
of
quick
filtering.
3.1.5.
Self-Dri
ving
Car
This
dataset
uses
a
Udacity’
s
Self-Dri
ving
Car
simulator
as
a
testbed
for
training
an
autonomous
car
.
This
w
ork
started
in
the
1980s
with
Carne
gie
Mellon
Uni
v
ersity’
s
Na
vlab
and
AL
V
projects
[22].
The
training
phase
starts
with
acti
v
ating
the
simulator
which
is
an
e
x
ecutable
application.
A
user
initiates
the
service
of
collecting
the
data
for
training
follo
wed
by
collecting
the
data
as
images
and
sa
ving
them
locally
on
the
computer
.
So,
the
frame
w
ork
can
tak
e
these
images
and
train
them.
The
trai
ning
is
done
via
distin-
guishing
the
image
edges,
which
are
tak
en
by
the
three
cameras
laying
on
the
front
of
the
car
in
the
simulator
.
After
the
training
phase
is
done,
the
testing
phase
be
gins
by
taking
the
file
generated
whene
v
er
the
performance
in
the
epoch
is
better
than
the
pre
vious
best.
Finally
,
the
last
generated
file
is
e
x
ecuted
in
order
to
mak
e
the
car
dri
v
e
autonomously
to
observ
e
the
testing
phase
results.
3.2.
Netw
orks
ar
chitectur
e
CNN
is
used
for
the
MNIST
,
CIF
AR-10,
IMDB
and
Self-Dri
ving
Car
datasets,
where
a
dif
ferent
netw
ork
architecture
is
used
for
each
dataset.
The
architecture
of
each
CNN
is
sho
wn
in
[23].
F
or
the
MNIST
and
CIF
AR-10
datasets,
tw
o
con
v
olutional
layers
with
ReLU
acti
v
ation
function
are
used
after
the
input
layer
.
The
acti
v
ation
function
is
used
to
reduce
the
training
time
and
to
pre
v
ent
v
anishing
gradients.
After
ea
ch
CNN
layer
,
a
max-pooling
layer
is
added
in
order
to
do
wn-sample
the
input
and
to
reduce
o
v
erfitting.
In
the
max-
pooling
layer
,
the
stride
v
alue
must
be
specified
wit
h
which
the
filter
is
slid.
When
the
stride
is
x
,
the
filter
(windo
w)
i
s
mo
v
ed
x
pix
els
at
a
time.
This
will
produce
smaller
output
v
olumes
spatially
.
After
each
max-
pooling
layer
,
the
dropout
method
is
used
in
order
to
reduce
o
v
erfitting
by
forcing
the
model
to
learn
man
y
independent
representations
of
the
same
data
through
randomly
disabling
neurons
in
the
learning
phase.
F
or
the
Self-Dri
ving
Car
dataset,
the
CNN
has
the
same
components
as
the
ones
used
with
the
MNIST
and
CIF
AR-10
datasets,
b
ut
with
deeper
model
that
consists
of
fi
v
e
con
v
olutional
layers
with
Exponential
Linear
Unit
(ELU)
acti
v
ation
function.
The
con
v
olutional
layers
are
used
for
feature
engineering.
The
fully
connected
layer
is
used
for
predicting
the
steering
angle
(final
output).
The
dropout
a
v
oids
o
v
erfitting
and,
finally
,
the
ELU
acti
v
ation
function
is
used
to
solv
e
the
problem
of
the
v
anishing
gradient.
In
the
IMDB
dataset,
the
mo
vie
re
vie
ws
are
composed
of
sequences
of
w
ords
of
dif
ferent
lengths.
These
w
ords
are
encoded
by
mapping
mo
vie
re
vie
ws
to
sequences
of
w
ord
embeddings
where
w
ords
are
mapped
to
v
ectors
of
real
numbers;
the
netw
ork
architecture
consists
of
an
embedding
layer
follo
wed
by
a
1D
con
v
olutional
layer
which
is
used
for
temporal
data
follo
wed
by
a
global
max-pooling
operation.
These
sequences
are
padded
to
ha
v
e
the
same
size
as
the
lar
gest
sequence
because
the
y
ha
v
e
dif
ferent
lengths.
The
other
NN
type
we
consider
is
RNN
with
LSTM.
One
of
the
most
popular
uses
of
LSTM
is
for
te
xt
analysis
tasks
such
as
the
ones
associated
with
the
Penn
T
reeBank
(PTB)
dataset.
W
ord-le
v
el
prediction
e
xperiments
on
PTB
w
as
adopted,
which
consists
of
929k
training
w
ords,
73k
v
alidation
w
ords
and
82k
test
w
ords.
It
has
10k
w
ords
in
its
v
ocab
ulary
.
W
e
trained
models
of
tw
o
sizes
(small
and
medium)
using
the
same
architecture
presented
in
[24].
T
o
e
v
aluate
language
models
of
the
PTB
implementation,
a
special
metric
called
a
perple
xity
is
used,
where
better
prediction
accurac
y
is
achie
v
ed
when
perple
xity
v
alue
is
as
lo
w
as
possible.
Perple
xity
is
the
in
v
erse
of
probability
defini
tion.
This
means
that
minimizing
perple
xity
v
alue
is
the
same
as
maximizing
probability
.
The
goal
of
applying
PTB
dataset
is
to
match
a
probabilistic
form
which
Int
J
Elec
&
Comp
Eng,
V
ol.
10,
No.
5,
October
2020
:
5479
–
5486
Evaluation Warning : The document was created with Spire.PDF for Python.
Int
J
Elec
&
Comp
Eng
ISSN:
2088-8708
r
5483
assigns
probabilities
to
sentences.
This
process
is
done
by
predicting
the
ne
xt
w
ords
in
a
te
xt
gi
v
en
a
history
of
pre
viously
located
w
ords.
LSTM
cells
represent
the
core
of
the
model
which
processes
one
w
ord
at
a
time
and
computes
probabilities
of
the
possibl
e
v
alues
for
the
ne
xt
w
ord
in
the
sentence.
A
v
ector
of
zeros
is
used
for
the
memory
state
of
the
netw
ork
to
get
initialized
and
updated
after
reading
each
w
ord.
In
the
small
model,
tw
o
hidden
layers
(with
200
LSTM
units
per
layer)
are
used
with
T
anh
acti
v
at
ion
function.
The
weights
are
initialized
to
0.1.
W
e
trained
it
for
four
epochs
with
a
learning
rate
of
one
(number
of
epochs
trained
with
initial
learning
rate)
and
then
the
learning
rate
is
decreased
by
a
f
actor
of
tw
o
after
each
epoch
(the
decay
of
the
learning
rate
for
each
epoch
after
four
epochs),
for
a
total
of
13
training
epochs.
The
size
of
each
batch
is
20,
then
the
netw
ork
is
unrolled
for
20
steps.
This
model’
s
architecture
is
sho
wn
in
[23].
4.
RESUL
TS
AND
DISCUSSION
In
this
section
we
discuss
the
results
of
our
e
xperiments.
F
or
each
model
on
each
dat
aset,
T
able
3
sho
ws
the
CPU
and
GPU
processing
times
while
T
ables
4–8
sho
w
the
utilization
le
v
els
of
the
CPU,
GPU
and
their
memories.
F
or
the
image
classific
ation
datasets
(MNIST
and
CIF
AR-10),
one
can
observ
e
the
superi-
ority
of
CNTK
o
v
er
T
ensorFlo
w
and
Theano
in
terms
of
GPU
and
CPU
multithreadi
ng;
ho
we
v
er
,
in
CIF
AR-10
using
8,
16
and
32
threads
in
CPU,
T
ensorFlo
w
w
as
f
aster
than
CNTK.
On
the
other
hand,
Theano
re
v
ealed
to
be
more
time
consuming
than
other
frame
w
orks.
T
ransitioning
to
sentiment
analysis
dataset
(IMDB),
CPU
multithreading
w
as
not
performed
because
CNTK
is
written
in
Python
in
which
multithreading
is
not
sup-
ported.
W
ithout
CPU
multithreading
(CPU
uses
the
def
ault
number
of
e
xisting
ph
ysical
cores
which
are
equal
one
thread
per
core),
the
superiority
of
T
ensorFlo
w
is
re
v
ealed
in
both
CPU
and
GPU
en
vironments.
Th
e
results
for
the
te
xt
analysis
dataset
(Penn
T
reeBank)
sho
ws
the
superiority
of
T
ensorFlo
w
o
v
er
CNTK
and
Theano,
for
CPU
with
8
threads
as
well
as
the
case
in
GPU.
Mo
ving
forw
ard
to
video
analysis
dataset
(Self-Dri
ving
Car),
the
superiority
of
T
ensorFlo
w
is
re
v
ealed
in
both
CPU
and
GPU
en
vironments,
while
CNTK
sho
wed
to
be
more
time
consuming
than
the
other
tw
o
frame
w
orks.
The
processing
times
clearly
sho
w
the
adv
antage
of
GPU
o
v
er
CPU
for
training
CNN
and
RNN.
The
adv
antage
of
f
ast
GPU
w
ould
be
more
significant
when
training
comple
x
models
with
lar
ger
data
as
in
the
Self-Dri
ving
Car
dataset.
From
the
CPU
results,
the
best
performance
occurred
when
the
number
of
threads
is
equal
to
the
number
of
ph
ysical
CPU
cores,
where
each
thread
possesses
a
single
core.
In
our
w
ork,
we
use
a
laptop
with
8
cores.
Thus,
in
each
dataset,
the
best
performance
in
terms
of
processing
time
w
as
achie
v
ed
while
using
8
threads.
The
metrics
measurement
of
each
frame
w
ork
w
as
conducted
to
e
xplain
the
f
ailure
of
one
of
the
selected
frame
w
orks.
W
e
notice
poor
performance
of
Theano
at
most
datasets
comparing
to
CNTK
and
T
ensorFlo
w
.
This
could
be
attrib
uted
to
its
lo
w
CPU
utilization
compared
to
the
other
frame
w
orks.
CNTK
outperformed
both
T
ensorFlo
w
and
Theano
while
training
MNIST
and
CIF
AR-10
datasets.
This
achie
v
ement
is
highly
lik
ely
due
to
the
use
of
BrainScript
format
(https://docs.microsoft.com/en-us/cogniti
v
e-toolkit/BrainScript-Netw
ork-
Builder),
which
is
a
custom
netw
ork
description
language
that
mak
es
CNTK
more
fle
xible
for
NN
customiza-
tion.
On
the
other
hand,
T
ensorFlo
w
uses
Eigen
(http://eigen.tuxf
amily
.or
g/inde
x.php?title=Main
P
age),
which
is
a
C++
template
library
(BLAS
library)
for
linear
algebra
including
matrices,
v
ectors,
numerical
solv
ers
and
related
algorithms.
It
is
used
to
mak
e
T
ensorFlo
w
perform
better
than
CNTK
and
Theano
in
RNN.
In
addition
to
processing
time,
we
also
report
the
utilization
le
v
els
of
the
CPU,
GPU
and
their
memories
for
each
model
on
each
frame
w
ork
under
consideration.
These
results
are
sho
wn
in
T
ables
4–8.
The
utilization
le
v
els
for
both
CPU
and
GPU
are
high
for
all
models.
The
only
supersizing
numbers
are
the
CPU
uti
lization
for
Theano,
which
were
v
ery
lo
w
.
The
tables
also
sho
w
that
the
utilization
le
v
els
are
rather
small
for
both
types
of
memory
.
This
applies
to
all
models
for
all
frame
w
orks.
Ho
we
v
er
,
the
tables
also
sho
w
that,
in
most
cases,
CNTK
had
the
lo
west
memory
utilization
while
T
ensorFlo
w
had
the
highest.
Surprisingly
,
the
case
is
almost
re
v
ersed
for
the
video
analysis
dataset
(the
Self-Dri
ving
Car
dataset),
where
CNTK
had
the
highest
utilization
and
Theano
had
the
lo
west.
Another
une
xpected
finding
of
these
e
xperiments
is
that
the
models
of
the
IMDB
generally
needed
the
lar
gest
portions
of
memory
.
Comparing
our
w
ork
to
pre
vious
w
orks
[1,
2],
we
re
v
eal
the
follo
wing
findings.
Bahrampour
et
al.
[1]
based
their
comparati
v
e
study
on
three
main
aspects
including
speed,
hardw
are
utilization
and
e
xtensibility
.
Besides,
the
y
used
three
NN
types:
CNN,
AutoEncoder
(AE)
and
LSTM
to
train
MNIST
,
ImageNet
[25]
and
IMDB
datasets
on
Caf
fe,
Neon,
T
ensorFlo
w
,
Theano
and
T
orch
frame
w
orks.
The
y
came
up
with
the
follo
wing
results.
T
raining
on
CPU,
T
orch
performed
the
best
follo
wed
by
Theano
while
Neon
had
the
w
orst
Benc
hmarking
open
sour
ce
deep
learning
fr
ame
works
(Ghadeer
Al-Bdour)
Evaluation Warning : The document was created with Spire.PDF for Python.
5484
r
ISSN:
2088-8708
performance.
Moreo
v
er
,
Theano
and
T
orch
are
the
best
in
terms
of
e
xtensibility
,
as
well
as
T
ensorFlo
w
and
Theano
were
v
ery
fle
xible
and
Caf
fe
w
as
the
easiest
to
find
the
performance.
Re
g
arding
training
datasets
on
GPU
and
for
lar
ger
con
v
olutional
and
fully
connected
netw
orks
(FCN),
T
orch
w
as
the
best
follo
wed
by
Neon.
F
or
smaller
netw
orks
Theano
w
as
the
best.
F
or
LSTM,
Theano’
s
results
were
the
best,
while
T
ensorFlo
w’
s
performance
w
as
not
competiti
v
e
compared
with
the
other
studied
frame
w
orks.
On
the
other
hand,
Shi
et
al.
[2]
based
their
comparati
v
e
study
on
tw
o
metrics:
processing
time
and
con
v
er
gence
rate.
The
NN
used
are
fully
connected
NN,
CNN
and
RNN
to
train
ImageNet,
MNIST
and
CIF
AR-10
datasets
on
Caf
fe,
CNTK,
MXNet,
T
ensorFlo
w
and
T
orch
frame
w
orks.
The
results
of
T
ensorFlo
w
were
the
best
using
CPU.
While
using
si
ngle
GPU;
on
FCN,
Caf
fe,
CNTK
and
T
orch
performed
better
than
MXNet
and
T
ensorFlo
w
.
As
for
small
CNN,
Caf
fe
and
CNTK
achie
v
ed
a
good
performance
and
for
RNN
(LSTM),
ho
we
v
er
,
CNTK
w
as
the
f
astest
(5-10x
f
aster
than
other
frame
w
orks).
Using
multi-GPU
implementa-
tion,
all
frame
w
orks
had
higher
throughput
and
accelerated
the
con
v
er
gence
speed
compared
with
single-GPU
implementation.
T
able
3.
Processing
time
for
each
dataset
(measured
in
seconds),
F
or
the
En
vironment
columns,
CPU
(
x
)
denotes
CPU
with
x
threads
En
v
CNTK
T
ensorFlo
w
Theano
MNIST
CPU
(1)
847
5130
3560
CPU
(2)
630
3180
2500
CPU
(4)
574
2070
2260
CPU
(8)
560
1740
2060
CPU
(16)
567
1920
2050
CPU
(32)
588
2010
2050
GPU
66.67
328.93
377.86
CIF
AR-10
CPU
(1)
20196
25905
26700
CPU
(2)
14520
16610
18700
CPU
(4)
13662
11550
17250
CPU
(8)
11484
9955
15800
CPU
(16)
11550
10340
15850
CPU
(32)
11649
10835
15750
GPU
926
2166.4
2386.1
IMDB
CPU
(1)
-
1244
538
CPU
(2)
-
642
412
CPU
(4)
-
390
380
CPU
(8)
486
290
368
CPU
(16)
-
249
368
CPU
(32)
-
302
384
GPU
73.1
62.4
220.
Self-Dri
ving
Car
CPU
(1)
-
33.3
hours
50
hours
CPU
(2)
-
19.8
hours
44.2
hours
CPU
(4)
-
15
hours
42.6
hours
CPU
(8)
47.6
hours
14.1
hours
43.5
hours
CPU
(16)
-
16.4
hours
43.5
hours
CPU
(32)
-
16.4
hours
43.5
hours
GPU
8.7
hours
6
hours
6.8hours
PTB
CPU
(1)
-
40560
27066
CPU
(2)
-
26819
23244
CPU
(4)
-
18733
21541
CPU
(8)
4290
16407
21450
CPU
(16)
-
16848
21476
CPU
(32)
-
18369
21541
GPU
2106
1342.28
1630
Int
J
Elec
&
Comp
Eng,
V
ol.
10,
No.
5,
October
2020
:
5479
–
5486
Evaluation Warning : The document was created with Spire.PDF for Python.
Int
J
Elec
&
Comp
Eng
ISSN:
2088-8708
r
5485
T
able
4.
Performance
metrics
of
all
models
on
the
MNIST
dataset
Metrics
En
vironment
CNTK
T
ensorFlo
w
Theano
Accurac
y
CPU
99.27
99.14
99.10
GPU
99.26
99.11
99.17
CPU%
-
99.6
92.2
14.7
GPU%
-
92
77
95
Memory%
CPU
1.7
2.2
2.1
GPU
3.6
5.2
4.9
Epochs#
CPU
7
15
10
GPU
7
15
10
T
able
5.
Performance
metrics
of
all
models
on
the
the
CIF
AR-10
dataset
Metrics
En
vironment
CNTK
T
ensorFlo
w
Theano
Accurac
y
CPU
82.68
82.26
82.29
GPU
82.57
82.33
82.30
CPU%
-
99.8
87.3
15.3
GPU%
-
97
73
94.5
Memory%
CPU
3.2
5.3
5.1
GPU
4.5
7.4
7.5
Epochs#
CPU
33
55
50
GPU
33
55
50
T
able
6.
Performance
metrics
of
all
models
on
the
IMDB
dataset
Metrics
En
vironment
CNTK
T
ensorFlo
w
Theano
Accurac
y
CPU
88.87
88.68
88.72
GPU
88.93
88.83
88.48
CPU%
-
94.8
92.2
14.6
GPU%
-
76
76
88
Memory%
CPU
5.7
6.6
5.1
GPU
6.6
9.1
7.6
Epochs#
CPU
2
3
2
GPU
2
3
2
T
able
7.
Performance
metrics
of
all
models
on
the
Self-Dri
ving
Car
dataset
Metrics
En
vironment
CNTK
T
ensorFlo
w
Theano
Accurac
y
CPU
99.93
99.96
99.71
GPU
99.97
99.97
99.73
CPU%
-
93.2
85
22
GPU%
-
32.4
34
31
Memory%
CPU
5.3
4.3
3.2
GPU
6.6
6.2
5.3
Epochs#
CPU
10
10
10
GPU
10
10
10
T
able
8.
Performance
metrics
of
all
models
on
the
Penn
T
reeBank
dataset
Metrics
En
vironment
CNTK
T
ensorFlo
w
Theano
Perple
xity
CPU
113.7
114.79
114.57
GPU
113.2
113.21
113.3
CPU%
-
94
91.8
18.3
GPU%
-
76.6
77
81
Memory%
CPU
1.3
2.3
4.1
GPU
2.2
4.5
5.4
Epochs#
CPU
13
13
13
GPU
13
13
13
5.
CONCLUSIONS
AND
FUTURE
W
ORK
In
this
paper
,
we
ha
v
e
pro
vided
a
qualitati
v
e
and
quantitati
v
e
comparison
between
three
of
the
m
ost
popular
and
most
comprehensi
v
e
DL
frame
w
orks
(namely
Microsoft’
s
CNTK,
Google’
s
T
ensorFlo
w
and
Uni-
v
ersity
of
Montreal’
s
Theano).
The
main
goal
of
this
w
ork
w
as
to
help
end
users
mak
e
an
informed
decision
Benc
hmarking
open
sour
ce
deep
learning
fr
ame
works
(Ghadeer
Al-Bdour)
Evaluation Warning : The document was created with Spire.PDF for Python.
5486
r
ISSN:
2088-8708
about
the
best
DL
frame
w
ork
that
suits
their
needs
and
resources.
T
o
ensure
that
our
study
is
as
comprehen-
si
v
e
as
possible,
we
ha
v
e
used
multiple
benchmark
datasets
namel
y
MNIST
,
CIF
AR-10,
Self-Dri
ving
Car
and
IMDB
which
were
trained
via
multilayer
CNN
netw
ork
architecture
and
Penn
T
reeBank
dataset
which
w
as
trained
via
RNN
architecture.
W
e
ha
v
e
run
our
e
xperiments
on
a
laptop
with
wi
ndo
ws
10
operating
system.
W
e
ha
v
e
measured
performance
and
utilization
of
CPU
multithreading
,
GPU
and
memory
.
F
or
most
of
our
e
xperiments,
we
find
out
that
CNTK’
s
implementations
are
superior
to
the
other
ones
under
consideration.
REFERENCES
[1]
S.
Bahrampour
,
N.
Ramakrishnan,
L.
Schott,
and
M.
Shah,
“Comparati
v
e
study
of
deep
learning
softw
are
frame
w
orks,
”
arXi
v
preprint
arXi
v:1511.06435,
2015.
[2]
S.
Shi,
et
al.,
“Benchma
rking
state-of-the-art
deep
learning
softw
are
tools,
”
arXi
v
preprint
arXi
v:1608.07249,
2016.
[3]
R.
Al-Rfou
et
al.,
“Theano:
A
p
ython
frame
w
ork
for
f
ast
computation
of
mathematical
e
xpressions,
”
arXi
v
preprint
arXi
v:1605.02688,
2016.
[4]
P
.
Goldsborough,
“
A
tour
of
tensorflo
w
,
”
arXi
v
preprint
arXi
v:1610.01178,
2016.
[5]
V
.
K
o
v
ale
v
et
al.,
“Deep
learning
with
theano,
torch,
caf
fe,
tensorflo
w
,
and
deeplearning4j:
Which
one
is
the
best
in
speed
and
accurac
y?,
”
P
attern
Recognition
and
Information
Processing
(PRIP),
2016.
[6]
F
.
Bastien,
et
al.,
“Theano:
ne
w
features
and
speed
impro
v
ements,
”
arXi
v
preprint
arXi
v:1211.5590,
2012.
[7]
W
.
Ding,
R.
W
ang,
F
.
Mao,
and
G.
T
aylor
,
“Theano-based
lar
ge-scale
visual
recognition
with
multiple
gpus,
”
arXi
v
preprint
arXi
v:1412.2302,
2014.
[8]
W
.
Dai
and
D.
Berleant,
“Benchmarking
contemporary
deep
learning
hardw
are
and
frame
w
orks:
A
surv
e
y
of
qualitati
v
e
metrics,
”
International
Conference
on
Cogniti
v
e
Machine
Intelligence,
pp.
148-155,
2019.
[9]
C.
Coleman
et
al.,
“Da
wnbench:
An
end-to-end
deep
learning
benchmark
and
competition,
”
31st
Confer
-
ence
on
Neural
Information
Processing
Systems,
v
ol.
100,
no.
101,
2017.
[10]
A.
Shatna
wi
et
al.,
“
A
comparati
v
e
study
of
open
source
deep
learning
frame
w
orks,
”
9th
International
Conference
on
Information
and
Communication
Systems,
pp.
72-77,
2018.
[11]
G.
Al-Bdour
,
R.
Al-Qurran,
M.
Al-A
yyoub,
and
A.
Shatna
wi,
“
A
detailed
comparati
v
e
study
of
open
source
deep
learning
frame
w
orks,
”
arXi
v
preprint
arXi
v:1903.00102,
2019.
[12]
D.
Y
u
et
al.,
“
An
introduction
to
com
putational
netw
orks
and
the
computational
netw
ork
toolkit,
”
Microsoft,
T
ech.
Rep.
MSR-TR-2014-112,
2014.
[13]
D.
Y
u,
K.
Y
ao,
and
Y
.
Zhang,
“The
computational
netw
ork
toolkit
[best
of
the
web],
”
IEEE
Signal
Pro-
cessing
Mag
azine,
v
ol.
32,
no.
6,
pp.
123-126,
2015.
[14]
J.
Ber
gstra
et
al.,
“Theano:
A
cpu
and
gpu
math
compiler
in
p
ython,
”
Proc.
9th
Python
in
Science
Conference,
v
ol.
1,
pp.
3-10,
2010.
[15]
M.
Abadi
et
al.,
“T
ensorflo
w:
A
system
for
lar
ge-scale
machine
learning,
”
Proceedings
of
the
12th
USENIX
Symposium
on
Operating
Systems
Design
and
Implementation
(OSDI),
pp.
265-283,
2016.
[16]
F
.
Chollet
et
al.,
“K
eras,
”
2015.
[17]
Y
.
LeCun,
L.
Bottou,
Y
.
Bengio,
and
P
.
Haf
fner
,
“Gradient-based
learning
applied
to
document
recogni-
tion,
”
Proceedings
of
the
IEEE,
v
ol.
86,
no.
11,
pp.
2278–2324,
1998.
[18]
A.
Krizhe
vsk
y
,
I.
Sutsk
e
v
er
,
and
G.
E.
Hinton,
“Im
agenet
classification
with
deep
con
v
olutional
neural
netw
orks,
”
Adv
ances
in
neural
information
processing
systems,
pp.
1097-1105,
2012.
[19]
A.
Krizhe
vsk
y
and
G.
Hinton,
“Learning
multiple
layers
of
features
from
tin
y
images,
”
Uni
v
ersity
of
T
oronto,
T
ech.
Rep.,
2009.
[20]
M.
P
.
Marcus,
M.
A.
Marcinkie
wicz,
and
B.
Santorini,
“Building
a
lar
ge
annotated
corpus
of
english:
The
penn
treebank,
”
Computational
linguistics,
v
ol.
19,
no.
2,
pp.
313-330,
1993.
[21]
A.
Maas
et
al.,
“Learning
w
ord
v
ectors
for
sentiment
analys
is,
”
Proceedings
of
the
49th
Annual
Meeting
of
the
Association
for
Computational
Linguistics:
Human
Language
T
echnologies,
2011.
[22]
R.
W
allace
et
al.,
“First
results
in
robot
road-follo
wing,
”
IJCAI,
pp.
1089-1095,
1985.
[23]
G.
Al-Bdour
,
“Comparati
v
e
study
between
deep
learning
frame
w
orks
using
multiple
benchmark
datasets,
”
Master’
s
thesis,
Jordan
Uni
v
ersity
of
Science
and
T
echnology
,
2017.
[24]
W
.
Zaremba,
I.
Sutsk
e
v
er
,
and
O.
V
in
yals,
“Recurrent
neural
netw
ork
re
gularization,
”
arXi
v
preprint
arXi
v:1409.2329,
2014.
[25]
J.
Deng,
et
al.,
“Imagenet:
A
lar
ge-scale
hiera
rchical
image
database,
”
IEEE
conference
on
computer
vision
and
pattern
recognition,
pp.
248-255,
2009.
Int
J
Elec
&
Comp
Eng,
V
ol.
10,
No.
5,
October
2020
:
5479
–
5486
Evaluation Warning : The document was created with Spire.PDF for Python.