Inter
national
J
our
nal
of
Electrical
and
Computer
Engineering
(IJECE)
V
ol.
8,
No.
6,
December
2018,
pp.
4693
–
4704
ISSN:
2088-8708,
DOI:
10.11591/ijece.v8i6.pp4693-4704
4693
I
ns
t
it
u
t
e
o
f
A
d
v
a
nce
d
Eng
ine
e
r
i
ng
a
nd
S
cie
nce
w
w
w
.
i
a
e
s
j
o
u
r
n
a
l
.
c
o
m
Deep
Belief
Netw
orks
f
or
Recognizing
Hand
writing
Captur
ed
by
Leap
Motion
Contr
oller
Abas
Setiawan
1,2
and
Reza
Pulungan
1
1
Dept.
of
Computer
Science
and
Electronics,
F
aculty
of
Mathematics
and
Natural
Sciences,
Uni
v
ersitas
Gadjah
Ma
da,
Indonesia
2
Dept.
of
Informatics
Engineering,
F
aculty
of
Computer
Science,
Uni
v
ersitas
Dian
Nusw
antoro,
Semarang,
Indonesia
Article
Inf
o
Article
history:
Recei
v
ed
December
29,
2017
Re
vised
May
30,
2018
Accepted
July
13,
2018
K
eyw
ord:
Handwriting
Leap
Motion
Deep
belief
netw
ork
Accurac
y
Resilient
backpropag
ation
ABSTRA
CT
Leap
Motion
controller
is
an
input
de
vice
that
can
track
hands
and
fingers
position
quickly
and
precisely
.
In
some
g
aming
en
vironment,
a
need
may
arise
to
capture
letters
written
in
the
air
by
Leap
Motion,
which
cannot
be
directly
done
right
no
w
.
In
this
paper
,
we
propose
an
approach
to
capture
and
recognize
which
letter
has
been
dra
wn
by
the
user
with
Leap
Motion.
This
approach
is
based
on
Deep
Belief
Netw
orks
(DBN)
with
Resilient
Backprop-
ag
ation
(Rprop)
fine-tuning.
T
o
assess
the
performance
of
our
proposed
approach,
we
con-
duct
e
xperiments
in
v
olving
30,000
samples
of
handwritten
capital
letters,
8,000
of
which
are
to
be
recognized.
Our
e
xperiments
indicate
that
DBN
with
Rprop
achie
v
es
an
accu-
rac
y
of
99.71%,
which
is
better
than
DBN
with
Backpr
opag
ation
or
Multi-Layer
Perceptron
(MLP),
either
with
Backpropag
ation
or
with
Rprop.
Our
e
xperiments
also
sho
w
that
Rprop
mak
es
the
process
of
fine-tuning
significantly
f
aster
and
results
in
a
much
more
accurate
recognition
compared
to
ordinary
Backpropag
ation.
The
time
needed
to
recognize
a
letter
is
in
the
order
of
5,000
microseconds,
which
is
e
xcellent
e
v
en
for
online
g
aming
e
xperience.
Copyright
c
2018
Institute
of
Advanced
Engineering
and
Science
.
All
rights
r
eserved.
Corresponding
A
uthor:
Reza
Pulung
an
Department
of
Computer
Science
and
Electronics,
Uni
v
ersitas
Gadjah
Mada,
Indonesia
Gedung
KPTU
FMIP
A
UGM,
Sekip
Utara,
Bulaksumur
,
Sleman,
Y
ogyakarta,
55281,
Indonesia
+6282136261216
Email:
pulung
an@ugm.ac.id
1.
INTR
ODUCTION
Computer
aided
handwriting
recognition
can
be
performed
either
on-line
or
of
f-line.
T
ypically
,
on-line
hand-
writing
recognition
is
based
on
specific
input
de
vices,
such
as
real-time
capturing
digital
pen,
finger
-touch
screen,
or
mouse
mo
v
ement.
In
the
recent
decade,
s
u
c
h
input
de
vices
ha
v
e
already
been
equipped
with
computer
vision
tech-
nology
that
enables
recognition
of
natural
hand
or
body
gestures
as
seen
in
Kinect-based
handwriting
recognition
[1].
Huang
et
al.
[2]
introduce
digit
recognition
on
Kinect
by
using
Multiple
Se
gment
and
Scaled
Coding
with
94.6%
accurac
y
.
Other
methods
are
proposed
by
using
Depth-Skin
Background
Mixture
Model
for
hand
se
gmentation
[3]
and
Dynamic
T
ime
W
arping
combined
with
Support
V
ector
Machines
(SVM)
for
digit
recognition
[1].
Kinect
has
the
ability
to
recognize
body
shape
or
hand
mo
v
ements,
b
ut
it
f
aces
dif
ficulties
in
obtaining
precise
finger
position
compared
to
Leap
Motion
controller
.
Leap
Motion
controller
(or
simply
Leap
Motion)
is
a
ne
w
computer
vision
de
vice
that
can
track
the
position
of
fi
ng
e
rs
more
precisely
.
The
3D
finger
data
can
be
obtained
at
o
v
er
100
frames
per
second
[4],
making
it
possible
to
ef
ficiently
track
the
position
of
each
finger
in
detail.
Ho
we
v
er
,
Leap
Motion
still
f
aces
some
limitations:
it
cannot
directly
recognize
a
specific
sequence
of
fingers’
strok
es
(mo
v
ements)
performed
when
dra
wing
a
letter
or
a
digit
in
air
.
Se
v
eral
algorithm
s
ha
v
e
been
proposed
to
recognize
these
sequences
of
strok
es
to
obtain
a
better
accurac
y
.
V
ikram
et
al.
de
v
elop
handwriting
recognition
based
on
Leap
Motion
by
using
Dynami
c
T
ime
W
arping
algorithm
[4].
Ag
arw
al
et
al.
[5]
present
se
gmentation
and
recognition
of
3D
handwritten
te
xt
captured
by
Leap
Motion.
Hidden
Mark
o
v
Model
(HMM)
is
used
to
train
se
gmented
w
ords
and
achie
v
es
an
accurac
y
of
77.6%.
Another
algorithm
for
rapid
recognition
of
hand
graphical
gestures
is
SVM.
SVM
is
used
to
recogni
ze
100
testing
samples
with
an
a
v
erage
recognition
rate
of
82.4%
[6].
J
ournal
Homepage:
http://iaescor
e
.com/journals/inde
x.php/IJECE
I
ns
t
it
u
t
e
o
f
A
d
v
a
nce
d
Eng
ine
e
r
i
ng
a
nd
S
cie
nce
w
w
w
.
i
a
e
s
j
o
u
r
n
a
l
.
c
o
m
Evaluation Warning : The document was created with Spire.PDF for Python.
4694
ISSN:
2088-8708
Researches
sho
w
that
algorithms
with
shallo
w
architectures
(
i.e
.
,
Decision
T
ree
[7],
Gaussian
Mixture
Models
(GMM),
HMM,
Conditional
Random
Fields
(CRF),
Multi-Layer
Perceptron
(MLP)
[8],
and
SVM
[9])
are
ef
fecti
v
e
in
solving
man
y
simple
or
well-constrained
problems.
Ho
we
v
er
,
these
al
gorithms
may
f
ace
dif
ficulties
in
complicated
real-w
orld
applications
in
v
olving
natural
signals,
such
as
human
speech,
natural
sound
and
language,
natural
images,
and
visual
scenes
[10]
because
of
their
modelling
limitation.
Instead,
a
deep
architecture
algorithm
able
to
recognize
handwritten
digits
captured
by
Leap
Motion
is
a
v
ailable,
namely
Con
v
olutional
Neural
Netw
orks
(CNN).
Ho
we
v
er
,
it
is
hard
to
train
deep
MLP
(and
hence
also
CNN)
without
GPU
parallelization
[11].
CNN
has
been
used
to
simulate
3D
digit
recognition
model
that
can
be
projected
as
2D
in
a
3D
en
vironment
on
Leap
Motion
tracking
data
[12].
This
method
achie
v
es
an
accurac
y
of
92.4%.
In
this
paper
,
we
propose
to
use
Deep
Belief
Netw
orks
(DBN)
to
solv
e
this
recognition
problem.
DBN
is
one
of
deep
architecture
algorithms
and
consists
of
tw
o
phases:
greedy
layer
-wise
unsupervised
pre-training
and
discriminati
v
e
fine-tuning
[13,
14,
15].
Pre-
training,
initialized
by
a
stack
ed
Restricted
Boltzmann
Machines,
can
generate
a
better
result
compared
to
purely
random
configuration
of
the
MLP
.
After
the
pre-training
phase
in
the
DBN,
the
netw
ork
can
be
treated
simply
as
an
MLP
and
then
trained
with
Backpropag
ation
to
enhance
its
weights
[14,
15].
Alternati
v
ely
,
Resilient
Backpropag
ation
can
be
used
to
train
the
netw
orks
instead
of
ordinary
Backpropag
ation.
2.
PRELIMIN
ARIES
2.1.
Leap
Motion
contr
oller
Leap
Motion
is
an
input
de
vice
that
a
llo
ws
users
to
recognize
and
track
hands,
fingers,
and
finger
-lik
e
tools.
Leap
Motion
is
de
v
eloped
by
Michael
Buckw
ald
and
Da
vid
Holz
at
Leap
Motion,
Inc.
Leap
Motion
is
able
to
track
3D
finger
data
precisely
and
obtains
them
at
o
v
er
100
frames
per
second
[4].
Besides,
Leap
Motion
de
vice
is
also
cheaper
compared
to
other
similar
computer
-vision
de
vices.
The
dimension
of
the
de
vice
is
1.2
3
7.6
cm
3
and
it
uses
optical
sensors
and
infrared
light,
as
sho
wn
in
Figure
1
[16].
Leap
Motion’
s
sensors
can
detect
hands
or
fingers
across
the
upw
ard
y-axis
and
possess
a
150
-vision
with
ef
fecti
v
e
range
of
25
to
600
millimeters
[17].
Figure
1.
Detail
of
the
Leap
Motion’
s
sensors
[16]
T
o
de
v
elop
applications
that
use
Leap
Motion
as
input
de
vice,
we
can
mak
e
use
of
an
a
v
ailable
API
that
can
be
accessed
via
v
arious
programming
languages.
Leap
Motion
API
v
ersion
2.0
[17]
intr
od
uc
es
Ne
w
Sk
eletal
T
racking
Model
that
pro
vides
additional
information
for
arms,
fingers,
and
o
v
erall
tracking
enhancement.
In
order
to
use
Leap
Motion,
first,
it
must
be
connected
with
an
e
xisting
USB
port
and
positioned
beside
the
monitor
at
a
distance
of
the
cable’
s
range.
Leap
Motion
de
vice
will
detect
gestures
and
the
position
as
well
as
the
mo
v
ement
of
hands
or
fingers
as
the
user
positions
her
hands
on
its
range.
T
o
retrie
v
e
the
desired
hand
tracking
dat
a,
Leap
Motion
API
pro
vides
an
update
set
or
a
frame
of
data.
Ev
ery
single
frame
object
from
the
API
represents
a
frame
that
contains
a
list
of
tracking
entities
called
motion
tr
ac
king
data
.
Motion
tracking
data,
such
as
hands,
fingers,
and
tools,
can
also
be
accessed
through
the
API.
2.2.
Restricted
Boltzmann
machines
A
Boltzmann
machine
is
a
stochast
ic
recurrent
neural
netw
ork
with
binary
stochastic
units.
There
are
tw
o
types
of
layers
in
Boltzmann
machines:
visible
layer
and
hidden
layer
.
V
isible
units
or
neurons
on
the
visible
layer
are
connected
to
other
units
and
recei
v
e
information
from
the
en
vironment.
V
isible
units
als
o
bring
the
input
signal
from
the
en
vironment
and
send
it
to
other
units
during
training.
Hidden
units,
on
the
other
hand,
operate
freely
and
e
xtract
features
from
the
input
signal.
Interconnections
of
units
in
Boltzmann
machines
can
be
done
e
v
en
in
a
single
layer
itself.
Unfortunately
,
learning
in
Boltzmann
machines
is
often
impractical
and
furthermore
the
y
also
f
ace
Int
J
Elec
&
Comp
Eng,
V
ol.
8,
No.
6,
December
2018:
4693
–
4704
Evaluation Warning : The document was created with Spire.PDF for Python.
Int
J
Elec
&
Comp
Eng
ISSN:
2088-8708
4695
scalability
issues.
In
order
to
mak
e
learning
in
Boltzmann
machines
easier
,
adv
anced
restrictions
of
connecti
vity
must
be
enforced.
This
restriction
is
called
Restricted
Boltzmann
Machines
(RBM)
[18].
In
RBM,
there
is
no
connection
between
units
of
the
same
layer
.
RBM
is
composed
of
a
visible
layer
and
a
hidden
layer
of
stochastic
neurons
that
are
connected
by
symmetrical
weights
(
i.e
.
,
there
is
no
weight
to
w
ard
itself).
Hidden
units
in
RBM
are
considered
as
feature
detector
.
2.3.
Deep
belief
netw
orks
A
deep
belief
netw
ork
is
a
generati
v
e
model,
consisting
of
multiple
stochastic
layers.
A
stochastic
layer
has
se
v
eral
latent
v
ariables.
Latent
v
ariables
typical
ly
ha
v
e
binary
v
alues
and
are
called
hidden
units
or
f
eature
detec-
tors
[13].
Each
DBN
layer
is
an
RBM
and
in
order
to
create
m
an
y
layers,
se
v
eral
RBMs
are
stack
ed
on
top
of
each
other
to
construct
a
DBN
construction.
Hinton
et
al.
in
[13]
propose
an
idea
to
learn
one
layer
at
a
time
and
restrict
the
connecti
vity
of
binary
stochastic
units
in
order
to
mak
e
learning
more
ef
ficient
and
simpler
[18].
DBN
training
be
gins
with
RBM
training:
starting
from
input
samples
and
then
sequentially
progressing
to
w
ard
the
ne
xt
RBM
training.
Generated
patterns
formed
by
the
top
RBM
can
deli
v
er
a
re
v
ersed
propag
ation
to
the
input
layer
by
using
conditional
probability
as
in
belief
netw
ork
or
sigmoid
belief
netw
ork.
This
training
procedure
is
called
DBN’
s
greedy
layer
-wise
training
[13].
DBN
can
w
ork
with
or
without
supervision.
Al
though
DBN
can
be
used
in
supervi
sed
or
discriminati
v
e
mode,
basically
,
greedy
layer
-wise
training
is
an
unsupervised
training
for
each
layer
of
stack
ed
RBMs.
It
is
called
a
pre-training
procedure
[19,
20,
21].
The
purpose
of
pre-training
procedure
on
each
layer’
s
training
is
to
put
the
parameters
of
all
layers
in
a
re
gion
of
certain
parameter
ranges
that
contain
a
good
generalization
of
local
optimum
that
can
be
obtained
by
local
gradient
descent
[20,
21].
In
an
RBM,
the
joint
distrib
ution
p
(
v
;
h
;
w
)
of
visible
units
v
and
hidden
units
h
with
parameters
w
is
defined
by
the
ener
gy
function
E
(
v
;
h
;
w
)
.
F
or
Bernoulli-visible
units
that
lead
to
Bernoulli-hidden
units,
the
ener
gy
function
is
defined
by:
E
(
v
;
h
;
w
)
=
I
X
i
=1
J
X
j
=1
w
ij
v
i
h
j
I
X
i
=1
b
i
v
i
J
X
j
=1
a
j
h
j
;
where
w
ij
represents
the
symmetrical
interaction
between
visible
unit
v
i
and
hidden
uni
t
h
j
,
b
i
and
a
j
are
the
bias
terms
of,
and
I
and
J
are
the
numbers
of
visible
and
hidden
units,
respecti
v
ely
.
Therefore,
the
conditional
probabilities
can
be
calculated
by:
p
(
h
j
=
1
j
v
;
w
)
=
'
I
X
i
=1
v
i
w
ij
+
a
j
!
;
and
(1)
p
(
v
i
=
1
j
h
;
w
)
=
'
0
@
J
X
j
=1
h
j
w
ij
+
b
i
1
A
;
(2)
where
'
(
x
)
=
1
=
(1
+
exp(
x
))
.
Greedy
procedure
on
s
tack
ed
RBMs
is
used
to
raise
the
maximum
lik
elihood
learning.
Equation
(3)
is
a
learning
rule
for
updated
weights
at
each
RBM
to
collect
the
gradient
log-lik
elihood:
w
ij
=
E
data
(
v
i
;
h
j
)
E
model
(
v
i
;
h
j
)
;
(3)
where
E
data
(
v
i
;
h
j
)
is
the
e
xpectation
of
observ
ation
on
the
training
data
and
E
model
(
v
i
;
h
j
)
is
the
e
xpectation
under
the
distrib
ution
defined
by
the
model.
Calculating
E
model
(
v
i
;
h
j
)
is
intractable,
so
Contrasti
v
e
Di
v
er
gence
(CD)
algorithm
is
used
to
solv
e
this
problem
[13].
CD
algorithm
performs
Gibbs
sampling
to
replace
E
model
(
v
i
;
h
j
)
in
one
or
more
steps.
CD-1,
namely
CD
algorithm
that
only
performs
one
step
of
Gibbs
sampling
on
the
data,
is
used
to
train
the
model.
The
follo
wing
are
the
steps
of
CD-1.
First,
input
sample
x
and
updated
hidden
units
are
used
to
compute
p
(
h
j
=
1
;
v
;
w
)
.
Second,
units
are
acti
v
ated
by
comparing
the
sample
result
from
the
pre
vious
step
with
random
v
alues
to
obtain
the
binary
state
for
each
unit.
Third,
thos
e
binary
units
are
reconstructed
by
calculating
p
(
v
i
=
1
;
h
;
w
)
.
F
ourth,
the
out
p
ut
probability
of
the
reconstruction
is
calculated,
and
the
reconstruction
result
becomes
an
input.
Fifth,
after
all
collections
of
statistics
from
the
pre
vious
step
are
obtained,
the
weights
on
the
RBM
are
updated.
After
completing
pre-training
procedure
in
DBN,
the
ne
xt
phase
is
to
perform
discrimi
nati
v
e
fine-tuning
with
Backpropag
ation
[13,
14].
Fine-tuning
with
Backpropag
ation
is
carried
out
by
adding
a
final
output
layer
,
which
is
a
labeled
class
e
xcluded
from
the
pre-training
procedure.
Deep
Belief
Networks
for
Reco
gnizing
Handwriting
Captur
ed
by
Leap
Motion
Contr
oller
(Setiawan
and
Pulungan)
Evaluation Warning : The document was created with Spire.PDF for Python.
4696
ISSN:
2088-8708
3.
PR
OPOSED
APPR
O
A
CH
Figure
2
depicts
the
architecture
of
our
proposed
approach
for
in-air
handwriting
recognition
using
Leap
Motion.
The
proposed
approach
is
de
v
eloped
based
on
the
idea
laid
out
in
[22].
Data
acquisition
is
a
process
of
collecting
e
v
ery
letter
sample
by
using
Leap
Motion.
This
captured
letter
sample
can
be
used
for
recognition,
or
else,
it
can
be
stored
in
a
dataset.
The
letter
dataset
is
split
into
tw
o
par
ts,
namely:
training
and
testing
datasets.
The
training
dataset
will
be
incorporated
into
a
learning
system
by
using
se
v
eral
proposed
algorithms;
and
this
basically
constitutes
the
training
process.
After
training
process
has
been
completed,
se
v
eral
model
algorithms
will
be
constructed.
These
model
algorithms
are
then
used
for
testing
(by
using
the
testing
dataset)
in
order
to
obtain
and
compare
their
accurac
y
.
The
best
model
algorithm
is
then
selected
for
further
recognition
of
letter
input
data
captured
by
Leap
Motion.
D
at
a A
c
q
u
i
s
i
t
i
on
u
s
i
n
g L
e
ap
M
ot
i
on
A
S
i
n
gl
e
D
at
a
L
e
t
t
e
r
D
at
as
e
t
S
ave
d
or
R
e
c
ogn
i
z
e
d
?
L
e
ar
n
i
n
g S
ys
t
e
m
S
ave
d
M
od
e
l
R
e
c
ogn
i
z
e
d
R
e
c
ogn
i
t
i
on
Figure
2.
Our
proposed
approach
3.1.
Data
acquisition
Data
acquisition
is
a
mechanism
to
obtain
handwritten
sample
data
by
using
Leap
Motion.
A
sample
data
is
a
sequence
of
information
about
the
mo
v
ement
of
hand
(inputs
only
use
one
hand
and
one
finger)
in
the
air
as
captured
by
Leap
Motion.
The
motion
of
finger
-hand
dra
ws
a
letter
strok
e,
which
is
then
displayed
on
the
screen
of
the
simulation
application
inte
grated
with
Leap
Motion
API.
Figure
3
depicts
the
flo
w
for
creating
one
sample
data
using
the
e
xisting
API.
S
t
ar
t
I
n
p
u
t
:
H
and
G
e
s
t
u
r
e
L
e
ap M
ot
i
on
I
n
i
t
i
al
i
z
at
i
on
E
n
d
2
D
M
app
i
n
g
N
or
m
al
i
z
i
n
g
S
t
abi
l
i
z
e
d
P
os
i
t
i
on
T
ou
c
h
E
m
u
l
at
i
on
I
m
ag
e
M
at
r
i
x
C
r
e
at
i
on
I
m
ag
e
M
at
r
i
x
D
i
l
at
i
on
O
u
t
p
u
t
:
a
L
e
t
t
e
r
as
a
B
i
n
ar
y
V
e
c
t
or
Figure
3.
Data
acquisition
w
orkflo
w
Leap
Motion
initialization
Initialization
is
performed
to
connect
the
simulation
application
to
the
API.
In
the
sim-
ulation
application,
the
user
is
gi
v
en
information
about
what
letter
should
be
dra
wn
with
the
Leap
Motion.
As
the
user
starts
dra
wing
the
letter
in
the
air
,
we
can
access
the
captured
F
r
ame
object
of
the
Leap
Motion
through
the
API.
2D
mapping
Through
the
F
r
ame
we
can
access
the
Finger
object,
and
hence
the
tip
posit
ions
of
all
fingers
can
be
obtained
through
the
API.
Ho
we
v
er
,
we
do
not
need
the
positions
of
all
fingers,
b
ut
only
the
dominant
finger
as
when
pointing
with
the
forefinger
.
The
dominant
finger
can
be
obtained
via
a
property
of
the
Finger
object,
namely
F
rontmost
.
In
addition,
in
order
to
obtain
the
position
of
the
dominant
finger’
s
tip
with
minimal
jitter
,
a
property
named
Stabiliz
edTipP
osition
can
be
used.
Because
of
2D
mapping,
the
z-position
of
the
forefinger
produced
by
Stabiliz
edTipP
osition
is
ignored,
and
the
x
and
y-position
are
then
denoted
by
x
pos
and
y
pos
,
respecti
v
ely
.
Normalizing
stabilized
position
Inter
actionBo
x
property
of
F
r
ame
is
an
object
on
the
API
to
ensure
t
h
e
user’
s
finger
-hand
al
w
ays
stays
on
the
de
vice’
s
visibility
range.
While
finger
-hand
is
inside
Inter
actionBo
x
,
mo
v
ements
of
the
fingers
will
be
detected
on
3D
c
oo
r
dinates,
as
sho
wn
in
Figure
4(left).
When
the
application
is
mapped
to
2D
coordinates,
the
Inter
actionBo
x
is
as
sho
wn
in
Figure
4(right).
Int
J
Elec
&
Comp
Eng,
V
ol.
8,
No.
6,
December
2018:
4693
–
4704
Evaluation Warning : The document was created with Spire.PDF for Python.
Int
J
Elec
&
Comp
Eng
ISSN:
2088-8708
4697
Figure
4.
Field-of-vie
w
of
the
(left)
3D
and
the
(right)
2D
Inter
actionbo
x
[17]
Inter
actionBo
x
simplifies
the
mapping
of
e
v
ery
point
of
finger’
s
tip
position
by
normalizing
the
range
of
captured
hands
from
left
t
o
right
or
top
to
bottom
to
v
alues
betwe
en
0
to
1,
relati
v
ely
to
the
application
screen.
If
w
app
and
h
app
are
the
width
and
height,
respecti
v
ely
,
of
the
2D
field
in
the
application
screen,
Inter
actionBo
x
pro
vides
the
normalized
position
of
the
final
strok
e
relati
v
e
to
the
application’
s
coordinate
(
x
app
;
y
app
)
,
which
is
gi
v
en
by:
x
app
=
N
P
(
x
pos
)
w
app
and
y
app
=
(1
N
P
(
y
pos
))
h
app
;
where
N
P
is
a
normalization
function
in
Inter
actionBo
x
called
nor
maliz
eP
oint
.
T
ouch
emulation
From
the
positions
of
all
strok
es
representing
a
single
letter
,
a
sequence
of
strok
es
~
S
=
(
S
1
;
S
2
;
S
3
,
S
4
;
:
:
:
;
S
m
)
is
formed,
where
each
S
i
=
(
x
app
;
y
app
)
for
1
i
m
,
and
m
is
the
number
of
strok
es
required
to
dra
w
the
letter
.
Strok
es
can
be
in
t
w
o
states:
ho
v
ering
and
touching.
Not
all
strok
es
are
used;
those
that
are
in
ho
v
ering
state
can
simply
be
ignored,
as
illustrated
in
Figure
5(left).
(0,0)
(336,0)
(336,336)
(0,336)
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
O =
T
ouch
X = Hover
H
o
v
e
r
i
n
g
T
o
u
c
h
i
n
g
+1
0
-1
Figure
5.
T
ouch
emulation:
the
dif
ference
between
ho
v
ering
and
touching
[17]
The
state
is
determined
by
a
normalized
distance
in
the
range
of
+1
to
1
.
First,
the
finger
enters
the
ho
v
ering
state
from
normalized
distance
of
+1
and
the
dis
tance
decreases
to
w
ard
0
where
the
finger
is
near
the
touching
surf
ace.
When
the
finger
penetrates
the
surf
ace,
the
normalized
distance
decreases
to
the
maximum
1
,
which
indicates
that
it
is
right
no
w
in
the
touching
state.
Strok
e
points
are
only
recorded
when
the
finger’
s
tip
is
in
the
touching
state.
Image
matrix
cr
eation
The
tracing
in
the
pre
vious
process
is
related
to
the
tile-based
editor
of
the
simulation
application.
T
ile-based
editors
are
often
de
v
eloped
in
g
ame
technology
,
specifically
t
o
create
the
g
ame
en
viron-
ment
[23,
24].
In
this
research,
all
tiles
are
of
the
same
width
and
height,
namely
12
pix
els,
and
the
tile-map
is
a
28
28
tile
matrix.
If
the
tile-map
is
considered
as
a
representation
of
an
image,
each
tile
will
ha
v
e
an
intensity
v
alue
and
an
inde
x
position
on
the
tile-map.
Each
tile
indicates
a
pix
el
image
dra
wn
by
the
user
in
the
air
.
Therefore,
in
order
to
select
a
tile
that
corresponds
to
t
he
user’
s
creation,
a
fix
ed
strok
e
point
position
with
the
ho
v
ering
or
touching
state
is
used.
Figure
6
pro
vides
an
illustration
on
ho
w
the
selection
of
tile-based
position
for
each
strok
e
captured
by
Leap
Motion
is
done.
In
Figure
6,
m
t
is
map-top
distance,
m
l
is
map-left
distance,
T
s
is
the
size
of
a
tile,
and
(
x
t
;
y
t
)
is
the
selected
inde
x
of
a
tile.
The
s
equence
of
strok
es
~
S
will
be
con
v
erted
into
an
inde
x
ed
(v
ector)
sequence
of
strok
es
~
V
=
(
V
1
;
V
2
;
V
3
;
V
4
;
:
:
:
;
V
n
)
with
V
i
=
(
x
i
;
y
i
)
for
1
i
n
and
n
is
the
v
ector
size
of
the
28
28
image
matrix,
namely
784.
Each
V
i
has
an
intensity
v
alue
of
0
or
1
determined
by:
V
i
=
(
1
;
if
d
>
0
;
0
;
otherwise
;
where
d
is
the
touch
distance,
which
defines
whether
the
finger
is
in
ho
v
ering
or
touching
state.
A
coordinate
(
x
t
;
y
t
)
of
each
strok
e
point
is
then
defined
by:
x
t
=
x
app
m
l
T
s
and
y
t
=
y
app
m
t
T
s
:
Deep
Belief
Networks
for
Reco
gnizing
Handwriting
Captur
ed
by
Leap
Motion
Contr
oller
(Setiawan
and
Pulungan)
Evaluation Warning : The document was created with Spire.PDF for Python.
4698
ISSN:
2088-8708
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
x
app
= 218, y
app
= 158
T
s
= 12 px
(x
t
,y
t
) = (14,9) = 1
28 28 Matrix
m
t
= 50 px
m
l
= 50 px
Figure
6.
Selecting
an
inde
x
ed
tile
on
the
tile-map
In
the
end,
this
phase
basically
con
v
erts
an
image
(tile)
matrix
obtained
by
the
Leap
Motion
into
a
v
ector
with
binary
elements
representing
the
image.
Image
matrix
dilation
Sometimes
when
the
user
writes
a
letter
in
the
air
by
using
Leap
Motion
too
quickly
,
the
handwritten
letter
becomes
unclear
.
T
o
clarify
the
handwritten
letter
,
dilation
operation
can
be
applied
to
the
image
matrix.
Dilation
simply
tak
es
se
v
eral
specific
neighboring
points
of
a
point
and
gi
v
es
them
an
intensity
v
alue
of
1,
as
depicted
in
Figure
7.
x
-
1
,
y
-
1
t
t
x
,
y
-
1
t
t
x
+
1
,
y
-
1
t
t
x
-
1
,
y
t
t
x
,
y
t
t
x
+
1
,
y
t
t
x
,
y
+
1
t
t
x
+
1
,
y
+
1
t
t
x
-
1
,
y
+
1
t
t
1
1
1
1
1
1
1
1
1
Figure
7.
Image
matrix
dilation
for
letter
“D”
3.2.
Data,
dataset,
and
data
pr
eparation
Each
instance
of
data
acquisition
creates
a
single
sample
data.
A
single
sample
data
is
represented
by
a
784-
size
v
ector
,
which
encodes
the
28
28
matrix
produced
in
image
matrix
creation.
The
data
is
used
for
tw
o
purposes:
as
an
addition
to
dataset
collection
used
as
trai
ning
or
testing
dataset
with
labeled
data,
and
to
be
used
in
the
recognition
process,
where
the
label
will
be
searched
out
based
on
the
learning
algorithm
trained
before.
In
thi
s
research,
22,000
samples
of
training
and
8,000
samples
of
testing
datasets
are
collected
from
10
respondents.
The
respondents
are
under
graduate
students
in
Uni
v
ersitas
Dian
Nusw
antoro,
Semarang,
Indonesia,
aged
20-21
years
old.
Each
of
them
produces
around
3,000
handwritings
with
Leap
Motion,
captured
within
1
month.
The
training
and
testing
datasets
are
restricted
for
upper
-case
letters
from
A
to
Z.
Because
the
learning
task
is
supervised,
the
dataset
used
for
the
DBN
is
in
the
form
of
[(
~
V
1
;
L
1
)
;
(
~
V
2
;
L
2
)
;
:
:
:
;
(
~
V
k
;
L
k
)]
.
F
or
each
letter
sample,
input
data
is
a
v
ector
representing
the
s
equence
of
strok
es
~
V
i
for
1
i
k
whose
label
is
L
i
,
and
k
is
the
number
of
s
amples
in
the
dataset.
All
elements
of
input
v
ector
~
V
i
are
of
binary
v
alue.
Encoding
the
label
L
i
,
which
has
v
alue
from
“
A
”
to
“Z”,
such
that
L
i
2
f
0
;
1
;
2
;
:
:
:
;
25
g
cannot
be
included
directly
into
a
DBN.
In
order
to
map
L
i
as
an
element
of
a
v
ec
tor
,
one-hot
encoding
mechanism
is
used.
In
one-hot
encoding,
label
L
i
containing
the
v
alue
“
A
”
is
encoded
by
[1
;
0
;
0
;
:
:
:
;
0]
,
“B”
is
encoded
by
[0
;
1
;
0
;
:
:
:
;
0]
,
etc.
3.3.
Lear
ning
system
This
research
mak
es
use
of
Deep
Belief
Netw
orks
(DBN)
in
supervised
f
ashion.
DBN
ba
sically
learns
the
model
of
F
w
:
f
~
V
g
!
f
L
g
,
which
maps
an
input
v
ector
~
V
into
a
labeled
v
ector
L
with
parameter
weights
w
.
Figure
8(a)
sho
ws
the
architecture
of
a
DBN
with
tw
o
hidden
layers,
h
1
and
h
2
.
T
w
o
main
processes
e
xist
on
learning
this
DBN
model.
Firstly
,
the
process
o
f
greedy
layer
-by-layer
training
for
each
stack
ed
RBM
by
using
Contrasti
v
e
Di
v
er
gence
algorithm,
which
is
called
pre-training.
Secondly
,
the
process
of
fine-tuning
the
entire
netw
ork
using
Resilient
Backpropag
ation
or
Rpr
op
.
3.3.1.
RBM
initialization
and
construction
As
des
cribed
earlier
,
a
DBN
consists
of
stack
ed
RBMs
[13],
therefore,
each
RBM
layer
first
needs
to
be
initialized.
Figure
8(b)
sho
ws
a
DBN
constructed
by
tw
o
layers
of
RBMs.
The
follo
wing
is
the
pre-training
process,
Int
J
Elec
&
Comp
Eng,
V
ol.
8,
No.
6,
December
2018:
4693
–
4704
Evaluation Warning : The document was created with Spire.PDF for Python.
Int
J
Elec
&
Comp
Eng
ISSN:
2088-8708
4699
R
B
M
1
V
h
1
h
2
v
1
P
r
e
-
t
r
a
i
n
i
n
g
L
y
F
i
n
e
-
t
u
n
i
n
g
(
R
p
r
o
p
)
R
B
M
1
h
1
w
1
v
1
H
I
D
D
E
N
L
A
Y
E
R
V
I
S
I
B
L
E
L
A
Y
E
R
I
n
p
u
t
D
a
t
a
x
R
B
M
2
h
2
w
2
v
2
H
I
D
D
E
N
L
A
Y
E
R
V
I
S
I
B
L
E
L
A
Y
E
R
H
I
D
D
E
N
L
A
Y
E
R
I
N
P
U
T
L
A
Y
E
R
w
ij
h
1
I
n
p
u
t
D
a
t
a
x
f
i
n
e
-
t
u
n
i
n
g
H
I
D
D
E
N
L
A
Y
E
R
w
jk
h
2
f
i
n
e
-
t
u
n
i
n
g
O
U
T
P
U
T
L
A
Y
E
R
y
l
y
(a)
(b)
(c)
Figure
8.
(a)
A
DBN
model
with
Rprop,
(b)
Constructing
tw
o
RBMs
in
a
DBN
(c)
Fine-tuning
DBN
with
Rprop
which
uses
unsupervised
learning
algorithm
on
DBN
to
construct
the
stack
ed
RBMs:
Step
0:
In
the
first
RBM,
visible
layer
v
1
gets
direct
input
from
sample
~
V
.
Step
1:
T
rain
the
RBM’
s
visible
v
1
and
hidden
h
1
layers
to
adjust
weights
w
1
.
Step
2:
After
the
training
is
completed,
use
the
pattern
acti
vity
of
the
first
RBM’
s
hidden
la
yer
h
1
as
input
to
visible
layer
v
2
to
train
the
second
RBM.
Step
3:
By
cop
ying
h
1
as
v
2
in
the
second
RBM,
train
visible
layer
v
2
and
hidden
layer
h
2
to
adjust
weights
w
2
.
Note
that
in
this
process,
class
labels
of
data
are
not
in
v
olv
ed.
3.3.2.
T
raining
each
RBM
lay
er
Ev
ery
RBM
is
trained
by
using
Contrasti
v
e
Di
v
er
gence
with
one
step
of
Gibbs
sampling
algorithm
(or
CD-1).
CD-1
follo
ws
the
gradient
dif
ference
function
called
K
ullback-Leibler
di
v
er
gence.
Here
are
the
steps
of
CD-1:
Step
1:
While
maximum
epoch
is
not
reached,
do
step
2
to
7
repeatedly
.
Step
2:
Use
data
sample
~
V
as
visible
layer
v
.
Step
3:
Update
hidden
layer
by
calculating
p
(
h
j
=
1
;
v
;
w
)
(cf.
Equation
(1)),
then
acti
v
ate
stochastic
binary
neurons
h
j
on
the
hidden
layer
by
taking
the
probability
p
(
h
j
=
1
;
v
;
w
)
.
Step
4:
Reconstruct
visible
neurons
v
1
i
by
calculating
p
(
v
i
=
1
;
h
;
w
)
(cf.
Equation
(2))
using
the
v
alue
of
stochas-
tic
binary
neurons
calculated
before
as
input.
Step
5:
Calculate
the
output
probability
of
h
1
j
from
pre
vious
reconstruction
step,
namely
p
(
h
1
j
=
1
;
v
1
;
w
)
.
Step
6:
Calculate
weights
(cf.
Equation
(3))
by
using
the
learning
rate
as
follo
ws:
w
ij
=
(
h
v
i
h
j
i
0
h
v
i
h
j
i
1
)
;
where
h
v
i
h
j
i
0
=
E
data
(
v
i
h
j
)
and
h
v
i
h
j
i
1
=
E
model
(
v
i
h
j
)
.
Step
7:
Update
the
ne
w
weights
by
adding
momentum
and
weight
decay
w
d
as
follo
ws:
w
g
r
ad
=
h
v
i
h
j
i
0
h
v
i
h
j
i
1
;
and
w
ij
=
w
ij
(
t
1)
+
(
w
ij
+
(
w
g
r
ad
)
w
d
w
ij
(
t
1))
:
Learning
rate
is
a
parameter
that
influences
the
lear
n
i
ng
speed
of
the
algorithm
in
reaching
con
v
er
gence.
Momentum
is
a
parameter
used
to
pre
v
ent
the
system
from
being
trapped
in
local
minima.
W
eight
decay
w
orks
by
adding
“additional
terms”
to
the
normal
gradient.
The
additional
term
is
a
deri
v
ati
v
e
of
a
function
that
penalizes
an
e
xcessi
v
e
weight
[25].
The
process
of
updating
we
ights
of
RBM
refers
to
the
update
rules
of
the
gradient
descent.
In
the
case
of
lar
ge-scale
datasets,
using
batch
gradient
descent
requires
quite
an
e
xpensi
v
e
computation
when
it
only
needs
one
step
f
o
r
the
entire
data
traini
n
g.
Therefore,
we
use
a
more
ef
ficient
learning
mode
that
learns
the
dataset
for
each
predetermined
batch.
This
mode
is
called
mini-batch
gradient
descent
[25].
3.3.3.
Fine-tuning
Rpr
op
Because
DBN
is
to
be
used
as
supervised
(after
pre-training
process
is
completed),
which
in
v
olv
es
label
classes,
weights
ha
v
e
to
be
fine-tuned.
Fine-tuning
is
conducted
by
training
the
netw
ork
with
Backpropag
ation
algo-
Deep
Belief
Networks
for
Reco
gnizing
Handwriting
Captur
ed
by
Leap
Motion
Contr
oller
(Setiawan
and
Pulungan)
Evaluation Warning : The document was created with Spire.PDF for Python.
4700
ISSN:
2088-8708
rithm
[13,
14].
In
this
research,
besides
using
fine-tuning
with
Backpropag
ation,
we
also
perform
Resilie
nt
Backprop-
ag
ation
(Rprop)
algorithm
as
sho
wn
in
Figure
8(c).
Theoretically
,
Rprop
algorithm
is
more
ef
ficient
in
terms
of
learning
speed
compared
to
ordinary
Backprop-
ag
ation.
The
dif
ference
between
Rprop
and
Backpropag
ation
is
located
on
the
weight
update,
where
Rprop
mak
es
use
of
the
sign
that
identifies
the
v
alue
of
the
last
gradient
update
[26].
In
the
Rprop
algorithm,
we
use
weights
and
biases
obtained
from
pre-training
process,
namely
w
ij
=
w
1
(from
input
to
the
first
hidden
layer)
and
w
j
k
=
w
2
(from
the
first
hidden
layer
to
the
second
hidden
layer).
The
weights
and
bias
connection
between
the
second
hidden
layer
to
output
layer
uses
random
initialization
in
the
interv
al
(
1
;
1)
.
W
e
initialize
the
maximum
epoch
and
set
the
v
alues
of
0
=
0
:
0125
,
max
=
50
:
0
,
min
=
10
6
,
+
=
1
:
2
,
and
=
0
:
5
.
4.
RESUL
T
AND
AN
AL
YSIS
A
learning
algorithm
naturally
minimizes
the
cost
function
in
order
to
obtain
optimal
result.
When
the
algorithm
is
being
trained
using
some
training
dataset,
we
are
basically
creating
a
model
algorithm
that
cl
o
s
ely
reflects
the
a
v
ailable
data.
Afterw
ards,
the
obtained
model
algorithm
can
then
be
applied
to
testing
dataset
to
assess
the
performance
of
the
algorithm.
4.1.
Experimental
setup
All
methods
and
algorithms
in
our
proposed
approach
are
implemented
in
C#.
All
e
xperiments
are
conducted
using
this
implementation
on
a
computer
with
a
Core(TM)
i7-2600
CPU
@
3.40GHz
(8
CPUs),
3.7
GHz.
Learning
is
performed
with
a
parallel
computation
on
a
logical
core
for
each
CPU.
F
or
our
e
xperiments,
we
perform
data
acquisition
that
produces
30,000
samples
of
capital
letters,
partitioned
into
tw
o
groups:
22,000
samples
as
training
and
8,000
sampl
es
as
testing
datasets.
There
are
about
843
samples
for
each
letter
from
A
to
Z
in
training
dataset,
whereas
in
testing
dataset,
each
letter
has
about
307
samples.
Ev
ery
single
data
in
the
training
dataset
will
be
used
as
an
input
for
the
learning
algorithm.
In
order
to
assess
the
performance
of
our
proposed
approach,
we
not
only
conduct
e
xperiments
o
n
DBN
with
Rprop
fine-tuning,
b
ut
also
on
three
other
approaches
for
the
purpose
of
comparison.
The
three
approaches
are
MLP
with
Backpropag
ation
(Bp),
MLP
with
Rprop,
and
DBN
with
Backpropag
ation.
T
able
1
sho
ws
the
e
xperimental
configuration
we
use
for
each
of
those
approaches.
T
able
1.
Experimental
configurations
Dataset
Architecture
Pre-training
Fine-tuning
Letter
784-400-100-26
(MLP)
-
Bp
(Ep.:
300)
784-400-100-26
(MLP)
-
Rprop
(Ep.:
300)
784-400-100-26
(DBN)
CD-1
(Ep.:
300)
Bp
(Ep.:
300)
784-400-100-26
(DBN)
CD-1
(Ep.:
300)
Rprop
(Ep.:
300)
Digit
(MNIST)
400-400-100-10
(MLP)
-
Bp
(Ep.:
500)
400-400-100-10
(MLP)
-
Rprop
(Ep.:
500)
400-400-100-10
(DBN)
CD-1
(Ep.:
300)
Bp
(Ep.:
500)
400-400-100-10
(DBN)
CD-1
(Ep.:
300)
Rprop
(Ep.:
500)
T
able
1
sho
ws
that
the
neural
netw
ork
architecture
we
use
when
dealing
with
the
letter
dataset
is
784-400-
100-26.
In
this
architecture,
the
input
is
a
784-sized
v
ector
,
which
is
a
flattening
of
the
28
28
tile-map
matrix
for
each
letter
sample.
Both
MLP
and
DBN
use
400
and
100
neurons
in
the
first
and
second
hidden
layers,
respecti
v
ely
.
The
size
of
the
output
v
ector
is
adjusted
to
the
number
of
letter
labels,
namely
26
classes
with
one-hot-encoding
mechanism.
Beside
using
the
letter
dataset
captured
by
our
o
wn
data
acquisition
mechanism,
we
also
mak
e
use
of
the
digit
dataset
from
MNIST
database
[27].
The
MNIST
digit
dataset
consists
of
60,000
samples
in
training
dataset
and
10,000
samples
in
testing
dataset.
Each
sample
in
the
MNIST
digit
dataset
is
formed
by
a
grayscale
image.
T
o
use
them,
it
is
necessary
first
to
perform
preprocessing
by
transforming
the
grayscale
into
binary
images.
Afterw
ards,
width
normaliza
tion
is
carried
out
by
changing
the
original
size
of
28
28
pix
els
to
20
20
pix
els.
The
20
20
tile-
map
matrix
is
then
flattened
into
an
input
v
ector
of
size
400
as
sho
wn
in
the
last
four
ro
ws
of
T
able
1.
Similar
to
the
letter
configuration,
in
this
digit
configuration,
both
MLP
and
DBN
use
400
neurons
and
100
neurons
in
the
first
and
second
hidden
layers,
respecti
v
ely
.
Because
the
digit
labels
range
from
0
to
9,
the
size
of
the
output
v
ector
is
10
with
one-hot-coding
mechanism.
Int
J
Elec
&
Comp
Eng,
V
ol.
8,
No.
6,
December
2018:
4693
–
4704
Evaluation Warning : The document was created with Spire.PDF for Python.
Int
J
Elec
&
Comp
Eng
ISSN:
2088-8708
4701
Both
for
the
letter
and
MNIST
digit
datasets,
we
use
the
same
configuration
on
the
pre-training
phase
of
the
DBN.
There
are
tw
o
RBM
architectures
in
letter
configuration:
the
first
RBM
is
784-400
and
the
second
RBM
is
400-100.
In
MNIST
digit
configuration,
the
first
RBM
architecture
is
400-400
and
the
second
is
400-100.
There
is
no
pre-training
phase
in
the
MLP
.
Each
RBM
in
the
letter
or
MNIST
digit
configurations
is
trained
with
CD-1
algorithm.
Based
on
our
e
xperiments,
we
select
the
same
initialization
throughout,
namely
learning-rate
=
0.5,
momentum
=
0.5,
and
weight-decay
=
0.001.
The
maximum
epoch
for
each
RBM
is
set
to
300.
The
batch
size
parameter
for
each
RBM
is
set
to
100,
which
means
that
if
there
are
22,000
training
samples
in
the
dataset,
the
y
will
be
processed
100
at
a
time.
The
Backpropag
ation
for
fine-tuning
on
DBN
or
training
on
MLP
uses
learning
rate
=
0.05,
momentum
=
0.5,
and
batch
size
=
100.
Unlik
e
Backpropag
ation,
Rprop
does
not
need
learning
rate
and
mom
entum,
b
ut
instead
needs
constant
v
alues
0
,
max
,
min
,
+
,
and
(cf.
Section
3.3.3.).
The
i
n
i
tial
epoch
for
Rprop
is
set
to
300
for
letter
and
500
for
MNIST
digit
configurations.
4.2.
T
esting
and
r
ecognition
T
able
2
sho
ws
accurac
y
and
the
computational
times
required
in
recognizing
each
letter
(from
testing
dataset)
by
the
four
dif
ferent
methods.
Recall
that
there
are
about
307
samples
for
each
letter
in
testing
dataset.
T
able
2.
Accurac
y
and
computational
times
for
the
letter
(testing)
dataset
Lab
MLP-Bp
MLP-Rprop
DBN-Bp
DBN-Rprop
Acc.
t
(
s)
stde
v
Acc.
t
(
s)
stde
v
Acc.
t
(
s)
stde
v
Acc.
t
(
s)
stde
v
A
99.35
4,572.2
1,081.2
100
5,305.1
1,718.6
100
5,117.5
1,392.5
100
3,278.2
811.7
B
92.86
5,740.0
1,534.4
99.68
4,622.1
918.5
99.03
5,369.4
1,416.5
100
5,235.6
2,074.0
C
94.81
3,727.2
912.2
100
5,337.8
1,117.6
99.35
5,126.9
1,684.3
100
3,922.0
2,140.9
D
92.88
2,972.8
628.1
100
4,991.5
1,229.1
98.71
4,124.2
996.1
100
5,668.1
1,935.6
E
92.51
3,151.1
751.4
100
4,428.3
968.1
99.35
3,718.4
941.1
100
5,212.0
1,156.5
F
97.73
3,265.4
724.0
99.68
5,219.6
1,224.0
99.03
4,776.6
1,213.6
99.35
4,602.7
1,556.9
G
77.67
4,862.7
1,100.5
100
5,068.3
1,262.4
98.38
3,558.6
958.8
99.68
5,459.3
1,452.2
H
90.29
4,893.7
1,163.0
99.68
4,205.8
935.3
99.68
3,139.1
819.7
99.68
4,549.9
1,634.3
I
90.58
5,055.4
1,231.4
97.40
3,573.4
1,095.7
95.45
3,714.9
1,205.8
98.38
5,856.4
3,309.5
J
96.74
5,439.8
1,390.0
100
4,720.8
1,233.2
100
5,064.3
1,357.8
100
4,439.5
1,304.9
K
94.48
4,619.5
1,006.0
99.03
4,766.1
1,270.2
98.70
5,416.3
1,433.2
100
5,325.4
1,096.8
L
99.68
3,705.6
913.6
100
4,446.1
1,392.5
100
5,260.5
1,268.8
100
5,545.3
1,287.1
M
91.53
3,833.7
907.5
99.67
5,013.1
1,151.6
99.02
4,462.5
1,136.8
99.67
3,833.9
896.3
N
98.38
4,287.0
1,097.0
99.35
4,819.3
1,019.2
99.35
3,699.3
914.0
99.35
4,909.1
863.1
O
95.45
3,338.1
816.7
99.03
5,237.3
1,000.0
98.38
4,846.7
1,167.2
99.35
4,849.6
1,563.8
P
95.44
5,606.4
1,390.2
99.67
5,486.6
1,502.8
99.67
3,633.0
880.7
100
5,707.3
2,559.3
Q
91.53
4,486.3
903.3
99.67
5,036.0
1,219.2
99.35
3,387.2
788.2
99.67
4,680.0
1,310.3
R
91.86
4,513.4
1,194.6
99.67
5,515.9
1,335.5
95.44
5,289.7
1,185.9
99.67
4,567.4
1,063.3
S
93.16
4,300.1
1,176.0
99.35
3,967.6
890.0
99.35
5,619.3
1,385.8
100
3,560.6
970.2
T
93.18
5,536.6
1,321.9
100
4,180.4
1,104.3
99.03
4,618.1
1,199.2
100
5,156.8
1,118.2
U
94.16
4,861.5
1,147.3
100
3,639.5
864.8
100
5,545.0
1,445.5
100
4,873.8
1,197.9
V
99.03
4,755.6
1,467.6
100
5,169.7
1,482.1
100
4,198.5
1,255.0
100
3,876.2
1,317.2
W
95.44
4,117.0
1,094.3
99.35
4,782.4
1,781.9
100
4,723.2
1,347.3
100
4,161.0
773.4
X
95.11
4,788.1
1,129.6
99.35
4,025.6
1,605.4
100
3,967.8
893.0
100
4,158.1
1,044.3
Y
98.05
3,699.5
986.6
99.35
5,022.5
1,330.0
99.67
3,239.5
777.4
99.67
5,238.1
1,067.9
Z
77.20
4,996.6
1,105.0
99.35
4,571.7
1,473.3
98.37
4,002.9
973.7
98.05
5,230.8
1,210.8
All
93.43
4,427.9
1,083.6
99.59
4,736.6
1,235.6
99.05
4,446.9
1,155.3
99.71
4,765.3
1,412.2
The
first
four
columns
of
the
first
ro
w
of
T
able
2
indicate
that
out
of
307
samples
of
the
letter
A,
99.35%
(Acc.)
of
them
are
recognized
and
the
computational
time
spent
until
the
point
when
the
letters
are
successfully
or
unsuccessfully
recognized
is
on
a
v
erage
4,572.2
microseconds
(
t
),
with
a
standard
de
viation
of
1,081.2
(stde
v).
The
last
ro
w
of
the
table
pro
vides
the
o
v
erall
stat
istics
of
accurac
y
and
computational
times
for
the
whole
letter
training
dataset
produced
by
the
four
dif
ferent
methods.
T
able
3
sho
ws
similar
results
for
the
recognition
of
the
MNIST
digit
testing
dataset.
Deep
Belief
Networks
for
Reco
gnizing
Handwriting
Captur
ed
by
Leap
Motion
Contr
oller
(Setiawan
and
Pulungan)
Evaluation Warning : The document was created with Spire.PDF for Python.
4702
ISSN:
2088-8708
T
able
3.
Accurac
y
and
computational
times
for
the
MNIST
digit
(testing)
dataset
Lab
MLP-Bp
MLP-Rprop
DBN-Bp
DBN-Rprop
Acc.
t
(
s)
stde
v
Acc.
t
(
s)
stde
v
Acc.
t
(
s)
stde
v
Acc.
t
(
s)
stde
v
1
97.62
3,012.4
575.6
98.15
3,062.5
688.6
97.71
2,747.4
595.3
98.59
2,844.7
1019.6
2
86.92
3,028.2
890.5
94.67
2,268.8
562.4
88.76
2,703.7
482.7
95.83
3,017.7
721.4
3
75.54
2,294.3
673.0
95.84
2,783.3
354.4
91.98
2,623.2
802.8
96.14
2,425.2
589.1
4
88.70
3,059.5
699.7
95.52
2,400.4
621.4
88.90
2,616.5
307.7
96.74
2,525.1
819.4
5
90.92
3,041.4
763.8
96.08
2,795.0
377.6
85.76
2,720.6
295.3
95.52
2,452.3
742.4
6
89.56
3,211.7
795.0
96.45
2,774.5
331.8
95.51
2,656.7
834.6
95.93
2,408.7
556.1
7
88.62
2,811.0
736.3
94.94
2,732.5
661.6
90.66
2,759.4
589.3
95.04
3,069.1
777.3
8
85.01
2,920.0
710.8
94.25
2,002.9
506.3
88.19
2,327.9
594.7
95.69
2,429.5
567.5
9
88.01
1,740.0
556.3
93.56
2,835.8
629.6
87.22
2,804.7
790.1
95.44
3,017.0
665.7
0
95.61
3,031.5
561.9
98.06
2,404.7
641.0
95.41
2,773.7
819.0
98.06
2,805.1
379.9
All
88.65
2,815.0
696.3
95.75
2,606.0
537.5
91.01
2,673.4
611.1
96.30
2,699.4
683.9
T
able
4
summarizes
the
results
for
recognizing
both
letter
and
MNIST
digit
testing
datasets.
The
best
accu-
rac
y
(column
Acc.)
for
both
testing
datasets
is
DBN
with
Rprop
fine-tuning
with
accurac
y
of
99.71%
and
96.30%,
respecti
v
ely
.
This
is
better
than
the
accurac
y
of
MLP
with
Rprop
fine-tuning,
which
achie
v
es
99.59%
and
95.75%,
respecti
v
ely
.
These
tw
o
approaches
e
vidently
are
much
more
superior
than
both
DBN
and
MLP
with
Backpropag
ation
training.
It
seems
that
fine-tuning
with
Rprop
contrib
utes
significantly
to
the
accurac
y
of
both
MLP
and
DBN.
T
able
4.
Summary
of
accurac
y
and
computational
times
Algorithm
Acc.
Epoch
T
raining
Recognition
(Dataset)
t
(hours)
t
(
s)
stde
v
MLP-Bp
(Letter)
93.43%
300
>
5
4,427.9
1,083.6
MLP-Rprop
(Letter)
99.59%
35
>
1
4,736.6
1,235.6
DBN-Bp
(Letter)
99.05%
300
>
10
4,446.9
1,155.3
DBN-Rprop
(Letter)
99.71%
35
>
6
4,765.3
1,412.2
MLP-Bp
(MNIST)
88.65%
500
>
12
2,815.0
696.3
MLP-Rporp
(MNIST)
95.75%
70
>
2
2,606.0
537.5
DBN-Bp
(MNIST)
91.01%
500
>
24
2,673.4
611.1
DBN-Rprop
(MNIST)
96.30%
70
>
12
2,699.4
683.9
The
last
tw
o
columns
of
T
able
4
(
t
and
stde
v)
sho
w
that
the
recognition
times
of
the
four
approaches
are
comparable.
It
seems
that
using
Rprop
results
in
longer
recognition
times
both
for
MLP
and
DBN.
Out
of
the
four
approaches,
the
recognition
time
of
DBN
fine-tuned
with
Rprop
is
the
longest,
namely
on
a
v
erage
4,765.3
microsec-
onds
with
a
standard
de
viation
of
1,412.2
for
the
letter
dataset.
F
or
the
MNIST
digit
dataset,
the
recognition
times
of
the
four
approaches
are
much
more
similar
to
each
other
,
and
are
a
bit
more
than
a
hal
f
of
those
for
the
letter
dataset.
This
is
be
cause
the
resulting
neural
net
w
ork
for
the
digit
dataset
has
smaller
input
as
well
as
output
v
ectors,
and
thus
is
quick
er
to
process.
The
time
needed
to
recognize
a
letter
or
a
digit
is
belo
w
ten
milliseconds,
which
is
e
xcellent
e
v
en
for
online
g
aming
e
xperience
[28].
4.3.
T
raining
and
fine-tuning
Prior
to
recognition,
the
neural
netw
orks,
both
DBN
and
MLP
,
must
first
be
trained
or
fine-tuned
using
training
dataset.
The
fourth
column
of
T
able
4
pro
vides
the
computational
times
spent
by
the
four
dif
ferent
methods
to
complete
training
and
pre-training
(if
an
y).
Because
MLP
does
not
ha
v
e
pre-training
phase,
its
training
time
is
much
shorter
than
DBN.
The
time
spent
by
DBN
to
complete
pre-training
is
around
5
hours
for
the
letter
da
taset
and
10
hours
for
the
MNIST
digit
dataset.
It
is
e
vident
that
this
pre-training
phase
tak
es
most
of
the
training
computational
time
of
DBN
and
Rprop
is
really
helpful
in
reducing
the
training
(b
ut
outside
of
pre-training)
time.
In
Figure
9
and
Figure
10,
we
sho
w
ho
w
the
accurac
y
of
each
method
changes
as
the
number
of
epochs
increases
in
the
training
process.
In
order
to
produce
these
figures,
ri
g
ht
after
each
epoch
in
the
training
process
(with
training
datasets),
each
method
is
paused,
and
the
resulting
neural
ne
tw
ork
is
then
used
to
recognize
letters
or
digits
in
the
testing
datasets.
W
e
observ
e
these
changes
until
300
and
500
epochs
for
the
letter
and
digit
datasets,
respecti
v
ely
.
Int
J
Elec
&
Comp
Eng,
V
ol.
8,
No.
6,
December
2018:
4693
–
4704
Evaluation Warning : The document was created with Spire.PDF for Python.