Indonesian
J
our
nal
of
Electrical
Engineering
and
Computer
Science
V
ol.
22,
No.
2,
May
2021,
pp.
1096
1107
ISSN:
2502-4752,
DOI:
10.11591/ijeecs.v22i2.pp1096-1107
r
1096
ArSL-CNN:
A
con
v
olutional
neural
netw
ork
f
or
arabic
sign
language
gestur
e
r
ecognition
Ali
A.
Alani
1
,
Geor
gina
Cosma
2
1
Department
of
Computer
Science,
Uni
v
ersity
of
Diyala,
Diyala,
Iraq
2
Department
of
Computer
Science,
School
of
Science,
Loughborough
Uni
v
ersity
,
U.K
Article
Inf
o
Article
history:
Recei
v
ed
Jan
20,
2021
Re
vised
March
13,
2021
Accepted
March
20,
2021
K
eyw
ords:
Arabic
sign
language
CNNs
Con
v
olutional
neural
netw
orks
Deep
learning
SMO
TE
ABSTRA
CT
Sign
language
(SL)
is
a
visual
language
means
of
communication
for
people
with
deafness
or
hearing
impairments.
In
Arabic-speaking
countries,
there
are
man
y
arabic
sign
languages
(ArSL)
and
these
use
the
same
al
phabets.
This
study
proposes
ArSL-
CNN,
a
deep
learning
model
that
i
s
based
on
a
con
v
olutional
neural
netw
ork
(CNN)
for
translating
Arabic
SL
(ArSL).
Experiments
were
performed
using
a
lar
ge
ArSL
dataset
(ArSL2018)
that
contains
54,049
images
of
32
sign
language
gestures,
collected
from
forty
participants.
The
results
of
the
first
e
xperiments
with
the
ArSL-CNN
model
re-
turned
a
train
and
test
accurac
y
of
98.80%
and
96.59%,
respecti
v
ely
.
The
results
also
re
v
ealed
the
impact
of
imbalanced
data
on
model
acc
urac
y
.
F
or
the
second
set
of
e
x-
periments,
v
arious
re-sampling
methods
were
applied
to
the
datase
t.
Results
re
v
ealed
that
applying
the
synthetic
minority
o
v
ersampling
technique
(SMO
TE)
impro
v
ed
the
o
v
erall
test
accurac
y
from
96.59%
to
97.29%,
yielding
a
statistically
significant
im-
pro
v
ement
in
test
accurac
y
(p=0.016,
<
0
:
05
).
The
pr
oposed
ArSL-CNN
model
can
be
trained
on
a
v
ariety
of
Arabic
sign
languages
and
reduc
e
the
communication
barriers
encountered
by
deaf
communities
in
Arabic-speaking
countries.
This
is
an
open
access
article
under
the
CC
BY
-SA
license
.
Corresponding
A
uthor:
Ali
A.
Alani
Department
of
Computer
Science
Uni
v
ersity
of
Diyala
Diyala,
Iraq
Email:
alialani@uodiyala.edu.iq
1.
INTR
ODUCTION
Sign
language
(SL)
is
visual
means
of
communication
for
people
who
ha
v
e
deafness
or
hearing-
impairments,
using
gestures,
f
acial
e
xpression,
and
body
l
anguage
[1],
[2].
In
2019,
the
w
orld
health
or
g
ani-
zation
report
ed
that
approximat
ely
466
milli
on
people,
which
is
approximately
5%
of
the
w
orld’
s
population,
suf
fer
from
hearing
impairment.
Among
these
people,
roughly
34
million
are
under
the
age
of
18.
A
pre
vious
study
predicted
that
this
number
w
ould
double
by
2050
due
to
genetic
f
actors,
birth
complications,
infectious
diseases,
and
chronic
ear
infections
[3],
[4].
Studies
ha
v
e
been
conducted
to
de
v
elop
systems
that
can
recognise
the
signs
of
v
arious
SLs
[5].
Arabic
SL
(ArSL)
recognition
systems
are
currently
in
the
de
v
elopment
phase
[6],
and
there
e
xist
limited
SL
recognition
systems
that
can
identify
ArSL
signs
using
deep
le
arning
methods.
T
w
o
methods
can
be
applied
to
SL
recognition
systems
,
namely
,
sensor
and
image-based
methods
[7].
Sensor
-based
methods
require
the
user
to
wear
instrumental
glo
v
es
with
sensors
to
recognise
hand
gestures.
This
approach
requires
interf
acing
multiple
sensors
with
a
glo
v
e
to
collect
the
gestures
using
sensor
data,
that
is
analysed
for
gesture
recognition
and
translation
tasks.
Despite
their
accurac
y
and
reliability
,
sensor
-based
methods
ha
v
e
J
ournal
homepage:
http://ijeecs.iaescor
e
.com
Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian
J
Elec
Eng
&
Comp
Sci
ISSN:
2502-4752
r
1097
se
v
eral
limitations
,
such
as
discomfort
in
using
glo
v
es
o
v
erloaded
with
wires,
sensors
and
other
materials
w
orn
by
the
signer
[8],
[9].
By
contrast,
with
image-based
sign
language
gesture
recognition,
the
signers
are
not
required
to
use
an
y
ki
nd
of
glo
v
es
or
complicated
de
vices.
This
technique
pro
vides
users
with
more
v
ersatility
than
the
sensor
-based
systems.
Ho
we
v
er
,
intensi
v
e
computations
are
necessary
in
the
preprocessing
phase
to
recognise
the
signs.
Recent
studies
focus
on
the
performance
of
image-based
approaches
in
recognising
ArSL
[8],
[9].
Image-based
systems
for
recognising
human
signs
are
comple
x
and
multidisciplinary
.
These
systems
are
de
v
eloped
using
v
arious
machine
learning
(ML)
methods,
such
as
the
artificial
neural
netw
ork
[10],
support
v
ector
machines
(SVMs)
[11],
and
elastic
graph
matching
[12].
Deep
learning
(DL)
algorithms
ha
v
e
recently
boosted
man
y
research
fields,
including
image
recognition
and
classification.
DL
algorithms
are
ML
methods
that
ha
v
e
been
utilised
in
v
arious
applications,
such
as
medical
im
age
classification
[13]
and
object
recogni-
tion
[14].
DL
models
compri
se
a
neural
netw
ork
wi
th
more
than
one
hidden
layer
that
uses
v
arious
le
v
els
of
distrib
ution
to
represent
and
learn
the
high-le
v
el
abstractions
of
data.
The
objecti
v
e
of
this
study
is
to
adv
ance
the
research
of
ArSL
and
e
xplore
the
capabilities
of
deep
learning
methods,
specifically
the
con
v
olutional
neu-
ral
netw
ork
(CNN)
method,
for
classifying
ArSL
gestures.
This
paper
proposes
a
ne
w
deep
learning
model,
ArSL-CNN,
and
e
xplores
the
adv
antage
of
resampling
techniques
to
address
class
imbalance
in
the
dataset.
The
proposed
ArSL-CNN
is
trained
with
images
of
hand
signs
in
dif
ferent
lighting
conditions
and
orientations
to
automatically
recognise
32
ArSL
signs
[15].
The
remainder
of
this
paper
is
or
g
anised
as
follo
ws.
Related
w
ork
is
re
vie
wed
in
Section
2.
The
design
and
architecture
of
the
proposed
ArSL-CNN
model
are
presented
in
Section
3.
Experiments
and
results
are
discussed
in
Section
4.
A
comparison
of
the
proposed
ArSL-CNN
with
state-of-the-art
methods
is
discussed
in
Section
5.
A
conclusion
and
future
research
directions
are
pro
vided
in
Section
6.
2.
RELA
TED
W
ORKS
Numerous
methods
are
appl
ied
for
SL
recognition
tasks.
The
tw
o
major
approaches
are
the
hand-
crafted
feature
engineering
e
xtraction
and
DL
methods.
The
earliest
kno
wn
w
ork
on
SL
recognition
is
focused
on
the
e
xtraction
of
hand-engineered
features,
which
are
fed
to
learning
algorithms
for
classification
[16].
Consequently
,
the
ef
ficienc
y
of
these
algorithms
is
highly
dependent
on
handcrafted
feature
engineering
[17].
Therefore,
the
accurac
y
results
obtained
using
these
approaches
highly
depend
on
e
xtracted
features.
Ibrahim
et
al
.
[18]
constructed
a
dataset
containing
30
isolated
w
ords
from
children
with
hearing
disabilities.
The
geometric
features
of
the
hands
were
formulated
into
feature
v
ectors
that
were
used
for
classification
and
auto-
matic
translation
of
the
indi
vidual
Arabi
c
signs
into
te
xt
w
ords.
The
accurac
y
of
their
proposed
system
reached
97%.
Alzohairi
et
al.
[9]
applied
the
histogram
of
oriented
gradients
(HoG)
feature
descriptor
for
e
xtracting
features
from
ArSL
image
data,
and
then
adopted
the
SVM
algorithm
for
de
v
eloping
a
ArSL
image
recognition
system.
The
accurac
y
of
their
system
reached
63.5%.
Abdo
et
al.
[19]
applied
the
hidden
Mark
o
v
model
and
hand
geometry
with
dif
ferent
hand
shapes
and
forms
to
the
task
of
Arabic
alphabet
and
numbers
sign
language
recognition
and
translation
into
speech
or
te
xt.
W
ith
Deep
Learning,
the
features
are
e
xtracted
hierarchically
in
an
automated
manner
by
applying
a
series
of
transformati
ons
to
the
input
images.
The
e
xtracted
features
are
the
most
rob
ust
ones,
which
means
that
comple
x
problems
are
ef
fecti
v
ely
modelled
using
DL
architectures.
Nagi
et
al.
[20]
proposed
a
hand
gesture
recognition
system
by
using
a
CNN
and
used
morphological
image
processing
and
colour
se
gmentation
to
obtain
hand
contour
edges
and
eliminate
noise.
Their
proposed
model
achie
v
ed
an
accurac
y
of
96%
on
6,000
sign
images
obtained
from
six
gestures.
By
using
the
data
collected
by
a
Kinect
sensor
,
T
ang
et
al.
[21]
used
a
deep
belief
netw
ork
(DBN)
and
a
CNN
for
sign
language
recognition.
Authors
trained
the
DBN
and
CNN
model
using
36
dif
ferent
hand
postures.
The
DBN
model
achie
v
ed
98.12%
accurac
y
,
which
w
as
higher
than
the
accurac
y
obtained
by
the
CNN
model.
Y
ang
and
Zhu
[2]
introduced
a
CNN
system
for
the
recognition
of
Chinese
SL.
The
authors
obtained
video-based
data
by
using
40
re
gular
v
ocab
ularies.
In
the
preprocessing
stage,
the
authors
enhanced
the
hand
se
gmentation
process
and
pre
v
ented
the
loss
of
important
information
during
feature
e
xtraction.
Moreo
v
er
,
the
y
compared
tw
o
dif
ferent
optimizers,
namely
,
Adagrad
and
Adadelta,
and
their
results
re
v
ealed
that
the
CNN
model
reached
better
accurac
y
when
the
Adadelta
optimizer
w
as
used.
Oyedotun
and
Khashman
[5]
adopted
tw
o
DL
methods,
namely
,
CNN
and
stack
ed
denoising
autoencoder
(SD
AE)
netw
orks,
to
recognise
24
ASL
alphabets.
The
samples
were
collected
from
the
freely
accessible
Thomas
Moeslund’
s
gesture
recognition
database.
Their
test
results
sho
wed
that
SD
AE
outperformed
the
CNN
model
in
terms
of
o
v
erall
a
v
erage
accurac
y
(92.83%).
Eibada
wy
et
al.
[22]
proposed
a
CNN-based
frame
w
ork
for
ArSL
recognition
to
identify
25
signs.
The
accurac
y
v
alues
of
this
ArSL-CNN:
A
con
volutional
neur
al
network
for
ar
abic
sign
langua
g
e
g
estur
e
r
eco
gnition
(Ali
A.
Alani)
Evaluation Warning : The document was created with Spire.PDF for Python.
1098
r
ISSN:
2502-4752
model
on
the
training
and
unseen
data
were
85%
and
98%,
respecti
v
ely
.
Ghazanf
ar
,
et
al.
[1]
propo
s
ed
dif
fer
-
ent
CNN
architectures
using
54,049
sign
images
of
more
than
40
participants
pro
vided
by
[15].
Their
results
re
v
ealed
the
significant
ef
fe
ct
of
the
dataset
size
on
the
accurac
y
of
the
proposed
model.
By
increasing
the
size
of
the
dataset
from
8,302
samples
to
27,985
samples,
the
proposed
model
test
accurac
y
increased
from
80.3%
to
93.9%.
Also,
increasing
the
size
of
the
dataset
from
33406
samples
to
50000
samples
resulted
in
a
further
increase
in
the
proposed
model
test
accurac
y
from
94.1%
to
95.9%,
respecti
v
ely
.
Elsayed
and
F
ath
y
[3]
e
xamined
the
capacity
of
ontology
technologies
(semantic
web
technologies)
and
DL
to
design
a
multiple
sign
language
ontology
for
feature
e
xtraction
using
CNNs
for
the
ArSL
recognition
task.
Their
findings
re
v
ealed
that
the
recognition
rates
of
the
ArSL
training
and
testing
sets
were
98.06%
and
88.87%,
respecti
v
ely
.
Although
CNNs
perform
well
with
computer
vision
tasks,
the
y
require
massi
v
e
quantities
of
data
to
train
the
netw
ork.
This
disadv
antage
demands
an
enormous
amount
of
time
and
computing
capabilities.
Se
v
eral
researchers
use
transfer
l
earning
techniques
to
minimise
the
processing
time
and
the
number
of
dataset
samples
needed
to
train
the
CNN
model.
Saleh
and
Issa
[23]
used
transfer
learning
on
a
pre-trained
netw
ork
of
V
GG-16
and
Resnet152
to
boost
performance
in
identifying
32
hand
gestures
from
the
ArSL
dataset.
T
o
minimise
the
imbalance
caused
by
the
heterogeneity
of
the
class
sizes,
random
undersampling
w
as
applied
to
the
dataset
to
reduce
the
number
of
images
from
54,049
to
25,600.
Their
proposed
method
achie
v
ed
testing
accuracies
of
99.4%
and
99.6%
for
the
V
GG16
and
Resne
t152,
respecti
v
ely
.
Despite
the
latest
de
v
elopments
in
DL
and
the
good
precision
of
image
classification
and
prediction
achie
v
ed
using
CNN,
imbalanced
data
can
af
fect
the
performance
of
pre-
diction
models.
Imbalanced
data
can
impact
a
model’
s
ability
to
learn
and
its
usage
in
real-time
situations.
The
translation
of
the
sign
language
gestures
into
dif
ferent
formats,
such
as
te
xt
and
speech,
should
also
be
further
in
v
estig
ated.
3.
EXPERIMENT
METHODOLOGY
This
section
describes
the
proposed
ArSL-CNN
architecture
that
w
as
designed
for
classifying
Arabic
Sign
Language
gestures.
This
section
also
describes
the
ArSL
dat
aset
and
the
pre-processing
t
echniques
that
were
applied
to
the
dataset.
3.1.
Pr
oposed
ArSL-CNN
ar
chitectur
e
CNNs
ha
v
e
achie
v
ed
se
v
eral
breakthroughs
as
a
basic
DL
technique
for
image
classification
problems,
such
as
object
detection
and
hand
gesture
recognition
[6],
[21].
T
able
1
sho
ws
the
architecture
of
the
proposed
ArSLCNN.
Three
types
of
layers
are
used
in
the
CNN
algorithm,
namely
,
con
v
olutional,
pooling
and
fully
connected
layers.
The
pooling
layer
decreases
the
spatial
size
of
an
input
sequence.
The
complete
CNN
archi-
tecture
is
obtained
through
se
v
eral
stacks
of
the
abo
v
ement
ioned
layers.
The
ArSL-CNN
model
is
composed
of
se
v
en
con
v
olutional
layers,
four
batch
normalisation
(BN)
layers,
four
pooling
layers,
fi
v
e
dropout
layers
and
one
fully
connected
layer
with
rectified
linear
unit
(ReLU).
This
model
ends
with
an
output
layer
that
has
a
softmax
acti
v
ation
function
to
yield
the
distrib
ution
of
the
probability
o
v
er
classes
as
sho
wn
in
Figure
1.
Figure
1.
Architecture
of
the
proposed
ArSL-CNN
model
Indonesian
J
Elec
Eng
&
Comp
Sci,
V
ol.
22,
No.
2,
May
2021
:
1096
–
1107
Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian
J
Elec
Eng
&
Comp
Sci
ISSN:
2502-4752
r
1099
T
able
1
lists
the
detailed
dimensions
of
each
layer
and
operation.
The
first
and
second
layers
are
con
v
olutional
layers
that
contain
32
feature
maps
and
ha
v
e
a
k
ernel
size
of
3
3.
These
layers
are
acti
v
ated
with
ReLUs.
The
ne
xt
layer
is
a
BN
layer
,
which
aims
to
achie
v
e
the
stable
distrib
ution
of
the
acti
v
ation
v
alues
through
training
and
normalise
the
inputs
to
a
layer
[24].
The
fourth
layer
is
a
max
pooling
layer
with
a
pool
size
of
2
2,
and
the
objecti
v
e
of
this
layer
is
to
decrease
the
number
of
parameters
to
minimise
o
v
erfitting
and
decrease
the
computation
time.
The
fifth
layer
is
a
dropout
re
gularisation
layer
with
the
parameter
set
to
10%.
The
ne
xt
layers
are
the
third
and
fourth
con
v
olutional
layers
with
64
feature
maps,
a
k
ernel
size
of
3
3
and
ReLU
acti
v
ation
function.
These
layers
are
follo
wed
by
another
max
pooling
layer
with
a
pool
size
of
2
2,
a
BN
layer
and
re
gularisation
layers
with
parameter
set
to
10%.
The
ne
xt
layers
are
the
fifth
and
sixth
con
v
olutional
layers
with
128
feature
maps,
a
k
ernel
size
of
3
3
and
ReLU
acti
v
ation
function,
follo
wed
by
another
pooling
layer
with
a
pool
size
of
2
2,
a
BN
layer
and
re
gularisation
layers
with
parameter
set
to
30%.
The
last
con
v
olutional
layer
in
the
netw
ork
is
then
laid
after
the
re
gularisation
layers.
This
layer
with
a
ReLU
acti
v
ation
function
comprises
256
feature
maps
and
has
a
k
ernel
size
of
3
3.
The
ne
xt
layers
are
a
max
pooling
layer
with
a
size
of
2
2,
a
BN
layer
and
another
dropout
re
gularisation
layer
with
the
parameter
set
to
30%.
The
ArSL-CNN
netw
ork
architecture
ends
with
fully
connected
units
that
contain
a
flatten
layer
,
one
fully
connected
layer
,
one
dropout
layer
and
the
output
layer
.
The
flatten
layer
con
v
erts
the
2D
matrix
data
into
a
v
ector
to
allo
w
the
final
output
to
be
processed
by
standard
ful
ly
connected
layers.
The
second
layer
is
a
fully
connected
layer
that
contains
512
neurons
with
ReLU
acti
v
ation
function.
The
dropout
layer
e
xcludes
50%
from
neurons.
The
last
layer
is
the
output
layer
,
which
contains
32
neurons
and
is
acti
v
ated
with
a
softmax
acti
v
ation
function.
The
ArSL-CNN
model
is
trained
in
a
fully
supervised
w
ay
,
and
its
parameters
are
optimised
by
minimising
the
cross-entrop
y
loss
function
with
the
Adam
v
ersion
of
stochastic
gradient
descent.
T
able
1.
P
arameters
of
the
ArSL-CNN
architecture
Layers
Layer
Configuration
#
P
arameters
Con
v
olution
1
32
filters,
3x3
k
ernel
and
ReLU
320
Con
v
olution
2
32
filters,
3x3
k
ernel
and
ReLU
9248
batch
Nor
.
1
-
128
Max-pooling
1
2x2
k
ernel
0
Dropout
1
0.1
0
Con
v
olution
3
64
filters,
3x3
k
ernel
and
ReLU
18496
Con
v
olution
4
64
filters,
3x3
k
ernel
and
ReLU
36928
batch
Nor
.
2
-
256
Max-pooling
2
2x2
k
ernel
0
Dropout
2
0.1
0
Con
v
olution
5
128
filters,
3x3
k
ernel
and
ReLU
73856
Con
v
olution
6
128
filters,
3x3
k
ernel
and
ReLU
174584
Batch
Nor
.
3
-
512
Max-pooling
3
2x2
k
ernel
0
Dropout
3
0.3
0
Con
v
olution
7
128
filters,
3x3
k
ernel
and
ReLU
295168
batch
Nor
.
4
-
1024
Max-pooling
4
2x2
k
ernel
0
Dropout
4
0.3
0
Flatten
4096
Neurons
0
Fully
connected
512
Neurons
209664
Dropout
0.5
0
Output
layer
Softmax
32
classes
16416
3.2.
Dataset
description
The
proposed
ArSL-CNN
architecture
is
trained
and
tested
on
the
ArSL2018
[15],
Arabic
Sign
Lan-
guage
(ArSL)
dataset.
The
ArSL2018
dataset
aims
to
pro
vide
an
opportunity
for
researchers
to
de
v
elop
auto-
mated
ArSL
recognition
systems
based
on
dif
ferent
machine
learning
methods.
The
original
dataset
consists
of
54,049
RGB
images
distrib
uted
around
32
classes
and
the
signs
collected
from
more
than
40
participants.
The
RGB
images
ha
v
e
dif
ferent
dimensions
and
man
y
v
aria
tions
of
images
were
presented
through
the
use
of
dif
ferent
lighting
and
backgrounds.
Samples
of
the
dataset
can
be
seen
in
Figure
2.
ArSL-CNN:
A
con
volutional
neur
al
network
for
ar
abic
sign
langua
g
e
g
estur
e
r
eco
gnition
(Ali
A.
Alani)
Evaluation Warning : The document was created with Spire.PDF for Python.
1100
r
ISSN:
2502-4752
Figure
2.
ArSL2018
dataset
samples
[15]
3.3.
Image
pr
e-pr
ocessing
Data
preprocessing
is
the
implementation
of
v
arious
morphological
acti
vities
to
eliminate
noise
from
the
data.
The
ArSL2018
dataset
includes
sign
language
gesture
images
with
dif
ferent
dim
ensions
that
were
tak
en
with
v
aried
illumination.
Therefore,
image
preprocessing
techniques
are
necessary
to
remo
v
e
noise
from
the
data
before
feeding
them
to
the
netw
ork.
All
sign
images
are
firstly
con
v
erted
into
gre
yscale
images
with
a
dimension
of
64×64
to
perform
real-ti
me
classification.
The
gre
yscale
colour
space
con
v
ers
ion
allo
ws
operating
in
one
channel
only
rather
than
processing
in
the
three
RGB
channels.
This
con
v
ersion
will
minimise
the
number
of
parameters
of
the
first
con
v
olutional
layer
tw
o
times
and
reduce
the
computational
time.
T
o
increase
the
ef
ficienc
y
of
the
computation
process
and
speed
of
the
training
stage,
all
images
are
normalised
to
set
the
range
of
the
pix
el
v
alues
from
0
to
1.
Then,
the
images
are
standardised
by
eliminating
their
means
and
scaling
them
to
unit
v
ariance.
T
o
generate
the
training
and
testing
sets,
images
are
randomly
selected
from
the
dataset.
The
dataset
is
split
into
testing
(20%)
sets
and
training
(80%)
of
which
20%
is
tak
en
for
v
alidation.
Figure
3
depicts
that
the
number
of
samples
for
each
class
in
the
dataset
is
not
balanced.
Therefore,
v
arious
resampling
techniques
ha
v
e
been
applied
to
solv
e
the
imbalance
problem
amongst
the
classes.
The
details
of
this
process
are
presented
in
Section
4.2.,
e
xperimental
results
and
discussion.
Figure
3.
Number
of
samples
in
each
class
4.
RESUL
TS
AND
DISCUSSION
The
e
xperiment
w
as
conducted
using
K
eras
libraries
and
Python
programming
language
that
run
on
T
ensorFlo
w
back
end.
The
ArSL-CNN
model
w
as
trained
on
a
machine
that
has
an
NVIDIA
K80
graphics
processing
unit
(GPU),
64
GB
random
access
memory
,
12
GB
memory
and
100
GB
solid
state
dri
v
e.
T
o
introduce
randomness,
the
training
dataset
w
as
shuf
fled
before
fed
to
the
netw
ork
to
a
v
oid
bi
as
to
w
ards
certain
parameters.
The
ef
fecti
v
eness
of
the
proposed
model
w
as
e
v
aluated
based
on
tw
o
independent
e
xperiments:
(1)
the
proposed
ArSL-CNN
model
w
as
trained
and
tested
using
the
original
ArSL2018
dataset;
and
(2)
the
model
w
as
trained
and
tested
using
dif
ferent
resampling
techniques
to
address
the
imbalance
problem
amongst
the
classes.
The
accurac
y
metric
w
as
adopted
to
determine
the
ef
ficienc
y
of
the
proposed
CNN
approach.
In
formula
(1),
A
denotes
the
accurac
y
,
TC
and
FC
represent
the
number
of
correctly
and
incorrectly
classified
Indonesian
J
Elec
Eng
&
Comp
Sci,
V
ol.
22,
No.
2,
May
2021
:
1096
–
1107
Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian
J
Elec
Eng
&
Comp
Sci
ISSN:
2502-4752
r
1101
instances,
respecti
v
ely
.
The
calculated
v
alue
is
multiplied
by
100
to
turn
it
into
a
percentage.
A
=
T
C
T
C
+
F
C
100
(1)
F
or
a
class,
the
accurac
y
can
be
determined
using
(2).
Ac
=
T
C
c
T
C
c
+
F
C
c
100
(2)
where,
TCc
represents
the
number
of
correctly
classified
insta
n
c
es
which
are
from
the
class
c,
FCc
represents
the
number
of
incorrectly
classified
i
nstances
which
are
from
the
class
c.
The
final
v
alue
is
multiplied
by
100
to
get
the
percentage
of
the
accurac
y
for
each
class.
4.1.
P
erf
ormance
e
v
aluation
of
the
pr
oposed
ArSL-CNN
model
The
performance
of
the
proposed
ArSL-CNN
model
on
the
original
ArSL2018
dataset
is
presented
in
T
able
2.
The
t
raining
dataset
consists
of
54,049
images
distrib
uted
o
v
er
32
ArSL
gesture
groups
in
a
unified
format.
The
training
data
were
di
vided
into
batches
with
128
samples
each.
The
input
and
output
layers
ha
v
e
4,096
and
32
neurons,
respecti
v
ely
.
The
proposed
ArSLCNN
model
w
as
trained
for
multiple
learning
epochs.
The
training
and
testing
accurac
y
v
alues
are
summarised
in
T
able
2.
ArSL-CNN
achie
v
ed
the
highest
testing
accurac
y
(96.59%)
at
500
learning
epochs.
Figure
4
depicts
the
model
accurac
y
when
the
proposed
ArSL-CNN
model
is
trained
with
500
epochs.
The
training
and
testing
performances
are
close
to
each
other
during
dif
ferent
epochs
which
indicates
that
the
model
has
not
been
o
v
ertrained.
T
able
2.
Classification
accurac
y
and
training
time
(minutes)
obtained
using
the
ArSL2018
original
dataset
No.
Epochs
T
raining
Acc.
(%)
T
esting
Acc.
(%)
T
raining
T
ime
(mins)
500
98.80
96.59
66.1
Figure
4.
Accurac
y
of
the
proposed
ArSL-CNN
model
obtained
using
the
original
ArSL2018
dataset
T
able
3
indicates
the
accurac
y
of
all
32
classes.
From
the
table
it
can
be
observ
ed
that
the
nu
m
ber
of
testing
samples
across
the
classes
v
aries
considerably
.
It
can
also
be
observ
ed
that
classes
with
t
he
highest
number
of
samples
achie
v
ed
a
better
accurac
y
than
those
the
classes
with
fe
wer
samples.
F
or
instance,
the
‘W
a
w’
class
contains
259
testing
samples
and
its
accurac
y
w
as
94.21%,
whereas
the
‘
A
yn’
class
contains
405
testing
samples
and
its
accurac
y
w
as
97.78%.
These
results
re
v
ealed
that
the
imbalanced
distrib
ution
of
the
number
of
samples
between
classes
may
impact
on
the
performance
of
the
models,
and
in
some
cases,
the
model
will
be
able
to
learn
the
classes
that
ha
v
e
more
samples
better
than
those
with
lo
wer
sample
numbers.
Therefore,
it
is
important
to
apply
techniques
that
can
handle
the
imbalance
problem
between
classes
and
to
det
ermine
whether
these
techniques
can
impro
v
e
classification
performance,
especially
for
the
classes
that
contain
smaller
sample
sizes.
Therefore,
resampling
(o
v
er
-sampling
and
under
-sampling)
methods
are
applied
to
the
dataset
and
their
impact
on
the
performance
of
the
ArSL-CNN
model
is
e
xplored
(results
are
discussed
in
Section
4.2.).
ArSL-CNN:
A
con
volutional
neur
al
network
for
ar
abic
sign
langua
g
e
g
estur
e
r
eco
gnition
(Ali
A.
Alani)
Evaluation Warning : The document was created with Spire.PDF for Python.
1102
r
ISSN:
2502-4752
T
able
3.
ArSL-CNN
accurac
y
on
the
Ttest
data
using
the
original
Arsl2018
dataset
before
applying
sampling
techniques
(no
sampling)
Class
No.
Class
Name
#S
a
#SCC
b
Accurac
y
0
Alif
354
343
96.89
1
Ba
314
310
98.73
2
T
a
372
364
97.85
3
Tha
364
350
96.15
4
Jim
313
302
96.49
5
Ha
299
286
95.65
6
Kha
337
320
94.96
7
Dal
295
285
96.61
8
Dhal
328
318
96.95
9
Ra
310
303
97.74
10
Zay
265
259
97.74
11
Sin
336
317
94.35
12
Shin
316
294
93.04
13
Sad
388
380
97.94
14
Dad
361
350
96.95
15
T
aa
355
349
98.31
16
Za
362
356
98.34
17
A
yn
405
396
97.78
18
Ghayn
376
361
96.01
19
F
a
391
377
96.42
20
Qaf
335
330
98.51
21
Kaf
371
362
97.57
22
Lam
354
340
96.05
23
Mim
346
338
97.69
24
Nun
353
349
98.87
25
Ha
325
318
97.85
26
W
a
w
259
244
94.21
27
Y
a
365
352
96.44
28
T
aa
374
345
92.25
29
Al
261
247
94.64
30
Laa
357
345
96.64
31
Y
aa
269
256
95.17
T
otal/
A
v
erage
c
10810
10446
96.59
a.
Number
of
samples
in
the
test
data
b
.
Number
of
samples
correctly
classified
c.
The
a
v
erage
is
calculated
by
formula
(2)
4.2.
Results
when
using
the
ArSL-CNN
model
with
o
v
ersampling
and
undersampling
methods
The
number
of
images
per
class
in
the
ArSL2018
dataset
is
sho
wn
in
Figure
3.
As
pre
viously
men-
tioned,
the
classes
contain
dif
ferent
sample
sizes,
and
such
discrepancies
may
result
in
an
imbalance
amongst
the
classes.
The
imbalance
issue
can
ha
v
e
a
ne
g
ati
v
e
ef
fect
on
the
classification
results.
T
o
o
v
ercome
this
issue
and
reduce
bias,
resampling
methods
ha
v
e
been
applied
to
balance
the
class
distrib
ution
are
classified
into
tw
o
groups:
o
v
ersampling
and
undersampling
methods
[25].
The
o
v
ersampling
technique
solv
es
the
imbalance
amongst
the
classes
by
generating
synthetic
samples
from
minority
samples.
This
approach
can
ef
fecti
v
ely
impro
v
e
the
classification
ef
ficienc
y
.
Ho
we
v
er
,
increasing
the
number
of
samples
in
the
minority
classes
will
increase
the
trai
ning
time.
The
o
v
ersampling
process
has
tw
o
v
ariations.
The
first
is
random
minority
o
v
er
-
sampling
(RMO),
which
randomly
duplicates
the
minority
class
samples.
The
second
is
the
synthetic
minority
o
v
ersampling
technique
(SMO
TE),
which
is
a
sophisticated
sampling
technique
that
o
v
ercomes
the
issue
of
class
imbalance
by
artificially
generating
samples
through
the
i
nterpolation
of
neighbouring
data
points
[26].
The
other
method
used
for
adjusting
the
balance
of
samples
across
ArSL2018
dataset
classed
w
as
random
minority
under
-sampling
(RMU).
The
RMU
strate
gy
in
v
olv
es
the
random
deletion
of
samples
from
majority
classes
until
the
dataset
is
balanced.
A
major
dra
wback
of
this
strate
gy
is
the
possible
loss
of
useful
infor
-
mation.
T
o
correct
the
balance
of
samples
amongs
t
the
classes
in
the
ArSL2018
dataset,
three
resampling
techniques,
nam
ely
RMO,
SMO
TE,
and
RMU
were
applied
to
the
dataset,
and
e
xperiments
were
carried
out
to
e
v
aluate
their
impact
on
the
task.
T
able
4
sho
ws
the
results
obtained
using
the
three
resampling
methods.
The
findings
re
v
eal
that
the
ef
ficienc
y
of
the
proposed
ArSL-CNN
model
increases
after
applying
the
resampling
techniques.
The
proposed
model
achie
v
es
training
and
testing
accuracies
of
99.14%
and
97.21%,
respecti
v
ely
,
Indonesian
J
Elec
Eng
&
Comp
Sci,
V
ol.
22,
No.
2,
May
2021
:
1096
–
1107
Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian
J
Elec
Eng
&
Comp
Sci
ISSN:
2502-4752
r
1103
by
using
the
random
o
v
ersampling
method.
The
training
and
testing
accurac
y
v
alues
after
applying
the
un-
dersampling
method
are
99.27%
and
97.07%,
respecti
v
ely
.
By
using
SMO
TE,
the
model
obtains
training
and
testing
accuracies
of
98.94%
and
97.29%,
respecti
v
ely
.
This
result
implies
that
SMO
TE
outperforms
the
other
tw
o
resampling
methods
in
terms
of
the
testing
accurac
y
.
The
highest
testing
accurac
y
(97.29%)
is
achie
v
ed
using
SMO
TE.
This
accurac
y
is
higher
than
that
obtained
by
implementing
the
ArSLCNN
architecture
on
the
original
dataset
(96.59%).
These
findings
highlight
the
importance
of
ha
ving
a
balanced
number
of
samples
in
each
class
in
achie
ving
high
classification
accurac
y
and
minimising
o
v
erfitting.
Classes
with
small
numbers
of
samples
will
reduce
the
accurac
y
of
the
proposed
model.
T
able
4.
ArSL-CNN
accurac
y
on
the
test
data
using
the
original
Arsl2018
dataset
before
applying
sampling
techniques
Resampling
T
echnique
No.
Epochs
T
raining
Acc.
(%)
T
esting
Acc.
(%)
T
raining
T
ime
(mins)
RMU
500
99.27
97.07
66.1
RMO
500
99.14
97.21
134.4
SMO
TE
500
98.94
97.29
141.9
Figure
5
sho
ws
the
confusion
matrix
generated
by
training
the
proposed
ArSL-CNN
model
with
SMO
TE
for
500
epochs.
The
diagonal
elements
in
the
confusion
matrix
reflect
the
number
of
correctly
labelled
images,
whereas
the
of
f-diagonal
elements
denote
the
mislabeled
images.
The
greater
the
sum
of
diagonal
v
alues
of
the
confusion
matrix,
the
higher
the
accurac
y
of
the
classification.
The
accurac
y
of
the
proposed
ArSL-CNN
model
during
v
arious
learning
epochs
after
applying
SMO
TE
is
illustrated
in
Figure
6.
The
results
re
v
eal
that
the
accurac
y
of
the
model
on
the
training
and
testing
sets
increases
for
all
learning
epochs.
Figure
5.
Confusion
matrix
of
the
proposed
ArSL-CNN
model
with
SMO
TE
Figure
6.
Accurac
y
of
the
proposed
ArSL-CNN
model
with
SMO
TE
Furthermore,
the
accurac
y
per
class
is
stated
in
T
able
5.
The
e
xperimental
results
sho
w
that
ArSL-
CNN
obtained
better
classification
ef
ficienc
y
when
the
RMU,
RMO
and
SMO
TE
resampling
method
were
applied.
F
or
instance,
the
number
of
samples
in
the
‘W
a
w’
class
w
as
259
testing
samples
before
using
SMO
TE
ArSL-CNN:
A
con
volutional
neur
al
network
for
ar
abic
sign
langua
g
e
g
estur
e
r
eco
gnition
(Ali
A.
Alani)
Evaluation Warning : The document was created with Spire.PDF for Python.
1104
r
ISSN:
2502-4752
resampling
method
applied
and
the
accurac
y
w
as
94.21%.
Ho
we
v
er
,
aft
er
applying
the
SMO
TE
resampling
method
the
number
of
samples
increase
from
259
to
422
testing
samples
and
that
led
to
increase
the
accurac
y
from
94.21%
to
98.34%.
These
results
appro
v
e
the
high
impact
of
applying
SMO
TE
resampling
method
to
solv
e
the
imbalance
problem
and
impro
v
e
the
o
v
erall
accurac
y
of
the
proposed
model.
4.3.
Statistical
analysis
of
the
impact
of
the
r
esampling
methods
applied
to
the
ArSL2018
dataset
on
the
perf
ormance
of
ArSL-CNN
T
able
6
pro
vides
descripti
v
e
statistics
of
the
test
accurac
y
results
when
v
arious
sampling
methods
are
applied
to
the
ArSL2018
dataset.
In
T
abl
e
6,
the
first
column
is
the
sampli
ng
method
applied
to
the
dataset.
The
second
column
holds
the
mean
test
accurac
y
v
alues
across
the
32
classes.
The
third
column
holds
the
standard
de
viation
v
alues
which
are
a
useful
indicator
of
the
stabil
ity
of
the
model.
The
fourth
and
fifth
columns
sho
w
the
minimum
and
maximum
test
accurac
y
v
alues
obtained
and
the
last
three
columns
hold
information
about
the
test
accurac
y
v
alue
percentiles.
T
able
5.
ArSL-CNN
accurac
y
on
the
test
data
after
applying
RMU,
RMO
and
SMO
TE
Methods
RMU
RMO
SMO
TE
Class
Name
#
S
a
#
S
C
C
b
Accurac
y
#
S
a
#
S
C
C
b
Accurac
y
#
S
a
#
S
C
C
b
Accurac
y
Alif
351
348
99.15
423
419
99.05
423
416
98.35
Ba
366
361
98.63
362
348
96.13
362
351
96.96
T
a
365
359
98.36
404
390
96.53
404
390
96.53
Tha
342
333
97.37
438
426
97.26
438
428
97.72
Jim
298
288
96.64
412
401
97.33
412
392
95.15
Ha
266
254
95.49
428
411
96.03
428
414
96.73
Kha
337
324
96.14
451
438
97.12
451
436
96.67
Dal
324
308
95.06
408
394
96.57
408
401
98.28
Dhal
290
284
97.93
449
436
97.10
449
430
95.77
Ra
330
323
97.88
443
436
98.42
443
437
98.65
Zay
266
253
95.11
415
398
95.90
415
406
97.83
Sin
306
285
93.14
440
417
94.77
440
411
93.41
Shin
305
285
93.44
438
422
96.35
438
428
97.72
Sad
332
327
98.49
418
413
98.80
418
411
98.33
Dad
347
335
96.54
428
421
98.36
428
422
98.60
T
aa
363
354
97.52
405
402
99.26
405
400
98.77
Za
333
327
98.20
441
439
99.55
441
436
98.87
A
yn
420
413
98.33
411
402
97.81
411
401
97.57
Ghayn
390
380
97.44
421
411
97.62
421
404
95.96
F
a
399
389
97.49
406
387
95.32
406
394
97.04
Qaf
339
328
96.76
389
378
97.17
389
376
96.66
Kaf
342
338
98.83
422
402
95.26
422
414
98.10
Lam
347
341
98.27
446
432
96.86
446
433
97.09
Mim
347
341
98.27
449
442
98.44
449
440
98.00
Nun
360
357
99.17
422
422
100.00
422
420
99.53
Ha
304
299
98.36
432
423
97.92
432
424
98.15
W
a
w
269
258
95.91
422
414
98.10
422
415
98.34
Y
a
314
305
97.13
424
409
96.46
424
411
96.93
T
aa
332
320
96.39
428
404
94.39
428
402
93.93
Al
262
255
97.33
394
387
98.22
394
385
97.72
Laa
363
343
94.49
413
395
95.64
413
398
96.37
Y
aa
274
266
97.08
442
428
96.83
442
431
97.51
T
otal/
A
v
erage
c
10583
10281
97.07
13524
13147
97.21
13524
13157
97.29
a.
Number
of
samples
in
the
test
data
b
.
Number
of
samples
correctly
classified
c.
The
a
v
erage
is
calculated
by
(2)
T
able
6.
Descripti
v
e
statistics
of
the
test
results
when
applying
v
arious
sampling
methods
to
the
dataset
Sampling
method
Mean%
Std.
De
viation
Minimum%
Maximum%
P
er
centiles
25th
50th
(Median)
75th
No
sampling
96.59
1.64
92.25
98.87
95.74
96.77
97.83
SMO
TE
97.29
1.37
93.41
99.53
96.66
97.65
98.32
RMU
97.07
1.57
93.14
99.17
96.20
97.41
98.32
RMO
97.21
1.40
94.39
100.00
96.19
97.15
98.33
Indonesian
J
Elec
Eng
&
Comp
Sci,
V
ol.
22,
No.
2,
May
2021
:
1096
–
1107
Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian
J
Elec
Eng
&
Comp
Sci
ISSN:
2502-4752
r
1105
T
able
6
sho
ws
that
the
mean
test
accurac
y
v
alue
reached
its
highest,
i.e.
=
97
:
29%
=
97.29%,
when
the
SMO
TE
resampling
method
w
as
applied.
W
ith
SMO
TE,
the
proposed
model
achie
v
ed
the
lo
west
standard
de
viation
v
alue,
i.e.
=
1
:
37
,
and
this
suggests
that
applying
SMO
TE
to
the
dataset
results
in
a
more
stable
prediction
model.
W
it
h
RMO,
the
maximum
test
accurac
y
and
the
75th
percentile
v
alues
were
slightly
higher
than
those
of
SMO
TE,
ho
we
v
er
,
the
higher
standard
de
viation
v
alue
of
RMO
suggests
that
using
RMO
results
in
a
less
stable
model.
The
boxplots
in
Figure
7
illustrate
the
distrib
ution
of
the
test
accurac
y
v
alues
with
v
arious
sampling
methods.
Each
boxplot
in
Figure
7
holds
32
v
alues,
where
each
v
alue
corresponds
to
a
test
accurac
y
v
alue
of
the
model
for
a
particular
class
(note
that
there
are
32
classes
in
the
dataset
as
sho
wn
in
Figure
3).
SMO
TE
has
tw
o
outliers,
and
‘no
sampling’
and
RMU
ha
v
e
one
outlier
each,
as
sho
wn
in
Fi
gure
7.
It
is
import
ant
to
mention
that
the
minimum
v
alue
of
SMO
TE,
i.e.
93.41%,
is
an
outlier
v
alue
as
sho
wn
in
Figure
7,
and
if
the
outliers
are
remo
v
ed
from
SMO
TE
and
RMU,
then
the
minimum
test
accurac
y
v
al
ues
of
SMO
TE
is
the
highest
of
all
sampling
methods,
reaching
95.15%
minimum
test
accurac
y
.
T
able
7
sho
ws
that
applying
sampling
methods
impro
v
es
ArSL-CNN’
s
performance,
and
the
results
suggest
that
best
performance
is
achie
v
ed
when
SMO
TE
are
applied
to
the
dataset.
Before Resampling
SMOTE
RMU
RMO
Resampling method
92
93
94
95
96
97
98
99
100
Class accuracy values (%)
Figure
7.
Boxplot
of
test
accurac
y
v
alues
when
v
arious
sampling
methods
are
applied
to
the
dataset
T
o
determine
whether
the
observ
ed
impro
v
ements
in
ArSL-CNN’
s
performance
when
the
SMO
TE
and
other
sampling
methods
are
adopted
are
statistically
significant
at
=
0
:
05
,
the
non-parametric
W
ilcoxon
Signed
Ranks
T
est
is
applied
to
the
test
accurac
y
v
alues
obtained
after
applying
the
resampling
methods
to
the
dataset
(see
T
able
7).
The
results
re
v
ealed
that
when
applying
SMO
TE,
there
is
a
statistically
significant
impro
v
ement
in
test
accurac
y
(Z=-2.412,
p=0.016).
Indeed,
there
w
as
also
a
weak
er
significant
impro
v
ement
in
performance
when
applying
the
RMU
and
RMO
sampling
methods
with
p=0.042
and
p=0.036
respecti
v
ely
.
Ho
we
v
er
,
SMO
TE
achie
v
ed
the
most
significant
statistical
impro
v
ement
as
indicated
by
the
lo
west
p
v
alue.
In
conclusion,
applying
SMO
TE
resampling
to
adjust
the
class
imbalance
of
the
dataset
significantly
impro
v
es
the
test
prediction
accurac
y
of
the
model.
T
able
7.
Results
of
the
wilcoxon
signed
ranks
test
applied
to
the
test
results
T
est
Statistics
a
No
sampling
vs.
SMO
TE
No
sampling
vs.
RMU
No
sampling
vs.
RMO
Z
-2.412b
-2.029b
-2.094b
Asymp.
Sig.
(2-tailed)
0.016
0.042
0.036
a.
W
ilcoxon
Signed
Ranks
T
est
b
.
Based
on
positi
v
e
ranks.
c.
Based
on
ne
g
ati
v
e
ranks.
5.
COMP
ARISON
WITH
ST
A
TE-OF-THE-AR
T
METHODS
The
performances
of
the
proposed
approach
with
e
xisting
state
of-the-art
techniques
on
the
ArSL2018
dataset
in
terms
of
accurac
y
is
sho
wn
in
T
able
8.
The
findings
indicate
that
the
proposed
ArSL-CNN
model
when
applying
SMO
TE
resampling
to
the
dataset
is
superior
to
tw
o
state-of-the-art
methods
in
terms
of
o
v
erall
accurac
y
[1],
[3].
Ghazanf
ar
et
al.
[1]
used
CNN
and
achie
v
ed
an
accurac
y
of
95.9%,
whereas
Elsayed
and
F
ath
y
[3]
applied
semantic
DL
and
obtained
an
accurac
y
of
88.8%.
In
comparison,
our
proposed
method
achie
v
es
an
o
v
erall
accurac
y
of
97.29%.
This
result
implies
the
significance
of
pro
viding
a
balanced
number
of
samples
to
enhance
the
generalisation
ef
ficienc
y
of
CNN
when
training
DL
models.
ArSL-CNN:
A
con
volutional
neur
al
network
for
ar
abic
sign
langua
g
e
g
estur
e
r
eco
gnition
(Ali
A.
Alani)
Evaluation Warning : The document was created with Spire.PDF for Python.