Indonesian
J
our
nal
of
Electrical
Engineering
and
Computer
Science
V
ol.
40,
No.
2,
No
v
ember
2025,
pp.
883
∼
897
ISSN:
2502-4752,
DOI:
10.11591/ijeecs.v40.i2.pp883-897
❒
883
Deep-lear
ning-based
hand
gestur
es
r
ecognition
applications
f
or
game
contr
ols
Huu-Huy
Ngo,
Hung
Linh
Le,
Man
Ba
T
uy
en,
V
u
Dinh
Dung,
T
ran
Xuan
Thanh
Thai
Nguyen
Uni
v
ersity
of
Information
and
Communication
T
echnology
,
Thai
Nguyen,
V
ietnam
Article
Inf
o
Article
history:
Recei
v
ed
Jan
12,
2025
Re
vised
Jul
21,
2025
Accepted
Oct
14,
2025
K
eyw
ords:
Action
recognition
Deep
learning
Game
controls
Hand
gestures
recognition
Human–computer
interaction
ABSTRA
CT
Hand
gesture
recognition
is
among
the
emer
ging
technologies
of
human-
computer
interaction,
and
an
intuiti
v
e
and
natural
interf
ace
is
more
preferable
for
such
appl
ications
than
a
total
solution.
It
is
also
widely
used
in
multimedia
applications.
In
this
paper
,
a
deep
learning-based
hand
gesture
recognition
sys-
tem
for
controlling
g
ames
is
presented,
sho
wcasing
its
s
ignicant
contrib
utions
to
w
ard
adv
ancing
the
frontier
of
natural
and
intuiti
v
e
human-computer
interac-
tion.
It
utilizes
MediaPipe
to
get
real-time
sk
eletal
information
of
hand
land-
marks
and
translates
the
gestures
of
the
user
into
smooth
control
signals
through
an
optimized
articial
neural
netw
ork
(ANN)
that
is
tailored
for
reduced
com-
putational
e
xpenses
and
quick
er
inference.
The
proposed
model,
which
w
as
trained
on
a
carefully
selected
dataset
of
four
gesture
classes
under
dif
ferent
lighting
and
vie
wing
conditions,
sho
ws
v
ery
good
generalization
performance
and
rob
ustness.
It
gi
v
es
a
recognition
rate
of
99.92%
with
much
fe
wer
param-
eters
than
deeper
models
such
as
ResNet50
and
V
GG16.
By
achie
ving
high
accurac
y
,
computational
speed,
and
lo
w
latenc
y
,
this
w
ork
addresses
some
of
the
most
important
challenges
in
ge
sture
recognition
and
opens
the
w
ay
for
ne
w
applications
in
g
aming,
virtual
reality
,
and
other
interacti
v
e
elds.
This
is
an
open
access
article
under
the
CC
BY
-SA
license
.
Corresponding
A
uthor:
Hung
Linh
Le
Thai
Nguyen
Uni
v
ersity
of
Information
and
Communication
T
echnology
Thai
Nguyen,
V
ietnam
Email:
lhlinh@ictu.edu.vn
1.
INTR
ODUCTION
Hand
gesture
recognition
is
a
fundamental
element
of
contemporary
human–computer
int
eraction
(HCI)
that
of
fers
a
more
natural,
touchless,
and
intuiti
v
e
control
paradigm
than
traditional
input
de
vices
such
as
k
e
yboards,
touchscreens,
or
mice.
Its
application
is
e
xtensi
v
e
and
co
v
ers
areas
such
as
virtual
and
augmented
reality
(VR/AR),
home
automation
technologies,
assisti
v
e
technology
,
robots,
and
g
aming
systems
[1]-[4].
The
adv
ancement
of
sensing
technology
and
computer
vision
algorithm
s
has
lar
gely
reduced
most
of
the
technical
dif
culties,
e.g.,
partial
occlusion,
background
clutter
,
and
changes
in
lighting
[5]-[7].
Deep
learning,
especially
of
con
v
olutional
neural
netw
orks
(CNNs),
has
transformed
the
area
of
hu-
man
gesture
recognition
with
its
pro
v
en
capability
of
e
xtracting
spatial
along
with
temporal
features
from
images
and
video
frames.
V
GG16,
ResNet50,
and
DenseNet
are
some
of
the
models
that
ha
v
e
been
e
xtensi
v
ely
utilized
and
modied
for
gesture
recognition
tasks
with
state-of-the-art
accurac
y
on
benchmarking
datasets
[4],
[8],
[9].
In
particular
,
Sharma
and
Singh
[4]
applied
CNNs
and
preprocessing
methods
(PCA,
ORB,
and
histogram
gradients)
to
impro
v
e
the
accurac
y
of
recognition,
while
Mohammed
et
al.
[8]
fused
color
and
depth
data
from
Kinect
sensors
with
h
ybrid
models.
De
vineau
et
al.
[10]
also
addressed
temporal
dynamics
with
the
J
ournal
homepage:
http://ijeecs.iaescor
e
.com
Evaluation Warning : The document was created with Spire.PDF for Python.
884
❒
ISSN:
2502-4752
use
of
s
k
eletal
joint
data
and
parallel
con
v
olutions.
While
being
prec
ise,
such
CNN-based
models
typically
ac-
compan
y
millions
of
parameters,
resulting
in
an
e
xpensi
v
e
computational
cost
that
discourages
their
real-time
implementation
on
lo
w-resource
de
vices.
Gi
v
en
these
limitations,
researchers
ha
v
e
e
xplored
lightweight
architectures.
The
combi
nation
of
ar
-
ticial
neural
netw
orks
(ANNs)
with
ef
fecti
v
e
feature
e
xtraction
of
fers
a
benecial
trade-of
f
between
accurac
y
and
operational
ef
cienc
y
.
Zhang
et
al.
[11]
and
Nasri
et
al.
[12]
presented
real-time
gesture
recognition
systems
using
sEMG
signals
paired
wi
th
ANN
classiers,
demonstrating
e
xcellent
performance
with
f
ast
in-
ference
times.
Similarly
,
Ozdemir
et
al.
[13]
and
Cruz
et
al.
[14]
used
spectral
and
inertial
inputs
to
classify
gestures.
Mujahid
et
al.
[15]
used
Y
OLOv3
with
DarkNet-53
for
real-time
detection
of
static
and
dynamic
gestures
without
preprocessing,
whereas
Agg
arw
al
and
Arora
[16]
used
mobile-based
HGR
in
g
ame
scenarios.
Meanwhile,
recent
gesture
recognition
research
has
focused
on
practicality
,
multimodality
,
and
e
x-
ibility
.
Lee
and
Bae
[17]
suggested
a
deep
learning-based
glo
v
e
using
soft
sensors
for
dynamic
motion.
Sen
et
al.
[18]
suggested
a
h
ybrid
frame
w
ork
that
fuses
CNNs,
V
iT
,
and
Kalman
ltering
for
stable
real-
time
control.
Osama
et
al.
[19]
and
Guo
et
al.
[20]
were
concerned
with
the
incorporation
of
gesture
control
in
presentation
and
educational
s
y
s
tems.
Jiang
et
al.
[21]
were
concerned
with
no
v
el
wearable
HGR
systems,
whereas
Naseer
et
al.
[22]
de
v
eloped
U
A
V
control
modules
using
gesture
detection.
W
en
et
al.
[23]
proposed
an
inno
v
ati
v
e
mix
ed
reality
system
aimed
at
enhancing
sign
language
education
through
immersi
v
e
learning
e
xperiences
and
comprehensi
v
e,
real-time
feedback
mechanisms.
Despite
such
adv
ancements,
a
major
lack
of
gesture
recognition
systems
that
are
both
computat
ionally
lean
and
easily
deplo
yable
in
interacti
v
e
systems
such
as
g
aming
still
e
xists.
Most
models
either
focus
on
attaining
optimality
in
performance
using
hea
vier
models
or
consider
hardw
are-specic
data
(such
as
EMG
or
IMU)
to
be
hardw
are-agnostic
in
consumer
-le
v
el
congurations.
Therefore,
this
study
proposes
a
no
v
el
hand
gesture
recognition
platform
aimed
at
interacti
v
e
g
ame
control.
The
approach
tak
es
adv
antage
of
the
MediaPipe
hands
frame
w
ork
for
real-time
landmark
detection
with
ef
cienc
y
optimizat
ion
and
combines
it
with
a
lightweight
ANN
model
minimizing
computational
o
v
erhead
and
latenc
y
.
The
performance
of
the
proposed
ANN
model
is
thoroughly
tested
and
compared
with
state-of-the-art
CNN
architectures
such
as
ResNet50
and
V
GG16.
The
comparati
v
e
analysis
determines
the
practical
strengths
and
applicability
of
the
ANN-based
model
in
g
ame
applications.
2.
METHOD
2.1.
System
ar
chitectur
e
Figure
1
illustrates
an
o
v
ervie
w
of
the
hand
gesture
recognition
system
being
considered
in
this
re-
search.
The
system
de
v
eloped
for
controlling
g
ames
consist
s
of
dif
ferent
steps
that
are
essential
in
their
o
wn
right
to
the
correct
identication
and
interpr
etation
of
the
mo
v
ements
of
the
user
.
The
process
starts
with
video
input,
which
is
the
primary
source
of
information
for
the
system.
The
video
input
may
be
from
a
webcam
or
another
camera
de
vice
capable
of
acquiring
real-time
visual
depictions
of
the
hand
motion
of
the
user
.
The
video
is
necessary
as
it
of
fers
a
continuous
o
w
of
visual
data
that
records
the
dynamics
and
location
of
the
hand,
which
is
vital
in
sensing
gestures
intended
for
interaction
with
a
g
ame.
After
the
video
input
has
been
acquired,
the
process
continues
with
processi
ng
of
the
video
by
decom-
posing
it
into
frames.
The
indi
vidual
frames
are
processed
using
the
MediaPipe
frame
w
ork
by
rst
detecting
the
palm
to
dra
w
a
boundary
around
the
hand
area.
After
localizing
the
hand,
MediaPipe
applies
its
specialist
landmark
detection
model
to
sample
21
import
ant
hand
landmarks
in
real-time.
The
coordinates
of
the
land-
marks
thus
obtained
are
then
con
v
erted
to
a
systematic
feature
v
ector
with
maintained
spatial
relationships
between
the
k
e
y
points.
Later
,
thi
s
v
ector
serv
es
as
an
input
to
a
neural
netw
ork
responsible
for
gesture
classi-
cation.
Incorporating
MediaPipe
into
the
pipeline
not
only
enhances
the
accurac
y
of
feature
e
xtraction
b
ut
also
signicantly
reduces
computational
demands,
thereby
guaranteeing
the
viability
of
the
system
for
real-time
applications.
The
hand
sk
eleton
input
produced
by
MediaPipe,
comprising
the
sk
eletal
structure
of
the
hand,
is
used
as
input
for
a
CNN
model.
In
this
case,
LeNet
architecture—a
traditional
model
in
the
eld
of
image
classication—is
used
to
read
the
image
and
e
xt
ract
high-le
v
el
features
capturing
spatial
relationships
between
important
hand
landmarks.
This
e
xtraction
of
features
is
crucial
for
the
distinction
between
hand
gestures
and
interpreting
them
as
indi
vidual
comm
ands
for
controlling
the
g
ame.
The
CNN
then
outputs
a
sequence
of
predictions
that
include
the
detected
gesture
and
a
condence
score
measure
of
ho
w
certain
the
model
is.
Indonesian
J
Elec
Eng
&
Comp
Sci,
V
ol.
40,
No.
2,
No
v
ember
2025:
883–897
Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian
J
Elec
Eng
&
Comp
Sci
ISSN:
2502-4752
❒
885
These
predictions
are
passed
as
inputs
to
the
application
or
g
ame,
thus
enabling
hands-free
interaction
without
needing
con
v
entional
input
methods
such
as
k
e
yboards
or
controllers.
Figure
1.
System
o
v
ervie
w
2.2.
MediaPipe
hands
MediaPipe
hands
[24]
is
an
adv
anced
frame
w
ork
utilized
for
tracking
hands
in
real-time
and
landmark
detection,
which
is
important
for
human-computer
interaction
applications.
It
is
able
to
localize
and
detect
21
k
e
y
points
on
each
hand
(Figure
2),
including
ngertips,
joints,
and
the
palm
bottom,
thus
enabling
accurate
hand
pose
estimation.
This
solution
has
signicant
applications
in
ma
n
y
areas
such
as
gesture
recognition,
sign
language
interpretation,
virtual
reality
(VR),
and
augmented
reality
(AR).
The
strong
architecture
of
MediaPipe
hands
allo
ws
for
seamless
processing
of
the
input
data
without
compromising
on
the
high
le
v
el
of
accurac
y
,
making
it
suitable
for
inte
gration
into
real-time
applications
on
numerous
platforms.
Figure
2.
Hand
landmarks
The
MediaPipe
hands
solution
w
orks
on
a
tw
o-stage
strate
gy
that
includes
palm
detection
and
hand
landmark
detection.
In
the
initial
stage,
palm
detect
ion
is
applied
for
identifying
re
gions
of
hands
within
the
pro
vided
image.
The
palm
detection
step
pro
vides
the
base
for
stable
k
e
ypoint
e
xtraction
by
dening
a
clear
re
gion
for
additional
processing.
Once
the
palm
has
been
detected,
the
system
enters
the
second
stage:
hand
k
e
ypoint
detection,
where
the
21
spec
ial
k
e
ypoints
on
the
cropped
hand
image
are
detected.
This
is
essential
in
order
to
map
the
hand
anatomy
properly
and
e
xtract
signicant
details
such
as
the
ngers’
tips,
intermediate
phalanges,
and
palm
center
.
One
of
the
adv
antages
of
MediaPipe
hands
is
that
it
can
also
track
multiple
hands
at
a
time,
e
v
en
in
cases
where
the
hands
o
v
erlap
or
where
the
hands
change
orientation.
Multi-hand
tracking
capability
is
crucial
for
applications
that
need
both
hands
to
be
in
v
olv
ed
or
when
there
are
multiple
users.
High
tracking
stability
is
attained
by
the
system
through
the
uti
lization
of
conte
xt
information
from
successi
v
e
frames
to
Deep-learning-based
hand
g
estur
es
r
eco
gnition
applications
for
game
contr
ols
(Huu-Huy
Ngo)
Evaluation Warning : The document was created with Spire.PDF for Python.
886
❒
ISSN:
2502-4752
predict
k
e
ypoint
positions
e
v
en
under
conditions
of
rapid
hand
mo
v
ement
or
temporary
occlusion.
Predicti
v
e
style
tracking
enables
smooth
and
continuous
tracking
required
by
applications
demanding
responsi
v
eness,
such
as
virtual
reality/augmented
reality
interaction
and
gesture
g
ames.
Based
on
k
e
ypoints’
coordi
nates
identied
by
MediaPipe
hands,
it
is
possible
to
b
uild
a
formal
input
for
an
articial
neural
netw
ork.
Namely
,
one-dimensional
input
v
ector
can
be
b
uilt
where
e
v
ery
element
corre-
sponds
to
the
Euclidean
distance
from
the
WRIST
point
to
the
remai
ning
20
k
e
ypoints.
This
method
guarantees
that
input
data
preserv
es
spatial
relations
among
signicant
landmarks
on
the
hand
while
minimizing
comple
x-
ity
related
to
direct
coordinate
representation.
The
computation
of
these
distances
produces
a
normalized
and
in
v
ariant
set
of
input
features
less
sensiti
v
e
to
v
ariation
in
hand
size
or
orientati
o
n,
therefore
impro
ving
the
ro-
b
ustness
of
the
neural
netw
ork
model
during
training
and
inference
phases.
This
then
yields
a
feature
v
ector
as:
X
=
d
1
,
d
2
,
d
3
,
...,
d
20
.
The
Euclidean
distance
(
d
i
)
between
the
WRIST
point
and
the
other
20
k
e
ypoints
is
calculated
using
(2.2.).
In
this
equation,
i
=
1
,
2
,
.
.
.
,
20
,
(
x
0
,
y
0
,
z
0
)
are
the
coordinates
of
the
WRIST
point,
and
(
x
i
,
y
i
,
z
i
)
represent
the
coordinates
of
the
other
k
e
ypoints.
d
i
=
p
(
x
i
−
x
0
)
2
+
(
y
i
−
y
0
)
2
+
(
z
i
−
z
0
)
2
(1)
By
representing
the
input
in
such
a
manner
,
the
output
v
ector
contains
20
elements
which
accurately
represent
the
spatial
relation
of
the
anatomy
of
the
hand.
This
v
ector
is
used
as
a
signicant
feature
for
the
neural
netw
ork
so
that
it
can
analyze
and
learn
patterns
of
v
arious
hand
gestures
or
motion.
The
Euclidean
distance
calculation
guarantees
that
each
v
ector
element
will
be
scaled
equally
,
hence
contrib
uting
to
stabilization
of
the
learning
process
and
enhancement
of
the
model’
s
performance.
As
such,
this
structured
representation
not
only
reduces
the
comple
xity
of
the
input
data
b
ut
also
retains
the
critical
geometric
characteristics
required
for
precise
hand
mo
v
ement
recognition.
This
method
il
lustrates
an
ef
fecti
v
e
w
ay
of
con
v
erting
ra
w
landmark
data
into
a
meaningful
format
suitable
for
deep
learning
algorithms,
hence
enabling
inno
v
ation
in
gesture-based
interacti
v
e
systems.
2.3.
Articial
neural
netw
ork
model
After
the
e
xtraction
and
v
ectorization
of
the
21
k
e
ypoints
via
MediaPipe
hands,
there
is
a
generation
of
a
structured
feature
v
ector
comprising
20
distinct
features.
Each
entry
in
the
v
ector
is
the
Euclidean
distance
between
the
WRIST
k
e
ypoi
n
t
and
all
the
other
k
e
ypoints
and
some
other
deri
v
ed
features
.
Figure
3
illustrates
an
ANN
model
that
has
three
fully
connected
layers.
There
is
a
rst
hidden
layer
with
64
neurons
and
ReLU
acti
v
ation,
follo
wed
by
a
hidden
layer
with
32
neurons
and
ReLU.
The
output
layer
has
4
neurons,
one
for
each
gesture
class,
and
applies
softmax
to
generate
class
probabilities.
The
ANN
model
is
trained
using
the
Adam
optimizer
with
a
learning
rate
of
0.001
for
20
epochs
to
achie
v
e
high
recognition
accurac
y
and
lo
w
computation
needs.
This
lightweight
design
and
ef
cient
training
routine
render
the
ANN
model
particularly
amenable
to
real-time
hand
gesture
recognition
on
the
mo
v
e,
specically
for
interacti
v
e
g
ame
applications.
Figure
3.
The
structure
of
ANN
model
Indonesian
J
Elec
Eng
&
Comp
Sci,
V
ol.
40,
No.
2,
No
v
ember
2025:
883–897
Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian
J
Elec
Eng
&
Comp
Sci
ISSN:
2502-4752
❒
887
2.4.
The
ResNet-50
model
The
deep
CNN
architecture
kno
wn
as
ResNet-50
(Figure
4),
also
kno
wn
as
ResNet-50,
has
emer
ged
as
an
essential
component
in
the
eld
of
contemporary
computer
vision.
He
et
al.
[25]
presented
ResNet-
50
with
the
intention
of
addressing
the
obstacles
that
are
associated
with
training
v
ery
deep
netw
orks.
In
particular
,
the
de
gradation
problem
is
a
problem
that
arises
when
increasing
the
depth
of
the
netw
ork
results
in
decreased
accurac
y
o
wing
to
dif
culty
in
optimizing
the
netw
ork.
On
e
of
the
most
important
inno
v
ations
that
ResNet-50
brings
to
t
he
table
is
its
utilization
of
residual
learning
through
shortcut
connections.
This
enables
the
netw
ork
to
acquire
identity
mappings
and
helps
to
alle
viate
the
problem
of
disappearing
gradients.
In
order
to
demonstrate
its
adaptability
and
ef
cienc
y
,
this
architecture
has
been
utilized
e
xtensi
v
ely
in
a
v
ariety
of
applications,
including
semantic
se
gmentation,
object
identication,
and
picture
classication.
Figure
4.
The
ResNet-50
model
architecture
In
the
ResNet-50
architecture,
there
are
a
total
of
fty
layers,
which
include
con
v
olutional
layer
s,
batch
normalization
layers,
ReLU
acti
v
ation
functions,
and
fully
link
ed
layers.
The
usage
of
residual
blocks,
in
which
identity
connections
bypass
one
or
more
layers,
is
one
of
its
distinguishing
characteristics.
This
allo
ws
the
netw
ork
to
learn
residual
functions
rather
than
direct
mappings,
which
is
a
signicant
adv
antage.
Each
of
the
sixteen
residual
blocks
that
mak
e
up
ResNet-50
is
composed
of
three
con
v
olutional
layers:
a
1
×
1
con
v
olution
for
dimensionality
reduction,
a
3
×
3
con
v
olution
for
spatial
feature
e
xtraction,
and
another
1
×
1
con
v
olution
for
restoring
dimensionality
.
These
blocks
are
constructed
utilizing
bottleneck
designs.
This
approach
to
bottlenecks
decreases
the
load
of
computing
w
ork
while
retaining
the
capacity
for
representation.
Additionally
,
the
design
emplo
ys
strided
con
v
olutions
and
pooling
layers
to
gradually
l
o
we
r
the
spatial
dimensions,
which
guarantees
the
capture
of
hierarchical
information
across
a
v
ariety
of
le
v
els.
2.5.
The
V
GG-16
model
Simon
yan
and
Zisserman
[26]
initially
presented
the
CNN
archit
ecture
kno
wn
as
V
GG-16.
The
ar
-
chitecture
places
an
emphasis
on
ha
ving
a
basic
and
modular
style.
By
taking
this
method,
the
netw
ork
is
able
to
e
xtract
intricat
e
hierarchical
properties
while
preserving
its
computational
ef
cienc
y
.
The
V
GG-16
architecture
(Figure
5)
consists
of
16
weight
layers,
including
13
con
v
olutional
layers
and
3
fully
connected
layers,
interspersed
with
max-pooling
and
acti
v
ation
functions.
The
hallmark
of
V
GG-16
is
its
use
of
small
3
×
3
con
v
olutions
with
a
stride
of
1
and
padding
to
maintain
spatial
resolution.
Stacking
these
tin
y
k
ernels
in
sequence
allo
ws
the
netw
ork
to
simulate
the
recepti
v
e
eld
of
bigger
lters,
which
in
turn
enables
the
netw
ork
to
collect
more
detailed
spatial
data.
Max-pooli
ng
layers
separate
v
e
con
v
olutional
blocks
in
a
hierarchical
design
of
the
architecture.
This
allo
ws
for
the
gradual
reduction
of
spatial
dimensions
while
simultaneously
increasing
the
depth
of
feature
maps.
The
fully
link
ed
layers
at
the
v
ery
end
of
the
netw
ork
are
responsible
for
aggre
g
ating
these
features
in
order
to
arri
v
e
at
a
nal
cate
gorization.
The
consistent
architecture
and
depth
of
V
GG-16
mak
e
it
an
e
xcellent
choice
for
feature
e
xtraction
and
transfer
learning,
despite
the
f
act
that
it
has
rather
high
processing
requirements.
Deep-learning-based
hand
g
estur
es
r
eco
gnition
applications
for
game
contr
ols
(Huu-Huy
Ngo)
Evaluation Warning : The document was created with Spire.PDF for Python.
888
❒
ISSN:
2502-4752
I
m
a
g
e
I
n
p
u
t
C
o
n
v
-
6
4
C
o
n
v
-
6
4
M
a
x
p
o
o
l
C
o
n
v
-
1
2
8
C
o
n
v
-
1
2
8
M
a
x
p
o
o
l
C
o
n
v
-
2
5
6
C
o
n
v
-
2
5
6
M
a
x
p
o
o
l
C
o
n
v
-
5
1
2
C
o
n
v
-
5
1
2
C
o
n
v
-
5
1
2
M
a
x
p
o
o
l
C
o
n
v
-
5
1
2
C
o
n
v
-
5
1
2
C
o
n
v
-
5
1
2
M
a
x
p
o
o
l
F
C
-
4
0
9
6
F
C
-
4
0
9
6
F
C
-
2
6
2
2
S
o
ftm
a
x
Figure
5.
The
V
GG-16
model
architecture
3.
RESUL
TS
AND
DISCUSSION
3.1.
Game
application
design
Game
description:
Bricks,
balls,
and
boards
will
be
the
three
components
that
mak
e
up
this
g
ame.
W
e
will
arrange
the
bricks
in
ro
ws
at
the
v
ery
top
of
the
screen.
The
bricks
will
v
anish
each
time
the
ball
mak
es
contact
with
them.
An
y
item
that
the
ball
comes
into
contact
with
will
cause
it
to
go
in
the
opposite
direction.
Users
can
block
the
ball
using
the
left
or
right
board
controls.
If
there
are
no
bricks,
the
player
is
considered
to
ha
v
e
w
on
the
g
ame;
if
the
y
are
unable
to
stop
the
ball,
the
y
will
lose
and
the
g
ame
will
end.
Game
acti
vity
diagram:
Figure
6
illust
rates
the
g
ame
acti
vity
diagram.
An
init
ial
user
interf
ace
is
presented
to
players
at
the
be
ginning
of
the
g
ame,
from
which
the
y
can
select
v
arious
choices
such
as
“Start”
and
“Exit.
”
After
selecting
the
“Start”
option,
the
softw
are
will
transition
to
the
g
ame
interf
ace
and
be
gin
the
process
of
initializing
all
of
the
essential
components.
These
components
include
the
board,
the
ball,
and
a
set
of
bricks
that
are
or
g
anized
in
a
pattern
that
has
been
planned
out
beforehand.
Other
v
ariables,
including
as
the
score,
the
v
elocity
of
the
ball,
and
the
status
of
the
g
ame,
are
also
initialized
in
order
to
guarantee
a
seamless
g
ameplay
e
xperience.
It
is
the
player’
s
responsibility
to
manage
the
board,
which
is
mo
v
ed
horizontally
in
order
to
interact
with
the
ball,
which
is
constantly
tra
v
eling
across
the
screen.
Ma
in Me
nu
G
a
m
e
pla
y
G
a
m
e
O
ve
r
Ex
it Ga
me
S
e
lec
t
“
P
l
a
y Ga
me
”
Initiali
z
e
ne
c
e
s
sa
ry c
ompone
nt
s
Move
the boa
rd
Che
c
k c
oll
ision w
ith ba
ll and br
ic
ks
Bric
ks de
ple
t
e
d?
Ball hi
t bot
tom
of sc
re
e
n?
Displ
ay "Yo
u wi
n"
messa
ge
Display "You lose"
messa
ge
Upda
t
e
ba
ll
posit
ion
G
a
m
e
not
finished?
Display "You lose"
messa
ge
Exit t
he
ga
me
No
Yes
Yes
No
Yes
No
Figure
6.
Game
acti
vity
diagram
Indonesian
J
Elec
Eng
&
Comp
Sci,
V
ol.
40,
No.
2,
No
v
ember
2025:
883–897
Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian
J
Elec
Eng
&
Comp
Sci
ISSN:
2502-4752
❒
889
The
program
is
responsible
for
managing
the
motion
of
the
ball
and
ensuring
that
it
does
not
coll
ide
with
the
board,
bricks,
or
w
alls
while
the
g
ame
is
being
played.
If
the
ball
comes
into
contact
with
a
brick,
the
brick
will
be
demolished,
and
the
score
will
be
adjusted
accordingly
.
In
the
e
v
ent
that
the
ball
collides
with
the
board
or
the
w
alls,
it
will
bounce
back,
so
preserving
the
o
w
of
the
g
ame.
The
g
ame
will
continue
until
either
all
of
the
bricks
are
demolished,
which
w
ould
result
in
a
victory
,
or
the
ball
f
alls
past
the
board,
which
w
ould
result
in
a
ne
g
ati
v
e
outcome.
After
the
nish
of
the
g
ame,
a
message
that
reads
“Y
ou
win”
or
“Y
ou
lose”
is
displayed,
depending
on
the
outcome
of
the
g
ame.
Once
the
g
ame
has
come
to
a
conclusion,
players
ha
v
e
the
choice
to
select
the
“Restart”
option,
which
will
either
allo
w
them
to
restart
the
g
ame
and
play
it
ag
ain
or
stop
the
g
ame
altogether
.
A
smooth
g
ame-
play
e
xperience
is
ensured
by
its
straightforw
ard
yet
capt
i
v
ating
frame
w
ork,
which
strik
es
a
balance
between
interacti
v
e
features
and
clear
end
conditions
in
order
to
k
eep
players
interested.
3.2.
Contr
ol
signal
transmission
fr
om
hand
gestur
e
r
ecognition
pr
ogram
to
game
application
One
method
to
successfully
con
v
e
y
control
signals
from
recognition
softw
are
to
a
g
aming
appli
cation
is
the
utilization
of
sock
ets.
Sock
ets,
a
reliable
and
frequently
used
technique,
accomplish
signal
transmission
in
netw
orking
and
operating
systems.
A
program
can
create
a
sock
et
and
establish
a
connection
with
a
corre-
sponding
sock
et
in
another
program.
After
establishing
a
connection,
the
program
that
is
deli
v
ering
data
can
send
data
o
v
er
the
sock
et,
while
the
program
that
is
recei
ving
the
data
can
proces
s
the
data
that
is
coming
in.
It
is
possible
to
transmit
signals
using
this
technology
across
both
local
area
netw
orks
(LANs)
a
nd
the
internet,
which
pro
vides
e
xibility
in
deplo
yment.
User
datagram
protocol
(UDP)
and
transmission
control
protocol
(TCP)
are
the
tw
o
principal
commu-
nication
protocols
that
sock
ets
are
able
to
implement.
The
TCP
ensures
precise
and
sequential
transmission
of
data
pack
ets.
This
feature
enables
applications
lik
e
le
transfers
and
protocols
lik
e
HTTP
and
FTP
to
utilize
it
ef
fecti
v
ely
.
In
contrast,
UDP
is
unstable
and
does
not
require
a
connection.
As
a
result,
it
pro
vides
lo
wer
latenc
y
and
higher
v
elocity
.
It
is
an
e
xcellent
choice
for
applications
such
as
online
g
aming,
streaming
multimedia,
and
DNS
queries.
The
UDP
is
often
the
protocol
of
choice
for
g
aming
applications
that
place
a
high
priority
on
lo
w-
latenc
y
signal
transfer
.
Its
capacity
to
pro
vide
signals
with
lo
w
latenc
y
counterbalances
its
lack
of
dependability
,
ensuring
a
smoother
and
more
responsi
v
e
g
aming
e
xperience.
Therefore,
the
UDP
is
utilized
in
this
study
for
the
purpose
of
controlling
the
transmission
of
signals
from
the
hand
gesture
detection
program
to
the
g
aming
application,
as
sho
wn
in
Figure
7.
Beg
i
n
R
ece
i
ve
i
n
put
dat
a
Per
f
or
m
hand
ges
t
ur
e
r
e
cogn
i
t
i
on
End
Yes
No
C
ont
i
nu
e r
e
cogn
i
t
i
on
?
C
r
e
at
e
U
D
P
pack
et
Send
U
D
P p
acke
t
t
o the
gam
e a
ppl
i
ca
t
i
on
Beg
i
n
R
ece
i
ve
i
n
put
dat
a
Per
f
or
m
hand
ges
t
ur
e
r
e
cogn
i
t
i
on
End
Yes
No
C
ont
i
nu
e r
e
cogn
i
t
i
on
?
C
r
e
at
e
U
D
P
pack
et
Send
U
D
P p
acke
t
t
o the
gam
e a
ppl
i
ca
t
i
on
Figure
7.
Diagram
of
control
signal
transmission
from
hand
gesture
recognition
program
to
g
ame
application
Deep-learning-based
hand
g
estur
es
r
eco
gnition
applications
for
game
contr
ols
(Huu-Huy
Ngo)
Evaluation Warning : The document was created with Spire.PDF for Python.
890
❒
ISSN:
2502-4752
3.3.
Dataset
description
T
raining
dataset:
the
training
dataset
comprised
4,000
images
that
were
labeled
and
cate
gorized
into
four
broad
cate
gories:
thumbs-up,
thumb-pointing-left,
thumb-pointing-right,
and
a
catch-all
cate
gory
for
other
gestures
of
the
hand.
Each
gesture
is
associated
with
a
distinct
control
signal
within
the
respecti
v
e
g
aming
app,
thereby
enabling
the
user
to
control
through
gestures.
The
data
has
been
separated
into
tw
o
se
gments:
70%
for
training
and
30%
for
v
alidation,
thereby
enabling
the
model’
s
ef
cac
y
to
be
thoroughly
v
eried.
The
training
dataset
w
as
compiled
from
videos
captured
under
v
arying
conditions,
including
imaging
vie
wpoints,
lighting
le
v
els,
and
background
en
vironments,
with
the
aim
of
increasing
the
model’
s
generalization
capacity
.
F
or
ef
cient
labeling
and
to
reduce
ambiguity
,
the
videos
were
arranged
in
a
w
ay
that
each
frame
contained
one
clear
and
distinct
hand
gesture.
The
detailed
breakdo
wn
of
the
number
of
images
per
class
of
hand
gesture
is
outlined
in
T
able
1,
highlighting
the
balance
and
distrib
uti
on
of
the
dataset.
Figure
8
pro
vides
representati
v
e
images
that
sho
w
the
four
hand
gestures,
Figure
8(a)
Thumbs
up,
Figure
8(b)
Thumbs
pointing
left,
Figure
8(c)
Thumbs
pointing
right,
and
Figure
8(d)
Other
hand
gestures.
T
able
1.
Description
of
the
training
dataset
Hand
gestures
Control
signal
in
g
ame
application
T
raining
dataset
T
esting
dataset
T
otal
Thumbs-up
Start
the
g
ame
700
300
1,000
Thumb-pointing-left
The
board
mo
v
es
to
the
left
700
300
1,000
Thumb-pointing-right
The
board
mo
v
es
to
the
right
700
300
1,000
Other
hand
gestures
None
700
300
1,000
Figure
8.
Snapshots
of
four
hand
gestures
from
the
dataset,
(a)
Thumbs
up,
(b)
Thumbs
pointing
left,
(c)
Thumbs
pointing
right,
and
(d)
Other
hand
gestures
A
fe
w
preprocessing
proce
d
ur
es
were
carried
out
prior
to
i
nputting
the
data
into
deep
learning
models
for
enhancing
data
inte
grity
and
model
strength.
F
or
consistenc
y
across
the
dataset,
e
v
ery
ra
w
image
w
as
resized
to
a
uniform
size
of
224×224
pix
els.
Then,
to
normalize
the
input
features,
the
pix
el
intensity
v
alues
were
normalized
within
the
range
[0,
1].
Throughout
the
training,
we
emplo
yed
a
range
of
data
augmentation
techniques,
such
as
random
rotation,
horizontal
ip,
changes
in
brightness,
and
subtle
zoom
adjustments.
All
these
augmentation
techniques
not
only
introduce
v
ariety
into
the
training
data
b
ut
also
counteract
o
v
ertting
and
thus
enhance
the
model’
s
capability
to
generalize
to
ne
w
,
unseen
data
in
real-time
gesture
recognition
applications.
Indonesian
J
Elec
Eng
&
Comp
Sci,
V
ol.
40,
No.
2,
No
v
ember
2025:
883–897
Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian
J
Elec
Eng
&
Comp
Sci
ISSN:
2502-4752
❒
891
3.4.
Model
training
e
v
aluation
Immediately
follo
wing
the
collection
of
a
signicant
dataset,
the
deep
neural
netw
ork
model
w
as
trained
by
e
xtracting
features
from
the
dataset.
This
acti
vity
is
crucial
because
it
has
a
direct
impact
on
the
o
v
erall
quality
of
the
system
that
is
being
of
fered.
This
section
prese
nts
the
results
of
model
training,
including
V
GG16,
ResNet50,
and
ANN
models.
Figures
9,
10,
and
11
illustrate
the
training
and
v
alidation
accur
ac
y
(left)
and
the
training
and
v
al
i-
dation
loss
(right)
during
the
training
process
of
three
models
.
The
training
results
of
the
V
GG16
model
are
sho
wn
i
n
Figure
9.
In
the
accurac
y
graph,
both
training
and
v
alidation
accurac
y
gro
w
at
a
quick
rate
in
the
early
epochs.
After
fteen
epochs,
the
accurac
y
stabilizes
at
high
v
alues
that
are
near
to
1.0,
which
indicates
that
learning
is
taking
place
ef
fecti
v
ely
.
The
accurac
y
of
the
training
v
aries
sl
ightly
,
b
ut
it
con
v
er
ges
in
a
consistent
manner
with
the
accurac
y
of
the
v
alidation.
At
the
end
of
twenty
epochs,
the
le
v
els
of
accurac
y
for
training
and
v
alidation
were
0.9996
and
0.9994,
respecti
v
ely
.
On
the
loss
graph,
the
training
loss
decreases
sharply
within
the
rst
fe
w
epochs
and
then
stabilizes
near
zero.
On
the
other
hand,
the
v
alidation
loss
follo
ws
a
similar
pattern
with
minor
spik
es,
which
demonstrates
that
the
model
generalizes
well
on
the
v
alidation
set.
Figure
10
sho
ws
the
outcomes
of
the
training
process
for
the
ResNet50
model.
According
to
the
accurac
y
graph,
the
le
v
els
of
accurac
y
achie
v
ed
by
the
model
during
training
and
v
alidation
are
increasing
with
each
epoch.
After
fteen
epochs,
these
accurac
y
le
v
els
remained
continuously
high
and
reached
a
saturation
le
v
el.
At
the
end
of
twenty
epochs,
the
le
v
el
of
accurac
y
achie
v
ed
throughout
training
and
v
alidation
is
equal
to
1.0.
During
the
rst
fe
w
epochs,
the
loss
graph
demonstrates
a
signicant
decrease
in
training
and
v
alidation
loss,
which
is
then
follo
wed
by
a
stability
that
is
some
what
close
to
zero.
Figure
11
illustrates
the
training
results
of
the
ANN
model.
In
the
accurac
y
plot,
both
training
and
v
alidation
accurac
y
impro
v
e
rapidly
in
the
early
epochs,
reaching
near
-perfect
v
alues
c
lose
to
1.0.
The
v
alida-
tion
accurac
y
closely
tracks
the
training
accurac
y
,
indicating
ef
fecti
v
e
learning.
At
the
end
of
twenty
epochs,
the
le
v
els
of
accurac
y
for
training
and
v
alidation
were
0.9992
and
0.9991,
respecti
v
ely
.
During
the
rst
fe
w
epochs,
the
training
loss
e
xperienced
a
substantial
decrease,
and
it
e
v
entually
stabilized
close
to
zero
in
the
loss
plot.
A
similar
pattern
may
be
seen
in
the
v
alidation
loss.
This
set
of
ndings
demonstrates
that
the
model
is
capable
of
generalizing
well
and
maintaining
steady
performance
throughout
the
training
phase.
Although
training
and
v
alidation
accurac
y
is
high,
as
desired,
i
n
all
three
models,
the
limitat
ions
and
potential
f
ailure
causes
in
real-w
orld
applications
need
to
be
stated.
During
testing,
t
he
ANN
model
occasionally
misclassied
gest
ures
when
hands
were
partially
occluded
or
under
lo
w
lighting
conditions,
which
af
fected
the
quality
of
the
MediaPipe
landma
rk
detection.
Moreo
v
er
,
gestures
with
similar
shapes,
such
as
a
loosely
held
st
or
a
half-e
xtended
thumb,
sometimes
confused
with
the
“thumbs-up”
class.
These
challenges
suggest
the
need
for
more
rob
ustness
tests
in
dif
ferent
and
uncontrolled
en
vironments.
Figure
9.
Accurac
y
and
loss
of
the
V
GG16
model
during
training
Deep-learning-based
hand
g
estur
es
r
eco
gnition
applications
for
game
contr
ols
(Huu-Huy
Ngo)
Evaluation Warning : The document was created with Spire.PDF for Python.
892
❒
ISSN:
2502-4752
Figure
10.
Accurac
y
and
loss
of
the
ResNet50
model
during
training
Figure
11.
Accurac
y
and
loss
of
the
ANN
model
during
training
3.5.
Compar
e
thr
ee
models
Figure
12
presents
an
o
v
eral
l
comparison
of
the
three
models
(ANN,
V
GG16,
and
ResNet50)
ag
ainst
k
e
y
performance
metrics
such
as
recognition
accurac
y
,
nu
m
ber
of
parameters,
and
comple
xity
of
the
model.
The
ANN
model
achie
v
ed
99.92%
with
only
3,556
parameters,
whereas
V
GG16
and
ResNet50
achie
v
ed
99.96%
and
100%
accurac
y
with
27,692,612
and
75,100,804
parameters,
respecti
v
ely
.
These
results
clearly
sho
w
that,
despite
ha
ving
a
simpler
s
tructure,
the
ANN
model
performs
as
well
as
more
comple
x
models.
T
o
help
e
xplore
these
dif
ferences,
we’
v
e
added
a
bar
chart
comparing
the
accurac
y
and
number
of
parameters
for
each
model,
illustrating
the
applicability
of
the
ANN
model
where
computational
ef
cienc
y
is
most
critical.
The
ANN
model,
while
being
simple,
w
as
quite
ef
fecti
v
e
and
compared
to
more
comple
x
models
w
as
equally
accurate
b
ut
with
much
less
parameters
and
computational
po
wer
.
These
qualities
mak
e
it
a
highly
viable
candidate
for
use
in
edge
de
vices
or
embedded
systems
where
there
is
limited
computational
capability
.
Such
a
trade-of
f
in
terms
of
performance
and
ef
cienc
y
enables
the
proposed
solution
to
be
realistic
and
usable
in
real-time
gesture
control
systems,
especially
in
mobile
g
aming
or
assisti
v
e
technology
.
Indonesian
J
Elec
Eng
&
Comp
Sci,
V
ol.
40,
No.
2,
No
v
ember
2025:
883–897
Evaluation Warning : The document was created with Spire.PDF for Python.