Indonesian J our nal of Electrical Engineering and Computer Science V ol. 40, No. 2, No v ember 2025, pp. 883 897 ISSN: 2502-4752, DOI: 10.11591/ijeecs.v40.i2.pp883-897 883 Deep-lear ning-based hand gestur es r ecognition applications f or game contr ols Huu-Huy Ngo, Hung Linh Le, Man Ba T uy en, V u Dinh Dung, T ran Xuan Thanh Thai Nguyen Uni v ersity of Information and Communication T echnology , Thai Nguyen, V ietnam Article Inf o Article history: Recei v ed Jan 12, 2025 Re vised Jul 21, 2025 Accepted Oct 14, 2025 K eyw ords: Action recognition Deep learning Game controls Hand gestures recognition Human–computer interaction ABSTRA CT Hand gesture recognition is among the emer ging technologies of human- computer interaction, and an intuiti v e and natural interf ace is more preferable for such appl ications than a total solution. It is also widely used in multimedia applications. In this paper , a deep learning-based hand gesture recognition sys- tem for controlling g ames is presented, sho wcasing its s ignicant contrib utions to w ard adv ancing the frontier of natural and intuiti v e human-computer interac- tion. It utilizes MediaPipe to get real-time sk eletal information of hand land- marks and translates the gestures of the user into smooth control signals through an optimized articial neural netw ork (ANN) that is tailored for reduced com- putational e xpenses and quick er inference. The proposed model, which w as trained on a carefully selected dataset of four gesture classes under dif ferent lighting and vie wing conditions, sho ws v ery good generalization performance and rob ustness. It gi v es a recognition rate of 99.92% with much fe wer param- eters than deeper models such as ResNet50 and V GG16. By achie ving high accurac y , computational speed, and lo w latenc y , this w ork addresses some of the most important challenges in ge sture recognition and opens the w ay for ne w applications in g aming, virtual reality , and other interacti v e elds. This is an open access article under the CC BY -SA license . Corresponding A uthor: Hung Linh Le Thai Nguyen Uni v ersity of Information and Communication T echnology Thai Nguyen, V ietnam Email: lhlinh@ictu.edu.vn 1. INTR ODUCTION Hand gesture recognition is a fundamental element of contemporary human–computer int eraction (HCI) that of fers a more natural, touchless, and intuiti v e control paradigm than traditional input de vices such as k e yboards, touchscreens, or mice. Its application is e xtensi v e and co v ers areas such as virtual and augmented reality (VR/AR), home automation technologies, assisti v e technology , robots, and g aming systems [1]-[4]. The adv ancement of sensing technology and computer vision algorithm s has lar gely reduced most of the technical dif culties, e.g., partial occlusion, background clutter , and changes in lighting [5]-[7]. Deep learning, especially of con v olutional neural netw orks (CNNs), has transformed the area of hu- man gesture recognition with its pro v en capability of e xtracting spatial along with temporal features from images and video frames. V GG16, ResNet50, and DenseNet are some of the models that ha v e been e xtensi v ely utilized and modied for gesture recognition tasks with state-of-the-art accurac y on benchmarking datasets [4], [8], [9]. In particular , Sharma and Singh [4] applied CNNs and preprocessing methods (PCA, ORB, and histogram gradients) to impro v e the accurac y of recognition, while Mohammed et al. [8] fused color and depth data from Kinect sensors with h ybrid models. De vineau et al. [10] also addressed temporal dynamics with the J ournal homepage: http://ijeecs.iaescor e .com Evaluation Warning : The document was created with Spire.PDF for Python.
884 ISSN: 2502-4752 use of s k eletal joint data and parallel con v olutions. While being prec ise, such CNN-based models typically ac- compan y millions of parameters, resulting in an e xpensi v e computational cost that discourages their real-time implementation on lo w-resource de vices. Gi v en these limitations, researchers ha v e e xplored lightweight architectures. The combi nation of ar - ticial neural netw orks (ANNs) with ef fecti v e feature e xtraction of fers a benecial trade-of f between accurac y and operational ef cienc y . Zhang et al. [11] and Nasri et al. [12] presented real-time gesture recognition systems using sEMG signals paired wi th ANN classiers, demonstrating e xcellent performance with f ast in- ference times. Similarly , Ozdemir et al. [13] and Cruz et al. [14] used spectral and inertial inputs to classify gestures. Mujahid et al. [15] used Y OLOv3 with DarkNet-53 for real-time detection of static and dynamic gestures without preprocessing, whereas Agg arw al and Arora [16] used mobile-based HGR in g ame scenarios. Meanwhile, recent gesture recognition research has focused on practicality , multimodality , and e x- ibility . Lee and Bae [17] suggested a deep learning-based glo v e using soft sensors for dynamic motion. Sen et al. [18] suggested a h ybrid frame w ork that fuses CNNs, V iT , and Kalman ltering for stable real- time control. Osama et al. [19] and Guo et al. [20] were concerned with the incorporation of gesture control in presentation and educational s y s tems. Jiang et al. [21] were concerned with no v el wearable HGR systems, whereas Naseer et al. [22] de v eloped U A V control modules using gesture detection. W en et al. [23] proposed an inno v ati v e mix ed reality system aimed at enhancing sign language education through immersi v e learning e xperiences and comprehensi v e, real-time feedback mechanisms. Despite such adv ancements, a major lack of gesture recognition systems that are both computat ionally lean and easily deplo yable in interacti v e systems such as g aming still e xists. Most models either focus on attaining optimality in performance using hea vier models or consider hardw are-specic data (such as EMG or IMU) to be hardw are-agnostic in consumer -le v el congurations. Therefore, this study proposes a no v el hand gesture recognition platform aimed at interacti v e g ame control. The approach tak es adv antage of the MediaPipe hands frame w ork for real-time landmark detection with ef cienc y optimizat ion and combines it with a lightweight ANN model minimizing computational o v erhead and latenc y . The performance of the proposed ANN model is thoroughly tested and compared with state-of-the-art CNN architectures such as ResNet50 and V GG16. The comparati v e analysis determines the practical strengths and applicability of the ANN-based model in g ame applications. 2. METHOD 2.1. System ar chitectur e Figure 1 illustrates an o v ervie w of the hand gesture recognition system being considered in this re- search. The system de v eloped for controlling g ames consist s of dif ferent steps that are essential in their o wn right to the correct identication and interpr etation of the mo v ements of the user . The process starts with video input, which is the primary source of information for the system. The video input may be from a webcam or another camera de vice capable of acquiring real-time visual depictions of the hand motion of the user . The video is necessary as it of fers a continuous o w of visual data that records the dynamics and location of the hand, which is vital in sensing gestures intended for interaction with a g ame. After the video input has been acquired, the process continues with processi ng of the video by decom- posing it into frames. The indi vidual frames are processed using the MediaPipe frame w ork by rst detecting the palm to dra w a boundary around the hand area. After localizing the hand, MediaPipe applies its specialist landmark detection model to sample 21 import ant hand landmarks in real-time. The coordinates of the land- marks thus obtained are then con v erted to a systematic feature v ector with maintained spatial relationships between the k e y points. Later , thi s v ector serv es as an input to a neural netw ork responsible for gesture classi- cation. Incorporating MediaPipe into the pipeline not only enhances the accurac y of feature e xtraction b ut also signicantly reduces computational demands, thereby guaranteeing the viability of the system for real-time applications. The hand sk eleton input produced by MediaPipe, comprising the sk eletal structure of the hand, is used as input for a CNN model. In this case, LeNet architecture—a traditional model in the eld of image classication—is used to read the image and e xt ract high-le v el features capturing spatial relationships between important hand landmarks. This e xtraction of features is crucial for the distinction between hand gestures and interpreting them as indi vidual comm ands for controlling the g ame. The CNN then outputs a sequence of predictions that include the detected gesture and a condence score measure of ho w certain the model is. Indonesian J Elec Eng & Comp Sci, V ol. 40, No. 2, No v ember 2025: 883–897 Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752 885 These predictions are passed as inputs to the application or g ame, thus enabling hands-free interaction without needing con v entional input methods such as k e yboards or controllers. Figure 1. System o v ervie w 2.2. MediaPipe hands MediaPipe hands [24] is an adv anced frame w ork utilized for tracking hands in real-time and landmark detection, which is important for human-computer interaction applications. It is able to localize and detect 21 k e y points on each hand (Figure 2), including ngertips, joints, and the palm bottom, thus enabling accurate hand pose estimation. This solution has signicant applications in ma n y areas such as gesture recognition, sign language interpretation, virtual reality (VR), and augmented reality (AR). The strong architecture of MediaPipe hands allo ws for seamless processing of the input data without compromising on the high le v el of accurac y , making it suitable for inte gration into real-time applications on numerous platforms. Figure 2. Hand landmarks The MediaPipe hands solution w orks on a tw o-stage strate gy that includes palm detection and hand landmark detection. In the initial stage, palm detect ion is applied for identifying re gions of hands within the pro vided image. The palm detection step pro vides the base for stable k e ypoint e xtraction by dening a clear re gion for additional processing. Once the palm has been detected, the system enters the second stage: hand k e ypoint detection, where the 21 spec ial k e ypoints on the cropped hand image are detected. This is essential in order to map the hand anatomy properly and e xtract signicant details such as the ngers’ tips, intermediate phalanges, and palm center . One of the adv antages of MediaPipe hands is that it can also track multiple hands at a time, e v en in cases where the hands o v erlap or where the hands change orientation. Multi-hand tracking capability is crucial for applications that need both hands to be in v olv ed or when there are multiple users. High tracking stability is attained by the system through the uti lization of conte xt information from successi v e frames to Deep-learning-based hand g estur es r eco gnition applications for game contr ols (Huu-Huy Ngo) Evaluation Warning : The document was created with Spire.PDF for Python.
886 ISSN: 2502-4752 predict k e ypoint positions e v en under conditions of rapid hand mo v ement or temporary occlusion. Predicti v e style tracking enables smooth and continuous tracking required by applications demanding responsi v eness, such as virtual reality/augmented reality interaction and gesture g ames. Based on k e ypoints’ coordi nates identied by MediaPipe hands, it is possible to b uild a formal input for an articial neural netw ork. Namely , one-dimensional input v ector can be b uilt where e v ery element corre- sponds to the Euclidean distance from the WRIST point to the remai ning 20 k e ypoints. This method guarantees that input data preserv es spatial relations among signicant landmarks on the hand while minimizing comple x- ity related to direct coordinate representation. The computation of these distances produces a normalized and in v ariant set of input features less sensiti v e to v ariation in hand size or orientati o n, therefore impro ving the ro- b ustness of the neural netw ork model during training and inference phases. This then yields a feature v ector as: X = d 1 , d 2 , d 3 , ..., d 20 . The Euclidean distance ( d i ) between the WRIST point and the other 20 k e ypoints is calculated using (2.2.). In this equation, i = 1 , 2 , . . . , 20 , ( x 0 , y 0 , z 0 ) are the coordinates of the WRIST point, and ( x i , y i , z i ) represent the coordinates of the other k e ypoints. d i = p ( x i x 0 ) 2 + ( y i y 0 ) 2 + ( z i z 0 ) 2 (1) By representing the input in such a manner , the output v ector contains 20 elements which accurately represent the spatial relation of the anatomy of the hand. This v ector is used as a signicant feature for the neural netw ork so that it can analyze and learn patterns of v arious hand gestures or motion. The Euclidean distance calculation guarantees that each v ector element will be scaled equally , hence contrib uting to stabilization of the learning process and enhancement of the model’ s performance. As such, this structured representation not only reduces the comple xity of the input data b ut also retains the critical geometric characteristics required for precise hand mo v ement recognition. This method il lustrates an ef fecti v e w ay of con v erting ra w landmark data into a meaningful format suitable for deep learning algorithms, hence enabling inno v ation in gesture-based interacti v e systems. 2.3. Articial neural netw ork model After the e xtraction and v ectorization of the 21 k e ypoints via MediaPipe hands, there is a generation of a structured feature v ector comprising 20 distinct features. Each entry in the v ector is the Euclidean distance between the WRIST k e ypoi n t and all the other k e ypoints and some other deri v ed features . Figure 3 illustrates an ANN model that has three fully connected layers. There is a rst hidden layer with 64 neurons and ReLU acti v ation, follo wed by a hidden layer with 32 neurons and ReLU. The output layer has 4 neurons, one for each gesture class, and applies softmax to generate class probabilities. The ANN model is trained using the Adam optimizer with a learning rate of 0.001 for 20 epochs to achie v e high recognition accurac y and lo w computation needs. This lightweight design and ef cient training routine render the ANN model particularly amenable to real-time hand gesture recognition on the mo v e, specically for interacti v e g ame applications. Figure 3. The structure of ANN model Indonesian J Elec Eng & Comp Sci, V ol. 40, No. 2, No v ember 2025: 883–897 Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752 887 2.4. The ResNet-50 model The deep CNN architecture kno wn as ResNet-50 (Figure 4), also kno wn as ResNet-50, has emer ged as an essential component in the eld of contemporary computer vision. He et al. [25] presented ResNet- 50 with the intention of addressing the obstacles that are associated with training v ery deep netw orks. In particular , the de gradation problem is a problem that arises when increasing the depth of the netw ork results in decreased accurac y o wing to dif culty in optimizing the netw ork. On e of the most important inno v ations that ResNet-50 brings to t he table is its utilization of residual learning through shortcut connections. This enables the netw ork to acquire identity mappings and helps to alle viate the problem of disappearing gradients. In order to demonstrate its adaptability and ef cienc y , this architecture has been utilized e xtensi v ely in a v ariety of applications, including semantic se gmentation, object identication, and picture classication. Figure 4. The ResNet-50 model architecture In the ResNet-50 architecture, there are a total of fty layers, which include con v olutional layer s, batch normalization layers, ReLU acti v ation functions, and fully link ed layers. The usage of residual blocks, in which identity connections bypass one or more layers, is one of its distinguishing characteristics. This allo ws the netw ork to learn residual functions rather than direct mappings, which is a signicant adv antage. Each of the sixteen residual blocks that mak e up ResNet-50 is composed of three con v olutional layers: a 1 × 1 con v olution for dimensionality reduction, a 3 × 3 con v olution for spatial feature e xtraction, and another 1 × 1 con v olution for restoring dimensionality . These blocks are constructed utilizing bottleneck designs. This approach to bottlenecks decreases the load of computing w ork while retaining the capacity for representation. Additionally , the design emplo ys strided con v olutions and pooling layers to gradually l o we r the spatial dimensions, which guarantees the capture of hierarchical information across a v ariety of le v els. 2.5. The V GG-16 model Simon yan and Zisserman [26] initially presented the CNN archit ecture kno wn as V GG-16. The ar - chitecture places an emphasis on ha ving a basic and modular style. By taking this method, the netw ork is able to e xtract intricat e hierarchical properties while preserving its computational ef cienc y . The V GG-16 architecture (Figure 5) consists of 16 weight layers, including 13 con v olutional layers and 3 fully connected layers, interspersed with max-pooling and acti v ation functions. The hallmark of V GG-16 is its use of small 3 × 3 con v olutions with a stride of 1 and padding to maintain spatial resolution. Stacking these tin y k ernels in sequence allo ws the netw ork to simulate the recepti v e eld of bigger lters, which in turn enables the netw ork to collect more detailed spatial data. Max-pooli ng layers separate v e con v olutional blocks in a hierarchical design of the architecture. This allo ws for the gradual reduction of spatial dimensions while simultaneously increasing the depth of feature maps. The fully link ed layers at the v ery end of the netw ork are responsible for aggre g ating these features in order to arri v e at a nal cate gorization. The consistent architecture and depth of V GG-16 mak e it an e xcellent choice for feature e xtraction and transfer learning, despite the f act that it has rather high processing requirements. Deep-learning-based hand g estur es r eco gnition applications for game contr ols (Huu-Huy Ngo) Evaluation Warning : The document was created with Spire.PDF for Python.
888 ISSN: 2502-4752 I m a g e   I n p u t C o n v - 6 4 C o n v - 6 4 M a x p o o l C o n v - 1 2 8 C o n v - 1 2 8 M a x p o o l C o n v - 2 5 6 C o n v - 2 5 6 M a x p o o l C o n v - 5 1 2 C o n v - 5 1 2 C o n v - 5 1 2 M a x p o o l C o n v - 5 1 2 C o n v - 5 1 2 C o n v - 5 1 2 M a x p o o l F C - 4 0 9 6 F C - 4 0 9 6 F C - 2 6 2 2 S o ftm a x Figure 5. The V GG-16 model architecture 3. RESUL TS AND DISCUSSION 3.1. Game application design Game description: Bricks, balls, and boards will be the three components that mak e up this g ame. W e will arrange the bricks in ro ws at the v ery top of the screen. The bricks will v anish each time the ball mak es contact with them. An y item that the ball comes into contact with will cause it to go in the opposite direction. Users can block the ball using the left or right board controls. If there are no bricks, the player is considered to ha v e w on the g ame; if the y are unable to stop the ball, the y will lose and the g ame will end. Game acti vity diagram: Figure 6 illust rates the g ame acti vity diagram. An init ial user interf ace is presented to players at the be ginning of the g ame, from which the y can select v arious choices such as “Start” and “Exit. After selecting the “Start” option, the softw are will transition to the g ame interf ace and be gin the process of initializing all of the essential components. These components include the board, the ball, and a set of bricks that are or g anized in a pattern that has been planned out beforehand. Other v ariables, including as the score, the v elocity of the ball, and the status of the g ame, are also initialized in order to guarantee a seamless g ameplay e xperience. It is the player’ s responsibility to manage the board, which is mo v ed horizontally in order to interact with the ball, which is constantly tra v eling across the screen. Ma in Me nu G a m e pla y G a m e  O ve r Ex it Ga me S e lec t P l a y Ga me Initiali z e  ne c e s sa ry c ompone nt s Move  the boa rd Che c k c oll ision w ith ba ll and br ic ks Bric ks de ple t e d? Ball hi t bot tom of sc re e n? Displ ay "Yo u wi n"   messa ge Display "You lose"  messa ge Upda t e  ba ll   posit ion G a m e  not  finished? Display "You lose"  messa ge Exit t he  ga me No Yes Yes No Yes No Figure 6. Game acti vity diagram Indonesian J Elec Eng & Comp Sci, V ol. 40, No. 2, No v ember 2025: 883–897 Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752 889 The program is responsible for managing the motion of the ball and ensuring that it does not coll ide with the board, bricks, or w alls while the g ame is being played. If the ball comes into contact with a brick, the brick will be demolished, and the score will be adjusted accordingly . In the e v ent that the ball collides with the board or the w alls, it will bounce back, so preserving the o w of the g ame. The g ame will continue until either all of the bricks are demolished, which w ould result in a victory , or the ball f alls past the board, which w ould result in a ne g ati v e outcome. After the nish of the g ame, a message that reads “Y ou win” or “Y ou lose” is displayed, depending on the outcome of the g ame. Once the g ame has come to a conclusion, players ha v e the choice to select the “Restart” option, which will either allo w them to restart the g ame and play it ag ain or stop the g ame altogether . A smooth g ame- play e xperience is ensured by its straightforw ard yet capt i v ating frame w ork, which strik es a balance between interacti v e features and clear end conditions in order to k eep players interested. 3.2. Contr ol signal transmission fr om hand gestur e r ecognition pr ogram to game application One method to successfully con v e y control signals from recognition softw are to a g aming appli cation is the utilization of sock ets. Sock ets, a reliable and frequently used technique, accomplish signal transmission in netw orking and operating systems. A program can create a sock et and establish a connection with a corre- sponding sock et in another program. After establishing a connection, the program that is deli v ering data can send data o v er the sock et, while the program that is recei ving the data can proces s the data that is coming in. It is possible to transmit signals using this technology across both local area netw orks (LANs) a nd the internet, which pro vides e xibility in deplo yment. User datagram protocol (UDP) and transmission control protocol (TCP) are the tw o principal commu- nication protocols that sock ets are able to implement. The TCP ensures precise and sequential transmission of data pack ets. This feature enables applications lik e le transfers and protocols lik e HTTP and FTP to utilize it ef fecti v ely . In contrast, UDP is unstable and does not require a connection. As a result, it pro vides lo wer latenc y and higher v elocity . It is an e xcellent choice for applications such as online g aming, streaming multimedia, and DNS queries. The UDP is often the protocol of choice for g aming applications that place a high priority on lo w- latenc y signal transfer . Its capacity to pro vide signals with lo w latenc y counterbalances its lack of dependability , ensuring a smoother and more responsi v e g aming e xperience. Therefore, the UDP is utilized in this study for the purpose of controlling the transmission of signals from the hand gesture detection program to the g aming application, as sho wn in Figure 7. Beg i n R ece i ve  i n put   dat a Per f or m  hand  ges t ur r e cogn i t i on End Yes No C ont i nu e r e cogn i t i on ? C r e at e  U D P  pack et Send  U D P p acke t   t o the  gam e a ppl i ca t i on Beg i n R ece i ve  i n put   dat a Per f or m  hand  ges t ur r e cogn i t i on End Yes No C ont i nu e r e cogn i t i on ? C r e at e  U D P  pack et Send  U D P p acke t   t o the  gam e a ppl i ca t i on Figure 7. Diagram of control signal transmission from hand gesture recognition program to g ame application Deep-learning-based hand g estur es r eco gnition applications for game contr ols (Huu-Huy Ngo) Evaluation Warning : The document was created with Spire.PDF for Python.
890 ISSN: 2502-4752 3.3. Dataset description T raining dataset: the training dataset comprised 4,000 images that were labeled and cate gorized into four broad cate gories: thumbs-up, thumb-pointing-left, thumb-pointing-right, and a catch-all cate gory for other gestures of the hand. Each gesture is associated with a distinct control signal within the respecti v e g aming app, thereby enabling the user to control through gestures. The data has been separated into tw o se gments: 70% for training and 30% for v alidation, thereby enabling the model’ s ef cac y to be thoroughly v eried. The training dataset w as compiled from videos captured under v arying conditions, including imaging vie wpoints, lighting le v els, and background en vironments, with the aim of increasing the model’ s generalization capacity . F or ef cient labeling and to reduce ambiguity , the videos were arranged in a w ay that each frame contained one clear and distinct hand gesture. The detailed breakdo wn of the number of images per class of hand gesture is outlined in T able 1, highlighting the balance and distrib uti on of the dataset. Figure 8 pro vides representati v e images that sho w the four hand gestures, Figure 8(a) Thumbs up, Figure 8(b) Thumbs pointing left, Figure 8(c) Thumbs pointing right, and Figure 8(d) Other hand gestures. T able 1. Description of the training dataset Hand gestures Control signal in g ame application T raining dataset T esting dataset T otal Thumbs-up Start the g ame 700 300 1,000 Thumb-pointing-left The board mo v es to the left 700 300 1,000 Thumb-pointing-right The board mo v es to the right 700 300 1,000 Other hand gestures None 700 300 1,000 Figure 8. Snapshots of four hand gestures from the dataset, (a) Thumbs up, (b) Thumbs pointing left, (c) Thumbs pointing right, and (d) Other hand gestures A fe w preprocessing proce d ur es were carried out prior to i nputting the data into deep learning models for enhancing data inte grity and model strength. F or consistenc y across the dataset, e v ery ra w image w as resized to a uniform size of 224×224 pix els. Then, to normalize the input features, the pix el intensity v alues were normalized within the range [0, 1]. Throughout the training, we emplo yed a range of data augmentation techniques, such as random rotation, horizontal ip, changes in brightness, and subtle zoom adjustments. All these augmentation techniques not only introduce v ariety into the training data b ut also counteract o v ertting and thus enhance the model’ s capability to generalize to ne w , unseen data in real-time gesture recognition applications. Indonesian J Elec Eng & Comp Sci, V ol. 40, No. 2, No v ember 2025: 883–897 Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752 891 3.4. Model training e v aluation Immediately follo wing the collection of a signicant dataset, the deep neural netw ork model w as trained by e xtracting features from the dataset. This acti vity is crucial because it has a direct impact on the o v erall quality of the system that is being of fered. This section prese nts the results of model training, including V GG16, ResNet50, and ANN models. Figures 9, 10, and 11 illustrate the training and v alidation accur ac y (left) and the training and v al i- dation loss (right) during the training process of three models . The training results of the V GG16 model are sho wn i n Figure 9. In the accurac y graph, both training and v alidation accurac y gro w at a quick rate in the early epochs. After fteen epochs, the accurac y stabilizes at high v alues that are near to 1.0, which indicates that learning is taking place ef fecti v ely . The accurac y of the training v aries sl ightly , b ut it con v er ges in a consistent manner with the accurac y of the v alidation. At the end of twenty epochs, the le v els of accurac y for training and v alidation were 0.9996 and 0.9994, respecti v ely . On the loss graph, the training loss decreases sharply within the rst fe w epochs and then stabilizes near zero. On the other hand, the v alidation loss follo ws a similar pattern with minor spik es, which demonstrates that the model generalizes well on the v alidation set. Figure 10 sho ws the outcomes of the training process for the ResNet50 model. According to the accurac y graph, the le v els of accurac y achie v ed by the model during training and v alidation are increasing with each epoch. After fteen epochs, these accurac y le v els remained continuously high and reached a saturation le v el. At the end of twenty epochs, the le v el of accurac y achie v ed throughout training and v alidation is equal to 1.0. During the rst fe w epochs, the loss graph demonstrates a signicant decrease in training and v alidation loss, which is then follo wed by a stability that is some what close to zero. Figure 11 illustrates the training results of the ANN model. In the accurac y plot, both training and v alidation accurac y impro v e rapidly in the early epochs, reaching near -perfect v alues c lose to 1.0. The v alida- tion accurac y closely tracks the training accurac y , indicating ef fecti v e learning. At the end of twenty epochs, the le v els of accurac y for training and v alidation were 0.9992 and 0.9991, respecti v ely . During the rst fe w epochs, the training loss e xperienced a substantial decrease, and it e v entually stabilized close to zero in the loss plot. A similar pattern may be seen in the v alidation loss. This set of ndings demonstrates that the model is capable of generalizing well and maintaining steady performance throughout the training phase. Although training and v alidation accurac y is high, as desired, i n all three models, the limitat ions and potential f ailure causes in real-w orld applications need to be stated. During testing, t he ANN model occasionally misclassied gest ures when hands were partially occluded or under lo w lighting conditions, which af fected the quality of the MediaPipe landma rk detection. Moreo v er , gestures with similar shapes, such as a loosely held st or a half-e xtended thumb, sometimes confused with the “thumbs-up” class. These challenges suggest the need for more rob ustness tests in dif ferent and uncontrolled en vironments. Figure 9. Accurac y and loss of the V GG16 model during training Deep-learning-based hand g estur es r eco gnition applications for game contr ols (Huu-Huy Ngo) Evaluation Warning : The document was created with Spire.PDF for Python.
892 ISSN: 2502-4752 Figure 10. Accurac y and loss of the ResNet50 model during training Figure 11. Accurac y and loss of the ANN model during training 3.5. Compar e thr ee models Figure 12 presents an o v eral l comparison of the three models (ANN, V GG16, and ResNet50) ag ainst k e y performance metrics such as recognition accurac y , nu m ber of parameters, and comple xity of the model. The ANN model achie v ed 99.92% with only 3,556 parameters, whereas V GG16 and ResNet50 achie v ed 99.96% and 100% accurac y with 27,692,612 and 75,100,804 parameters, respecti v ely . These results clearly sho w that, despite ha ving a simpler s tructure, the ANN model performs as well as more comple x models. T o help e xplore these dif ferences, we’ v e added a bar chart comparing the accurac y and number of parameters for each model, illustrating the applicability of the ANN model where computational ef cienc y is most critical. The ANN model, while being simple, w as quite ef fecti v e and compared to more comple x models w as equally accurate b ut with much less parameters and computational po wer . These qualities mak e it a highly viable candidate for use in edge de vices or embedded systems where there is limited computational capability . Such a trade-of f in terms of performance and ef cienc y enables the proposed solution to be realistic and usable in real-time gesture control systems, especially in mobile g aming or assisti v e technology . Indonesian J Elec Eng & Comp Sci, V ol. 40, No. 2, No v ember 2025: 883–897 Evaluation Warning : The document was created with Spire.PDF for Python.