Indonesian J our nal of Electrical Engineering and Computer Science V ol. 22, No. 2, May 2021, pp. 1096 1107 ISSN: 2502-4752, DOI: 10.11591/ijeecs.v22i2.pp1096-1107 r 1096 ArSL-CNN: A con v olutional neural netw ork f or arabic sign language gestur e r ecognition Ali A. Alani 1 , Geor gina Cosma 2 1 Department of Computer Science, Uni v ersity of Diyala, Diyala, Iraq 2 Department of Computer Science, School of Science, Loughborough Uni v ersity , U.K Article Inf o Article history: Recei v ed Jan 20, 2021 Re vised March 13, 2021 Accepted March 20, 2021 K eyw ords: Arabic sign language CNNs Con v olutional neural netw orks Deep learning SMO TE ABSTRA CT Sign language (SL) is a visual language means of communication for people with deafness or hearing impairments. In Arabic-speaking countries, there are man y arabic sign languages (ArSL) and these use the same al phabets. This study proposes ArSL- CNN, a deep learning model that i s based on a con v olutional neural netw ork (CNN) for translating Arabic SL (ArSL). Experiments were performed using a lar ge ArSL dataset (ArSL2018) that contains 54,049 images of 32 sign language gestures, collected from forty participants. The results of the first e xperiments with the ArSL-CNN model re- turned a train and test accurac y of 98.80% and 96.59%, respecti v ely . The results also re v ealed the impact of imbalanced data on model acc urac y . F or the second set of e x- periments, v arious re-sampling methods were applied to the datase t. Results re v ealed that applying the synthetic minority o v ersampling technique (SMO TE) impro v ed the o v erall test accurac y from 96.59% to 97.29%, yielding a statistically significant im- pro v ement in test accurac y (p=0.016, < 0 : 05 ). The pr oposed ArSL-CNN model can be trained on a v ariety of Arabic sign languages and reduc e the communication barriers encountered by deaf communities in Arabic-speaking countries. This is an open access article under the CC BY -SA license . Corresponding A uthor: Ali A. Alani Department of Computer Science Uni v ersity of Diyala Diyala, Iraq Email: alialani@uodiyala.edu.iq 1. INTR ODUCTION Sign language (SL) is visual means of communication for people who ha v e deafness or hearing- impairments, using gestures, f acial e xpression, and body l anguage [1], [2]. In 2019, the w orld health or g ani- zation report ed that approximat ely 466 milli on people, which is approximately 5% of the w orld’ s population, suf fer from hearing impairment. Among these people, roughly 34 million are under the age of 18. A pre vious study predicted that this number w ould double by 2050 due to genetic f actors, birth complications, infectious diseases, and chronic ear infections [3], [4]. Studies ha v e been conducted to de v elop systems that can recognise the signs of v arious SLs [5]. Arabic SL (ArSL) recognition systems are currently in the de v elopment phase [6], and there e xist limited SL recognition systems that can identify ArSL signs using deep le arning methods. T w o methods can be applied to SL recognition systems , namely , sensor and image-based methods [7]. Sensor -based methods require the user to wear instrumental glo v es with sensors to recognise hand gestures. This approach requires interf acing multiple sensors with a glo v e to collect the gestures using sensor data, that is analysed for gesture recognition and translation tasks. Despite their accurac y and reliability , sensor -based methods ha v e J ournal homepage: http://ijeecs.iaescor e .com Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752 r 1097 se v eral limitations , such as discomfort in using glo v es o v erloaded with wires, sensors and other materials w orn by the signer [8], [9]. By contrast, with image-based sign language gesture recognition, the signers are not required to use an y ki nd of glo v es or complicated de vices. This technique pro vides users with more v ersatility than the sensor -based systems. Ho we v er , intensi v e computations are necessary in the preprocessing phase to recognise the signs. Recent studies focus on the performance of image-based approaches in recognising ArSL [8], [9]. Image-based systems for recognising human signs are comple x and multidisciplinary . These systems are de v eloped using v arious machine learning (ML) methods, such as the artificial neural netw ork [10], support v ector machines (SVMs) [11], and elastic graph matching [12]. Deep learning (DL) algorithms ha v e recently boosted man y research fields, including image recognition and classification. DL algorithms are ML methods that ha v e been utilised in v arious applications, such as medical im age classification [13] and object recogni- tion [14]. DL models compri se a neural netw ork wi th more than one hidden layer that uses v arious le v els of distrib ution to represent and learn the high-le v el abstractions of data. The objecti v e of this study is to adv ance the research of ArSL and e xplore the capabilities of deep learning methods, specifically the con v olutional neu- ral netw ork (CNN) method, for classifying ArSL gestures. This paper proposes a ne w deep learning model, ArSL-CNN, and e xplores the adv antage of resampling techniques to address class imbalance in the dataset. The proposed ArSL-CNN is trained with images of hand signs in dif ferent lighting conditions and orientations to automatically recognise 32 ArSL signs [15]. The remainder of this paper is or g anised as follo ws. Related w ork is re vie wed in Section 2. The design and architecture of the proposed ArSL-CNN model are presented in Section 3. Experiments and results are discussed in Section 4. A comparison of the proposed ArSL-CNN with state-of-the-art methods is discussed in Section 5. A conclusion and future research directions are pro vided in Section 6. 2. RELA TED W ORKS Numerous methods are appl ied for SL recognition tasks. The tw o major approaches are the hand- crafted feature engineering e xtraction and DL methods. The earliest kno wn w ork on SL recognition is focused on the e xtraction of hand-engineered features, which are fed to learning algorithms for classification [16]. Consequently , the ef ficienc y of these algorithms is highly dependent on handcrafted feature engineering [17]. Therefore, the accurac y results obtained using these approaches highly depend on e xtracted features. Ibrahim et al . [18] constructed a dataset containing 30 isolated w ords from children with hearing disabilities. The geometric features of the hands were formulated into feature v ectors that were used for classification and auto- matic translation of the indi vidual Arabi c signs into te xt w ords. The accurac y of their proposed system reached 97%. Alzohairi et al. [9] applied the histogram of oriented gradients (HoG) feature descriptor for e xtracting features from ArSL image data, and then adopted the SVM algorithm for de v eloping a ArSL image recognition system. The accurac y of their system reached 63.5%. Abdo et al. [19] applied the hidden Mark o v model and hand geometry with dif ferent hand shapes and forms to the task of Arabic alphabet and numbers sign language recognition and translation into speech or te xt. W ith Deep Learning, the features are e xtracted hierarchically in an automated manner by applying a series of transformati ons to the input images. The e xtracted features are the most rob ust ones, which means that comple x problems are ef fecti v ely modelled using DL architectures. Nagi et al. [20] proposed a hand gesture recognition system by using a CNN and used morphological image processing and colour se gmentation to obtain hand contour edges and eliminate noise. Their proposed model achie v ed an accurac y of 96% on 6,000 sign images obtained from six gestures. By using the data collected by a Kinect sensor , T ang et al. [21] used a deep belief netw ork (DBN) and a CNN for sign language recognition. Authors trained the DBN and CNN model using 36 dif ferent hand postures. The DBN model achie v ed 98.12% accurac y , which w as higher than the accurac y obtained by the CNN model. Y ang and Zhu [2] introduced a CNN system for the recognition of Chinese SL. The authors obtained video-based data by using 40 re gular v ocab ularies. In the preprocessing stage, the authors enhanced the hand se gmentation process and pre v ented the loss of important information during feature e xtraction. Moreo v er , the y compared tw o dif ferent optimizers, namely , Adagrad and Adadelta, and their results re v ealed that the CNN model reached better accurac y when the Adadelta optimizer w as used. Oyedotun and Khashman [5] adopted tw o DL methods, namely , CNN and stack ed denoising autoencoder (SD AE) netw orks, to recognise 24 ASL alphabets. The samples were collected from the freely accessible Thomas Moeslund’ s gesture recognition database. Their test results sho wed that SD AE outperformed the CNN model in terms of o v erall a v erage accurac y (92.83%). Eibada wy et al. [22] proposed a CNN-based frame w ork for ArSL recognition to identify 25 signs. The accurac y v alues of this ArSL-CNN: A con volutional neur al network for ar abic sign langua g e g estur e r eco gnition (Ali A. Alani) Evaluation Warning : The document was created with Spire.PDF for Python.
1098 r ISSN: 2502-4752 model on the training and unseen data were 85% and 98%, respecti v ely . Ghazanf ar , et al. [1] propo s ed dif fer - ent CNN architectures using 54,049 sign images of more than 40 participants pro vided by [15]. Their results re v ealed the significant ef fe ct of the dataset size on the accurac y of the proposed model. By increasing the size of the dataset from 8,302 samples to 27,985 samples, the proposed model test accurac y increased from 80.3% to 93.9%. Also, increasing the size of the dataset from 33406 samples to 50000 samples resulted in a further increase in the proposed model test accurac y from 94.1% to 95.9%, respecti v ely . Elsayed and F ath y [3] e xamined the capacity of ontology technologies (semantic web technologies) and DL to design a multiple sign language ontology for feature e xtraction using CNNs for the ArSL recognition task. Their findings re v ealed that the recognition rates of the ArSL training and testing sets were 98.06% and 88.87%, respecti v ely . Although CNNs perform well with computer vision tasks, the y require massi v e quantities of data to train the netw ork. This disadv antage demands an enormous amount of time and computing capabilities. Se v eral researchers use transfer l earning techniques to minimise the processing time and the number of dataset samples needed to train the CNN model. Saleh and Issa [23] used transfer learning on a pre-trained netw ork of V GG-16 and Resnet152 to boost performance in identifying 32 hand gestures from the ArSL dataset. T o minimise the imbalance caused by the heterogeneity of the class sizes, random undersampling w as applied to the dataset to reduce the number of images from 54,049 to 25,600. Their proposed method achie v ed testing accuracies of 99.4% and 99.6% for the V GG16 and Resne t152, respecti v ely . Despite the latest de v elopments in DL and the good precision of image classification and prediction achie v ed using CNN, imbalanced data can af fect the performance of pre- diction models. Imbalanced data can impact a model’ s ability to learn and its usage in real-time situations. The translation of the sign language gestures into dif ferent formats, such as te xt and speech, should also be further in v estig ated. 3. EXPERIMENT METHODOLOGY This section describes the proposed ArSL-CNN architecture that w as designed for classifying Arabic Sign Language gestures. This section also describes the ArSL dat aset and the pre-processing t echniques that were applied to the dataset. 3.1. Pr oposed ArSL-CNN ar chitectur e CNNs ha v e achie v ed se v eral breakthroughs as a basic DL technique for image classification problems, such as object detection and hand gesture recognition [6], [21]. T able 1 sho ws the architecture of the proposed ArSLCNN. Three types of layers are used in the CNN algorithm, namely , con v olutional, pooling and fully connected layers. The pooling layer decreases the spatial size of an input sequence. The complete CNN archi- tecture is obtained through se v eral stacks of the abo v ement ioned layers. The ArSL-CNN model is composed of se v en con v olutional layers, four batch normalisation (BN) layers, four pooling layers, v e dropout layers and one fully connected layer with rectified linear unit (ReLU). This model ends with an output layer that has a softmax acti v ation function to yield the distrib ution of the probability o v er classes as sho wn in Figure 1. Figure 1. Architecture of the proposed ArSL-CNN model Indonesian J Elec Eng & Comp Sci, V ol. 22, No. 2, May 2021 : 1096 1107 Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752 r 1099 T able 1 lists the detailed dimensions of each layer and operation. The first and second layers are con v olutional layers that contain 32 feature maps and ha v e a k ernel size of 3 3. These layers are acti v ated with ReLUs. The ne xt layer is a BN layer , which aims to achie v e the stable distrib ution of the acti v ation v alues through training and normalise the inputs to a layer [24]. The fourth layer is a max pooling layer with a pool size of 2 2, and the objecti v e of this layer is to decrease the number of parameters to minimise o v erfitting and decrease the computation time. The fifth layer is a dropout re gularisation layer with the parameter set to 10%. The ne xt layers are the third and fourth con v olutional layers with 64 feature maps, a k ernel size of 3 3 and ReLU acti v ation function. These layers are follo wed by another max pooling layer with a pool size of 2 2, a BN layer and re gularisation layers with parameter set to 10%. The ne xt layers are the fifth and sixth con v olutional layers with 128 feature maps, a k ernel size of 3 3 and ReLU acti v ation function, follo wed by another pooling layer with a pool size of 2 2, a BN layer and re gularisation layers with parameter set to 30%. The last con v olutional layer in the netw ork is then laid after the re gularisation layers. This layer with a ReLU acti v ation function comprises 256 feature maps and has a k ernel size of 3 3. The ne xt layers are a max pooling layer with a size of 2 2, a BN layer and another dropout re gularisation layer with the parameter set to 30%. The ArSL-CNN netw ork architecture ends with fully connected units that contain a flatten layer , one fully connected layer , one dropout layer and the output layer . The flatten layer con v erts the 2D matrix data into a v ector to allo w the final output to be processed by standard ful ly connected layers. The second layer is a fully connected layer that contains 512 neurons with ReLU acti v ation function. The dropout layer e xcludes 50% from neurons. The last layer is the output layer , which contains 32 neurons and is acti v ated with a softmax acti v ation function. The ArSL-CNN model is trained in a fully supervised w ay , and its parameters are optimised by minimising the cross-entrop y loss function with the Adam v ersion of stochastic gradient descent. T able 1. P arameters of the ArSL-CNN architecture Layers Layer Configuration # P arameters Con v olution 1 32 filters, 3x3 k ernel and ReLU 320 Con v olution 2 32 filters, 3x3 k ernel and ReLU 9248 batch Nor . 1 - 128 Max-pooling 1 2x2 k ernel 0 Dropout 1 0.1 0 Con v olution 3 64 filters, 3x3 k ernel and ReLU 18496 Con v olution 4 64 filters, 3x3 k ernel and ReLU 36928 batch Nor . 2 - 256 Max-pooling 2 2x2 k ernel 0 Dropout 2 0.1 0 Con v olution 5 128 filters, 3x3 k ernel and ReLU 73856 Con v olution 6 128 filters, 3x3 k ernel and ReLU 174584 Batch Nor . 3 - 512 Max-pooling 3 2x2 k ernel 0 Dropout 3 0.3 0 Con v olution 7 128 filters, 3x3 k ernel and ReLU 295168 batch Nor . 4 - 1024 Max-pooling 4 2x2 k ernel 0 Dropout 4 0.3 0 Flatten 4096 Neurons 0 Fully connected 512 Neurons 209664 Dropout 0.5 0 Output layer Softmax 32 classes 16416 3.2. Dataset description The proposed ArSL-CNN architecture is trained and tested on the ArSL2018 [15], Arabic Sign Lan- guage (ArSL) dataset. The ArSL2018 dataset aims to pro vide an opportunity for researchers to de v elop auto- mated ArSL recognition systems based on dif ferent machine learning methods. The original dataset consists of 54,049 RGB images distrib uted around 32 classes and the signs collected from more than 40 participants. The RGB images ha v e dif ferent dimensions and man y v aria tions of images were presented through the use of dif ferent lighting and backgrounds. Samples of the dataset can be seen in Figure 2. ArSL-CNN: A con volutional neur al network for ar abic sign langua g e g estur e r eco gnition (Ali A. Alani) Evaluation Warning : The document was created with Spire.PDF for Python.
1100 r ISSN: 2502-4752 Figure 2. ArSL2018 dataset samples [15] 3.3. Image pr e-pr ocessing Data preprocessing is the implementation of v arious morphological acti vities to eliminate noise from the data. The ArSL2018 dataset includes sign language gesture images with dif ferent dim ensions that were tak en with v aried illumination. Therefore, image preprocessing techniques are necessary to remo v e noise from the data before feeding them to the netw ork. All sign images are firstly con v erted into gre yscale images with a dimension of 64×64 to perform real-ti me classification. The gre yscale colour space con v ers ion allo ws operating in one channel only rather than processing in the three RGB channels. This con v ersion will minimise the number of parameters of the first con v olutional layer tw o times and reduce the computational time. T o increase the ef ficienc y of the computation process and speed of the training stage, all images are normalised to set the range of the pix el v alues from 0 to 1. Then, the images are standardised by eliminating their means and scaling them to unit v ariance. T o generate the training and testing sets, images are randomly selected from the dataset. The dataset is split into testing (20%) sets and training (80%) of which 20% is tak en for v alidation. Figure 3 depicts that the number of samples for each class in the dataset is not balanced. Therefore, v arious resampling techniques ha v e been applied to solv e the imbalance problem amongst the classes. The details of this process are presented in Section 4.2., e xperimental results and discussion. Figure 3. Number of samples in each class 4. RESUL TS AND DISCUSSION The e xperiment w as conducted using K eras libraries and Python programming language that run on T ensorFlo w back end. The ArSL-CNN model w as trained on a machine that has an NVIDIA K80 graphics processing unit (GPU), 64 GB random access memory , 12 GB memory and 100 GB solid state dri v e. T o introduce randomness, the training dataset w as shuf fled before fed to the netw ork to a v oid bi as to w ards certain parameters. The ef fecti v eness of the proposed model w as e v aluated based on tw o independent e xperiments: (1) the proposed ArSL-CNN model w as trained and tested using the original ArSL2018 dataset; and (2) the model w as trained and tested using dif ferent resampling techniques to address the imbalance problem amongst the classes. The accurac y metric w as adopted to determine the ef ficienc y of the proposed CNN approach. In formula (1), A denotes the accurac y , TC and FC represent the number of correctly and incorrectly classified Indonesian J Elec Eng & Comp Sci, V ol. 22, No. 2, May 2021 : 1096 1107 Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752 r 1101 instances, respecti v ely . The calculated v alue is multiplied by 100 to turn it into a percentage. A = T C T C + F C 100 (1) F or a class, the accurac y can be determined using (2). Ac = T C c T C c + F C c 100 (2) where, TCc represents the number of correctly classified insta n c es which are from the class c, FCc represents the number of incorrectly classified i nstances which are from the class c. The final v alue is multiplied by 100 to get the percentage of the accurac y for each class. 4.1. P erf ormance e v aluation of the pr oposed ArSL-CNN model The performance of the proposed ArSL-CNN model on the original ArSL2018 dataset is presented in T able 2. The t raining dataset consists of 54,049 images distrib uted o v er 32 ArSL gesture groups in a unified format. The training data were di vided into batches with 128 samples each. The input and output layers ha v e 4,096 and 32 neurons, respecti v ely . The proposed ArSLCNN model w as trained for multiple learning epochs. The training and testing accurac y v alues are summarised in T able 2. ArSL-CNN achie v ed the highest testing accurac y (96.59%) at 500 learning epochs. Figure 4 depicts the model accurac y when the proposed ArSL-CNN model is trained with 500 epochs. The training and testing performances are close to each other during dif ferent epochs which indicates that the model has not been o v ertrained. T able 2. Classification accurac y and training time (minutes) obtained using the ArSL2018 original dataset No. Epochs T raining Acc. (%) T esting Acc. (%) T raining T ime (mins) 500 98.80 96.59 66.1 Figure 4. Accurac y of the proposed ArSL-CNN model obtained using the original ArSL2018 dataset T able 3 indicates the accurac y of all 32 classes. From the table it can be observ ed that the nu m ber of testing samples across the classes v aries considerably . It can also be observ ed that classes with t he highest number of samples achie v ed a better accurac y than those the classes with fe wer samples. F or instance, the ‘W a w’ class contains 259 testing samples and its accurac y w as 94.21%, whereas the A yn’ class contains 405 testing samples and its accurac y w as 97.78%. These results re v ealed that the imbalanced distrib ution of the number of samples between classes may impact on the performance of the models, and in some cases, the model will be able to learn the classes that ha v e more samples better than those with lo wer sample numbers. Therefore, it is important to apply techniques that can handle the imbalance problem between classes and to det ermine whether these techniques can impro v e classification performance, especially for the classes that contain smaller sample sizes. Therefore, resampling (o v er -sampling and under -sampling) methods are applied to the dataset and their impact on the performance of the ArSL-CNN model is e xplored (results are discussed in Section 4.2.). ArSL-CNN: A con volutional neur al network for ar abic sign langua g e g estur e r eco gnition (Ali A. Alani) Evaluation Warning : The document was created with Spire.PDF for Python.
1102 r ISSN: 2502-4752 T able 3. ArSL-CNN accurac y on the Ttest data using the original Arsl2018 dataset before applying sampling techniques (no sampling) Class No. Class Name #S a #SCC b Accurac y 0 Alif 354 343 96.89 1 Ba 314 310 98.73 2 T a 372 364 97.85 3 Tha 364 350 96.15 4 Jim 313 302 96.49 5 Ha 299 286 95.65 6 Kha 337 320 94.96 7 Dal 295 285 96.61 8 Dhal 328 318 96.95 9 Ra 310 303 97.74 10 Zay 265 259 97.74 11 Sin 336 317 94.35 12 Shin 316 294 93.04 13 Sad 388 380 97.94 14 Dad 361 350 96.95 15 T aa 355 349 98.31 16 Za 362 356 98.34 17 A yn 405 396 97.78 18 Ghayn 376 361 96.01 19 F a 391 377 96.42 20 Qaf 335 330 98.51 21 Kaf 371 362 97.57 22 Lam 354 340 96.05 23 Mim 346 338 97.69 24 Nun 353 349 98.87 25 Ha 325 318 97.85 26 W a w 259 244 94.21 27 Y a 365 352 96.44 28 T aa 374 345 92.25 29 Al 261 247 94.64 30 Laa 357 345 96.64 31 Y aa 269 256 95.17 T otal/ A v erage c 10810 10446 96.59 a. Number of samples in the test data b . Number of samples correctly classified c. The a v erage is calculated by formula (2) 4.2. Results when using the ArSL-CNN model with o v ersampling and undersampling methods The number of images per class in the ArSL2018 dataset is sho wn in Figure 3. As pre viously men- tioned, the classes contain dif ferent sample sizes, and such discrepancies may result in an imbalance amongst the classes. The imbalance issue can ha v e a ne g ati v e ef fect on the classification results. T o o v ercome this issue and reduce bias, resampling methods ha v e been applied to balance the class distrib ution are classified into tw o groups: o v ersampling and undersampling methods [25]. The o v ersampling technique solv es the imbalance amongst the classes by generating synthetic samples from minority samples. This approach can ef fecti v ely impro v e the classification ef ficienc y . Ho we v er , increasing the number of samples in the minority classes will increase the trai ning time. The o v ersampling process has tw o v ariations. The first is random minority o v er - sampling (RMO), which randomly duplicates the minority class samples. The second is the synthetic minority o v ersampling technique (SMO TE), which is a sophisticated sampling technique that o v ercomes the issue of class imbalance by artificially generating samples through the i nterpolation of neighbouring data points [26]. The other method used for adjusting the balance of samples across ArSL2018 dataset classed w as random minority under -sampling (RMU). The RMU strate gy in v olv es the random deletion of samples from majority classes until the dataset is balanced. A major dra wback of this strate gy is the possible loss of useful infor - mation. T o correct the balance of samples amongs t the classes in the ArSL2018 dataset, three resampling techniques, nam ely RMO, SMO TE, and RMU were applied to the dataset, and e xperiments were carried out to e v aluate their impact on the task. T able 4 sho ws the results obtained using the three resampling methods. The findings re v eal that the ef ficienc y of the proposed ArSL-CNN model increases after applying the resampling techniques. The proposed model achie v es training and testing accuracies of 99.14% and 97.21%, respecti v ely , Indonesian J Elec Eng & Comp Sci, V ol. 22, No. 2, May 2021 : 1096 1107 Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752 r 1103 by using the random o v ersampling method. The training and testing accurac y v alues after applying the un- dersampling method are 99.27% and 97.07%, respecti v ely . By using SMO TE, the model obtains training and testing accuracies of 98.94% and 97.29%, respecti v ely . This result implies that SMO TE outperforms the other tw o resampling methods in terms of the testing accurac y . The highest testing accurac y (97.29%) is achie v ed using SMO TE. This accurac y is higher than that obtained by implementing the ArSLCNN architecture on the original dataset (96.59%). These findings highlight the importance of ha ving a balanced number of samples in each class in achie ving high classification accurac y and minimising o v erfitting. Classes with small numbers of samples will reduce the accurac y of the proposed model. T able 4. ArSL-CNN accurac y on the test data using the original Arsl2018 dataset before applying sampling techniques Resampling T echnique No. Epochs T raining Acc. (%) T esting Acc. (%) T raining T ime (mins) RMU 500 99.27 97.07 66.1 RMO 500 99.14 97.21 134.4 SMO TE 500 98.94 97.29 141.9 Figure 5 sho ws the confusion matrix generated by training the proposed ArSL-CNN model with SMO TE for 500 epochs. The diagonal elements in the confusion matrix reflect the number of correctly labelled images, whereas the of f-diagonal elements denote the mislabeled images. The greater the sum of diagonal v alues of the confusion matrix, the higher the accurac y of the classification. The accurac y of the proposed ArSL-CNN model during v arious learning epochs after applying SMO TE is illustrated in Figure 6. The results re v eal that the accurac y of the model on the training and testing sets increases for all learning epochs. Figure 5. Confusion matrix of the proposed ArSL-CNN model with SMO TE Figure 6. Accurac y of the proposed ArSL-CNN model with SMO TE Furthermore, the accurac y per class is stated in T able 5. The e xperimental results sho w that ArSL- CNN obtained better classification ef ficienc y when the RMU, RMO and SMO TE resampling method were applied. F or instance, the number of samples in the ‘W a w’ class w as 259 testing samples before using SMO TE ArSL-CNN: A con volutional neur al network for ar abic sign langua g e g estur e r eco gnition (Ali A. Alani) Evaluation Warning : The document was created with Spire.PDF for Python.
1104 r ISSN: 2502-4752 resampling method applied and the accurac y w as 94.21%. Ho we v er , aft er applying the SMO TE resampling method the number of samples increase from 259 to 422 testing samples and that led to increase the accurac y from 94.21% to 98.34%. These results appro v e the high impact of applying SMO TE resampling method to solv e the imbalance problem and impro v e the o v erall accurac y of the proposed model. 4.3. Statistical analysis of the impact of the r esampling methods applied to the ArSL2018 dataset on the perf ormance of ArSL-CNN T able 6 pro vides descripti v e statistics of the test accurac y results when v arious sampling methods are applied to the ArSL2018 dataset. In T abl e 6, the first column is the sampli ng method applied to the dataset. The second column holds the mean test accurac y v alues across the 32 classes. The third column holds the standard de viation v alues which are a useful indicator of the stabil ity of the model. The fourth and fifth columns sho w the minimum and maximum test accurac y v alues obtained and the last three columns hold information about the test accurac y v alue percentiles. T able 5. ArSL-CNN accurac y on the test data after applying RMU, RMO and SMO TE Methods RMU RMO SMO TE Class Name # S a # S C C b Accurac y # S a # S C C b Accurac y # S a # S C C b Accurac y Alif 351 348 99.15 423 419 99.05 423 416 98.35 Ba 366 361 98.63 362 348 96.13 362 351 96.96 T a 365 359 98.36 404 390 96.53 404 390 96.53 Tha 342 333 97.37 438 426 97.26 438 428 97.72 Jim 298 288 96.64 412 401 97.33 412 392 95.15 Ha 266 254 95.49 428 411 96.03 428 414 96.73 Kha 337 324 96.14 451 438 97.12 451 436 96.67 Dal 324 308 95.06 408 394 96.57 408 401 98.28 Dhal 290 284 97.93 449 436 97.10 449 430 95.77 Ra 330 323 97.88 443 436 98.42 443 437 98.65 Zay 266 253 95.11 415 398 95.90 415 406 97.83 Sin 306 285 93.14 440 417 94.77 440 411 93.41 Shin 305 285 93.44 438 422 96.35 438 428 97.72 Sad 332 327 98.49 418 413 98.80 418 411 98.33 Dad 347 335 96.54 428 421 98.36 428 422 98.60 T aa 363 354 97.52 405 402 99.26 405 400 98.77 Za 333 327 98.20 441 439 99.55 441 436 98.87 A yn 420 413 98.33 411 402 97.81 411 401 97.57 Ghayn 390 380 97.44 421 411 97.62 421 404 95.96 F a 399 389 97.49 406 387 95.32 406 394 97.04 Qaf 339 328 96.76 389 378 97.17 389 376 96.66 Kaf 342 338 98.83 422 402 95.26 422 414 98.10 Lam 347 341 98.27 446 432 96.86 446 433 97.09 Mim 347 341 98.27 449 442 98.44 449 440 98.00 Nun 360 357 99.17 422 422 100.00 422 420 99.53 Ha 304 299 98.36 432 423 97.92 432 424 98.15 W a w 269 258 95.91 422 414 98.10 422 415 98.34 Y a 314 305 97.13 424 409 96.46 424 411 96.93 T aa 332 320 96.39 428 404 94.39 428 402 93.93 Al 262 255 97.33 394 387 98.22 394 385 97.72 Laa 363 343 94.49 413 395 95.64 413 398 96.37 Y aa 274 266 97.08 442 428 96.83 442 431 97.51 T otal/ A v erage c 10583 10281 97.07 13524 13147 97.21 13524 13157 97.29 a. Number of samples in the test data b . Number of samples correctly classified c. The a v erage is calculated by (2) T able 6. Descripti v e statistics of the test results when applying v arious sampling methods to the dataset Sampling method Mean% Std. De viation Minimum% Maximum% P er centiles 25th 50th (Median) 75th No sampling 96.59 1.64 92.25 98.87 95.74 96.77 97.83 SMO TE 97.29 1.37 93.41 99.53 96.66 97.65 98.32 RMU 97.07 1.57 93.14 99.17 96.20 97.41 98.32 RMO 97.21 1.40 94.39 100.00 96.19 97.15 98.33 Indonesian J Elec Eng & Comp Sci, V ol. 22, No. 2, May 2021 : 1096 1107 Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752 r 1105 T able 6 sho ws that the mean test accurac y v alue reached its highest, i.e. = 97 : 29% = 97.29%, when the SMO TE resampling method w as applied. W ith SMO TE, the proposed model achie v ed the lo west standard de viation v alue, i.e. = 1 : 37 , and this suggests that applying SMO TE to the dataset results in a more stable prediction model. W it h RMO, the maximum test accurac y and the 75th percentile v alues were slightly higher than those of SMO TE, ho we v er , the higher standard de viation v alue of RMO suggests that using RMO results in a less stable model. The boxplots in Figure 7 illustrate the distrib ution of the test accurac y v alues with v arious sampling methods. Each boxplot in Figure 7 holds 32 v alues, where each v alue corresponds to a test accurac y v alue of the model for a particular class (note that there are 32 classes in the dataset as sho wn in Figure 3). SMO TE has tw o outliers, and ‘no sampling’ and RMU ha v e one outlier each, as sho wn in Fi gure 7. It is import ant to mention that the minimum v alue of SMO TE, i.e. 93.41%, is an outlier v alue as sho wn in Figure 7, and if the outliers are remo v ed from SMO TE and RMU, then the minimum test accurac y v al ues of SMO TE is the highest of all sampling methods, reaching 95.15% minimum test accurac y . T able 7 sho ws that applying sampling methods impro v es ArSL-CNN’ s performance, and the results suggest that best performance is achie v ed when SMO TE are applied to the dataset. Before Resampling SMOTE RMU RMO Resampling method 92 93 94 95 96 97 98 99 100 Class accuracy values (%) Figure 7. Boxplot of test accurac y v alues when v arious sampling methods are applied to the dataset T o determine whether the observ ed impro v ements in ArSL-CNN’ s performance when the SMO TE and other sampling methods are adopted are statistically significant at = 0 : 05 , the non-parametric W ilcoxon Signed Ranks T est is applied to the test accurac y v alues obtained after applying the resampling methods to the dataset (see T able 7). The results re v ealed that when applying SMO TE, there is a statistically significant impro v ement in test accurac y (Z=-2.412, p=0.016). Indeed, there w as also a weak er significant impro v ement in performance when applying the RMU and RMO sampling methods with p=0.042 and p=0.036 respecti v ely . Ho we v er , SMO TE achie v ed the most significant statistical impro v ement as indicated by the lo west p v alue. In conclusion, applying SMO TE resampling to adjust the class imbalance of the dataset significantly impro v es the test prediction accurac y of the model. T able 7. Results of the wilcoxon signed ranks test applied to the test results T est Statistics a No sampling vs. SMO TE No sampling vs. RMU No sampling vs. RMO Z -2.412b -2.029b -2.094b Asymp. Sig. (2-tailed) 0.016 0.042 0.036 a. W ilcoxon Signed Ranks T est b . Based on positi v e ranks. c. Based on ne g ati v e ranks. 5. COMP ARISON WITH ST A TE-OF-THE-AR T METHODS The performances of the proposed approach with e xisting state of-the-art techniques on the ArSL2018 dataset in terms of accurac y is sho wn in T able 8. The findings indicate that the proposed ArSL-CNN model when applying SMO TE resampling to the dataset is superior to tw o state-of-the-art methods in terms of o v erall accurac y [1], [3]. Ghazanf ar et al. [1] used CNN and achie v ed an accurac y of 95.9%, whereas Elsayed and F ath y [3] applied semantic DL and obtained an accurac y of 88.8%. In comparison, our proposed method achie v es an o v erall accurac y of 97.29%. This result implies the significance of pro viding a balanced number of samples to enhance the generalisation ef ficienc y of CNN when training DL models. ArSL-CNN: A con volutional neur al network for ar abic sign langua g e g estur e r eco gnition (Ali A. Alani) Evaluation Warning : The document was created with Spire.PDF for Python.