Indonesian Journal of Electr ical Engineering and Computer Science Vol. 39, No. 2, August 2025, pp. 1360 1372 ISSN: 2502-4752, DOI: 10.11591/i jeecs.v39.i2.pp1360-1372 r 1360 Recognizing AlMuezzin and his Maqam using deep learning approach Nahlah Mohammad Shatnawi 1 , Khalid M. O. Nahar 1 , Suhad Al-Issa 2 , Enas Ahmad Alikhashashneh 3 1 Department of Computer Science, Faculty of Information Technology a nd Computer Sciences, Yarmouk University, Irbid, Jordan 2 Department of Electronics, Electrical Enginee ring and Computer Science, Queen’s University Belfast, Belfast BT7 1NN, UK 3 Department of Information Systems, Faculty of Information Technology and Computer Sciences, Yarmouk University, Irbid, Jordan Article Info Article history: Received Jul 14, 2024 Revised Mar 27, 2025 Accepted Jul 2, 2025 Keywords: Aladhan AlMuezzin Arabic language Deep learning Maqam Speech recognition VGG-16 Abstract Speech recognition is an important topic in deep learning, especially to Arabic language in an attempt to recogniz e Arabic speech, due to the diffic ulty of apply- ing it because of the nature of the Arabic langua ge, its frequent ove rlap, and the lack of available sources, and some other limitations related to the programming matters. This paper attempts to reduce the gap that exists between speech re cog- nition and the Arabic language and atte mpts to address it through deep learning. In this paper, the focus is on Call for Prayer (Aladha n: ل آ ) as one of the most famous Arabic words, where its form is stable, but it differs in the notes and shape of its sound, which is known as the phonetic Maqam (Maqam: ل م ق ا ل ص و ت ي ). In this paper, a solution to identify the voice of AlMuezzin ( ل م ؤ ), recog- nize AlMuezzin, and determine the form of the Maqam through VGG-16 model presented. The VGG-16 model examine d with 4 e xtracted feat ures: Chroma fe a- ture, LogFbank feature, MFCC feature, and spectral centroids. The best result obtained was with c hroma features, where the accura cy of Aladhan recognit ion reached 96%. On the other hand, the classification of Maqam with the highest accuracy reached of 95% using spectral centroids feature . This is an open ac cess article under the CC BY-SA license. Corresponding Author: Nahlah Shatnawi Department of Computer Science, Faculty of Information Technol ogy and Computer Sciences Yarmouk University 21163, Irbid, Jordan Email: nahlah.s@yu.edu.jo 1. INTRODUCTION Speech recognition is one of the most active researc h areas that aims to identify the speaker based on the characteristics of the ir voice [1]. Speech recognition cont ributes to improving several disciplines, such as health care and security. Severa l state-of-the-art works have recently explored the use of feature extraction techni ques to describe a massive amount of dat a using different feature vectors that represent different physical and acoustic meanings. Selecting a good feature will help to improve the accura cy of t he recognition. Thus, choosing the feature extraction technique is considered a critical step in the speaker recognition process. Currently, the most used speech characteristics are the linear reduction spectrum coefficient (LPCC) [2] and the MEL spectrum coefficient (MFCC) [3]. These features have a chieved good recognition effects in speech recognition [4], [5]. Traditional automat ic speech recognition (ASR) systems still employ an architecture c onsisting of numerous components, including but not li mited to le xicon building, language models, and acoustic models. Various technique s are employed to construct and proc ess these components, incl uding traditiona l machine Journal homepage : http://ijeecs.iaescore.com Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752 r 1361 learning (ML) techniques, Gaussian mixture m odels, hidden Markov models, a deep neural network, and a hybrid HMM-DNN [6]. Aladhan is a call to prayer for Muslims, where the AlMuezzin pronounces the call to prayer e very day at the beginning of the t ime for each of the five obligatory prayers. In the past, the AlMuezzin used to give the call to prayer from a high place, or from the lighthouse, or from the roof of the mosque, but now the AlMuezz in gives the call to prayer through amplification devic es, which makes the matter much easier for him. Aladhan is an announcement of the time of prayer with specific and customize d words, through which the AlMuezzin informs pe ople of the time s of prayer and invites them to prayer. Aladhan is announced through specific words in the following format, as in Figure 1. Figure 1. Aladhan words format AlMaqamat is a compl ex tonal system used in traditional Arabic m usic. Phonet ic Maqams are char- acterized by a set of specific pitches and their own rules of pe rformance. AlMaqamat can be used to create a wide range of musical emotions and effects. Phonetic Maqams are usual ly classified into two main groups: basic Maqams and sub-Maqams. The basic Maqams are the Maqams that form the basis of the Arabic musical system. There are nine basic shrines, which are: AlRust, AlNahawand, AlHijaz, AlSiyka, AlBayat, AlSaba, AlEajam. Sub-Maqams are Maqams t hat are derived from the basic Maqams. There are many sub-Maqams, but some are more common than others. Various Maqams with distinctive melodies ac companying them; these are the feature s of the call to prayer in most different Islamic countries, as it is a call to praye r with sweet maqams appropriat e to the time of the obligatory prayer and the psychological state of the people. The Muzzins in all mosques and mosques in most different Islamic countries are keen to perform this cal l in the best way, in a manner that suits the ma qams adopted hundreds of years ago. Hence, the importance of this work, as it links deep learning to the Arabic language, especially clas- sical Arabic music. And because of the importa nce of Ara bic music, its chosen and selected Aladhan as an actual application for speech recognition because it is known among people, especially Muslims, and because it contains a specific and fixed set of words, its form is stable, and it also has a specific and fixed tone in musical Maqamat. To ac hieve the objective of this paper, the aut hers present a VGG-16 model [7] to identify AlMuezzin ( ل م ؤ ) and classify his Maqam ( ل م ق ا ل ص و ت ي ). This performed by e xtracting a set of features from the collected dataset. The n, speaker-independent speech recogni tion performed through the VGG-16 model because no two speakers have the same voice and the organs of the sound differ. So, the VGG-16 model can distinguish different Azzan col lected from different Muezzin under several Maqamat: Al-Hejaz, Al-Sika, Al-Rust, Al-Saba, Al- Ashaq, and Nahawand. This paper describes an interesting idea for using sound features for Muezz in identification. The pro- posed approach in this research is hot, uncommon, and c ould be improved to be applied in many other fields. The work in this study can be the first milestone to show the effecti veness of deep learning (DL) for classific a- tion and the dependence on sound features to identify AlMuezzin. The rest of this paper is organized as follows: in section 2, t he related works to the proposed approach shown. Next, the collect ed dat aset i nformation presented. In sec tion 4, introduce the work method. Section 5, include experimental results. Final ly, the conclusion is determined. Recognizing AlMuezzin and his Maqam using deep learning approach (Nahlah Shatnawi) Evaluation Warning : The document was created with Spire.PDF for Python.
1362 r ISSN: 2502-4752 2. RELATED WORK This section presents a few works that utilize DL, ML, feature extraction, fea ture reduction, and view how they relate to this work. The researchers of the paper [8] collecte d a modern arabic dataset to assess the performance of a few of the DL strategies in human speech recogni tion (HSR). In this work, the accura cy of the modular hidden Markov model-dee p neural network (HMM-DNN) frameworks was compare d to the native speaker performance. The compa rison make s it appear that human performance within the Arabic di alect is still significantly be tter than that of machines, with an absolute word error rate (WER) gap of 3.5% on average. On the other hand, in paper [9], there’s an endeavor to construct a strong, robust diacritized Arabic ASR utiliz ing Dee p le arning approaches. They utilized the standard arabic single speaker corpus (SASSC), which conta ins seven hours of cutting-edge standard Arabi c discourse, to prepa re and test a modern CTC-based ASR, convolutional neural network (CNN)-long short-term memory (L STM), and an attention-based end-to- end approach to m ake strides in diac ritized Arabic ASR. From the exploratory results, the researchers conclude that the CNN-LSTM with an attention framework outper forms conventional ASR and the Joint CTC-attention ASR fra mework within the task of Arabic speech recognition. In work [10], the researchers used a deep fe ed-forward neural network (DFFNN) to the Arabic natural audio data set (ANAD), which is designed for Arabic automatic speech recognition. The ANAD da taset conta ins three discrete feelings: angry (A), surprised (S), and happy (H). T he researchers also utilized eight videos of live calls bet ween an anchor and a huma n outside the studio that were downloaded from online Arabic ta lk shows to test and evaluate the proposed approach. The target was to re cognize human fe elings from the sounds. They proposed an automated Arabi c speech emot ion recognition system using feature extraction to extract the foremost imperative features from the dataset, which was at that point utilized t o train the DFFNN. In this investigation, it appears that the DFFNN achieves the highest accuracy when applying PCA to t he extracted features, with an accuracy of 98.56%. Moreover, the work [11] a speech emotion recognition system based on de ep neural network hidden markov models (DNN-HMM) by extricat ing MFCC and epoch-based fea tures. The researchers concluded that the ac curacy when utiliz ing MFCC features was 60.86% whereas when utilizing epoch-based features, it was 54.52%. Also, the the recognition performance t o 64.2% when MFCC and e poch features are combined. Fahad et al. [12], they presented a convolutional neural net work for Arabi c speech recognition. In this investigation, t hey centered on single-word Ara bic automatic speech re cognition (AASR). They utilized l og- frequency spectral coefficients (MFSC) and Gammatone-fre quency cepstral coefficients (GFCC) with their first and second-order derivatives. They found that the great est accuracy gotten when utilizing GFCC with CNN is 99.77%, and it appeared that the CNN ac complished way better performance in AASR. A traditional ML approaches such as the random forest (RF) were used by [13] to distinguish different speakers by e xtraction mel-frequency cepstral coefficients (MFCC) and reconstructed phase space (RPS) fea tures. The researchers of this investigation observed that the accuracy in MFCC is higher tha n in RPS, where the accuracy obtained from utilizing RPS features was 71% and the accuracy obtained from utili zing MFCC fe atures was 97%. Another speaker-identification framework was proposed by the researchers [14] to recogniz e spoken sounds by utilizing particular words. The researchers extracted the MFCC features and then utilized them as input for the recurrent neural net work (RNN) and LSTM. They found that the accuracy in different RNNs is 87.74%, and the accuracy that showed up in a single RNN is 80.58%. On the other hand, Utomo et al. [15] proposed automatic speaker recognition by artificial ne ural network (ANN). They extracted t he Another speaker-identifi cation framework by utilizing particular words. The re- searchers extracted the MFCC features and then utilized them as input for the RNN and LSTM. They found that the accuracy in different RNNs is 87.74%, and the accuracy t hat showed up in a single RNN is 80.58%. Moreover, the work in [16] proposed text-speaker recognition to recognize wha t the spea ker said. They utilized MFCC, spectrum, and log-spec trum to extract the features from the speaker sound wave the extracte d features were at that point util ized to to train and evaluate the LSTM and RNN models. The accuracy by utilizing MFCC was 95.33%, whereas by utilizing spectrum and log-spec trum, it was 98.7%. Analysts in [17] proposed speaker-identification in a noisy environment. The y utilized CNN to classify 60 speakers and divided 4 voice samples for each spea ker. The researchers utilized MFCC to extract the features from the speaker signal and found an accuracy of 87.5%. In addition, ion, ion, analysts in [18] propose a spea ker-identification fra mework by utilizing the gaus- sian mixture model (GMM), and MFCC to extract the features. The researchers extracted features and compared Indonesian J Elec Eng & Comp Sci, Vol. 39, No. 2, August 2025: 1360–1372 Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752 r 1363 them with all the features they had saved. In DNN-HMM, numerous sorts of inquiry ha ve been made a bout comparing two strategies for this point. In Chowdary et al. [19], analysts com pared extraction and normalization strategies f or speakers; they utilized MFCC and PNCC for feature extraction. This framework was connected to six men and two women, and the performance of the speaker’s identification was 100%. PNCC ha d far better, much better, higher, stronger, and improved performance than MFCC in recogni zing women. Al-Kaltakchi et al. [20], analysts compared extraction strategies by utilizing FFANN and SVM and utili zed numerous extraction methods such as MFCC, PLP, and LPC. They connected these feature extractions to two calculations, ANN and SVM, to discover t he finest include extractions that seem ed to apply to these two calculati ons. The best accuracy they obtained was when using MFCC, PLP, and L PC with SVM and ANN; it was 100%. The work in [21] examined the efficient of extricat ing and coordinat ing feature strategies utilizing vector quantization and MFCC together in speaker-identification and speech recogni tion applications. T hey found that utilizing ve ctor quantization reduced the time of com parison between the input speech and the te sting speech. Several papers have used speech recognition system as pa rt of sma rt home a utomation systems. For example, the work in [22] proposed a speec h recognition framework without having to utiliz e the web to assist individuals with disabilities by utilizing Ivona. converted to text and received and gotten by GSM modems. This framework permits individuals with inabilities to deliver house voice commands to carry out a particular command. Another study done by [23] proposed a speech recogni tion framework for indivi duals with disabilitie s to control their wheelchai rs and other devices. The researchers used MATLAB to create 8 commands (GO, HELP, START, REPEAT, ERASE, NO, ENTER, YES) by extricating MFCC features; all extracted features are put away in K-mean cluster forms. The average accuracy of 8 commands is 73.54%, and the average accuracy of listener is 82.25%. In the work of [24], [25], the researchers constructed speaker-independent NSR systems using the DeepSpeech model, then assessed them using the WER. The DeepSpeech model is one of the most well-known open-source ASR mode ls from Mozil la for Qura nic recitations for male and female reciters. Where the target is to be utilized efficiently by anyone, regardless of gender or a ge, and it obtained intriguing results. Shareef a nd Al-Irhayim, [26] they do speech sound errors classification impairments children when are incorrectly pronounced in Arabic. The y employ Mel frequency spectral coefficients for feature extraction, and deep LSTM network. They gain classificat ion accuracy reaches 97.99% and loss 0.18%. To the best of our knowledge, deep learning approach has not been used in the a utomatic speech recognition of AlMuezzin. The call t o prayer is a special type of speech that announces the call to prayer in a mosque through a standardized set of words with a Maqam. Recognize the AlMuezzin with the Maqam he follows will contribute to identifying the speaker, just as identifying the Maqam will contribute to the problem of speech synthesis. Based on the investigation and lit erature review conducted, this paper has been prepared. 3. METHOD The proposed method for recognizing AlMuezzin and his maqam using deep learning approach con- sists of two phases. The first phase is for AlMuezzin identification, and the second phase is for Acoustic stand classification of Azan (Maqam classification). Where for each phase, different dataset colle cted and used. In this work, the VGG16 model in [7] used. VGG16 model reaches a te st accuracy of 92.7% with almost 14 million training photos from 1000 item classes in ImageNet, and was one of the best models from the ILSVRC-2014 competition. VGG one of the most widely used deep learning models for image recognition. As the name implies, VGG16 is a 16-layer de ep neural network. W ith 138 million parameters overall, VGG16 i s a relatively large network—huge even by today’s standards. Nevertheless, the key selling point of the VGGNet16 architecture is its simplicity, as it incorporates the m ost significant convolution neural network features. Since the problem is to achieve both AlMuezzin identific ation, and acoustic stand of Azzan, the pic- torial design of the methodology is stated into two phases as shown in Figure 2. Figure 2(a): AlMuezzin identification. Figure 2(b): Aladhan Maqam classification. After collecting the data, it needs to be proc essed, then the audio file spect rum is extracted using different features to be trained on a VGG16 model. The detailed steps of the methodology are explained in the following subsections. Recognizing AlMuezzin and his Maqam using deep learning approach (Nahlah Shatnawi) Evaluation Warning : The document was created with Spire.PDF for Python.
1364 r ISSN: 2502-4752 Figure 2. The methodology phases: (a) AlMuzzein identification and (b) Aladhan Maqam classific ation To ensure full reproducibility of the experimental setup, the methodology is structured into two clearly defined phases: AlMuezzin ide ntification and Maqam classification, each with distinct datasets and preprocess- ing pipelines. The experimental workflow begins with a c arefully curat ed dataset of audio re cordings collecte d from YouTube, comprising well-known Muezzins and multiple Maqam styles. These recordings were con- verted to WAV form at (16-bit stereo, 44.1 kHz) and segmente d into 20-second cl ips to manage memory effi- ciency and enhance model performance. The feature extra ction was the n perform ed using four widely validate d acoustic features: MFCC, LogFBank, Spec tral Centroid, and Chroma. Each feat ure set was independently used to train a pre-trained VGG-16 model, allowing a compa rative evaluation of their effectiveness. For AlMuezzin identification, the model was traine d on 1,211 samples and tested on 295, while for Maqam classification, 287 training samples and 71 testi ng samples were used. The models were validated using standard performance metrics (accuracy and loss) over multiple epochs, and visualizations of training behavi or (e .g., accuracy/loss curves) were included in the Results sec tion. A detailed pictori al representation of the pipeline shown in Figure 3, and struc tured equations for each feature extraction method Table 1 are provided to ensure trans- parency and reproducibility of our experimental design. 3.1. Dataset The dataset was collecte d manually and carefully from YouTube in two stages, you can find the dataset in the link in [27]. In the first stage, 19 different audio records of t he Aladhan for 19 famous male Muezzins collected, a tot al of 105 audio records were collected. Then, transferring eac h audio file in the datasets into WAV audio files with a 16-bit stereo a nd 44.1 kHz sample rate so a VGG-16 model’s can handle [7]. Afte r that, audio files divided by reciters and created different folders with the names of the reciters. The reader’s audio fi les are stored in the folder with the reader’s name. Thus, the intersection of audio files between readers to pe rform speake r-independent identification avoided. The audio files di vided int o 80% for the training group, 20% for testing. The largest training rati o is to ensure sufficient and good training of the system. Indonesian J Elec Eng & Comp Sci, Vol. 39, No. 2, August 2025: 1360–1372 Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752 r 1365 Figure 3. Graphical overview of simulation and e xperimental setup To inc rease the performance of the prediction, eac h audio of Aladhan divided into several audio tracks of 20 seconds each because the me mory resources are l imited and to avoid out of memory issue, so t he total amount of data in the training bec omes 1506 audio. But after deleting the empty audio files and t he corrupted audio files, the remaining 1506 audio files separated so 1211 audio records for training and 295 audio records for testing. Finally, noise rem oved from data to ensure the data is clean. The ta rget of this stage is to perform AlMuezzin identification. In the second stage, records collected of differe nt calls to prayer from different audio maqams for different Muezzins, such as Hijaz, Sikka, Al-Sada’, Al-Saba’, Al-Ashaq, and Al-Nahawand, as in Table 1. The total number of audio recordings collected was 36. Also, as in the first stage, the data divided in eac h records here into several audio tracks of 20 seconds ea ch, bringing the size of the da ta to be trained to 358 in this phase. A splitting ratio of 80% for training and 20% for testing was used. So, 287 audios used for training and 71 audios for testing. The target of this stage is to pe rform Al-Maqam identification. 3.2. AlMuezzin identification For AlMuezzin identification the collected data is first preproce ssed as mentioned in dataset section, after that features extrac tion using four distinct feature types to train them into pre -trained VGG16 model. The features are extracted from the speech signal for analysis are: MFCC, Spectra l - Centroid, Chroma, and LogFBank as shown in Figure 4. Figure 4. Features extraction from four different feature types MFCC is so well-liked because its foundation is a linear cosine transform of a log power spectrum on a nonlinear Mel scale of frequency. In spectral centroid every frequency band’s spectrum has a center of gravity. Recognizing AlMuezzin and his Maqam using deep learning approach (Nahlah Shatnawi) Evaluation Warning : The document was created with Spire.PDF for Python.
1366 r ISSN: 2502-4752 When a piece of music has pitche s that can be meaningfully categorized and is tuned close to the equal -tempered scale, it is said to have chroma. While LogFBank idly used in robust speech recognition community. Ta ble 1 showed the computation of the four features. Table 1. Features computations of MFCC, Spectral Cent roid, Chroma, and LogFBank Features Description MFCC x 0 ( n ) = x ( n ) x ( n 1) X ( f ) = F F T ( w ( n )) Apply a Mel filter bank to the power spectrum of the signal y k = l og ( X j y k ( n ) j 2 ) Compute the discrete cosine transform (DCT) for the log-energy output Spectral - Centroid centr oid = P N 1 n =0 f ( n ) x ( n ) P N 1 n =0 x ( n ) Ccalculated as the weighted mean of the frequenc ies present in the signal Chroma Using short-time Fourier transforms in combination with binning strategie s f g f [ n; k ] = M 1 X m =0 f [ n m ] g [ m ] 2 k [ m ] Where 2 k [ m ] = e 2 m k N M is the window length of g and N is the number of samples in f LogFBank x 0 ( n ) = x ( n ) x ( n 1) w ( n ) = x 0 ( n ) h ( n ) X ( f ) = F F T ( w ( n )) Apply a logarithmically spaced filter bank to the power spec trum y k = l og ( X j y k ( f ) j 2 ) After appliying the features on t he dataset, VGG16 used to identify AlMuezzi n. Tiny convolution filters make up a VGG network. Thi rteen convolutional layers a nd three fully linked layers make up VGG16. An overview of the VGG architecture is provided below: Input: VGGNet is fed a 224 by 224 picture input. Convolutional layers: VGG’s convolutional filters use a 3x3 receptive field, the smallest available. In addition, VGG performs a linear transformation on the input using a 1×1 convolution filt er. ReLu activation: AlexNet’s primary innovation for cutting training time is the rectifie d linear unit activation function (ReLU) component. ReLU is a linear function tha t yields zero for negative inputs and a matching output for positive inputs. To maintain the spatial resolution following convolution, VGG uses a fixed convolution stride of 1 pixel (the stride value shows how many pixels the filter “moves” t o cover the complete space of the picture). Hidden layers: unlike AlexNet, which uses local response normalization, all of the VGG network’s hidden layers employ ReLU. The la tter adds little to the accuracy overall but lengthens the training period a nd memory usage. Indonesian J Elec Eng & Comp Sci, Vol. 39, No. 2, August 2025: 1360–1372 Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752 r 1367 Pooling layers: a pooling layer is placed after a ser ies of convolutional layers, which serves t o lower the feature maps produced by each convolution step’s dimensions and parameter count. Given the quic k increase in the number of filt ers accessible from 64 t o 128 to 256 and finally 512 in the final layers, pooling is essential. 3.3. Aladhan Maqam classific ation Aladhan Maqam classification is the second phase were the resulted data from phase 1 used beside another dataset collected as m entioned in the da taset section. In this phase, the data is also preprocessed and filtered, and then apply spectral centroid for feature extraction using the equations in Ta ble 1. So, the VGG-16 model can distinguish different Azz an collected from different Muezzin under seve ral Maqamat: Al-Hejaz, Al-Sika, Al-Rust, Al-Saba, Al-Ashaq, and Nahawand, details of each Maqam mentioned in Tabl e 2. Table 2. Different Muezzins under severa l Maqams Acoustic stand Description Musical scale Al-Saba Saba is a very common maqam in Arabic music. The ladder of this maqam begi ns on the decision, and the Hijaz on the third degree overlaps with it. Al-Hejaz Maqam Hijaz is the main maqam in the Maqam al -Hijaz family. The scale of the place of the Hijaz begins with the genus Hijaz on the deci- sion, f ollowed by the Rust. Nahawand Maqam Nahawand i s the main maqam in the Maqam Nahawand family. The ladde r of this maqam begins with the genus al-Nahawand on the first degree (Qarar), followed by the genus al-Hejaz. Al-Rust The Rust Maqam is the main maqam in the Rust Maqam family. The scale of t his maqam begins with the genus of the Rust on the first degree (Qarar), followed by a ny of the genus Nahawand or the genus of the higher Rust. Al-Sika It is t he main maqam in the Sika Maqam family, but it is rarely used as an independent maqam. Al-Ashaq The Maqam Ashaq Egypt ian is a sub-maqam in the Maqam Nahawand family. 4. RESULTS AND DISCUSSION At the beginning, a neural network model built, where the i dentification problem was resolved using a pre-trained VGG-16 model. Ultimately, the necessary model for the identification process was generated by using 80% of the gathe red data in the training phase, which was then sent as a sample to VGG16. In this sect ion the experiments conducted and the results that’s obtained pre sented. In the first phase, four di fferent experiments are conducted to perform AlMuezzin identific ation. In the first experiment, the proposed model trained using MFCC features got a 93% accura cy. In the second experiment, the proposed model trained using Logfbank features got a 96% accuracy. In the third experiment, the proposed model trained using spectral centroid features got a 94% accuracy. In the fourth experiment, the proposed model trained using Chroma features got a 96% accuracy. All accuracy results from t he conducted experiments are listed in Table 3. Table 3. Classification accuracy by VGG16 for AlMuezz in identification using different features Feature MFCC Logfbank Spectral centroid Chroma Accuracy 93% 96% 94% 96% Recognizing AlMuezzin and his Maqam using deep learning approach (Nahlah Shatnawi) Evaluation Warning : The document was created with Spire.PDF for Python.
1368 r ISSN: 2502-4752 For Adan identi fication, Logfbank and Chroma performed better than the other criteria in terms of accuracy. The acc uracy and loss for each model i n relation to the epoch number (100) are shown i n Figures 5-8 (each figure shows the number of epochs for training and validation (a) ac curacy and (b) loss). Figure 5. The number of epochs for (a) accuracy t rends during training and validation using MFCC features and (b) loss function behavior showing the model’s adaptation over epochs Figure 6. The number of epochs for (a) accuracy pr ogression using Chroma features and (b) loss function graph highlighting convergence and overfitting tendencies Figure 7. The number of epochs for (a) accurac y progression during training and validation using Logfbank features and (b) loss function behavior indicating model convergence Indonesian J Elec Eng & Comp Sci, Vol. 39, No. 2, August 2025: 1360–1372 Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752 r 1369 Figure 8. The number of epochs for (a) validation ac curacy trends across epochs using spectral centroid features and (b) loss graph demonstrating overfitting tendencies in spect ral centroid-based classification Because validation data are a collection of fresh data points that the model is unfamiliar with, the validation accuracy is typically lower than the training accuracy. Training data is da ta that the model is already familiar with, where this is noti ced in Figures 5-8. Therefore, i t stands to reason that the accuracy is lower when using validation data than when using training data. But in Figure 6, noticed in the first epochs (a pproximately the first 15 epochs) the valida tion data’s accuracy exceeds that of the training data, can interpret t his by saying that the proposed m odel is a highly accurate predictor that takes into a ccount a wide range of boundary situations. However, considering that some of the data points in the validation data present some challenge to t he model, the model can be considered good if its accuracy (valida tion data) is approximately 80% of the training data. When the accuracy of the validat ion data is higher than the accuracy of the rainfall data, this can be interpreted as a good indicator that the hyperparameters in the traini ng da ta were properly adjusted, leading to a superior prediction in the validation. Found that the validation loss is significantly higher than the training loss, as shown in Figures 5-8. And t his is because of the overfitting of this model, in this instance, the validation loss is significantly higher than the training loss. While the validation loss is not continuously lowering, the training loss is. This indicates that the complexity of presented model i s sufficient for it to ”memories” the patter ns found in the training set. In these kinds of cases, the proposed model needs to be regularized, and that is what are att empting to do in the upcoming work. Also, an experiment to identify Al-Maqam, where we got a 95% as training ac curacy, and 74% as validation accuracy, as shown in Figure 9 in relation to the epoc hs numbe r (60). Figure 9. The number of epochs for (a) accurac y trends for Al-Maqam classification using spectral centroid and (b) corresponding loss function analysis highlighting performance variation ac ross t raining phase using spectral centroid Recognizing AlMuezzin and his Maqam using deep learning approach (Nahlah Shatnawi) Evaluation Warning : The document was created with Spire.PDF for Python.