Indonesian Journal of Elec trical Engineering and Computer Science Vol. 41, No. 3, March 2026, pp. 1049 1059 ISSN: 2502- 4752, DOI: 10.11591/ ijeecs.v41.i3.pp1049-1059 r 1049 A rich and balanced phonetics cor pus for moder n standard Arabic ASR systems Youssef Boutazar t, Naouar Laaï di, Abderr ahim Ezzine, Hassan Satori, Mohamed Taj Bennani Department of Mathematics and Computer Science, Faculty of Sci ences Dhar-Mahraz (FSDM), Sidi Mohamed Ben Abdellah University (USMBA), Fez, Morocco Article Info Article history: Received Sep 11, 2024 Revised Jan 12, 2026 Accepted Feb 27, 2026 Keywords: Modern standard Arabic Phonetically balanced corpus Phonetically rich corpus Segmentation grapheme to phoneme Zipf’s law Abstract This resea rch del ves into the creation of an innovative Modern Standard Ara- bic corpus, aiming for a comprehensive bala nce and richness while adhering to Zipf’s l aw. Building a phonetically diverse Arabic sentence collection yiel ds significant advantages in term s of e fficiency, cost-effectiveness, and storage ca- pacity compared to conventional corpora. The corpus undergoes meticulous seg- mentation into graphemes, which are then manually converted into phonemes, resulting in a total of 19769 phonemic units. Among these phonemes, conso- nants like ’Laa m - l’ account for 10%, while ’Fatha - A’ vowels constitute 20%. Evaluation of this corpus using an automatic speech recognition (ASR) system reveals a sentence err or rate (SER) of 30% and a word error ra te (WER) of 15%. Furthermore, sta tistical analysis unveils that diacritic marks encompass 47.59% of the corpus, with graphemes comprising the remaining 52.41%. T hese dia- critized marks provide valuable insight s into the preci se phonetic transcription of the corpus. Additionally, the study provides detailed breakdowns of consonants based on their plac e and manner of articulation, enhancing our unde rstanding of phonetic structures. This i s an ope n access article under the CC BY-SA license. Corresponding Author: Hassan Satori Department of Mathematics and Computer Science, Faculty of Sciences Dhar-Mahraz (FSDM) Sidi Mohamed Ben Abdellah University (USMBA) FSDM, USMBA, B.P. 1796, Fez, Morocco Email: hassan.satori@usmba.ac.ma 1. INTRODUCTION A corpus is a digital colle ction of natural language samples, used to discover its structure and patterns. Researchers analyze it to study word order, the arrangement of sentences and cl auses, and the use of grammat- ical struc tures [1], [2]. The corpus development serves as a pivotal element in various disci plines, including natural language processing (NLP) and automatic speech recognition (ASR) [3]-[5]. Developing a ny corpus demands considerable resources and effort . Consequently, there’s a growing interest in cra fting phonetically diverse and balanced text and speech corpora [6], [7]. The rich and balanced cor pus is a valuable source for developing ASR systems, as it allows the system to be trained on a variety of speech patterns and accents [8], [9]. Reflecting this interest, i n their study, the a uthors in [10] introduced an aut omated approach for construc t- ing a phonetically rich and balanced corpus sourced from the web, selecting 6082 phrases to develop a robust recognizer tool. Sim ilarly, Wang in [11] conducte d a statistical analysis of various Mandarin acoustic units using an extensive Chinese text corpus gathered from daily newspapers. Following this analysis, Wang pro- posed a n algorit hm to automat ically extract phonetically rich sentences from the corpus, which are then utilized Journal home page: http://ijeecs.iaescore.com Evaluation Warning : The document was created with Spire.PDF for Python.
1050 r ISSN: 2502-4752 for training and evaluating a Manda rin speech recognition system. Radová and Vopálka in [12] address the challenge of phonetically balanced sentence selecti on, presenting two iterative procedures to choose sentences that accurately reflect the occurrence of phoneti c events in natural speech, resulting in a set of 40 phonetically balanced sentences. In the same way, Matoušek and Romportl in [13] propose a method for preparing a nd recording a phonetically and prosodically rich Czech language corpus for text-to-speech synthesis. They im- plement an algorithm that selects sentenc es based on both phonetic and prosodic criteria, including the random selection of para graphs to capture supra-sentential prosody phenomena. On the other hand, for the English language, Yaza wa in [14] developed a set of 720 phonemically balance d phrases for English learners, select- ing 50 core vocabulary words based on the Harvard New General Service List (NGSL). However, preparing and selecting suitable sentences and words poses significant challenges in ensuring c omprehensive linguistic representation and maintaining desired phonetic diversity. While abundant dat abases e xist for major langua ges like English, German, French, a nd Mandarin [15], [16], the task is considerably more complex for underrepre- sented languages such as MSA. Regarding the MSA, recently, Alqudah et al. [17] have deve loped the Arabic automatic speech recognition (ASR) for spea kers with speech disorders (SD), identifying research gaps and highlighting the need for comprehe nsive ASR systems that address various SD types and continuous speech in Arabic. Alghamdi et al. in [18] have proposed a manually written Arabic corpus, based on a phonetically rich and balanced created list of 663 words. The y were one of the first works on the production of this type of corpus. The database consists of 367 sentences, 2 to 9 words per sentence. Later, in 2012 Abuschariaa et al. [19] described the prepara tion, recording, analyzing, and evaluation of a ne w speech corpus for MSA. The sen- tences used contained all phonemes and preserve the phonetic distributi on of the Arabic language. Yuwan and Lestari in [20] have explained that creating a phonetically rich and balanced corpus not only makes the system more robust and intelligent but saves time, cost, and storage capacity. They have collected verses as speech corpus for the Quranic recognition system with special symbols. The selected verses contained 180 ve rses of 6236 whole verses in the Quran. Our primary goal is to develop a rich and bala nced corpus. We prioritize rea d- ability and pronunciation by incorporating phonetic ally rich and balanced, structurally simple sentences. The Corpus collection encompasses diverse Arabic texts from various sources. This deliberate selection adheres to the 50 most prevalent words in the Arabic language, ensuring compliance with Zipf’s law, which states that the frequency of a word in a text is inversely proportional to its rank in a freque ncy table [21], [22]. In this study, we introduce an approac h for c onstructing a novel rich and balanced modern standard Arabic corpus, developed at Faculty of Sciences Dhar el Mehraz by University Sidi Mohammed Ben Abdellah (FSDM-USMBA). This approach strea mlines the process, saving time, costs, and storage capacity compare d to the conventional corpora collection. The corpus adheres to Zipf’s law by focusing on the 50 most com - mon Arabic words, includes gra pheme-to-phoneme conversion, and conducts a phonetic statistical analysis, all contributing to advancements in Arabic speech recognition t echnologies. Apart from the introduction in section 1. The paper is organized as follows: the method is explained in the section 2. The Statistical analysis is discussed in section 3. Section 4, deals with results and discussion. We finished with a conclusion and future research directions. 2. METHOD In this part, we have noticed that a rich and ba lanced Arabic corpus is very rare and i t is not accessible to the Arabic linguistic researchers. We proposed an approach t o creating a rich and balanced Modern Standa rd Arabic Corpus by University Sidi Mohamed Ben Abdellah called FSDM-USMBA. Indeed, Arabic, is a Semitic language and one of the six official UN languages, is spoken by around 400 million people across 22 countri es [23]-[25]. It is categorized into classical Arabic, mode rn standard Arabic , and dialectal Arabic . Arabic scri pt is written from right to left and consists of two types of symbols: letters and diacritics. These symbols are typically written in a connected form. Additionally, several l etters may change shape depending on their position within a word. However, it should be noted that the script alone does not encompass all sounds [26]. It provi des valuable information about consonants and vowels, which can be extracted through dive rse techniques. The Arabic language consists of 36 phonemes, with 28 of them representing c onsonantal sounds. Additionally, there are 8 phonemes, including three short vowels, three long vowels and two diphthongs [27]. The Arabic language is charac terized by the following diacritics: ( ٌ ً ٍ ت َ ن ْ و ِ ي ن tanween, - dammatan, fathatan, and kasratan -), ( ش َ د ّ َ ّ šaddat) and ( س ُ ك ُ و ن ْ ْ sukun). The diacritical marks are used to indicate vowel sounds a nd other phonetic Indonesian J Elec Eng & Comp Sci, Vol. 41, No. 3, March 2026: 1049–1059 Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752 r 1051 features. It appears on top or below of the graphemes. Here joining all the ( ) diacritic marks is not included i n consideration. On the other hand, syllable s are unit s of speech compri sing one or more phonemes. In the Arabic language, various syll able structures are allowed. These include CV, CVV, CVC, CVVC, CVCC, and CVVCC [28]. In these structures, C represents a consonant, V represents a short vowel, and VV represents a long vowel. Following a brief overview of the Arabic language, we detai l the corpus specifications. Our methodology comprises four phases: corpus initial, corpus text handl ing, corpus final, and corpus segmentation. 2.1. Corpus initial We based the initial c orpus on written sources. This corpus contained 30 mill ion words and was divided into various text genres of the same size. Each of these text type s contained material from all re gions of t he Arabic-speaking world. The primary source of the corpus was written Arabi c, e ncompassing both its standard form and its diale cts. The main objective of this corpus was to generate a frequency count of all Arabic words as they are written, including their prefixes and suffixes [29]. 2.2. Corpus text handling To build a modern standard Arabic rich and balanced corpus, we used t he first 50 most freque nt words from the initial corpus. Based on these words, we selected and constructed sentences to adhere to the distribution outlined by Zipf’s law. We used different sources, inc luding The Holy Quran and Hadith, as well as content related to finance, busine ss, economics, politics, culture, sports, te chnology, science , weather, art, and others. The goal was to produce simple , short sentences that are phonetically rich and balanced. It is im portant to not e that some expressions may be deleted or replaced with others to better adhere to the grammar rules of the Arabic language. 2.3. Corpus fi nal In our study, we analyzed a final corpus consisting of 527 sente nces and a total of 3,308 words. Table 1 displays the sentence count and word count for each genre, along with the proportion of words within ea ch genre. To understand the relationship between the frequency of a list of 50 words and their rank in the final corpus, Figure 1 illustrates the log-log graph of word frequency in the final corpus. The straight line in the graph represents the average slope of the desce nding word frequencies. Additionally, Ta ble 2 and Figure 2 provide evidence that the frequency of the words is inversely proportional to their rank. These illustrations confirm the achievement of Zipf’s law. We found that the eight most frequent words in the final corpus align with those found in a previous study [30]. Table 1. Statistics of the final ’FSDM-USMBA’ corpus Genre Number of sentences Number of words Percentage % Healy Quran, Hadith and religion 86 536 0,162 Health and epidemic 51 374 0,113 Finance and Business 20 146 0,044 Technology 29 202 0,061 Literature 49 324 0,098 Economy 22 153 0,046 Politics 58 373 0,113 Arts 13 75 0,023 Tourism and Culture 41 264 0,079 Sports 16 118 0,036 Weather 13 81 0,025 Others 129 662 0,200 Figure 1. The log-log graph of word frequency in the final corpus Figure 2. Distribution of word in the final corpus A rich and balanced phonetics corpus for modern standard Arabic ASR systems (Youssef Boutazart) Evaluation Warning : The document was created with Spire.PDF for Python.
1052 r ISSN: 2502-4752 Table 2. Empirical evaluat ion of Zipf’s law in ’FSDM-USMBA’ corpus Word Freq. Rank Word Freq. Rank Word Freq. Rank Word Freq. Rank ل ْ 207 1 ع َ ن ْ 25 14 ل َ م ْ 15 27 و ّ َ ل 11 39 و َ 119 2 ق َ ا ل َ 24 15 م َ ا 15 28 غ َ ي ْ ر 11 40 ي 86 3 ه َ ذ َ 23 16 ن ّ َ 14 29 َ 11 41 م ِ ن ْ 69 4 م َ ع َ 22 17 ب َ ي ْ ن َ 14 30 ن َ ف ْ س 11 42 ل ِ 58 5 ل ّ َ ت ِ ي 21 18 ه ِ ي َ 14 31 ع َ ر َ ب ِ ي ّ 10 43 ِ 50 6 ك ُ ل ّ ُ 20 19 ب َ ع ْ د َ 13 32 ي ّ 10 44 ع َ ل َ ى 44 7 ه ُ و َ 19 20 ي َ ا 13 33 َ ِ ي س 10 45 ن ّ َ 40 8 ف َ 18 21 ٰ ل ِ ك َ 13 34 ع َ م َ ل 10 46 ل َ ى 36 9 ه ٰ ذ ِ ه ِ 18 22 ق َ د ْ 12 35 ع َ ر َ ف َ 10 47 ك َ ا ن َ 33 10 و ْ 17 23 خ َ ر 12 36 ب َ ع ْ ض 9 48 ل ا َ 31 11 ل ّ َ ذ ِ ي 16 24 " ' ش َ ي ْ 12 37 َ و ْ ل َ ة 9 49 ل ل ّ َ ه 29 12 ن َ ا 16 25 ع ِ ن ْ د َ 12 38 ك َ م َ ا 9 50 ن ْ 27 13 ي َ و ْ م 16 26 2.4. Corpus segmentation To create a rich and balanced corpus of Arabic, it is essential to encompass all the phonemes of the Arabic language while preserving its phonetic distributi on. To accomplish this, we adopt a two-step method (Algorithm 1). Firstly, we segment the text into graphe mes. Algorithm 1: text to grapheme 1. Deter mine the path of FSDM by USMBA c orpus 2. Itera te through each character of the Arabic text and print it 3. Creat e a text field with Arabic font and right-to-left orientat ion 4. Creat e a table to display the characters and their corresponding Unic ode codes 5. Creat e a JFrame to display the text field and table 6. Set JFrame prope rties such as the size and default close operation 7. Creat e an instance of the class (starts the application) Secondly, we metic ulously convert the graphemes into phonemes, adhering to the phonological rules to the Arabic la nguage. T his manual conversion e nsures the accurate representation of Arabic phonetics in the resulting corpus: a. Convert ( ّ šaddat) to two consecutive ones, b. Convert ( َ ) to ( ) it is found in the text, c. Convert tanween ( ً ٌ ٍ ) to ( َ ن ْ ُ ن ْ ِ ن ْ ), d. Pronunciation of all types of the Hamza ( and ) are . 3. STATISTICAL ANALYSIS In this section, we conduc ted a statistical analysis of syllables, graphemes, and phonemes using the rich and balanced corpus of Modern Standard Arabic. Our primary objective was to gain va luable insights i nto the morphology and phonology of MSA. 3.1. Statistical analysis of syllables After segmenting our corpus, we conducted an extraction process to identify and compute the various types of syllables present in Modern Standard Arabic. This step enabled us to determine the frequencies and percentages of each syllable type. The distribution of syllabic structures in the corpus shows CV and CVC syllables as the most frequent, accounting for 55.60% (3360 occurrenc es) and 25.00% (1511 occurrences) respectively. CVV syllables follow with 990 occurrences (16.38%). Less frequent are CVCC (149 occurrences, 2.46%), CVVC (33 occurrences, 0.55%), and CVVCC (1 occurrence, 0.01%). Indonesian J Elec Eng & Comp Sci, Vol. 41, No. 3, March 2026: 1049–1059 Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752 r 1053 3.2. Statistical analysis of graphemes Regarding the investigation of the frequency and di stribution of individual graphemes within the rich and balanced corpus, Table 3 displays the count and percentage of occurrences for each grapheme. The graphemes ل and occur most frequently, accounting for 10.09% and 8.32% re spectively. Conversely, the graphe mes and are the least frequent, with respective occurrence rates of 0.18% and 0.47%. Table 3. Repetitions and percenta ge for each grapheme Arabic Grapheme i n Arabic Grapheme in Arabic Grapheme in Grapheme repetitions % Grapheme repetitions % Grapheme repetitions % ل 1090 10.09 337 3.12 115 1.07 899 8.32 ف 323 2.99 113 1.04 م 751 6.95 ك 321 2.97 113 1.04 ن 711 6.58 296 2.74 105 0.97 ي 599 5.54 276 2.55 84 0.78 588 5.44 ق 268 2.48 67 0.62 461 4.27 254 2 : 35 65 0.60 و 454 4.20 208 1.92 59 0.55 417 3.86 186 1.72 57 0.53 ه 390 3.61 132 1.22 " ' 51 0 : 47 368 3.57 125 1.16 22 0.21 343 3.17 ى 121 1.12 19 0.18 Furthermore, our analysis reve aled tha t diacr itic ma rks constitute 47.59% of the corpus, while graphemes make up the remaining 52.41%, as detailed in Table 4. Consequently, the absence of diacritic information in the gr apheme-based (non-diacritized) transcription mea ns that approximately 47.59% of the details required for a n accurate phonetic transcript ion are unavailable. In the anal ysis of dia critic marks frequencies, presented i n Table 5, Fatha emerged as the phoneme most frequent with 20.62%. It’s worth not ing t hat tanween (nunation) is restricted to appearing solely on the last letter of a word. Table 4. Frequency of graphemes and diacri tics Type Frequency Percentage Graphemes 10806 52.41 Diacritics marks 9813 47.59 Total 20619 100 Table 5. Arabic diacritics and t heir frequency of occurrence Type Frequency Perc entage Fatha 4253 20.62 Kasra 1859 9.02 Damma 1028 5.14 Shadda 611 2.96 Sukun 1265 6.14 Tanween Fatha 69 0.34 Tanween Kasra 360 1.75 Tanween Damma 337 1.63 3.3. Statistical analysis of phonemes After applying the grapheme-to-phoneme approach wit h phonological rules, we conducted a statistical analysis to examine the occurrenc e of phonemes in 3,308 words. This analysis encompassed the positions of phonemes at the beginning, middle, and end of the words. Table 6 displays the statistics for each phoneme. Our results are in accordance with those of the researchers in [31], for Arabic phonemes frequencies in the final corpus. According to the findings pr esented in Table 6 a nd Figure 3(a), t he deductions concerning phoneme statistics ca n be summarized as: In the case of short vowels: Fatha, ( َ ) is the most frequent, followed by kasra ( ِ ) and damma ( ُ ), In t he ca se of long vowels: The long vowels ى and َ are counted as one and pronounced Aaa ( ى ). They are t he most frequent, followed by Aii ( ِ ي ْ ) and Oue ( ُ و ْ ), A rich and balanced phonetics corpus for modern standard Arabic ASR systems (Youssef Boutazart) Evaluation Warning : The document was created with Spire.PDF for Python.
1054 r ISSN: 2502-4752 In the case of Consonants: Noon, ( ن ) is the most frequent phoneme, this is explained by t hat it also comes from the tanween (fathatan - ً >, dammatan - ٌ , and kasratan - ٍ ), followed by Laam (ل) and Ham za ( ), all hamza types ( and ) are counted as one ( ). The following consonants ha ve a frequency lower than 0.5%, thaa ( ), z ain ( ), dhaad ( ), and ghayn ( ). Dhaa ( ) is the least frequent phoneme, The difference in the percentages of the two diphthongs is excee dingly small ( around 0.07% ). Table 6. Arabic phonemes statistics in the FSDM-USMBA corpus Conson. Arpa .symbols IPA symbols Description and Syllables Repetitions Start Inside End Percentage % Hamza - ه م ز - CVC - CVC Alif - أ ل ف - CVC - CVC E P Alif+Hamza below 667 176 33 4 : 43 Waw+Hamza above Ya+Hamza above B b Baa - ب ا - CV 136 221 49 2 : 05 Taa - ت ا - CV T t Taa marbuta 95 485 144 3 : 66 TH T Thaa - ث ا - CV 13 55 6 0 : 37 JH g Jeem - ج ي م - CVC 49 136 10 0 : 99 HH Haa - ح ا - CV 79 114 17 1 : 06 KH x Khaa - خ ا - CV 36 73 4 0 : 55 D d Daa l- ل - CVVC 69 218 93 1 : 92 DH D Thaal - ل - CVVC 21 99 6 0 : 64 R r Raa - - CV 66 452 100 3 : 12 Z z Zaiy - ي - CVC 11 53 5 0 : 35 S s Seen - س ي ن - CVC 48 190 36 1 : 39 SH S Sheen - ش ي ن - CVC 44 71 4 0 : 60 SS s Saad - ص ا - CVC 27 99 1 0 : 64 DD d Dhaad - ض ا - CVC 7 58 21 0 : 43 TT t TTaa - ط ا - CV 18 88 5 0 : 56 DH2 D Dhaa - ظ ا - CV 5 14 2 0 : 11 AI Q Ayn - ع ي ن - CV - CVC 182 235 46 2 : 34 GH G Ghayn - غ ي ن - CV - CVC 27 28 4 0 : 30 ف F f Faa - ف ا - CV 154 146 28 1 : 66 ق Q q Qaaf - ق ا ف - CVC 81 187 21 1 : 46 ك K k Kaaf - ك ا - CVC 124 153 63 1 : 72 ل L l Laam - ل ا م - CVC 147 914 140 6 : 08 م M m Meem - م ي م - CVC 321 358 91 3 : 89 ن N n Noon - ن و ن - CVC 76 410 1063 7 : 84 ه H h Haa - ه ا - CV 87 117 186 1 : 97 و W w Waw - و و - CVC 166 289 37 2 : 49 ي Y y Yaa - ي ا - CV 143 381 213 3 : 73 Short vowe l َ AE A Fatha - ف ت ح 0 2542 520 15 : 49 ُ UH u Damma - ض م ّ 0 959 398 6 : 86 ِ IH i Kasra - ك س ر 0 1692 428 11 : 23 Long vowel َ , ىﺍ َ AE: a : Aaa - ى 0 670 350 5 : 16 ُ و ْ UW u: Oue - ُ و ْ 0 190 23 1 : 08 ِ ي ْ IY i: Aii - ي ْ 0 179 339 2 : 62 Diphthong و ْ َ AW aw Aoue - و ْ 0 105 8 0 : 57 ي ْ َ AY ay Aye - ي ْ 0 126 1 0 : 64 The Arabic consonants are classified based on their place and manner of articulation. Tables 7 and 8 present corresponding statistics for these categories, following the classifications established in the litera ture [32], [33]. The place Indonesian J Elec Eng & Comp Sci, Vol. 41, No. 3, March 2026: 1049–1059 Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752 r 1055 of articulation refers to where in the vocal tract the airflow i s obstructed, leadi ng to distinct sounds. Meanwhile, the manner of articulation describes how the airflow is modified or obstructed, further distinguishing consonant sounds. The utilizat ion of this dual classification system provides linguists and phoneticians with a structure d framework to methodically examine and comprehend the variety of consonant sounds present in Modern Standard Arabic. Table 7. Percentage of consonants classes based on their place of articulation Place of articulation Consonants in % Alveolar T,D,R,Z,S,SS, 25.99 DD,TT,L,N Glottal E,H 6.40 Bilabial B,M 5.94 Palatal Y 3.75 Velar JH,K 2.72 Uvular KH,GH,Q 2.31 Post-Alveolar JH,SH 1.59 Labiodental Q 1.46 Interdental TH,DH,DH2 1.12 Table 8. Percentage of consonants cla sses based on their way of articulation Way of articulation Consonants in % Stop E,B,T,D,Q,K 15.24 Nasals M,N 11.73 Fricative TH,DH,HH,KH 11.23 Z,S,SH,AI,GH,H Glide W ,Y 2.31 Lateral L 6.08 Trill R 3.12 Affricative JH 0.99 Emphatic stop DD,TT 0.99 Emphatic fricative SS,DH2 0.75 4. RESULTS AND DISCUSSION 4.1. Speech corpus The FSDM-USMBA database was establ ished for this study, comprising a speech corpus and transcriptions from 130 Moroccan speakers (63 ma les and 67 females) age d between 17 and 50 years. During the recording sessions, speakers were asked to utter the 527 sentences with 10 re petitions of every sentence. Voice clarity is fundamental for success- ful recording. Factors like recording environment, equipment, and speaker-microphone distance influence sound quali ty. Optimal microphone placement at 10 cm proved effective after te sting. For ac curate capture, recordings should occur in quiet environments with noise levels below 30 dB, closed windows, and weathe r impacts such as wind and rain must be avoided. To streamline the proc ess, speakers recited each sentence 10 repetitions consecutively, resulting in 25-50 second audio f iles. Using WaveSurfer, each recitation was isolated in (.wav) format by removing t he unnece ssary parts of the audio signal. Audio file names encode multiple detai ls about the speakers. For instance, ”XY18ZW21_ 10 .wav” reveals the following information: t he initials X and Y representing first and last names respecti vely, followed by the age 18, city Y, gender W, sentence number 21, and 10 denoting the number of repetitions. Thus, the task of segmenting speech is easy. These recordings have a sampling rate of 16 kHz and a resolution of 16 bits. In the recording sessions, the waveform and spectrogram of each phrase were reviewed to verify the inclusion of the entire sentence in the recording, as illustrated in Figure 3(b). Only correctly pronounced utterances were retained. Our dictionary contains symbolic representations for all the sounds used in the sentences of our corpus. (a) (b) Figure 3. Arabic speech corpus illustration: (a) Arabic phonemes of the final corpus and (b) clean spee ch waveform of an example of an Arabic sentence spoken by a female spe aker, is referred to MB18FF21_01 in our audio database 4.2. Speech test To test our corpus a set of experiments were conducted. A subset of the final corpus, we selected 130 sentences spoken by 60 speakers (30 male and 30 female). This resulted in a vocal corpus of 78,000 audio files. To optimize system performance, we divided the corpus for training (70%) and testing (30%) and adjusted the parameters of Hidden Markov Models (HMMs) and Gaussian Mixture Models (GMMs). This information is stored is stored in the MCA-USMBA.dic file for symbolic representation of each word. The Baum-Welch algorithm, which is a special case of the Expectation- Maximization (EM) method, is used to estimate transition probabilities during training. The a coustic model is trained with A rich and balanced phonetics corpus for modern standard Arabic ASR systems (Youssef Boutazart) Evaluation Warning : The document was created with Spire.PDF for Python.
1056 r ISSN: 2502-4752 a continuous state probability density, using between 2 and 16 Gaussian mixture distributions and 3 to 5 HMMs. Table 9 shows the achieved Sentence Error Ra te (SER) and Word Error Rate (WER). Figures 4 and 5 pre sent the decoding results and the influence of HMM and GMM parameters on SER and WER perfor mance, re spectively. The system is e valuated based on three types of errors: insert ion, deletion, and substitution, which can occur at both the word and sentence leve ls. Figures 6 present s concrete examples of sentence recognition errors, illustrating the different types of errors: insert ions, deletions, and substitutions at the sentence level. The best configuration used 3 HMMs and 8 GMMs, resulting in a SER of 30.00% and a WER of 15.00%. Our results are in accordance with the study of Abushariah and colleagues [18], who demonstrated a word error rate (WE R) of 13.48% for Arabic speech recognition on diffe rent sentence s spoken by different speakers. Table 9. SER and WER in percentages for di fferent values of the HMM and GMM HMM 3 5 GMM 2 4 8 16 2 4 8 16 WER 22.00 17.75 15.50 23.50 22.20 19.00 18.00 27.50 SER 40.00 37.50 30.00 40.50 50.20 40.00 30.00 40.50 Figure 4. Optimizing model training with the Ba um-Welch algorithm Figure 5. Impact of HMM and GMM values on SER and WER performance Indonesian J Elec Eng & Comp Sci, Vol. 41, No. 3, March 2026: 1049–1059 Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752 r 1057 Figure 6. Possible errors, for recognition of Arabic sentences exa mples 5. CONCLUSION This pape r introduces an innovative and efficient method for acquiring a comprehensive and balanced Modern Standard Arabic corpus. The method, meticulously outlined from initial corpus selection to sentence curation, adheres closely to Zipf’s law and principles of phonetic distri bution equilibrium. The resulting corpus comprises 527 meticulously selected sentences, ensuring the representati on of diverse Arabic phoneme s across various linguistic context s, enc ompassing consonants, vowels, diphthongs, and syllables. The study evaluates an Arabic continuous spee ch recognition system using 25% of the final corpus. Fine-tuning hidden Markov model (HMM) and Gaussian mixture model (GMM) parameters notably enhances system performance. The findings indi cate that employing 3 HMM and 8 GMM achieves optimal sentence error rate (SER) and word error rate (WER) at 30.00% and 15.00%, respectively. In future endeavors, we aim to expand rec ordings to diverse speaker groups independently, leveraging the ent irety of the final comprehensive and balanced corpus. Then, the results obtained in this study are very satisfactory, to the deve lopment of a continuous Arabic speech recognition system, which encourage us to extend our re search scope to spontaneous Arabic language . Additionally, expanding the corpus, exploring various ASR system architectures, and developing an automatic continuous speech recognition system for the Moroccan dialect. REFERENCES [1] K. Shaalan, A. E. Hassanien, and F. Tol ba, eds. Intelligent nat ural language processing: t rends and applications: Springe r . vol. 740, 2017. [2] Kennedy, An introduction to corpus linguistics . Rout ledge, 2014. [3] A. A. M. Alqudah et al. , “Modern Standard Arabic spee ch disorders corpus for digital speech processing applications,” International Journal of Speech Technology , vol. 27, no. 1, pp. 157–170, 2024, doi: 10.1007/s10772-024-10086-9. [4] Z. Oumaima, and A. Meziane, “Modern Arabic speech corpus for text to speech synthesis,” In : 2020 IEEE International Con- ference on Technology Management, Operations and Decisions (ICTMOD) , IEEE, pp. 1–6, 2020- November, doi: 10.1109/ICT- MOD49425.2020.9380606. [5] U. Ka math, J. Liu, and J. Whitaker, Deep learning for NLP and speech recognition. Cham, Switzerland: Springer . vol. 84, 2019. [6] J. E gbert, and P. Baker, eds. Using corpus methods to triangulate linguistic analysis . London: Routledge, 2019. [7] M. W eisser, Practical corpus linguistics: An introduction to corpus-based language analy sis, vol. 43, John Wiley and Sons. 2016. [8] H. Satori, O. Zealouk, K. Satori, a nd F. ElHaoussi, “Voice comparison between smokers and non-smokers using HMM speech recognition system,” International Journal of Spee ch Technology , vol. 20, no. 4, pp. 771–777, 2017, doi: 10.1007/s10772-017- 9442-0. [9] H. Satori, and F. ElHaoussi, “Inve stigation Amazigh speech recognition using CMU tools,” International Journal of Speech Tech- nology , vol. 17, no. 17, pp. 235–243, 2014, doi: 10.1007/s10772-014-9223-y [10] L. Villaseñor-Pineda, M. Montes-y-Gómez, D. Vaufreydaz, and J. F. Serignat , “Experiments on the Construction of a Phoneti- cally Balanced Corpus from the Web,” In Conferenc e on Intelligent Text Processing and Computational Linguistics , vol. 2945, pp. 416–419, 2004- February, Springer, Berlin, Heidelberg, doi: 10.1007/978-3-540-24630-5-50. [11] H. M. Wang, “Statistical analysis of mandarin acoustic units and aut omatic e xtraction of phonetica lly rich sentences based upon a very lar ge chine se text corpus,” In Inte rnational Journal of Computational Linguistics and Chinese Language Processing , vol. 3, no. 2, pp. 93–114, 1998- August, doi : 10.30019/IJCLCLP.199808.0005. [12] V. Radová, and P. Vopálka, “Methods of Sentences Selection for Rea d-Speech Corpus Design,” In International Workshop on Text, Speech and Dialogue , vol. 1692, pp. 165–170, 1999- September, Springer Berlin Heidelberg, doi: 10.1007/ 3-540-48239-3-30. [13] J. Matoušek, and J. Romportl, “On building phonetically and prosodically rich speech corpus for text-to-speech synthesis,” In: Proceedings of the second IASTED inte rnational conference on Computational intelligence: ACTA Press , pp. 442–447, 2006- A rich and balanced phonetics corpus for modern standard Arabic ASR systems (Youssef Boutazart) Evaluation Warning : The document was created with Spire.PDF for Python.
1058 r ISSN: 2502-4752 20-22 November, San Francisco, USA. [14] K. Yazawa, “Harvard-NGSL Sentences for English Learner Speech Corpora,” In 2022 25th Conference of the Oriental CO- COSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O- COCOSDA), IEEE, pp. 1–5, 2022, doi: 10.30019/IJCLCLP.199808.0005. [15] H. Schwenk, and X. Li, “A corpus for multilingual docum ent classification in eight languages,” arXiv preprint , arXiv:1805.09821. 2018, doi : 10.48550/arXiv.1805.09821. [16] E. Grave et al. , “Learning word vectors for 157 language s,” arXiv preprint. , arXiv:1802.06893, 2018, doi: 10.48550/arXiv.1802.06893. [17] A. A. M. Alquda h et al. , “Arabic Automatic Speech Recognition for Speakers With Speech Disorders: A Com- prehensive Review,” 2023 International Conference on Information Technology (ICIT), Amman , pp. 667–673, 2023, doi:10.1109/ICIT58056.2023.10225965. [18] M. Alghamdi, A.H., Alhamid, and M.M., Aldasuqi, Database of Arabic sounds: sentences , i n Arabic, Technical report, King Abdu- laziz City of Science and Technology (KACST), Riyadh, Saudi Arabia, 2003. [19] M. A. Abushariah et al. , “Phonetically rich and bal anced text and speech corpora for Arabic language,” Language resources and evaluation. , vol. 46, pp. 601–634, 2012, doi: 10.1007/s10579-011-9166-8. [20] Y. Yuwan a nd D.P. Le stari, Automatic extraction phonetically rich and balanced verses for speaker-dependent quranic speech recognition system,” In: Hasida, K., Purwarianti, A. (eds) Computational Linguistics. , vol 593, pp. 65–75, 2015, doi: 10.1007/978- 981-10-0515-2-5. [21] D. Qi, and H. Wa ng, “Zipf’s Law for Speech Acts in Spoken English,” Journal of Quantitative Linguistics , pp. 231–258, 2024, doi: 10.1080/09296174.2023.2202470. [22] A Ech-Charfi, “Frequency and text covera ge in Sta ndard Arabic based on Arabic Internet Corpus,” Journal of A pplied Language and Cul ture Studies , vol. 6, no 3, pp. 1-19, 2023. [23] R. Ba ssiouney and E. G. (Eds.). Katz, Arabic language and linguistics . Georgetown University Press, 2012. [24] A. Hussein, S. Wa tanabe and A. Ali, “Arabic speech recognition by end-to-end, modular systems a nd human,” Computer Speech and Language , vol. 71, p. 101272, 2022, doi: 10.1016/j.csl.2021.101272. [25] I. Guellil, H Saâdane, F. Azouaou, B. Gueni, and D. Nouvel, “Arabi c natural language processing: An overview,” Journal of King Saud Univ ersity-Computer and Information Sciences , vol. 33, no. 6, pp. 497-507, 2021, doi: 10.1016/j.jksuci.2019.02.006. [26] Y. A. El-Imam, “Phonetization of Arabic: rules and algorithm s,” Computer Spe ech and Language , vol. 18, no. 4, pp. 339–373, 2004, doi : 10.1016/S0885-2308(03)00035-4. [27] F. Sindran, F. Mual la, T. Haderlein, K. Daqrouq, and E. Nöth, “Automatic phonetization-based statistical linguistic study of standard Arabic,” Int. J. Comput. Linguist.(IJCL) , vol. 7, pp. 38–53, 2016. [28] M. Elmahdy, R. Gruhn and W. Minker, Novel techniques for dialectal arabic spee ch recognition. Springer Science and Business Media, 2012. [29] T. Buckwalter, and D. Parkinson, A frequency dictionary of Arabic: Core vocabulary f or l earners. R outledge . 2014. [30] A. Masrai, and J. Milton, “How different is Arabic from other languages? The rela tionship between word frequency and lexical coverage,” Journal of Applied Linguistics and Language Research , vol. 3, no. 1, pp. 15–35, 2016. [31] A. Amrouche, A. Abed, K. Ferra t, , K. N. Boubakeur, Y. Bentrci a, and L. Falek, “Balanced Arabic corpus design for speec h synthe- sis,” International Journal of Speech Technology , vol. 24, no. 3, pp. 747–759, 2021, doi: 10.1007/s10772-021-09846-8. [32] J. C. Wa tson, The phonology and morphology of Arabic , Oxford University Press, USA, 2002. [33] F. Sindran, Automatic Phonetic Transcription of Sta ndard Arabic with Applications in t he NLP Domain (Doct oral dissertation, Friedrich-Alexander-Universitaet Erlangen-Nuernberg (Germany)). 2021. BIOGRAPHIES O F AUTHORS Youssef Boutazart received the engineer degree in Automation from the Belarusian state Agrarian Technical University of Minsk Belarus a nd the Bachelor in electronics from Moulay Ismail University of Meknes Morocco. Since 2009, he has been administrator of the Presidenc y by Sidi Mohamed ben Abdellah University. Currently He is a Ph.D. student in the L ISAC of the Dhar Mehrez Faculty of sciences of Fez. His researc h interests are focused on the development of the rich and balanced speec h corpus for high- performa nce speech recognition systems. He can be contacted at email: youssef.boutazart@usmba.ac.ma. Naouar Laaïdi got her Master in Electronics, Automatics and Signal Processing Faculty of Sci ences, Chouaib Doukkali University, El-Jadida. Currently, she is a Ph.D. student at LISAC Laboratory at University Sidi Mohamed Ben Abdellah Faculty of Sciences of Fez. Speciali st in many disciplines among Clustering, Machine Learning, Classification, Automatic speech recognition. He can be contacted at email: naouarlaaidi@gmail.com. Indonesian J Elec Eng & Comp Sci, Vol. 41, No. 3, March 2026: 1049–1059 Evaluation Warning : The document was created with Spire.PDF for Python.