Indonesian
Journal
of
Elec
trical
Engineering
and
Computer
Science
Vol.
41,
No.
3,
March
2026,
pp.
1049
1059
ISSN:
2502-
4752,
DOI:
10.11591/
ijeecs.v41.i3.pp1049-1059
r
1049
A
rich
and
balanced
phonetics
cor
pus
for
moder
n
standard
Arabic
ASR
systems
Youssef
Boutazar
t,
Naouar
Laaï
di,
Abderr
ahim
Ezzine,
Hassan
Satori,
Mohamed
Taj
Bennani
Department
of
Mathematics
and
Computer
Science,
Faculty
of
Sci
ences
Dhar-Mahraz
(FSDM),
Sidi
Mohamed
Ben
Abdellah
University
(USMBA),
Fez,
Morocco
Article
Info
Article
history:
Received
Sep
11,
2024
Revised
Jan
12,
2026
Accepted
Feb
27,
2026
Keywords:
Modern
standard
Arabic
Phonetically
balanced
corpus
Phonetically
rich
corpus
Segmentation
grapheme
to
phoneme
Zipf’s
law
Abstract
This
resea
rch
del
ves
into
the
creation
of
an
innovative
Modern
Standard
Ara-
bic
corpus,
aiming
for
a
comprehensive
bala
nce
and
richness
while
adhering
to
Zipf’s
l
aw.
Building
a
phonetically
diverse
Arabic
sentence
collection
yiel
ds
significant
advantages
in
term
s
of
e
fficiency,
cost-effectiveness,
and
storage
ca-
pacity
compared
to
conventional
corpora.
The
corpus
undergoes
meticulous
seg-
mentation
into
graphemes,
which
are
then
manually
converted
into
phonemes,
resulting
in
a
total
of
19769
phonemic
units.
Among
these
phonemes,
conso-
nants
like
’Laa
m
-
l’
account
for
10%,
while
’Fatha
-
A’
vowels
constitute
20%.
Evaluation
of
this
corpus
using
an
automatic
speech
recognition
(ASR)
system
reveals
a
sentence
err
or
rate
(SER)
of
30%
and
a
word
error
ra
te
(WER)
of
15%.
Furthermore,
sta
tistical
analysis
unveils
that
diacritic
marks
encompass
47.59%
of
the
corpus,
with
graphemes
comprising
the
remaining
52.41%.
T
hese
dia-
critized
marks
provide
valuable
insight
s
into
the
preci
se
phonetic
transcription
of
the
corpus.
Additionally,
the
study
provides
detailed
breakdowns
of
consonants
based
on
their
plac
e
and
manner
of
articulation,
enhancing
our
unde
rstanding
of
phonetic
structures.
This
i
s
an
ope
n
access
article
under
the
CC
BY-SA
license.
Corresponding
Author:
Hassan
Satori
Department
of
Mathematics
and
Computer
Science,
Faculty
of
Sciences
Dhar-Mahraz
(FSDM)
Sidi
Mohamed
Ben
Abdellah
University
(USMBA)
FSDM,
USMBA,
B.P.
1796,
Fez,
Morocco
Email:
hassan.satori@usmba.ac.ma
1.
INTRODUCTION
A
corpus
is
a
digital
colle
ction
of
natural
language
samples,
used
to
discover
its
structure
and
patterns.
Researchers
analyze
it
to
study
word
order,
the
arrangement
of
sentences
and
cl
auses,
and
the
use
of
grammat-
ical
struc
tures
[1],
[2].
The
corpus
development
serves
as
a
pivotal
element
in
various
disci
plines,
including
natural
language
processing
(NLP)
and
automatic
speech
recognition
(ASR)
[3]-[5].
Developing
a
ny
corpus
demands
considerable
resources
and
effort
.
Consequently,
there’s
a
growing
interest
in
cra
fting
phonetically
diverse
and
balanced
text
and
speech
corpora
[6],
[7].
The
rich
and
balanced
cor
pus
is
a
valuable
source
for
developing
ASR
systems,
as
it
allows
the
system
to
be
trained
on
a
variety
of
speech
patterns
and
accents
[8],
[9].
Reflecting
this
interest,
i
n
their
study,
the
a
uthors
in
[10]
introduced
an
aut
omated
approach
for
construc
t-
ing
a
phonetically
rich
and
balanced
corpus
sourced
from
the
web,
selecting
6082
phrases
to
develop
a
robust
recognizer
tool.
Sim
ilarly,
Wang
in
[11]
conducte
d
a
statistical
analysis
of
various
Mandarin
acoustic
units
using
an
extensive
Chinese
text
corpus
gathered
from
daily
newspapers.
Following
this
analysis,
Wang
pro-
posed
a
n
algorit
hm
to
automat
ically
extract
phonetically
rich
sentences
from
the
corpus,
which
are
then
utilized
Journal
home
page:
http://ijeecs.iaescore.com
Evaluation Warning : The document was created with Spire.PDF for Python.
1050
r
ISSN:
2502-4752
for
training
and
evaluating
a
Manda
rin
speech
recognition
system.
Radová
and
Vopálka
in
[12]
address
the
challenge
of
phonetically
balanced
sentence
selecti
on,
presenting
two
iterative
procedures
to
choose
sentences
that
accurately
reflect
the
occurrence
of
phoneti
c
events
in
natural
speech,
resulting
in
a
set
of
40
phonetically
balanced
sentences.
In
the
same
way,
Matoušek
and
Romportl
in
[13]
propose
a
method
for
preparing
a
nd
recording
a
phonetically
and
prosodically
rich
Czech
language
corpus
for
text-to-speech
synthesis.
They
im-
plement
an
algorithm
that
selects
sentenc
es
based
on
both
phonetic
and
prosodic
criteria,
including
the
random
selection
of
para
graphs
to
capture
supra-sentential
prosody
phenomena.
On
the
other
hand,
for
the
English
language,
Yaza
wa
in
[14]
developed
a
set
of
720
phonemically
balance
d
phrases
for
English
learners,
select-
ing
50
core
vocabulary
words
based
on
the
Harvard
New
General
Service
List
(NGSL).
However,
preparing
and
selecting
suitable
sentences
and
words
poses
significant
challenges
in
ensuring
c
omprehensive
linguistic
representation
and
maintaining
desired
phonetic
diversity.
While
abundant
dat
abases
e
xist
for
major
langua
ges
like
English,
German,
French,
a
nd
Mandarin
[15],
[16],
the
task
is
considerably
more
complex
for
underrepre-
sented
languages
such
as
MSA.
Regarding
the
MSA,
recently,
Alqudah
et
al.
[17]
have
deve
loped
the
Arabic
automatic
speech
recognition
(ASR)
for
spea
kers
with
speech
disorders
(SD),
identifying
research
gaps
and
highlighting
the
need
for
comprehe
nsive
ASR
systems
that
address
various
SD
types
and
continuous
speech
in
Arabic.
Alghamdi
et
al.
in
[18]
have
proposed
a
manually
written
Arabic
corpus,
based
on
a
phonetically
rich
and
balanced
created
list
of
663
words.
The
y
were
one
of
the
first
works
on
the
production
of
this
type
of
corpus.
The
database
consists
of
367
sentences,
2
to
9
words
per
sentence.
Later,
in
2012
Abuschariaa
et
al.
[19]
described
the
prepara
tion,
recording,
analyzing,
and
evaluation
of
a
ne
w
speech
corpus
for
MSA.
The
sen-
tences
used
contained
all
phonemes
and
preserve
the
phonetic
distributi
on
of
the
Arabic
language.
Yuwan
and
Lestari
in
[20]
have
explained
that
creating
a
phonetically
rich
and
balanced
corpus
not
only
makes
the
system
more
robust
and
intelligent
but
saves
time,
cost,
and
storage
capacity.
They
have
collected
verses
as
speech
corpus
for
the
Quranic
recognition
system
with
special
symbols.
The
selected
verses
contained
180
ve
rses
of
6236
whole
verses
in
the
Quran.
Our
primary
goal
is
to
develop
a
rich
and
bala
nced
corpus.
We
prioritize
rea
d-
ability
and
pronunciation
by
incorporating
phonetic
ally
rich
and
balanced,
structurally
simple
sentences.
The
Corpus
collection
encompasses
diverse
Arabic
texts
from
various
sources.
This
deliberate
selection
adheres
to
the
50
most
prevalent
words
in
the
Arabic
language,
ensuring
compliance
with
Zipf’s
law,
which
states
that
the
frequency
of
a
word
in
a
text
is
inversely
proportional
to
its
rank
in
a
freque
ncy
table
[21],
[22].
In
this
study,
we
introduce
an
approac
h
for
c
onstructing
a
novel
rich
and
balanced
modern
standard
Arabic
corpus,
developed
at
Faculty
of
Sciences
Dhar
el
Mehraz
by
University
Sidi
Mohammed
Ben
Abdellah
(FSDM-USMBA).
This
approach
strea
mlines
the
process,
saving
time,
costs,
and
storage
capacity
compare
d
to
the
conventional
corpora
collection.
The
corpus
adheres
to
Zipf’s
law
by
focusing
on
the
50
most
com
-
mon
Arabic
words,
includes
gra
pheme-to-phoneme
conversion,
and
conducts
a
phonetic
statistical
analysis,
all
contributing
to
advancements
in
Arabic
speech
recognition
t
echnologies.
Apart
from
the
introduction
in
section
1.
The
paper
is
organized
as
follows:
the
method
is
explained
in
the
section
2.
The
Statistical
analysis
is
discussed
in
section
3.
Section
4,
deals
with
results
and
discussion.
We
finished
with
a
conclusion
and
future
research
directions.
2.
METHOD
In
this
part,
we
have
noticed
that
a
rich
and
ba
lanced
Arabic
corpus
is
very
rare
and
i
t
is
not
accessible
to
the
Arabic
linguistic
researchers.
We
proposed
an
approach
t
o
creating
a
rich
and
balanced
Modern
Standa
rd
Arabic
Corpus
by
University
Sidi
Mohamed
Ben
Abdellah
called
FSDM-USMBA.
Indeed,
Arabic,
is
a
Semitic
language
and
one
of
the
six
official
UN
languages,
is
spoken
by
around
400
million
people
across
22
countri
es
[23]-[25].
It
is
categorized
into
classical
Arabic,
mode
rn
standard
Arabic
,
and
dialectal
Arabic
.
Arabic
scri
pt
is
written
from
right
to
left
and
consists
of
two
types
of
symbols:
letters
and
diacritics.
These
symbols
are
typically
written
in
a
connected
form.
Additionally,
several
l
etters
may
change
shape
depending
on
their
position
within
a
word.
However,
it
should
be
noted
that
the
script
alone
does
not
encompass
all
sounds
[26].
It
provi
des
valuable
information
about
consonants
and
vowels,
which
can
be
extracted
through
dive
rse
techniques.
The
Arabic
language
consists
of
36
phonemes,
with
28
of
them
representing
c
onsonantal
sounds.
Additionally,
there
are
8
phonemes,
including
three
short
vowels,
three
long
vowels
and
two
diphthongs
[27].
The
Arabic
language
is
charac
terized
by
the
following
diacritics:
(
ﺍ
ٌ
ﺍ
ً
ﺍ
ٍ
ت
َ
ن
ْ
و
ِ
ي
ن
tanween,
-
dammatan,
fathatan,
and
kasratan
-),
(
ش
َ
د
ّ
َ
ﺍ
ّ
šaddat)
and
(
س
ُ
ك
ُ
و
ن
ْ
ﺍ
ْ
sukun).
The
diacritical
marks
are
used
to
indicate
vowel
sounds
a
nd
other
phonetic
Indonesian
J
Elec
Eng
&
Comp
Sci,
Vol.
41,
No.
3,
March
2026:
1049–1059
Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian
J
Elec
Eng
&
Comp
Sci
ISSN:
2502-4752
r
1051
features.
It
appears
on
top
or
below
of
the
graphemes.
Here
joining
all
the
(
ﺍ
)
diacritic
marks
is
not
included
i
n
consideration.
On
the
other
hand,
syllable
s
are
unit
s
of
speech
compri
sing
one
or
more
phonemes.
In
the
Arabic
language,
various
syll
able
structures
are
allowed.
These
include
CV,
CVV,
CVC,
CVVC,
CVCC,
and
CVVCC
[28].
In
these
structures,
C
represents
a
consonant,
V
represents
a
short
vowel,
and
VV
represents
a
long
vowel.
Following
a
brief
overview
of
the
Arabic
language,
we
detai
l
the
corpus
specifications.
Our
methodology
comprises
four
phases:
corpus
initial,
corpus
text
handl
ing,
corpus
final,
and
corpus
segmentation.
2.1.
Corpus
initial
We
based
the
initial
c
orpus
on
written
sources.
This
corpus
contained
30
mill
ion
words
and
was
divided
into
various
text
genres
of
the
same
size.
Each
of
these
text
type
s
contained
material
from
all
re
gions
of
t
he
Arabic-speaking
world.
The
primary
source
of
the
corpus
was
written
Arabi
c,
e
ncompassing
both
its
standard
form
and
its
diale
cts.
The
main
objective
of
this
corpus
was
to
generate
a
frequency
count
of
all
Arabic
words
as
they
are
written,
including
their
prefixes
and
suffixes
[29].
2.2.
Corpus
text
handling
To
build
a
modern
standard
Arabic
rich
and
balanced
corpus,
we
used
t
he
first
50
most
freque
nt
words
from
the
initial
corpus.
Based
on
these
words,
we
selected
and
constructed
sentences
to
adhere
to
the
distribution
outlined
by
Zipf’s
law.
We
used
different
sources,
inc
luding
The
Holy
Quran
and
Hadith,
as
well
as
content
related
to
finance,
busine
ss,
economics,
politics,
culture,
sports,
te
chnology,
science
,
weather,
art,
and
others.
The
goal
was
to
produce
simple
,
short
sentences
that
are
phonetically
rich
and
balanced.
It
is
im
portant
to
not
e
that
some
expressions
may
be
deleted
or
replaced
with
others
to
better
adhere
to
the
grammar
rules
of
the
Arabic
language.
2.3.
Corpus
fi
nal
In
our
study,
we
analyzed
a
final
corpus
consisting
of
527
sente
nces
and
a
total
of
3,308
words.
Table
1
displays
the
sentence
count
and
word
count
for
each
genre,
along
with
the
proportion
of
words
within
ea
ch
genre.
To
understand
the
relationship
between
the
frequency
of
a
list
of
50
words
and
their
rank
in
the
final
corpus,
Figure
1
illustrates
the
log-log
graph
of
word
frequency
in
the
final
corpus.
The
straight
line
in
the
graph
represents
the
average
slope
of
the
desce
nding
word
frequencies.
Additionally,
Ta
ble
2
and
Figure
2
provide
evidence
that
the
frequency
of
the
words
is
inversely
proportional
to
their
rank.
These
illustrations
confirm
the
achievement
of
Zipf’s
law.
We
found
that
the
eight
most
frequent
words
in
the
final
corpus
align
with
those
found
in
a
previous
study
[30].
Table
1.
Statistics
of
the
final
’FSDM-USMBA’
corpus
Genre
Number
of
sentences
Number
of
words
Percentage
%
Healy
Quran,
Hadith
and
religion
86
536
0,162
Health
and
epidemic
51
374
0,113
Finance
and
Business
20
146
0,044
Technology
29
202
0,061
Literature
49
324
0,098
Economy
22
153
0,046
Politics
58
373
0,113
Arts
13
75
0,023
Tourism
and
Culture
41
264
0,079
Sports
16
118
0,036
Weather
13
81
0,025
Others
129
662
0,200
Figure
1.
The
log-log
graph
of
word
frequency
in
the
final
corpus
Figure
2.
Distribution
of
word
in
the
final
corpus
A
rich
and
balanced
phonetics
corpus
for
modern
standard
Arabic
ASR
systems
(Youssef
Boutazart)
Evaluation Warning : The document was created with Spire.PDF for Python.
1052
r
ISSN:
2502-4752
Table
2.
Empirical
evaluat
ion
of
Zipf’s
law
in
’FSDM-USMBA’
corpus
Word
Freq.
Rank
Word
Freq.
Rank
Word
Freq.
Rank
Word
Freq.
Rank
ﺃ
ل
ْ
207
1
ع
َ
ن
ْ
25
14
ل
َ
م
ْ
15
27
ﺃ
و
ّ
َ
ل
11
39
و
َ
119
2
ق
َ
ا
ل
َ
24
15
م
َ
ا
15
28
غ
َ
ي
ْ
ر
11
40
ي
86
3
ه
َ
ذ
َ
ﺍ
23
16
ﺇ
ن
ّ
َ
14
29
ﺇ
ﺫ
َ
ﺍ
11
41
م
ِ
ن
ْ
69
4
م
َ
ع
َ
22
17
ب
َ
ي
ْ
ن
َ
14
30
ن
َ
ف
ْ
س
11
42
ل
ِ
58
5
ﺍ
ل
ّ
َ
ت
ِ
ي
21
18
ه
ِ
ي
َ
14
31
ع
َ
ر
َ
ب
ِ
ي
ّ
10
43
ﺏ
ِ
50
6
ك
ُ
ل
ّ
ُ
20
19
ب
َ
ع
ْ
د
َ
13
32
ﺃ
ي
ّ
10
44
ع
َ
ل
َ
ى
44
7
ه
ُ
و
َ
19
20
ي
َ
ا
13
33
ﺭ
َ
ِ
ي
س
10
45
ﺃ
ن
ّ
َ
40
8
ف
َ
18
21
ﺫ
ٰ
ل
ِ
ك
َ
13
34
ع
َ
م
َ
ل
10
46
ﺇ
ل
َ
ى
36
9
ه
ٰ
ذ
ِ
ه
ِ
18
22
ق
َ
د
ْ
12
35
ع
َ
ر
َ
ف
َ
10
47
ك
َ
ا
ن
َ
33
10
ﺃ
و
ْ
17
23
ﺁ
خ
َ
ر
12
36
ب
َ
ع
ْ
ض
9
48
ل
ا
َ
31
11
ﺍ
ل
ّ
َ
ذ
ِ
ي
16
24
"
'
ش
َ
ي
ْ
12
37
ﺩ
َ
و
ْ
ل
َ
ة
9
49
ﺍ
ل
ل
ّ
َ
ه
29
12
ﺃ
ن
َ
ا
16
25
ع
ِ
ن
ْ
د
َ
12
38
ك
َ
م
َ
ا
9
50
ﺃ
ن
ْ
27
13
ي
َ
و
ْ
م
16
26
2.4.
Corpus
segmentation
To
create
a
rich
and
balanced
corpus
of
Arabic,
it
is
essential
to
encompass
all
the
phonemes
of
the
Arabic
language
while
preserving
its
phonetic
distributi
on.
To
accomplish
this,
we
adopt
a
two-step
method
(Algorithm
1).
Firstly,
we
segment
the
text
into
graphe
mes.
Algorithm
1:
text
to
grapheme
1.
Deter
mine
the
path
of
FSDM
by
USMBA
c
orpus
2.
Itera
te
through
each
character
of
the
Arabic
text
and
print
it
3.
Creat
e
a
text
field
with
Arabic
font
and
right-to-left
orientat
ion
4.
Creat
e
a
table
to
display
the
characters
and
their
corresponding
Unic
ode
codes
5.
Creat
e
a
JFrame
to
display
the
text
field
and
table
6.
Set
JFrame
prope
rties
such
as
the
size
and
default
close
operation
7.
Creat
e
an
instance
of
the
class
(starts
the
application)
Secondly,
we
metic
ulously
convert
the
graphemes
into
phonemes,
adhering
to
the
phonological
rules
to
the
Arabic
la
nguage.
T
his
manual
conversion
e
nsures
the
accurate
representation
of
Arabic
phonetics
in
the
resulting
corpus:
a.
Convert
(
ﺍ
ّ
šaddat)
to
two
consecutive
ones,
b.
Convert
(
ﺍ
َ
)
to
(
ﺃ
ﺍ
)
it
is
found
in
the
text,
c.
Convert
tanween
(
ﺍ
ً
ﺍ
ٌ
ﺍ
ٍ
)
to
(
ﺍ
َ
ن
ْ
ﺍ
ُ
ن
ْ
ﺍ
ِ
ن
ْ
),
d.
Pronunciation
of
all
types
of
the
Hamza
(
ﺀ
ﺇ
ﺃ
ﺉ
and
ﺅ
)
are
ﺀ
.
3.
STATISTICAL
ANALYSIS
In
this
section,
we
conduc
ted
a
statistical
analysis
of
syllables,
graphemes,
and
phonemes
using
the
rich
and
balanced
corpus
of
Modern
Standard
Arabic.
Our
primary
objective
was
to
gain
va
luable
insights
i
nto
the
morphology
and
phonology
of
MSA.
3.1.
Statistical
analysis
of
syllables
After
segmenting
our
corpus,
we
conducted
an
extraction
process
to
identify
and
compute
the
various
types
of
syllables
present
in
Modern
Standard
Arabic.
This
step
enabled
us
to
determine
the
frequencies
and
percentages
of
each
syllable
type.
The
distribution
of
syllabic
structures
in
the
corpus
shows
CV
and
CVC
syllables
as
the
most
frequent,
accounting
for
55.60%
(3360
occurrenc
es)
and
25.00%
(1511
occurrences)
respectively.
CVV
syllables
follow
with
990
occurrences
(16.38%).
Less
frequent
are
CVCC
(149
occurrences,
2.46%),
CVVC
(33
occurrences,
0.55%),
and
CVVCC
(1
occurrence,
0.01%).
Indonesian
J
Elec
Eng
&
Comp
Sci,
Vol.
41,
No.
3,
March
2026:
1049–1059
Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian
J
Elec
Eng
&
Comp
Sci
ISSN:
2502-4752
r
1053
3.2.
Statistical
analysis
of
graphemes
Regarding
the
investigation
of
the
frequency
and
di
stribution
of
individual
graphemes
within
the
rich
and
balanced
corpus,
Table
3
displays
the
count
and
percentage
of
occurrences
for
each
grapheme.
The
graphemes
“
ل
”
and
“
ﺍ
”
occur
most
frequently,
accounting
for
10.09%
and
8.32%
re
spectively.
Conversely,
the
graphe
mes
“
ﻅ
”
and
“
ﺀ
”
are
the
least
frequent,
with
respective
occurrence
rates
of
0.18%
and
0.47%.
Table
3.
Repetitions
and
percenta
ge
for
each
grapheme
Arabic
Grapheme
i
n
Arabic
Grapheme
in
Arabic
Grapheme
in
Grapheme
repetitions
%
Grapheme
repetitions
%
Grapheme
repetitions
%
ل
1090
10.09
ﺃ
337
3.12
ﻁ
115
1.07
ﺍ
899
8.32
ف
323
2.99
ﺥ
113
1.04
م
751
6.95
ك
321
2.97
ﺹ
113
1.04
ن
711
6.58
ﺓ
296
2.74
ﺵ
105
0.97
ي
599
5.54
ﺁ
276
2.55
ﺽ
84
0.78
ﺭ
588
5.44
ق
268
2.48
ﺯ
67
0.62
ﻉ
461
4.27
ﺱ
254
2
:
35
ﺙ
65
0.60
و
454
4.20
ﺡ
208
1.92
ﻍ
59
0.55
ﺕ
417
3.86
ﺝ
186
1.72
ﺉ
57
0.53
ه
390
3.61
ﺇ
132
1.22
"
'
51
0
:
47
ﺏ
368
3.57
ﺫ
125
1.16
ﺅ
22
0.21
ﺩ
343
3.17
ى
121
1.12
ﻅ
19
0.18
Furthermore,
our
analysis
reve
aled
tha
t
diacr
itic
ma
rks
constitute
47.59%
of
the
corpus,
while
graphemes
make
up
the
remaining
52.41%,
as
detailed
in
Table
4.
Consequently,
the
absence
of
diacritic
information
in
the
gr
apheme-based
(non-diacritized)
transcription
mea
ns
that
approximately
47.59%
of
the
details
required
for
a
n
accurate
phonetic
transcript
ion
are
unavailable.
In
the
anal
ysis
of
dia
critic
marks
frequencies,
presented
i
n
Table
5,
Fatha
emerged
as
the
phoneme
most
frequent
with
20.62%.
It’s
worth
not
ing
t
hat
tanween
(nunation)
is
restricted
to
appearing
solely
on
the
last
letter
of
a
word.
Table
4.
Frequency
of
graphemes
and
diacri
tics
Type
Frequency
Percentage
Graphemes
10806
52.41
Diacritics
marks
9813
47.59
Total
20619
100
Table
5.
Arabic
diacritics
and
t
heir
frequency
of
occurrence
Type
Frequency
Perc
entage
Fatha
4253
20.62
Kasra
1859
9.02
Damma
1028
5.14
Shadda
611
2.96
Sukun
1265
6.14
Tanween
Fatha
69
0.34
Tanween
Kasra
360
1.75
Tanween
Damma
337
1.63
3.3.
Statistical
analysis
of
phonemes
After
applying
the
grapheme-to-phoneme
approach
wit
h
phonological
rules,
we
conducted
a
statistical
analysis
to
examine
the
occurrenc
e
of
phonemes
in
3,308
words.
This
analysis
encompassed
the
positions
of
phonemes
at
the
beginning,
middle,
and
end
of
the
words.
Table
6
displays
the
statistics
for
each
phoneme.
Our
results
are
in
accordance
with
those
of
the
researchers
in
[31],
for
Arabic
phonemes
frequencies
in
the
final
corpus.
According
to
the
findings
pr
esented
in
Table
6
a
nd
Figure
3(a),
t
he
deductions
concerning
phoneme
statistics
ca
n
be
summarized
as:
In
the
case
of
short
vowels:
Fatha,
(
ﺍ
َ
)
is
the
most
frequent,
followed
by
kasra
(
ﺍ
ِ
)
and
damma
(
ﺍ
ُ
),
In
t
he
ca
se
of
long
vowels:
The
long
vowels
ﺁ
ى
and
ﺍ
َ
ﺍ
are
counted
as
one
and
pronounced
Aaa
(
ﺃ
ى
).
They
are
t
he
most
frequent,
followed
by
Aii
(
ﺍ
ِ
ي
ْ
)
and
Oue
(
ﺍ
ُ
و
ْ
),
A
rich
and
balanced
phonetics
corpus
for
modern
standard
Arabic
ASR
systems
(Youssef
Boutazart)
Evaluation Warning : The document was created with Spire.PDF for Python.
1054
r
ISSN:
2502-4752
In
the
case
of
Consonants:
Noon,
(
ن
)
is
the
most
frequent
phoneme,
this
is
explained
by
t
hat
it
also
comes
from
the
tanween
(fathatan
-
ﺍ
ً
>,
dammatan
-
ﺍ
ٌ
,
and
kasratan
-
ﺍ
ٍ
),
followed
by
Laam
(ل)
and
Ham
za
(
ﺀ
),
all
hamza
types
(
ﺀ
ﺉ
ﺃ
ﺇ
and
ﺅ
)
are
counted
as
one
(
ﺀ
).
The
following
consonants
ha
ve
a
frequency
lower
than
0.5%,
thaa
(
ﺙ
),
z
ain
(
ﺯ
),
dhaad
(
ﺽ
),
and
ghayn
(
ﻍ
).
Dhaa
(
ﻅ
)
is
the
least
frequent
phoneme,
The
difference
in
the
percentages
of
the
two
diphthongs
is
excee
dingly
small
(
around
0.07%
).
Table
6.
Arabic
phonemes
statistics
in
the
FSDM-USMBA
corpus
Conson.
Arpa
.symbols
IPA
symbols
Description
and
Syllables
Repetitions
Start
Inside
End
Percentage
%
ﺀ
Hamza
-
ه
م
ز
ﺓ
-
CVC
-
CVC
ﺃ
Alif
-
أ
ل
ف
-
CVC
-
CVC
ﺇ
E
P
Alif+Hamza
below
667
176
33
4
:
43
ﺅ
Waw+Hamza
above
ﺉ
Ya+Hamza
above
ﺏ
B
b
Baa
-
ب
ا
-
CV
136
221
49
2
:
05
ﺕ
Taa
-
ت
ا
-
CV
ﺓ
T
t
Taa
marbuta
95
485
144
3
:
66
ﺙ
TH
T
Thaa
-
ث
ا
-
CV
13
55
6
0
:
37
ﺝ
JH
g
Jeem
-
ج
ي
م
-
CVC
49
136
10
0
:
99
ﺡ
HH
Haa
-
ح
ا
-
CV
79
114
17
1
:
06
ﺥ
KH
x
Khaa
-
خ
ا
-
CV
36
73
4
0
:
55
ﺩ
D
d
Daa
l-
ﺩ
ﺍ
ل
-
CVVC
69
218
93
1
:
92
ﺫ
DH
D
Thaal
-
ﺫ
ﺍ
ل
-
CVVC
21
99
6
0
:
64
ﺭ
R
r
Raa
-
ﺭ
ﺍ
-
CV
66
452
100
3
:
12
ﺯ
Z
z
Zaiy
-
ﺯ
ي
-
CVC
11
53
5
0
:
35
ﺱ
S
s
Seen
-
س
ي
ن
-
CVC
48
190
36
1
:
39
ﺵ
SH
S
Sheen
-
ش
ي
ن
-
CVC
44
71
4
0
:
60
ﺹ
SS
s
Saad
-
ص
ا
ﺩ
-
CVC
27
99
1
0
:
64
ﺽ
DD
d
Dhaad
-
ض
ا
-
CVC
7
58
21
0
:
43
ﻁ
TT
t
TTaa
-
ط
ا
-
CV
18
88
5
0
:
56
ﻅ
DH2
D
Dhaa
-
ظ
ا
-
CV
5
14
2
0
:
11
ﻉ
AI
Q
Ayn
-
ع
ي
ن
-
CV
-
CVC
182
235
46
2
:
34
ﻍ
GH
G
Ghayn
-
غ
ي
ن
-
CV
-
CVC
27
28
4
0
:
30
ف
F
f
Faa
-
ف
ا
-
CV
154
146
28
1
:
66
ق
Q
q
Qaaf
-
ق
ا
ف
-
CVC
81
187
21
1
:
46
ك
K
k
Kaaf
-
ك
ا
-
CVC
124
153
63
1
:
72
ل
L
l
Laam
-
ل
ا
م
-
CVC
147
914
140
6
:
08
م
M
m
Meem
-
م
ي
م
-
CVC
321
358
91
3
:
89
ن
N
n
Noon
-
ن
و
ن
-
CVC
76
410
1063
7
:
84
ه
H
h
Haa
-
ه
ا
-
CV
87
117
186
1
:
97
و
W
w
Waw
-
و
ﺍ
و
-
CVC
166
289
37
2
:
49
ي
Y
y
Yaa
-
ي
ا
-
CV
143
381
213
3
:
73
Short
vowe
l
ﺍ
َ
AE
A
Fatha
-
ف
ت
ح
0
2542
520
15
:
49
ﺍ
ُ
UH
u
Damma
-
ض
م
ّ
0
959
398
6
:
86
ﺍ
ِ
IH
i
Kasra
-
ك
س
ر
0
1692
428
11
:
23
Long
vowel
ﺍ
َ
ﺍ
,
ىﺍ
َ
AE:
a
:
Aaa
-
ﺃ
ى
0
670
350
5
:
16
ﺍ
ُ
و
ْ
UW
u:
Oue
-
ﺃ
ُ
و
ْ
0
190
23
1
:
08
ﺍ
ِ
ي
ْ
IY
i:
Aii
-
ﺇ
ي
ْ
0
179
339
2
:
62
Diphthong
و
ْ
ﺍ
َ
AW
aw
Aoue
-
ﺃ
و
ْ
0
105
8
0
:
57
ي
ْ
ﺍ
َ
AY
ay
Aye
-
ﺃ
ي
ْ
0
126
1
0
:
64
The
Arabic
consonants
are
classified
based
on
their
place
and
manner
of
articulation.
Tables
7
and
8
present
corresponding
statistics
for
these
categories,
following
the
classifications
established
in
the
litera
ture
[32],
[33].
The
place
Indonesian
J
Elec
Eng
&
Comp
Sci,
Vol.
41,
No.
3,
March
2026:
1049–1059
Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian
J
Elec
Eng
&
Comp
Sci
ISSN:
2502-4752
r
1055
of
articulation
refers
to
where
in
the
vocal
tract
the
airflow
i
s
obstructed,
leadi
ng
to
distinct
sounds.
Meanwhile,
the
manner
of
articulation
describes
how
the
airflow
is
modified
or
obstructed,
further
distinguishing
consonant
sounds.
The
utilizat
ion
of
this
dual
classification
system
provides
linguists
and
phoneticians
with
a
structure
d
framework
to
methodically
examine
and
comprehend
the
variety
of
consonant
sounds
present
in
Modern
Standard
Arabic.
Table
7.
Percentage
of
consonants
classes
based
on
their
place
of
articulation
Place
of
articulation
Consonants
in
%
Alveolar
T,D,R,Z,S,SS,
25.99
DD,TT,L,N
Glottal
E,H
6.40
Bilabial
B,M
5.94
Palatal
Y
3.75
Velar
JH,K
2.72
Uvular
KH,GH,Q
2.31
Post-Alveolar
JH,SH
1.59
Labiodental
Q
1.46
Interdental
TH,DH,DH2
1.12
Table
8.
Percentage
of
consonants
cla
sses
based
on
their
way
of
articulation
Way
of
articulation
Consonants
in
%
Stop
E,B,T,D,Q,K
15.24
Nasals
M,N
11.73
Fricative
TH,DH,HH,KH
11.23
Z,S,SH,AI,GH,H
Glide
W
,Y
2.31
Lateral
L
6.08
Trill
R
3.12
Affricative
JH
0.99
Emphatic
stop
DD,TT
0.99
Emphatic
fricative
SS,DH2
0.75
4.
RESULTS
AND
DISCUSSION
4.1.
Speech
corpus
The
FSDM-USMBA
database
was
establ
ished
for
this
study,
comprising
a
speech
corpus
and
transcriptions
from
130
Moroccan
speakers
(63
ma
les
and
67
females)
age
d
between
17
and
50
years.
During
the
recording
sessions,
speakers
were
asked
to
utter
the
527
sentences
with
10
re
petitions
of
every
sentence.
Voice
clarity
is
fundamental
for
success-
ful
recording.
Factors
like
recording
environment,
equipment,
and
speaker-microphone
distance
influence
sound
quali
ty.
Optimal
microphone
placement
at
10
cm
proved
effective
after
te
sting.
For
ac
curate
capture,
recordings
should
occur
in
quiet
environments
with
noise
levels
below
30
dB,
closed
windows,
and
weathe
r
impacts
such
as
wind
and
rain
must
be
avoided.
To
streamline
the
proc
ess,
speakers
recited
each
sentence
10
repetitions
consecutively,
resulting
in
25-50
second
audio
f
iles.
Using
WaveSurfer,
each
recitation
was
isolated
in
(.wav)
format
by
removing
t
he
unnece
ssary
parts
of
the
audio
signal.
Audio
file
names
encode
multiple
detai
ls
about
the
speakers.
For
instance,
”XY18ZW21_
10
.wav”
reveals
the
following
information:
t
he
initials
X
and
Y
representing
first
and
last
names
respecti
vely,
followed
by
the
age
18,
city
Y,
gender
W,
sentence
number
21,
and
10
denoting
the
number
of
repetitions.
Thus,
the
task
of
segmenting
speech
is
easy.
These
recordings
have
a
sampling
rate
of
16
kHz
and
a
resolution
of
16
bits.
In
the
recording
sessions,
the
waveform
and
spectrogram
of
each
phrase
were
reviewed
to
verify
the
inclusion
of
the
entire
sentence
in
the
recording,
as
illustrated
in
Figure
3(b).
Only
correctly
pronounced
utterances
were
retained.
Our
dictionary
contains
symbolic
representations
for
all
the
sounds
used
in
the
sentences
of
our
corpus.
(a)
(b)
Figure
3.
Arabic
speech
corpus
illustration:
(a)
Arabic
phonemes
of
the
final
corpus
and
(b)
clean
spee
ch
waveform
of
an
example
of
an
Arabic
sentence
spoken
by
a
female
spe
aker,
is
referred
to
MB18FF21_01
in
our
audio
database
4.2.
Speech
test
To
test
our
corpus
a
set
of
experiments
were
conducted.
A
subset
of
the
final
corpus,
we
selected
130
sentences
spoken
by
60
speakers
(30
male
and
30
female).
This
resulted
in
a
vocal
corpus
of
78,000
audio
files.
To
optimize
system
performance,
we
divided
the
corpus
for
training
(70%)
and
testing
(30%)
and
adjusted
the
parameters
of
Hidden
Markov
Models
(HMMs)
and
Gaussian
Mixture
Models
(GMMs).
This
information
is
stored
is
stored
in
the
MCA-USMBA.dic
file
for
symbolic
representation
of
each
word.
The
Baum-Welch
algorithm,
which
is
a
special
case
of
the
Expectation-
Maximization
(EM)
method,
is
used
to
estimate
transition
probabilities
during
training.
The
a
coustic
model
is
trained
with
A
rich
and
balanced
phonetics
corpus
for
modern
standard
Arabic
ASR
systems
(Youssef
Boutazart)
Evaluation Warning : The document was created with Spire.PDF for Python.
1056
r
ISSN:
2502-4752
a
continuous
state
probability
density,
using
between
2
and
16
Gaussian
mixture
distributions
and
3
to
5
HMMs.
Table
9
shows
the
achieved
Sentence
Error
Ra
te
(SER)
and
Word
Error
Rate
(WER).
Figures
4
and
5
pre
sent
the
decoding
results
and
the
influence
of
HMM
and
GMM
parameters
on
SER
and
WER
perfor
mance,
re
spectively.
The
system
is
e
valuated
based
on
three
types
of
errors:
insert
ion,
deletion,
and
substitution,
which
can
occur
at
both
the
word
and
sentence
leve
ls.
Figures
6
present
s
concrete
examples
of
sentence
recognition
errors,
illustrating
the
different
types
of
errors:
insert
ions,
deletions,
and
substitutions
at
the
sentence
level.
The
best
configuration
used
3
HMMs
and
8
GMMs,
resulting
in
a
SER
of
30.00%
and
a
WER
of
15.00%.
Our
results
are
in
accordance
with
the
study
of
Abushariah
and
colleagues
[18],
who
demonstrated
a
word
error
rate
(WE
R)
of
13.48%
for
Arabic
speech
recognition
on
diffe
rent
sentence
s
spoken
by
different
speakers.
Table
9.
SER
and
WER
in
percentages
for
di
fferent
values
of
the
HMM
and
GMM
HMM
3
5
GMM
2
4
8
16
2
4
8
16
WER
22.00
17.75
15.50
23.50
22.20
19.00
18.00
27.50
SER
40.00
37.50
30.00
40.50
50.20
40.00
30.00
40.50
Figure
4.
Optimizing
model
training
with
the
Ba
um-Welch
algorithm
Figure
5.
Impact
of
HMM
and
GMM
values
on
SER
and
WER
performance
Indonesian
J
Elec
Eng
&
Comp
Sci,
Vol.
41,
No.
3,
March
2026:
1049–1059
Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian
J
Elec
Eng
&
Comp
Sci
ISSN:
2502-4752
r
1057
Figure
6.
Possible
errors,
for
recognition
of
Arabic
sentences
exa
mples
5.
CONCLUSION
This
pape
r
introduces
an
innovative
and
efficient
method
for
acquiring
a
comprehensive
and
balanced
Modern
Standard
Arabic
corpus.
The
method,
meticulously
outlined
from
initial
corpus
selection
to
sentence
curation,
adheres
closely
to
Zipf’s
law
and
principles
of
phonetic
distri
bution
equilibrium.
The
resulting
corpus
comprises
527
meticulously
selected
sentences,
ensuring
the
representati
on
of
diverse
Arabic
phoneme
s
across
various
linguistic
context
s,
enc
ompassing
consonants,
vowels,
diphthongs,
and
syllables.
The
study
evaluates
an
Arabic
continuous
spee
ch
recognition
system
using
25%
of
the
final
corpus.
Fine-tuning
hidden
Markov
model
(HMM)
and
Gaussian
mixture
model
(GMM)
parameters
notably
enhances
system
performance.
The
findings
indi
cate
that
employing
3
HMM
and
8
GMM
achieves
optimal
sentence
error
rate
(SER)
and
word
error
rate
(WER)
at
30.00%
and
15.00%,
respectively.
In
future
endeavors,
we
aim
to
expand
rec
ordings
to
diverse
speaker
groups
independently,
leveraging
the
ent
irety
of
the
final
comprehensive
and
balanced
corpus.
Then,
the
results
obtained
in
this
study
are
very
satisfactory,
to
the
deve
lopment
of
a
continuous
Arabic
speech
recognition
system,
which
encourage
us
to
extend
our
re
search
scope
to
spontaneous
Arabic
language
.
Additionally,
expanding
the
corpus,
exploring
various
ASR
system
architectures,
and
developing
an
automatic
continuous
speech
recognition
system
for
the
Moroccan
dialect.
REFERENCES
[1]
K.
Shaalan,
A.
E.
Hassanien,
and
F.
Tol
ba,
eds.
Intelligent
nat
ural
language
processing:
t
rends
and
applications:
Springe
r
.
vol.
740,
2017.
[2]
Kennedy,
An
introduction
to
corpus
linguistics
.
Rout
ledge,
2014.
[3]
A.
A.
M.
Alqudah
et
al.
,
“Modern
Standard
Arabic
spee
ch
disorders
corpus
for
digital
speech
processing
applications,”
International
Journal
of
Speech
Technology
,
vol.
27,
no.
1,
pp.
157–170,
2024,
doi:
10.1007/s10772-024-10086-9.
[4]
Z.
Oumaima,
and
A.
Meziane,
“Modern
Arabic
speech
corpus
for
text
to
speech
synthesis,”
In
:
2020
IEEE
International
Con-
ference
on
Technology
Management,
Operations
and
Decisions
(ICTMOD)
,
IEEE,
pp.
1–6,
2020-
November,
doi:
10.1109/ICT-
MOD49425.2020.9380606.
[5]
U.
Ka
math,
J.
Liu,
and
J.
Whitaker,
Deep
learning
for
NLP
and
speech
recognition.
Cham,
Switzerland:
Springer
.
vol.
84,
2019.
[6]
J.
E
gbert,
and
P.
Baker,
eds.
Using
corpus
methods
to
triangulate
linguistic
analysis
.
London:
Routledge,
2019.
[7]
M.
W
eisser,
Practical
corpus
linguistics:
An
introduction
to
corpus-based
language
analy
sis,
vol.
43,
John
Wiley
and
Sons.
2016.
[8]
H.
Satori,
O.
Zealouk,
K.
Satori,
a
nd
F.
ElHaoussi,
“Voice
comparison
between
smokers
and
non-smokers
using
HMM
speech
recognition
system,”
International
Journal
of
Spee
ch
Technology
,
vol.
20,
no.
4,
pp.
771–777,
2017,
doi:
10.1007/s10772-017-
9442-0.
[9]
H.
Satori,
and
F.
ElHaoussi,
“Inve
stigation
Amazigh
speech
recognition
using
CMU
tools,”
International
Journal
of
Speech
Tech-
nology
,
vol.
17,
no.
17,
pp.
235–243,
2014,
doi:
10.1007/s10772-014-9223-y
[10]
L.
Villaseñor-Pineda,
M.
Montes-y-Gómez,
D.
Vaufreydaz,
and
J.
F.
Serignat
,
“Experiments
on
the
Construction
of
a
Phoneti-
cally
Balanced
Corpus
from
the
Web,”
In
Conferenc
e
on
Intelligent
Text
Processing
and
Computational
Linguistics
,
vol.
2945,
pp.
416–419,
2004-
February,
Springer,
Berlin,
Heidelberg,
doi:
10.1007/978-3-540-24630-5-50.
[11]
H.
M.
Wang,
“Statistical
analysis
of
mandarin
acoustic
units
and
aut
omatic
e
xtraction
of
phonetica
lly
rich
sentences
based
upon
a
very
lar
ge
chine
se
text
corpus,”
In
Inte
rnational
Journal
of
Computational
Linguistics
and
Chinese
Language
Processing
,
vol.
3,
no.
2,
pp.
93–114,
1998-
August,
doi
:
10.30019/IJCLCLP.199808.0005.
[12]
V.
Radová,
and
P.
Vopálka,
“Methods
of
Sentences
Selection
for
Rea
d-Speech
Corpus
Design,”
In
International
Workshop
on
Text,
Speech
and
Dialogue
,
vol.
1692,
pp.
165–170,
1999-
September,
Springer
Berlin
Heidelberg,
doi:
10.1007/
3-540-48239-3-30.
[13]
J.
Matoušek,
and
J.
Romportl,
“On
building
phonetically
and
prosodically
rich
speech
corpus
for
text-to-speech
synthesis,”
In:
Proceedings
of
the
second
IASTED
inte
rnational
conference
on
Computational
intelligence:
ACTA
Press
,
pp.
442–447,
2006-
A
rich
and
balanced
phonetics
corpus
for
modern
standard
Arabic
ASR
systems
(Youssef
Boutazart)
Evaluation Warning : The document was created with Spire.PDF for Python.
1058
r
ISSN:
2502-4752
20-22
November,
San
Francisco,
USA.
[14]
K.
Yazawa,
“Harvard-NGSL
Sentences
for
English
Learner
Speech
Corpora,”
In
2022
25th
Conference
of
the
Oriental
CO-
COSDA
International
Committee
for
the
Co-ordination
and
Standardisation
of
Speech
Databases
and
Assessment
Techniques
(O-
COCOSDA),
IEEE,
pp.
1–5,
2022,
doi:
10.30019/IJCLCLP.199808.0005.
[15]
H.
Schwenk,
and
X.
Li,
“A
corpus
for
multilingual
docum
ent
classification
in
eight
languages,”
arXiv
preprint
,
arXiv:1805.09821.
2018,
doi
:
10.48550/arXiv.1805.09821.
[16]
E.
Grave
et
al.
,
“Learning
word
vectors
for
157
language
s,”
arXiv
preprint.
,
arXiv:1802.06893,
2018,
doi:
10.48550/arXiv.1802.06893.
[17]
A.
A.
M.
Alquda
h
et
al.
,
“Arabic
Automatic
Speech
Recognition
for
Speakers
With
Speech
Disorders:
A
Com-
prehensive
Review,”
2023
International
Conference
on
Information
Technology
(ICIT),
Amman
,
pp.
667–673,
2023,
doi:10.1109/ICIT58056.2023.10225965.
[18]
M.
Alghamdi,
A.H.,
Alhamid,
and
M.M.,
Aldasuqi,
Database
of
Arabic
sounds:
sentences
,
i
n
Arabic,
Technical
report,
King
Abdu-
laziz
City
of
Science
and
Technology
(KACST),
Riyadh,
Saudi
Arabia,
2003.
[19]
M.
A.
Abushariah
et
al.
,
“Phonetically
rich
and
bal
anced
text
and
speech
corpora
for
Arabic
language,”
Language
resources
and
evaluation.
,
vol.
46,
pp.
601–634,
2012,
doi:
10.1007/s10579-011-9166-8.
[20]
Y.
Yuwan
a
nd
D.P.
Le
stari,
“
Automatic
extraction
phonetically
rich
and
balanced
verses
for
speaker-dependent
quranic
speech
recognition
system,”
In:
Hasida,
K.,
Purwarianti,
A.
(eds)
Computational
Linguistics.
,
vol
593,
pp.
65–75,
2015,
doi:
10.1007/978-
981-10-0515-2-5.
[21]
D.
Qi,
and
H.
Wa
ng,
“Zipf’s
Law
for
Speech
Acts
in
Spoken
English,”
Journal
of
Quantitative
Linguistics
,
pp.
231–258,
2024,
doi:
10.1080/09296174.2023.2202470.
[22]
A
Ech-Charfi,
“Frequency
and
text
covera
ge
in
Sta
ndard
Arabic
based
on
Arabic
Internet
Corpus,”
Journal
of
A
pplied
Language
and
Cul
ture
Studies
,
vol.
6,
no
3,
pp.
1-19,
2023.
[23]
R.
Ba
ssiouney
and
E.
G.
(Eds.).
Katz,
Arabic
language
and
linguistics
.
Georgetown
University
Press,
2012.
[24]
A.
Hussein,
S.
Wa
tanabe
and
A.
Ali,
“Arabic
speech
recognition
by
end-to-end,
modular
systems
a
nd
human,”
Computer
Speech
and
Language
,
vol.
71,
p.
101272,
2022,
doi:
10.1016/j.csl.2021.101272.
[25]
I.
Guellil,
H
Saâdane,
F.
Azouaou,
B.
Gueni,
and
D.
Nouvel,
“Arabi
c
natural
language
processing:
An
overview,”
Journal
of
King
Saud
Univ
ersity-Computer
and
Information
Sciences
,
vol.
33,
no.
6,
pp.
497-507,
2021,
doi:
10.1016/j.jksuci.2019.02.006.
[26]
Y.
A.
El-Imam,
“Phonetization
of
Arabic:
rules
and
algorithm
s,”
Computer
Spe
ech
and
Language
,
vol.
18,
no.
4,
pp.
339–373,
2004,
doi
:
10.1016/S0885-2308(03)00035-4.
[27]
F.
Sindran,
F.
Mual
la,
T.
Haderlein,
K.
Daqrouq,
and
E.
Nöth,
“Automatic
phonetization-based
statistical
linguistic
study
of
standard
Arabic,”
Int.
J.
Comput.
Linguist.(IJCL)
,
vol.
7,
pp.
38–53,
2016.
[28]
M.
Elmahdy,
R.
Gruhn
and
W.
Minker,
Novel
techniques
for
dialectal
arabic
spee
ch
recognition.
Springer
Science
and
Business
Media,
2012.
[29]
T.
Buckwalter,
and
D.
Parkinson,
A
frequency
dictionary
of
Arabic:
Core
vocabulary
f
or
l
earners.
R
outledge
.
2014.
[30]
A.
Masrai,
and
J.
Milton,
“How
different
is
Arabic
from
other
languages?
The
rela
tionship
between
word
frequency
and
lexical
coverage,”
Journal
of
Applied
Linguistics
and
Language
Research
,
vol.
3,
no.
1,
pp.
15–35,
2016.
[31]
A.
Amrouche,
A.
Abed,
K.
Ferra
t,
,
K.
N.
Boubakeur,
Y.
Bentrci
a,
and
L.
Falek,
“Balanced
Arabic
corpus
design
for
speec
h
synthe-
sis,”
International
Journal
of
Speech
Technology
,
vol.
24,
no.
3,
pp.
747–759,
2021,
doi:
10.1007/s10772-021-09846-8.
[32]
J.
C.
Wa
tson,
The
phonology
and
morphology
of
Arabic
,
Oxford
University
Press,
USA,
2002.
[33]
F.
Sindran,
Automatic
Phonetic
Transcription
of
Sta
ndard
Arabic
with
Applications
in
t
he
NLP
Domain
(Doct
oral
dissertation,
Friedrich-Alexander-Universitaet
Erlangen-Nuernberg
(Germany)).
2021.
BIOGRAPHIES
O
F
AUTHORS
Youssef
Boutazart
received
the
engineer
degree
in
Automation
from
the
Belarusian
state
Agrarian
Technical
University
of
Minsk
–
Belarus
a
nd
the
Bachelor
in
electronics
from
Moulay
Ismail
University
of
Meknes
–
Morocco.
Since
2009,
he
has
been
administrator
of
the
Presidenc
y
by
Sidi
Mohamed
ben
Abdellah
University.
Currently
He
is
a
Ph.D.
student
in
the
L
ISAC
of
the
Dhar
Mehrez
Faculty
of
sciences
of
Fez.
His
researc
h
interests
are
focused
on
the
development
of
the
rich
and
balanced
speec
h
corpus
for
high-
performa
nce
speech
recognition
systems.
He
can
be
contacted
at
email:
youssef.boutazart@usmba.ac.ma.
Naouar
Laaïdi
got
her
Master
in
Electronics,
Automatics
and
Signal
Processing
Faculty
of
Sci
ences,
Chouaib
Doukkali
University,
El-Jadida.
Currently,
she
is
a
Ph.D.
student
at
LISAC
Laboratory
at
University
Sidi
Mohamed
Ben
Abdellah
Faculty
of
Sciences
of
Fez.
Speciali
st
in
many
disciplines
among
Clustering,
Machine
Learning,
Classification,
Automatic
speech
recognition.
He
can
be
contacted
at
email:
naouarlaaidi@gmail.com.
Indonesian
J
Elec
Eng
&
Comp
Sci,
Vol.
41,
No.
3,
March
2026:
1049–1059
Evaluation Warning : The document was created with Spire.PDF for Python.