Indonesian
Journal
of
Electr
ical
Engineering
and
Computer
Science
Vol.
39,
No.
2,
August
2025,
pp.
1360
1372
ISSN:
2502-4752,
DOI:
10.11591/i
jeecs.v39.i2.pp1360-1372
r
1360
Recognizing
AlMuezzin
and
his
Maqam
using
deep
learning
approach
Nahlah
Mohammad
Shatnawi
1
,
Khalid
M.
O.
Nahar
1
,
Suhad
Al-Issa
2
,
Enas
Ahmad
Alikhashashneh
3
1
Department
of
Computer
Science,
Faculty
of
Information
Technology
a
nd
Computer
Sciences,
Yarmouk
University,
Irbid,
Jordan
2
Department
of
Electronics,
Electrical
Enginee
ring
and
Computer
Science,
Queen’s
University
Belfast,
Belfast
BT7
1NN,
UK
3
Department
of
Information
Systems,
Faculty
of
Information
Technology
and
Computer
Sciences,
Yarmouk
University,
Irbid,
Jordan
Article
Info
Article
history:
Received
Jul
14,
2024
Revised
Mar
27,
2025
Accepted
Jul
2,
2025
Keywords:
Aladhan
AlMuezzin
Arabic
language
Deep
learning
Maqam
Speech
recognition
VGG-16
Abstract
Speech
recognition
is
an
important
topic
in
deep
learning,
especially
to
Arabic
language
in
an
attempt
to
recogniz
e
Arabic
speech,
due
to
the
diffic
ulty
of
apply-
ing
it
because
of
the
nature
of
the
Arabic
langua
ge,
its
frequent
ove
rlap,
and
the
lack
of
available
sources,
and
some
other
limitations
related
to
the
programming
matters.
This
paper
attempts
to
reduce
the
gap
that
exists
between
speech
re
cog-
nition
and
the
Arabic
language
and
atte
mpts
to
address
it
through
deep
learning.
In
this
paper,
the
focus
is
on
Call
for
Prayer
(Aladha
n:
ﺍ
ل
آ
ﺫ
ﺍ
ﻥ
)
as
one
of
the
most
famous
Arabic
words,
where
its
form
is
stable,
but
it
differs
in
the
notes
and
shape
of
its
sound,
which
is
known
as
the
phonetic
Maqam
(Maqam:
ﺍ
ل
م
ق
ا
ﻡ
ﺍ
ل
ص
و
ت
ي
).
In
this
paper,
a
solution
to
identify
the
voice
of
AlMuezzin
(
ﺍ
ل
م
ؤ
ﺫ
ﻥ
),
recog-
nize
AlMuezzin,
and
determine
the
form
of
the
Maqam
through
VGG-16
model
presented.
The
VGG-16
model
examine
d
with
4
e
xtracted
feat
ures:
Chroma
fe
a-
ture,
LogFbank
feature,
MFCC
feature,
and
spectral
centroids.
The
best
result
obtained
was
with
c
hroma
features,
where
the
accura
cy
of
Aladhan
recognit
ion
reached
96%.
On
the
other
hand,
the
classification
of
Maqam
with
the
highest
accuracy
reached
of
95%
using
spectral
centroids
feature
.
This
is
an
open
ac
cess
article
under
the
CC
BY-SA
license.
Corresponding
Author:
Nahlah
Shatnawi
Department
of
Computer
Science,
Faculty
of
Information
Technol
ogy
and
Computer
Sciences
Yarmouk
University
21163,
Irbid,
Jordan
Email:
nahlah.s@yu.edu.jo
1.
INTRODUCTION
Speech
recognition
is
one
of
the
most
active
researc
h
areas
that
aims
to
identify
the
speaker
based
on
the
characteristics
of
the
ir
voice
[1].
Speech
recognition
cont
ributes
to
improving
several
disciplines,
such
as
health
care
and
security.
Severa
l
state-of-the-art
works
have
recently
explored
the
use
of
feature
extraction
techni
ques
to
describe
a
massive
amount
of
dat
a
using
different
feature
vectors
that
represent
different
physical
and
acoustic
meanings.
Selecting
a
good
feature
will
help
to
improve
the
accura
cy
of
t
he
recognition.
Thus,
choosing
the
feature
extraction
technique
is
considered
a
critical
step
in
the
speaker
recognition
process.
Currently,
the
most
used
speech
characteristics
are
the
linear
reduction
spectrum
coefficient
(LPCC)
[2]
and
the
MEL
spectrum
coefficient
(MFCC)
[3].
These
features
have
a
chieved
good
recognition
effects
in
speech
recognition
[4],
[5].
Traditional
automat
ic
speech
recognition
(ASR)
systems
still
employ
an
architecture
c
onsisting
of
numerous
components,
including
but
not
li
mited
to
le
xicon
building,
language
models,
and
acoustic
models.
Various
technique
s
are
employed
to
construct
and
proc
ess
these
components,
incl
uding
traditiona
l
machine
Journal
homepage
:
http://ijeecs.iaescore.com
Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian
J
Elec
Eng
&
Comp
Sci
ISSN:
2502-4752
r
1361
learning
(ML)
techniques,
Gaussian
mixture
m
odels,
hidden
Markov
models,
a
deep
neural
network,
and
a
hybrid
HMM-DNN
[6].
Aladhan
is
a
call
to
prayer
for
Muslims,
where
the
AlMuezzin
pronounces
the
call
to
prayer
e
very
day
at
the
beginning
of
the
t
ime
for
each
of
the
five
obligatory
prayers.
In
the
past,
the
AlMuezzin
used
to
give
the
call
to
prayer
from
a
high
place,
or
from
the
lighthouse,
or
from
the
roof
of
the
mosque,
but
now
the
AlMuezz
in
gives
the
call
to
prayer
through
amplification
devic
es,
which
makes
the
matter
much
easier
for
him.
Aladhan
is
an
announcement
of
the
time
of
prayer
with
specific
and
customize
d
words,
through
which
the
AlMuezzin
informs
pe
ople
of
the
time
s
of
prayer
and
invites
them
to
prayer.
Aladhan
is
announced
through
specific
words
in
the
following
format,
as
in
Figure
1.
Figure
1.
Aladhan
words
format
AlMaqamat
is
a
compl
ex
tonal
system
used
in
traditional
Arabic
m
usic.
Phonet
ic
Maqams
are
char-
acterized
by
a
set
of
specific
pitches
and
their
own
rules
of
pe
rformance.
AlMaqamat
can
be
used
to
create
a
wide
range
of
musical
emotions
and
effects.
Phonetic
Maqams
are
usual
ly
classified
into
two
main
groups:
basic
Maqams
and
sub-Maqams.
The
basic
Maqams
are
the
Maqams
that
form
the
basis
of
the
Arabic
musical
system.
There
are
nine
basic
shrines,
which
are:
AlRust,
AlNahawand,
AlHijaz,
AlSiyka,
AlBayat,
AlSaba,
AlEajam.
Sub-Maqams
are
Maqams
t
hat
are
derived
from
the
basic
Maqams.
There
are
many
sub-Maqams,
but
some
are
more
common
than
others.
Various
Maqams
with
distinctive
melodies
ac
companying
them;
these
are
the
feature
s
of
the
call
to
prayer
in
most
different
Islamic
countries,
as
it
is
a
call
to
praye
r
with
sweet
maqams
appropriat
e
to
the
time
of
the
obligatory
prayer
and
the
psychological
state
of
the
people.
The
Muzzins
in
all
mosques
and
mosques
in
most
different
Islamic
countries
are
keen
to
perform
this
cal
l
in
the
best
way,
in
a
manner
that
suits
the
ma
qams
adopted
hundreds
of
years
ago.
Hence,
the
importance
of
this
work,
as
it
links
deep
learning
to
the
Arabic
language,
especially
clas-
sical
Arabic
music.
And
because
of
the
importa
nce
of
Ara
bic
music,
its
chosen
and
selected
Aladhan
as
an
actual
application
for
speech
recognition
because
it
is
known
among
people,
especially
Muslims,
and
because
it
contains
a
specific
and
fixed
set
of
words,
its
form
is
stable,
and
it
also
has
a
specific
and
fixed
tone
in
musical
Maqamat.
To
ac
hieve
the
objective
of
this
paper,
the
aut
hers
present
a
VGG-16
model
[7]
to
identify
AlMuezzin
(
ﺍ
ل
م
ؤ
ﺫ
ﻥ
)
and
classify
his
Maqam
(
ﺍ
ل
م
ق
ا
ﻡ
ﺍ
ل
ص
و
ت
ي
).
This
performed
by
e
xtracting
a
set
of
features
from
the
collected
dataset.
The
n,
speaker-independent
speech
recogni
tion
performed
through
the
VGG-16
model
because
no
two
speakers
have
the
same
voice
and
the
organs
of
the
sound
differ.
So,
the
VGG-16
model
can
distinguish
different
Azzan
col
lected
from
different
Muezzin
under
several
Maqamat:
Al-Hejaz,
Al-Sika,
Al-Rust,
Al-Saba,
Al-
Ashaq,
and
Nahawand.
This
paper
describes
an
interesting
idea
for
using
sound
features
for
Muezz
in
identification.
The
pro-
posed
approach
in
this
research
is
hot,
uncommon,
and
c
ould
be
improved
to
be
applied
in
many
other
fields.
The
work
in
this
study
can
be
the
first
milestone
to
show
the
effecti
veness
of
deep
learning
(DL)
for
classific
a-
tion
and
the
dependence
on
sound
features
to
identify
AlMuezzin.
The
rest
of
this
paper
is
organized
as
follows:
in
section
2,
t
he
related
works
to
the
proposed
approach
shown.
Next,
the
collect
ed
dat
aset
i
nformation
presented.
In
sec
tion
4,
introduce
the
work
method.
Section
5,
include
experimental
results.
Final
ly,
the
conclusion
is
determined.
Recognizing
AlMuezzin
and
his
Maqam
using
deep
learning
approach
(Nahlah
Shatnawi)
Evaluation Warning : The document was created with Spire.PDF for Python.
1362
r
ISSN:
2502-4752
2.
RELATED
WORK
This
section
presents
a
few
works
that
utilize
DL,
ML,
feature
extraction,
fea
ture
reduction,
and
view
how
they
relate
to
this
work.
The
researchers
of
the
paper
[8]
collecte
d
a
modern
arabic
dataset
to
assess
the
performance
of
a
few
of
the
DL
strategies
in
human
speech
recogni
tion
(HSR).
In
this
work,
the
accura
cy
of
the
modular
hidden
Markov
model-dee
p
neural
network
(HMM-DNN)
frameworks
was
compare
d
to
the
native
speaker
performance.
The
compa
rison
make
s
it
appear
that
human
performance
within
the
Arabic
di
alect
is
still
significantly
be
tter
than
that
of
machines,
with
an
absolute
word
error
rate
(WER)
gap
of
3.5%
on
average.
On
the
other
hand,
in
paper
[9],
there’s
an
endeavor
to
construct
a
strong,
robust
diacritized
Arabic
ASR
utiliz
ing
Dee
p
le
arning
approaches.
They
utilized
the
standard
arabic
single
speaker
corpus
(SASSC),
which
conta
ins
seven
hours
of
cutting-edge
standard
Arabi
c
discourse,
to
prepa
re
and
test
a
modern
CTC-based
ASR,
convolutional
neural
network
(CNN)-long
short-term
memory
(L
STM),
and
an
attention-based
end-to-
end
approach
to
m
ake
strides
in
diac
ritized
Arabic
ASR.
From
the
exploratory
results,
the
researchers
conclude
that
the
CNN-LSTM
with
an
attention
framework
outper
forms
conventional
ASR
and
the
Joint
CTC-attention
ASR
fra
mework
within
the
task
of
Arabic
speech
recognition.
In
work
[10],
the
researchers
used
a
deep
fe
ed-forward
neural
network
(DFFNN)
to
the
Arabic
natural
audio
data
set
(ANAD),
which
is
designed
for
Arabic
automatic
speech
recognition.
The
ANAD
da
taset
conta
ins
three
discrete
feelings:
angry
(A),
surprised
(S),
and
happy
(H).
T
he
researchers
also
utilized
eight
videos
of
live
calls
bet
ween
an
anchor
and
a
huma
n
outside
the
studio
that
were
downloaded
from
online
Arabic
ta
lk
shows
to
test
and
evaluate
the
proposed
approach.
The
target
was
to
re
cognize
human
fe
elings
from
the
sounds.
They
proposed
an
automated
Arabi
c
speech
emot
ion
recognition
system
using
feature
extraction
to
extract
the
foremost
imperative
features
from
the
dataset,
which
was
at
that
point
utilized
t
o
train
the
DFFNN.
In
this
investigation,
it
appears
that
the
DFFNN
achieves
the
highest
accuracy
when
applying
PCA
to
t
he
extracted
features,
with
an
accuracy
of
98.56%.
Moreover,
the
work
[11]
a
speech
emotion
recognition
system
based
on
de
ep
neural
network
hidden
markov
models
(DNN-HMM)
by
extricat
ing
MFCC
and
epoch-based
fea
tures.
The
researchers
concluded
that
the
ac
curacy
when
utiliz
ing
MFCC
features
was
60.86%
whereas
when
utilizing
epoch-based
features,
it
was
54.52%.
Also,
the
the
recognition
performance
t
o
64.2%
when
MFCC
and
e
poch
features
are
combined.
Fahad
et
al.
[12],
they
presented
a
convolutional
neural
net
work
for
Arabi
c
speech
recognition.
In
this
investigation,
t
hey
centered
on
single-word
Ara
bic
automatic
speech
re
cognition
(AASR).
They
utilized
l
og-
frequency
spectral
coefficients
(MFSC)
and
Gammatone-fre
quency
cepstral
coefficients
(GFCC)
with
their
first
and
second-order
derivatives.
They
found
that
the
great
est
accuracy
gotten
when
utilizing
GFCC
with
CNN
is
99.77%,
and
it
appeared
that
the
CNN
ac
complished
way
better
performance
in
AASR.
A
traditional
ML
approaches
such
as
the
random
forest
(RF)
were
used
by
[13]
to
distinguish
different
speakers
by
e
xtraction
mel-frequency
cepstral
coefficients
(MFCC)
and
reconstructed
phase
space
(RPS)
fea
tures.
The
researchers
of
this
investigation
observed
that
the
accuracy
in
MFCC
is
higher
tha
n
in
RPS,
where
the
accuracy
obtained
from
utilizing
RPS
features
was
71%
and
the
accuracy
obtained
from
utili
zing
MFCC
fe
atures
was
97%.
Another
speaker-identification
framework
was
proposed
by
the
researchers
[14]
to
recogniz
e
spoken
sounds
by
utilizing
particular
words.
The
researchers
extracted
the
MFCC
features
and
then
utilized
them
as
input
for
the
recurrent
neural
net
work
(RNN)
and
LSTM.
They
found
that
the
accuracy
in
different
RNNs
is
87.74%,
and
the
accuracy
that
showed
up
in
a
single
RNN
is
80.58%.
On
the
other
hand,
Utomo
et
al.
[15]
proposed
automatic
speaker
recognition
by
artificial
ne
ural
network
(ANN).
They
extracted
t
he
Another
speaker-identifi
cation
framework
by
utilizing
particular
words.
The
re-
searchers
extracted
the
MFCC
features
and
then
utilized
them
as
input
for
the
RNN
and
LSTM.
They
found
that
the
accuracy
in
different
RNNs
is
87.74%,
and
the
accuracy
t
hat
showed
up
in
a
single
RNN
is
80.58%.
Moreover,
the
work
in
[16]
proposed
text-speaker
recognition
to
recognize
wha
t
the
spea
ker
said.
They
utilized
MFCC,
spectrum,
and
log-spec
trum
to
extract
the
features
from
the
speaker
sound
wave
the
extracte
d
features
were
at
that
point
util
ized
to
to
train
and
evaluate
the
LSTM
and
RNN
models.
The
accuracy
by
utilizing
MFCC
was
95.33%,
whereas
by
utilizing
spectrum
and
log-spec
trum,
it
was
98.7%.
Analysts
in
[17]
proposed
speaker-identification
in
a
noisy
environment.
The
y
utilized
CNN
to
classify
60
speakers
and
divided
4
voice
samples
for
each
spea
ker.
The
researchers
utilized
MFCC
to
extract
the
features
from
the
speaker
signal
and
found
an
accuracy
of
87.5%.
In
addition,
ion,
ion,
analysts
in
[18]
propose
a
spea
ker-identification
fra
mework
by
utilizing
the
gaus-
sian
mixture
model
(GMM),
and
MFCC
to
extract
the
features.
The
researchers
extracted
features
and
compared
Indonesian
J
Elec
Eng
&
Comp
Sci,
Vol.
39,
No.
2,
August
2025:
1360–1372
Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian
J
Elec
Eng
&
Comp
Sci
ISSN:
2502-4752
r
1363
them
with
all
the
features
they
had
saved.
In
DNN-HMM,
numerous
sorts
of
inquiry
ha
ve
been
made
a
bout
comparing
two
strategies
for
this
point.
In
Chowdary
et
al.
[19],
analysts
com
pared
extraction
and
normalization
strategies
f
or
speakers;
they
utilized
MFCC
and
PNCC
for
feature
extraction.
This
framework
was
connected
to
six
men
and
two
women,
and
the
performance
of
the
speaker’s
identification
was
100%.
PNCC
ha
d
far
better,
much
better,
higher,
stronger,
and
improved
performance
than
MFCC
in
recogni
zing
women.
Al-Kaltakchi
et
al.
[20],
analysts
compared
extraction
strategies
by
utilizing
FFANN
and
SVM
and
utili
zed
numerous
extraction
methods
such
as
MFCC,
PLP,
and
LPC.
They
connected
these
feature
extractions
to
two
calculations,
ANN
and
SVM,
to
discover
t
he
finest
include
extractions
that
seem
ed
to
apply
to
these
two
calculati
ons.
The
best
accuracy
they
obtained
was
when
using
MFCC,
PLP,
and
L
PC
with
SVM
and
ANN;
it
was
100%.
The
work
in
[21]
examined
the
efficient
of
extricat
ing
and
coordinat
ing
feature
strategies
utilizing
vector
quantization
and
MFCC
together
in
speaker-identification
and
speech
recogni
tion
applications.
T
hey
found
that
utilizing
ve
ctor
quantization
reduced
the
time
of
com
parison
between
the
input
speech
and
the
te
sting
speech.
Several
papers
have
used
speech
recognition
system
as
pa
rt
of
sma
rt
home
a
utomation
systems.
For
example,
the
work
in
[22]
proposed
a
speec
h
recognition
framework
without
having
to
utiliz
e
the
web
to
assist
individuals
with
disabilities
by
utilizing
Ivona.
converted
to
text
and
received
and
gotten
by
GSM
modems.
This
framework
permits
individuals
with
inabilities
to
deliver
house
voice
commands
to
carry
out
a
particular
command.
Another
study
done
by
[23]
proposed
a
speech
recogni
tion
framework
for
indivi
duals
with
disabilitie
s
to
control
their
wheelchai
rs
and
other
devices.
The
researchers
used
MATLAB
to
create
8
commands
(GO,
HELP,
START,
REPEAT,
ERASE,
NO,
ENTER,
YES)
by
extricating
MFCC
features;
all
extracted
features
are
put
away
in
K-mean
cluster
forms.
The
average
accuracy
of
8
commands
is
73.54%,
and
the
average
accuracy
of
listener
is
82.25%.
In
the
work
of
[24],
[25],
the
researchers
constructed
speaker-independent
NSR
systems
using
the
DeepSpeech
model,
then
assessed
them
using
the
WER.
The
DeepSpeech
model
is
one
of
the
most
well-known
open-source
ASR
mode
ls
from
Mozil
la
for
Qura
nic
recitations
for
male
and
female
reciters.
Where
the
target
is
to
be
utilized
efficiently
by
anyone,
regardless
of
gender
or
a
ge,
and
it
obtained
intriguing
results.
Shareef
a
nd
Al-Irhayim,
[26]
they
do
speech
sound
errors
classification
impairments
children
when
are
incorrectly
pronounced
in
Arabic.
The
y
employ
Mel
frequency
spectral
coefficients
for
feature
extraction,
and
deep
LSTM
network.
They
gain
classificat
ion
accuracy
reaches
97.99%
and
loss
0.18%.
To
the
best
of
our
knowledge,
deep
learning
approach
has
not
been
used
in
the
a
utomatic
speech
recognition
of
AlMuezzin.
The
call
t
o
prayer
is
a
special
type
of
speech
that
announces
the
call
to
prayer
in
a
mosque
through
a
standardized
set
of
words
with
a
Maqam.
Recognize
the
AlMuezzin
with
the
Maqam
he
follows
will
contribute
to
identifying
the
speaker,
just
as
identifying
the
Maqam
will
contribute
to
the
problem
of
speech
synthesis.
Based
on
the
investigation
and
lit
erature
review
conducted,
this
paper
has
been
prepared.
3.
METHOD
The
proposed
method
for
recognizing
AlMuezzin
and
his
maqam
using
deep
learning
approach
con-
sists
of
two
phases.
The
first
phase
is
for
AlMuezzin
identification,
and
the
second
phase
is
for
Acoustic
stand
classification
of
Azan
(Maqam
classification).
Where
for
each
phase,
different
dataset
colle
cted
and
used.
In
this
work,
the
VGG16
model
in
[7]
used.
VGG16
model
reaches
a
te
st
accuracy
of
92.7%
with
almost
14
million
training
photos
from
1000
item
classes
in
ImageNet,
and
was
one
of
the
best
models
from
the
ILSVRC-2014
competition.
VGG
one
of
the
most
widely
used
deep
learning
models
for
image
recognition.
As
the
name
implies,
VGG16
is
a
16-layer
de
ep
neural
network.
W
ith
138
million
parameters
overall,
VGG16
i
s
a
relatively
large
network—huge
even
by
today’s
standards.
Nevertheless,
the
key
selling
point
of
the
VGGNet16
architecture
is
its
simplicity,
as
it
incorporates
the
m
ost
significant
convolution
neural
network
features.
Since
the
problem
is
to
achieve
both
AlMuezzin
identific
ation,
and
acoustic
stand
of
Azzan,
the
pic-
torial
design
of
the
methodology
is
stated
into
two
phases
as
shown
in
Figure
2.
Figure
2(a):
AlMuezzin
identification.
Figure
2(b):
Aladhan
Maqam
classification.
After
collecting
the
data,
it
needs
to
be
proc
essed,
then
the
audio
file
spect
rum
is
extracted
using
different
features
to
be
trained
on
a
VGG16
model.
The
detailed
steps
of
the
methodology
are
explained
in
the
following
subsections.
Recognizing
AlMuezzin
and
his
Maqam
using
deep
learning
approach
(Nahlah
Shatnawi)
Evaluation Warning : The document was created with Spire.PDF for Python.
1364
r
ISSN:
2502-4752
Figure
2.
The
methodology
phases:
(a)
AlMuzzein
identification
and
(b)
Aladhan
Maqam
classific
ation
To
ensure
full
reproducibility
of
the
experimental
setup,
the
methodology
is
structured
into
two
clearly
defined
phases:
AlMuezzin
ide
ntification
and
Maqam
classification,
each
with
distinct
datasets
and
preprocess-
ing
pipelines.
The
experimental
workflow
begins
with
a
c
arefully
curat
ed
dataset
of
audio
re
cordings
collecte
d
from
YouTube,
comprising
well-known
Muezzins
and
multiple
Maqam
styles.
These
recordings
were
con-
verted
to
WAV
form
at
(16-bit
stereo,
44.1
kHz)
and
segmente
d
into
20-second
cl
ips
to
manage
memory
effi-
ciency
and
enhance
model
performance.
The
feature
extra
ction
was
the
n
perform
ed
using
four
widely
validate
d
acoustic
features:
MFCC,
LogFBank,
Spec
tral
Centroid,
and
Chroma.
Each
feat
ure
set
was
independently
used
to
train
a
pre-trained
VGG-16
model,
allowing
a
compa
rative
evaluation
of
their
effectiveness.
For
AlMuezzin
identification,
the
model
was
traine
d
on
1,211
samples
and
tested
on
295,
while
for
Maqam
classification,
287
training
samples
and
71
testi
ng
samples
were
used.
The
models
were
validated
using
standard
performance
metrics
(accuracy
and
loss)
over
multiple
epochs,
and
visualizations
of
training
behavi
or
(e
.g.,
accuracy/loss
curves)
were
included
in
the
Results
sec
tion.
A
detailed
pictori
al
representation
of
the
pipeline
shown
in
Figure
3,
and
struc
tured
equations
for
each
feature
extraction
method
Table
1
are
provided
to
ensure
trans-
parency
and
reproducibility
of
our
experimental
design.
3.1.
Dataset
The
dataset
was
collecte
d
manually
and
carefully
from
YouTube
in
two
stages,
you
can
find
the
dataset
in
the
link
in
[27].
In
the
first
stage,
19
different
audio
records
of
t
he
Aladhan
for
19
famous
male
Muezzins
collected,
a
tot
al
of
105
audio
records
were
collected.
Then,
transferring
eac
h
audio
file
in
the
datasets
into
WAV
audio
files
with
a
16-bit
stereo
a
nd
44.1
kHz
sample
rate
so
a
VGG-16
model’s
can
handle
[7].
Afte
r
that,
audio
files
divided
by
reciters
and
created
different
folders
with
the
names
of
the
reciters.
The
reader’s
audio
fi
les
are
stored
in
the
folder
with
the
reader’s
name.
Thus,
the
intersection
of
audio
files
between
readers
to
pe
rform
speake
r-independent
identification
avoided.
The
audio
files
di
vided
int
o
80%
for
the
training
group,
20%
for
testing.
The
largest
training
rati
o
is
to
ensure
sufficient
and
good
training
of
the
system.
Indonesian
J
Elec
Eng
&
Comp
Sci,
Vol.
39,
No.
2,
August
2025:
1360–1372
Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian
J
Elec
Eng
&
Comp
Sci
ISSN:
2502-4752
r
1365
Figure
3.
Graphical
overview
of
simulation
and
e
xperimental
setup
To
inc
rease
the
performance
of
the
prediction,
eac
h
audio
of
Aladhan
divided
into
several
audio
tracks
of
20
seconds
each
because
the
me
mory
resources
are
l
imited
and
to
avoid
out
of
memory
issue,
so
t
he
total
amount
of
data
in
the
training
bec
omes
1506
audio.
But
after
deleting
the
empty
audio
files
and
t
he
corrupted
audio
files,
the
remaining
1506
audio
files
separated
so
1211
audio
records
for
training
and
295
audio
records
for
testing.
Finally,
noise
rem
oved
from
data
to
ensure
the
data
is
clean.
The
ta
rget
of
this
stage
is
to
perform
AlMuezzin
identification.
In
the
second
stage,
records
collected
of
differe
nt
calls
to
prayer
from
different
audio
maqams
for
different
Muezzins,
such
as
Hijaz,
Sikka,
Al-Sada’,
Al-Saba’,
Al-Ashaq,
and
Al-Nahawand,
as
in
Table
1.
The
total
number
of
audio
recordings
collected
was
36.
Also,
as
in
the
first
stage,
the
data
divided
in
eac
h
records
here
into
several
audio
tracks
of
20
seconds
ea
ch,
bringing
the
size
of
the
da
ta
to
be
trained
to
358
in
this
phase.
A
splitting
ratio
of
80%
for
training
and
20%
for
testing
was
used.
So,
287
audios
used
for
training
and
71
audios
for
testing.
The
target
of
this
stage
is
to
pe
rform
Al-Maqam
identification.
3.2.
AlMuezzin
identification
For
AlMuezzin
identification
the
collected
data
is
first
preproce
ssed
as
mentioned
in
dataset
section,
after
that
features
extrac
tion
using
four
distinct
feature
types
to
train
them
into
pre
-trained
VGG16
model.
The
features
are
extracted
from
the
speech
signal
for
analysis
are:
MFCC,
Spectra
l
-
Centroid,
Chroma,
and
LogFBank
as
shown
in
Figure
4.
Figure
4.
Features
extraction
from
four
different
feature
types
MFCC
is
so
well-liked
because
its
foundation
is
a
linear
cosine
transform
of
a
log
power
spectrum
on
a
nonlinear
Mel
scale
of
frequency.
In
spectral
centroid
every
frequency
band’s
spectrum
has
a
center
of
gravity.
Recognizing
AlMuezzin
and
his
Maqam
using
deep
learning
approach
(Nahlah
Shatnawi)
Evaluation Warning : The document was created with Spire.PDF for Python.
1366
r
ISSN:
2502-4752
When
a
piece
of
music
has
pitche
s
that
can
be
meaningfully
categorized
and
is
tuned
close
to
the
equal
-tempered
scale,
it
is
said
to
have
chroma.
While
LogFBank
idly
used
in
robust
speech
recognition
community.
Ta
ble
1
showed
the
computation
of
the
four
features.
Table
1.
Features
computations
of
MFCC,
Spectral
Cent
roid,
Chroma,
and
LogFBank
Features
Description
MFCC
x
0
(
n
)
=
x
(
n
)
x
(
n
1)
X
(
f
)
=
F
F
T
(
w
(
n
))
Apply
a
Mel
filter
bank
to
the
power
spectrum
of
the
signal
y
k
=
l
og
(
X
j
y
k
(
n
)
j
2
)
Compute
the
discrete
cosine
transform
(DCT)
for
the
log-energy
output
Spectral
-
Centroid
centr
oid
=
P
N
1
n
=0
f
(
n
)
x
(
n
)
P
N
1
n
=0
x
(
n
)
Ccalculated
as
the
weighted
mean
of
the
frequenc
ies
present
in
the
signal
Chroma
Using
short-time
Fourier
transforms
in
combination
with
binning
strategie
s
f
g
f
[
n;
k
]
=
M
1
X
m
=0
f
[
n
m
]
g
[
m
]
2
k
[
m
]
Where
2
k
[
m
]
=
e
2
m
k
N
M
is
the
window
length
of
g
and
N
is
the
number
of
samples
in
f
LogFBank
x
0
(
n
)
=
x
(
n
)
x
(
n
1)
w
(
n
)
=
x
0
(
n
)
h
(
n
)
X
(
f
)
=
F
F
T
(
w
(
n
))
Apply
a
logarithmically
spaced
filter
bank
to
the
power
spec
trum
y
k
=
l
og
(
X
j
y
k
(
f
)
j
2
)
After
appliying
the
features
on
t
he
dataset,
VGG16
used
to
identify
AlMuezzi
n.
Tiny
convolution
filters
make
up
a
VGG
network.
Thi
rteen
convolutional
layers
a
nd
three
fully
linked
layers
make
up
VGG16.
An
overview
of
the
VGG
architecture
is
provided
below:
Input:
VGGNet
is
fed
a
224
by
224
picture
input.
Convolutional
layers:
VGG’s
convolutional
filters
use
a
3x3
receptive
field,
the
smallest
available.
In
addition,
VGG
performs
a
linear
transformation
on
the
input
using
a
1×1
convolution
filt
er.
ReLu
activation:
AlexNet’s
primary
innovation
for
cutting
training
time
is
the
rectifie
d
linear
unit
activation
function
(ReLU)
component.
ReLU
is
a
linear
function
tha
t
yields
zero
for
negative
inputs
and
a
matching
output
for
positive
inputs.
To
maintain
the
spatial
resolution
following
convolution,
VGG
uses
a
fixed
convolution
stride
of
1
pixel
(the
stride
value
shows
how
many
pixels
the
filter
“moves”
t
o
cover
the
complete
space
of
the
picture).
Hidden
layers:
unlike
AlexNet,
which
uses
local
response
normalization,
all
of
the
VGG
network’s
hidden
layers
employ
ReLU.
The
la
tter
adds
little
to
the
accuracy
overall
but
lengthens
the
training
period
a
nd
memory
usage.
Indonesian
J
Elec
Eng
&
Comp
Sci,
Vol.
39,
No.
2,
August
2025:
1360–1372
Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian
J
Elec
Eng
&
Comp
Sci
ISSN:
2502-4752
r
1367
Pooling
layers:
a
pooling
layer
is
placed
after
a
ser
ies
of
convolutional
layers,
which
serves
t
o
lower
the
feature
maps
produced
by
each
convolution
step’s
dimensions
and
parameter
count.
Given
the
quic
k
increase
in
the
number
of
filt
ers
accessible
from
64
t
o
128
to
256
and
finally
512
in
the
final
layers,
pooling
is
essential.
3.3.
Aladhan
Maqam
classific
ation
Aladhan
Maqam
classification
is
the
second
phase
were
the
resulted
data
from
phase
1
used
beside
another
dataset
collected
as
m
entioned
in
the
da
taset
section.
In
this
phase,
the
data
is
also
preprocessed
and
filtered,
and
then
apply
spectral
centroid
for
feature
extraction
using
the
equations
in
Ta
ble
1.
So,
the
VGG-16
model
can
distinguish
different
Azz
an
collected
from
different
Muezzin
under
seve
ral
Maqamat:
Al-Hejaz,
Al-Sika,
Al-Rust,
Al-Saba,
Al-Ashaq,
and
Nahawand,
details
of
each
Maqam
mentioned
in
Tabl
e
2.
Table
2.
Different
Muezzins
under
severa
l
Maqams
Acoustic
stand
Description
Musical
scale
Al-Saba
Saba
is
a
very
common
maqam
in
Arabic
music.
The
ladder
of
this
maqam
begi
ns
on
the
decision,
and
the
Hijaz
on
the
third
degree
overlaps
with
it.
Al-Hejaz
Maqam
Hijaz
is
the
main
maqam
in
the
Maqam
al
-Hijaz
family.
The
scale
of
the
place
of
the
Hijaz
begins
with
the
genus
Hijaz
on
the
deci-
sion,
f
ollowed
by
the
Rust.
Nahawand
Maqam
Nahawand
i
s
the
main
maqam
in
the
Maqam
Nahawand
family.
The
ladde
r
of
this
maqam
begins
with
the
genus
al-Nahawand
on
the
first
degree
(Qarar),
followed
by
the
genus
al-Hejaz.
Al-Rust
The
Rust
Maqam
is
the
main
maqam
in
the
Rust
Maqam
family.
The
scale
of
t
his
maqam
begins
with
the
genus
of
the
Rust
on
the
first
degree
(Qarar),
followed
by
a
ny
of
the
genus
Nahawand
or
the
genus
of
the
higher
Rust.
Al-Sika
It
is
t
he
main
maqam
in
the
Sika
Maqam
family,
but
it
is
rarely
used
as
an
independent
maqam.
Al-Ashaq
The
Maqam
Ashaq
Egypt
ian
is
a
sub-maqam
in
the
Maqam
Nahawand
family.
4.
RESULTS
AND
DISCUSSION
At
the
beginning,
a
neural
network
model
built,
where
the
i
dentification
problem
was
resolved
using
a
pre-trained
VGG-16
model.
Ultimately,
the
necessary
model
for
the
identification
process
was
generated
by
using
80%
of
the
gathe
red
data
in
the
training
phase,
which
was
then
sent
as
a
sample
to
VGG16.
In
this
sect
ion
the
experiments
conducted
and
the
results
that’s
obtained
pre
sented.
In
the
first
phase,
four
di
fferent
experiments
are
conducted
to
perform
AlMuezzin
identific
ation.
In
the
first
experiment,
the
proposed
model
trained
using
MFCC
features
got
a
93%
accura
cy.
In
the
second
experiment,
the
proposed
model
trained
using
Logfbank
features
got
a
96%
accuracy.
In
the
third
experiment,
the
proposed
model
trained
using
spectral
centroid
features
got
a
94%
accuracy.
In
the
fourth
experiment,
the
proposed
model
trained
using
Chroma
features
got
a
96%
accuracy.
All
accuracy
results
from
t
he
conducted
experiments
are
listed
in
Table
3.
Table
3.
Classification
accuracy
by
VGG16
for
AlMuezz
in
identification
using
different
features
Feature
MFCC
Logfbank
Spectral
centroid
Chroma
Accuracy
93%
96%
94%
96%
Recognizing
AlMuezzin
and
his
Maqam
using
deep
learning
approach
(Nahlah
Shatnawi)
Evaluation Warning : The document was created with Spire.PDF for Python.
1368
r
ISSN:
2502-4752
For
Adan
identi
fication,
Logfbank
and
Chroma
performed
better
than
the
other
criteria
in
terms
of
accuracy.
The
acc
uracy
and
loss
for
each
model
i
n
relation
to
the
epoch
number
(100)
are
shown
i
n
Figures
5-8
(each
figure
shows
the
number
of
epochs
for
training
and
validation
(a)
ac
curacy
and
(b)
loss).
Figure
5.
The
number
of
epochs
for
(a)
accuracy
t
rends
during
training
and
validation
using
MFCC
features
and
(b)
loss
function
behavior
showing
the
model’s
adaptation
over
epochs
Figure
6.
The
number
of
epochs
for
(a)
accuracy
pr
ogression
using
Chroma
features
and
(b)
loss
function
graph
highlighting
convergence
and
overfitting
tendencies
Figure
7.
The
number
of
epochs
for
(a)
accurac
y
progression
during
training
and
validation
using
Logfbank
features
and
(b)
loss
function
behavior
indicating
model
convergence
Indonesian
J
Elec
Eng
&
Comp
Sci,
Vol.
39,
No.
2,
August
2025:
1360–1372
Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian
J
Elec
Eng
&
Comp
Sci
ISSN:
2502-4752
r
1369
Figure
8.
The
number
of
epochs
for
(a)
validation
ac
curacy
trends
across
epochs
using
spectral
centroid
features
and
(b)
loss
graph
demonstrating
overfitting
tendencies
in
spect
ral
centroid-based
classification
Because
validation
data
are
a
collection
of
fresh
data
points
that
the
model
is
unfamiliar
with,
the
validation
accuracy
is
typically
lower
than
the
training
accuracy.
Training
data
is
da
ta
that
the
model
is
already
familiar
with,
where
this
is
noti
ced
in
Figures
5-8.
Therefore,
i
t
stands
to
reason
that
the
accuracy
is
lower
when
using
validation
data
than
when
using
training
data.
But
in
Figure
6,
noticed
in
the
first
epochs
(a
pproximately
the
first
15
epochs)
the
valida
tion
data’s
accuracy
exceeds
that
of
the
training
data,
can
interpret
t
his
by
saying
that
the
proposed
m
odel
is
a
highly
accurate
predictor
that
takes
into
a
ccount
a
wide
range
of
boundary
situations.
However,
considering
that
some
of
the
data
points
in
the
validation
data
present
some
challenge
to
t
he
model,
the
model
can
be
considered
good
if
its
accuracy
(valida
tion
data)
is
approximately
80%
of
the
training
data.
When
the
accuracy
of
the
validat
ion
data
is
higher
than
the
accuracy
of
the
rainfall
data,
this
can
be
interpreted
as
a
good
indicator
that
the
hyperparameters
in
the
traini
ng
da
ta
were
properly
adjusted,
leading
to
a
superior
prediction
in
the
validation.
Found
that
the
validation
loss
is
significantly
higher
than
the
training
loss,
as
shown
in
Figures
5-8.
And
t
his
is
because
of
the
overfitting
of
this
model,
in
this
instance,
the
validation
loss
is
significantly
higher
than
the
training
loss.
While
the
validation
loss
is
not
continuously
lowering,
the
training
loss
is.
This
indicates
that
the
complexity
of
presented
model
i
s
sufficient
for
it
to
”memories”
the
patter
ns
found
in
the
training
set.
In
these
kinds
of
cases,
the
proposed
model
needs
to
be
regularized,
and
that
is
what
are
att
empting
to
do
in
the
upcoming
work.
Also,
an
experiment
to
identify
Al-Maqam,
where
we
got
a
95%
as
training
ac
curacy,
and
74%
as
validation
accuracy,
as
shown
in
Figure
9
in
relation
to
the
epoc
hs
numbe
r
(60).
Figure
9.
The
number
of
epochs
for
(a)
accurac
y
trends
for
Al-Maqam
classification
using
spectral
centroid
and
(b)
corresponding
loss
function
analysis
highlighting
performance
variation
ac
ross
t
raining
phase
using
spectral
centroid
Recognizing
AlMuezzin
and
his
Maqam
using
deep
learning
approach
(Nahlah
Shatnawi)
Evaluation Warning : The document was created with Spire.PDF for Python.