IAES
Inter
national
J
our
nal
of
Articial
Intelligence
(IJ-AI)
V
ol.
14,
No.
4,
August
2025,
pp.
3421
∼
3434
ISSN:
2252-8938,
DOI:
10.11591/ijai.v14.i4.pp3421-3434
❒
3421
Exploring
bibliometric
tr
ends
in
speech
emotion
r
ecognition
(2020-2024)
Y
esy
Diah
Rosita
1,2
,
Muhammad
Raa’u
Firmansyah
2
,
Annisaa
Utami
2
1
Center
of
Excellence
for
Human
Centric
Engineering,
Institute
of
Sustainable
Society
,
T
elk
om
Uni
v
ersity
,
Main
Campus,
Bandung
City
,
Indonesia
2
Informatics
Engineering
Study
Program,
T
elk
om
Uni
v
ersity
,
Purw
ok
erto
Campus,
Ban
yumas
City
,
Indonesia
Article
Inf
o
Article
history:
Recei
v
ed
Apr
21,
2024
Re
vised
Jun
12,
2025
Accepted
Jul
10,
2025
K
eyw
ords:
Audio
features
Classication
model
Emotions
Preprocessing
Speech
emotion
recognition
ABSTRA
CT
Speech
emotion
recognition
(SER)
is
crucial
in
v
arious
real-w
orld
applications,
including
healthcare,
human-computer
interaction,
and
af
fecti
v
e
computing.
By
enabling
systems
to
detect
and
respond
to
human
emotions
through
v
ocal
cues,
SER
enhances
user
e
xperience,
supports
mental
health
monitoring,
and
impro
v
es
adapti
v
e
technologies.
This
research
pres
ents
a
bibliom
etric
analysis
of
SER
based
on
68
articles
from
2020
to
early
2024.
The
ndings
sho
w
a
signicant
increase
in
publications
each
year
,
reecting
the
gro
wing
interest
in
SER
research.
The
analysis
highlights
v
arious
approaches
in
preprocessing,
data
sources,
feature
e
xtraction,
and
emotion
cl
assication.
India
and
China
emer
ged
as
the
most
acti
v
e
contrib
utors,
with
e
xternal
funding,
particularly
from
the
Na-
tional
Natural
Science
F
oundation
of
China
(NSFC),
playing
a
si
gnicant
role
in
the
adv
ancement
of
SER
research.
Support
v
ector
ma
chine
(SVM)
remains
the
most
widely
used
classication
model,
follo
wed
by
K-ne
arest
neighbors
(KNN)
and
con
v
olutional
neural
netw
orks
(CNN).
Ho
we
v
er
,
se
v
eral
critical
challenges
persist,
including
inconsistent
data
quality
,
cross-linguistic
v
ariability
,
limited
emotional
di
v
ersity
in
datasets,
and
the
comple
xity
of
real-time
implementation.
These
limitations
hinder
the
generalizabili
ty
and
scalability
of
SER
systems
in
practical
en
vironments.
Addressing
these
g
aps
is
essential
to
enhance
SER
per
-
formance,
especially
for
multimodal
and
multilingual
applications.
This
study
pro
vides
a
detailed
understa
nding
of
SER
research
trends,
of
fering
v
aluable
insights
for
future
adv
ances
in
speech-based
emotion
recognition.
This
is
an
open
access
article
under
the
CC
BY
-SA
license
.
Corresponding
A
uthor:
Y
esy
Diah
Rosita
Informatics
Engineering
Study
Program,
T
elk
om
Uni
v
ersity
,
Purw
ok
erto
Campus
Jln.
D.I.
P
anjaitan
No.
128,
Purw
ok
erto,
Ban
yumas
City
,
53147,
Indonesia
Email:
yesydr@telk
omuni
v
ersity
.ac.id
1.
INTR
ODUCTION
Hate
speech
is
often
dri
v
en
by
strong
ne
g
ati
v
e
emotions,
such
as
hatred
or
anger
,
which
can
t
rigger
social
conicts
and
escal
ate
tensions
between
indi
viduals
or
groups
[
1],
[2].
In
this
conte
xt,
speech
emotion
recognition
(SER)
emer
ges
as
a
tec
hn
ol
ogy
capable
of
detecting
and
interpreting
em
otions
from
hu
m
an
speech.
By
identifying
ne
g
ati
v
e
emotions
s
uch
as
anger
in
speech,
SER
can
be
utilized
to
support
online
content
mod-
eration,
enhance
hate
speech
detection
systems,
and
analyze
social
interactions
to
pre
v
ent
conict
escalation.
SER
w
as
rst
introduced
by
Rosalind
W
.
Picard
and
her
team
at
the
MIT
Media
Laboratory
in
the
early
2000s
[3].
Since
then,
this
eld
has
e
xperienced
rapid
gro
wth,
with
adv
ancements
in
feature
e
xtraction,
classication
models,
and
multimodal
approaches.
Ov
er
the
past
decade,
the
rise
of
deep
learning
and
the
J
ournal
homepage:
http://ijai.iaescor
e
.com
Evaluation Warning : The document was created with Spire.PDF for Python.
3422
❒
ISSN:
2252-8938
a
v
ailability
of
lar
ger
speech
datasets
ha
v
e
signicantly
impro
v
ed
the
accurac
y
of
SER
systems.
T
oday
,
this
technology
is
widely
applied
in
v
arious
domains,
including
healthcare,
human-computer
interaction,
and
digital
security
.
This
study
aims
to
pro
vide
a
bibliometric
analysis
of
68
articles
on
SER
published
between
2020
and
early
2024.
This
timeframe
w
as
sel
ected
due
to
the
increasing
number
of
publications
in
recent
years,
reecting
the
gro
wing
inter
est
in
SER
research,
particularly
follo
wing
the
CO
VID-19
pandemic,
whic
h
accelerated
the
adoption
of
v
oice-based
technologies
in
communication
and
emotion
analysis.
The
analysis
is
conducted
by
collecting
data
from
the
Scopus
database,
co
v
ering
k
e
y
trends
in
SER,
met
hodological
de
v
elopments,
and
research
collaborations
among
scholars
from
v
arious
countries.
Bibliometrics
is
a
statistical
analysis
technique
used
to
understand
the
historical
de
v
elopment
of
a
scientic
eld
[4].
This
method
helps
unco
v
er
collaboration
patterns
in
multidisciplinary
research
[5],
identify
trends
in
scientic
publications,
and
analyze
inter
-article
relationships
[6].
Additionally
,
bibliometric
analysis
enables
the
e
v
aluation
of
research
impact
and
the
mapping
of
scientic
structures
using
v
arious
statistical
indicators
[7].
Research
collaboration
tends
to
enhance
the
inuence
of
a
study
c
o
m
pared
to
indi
vidual
research
ef
forts
[8],
particularly
when
it
in
v
olv
es
multiple
rele
v
ant
disciplines.
SER
has
become
a
rapidly
e
v
olving
research
eld,
emplo
ying
v
arious
approaches
to
recognize
emotions
in
human
speech
[9].
In
se
v
eral
studies,
SER
has
been
applied
in
sentiment
analysis
[8]
and
human-
computer
interaction
[10],
helping
to
ident
ify
dominant
topics
in
scientic
publications.
Moreo
v
er
,
its
applica-
tion
in
speech
and
video
data
analysis
demonstrates
signicant
potential
for
understanding
emotional
dynamics
across
dif
ferent
conte
xts.
Although
SER
research
has
seen
substantial
gro
wth
in
the
past
v
e
years,
se
v
eral
k
e
y
challenges
remain.
These
include
dif
culties
in
collecting
and
analyzing
accurate
speech
data,
the
comple
xity
of
understanding
and
interpreting
human
emotions
through
speech,
and
limitations
in
handling
linguistic
and
cultural
v
ariations.
This
study
aims
to
identify
SER’
s
contrib
utions
and
impacts
on
other
elds
while
highlighting
areas
that
require
further
e
xploration.
One
of
the
primary
challenges
in
this
study
is
ensuring
the
accurac
y
and
consistenc
y
of
the
collected
data
from
v
arious
sources.
A
rigorous
data-cleaning
process
and
manual
re
vie
w
are
necessary
to
ensure
that
the
analyzed
articles
are
rele
v
ant
and
of
high
quality
.
Additionally
,
the
comple
xity
of
interpreting
bibliometric
analysis
results
presents
another
challenge.
While
SER
has
been
widely
e
xplored,
bibliometric
studies
specically
mapping
the
distrib
ution
of
classi
cation
models,
the
inter
-
disciplinary
collaborations,
and
cross-cultural
g
aps
in
emotion
recognition
remain
limited.
This
study
seeks
to
ll
these
g
aps
by
pro
viding
a
comprehensi
v
e
o
v
ervie
w
of
trends,
collaborations,
and
undere
xplored
areas
in
SER
literature
between
2020
and
2024.
This
research
adv
ances
e
xisting
methodologies
by
inte
grating
SER
models
with
a
comprehensi
v
e
re
vie
w
of
rele
v
ant
literature.
Data
collection
is
based
on
titles,
abstracts,
and
the
full
content
of
selected
articles,
follo
wed
by
a
manual
re
vie
w
to
ensure
rele
v
ance
to
the
r
esearch
topic.
The
main
objecti
v
es
of
this
study
are:
i)
to
pro
vide
an
o
v
ervie
w
of
SER
research
trends
using
t
h
e
5W+1H
approach
(what,
who,
when,
wh
y
,
where,
and
ho
w);
and
ii)
to
identify
potential
research
subtopics
that
w
arrant
further
e
xploration.
The
structure
of
this
article
is
or
g
anized
as
follo
ws.
Section
2
e
xplains
the
data
collection
methodol-
ogy
.
Section
3
presents
the
research
ndings
related
to
SER.
Lastly
,
section
4
pro
vides
the
study’
s
conclusions.
2.
METHOD
Scientic
articles
are
obtained
from
the
Scopus
database,
where
some
articles
can
be
accessed
directly
,
while
others
are
closed
access.
The
a
v
ailability
of
access
to
scientic
articles
can
inuence
the
ease
of
obtaining
rele
v
ant
information
and
literature.
Addit
ionally
,
access-restricted
articles
may
require
additional
ef
forts
to
g
ain
full
access,
such
as
through
an
institutional
library
or
database
subscription
service.
In
the
conte
xt
of
academic
research,
it
is
important
to
consider
the
a
v
ailable
information
sources
and
the
access
methods
that
can
be
used
to
optimize
the
use
of
e
xisting
information
resources.
Data
collection
w
as
a
crucial
step
in
this
research,
ensuring
that
the
study
could
be
replicated.
T
o
ensure
that
this
research
can
be
replicated,
a
well-dened
query
strate
gy
w
as
implemented,
which
had
been
tested
for
ef
fecti
v
eness
in
retrie
ving
rele
v
ant
articles
from
the
Scopus
database.
This
structured
approach
helped
obtain
consistent
and
high-quality
data
for
bibliometric
analysis.
Moreo
v
er
,
the
user
-friendly
interf
ace
of
the
Scopus
database
f
acilitated
the
data
retrie
v
al
process,
enabling
researchers
to
focus
more
on
data
interpretation
and
analysis.
The
article
selection
and
data
analysis
were
primarily
carried
out
using
spreadsheet
softw
are,
which
allo
wed
for
ef
cient
or
g
anization,
ltering,
and
summarization
of
the
dataset.
This
approach
w
as
chosen
to
maintai
n
e
xibility
in
the
re
vie
w
process
and
Int
J
Artif
Intell,
V
ol.
14,
No.
4,
August
2025:
3421–3434
Evaluation Warning : The document was created with Spire.PDF for Python.
Int
J
Artif
Intell
ISSN:
2252-8938
❒
3423
to
adapt
to
the
e
v
olving
nature
of
the
research.
The
article
search
w
as
conducted
using
an
adv
anced
search
query
applied
to
titles,
abstracts,
and
k
e
yw
ords,
with
additional
lters
based
on
publication
year
,
article
source,
publication
stage,
and
document
type.
The
last
data
collection
w
as
performed
on
February
2,
2024,
resulting
in
an
initial
set
of
80
articles.
Since
these
art
icles
originat
ed
from
di
v
erse
sources,
a
rigorous
data-cleaning
process
w
as
undertak
en
to
remo
v
e
duplicates
and
inconsistencies.
The
names
of
authors,
publishers,
journals,
and
research
funders
were
cross-check
ed
for
duplication.
The
w
orko
w
for
data
collection
and
analysis
is
illustrated
in
Figure
1
(data
collection
o
wchart)
and
Figure
2
(query
instruction).
The
data
collection
steps
were
carried
out
as
follo
ws:
–
Search
by
query:
articles
were
retrie
v
ed
using
an
adv
anced
search
query
applied
to
title,
abstract,
and
k
e
yw
ords,
with
lters
based
on
publication
year
,
article
source,
publication
stage,
and
document
type.
–
Data
cleaning:
the
names
of
authors,
publishers,
journals,
and
funding
institutions
were
check
ed
to
eliminate
duplication.
–
Article
numbering:
each
article
w
as
assigned
a
unique
identication
number
to
f
acilitate
tracking
during
the
re
vie
w
and
analysis
process.
–
Manual
re
vie
w:
a
detailed
manual
re
vie
w
of
each
article
w
as
conducted
by
a
team
of
three
independent
re
vie
wers
to
ensure
rele
v
ance
to
the
research
topic,
v
erify
the
adequac
y
of
the
information
presented,
and
assess
the
quality
of
the
journal.
Discrepancies
in
the
selection
of
articles
were
resolv
ed
through
discussion.
–
Re
vie
wed
summary:
compile
a
summary
of
each
article
that
has
been
re
vie
wed
to
pro
vide
reference
material
in
writing
a
literature
re
vie
w
.
–
Nomenclature:
compile
a
nomenclature
or
list
of
terms
used
in
the
articles
to
f
acilitate
readers’
under
-
standing.
–
Dataset
source:
compile
the
dataset
source
or
list
of
data
sources
in
the
articles.
–
List
document:
compile
a
list
of
articles
that
will
be
re
vie
wed,
based
on
certain
criteria
to
be
used
as
a
basis
for
compiling
a
literature
re
vie
w
.
Figure
1.
The
o
wchart
for
collecting
data
Figure
2.
The
query
for
collecting
data
From
the
80
initially
retrie
v
ed
articles,
68
(85%)
were
deemed
rele
v
ant
and
included
in
the
nal
dataset.
The
selection
process
w
as
carried
out
through
a
structured
screening
procedure
to
ensure
that
only
the
Exploring
bibliometric
tr
ends
in
speec
h
emotion
r
eco
gnition
(2020-2024)
(Y
esy
Diah
Rosita)
Evaluation Warning : The document was created with Spire.PDF for Python.
3424
❒
ISSN:
2252-8938
most
pertinent
and
impact
ful
studies
were
retained.
This
process
in
v
olv
ed
a
thorough
re
vie
w
of
each
article’
s
abstract,
k
e
yw
ords,
and
full
te
xt
when
necessary
to
determine
its
suitability
for
inclusion.
The
selection
process
w
as
based
on
the
follo
wing
criteria:
–
Rele
v
ance
to
SER:
articles
that
e
xplicitly
discuss
SER
methodologies,
datasets,
or
applications
were
prioritized.
–
Citation
impact:
articles
wi
th
signicant
citations
(when
a
v
ailable)
were
gi
v
en
preference
to
ensure
academic
inuence.
–
Publication
year:
articles
published
between
2020
and
early
2024
were
included
to
reect
recent
de
v
elopments
in
SER
research.
The
bibliometric
analysis
re
v
ealed
a
signicant
increase
in
SER-related
research
o
v
er
the
past
v
e
years,
indicating
gro
wing
interest
and
impact
in
this
domain.
Ho
we
v
er
,
the
number
of
publications
alone
does
not
necessarily
correlate
with
high
citation
counts.
Some
journals
with
a
high
number
of
publications
had
relati
v
ely
lo
w
citation
a
v
erages,
while
others
with
fe
wer
articles
had
a
substantial
citation
impact.
T
able
1
sho
ws
the
summary
per
pro
v
enance
as
a
result
of
that
query
.
Based
on
it
,
there
are
v
ariations
in
the
number
of
articles
published
and
the
a
v
erage
citations
per
year
for
each
journal.
In
general,
journals
that
publish
more
articles
tend
to
ha
v
e
a
higher
a
v
erage
of
citations
per
year
.
Ho
we
v
er
,
the
correlation
between
the
number
of
articles
and
the
a
v
erage
citations
per
year
is
not
al
w
ays
the
case.
F
or
instance,
the
Multimedia
T
ools
and
Applications
journal,
despite
ha
ving
a
high
number
of
articles
(13.24%),
has
a
relati
v
ely
lo
w
a
v
erage
of
citations
per
year
(7
citations
per
year);
the
International
Journal
of
Adv
anced
Computer
Science
and
Applications,
with
only
2.94%
of
68
articles,
has
a
v
ery
high
a
v
erage
of
citations
per
year
(59
citations
per
year).
This
suggests
that
other
f
actors,
such
as
article
quality
,
research
no
v
elty
,
and
journal
inde
xing,
also
contrib
ute
to
citation
impact.
T
able
1.
The
summary
per
pro
v
enance
Journal
Number
of
art
icles
Percentages
%
A
v
erage
number
of
citations/years
Multimedia
T
ools
and
Applications
[3]–[7],
[9]–[12]
9
13.24
8
(8
citations)
IEEE
Access
[13]–[19]
7
10.29
20
(20
citations)
Applied
Acoustics
[20]–[24]
5
7.35
39.8
(40
citations)
International
Journal
of
Speech
T
echnology
[25]–[27]
3
4.41
7.4
(7
citations)
Journal
of
Supercomputing
[28]–[30]
3
4.41
1.4
(1
citation)
Signal,
Image
and
V
ideo
Processing
[31],
[32]
2
2.94
0.8
(1
citation)
Electronics
(Switzerland)
[33],
[34]
2
2.94
2.8
(3
citations)
Sensors
(Switzerland)
[35],
[36]
2
2.94
1.4
(1
citation)
IEEE/A
CM
T
ransactions
on
Audio
Speech
and
Lan-
guage
Processing
[37],
[38]
2
2.94
2.2
(2
citations)
Journal
of
Ambient
Intelligenc
e
and
Humanized
Computing
[39],
[40]
2
2.94
4.4
(4
citations)
International
Journal
of
Adv
anced
Computer
Science
and
Applications
[41],
[42]
2
2.94
59.2
(59
citations)
The
journals
that
ha
v
e
only
1
article
[8],
[43]–[70]
29
42.65
2.83
(3
citations)
T
otal
68
100
-
The
re
vie
w
technique
that
will
be
carried
out
by
applying
the
5W+1H
concept
(what,
who,
where,
when,
wh
y
,
and
ho
w)
sho
ws
the
follo
wing:
–
What:
data
sources
used,
features,
types
of
emotions.
Information
on
data
sources
used
in
research
can
be
obtained
from
the
data/material
section,
while
the
method
for
e
xtracting
v
oice
characteristics/features
and
determining
emotions
as
output
is
obtained
from
the
method
section.
–
Where:
country
of
origin
of
the
main
researcher
and
correspondent.
Information
on
the
author’
s
country
of
origin
can
be
obtained
on
the
rst
page,
commonly
belo
w
the
author’
s
name.
–
When:
year
of
publication.
Informat
ion
re
g
arding
the
year
of
publication
of
the
article
can
be
obtained
from
the
rst
page.
It
is
generally
put
before
the
abstract;
it
states
the
time
of
submission,
re
vision
time,
time
of
acceptance,
and
time
of
publication
of
the
article
in
the
journal.
–
Who:
research
funding
agent.
Information
about
the
institution
that
funded
the
research
w
as
obtained
from
the
ackno
wledgment
section
as
a
form
of
the
researchers’
thanks.
Some
articles
did
not
mention
the
institution
that
pro
vided
the
research
funding,
which
could
mean
that
the
research
w
as
funded
inde-
pendently
.
Int
J
Artif
Intell,
V
ol.
14,
No.
4,
August
2025:
3421–3434
Evaluation Warning : The document was created with Spire.PDF for Python.
Int
J
Artif
Intell
ISSN:
2252-8938
❒
3425
–
Wh
y:
the
root
of
the
problem.
Re
vie
wing
the
root
of
the
problem
is
carried
out
by
observing
the
back-
ground,
including
the
problem
formulation
dened
by
the
researcher
.
–
Ho
w:
classier
model
used.
An
o
v
ervie
w
of
the
classier
models
used
by
researchers
can
be
found
in
the
method
section.
Se
v
eral
researchers
compared
v
arious
classier
methods;
others
applied
a
single
method
b
ut
rened
the
model’
s
architectural
conguration.
3.
RESUL
TS
AND
DISCUSSION
After
conducting
a
thorough
re
vie
w
of
the
lit
erature,
the
researchers
decided
to
include
only
68
articl
es
in
the
nal
analysis,
re
v
ealing
the
results
of
the
processed
data.
This
selection
pro
vides
a
comprehensi
v
e
o
v
ervie
w
of
v
arious
aspects
of
recognizing
emotions
in
speech
and
of
fers
v
aluable
insights
into
the
current
state
of
research
in
this
eld.
3.1.
What
In
the
re
vie
w
of
the
articles,
four
k
e
y
aspects
were
identied
as
c
rucial
components
within
the
“What
”
cate
gory
of
SER.
These
aspects
include
the
preprocessing
stages
used
in
the
analysis,
the
data
sources
emplo
yed
for
training
the
models,
the
types
of
features
e
xtracted
from
the
speech
signals,
and
the
emotions
that
serv
e
as
the
tar
get
or
class
for
emotion
detection
in
speech.
These
four
f
actors
play
an
essential
role
in
shaping
the
methodologies
used
to
recognize
and
classify
emotions
in
speech
and
ha
v
e
a
signicant
impact
on
the
accurac
y
and
applicability
of
the
models.
Figure
3
presents
a
detailed
map
of
the
data,
illustrating
the
distrib
ution
of
articles
that
discuss
each
of
thes
e
four
critical
aspects.
The
map
cate
gorizes
the
number
of
articles
into
se
v
en
distinct
ranges
based
on
the
frequenc
y
with
which
each
aspect
is
co
v
ered.
These
ranges
are
as
follo
ws:
6
>
articles,
6-10
articles,
11-20
articles,
21-30
articles,
31-40
articles,
41-50
articles,
51-60
articles,
and
>
60
articles.
This
classication
allo
ws
for
a
better
understanding
of
which
aspects
of
emotion
recognition
in
speech
are
most
frequently
addressed
in
the
literature,
highlighting
the
areas
of
the
eld
that
are
recei
ving
the
most
attention
and
those
that
may
require
further
e
xploration.
Figure
3.
Distrib
ution
of
re
vie
w
data
based
on
the
concepts
of
‘What’
3.1.1.
Pr
epr
ocessing
Preprocessing,
or
the
preprocessing
stage,
is
a
critical
step
in
processing
speech
signals
for
the
recog-
nition
of
emotions
in
speech.
This
stage
aims
to
impro
v
e
the
quality
of
the
sound
signal
before
further
analysis
is
carried
out.
In
this
research,
preprocessing
includes
three
main
stages:
silence
remo
v
al,
noise
remo
v
al,
and
unspecied.
The
results
of
the
analysis
sho
w
that
researches
in
v
olving
silence
and
noise
remo
v
al
processes
are
only
7
articles
[9],
[23],
[25],
[36],
[39],
[57],
[61]
and
studies
e
xamining
preprocessing
of
s
ilence
remo
v
al
only
are
2
articles
[21],
[65]
and
others
focusing
noise
remo
v
al
only
are
23
articles
[3],
[6],
[7],
[11]-[13],
[15],
Exploring
bibliometric
tr
ends
in
speec
h
emotion
r
eco
gnition
(2020-2024)
(Y
esy
Diah
Rosita)
Evaluation Warning : The document was created with Spire.PDF for Python.
3426
❒
ISSN:
2252-8938
[16],
[20],
[30],
[32],
[40],
[42],
[46],
[48]-[55],
[66],
[69].
Ho
we
v
er
,
most
of
the
articles
(32
articles)
did
not
specically
m
ention
the
preprocessing
steps
the
y
used.
A
summary
of
the
types
of
preprocessing.
There
is
still
v
ariat
ion
in
the
preprocessing
approaches
used
in
SER
research.
Most
researchers
did
not
pro
vide
specic
details
about
the
preprocessing
step
the
y
undertook.
The
main
challenge
in
this
stage
is
to
ensure
that
the
re-
sulting
sound
signal
is
free
from
interference
and
ready
for
further
analysis.
Therefore,
further
researches
need
to
e
xplore
v
arious
preprocessing
methods
that
can
impro
v
e
the
quality
of
speech
signals
and
the
accurac
y
of
emotion
recognition
in
speech.
3.1.2.
Data
sour
ces
Data
sources
are
an
important
component
in
research
into
emotion
recognition
in
speech,
as
the
qu
a
lity
and
representati
v
eness
of
the
data
can
ha
v
e
a
major
impact
on
the
results
of
the
analysis.
In
this
research,
there
are
v
ariations
in
the
data
sources
used
by
researchers.
Berlin
database
of
emotional
speech
(EMO-DB)
is
the
most
commonly
used
data
source,
with
31-40
articles
using
data
from
it
[4]-[7],
[9]-[14],
[18],
[20]-[23],
[26],
[28],
[31],
[32],
[34],
[35],
[37],
[38],
[42],
[45],
[49],
[50],
[52],
[56]-[60],
[63]-[65].
There
are
also
other
popular
data
sources
such
as
the
interacti
v
e
emotional
dyadic
motion
capture
(IEMOCAP)
tak
en
by
21-30
articles
[8],
[11],
[14],
[15],
[17]-[22],
[28],
[29],
[33],
[35]-[38],
[43],
[49],
[52],
[54]-[56],
[58],
[59],
[62],
[65],
[68],
ryerson
audio-visual
database
of
emotional
speech
and
song
(RA
VDESS)
[5],
[6],
[8],
[10],
[11],
[23],
[27],
[30],
[31],
[35]-[37],
[39],
[40],
[42],
[43],
[45],
[52],
[53],
[58],
[59],
[64],
[65],
[69],
[70]
and
surre
y
audio-visual
e
xpressed
emotion
(SA
VEE),
f
airly
com
mon
data
source,
is
used
in
11-20
articles
[3],
[10],
[13],
[18],
[21],
[23],
[31],
[34],
[35],
[39],
[42],
[45],
[50]-[53],
[58],
[60],
[63],
[69].
Meanwhile,
toronto
emotional
speech
set
(TESS)
only
becomes
sources
in
less
than
11
articles
[3],
[6],
[25],
[37],
[38],
[40],
[53],
[69].
In
addition
to
this
main
data
source,
there
are
also
other
data
sources
used
by
fe
wer
than
6
articles,
which
are
included
in
the
“Others”
cate
gory
.
The
v
ariations
in
data
sources
indicate
that
researchers
ha
v
e
di
v
erse
choices
in
selecting
data
for
their
research.
This
also
sho
ws
the
importance
of
ha
ving
good
access
to
a
v
ariety
of
rele
v
ant
data
sources
to
ensure
the
representati
v
eness
of
research
results.
In
the
conte
xt
of
SER
research,
it
is
important
to
select
data
sources
that
are
appropriate
to
the
research
objecti
v
es
and
capable
of
representing
a
v
ariety
of
dif
ferent
emotional
states.
P
arameters
that
can
inuence
data
quality
include
the
distance
from
the
recorder
to
the
transmitter
of
respondents,
the
specications
of
the
equipment
used,
the
recording
duration
of
the
recording,
and
the
signicance
of
the
emotions
gi
v
en
by
the
respondents.
Despite
the
frequent
use
of
well-kno
wn
datasets
such
as
EMO-DB
and
IEMOCAP
,
this
analysi
s
re-
v
eals
a
lack
of
di
v
ersity
in
the
selection
of
data
sources,
particularly
those
that
capture
spontaneous
emotional
e
xpressions
or
represent
non-W
estern
cultural
conte
xts.
This
suggests
a
research
g
ap
in
cross-cultural
emotional
representation
and
real-w
orld
data
v
ariability
,
which
may
limit
the
generalizability
of
current
SER
models.
By
identifying
this
g
ap
through
bibliometric
mapping,
this
study
encourages
future
research
to
e
xplore
and
de
v
elop
more
inclusi
v
e,
di
v
erse,
and
naturalistic
datasets
to
enhance
the
rob
ustness
of
SER
systems.
3.1.3.
F
eatur
es
The
features
used
in
speech
analysis
play
an
important
role
in
the
recognition
of
emotions
in
speech.
In
this
research,
the
mel-frequenc
y
cepstral
coef
cients
(MFCC)
feature
is
the
most
commonly
used
feature,
with
more
than
41
articles
using
it
[3],
[4],
[6],
[7],
[9],
[10],
[12],
[13],
[15],
[16],
[19],
[21]-[23],
[25]-
[30],
[34],
[36],
[38]-[40],
[42],
[43],
[45],
[46],
[48],
[49],
[51]-[53],
[57],
[60],
[61],
[63],
[67],
[68],
[70].
Besides,
pitch
is
also
a
popular
feature,
found
in
12
articles
[6],
[7],
[9],
[14],
[21],
[25],
[27],
[29],
[34],
[46],
[68],
[70].
In
addition,
there
are
se
v
eral
other
features
used
by
6-10
articles,
including
mel-spectrogram
[3],
[5],
[10],
[46],
[48],
[51],
[54],
[58],
linear
predicti
v
e
coding
(LPC)
[6],
[9],
[13],
[26],
[29],
[40],
[61],
formant
[9],
[14],
[27],
[46],
[57],
[59],
ener
gy
[6],
[9],
[29],
[46],
[51],
and
chroma
[25],
[28],
[46],
[48],
[51],
[61].
These
features
reect
v
ariations
in
speech
analysis
approaches
used
to
identify
emotional
patterns
in
speech.
Apart
from
these
main
features,
there
are
als
o
other
features
used
in
fe
wer
than
six
articles,
which
f
all
into
the
“Others”
cate
gory
.
The
v
ariation
sho
ws
that
researchers
ha
v
e
applied
v
aried
approaches
in
analyzing
sound
signals
for
emotion
recognition,
with
each
feature
ha
ving
its
adv
antages
and
disadv
antages.
Therefore,
selecting
appropriate
features
is
a
critical
step
in
the
de
v
elopment
of
an
ef
fecti
v
e
emotion
recognition
system.
As
in
pre
vious
research,
the
use
of
the
dominant
we
ight
normalization
feature
selection
algorithm
also
has
an
inuence
on
the
le
v
el
of
accurac
y
,
which
sho
ws
suf
cient
transmission
with
a
relati
v
ely
small
amount
of
data.
This
research
sho
ws
that
with
300
data
points,
it
is
able
to
sho
w
an
accurac
y
rate
of
86%,
so
that
this
algorithm
can
be
used
as
a
consideration
for
use
in
de
v
eloping
SER
research
[71].
Int
J
Artif
Intell,
V
ol.
14,
No.
4,
August
2025:
3421–3434
Evaluation Warning : The document was created with Spire.PDF for Python.
Int
J
Artif
Intell
ISSN:
2252-8938
❒
3427
3.1.4.
Emotions
Analysis
of
emotions
in
speech
in
v
olv
es
identifying
the
dif
ferent
types
of
emotions
that
can
be
e
xpressed
through
sound.
In
this
study
,
the
emotion
“Happ
y”
w
as
found
out
to
be
the
most
commonly
in-
v
estig
ated
emotion,
sometimes
referred
to
as
“Jo
y/Jo
yful”,
with
more
than
62
articles
(91.18%).
In
addition,
the
emotions
“
Angry”
60
articles
(88.24%),
“Sad”
60
articles
(88.23%),
and
“Neutral”
59
articles
(86.76%)
are
also
equally
chosen.
The
emotions
“Fear”
47
articles
(69.12%)
and
“Disgust”
46
articles
(67.65%)
are
also
quite
commonly
used.
Apart
from
that,
there
were
also
“Surprise”
emotions
in
36
articles
(52.94%),
and
“Boredom”
emotions
in
24
articles
(35.29%).
In
addition
to
them,
there
are
also
others
used
by
fe
wer
than
11
articles
in
the
cate
gory
‘Others’.
The
v
ariation
in
the
types
of
emotions
sho
ws
that
researchers
ha
v
e
v
aried
interests
in
understanding
and
identifying
dif
ferent
types
of
emotions
in
speech.
Research
classied
these
emotions
into
3
types,
namely
positi
v
e,
ne
g
ati
v
e,
and
neutral
[35].
It
also
reects
the
comple
xity
of
human
emotional
e
xpression
and
the
challenges
in
de
v
eloping
systems
capable
of
recognizing
emotions
with
high
accurac
y
in
a
v
ariety
of
conte
xts
and
situations.
3.2.
Wher
e
The
discussion
re
g
arding
“Where”
stems
from
an
in-depth
analysis
of
the
countries
of
origin
and
the
institutions
af
liated
with
the
rst
author
and
the
corresponding
author
,
as
illustrated
in
Figure
4.
This
aspect
of
the
research
is
crucial
for
understanding
the
geographical
and
institutional
spread
of
the
studies
on
SER.
By
e
xamining
whether
the
rst
and
correspondi
ng
authors
come
from
the
same
or
dif
ferent
countries,
we
can
g
ain
insights
into
the
international
collaboration
patterns
within
the
eld
of
emotion
detection
in
speech.
A
signicant
number
of
articles
authored
by
researchers
from
multiple
countries
suggests
a
broad,
global
netw
ork
of
researchers
eng
aged
i
n
v
oice
emotion
identication.
Con
v
ersely
,
studies
with
authors
from
a
single
country
or
institution
may
indicate
more
localized
research
ef
f
orts.
Figure
4
sho
ws
the
number
of
authors
from
countries,
either
as
the
rst
author
or
corresponding
author
,
who
ha
v
e
published
on
SER
with
more
than
three
contrib
uting
authors.
Others
come
from
T
aiw
an,
T
urk
e
y
,
P
akistan,
Australia,
Indonesia,
Japan,
Egypt,
France,
Iraq,
Italy
,
Kazakhstan,
London,
Portug
al,
Saudi
Arabia,
V
ietnam,
Bhutan,
and
Malaysia.
Figure
4.
T
op
4
countries
by
number
of
authors
Moreo
v
er
,
this
geographical
analysis
indicates
the
le
v
el
of
global
interest
and
in
v
olv
ement
in
SER
research,
reecting
ho
w
research
in
this
domain
is
distrib
uted
across
dif
ferent
re
gions.
It
also
allo
ws
for
the
identication
of
leading
countries
or
institutions
that
are
dri
ving
inno
v
ation
and
contrib
uting
to
adv
ancements
in
this
eld.
Understanding
the
“Where”
thus
highlights
not
only
the
scope
of
international
collaboration
b
ut
also
the
potential
for
future
netw
orking
opportunities
and
the
sharing
of
kno
wledge
across
borders.
3.2.1.
The
rst
author
The
rst
author
of
a
study
often
reects
the
institution
or
country
where
the
research
w
as
conduct
ed.
In
this
study
,
the
rst
authors
came
from
countries
around
the
w
orld.
The
countries
contrib
uting
the
most
rst
authors
are
India,
with
27
articles
[3],
[5]-[8],
[10],
[11],
[13],
[20],
[23],
[25],
[27],
[30],
[31],
[39],
[40],
[43],
[46],
[48],
[50],
[52],
[53],
[56],
[60],
[65],
[66],
[70]
and
China,
with
17
articles
[15],
[16],
[19],
[32],
[33],
[34],
[38],
[44],
[45],
[49],
[54],
[55],
[59],
[61],
[62],
[67],
[68].
Apart
from
that,
there
are
se
v
eral
other
contrib
uting
countries
with
much
fe
wer
articles
such
as
Iran
(4
articles)
[9],
[21],
[28],
[57],
South
K
orea
Exploring
bibliometric
tr
ends
in
speec
h
emotion
r
eco
gnition
(2020-2024)
(Y
esy
Diah
Rosita)
Evaluation Warning : The document was created with Spire.PDF for Python.
3428
❒
ISSN:
2252-8938
(2
articles)
[29],
[36],
P
akistan
(2
articles)
[33],
[39],
T
aiw
an
(2
articles)
[14],
[58],
T
urk
e
y
(2
articles)
[22],
[24]
and
se
v
eral
other
countries
with
an
article
including
Egypt,
France,
Indonesia,
Iraq,
Italy
,
Japan,
Kazakhstan,
London,
Portug
al,
Saudi
Arabia,
and
V
ietnam.
Among
the
authors,
B
anusree
Y
alamanchili
from
India
w
as
the
most
acti
v
e
by
publishing
3
articles
o
v
er
the
last
5
years
as
rst
author
.
3.2.2.
The
corr
esponding
author
Corresponding
authors
often
ha
v
e
an
important
role
in
research,
especially
in
terms
of
com
munication
with
journal
editors
and
other
researchers.
In
this
research,
the
y
come
from
v
arious
countries
of
origin.
A
similar
gure
is
seen
lik
e
the
rst
author
trend.
Ag
ain
India
is
the
country
with
the
most
contrib
utions
for
its
corresponding
author
,
with
25
articles
[3],
[5]-[8],
[10],
[11],
[20],
[23],
[25],
[
2
7]
,
[31],
[39],
[40],
[43],
[46],
[48],
[50],
[52],
[53],
[56],
[60],
[65],
[66],
[70],
follo
wed
by
China
with
16
articles
[15],
[16],
[19],
[30],
[32]-
[34],
[38],
[44],
[45],
[49],
[54],
[59],
[61],
[67],
[68].
Apart
from
that
,
se
v
eral
other
countries
also
contrib
ute,
such
as
Iran
(4
articles)
[9],
[21],
[28],
[57],
South
K
orea
(3
articles)
[29],
[35],
[36],
Australia
(2
articles)
[12],
[62],
Indonesia
(2
articles)
[26],
[42],
Japan
(2
articles)
[17],
[55],
T
aiw
an
(2
articles)
[14],
[58],
T
urk
e
y
(2
articles)
[22],
[24],
and
se
v
eral
other
countries
ha
v
e
only
an
article
such
as
Bhutan,
Egypt,
France,
Iraq,
Italy
,
Kazakhstan,
London,
Malaysia,
P
akistan,
and
Portug
al.
Among
the
corresponding
authors,
43
authors
are
also
the
rst
authors.
This
sho
ws
that
the
y
ha
v
e
a
signicant
role
in
the
research
carried
out,
both
as
the
main
initiator
a
n
d
as
the
person
responsible
for
communication
and
coordination
with
other
parties,
such
as
journal
editors
and
other
researchers.
This
also
sho
ws
the
high
le
v
el
of
in
v
olv
ement
and
contrib
ution
of
these
researchers
in
the
de
v
elopment
and
dissemination
of
kno
wledge
in
the
eld
of
SER.
3.3.
When
The
distrib
ution
of
articles
about
SER
by
year
sho
ws
an
interesting
trend
in
the
last
v
e
years
as
sho
wn
in
Figure
5.
In
2020,
14
articles
[12]-[14],
[20],
[21],
[26],
[29],
[35],
[36],
[41],
[54],
[57],
[61],
[62]
were
published,
indicating
a
moderate
le
v
el
of
research
acti
vity
in
this
area.
The
follo
wing
year
,
in
2021,
the
number
of
articles
increased
slightly
to
11
articles
[7],
[15],
[22],
[23],
[30]
[33],
[34],
[42],
[58],
[59],
[64]
sho
wing
a
temporary
inc
rease
in
rese
arch
output.
A
higher
increase
is
also
seen
in
2022
with
17
articles
[10],
[11],
[17],
[24],
[27],
[32],
[40],
[43],
[46],
[51],
[53],
[55],
[56],
[62],
[63],
[65],
[67],
sho
wing
rene
wed
interest
in
SER
research.
This
trend
continues
in
2023,
with
the
highest
number
of
articles
reaching
22
articles
[3]-[5],
[8],
[9],
[16],
[18],
[19],
[25],
[28],
[37]-[39],
[44],
[45],
[48],
[50],
[52],
[60],
[66],
[68],
[70]
indicating
continued
gro
wth
in
research
acti
vity
and
perhaps
also
maturity
of
the
eld.
Figure
5.
Number
of
articles
published
by
year
As
of
February
2024,
there
are
four
articles
[6],
[47],
[50],
[69]
that
pro
v
e
research
in
this
eld
remains
sustainable,
despite
a
dramatic
do
wnturn.
During
this
period,
some
articles
be
gin
discussi
ng
calm
SER,
and
it
is
possible
that
by
the
end
of
2024,
there
will
be
a
signicant
rise
compared
to
2023.
A
clearer
trend
of
the
last
v
e
years
in
the
number
of
publications
re
g
arding
SER
articles.
It
sho
ws
that
e
motion
recognition
in
s
peech
remains
a
rele
v
ant
and
interesting
topic,
and
it
can
be
e
xpected
that
further
research
will
continue
Int
J
Artif
Intell,
V
ol.
14,
No.
4,
August
2025:
3421–3434
Evaluation Warning : The document was created with Spire.PDF for Python.
Int
J
Artif
Intell
ISSN:
2252-8938
❒
3429
to
be
conducted
to
e
xpand
understanding
of
the
technologie
s
that
can
be
used
to
detect
and
interpret
human
emotions.
3.4.
Who
An
analysis
of
funding
sources
for
research
into
emotion
recognition
in
speech
sho
ws
the
di
v
erse
origins
of
funds
used
to
support
this
research.
This
di
v
ersity
is
reected
in
se
v
eral
patterns
identied
in
the
dataset.
From
the
data,
these
situations
are
identied:
–
An
institution
funds
research,
totaling
thirteen
studies
[12],
[13],
[17],
[18],
[20],
[29],
[32],
[37],
[45],
[62]-[64],
[68];
–
A
funding
institution
pro
vides
the
funds
for
man
y
studies,
such
as
the
National
Natural
Science
F
ounda-
tion
of
China
(NSFC),
supporting
se
v
en
projects
[32],
[38],
[45],
[54],
[55],
[59],
[61];
–
A
research
is
funded
by
man
y
instituti
o
ns
;
one
research
w
as
funded
by
v
e
institutions
[15],
[69],
four
institutions
[61],
three
institutions
[55],
and
tw
o
institutions
[35],
[58],
[68];
–
Others
are
self-funded
research.
Funding
plays
a
crucial
role
in
research,
with
the
NSFC
reecting
China’
s
commitment.
Ho
we
v
er
,
man
y
studies
do
not
list
funding
sources,
suggesting
a
combination
of
e
xternal,
internal,
or
independent
funding.
Interest-
ingly
,
the
most
cited
studies
are
independent,
particularly
in
the
journal
Multimedia
T
ools
and
Applications.
3.5.
Wh
y
The
analysis
of
research
methods
behind
emotion
recognition
in
speech
is
visualized
in
Figure
6.
Most
research
in
this
area
is
dri
v
en
by
three
main
reasons:
classication
model
selection,
feature
selection,
and
implementation
in
other
cases.
There
are
studies
(7
articles)
discussing
not
only
classication
models
b
ut
also
the
selection
of
v
oice
features
[3],
[4],
[17],
[32],
[44],
[52],
[70].
Figure
6.
Distrib
ution
of
research
reasons
in
emotion
recognition
–
Classication
model:
a
total
of
26
articles
use
the
selection
of
a
classication
model
which
is
only
used
as
the
sole
reason
behind
the
methods
of
pre
vious
studies
[3],
[4],
[10],
[11],
[14],
[15],
[17],
[25]-[27],
[32]-[34],
[41],
[43]-[45],
[47]-[49],
[52],
[54],
[60],
[66],
[68],
[70].
This
reects
the
importance
of
selecting
an
appropriate
and
ef
fecti
v
e
classication
model
for
b
uilding
an
e
motion
recognition
system
in
speech.
Classication
models
ha
v
e
become
t
h
e
main
choice
for
researchers
to
classify
emotions
in
speech
with
high
accurac
y;
–
Feature
selection:
a
total
of
39
articles
use
feature
selection
as
the
sole
main
reason
behind
their
research
methods
[3]-[6],
[8],
[9],
[12],
[13],
[17],
[18],
[20]-[24],
[29]-[32],
[35],
[37],
[39],
[40],
[44],
[46],
[50],
[51]-[53],
[56]-[59],
[61],
[62],
[64],
[65],
[67],
[70].
Appropriate
and
representati
v
e
features
are
essential
in
b
uilding
a
reliable
emotion
recognition
system.
Features
such
as
MFCC,
Pitch,
Mel-
spectrogram,
and
others
ha
v
e
become
the
focus
of
research
to
e
xtract
important
information
from
sound
signals
that
can
be
used
to
identify
emotions;
Exploring
bibliometric
tr
ends
in
speec
h
emotion
r
eco
gnition
(2020-2024)
(Y
esy
Diah
Rosita)
Evaluation Warning : The document was created with Spire.PDF for Python.
3430
❒
ISSN:
2252-8938
–
Implementation:
a
total
of
7
articles
used
im
p
l
ementation
in
other
cases
as
the
rationale
behind
their
research
methods
[16],
[28],
[38],
[42],
[55],
[63],
[69].
This
suggests
that
some
researchers
ha
v
e
applied
their
approach
in
a
broader
application
conte
xt,
be
yond
emotion
recognition
in
speech.
This
approach
may
in
v
olv
e
the
use
of
emotion
recognition
technology
for
s
uch
purposes
as
sentiment
analysis,
human-
computer
interaction,
or
psychological
research.
Although
most
of
the
research
w
as
dri
v
en
by
the
se
reasons,
some
research
has
used
other
reasons
be
yond
the
cate
gories
[59].
This
sho
ws
that
there
is
sti
ll
v
ariation
in
research
moti
v
ations
and
approaches
in
emotion
recognition
in
speech,
and
there
is
potential
for
further
e
xploration
in
de
v
eloping
more
inno
v
ati
v
e
methods
and
techniques.
3.6.
Ho
w
This
research
notes
v
ari
ations
in
the
use
of
classier
models
to
identify
emotions
in
human
speech.
More
than
20
articles
use
support
v
ector
machine
(SVM)
as
the
main
model,
sho
wing
its
popularity
and
ef
fec-
ti
v
eness
in
emotion
classication.
Meanwhile,
around
14
articles
emplo
y
K-nearest
neighbors
(KNN)
and
1D
con
v
olutional
neural
netw
orks
(CNN)
12
articles,
while
about
5-10
articles
apply
approaches
such
as
decision
tree
(DT),
deep
neural
netw
ork
(DNN),
long
short-term
memory
(
LSTM),
multi
layer
perceptron
(MLP),
and
random
forest
(RF)
in
more
detail
i
s
sho
wn
in
T
able
2.
The
use
of
v
arious
classier
models
sho
ws
an
ef
fort
to
e
xplore
v
arious
approaches
in
f
acing
the
challenge
of
emotion
classication
in
human
speech.
In
addition,
there
are
also
se
v
eral
other
approaches
used
in
smaller
numbers,
demonstrating
the
di
v
ersity
in
strate
gies
and
techniques
used
in
SER
researches.
T
able
2.
The
summary
of
model
classiers
Number
of
articles
Model
classiers
>
20
SVM
10–15
KNN,
1D
CNN
5–10
DT
,
DNN,
LSTM,
MLP
,
RF
A
clear
trend
sho
ws
that
deep
learning
models,
particularly
CNN
and
LSTM,
are
g
aining
popularity
due
to
their
ability
to
capture
the
comple
xity
of
speech
signals
and
outperform
traditional
models
lik
e
SVM,
especially
with
lar
ge
datasets.
These
models
automatically
learn
features
from
ra
w
data,
of
fering
better
gen-
eralization
in
noisy
en
vironments.
In
contrast,
while
traditional
models
lik
e
SVM
w
ork
well
with
smaller
,
structured
datasets,
the
y
struggle
with
ra
w
audio
data,
where
deep
learning
e
xcels
.
Therefore,
deep
learning
models
are
becoming
more
pre
v
alent
in
SER
research
due
to
their
higher
accurac
y
and
adaptability
.
4.
CONCLUSION
This
research
presents
a
bibliometric
analysis
of
68
articles
on
SER
published
between
2020
and
early
2024.
There
ha
v
e
been
signicant
de
v
elopments
in
SER
research
in
the
last
v
e
years,
and
India
being
the
top
contrib
utor
.
The
e
xploration
of
research
topics
pro
vides
a
comprehensi
v
e
o
v
ervie
w
of
de
v
elopments
and
trends
in
this
eld.
The
use
of
preprocessing
techniques,
such
as
silence
remo
v
al
and
noise
remo
v
al,
is
the
main
focus.
The
most
commonly
used
data
sources
are
EmoDB,
IEM
OCAP
,
and
RA
VDESS,
while
features
such
as
MFCC
and
pitch
are
the
most
frequently
used
in
the
analysis.
More
di
v
erse
data
sources,
including
real-w
orld
noisy
data,
can
signicantly
impro
v
e
SER
models.
By
inte
grating
datasets
that
reect
real-w
orld
conditions,
including
a
broader
range
of
emotional
v
ariations
and
loud
en
vironments,
SER
models
can
be
trained
to
be
more
resilient
to
the
challenges
f
aced
in
e
v
eryday
situations.
This
will
help
address
current
limitations,
such
as
inconsistent
data
quality
and
a
lack
of
emotional
di
v
ersity
in
datasets,
thereby
enhancing
the
accurac
y
and
generalizability
of
models
in
practical
applications.
Based
on
these
ndings,
further
research
is
suggested
to
de
v
elop
multi
modal
approaches
that
inte
grate
acoustic
features
with
non-auditory
data,
such
as
f
acial
e
xpressions,
body
mo
v
ements,
or
ph
ysiological
signals.
The
combination
of
multimodal
features
can
capture
a
more
holistic
representation
of
emotions,
o
v
ercoming
the
l
imitations
of
single-v
oice-based
systems
susceptible
to
en
vironmental
noise
or
ambiguous
conte
xts.
The
most
frequently
analyzed
emotions
are
happ
y
,
angry
,
sad,
neutral,
fear
,
disgust,
and
surprise.
In
terms
of
classication
modeling,
SVM
is
the
most
widely
used
model,
follo
wed
by
KNN,
1D
CNN,
and
se
v
eral
other
approaches.
Ov
erall,
this
study
pro
vides
an
in-
depth
understanding
of
SER
research
trends
and
the
techniques
most
commonly
used
in
this
analysis.
It
is
recommended
to
de
v
elop
more
sophisticated
pre-processing
techniques
and
classication
models
that
are
more
Int
J
Artif
Intell,
V
ol.
14,
No.
4,
August
2025:
3421–3434
Evaluation Warning : The document was created with Spire.PDF for Python.