Inter
national
J
our
nal
of
Electrical
and
Computer
Engineering
(IJECE)
V
ol.
7,
No.
6,
December
2017,
pp.
3369
–
3384
ISSN:
2088-8708
3369
I
ns
t
it
u
t
e
o
f
A
d
v
a
nce
d
Eng
ine
e
r
i
ng
a
nd
S
cie
nce
w
w
w
.
i
a
e
s
j
o
u
r
n
a
l
.
c
o
m
Query
by
Example
of
Speak
er
A
udio
Signals
using
P
o
wer
Spectrum
and
MFCCs
P
afan
Doungpaisan
1
and
Anirach
Mingkhwan
2
1,2
F
aculty
of
Information
T
echnology
,
King
Mongkut’
s
Uni
v
ersity
of
T
echnology
North
Bangk
ok,
1518
Pracharat
1
Road,
W
ongsa
w
ang,
Bangsue,
Bangk
ok
10800,
Thailand
2
F
aculty
of
Industrial
T
echnology
and
Management,
King
Mongkut’
s
Uni
v
ersity
of
T
echnology
North
Bangk
ok,
1518
Pracharat
1
Road,
W
ongsa
w
ang,
Bangsue,
Bangk
ok
10800,
Thailand
Article
Inf
o
Article
history:
Recei
v
ed:
Mar
13,
2017
Re
vised:
Aug
9,
2017
Accepted:
Aug
22,
2017
K
eyw
ord:
Speak
er
identification
Acoustic
signal
processing
Content-Based
Audio
Retrie
v
al
Speak
er
Recognition
Database
Query
Processing
ABSTRA
CT
Search
engine
is
the
popular
term
for
an
information
retrie
v
al
(IR)
system.
T
ypically
,
search
engine
can
be
based
on
full-te
xt
inde
xing.
Changing
the
presentation
from
the
te
xt
data
to
multimedia
data
types
mak
e
an
information
retrie
v
al
proces
s
more
comple
x
such
as
a
retrie
v
al
of
image
or
sounds
in
lar
ge
databases.
This
paper
introduces
the
use
of
language
and
te
xt
independent
speech
as
i
nput
queries
in
a
lar
ge
sound
database
by
using
Speak
er
identification
algorithm.
The
method
consists
of
2
main
processing
first
steps,
we
separate
v
ocal
and
non-v
ocal
identification
after
that
v
ocal
be
used
to
speak
er
identification
for
audio
query
by
speak
er
v
oice.
F
or
the
speak
er
identification
and
audio
query
by
process,
we
estimate
the
similarity
of
the
e
xample
signal
and
the
samples
in
the
queried
database
by
calculating
the
Euclidian
distance
betw
een
the
Mel
frequenc
y
cepstral
coef
ficients
(MFCC)
and
Ener
gy
spectrum
of
acoustic
features.
The
simulations
sho
w
that
the
good
performance
with
a
sustainable
computat
ional
cost
and
obtained
the
a
v
erage
accurac
y
rate
more
than
90%.
Copyright
c
2017
Institute
of
Advanced
Engineering
and
Science
.
All
rights
r
eserved.
1.
INTR
ODUCTION
The
Internet
has
become
a
major
component
of
their
e
v
eryday
social
li
v
es
and
b
usiness
.
Another
important
use
of
the
internet
is
search
engine
technology
.
Though
the
y
rarely
gi
v
e
it
a
moment
has
thought,
the
search
engines
that
help
them
na
vig
ate
through
the
man
y
of
information,
web
pages,
images,
video
files,
and
audio
recordings
found
on
the
W
orld
W
ide
W
eb
ha
v
e
become
important.
Search
engine
technology
is
de
v
elop
o
v
er
20
years
ago
[1][2].
It
has
changed
ho
w
we
get
information
at
school,
collage,
w
ork
and
home.
A
search
engine
is
an
information
retrie
v
al
system
designed
to
search
for
information
on
the
W
orld
W
ide
W
eb
.
The
search
results
are
generally
as
search
engine
results
pages
(SERPs).
A
search
engine
results
page,
or
SERP
,
is
the
web
page
that
appears
in
a
bro
wser
windo
w
when
a
k
e
yw
ord
query
is
put
into
a
search
field
on
a
search
engine.The
information
may
be
a
mix
of
te
xt,
web
pages,
images
,
video,
and
other
types
of
files.
Some
search
engines
also
mine
data
a
v
ailable
in
databases.
Search
engines
also
maintain
real-time
information
by
running
an
algorithm
on
a
web
cra
wler
.
It
w
ould
be
easy
to
find
if
to
search
by
entering
k
e
yw
ords.
Ho
we
v
er
,
if
we
w
ant
to
used
search
engine
to
search
image
or
sound.
It
will
mak
e
more
dif
ficult
and
more
complicated.
1.1.
Content-based
image
r
etrie
v
al
or
Re
v
erse
image
sear
ch
engines
Content-based
image
retrie
v
al
or
Re
v
erse
image
search
engines
are
those
special
kind
of
search
en-
gines
where
you
dont
need
to
input
an
y
k
e
yw
ord
to
find
pictures
[3][4][5][6].
Instead,
we
ha
v
e
to
put
a
picture
and
the
search
engine
finds
the
images
similar
to
you
entered.
Thus,
you
can
get
to
kno
w
e
v
erything
you
wish
to,
just
with
the
help
of
one
picture.
Practical
uses
for
re
v
erse
image
search
include
Searching
for
duplicated
image
or
content.
Locating
the
source
information
for
an
image.
J
ournal
Homepage:
http://iaesjournal.com/online/inde
x.php/IJECE
I
ns
t
it
u
t
e
o
f
A
d
v
a
nce
d
Eng
ine
e
r
i
ng
a
nd
S
cie
nce
w
w
w
.
i
a
e
s
j
o
u
r
n
a
l
.
c
o
m
,
DOI:
10.11591/ijece.v7i6.pp3369-3384
Evaluation Warning : The document was created with Spire.PDF for Python.
3370
ISSN:
2088-8708
Ensuring
compliance
with
cop
yright
re
gulations.
Finding
information
about
unidentified
products
and
other
objects.
Finding
information
about
f
ak
ed
images.
Finding
higher
resolution
v
ersions
of
images.
There
are
three
types
of
Content-based
image
retrie
v
al
or
Image
Search
Engines
such
as
Search
by
Meta-data,
Search
by
Example,
and
Hybrid
Search.
1.1.1.
Sear
ch
by
Meta-data
Search
by
Meta-data:
Metadata
is
data
that
summarizes
basic
information
about
image,
which
can
mak
e
finding
and
w
orking
with
particular
instances
of
data
easier
.
F
or
e
xample,
author
,
file
size,
date
created
and
date
modified.
All
are
e
xamples
of
v
ery
basic
document
metadata.
A
f
amous
Search
Engines
such
as
Google
are
presented
with
a
te
xt
box
that
you
type
your
k
e
yw
ords
into,
a
nd
click
b
uttons:
Google
Search.
Manually
typing
in
k
e
yw
ords
and
finding
interrelated
results.
In
f
act,
a
meta-data
image
search
engine
is
only
mar
ginally
dif
ferent
from
the
te
xt
search
engine
mentioned
abo
v
e.
A
search
by
meta-data
image
search
engine
rarely
e
xamines
the
actual
image
itself.
Instead,
it
relies
on
te
xtual
clues.
These
searches
can
come
from
a
v
ariety
of
sources.The
tw
o
main
methods
of
Search
by
Meta-data
are
Manual
Annotations
and
Conte
xtual
Hints.
1.1.2.
Sear
ch
image
by
Example
image
Search
image
by
Example
image:
Search
image
by
Example
image,
we
can
used
Google
or
T
inEye.
Instead,
we
can
b
uild
a
search
by
e
xample
image
search
engine.
These
types
of
image
search
engines
try
to
quantify
the
image
itself
and
are
called
Content
Based
Image
Retrie
v
al
(CBIR)
systems.
An
e
xample
image
w
ould
be
to
characterize
the
col
or
of
an
image
by
the
standard
de
viation,
mean,
and
sk
e
wness
of
the
pix
el
intensities
in
the
image.
By
gi
v
en
a
dataset
of
images,
we
w
ould
compute
these
moments
o
v
er
all
images
in
our
dataset
and
store
them
on
disk.
The
ne
xt
step
is
called
inde
xing.
When
we
quantify
an
image,
we
are
describing
an
image
by
e
xtracting
image
features.
These
image
features
are
an
abstraction
of
the
image
and
used
to
characterize
the
content
of
the
image.
Lets
pretend
that
we
are
b
u
i
lding
an
i
mage
search
engine
for
T
witter
.
1.1.3.
Hybrid
A
ppr
oach
Hybrid
Approach:
An
interesting
h
ybrid
approach
is
T
witt
er
.
T
witter
allo
ws
you
to
include
te
xt
and
images
with
your
tweets.
T
witter
l
ets
you
used
hashtags
to
your
o
wn
tweets.
W
e
can
used
the
hashtags
to
b
uild
a
search
by
meta-data
image
search
engine
and
then
anal
yzed
and
quantified
the
image
itself
to
b
uild
a
search
by
e
xample
image
search
engine.
From
this
concept,
we
w
ould
be
b
uilding
a
h
ybrid
image
search
engine
that
includes
both
k
e
yw
ords
and
hashtags
with
features
e
xtracted
from
the
images.
1.2.
Content-based
audio
r
etrie
v
al
or
audio
sear
ch
engines
Content-based
functionalities
aim
a
t
finding
ne
w
w
ays
of
querying
and
bro
wsing
audio
documents
as
well
as
automatic
generating
of
metadata,
mainly
via
classification.
Query-by-e
xample
and
similarity
measures
that
allo
w
perceptual
bro
wsing
of
an
audio
collection
is
addressed
in
the
literature
and
e
xist
in
commercial
products,
see
for
instance:
www
.
findsounds.com,
www
.soundfisher
.com.
There
are
three
types
of
Content-
based
audio
retrie
v
al
Such
as
Search
by
search
from
te
xt,
search
from
image
and
search
from
audio.
1.2.1.
A
udio
sear
ch
fr
om
text
or
Sear
ch
by
Meta-data
Audio
search
from
te
xt
or
Search
by
Meta-data:
T
e
xt
entered
into
a
search
bar
by
the
user
is
compared
to
the
search
engine’
s
database.
Matching
results
are
accompanied
by
a
description
or
Meta-data
of
the
audio
file
and
its
characteristics
such
as
sample
frequenc
y
,
bit
rate,
type
of
file,
length,
duration,
or
coding
type.
The
user
is
gi
v
en
the
option
of
do
wnloading
the
resulting
files.
On
other
hand,
K
e
yw
ords
are
generated
from
the
analyzed
audio
by
using
speech
recognition
techniques
to
con
v
ert
audio
to
te
xt.
These
k
e
yw
ords
are
used
to
search
for
audio
files
in
the
database
such
as
Google
V
oice
Search.
IJECE
V
ol.
7,
No.
6,
December
2017:
3369
–
3384
Evaluation Warning : The document was created with Spire.PDF for Python.
IJECE
ISSN:
2088-8708
3371
1.2.2.
A
udio
sear
ch
fr
om
image
Audio
search
from
image:
The
Query
by
Example
(QBE)
system
is
a
searching
algorithm
that
uses
Content-based
image
retrie
v
al
(CBIR).
K
e
yw
ords
are
generated
from
the
analyzed
image.
These
k
e
yw
ords
are
used
to
search
for
audio
files
in
the
database.
The
results
of
the
search
are
displayed
according
to
the
user
preferences
re
g
arding
to
the
type
of
file
such
as
w
a
v
,
mp3,
aif
f
etc.
1.2.3.
A
udio
sear
ch
fr
om
audio
Audio
search
from
audio:
In
audio
search
from
audio,
the
user
must
play
the
audio
of
a
song
either
with
a
music
player
,
by
singing
or
by
humming
to
the
microphone.
Then,
an
audio
pat
tern
is
deri
v
ed
from
the
audio
w
a
v
eform
and
a
frequenc
y
representation
is
deri
v
ed
from
its
Discrete
F
ourier
T
ransform.
This
pattern
will
be
matched
with
a
pattern
corresponding
to
the
w
a
v
eform
of
sound
files
found
in
the
database.
All
those
audio
files
in
the
database
whose
patterns
are
similar
to
the
pattern
search
will
be
displayed
as
search
results.
The
most
popular
of
an
audio
search
from
audio
is
an
audio
fingerprint
[7][8][9].
An
audio
fingerprint
is
a
content-based
compact
signature
that
summarizes
an
audio
files.
Audio
fingerprinting
technologies
ha
v
e
recently
attracted
attention
since
the
y
allo
w
the
monitoring
of
audio
independently
of
its
format
and
without
the
need
of
w
atermark
embedding
or
meta-data.
Audio
fingerprinting,
also
named
as
audio
hashing,
has
been
well-kno
wn
as
a
po
werful
technique
to
perform
audio
identification
and
synchronization.
The
figure
1
describe
a
model
of
audio
fingerprint,
An
audio
fingerprint
in
v
olv
es
tw
o
major
steps:
fingerprint
or
v
oice
pattern
design
and
matching
search.
While
the
first
step
concerns
the
deri
v
ation
of
a
compact
and
rob
ust
audio
signature,
the
second
step
usually
requires
kno
wledge
about
database
and
quick-search
algorithms
[10].
Figure
1.
State-of-the-art
audio
fingerprinting
algorithms.
F
or
e
xample
of
Audio
Fingerprinting
application
are
Shazam
(http://www
.shazam.com/)
and
Sound-
Hound
(http://www
.soundhound.com/)
[11][12].
A
human
listener
can
only
identify
a
piece
of
music
if
he
has
heard
it
before,
unless
he
has
access
to
more
information
than
just
the
audio
signal.
Simila
rly
,
fingerprinting
systems
require
pre
vious
kno
wledge
of
the
audio
signals
in
order
to
identify
them,
since
no
information
other
t
han
the
audio
signal
itself
is
a
v
ailable
to
the
system
in
the
identification
phase.
Therefore,
a
musical
kno
wledge
database
must
be
b
uilt.
This
database
contains
the
fingerprints
of
all
the
songs
the
system
is
supposed
to
identify
.
During
detection,
the
fingerprint
of
the
input
signal
is
calculated
and
a
matching
algorithm
compare
s
it
to
all
fingerprints
in
the
database.
The
kno
wledge
database
must
be
updated
as
ne
w
songs
come
out.
As
the
number
of
songs
in
the
database
gro
ws,
memory
requirements
and
computational
costs
also
gro
w;
thus,
the
comple
xity
of
the
detection
process
in-
creases
with
the
size
of
the
database.
This
technique
is
useful
b
ut
it
has
its
limitations.
Audio
fingerprint
cannot
find
a
li
v
e
v
ersion
of
music
because
dif
ferent
k
e
y
or
tempo.
Audio
fingerprint
cannot
find
a
co
v
er
v
ersion
because
dif
ferent
instruments.
Audio
fingerprint
cannot
find
a
hummed
v
ersion
of
music
because
single
melody
.
Audio
fingerprint
unable
to
find
music
if
a
singer
sings
with
te
xt
independent.
Unfortunately
,
in
the
present
audio
search
engine
cannot
be
search
human
v
oice
in
the
database
of
speak
ers
by
e
xample
of
Spok
en
audio
signals.
From
this
problem,
this
paper
proposes
a
method
for
query
by
e
xample
of
Spok
en
Audio
signals
by
using
Speak
er
identification
algorithm.
Query
by
Example
of
Speak
er
A
udio
Signals
using
P
ower
Spectrum
and
MFCCs
...
(P
afan
Doungpaisan)
Evaluation Warning : The document was created with Spire.PDF for Python.
3372
ISSN:
2088-8708
2.
LITERA
TURE
REVIEWS
(SPEAKER
VERIFICA
TION
AND
IDENTIFICA
TION)
Speak
er
identification
is
one
of
the
main
tasks
in
speech
processing.
In
addition
to
identification
accurac
y
,
lar
ge-scale
applications
of
speak
er
identification
gi
v
e
rise
to
another
challenge:
f
ast
search
in
the
database
of
speak
ers.
Research
about
Speak
er
recognition,
there
are
tw
o
dif
ferent
types
of
Speak
er
Recognition
[13][14]
consist
of
Speak
er
V
erification
and
Speak
er
Identification.
Speak
er
V
erification
is
the
process
of
accepting
or
rejecting
the
identity
mention
of
a
speak
er
.
Speak
er
Identification
is
the
process
of
determining
which
re
gistered
speak
er
pro
vides
a
gi
v
en
utterance.
In
the
speak
er
identification
task,
a
v
oice
of
an
unkno
wn
speak
er
is
analyzed
and
then
compared
with
speech
samples
of
kno
wn
speak
ers.
The
unkno
wn
speak
er
is
identified
as
the
speak
er
whose
model
best
matches
the
input
model.
There
are
tw
o
dif
ferent
types
of
speak
er
identification
consist
of
open-set
and
closed-set.
Open-set
identification
similar
as
a
combination
of
closed-set
identification
and
speak
er
v
erification.
F
or
e
xample,
a
closed-set
identification
may
be
proceed
and
the
resulting
ID
may
be
used
to
run
a
speak
er
v
erification
session.
If
the
test
speak
er
matches
the
tar
get
speak
er
,
based
on
the
ID
returned
from
the
closed-set
identification,
then
the
ID
is
accepted
and
it
is
passed
back
as
the
true
ID
of
the
test
speak
er
.
On
the
other
hand,
if
the
v
erification
f
ails,
the
speak
er
may
be
rejected
all-together
with
no
v
alid
identification
result.
Closed-set
identification
is
the
simpler
of
the
tw
o
problems.
In
closed-set
identification,
the
audio
of
the
test
speak
er
is
compared
ag
ainst
all
the
a
v
ailable
speak
er
models
and
the
speak
er
ID
of
the
model
with
the
closest
match
is
returned.
In
closed-set
identification,
the
ID
of
one
of
the
speak
ers
in
the
database
will
al
w
ays
be
closest
to
the
audio
of
the
test
speak
er;
there
is
no
rejection
scheme.
Speak
er
v
erification
is
the
process
of
v
erifying
the
claimed
identity
of
a
speak
er
based
on
the
speech
signal
from
the
speak
er
call
a
v
oiceprint.
In
speak
er
v
erificati
on,
a
v
oiceprint
of
an
unkno
wn
speak
er
who
claims
an
identity
is
compared
with
a
model
for
the
s
peak
er
whose
identity
is
being
claimed.
If
the
match
is
good
enough,
the
identity
claim
is
accepted.
A
high
threshold
reduces
the
probability
of
impostors
being
accepted
by
the
system,
increasing
the
risk
of
f
alsely
rejecting
v
alid
users.
On
the
other
hand,
a
lo
w
thresh-
old
enables
v
alid
users
to
be
accepted
consistently
,
b
ut
with
the
risk
of
accepting
impostors.
In
order
to
set
the
threshold
at
the
optim
al
le
v
el
of
impostor
acceptance
or
f
alse
acceptance
a
n
d
customer
rejection
or
f
alse
rejection.
The
data
sho
wing
impostor
scores
and
distrib
utions
of
customer
are
needed.
There
are
tw
o
types
of
speak
er
v
erification
systems:
T
e
xt-Independent
Speak
er
V
erification
(TI-SV)
and
T
e
xt-Dependent
Speak
er
V
erification
(TD-SV).
TD-SV
requires
the
speak
er
saying
e
xactly
the
enrolled
or
gi
v
en
passw
ord.
T
e
xt
independent
Speak
er
V
erification
is
a
process
of
v
erifying
the
identity
without
constraint
on
the
speech
content.
Compare
d
to
TD-SV
,
it
i
s
more
con
v
enient
because
the
user
can
speak
freely
to
the
system.
Ho
we
v
er
,
it
requires
longer
training
and
testing
utterances
to
achie
v
e
good
accurac
y
and
performance.
Using
audio
in
identifying
the
man
y
f
actors
in
v
olv
ed.
F
actors
within
the
characteristics
of
human
sound
and
technologies
related
to
the
acquisition
of
sound.
F
actors
within
the
characteristics
of
human
such
as.
The
conte
xt
of
a
public
speaking
e
xperience
is
the
en
vironment
or
situation
in
which
the
speech
occurs
[14].
Each
person
has
a
unique
communication
style
such
as
Pitch,
T
one
or
T
imbre,
Melody
,
V
olume
or
Inten-
sity
and
Rh
ythm
[15][16][17].
Emotional
speech
or
the
mood
of
the
people
such
as
angry
,
sad,
fearful
and
fun.
Ph
ysiological,
sometimes
people
ha
v
e
an
illness
or
in
a
state
of
alcohol
or
drugs.
Which
influences
the
sound
[17].
Counterfeiting
or
Disguise
sound,
Sometimes
speak
er
changed
my
o
wn
v
oice,
so
changed
from
the
original,
whether
it
is
higher
or
lo
wer
.
Speaking
to
the
rh
ythm
that
influences
the
characteristics
of
sound
[18].
F
actors
within
the
technologies
related
t
o
the
acquisition
of
sound
such
as.
The
quality
of
the
microphone
(Microphone)
or
the
equipment
used
to
record
which
greatly
af
fects
the
quality
of
the
sound.
The
mi-
crophones
each
ha
v
e
dif
ferent
features.
The
frequenc
y
response
or
the
microphone’
s
sensiti
vity
to
sound
from
v
arious
directions
[18][19][20].
En
vironment
in
sound
recording
such
as
the
noise
from
the
en
vironment,
the
distance
from
the
micro-
phone
to
record
sound
[21][22].
IJECE
V
ol.
7,
No.
6,
December
2017:
3369
–
3384
Evaluation Warning : The document was created with Spire.PDF for Python.
IJECE
ISSN:
2088-8708
3373
The
basics
of
digital
audio:
sample
rate,
bitrate,
and
ho
w
analog
signals
are
represented
digitally
.
This
research,
we
ha
v
e
w
ork
ed
on
te
xt-independent
speak
er
v
erification.
Research
interesting
of
speak
er
recognition
such
as.
Research
of
Poignant,
J.
[23]
used
an
unsupervised
approach
to
Identifying
speak-
ers
in
TV
broadcast
without
biometri
c
models.
Existing
methods
usually
use
pronounced
names,
as
a
source
of
names,
for
identifying
speech
clusters
pro
vided
by
a
speak
er
diarisation
step
b
ut
this
source
is
too
imprecise
for
ha
ving
suf
ficient
confidence.
Existing
methods
propose
tw
o
approac
hes
for
finding
speak
er
identity
based
only
on
names
written
in
the
image
track.
W
ith
the
late
naming
approach,
there
propose
dif
ferent
propag
ations
of
written
names
onto
clusters.
Our
second
proposition,
Early
naming,
modifies
the
speak
er
diarisation
module
by
adding
constraints
pre
v
enting
tw
o
clusters
with
dif
ferent
associated
written
names
to
be
mer
ged
together
.
These
methods
were
tested
on
the
REPERE
corpus
phase
1,
containing
3
hours
of
annotated
videos.
W
ith
the
late
naming
system
reaches
an
F-measure
of
73.1%.
W
ith
the
e
arly
naming
impro
v
es
o
v
er
this
result
both
in
terms
of
identification
error
rate
and
of
stability
of
the
clustering
stopping
criter
ion.
By
comparison,
a
mono-modal,
supervised
speak
er
identification
system
with
535
speak
er
models
trai
n
e
d
on
matching
de
v
elopment
data
and
additional
TV
and
radio
data
only
pro
vided
a
57.2%
F-measure.
Research
of
M.
K.
Nandw
ana
[24]
used
an
unsupervised
approach
for
detection
of
human
scream
v
ocalizations
from
continuous
recordings
in
noisy
acoustic
en
vironments.
The
proposed
detection
solution
is
based
on
compound
se
gmenta
tion,
which
emplo
ys
weighted
mean
distance,
T2-statistics
and
Bayesian
In-
formation
Criteria
for
detection
of
screams.
This
solution
also
emplo
ys
an
unsupervised
threshold
optimized
Combo-SAD
for
remo
v
al
of
non-v
ocal
noisy
se
gments
in
the
preliminary
stage.
A
total
of
fi
v
e
noisy
en
viron-
ments
were
simulated
for
noise
le
v
els
ranging
from
-20dB
to
+20dB
for
fi
v
e
dif
ferent
noisy
en
vironments.
Per
-
formance
of
proposed
system
w
as
compared
using
tw
o
alternati
v
e
acoustic
front-end
features
(i)
Mel-frequenc
y
cepstral
coef
ficients
(MFCC)
and
(ii)
perceptual
minimum
v
ariance
distortionless
response
(PMVDR).
Ev
alu-
ation
result
s
sho
w
that
the
ne
w
scream
detection
solution
w
orks
well
for
clean,
+20,
+10
dB
SNR
le
v
els,
with
performance
declining
as
SNR
decreases
to
-20dB
across
a
number
of
the
noise
sources
considered.
Research
of
Almaadeed,
N.
[25]
is
to
in
v
estig
ate
the
problem
of
identifying
a
speak
er
from
its
v
oice
re
g
ardless
of
the
content.
In
this
study
,
the
authors
designed
and
impl
emented
a
no
v
el
te
xt-independent
multi-
modal
speak
er
identification
system
based
on
w
a
v
elet
analysis
and
neural
netw
orks.
The
system
w
as
found
to
be
competiti
v
e
and
it
impro
v
ed
the
identification
rate
by
15%
as
compared
with
the
classical
MFCC.
In
addi-
tion,
it
reduced
the
identification
time
by
40%
as
compared
with
the
back-propag
ation
neural
netw
ork,
Gaussian
mixture
model
and
principal
component
analysis.
Performance
tests
conducted
using
the
GRID
database
cor
-
pora
ha
v
e
sho
wn
that
this
approach
has
f
aster
identification
time
and
greater
accurac
y
compared
with
traditional
approaches,
and
it
is
applicable
to
real-time,
te
xt-independent
speak
er
identification
systems.
Research
of
Xiaojia
Zhao
[26]
in
v
estig
ates
the
problem
of
speak
er
identification
and
v
erification
in
noisy
conditions,
assuming
that
speech
signals
are
corrupted
by
en
vironmental
noise.
This
paper
is
focused
on
se
v
eral
issues
relating
to
the
implement
ation
of
the
ne
w
model
for
real-w
orld
applications.
These
include
the
generation
of
multicondition
training
data
to
model
noisy
speech,
the
combination
of
dif
ferent
training
data
to
optimize
the
recognition
performance,
and
the
reduction
of
the
model’
s
comple
xity
.
The
ne
w
algorithm
w
as
tested
using
tw
o
databases
with
simulated
and
realistic
noisy
speech
data.
The
first
database
is
a
rede
v
elopment
of
the
TIMIT
database
by
rerecording
the
data
in
the
presence
of
v
arious
noise
types,
used
to
test
the
model
for
speak
er
identificati
on
with
a
focus
on
the
v
arieties
of
noise.
The
second
database
is
a
handheld-de
vi
ce
database
collected
in
realistic
noisy
conditions,
used
to
further
v
alidate
the
model
for
real-w
orld
speak
er
v
erification.
The
ne
w
model
is
compared
to
baseline
systems
and
is
found
to
achie
v
e
lo
wer
error
rates.
P
athak,
M.A.
and
Raj,
B.,
[27]
present
frame
w
orks
for
speak
er
v
erification
and
speak
er
identifi
cation
systems,
where
the
sys
tem
is
able
to
perform
the
necessary
operations
without
being
able
to
observ
e
the
speech
input
pro
vided
by
the
user
.
In
this
paper
formalize
the
pri
v
ac
y
criteria
for
the
speak
er
v
erification
and
speak
er
identification
problems
and
construct
Gaussian
mixture
model-based
protocols.
The
proposed
also
report
e
x-
periments
with
a
prototype
implementation
of
the
protocols
on
a
standardized
dataset
for
e
x
ecution
time
and
accurac
y
.
Bhardw
aj,
S.
[28]
presents
three
no
v
el
methods
for
speak
er
identification
of
which
tw
o
methods
util
ize
both
the
continuous
density
hidden
Mark
o
v
model
(HMM)
and
the
generalized
fuzzy
model
(GFM),
which
has
the
adv
antages
of
both
Mamdani
and
T
akagi-Sugeno
models.
In
the
first
method,
the
HMM
is
util
ized
for
the
e
xtraction
of
shape-based
batch
feature
v
ector
that
is
fitted
with
the
GFM
to
identify
the
speak
er
.
On
the
other
hand,
the
second
method
mak
es
use
of
the
Gaussian
mixture
model
(GMM)
and
the
GFM
for
the
identification
of
speak
ers.
Finally
,
the
third
method
has
been
inspired
by
the
w
ay
humans
cash
in
on
the
mutual
acquaintances
Query
by
Example
of
Speak
er
A
udio
Signals
using
P
ower
Spectrum
and
MFCCs
...
(P
afan
Doungpaisan)
Evaluation Warning : The document was created with Spire.PDF for Python.
3374
ISSN:
2088-8708
while
identifying
a
speak
er
.
T
o
see
the
v
alidity
of
the
proposed
models
[HMM-GFM,
GMM-GFM,
and
HMM-
GFM
(fusion)]
in
a
real-life
scenario,
the
y
are
tested
on
V
oxF
or
ge
speech
corpus
and
on
the
subset
of
the
2003
National
Institut
e
of
Standards
and
T
echnology
e
v
aluation
data
set.
These
models
are
also
e
v
aluated
on
the
corrupted
V
oxF
or
ge
speech
corpus
by
mixing
with
dif
ferent
types
of
noisy
signal
s
at
dif
ferent
v
alues
of
signal-to-noise
ratios,
and
their
performance
is
found
superior
to
that
of
the
well-kno
wn
models.
Abrham
Debasu
Mengistu
and
Dagnache
w
Melese
w
Alemayehu
[29]
presented
the
implementation
of
T
e
xt
Independent
Amhari
c
Language
Speak
er
Identification
using
VQ
(V
ector
Quantization),
GMM
(Gaussian
Mixture
Models),
BPNN
(
Back
propag
ation
neural
netw
ork),
MFCC
(Mel-f
requenc
y
cepstrum
coef
ficients),
GFCC
(Gammatone
Frequenc
y
Cepstral
Coef
ficients).
F
or
the
identification
process,
spe
ech
signals
are
col-
lected
from
dif
ferent
speak
ers
including
both
se
x
es;
for
our
data
set,
a
total
of
90
speak
ers
speech
samples
were
collected,
and
each
speech
ha
v
e
10
seconds
duration
from
each
indi
vidual.
From
these
speak
ers,
59.2%,
70.9%
and
84.7%
accurac
y
are
achie
v
ed
when
VQ,
GMM
and
BPNN
are
used
on
the
combined
feature
v
ector
of
MFCC
and
GFCC.
W
ajdi
Ghezaiel1,
Amel
Ben
Slimane
and
Ezzedine
Ben
Braiek
[30]
proposed
to
e
xtract
mi
nimally
corrupted
speech
that
is
considered
useful
for
v
ari
ou
s
speech
processing
systems.
In
this
paper
,
there
are
interested
for
co-channel
speak
er
identification
(SID).
There
emplo
y
a
ne
w
proposed
usable
speech
e
xtraction
method
based
on
t
he
pitch
information
obtained
from
linear
multi-scale
decompositi
on
by
discrete
w
a
v
elet
transform.
The
idea
is
to
retain
the
speech
se
gments
that
ha
v
e
only
one
pitch
detected
and
remo
v
e
the
others.
Detected
Usable
speech
w
as
used
as
input
for
speak
er
identification
system.
The
system
is
e
v
aluated
on
co-
channel
speech
and
results
sho
w
a
significant
impro
v
ement
across
v
arious
tar
get
to
Interferer
Ratio
(TIR)
for
speak
er
identification
system.
Syei
v
a
Nurul
Desylvia
[31]
presented
t
he
implementation
of
T
e
xt
Independent
Speak
er
I
dentification.
In
this
research,
speak
er
identification
te
xt
independent
with
Indonesian
speak
er
data
w
as
modelled
with
V
ector
Quantization
(VQ).
In
this
research
VQ
with
K-Means
initialization
w
as
used.
K-Means
clustering
also
w
as
used
to
initialize
mean
and
Hierarchical
Agglomerati
v
e
Clustering
w
as
used
to
identify
K
v
alue
for
VQ.
The
best
VQ
accurac
y
w
as
59.67%
when
k
w
as
5.
According
to
the
result,
Indonesian
language
could
be
mod-
elled
by
VQ.
This
research
can
be
de
v
eloped
using
optimization
method
for
VQ
parameters
such
as
Genetic
Algorithm
or
P
article
Sw
arm
Optimization.
Hery
Heryanto,
Saiful
Akbar
and
Benhard
Sitohang
[32]
present
a
ne
w
direct
access
strate
gy
for
speak
er
identification
system.
D
AMClass
is
a
method
for
direct
access
st
rate
gy
that
speeds
up
the
identification
process
without
decreasing
the
identification
rate
drastically
.
This
proposed
method
uses
speak
er
classification
strate
gy
based
on
human
v
oices
original
characteristics,
such
as
pitch,
flatness,
brightness,
and
roll
of
f.
D
AM-
Class
decomposes
a
v
ailable
dataset
into
smaller
sub-datasets
in
the
form
of
classes
or
b
uck
ets
based
on
the
similarity
of
speak
ers
original
characteristics.
D
AMClass
b
uilds
speak
er
dataset
inde
x
based
on
range-based
inde
xing
of
direct
access
f
acility
and
uses
nearest
neighbor
search,
range-based
searching
and
multiclass-SVM
mapping
as
its
access
method.
Experiments
sho
w
that
the
direct
access
strate
gy
with
multiclass-SVM
algo-
rithm
outperforms
the
inde
xing
accurac
y
of
range-based
inde
xing
and
nearest
neighbor
for
one
to
nine
percent.
D
AMClass
is
sho
wn
to
speed
up
the
identification
process
16
times
f
aster
than
sequential
access
method
with
91.05%
inde
xing
accurac
y
.
3.
RESEARCH
METHOD
This
paper
presents
an
audio
search
engine
that
can
retrie
v
e
sound
files
from
a
lar
ge
files
system
based
on
similarity
to
a
query
sound.
Sounds
are
characterized
by
speech
templates
deri
v
ed
from
MFCC
and
Po
wer
spectrum.
Audio
similarity
can
be
measured
by
comparing
templates,
which
w
orks
both
for
simple
sounds
and
comple
x
audio
such
as
music.
De
v
elopment
in
speech
technology
[33][13]
has
been
inspired
by
the
reason
that
people
desire
to
de
v
elop
mechanical
models
that
permits
the
emulation
of
human
v
erbal
communication
capabilities.
Speech
processing
allo
w
computer
to
follo
w
v
oice
commands
and
dif
ferent
human
languages.
A
number
of
rele
v
ant
tasks
F
or
e
xample
Source
Identification,
Automatic
Speech
Recognition,
Automatic
Music
T
ranscription,
La-
beling/Classification/T
agging,
M
usic/Speech/En
vironmental
Sound
Se
gmentation,
Sentiment/Emotion
Recog-
nition,
Common
machine
learning
t
echniques
applied
in
related
fields
such
as
image
processing
and
natural
language
processing.
Figure
2
describe
a
model
of
audio
recognition
system
that
represents
dif
ferent
stages
of
a
system
including
pre-processing,
feature
e
xtraction,
classification
and
language
model
[13].
The
pre-processing
trans-
IJECE
V
ol.
7,
No.
6,
December
2017:
3369
–
3384
Evaluation Warning : The document was created with Spire.PDF for Python.
IJECE
ISSN:
2088-8708
3375
forms
the
input
signal
before
an
y
information
can
be
e
xtracted
at
feature
e
xtraction
stage.
Figure
2.
State-of-the-art
Audio
Classification
v
ectors
must
be
rob
ust
to
noise
for
better
accurac
y
[34][35].The
features
e
xtraction
[35][36][37]
is
the
most
important
part
of
a
recognizer
.
If
the
f
eatures
are
ideally
good,
the
type
of
classification
architecture
w
ont
ha
v
e
much
importance.
On
the
opposite,
if
the
features
cannot
discriminate
between
the
concerned
classes,
no
classifier
will
be
ef
ficient,
as
adv
anced
as
it
could
be.
In
practical
situati
ons,
features
al
w
ays
present
some
de
grees
of
o
v
erlap
from
one
class
to
the
other
class.
Therefore,
it
is
w
orth
using
good
and
adapted
classification
architectures.
Feature
Extraction
for
Class
ification
such
as
Linear
prediction
coef
ficients:
LPC,
Cepstral
coef
ficient,
Mel
frequenc
y
cepatral
coef
ficients:
MFCC,
Cepstral
meansubstraction:
CMS
and
Post
filtered
cepsturm:
PFL.
Classification
stage
recognize
using
e
xtracted
features
and
language
model
where
Language
M
o
de
l
contains
syntax
related
to
language
responsible
that
helps
classifier
to
recognize
the
input.
In
P
attern
Classifica-
tion
problems,
the
goal
is
to
discriminate
between
features
representing
dif
ferent
classes
of
interest.
Based
on
learning
beha
vior
,
classifiers
can
be
di
vided
into
tw
o
groups:
classifiers
that
use
supervised
learning
(supervised
classification)
and
unsupervised
learning
(unsupervised
classification).
In
supervised
classification,
we
pro
vide
e
xamples
of
the
correct
classification
or
a
feature
v
ector
al
ong
with
its
correct
class
to
teach
the
classifier
.
Based
on
these
e
xamples,
which
are
commonly
termed
as
train-
ing
samples,
the
classifier
then
learns
ho
w
to
assign
an
unseen
feature
v
ector
to
a
correct
class.
Examples
of
supervised
classifications
include
Hidden
Mark
o
v
Model
(HMM
),
Gaussian
Mixture
Models
(GMM),
K-
Near
-
est
Neighbor
(k-NN),
Support
V
ector
Machine
(SVM),
Artificial
Neural
Netw
orks
(ANN),
Bayesian
Netw
ork
(BN)
and
Dynamic
T
ime
Wrapping
(DTW)
[38][39][40][41][42][43].
In
unsupervised
classification
or
clustering
[38],
there
is
neither
e
xplicit
teacher
nor
training
sampl
es.
The
classification
of
the
feature
v
ectors
must
be
based
on
similarity
between
them
based
on
which
the
y
are
di
vided
into
natural
groupings.
Whether
an
y
tw
o
feature
v
ectors
are
similar
depends
on
the
application.
Ob-
viously
,
unsupervised
cla
ssification
is
a
more
dif
ficult
problem
than
supervised
classification
and
supervised
classification
is
the
preferable
option
if
it
is
possible.
In
some
cases,
ho
we
v
er
,
it
is
necessary
to
use
unsu-
pervised
learning.
F
or
e
xample,
this
is
the
case
if
the
feature
v
ector
describing
an
object
can
be
e
xpected
to
change
with
time.
Examples
of
unsupervis
ed
classifications
include
k-means
clustering,
Self-Or
g
anizing
Maps
(SOM),
and
Linear
v
ector
Quantization
(L
VQ)
[43][44][45].
Classifiers
can
also
be
grouped
based
on
reasoning
process
as
probabilistic
and
deterministic
clas
si-
fiers.
Deterministic
reasoning
classifiers
classify
sensed
data
into
distinct
states
and
produce
a
distinct
output
that
cannot
be
uncertai
n
or
disputable.
Probabilistic
reasoning,
on
the
other
hand,
considers
s
ensed
data
to
be
uncertain
input
and
thus
outputs
multiple
conte
xtual
states
with
associated
de
grees
of
truthfulness
or
probabil-
ities.
Decision
of
the
class
type
to
which
the
feature
belongs
is
made
based
on
the
highest
probability
.
Due
to
limitation
of
Audio
fingerprint
concept,
Audio
fingerprint
cannot
used
to
find
audio
files
if
a
speak
er
spok
en
with
te
xt
independent.
So
with
the
technical
limitations
of
Audio
fingerprint,
the
lack
of
fle
xibility
in
the
search
for
Audio
information
and
cannot
be
applied
to
other
types
of
search
such
as
v
oice
search.
Therefore,
this
research
w
as
also
interested
in
Speak
er
Identification
concept
to
be
applied
to
the
speak
er
v
oice
retrie
v
al
system.
The
operating
system
can
process
as
follo
ws.
3.1.
F
eatur
e
extraction.
Feature
e
xtraction
is
the
process
of
computing
a
compact
numeri
cal
representation
that
can
be
used
to
characterize
a
se
gment
of
audio.
The
research
uses
Mel
Frequenc
y
Cepstral
Coef
ficients
analysis
that
based
on
Discrete
F
ourier
transform
(DFT)
and
Ener
gy
spectrum
as
sho
w
in
Figure
3.
The
use
of
MFCCs
coef
ficients
is
common
in
automatic
speech
recognit
ion
(ASR),
although
10-13
coef
ficients
are
often
considered
to
be
suf
ficient
for
coding
speech
[38].
A
subjecti
v
e
pitch
is
present
on
Mel
Frequenc
y
Scale
to
capture
important
characteristic
of
phonetic
in
speech.
MFCC
[38][39]
is
based
on
human
hearing
perceptions
that
cannot
percei
v
e
frequencies
o
v
er
1Khz.
Figure
3
sho
ws
the
process
of
creating
MFCCs
features.
The
first
step
is
to
se
gmenting
the
audio
signal
into
frames
with
the
length
with
in
the
range
is
equal
Query
by
Example
of
Speak
er
A
udio
Signals
using
P
ower
Spectrum
and
MFCCs
...
(P
afan
Doungpaisan)
Evaluation Warning : The document was created with Spire.PDF for Python.
3376
ISSN:
2088-8708
Figure
3.
Calculate
the
ener
gy
spectrum
(Po
wer
spectrum)
and
MFCCs
Figure
4.
Create
a
(N-1)-point
Hamming
windo
w
and
Display
the
result.
to
a
po
wer
of
tw
o,
usually
by
applying
Hamming
windo
w
function
as
sho
w
in
Figure
4.
The
ne
xt
step
is
to
tak
e
the
Discrete
F
ourier
T
ransform
(DFT)
of
each
frame.
The
ne
xt
step
is
Mel
Filter
Bank
Processing.
The
frequencies
range
in
DFT
spectrum
is
v
ery
wide
and
v
oice
signal
does
not
follo
w
the
linear
scale.
The
ne
xt
step
is
Discrete
Cosine
T
ransform.
This
is
the
process
to
con
v
ert
the
log
Mel
spectrum
into
time
domain
using
Discrete
Cosine
T
ransform
(DCT).
The
result
of
the
con
v
ersion
is
called
Mel
Frequenc
y
Cepstrum
Coef
ficient
(12
cepstral
features
plus
ener
gy).
The
process
of
creating
Ener
gy
spectrum
features.
The
first
step
is
to
se
gmenting
the
audio
signal
into
frames
with
the
length
with
in
the
range
is
equal
to
a
po
wer
of
tw
o,
usually
by
applying
Hamming
windo
w
function.
The
ne
xt
ste
p
is
to
tak
e
the
Discrete
F
ourier
T
ransform
(DFT)
of
each
frame.
The
ne
xt
step
is
to
tak
e
the
po
wer
of
each
frames,
denoted
by
P(k),
is
computed
by
the
follo
wing
equation
1.
P
(
k
)
=
2595
l
og
(
D
F
T
)
(1)
The
result
of
P(k)
is
called
Ener
gy
spectrum.
3.1.
Measur
e
of
similarity
The
purpose
of
a
measure
of
similarity
is
to
compare
tw
o
v
ectors
and
compute
a
single
number
which
e
v
aluates
their
similarity
.
In
other
w
ords,
the
objecti
v
e
is
to
determine
to
what
e
xtent
tw
o
v
ariables
co-v
ary
,
which
is
to
say
,
ha
v
e
the
same
v
alues
for
the
same
cases.
Euclidean
distance
is
most
often
used
to
compare
profiles
of
respondent
s
across
v
ariables.
F
or
e
xample,
suppose
our
data
consist
of
demographic
information
on
a
sample
of
indi
viduals,
arranged
as
a
respondent-by-v
ariable
matrix.
Each
ro
w
of
the
matrix
is
a
v
ector
of
m
numbers,
where
m
is
the
number
of
v
ariables.
W
e
can
e
v
aluate
the
similarity
or
the
distance
between
an
y
pair
of
ro
ws.
Euclidean
Distance
is
the
basis
of
man
y
measures
of
similarity
and
dissimilarity
is
Euclidean
distance.
The
distance
between
v
ectors
X
and
Y
is
defined
as
follo
ws:
j
d
j
d
k
j
=
v
u
u
t
n
X
i
=1
(
d
j
d
k
)
2
(2)
IJECE
V
ol.
7,
No.
6,
December
2017:
3369
–
3384
Evaluation Warning : The document was created with Spire.PDF for Python.
IJECE
ISSN:
2088-8708
3377
In
other
w
ords,
Euclidean
distance
is
the
square
root
of
the
sum
of
squared
dif
ferences
between
cor
-
responding
elements
of
the
tw
o
v
ectors.
Note
that
the
formula
treats
the
v
alues
of
X
and
Y
seriously:
no
adjustment
is
made
for
dif
ferences
in
scale.
Eucli
dean
distance
is
only
appropriate
for
data
measured
on
the
same
scale.
As
you
will
see
in
the
section
on
cor
relation,
the
correlation
coef
ficient
is
related
to
the
Euclidean
distance
between
standardized
v
ersions
of
the
data.
3.2.
Content-Based
Retrie
v
al
of
Spok
en
A
udio
Step
This
section
discusses
the
methodology
used
in
our
proposed
techniques.
It
includes
the
descri
ption
of
the
e
xperiment
setup,
the
comparati
v
e
study
method
and
the
implementation
details.
In
a
speak
er
v
oice
retrie
v
al
system
consisting
of
tw
o
stages
as
sho
wing
in
Figure
5.
The
first
stages
Figure
5.
The
process
of
Content-Based
Retrie
v
al
of
Spok
en
Audio.
The
first
step,
this
research
store
an
audio
files
in
v
oice
retrie
v
al
system.
The
step
we
identify
audio
areas
where
is
v
ocal
or
non-v
ocal
area
in
each
audio
files.
Because
we
used
only
v
ocal
area
in
each
audio
files
for
speak
er
v
oice
retrie
v
al
system.
V
ocal
sound
is
a
type
of
sound
performed
by
one
or
more
speak
er
,
with
or
without
noises
and
instrumental
accompaniment.
In
this
step,
we
used
Euclidean
Dis
tance
to
measures
a
v
ocal
or
non-v
ocal
area
in
each
audio
fil
es.
By
using
v
ocal
or
non-v
ocal
template.
The
v
ocal
template
is
consists
of
singing
and
speech
both
men
and
w
omen
length
approximately
10
minutes.
The
non-v
ocal
template
is
consisting
of
v
aried
en
vironments
sound
in
background
including
the
meeting
rooms
of
v
arious
sizes,
of
fice,
construction
site,
tele
vision
studio,
streets,
parks,
the
International
Space
Station
etc.
The
non-v
ocal
template
length
approximately
10
minutes.
1.
Read
the
audio
files
to
e
xtract
v
ocal
and
non-v
ocal
area.
2.
Con
v
ert
audio
data
to
Ener
gy
spectra
(Ener
gy
spectrum)
and
Mel
Frequenc
y
Cepstral
Coef
ficents
(MFCCs)
with
windo
ws
of
512
as
described
in
Section
4.2
(a
concatenation
of
the
Ener
gy
spectrum
and
MFCCs
to
form
a
longer
feature
v
ector
as
sho
wing
in
Figure
6).
3.
Calculate
the
distance
between
the
query-instance
of
audio
files
with
all
samples
v
ector
in
v
ocal
and
non-v
ocal
template.
4.
Sort
the
distance
and
determine
nearest
samples
based
on
the
minimum
distance
each
sample
windo
ws.
5.
Use
simpl
e
majority
of
the
cate
gory
w
as
choosing
from
a
dist
ance
at
least
to
prediction
v
alue
of
the
query
instance
as
v
ocal
or
non-v
ocal.
6.
Reject
the
non-v
ocal
v
ector
,
lea
ving
only
v
ocal
v
ector
of
each
audio
file
and
storage
files
to
v
oice
retrie
v
al
system.
Query
by
Example
of
Speak
er
A
udio
Signals
using
P
ower
Spectrum
and
MFCCs
...
(P
afan
Doungpaisan)
Evaluation Warning : The document was created with Spire.PDF for Python.
3378
ISSN:
2088-8708
Figure
6.
F
orming
the
input
v
ector
of
ener
gy
spectrum
(Po
wer
spectrum)
and
MFCCs.
After
the
procedure,
a
data
in
audio
file
s
lea
ving
only
v
ocal
v
ector
of
each
audio
file.
Which
is
prepared
for
use
in
a
speak
er
v
oice
retri
e
v
al
system.
F
ollo
wing
a
speak
er
v
oice
retrie
v
al
Process
according
to
Figure
7.
The
second
step
is
the
retrie
v
al
procedure,
here
is
an
algorithm
step
by
step
on
ho
w
to
used
Euclidean
Distance
to
retrie
v
al
human
v
oice
in
each
audio
files
The
goal
of
this
step
is
to
find
out
tar
get
and
reject
not
tar
get
files.This
research
used
a
relati
v
e
frequenc
y
distrib
ution
to
decide
whether
or
not
tar
get
class.
A
frequenc
y
distrib
ution
sho
ws
the
number
of
elements
in
a
data
set
that
belong
to
each
class.
In
a
relati
v
e
frequenc
y
distrib
ution,
the
v
alue
assigned
to
each
class
is
the
proportion
of
the
total
data
set
that
belongs
in
the
class.
Heres
a
formula
for
calculating
the
relati
v
e
frequenc
y
of
a
class:
R
E
LAT
I
V
E
F
R
E
QU
E
N
C
Y
O
F
AC
LAS
S
=
cl
ass
f
r
eq
uency
n
(3)
Class
frequenc
y
refers
to
the
number
of
observ
ations
in
each
class;
n
represents
the
total
number
of
observ
ations
in
the
entire
data
set.
Figure
7.
The
process
of
relati
v
e
frequenc
y
and
searching
for
Speak
er
v
oice.
1.
Read
a
V
oice
files
of
the
tar
get
person
con
v
ert
to
Ener
gy
spectrum
and
Mel
Frequenc
y
Cepstral
Coef
fi-
cents
(MFCCs)
and
assigned
to
tar
get
class
template.
The
tar
get
class
template
is
length
approximately
5-10
minutes.
Afterthat
,
Read
a
V
oice
file
of
the
non-tar
get
person
and
con
v
ert
to
Ener
gy
spectrum
and
Mel
Frequenc
y
Cepstral
Coef
ficents
(MFCCs).
Then
assigned
to
non-tar
get
class
template.
The
non-tar
get
class
template
is
length
approximately
5-10
minutes.
2.
T
o
that
end,
first
read
v
ocal
files
in
v
oice
retrie
v
al
system
and
split
into
frames
with
predefined
duration.
Then,
each
frame
further
split
into
N
non-o
v
erlapping
se
gments,
where
N
called
frame
size.
Afterw
ards,
se
gments
in
each
frame
measure
distance
between
the
tar
get
class
template
or
non-tar
get
class
template
by
Euclidean
distance.
Used
a
minimum
distance
each
sample
windo
ws
to
choose
the
tar
get
class
or
non-tar
get
class.
IJECE
V
ol.
7,
No.
6,
December
2017:
3369
–
3384
Evaluation Warning : The document was created with Spire.PDF for Python.