TELKOM
NIKA
, Vol. 11, No. 5, May 2013, pp. 2731 ~
2738
ISSN: 2302-4
046
2731
Re
cei
v
ed
De
cem
ber 1
8
, 2012; Re
vi
sed
March 19, 20
13; Accepted
March 26, 20
13
Wavelet
Cesptral Coefficients f
o
r Isolated Speech
Recognition
T. B. Adam
*
1
, M. S
.
Sala
m
1
, T.
S. Gu
na
w
a
n
2
1
School of Co
mputin
g
Univers
i
ti T
e
knolo
g
i Mal
a
ysia,
8130
0 Skud
ai,
Johor, Mala
ys
i
a
2
Departme
n
t of Electrical a
nd
Comp
uter Engi
neer
ing
Internatio
na
l Islamic Univ
ersit
y
Mala
ysi
a
*Corres
p
o
ndi
n
g
author, e-ma
i
l
: tarmizi_a
dam
200
5@
ya
ho
o.com, sah@utm.
m
y
, tsgun
a
w
a
n
@
iium.e
du.m
y
A
b
st
r
a
ct
T
he study
pr
o
poses
a
n
i
m
p
r
oved
featur
e
extrac
tion
met
hod
that
is ca
lled
W
a
ve
let
Cepstra
l
Coeffici
ents (W
CC). In traditiona
l cepstra
l
analys
is,
the
cepstru
m
s are
calcul
at
ed w
i
th the us
e of t
h
e
Discrete F
o
uri
e
r T
r
ansfor
m
(
D
F
T
)
. Ow
ing t
o
the f
a
ct
that
the DF
T
ca
lc
ulati
on
assu
mes sig
n
a
l
stati
onary
betw
een fra
m
e
s
w
h
ich in pra
c
tice is
not qu
ite true, the W
CC repl
aces t
he DF
T
block
in the trad
ition
a
l
cepstru
m
ca
lcu
l
atio
n w
i
th the
Discrete W
a
v
e
l
e
t T
r
ansfor
m
(
D
W
T
) hence
p
r
oduc
ing t
he
W
CC. T
o
eval
uat
e
the prop
ose
d
W
CC, speech
recog
n
itio
n task of rec
ogni
z
i
n
g
the 26 Eng
l
i
s
h alp
hab
ets w
e
re conducte
d
.
Co
mp
ariso
n
s
w
i
th the traditi
o
nal M
e
l-F
r
eq
ue
ncy Ce
pstr
al C
oefficie
n
ts (MF
CC) are
do
ne t
o
further a
n
a
l
y
z
e
the effectiveness of the WC
Cs. It
is found that the WCCs show
ed
some com
p
arable results when
compar
ed t
o
t
he MF
C
C
s co
nsid
erin
g th
e
W
CCs s
m
al
l v
e
ctor d
i
mens
io
n w
hen
co
mpa
r
ed to
the
MF
CCs
.
T
he best rec
o
g
n
itio
n w
a
s foun
d from W
CCs
at leve
l 5
of the DW
T
deco
m
positi
on w
i
th a
sma
ll d
i
fferenc
e
o
f
1.19%
an
d 3.2
1
% w
hen c
o
mpare
d
to the M
F
CCs fo
r spea
ker ind
e
p
e
n
d
e
n
t and s
p
e
a
ker
dep
end
ent tas
k
s
respectively.
Ke
y
w
ords
: Speec
h reco
gni
tion, Speec
h
process
i
ng, C
epstral a
nalys
i
s
, W
a
velet T
r
ansfor
m
, F
eatur
e
Extraction
Copy
right
©
2013 Un
ive
r
sita
s Ah
mad
Dah
l
an
. All rig
h
t
s r
ese
rved
.
1. Introduc
tion
Feature extra
c
tion (FE
)
co
uld be se
en
as on
e
of the most signifi
cant pha
ses in
Speech
Re
cog
n
it
ion
(
S
R)
sy
st
e
m
s.
The
FE
pha
se i
n
S
R
sy
st
ems
play
s a
major
r
o
le i
n
t
he a
c
c
u
r
a
cy
of
the SR sy
ste
m
. In othe
r
word
s, in
o
r
de
r to o
b
tain
reli
able
accu
ra
cy for S
R
syst
ems the fe
ature
extraction ph
ase sho
u
ld yield
spe
e
ch
feature
s
th
at are ea
sy
to discrimin
a
te and
cla
s
sify
betwe
en diffe
rent cla
s
se
s.
Traditio
nally, the mo
st do
minant a
nd
p
opula
r
f
eatu
r
e extra
c
tion t
e
ch
niqu
e is the Mel
-
F
r
equ
ency
Ceps
tral
Coeffic
i
ent (MFCC) [1, 2]. It
was
sh
own by Davis an
d Merm
elstei
n that MFCCs
outperfo
rme
d
seve
ral othe
r spe
e
ch featu
r
es thu
s
ma
ki
ng it the mo
st
widely u
s
e
d
feature fo
r SR
sy
st
em
s [
3
]
.
Ho
wever, M
F
CCs suffer from several probl
ems
[4]. Experiments sho
w
ed
that the
MFCCs be
ha
ved po
orly u
nder noi
sy
condition
s.
M
F
CC featu
r
e
s
extra
c
ted fro
m
noi
sy spe
e
ch
sign
al
sho
w
e
d
redu
ced
a
c
curacy
[5]. Another i
s
sue
rega
rdi
ng M
F
CCs i
s
rel
a
te
d to th
e fixe
d
wind
ow o
r
fra
m
e use
d
wh
e
n
comp
uting the MFCCs [6]. With fixed frame
size, sp
eech sam
p
le
s
lying betwe
e
n
the frame
s
are a
ssu
me
d to be stat
io
nary. Unfo
rtu
nately, this assumption i
s
not
true a
s
sp
ee
ch si
gnal
s te
nd to be no
n
-
station
a
ry
in
nature. Th
us, information
su
ch a
s
plo
s
i
v
e
s
o
unds
is
diffic
u
lt to extrac
t
[7, 8].
With fixed fra
m
e si
ze, ab
ru
pt chan
ge
s a
nd localized
events such
as sha
r
p tran
sition
s in
spe
e
ch
sign
als
can
not b
e
a
nalyze
d
o
r
ex
tracted
with t
he u
s
e
of MF
CCs. T
h
e
s
e l
o
cali
ze
d even
ts
may contai
n
signifi
cant i
n
formatio
n that may
be
importa
nt to furthe
r increa
se the
sp
eech
recognitio
n
system a
c
cura
cy [9, 10]. As an exa
m
pl
e,
more info
rm
ation fro
m
th
e spee
ch
sig
nal
must be retai
ned from a
c
o
u
stic
confu
s
a
b
le wo
rd
s.
To add
re
ss t
hese issue
s
, wavelet
s
ha
s been of pa
rticular inte
re
sts.
The use of Discrete
Wav
e
let
T
r
an
sform (
D
WT
) or Wav
e
let
P
a
c
k
et
Tr
an
sfo
r
m (WPT) fo
r feature extra
c
tion h
a
s
bee
n
sho
w
n i
n
several
wo
rks. In
this pa
pe
r, we propo
se a
set of new feat
ure
s
called
Wavelet Ce
pstral
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 23
02-4
046
TELKOM
NIKA
Vol. 11, No
. 5, May 2013 : 2731 – 273
8
2732
Coeffici
ents
(WCC) fo
r i
s
olated
spo
k
e
n
Engl
i
s
h al
phab
et re
co
gnition. Here
, the WCC
are
prop
osed to remedy the issue
s
po
sse
s
sed by MFCCs.
The re
st of the pape
r is st
ructured a
s
follows
, se
ction
2 review
som
e
related
wo
rks that
use
s
wavel
e
ts a
n
d
wavel
e
t ce
pst
r
um.
Sectio
n 3
reviews
som
e
theo
retical a
nd m
a
themat
ical
backg
rou
nd
u
s
ed
in thi
s
p
a
per. Se
ction
4 and
5
expl
ains the m
e
th
od an
d the
propo
sed fe
atu
r
e
extraction. S
e
ction 6 expl
ains t
he exp
e
rime
ntal verification. Whil
e in sectio
ns 7, 8 and 9
we
pre
s
ent the result
s, discu
s
sion
s an
d co
nclu
sio
n
s respectively.
2. Backg
rou
nd
Sar
i
kaya
et al
. [5] used WP
T to repla
c
e t
he Di
screte
Co
s
i
ne
T
r
an
s
f
or
m (D
C
T
)
to
ob
ta
in
a
set of feature
s
call
ed Wavelet Packet Parame
te
rs (WPP). Wu a
nd Lin propo
sed a
n
Irre
g
u
lar
Wavelet Packet de
comp
o
s
ition feature
base
d
on
the energy of an uttered
word in orde
r to
improve
the perfo
rman
ce spe
a
ker
ide
n
t
ificati
on syst
ems [11]. Th
eir propo
sed
method ap
pli
e
s
the WP deco
m
positio
n to frequ
en
cy regi
ons that
are observed to
have high en
ergy value
s
a
s
a
final result 96
.6% reco
gniti
on rate were obtaine
d.
DWT we
re al
so u
s
ed by G
o
wdy and T
u
fekci to
obtain
a new feature vector calle
d Mel-
Freq
uen
cy Di
screte Wavel
e
t Coefficient
s (MF
D
WC) [9]. The MFDWC
were obtained by appl
ying
DWT to the Mel-scaled lo
g filterban
k e
nergi
es
of a spe
e
ch fram
e. Results showed that the
MFDWC p
e
rf
orme
d better
in terms of re
cog
n
ition ove
r
other featu
r
es that we
re
use
d
for the test.
The u
s
e of
Admissible
Wavelet Pa
cket (A
WP) b
y
Farooq
an
d Datta to reco
gni
ze
phon
eme
s
showed
good
re
sults whe
n
compa
r
e
d
to MFCCs [
12]. Thei
r p
r
opo
sed
FE
also
yielded
better re
sult
s tha
n
MFCCsund
er differe
nt
types of
noi
se.
The use
of (AWP) we
re
a
l
so
prop
osed by
De
shp
ande
a
nd Holamb
e
[13]. Here,
the autho
rs pro
posed a f
ilter stru
cture tha
t
best represe
n
ts the sig
nal
without
takin
g
any human
auditory scal
e into con
s
id
eration fo
r use in
spe
a
ker ident
ification
appli
c
ation.Th
us,
i
t
can
be
con
c
luded
from
th
ese
stu
d
ie
s t
hat the
wavel
e
t
can b
e
utilized in feature extractio
n
pha
se to
increa
se
spee
ch re
cognitio
n
accuracy. The
experim
ents
also sugg
est
that wavelets can b
e
use
d
as an alternative to MFCCs for feat
ure
extraction.
2.1. Wav
e
lets and Ce
pstr
um Calculati
on
Several
wo
rks h
a
ve be
en
done
in u
s
in
g
wavel
e
ts fo
r
comp
uting th
e cep
s
trum
a
s
in
[1,
14, 15]. The paper by Kinney [14] propo
sed de
co
mposi
ng the spe
e
ch sign
a
l
using wavel
e
t
packet
t
r
an
sf
orm
(WPT) and
th
en cal
c
ulating
th
e
re
al
cep
s
trum
for each coeff
i
cient
s
o
r
ato
m
s
obtaine
d fro
m
that deco
m
positio
n. Promisin
g
re
sul
t
s were obtai
ned for text-d
epen
dent sp
e
a
ke
r
recognitio
n
consi
deri
ng th
e metho
d
s few fe
at
ure
coeffici
ents. It
wa
s
sh
own that 90%
of
spe
a
kers we
re recogni
ze
d whe
n
usi
ng 9
training vect
ors.
In [15] a wa
velet based
cep
s
tru
m
cal
c
ulatio
n wa
s
prop
osed. Th
e pro
p
o
s
ed
wavelet
based cal
c
ul
ation wa
s used for pitch
extraction in
spe
e
ch sign
a
l
s. Different types of wav
e
let
family were u
s
ed to find th
e optimal accura
cy for pitch extraction.
Re
cently, Pavez and Silva [1] propo
ses the
Wav
e
let Packet Cep
s
tral
Co
efficients
(WP
CC) a
s
a
n
altern
ative to filt
er-ban
k
energy ba
sed
feature extr
a
c
tion. In their work, detaile
d
filter de
sign
were
pre
s
e
n
ted to
obtain
t
he
WPCC
as an
alternative to th
e
wide
ly use
d
M
F
CCs.
Re
sults
sho
w
that the WPCC a
r
e b
e
tter than MF
CCs and h
a
s th
e ability to retain more p
h
one
discrimi
nating
information i
n
the spe
e
ch sign
al at lowe
r frequ
en
cy range
s.
3. Theore
t
ic
al Backg
rou
nd
3.1. Ceps
tral
Analy
s
is
Cep
s
tral
an
al
ysis i
s
an i
m
portant
con
c
ept in
many t
a
sk
related
t
o
spee
ch
p
r
o
c
e
ssi
ng
.
For exam
ple,
cep
s
trum
could be u
s
e
d
for pitch d
e
tection a
n
d
formant esti
mation. Also,
the
MFCC as
previously stat
ed is al
so
co
nsid
ere
d
to b
e
a ce
pstral analysi
s
met
hod in which
the
analysi
s
i
s
do
ne in the
Mel
freque
ncy
scale.Com
putin
g the
cep
s
tral
coeffici
ents for e
a
ch spee
ch
frame involve
s
three
step
s
(refe
r to Figu
re 1):
1.
DFT of the sp
eech frame.
2.
Log en
ergy spectrum calculation.
3.
DCT of the lo
g energy spe
c
trum.
A frame of sp
eech is first
subje
c
ted to the DFT with th
e followin
g
eq
uation.
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
2302-4
046
Wa
velet Cesptral Co
efficie
n
ts for Isol
ate
d
Speech Re
cog
n
ition (T.
B. Adam
)
2733
1
0
2
)
(
)
(
N
n
N
kjn
w
w
e
n
s
K
S
(1)
Next, the spe
c
trum u
nde
rg
oes lo
g po
we
r cal
c
ulatio
n to obtain the l
og po
wer
spe
c
trum.
2
)
(
log
K
S
m
w
k
(2)
Finally, taking
the DCT of the log po
we
r spe
c
tru
m
yields (3
)
N
k
k
i
k
N
i
m
N
c
1
5
.
0
cos
2
(3)
Whe
r
e
i
c
is the cep
s
trum o
r
cep
s
tral
coeff
i
cient.
Figure 1. Block di
agram of
cep
s
trum
cal
c
ulatio
n
3.2. Mel-Freq
uenc
y
Cepstral Coe
fficie
n
ts
As mention
e
d
in Section
1 the MFCC is
the m
o
st widely u
s
ed feature for sp
ee
ch
recognitio
n
p
u
rpo
s
e. Com
puting the MF
CC fro
m
a given sp
ee
ch si
gnal follo
ws
several ste
p
s.
First, the
spe
e
ch i
s
pa
ssed t
o
a
pre-e
m
pha
sis filter
with th
e form
of
1
1
)
(
z
a
z
H
pre
pre
whe
r
e a ty
pical valu
e o
f
pre
a
is usually n
ear
-1 [16].
Next the spe
e
ch
sign
al is fram
ed and
wind
o
w
ed.
The du
ration
of the frame are u
s
uall
y
set to 256 sampl
e
s
with 128 sa
mples of
overlap
p
ing
betwe
en
the frame
s
while th
e
hammin
g
wind
ow
with the f
unctio
n
T
t
n
w
2
cos
46
.
0
54
.
0
)
(
is u
s
ed. T
h
e
n
, for ea
ch
wind
owed
sp
eech fram
e the di
screte
Fouri
e
r T
r
an
sform (DFT
) is com
puted
an
d the po
we
r
spectrum of th
e DFT i
s
binn
ed with
a set
of
Mel-scaled t
r
i
angul
ar filterb
ank.
Next, th
e log
a
rithm
of the mel
-
spe
c
tral coefficie
n
t
s is taken. T
h
e
final
ste
p
i
s
applying
the discrete
co
si
ne tran
sf
orm
(DCT) to the logarith
m
sca
l
ed mel-spe
c
t
r
al
coeffici
ents to
obtain the MFCCs. A deta
iled co
m
putat
ion of the MFCC i
s
pre
s
e
n
ted in [3].
3.3. Discre
t
e
Wav
e
let Transform
A wavel
e
t is a
sh
ort
osci
llating
sign
al
or fun
c
tion
that ha
s
a fi
nite du
ratio
n
with
an
averag
e valu
e of zero. Given a
discrete
si
gnal
)
(
n
s
with pe
rio
d
N
,
the Disc
rete Wavelet
Tran
sfo
r
m (DWT) of the si
gnal is:
n
m
n
s
n
DWT
N
m
j
j
1
0
*
2
2
,
(4)
Whe
r
e
j
is the level of deco
m
positio
n an
d
j
n
n
j
2
2
1
2
(5)
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 23
02-4
046
TELKOM
NIKA
Vol. 11, No
. 5, May 2013 : 2731 – 273
8
2734
The
DWT de
comp
oses an
input
sign
al
into a
set of
approximatio
n and
detail
coeffici
ent. T
he
approximatio
n is re
cu
rsiv
ely decom
po
sed into a
binary tree li
ke
stru
cture le
aving the det
ails
without furthe
r decomp
o
siti
on.
3.4. Neural Net
w
o
r
k Clas
sifier
A multilayer percept
ron (MLP) ba
ck p
r
opa
gation
with adaptive learni
ng rate Neu
r
al
Network (NN) was em
ploye
d
to train and
cla
ssi
fy all the 26 English alpha
bets. Depen
ding on t
he
feature
s
u
s
e
d
, the input
node to the
NN
cla
s
si
fier woul
d
have 800
for MFCC,
90
for WCC at
level 8, 60 at level 5 and 4
0
at level 3.
The le
arning
rate a
nd th
e
momentum
coeffi
cient
of the
NN were
varied to
obt
ain the
best
re
co
gnition
rate. Fo
r t
he
cla
ssifi
cati
on, the i
nput feature
s
we
re
no
rmali
z
ed
b
e
twee
n -1 an
d
1 in which
1
indicates tru
e
cla
s
sificatio
n
whil
e
-1 i
n
dicat
e
s f
a
l
s
e
cla
ssi
fic
a
tion. The ac
tivation
function
for the
NN is the
hype
rboli
c
t
ange
nt
a
c
tivation fun
c
tion
as it
gua
ra
ntees the
outp
u
t to
fall betwe
en this rang
e.Th
e hyperboli
c
tangent a
c
tivation fun
c
tion
also
sp
eed
s up the lea
r
ni
ng
pro
c
e
ss of th
e NN [17]. Th
is activation f
unctio
n
wa
s u
s
ed for
both h
i
dden a
nd out
put node
s.
4. Method
We
propo
se
that the
DF
T
block i
n
Fi
gure 1
be
re
pla
c
ed with
th
e DWT be
cau
s
e
of
the
advantag
es o
f
the
wavelet
tran
sform. T
he
coeffi
ci
ent
s from th
e
DWT
of the
sp
eech
sign
al a
r
e
then subje
c
t
ed tolog
po
wer
sp
ect
r
u
m
and
DCT. The final o
u
tput is
wha
t
we call Wavelet
Cep
s
tral
Coe
fficient (WCC).For ea
ch
WCC comput
ed from the
DWT
c
oeffic
i
ents
, only the
firs
t
ten co
efficien
ts from e
a
ch
WCC a
r
e
con
c
aten
ated (F
i
gure
2) to fo
rm the ov
erall
feature ve
ctor of
the WCC.
4.1. Data
bas
e
The spee
ch
sampl
e
s
use
d
for trai
nin
g
and te
stin
g we
re fro
m
the stan
dard TI46 isolat
ed
alpha
bets. Th
is Dataset of isolate
d
spee
ch u
tte
ran
c
e
wa
s devel
op
ed by Texa
s I
n
stru
ment
s (TI).
Although th
e
data
s
et
co
ntains
both
isolate
d
alp
h
abet a
nd di
git sp
ee
ch
reco
rdin
gs th
e
experim
ents
con
d
u
c
ted on
ly used
alph
a
bets. Overall,
the TI46
databa
se contai
ns 16 spea
ke
rs,
eight female
spe
a
kers (F1
to F8) and ei
ght male sp
e
a
ke
rs
(M1 to M8). Each of
the alpha
bets A
to Z was uttered 16 times from each sp
e
a
ke
r.
Figure 2. Pro
c
e
ss of obtai
ning WCC fro
m
DWT
coeffi
cient
s
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
2302-4
046
Wa
velet Cesptral Co
efficie
n
ts for Isol
ate
d
Speech Re
cog
n
ition (T.
B. Adam
)
2735
To trai
n the
NN, five fem
a
le
spea
ke
rs F1 to
F5
an
d thre
e m
a
le
spe
a
kers M
1
, M2 a
nd
M3
were
use
d
. Thu
s
, th
e
tests we
re
condu
cted
in
either spea
ke
r d
epe
ndent
(SD) o
r
sp
ea
ker
indep
ende
nt (SI) mod
e
. SD involve
s
testing the
trai
ned NN cla
s
sifier with
the
same sp
ea
kers
that were us
ed for training while SI tes
t
s were
con
d
u
c
ted with the o
nes n
o
t use
d
for trainin
g
.
5. Proposed
Feature Extr
action
The first st
ep
in comp
uting
the p
r
op
ose
d
WCC is sp
eech en
d p
o
i
n
t dete
c
tion o
r
sil
e
n
c
e
removal. Here, the silence
at the beginning and e
n
d
of the speech is omitted. Then the spe
e
ch
unde
rgo
e
s
pre-em
pha
si
s filtering, frami
ng and
win
d
o
win
g
as
explaine
d in se
ction 3.2 fo
r the
MFCC cal
c
ul
ation. Next, instea
d of un
derg
o
ing t
he
DFT (A
s in Section 3.1
)
th
e spe
e
ch fra
m
es
a
r
e d
e
c
o
mp
os
e
d
w
i
th th
e
D
W
T
(
R
e
f
er
Se
c
t
io
n 3
.
3).
T
he co
efficie
n
ts
p
r
o
d
u
c
ed
from
the
DWT
are
then subj
ecte
d
to
l
og po
we
r spe
c
tru
m
a
nd DCT cal
c
u
l
ations.
E
a
ch wavelet co
efficient produ
ce
s
the WCC h
o
w
ever, to red
u
ce the
dime
nsio
n of
the feature
extracted onl
y ten coeffici
ents a
r
e
retaine
d
from
each
wavel
e
t decom
po
sition (Figu
r
e
2).Finally, the WCC a
r
e fed into the NN
cla
ssifie
r
for
either trai
ning
or testing.
6. Experimental Verifica
tion
The expe
rim
ent wa
s co
nd
ucted by vary
ing t
he level of DWT. WCC we
re extra
c
ted from
level 8, level
5, and level 3. To benchm
a
r
k ou
r
propo
sed WCC we compa
r
ed the
results with th
e
MFCC. The
task wa
s to
recogni
ze
all
26 Engli
s
h
alpha
bets
which i
s
quite
a difficult ta
sk
becau
se of several a
c
ou
st
ic simila
rities
betwe
en the letters. Thete
s
t wa
s co
ndu
cted to ob
serve
the effects of
decom
po
sition level of the DWT with
rega
rd to the recognitio
n
rate. We al
so
wante
d
to o
b
s
erve
the
effectivene
ss of
the
WCC
in
cla
s
sif
y
ing a
c
ou
st
ic
a
lly confusable letters
whe
n
com
p
a
r
ed to the MFCCs.Tabl
e 1 and
2 are sev
e
ral setup for
the experim
e
n
t.
For the MFCC feature
s
, an 800 feature
vector wa
s o
b
tained for ea
ch sp
ee
ch. This valu
e
wa
s fixed by mean
s of zero-pa
dde
d no
rmalizatio
n a
s
mentione
d in [18]. For th
e WCC 9
0
, 6
0
and 40
co
efficient featu
r
e
vector
wa
s u
s
ed fo
r ea
ch
spe
e
ch sig
n
a
l
. The numb
e
r of feature
u
s
ed
also d
enote
s
the numbe
r o
f
input node
s need
ed for th
e NN
cla
ssifie
r
.
Table 1. MF
CC paramete
r
setting
s
Parameter
Value
Frame size
256 samples
Frame ove
r
lap
128 samples
Pre-emph
asis co
efficient (
pre
a
)
-0.95
Number of
triang
ular band-
pass filters
20
Number of
MFC
C
coefficients
13
w
i
th energ
y
Table 2. Ne
ural netwo
rk
cla
ssifie
r
setu
p
Parameter
Value
Input la
y
e
r
800 nodes for
M
F
CC, 90,
60 and
40 for WCC
Hidden la
y
e
r
1 hidden la
y
e
r
w
i
th 250 nodes
Output la
yer
26 nodes
Hidden la
y
e
r activation function
H
y
pe
rbolic tange
nt
Output la
yer acti
vation function
H
y
pe
rbolic tange
nt
The
numb
e
r
of co
efficient
s fo
r the
WCC i
s
o
b
taine
d
by retaining
only ten
co
efficients
from ea
ch DWT de
com
p
o
s
ition level (Figure 2).
In this study we
have rest
rict
ed for only ten
coeffici
ents from each leve
l for evaluatio
n. Ten
coeffi
cient
s from a
pproxim
ation coeffici
ent an
d
ten coeffici
en
ts from ea
ch l
e
vel of detail coeffici
ents a
r
e used.
Let their be
n
l
e
vels
of wavelet de
com
p
o
s
ition i
n
whi
c
h ea
ch
level
contai
ns both
detail
and
app
roxi
mation
co
efficient
s. Exce
pt for th
e
th
n
level, ten
co
efficients fro
m
the
detail
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 23
02-4
046
TELKOM
NIKA
Vol. 11, No
. 5, May 2013 : 2731 – 273
8
2736
coeffici
ents
are
taken
while ten
coe
fficients
are
take
n fo
r
both d
e
tail
and
app
roxi
mate
coeffici
ents o
n
the level
th
n
wa
velet decom
p
o
sition. The
relation
ship is
pre
s
ente
d
in (6) as
10
10
n
N
c
(6)
Whe
r
e
c
N
is the numbe
r of WCC obtai
n
ed from a
n
level wavelet decom
p
osition. Thus, for
example an 8
level wavelet
decom
po
sition yields
90
10
8
10
W
C
C co
ef
f
i
cient
s
.
7. Results
Table
3 an
d
4 sh
ows the
results fo
r th
e propo
sed
WCC o
n
spe
a
ke
r d
epen
d
ent and
spe
a
ker in
de
pend
ent tasks. Co
mpa
r
iso
n
s
with MF
CCs
we
re
don
e to evaluate
the effective
ness
of the
WCC.
Figure 3
an
d
Figure
4
sho
w
s the
histo
g
r
am
plot fo
r t
he the
ave
r
a
ge
re
cognitio
n
of
both MF
CC a
nd
WCCs. Averag
e
re
cog
n
i
t
ion were
obt
ained
from
dif
f
erent valu
es
of the le
arni
n
g
rate (L
R) a
n
d
momentum consta
nt (MC) of theNN
cl
assifier. Differe
nt values of LR and M
C
we
re
need
ed to obt
ain the be
st result
s for ea
ch of the WCCs and MF
CC.
Table 3. Co
m
parative resul
t
s betwe
en M
F
CC a
nd
WCC (Spe
aker d
epen
dent)
Speaker
Recog
n
iti
on M
F
CC
(%
)
Recog
n
iti
on W
CC l
v
l
8
(%
)
Recog
n
iti
on W
CC l
v
l
5
(%
)
Recog
n
iti
on W
CC l
v
l
3
(%
)
F1
81.31
76.74
77.71
72.96
F2
78.79
66.65
73.02
69.95
F3
84.50
75.54
78.55
82.81
F4
87.56
79.81
81.13
78.25
F5
81.01
81.37
81.97
75.24
M1
74.94
66.23
72.42
61.42
M2
93.03
88.16
91.29
85.10
M3
93.03
72.36
76.38
70.07
Average
84.27
75.86
79.06
74.48
Figure 3. Re
sults for spea
ker dep
end
ent
tasks
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
2302-4
046
Wa
velet Cesptral Co
efficie
n
ts for Isol
ate
d
Speech Re
cog
n
ition (T.
B. Adam
)
2737
Table 4. Co
m
parative resul
t
s betwe
en M
F
CC a
nd
WCC (Spe
aker in
depe
ndent
)
Speaker
Recog
n
iti
on M
F
CC
(%
)
Recog
n
iti
on W
CC l
v
l
8
(%
)
Recog
n
iti
on W
CC l
v
l
5
(%
)
Recog
n
iti
on W
CC l
v
l
3
(%
)
F6
75.60
61.84
68.75
65.87
F7
74.52
72.96
79.45
75.72
F8
68.13
59.94
64.88
63.86
M4
85.60
80.15
82.60
78.19
M5
69.55
71.32
70.10
65.20
M6
73.29
62.35
60.82
54.04
M7
71.48
67.72
70.15
64.57
M8
56.97
64.90
68.81
62.82
Average
71.89
67.65
70.70
66.29
Figure 4. Re
sults for spea
ker inde
pen
de
nt tasks
8. Discussio
n
From the exp
e
rime
nts con
ducte
d, there
are
several i
n
tere
sting di
scu
ssi
on that can b
e
pre
s
ente
d
. First, the WCCs had
a co
n
s
ide
r
abl
e sm
all feature ve
ctor
whe
n
co
mpared to th
e
MFCC. This i
s
true for the
WCCs obtain
ed from le
vel
8, level 5, and level 3. Each WCC featu
r
e
vector
co
ntai
ned o
n
ly 90,
60, an
d 40
coeffi
ci
ents respe
c
tively.
These small values sh
ow
a
con
s
id
era
b
le
amo
unt of
feature
re
du
ction
wh
en comp
ared
to
the
M
F
CC whi
c
h uses 800
coeffici
ents.
It is found fro
m
the results (Table 3 a
n
d
Table 4) that
level 5 of wavelet decom
p
o
sition
yielded the
b
e
st results
a
m
ong th
e
WCC. T
h
is is
true
for
eithe
r
spe
a
ker dep
e
ndent or sp
e
a
ke
r
indep
ende
nt taks. Ho
wever, the MFCC result
s we
re st
ill higher in m
o
st of the ca
ses.
Re
sults for
spea
ker ind
e
p
ende
nt task
showed
i
n
tere
sting
ob
serva
t
ions. It is fo
und th
at
for spea
ke
r i
ndep
ende
nt tests th
e reco
gnition
rate
(RR)
were qui
te com
parabl
e with
each o
t
her
for the WCCs and MF
CC. The re
cog
n
ition for MFCC is 71.89% while WCC lev
e
l 5 is 70.70
%.
The pe
rcenta
ge differe
nce
betwe
en the t
w
o i
s
only 1.
19% whi
c
h i
s
quite small.
The
significa
nce
of this result stem
s from the fact that the f
eature ve
ctor for the
WCC u
s
ed o
n
ly 60 coefficients
while th
e MF
CC
used 8
0
0
coeffici
ents.
For
sp
e
a
ker
depe
ndent ta
sk, the
pe
rce
n
tage diffe
re
nce
of WCC at level 5 and MF
CC i
s
3.21% still maintaini
ng a small p
e
r
ce
ntage diffe
ren
c
e.
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 23
02-4
046
TELKOM
NIKA
Vol. 11, No
. 5, May 2013 : 2731 – 273
8
2738
9. Conclusio
n
The p
r
opo
se
d WCC featu
r
e extra
c
tion
prod
uc
ed a
con
s
id
era
b
ly small featu
r
e
vecto
r
whe
n
com
pared to MFCCs. Although the feature vect
or for the WCC we
re sm
all, result
s sh
owed
that with further studi
es a
n
d
improvem
e
n
t t
he WCC can be improved to
outperf
o
rm the MF
CCs.
To imp
r
ove t
he
WCCs, the structu
r
e
of the DWT
mu
st be
expe
rim
ented
with. In
this
study, it
is
sho
w
n
t
hat
t
h
e be
st
a
c
cur
a
cy
f
o
r
W
C
C
wa
s at
l
e
vel
5 of the
DWT
de
comp
ositi
on. Results from
the spe
a
ker i
ndep
ende
nt task sh
ows th
at the
WCC
coul
d be imp
r
oved a
nd well suited fo
r
text
depe
ndent
sp
eech re
cog
n
ition tasks.
Future
wo
rks inclu
de exp
e
rime
nting with different
numbe
rs of
coeffici
ents,
wavelet
families an
d wavelet struct
ure. The
WCCs
sho
u
ld also be tested u
nder diffe
rent
noise
con
d
ition
s
in future
ex
perim
ents to
ob
serve
its rob
u
st
n
e
s
s
t
o
wa
rd
s n
o
i
sy
s
pee
ch.
We
ex
pe
ct
by
experim
entin
g with these para
m
eters the WC
C
s
ma
y
surpa
s
s MF
CC
s in t
e
rm
s
of
RR.
Referen
ces
[1]
E Pavez and
JF
Silva. "Anal
ysis a
nd Des
i
g
n
of
W
a
velet-Packet Cepstr
al C
oeffici
ents
for Automatic
Speec
h Rec
o
g
n
itio
n".
Speec
h
Communic
a
tio
n
.
2012; 5
4
: 81
4-83
5.
[2]
LD Vi
gno
lo, D
H
Milo
ne
and
HL R
u
finer. "G
enetic W
a
v
e
let
Packets for S
peec
h Rec
o
g
n
i
t
ion".
Expert
Systems w
i
th Appl
icatio
ns
. 2
012; 40: 2
350-
235
9.
[3]
S Davis a
nd
P Mermelstei
n
. "Compar
ison
of Pa
rametric
Repres
entati
o
ns for Monos
yllab
i
c W
o
r
d
Reco
gniti
on in Conti
nuo
usl
y
S
poke
n
Se
ntenc
es".
IEEE Transactions
on Ac
ous
tics, Speec
h and Signal
Processi
ng
. 19
80; 28: 35
7-36
6.
[4
]
M An
u
s
uy
a and
S Ka
tti
. "
F
ron
t
En
d
An
aly
s
is o
f
Sp
ee
ch
Re
co
gn
i
t
i
o
n
:
A
R
e
vi
e
w
".
Intern
ation
a
l J
ourn
a
l
of Speech T
e
c
hno
logy
. 2
011;
14: 99-14
5.
[5]
R Sarik
a
ya, B
L
Pe
llom,
an
d
JHL H
ans
en, "
W
avel
et P
a
cke
t T
r
ansform F
eatures W
i
th
Ap
plicati
o
n
T
o
Speak
er Identif
icatio
n". in
Thir
d IEEE Nordic
Signal Process
i
ng Sy
m
p
osium
. 19
98; 81-
84
.
[6]
P Kumar an
d
M Chan
dra. "H
ybr
i
d of W
a
ve
l
e
t and MF
CC
F
eatures for S
peak
er Verific
a
tion". in
Wo
rld
Con
g
ress on In
formati
on a
nd
Co
mmun
icati
o
n T
e
chno
lo
gies
(W
ICT
)
. 2011: 1150-
11
54.
[7]
SY Lun
g. "Imp
roved W
a
v
e
let
F
eature E
x
trac
tion Usi
ng K
e
r
nel A
nal
ys
is fo
r
T
e
xt Inde
pe
n
dent Sp
eak
er
Reco
gniti
on".
Digita
l
Sig
nal P
r
ocessi
ng
. 20
1
0
; 20: 140
0-14
07.
[8]
K Daqro
uq a
n
d
KY Al Azza
w
i
.
"Averag
e
F
r
aming L
i
ne
ar Predicti
on Co
di
n
g
w
i
t
h
W
a
vel
e
t
T
r
ansform for
T
e
xt-Indepe
nd
ent Speak
er Id
entificati
on S
y
s
t
em".
Compute
r
s and Electric
al Eng
i
ne
eri
n
g
.
2012; 3
8
.
[9]
JN Go
w
d
y
an
d Z
T
u
fekci. "
M
el-Scal
e
d
Di
screte W
a
ve
le
t Coeffici
ents f
o
r Sp
eec
h R
e
cogn
ition
"
. i
n
Proceedings
of the IEEE Int
e
r
national Conference on Ac
oustics,
Speec
h, and Si
gnal Processing
.
200
0: 135
1-13
54.
[10]
E Did
iot, I Illi
na, D F
o
hr,
and
O Mel
l
a.
"A
Wavelet-
Based Param
e
teri
zation for Speech/Music
Discrimi
natio
n"
.
Comp
uter Sp
eech a
nd L
a
n
g
uag
e
. 201
0; 24
: 341-35
7.
[11]
JD W
u
and BF
Lin. "Spe
aker
Identificati
on u
s
ing Discr
ete W
a
velet Packe
t
T
r
ansform
T
e
chni
que
w
i
t
h
Irregul
ar Deco
mpositi
on".
Ex
pert Systems
w
i
th Applicati
o
ns
. 2009; 3
6
: 3136-
314
3.
[12]
O F
a
rooq an
d S Datta. "W
avelet Base
d Ro
bus
t Sub-B
and
F
eatures for Phon
eme R
e
co
gniti
on".
IE
EE
Procee
din
g
s Vi
sion Imag
e an
d Sign
al Proc
e
ssing
. 20
04; 15
1: 187-1
93.
[13]
MS Des
hpa
nd
e a
nd
RS
Hol
a
mbe.
"Spe
ak
er Ide
n
tificati
o
n
Usi
n
g
Admis
s
ible
W
a
ve
let
Packet Bas
e
d
Decom
positi
o
n
"
.
Journ
a
l of
Sign
al Proc
essin
g
W
o
rld
Acade
my
of Scienc
e En
gin
eeri
ng
an
d
T
e
chno
logy
. 2
010; 6: 20-
23.
[14]
A Kinne
y
and
J Stevens. "W
avel
et Packet Ceps
tra
l
Anal
ysis for Speak
e
r
Recog
n
itio
n,"
in
Sign
als,
Systems an
d
C
o
mputers, 20
0
2
.
Co
nferenc
e Recor
d
of
th
e T
h
irty-Sixth
Asi
l
o
m
ar Co
nfere
n
ce on
. 20
02
;
1: 206-2
09.
[15]
F
L
Sanch
e
z, S Barbo
n
Jún
i
or, LS Vieir
a
, RC
Guid
o, ES
F
onseca, PR
Scalass
a
ra,
CD Maci
el, JC
Pereir
a, and S
H
Che
n
. "W
avelet-Bas
ed Ce
pstrum Calc
ula
t
ion".
Jour
nal o
f
Comp
utatio
na
l and Ap
pli
e
d
Mathem
atics
. 2
009; 22
7: 288-
293.
[16]
JW
Picone. "Si
gna
l Mod
e
li
ng
T
e
chniques in Speec
h
Rec
o
g
n
itio
n".
Proce
e
d
in
gs of the IE
EE
. 1993; 8
1
:
121
5-12
47.
[17] M.
Negn
evitsk
y,
Artifici
al Int
e
lli
ge
nce: A G
u
id
e to Inte
lli
g
ent Syste
m
s
, 2
ed. Harl
o
w
UK:
Addis
o
n
We
sl
ey
, 20
05
.
[18]
MSH Salam, D
Mohama
d
an
d
SHS Salle
h. "T
empor
al Spe
e
ch Norm
aliz
ation Meth
ods C
o
mparis
on i
n
Speec
h Rec
o
gniti
on Usi
ng
Neur
al Net
w
o
r
k". in
Internati
ona
l Co
nferen
ce of Soft Co
mp
utin
g an
d
Pattern Reco
g
n
itio
n (SoCPa
R)
, Melacca, Malay
s
ia. 2009.
Evaluation Warning : The document was created with Spire.PDF for Python.