Int
ern
at
i
onal
Journ
al of
El
e
ctrical
an
d
Co
mput
er
En
gin
eeri
ng
(IJ
E
C
E)
Vo
l.
8
, No
.
6
,
Decem
ber
201
8
, p
p.
5381
~
5388
IS
S
N:
20
88
-
8708
,
DOI: 10
.11
591/
ijece
.
v
8
i
6
.
pp
5381
-
53
88
5381
Journ
al h
om
e
page
:
http:
//
ia
es
core
.c
om/
journa
ls
/i
ndex.
ph
p/IJECE
Convolu
tional N
eural Ne
twork
an
d Feature
Trans
formati
on
for
Distant
Sp
ee
ch Re
cognit
ion
Hil
man
F
. P
ar
dede, Asri
R.
Yu
li
an
i,
R
ik
a Sustik
a
Resea
rch
Cen
te
r
for
Inform
atics,
Indone
sian
Inst
itute
of
Sci
enc
es
,
Indone
sia
Art
ic
le
In
f
o
ABSTR
A
CT
Art
ic
le
history:
Re
cei
ved
Ja
n
5
, 201
8
Re
vised
Ju
l
2
7
,
201
8
Accepte
d
Aug
7
, 2
01
8
In
m
an
y
appl
i
cations,
spe
ec
h
r
e
cogni
ti
on
m
ust
oper
ate
in
conditions
wher
e
the
re
are
som
e
dista
nc
es
bet
we
en
spea
ker
s
and
the
m
ic
rophon
es.
Thi
s
is
ca
l
le
d
d
ista
nt
spee
ch
re
cogni
t
ion
(DS
R).
In
th
is
c
ondit
ion,
spee
ch
rec
ogni
ti
on
m
ust
dea
l
with
rev
erb
er
ation.
No
wada
y
s,
d
eep
le
arn
ing
tech
nologi
es
ar
e
bec
om
ing
the
t
he
m
ai
n
t
ec
hnol
ogie
s
for
spee
c
h
rec
ogni
ti
on
.
Dee
p
Neura
l
Network
(DN
N
)
in
h
y
brid
wit
h
Hidden
Markov
Model
(HMM)
is
the
comm
only
used
arc
hi
te
c
ture.
How
eve
r,
thi
s
s
y
st
e
m
is
stil
l
not
robust
aga
inst
rev
erb
e
r
ation.
Previous
studie
s
use
Convolut
ion
al
Neura
l
Netw
orks
(CNN
),
which
is
a
var
i
a
ti
on
of
neur
a
l
net
work,
to
impr
ove
the
robustne
ss
of
spee
ch
rec
ogni
ti
on
agai
nst no
ise.
CNN
has
the
prope
rtie
s of
pooli
ng
whi
ch
is use
d
to
find
loc
a
l
cor
r
elati
on
be
twee
n
n
ei
gh
boring
dime
nsions
in
the
feature
s.
W
it
h
thi
s
prope
rt
y
,
CNN
coul
d
be
used
as
fea
tur
e
learni
ng
emp
hasiz
ing
th
e
informati
on
on
nei
ghboring
fra
m
es.
In
thi
s
stu
d
y
w
e
use
CNN
to
d
ea
l
with
rev
erb
er
ation.
W
e
al
so
propos
e
to
use
fea
tur
e
tra
nsform
at
io
n
te
chni
qu
es:
l
inear
discri
m
i
nat
anal
y
s
is
(LDA)
and
ma
ximum
li
kel
ih
ood
li
near
tra
nsform
at
ion
(MLLT
),
on
m
el
fre
qu
ency
ceps
tra
l
co
eff
i
ci
e
nt
(MF
CC)
bef
ore
fe
edi
ng
t
hem
to
CNN
.
W
e
arg
ue
that
t
ran
sform
ing
feature
s
coul
d
produc
e
m
ore
discri
m
ina
tive
f
ea
tur
es
for
CN
N,
and
hence
i
m
prove
the
robustness
of
spee
ch
rec
ogn
it
ion
aga
inst
rev
erb
e
rat
ion
.
Our
ev
aluati
ons
o
n
Mee
ti
ng
R
ec
ord
er
Digit
s
(MRD
)
subs
et
of
Aurora
-
5
databa
se
c
onfirm
tha
t
the
use
of
LDA
and
MLLT
tr
ansform
at
ions
improve
the
robustne
ss
of
spee
ch
rec
ogni
ti
on.
It
i
s
bet
te
r
b
y
20%
relati
v
e
err
or
r
educ
t
ion
on
co
m
par
ed
to
a
standa
rd
DN
N
base
d
spee
ch
rec
ogni
ti
on
u
sing
the
sam
e
num
ber
of
hidde
n
lay
ers
.
Ke
yw
or
d:
CNN
Dista
nt S
peec
h R
eco
gn
it
io
n
Feat
ur
e
t
ransf
orm
ation
LDA
MLL
T
Re
verberati
on
Copyright
©
201
8
Instit
ut
e
o
f
Ad
vanc
ed
Engi
n
ee
r
ing
and
S
cienc
e
.
Al
l
rights re
serv
ed
.
Corres
pond
in
g
Aut
h
or
:
Hil
m
an
F. Par
de
de
Re
search
Cent
er fo
r
I
nfo
rm
atics,
Jl. Ci
sit
u
N
o. 21/1
54D
Ba
ndung,
Ind
on
esi
a
.
Em
a
il
:
hil
m
00
1@
li
pi.
go.id
1.
INTROD
U
CTION
Deep
Lea
rn
i
ng
te
chnolo
gie
s
hav
e
rece
nt
ly
achieved
huge
s
uccess
i
n
aco
us
ti
c
m
od
el
li
ng
f
or
autom
at
ic
sp
e
ech
recog
niti
on
(
AS
R)
ta
sk
s
[1
]
-
[
4].
They
rep
la
ce
co
nv
e
ntio
nal
Hi
dd
e
n
Ma
rko
v
Mod
el
s
-
Ga
us
si
an
Mi
xt
ur
e
M
od
el
s
(
HMM
-
GMM)
[5
]
,
[
6]
.
Cure
ntly
,
D
eep
Neural
Ne
twork
(
DNN)
is
the
sta
te
-
of
-
the
-
art
arch
it
e
ct
ur
e
f
or
s
peech
rec
ogniti
on.
D
N
N
is
us
ed
to
pro
vid
e
poste
ri
or
pro
bab
il
it
y
to
HMM
base
d
on
a
se
t
of
le
arn
e
d
f
eat
ur
es.
A
hy
br
i
d
of
HMM
-
D
NN
has
show
n
to
ha
ve
su
pe
rio
r
pe
rfo
rm
ance
com
par
ed
t
o H
MM
-
GMM m
od
el
s fo
r ASR.
Currentl
y,
m
or
e
autom
at
ic
sp
eech
rec
ogniti
o
n
(ASR)
ap
pl
ic
at
ion
s
f
ound
in
our
daily
act
ivit
ie
s.
They
hav
e
bee
n
i
m
ple
m
ented
as
virtu
al
assist
ant
in
sm
art
-
ph
one
s,
hom
e
auto
m
at
ion
,
m
eet
ing
diarisat
ion,
an
d
s
o
on.
F
or
s
uc
h
app
li
cat
io
ns
,
AS
R
m
us
t
op
erate
in
co
ndit
ion
s
w
he
re
th
ere
a
re
s
om
e
distances
bet
w
e
en
t
he
sp
ea
ker
s
an
d
the
m
ic
ro
phone
s.
This
is
cal
l
ed
distant
sp
e
ech
recog
niti
on
(DSR).
I
n
s
uch
c
onditi
on
s
,
AS
R
syst
e
m
s
are
exp
ect
ed
to
be
r
obus
t
a
gainst
noise
and
rev
e
rb
e
rati
on.
H
ow
e
ve
r,
the
perform
ance
of
D
N
N
-
HM
M
Evaluation Warning : The document was created with Spire.PDF for Python.
IS
S
N
:
2088
-
8708
In
t J
Elec
&
C
om
p
En
g,
V
ol.
8
, N
o.
6
,
Dece
m
ber
201
8
:
5381
-
5388
5382
syst
e
m
s
are
still
un
sat
isfact
ory
fo
r
thes
e
c
onditi
on
s
[
7].
N
oise
an
d
rev
e
r
ber
at
io
n
disto
r
t
the
sp
eech
sign
al
s
causin
g
la
r
ge
degra
dation
on
the
perform
ance
of
A
SR
sys
tem
s.
This
m
a
y
ho
l
d
back
th
e
us
e
rs
wh
e
n
us
i
ng
AS
R a
pp
li
cat
io
ns
.
Ma
ny
stu
dies
hav
e
pro
pose
d
te
chn
i
qu
es
to
i
m
pr
ove
the
ac
cur
aci
es
of
A
S
R
in
noisy
an
d
rev
e
r
ber
a
nt
conditi
ons.
O
ne
ap
proac
h
is
t
o
e
nh
a
nce
t
he
no
isy
featu
res
by
ap
plyi
ng
no
ise
rem
ov
al
te
chn
i
qu
e
s
[
8]
.
O
thers
desig
ne
d
a
dis
crim
inati
ve,
ha
nd
c
raf
te
d
fea
tures
t
hat
are
m
or
e
robu
st
a
gainst
no
ise
a
nd
re
verberati
on
[
9].
Ma
ny
w
orks
a
lso
pr
opos
e
a
dap
ti
ng
t
he
ac
ou
sti
c
m
od
el
s
into
noisy
co
nd
it
io
n
[10].
I
n
DNN
fr
am
eworks
howe
ver,
m
any
m
et
ho
ds
pr
opose
d
f
or
HM
M
-
GMM
syst
e
m
s
m
ay
no
t
w
ork
as
well
[7]
.
For
deep
le
arn
i
ng
fr
am
ewo
r
ks
,
va
rio
us
arc
hitec
tures
are
i
nv
e
sti
gated
to
fi
nd
t
he
bette
r
syst
e
m
s
s
uch
as
rec
urren
t
neural
ne
twor
k
(RN
N)
[
11]
and
c
onvoluti
onal
neural
net
work
(CN
N)
[
12
]
.
I
n
these
a
ppr
oach
es
,
the
hid
de
n
la
ye
rs
of
the
syst
e
m
s
are
in
creased
to
pro
du
ce
m
or
e
disc
rim
inati
ve
featur
es
befor
e
fi
ne
-
tu
ning
in
the
la
st
la
ye
rs.
Howev
e
r,
this m
ay
sig
nificantl
y i
ncr
eas
e the c
om
pu
ta
ti
on
al
ti
m
e fo
r
t
rainin
g.
Currentl
y,
CN
N
are
gaini
ng
interest
s
am
on
g
resea
rch
e
rs.
Or
i
gin
al
ly
,
it
i
s
us
ed
in
c
ompu
te
r
visio
n
[13].
S
om
e
stud
ie
s
[
14
]
-
[
17
]
ind
ic
at
e
it
to
be
bette
r
tha
n
DNN
for
la
r
ge
scal
e
vocab
ul
ary
ta
sk
s.
W
e
a
rgues
,
the
pro
pe
rtie
s
of
C
N
N
s
uch
as
pooli
ng
c
ou
ld
be
ne
fit
in
re
verbera
nt
c
ondi
ti
on
s.
In
these
stud
ie
s
m
os
tly
deal
with
no
ise
onl
y. Their
im
ple
m
entat
ion
s
on
deali
ng r
e
ve
rberati
on
s
h
a
ve n
ot yet
explo
red.
On
e
a
dv
a
ntage
of
dee
p
le
arn
i
ng
f
ram
ewo
r
ks
is
the
abil
ity
of
the
netw
ork
to
le
arn
the
discrim
inati
ve
featur
e
s
giv
e
n
input
data
[18].
Stud
ie
s
sho
w
that
transf
orm
ing
featu
res
be
fore
feed
i
ng
th
e
m
to
the
netw
ork
s
m
ay
ben
efit
t
he
pe
rfor
m
ance
of
dee
p
le
arn
i
ng
syst
em
s.
Ther
e
are
num
ero
us
ap
pr
oach
e
s
that
can
be
i
m
ple
m
ented
i
n
featu
re
dom
ai
n
to
i
m
pr
ov
e
the
per
f
or
m
ance
of
AS
R
sys
tem
s
in
deep
arch
it
ect
ures
for
la
rge
vo
ca
bula
ry
sy
stem
s.
So
m
e
of
them
are
l
inear
discrim
i
nan
t
analy
sis
(LDA)
[
19]
,
heter
os
ceda
sti
c
li
near
discrim
inant
analy
sis
(H
L
D
A)
[20],
Ma
xi
m
u
m
Likel
ihood
Li
nea
r
Tra
ns
f
or
m
(MLL
T)
[
21]
,
featu
r
e
base
d
-
m
ini
m
u
m
p
hone
erro
r
(
fMPE
)
[22
]
, o
r usin
g t
he
com
bin
e
d
t
ran
s
f
or
m
at
ion
s.
In
this
stu
dy,
we
pro
pose
CNN
with
featu
re
trans
form
at
i
on
s
for
im
pr
oving
the
r
obus
t
ness
of
AS
R
against
re
verb
erati
on.
we
ap
ply
LDA
an
d
MLL
T
on
feat
ur
es
befor
e
fe
edin
g
them
to
CNN.
We
ar
gue
that,
app
ly
in
g
them
m
ay
al
so
i
m
pr
ov
e
t
he
r
obus
t
ness
of
s
peech
recog
niti
on
in
rev
e
rb
e
ra
nt
co
nd
it
io
ns
by
sti
ll
us
in
g
relat
ively
s
m
aller
nu
m
ber
of
hid
de
n
la
ye
r.
W
e
e
valuate
the
us
e
of
feat
ur
e
tra
ns
f
orm
a
ti
on
s
(i.e
.
L
D
A
an
d
MLL
T)
on
M
el
-
fr
e
quency
c
epstral
c
oeffici
ent
(M
FCC
)
We
ca
pture
t
he
co
ntext
i
nfo
rm
ation
of
s
pe
ech
by
sp
li
ci
ng
t
he
fe
at
ur
es
with
se
ver
al
prece
ding
an
d
s
ucceed
ing
f
ram
es
and
the
n
ap
plied
LDA
t
o
re
duce
the
dim
ension
al
it
y.
A
fter
that,
we
ap
ply
MLL
T
on
t
he
reduce
d
featu
res.
In
th
is
we
fee
d
t
he
trans
form
ed
featur
es
as aco
us
ti
c in
put f
or CN
N.
The
rest
of
th
e
pap
e
r
is
org
anized
as
fo
ll
ow
s
.
Sect
io
n
2.
prov
i
de
the
or
et
ic
al
bac
kgr
ound
f
or
our
syst
e
m
.
In
this
sect
ion
,
we
br
ie
fly
desc
rib
e
the
featu
res
we
us
e
d,
th
e
f
eat
ur
e
tra
nsf
or
m
at
ion
s
an
d
CNN.
In
Sect
io
n
3.,
we
e
xpla
in
our
pro
posed
s
yst
e
m
.
In
Sect
ion
4.,
we
we
ex
plain
our
e
xp
e
rim
ental
se
tup
t
o
evaluate
our
m
et
hod
a
nd d
isc
us
s t
he results.
We c
on
cl
ude t
he pape
r
i
n
Sec
ti
on
5.
2.
THE
ORETI
C
AL BA
CKGR
OUND
2.1.
Speec
h
F
eatures
Ma
ny
featu
res
hav
e
bee
n
pro
po
s
ed
for
A
SR.
MFC
C
is
arguab
ly
the
m
os
t
popula
r
one.
MFC
C
is
a
handc
raf
te
d
fe
at
ur
e
that
is
e
xtracted
us
i
ng
two
-
sta
ges
Four
ie
r
t
ran
s
f
orm
.
The
aim
is
app
li
ed
to
de
correla
te
sp
eec
h
com
ponen
ts
in
ti
m
e
and
fr
e
qu
e
ncy
dom
ai
ns
.
By
do
ing
s
o,
sp
ee
ch
un
it
s,
s
uch
as
phonem
es,
coul
d
be
m
od
el
ed
us
i
ng
m
ixtur
es
of
Ga
us
sia
ns
usi
ng
only
their
diag
onal
co
va
riance
s.
I
n
MFC
C
e
xt
racti
on
proces
s,
the
sp
eec
h
sig
nals
are
c
hunked
into
se
quen
ces
of
fr
am
es
with
fixe
d
durati
on,
usual
ly
around
25
-
50
m
s
each
.
Sp
eec
h
is
assu
m
ed
to
be
sta
ti
on
a
ry
for
each
fr
am
es
and
t
hen
th
e
Four
ie
r
trans
form
is
app
li
ed
t
o
obta
in
it
s
sp
ect
ral
com
po
ne
nts.
Us
ually
,
the
powe
r
s
pectra
are
us
e
d
by
ta
ki
ng
th
e
sq
ua
re
of
it
s
m
agn
it
ud
e
.
T
hen,
th
e
sp
ect
ra
a
re
m
app
e
d
i
nto
a
m
e
l
-
scal
ed
filt
er
-
banks
t
o
em
phasi
s
the
fr
e
qu
e
ncy
in
l
ow
e
r
re
gion
m
or
e.
Af
t
er
that,
the
lo
g
op
e
rati
on
is
a
pp
li
ed
to
the
outp
ut
of
m
el
-
filt
erb
an
k
befor
e
ap
plyi
ng
Fou
rier
T
r
ansfo
rm
,
in
this
cas
e
on
ly
us
in
g
t
he real
p
a
rt of
Four
ie
r
tra
ns
f
or
m
to deco
rr
el
at
e
each c
om
po
ne
nt in fre
quency
dom
ai
n.
Wh
il
e
MFC
C
sh
ows
good
re
su
lt
s
w
hen
t
he
conditi
ons
bet
ween
tr
ai
ning
and
te
sti
ng
are
the
sam
e,
i
t
su
f
fer
s
w
he
n
t
her
e
is
hi
gh
va
riabil
it
y
on
t
he
data.
Sp
e
ech
is
hi
gh
ly
var
ie
d
due
to
intra
-
s
peak
e
r
var
ia
bili
ti
es,
inter
-
s
pea
ker
va
riabil
it
ie
s,
env
ir
on
m
ent
var
ia
bili
ti
es,
i.e.
wh
en
s
peec
h
is
no
isy
or
co
ntam
inate
d,
et
c.
Ma
ny
stud
ie
s
pr
opose
diff
e
ren
t
feat
ur
es
to
im
pr
ov
e
the
robu
st
nes
s
of
A
SR.
PLP
is
on
e
of
the
exam
ples.
The
m
ai
n
diff
e
re
nce
bet
ween
PLP
a
nd
MFC
C
is
that
PLP
ap
plies
cu
b
e
r
oo
t
f
unct
io
n
instea
d
of
lo
g.
T
he
ob
j
ect
iv
e
is
to
reduce
the
sen
sit
ivit
y
of
the
featur
es
in
lo
w
energy
reg
i
on
wh
ic
h
is
m
os
t
sensiti
ve
to
no
ise
.
Other
di
ff
e
ren
ce
is t
he use
of b
a
rk scal
e instea
d of m
el
in
MFC
C.
MFC
C
sh
ows
pr
et
ty
good
pe
rfor
m
ance
in
HMM
-
GM
M
s
yst
e
m
s.
Since
i
t
is
qu
it
e
un
co
r
relat
ed,
it
is
adequate
to
m
od
el
eac
h
sta
te
of
HMM
us
in
g
m
ixtur
es
of
Gau
s
sia
n
only
us
in
g
th
e
dia
gonal
c
ovaria
nc
es
of
th
e
GMM.
H
ow
e
ve
r,
tw
o
-
sta
ges
Four
ie
r
tra
nsfo
rm
re
m
ov
e
the
cor
relat
io
n
be
tween
s
peech
com
po
ne
nts
in
tim
e
-
Evaluation Warning : The document was created with Spire.PDF for Python.
In
t J
Elec
&
C
om
p
En
g
IS
S
N:
20
88
-
8708
Con
v
olu
ti
on
al
Ne
ur
al Net
wo
r
k
and Fe
atu
re
Transfor
m
atio
n
fo
r
Dist
an
t…
(
Hilman F
. P
arde
de
)
5383
f
re
qu
e
ncy
do
m
ai
ns
.
T
hese
c
orrelat
ions
co
ul
d
sti
ll
be
nee
de
d
in
recog
niti
on
proces
s.
T
hi
s
m
ay
be
on
e
of
t
he
reason
th
at
ASR
are
not
r
obust
.
In
D
NN
-
H
MM
syst
e
m
s,
so
m
e
stud
ie
s
s
how
it
is
m
or
e
ben
e
fit
w
hen
m
or
e
“raw
”
feat
ur
es
are
us
e
d.
O
ne
of
them
is
FBANK
[23].
FB
ANK
has
the
s
a
m
e
extracti
on
proces
s
with
MFC
C
excep
t i
t i
s
without t
he
sec
ond
sta
ge
F
ourier
tran
s
f
or
m
. S
o, so
m
e correlat
ion
s
in
freq
ue
nc
y are sti
ll
ex
is
t.
2.2.
Fe
at
ure
Tr
an
sf
orm
ati
on
Transf
or
m
ing
featur
e
s
to
ot
he
r
dom
ai
n
sp
aces
of
te
n
f
ound
ef
fecti
ve
i
n
m
any
cl
assifi
cat
ion
ta
sk
s
i
n
m
achine
le
ar
nin
g.
T
he
us
e
of
high
dim
ension
f
eat
ur
e
s
are
of
te
n
ine
ff
ect
i
ve
because
it
m
ay
le
ad
to
ov
erf
it
ti
ng
.
Re
du
ci
ng
t
he
dim
ension
s
of
the
featu
res
is
of
te
n
ap
plied.
LDA
an
d
PC
A
are
exam
ples
of
featu
re
re
du
ct
ion
te
chn
iq
ues
. L
D
A
[
24]
is a
pp
li
ed
in
s
up
e
rv
ise
d
m
ann
er
whil
e PCA
is
an u
nsupe
r
vised
techn
i
qu
e
.
LDA
is
usua
ll
y
app
li
ed
i
n
prep
ro
ce
ssin
g
sta
ge
to
re
du
ce
the
dim
ensio
ns
of
f
e
at
ur
e
from
the
n
-
dim
ension
al
feat
ur
e
s
are
re
du
ce
d
in
to
m
-
di
m
ension
al
sp
ace
(m
<
n).Th
e
obj
ect
i
ve
is
to
pro
j
e
ct
the
featur
e
s
s
pace
into
lo
we
r
dim
ension
al
s
pa
ce
an
d
m
aking
the
featu
re
s
m
or
e
discri
m
inati
ve.
The
lowe
r
dim
ension
al
fe
at
ur
e
s
pace
is
chosen
s
uc
h
th
at
it
e
m
ph
asi
ze
s
the
distances
betwee
n
cl
asses
m
or
e
than
withi
n
the cla
ss.
Ma
t
hem
at
ic
ally i
t c
ou
l
d
be writt
en
:
T
e
T
i
J
d
e
t
d
e
t
(1)
wh
e
re
Σ
are
t
he
cov
a
riance
between
cl
ass,
Σ
is
the
co
va
rianc
e
withi
n
cl
ass
,
θ
is
t
he
featu
re
,
a
nd
J
(
θ
)
is
t
he
cost
f
unct
ion
that
to
be
m
axim
iz
ed.
The
so
l
ution
f
or
J
(
θ)
is
by
ta
ki
ng
the
first
m
ei
ge
nv
ect
or
s
of
m
at
ri
x
Σ
−
1
Σ
after
s
or
ti
ng
t
he
ei
genval
ues
f
ro
m
the
la
rgest
ones.
F
or
m
or
e
inf
or
m
ation
on
ap
plyi
ng
LD
A
on
sp
e
ech
featur
e
s c
ou
l
d refe
r
to
[19
]
Me
anwhil
e,
MLL
T
[
25
]
is
a
pp
li
ed
i
n
HM
M
-
GMM
syst
e
m
s
to
loo
se
n
the
assum
ptio
n
in
HMM
-
GMM
syst
e
m
s
.
In
HMM
-
G
MM
syst
e
m
s,
it
is
assu
m
ed
that
the
featu
re
s
are
ind
e
pe
ndent
with
eac
h
oth
e
r
.
Ther
e
f
or
e,
Ga
us
sia
n
ass
um
pt
ion
s
are
with
on
ly
diag
on
al
co
-
var
ia
nces
a
re
us
ed
.
Wh
il
e
this
c
ou
l
d
f
as
te
n
th
e
trai
ning
tim
e,
the
assum
ption
m
ay
no
t
nec
essaril
y
ho
ld
s.
This
i
s
because
sp
eec
h
com
pone
nt
m
a
y
st
il
l
be
relat
ed
to
each
oth
er
in
featu
r
e
sp
ace.
MLL
T
[25],
w
hich
i
s
al
so
know
n
as
se
m
i
-
ti
ed
co
-
var
ia
nce
(S
TC
)
[26],
li
near
ly
tran
s
f
or
m
s
the
sam
ple
data
to
a
ne
w
tra
ns
f
orm
ed
sp
ace
t
hat
are
Ga
us
sia
n
distr
ibu
te
d
to
l
oo
se
n
this
assum
ption
.
MLL
T
is
a
ppli
ed
to
im
plicitl
y
capt
ur
e
the
c
orrelat
ion
bet
ween
the
feat
ure
el
em
ents
by
us
i
ng
const
raine
d
c
o
-
var
ia
nce m
od
el
.
MLL
T
w
orks
as
fo
ll
ow.
ML
LT
us
es
ei
ge
n
deco
m
po
sit
io
n
to
deco
m
po
s
e
a
fu
ll
co
-
vari
ance
m
at
rix
ov
e
r
the
set
of
Gaussi
an
c
om
po
nen
ts,
a
nd
each
c
om
po
ne
nts
m
ai
ntain
it
s
“diagonal”
char
act
e
risti
cs.
A
f
ull
cov
a
riance
m
atr
ix c
ould
be de
com
po
sed
u
si
ng the
foll
owin
g f
or
m
ula:
T
r
m
d
i
a
g
r
m
)
(
)
(
)
(
)
(
ˆ
ˆ
H
Σ
H
Σ
(2)
wh
e
re
m
is
the
ind
e
x
of
Ga
us
sia
n
c
om
ponen
t
a
nd
r
is
the
inde
x
of
c
la
ss.
Each
c
om
po
nen
t
m
has
three
par
am
et
ers:
weigh
t,
m
ean,
an
d
dia
gonal
el
e
m
ent
of
sem
i
-
tied
co
-
var
ia
nce ma
trix.
So,
eac
h
co
-
var
ia
nce ma
trix
(
)
co
ul
d
be
de
c
om
po
se
i
nto
t
wo
:
dia
gonal
el
e
m
ent
of
c
o
-
va
riance
m
at
rix
c
om
po
ne
nt
m
,
(
)
,
an
d
a
sh
are
d
fu
ll
co
-
var
ia
nce
m
at
ri
x
of
Gau
s
sia
n
com
po
ne
nts
in
cl
ass
r,
(
)
(n
am
ed
as
sem
i
-
ti
e
d
tra
ns
f
or
m
).
We
denote
(
)
be
the
inv
e
rse
of
(
)
.
In
AS
R,
the
c
ova
riance
m
at
rices
are
trai
ne
d
un
d
er
Ma
xim
u
m
Likel
ihoo
d
sense
o
n
the
tr
ai
nin
g
data
a
nd
will
be
op
ti
m
iz
ed
with
res
pect
to
̂
(
)
the
m
ean
of
t
he
Ga
us
sia
ns
(
)
and
diag
on
al
c
ovar
ia
nce m
at
rices
̂
(
)
. So,
the c
os
t
f
un
ct
io
n J c
ou
l
d be
w
ritt
en:
,
)
(
)
(
)
(
2
)
(
)
(
ˆ
ˆ
d
i
a
g
ˆ
l
o
g
ˆ
,
r
M
m
T
r
m
r
r
m
J
A
W
A
A
(3)
wh
e
re:
m
T
m
m
m
m
)
(
)
(
)
(
ˆ
ˆ
o
o
W
(4)
,
)
(
r
M
m
m
(5)
Evaluation Warning : The document was created with Spire.PDF for Python.
IS
S
N
:
2088
-
8708
In
t J
Elec
&
C
om
p
En
g,
V
ol.
8
, N
o.
6
,
Dece
m
ber
201
8
:
5381
-
5388
5384
and
T
m
m
q
p
O
,
(6)
The
no
ta
ti
on
(
)
is
the
gau
s
sia
n
com
po
ne
nt
m
at
tim
e
τ
,
O
T
is
the
trai
ning
data
an
d
o
(
τ
)
i
s
the
ob
se
r
ved
featur
e
.
T
he
n
t
he
m
axi
m
u
m
lik
el
ihoo
d
est
im
at
ion
of the m
ean is:
m
m
m
o
)
(
ˆ
(7)
and the c
ovaria
nce m
at
rix
est
im
at
e is:
T
r
n
r
m
d
i
a
g
)
(
)
(
)
(
)
(
d
i
a
g
ˆ
A
W
A
Σ
(8)
Ca
lc
ulati
ng
(
)
is
nontri
vial.
To
est
i
m
at
e
i
t,
it
i
s
init
ia
li
zed
us
i
ng
an
ide
ntit
y
m
at
rix
an
d
the
n
est
im
at
e
̂
(
)
us
in
g
Eq
uation
(
8)
a
nd the
n
(
)
is u
pdat
ed usin
g
E
g. (
2)
2.3
.
C
onv
oluti
on
al
Neur
al
Net
w
or
k
A
ty
pical
convo
l
ution
al
network
st
ru
ct
ur
e
is
illustrated
in
Fig
ur
e
1.
T
his
is
diff
e
ren
t
fr
om
DN
N
,
wh
e
re
al
l
neur
on
s
i
n
the
pr
e
vi
ou
s
la
ye
rs
a
re
connecte
d
t
o
al
l
the
neurons
of
the
su
cc
essiv
e
la
ye
rs,
wh
ic
h
m
ay
no
t
be
ef
fecti
ve
w
hen
the
fea
tures
ha
ve
la
rge
dim
ension
s
.
Conv
olu
ti
onal
Neural
Netw
ork
(CN
N)
is
a
s
pecial
kind
of
dee
p
ne
ur
al
netw
ork.
CNN
i
ntr
oduc
es
two
ty
pes
of
sp
eci
al
net
wor
k
la
ye
rs,
cal
le
d
co
nvolu
ti
onal
la
ye
r
and
pooli
ng
la
ye
r.
Eac
h
ne
uron
of
the
c
onvoluti
onal
la
ye
r
receives
i
nput
s
from
a
set
of
filt
ers
of
the
l
ow
e
r
la
ye
r.
The
filt
ers
ar
e
ob
t
ai
ne
d
by
m
ulti
plyi
ng
a
sm
al
l
local
pa
rt
of
the
in
pu
t
with
t
he
w
ei
gh
t
m
at
rix,
wh
e
re
these
filt
ers
a
re
then
re
plica
te
d
th
rou
ghout
the
whole
in
pu
t
sp
ace.
Lo
cal
iz
ed
filt
ers
t
hat
s
har
e
t
he
sam
e
weig
ht
app
ea
r
as
featu
re
m
aps.
Af
te
r
com
pleti
ng
co
nvolu
ti
on
proc
ess
,
a
poolin
g
la
ye
r
ta
kes
in
puts
f
ro
m
a
local
par
t
of the c
onvolut
ion
al
lay
er a
nd
g
e
ner
at
es a
lo
wer res
olu
ti
on
ver
si
on of
filt
er acti
vatio
n.
In
the
im
ple
m
e
ntati
on
s f
or
spe
ech
rec
ogniti
on,
after
fe
w
la
ye
rs
of
C
NN
st
ru
ct
ur
e
d,
a
f
ully
con
ne
ct
ed
la
ye
r
of
ge
ner
a
ti
ve
dee
p
neura
l
netw
ork
m
odel
(D
B
N
-
DNN
)
is
pe
rfo
rm
ed
to
com
bin
e
e
xtracted
l
ocal
pat
te
rn
s
from
a
ll
po
sit
i
on
s
i
n
the
lowe
r
la
ye
r
fo
r
final
recog
niti
on
[
27
]
.
I
n
this
pap
e
r,
we
use
two
la
ye
rs
of
CN
N
structu
re
a
nd
t
hen
ap
ply
4
la
ye
rs
of
DBN
to
pro
du
ce
a
tot
al
of
6
la
ye
rs
of
hi
dden
la
ye
rs
for
pret
rainin
g.
T
hen
DNN
is a
ppli
ed on t
he
t
op of
then f
or
super
vi
sed
le
ar
ning.
Figure
1. A
ty
pi
cal
CNN
a
rch
i
te
ct
ur
e
Evaluation Warning : The document was created with Spire.PDF for Python.
In
t J
Elec
&
C
om
p
En
g
IS
S
N:
20
88
-
8708
Con
v
olu
ti
on
al
Ne
ur
al Net
wo
r
k
and Fe
atu
re
Transfor
m
atio
n
fo
r
Dist
an
t…
(
Hilman F
. P
arde
de
)
5385
3.
THE
PR
OPO
SED
S
YS
TE
M
Figure
2
is
th
e
currently
the
m
os
t
com
m
o
nly
us
e
d
A
SR
syst
e
m
s.
DBN
-
DNN,
w
hic
h
is
ba
sed
on
Kar
el
’
s
im
ple
m
entat
ion
of
DNN
on
KAL
DI
[28]
is
us
e
d
as
ba
sel
ine.
DBN
c
onfig
urat
ion
in
this
e
xperim
ent
us
es
6
-
dep
t
h
hi
dd
e
n
la
ye
rs
w
it
h
dim
ension
of
2048
hi
dd
e
n
ne
uro
ns
,
i.e.
Gau
ssia
n
-
Be
r
noulli
RB
M
a
s
firs
t
la
ye
r
connecti
ng
t
o
the
Gau
s
sia
n
aco
us
t
ic
inputs
an
d
Be
r
noulli
-
Be
r
noulli
RB
M
la
ye
rs
a
fterw
a
r
d.
T
he
s
ta
ck
of
pre
-
trai
ni
ng
la
ye
rs
is
fo
ll
ow
e
d
by
DNN
la
ye
rs
with
1
hi
dden
la
ye
rs
(1024
ne
urons)
an
d
so
ftm
ax
ou
tput
la
ye
r.
We
denote t
his as BASEL
INE2
i
n
this
pa
pe
r.
We also
trai
n
a
co
nventio
nal H
MM
-
GMM
. We de
no
te
this
as BAS
ELI
N
E1.
F
or
this,
w
e
m
od
el
each
dig
it
with
16
sta
te
s
HMM,
lef
t
-
to
-
rig
ht
w
he
re
each
sta
te
was
m
od
el
le
d
usi
ng
Mi
xtures
of
Gaussi
an
wit
h
the
nu
m
ber
of G
a
ussi
an
is t
hr
ee.
Fo
r
pause
m
od
el
s: si
l, we use
HMM wit
h 3
s
ta
te
s w
it
h 6 Ga
us
sia
n co
m
pone
nts.
Figure
3
is
the
pro
po
se
d
syst
e
m
in
this
stud
y
.
W
e
us
e
MFC
C
as
the
basic
for
sta
ti
c
featu
res.
We
us
e
13
dim
ension
s
of
sta
ti
c
featu
r
es
an
d
t
hen
the
featu
res
a
re
s
pl
ic
ed
by
us
in
g
4
pr
ece
ding
a
nd
s
uccee
ding
f
r
a
m
es
to
capt
ur
e
the
co
ntext
of
the
s
peec
h
pro
du
ci
ng
11
7
di
m
ension
s
.
T
he
n,
we
a
pply
LDA
to
re
duc
e
the
dim
ension
s
int
o
40
dim
ension
s
f
or
al
l
feat
ures.
T
he
n
we
a
pp
ly
MLL
T
on
the
outp
ut
of
LDA
be
fore
fe
edin
g
them
into
CNN.
F
or
syst
em
s
us
i
ng
only
L
DA,
we
de
note
as
PRO
POS
ED
1,
a
nd
f
or
s
yst
e
m
with
bo
t
h
L
DA
and MLLT i
s
deno
te
d
as
PRO
PO
SE
D
2.
Figure
2. The
Ba
sel
ine Syst
em
: M
FCC
w
it
h
delta
tra
ns
f
orm
at
ion
b
e
fore
feed
i
ng to DB
N
-
DNN
Figure
3. The
Pr
op
os
e
d
Syst
e
m
: The F
eat
ure
Tr
a
ns
f
orm
ation
(LDA a
nd M
LLT)
is
appli
ed on M
FCC
be
fore
feed
i
ng it
into 2
-
la
ye
rs
CN
N.
The o
utput o
f C
NN
a
re
fee
d
i
n
to
4
la
ye
rs
DB
N
a
nd 1 la
ye
r
D
NN
Fo
r
C
NN
pre
-
t
rainin
g,
w
e
use
two
la
ye
rs
of
CNN
an
d
the
n
4
la
ye
rs
of
DBN.
F
or
CN
N
la
ye
rs,
we
us
e
128
ne
uro
ns
f
or
first
hidd
en
la
ye
r
a
nd
256
ne
uro
ns
f
or
seco
nd
hi
dd
e
n.
Pool
siz
e
of
th
ree
a
nd
m
ax
poolin
g
are
us
e
d
in
t
his
stud
y.
For
D
BN,
we
us
e
sta
nd
a
r
d
10
24
n
e
uro
ns
f
or
hidd
en
la
ye
rs.
T
his
set
ti
ng
s
are
t
he
sam
e
as
in
[
15]
as
it
is
fou
nd
a
good
set
ti
ng
f
or
sp
eec
h
rec
ogni
ti
on
.
T
he
outp
ut
of
DNN
is
us
e
d
to
e
stim
a
te
th
e
po
ste
rio
r prob
a
bili
ty
o
f HMM
stat
es in
hybri
d of dee
p
le
ar
ni
ng
a
nd
HMM
syst
e
m
s.
4.
E
X
PERI
MEN
TS
4
.
1.
The
Setup
The
e
xperim
ents
are
e
valu
at
ed
on
s
peec
h
c
orp
us
of
is
olate
d
dig
it
recog
niti
on
ta
s
k.
W
e
us
e
TI
Digits
corp
us
to
trai
n
acou
sti
c
m
od
el
s
on
cl
ean
c
onditi
ons,
w
hi
ch
co
ns
ist
s
of
8623
uttera
nce
s
prono
un
ce
d
by
111
m
al
e
and
114
fem
al
e
adu
lt
sp
eake
rs.
F
or
te
st
data,
w
e
use
the
re
verbe
ran
t
versi
on
of
TI
Digits,
t
hat
is
th
e
Me
et
ing
Re
co
r
der
Digits
(MRD)
s
ubset
of
Aurora
-
5
c
orp
us
[
29]
.
T
he
c
orp
us
com
pr
is
es
of
real
recordi
ng
i
n
hands
-
f
ree
m
od
e
in
the
m
eeting
r
oom
.
The
sp
eech
data
is
colle
ct
ed
fro
m
24
sp
ea
ker
s
at
the
In
te
r
na
ti
on
al
Com
pu
te
r
Scie
nce
I
ns
ti
tute
in
Ba
rk
el
ey
,
res
ulti
ng
of
2400
utterances
f
or
each
m
ic
ro
phone.
T
he
rec
ord
ing
i
s
perform
ed
us
i
ng
four
m
ic
ro
phon
e
s
(la
beled
a
s
6,
7,
E,
a
nd
F
)
wh
ic
h
a
re
placed
at
t
he
m
idd
le
of
t
he
ta
ble.
T
he
Evaluation Warning : The document was created with Spire.PDF for Python.
IS
S
N
:
2088
-
8708
In
t J
Elec
&
C
om
p
En
g,
V
ol.
8
, N
o.
6
,
Dece
m
ber
201
8
:
5381
-
5388
5386
recordi
ng
th
us
co
ntain
s
om
e
re
ve
r
ber
a
nt
a
coust
ic
co
nd
it
ion
s
f
r
om
the
eff
ect
of
ha
nd
s
-
f
ree
recor
ding
in
a
m
eet
ing
roo
m
. Th
e
perf
or
m
ance is m
easur
ed usin
g wor
d
e
rror rate
(
W
ER
).
4
.
2.
Res
ults
and
Discussio
n
Table
1
s
hows
th
e
e
valu
at
ion
of
the
pro
posed
m
et
hods
(P
R
O
PO
SE
1
an
d
PROP
OS
E
2).
As
c
om
par
iso
n,
t
he
pe
rfor
m
ance
of
BA
SE
LINE1
an
d
BASEL
INE2
ar
e
sho
wn
as
w
el
l.
The
ta
ble
cl
early
ind
ic
at
es a co
nsi
ste
nt r
ed
uctio
n
of
wor
d
error rate
s in
rev
er
be
ran
t co
ndit
ions. PROP
OS
E
D
1
achieve
s 3
7.
67
%
and
27.
78
%
r
el
at
ive
i
m
pr
ov
e
m
ents
over
B
AS
EL
INE1
an
d
BA
SEL
IN
E
2
re
sp
ect
ively
wh
il
e
PR
OPO
SED2
achieves
38.
69%
a
nd
28,
94
%
relat
ive
i
m
pr
ov
em
ents
ov
e
r
B
AS
EL
I
NE
1
an
d
B
A
SELI
NE
2
re
spe
ct
ively
.
Applyi
ng MLL
T after
LDA
(PR
OP
O
SE
D2)
s
li
gh
tl
y bett
er t
han ap
plyi
ng L
DA alo
ne.
This
m
igh
t
be
analy
sed
as
t
he
f
ollo
wings.
W
he
n
rev
e
r
be
rati
on
e
xists,
the
res
ulti
ng
rev
e
rb
e
ra
nt
sp
eec
h
are
a
s
um
of
the
signa
l
with
the
del
ay
ed
versi
on
of
the
sam
e
sig
nals.
T
her
e
for
e
rev
e
rb
e
ran
t
s
peech
con
ta
in
the
i
nfor
m
at
ion
f
ro
m
pr
e
vious
f
ram
es.
T
his
will
increase
co
rr
el
a
ti
on
s
betwee
n
neig
hbori
ng
f
r
a
m
es,
h
ence
m
ay
inc
rease
the
local
cor
relat
io
ns
ne
arb
y
f
ram
es.
In
CN
N
arc
hitec
ture,
the
pro
per
ti
es
of
loca
li
ty
,
conv
olu
ti
on
a
nd
poolin
g
m
ay
be
ben
e
fit
in
s
uch
co
ndit
ion
s
.
Since
t
he
em
ph
asi
s
is
on
l
oca
l
neuron
first,
i
t
can
le
arn
on
the
lo
cal
info
rm
at
ion
an
d
pro
du
ce
good
feat
ur
es
based
on
cl
ean
par
t
of
sp
ee
ch
from
early
fr
am
es
(since
they
are r
el
at
ively
cl
ea
n
com
par
ed
t
o
la
te
par
t
of
s
pe
ech). Wh
en
s
pe
ech
is
corr
up
t
ed
by
re
ve
ber
a
ti
on
,
I
t
m
ay
cause
so
m
e fr
e
quency
sh
i
ft (dela
y i
n
ti
m
e
-
f
reque
ncy do
m
ai
ns
).
These
delay
s
a
re
diff
ic
ult
to
handle
within
oth
e
r
m
od
el
s
s
uch
as
GMMs
an
d
DNNs
,
w
her
e
m
any
Gau
s
sia
ns
a
nd
hidden
un
it
s
are
nee
ded
t
o
be
opti
m
iz
ed
fo
r
al
l
possible
patte
rn
s
hifts
[27].
W
it
h
poolin
g
pro
per
ti
es
in
CNN,
t
he
sam
e
featur
e
value
that
cal
culat
e
d
from
diff
ere
nt
locat
ion
is
colle
ct
ed
tog
et
her
a
nd
ind
ic
at
ed
by
a
sing
le
value
,
wh
ic
h
m
a
y
be
fr
om
the
cl
ea
ner
pa
rt
of
s
pe
ech.
T
her
e
for
e,
the
diff
e
re
nc
es
in
featur
e
s
e
xtrac
te
d
by
poli
ng
ply
m
a
y
m
ini
m
iz
ed
the
ef
fe
ct
of
delay
wh
i
ch
a
re
ca
us
e
d
by
re
verbe
rati
on.
LD
A
fin
ds
the
featu
r
es
with
the
la
rg
e
va
riances
a
nd
m
os
t
separ
a
te
d
m
eans
with
in
the
cl
ass.
So,
wh
e
n
it
is
us
ed
f
or
featur
e
s,
it
is
ver
y
li
kely
to
choose
m
os
t
disti
ng
uis
h
sp
ec
tra
(the
dom
in
ant
sp
ect
ra
)
w
hich
m
ay
con
ta
in
the
phonem
es
inform
ation
.
When
CNN
is
a
pp
li
ed,
due
to
the
m
ax
-
po
oling,
t
he
in
f
or
m
at
ion
is
m
ai
ntained
up
t
o
the to
p
la
ye
rs
, pr
oduci
ng a m
or
e
d
isc
rim
inati
ve
feat
ur
es
and
hen
ce
im
pr
ovin
g
the
p
e
rfo
r
m
ance.
Table
1.
WER
(%) o
f
the
P
ropo
s
ed
Met
hod i
n
Com
par
iso
n wit
h
t
he
Ba
sel
ines
Mod
el
Co
n
d
itio
n
s
Av
erage
Clean
MRD 6
MRD 7
MRD
E
MRD F
BASEL
INE
1
0
.64
4
6
.66
5
4
.56
5
0
.20
4
4
.59
4
9
.00
BASEL
II
N
E2
0
.80
3
9
.93
4
7
.99
4
2
.69
3
8
.51
4
2
.28
PROPOSE1
0
.90
2
8
.10
3
3
.90
3
3
.03
2
7
.14
3
0
.54
PROPOSE2
0
.82
2
7
.73
3
3
.20
3
2
.62
2
6
.61
3
0
.04
5.
CONCL
US
I
O
N
In
this
st
ud
y,
we
eval
uate
th
e
us
e
of
LD
A
and
MLL
T
on
CNN
-
base
d
s
pe
ech
rec
ogniti
on
t
o
i
m
pr
ove
the
r
obust
ness
of
s
peec
h
re
cogniti
on
a
gainst
r
eve
rb
e
rati
on.
Our
e
xper
i
m
ents
confir
m
that
ou
r
pr
opose
d
m
et
ho
d
is
m
or
e
r
obus
t
t
han
sta
nd
a
r
d
D
N
N
-
HMM
an
d
HMM
-
GMM
syst
e
m
s.
The
pro
per
ti
es
of
weig
ht
sh
ari
ng,
poolin
g,
a
nd
loc
al
it
y
of
CN
N,
co
ul
d
im
pr
ove
the
recog
niti
on
ac
cur
acy
on
al
l
trans
f
or
m
ed
fea
ture
s
com
par
ed
t
o
th
e stan
dard
fu
ll
y
-
co
nn
ect
e
d D
NN.
We
nee
d
to
s
ta
te
that
the
e
valuated
ta
sks
are
dig
it
rec
ogniti
on
ta
sks.
The
refore
,
t
he
long
-
te
r
m
dep
e
n
den
cy
th
at
exists
in
sp
e
ech
m
ay
no
t
as
sign
ific
a
nt
as
in
co
ntinuo
us
s
peech.
T
her
e
fore
it
is
interest
ing
t
o
see how
eac
h
a
rch
it
ect
ure fare
f
or co
ntin
uous t
asks.
Si
nce r
e
verberati
on ti
m
e is al
so
h
ea
vily
inf
lue
nced
by
the
siz
e
of
the
roo
m
,
it
is
a
lso
interest
ing
to
se
e
how
dee
p
a
r
chite
ct
ur
es
perform
in
diff
ere
nt
set
ti
ng
s
of
r
oo
m
s.
This is
our f
uture
plan.
REFERE
NCE
S
[1]
M.
L.
Seltzer
,
D.
Yu,
and
Y.
W
ang,
“
An
in
vesti
gation
of
dee
p
n
eur
al
netw
orks
for
noise
robust
spee
ch
rec
ogni
ti
on,
”
in
2013
IEE
E
Inte
r
nati
onal
Confe
r
enc
e
on
Ac
ousti
cs,
Spee
ch
and
Signal
Proce
ss
ing
2013,
pp.
7398
–
7402.
[2]
T.
Yos
hioka
an
d
M.
J.
Gal
es,
“
Envi
ronm
ent
a
lly
robust
asr
front
-
end
for
d
ee
p
n
eur
al
net
work
a
cousti
c
m
odel
s
,
”
Computer
Speec
h
&
Language
,
vol.
31
,
no
.
1
,
pp
.
65
–
86,
2015
.
[3]
R.
Err
attahi
and
A.
El
Hanna
ni,
“
Rec
ent
adva
n
c
es
in
lvc
sr:
A
b
enc
hm
ark
compari
son
of
per
form
anc
es,
”
Int
erna
-
ti
onal Journal
o
f
Elec
tri
cal and Com
pute
r E
ngin
ee
ring (
IJE
C
E)
,
vol.
7
,
no
.
6
,
pp
.
3358
–
3368,
201
7.
[4]
M.
F.
Alghif
ari,
T.
S.
Gunawa
n
,
and
M.
K
art
i
wi,
“
Speec
h
emotion
r
ec
ogni
ti
o
n
using
de
ep
f
e
edf
orward
n
eur
a
l
net
work,”
Indon
esian
Journal
of
El
e
ct
rica
l
Eng
in
ee
ring a
nd
Computer
Sc
ie
nc
e
,
v
ol.
10
,
no
.
2
,
pp
.
554
–
561,
2018
.
Evaluation Warning : The document was created with Spire.PDF for Python.
In
t J
Elec
&
C
om
p
En
g
IS
S
N:
20
88
-
8708
Con
v
olu
ti
on
al
Ne
ur
al Net
wo
r
k
and Fe
atu
re
Transfor
m
atio
n
fo
r
Dist
an
t…
(
Hilman F
. P
arde
de
)
5387
[5]
K.
F.
Akingbad
e,
O.
M.
Um
an
na,
and
I.
A.
Alimi,
“
Voice
-
b
ase
d
door
acce
s
s
c
ontrol
s
y
st
e
m
using
the
m
el
fre
quency
ce
pstr
um
coe
ffic
i
ent
s
and
gaussian
m
ixt
ure
m
odel
,
”
In
te
rnational
Journal
of
Elec
tri
cal
and
Computer
Engi
ne
ering
,
vol
.
4
,
no
.
5
,
p
.
643
,
2014.
[6]
S.
N.
Enda
h
,
S.
Adh
y
,
and
S.
Sutikno,
“
Com
par
i
son
of
fea
ture
e
xtra
c
ti
on
m
el
fr
eque
nc
y
ce
pstr
al
coe
ffi
cients
and
li
ne
ar
pre
dictive
codi
ng
in
aut
o
m
at
ic
spee
ch
re
cogni
ti
on
for
in
donesia
n,
”
TEL
KOMNIKA
(Tele
communic
at
ion
Computing
E
le
c
t
ronics
and
Cont
rol)
,
vol.
15
,
no
.
1,
pp
.
292
–
298
,
2017.
[7]
M.
L.
Seltzer
,
D.
Yu,
and
Y.
W
ang,
“
An
in
ves
ti
gation
of
dee
p
n
eur
al
netw
orks
for
noise
robust
spee
ch
rec
ogni
ti
on,
”
in
2013
IEE
E
In
te
r
nati
onal
Conf
ere
nce
on
Ac
oust
i
cs,
Speech
and
Signal
Proc
essing
,
Ma
y
2013,
pp
.
7398
–
7402.
[8]
P.
Loc
kwood
an
d
J.
Boud
y
,
“
Ex
per
iments
with
a
nonli
ne
ar
spec
tr
al
subtra
ct
or
(n
s
s),
hidde
n
m
ark
o
v
m
odel
s
and
th
e
proje
c
ti
on,
for
ro
bust
spee
ch
re
co
gnit
ion in car
s,
”
Spee
ch
Comm
unic
ati
on
,
vo
l. 11,
no.
2
,
pp
.
215
–
228,
1992
.
[9]
C.
Kim
and
R.
M.
Stern,
“
Pow
er
-
norm
al
iz
ed ce
p
stral
co
eff
icient
s
(pnc
c)
for
robus
t
spee
ch
r
ec
ogni
t
ion,
”
I
EE
E
/A
CM
Tr
ans.
Audi
o,
Sp
ee
ch
and
Lang.
Proc
.
,
vol
.
24
,
n
o.
7
,
pp
.
1315
–
1
329,
Jul.
2016.
[10]
M.
J.
F.
Gale
s
and
S.
J.
You
ng,
“
Robust
cont
inuous
spee
ch
rec
ognition
using
par
al
l
el
m
odel
combination,
”
IEE
ETrans
act
io
ns on
Speech
an
d
Audi
o
Proc
essing
,
vo
l. 4, no. 5, pp. 3
52
–
359,
Se
p
1996.
[11]
C.
W
eng,
D.
Yu,
S.
W
at
ana
be,
an
d
B.
H.
F.
Juang,
“
Rec
urre
nt
dee
p
neur
al
net
works
for
robust
spee
ch
rec
ognition,”
in
2014
I
EEE
Int
ernati
onal
Confer
enc
e
on
A
coustics,
Spe
ec
h
and
S
ignal
Proce
ss
ing
,
Ma
y
2014,
pp.
5532
–
5536.
[12]
Y.
Zha
ng
,
W
.
C
han,
and
N.
Jaitl
y
,
“
Ver
y
de
ep
c
onvolut
ional
n
etw
orks
for
end
-
to
-
end
spee
ch
r
ec
o
gnit
ion,”
in
201
7
IEE
E
Inte
rnat
io
nal
Conf
ere
nce
on
Ac
oust
ic
s,
Sp
ee
ch
and
S
ignal
Proce
ss
ing
,
M
ar
ch
2017,
pp.
484
5
–
4849.
[13]
S.
La
wrenc
e,
C
.
L.
Gi
le
s,
A.
C.
Tsoi,
and
A.
D.
Bac
k
,
“
Face
r
e
cogni
t
ion:
A
convol
uti
on
al
neur
al
-
n
et
wor
k
appr
oac
h
,
”
IE
EE t
rans
act
ions o
n
neural
ne
tworks
,
vol. 8, no. 1, pp. 98
–
113,
1997.
[14]
P.
Sw
ie
tojanski,
A.
Ghos
hal
,
and
S.
Ren
al
s,
“
Convolut
ional
n
eur
a
l
ne
tworks
for
di
stant
spe
ec
h
recognit
ion,”
I
EEE
Signal
Proce
ss
in
g
Letters
,
vol
.
21
,
no
.
9
,
pp
.
1120
–
1124,
2014
.
[15]
T.
N.
Sa
ina
th
,
A.
-
r.
Moham
ed,
B.
Kingsbur
y
,
and
B.
Ramabh
adr
an,
“
Dee
p
c
onvolut
ional
ne
ura
l
ne
tworks
for
lvc
sr,” in
2013
I
EE
E
Inte
rnat
ion
al
Conf
ere
nce o
n
Ac
oust
ic
s,
Spe
ec
h
and
Signa
l
P
roce
ss
ing
,
2013
,
pp.
8614
–
8
618.
[16]
T.
N.
Sainath,
B.
Kingsbur
y
,
A.
-
r.
Moham
ed,
G.
E
.
Dahl
,
G.
Saon,
H.
Sol
tau,
T
.
Ber
an,
A.
Y.
Aravki
n,
and
B
.
Ramabhadr
an
,
“
Im
prove
m
ent
s
to
dee
p
convol
u
ti
onal
neur
al
n
e
tworks
for
lvc
sr
,
”
in
2013
I
EEE
Workshop
on
Aut
omatic Speec
h
Recogni
t
ion
a
nd
Unders
tandi
n
g
(
ASR
U)
,
2013,
pp.
315
–
320.
[17]
O.
Abdel
-
Ham
i
d,
A.
-
r
.
Moham
ed,
H.
Jiang
,
an
d
G.
Penn,
“
Appl
y
ing
convo
lut
i
onal
n
eur
al
netw
orks
conc
ept
s
t
o
h
y
brid
nn
-
hm
m
m
odel
for
spee
c
h
rec
ogni
ti
on,
”
i
n
2012
IEE
E
Int
ernati
onal
Conf
ere
nce
on
Ac
ous
ti
cs,
Sp
ee
ch
an
d
S
ignal
Proce
ss
in
g
,
2012
,
pp
.
427
7
–
4280.
[18]
Y.
Bengi
o,
A.
Courvil
le,
and
P.
Vince
nt
,
“
Repres
ent
at
ion
l
ea
rn
in
g:
A
rev
ie
w
and
new
per
spec
ti
v
es,
”
IEEE
trans
-
act
ions o
n
pattern anal
ysis
and
m
achi
ne
intelligen
ce
,
vol
.
35
,
no
.
8
,
pp
.
1798
–
1828
,
2013.
[19]
R.
Hae
b
-
Um
bach
and
H.
Ne
y
,
“
Li
nea
r
discr
iminant
an
aly
s
is
for
improved
la
rg
e
voca
bu
la
r
y
co
nti
nuous
spee
c
h
rec
ogni
ti
on,
”
in
1992
IEEE
In
te
r
nati
onal
Con
fe
r
enc
e
on
Ac
ousti
cs,
Spe
ec
h
and
Signal
Proce
ss
i
ng,
1992
,
pp
.
13
–
16.
[20]
L.
Burget,
“
Com
bina
ti
on
of
spee
ch
fe
at
ur
es
using
sm
oo
the
d
het
ero
sc
eda
sti
c
li
ne
ar
discri
m
in
ant
anal
y
s
is.”
i
n
Inte
rs
pee
ch
,
200
4.
[21]
R.
A.
Gopinat
h
,
“
Maximum
li
kel
ihood
m
odel
in
g
with
g
aussian
distri
but
ions
fo
r
class
ifi
c
at
ion
,
”
in
1998
IE
EE
Inte
rnational
Co
nfe
renc
e
on
Ac
o
ustic
s,
Sp
eech
a
nd
Signal P
roc
e
ss
ing,
1998,
pp.
661
–
664.
[22]
D.
Pove
y
,
B.
Ki
ngsbur
y
,
L
.
Mangu,
G.
Saon,
H.
Solta
u,
and
G.
Z
weig,
“
fm
pe:
Discriminat
iv
ely
tr
ai
ned
fe
at
ur
es
for
spee
ch
r
ec
ogni
tion,”
in
2005
I
E
EE
Int
ernati
ona
l
Confe
ren
ce
o
n
Ac
ousti
cs,
Sp
eech
and
Signa
l
P
roce
ss
ing,
2005,
pp.
I
–
961
.
[23]
T.
Yos
hioka,
A.
Ragni,
and
M.
J.
Gale
s,
“
Inve
st
iga
ti
on
of
unsup
erv
ised
ada
pt
at
i
on
of
dnn
ac
ous
ti
c
m
ode
ls
with
fil
ter
bank
input,”
in
2014
IEEE
Inte
rnational
C
onfe
renc
e
on
Acous
ti
cs,
Speech
and
Signal
Proc
essing
,
2014,
pp
.
6344
–
6348.
[24]
S.
Geir
hofe
r
,
“
Feat
ur
e
red
u
ct
ion
with
li
n
ea
r
d
isc
riminant
an
aly
si
s
and
it
s
per
for
m
anc
e
on
phone
m
e
rec
ogni
ti
on
,
”
Department
of
E
le
c
tric
al
and
Co
mputer
Engi
n
ee
r
ing:
Univ
ersity o
f
Il
li
nois a
t
Ur
bana
-
Champaign
,
2004.
[25]
M.
Gale
s,
“
Max
imum
li
kel
ihood
li
near
tra
nsfor
m
at
ions
for
hmm
-
base
d
spee
ch
rec
ogni
ti
on,
”
C
o
mputer
Spee
c
h
&
Language
,
vol
.
1
2,
no
.
2
,
pp
.
75
–
98,
1998.
[26]
M.
J.
Gal
es,
“
Se
m
i
-
ti
ed
cova
ri
an
ce
m
at
r
ic
es
fo
r
hidde
n
m
ark
ov
m
odel
s,”
I
E
EE
t
rans
act
ions
on
spee
ch
and
audio
proce
ss
ing,
vol
.
7,
no
.
3
,
pp
.
272
–
281,
1999
.
[27]
A.
-
H.
Os
sam
a,
M
.
Abdel
-
rah
m
an,
J.
Hui
,
D.
Li
,
P.
Ger
al
d
,
a
nd
Y.
Dong,
“
Convolut
ional
n
eur
al
n
et
works
for
spee
ch
r
ec
ogn
it
i
on,
”
I
EEE
Signa
l
Proc
essing
Ma
gazine
,
vo
l. 22,
no.
10
,
2014
.
[28]
K.
Vesel
`
y
,
A.
Ghos
hal
,
L
.
Bur
get
,
and
D.
Pove
y
,
“
Sequence
-
d
iscri
m
ina
ti
v
e
training
of
d
ee
p
n
e
ura
l
n
et
works
.
”
in
INTERSP
EE
CH
,
2013,
pp.
2345
–
2349.
[29]
H.
Hirsch,
“
Aurora
-
5
exp
eri
m
en
ta
l
f
ramework
f
or
the
per
form
a
nce
eva
lu
at
ion
o
f
spee
ch
re
cognition
in
c
ase
of
a
hands
-
fre
e
spee
c
h
input in nois
y
envi
ronm
ent
s,”
Nie
derrhein
Uni
v.
o
f Appl
i
ed
S
cienc
es
,
20
07.
Evaluation Warning : The document was created with Spire.PDF for Python.
IS
S
N
:
2088
-
8708
In
t J
Elec
&
C
om
p
En
g,
V
ol.
8
, N
o.
6
,
Dece
m
ber
201
8
:
5381
-
5388
5388
BIOGR
AP
H
I
ES
OF
A
UTH
ORS
Hilman
Parde
de
is
a
rese
arc
h
er
at
Resea
r
ch
C
ent
er
for
Infor
m
at
ic
s,
Indone
si
an
Instit
ut
e
of
Scie
nc
es.
He
o
bta
in
ed
his
B
a
che
lor
Degre
e
in
Elec
tri
c
al
E
ngine
er
ing
from
Univer
sit
y
of
Indone
sia
in
200
4
and
Master
of
Engi
ne
eri
ng
fro
m
the
Univer
sit
y
of
W
este
rn
Aus
tra
lia
in
2009.
He
recei
ved
his
Doctor
of
E
ngin
ee
ring
from
Tok
y
o
Inst
it
ut
e
of
T
ec
hnolog
y
in
20
13.
He
d
id
a
postdoct
ora
l
at
Fondazi
one
Bru
no
Kess
le
r
in
T
ren
to
It
aly
from
2013
to
2015.
His
rese
arc
h
int
er
ests
inc
l
ude
are
spee
ch
rec
o
gnit
ion,
pa
ttern
rec
ogni
ti
on,
sign
al
proc
essing,
m
ac
hin
e
le
arn
ing
and
art
if
ic
i
al
intell
ig
ence.
He
is
an
IEE
E
m
em
ber
and
rev
ie
we
r
for
Speec
h
Com
m
unic
at
ions
(El
sevi
er)
and
Inte
rna
ti
ona
l
Journal
of
Mac
hine
Le
arn
ing
and
C
y
ber
n
etics
(Springer
).
He
a
lso
has
serve
d
as
r
ev
ie
wers i
n
seve
r
al i
nt
ern
ational c
o
nfe
ren
c
es.
As
ri
Riz
ki
Yul
i
ani
is
a
rese
archer
a
t
Rese
arc
h
Cent
er
for
Info
rm
at
ic
s,
Indon
esia
n
Insti
tut
e
of
Scie
nc
es.
She
e
arn
ed
ba
chelor
degr
ee
in
Com
pute
r
Sci
ence
fro
m
the
Univer
sit
y
of
Te
kno
logi
Malay
s
ia
in
200
9
and
m
aste
r
degr
ee
in
Inform
ation
Mana
gemen
t
from
Yuan
Ze
Univer
sit
y
in
2013.
Her
r
ese
arc
h
in
te
r
ests
i
ncl
ude
spe
ec
h
rec
ogni
ti
on,
patter
n
re
cognition
,
and
m
a
chi
ne
le
arn
ing.
Rika
Sus
ti
ka
is
a
rese
ar
che
r
at
R
ese
arc
h
C
ent
e
r
for
Inform
at
ic
s,
I
ndonesia
n
Institute
of
Scie
n
ce
s
(LI
PI).
She
ea
rn
ed
ba
che
lo
r
and
m
aste
r
degr
ee
in
Elec
tr
ical
Engi
n
ee
ring
from
Ban
dung
Instit
ut
e
of
Te
chnol
og
y
(
ITB).
Her
rese
ar
ch
int
er
ests
are
i
n
the
area
of
signal
proc
e
ss
ing.
Since
Janu
a
r
y
2017
joi
ned
wit
h
m
ac
hine
learn
ing
rese
ar
ch
gro
up.
Inte
r
este
d
fo
r
using
dee
p
lea
rning
on
m
an
y
appl
i
ca
t
ion
such
as
on
spe
ec
h
recognit
ion, i
m
ag
e rec
ogni
ti
on,
and
nat
ura
l
l
angua
g
e
proc
essing.
Evaluation Warning : The document was created with Spire.PDF for Python.