Inter
national
J
our
nal
of
Electrical
and
Computer
Engineering
(IJECE)
V
ol.
7,
No.
6,
December
2017,
pp.
3358
–
3368
ISSN:
2088-8708
3358
I
ns
t
it
u
t
e
o
f
A
d
v
a
nce
d
Eng
ine
e
r
i
ng
a
nd
S
cie
nce
w
w
w
.
i
a
e
s
j
o
u
r
n
a
l
.
c
o
m
Recent
adv
ances
in
L
VCSR
:
A
benchmark
comparison
of
perf
ormances
Rahhal
Errattahi
and
Asmaa
El
Hannani
Laboratory
of
Information
T
echnology
,
National
School
of
Applied
Sciences,
Uni
v
ersity
of
Chouaib
Doukkali,
EL
Jadida,
Morocco
Article
Inf
o
Article
history:
Recei
v
ed:
No
v
26,
2016
Re
vised:
Jun
28,
2017
Accepted:
Jul
10,
2017
K
eyw
ord:
Lar
ge
V
ocab
ulary
Continuous
Speech
Recognition
Automatic
Speech
Recognition
W
ord
Error
Rates
Deep
Neural
Netw
orks
Hidden
Mark
o
v
models
Gaussian
Mixture
Models
ABSTRA
CT
Lar
ge
V
ocab
ulary
Continuous
Speech
Recognition
(L
VCSR),
which
is
characterized
by
a
high
v
ariability
of
the
speec
h,
is
the
most
challenging
task
in
automatic
speech
recognition
(ASR).
Belie
ving
that
the
e
v
aluation
of
ASR
systems
on
rele
v
ant
and
common
speech
corpora
is
one
of
the
k
e
y
f
actors
that
help
accelerating
research,
we
present,
in
this
paper
,
a
benchmark
comparison
of
the
performances
of
the
current
state-of-the-art
L
VCSR
systems
o
v
er
dif
ferent
speech
recognition
tasks.
Furthermore,
we
put
objecti
v
ely
into
e
vidence
the
best
performing
technologies
and
the
best
accu-
rac
y
achie
v
ed
so
f
ar
in
each
task.
The
benchmarks
ha
v
e
sho
wn
that
the
Deep
Neural
Netw
orks
and
Con
v
olutional
Neural
Netw
orks
ha
v
e
pro
v
en
their
ef
ficienc
y
on
se
v
eral
L
VCSR
tasks
by
outperforming
the
traditional
Hidden
Mark
o
v
Models
and
Guaussian
Mixture
Models.
The
y
ha
v
e
also
sho
wn
that
despite
the
satisfying
performances
in
some
L
VCSR
tasks,
the
problem
of
lar
ge-v
ocab
ulary
speech
recognition
is
f
ar
from
being
solv
ed
in
some
others,
where
more
research
ef
forts
are
still
needed.
Copyright
c
2017
Institute
of
Advanced
Engineering
and
Science
.
All
rights
r
eserved.
Corresponding
A
uthor:
Rahhal
Errattahi
Laboratory
of
Information
T
echnology
,
National
School
of
Applied
Sciences,
Uni
v
ersity
of
Chouaib
Doukkali,
EL
Jadida,
Morocco
errattahi.r@ucd.ac.ma
1.
INTR
ODUCTION
Speech
is
a
natural
and
fundamental
communication
v
ehicle
which
can
be
considered
as
one
of
the
most
appropriate
media
for
human-machine
interactions.
The
aim
of
Automatic
Speech
Recognition
(ASR)
systems
is
to
con
v
ert
a
speech
signal
into
a
sequence
of
w
ords
either
for
te
xt-based
communication
purposes
or
for
de
vice
controlling.
ASR
is
usually
used
when
the
k
e
yboard
becomes
incon
v
enient
such,
for
e
xample,
when
our
hands
are
b
usy
or
with
limited
mobility
,
when
we
are
using
the
phone,
we
are
in
the
dark,
or
we
are
mo
ving
around
etc.
ASR
finds
application
in
man
y
dif
ferent
areas:
dictation,
meeting
and
lectures
transcription,
speech
translation,
v
oice-search,
phone
based
services
and
others.
Those
systems
are,
in
general,
e
xtremely
dependent
on
the
data
used
for
training
the
models,
configuration
of
front-ends
etc.
Hence
a
lar
ge
part
of
system
de
v
elopment
usually
in
v
olv
es
in
v
estig
ations
of
appropriate
configurations
for
a
ne
w
domain,
ne
w
training
data,
and
ne
w
language.
There
are
se
v
eral
tasks
of
speech
recognition
and
the
dif
ference
between
these
tasks
rests
mai
nly
on:
(i)
the
speech
type
(isolated
or
continuous
speech),
(ii)
the
speak
er
mode
(speak
er
dependent
or
independent),
(iii)
the
v
ocab
ulary
size
(small,
medium
or
lar
ge)
and
(i
v)
the
speaking
style
(read
or
spontaneous
speech).
Ev
en
though
ASR
has
matured
to
the
point
of
commercial
applications,
the
Speak
er
Independent
Lar
ge
V
ocab
ulary
Continuous
Speech
Recognition
tasks
(commonly
designed
as
L
VCSR)
pose
a
particular
challenge
to
ASR
technology
de
v
elopers.
Three
of
the
major
problems
that
arise
when
L
VCSR
systems
are
being
de
v
eloped
are:
First
speak
er
independent
systems
require
a
lar
ge
amount
of
training
data
in
order
to
co
v
er
speak
ers
v
ariability
.
Second,
continuous
speech
recognition
is
v
ery
comple
x
because
of
the
dif
ficulties
to
locate
w
ord
boundaries
and
the
high
de
gree
of
pronunciation
v
ariation
due
to
dialects,
coarticulation
and
noise,
unlik
e
isolated
w
ord
J
ournal
Homepage:
http://iaesjournal.com/online/inde
x.php/IJECE
I
ns
t
it
u
t
e
o
f
A
d
v
a
nce
d
Eng
ine
e
r
i
ng
a
nd
S
cie
nce
w
w
w
.
i
a
e
s
j
o
u
r
n
a
l
.
c
o
m
,
DOI:
10.11591/ijece.v7i6.pp3358-3368
Evaluation Warning : The document was created with Spire.PDF for Python.
IJECE
ISSN:
2088-8708
3359
speech
recognition
where
the
system
operates
on
single
w
ords
at
a
time
[1,
2,
3].
Finally
,
with
lar
ge
v
ocab
ulary
,
it
bec
o
m
es
increasingly
harder
to
find
suf
ficient
data
to
train
the
acoustic
models
and
e
v
en
the
language
models.
Thus,
subw
ords
models
are
usually
used
instead
of
w
ords
models
which
af
fect
ne
g
ati
v
ely
the
performance
of
the
recognit
ion.
Moreo
v
er
,
L
VCSR
tasks
themselv
es
v
ary
in
dif
ficulty;
for
e
xample
read
speech
task
(human-
to-machine
speech,
e.g.
dictation)
is
much
easier
than
spontaneous
speech
task
(human-to-human
speech,
e.g.
telephone
con
v
ersation).
T
o
deal
with
all
these
problems,
there
has
been
a
plethora
of
algorithms
and
technologies
proposed
by
the
scientific
communities
for
all
steps
of
L
VCSR
o
v
er
the
last
decade:
pre-processing,
feature
e
xtraction,
acoustic
modeling,
language
modeling,
decoding
and
result
post-processing.
Man
y
papers
were
dedicated
to
presenting
an
o
v
ervie
w
of
the
adv
ances
in
L
VCSR:
[4,
5,
6,
7,
8].
Ho
we
v
er
,
the
scope
of
this
paper
focuses
primarily
on
the
systems
architecture,
the
techniques
used
and
the
k
e
y
issues.
There
is
to
date
no
w
ork
which
has
attempted
to
report
and
analyze
the
current
adv
ances
in
term
of
performances
across
all
the
major
tasks
of
L
VCSR.
Our
ambition
has
been
to
fill
this
g
ap.
W
e
did
a
benchmark
comparis
on
of
the
performances
of
the
current
state-of-the-art
L
VCSR
systems
and
ha
v
e
co
v
ered
dif
ferent
tasks:
read
continues
speech
recognition,
mobile
v
oice
search,
con
v
ersational
telephonic
speech
recognition,
broadcast
ne
ws
speech
recognition,
video
speech
and
distant
con
v
ersational
speech
recognition.
W
e
tried
to
put
objecti
v
ely
into
e
vidence
the
best
per
-
forming
technologies
and
the
best
accurac
y
achie
v
ed
so
f
ar
in
each
task.
Note
that
in
this
paper
we
only
address
the
English
language
and
that
we
ha
v
e
constrained
the
re
vie
w
to
systems
that
ha
v
e
been
e
v
aluated
on
widely
used
speech
corpora.
This
choice
w
as
forced
by
the
non-a
v
ailability
of
enough
publication
of
some
other
cor
-
pora
and
also
because
e
v
aluating
systems
on
rele
v
ant
and
common
speech
corpora
is
a
k
e
y
f
actor
for
measuring
the
progress
and
disco
v
ering
the
remaining
dif
ficulties
and
especially
when
comparing
systems
produced
by
dif
ferent
labs.
T
able
1.
Ov
ervie
w
of
the
selected
L
VCSR
corpora
Corpus
Y
ear
T
ype
of
data
Size
Audio
source
W
all
Street
Journal
I,
[9]
1991
Read
e
xcerpts
from
the
W
all
Street
Journal
80
hours,
123
speak
ers
Close
talking
micro-
phone
Switchboard,
[10]
1993
Phone
con
v
ersations
be-
tween
strangers
on
an
assigned
topic
250
hours,
543
speak
ers,
2400
con
v
ersations
V
ariable
telephone
handsets
CallHome,
[11]
1997
Phone
con
v
ersations
between
f
amily
members
or
close
friends
120
con
v
ersations,
Up
to
30
min
each
V
ariable
telephone
handsets
Broadcast
Ne
ws,
[12,
13]
1996/1997
T
ele
vision
and
radio
broad-
cast
LDC97S44:104
hours
LDC98S71:97
hours
Head
mounted
mi-
crophone
AMI,
[14]
2007
Scenario
and
NonScenario
Meetings
from
v
arious
groups
100
hours
Close-talking
and
f
ar
-field
micro-
phones
Bing
Mobile
Data,
[15]
2010
Mobile
v
oice
queries
21400
hours
V
ariable
mobile
phones
Google
V
oice
Search
Data
-
Mobile
V
oice
Search
and
Android
V
oice
Input
5780
hours
-
Y
outube
V
ideo
-
V
ideo
from
youtube
1400
hours
-
2.
COMP
ARING
ST
A
TE-OF-THE-AR
T
L
VCSR
SYSTEMS
PERFORMANCES
The
industry
and
resear
ch
community
can
benefit
greatly
when
dif
ferent
systems
are
e
v
aluated
on
a
common
ground
and
particularly
on
the
same
speech
corpora.
In
this
perspecti
v
e,
we
report
the
recent
progress
in
the
area
of
L
VCSR
on
a
selection
of
most
popular
English
speech
corpora
with
v
ocab
ularies
ranging
from
5K
to
more
than
65K
w
ords
and
content
ranging
from
read
speech
to
spontaneous
con
v
ersations.
Only
such
with
recent
publications
were
considered.
An
o
v
ervie
w
on
properties
of
the
chosen
sets
is
gi
v
en
in
T
able
1.
In
the
follo
wing
subsections,
we
will
shortly
introduce
each
task
and
t
he
datasets
used
and
report
the
performances
of
systems
produced
by
dif
ferent
labs.
Evaluation Warning : The document was created with Spire.PDF for Python.
3360
ISSN:
2088-8708
2.1.
Read
continuous
speech
r
ecognition
task
Early
speech
recognition
systems
were
often
designed
for
read
speech
transcription
tasks
(dictation).
As
implied
by
the
name,
the
data
used
in
this
domain
consists
of
read
sentences
and
in
general
in
a
speak
er
-
independent
mode.
Its
popularity
arose
because
the
le
xical
and
syntactic
content
of
the
data
can
be
controlled
and
it
is
significantly
less
e
xpensi
v
e
to
collect
than
spontaneous
speech.
The
primary
applications
in
this
domain
include
the
dictation
of
notes
and
transcription
of
important
information
by
some
professionals
(e.g.
medical,
military
and
la
w
)
and
by
persons
with
learning
disabilities
(e.g.
dysle
xia
and
dysgraphia),
limited
motor
skills
or
vision
impairment.
The
W
all
Street
Journal
corpus
I
(also
kno
wn
as
CSR-I
or
WSJ0)
[9]
is
kno
wn
as
a
reference
corpus
in
the
field,
which
is
an
American
English
read
speech
with
te
xts
tak
en
from
the
W
all
Street
Journal
ne
ws,
the
speech
w
as
recorded
using
a
machine-readable
under
clea
n
conditions.
The
systems
presented
here
were
e
v
aluated
on
the
No
v
ember
1992
ARP
A
CSR
(No
v-92)
benchmark
test
set
,
a
5K-w
ord
closed-
v
ocab
ulary
subset
deri
v
ed
from
the
WSJ0
corpus
which
consists
of
330
utterances
from
8
speak
ers.
Hidden
Mark
o
v
models
(HMMs)
and
Guaussian
Mixture
Models
(GMMs)
ha
v
e
been
used
e
xtensi
v
ely
since
the
be
ginning
of
the
research
in
the
area
of
speech
recognition.
More
than
40
years
later
,
the
y
still
predominate
and
the
y
are
usually
used
as
baseline
when
it
comes
to
compare
systems
with
dif
ferent
acoustic
models.
Beside
this,
se
v
eral
techniques
were
de
v
eloped
around
the
HMMs/GMMs
in
order
to
impro
v
e
the
performance
of
the
ASR
systems.
T
able
2.
W
ord
Error
Rates
(WER)
in
%
on
the
No
v-92
subset
of
the
WSJ0
corpus
using
bigram
and
trigram
language
models
Acoustic
model
/
Features
Bigram
T
rigram
MLP-HMM,
[16]
8.5
6.5
RC-HMM,
[17]
6.2
3.9
GMM-HMM
(ML),
[17]
6.0
3.8
GMM-HMM
(MMI+VTLN),
[16]
-
3.0
DNN-HMM
(STC
features),
[18]
5.2
-
T
able
2
sho
ws
a
recapitulation
of
the
k
e
y
performances
of
some
state-of-the-art
systems
in
the
field.
The
fisrt
tw
o
systems
in
the
list
are
based
on
a
GMM-HMM
acoustic
models:
the
first
one
w
as
trained
using
the
maximum-lik
elihood
(ML)
criteria
[17],
while
the
second
one
usi
n
g
the
maximum
mutual
information
(MMI)
criteria
with
the
v
ocal
tract
length
normalization
(VTLN)
[16].T
riefenbach
et
al.
[17],
proposed
also
a
Reserv
oir
Computing
(RC)
HMM
h
ybrid
system
for
phoneme
recognit
ion
using
a
bigram
phonotactic
utterance
model.
The
RC-HMM
performs
significantly
better
than
the
MLP-HMM
h
ybrids
proposed
by
Gemello
et
al.
[19].
Ho
we
v
er
,
it
is
still
outperformed
by
the
GMM
system
with
VTLN.
Another
study
[18]
demonstrated
the
ef
fecti
v
eness
of
the
Deep
Neural
Netw
orks
(DNNs)
in
speech
recognition.
The
best
result
of
this
study
belongs
to
a
DNN
system
with
5
hidden
layers,
where
each
hidden
layer
has
2048
nodes.
In
terms
of
comple
xity
,
both
the
DNN-HMM
and
the
RC-HMM
incorporates
massi
v
e
parameters
in
the
training
stage.
On
the
other
side,
GMM-HMM
is
much
more
ef
ficient
as
it
reaches
good
performances
with
small
number
of
parameters.
The
results,
obtained
with
the
bigram
language
model,
sho
ws
that
the
DNN-HMM
acoustic
model
presents
the
best
performance
on
the
WSJ0
task;
this
performance
could
be
e
v
en
enhanced
using
a
trigram
language
model.
Generally
,
using
trigram
language
model
w
as
crucial
and
clearly
superior
to
using
bigram
models
o
v
er
the
No
v-92
test
set
in
v
arious
studies.
2.2.
V
oice
sear
ch
speech
r
ecognition
task
V
oice
search
is
the
technology
allo
wing
users
to
use
their
v
oice
to
access
information.
The
adv
ent
of
smart
phones
and
other
small,
W
eb-enabled
mobile
de
vices
in
recent
years
has
spurred
more
interest
in
v
oice
search,
especially
in
some
usage
scenarios
when
our
hands
are
b
usy
or
with
limited
mobility
,
when
we
are
using
the
phone,
we
are
in
the
dark,
or
we
are
mo
ving
around
etc.
There
is
a
plethora
of
mobile
applications
which
allo
ws
users
to
gi
v
e
speech
commands
to
a
mobile
either
for
a
search
purpose
(e.g
web
search,
maps,
directions,
tra
v
el
resources
such
as
airlines,
hotels
etc)
or
for
question
answering
assistance
purpose.
Mobile
v
oice
search
speech
recognition
is
consedered
as
one
of
the
challenging
tasks
is
the
field
of
speech
recognition
due
to
man
y
f
actors:
the
utterances
tend
to
be
v
ery
short,
yet
unconstrained
and
open-domain.
Hence,
the
IJECE
V
ol.
7,
No.
6,
December
2017:
3358
–
3368
Evaluation Warning : The document was created with Spire.PDF for Python.
IJECE
ISSN:
2088-8708
3361
v
ocab
ularies
are
unlimited
with
unpredictable
input,
and
high
de
gree
of
acoustic
v
ariability
caused
by
noise,
side-speech,
accents,
slopp
y
pronunciation,
hesitation,
repetition,
interruptions,
and
mobile
phone
dif
ferences.
In
this
section
we
will
report
results
on
tw
o
v
oice
search
application
that
ha
v
e
been
b
uilt
in
the
recent
fe
w
years:
the
Google
V
oice
Input
and
the
Bing
mobile
v
oice
search.
2.2.1.
Google
V
oice
Sear
ch
Google
V
oice
Search
transcribes
speech
input
used
for
user
interaction
from
mobile
de
vices
as
v
oice
search
queries,
short
messages
and
emails.
The
Google
V
oice
Search
system
w
as
trained
using
approximately
5780
hours
of
data
from
mobile
V
oice
Search
and
Android
V
oice
Input.
T
able
3.
WER
in
%
on
a
test
set
from
the
Google
li
v
e
V
oice
input
dataset
Acoustic
model
WER
GMM-HMM,
[20]
16.0
DNN(DBN)-HMM,
[20]
12.3
+
MMI
discriminati
v
e
training,
[20]
12.2
DNN
+
GMM
(combination),
[20]
11.8
The
baseline
GMM-HMM
[20]
system
created
by
Google’
s
group
consisted
of
triphone
HMM
with
decision
tree
clustered
states,
and
used
PLP
features
that
were
transformed
by
linear
discriminati
v
e-Analysis
(LD
A).
The
GMM-HMM
model
w
as
trained
discriminati
v
ely
using
Boosted-MMI
criterion.
All
these
param-
eters
generate
a
conte
xt-dependent
model
with
a
total
of
7969
st
ates.
The
same
data
model
w
as
used
to
train
a
deep
belief
netw
orks
(DBN)
based
DNN
acoustic
model
t
o
predict
the
7969
HMM
states.
The
used
DBN-
DNN
in
[20]
w
as
composed
of
four
hidden
layers
with
2560
nodes
per
each
hidden
layer
,
a
final
layer
with
7969
states,
and
an
input
layer
of
11
contiguous
frames
of
40
log
filter
-bank
were
modelled
with
a
DBN.
As
sho
wn
in
T
able
3
the
DNN(DBN)-HMM
system
achie
v
ed
a
23%
relati
v
e
reduction
o
v
er
the
base-
line.
Further
impro
v
ement
could
result
from
combining
both
systems
using
the
se
gmental
conditional
random
field
(SCARF)
frame
w
ork,
this
combination
g
a
v
e
a
w
ord
error
rate
of
11.8%.
2.2.2.
Bing
mobile
v
oice
sear
ch
Bing
mobile
v
oice
search,
kno
wn
as
Li
v
e
Search
for
mobile
(LS4M)
[15],
is
a
mobile
appli
cation
that
allo
ws
users
to
do
web-based
search
(e.g.
map,
directions,
traf
fic,
and
mo
vies)
from
their
mobile
phones.
LS4M
w
as
de
v
eloped
by
the
Microsoft
compan
y
and
w
as
trained
on
a
data
set
around
24
hours
with
a
high
de
gree
of
acoustic
v
ariability
.
T
able
4.
WER
in
%
on
the
Bing
V
oice
Search
set
Acoustic
model
WER
DNN-HMM
(No
PT),
[21]
37.1
DNN-HMM
(with
PT),
[21]
35.4
CNN-HMM
(No
PT),
[21]
34.2
CNN-HMM
(with
PT),
[21]
33.4
RNNLM,
[22]
23.2
Abdel-Hamid
et
al.
[21]
at
Microsoft
Research,
in
v
estig
ate
the
performance
of
both
DNNs
and
C
NNs
on
a
L
VCSR
task
and
the
ef
fects
of
RBM-based
pretraining
on
their
recogniti
on
performance.
Both
models
were
trained
using
a
subset
of
18
hours
from
the
Bing
V
oice
Search
task
in
order
to
predict
the
triphone
HMM
state
labels.
The
DNNs
architecture
consisted
of
three
hidden
layers
while
the
CNN
had
one
pair
of
con
v
olution
and
pooling
plies
in
addition
to
tw
o
hidden
fully
connected
layers.
Result
s,
in
T
able
4
sho
w
that
the
CNN
outperform
the
DNN
on
the
Bi
n
g
V
oice
Search,
pro
viding
about
8%
relati
v
e
error
reduction
without
pretraining
and
relati
v
e
w
ord
error
rate
reduction
of
6%
while
using
pretraining.
According
to
Abdel-
Hamid
et
al.
[21],
pretraining
is
more
ef
fecti
v
e
for
the
CNN
than
for
the
DNN.
Another
study
[22],
suggests
applying
Recurrent
Neural
Netw
ork
Language
Models
(RNNLMs)
directly
in
the
first
pass
of
speech
recognition
decoding,
which
outperform
both
simple
n-gra
m
based
models
(DNN
and
CNN)
on
the
Bing
V
oice
Search
ta
sk
with
a
w
ord
Evaluation Warning : The document was created with Spire.PDF for Python.
3362
ISSN:
2088-8708
error
rate
of
23.2
%.
Ho
we
v
er
,
the
computational
e
xpense
of
RNNLMs
is
v
ery
high,
and
to
reduce
the
cost
of
using
a
RNNLMs,
authors
propose
cache
based
RNN
inference,
which
drops
the
runtime
from
100xR
T
(no
caching
is
done)
to
just
under
1.2xR
T
.
Though
the
e
xperimental
setup
w
as
not
described
in
suf
ficient
detail,
in
both
papers,
we
can
only
assume
that
the
10%
absolute
impro
v
ement
of
WER
in
RNNLM
vs
CNN
systems
could
be
due
to
dif
ferences
in
the
amount
of
training
data
or
dif
ferences
of
the
Bing
V
oice
Search
subset
used
to
e
v
aluate
the
systems.
2.3.
Con
v
ersational
telephone
speech
r
ecognition
task
Owing
to
the
re
v
olution
in
telecommunication
domain,
peoples
all
o
v
er
the
w
orld
spent
millions
of
hours
in
communication
via
their
phones.
F
or
m
an
y
reasons,
as
security
for
e
xample,
the
transcription
of
spontaneous
casual
speech
and
particularly
con
v
ersational
telephone
speech
becomes
indispensable.
Whereas
transcribing
this
type
of
speech
is
v
ery
challenging,
due
to
man
y
f
actors,
including
poor
articulation,
increased
coarticulation,
highly
v
ariable
speaking
rate,
and
v
a
rious
types
of
disfluenc
y
such
as
hesitations,
f
alse
starts,
and
corrections.
W
e
report
pertinent
results
on
a
highly
challenging
test
set,
the
NIST
2000
Hub5
(Hub5’00).
The
Hub5
is
composed
of
tw
o
subsets,
an
”easy”
split
which
containe
20
con
v
ersation
from
Switchboard
corpus
[10,
23],
and
a
”hard”
split
containing
20
con
v
ersation
from
CallHome
corpus
[11],
often
reporting
results
on
the
easier
portion
alone.
Switchboard
is
a
corpus
of
American
English
spontaneous
con
v
ersational
telephone
speech,
it
is
composed
of
about
2,400
tw
o-sided
telephone
con
v
ersations
between
543
speak
ers
(302
male,
241
female)
from
all
areas
of
the
United
States.
While
the
CallHome
corpus
consists
of
120
unscripted
telephone
con
v
ersations
between
nati
v
e
speak
ers
of
English
mostly
between
f
amily
members
or
close
friends
o
v
erseas.
The
CallHome
data
is
harder
to
recognize
compared
to
Switchboard,
partly
due
to
a
greater
presence
of
foreign-
accented
speech.
In
the
last
tw
o
years,
se
v
eral
labs
ha
v
e
conducted
benchmarking
e
xperiements
using
the
Switchboard
corpus
[24,
25,
26,
27,
28,
29].
In
T
able
5,
we
summarize
the
most
performing
sytems
on
the
Hub5’00
dataset
splits.
All
systems
ha
v
e
been
trained
on
the
300
hour
Switchboard
dataset
e
xcept
the
Deep
spe
ech
system
from
[26]
which
has
been
trained
on
both
Switchboard
and
Fisher
dataset.
The
Fisher
corpus
[30]
of
fers
2000
hours
of
con
v
ersational
telephone
speech
collected
in
a
similar
manner
as
Switchboard.
T
able
5.
WER
in
%
on
the
Switchboard
subsets
”SWB”
of
the
Hub5’00
dataset
Acoustic
Models
SWB
GMM/HMM
fBMMI,
[31]
14.5
DNN-HMM-sMBR
fMLLR,
[24]
12.6
RNN
(Deep
speech),
[26]
12.6
DNN
fMLLR,
[31]
12.2
CNN
log-mel,
[31]
11.8
CNN+DNN
log-mel+fMLLR+I-v
ector
,
[31]
10.7
MLP/CNN+I-V
ector
,
[28]
10.4
The
GMM
system
in
[31],
w
as
trained
using
speak
er
-adaptation
with
VTLN
and
feature
space
Max-
imum
Lik
elihood
Linear
Re
gression
(fMLLR),
follo
wed
by
feature
and
model-space
discriminati
v
e
training
with
the
the
Boosted
Maximum
Mutual
Information
(BMMI)
criterion.
The
DNN-HMM
sMBR
system
from
[24]
w
as
trained
on
LD
A+STC+FMLLR
features
on
the
full
300
hour
training
set,
and
w
as
composed
of
7
lay-
ers,
where
each
hidden
layer
has
2048
neurons;
an
an
output
layer
of
8859
units.
Hannun
et
al.
[26]
propose
a
RNN-based
system
called
Deep
Speech
that
uses
deep
learni
ng
systems
to
learn
from
lar
ge
datasets
(more
than
7380
hours).
In
[26],
authors
used
a
multi-GPU
computation
for
training
the
RNN
model,
and
a
combination
of
collected
and
synthesized
data,
which
mak
e
the
system
able
to
learn
rob
ustness
to
realistic
noise
and
speak
er
v
ariation.
The
DNN
system,
in
[31],
w
as
trained
using
fMLLR
features
and
w
as
composed
of
6
hidden
layers
each
containing
2048
sigmoidal
neurons,
and
a
softmax
layer
with
8192
output
units.
F
or
the
CNN
[31],
it
w
as
trained
using
log-mel
features,
with
an
architecture
consisted
of
tw
o
con
v
olutional
layers
each
containi
ng
512
hidden
un
i
ts,
and
fi
v
e
fully
connected
layers
each
containing
2048
hidden
units,
and
a
softmax
layer
with
8192
output
units.
Results
are
summarised
in
T
able
5,
sho
w
that
the
CNNs
clearly
outperform
the
other
systems
gi
v-
ing
a
20%
w
ord
error
rate
relati
v
e
impro
v
ement
o
v
er
the
GMM/HMM
system,
and
3%
w
ord
error
rate
relati
v
e
IJECE
V
ol.
7,
No.
6,
December
2017:
3358
–
3368
Evaluation Warning : The document was created with Spire.PDF for Python.
IJECE
ISSN:
2088-8708
3363
impro
v
ement
o
v
er
the
h
ybrid
DNN.
Another
form
of
system
combination
ha
v
e
been
proposed
in
[28]:
a
jointly
trained
MLP/CNN
model
with
I-V
ectors,
where
the
outputs
of
the
first
MLP
hidden
layer
get
combined
with
the
outputs
of
the
second
CNN
layer
.
This
system
has
gi
v
en
the
best
result
in
this
task
so
f
ar
(10.4%
WER
on
the
SWB
split
of
Hub5’00).
2.4.
Br
oadcast
news
speech
r
ecognition
task
Broadcast
Ne
ws
Automatic
Speech
Recognition
(ASR
BN)
consist
of
recognising
speech
from
ne
ws-
oriented
content
from
either
te
le
vision
or
radio,
including
ne
ws,
multi-speak
er
roundtable
discussions,
debates,
and
e
v
en
open-air
intervie
ws
outside
of
the
studio.
English
Broadcast
Ne
ws
Speech
Corpus
[12,
13]
is
one
of
the
m
ost
common
datasets
in
this
domain.
It
is
a
collection
of
radio
and
tele
vision
ne
ws
broadcasts
(from
ABC,
CNN
and
CSP
AN
tele
vision
netw
orks
and
NPR
and
PRI
radio
netw
orks)
with
corresponding
transcripts.
The
acoustic
models
of
the
systems
re
vie
wed
in
this
section
w
as
trai
ned
on
50
hours
of
data
from
the
1996
(LDC97S44,
104
hours)
and
1997
(LDC98S71,
97
hours)
English
Broadcast
Ne
ws
Speech
Corpora.
State-of-the-art
systems
performances
are
reported
on
both
the
EARS
De
v-04f
(3
hours
from
6
sho
ws)
and
R
T
-04
(6
hours
from
12
sho
ws)
sets.
T
able
6.
WER
in
%
on
the
De
v-04f
and
R
T
-04
English
Broadcast
Ne
ws
sets
Acoustic
model
De
v-04f
R
T
-04
Baseline
GMM/HMM,
[27]
18.8
18.1
SA
T
MLP
fBMMI,
[32]
21.9
23.6
SA
T
DBN
fBMMI,
[32]
17.0
17.7
SA
T
GMM
fMPE+MPE,
[33]
16.5
14.8
SA
T
DNN
cross-entrop
y
,
[33]
16.7
14.6
SA
T
DNN
HF
sMBR,
[33]
15.1
13.4
Hyprid
DNN,
[27]
16.3
15.8
DNN-based
Features,
[27]
16.7
16.0
Hybrid
CNN,
[27]
15.8
15.0
CNN-based
Features,
[27]
15.2
15.0
In
[27],
the
baseline
GMM-HMM
system
w
as
trained
using
speak
er
-based
mean
with
VTLN
and
an
LD
A
transform
to
projec
t
the
13-dimentional
MFCC
to
40
dimensions.
Ne
xt
a
FMLLR
follo
wed
by
a
BBMI
transform
were
applied
to
obtain
a
GMM
system
with
2220
quinphone
states
and
30k
diagonal
co
v
ariance
Gaussians.
Sainath
et
al.
[32]
compared
the
performance
of
Deep
Belief
Netw
orks
(DBNs)
to
a
simple
Multi-
Layer
Perceptrons
MLPs,
where
both
the
DBN
and
MLP
were
trained
with
the
same
architecture
(6
layers
x
1,024
units)
using
speak
er
-adapted
and
fBMMI
features
with
2220
output
tar
gets.
The
DBN
contains
tw
o
types
of
Restricted
Boltzmann
Machines
(RBMs)
that
w
as
used
to
pre-train
weights
of
the
Artificial
Neural
Netw
orks
(ANNs).
F
or
the
first
layer
of
DBN
authors
used
a
Gaussian-Bernoulli
RBM
trained
for
50
epochs.
The
first
layer
consist
of
9
frames
as
inputs
and
1,024
output
features.
F
or
all
subsequent
layers,
Bernoulli-
Bernoulli
RBMs
are
trained
for
25
epochs
and
contain
1,024
hidden
units.
Further
study
of
Sainath
et
al.
[33]
suggests
dif
ferent
refinements
to
impro
v
e
DNN
tr
aining
speed.
The
DNNs
systems
use
a
fMLLR
with
5,999
quinphone
states,
and
composed
of
six
hidden
layers
each
containing
1,024
sigmoidal
units.
In
this
study
,
authors
succeeded
in
reducing
the
number
of
parameters
from
10.7
M
to
5.5
M
a
49%
reduction,
due
to
lo
w-
rank
f
actorization.
Recent
study
of
Sainath
et
al.
[27]
e
xplored
the
performance
of
the
CNNs
compared
to
DNN.
The
Hybrid
DNN
has
an
architecture
of
5
layers
with
1024
hidden
units
per
each
layer
and
a
softmax
output
layer
with
2220
tar
get
units.
The
DNN-based
system
has
the
same
architecture,
b
ut
with
only
512
output
tar
gets.
While
for
booth
Hybrid
CNN
and
CNN-based
feature
systems
are
trained
using
VTLN-w
arped
mel
FB,
delta
and
double-delta
feature.
T
able
6
sho
ws
that
RBM
pre-training
of
the
DBN
impro
v
es
the
WER
o
v
er
the
MLP
for
all
feature
sets.
F
ollo
wing
sMBR
training,
the
DNN
is
the
best
model.
It
is
20%
better
than
the
baseline
GMM
on
De
v-04f
and
36%
better
on
RT
-04
.Furthermore,
the
CNN-based
features
present
competiti
v
e
performance
with
19%
relati
v
e
impro
v
ement
o
v
er
the
baseline
GMM-HMM.
The
performance
of
CNN-based
features
could
achie
v
e
only
13.1%
WER
on
the
de
v04
and
12.0%
WER
on
TR04
[27],
when
a
lar
ger
scale
tasks
is
used
for
training
(400
hours
of
English
Broadcast
Ne
ws).
Evaluation Warning : The document was created with Spire.PDF for Python.
3364
ISSN:
2088-8708
2.5.
V
ideo
speech
r
ecognition
task
The
adv
ent
of
t
he
web,
lo
w
cost
digital
cameras,
and
Smatphones
has
significant
ly
broadened
the
quantity
as
well
as
the
reach
of
videos.
The
k
e
y
challenge
for
man
y
web
video
producers
is
making
it
easy
for
others,
hearing
impaired
and
non-nati
v
e
speak
ers,
to
find
and
enjo
y
their
content.
One
w
ay
to
do
that
is
to
use
hand-transcription,
e
v
en
so
this
solution
can
be
time-consuming
and
e
xpensi
v
e,
and
could
not
cope
up
with
the
huge
content
being
uploaded
e
v
ery
minute
to
the
internet.
On
the
other
hand
automatic
video
transcription
represents
an
alternati
v
e
solution
to
impro
v
e
accessibility
of
the
video
conte
nts.
In
this
section
we
chose
to
present
results
studies
done
by
the
Google
researchers
on
the
Y
outube
video
data.
The
goal
of
this
task
is
to
transcribe
Y
outube
data,
unlik
e
the
pre
vious
tasks
Y
ouT
ube
data
is
e
xtremely
challenging
for
current
ASR
technology
[34].
Jaitly
et
al.
[
20
]
used
1400
hours
of
Y
ouT
ube
data
to
train
the
Conte
xt-Dependent
ANN/HMM
with
speak
er
adapted
features
and
17552
triphone
tar
get
states.
The
baseline
system
used
9-frames
of
MFCCs
as
inputs
that
were
transformed
using
LD
A.
The
ac
ou
s
tic
models
were
further
impro
v
ed
with
BMMI.
During
decoding,
fMLLR
and
MLLR
transforms
were
applied.
F
or
the
DBN-HMMs
the
acoustic
data
used
in
the
training
stage
were
the
fMLLR
transform
ed
features.
F
or
a
comple
xity
reason
and
to
mak
e
the
training
f
aster
,
the
ANN/HMM
has
an
architecture
of
only
4
hidden
layers
with
2000
units
in
the
output
layer
and
1000
units
in
the
layers
abo
v
e.
In
order
to
generate
additional
semi-supervised
training
data,
Liao
et
al.
[34]
ha
v
e
proposed
to
use
the
o
wner
-uploaded
video
transcripts
and
a
DNNs
acoustic
models.
The
proposed
DNNs
are
fully-connected,
feed
forw
ard
neural
netw
ork,
with
sigmoid
non-linearities
and
a
softmax
output
layer
and
w
as
trained
using
minibatch
stochastic
gradient
descent
and
back-propag
ation
techniques.
Reported
results
are
summarized
in
T
able
7.
It
should
be
noted
that
the
training
set
used
in
the
e
xperiments
is
not
e
xactly
the
same,
b
ut
both
e
xperiments
w
as
conducted
on
a
comparable
amount
of
data;
namely
1400
hours
in
[20]
vs
1781
hours
in
[34];
ho
we
v
er
,
al
results
are
reported
on
the
same
test
set-
YtiDe
v11
(6.6
hours,
2.4M
frames).
The
reported
baseline
7x1024
system
with
7k
output
states
from
[34]
outperform
all
of
the
these
pre
viously
reported
results
in
[20].
While
mer
ging
the
wide
hidden
layer
architecture
of
2048
nodes
with
a
lo
w-rank
approximation
with
high
number
of
CD
states
in
the
output
layer
yielded
the
best
result
on
the
YtiDe
v
test
set
of
40.9%
WER.
T
able
7.
WER
in
%
one
the
YtiDe
v11
Y
ouT
ube
set
Acoustic
model
WER
MFCC
GMM,
18k
state,
450kcomps,
[20]
52.3
DBN-HMM
pretrained
with
sparsity
,
[20]
47.6
+
MMI,
[20]
47.1
+
system
combination
with
SCARF
,
[20]
46.2
Fbank
DNN
7x1024,
7k
state,
[34]
44.0
Fbank
DNN
6x2048,
7k
state,
[34]
42.7
Fbank
DNN
7x1024,
lo
w-rank
256,
45k
state,
[34]
42.5
Fbank
DNN
7x2048,
lo
w-rank
256,
45k
state,
[34]
40.9
2.6.
Distant
con
v
ersational
speech
r
ecognition
task
Distant
con
v
ersational
speech
recognition
(DCS)
is
captured
using
m
ultiple
distant
microphones,
typ-
ically
configured
in
a
calibrated
arrayis,
and
is
v
ery
challenging
since
the
speech
signals
to
be
recognized
are
de
graded
by
the
presence
of
o
v
erlapping
talk
ers,
background
noise,
and
re
v
erberation.
Classroom
Lectures,
P
arliamentary
Meetings,
and
Scientific
Meeting
are
the
main
applications
of
Distant
con
v
ersational
speech
recognition.
In
this
section
we
chose
to
report
results
o
v
er
the
AMI
Meeting
corpus,
prompted
by
the
lar
ge
use
of
this
corpus
in
se
v
eral
recent
studies.
The
AMI
corpus
[14]
contains
around
100
hours
of
meeting
recordings
from
three
European
sites
(UK,
Netherlands,
Switzerland).
Each
meeting
usually
has
four
participants
and
the
meetings
are
in
English,
man
y
of
the
meeting
participants
are
non-nati
v
e
English
speak
ers.
The
AMI
corpus
w
as
di
vided
into
train,
de
v
elopment,
and
test
sets.
Where
about
78
hours
of
meeting
recorded
speech
were
used
as
training
set,
and
about
9
hours
each
were
used
as
de
v
elopment
and
test
sets.
Ev
aluations
in
the
meeting
domain
are
usually
IJECE
V
ol.
7,
No.
6,
December
2017:
3358
–
3368
Evaluation Warning : The document was created with Spire.PDF for Python.
IJECE
ISSN:
2088-8708
3365
conducted
in
three
conditions:
Single
Distant
Microphone
(SDM),
Multiple
Distant
Microphones
(MDM)
and
Indi
vidual
Headset
Microphones
(IHM).
T
able
8.
WER
in
%
on
the
AMI
s
et
for
v
arious
microphone
configurations;
SDM,
MDM
and
IHM
are
respec-
ti
v
ely
Single
Distant
Microphone,
Multiple
Distant
Microphones
and
Indi
vidual
Headset
Microphones
Acoustic
model
De
v
T
est
SDM
MDM
IHM
SDM
M
DM
IHM
GMM
LD
A+STC,
[35]
63.2
54.8
29.4
67.66
59.4
31.6
DNN
LD
A+STC,
[35]
55.4
51.4
26.7
59.8
56.0
28.4
DNN
Fbank,
[35]
55.8
51.1
28.3
60.8
55.6
31.5
CNN,
[36]
52.5
46.3
25.6
-
-
-
Swietojanski
et
al.
[35]
applied
DNN-HMM
for
meeting
speech
recognition
task.
Authors
compared
there
results
to
the
con
v
entional
system
based
on
GMMs.
The
baseline
GMM-HMM
system
ha
v
e
a
total
of
80000
Gaussians,
and
were
discriminati
v
ely
trained
using
BMMI
with
linear
discriminati
v
e
analysis
(LD
A)
and
decorrelated
using
a
semi-tied
co
v
ariance
(STC)
transform.
While
the
DNNs
were
configured
to
ha
v
e
6
hidden
layers,
with
2048
units
in
each
hidden
layer
and
w
as
trained
using
RBMs
and
using
either
LD
A+SA
T
features
or
the
FB
ANK
features.
Further
study
of
Swietojanski
et
al.
[36],
suggests
using
CNNs
for
lar
ge
v
ocab
ulary
distant
speech
recognition
trained
using
the
three
type
of
microphones:
SDM
and
MDM
or
simply
IHM.
The
CNNs
were
trained
using
the
Fbank
features,
and
composed
of
a
single
CNN
layer
f
ollo
wed
by
fi
v
es
fully-connected
layers.
T
able
8
sho
ws
that
replacing
the
GMM
with
a
DNN
impro
v
es
recognition
accurac
y
for
speech
recorded
with
distant
microphones.
While
in
[36],
the
y
found
that
CNNs
impro
v
e
the
WER
6.5%
relati
v
e
compared
to
con
v
entional
deep
neural
netw
ork
(DNN)
models
and
15.7%
o
v
er
a
discriminati
v
ely
trained
GMM
baseline.
3.
DISCUSSION
From
the
first
vie
w
at
the
reported
results
o
v
er
the
dif
ferent
tasks
using
v
arious
acoustic
models,
we
can
illustrate
that
the
traditional
GMM
based
HMM
models
has
been
outperformed
by
se
v
eral
other
models.
Despite
their
ability
to
model
the
probability
distrib
utions
o
v
er
v
ectors
of
input
features
that
are
associated
with
each
state
of
an
HMM,
GMMs
ha
v
e
a
serious
shortcoming.
As
Hinton
et
al.
[37]
stated,
“Despite
all
their
adv
antages,
GMMs
ha
v
e
a
serious
shortcoming;
the
y
are
statistical
ly
inef
ficient
for
modeling
data
that
lie
on
or
near
a
non-linear
manifold
in
the
data
space”.
Therefore
other
classifiers,
which
can
capture
better
properties
of
acoustic
features
could
presente
better
accurac
y
than
GMMs
for
acoustic
modeling
of
speech.
Machine
learning
algorithms
are
more
ef
ficient
t
han
the
traditional
GMM
for
acoustic
modeling
of
speech.
In
particular
the
neural
netw
orks
(NNs)
based
technology
present
challenging
performance
o
v
er
a
v
ariety
of
speech
recognition
benchmarks.
Their
successes,
in
acoustic
modeling
of
speech,
come
from
there
capability
to
classify
data
e
v
en
with
small
number
of
parameters
and
the
potential
to
learn
much
better
models
of
data
that
lie
on
or
near
a
nonlinear
manifold.
DNNs,
which
are
a
feed-forw
ard
artificial
neural
netw
ork
that
has
more
than
one
layer
of
hidden
units
between
its
inputs
and
its
outputs,
are
the
ne
w
generation
of
the
NNs,
the
y
come
to
solv
e
problems
of
training
time
and
o
v
erfitting
by
adding
an
initial
stage
of
generat
i
v
e
pre-training
using
RBMs.
Performances
of
L
VCSR
systems
v
aried
from
domain
to
other
,
as
summarized
in
Figure
1.
In
s
ome
domains,
lik
e
read
continuous
speech
where
generally
the
speech
w
as
recorded
under
clean
conditions,
results
are
satisfying
with
an
error
rate
under
5%.
While
in
other
domains
that
contain
more
speech
v
ariations,
as
video
speech
or
distant
con
v
ersational
speech
(meeting),
results
are
not
acceptable
presenting
an
error
rate
near
50%.
This
huge
dif
ference
in
performances
w
as
caused
by
the
nature
of
speech,
the
more
natural
and
spontaneous
the
speech
is,
the
more
the
error
rate
increase.
Dif
ficulties
encount
ered
in
modeling
spontaneous
speech
stem
from
man
y
f
actors:
foreign
accents,
e
xtraneous
w
ords,
out-of-v
ocab
ulary
w
ords
,
ungrammatical
sentences,
disfluenc
y
,
partial
w
ords,
repairs,
hesitations,
repetitions,
styl
e
shifting
It
must
be
said
that
in
this
paper
we
ha
v
e
constrained
the
benchmarks
to
the
performances
i
n
terms
of
w
ord
error
rates,
because
the
majority
of
researchers
use
it
as
common
measure
to
report
performance
of
their
systems.
Ho
we
v
er
,
other
important
aspects
of
ASR
systems
should
also
be
tak
en
into
account
in
the
future,
such
as
the
ef
ficienc
y
and
the
usability
.
Most
of
the
systems
presented
in
the
literature
require
either
lots
of
training
data
(thousands
of
Evaluation Warning : The document was created with Spire.PDF for Python.
3366
ISSN:
2088-8708
RCS
VSS
CTS
BNS
VS
DCS
3
%
11
:
8
%
16
%
13
:
4
%
40
:
9
%
46
:
3
%
T
ask
WER
in
%
Figure
1.
Best
state-of-the-art
performances
o
v
er
the
six
tasks.
RCS,
VSS,
CTS,
BNS,
VS,
DCS
are
respec-
ti
v
ely
read
continuous
speech,
v
oice
search
speech,
con
v
ersational
telephone
speech,
broadcast
ne
ws
speech,
video
speech
and
distant
con
v
ersational
speech.
hours
of
speech
and
billions
of
w
ords
of
te
xt)
or
lar
ge
computational
e
xpense
which
is
inef
fecti
v
e.
Therefore,
we
belie
v
e
there
is
a
need
of
corpora
and
e
v
aluations
that
include
more
objecti
v
e
criteria,
oriented
to
w
ards
usability
,
in
order
to
de
v
elop
a
more
user
-centered
ASR
applicati
o
n.
It
should
be
noticed
that
the
ASR
system
must
ensure
reacti
v
eness;
looking
at
the
real
time
f
act
or
of
the
used
algorithms,
and
rob
ustness;
in
front
of
accents
and
impaired
speech
should
also
be
considered.
4.
CONCLUSION
In
this
paper
we
ha
v
e
summarized
the
recent
de
v
elopments
of
L
VCSR
research
and
presented
a
bench-
mark
comparison
of
the
performances
of
ASR
systems
on
dif
ferent
L
VCSR
tasks:
read
continues
speech
recog-
nition,
mobile
v
oice
search,
con
v
ersational
telephonic
speech
recognition,
broadcast
ne
ws
speech
recognition,
video
speech
and
distant
con
v
ersational
speech
r
ecognition.
Most
of
the
presented
results
sho
w
that
replacing
GMMs
with
other
machine
learning
algorithms
gi
v
es
competiti
v
e
results.
P
articularly
,
the
DNN
gi
v
es
f
asci-
nating
performances
o
v
er
a
v
ariety
of
speech
recognition
benchmarks.
The
biggest
disadv
antage
of
DNNs
is
there
comple
xity;
it
is
hard
to
train
a
lar
ge
model
on
massi
v
e
data
sets.
Although
we
suggest
that
an
y
impro
v
e-
ment
for
a
clean
speech
corpus
such
as
WSJ
is
promising.
On
the
other
hand
more
potential
researches
are
needed
in
se
v
eral
domains
that
are
characterized
by
noisy
and
spontaneous
speech
such
as
video
and
distant
con
v
ersational
speech.
REFERENCES
[1]
T
.
Adam,
et
al.,
“W
a
v
elet
cesptral
coef
ficients
for
isolated
speech
recognition,
”
Indonesian
J
ournal
of
Electrical
Engineering
and
Computer
Science
,
v
ol.
11,
no.
5,
pp.
2731–2738,
2013.
[2]
N.
R.
Emillia,
et
al.,
“Isolated
w
ord
recognition
using
er
godic
hidden
mark
o
v
models
and
genetic
al-
gorithm,
”
TELK
OMNIKA
(T
elecommunication
Computing
Electr
onics
and
Contr
ol)
,
v
ol.
10,
no.
1,
pp.
129–136,
2012.
[3]
F
.
Jalili
and
M.
J.
Barani,
“Speech
recognition
using
combined
fuzzy
and
ant
colon
y
algorithm,
”
Interna-
tional
J
ournal
of
Electrical
and
Computer
Engineering
(IJECE)
,
v
ol.
6,
no.
5,
pp.
2205–2210,
2016.
[4]
S.
Y
oung,
“
A
re
vie
w
of
lar
ge-v
ocab
ulary
continuous-speech,
”
IEEE
Signal
Pr
ocessing
Ma
gazine
,
v
ol.
13,
no.
5,
pp.
45–57,
Sept
1996.
[5]
G.
Zweig
and
M.
Pichen
y
,
“
Adv
ances
in
lar
ge
v
ocab
ulary
continuous
speech
recognition,
”
Advances
in
Computer
s
,
v
ol.
60,
pp.
249–291,
2004.
[6]
G.
Saon
and
J.-T
.
Chien,
“Lar
ge-v
ocab
ulary
continuous
speech
recognition
systems:
A
look
at
some
recent
adv
ances,
”
IEEE
Signal
Pr
ocessing
Ma
gazine
,
v
ol.
29,
no.
6,
pp.
18–33,
2012.
[7]
J.
Bak
er
,
et
al.,
“De
v
elopments
and
directions
i
n
speech
recognition
and
understanding,
part
1,
”
IEEE
Signal
Pr
ocessing
Ma
gazine
,
v
ol.
26,
no.
3,
pp.
75–80,
2009.
[8]
J.
Bak
er
,
et
al.,
“Updated
minds
report
on
speech
recognition
and
understanding,
part
2,
”
v
ol.
26,
no.
4,
2009,
pp.
78–85.
[9]
D.
B.
P
aul
and
J.
M.
Bak
er
,
“The
design
for
the
w
all
street
journal-based
csr
corpus,
”
in
D
ARP
A
Speec
h
and
Langua
g
e
W
orkshop
.
Mor
g
an
Kaufmann
Publishers,
1992.
IJECE
V
ol.
7,
No.
6,
December
2017:
3358
–
3368
Evaluation Warning : The document was created with Spire.PDF for Python.
IJECE
ISSN:
2088-8708
3367
[10]
J.
Godfre
y
and
E.
Holliman,
“Switchboard-1
release
2
ldc97s62,
”
1993.
[11]
A.
Cana
v
an,
et
al.,
“Callhome
american
english
speech
ldc97s42,
”
Linguistic
Data
Consortium,
Philadel-
phia:
,
1997.
[12]
J.
Fiscus,
et
al.,
“1997
english
broadcast
ne
ws
speech
(hub4)
ldc98s71,
”
Linguistic
Data
Consortium,
Philadelphia
,
1997.
[13]
e.
a.
Graf
f,
Da
vid,
“1996
english
broadcast
ne
ws
speech
(hub4)ldc97s44,
”
Linguistic
Data
Consortium,
Philadelphia
,
1996.
[14]
J.
Carletta,
“Unleashing
the
killer
corpus:
e
xperiences
in
creating
the
multi-e
v
erything
ami
meeting
cor
-
pus,
”
Langua
g
e
Resour
ces
and
Evaluation
,
v
ol.
41,
no.
2,
pp.
181–190,
2007.
[15]
A.
Acero,
et
al.,
“Li
v
e
search
for
mobile:web
services
by
v
oice
on
the
cellphone,
”
in
In
the
pr
oceedings
of
the
IEEE
International
Confer
ence
on
Acoustics,
Speec
h,
and
Signal
Pr
ocessing
(ICASSP)
,
March
2008,
pp.
5256–5259.
[16]
G.
Heigold,
et
al.,
“Discriminati
v
e
hmms,
log-linear
models,
and
CRFS:
what
is
the
dif
ference?”
in
In
the
pr
oceedings
of
the
IEEE
International
Confer
ence
on
Acoustics,
Speec
h,
and
Signal
Pr
ocessing
(ICASSP)
,
March
2010,
pp.
5546–5549.
[17]
F
.
T
ri
efenbach,
K.
Demuynck,
and
J.-P
.
Martens,
“Lar
ge
v
ocab
ulary
continuous
speech
recognition
with
reserv
oir
-based
acoustic
models,
”
IEEE
Signal
Pr
ocessing
Letter
s
,
v
ol.
21,
no.
3,
pp.
311–315,
March
2014.
[18]
S.
Siniscalchi,
T
.
Sv
endsen,
and
C.-H.
Lee,
“
A
bottom-up
modular
search
approach
to
lar
ge
v
ocab
ulary
continuous
speech
recognition,
”
IEEE
T
r
ansactions
on
A
udio,
Speec
h,
and
Langua
g
e
Pr
ocessing
,
v
ol.
21,
no.
4,
pp.
786–797,
April
2013.
[19]
R.
Gemello,
et
al.,
“Linear
hidden
transformations
for
adaptation
of
h
ybrid
ann/hmm
models,
”
Speec
h
Communication
,
v
ol.
49,
no.
10,
pp.
827–835,
2007.
[20]
N.
Jaitly
,
et
al.,
“
Application
of
pretrained
deep
neural
netw
orks
to
lar
ge
v
ocab
ulary
speech
recognition,
”
in
Pr
oceedings
of
Inter
speec
h
,
2012.
[21]
O.
Abdel-Hamid,
et
al.,
“Con
v
olutional
neural
netw
orks
for
speech
recognition,
”
IEEE
T
r
ansactions
on
A
udio,
Speec
h,
and
Langua
g
e
Pr
ocessing
,
v
ol.
22,
no.
10,
pp.
1533–1545,
2014.
[22]
Z.
Huang,
et
al.,
“Cache
based
recurrent
neural
netw
ork
language
model
inference
for
first
pass
speech
recognition,
”
in
In
the
pr
oceedings
of
the
IEEE
International
Confer
ence
on
Acoustics,
Speec
h,
and
Signal
Pr
ocessing
(ICASSP)
,
May
2014,
pp.
6354–6358.
[23]
J.
J.
Godfre
y
,
et
al.,
“Switchboard:
T
elephone
speech
corpus
for
research
and
de
v
elopment,
”
in
In
the
pr
oceedings
of
the
IEEE
International
Confer
ence
on
Acoustics,
Speec
h,
and
Signal
Pr
ocessing
(ICASSP)
,
v
ol.
1.
IEEE,
1992,
pp.
517–520.
[24]
L.
B.
K.
V
esely
,
et
al.,
“Sequence-discriminati
v
e
training
of
deep
neural
netw
orks,
”
in
In
Inter
speec
h
,
2013.
[25]
F
.
Seide,
et
al.,
“Feature
engineering
in
conte
xt-dependent
deep
neural
netw
orks
for
con
v
ersational
speech
transcription,
”
in
IEEE
W
orkshop
on
A
u
t
omatic
Speec
h
Reco
gnition
and
Under
standing
(ASR
U)
.
IEEE,
December
2011.
[26]
A.
Y
.
Hannun,
et
al.,
“Deep
speech:
Scaling
up
end-to-end
speech
recognition,
”
CoRR
,
v
ol.
abs/1412.5567,
2014.
[27]
T
.
N.
Sainath,
et
al.,
“Deep
con
v
olutional
neural
netw
orks
for
lvcsr
,
”
in
In
the
pr
oceedings
of
the
IEEE
International
Confer
ence
on
Acoustics,
Speec
h,
and
Signal
Pr
ocessing
(ICASSP)
.
IEEE,
2013,
pp.
8614–8618.
[28]
H.
Soltau,
et
al.,
“Joint
training
of
con
v
olutional
and
non-con
v
olutional
neural
netw
orks
,
”
In
the
pr
o-
ceedings
of
the
IEEE
International
Confer
ence
on
Acoustics,
Speec
h,
and
Signal
Pr
ocessing
(ICASSP)
,
2014.
[29]
A.
L.
Maas,
et
al.,
“Increasing
deep
neural
netw
ork
acoustic
model
size
for
lar
ge
v
ocab
ulary
continuous
speech
recognition,
”
arXiv
pr
eprint
arXiv:1406.7806
,
2014.
[30]
C.
Cieri,
et
al.,
“The
fisher
corpus:
a
resource
for
the
ne
xt
generations
of
speech-to-te
xt,
”
in
LREC
,
v
ol.
4,
2004,
pp.
69–71.
[31]
T
.
N.
Sainath,
et
al.,
“Deep
con
v
olutional
neural
netw
orks
for
lar
ge-scale
speech
tas
k
s
,
”
Else
vier
,
Special
Issue
in
Deep
Learning
,
2014.
[32]
T
.
Sainath,
et
al.,
“Making
deep
belief
netw
orks
ef
fecti
v
e
for
lar
ge
v
ocab
ulary
continuous
speech
recog-
nition,
”
in
IEEE
W
orkshop
on
A
utomatic
Speec
h
Reco
gnition
and
Under
standing
(ASR
U)
,
Dec
2011,
pp.
30–35.
Evaluation Warning : The document was created with Spire.PDF for Python.