In
ternational
Journal
of
Electrical
and
Computer
Engineering
(IJECE)
V
ol.
9,
No.
4,
A
ugust
2019,
pp.
3194
3202
ISSN:
2088-8708,
DOI:
10.11591/ijece.v9i4.pp3194-3202
3194
UCSY-SC1:
A
My
anmar
Sp
eec
h
Corpus
for
A
utomatic
Sp
eec
h
Recognition
A
y
e
Ny
ein
Mon
1
,
Win
P
a
P
a
2
,
Y
e
Ky
a
w
Th
u
3
1,2
Natural
Language
Pro
cessing
Lab,
Univ
ersit
y
of
Computer
Studies,
Y
angon
3
Language
and
Sp
eec
h
Science
Researc
h
Lab.,
W
aseda
Univ
ersit
y
,
Japan
Article
Info
A
rticle
history:
Receiv
ed
Dec
18,
2018
Revised
F
eb
15,
2019
A
ccepted
Mar
21,
2019
K
eywor
ds:
A
utomatic
Sp
eec
h
Recognition
My
anmar
Language
Sp
eec
h
Corpus
Con
v
olutional
Neural
Net
w
ork
(CNN)
ABSTRA
CT
This
pap
er
in
tro
duces
a
sp
eec
h
corpus
whic
h
is
dev
elop
ed
for
My
anmar
A
u-
tomatic
Sp
eec
h
Recognition
(ASR)
researc
h.
A
utomatic
Sp
eec
h
Recognition
(ASR)
researc
h
has
b
een
conducted
b
y
the
researc
hers
around
the
w
orld
to
impro
v
e
their
language
tec
hnologies.
Sp
eec
h
corp
ora
are
imp
ortan
t
in
de-
v
eloping
the
ASR
and
the
creation
of
the
corp
ora
is
necessary
esp
ecially
for
lo
w-resourced
languages.
My
anmar
language
can
b
e
regarded
as
a
lo
w-
resourced
language
b
ecause
of
lac
k
of
pre-created
resources
for
sp
eec
h
pro-
cessing
researc
h.
In
this
w
ork,
a
sp
eec
h
corpus
named
UCSY-SC1
(Univ
ersit
y
of
Computer
Studies
Y
angon
-
Sp
eec
h
Corpus1)
is
created
for
My
anmar
ASR
researc
h.
The
corpus
consists
of
t
w
o
t
yp
es
of
domain:
news
and
daily
con-
v
ersations.
The
total
size
of
the
sp
eec
h
corpus
is
o
v
er
42
hrs.
There
are
25
hrs
of
w
eb
news
and
17
hrs
of
con
v
ersational
recorded
data.
The
corpus
w
as
collected
from
177
females
and
84
males
for
the
news
data
and
42
females
and
4
males
for
con
v
ersational
domain.
This
corpus
w
as
used
as
training
data
for
dev
eloping
My
anmar
ASR.
Three
differen
t
t
yp
es
of
acoustic
mo
dels
suc
h
as
Gaussian
Mixture
Mo
del
(GMM)
-
Hidden
Mark
o
v
Mo
del
(HMM),
Deep
Neural
Net
w
ork
(DNN),
and
Con
v
olutional
Neural
Net
w
ork
(CNN)
mo
dels
w
ere
built
and
compared
their
results.
Exp
erimen
ts
w
ere
conducted
on
dif-
feren
t
data
sizes
and
ev
aluation
is
done
b
y
t
w
o
test
sets:
T
estSet1,
w
eb
news
and
T
estSet2,
recorded
con
v
ersational
data.
It
sho
w
ed
that
the
p
erformance
of
My
anmar
ASRs
using
this
corpus
ga
v
e
satisfiable
results
on
b
oth
test
sets.
The
My
anmar
ASR
using
this
corpus
leading
to
w
ord
error
rates
of
15.61%
on
T
estSet1
and
24.43%
on
T
estSet2.
Copyright
201x
Insitute
of
A
dvanc
e
d
Engine
e
ering
and
Scienc
e.
A
l
l
rights
r
eserve
d.
Corr
esp
onding
A
uthor:
A
y
e
Ny
ein
Mon
Natural
Language
Pro
cessing
Lab,
Univ
ersit
y
of
Computer
Studies,
Y
angon,
My
anmar.
Email:
a
y
en
y
einmon@ucsy
.edu.mm
1.
INTR
ODUCTION
Sp
eec
h
is
the
most
natural
form
of
comm
unication
among
h
umans.
Numerous
sp
ok
en
lan-
guages
are
emplo
y
ed
throughout
the
w
orld.
As
comm
unication
among
h
uman
b
eings
is
mostly
done
v
o
cally
,
it
is
natural
for
p
eople
to
exp
ect
sp
eec
h
in
terfaces
with
the
computer.
A
utomatic
sp
eec
h
recognition
(ASR)
means
the
con
v
ersion
of
sp
ok
en
w
ords
in
to
computer
text.
A
lot
of
automatic
sp
eec
h
recognition
researc
h
is
curren
tly
b
eing
conducted
b
y
the
researc
hers
around
the
w
orld
for
their
languages
[1]
[2].
Curren
t
ASR
system
use
statistical
mo
dels
constructed
on
sp
eec
h
data.
Therefore,
sp
eec
h
corpus
is
imp
ortan
t
for
statistical
mo
del
based
automatic
sp
eec
h
recognition
and
it
affects
the
p
erformance
of
a
sp
eec
h
recognizer.
F
or
w
ell-resourced
languages,
sp
eec
h
researc
hers
ha
v
e
used
publicly
a
v
ailable
resources
from
online.
Ho
w
ev
er,
for
lo
w-resourced
languages,
they
ha
v
e
to
build
Journal
homep
age:
http://iaesc
or
e.c
om/journals/index.php/IJECE
Evaluation Warning : The document was created with Spire.PDF for Python.
In
t
J
Elec
&
Comp
Eng
ISSN:
2088-8708
3195
the
corp
ora
b
y
themselv
es
for
dev
eloping
the
ASR
systems
[3]
[4].
They
used
online
resources
suc
h
as
broadcast
news,
daily
con
v
ersational
data,
etc.,
or
the
data
are
recorded
b
y
themselv
es
for
dev
eloping
the
ASR.
My
anmar
language
can
b
e
considered
as
a
lo
w-resourced
language
regarding
the
linguistic
resources
a
v
ailable
for
NLP
.
Lac
k
of
prop
er
data
is
the
main
problem
when
it
comes
to
sp
eec
h
recog-
nition
researc
h
in
the
My
anmar
Language.
Therefore,
sp
eec
h
corpus
is
needed
to
build
for
dev
eloping
My
anmar
ASR.
This
giv
es
imp
etus
to
build
the
sp
eec
h
corpus
for
My
anmar
language.
There
w
as
our
w
ork
done
in
dev
eloping
a
sp
eec
h
corpus
for
My
anmar
language
[5].
In
the
w
ork,
sp
eec
h
data
w
as
collected
only
from
sp
ecific
domain,
w
eb
news.
The
total
length
of
the
sp
eec
h
corpus
is
20
hrs.
It
in
v
olv
es
126
females
and
52
males,
totalling
178
sp
eak
ers.
There
are
7,332
utterances
in
the
corpus,
whic
h
are
collected
from
lo
cal
and
foreign
news.
The
corpus
w
as
used
in
dev
eloping
My
anmar
con
tin-
uous
sp
eec
h
recognition.
It
w
as
ev
aluated
on
t
w
o
test
sets:
w
eb
data
and
news
recorded
b
y
10
nativ
es.
It
yielded
24.73%
WER
on
T
estSet1
and
22.59%
WER
on
T
estSet2.
In
this
task,
the
domain
of
the
sp
eec
h
corpus
is
extended.
The
sp
eec
h
corpus
is
named
as
"UCSY-SC1"
and
it
is
constructed
b
y
using
t
w
o
t
yp
es
of
domain:
w
eb
news
and
daily
con
v
ersations.
The
w
eb
news
data
size
is
increased
to
25
hrs.The
w
eb
news
is
the
already
recorded
data
collected
from
the
w
eb.
Its
sen
tences
are
longer
than
the
con
v
ersation
sen
tences.
This
corpus
consists
of
daily
con
v
ersational
data.
The
shorter
daily
con
v
ersation
sen
tences
are
obtained
from
ASEAN
language
sp
eec
h
translation
thru'
U-Star
1
and
the
w
eb.
They
are
recorded
b
y
ourselv
es
using
the
recording
device.
They
are
v
ery
short
sen
tences.
There
are
17
hrs
of
con
v
ersational
data.
Th
us,
the
total
sp
eec
h
corpus
size
is
o
v
er
42
hrs.
This
corpus
is
used
as
training
data
and
the
exp
erimen
tal
results
of
GMM,
DNN
and
CNN
with
sequence
discriminativ
e
training
approac
hes
are
presen
ted.
This
is
a
milestone
for
My
anmar
ASR
dev
elopmen
t.
This
pap
er
is
organized
as
follo
ws.
The
in
tro
duction
to
My
anmar
language
is
presen
ted
in
Section
2.
A
sp
eec
h
corpus
dev
elop
ed
for
My
anmar
language
is
explained
in
Section
3.
Ev
aluation
on
the
corpus
is
done
in
Section
4.
Conclusion
and
future
w
ork
are
summarized
in
Section
5.
2.
NA
TURE
OF
MY
ANMAR
LANGUA
GE
My
anmar
Language
(formerly
kno
wn
as
Burmese)
is
the
official
language
of
My
anmar.
The
My
anmar
script
deriv
es
from
Brahmi
script
of
South
India.
My
anmar
text
is
a
string
of
c
haracters
without
an
y
w
ord
b
oundary
markup.
There
are
no
spaces
b
et
w
een
w
ords
in
My
anmar
language.
My
anmar
language
has
33
basic
consonan
ts,
12
v
o
w
els,
and
4
medials.
Phonology
is
a
system
of
the
com
bination
of
v
o
w
els
and
consonan
ts.
My
anmar
phonology
is
structured
b
y
just
one
v
o
w
el,
or
one
v
o
w
el
and
consonan
t,
consonan
t
com
bination
sym
b
ols.
The
v
o
w
els
ha
v
e
their
o
wn
sounds
in
My
anmar
language.
Therefore,
just
only
one
v
o
w
el
can
pro
duce
clear
sound
suc
h
as
အ
/a
̰
/
၊
အာ
/à/
၊
အား
/á/
.
My
anmar
consonan
ts
ha
v
e
no
clear
o
wn
sound
and
if
it
is
com
bined
with
a
v
o
w
el,
it
can
pro
duce
the
clear
sound.
Example
is
က
/k/
+
အား
/á/
=
ကား
/ká/
.
The
12
basic
v
o
w
els
are
အ
/a
̰
/
,
အာ
/à/
,
အ
ိ
/ḭ/
,
အ
ီ
/ì/
,
အ
/ṵ/
,
အ
/ù/
,
ေအ
/èi/
,
အ
ဲ
/
´
ɛ/
ေအာ
/ɔ
́
/
,
ေအာ
်
/ɔ
̀
/
,
အ
ံ
/àɴ/
,
အ
ိ
/ò/
.
My
anmar
syllables
are
basically
formed
b
y
the
consonan
t
and
v
o
w
el
com
bination
[6].
As
an
example,
consider
the
com
bination
of
အ
/ù/
v
o
w
el
and
က
/k/
consonan
t
mak
es
one
syllable
က
/kù/
as
က
/k/
+
အ
/ù/
=
က
/kù/
.
3.
UCSY-SC1
SPEECH
CORPUS
BUILDING
Building
a
sp
eec
h
corpus
is
the
first
step
for
dev
eloping
an
y
automatic
sp
eec
h
recognition
(ASR)
system,
esp
ecially
for
lo
w-resourced
languages,
and
it
is
crucial
for
the
statistical
ASR
system.
Moreo
v
er,
the
accuracy
of
a
sp
eec
h
recognizer
dep
ends
on
the
sp
eec
h
corp
ora.
Sp
eec
h
corp
ora
for
w
ell-resourced
languages
suc
h
as
English
are
publicly
a
v
ailable
for
ASR
researc
h.
Ho
w
ev
er,
b
eing
a
lo
w-resourced
language,
My
anmar
language
has
no
existing
sp
eec
h
corp
ora.
A
sp
eec
h
corpus
can
b
e
built
mainly
in
t
w
o
metho
ds.
The
first
metho
d
is
to
gather
the
sp
eec
h
that
has
already
b
een
recorded
and
man
ually
transcrib
e
it
in
to
text.
The
second
metho
d
is
to
create
the
text
corpus
first
and
record
the
sp
eec
h
b
y
reading
the
collected
text
[7].
1
h
ttp://www.ustar-consortium.com/qws/slot/u50227/index.h
tml
U
CSY-SC1:
A
Myanmar
Sp
e
e
ch
Corpus
for
A
utomatic
Sp
e
e
ch
R
e
c
o
gnition
(A
ye
Nyein
Mon)
Evaluation Warning : The document was created with Spire.PDF for Python.
3196
ISSN:
2088-8708
3.1.
Collecting
the
Data
from
the
W
eb
News
The
first
approac
h
is
used
to
collect
the
w
eb
news
data.
T
o
da
y
,
the
in
ternet
has
v
arious
resource
t
yp
es,
for
example,
so
cial
media,
blogs,
t
witter,
and
new
p
ortals,
whic
h
offer
a
lot
of
sp
eec
h
data
and
whic
h
can
b
e
freely
do
wnloaded.
Moreo
v
er,
it
has
b
een
pro
v
ed
that
the
corp
ora
created
on
in
ternet
resources
yielded
promising
results
[8]
[9].
Therefore,
sp
eec
h
data
w
as
collected
first
from
the
w
eb
news.
The
duration
of
the
w
eb
data
collecting
pro
cess
lasted
one
y
ear
and
it
in
v
olv
ed
t
w
o
p
ersons
including
the
author.
3.1.1.
Sp
eec
h
Corpus
Preparation
The
w
eb
news
is
do
wnloaded
from
the
sites
of
My
anmar
Radio
and
T
elevision
(MR
TV),V
oice
of
America
(V
O
A),
faceb
o
ok
pages
of
Elev
en
broadcasting,
7da
ys
TV,
F
orInfo
news,
Go
o
dMorn-
ingMy
anmar,
British
Broadcasting
Corp
oration
(BBC)
Burmese
news
and
breakfast
news.
Both
lo
cal
and
foreign
news
are
con
tained
in
the
corpus.
The
w
eb
news
videos
are
con
v
erted
to
w
a
v
e
file
format.
After
that,
the
audio
files
are
segmen
ted
with
Praat
2
.
All
the
audio
files
are
formatted
with
sample
frequency
16,000
Hz
and
mono
c
hannel.
The
length
of
eac
h
audio
file
is
b
et
w
een
2
seconds
and
30
seconds.
3.1.2.
Sp
eak
er
Information
The
news
presen
ters
are
professional,
w
ell-exp
erienced
and
w
ell-trained.
Therefore,
they
ha
v
e
clear
v
oice
in
news
broadcasting.
F
emale
news
presen
ters
are
dominan
t
in
the
w
eb
news.
Hence,
in
this
corpus,
few
er
male
sp
eak
ers
are
in
v
olv
ed
than
females.
The
ages
of
the
sp
eak
ers
are
under
35.
3.1.3.
T
ext
Corpus
Preparation
Most
of
the
broadcast
news
items
from
the
w
eb
ha
v
e
transcriptions.
Ho
w
ev
er,
the
transcrip-
tions
are
man
ually
done
if
they
are
not
a
v
ailable
and
My
anmar3
Unico
de
is
used
for
that
purp
ose.
W
ord
segmen
tation
is
done
b
y
hand
as
My
anmar
language
has
no
w
ord
b
oundary
.
This
is
p
erformed
based
on
My
anmar-English
dictionary
[10]
and
this
dictionary
is
also
applied
to
c
hec
k
the
sp
elling
of
the
w
ords.
The
a
v
erage
lengths
of
the
utterances
in
this
corpus
are
33
in
w
ords
and
54
in
syllables.
W
eb
news
data
has
8,973
unique
sen
tences
and
11,040
unique
w
ords.
The
example
news
sen
tences
from
the
corpus
are
sho
wn
in
Figure
1.
The
format
of
eac
h
sen
tence
is
the
utterance-id
follo
w
ed
b
y
the
transcription
of
eac
h
sen
tence.
Figure
1.
Example
sen
tences
of
the
corpus
on
news
3.2.
Recording
Daily
Con
v
ersations
The
second
approac
h
(designing
the
text
corpus
first
and
recording
the
sp
eec
h
b
y
reading
the
collected
text)
w
as
used
for
collecting
the
con
v
ersational
data.
It
to
ok
3
mon
ths
for
data
recording
and
11
p
eople
w
ere
in
v
olv
ed
in
the
sp
eec
h
and
text
segmen
tation.
3.2.1.
T
ext
Corpus
Preparation
The
daily
English
con
v
ersations
from
ASEAN
language
sp
eec
h
translation
thru'
U-Star
are
translated
in
to
My
anmar
for
text
corpus
building.
The
con
v
ersational
data
con
tains
2,156
unique
sen
tences
and
1,740
unique
w
ords.
There
are
2,000
sen
tences
in
the
ASEAN
language
sp
eec
h
transla-
tion
and
they
are
the
con
v
ersations
in
hotels,
restauran
ts,
streets,
telephones,
etc.
The
rest
156
daily
con
v
ersational
sen
tences
are
collected
from
the
w
eb.
The
sp
elling
of
the
text
is
man
ually
c
hec
k
ed
and
the
w
ords
are
segmen
ted
as
the
news
data.
The
sen
tences
con
tained
in
the
corpus
are
shorter
than
those
of
the
news
domain.
The
a
v
erage
lengths
of
the
con
v
ersational
sen
tences
are
11
in
w
ords
and
15
in
syllables.
The
example
sen
tences
for
the
daily
con
v
ersational
domain
are
sho
wn
in
Figure
2.
2
h
ttp://www.fon.h
um.uv
a.nl/praat/
In
t
J
Elec
&
Comp
Eng,
V
ol.
9,
No.
4,
A
ugust
2019
:
3194
--
3202
Evaluation Warning : The document was created with Spire.PDF for Python.
In
t
J
Elec
&
Comp
Eng
ISSN:
2088-8708
3197
The
format
of
eac
h
sen
tence
is
similar
to
that
of
the
news
domain
(utterance
id
follo
w
ed
b
y
eac
h
utterance).
Figure
2.
Example
sen
tences
of
the
con
v
ersational
data
3.2.2.
Sp
eak
er
Information
The
sen
tences
are
recorded
b
y
4
male
sp
eak
ers
and
42
female
sp
eak
ers,
who
are
the
facult
y
mem
b
ers
and
studen
ts
of
the
Univ
ersit
y
of
Computer
Studies,
Y
angon,
My
anmar.
Since
the
n
um
b
er
of
females
exceeds
that
of
males
in
our
univ
ersit
y
,
man
y
female
sp
eak
ers
are
represen
ted
in
the
corpus.
The
ages
of
the
sp
eak
ers
are
b
et
w
een
19
and
40.
3.2.3.
Sp
eec
h
Recording
and
Segmen
tation
The
recording
w
ork
w
as
done
in
a
lab
oratory
of
our
univ
ersit
y
.
It
is
a
v
ery
quiet
place
with
no
external
effects
from
the
ro
om
lik
e
ec
ho
and
bac
kground
noises.
It
is
also
a
health
y
place
to
w
ork
in
b
ecause
p
eople
can
breathe
w
ell
and
feel
relaxed.
T
ascam
DR-100MKI
I
I
3
w
as
used
for
sp
eec
h
recording.
It
is
in
tended
to
b
e
used
for
audio
designers
and
engineers
and
it
has
an
easy-to-
use
in
terface
with
robust
reliabilit
y
.
The
audio
files
are
formatted
with
sample
frequency
16,000
Hz
and
mono
c
hannel
with
16
bits
enco
ding.
The
recorded
files
are
segmen
ted
with
the
audacit
y
to
ol
4
.
Moreo
v
er,
the
silen
t
p
ortion
of
eac
h
utterance
is
discarded.
In
a
sp
eec
h
corpus,
audio
and
text
data
should
b
e
aligned.
So
eac
h
recorded
sen
tence
is
listened
to
and
c
hec
k
ed
with
their
corresp
onding
text
transcription
and
made
necessary
corrections.
If
the
sp
eak
ers
do
not
ha
v
e
clear
v
oices,
the
recordings
are
done
rep
eatedly
un
til
they
are
satisfactory
and
smo
oth.
All
sp
eak
ers
read
at
normal
pace.
3.2.4.
Normalization
to
T
ranscription
Some
of
the
transcriptions
of
broadcast
news
and
daily
con
v
ersions
obtained
from
online
con-
sists
of
non-standard
w
ords.
They
are
n
um
b
ers,
dates,
abbreviations
acron
yms,
sym
b
ols,
and
English
names
suc
h
as
names
of
organization,
things,
p
ersons,
animals,
so
cial
media,
etc.
The
pron
unciations
of
these
w
ords
cannot
b
e
found
in
the
dictionary
.
Therefore,
it
is
necessary
to
do
text
normalization
and
transliteration
in
to
My
anmar
language.
In
this
w
ork,
those
w
ords
are
man
ually
transcrib
ed
in
to
My
anmar
w
ords
as
the
transcrib
ers
listen
to
their
corresp
onding
audios.
T
able
1
sho
ws
the
example
w
ords
that
need
to
b
e
normalized.
T
able
1.
Example
of
text
normalization
Description
Example
Normalization
Date
၂၀၁၆-၂၀၁၇
ှ
စ
်
ေထာင
့
်
ဆယ
့
်
ေြခာက
်
ှ
စ
်
ေထာင
့
်
ဆယ
့
်
ခ
နစ
်
(2016-2017)
Time
၃
နာရ
ီ
၅၅
မ
ိ
နစ
်
သ
ံ
း
နာရ
ီ
ငါး
ဆယ
့
်
ငါး
မ
ိ
နစ
်
(3
Hours
55
Min
utes)
Num
b
er
၁၁၄
ဦး
တစ
်
ရာ
တစ
်
ဆယ
့
်
ေလး
ဦး
(114
p
ersons)
Digit
09-448045577
သ
ည
က
ိ
း
ေလး
ေလး
ှ
စ
်
သ
ည
ေလး
ငါး
ငါး
ခ
နစ
်
ခ
နစ
်
A
cron
yms
FD
A
အက
်
ဖ
်
ဒ
ီ
ေအ
P
erson
Name
Mr.
Filippno
Grandi
မစ
တာ
ဖ
ီ
လစ
်
ိ
ဂရမ
်
းဒ
ီ
3
h
ttps://tascam.com/us/pro
duct/dr-100mkiii/top
4
h
ttps://www.audacit
yteam.org/
U
CSY-SC1:
A
Myanmar
Sp
e
e
ch
Corpus
for
A
utomatic
Sp
e
e
ch
R
e
c
o
gnition
(A
ye
Nyein
Mon)
Evaluation Warning : The document was created with Spire.PDF for Python.
3198
ISSN:
2088-8708
3.3.
Phone
Co
v
erage
in
the
Sp
eec
h
Corpus
Phone
co
v
erage
is
vital
for
impro
ving
the
ASR
accuracy
.
My
anmar-English
dictionary
dev
el-
op
ed
b
y
My
anmar
Language
Commission
(MLC)
[10]
is
used
as
the
baseline
and
this
dictionary
is
extended
with
the
v
o
cabularies
of
the
sp
eec
h
corpus.
There
are
38,376
w
ords
in
the
lexicon.
In
the
training
set,
there
are
67
phonemes
and
it
co
v
ers
94.37%
of
phonemes.
T
able
2
describ
es
an
example
of
My
anmar
lexicon.
T
able
2.
Example
of
My
anmar
lexicon
My
anmar
W
ord
Phoneme
အ
(Dump)
/a
̰
/
အားကစား
(Sp
ort)
/á
ɡəzá/
အာကာသ
(Space)
/à
kà
θa
̰
/
The
distributions
of
phonemes
for
b
oth
consonan
t
and
v
o
w
el
phonemes
o
ccurring
in
the
sp
eec
h
corpus
are
analyzed.
The
frequency
data
on
consonan
t
distribution
of
the
corpus
are
giv
en
in
Figure
3.
The
phoneme
/j/
has
the
most
o
ccurrences
in
the
corpus.
This
is
b
ecause
the
phoneme
represen
ts
some
medials
suc
h
as
◌ျ
/ya
̰
p
̰̃
/
,
ြ◌
/ya
̰
yiʔ/
and
the
consonan
ts
ရ
and
ယ
are
defined
as
/j/
phoneme.
The
second
most
frequen
t
o
ccurrence
is
the
phoneme
/d/
b
ecause
the
consonan
ts
ဒ
and
ဓ
are
represen
ted
b
y
the
same
phoneme
/d/.
The
My
anmar
w
ord
တ
ိ
/trḭ/
rarely
app
ears
in
My
anmar
language.
Therefore,
the
pron
unciation
phoneme
of
the
w
ord,
/tr/
phoneme,
is
found
only
1
time
in
the
texts.
A
few
nasal
phonemes,
/ng/
and
/nj/,
are
found.
Figure
3.
Consonan
t
phonemes
distribution
of
UCSY-SC1
corpus
The
frequency
of
the
v
o
w
el
distribution
of
the
corpus
is
sho
wn
in
Figure
4.
All
v
o
w
el
phonemes
app
ear
in
the
corpus.
The
most
frequen
t
phoneme
is
the
phoneme
/a/
with
tone1
and
most
of
the
pron
unciation
of
the
w
ords
is
formed
with
the
v
o
w
el
phoneme.
F
or
example,
the
w
ords
ေကာင
်
း
/káʊɴ/
is
comp
osed
of
the
phonemes
of
/k/
+
/a/+/un:/
and
က
ိ
င
်
း
/káɪɴ/
is
formed
b
y
the
com
bination
of
the
phonemes
/k/+/a/+/in:/.
The
second
most
frequen
t
phoneme
is
the
/a-/
with
neural
tone.
In
My
anmar
language,
the
basic
v
o
w
els
(/i/
/ì/
,
/ei/
/èi/
,
/e/
/è/
,
/a/
/à/
,
/o/
/ɔ
̀
/
,
/ou/
/ò/
,
/u/
/ù/
)
ha
v
e
their
o
wn
prop
erties.
While
these
v
o
w
els
are
influenced
b
y
the
con
textual
sounds,
they
c
hange
to
neutralized
v
o
w
els
when
their
o
wn
prop
erties
decrease.
Therefore,
most
of
the
My
anmar
w
ords
are
found
with
neutral
tone
in
the
corpus.
F
or
example,
/ná/
+
/jwɛʔ/
==>
/nə/
+
/jwɛʔ/
Most
of
the
nasalized
v
o
w
els
suc
h
as
/ai
/
/aiʔ/
,
/an./
/a
̰
ɴ/
,
/ei
/
/eɪʔ/
,
/in./
/ɪ
̰
ɴ/
,
/u
/ʊ/
and
/un./
/ṵ̃/
are
the
least
frequen
t
phonemes
in
the
corpus.
In
t
J
Elec
&
Comp
Eng,
V
ol.
9,
No.
4,
A
ugust
2019
:
3194
--
3202
Evaluation Warning : The document was created with Spire.PDF for Python.
In
t
J
Elec
&
Comp
Eng
ISSN:
2088-8708
3199
Figure
4.
V
o
w
el
phonemes
distribution
of
UCSY-SC1
corpus
3.4.
Statistics
of
the
Corpus
The
sp
eec
h
corpus
consists
of
t
w
o
t
yp
es
of
domain:
w
eb
news
and
con
v
ersational
data.
The
detailed
statistics
of
the
corpus
is
sho
wn
in
T
able
3.
The
corpus
consists
of
306,088
w
ords.
11,696
w
ords
are
unique
and
nearly
37%
o
ccurs
only
once.
Ab
out
5%
of
unique
w
ords
app
ear
b
et
w
een
100
and
1,000
times.
Only
nearly
1%
is
found
more
than
1,000
times
in
the
unique
w
ords.
T
able
3.
UCSY-SC1
corpus
statistics
Data
Size
Sp
eak
ers
Utterance
UniqueW
ord
F
emale
Male
T
otal
W
eb
News
25
Hrs
20
Mins
177
84
261
9,066
9,956
Daily
Con
v
ersations
17
Hrs
19
Mins
42
4
46
22,048
1,740
T
otal
42
Hrs
39
Mins
219
88
307
31,114
11,696
4.
EV
ALUA
TION
ON
THE
CORPUS
In
this
w
ork,
exp
erimen
ts
are
done
to
ev
aluate
the
qualit
y
of
the
sp
eec
h
corpus
on
My
anmar
ASR.
4.1.
Exp
erimen
tal
Setup
The
details
of
the
exp
erimen
tal
setup
for
data
sets,
acoustic
and
language
mo
dels
are
dealt
with
in
this
section.
The
impact
of
training
data
sizes
on
the
ASR
p
erformance
is
in
v
estigated
in
this
exp
erimen
t.
F
our
differen
t
data
sizes
-10
hrs,
20
hrs,
30
hrs,
and
42
hrs
-
are
used
for
incremen
tal
training.
The
detailed
statistics
on
the
train
and
test
sets
are
displa
y
ed
at
T
able
4.
T
estSet1
is
the
op
en
test
data,
whic
h
is
w
eb
news
data.
T
estSet2
is
also
op
en
test
data
and
it
is
the
con
v
ersational
data
from
nativ
es
recorded
with
v
oice
recorders
and
microphones.
T
able
4.
Statistics
on
train
and
test
sets
Data
Size
Sp
eak
ers
Utterance
F
emale
Male
T
otal
T
rainSet
10
Hrs
5
Mins
79
23
102
3,530
20
Hrs
2
Mins
126
52
178
7,332
30
Hrs
3
Mins
174
86
260
15,556
42
Hrs
39
Mins
219
88
307
31,114
T
estSet1
31
Mins
55
Sec
5
3
8
193
T
estSet2
32
Mins
40
Sec
3
2
5
887
U
CSY-SC1:
A
Myanmar
Sp
e
e
ch
Corpus
for
A
utomatic
Sp
e
e
ch
R
e
c
o
gnition
(A
ye
Nyein
Mon)
Evaluation Warning : The document was created with Spire.PDF for Python.
3200
ISSN:
2088-8708
4.2.
GMM-based
A
coustic
Mo
del
Kaldi
sp
eec
h
recognition
to
olkit
[11]
is
adopted
to
dev
elop
the
exp
erimen
ts.
F
or
GMM-based
acoustic
mo
del
training,
the
standard
13-dimensional
Mel-F
requency
Cesptral
Co
efficien
ts
(MF
CC)
features
and
its
first
and
second
deriv
ativ
es
without
energy
features
are
applied.
After
that,
cepstral
mean
and
v
ariance
normalized
(CMVN)
is
p
erformed
on
the
features.
Linear
discriminan
t
analysis
(LD
A)is
used
to
splice
9
frames
together
and
pro
ject
do
wn
to
40
dimensions.
A
maxim
um
lik
eliho
o
d
linear
transform
(MLL
T)
is
estimated
on
the
LD
A
features.
The
feature-space
Maxim
um
Lik
eliho
o
d
Linear
Regression
(fMLLR)
is
used
for
sp
eak
er
adaptiv
e
training
(SA
T).
The
baseline
GMM
mo
del
has
2,052
con
text
dep
enden
t
(CD)
triphone
states
and
an
a
v
erage
of
34
Gaussian
comp
onen
ts
p
er
state.
4.3.
CNN
and
DNN
A
coustic
Mo
del
As
input
features,
40-dimensional
log
mel-filter
bank
features
are
applied
for
CNN
and
DNN
acoustic
mo
dels.
F
or
DNN,
4
hidden
la
y
ers
with
300
units
p
er
hidden
la
y
ers
are
used.
F
or
CNN,
256
and
128
feature
maps
in
first
and
second
con
v
olutional
la
y
ers
are
set
resp
ectiv
ely
with
8
and
4
filter
sizes.
The
p
o
oling
size
is
set
to
3
with
p
o
ol
step
1.
The
fully
connected
net
w
ork
has
2
hidden
la
y
ers
with
300
units
p
er
hidden
la
y
ers.
Cross-en
trop
y
training
is
p
erformed
on
CNN
and
DNN
acoustic
mo
dels.
Restricted
Boltzmann
mac
hines
(RBMs)
are
built
on
top
of
the
CNN
training.
A
dditionally
,
a
6-la
y
er
DNN
with
cross-en
trop
y
training
is
done
and
6
iterations
of
state-lev
el
minim
um
Ba
y
es
risk
(sMBR)
for
discriminativ
e
training
are
p
erformed
[12].
The
training
pro
cedure
of
the
CNN
(sMBR)
is
depicted
in
Figure
5.
A
constan
t
learning
rate
of
0.008
is
used
to
train
the
neural
net
w
orks.
Next,
the
learning
rate
is
decreased
b
y
half
through
cross-v
alidation
error
reduction.
When
the
error
rate
stops
decreasing
or
starts
increasing,
the
training
pro
cedure
is
stopp
ed.
Sto
c
hastic
gradien
t
descen
t
is
applied
with
a
mini-batc
h
of
256
training
examples
for
bac
kpropagation.
TESLA
K80
GPU
is
used
for
all
the
neural
net
w
ork
training.
Figure
5.
T
raining
flo
w
of
CNN
(sMBR)
4.4.
Exp
erimen
tal
Result
In
this
exp
erimen
t,
the
ASR
p
erformance
is
ev
aluated
on
differen
t
corpus
sizes.
The
three
differen
t
acoustic
mo
dels
suc
h
as
GMM,
DNN,
and
CNN
mo
dels
are
dev
elop
ed
and
compared
their
results.
Con
v
olutional
Neural
Net
w
ork
(CNN)
has
ac
hiev
ed
a
b
etter
p
erformance
than
Deep
Neural
Net
w
ork
(DNN)
and
Gaussian
Mixture
Mo
del
(GMM)
in
differen
t
large
v
o
cabulary
con
tin
uous
sp
eec
h
recognition
(L
V
CSR)
tasks
[13]
[14]
[15]
b
ecause
the
fully
connected
nature
of
DNN
can
cause
o
v
erfitting
and
it
decreases
the
ASR
p
erformance
for
lo
w-resourced
languages.
CNN
can
mo
del
w
ell
tone
patterns
b
ecause
it
has
an
abilit
y
to
reduce
the
translational
in
v
ariance
and
sp
ectral
correlations
in
the
input
signal.
F
urthermore,
as
a
sequence
discriminativ
e
training
can
minimize
the
error
on
the
state
lab
els
in
a
sen
tence,
the
DNN
with
sequence
training
is
done
on
top
of
the
CNN
training.
It
is
ob
vious
in
this
w
ork
that
CNN
(sMBR)
significan
tly
outp
erforms
the
GMM
and
DNN
acoustic
mo
dels
for
a
lo
w-resourced
and
tonal
language,
My
anmar
language.
Figures
6
and
7
sho
w
w
ord
error
rates
(WERs)
of
T
estSet1
and
T
estSet2
based
on
training
data
sizes.
A
ccording
to
the
Figures
6,
when
the
training
data
set
size
is
increased
from
10
hrs
to
20
hrs,
the
WERs
of
T
estSet1
decrease
considerably
b
ecause
it
is
the
same
domain
with
the
training
sets.
Ho
w
ev
er,
the
error
rates
of
T
estSet1
are
not
reduced
notably
ev
en
when
the
training
data
size
is
increased
from
30
hrs
to
42
hrs
b
ecause
the
augmen
ted
data
is
from
the
differen
t
domain.
In
Figure
7,
the
w
ord
error
rates
of
T
estSet2
ob
viously
decrease
o
v
er
the
increasing
training
data
size.
This
is
b
ecause
the
augmen
ted
data
of
the
training
sets
of
30
hrs
and
42
hrs
are
the
same
domain
with
the
T
estSet2,
whic
h
results
in
diminishing
the
w
ord
error
rates
of
T
estSet2.
It
can
b
e
clearly
observ
ed
that
when
the
amoun
t
of
training
data
is
increased,
WERs
are
decreased.
The
largest
amoun
t
of
training
data,
42-hr-data
set,
has
the
lo
w
est
WERs
on
b
oth
test
sets.
Th
us,
the
training
data
size
has
a
great
impact
on
the
ASR
p
erformance.
In
t
J
Elec
&
Comp
Eng,
V
ol.
9,
No.
4,
A
ugust
2019
:
3194
--
3202
Evaluation Warning : The document was created with Spire.PDF for Python.
In
t
J
Elec
&
Comp
Eng
ISSN:
2088-8708
3201
Figure
6.
W
ord
error
rate
on
T
estSet1
v
ersus
training
data
Figure
7.
W
ord
error
rate
on
T
estSet2
v
ersus
training
data
A
ccording
to
the
ev
aluation
results,
the
error
rates
of
T
estSet1
are
lo
w
er
than
that
of
T
estSet2.
This
is
b
ecause
the
news
presen
ters
ha
v
e
clear
and
sharp
v
oices
than
the
v
oices
in
the
recorded
con
v
ersational
data.
Moreo
v
er,
the
total
length
of
the
w
eb
news
data
is
longer
than
that
of
the
recorded
con
v
ersational
data.
It
is
found
that
CNN
outp
erformed
DNN
and
GMM
on
b
oth
test
sets.
As
the
result,
using
CNN
(sMBR)
leads
to
the
lo
w
est
WERs
of
15.61%
on
T
estSet1
and
24.43%
on
T
estSet2.
5.
CONCLUSION
This
pap
er
in
tro
duces
a
UCSY-SC1
corpus
for
My
anmar
sp
eec
h
pro
cessing
researc
h.
The
corpus
consists
of
t
w
o
domains:
w
eb
news
and
daily
con
v
ersational
data
recorded
b
y
ourselv
es.
A
detailed
description
of
the
collection
of
text
and
sp
eec
h
corpus
for
eac
h
domain
is
presen
ted.
The
total
duration
of
the
UCSY-SC1
corpus
is
42
hrs
and
39
mins.
The
corpus
consists
of
261
sp
eak
ers
for
the
w
eb
news
and
46
sp
eak
ers
for
con
v
ersational
domain.
Moreo
v
er,
the
phone
co
v
erage
of
the
corpus
is
analyzed.
The
sp
eec
h
corpus
is
used
as
training
data
for
building
My
anmar
ASR.
This
is
a
milestone
for
My
anmar
ASR
dev
elopmen
t.
The
effect
of
the
training
data
sizes
on
recognition
accuracy
is
also
analyzed
b
y
means
of
GMM,
DNN,
and
CNN
acoustic
mo
dels.
T
w
o
test
sets,
w
eb
news
and
recorded
con
v
ersational
data,
are
used
to
ev
aluate
the
ASR
accuracy
.
It
is
found
that
the
accuracy
on
w
eb
news
data
is
b
etter
than
that
of
the
recorded
con
v
ersational
data.
The
CNN
(sMBR)-based
mo
del
U
CSY-SC1:
A
Myanmar
Sp
e
e
ch
Corpus
for
A
utomatic
Sp
e
e
ch
R
e
c
o
gnition
(A
ye
Nyein
Mon)
Evaluation Warning : The document was created with Spire.PDF for Python.
3202
ISSN:
2088-8708
outp
erforms
the
GMM
and
DNN
mo
dels.
It
leads
to
the
lo
w
est
error
rates
of
15.61%
WER
on
T
estSet1
and
24.43%
WER
on
T
estSet2
b
y
using
this
corpus.
As
My
anmar
is
a
lo
w-resourced
language,
creating
the
sp
eec
h
corp
ora
is
essen
tial
and
it
is
b
eliev
ed
that
this
corpus
will
b
e
of
some
use
for
future
My
anmar
sp
eec
h
pro
cessing
researc
h.
The
corpus
will
b
e
further
expanded
b
y
more
sp
eec
h
data
and
My
anmar
ASR
will
hop
efully
b
e
dev
elop
ed
b
y
means
of
the
end-to-end
learning
approac
h.
A
CKNO
WLEDGMENTS
W
e
w
ould
lik
e
to
thank
all
facult
y
mem
b
ers
and
studen
ts
of
Univ
ersit
y
of
Computer
Studies,
Y
angon,
for
participating
in
the
data
collection
task.
REFERENCES
[1]
J.Xu,
et
al.,
"Agricultural
Price
Information
A
cquisition
Using
Noise-Robust
Mandarin
A
uto
Sp
eec
h
Recognition,"
In
ternational
Journal
of
Sp
eec
h
T
ec
hnology
,
v
ol/issue:21(3),
pp.681-688,
2018.
[2]
M.O.M.Khelifa,
et
al.,
"Constructing
A
ccurate
and
Robust
HMM/GMM
Mo
dels
for
an
Arabic
Sp
eec
h
Recognition
System,"
In
ternational
Journal
of
Sp
eec
h
T
ec
hnology
,
v
ol/issue:20(4),
pp.937-
949,
2017.
[3]
N.D.Londhe,
et
al.,
"Chhattisgarhi
sp
eec
h
corpus
for
researc
h
and
dev
elopmen
t
in
automatic
sp
eec
h
recognition,"
In
ternational
Journal
of
Sp
eec
h
T
ec
hnology
,
v
ol/issue:21(2),
pp.193-210,
2018.
[4]
P
.Zelask
o,
et
al.,
"A
GH
corpus
of
P
olish
sp
eec
h,"
Language
Resources
and
Ev
aluation
Journal
,
v
ol.
50,
2015.
[5]
A.N.Mon,
et
al.,
"Dev
eloping
a
Sp
eec
h
Corpus
from
W
eb
News
for
My
anmar
(Burmese)
Language,"
2017
20th
Conference
of
the
Orien
tal
Chapter
of
the
In
ternational
Co
ordinating
Committee
on
Sp
eec
h
Databases
and
Sp
eec
h
I/O
Systems
and
Assessmen
t
(O-COCOSD
A)
,
pp.
1-6,
2017.
[6]
U.T.Htun,
"Some
A
coustic
Prop
erties
of
T
ones
in
Burmese,"
in
South-East
Asian
Linguistics
8:
T
onation,
D.
Bradley
,
Ed.
(A
ustralian
National
Univ
ersit
y
,
Can
b
erra,
1982)
,
pp.
77–116,
1982.
[7]
T.Nadungo
dage,
et
al.,
"Dev
eloping
a
Sp
eec
h
Corpus
for
Sinhala
Sp
eec
h
Recognition,"
ICON-2013:
10th
In
ternational
Conference
on
Natural
Language
Pro
cessing,CD
A
C
Noida,
India
,
2013.
[8]
J.Staš,
et
al.,
"TEDxSK
and
JumpSK:
A
New
Slo
v
ak
Sp
eec
h
Recognition
Dedicated
Corpus,"
Journal
of
Linguistics/Jazyk
o
v
edn
ý
casopis
,
v
ol/issue:68(2),
pp.346
-
354,
2017.
[9]
M.Ziołk
o,
et
al.,
"A
utomatic
Sp
eec
h
Recognition
System
Dedicated
for
P
olish,"
INTERSPEECH
2011,
12th
Ann
ual
Conference
of
the
In
ternational
Sp
eec
h
Comm
unication
Asso
ciation,
Florence,
Italy
,
pp.3315-3316,
2011.
[10]
M.L.Commission,
"My
anmar-English
Dictionary
,"
Departmen
t
of
the
My
anmar
Language
Com-
mission,
Y
angon,
Ministry
of
Education,
My
anmar
,
1993.
[11]
D.P
o
v
ey
,
et
al.,
"The
Kaldi
Sp
eec
h
Recognition
T
o
olkit,"
IEEE
2011
W
orkshop
on
A
utomatic
Sp
eec
h
Recognition
and
Understanding
,
2011.
[12]
K.V
esely
,
et
al.,
"Sequence-discriminativ
e
T
raining
of
Deep
Neural
Net
w
orks,"
INTERSPEECH
2013,
14th
Ann
ual
Conference
of
the
In
ternational
Sp
eec
h
Comm
unication
Asso
ciation,
Ly
on,
F
rance
,
pp.
2345-2349,
A
ugust
25-29,
2013.
[13]
W.Chan
and
I.Lane,
"Deep
Con
v
olutional
Neural
Net
w
orks
for
A
coustic
Mo
deling
in
Lo
w
Re-
source
Languages,"
2015
IEEE
In
ternational
Conference
on
A
coustics,
Sp
eec
h
and
Signal
Pro
cess-
ing,
ICASSP
2015
,
pp.
2056-2060,
2015.
[14]
T.N.Sainath,
et
al.,
"Impro
v
emen
ts
to
Deep
Con
v
olutional
Neural
Net
w
orks
for
L
V
CSR,"
2013
IEEE
W
orkshop
on
A
utomatic
Sp
eec
h
Recognition
and
Understanding
,
pp.
315-320,
2013.
[15]
T.Sercu,
et
al.,
"V
ery
Deep
Multilingual
Con
v
olutional
Neural
Net
w
orks
for
L
V
CSR,"
2016
IEEE
In
ternational
Conference
on
A
coustics,
Sp
eec
h
and
Signal
Pro
cessing,
ICASSP
2016
,
pp.
4955-
4959,
2016.
In
t
J
Elec
&
Comp
Eng,
V
ol.
9,
No.
4,
A
ugust
2019
:
3194
--
3202
Evaluation Warning : The document was created with Spire.PDF for Python.