In ternational Journal of Electrical and Computer Engineering (IJECE) V ol. 9, No. 4, A ugust 2019, pp. 3194 3202 ISSN: 2088-8708, DOI: 10.11591/ijece.v9i4.pp3194-3202 3194 UCSY-SC1: A My anmar Sp eec h Corpus for A utomatic Sp eec h Recognition A y e Ny ein Mon 1 , Win P a P a 2 , Y e Ky a w Th u 3 1,2 Natural Language Pro cessing Lab, Univ ersit y of Computer Studies, Y angon 3 Language and Sp eec h Science Researc h Lab., W aseda Univ ersit y , Japan Article Info A rticle history: Receiv ed Dec 18, 2018 Revised F eb 15, 2019 A ccepted Mar 21, 2019 K eywor ds: A utomatic Sp eec h Recognition My anmar Language Sp eec h Corpus Con v olutional Neural Net w ork (CNN) ABSTRA CT This pap er in tro duces a sp eec h corpus whic h is dev elop ed for My anmar A u- tomatic Sp eec h Recognition (ASR) researc h. A utomatic Sp eec h Recognition (ASR) researc h has b een conducted b y the researc hers around the w orld to impro v e their language tec hnologies. Sp eec h corp ora are imp ortan t in de- v eloping the ASR and the creation of the corp ora is necessary esp ecially for lo w-resourced languages. My anmar language can b e regarded as a lo w- resourced language b ecause of lac k of pre-created resources for sp eec h pro- cessing researc h. In this w ork, a sp eec h corpus named UCSY-SC1 (Univ ersit y of Computer Studies Y angon - Sp eec h Corpus1) is created for My anmar ASR researc h. The corpus consists of t w o t yp es of domain: news and daily con- v ersations. The total size of the sp eec h corpus is o v er 42 hrs. There are 25 hrs of w eb news and 17 hrs of con v ersational recorded data. The corpus w as collected from 177 females and 84 males for the news data and 42 females and 4 males for con v ersational domain. This corpus w as used as training data for dev eloping My anmar ASR. Three differen t t yp es of acoustic mo dels suc h as Gaussian Mixture Mo del (GMM) - Hidden Mark o v Mo del (HMM), Deep Neural Net w ork (DNN), and Con v olutional Neural Net w ork (CNN) mo dels w ere built and compared their results. Exp erimen ts w ere conducted on dif- feren t data sizes and ev aluation is done b y t w o test sets: T estSet1, w eb news and T estSet2, recorded con v ersational data. It sho w ed that the p erformance of My anmar ASRs using this corpus ga v e satisfiable results on b oth test sets. The My anmar ASR using this corpus leading to w ord error rates of 15.61% on T estSet1 and 24.43% on T estSet2. Copyright 201x Insitute of A dvanc e d Engine e ering and Scienc e. A l l rights r eserve d. Corr esp onding A uthor: A y e Ny ein Mon Natural Language Pro cessing Lab, Univ ersit y of Computer Studies, Y angon, My anmar. Email: a y en y einmon@ucsy .edu.mm 1. INTR ODUCTION Sp eec h is the most natural form of comm unication among h umans. Numerous sp ok en lan- guages are emplo y ed throughout the w orld. As comm unication among h uman b eings is mostly done v o cally , it is natural for p eople to exp ect sp eec h in terfaces with the computer. A utomatic sp eec h recognition (ASR) means the con v ersion of sp ok en w ords in to computer text. A lot of automatic sp eec h recognition researc h is curren tly b eing conducted b y the researc hers around the w orld for their languages [1] [2]. Curren t ASR system use statistical mo dels constructed on sp eec h data. Therefore, sp eec h corpus is imp ortan t for statistical mo del based automatic sp eec h recognition and it affects the p erformance of a sp eec h recognizer. F or w ell-resourced languages, sp eec h researc hers ha v e used publicly a v ailable resources from online. Ho w ev er, for lo w-resourced languages, they ha v e to build Journal homep age: http://iaesc or e.c om/journals/index.php/IJECE Evaluation Warning : The document was created with Spire.PDF for Python.
In t J Elec & Comp Eng ISSN: 2088-8708 3195 the corp ora b y themselv es for dev eloping the ASR systems [3] [4]. They used online resources suc h as broadcast news, daily con v ersational data, etc., or the data are recorded b y themselv es for dev eloping the ASR. My anmar language can b e considered as a lo w-resourced language regarding the linguistic resources a v ailable for NLP . Lac k of prop er data is the main problem when it comes to sp eec h recog- nition researc h in the My anmar Language. Therefore, sp eec h corpus is needed to build for dev eloping My anmar ASR. This giv es imp etus to build the sp eec h corpus for My anmar language. There w as our w ork done in dev eloping a sp eec h corpus for My anmar language [5]. In the w ork, sp eec h data w as collected only from sp ecific domain, w eb news. The total length of the sp eec h corpus is 20 hrs. It in v olv es 126 females and 52 males, totalling 178 sp eak ers. There are 7,332 utterances in the corpus, whic h are collected from lo cal and foreign news. The corpus w as used in dev eloping My anmar con tin- uous sp eec h recognition. It w as ev aluated on t w o test sets: w eb data and news recorded b y 10 nativ es. It yielded 24.73% WER on T estSet1 and 22.59% WER on T estSet2. In this task, the domain of the sp eec h corpus is extended. The sp eec h corpus is named as "UCSY-SC1" and it is constructed b y using t w o t yp es of domain: w eb news and daily con v ersations. The w eb news data size is increased to 25 hrs.The w eb news is the already recorded data collected from the w eb. Its sen tences are longer than the con v ersation sen tences. This corpus consists of daily con v ersational data. The shorter daily con v ersation sen tences are obtained from ASEAN language sp eec h translation thru' U-Star 1 and the w eb. They are recorded b y ourselv es using the recording device. They are v ery short sen tences. There are 17 hrs of con v ersational data. Th us, the total sp eec h corpus size is o v er 42 hrs. This corpus is used as training data and the exp erimen tal results of GMM, DNN and CNN with sequence discriminativ e training approac hes are presen ted. This is a milestone for My anmar ASR dev elopmen t. This pap er is organized as follo ws. The in tro duction to My anmar language is presen ted in Section 2. A sp eec h corpus dev elop ed for My anmar language is explained in Section 3. Ev aluation on the corpus is done in Section 4. Conclusion and future w ork are summarized in Section 5. 2. NA TURE OF MY ANMAR LANGUA GE My anmar Language (formerly kno wn as Burmese) is the official language of My anmar. The My anmar script deriv es from Brahmi script of South India. My anmar text is a string of c haracters without an y w ord b oundary markup. There are no spaces b et w een w ords in My anmar language. My anmar language has 33 basic consonan ts, 12 v o w els, and 4 medials. Phonology is a system of the com bination of v o w els and consonan ts. My anmar phonology is structured b y just one v o w el, or one v o w el and consonan t, consonan t com bination sym b ols. The v o w els ha v e their o wn sounds in My anmar language. Therefore, just only one v o w el can pro duce clear sound suc h as /a ̰ / အာ /à/ အား /á/ . My anmar consonan ts ha v e no clear o wn sound and if it is com bined with a v o w el, it can pro duce the clear sound. Example is က /k/ + အား /á/ = ကား /ká/ . The 12 basic v o w els are /a ̰ / , အာ /à/ , /ḭ/ , /ì/ , /ṵ/ , /ù/ , ေအ /èi/ , / ´ ɛ/ ေအာ ́ / , ေအာ ̀ / , /àɴ/ , /ò/ . My anmar syllables are basically formed b y the consonan t and v o w el com bination [6]. As an example, consider the com bination of /ù/ v o w el and က /k/ consonan t mak es one syllable က /kù/ as က /k/ + /ù/ = က /kù/ . 3. UCSY-SC1 SPEECH CORPUS BUILDING Building a sp eec h corpus is the first step for dev eloping an y automatic sp eec h recognition (ASR) system, esp ecially for lo w-resourced languages, and it is crucial for the statistical ASR system. Moreo v er, the accuracy of a sp eec h recognizer dep ends on the sp eec h corp ora. Sp eec h corp ora for w ell-resourced languages suc h as English are publicly a v ailable for ASR researc h. Ho w ev er, b eing a lo w-resourced language, My anmar language has no existing sp eec h corp ora. A sp eec h corpus can b e built mainly in t w o metho ds. The first metho d is to gather the sp eec h that has already b een recorded and man ually transcrib e it in to text. The second metho d is to create the text corpus first and record the sp eec h b y reading the collected text [7]. 1 h ttp://www.ustar-consortium.com/qws/slot/u50227/index.h tml U CSY-SC1: A Myanmar Sp e e ch Corpus for A utomatic Sp e e ch R e c o gnition (A ye Nyein Mon) Evaluation Warning : The document was created with Spire.PDF for Python.
3196 ISSN: 2088-8708 3.1. Collecting the Data from the W eb News The first approac h is used to collect the w eb news data. T o da y , the in ternet has v arious resource t yp es, for example, so cial media, blogs, t witter, and new p ortals, whic h offer a lot of sp eec h data and whic h can b e freely do wnloaded. Moreo v er, it has b een pro v ed that the corp ora created on in ternet resources yielded promising results [8] [9]. Therefore, sp eec h data w as collected first from the w eb news. The duration of the w eb data collecting pro cess lasted one y ear and it in v olv ed t w o p ersons including the author. 3.1.1. Sp eec h Corpus Preparation The w eb news is do wnloaded from the sites of My anmar Radio and T elevision (MR TV),V oice of America (V O A), faceb o ok pages of Elev en broadcasting, 7da ys TV, F orInfo news, Go o dMorn- ingMy anmar, British Broadcasting Corp oration (BBC) Burmese news and breakfast news. Both lo cal and foreign news are con tained in the corpus. The w eb news videos are con v erted to w a v e file format. After that, the audio files are segmen ted with Praat 2 . All the audio files are formatted with sample frequency 16,000 Hz and mono c hannel. The length of eac h audio file is b et w een 2 seconds and 30 seconds. 3.1.2. Sp eak er Information The news presen ters are professional, w ell-exp erienced and w ell-trained. Therefore, they ha v e clear v oice in news broadcasting. F emale news presen ters are dominan t in the w eb news. Hence, in this corpus, few er male sp eak ers are in v olv ed than females. The ages of the sp eak ers are under 35. 3.1.3. T ext Corpus Preparation Most of the broadcast news items from the w eb ha v e transcriptions. Ho w ev er, the transcrip- tions are man ually done if they are not a v ailable and My anmar3 Unico de is used for that purp ose. W ord segmen tation is done b y hand as My anmar language has no w ord b oundary . This is p erformed based on My anmar-English dictionary [10] and this dictionary is also applied to c hec k the sp elling of the w ords. The a v erage lengths of the utterances in this corpus are 33 in w ords and 54 in syllables. W eb news data has 8,973 unique sen tences and 11,040 unique w ords. The example news sen tences from the corpus are sho wn in Figure 1. The format of eac h sen tence is the utterance-id follo w ed b y the transcription of eac h sen tence. Figure 1. Example sen tences of the corpus on news 3.2. Recording Daily Con v ersations The second approac h (designing the text corpus first and recording the sp eec h b y reading the collected text) w as used for collecting the con v ersational data. It to ok 3 mon ths for data recording and 11 p eople w ere in v olv ed in the sp eec h and text segmen tation. 3.2.1. T ext Corpus Preparation The daily English con v ersations from ASEAN language sp eec h translation thru' U-Star are translated in to My anmar for text corpus building. The con v ersational data con tains 2,156 unique sen tences and 1,740 unique w ords. There are 2,000 sen tences in the ASEAN language sp eec h transla- tion and they are the con v ersations in hotels, restauran ts, streets, telephones, etc. The rest 156 daily con v ersational sen tences are collected from the w eb. The sp elling of the text is man ually c hec k ed and the w ords are segmen ted as the news data. The sen tences con tained in the corpus are shorter than those of the news domain. The a v erage lengths of the con v ersational sen tences are 11 in w ords and 15 in syllables. The example sen tences for the daily con v ersational domain are sho wn in Figure 2. 2 h ttp://www.fon.h um.uv a.nl/praat/ In t J Elec & Comp Eng, V ol. 9, No. 4, A ugust 2019 : 3194 -- 3202 Evaluation Warning : The document was created with Spire.PDF for Python.
In t J Elec & Comp Eng ISSN: 2088-8708 3197 The format of eac h sen tence is similar to that of the news domain (utterance id follo w ed b y eac h utterance). Figure 2. Example sen tences of the con v ersational data 3.2.2. Sp eak er Information The sen tences are recorded b y 4 male sp eak ers and 42 female sp eak ers, who are the facult y mem b ers and studen ts of the Univ ersit y of Computer Studies, Y angon, My anmar. Since the n um b er of females exceeds that of males in our univ ersit y , man y female sp eak ers are represen ted in the corpus. The ages of the sp eak ers are b et w een 19 and 40. 3.2.3. Sp eec h Recording and Segmen tation The recording w ork w as done in a lab oratory of our univ ersit y . It is a v ery quiet place with no external effects from the ro om lik e ec ho and bac kground noises. It is also a health y place to w ork in b ecause p eople can breathe w ell and feel relaxed. T ascam DR-100MKI I I 3 w as used for sp eec h recording. It is in tended to b e used for audio designers and engineers and it has an easy-to- use in terface with robust reliabilit y . The audio files are formatted with sample frequency 16,000 Hz and mono c hannel with 16 bits enco ding. The recorded files are segmen ted with the audacit y to ol 4 . Moreo v er, the silen t p ortion of eac h utterance is discarded. In a sp eec h corpus, audio and text data should b e aligned. So eac h recorded sen tence is listened to and c hec k ed with their corresp onding text transcription and made necessary corrections. If the sp eak ers do not ha v e clear v oices, the recordings are done rep eatedly un til they are satisfactory and smo oth. All sp eak ers read at normal pace. 3.2.4. Normalization to T ranscription Some of the transcriptions of broadcast news and daily con v ersions obtained from online con- sists of non-standard w ords. They are n um b ers, dates, abbreviations acron yms, sym b ols, and English names suc h as names of organization, things, p ersons, animals, so cial media, etc. The pron unciations of these w ords cannot b e found in the dictionary . Therefore, it is necessary to do text normalization and transliteration in to My anmar language. In this w ork, those w ords are man ually transcrib ed in to My anmar w ords as the transcrib ers listen to their corresp onding audios. T able 1 sho ws the example w ords that need to b e normalized. T able 1. Example of text normalization Description Example Normalization Date ၂၀၁၆-၂၀၁၇ ေထာင ဆယ ေြခာက ေထာင ဆယ နစ (2016-2017) Time နာရ ၅၅ နစ နာရ ငါး ဆယ ငါး နစ (3 Hours 55 Min utes) Num b er ၁၁၄ ဦး တစ ရာ တစ ဆယ ေလး ဦး (114 p ersons) Digit 09-448045577 က ေလး ေလး ေလး ငါး ငါး နစ နစ A cron yms FD A အက ေအ P erson Name Mr. Filippno Grandi မစ တာ လစ ဂရမ းဒ 3 h ttps://tascam.com/us/pro duct/dr-100mkiii/top 4 h ttps://www.audacit yteam.org/ U CSY-SC1: A Myanmar Sp e e ch Corpus for A utomatic Sp e e ch R e c o gnition (A ye Nyein Mon) Evaluation Warning : The document was created with Spire.PDF for Python.
3198 ISSN: 2088-8708 3.3. Phone Co v erage in the Sp eec h Corpus Phone co v erage is vital for impro ving the ASR accuracy . My anmar-English dictionary dev el- op ed b y My anmar Language Commission (MLC) [10] is used as the baseline and this dictionary is extended with the v o cabularies of the sp eec h corpus. There are 38,376 w ords in the lexicon. In the training set, there are 67 phonemes and it co v ers 94.37% of phonemes. T able 2 describ es an example of My anmar lexicon. T able 2. Example of My anmar lexicon My anmar W ord Phoneme (Dump) /a ̰ / အားကစား (Sp ort) ɡəzá/ အာကာသ (Space) θa ̰ / The distributions of phonemes for b oth consonan t and v o w el phonemes o ccurring in the sp eec h corpus are analyzed. The frequency data on consonan t distribution of the corpus are giv en in Figure 3. The phoneme /j/ has the most o ccurrences in the corpus. This is b ecause the phoneme represen ts some medials suc h as ◌ျ /ya ̰ p ̰̃ / , ြ◌ /ya ̰ yiʔ/ and the consonan ts and are defined as /j/ phoneme. The second most frequen t o ccurrence is the phoneme /d/ b ecause the consonan ts and are represen ted b y the same phoneme /d/. The My anmar w ord တ /trḭ/ rarely app ears in My anmar language. Therefore, the pron unciation phoneme of the w ord, /tr/ phoneme, is found only 1 time in the texts. A few nasal phonemes, /ng/ and /nj/, are found. Figure 3. Consonan t phonemes distribution of UCSY-SC1 corpus The frequency of the v o w el distribution of the corpus is sho wn in Figure 4. All v o w el phonemes app ear in the corpus. The most frequen t phoneme is the phoneme /a/ with tone1 and most of the pron unciation of the w ords is formed with the v o w el phoneme. F or example, the w ords ေကာင /káʊɴ/ is comp osed of the phonemes of /k/ + /a/+/un:/ and က /káɪɴ/ is formed b y the com bination of the phonemes /k/+/a/+/in:/. The second most frequen t phoneme is the /a-/ with neural tone. In My anmar language, the basic v o w els (/i/ /ì/ , /ei/ /èi/ , /e/ /è/ , /a/ /à/ , /o/ ̀ / , /ou/ /ò/ , /u/ /ù/ ) ha v e their o wn prop erties. While these v o w els are influenced b y the con textual sounds, they c hange to neutralized v o w els when their o wn prop erties decrease. Therefore, most of the My anmar w ords are found with neutral tone in the corpus. F or example, /ná/ + /jwɛʔ/ ==> /nə/ + /jwɛʔ/ Most of the nasalized v o w els suc h as /ai / /aiʔ/ , /an./ /a ̰ ɴ/ , /ei / /eɪʔ/ , /in./ ̰ ɴ/ , /u /ʊ/ and /un./ /ṵ̃/ are the least frequen t phonemes in the corpus. In t J Elec & Comp Eng, V ol. 9, No. 4, A ugust 2019 : 3194 -- 3202 Evaluation Warning : The document was created with Spire.PDF for Python.
In t J Elec & Comp Eng ISSN: 2088-8708 3199 Figure 4. V o w el phonemes distribution of UCSY-SC1 corpus 3.4. Statistics of the Corpus The sp eec h corpus consists of t w o t yp es of domain: w eb news and con v ersational data. The detailed statistics of the corpus is sho wn in T able 3. The corpus consists of 306,088 w ords. 11,696 w ords are unique and nearly 37% o ccurs only once. Ab out 5% of unique w ords app ear b et w een 100 and 1,000 times. Only nearly 1% is found more than 1,000 times in the unique w ords. T able 3. UCSY-SC1 corpus statistics Data Size Sp eak ers Utterance UniqueW ord F emale Male T otal W eb News 25 Hrs 20 Mins 177 84 261 9,066 9,956 Daily Con v ersations 17 Hrs 19 Mins 42 4 46 22,048 1,740 T otal 42 Hrs 39 Mins 219 88 307 31,114 11,696 4. EV ALUA TION ON THE CORPUS In this w ork, exp erimen ts are done to ev aluate the qualit y of the sp eec h corpus on My anmar ASR. 4.1. Exp erimen tal Setup The details of the exp erimen tal setup for data sets, acoustic and language mo dels are dealt with in this section. The impact of training data sizes on the ASR p erformance is in v estigated in this exp erimen t. F our differen t data sizes -10 hrs, 20 hrs, 30 hrs, and 42 hrs - are used for incremen tal training. The detailed statistics on the train and test sets are displa y ed at T able 4. T estSet1 is the op en test data, whic h is w eb news data. T estSet2 is also op en test data and it is the con v ersational data from nativ es recorded with v oice recorders and microphones. T able 4. Statistics on train and test sets Data Size Sp eak ers Utterance F emale Male T otal T rainSet 10 Hrs 5 Mins 79 23 102 3,530 20 Hrs 2 Mins 126 52 178 7,332 30 Hrs 3 Mins 174 86 260 15,556 42 Hrs 39 Mins 219 88 307 31,114 T estSet1 31 Mins 55 Sec 5 3 8 193 T estSet2 32 Mins 40 Sec 3 2 5 887 U CSY-SC1: A Myanmar Sp e e ch Corpus for A utomatic Sp e e ch R e c o gnition (A ye Nyein Mon) Evaluation Warning : The document was created with Spire.PDF for Python.
3200 ISSN: 2088-8708 4.2. GMM-based A coustic Mo del Kaldi sp eec h recognition to olkit [11] is adopted to dev elop the exp erimen ts. F or GMM-based acoustic mo del training, the standard 13-dimensional Mel-F requency Cesptral Co efficien ts (MF CC) features and its first and second deriv ativ es without energy features are applied. After that, cepstral mean and v ariance normalized (CMVN) is p erformed on the features. Linear discriminan t analysis (LD A)is used to splice 9 frames together and pro ject do wn to 40 dimensions. A maxim um lik eliho o d linear transform (MLL T) is estimated on the LD A features. The feature-space Maxim um Lik eliho o d Linear Regression (fMLLR) is used for sp eak er adaptiv e training (SA T). The baseline GMM mo del has 2,052 con text dep enden t (CD) triphone states and an a v erage of 34 Gaussian comp onen ts p er state. 4.3. CNN and DNN A coustic Mo del As input features, 40-dimensional log mel-filter bank features are applied for CNN and DNN acoustic mo dels. F or DNN, 4 hidden la y ers with 300 units p er hidden la y ers are used. F or CNN, 256 and 128 feature maps in first and second con v olutional la y ers are set resp ectiv ely with 8 and 4 filter sizes. The p o oling size is set to 3 with p o ol step 1. The fully connected net w ork has 2 hidden la y ers with 300 units p er hidden la y ers. Cross-en trop y training is p erformed on CNN and DNN acoustic mo dels. Restricted Boltzmann mac hines (RBMs) are built on top of the CNN training. A dditionally , a 6-la y er DNN with cross-en trop y training is done and 6 iterations of state-lev el minim um Ba y es risk (sMBR) for discriminativ e training are p erformed [12]. The training pro cedure of the CNN (sMBR) is depicted in Figure 5. A constan t learning rate of 0.008 is used to train the neural net w orks. Next, the learning rate is decreased b y half through cross-v alidation error reduction. When the error rate stops decreasing or starts increasing, the training pro cedure is stopp ed. Sto c hastic gradien t descen t is applied with a mini-batc h of 256 training examples for bac kpropagation. TESLA K80 GPU is used for all the neural net w ork training. Figure 5. T raining flo w of CNN (sMBR) 4.4. Exp erimen tal Result In this exp erimen t, the ASR p erformance is ev aluated on differen t corpus sizes. The three differen t acoustic mo dels suc h as GMM, DNN, and CNN mo dels are dev elop ed and compared their results. Con v olutional Neural Net w ork (CNN) has ac hiev ed a b etter p erformance than Deep Neural Net w ork (DNN) and Gaussian Mixture Mo del (GMM) in differen t large v o cabulary con tin uous sp eec h recognition (L V CSR) tasks [13] [14] [15] b ecause the fully connected nature of DNN can cause o v erfitting and it decreases the ASR p erformance for lo w-resourced languages. CNN can mo del w ell tone patterns b ecause it has an abilit y to reduce the translational in v ariance and sp ectral correlations in the input signal. F urthermore, as a sequence discriminativ e training can minimize the error on the state lab els in a sen tence, the DNN with sequence training is done on top of the CNN training. It is ob vious in this w ork that CNN (sMBR) significan tly outp erforms the GMM and DNN acoustic mo dels for a lo w-resourced and tonal language, My anmar language. Figures 6 and 7 sho w w ord error rates (WERs) of T estSet1 and T estSet2 based on training data sizes. A ccording to the Figures 6, when the training data set size is increased from 10 hrs to 20 hrs, the WERs of T estSet1 decrease considerably b ecause it is the same domain with the training sets. Ho w ev er, the error rates of T estSet1 are not reduced notably ev en when the training data size is increased from 30 hrs to 42 hrs b ecause the augmen ted data is from the differen t domain. In Figure 7, the w ord error rates of T estSet2 ob viously decrease o v er the increasing training data size. This is b ecause the augmen ted data of the training sets of 30 hrs and 42 hrs are the same domain with the T estSet2, whic h results in diminishing the w ord error rates of T estSet2. It can b e clearly observ ed that when the amoun t of training data is increased, WERs are decreased. The largest amoun t of training data, 42-hr-data set, has the lo w est WERs on b oth test sets. Th us, the training data size has a great impact on the ASR p erformance. In t J Elec & Comp Eng, V ol. 9, No. 4, A ugust 2019 : 3194 -- 3202 Evaluation Warning : The document was created with Spire.PDF for Python.
In t J Elec & Comp Eng ISSN: 2088-8708 3201 Figure 6. W ord error rate on T estSet1 v ersus training data Figure 7. W ord error rate on T estSet2 v ersus training data A ccording to the ev aluation results, the error rates of T estSet1 are lo w er than that of T estSet2. This is b ecause the news presen ters ha v e clear and sharp v oices than the v oices in the recorded con v ersational data. Moreo v er, the total length of the w eb news data is longer than that of the recorded con v ersational data. It is found that CNN outp erformed DNN and GMM on b oth test sets. As the result, using CNN (sMBR) leads to the lo w est WERs of 15.61% on T estSet1 and 24.43% on T estSet2. 5. CONCLUSION This pap er in tro duces a UCSY-SC1 corpus for My anmar sp eec h pro cessing researc h. The corpus consists of t w o domains: w eb news and daily con v ersational data recorded b y ourselv es. A detailed description of the collection of text and sp eec h corpus for eac h domain is presen ted. The total duration of the UCSY-SC1 corpus is 42 hrs and 39 mins. The corpus consists of 261 sp eak ers for the w eb news and 46 sp eak ers for con v ersational domain. Moreo v er, the phone co v erage of the corpus is analyzed. The sp eec h corpus is used as training data for building My anmar ASR. This is a milestone for My anmar ASR dev elopmen t. The effect of the training data sizes on recognition accuracy is also analyzed b y means of GMM, DNN, and CNN acoustic mo dels. T w o test sets, w eb news and recorded con v ersational data, are used to ev aluate the ASR accuracy . It is found that the accuracy on w eb news data is b etter than that of the recorded con v ersational data. The CNN (sMBR)-based mo del U CSY-SC1: A Myanmar Sp e e ch Corpus for A utomatic Sp e e ch R e c o gnition (A ye Nyein Mon) Evaluation Warning : The document was created with Spire.PDF for Python.
3202 ISSN: 2088-8708 outp erforms the GMM and DNN mo dels. It leads to the lo w est error rates of 15.61% WER on T estSet1 and 24.43% WER on T estSet2 b y using this corpus. As My anmar is a lo w-resourced language, creating the sp eec h corp ora is essen tial and it is b eliev ed that this corpus will b e of some use for future My anmar sp eec h pro cessing researc h. The corpus will b e further expanded b y more sp eec h data and My anmar ASR will hop efully b e dev elop ed b y means of the end-to-end learning approac h. A CKNO WLEDGMENTS W e w ould lik e to thank all facult y mem b ers and studen ts of Univ ersit y of Computer Studies, Y angon, for participating in the data collection task. REFERENCES [1] J.Xu, et al., "Agricultural Price Information A cquisition Using Noise-Robust Mandarin A uto Sp eec h Recognition," In ternational Journal of Sp eec h T ec hnology , v ol/issue:21(3), pp.681-688, 2018. [2] M.O.M.Khelifa, et al., "Constructing A ccurate and Robust HMM/GMM Mo dels for an Arabic Sp eec h Recognition System," In ternational Journal of Sp eec h T ec hnology , v ol/issue:20(4), pp.937- 949, 2017. [3] N.D.Londhe, et al., "Chhattisgarhi sp eec h corpus for researc h and dev elopmen t in automatic sp eec h recognition," In ternational Journal of Sp eec h T ec hnology , v ol/issue:21(2), pp.193-210, 2018. [4] P .Zelask o, et al., "A GH corpus of P olish sp eec h," Language Resources and Ev aluation Journal , v ol. 50, 2015. [5] A.N.Mon, et al., "Dev eloping a Sp eec h Corpus from W eb News for My anmar (Burmese) Language," 2017 20th Conference of the Orien tal Chapter of the In ternational Co ordinating Committee on Sp eec h Databases and Sp eec h I/O Systems and Assessmen t (O-COCOSD A) , pp. 1-6, 2017. [6] U.T.Htun, "Some A coustic Prop erties of T ones in Burmese," in South-East Asian Linguistics 8: T onation, D. Bradley , Ed. (A ustralian National Univ ersit y , Can b erra, 1982) , pp. 77–116, 1982. [7] T.Nadungo dage, et al., "Dev eloping a Sp eec h Corpus for Sinhala Sp eec h Recognition," ICON-2013: 10th In ternational Conference on Natural Language Pro cessing,CD A C Noida, India , 2013. [8] J.Staš, et al., "TEDxSK and JumpSK: A New Slo v ak Sp eec h Recognition Dedicated Corpus," Journal of Linguistics/Jazyk o v edn ý casopis , v ol/issue:68(2), pp.346 - 354, 2017. [9] M.Ziołk o, et al., "A utomatic Sp eec h Recognition System Dedicated for P olish," INTERSPEECH 2011, 12th Ann ual Conference of the In ternational Sp eec h Comm unication Asso ciation, Florence, Italy , pp.3315-3316, 2011. [10] M.L.Commission, "My anmar-English Dictionary ," Departmen t of the My anmar Language Com- mission, Y angon, Ministry of Education, My anmar , 1993. [11] D.P o v ey , et al., "The Kaldi Sp eec h Recognition T o olkit," IEEE 2011 W orkshop on A utomatic Sp eec h Recognition and Understanding , 2011. [12] K.V esely , et al., "Sequence-discriminativ e T raining of Deep Neural Net w orks," INTERSPEECH 2013, 14th Ann ual Conference of the In ternational Sp eec h Comm unication Asso ciation, Ly on, F rance , pp. 2345-2349, A ugust 25-29, 2013. [13] W.Chan and I.Lane, "Deep Con v olutional Neural Net w orks for A coustic Mo deling in Lo w Re- source Languages," 2015 IEEE In ternational Conference on A coustics, Sp eec h and Signal Pro cess- ing, ICASSP 2015 , pp. 2056-2060, 2015. [14] T.N.Sainath, et al., "Impro v emen ts to Deep Con v olutional Neural Net w orks for L V CSR," 2013 IEEE W orkshop on A utomatic Sp eec h Recognition and Understanding , pp. 315-320, 2013. [15] T.Sercu, et al., "V ery Deep Multilingual Con v olutional Neural Net w orks for L V CSR," 2016 IEEE In ternational Conference on A coustics, Sp eec h and Signal Pro cessing, ICASSP 2016 , pp. 4955- 4959, 2016. In t J Elec & Comp Eng, V ol. 9, No. 4, A ugust 2019 : 3194 -- 3202 Evaluation Warning : The document was created with Spire.PDF for Python.