Inter national J our nal of Electrical and Computer Engineering (IJECE) V ol. 7, No. 6, December 2017, pp. 3358 3368 ISSN: 2088-8708 3358       I ns t it u t e  o f  A d v a nce d  Eng ine e r i ng  a nd  S cie nce   w     w     w       i                       l       c       m     Recent adv ances in L VCSR : A benchmark comparison of perf ormances Rahhal Errattahi and Asmaa El Hannani Laboratory of Information T echnology , National School of Applied Sciences, Uni v ersity of Chouaib Doukkali, EL Jadida, Morocco Article Inf o Article history: Recei v ed: No v 26, 2016 Re vised: Jun 28, 2017 Accepted: Jul 10, 2017 K eyw ord: Lar ge V ocab ulary Continuous Speech Recognition Automatic Speech Recognition W ord Error Rates Deep Neural Netw orks Hidden Mark o v models Gaussian Mixture Models ABSTRA CT Lar ge V ocab ulary Continuous Speech Recognition (L VCSR), which is characterized by a high v ariability of the speec h, is the most challenging task in automatic speech recognition (ASR). Belie ving that the e v aluation of ASR systems on rele v ant and common speech corpora is one of the k e y f actors that help accelerating research, we present, in this paper , a benchmark comparison of the performances of the current state-of-the-art L VCSR systems o v er dif ferent speech recognition tasks. Furthermore, we put objecti v ely into e vidence the best performing technologies and the best accu- rac y achie v ed so f ar in each task. The benchmarks ha v e sho wn that the Deep Neural Netw orks and Con v olutional Neural Netw orks ha v e pro v en their ef ficienc y on se v eral L VCSR tasks by outperforming the traditional Hidden Mark o v Models and Guaussian Mixture Models. The y ha v e also sho wn that despite the satisfying performances in some L VCSR tasks, the problem of lar ge-v ocab ulary speech recognition is f ar from being solv ed in some others, where more research ef forts are still needed. Copyright c 2017 Institute of Advanced Engineering and Science . All rights r eserved. Corresponding A uthor: Rahhal Errattahi Laboratory of Information T echnology , National School of Applied Sciences, Uni v ersity of Chouaib Doukkali, EL Jadida, Morocco errattahi.r@ucd.ac.ma 1. INTR ODUCTION Speech is a natural and fundamental communication v ehicle which can be considered as one of the most appropriate media for human-machine interactions. The aim of Automatic Speech Recognition (ASR) systems is to con v ert a speech signal into a sequence of w ords either for te xt-based communication purposes or for de vice controlling. ASR is usually used when the k e yboard becomes incon v enient such, for e xample, when our hands are b usy or with limited mobility , when we are using the phone, we are in the dark, or we are mo ving around etc. ASR finds application in man y dif ferent areas: dictation, meeting and lectures transcription, speech translation, v oice-search, phone based services and others. Those systems are, in general, e xtremely dependent on the data used for training the models, configuration of front-ends etc. Hence a lar ge part of system de v elopment usually in v olv es in v estig ations of appropriate configurations for a ne w domain, ne w training data, and ne w language. There are se v eral tasks of speech recognition and the dif ference between these tasks rests mai nly on: (i) the speech type (isolated or continuous speech), (ii) the speak er mode (speak er dependent or independent), (iii) the v ocab ulary size (small, medium or lar ge) and (i v) the speaking style (read or spontaneous speech). Ev en though ASR has matured to the point of commercial applications, the Speak er Independent Lar ge V ocab ulary Continuous Speech Recognition tasks (commonly designed as L VCSR) pose a particular challenge to ASR technology de v elopers. Three of the major problems that arise when L VCSR systems are being de v eloped are: First speak er independent systems require a lar ge amount of training data in order to co v er speak ers v ariability . Second, continuous speech recognition is v ery comple x because of the dif ficulties to locate w ord boundaries and the high de gree of pronunciation v ariation due to dialects, coarticulation and noise, unlik e isolated w ord J ournal Homepage: http://iaesjournal.com/online/inde x.php/IJECE       I ns t it u t e  o f  A d v a nce d  Eng ine e r i ng  a nd  S cie nce   w     w     w       i                       l       c       m     DOI:  10.11591/ijece.v7i6.pp3358-3368 Evaluation Warning : The document was created with Spire.PDF for Python.
IJECE ISSN: 2088-8708 3359 speech recognition where the system operates on single w ords at a time [1, 2, 3]. Finally , with lar ge v ocab ulary , it bec o m es increasingly harder to find suf ficient data to train the acoustic models and e v en the language models. Thus, subw ords models are usually used instead of w ords models which af fect ne g ati v ely the performance of the recognit ion. Moreo v er , L VCSR tasks themselv es v ary in dif ficulty; for e xample read speech task (human- to-machine speech, e.g. dictation) is much easier than spontaneous speech task (human-to-human speech, e.g. telephone con v ersation). T o deal with all these problems, there has been a plethora of algorithms and technologies proposed by the scientific communities for all steps of L VCSR o v er the last decade: pre-processing, feature e xtraction, acoustic modeling, language modeling, decoding and result post-processing. Man y papers were dedicated to presenting an o v ervie w of the adv ances in L VCSR: [4, 5, 6, 7, 8]. Ho we v er , the scope of this paper focuses primarily on the systems architecture, the techniques used and the k e y issues. There is to date no w ork which has attempted to report and analyze the current adv ances in term of performances across all the major tasks of L VCSR. Our ambition has been to fill this g ap. W e did a benchmark comparis on of the performances of the current state-of-the-art L VCSR systems and ha v e co v ered dif ferent tasks: read continues speech recognition, mobile v oice search, con v ersational telephonic speech recognition, broadcast ne ws speech recognition, video speech and distant con v ersational speech recognition. W e tried to put objecti v ely into e vidence the best per - forming technologies and the best accurac y achie v ed so f ar in each task. Note that in this paper we only address the English language and that we ha v e constrained the re vie w to systems that ha v e been e v aluated on widely used speech corpora. This choice w as forced by the non-a v ailability of enough publication of some other cor - pora and also because e v aluating systems on rele v ant and common speech corpora is a k e y f actor for measuring the progress and disco v ering the remaining dif ficulties and especially when comparing systems produced by dif ferent labs. T able 1. Ov ervie w of the selected L VCSR corpora Corpus Y ear T ype of data Size Audio source W all Street Journal I, [9] 1991 Read e xcerpts from the W all Street Journal 80 hours, 123 speak ers Close talking micro- phone Switchboard, [10] 1993 Phone con v ersations be- tween strangers on an assigned topic 250 hours, 543 speak ers, 2400 con v ersations V ariable telephone handsets CallHome, [11] 1997 Phone con v ersations between f amily members or close friends 120 con v ersations, Up to 30 min each V ariable telephone handsets Broadcast Ne ws, [12, 13] 1996/1997 T ele vision and radio broad- cast LDC97S44:104 hours LDC98S71:97 hours Head mounted mi- crophone AMI, [14] 2007 Scenario and NonScenario Meetings from v arious groups 100 hours Close-talking and f ar -field micro- phones Bing Mobile Data, [15] 2010 Mobile v oice queries 21400 hours V ariable mobile phones Google V oice Search Data - Mobile V oice Search and Android V oice Input 5780 hours - Y outube V ideo - V ideo from youtube 1400 hours - 2. COMP ARING ST A TE-OF-THE-AR T L VCSR SYSTEMS PERFORMANCES The industry and resear ch community can benefit greatly when dif ferent systems are e v aluated on a common ground and particularly on the same speech corpora. In this perspecti v e, we report the recent progress in the area of L VCSR on a selection of most popular English speech corpora with v ocab ularies ranging from 5K to more than 65K w ords and content ranging from read speech to spontaneous con v ersations. Only such with recent publications were considered. An o v ervie w on properties of the chosen sets is gi v en in T able 1. In the follo wing subsections, we will shortly introduce each task and t he datasets used and report the performances of systems produced by dif ferent labs. Evaluation Warning : The document was created with Spire.PDF for Python.
3360 ISSN: 2088-8708 2.1. Read continuous speech r ecognition task Early speech recognition systems were often designed for read speech transcription tasks (dictation). As implied by the name, the data used in this domain consists of read sentences and in general in a speak er - independent mode. Its popularity arose because the le xical and syntactic content of the data can be controlled and it is significantly less e xpensi v e to collect than spontaneous speech. The primary applications in this domain include the dictation of notes and transcription of important information by some professionals (e.g. medical, military and la w ) and by persons with learning disabilities (e.g. dysle xia and dysgraphia), limited motor skills or vision impairment. The W all Street Journal corpus I (also kno wn as CSR-I or WSJ0) [9] is kno wn as a reference corpus in the field, which is an American English read speech with te xts tak en from the W all Street Journal ne ws, the speech w as recorded using a machine-readable under clea n conditions. The systems presented here were e v aluated on the No v ember 1992 ARP A CSR (No v-92) benchmark test set , a 5K-w ord closed- v ocab ulary subset deri v ed from the WSJ0 corpus which consists of 330 utterances from 8 speak ers. Hidden Mark o v models (HMMs) and Guaussian Mixture Models (GMMs) ha v e been used e xtensi v ely since the be ginning of the research in the area of speech recognition. More than 40 years later , the y still predominate and the y are usually used as baseline when it comes to compare systems with dif ferent acoustic models. Beside this, se v eral techniques were de v eloped around the HMMs/GMMs in order to impro v e the performance of the ASR systems. T able 2. W ord Error Rates (WER) in % on the No v-92 subset of the WSJ0 corpus using bigram and trigram language models Acoustic model / Features Bigram T rigram MLP-HMM, [16] 8.5 6.5 RC-HMM, [17] 6.2 3.9 GMM-HMM (ML), [17] 6.0 3.8 GMM-HMM (MMI+VTLN), [16] - 3.0 DNN-HMM (STC features), [18] 5.2 - T able 2 sho ws a recapitulation of the k e y performances of some state-of-the-art systems in the field. The fisrt tw o systems in the list are based on a GMM-HMM acoustic models: the first one w as trained using the maximum-lik elihood (ML) criteria [17], while the second one usi n g the maximum mutual information (MMI) criteria with the v ocal tract length normalization (VTLN) [16].T riefenbach et al. [17], proposed also a Reserv oir Computing (RC) HMM h ybrid system for phoneme recognit ion using a bigram phonotactic utterance model. The RC-HMM performs significantly better than the MLP-HMM h ybrids proposed by Gemello et al. [19]. Ho we v er , it is still outperformed by the GMM system with VTLN. Another study [18] demonstrated the ef fecti v eness of the Deep Neural Netw orks (DNNs) in speech recognition. The best result of this study belongs to a DNN system with 5 hidden layers, where each hidden layer has 2048 nodes. In terms of comple xity , both the DNN-HMM and the RC-HMM incorporates massi v e parameters in the training stage. On the other side, GMM-HMM is much more ef ficient as it reaches good performances with small number of parameters. The results, obtained with the bigram language model, sho ws that the DNN-HMM acoustic model presents the best performance on the WSJ0 task; this performance could be e v en enhanced using a trigram language model. Generally , using trigram language model w as crucial and clearly superior to using bigram models o v er the No v-92 test set in v arious studies. 2.2. V oice sear ch speech r ecognition task V oice search is the technology allo wing users to use their v oice to access information. The adv ent of smart phones and other small, W eb-enabled mobile de vices in recent years has spurred more interest in v oice search, especially in some usage scenarios when our hands are b usy or with limited mobility , when we are using the phone, we are in the dark, or we are mo ving around etc. There is a plethora of mobile applications which allo ws users to gi v e speech commands to a mobile either for a search purpose (e.g web search, maps, directions, tra v el resources such as airlines, hotels etc) or for question answering assistance purpose. Mobile v oice search speech recognition is consedered as one of the challenging tasks is the field of speech recognition due to man y f actors: the utterances tend to be v ery short, yet unconstrained and open-domain. Hence, the IJECE V ol. 7, No. 6, December 2017: 3358 3368 Evaluation Warning : The document was created with Spire.PDF for Python.
IJECE ISSN: 2088-8708 3361 v ocab ularies are unlimited with unpredictable input, and high de gree of acoustic v ariability caused by noise, side-speech, accents, slopp y pronunciation, hesitation, repetition, interruptions, and mobile phone dif ferences. In this section we will report results on tw o v oice search application that ha v e been b uilt in the recent fe w years: the Google V oice Input and the Bing mobile v oice search. 2.2.1. Google V oice Sear ch Google V oice Search transcribes speech input used for user interaction from mobile de vices as v oice search queries, short messages and emails. The Google V oice Search system w as trained using approximately 5780 hours of data from mobile V oice Search and Android V oice Input. T able 3. WER in % on a test set from the Google li v e V oice input dataset Acoustic model WER GMM-HMM, [20] 16.0 DNN(DBN)-HMM, [20] 12.3 + MMI discriminati v e training, [20] 12.2 DNN + GMM (combination), [20] 11.8 The baseline GMM-HMM [20] system created by Google’ s group consisted of triphone HMM with decision tree clustered states, and used PLP features that were transformed by linear discriminati v e-Analysis (LD A). The GMM-HMM model w as trained discriminati v ely using Boosted-MMI criterion. All these param- eters generate a conte xt-dependent model with a total of 7969 st ates. The same data model w as used to train a deep belief netw orks (DBN) based DNN acoustic model t o predict the 7969 HMM states. The used DBN- DNN in [20] w as composed of four hidden layers with 2560 nodes per each hidden layer , a final layer with 7969 states, and an input layer of 11 contiguous frames of 40 log filter -bank were modelled with a DBN. As sho wn in T able 3 the DNN(DBN)-HMM system achie v ed a 23% relati v e reduction o v er the base- line. Further impro v ement could result from combining both systems using the se gmental conditional random field (SCARF) frame w ork, this combination g a v e a w ord error rate of 11.8%. 2.2.2. Bing mobile v oice sear ch Bing mobile v oice search, kno wn as Li v e Search for mobile (LS4M) [15], is a mobile appli cation that allo ws users to do web-based search (e.g. map, directions, traf fic, and mo vies) from their mobile phones. LS4M w as de v eloped by the Microsoft compan y and w as trained on a data set around 24 hours with a high de gree of acoustic v ariability . T able 4. WER in % on the Bing V oice Search set Acoustic model WER DNN-HMM (No PT), [21] 37.1 DNN-HMM (with PT), [21] 35.4 CNN-HMM (No PT), [21] 34.2 CNN-HMM (with PT), [21] 33.4 RNNLM, [22] 23.2 Abdel-Hamid et al. [21] at Microsoft Research, in v estig ate the performance of both DNNs and C NNs on a L VCSR task and the ef fects of RBM-based pretraining on their recogniti on performance. Both models were trained using a subset of 18 hours from the Bing V oice Search task in order to predict the triphone HMM state labels. The DNNs architecture consisted of three hidden layers while the CNN had one pair of con v olution and pooling plies in addition to tw o hidden fully connected layers. Result s, in T able 4 sho w that the CNN outperform the DNN on the Bi n g V oice Search, pro viding about 8% relati v e error reduction without pretraining and relati v e w ord error rate reduction of 6% while using pretraining. According to Abdel- Hamid et al. [21], pretraining is more ef fecti v e for the CNN than for the DNN. Another study [22], suggests applying Recurrent Neural Netw ork Language Models (RNNLMs) directly in the first pass of speech recognition decoding, which outperform both simple n-gra m based models (DNN and CNN) on the Bing V oice Search ta sk with a w ord Evaluation Warning : The document was created with Spire.PDF for Python.
3362 ISSN: 2088-8708 error rate of 23.2 %. Ho we v er , the computational e xpense of RNNLMs is v ery high, and to reduce the cost of using a RNNLMs, authors propose cache based RNN inference, which drops the runtime from 100xR T (no caching is done) to just under 1.2xR T . Though the e xperimental setup w as not described in suf ficient detail, in both papers, we can only assume that the 10% absolute impro v ement of WER in RNNLM vs CNN systems could be due to dif ferences in the amount of training data or dif ferences of the Bing V oice Search subset used to e v aluate the systems. 2.3. Con v ersational telephone speech r ecognition task Owing to the re v olution in telecommunication domain, peoples all o v er the w orld spent millions of hours in communication via their phones. F or m an y reasons, as security for e xample, the transcription of spontaneous casual speech and particularly con v ersational telephone speech becomes indispensable. Whereas transcribing this type of speech is v ery challenging, due to man y f actors, including poor articulation, increased coarticulation, highly v ariable speaking rate, and v a rious types of disfluenc y such as hesitations, f alse starts, and corrections. W e report pertinent results on a highly challenging test set, the NIST 2000 Hub5 (Hub5’00). The Hub5 is composed of tw o subsets, an ”easy” split which containe 20 con v ersation from Switchboard corpus [10, 23], and a ”hard” split containing 20 con v ersation from CallHome corpus [11], often reporting results on the easier portion alone. Switchboard is a corpus of American English spontaneous con v ersational telephone speech, it is composed of about 2,400 tw o-sided telephone con v ersations between 543 speak ers (302 male, 241 female) from all areas of the United States. While the CallHome corpus consists of 120 unscripted telephone con v ersations between nati v e speak ers of English mostly between f amily members or close friends o v erseas. The CallHome data is harder to recognize compared to Switchboard, partly due to a greater presence of foreign- accented speech. In the last tw o years, se v eral labs ha v e conducted benchmarking e xperiements using the Switchboard corpus [24, 25, 26, 27, 28, 29]. In T able 5, we summarize the most performing sytems on the Hub5’00 dataset splits. All systems ha v e been trained on the 300 hour Switchboard dataset e xcept the Deep spe ech system from [26] which has been trained on both Switchboard and Fisher dataset. The Fisher corpus [30] of fers 2000 hours of con v ersational telephone speech collected in a similar manner as Switchboard. T able 5. WER in % on the Switchboard subsets ”SWB” of the Hub5’00 dataset Acoustic Models SWB GMM/HMM fBMMI, [31] 14.5 DNN-HMM-sMBR fMLLR, [24] 12.6 RNN (Deep speech), [26] 12.6 DNN fMLLR, [31] 12.2 CNN log-mel, [31] 11.8 CNN+DNN log-mel+fMLLR+I-v ector , [31] 10.7 MLP/CNN+I-V ector , [28] 10.4 The GMM system in [31], w as trained using speak er -adaptation with VTLN and feature space Max- imum Lik elihood Linear Re gression (fMLLR), follo wed by feature and model-space discriminati v e training with the the Boosted Maximum Mutual Information (BMMI) criterion. The DNN-HMM sMBR system from [24] w as trained on LD A+STC+FMLLR features on the full 300 hour training set, and w as composed of 7 lay- ers, where each hidden layer has 2048 neurons; an an output layer of 8859 units. Hannun et al. [26] propose a RNN-based system called Deep Speech that uses deep learni ng systems to learn from lar ge datasets (more than 7380 hours). In [26], authors used a multi-GPU computation for training the RNN model, and a combination of collected and synthesized data, which mak e the system able to learn rob ustness to realistic noise and speak er v ariation. The DNN system, in [31], w as trained using fMLLR features and w as composed of 6 hidden layers each containing 2048 sigmoidal neurons, and a softmax layer with 8192 output units. F or the CNN [31], it w as trained using log-mel features, with an architecture consisted of tw o con v olutional layers each containi ng 512 hidden un i ts, and v e fully connected layers each containing 2048 hidden units, and a softmax layer with 8192 output units. Results are summarised in T able 5, sho w that the CNNs clearly outperform the other systems gi v- ing a 20% w ord error rate relati v e impro v ement o v er the GMM/HMM system, and 3% w ord error rate relati v e IJECE V ol. 7, No. 6, December 2017: 3358 3368 Evaluation Warning : The document was created with Spire.PDF for Python.
IJECE ISSN: 2088-8708 3363 impro v ement o v er the h ybrid DNN. Another form of system combination ha v e been proposed in [28]: a jointly trained MLP/CNN model with I-V ectors, where the outputs of the first MLP hidden layer get combined with the outputs of the second CNN layer . This system has gi v en the best result in this task so f ar (10.4% WER on the SWB split of Hub5’00). 2.4. Br oadcast news speech r ecognition task Broadcast Ne ws Automatic Speech Recognition (ASR BN) consist of recognising speech from ne ws- oriented content from either te le vision or radio, including ne ws, multi-speak er roundtable discussions, debates, and e v en open-air intervie ws outside of the studio. English Broadcast Ne ws Speech Corpus [12, 13] is one of the m ost common datasets in this domain. It is a collection of radio and tele vision ne ws broadcasts (from ABC, CNN and CSP AN tele vision netw orks and NPR and PRI radio netw orks) with corresponding transcripts. The acoustic models of the systems re vie wed in this section w as trai ned on 50 hours of data from the 1996 (LDC97S44, 104 hours) and 1997 (LDC98S71, 97 hours) English Broadcast Ne ws Speech Corpora. State-of-the-art systems performances are reported on both the EARS De v-04f (3 hours from 6 sho ws) and R T -04 (6 hours from 12 sho ws) sets. T able 6. WER in % on the De v-04f and R T -04 English Broadcast Ne ws sets Acoustic model De v-04f R T -04 Baseline GMM/HMM, [27] 18.8 18.1 SA T MLP fBMMI, [32] 21.9 23.6 SA T DBN fBMMI, [32] 17.0 17.7 SA T GMM fMPE+MPE, [33] 16.5 14.8 SA T DNN cross-entrop y , [33] 16.7 14.6 SA T DNN HF sMBR, [33] 15.1 13.4 Hyprid DNN, [27] 16.3 15.8 DNN-based Features, [27] 16.7 16.0 Hybrid CNN, [27] 15.8 15.0 CNN-based Features, [27] 15.2 15.0 In [27], the baseline GMM-HMM system w as trained using speak er -based mean with VTLN and an LD A transform to projec t the 13-dimentional MFCC to 40 dimensions. Ne xt a FMLLR follo wed by a BBMI transform were applied to obtain a GMM system with 2220 quinphone states and 30k diagonal co v ariance Gaussians. Sainath et al. [32] compared the performance of Deep Belief Netw orks (DBNs) to a simple Multi- Layer Perceptrons MLPs, where both the DBN and MLP were trained with the same architecture (6 layers x 1,024 units) using speak er -adapted and fBMMI features with 2220 output tar gets. The DBN contains tw o types of Restricted Boltzmann Machines (RBMs) that w as used to pre-train weights of the Artificial Neural Netw orks (ANNs). F or the first layer of DBN authors used a Gaussian-Bernoulli RBM trained for 50 epochs. The first layer consist of 9 frames as inputs and 1,024 output features. F or all subsequent layers, Bernoulli- Bernoulli RBMs are trained for 25 epochs and contain 1,024 hidden units. Further study of Sainath et al. [33] suggests dif ferent refinements to impro v e DNN tr aining speed. The DNNs systems use a fMLLR with 5,999 quinphone states, and composed of six hidden layers each containing 1,024 sigmoidal units. In this study , authors succeeded in reducing the number of parameters from 10.7 M to 5.5 M a 49% reduction, due to lo w- rank f actorization. Recent study of Sainath et al. [27] e xplored the performance of the CNNs compared to DNN. The Hybrid DNN has an architecture of 5 layers with 1024 hidden units per each layer and a softmax output layer with 2220 tar get units. The DNN-based system has the same architecture, b ut with only 512 output tar gets. While for booth Hybrid CNN and CNN-based feature systems are trained using VTLN-w arped mel FB, delta and double-delta feature. T able 6 sho ws that RBM pre-training of the DBN impro v es the WER o v er the MLP for all feature sets. F ollo wing sMBR training, the DNN is the best model. It is 20% better than the baseline GMM on De v-04f and 36% better on RT -04 .Furthermore, the CNN-based features present competiti v e performance with 19% relati v e impro v ement o v er the baseline GMM-HMM. The performance of CNN-based features could achie v e only 13.1% WER on the de v04 and 12.0% WER on TR04 [27], when a lar ger scale tasks is used for training (400 hours of English Broadcast Ne ws). Evaluation Warning : The document was created with Spire.PDF for Python.
3364 ISSN: 2088-8708 2.5. V ideo speech r ecognition task The adv ent of t he web, lo w cost digital cameras, and Smatphones has significant ly broadened the quantity as well as the reach of videos. The k e y challenge for man y web video producers is making it easy for others, hearing impaired and non-nati v e speak ers, to find and enjo y their content. One w ay to do that is to use hand-transcription, e v en so this solution can be time-consuming and e xpensi v e, and could not cope up with the huge content being uploaded e v ery minute to the internet. On the other hand automatic video transcription represents an alternati v e solution to impro v e accessibility of the video conte nts. In this section we chose to present results studies done by the Google researchers on the Y outube video data. The goal of this task is to transcribe Y outube data, unlik e the pre vious tasks Y ouT ube data is e xtremely challenging for current ASR technology [34]. Jaitly et al. [ 20 ] used 1400 hours of Y ouT ube data to train the Conte xt-Dependent ANN/HMM with speak er adapted features and 17552 triphone tar get states. The baseline system used 9-frames of MFCCs as inputs that were transformed using LD A. The ac ou s tic models were further impro v ed with BMMI. During decoding, fMLLR and MLLR transforms were applied. F or the DBN-HMMs the acoustic data used in the training stage were the fMLLR transform ed features. F or a comple xity reason and to mak e the training f aster , the ANN/HMM has an architecture of only 4 hidden layers with 2000 units in the output layer and 1000 units in the layers abo v e. In order to generate additional semi-supervised training data, Liao et al. [34] ha v e proposed to use the o wner -uploaded video transcripts and a DNNs acoustic models. The proposed DNNs are fully-connected, feed forw ard neural netw ork, with sigmoid non-linearities and a softmax output layer and w as trained using minibatch stochastic gradient descent and back-propag ation techniques. Reported results are summarized in T able 7. It should be noted that the training set used in the e xperiments is not e xactly the same, b ut both e xperiments w as conducted on a comparable amount of data; namely 1400 hours in [20] vs 1781 hours in [34]; ho we v er , al results are reported on the same test set- YtiDe v11 (6.6 hours, 2.4M frames). The reported baseline 7x1024 system with 7k output states from [34] outperform all of the these pre viously reported results in [20]. While mer ging the wide hidden layer architecture of 2048 nodes with a lo w-rank approximation with high number of CD states in the output layer yielded the best result on the YtiDe v test set of 40.9% WER. T able 7. WER in % one the YtiDe v11 Y ouT ube set Acoustic model WER MFCC GMM, 18k state, 450kcomps, [20] 52.3 DBN-HMM pretrained with sparsity , [20] 47.6 + MMI, [20] 47.1 + system combination with SCARF , [20] 46.2 Fbank DNN 7x1024, 7k state, [34] 44.0 Fbank DNN 6x2048, 7k state, [34] 42.7 Fbank DNN 7x1024, lo w-rank 256, 45k state, [34] 42.5 Fbank DNN 7x2048, lo w-rank 256, 45k state, [34] 40.9 2.6. Distant con v ersational speech r ecognition task Distant con v ersational speech recognition (DCS) is captured using m ultiple distant microphones, typ- ically configured in a calibrated arrayis, and is v ery challenging since the speech signals to be recognized are de graded by the presence of o v erlapping talk ers, background noise, and re v erberation. Classroom Lectures, P arliamentary Meetings, and Scientific Meeting are the main applications of Distant con v ersational speech recognition. In this section we chose to report results o v er the AMI Meeting corpus, prompted by the lar ge use of this corpus in se v eral recent studies. The AMI corpus [14] contains around 100 hours of meeting recordings from three European sites (UK, Netherlands, Switzerland). Each meeting usually has four participants and the meetings are in English, man y of the meeting participants are non-nati v e English speak ers. The AMI corpus w as di vided into train, de v elopment, and test sets. Where about 78 hours of meeting recorded speech were used as training set, and about 9 hours each were used as de v elopment and test sets. Ev aluations in the meeting domain are usually IJECE V ol. 7, No. 6, December 2017: 3358 3368 Evaluation Warning : The document was created with Spire.PDF for Python.
IJECE ISSN: 2088-8708 3365 conducted in three conditions: Single Distant Microphone (SDM), Multiple Distant Microphones (MDM) and Indi vidual Headset Microphones (IHM). T able 8. WER in % on the AMI s et for v arious microphone configurations; SDM, MDM and IHM are respec- ti v ely Single Distant Microphone, Multiple Distant Microphones and Indi vidual Headset Microphones Acoustic model De v T est SDM MDM IHM SDM M DM IHM GMM LD A+STC, [35] 63.2 54.8 29.4 67.66 59.4 31.6 DNN LD A+STC, [35] 55.4 51.4 26.7 59.8 56.0 28.4 DNN Fbank, [35] 55.8 51.1 28.3 60.8 55.6 31.5 CNN, [36] 52.5 46.3 25.6 - - - Swietojanski et al. [35] applied DNN-HMM for meeting speech recognition task. Authors compared there results to the con v entional system based on GMMs. The baseline GMM-HMM system ha v e a total of 80000 Gaussians, and were discriminati v ely trained using BMMI with linear discriminati v e analysis (LD A) and decorrelated using a semi-tied co v ariance (STC) transform. While the DNNs were configured to ha v e 6 hidden layers, with 2048 units in each hidden layer and w as trained using RBMs and using either LD A+SA T features or the FB ANK features. Further study of Swietojanski et al. [36], suggests using CNNs for lar ge v ocab ulary distant speech recognition trained using the three type of microphones: SDM and MDM or simply IHM. The CNNs were trained using the Fbank features, and composed of a single CNN layer f ollo wed by v es fully-connected layers. T able 8 sho ws that replacing the GMM with a DNN impro v es recognition accurac y for speech recorded with distant microphones. While in [36], the y found that CNNs impro v e the WER 6.5% relati v e compared to con v entional deep neural netw ork (DNN) models and 15.7% o v er a discriminati v ely trained GMM baseline. 3. DISCUSSION From the first vie w at the reported results o v er the dif ferent tasks using v arious acoustic models, we can illustrate that the traditional GMM based HMM models has been outperformed by se v eral other models. Despite their ability to model the probability distrib utions o v er v ectors of input features that are associated with each state of an HMM, GMMs ha v e a serious shortcoming. As Hinton et al. [37] stated, “Despite all their adv antages, GMMs ha v e a serious shortcoming; the y are statistical ly inef ficient for modeling data that lie on or near a non-linear manifold in the data space”. Therefore other classifiers, which can capture better properties of acoustic features could presente better accurac y than GMMs for acoustic modeling of speech. Machine learning algorithms are more ef ficient t han the traditional GMM for acoustic modeling of speech. In particular the neural netw orks (NNs) based technology present challenging performance o v er a v ariety of speech recognition benchmarks. Their successes, in acoustic modeling of speech, come from there capability to classify data e v en with small number of parameters and the potential to learn much better models of data that lie on or near a nonlinear manifold. DNNs, which are a feed-forw ard artificial neural netw ork that has more than one layer of hidden units between its inputs and its outputs, are the ne w generation of the NNs, the y come to solv e problems of training time and o v erfitting by adding an initial stage of generat i v e pre-training using RBMs. Performances of L VCSR systems v aried from domain to other , as summarized in Figure 1. In s ome domains, lik e read continuous speech where generally the speech w as recorded under clean conditions, results are satisfying with an error rate under 5%. While in other domains that contain more speech v ariations, as video speech or distant con v ersational speech (meeting), results are not acceptable presenting an error rate near 50%. This huge dif ference in performances w as caused by the nature of speech, the more natural and spontaneous the speech is, the more the error rate increase. Dif ficulties encount ered in modeling spontaneous speech stem from man y f actors: foreign accents, e xtraneous w ords, out-of-v ocab ulary w ords , ungrammatical sentences, disfluenc y , partial w ords, repairs, hesitations, repetitions, styl e shifting It must be said that in this paper we ha v e constrained the benchmarks to the performances i n terms of w ord error rates, because the majority of researchers use it as common measure to report performance of their systems. Ho we v er , other important aspects of ASR systems should also be tak en into account in the future, such as the ef ficienc y and the usability . Most of the systems presented in the literature require either lots of training data (thousands of Evaluation Warning : The document was created with Spire.PDF for Python.
3366 ISSN: 2088-8708 RCS VSS CTS BNS VS DCS 3 % 11 : 8 % 16 % 13 : 4 % 40 : 9 % 46 : 3 % T ask WER in % Figure 1. Best state-of-the-art performances o v er the six tasks. RCS, VSS, CTS, BNS, VS, DCS are respec- ti v ely read continuous speech, v oice search speech, con v ersational telephone speech, broadcast ne ws speech, video speech and distant con v ersational speech. hours of speech and billions of w ords of te xt) or lar ge computational e xpense which is inef fecti v e. Therefore, we belie v e there is a need of corpora and e v aluations that include more objecti v e criteria, oriented to w ards usability , in order to de v elop a more user -centered ASR applicati o n. It should be noticed that the ASR system must ensure reacti v eness; looking at the real time f act or of the used algorithms, and rob ustness; in front of accents and impaired speech should also be considered. 4. CONCLUSION In this paper we ha v e summarized the recent de v elopments of L VCSR research and presented a bench- mark comparison of the performances of ASR systems on dif ferent L VCSR tasks: read continues speech recog- nition, mobile v oice search, con v ersational telephonic speech recognition, broadcast ne ws speech recognition, video speech and distant con v ersational speech r ecognition. Most of the presented results sho w that replacing GMMs with other machine learning algorithms gi v es competiti v e results. P articularly , the DNN gi v es f asci- nating performances o v er a v ariety of speech recognition benchmarks. The biggest disadv antage of DNNs is there comple xity; it is hard to train a lar ge model on massi v e data sets. Although we suggest that an y impro v e- ment for a clean speech corpus such as WSJ is promising. On the other hand more potential researches are needed in se v eral domains that are characterized by noisy and spontaneous speech such as video and distant con v ersational speech. REFERENCES [1] T . Adam, et al., “W a v elet cesptral coef ficients for isolated speech recognition, Indonesian J ournal of Electrical Engineering and Computer Science , v ol. 11, no. 5, pp. 2731–2738, 2013. [2] N. R. Emillia, et al., “Isolated w ord recognition using er godic hidden mark o v models and genetic al- gorithm, TELK OMNIKA (T elecommunication Computing Electr onics and Contr ol) , v ol. 10, no. 1, pp. 129–136, 2012. [3] F . Jalili and M. J. Barani, “Speech recognition using combined fuzzy and ant colon y algorithm, Interna- tional J ournal of Electrical and Computer Engineering (IJECE) , v ol. 6, no. 5, pp. 2205–2210, 2016. [4] S. Y oung, A re vie w of lar ge-v ocab ulary continuous-speech, IEEE Signal Pr ocessing Ma gazine , v ol. 13, no. 5, pp. 45–57, Sept 1996. [5] G. Zweig and M. Pichen y , Adv ances in lar ge v ocab ulary continuous speech recognition, Advances in Computer s , v ol. 60, pp. 249–291, 2004. [6] G. Saon and J.-T . Chien, “Lar ge-v ocab ulary continuous speech recognition systems: A look at some recent adv ances, IEEE Signal Pr ocessing Ma gazine , v ol. 29, no. 6, pp. 18–33, 2012. [7] J. Bak er , et al., “De v elopments and directions i n speech recognition and understanding, part 1, IEEE Signal Pr ocessing Ma gazine , v ol. 26, no. 3, pp. 75–80, 2009. [8] J. Bak er , et al., “Updated minds report on speech recognition and understanding, part 2, v ol. 26, no. 4, 2009, pp. 78–85. [9] D. B. P aul and J. M. Bak er , “The design for the w all street journal-based csr corpus, in D ARP A Speec h and Langua g e W orkshop . Mor g an Kaufmann Publishers, 1992. IJECE V ol. 7, No. 6, December 2017: 3358 3368 Evaluation Warning : The document was created with Spire.PDF for Python.
IJECE ISSN: 2088-8708 3367 [10] J. Godfre y and E. Holliman, “Switchboard-1 release 2 ldc97s62, 1993. [11] A. Cana v an, et al., “Callhome american english speech ldc97s42, Linguistic Data Consortium, Philadel- phia: , 1997. [12] J. Fiscus, et al., “1997 english broadcast ne ws speech (hub4) ldc98s71, Linguistic Data Consortium, Philadelphia , 1997. [13] e. a. Graf f, Da vid, “1996 english broadcast ne ws speech (hub4)ldc97s44, Linguistic Data Consortium, Philadelphia , 1996. [14] J. Carletta, “Unleashing the killer corpus: e xperiences in creating the multi-e v erything ami meeting cor - pus, Langua g e Resour ces and Evaluation , v ol. 41, no. 2, pp. 181–190, 2007. [15] A. Acero, et al., “Li v e search for mobile:web services by v oice on the cellphone, in In the pr oceedings of the IEEE International Confer ence on Acoustics, Speec h, and Signal Pr ocessing (ICASSP) , March 2008, pp. 5256–5259. [16] G. Heigold, et al., “Discriminati v e hmms, log-linear models, and CRFS: what is the dif ference?” in In the pr oceedings of the IEEE International Confer ence on Acoustics, Speec h, and Signal Pr ocessing (ICASSP) , March 2010, pp. 5546–5549. [17] F . T ri efenbach, K. Demuynck, and J.-P . Martens, “Lar ge v ocab ulary continuous speech recognition with reserv oir -based acoustic models, IEEE Signal Pr ocessing Letter s , v ol. 21, no. 3, pp. 311–315, March 2014. [18] S. Siniscalchi, T . Sv endsen, and C.-H. Lee, A bottom-up modular search approach to lar ge v ocab ulary continuous speech recognition, IEEE T r ansactions on A udio, Speec h, and Langua g e Pr ocessing , v ol. 21, no. 4, pp. 786–797, April 2013. [19] R. Gemello, et al., “Linear hidden transformations for adaptation of h ybrid ann/hmm models, Speec h Communication , v ol. 49, no. 10, pp. 827–835, 2007. [20] N. Jaitly , et al., Application of pretrained deep neural netw orks to lar ge v ocab ulary speech recognition, in Pr oceedings of Inter speec h , 2012. [21] O. Abdel-Hamid, et al., “Con v olutional neural netw orks for speech recognition, IEEE T r ansactions on A udio, Speec h, and Langua g e Pr ocessing , v ol. 22, no. 10, pp. 1533–1545, 2014. [22] Z. Huang, et al., “Cache based recurrent neural netw ork language model inference for first pass speech recognition, in In the pr oceedings of the IEEE International Confer ence on Acoustics, Speec h, and Signal Pr ocessing (ICASSP) , May 2014, pp. 6354–6358. [23] J. J. Godfre y , et al., “Switchboard: T elephone speech corpus for research and de v elopment, in In the pr oceedings of the IEEE International Confer ence on Acoustics, Speec h, and Signal Pr ocessing (ICASSP) , v ol. 1. IEEE, 1992, pp. 517–520. [24] L. B. K. V esely , et al., “Sequence-discriminati v e training of deep neural netw orks, in In Inter speec h , 2013. [25] F . Seide, et al., “Feature engineering in conte xt-dependent deep neural netw orks for con v ersational speech transcription, in IEEE W orkshop on A u t omatic Speec h Reco gnition and Under standing (ASR U) . IEEE, December 2011. [26] A. Y . Hannun, et al., “Deep speech: Scaling up end-to-end speech recognition, CoRR , v ol. abs/1412.5567, 2014. [27] T . N. Sainath, et al., “Deep con v olutional neural netw orks for lvcsr , in In the pr oceedings of the IEEE International Confer ence on Acoustics, Speec h, and Signal Pr ocessing (ICASSP) . IEEE, 2013, pp. 8614–8618. [28] H. Soltau, et al., “Joint training of con v olutional and non-con v olutional neural netw orks , In the pr o- ceedings of the IEEE International Confer ence on Acoustics, Speec h, and Signal Pr ocessing (ICASSP) , 2014. [29] A. L. Maas, et al., “Increasing deep neural netw ork acoustic model size for lar ge v ocab ulary continuous speech recognition, arXiv pr eprint arXiv:1406.7806 , 2014. [30] C. Cieri, et al., “The fisher corpus: a resource for the ne xt generations of speech-to-te xt, in LREC , v ol. 4, 2004, pp. 69–71. [31] T . N. Sainath, et al., “Deep con v olutional neural netw orks for lar ge-scale speech tas k s , Else vier , Special Issue in Deep Learning , 2014. [32] T . Sainath, et al., “Making deep belief netw orks ef fecti v e for lar ge v ocab ulary continuous speech recog- nition, in IEEE W orkshop on A utomatic Speec h Reco gnition and Under standing (ASR U) , Dec 2011, pp. 30–35. Evaluation Warning : The document was created with Spire.PDF for Python.