Inter national J our nal of Electrical and Computer Engineering (IJECE) V ol. 7, No. 6, December 2017, pp. 3344 3357 ISSN: 2088-8708 3344       I ns t it u t e  o f  A d v a nce d  Eng ine e r i ng  a nd  S cie nce   w     w     w       i                       l       c       m     Limited Data Speak er V erification: Fusion of F eatur es T . R. J ayanthi K umari 1 and H. S. J ayanna 2 1 Department of Electronics and Communication Engineering, Siddag ang a Institute of T echnology , Karnataka, India 2 Department of Information Science and Engineering, Siddag ang a Institute of T echnology , Karnataka, India Article Inf o Article history: Recei v ed: Mar 29, 2017 Re vised: Jul 18, 2017 Accepted: Aug 3, 2017 K eyw ord: MFCC LPCC LPR LPRP GMM GMM-UBM ABSTRA CT The present w ork demonstrates e xperimental e v aluation of speak er v erification for dif- ferent speech feature e xtraction techniques with the constraints of limited data (less than 15 seconds). The state-of-the-art speak er v erification techniques pro vide good performance for suf ficient data (greater than 1 minutes ). It is a challenging task to de- v elop techniques which perform well for speak er v erification under limited data condi- tion. In this w ork dif ferent features lik e Mel Frequenc y Cepstral Coef ficients (MFCC), Linear Prediction Cepstral Coef ficients (LPCC), Delta ( 4 ), Delta-Delta ( 44 ), Line ar Prediction Residual (LPR) and Linear Prediction Residual Phase (LPRP) are consid- ered. The performance of indi vidual features is studied and for better v erification per - formance, combination of these features is attempted. A comparati v e study is made between Gaussian mixture model (GMM) and GMM-uni v ersal background model (GMM-UBM) through e xperimental e v aluation. The e xperiments are conducted using NIST -2003 database. The e xperimental results sho w that, the combination of features pro vides better performance compared to the indi vidual features. Further GMM-UBM modeling gi v es reduced equal error rate (EER) as compared to GMM. Copyright c 2017 Institute of Advanced Engineering and Science . All rights r eserved. Corresponding A uthor: T . R. Jayanthi K umari Department of Electronics and Communication Engineering Siddag ang a Institute of T echnology India, Karnataka, Beng aluru-560077 Email: trjayanthikumari@gmail.com 1. INTR ODUCTION Speech signals play a main role in communication media to understand the con v ersation between the people [1]. The speak er recognition is a technique to recognize a speak er using his/her original speech v oice and can be used for either speak er v erification or speak er identification [2]. Ov er the last decade, speak er v eri- fication has been used for man y commercial applications and these applications prefer limited data conditions. Further , limited data indicates speech data of fe w seconds (less than 15 sec). Based on the nature of training and test speech data, te xt-dependent and te xt-independent [3] are tw o classification of speak er v erification. In te xt-dependent mode, speak er training and testing data remains same and in case of te xt-independent, training and testing speech data are dif ferent. T e xt-independent speak er v erification under limited data conditions has al w ays been a challenging task. The speak er v erification system contains four stages, namely analysis of speech data, e xtraction of features, modeling and testing [4]. The analysis stage analyzes the speak er information using v ocal tract [5], e xcitation source [6] and suprase gmental features lik e duration, accent and modulation [7]. The amount of data a v ailable in limited data condition is v ery small which gi v es poor v erification performance. T o impro v e the v erification performance in limited dat a condition, we need dif ferent le v els of information to be e xtracted from speech data and the y ha v e to be combined to good v erification performance. The v ocal tract and e xcitation source information are combined in the present study for impro ving the performance of speak er v erification system under limited data condition. Second stage of speak er v erification is feature e xtraction. Speech production system usually generates J ournal Homepage: http://iaesjournal.com/online/inde x.php/IJECE       I ns t it u t e  o f  A d v a nce d  Eng ine e r i ng  a nd  S cie nce   w     w     w       i                       l       c       m     DOI:  10.11591/ijece.v7i6.pp3344-3357 Evaluation Warning : The document was created with Spire.PDF for Python.
IJECE ISSN: 2088-8708 3345                                      S pe e c h                                     Tr a i n/Test     S pe a ke r mode ls                                                                                                                                                Te sti ng                                                                                                                                                                                                   F e a tu re s                      Ve rif ied                                                                                                                                                                                                               S pe a ke r         MF C C     or   L P C C   De lt a  ( Δ )   De lt a - De lt a   ( Δ Δ )   L P R   L P R P     C om bi ne diff e re nt  fe a tur e   Ex tra c ti on     GMM    or    GMM - UB M       Te sti ng   Figure 1. Block diagram of combination of dif ferent features for speak er v erification system. lar ge amount of data which include sensor , channel, language, style etc [8]. The purpose of feature e xtraction is to e xtract feature v ectors of reduced dimension. The e xtracted feature information are emphasized and other redundant f actors are suppressed in these feature v ectors [3][9]. The v ocal tract information can be e xtracted using Mel-frequenc y cepstral coef ficients (MFCC) [10] and Linear prediction cepstral coef ficients (LPCC) [11] e xtraction methods. The speech signal contains both static and dynamic characteristics. The MFCC and LPCC feature set contain only static characteristics. The dynamic characteristics represented by Delta ( 4 ) and Delta-Delta ( 44 ) contains some more speak er information, which are useful in speak er v erification [4]. Excitation source features are e xtracted using Linear prediction residual (LPR) and Linear prediction residual phase (LPRP) [12]. In this w ork, LPR, LPRP , MFC C, 4 MFCC, 44 MFCC, LPCC, 4 LPCC and 44 LPCC features are used to e v aluate the performance of the system under limited data condition. Further , each of these features of fer dif ferent information and a combination of t h e se may impro v e the performance of speak er v erification. Hence, the performance of speak er v erification considering combination of features is e v aluated in the present w ork. Fig. 1. sho ws the block diagram representation of combination of dif ferent features for speak er v erifica- tion system. The paper has been or g anized into the follo wing sections: Section 2 describe s the speak er v eri fication studies using dif ferent feature e xtraction techniques. Dif ferent modelling techniques and testing are presented in Section 3. Experimental results are reported in Section 4. Section 5 contains conclusion and future scope of w ork. 2. SPEAKER VERIFICA TION STUDIES USING DIFFERENT FEA TURES The speak er -specific information can be e xtracted from feature e xtraction techniques at a reduced data rate[13]. These feature v ectors contain v ocal tract, e xcitation source and beha vioral traits of speak er -specific information[4]. A good feature is one which contains all components of speak er -specific information. T o create a good feature set, dif ferent feature e xtraction techniques need to be understood. 2.1. V ocal tract featur es f or speak er v erification The v ocal tract features are e xtracted using MFCC and LPCC feature e xtraction techniques. The features e xtracted from these techniques are dif ferent and therefore their performance v aries. The reason for the same is as follo ws. In case of MFCC, the spectral distortion is minimized using hamming windo w . The magnitude fre- quenc y response is obtained by applying F ourier T ransformation to the windo wed frame signal. The 22 trian- gular band pass filters are used to pass the resulting spectrum. Discrete cosine transform is applied to the output of the mel filters in order to obtain the cepstral coef ficients. The obtained MFCC features are used to train and test speech data. LPCC reflects the dif ferences of the biological structure of human v ocal tract. Computing method by LPCC is a recursion from LPC parameter to LPC cepstrum according to all-pole model. LPC is simply Limited Data Speak er V erification: Fusion of F eatur es (J ayanthi K umari) Evaluation Warning : The document was created with Spire.PDF for Python.
3346 ISSN: 2088-8708 the coef ficients of this all-pole filter and is equi v alent to the smoothened en v elope of the log spectrum of the speech. LPC can be calculated either by the autocorrelation or co v ariance methods directly from the windo wed portion of speech. The Durbin’ s recursi v e method is used to calculate LPCC without using the Discrete F ourier T ransform (DFT) and the in v erse DFT . These tw o methods are more comple x and time consuming [14]. The MFCC and LPCC e xtraction techniques are widely used and ha v e pro v en to be ef fecti v e in speak er v erification. Ho we v er , the y are not pro viding satisf actory performance under limited data condition. Therefore, there is a need to impro v e the performance of speak er v erification system by obtaining e xtra information about the speech data. The feature set of MFCC and LPCC contains only static properties of speech signal. In addition, the dynamic characteristics of the speech signal can also be obtained to impro v e the performance of speak er v erification. This will be helpful for v erification of speak ers[15]. T w o types of dynamics are a v ailable in speech processing [16] : The v elocity of the features which is kno wn as 4 features obtained by a v erage first-order temporal deri v ati v e. The acceleration of the features which is kno wn as 44 features obtained by a v erage second order temporal deri v ati v e. 2.2. Excitation sour ce featur es f or speak er v erification The spectral features e xtracted from v ocal tract are in the range of 10-30 ms. These spectral features ignore some of the speak er specific e xcitation information lik e linear prediction (LP) residual and LP residual phase that can be used for speak er v erification [6]. In order to calculate LP residual, first the v ocal tract information is predicted from speech data using LP analysis and in v erse filter formation is used to suppress them from the speech data [17][6]. T o calculate LPRP , first we need to di vide LP residual by its Hilbert en v elop [17]. The LPRP contains speak er -specific information and LPR contains information obtained from e xcitation source mainly glottal closure instants (GCIS) [18]. The features of LPR and LPRP contai n speak er - specific e xcitation source information, which are dissimilar in their characteristics. These tw o feat ures can be combined to g ain more adv antage. 3. SPEAKER MODELING AND TESTING Dif ferent modelling techniques are a v ailable for speak er modelling including V ector q ua n t ization (VQ), Hidden mark o v model (HMM), Gaussian mixture model (GMM) and GMM-Uni v ersal background model (UBM) etc. Among these GMM and GMM-UBM are used as a classifier for the present w ork. When the a v ailable training data is inadequate, the GMM-UBM is widely used for speak er v erification [19]. UBM represents the speak er independent distrib ution of features. T o construct UBM, we require lar ge amount of speech data. UBM is the core part of GMM-UBM speak er v erification system. A balance of male and female speak ers must be ensured in UBM. The simplest approach to train a UBM is to pool all the data and use it via e xpectation-maximization (EM) algorithm [20]. The coupled tar get and background speak er model com- ponents are inte grated ef fecti v ely while performing speak er recognition, when Maximum a posteriori (MAP) adaptation is used [13]. The adv antage of UBM model i s that a lar ge number of s peak ers are used to design speak er indepen- dent model and trained for the required task. Ev en with minimal speak er data, UBM-based modeling technique pro vides good performance. The dra wback of UBM model is that a lar ge gender -balanced speak er set is re- quired for training [20]. The speak ers are also modelled using GMM to v erify its ef fecti v eness under limited data speak er v erification. In case of testing, the reference models are compared by test feature v ectors, if the test feature v ectors are matches with the reference models scores is generated. The scores represent ho w well the test feature v ec- tors match with reference models [4]. In practical applications, there will be chance of rejecting true speak ers and chance of accepting f alse speak ers. In the present w ork the log lik elihood ratio test method [21] is adopted. 4. RESUL TS AND DISCUSSIONS 4.1. Experimental setup In current analysis, the NIST -2003 database is used for v erifying the speak ers [22]. This conta ins 356 train and 2559 test speak ers. The train speak er contains 149 male and 207 female speak ers. The UBM contains IJECE V ol. 7, No. 6, December 2017: 3344 3357 Evaluation Warning : The document was created with Spire.PDF for Python.
IJECE ISSN: 2088-8708 3347 251 female and male speak ers. The duration of test, train and UBM speech v aries from seconds to fe w minutes. The present w ork is for limited data, therefore we ha v e tak en each speak ers is of durations 3s-3s (train-test), 4s-4s, 5s-5s, 6s-6s, 9s-9s and 12s-12s data to create the database for the study . 4.2. Speak er v erification r esults W e conducted the te xt-independent speak er v erificati on e xperiments. The v erification performance of the system can be calculated by using equal error rate (EER). It is the ratio of f alse rejection rate (FRR) and f alse acceptance rate (F AR). The e xtracted features are MFCC, LPCC, LPR, LPRP and transitional characteristics lik e 4 and 44 are in the dimension of 13. In case of MFCC and LPCC and its deri v ati v es, speech data is analyzed with the frame size (FS) of 20 ms and with frame rate (FR) of 10 ms. In case of LPCC, we considered 10 th order LP analysis because speech is sampled at 8 KHz. The LP order v aries from 8 to 12 [23] and 10 th order sho wn to be appropriate to compute LPCC [23]. FS of 12 ms and FR 6 ms has been fix ed for LPR and LPRP . The speak er specific information obtained for each of these features are dif ferent. Therefore the combination of these features may gi v e better performance. The modeling techniques used are GMM and GMM-UBM. The speak ers are modelled for Gaussian mixture of 16, 32, 64, 128 and 256. 4.3. Indi vidual featur e perf ormance using GMM and GMM-UBM 3 4 5 6 9 12 25 30 35 40 45 50 Training/Testing data in secs (MFCC)                        Equal Error Rate (%) Gaussian Mixture     3 4 5 6 9 12 25 30 35 40 45 50 Training/Testing data in secs (DeltaMFCC)                   Equal Error Rate (%) GaussianMixture     3 4 5 6 9 12 25 30 35 40 45 50 Training/Testing data in secs (DeltaDeltaMFCC)              Equal Error Rate (%) Gaussian Mixture     16 32 64 16 32 64 16 32 64 Figure 2. Performance of speak er v erification system based on MFCC indi vidual features using GMM model- ing 3 4 5 6 9 12 25 30 35 40 45 50 Training/Testing data in secs (LPCC)                        Equal Error Rate (%) Gaussian Mixture     3 4 5 6 9 12 25 30 35 40 45 50 Training/Testing data in secs (DeltaLPCC) Equal Error Rate (%) Gaussian Mixture     3 4 5 6 9 12 25 30 35 40 45 50 Training/Testing data in secs (DeltaDeltaLPCC)              Equal Error Rate (%) Gaussian Mixture     16 32 64 16 32 64 16 32 64 Figure 3. Performance of speak er v erification system based on LPCC indi vidual features using GMM modeling The e xperimental results are sho wn in Figure. 2, 3 and 4 for indi vidual features of (MFCC, 4 MFCC, 44 MFCC), (LPCC, 4 LPCC, 44 LPCC) and (LPR, LPRP) re specti v ely . The e xperiment is conducted for 3s- Limited Data Speak er V erification: Fusion of F eatur es (J ayanthi K umari) Evaluation Warning : The document was created with Spire.PDF for Python.
3348 ISSN: 2088-8708 T able 1. Comparison of minimum EER(%) for indi vidual features using dif ferent amount of training and testing data for GMM Indi vidual Features T raining/T esting data 3s-3s 4s-4s 5s-5s 6s-6s 9s-9s 12s-12s MFCC 45.16 44.21 42.36 41.89 38.25 35.68 4 MFCC 45.75 43.54 42.09 44.89 38.07 35.63 44 MFCC 45.27 44.67 42.95 42.14 38.70 37.30 LPCC 43.08 41.41 39.97 38.7 31.34 28.18 4 LPCC 44.89 44.12 41.82 41.05 37.17 35.32 44 LPCC 44.76 43.13 42.00 41.10 37.48 35.86 LPR 47.85 48.34 47.34 47.06 46.59 46.09 LPRP 47.16 47.43 46.08 46.58 46.66 46.62 3s, 4s-4s, 5s-5s, 6s-6s, 9s-9s and 12s-12s for dif ferent Gaussian mixtures. Further , the modeling is done using GMM for Gaussian mixtures of 16, 32 and 64. Since the data is v e ry small, the Gaussian mixtures are limited to 64. The minimum EER of each speech data are tab ulated in T able 1. irrespecti v e of Gaussian mixtures. 3 4 5 6 9 12 25 30 35 40 45 50 Training/Testing data in secs (LPR)                         Equal Error Rate (%) Gaussian Mixture     3 4 5 6 9 12 25 30 35 40 45 50 Training/Testing data in secs (LPRP)                        Equal Error Rate (%) Gaussian Mixture     16 32 64 16 32 64 Figure 4. Performance of speak er v erification system based on LPR and LPRP indi vidual features using GMM modeling The performance of indi vidual features are analysed by considering 3s-3s data size as sho wn in Figure. 2. From the e xperimental results it w as observ ed that, the indi vidual feature MFCC pro vides a reduced EER which is less by 0.59% and 0.11% of 4 MFCC and 44 MFCC respecti v ely . The results of LPCC features for the same data size is sho wn in Figure. 3, the indi vidual feature LPCC pro vides a reduced EER which is less by 1.81% and 1.68% of 4 LPCC and 44 LPCC respecti v ely . The tw o points can be noticed from these results. First point, static characteristics pro vides better performance as compared with dynamic characteristics. The second point is, the indi vidual features of LPCC and its deri v ati v es gi v es better v erification performance than MFCC and its deri v ati v es. The results of LPCC features for the same data size is sho wn in Figure. 4. From the e xperimen- tal results it w as observ ed that, the reduced EER of LPR which is greater than 2.69% and 4.77% of MFCC and LPCC respecti v ely . Further , the reduced EER of LPRP which is more by 2% and 4.08% of MFCC and LPCC respecti v ely . It clearly sho ws that performance of v ocal tract features gi v es better EER as compared to e xcitation source features. The same study is also conducted for other data sizes of 4s-4s, 5s-5s, 6s-6s, 9s-9s and 12s-12s to v erify the performance using indi vidual features. In all the cases, the results sho ws that EER decreases as we increased the train and test data. The GMM modeling w orks v ery well in case of suf ficient data [20]. T o o v ercome this problem, we used GMM-UBM modeling. UBM should be trained in such a w ay that it should ha v e equal number of male and female speak ers. In our e xperiment the total duration of male and female speak ers is 1506 sec each. IJECE V ol. 7, No. 6, December 2017: 3344 3357 Evaluation Warning : The document was created with Spire.PDF for Python.
IJECE ISSN: 2088-8708 3349 T o study the significance of GMM-UBM m odeling, the same set of e xperiments are conducted. The e xperimental results are sho wn in Figure. 5, Figure. 6, Figure.7 for indi vidual features using (MFCC, 4 MFCC, 44 MFCC), (LPCC, 4 LPCC, 44 LPCC) and (LPR, LPRP) respecti v ely . The Gaussian mixtures considered are 16, 32, 64, 128 and 256 as additional UBM speech data is used for training. T able 2. represents the minimum EER of indi vidual features for dif ferent speech data and dif ferent amount of Gaussian mixtures. 3 4 5 6 9 12 25 30 35 40 45 50 Training/Testing data in secs (MFCC) Equal Error Rate (%) Gaussian Mixture     3 4 5 6 9 12 25 30 35 40 45 50 Training/Testing data in secs (DeltaMFCC) Equal Error Rate (%) Gaussian Mixture     3 4 5 6 9 12 25 30 35 40 45 50 Training/Testing data in secs (DeltaDeltaMFCC) Equal Error Rate (%) Gaussian Mixture     16 32 64 128 256 16 32 64 128 256 16 32 64 128 256 Figure 5. Performance of speak er v erification system based on MFCC indi vidual features using GMM-UBM modeling 3 4 5 6 9 12 25 30 35 40 45 Training/Testing data in secs                  (LPCC)                        Equal Error Rate (%) Gaussian Mixture     16 32 64 128 256 3 4 5 6 9 12 25 30 35 40 45 50 Training/Testing data in secs                (DeltaLPCC)                   Equal Error Rate (%) Gaussian Mixture     16 32 64 128 256 3 4 5 6 9 12 25 30 35 40 45 50 Training/Testing data in secs             (DeltaDeltaLPCC)              Equal Error Rate (%) Gaussian Mixture     16 32 64 128 256 Figure 6. Performance of speak er v erification system based on LPCC indi vidual features using GMM-UBM modeling Consider 3s-3s data for indi vidual features MFCC and its deri v ati v es, MFCC pro vides a reduced EER which is less by 1.18% and 0.73% of less 4 MFCC and 44 MFCC respecti v ely . In case of LPCC feature for same data size, the indi vidual feature LPCC pro vides a reduced EER which is less by 0.9% and 1.49% of 4 LPCC and 44 LPCC respecti v ely . In this modeling also, static characteri stics pro vides better performance as compared with dynam ic characteristics. Further , the indi vidual features of LPCC and its deri v ati v es gi v es better v erification performance than MFCC and its deri v ati v es. Consider LPR and LPRP f eatures for 3s-3s data size. The minimum EER of LPR which is more by 1.35% and 2.25% of MFCC and LPCC respecti v ely . Further , LPRP is also ha ving 1.15% and 2.05% higher in EER as compared with MFCC and LPCC respecti v ely . The same study is also conducted for other data sizes of 4s-4s, 5s-5s, 6s-6s, 9s-9s and 12s-12s to v erify the performance using indi vidual features. Here also, the results sho ws that EER decreases as we increased the train and test data. From these tw o modeling techniques it is clear that, performance of v ocal tract features gi v es bet ter EER as compared to e xcitation source features. Further , the indi vidul features e xtracted from v aries e xtraction techniques are dif ferent and hence the y may combine to further impro v e the speak er v erification performance under limited data condition. In T able 1 and 2, it w as observ ed that irrespecti v e of speech data size and indi vidual features, the minimum EER of GMM-UBM performance is better than GMM. Limited Data Speak er V erification: Fusion of F eatur es (J ayanthi K umari) Evaluation Warning : The document was created with Spire.PDF for Python.
3350 ISSN: 2088-8708 3 4 5 6 9 12 25 30 35 40 45 50 Training/Testing data in secs (LPR) Equal Error Rate (%) Gaussian Mixture     3 4 5 6 9 12 25 30 35 40 45 50 Training/Testing data in secs (LPRP) Equal Error Rate (%) Gaussian Mixture     16 32 64 128 256 16 32 64 128 256 Figure 7. Performance of speak er v erification system based on LPR and LPRP indi vidual features using GMM- UBM modeling T able 2. Comparison of minimum EER(%) for indi vidual features using dif ferent amount of training and testing data for GMM-UBM Indi vidual Features T raining/T esting data 3s-3s 4s-4s 5s-5s 6s-6s 9s-9s 12s-12s MFCC 40.01 39.02 37.75 36.90 29.35 27.12 4 MFCC 41.19 41.14 39.79 39.88 36.31 33.10 44 MFCC 40.74 40.28 38.79 38.03 30.98 29.14 LPCC 39.11 37.75 36.35 35.99 29.08 26.91 4 LPCC 40.01 39.20 38.12 37.75 32.33 30.57 44 LPCC 40.60 39.11 38.07 36.54 29.58 27.41 LPR 41.36 40.32 39.25 37.04 36.06 33.28 LPRP 41.16 40.43 39.16 37.23 36.16 33.02 4.4. Combination of featur es perf ormance using GMM and GMM-UBM The speak er v er ification system using limited data contains speech dat a of fe w seconds. Due to this the a v ailable feature v ectors are less in numbers. The performance of speak er v erification system can be increased by combining feature v ectors of dif ferent features. The combination of features is accomplished by a simple concatenation of the feature sets obtained by dif ferent feature e xtraction techniques. The performance of speak er v erification system for combination of features (MFCC, 4 , 44 , LPR and LPRP) for dif ferent data sizes and modeling is done by GMM. The e xperimental results are sho wn in Figure. 8 and 9 for combinations of MFCC and LPCC respecti v ely . The minimum EER of v aries Gaussian mixtures of each speech data are tab ulated in T able 3. Further , consider Figure. 8(a) to analyse the performance for multiple combination of features with MFCC using 3s-3s data. From the e xperimental results it w a s observ ed that, the combination of features (MFCC+ 4 + 44 ) is pro viding minimum EER of 44.35% for Gaussian mixture of 32 and the indi vidual features MFCC, 4 and 44 are pro viding minimum EER of 45.27%, 45.75% and 45.16% respecti v ely for the Gaussian mixture of 16. The (MFCC+ 4 + 44 ) pro vides a reduced EER which is less by 0.92%, 1.4% and 0.81% MFCC, 4 and 44 respecti v ely . The performance of MFCC and its deri v ati v es (MFCC+ 4 + 44 ) is better than indi vidual perfor - mance of MFCC, 4 MFCC, 44 MFCC. This is due to combination of both static and dynamic characteristics of speech data in training and testing. The (MFCC+LPR) is pro viding minimum EER of 37.75% for Gaussian mixture of 16. The indi vidual features LPR is pro viding minimum EER of 47.85% for Gaussian mixture of 16 and which is more by 10.1% of (MFCC+LPR). The (MFCC+LPR) pro vides a reduced EER which is less by 6.6% of (MFCC+ 4 + 44 ). The combination of (MFCC+LPR) performance is better than (MFCC+ 4 + 44 ). The (MFCC+LPRP) is ha ving minimum EER of 37.62% for the Gaussian mixture of 64. The indi- vidual features LPRP is pro viding minimum EER of 47.16% for Gaussian mixture of 32 and which is more by IJECE V ol. 7, No. 6, December 2017: 3344 3357 Evaluation Warning : The document was created with Spire.PDF for Python.
IJECE ISSN: 2088-8708 3351 16 32 64 128 256 20 25 30 35 40 45 50 Gaussian Mixtures −−−−> % Equal Error Rate (EER) −−−−−> % (a)     16 32 64 128 256 20 25 30 35 40 45 50 Gaussian Mixtures−−−−−> % Equal Error Rate (EER) −−−−−> % (b)     16 32 64 128 256 20 25 30 35 40 45 50 Gaussian Mixtures −−−−> % Equal Error Rate (EER) −−−−−> % (c)     MFCC+Delta+DeltaDelta MFCC+LPR MFCC+LPRP MFCC+Delta+DeltaDelta+LPR MFCC+Delta+DeltaDelta+LPRP MFCC+Delta+DeltaDelta+LPR+LPRP MFCC+Delta+DeltaDelta MFCC+LPR MFCC+LPRP MFCC+Delta+DeltaDelta+LPR MFCC+Delta+DeltaDelta+LPRP MFCC+Delta+DeltaDelta+LPR+LPRP MFCC+Delta+DeltaDelta MFCC+LPR MFCC+LPRP MFCC+Delta+DeltaDelta+LPR MFCC+Delta+DeltaDelta+LPRP MFCC+Delta+DeltaDelta+LPR+LPRP Figure 8. Performance of speak er v erification system for MFCC and dif ferent combined system using (a) 3s-3s, (b) 4s-4s and (c) 5s-5s and modeling using GMM. 16 32 64 128 256 20 25 30 35 40 45 50 Gaussian Mixtures −−−−> % Equal Error Rate (EER) −−−−−> % (a)     16 32 64 128 256 20 25 30 35 40 45 50 Gaussian Mixtures−−−−−> % Equal Error Rate (EER) −−−−−> % (b)     16 32 64 128 256 20 25 30 35 40 45 50 Gaussian Mixtures −−−−> % Equal Error Rate (EER) −−−−−> % (c)     MFCC+Delta+DeltaDelta MFCC+LPR MFCC+LPRP MFCC+Delta+DeltaDelta+LPR MFCC+Delta+DeltaDelta+LPRP MFCC+Delta+DeltaDelta+LPR+LPRP MFCC+Delta+DeltaDelta MFCC+LPR MFCC+LPRP MFCC+Delta+DeltaDelta+LPR MFCC+Delta+DeltaDelta+LPRP MFCC+Delta+DeltaDelta+LPR+LPRP MFCC+Delta+DeltaDelta MFCC+LPR MFCC+LPRP MFCC+Delta+DeltaDelta+LPR MFCC+Delta+DeltaDelta+LPRP MFCC+Delta+DeltaDelta+LPR+LPRP Figure 9. Performance of speak er v erification system for MFCC and dif ferent combined system using (a) 6s-6s, (b) 9s-9s and (c) 12s-12s data and modeling using GMM. 9.54% of (MFCC+LPRP). The combination of (MFCC+LPRP) pro vides a reduced EER which is less by 6.73% of (MFCC+ 4 + 44 ). This is due to combination of both v ocal tract and e xcitation source information. The LPR contains Glottal Closures Instants (GCIs) related to e xcitation source information. Whereas, LPRP con- tains speak er -specific sequence information [24]. The LPR and LPRP feat u r es contains dif ferent characteristics of speak er -specific e xcitation information. The (MFCC+ 4 + 44 +LPR) and (MFCC+ 4 + 44 +LPRP) is ha ving minimum EER of 37.63% and 37.03% for the Gaussian mixture o f 16 respecti v ely and pro vids reduced EER which is less by 0.12% and 0.59% of (MFCC+LPR) and (MFCC+LPRP) respecti v ely . The combination of (MFCC+ 4 + 44 +LPR+LPRP) pro vide minimum EER of 34.32% for Gaussian mixture of 32. Further , this combination pro vides reduced EER which is less by 10.03%, 3.43%, 3.3% 3.31% and 2.71% of (MFCC+ 4 + 44 ), (MFCC+LPR), (MFCC+LPRP), (MFCC+ 4 + 44 +LPR) and (MFCC+ 4 + 44 +LPRP) respecti v ely . The combined (MFCC+ 4 + 44 +LPR+LPRP) system performs better as compared to other combined systems performance for all training and testing data. This is because, in case of (MFCC+ 4 + 44 +LPR+LPRP) the speak er -specific information includes static, transitional characteristics and e xcitation source. The same trend is observ ed for remaining data sizes are gi v en in Figure. 8 and Figure. 9. From abo v e mentioned results, we ha v e observ ed that, if we increase training and testing data the performance of combined system sho ws significant impro v ement in EER. T o study the significance of LPCC and combined system the same set of e xperiments are conducted as in case of MFCC and combined system. The e xperimental results are sho wn in Fig. 10 and Fig. 11 for combination of features (LPCC, 4 , Limited Data Speak er V erification: Fusion of F eatur es (J ayanthi K umari) Evaluation Warning : The document was created with Spire.PDF for Python.
3352 ISSN: 2088-8708 16 32 64 128 256 20 25 30 35 40 45 50 Gaussian Mixtures −−−−> %     Equal Error Rate (EER) −−−−−> % (a)     16 32 64 128 256 20 25 30 35 40 45 50 Gaussian Mixtures−−−−−> % Equal Error Rate (EER) −−−−−> % (b)     16 32 64 128 256 20 25 30 35 40 45 50 Gaussian Mixtures −−−−> % Equal Error Rate (EER) −−−−−> % (c)     LPCC+Delta+DeltaDelta LPCC+LPR LPCC+LPRP LPCC+Delta+DeltaDelta+LPR LPCC+Delta+DeltaDelta+LPRP LPCC+Delta+DeltaDelta+LPR+LPRP LPCC+Delta+DeltaDelta LPCC+LPR LPCC+LPRP LPCC+Delta+DeltaDelta+LPR LPCC+Delta+DeltaDelta+LPRP LPCC+Delta+DeltaDelta+LPR+LPRP LPCC+Delta+DeltaDelta LPCC+LPR LPCC+LPRP LPCC+Delta+DeltaDelta+LPR LPCC+Delta+DeltaDelta+LPRP LPCC+Delta+DeltaDelta+LPR+LPRP Figure 10. Performance of speak er v erification system for LPCC and dif ferent combined system using (a) 3s-3s, (b) 4s-4s and (c) 5s-5s data and modeling using GMM. 16 32 64 128 256 20 25 30 35 40 45 50 Gaussian Mixtures−−−−−> % Equal Error Rate (EER) −−−−−> % (a)     16 32 64 128 256 20 25 30 35 40 45 50 Gaussian Mixtures −−−−> % Equal Error Rate (EER) −−−−−> % (b)     16 32 64 128 256 20 25 30 35 40 45 50 Gaussian Mixtures−−−−−> % Equal Error Rate (EER) −−−−−> % (c)     LPCC+Delta+DeltaDelta LPCC+LPR LPCC+LPRP LPCC+Delta+DeltaDelta+LPR LPCC+Delta+DeltaDelta+LPRP LPCC+Delta+DeltaDelta+LPR+LPRP LPCC+Delta+DeltaDelta LPCC+LPR LPCC+LPRP LPCC+Delta+DeltaDelta+LPR LPCCDelta+DeltaDelta+LPRP LPCC+Delta+DeltaDelta+LPR+LPRP LPCC+Delta+DeltaDelta LPCC+LPR LPCC+LPRP LPCC+Delta+DeltaDelta+LPR LPCC+Delta+DeltaDelta+LPRP LPCC+Delta+DeltaDelta+LPR+LPRP Figure 11. Performance of speak er v erification system for LPCC and dif ferent combined system using (a) 6s-6s, (b) 9s-9s and (c) 12s-12s data and modeling using GMM. 44 , LPR and LPRP). The modeling is done by GMM. Consider 3s-3s data, the follo wing e xperimental results are observ ed from Fig. 10(a). The combination of features (LPCC+ 4 + 44 ) is pro viding minimum EER of 41.37% for Gaussian mixture of 16 and the indi vidual features LPCC, 4 and 44 are pro viding minimum EER of 43.08%, 44.89% and 44.76% respecti v ely for the Gaussian m ixture of 16. The (LPCC+ 4 + 44 ) pro vides a reduced EER which is less by 1.71%, 3.52% and 3.39 % of LPCC, 4 and 44 respecti v ely . The performance of LPCC and its deri v ati v es (LPCC+ 4 + 44 ) is better than indi vidual performance of LPCC, 4 LPCC, 44 LPCC. This is due to combination of both static and dynamic characteristics of speech data in training and testing. The (LPCC+LPR) is pro viding minimum EER of 36.26% for Gaussian mixture of 16. The (LPCC+LPR) pro vides a reduction in EER which is less by 6.48% of LPR. The (LPCC+LPR) pro vides a reduction in EER which is less by 5.11% of (LPCC+ 4 + 44 ). The combination of (LPCC+LPR) performance is bet- ter than (LPCC+ 4 + 44 ). The (LPCC+LPRP) is ha ving minimum EER of 37.57% for the Gaussian mixture of 32. The indi vidual features LPRP is pro vides a reduced EER which is more by 9.59% (LPCC+LPR). The (LPCC+LPRP) pro vides a reduction in EER which is less by 1.31% of (LPCC+LPR). The combination of (LPCC+LPRP) performance is better than (LPCC+LPR). This is because LPR and LPRP contains dif ferent speak er -specific information. The (LPCC+ 4 + 44 +LPR) and (LPCC+ 4 + 44 +LPRP) is ha ving minimum EER of 36.12% and 37.54% for the Gaussian mixture of 16 respecti v ely and pro vids reduced EER which is less by 0.14% and 0.03% of (LPCC+LPR) and (LPCC+LPRP) respecti v ely . The combination of (LPCC+ 4 + 44 +LPR+LPRP) pro vide minimum EER of 33.69% for Gaus- sian mixture of 16. The reduced EER which is less by (LPCC+ 4 + 44 +LPR+LPRP) is 7.68%, 2.57%, 3.88% , 2.73% and 3.85% of (LPCC+ 4 + 44 ), (LPCC+LPR), (LPCC+LPRP), (LPCC+ 4 + 44 +LPR) and (LPCC+ 4 + 44 +LPRP) respecti v ely . The combined (LPCC+ 4 + 44 +LPR+LPRP) system perform better IJECE V ol. 7, No. 6, December 2017: 3344 3357 Evaluation Warning : The document was created with Spire.PDF for Python.
IJECE ISSN: 2088-8708 3353 T able 3. Comparison of minimum EER(%) for dif ferent combinations of feature s using dif ferent amount of training and testing data for GMM Indi vidual Features T raining/T esting data 3s-3s 4s-4s 5s-5s 6s-6s 9s-9s 12s-12s MFCC+ 4 + 4 44.35 44.12 41.86 41.41 38.07 32.61 MFCC+LPR 37.75 37.57 37.48 37.98 33.55 32.06 MFCC+LPRP 37.62 36.99 36.22 36.54 32.07 31.44 MFCC+ 4 + 4 +LPR 37.63 37.52 37.34 36.40 31.65 31.03 MFCC+ 4 + 4 +LPRP 37.03 36.35 36.12 36.31 31.33 31.23 MFCC+ 4 + 4 +LPR+LPRP 34.32 33.96 33.83 33.46 28.95 28.31 LPCC+ 4 + 4 41.37 39.97 38.88 37.98 32.61 30.26 LPCC+LPR 36.26 37.86 37.48 36.58 32.06 33.42 LPCC+LPRP 37.57 36.94 36.17 36.31 30.44 29.53 LPCC+ 4 + 4 +LPR 36.12 36.85 35.64 35.86 31.32 3 0.89 LPCC+ 4 + 4 +LPRP 37.54 36.48 36.12 34.13 30.15 29.12 LPCC+ 4 + 4 +LPR+LPRP 33.69 33.78 33.73 33.42 28.31 28.22 as compared to other combined systems performance for all training and testing data. This is because, the combination of (LPCC+ 4 + 44 +LPR+LPRP) contains static, transitional characteristics and e xcitation source information. The same trend is observ ed for remaining data sizes as sho wn in Figure. 10 and Figure. 11. T able 3. pro vides the comparison of dif ferent combined systems for dif ferent amount of tra ining and testing data. The EER of (LPCC+ 4 + 44 ) is less by 2.98%, 4.15%, 2.98%, 3.43%, 5.46% and 5.46% of (MFCC+ 4 + 44 ) for 3s-3s, 4s-4s, 5s-5s, 6s-6s, 9s-9s and 12s-12s data respecti v ely . The same trend has been observ ed for remaining combinations. In this e xperimental study , we observ ed that when both training and testing d a ta are limited, the (LPCC+ 4 + 44 +LPR+LPRP) is ha ving minimum EER compared to all other combination in case of GMM modeling. This is because LPCC and its deri v ati v es along with e xcitation source features are able to capture more s peak er -specific information from speech data, this will create dif ferent char - acteristics between speak ers [25]. T o study the significance of GMM-UBM for combination of features the follo wing e xperiments are analysied. The performance of speak er v erification system for combination of features (MFCC, LPCC, 4 , 44 , LPR and LPRP) for dif ferent data sizes using GMM-UBM as a m odeling technique is sho wn in Figure. 12 to Figure. 15. Further , The minimum EER of v aries Gaussian mixtures of each speech data are tab ulated in T ABLE IV . Consider 3s-3s data, the follo wing points are observ ed in this e xperimental setup as sho wn in Figure. 12 and Figure. 14. The combinat ion of features (MFCC+ 4 + 44 ) and (LPCC+ 4 + 44 ) is ha ving minimum EER of 38.84% and 36.44% respecti v ely . The (LPC C+ 4 + 44 ) is pro viding reduced EER which is less by 2.4% o f (MFCC+ 4 + 44 ). The combination of features (MFCC+ LPR) and (MFCC +LPRP) is ha ving minimum EER of 38.3% and 36.54% respecti v ely . Further , the minimum EER of (LPCC+LPR) and (LPCC+LPRP) is 36.94% and 34.55% respe cti v ely . The (LPCC+LPR) and LPCC+LPRP) is pro viding reduced EER of 1.36% and 1.99% less in EER as compared to (MFCC+LPR) and (MFCC+LPRP) respecti v ely . The combination of features (MFCC+ 4 + 44 +LPR) and (MFCC+ 4 + 44 +LPRP)is ha ving mini- mum EER of 34.73% and 34.74% respecti v ely . Further , the minimum EER of (LPCC+ 4 + 44 +LPR) and (LPCC+ 4 + 44 +LPRP) is 34.12% and 33.93% respecti v ely . The (LPCC+ 4 + 44 +LPR) and (LPCC+ 4 + 44 +LPRP) is pro viding reduced EER which is less by 0.61% and 0.81% of (MFCC+ 4 + 44 +LPR) and (MFCC+ 4 + 44 +LPRP) respecti v ely . The combination of features (MFCC+ 4 + 44 +LPR+LPRP) is pro viding reduced EER which less by 6.19%, 4.29%, 2.08%, 3.89% and 2.09% of (MFCC+ 4 + 44 ), (MFCC+LPR), (MFCC+LPRP), (MFCC+ 4 + 44 +LPR) and (MFCC+ 4 + 44 +LPRP) respecti v ely . The combination of features (LPCC+ 4 + 44 +LPR+LPRP) is pro viding reduced EER which is less by 2.61%, 4.47%, 0.72%, 2.29% and 0.1% of (LPCC+ 4 + 44 ), (LPCC+LPR), (LPCC+LPRP), (LPCC+ 4 + 44 +LPR) and (LPCC+ 4 + 44 +LPRP) respecti v ely . Limited Data Speak er V erification: Fusion of F eatur es (J ayanthi K umari) Evaluation Warning : The document was created with Spire.PDF for Python.