Inter national J our nal of Electrical and Computer Engineering (IJECE) V ol. 7, No. 6, December 2017, pp. 3369 3384 ISSN: 2088-8708 3369       I ns t it u t e  o f  A d v a nce d  Eng ine e r i ng  a nd  S cie nce   w     w     w       i                       l       c       m     Query by Example of Speak er A udio Signals using P o wer Spectrum and MFCCs P afan Doungpaisan 1 and Anirach Mingkhwan 2 1,2 F aculty of Information T echnology , King Mongkut’ s Uni v ersity of T echnology North Bangk ok, 1518 Pracharat 1 Road, W ongsa w ang, Bangsue, Bangk ok 10800, Thailand 2 F aculty of Industrial T echnology and Management, King Mongkut’ s Uni v ersity of T echnology North Bangk ok, 1518 Pracharat 1 Road, W ongsa w ang, Bangsue, Bangk ok 10800, Thailand Article Inf o Article history: Recei v ed: Mar 13, 2017 Re vised: Aug 9, 2017 Accepted: Aug 22, 2017 K eyw ord: Speak er identification Acoustic signal processing Content-Based Audio Retrie v al Speak er Recognition Database Query Processing ABSTRA CT Search engine is the popular term for an information retrie v al (IR) system. T ypically , search engine can be based on full-te xt inde xing. Changing the presentation from the te xt data to multimedia data types mak e an information retrie v al proces s more comple x such as a retrie v al of image or sounds in lar ge databases. This paper introduces the use of language and te xt independent speech as i nput queries in a lar ge sound database by using Speak er identification algorithm. The method consists of 2 main processing first steps, we separate v ocal and non-v ocal identification after that v ocal be used to speak er identification for audio query by speak er v oice. F or the speak er identification and audio query by process, we estimate the similarity of the e xample signal and the samples in the queried database by calculating the Euclidian distance betw een the Mel frequenc y cepstral coef ficients (MFCC) and Ener gy spectrum of acoustic features. The simulations sho w that the good performance with a sustainable computat ional cost and obtained the a v erage accurac y rate more than 90%. Copyright c 2017 Institute of Advanced Engineering and Science . All rights r eserved. 1. INTR ODUCTION The Internet has become a major component of their e v eryday social li v es and b usiness . Another important use of the internet is search engine technology . Though the y rarely gi v e it a moment has thought, the search engines that help them na vig ate through the man y of information, web pages, images, video files, and audio recordings found on the W orld W ide W eb ha v e become important. Search engine technology is de v elop o v er 20 years ago [1][2]. It has changed ho w we get information at school, collage, w ork and home. A search engine is an information retrie v al system designed to search for information on the W orld W ide W eb . The search results are generally as search engine results pages (SERPs). A search engine results page, or SERP , is the web page that appears in a bro wser windo w when a k e yw ord query is put into a search field on a search engine.The information may be a mix of te xt, web pages, images , video, and other types of files. Some search engines also mine data a v ailable in databases. Search engines also maintain real-time information by running an algorithm on a web cra wler . It w ould be easy to find if to search by entering k e yw ords. Ho we v er , if we w ant to used search engine to search image or sound. It will mak e more dif ficult and more complicated. 1.1. Content-based image r etrie v al or Re v erse image sear ch engines Content-based image retrie v al or Re v erse image search engines are those special kind of search en- gines where you dont need to input an y k e yw ord to find pictures [3][4][5][6]. Instead, we ha v e to put a picture and the search engine finds the images similar to you entered. Thus, you can get to kno w e v erything you wish to, just with the help of one picture. Practical uses for re v erse image search include Searching for duplicated image or content. Locating the source information for an image. J ournal Homepage: http://iaesjournal.com/online/inde x.php/IJECE       I ns t it u t e  o f  A d v a nce d  Eng ine e r i ng  a nd  S cie nce   w     w     w       i                       l       c       m     DOI:  10.11591/ijece.v7i6.pp3369-3384 Evaluation Warning : The document was created with Spire.PDF for Python.
3370 ISSN: 2088-8708 Ensuring compliance with cop yright re gulations. Finding information about unidentified products and other objects. Finding information about f ak ed images. Finding higher resolution v ersions of images. There are three types of Content-based image retrie v al or Image Search Engines such as Search by Meta-data, Search by Example, and Hybrid Search. 1.1.1. Sear ch by Meta-data Search by Meta-data: Metadata is data that summarizes basic information about image, which can mak e finding and w orking with particular instances of data easier . F or e xample, author , file size, date created and date modified. All are e xamples of v ery basic document metadata. A f amous Search Engines such as Google are presented with a te xt box that you type your k e yw ords into, a nd click b uttons: Google Search. Manually typing in k e yw ords and finding interrelated results. In f act, a meta-data image search engine is only mar ginally dif ferent from the te xt search engine mentioned abo v e. A search by meta-data image search engine rarely e xamines the actual image itself. Instead, it relies on te xtual clues. These searches can come from a v ariety of sources.The tw o main methods of Search by Meta-data are Manual Annotations and Conte xtual Hints. 1.1.2. Sear ch image by Example image Search image by Example image: Search image by Example image, we can used Google or T inEye. Instead, we can b uild a search by e xample image search engine. These types of image search engines try to quantify the image itself and are called Content Based Image Retrie v al (CBIR) systems. An e xample image w ould be to characterize the col or of an image by the standard de viation, mean, and sk e wness of the pix el intensities in the image. By gi v en a dataset of images, we w ould compute these moments o v er all images in our dataset and store them on disk. The ne xt step is called inde xing. When we quantify an image, we are describing an image by e xtracting image features. These image features are an abstraction of the image and used to characterize the content of the image. Lets pretend that we are b u i lding an i mage search engine for T witter . 1.1.3. Hybrid A ppr oach Hybrid Approach: An interesting h ybrid approach is T witt er . T witter allo ws you to include te xt and images with your tweets. T witter l ets you used hashtags to your o wn tweets. W e can used the hashtags to b uild a search by meta-data image search engine and then anal yzed and quantified the image itself to b uild a search by e xample image search engine. From this concept, we w ould be b uilding a h ybrid image search engine that includes both k e yw ords and hashtags with features e xtracted from the images. 1.2. Content-based audio r etrie v al or audio sear ch engines Content-based functionalities aim a t finding ne w w ays of querying and bro wsing audio documents as well as automatic generating of metadata, mainly via classification. Query-by-e xample and similarity measures that allo w perceptual bro wsing of an audio collection is addressed in the literature and e xist in commercial products, see for instance: www . findsounds.com, www .soundfisher .com. There are three types of Content- based audio retrie v al Such as Search by search from te xt, search from image and search from audio. 1.2.1. A udio sear ch fr om text or Sear ch by Meta-data Audio search from te xt or Search by Meta-data: T e xt entered into a search bar by the user is compared to the search engine’ s database. Matching results are accompanied by a description or Meta-data of the audio file and its characteristics such as sample frequenc y , bit rate, type of file, length, duration, or coding type. The user is gi v en the option of do wnloading the resulting files. On other hand, K e yw ords are generated from the analyzed audio by using speech recognition techniques to con v ert audio to te xt. These k e yw ords are used to search for audio files in the database such as Google V oice Search. IJECE V ol. 7, No. 6, December 2017: 3369 3384 Evaluation Warning : The document was created with Spire.PDF for Python.
IJECE ISSN: 2088-8708 3371 1.2.2. A udio sear ch fr om image Audio search from image: The Query by Example (QBE) system is a searching algorithm that uses Content-based image retrie v al (CBIR). K e yw ords are generated from the analyzed image. These k e yw ords are used to search for audio files in the database. The results of the search are displayed according to the user preferences re g arding to the type of file such as w a v , mp3, aif f etc. 1.2.3. A udio sear ch fr om audio Audio search from audio: In audio search from audio, the user must play the audio of a song either with a music player , by singing or by humming to the microphone. Then, an audio pat tern is deri v ed from the audio w a v eform and a frequenc y representation is deri v ed from its Discrete F ourier T ransform. This pattern will be matched with a pattern corresponding to the w a v eform of sound files found in the database. All those audio files in the database whose patterns are similar to the pattern search will be displayed as search results. The most popular of an audio search from audio is an audio fingerprint [7][8][9]. An audio fingerprint is a content-based compact signature that summarizes an audio files. Audio fingerprinting technologies ha v e recently attracted attention since the y allo w the monitoring of audio independently of its format and without the need of w atermark embedding or meta-data. Audio fingerprinting, also named as audio hashing, has been well-kno wn as a po werful technique to perform audio identification and synchronization. The figure 1 describe a model of audio fingerprint, An audio fingerprint in v olv es tw o major steps: fingerprint or v oice pattern design and matching search. While the first step concerns the deri v ation of a compact and rob ust audio signature, the second step usually requires kno wledge about database and quick-search algorithms [10]. Figure 1. State-of-the-art audio fingerprinting algorithms. F or e xample of Audio Fingerprinting application are Shazam (http://www .shazam.com/) and Sound- Hound (http://www .soundhound.com/) [11][12]. A human listener can only identify a piece of music if he has heard it before, unless he has access to more information than just the audio signal. Simila rly , fingerprinting systems require pre vious kno wledge of the audio signals in order to identify them, since no information other t han the audio signal itself is a v ailable to the system in the identification phase. Therefore, a musical kno wledge database must be b uilt. This database contains the fingerprints of all the songs the system is supposed to identify . During detection, the fingerprint of the input signal is calculated and a matching algorithm compare s it to all fingerprints in the database. The kno wledge database must be updated as ne w songs come out. As the number of songs in the database gro ws, memory requirements and computational costs also gro w; thus, the comple xity of the detection process in- creases with the size of the database. This technique is useful b ut it has its limitations. Audio fingerprint cannot find a li v e v ersion of music because dif ferent k e y or tempo. Audio fingerprint cannot find a co v er v ersion because dif ferent instruments. Audio fingerprint cannot find a hummed v ersion of music because single melody . Audio fingerprint unable to find music if a singer sings with te xt independent. Unfortunately , in the present audio search engine cannot be search human v oice in the database of speak ers by e xample of Spok en audio signals. From this problem, this paper proposes a method for query by e xample of Spok en Audio signals by using Speak er identification algorithm. Query by Example of Speak er A udio Signals using P ower Spectrum and MFCCs ... (P afan Doungpaisan) Evaluation Warning : The document was created with Spire.PDF for Python.
3372 ISSN: 2088-8708 2. LITERA TURE REVIEWS (SPEAKER VERIFICA TION AND IDENTIFICA TION) Speak er identification is one of the main tasks in speech processing. In addition to identification accurac y , lar ge-scale applications of speak er identification gi v e rise to another challenge: f ast search in the database of speak ers. Research about Speak er recognition, there are tw o dif ferent types of Speak er Recognition [13][14] consist of Speak er V erification and Speak er Identification. Speak er V erification is the process of accepting or rejecting the identity mention of a speak er . Speak er Identification is the process of determining which re gistered speak er pro vides a gi v en utterance. In the speak er identification task, a v oice of an unkno wn speak er is analyzed and then compared with speech samples of kno wn speak ers. The unkno wn speak er is identified as the speak er whose model best matches the input model. There are tw o dif ferent types of speak er identification consist of open-set and closed-set. Open-set identification similar as a combination of closed-set identification and speak er v erification. F or e xample, a closed-set identification may be proceed and the resulting ID may be used to run a speak er v erification session. If the test speak er matches the tar get speak er , based on the ID returned from the closed-set identification, then the ID is accepted and it is passed back as the true ID of the test speak er . On the other hand, if the v erification f ails, the speak er may be rejected all-together with no v alid identification result. Closed-set identification is the simpler of the tw o problems. In closed-set identification, the audio of the test speak er is compared ag ainst all the a v ailable speak er models and the speak er ID of the model with the closest match is returned. In closed-set identification, the ID of one of the speak ers in the database will al w ays be closest to the audio of the test speak er; there is no rejection scheme. Speak er v erification is the process of v erifying the claimed identity of a speak er based on the speech signal from the speak er call a v oiceprint. In speak er v erificati on, a v oiceprint of an unkno wn speak er who claims an identity is compared with a model for the s peak er whose identity is being claimed. If the match is good enough, the identity claim is accepted. A high threshold reduces the probability of impostors being accepted by the system, increasing the risk of f alsely rejecting v alid users. On the other hand, a lo w thresh- old enables v alid users to be accepted consistently , b ut with the risk of accepting impostors. In order to set the threshold at the optim al le v el of impostor acceptance or f alse acceptance a n d customer rejection or f alse rejection. The data sho wing impostor scores and distrib utions of customer are needed. There are tw o types of speak er v erification systems: T e xt-Independent Speak er V erification (TI-SV) and T e xt-Dependent Speak er V erification (TD-SV). TD-SV requires the speak er saying e xactly the enrolled or gi v en passw ord. T e xt independent Speak er V erification is a process of v erifying the identity without constraint on the speech content. Compare d to TD-SV , it i s more con v enient because the user can speak freely to the system. Ho we v er , it requires longer training and testing utterances to achie v e good accurac y and performance. Using audio in identifying the man y f actors in v olv ed. F actors within the characteristics of human sound and technologies related to the acquisition of sound. F actors within the characteristics of human such as. The conte xt of a public speaking e xperience is the en vironment or situation in which the speech occurs [14]. Each person has a unique communication style such as Pitch, T one or T imbre, Melody , V olume or Inten- sity and Rh ythm [15][16][17]. Emotional speech or the mood of the people such as angry , sad, fearful and fun. Ph ysiological, sometimes people ha v e an illness or in a state of alcohol or drugs. Which influences the sound [17]. Counterfeiting or Disguise sound, Sometimes speak er changed my o wn v oice, so changed from the original, whether it is higher or lo wer . Speaking to the rh ythm that influences the characteristics of sound [18]. F actors within the technologies related t o the acquisition of sound such as. The quality of the microphone (Microphone) or the equipment used to record which greatly af fects the quality of the sound. The mi- crophones each ha v e dif ferent features. The frequenc y response or the microphone’ s sensiti vity to sound from v arious directions [18][19][20]. En vironment in sound recording such as the noise from the en vironment, the distance from the micro- phone to record sound [21][22]. IJECE V ol. 7, No. 6, December 2017: 3369 3384 Evaluation Warning : The document was created with Spire.PDF for Python.
IJECE ISSN: 2088-8708 3373 The basics of digital audio: sample rate, bitrate, and ho w analog signals are represented digitally . This research, we ha v e w ork ed on te xt-independent speak er v erification. Research interesting of speak er recognition such as. Research of Poignant, J. [23] used an unsupervised approach to Identifying speak- ers in TV broadcast without biometri c models. Existing methods usually use pronounced names, as a source of names, for identifying speech clusters pro vided by a speak er diarisation step b ut this source is too imprecise for ha ving suf ficient confidence. Existing methods propose tw o approac hes for finding speak er identity based only on names written in the image track. W ith the late naming approach, there propose dif ferent propag ations of written names onto clusters. Our second proposition, Early naming, modifies the speak er diarisation module by adding constraints pre v enting tw o clusters with dif ferent associated written names to be mer ged together . These methods were tested on the REPERE corpus phase 1, containing 3 hours of annotated videos. W ith the late naming system reaches an F-measure of 73.1%. W ith the e arly naming impro v es o v er this result both in terms of identification error rate and of stability of the clustering stopping criter ion. By comparison, a mono-modal, supervised speak er identification system with 535 speak er models trai n e d on matching de v elopment data and additional TV and radio data only pro vided a 57.2% F-measure. Research of M. K. Nandw ana [24] used an unsupervised approach for detection of human scream v ocalizations from continuous recordings in noisy acoustic en vironments. The proposed detection solution is based on compound se gmenta tion, which emplo ys weighted mean distance, T2-statistics and Bayesian In- formation Criteria for detection of screams. This solution also emplo ys an unsupervised threshold optimized Combo-SAD for remo v al of non-v ocal noisy se gments in the preliminary stage. A total of v e noisy en viron- ments were simulated for noise le v els ranging from -20dB to +20dB for v e dif ferent noisy en vironments. Per - formance of proposed system w as compared using tw o alternati v e acoustic front-end features (i) Mel-frequenc y cepstral coef ficients (MFCC) and (ii) perceptual minimum v ariance distortionless response (PMVDR). Ev alu- ation result s sho w that the ne w scream detection solution w orks well for clean, +20, +10 dB SNR le v els, with performance declining as SNR decreases to -20dB across a number of the noise sources considered. Research of Almaadeed, N. [25] is to in v estig ate the problem of identifying a speak er from its v oice re g ardless of the content. In this study , the authors designed and impl emented a no v el te xt-independent multi- modal speak er identification system based on w a v elet analysis and neural netw orks. The system w as found to be competiti v e and it impro v ed the identification rate by 15% as compared with the classical MFCC. In addi- tion, it reduced the identification time by 40% as compared with the back-propag ation neural netw ork, Gaussian mixture model and principal component analysis. Performance tests conducted using the GRID database cor - pora ha v e sho wn that this approach has f aster identification time and greater accurac y compared with traditional approaches, and it is applicable to real-time, te xt-independent speak er identification systems. Research of Xiaojia Zhao [26] in v estig ates the problem of speak er identification and v erification in noisy conditions, assuming that speech signals are corrupted by en vironmental noise. This paper is focused on se v eral issues relating to the implement ation of the ne w model for real-w orld applications. These include the generation of multicondition training data to model noisy speech, the combination of dif ferent training data to optimize the recognition performance, and the reduction of the model’ s comple xity . The ne w algorithm w as tested using tw o databases with simulated and realistic noisy speech data. The first database is a rede v elopment of the TIMIT database by rerecording the data in the presence of v arious noise types, used to test the model for speak er identificati on with a focus on the v arieties of noise. The second database is a handheld-de vi ce database collected in realistic noisy conditions, used to further v alidate the model for real-w orld speak er v erification. The ne w model is compared to baseline systems and is found to achie v e lo wer error rates. P athak, M.A. and Raj, B., [27] present frame w orks for speak er v erification and speak er identifi cation systems, where the sys tem is able to perform the necessary operations without being able to observ e the speech input pro vided by the user . In this paper formalize the pri v ac y criteria for the speak er v erification and speak er identification problems and construct Gaussian mixture model-based protocols. The proposed also report e x- periments with a prototype implementation of the protocols on a standardized dataset for e x ecution time and accurac y . Bhardw aj, S. [28] presents three no v el methods for speak er identification of which tw o methods util ize both the continuous density hidden Mark o v model (HMM) and the generalized fuzzy model (GFM), which has the adv antages of both Mamdani and T akagi-Sugeno models. In the first method, the HMM is util ized for the e xtraction of shape-based batch feature v ector that is fitted with the GFM to identify the speak er . On the other hand, the second method mak es use of the Gaussian mixture model (GMM) and the GFM for the identification of speak ers. Finally , the third method has been inspired by the w ay humans cash in on the mutual acquaintances Query by Example of Speak er A udio Signals using P ower Spectrum and MFCCs ... (P afan Doungpaisan) Evaluation Warning : The document was created with Spire.PDF for Python.
3374 ISSN: 2088-8708 while identifying a speak er . T o see the v alidity of the proposed models [HMM-GFM, GMM-GFM, and HMM- GFM (fusion)] in a real-life scenario, the y are tested on V oxF or ge speech corpus and on the subset of the 2003 National Institut e of Standards and T echnology e v aluation data set. These models are also e v aluated on the corrupted V oxF or ge speech corpus by mixing with dif ferent types of noisy signal s at dif ferent v alues of signal-to-noise ratios, and their performance is found superior to that of the well-kno wn models. Abrham Debasu Mengistu and Dagnache w Melese w Alemayehu [29] presented the implementation of T e xt Independent Amhari c Language Speak er Identification using VQ (V ector Quantization), GMM (Gaussian Mixture Models), BPNN ( Back propag ation neural netw ork), MFCC (Mel-f requenc y cepstrum coef ficients), GFCC (Gammatone Frequenc y Cepstral Coef ficients). F or the identification process, spe ech signals are col- lected from dif ferent speak ers including both se x es; for our data set, a total of 90 speak ers speech samples were collected, and each speech ha v e 10 seconds duration from each indi vidual. From these speak ers, 59.2%, 70.9% and 84.7% accurac y are achie v ed when VQ, GMM and BPNN are used on the combined feature v ector of MFCC and GFCC. W ajdi Ghezaiel1, Amel Ben Slimane and Ezzedine Ben Braiek [30] proposed to e xtract mi nimally corrupted speech that is considered useful for v ari ou s speech processing systems. In this paper , there are interested for co-channel speak er identification (SID). There emplo y a ne w proposed usable speech e xtraction method based on t he pitch information obtained from linear multi-scale decompositi on by discrete w a v elet transform. The idea is to retain the speech se gments that ha v e only one pitch detected and remo v e the others. Detected Usable speech w as used as input for speak er identification system. The system is e v aluated on co- channel speech and results sho w a significant impro v ement across v arious tar get to Interferer Ratio (TIR) for speak er identification system. Syei v a Nurul Desylvia [31] presented t he implementation of T e xt Independent Speak er I dentification. In this research, speak er identification te xt independent with Indonesian speak er data w as modelled with V ector Quantization (VQ). In this research VQ with K-Means initialization w as used. K-Means clustering also w as used to initialize mean and Hierarchical Agglomerati v e Clustering w as used to identify K v alue for VQ. The best VQ accurac y w as 59.67% when k w as 5. According to the result, Indonesian language could be mod- elled by VQ. This research can be de v eloped using optimization method for VQ parameters such as Genetic Algorithm or P article Sw arm Optimization. Hery Heryanto, Saiful Akbar and Benhard Sitohang [32] present a ne w direct access strate gy for speak er identification system. D AMClass is a method for direct access st rate gy that speeds up the identification process without decreasing the identification rate drastically . This proposed method uses speak er classification strate gy based on human v oices original characteristics, such as pitch, flatness, brightness, and roll of f. D AM- Class decomposes a v ailable dataset into smaller sub-datasets in the form of classes or b uck ets based on the similarity of speak ers original characteristics. D AMClass b uilds speak er dataset inde x based on range-based inde xing of direct access f acility and uses nearest neighbor search, range-based searching and multiclass-SVM mapping as its access method. Experiments sho w that the direct access strate gy with multiclass-SVM algo- rithm outperforms the inde xing accurac y of range-based inde xing and nearest neighbor for one to nine percent. D AMClass is sho wn to speed up the identification process 16 times f aster than sequential access method with 91.05% inde xing accurac y . 3. RESEARCH METHOD This paper presents an audio search engine that can retrie v e sound files from a lar ge files system based on similarity to a query sound. Sounds are characterized by speech templates deri v ed from MFCC and Po wer spectrum. Audio similarity can be measured by comparing templates, which w orks both for simple sounds and comple x audio such as music. De v elopment in speech technology [33][13] has been inspired by the reason that people desire to de v elop mechanical models that permits the emulation of human v erbal communication capabilities. Speech processing allo w computer to follo w v oice commands and dif ferent human languages. A number of rele v ant tasks F or e xample Source Identification, Automatic Speech Recognition, Automatic Music T ranscription, La- beling/Classification/T agging, M usic/Speech/En vironmental Sound Se gmentation, Sentiment/Emotion Recog- nition, Common machine learning t echniques applied in related fields such as image processing and natural language processing. Figure 2 describe a model of audio recognition system that represents dif ferent stages of a system including pre-processing, feature e xtraction, classification and language model [13]. The pre-processing trans- IJECE V ol. 7, No. 6, December 2017: 3369 3384 Evaluation Warning : The document was created with Spire.PDF for Python.
IJECE ISSN: 2088-8708 3375 forms the input signal before an y information can be e xtracted at feature e xtraction stage. Figure 2. State-of-the-art Audio Classification v ectors must be rob ust to noise for better accurac y [34][35].The features e xtraction [35][36][37] is the most important part of a recognizer . If the f eatures are ideally good, the type of classification architecture w ont ha v e much importance. On the opposite, if the features cannot discriminate between the concerned classes, no classifier will be ef ficient, as adv anced as it could be. In practical situati ons, features al w ays present some de grees of o v erlap from one class to the other class. Therefore, it is w orth using good and adapted classification architectures. Feature Extraction for Class ification such as Linear prediction coef ficients: LPC, Cepstral coef ficient, Mel frequenc y cepatral coef ficients: MFCC, Cepstral meansubstraction: CMS and Post filtered cepsturm: PFL. Classification stage recognize using e xtracted features and language model where Language M o de l contains syntax related to language responsible that helps classifier to recognize the input. In P attern Classifica- tion problems, the goal is to discriminate between features representing dif ferent classes of interest. Based on learning beha vior , classifiers can be di vided into tw o groups: classifiers that use supervised learning (supervised classification) and unsupervised learning (unsupervised classification). In supervised classification, we pro vide e xamples of the correct classification or a feature v ector al ong with its correct class to teach the classifier . Based on these e xamples, which are commonly termed as train- ing samples, the classifier then learns ho w to assign an unseen feature v ector to a correct class. Examples of supervised classifications include Hidden Mark o v Model (HMM ), Gaussian Mixture Models (GMM), K- Near - est Neighbor (k-NN), Support V ector Machine (SVM), Artificial Neural Netw orks (ANN), Bayesian Netw ork (BN) and Dynamic T ime Wrapping (DTW) [38][39][40][41][42][43]. In unsupervised classification or clustering [38], there is neither e xplicit teacher nor training sampl es. The classification of the feature v ectors must be based on similarity between them based on which the y are di vided into natural groupings. Whether an y tw o feature v ectors are similar depends on the application. Ob- viously , unsupervised cla ssification is a more dif ficult problem than supervised classification and supervised classification is the preferable option if it is possible. In some cases, ho we v er , it is necessary to use unsu- pervised learning. F or e xample, this is the case if the feature v ector describing an object can be e xpected to change with time. Examples of unsupervis ed classifications include k-means clustering, Self-Or g anizing Maps (SOM), and Linear v ector Quantization (L VQ) [43][44][45]. Classifiers can also be grouped based on reasoning process as probabilistic and deterministic clas si- fiers. Deterministic reasoning classifiers classify sensed data into distinct states and produce a distinct output that cannot be uncertai n or disputable. Probabilistic reasoning, on the other hand, considers s ensed data to be uncertain input and thus outputs multiple conte xtual states with associated de grees of truthfulness or probabil- ities. Decision of the class type to which the feature belongs is made based on the highest probability . Due to limitation of Audio fingerprint concept, Audio fingerprint cannot used to find audio files if a speak er spok en with te xt independent. So with the technical limitations of Audio fingerprint, the lack of fle xibility in the search for Audio information and cannot be applied to other types of search such as v oice search. Therefore, this research w as also interested in Speak er Identification concept to be applied to the speak er v oice retrie v al system. The operating system can process as follo ws. 3.1. F eatur e extraction. Feature e xtraction is the process of computing a compact numeri cal representation that can be used to characterize a se gment of audio. The research uses Mel Frequenc y Cepstral Coef ficients analysis that based on Discrete F ourier transform (DFT) and Ener gy spectrum as sho w in Figure 3. The use of MFCCs coef ficients is common in automatic speech recognit ion (ASR), although 10-13 coef ficients are often considered to be suf ficient for coding speech [38]. A subjecti v e pitch is present on Mel Frequenc y Scale to capture important characteristic of phonetic in speech. MFCC [38][39] is based on human hearing perceptions that cannot percei v e frequencies o v er 1Khz. Figure 3 sho ws the process of creating MFCCs features. The first step is to se gmenting the audio signal into frames with the length with in the range is equal Query by Example of Speak er A udio Signals using P ower Spectrum and MFCCs ... (P afan Doungpaisan) Evaluation Warning : The document was created with Spire.PDF for Python.
3376 ISSN: 2088-8708 Figure 3. Calculate the ener gy spectrum (Po wer spectrum) and MFCCs Figure 4. Create a (N-1)-point Hamming windo w and Display the result. to a po wer of tw o, usually by applying Hamming windo w function as sho w in Figure 4. The ne xt step is to tak e the Discrete F ourier T ransform (DFT) of each frame. The ne xt step is Mel Filter Bank Processing. The frequencies range in DFT spectrum is v ery wide and v oice signal does not follo w the linear scale. The ne xt step is Discrete Cosine T ransform. This is the process to con v ert the log Mel spectrum into time domain using Discrete Cosine T ransform (DCT). The result of the con v ersion is called Mel Frequenc y Cepstrum Coef ficient (12 cepstral features plus ener gy). The process of creating Ener gy spectrum features. The first step is to se gmenting the audio signal into frames with the length with in the range is equal to a po wer of tw o, usually by applying Hamming windo w function. The ne xt ste p is to tak e the Discrete F ourier T ransform (DFT) of each frame. The ne xt step is to tak e the po wer of each frames, denoted by P(k), is computed by the follo wing equation 1. P ( k ) = 2595 l og ( D F T ) (1) The result of P(k) is called Ener gy spectrum. 3.1. Measur e of similarity The purpose of a measure of similarity is to compare tw o v ectors and compute a single number which e v aluates their similarity . In other w ords, the objecti v e is to determine to what e xtent tw o v ariables co-v ary , which is to say , ha v e the same v alues for the same cases. Euclidean distance is most often used to compare profiles of respondent s across v ariables. F or e xample, suppose our data consist of demographic information on a sample of indi viduals, arranged as a respondent-by-v ariable matrix. Each ro w of the matrix is a v ector of m numbers, where m is the number of v ariables. W e can e v aluate the similarity or the distance between an y pair of ro ws. Euclidean Distance is the basis of man y measures of similarity and dissimilarity is Euclidean distance. The distance between v ectors X and Y is defined as follo ws: j d j d k j = v u u t n X i =1 ( d j d k ) 2 (2) IJECE V ol. 7, No. 6, December 2017: 3369 3384 Evaluation Warning : The document was created with Spire.PDF for Python.
IJECE ISSN: 2088-8708 3377 In other w ords, Euclidean distance is the square root of the sum of squared dif ferences between cor - responding elements of the tw o v ectors. Note that the formula treats the v alues of X and Y seriously: no adjustment is made for dif ferences in scale. Eucli dean distance is only appropriate for data measured on the same scale. As you will see in the section on cor relation, the correlation coef ficient is related to the Euclidean distance between standardized v ersions of the data. 3.2. Content-Based Retrie v al of Spok en A udio Step This section discusses the methodology used in our proposed techniques. It includes the descri ption of the e xperiment setup, the comparati v e study method and the implementation details. In a speak er v oice retrie v al system consisting of tw o stages as sho wing in Figure 5. The first stages Figure 5. The process of Content-Based Retrie v al of Spok en Audio. The first step, this research store an audio files in v oice retrie v al system. The step we identify audio areas where is v ocal or non-v ocal area in each audio files. Because we used only v ocal area in each audio files for speak er v oice retrie v al system. V ocal sound is a type of sound performed by one or more speak er , with or without noises and instrumental accompaniment. In this step, we used Euclidean Dis tance to measures a v ocal or non-v ocal area in each audio fil es. By using v ocal or non-v ocal template. The v ocal template is consists of singing and speech both men and w omen length approximately 10 minutes. The non-v ocal template is consisting of v aried en vironments sound in background including the meeting rooms of v arious sizes, of fice, construction site, tele vision studio, streets, parks, the International Space Station etc. The non-v ocal template length approximately 10 minutes. 1. Read the audio files to e xtract v ocal and non-v ocal area. 2. Con v ert audio data to Ener gy spectra (Ener gy spectrum) and Mel Frequenc y Cepstral Coef ficents (MFCCs) with windo ws of 512 as described in Section 4.2 (a concatenation of the Ener gy spectrum and MFCCs to form a longer feature v ector as sho wing in Figure 6). 3. Calculate the distance between the query-instance of audio files with all samples v ector in v ocal and non-v ocal template. 4. Sort the distance and determine nearest samples based on the minimum distance each sample windo ws. 5. Use simpl e majority of the cate gory w as choosing from a dist ance at least to prediction v alue of the query instance as v ocal or non-v ocal. 6. Reject the non-v ocal v ector , lea ving only v ocal v ector of each audio file and storage files to v oice retrie v al system. Query by Example of Speak er A udio Signals using P ower Spectrum and MFCCs ... (P afan Doungpaisan) Evaluation Warning : The document was created with Spire.PDF for Python.
3378 ISSN: 2088-8708 Figure 6. F orming the input v ector of ener gy spectrum (Po wer spectrum) and MFCCs. After the procedure, a data in audio file s lea ving only v ocal v ector of each audio file. Which is prepared for use in a speak er v oice retri e v al system. F ollo wing a speak er v oice retrie v al Process according to Figure 7. The second step is the retrie v al procedure, here is an algorithm step by step on ho w to used Euclidean Distance to retrie v al human v oice in each audio files The goal of this step is to find out tar get and reject not tar get files.This research used a relati v e frequenc y distrib ution to decide whether or not tar get class. A frequenc y distrib ution sho ws the number of elements in a data set that belong to each class. In a relati v e frequenc y distrib ution, the v alue assigned to each class is the proportion of the total data set that belongs in the class. Heres a formula for calculating the relati v e frequenc y of a class: R E LAT I V E F R E QU E N C Y O F AC LAS S = cl ass f r eq uency n (3) Class frequenc y refers to the number of observ ations in each class; n represents the total number of observ ations in the entire data set. Figure 7. The process of relati v e frequenc y and searching for Speak er v oice. 1. Read a V oice files of the tar get person con v ert to Ener gy spectrum and Mel Frequenc y Cepstral Coef fi- cents (MFCCs) and assigned to tar get class template. The tar get class template is length approximately 5-10 minutes. Afterthat , Read a V oice file of the non-tar get person and con v ert to Ener gy spectrum and Mel Frequenc y Cepstral Coef ficents (MFCCs). Then assigned to non-tar get class template. The non-tar get class template is length approximately 5-10 minutes. 2. T o that end, first read v ocal files in v oice retrie v al system and split into frames with predefined duration. Then, each frame further split into N non-o v erlapping se gments, where N called frame size. Afterw ards, se gments in each frame measure distance between the tar get class template or non-tar get class template by Euclidean distance. Used a minimum distance each sample windo ws to choose the tar get class or non-tar get class. IJECE V ol. 7, No. 6, December 2017: 3369 3384 Evaluation Warning : The document was created with Spire.PDF for Python.