TELK OMNIKA , V ol. 14, No . 3, September 2016, pp . 1024 1034 ISSN: 1693-6930, accredited A b y DIKTI, Decree No: 58/DIKTI/K ep/2013 DOI: 10.12928/telk omnika.v14.i3.3281 1024 Musical Genre Classification Using Suppor t V ector Mac hines and A udio Features A.B. Mutiara * , R. Refianti , and N.R.A. Mukarr omah F aculty of Computer Science and Inf or mation T echnology , Gunadar ma Univ ersity , Dep ok 16424, Indonesia Jl. Margonda Ra y a No .100, Depok 16424, Indonesia, f ax +62-21-78 * corresponding author , e-mail: am utiar a@staff .gunadar ma.ac.id Abstract The ne ed of adv ance Music Inf or mation Retr ie v al increases as w ell as a huge amount of digital m usic files distr ib ution on the inter net. Musical genres are the main top-le v el descr iptors used to organiz e digital m usic files . Most of w or k in labeling genre done man ually . Thus , an automatic w a y f or labeling a genre to digital m usic files is needed. The most standard approach to do automatic m usical genre classi- fication is f eature e xtr action f ollo w ed b y super vised machine-lear ning. This research aims to find the best combination of audio f eatures using se v er al k er nels of non-linear Suppor t V ector Machines (SVM). The 31 diff erent combinations of proposed audio f eatures are dissimilar compared in an y other related research. Fur ther more , among the proposed audio f eatures , Linear Predictiv e Coefficients (LPC) has not been used in another w or ks related to m usical g enre classification. LPC w as or iginally used f or speech coding. An e xper imentation in classifying digital m usic file into a genre is carr ied out. The e xper iments are done b y e xtr acting f eature sets related to timbre , rh ythm, tonality and LPC from m usic files . All possib le combination of the e xtr acted f eatures are classified using three diff erent k er nel of SVM classifier that are Radial Basis Function (RBF), polynomial and sigmoid. The result sho ws that the most appropr iate k er nel f or automatic m usical genre classification is polynomial k er nel and the best combination of audio f eatures is the combina- tion of m usical surf ace , Mel-F requency Cepstr um Coefficients (MFCC), tonality and LPC . It achie v es 76.6 % in classification accur acy . K e yw or d: Suppor t V ector Machine , A udio F eatures , Mel-F requency Cepstr um Coefficients , Linear Predic- tiv e Coefficients Cop yright c 2016 Univer sitas Ahmad Dahlan. All rights reser ved. 1. Intr oduction A standard approach f or automatic m usical genre classification is a f eature e xtr action f ollo w ed b y super vised machine-lear ning. F eature e xtr action tr ansf or ms the input data into a reduced representation set of f ea tures instead of the full siz e input. There are a lot of f eatures that can be e xtr acted from audio signals that ma y be related to main dimension of m usic including timbre , rh ythm, pitch, tonality etc. In the case of genre classification, a r igorous selection of f eature that can be used to distinguish one genre to another is a k e y f actor in order to achie v e g reat accur acy in classification result. Se v er al f eature sets ha v e been proposed to be used in representing genres . Those pro- posed f eature sets are related to timbre , rh ythm, pitch and tonality . Musical surf ace (spectr al flux, spectr al centroid, spectr al rolloff , z ero-crossings and lo w-energy) [1][2] and Mel-F requency Cepstr um Coefficients (MFCC) [1][3] are used as f eatures related to timbre . F or rh ythmic f eature , strongest beat, strength of strongest beat and beat sum are used [4]. The f eatures are obtained using the calculation of a Beat Histog r am. F eature related to pitch uses accum ulation of m ultiple pitch detection results in a Pitch Histog r am. Based on research b y [5], the use of this f eature contr ib uted a poor perf or mance in classification accur acy . F eatures related to tonality are chro- mag r am, k e y strength and the peak of k e y strength [6]. There is another cepstr al-based f eature similar to MFCC called linear predictiv e coefficients (LPC) . So f ar , LPC has not been used in w or ks related to m usical genre classification. Receiv ed December 17, 2015; Re vised Apr il 19, 2016; Accepted Ma y 2, 2016 Evaluation Warning : The document was created with Spire.PDF for Python.
TELK OMNIKA ISSN: 1693-6930 1025 In order to do genre classification t ask, v ar ious super vised machine-lear ning classifier al- gor ithm are used to w ard the e xtr acted f eature sets , such as , Linear Discr iminant Analysis (LD A), k-Nearest Neighbor (kNN), Gaussian Mixtu re Model (GMM), and Suppor t V ector Machines (SVM). The compar ison among those algor ithm has done b y [7] resulting that SVM has the highest accu- r acy le v el. The need of adv ance Music Inf or mation Retr ie v al increases as w ell as a huge amount of digital m usic files distr ib ution on inter net. There are a lot of research and e xper iments of m usical genre classification and the most standard approach is f eature e xtr a ction f ollo w ed b y super vised machine-lear ning. Se v er al f eature sets and v ar ious super vised machine-lear ning algor ithms ha v e been proposed to do automatic m usical genre classification. The prob lems of this research are: i) Ho w to find the best combination of f eature sets that are e xtr acted from m usic file f or automatic m usical genre classification task?; ii) Ho w to find the most appropr iate k er nel of non-linear SVM k er nel method f or automatic m usical genre classification task? The scopes of the research are : i) Genres that are used to classify m usic file are limited to ten genres that are b lues , classical, countr y , disco , hiphop , jazz, metal, pop , reggae , and roc k.; ii) Although the re are man y data set of m usic a v ailab le online , at this research w e use data set of m usic files from GTZAN [8], an online a v ailab le data set that contains 1000 m usic files with dur ation of 30 seconds . Ev er y 100 m usic file represents one genre .; iii) The m usic files are using 22050 Hz sample r ate , mono channel, 16-bit and .w a v f or mat. iv) The f ollo wing are softw are used in the research: Windo ws 7 Oper ating System, MA TLAB R2009a, MIRtoolbo x 1.5, jA udio 1.0.4, W eka 3.7.10 This research aims to find the best combination of f eature sets and SVM k er nel f or auto- matic m usical genre classification. T o do this , an e xper iment ation in classifying digital m usic file into a genre will be carr ied out. In the e xper iments , all possib le combination of f eature sets related to timbre , rh ythm, tonality and LPC that are e xtr acted from m usic files are used and classified us- ing three diff erent k er nel of SVM classifier . The classification accur acy of each e xper ime nt are then compared to deter mine the best among these tr ials . 2. Literature Re vie w 2.1. Music Genre Musical genres are categor ies that ha v e ar isen through a comple x inter pla y of cultures , ar tists and mar k et f orces to char acter iz e similar ities betw een m usicians or compositions and or- ganiz e m usic collections [9 ]. No w ada ys , m usic genres are often used to categor iz e m usic on r adio , tele vision and especially inter net. There is not an y ag reemen t on m usical genre taxonom y . Theref ore , most of m usic indus- tr ies and inter net m usic stores use diff erent genre taxonomies when categor izing m usic pieces into logical g roups within a hier archical str ucture . F or e xample , allm usic.com uses 531 genres , mp3.com uses 430 genres , and amaz on.com uses 719 genres in their database . P achet and Cazaly [10] tr ied to define a gener al taxonom y of m usical genres b ut the y e v entually ga v e up and used self-defined tw o-le v el genre taxonom y of 20 genres and 250 subgenres in their Cuidado m usic bro wser [11]. There are studies to identify human ability to classify m usic into a genre . One of them is a study conducte d b y R.O . Gjerdigen and D . P errot [12] that uses ten diff erent genres , namely Blues , Classical, Countr y , Dance , J azz, Latin, P op , R&B , Rap , and Roc k. The subjects of the study w ere 52 college students enrolled in t heir first y ear of psychology . The accur acy of the genre prediction f or the 3 s samples w as around 70%. The accur acy f or the 2.5 s samples w as around 40%, and the a v er age betw een the 2.5 s classification and the 3 s classification w as around 44%. 2.2. A utomatic Music Genre Classification Musical genre classification is a classification prob lem, and such task consists of tw o ba- sic steps that ha v e to be perf or med: f eature e xtr action and classification. The goal of the first step , f eature e xtr action, is to get the essential inf or mation out of the input data. The second step is to find what combinations of f eature v alues correspond to what categor ies , which i s done in the clas- Musical Genre Classification Using Suppor t ... (A.B . Mutiar a) Evaluation Warning : The document was created with Spire.PDF for Python.
1026 ISSN: 1693-6930 sification par t. The tw o steps can be clear ly separ ated: the output of the f eature e xtr action step is the input f or the classification step [13]. The standard approach of m usic genre classification task can be seen in Figure 1. Figure 1. Music genre classification standard approach. 2.3. Feature Extraction F eature e xtr action is tr ansf or ming the input data into a reduced representation set of f ea- tures (also named as f eatures v ector). If the f eatures e xtr acted are carefully chosen it is e xpected that the f eatures set will e xtr act the rele v ant inf or mation from the input data in order to perf or m the desired t ask using this reduced representation instead of the full siz e input. In the case of audio signal, f eature e xtr action is the process of computing a compact n umer ical representation that can be used to char acter iz e a segment of audio [1]. T o represent m usical genre , some f eatures e xtr acted from audio signal that are related to timbre , rh ythm and tonality are used. Those f eatures can be divided as time-domain f eatures and frequency-domain f eatures . The calculation of time-domain f eatures can be implemented directly to audio w a v ef or m (amplitude v ersus time). While f or freq uency-domain f eatures , F our ier T r ans- f or m is needed as the tools to obtain spectr um of audio (energy or magnitude v ersus frequency) that will be used to calculate the f eatures . F our ier T r ansf or m is a mathematical tr ansf or mation emplo y ed to tr ansf or m signals from time domain into frequency domain. T o compute the F our ier T r ansf or m digitally , the Discrete F our ier T r ansf or mation (DFT) algor ithm is used, especially F ast F our ier T r ansf or m (FFT) and a Shor t Time F our ier T r ansf or mation (STFT). FFT is a f aster v ersion of DFT . The FFT utiliz es some cle v er algor ithms to do the same thing as the DFT , b ut in m uch less time . Ev aluating DFT definition directly requires O ( N 2 ) oper ations: there are N outputs X k , and each output requires a sum of N ter ms . An FFT is an y method to compute the same results in O ( N l og N ) oper ations . Ho w e v er , FFT ha v e some dr a wbac ks . FFT contains only frequency inf or- mation and no time inf or mation is retained. Thus , it only w or ks fine f or stationar y signal. It is not useful f or analyzing time-v ar iant, non-stationar y signals since it only sho ws frequencies occurr ing at all times instead of specific times . T o handle non-stationar y signal, STFT is needed. The idea of STFT is fr aming or windo wing the signal into narro w time inter v als (possib ly o v er lapping) and taking the FFT of each segment. 2.3.1. Features: Timbre , Rh ythm and T onality The f eatures used to repre sent timbre are based on standard f eatures proposed f or m usic- speech discr imination and speech recognition [5]. The f eatures are spectr al flux, spectr al centroid, spectr al rolloff , z ero-crossings and lo w-energy that are g rouped as m usical surf ace f eatures in [2], and also the mel-frequency cepstr um coefficients (MFCC). The timbre f eatures are calculated based on the shor t-time F our ier tr ansf or m which is perf or med fr ame b y fr ame along the time axis . The detail of each f eature will be elabor ated in [15] about spectr al centroid, [4] about spectr al rolloff , [16] about spectr al flux, [2] about z ero crossing, [17] about MFCC . F eatures related to rh ythm proposed b y [4] are strongest beat, beat sum and strength of strongest beat. Those f eature are based on calculation of beat histog r am. Beat histog r am sho ws the strength of diff erent rh ythmic per iodicities in a signal. Beat histog r am autocorrelates the RMS f or each bin in order to constr uct a histog r am representing rh ythmic regular ities . This is TELK OMNIKA V ol. 14, No . 3, September 2016 : 1024 1034 Evaluation Warning : The document was created with Spire.PDF for Python.
TELK OMNIKA ISSN: 1693-6930 1027 calculated b y taking the RMS of 256 windo ws and then taking the FFT of the result [4]. RMS is used to calculate the amplitude of a windo w . RMS can be calculated using the f ollo wing equations: R M S = s P N n =1 x 2 n N (1) where N is the total n umber of samples (fr ames) pro vided in the time domain and x 2 n is the amplitude of the signal at n th sample . F eatures related to tonality are proposed b y [6]. T onality used to denote a system of relationships betw een a ser ies of pitches (f or ming melodies and har monies) ha ving a tonic , or centr al pitch class , as its most impor tant (or stab le) element. The f eatures of tonality include , chromag r am, k e y strength and peak of k e y strength. 2.3.2. Linear Predictive Coefficients (LPC) The b asic idea behind linear prediction is that a signal carr ies relativ e inf or mation. There- f ore , the v alue of consecutiv e samples of a signal is appro ximately the same and the diff erence betw een them is small. It becomes easy to predict the output based on a linear combination of pre vious samples . F or e xample , contin uous time v ar ying signal is x and its v alue at time t is the function of x ( t ) . When this signal is con v er ted into discrete time domain samples , then the v alue of the n th sample is deter mined b y x ( n ) . T o deter mine the v alue of x ( n ) using linear prediction techniques , v alues of past samp les such as x ( n 1) , x ( n 2) , x ( n 3) ::x ( n p ) are used where p is the predictor order of the filter . This coding based on the v alues of pre vious samples is kno wn as linear predictiv e coding [18]. 2.4. Classification In the ter minology of machine lear ning, classification is considered an instance of super- vised lear ning, a lear ning where a tr aining set of correctly identified obser v ations is a v ailab le . An algor ithm that implements classification, especially in a concrete implementation, is kno wn as a classifier . 2.4.1. Suppor t V ector Mac hine Suppor t V ector Machine (SVM) is a technique f or prediction task, either f or classification or f or reg ression. It w as first introduced in 1992. SVM is member of super vised lear ning where categor ies are giv en to map instances into it b y SVM algor ithm. Suppor t V ector Machines are based on the concept of decision planes that de fine deci- sion boundar ies . A decision plane is one that separ ates betw een a set of objects ha ving diff erent class memberships [19]. The basic idea of SVM is finding the best separ ator function (classi- fier/h yper plane) to separ ate the tw o kinds of objects . The best h yper plane is the h yper plane which is located in the middle of tw o object. Finding the best h yper plane is equiv alent to maxi- miz e margin betw een tw o diff erent sets of object [20]. In real w or ld, most classification tasks are not that simple to solv e linear ly . More comple x str uctures are needed in order to mak e an optimal separ ation, that is , correctly classify ne w ob- jects (test cases) on the basis of the e xamples that are a v ailab le (tr ain cases). K er n el method is one of method to solv e this prob lem. K er nels are rearr anging the or iginal objects using a set of mathematical functions . The process of rearr anging the objects is kno wn as mapping (tr ansf or- mation). The k er nel function, represents a dot product of input data points mapped into the higher dimensional f eature space b y tr ansf or mation [19]. Linear oper ation in the f eature space is equiv alent to nonlinear oper ation in input space . Classification can become easier with a proper tr ansf or mation [21]. There are n umber of k er nels that can be used in SVM models . These include P olynomial, Radial Basis Function (RBF) and Sigmoid: 1. P olynomial: K ( X i ; X j ) = ( X i X j + C ) d Musical Genre Classification Using Suppor t ... (A.B . Mutiar a) Evaluation Warning : The document was created with Spire.PDF for Python.
1028 ISSN: 1693-6930 2. RBF: K ( X i ; X j ) = exp ( j X i X j j 2 ) 3. Sigmoid: K ( X i ; X j ) = tan h( X i X j + C ) where K ( X i ; X j ) = ( X i ) ( X j ) , k er nel function that should be used to substitute the dot product in the f eature space is highly dependent on the data. 2.4.2. K-Fold Cr oss V alidation F or classification prob lems , the perf or mance is measured of a model in ter ms of its error r ate (percentage of incorrectly classified instances in the data set). In classification, tw o data sets are used: the tr aining set (seen data) to b uild the model (deter mine its par ameters) and the test set (unseen data) to measure its perf or mance (holding t he par ameters constant). T o split data into tr aining and test set, a k-F old Cross V alidation is used. K-F old Cross V alidation divides data r andomly into k f olds (subsets) of equal siz e . The k-1 f olds are used f or tr aining the model, and one f old is used f or testing. This process will be repeated k times so that all f olds are used f or testing. The o v er all perf or mance is obtained b y computing the a v er age perf or mance on the k test sets . This method eff ectiv ely uses all the data f or both tr aining and testing. T ypically k=5 or k = 10 are used f or eff ectiv e f old siz es [22]. 2.4.3. Confusion Matrix In the field of machine lear ning, a confusion matr ix, is a specific tab le la y out that allo ws visualization of the perf or mance of an algor ithm, typically a super vised lear ning one [23]. Each column of the matr ix represents the instances in a predicted class , while each ro w represents the instances in an actual class .As seen on Figure 2, the entr ies in the confusion matr ix ha v e the f ollo wing meaning: Figure 2. Confusion matr ix. a is the n umber of correct predictions that an instance is negativ e , b is the n umber of incorrect predictions that an instance is positiv e , c is the n umber of incorrect of predictions that an instance negativ e , and d is the n umber of correct predictions that an instance is positiv e . 3. Methodology In this research, there are se v er al steps to be done . These steps include data collection, f eature e xtr action, data preprocessing, classification, and perf or mance e v aluation. 3.1. Data Collection, Feature Extraction, Data Prepr ocessing The data set of m usic files that is used to be classified in the e xper iment is GTZAN [8]. The data set contains 1000 m usic files in .w a v f or mat. Ea ch file has 30 second long of dur ation and is recorded using 22050 Hz sample r ate and 16-bit sample siz e in mono channel. This TELK OMNIKA V ol. 14, No . 3, September 2016 : 1024 1034 Evaluation Warning : The document was created with Spire.PDF for Python.
TELK OMNIKA ISSN: 1693-6930 1029 data set consists of ten genres that are b lues , classical, countr y , disco , hiphop , jazz, metal, pop , reggae , and roc k where each genre is represented b y 100 m usic files . F eature e xtr action process is done to w ard m usic files in the data set. The f eatures are calculated f or e v er y shor t-time fr ame of audio signal f or time-domain f eatures and are calculated based on shor t time f our ier tr ansf or m (STFT) f or frequency-domain f eatures . The f eatures e x- tr acted from the files . The process of f eature e xtr action is done b y utilizing MIRtoolbo x [24], a MA TLAB toolbo x f or m usical f eature e xtr action from audio and jA udio , a ja v a f eature e xtr action libr ar y [4][25]. The f eatures data that are obtained from f eature e xtr action phase are stored in tw o ARRF (Attr ib ute-Relation File F or mat) file that are f eatures .arff and f eatures2 .arff. Bef ore enter ing the classification stage , the data will be preprocessed so that the data are not separ ated in the diff er- ent files . 3.2. Classification In gener al, the process of classification in the e xper iments will use SVM classifier algo- r ithm with k er nel method and 10-f old cross v alidation str ategy . The process is done b y at first distr ib uting the data set into 10 equal siz ed sets . 9 sets (set 2 to 10) are used as tr aining set to gener a te classifier model with SVM classifier . The model is tested to the set 1 to measure its accur acy . In second iter ation, set 2 will be used as testing set and set 1, 3 to 10 will be used as tr aining set. This process is iter ated until 10 classifier models are produced and all set are used as testing set. The accur acy of the 10 classifier models are a v er aged to obtain o v er all accur acy . The process of classification will be under tak en b y utilizing W eka [26] data mining softw are . In the e xper iments , se v er al classification processes will be done . Each classification pro- cess will use diff erent combination of audio f eatures as w ell as three diff erent SVM k er nel. F eature sets that will be e xper imented are g rouped as Musical Surf ace (spectr al flux, spe c t r al centroid, spectr al rolloff , z ero-crossings , RMS energy and lo w-energy), MFCC , rh ythm (strongest beat, beat sum and strength of strongest beat), tonality (chromag r am, k e y strength and peak of k e y strength) and LPC . All possib le combinations of these f e atures can be seen at T ab le 1. SVM k er nels used in the e xper iment s are RBF , polynomial an d sigmoid k er nel. In the end, with m ultiplication of possib le f eature combinations and k er nels used, there are 93 classification processes that will be tr ied out in the e xper iments . T ab le 1. P ossib le Combination of the F eature Sets . No F eature No F eature 1 Musical Surf ace (MS) 17 MS+MFCC+T onality 2 MFCC 18 MS+MFCC+LPC 3 Rh ythm 19 MS+Rh ythm+T onality 4 T onality 20 MS+Rh ythm+LPC 5 LPC 21 MS+T onality+LPC 6 MS+MFCC 22 MFCC+Rh ythm+T onality 7 MS+Rh ythm 23 MFCC+Rh ythm+LPC 8 MS+T onality 24 MFCC+T onality+LPC 9 MS+LPC 25 Rh ythm+T onality+LPC 10 MFCC+Rh ythm 26 MS+MFCC+Rh ythm+T onality 11 MFCC+T onality 27 MS+MFCC+Rh ythm+LPC 12 MFCC+LPC 28 MS+MFCC+T onality+LPC 13 Rh ythm+T onality 29 MS+Rh ythm+T onality+LPC 14 Rh ythm+LPC 30 MFCC+Rh ythm+T onality+LPC 15 T onality+LPC 31 MS+MFCC+Rh ythm+T onality+LPC 16 MS+MFCC+Rh ythm Musical Genre Classification Using Suppor t ... (A.B . Mutiar a) Evaluation Warning : The document was created with Spire.PDF for Python.
1030 ISSN: 1693-6930 3.3. P erf ormance Ev aluation The perf or mance of automatic m usical genre classification in the e xper iments are e v al- uated b y classification accur acy that is obtained from each tr ial. A compar ativ e analysis to w ard accur acy of 93 classificat ion processes will be carr ied out in order to deter mine the best combina- tion of audio f eature and the most appropr iate k er nel f or automatic m usical genre classification. The detail inf or mation of classification results will also be pro vided in the f or m of confusion ma- tr ix. It will sho w whether the m usic files are correctly classified into their genre or not. The ro ws represent the actual genre and the columns represent the predicted genre . 4. Results and Discussions 4.1. Feature Extraction Results F eature e xtr action process is done to w ard 1000 m usic files in the data set which each 100 files represent one genre . The process is carr ied out in tw o step: fist step is utilizing MIRtoolbo x as f eature e xtr action tool and the second step is utilizing jA udio . In the first step , the code wr itten in a MA TLAB’ s file called f eature e xtr action.m is e x ecuted. In r unning this file , 21,874.955 seconds are elapsed. The statistical data of f eatures related to timbre and tonality as w ell as the label of each file are obtained. In the end, the prog r am gener ates f eatures .arff file that contains the statistical inf or mation of f eatures and file labels . In the second step , jA udio is used as f eature e xtr action tool. F eatures e xtr acted in this step are f eatures related to timbre and tonality . The process tak es 2.127 seconds to be accomplished. The output of this process are statistical inf or mation of f eatures that is stored in f eatures2.arff file . 4.2. Classification Results The classification processes are done using W eka data mining softw are . The input is f eatures final.arff file , a merger betw een f eatures .arff and f eatures2.arff file that are obtained from f eature e xtr action stage . In this research, the e xper iments are carr ied out b y classifying 31 diff er- ent combination of f eature sets using three diff erent k er nels in SVM classifier f or non linear data. The accur acy and detail inf or mation of classification results of the e xper iments will be pro vided in the f ollo wing section. 4.2.1. Classification Accurac y of Experiments F rom the e xper iments conducted, o v er all perf or mance results of 93 classification pro- cesses are obtained. The f eature set of Musical Surf ace , MFCC , Rh ythm, tonality and LPC and their combination are used. T ab le 2 compares the classification accur acy of v ar ious f eature sets and their combinations using three k er nels in SVM classifier : RBF , polynomial and sigmoid. According to T ab le 2, it can be deduced that polynomial k er nel is the most appropr iate k er nel f or aut omatic m usical genre classification since it alw a ys yields the best accur acy among all k er nels f or the same combination of f eatures . The other tw o k er nels are not suitab le f or this task. It is because RBF k er nel does not giv e optim um result and sigmoid k er nel does not completely suit with the data distr ib ution of audio f eatures . The best accur acy is gained b y the combination of MS+MFCC+T onality+LPC f eatures and the implementation of polynomial k er nel with 76.6 % of correctly classified instances . The best accur acy is achie v ed b y the combination of all f eature sets e xcluding Rh ythm f eature . It is reasonab le since in the e xper iment classification of individual f eature set, Rh ythm f eature has the w orst perf or mance of all (30.3 %). 4.2.2. Classification Results in Confusion Matrix This section will discuss the detail inf or mation of genre classification results in the f or m of confusion matr ix. The discussion will be restr icted to the best three accur acy in the e xper iments . The best accur acy (76.6 %) is gained b y the combination of MS+MFCC+T onality+LPC f eatures . TELK OMNIKA V ol. 14, No . 3, September 2016 : 1024 1034 Evaluation Warning : The document was created with Spire.PDF for Python.
TELK OMNIKA ISSN: 1693-6930 1031 T ab le 2. Accur acy of Classification Processes in the Exper iments . F eature/K er nel RBF P olynomial Sigmoid Musical Surf ace (MS) 37.1 % 56.7 % 10 % MFCC 43.7 % 61.8 % 14.6 % Rh ythm 18.5 % 30.3 % 10 % T onality 46.2 % 46.3 % 17.5 % LPC 37.1 % 53.9 % 32.6 % MS+MFCC 53.2 % 67.5 % 10 % MS+Rh ythm 38.3 % 58.9 % 10 % MS+T onality 57.2 % 61.9 % 10 % MS+LPC 43.9 % 66.5 % 10 % MFCC+Rh ythm 47.8 % 62.7 % 10 % MFCC+T onality 59.3 % 64.2 % 22.3 % MFCC+LPC 52.4 % 66.3 % 22.1 % Rh ythm+T onality 50.5 % 53.8 % 10 % Rh ythm+LPC 41.3 % 62 % 10 % T onality+LPC 58.9 % 63.3 % 22.1 % MS+MFCC+Rh ythm 55.8 % 69.2 % 10 % MS+MFCC+T onality 65.1 % 72.1 % 10 % MS+MFCC+LPC 57 % 71.1 % 10 % MS+Rh ythm+T onality 58.6 % 63.4 % 10 % MS+Rh ythm+LPC 46.4 % 68.1 % 10 % MS+T onality+LPC 61.4 % 68.9 % 10 % MFCC+Rh ythm+T onality 61.6.% 66.6 % 10 % MFCC+Rh ythm+LPC 56.8 % 69.8 % 10 % MFCC+T onality+LPC 65.6 % 72.6 % 26 % Rh ythm+T onality+LPC 60.5 % 66.3 % 10 % MS+MFCC+Rh ythm+T onality 66.6% 70.8% 10 % MS+MFCC+Rh ythm+LPC 60.4 % 73.4 % 10 % MS+MFCC+T onality+LPC 68.9 % 76.6 % 10 % MS+Rh ythm+T onality+LPC 62.5 % 69.3 % 10 % MFCC+Rh ythm+T onality+LPC 68.9 % 73.2 % 10 % MS+MFCC+Rh ythm+T onality+LPC 70 % 75.1 % 10 % The se c o nd best(75.1 %) is gained b y the combination of all f eature sets . And, the third is gained b y the combination of MS+MFCC+Rh ythm+LPC f eatures . Bl, cl, co , di, hi, ja, me , po , re and ro in T ab le 3, 4 and 5 represent to b lues , classical, countr y , disco , hiphop , jazz, metal, pop , reggae , and roc k genre . F rom T ab le 3, it can be seen that b lues genre has 87 instances that are correctly classified as their genre , 7 instances that are misclassified as countr y genre , 1 instance that is misclassified as disco genre , 2 instances that are misclassified as metal genre , 2 instances that are misclassified as reggae genre and 1 instance that is misclassified as roc k genre . The best genre is classical with 95 instances that are correctly classified to their genre . The w orst genre is roc k with just 40 instances that are correctly classified to their genre . In T ab le 4 and 5 classical genre still become the best genre with 93 and 90 instances are correctly classified. The w orst genre is roc k with only 38 instances are correctly classified in T ab le 4. But, in T ab le 5, the n umber of correctly classified instance are increasing to 54 instances . Figure 3 visualiz es the accur acy compar isons betw een the three diff erent combination of f eature sets f or each genre . The combination of m usical surf ace , MFCC , tonality and LPC f eatures (without rh ythm f ea ture) has the largest amount of correctly classified instances of all. But, it has a little n umber f or roc k genre , so does the combination of all f eatures . The combination of m usical Musical Genre Classification Using Suppor t ... (A.B . Mutiar a) Evaluation Warning : The document was created with Spire.PDF for Python.
1032 ISSN: 1693-6930 T ab le 3. Confusion Matr ix of MS+MFCC+T onality+LPC F eatures . b l cl co di hi ja me po re ro b l 87 0 7 1 0 0 2 0 2 1 cl 0 95 1 0 0 4 0 0 0 0 co 7 1 73 2 0 1 1 3 3 9 di 0 0 2 77 4 1 1 3 7 5 hi 1 0 1 5 82 0 3 5 3 0 ja 3 4 4 0 0 88 1 0 0 0 me 2 0 0 2 1 0 84 0 0 11 po 0 0 7 4 4 1 0 78 4 2 re 7 0 6 7 4 0 0 6 62 8 ro 12 1 10 8 4 0 9 6 10 40 T ab le 4. Confusion Matr ix of All F eature Sets . b l cl co di hi ja me po re ro b l 86 0 3 3 0 1 2 0 3 2 cl 0 93 1 0 0 5 0 0 0 1 co 7 1 74 2 0 1 0 3 4 8 di 1 0 2 71 4 1 1 3 9 8 hi 1 0 1 4 83 0 3 5 3 0 ja 3 4 3 0 0 88 1 0 0 1 me 4 0 0 0 1 0 86 0 0 9 po 1 0 8 5 4 1 0 74 4 3 re 4 0 7 9 7 0 0 6 58 9 ro 11 1 11 9 4 1 8 6 11 38 T ab le 5. Confusion Matr ix of MS+MFCC+Rh ythm+LPC F eatures . b l cl co di hi ja me po re ro b l 83 0 6 1 0 1 4 0 1 4 cl 0 90 0 0 0 8 0 0 0 2 co 8 0 71 3 0 4 0 3 3 8 di 1 0 3 68 7 1 2 4 9 5 hi 5 0 2 13 65 0 2 4 8 1 ja 4 4 6 0 0 81 1 1 0 3 me 4 0 0 3 1 0 87 0 0 5 po 0 0 6 4 6 1 0 75 4 4 re 6 0 8 8 8 0 0 3 60 7 ro 10 0 10 4 2 2 10 3 5 54 surf ace , MFCC , rh ythm and LPC f eatures (without tonality f eature) has the smallest n umber of correctly classified instances b ut the distr ib ution are commonly pre v alent. This can be noticed that none of genre has the n umber correctly classified instances belo w 50. It means that tonality f eatures has a good perf or mance in classification b ut has a bad impact f or specific genre and the rh ythm f eature has a bad perf or mance in classification b ut the results are pre v alent. TELK OMNIKA V ol. 14, No . 3, September 2016 : 1024 1034 Evaluation Warning : The document was created with Spire.PDF for Python.
TELK OMNIKA ISSN: 1693-6930 1033 Figure 3. Accur acy compar ison of the three diff erent f eature combinations f or each genre . 5. Conc luding Remarks The e xper iments of automatic m usical genre classification ha v e been done successfully using three SVM k er nels to w ard dif f ere nt combination of audio f eature set. The k er nels are poly- nomial, r adial b asis function (RBF) and sigmoid. The audio f eature sets used are m usical surf ace , MFCC , rh ythm, to nality and LPC . There are 31 possib le combination of the f eature sets . With the implementation of the three k er nels , 93 e xper iments are done in this rese arch. The result sho ws that the most appropr iate k er nel f or automatic m usical genre classification is polynomial k er nel. While the best combination of audio f eatures is the combination of m usical surf ace , MFFC , tonality and LPC . The c o mbination achie v es classification accur acy of 76.6 %. This result is compar ab le to human perf or mance in classifying m usic into a genre . F or future research, se v er al things ha v e to be considered. These inclu de the usage of more v ar ied genre to classify m usic and the de v elopment of an application f or doing real time classification, i.e . detecting genre when a m usic piece is p la y e d on the r adio , tv , mp3 pla y er , and other electronic de vices . Another impor tant thing is conducting the research to label m usic into m ultiple genres . It is because , these da ys , a lot of m usic piece are composed based on the composition of tw o or more genres . Ac kno wledg ement The authors w ould lik e to thank the Gunadar ma F oundations f or financial suppor ts . Ref erences [1] G. Tzanetakis and P . Cook, “Musical genre classication of audio signals , in IEEE T r ans . Speech A udio Process , v ol. 10, no . 5, J uly 2002, pp . 293–302. [2] G. Tzanetakis , G. Essl, and P . Cook, “A utomatic m usic genre classification of audio signals , in Proc. ISMIR , 2001. [3] M. F . McKinne y and J . Breebaar t, F eatures f or A udio and Music Classification , Philips Re- search Labor ator ies , Eindho v en, The Nether lands , 2003. [4] P . D . Daniel McEnnis , Ichiro Fujinaga, “J audio: A f eature e xtr action libr ar y , in ISMIR , 2005, Queen Mar y , Univ ersity of London. [5] T . Li and G. Tzanet akis , “F actors in automatic m usical genre classification of audio signals , in IEEE W or kshop on Applications of Signal Processing to A udio and Acoustics , Ne w P altz, NY , 19-22 Oct. 2003. [6] E. G. Gutierrez, “T onal descr iption of m usic audio signals , Ph.D . disser tation, Univ ersitat P ompeu F abr a, 2006. [7] D . J ang, Genre Classification Using No v el F eatures and W eighted V oting Method , Div . of EE, School of EECS , KAIST , K orea, 2008. Musical Genre Classification Using Suppor t ... (A.B . Mutiar a) Evaluation Warning : The document was created with Spire.PDF for Python.