Inter national J our nal of Electrical and Computer Engineering (IJECE) V ol. 6, No. 6, December 2016, pp. 3222 3228 ISSN: 2088-8708 3222 A New P aradigm f or De v elopment of Data Imputation A ppr oach f or Missing V alue Estimation Madhu G and Nagachandrika G Dept. of Information T echnology VNR V ignana Jyothi Institute of Engineering and T echnology Hyderabad-90, T elang ana, India Article Inf o Article history: Recei v ed Mar 26, 2016 Re vised No v 8, 2016 Accepted No v 21, 2016 K eyw ord: Centroids Distance Imputation Missing v alue Nearest Neighbour ABSTRA CT Man y re al-w orld applications encountered a common issue in data analysis is the pres- ence of missing data v alue and challenging task in man y applications such as wireless sensor netw orks, medical applications and psychological domain and others. Learning and prediction in the presence of missing v alue can be treacherous in machine learning, data mining and statistical analysis. A missing v alue can signify important information about dataset in the mining process. Handling missing data v alue is a challenging task for the data mining process. In this paper , we propose ne w paradigm for the de v elop- ment of data imputation method for missing data v alue estimation based on centroids and the nearest neighbours. Firstly , identify clusters based on the k-means algorithm and calculate c entroids and the nearest neighbour data records. Secondly , the nearest distances from complete dataset as well as incomplete dataset from the centroids and estimated the nearest data record which tends to be curse dimensionality . Finally , im- pute the missing v alue based nearest neighbour record using stat istical measure called z-score. The e xperimental study demonstrates strengthen of the proposed paradigm for the imputation of the missing data v alue estimation in dataset. T ests ha v e been run us- ing dif ferent types of datasets in order to v alidate our approach and compare the results with other imputation methods such as KNNI, SVM I, WKNNI, KMI and FKNNI. The proposed approach is geared to w ards maximizing the utility of imputation with respect to missing data v alue estimation. Copyright c 2016 Institute of Advanced Engineering and Science . All rights r eserved. Corresponding A uthor: Madhu.G Department of Information T echnology VNR V ignana Jyothi Institute of Engineering and T echnology T elang ana, Hyderabad-90, INDIA. E-mail: madhu g@vnrvjiet.in 1. INTR ODUCTION Man y real-w orld applications encountered a common issue in data analysis is the presence of miss- ing or incomplet e data v alues. Ho we v er , data mining applications are associated with industry applications, wireless sensor netw orks, medical applications, psychological applications and others. Modern computational techniques of data cleaning require complete dataset.These databases are highly susceptible to missing and inconsistent data due to their huge amount of data sizes [1-2]. Missing v alue has dif ferent causes lik e the data v alue might not be recorded, equipment malfunctions, improper measurements and den y answers to certain questions [3-4]. Missing data v alue in the training dataset can reduce the performance of the model or that can lead biased results to the model and it leads to misleading inferences in data analysts. Recently , man y researchers ha v e been proposed se v eral methods to treat m issing v alue problems for real-w orld applications [5-10]. Generally , missing v alue treatment methods classified into : i) ignoring and discarding data v alue or case/pairwise deletion [11]; ii) parameter estimation/Expectation-Maximization algorithm [12]; iii) Imputa- J ournal Homepage: http://iaesjournal.com/online/inde x.php/IJECE DOI:  10.11591/ijece.v6i6.10632 Evaluation Warning : The document was created with Spire.PDF for Python.
IJECE ISSN: 2088-8708 3223 tion, represents the process of filling the missing data v alues in datasets by some plausible v alues based on information a v ailable in the dataset [13]. First, tw o methods are suf fering with se v eral limitations such as: i) high sensiti vity to the outliers and slo w computations; ii) Imputation methods a re suf fering with ho w to select the conditioning v ariables and bias those results from a bad choice. T o address these issues, we de v elop a ne w paradigm for data imputation of the missing data v alue estimation using cluster cent roids and the nearest neigh- bours. Also, our approach uses the centroids and the nearest neighbours for the f amily of k-means clustering algorithm. This study demonstrates that the proposed imputation method can significantly estimate the missing v alue in datasets. The rest of the paper is or g anized as follo ws: related w ork presented in section-2, while in sect ion- 3, proposed a ne w paradigm for data imputation method for the missing data v alue estimation st ep by step approach. In section-4 presents the results and analysis. Conclusions are deferred to section-5. 2. RELA TED W ORKS This section represents mechanisms of dif ferent data imputation methods and treatments for the pur - pose of data cleaning in the data mining process. Description of the data imputation mechanisms and treatments are as follo ws: 2.1. Mechanism of missing data imputation Little, R.J., Rubin, D.B [14] suggested three types of mechanisms for imputation of missi ng data such as: i) Missing Completely at Random (MCAR): Suppose fe w data records are missing on X, then these data records are said to be MCAR if the probability of X is missing with unrelated X or other v ariable Y then the pr ob ( X ismissing j Y ; X ) = pr ob ( X ismissing ) : ii) Missing at random (MAR): Suppose fe w data records X are m issing at random, i f the probability of that X is missing does not depend on the v alue X, after controlling the o bs erv ed data b ut not on missing data then the pr ob ( X ismissing j Y ; X ) = pr ob ( X ismissing j X ); iii) Not missing at random (NMAR): ): In this approach assumption is violated and sharing of a sample ha ving a missing data v alue for an attrib ute depends on the missing v alues. 2.2. Missing data tr eatment methods In literature v arious approaches ha v e been suggested to treat missing v alue problems. Little and Rubin [14] classified these approaches into three cate gories: i) Ignoring and deleting data records: this approach to discard data with the missing v alue into a complete case analysis and delete instances and/or attrib utes. These methods are applicable only missing data are MCAR otherwise it can lead bias results. ii) P arameter estimation approach: this approach is based on estimation of the parameters in gi v en model then the presence of missing v alue based on Expectation-Maximization algorithm [12]. iii) Imputation Methods: to impute the missing v alue with probable v alue based on information obtainable in the dataset. The main idea of this approach is to emplo y kno wn v alue that can be identified in the v alid data v alues of the gi v en dataset to assist in assessing the missing v alues. T o imput e the missing v alue based dif ferent type of methods such as single imputation and multiple imputation methods [15]. But multiple imputation approaches are computationally more e xpensi v e than the single imputation methods [15]. Ho we v er , the y accommodate for sample v ariability of the estim ated v alue and uncertainty associated with a model used for computation [16]. Also, these tw o methods can be classified into three cate gories such as i) data dri v en approach; ii) model based approach; and iii) machine learning approach [14-17]. The abo v e discussion leads us to propose this research w ork presents a no v el frame w ork for data imputation which addresses the problem of missing data v alues. 3. A NEW P ARADIGM FOR D A T A IMPUT A TION: PR OPOSED APPR O A CH This section, presents a ne w paradigm for data imputation to address the problem of missing v alue which is based on tw o distance measures that are used to define a ne w feature between its cluster center and the nearest neighbour respecti v ely as sho wn in Fig.1. Let M be the gi v en dataset containing of ro ws and columns and each ro w is represented by a ro w v ector ( R 1 ; R 2 ; :R m ) and each column is represented by a column v ector ( C 1 ; C 2 ; ::C n ) and the dataset M is represented as m x n matrix: A Ne w P ar adigm for De velopment of Data Imputation Appr oac h for Missing V alue Estimation (Madhu G) Evaluation Warning : The document was created with Spire.PDF for Python.
3224 ISSN: 2088-8708 Figure 1. A Ne w P aradigm for Data Imputation (NPDI) M mn = 0 B B B @ a 11 a 12 a 1 n a 21 a 22 a 2 n . . . . . . . . . . . . a m 1 a m 2 a mn 1 C C C A Here matrix M represents the samples of 0 m 0 observ ations and 0 n 0 v ariables and each cell is denoted as ordered n-tuples of data records such as ( a i 1 ; a i 2 ; ; a i ( n 1) ; a in ) for each i = 1 ; 2 ; 3 ; m where the last column data attrib ute ( a in ) for each i characterizes the tar get class or decision data attrib ute of the dataset M. Clearly , we say that all datasets are finite set. An object 0 a 0 i is kno wn as the complete dataset, if f a ij 6 = j ; 8 1 j R g and the object 0 a 0 i is called the incomplete dataset f a ij = j ; 9 1 j R g . No w split the gi v en dataset into tw o forms that is: i) first form is a complete dataset which contains records without missing v alues. ii) Second form is an incomplete dataset which contains missing records with one or more attrib utes which is kno wn as the missing v alue dataset. Then consi der complete dataset as the training set and missing data set as the testing set. This w ork is moti v ated by researchers W ei-Chao Lin [18], Congnan Luo [19]. The k-means clustering algorithm [20-21] is one of simplest and common clustering algorithms used to classify the number of clusters based on k-v alue. The v alue k is the cluster centroid of each cluster . This study step-1 we use a k-means clustering algorithm to form clusters based its decision classes (here k=2) and to e xtract the cluster centroids using the centroid v ector c k is defined as follo ws: c k = 1 j N j N X m =1 a im (1) Then, calculate its nearest neighbours based on a Euclidian distance measure on without missing v al ues of dataset. The nearest neighbour function defined as follo ws: Let S is a finite set of elements and a and b belongs to S. B is said to be a the nearest neighbour of A if B is closest to A am o ng all the points in S A . Then B is said to be the nearest neighbour of A if and only if 4. EXPERIMENTS AND RESUL TS In order to e v aluate NPDI model, we carried out e xperimentions on the benchmark datasets from Kno wledge Extraction based on Ev olutionary Learning (KEEL) repository [22], includes Bands, Cle v eland, IJECE V ol. 6, No. 6, December 2016: 3222 3228 Evaluation Warning : The document was created with Spire.PDF for Python.
IJECE ISSN: 2088-8708 3225 Hepatitis, Horse-Colic, Iris, Magic, Pima, W ine. The completely summary of datasets are used in the e xperi- ments are sho wn in T able.1. T able 1. Summary of Datasets used in e xperiments [22]. Datasets Instances Classes MVs Databases(KB) Bands 539 2 32.28 608 Cle v eland 303 5 1.98 233 Hepatitis 155 2 48.39 129 Horse-Colic 368 2 98.1 444 Magic 1902 2 58.20 1423 Pima 768 2 50.65 440 W ine 178 3 70.22 136 First, we di vided the gi v en dataset into 10 folds of the equal size based on the stratified cross v al- idation test and each step of the e xperiments 1-fold is used for the test dataset and remaining 9- folds are used for the training dataset.Also, this study presents t he results obtained from the e xperimental e v aluation of our proposed approach discuss ed in Section .3 and we took results for the comparison with other impu- tation methods such as Imputation with K-Nearest Neighbour (KNNI), Support V ector Machine Imputation (SVMI), W eighted Imputation with K-Nearest Neighbour (WKNNI), K-Means clustering Imputation (KMI) and Fuzzy-Means clustering Imputation (FKNNI) on a popular decision tree C4.5 classifier [23]. T able 2. T est classifier accurac y using C4.5 deceson tree Methods NPDI KNNI SVMI W -KNNI KMI F-KNNI Bands 71.15 70.32 69.18 69.57 70.11 68.25 Cle v eland 57.32 56.09 55.41 56.09 55.43 56.09 Hepatitis 89.95 84.10 84.82 85.09 83.20 83.47 Horse-Colic 90.15 83.10 83.67 83.40 82.04 83.94 Magic 82.92 79.96 78.23 79.75 79.86 80.07 Pima 75.56 71.09 73.32 73.69 72.78 72.91 W ine 88.55 87.64 86.56 88.75 86.54 87.45 In T able 2 sho ws that e v aluation results, from these results we clearly state that the proposed no v el frame w ork for data Imputation is higher than KNNI, SVMI, WKNNI, KMI and FKNNI. Then classifier accu- rac y is impro v ed significantly compared with other imputation methods such as KNNI, SVMI, WKNNI, KMI and FKNNI on benchmark datasets using C4.5 decision tree, which pro v ed the v alidity of no v el frame w ork for data Imputation. The predictions of classifier for the proposed approach is compared with other imputation algorithms are sho wn separately in Figure 2-6. From the bar graphs, we can clearly state that the proposed NPDI has attai ned the best accurac y among all other se v en benchmark datasets. Then we compared with other imputation methods such as KNNI, SVMI, WKNNI, KMI and FKNNI based on C4.5 decision tree classifie rs ha v e scored approximately 6.75 percentage impro v ement in terms of accurac y . . 5. CONCLUSION This paper proposed a ne w paradigm for data imputation called NPDI for estimating the missing v alue in datasets. Firstly , split the gi v en dataset into training and testing datasets that is complete dataset and incom- plete dataset. Secondly , applied k-means algorithm on training dataset to genera te clusters and its centroids, then calculated the nearest neighbours distance between centroids and other complete data records and miss- ing data records. Finally , applied a popular statistical measure called z-score on mapping distances and then imputed plausible v alue for imputation. Further , we applied C4.5 decision tree classifier for test classifier accu- rac y . The classifier accurac y is impro v ed significantly compared with other imputation methods such as KNNI, SVMI, WKNNI, KMI and FKNNI on benchmark datasets using C4.5 decision tree, which pro v ed the v alidity of the no v el frame w ork for data Imputation. Approximately 6.75 percentage impro v ement in terms of classifier A Ne w P ar adigm for De velopment of Data Imputation Appr oac h for Missing V alue Estimation (Madhu G) Evaluation Warning : The document was created with Spire.PDF for Python.
3226 ISSN: 2088-8708 Figure 2. NPDI vs KNNI Figure 3. NPDI vs SVMI accurac y with compared to KNNI algorithm, WKNNI algorithm for Hepatitis dataset and KMI algorithm for Horse-Colic dataset. Major strength of this approach is not lost the an y attrib ute information during the dimen- sionality reduction process. Limitation of this research w ork is before imputations need suitable normalization technique to smooth dataset. REFERENCES [1] Jia wei Han, Micheline Kamber . ”Data Mining: Concepts and T echniques”, 2nd edition. Mor g an Kaufmann, San Francisco, CA, 2006. [2] Mehmed Kantardzic et al., Data Mining: Concepts, Models, Methods, and Algorithms, 2nd edition, IEEE Press, 2011. [3] G.Madhu et al., A No v el Inde x Measure Mimputation Algorithm for Missing Data V alues: A Machine Learning Approach, IEEE International Conference on Computational Intelligence and Computing Re- search, pp.1-7, 2012. [4] Dan Li et al., T o w ards Missing Data Imputation: A Study of Fuzzy k-Means Clustering Method, RSCTC 2004, LN AI 3066, pp. 573579, 2004. [5] Rahman, M.M and Da vis, D.N. ”Machine Learning-based Missing V alue Imputation Methods for Clinical datasets”, IAENG T ransactions on Engineering T echnologies, Springer Netherlands, pp.245-257, 2013. [6] Rupam Deb et al., ”Missing V alue Imputation for the Analysis of Incomplete T raf fic Accident Data”, Information Sciences, v ol.339, pp.274-289, 2016. [7] Bashir .F , et al., ”P arametric and Non-P arametric Methods to Enhance Prediction Performance in the Pres- ence of Missing Data”, 19th International conference on system theory , control and computing (ICSTCC), pp.337-342, 2015. IJECE V ol. 6, No. 6, December 2016: 3222 3228 Evaluation Warning : The document was created with Spire.PDF for Python.
IJECE ISSN: 2088-8708 3227 Figure 4. NPDI vs WKKNI Figure 5. NPDI vs KMI [8] F abio Lobato, et al., ”Multi-Objecti v e Genetic Algorithm for Missing Data Imputation”, pattern recognition letters, v ol.68, part-1, pp.126-131, 2015. [9] G. Madhu, et al., ”A Non-P arametric Discretization Based Imputation Algorithm for Continuous Attrib utes with Missing Data V alues”, International Journal of Information Processing, v ol.8, no.1, pp.64- 72, 2014. [10] Chandan Gautam and V adlamani Ra vi, ”Data Imputation via Ev olutionary Computation, Clustering and a Neural Netw ork”, Neurocomput, 156, 134-142, 2015. [11] Gary , K., et al., ”Listwise Deletion is Evil: What to do about Missing Data in Pol itical Science”, (2000) (a v aible http://GKing.Harv ard.edu) [12] Dempster , A.P ., Laird, N.M., Rubin, D.B., ”Maximum lik elihood from incomplete data via the EM algo- rithm”, J. of Ro yal Statistical Society Series, v ol. 39,pp, 138, 1977. [13] Myrtv eit, I., et al., ”Analyzing data sets with missing data: an empirical e v aluation of imputation methods and lik elihood-based methods”, IEEE T ransactions on Softw are Engineering, v ol.27, pp. 9991013, 2001. [14] Little, R.J., Rubin, D.B., ”Statistical Analysis with Missing Data”, W ile y , Ne w Y ork, 1987. [15] Alireza F arhangf ar et al., ”A No v el Frame w ork for Imputation of Missing v alues in Databases”, IEEE T ransaction on systems, man, and c ybernetics-part-a: systems and humans, v ol.37, no.5, sept 2007. [16] Lakshminarayan K, et al., ”Imputation of Missing data in industrial databases”, Appl, Intell, v ol.11, no.3, pp. 259-275, 1999. [17] H.L.Oh and F .L Scheuren, ”W eighting adjustments for unit nonresponse, incomplete data in sample sur - v e y”, Theory and Bibliographies, v ol.2, pp.143-183, 1983. [18] W ei-Chao Lin, Shih-W en K e, and Chih-F ong Tsai ”CANN. An Intrusi on Detection System based on Combining Cluster Centers and Nearest Neighbors, Kno w .-Based Syst”, v ol. 78, pp. 13-21, 2015.. [19] Congnan Luo, Y anjun Li, and Soon M. Chung ”T e xt document clustering based on neighbors”, Data A Ne w P ar adigm for De velopment of Data Imputation Appr oac h for Missing V alue Estimation (Madhu G) Evaluation Warning : The document was created with Spire.PDF for Python.
3228 ISSN: 2088-8708 Figure 6. NPDI vs FKNNI Kno wl. Eng, v ol.68, issue. 11, pp. 1271-1288, 2009. [20] J.A. Hartig an, ”Clustering Algorithms”, John W ile y and Sons, 1975. [21] A.K. Jain, M.N. Murty , P .J. Flynn, ”Data clustering: a re vie w”, A CM Comput.Surv e y , v ol.31 (3), pp. 264323, 1999. [22] J. Alcal-Fdez, A. Fernandez, J. Luengo, J. Derrac, S. Garca, L. Snchez, F . Herrera. ”KEEL Data-Mining Softw are T ool: Data Set Repository , Inte gration of Algorithms and Experim ental Analysis Frame w ork”, Journal of Multiple-V alued Logic and Soft Computing 17:2-3 (2011) 255-287. [23] J.R. Quinlan, ”C4.5: Programs for Machine Learning”. Mor g an Kaufmann Publishers, Inc., 1993. BIOGRAPHIES OF A UTHORS Madhu.G recei v ed Ph.D De gree in Compter Science and Engineering from Ja w aharlal Nehru T ech- nological Uni v ersity , Hyderabad in 2015. He is currently w orking as Assocaite Professor , Dept of Information T echnology , VNR VJIET , JNTU Hyderabad, T .S, INDIA. He has 27 research publi- cations at International/National Journals and Conferences. His research interest includes Data Mining, Machine Learning and Methematical Modeling and he is also a re vie wer of research pa- pers of v arious Journals lik e JCIT ,JET AI and Journal of Soft Computing. He is a Member of IEEE, IEEE Computational Intelligence Society , Life Member of CSI,Member of IRSS, and Member of IAENG. Recently , machine learning based algorithms on malaria daignosis has been tackled. G.Naga Chandrika recei v ed M.T ech De gree in Compter Science and Information T echnology from Ja w aharlal Nehru T echnological Uni v ersity , Hyderabad. He is c urrently w orking as Assis- tant Professor , Dept of Information T echnology , VNR VJIET , JNTU Hyderabad, T .S, INDIA. Her research interest includes Data Mining, T e xt Mining and Theory of Computation. IJECE V ol. 6, No. 6, December 2016: 3222 3228 Evaluation Warning : The document was created with Spire.PDF for Python.