Indonesian Journal of Electrical Engineering and Computer Science V ol. 5, No . 2, F ebr uar y 2017, pp . 410 415 DOI: 10.11591/ijeecs .v5.i2.pp410-415 410 An Empirical Comparison of Latest Data Clustering Algorithms with State-of-the-Ar t Xianjin Shi , W anwan W ang , and Chongsheng Zhang * Henan Univ ersity No .1 JinMing Street, 475001 KaiF eng, China, T el: +86 13837150021 *Corresponding author , e-mail: chongsheng.zhang@y ahoo .com Abstract Cluster ing technology has been applied in n umerous applications . It can enhance the perf or mance of inf or mation retr ie v al systems , it can also g roup Inter net users to help impro v e the clic k-through r a te of on-line adv er tising , etc. Ov er the past f e w decades , a g reat man y data cluster ing algor ithms ha v e been de v eloped, including K-Means , DBSCAN, Bi-Cluster ing and Spectr al cluster ing, etc. In recent y ears , tw o ne w data cluster ing algor ithms ha v e been proposed, which are affinity propagation (AP , 2007) and density peak based cluster ing ( DP , 2014). In this w or k, w e empir ically compare the perf or mance of these tw o latest data cluster ing algor ithms with state-of-the-ar t, using 6 e xter nal and 2 inter nal cluster ing v alidation metr ics . Our e xper imental results on 16 pub lic datasets sho w that, the tw o latest cluster ing algor ithms , AP and DP , do not alw a ys outperf or m DBSCAN. Theref ore , to find the best cluster ing algor ithm f or a specific dataset, all of AP , DP and DBSCAN should be considered. More o v er , w e find that the compar ison of diff erent cluster ing algor ithms is closely related to the cluster ing e v aluation metr ics adopted. F or instance , when using the Silhouette cluster ing v alidation metr ic , the o v er all perf or mance of K-Means is as good as AP and DP . This w or k has impor tant ref erence v alues f or researchers and engineers who need to select appropr iate cluster ing algor ithms f or their specific applications . K e yw or ds: Affinity Propagation, Density peak based cluster ing, Cluster ing Ev aluation Cop yright c 2017 Institute of Ad v anced Engineering and Science . All rights reser v ed. 1. Intr oduction Cluster ing or cluster analysis is the task of g rouping a set of objects in such a w a y that objects in the same g roup (called a cluster) are more similar to each other than to those in other g roups (clusters) [1]. Cluster ing has been widely used in man y applications , such as disco v- er ing customer g roups based on their pu rchase beha viours to design targeted adv er tisements , identifying co-regulated genes to pro vide a genetic finger pr int f or v ar ious diseases , diff erentiating betw een diff erent types of tissue and b lood in medical images , etc. Ov er the past f e w decades , consider ab le research eff or t has been put into the de v el- opment of ne w data cluster ing algor ithms [2,3,4,5]. Among them, K-Means , DBSCAN [6], Bi- Cluster ing [7] and Spectr al cluster ing [8] are the v er y w ell-kno wn ones . K-Means is b y f ar the most popular cluster ing tool used in scientific and industr ial applications . Star ting with r andom centroids , K-Means cluster ing iter ativ ely re-assigns each data point to the nearest centroid, then computes a ne w centroid f or each g roup of data points ha ving the same centroid, then again, allocates each data point to the nearest centroid. DBSCAN [6] is a classic density-based clus- ter ing algor ithm, it can disco v er clusters of arbitr ar y shape . Bi-Cluster ing and Co-Cluster ing [7] allo w o v er lap betw een clusters , this class of algor ithms ha v e been widely used in bioinf or matics . Spectr al cluster ing [8] techniques perf or m dimensionality reduction bef ore cluster ing, b y utilizing the eigen v alues of the similar ity matr ix of the data. In recent y ears , tw o no v el and f amous data cluster ing algor ithms ha v e been proposed. The first cluster ing algor ithm is affinity propagation (hereafter ref erred to as AP) [9], which w as pub lished in Science in 2007. Its highlight is that it does not require users to specify the n umber of clusters . AP alter nates tw o message passing steps: one is ho w w ell a data point is to ser v e Receiv ed No v ember 16, 2016; Re vised J an uar y 6, 2017; Accepted J an uar y 19, 2017 Evaluation Warning : The document was created with Spire.PDF for Python.
IJEECS ISSN: 2502-4752 411 as the cluster center f or another data point; the other step tak es into account other data points’ pref erence f or a data point to be a cluster center . The second (and the latest data cluster ing) algor ithm, pub lished in Science in 2014, is the density peak based cluster ing algor ithm (hereafter ref erred to as DP) [10]. It is based on the idea that the cluster centers are cha r acter iz ed b y a higher density than their neighbours and b y a relativ ely large distance from points with higher densities [10]. F or each data point, DP computes its local density and its distance to points of higher density . T op-r ank ed data points in both metr ics will be selected as cluster centers . Ov erwhelmed with so man y cluster ing algor it hms , especially with the arr iv al of the tw o ne w cluster ing algor ithms , a question which is natur ally r aised is: which is the best data cluster ing algor ithm? Can the tw o latest data cluster ing algor ithms , AP and DP , tr uly outperf or m state-of- the-ar t? Pr ob lem Statement . In this w or k, w e will empir ically compare the perf or mance of the tw o latest data cluster ing algor ithms , which are DP and AP , with other w ell-estab lished cluster ing algor ithms , in par ticular K-means , DBSCAN, Bi-Cluster ing, Co-Cluster ing and Spectr al cluster ing. Contrib utions . This w or k re v eals that, the tw o latest data cluster ing algor ithms , AP and DP , do not alw a ys outperf or m the classic cluster ing algor it hms , such as DBSCAN. Hence , to find the best cluster ing algor ithm f or a specific application, all of AP , DP and DBSCAN should be tested. Fur ther more , w e find the compar ison of diff erent cluster ing algor ithms is closely related to the cluster ing v alidation metr ics adopted. Thus , bef ore selecting th e best cluster ing algor ithm f or a giv en dataset/application, it is necessar y to pic k the cluster ing e v aluation metr ic in adv ance . The remaining of the paper is organiz ed as f ollo ws . In Section 2., w e br iefly re vie w com- mon cluster ing e v aluation metr ics . Ne xt, in Section 3., w e descr ibe th e setup f or the e xper iments . W e analyse the e xper imental results in Section 4. and conclude the paper in Section 5. 2. Clustering Ev aluation Metrics There are tw o types of measures to e v aluate the results of the cluster ing algor ithms (i.e ., the quality of the clusters), which are the inter nal and e xter nal v alidation metr ics [11]. The ba- sic idea of the inter nal v alidation measures is to chec k whether the intr a-cluster similar ities (the similar ities betw een the data points inside the same cluster) are high, while , at the same time , the inter-cluster similar ities (the similar ities betw een data points from diff erent clusters) are lo w . F or instance , an intuitiv e inter nal v alidation measure that easily comes into our mind is to simply divide the intr a-cluster similar ities b y the inter-cluster similar ities . In this w or k, w e use tw o v er y w ell-kno wn inter nal cluster ing v alidation metr ics , which are Dunn and Silhouette [11,12]. The e xter nal v alidation metr ics calculate f or each cluster , the distr ib ution of the tr ue class labels f or all the data points in the same cluster . Theref ore , this type of cluster ing e v aluation metr ics require each data point to ha v e a class label. If all or the major i ty of the data points in a cluster share the same class label, this implies that the cluster ing is v er y successful, then the corresponding score , in ter ms of an e xter nal cluster ing v alidation metr ic , will be high. In this paper , w e utiliz e 6 e xter nal cluster ing v alidation metr ics , which are Pur ity , Homogeneity , Completeness , V measure , Adjusted r and and Mutual inf o score (Mutual inf or mation) [11,12]. 3. Experimental Setup 3.1. Clustering algorithms to be compared In this w or k, w e compare the perf or mance of the tw o lat est data cluster ing algor ithms (i.e ., AP and DP) with 5 classic cluster ing algor ithms , which include K-Means , DBSCAN, Bi-Cluster ing, Co-cluster ing and Spectr al cluster ing. W e adopt the implementations from Scikit-lear n 1 f or AP and these 5 classic algor ithms , while the implementation of DP w as obtained from its official w ebsite [13]. 1 Scikit-lear n is a w ell-kno wn open source machine lear ning libr ar y . http://scikit-lear n.org An Empir ical Compar ison of Latest Data Cluster ing Algor ithms with ... (Xianjin Shi) Evaluation Warning : The document was created with Spire.PDF for Python.
412 ISSN: 2502-4752 T ab le 1. Summar y of datasets Dataset Number Number With Sour ces of instances of attrib utes c lass label? Agg regation 788 7 Y es [17] Flame 240 2 Y es [17] Compound 399 6 Y es [17] Spir al 312 3 Y es [17] P athbased 300 3 Y es [17] R15 600 15 Y es [17] D31 3100 31 Y es [17] J ain 373 2 Y es [17] Breast 699 2 No [17] Th yroid 215 5 No [17] Y east 1484 8 No [17] Wine 178 13 No [17] Dim4 2701 4 No [17] Dim8 5401 8 No [17] Dim32 1009 32 No [17] Dim64 1024 64 No [17] 3.2. Datasets W e use 16 datasets to v alidate the quality of diff erent cluster ing algor ithms . These datasets can be divided into tw o categor ies: 1) 8 datasets commonly u sed f or cluster ing [14], including Agg regation, Flame , Compound, Spir al, P athbased, R15, D31, and J ain. All of these datasets contain class labels (g round tr uth) f or the data points . 2) 8 datasets that do not contain class labels , including Dim4, Dim8, Dim32, Dim64, Breast, Th yroid, Y east, and Wine [14]. T ab le 1 is a summar y of these datasets . 3.3. P arameter Settings F or K-Means cluster ing, w e use the def ault v alue as the n umber of clusters , which is 8. W e also tr y other clu ster n umbers such as 2, 3. DP needs to specify the initial cluster n umbers (or an initial cluster) to automatically search f or a good r adius v alue . W e tr y three diff erent v alues , which are 2, 3, and 6. DBSCAN has tw o par ameters , eps , which is the maxim um r adius of the neighbourhood from a point, and minPts , which is the minim um n umber of data points within this distance . In our e xper iments , w e tr y diff erent eps v alues , such as 0.2, 0.4, 0.9, 1.0, 3.0, and diff erent v alues f or minPts , such as 13. F or all the cluster ing algor ithms that need to tune the par ameters , w e man ually choose the set of par ameters that can achie v e the best cluster ing quality . 4. Experimental Results In T ab le 2, w e depict the e xper imental results of the 7 cluster ing algor ithms on the 8 datasets that contain class labe l inf or mation, using 6 e xter nal cluster ing v alidation metr ics . In T a- b le 3, w e also present the cluster ing results on the other 8 datasets without class labels , e v aluated in ter ms of 2 inter nal cluster ing v alidation metr ics . 4.1. The P erf ormance of Co-Clustering, Bi-Clustering and Spectral Clustering W e obser v e from T ab le 2 and T ab le 3 that, Co-Cluster ing algor ithm has ne v er sho wn out- standing perf or mance , while Bi-Cluster ing only obtains leading cluster ing result o n one dataset, which is Th yroid . Hence , both of them can be neglected when selecting the best cluster ing algo- r ithm f or a specific dataset. Spectr al cluster ing seldom achie v es e xcellent cluster ing results , e xcept on the Agg re- gation dataset using the Mutual inf o score metr ic , as w ell as the J ain and P athbased datasets . Ho w e v er , it ne v er sho ws best perf or mance according to the tw o inter nal metr ics , as can be ob- ser v ed in T ab le 4. IJEECS V ol. 5, No . 2, F ebr uar y 2017 : 410 415 Evaluation Warning : The document was created with Spire.PDF for Python.
IJEECS ISSN: 2502-4752 413 T ab le 2. Ev aluation of algor ithms using e xter nal v alidation metr ics . Dataset Algorithm Purity Homeg eneity Completeness V measure Rand score Mutual inf o score Agg regation AP 0.996 0.990 0.599 0.746 0.377 0.589 DP 0.511 0.253 0.909 0.396 0.202 0.251 DBSCAN 0.827 0.718 1.000 0.836 0.734 0.716 Spectr al 0.834 0.846 0.786 0.815 0.549 0.783 Cocluster 0.346 0.010 0.011 0.010 0.001 0.001 Bicluster 0.506 0.340 0.902 0.494 0.348 0.339 K-Means 0.911 0.909 0.765 0.830 0.668 0.761 Compound AP 0.910 0.855 0.563 0.679 0.389 0.548 DP 0.627 0.416 1.000 0.588 0.437 0.414 DBSCAN 0.935 0.725 0.920 0.811 0.784 0.721 Spectr al 0.875 0.857 0.667 0.750 0.481 0.659 Cocluster 0.396 0.012 0.013 0.012 -0.004 -0.005 Bicluster 0.627 0.416 1.000 0.588 0.437 0.414 K-Means 0.875 0.769 0.596 0.671 0.453 0.586 Flame AP 0.833 0.932 0.244 0.387 0.134 0.236 DP 0.988 1.000 0.420 0.591 0.410 0.416 DBSCAN 0.742 0.741 0.396 0.516 0.456 0.393 Spectr al 0.971 0.873 0.279 0.423 0.202 0.274 Cocluster 0.654 0.024 0.010 0.015 0.004 0.005 Bicluster 0.838 0.410 0.388 0.399 0.453 0.386 K-Means 0.983 0.910 0.289 0.438 0.206 0.284 J ain AP 1.000 1.000 0.238 0.385 0.120 0.233 DP 1.000 1.000 0.812 0.897 0.954 0.812 DBSCAN 0.997 0.247 0.375 0.298 0.337 0.243 Spectr al 1.000 1.000 0.285 0.444 0.185 0.282 Cocluster 0.740 0.018 0.007 0.010 0.006 0.003 Bicluster 0.786 0.407 0.337 0.369 0.324 0.336 K-Means 0.987 0.924 0.261 0.406 0.166 0.257 P athbased AP 0.963 0.916 0.380 0.537 0.273 0.367 DP 0.633 0.402 0.636 0.493 0.401 0.400 DBSCAN 0.927 0.340 0.620 0.439 0.325 0.338 Spectr al 0.877 0.781 0.430 0.555 0.348 0.423 Cocluster 0.387 0.006 0.004 0.005 -0.004 -0.005 Bicluster 0.633 0.401 0.634 0.491 0.399 0.399 K-Means 0.847 0.710 0.406 0.517 0.391 0.398 R15 AP 0.997 0.994 0.994 0.994 0.993 0.994 DP 0.667 0.773 0.984 0.886 0.579 0.763 DBSCAN 0.533 0.590 1.000 0.743 0.264 0.576 Spectr al 0.530 0.704 0.988 0.822 0.517 0.694 Cocluster 0.108 0.021 0.039 0.027 0.001 0.003 Bicluster 0.133 0.244 0.980 0.391 0.120 0.241 K-Means 0.533 0.590 1.000 0.743 0.264 0.576 Spir al AP 0.888 0.770 0.288 0.419 0.144 0.272 DP 1.000 1.000 1.000 1.000 1.000 1.000 DBSCAN 0.772 0.394 0.393 0.393 0.142 0.387 Spectr al 0.808 0.684 0.373 0.483 0.258 0.366 Cocluster 0.397 0.013 0.009 0.010 0.001 0.001 Bicluster 0.353 0.001 0.001 0.001 -0.003 -0.002 K-Means 0.487 0.182 0.098 0.127 0.048 0.088 D31 AP 0.975 0.966 0.966 0.966 0.950 0.964 DP 0.161 0.360 0.958 0.524 0.107 0.357 DBSCAN 0.065 0.042 0.976 0.081 0.004 0.040 Spectr al 0.258 0.575 0.988 0.727 0.328 0.571 Cocluster 0.045 0.006 0.0013 0.008 -0.000 -0.000 Bicluster 0.065 0.188 0.933 0.313 0.060 0.187 K-Means 0.258 0.566 0.944 0.708 0.336 0.562 4.2. The P erf ormance of AP , DP , DBSCAN and K-Means W e can see from T ab le 2 and T ab le 4 that, on the e xter nal v alidation metr ics , AP , DP and DBSCAN sho w v er y good cluster ing results . Moreo v er , DP and DBSCAN achie v e the best cluster ing results on the Dunn inter nal metr ic. Sur pr isingly , on the Silhouette inter nal metr ic , the o v er all perf or mance of K-Means is as good as DP and AP . W e no w compare the tw o density-based cluster ing algor ithms , i.e ., DP vs DBSCAN. W e see that, on tw o datasets , Agg regation and Compound , DBSCAN outperf or ms DP , while on the rest 6 datasets DP outperf or ms DBSCAN in almost all the metr ics (with f e w e xceptions), as can be seen in T ab le 4. 4.3. Efficienc y Comparison of Diff erent Clustering Algorithms W e ha v e also chec k ed the efficiency of diff erent cluster ing algor ithms . W e find that, AP is v er y time-consuming, especially when the n umber of data points is large , sa y , more than 3000. DP is f aster than AP , b ut slo w er than Co-Cluster ing, Bi-Cluster ing, and K-Means in gener al. In Figure 1, the r unning time of these algor ithms on Agg regation , Y east , and Dim8 datasets is depict ed. It should be mentioned that in Y east and Dim8 datasets , AP w as also f ound to be the slo w est algor ithm, at least 3 times slo w er than DP , so w e remo v ed it from the plots f or clar ity reasons . An Empir ical Compar ison of Latest Data Cluster ing Algor ithms with ... (Xianjin Shi) Evaluation Warning : The document was created with Spire.PDF for Python.
414 ISSN: 2502-4752 T ab le 3. Ev aluation of algor ithms using inter nal v alidation metr ics . Data sets Metr ics Algor ithm AP DP DBSCAN Spectral Coc luster Bic luster K-Means Breast Dunn 0.00 0.408 0.093 0.000 0.000 0.041 0.051 Silhouette 0.182 -0.298 0.631 -0.057 0.067 0.122 0.722 Th yroid Dunn 0.067 0.019 0.050 0.017 0.008 0.091 0.047 Silhouette 0.230 -0.022 0.651 0.194 0.354 0.577 0.462 Y east Dunn 0.054 0.087 0.019 0.000 0.000 0.025 0.026 Silhouette 0.139 -0.042 -0.282 0.204 0.323 0.191 0.356 Wine Dunn 0.178 0.256 0.265 0.190 0.109 0.256 0.147 Silhouette 0.115 0.279 0.310 0.118 0.304 0.280 0.473 Dim4 Dunn 0.561 0.561 4.904 0.003 0.001 0.463 0.601 Silhouette 0.951 0.951 0.950 0.480 0.871 0.511 0.912 Dim8 Dunn 0.068 0.453 4.171 0.292 0.003 0.745 5.085 Silhouette 0.837 0.932 0.995 0.583 0.604 0.311 0.851 Dim32 Dunn 4.035 4.035 0.016 0.003 0.000 0.645 0.771 Silhouette 0.945 0.945 0.050 -0.212 0.389 0.191 0.523 Dim64 Dunn 5.820 5.820 0.012 0.004 0.001 0.777 0.764 Silhouette 0.966 0.966 0.077 -0.194 0.355 0.154 0.511 T ab le 4. T otal n umber of datasets where a classifier r ank ed first. Metr ics T otal n umber Alg AP DP DBSCAN K-Means Spectral Dunn 2 4 2 1 0 Silhouette 3 3 2 3 0 Pur ity 5 2 1 0 1 Homegeneity 5 3 0 0 2 Completeness 0 5 2 0 1 V measure 2 3 2 0 1 Rand score 2 3 3 0 0 Mutual inf o score 2 3 1 0 1 4.4. Summar y of the Results In summar y , w e dr a w the f ollo wing summar y from the abo v e analyses: 1. Although AP and DP are the tw o latest (and v er y popular) cluster ing algor ithms , the y do not alw a ys outperf or m DBSCAN. 2. AP , DP , and DBSCAN, when put together as a g roup , sho w v er y outstanding perf or mance than the other cluster ing algor ithms . The ref ore , AP , DP and DBSCAN should be the major candidates f or the cluster ing tasks in real-w or ld applications . 3. Co-Cluster ing, K-Means , and Bi-Cluster ing are gener ally the most efficient cluster ing algo- r ithms . AP is the slo w est one , whereas DP is more than 3 times f aster than AP . 4. The compar ison of diff erent cluster ing algor ithms depends on the e v aluation metr ics . 5. Conc lusions In this w or k, w e empir ically compare the perf or mance of the tw o latest data cluster ing algor ithms , which are affinity propagation (AP) and density peak based cluster ing (DP), with state-of-the-ar t cluster ing algor ithms , using 6 e xter nal and 2 inter nal cluster ing v alidation met- r ics . Our e xper imental results demonstr ate that, the tw o latest cluster ing algor ithms AP and DP do not alw a ys outperf or m DBSCAN. This means that, e v en with the tw o latest cluster ing algo- r ithms , there is no single cluster ing algor ithm that is the best f or all the datasets . Theref ore , to select the best cluster ing algor ithm f or a specific dataset, all of AP , DP and DBSCAN should be considered/tested. In additio n, w e find the compar ison of diff erent cluster ing algor ithms is also closely related to the cluster ing e v aluation metr ics adopted. In ter ms of r unning time efficiency , our e xper iments sho w that AP is the least efficient cluster ing algor ithm, it consumes at least 3 times more r unning time than the others , while Co- Cluster ing, K-Means , and Bi-Cluster ing are usually the top-3 most efficient algor ithms . This w or k pro vides v aluab le empir ical ref erence f or researchers and engineer ing who need to select the best cluster ing algor ithm f or their specific applications . IJEECS V ol. 5, No . 2, F ebr uar y 2017 : 410 415 Evaluation Warning : The document was created with Spire.PDF for Python.
IJEECS ISSN: 2502-4752 415 Figure 1. Running time of diff erent algor ithms on Agg regation (left), Y east (center), and Dim8 (r ight) datasets . Ref erences [1] Cluster analysis . Wikipedia. https://en.wikipedia.org/wiki/Cluster analysis . [2] A. K. J ain, M. N. Mur ty , and P . J . Flynn. Data cluster ing: a re vie w . A CM Comput . Sur v . 31, 3, 1999, pp 264-323. [3] S . K otsiantis , P . Pintelas , Recent Adv ances in Cluster ing: A Br ief Sur v e y , WSEAS T r ansac- tions on Inf or mation Science and Applications , V ol 1, No 1, 2004, pp 73-81. [4] Rui Xu and D . W unsch. Sur v e y of cluster ing algor ithms . IEEE T r ansactions on Neur al Net- w or ks , 2005, pp 645-678. [5] P . Ber khin, A sur v e y of cluster ing data mining techniques , in: J . K ogan, C . Nicholas , M. T eboulle (Eds .), Grouping Multidimensional Data, Spr inger , Ber lin, Heidelberg, 2006, pp . 2571. [6] Mar tin Est er , Hans-P erer Kr iegel, Jorg Sander , Xiao w ei Xu. A Density-Based Algor ithm f or Disco v er ing Clusters in Large Spatial Databases with Noise .1996. [7] Mir kin, Bor is . Mathematical Classification and Cluster ing. Kluw er Academic Pub lishers . 1996. [8] Sepandar Kamv ar , Dan Klein, Chr istopher Manning. Spectr al lear ning. In IJCAI03, pp 561– 566, 2003. [9] Brendan J .F re y , Delber t Duec k. Cluster ing b y P assing Messages Betw een Data P oints . Sci- ence . 2007. [10] Ale x Rodr iguez ,Alessandro Laio . Cluster ing b y f ast search and find of density peaks . 2014. [11] Erendir a Rendon,Itz el Ab undez,Alejandr a Ar izmendi,Elvia M.Quiroz. Inter nal v ersus Exter nal cluster v alidation inde x es . Inter national Jo ur nal of Computers and Comm unications . Issue 1, V olume 5, 2011. [12] Cluster ing. Scikit-lear n-0.17.1-documentation. http://scikit-lear n.org/stab le/modules/cluster ing.html. [13] http://people .sissa.it/ laio/Research/Res cluster ing.php . [14] Cluster ing datasets . http://cs .joensuu.fi/sipu/datasets/. [15] http://scikit-lear n.org/stab le/modules/classes .html#module-sklear n.datasets . An Empir ical Compar ison of Latest Data Cluster ing Algor ithms with ... (Xianjin Shi) Evaluation Warning : The document was created with Spire.PDF for Python.