Inter national J our nal of Electrical and Computer Engineering (IJECE) V ol. 10, No. 6, December 2020, pp. 6361 6369 ISSN: 2088-8708, DOI: 10.11591/ijece.v10i6.pp6361-6369 r 6361 T ext documents clustering using modified multi-v erse optimizer Ammar Kamal Abasi 1 , Ahamad T ajudin Khader 2 , Mohammed Azmi Al-Betar 3 , Syibrah Naim 4 , Mohammed A. A wadallah 5 , Osama Ahmad Alomari 6 1,2 School of Computer Sciences, Uni v ersiti Sains Malaysia (USM), Malaysia 3 Department of information technology , Al-Huson Uni v ersity Colle ge, Jordan 4 T echnology Department, Endicott Colle ge of International Studies (ECIS), W oosong Uni v ersity , K orea 5 Department of Computer Science, Al-Aqsa Uni v ersity , P alestine 6 Department of Computer Engineering,F aculty of Engineering and Architecture, Istanb ul Gelisim Uni v ersity , T urk e y Article Inf o Article history: Recei v ed Mar 29, 2020 Re vised May 2, 2020 Accepted May 18, 2020 K eyw ords: Multi-v erse optimizer Optimization Sw arm intelligence T est document clustering ABSTRA CT In this study , a multi-v erse optimizer (MV O) is utilised for the te xt document clus- tering (TDC) problem. TDC is treated as a discrete optimization problem, and an objecti v e function based on the Euclidean distance is applied as similarity measure. TDC is tackled by the di vision of the documents into clusters; documents belong- ing to the same cluster are similar , whereas those belonging to dif ferent clusters are dissimilar . MV O, which is a recent metaheuristic optimization algorithm established for continuous optimization problems, can intelligently na vig ate dif ferent areas in the search s pace and search deeply in each area using a particular learning mechanism. The proposed algorithm is called MV O TDC, and it adopts the con v er gence beha viour of MV O operators to deal with discrete, rather than continuous, optimization prob- lems. F or e v aluating MV O TDC, a c omprehensi v e comparati v e study i s conducted on six te xt document datasets with v arious numbers of documents and clusters. The qual- ity of the final results is assessed using precision, recall, F-measure, entrop y accurac y , and purity measures. Experimental results re v eal that the proposed method performs competiti v ely in comparison with state-of-the-art algorithms. Statistical analysis is also conducted and sho ws that MV O TDC can produce significant results in compari- son with three well-established methods. Copyright c 2020 Insitute of Advanced Engineeering and Science . All rights r eserved. Corresponding A uthor: Ammar Kamal Abasi, School of Computer Sciences, Uni v ersiti Sains Malaysia (USM), 11800 Pulau Pinang, Malaysia. Email: ammar abasi@student.usm.my 1. INTR ODUCTION In the current digital era, massi v e amounts of online te xt documents inundate the web e v ery day . Manipulating these te xt documents is important for impro ving the query results returned by search engines, unsupervised te xt or g anisation systems, te xt classificati on, te xt summarization, kno wledge e xtraction processes, information retrie v al services and te xt mining processes, scientific document clustering [1]. Man y approaches ha v e been proposed for unsupervised or g anising te xt documents. T e xt document clustering (TDC) is an ef fecti v e and ef ficient technique used by researchers in t his domain [2], this field of te xt mining enables the or g anisation of lar ge amounts of te xtual data. It can be defined as an unsupervised automatic document clustering technique that utilises the document similarities rule to di vide documents into homogeneous clusters. In other w ords, te xt documents in the same cluster are similar , J ournal homepage: http://ijece .iaescor e .com Evaluation Warning : The document was created with Spire.PDF for Python.
6362 r ISSN: 2088-8708 whereas those in dif ferent clusters are dissimilar [3]. Con v entionally , clustering methods can be classified into tw o main groups: (i) partitional clustering and (ii) hierarchical clustering. K-means and K-medoids are simple and easy-to-use methods that can be tailored to suit la r ge- scale te xt document datasets. The y are iterati v e clustering-based techniques initiated with predefined numbers of cluster centroids. At each ite ration, documents are distrib uted into clusters according to similarity functions, depending on the distance between each centroid and its closest document. Then, the cluster centroid is iter - ati v ely updated according to the documents belonging to the sa me cluster . This operation is stopped when all the documents are mo v ed into the right cluster by means of stagnated cluster centroids. The main shortcoming of these methods is their con v er gence beha viour; the y mo v e in one direction in a single search space re gion and do not perform a wider scan i n the whole search space re gion. Therefore, the y can easily become trapped in local optima due to the unkno wn shapes of search spaces. Gi v en that K-means is a local search area method and TDC is formulated as an optimization problem [4], optimization methods that can escape the local optima can be utilized for TDC [5]. The most successful algorithms recently utilized for TDC are metaheuristic-based algorithms [6]. The first type of metaheuristic algorithms is e v olutionary-based algorithms (EA), which is initiated with a group of pro visional indi viduals called population. Generation after generation, the population is e v olv ed on the basis of three main operators: r ecombination for mixing the indi vidual features, mutation for di v ersifying the search and selection for utilising the survi v al-of-the-fittest princi ple. The EA is stopped when no further e v olution can be achie v ed. The main shortcoming of EAs is that although the y can simultaneously na vig ate se v eral areas in the search space, the y cannot perform deep searching in each area to which the y na vig ate. Consequently , EAs mostly suf fer from premature con v er gence. EAs that ha v e been successfully utilized for TDC include genetic algorithm (GA) [7] and harmon y search [8]. The second type of metaheuris tic algorithms is trajectory-based algorithms (T A); a single solution is usedto launch such an algorithm. This solution is impro v ed by repetition using neighbouring-mo v es operators until a local optimal solution that is in the same search space re gion, is reached [9]. While T As can e xtensi v ely search the initial solution search re gion and achie v e local optima, the y cannot na vig ate simultaneously numer - ous search space re gions. The main T As utilized for TDC are K-means and K-medoids. Other T As used for TDC are self-or g anizing maps (som) [10], and -hill climbing. The last type of metaheuristic algorithms is sw arm intelligence (SI); an SI algorithm is also init iated with a set of random solutions called a sw arm. Iteration after iteration, the solutions in the sw arm are recon- structed by means of attracting them by the best solutions that are so f ar found [11]. SI-based algorithms can easily con v er ge prematurely . Se v eral SI-based TDC are utilized, such as particle sw arm optimization (PSO) [12] and artificial bee colon y [13]. The multi-v erse optimizer (MV O) algorithm w as recently proposed as a stochastic population-based algorithm [14] inspired by multi-v erse theory . The big bang theory e xplains the origin of the uni v erse to ha v e been a massi v e e xplosion. According to this theory , the origin of e v erything in our uni v erse requires one big bang. Multi-v erse theory belie v es that more than one e xplosion (big bang) occurred, with each big bang creating a ne w and independent uni v erse. This theory is modelled as an optimization algorithm with three concepts: white hole, black hole and w ormhole, for performing e xploration, e xploitation and local search, respecti v ely . the MV O has been utilized for a wide range of optimization problems, such as identifying the optimal parameters of PEMFC stacks [15], unmanned aerial v ehicle path planning [16], clustering problems [17], feature selection [18], neural netw orks [19], and optimising SVM parameters [18]. This paper adapts the MV O algorithm for the TDC problem using Euclidean distance as si milarity measure. The adaptation includes modifying the con v er gence beha viour of MV O operators to deal with the discrete, rather than continuous, optimization problem. The main adv antage of the proposed method is that it impro v es the quality of final outcomes for TDC probl ems. A comprehensi v e comparati v e study is conducted on six te xt document benchmark datasets that ha v e dif ferent numbers of clusters and documents. The quality of the final results is analysed with a discussion using accurac y , precision, recall, F-measure, purity and entrop y criteria. The findings of the e xperimental analyses re v eal that the propose d method performs competiti v ely in comparison with state-of-the-art algorithms. The rest of this paper is or g anised as follo ws. Section 2. presents the structure of MV O for TDC. Section 3. discusses the e xperiments results of MV O. Section 4. gi v es the conclusion of this study and future w ork for the authors. Int J Elec & Comp Eng, V ol. 10, No. 6, December 2020 : 6361 6369 Evaluation Warning : The document was created with Spire.PDF for Python.
Int J Elec & Comp Eng ISSN: 2088-8708 r 6363 2. MUL TI-VERSE OPTIMIZER FOR TEXT DOCUMENT CLUSTERING This section describes the main components uti lized for tackling TDC. Th e term ‘components’ is used to denote the adaptation elements that are conducted for solving the TDC problem using the MV O algorithm and their sequence, including (i) TDC pre-processing, (ii) solution representation and (iii) calculation of the objecti v e function (e v aluation of the solutions). Finally , the question of ho w MV O is adapted to TDC is addressed. 2.1. T ext document clustering (TDC) TDC aims to di vide documents into clusters; each cluster has similar documents, whereas docum ents in dif ferent clusters are dissimilar [20, 21]. In the use of an y clustering algorithm for TDC, the te xt needs some necessary and preliminary s teps (pre-processing); this step filters unnecessary data, such as special formatting, special characters and numbers, out of the te xt. Thereafter , the pre-processed te xt document t erms are con v erted into numerical form for further processing. The main goal of this step is to impro v e the quality of features and reduce the implementation comple xity of the TDC algorithm [22]. The te xt pre-processing includes tok enization, st op w ord remo v al and stemming, which are discussed in detail in the succeeding subsections. 2.1.1. T ok enization Phase In the tok enization phase, each document is brok en do wn into a set of tok ens (w ords), where the tok en is an y sequence of characters separated by spaces. Each document is then formulated as a w ord instance count (pieces) as a bag-of-w ords model [23]. Note that the w ord instance count is filtered through remo v al of empty sequences, number formatting and collapsing, among other tasks [24]. 2.1.2. Stop w ord r emo v al Phase In the stop w ords remo v al phase, commonly repeated terms (e.g., ‘a’, ‘an’, ‘the’, ‘who’, ‘be’, ‘about’, ‘ag ain’, ‘an y’, ‘ag ainst’), pronouns (e.g. ‘she’, ‘he’, ‘it’), conjunctions (e.g. ‘b ut’, ‘and’, ‘or’, ) and similar w ords are remo v ed because the y ha v e high frequencies and ne g ati v ely af fect the clustering process (i.e. the y hinder the clustering algorithm.). This process impro v es the clustering performance and reduces the number of processed w ords or terms [22]. 2.1.3. Stemming Phase Stemming is the process of decomposing terms to their roots by remo v al of af fix es (prefix es and suf fix es) [25]. F or e xample, the root of the w ord ‘stemming’ is ‘stem’. In the English language, man y terms may share the same root; for e xample, the w ords ‘connects’, ‘connected’ and ‘connecting’ all stem from the same root, which is ‘connect’ in www .te xt-processing.com /demo/stem/. The stemming process attempts to impro v e the clustering by reducing the number of dif ferent terms that ha v e similar grammatical properties and stem from a single term [26]. 2.1.4. Solution r epr esentation Each solution is represented as a v ector x = ( x 1 ; x 2 ; : : : ; x d ) , where d is the number of documents. Figure 1 sho ws an e xample of a solution representation. In this e xample, v e clusters contain twenty document; for e xample, cluster tw o has three documents f 4 ; 8 ; 9 g , and document 10 (i.e., x 10 ) is in cluster three. Figure 1. Solution representation 2.1.5. Objecti v e function In this study , for each solution x the objecti v e function is calculated using the a v erage distance of documents to the cluster centroid (ADDC) as sho wn in (1). min f ( x ) = P k i =1 ( 1 n i P n i j =1 D ( C k i ; d j )) k (1) T e xt documents clustering using ... (Ammar Kamal Abasi) Evaluation Warning : The document was created with Spire.PDF for Python.
6364 r ISSN: 2088-8708 where D ( C k i ; d j ) is the distance between the cluster centroid j and document i , n i is the number of documents in cluster i , k is the number of clusters, and f ( x ) is the objecti v e function (i.e. minimize the distance). 2.2. Multi-v erse in optimization context MV O [14] inspired by multi-v erse theory . According to multi-v erse theory , uni v erses connect and might e v en collide with each other . MV O eng ages and reformulates using three main concepts: w ormholes, black holes, and white holes. The probability is used for determining the inflation rate (corresponds to the objecti v e function in the optimization conte xt) for the uni v erse, thereby allo wing the uni v erse to assign one of these holes. Gi v en that the uni v erse has a high inflat ion rate, the probability of a white hole e xisting increases. Meanwhile, a lo w inflation rate leads to increased probability of a black hole e xisting [14]. Re g ardless of the uni v erse’ s inflation rate, w ormholes mo v e objects to w ards the best uni v erse randomly [15, 27]. The black and white hole concepts in MV O are formulated for e xploring search spaces, and the w ormhole concept is formu- lated for e xploiting search spaces. In other EAs, MV O is initiated by a population of indi viduals (uni v erses). Thereafter , MV O impro v es these solutions until a stopping criterion. The conceptual model of the MV O in [14] sho ws the mo v ements of the objects between the uni v erses via white/black hole tunnels. These hole tunnels are created between tw o uni v erses on the basis of the inflation rate of each uni v erse (i.e. one uni v erse has a higher inflation rate than the other uni v erses.). Objects mo v e from uni v erses with high inflation rates using white holes. These objects are recei v ed by uni v erses with lo w inflation rates using black holes. After a population of solutions is initia ted, all solutions in MV O are sorted from high inflation rates to lo w ones. Thereafter , it visits the solutions one by one to attract these solution to the best one. This is done under the a ssumption that the solution that has been visited has the black hole. As for the white holes, the roulette wheel mechanism is used for selecting one solution. 2.3. Adapting MV O f or TDC After the pre-processing step, MV O is used to split documents into their parent clusters. In this study , the solution representation and the objecti v e function formulated abo v e are used. The steps of classical MV O in [14] are adopted for TDC with certain modifications. These modifications are related to the nature of the problem v ariables. Gi v en that the clustering problem is discrete in nature [28] and MV O w as originally proposed for continuous optimization problems, MV O should deal with discrete v alues of the decision v ariables of each TDC solution. During the MV O e x ecution, the generation function and the w ormhole equations (2) is adjusted for deciding the feasible solution as follo ws: x j i = 8 > > < > > : 8 < : j x j + T D R (( ub j l b j ) r 4 + l b j ) B ig r c r 3 < 0 : 5 ; j x j T D R (( ub j l b j ) r 4 + l b j ) B ig r c r 3 0 : 5 r 2 < W E P x j i r 2 W E P (2) A general o v ervie w of MV O for TDC is pro vided via Figure 2, which visualises the procedural steps. Figure 2. Process of MV O TDC Int J Elec & Comp Eng, V ol. 10, No. 6, December 2020 : 6361 6369 Evaluation Warning : The document was created with Spire.PDF for Python.
Int J Elec & Comp Eng ISSN: 2088-8708 r 6365 3. EXPERIMENT AL RESUL TS F or e v aluating the performance of the proposed method, a set of designed e xperiments is conducted using six instances of standard datasets formulated for measuring the performance of te xt clustering techniques. Six e v aluation measures are used, as con v entionally done: precision, recall, F-measure, entrop y accurac y , and purity criteria. F or comparati v e e v aluation, results obtained in terms of e v aluation measures are compared with those obtained by three state-of-the-art algorithms (K-means clustering, GA and PSO) using the same objecti v e function. The e xperiments are conducted using the programming language MA TLAB. Thorough descriptions about the e xperimental results are gi v en in the follo wing subsections. 3.1. Standard datasets T able 1 pro vides the characteristics of the six te xt document datasets used in this study: (CSTR2, 20Ne wsgroups, Classic4) in sites.labic.icmc.usp.br/te xt-collections, (tr12, tr41 and W ap) in glaros.dtc.umn.edu/ gkhome/fetch/sw/cluto/datasets.tar .gz. T able 1. T e xt document dataset characteristics ID Datasets Number of documents (d) Number of features or terms (t) Number of clusters (K) DS1 CSTR 299 1725 4 DS2 20Ne wsgroups 300 2275 3 DS3 tr12 313 5329 8 DS4 tr41 878 6743 10 DS5 W ap 1560 7512 20 DS6 Class ic4 2000 6500 4 3.2. Results and discussion The results obtained by MV O TDC are summarised in T able 2, and the parameter settings used in the e xperiments are gi v en in T able 3. The results are summarised in terms of precision, recall, F-measure, entrop y accurac y , and purity for the si x datasets . The findings pro v e the v alidity and ef fecti v eness of the proposed MV O TDC in the distrib ution of te xt documents to the right clusters. The results are also conducted to sho w the v alidity of the proposed method in comparison with three well-kno wn methods: GA, K-means and PSO. T able 3 sho ws the parameter setting v alues for each compared algorithm. These parameter settings are used as suggested in [17]. A comparati v e analysis of K-means, GA, PSO and MV O TDC is pro vided in T able 2 in terms of precision, recall, F-measure, entrop y accurac y , and purity; the a v erage v alues for each measure are recorded. The results obtained by the K-means clusteri ng algorithm are w orse t han those obtained by the other algorithms for nearly all datasets. The possible justification is that K-means is a local search algorithm; therefore, it is highly lik ely to f all in local optima due to its inability to e xplore the problem search space ef fecti v ely . Meanwhile, population-based metaheuristic algorithms, such as GA, PSO and MV O TDC, can e xplore dif ferent areas in the search space simultaneously and can consequently achie v e better e xploration properties. T able 2 also sho w that MV O TDC attains minimum entrop y and maximum purity , precision, recall, F-measure, accurac y for v e datasets (i.e. DS1, DS2, DS3, DS4, DS6). This ability of the proposed MV O TDC algorithm during the search in reaching the right balance between e xploitation and e xploration with a po werful learning mechanis m strengthens its performance in achie ving i mpressi v e outcomes in comparison with the other methods. T able 2 pro vides the results of the F-m easure for all compared methods, including MV O TDC. No- tably , MV O TDC produces the best F-measure v alues for v e datasets. Furt hermore, GA, PSO and MV O TDC outperform K-means in all the datasets. From a dif ferent perspecti v e, T able 2 also sho ws the accurac y of all compared algorithms. In general, the results obtained by MV O TDC are better than those of the other methods. In f act, the results could be slightly changed from one dataset to another due to the f act that clustering algorithms are normally highly sensiti v e to the dataset search space. This is can be v alidated by the finding that MV O TDC obtains the best accurac y in v e datasets and the second-best for DS5. The purity measure of clusters is another e xternal e v aluation. It measures the maximum class for each cluster . In general, the closer the purity v alue to 1, the better the clustering solution. T able 2 sho ws the results of the purity measure for all compared methods on all datasets. MV O TDC outperforms K-means, GA and PSO in v e datasets (i.e. DS1, DS2, DS3, DS4, DS6). The proposed algorithm obtains a 21.5% impro v ement percentage for DS1 in accordance with K-means. F or DS2, MV O TDC’ s purity v alues sho w impro v ements of T e xt documents clustering using ... (Ammar Kamal Abasi) Evaluation Warning : The document was created with Spire.PDF for Python.
6366 r ISSN: 2088-8708 6.0%, 2.6% and 2.4% o v er those acquired by K-means, GA and PSO, respecti v ely . Meanwhile, the obtained impro v ements are 15.4%, 9.3%, 5.7%, 19.7%, 4.7%, 2.9%, 8.0%, 4.2% and 4.9% for te xt document standard datasets DS3, DS4 and DS6. In summary , the results sho wn in T able2 re v eal that MV O TDC outperforms all compared algorithms in terms of cluster quality (i.e. F-measure and purity). Entrop y is another e xternal measure used in e v aluating and comparing the quality of clustering algo- rithms. The entrop y v alue is zero only when all documents in a single class are placed in a single cluster . In this case, the one cluster solution is considered the best. T able 2 sho ws the entrop y measure v alues obtained by all the compared algorithms on the dif ferent datasets. Th e bigger the entrop y v alue, the w orse the clustering so- lution. According to the results, MV O TDC pro vides lo w entrop y v alues for most of the datasets, which means that it performs better than the other algorithms and of fers the best clustering solution. Notably , K-means pro- duces the w orst entrop y measure for all datasets, whereas GA and PSO are ag ain rank ed in between MV O TDC and K-means. The superior performance of MV O TDC is due to its e xplorati v e capability in the search space. The objecti v e function is determined by ADDC for all clus tering algorithms so that the distance be- tween the documents in each cluster is minimized. Figure 3 depicts the con v er gence trends of GA, PSO and MV O TDC using ADDC v alues. The x-axis is the stream of iteration numbers, whereas the y-axis is the stream of ADDC v alues. Notably , the con v er gence rate of MV O TDC is f airly f ast for all datasets e xcept DS5. T able 2. Results of the Accurac y , Precision, Recall, F-measure, Purity and Entrop y for K-means, GA, PSO, and MV O TDC algorithms o v er 30 independent runs for DS1, DS2, DS3, DS4, D65 ,and DS6 Dataset Measure Optimization algorithms and techniques K-means [17] GA [29] PSO [12] MV O TDC DS1 Accurac y 0.3573 0.3398 0.4355 0.4593 Precision 0.4091 0.4416 0.5340 0.571 Recall 0.3091 0.3417 0.4359 0.4829 F-Measure 0.3459 0.3886 0.4819 0.5243 Purity 0.3524 0.4050 0.4953 0.5684 Entrop y 0.8201 0.7170 0.6199 0.5206 DS2 Accurac y 0.3180 0.3675 0.3498 0.4044 Precision 0.3121 0.4209 0.4134 0.4391 Recall 0.3099 0.3676 0.3496 0.384 F-Measure 0.3406 0.3935 0.3803 0.4109 Purity 0.3741 0.4080 0.4096 0.4343 Entrop y 0.8028 0.7546 0.7722 0.7120 DS3 Accurac y 0.2971 0.3676 0.4075 0.4485 Precision 0.3522 0.4128 0.4297 0.5075 Recall 0.2944 0.3549 0.4263 0.4398 F-Measure 0.3221 0.3826 0.4277 0.4705 Purity 0.3907 0.4512 0.4877 0.5448 Entrop y 0.7137 0.6233 0.5719 0.5224 DS4 Accurac y 0.4125 0.4320 0.4870 0.4630 Precision 0.3944 0.4140 0.4505 0.4568 Recall 0.3812 0.4007 0.4496 0.4418 F-Measure 0.3876 0.4071 0.4497 0.4568 Purity 0.4107 0.5602 0.5789 0.6081 Entrop y 0.5874 0.5469 0.5391 0.5355 DS5 Accurac y 0.5011 0.5316 0.5622 0.5291 Precision 0.4626 0.5313 0.5249 0.5213 Recall 0.4010 0.4705 0.4810 0.4496 F-Measure 0.4314 0.4997 0.5016 0.4831 Purity 0.4759 0.4916 0.6124 0.6069 Entrop y 0.7043 0.6216 0.5765 0.6625 DS6 Accurac y 0.5858 0.6620 0.6363 0.7042 Precision 0.5698 0.6725 0.6603 0.6919 Recall 0.5259 0.6319 0.6163 0.6843 F-Measure 0.5471 0.6518 0.6377 0.6880 Purity 0.5938 0.6319 0.6242 0.6742 Entrop y 0.5600 0.5780 0.5306 0.5112 Int J Elec & Comp Eng, V ol. 10, No. 6, December 2020 : 6361 6369 Evaluation Warning : The document was created with Spire.PDF for Python.
Int J Elec & Comp Eng ISSN: 2088-8708 r 6367 T able 3. P arametric v alues for dif ferent v ariants of TDC algorithms Algorithm P arameters V alue All Optimization algorithms Population size 60 All Optimization algorithms Maximum number of iteration 1000 All Optimization algorithms runs 30 proposed method (MV O TDC) WEP Max 1 proposed method (MV O TDC) WEP Min 0.2 proposed method (MV O TDC) p 6 GA C rosso v er probability 0.80 GA Mutation probability 0.02 PSO Maximum inertia weight 0.9 PSO Minimum inertia weight 0.2 PSO C1 2 PSO C2 2 0 100 200 300 400 500 600 700 800 900 1000 Number of iterations 0.019 0.0192 0.0194 0.0196 0.0198 0.02 0.0202 0.0204 0.0206 0.0208 0.021 ADDC DS1 GA PSO MVO 0 100 200 300 400 500 600 700 800 900 1000 Number of iterations 0.022 0.0225 0.023 0.0235 0.024 0.0245 0.025 0.0255 0.026 ADDC DS2 GA PSO MVO 0 100 200 300 400 500 600 700 800 900 1000 Number of iterations 0.14 0.15 0.16 0.17 0.18 0.19 0.2 0.21 0.22 0.23 0.24 ADDC DS3 GA PSO MVO 0 100 200 300 400 500 600 700 800 900 1000 Number of iterations 0.062 0.064 0.066 0.068 0.07 0.072 0.074 0.076 0.078 0.08 ADDC DS4 GA PSO MVO 0 100 200 300 400 500 600 700 800 900 1000 Number of iterations 0.075 0.08 0.085 0.09 0.095 0.1 ADDC DS5 GA PSO MVO 0 100 200 300 400 500 600 700 800 900 1000 Number of iterations 5.7 5.8 5.9 6 6.1 6.2 6.3 ADDC 10 -3 DS6 GA PSO MVO Figure 3. Con v er gence characteristics of GA, PSO and MV O TDC on datasets D1, D2, D3, D4, D5 and D6 It is w orth emphasizing the MV O TDC can be used to address specific optimization problems such as EEG signals denoising [30], gene selection problem [31], and po wer scheduling problems [32]. Despite the MV O TDC’ s superiority among the competiti v e algorithms, MV O TDC remains sensiti v e to the characteristics of the datasets, making it dif ficult to predict its beha vior on ne w datasets while implemented. T e xt documents clustering using ... (Ammar Kamal Abasi) Evaluation Warning : The document was created with Spire.PDF for Python.
6368 r ISSN: 2088-8708 4. CONCLUSION AND FUTURE W ORK This paper proposes a metaheuristic optimization algorithm called multi-v erse optimizer (MV O) for solving the te xt document clustering (TDC) problem, i.e. MV O TDC. This method introduces a ne w strate gy of sharing information between solutions on the basis of an objecti v e function and l earns from the best solution instead of the global best (i.e. all solutions). The con v er gence of the results of MV O TDC is impressi v e due to the method’ s achie v ement of the appropriate balance between e xploitation and e xploration search during each run. The proposed MV O TDC is e v aluated using six te xt document datasets with v arious sizes and com- ple xities. The numbers of documents and clusters in each dataset are gi v en. The quality of the obt ained results is assessed using six measures: precision, recall, F-measure, entrop y accurac y , and purity . These measures are also used for a comparati v e e v aluation in which three well-kno wn clustering al- gorithms are used: K-means, genetic algorithm (GA) and particle sw arm optimisation (PSO). F or all measures, the results obtained by MV O TDC are significantly better than those produced by the three compared methods. In terms of computational time, MV O TDC is slo wer than K-means and requi res nearly the same computa- tional time as GA and PSO. Therefore, MV O TDC can be considered an ef ficient clustering method for the te xt clustering domain. Gi v en the successful outcomes of MV O for the TDC problem, MV O TDC can be implemented for dif ferent types of clustering problems. MV O can also be further impro v ed by the addition or modification of its operators so that it can address other discrete optimisation problems, such as scheduling. In addition, datasets other than those used in this w ork can be used in future studies. In addition, h ybridized the MV O with local search strate gies in order to impro v e initial solutions and the e xploitation capability during optimization process. REFERENCES [1] N. Saini, S. Saha, and P . Bhattacharyya, Automatic scientific document clustering using self-or g anized multi-objecti v e dif ferential e v olution, Cogniti v e Computation, pp. 1–23, 2018. [2] I. Arın, M. K. Erpam, and Y . Saygın, “I-twec: Interacti v e clustering tool for twitter , Expert Systems with Applications, v ol. 96, pp. 1–13, 2018. [3] A. K. Abasi, A. T . Khader , M. A. Al-Betar , S. Naim, S. N. Makhadmeh, and Z. A. A. Alyass eri, “Link- based multi-v erse optimizer for te xt documents clustering, Applied Soft Computing, v ol. 87, 2020. [4] W . Song, W . Ma, and Y . Qiao, “P article sw arm optimization algorithm with en vironmental f actors for clustering analysis, Soft Computing, v ol. 21, no. 2, pp. 283–293, 2017. [5] Z. A. A. Alyasseri, A. T . Khader , M. A. Al-Betar , M. A. A w adallah, and X.-S. Y ang, “V ariants of the flo wer pollination algorithm: a re vie w , Nature-Inspired Algorithms and Applied Optimization, pp. 91–118, 2018. [6] A. K. Abasi, A. T . Khader , M. A. Al-Betar , S. Naim, S. N. Makhadmeh, and Z. A. A. Alyasseri, A te xt feature selection technique based on binary multi-v erse optimizer for te xt clustering, IEEE Jordan International Joint Conference on Electrical Engineering and Information T echnology , pp. 1–6, 2019. [7] J.-H. Jiang, J.-H. W ang, X. Chu, and R.-Q. Y u, “Clustering data using a modified inte ger genetic algorithm (ig a), Analytica Chimica Acta, v ol. 354, no. 1, pp. 263–274, 1997. [8] R. F orsati, M. Mahda vi, M. Shamsf ard, and M. R. Me ybodi, “Ef ficient stochastic algorithms for document clustering, Information Sciences, v ol. 220, pp. 269–291, 2013. [9] O. A. Alomari, A. T . Khader , M. A. Al-Betar , and M. A. A w adallah, A no v el gene selection method using modified mrmr and h ybrid bat-inspired algorithm with -hill climbing, Applied Intelligence, v ol. 48, no. 11, pp. 4429–4447, 2018. [10] N. Saini, S. Saha, A. Harsh, and P . Bhattacharyya, “Sophistica ted som based genetic operators in multi- objecti v e clustering frame w ork, Applied Intelligence, pp. 1–20, 2018. [11] M. Ma vro v ouniotis, C. Li, and S. Y ang, A surv e y of sw arm intelligence for dynamic optimization: Al- gorithms and applications, Sw arm and Ev olutionary Computation, v ol. 33, pp. 1–17, 2017. [12] T . Cura, A particle sw arm optimization approach to clustering, Expert System s with Applications, v ol. 39, no. 1, pp. 1582–1588, 2012. [13] K. K. Bharti and P . K. Singh, “Chaotic gradient artificial bee colon y for te xt clustering, Soft Computing, v ol. 20, no. 3, pp. 1113–1126, 2016. Int J Elec & Comp Eng, V ol. 10, No. 6, December 2020 : 6361 6369 Evaluation Warning : The document was created with Spire.PDF for Python.
Int J Elec & Comp Eng ISSN: 2088-8708 r 6369 [14] S. Mirjalili, S. M. Mirjalili, and A. Hatamlou, “Multi-v erse optimizer: a nature-inspired algorithm for global optimization, Neural Computing and Applications, v ol. 27, no. 2, pp. 495–513, 2016. [15] A. F ath y and H. Rezk, “Multi-v erse optimizer for identifying the optimal parameters of pemfc model, Ener gy , v ol. 143, pp. 634–644, 2018. [16] P . K umar , S. Gar g, A. Singh, S. Batra, N. K umar , and I. Y ou, “Mv o-based tw o-dimensional path planning scheme for pro viding quality of service in ua v en vironment, IEEE Internet of Things Journal, 2018. [17] A. K. Abasi, A. T . Khader , M. A. Al-Betar , S. Naim, and S. N. a. Alyasseri, Zaid Abdi Alkareem Makhad- meh, A no v el h ybrid multi-v erse optimizer with k-means for te xt documents clustering, Neural Com- puting and Applications, 2020. [18] H. F aris, M. A. Hassonah, A.-Z. Ala’M, S. Mirjalili, and I. Aljarah, A multi-v erse optimizer approach for feature selection and optimizing svm parameters based on a rob ust system architecture, Neural Com- puting and Applications, pp. 1–15, 2017. [19] I. Benmessahel, K. Xie, and M. Chellal, A ne w e v olutionary neural netw orks based on intrusion detection systems using multi v erse optimization, Applied Intelligence, pp. 1–13, 2017. [20] H.-S. P ark and C.-H. Jun, A simple and f ast algorithm for k-medoids clustering, Expert systems with applications, v ol. 36, no. 2, pp. 3336–3341, 2009. [21] A. Huang, “Similarity measures for te xt document clustering, Proceedings of the sixth ne w zealand computer science research student conference (NZCSRSC2008), pp. 49–56, 2008. [22] A. I. Kadhim, Y .-N. Cheah, and N. H. Ahamed, “T e xt document preprocessing and dimension reduc- tion techniques for te xt document clustering, 4th Internati on a l Conference onArtificial Intelligence with Applications in Engineering and T echnology (ICAIET), pp. 69–73, 2014. [23] R. Zhao and K. Mao, “Fuzzy bag-of-w ords model for document representation, IEEE T ransactions on Fuzzy Systems, v ol. 26, no. 2, pp. 794–804, 2018. [24] K. K. Bharti and P . K. Singh, “Opposition chaotic fitness mutation bas ed adapti v e inertia weight bpso for feature selection in te xt clustering, Applied Soft Computing, v ol. 43, pp. 20–34, 2016. [25] J. Singh and V . Gupta, A systematic re vie w of te xt stemming techniques, Artificial Intelligence Re vie w , v ol. 48, no. 2, pp. 157–217, 2017. [26] M. N. P . Katariya, M. Chaudhari, B. Subhani, G. Laxminarayana, K. Mate y , M. A. Nik ose , S. A. T inkhede, and S. Deshpande, “T e xt preprocessing for te xt mining using side information, International Journal of Computer Science and Mobile Applications, v ol. 3, no. 1, pp. 01–05, 2015. [27] D. Janig a, R. Czarnota, J. Stopa, P . W ojnaro wski, and P . K oso wski, “Performance of nature inspired optimization algorithms for polymer enhanced oil reco v ery process, Journal of Petroleum Science and Engineering, v ol. 154, pp. 354–366, 2017. [28] W . Song, Y . Qiao, S. C. P ark, and X. Qian, A h ybrid e v olutionary computation approach with its ap- plication for optimizing te xt document clustering, Expert Syst ems with Applications, v ol. 42, no. 5, pp. 2517–2524, 2015. [29] D. Mustafi and G. Sahoo, A h ybrid approach using genetic algorithm and the dif ferential e v olution heuristic for enhanced initialization of the k-means algorithm with applications in te xt clustering, Soft Computing, pp. 1–18, 2018. [30] Z. A. A. Alyasseri, A. T . Khader , M. A. Al-Betar , A. K. Abasi, and S. N. Makhadmeh, “EEG signals de- noising using optimal w a v elet transform h ybridized with ef ficient metaheuristic methods, IEEE Access, v ol. 8, pp. 10 584–10 605, 2019. [31] M. A. Al-Betar , O. A. Alomari, and S. M. Ab u-Romman, A triz-inspired bat algorithm for gene selection in cancer classification, Genomics, v ol. 112, no. 1, pp. 114–126, 2020. [32] S. N. Makhadmeh, A. T . Khader , M. A. Al-Betar , S. Naim, A. K. Abasi, and Z. A. A. Alyasseri, “Optimization methods for po wer s cheduling problems in smart home: Surv e y , Rene w able and Sus- tainable, Ener gy Re vie ws, v ol. 115, 2019. T e xt documents clustering using ... (Ammar Kamal Abasi) Evaluation Warning : The document was created with Spire.PDF for Python.