Inter national J our nal of Electrical and Computer Engineering (IJECE) V ol. 9, No. 1, February 2019, pp. 531 538 ISSN: 2088-8708, DOI: 10.11591/ijece.v9i1.pp531-538 531 Generating similarity cluster of Indonesian languages with semi-super vised clustering Arbi Haza Nasution 1,3 , Y ohei Murakami 2 , and T oru Ishida 3 1,3 Department of Social Informatics, K yoto Uni v ersity , Japan 2 Colle ge of Information Science and Engineering, Ritsumeikan Uni v ersity , Japan 1 Department of Informatics Engineering, Uni v ersitas Islam Riau, Indonesia Article Inf o Article history: Recei v ed Jan 11, 2018 Re vised Jul 6, 2018 Accepted Aug 22, 2018 K eyw ords: Le xicostatistic Language Similarity Hierarchical Clustering K-means Clustering Semi-Supervised Clustering ABSTRA CT Le xicostatistic and language similarity clusters are usef ul for computational linguistic researches that depends on language similarity or cognate recognition. Ne v ertheless, there are no published le xicostatistic/language similarity cluster of Indonesian ethnic languages a v ailable. W e formulate an approach of creating language similarity clusters by utilizing ASJP database to generate the language similarity matrix, then generate the hierarchical clusters with complete linkage and mean linkage clustering, and further e xtract tw o stable clusters with high language similarities. W e introduced an e xtended k-means clustering semi-supervised learning to e v aluate the stability le v el of the hierar - chical stable clusters being grouped together despite of changing the number of cluster . The higher the number of the trial, the more lik ely w e can distinctly find the tw o hierar - chical stable clusters in the generated k-clusters. Ho we v er , for all v e e xperiments, the stability le v el of the tw o hierarchical stable clusters is the highest on 5 clusters. There- fore, we tak e the 5 clusters as the best clusters of Indonesian ethnic languages. Finally , we plot the generated 5 clusters to a geographical map. Copyright c 2019 Institute of Advanced Engineering and Science . All rights r eserved. Corresponding A uthor: Arbi Haza Nasution, Department of Social Informatics, K yoto Uni v ersity , Japan, Y oshida-Honmachi, Sak yo-ku, K yoto 606-8501, Japan. +818078554376 Email: arbi@ai.soc.i.k yoto-u.ac.jp, arbi@eng.uir .ac.id 1. INTR ODUCTION No w adays, machine-readable bilingual dictionaries are being utilized in actual services [1] to support intercultural coll aboration [2, 3, 4] and other research domains [5, 6, 7, 8, 9], b u t lo w-resource languages lack such sources. Indonesia has a population of 221,398,286 and 707 li ving languages which co v er 57.8% of Aus- tronesian F amily and 30.7% of languages in Asia [10]. There are 341 Indonesian ethnic languages f acing v arious de gree of language endangerment (trouble / dyi ng) where some of the nati v e speak er do not speak Bahasa In- donesia well since the y are in remote areas. Unfortunately , there are 13 Indonesian ethnic languages which already e xtinct. In order to sa v e lo w-resource languages lik e Indonesian ethnic languages from language endan- germent, prior w orks tried to enrich the basic language resource, i.e., bilingual dictionary [11, 12, 13, 14]. Those pre vious researchers require le xicostatistic/language similarity clusters of the lo w-resource languages to select the tar get languages. Ho we v er , to the best of our kno wledge, there are no published le xicostatistic/language similarity clusters of Indonesian ethnic languages. T o fill the v oid, we address this research goal: F ormulating an approach of creating a language simi larity cluster . W e first obtain 40-item w ord lists from the Automated Simi- larity Judgment Program (ASJP), further generate the language similarity matrix, then generate the hierarchical and k-means clusters, and finally plot the generated clusters to a map. J ournal Homepage: http://iaescor e .com/journals/inde x.php/IJECE Evaluation Warning : The document was created with Spire.PDF for Python.
532 ISSN: 2088-8708 2. A UT OMA TED SIMILARITY JUDGMENT PR OGRAM Historical linguistics is the sc ientific study of language change o v er time in term of sound, analogical , le xical, morphological, syntactic, and semantic information [15]. Comparati v e linguistics is a branch of histor - ical linguistics that is concerned wi th language comparis on to determine historical rela tedness and to construct language f amilies [16]. Man y methods, techniques, and procedures ha v e been utilized in in v estig ating the poten- tial distant genetic relationship of languages, including le xical comparison, sound correspondences, grammatical e vidence, borro wing, semantic constraints, chance similarities, sound-meaning isomorphism, etc [17]. The ge- netic relationship of languages is used to classify languages into language f amilies. Closely-related languages are those that came from the same origin or proto-language, and belong to the same language f amily . Sw adesh List is a classic compilation of basic concepts for the purposes of historical-comparati v e lin- guistics. It is used in le xicostatistics (quantitati v e comparison of le xical cognates) and glottochronology (chrono- logical relationship between languages). There are v arious v ersion of sw adesh list with a number of w ords equal 225 [18], 215 & 200 [19], and lastly 100 [20]. T o find the best size of the list, Sw adesh states that “The only solution appears to be a drastic weeding out of the list, in the realization that quality is at least as important as quantity . Ev en the ne w list has defects, b ut the y are relati v ely mild and fe w in number . [21] A widely-used notion of string/le xical similarity is the edit distance or also kno wn as Le v enshtei n Distance (LD): the minimum number of insertions, deletions, and substitutions required to transform one string into the other [22]. F or e xample, LD between “kitten” and “sitting” is 3 since there are three transformations needed: kitten sitten (substitution of “s” for “k”), sitten sittin (substitution of “i” for “e”), and finally sittin sitting (insertion of “g” at the end). There are a lot of pre vious w orks using Le v enshtein Distances such as dialect groupings of Irish Gaeli c [23] where the y g ather the data from questionnaire gi v en to nati v e speak ers of Irish Gaelic in 86 sites. The y obtain 312 dif ferent Gaelic w ords or phrases. Another w ork is about dialect pronunciation dif ferences of 360 Dutch dialects [24] which obtai n 125 w ords from Reeks Nederlandse Dialectatlassen. The y normalize LD by di viding it by the length of the longer alignment. [25] measure linguisti c similarity and intelligibility of 15 Chinese dialects and obtain 764 common syllabic units. [26] define le xical distance between tw o w ords as the LD normalized by the number of characters of the longer of the tw o. [27] e xtend Petroni definition as LDND and use it in Automated Similarity Judgment Program (ASJP). The ASJP , an open source softw are w as proposed by [28] with the main goal of de v eloping a data base of Sw adesh lists [21] for all of the w orld’ s languages from which le xical similarity or le xical distance matrix be- tween languages can be obtained by comparing the w ord lists. The classification is based on 100-item reference list of Sw adesh [20] and further reduced to 40 most stable items [29]. The item stability is a de gree to which w ords for an item are retained o v er time and not replaced by another le xical item from the language itself or a borro wed element. W ords resistant to replacement are more stable. Stable items ha v e a greater tendenc y to yield cognates (w ords that ha v e a common etymological origin) within groups of closely related languages. 3. LANGU A GE SIMILARITY CLUSTERING APPR O A CH W e formalize an approach to create language similarity clusters by utilizing ASJP database to generate the language similarity matrix, then generate the hierarchica l clusters, and further e xtract the stable clusters with high language similarities. The hierarchical stable clusters are e v aluated utilizing our e xtended k-means clustering. Finally , the obtained k-means clusters are plotted to a geographical map. The flo wchart of the whole process is sho wn in Figure 1. In this paper , we focus on Indonesian ethnic languages. W e obtain w ords list of 119 Indonesian ethnic languages with the number of speak ers at least 100,000. Ho we v er , it is dif ficult to classify 119 languages and obtain a v aluable information from the generated clusters, therefore, we further filtered the tar get languages based on the number of speak er and a v ailability of the language information in W ikipedia. W e obtain 32 tar get languages as sho wn in T able 1 from the intersec tion between 46 Indonesian ethnic languages with number of speak er abo v e 300,000 pro vided by W ikipedia and 119 Indonesian ethnic languages with number of speak er abo v e 100,000 pro vided by ASJP . W e further generate the similarity matrix of those 32 languages as sho wn in Figure 2. W e added a white-red color scale where white color means the tw o languages are totally dif ferent (0% similarity) and the reddest color means the tw o languages are e xactly the same (100% similarity). F or a better clarity and to a v oid redundanc y , we only sho w the bottom-left part of the table. The headers follo w the language code in T able 1. IJECE V ol. 9, No. 1, February 2019 : 531 538 Evaluation Warning : The document was created with Spire.PDF for Python.
IJECE ISSN: 2088-8708 533 Ye s No G e ne r a t e  S i m i l a r i t y M a t r i x G e ne r a t e  H i e r a r c hi c a l  C l us t e r St a r t A S J P  W or ds  L i s t S t a bl e  C l us t e r s P l ot  k - m e a ns  c l us t e r s  t o a  m a p E va l ua t e  S t a bl e  C l us t e r s  w i t h C l us t e r  S t a bi l i t y E va l ua t or   ( us i ng k - m e a ns  c l us t e r i ng s e m i - s upe r vi s e d l e a r ni ng) En d Hi g h   S t a b i l i t y   Le v e l ? Figure 1. Flo wchart of Generating Language Similarity Clusters T able 1. List of 32 Indonesian Ethnic Languages Rank ed by Population According to ASJP database Code Population Language Code Population Language L 1 232004800 INDONESIAN L 17 1000000 GOR ONT ALO L 2 84300000 OLD OR MIDDLE J A V ANESE L 18 1000000 J AMBI MALA Y L 3 34000000 SUND ANESE L 19 900000 MANGGARAI L 4 15848500 MALA Y L 20 770000 NIAS NOR THERN L 5 15848500 P ALEMB ANG MALA Y L 21 750000 B A T AK ANGK OLA L 6 6770900 MADURESE L 22 700000 U AB MET O L 7 5530000 MIN ANGKAB A U L 23 600000 KAR O B A T AK L 8 5000000 B UGINESE L 24 500000 BIMA L 9 5000000 BET A WI L 25 470000 K OMERING L 10 3502300 B ANJ ARESE MALA Y L 26 350000 REJ ANG L 11 3500032 A CEH L 27 331000 T OLAKI L 12 3330000 B ALI L 28 300000 GA Y O L 13 2130000 MAKASAR L 29 300000 MUN A L 14 2100000 SASAK L 30 250000 T AE L 15 2000000 T OB A B A T AK L 31 245020 AMBONESE MALA Y L 16 1100000 B A T AK MAND AILING L 32 230000 MONGONDO W L 1 L 2 L 3 L 4 L 5 L 6 L 7 L 8 L 9 L 1 0 L 1 1 L 1 2 L 1 3 L 1 4 L 1 5 L 1 6 L 1 7 L 1 8 L 1 9 L 2 0 L 2 1 L 2 2 L 2 3 L 2 4 L 2 5 L 2 6 L 2 7 L 2 8 L 2 9 L 3 0 L 3 1 L 2 24 L 3 39 22 L 4 85 21 41 L 5 68 32 39 73 L 6 34 15 20 34 34 L 7 62 25 31 62 64 34 L 8 31 18 25 32 31 18 32 L 9 69 10 25 67 58 23 50 24 L 1 0 72 33 39 71 64 34 60 33 55 L 1 1 27 11 19 27 30 22 25 16 21 25 L 1 2 38 20 29 35 39 23 31 30 24 37 22 L 1 3 33 22 24 30 32 25 33 36 25 33 16 29 L 1 4 44 20 28 42 44 30 44 31 37 47 22 29 35 L 1 5 37 24 23 37 36 21 40 25 35 37 13 21 25 35 L 1 6 25 16 14 27 27 20 27 23 24 25 14 20 18 24 58 L 1 7 19 14 16 18 19 9 18 20 14 17 12 12 18 20 17 9 L 1 8 79 26 40 78 78 34 69 31 70 73 27 35 38 46 39 21 20 L 1 9 30 18 24 30 34 19 32 36 26 32 10 23 29 31 32 21 16 34 L 2 0 26 21 17 23 25 13 29 26 24 29 12 16 19 24 29 21 19 24 25 L 2 1 24 16 15 26 26 19 26 21 21 24 12 21 18 23 59 98 9 20 19 20 L 2 2 13 10 9 11 14 12 18 19 10 19 10 12 21 18 15 9 14 15 22 16 9 L 2 3 47 22 28 48 50 23 40 30 40 44 21 32 27 35 51 40 17 47 28 33 40 12 L 2 4 18 10 16 17 18 12 18 21 18 19 6 14 21 25 22 14 8 17 30 19 14 18 19 L 2 5 33 19 25 33 33 18 25 23 29 36 14 23 22 22 24 24 16 30 26 29 25 20 36 14 L 2 6 28 20 16 27 32 18 30 17 21 29 15 17 17 30 25 20 11 32 18 15 19 12 29 4 19 L 2 7 30 14 18 28 27 17 26 32 23 33 11 21 27 21 26 14 11 28 36 25 14 19 28 26 20 13 L 2 8 37 27 28 36 37 20 37 26 28 38 18 25 23 35 28 18 17 40 26 23 17 20 41 18 37 29 28 L 2 9 14 12 12 14 13 13 11 21 18 12 8 16 24 14 14 9 11 13 15 15 10 11 14 21 14 4 29 11 L 3 0 42 29 31 41 39 27 42 60 30 47 20 28 42 40 34 27 23 44 38 35 26 29 38 30 29 21 38 38 25 L 3 1 72 23 35 70 58 37 59 36 62 60 23 34 36 43 33 28 19 69 33 29 26 17 36 19 29 24 29 31 16 42 L 3 2 30 18 24 32 31 13 26 26 27 34 11 21 25 24 24 17 26 32 23 24 17 12 28 14 24 20 20 27 15 38 24 Figure 2. Le xicostatistic / Similarity Matrix of 32 Indonesian Ethnic Languages by ASJP (%) Hierarchical clustering is an approach which b uilds a hierarch y from the bottom-up, and does not re- quire us to specify the number of clusters beforehand. The algorithm w orks as follo ws: (1) Put each data point in its o wn cluster; (2) Identify the closest tw o clusters and combine them into one cluster; (3) Repeat the abo v e step until all the data points are in a single clus ter . Once this is done, it is usually represented by a dendrogram lik e structure. There are a fe w w ays to determine ho w close tw o clusters are: (1) Complete linkage clustering: find the maximum possible distance between points belonging to tw o dif ferent clusters; (2) Single linkage cluster - ing: find the m inimum possible distance between points belonging to tw o dif ferent clusters; (3) Mean/A v erage Gener ating similarity cluster of Indonesian langua g es... (Arbi Haza Nasution) Evaluation Warning : The document was created with Spire.PDF for Python.
534 ISSN: 2088-8708 linkage clustering: find all possible pairwise distances for points belonging to tw o dif ferent clusters and then calculate the a v erage; (4) Centroid linkage clustering: find the centroid of each cluster and calculate the distance between centroids of tw o clusters. Complete linkage and mean (a v erage) linkage clustering are the ones used most often. W e generate the distance matrix from the similarity matrix sho wn in Figure 2 and further generate the hierarchical clusters with hclust function with a complete linkage clustering method as sho wn in Figure 3(a) and a mean linkage clustering method as sho wn in Figure 3(b) using R, a free softw are en vironment for statistical computing and graphics. 20 40 60 80 100 L a ngua ge  S i m i l a r i t 50 U A B  M E T O BI M A MU N A GOR ONT AL O AC E H O L D  O R  M I D D L E  J A V A N E S E RE J A N G N I A S  N O R T H E R MA D U R E S E T O B A  B A T A K BA T A K   MA N D A I L I N G BA T A K   ANGKOL A MO N G O N D O W MA N G G A R A I TO LA K I KOM E R I NG GAYO BA L I SU N D A N E SE SA SA K K A R O  B A T A K MI N A N G K A B A U BE T A W I A M B O N E S E  M A L A Y B A N J A R E S E  M A L A Y P A L E M B A N G  M A L A Y J A M B I  M A L A Y IN D O N E S IA N MA L A Y MA K A S A R BU G I N E S E TA E (a) Method: Complete 20 40 60 80 L a ngua ge  S i m i l a r i t 100 50 U A B  M E T O BI M A MU N A GOR ONT AL O AC E H O L D  O R  M I D D L E  J A V A N E S E RE J A N G N I A S  N O R T H E R MA D U R E S E MO N G O N D O W MA N G G A R A I TO LA K I KOM E R I NG GAYO BA L I SU N D A N E SE SA SA K K A R O  B A T A K T O B A  B A T A K BA T A K   MA N D A I L I N G BA T A K   ANGKOL A MI N A N G K A B A U BE T A W I A M B O N E S E  M A L A Y B A N J A R E S E  M A L A Y P A L E M B A N G  M A L A Y J A M B I  M A L A Y IN D O N E S IA N MA L A Y MA K A S A R BU G I N E S E TA E (b) Method: A v erage Figure 3. Hierarchical Clusters Dendogram of 32 Indonesian Ethnic Languages. From those tw o hierarchical clusters in Figure 3, we select tw o st able clusters that al w ays grouped to- gether despite of changing the linkage clustering method. The first cluster consists of T OB A B A T AK, B A T AK MAND AILING, and B A T AK ANGK OLA, while the second cluster consists of MIN ANGKAB A U, BET A WI, AMBONESE MALA Y , B ANJ ARESE MALA Y , P ALEMB ANG MALA Y , J AMBI MALA Y , MALA Y , and In- donesia. Since the tw o stable custers ha v e language similarities abo v e 50% between the languages, the y are good clusters to be referred when selecting tar get languages for computational linguistic researches that de- pends on language similarity or cognate recognition for inducing bilingual le xicons from the tar get languages [11, 12, 14, 30]. The tw o clusters are actually enough for selecting the tar get languages for those researches. Ho we v er , we still need to e v aluate the stability of those clusters and we also need to identify the lo w language similarities clusters in order to grasp the whole picture of Indones ian ethnic languages. Thus, we utilize the alternati v e clustering approach which is a k-means clustering. K-means clustering is an unsupervised learning algorithm that tries to cluster data based on their sim- ilarity . Unsupervised learning means that there is no outcome to be predicted, and the algorithm just tries to find patterns in the data. In k-means clustering, we ha v e to specify the number of clusters we w ant the data to be grouped into. The algorithm w orks as follo ws: (1) The algorithm randomly assigns each observ ation to a cluster , and n ds the centroi d of each cluster; (2) Then, the algorithm iterates through tw o steps: (2a) Reassign data points to the cluster whose centroid is closest; (2b) Calculate ne w centroid of e ach cluster . These tw o steps are repeated until the within cluster v ariation cannot be reduced an y further . The within cluster v ariation is calculated as the sum of the euclidean distance between the data points and their respecti v e cluster centroids. It is well kno wn that standard agglomerati v e hierarchical clustering techniques are not tolerant to nois e [31, 32]. There are man y pre vious w orks on finding clusters which rob ust to noise [33, 34, 35]. Ho we v er , to e v aluate the stability of the hierarchical stable clusters, we introduced a simple approach of calculating their stability le v el of being grouped together despite of changing the number of k-means clusters. W e e xtend the k- means clustering unsupervised learning to a k-means clustering semi-supervised learning as sho wn in Algorithm 1 by labeling the tw o hierarchical stable clusters beforehand. IJECE V ol. 9, No. 1, February 2019 : 531 538 Evaluation Warning : The document was created with Spire.PDF for Python.
IJECE ISSN: 2088-8708 535 Algorithm 1: Cluster Stability Ev aluator Input: simil ar ity M atr ix , stabl eC l uster s , minimumK , maximumT r ial ; Output: stabil ity Lev el 1 tr ial   1 ; 2 cur r entK   minimumK ; 3 maximumK   l eng th ( simil ar ity M atr ix ) ; 4 scal e 2 D   cmdscal e ( simil ar ity M atr ix ) ; // multidimensional to 2D scaling 5 while cur r entK < = maximumK do 6 successf ul T r ial   0 ; // initialized for each cur r entK 7 while tr ial < = maximumT r ial do 8 k C l uster s   k means ( scal e 2 D ; cur r entK ) ; 9 if stabl eC l uster s distinctly found in k C l uster s then 10 successf ul T r ial + + ; 11 tr ial + + ; // try again with the same number of cluster ( cur r entK ) 12 end 13 end 14 stabil ity L ev el [ cur r entK ]   successf ul T r ial =maximumT r ia l ; 15 cur r entK + + ; // increase the number of clusters 16 tr ial   1 // reset the number of trial 17 end 18 return stabil ity L ev el ; 4. RESUL T AND DISCUSSION Initially , we manually conduct se v eral trials to estimate the minimum and maximum number of k-means cluster to obtain clusters which consist of the stable clusters distinctly . Based on the initi al trials , we estimate the minimum k = 4 and maximum k = 21 . Then, we calculate the s tability le v el of the tw o hierarchical stable clusters where the number of clusters ranging from minimum k = 4 to maximum k = 21 follo wing Algorithm 1. W e ha v e v e sets of e xperiments with the maximum t r ial equals 50, 500, 5,000, 50,000, and 500,000. In each e xperiment, a stability le v el of the tw o hierarchical stable clusters is measured for each number of k-means clusters by calculating the success rate of obtaining the tw o hierarchical stable clusters in the generated k-clusters as sho wn in Figure 4. The higher the number of the trial, the more lik ely we can distinctly find the tw o hierarchical stable clusters in the generated k-clusters with a big number of clusters. F or e xample, within 50 trials, we can not find the tw o hierarchical stable clusters distinctly in the generated k-clusters for big number of clusters ( k > 14 ). Ho we v er , within 50,000 and 500,000 trials, we can find the tw o hierarchical stable clusters distinctly in the generated k-clusters for all number of clusters between the minimum k = 4 and the maximum k = 21 , e v en though the success rate is getting lo wer as the number of clusters increases. F or all v e e xperiments, the stability le v el of the tw o hierarchical stable clusters is the highest (0.78) on 5 clusters. Therefore, we tak e the 5 clusters as sho wn in Figure 5 as the best clusters of Indonesian ethnic languages to be referred when selecting tar get languages for computational linguistic researches that depends on language similarity or cognate recognition. W e further plot the 5 clusters to a geographical map as sho wn in Figure 6. 0. 75787 0. 77950 0. 55481 0. 37357 0. 25639 0. 17515 0. 11639 0. 07434 0. 04515 0. 02644 0. 01425 0. 00742 0. 00333 0. 00142 0. 00054 0. 00018 0. 00004 0. 00001 0 0. 2 0. 4 0. 6 0. 8 1 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 Su cce s s R a t e #c l us t er ( k) 0. 75526 0. 78222 0. 55570 0. 37472 0. 25894 0. 17536 0. 11554 0. 07222 0. 04466 0. 02680 0. 01490 0. 00754 0. 00316 0. 00172 0. 00056 0. 00012 0. 00004 0. 00004 0 0. 2 0. 4 0. 6 0. 8 1 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 Su cce s s R a t e #c l us t er ( k) 0. 7538 0. 7852 0. 5556 0. 3756 0. 2542 0. 1804 0. 1194 0. 0760 0. 0438 0. 0230 0. 0144 0. 0066 0. 0054 0. 0010 0. 0006 0. 0008 0 0 0 0. 2 0. 4 0. 6 0. 8 1 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 S u cce ssRate #cl u ste r (k ) 0. 758 0. 784 0. 558 0. 408 0. 230 0. 140 0. 104 0. 076 0. 046 0. 030 0. 010 0. 008 0. 004 0. 002 0. 002 0 0 0 0 0. 2 0. 4 0. 6 0. 8 1 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 S u cce ssRate #cl u ste r (k ) 0. 66 0. 78 0. 48 0. 28 0. 42 0. 20 0. 12 0. 08 0. 04 0. 02 0. 06 0 0 0 0 0 0 0 0 0. 2 0. 4 0. 6 0. 8 1 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 S u cce ssRate #cl u ste r (k ) (a ) 5 0 Tr i a l s (b) 5 0 0 Tr i a l s ( c ) 5 ,0 0 0 Tr i a l s ( d ) 5 0 ,0 0 0 Tr i a l s ( e ) 5 0 0 ,0 0 0 Tr i a l s Figure 4. Obtaining Stable Clusters in n T rials Gener ating similarity cluster of Indonesian langua g es... (Arbi Haza Nasution) Evaluation Warning : The document was created with Spire.PDF for Python.
536 ISSN: 2088-8708 Figure 5. K-means Clusters of 32 Indonesian Ethnic Languages 5 Clusters Figure 6. Similarity Clusters Map of 32 Indonesian Ethnic Languages 5 Clusters 5. CONCLUSION W e utilized ASJP database to generate the language similarity matrix, then generate the hierarchical clusters with complete linkage and m ean linkage clustering, and further e xtract tw o stable clusters with the highest language similarities. W e apply our e xtended k-means clustering semi-supervis ed learning to e v aluate the stability le v el of the hierarchical stable clusters being grouped together despite of changing the number of clusters. The higher the number of the trial, the more lik ely we can distinctly find the tw o hierarchical stable clusters in the generated k-clusters. Ho we v er , for all v e e xperiments, the stability le v el of the tw o hierarchical stable clusters is the highest (0.78) on 5 clusters. Therefore, we tak e the 5 clusters as the best clusters of Indonesian ethnic languages to be referred to select tar get languages for computational linguistic researches that depends on language similarity or cognate recognition. Finally , we plot the generated 5 clusters to a geographical map. Our algorithm can be used to find and e v aluate other stable clusters of Indonesian ethnic languages or other language sets. A CKNO WLEDGEMENT This research w as partially supported by a Grant-in-Aid for Scientific Research (A) (17H00759, 2017- 2020) and a Grant-in-Aid for Y oung Scientists (A) (17H04706, 2017-2020) from Japan Society for the Promotion of Science (JSPS). The first author w as supported by Indonesia Endo wnment Fund for Education (LPDP). REFERENCES [1] T . Ishida,Y . Murakami, D. Lin, T . Nakaguchi and M. Otani, “Language Service Infrastructure on the W eb: The Language Grid, IEEE Computer , v ol. 51, Issue 6, pp. 72-81, June, 2018. [2] T . Ishida, “Intercultural collaboration and support systems: A brief history , in International Conference on Principles and Practice of Multi-Agent Systems (PRIMA 2016) , pages 3-19. Springer , 2016. IJECE V ol. 9, No. 1, February 2019 : 531 538 Evaluation Warning : The document was created with Spire.PDF for Python.
IJECE ISSN: 2088-8708 537 [3] A. H. Nasution, N. Syafitri, P . R. Setia w an and D. Suryani, “Pi v ot-Based Hybrid Machine T ranslation to Support Multilingual Communication, in International Conference on Culture and Computing (Culture and Computing), K yoto, Japan , 2017, pp. 147-148. doi: 10.1109/Culture.and.Computing.2017.22. [4] A. H. Nasution, “P i v ot-Based Hybrid Machine T ranslation to Support Multilingual Communication for Closely Related Languages, W orld T ransactions on Engineering and T echnology Education , 16, 2, 12-17, 2018. [5] A. H. Nasution, Y . Murakami, and T . Ishida. “Designing a Collaborati v e Process to Create Bilingual Dictio- naries of Indonesian Ethnic Languages, in Proceedings of the Ele v enth International Conference on Lan- guage Resources and Ev aluation (LREC 2018) , European Language Resources Association (ELRA), P aris, France, 3397-3404, 2018. [6] R. T anaka, Y . Murakami and T . Ishida, “Conte xt-Based Approach for Pi v ot T ranslation Services, in Inter - national Joint Conference on Artificial Intelligence (IJCAI-09) , pp.1555-1561, 2009. [7] E. W . P amungkas, R. Sarno, and A. Munif, “B-BabelNet: Business-Specific Le xical Database for Impro ving Semantic Analysis of Business Process Models, T elk omnika , 15(1), 407, 2017. [8] H. Hassan, A frame w ork for Arabic concept-le v el sentiment analysis using SenticNet, International Journal of Electrical and Computer Engineering (IJECE) , 8(6), 2018. [9] P . Bajpai, P . V erma and S. Q. Abbas, “T w o Le v el Dis ambiguation Model f or Query T ranslation, Interna- tional Journal of Electrical and Computer Engineering (IJECE) , 8(5), 2018. [10] Le wis, M. P aul, Gary F . Simons, and Charles D. Fennig (eds.), Ethnologue: Languages of the W orld, Eighteenth edition. Dallas, T e xas: SIL International. Online v ersion: http://www .ethnologue.com, 2015. [11] A. H. Nasution, Y . Murakami, and T . Ishida, “Constraint-based bilingual le xicon induction for closely related languages, in Proceedings of the T enth International Conference on Language Resources and Ev al- uation (LREC 2016) , pp. 3291-3298, P aris, France, May , 2016. [12] A. H. Nasution, Y . Murakami and T . Ishida, A generalized constraint approach to bilingual dictionary induction for lo w-resource language f amilies, A CM T rans. Asian Lo w-Resour . Lang. Inf. Process. , 17, 2, Article 9 (No v ember 2017), 29 pages, 2017. [13] A. H. Nasution, Y . Murakami and T . Ishida, “Plan Optimization for Creating Bilingual Dictionaries of Lo w- Resource Languages, in International Conference on Culture and Computing (Culture and Computing), K yoto, Japan , 2017, pp. 35-41. doi: 10.1109/Culture.and.Computing.2017.21. [14] M. W ushouer , D. Lin, T . Ishida and K. Hirayama, A constraint approach to pi v ot-based bilingual dictionary induction, A CM T rans. Asian Lo w-Resour . Lang. Inf. Process. , 15(1):4:1-4:26, No v ember , 2015. [15] L. Campbell. Historical Linguistics . Edinb ur gh Uni v ersity Press, 2013. [16] W . P . Lehmann. Historical linguistics: an introduction . Routledge, 2013. [17] L. Campbell and W .J. Poser . Language classification. History and method . Cambridge, 2008. [18] M. Sw adesh, “Salish Internal Relationships, International Journal of American Linguistics , v ol. 16, 157- 167, 1950. [19] M. Sw adesh, “Le xicostatistic Dating of Prehistoric Ethnic Contacts, in Proceedings of the American Philo- sophical Society , v ol. 96, 452-463, 1952. [20] M. Sw adesh. The Origin and Di v ersification of Language , Ed. post mortem by Joel Sherzer . Chicago: Aldine, p. 283, 1971. [21] M. Sw adesh, “T o w ards Greater Accurac y in Le xicostatistic Dating, International Journal of American Linguistics, v ol. 21, 121-137, 1955. [22] V . I. Le v enshtein, “Binary codes capabl e of correcting deletions, insertions, and re v ersals, So viet ph ysics doklady , v ol. 10, No. 8, pp. 707-710, 1966. [23] B. K essler , “Computational dialectology in Irish Gaelic, in Proceedings of the se v enth conference on Eu- ropean chapter of the Association for Computational Linguistics (EA CL ’95) , Mor g an Kaufmann Publishers Inc., San Francisco, CA, USA, 60-66. DOI: https://doi.or g/10.3115/976973.976983 [24] W . J. Heering a. Measuring dialect pronunciation dif ferences using Le v enshtein distance, Doctoral disser - tation, Uni v ersity Library Groningen, 2004. [25] C. T ang and V . J. v an Heuv en, “Predicting mutual i ntelligibility of Chinese dialects from multiple objecti v e linguistic distance measures, Linguistics , 53(2), 285-312, 2015. [26] F . Petroni and M. Serv a, “Language distance and tree reconstruction, Journal of Statistical Mechanics: Theory and Experiment 2008 , no. 08 (2008): P08012. [27] S. W ichmann, E. W . Holman, D. Bakk er , and C. H. Bro wn, “Ev aluating linguistic distance measures, Gener ating similarity cluster of Indonesian langua g es... (Arbi Haza Nasution) Evaluation Warning : The document was created with Spire.PDF for Python.
538 ISSN: 2088-8708 Ph ysica A: Statistical Mechanics and its Applications , 389(17), 3632-3639, 2010. [28] E.W . Holman, C.H. Bro wn, S. W ichmann, A. Mu ller , V . V elupillai, H. Hammarstro m, S. Sauppe, H. Jung, D. Bakk er and P . Bro wn, Automated dating of the w orld’ s language f amilies based on le xical similarity , Current Anthropology 52, 6 , 841-875, 2011. [29] E. W . Holman, S. W ichmann, C. H. Bro wn, V . V elupillai, A. M ¨ uller , and D. Bakk er , “Explorations in automated language classification, F olia Linguistica , 42(3-4), 331-354, 2008. [30] G. S. Mann and D. Y aro wsk y , “Multipath translation le xicon induction via bridge languages, in Proceed- ings of the second meeting of the North American Chapter of the Association for Computational Linguistics on Language technologies , Association for Computational Linguistics, 1-8, 2001. [31] G. Nagy , “State of the art in pattern recognition, in Proceedings of the IEEE , 56, no. 5 pp.836-863, 1968. [32] M. Narasimhan, N. Jojic and J. Bilmes, “Q-clustering, Adv ances in Neural Information Processing Sys- tems , 2006. [33] M. F . Balcan, Y . Liang and P . Gupta, “Rob ust hierarchical clustering, The Journal of Machine Learning Research , 15(1), 3831-3871, 2014. [34] S. Guha, R. Rastogi and K. Shim, “R OCK: A rob ust clustering algorithm for cate gorical attrib utes, Infor - mation systems , 25(5), 345-366, 2000. [35] P . Langfelder and S. Horv ath, “F ast R functions for rob ust correlations and hierarchical clustering, Journal of statistical softw are , 46(11), 2012. BIOGRAPHY OF A UTHORS Arbi Haza Nasution is curr ently w orking to w ard the Ph.D. de gree in Social Informatics at Graduate School of Informatics, K yoto Uni v ersity . He obtained Bachelor De gree in Computer Science from National Uni v ersity of Malaysia in 2010 and obtained Master De gree in Management Information System from National Uni v ersity of Malaysia in 2012. He has been a Lecturer with the Department of Informatics Engineering, Uni v ersitas Islam Riau, Indones ia, since 2013. His current research interests include computational linguistics, natural language processing and machine learning. Y ohei Murakami recei v ed the Ph.D. de gree in informatics from K yoto Uni v ersity , K yoto, Japan, in 2006. He has been an Associat e Professor with Ritsumeikan Uni v ersity , since 2018. He cur - rently leads the research and de v elopment of the Language Grid, the purpose of which is to share v arious language resources as W eb services and enable users to create ne w services. He recei v ed the Achie v ement A w ard of the Institute of Electronics, Information and Communication Engineers for this w ork in 2013. His current research interest s include services computing and multiagent systems. He founded the T echnical Committee on Services Computing with the Institute of Elec- tronics, Information and Communication Engineers in 2012. T oru Ishida has been a Professor with K yoto Uni v ersity , K yoto, Japan, since 1993. His current research interests include autonomous agents and multiagent systems. He has performed research in the abo v e areas for o v er 20 years. Since 2006, he ha s been running the Language Grid Project. Prof. Ishida serv ed as the Program Co-Chair of the second ICMAS, the Chair of the first PRIMA, and the General Co-Chair of the first AAMAS. He w as also an Editor -in-Chief of the Journal on W eb Semantics (Else vier) and an Associate Editor of the IEEE T ransactions on P attern Analysis and Machine Intelligence and the Journal on Autonomous Agents and Multi-Agent Systems (Springer). He w as a Board Member of the International F oundation on Autonomous Agent and Multiagent Systems. He has also started w orkshops/conferenc es on digital cities and intercultural collaboration. IJECE V ol. 9, No. 1, February 2019 : 531 538 Evaluation Warning : The document was created with Spire.PDF for Python.