Inter
national
J
our
nal
of
Electrical
and
Computer
Engineering
(IJECE)
V
ol.
9,
No.
1,
February
2019,
pp.
531
538
ISSN:
2088-8708,
DOI:
10.11591/ijece.v9i1.pp531-538
531
Generating
similarity
cluster
of
Indonesian
languages
with
semi-super
vised
clustering
Arbi
Haza
Nasution
1,3
,
Y
ohei
Murakami
2
,
and
T
oru
Ishida
3
1,3
Department
of
Social
Informatics,
K
yoto
Uni
v
ersity
,
Japan
2
Colle
ge
of
Information
Science
and
Engineering,
Ritsumeikan
Uni
v
ersity
,
Japan
1
Department
of
Informatics
Engineering,
Uni
v
ersitas
Islam
Riau,
Indonesia
Article
Inf
o
Article
history:
Recei
v
ed
Jan
11,
2018
Re
vised
Jul
6,
2018
Accepted
Aug
22,
2018
K
eyw
ords:
Le
xicostatistic
Language
Similarity
Hierarchical
Clustering
K-means
Clustering
Semi-Supervised
Clustering
ABSTRA
CT
Le
xicostatistic
and
language
similarity
clusters
are
usef
ul
for
computational
linguistic
researches
that
depends
on
language
similarity
or
cognate
recognition.
Ne
v
ertheless,
there
are
no
published
le
xicostatistic/language
similarity
cluster
of
Indonesian
ethnic
languages
a
v
ailable.
W
e
formulate
an
approach
of
creating
language
similarity
clusters
by
utilizing
ASJP
database
to
generate
the
language
similarity
matrix,
then
generate
the
hierarchical
clusters
with
complete
linkage
and
mean
linkage
clustering,
and
further
e
xtract
tw
o
stable
clusters
with
high
language
similarities.
W
e
introduced
an
e
xtended
k-means
clustering
semi-supervised
learning
to
e
v
aluate
the
stability
le
v
el
of
the
hierar
-
chical
stable
clusters
being
grouped
together
despite
of
changing
the
number
of
cluster
.
The
higher
the
number
of
the
trial,
the
more
lik
ely
w
e
can
distinctly
find
the
tw
o
hierar
-
chical
stable
clusters
in
the
generated
k-clusters.
Ho
we
v
er
,
for
all
fi
v
e
e
xperiments,
the
stability
le
v
el
of
the
tw
o
hierarchical
stable
clusters
is
the
highest
on
5
clusters.
There-
fore,
we
tak
e
the
5
clusters
as
the
best
clusters
of
Indonesian
ethnic
languages.
Finally
,
we
plot
the
generated
5
clusters
to
a
geographical
map.
Copyright
c
2019
Institute
of
Advanced
Engineering
and
Science
.
All
rights
r
eserved.
Corresponding
A
uthor:
Arbi
Haza
Nasution,
Department
of
Social
Informatics,
K
yoto
Uni
v
ersity
,
Japan,
Y
oshida-Honmachi,
Sak
yo-ku,
K
yoto
606-8501,
Japan.
+818078554376
Email:
arbi@ai.soc.i.k
yoto-u.ac.jp,
arbi@eng.uir
.ac.id
1.
INTR
ODUCTION
No
w
adays,
machine-readable
bilingual
dictionaries
are
being
utilized
in
actual
services
[1]
to
support
intercultural
coll
aboration
[2,
3,
4]
and
other
research
domains
[5,
6,
7,
8,
9],
b
u
t
lo
w-resource
languages
lack
such
sources.
Indonesia
has
a
population
of
221,398,286
and
707
li
ving
languages
which
co
v
er
57.8%
of
Aus-
tronesian
F
amily
and
30.7%
of
languages
in
Asia
[10].
There
are
341
Indonesian
ethnic
languages
f
acing
v
arious
de
gree
of
language
endangerment
(trouble
/
dyi
ng)
where
some
of
the
nati
v
e
speak
er
do
not
speak
Bahasa
In-
donesia
well
since
the
y
are
in
remote
areas.
Unfortunately
,
there
are
13
Indonesian
ethnic
languages
which
already
e
xtinct.
In
order
to
sa
v
e
lo
w-resource
languages
lik
e
Indonesian
ethnic
languages
from
language
endan-
germent,
prior
w
orks
tried
to
enrich
the
basic
language
resource,
i.e.,
bilingual
dictionary
[11,
12,
13,
14].
Those
pre
vious
researchers
require
le
xicostatistic/language
similarity
clusters
of
the
lo
w-resource
languages
to
select
the
tar
get
languages.
Ho
we
v
er
,
to
the
best
of
our
kno
wledge,
there
are
no
published
le
xicostatistic/language
similarity
clusters
of
Indonesian
ethnic
languages.
T
o
fill
the
v
oid,
we
address
this
research
goal:
F
ormulating
an
approach
of
creating
a
language
simi
larity
cluster
.
W
e
first
obtain
40-item
w
ord
lists
from
the
Automated
Simi-
larity
Judgment
Program
(ASJP),
further
generate
the
language
similarity
matrix,
then
generate
the
hierarchical
and
k-means
clusters,
and
finally
plot
the
generated
clusters
to
a
map.
J
ournal
Homepage:
http://iaescor
e
.com/journals/inde
x.php/IJECE
Evaluation Warning : The document was created with Spire.PDF for Python.
532
ISSN:
2088-8708
2.
A
UT
OMA
TED
SIMILARITY
JUDGMENT
PR
OGRAM
Historical
linguistics
is
the
sc
ientific
study
of
language
change
o
v
er
time
in
term
of
sound,
analogical
,
le
xical,
morphological,
syntactic,
and
semantic
information
[15].
Comparati
v
e
linguistics
is
a
branch
of
histor
-
ical
linguistics
that
is
concerned
wi
th
language
comparis
on
to
determine
historical
rela
tedness
and
to
construct
language
f
amilies
[16].
Man
y
methods,
techniques,
and
procedures
ha
v
e
been
utilized
in
in
v
estig
ating
the
poten-
tial
distant
genetic
relationship
of
languages,
including
le
xical
comparison,
sound
correspondences,
grammatical
e
vidence,
borro
wing,
semantic
constraints,
chance
similarities,
sound-meaning
isomorphism,
etc
[17].
The
ge-
netic
relationship
of
languages
is
used
to
classify
languages
into
language
f
amilies.
Closely-related
languages
are
those
that
came
from
the
same
origin
or
proto-language,
and
belong
to
the
same
language
f
amily
.
Sw
adesh
List
is
a
classic
compilation
of
basic
concepts
for
the
purposes
of
historical-comparati
v
e
lin-
guistics.
It
is
used
in
le
xicostatistics
(quantitati
v
e
comparison
of
le
xical
cognates)
and
glottochronology
(chrono-
logical
relationship
between
languages).
There
are
v
arious
v
ersion
of
sw
adesh
list
with
a
number
of
w
ords
equal
225
[18],
215
&
200
[19],
and
lastly
100
[20].
T
o
find
the
best
size
of
the
list,
Sw
adesh
states
that
“The
only
solution
appears
to
be
a
drastic
weeding
out
of
the
list,
in
the
realization
that
quality
is
at
least
as
important
as
quantity
.
Ev
en
the
ne
w
list
has
defects,
b
ut
the
y
are
relati
v
ely
mild
and
fe
w
in
number
.
”
[21]
A
widely-used
notion
of
string/le
xical
similarity
is
the
edit
distance
or
also
kno
wn
as
Le
v
enshtei
n
Distance
(LD):
the
minimum
number
of
insertions,
deletions,
and
substitutions
required
to
transform
one
string
into
the
other
[22].
F
or
e
xample,
LD
between
“kitten”
and
“sitting”
is
3
since
there
are
three
transformations
needed:
kitten
sitten
(substitution
of
“s”
for
“k”),
sitten
sittin
(substitution
of
“i”
for
“e”),
and
finally
sittin
sitting
(insertion
of
“g”
at
the
end).
There
are
a
lot
of
pre
vious
w
orks
using
Le
v
enshtein
Distances
such
as
dialect
groupings
of
Irish
Gaeli
c
[23]
where
the
y
g
ather
the
data
from
questionnaire
gi
v
en
to
nati
v
e
speak
ers
of
Irish
Gaelic
in
86
sites.
The
y
obtain
312
dif
ferent
Gaelic
w
ords
or
phrases.
Another
w
ork
is
about
dialect
pronunciation
dif
ferences
of
360
Dutch
dialects
[24]
which
obtai
n
125
w
ords
from
Reeks
Nederlandse
Dialectatlassen.
The
y
normalize
LD
by
di
viding
it
by
the
length
of
the
longer
alignment.
[25]
measure
linguisti
c
similarity
and
intelligibility
of
15
Chinese
dialects
and
obtain
764
common
syllabic
units.
[26]
define
le
xical
distance
between
tw
o
w
ords
as
the
LD
normalized
by
the
number
of
characters
of
the
longer
of
the
tw
o.
[27]
e
xtend
Petroni
definition
as
LDND
and
use
it
in
Automated
Similarity
Judgment
Program
(ASJP).
The
ASJP
,
an
open
source
softw
are
w
as
proposed
by
[28]
with
the
main
goal
of
de
v
eloping
a
data
base
of
Sw
adesh
lists
[21]
for
all
of
the
w
orld’
s
languages
from
which
le
xical
similarity
or
le
xical
distance
matrix
be-
tween
languages
can
be
obtained
by
comparing
the
w
ord
lists.
The
classification
is
based
on
100-item
reference
list
of
Sw
adesh
[20]
and
further
reduced
to
40
most
stable
items
[29].
The
item
stability
is
a
de
gree
to
which
w
ords
for
an
item
are
retained
o
v
er
time
and
not
replaced
by
another
le
xical
item
from
the
language
itself
or
a
borro
wed
element.
W
ords
resistant
to
replacement
are
more
stable.
Stable
items
ha
v
e
a
greater
tendenc
y
to
yield
cognates
(w
ords
that
ha
v
e
a
common
etymological
origin)
within
groups
of
closely
related
languages.
3.
LANGU
A
GE
SIMILARITY
CLUSTERING
APPR
O
A
CH
W
e
formalize
an
approach
to
create
language
similarity
clusters
by
utilizing
ASJP
database
to
generate
the
language
similarity
matrix,
then
generate
the
hierarchica
l
clusters,
and
further
e
xtract
the
stable
clusters
with
high
language
similarities.
The
hierarchical
stable
clusters
are
e
v
aluated
utilizing
our
e
xtended
k-means
clustering.
Finally
,
the
obtained
k-means
clusters
are
plotted
to
a
geographical
map.
The
flo
wchart
of
the
whole
process
is
sho
wn
in
Figure
1.
In
this
paper
,
we
focus
on
Indonesian
ethnic
languages.
W
e
obtain
w
ords
list
of
119
Indonesian
ethnic
languages
with
the
number
of
speak
ers
at
least
100,000.
Ho
we
v
er
,
it
is
dif
ficult
to
classify
119
languages
and
obtain
a
v
aluable
information
from
the
generated
clusters,
therefore,
we
further
filtered
the
tar
get
languages
based
on
the
number
of
speak
er
and
a
v
ailability
of
the
language
information
in
W
ikipedia.
W
e
obtain
32
tar
get
languages
as
sho
wn
in
T
able
1
from
the
intersec
tion
between
46
Indonesian
ethnic
languages
with
number
of
speak
er
abo
v
e
300,000
pro
vided
by
W
ikipedia
and
119
Indonesian
ethnic
languages
with
number
of
speak
er
abo
v
e
100,000
pro
vided
by
ASJP
.
W
e
further
generate
the
similarity
matrix
of
those
32
languages
as
sho
wn
in
Figure
2.
W
e
added
a
white-red
color
scale
where
white
color
means
the
tw
o
languages
are
totally
dif
ferent
(0%
similarity)
and
the
reddest
color
means
the
tw
o
languages
are
e
xactly
the
same
(100%
similarity).
F
or
a
better
clarity
and
to
a
v
oid
redundanc
y
,
we
only
sho
w
the
bottom-left
part
of
the
table.
The
headers
follo
w
the
language
code
in
T
able
1.
IJECE
V
ol.
9,
No.
1,
February
2019
:
531
–
538
Evaluation Warning : The document was created with Spire.PDF for Python.
IJECE
ISSN:
2088-8708
533
Ye
s
No
G
e
ne
r
a
t
e
S
i
m
i
l
a
r
i
t
y M
a
t
r
i
x
G
e
ne
r
a
t
e
H
i
e
r
a
r
c
hi
c
a
l
C
l
us
t
e
r
St
a
r
t
A
S
J
P
W
or
ds
L
i
s
t
S
t
a
bl
e
C
l
us
t
e
r
s
P
l
ot
k
-
m
e
a
ns
c
l
us
t
e
r
s
t
o a
m
a
p
E
va
l
ua
t
e
S
t
a
bl
e
C
l
us
t
e
r
s
w
i
t
h C
l
us
t
e
r
S
t
a
bi
l
i
t
y E
va
l
ua
t
or
(
us
i
ng k
-
m
e
a
ns
c
l
us
t
e
r
i
ng s
e
m
i
-
s
upe
r
vi
s
e
d l
e
a
r
ni
ng)
En
d
Hi
g
h
S
t
a
b
i
l
i
t
y
Le
v
e
l
?
Figure
1.
Flo
wchart
of
Generating
Language
Similarity
Clusters
T
able
1.
List
of
32
Indonesian
Ethnic
Languages
Rank
ed
by
Population
According
to
ASJP
database
Code
Population
Language
Code
Population
Language
L
1
232004800
INDONESIAN
L
17
1000000
GOR
ONT
ALO
L
2
84300000
OLD
OR
MIDDLE
J
A
V
ANESE
L
18
1000000
J
AMBI
MALA
Y
L
3
34000000
SUND
ANESE
L
19
900000
MANGGARAI
L
4
15848500
MALA
Y
L
20
770000
NIAS
NOR
THERN
L
5
15848500
P
ALEMB
ANG
MALA
Y
L
21
750000
B
A
T
AK
ANGK
OLA
L
6
6770900
MADURESE
L
22
700000
U
AB
MET
O
L
7
5530000
MIN
ANGKAB
A
U
L
23
600000
KAR
O
B
A
T
AK
L
8
5000000
B
UGINESE
L
24
500000
BIMA
L
9
5000000
BET
A
WI
L
25
470000
K
OMERING
L
10
3502300
B
ANJ
ARESE
MALA
Y
L
26
350000
REJ
ANG
L
11
3500032
A
CEH
L
27
331000
T
OLAKI
L
12
3330000
B
ALI
L
28
300000
GA
Y
O
L
13
2130000
MAKASAR
L
29
300000
MUN
A
L
14
2100000
SASAK
L
30
250000
T
AE
L
15
2000000
T
OB
A
B
A
T
AK
L
31
245020
AMBONESE
MALA
Y
L
16
1100000
B
A
T
AK
MAND
AILING
L
32
230000
MONGONDO
W
L
1
L
2
L
3
L
4
L
5
L
6
L
7
L
8
L
9
L
1
0
L
1
1
L
1
2
L
1
3
L
1
4
L
1
5
L
1
6
L
1
7
L
1
8
L
1
9
L
2
0
L
2
1
L
2
2
L
2
3
L
2
4
L
2
5
L
2
6
L
2
7
L
2
8
L
2
9
L
3
0
L
3
1
L
2
24
L
3
39
22
L
4
85
21
41
L
5
68
32
39
73
L
6
34
15
20
34
34
L
7
62
25
31
62
64
34
L
8
31
18
25
32
31
18
32
L
9
69
10
25
67
58
23
50
24
L
1
0
72
33
39
71
64
34
60
33
55
L
1
1
27
11
19
27
30
22
25
16
21
25
L
1
2
38
20
29
35
39
23
31
30
24
37
22
L
1
3
33
22
24
30
32
25
33
36
25
33
16
29
L
1
4
44
20
28
42
44
30
44
31
37
47
22
29
35
L
1
5
37
24
23
37
36
21
40
25
35
37
13
21
25
35
L
1
6
25
16
14
27
27
20
27
23
24
25
14
20
18
24
58
L
1
7
19
14
16
18
19
9
18
20
14
17
12
12
18
20
17
9
L
1
8
79
26
40
78
78
34
69
31
70
73
27
35
38
46
39
21
20
L
1
9
30
18
24
30
34
19
32
36
26
32
10
23
29
31
32
21
16
34
L
2
0
26
21
17
23
25
13
29
26
24
29
12
16
19
24
29
21
19
24
25
L
2
1
24
16
15
26
26
19
26
21
21
24
12
21
18
23
59
98
9
20
19
20
L
2
2
13
10
9
11
14
12
18
19
10
19
10
12
21
18
15
9
14
15
22
16
9
L
2
3
47
22
28
48
50
23
40
30
40
44
21
32
27
35
51
40
17
47
28
33
40
12
L
2
4
18
10
16
17
18
12
18
21
18
19
6
14
21
25
22
14
8
17
30
19
14
18
19
L
2
5
33
19
25
33
33
18
25
23
29
36
14
23
22
22
24
24
16
30
26
29
25
20
36
14
L
2
6
28
20
16
27
32
18
30
17
21
29
15
17
17
30
25
20
11
32
18
15
19
12
29
4
19
L
2
7
30
14
18
28
27
17
26
32
23
33
11
21
27
21
26
14
11
28
36
25
14
19
28
26
20
13
L
2
8
37
27
28
36
37
20
37
26
28
38
18
25
23
35
28
18
17
40
26
23
17
20
41
18
37
29
28
L
2
9
14
12
12
14
13
13
11
21
18
12
8
16
24
14
14
9
11
13
15
15
10
11
14
21
14
4
29
11
L
3
0
42
29
31
41
39
27
42
60
30
47
20
28
42
40
34
27
23
44
38
35
26
29
38
30
29
21
38
38
25
L
3
1
72
23
35
70
58
37
59
36
62
60
23
34
36
43
33
28
19
69
33
29
26
17
36
19
29
24
29
31
16
42
L
3
2
30
18
24
32
31
13
26
26
27
34
11
21
25
24
24
17
26
32
23
24
17
12
28
14
24
20
20
27
15
38
24
Figure
2.
Le
xicostatistic
/
Similarity
Matrix
of
32
Indonesian
Ethnic
Languages
by
ASJP
(%)
Hierarchical
clustering
is
an
approach
which
b
uilds
a
hierarch
y
from
the
bottom-up,
and
does
not
re-
quire
us
to
specify
the
number
of
clusters
beforehand.
The
algorithm
w
orks
as
follo
ws:
(1)
Put
each
data
point
in
its
o
wn
cluster;
(2)
Identify
the
closest
tw
o
clusters
and
combine
them
into
one
cluster;
(3)
Repeat
the
abo
v
e
step
until
all
the
data
points
are
in
a
single
clus
ter
.
Once
this
is
done,
it
is
usually
represented
by
a
dendrogram
lik
e
structure.
There
are
a
fe
w
w
ays
to
determine
ho
w
close
tw
o
clusters
are:
(1)
Complete
linkage
clustering:
find
the
maximum
possible
distance
between
points
belonging
to
tw
o
dif
ferent
clusters;
(2)
Single
linkage
cluster
-
ing:
find
the
m
inimum
possible
distance
between
points
belonging
to
tw
o
dif
ferent
clusters;
(3)
Mean/A
v
erage
Gener
ating
similarity
cluster
of
Indonesian
langua
g
es...
(Arbi
Haza
Nasution)
Evaluation Warning : The document was created with Spire.PDF for Python.
534
ISSN:
2088-8708
linkage
clustering:
find
all
possible
pairwise
distances
for
points
belonging
to
tw
o
dif
ferent
clusters
and
then
calculate
the
a
v
erage;
(4)
Centroid
linkage
clustering:
find
the
centroid
of
each
cluster
and
calculate
the
distance
between
centroids
of
tw
o
clusters.
Complete
linkage
and
mean
(a
v
erage)
linkage
clustering
are
the
ones
used
most
often.
W
e
generate
the
distance
matrix
from
the
similarity
matrix
sho
wn
in
Figure
2
and
further
generate
the
hierarchical
clusters
with
hclust
function
with
a
complete
linkage
clustering
method
as
sho
wn
in
Figure
3(a)
and
a
mean
linkage
clustering
method
as
sho
wn
in
Figure
3(b)
using
R,
a
free
softw
are
en
vironment
for
statistical
computing
and
graphics.
20
40
60
80
100
L
a
ngua
ge
S
i
m
i
l
a
r
i
t
y
50
U
A
B
M
E
T
O
BI
M
A
MU
N
A
GOR
ONT
AL
O
AC
E
H
O
L
D
O
R
M
I
D
D
L
E
J
A
V
A
N
E
S
E
RE
J
A
N
G
N
I
A
S
N
O
R
T
H
E
R
MA
D
U
R
E
S
E
T
O
B
A
B
A
T
A
K
BA
T
A
K
MA
N
D
A
I
L
I
N
G
BA
T
A
K
ANGKOL
A
MO
N
G
O
N
D
O
W
MA
N
G
G
A
R
A
I
TO
LA
K
I
KOM
E
R
I
NG
GAYO
BA
L
I
SU
N
D
A
N
E
SE
SA
SA
K
K
A
R
O
B
A
T
A
K
MI
N
A
N
G
K
A
B
A
U
BE
T
A
W
I
A
M
B
O
N
E
S
E
M
A
L
A
Y
B
A
N
J
A
R
E
S
E
M
A
L
A
Y
P
A
L
E
M
B
A
N
G
M
A
L
A
Y
J
A
M
B
I
M
A
L
A
Y
IN
D
O
N
E
S
IA
N
MA
L
A
Y
MA
K
A
S
A
R
BU
G
I
N
E
S
E
TA
E
(a)
Method:
Complete
20
40
60
80
L
a
ngua
ge
S
i
m
i
l
a
r
i
t
y
100
50
U
A
B
M
E
T
O
BI
M
A
MU
N
A
GOR
ONT
AL
O
AC
E
H
O
L
D
O
R
M
I
D
D
L
E
J
A
V
A
N
E
S
E
RE
J
A
N
G
N
I
A
S
N
O
R
T
H
E
R
MA
D
U
R
E
S
E
MO
N
G
O
N
D
O
W
MA
N
G
G
A
R
A
I
TO
LA
K
I
KOM
E
R
I
NG
GAYO
BA
L
I
SU
N
D
A
N
E
SE
SA
SA
K
K
A
R
O
B
A
T
A
K
T
O
B
A
B
A
T
A
K
BA
T
A
K
MA
N
D
A
I
L
I
N
G
BA
T
A
K
ANGKOL
A
MI
N
A
N
G
K
A
B
A
U
BE
T
A
W
I
A
M
B
O
N
E
S
E
M
A
L
A
Y
B
A
N
J
A
R
E
S
E
M
A
L
A
Y
P
A
L
E
M
B
A
N
G
M
A
L
A
Y
J
A
M
B
I
M
A
L
A
Y
IN
D
O
N
E
S
IA
N
MA
L
A
Y
MA
K
A
S
A
R
BU
G
I
N
E
S
E
TA
E
(b)
Method:
A
v
erage
Figure
3.
Hierarchical
Clusters
Dendogram
of
32
Indonesian
Ethnic
Languages.
From
those
tw
o
hierarchical
clusters
in
Figure
3,
we
select
tw
o
st
able
clusters
that
al
w
ays
grouped
to-
gether
despite
of
changing
the
linkage
clustering
method.
The
first
cluster
consists
of
T
OB
A
B
A
T
AK,
B
A
T
AK
MAND
AILING,
and
B
A
T
AK
ANGK
OLA,
while
the
second
cluster
consists
of
MIN
ANGKAB
A
U,
BET
A
WI,
AMBONESE
MALA
Y
,
B
ANJ
ARESE
MALA
Y
,
P
ALEMB
ANG
MALA
Y
,
J
AMBI
MALA
Y
,
MALA
Y
,
and
In-
donesia.
Since
the
tw
o
stable
custers
ha
v
e
language
similarities
abo
v
e
50%
between
the
languages,
the
y
are
good
clusters
to
be
referred
when
selecting
tar
get
languages
for
computational
linguistic
researches
that
de-
pends
on
language
similarity
or
cognate
recognition
for
inducing
bilingual
le
xicons
from
the
tar
get
languages
[11,
12,
14,
30].
The
tw
o
clusters
are
actually
enough
for
selecting
the
tar
get
languages
for
those
researches.
Ho
we
v
er
,
we
still
need
to
e
v
aluate
the
stability
of
those
clusters
and
we
also
need
to
identify
the
lo
w
language
similarities
clusters
in
order
to
grasp
the
whole
picture
of
Indones
ian
ethnic
languages.
Thus,
we
utilize
the
alternati
v
e
clustering
approach
which
is
a
k-means
clustering.
K-means
clustering
is
an
unsupervised
learning
algorithm
that
tries
to
cluster
data
based
on
their
sim-
ilarity
.
Unsupervised
learning
means
that
there
is
no
outcome
to
be
predicted,
and
the
algorithm
just
tries
to
find
patterns
in
the
data.
In
k-means
clustering,
we
ha
v
e
to
specify
the
number
of
clusters
we
w
ant
the
data
to
be
grouped
into.
The
algorithm
w
orks
as
follo
ws:
(1)
The
algorithm
randomly
assigns
each
observ
ation
to
a
cluster
,
and
fi
n
ds
the
centroi
d
of
each
cluster;
(2)
Then,
the
algorithm
iterates
through
tw
o
steps:
(2a)
Reassign
data
points
to
the
cluster
whose
centroid
is
closest;
(2b)
Calculate
ne
w
centroid
of
e
ach
cluster
.
These
tw
o
steps
are
repeated
until
the
within
cluster
v
ariation
cannot
be
reduced
an
y
further
.
The
within
cluster
v
ariation
is
calculated
as
the
sum
of
the
euclidean
distance
between
the
data
points
and
their
respecti
v
e
cluster
centroids.
It
is
well
kno
wn
that
standard
agglomerati
v
e
hierarchical
clustering
techniques
are
not
tolerant
to
nois
e
[31,
32].
There
are
man
y
pre
vious
w
orks
on
finding
clusters
which
rob
ust
to
noise
[33,
34,
35].
Ho
we
v
er
,
to
e
v
aluate
the
stability
of
the
hierarchical
stable
clusters,
we
introduced
a
simple
approach
of
calculating
their
stability
le
v
el
of
being
grouped
together
despite
of
changing
the
number
of
k-means
clusters.
W
e
e
xtend
the
k-
means
clustering
unsupervised
learning
to
a
k-means
clustering
semi-supervised
learning
as
sho
wn
in
Algorithm
1
by
labeling
the
tw
o
hierarchical
stable
clusters
beforehand.
IJECE
V
ol.
9,
No.
1,
February
2019
:
531
–
538
Evaluation Warning : The document was created with Spire.PDF for Python.
IJECE
ISSN:
2088-8708
535
Algorithm
1:
Cluster
Stability
Ev
aluator
Input:
simil
ar
ity
M
atr
ix
,
stabl
eC
l
uster
s
,
minimumK
,
maximumT
r
ial
;
Output:
stabil
ity
Lev
el
1
tr
ial
1
;
2
cur
r
entK
minimumK
;
3
maximumK
l
eng
th
(
simil
ar
ity
M
atr
ix
)
;
4
scal
e
2
D
cmdscal
e
(
simil
ar
ity
M
atr
ix
)
;
//
multidimensional
to
2D
scaling
5
while
cur
r
entK
<
=
maximumK
do
6
successf
ul
T
r
ial
0
;
//
initialized
for
each
cur
r
entK
7
while
tr
ial
<
=
maximumT
r
ial
do
8
k
C
l
uster
s
k
means
(
scal
e
2
D
;
cur
r
entK
)
;
9
if
stabl
eC
l
uster
s
distinctly
found
in
k
C
l
uster
s
then
10
successf
ul
T
r
ial
+
+
;
11
tr
ial
+
+
;
//
try
again
with
the
same
number
of
cluster
(
cur
r
entK
)
12
end
13
end
14
stabil
ity
L
ev
el
[
cur
r
entK
]
successf
ul
T
r
ial
=maximumT
r
ia
l
;
15
cur
r
entK
+
+
;
//
increase
the
number
of
clusters
16
tr
ial
1
//
reset
the
number
of
trial
17
end
18
return
stabil
ity
L
ev
el
;
4.
RESUL
T
AND
DISCUSSION
Initially
,
we
manually
conduct
se
v
eral
trials
to
estimate
the
minimum
and
maximum
number
of
k-means
cluster
to
obtain
clusters
which
consist
of
the
stable
clusters
distinctly
.
Based
on
the
initi
al
trials
,
we
estimate
the
minimum
k
=
4
and
maximum
k
=
21
.
Then,
we
calculate
the
s
tability
le
v
el
of
the
tw
o
hierarchical
stable
clusters
where
the
number
of
clusters
ranging
from
minimum
k
=
4
to
maximum
k
=
21
follo
wing
Algorithm
1.
W
e
ha
v
e
fi
v
e
sets
of
e
xperiments
with
the
maximum
t
r
ial
equals
50,
500,
5,000,
50,000,
and
500,000.
In
each
e
xperiment,
a
stability
le
v
el
of
the
tw
o
hierarchical
stable
clusters
is
measured
for
each
number
of
k-means
clusters
by
calculating
the
success
rate
of
obtaining
the
tw
o
hierarchical
stable
clusters
in
the
generated
k-clusters
as
sho
wn
in
Figure
4.
The
higher
the
number
of
the
trial,
the
more
lik
ely
we
can
distinctly
find
the
tw
o
hierarchical
stable
clusters
in
the
generated
k-clusters
with
a
big
number
of
clusters.
F
or
e
xample,
within
50
trials,
we
can
not
find
the
tw
o
hierarchical
stable
clusters
distinctly
in
the
generated
k-clusters
for
big
number
of
clusters
(
k
>
14
).
Ho
we
v
er
,
within
50,000
and
500,000
trials,
we
can
find
the
tw
o
hierarchical
stable
clusters
distinctly
in
the
generated
k-clusters
for
all
number
of
clusters
between
the
minimum
k
=
4
and
the
maximum
k
=
21
,
e
v
en
though
the
success
rate
is
getting
lo
wer
as
the
number
of
clusters
increases.
F
or
all
fi
v
e
e
xperiments,
the
stability
le
v
el
of
the
tw
o
hierarchical
stable
clusters
is
the
highest
(0.78)
on
5
clusters.
Therefore,
we
tak
e
the
5
clusters
as
sho
wn
in
Figure
5
as
the
best
clusters
of
Indonesian
ethnic
languages
to
be
referred
when
selecting
tar
get
languages
for
computational
linguistic
researches
that
depends
on
language
similarity
or
cognate
recognition.
W
e
further
plot
the
5
clusters
to
a
geographical
map
as
sho
wn
in
Figure
6.
0.
75787
0.
77950
0.
55481
0.
37357
0.
25639
0.
17515
0.
11639
0.
07434
0.
04515
0.
02644
0.
01425
0.
00742
0.
00333
0.
00142
0.
00054
0.
00018
0.
00004
0.
00001
0
0.
2
0.
4
0.
6
0.
8
1
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
Su
cce
s
s
R
a
t
e
#c
l
us
t
er
(
k)
0.
75526
0.
78222
0.
55570
0.
37472
0.
25894
0.
17536
0.
11554
0.
07222
0.
04466
0.
02680
0.
01490
0.
00754
0.
00316
0.
00172
0.
00056
0.
00012
0.
00004
0.
00004
0
0.
2
0.
4
0.
6
0.
8
1
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
Su
cce
s
s
R
a
t
e
#c
l
us
t
er
(
k)
0.
7538
0.
7852
0.
5556
0.
3756
0.
2542
0.
1804
0.
1194
0.
0760
0.
0438
0.
0230
0.
0144
0.
0066
0.
0054
0.
0010
0.
0006
0.
0008
0
0
0
0.
2
0.
4
0.
6
0.
8
1
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
S
u
cce
ssRate
#cl
u
ste
r
(k
)
0.
758
0.
784
0.
558
0.
408
0.
230
0.
140
0.
104
0.
076
0.
046
0.
030
0.
010
0.
008
0.
004
0.
002
0.
002
0
0
0
0
0.
2
0.
4
0.
6
0.
8
1
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
S
u
cce
ssRate
#cl
u
ste
r
(k
)
0.
66
0.
78
0.
48
0.
28
0.
42
0.
20
0.
12
0.
08
0.
04
0.
02
0.
06
0
0
0
0
0
0
0
0
0.
2
0.
4
0.
6
0.
8
1
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
S
u
cce
ssRate
#cl
u
ste
r
(k
)
(a
)
5
0
Tr
i
a
l
s
(b)
5
0
0
Tr
i
a
l
s
(
c
)
5
,0
0
0
Tr
i
a
l
s
(
d
)
5
0
,0
0
0
Tr
i
a
l
s
(
e
)
5
0
0
,0
0
0
Tr
i
a
l
s
Figure
4.
Obtaining
Stable
Clusters
in
n
T
rials
Gener
ating
similarity
cluster
of
Indonesian
langua
g
es...
(Arbi
Haza
Nasution)
Evaluation Warning : The document was created with Spire.PDF for Python.
536
ISSN:
2088-8708
Figure
5.
K-means
Clusters
of
32
Indonesian
Ethnic
Languages
–
5
Clusters
Figure
6.
Similarity
Clusters
Map
of
32
Indonesian
Ethnic
Languages
–
5
Clusters
5.
CONCLUSION
W
e
utilized
ASJP
database
to
generate
the
language
similarity
matrix,
then
generate
the
hierarchical
clusters
with
complete
linkage
and
m
ean
linkage
clustering,
and
further
e
xtract
tw
o
stable
clusters
with
the
highest
language
similarities.
W
e
apply
our
e
xtended
k-means
clustering
semi-supervis
ed
learning
to
e
v
aluate
the
stability
le
v
el
of
the
hierarchical
stable
clusters
being
grouped
together
despite
of
changing
the
number
of
clusters.
The
higher
the
number
of
the
trial,
the
more
lik
ely
we
can
distinctly
find
the
tw
o
hierarchical
stable
clusters
in
the
generated
k-clusters.
Ho
we
v
er
,
for
all
fi
v
e
e
xperiments,
the
stability
le
v
el
of
the
tw
o
hierarchical
stable
clusters
is
the
highest
(0.78)
on
5
clusters.
Therefore,
we
tak
e
the
5
clusters
as
the
best
clusters
of
Indonesian
ethnic
languages
to
be
referred
to
select
tar
get
languages
for
computational
linguistic
researches
that
depends
on
language
similarity
or
cognate
recognition.
Finally
,
we
plot
the
generated
5
clusters
to
a
geographical
map.
Our
algorithm
can
be
used
to
find
and
e
v
aluate
other
stable
clusters
of
Indonesian
ethnic
languages
or
other
language
sets.
A
CKNO
WLEDGEMENT
This
research
w
as
partially
supported
by
a
Grant-in-Aid
for
Scientific
Research
(A)
(17H00759,
2017-
2020)
and
a
Grant-in-Aid
for
Y
oung
Scientists
(A)
(17H04706,
2017-2020)
from
Japan
Society
for
the
Promotion
of
Science
(JSPS).
The
first
author
w
as
supported
by
Indonesia
Endo
wnment
Fund
for
Education
(LPDP).
REFERENCES
[1]
T
.
Ishida,Y
.
Murakami,
D.
Lin,
T
.
Nakaguchi
and
M.
Otani,
“Language
Service
Infrastructure
on
the
W
eb:
The
Language
Grid,
”
IEEE
Computer
,
v
ol.
51,
Issue
6,
pp.
72-81,
June,
2018.
[2]
T
.
Ishida,
“Intercultural
collaboration
and
support
systems:
A
brief
history
,
”
in
International
Conference
on
Principles
and
Practice
of
Multi-Agent
Systems
(PRIMA
2016)
,
pages
3-19.
Springer
,
2016.
IJECE
V
ol.
9,
No.
1,
February
2019
:
531
–
538
Evaluation Warning : The document was created with Spire.PDF for Python.
IJECE
ISSN:
2088-8708
537
[3]
A.
H.
Nasution,
N.
Syafitri,
P
.
R.
Setia
w
an
and
D.
Suryani,
“Pi
v
ot-Based
Hybrid
Machine
T
ranslation
to
Support
Multilingual
Communication,
”
in
International
Conference
on
Culture
and
Computing
(Culture
and
Computing),
K
yoto,
Japan
,
2017,
pp.
147-148.
doi:
10.1109/Culture.and.Computing.2017.22.
[4]
A.
H.
Nasution,
“P
i
v
ot-Based
Hybrid
Machine
T
ranslation
to
Support
Multilingual
Communication
for
Closely
Related
Languages,
”
W
orld
T
ransactions
on
Engineering
and
T
echnology
Education
,
16,
2,
12-17,
2018.
[5]
A.
H.
Nasution,
Y
.
Murakami,
and
T
.
Ishida.
“Designing
a
Collaborati
v
e
Process
to
Create
Bilingual
Dictio-
naries
of
Indonesian
Ethnic
Languages,
”
in
Proceedings
of
the
Ele
v
enth
International
Conference
on
Lan-
guage
Resources
and
Ev
aluation
(LREC
2018)
,
European
Language
Resources
Association
(ELRA),
P
aris,
France,
3397-3404,
2018.
[6]
R.
T
anaka,
Y
.
Murakami
and
T
.
Ishida,
“Conte
xt-Based
Approach
for
Pi
v
ot
T
ranslation
Services,
”
in
Inter
-
national
Joint
Conference
on
Artificial
Intelligence
(IJCAI-09)
,
pp.1555-1561,
2009.
[7]
E.
W
.
P
amungkas,
R.
Sarno,
and
A.
Munif,
“B-BabelNet:
Business-Specific
Le
xical
Database
for
Impro
ving
Semantic
Analysis
of
Business
Process
Models,
”
T
elk
omnika
,
15(1),
407,
2017.
[8]
H.
Hassan,
“
A
frame
w
ork
for
Arabic
concept-le
v
el
sentiment
analysis
using
SenticNet,
”
International
Journal
of
Electrical
and
Computer
Engineering
(IJECE)
,
8(6),
2018.
[9]
P
.
Bajpai,
P
.
V
erma
and
S.
Q.
Abbas,
“T
w
o
Le
v
el
Dis
ambiguation
Model
f
or
Query
T
ranslation,
”
Interna-
tional
Journal
of
Electrical
and
Computer
Engineering
(IJECE)
,
8(5),
2018.
[10]
Le
wis,
M.
P
aul,
Gary
F
.
Simons,
and
Charles
D.
Fennig
(eds.),
Ethnologue:
Languages
of
the
W
orld,
Eighteenth
edition.
Dallas,
T
e
xas:
SIL
International.
Online
v
ersion:
http://www
.ethnologue.com,
2015.
[11]
A.
H.
Nasution,
Y
.
Murakami,
and
T
.
Ishida,
“Constraint-based
bilingual
le
xicon
induction
for
closely
related
languages,
”
in
Proceedings
of
the
T
enth
International
Conference
on
Language
Resources
and
Ev
al-
uation
(LREC
2016)
,
pp.
3291-3298,
P
aris,
France,
May
,
2016.
[12]
A.
H.
Nasution,
Y
.
Murakami
and
T
.
Ishida,
“
A
generalized
constraint
approach
to
bilingual
dictionary
induction
for
lo
w-resource
language
f
amilies,
”
A
CM
T
rans.
Asian
Lo
w-Resour
.
Lang.
Inf.
Process.
,
17,
2,
Article
9
(No
v
ember
2017),
29
pages,
2017.
[13]
A.
H.
Nasution,
Y
.
Murakami
and
T
.
Ishida,
“Plan
Optimization
for
Creating
Bilingual
Dictionaries
of
Lo
w-
Resource
Languages,
”
in
International
Conference
on
Culture
and
Computing
(Culture
and
Computing),
K
yoto,
Japan
,
2017,
pp.
35-41.
doi:
10.1109/Culture.and.Computing.2017.21.
[14]
M.
W
ushouer
,
D.
Lin,
T
.
Ishida
and
K.
Hirayama,
“
A
constraint
approach
to
pi
v
ot-based
bilingual
dictionary
induction,
”
A
CM
T
rans.
Asian
Lo
w-Resour
.
Lang.
Inf.
Process.
,
15(1):4:1-4:26,
No
v
ember
,
2015.
[15]
L.
Campbell.
Historical
Linguistics
.
Edinb
ur
gh
Uni
v
ersity
Press,
2013.
[16]
W
.
P
.
Lehmann.
Historical
linguistics:
an
introduction
.
Routledge,
2013.
[17]
L.
Campbell
and
W
.J.
Poser
.
Language
classification.
History
and
method
.
Cambridge,
2008.
[18]
M.
Sw
adesh,
“Salish
Internal
Relationships,
”
International
Journal
of
American
Linguistics
,
v
ol.
16,
157-
167,
1950.
[19]
M.
Sw
adesh,
“Le
xicostatistic
Dating
of
Prehistoric
Ethnic
Contacts,
”
in
Proceedings
of
the
American
Philo-
sophical
Society
,
v
ol.
96,
452-463,
1952.
[20]
M.
Sw
adesh.
The
Origin
and
Di
v
ersification
of
Language
,
Ed.
post
mortem
by
Joel
Sherzer
.
Chicago:
Aldine,
p.
283,
1971.
[21]
M.
Sw
adesh,
“T
o
w
ards
Greater
Accurac
y
in
Le
xicostatistic
Dating,
”
International
Journal
of
American
Linguistics,
v
ol.
21,
121-137,
1955.
[22]
V
.
I.
Le
v
enshtein,
“Binary
codes
capabl
e
of
correcting
deletions,
insertions,
and
re
v
ersals,
”
So
viet
ph
ysics
doklady
,
v
ol.
10,
No.
8,
pp.
707-710,
1966.
[23]
B.
K
essler
,
“Computational
dialectology
in
Irish
Gaelic,
”
in
Proceedings
of
the
se
v
enth
conference
on
Eu-
ropean
chapter
of
the
Association
for
Computational
Linguistics
(EA
CL
’95)
,
Mor
g
an
Kaufmann
Publishers
Inc.,
San
Francisco,
CA,
USA,
60-66.
DOI:
https://doi.or
g/10.3115/976973.976983
[24]
W
.
J.
Heering
a.
Measuring
dialect
pronunciation
dif
ferences
using
Le
v
enshtein
distance,
Doctoral
disser
-
tation,
Uni
v
ersity
Library
Groningen,
2004.
[25]
C.
T
ang
and
V
.
J.
v
an
Heuv
en,
“Predicting
mutual
i
ntelligibility
of
Chinese
dialects
from
multiple
objecti
v
e
linguistic
distance
measures,
”
Linguistics
,
53(2),
285-312,
2015.
[26]
F
.
Petroni
and
M.
Serv
a,
“Language
distance
and
tree
reconstruction,
”
Journal
of
Statistical
Mechanics:
Theory
and
Experiment
2008
,
no.
08
(2008):
P08012.
[27]
S.
W
ichmann,
E.
W
.
Holman,
D.
Bakk
er
,
and
C.
H.
Bro
wn,
“Ev
aluating
linguistic
distance
measures,
”
Gener
ating
similarity
cluster
of
Indonesian
langua
g
es...
(Arbi
Haza
Nasution)
Evaluation Warning : The document was created with Spire.PDF for Python.
538
ISSN:
2088-8708
Ph
ysica
A:
Statistical
Mechanics
and
its
Applications
,
389(17),
3632-3639,
2010.
[28]
E.W
.
Holman,
C.H.
Bro
wn,
S.
W
ichmann,
A.
Mu
ller
,
V
.
V
elupillai,
H.
Hammarstro
m,
S.
Sauppe,
H.
Jung,
D.
Bakk
er
and
P
.
Bro
wn,
“
Automated
dating
of
the
w
orld’
s
language
f
amilies
based
on
le
xical
similarity
,
”
Current
Anthropology
52,
6
,
841-875,
2011.
[29]
E.
W
.
Holman,
S.
W
ichmann,
C.
H.
Bro
wn,
V
.
V
elupillai,
A.
M
¨
uller
,
and
D.
Bakk
er
,
“Explorations
in
automated
language
classification,
”
F
olia
Linguistica
,
42(3-4),
331-354,
2008.
[30]
G.
S.
Mann
and
D.
Y
aro
wsk
y
,
“Multipath
translation
le
xicon
induction
via
bridge
languages,
”
in
Proceed-
ings
of
the
second
meeting
of
the
North
American
Chapter
of
the
Association
for
Computational
Linguistics
on
Language
technologies
,
Association
for
Computational
Linguistics,
1-8,
2001.
[31]
G.
Nagy
,
“State
of
the
art
in
pattern
recognition,
”
in
Proceedings
of
the
IEEE
,
56,
no.
5
pp.836-863,
1968.
[32]
M.
Narasimhan,
N.
Jojic
and
J.
Bilmes,
“Q-clustering,
”
Adv
ances
in
Neural
Information
Processing
Sys-
tems
,
2006.
[33]
M.
F
.
Balcan,
Y
.
Liang
and
P
.
Gupta,
“Rob
ust
hierarchical
clustering,
”
The
Journal
of
Machine
Learning
Research
,
15(1),
3831-3871,
2014.
[34]
S.
Guha,
R.
Rastogi
and
K.
Shim,
“R
OCK:
A
rob
ust
clustering
algorithm
for
cate
gorical
attrib
utes,
”
Infor
-
mation
systems
,
25(5),
345-366,
2000.
[35]
P
.
Langfelder
and
S.
Horv
ath,
“F
ast
R
functions
for
rob
ust
correlations
and
hierarchical
clustering,
”
Journal
of
statistical
softw
are
,
46(11),
2012.
BIOGRAPHY
OF
A
UTHORS
Arbi
Haza
Nasution
is
curr
ently
w
orking
to
w
ard
the
Ph.D.
de
gree
in
Social
Informatics
at
Graduate
School
of
Informatics,
K
yoto
Uni
v
ersity
.
He
obtained
Bachelor
De
gree
in
Computer
Science
from
National
Uni
v
ersity
of
Malaysia
in
2010
and
obtained
Master
De
gree
in
Management
Information
System
from
National
Uni
v
ersity
of
Malaysia
in
2012.
He
has
been
a
Lecturer
with
the
Department
of
Informatics
Engineering,
Uni
v
ersitas
Islam
Riau,
Indones
ia,
since
2013.
His
current
research
interests
include
computational
linguistics,
natural
language
processing
and
machine
learning.
Y
ohei
Murakami
recei
v
ed
the
Ph.D.
de
gree
in
informatics
from
K
yoto
Uni
v
ersity
,
K
yoto,
Japan,
in
2006.
He
has
been
an
Associat
e
Professor
with
Ritsumeikan
Uni
v
ersity
,
since
2018.
He
cur
-
rently
leads
the
research
and
de
v
elopment
of
the
Language
Grid,
the
purpose
of
which
is
to
share
v
arious
language
resources
as
W
eb
services
and
enable
users
to
create
ne
w
services.
He
recei
v
ed
the
Achie
v
ement
A
w
ard
of
the
Institute
of
Electronics,
Information
and
Communication
Engineers
for
this
w
ork
in
2013.
His
current
research
interest
s
include
services
computing
and
multiagent
systems.
He
founded
the
T
echnical
Committee
on
Services
Computing
with
the
Institute
of
Elec-
tronics,
Information
and
Communication
Engineers
in
2012.
T
oru
Ishida
has
been
a
Professor
with
K
yoto
Uni
v
ersity
,
K
yoto,
Japan,
since
1993.
His
current
research
interests
include
autonomous
agents
and
multiagent
systems.
He
has
performed
research
in
the
abo
v
e
areas
for
o
v
er
20
years.
Since
2006,
he
ha
s
been
running
the
Language
Grid
Project.
Prof.
Ishida
serv
ed
as
the
Program
Co-Chair
of
the
second
ICMAS,
the
Chair
of
the
first
PRIMA,
and
the
General
Co-Chair
of
the
first
AAMAS.
He
w
as
also
an
Editor
-in-Chief
of
the
Journal
on
W
eb
Semantics
(Else
vier)
and
an
Associate
Editor
of
the
IEEE
T
ransactions
on
P
attern
Analysis
and
Machine
Intelligence
and
the
Journal
on
Autonomous
Agents
and
Multi-Agent
Systems
(Springer).
He
w
as
a
Board
Member
of
the
International
F
oundation
on
Autonomous
Agent
and
Multiagent
Systems.
He
has
also
started
w
orkshops/conferenc
es
on
digital
cities
and
intercultural
collaboration.
IJECE
V
ol.
9,
No.
1,
February
2019
:
531
–
538
Evaluation Warning : The document was created with Spire.PDF for Python.