TELKOM
NIKA
, Vol.12, No
.3, Septembe
r 2014, pp. 5
81~588
ISSN: 1693-6
930,
accredited
A
by DIKTI, De
cree No: 58/DIK
T
I/Kep/2013
DOI
:
10.12928/TELKOMNIKA.v12i3.79
581
Re
cei
v
ed Fe
brua
ry 9, 201
4; Revi
se
d May 14, 20
14; Acce
pted Jun
e
10, 2014
A Novel Part-of-Speech Set Developing Method for
Statistical Machine Translatio
n
Herry
Sujain
i, Kuspriy
a
nto, Arr
y
Akhmad Arman,
A
y
u Pur
w
a
r
i
a
nti
Schoo
l of Elect
r
ical En
gin
eeri
ng an
d Informa
tics
Band
un
g Institute of T
e
chnol
og
y, Jl
. Ganes
ha No. 10, Ba
n
dun
g, Indon
esi
a
e-mail : herr
y
_
s
ujai
ni@
y
a
h
o
o
.
com
A
b
st
r
a
ct
Part of sp
eec
h (PoS) is
on
e
of t
he
featur
es that
can be us
ed
to im
prov
e the q
u
a
lity of st
atistical-
base
d
machi
n
e
transl
a
tio
n
. T
y
pical
l
y, l
a
n
gua
ge P
o
S
det
er
mine
d b
a
se
d
on
the gr
a
m
mar
o
f
the l
a
n
g
u
age
or
ado
pted
fro
m
other
lan
g
u
a
g
e
s PoS. T
h
is
w
o
rk ai
ms
to
formulat
e
a
mo
del
to d
e
ve
lop
PoS
as l
i
n
gui
sti
c
factors to impr
ove the
qu
ality
of ma
c
h
i
ne tra
n
slati
on a
u
to
matically. T
h
e mode
l is b
a
se
d on w
o
rd si
mi
la
rity
appr
oach, w
h
e
r
e w
e
perfor
m
e
d
w
o
rd cl
usteri
ng o
n
cor
pus
.
The resu
lt of w
o
rd cl
usterin
g
w
ill be
defi
n
e
d
as
PoS set o
b
tain
ed for a
giv
en
lan
gua
ge. T
h
e
PoS sets res
u
lted by t
he w
o
r
d
cluster
i
ng
w
e
re co
mpare
d
to
the
ma
nu
ally
defin
ed
PoS s
e
t in
a
mach
in
e trans
latio
n
(
M
T
)
experi
m
e
n
t, the MT
ex
peri
m
e
n
t e
m
pl
oye
d
Engl
ish as the
source l
ang
ua
ge an
d
Indo
ne
sian as the tar
get lan
g
u
age.
Ke
y
w
ords
: me
thod, part-of-sp
eech, statistica
l mac
h
i
ne trans
latio
n
, mos
e
s, w
o
rd simi
larity
1. Introduc
tion
The
dre
a
m of
automati
c
ally
tran
slating
d
o
cu
m
ents bet
wee
n
two la
n
guag
es is on
e of the
oldest pursui
ts of artificial
intelligence
rese
arch. Now, armed wit
h
vast
amounts of
example
transl
a
tion
s a
nd po
werful
compute
r
s, we
can witn
ess
signifi
cant progre
s
s toward achi
eving that
dream. Statistical analysi
s
of
bilingual parall
e
l corpora allow
fo
r the automati
c
construction of
machi
ne t
r
an
slation
sy
ste
m
s. Already,
for
some
lan
guag
e
p
a
irs, statistical systems are
the best
machi
ne tra
n
s
lation
syste
m
s currently available.
Statistical Ma
chin
e Tran
sl
ation is corp
us
-ba
s
ed
an
d co
nsequ
en
tly requires a
parallel
c
o
rpus
to learn a model [1],[2]. Parallel c
o
rpora ar
e different from normal
text c
o
rpora in that
they are
not
just a
coll
ecti
on of texts,
but ar
e bili
ng
ual o
r
multili
ngual
and
st
ructured
so
th
at
every sente
n
c
e is lin
ke
d to its transl
a
tions.
Some wo
rks have sho
w
n that the tr
ansl
a
tion qu
ality can be
incre
a
sed by using
addition
al fea
t
ures
su
ch a
s
lemma, part
of spee
ch
(P
oS), gend
er
and othe
rs. In their research,
Koehn an
d Hoang [3] explained that by adding a fact
or of part
-
o
f
-spe
ech in English-G
e
rm
a
n
transl
a
tor sy
stem, the q
ual
ity of the tra
n
slat
io
n
wa
s
increa
sed
fro
m
18.0
4
% to
18.15%.
Th
ey
also
sho
w
ed
that by
usi
ng m
o
rp
holo
g
ical
facto
r
s and
p
a
rt-of-spe
e
ch, the
English
-
Spa
n
i
sh
transl
a
tor
system quality was in
cre
a
sed
from 23.41%
to 24.25%.
Youssef et al. [4] examined the factor
s on a
ddi
ng part
-
of-sp
eech on sta
t
istical
transl
a
tion
system for E
ngl
ish-A
r
abi
c. Rese
arch
re
sul
t
s sh
owed th
at the additio
n
of a facto
r
of
part-of-spee
ch can imp
r
ov
e the quality
of tr
anslatio
n
from 0.6095
% to 0.6394%. Razavian
and
Vogel [5] examined the fa
ctors o
n
addi
n
g
to the st
atistics ba
se
d interp
rete
r
syste
m
s, for Engli
s
h-
Iraqi inte
rp
ret
e
r
system, th
e quality of th
e tran
slatio
n
wa
s imp
r
ove
d
from
15.62
% to 16.41%;
for
the Spanish-English tran
slator sy
ste
m
, the quality of the tran
slation
was imp
r
ove
d
from 32.53
%
to 32.84%; a
nd for Arabi
c-English t
r
an
sl
ator sy
stem, the qu
ality of
the tran
slatio
n
wa
s imp
r
ove
d
from 41.70%
to 42.74%.
For Engli
s
h
-
Indon
esi
an, Sujaini et al. [6]
cond
ucte
d
a study of the additio
n
of PoS
factors ba
se
d
on a
statisti
cal tra
n
sl
ator system fa
ct
ors. T
he
resu
lts of the
s
e
studie
s
in
dicated
that the PoS factor incre
a
se
d the qu
ality of t
he
English
-
Ind
o
n
e
sia
n
tran
sla
t
ion of 2%, from
31.26% to 33
.26%.
Gram
maticall
y, word
s can
be divided i
n
to two catego
ries: op
en cl
a
ss
and
clo
s
e
d
cla
s
s.
Open
class is a class cate
gory whi
c
h n
u
mbe
r
of
words always in
crea
se
s over time, while clo
s
e
d
cla
ss i
s
a cl
a
ss
cat
e
g
o
ry
who
s
e
wo
rds
are f
i
x
ed.
Grammatically different cate
gorie
s of words,
comm
only cal
l
ed Part of Speech [1].
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 16
93-6
930
TELKOM
NIKA
Vol. 12, No. 3, September 20
14: 58
1 – 588
582
PoS functio
n
s
for natu
r
al l
angu
age
pro
c
e
ssi
ng i
s
to
provide
some
inform
ation
about
a
word a
nd the
wo
rd
s a
r
ou
n
d
it. This
app
lies to
gen
eral cate
go
ry (noun v
s
. verb) a
s
well
as to
more
spe
c
iali
zed. Fo
r exa
m
ple, a set o
f
tags
to distingui
sh bet
we
en po
ssessiv
e
pron
oun
s (my,
your, his, he
r, it) and personal pr
ono
un
s (I, you, he, she
)
[7]. While PoS taggin
g
is the pro
c
ess
of labeling e
a
c
h word in a
sente
n
ce with
the appro
p
ri
ate tag from a set of PoS [8].
In gene
ral, a
set of tags
en
cod
e
both the
cla
ssi
fi
cation
of the target
feature, tell th
e use
r
useful
info
rm
ation a
bout t
he g
r
am
mati
cal
wo
rd
cl
a
s
ses,
and
p
r
edictive fe
ature
s
, e
n
codi
ng
featur
e that
w
ould be us
eful in
pr
edic
ting the behavior
of other
wor
d
s
in the
context. Both t
a
sks
sho
u
ld overl
a
p, but they are not alway
s
identical [9].
PoS gen
erall
y
refers to
a
cla
s
s of
wo
rds u
s
ed
in
a pa
rticul
ar l
angu
age
and
ea
ch
langu
age
ha
s different P
o
S cate
gori
e
s. Cla
s
se
s f
o
r the
Gree
k wo
rd h
a
s
b
een d
e
fined
b
y
Diony
siu
s
Th
rax in 10
0 B
C
which con
s
ist
s
of eig
h
t cla
s
ses
of words, n
a
me
ly: noun, verb,
pron
oun, p
r
e
positio
n, adverb, c
onju
n
ct
ion, parti
cle,
and the a
r
ticle. Indon
esi
an cla
s
s words
divided i
n
to
verb
s, a
d
j
e
ctives,
nou
n, wo
rd
nu
mbers, p
r
o
n
oun
s, adve
r
bs,
conj
un
ction,
demon
strative, interjectio
n
,
interogative, arti
culato
ry, prep
ositio
n, and red
upli
c
ati
on [10].
PoS for vari
o
u
s la
ngu
age
s have b
een
d
e
velope
d for
the compute
r
i
z
ation,
one
of whi
c
h
is the Penn Tree
ban
k by LINC Lab
oratory, Comp
uter and Info
rmat
ion Sci
e
nce, Universi
ty of
Pennsylvani
a
[11]. They
di
vided Engli
s
h
wo
rd
s i
n
to
4
8
PoS. Previ
ously, F
r
an
ci
s [12]
divided
the
English
word
s u
s
e
d
fo
r 8
7
PoS in th
e B
r
own
co
rpu
s
.
Additionally
Garside
et
al. [13] divide
d t
h
e
English
wo
rd
s into a 146 P
o
S for C7 tag
s
et.
Variou
s
set
s
of Indon
esi
a
PoS ha
s b
e
e
n
u
s
ed
in th
e
re
se
arch fiel
d of n
a
tural l
angu
age
pro
c
e
ssi
ng, i
n
clu
d
ing
thro
ugh th
e PAN Lo
cali
zation
Proje
c
t, sp
eci
f
ically for Po
S Indone
sia
has
been
develo
p
ed
spe
c
ifically to be tran
sl
ated into
En
glish
in 2
009
[14], the Po
S based
on t
he
Penn Treeb
a
n
k POS tag
set [11] con
s
i
s
ts of 29 PoS
t
ags. Piscel
do
et al. [15] defined 3
7
tag
s
for
Indonesia. Wicaksono and
Pu
rwarianti [16],[17] in their work
using 35 tag tagset modifi
cati
on
results
produ
ced
by Ad
ria
n
i, [14] an
d P
i
sceldo
et al.
[15]. Lastly, L
a
ra
sati et
al. [18] u
s
e
s
o
n
ly 19
tags in their
work.
Several
other wo
rks al
so
sho
w
e
d
vari
a
t
ions i
n
the
a
m
ount ta
gset used i
n
a
v
a
riety of
langu
age
s. F
o
r the Arabi
c, Hajic et al.
[19], us
ing 2
1
tags in th
e
Arabic T
r
e
eban
k data
a
n
d
tools. Bra
n
ts
et al. [20] used 54
tag
s
to
build
th
e TI
GER treeba
n
k
in
German.
Simov et al.
[21]
use
d
54 ta
gs
to build a
co
rpus
of Bulga
r
i
an. Csen
de
s
et al. [22] use
d
43 tag
s
to
b
u
ild a treeba
n
k
Szege
d in
Hunga
rian.
Civit and M.A. Mart [23]
u
s
ed 47 ta
gs to build
a Sp
anish tre
eba
nk i
n
Spanish. For developed p
a
rt-of
-
spee
ch
tagger, Avont
uur et al. [24] use
d
25
tags for Dutch,
Singha et al. [25] use
d
97 tags for M
anip
u
ri,
Neun
erdt et al. [26] used 54 tag
s
for Germa
n
.
In this a
r
ticl
e, we
propo
se
a metho
d
to
determi
ne a
set of PoS
a
u
tomatically
by usin
g
word
simila
rity approa
ch f
o
r Ind
one
sia
n
.
The
cont
rib
u
tions of thi
s
resea
r
ch a
r
e
a novel
meth
od
for develo
p
in
g a lang
uag
e PoS auto
m
atically an
d
an altern
ative Indone
sia
n
Sets PoS to be
use
d
in statist
i
cal ma
chin
e tran
slation.
2. Dev
e
lopin
g
Part-o
f-Sp
eech Set M
e
thod
The i
nput
of
this m
e
thod
i
s
m
ono
corp
us th
at conta
i
ns a colle
cti
on
of
se
nten
ce
s.
Th
e
output of this method is a
PoS set. Models to det
ermine com
putati
onally PoS Set con
s
ist
s
of 4
(four) step
s of
the
proce
ss,
n
a
mely: comp
uting word simila
rity, word
c
l
us
tering, vis
ualiz
ation
clu
s
ter, and P
o
S catego
ri
za
tion as sho
w
n
in Figure 1.
Step 1: Comp
uting wo
rd si
milarity
At this step,
mono corp
us processe
d usi
ng Extende
d Word Similarity Based (EWSB
)
algorith
m
wh
ich ha
s bee
n develope
d
and pre
s
e
n
t
ed by Sujaini et al. [23
]. The mutual
informatio
n b
e
twee
n w1 a
nd w2 i
s
defin
ed as :
,
,
,
,
,
,
.
,∗,
,∗
,
,
,∗
.
,∗,
,
(1)
and the word
simila
rity betwee
n
w1 a
n
d
w2 is defin
ed
as :
,
∑
,
,
,
,
,
∈
∩
∑
,
,
,
∈
∑
,
,
,,
∈
(2)
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
1693-6
9
30
A Novel Pa
rt of Speech Se
t Devel
opin
g
Method
for St
atistical Ma
ch
ine …. (He
r
ry Sujaini)
583
The output of
this step is a
list of word
p
a
irs al
ong
with the simila
rity value.
Figure 1. Block
Diag
ram o
f
Determin
ation Part of Speech Set Model
Step 2 : Word
cluste
ring
Wo
rd clu
s
te
ri
ng pro
c
e
s
s a
t
this step usi
ng Agglom
erative and cu
stomized a
p
p
r
oach to
get the
histo
r
y of clu
s
te
rin
g
in
Ne
wi
ck format.
Ad
opt
ed in
19
86,
Newi
ck form
at
(Ne
w
i
c
k notat
ion)
is a way to re
pre
s
ent g
r
ap
h-theo
reti
cal tree
s by usin
g
parenth
e
ses
and comma
s
[24].
Agglome
r
ative algo
rithm
s
whi
c
h h
a
ve b
een
adj
u
s
ted to
obtain
th
e results of
the Ne
wi
ck
format is
as
follows
:
1.
Initialize ea
ch
unique
word (toke
n
) a
s
a cluster
2.
Cal
c
ulate the
simila
rity betwee
n
two cl
u
s
ters
3.
Sort ra
nki
ng
betwe
en all
pairs of
clu
s
ters
ba
se
d o
n
simil
a
rity, then
com
b
ine
the two
top
clu
s
t
e
r
s
4.
Add clu
s
ters
are combi
ned
in Newi
ck format
5.
Stop until it reach
e
s a
singl
e clu
s
ter, if not, return to step 2.
To cal
c
ul
ate
the simila
rity between t
w
o
clus
te
rs in step 2,
we
used th
e fo
rmula in
equatio
n (3
) [23]:
,
∗
∑∑
∈
∈
,
(3)
whe
r
e N
1
an
d N
2
denote th
e numbe
rs of words in the
cla
s
ses, C1 a
nd C2 , re
sp
ectively. Jeff et
al. [25] added
the term
to the cla
ss
simil
a
rity comput
a
t
ion, tending to have a high
er prio
rity
for small
e
r cl
asse
s to be merg
ed. In our expe
riment
s we
set
≈
0.
Step 3 : Clust
e
r Visu
alization
Results of hi
erarchi
c
al
cl
ust
e
ring illustrat
ed
with a dendogra
m,
where the dendrogram
is
a cu
rve that
describe
s
th
e clu
s
ter gro
uping. At
thi
s
sta
g
e, Ne
wick fo
rmat
gene
rated i
n
th
e
previou
s
sta
ge b
e
u
s
e
d
as inp
u
t to
obtain
a
visuali
z
atio
n
cl
uster d
e
nd
o
g
ram.
We
u
s
e
“
D
en
dr
os
co
pe
”
to
d
e
s
c
r
i
be
c
l
us
te
rs
tha
t
c
a
n
be
ac
ce
ss
ed
a
t
h
ttp
://w
w
w
-
a
b2
.in
f
o
r
ma
tik
.
un
i-
tuebing
en.de/
softwa
r
e/de
n
d
ro
scope/.
Step 4 : PoS
c
a
tegoriz
a
tion
The la
st p
r
o
c
ess of thi
s
m
odel i
s
the P
o
S cate
gori
z
ation ma
nuall
y
pro
c
e
s
sed
by the
dend
ogram visuali
z
atio
n. The output of th
is process is the groupi
ng
and nami
ng
PoS.
3. Dete
rmining Indonesi
an PoS Set
The p
u
rp
ose of this
e
x
perime
n
t is to
determi
ne the
set
of Indon
esian PoS
comp
utationa
lly throug
h
computation
a
l
re
sults.
In
this
experim
e
n
t, we
use
a
171K
se
nte
n
ce
s
Indone
sia
n
corpu
s
which has 3,4 M to
kens (114 K u
n
ique to
ken
s
).
We
have
ex
perim
ented
to dete
r
min
e
the
set
of
PoS with
two (2) ways,
nam
ely
clu
s
t
e
rin
g
wo
rds
wit
h
ea
ch
cat
ego
ry
se
p
a
rat
e
ly
co
ndu
cted PoS and
word
clu
s
teri
ng as a
whol
e.
I
n
sepa
rat
e
way
s
,
w
e
cla
ssif
y
ce
rt
ain
wor
d
s t
hat fit the catego
ry. PoS catego
ries u
s
e
d
are
:
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 16
93-6
9
30
TELKOM
NIKA
Vol. 12, No. 3, September 20
14: 58
1 – 588
584
verbs,
nou
ns, adje
c
tives,
nume
r
al
s, a
d
verb
s, conj
unctio
n
s and
other categ
o
rie
s
. We h
a
ve
cho
s
e
n
som
e
app
rop
r
iate
and vari
es
words f
r
om
a list of uni
q
ue token (un
i
-gram) fo
r e
a
ch
categ
o
ry. As an example,
we compute
d
the word
s
si
milarity again
s
t wo
rd
s in verb
s cate
go
ry,
the re
sults of
the se
con
d
step fr
o
m
co
mputational
p
r
ocess p
r
od
u
c
e
s
an o
u
tpu
t
word
simila
rity
list (20 hig
h
e
s
t scores) ca
n be se
en in
Table 1.
Tabel 1. Word Similarity Scores fo
r Verbs Catego
ry
No
Word 1
Word 2
Word Similarity
S
c
ore
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
m
e
m
b
aik (getting
better)
m
e
n
guat (strengt
hened)
m
e
m
b
aik (getting
better)
dilakukan (do)
m
e
n
gatakan (say
)
dilakukan (do)
m
e
m
b
erikan (p
ro
vide)
berdiri (stand)
dibuat (m
ade
)
m
e
m
b
aik (getting
better)
diatur (regulated)
digunakan (used
)
dilaksanakan (im
p
lem
ente
d)
digunakan (used
)
digunakan (used
)
diatur (regulated)
dilakukan (do)
duduk (sit)
dilakukan (do)
bergerak (m
ove)
m
e
le
m
ah (weake
ned)
m
e
le
m
ah (weake
ned)
m
e
n
guat (strengt
hened)
dilaksanakan (im
p
lem
ente
d)
m
enyatakan (stat
e)
digunakan (used
)
me
mb
e
r
i
(g
i
v
e
)
duduk (sit)
dilaksanakan (im
p
lem
ente
d)
m
e
m
b
uruk (d
eteriorate)
dilaksanakan (im
p
lem
ente
d)
dibuat (m
ade
)
berke
m
bang (thri
v
e)
ditem
ukan (foun
d)
dilaksanakan (im
p
lem
ente
d)
dibuat (m
ade
)
dibuat (m
ade
)
tidur (sleep)
diberikan (given)
berke
m
bang (thri
v
e)
0.12576171
13
0.09843289
77
0.08105265
08
0.08017807
30
0.07380275
65
0.07254328
67
0.06926502
45
0.06290383
61
0.06069814
94
0.05978778
22
0.05620827
58
0.05502826
51
0.05432816
08
0.05170410
60
0.05058656
51
0.04961552
58
0.04869617
55
0.04734216
17
0.04574351
51
0.04465838
98
From the results of the above pro
c
e
ss,
we ha
ve p
r
o
c
essed the ne
xt step, ie grouping of
words to obt
ain the
clu
s
te
r re
sult
s in
Newi
ck fo
rmat,
wo
rd
simila
ri
ty cluste
ring
results fo
r ve
rb
PoS catego
ri
es are:
(((((((((((dibe
r
i
k
an
),((ditem
u
kan
)
,(((di
buat
),(((dila
ku
ka
n
)
,(d
ila
ksan
akan)),(dig
una
kan))),(diatur)))),(
(be
r
ge
ra
k),(b
e
rkem
bang
))),(((berm
a
in
),(bertem
u
)),(
((berdi
ri),
(du
d
u
k
)),((m
a
k
an
),(tidur))))),((m
andi
),(m
inum
))),(t
erba
wa
)),
((
((
m
e
m
buruk),(
(
m
enguat),
(
(
m
e
lem
a
h),(m
e
m
baik))
)
),
((m
enge
cil),
(
m
e
lam
b
at))),((m
em
besar),(m
em
udar)))),(te
rpa
k
ai)),(te
r
d
eng
a
r
)),((ial
ah
),
(((((adal
ah
),(m
erupa
k
an
)),(((((m
em
berikan),
(
m
endapatkan
)),(m
em
punyai)),(m
en
ggu
nakan)),((m
em
buat),(m
elaku
k
an
)))),((((m
en
gatakan),
(
m
enyata
k
a
n
)),(m
e
lihat)),
(m
erasa))),(m
en
g
a
lam
i
)))),(((((i
ngin),
(
((a
k
an
),(dapat
)),(harus)
)),(sud
ah
)),(b
oleh)),(m
esti)))
Furthe
rmo
r
e,
we
have
a PoS verb
s visuali
z
ati
on
with
De
ndro
s
cop
e
software,
visuali
z
ation i
s
obtain
ed a
s
sho
w
n in Fig
u
re 2.
Figure 2. Den
dogram Visu
alizatio
n of
Verb
s Cate
go
ry Clusteri
ng
Re
sults
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
1693-6
930
A Novel Pa
rt of Speech Se
t Devel
opin
g
Method
for St
atistical Ma
ch
ine …. (He
r
ry Sujaini)
585
Next, we
d
e
termin
ed th
e
PoS for ve
rb
s
categ
o
ri
es
based
by de
ndog
ram
visu
alizatio
n,
verbs
whi
c
h
are
already i
n
u
s
e
PoS b
a
se
d g
r
am
m
a
r i
s
: VBT
(transitive ve
rb
), VBI (intra
nsitive
verb) a
nd M
D
(mo
dal).
We can see
that the MD (
akan, dap
at, haru
s
, etc.)
and VBT form
a
sep
a
rate gro
up
(
m
e
m
buat, m
e
lakuka
n, m
e
rupa
kan
,
etc.), While t
he VBI disp
e
r
se
d into
sev
e
ral
grou
ps. VBI sprea
d
with g
r
oup
s of passi
ve verbs (
dib
uat, diguna
ka
n, dilaksan
akan
, etc.), Whi
c
h
has the me
a
n
ing of the verb "to be" (
m
e
lem
ah, m
e
m
baik, menge
cil
, etc.), and other
VBI
scattered. Ba
sed on the foreg
o
ing, we
provided re
comm
end
atio
ns for verb
s PoS set on the
results of co
mputational
such a
s
Tabl
e
2. In t
he same way, we
also h
a
ve to experim
ent with
other types of
word
s in o
r
d
e
r to obtain a
set PoS for Indone
sia
n
as i
n
Table 3.
Tabel 2. PoS Set Reco
mm
ende
d for Verbs Catego
ry
No
Words Examples
PoS Tag
Description
1
dapat, akan, ingin, sudah
MD Modal
2
m
e
n
gatakan ,
melakukan
me
mb
u
a
t
,
me
l
i
h
a
t
VBT Transitive
3
duduk, m
i
nu
m
,
m
andi, berke
m
ban
g,
terpakai
VBI Intransitive
4
digunakan, dibua
t, diatur, dilaksanakan
VBI1 passive
ver
b
s
5
m
e
le
m
ah, me
m
b
aik,
m
engecil,
me
mu
d
a
r
VBI2
meaning of the v
e
rb "to b
e
"
Table 3. Indo
nesi
an PoS Set Recomme
nded by Com
putational Ba
sed
No Tag
Description
Word
Exam
ples
1
OP
Opening pa
renth
e
sis
( { [
2
CP
Closing parenthe
sis
) } ]
3 GM
Slash
/
4 ;
Semicolon
;
5 :
Colon
:
6 “
Quotation
“
’
7
.
Sentence termin
ator
. ?
!
8 ,
Comma
,
9 -
Das
h
-
10 ...
Ellips
i
s
...
11 JJ1
Adjectives
1
panjang, kuat, indah, besar
12 JJ2
Adjectives
2
genap, buntu
,
ne
gatif
13 RB
Adver
b
s
sekedar, ha
m
p
ir,
tidak
14 RB1
Adverbs
1
sangat, a
m
at, cu
kup, paling
15 NN
Common
Nou
n
m
obil, air, neg
ar
a
16 NNP
Proper
n
ouns
tvri, jokowi, persib
17 NNG
Genitive
nouns
bukunya, hatinya
18 VBI
Intransitive
Verb
duduk, pergi,
m
a
kan
19
VBI1
Intransitive Verb 1
dibuat, diam
bil
20
VBI2
Intransitive Verb 2
m
e
n
gecil, m
engu
at
21 VBT
Transitive
Verb
me
mb
e
l
i
,
me
mu
k
u
l
22 IN
Preposition
di, ke, dari
23 MD
Modal
akan, harus
24
CC
Coor - conjunctio
n
dan, atau, ketika, jika
25 DT
Determiner
ini, itu
26 UH
Interjections
wah, aduh,
oi
27 CDO
Ordinal
nume
r
als
perta
m
a
, kedua
28 CDC
Collective
numerals
berdua, b
e
rtiga
29
CDP
Primar
y
n
u
meral
s
1, 2, 3
30
CDP1
Primar
y
n
u
meral
s
1
satu, dua
31
CDP2
Primar
y
n
u
meral
s
2
puluh, ribu, juta
32
CDP3
Primar
y
n
u
meral
s
3
1990, 2001,
201
3
33 CDI
Irregular
numer
als
beberapa
34 PRP
Personal
pronou
n
saya, kam
u
35 WP
WH-pron
ouns
apa, siapa
36 PRN
Number
p
r
onou
n
s
kedua-duanya
37 PRL+
Locative
Proper
nouns/pronou
ns
sini, situ, Jakarta, Bali
38 S
Y
M
S
y
mbols
@#$%
^&
39 RP
Par
t
icles
pun, kah
40 FW
Foreign
words
foreign,
Word
41 ART
Articles
sang, si, para
42 COP
Copula
adalah, bukan,
merupakan
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 16
93-6
930
TELKOM
NIKA
Vol. 12, No. 3, September 20
14: 58
1 – 588
586
4. Experiments on SMT
The
purpo
se
of this exp
e
r
iment i
s
to
compa
r
e th
e
accuracy
of t
he tran
slatio
n sy
stem
that use
s
Po
S computatio
nal re
sult
s co
mpared wi
th t
r
an
slation
sy
stem with Po
S determin
e
d
by
gramm
a
r ba
sed. In addit
i
on, we also
compa
r
ed t
he re
sults of
the translati
on without PoS
feature
s
. For PoS determ
i
ned ba
sed g
r
amma
r,
in this work use
d
the Wicaksono'
s PoS and
herei
nafter
ca
lled Gra
mma
r PoS
We u
s
ed
se
veral in
strum
ents in thi
s
experim
ent, Mose
s [1] a
s
machi
ne tra
n
slato
r
s,
SRILM [26] t
o
buildi
ng la
n
guag
e an
d P
o
S model
s,
G
i
za
++ [2
7] for wo
rd ali
gnm
ent process,
and
Gram
mar Po
stagg
er for P
o
S tagging.
Furthe
rmo
r
e,
we u
s
e the
BLEU metho
d
[28] for sco
r
ing
the tran
slatio
n re
sult
s. We
used a
pa
ral
l
el co
rp
us for trainin
g
the t
r
an
slation
mo
del an
d m
o
n
o
corpu
s
fo
r tra
i
ning th
e lan
g
uage
mod
e
l.
We
used "Ide
ntic" Parallel
corpu
s
[29] th
at co
ntain
s
2
7
K
sente
n
ce p
a
irs of
Engli
s
h-I
ndon
esi
an.
While
mon
o
corpu
s
used i
s
the
same
a
s
that
u
s
ed
in
the
experim
ents
at 170 K sent
ence clu
s
teri
ng.
We te
sted t
he facto
r
-ba
s
ed
statisti
cal ma
chine
transl
a
tion b
y
marki
ng t
he PoS
(po
s
taggi
ng
) again
s
t En
glish
-
Indo
ne
sian pa
rallel
co
rpu
s
. Te
st se
nten
ce
s totaling
1
,
500
sente
n
ces
co
nsi
s
ting of 5
test gro
u
p
s
, each con
s
isti
ng
of
300 se
ntences with word
len
g
th 10,
15, 20, 25 an
d 30 (refere
n
c
e senten
ce
).
The BLEU score of the experim
ent re
su
lts of
condu
ct
ed in MPS ca
n be see
n
in Table 4.
The in
crea
se
in the B
L
EU
score
of t
he tra
n
sl
atio
n re
sult
s u
s
i
ng
comp
utational PoS
an
d
Gram
mar Po
S of the translation re
sults
wi
thout u
s
ing
PoS illustrate
d in Figure 3.
From
Tabl
e 4
.
we
can
see
that the tra
n
sl
ation a
c
curacy usin
g G
r
am
mar PoS
bett
e
r tha
n
without PoS. While the
use of PoS of computing
re
sults ca
n al
so
improve the
accuracy of the
transl
a
tion re
sults a
s
comp
ared to the u
s
e of Gramm
a
r PoS.
The i
n
crea
se
in a
c
curacy
d
ue to
the
use
of
PoS
featu
r
es better
on
short se
nten
ce
s.
Th
e
best
enh
an
cement to
the
tran
slatio
n
by co
mputin
g PoS of
8.
89% on
a
corpu
s
contai
ning
sente
n
ces
wi
th 10 word
s
long, while t
he lo
we
st
in
cre
a
se of 1.
57% occu
rs
at the E co
rpus
contai
ning
se
ntences
with
30 words l
o
n
g
. Whe
n
com
pare
d
with th
e use of G
r
a
mmar PoS, S
M
T
with
com
puta
t
ional PoS
re
sults to in
cre
a
se
aver
age
accuracy
of 4
.
13%. The i
n
crea
se
in
average
accuracy of the tran
slation
use gr
amma
r PoS on with
out PoS is 2.23%.
Tabel 4. BLEU score of Grammar a
nd Computation
a
l PoS
Corpus
Base
(no PoS)
Gramm
a
r
PoS
Computational
PoS
A 56.93
57.90
61.99
B 47.86
49.06
51.38
C 44.98
46.94
48.56
D 43.52
44.92
46.56
E 55.39
55.44
56.26
Average
49.74
50.85
52.95
Figure 3. Gra
ph Tra
n
sl
atio
n A
ccu
ra
cy Againts
Withou
t PoS
AB
C
D
E
0
1
2
3
4
5
6
Grammar
PoS
Computational
PoS
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
1693-6
930
A Novel Pa
rt of Speech Se
t Devel
opin
g
Method
for St
atistical Ma
ch
ine …. (He
r
ry Sujaini)
587
The BLE
U
score
example
s
of ea
ch
gro
u
p
form
so
urce
se
nten
ce
s in
English, a
ref
e
ren
c
e
transl
a
tion, transl
a
tion
with gramma
r
PoS and
co
mputing PoS
ha
s in
crea
sed, fixed, a
n
d
decrea
s
e
d
accuracy
can b
e
see
n
in Tab
l
e 5.
Based
on t
he expe
rime
ntal re
sults,
we
can
concl
ude th
at the use of
sets
of
comp
utationa
lly generated
PoS can
re
duce we
a
k
ne
sses determi
ned
PoS set based
g
r
am
mar
so a
s
to improve the qualit
y of statistical
machi
ne tran
slation. Thi
s
i
s
be
cau
s
e th
e determi
nati
on
of gram
mar
PoS is ge
nerally based o
n
the fu
n
c
tio
n
and m
eani
ng, and it d
o
e
s n
o
t gua
ra
ntee
simila
rity of distributio
n of words in a
se
nt
ence to the words in the
same
categ
o
ry PoS.
Tabel 5. BLEU score for G
r
amma
r an
d Comp
utation
a
l PoS Use
d
No
Sentences
BLEU Score (
%
)
1
Input
did i not just say
i
'm saving the film ?
Ref
bukan kah sa
y
a
sudah bilang unt
uk menghemat film n
y
a ?
Gramm
a
r
apa kah sa
y
a
tid
a
k han
y
a
bilang
aku untuk mengh
emat film n
y
a ?
37.70
Komp
bukan kah sa
y
a
sudah bilang unt
uk menghemat film n
y
a ?
100.00
2
Input
the challenges to meet the investment needs
w
ill come from the
government itself
Ref
tantangan pe
me
nuhan kebutuha
n investasi itu
justru berasal da
ri
pemerintah sendi
ri
Gramm
a
r
tantangan pe
me
nuhan kebutuha
n investasi itu akan datang da
ri
pemerintah sendi
ri
52.54
Komp
tantangan pe
me
nuhan kebutuha
n investasi itu akan datang da
ri
pemerintah sendi
ri
52.54
3
Input
proven that all po
licie
s are aimed for successing liberalization
implementation
Ref
terbukti bah
w
a
s
egala kebijakan di
tujukan untuk men
y
ukseskan berlaku
n
y
a liberalisasi
Gramm
a
r
terbukti bah
w
a
s
egala kebijakan di
tujukan untuk men
y
ukseskan berlaku
n
y
a liberalisasi
100.00
Komp
terbukti bah
w
a
s
e
mua kebijakan di
tujukan untuk men
y
ukseskan berlaku
n
y
a liberalisasi
70.71
5. Conclusio
n
Model
s to determin
e
com
p
utationally PoS Set c
onsi
s
ts of 4 (four) st
eps of the proce
s
s,
namely: co
mputing wo
rd simil
a
rity, word
clu
s
tering, visua
lization
clu
s
ter, and PoS
categ
o
ri
zatio
n
. From experime
n
t result, we reco
mm
ende
d 42 tag
s
Indone
sia
n
PoS for mach
ine
transl
a
tion. T
he average o
f
incre
a
se in accuracy
of the tran
slation
use g
r
amm
a
r PoS on with
out
PoS is 2.23%
. The use of
PoS computi
ng re
sults
ca
n improve the
accuracy of 6.45% com
p
a
r
ed
to a transl
a
tion witho
u
t PoS. When
compa
r
ed
with the use
of PoS gramm
a
r, usage P
o
S
comp
uting re
sults
can im
prove the a
c
curacy
of ab
out 4.13%. Accu
ra
cy of PoS use b
o
th
gramm
a
r Po
S and PoS TB result
s are l
o
w at long
se
ntences (30
words).
Referen
ces
[1]
Koeh
n P. Statistical Mach
ine
T
r
anslation.
Ne
w
York: Cam
b
r
i
dg
e Univ
ersit
y
Press. 2010.
[2]
Peng
L. A S
u
r
v
e
y
of Mac
h
i
n
e T
r
anslatio
n
Methods.
T
E
L
K
OMNIKA Ind
ones
ian
Jo
urn
a
l of E
l
ectric
al
Engi
neer
in
g
. 2013; 11(
12): 71
25-7
130.
[3]
Koeh
n P, H
o
a
ng H.
F
a
ct
ored
T
r
anslati
on
M
ode
ls
. Joi
n
t Co
nferenc
e o
n
E
m
pirica
l Meth
o
d
s in
Nat
u
ra
l
Lan
gu
age Proc
essin
g
an
d Co
mputatio
nal N
a
tura
l La
ng
uag
e
Learn
i
ng. Pra
gue. 20
07: 86
8
-
876.
[4]
Youssef I, S
a
k
r
M, Kouta
M. Li
ngu
istic F
a
c
t
ors in
Statistic
a
l Mac
h
i
ne T
r
ansl
a
tion
Invo
l
v
ing
Arab
ic
Lan
gu
age.
IJC
S
NS Internati
o
nal Jo
urn
a
l of
Co
mp
uter Scie
nce an
d Netw
ork Security
. 2
009; 9(
11):
154-
159.
[5]
Razavi
an,
N.S, Voge
l S.
F
i
x
ed
Len
gth W
o
rd Suffix for
F
a
ctored
Statistical M
a
ch
ine
T
r
anslati
on
.
Procee
din
g
s of
the ACL 20
10
Confer
ence
S
h
ort Papers. Up
psal
a. 201
0: 147-1
50.
[6]
Suja
ini
H, Ku
spri
ya
nto, A
rman A.A, an
d
Pur
w
ari
anti A.
Peng
aru
h
Pa
rt-Of-Speech
pad
a Mesi
n
Pener
je
ma
h Bahas
a
Ing
g
r
i
s-Indo
nesi
a
Be
rbas
is F
a
ctored T
r
a
n
sl
ati
on Mo
de
l
, SNAT
I-2012.
Yog
y
ak
arta. 20
12: H77-
H82.
[7]
Jurafsk
y
D, M
a
rtin H. Sp
eec
h an
d L
a
n
gua
ge Proc
essin
g
,
Ne
w
J
e
rse
y
:
Parson Int
e
rna
t
iona
l Editi
on.
200
9.
[8]
Raja
F, T
a
sharofi S, Oro
u
m
chia
n F.
Statistical
POS
T
aggi
ng Ex
pe
riments
on P
e
rsia
n T
e
xt,
Procee
din
g
s o
f
the Seco
nd
W
o
rkshop
o
n
Co
mput
ati
o
nal A
ppro
a
ch
es to Arab
ic
Script-bas
ed
Lan
gu
ages
. Ca
liforni
a. 200
7 : 128-
133.
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 16
93-6
930
TELKOM
NIKA
Vol. 12, No. 3, September 20
14: 58
1 – 588
588
[9]
Mann
ing C.D,
Schütze H. F
ound
atio
ns of Statistical Natur
a
l La
ng
uag
e P
r
ocessi
ng. Ca
mbridg
e: T
he
MIT
Press. 1999.
[10]
W
a
rida
h E. EYD dan Se
puta
r
Kebah
asa-In
don
esia
an. Jak
a
rta: Ka
w
an P
u
staka. 20
08.
[11]
Marcus M.P, Marcinkie
w
icz
M.
A, Santoron
i
B. Buil
din
g
a
Larg
e
An
notat
ed C
o
rp
us of
Engl
ish: T
he
Penn T
r
eeb
an
k.
Computati
o
nal
Lin
g
u
i
stics
- Speci
a
l iss
u
e on
usin
g l
a
r
ge cor
pora: II
. 199
3; 19(
2):
313-
330.
[12]
Francis W.N.
A T
agged Cor
pus
– Problems and Pros
pects.
In: Greenbaum, S., Leech, G., and
Svartvik, J.
Editors
. Studies in
Englis
h Li
ngu
i
s
tics for R
ando
lph Quirk. Lo
n
don: Lo
ngm
an;
1979: 1
92-
209.
[13]
Garside
R, L
e
e
ch G, McEn
e
r
y
A.
Corp
us
Annotat
i
o
n
: Li
ngu
istic Inform
ation
from C
o
mputer T
e
xt
Corp
ora. Lo
nd
on: Lon
gma
n
. 199
7.
[14]
Adria
n
i M, Riz
a H.
Res
earch
Rep
o
rt on
Lo
cal L
ang
ua
ge
Co
mp
uting: D
e
vel
o
p
m
e
n
t of
Indon
esi
a
n
Lan
gu
age R
e
s
ources a
nd T
r
a
n
slati
on Syste
m
.
PAN Loc
ali
z
ation, 10
20
42
. 2008.
[15]
Piscel
do F
.
Adri
ani M, Manurung R.
Pro
b
abil
i
stic Part o
f
Speech
T
a
g
g
in
g for Ba
ha
sa Indo
nes
ia
.
T
h
ird Internatio
nal W
o
ksh
op o
n
Mala
y a
nd In
don
es
ia
n La
ng
uag
e Eng
i
ne
eri
ng. Sing
ap
ore. 200
9.
[16]
W
i
cakson
o
A.F
,
Pur
w
ar
ia
nti
A
. HMM Base
d Part-of-Spe
e
c
h T
agg
er for
Bahas
a Ind
o
n
e
sia. T
he
4th
Internatio
na
l Malin
do W
o
ksh
o
p
. Jakarta. 201
0: 94-10
0.
[17]
Pur
w
ari
anti A, Sael
an A, Af
if I, F
e
rdian F
,
W
i
cakson
o
A.F
.
Natural Lan
gua
ge U
nderta
ndi
ng T
ools,
w
i
t
h
L
o
w
La
ng
uag
e R
e
sourc
e
in
Bui
l
di
ng A
u
tomatic In
don
esia
n Min
d
Ma
p Gen
e
rator”.
Internation
a
l
Journ
a
l on E
l
e
c
trical Eng
i
ne
e
r
ing a
nd Infor
m
atics
. 2013: 5(
3): 256-2
69.
[18] Laras
ati
S.D,
Kubo
ň
V, Z
e
m
an
D, Ind
ones
i
an M
o
rph
o
l
o
g
y
T
ool (Morp
hIn
d
):
To
wa
rd
s an
In
do
ne
si
an
Corp
us. SF
CM 201
1. Sprin
g
e
r
CCIS proce
e
d
in
gs
of the W
o
rksho
p
on Sy
stems a
nd F
r
a
m
ew
orks fo
r
Co
mp
utation
a
l Morph
o
lo
gy
. Z
u
rich. 20
11: 11
9-12
9.
[19]
Hajic O, Smrz
P, Z
e
manek J.S, Beska E.
Pragu
e Arabic d
e
pen
de
ncy tree
bank: Dev
e
lo
p
m
e
n
t in Data
and T
o
o
l
s
. Netw
o
r
k for Euro-
M
editerr
ane
an
Lan
gu
age R
e
s
ources (NEM
L
A
R). Cairo. 20
04.
[20]
Brants S, D
i
pp
er S, Ha
nse
n
S, Lezi
u
s W
,
S
m
ith G.
T
he T
IGER T
r
eeb
ank
. W
o
rkshop
on
T
r
eebank
s
and L
i
n
guistic
T
heories. Sozo
pol. 20
02: 24-
4
1
.
[21]
Simov K, Osenova P, Ko
lko
vska S, Bala
b
anov
a E,
Doik
off D, Ivanova
K, Simov A, Kou
y
l
e
kov M.
Buil
din
g
a Li
n
guistic
ally Inte
rpreted C
o
rpu
s
of Bulgari
a
n
:
the BulT
ree
B
ank. Europ
e
an La
ng
uag
e
Reso
urces Ass
o
ciati
on LRE
C
. Canar
y Isla
nd
s. 2002.
[22]
Csen
des D, Csirik
J,
G
y
imóthy
T
,
Kocsor A.
Th
e
Sz
eged
Tre
e
b
a
n
k
. Proce
edi
n
g
s of th
e 8th
Internatio
na
l C
onfere
n
ce o
n
T
e
xt, Spe
e
ch a
n
d
Dial
o
g
ue. Ka
rlov
y
V
a
r
y
.
200
5: 123-1
31.
[23]
Civit M, Mart
M.A. Build
in
g cast3lb: A Spa
n
ish tree
bank.
Researc
h
on
Lan
gu
age & C
o
mputati
o
n
.
200
4: 2(4): 549
–57
4.
[24]
Avontuur
T
,
Balem
ans I, Els
hof
L,
Noor
d N
.
V, Z
aanen
M.V, Deve
lo
p
i
n
g
a p
a
rt-of-speec
h tag
ger f
o
r
Dutch t
w
eets.
Co
mp
utation
a
l
Lin
guistics i
n
the Neth
erla
nds
Journa
l
. 201
2: 2: 34–5
1.
[25]
Sing
ha
K.R, P
u
rka
y
asth
a B.
S, Sing
ha
K.D
,
Pa
rt of S
p
e
e
ch T
aggi
ng
i
n
Ma
nip
u
ri: A
Rul
e
-b
ase
d
Appro
a
ch.
Inte
rnatio
nal Jo
urn
a
l of Co
mp
uter
Applic
ations
. 2
012: 51(
14): 31
-36.
[26]
Neu
nerdt M, R
e
yer M, Matha
r
R. A POS
T
agger for S
o
cia
l
Medi
a T
e
xts train
ed o
n
W
eb
Comments.
Rese
arch jo
urn
a
l on C
o
mp
ute
r
science a
nd
c
o
mputer e
n
g
i
n
eeri
ng w
i
th ap
plicati
ons
. 2
0
1
3
: 1(48): 61-
68.
[27]
Suja
ini H, Kus
p
ri
ya
nto, Arm
an A.A, Pur
w
ar
i
anti A, Exten
d
ed W
o
rd
Simil
a
rit
y
Bas
ed C
l
u
sterin
g on
Unsu
pervis
ed
PoS Inducti
on
to Improve Englis
h-Ind
o
n
e
si
an Statistica
l Machi
ne T
r
anslatio
n
.
16t
h
ORIENTAL COCOSDA/CASLRE-2013
. Gurg
aon. 20
13: 47-
48.
[28]
F
e
lsenste
in J. Inferring Ph
yl
o
gen
ies.
Sun
der
lan
d
: Sina
uer
Associates, Inc
.
2004.
[29]
Jeff M.A,
Matsoukas S, Sch
w
artz R.
Improvi
ng Low
-Res
ou
rce Statistical
Machi
ne T
r
ans
latio
n
w
i
th a
Novel
Se
manti
c
Word
Cluster
ing
Alg
o
rith
m
.
Proceedings
of the MT
Su
mm
it X
III.
X
i
am
en.
2011:
352-
359.
[30]
Stolcke A, Z
heng J, W
ang W
,
Abrash V.
SRILM at Sixteen: Upd
a
te
an
d Outlook. IEEE Automatic
Speec
h Rec
o
g
n
itio
n an
d Und
e
rstand
ing W
o
r
kshop
. W
a
iko
l
o
a
. 2011.
[31]
Och F
.
J, and Ne
y
H. A S
y
stematic Comp
a
r
ison Of Vario
u
s
Statistical Alig
nment Mo
dels.
Jour
na
l
Co
mp
utation
a
l Lin
guistics
. 20
03: 29(1): 1
9
-5
1.
[32]
Papi
nen
i K, R
oukos
S
,
W
a
r
d
T
,
Z
h
u
W
.
J
.
BLEU: A Method For Automati
c Eval
uatio
n
of Machin
e
T
r
anslati
on
. A
C
L '
02 Pr
ocee
din
g
s of the
4
0
th Ann
u
a
l
Meetin
g o
n
Ass
o
ciati
on for C
o
mputati
o
n
a
l
Lin
guistics. 20
02: 311-
31
8.
[33]
Laras
ati S.D,
IDENT
I
C Corpus :
Morphol
o
g
ical
ly Enric
h
e
d
Indon
esia
n-
Engl
ish Para
ll
el Corp
us
.
LREC, Euro
pa
n Lan
gu
age R
e
sourc
e
s Asso
ciatio
n ELRA. 201
2: 902-
906.
Evaluation Warning : The document was created with Spire.PDF for Python.