TELKOM
NIKA
, Vol.14, No
.4, Dece
mbe
r
2016, pp. 14
62~147
1
ISSN: 1693-6
930,
accredited
A
by DIKTI, De
cree No: 58/DIK
T
I/Kep/2013
DOI
:
10.12928/TELKOMNIKA.v14i4.3876
1462
Re
cei
v
ed Ap
ril 24, 2016; Revi
sed O
c
tob
e
r 25, 201
6; Acce
pted No
vem
ber 1
3
, 2016
Supervised Entity Tagger for Indonesian Labor Strike
Tweets using Oversampling Technique and Low
Resource Feat
ures
A
y
u Pur
w
a
r
i
a
nti*
1
, Lisa Madlberg
er
2
, Muhammad Ibrahim
3
1
School of Elec
trical Eng
i
ne
eri
ng an
d Info
rma
tics, Bandun
g Institute of T
e
chno
log
y
,
Jl. Ganesa No.
10, Band
ung, Indo
nesi
a
, te
lp/fax: +
62-
22-2
5
0
813
5/+
62-2
2
-2
500
94
0
2
Institute of Soft
w
a
r
e
T
e
chnol
og
y an
d Intera
ctiv
e S
y
stems,
Vien
na Un
ivers
i
t
y
of T
e
chnolo
g
y
,
F
a
voritenstras
se 9-11/1
88, A
-
104
0 Vie
nna,
Austria, telp: +
43-1-
588
01-
18
864
2
3
F
a
cult
y
of Co
mputer Scie
nc
e, Univers
i
t
y
of
Indones
ia,
Kampus UI De
pok 16
42
4 Ind
ones
ia, telp: +
6
2-21-
786
34
19, fax: +
62-2
1
-78
634
15
*Corres
p
o
ndi
n
g
author, e-ma
i
l
: a
y
u@stei.itb.
a
c.id
1
, lisa.mad
l
ber
ger@tu
w
i
e
n.ac.at
2
,
mocham
ad.ibr
ahim@
ui.ac.i
d
3
A
b
st
r
a
ct
W
e
prop
ose
an e
n
tity tagg
er for Indo
ne
sian tw
eets s
ent dur
in
g la
b
o
r strike ev
en
ts usin
g
superv
i
sed le
arni
ng meth
o
d
s.
T
he
ai
m of the tagg
er is to extr
act the date,
locatio
n
an
d
the
perso
n/org
ani
zation
inv
o
lve
d
i
n
the strike. W
e
use
SMOT
E
(Synthetic Mi
n
o
rity Oversa
mplin
g T
e
ch
ni
qu
e)
as a
n
oversa
mplin
g tec
h
n
i
qu
e
an
d c
o
n
ducte
d sev
e
ral
ex
pe
riments usin
g T
w
itter
data
to eval
uate
differ
e
n
t
settings w
i
th v
a
ryin
g mach
ine
lear
nin
g
al
gor
i
t
hms a
nd tra
i
ni
ng d
a
ta si
z
e
s. In order to t
e
st the low
reso
ur
ce
features, w
e
al
so con
ducte
d exper
iments fo
r the system
w
i
thout e
m
ploy
i
ng the w
o
rd
lis
t feature an
d the
w
o
rd nor
ma
li
zation. Our r
e
s
u
lts in
dic
a
ted t
hat differ
ent
tr
eatment
of diff
erent typ
e
s of
machi
ne
le
arnin
g
alg
o
rith
ms w
i
th low
resource
features can l
ead to
a g
ood
accuracy scor
e
. Here, w
e
tried Naïve Bay
e
s,
C4.5, Ran
d
o
m
F
o
rest and SM
O (Sequenti
a
l
Mini
ma
l Optimi
z
a
ti
on) a
l
gor
ith
m
s usi
ng W
e
k
a
as the machi
ne
lear
nin
g
to
ols.
For the
Naïve
Bayes, d
u
e
to
the d
a
ta
distrib
u
tion
b
a
sed
of
the c
l
ass
pro
bab
ility, th
e b
e
s
t
accuracy w
a
s achi
eved by re
mov
i
n
g
data d
uplic
atio
n. F
o
r C4.5 an
d Ran
d
o
m
F
o
rest, SMOT
E gave highe
r
accuracy res
u
lt
comp
are
d
to the ori
g
in
al d
a
ta and t
he d
a
ta
w
i
th data dupl
icatio
n re
mova
l
.
F
o
r SMO, the
r
e
is no sig
n
ifica
n
t
difference a
m
ong var
i
ous si
zes of trainin
g
d
a
ta.
Ke
y
w
ords
:
Indon
esi
an E
n
tity T
agger, S
M
OT
E, supervised
le
arni
ng,
w
o
rd leve
l fe
a
t
ure, w
o
rd w
i
n
dow
feature, lab
o
r s
t
rike tw
eets
Copy
right
©
2016 Un
ive
r
sita
s Ah
mad
Dah
l
an
. All rig
h
t
s r
ese
rved
.
1. Introduc
tion
A
strike can be
defin
ed
a
s
a plann
ed action of
em
ployee
s o
r
workers u
n
ion
s
whi
c
h
i
s
perfo
rmed
co
llectively to stop or to
slo
w
down
wo
rk.
Labo
r stri
ke
s have seve
re
con
s
eq
uen
ces
for all involved pa
rties, f
o
rem
o
st
corporat
io
ns, e
m
ployee
s an
d cu
stome
r
s. Supply ch
a
i
n
disruption
s
, b
l
ocked t
r
an
sp
ortation
route
s
, delay
of
de
livery, loss of
prod
uctivity a
nd reputatio
n
a
l
damag
e are just so
me of the co
nsequ
e
n
ce
s compa
n
i
es an
d cu
st
o
m
er
s f
a
ce a
s
result
of
lab
o
r
stri
kes.
Dam
age
s co
uld b
e
red
u
ced through timely
a
nd efficient
resp
on
se
s, ho
wever val
uea
ble
time is often l
o
st as
partie
s
are info
rmed
too la
te abou
t a strike at their supplie
r o
r
transportatio
n
partne
r
s. The
pro
b
lem
is th
at there
is a l
a
ck of
structu
r
ed i
n
form
ation o
n
la
bor strike
events tha
t
is provid
ed in
a timely manner. At the same time,
more a
nd mo
re peo
ple u
s
e so
cial medi
a to
repo
rt what i
s
ha
ppe
ning
arou
nd th
em
in re
al-time.
We
wa
nt to
use
T
w
itter
data in
ord
e
r to
extract struct
ured eve
n
t informatio
n on l
abor
stri
ke
s.
In this pap
er we
u
s
e
stri
ke-rel
ated
T
w
eet
s
p
o
st
e
d
by
lo
cal
u
s
ers
su
ch
as
cit
i
ze
ns,
activists, lo
cal ne
ws m
e
dia or l
abo
r
union
s in
order to extract the date, the lo
cation
and
orga
nization
s
involved
in strike events. We
fo
cu
s o
u
r experim
ents
on Indo
ne
sia
and Ind
one
si
an
Twitter data,
as it cou
n
ts
as an imp
o
rt
ant su
p
p
lier
cou
n
try of ra
w materi
als a
nd manufa
c
t
u
red
good
s in i
n
te
rnation
a
l sup
p
ly chai
ns
an
d at the
sam
e
time ha
s a
high n
u
mbe
r
of Social
Me
dia
Us
er
s.
The go
al of
Entity Reco
g
n
ition is to
id
entif
y and cl
a
ssify entitie
s i
n
a given
text, which is
an inform
atio
n extractio
n
task. Di
ffere
nt to Named E
n
tity Recog
n
i
t
ion (NE
R
), a
n
entity in our
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
1693-6
930
Supervi
sed E
n
tity Tagge
r for Indon
esi
a
n
Labor
Stri
ke
Tweet
s u
s
ing
…
(Ayu Pu
rwarianti
)
1463
r
e
sear
ch is
not only a
nam
ed
e
n
tity, but can
be
also a
com
m
on
enti
t
y,having ce
rtain role
s in
o
u
r
domain of lab
o
r stri
ke, like
for example ``taxi dr
ivers’’. Applying entity t
agging to
use
r
-gen
erat
ed
texts origin
ating in
soci
al n
e
tworks imp
o
s
e
s
ad
di
tiona
l difficulty co
mpared to formal texts. User-
gene
rated te
xts typically involve inform
al wo
rd
s, abb
reviation
s
an
d affixes, as
well a
s
the u
s
e o
f
informal
word o
r
de
r a
n
d
gramma
r. F
u
rthe
rmo
r
e, I
ndon
esi
an
social
medi
a
data p
a
rti
c
ul
arly
exhibits the
mixed u
s
e of
langu
age
s in
cludi
ng the
of
ficial lan
gua
g
e
Baha
sa In
d
one
sia, Engli
s
h
and several I
ndon
esi
an re
gional lan
gua
ges.
For Ind
one
si
an soci
al me
dia, there
are seve
ral
researche
r
s
on NER ([1-5]
).
Existing
approa
che
s
can gene
rally be divided int
o
rule-ba
s
ed
([
1, 2]) and st
atistical ap
proache
s ([3-5]
). In
r
u
le b
a
s
e
d
sys
te
ms
,
r
e
s
e
a
r
ch
er
s d
e
f
ine
ru
les
in fr
om o
f
s
t
r
i
ng
pa
tte
r
n
s us
ed to
id
en
tify an
d
cla
ssify n
a
m
ed e
n
tities. I
n
stati
s
tical
systems
, the
named
entity extractio
n
rules are le
arned
automatically base
d
on previously
label
ed data by machi
ne lea
r
ni
ng algo
rithms. Since the ru
le
s
are
not ea
sily
defined
man
ually by a hu
man, re
ce
nt rese
arche
s
te
nd to ap
ply a
nd en
han
ce t
h
e
statistical met
hod
s for nam
ed entity reco
gnition. In
line with these argu
mentatio
ns, we
cho
s
e
to
employ the st
atistical
a
pproach in our
study.
Existing stu
d
i
e
s a
pplie
d In
done
sia
n
NE
R for
differe
n
t
application
s
scena
rio
s
in
multiple
spe
c
ific dom
ains incl
udin
g
e-co
mme
rce tran
sa
ction
s
([3]
), citi
ze
n complai
n
ts ([4])
and t
r
a
ffic
con
d
ition
s
([1
-
2], [5]); and
gene
ric d
o
m
ains
([6,
7]). We devel
o
ped the first entity tagger for
tweets
sent d
u
ring
civic st
ri
ke an
d prote
s
t events in Indone
sia.
Statistical e
n
tity tagger
re
q
u
ire
s
the
defi
n
ition
a
nd
extractio
n
of fe
at
ure
s
from the
origi
nal
text. Feature
s
for e
n
tity tagger
ca
n b
e
di
vided in
to
wo
rd level
features[8], word
windo
w featu
r
e
s
,
word li
st feature
s
[8] an
d
document fe
ature
s
[8].
Wo
rd level fe
atu
r
es are
cha
r
acteri
stics
of a
particula
r wo
rd, e.g. the length of a wo
rd.
Word wi
n
d
o
w
features a
r
e cha
r
a
c
teri
st
ics
relatin
g
to a
defined nu
m
ber of previous or
su
ccee
ding word
s, e.g. whethe
r
the previou
s
word h
a
s b
een
identified a
s
an entity. Wo
rd list featu
r
e
s
indi
ca
te wh
ether
a wo
rd occurs
in
a p
r
edefin
ed
list of
entities
e.g.
geog
rap
h
ic g
a
zettee
r
s.
Do
cume
nt feat
u
r
es relate to
other
do
cum
ents, in
ou
r
case
other twe
e
ts,
e.g. the occurren
ce nu
mbe
r
of a word.
Most
NER
systems devel
oped
for Ind
one
sian
texts u
s
e
word
level featu
r
e
s
, word
wind
ow featu
r
es a
nd word
list feature
s
. Khodra
& Purwari
anti [3] employed word level and word
wind
ow featu
r
es. Th
ey rep
o
rted an a
c
cu
racy of
81.49
% by includin
g
two pre
c
e
d
i
ng wo
rd
s in the
word
windo
w. The be
st a
l
gorithm
emp
l
oyed was IBk, compa
r
ed
to Naïve B
a
yes a
nd
C4
.5.
Angga
re
ska
& Purwari
anti
[4] empl
oyed
wo
rd
level
fe
ature
s
, word wind
ow
featu
r
es
an
d wo
rd list
feature
s
. The
y
reporte
d the best accu
racy of
85.6% achieved by
applying the
SMO algorit
hm
(co
m
pa
red
to
Naïve Baye
s and
IBk). T
h
e features
ap
plied in
clu
d
e
d
word
wi
ndo
w featu
r
e
s
, the
curre
n
t wo
rd
with its o
r
tho
g
rap
h
ical info
rmation a
s
th
e wo
rd level f
eature
s
a
nd several
word li
sts
o
f
lo
c
a
tion
ga
z
e
tte
er
, c
l
ue
lis
t a
n
d
s
t
op
w
o
r
d
as
the w
o
r
d
lis
t
fe
a
t
u
r
es
. He
r
e
, in
o
u
r
re
se
ar
c
h
,
simila
r with Angga
re
ska & Purwaria
nti [4], we w
ill use the wo
rd wi
ndo
w feature
s
, the wo
rd le
vel
features and word list feat
ures.
But, in the experi
m
ent, we will show
that without using the
word
list feature
s
, the system
ca
n still achieve
a
good accu
racy by usi
n
g
oversa
mplin
g techni
que o
n
the training d
a
ta. Another differen
c
e wit
h
Anggare
ska & Purwari
a
nti [4] is that we do not u
s
e
clueli
s
t a
s
th
e wo
rd li
st fe
ature
sin
c
e it
is not
easily
built for n
e
w entity class
su
ch a
s
o
u
rs in
labor
stri
ke in
formation.
2. Indonesia
n
Entit
y
Ta
gger on Str
i
ke Information for T
w
i
t
ter Te
xt usi
ng Superv
is
e
d
Learning
In this
re
sea
r
ch, o
u
r goal
i
s
to ta
g im
po
rtant
entitie
s
automatically
for stri
ke
i
n
formatio
n
from Indones
i
an tweets
.
For the
s
t
rik
e
inform
ation, there a
r
e several
alt
e
rnative
s
of
the
importa
nt enti
t
y type su
ch
as
peo
ple
wh
o do
the
stri
ke, the
strike t
a
rget
(whi
ch
can
be
peo
pl
e or
orga
nization),
the location
of the strike, and the dat
e
or time of the
strike. In our research, we
deci
ded to ha
ve three types of entities: 1) peo
ple-
org
anization (inv
olved in the strike
); 2) location
of the strike;
3) date or ti
me of the strike
event. Th
e example
s
of tweet and
their import
ant
entities are shown in Tabl
e 1.
The co
mplet
e
pro
c
e
ss a
p
p
lied in our
entity
tagger is depi
cted in
Figure 1. Th
e entity
tagger con
s
ists of three
pa
rts,
na
mely
p
r
epro
c
e
s
sing, feature extr
a
c
tion an
d cl
assificatio
n
. Ea
ch
part ha
s an i
m
porta
nt role
in orde
r to achi
eve a high
accuracy sco
r
e of entity tagger.
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 16
93-6
930
TELKOM
NIKA
Vol. 14, No. 4, Dece
mb
er 201
6 : 1462 – 147
1
1464
Table 1. Example of Twe
e
t
and the Impor
tant Entity
on Strike Info
rmation
T
w
eet Text
People/Org
Location
Date
Mahasis
w
a
mulai menggulirkan r
encana aksi
10 Septe
m
ber
melalui #IndonesiaDarurat,
a
y
o kit
a
dukung. @
y
pa
onganan
{English:
Stude
n
t
s
start rolling
Se
pte
m
ber 10
th
str
i
ke plan
through #I
ndone
siaDarurat, let’s support @
y
paon
ganan}
Mahasisw
a
(English:
Students)
10
Septembe
r
A
ngko
t
di
bo
go
r
pada mogok kerj
a,jalanan tuh serasa milik
sendiri bebas dar
i macet :D
{English:
Bus dri
v
e
r
in
bo
go
r
doi
ng strike, feels like having our
own road
w
i
thout
traffic jam :D}
Angkot
(English: Bus
driver)
Bogor
Figure 1. Flow of Labo
r Strike Entity Ta
gger System
2.1. Preproc
essing
The p
r
ep
ro
ce
ssi
ng m
odul
e
first
split up
the twe
e
t te
xt into token
s
(to
k
e
n
izatio
n) a
n
d
con
s
e
que
ntly transfo
rm
s informal
word
s into fo
rmal words (no
r
m
a
lizatio
n).
An
other step
which
coul
d be ap
plied he
re in
the future woul
d be a Part-of-S
pee
ch Tag
g
e
r
. As up to no
w, no
available PO
S-Tagg
er fo
r Indone
sian
Social
M
edia
data can b
e
used in o
u
r sy
stem, th
e
available
one
is POS
-
Ta
g
ger fo
r In
don
esia
n
comm
o
n
senten
ce
s
su
ch
as in a
r
ticles [9]. Th
us,
we only empl
oyed the tokenization and
word n
o
rm
al
ization. The
word no
rmali
z
ation a
nd POS-
tagging a
r
e l
angu
age
-de
p
ende
nt modu
les which
wo
uld have to be repla
c
e
d
or removed wh
en
the sy
stem i
s
a
pplied
to
anothe
r la
n
guag
e. In
ou
r expe
rime
nts
we
com
p
a
r
e the
sy
ste
m
’s
accuracy in settings in
cludi
ng and ex
clu
d
ing the word
normali
zatio
n
module.
2.2. Featur
e Extrac
tion
The featu
r
e
extraction
m
odule
extra
c
ts info
rmatio
n
from
sin
g
le
toke
ns (wo
r
d level
feature
s
) an
d
se
que
nces
of toke
ns (word
wi
ndo
w f
eature
s
). Part
icularly,
wo
rd level
features
inclu
de the
n
o
rmali
z
e
d
form of a to
ken
lexical
and t
he token
ort
hographi
cal i
n
formatio
n. T
h
e
word
wind
ow feature
s
ta
ke the to
ken
seque
nce, incl
ude the
preceding
and
su
cceedi
ng to
ken
lexical alo
ng
with the entit
y class of prece
d
ing
to
ke
n. Another a
dditional feat
ure i
s
a wo
rd
list
feature.
Here
, we em
ploy
an e
a
sily g
a
there
d
word
list, whi
c
h i
s
the
ga
zetteer
of location
s
provide
d
by Geon
ame
s
(
h
ttp://www.ge
oname
s
.o
rg/)
.
A
nother
w
o
rd li
st that
w
e
u
s
ed
is a
stop
word list. A
s
we me
ntio
ned e
a
rlie
r,
we
comp
are
the usage
of these
wo
rd li
sts in t
he
experim
ents.
Table
2
sh
ows the
feat
u
r
e
s
a
nd it
s
exa
m
ples for word “bogo
r” of
se
con
d
tweet
in
Table 1.
2.3. Classific
a
tion
In the classifi
cation, the fe
ature
s
take
n from
t
he t
w
e
e
t
is cla
ssif
i
e
d
usin
g a cla
ssi
f
i
cat
i
on
model into
se
ven cla
s
se
s that cove
r three info
rmatio
n mention
ed
before. T
he
seven cl
asse
s
are
as follo
ws: 1
)
Toke
n is th
e begin
n
ing of
a location
(L
OC-B); 2) T
o
ken i
s
pa
rt of a location (L
OC-
I); 3) To
ken
marks th
e be
ginnin
g
of pe
ople o
r
or
gan
ization i
n
volved in the
stri
ke (PEORG-B
); 4
)
Toke
n defin
e
s
a word as
part of a pe
o
p
le or o
r
g
ani
zation id
entifier (PEO
RG
-I); 5) Token m
a
rks
the beginni
ng
of a date (DATE-B); 6) T
o
ke
n define
s
a word as p
a
r
t of a date (DATE-I); 7) t
he
token i
s
not a
named entity of interest (O
THER).
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
1693-6
930
Supervi
sed E
n
tity Tagge
r for Indon
esi
a
n
Labor
Stri
ke
Tweet
s u
s
ing
…
(Ayu Pu
rwarianti
)
1465
Table 2. Feat
ure
s
on the E
n
tity Tagger o
n
Strike Information and it
s Example
s
Feat
ure Na
me
Descrip
tion
Example
Lexical (n)
Lexical of the cur
r
ent token
bogor
Lexical (n-1
)
Lexical of one
pr
eceding token
di (eng: in)
Lexical (n+1)
Lexical of one
su
cceeding token
pada (eng
: doing
)
NE class (n-1)
NE tag of one
pr
eceding token
Other
Ortho
g
raph
y
(n)
Ortho
g
raph
y info
rmation of the cu
rrent token
normal alphabet
TokenKind (n)
T
y
pe
of curre
nt t
o
ken
w
o
rd
IsMention (n)
True if the cur
r
en
t token is a mention
False
IsLink (n)
True if the cur
r
en
t token is a link
False
IsTime (n)
True if the cur
r
en
t token format is
a date or time
False
IsGazetteer
(n)
True if the cur
r
en
t token is a member of
Gazetteer
True
IsStopWord (n
)
True if the cur
r
en
t token is a member of stop
w
o
r
d
list
Fal
s
e
2.3.1. Using SMOTE to Handle Imbalanced Data
se
t
Similar
with
NER, in
ea
ch inp
u
t, the
numbe
r to
ke
n lab
e
led
wit
h
defin
ed e
n
tity (LOC,
PEORG,
DA
TE) i
s
smalle
r than
the
nu
mber of
n
on
entity label.
This co
nditio
n
of imb
a
lan
c
e
d
dataset may lead to low
accuracy of l
abor
stri
ke e
n
tity tagger. Since o
n
ly few toke
ns
ca
rry
informatio
n a
bout the locat
i
on, orga
nization or date;
n
a
turally mo
st of the token
s
will fall into the
last
cla
s
s “OT
H
E
R
”.
To handl
e the inbalan
ce
d data, we emp
l
oyed
SMOT
E (Synthetic Minority Oversampli
ng
Tech
niqu
e) [
10] a
s
the
oversampli
ng te
chni
que. SM
OTE oversa
mpling th
e m
i
nority by a
d
d
in
g
new
data p
r
e
d
icted f
r
om
several n
e
a
r
e
s
t neig
hbo
r d
a
ta of ea
ch
minority cl
ass. For t
he ba
si
c
step,
we first
remove
d the
dupli
c
ate d
a
t
a
an
d resam
p
led th
e data
s
et
several ti
mes
by ap
plying
SMOTE.
2.3.2. Machine Learning
Algorithm
We
com
p
a
r
e
d
several
m
a
chi
ne le
arni
ng alg
o
rithm
s
to
build th
e cl
assificati
on mo
del
in
c
l
ud
in
g N
a
ive
Ba
ye
s
,
C
4
.5
, SMO
a
n
d
R
a
nd
o
m
F
o
res
t. W
e
se
lec
t
th
e
s
e a
l
go
r
i
thms
s
i
nc
e th
es
e
algorith
m
s a
r
e widely u
s
ed
in text minin
g
studie
s
.
Naïve Bayes makes use
Bayesian theorem to
calculate a class probability of a given
feature inp
u
t as the re
pre
s
entation of a toke
n wi
th assumption that the feature
s
a
r
e inde
pen
de
nt
[11]. C4.5 (or kno
w
n a
s
J4
8 in We
ka [1
1] – this term
is use
d
later
in the experi
m
ent) u
s
e div
i
de-
and-co
nqu
er
algorith
m
on
trainin
g
d
a
ta t
o
form
a d
e
ci
sion
tree
that
rep
r
e
s
ent
cla
ssifi
cation
rul
e
s
[12]. SMO m
a
ke
s
use
se
quential
mini
mal optimi
z
at
ion alg
o
rithm
for trai
ning
a suppo
rt ve
ctor
cla
ssifie
r
[13
]. Rand
om f
o
re
st em
ploy
s voting
ove
r
several tre
e
con
s
tru
c
te
d from
rand
om
training data [14].
3. Experiments
We have
con
ducte
d seve
ral experime
n
t
s
applying th
e prop
osed strike entity tagger of
Indone
sia
n
t
w
eet
s to
co
mpare diffe
re
nt ma
chin
e l
earni
ng
algo
rithms, differe
nt features,
and
training
data
sizes. T
o
e
v
aluate the p
e
rform
a
n
c
e o
n
ea
ch
cla
s
s, we u
s
ed
F-Measure sco
r
e
cal
c
ulatio
n su
ch a
s
belo
w
.
2
.
(
1
)
Whe
r
e the p
r
eci
s
ion a
nd recall
score
s
a
r
e cal
c
ul
ated
as follo
ws:
#
#
(
2
)
#
#
(
3
)
As for the overall evalu
a
tion, we empl
o
y
ed ac
curacy
sco
re with e
quation
su
ch
as bel
ow.
For exam
ple,
if there a
r
e 1
00 corre
c
tly classified to
ke
ns am
ong
10
00 token
s
in t
he testin
g dat
a,
then the accu
racy is
100/1
000 = 1
0
%.
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 16
93-6
930
TELKOM
NIKA
Vol. 14, No. 4, Dece
mb
er 201
6 : 1462 – 147
1
1466
#
#
(
4
)
3.1. Experimental Data a
nd Bas
e
line Expe
riment
of Indone
sia
n
Strike Enti
t
y
Tagger
We u
s
ed the
DMI-T
w
itter
Captu
r
e an
d Analysis T
ool
kit to colle
ct Tweet
s from t
he publi
c
Twitter Strea
m
ing API that
match
ed a
li
st of def
in
ed stri
ke-relate
d
keyword
s
:
“a
ksi
dem
o”, “a
ksi
dukung
”, “aksi
kerj
a”,
“a
ksi mo
go
k”,
“mogo
k
ke
rja”. We
coll
ecte
d T
w
eets for a p
e
rio
d
of
two
months (01.0
5
.2015
-
30.0
6
.2015
).
For
our expe
rime
nts, we retri
e
ved a rando
m sample of
this
dataset con
s
i
s
ting of 18,9
9
9
toke
n (fro
m
1,046 tw
e
e
ts). 10 Indo
ne
sians
we
re a
s
ked to ann
otate
words
in
the Twitter Data
manually by usin
g
a
mo
bil
e
an
notation
tool [15]. Th
e
final
annotati
on
label is
ch
osen by the m
a
jority label.
We
sele
cted
20% as th
e testing d
a
ta a
nd 80% a
s
the
training d
a
ta. The data
size
for each cla
s
s is sho
w
n in
Table bel
ow.
Table 3. Data Siz
e
for Each Entity Clas
s
NE Class
Data Size (
t
oke
n
)
T
rainin
g
Testi
ng
OTHE
R
14046
11195
2799
LOC
-
B 525
420
105
LOC
-
I
584
467
117
PEOR
G
-B
1740
1391
348
PEOR
G
-
O
1513
1211
303
DATE-B
255
204
51
DATE-I
336
268
68
Total 18999
15156
3791
For the b
a
se
line, we
con
ducte
d an ex
perim
ent
usi
ng features
of current
wo
rd lexical
surfa
c
e
and
previou
s
entit
y label. We compa
r
ed
sev
e
ral alg
o
rithm
s
in the ba
sel
i
ne experi
m
e
n
ts
of Naive Bayes (NB),
J48,
Rand
om Fo
rest (RF)
and
SMO usi
ng
We
ka [11] for the entity tagger.
The experim
ental re
sult
i
s
sho
w
n
bel
ow. He
re,
S
M
O al
gorith
m
a
c
hieved
t
he b
e
st
accu
racy
result of 94.54%.
Figure 2. Accura
cy of Base
line Experim
e
n
t on Indone
sian Strike Ent
i
ty Tagger
The accu
ra
cy of each algo
rithm above
coul
dn’t sho
w
the exact perfo
rman
ce
for each
entity. Thus, table below shows the
F-Mea
s
u
r
e score for ea
ch entity type of the base
line
experim
ent. Here, even t
houg
h the a
c
curacy
of J4
8 and
Ra
ndo
m Forest o
u
tperfo
rmed
Naïv
e
Bayes, but fo
r “I-L
OC” enti
t
y type, the b
e
st f-
me
asure wa
s a
c
hiev
ed by Naïve
Bayes alg
o
rit
h
m.
On the oth
e
r
hand,
Naïve
Bayes alg
o
rit
h
m co
uldn’t
a
b
le to extra
c
t the B-DATE
sin
c
e the
out-of-
vocab
u
lari
es
coul
dn’t be h
andle
d
by onl
y using th
e current lexical
entry and th
e
previou
s
e
n
tity
label.
.
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
1693-6
930
Supervi
sed E
n
tity Tagge
r for Indon
esi
a
n
Labor
Stri
ke
Tweet
s u
s
ing
…
(Ayu Pu
rwarianti
)
1467
Table 4. F-M
easure fo
r Each Strike
Entity Class fo
r Baseli
ne Experiment
NB
J48
RF
SMO
OTHE
R
0.94
0.961
0.963
0.967
LOC
-
B 0.391
0.729
0.734
0.773
LOC
-
I
0.835
0.657
0.711
0.877
PEOR
G
-B
0.797
0.88
0.9
0
.914
PEOR
G
-I
0.856
0.798
0.86
0.916
DATE-B
0
0
.623
0.68
0.66
DATE-I
0.868
0.85
0.899
0.919
3.2. Experiment on Using
Lo
w
Res
our
ce Fea
t
ure for Indonesia
n
Strike Enti
t
y
Tagger
In this experi
m
ent, we p
r
o
pose the u
s
a
ge of
seve
ral
easily p
r
epa
red features
which ai
m
is to handl
e the out of vocabula
r
y pro
b
l
e
m. The
com
p
lete feature
s
are sh
own in Table 2. T
he
experim
ental
result for
ea
ch algo
rithm i
s
sh
own
in fig
u
re
belo
w
. F
o
r al
gorith
m
s of Naïve Bay
e
s,
Ran
dom Fo
rest an
d SMO
,
the accuracy result i
s
hig
her, but fo
r J48 algo
rithm,
the accu
ra
cy is
lowe
r. The hi
gher
numb
e
r of feature for the J
48 d
oesn’t give higher
accu
ra
cy re
sult. Usi
n
g
compl
e
te fea
t
ure for
J48
algorith
m
yielded the
r
oot
node of the t
r
ee a
s
p
r
evio
us entity lab
e
l,
while u
s
in
g o
n
ly two features yielde
d th
e root n
ode o
f
the tree as
curre
n
t lexica
l surfa
c
e. Th
ere
is a
rule in
th
e complete
fe
ature
J4
8
alg
o
rithm
sayin
g
that if the
previous entity
label i
s
OTHER
then the c
l
as
s
result is
OTHE
R. This
rule
is
s
u
pported by 10,420 c
o
rrec
t
data and 1,771
inco
rrect data
.
This rule
ca
use
d
the
low
accuracy of J48 algo
rithm.
Figure 3. Accura
cy of Com
p
lete Featu
r
e
s
Ex
perime
n
t on Indon
esi
a
n Strike Entity Tagger fo
r
Variou
s Ma
ch
ine Lea
rnin
g Algorithm
s wi
th Original
Da
ta Colle
ction (1515
6 Trai
nin
g
Data Size
)
Table 5. F-M
easure fo
r Each Strike Entity Class fo
r O
r
iginal
Data Collectio
n
NB
J48
RF
SMO
OTHE
R
0.944
0.905
0.974
0.981
LOC
-
B 0.734
0.188
0.888
0.893
LOC
-
I
0.829
0.915
0.713
0.935
PEOR
G
-B
0.815
0.006
0.936
0.934
PEOR
G
-I
0.848
0.893
0.908
0.95
DATE-B
0.435
0.109
0.699
0.782
DATE-I
0.919
0.788
0.939
0.97
In the F
-
Me
a
s
ure
score fo
r ea
ch
cl
ass
su
ch
as sho
w
n i
n
Ta
ble
5
,
it is
sho
w
n
that u
s
ing
only 2 featu
r
es
su
ch
as in
the b
a
seli
ne
gave b
e
tter result
th
an
u
s
i
ng compl
e
te feature
s
fo
r
t
he
J48 al
gorith
m
. The low results are sh
own for a
ll “B” la
bel su
ch a
s
L
O
C-B, PEORG-B and
DAT
E
-
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 16
93-6
930
TELKOM
NIKA
Vol. 14, No. 4, Dece
mb
er 201
6 : 1462 – 147
1
1468
B. Most of “B” label were cla
ssifie
d
as
OTHE
R.
Other finding i
s
that -simil
ar with the baseli
n
e
con
c
lu
sio
n
- the SMO algo
rithm outpe
rforme
d other
machi
ne lea
r
ning alg
o
rith
ms.
3.3. Experiment on Using
SMOTE for Indonesia
n
Strike Entity
Tagger
The data
s
et
size
sh
own
in Table 3
indi
cate
s th
at the dataset at hand i
s
highly
unbal
an
ced, whi
c
h mea
n
s that the number of entitie
s in the trainin
g
data vary betwee
n
cla
sses.
This i
s
a ch
al
lenge b
e
cau
s
e low si
ze
of one cla
s
s ca
n be insuffici
ent to rep
r
e
s
ent the traini
ng
data, thus it
will be hard
er for the learni
ng al
gorithm
to learn th
e characteri
stics of
this group. To
overcome thi
s
pro
b
lem, we cond
ucte
d exper
im
ents
on usin
g SMOTE to add the dataset. Here,
we tri
ed 2
st
rategie
s
. Th
e
differen
c
e i
s
that
in the
se
con
d
st
rat
egy, we
emp
l
oyed the d
a
t
a
dupli
c
ation removal befo
r
e applying S
M
OTE. The ex
perim
ental
result for e
a
ch
strategy
is
explained b
e
l
o
w.
3.3.1. Only
Using SMOTE
to
Enhance
Training Da
taset
We a
pplied
SMOTE sev
e
ral time
s to
the origin
al
dataset a
n
d
yielded several n
e
w
training
data
s
et as sho
w
n i
n
Ta
ble
6. Th
ere
ar
e 4
types
of SMOTE
:
(1) do
uble
the d
a
ta of
e
a
c
h
cla
ss
(ex
c
ept
OTHER
)
by
SMOTE; (2)
multiply
the data of each cl
ass (ex
c
ept
OTHE
R) 4 ti
mes
by
SMOTE; (3) co
ndu
ct
S
M
OTE
u
n
til
t
he
d
a
ta
i
n
e
a
c
h cla
s
s rea
c
h about 1:2 compa
r
ed to the
OTHE
R cla
ss; (4) cond
uct
SMOTE until the data in
e
a
ch
cla
ss
ha
s abo
ut
t
he s
a
me si
ze a
s
t
he
OTHE
R cla
s
s
.
Table 6. Vari
ous T
r
aini
ng
Data
set Size
for Experime
n
ts on Indo
ne
sian Stri
ke Entity Tagger
usin
g SMOT
E only (1
st
Strategy)
Original Data
SMOTE
(x2
)
(1)
SMO
T
E (x
4
)
(2)
SMO
T
E (1:2)
(3)
SMO
T
E (1:1)
(4)
OTHE
R
11195
11195
11195
11195
11195
LOC
-
B
420 840 1680
5040
10080
LOC
-
I
467 934 1868
5604
11208
PEOR
G
-B
1391
2782
5564
5564
11128
PEOR
G
-I
1211
2422
4844
4844
9688
DATE-B
204 408 816 5304
10608
DATE-I
268 536 1072
5360
10720
Total
15156
19117
27039
42911
74627
Figure 4. Accura
cy Sco
r
es for Va
riou
s Data
set Size of Indone
si
an Strike Enti
ty Tagger u
s
i
ng
SMOTE only (1
st
Strategy)
The expe
rime
ntal re
sult sh
own in
Figu
re
4
indicates t
he differe
nce
effects of SM
OTE for
each alg
o
rith
m. Applying
SMOTE for
Naïve Bayes
u
n
fortunately l
o
we
r the
accura
cy si
nce Naïve
Bayes algo
rit
h
m is highly
depe
nd on th
e data distri
b
u
ti
on. This is
different with
the deci
s
ion t
r
ee
algorith
m
su
ch as
J48 o
r
Random F
o
rest (RF). In the
deci
s
io
n tree
algorith
m
, SMOTE tech
ni
que
increa
se
s the
accuracy si
nce the i
n
formation
gain
of each tree
bran
ch
depe
nds o
n
the d
a
ta
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
1693-6
930
Supervi
sed E
n
tity Tagge
r for Indon
esi
a
n
Labor
Stri
ke
Tweet
s u
s
ing
…
(Ayu Pu
rwarianti
)
1469
dist
rib
u
t
i
on
of
cla
s
se
s.
The
sig
n
ificant i
m
provem
ent i
s
in
J4
8 al
go
rithm
whe
r
e t
here
is only
on
e
tree
cal
c
ulati
on. In bot
h algorithm
s, there i
s
a
point of
data size where
it will reach maxim
u
m
accuracy and after that the training data will ov
erfit and the accuracy will be l
o
wer. As for S
M
O
algorith
m
, SMOTE techni
que d
o
e
s
n’t
give any effe
ct si
nce SM
O alg
o
rithm
doe
sn’t de
pe
nd o
n
data di
stri
buti
on
whi
c
h
me
ans that
10%
of rare
ca
se
has the same res
u
lt
with 2% of it. Here,
we
also
sho
w
the highe
st F-Measure sco
r
e for ea
ch
a
l
gorithm in T
able 7. Almost simila
r with the
con
c
lu
sio
n
for Tabl
e 5, the highe
st F-Measur
e sco
r
e was
achie
v
ed by SMO
except for
L
O
C-B
whi
c
h is lo
we
r than the Ra
ndom Fo
re
st algorith
m
.
Table 7. F-M
easure S
c
ore
for each Algorithm u
s
ing
SMOTE (1
st
Strategy)
Naïve Ba
y
e
s
J
48
Random F
o
rest
SMO
OTHE
R
0.944
0.97
0.98
0.981
LOC
-
B 0.734
0.828
0.924
0.893
LOC
-
I
0.829
0.918
0.874
0.935
PEOR
G
-B
0.815
0.917
0.934
0.934
PEOR
G
-I
0.848
0.926
0.948
0.95
DATE-B
0.435
0.72
0.69
0.782
DATE-I
0.919
0.886
0.933
0.97
Schema
Original
SMOTE (1:2)
S
MOTE (4x)
Original
3.3.2. Using Data Du
plica
t
ion Remov
a
l before SM
OTE to Enha
nce Training
Data
se
t
In the
se
con
d
strategy, bef
ore
applyin
g
SMOT
E, we
con
d
u
c
ted
da
ta dupli
c
ation
rem
o
val
first. The trai
ning data
s
et
s are sho
w
n in
Table 8.
The
r
e are 6 tre
a
tments: (1) on
ly using the d
a
t
a
dupli
c
ation
re
moval; (2
)
ap
plying SMO
T
E for L
o
c an
d Date
cla
s
s
after the
du
pl
ication
re
mov
a
l;
(3) a
pplying
SMOTE for all cla
s
ses e
x
cept OT
HER cla
s
s after the duplication rem
o
val; (4
)
applying SM
OTE until ea
ch cl
ass dat
a rea
c
he
s
1:
4 com
pared
to OTHER
class, after the
dupli
c
ation
removal; (5)
applying
SM
OTE until
e
a
ch
cl
ass
d
a
ta re
ache
s
1:2 compa
r
e
d
to
OTHE
R
cla
s
s, after the
du
plicatio
n rem
o
val; (6
) an
d
arrang
e the
d
a
ta in e
a
ch
cl
ass to b
e
alm
o
st
the same
size as the OT
HER cla
ss,
aft
e
r the dupli
c
a
t
ion removal.
Table 8. Vari
ous T
r
aini
ng
Data
set Size
for Experime
n
ts on Indo
ne
sian Stri
ke Entity Tagger
usin
g SMOT
E and Data Duplication Re
moval (2
nd
St
rategy)
O
r
igin
al
Data
Data
Duplication
Removal
(1)
Removal +
SMO
T
E (2x
f
o
r
Loc & Date)
(2)
Removal +
SMO
T
E (x
2
)
(3)
Removal +
SMO
T
E (1:4)
(4)
Removal +
SMO
T
E (1:2)
(5)
Removal +
SMO
T
E (1:1)
(6)
OTHE
R
11195
6442
6442
6442
6442
6442
6442
LOC
-
B 420
277
554
554
1662
3324
6315
LOC
-
I
467
290
580
580
1740
3393
6446
PEOR
G
-B
1391
913
913
1826
1826
3469
6591
PEOR
G
-I
1211
779
779
1558
1558
3116
6232
DATE-B
204
118
236
236
1652
3304
6277
DATE-I
268
177
354
354
1770
3451
6556
Total 15156
8996
9858
11550
16650
26499
44859
Figure 5
shows the
a
c
cura
cy sco
r
e
for ea
ch
algorith
m
and
e
a
ch SMOTE
impleme
n
tation for o
u
r
seco
nd
strate
gy. The co
n
c
lu
sion i
s
si
milar
with th
e previo
us
section
whe
r
e SMO
T
E gave effect
mostly on th
e de
cisio
n
tr
e
e
algo
rithm, while it lo
wered the a
c
curacy
of Naïve Bay
e
s
algo
rithm
and
had
no
effect on
the
SMO al
gorit
hm. Figu
re
5
also
sho
w
s
that
sin
c
e
Naïve Bayes alg
o
rit
h
m is a
pro
babili
stic alg
o
rithm
which
depen
d a l
o
t on the d
a
ta
distrib
u
tion of
ea
ch cla
s
s, the
data
du
pl
icatio
n
re
mov
a
l give
s the
best t
r
ainin
g
data fo
r
Naïve
Bayes wh
ere it reach
ed hig
hest a
c
cura
cy among all scen
ario
s on t
he Naïve Bayes alg
o
rithm.
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 16
93-6
930
TELKOM
NIKA
Vol. 14, No. 4, Dece
mb
er 201
6 : 1462 – 147
1
1470
Figure 5. Accura
cy Sco
r
es for Va
riou
s Data
set Size of Indone
si
an Strike Enti
ty Tagger u
s
i
ng
SMOTE and
Data Dupli
c
at
ion Rem
o
val (2
nd
Strategy)
Figure 6 sh
o
w
s that the b
e
st effect of
applying SMO
T
E is on J4
8 algorith
m
wh
ere it wa
s
able to e
nha
n
c
e the
accu
ra
cy sco
r
e fro
m
84% to
mo
re
than 94%. T
o
be mo
re
preci
s
e, bel
ow i
s
the detail of F-Mea
s
u
r
e sco
r
e on ea
ch training data fo
r J48 alg
o
rith
m. Here, SMOTE wa
s abl
e to
enha
nce the F-Mea
s
u
r
e
score si
gnifican
t
ly for B-LOC,
B-ENT, B-DATE.
Figure 6. F-Measure Sco
r
es fo
r Vario
u
s
Data
set Size of Indone
si
an Strike Enti
ty Tagger u
s
i
ng
J48 Algo
rithm
with SMOTE and Data
Du
plicatio
n Rem
o
val (2
nd
Strategy)
3.4. Indonesi
an Entit
y
Ta
gger
w
i
th
out Word List F
eatur
e and
w
i
thout
Word
Normaliza
t
io
n
Our l
a
st exp
e
rime
nt is to
evaluate th
e
usa
ge
of on
ly easily p
r
e
pare
d
featu
r
e
s
which
inclu
de wo
rd l
e
vel feature a
nd wo
rd win
d
o
w featur
e. For this pu
rpo
s
e we eliminat
ed the wo
rd list
feature
s
and
gain only 9 feature
s
by re
moving t
he feature
s
isGa
zetteer
and i
s
Stop
Word, since
these
feature
s
a
r
e
man
ual
ly built. Othe
r than
t
hat,
we
also elimi
nate the
word no
rmali
z
ati
on
prep
ro
ce
ssin
g to see the e
ffect on the e
n
tity tagger
. The experi
m
en
tal result for the origi
nal da
ta
(151
56 token
s
) i
s
sh
own in
Figure 7.
Figure 7. Accura
cy of Entity Tagger
with
out
Wo
rd No
rmalizatio
n an
d Wo
rd List F
eature
s
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
1693-6
930
Supervi
sed E
n
tity Tagge
r for Indon
esi
a
n
Labor
Stri
ke
Tweet
s u
s
ing
…
(Ayu Pu
rwarianti
)
1471
The re
sult sh
own in Fig
u
re
7 indicate
s that feature
s
depe
nd on a
word list gav
e a slight
different a
c
cu
racy
score
co
mpared to th
e co
mplete
o
ne. It mean
s
that in the fut
u
re, o
u
r
so
ci
al
media entity tagge
r ca
n be
easily empl
oyed
for othe
r langu
age in
strike d
o
mai
n
.
4. Conclusio
n
We
pro
p
o
s
e
an Ind
one
sia
n
entity tagg
e
r
to
extr
act
en
tities from
tweets
se
nt du
ri
ng
stri
ke
events
usi
n
g
su
pervi
sed
learning.
T
o
ha
ndle
th
e inb
a
lan
c
ed
distri
bution
of cl
asse
s,
we
employed SM
OTE as an ov
ersampli
ng te
chni
que. Ou
r experim
ents
on seve
ral m
a
chi
ne lea
r
ni
ng
algorith
m
s showed th
at
SMOTE ha
s different
e
ffec
t
s for different
algorithms
. SMOTE
gives
signifi
cant go
od effect on
deci
s
io
n tree
based alg
o
ri
t
h
ms
su
ch as
J48 a
nd ra
nd
om fore
st, but it
lowe
rs the a
c
curacy of Naï
v
e Bayes alg
o
rithm
and h
a
s not effect on SMO algo
rithm. Usi
ng only
word level features and word
wi
ndow f
eatures, the
system
can
still achieve
si
milar accuracy as
whe
n
u
s
ing
a
ll feature
s
sin
c
e the
word
norm
a
lization
and
wo
rd li
st employed
h
e
re
do n
o
t gi
ve
signifi
cant effect on the accuracy. In future re
se
arch
we wa
nt to apply t
he system also in ot
her
langu
age
s.
Ackn
o
w
l
e
dg
ements
Research reported in thi
s
publication was jointly s
upported by the ASEAN-European
Academic University Net
w
ork
(ASEA-UNI
NET), the Austrian F
eder
al Minist
ry of Science,
Re
sea
r
ch an
d Econo
my and the Austri
an Agen
cy fo
r Internatio
nal
Coope
ratio
n
in Educatio
n and
Re
sea
r
c
h
(O
eAD-
Gmb
H
).
Referen
ces
[1]
Endar
noto
SK, Prad
ipta
S,
Nugr
oho
AS,
Purnam
a J.
T
r
affic Condition Info
r
m
ati
o
n
Extraction
&
Visua
l
i
z
a
t
i
on fr
om S
o
cia
l
Me
dia T
w
itter for Andro
i
d Mo
bil
e
Applic
atio
n
. Internatio
nal
Co
nferenc
e on
Electrical E
ngi
neer
ing a
nd Inf
o
rmatics. Band
ung, Ind
ones
ia
. 2011.
[2]
Hanif
ah
R, Su
pan
gkat SH, P
u
r
w
ari
anti
A.
Twitter Inform
ation Extraction f
o
r Sm
art City
. Internationa
l
Confer
ence
on
IC
T
for Smart
Societ
y (ICISS). Bandun
g, Indon
esia. 2
014.
[3]
Khodr
a ML, P
u
r
w
ari
anti A. E
kstraksi
Informasi T
r
ansaksi
Onlin
e pa
da T
w
itter.
Cy
ber
matika
. 20
13;
1(1).
[4]
Angg
aresk
a
D
,
Pur
w
ari
anti
A.
Informati
o
n
Extraction
of Publ
ic C
o
mp
laints
on T
w
itter T
e
xt fo
r
Band
un
g Gov
e
rn
me
nt
. Internatio
nal
Co
nfe
r
ence
on D
a
ta an
d Soft
w
a
re Eng
i
ne
eri
n
g. Band
un
g,
Indon
esi
a
. 201
4.
[5]
Hasby
M, Khodra ML.
Optimal Path Finding base
d on Traffic Inform
ation
Extraction from Twitter
.
Internatio
na
l C
onfere
n
ce o
n
ICT
for Smart
S
o
ciet
y
(ICISS). Jakart
a, Indo
n
e
sia. 20
13.
[6]
Liu
X, Z
hang
S, W
e
i F, Z
h
o
u
M.
Recogni
zing Na
med En
tities in T
w
eets
. Proceedi
ngs
of the 49
th
Annu
al Me
etin
g of the Associ
ati
on for Com
p
utation
a
l L
i
ng
ui
stics.
Portland,
Oregon. 20
11:
359-3
67.
[7]
Ritter A, Clark
S, Mausam, Etzioni O.
Na
me
d Entity Rec
o
g
n
itio
n in T
w
eet
s: an Exper
i
m
ental Stu
d
y
.
Procee
din
g
s o
f
the 20
11
C
onfere
n
ce
on
Empiri
c
a
l M
e
thods i
n
N
a
tu
ral L
a
n
gua
ge
Processi
ng.
Edin
burg
h
, Scotlan
d
, UK. 20
11: 152
4-1
534.
[8]
Nad
eau
D. Se
mi-Sup
ervise
d
Name
d Entit
y
Reco
gniti
on: L
earn
i
ng
to Rec
ogn
ize
100
Ent
i
t
y
T
y
p
e
s
w
i
t
h
Little S
uperv
i
sion. Dis
ertation. Ottaw
a
, Canada: Otta
w
a
-
C
arleton Institut
e fo
r Computer Science,
School of Information T
e
chno
logy
and Engineering, Un
iv
ers
i
t
y
of Ottaw
a
; 2007.
[9]
Wicaksono A,
Pur
w
ar
ianti A
.
HMM bas
ed
Part-of-Spe
e
c
h
T
a
g
ger f
o
r
Bahas
a In
don
esia
. Fourth
Internatio
na
l MALINDO W
o
rkshop.
Jakarta,
Indon
esi
a
. 201
0.
[10]
Cha
w
l
a
NV, Bo
w
y
er KW
, H
a
ll LO, Ke
gel
me
yer W
P
. SMOT
E: Sy
nt
h
e
tic Min
o
rit
y
Over-sampli
n
g
T
e
chnique.
Jo
urna
l of Artificial
Intell
ige
n
ce Rese
arch
. 20
0
2
; 16(1): 32
1-3
57.
[11]
W
i
tten IH, F
r
ank E. Data M
i
nin
g
: Practica
l
Machi
ne
Lea
rnin
g T
ools a
nd T
e
chni
que.
2
nd
edition.
Morga
n
Kauffman. 200
5.
[12]
Quinl
ann JR. C
4
.5: Programs
for machin
e le
arni
ng. San F
r
ancisc
o
: Morga
n
Kauffman. 19
93.
[13]
Platt J.
F
a
st
traini
ng of support vector ma
chi
nes usi
n
g sequ
entia
l minima
l optimi
z
ation. In: B
Scholk
opf, C
Burges, A Sm
ola.
Ed
i
t
o
r
s
.
A
d
vanc
es i
n
ke
rnel m
e
tho
d
s: Sup
port vect
or le
arni
ng
.
Cambri
dg
e, MA: MIT
Press.
199
8.
[14]
Breima
n L. Ra
ndom F
o
rests.
Machi
ne Le
arn
i
ng.
20
01; 45(
1
)
: 5-32.
[15]
Madl
berg
e
r L, Roma
dho
n
y
A, Ibrahim M, Pur
w
arianti A.
Gotong Roy
o
n
g
in
NLP researc
h
– a mobi
l
e
tool for col
l
ab
orative text
a
nnotati
on i
n
Indo
nesi
a
.
T
he 20th Intern
ati
ona
l Co
nferen
ce on Asi
a
n
Lan
gu
age Proc
essin
g
(IALP) 201
6. T
a
inan, T
a
i
w
a
n
. 201
6.
Evaluation Warning : The document was created with Spire.PDF for Python.