TELKOM
NIKA
, Vol.14, No
.2, June 20
16
, pp. 665~6
7
3
ISSN: 1693-6
930,
accredited
A
by DIKTI, De
cree No: 58/DIK
T
I/Kep/2013
DOI
:
10.12928/TELKOMNIKA.v14i1.3113
665
Re
cei
v
ed
Jan
uary 10, 201
6
;
Revi
sed Ap
ril 10, 2016; Accepted Ap
ril 26, 2016
Analysis of Stemming Influence on Indonesian Tweet
Classification
Ahmad Fa
th
an Hida
y
a
tullah*
1
, Chanifah Indah Ra
tnasari
2
, Satri
o
Wisnugr
oh
o
3
Dep
a
rtment of Informatics, Un
iversitas Islam
Indon
esi
a
,
Jl.Kali
u
ran
g
km 14.5 Slema
n
Yog
y
akarta In
don
es
ia, T
e
lp. (027
4) 89
529
7/F
a
x. (02
74) 89
500
7
*Corres
p
o
ndi
n
g
author, e-ma
i
l
: fathan@u
ii.a
c
.id
1
, chanifah.i
nda
h@u
ii.ac.i
d
2
,
w
i
snu
g
ro
hos
atrio@gm
ail.c
o
m
3
A
b
st
r
a
ct
Stemmi
ng h
a
s
bee
n co
mmo
n
ly use
d
by so
me r
e
s
earc
her
s in natur
al l
a
ngu
ag
e proc
es
sing ar
ea
such as
text m
i
ning, text
classification, and infor
m
atio
n r
e
trieva
l.
In info
rmati
on r
e
triev
a
l, ste
m
min
g
may
hel
p to r
a
ise
r
e
trieva
l p
e
rfor
ma
nce. H
o
w
e
v
e
r, ther
e
is a
n
ind
i
cati
on th
at stemming
d
o
e
s n
o
t ha
nd
o
v
er
signific
ant
infl
uenc
e tow
a
rd
the acc
u
racy
i
n
text cl
assific
a
tion. T
h
erefor
e, this
pa
per
ana
ly
z
e
s f
u
rth
e
r
researc
h
abo
ut
the influe
nce o
f
stemming o
n
tw
eet cl
assifica
tion in Ba
has
a Indon
esi
a
. T
h
is w
o
rk examin
e
s
abo
ut the
acc
u
racy res
u
lt b
e
t
w
een tw
o co
nd
itions
by
in
v
o
lvi
ng ste
mming
a
nd w
i
tho
u
t i
n
vo
lving
ste
m
mi
ng
in
pre-pr
ocessi
ng
task for tw
eet classificati
on. T
he contri
bu
ti
on
o
f
th
i
s
re
se
arch
is to find out a better pre-
process
i
ng tas
k
in ord
e
r to o
b
tain g
o
o
d
acc
u
racy in te
xt classification. Ac
cording
to the
exper
iments, it is
observ
ed that all acc
u
racy re
sults in tw
eet classifi
cati
on te
nd to decre
ase
.
Stemming ta
sk does not rai
s
e
the accur
a
cy ei
ther usin
g SV
M or Naive B
a
yes alg
o
rith
m. T
herefore,
this w
o
rk
summar
i
z
ed that ste
m
mi
ng
process d
oes n
o
t affect signifi
cantly
tow
a
rds the accuracy p
e
rformanc
e.
Ke
y
w
ords
: stemmi
ng, pre-
pro
c
essin
g
, tw
eet
cl
assification, text
classification
Copy
right
©
2016 Un
ive
r
sita
s Ah
mad
Dah
l
an
. All rig
h
t
s r
ese
rved
.
1. Introduc
tion
Stemming ha
s bee
n com
m
only used
by research
e
r
s in n
a
tural langu
age p
r
o
c
e
ssi
ng
area
su
ch as
text mining, text classifi
cat
i
on,
and information retri
e
val. The purp
o
se of stemmi
ng
is to obtain th
e root of words by removi
ng affixes an
d suffixes. Ne
verthele
ss,
stemming p
r
o
c
ess
in every l
ang
uage i
s
different de
pend
s
on the fo
rma
t
ion of words in it. For ex
ample, Engli
s
h
stemmin
g
re
duces th
e word
s “plays”, “playe
r”, “pla
yers”, and
“p
layed” to the
word “pl
a
y”.
To
obtain the
ro
ot of words
in E
nglish, there
are
so
me stemmi
n
g
algo
rithm
su
ch a
s
Lovi
n
s,
Da
wson, a
nd,
Porte
r
. On
th
e othe
r
han
d, Indon
es
i
an l
angu
age
ha
s
different
ch
aracteri
stics fro
m
English. Th
e
words i
n
Bah
a
sa Ind
one
si
a have the speci
a
l and
co
mplex morph
o
logi
cal st
ru
cture
comp
ared wit
h
the words
in other lan
g
uage
s.
In Bahasa Indon
e
s
ia, the wo
rd
s com
p
o
s
ed
of
inflection
al a
nd d
e
rivation
al structu
r
e. I
n
flection
al is
a collectio
n o
f
suffixes
whi
c
h
doe
s n
o
t
alter
the form and
does n
o
t affect the mea
n
i
ng of the
ro
ot word.
Deri
vational stru
cture consi
s
t
s
of
prefixes,
suffixes, and al
so
a coupl
e of combi
nation
of the two. In Indone
sian l
angu
age, the
r
e
are vari
ou
s
algorith
m
s fo
r stemmi
ng li
ke Veg
a
, Na
zief-Ad
r
iani,
Arifin-Setion
o
, and Enhan
ced
Confix Strippi
ng Stemmer.
Ho
wever, so
me previo
us
resea
r
ch ha
s sho
w
n different re
sult reg
a
rdin
g to the influen
ce
of stemmin
g
in text mini
ng. Several
resea
r
ch cl
arified that ste
mming
coul
d
enha
nce th
e
accuracy
and
the othe
rs cl
aimed th
at st
emming
did
not han
d ove
r
si
gnificant i
n
fluen
ce to
ward
the accuracy.
Basnu
r
an
d Sensu
s
e [1]
have cla
s
sified ne
ws a
r
ticle in Indo
ne
sian L
ang
ua
ge usi
n
g
ontology by obse
r
ving two
asp
e
ct
s, stop
word
s
and
stemming. The
y
declared that stemming i
n
cla
ssifying te
xt document
s co
uld enh
a
n
ce
the a
c
cu
racy. Ra
ma
subra
m
ani
an and Ra
mya [2]
made
an
effective p
r
e
-
p
r
oce
s
sing
ste
p
u
s
in
g
imp
r
oved
stemmi
ng al
go
rithm. The
research
con
c
lu
ded
th
at
improve
d
Porter’
s
ste
mming wi
th
the Spell-ch
eck ut
ility ha
s in
crea
sed
the
accuracy l
e
vel of outp
u
t conte
n
t. Port
er’s stem
mi
n
g
algo
rithm a
l
so utili
zed
b
y
[3] to evaluate
stemmin
g
an
d stop word
techniq
u
e
s
in classi
fication pro
b
le
m. This pap
er reve
aled
that
stemmin
g
techni
que
s hav
e a sig
n
ifica
n
t affect to
the si
ze of th
e feature
set
with a different
spa
r
sit
y
v
a
lue
.
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 16
93-6
930
TELKOM
NIKA
Vol. 14, No. 2, June 20
16 : 665 – 67
3
666
Gau
s
tad
and
Boume
[4]
con
d
u
c
ted th
e inve
stigatio
n ab
out the
use
of
stem
ming fo
r
cla
ssif
i
cat
i
on
of
Dut
c
h e
m
ail t
e
x
t
s.
T
h
is
re
se
ar
ch
also
u
s
ed
P
o
rter’
s
algo
rit
h
m for
stem
ming
pro
c
e
ss. It was foun
d that
stemming
d
oes n
o
t co
nsi
s
tently impro
v
e classification accu
ra
cy. Yu
[5] evaluated
the effe
ct o
f
stemmin
g
on
cla
ssifi
cat
i
on p
e
rfo
r
ma
nce. T
he
re
search
exami
ned
wheth
e
r
th
e overall cla
s
sification accu
racie
s
cha
nge
signifi
cantly
after ste
mmi
ng. In ad
ditio
n
, it
also
compa
r
ed the
contribution
of in
dividual fe
ature
s
b
e
fore
and
after
st
emming
toward
cla
ssifi
cation.
Acco
rdin
g to the experime
n
t, the
accu
ra
cy did not ch
ange
signifi
ca
ntly before an
d
after s
t
emming.
Toru
nogl
u, et al., [6] analyzed th
e effect of
pre
p
ro
ce
ssi
ng meth
od
s in text
cla
ssif
i
cat
i
on
on
Tur
k
i
s
h
t
e
x
t
s.
They
co
ncl
ude
d
that stemmi
n
g
ha
s ve
ry
little impact
on
accuraci
es.
Toman,
et
al., [7] com
pare
d
b
e
twe
en vari
ou
s l
e
mmatization
and
stemm
i
ng
algorith
m
s u
s
ing English and Czech dat
aset
s to ex
amine the influence of the word no
rmali
z
a
t
ion
on g
ene
ral
cl
assificatio
n
t
a
sk. Thi
s
re
search
su
mm
arized th
at le
mmatizatio
n
and
stemmi
n
g
in
word no
rmali
z
ation
did n
o
t affect si
gnificantly in
text classificatio
n
. Wah
beh, et a
l
., [8] examine
d
the effect of
stemmin
g
a
s
part of the
pre
-
proc
es
sin
g
t
a
s
ks
on A
r
abi
c t
e
x
t
cla
ssif
i
cat
i
on.
T
heir
experim
ent referred th
at stemming
ha
s de
cre
a
sed
t
he a
c
curacy.
Hid
a
yatullah
[9] cla
r
ified t
hat
stemmin
g
do
es not rai
s
e the accu
ra
cy as well.
Thi
s
resea
r
ch use
s
Na
zief and
Adriani alg
o
ri
thm
in stemmin
g
pro
c
e
ss to o
b
tain the root
of the
word
s. Nazief an
d Adriani Ind
o
n
e
sia
n
stemmi
ng
algorith
m
is choo
sen b
e
ca
use it ha
s bet
ter ac
cu
ra
cy than any othe
r stemming al
gorithm
s [10].
This pap
er a
ddre
s
se
s the
issue
of text
cla
ssifi
cation
in Baha
sa
In
done
sia.
Furt
herm
o
re,
this re
se
arch co
ndu
cts further inv
e
stigatio
n a
bout the inf
l
uen
ce of stemming o
n
text
cla
ssifi
cation
usin
g tweet d
a
taset in Indo
nesi
an
Lan
gu
age. Moreov
er, this wo
rk examine
s
ab
out
the differe
nce effect b
e
twee
n two
condition
s
by
involving st
emming
and
without i
n
volvin
g
stemmin
g
on
pre
-
preprocessing ta
sk. Acco
rdi
ng to
previou
s
research by
Hid
a
yatullah [9], the
numbe
r of da
tasets in thi
s
resea
r
ch also
incre
a
sed to obtain mo
re valid re
sult.
The re
st of this pap
er i
s
organi
zed a
s
fo
llo
ws. Sectio
n 2 describe
s
the research
method
in this work.
Section 3
exp
l
ains the
re
su
lt and di
scu
ssion of this
re
search. Fin
a
lly, the con
c
lu
si
on
of this wo
rk i
s
describ
ed in
Section 4.
2. Rese
arch
Metho
d
Figure 1
de
pi
cts th
e
detail
expe
rimenta
l
de
sign
in th
is
wo
rk. In
d
a
ta colle
ction
,
tweet
datasets were gathe
red
usin
g Twitter Search
API v1.1. Secondly, pre-pro
c
e
ssi
ng ste
p
is
con
d
u
c
ted to
cle
a
n
the t
w
eet from
n
o
isiness. T
he
th
ird
step
is fe
ature
sele
ctio
n which
aims to
get the influe
ntial feature
and remove
the uninflu
en
tial feature.
This
re
sea
r
ch uses t
w
o t
e
rm
weig
hting me
thods in feat
ure sele
ction
task,
term frequ
en
cy an
d TF-IDF (T
erm Freque
n
cy-
Inverse
Document Frequency). T
he t
w
eet data
sets will
be
classified using Naive Bay
e
s
Cla
ssifie
r
and
Support Vect
or Ma
chin
e (SVM) method
.
Figure 1. Experime
n
tal de
sign
Finally, pe
rfo
r
man
c
e
eval
uation i
s
ca
rried
o
n
to
evaluate th
e p
r
opo
sed
meth
od. The
obje
c
tive of p
e
rform
a
n
c
e e
v
aluation i
s
to mea
s
u
r
e h
o
w p
r
e
c
ise th
e method i
n
cla
ssifying te
xt.
The co
nfusi
o
n matrix mod
e
l is ch
osen for two
cl
ass p
r
edi
ction which can b
e
se
e
n
in Table 1.
Table 1.
Confu
s
io
n Matrix for Two
Cla
s
ses Pred
iction
Actual Class
Class-
1 Class-
2
Predicted
Class
Class-1
True positive (TP
)
False negative (
F
N)
Class-2
False positive (FP)
True ne
gative (T
N)
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
1693-6
930
Analysis of Stem
m
i
ng Influ
ence on Indo
nesi
an Tweet
… (Ahm
ad Fathan Hi
da
ya
tullah)
667
The confu
s
io
n matrix abov
e is u
s
ed to
calcul
ate the a
c
cura
cy of the pro
p
o
s
ed
method
s.
Accu
ra
cy is the total valid
ity of the mo
del that
cal
c
ulat
ed a
s
t
h
e
sum
of
t
r
u
e
cla
ssif
i
cat
i
o
n
s
divided
by th
e total n
u
mb
er of
cla
s
sification
s.
Th
e
formul
a to
obtain
accu
racy i
s
d
e
scri
bed
belo
w
:
(
1
)
2.1. Data
se
ts
Tweet
data
s
ets were gai
ned from th
e previou
s
d
a
taset
s
u
s
ed
by Hid
a
yatu
llah [9].
More
over, thi
s
re
sea
r
ch al
so i
n
crea
se
t
he n
u
mbe
r
of tweet
data
s
e
t
s an
d exa
m
i
ned
200
0 twe
e
ts.
Those twe
e
ts were labelle
d manu
ally into two
se
ntiment pola
r
ities,
pos
itive and
negative. Th
e
datasets cont
ain 107
4 po
si
tive tweets an
d 926
ne
gative tweets a
s
shown in Tabl
e 2.
Table 2.
Distri
bution of
Tweet Polarit
y
T
w
eet Polarit
y
Quantit
y
Positive
1074
Negative 926
2.2. Text Pre
-
proc
essing
This research
propo
se
s se
veral step
s
in
text pre-processing, such as:
1.
Removin
g
URLs task handl
es the URLs i
n
tweet, for
example
http://www.w
ebsite.
com.
2.
Cha
nging
e
m
oticon
ste
p
re
pla
c
e
s
t
he emoti
c
o
n
s th
at pre
s
ent i
n
twe
e
t by
transfo
rmin
g the emoticon i
n
to rep
r
e
s
ent
ative word
s which
sho
w
n in
Table 3.
Table 3.
Emoticon
Co
nversi
on
Emoti
con
Conversion
:) :-) :
)) :
-))
=) =)
)
Sen
y
um
(s
mi
l
e
)
:D :-D =D
Ta
w
a
(lau
gh)
:-( :(
Sedih (sad)
;-) ;)
Berk
edi
p (
w
i
n
k
)
:-P :P
Mengejek (stick
out tongue
)
:-/ :/
Ragu (hesitate
)
3.
Removin
g
sp
ecial
ch
aract
e
rs of T
w
itter such a
s
#
hasta
gs,
@u
sernam
e, an
d RT
(ret
w
eet).
4.
Removing sy
mbols or num
bers (e.g.!, #, $, *, 1234, etc.)
5.
Normali
z
e le
ngtheni
ng word
s, for ex
ample the
word
‘
s
em
angaaaaatttt’
w
ill be
norm
a
ilzed in
to
‘
s
em
angat’
which means spirit.
6.
Toke
nization
sep
a
rate
s a stream of text into part
s
call
ed toke
ns.
7.
The publi
c
fig
u
re na
me whi
c
h ap
pea
rs in
the tweet will
be omitted in this step.
8.
Ca
se foldin
g tran
sform
s
wo
rds into
simila
r form (lo
w
e
r
c
a
se o
r
upp
er
c
a
se
).
9.
Cha
nge
slan
g words into
stand
ard
wo
rd based on di
ctiona
ry.
10.
Stemming is
use
d
to redu
ce the affixes and suffixes in the word.
11.
Removin
g
sto
p
wo
rd task re
moves the
stopword.
12.
Con
c
ate
nate
neg
ation
re
cog
n
izes neg
ation
in
twee
t for exam
pl
e when
there is a
term
‘
t
idak
’
(not) then the word
‘
t
idak
’
will be concatenated with the
next word.
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 16
93-6
930
TELKOM
NIKA
Vol. 14, No. 2, June 20
16 : 665 – 67
3
668
Figure 2. Two
different app
roa
c
he
s in p
r
e-p
r
o
c
e
ssi
ng
Two different approac
hes we
re conduc
t
ed to compare th
e result acc
u
rac
y
between
involve stem
ming and th
e se
con
d
do
not invo
lve stemming i
n
pre
-
p
r
o
c
e
s
sing
step
s. The
stemmin
g
rul
e
s i
n
the
first
app
roa
c
h
uti
lized
Nazief
and Ad
ria
n
i a
l
gorithm. Fi
g
u
re
2
sho
w
s
the
two different
approa
che
s
i
n
pre
-
processing task.
2.3. Nazie
f
a
nd Adriani’s
Algorithm
The ste
p
s of
Na
zief and A
d
riani’
s
Algo
ri
thm are de
scribed a
s
follows [11]:
1.
Firstly, find t
he current
word i
n
the
dictionary. The
word
will be
con
s
id
ere
d
to
be a
root wo
rd if it found
an
d the
process
stop
s.
2.
The inflec
tion s
u
ffix (“-lah”, “-kah”,
“-k
u
”,
“-mu”,
o
r
“-nya”)
will
be omitted.
If it
su
cceed
s a
n
d
the suffix is a parti
cle (“-l
ah” o
r
“-k
ah
”), then rem
o
ve the inflectio
nal po
sse
ssiv
e
pron
oun
suffix (“-ku”, “-mu
”, or “-nya
”).
3.
Remov
e
the
deriv
ation
s
u
ffix
(“-i
” o
r
“-a
n”). If thi
s
su
cc
eed
s, the
n
go to
the
st
ep 4.
Otherwise, if step 4 do
es n
o
t succe
ed, d
o
these
step
s :
a.
If “-an
”
wa
s
remove
d, an
d the la
st let
t
er
of the
wo
rd i
s
“-k”, th
en the
“-k” is also
remove
d and
try again Step 4. If
that fails, Step 3b wil
l
be carrie
d o
u
t.
b.
The re
mov
e
d
suffix
(“-i”, “
-
an”, or
“-
kan
”
)
is brou
ght ba
ck.
4.
Remov
e
the
deriv
ation
p
r
e
f
ix
, as “
d
i-
”,“k
e-”,
“se
-”,
“me
-
”,“be
-
”,“p
e”,
“
t
e-”
by
attem
p
ting
these step
s:
a.
If a suffix
wa
s elimi
nat
ed in
Step
3, then
ch
e
c
k the
dissa
l
owe
d
p
r
efix-suffix
combi
nation
s
which is liste
d in Table 4.
Table 4.
Dissall
o
we
d Prefix-Suffix
Combi
nation
s
Pr
efix Dissallow
ed
Suffix
e
s
be-
-i
di
- -an
k
e
- -i
,
-k
an
me- -an
te- -i,
-kan
s
e
- -an
b.
The algo
rithm
s
retu
rn
s if the latest wo
rd
approp
riate
s
with any previ
ous p
r
efix.
c.
The algo
rithm
s
retu
rn
s if three prefixe
s
h
a
ve previou
s
l
y
been delete
d
.
d.
The prefix typ
e
is re
cog
n
ized
by one of these followin
g
action
s :
1)
If the prefix o
f
the wo
rd i
s
“di-”, “ke
-”, o
r
“se-”, then th
e prefix type is “di”, “ke
”
, o
r
“
s
e”
succ
ess
i
vely.
2)
Whe
n
the
pr
efix
is “te
-”,
“be
-”,
“me
-”,
or
“pe
-”, a
n
extra p
r
o
c
ess of extractin
g
cha
r
a
c
ter
set
s
to dete
r
mi
ne the p
r
efix
type is nee
ded. Fo
r exa
m
ple, the wo
rd
“terlam
bat” (l
ate) will
be stemmed.
Afte
r the
prefix
“te-“ remov
ed to obt
ain “-
rlambat”,
the firs
t c
o
llec
t
ion
of
cha
r
a
c
ters are extra
c
te
d from
the
prefix as initiate
d
by the “Set 1
”
rule
s in T
a
b
l
e 5. This ex
am
ple the let
t
er after t
he
prefix “te-“ is
“r”
and thi
s
is
ap
prop
riate
with
the first five rows of the ta
ble. The l
e
tter “r” i
s
follo
we
d
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
1693-6
930
Analysis of Stem
m
i
ng Influ
ence on Indo
nesi
an Tweet
… (Ahm
ad Fathan Hi
da
ya
tullah)
669
by “l” and m
a
tch with third to fifth rows
(Set 2). Th
e next after the letter “l” i
s
“ambat”
and
this
is
exac
tly with the fifth
row (S
et 3),
so the
prefix type in
the
wo
rd
“terlam
bat” i
s
“ter
-“.
Table 5.
Determinin
g the prefix
type for words p
r
e
f
ixed with “te–”
Follow
i
ng Chara
c
ters
Prefix T
y
pe
Set 1
Set 2
Set 3
Set 4
“-r
-”
“-r
-”
-
-
None
“-r
-”
Vowel
-
-
Ter-l
ul
uh
“-r
-”
not (“
-r-
” or v
o
wel
)
“-er
-
”
v
o
w
e
l
Ter
“-r
-”
not (“
-r-
” or v
o
wel
)
“-er
-
”
not v
o
wel
None
“-r
-”
not (“
-r-
” or v
o
wel
)
not “-e
r-
”
-
ter
not (v
o
w
el
or
“-r
-”
)
“-er
-
”
v
o
w
e
l
-
None
not (v
o
w
el
or
“-r
-”
)
“-er
-
”
not v
o
wel
-
Te
3)
The al
gorith
m
retu
rn
s wh
en the first two ch
aracte
rs
do not m
a
tch
with “di-”, “ke-”,
“se
-”, “te
-”, “
b
e-”, “m
e-
”, or
“pe
-”.
e.
If the prefix
type is “no
n
e
”
, then the algo
rith
m returns.
For not “non
e” prefix type, it is
found in Tabl
e 6 and the p
r
efix will be re
moved from t
he wo
rd.
Table 6.
Determinin
g the prefix from
the prefix type
Prefix t
y
pe
Prefix to be
remo
ved
di di-
ke ke-
se se-
te te-
ter ter-
ter-l
ul
uh
ter-
f.
Step 4 will recursively end
eavoured wh
en t
he root word ha
s not b
een found u
n
t
il th
e
root wo
rd fou
nd.
g.
Perform
the reco
ding step whi
c
h
is
d
epe
ndi
ng o
n
the prefix type. It
is only sh
own
th
e
prefix type “ter-l
uluh
” in T
able 5 an
d 6
.
In this case
, the letter “r-“ is a
dde
d to the
word
after
re
moving th
e p
r
efix “te
r
-“. If
the ne
w
wo
rd is not
foun
d in th
e
dicti
onary,
then
Step
4
i
s
ca
rrie
d
out
again. The “r-“
is remove
d and “ter-“ restored when
the
root wo
rd i
s
still not found. The prefix is
is set to “n
one
” and the alg
o
r
ithm retu
rn
s.
h.
The algorithm
will return the orig
inal word after failed in all steps.
2.4. Featur
e Selection Me
thods
1. Term
F
r
eq
ue
ncy
Term frequ
e
n
cy is th
e st
anda
rd id
ea
of freque
ncy
in co
rpu
s
-b
ase
d
natu
r
al
langu
age
pro
c
e
ssi
ng (NLP). It cal
c
u
l
ates the q
u
a
n
tity of ti
mes that a type (term/word/n
-g
ram)
sh
ows u
p
in
a corpu
s
[12
]. The te
rm f
r
equ
en
cy of
a term
t
in
a do
cu
ment
d
can
be
util
ized
for reco
rd
particula
r wei
ghting an
d de
noted a
s
,
.
2. TF-IDF
TF-IDF
comb
ines both TF
and IDF to d
e
termin
e the weig
ht of a term [13]. The
TF-IDF
weig
ht schem
e of a term
t
in a document
d
given by [14]:
,
,
(
2
)
The weight of
term
t
in
document
d
is
de
noted a
s
,
, wherea
s
is inve
rse d
o
cum
ent
freque
ncy of term t whi
c
h d
e
rived from:
l
o
g
(
3
)
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 16
93-6
930
TELKOM
NIKA
Vol. 14, No. 2, June 20
16 : 665 – 67
3
670
In the Equation (2
), the numbe
r of
all document re
pre
s
ente
d
as N and
is the quantity of
document
d
whi
c
h contain
s
term
t
.
2.5. Classific
a
tion Me
tho
d
1. Naive
Bayes
Naive Baye
s
is a
pro
babili
stic le
arning
appr
oa
ch [1
4
]. The likeli
h
o
od of a
do
cu
ment
d
being in
cla
s
s
c
is processe
d as:
|
∝
∏
|
(
4
)
|
is the probability of term
occurin
g
in a document
of class
c
wh
erea
s
is
the prior probability of a doc
ument occurring in cl
ass
c
. Variable
is the nu
mber of to
ke
n in
document
d
.
The b
e
st
cla
s
s in
Naive B
a
yes
cla
s
sificati
on i
s
the
most p
r
o
babil
i
ty that pro
c
u
r
ed
from Maximu
m a poste
riori
(MAP) cla
s
s
:
∈
|
∈
∏
|
(
5
)
The value of
is cal
c
ul
ated
usin
g Equatio
n (6)
by dividi
ng the numb
e
r
of document
in cla
ss
c
(
) with all numbe
r of docume
n
t in training d
a
ta (
N’
).
=
′
(
6
)
2.
Suppo
rt Vector Ma
chin
e (SVM)
SVM finds a
hyperplane
with the hi
gh
est po
ssible
margi
n
to se
parate
betwe
en two
categ
o
rie
s
. Hyperpla
n
e
s
with larger ma
rgin are
le
ss li
kely to overfit the training data. Suppo
se
that, we have datasets
,
,…,
whe
r
ea
s
∈
1,
1
as the cla
ss la
b
e
l of
. Figure
3
sho
w
s the
h
y
perpla
ne
wi
th the la
rg
est margi
n
, wh
ich
se
parate
between
t
w
o cl
as
se
s.
T
h
e
hyperpl
ane
can be written as:
.
0
(
7
)
In Equation
(7),
w
is a
weig
ht vector
,
,…,
and
b
i
s
a
n
addition
al
weig
ht.
Acco
rdi
ng to
Figure 3,
the
r
e
are
two hy
perpl
ane
s t
h
at define
the
side
s
of the
margi
n
. Both
of
hyperpl
ane
s
can b
e
writte
n as:
.
1
1
(
8
)
.
1
1
(
9
)
Figure 3. Hyp
e
rpla
ne with t
he larg
est ma
rgin
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
1693-6
930
Analysis of Stem
m
i
ng Influ
ence on Indo
nesi
an Tweet
… (Ahm
ad Fathan Hi
da
ya
tullah)
671
2.6. Experiment
This wo
rk u
s
es
Ra
pidmin
er
5.3.000
to
con
d
u
c
t the
e
x
perime
n
t. M
o
reove
r
, thi
s
resea
r
ch
prop
oses fou
r
com
b
inatio
n
s
in the expe
riment
such as TF
-IDF wi
th stemming,
TF-IDF with
out
stemmin
g
, te
rm fre
que
ncy
(TF
)
with
stemming,
and
term frequ
e
n
cy (TF) wit
hout
stemmi
ng.
Each com
b
in
ation
will be cla
ssifie
d
u
s
i
ng
Naiv
e Ba
yes Cl
assifier and SVM. T
he main
process
usin
g Ra
pidm
iner can be
seen in Figu
re
4 belo
w
.
Figure 4. Main Process in
Rapi
dmine
r
T
h
e
ma
in
pr
oc
es
s
in
Ra
p
i
d
m
in
er
us
es
fo
u
r
o
perators such a
s
re
ad data
b
a
s
e,
pro
c
e
s
s
document, set
role,
an
d split
valid
atio
n.
Re
ad dat
aba
se ope
rat
o
r retri
e
ves
tweet data
f
r
om
databa
se. Proce
s
s do
cum
ent ope
rato
r
gene
rate
s wo
rd
ve
ctors fro
m
strin
g
attri
butes
usi
ng t
e
rm
freque
ncy
or
TF-IDF.
Set role o
perator i
s
u
s
e
d
to
cha
nge th
e attri
b
ute role
(e.g.
regul
ar,
sp
eci
a
l,
label, id,
etc).
Split validati
on o
perator randomly
split
s u
p
the
data
into training
and te
st
set t
hen
evaluate
s
the
model.
The trainin
g
and te
sting
p
r
oce
s
s a
r
e
co
ndu
cted
i
n
th
e sub
process of
the
split
validation
operator. T
h
ree o
perators ch
oo
se
n
in
the sub
pro
c
ess, mod
e
lin
g op
erato
r
, a
pply mod
e
l,
and
perfo
rman
ce.
In modelin
g
operator, th
e experim
ent
use
s
Baye
si
an Mod
e
ling
for Naive Ba
yes
method. For
SVM method
, it uses Sup
port Vecto
r
Modelin
g. Fig
u
re 5 de
pict
s the training
and
testing proce
ss u
s
in
g Naiv
e Bayes meth
od.
Figure 5. Trai
ning an
d testi
ng pro
c
e
s
s u
s
ing
Naive Bayes
The scen
ari
o
for trainin
g
an
d testing process usi
ng SVM
can be
see
n
in Figure 6 belo
w
.
Figure 6. Trai
ning an
d testi
ng pro
c
e
s
s u
s
ing
Naive Bayes
3. Results a
nd Analy
s
is
To examine the influen
ce
of stemming i
n
tw
eet cla
ssi
fication, our e
x
perime
n
ts compa
r
e
the a
c
curacy
ba
sed
on
th
e nu
mbe
r
of
datasets.
T
h
e data
s
et
s a
r
e divide
d int
o
three
different
sub
s
et si
ze (1500, 17
50, a
nd 200
0).
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 16
93-6
930
TELKOM
NIKA
Vol. 14, No. 2, June 20
16 : 665 – 67
3
672
The expe
rim
ents u
s
ing
Naive Bayes a
nd TF-I
D
F ca
n be se
en in
Table 7. The
accuracy
r
e
su
lts w
i
th
s
t
e
mming
are
low
e
r
th
an w
i
th
ou
t
s
t
emmin
g
.
H
o
we
ve
r
,
th
e acc
u
ra
c
y
d
i
ffe
re
n
c
e
betwe
en both
treatments a
r
e not too
hig
h
with the averag
e 0.86%.
Table 7.
Experiment
s Re
sults
Usi
n
g Naive Baye
s and TF
-IDF
Number
of
Datasets
Accur
a
cy
Accur
a
cy
Difference
TF-I
D
F +
Stemming
TF-I
D
F +
Without
Stemming
1500
85.33%
86.00%
0.67%
1750
84.00%
84.57%
0.57%
2000
84.00%
85.33%
1.33%
Average
84.44%
85.30%
0.86%
The re
sult
s for Naive Bay
e
s an
d Term
Frequ
en
cy also in
dicate
the same condition
.
Stemming in
pre
-
processin
g
task
doe
s
not hel
p to
ra
ise th
e a
c
cu
racy. Th
e a
ccura
cy differen
c
e
in this experi
m
ent is 2.34
%. This achi
evement
is h
i
gher tha
n
the previou
s
e
x
perime
n
ts u
s
i
n
g
TF-IDF. M
o
re
detail abo
ut the re
sult
s u
s
i
ng Naiv
e Bay
e
s an
d Te
rm
Freq
uen
cy is
sho
w
n in
Tab
l
e
8.
Table 8.
Experiment
s Re
sults
Usi
n
g Naive Baye
s and T
e
rm F
r
equ
en
cy
Number
of
Datasets
Accur
a
cy
Accur
a
cy
Difference
Term
Freque
nc
y
+
Stemming
Term
Freque
nc
y
+ Without
Stemming
1500
88.00%
88.67%
0.67%
1750
86.29%
89.14%
2.85%
2000
85.50%
89.00%
3.50%
Average
86.60%
88.94%
2.34%
On the
othe
r han
d, ou
r e
x
perime
n
ts
u
s
ing
SVM al
so
den
ote th
at stemmi
ng
in p
r
e
-
pro
c
e
ssi
ng ta
sk doe
s
not
enha
nce the
accuracy.
Ta
ble 9
depi
cts the expe
rim
ent re
sult
s u
s
in
g
SVM and TF-IDF.
Table 9.
Experiment
s Re
sults
Usi
n
g SVM and TF-IDF
Number
of
Datasets
Accur
a
cy
Accur
a
cy
Difference
TF-I
D
F +
Stemming
TF-I
D
F +
Without
Stemming
1500
89.56%
91.33%
1.77%
1750
91.62%
92.00%
0.38%
2000
91.50%
91.67%
0.17%
Average
90.89%
91.67%
0.77%
Table
10
sh
o
w
s the
re
sult
s a
c
cu
ra
cy u
s
ing
SVM a
n
d Te
rm F
r
e
q
uen
cy. Base
d on
ou
r
experim
ents,
the attainm
ent of accu
racy wi
th
out
stemmin
g
is better a
c
cu
racy th
an
when
stemmin
g
is cond
ucte
d in pre
-
processing.
Acco
rdi
ng to both experim
ents
usin
g SVM,
the
accuracy diff
eren
ce
s a
r
e
almost the
same.
The a
c
curacy diffe
rence usi
ng
SVM and TF
-IDF
obtaine
d 0.7
7
%, whe
r
ea
s for SVM a
nd Te
rm F
r
eque
ncy o
b
tained
0.87%
. Based
on
ou
r
experim
ents,
it is cle
a
r th
at
pre
-
processing ta
sk wit
hout ste
mmi
ng ha
s b
e
tter a
c
cura
cy th
an
whe
n
stemmi
ng is cond
uct
ed in pre
-
p
r
o
c
e
ssi
ng.
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
1693-6
930
Analysis of Stem
m
i
ng Influ
ence on Indo
nesi
an Tweet
… (Ahm
ad Fathan Hi
da
ya
tullah)
673
Table 10.
Experiment
s Re
sults
Usi
n
g SVM and Term Frequ
en
cy
Number
of
Datasets
Accur
a
cy
Accur
a
cy
Difference
TF-I
D
F +
Stemming
TF-I
D
F +
Without
Stemming
1500
88.67%
89.11%
0.44%
1750
91.05%
91.62%
0.57%
2000
90.67%
92.00%
1.33%
Average
90.13%
90.91%
0.78%
4. Conclusio
n
This
pape
r h
a
s exa
m
ined
the stem
min
g
influen
ce o
n
tweet cla
s
sification.
To examin
e
the stemmi
ng
influen
ce, thi
s
work ha
s
compa
r
ed
bet
wee
n
the two
approa
che
s
i
n
pre
-
p
r
o
c
e
s
sing
task. Th
e first pre
-
p
r
o
c
e
s
sing
step
s in
volv
ed stem
ming an
d th
e other
one
doe
s not inv
o
lve
stemmin
g
. Accordi
ng to t
he expe
rime
nts, it is
ob
serv
e
d
t
hat
all ac
cu
ra
cy
re
sult
s in
t
w
ee
t
cla
ssifi
cation
tend to decre
ase. Mo
reov
er, ste
mmin
g
task doe
s n
o
t help to raise the accu
racy
either u
s
ing
SVM or Nai
v
e Bayes algorithm.
Fina
lly, this work summa
rized
that stemmin
g
p
r
oc
es
s
do
es n
o
t
a
ffe
c
t
s
i
gn
ific
a
n
t
ly
t
o
w
a
rd
s t
he ac
cu
racy
pe
rform
ance.
Referen
ces
[1]
Basnur PW
, Sensus
e DI. Pe
ngkl
a
sifikas
i
an
Otoma
tis Berbasis Onto
log
i
untuk Artike
l
Berba
has
a
Indon
esi
a
.
MAKARA of Technol
ogy Seri
es
. 201
0; 14(1): 29
-35.
[2]
Ramas
ubram
a
n
ia
n C, Ram
y
a R. Effective Pre-proc
essin
g
Activities in T
e
xt Min
i
n
g
Usi
ng Improve
d
Porter’s Stem
ming Al
gor
ith
m
.
Internation
a
l Jo
urna
l of
Adv
anc
ed R
e
searc
h
in
C
o
mputer
and
Co
mmun
icati
o
n Engi
ne
erin
g
. 201
3; 2(12): 45
36-4
538.
[3]
Sharma
D, Ja
in S. Eval
uati
on of Stemmi
ng a
nd Sto
p
W
o
rd T
e
chniq
ues o
n
T
e
xt
Classific
a
tio
n
Probl
em.
Internatio
nal
Jour
n
a
l of Sci
entific
Rese
ar
ch i
n
Co
mp
uter Scie
nce a
nd E
n
g
i
n
eeri
n
g
. 2
015
;
3(2): 1-4.
[4]
Gaustad T
,
Bo
uma G. Accurate Stemmi
ng
of Dutch for T
e
xt C
l
assific
a
ti
on.
La
ngu
ag
e Co
mp
uting
.
200
2; 45(1): 10
4-17
7.
[5]
Yu B. An Ev
aluation of T
e
x
t
Cla
ssification Methods fo
r L
i
te
r
a
ry
S
t
u
d
y
.
L
i
terary an
d
Li
ng
uisti
c
Co
mp
uting
. 2
0
08; 23(3): 3
27-
343.
[6] T
o
runo
ğ
lu
D, Ç
a
k
ı
rman E, Ganiz MC, Ak
y
o
k
u
ş
S, Gürbüz
MZ.
Analysis
o
f
Preproc
essin
g
Meth
ods
o
n
Classific
a
tion of Turkish Texts
. In Inn
o
vatio
n
s
in
Intell
ig
ent
S
y
stems
an
d A
pplic
atio
ns (INI
ST
A), 201
1
Internatio
na
l Sy
mp
osi
u
m on IEEE. 2011: 11
2-11
7.
[7]
T
o
man M,
T
e
s
a
r R, Jezek K.
Influenc
e of W
o
rd Nor
m
ali
z
at
i
on on T
e
xt Cla
ssificatio
n
. Proceed
ings
of
InSciT
(2006). 200
6; 4: 354-3
58.
[8]
W
ahbe
h A, Al-
K
abi M, Al-R
a
dai
deh Q, Al-S
ha
w
a
kfa
E, Als
m
adi I. T
he Effect of Stemmin
g
on Ar
abic
T
e
xt Classifi
ati
on:An Em
piric
a
l Stu
d
y
.
Inter
natio
nal
Jo
urn
a
l of Inf
o
rmatio
n
Retri
e
va
l R
e
search
. 201
1;
1(3): 54-7
0
.
[9] Hiday
atullah
AF.
T
he Influence of Indon
esi
an Ste
m
min
g
on Indo
nes
ian
T
w
eet
Sentiment Analys
is
.
Procee
din
g
of Internati
o
n
a
l Co
nf
erenc
e on El
ectrical En
gin
e
e
rin
g
, Comput
er Scienc
e and
Informatics
(EECSI 2015).
Palem
ban
g, Indon
esia. 2
015;
2(1): 182-1
87.
[10] Agusta
L.
Per
ban
din
g
a
n
Alg
o
ritma Stemmi
ng Porter
de
n
gan Al
gorit
ma
Na
z
i
ef & Ad
riani
untuk
Stemmi
ng D
o
kumen T
e
ks
Bahas
a Indo
n
e
sia
. In Proc
eed
ing K
onfer
ensi N
a
sio
n
a
l
Sistem dan
Informatika. Ba
li, Indon
esi
a
. 2009: 19
6-2
01.
[11]
Asian J, W
illi
a
m
s HE,
T
ahag
hog
hi SMM.
Ste
m
m
i
n
g
In
do
ne
si
an
. Procee
d
i
ngs of the T
w
ent
y
-
ei
ghth
Australas
i
an c
onfere
n
ce o
n
Comp
uter Scie
nce. 200
5; 38: 307-
314.
[12]
Yamamoto
M, Ch
urch KW
.
Usin
g S
u
ffix Arra
ys to
C
o
mpute
T
e
rm F
r
equ
enc
y
a
nd
Docum
ent
F
r
eque
nc
y
for
All Substri
ngs i
n
A Corpus.
C
o
mputati
o
n
a
l Li
ngu
istics
. 200
1
;
27(1): 1-30.
[13]
Srividh
y
a
V, A
n
itha
R.
Eva
l
u
a
ting Prepr
oce
ssing
T
e
chn
i
q
ues i
n
T
e
xt C
a
tegor
izatio
n.
International
Journ
a
l of Co
mputer
Scie
nce and
Ap
plic
atio
n
. 2010; 4
7
(11)
: 49-51.
[14]
Mann
ing
C, R
agh
ava
n
P, S
c
hut
ze
H. Introducti
on to
Inf
o
rmatio
n
R
e
tri
e
val. C
a
mbr
i
d
ge U
n
iv
ersit
y
Press. 2009.
Evaluation Warning : The document was created with Spire.PDF for Python.