TELKOM
NIKA
, Vol.14, No
.1, March 2
0
1
6
, pp. 219~2
2
7
ISSN: 1693-6
930,
accredited
A
by DIKTI, De
cree No: 58/DIK
T
I/Kep/2013
DOI
:
10.12928/TELKOMNIKA.v14i1.1926
219
Re
cei
v
ed Ap
ril 25, 2015; Revi
sed
De
ce
m
ber 3, 2015
; Accepte
d
Decem
b
e
r
26, 2015
Stemming Influence on Similarity Detection of Abstract
Written in Indonesia
Tari Mardian
a
*
1
, Teguh Bharata Adji
2
, Indriana Hida
y
a
h
3
1,2,
3
Departeme
nt of Electrical
Engi
neer
in
g an
d Information T
e
chn
o
lo
g
y
, UGM,
Jala
n Grafika No. 2, Yog
y
ak
arta, Indon
esia
1
Informatics Engin
eeri
ng, T
anjung
pura U
n
iv
e
r
sit
y
,
Jala
n Jen
dera
l
Ahmad Ya
ni, P
ontia
nak, Indo
nesi
a
*Corres
p
o
ndi
n
g
author, em
ail
:
tari.mardian
a
@
gmai
l.com
1
, adji.tb
a@gm
ail
.
com
2
, indrian
a
.h@ugm.ac.i
d
3
A
b
st
r
a
ct
In this p
a
p
e
r
w
e
w
ould l
i
ke
to disc
uss a
bout ste
m
mi
n
g
effect by
us
ing
Na
z
i
e
f
a
n
d
Adri
an
i
alg
o
rith
m
aga
i
n
st si
mil
a
rity d
e
tection
resu
lt
of Indo
nesi
an
w
r
itten abstrac
t. T
he c
onte
n
ts of the
pu
blic
atio
n
abstract si
mil
a
r
i
ty detectio
n
ca
n be us
ed
as a
n
early
ind
i
cati
on of w
hether
or not t
he act
of pla
g
iar
i
s
m
in
a
w
r
iting. Mostly
in
process
i
n
g
the
text a
d
d
i
ng
a pr
e-proc
ess, on
e of
it w
h
ich is
cal
l
ed
a ste
m
min
g
by
chan
gin
g
the
w
o
rd into the r
oot w
o
rd in ord
e
r to ma
x
i
mi
z
e
the search
ing
proce
ss. The r
e
sult of stemming
process w
ill b
e
chan
ge
d as
a certain w
o
rd n-gra
m
set
then ap
pli
e
d
an ana
lysis
of simi
larity u
s
in
g
F
i
nger
print Ma
tching to
perfo
rm si
milar
i
ty matchin
g
b
e
tw
een text. Base
d
on the F
1
-sco
re w
h
ich us
ed
to
bal
ance
the
pr
ecisio
n
an
d re
call
nu
mber, th
e d
e
tectio
n th
a
t
impl
ements s
t
emmi
ng
an
d s
t
opw
ord re
mov
a
l
has a better re
sult in detecti
n
g
similar
i
ty betw
een the te
xt w
i
th an averag
e is 42%. It is
hig
her co
mp
ari
ng t
o
the
si
mi
larity d
e
tection
by usi
ng only
ste
mming proc
e
ss (3
1%) or
the
on
e
that w
a
s d
one
w
i
thout i
n
volv
i
n
g
the text pre-pro
c
ess (34%) w
h
i
l
e ap
plyi
ng the
bigr
am.
Ke
y
w
ords
: ab
stract, Indonesi
an, simil
a
rity, stemmi
ng, w
o
rd
Copy
right
©
2016 Un
ive
r
sita
s Ah
mad
Dah
l
an
. All rig
h
t
s r
ese
rved
.
1. Introduc
tion
Indone
sia
n
can b
e
rega
rd
ed a
s
a hi
gh-contex
t
la
ngu
age be
cau
s
e
of
the compl
e
xity
use
of wo
rd
s i
n
it. In the
field
of Natu
ral
La
ng
uage
Pro
c
e
ssing
(NLP
), In
done
sia
n
i
s
i
n
clu
ded
in l
e
ss-
resou
r
ced l
a
n
guag
e
so th
a
t
the resea
r
ch a
s
reg
a
rd
s
this la
ngu
age
is limited. In
done
sia
n
words
rarely have t
he p
r
e
c
ise fo
rm b
e
ca
use
of the u
s
e
of
affixes amo
n
g
stem
word
s, either
prefix
es,
suffixes
or re
petitive wo
rd
can
chan
ge t
he m
eani
ng
o
f
the
word it
self. There i
s
a
nee
d to
cond
uct
a pro
c
e
s
s to
cha
nge th
e
words
co
ntai
n affixes be
comes the roo
t
form of the
word so that
the
word m
eanin
g
is
ea
sy to u
nderstan
d th
rough
t
he p
r
o
c
e
s
s called
st
emming. Ste
mming i
s
al
m
o
st
simila
r to le
mmatizatio
n
but stemmi
n
g
doe
s not
n
eed to pay a
ttention to the meani
ng of
the
word formed from the process
.
The
stemmi
ng alg
o
rithm
is different
for on
e la
ngua
ge to
anothe
r
so
that the
impleme
n
tation of the sa
me techni
que
to other
languag
e can le
ad to the different re
sult. It is
becau
se the
stemmin
g
is l
angu
age
-de
p
ende
nt pro
c
e
ss. A
s
the p
h
a
se
of pre-p
r
oce
s
sing i
n
text
retrieval, on
e of popula
r
ste
mmer, Porte
r
algorithm
i
s
impleme
n
ted
not only to proce
s
sing text in
English
but a
l
so Spai
n an
d Portugu
ese
[1, 2]. To make the Po
rt
er alg
o
rithm
being a
b
le to
be
applie
d in In
done
sia
n
, m
odificatio
n
h
a
s
b
een
do
ne
by ad
ding
some
rule
s
a
nd me
asure
m
ent
requi
re
nment
by [3]. There a
r
e
some
other
ki
nd
s of stemmin
g
algo
rithms that has b
e
e
n
prop
osed
by [4] to be impl
emented
in I
ndon
esi
an,
such
as Nazi
e
f
and Ad
riani
Algorithm, Ari
f
in
and Setion
o
Algorithm,
Vega Algo
rithm, Ahmad,
Yusof, and
Sembo
k
Algorithm, a
n
d
Idris
Algorithm. A
re
sea
r
ch [5] that com
pared bet
wee
n
Porter and Nazief
Ad
riani
(NA
)
alg
o
rit
h
m
whi
c
h im
plem
ented i
n
Ind
o
nesi
an
do
cu
ment
sho
w
s t
hat Porte
r
al
g
o
rithm
ha
s th
e le
ss preci
s
i
on
value in
re
sul
t. Another
re
sea
r
ch
relate
d to this fact
says that NA
algo
rithm i
s
able to
co
nd
uct
stemmin
g
proce
s
s with u
p
to 93% of succe
s
s [6]. Howeve
r, there i
s
no
provisi
on whi
c
h
algorith
m
sho
u
ld be
appli
e
d in o
r
de
r
ste
mming p
r
o
c
e
ss
ca
n p
r
ovid
e better
re
sul
t
to improve t
he
accuracy of I
n
formatio
n Retrieval si
nce
basi
c
ally
, a
document i
s
see
n
as text
that cohe
re
n
c
e
and contain u
s
eful informat
ion.
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 16
93-6
930
TELKOM
NIKA
Vol. 14, No. 1, March 2
016 : 219 – 2
2
7
220
Dupli
c
ation i
n
a text docu
m
ent ca
n be
in
terms of
sente
n
ces th
at has b
een
arrang
ed
and m
odified
into sm
aller p
a
rt of
words.
To find
out a
cou
p
le of
sim
ilar
words, it i
s
n
e
cessa
r
y t
o
divide the se
ntence into smaller fragm
ent. The word
formin
g pro
c
e
ss th
at use
d
to find the
key-
word
of a
se
ntence
can
b
e
throug
h tex
t
-pre
pro
c
e
s
si
ng that
con
s
i
s
t of
keyp
hra
s
e
matching
[7]
and to
keni
zat
i
on [8]. The
state of art in
doing th
e si
m
ilarity matchi
ng is th
e d
upl
ication th
at e
x
ist
in the text mu
st be detectin
g
and analy
z
i
ng althoug
h
it is only a small part of word that has be
en
modified
whi
c
h i
s
text re
-use. Th
e im
plementat
io
n
of ste
mmin
g
process i
n
ord
e
r to d
e
t
ect
similarity
by using statistic
is
believe
d a
s
o
ne
way
to
co
mpa
r
e th
e
wo
rd
s
amon
g different te
xts
and to ide
n
tify the identical wo
rd
s whi
c
h h
a
ve
the
same
meani
n
g
[9]. Indentifying simila
rity is
usu
a
lly don
e
based o
n
the
word
simila
rity, fingerp
r
int
simila
rity or l
a
tent semanti
c
a
nalysi
s
(L
SA)
[10].
There i
s
n
o
p
r
eviou
s
study
relate
d to th
e stem
ming
pro
c
e
s
s in In
done
sia
n
a
s
a pa
rt of
the text pre-pro
c
e
ssi
ng, e
s
pe
cially the
one a
bout
si
milarity dete
c
tion in written ab
stra
ct. The
detectin
g
pro
c
e
ss i
n
volvin
g the ste
mmi
ng can le
ad
to era
s
ing i
m
portant info
rmation, so
it is
necessa
ry to
do
a fu
rthe
r
study to
see h
o
w im
p
o
rtant th
e te
xt pre-processing
p
r
o
c
ess in
simila
rity detection. Thi
s
pape
r aim
s
to see t
he inf
l
uen
ce of ste
mming p
r
oce
ss in d
e
tecti
ng
simila
rity by comp
ari
ng
the matchi
ng
result
between the a
b
stract that im
plemente
d
the
stemmin
g
p
r
o
c
e
s
s an
d the
one
s
are
not
. This stu
d
y
use
s
th
e
Na
zief and
Adri
a
n
i alg
o
rithm
a
n
d
n-g
r
am to proce
s
s the order word
ste
m
from t
he stemming re
sult, the value of similarity
wa
s
taken
ba
sed
on the fin
g
e
r
print m
a
tch t
hat is
don
e
t
o
see
wheth
e
r the
r
e i
s
a
simila
rity exists
among th
e a
b
stra
ct
s or n
o
t. Then, accordin
g to t
hat
analysi
s
, this pa
per
explains the
prob
able
factors th
at a
ffect the
dete
c
tion
re
sult
a
nd the
falla
cy factor that
prob
ably
cau
s
e th
e fail
ure
in
s
t
e
mming
p
r
oc
es
s
.
2. Rese
arch
Metho
d
In detecting text similarity written ab
stra
c
t, there we
re
some pha
se
s that shoul
d be don
e
in this re
se
arch, incl
udin
g
prep
ro
ce
ssin
g input
ab
stra
ct, tokeni
zatio
n
, hashi
ng te
xt, and simila
rity
analysi
s
, with
output of si
milarity
value
produ
ced
be
tween
text. This foll
owi
ng
figure
(Fi
gure
1)
sho
w
s the flow of the pha
ses.
Figure 1. System flows of similarity detection
Abstract that
has been pa
ssed prepocessi
ng will be
form
ed as
word
gram
term set
according to
gram val
ue t
hat ha
s be
e
n
deter
mine
d
before th
ro
u
gh tokeni
zati
on process. By
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
1693-6
930
Stemm
i
ng Influen
ce on Sim
ilarity Detecti
on of Abstra
ct Written in Indone
sia (Tari
Mardi
ana
)
221
using hashi
n
g function, each
gram
set
will be
represented as
hex
adeci
mal
and used as
unique
identity for matchin
g
up. Detail explainat
ion of every process a
c
cording bel
ow.
2.1. Preproce
ssi
ng
The very first step before docu
m
ent or text
is proce
s
sed furt
herm
o
re, co
nsi
s
ts of
whitespace insensitivit
y proce
ss, sto
p
word removal, an
d stemmin
g
.
a)
Whitespace i
n
sensiti
v
ity
;
This process will
eliminat
e all
unnecessary
punctuation such as
spa
c
e,
col
o
n
,
semi
colo
n, numb
e
rs, e
t
c an
d
ca
se
folding
to
chang
e all te
xt characte
rs
become lo
we
rca
s
e.
b)
Stopwo
r
d Rem
o
val;
This pro
c
e
ss
wil
l
eliminate common
wo
rd that is often used o
r
meanin
g
le
ss
repetitive wo
rd in sente
n
ces, for
exam
ple “ma
s
in
g-masin
g
”. Thi
s
pro
c
e
s
s wil
l
result uniqu
e word set which is expe
cted
to
improve the accuracy of
similarity [8].
c)
Stemm
i
ng;
This pro
c
e
s
s will chan
ge
t
he corespon
dent words
i
n
the sam
e
root wo
rd,
fo
r
example “me
m
beli” and
“d
ibeli” com
e
from
the
root
“beli”. In in
do
nesi
an the
r
e
are
so
man
y
words
whi
c
h
contai
n affix, either p
r
efix, infix,
sufix,
confix, or
wo
rd
repetiton i
n
se
rted in
a
stem. Through stem
ming
process,
the
text will only
contai
n root
words in order to
perform
plagia
r
ism de
tection re
ga
rd
s to the text
whi
c
h ha
s be
en ch
ang
ed i
n
words o
r
d
e
r.
2.1.1. Stemmer
Na
zief
Adriani
In Indone
sia
n
morpholo
g
y
, there are
some
ru
le
s of word formation that
contai
ns
inflection a
n
d
derivative wo
rd as d
e
fined
in the followin
g
rule
s.
Inflictional = (root+po
s
sesi
ve_
p
ro
nou
ns) | (root+p
arti
cl
e) | (ro
ot+p
osse
si
ve_p
ron
o
uns+p
articl
e)
Deri
vatio
nal
= prefi
x
ed | suffixed | confi
x
ed | dou
ble_
prefixe
d
in whi
c
h:
Prefixed
= prefix + root;
Suffixe
d
=
root +
s
u
ffix
;
Con
f
ixed
= p
r
ef
ix
+ root
+
suf
f
i
x;
Double
_
pre
f
ixed
= (prefix
+ prefi
x
ed
) | (prefix
+ c
onfixed
) | (prefix+prefixed
+suffix)
From th
e d
e
finition ab
ove,
in ge
neral Ind
one
sian
morp
hologi
cal
stru
cture
a
s
d
e
scribed
in [3]
ca
n
be define
d
as the following
rule
s belo
w
.
[prefix
1
]
+
[prefix
2
] +
root + [s
uffix
]
+ [possesi
ve
_pron
oun
s ] + [parti
cle]
In addition,
some p
r
efixes su
ch a
s
b
e
r-, m
eng-, pen
g-, pe
r-, ter-
will chang
e from their
origin
s, called
Nasal Subtitution. It namely the st
ate of articul
a
tion that cha
nge
d whe
n
one p
r
e
f
ix
(e.g.
m
eng-
)
inse
rted in th
e stem
of the
word [11]
a
s
can be see
n
in
Table 1. These chan
g
e
s
greatly dep
en
ding on the in
serte
d
after the first word
affixes.
Table 1. Examples of word formation rules
with prefi
x
–m
eng
Prefix
O
r
igin
Formati
o
n
Subtitution
me
n
g
-
tulis
menulis
{meng
|t} =
me
n
-
me
n
g
-
sew
a
m
e
ny
ew
a
{meng
|s
} =
m
eny
-
me
n
g
-
pakai
memakai
{meng
|p} =
me
m-
me
n
g
-
kr
itik
mengkritik
{meng
|k
} =
m
eng
-
Indone
sia
n
morph
o
logy is
more difficult
and compli
ca
ted than Engl
ish si
nce duri
ng the
morp
holo
g
ica
l
pro
c
e
s
s, Ind
one
sian
alwa
ys combin
e a
m
ong
affixes, root
wo
rd
an
d grammati
ca
l
rules
at the same time.
Na
zief and A
d
riani
(NA
)
al
gorithm
wa
s first introdu
ce
d in 199
6 in a
techni
cal
rep
o
rt fro
m
University of Indone
sia a
n
d
wa
s
furthe
r develope
d in
the study [12]. This algo
ri
thm is ba
sed
on
morp
holo
g
ica
l
rul
e
s a
r
e i
n
terlinked
and
group
ed to
g
e
ther,
and
th
en e
n
cap
s
ula
t
ed a
s
allo
wed
part of the
wo
rd an
d not in
clude affixes
such
as
pref
ix
es, suffixes, a
nd confixes t
o
get the root
o
f
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 16
93-6
930
TELKOM
NIKA
Vol. 14, No. 1, March 2
016 : 219 – 2
2
7
222
a word. Basi
cally, all stemming are able
to incre
a
se
t
he se
nsitivity of the retriev
a
l of docum
e
n
ts,
however by
doing
a
se
arch
ro
ot word
thro
ugh
ste
mming
often
lead
to th
e
rem
o
val of
the
meanin
g
of th
e word
itself. If the rem
o
val
rule
i
s
d
one
according
to t
he d
e
termi
n
e
d
o
r
de
r it
wou
l
d
be su
re capa
ble of preven
ting any
overstemmin
g
tha
t
is a conditi
on of over word removal
or
unde
rste
mmi
ng that is the word ca
nn
ot be perfo
rmed for ste
mming be
ca
use it can
'
t see the
removal
rule
of the word,
so that th
e
failure
that l
e
ad to the
ste
m
ca
nnot b
e
found a
b
le t
o
minimize. Ba
sic removal
p
r
ocess of
NA
stemm
e
r ba
sed
on
the
e
x
planation
in
study [6]
ca
n
be
see
n
in Figu
re 2.
Figure 2. Basic rem
o
val proce
s
s of NA stemme
r
Na
zief and A
d
riani
algo
rithm is ap
plying 37 rule
s that can
be u
s
ed to p
e
rfo
r
m word
stemmin
g
. Th
e pe
rform
a
n
c
e of this
algo
rithm is
b
a
sed
on three p
a
rt
s, they are g
r
oupin
g
affixes,
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
1693-6
930
Stemm
i
ng Influen
ce on Sim
ilarity Detecti
on of Abstra
ct Written in Indone
sia (Tari
Mardi
ana
)
223
usa
ge rul
e
s
and the esta
blishm
ent of limits, and t
he dictiona
ry is use
d
. Diction
a
ry become
s
an
importa
nt part becau
se it is used to ch
eck wh
ether
a word ha
s met its stem or not. Before the
affix removal
process, the
r
e
are
several thing
s
t
hat
mu
st be
co
nsid
ere
d
in
t
he u
s
e
s
of t
h
is
algorith
m
.
a)
Inflection su
ffixe
s
: a set
of suffixes th
at doesn't ch
ange the
ste
m
, such a
s
-l
ah, -kah, -ku,
-
m
u
, -pun, -n
ya
. Particles includi
ng
-l
ah, -kah,
-tah
or
–
pun
, an
d po
sse
s
ive p
r
ono
uns i
n
cl
udi
n
g
-ku, -m
u, -n
y
a
.
b)
Deriv
a
tion suffix
e
s
: a
set
of suffixes
are directly pla
c
ed i
n
the
ro
ot, but have
more th
an
on
e
suf
f
i
x
e
s,
su
ch
as -i, -an, -ka
n
.
c)
Deriv
a
tion p
r
efixe
s
: prefix that is
attache
d
di
re
ctly
to the
stem
or wo
rd
s th
at have t
w
o
prefixes a
r
e p
l
ace
d
togethe
r, su
ch a
s
di-,
ke-, s
e
-, te-, be-, m
e
-,
and
pe-
.
d)
Prefix Disall
o
w
e
d
Su
fixe
s
: a
com
b
inat
ion of th
e p
r
e
f
ix and
suffix that are n
o
t a
llowe
d to
be
attache
d
to the stem a
s
in Table 2.
2.2. Tokeni
z
ation
Is an initialization pha
se
by condu
ctin
g a stru
cture
d
text extract
i
on in the form of a
singl
e word.
In this
stage,
the ab
stra
ct
text w
ill be
establi
s
h
ed a
s
a
set of
sh
ingle
s
u
s
ing t
h
e
word n
-
gram
(WNG
). WNG is o
ne style
of
co
nst
r
u
c
tion
mod
e
ls which ca
n
be u
s
ed as a
way
of
verifying the detectio
n
pro
c
e
ss [1
3]. The longe
r the
value of n is
use
d
the le
ss set shi
ngle
s
that
are fo
rmed.
N-g
r
am
app
roach in resea
r
ch i
s
u
r
g
entl
y
need p
r
op
e
r
value of n i
n
orde
r to p
r
od
uce
a cl
ear di
stin
ction
betwee
n
the
se
nten
ce
s in
t
he
do
cume
nt [14].
The exa
m
ple
s
of to
ke
nization
with WNG sh
owe
d
in Tabl
e 3.
Table 2. The
combi
nation
of the prefix and suffix disal
l
owe
d
Prefix
Suffix
be-
-i
di-
-an
ke-
-i, -kan
me-
-an
se-
-i, -kan
Table 3. Examples fo
rmati
on of shin
gle
s
with WNG
Original
belajar komputer
itu tidak sulit
Unigram
{belajar}{komput
er}{itu}{tidak}{sulit}
Bigram {belajarkompute
r
}{kom
pute
r
itu}{itu
tidak}{tidaksulit}
Trigram
{belajarkompute
r
itu}{k
ompute
r
itutidak}{itutidaksulit}
Fourg
r
am
{belajarkompute
r
itutidak}{kompute
r
itutidaksulit}
2.3. Hashin
g
Tex
t
In this process,
all shingl
es
set
will be represented as
groups
of hexade
cim
a
l call
ed
fingerp
r
int th
rough
ha
sh fu
nction. T
he o
b
jectio
n to pe
rform
ha
shin
g
is to o
b
tain
u
n
ique val
u
e
s
as
identity to differ each formed wo
rd
s. Finge
rpri
nt
is one of the tech
niqu
es th
at can be u
s
ed to
perfo
rm simil
a
rity analysi
s
that can be le
ad to plagia
r
i
s
m a
c
t [15].
2.4. Similarit
y
An
aly
s
is
The la
st p
r
o
c
ess that d
o
n
e
thro
ugh
m
a
tchin
g
the f
o
rmul
ated fin
gerp
r
int valu
e thro
ugh
hashing p
r
o
c
ess bet
ween
the ab
stra
ct whi
c
h is i
ndi
cated a
s
a pla
g
iat cont
ra
st to the abst
r
a
c
ts
that have
be
en in
the
dat
aba
se. Th
e
value of
th
e
simila
rity am
ong th
e n
u
m
ber
of A a
n
d
B
shingles whi
c
h has a
resembl
a
nce union C
w
ill
be cal
c
ul
ated using the Dice
coefficient
according fo
rmula (1
) and
expre
s
sed a
s
a percentag
e
.
|
|
|
|
)
(
2
2
B
A
B
A
B
A
C
similarity
(
1
)
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 16
93-6
930
TELKOM
NIKA
Vol. 14, No. 1, March 2
016 : 219 – 2
2
7
224
To suppo
rt th
e pro
c
e
s
s an
d testing, it is us
e
d
a d
a
taset con
s
i
s
ting
of 30 data
of abstract
document
s in
the field of Informatio
n Tech
nolo
g
y.
As many as 2
5
data of training materi
als
and
testing
as ma
ny as 5
data.
The
num
ber of ste
m
in
a
dictio
nary
wo
rd
that i
s
u
s
ed to
assi
st the
pro
c
e
s
s
of st
emming
a
s
much
a
s
31.
296 wo
rd
s,
while
the
nu
mber of
stopl
ist a
r
e
availa
ble to
perfo
rm stop
word removal
as
many as 756 wo
rd
s.
A
s
the
furth
e
r
evaluation, th
e fault
ca
se
s
that
comm
only h
a
ppen
ed
while
applyin
g
the
NA alg
o
rithm
are
cla
ssifie
d
by
Jelita
et
al [6] into
so
me
categ
o
rie
s
: 1
)
Non
-
root word
s in dictio
nary,
2) Hyp
enated
word
s, 3) Incomp
lete Dictio
nary,
4) Mi
sspellin
gs, 5) Inco
m
p
lete affix rul
e
s, 6)
Overst
emming, 7
)
People’
s na
mes, 8
)
Com
b
ined
Wo
rds,
9)
Reco
ding
ambi
guity (di
c
tion
ary related
)
,
10
) A
c
ro
nyms,
11
)
Re
codi
ng
ambig
u
ity
(rul
e
relate
d),
12) Othe
r, 13) Un
de
rstem
m
ing, 14) Fo
reign word
s, a
nd 15) u
m
an
Erro
r.
The
cou
n
ting
on th
e
ste
mming
accu
racy valu
e
was
don
e b
a
sed o
n
the
n
u
mbe
r
of
su
cee
d
wo
rd
(S
B
) divided
by the total of unique word
s in text (S
T
)
according to f
o
rmul
a (2
). The
evaluation of succesful
det
ection
will be based on
precisi
on
value
and recall
whi
c
h are got from
(3) and (4) e
quotation
s
.
F
1
-score is u
s
ed to bala
n
ce the num
be
r of pre
c
i
s
ion
and the recall
whi
c
h counte
d
by using th
e (5) e
quatio
n.
%
100
)
/
(
x
S
S
Accuracy
T
B
(
2
)
F
P
TP
TP
precision
(3)
F
N
TP
TP
call
Re
(4)
r
p
r
p
score
F
.
.
2
1
(5)
3. Results a
nd Discu
ssi
on
This
se
ction shows the result
of the testing and the
a
nalysin
g towa
rd stem
mer
al
gorithm
usa
ge a
nd
si
milarity dete
c
tion sy
ste
m
that have b
e
e
n
co
nst
r
u
c
ted
.
Scena
rio te
sting i
s
divid
ed
into two part
s
:
the firs
t tes
t
to s
e
e the s
u
c
c
ess of
the wo
rd
ste
mming in tex
t
by using
NA
algorith
m
an
d
the second i
s
dete
c
tion
e
v
aluation
whi
c
h i
s
d
one
by analyzi
ng th
e simil
a
rity ba
sed
on the
numb
e
r of
simila
rity resulted fro
m
the
text im
plemente
d
th
e stem
ming
pro
c
e
s
s an
d
the
one
s whi
c
h di
d not.
The first test
wa
s carried
o
u
t to test the
perfo
rman
ce
of the algo
rit
h
m stem
mer
(NA) to
perfo
rm stem
ming to 10 a
b
stra
ct. Prior
to stemmi
ng,
stop
word re
m
o
val will be d
one to get rid
of
comm
on words in Indo
ne
sian a
r
e
con
s
ide
r
ed to h
a
ve no sig
n
ifican
ce, such
as “di
”
, “ya
ng”,
“adal
ah
”, etc
so th
e o
n
ly remainin
g u
n
i
que
wo
rd
s al
one. T
able
4
sho
w
s the
re
sults of
wo
rd
root
stemmin
g
u
s
i
ng the
NA al
gorithm
s a
n
d
Table 5
sh
o
w
s t
h
e cl
as
si
f
i
cat
i
on o
n
t
h
e f
ault
ca
ses
(stem
m
ing e
r
rors) th
at occured in the a
b
s
tra
c
t do
cum
ent.
Table 4. The
result towa
rd
the word ste
mming u
s
ing
NA algo
rithm
A
b
st
r
a
ct
Do
c
Wo
r
d
C
o
u
n
t
U
n
i
q
u
e
Wo
r
d
Correct S
t
em
NA
A
ccu
r
a
c
y
(%
)
1_ABS.docx 248
161
158
98.14
2_ABS.docx 263
190
183
96.32
7_ABS.docx 330
232
222
95.69
9_ABS.docx 164
117
114
97.43
10_ABS.docx 207
143
141
98.60
12_ABS.docx 287
217
212
97.69
13_ABS.docx 244
175
171
97.71
14_ABS.docx 286
198
190
95.95
15_ABS.docx 281
218
195
89.45
20_ABS.docx 126
85
83
97.65
AVERAG
E
96.46
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
1693-6
9
30
Stemm
i
ng Influen
ce on Sim
ilarity Detecti
on of Ab
stra
ct Written in Indone
sia (Tari
Mardi
a
na
)
225
Table 5. Mo
st Fault Ca
ses i
n
Abstra
ct
F
ault cases
Total Case(s)
F
ault cases
Total Case(s)
Non-root wo
rds in dictionary
4
Recoding a
m
big
u
ity (dictionary related)
13
Hypenated wo
rd
s
8
Recoding a
m
big
u
ity (rule related)
8
Incom
plete Dictionary
4
Other
2
M
i
sspellings
2
Underste
m
m
i
ng
1
Incom
plete affix rules
10
Forei
gn w
o
rds
2
Overste
m
m
i
ng
6
Human Error
5
Combined W
o
rd
s
2
T
O
TA
L
67
Several fa
cto
r
s mu
st be
n
o
ted in
ste
m
ming failu
re
o
f
abst
r
a
c
t do
cument
su
ch
a
s
failu
re
cla
ssifi
cation
based on Ta
b
l
e 5 with detai
l below.
a)
Some words con
s
id
er as
foreig
n wo
rd,
f
o
r example
me
mi
n
i
ma
l
i
s
i
r
, n
o
r
ma
l
i
s
a
s
i
e
n
co
unter the
stemmin
g
failure b
e
cau
s
e
they do not
includ
e the
word re
mov
a
l rule.
Whil
e the wo
rd
s
sed
ang
ka
n, pencari, pela
k
u, diolah
en
counter ove
r
stemming into
dang, pe
nca
r
, pela
, and
o
.
Overstemmin
g
ca
n o
c
cur
becau
se the
pro
c
e
s
s of removing affixes a
s
m
u
ch
as p
o
ssibl
e
according to
the rule
appli
ed. Anothe
r
stemmi
n
g
fail
ure i
s
u
s
ag
e
of uncomm
on word
s in
abstract, eg.
kerapkali.
b)
The m
o
st
prevalent
ca
se
s of
word
st
emming
fa
ilur
e
a
r
e th
e wo
r
d
s
in
c
l
ud
e in
r
e
c
o
rd
ing
ambiguity (di
c
tiona
ry rel
a
ted). In
NA al
gorithm th
at i
s
a
c
cordan
ce
to dictio
nary
as a
ba
se fo
r
stem m
a
tchi
n
g
, for exam
pl
e re
moval ru
l
e
of words th
at co
ntain
co
nfixes
pe
r-
an
d
-an
whe
n
encounte
r
ing
a stem
be
gi
ns
with a
n
(r)
lead to
s
t
emming failure, eg.
pe
ra
watan
be
co
me
s
awat
a
nd
p
e
rancang
an
bec
o
m
es
an
can
g
. It beca
u
se
in the
words
diction
a
ry the
r
e a
r
e
wo
rd
s
ran
c
an
g
and
ancang
,
ra
wa
t
and
awat
, a
nd som
e
othe
r wo
rd
s.
c)
The words
refer to quantity, for example
sejum
l
ah
an
d
be
rj
um
lah
are
subje
c
ted to
inapp
rop
r
iate
inflectio
nal
suffixes
rem
o
val rule
for
-l
ah
that
lead
to failure in the proc
ess
.
Besides, there are still
many errors in
stemming for repetitive words, eg.
se
hari
-
h
a
ri,
berb
eda
-be
d
a
.
d)
The failure d
ue to hum
an
error i
s
en
co
untere
d
in ca
se
s where th
e wo
rd
s are typed with n
o
spa
c
e
so that
two wo
rds
written as if they are one word.
In the se
co
nd
test aims to
detecte
d the
si
mila
rity that empha
si
zed
on the p
r
ep
ro
ce
ssi
ng
text
in
term
s of
pure stem
ming (ST), combin
in
g ste
mming and stopwo
r
d rem
o
val
(ST
+
SWR),
and
dete
c
te
d with
out th
e p
r
ep
ro
ce
ssing
which i
s
d
one
by
che
c
king
the
su
cce
s
s of
the
measurement
usin
g preci
s
i
on an
d re
call
value.
The l
e
ngth valu
es of the wo
rd
n-g
r
am
(WNG)
that used in
this study in orde
r to con
s
tru
c
t wo
rd term are 2, 3 and 4. Five abstra
c
ts
were
rand
omly cho
s
en a
nd mad
e
as the testi
ng sam
p
le
s
to evaluate the work of the detectio
n
syst
em
that
is co
nstructed. The a
c
cura
cy n
u
m
ber
of the
preci
s
ion
and
the recall valu
e alo
ng
with
the
some
script can be seen in
Figure 3 a
n
d
4.
Figure 3. Evaluation of simi
larity detectio
n
based on p
r
e
c
isi
o
n
Figure 4. Evaluation of simi
larity detectio
n
based on recall
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 16
93-6
9
30
TELKOM
NIKA
Vol. 14, No. 1, March 2
016 : 219 – 2
2
7
226
Based
on
Fig
u
re
3
and
4,
we
ca
n
see
that t
he
simil
a
rity dete
c
tion
in the
ab
stra
ct that
being th
e
sub
j
ect of the
ste
mming p
r
o
c
e
s
s and
the
stopword
re
mo
val duri
ng the
pre
p
ro
ce
ssin
g
text have higher p
r
e
c
isi
o
n
value comp
are to
the o
nes that only
used the
stemming p
r
o
c
ess
without the p
r
ep
ro
ce
ss
with a
percenta
ge of 67% while usin
g the
fourgram. Thoug
h, the best
recall value i
s
sho
w
ed in th
e detectio
n
proce
s
s
that exclud
ed the p
r
epro
ce
ssing
with the num
ber
of pe
rcentage
100%
while
usin
g the
big
r
am a
nd tri
g
ra
m. This result su
ppo
rts the
resea
r
ch
re
su
lt
done
by [4] i
n
which it
sai
d
that th
e eff
e
ct of
stem
m
i
ng in
the t
e
xt retrieval
is
con
s
id
ere
d
a
s
a
help to increa
se the re
call
value but it reduces
the preci
s
ion value.
There are so
me other false-
positive case
s that being t
he ca
uses
of
the pre
c
isi
on
low value.
The
simila
rity value resulted fro
m
the
inexisten
c
e
o
f
the preprocessing
with t
he lo
w
value of gra
m
(bigram)
gav
e a high
er
re
sult compa
r
e
d
to the one
s using th
e prepro
c
e
s
sing
with
the rate
of 4
.
75%. The lo
w si
milarity v
a
lue
sho
w
s t
h
at ea
ch
of the final
proj
e
c
t ab
stra
cts i
s
categ
o
ri
zed
a
s
a u
n
iqu
e
te
xt since th
ey have a
diffe
rent co
ntent compo
s
ition th
at differentiat
ed
based o
n
the
i
r re
se
arch fi
eld. This
co
n
d
ition sho
w
s
that similarity is not a
significant
way to
determi
ne re
dupli
c
ation, thoug
h it is still can be u
s
ed
as the first filter before doing
simil
a
rity
detectio
n
on
a written
discourse. Ba
sed
on Bazdari
c
assumptio
n
[16], plagia
r
ism in a pie
c
e
of
writing i
s
esti
mated to have the rang
e of 5-10 %
si
milarity or around 1
00 wo
rds
simila
r in
one
document, so
we also have
to pay attenti
on on the si
ze of the docu
m
ent che
c
ke
d.
Figure 5. Evaluation of simi
la
rity detectio
n
based on F
1
-s
cor
e
From the ov
erall result showed in Fig
u
re 5,
text prepro
c
e
s
sing
on simila
rity detectio
n
that com
b
ine
s
stemming
a
nd sto
p
word
gives the
hig
hest a
c
cu
ra
cy value to all
word g
r
am l
e
vel
that appli
ed i
n
2
scen
ari
o
s
whi
c
h
are
evaluated
wit
h
the
rate
of
F
1
n
u
mb
er
is
42
%. T
h
e fa
u
l
t
factors that o
c
curred d
u
ri
n
g
the pro
c
e
s
s of t
he word
stemming
m
a
ke the text
contai
ning
so
me
meanin
g
le
ss and un
suitabl
e
word
s so
it redu
ce
s
th
e
n
u
mbe
r
of
simi
larity that bei
ng a
se
ction
of
two texts, e
s
peci
a
lly whe
n
the
re
sult
is
comp
ared
to the
one
s
experi
en
ced
dete
c
tion. T
h
e
addition
of st
opword
re
mo
val is a
b
le to
increa
se
th
e
work
of stem
ming by
red
u
c
ing th
e n
u
m
ber
of the
wo
rds
that co
mmonl
y repe
ated, if
we
comp
ar
e it to
th
e
o
n
es
th
a
t
o
n
l
y exp
e
r
ie
nc
ed
the
stemmin
g
p
r
o
c
e
s
s with
out
prep
ro
ce
ssin
g. So, it is
ne
ce
ssary to
be
furthe
r di
scu
s
sed
wh
enev
er
it is applied in the detectio
n
that
inv
o
lv
es st
at
ist
i
c wo
r
d
s.
This r
e
s
u
l
t
also sho
w
s t
hat
combini
n
g
two of them (stemmin
g
an
d stop
word re
moval) can
gi
ve a better re
sult in the si
milarity detection
for the mo
re
uniqu
e wo
rd
s existing in t
he text
that are n
eed
ed t
o
be che
c
ke
d. There is a
big
possibility that even the
smallest m
o
dification can be
detec
ted by using thi
s
process.
T
h
e
sup
port to
wa
rd the im
plem
entation
of WNG
whi
c
h
pa
y no attentio
n
on th
e
word
positio
n/ term
in
the sea
r
chin
g wo
rd
s also help the d
e
tection
wh
i
c
h inv
o
lv
e t
h
e ca
se of
p
o
sit
i
on
cha
n
g
ing.
Ho
wever, the
application
of docum
ent
fingerp
r
in
t a
s
the matchi
ng tool whi
c
h based on
the
length of wo
rd gram al
so
need to be
seen be
ca
use it
will make t
he re
dupli
cat
ed pa
rt of the text
is cle
a
rly se
e
n
and on the
other way aro
und.
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
1693-6
930
Stemm
i
ng Influen
ce on Sim
ilarity Detecti
on of Abstra
ct Written in Indone
sia (Tari
Mardi
ana
)
227
4. Conclusio
n
and Futu
r
e
Work
Based
on
the
previo
us stu
d
y, Na
zief an
d Adri
ani al
go
rithm i
s
con
s
i
dere
d
a
s
th
e
qualified
one to d
o
the
good
stemmi
ng p
r
o
c
ess
al
though th
ere
are
still some
mista
k
e
s
in t
he word
s. Thi
s
probl
em lea
d
s
to a con
d
ition wh
ere it is definitel
y important to cre
a
te a diction
a
r
y that contai
ns
the ro
ot wo
rd
. It is also ne
ce
ssary to m
a
ke
a di
ct
ion
a
ry that
conta
i
n the
st
and
ardize
ro
ot wo
rd i
n
KBBI (Kamus Besar Bahasa Ind
onesi
a) whi
c
h
can fulfill the
nec
essity of the users. The
impleme
n
tation of text prepro
c
e
ss to de
tect sim
ilarity
in Indonesi
a
n written ab
st
ract
s is goin
g
to
be more suit
able when
ever it is appli
e
d the st
emmi
ng process al
ong with the
stop
word re
m
o
val
pro
c
e
ss b
e
cause their combinatio
n can make
the unique
wo
rds
whi
c
h re
sulted fro
m
the
prep
ro
ce
ss to help the m
a
tchin
g
pro
c
e
ss that in
volv
e the positio
n
or term cha
nging. Howe
ver,
the simila
rity detectio
n
wit
hout the prep
rocess
i
s
still appli
c
able
as the alternativ
e way sin
c
e t
he
measurement
is d
one
ba
se
d on th
e inte
rse
c
ting
wo
rd
s exist i
n
the
comp
ared tex
t. Beside
s, th
e
value of simil
a
rity is not a
dequ
ate to decid
e that
an
abstract is a
result of red
uplication or
not.
Thoug
h, it
still ca
n b
e
u
s
ed a
s
a
stan
dard
to
kn
ow wh
ethe
r
so
me p
a
rts of t
he a
b
st
ra
ct
are
redu
plicated
by the other a
b
stra
ct or n
o
t.
For th
e furth
e
r
re
sea
r
ch, i
t
is
sugg
este
d to fo
cus on
solving
the f
ault cases th
at foun
d
durin
g the
st
emming
process, such a
s
the o
ne
re
la
ted to the
no
n-root
words
in the
diction
a
ry
and hyp
hen
a
t
ed word
s i
n
orde
r to
re
du
ce th
e falla
cy in the
stem
ming p
r
o
c
e
s
s. The
simila
ri
ty
detectio
n
to
ward
ab
stra
ct
of a
re
sea
r
ch
is
highly
adv
ised
to b
e
fo
cu
sed
mo
re
on the
sema
ntic
part in
order to cove
r the
weakne
ss
of th
e wo
rd
s m
a
tching that
com
m
only only d
one by fin
d
in
g
the simil
a
rity in the word term
while there i
s
a
possibility that words c
ontain different
meaning
based on the
field of the resea
r
ch.
Referen
ces
[1]
CG F
i
guer
ola,
R Gómez, EL
De Sa
n Rom
á
n. St
emming a
nd
n-
grams
i
n
Span
ish:
An e
v
alu
a
tion of
their imp
a
ct on
info
rmation retrieval.
Jo
urna
l of Informati
on
Scienc
e
. 200
0; 26(6): 461-
46
7.
[2]
MVB Soar
es, RC Pr
ati,
MC Mon
a
rd.
Improv
eme
n
t
on th
e Porter
’s Stemmin
g
Algorit
hm for
Portugu
ese.
Latin America Transactions, I
EEE (Revista IEEE Am
eric
a Latina)
. 200
9; 7(4)
: 472-47
7.
[3]
FZ
T
a
la. Effects on Informatio
n
Retri
e
val
in B
a
h
a
sa I
ndo
nesi
a
. Ma
ster T
hesis. Univers
i
t
y
o
f
Amsterdam, Institute for Logic
,
Langu
ag
e an
d Comp
utatio
n; 2003.
[4]
J Asia
n. Effecti
v
e T
e
chniq
ues
for Indo
nes
ian
T
e
xt
Retriev
a
l.
PhD T
hesis. R
M
IT
Universit
y
,
Mel
bour
ne,
Schoo
l of
Co
mputer Sc
ienc
e a
nd I
n
forma
tion T
e
ch
n
o
lo
g
y
, Sci
enc
e, E
ngi
neer
in
g, an
d T
e
chnol
og
y
Portfolio; 2007.
[5]
L
Agusta.
Per
ban
din
g
a
n
Algorit
ma Ste
m
mi
ng Porter D
eng
an Al
gorit
ma N
a
z
i
ef &
Adria
n
i Untu
k
Stemmi
ng D
o
kumen T
e
ks
Bahas
a Indo
n
e
sia.
Prese
n
te
d at the Konf
erens
i Nasi
on
al Sistem da
n
Informatika, Ba
li. 200
9: 196-
2
01.
[6]
J Asia
n, HE
W
illiams, SM
M T
ahagh
ogh
i
.
Stemming
I
ndo
nesi
a
n
.
in
T
w
ent
y-Ei
ght
h Austra
lasi
an
Comp
uter Scie
nce Co
nfere
n
c
e
(ACSC2
005).
Ne
w
c
astl
e, Australia. 20
05; 3
8
: 307-3
14.
[7]
I Verita
w
a
ti, I Wasito
,
T
Basarud
d
in. T
e
xt Preproc
essin
g
usin
g Ann
o
tate
d Suffix T
r
ee w
i
t
h
Matchi
ng
Ke
yphr
ase.
Internati
ona
l Jour
nal of Electric
al
an
d Co
mp
uter Engi
neer
ing
(IJECE)
. 2015; 5(3): 409-
420.
[8]
Z Ceska, C
Fox
.
T
he
infl
u
ence
of text pre-pr
ocessi
ng
on p
l
a
g
iar
i
s
m
d
e
tectio
n.
in
In
te
rn
a
t
io
na
l
Confer
ence R
e
cent Advanc
es
in Natura
l
La
n
gua
ge Proc
ess
i
ng, RAN
L
P. 2009: 55-
59.
[9]
S Harih
a
ra
n. Automatic Pl
agi
arism Detecti
o
n Usin
g Simi
lar
i
t
y
An
al
ysis.
T
h
e Internati
o
n
a
l
Arab Jo
urna
l
of Informati
on
T
e
chno
logy
. 2
012; 9(4): 3
22-
326.
[10]
Z Alfikri, A Pur
w
a
r
ia
nti. Deta
iled A
nal
ys
is o
f
Ex
trinsic Plagiarism
Dete
cti
on S
y
stem Usi
ng Mach
in
e
Lear
nin
g
Ap
pr
oach
Naiv
e
Ba
yes a
nd S
V
M.
T
E
LKOMNIKA Indo
nes
ian J
ourn
a
l
o
f
Electrical
Engi
neer
in
g
. 2014; 12(
11): 78
84-7
894.
[11]
M Haspelmath, AD Sims.
Un
de
rsta
n
d
i
ng Mo
rp
ho
lo
gy
.
2
n
d
e
d
.
Lo
nd
on: H
odd
er E
ducati
on,
an
Hachette UK
C
o
mp
any
. 20
10
.
[12]
M Adrian
i, J Asian, B Nazi
ef, SM
T
ahagho
g
h
i,
HE W
illiam
s
. Stemming Indon
es
ia
n: A confix-stri
ppi
n
g
appr
oach.
AC
M T
r
ansaction
s on Asian L
a
n
gua
ge Infor
m
at
ion Proc
essin
g
(T
ALIP)
. 2007; 6(4): 1-33.
[13]
B Stein. T
e
chnolo
g
y
for T
e
xt Plag
iari
sm An
a
l
y
s
is. Bauh
aus-
U
nivers
ität W
e
i
m
ar. 2010.
[14]
AZ
Broder. S
y
ntactic clusteri
ng of the W
eb.
Co
mp
uter Net
w
orks
. 1997; 29(8-1
3
): 115
7-116
6.
[15]
B Stein,
SM E
i
ssen.
Near
si
mi
larity s
earc
h
an
d
pla
g
iar
i
s
m
an
alysis.
in
F
r
om D
a
ta a
nd Inform
atio
n
Anal
ys
is to Kno
w
le
dg
e Eng
i
n
eeri
ng. Sprin
g
e
r
. 2006: 43
0-43
7.
[16]
K
Baždar
i
ć
. Pl
agi
arism d
e
tection
– qu
alit
y m
ana
geme
n
t too
l
for all sc
ientifi
c
journ
a
ls.
Cro
a
tian M
edic
a
l
Journ
a
l
. 20
12; 53(1): 1-3.
Evaluation Warning : The document was created with Spire.PDF for Python.