TELKOM
NIKA Indonesia
n
Journal of
Electrical En
gineering
Vol. 12, No. 11, Novembe
r
2014, pp. 78
8
4
~ 789
4
DOI: 10.115
9
1
/telkomni
ka.
v
12i11.66
52
7884
Re
cei
v
ed Se
ptem
ber 8, 2013; Re
vi
sed
Octob
e
r 2, 20
14; Accepted
Octob
e
r 10, 2
014
Detailed Analysis of Extrinsic Plagiarism Detection
System Using Machin
e Learning Approach (Naive
Bayes and SVM)
Zakiy
F
i
rdaus Alfikri*
1
, Ay
u
Pur
w
arianti
2
Institut
T
e
knolo
g
i Ban
d
u
ng. Jl
. Ganeca 1
0
, Band
un
g, Indon
esia
*Corres
p
o
ndi
n
g
author, e-ma
i
l
: zaki
y_f_a
@
yaho
o.co.id
1
, ayu@stei.itb.ac.id
2
A
b
st
r
a
ct
In this report
we proposed
a detailed
analysi
s m
e
thod of
plagiaris
m
detection syst
em using
mac
h
i
ne le
arn
i
ng ap
pro
a
ch. W
e
used Na
iv
e Bayes a
nd
Supp
ort Vecto
r
Machin
e (SVM) as learn
i
ng
alg
o
rith
ms. Le
arni
ng featur
e
s
used i
n
the
meth
od ar
e w
o
rds simi
larit
y
, fingerpri
n
ts simi
larity, lat
e
n
t
semantic
ana
ly
sis (LSA) si
mi
l
a
rity, and w
o
rd
pair. T
hos
e fe
atures ar
e ad
a
p
ted fr
o
m
so
me state-of-the-
art
m
e
thods
in det
ailed
analysis
of a plagiar
i
s
m
detection syst
em
. The pur
pose in selecting those f
eatures
is
to retrieve info
rmati
on fro
m
the st
ate-of-the
-art detail
ed a
nalysis
method
s (w
ords simil
a
rity, fingerpri
n
ti
ng
,
and LSA) i
n
or
der to inte
grat
e the
strength of
each meth
o
d
in detecti
ng
pla
g
iar
i
s
m
. Several ex
peri
m
e
n
ts
w
e
re con
ducte
d to test th
e p
e
rformanc
e of
the pr
opos
ed
meth
od
in
dete
c
ting
ma
ny c
a
ses of p
l
a
g
iar
i
sm.
T
he exper
i
m
e
n
ts used 7
0
d
a
ta test. T
he data test cont
ains cas
e
s of
literal p
l
ag
iari
sm, parti
al lite
r
al
pla
g
iar
i
s
m
, pa
raphr
ased
pl
a
g
iaris
m
,
pla
g
i
a
ris
m
w
i
th c
han
ge
d sent
e
n
ce structure,
and tra
n
sl
ated
pla
g
iar
i
s
m
. T
h
e d
a
ta test
als
o
co
ntains
cas
e
s of
no
n-pl
agi
aris
m of
differ
ent top
i
cs
and
no
n-pl
agi
aris
m
o
f
the sa
me to
pic
.
T
he res
u
lts o
b
tain
ed
in
exp
e
ri
m
ents
usi
n
g
SVM sh
ow
ed
an
avera
g
e
ac
curacy
of 92.
8
6
%
(reachi
ng 9
5
.7
1% w
i
thout usi
ng w
o
rds si
mil
a
rity feat
ure). W
h
ile the res
u
lt obtain
ed us
i
ng Na
ive Bay
e
s
show
ed
an
av
erag
e acc
u
rac
y
of 5
4
.29%
(r
each
i
ng
8
4
.2
9
%
w
i
thout
usin
g the
w
o
rd
p
a
i
r
features).
Usi
n
g
SVM alg
o
rith
m show
ed b
e
tter results b
e
caus
e it is nat
ural
ly
suitab
le i
n
cla
ssifying
prob
le
ms that h
a
ve t
w
o
classes a
nd it
is better than
Naive Bay
e
s
in reso
lvin
g hi
gh-d
i
mens
ion
a
l
probl
e
m
s (w
hich hav
e a lot
of
features). T
h
e
propos
ed
met
hod (us
i
n
g
SVM) has an av
erag
e of hig
h
accuracy for e
a
ch of the teste
d
cases of pla
g
ia
rism. It proves that the prop
os
ed metho
d
(usi
ng SVM) is abl
e to integrate t
he infor
m
ation
i
n
detectin
g
pl
ag
i
a
ris
m
fro
m
sta
t
e-of-the-art de
taile
d
an
alysis
meth
od (w
ord
s
similar
i
ty, fingerpri
n
tin
g
, an
d
LSA) to obtai
n a mor
e
accur
a
te detectio
n
res
u
lts.
Ke
y
w
ords
:
detai
led
an
aly
s
is, pla
g
iar
i
s
m
detectio
n
, ma
chi
ne l
ear
nin
g
ap
proac
h, l
earn
i
ng
al
gorit
hm,
lear
nin
g
featur
e
Copy
right
©
2014 In
stitu
t
e o
f
Ad
van
ced
En
g
i
n
eerin
g and
Scien
ce. All
rig
h
t
s reser
ve
d
.
1. Introduc
tion
Plagiari
s
m is
a form of che
a
ting whi
c
h is done
by taki
ng the writin
g
s
of others a
nd then
put in his
own without any
c
r
edit given t
o
the or
igin [
1
]. Lik
e
wise,
acc
o
rding to IEEE, plagiaris
m
is re
use of idea
s, pro
c
e
s
se
s,
re
sults,
or word
s of
anothe
r pe
rson witho
u
t gi
ving informat
ion
about the
ori
g
inal auth
o
r
a
nd source
explicitly [2]. Plagiari
s
m i
s
th
eft of idea wh
ich i
s
a pe
rso
n's
intellectu
a
l property rig
h
t [3].
There are
so
me previo
us
works in Indo
nesi
an mon
o
l
i
ngual (sin
gle
langua
ge) pl
agiari
s
m
detectio
n
stu
d
ied by [4-7].
In addition, ther
e i
s
Indo
nesi
an-E
nglish cross-l
ang
uage pl
agia
r
i
s
m
detectio
n
stu
d
ied by [8]. Plagiari
s
m d
e
tection
n
eed
s to be de
sig
ned so it doe
s not de
pen
d
on
wheth
e
r the p
l
agiari
s
m i
s
monolin
gual o
r
cro
s
s lingu
al.
There a
r
e
two
app
roa
c
h
e
s to
dete
c
t
plagi
ari
s
m.
They
are
extrinsi
c
and
intrin
si
c
plagia
r
ism d
e
tection
app
roa
c
he
s [9]. Plagiari
s
m
detectio
n
in
this stu
d
y is usin
g extrin
sic
plagia
r
ism d
e
tection
ap
proach b
e
cau
s
e it is expe
cted to b
e
a
b
l
e to d
e
tect t
he p
r
e
s
en
ce
of
intelligen
ce pl
agiari
s
m that
is perfo
rme
d
usin
g the ide
a
adoptio
n an
d transl
a
tion.
Gene
ral arch
itectural of p
l
agiari
s
m det
ection
sy
st
e
m
con
s
ist
s
o
f
t
h
ree main
st
age
s,
namely h
e
u
r
istic
retri
e
val
,
detailed
a
nalysi
s
, an
d
kn
owl
edg
e-based
po
st-pro
ce
ssi
ng [
10].
Heu
r
isti
c ret
r
i
e
val is the p
r
oce
s
s of retri
e
ving do
cum
ents fro
m
co
rpus that a
r
e l
i
kely to be th
e
sou
r
ce of pla
g
iari
sm. We
named th
em can
d
idate d
o
c
ume
n
ts. Detailed an
alysi
s
is the pro
c
e
s
s of
sea
r
ching
si
milaritie
s
bet
wee
n
the inp
u
t doc
ument
and the
can
d
idate do
cu
ments at a
more
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
2302-4
046
Detaile
d Anal
ysi
s of Extrin
sic Plagi
ari
s
m
Detect
ion System
Usi
n
g… (Za
k
i
y
Firdau
s Alfikri)
7885
detailed l
e
vel
(senten
ce
o
r
pa
rag
r
a
ph).
Knowl
edge
-based p
o
st
p
r
ocessin
g
i
s
the process
o
f
filtering false
positive
s
that might be
prod
uce
d
from previous p
r
o
c
e
s
ses.
There
a
r
e several state
-
of-the-art
m
e
thods
that
can b
e
u
s
e
d
to pe
rform
detailed
analysi
s
. Method
s su
ch a
s
fingerpri
n
ting
and latent se
mantic an
alysis (LSA) hav
e been u
s
ed
to
perfo
rm detail
ed analy
s
is in
plagiari
s
m d
e
tection.
Finge
rpri
nting
methods ten
d
s to have a high pe
rform
ance for ca
se
s of literal pla
g
iari
sm,
while the LSA
method tend
s to have goo
d perfo
rman
ce for ca
se
s of
intelligen
ce p
l
agiari
s
m
su
ch
as pl
agia
r
ism
made u
s
in
g
para
p
h
r
asi
n
g. Finge
rpri
nting metho
d
h
a
s
wea
k
n
e
sses in
detecti
ng
ca
se
s of plagiari
s
m mad
e
using p
a
ra
phra
s
in
g. As for the LSA method cannot dete
c
t the
sente
n
ce pairs that are n
o
t plagia
r
ism bu
t still stand in one topi
c.
To obtain inf
o
rmatio
n in d
e
tecting pl
agi
arism from fingerpri
n
ting
and LSA method
s and
to get the
combine
d
stre
ngth of
ea
ch
state
-
of
-the
-art meth
od,
we
pro
p
o
s
ed
a m
e
thod
of
perfo
rming
d
e
tailed a
naly
s
is
ba
sed
o
n
ma
chin
e l
earni
ng
app
roach. Ma
chi
ne lea
r
nin
g
i
s
a
method of i
m
provin
g performan
ce in
line with the
experien
c
e
s
made
on a particular ta
sk.
Machi
ne lea
r
ning ha
s sh
o
w
n to have a high utility va
l
ue for a variet
y of applicatio
n domain
s
[11].
Quality of the machi
ne lea
r
ning
dep
end
s on the
se
l
e
ction of trai
ni
ng expe
rien
ce (trai
n
ing d
a
t
a
feature
)
, the target fun
c
tion
and its rep
r
e
s
entatio
n, an
d its learni
ng
algorith
m
s.
In this stu
d
y we
u
s
ed
two lea
r
nin
g
algor
ithm
s. T
h
ey are
Naive
Bayes an
d
Suppo
rt
Vector
Ma
chi
ne (SVM) alg
o
rithm
s
. We use
d
SVM
l
e
arnin
g
algo
rithms be
ca
use
it ha
s
excell
ent
perfo
rman
ce
in text cla
ssif
i
cation
proble
m
s that
can
be seen
in th
e expe
riment
re
sults
of [1
2
]
and [13]. SV
M algo
rithm i
s
al
so
suita
b
l
e for
det
e
c
ting pla
g
ia
rism problem
b
e
ca
use natu
r
ally
SVM is suita
b
le for cla
s
sifying probl
em
s that
have two cl
asse
s, as ha
d bee
n analyzed by [14].
Then
Naive
Bayes al
go
rithm is cho
s
e
n
be
ca
use it
is
suita
b
le t
o
be
u
s
ed
a
s
comp
arative
baseline in te
xt classifi
cati
on pro
b
lem
s
[15]
as can be
seen in exp
e
r
iment
s of [12] and [13].
2.
Detailed An
a
l
y
s
is of Plagiarism Detec
t
ion Sy
stem
2.1. Similarit
y
Measur
e
ment
Metho
d
s
There are
many metho
d
s that can
be us
ed t
o
comp
are t
he simila
rity between
sente
n
ces. S
o
me of the
m
are fin
g
e
r
pri
n
ting, ve
cto
r
spa
c
e mod
e
l
(VSM),
an
d latent
se
man
t
ic
analysi
s
(LSA
).
In the fingerprinting meth
od, the amo
unt of
similar fingerp
r
ints
is used as
similarity
indicator bet
wee
n
sente
n
c
e
s
.
Fin
gerprint
is a st
ate
m
ent that
ca
n cha
r
a
c
teri
ze an
obj
ect
[16].
Finge
rpri
nt of
a
se
nten
ce
is in
the
form of in
te
ge
r values
that a
r
e cal
c
ul
ated
usi
n
g
the h
a
sh
function [17].
Measure
m
e
n
t of similarit
y
betw
een
senten
ce
s is
cal
c
ulate
d
by comp
arin
g the
simila
rity portion between a
senten
ce’
s
fi
ngerpri
n
t and
another
sent
ence’s fing
erprint.
In the vecto
r
spa
c
e
mod
e
l
method, the
senten
ce
s a
r
e
rep
r
e
s
ente
d
i
n
vecto
r
form
that is
based o
n
th
e wei
ght of
the wo
rd
s in
the sent
e
n
ce
[1
8
]. Simila
r
i
ty me
as
ure
m
e
n
t
be
tw
ee
n
sente
n
ces i
s
cal
c
ulate
d
usi
ng co
sin
e
sim
ilarity
function
betwee
n
the vect
ors of the
senten
ce
s.
In the latent sema
ntic an
a
l
ysis metho
d
, a ma
trix is
formed that repres
ents
the term-
sente
n
ce m
a
trix. This mat
r
ix
is then
spli
t
usi
ng sin
gul
ar value de
compo
s
ition (SVD) to beco
m
e
three matri
c
e
s
.
They are
matrix
U
re
prese
n
ting te
rms, matrix
V
T
rep
r
e
s
entin
g the
sente
n
ce
s,
and th
e m
a
tri
x
S
whi
c
h
is
a dia
gon
al m
a
trix of
sing
u
l
ar
valu
e [19]
. Similarity m
easure
m
ent i
s
cal
c
ulate
d
usi
ng the co
sin
e
similarity fun
c
ti
on bet
wee
n
the vectors
formed by the
matrice
s
.
2.2.
Similarit
y
Measur
e
ment
Metho
d
Usin
g Machine L
earning App
r
oach
Machi
ne le
arning is
a bra
n
ch
of artifici
al intelligen
ce
. Machin
e lea
r
ning
ca
n co
nstru
c
t a
learni
ng
mod
e
l from
avail
able t
r
ainin
g
data. Th
e g
o
a
l of
usi
ng
m
a
chi
ne l
earni
ng i
s
to
imp
r
ove
perfo
rman
ce
(ba
s
ed
on
a
particula
r p
e
rforman
c
e
me
asu
r
e
)
to
a ta
sk ba
se
d o
n
a lea
r
nin
g
m
o
del
c
o
ns
tr
uc
te
d
fro
m
tr
a
i
n
i
ng
da
ta
(
e
xp
er
ie
nc
es
)
[1
1
]
.
Machi
ne le
arning
can
solv
e many ki
nd
of pr
obl
em
s. Machi
ne le
arning
can b
e
use
d
to
deci
de if two
sente
n
ces
are pl
agia
r
ism or n
o
t
fro
m
the inform
ation contain
ed in them.
So,
machi
ne le
arning
can
be u
s
ed
as
an a
p
p
roa
c
h i
n
det
ailed a
nalysi
s
method. Th
e
use
of machi
n
e
learni
ng in d
e
t
ailed analy
s
i
s
re
quires trai
ning data i
n
the form of
col
l
ection of
sen
t
ence p
a
irs a
n
d
their label (pl
agiari
s
m o
r
not). From the
collect
io
n of the training data, machi
n
e learni
ng wi
ll
extract
th
e
a
ppro
p
ri
ate
fe
ature
s
su
ch
as wo
rd
s si
milarity
, finge
rpri
nt si
milarit
y
, LSA simila
rity
,
and so on.
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 23
02-4
046
TELKOM
NI
KA
Vol. 12, No. 11, Novem
ber 20
14: 78
84 – 789
4
7886
If a lot of feature
s
a
r
e u
s
e
d
, we
can
do
featur
e
sele
ct
ion. Featu
r
e
sele
ction i
s
p
e
rform
e
d
to get the number of features
that are
not too big but still be abl
e to represent the informat
ion
requi
re
d. Several feature
sele
ction m
e
thod
s
are t
h
re
shol
d freq
uen
cy sele
ction and mut
ual
informatio
n ra
nke
d
sel
e
ctio
n.
Freq
uen
cy threshold i
s
the minimum fre
quen
cy limit
posse
ssed b
y
a feature to be able
to qualify for
sele
ction. Mu
tual inform
ation is
a
dep
e
nden
cy value
of a feature
in de
ciding th
e
value of the
class. It
will r
ank the features based on mut
ual i
n
formation values. Top
n
-feat
ure
s
are sele
cted from the result
ing ran
k
in
g.
A learni
ng m
odel
can
be
b
u
ilt from the
featur
e
s
obtai
ned by th
e p
r
ev
ious proce
ss. Using
the learni
ng
model, we
ca
n cla
ssify wh
ether a pai
r o
f
senten
ce is
a ca
se of pla
g
iari
sm or n
o
t.
By
usin
g ap
prop
riate featu
r
e
s
, machi
ne le
arnin
g
meth
o
d
s can
be used
to see pl
agiari
s
m on
t
w
o
different do
cu
ment langu
ag
es.
There a
r
e
m
any ki
nd
s of
lea
r
ning
alg
o
rithm
s
that
can
be
u
s
ed
in m
a
chine
learni
ng.
Some of them are Naive Bayes and S
uppo
rt Vector Machin
e.
Naive Bayes is a learni
n
g
algorith
m
that
represe
n
ts ea
ch in
stance (data
)
as the
conj
un
ction o
f
its attribute
s
values
[11].
Naive
Baye
s
can
only
cla
s
sify the data t
hat ha
s limite
d
values
of targ
et function
(cl
a
sse
s
); it me
ans th
e cl
ass’
value cann
ot be continu
o
u
s
. Naive Bay
e
s
cla
ssifie
s
th
e
data by
sel
e
cting the
cla
ss that
ha
s th
e
highe
st
cha
n
c
e val
ue, a
ssuming th
at ea
ch
attributes in
d
epen
dent to e
a
ch oth
e
r.
Naive Bayes is base
d
o
n
Bayes rul
e
. It assume
s the attribu
t
es
a
1
...
a
n
are all
con
d
itionally indep
ende
nt o
f
one anothe
r, given
V
[11]. It can be describ
ed by:
a
r
g
m
a
x
∈
,
,…,
(
1
)
Suppo
rt Ve
ctor M
a
chine
(SVM) is lea
r
ning
algo
rith
m
that
a
naly
z
e data and
recogni
ze
pattern
s [14]. SVM constru
c
ts a hype
rpl
ane or
se
t of hyperpl
ane
s i
n
a high-dim
ensi
onal
spa
c
e,
whi
c
h can be
used fo
r cla
ssifi
cation. It can m
a
ke a hyper pla
ne t
hat can
sep
a
rate two
cla
s
se
s
[14]. This al
gorithm
anal
yzes th
e dat
a, looks fo
r
a pattern,
a
nd then
crea
tes hype
r pl
ane
sep
a
rato
r to
divide the dat
a based on
e
a
ch
cla
ss.
SV
M model
s the
existing data
into points in
a
spa
c
e. The lo
cation of ea
ch point dep
en
ds on the val
ue of the feature
s
used.
3.
Anal
y
s
is of Plagiarism Detection Detailed Anal
y
s
i
s
3.1. Plagiarism
Cases
There a
r
e
se
veral
ca
se
s
o
f
plagia
r
i
s
m t
hat ma
y
occu
r. The
follo
wi
ngs a
r
e
so
me
ca
se
s of
plagia
r
ism.
a)
Literal pl
agia
r
ism, whi
c
h i
s
plagia
r
ism pe
rf
orme
d by copying the
so
ur
ce text dir
e
c
t
ly
without modifi
cation.
b)
Partial literal
plagia
r
ism, which
is
plagi
a
r
ism
pe
rform
ed by
copyin
g sm
all p
o
rtio
n of
s
o
urce text.
c)
Parap
h
ra
se
d plagia
r
ism, which is pl
agia
r
ism pe
rfo
r
m
ed by parap
h
r
asi
ng the co
pied
s
o
urce text.
d)
Plagiari
s
m with
ch
ang
ed sente
n
ce struct
ure, which is pla
g
iari
sm p
e
rfo
r
me
d by
cha
ngin
g
the stru
cture
of the copi
ed sent
ence.
e)
Tran
slate
d
pl
agiari
s
m, whi
c
h is pl
agia
r
ism perfo
rmed
by translatin
g
the copie
d
text.
There a
r
e
also cases that
are
not pl
agi
aris
m. The
fo
llowing
s
are
some
cases that are
not plagia
r
ism.
a)
Non
-
pla
g
iari
sm with differe
nt topic, whic
h is the
ca
se
of the se
nten
ce a
nd the ot
her
are not pla
g
ia
rism a
nd their content
s are
in different topic.
b)
Non
-
pla
g
iari
sm but in on
e topic,
which is
the ca
se of t
he senten
ce
and the oth
e
r are
not plagia
r
ism but their co
nt
ents have t
he sam
e
topi
c.
3.2. Feature
Sele
ction
To cre
a
te m
odel a
nd to
perfo
rm
cla
ssifi
cation,
we nee
d to
d
e
fine which
learni
ng
feature
s
to
b
e
u
s
ed. T
he
sele
ction
of f
eature
s
i
s
ba
sed
on
ea
ch
feature’
s i
n
flu
ence o
n
the
class
value (pl
agia
r
ism o
r
not).
The follo
wing
s are t
he fea
t
ures to
be u
s
ed in
propo
sed m
e
thod.
To
explain mo
re
about the fe
ature
s
, as
example
we u
s
ed senten
ce
“POS tag ca
n be obtai
ne
d
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
2302-4
046
Detaile
d Anal
ysi
s of Extrin
sic Plagi
ari
s
m
Detect
ion System
Usi
n
g… (Za
k
i
y
Firdau
s Alfikri)
7887
usin
g HMM t
e
ch
niqu
e” a
s
to-be
-
dete
c
t
ed sent
en
ce
and sente
n
ce
“POS
tag
can be obtai
ned
usin
g HMM t
e
ch
niqu
e” a
s
sou
r
ce se
nte
n
ce.
a) Word
pairs
Wo
rd pai
rs
are cho
s
en
to be the features in
creatin
g m
odel an
d performi
n
g
cla
ssifi
cation
becau
se word pairs that e
x
ist bet
wee
n
two se
nten
ce
s ca
n dete
r
mi
ne wh
ethe
r they
are pl
agia
r
ism or not. Wo
rd pai
rs featu
r
es
are
ex
pe
cted to give i
n
formatio
n of different wo
rds
that have sim
ilar co
ntext.
We u
s
ed sto
p
word remo
val and stem
ming on wo
rd
s. We listed
all possible
word pai
rs
from the trai
ning d
a
ta a
nd filtered
them u
s
ing
mutual infro
m
ation ran
k
i
ng an
d fre
q
uen
c
y
threshold to
gene
rate
wo
rd pairs. Fo
r e
x
ample,
the g
enerated
wo
rd pairs a
r
e o
b
tain_te
c
hni
q
ue,
obtain_
obtain
,
techniq
u
e
_
tech
niqu
e, techni
qu
e_text, algorithm_m
odel
, and alg
o
-
rithm_te
chni
q
ue. So the sample
se
ntence has
attribute value of 1for obtain_te
c
h
n
i
que,
obtain_
obtain
,
and techni
que_te
c
h
n
iqu
e
. And it has
attribute value of 0 for techniq
ue_t
ext,
algorith
m
_mo
del, and alg
o
rithm_ techni
q
ue.
b) Word
s
similarit
y
Wo
rds
simila
rity is cho
s
en t
o
be on
e of the
features
b
e
ca
use plagi
arism senten
ce ten
d
s
to have
a hi
gh level
simil
a
rity with
the
so
urce
sent
ence. But to
de
cide
pla
g
i
a
rism o
r
not
by
looki
ng
only
at the
wo
rd
s simil
a
ritie
s
i
s
n
o
t g
uaran
teed to
be
al
ways ri
ght. F
o
r
example,
the
value of thi
s
attribute for sample
se
nten
ce i
s
1, b
e
ca
use to
-b
e-d
e
tected
and
so
urce
sente
n
ces
have 100% word
s simil
a
rity.
c) Fingerprint
s
i
milarit
y
Finge
rpri
nt si
milarity is
ch
ose
n
be
ca
use it
provide
s
informatio
n
of simila
rity betwe
en
sente
n
ces in
a more
detail
ed level, whi
c
h is in the lev
e
l of
n
-g
ram
stru
ctur
e. Th
e simila
rity
is the
percenta
ge o
f
similar finge
rpri
nt betwee
n
senten
ce
s. Finge
rpri
nt gives a more d
e
tailed simil
a
rity
value (t
he
n
-gram
)
rathe
r
than wo
rd
s simila
rity.
Afte
r
pro
c
e
s
sin
g
the
sa
mple
se
nten
ce
s, the
values of fing
erp
r
int for bot
h sente
n
ces
are t
he
same.
So, the value of this attribu
t
e is 1.
d) LSA
similarity
LSA similarit
y
is cho
s
en
becau
se it provide
s
information in th
e form of co
nce
p
tual
simila
rity of context (sem
a
n
tics) betwee
n
the
two se
ntences. Con
c
eptu
a
l simil
a
rity of conte
x
t
has a g
r
eat i
n
fluen
ce in
d
e
termini
ng pl
agiari
s
m
bet
wee
n
senten
ce
s. We calculate the
co
sine
simila
rity bet
wee
n
to-be
-
d
e
tected
sent
ence
se
ma
ntic ve
ctor an
d sour
ce
se
ntence sema
ntic
vector. The
result is 1, so t
he value of this attribute is
1
.
3.3. Preproce
ss
Supplementar
y
Compon
ents
In preprocess, we used
supple
m
ent
ary
comp
on
ents su
ch a
s
stop word
removal
comp
one
nt a
nd stem
ming
comp
one
nt. Stop wo
rd re
moval co
mpo
nent will
dele
t
e stop
word
s
that exist on
the se
nten
ces. Thi
s
co
mpone
nt
is
utilized to
re
duce t
he am
ount of featu
r
es
gene
rated
an
d to avoid th
e gene
ratio
n
of low-i
n
flue
ntial feature
s
. Stemming is u
s
ed to g
e
t
a
stem fo
rm of
every word
th
at exists i
n
th
e se
nt
en
ce. T
h
is
com
pon
e
n
t is u
s
e
d
to redu
ce va
riati
o
n
of feature
s
that have the same co
ntext and to
gen
erate more hi
gh
-influential fe
ature
s
.
3.4. Architec
ture
s
Detaile
d anal
ysis archite
c
t
u
re is divide
d into two main part
s
. The two main p
a
rts a
r
e
model cre
a
tio
n
and cl
assifi
cation.
In model creation, firstly preprocessi
ng
the traini
ng data is
perfo
rmed.
Then, it
gene
rate
s wo
rd pairs featu
r
es from the training d
a
ta
. After that, it
extract
s
the values of ea
ch
feature f
r
om
the traini
ng
data. And fin
a
lly, it cr
e
a
te
s a
lea
r
nin
g
model th
at can b
e
u
s
ed
for
cla
ssif
i
cat
i
on.
Cla
ssifi
cation
preprocesse
s the
two in
p
u
ts
(text to b
e
dete
c
ted
a
nd the
candi
d
a
te text),
extract
s
the
values
of ea
ch featu
r
e,
a
nd cl
as
sifie
s
the input
s a
s
plagia
r
i
s
m o
r
not
plagia
r
i
s
m
usin
g the learning mod
e
l. Cla
ssifi
cation
archite
c
tu
re
can b
e
se
en i
n
Figure 1.
3.5.
Ho
w
Th
e Sy
stem Works
The
pro
p
o
s
e
d
detail
ed
an
alysis meth
o
d
requi
re
s
m
odel creation
and
cla
ssifi
cation.
Fo
r
model
cre
a
tion, the first
thing co
nd
ucted i
s
th
e pre
p
ro
ce
ssing of trai
ning data.
The
prep
ro
ce
ssin
g take
s the value of the senten
ce
s det
ected, the so
urce-ca
ndidat
e sente
n
ce, and
the label of e
a
ch d
a
ta. In this p
r
o
c
e
s
se
d, stop word removal a
nd
stemmin
g
are
also p
e
rfo
r
m
ed.
Next, on th
e feature
extraction
pr
ocess, firstly the ge
neratio
n
of wo
rd p
a
irs fe
atures is
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 23
02-4
046
TELKOM
NI
KA
Vol. 12, No. 11, Novem
ber 20
14: 78
84 – 789
4
7888
perfo
rmed. Word pai
rs feat
ure
s
gen
erat
ed ca
n be
sel
e
cted u
s
ing t
he mutual inf
o
rmatio
n ran
k
ing
and f
r
eq
uen
cy thre
shol
d to
re
du
ce th
e fe
ature
s
and
to
get only
high
-influential fe
a
t
ures a
r
e
to
be
use
d
.
Figure 1. Cla
ssifi
cation Archite
c
ture
Then, the ext
r
actio
n
of ea
ch featu
r
e va
lue is
exe
c
ut
ed for e
a
ch o
f
the existing
data in
the traini
ng d
a
ta. Value of
the wo
rd
pair
feature i
s
the
numbe
r of o
c
curre
n
ces
of the word pai
r i
n
each d
a
ta.
The
wo
rd
s
similarity valu
e is the
percenta
ge
of
words si
mila
rity betwe
en
two
sente
n
ces in
ea
ch
data.
Finge
rpri
nt si
milarity
value
is th
e
percentage
of
si
milar fin
gerprint
betwe
en t
w
o
sente
n
ces in
each d
a
ta. T
hen
LSA si
mi
larity value
can b
e
cal
c
ula
t
ed u
s
ing
cosine
simila
rity function bet
wee
n
sente
n
ce
s-context ve
cto
r
s of the two
senten
ce
s in
each data.
The
results of this process a
r
e
dat
a with thei
r feature
s
’ val
ue.
After the feature extractio
n
pro
c
e
ss, the
m
odel
creati
on process i
s
perfo
rmed
u
s
ing th
e
chosen learni
ng algorithm. The learning algorit
hm will
build a model
to fit the training data.
For cl
assifica
tion, accordi
n
g to Figure 1,
t
he first thing carrie
d out is the pre
p
ro
ce
ssi
ng
of the inp
u
ts
by taking
the
value of to-b
e-det
e
c
ted
se
ntence an
d
source
-candid
a
te se
nten
ce.
In
prep
ro
ce
ssin
g, stop word
removal a
n
d
stemmin
g
pro
c
e
s
ses
are also
perfo
rmed. Next, the
pro
c
e
ss of fe
ature extra
c
ti
on, extraction
of eac
h feat
ure value for
each of the existing data o
n
the training d
a
ta is perfo
rm
ed. The re
sult
s of this
process are d
a
ta with their feat
ure
s
’ value.
Finally, the
cl
assificatio
n
o
f
the input
s i
s
p
e
rf
o
r
med.
It use
s
the i
n
put data
as i
n
stan
ce
s th
at
have featu
r
e
s
’ value
s
th
at is
calcula
t
ed in
featu
r
e
extra
c
tion pro
c
e
ss. Lea
rning
algo
rith
m
perfo
rms cl
assificatio
n
u
s
in
g the l
earnin
g
mod
e
l g
e
n
e
rated
by m
o
deling
subsy
s
tem. Th
e
re
sult
of the classifi
cation i
s
the deci
s
io
n whet
her the inp
u
t data is pla
g
ia
rism o
r
not.
4. Experiments
4.1. Model
Cre
a
ti
on
In this experi
m
ent, model
cre
a
tion p
r
o
c
esse
s we
re p
e
rform
ed. It used 8
0
traini
ng data
that are
plagi
arism cases
and 8
0
traini
n
g
data that
a
r
e not cases
o
f
plagiari
s
m. I
t
used
all of the
feature
s
. Th
e
value
of fre
q
uen
cy thresh
old u
s
e
d
wa
s
1 an
d th
e val
ue of
mutual
i
n
formatio
n u
s
ed
wa
s 100
00-fi
rst-ran
k
ed.
There
we
re t
w
o
expe
rime
nts p
e
rfo
r
me
d, us
i
ng
Naiv
e Baye
s a
nd
usin
g Sup
p
o
r
t Vector
Machi
ne (SV
M
). The a
c
cu
racy of ea
ch
model was
ca
lculate
d
usi
n
g
10-fold cro
ss-validation
4.2.
Detec
t
ing Plagiarism Ca
ses
In expe
riment
s of
dete
c
ting
plagi
ari
s
m
case
s,
the test case
s were
created a
nd their data
illustrate possible cases
of plagiari
s
m.
Detecting or
classify
ing
existing test dat
a for each test
ca
se
we
re al
so pe
rform
ed.
The result also to
be
com
p
ared
to the
re
sults
obtain
e
d
with the
sta
t
e
-
of-the-art det
ailed an
alysis methods
su
ch as wo
rd
si
milarity method, fingerp
r
int
i
ng, and LSA.
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
2302-4
046
Detaile
d Anal
ysi
s of Extrin
sic Plagi
ari
s
m
Detect
ion System
Usi
n
g… (Za
k
i
y
Firdau
s Alfikri)
7889
4.2.1. Testing
d
a
ta
The d
a
ta u
s
e
d
for te
sting
consi
s
ted
of 7
0
dat
a that
were
divided in
to six test
ca
ses. Th
e
f
o
llowin
g
s a
r
e
t
he t
e
st
cas
e
s.
a)
Literal pla
g
iarism se
nten
ce
pairs
b)
Partial literal
plagia
r
ism se
ntence pairs
c)
Parap
h
ra
se
d plagia
r
ism se
ntence pairs
d)
Cha
nge
d stru
cture pl
agia
r
i
s
m senten
ce
pairs
e)
Tran
slate
d
pl
agiari
s
m
sent
ence pairs
f)
Non
-
pla
g
iari
sm senten
ce p
a
irs
whi
c
h a
r
e different in topic
g)
Non
-
pla
g
iari
sm senten
ce p
a
irs
whi
c
h ha
ve same topi
c
4.2.2. Experiment
Scenario
In this experi
m
ent, the accuracy of the
clas
sifie
r
wa
s cal
c
ul
ated. The expe
rim
ent use
d
test cases
a
s
the i
nput
of the cl
assi
ficati
on
p
r
o
c
esse
s.
The
accuracy wa
s
o
b
tained
by
cal
c
ulatin
g th
e pe
rce
n
tage
of co
rre
ctly
cla
ssi
fie
d
in
stance
s
. Th
e
experim
ent u
s
ed
all of the
feature
s
. Th
e
value
of fre
q
uen
cy thresh
old u
s
e
d
wa
s
1 an
d th
e val
ue of
mutual
i
n
formatio
n u
s
ed
wa
s 100
00-fi
rst-ran
k
ed.
4.3.
Testing T
h
e
Effec
t
of Fr
e
quency
Threshold and M
u
tual Inform
ation
This expe
rim
ent wa
s co
nd
ucted to see
the e
ffect of cha
ngin
g
the
values of fre
quen
cy
threshold
an
d mutual info
rmation p
a
ra
meters that
play a role i
n
the sel
e
cti
on for
word
pairs
feature
s
. Experime
n
t perfo
rmed by creat
i
ng model a
n
d cla
ssifying t
e
st data.
The exp
e
rim
ent used all
o
f
the feature
s
. The
value
s
of frequ
en
cy thre
shol
d test
ed were
1, 2, 3, a
nd 4
and
the valu
es
of mutual
info
rmatio
n u
s
ed
were
10
0
00, 50
00, 70
00, an
d 10
00
0-
firs
t-rank
ed.
4.4.
Testing Ea
c
h
Featu
r
e Influence
This exp
e
rim
ent wa
s cond
ucted to
see
the
effect of each feature and al
so to see the
perfo
rman
ce
of ea
ch l
earni
ng al
gorith
m
use
d
. Th
e val
ue of f
r
eq
uen
cy threshold
use
d
wa
s 1
a
n
d
the value of mutual inform
ation use
d
was 10
000
-first
-ra
nked. Experime
n
t perfo
rmed by crea
ting
model
and
cl
assifying te
st data. Th
e ex
perim
ent
was divided i
n
to f
i
ve experim
e
n
t ca
se
s, whi
c
h
wer
e
:
a)
Usi
ng all feat
ure
s
b)
Usi
ng all feat
ure
s
exce
pt word pai
rs fe
ature
s
c)
Usi
ng all feat
ure
s
exce
pt LSA similarity feature
d)
Usi
ng all feat
ure
s
exce
pt
fingerpri
n
t simi
larity feature
e)
Usi
ng all feat
ure
s
exce
pt
words
simila
ri
ty feature
5. Resul
t
5.1.
Resul
t
and Analy
s
is of
M
odel Cre
a
tio
n
Experimen
t
The a
c
curacy
re
sult of m
o
d
e
l creatio
n ex
perim
ent can
be seen
in T
a
ble 1.
Jud
g
in
g from
the accuracy
evaluation
of model cre
a
tion,
the created lea
r
nin
g
model ha
s a pretty good
perfo
rman
ce.
Model
cre
a
ted u
s
ing SV
M has
an a
c
cu
racy
rate o
f
84.375%. It is be
ca
use
of the
feature
s
sele
cted to be u
s
ed a
r
e relev
ant and ap
propriate in
re
solving pro
b
le
ms of dete
c
ting
plagia
r
ism in detailed a
nal
ysis.
For the lea
r
n
i
ng algo
rithm,
it can be se
en
that the SVM has bett
e
r pe
rform
a
n
c
e than
Naive Bayes.
SVM has a better pe
rformance due t
o
the cha
r
a
c
t
e
risti
cs
of its learni
ng that
fits
high dime
nsi
onal data. Additiona
lly, because there a
r
e only two cla
s
ses (plagi
ari
s
m or not
) SVM
naturally only
need
s to cre
a
te a hyper pl
ane t
hat se
pa
rates th
e two
parts of the
cl
ass.
Table 1. The
Accu
ra
cy of Model Cre
a
tion
Learni
ng al
gorit
hm
A
c
c
u
rac
y
Naive Ba
y
e
s
76.25 %
SVM
84.375
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 23
02-4
046
TELKOM
NI
KA
Vol. 12, No. 11, Novem
ber 20
14: 78
84 – 789
4
7890
5.2.
Resul
t
and Analy
s
is of Detec
t
ing Pla
g
iarism Cas
es Experime
nt
The expe
rim
ent re
sults a
r
e presented
in
Table 2.
The data in
the table show th
e
accuracy of e
a
ch mo
del in
cla
ssifying th
e
test data were u
s
ed to cl
assify the input.
The experi
m
ent results sh
ow an averag
e ac
curacy of 92.86% for the SVM models and
an ave
r
ag
e a
c
cura
cy of 5
4
.
29% for
Naiv
e Bayes
mo
d
e
ls. SVM mo
del could
cl
assify the types of
literal plagia
r
i
s
m, partial literal pla
g
iari
sm, and tr
ansl
a
ted plagia
r
ism with accuracy rate of 10
0%.
For othe
r ca
ses of
pla
g
iari
sm, su
ch as plagia
r
ism u
s
i
ng p
a
ra
ph
ra
si
ng a
nd
ch
ang
ing in
senten
ce
structure, the accu
racy of
the SVM i
s
still very
hi
gh (
90% accuracy).
In
the case of
non-
plagia
r
ism
wi
th differe
nt to
pic, the
a
c
cu
racy
of the
resultin
g mo
d
e
l of SVM i
s
also q
u
ite hi
gh,
90% accu
ra
cy. Then for non-pl
agia
r
ism
within t
he sa
me topic, SVM only rea
c
h
80% accu
ra
cy.
It is sho
w
n th
at SVM wo
rks b
e
tter tha
n
Naive Baye
s
in almo
st all
ca
se
s of
plag
iarism.
SVM has b
e
tter accu
ra
cy in detectin
g
li
teral pla
g
iari
sm, partial lite
r
al pla
g
iari
sm
, parap
hra
s
e
d
plagia
r
ism, and cha
nge
d stru
cture pla
g
iari
sm. SVM
is also bette
r in detecting
non-pl
agia
r
i
s
m
ca
se
s.
Naiv
e
B
a
y
e
s ha
s t
he same p
e
rf
o
r
ma
n
c
e in d
e
tecting tra
n
sl
a
t
ed plagia
r
ism.
Table 2. The
Accu
ra
cy of Dete
cting Pla
g
iari
sm Case
s
Test ca
se
Nai
v
e Ba
y
e
s
A
ccu
r
a
c
y
SVM
A
ccu
r
a
c
y
1. Literal
50 %
100 %
2. Partial literal
40 %
100 %
3. Paraph
rased
50 %
90 %
4. Changed st
ructured
50 %
90 %
5. Translated
100 %
100 %
6. Non-plagiaris
m w
i
th diffe
rent
topic
30 %
90 %
7. Non-plagiaris
m in the same
topic
60 %
80 %
Mean:
54.29 %
92.86 %
Gene
rated
Naive Bayes
model
s tend
to exhi
bit p
oor
accu
ra
cy. Almost all
ca
se
s of
plagia
r
ism a
c
cura
cy a
r
e
not more tha
n
50%, ex
ce
pt for ca
se
of transl
a
ted
plagia
r
ism t
hat
rea
c
he
s 1
0
0
%
. So are the
ca
se
s of no
n
-
plagi
ar
i
s
m, the average
a
c
cura
cy obtai
ned i
s
only 4
5
%.
Jud
g
ing f
r
om
the results
obtaine
d, Nai
v
e Bayes m
o
dels are le
ss suita
b
le fo
r
the u
s
e a
s
t
he
cla
ssifie
r
solu
tion in detecti
ng plagi
ari
s
m
.
Reason whi
c
h ma
ke the
result
s have p
oor a
c
curacy
is
the numb
e
r
o
f
feature
s
used (wo
r
d
p
a
irs featu
r
es
co
nsi
s
t of abo
ut
1000 fe
ature
)
in the fa
ct that
the training
data used
is only about 160 d
a
ta, so Na
ive Bayes can’t form
a
statistical/probabili
stic model that is
rele
vant to be used for the test
data.
Table 3. Co
m
pari
s
on b
e
tween State-of-t
he-a
r
t Metho
d
and the Pro
posed Metho
d
(usi
ng SVM
)
Test ca
se
Wo
rd
s
similarit
y
Fing
er-
printi
ng
LS
A
SVM
1. Literal
100 %
100 %
100 %
100 %
2. Partial literal
90 %
90 %
100 %
100 %
3. Paraph
rased
50 %
40 %
90 %
90 %
4. Changed
str
u
ctur
ed
100 %
80 %
100 %
90 %
5. Translated
100 %
80 %
100 %
100 %
6. Non-plagiaris
m
w
i
th diffe
rent
topic
100 %
100 %
100 %
90 %
7. Non-plagiaris
m
in the same
topic
100 %
100 %
0 %
80 %
Mean:
91.42 %
84.29 %
84.29 %
92.86 %
This exp
e
rim
ent also
co
nd
ucted
a co
mp
aris
on exp
e
ri
ment of the result
s obtain
ed u
s
ing
SVM
model
s with
an
othe
r state-of
-the a
r
t
detaile
d
an
alysis
metho
d
. Com
pari
s
o
n
of the
re
sul
t
s
can b
e
se
en i
n
Table 3.
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
2302-4
046
Detaile
d Anal
ysi
s of Extrin
sic Plagi
ari
s
m
Detect
ion System
Usi
n
g… (Za
k
i
y
Firdau
s Alfikri)
7891
Can be
see
n
in the averag
e level of accura
cy usin
g the pro
p
o
s
ed
SVM method is on the
top of the other meth
od
s. Viewed fro
m
eac
h
ca
se in gene
ral,
the propo
se
d SVM method
accuracy of a
l
most all case
s is a
bove av
erag
e.
Only in the ca
se of
st
ru
cture ch
a
nged pl
agia
r
i
s
m
and no
n-pl
agi
arism, the accuracy
is
slig
htly below av
erag
e.
In the case o
f
literal plagia
r
ism all meth
ods
have a very high accu
racy. In the case
s of
partial literal
plagia
r
ism an
d cha
nged
structure,
the word si
milarity and finge
rp
rin
t
ing method h
a
s
worse
a
c
cura
cy than
othe
r
method
s. In t
he
ca
se
of
tra
n
slate
d
pl
agi
arism, all m
e
thod
s h
a
ve fai
r
ly
good a
c
curacy. However, o
t
her metho
d
s
need a ma
chi
ne tran
slation
while the pro
posed metho
d
doe
sn’t nee
d any machine
transl
a
tion. And in the ca
se of non-plag
iarism within
the sam
e
topi
c,
LSA method
ha
s a
n
a
ccura
cy that i
s
far
belo
w
a
v
erage.
It ca
n be
seen
th
at the p
r
o
p
o
s
ed
method comb
ines the
stren
g
th of the other metho
d
s.
5.3.
Resul
t
and
Analy
s
is of Testing Th
e Effec
t
of Frequen
c
y
Threshold
and Mutu
al
Information
The re
sult o
f
testing the influence of
fr
equen
cy thre
shol
d and
mutual information
para
m
eter
ca
n be
fou
nd
at Tabl
e 4.
It can
b
e
see
n
from
the
e
v
aluation
accura
cy that th
e
averag
e accura
cy of Nai
v
e Bayes model is
a
r
oun
d 73% and the avera
ge accuracy of SVM
model i
s
a
r
o
u
nd 83%. T
he
most o
p
timal
result obt
ai
ne
d on f
r
equ
en
cy thre
sh
old
value of 1
wit
h
the value
of n
on m
u
tual i
n
formatio
n i
s
1
0000
an
d o
n
freque
ncy th
resh
old val
ue
of 2
rega
rdl
e
ss
of the value of n on mutu
al informatio
n. This is b
e
c
au
se fo
r 16
0 training d
a
ta the amount
of
most-i
nfluenti
a
l word
pai
rs i
s
o
n
ly aro
und
1000
0
p
a
irs of
wo
rd
with val
ue
of its fre
que
ncy
boun
dary is o
n
ly 1.
Table 4. The
Accu
ra
cy of each Mo
del for each Pa
ram
e
ter
Freque
nc
y
threshold: 1
Freque
nc
y
threshold: 2
Freque
nc
y
threshold: 3
Freque
nc
y
threshold: 4
n mutual
information:
1000
NB:
74.375 %
SVM:
81.875 %
NB:
75 %
SVM:
85.625 %
NB:
73.125 %
SVM:
84.375 %
NB:
71.875 %
SVM:
78.75 %
n mutual
information:
5000
NB:
74.375 %
SVM:
81.875 %
NB:
75 %
SVM:
85.625 %
NB:
73.125 %
SVM:
84.375 %
NB:
71.875 %
SVM:
78.75 %
n mutual
information:
7000
NB:
76.25 %
SVM:
83.125 %
NB:
75 %
SVM:
85.625 %
NB:
73.125 %
SVM:
84.375 %
NB:
71.875 %
SVM:
78.75 %
n mutual
information:
10000
NB:
76.25 %
SVM:
84.375 %
NB:
75 %
SVM:
85.625 %
NB:
73.125 %
SVM:
84.375 %
NB:
71.875 %
SVM:
78.75 %
Table 5. The
Accu
ra
cy of each Mo
del in
Cl
as
s
i
fying Tes
t
Data for eac
h
Paramet
e
r
Freque
nc
y
threshold: 1
Freque
nc
y
threshold: 2
Freque
nc
y
threshold: 3
Freque
nc
y
threshold: 4
n mutual
information:
1000
NB:
54.29 %
SVM:
92.86 %
NB:
54.29 %
SVM:
80.00 %
NB:
54.29 %
SVM:
85.71 %
NB:
54.29%
SVM:
84.29 %
n mutual
information:
5000
NB:
54.29 %
SVM:
92.86 %
NB:
54.29 %
SVM:
80.00 %
NB:
54.29 %
SVM:
85.71 %
NB:
54.29%
SVM:
84.29 %
n mutual
information:
7000
NB:
54.29 %
SVM:
92.86 %
NB:
54.29 %
SVM:
80.00 %
NB:
54.29 %
SVM:
85.71 %
NB:
54.29%
SVM:
84.29 %
n mutual
information:
10000
NB:
54.29 %
SVM:
92.86 %
NB:
54.29 %
SVM:
80.00 %
NB:
54.29 %
SVM:
85.71 %
NB:
54.29%
SVM:
84.29 %
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 23
02-4
046
TELKOM
NI
KA
Vol. 12, No. 11, Novem
ber 20
14: 78
84 – 789
4
7892
Then for te
st
ing cla
s
sifica
tion of test data,
as can
be se
en in T
able 5, the a
v
erage
accuracy
of Naive
Baye
s cla
ssifi
cation is
ap
pr
oximat
ely 54% and
the aver
a
ge accuracy
of
S
V
M
cla
ssifi
cation
is
app
roxim
a
tely 85%.
The m
o
st
o
p
timal results o
b
taine
d
i
n
the frequ
e
n
cy
threshold of 1
rega
rdle
ss of the value of n on the mutu
al informatio
n
.
If the cla
ssifi
cation te
st re
sults
are an
a
l
yzed al
ong
with the m
o
d
e
l creation
re
sult, the
optimal re
sult
is obtained a
t
threshol
d freque
ncy pa
ra
meter by 1 and n on mutu
al information
by
1000
0. As already discu
ssed, this is be
cause fo
r 160 t
r
ainin
g
data the amou
nt of most-i
nfluenti
a
l
word pai
rs i
s
only arou
nd 1
0000 p
a
irs of word with val
ue of its frequ
ency bo
und
ary is only 1.
5.4.
Resul
t
and Analy
s
is of Te
sting Each F
eatur
e Influe
nce
The re
sult of
testing the
effect of each f
eature infl
uen
ce to the
accuracy of
model
cre
a
tion
can
be seen i
n
T
able 6. It ca
n
be seen th
at
in the case
of 2 to 5 the
accuracy of
SVM
are
redu
ce
d
comp
ared to
accuracy in
ca
se 1. Th
i
s
proves
all
the feature
s
are influ
entia
l in
improvin
g the accuracy
of SVM. Then for Naive
Bayes, in the ca
se of 2
the accura
cy is
increa
sed. T
h
is
sho
w
wo
rd pai
r feature
s
a
r
e le
ss sui
t
able for
Naiv
e Bayes.
Whi
l
e the featu
r
e
s
other than th
e pairs have
an effect in im
provin
g the accuracy of Naive Bayes.
Table 6. The
Accu
ra
cy of each
Mo
del in
each Experim
ent Ca
se
Naive Bay
e
s Accur
a
c
y
SVM Accur
a
cy
Case 1
76.25 %
84.375 %
Case 2
77.5 %
78.75 %
Case 3
75 %
80.625 %
Case 4
71.25 %
77.5 %
Case 5
71.875 %
80.625 %
In additio
n
t
o
testin
g m
o
del
creation,
the te
sting
of
test data cla
ssifi
cation
is
al
so
perfo
rmed. E
a
ch te
st ca
se
attempted to sev
en type
s of test data that has be
en d
e
fined.
For
ca
se 1, t
he re
sult
s obt
ained
ca
n be
see
n
in Ta
ble
2. These results will
be u
s
ed a
s
a
comp
ari
s
o
n
for anoth
e
r ex
perim
ent ca
ses.
For ca
se
2, the
average
accu
racy
of
Naive Baye
s cla
s
sificatio
n
in
cre
a
sed
by abo
ut
30%. Accu
ra
cy of
100%
hap
pen
s fo
r alm
o
st
all
ca
se
s ex
cep
t
ca
se
s
of p
l
agiari
s
m
u
s
i
ng
para
p
h
r
a
s
ing
(that still h
a
s a hi
gh a
c
cura
cy,
90
%) and tran
slated pl
agia
r
ism
(do
w
n t
o
0%
ac
cur
a
cy
).
Average
accura
cy of SVM cla
ssifi
cati
on is
de
cre
a
s
ed fo
r ab
out
14.29%. A significa
nt
decrea
s
e h
a
p
pen
s in ca
se
s of translated
plagi
a
r
ism an
d non-plagi
ari
s
m within o
n
e
topic.
For
ca
se
3, the ave
r
ag
e a
c
cura
cy of
Naive Ba
yes
cl
assificatio
n
is not different
from the
averag
e cla
s
sificatio
n
accura
cy usin
g all feat
ure
s
. Average a
ccura
cy of SVM classification is
decrea
s
e
d
for ab
out 12.
86%. Significant de
crea
se in accu
ra
cy occu
rre
d
in the ca
se of
para
p
h
r
a
s
ed
plagia
r
ism an
d cha
nge
d structure plagi
arism.
For
ca
se
4, the ave
r
ag
e a
c
cura
cy of
Naive Ba
yes
cl
assificatio
n
is not different
from the
averag
e cla
s
sificatio
n
accura
cy usin
g all feat
ure
s
. Average a
ccura
cy of SVM classification is
decrea
s
e
d
fo
r abo
ut 10%. Significant d
e
crea
se in
th
e accu
ra
cy of the ca
se i
s
not plagia
r
i
s
m in
the ca
se of di
fferent topics
and on
e topic is not plagia
r
ism.
For
ca
se
5, the ave
r
ag
e a
c
cura
cy of
Naive Ba
yes
cl
assificatio
n
is not different
from the
averag
e
cla
s
sificatio
n
a
c
curacy
usi
n
g
all f
eatu
r
e
s
. Cha
nge i
n
the ave
r
ag
e cl
assificati
on
accuracy
of
SVM is
not
si
gnifica
nt en
o
ugh, o
n
ly ab
out 2.85%
ch
ange.
Wo
rd
s simila
rity
feature
doe
s not hav
e a signifi
cant
effect in increasi
ng the a
c
curacy of cla
s
sifiers plagi
ari
s
m.
6. Conclu
sion
The
co
ncl
u
si
on that
can
be d
r
a
w
n
fro
m
this
pa
per is the
perfo
rman
ce
of
propo
se
d
detailed
anal
ysis
of pla
g
ia
rism
dete
c
tio
n
u
s
ing
ma
ch
ine le
arni
ng
approa
che
s
i
s
q
u
ite hi
gh.
The
results o
b
tain
ed in exp
e
ri
ments u
s
in
g
SVM sho
w
ed
an average
accuracy
of 9
2
.86% (rea
ch
ing
95.71% with
out usi
ng
wo
rds simil
a
rity feature
)
. Wh
ile the result obtaine
d u
s
i
ng Naive Ba
yes
sho
w
e
d
an
averag
e a
c
cura
cy of 54.
29% (rea
chi
ng 84.2
9
% without u
s
in
g the wo
rd
pair
feature
s
).
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
2302-4
046
Detaile
d Anal
ysi
s of Extrin
sic Plagi
ari
s
m
Detect
ion System
Usi
n
g… (Za
k
i
y
Firdau
s Alfikri)
7893
Detaile
d anal
ysis pe
rform
a
nce vari
ed in each ca
se. At all test case
s, prop
osed
detailed
analysi
s
pe
rforma
nce is q
u
ite well, the worst a
ccu
ra
cy wa
s 80% for case
s of n
on-pl
agia
r
ism
in
simila
r topic.
Compa
r
ed
with t
he state-of-th
e
-a
rt d
e
tailed analy
s
is meth
od
s such as word
simila
rity method, finge
rp
ri
nting metho
d
,
and LSA
method, the
prop
osed m
e
thod ha
s m
o
re
advantag
es i
n
detectin
g
pl
agiari
s
m a
nd
non-plagi
ar
i
s
m in almo
st all cases. Oth
e
r
method
s h
a
v
e
sho
r
tco
m
ing
s
(a
s
se
en
fro
m
the l
o
w a
c
curacy
) in
sp
ecific cases.
It
can
be
con
c
lud
ed
that
the
prop
osed m
e
thod can retrieve information from
the other state-of-the
-a
rt method
s. It ca
n
combi
ne the i
n
formatio
n to obtain the hig
h
level of plagiari
s
m dete
c
tion accuracy
.
The features
that are mo
st suitable to b
e
use
d
in det
ailed analy
s
is base
d
on m
a
chi
ne
learni
ng
are
fingerpri
n
t sim
ilarity feature,
LSA si
mil
a
rit
y
feature, a
n
d
word
pai
rs
feature
s
. And
the mo
st suitable le
arning
algo
rithm to
be u
s
e
d
i
s
the SVM lea
r
ning
alg
o
rith
m. Usi
ng SV
M
algorith
m
sho
w
ed
better
re
sults
be
cau
s
e
it is natu
r
a
lly
suitabl
e in
cl
assifying pro
b
lems that ha
ve
two cla
s
se
s and it is better than Naive Bayes in
resolvin
g high
-dime
n
si
onal
probl
em
s (which
have a
lot of
features). Using
SVM le
arnin
g
al
go
rithm with
fing
erp
r
int simila
rity
feature, LSA
simila
rity feature, an
d
word pairs fe
atures, t
he a
c
cu
racy obtai
ned
is in ave
r
ag
e
of 95.71%. T
h
is
result sho
w
s
a go
od p
e
rfo
r
man
c
e
of th
e propo
se
d
method i
n
d
e
t
ecting
extrin
sic pla
g
iari
sm in
detailed level.
For furth
e
r re
sea
r
ch, expe
riment
s in de
tecting multi-l
angu
age (oth
er than Ind
o
nesi
an-
English
)
pla
g
i
a
rism may b
e
perfo
rme
d
usin
g propo
sed detaile
d a
nalysi
s
meth
od in this
pa
per.
Traini
ng d
a
ta
for rel
a
ted la
n
guag
e is
req
u
i
red fo
r the ex
perim
ent. Th
en, a research may be
don
e
to study
suita
b
le featu
r
e
se
lection
metho
d
for the
wo
rd pai
r featu
r
e
s
that
allo
w t
he g
ene
ratio
n
of
optimal feature sele
ction to
be use
d
excl
us
ively in the plagia
r
ism de
tection issu
es.
Referen
ces
[1]
Barron-Cedeno A, Rosso P,
Pinto
D, Ju
an
A. On Cross
L
i
ngu
al
Pla
g
iar
i
s
m
Ana
l
ysis
Usi
ng Statistic
a
l
Mode
l. Prosidi
ng
PAN 20
08
.
Patras, Yuna
ni
. 2008.
[2] IEEE
.
A Plagiarism FAQ
. http://w
w
w
.
ieee.org/pu
blic
atio
n
s
_stand
ards/ pub
licati
ons/ri
g
hts/plag
iaris
m
_FAQ.htm
l.
Di
akses tang
ga
l 13 Jun
i
20
12.
[3]
Maurer H, Kap
pe F
,
Z
a
ka B.
Plag
iarism – A
Surve
y
.
Jour
n
a
l of Univers
a
l
Co
mp
uter Scie
nce.
Austria.
200
6
;
12(8): 10
50-1
084.
[4]
Ardia
n
s
y
ah A.
Pengem
ba
ng
an Apl
i
kasi P
end
eteksi Pl
a
g
iarism
e Men
ggu
nak
an Me
tode Lat
ent
Semantic Analy
s
is (LSA).
Ba
ndu
ng, Indo
ne
sia. 201
1.
[5]
Kusma
w
a
n
P, Yuhan
a U, Pur
w
itas
ari D. Aplik
asi Pen
d
e
teksi Pen
j
i
p
la
kan pa
da F
i
l
e
T
e
ks deng
a
n
Algoritm
a
W
i
nn
o
w
in
g. Indo
nes
ia. 200
9.
[6]
Mahath
i
r F
.
Si
stem Pen
dete
ksi Pla
g
i
a
t p
a
da
Dokum
e
n
T
e
ks Berbah
a
s
a Ind
o
n
e
sia
Meng
gun
aka
n
Metode R
o
u
g
e
-N, Rou
ge-L,
dan R
oug
e-
W
.
Departeme
n
Ilmu Komp
u
t
er Institut Pertania
n
Bog
o
r.
Bogor, Ind
ones
ia. 201
1.
[7]
Solem
an S,
Pur
w
ari
anti
A. T
ugas Akh
i
r: Pe
n
gem
ban
g
an S
i
stem P
e
ndeteks
i
Pl
ag
i
a
risme
pa
da
Dokum
en Berb
ahas
a Indo
nesi
a
. Bandu
ng, In
don
esia. 2
012
[8]
Alfikri Z
,
Pur
w
a
r
ianti A. T
he C
onstructio
n
of
Indo
nesi
an-E
n
g
lish Cross L
a
n
gua
ge Pla
g
i
a
ri
sm Detectio
n
S
y
stem Usin
g F
i
nger
printi
ng T
e
chnique.
J
u
r
nal Il
mu Ko
mp
uter da
n Infor
m
as
i, Vol
5, N
o
1
. De
pok,
Indon
esi
a
. 201
2.
[9]
Alzahr
ani
S, S
a
lim
N, Abr
a
h
a
m
A. Un
dersta
ndi
ng P
l
a
g
iar
i
s
m
Li
ngu
istic P
a
tterns, T
e
xtual
F
eatures
an
d
Detection Methods.
IEEE Transactions on System
s, Man,
and Cybe
rnetics, Part C:
Applications and
Reviews
. T
a
if,
Arab Sau
d
i. 20
11.
[10]
Potthast M, Stein B, Eis
e
lt
A, Barron-C
ede
no A, R
o
s
s
o P. Ov
ervie
w
o
f
the 1st
Internati
ona
l
Comp
etition o
n
Plagi
arism Det
e
ction.
SEPLN
.
Donosti
a. 200
9.
[11]
Mitchell T
M
. Machi
ne L
earn
i
n
g
.
The McGraw
-Hill Co
mp
an
i
e
s, Inc.
1997.
[12] Joach
i
ms T
.
Text categ
o
ri
z
a
ti
on w
i
th su
ppor
t vector mach
i
nes: Le
arn
i
ng
w
i
th ma
ny rel
e
vant featur
es
.
Sprin
ger Berl
in
Heid
elb
e
rg. 19
98; 137-
14
2.
[13]
Sebasti
ani
F
.
Machi
ne
le
arni
ng
in aut
omated
te
xt
categor
izatio
n.
ACM co
mp
uti
ng surv
eys
(CSUR)
. 20
02;
34
(1): 1-47.
[14]
Cortes C, Vap
n
ik V. Supp
ort vector machi
n
e
.
Machine l
ear
nin
g
. 199
5;
20
(
3
): 273-2
97.
[15]
Ren
n
ie
JD, S
h
ih
L, T
eevan
J, Karg
er D.
T
a
cklin
g T
h
e
Poor
Assu
mp
tions
of Na
ive
Bayes
T
e
xt
Classifiers
. Machin
e Le
arni
ng
-Internatio
na
l W
o
rkshop T
hen Confer
enc
e
.
200
3; 20(2): 61
6.
[16]
Stein B. F
u
zzy-F
in
ger
prints fo
r T
e
x
t
-Based
Information
Retrieval.
Jour
nal of Univ
ers
a
l Co
mp
uter
Scienc
e.
200
5; 572-5
79.
[17]
Schle
i
mer S,
W
ilkerson
D,
Aiken A. W
i
n
n
o
w
i
ng
Loc
al
Algorit
hms for
Docum
ent F
i
nger
printi
ng.
SIGMOD 03 P
r
ocee
din
g
s of
the 2
003
ACM
SIGMOD
inte
rnatio
nal
conf
e
r
ence
on M
a
n
age
ment of
data
. Cal
i
forni
a
. 2003.
Evaluation Warning : The document was created with Spire.PDF for Python.