Indonesian J
ournal of Ele
c
trical Engin
eering and
Computer Sci
e
nce
Vol. 1, No. 2,
February 20
1
6
, pp. 371 ~
374
DOI: 10.115
9
1
/ijeecs.v1.i2.pp37
1-3
7
4
371
Re
cei
v
ed O
c
t
ober 1
2
, 201
5; Revi
se
d Ja
nuary 17, 20
1
6
; Acce
pted Janua
ry 3
0
, 20
16
Semantic Similarity/Relatedness for Cross Language
Plagiarism Detection
Hanane Ezz
i
kouri*, Mohammed Erritali, Mohamed Oukessou
LMACS la
borat
or
y
,
Math
emati
cs Departme
n
t,
F
a
cult
y
of scie
n
ces an
d techn
i
qu
es Sulta
n
Moul
a
y
S
lima
ne
Univers
i
t
y
B
eni
-Mella
l, Morocc
o
*Corres
p
o
ndi
n
g
author, e-ma
i
l
: ezzikour
iha
n
ane
@gma
il.co
m
A
b
st
r
a
ct
Genera
lly utter
ances i
n
natur
al lan
g
u
a
g
e
are
hig
h
ly a
m
b
i
g
uous, an
d a u
n
iq
ue inter
p
ret
a
tion ca
n
usua
lly be det
ermine
d only
by
taki
ng into accou
n
t
the
c
ontext i
n
the
u
tterance
occur
r
ed. Auto
matic
a
ll
y
deter
mi
nin
g
th
e correct s
ens
e of a
po
lyse
mo
us w
o
rd
is a
co
mp
licat
ed prob
le
m espec
ially
in
multi
lin
gua
l
corpus
es. T
h
is paper
pre
s
ents an a
p
p
licati
on
pr
og
ramming
inte
rface for several Se
mant
i
c
Relat
edn
ess/Si
mi
larity metrics
me
as
uri
ng se
ma
ntic si
mil
a
r
i
ty/distance b
e
tw
een multil
in
gua
l w
o
rds an
d
conce
p
ts, in
or
der to
use
it af
ter for sent
enc
es a
nd para
g
r
aphs
in Cross Lan
gu
age
Pla
g
iaris
m
Detecti
o
n
(CLPD); usi
ng
W
o
rdNet for the Engl
ish-F
r
en
ch and
En
gl
ish
-
Arabic
multi
l
i
n
gua
l pla
g
i
a
ris
m
cases.
Ke
y
w
ords
: Sem
a
ntic sim
i
larit
y
, CLPD, Plagiarism
.
Copy
right
©
2016 In
stitu
t
e o
f
Ad
van
ced
En
g
i
n
eerin
g and
Scien
ce. All
rig
h
t
s reser
ve
d
.
1. Introduc
tion
Plagiari
s
m can be define
d
as the reu
s
e of so
m
e
o
ne else’
s
ide
a
s,
re
sult
s,
o
r
wo
rd
s
without a
c
kn
owle
dging
th
e ori
g
inal
source
. Cro
ss-Lan
gua
ge
P
l
agiari
s
m De
tection (CLPD)
con
s
i
s
ts in d
i
scrimin
a
ting sema
ntically
sim
ila
r texts indep
ende
nt of the langu
age
s they are
written i
n
, wh
en n
o
referen
c
e to
the
orig
inal
sou
r
ce i
s
given.
CLPD ca
se
takes p
l
ace
when
we
deal
with
una
ckno
wled
ged
re
use of
a te
xt involvi
ng its tran
slation
from
one
lan
g
uage
to a
not
her
[1].
CLPD issu
e
has a
c
qui
red
pro
nou
nced
importa
nc
e la
tely sin
c
e
se
mantic
co
nte
n
ts of
a
document
ca
n be e
a
sily
and di
scre
etly plagiari
z
e
d
throug
h the
use
of tran
slation (h
uma
n
or
machi
n
e
-
ba
sed).
Arabi
c is a Central Semiti
c lang
uage, i
t
belong
s to the Afro Asiat
i
c family, Arabic ha
s
much
sp
ecifi
c
ity whi
c
h m
a
ke
s it very
different
from
other Ind
o
-E
urop
ean l
ang
uage
s. Detecting
plagia
r
ism in
Arabi
c do
cu
ments i
s
part
i
cula
rly a
ch
a
llenging
task
and it b
e
com
e
s
even h
a
rd
er,
as t
r
an
slatio
n is often
a
fuzzy
process that
is ha
rd to
se
arch
for, be
ca
use
of the
com
p
lex
lingui
stic
stru
cture
of Ara
b
ic. In
spite of
the
fa
ct
that
many re
sea
r
che
s
we
re con
d
u
c
ted
on
plagia
r
ism d
e
tection in th
e last de
cad
e
s, t
hose co
nce
r
nin
g
the
Arabic la
ng
uage text re
main
quite limited and ad
dre
s
se
d esp
e
ci
ally to monolin
gua
l plagiari
s
m.
Similarity is a
fundame
n
tal
and wi
dely u
s
ed
con
c
e
p
t. An importa
nt numbe
r of si
milarity
measures h
a
v
e been prop
ose
d
in
the la
st few years.
The
simila
rit
y
betwe
en
two
subje
c
ts (e.g.:
A and
B) is rel
a
ted
to thei
r
comm
onality/differen
c
e
s
. The mo
re
co
mmonality/
di
fferences th
e
y
share/have
,
the more/le
s
s
simila
r they are. Whil
e Semantic Simi
larity Sema
ntic
s
i
milarity (Ss) [2] refers
to s
i
milarit
y
betwe
en two
con
c
e
p
ts in
a taxonomy
su
ch a
s
th
e Wo
rd
Net,
whe
r
e the i
d
ea of Sema
ntic
simila
rity betwee
n
them i
s
ba
se
d on
the like
nes
s
of their me
a
n
ing o
r
sem
antic
conte
n
t as
oppo
se
d to
simila
rity whi
c
h
ca
n be
e
s
timated
re
g
a
rdin
g
thei
r syntacti
cal re
pre
s
entatio
n [3].
Semantic
sim
ilarity is often confu
s
ed wit
h
se
ma
ntic re
latedne
ss, wh
ere the secon
d
one incl
ude
s
any relatio
n
betwe
en two
terms. Fo
r
example, “c
a
r” an
d “b
us” are
similar i
n
that they are
con
n
e
c
ted via a relation
with “vehicle
”
, but is only rel
a
ted to “ro
ad” and “d
riving”.
Natural lan
g
u
age utteran
c
es a
r
e, in g
e
neral, hi
ghly
ambigu
ou
s, b
e
ca
use of the
multiple
possibl
e me
a
n
ing o
r
sen
s
es th
at wo
rd
s may
have
(polysem
ou
s)
or m
a
lap
r
opi
sm
whi
c
h i
s
t
he
confo
undi
ng of an intende
d word with a
nother
wo
rd
of similar
sou
nd or si
milar
spelli
ng that has
a quite different and mala
prop
os m
eani
ng, and in
terpretation
can
generally be
determin
ed
only
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 25
02-4
752
IJEECS
Vol.
1, No. 2, February 201
6 : 371 – 374
372
by takin
g
i
n
to a
c
count
the
co
nte
x
t in
which the
utterance o
c
cu
rred. Howeve
r,
algorith
m
s/p
r
ogra
m
s d
o
no
t have the benefit of
human’s vast expe
rien
ce of the langu
age.
The
step
s o
f
Cro
s
s-L
ang
uage Pla
g
ia
rism
De
te
ction
proce
s
s were defined by
[4],
authors put u
p
some
strate
gies of he
uri
s
tic retr
ieval a
nd evaluate t
he perfo
rma
n
c
e of the mod
e
ls
for the detail
ed analy
s
is.
Although ma
ny studie
s
were di
re
cted
on plagi
ari
s
m
detection i
n
the
last years, those
con
c
e
r
ni
ng the Arabi
c langua
ge te
xt remain qui
te limited. Works in thi
s
a
r
ea
are th
ose of
Alzah
r
ani
et
al. [5], Menai
et al. [6] an
d othe
rs [7,
8]. All of the
m
add
re
ssed
the
monolin
gual
external ap
proach.
N. Abdul
Jal
eel, et al. [9] works
on
static
al tran
slit
eration fo
r E
nglish-Arabi
c Cro
s
s
Lang
uage Inf
o
rmatio
n retri
e
val (CLIR),
authors
wo
rked with n
-
gram model a
nd evaluate t
h
e
statistically-trained m
odel
and a
simpl
e
r hand
-crafted
model o
n
a t
e
st set of na
med entitie
s from
the Ara
b
ic A
F
P-Co
rp
us a
nd de
mon
s
trate that
the
y
perfo
rm b
e
tter than
o
n
line tran
slat
ion
sou
r
c
e
s.
Imene Ben
s
alem a
nd al.
[10] Wo
rk
on Ara
b
ic in
trinsi
c pl
agia
r
ism
dete
c
tio
n
. They
pre
s
ente
d
a
set of p
r
elimi
nary expe
rim
ents o
n
in
tri
n
sic
plagi
ari
s
m
detectio
n
in
Arabi
c text using
Stylysis tool
and a
sm
all
corpu
s
. Thei
r app
roa
c
h
co
nsi
s
ts in
testi
ng whethe
r
some la
ngua
g
e
-
indep
ende
nt
stylistic featu
r
es ar
e effect
ive or n
o
t to discrimi
nate
betwe
en pl
ag
iarized a
nd n
o
t
plagia
r
ized
senten
ce
s, the
re
sult
s th
ey found i
s
th
at a
v
erage
word
length a
nd
averag
e
sente
n
c
e
length are not
reliable
stylistic
discrimi
nat
or of Arabi
c text.
2. Used Me
tr
ics
2.1. HSO [11
]
A malapropi
sm can b
e
def
ined a
s
a correctly
spelle
d
word un
suita
b
le with the
context
whe
r
e it is used, becau
se it is may a spelli
ng erro
r of another inte
nded word. The dete
c
tion of
malap
r
opi
sm
s relie
d, on a big quantit
y, on lexica
l chain
s
presenting wo
rd
s of a seman
t
ic
contin
uity.
Hirst&Ong
e
fixed a mecha
n
ism that ge
nerate
s
spelli
ng re
pla
c
eme
n
ts that ca
n
be used
to gene
rate
repla
c
eme
n
t candid
a
tes fo
r a mala
pr
o
p
ism, basi
ng the
i
r argum
ents
on the fa
ct th
at
words that
ca
nnot b
e
u
s
e
d
with
othe
r
word
s
ca
n
b
e
consi
dered as potential
m
a
l
apro
p
isms. T
h
e
prop
osed alg
o
rithm u
s
e
s
the Wo
rd
Net
thesa
u
ru
s
to
automatically quantify se
mantic relatio
n
s
betwe
en
wo
rds, in
Word
Net, a word
m
a
y have o
ne
to many
synset, each
corresp
ondi
ng to
a
different mea
n
ing. Wh
en
we loo
k
for
a
relation b
e
twee
n two diff
erent
words,
we con
s
ide
r
the
synsets of al
l the sen
s
e
s
of each wo
rd, look
i
ng for a po
ssi
bl
e con
n
e
c
tion
betwee
n
so
me
meanin
g
s of the two word.
2.2. Lesk [1
2
]
Wo
rd sen
s
e
disam
b
igu
a
tion is the task of
identifying the intende
d meanin
g
of a given
target
word from the
conte
x
t in which it
is u
s
ed. In
Wo
rdNet ea
ch co
ncept (or wo
rd
sen
s
e
)
is
defined
by a
sho
r
t glo
ss.
A supe
r-glo
s
s of a
con
c
e
p
t is an
expa
nded
glo
ss t
hat co
ncaten
ate
other glo
s
se
s that are co
nn
ected to it via some
WordNet relation.
The Adapte
d
Lesk m
e
a
s
ure was
d
e
velope
d
to overcome t
he problem
of sho
r
t
definitions in most dictionary, which was an inte
rest to (Le
sk, 198
6
)
whe
n
he prese
n
t the notion
of involving definition overl
aps fo
r wo
rd
sen
s
e di
sam
b
iguatio
n. In the A-Le
sk m
easure,
simil
a
rity
betwe
en two
word se
nses
(co
n
cepts) attributed by
fin
d
ing an
d scoring interse
c
tions b
e
twe
en
the
glosse
s of two co
ncepts.
The bi
gge
r n
u
mbe
r
of inte
rse
c
tion
glo
s
s word is, in
d
i
cate
s a
stro
n
ger
relation, the b
i
gger
simila
rity value betwe
en two con
c
e
p
ts.
2.3. LCH [13
]
The LCH si
mi
larity/related
n
e
ss mea
s
u
r
e
(Lea
co
ck and
Chod
oro
w
) is:
l
o
g
∗
Whe
r
e:
Length i
s
the length of the sho
r
test path
bet
we
en the two syn
s
et
s (u
sing n
ode
-co
unting)
D is the maxi
mum depth of
the taxonomy
Evaluation Warning : The document was created with Spire.PDF for Python.
IJEECS
ISSN:
2502-4
752
Sem
antic Sim
ilarity/Relat
edne
ss for Cross Lan
gu
a
g
e
Plagiari
s
m
Dete
ction (Ha
n
ane Ezzi
kou
r
i)
373
LCH me
asure is very
sen
s
itive to the
pre
s
en
ce
o
r
absen
ce of
a
uniq
u
e root node,
i
s
very se
nsitive
to the
pre
s
e
n
ce
or ab
se
n
c
e
of
a
uniqu
e ro
ot no
de
b
e
ca
use it
con
s
ide
r
the
de
p
t
h
of the taxonomy in which t
he syn
s
ets a
r
e found.
2.4. LIN [3]
The LIN
simil
a
rity measure
is:
v
2
Whe
r
e IC(x) i
s
the inform
ation co
ntent
of x, And LIN similarity verify
0
≤
≤
1
If there is
a
n
y lack of d
a
ta or th
e in
format
ion
co
ntent of any
of either
co
nce
p
t1 o
r
con
c
e
p
t2 is0,
then 0 is retu
rned a
s
the si
milarity score
.
2.5. WUP [1
4
]
The WUP si
milarity/relate
dne
ss me
asu
r
e (Wu & Palmer) i
s
:
v
2
Whe
r
e the d
epths of the t
w
o syn
s
et
s in the
Wo
rd
Net taxonomie
s, along
with
the depth of the
LCS (L
ea
st Common Sub
s
umer (LCS
).
3. Applicatio
n
The Sem
anti
c
Relate
dne
ss/Similarity calcul
us
process i
s
divide
d into
a nu
mber of
smalle
r sub
-
t
a
sks, ea
ch of
whi
c
h
i
s
usin
g
me
t
r
ic
s of
relat
e
d
nes
s/
simil
a
rit
y
.
E
a
ch of
t
he
seq
uential
su
b-tasks or ste
p
s acce
pts d
a
ta
from
a
previous sta
g
e,
perf
o
rm
s a
transfo
rmatio
n on
the data, an
d then p
a
sses o
n
the
pro
c
e
s
sed d
a
ta structu
r
e
s
to the n
e
x
t step. In the
developm
ent
of ou
r
syste
m
, we
did
u
s
e some
Sem
antic
Rel
a
ted
ness/Similarit
y
algo
rithms that
exist the java API WordNet Simila
rity for Java (WS4J
)
with
s
o
me
modific
a
tion
to makes
them
suitabl
e for Arabi
c.
Figure 1. A generalized proce
s
s for Ara
b
ic
/Engli
s
h semantic
simil
a
rity measure
3.1.
Pre-proce
s
s
ing
Some optio
n
a
l pre
-
p
r
o
c
e
s
sing
sh
ould b
e
perfo
rme
d
on the d
a
ta structu
r
e
s
ent
ered
by
the u
s
e
r
. Thi
s
woul
d in
clu
de ta
sks
su
ch a
s
d
ealin
g
with
“Ha
r
kat/Tash
k
e
e
l” fro
m
the
user i
n
put,
whi
c
h is the
pro
c
e
ss
of re
moving it fro
m
t
he Arabi
c con
c
ept
s, b
a
se
d on he
uristics and
so
me
algorith
m
s.
3.2. Translite
r
ati
on
We have u
s
ed the Java
port of the homony
m p
r
odu
ct devel
oped in Perl
by Tim
Buckwalte
r
, i
t
works
with
a translite
r
ation of
the Ar
a
b
ic
w
o
rd
. T
h
is
tr
ans
lite
r
a
t
ion
use
s
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 25
02-4
752
IJEECS
Vol.
1, No. 2, February 201
6 : 371 – 374
374
Buckwalte
r
'
s
translite
r
atio
n
system. It i
n
clu
d
e
s
Java
cla
s
ses for t
he mo
rph
o
lo
gical
analy
s
is of
Arabi
c text files, wh
atever
their en
codi
n
g
.
3.3.
Proposed S
y
stem
We
have
dev
elope
d a
g
r
ap
hical
inte
rface to
co
nvenie
n
tly acce
ss th
e sy
stem. T
h
e G
U
I i
s
written
in j
a
va
. The i
n
terfa
c
e allo
ws the
u
s
er to i
nput
word
s, a
nd to
submit for
sem
antic
simil
a
rit
y
cal
c
ulatio
n.
Figure 2. Pro
posed Syste
m
4. Conclusio
n
As pa
rt of thi
s
work, we d
e
velope
d a
system
to calculate semanti
c
si
milarity/rel
atedne
ss
in a Arabi
c-E
nglish. Our o
b
jective
s
are
to to ex
tend the syste
m
to englo
be scie
ntific article
s
and
aca
demi
c
researche
s
.
We
believe that t
he validat
io
n
of the re
sult
s req
u
ire
s
fu
rther
experi
m
e
n
ts
in time.
Referen
ces
[1]
Potthast M, Barrón-C
e
d
e
ñ
o
A, Stein B, R
o
sso P. Cross
-
lan
gua
ge
pla
g
i
arism d
e
tectio
n.
Lan
gu
ag
e
Reso
urces an
d
Evaluati
o
n
. 20
11; 45(1): 4
5
-6
2.
[2] Philip
Rensik.
Using
i
n
for
m
ati
on c
ontent t
o
e
v
alu
a
te se
mant
ic si
mi
larity
in
a taxo
no
my
. In
Proce
edi
ng
s
of IJCAI-95. Montrea
l
, Can
a
d
a
. 448-4
53.
[3]
Deka
ng Li
n. An Information-T
heor
etic Defin
i
tion of Simi
larit
y
.
ICML
. 1998: 296-
304.
[4]
Pereira RC, et al.
A new
appro
a
ch for cr
oss-la
ngu
ag
e pla
g
iar
i
s
m
an
alysis
. In Mult
ilin
gu
al a
n
d
Multimod
al Info
rmation Acces
s
Evaluati
on.
S
p
rin
ger Berl
in
Heid
el
berg. 2
0
10: 15-2
6
.
[5]
Alzahr
ani, S
a
l
ha, N
aomi
e
S
a
lim. F
u
zz
y s
e
mantic
-
base
d
string
simil
a
ri
t
y
for
e
x
trins
i
c pl
agi
aris
m
detectio
n
. Bras
chler a
nd Har
m
an. 201
0.
[6]
Mena
i MEB. Detectio
n of pl
agiar
i
sm in
Arabic docum
ents.
Internati
ona
l jo
urna
l o
f
informatio
n
techno
lo
gy and
comp
uter scie
n
ce (IJIT
C
S)
. 2012; 4(1
0
): 80.
[7]
Z
i
touni, A
b
d
e
l
a
ziz, et
al.
C
o
rpus
-
base
d
Arabic
stemmi
ng usin
g N-gr
ams.
Inform
ation Retrieval
T
e
chno
logy
. S
p
rin
ger Berl
in
Heid
el
berg. 2
0
10: 280-
28
9.
[8]
Sidd
iqu
i
, Muaz
zam Ahme
d, e
t
al. Dev
e
lo
pi
n
g
An Ar
abic
Pl
agi
arism D
e
tec
t
ion C
o
rp
us. Grant No.
11-
INF
-
1520-
03.
[9]
Abdu
l Jale
el, Nasre
en, Lea
h
S Larke
y
.
Statistical transl
i
teratio
n
for English-Ar
ab
ic cross lan
gua
g
e
infor
m
ati
on r
e
trieval
. Proc
e
edi
ngs
of the
t
w
e
l
fth i
n
ter
natio
nal
conf
e
r
ence
on
Info
rmation
an
d
kno
w
l
e
d
ge ma
nag
ement. AC
M. 2003.
[10]
Bensa
l
em, Ime
ne, Pao
l
o R
o
s
s
o, Salim C
h
ik
hi.
Intrinsic
pla
g
iaris
m
detecti
on i
n
Arab
ic te
xt: Preliminar
y
exper
iments
. II Spanis
h
Co
nferenc
e on Infor
m
ation R
e
triev
a
l (CERI’1
2
). 2
012.
[1
1
]
Gra
e
m
e
H
i
r
st, D
a
vid
St-Onge
. L
e
xi
ca
l cha
i
n
s
a
s
re
p
r
e
s
en
ta
ti
o
n
s
o
f
con
t
ex
t fo
r the
de
te
cti
o
n and
correctio
n of mala
prop
isms. W
o
rdNet, edite
d b
y
C
h
ri
stia
ne
F
e
llbaum, Ca
mbridg
e, MA:
T
he MIT
Press.
199
5.
[12]
Satanj
eev Ba
n
e
rje
e
, T
ed Pederson. An
Adapt
e
d
Lesk
Algorithm for
W
o
rd Sense
Disambi
gu
ati
o
n
Using W
o
r
d
Net
.
2002.
[13]
Clau
d
i
a
Leac
o
ck, Martin Cho
doro
w
. Com
b
i
n
ing L
o
cal C
ont
ext an
d W
o
rdN
e
t Similarit
y
for
W
o
rd Sense
Identificati
on.
W
o
rdNet: An El
ectron
ic Le
xic
a
l Data
bas
e, Publis
her: MIT
Press. 265-2
83.
[14]
W
u
Z
,
Palmer
M.
Verbs semantics and
lexical selection.
Procee
din
g
s of
the 3
2
n
d
a
n
n
ual m
eeti
ng
on
Associati
on for
Computati
o
n
a
l
Ling
uistics As
soci
ati
on for C
o
mputati
o
n
a
l Li
ngu
istics. 199
4
:
133-13
8.
Evaluation Warning : The document was created with Spire.PDF for Python.