Int
ern
at
i
onal
Journ
al of Ele
ctrical
an
d
Co
mput
er
En
gin
eeri
ng
(IJ
E
C
E)
Vo
l.
9
, No
.
5
,
Octo
ber
201
9
, pp.
3804
~3
812
IS
S
N: 20
88
-
8708
,
DOI: 10
.11
591/
ijece
.
v9
i
5
.
pp
3804
-
38
12
3804
Journ
al h
om
e
page
:
http:
//
ia
es
core
.c
om/
journa
ls
/i
ndex.
ph
p/IJECE
Enh
ancing code
clone det
ec
ti
on us
ing cont
ro
l
flow
graphs
Dong Kw
an K
im
Depa
rtment
o
f
C
om
pute
r
Engi
n
e
eri
ng,
Mokpo N
at
ion
al
Mar
it
ime
Univer
sit
y
,
Kor
ea
Art
ic
le
In
f
o
ABSTR
A
CT
Art
ic
le
hist
or
y:
Re
cei
ved
Ma
r
6
, 2
01
9
Re
vised
A
pr
7
,
201
9
Accepte
d
Apr
19
, 201
9
Code
cl
ones
ar
e
sy
n
tactica
l
l
y
or
sem
ant
ic
all
y
eq
uiva
l
ent
code
fr
agments
of
source
code.
C
op
y
-
and
-
paste
p
rogra
m
m
ing
al
l
ows
software
deve
lop
ers
to
improve
develop
m
ent
produc
t
ivi
t
y
,
but
it
coul
d
pr
oduce
code
cl
on
es
tha
t
c
an
int
roduc
e
non
-
tr
ivi
al
d
iffi
cu
lt
i
es
in
softw
are
m
ai
nt
ena
nc
e.
In
t
his
pape
r
,
a
code
cl
one
d
etec
t
ion
fra
m
ework
is
pre
sente
d
w
it
h
a
fe
at
ur
e
ext
r
ac
tor
and
a
cl
one
class
ifi
er
using
dee
p
le
arn
ing.
The
cl
on
e
c
la
ss
ifi
er
is
tra
in
e
d
with
true
and
fal
se
cl
on
e
s
and
the
n
is
t
este
d
with
a
tes
t
dat
ase
t
to
e
val
ua
te
the
per
form
anc
e
of
the
proposed
appr
oac
h
to
c
lo
ne
det
e
ct
ion
.
In
par
ti
cu
la
r
,
the
proposed
ap
proa
ch
to
cl
one
det
e
ct
ion
uses
C
ontrol
Flow
Graphs
(CFGs)
to
ext
r
act
fe
at
ur
es
of
a
give
n
co
de
snippet.
The
sele
c
te
d
f
eature
s
are
used
t
o
compute
sim
il
ar
ity
sco
res
for
c
om
par
ing
two
code
fra
gm
ent
s.
The
cl
on
e
cl
assifi
er
is
tr
ai
n
ed
and
te
sted
wi
th
sim
il
arit
y
sco
res
tha
t
quantif
y
the
d
egr
e
e
of
how
si
m
il
ar
two
code
fra
gm
ent
s
are
.
The
exp
eri
m
ent
a
l
result
s
demons
tra
te
tha
t
using
CF
G
fea
tur
es
is
a
via
bl
e
m
et
hodolog
y
i
n
te
rm
s
of
the
ef
fec
t
ive
ness
of
cl
on
e
d
et
e
ct
i
o
n
for
both
s
y
ntac
ti
c
and
sem
ant
ic
cl
ones.
Ke
yw
or
d
s
:
Cl
on
e
detect
io
n
Cl
on
e ty
pes
Con
tr
ol
flo
w g
raph
Deep l
ear
ning
Copyright
©
201
9
Instit
ut
e
o
f Ad
vanc
ed
Engi
n
ee
r
ing
and
S
cienc
e
.
Al
l
rights re
serv
ed
.
Corres
pond
in
g
Aut
h
or
:
Don
g Kw
a
n Ki
m
,
Dep
a
rtm
ent o
f C
om
pu
te
r
E
ng
i
neer
i
ng,
Mokpo Nati
onal
Marit
i
m
e U
niv
e
rsity
,
91, Haey
an
gd
a
ehak
-
r
o, Mo
kp
o
-
si,
Je
ollanam
-
do,
Kor
ea
.
Em
a
il
:
do
ngkw
an@gm
ai
l.co
m
1.
INTROD
U
CTION
So
ft
war
e
re
us
e
ref
e
rs
to
a
se
ries
of
act
i
viti
es
of
us
in
g
ex
ist
ing
c
od
e
f
or
de
velo
ping
ne
w
s
of
t
war
e
pro
du
ct
s.
Software
reu
s
e
has
po
sit
ive
as
pec
ts
of
im
pr
ov
i
ng
de
velo
pm
ent
producti
vity
and
qual
it
y,
bu
t
si
m
ple
cop
y
-
an
d
-
paste
reu
se
ca
n
pr
oduce
re
dunda
nt
an
d
duplic
at
e
cod
e
(als
o
known
as
c
ode
cl
ones)
ac
r
os
s
a
pro
gr
am
.
Cod
e
cl
on
es
ca
n
be
def
i
ned
as
synta
ct
ic
al
ly
or
se
m
antic
al
ly
eq
uiv
al
ent
c
ode
fr
a
gm
ents
of
s
ource
cod
e
.
The
pres
ence
of
the
co
de
cl
on
e
ca
n
hi
nd
e
r
the
co
ns
ist
ency
of
an
a
pp
li
cat
io
n
in
so
ft
war
e
m
ai
nte
nan
c
e
su
c
h
as
bug
fixes,
sec
ur
it
y
updates
,
an
d
co
de
re
factor
i
ng.
Su
c
h
co
de
ch
ang
e
s
can
be
involve
d
in
unde
sirabl
e
conseq
ue
nces
du
e
t
o
the
co
de
cl
on
e.
If
on
ly
par
t
of
the
c
ode
cl
one
is
m
o
dified
with
ou
t
cov
e
rin
g
the
w
ho
le
of
the
co
de
cl
on
e
,
pote
ntial
pr
oble
m
s
can
be
i
ntr
oduce
d
during
the
li
fecyc
l
e
o
f
a
s
of
t
ware
syst
e
m
.
Therefo
re
,
for
bette
r
s
of
t
war
e
m
ai
ntenan
ce,
l
ocati
ng
a
nd
kee
pi
ng
tra
ck
of
co
de
cl
ones
a
re
a
cr
uci
al
par
t
w
hen
m
od
i
fyi
ng
cod
e
of
a
n
exi
sti
ng
pro
gram
.
In
this
pa
pe
r,
a
novel
appr
oa
ch
to
cl
one
de
te
ct
ion
is
pr
e
sented
with
a
feature
extracto
r
a
nd
a
cl
one
cl
assif
ie
r
usi
ng
deep
le
arn
i
ng.
T
he
featu
re
e
xtra
ct
or
c
onstr
ucts
featu
re
vect
ors
t
o
char
act
e
rize
a
giv
e
c
ode
fr
a
gm
ent
for
cl
one
detect
ion.
T
he
cl
o
ne
cl
assif
ie
r
is
base
d
on
super
vised
le
arn
i
ng
wh
ic
h
pro
du
ce
s
a
deep
le
arn
i
ng
m
od
el
through
trai
ning
da
ta
con
sist
ing
of
in
pu
t
vect
ors
and
desi
red
ou
t
put
values
.
Theref
or
e
,
the
cl
on
e
cl
assifi
er
is
trai
ned
with
kn
own
tr
ue
cl
on
es
and
false
cl
ones
in
a
trai
nin
g
ph
ase.
In
a
te
sti
ng
ph
ase,
the
cl
on
e
cl
assifi
er
pr
e
di
ct
s
wh
et
he
r
or
no
t
two
m
et
h
od
p
ai
rs
ha
ve
a
cl
on
e
relat
io
ns
hi
p.
The
pro
pose
d
appro
ac
h
to
cl
on
e
detect
io
n
extracts
in
put
featur
e
s
fr
om
the
CFGs
of
a
co
de
f
ra
gm
ent
.
The
in
pu
t
featur
es
a
re
use
d
to
com
pu
te
s
im
il
arity
scor
es
for
com
par
in
g
tw
o
co
de
f
r
agm
ents.
The
cl
on
e
cl
assifi
er
is
trai
ned
a
nd
te
ste
d
with
the
sim
il
arit
y
scor
es
that
quantify
th
e
degree
of
ho
w
sim
il
ar
two
code
fr
a
gm
ents are.
Evaluation Warning : The document was created with Spire.PDF for Python.
In
t J
Elec
&
C
om
p
En
g
IS
S
N: 20
88
-
8708
En
hancin
g
c
od
e clon
e
d
et
ect
i
on u
si
ng c
on
tr
ol fl
ow
gr
ap
hs
(
Do
ng Kw
an K
im
)
3805
The
pro
po
se
d
cl
on
e
detect
ion
fr
am
ewo
r
k
re
pr
ese
nts
a
c
ode
fr
a
gm
ent
as
c
on
t
ro
l
flo
w
gr
aphs
to
cat
c
h
sem
antic
a
ll
y
si
m
il
ar
cl
on
es.
A
co
ntr
ol
fl
ow
gr
a
ph
is
a
di
r
ect
ed
grap
h
t
ha
t
represents
t
he
co
ntr
ol
fl
ow
of
a
cod
e fr
a
gm
ent
(e.
g.
,
f
unct
io
ns
and
m
et
ho
ds).
The
CFG
c
on
s
ist
s
of
a
set
of
n
odes
(als
o
kn
own
as
ve
rtic
es)
an
d
edg
e
s.
A
no
de
r
ep
rese
nts
the
b
asi
c
sta
te
m
ent
of
the
c
ode fragm
ent
su
ch
as
ex
pr
es
sio
ns
,
i
f
-
sta
tem
ents,
an
d
for
-
sta
tem
ents.
An
edg
e
of
the
CFG
co
nnect
s
one
node
with
ano
t
her
a
nd
m
eans
a
pro
gr
a
m
con
trol
flo
w
from
betwee
n
the
two
no
des.
The
node
co
uld
be
co
nn
ect
e
d
to
m
ul
ti
ple
nodes
if
it
is
invol
ve
d
in
co
ntr
olli
ng
m
or
e
than
one
pr
ogr
a
m
con
tr
ol.
T
he
CFG
has
a
sing
le
e
ntry
node
w
her
e
a
co
nt
ro
l
flo
w
sta
rts
and
a
sin
gle
e
xi
t
node
wh
e
re
a
c
ontr
ol
flo
w
e
nds.
Ther
e
can
be
m
ul
ti
ple
con
tr
ol
flo
ws
bet
we
en
the
ent
ry
node
an
d
t
he
e
xi
t
node.
A
path
in
a
CF
G
is
a
sequ
enc
e
of
directe
d
edg
e
s
in
wh
ic
h
al
l
no
des
are
di
sti
nct.
Fo
r
exa
m
ple,
3_
Path
in
this
pap
e
r
m
eans
any
three
disti
nct
nodes
are
c
onnected
i
n
a
CFG
-
base
d
r
epr
ese
ntati
on.
The
pat
h
re
presents
structu
ral
featur
es
in
co
de
f
rag
m
ents
and
it
s
le
ng
th
is
var
io
us
in
a
CFG
-
base
d
re
presentat
ion.
Th
e
cod
e
fr
a
gm
ents are c
har
act
e
rized
by
the len
gth an
d
the
o
cc
urre
nc
e co
un
ts
of th
e CFG pat
h.
The
first
ste
p
of
cl
one
detect
ion
is
to
dete
r
m
ine
the
sc
ope
of
a
co
de
frag
m
ent
w
hich
is
a
co
ntin
uous
se
gm
ent
of
source
c
ode.
S
om
e
cl
on
e
dete
ct
ion
to
ols
ta
ke
a
functi
on
or
a
m
e
tho
d
a
s
the
co
de
f
ragm
ent.
In
so
m
e
cases
,
a
portio
n
of
the
f
unct
ion
c
an
be
c
onside
r
ed
as
the
c
od
e
fr
a
gm
ent.
T
he
c
od
e
f
ra
gme
nt
is
identifie
d
with
three
el
em
ent
s
su
c
h
as
a
s
ource
file
,
a
sta
rting
li
ne
num
ber
,
an
d
a
n
end
i
ng
li
ne
num
ber
.
If
a
w
hole
m
e
t
hod
is
co
ns
ide
r
ed
as
the
co
de
fr
a
gm
ent,
the
sta
rting
li
ne
nu
m
ber
and
the
e
nd
i
ng
li
ne
nu
m
ber
of
the
c
od
e
f
rag
m
ent
will
b
e
t
he
sam
e
as
tho
se o
f
the
m
et
hod.
Th
e pro
posed
clon
e
detect
io
n
ap
proac
h
ta
ke
s
each
m
et
ho
d
in
Jav
a
so
urce
file
s
as
the
cod
e
f
ra
gm
ent.
Cod
e
cl
on
es
are
a
pai
r
of
c
od
e
f
ra
gm
ents
of
sourc
e
cod
e
that
are
synta
c
ti
cal
ly
or
sem
a
ntica
ll
y
equ
iva
le
nt.
T
her
e
are
f
our
ty
pes
of
cod
e
cl
ones:
T
ype
-
1
(
T1
),
Ty
pe
-
2
(T2),
Ty
pe
-
3
(
T3),
an
d
Ty
pe
-
4
(
T4
).
Ty
pe
-
1
cl
on
e
s
are
s
ynta
ct
ic
al
ly
identic
al
cod
e
f
r
agm
ents,
excep
t
for
diff
e
re
nces
in
wh
it
e
sp
ace,
la
yout
and
c
omm
ents
[1
]
.
Ty
pe
-
2
cl
ones
are
synta
ct
ic
al
ly
identic
al
cod
e
fr
a
gm
ents,
exc
ept
f
or
di
ff
e
re
nces
i
n
ide
ntif
ie
r
nam
es,
li
te
ral
val
ues,
w
hite
sp
ace
,
la
yout
and
com
m
ents
[1
]
.
Ty
pe
-
3
cl
on
es
are
synta
ct
ic
al
ly
si
m
il
ar
code
fr
a
gm
ents
th
at
dif
fer
at
the
sta
tem
ent
le
ve
l.
Fr
a
gm
ents
have
sta
tem
ents
add
ed
,
m
od
ifie
d
and
/
or
rem
ov
ed
with
res
pe
ct
to
each
ot
he
r
[
1].
Finall
y,
Ty
pe
-
4
cl
on
es
are
synta
ct
ic
al
ly
di
ssi
m
il
ar
cod
e
f
rag
m
ents
that
im
ple
m
ent
the
sam
e
fu
nctiona
li
ty
[2
]
.
The
propose
d
a
pproa
ch
t
o
cl
on
e
detect
io
n
use
s
Bi
gCl
on
eB
e
nc
h
[
3]
wh
ic
h
is
one
of
t
he
pop
ular
ben
c
hm
ark
s
of
c
ode
c
lon
es
.
The
Bi
gClo
ne
Be
nch
data
ar
e
cat
egorized
by
sepa
rati
ng
Ty
pe
-
3
an
d
T
ype4
cl
one
pai
rs
into
f
our
ca
te
gories
base
d
on
thei
r
synta
ct
ic
al
si
m
il
arity:
Ver
y
-
Str
ongly
Ty
pe
-
3
(VST
3)
cl
on
e
s
with
a
sim
il
arity
in
range
90
%
(inclusi
ve)
t
o
100%
,
Str
ongl
y
Ty
pe
-
3
(
ST
3)
:
70
-
90
%
,
Mod
e
ratel
y
Ty
pe
-
3
(MT
3)
:
50
-
70%,
a
nd
W
ea
kly
Ty
pe
-
3 or Ty
pe
-
4 (
WT
3/4
):
0
-
50%
[
4].
Figure
1
.
The
deep ne
ural
n
et
work arc
hitec
ture
Deep
le
a
rn
i
ng
is
a
sp
eci
fic
kind
of
m
achine
le
ar
ning
an
d
al
lows
com
pu
ta
ti
onal
m
od
el
s
that
are
com
po
sed
of
m
ulti
ple
pr
oc
essing
la
ye
rs
to
le
arn
re
pr
ese
ntati
ons
of
data
with
m
ulti
ple
le
vels
of
abst
racti
on
[
5].
Deep
le
ar
ni
ng
al
go
rit
hm
s
hav
e
bee
n
a
pp
li
e
d
to
s
olv
e
pro
bl
e
m
s
of
arti
fici
al
intel
li
gen
ce
fiel
ds
and
hav
e
pro
duced
prom
isi
n
g
res
ults
for
som
e
sp
eci
fic
ta
s
ks
su
c
h
as
nat
ur
al
la
ngua
ge
processi
ng,
co
m
pu
te
r
visio
n,
an
d
im
age
proces
sin
g
[6
]
.
These
a
d
va
nced
le
ar
ning
al
gorithm
s
can
al
so
con
tri
bu
t
e
to
so
lvin
g
software
eng
i
neer
i
ng
pr
ob
le
m
s
that
ar
e
the
a
nalo
gous
ones
i
n
a
rtific
ia
l
intel
li
gen
ce.
Fig
ure
1
s
hows
t
he
dee
p
neural
netw
ork
arc
hitec
ture
that
co
ns
ist
s
of
th
ree
la
ye
rs
-
input
la
ye
r,
hidden
l
ay
er,
and
ou
t
put
la
ye
r.
The
input
featur
e
s
(e.
g.
,
si
m
il
arity
scor
es)
are
giv
e
n
to
input
nodes
in
the
input
la
yer
.
The
no
des
in
the
diff
e
re
nt
la
ye
rs
are
co
nnect
ed
with
co
nnect
io
n
wei
gh
ts
a
nd
biases.
T
he
de
ep
le
ar
ning
ne
twork
m
od
el
c
an
posse
ss
m
ulti
ple
hidden
la
ye
rs
so
to
im
pr
ove
the
pe
rfor
m
ance
of
the
c
ode
cl
one
cl
ass
ifie
r.
I
n
the
cl
on
e
cl
assi
fier
m
od
el
,
the
Re
ct
ifie
d
L
inear
Un
it
(Re
LU)
is
an
act
i
vation
f
un
ct
i
on
f
or
the
hidde
n
la
ye
rs
a
nd
th
e
softm
ax
fu
nc
ti
on
is
us
e
d
as
act
ivat
ion
functi
on
f
or
the
ou
t
pu
t
la
ye
r.
As
tw
o
no
des
re
side
in
th
e
ou
t
put
la
ye
r,
the
predict
io
n
resu
lt
of the cl
on
e
classi
fier is eit
he
r “
Cl
on
e”
or “N
on
-
Cl
one”
on the
giv
e
n
m
et
ho
d pai
rs.
Evaluation Warning : The document was created with Spire.PDF for Python.
IS
S
N
:
2088
-
8708
In
t J
Elec
&
C
om
p
En
g,
V
ol.
9
, N
o.
5
,
Oct
ober
20
19
:
3
8
0
4
-
3
8
1
2
3806
The
rest
of
thi
s
pap
e
r
is
orga
nized
as
f
ollo
ws.
Sect
io
n
2
ov
e
rall
descr
i
be
s
the
pro
po
se
d
ap
proac
h
to
detect
co
de
cl
on
e
s
us
i
ng
co
ntr
ol
flo
w
gr
a
ph
s
.
Sect
io
n
3
pr
es
ents
e
xpe
rim
ental
resu
lt
s
to
dem
on
st
r
at
e
the
eff
ect
ive
ness
of
the
CF
G
-
bas
ed
cl
one
de
te
ct
ion
.
Sect
i
on
4
su
m
m
arizes
relat
ed
stud
ie
s
and
Sect
ion
5
finall
y
rem
ark
s conclu
sion
s
and
f
utu
r
e
resea
rch
dir
e
ct
ion
s.
2.
APP
ROAC
H
Figure
2
il
lust
r
at
es
the
overall
w
orkf
l
ow
of
t
he
pro
po
se
d
co
de
cl
on
e
detect
ion
fram
ewo
rk.
T
he
cl
one
detect
ion
syst
em
has
the
train
in
g
an
d
te
sti
ng
phase.
T
he
so
li
d
arro
ws
show
the
fl
ow
of
the
trai
ning
ph
a
s
e
wh
ic
h
pro
duce
s
a
trai
ned
cl
one
cl
assifi
er
from
a
li
st
of
featur
es.
T
he
cl
one
cl
assifi
er
de
te
rm
ines
if
pairw
ise
m
et
ho
ds are cl
on
e
s to
each
o
t
her in t
he
te
sti
ng phase
whic
h
is s
how
n
t
he das
hed ar
rows i
n
Fig
ure
2.
Figure
2
.
The
ov
e
rall
f
lo
w o
f
the
pro
po
se
d
c
od
e
clo
ne dete
ct
ion
fr
am
ewo
r
k
The
c
lo
ne
dete
ct
ion
syst
e
m
us
ed
Bi
gClo
ne
Be
nch
w
hic
h
is
on
e
of
the
popula
r
be
nc
hma
rk
s
of
co
de
cl
on
es
a
nd
has
been
use
d
f
or
cl
on
e
-
relat
ed
stud
ie
s.
T
he
Bi
gClon
eB
e
nc
h
data
are
sepa
ra
te
d
into
the
tra
ining
and
te
st
datase
ts.
The
cl
one
c
la
ssifie
r
m
od
el
in
the
detect
io
n
syst
em
is
trai
ned
a
nd
opti
m
iz
ed
by
the
trai
ning
dataset
and
t
he
n
is
evaluate
d
by
the
te
st
da
ta
set
.
The
fir
st
ste
p
of
the
t
rainin
g
phase
i
s
to
identify
m
et
ho
d
m
od
ules
from
the
giv
e
n
Jav
a
so
urce
file
s
thr
ough
le
xical
analy
sis
and
synta
ct
ic
analy
sis.
Pairwi
se
m
et
ho
d
m
od
ules
are
c
on
st
ru
ct
e
d
f
rom
the
separ
at
e
m
et
ho
ds
.
The
pair
wise
m
e
tho
ds
a
re
use
d
t
o
trai
n
or
te
st
the
dee
p
le
arn
in
g
m
od
e
l.
The
pro
pos
ed
cl
one
dete
ct
ion
syst
em
extracts
synta
ct
ic
and
sem
antic
featu
res
of
c
ode
fr
a
gm
ents
fr
om
CFGs
of
t
he
identifie
d
m
et
hods
.
A
CF
G
is
ge
ner
at
e
d
for
each
m
eth
od.
CF
G
featur
es
a
re
gen
e
rated
f
rom
the
CFG
.
T
he
CF
G
featu
r
e
set
s
are
re
pr
esented
by
fea
ture
vecto
rs
t
ha
t
co
uld
be
c
om
pu
te
d
eff
ect
ively
in
the
deep
le
ar
nin
g
m
od
el
.
Sim
il
arit
y
scor
es
a
re
us
ed
to
qua
ntify
the
de
gr
e
e
o
f
how
sim
ilar
t
w
o
m
et
ho
ds
are
.
The
si
m
il
arity
s
cor
e
of
the
pai
rw
ise
m
et
ho
d
is
cal
culat
ed
and
is
giv
en
to
the
input
la
ye
r
of
th
e
cl
on
e
cl
assifi
e
r.
T
he
ra
ng
e
of
sim
il
arit
y
sc
or
e
is
[
0,
1]
thr
ough
the
norm
alizat
ion
ste
p.
T
he
trai
ne
d
cl
on
e
cl
assifi
er
can
be
pr
od
uce
d
after
the
trai
ning
ph
a
se.
I
n
the
te
sti
ng
phase,
t
he
sim
il
arit
y
s
cor
es
of
the
pa
irwis
e
m
et
ho
ds
a
re
cr
eat
ed
from
the
te
st
dataset
.
The
ste
ps
f
or
ge
ner
at
in
g
the
si
m
il
arity
scor
e
are
the
sam
e
a
s
tho
se
of
the
t
rainin
g
ph
a
se.
T
he
cl
one
cl
assifi
er
ca
n
te
ll
wh
et
he
r
the
gi
ven
m
et
ho
d
pairs
a
re
cl
on
e
s
or
not
us
i
ng
t
he
si
m
il
arity sco
r
es.
The
pro
po
se
d
cl
on
e
detect
io
n
f
ram
ewo
r
k
i
s
base
d
on
a
deep
le
arn
i
ng
netw
ork
m
od
e
l
wh
ic
h
is
a
cl
on
e
cl
assifi
er
a
nd
dete
rm
in
es
the
sem
antic
sim
il
arit
y
of
the
m
et
ho
d
usi
ng
it
s
co
rr
e
s
pondin
g
c
ontr
ol
f
l
ow
gr
a
ph.
A
CF
G
is
a
directed
graph
a
nd
re
pres
ents
al
l
po
ssibl
e
execu
ti
on
pat
hs
of
a
m
et
ho
d.
It
al
so
encode
s
the
beh
a
vior
of
th
e
m
et
ho
d
at
a
higher
le
vel
of
abstracti
on.
A
set
of
feat
ur
es
are
extracte
d
f
ro
m
the
con
tr
ol
flo
w
gr
a
phs.
Su
c
h
a
featu
re
set
inc
lud
es
no
de
i
nfor
m
at
ion
an
d
pa
th
in
form
at
io
n
of
the
co
ntr
ol
flo
w
gr
a
ph.
F
or
the
com
par
ison
of
two
m
et
ho
ds,
the
pair
wise
featur
e
set
s
ar
e
us
e
d
to
c
ompu
te
t
he
sim
i
l
arit
y
scor
e.
Figure
3
sh
ows
a
n
ex
a
m
ple
of
Java
s
ource
c
od
e
a
nd
it
s
corres
pondi
ng
co
ntr
ol
fl
ow
gr
a
ph.
In
t
his
exam
ple,
the
cl
ass
Sam
ple
has
on
ly
on
e
m
et
hod
m
a
in
an
d
t
he
con
t
ro
l
flo
w
grap
h
of
the
m
ai
n
m
et
ho
d
is
presente
d
on
th
e
rig
ht
side
of
Fi
gure
3.
T
he
node
of
the
CFG
is
known
as
a
basic
blo
ck
a
nd
represents
e
xpress
ion
s
,
if
-
sta
te
m
e
nts,
for
-
sta
te
m
ents,
et
c.
The
ed
ge
of
t
he
CF
G
re
pr
ese
nts
a
poss
ible
co
ntr
ol
flo
w
f
r
om
the
end
of
one
bl
oc
k
to
th
e
beg
i
nn
i
ng of t
he
o
the
r. As
see
n
in
Fig
ure
3,
CFGs als
o ha
ve
a sin
gle e
ntry
and a
sin
gle e
xit p
oin
t.
Evaluation Warning : The document was created with Spire.PDF for Python.
In
t J
Elec
&
C
om
p
En
g
IS
S
N: 20
88
-
8708
En
hancin
g
c
od
e clon
e
d
et
ect
i
on u
si
ng c
on
tr
ol fl
ow
gr
ap
hs
(
Do
ng Kw
an K
im
)
3807
p
u
b
lic class Sa
m
p
l
e {
p
u
b
lic static Str
in
g
VE
RS
ION
= "
2
.0.
0
";
p
u
b
lic static vo
id
m
a
in
(St
ring
[
]
args
)
{
Sy
ste
m
.ou
t.prin
tf
("
PROG
EX
(Pr
o
g
ra
m
G
raph
Extractor)
[
ver.
%s
]
\
n
",
VERS
IO
N);
Sy
ste
m
.ou
t.prin
tln
("
Visit p
roject web
site @
h
ttp
s://g
ith
u
b
.co
m
/
g
h
affarian
/p
rog
ex
\
n
")
;
if
(
args
.leng
th
==
0
)
{
CLI.
p
rintHelp
(nu
ll
);
n
ew GUI()
.sh
o
w()
;
} else {
n
ew CLI
()
.pa
rse(a
r
g
s).execu
te()
;
}
}
}
Figure
3
.
Exa
m
ple o
f
co
ntr
ol
f
lo
w gr
a
ph
A
path
in
a
C
FG
is
a
se
que
nce
of
directed
e
dges
wh
ic
h
co
nnec
t
a
se
qu
e
nce
of
disti
nct
no
des
.
Fo
r
e
xam
ple,
2_Path
m
eans
any
two
disti
nc
t
nodes
are
c
onnected
i
n
th
e
con
tr
ol
fl
ow
gr
a
ph.
T
he
pr
opos
e
d
cl
on
e
detect
io
n
f
ram
ewo
r
k
consi
ders
1_P
at
h,
2_Pat
h,
3_Path
,
an
d
4_
Path
to
c
om
par
e
pai
rw
ise
m
et
hods
.
Table
1
li
sts
the
featur
e
pat
hs
wh
ic
h
are
e
xtracted
f
ro
m
th
e
con
tr
ol
flo
w
gr
a
ph
show
n
in
Fig
ur
e
3.
Th
e
first
colum
n
shows
the
ty
pe
of
paths
a
nd
t
he
sec
ond
c
olu
m
n
is
the
total
fr
e
que
ncy
of
eac
h
pa
th
ty
pe
in
th
e
c
on
t
ro
l
flo
w
gr
a
ph.
The
la
st
colu
m
n
sh
ow
s
th
e
disti
nct
instances
of
each
path
ty
pe
and
their
fr
e
qu
e
ncy.
The
s
umm
a
tio
n
of
the
frequ
e
ncy
of
e
ach
pa
th
inst
ance
s
hould
be
eq
ual
to
the
total
num
ber
i
n
the sec
ond
c
ol
um
n.
Table
1
. E
xam
ple of
featu
re
pa
ths e
xtracted
fro
m
a contro
l
fl
ow
grap
h
Ty
p
e
Total
Path
/Frequ
en
cy
1
_
Path
8
[
startMetho
d
/1
],
[
State
m
en
tExp
ressio
n
Co
n
tex
t/5
],
[
If
Sta
te
m
en
tCo
n
tex
t/1
],
[
en
d
i
f
/1
]
2
_
Path
8
[
startMetho
d
-
Stat
e
m
en
tExp
ressio
n
Co
n
tex
t/1
],
[
State
m
en
tExp
ress
io
n
Co
n
tex
t
-
Statem
e
n
tExp
ressio
n
Co
n
tex
t/2
],
[
State
m
en
tExp
ressio
n
Co
n
tex
t
–
If
State
m
en
tCo
n
tex
t/1
],
[
If
State
m
en
tCo
n
tex
t
-
State
m
en
t
Exp
re
ss
io
n
Contex
t/2
],
[
State
m
en
tExp
ress
io
n
Co
n
tex
t
-
en
d
i
f
/
2
]
3
_
Path
7
[
startMetho
d
-
Stat
e
m
en
tExp
ressio
n
Co
n
tex
t
-
State
m
en
tE
x
p
ressio
n
Contex
t/
1
], [
State
m
en
t
Exp
ressio
n
Co
n
te
x
t
-
State
m
en
t
Exp
ressio
n
Co
n
te
x
t
-
If
Statem
e
n
tCo
n
tex
t/1
], [
State
m
en
tExp
ressio
n
Co
n
tex
t
-
If
State
m
en
tCo
n
tex
t
-
State
m
en
t
Exp
ressio
n
Co
n
te
x
t/2
], [
If
State
m
en
tCo
n
tex
t
-
St
ate
m
en
tExp
ressio
n
Co
n
tex
t
-
State
m
en
t
Exp
ressio
n
Co
n
te
x
t/1
], [
If
State
m
en
tCo
n
tex
t
-
St
ate
m
en
tExp
ressio
n
Co
n
tex
t
-
en
d
if
/
1
],
[
State
m
en
tExp
ress
io
n
Co
n
tex
t
-
State
m
en
tExp
ressio
n
Co
n
t
ex
t
-
en
d
if
/1
]
4
_
Path
6
[
startMetho
d
-
Stat
e
m
en
tExp
r
ess
io
n
Co
n
tex
t
-
State
m
en
tE
x
p
ressio
n
Contex
t
-
If
State
m
en
tCo
n
tex
t/1
],
[
State
m
en
tExp
ress
io
n
Co
n
tex
t
-
State
m
en
tExp
ressio
n
Co
n
t
ex
t
-
If
State
m
en
tCo
n
tex
t
-
State
m
en
t
Exp
ressio
n
Co
n
te
x
t/2
], [
State
m
en
tExp
ressio
n
Co
n
tex
t
-
If
State
m
en
t
Co
n
tex
t
-
State
m
en
t
Exp
ressio
n
Co
n
te
x
t
-
State
m
en
tExp
r
ess
io
n
Co
n
t
ex
t/1
], [
State
m
en
t
Exp
ressio
n
Co
n
tex
t
-
If
State
m
en
tCo
n
tex
t
-
State
m
en
tExp
res
sio
n
Co
n
te
x
t
-
en
d
if
/
1
], [
If
State
m
en
tCo
n
tex
t
-
State
m
en
t
Exp
ressio
n
Co
n
tex
t
-
State
m
en
t
Exp
ressio
n
Co
n
te
x
t
-
en
d
if
/1
]
3.
EVAL
UA
TI
O
N
A
set
of
feature
s
can
im
pact
serio
us
ly
the
pe
rfor
m
ance
of
the
cl
one
cl
ass
ifie
r.
T
he
refore
,
e
xtracti
ng
the
m
eaningfu
l
featur
es
of
a
dataset
is
on
e
of
the
key
ste
ps
to
de
velo
p
an
acce
ptable
deep
le
ar
ning
m
od
el
.
Figure
4
s
how
s
the
CF
G
feat
ur
e
extracti
on
al
gor
it
hm
to
identify
a
set
of
CFG
featu
res
f
ro
m
a
giv
e
n
m
et
hod
.
Su
c
h
CFG
feat
ur
es
of
each
m
et
hod
will
be
use
d
to
com
pu
te
the
si
m
il
arit
y
scor
e
f
or
the
m
et
ho
d
com
par
iso
n.
Give
n
a
ta
rg
et
m
et
ho
d
in
a
program
,
the
pro
po
s
ed
cl
one
de
te
ct
ion
syst
e
m
extracts
a
set
of
C
FG
featu
res
from
the
corres
pond
ing
co
ntr
ol
flo
w
gr
a
ph
of
th
e
m
et
ho
d.
A
ft
er
creati
ng
th
e
cor
re
spo
nd
i
ng
CFG
of
the
inp
ut
m
et
ho
d,
the
de
te
ct
ion
syst
em
find
s
al
l
pa
ths
on
t
he
c
on
t
ro
l
flo
w
gr
aph.
T
he
nex
t
ste
p
is
to
co
un
t
t
he
Evaluation Warning : The document was created with Spire.PDF for Python.
IS
S
N
:
2088
-
8708
In
t J
Elec
&
C
om
p
En
g,
V
ol.
9
, N
o.
5
,
Oct
ober
20
19
:
3
8
0
4
-
3
8
1
2
3808
fr
e
qu
e
ncy
of
the
CF
G
featu
r
es.
I
f
t
he
la
be
l
of
a
CF
G
node
on
a
CF
G
pa
th
is
f
ound
in
t
he
featur
e
set
,
the
fr
e
quency
of
the
C
FG
fe
at
ur
e
node
inc
r
eases
by
one.
I
f
the
CF
G
node
is
found
ne
w
ly
,
it
is
add
ed
to
the
featur
e
set
.
T
he
fr
eq
ue
ncy
of
the
new
ly
ad
de
d
featu
re
node
is
on
e.
T
he
co
un
ti
ng
of
the
frequ
e
ncy
of
the
CFG
featur
e
no
des
is
perform
ed
un
ti
l
al
l
CFG
paths
are
co
ns
ide
red.
Fi
na
ll
y,
throu
gh
t
he
fe
at
ur
e
ext
racti
on
al
gorithm
, th
e CFG feat
ure se
t i
s p
r
oduce
d
t
hat hold
s a
pair
of the
f
eat
ure
node
ty
pe
and i
ts fr
e
quency.
Inp
u
t: L
et
ME
D
b
e
a
t
arget
m
e
th
o
d
in a pro
g
ra
m
.
Ou
tp
u
t: L
et
Featu
r
eS
et
b
e a
set
of
CFG f
eatu
res.
Featu
reS
et
= {(f
1
,
f
eq
1
),
·
·
·
,
(f
n
,
f
eq
n
)}
1
:
a
CFG
g
en
er
ateCFG(
M
ED
)
//C
reating
CFGs b
y
c
allin
g
CFG gen
erat
o
r
o
n
MED
2
:
a
llPath
s
f
in
d
AllPath
s(
a
CFG
)
//
Fin
d
in
g
all
p
ath
s o
n
th
e con
trol f
lo
w graph
3
:
forea
ch
p
a
th
i
f
ro
m
a
llPath
s
do
4
:
forea
ch
node
j
f
ro
m
p
a
th
i
do
5
:
if
node
j
Featu
reS
et
then
//Getting
the f
requ
ency
to
which
node
j
is
m
ap
p
ed
in
Fe
a
tu
reS
et
6:
f
eq
j
g
etFeq
u
en
cy
(
F
ea
tu
reS
et,
n
o
d
e
j
)
7
:
f
eq
j
feq
j
+ 1
//
Inc
reas
e the f
requ
en
cy
by
o
n
e
8
:
else
//
Ad
d
in
g
n
o
d
e
j
to
Featu
reS
et
if
node
j
is n
ew
9
:
Featu
reS
et
=
Featu
reS
et
{(
node
j
,
1
)
}
//
Ass
ig
n
in
g
o
n
e to fr
eq
u
en
cy
of
node
j
10:
en
dIf
1
1
:
endForea
c
h
1
2
:
endForea
ch
1
3
:
return
Featu
reS
et
Figure
4. Ext
ra
ct
ing
CF
G
featur
es
fro
m
p
ai
r
wise m
et
ho
ds
t
o
ide
ntify c
od
e
clon
e
s
The
pro
po
se
d
cl
on
e
detect
ion
syst
e
m
need
s
to
trai
n
a
bin
a
r
y
-
cl
ass
cl
assifi
er
f
or
c
ode
cl
one
de
te
ct
io
n
wh
ic
h
us
es
tra
ining
a
nd
te
st
dataset
for
bot
h
true
cl
on
e
pa
irs
an
d
false
cl
on
e
pairs.
F
or
the
trai
ning
of
the
cl
assifi
er,
the
si
m
il
arity
scor
es
of
each
fea
ture
vect
or
an
d
it
s
la
bel
are
pr
ov
i
ded
f
r
om
the
trai
nin
g
dataset
.
The
true
cl
on
e
s
are
la
belle
d
to
‘1’
w
hile
the
false
cl
one
s
are
la
belle
d
to
‘0’.
Ke
ras
[
7]
—
an
op
e
n
s
ource
m
achine
le
ar
ni
ng
li
br
a
ry
for
deep
le
ar
ning
—
is
us
e
d
to
t
rain
a
cl
one
cl
assifi
er.
F
our
i
nput
nodes
a
nd
tw
o
ou
t
pu
t
nodes
a
re
co
nf
i
gure
d
f
or
the
i
nput
la
ye
r
and
t
he
ou
tpu
t
la
ye
r
res
pe
ct
ively
.
Since
th
e
pro
posed
cl
one
detect
ion
a
ppr
oach
co
ns
i
der
s
fou
r
feat
ur
es
of
t
he
c
on
t
ro
l
f
low
gr
a
ph,
t
he
input
la
ye
r
holds
f
our
node
s
wh
ic
h
ind
e
pende
ntly
ta
ke
in
f
our
fe
at
ur
e
value
s.
T
he
ou
t
pu
t
la
ye
r
c
on
ta
in
s
tw
o
node
s
becaus
e
the
cl
on
e
det
ect
ion
pro
blem
is
the
bin
a
ry
c
la
ssific
at
ion
ta
sk
to
ch
eck
w
hether
t
wo
m
et
ho
ds
ar
e
cl
on
es
or
no
t
.
The
pro
posed
cl
on
e
detect
ion
syst
e
m
con
figures
t
he
deep
le
a
rn
i
ng
m
od
el
to
in
cl
ud
e
ei
gh
t
hidden
la
ye
rs
an
d
to
r
un
2
00
it
er
at
ion
s
for
trai
ning.
Table
2
li
sts
t
he
node
ty
pes
of
the
co
ntr
ol
fl
ow
grap
h
that
sh
ould
be
consi
der
e
d
to
detect
cod
e
cl
on
es
.
T
he
no
de
ty
pe
is
the
const
ru
ct
of
Ja
va
program
s
su
ch
as
e
xpress
ion
s
,
co
ntr
ol
c
on
st
ru
ct
s
,
co
nd
it
ion
al
const
ru
ct
s
,
an
d
excep
ti
on
ha
ndli
ng
c
onstruc
ts.
The
Ja
va
synta
x
of
the
no
de
ty
pe
is
rep
r
esented
in
the
la
st
colum
n
and
fol
lows
BN
F
-
sty
le
conven
ti
ons
:
[ex
pr]
m
eans
zero
or
one
oc
currence
s
of
expr,
{e
xpr}
m
eans
zero
or
m
or
e
occurre
nces
of
ex
pr
,
an
d
(
x
|
y)
de
note
s
one
of
ei
ther
x
or
y.
T
he
pro
posed
cl
on
e
dete
ct
ion
syst
e
m
will
e
xtract
no
de
ty
pes
f
ro
m
m
eth
ods
a
nd
t
hen
gen
e
rate
ed
ge
s
and
path
s
wh
ic
h
are
m
ade
up
of
the no
des.
Table
3
s
ho
ws
the
datase
ts
of
trai
ni
ng
and
te
sti
ng
for
the
cl
one
cl
assifi
er.
F
old
e
r
#4
in
Bi
gClon
eB
e
nc
h
is
use
d
as
th
e
trai
ning
dataset
since
it
has
the
la
rg
e
st
nu
m
ber
of
tr
ue
a
nd
false
cl
one
pairs.
The
true
cl
on
e
pairs
in
the
tr
ai
nin
g
dataset
include
T
1,
T
2,
VS
T
3,
a
nd
S
T3
cl
ones
an
d
the
total
nu
m
ber
of
true
cl
ones
is
13,
399
wh
ic
h
is
e
qu
al
t
o
the
total
num
ber
of
false
c
lon
es
.
MT3
a
nd
W
T
3/4
cl
ones
a
re
intenti
onal
ly
exclu
ded
from
t
he
trai
ni
ng
dat
aset
becau
s
e
they
co
uld
c
onta
in
noisy
data
wh
ic
h
m
ay
pr
oduce
false al
arm
s o
n detec
ti
ng c
od
e
clon
e
s.
Fo
r
t
he
te
sti
ng
of
the
cl
one
cl
assifi
er,
the
te
st
dataset
is
m
ade
from
Ja
va
sourc
e
file
s
in
the
four
fo
l
der
s
of
Bi
gC
lon
eB
enc
h
s
uc
h
as
#2,
#3,
#7,
an
d
#10
.
Th
e
so
urce
file
s
in
the
fo
l
der
#4
are
excluded
be
cause
they
are
us
e
d
a
s
the
trai
ning
da
ta
set
.
Un
li
ke
the
trai
ni
ng
dat
aset
,
the
te
st
da
ta
set
con
ta
ins
on
ly
tr
ue
cl
one
pair
s
without
false
cl
on
e
pairs
.
T
he
sim
i
la
rity
s
cor
es
on
the
pa
irwise
m
et
ho
ds
are
co
m
pu
t
ed
th
rou
gh
t
he
sam
e
proce
dures
as
the
trai
ning
ph
a
se.
T
hese
proc
edures
i
nclu
de
the
CF
G
c
ons
tructi
on,
t
he
C
FG
path
ge
ne
r
at
ion
,
the CF
G
feat
ure extracti
on,
and the
sim
i
la
rity score c
om
pu
ta
ti
on
.
Evaluation Warning : The document was created with Spire.PDF for Python.
In
t J
Elec
&
C
om
p
En
g
IS
S
N: 20
88
-
8708
En
hancin
g
c
od
e clon
e
d
et
ect
i
on u
si
ng c
on
tr
ol fl
ow
gr
ap
hs
(
Do
ng Kw
an K
im
)
3809
Table
2
. A
li
st
of
CFG
no
des
No
No
d
e T
y
p
e
Jav
a Sy
n
tax
1
State
m
en
t
Exp
ressio
n
Co
n
te
x
t
State
m
en
t
Exp
ressio
n
;
2
LocalVari
ab
leDecl
aration
Co
n
tex
t
{Variable
Mod
if
ier}
Ty
p
e
Vari
ab
leDe
clara
to
rs;
3
If
State
m
en
tCo
n
tex
t
if
P
arE
x
p
ressio
n
S
tate
m
en
t
[
else
S
tat
e
m
en
t
]
4
Fo
rState
m
en
tCo
n
tex
t
f
o
r
(
F
o
rCo
n
trol )
S
tate
m
en
t
5
W
h
ileState
m
en
tCo
n
tex
t
wh
ile
P
ar
Exp
ressio
n
S
tate
m
en
t
6
Do
W
h
ileState
m
en
t
Co
n
tex
t
do
S
tate
m
en
t whil
e
P
ar
Exp
ressio
n
;
7
Switch
State
m
en
tCo
n
tex
t
switch
P
ar
Exp
ressio
n
{
S
witch
Blo
ck
State
m
en
tG
rou
p
s
}
Switch
Blo
ck
Statem
e
n
tGrou
p
s: { Sw
itch
Blo
ck
State
m
en
tGrou
p
}
Switch
Blo
ck
Statem
e
n
tGrou
p
: Switc
h
Labels
Blo
ck
State
m
en
ts
Switch
Labels
:
S
wi
tch
Label { Sw
itch
Label }
Switch
Label: c
ase
Exp
ressio
n
:
case
Enu
m
Co
n
stan
tNa
m
e
:
d
ef
au
lt :
8
LabelState
m
en
tCo
n
tex
t
Iden
tif
ier
:
S
tate
m
en
t
9
Retu
rnState
m
en
tC
o
n
tex
t
return
[E
x
p
ressio
n
]
;
10
BreakS
tate
m
en
tCo
n
tex
t
b
reak
[
Iden
tif
ier
]
;
11
Co
n
tin
u
eState
m
en
t
Co
n
tex
t
co
n
tin
u
e
[
Iden
ti
f
ie
r
]
;
12
Sy
n
ch
Blo
ck
State
m
en
tCo
n
tex
t
synch
ron
ized
P
arEx
p
ressio
n
B
lo
ck
13
Tr
y
State
m
en
tCo
n
tex
t
try
B
lo
ck
(
C
atch
Cl
au
se
|
[C
atch
Clau
se
] F
in
all
y
Blo
ck
)
14
Tr
y
W
ith
Res
o
u
rce
State
m
en
tCo
n
tex
t
try
R
eso
u
rceSpecif
icatio
n
B
lo
ck
[C
atch
Clau
se
]
[F
in
ally
Blo
ck
]
R
eso
u
rceSpecif
icatio
n
:
(
R
eso
u
rces
[
;
]
)
R
eso
u
rces :
R
eso
u
rce
{
;
R
eso
u
rce
}
R
eso
u
rce
:
{V
ari
ab
leMod
if
ier
}
C
lass
OrI
n
terface
Ty
p
e
V
ariable
Decla
ratorId =
E
x
p
ressio
n
15
Thro
wState
m
en
tC
o
n
tex
t
t
h
row
Ex
p
ressio
n
;
Table
3
.
Datas
et
s o
f
trai
ni
ng
and
te
sti
ng
Dataset
Tr
u
e Clo
n
es
False Clo
n
es
T1
T2
VST3
ST3
MT3
W
T3/4
Tr
ain
in
g
(#4
)
4
,50
0
3
,10
3
1
,20
4
4
,59
2
0
0
13
,
399
Testin
g
#2
1
,55
2
7
22
1
,41
1
2
,68
7
10
,
107
#3
630
583
523
2
,
734
24
,
335
8
2
5
,838
#7
33
4
21
211
1
,62
7
1
0
,66
5
#10
151
61
282
922
1159
296
Figure
5
s
how
s
the
ex
pe
rim
e
ntal
res
ult
on
t
he
te
st
dataset
.
The
pro
po
se
d
cl
on
e
cl
assi
fier
has
bee
n
evaluate
d
with
diff
e
ren
t
sim
i
la
rity
thresholds
in
order
t
o
check
how
t
he
sim
il
arit
y
t
hr
es
hold
af
fec
ts
the
perform
ance
of
the
cl
one
det
ect
ion
.
The
cl
a
ssifie
r
fin
ds
cl
on
e
s
wh
e
n
t
he
li
kelihoo
d
value
is
great
er
t
han
or
equ
al
to
t
he
si
m
il
arity
thres
hold.
I
n
t
hese
e
xp
e
rim
ents,
th
e
cl
one
cl
assifi
er
is
c
onfi
gure
d
with
8
hidde
n
la
ye
rs
and
200
e
po
c
hs.
The
propose
d
detect
io
n
syst
e
m
eff
ect
ively
ide
ntifie
d
T
1,
T
2,
an
d
V
S
T3
cl
ones
-
t
he
recall
resu
lt
s
a
re
cl
ose
to
10
0%
(e
xcep
t
the
f
olde
r
#2
da
ta
set
)
with
t
he
diff
e
r
ent
sim
i
la
rity
t
hr
es
holds:
0.9
5,
0.9
6,
0.97,
an
d
0.9
8.
The
pro
p
os
e
d
cl
assifi
er
al
so
detect
s
eff
ec
ti
vely
ST3
cl
on
es
on
the
dat
aset
s
#2
an
d
#10
-
the
recall
values
a
r
e
gr
eat
er
t
ha
n
90%.
W
it
h
the
dataset
s
#3
an
d
#7,
t
he
pe
rfo
rm
ance
of
the
cl
on
e
detect
ion
go
e
s
dow
n
a
li
tt
le
bit
as
the
dete
ct
ion
th
res
ho
l
ds
bec
om
e
la
rger.
T
he
detect
ion
syst
em
did
not
s
how
des
irable
detect
ion
perform
ance
on
the
MT3
an
d
WT
3/4
cl
ones
.
It
is
sti
ll
chall
eng
ing
to
detect
se
m
antic
cl
on
e
ty
pe
s
even
th
ough
t
he
pro
po
se
d
a
ppr
oach
is
ba
s
ed
on
co
ntr
ol
flo
w
gra
ph
s
w
hich
m
ay
rep
r
esent
m
or
e
ab
stract
per
s
pecti
ves
of
sem
antic
ally sim
i
la
r
co
de blo
cks
t
han to
ken
-
base
d
a
ppro
ac
hes.
Deep
ne
ur
al
ne
twork
m
od
el
s
m
ay
be
aff
ect
ed
by
hy
perpa
ram
et
ers
su
ch
as
hidden
la
ye
r
an
d
ep
oc
h.
Tables
4
a
nd
5
s
how
the
re
cal
l
resu
lt
s
wi
th
di
ff
e
ren
t
pa
ram
et
er
set
ti
n
gs
for
t
rainin
g
the
pro
posed
cl
one
detect
ion
syst
e
m
.
Five
cl
on
e d
et
ect
ion
m
odel
s
are
create
d
with
the
di
ff
e
r
ent
num
ber
s
of
hid
de
n
la
ye
rs
su
c
h
as
2,
4,
6,
8,
a
nd
10.
T
he
de
faul
t
nu
m
ber
of
t
he
e
po
c
h
is
10
0.
Alth
ough
th
e
trai
ning
m
odel
s
are
no
t
a
ffec
te
d
sever
el
y
by
th
e
nu
m
ber
of
hi
dd
e
n
la
ye
rs
i
n
this
ex
per
im
ent,
the
trai
ne
d
m
od
el
with
8
hidden
la
ye
r
s
achieves
the
be
st
pe
rfo
rm
ance.
Table
5
s
hows
the
num
ber
of
e
po
c
hs
im
pacts
on
the
pe
rfo
rm
ance
of
t
he
cl
on
e
detect
ion. T
he pr
opos
e
d
cl
on
e
classi
fier
wor
ks
best
wh
e
n
t
he nu
m
ber
of e
po
c
hs i
s 20
0.
Evaluation Warning : The document was created with Spire.PDF for Python.
IS
S
N
:
2088
-
8708
In
t J
Elec
&
C
om
p
En
g,
V
ol.
9
, N
o.
5
,
Oct
ober
20
19
:
3
8
0
4
-
3
8
1
2
3810
Figure
5
.
Re
cal
l
resu
lt
s
with
di
ff
ere
nt sim
il
ar
it
y t
hr
esh
old
s
(
0.95
~
0.9
8)
Table
4
. Recal
l
res
ults with
d
i
ff
e
ren
t
hidde
n l
ay
ers
(T
hr
es
hold =
0.98
)
#
of
hidden
lay
ers
T1
T2
VST3
ST3
MT3
W
T3/4
2
100
100
99
74
28
2
4
100
100
99
73
27
2
6
100
100
99
71
28
2
8
100
100
99
74
29
2
10
100
100
99
70
25
2
Table
5
. Recal
l
res
ults with
d
i
ff
e
ren
t e
pochs
(Thres
ho
l
d
=
0.9
8)
#
of
ep
o
ch
s
T1
T2
VST3
ST3
MT3
W
T3/4
50
100
100
99
72
27
2
100
100
100
99
71
28
2
150
100
100
99
72
28
2
200
100
100
99
73
28
2
250
100
100
99
72
27
2
4.
RELATE
D
W
ORK
Ma
ny
stud
ie
s
hav
e
been
c
on
du
ct
e
d
to
over
com
e
research
chall
enges
in
de
te
ct
ing
co
de
cl
on
es
acr
oss
so
urce
c
od
e
.
Most
of
e
xis
ti
ng
cl
one
de
te
ct
ion
m
e
thodo
l
og
ie
s
ca
n
be
cat
eg
or
iz
ed
into
five
ty
pes:
te
xt
-
base
d
[8
]
,
tok
e
n
-
base
d
[
9]
,
tree
-
ba
sed
[
10]
,
gra
ph
-
ba
se
d
[
11]
,
and
m
et
rics
-
based
[
12]
appr
oach
es
.
O
ne
of
the
e
m
erg
in
g
researc
h
tren
ds
is
to
le
ver
ag
e
deep
le
ar
ning
al
gorithm
s
t
o
en
han
ce
the
per
f
orm
ance
of
th
e
existi
ng
cl
one
detect
ion
strat
e
gies.
T
he
pr
omi
sing
ou
tc
om
es
of
dee
p
le
a
rn
i
ng
af
fect
m
any
oth
e
r
fiel
ds
be
yond
arti
fici
al
intel
lig
ence
com
m
un
it
ie
s.
So
ft
war
e
eng
i
neer
s
ha
ve
bee
n
in
volve
d
in
act
ively
ap
plyi
ng
dee
p
le
a
rn
i
ng
to
so
l
ve
ty
pical
prob
le
m
s
in
softwa
re
eng
i
neer
i
ng
su
c
h
as
cl
one
detect
io
n,
bug
pr
e
dicti
on,
a
nd
secur
it
y p
re
dicti
on
.
Sh
e
neam
er
an
d
Kali
ta
’s
cl
on
e
detect
io
n
a
ppr
oac
h
[13]
use
s
ty
pical
m
ac
hin
e
le
arn
i
ng
a
lgorit
hm
s
so
to
detect
se
m
antic
cod
e
cl
ones.
T
hey
us
e
a
su
pervised
le
arn
ing
a
ppr
oach
wh
e
re
se
m
antic
featur
es
are
extra
ct
ed
f
r
om
AS
Ts
an
d
P
D
Gs
an
d
trai
ning
data
are
la
be
ll
ed
with
cl
on
e
s
or
no
n
-
cl
on
e
s.
Their
a
pproach
is
base
d
on
m
ac
hin
e
le
ar
ning
al
gorithm
s,
not
deep
le
ar
ning
al
gorithm
s.
Wh
it
e
et
al
.
[
14]
us
e
a
deep
le
arn
i
ng
Evaluation Warning : The document was created with Spire.PDF for Python.
In
t J
Elec
&
C
om
p
En
g
IS
S
N: 20
88
-
8708
En
hancin
g
c
od
e clon
e
d
et
ect
i
on u
si
ng c
on
tr
ol fl
ow
gr
ap
hs
(
Do
ng Kw
an K
im
)
3811
appr
oach
to
de
te
ct
cod
e
cl
ones
by
com
bin
in
g
rec
urr
ent
neural
netw
ork
with
rec
ur
s
ive
ne
ur
al
net
work
at
m
et
ho
d
an
d
file
le
vels.
The
e
xp
e
rim
ental
resu
lt
s
s
how
t
heir
m
et
ho
dolo
gy
is
feasi
ble
in
so
m
e
cases,
bu
t
they
sti
ll
need
to
cond
uct
m
or
e
case
stud
ie
s
on
popu
la
r
cl
one
ben
c
hm
ark
s.
Li
et
al
.
[1
5]
pr
ovide
a
to
ke
n
-
bas
e
d
cl
on
e
detect
io
n
a
ppro
ac
h
usi
ng
dee
p
le
a
rn
i
ng
-
ba
sed
cl
on
e
cl
assifi
er.
They
e
xtract
featu
re
vecto
rs
by
tok
e
nizing
m
eth
od
pai
rs
an
d
then
c
om
pu
te
si
m
il
arity
scor
es
of
the
featu
re
vecto
rs.
T
he
cl
on
e
cl
assif
ie
r
is
trai
ned
with
ei
gh
t
sim
il
arit
y
scor
e
s
o
f
known
tr
ue
cl
ones
an
d
false
c
ones.
I
n
t
he
te
sti
ng
phase
,
the
trai
ne
d
cl
on
e
cl
assifi
e
r
pr
e
dicts
co
de
cl
on
es
in
a
cod
e
base
of
unkn
own
cl
one
s.
This
ap
proa
ch
sti
ll
has
ro
om
fo
r
i
m
pr
ovem
ent
on
fi
nd
i
ng
se
m
antic
cl
on
es
li
ke
Ty
pe
-
3
a
nd
Ty
pe
-
4
cl
on
es.
Saini
et
al
.
[1
6]
pro
pose
a
cl
on
e
detect
ion
a
ppr
oach
t
o
f
ocus
on
harde
r
-
to
-
de
te
ct
se
m
antic
cl
on
es.
Their
appr
oach
is
ba
sed
on
a
deep
neural
netw
ork
with
Sia
m
ese
arch
it
ect
ur
e
w
her
e
i
nfor
m
at
ion
ret
rieval
a
nd
m
et
ric
-
ba
sed
m
et
h
od
s
are
com
bin
ed
.
To
detect
sem
antic
cl
ones,
their
a
ppr
oac
h
exclu
des
sem
a
ntica
ll
y
dissi
m
il
ar
cl
on
e
s
us
i
ng
a
sem
antic
filt
er
instea
d
of
us
i
ng
sem
antic
fea
tures.
P
ha
n
et
al
.’
s
w
ork
[
17]
sh
ow
CFG
-
ba
sed
dee
p
le
ar
ni
ng
ca
n
be
use
d
f
or
so
ft
war
e
def
e
ct
pr
edict
io
n
as
well
as
cl
on
e
detect
ion.
They
ap
ply
conv
o
luti
onal
neural
netw
or
ks
f
or
pr
e
dicti
ng
s
of
t
war
e
de
f
ect
usi
ng
co
ntr
ol
flo
w
gr
a
phs.
Co
nt
ro
l
flo
w
gr
a
ph
s
are
bu
il
t
fr
om
asse
m
bly
fil
es
and
then
a
re
re
pr
es
ented
as
vecto
rs
w
hich
a
re
gi
ven
t
o
co
nvol
ut
ion
al
ne
ur
al
ne
tworks
.
T
he
c
onvoluti
onal
ne
ur
al
netw
ork
ex
pl
ores
the
be
ha
vio
r
of
ta
r
get
co
de
us
i
ng
vecto
r
represe
ntati
ons
an
d
re
ports
so
ftwa
re
de
fe
ct
s
in
un
s
een
d
at
aset
s af
te
r
b
ei
ng tr
ai
ned
with t
rainin
g datase
ts.
5.
CONCL
US
I
O
N
S
A
ND FUT
UR
E
WO
RK
This
pap
e
r
pr
e
sents
a
c
ode
c
lon
e
detect
io
n
fr
am
ewo
r
k
t
o
fin
d
ef
fecti
vel
y
cl
on
e
ty
pes
with
a
dee
p
le
arn
in
g
-
base
d
cl
on
e
cl
assi
fier.
T
he
pro
posed
a
ppr
oach
to
cl
one
detec
ti
on
is
based
on
the
e
xtract
ion
of
featur
e
s
f
r
om
CFG
s
of
gi
ven
co
de
f
ra
gm
ents.
The
CFG
f
eat
ur
es
a
re
represente
d
as
fea
ture
vecto
rs
s
o
that
they
can
be
c
om
par
ed
t
o
determ
ine
wh
et
he
r
co
de
fr
a
gm
e
nts
are
sim
il
ar
or
dissim
il
ar.
The
cl
on
e
det
ect
ion
cl
assifi
er is trained
a
nd test
ed wit
h
sim
i
la
rity
scor
e
s that are
com
pu
te
d
f
ro
m
the f
eat
ure ve
ct
or
s. The
pro
pose
d
detect
ion
fr
am
ewor
k
ef
fecti
ve
ly
fo
un
d
syn
ta
ct
ic
cl
on
e
ty
pes
s
uch
as
T
1,
T
2,
a
nd
V
ST3
cl
on
es
.
It
al
so
identifie
d
ST
3
cl
on
es
with
th
e
acce
ptable
but
not
e
xcell
en
t
recall
res
ults.
In
the
cas
e
of
sem
atic
cl
on
e
ty
pes
su
c
h
as MT
3
a
nd
W
T
3/
4
cl
on
es, the det
ect
io
n per
form
ance sti
ll
n
eeds to b
e i
m
pr
ove
d.
Al
though t
he pr
opose
d
appr
oach
to
cl
on
e
detect
io
n
needs
to
be
i
m
pr
ov
e
d
f
urt
he
r,
t
he
prom
isin
g
ex
pe
rim
ent
al
resu
lt
s
sug
ge
st
that
deep
le
a
rn
i
ng
-
base
d
cl
one
de
te
ct
ion
cl
assifi
ers
can
be
e
ff
e
ct
ive
in
fin
ding
co
de
cl
ones
.
In
th
e
f
uture,
m
or
e
cod
e
cl
on
e
s
wi
ll
be
exp
l
or
e
d
to
en
han
ce
f
unct
ion
s
of
the
pro
posed
cl
on
e
detect
ion
on
se
m
antic
cl
on
e
ty
pes.
Fu
rt
her
m
or
e,
unsupe
rv
ise
d
de
ep
le
ar
ning
al
gorithm
s
can
be
co
ns
ide
red
to
i
m
pr
ove
the
w
eakn
e
ss
of
t
he
cl
on
e
cl
assifi
er
us
in
g
su
pe
rv
ise
d
le
arn
i
ng.
T
he
cl
one
detect
ion
f
r
a
m
ewo
r
k
will
be
ap
plied
for
oth
e
r
pro
gr
am
m
ing
la
nguag
e
s to
e
xten
d
the
g
e
ne
rali
ty
o
f
th
e
propose
d
cl
on
e
det
ect
ion
m
et
ho
ds
.
ACKN
OWLE
DGE
MENTS
This
researc
h
was fina
ncial
ly
suppo
rted by
Mokpo Nati
onal
Marit
i
m
e U
niv
e
rsity
in 20
18.
REFERE
NCE
S
[1]
S.
Bel
lon
,
et
a
l
.
,
“
Com
par
ison
and
eva
lu
atio
n
of
cl
one
de
t
ec
t
ion
tool
s,
”
I
EE
E
Tr
ansacti
o
ns
on
Soft
ware
Engi
ne
ering
,
,
vo
l.
33
,
no
.
9
,
pp
.
5
77
–
591
,
2007
.
[2]
C.
K.
Ro
y
,
et
al
.
,
“
A
surve
y
on
software
cl
on
e
d
e
te
c
ti
on
r
ese
ar
ch,”
Quee
n’s
Univ
ersity
,
Tech
.
Rep.
2007
-
541
,
11
5
pp
,
2007
.
[3]
J.
Svajl
enko
,
et
al
.
,
“
Towa
rds
a
big
dat
a
c
ura
t
ed
benc
hm
ark
of
int
er
-
proj
ec
t
cod
e
cl
ones
,
”
30th
I
EE
E
Int
ernati
on
al
Confe
renc
e
on
S
oft
ware
Ma
int
en
ance
and
E
vol
ut
i
on
,
Vic
toria,
BC,
Can
ada,
Sep
tem
ber
29
-
Oc
tob
er
3,
2014.
[4]
J
.
Svajl
enko
,
e
t
al
.
,
“
BigCl
one
E
val
:
A
c
lone
de
t
ec
t
ion
tool
eva
lu
at
ion
fr
amework
with
BigCl
one
Benc
h,
”
32nd
IE
EE
Inte
rnational
Co
nfe
renc
e
on
Softw
are
Maintenan
ce
and
E
vol
ut
ion
,
pp
.
596
-
600
,
2
016
.
[5]
Y
.
Le
Cun
,
et a
l
.
,
“
Dee
p
l
ea
rn
ing,”
Nature
,
vo
l
.
52
1,
pp
.
436
-
444
,
Ma
y
2015
.
[6]
P.
Bie
l
ik,
e
t
a
l
.
,
“
Programming
with
Big
Code:
Le
ss
ons,
Te
chn
i
ques
and
Applicati
ons,
”
The
Ina
ugural
Summ
it
on
Adv
anc
es
in
Pro
gram
ming
Lang
uages
,
2015
.
[7]
“
Kera
s,”
htt
ps:
//kera
s.io
/, acc
ess
ed:
Nov.
2018.
[8]
S
.
Duca
ss
e
,
et
a
l.
,
“
A
la
nguage
inde
pend
ent
app
roa
ch
for
det
e
ct
i
ng
dupli
ca
t
ed
co
de
,
”
In
Proceedi
ngs
of
the
IEE
E
Inte
rnational
Co
nfe
renc
e
on
Softw
are
Maintenan
ce
,
pp.
109
–
118
,
1999
.
[9]
T
.
Kam
i
y
a
,
et
a
l.
,
“
CCF
inde
r:
a
m
ult
il
ingu
isti
c
toke
n
-
base
d
cod
e
c
lone
d
etec
t
io
n
s
y
stem
for
lar
ge
sca
le
source
code
,
”
I
EE
E
Tr
ansacti
ons on
So
f
tware
Eng
ine
eri
ng
,
vol
.
28
,
issue
7,
pp.
654
–
670
,
2002
.
[10]
L.
Jiang
,
e
t
al.
,
“
DECKA
R
D:
sca
la
b
le
and
accur
at
e
tre
e
-
b
ase
d
d
et
e
ct
ion
of
code
cl
ones,
”
Int
ernati
onal
Conf
ere
nce
on
Soft
ware
Eng
ine
ering
,
pp
.
96
–
105
,
2007
.
[11]
J
.
Krinke
,
“
Ide
nti
f
y
ing
sim
ilar
c
ode
with
progr
a
m
depe
ndence
gra
phs,”
In
Pro
ce
ed
ings
of
th
e
Ei
ghth
Workin
g
Confe
renc
e
on
R
ev
erse Engin
ee
ri
ng
,
pp
.
301
–
309
,
2001
.
Evaluation Warning : The document was created with Spire.PDF for Python.
IS
S
N
:
2088
-
8708
In
t J
Elec
&
C
om
p
En
g,
V
ol.
9
, N
o.
5
,
Oct
ober
20
19
:
3
8
0
4
-
3
8
1
2
3812
[12]
J
.
Ma
y
ran
d
,
et
al
.
,
“
Expe
riment
on
the
Automa
tic
Dete
c
ti
on
o
f
Functi
on
Clone
s
in
a
Software
S
y
stem
Us
ing
Metri
cs
,
”
In
Pro
ce
ed
ings o
f
Inte
r
nati
onal
Con
fe
re
nce
on
Sof
tware Maint
enan
ce
,
19
96.
[13]
A
,
Shenea
m
er
,
e
t
al
.
,
“
Sem
ant
ic
Clone
Detect
ion
Us
ing
Mac
hine
Le
arn
ing,”
15th
IEE
E
Int
ernati
o
nal
Confe
renc
e
on
Mac
hine Learni
ng
and
App
li
ca
tions
,
Dec
.
,
2016.
[14]
M.
W
hit
e
,
et
a
l.
,
“
Dee
p
le
arn
ing
code
fra
gm
ent
s
for
code
cl
one
det
e
ct
ion
,
”
I
n
P
roce
edi
ngs
of
th
e
31st
IEE
E/AC
M
Inte
rnational
Co
nfe
renc
e
on
Autom
ate
d
Sof
tware
Engi
n
ee
ring
,
20
16.
[15]
L
.
L
i
,
e
t
al
.
,
“
C
CLe
arn
er:
A
De
ep
L
ea
rning
-
Bas
ed
Clon
e
Det
ect
ion
Approac
h
,
”
IEE
E
Inte
rnat
io
nal
Confe
r
enc
e
o
n
Soft
ware
Ma
int
e
nance
and
E
vol
u
ti
on
,
Sep.
2017.
[16]
V
.
Saini
,
et
al
.
,
“
Oreo:
Dete
ctio
n
of
Clone
s
in
the
Twil
ight
Zon
e,
”
The
ACM
Jo
int
European
Soft
ware
Engi
ne
eri
ng
Confe
renc
e
and
Symposium on
t
he
Founda
ti
ons
of
Soft
war
e
Eng
i
nee
ring
,
Nov.
20
18.
[17]
A
.
V
.
Phan
,
et
al
.
,
“
Convolut
io
nal
Neura
l
Net
works
over
Control
Flow
Graph
s
for
Software
Defe
ct
Pr
edi
c
ti
o
n,
”
IEE
E
29th
Inte
r
nati
onal
Con
fe
re
nce
on
Tools
wi
t
h
Arti
f
i
c
ial
In
tell
ige
nc
e
,
Nov.
20
17.
BIOGR
AP
H
Y
O
F
AU
TH
OR
Don
g
K
w
an
K
im
is
an
a
ss
ociate
p
rofe
ss
or
in
the
Depa
rtmen
t
of
Com
pute
r
Engi
ne
eri
ng
at
Mokpo
Nati
onal
Marit
ime
Univer
sit
y
.
His
rese
arc
h
int
e
rests
inc
lude
de
ep
le
ar
ning,
software
evol
uti
on
,
run
-
tim
e
s
y
stems
,
and
m
obil
e
prog
ramm
ing.
Evaluation Warning : The document was created with Spire.PDF for Python.