TELK
OMNIKA
,
V
ol.
14,
No
.
3,
September
2016,
pp
.
1157
1165
ISSN:
1693-6930,
accredited
A
b
y
DIKTI,
Decree
No:
58/DIKTI/K
ep/2013
DOI:
10.12928/telk
omnika.v14.i3.3850
1157
Metamorphic
Mal
ware
Detection
Based
on
Suppor
t
V
ector
Mac
hine
Classification
of
Mal
ware
Sub-Signatures
Ban
Mohammed
Khammas
,
Alireza
Monemi
,
Ismahani
Ismail
,
Sulaiman
Mohd
Nor
,
and
M.N.
Mar
sono
*
F
aculty
of
Electr
ical
Engineer
ing,
Univ
ersiti
T
eknologi
Mala
ysia,
81310,
Johor
Bahr
u,
Johor
Mala
ysia
*Corresponding
author
,
e-mail:
mnadzir@utm.m
y
Abstract
Achie
ving
accur
ate
and
efficient
metamor
phic
malw
are
detection
remains
a
challenge
.
Metamor
phic
mal-
w
are
is
ab
le
to
m
utate
and
alter
its
code
str
ucture
in
each
inf
ection
that
can
circumv
ent
signature
matching
detection.
Ho
w
e
v
er
,
some
vital
functionalities
and
code
segments
remain
unchanged
betw
een
m
utations
.
W
e
e
xploit
these
unchanged
f
eatures
b
y
the
mean
of
classification
using
Suppor
t
V
ector
Machine
(SVM).
N-g
r
am
f
eatures
are
e
xtr
acted
directly
from
malw
are
binar
ies
to
a
v
oid
disassemb
ly
,
which
these
f
eatures
are
then
mask
ed
with
the
e
xtr
acted
kno
wn
malw
are
signature
n-g
r
ams
.
These
mask
ed
f
eatures
reduce
the
n
umber
of
selected
n
-g
r
am
f
eatures
consider
ab
ly
.
Our
method
is
capab
le
to
accur
ately
detect
metamor
phic
malw
are
w
ith
~99%
accur
acy
and
lo
w
f
alse
positiv
e
r
ate
.
The
proposed
method
is
also
super
ior
to
commer-
cially
a
v
ailab
le
anti-vir
uses
f
or
detecting
metamor
phic
malw
are
.
K
e
yw
or
d:
SVM
classification,
Metamor
phic
,
n-g
r
am,
Snor
t
Cop
yright
©
2016
Univer
sitas
Ahmad
Dahlan.
All
rights
reser
ved.
1.
Intr
oduction
Malw
are
is
one
of
secur
ity
attac
ks
to
Inter
net
users
as
it
breaches
computer
secur
ity
and
data
confidentiality
,
which
are
categor
iz
ed
into
gener
al
(non-m
utab
le)
and
m
utab
le
types
.
Anti-
vir
us
softw
ares
rely
on
signature-based
detection
as
the
pr
imar
y
detection
mechanism.
Mutatab
le
malw
are
such
as
pac
king,
polymor
phic
,
and
metamor
phic
mak
e
the
detection
based
on
signature
matching
difficult.
Metamor
phic
malw
ares
m
utate
and
change
their
codes
str
ucture
and
signa-
tures
in
each
inf
ection
that
is
difficult
to
detect
[1].
Lately
,
se
v
er
al
host-based
dynamical
analysis
techniques
w
ere
proposed
f
or
metamor
phic
malw
are
detection
[2].
Ho
w
e
v
er
,
these
techniques
require
separ
ate
en
vironment
to
analyz
e
malw
are
in
order
to
be
ab
le
to
be
detected.
At
the
same
time
,
the
requirement
of
binar
y
code
disassemb
ly
in
opcode-based
methods
[3–5]
is
not
suitab
le
f
or
timely
metamor
phic
detection
on
host-le
v
el
intr
usion
detection
systems
.
W
e
propose
metamor
phic
malw
are
detection
based
on
static
analysis
of
metamor
phic
malw
are
binar
ies
without
disassemb
ly
.
F
eatures
are
e
xtr
acted
from
binar
y
,
which
can
be
in
the
f
or
m
of
pac
k
ets
pa
yload
in
netw
or
k
detection
system
or
files
in
host
based
detection
system
using
n-g
r
am
f
eature
e
xtr
action
and
machine
lear
ning
SVM
classification.
Besides
,
e
xtr
acted
n-
g
r
am
f
eatures
are
mask
ed
with
kno
wn
malw
are
signature
n-g
r
ams
to
represent
only
inf
or
mativ
e
malw
are
f
eatures
.
This
technique
can
reduce
the
n-g
r
am
search
space
.
This
paper
is
organiz
ed
as
f
ollo
ws
.
Section
2
pro
vides
a
cr
itical
re
vie
w
on
rele
v
ant
lit-
er
atures
.
The
methodology
f
or
metamor
phic
malw
are
detection
in
netw
or
k
and
host-based
IDS
are
descr
ibed
in
Section
3.
Section
4
highlights
the
e
xper
imental
setup
,
datasets
,
and
e
v
aluation
cr
iter
ia.
The
data
analysis
and
compar
ison
with
commercial
anti-vir
us
softw
are
are
presented
in
Section
5.
Section
6
concludes
the
research
findings
,
and
contr
ib
utions
of
the
paper
.
2.
Related
W
ork
Host-based
anti-vir
uses
(A
V)
and
NIDS
such
as
Snor
t
[6]
and
Bro
[7]
pr
imar
ily
rely
on
signature
matching
as
one
of
the
detection
techniques
.
These
tools
search
f
or
specific
signatures
Receiv
ed
Apr
il
5,
2016;
Re
vised
J
uly
16,
2016;
Accepted
J
uly
30,
2016
Evaluation Warning : The document was created with Spire.PDF for Python.
1158
ISSN:
1693-6930
in
the
file
or
pac
k
et
pa
yload
to
detect
malw
are
and
other
types
of
attac
ks
[6].
Ho
w
e
v
er
,
the
inability
to
detect
metamor
phic
malw
are
or
pre
viously
unseen
malw
are
without
kno
wn
signatures
is
the
main
limitation
of
signature-matching
met
hods
[5,
8].
Metamor
phic
malw
are
is
a
type
of
adv
anced
m
utating
malw
ares
that
change
their
str
ucture
in
each
m
utation.
These
changes
ma
y
contain
small
n
umber
of
instr
uctions
f
or
specific
functionality
or
enclose
m
ultiple
instr
uctions
to
perf
or
m
similar
functionality
.
In
each
m
utation,
these
instr
uctions
are
e
xpanded
or
minimiz
ed
according
to
obfuscation
techniques
used.
The
taxonom
y
of
the
mor
phing
techniques
that
are
used
in
metamor
phic
malw
are
include
inser
tion
of
redundant
code
,
dead
instr
uctions
,
NOP
instr
uctions
,
unreachab
le
code
,
reorder
ing
of
instr
uctions
,
register
s
w
apping,
and
substituting
instr
uctions
with
equiv
alent
instr
uctions
.
The
detection
of
m
utating
malw
are
is
non-tr
ivial
[3,
5,
9].
Se
v
er
al
researches
proposed
the
use
of
machine
lear
ning
(ML)
to
detect
malw
are
[10,
11].
Schultz
et
al.
[12]
in
their
pioneer
ing
study
used
ML
technique
to
detect
malw
are
,
where
diff
erent
f
eature
e
xtr
action
methods
such
as
str
ing
f
eatures
,
prog
r
am
header
,
and
b
yte
sequence
f
eatures
w
ere
in
v
estigated.
K
olter
and
Maloof
[11]
used
n-g
r
am
f
eatures
e
xtr
action
method
and
impro
v
ed
the
detection
accur
acy
b
y
combining
n-g
r
am
technique
with
ML
to
classify
malw
are
e
x
ecutab
le
files
(w
or
m,
vir
us
,
and
trojan).
The
n-g
r
ams
f
eatures
are
e
xtr
acted
and
selected
from
malw
are
b
yte
code
using
the
Inf
or
mation
Gain
(IG)
method.
Fur
ther
more
,
the
eff
ects
of
se
v
er
al
ML
classifiers
(naiv
e
Ba
y
es
,
instance
based
lear
ning,
SVM,
decision
tree
,
and
boosted
classifier)
as
w
ell
as
the
siz
e
of
n
-g
r
ams
on
the
classification
accur
acy
w
ere
analyz
ed.
The
4-g
r
am
w
as
repor
ted
to
off
er
the
best
accur
acy
.
Shabtai
et
al.
[13]
used
n-g
r
am
f
eatures
that
e
xtr
acted
from
opcode
instead
of
b
yte
code
.
The
y
e
xamined
the
influence
of
diff
erent
n-g
r
am
siz
es
(1
to
5)
with
assor
ted
f
eature
selection
and
classifiers
.
The
eff
ect
of
ter
m
frequency
in
v
erse
document
frequency
(TFIDF)
and
nor
maliz
ed
TF
w
ere
compared.
It
is
repor
ted
that
TFIDF
and
TF
produced
almost
identical
results
.
Santos
et
al.
[10]
proposed
the
use
of
opcode-sequence
frequency
to
represent
f
eatures
based
on
sampled
malw
ares
and
nor
mal
files
,
where
top
1000
f
eatu
res
are
selected
using
IG.
Se
v
er
al
ML
classifiers
such
as
decision
trees
,
SVM,
k-nearest
neighbours
,
and
Ba
y
esian
w
ere
used
to
analyz
e
the
data
set.
Gener
ally
,
static
metamor
phic
malw
are
detection
uses
opcodes
or
opcode
n-g
r
ams
as
e
xtr
acted
f
eatures
f
or
diff
erent
detection
techniques
such
as
ML
and
statistical
analysis
[3–5,
14].
Lately
,
control
flo
w
based
method
such
as
b
y
Alam
et
al.
[1]
statistically
analyz
ed
the
perf
or
mance
of
host-based
metamor
phic
malw
are
detect
ion.
In
their
method,
the
dataset
is
tr
anslated
into
an
inter
mediate
language
called
MAIL
after
disassemb
ling
the
prog
r
am
binar
y
.
The
detection
ac-
cur
acy
f
or
the
dataset
containing
1020
metamor
phic
files
and
2330
nor
mal
files
w
as
repor
ted
to
be
94.69%
with
f
alse
positiv
e
r
atio
(FPR)
of
10.59.
Runw
al
et
al.
[15]
proposed
opcode
g
r
aphs
technique
to
find
the
similar
ity
of
e
x
ecutab
le
file
f
or
detecting
metamor
phic
malw
are
.
Opcode
se-
quences
are
e
xtr
acted
and
w
eights
are
measured
based
on
the
freq
uency
of
opcode
occurrences
.
Most
e
xisting
liter
atures
used
opcode-ba
sed
f
eatures
to
detect
metamor
phic
malw
are
despite
their
computational
cost.
F
or
instan
ce
,
opcode-based
methods
require
disassemb
ly
of
binar
y
code
to
obtain
the
opcode
f
eatures
,
which
are
not
suitab
le
f
or
timely
met
amor
phic
detec-
tion,
especially
when
targeting
f
or
netw
or
k-le
v
el
detection.
T
o
circumv
ent
this
shor
tcoming,
w
e
propose
the
use
of
ML
and
n-g
r
am
ter
m
frequency
[10,
13]
b
y
modifying
the
f
eature
e
xtr
action
and
selection
processes
.
The
n-g
r
am
search
space
is
fur
ther
minimiz
ed
via
a
tw
o-stage
f
eature
selection
scheme
.
First,
f
eatures
that
match
the
sub-signature
f
eatures
ar
e
selected.
Second,
the
most
eff
ectiv
e
f
eatures
are
chosen
using
IG
bef
ore
being
classified
using
SVM
classifier
.
3.
Pr
oposed
Method
f
or
Metamorphic
Mal
ware
Detection
Ear
lier
studies
on
met
amor
phic
detection
[3–5,
14]
mostly
emplo
y
ed
opcode-based
f
ea-
ture
e
xtr
action,
which
is
applicab
le
only
in
host-based
detection
because
of
the
disassemb
ly
re-
quirement.
The
proposed
method
o
v
ercomes
it
b
y
e
xtr
acting
the
f
eatures
directly
from
binar
ies
.
Those
f
eatures
are
mask
ed
with
sub-signature
n-g
r
ams
f
or
classification.
Similar
to
pre
vious
methods
[10,
13],
the
ter
m
frequency
of
f
eatures
are
computed
and
the
ML
classifier
is
used
to
classify
unkno
wn
files
based
on
the
ter
m
frequency
of
these
f
eatures
.
TELK
OMNIKA
V
ol.
14,
No
.
3,
September
2016
:
1157
1165
Evaluation Warning : The document was created with Spire.PDF for Python.
TELK
OMNIKA
ISSN:
1693-6930
1159
Figure
1
displa
ys
a
screen-shot
of
one
or
iginal
metamor
phic
and
its
m
ut
ated
v
ersion.
The
left
pane
sho
ws
all
similar
and
dissimilar
metamor
phic
codes
of
these
tw
o
files
.
Some
instr
uctions
remain
unaltered
dur
ing
the
metamor
phic
malw
are
m
utation
when
gener
ating
ne
w
metamor
phic
file
.
Figure
1.
Compar
ison
of
tw
o
NGVCK
vir
uses
.
It
is
impor
tant
to
note
that
the
m
utation
process
gener
ates
metamor
phic
malw
are
with
in-
her
ited
code
segments
from
their
ancestor
.
Some
f
eatures
in
old
metamor
phic
malw
are
are
k
ept
unchanged
in
the
m
utated
malw
are
,
as
malw
are
wr
iters
reuse
old
code
segments
[1].
Complete
m
utation
of
a
metamor
phic
malw
are
is
deemed
impossib
le
due
to
the
need
to
k
eep
the
same
func-
tionality
[14].
Most
v
ersions
of
the
same
malw
are
share
a
combination
of
s
e
v
e
r
al
unchanged
code
segments
[16].
Based
on
this
h
ypothesis
,
an
n-g
r
am
analysis
f
or
metamor
phic
malw
are
detection
can
be
made
b
y
mining
these
inher
ited
code
segments
.
Figure
2
sho
ws
diff
erent
processing
steps
of
the
proposed
method.
n-g
r
am
sub-signatures
are
augmented
with
sub-signature
obtain
from
e
xisting
metamor
phic
file
.
S
e
l
e
c
t
O
n
l
y
n
-
g
r
a
m
A
p
p
e
a
r
i
n
S
n
o
r
t
S
u
b
-
S
i
g
n
a
t
u
r
e
S
n
o
r
t
-
S
u
b
S
i
g
n
a
t
u
r
e
(
n
-
g
r
a
m
S
n
o
r
t
S
i
g
n
a
t
u
r
e
)
M
a
l
w
a
r
e
N
o
r
m
a
l
B
i
n
a
r
y
F
i
l
e
F
e
a
t
u
r
e
S
e
l
e
c
t
i
o
n
C
l
a
s
s
i
f
i
e
r
N
-
g
r
a
m
F
e
a
t
u
r
e
E
x
t
r
a
c
t
i
o
n
Figure
2.
V
ar
ious
stages
of
metamor
phic
malw
are
detection
via
the
proposed
method.
1.
Ref
erences
[17,
18]
used
the
method
of
splitting
the
signature
to
equal
pieces
.
Present
study
uses
o
v
er
lapping
4-g
r
am
f
eature
e
xtr
act
ion
together
from
Snor
t
signature
f
or
malw
are
detection.
These
signatures
are
used
as
a
case
study
although
Bro
[7]
r
ules
can
also
be
used.
Snor
t
signatures
are
split
into
o
v
er
lapping
4-g
r
ams
sub-signature
.
Then,
ter
m
frequency
(TF)
of
each
unique
n-g
r
am
in
e
v
er
y
file
is
counted
and
nor
maliz
ed,
as
ter
m
or
b
yte
frequency
analysis
is
eff
ectiv
e
in
files
classification
[19].
2.
In
order
to
minimiz
e
the
large
search
space
of
n-g
r
am,
f
eature
selection
method
is
used.
T
o
select
the
most
inf
or
mativ
e
sub-signature
n-g
r
am
f
eatures
that
appear
in
a
file
,
the
inf
or
ma-
tion
gain
(IG)
f
eature
selection
method
is
emplo
y
ed
[11].
I
G
(
j
)
=
X
X
v
j
f
0
:
1
g
C
i
P
(
v
j
;
C
i
)
l
og
P
(
v
j
;
C
i
)
P
(
v
j
)
P
(
C
i
)
(1)
where
v
j
is
the
v
alue
of
the
j
-th
attr
ib
ute
,
C
i
is
the
i
-th
class
,
P
(
C
i
)
is
the
probability
that
the
tr
aining
data
is
in
class
C
i
,
P
(
v
j
)
is
t
he
probability
that
the
j
-th
n
-g
r
am
ha
v
e
v
j
v
alue
in
the
tr
aining
dataset,
and
P
(
v
j
;
C
i
)
is
the
probability
that
in
class
C
i
,
the
j
-th
attr
ib
ute
has
the
v
alue
v
j
.
Metamor
phic
Malw
are
Detection
using
SVM
(BAN
MOHAMMED)
Evaluation Warning : The document was created with Spire.PDF for Python.
1160
ISSN:
1693-6930
3.
Classification
process
requires
gener
ation
of
classifier
model
from
the
tr
aining
dataset.
The
lear
ning
algor
ithm
tr
ains
the
SVM
classifier
to
predict
if
a
ne
w
file
is
malw
are
or
non-malw
are
.
Moreo
v
er
,
the
SVM
k
er
nal
methods
can
map
the
f
eatures
into
higher
dimensional
space
b
y
con
v
er
ting
nonlinear
f
eatures
into
linear
ones
[14].
Compared
to
other
ML
techniques
,
SVM
has
been
repor
ted
to
pro
vide
the
highest
malw
are
attac
k
detection
accur
acy
[20].
The
suc-
cess
of
SVMs
is
due
to
the
use
of
statistical
lear
ning
theor
y
[21],
which
is
char
acter
iz
ed
b
y
lo
w
estimation
probability
of
gener
alization
errors
.
There
are
se
v
er
al
SVM
k
er
nel
functions
that
can
be
used
in
our
proposed
t
echnique
.
These
include
P
olynomial
k
er
nel
(PL
Y),
Linear
(LN),
Sigmoidal,
and
Gaussian
Radial
Basis
Function
(RB)
as
summar
iz
ed
in
T
ab
le
1
[22].
T
ab
le
1.
SVM
K
er
nel
functions.
Function
Equation
P
olynomial
k
er
nel
k
(
x
i
;
x
j
)
=
(
x
T
i
x
j
+
)
d
,
>
0
Linear
k
(
x
i
;
x
j
)
=
x
T
i
x
j
Sigmoidal
Function
k
((
x
i
;
x
j
)
=
tanh
(
x
T
i
x
j
+
)
Gaussian
RB
Function
k
((
x
i
;
x
j
)
=
exp
(
k
x
i
x
j
k
2
)
;
>
0
where
x
i
and
x
j
are
the
tr
aining
v
ectors
,
d
is
natur
al
n
umber
,
is
a
shifting
par
ameter
that
control
the
threshold,
and
is
a
scaling
par
ameter
.
V
ar
iab
les
d;
;
are
the
k
er
nel
par
ameters
.
Fur
ther
discussion
on
each
k
er
nel
function
can
be
f
ound
in
[22].
It
is
repor
ted
[23]
that
SVM
is
widely
used
in
IDS
because
it
off
ers
high-speed
classifica-
tion,
and
deliv
ers
scalability
.
Moreo
v
er
,
it
is
compar
ativ
ely
insensitiv
e
to
the
quant
ity
of
inf
or
mation
points
with
lo
w
gener
alization
error
.
As
af
orementioned,
the
proposed
method
can
be
used
also
f
or
metamor
phic
malw
are
detection
in
both
host
and
netw
or
k-based
detection
system
(directly
from
tr
affic
flo
ws).
Since
NIDS
also
used
signature
maching
to
detect
attac
k
and
malw
are
,
the
present
study
adopt
sub-signature
f
eatures
similar
to
V
arghese
et
al.
[17].
The
proposed
method
uses
only
the
n-g
r
am
f
eatures
that
appear
in
kn
o
wn
malw
are
sub-signature
to
reduce
the
f
eature
search
space
.
4.
Experimental
Setup
Exper
iments
w
ere
perf
or
med
on
Intel
®
core™
i7-4710
HQ
at
2.50
GHZ
(8GB
RAM)
on
Lin
ux
Ub
untu
14
.04
platf
or
m.
All
e
x
ecutab
le
files
(metamor
phic
malw
are
and
nor
mal)
w
ere
first
con
v
er
ted
to
o
v
er
lapping
4-g
r
ams
f
eatures
.
Only
the
4-g
r
am
metamor
phic
f
eatures
that
appeared
in
Snor
t
f
eatures
(sub-signatures)
w
ere
select
ed.
Snor
t
signatures
w
ere
e
xtr
acted
from
Snor
t
r
ules
v
ersion
2.9.5.5.
The
most
impor
tant
f
eatures
are
chosen
using
the
IG
f
eature
selection
method,
using
W
akaito
En
vironment
f
or
Kno
wledge
Acquisition
(WEKA)
[24].
The
impor
tant
f
ea-
tures
with
high
r
ank
are
used
to
tr
ain
and
b
uilt
the
classifier
model.
Ne
xt,
the
model
has
used
to
classify
the
testing
dataset.
The
SVM
classifier
w
as
b
uilt
using
Libr
ar
y
f
or
Suppor
t
V
ector
Machines
(LIBSVM)
(https://www
.csie
.ntu.edu.tw/~cjlin/libsvm/),
to
sim
ulate
online
metamor
phic
detection
after
e
xtr
acting
inf
or
mat
iv
e
4-g
r
am
f
eatures
directly
from
e
x
ecutab
le
binar
y
files
.
Gr
id
Search
is
used
in
tr
aining
to
select
optim
um
eff
ectiv
e
SVM
par
ameters
[25].
The
t
r
aining
data
set
contained
labeled
metamor
phic
and
non-malw
are
files
.
He
xdump
utility
tool
is
used
to
con
v
er
t
the
content
of
binar
y
e
x
ecutab
le
to
he
xadecimal
code
,
where
n-g
r
ams
are
e
xtr
acted
b
y
combining
each
4-b
yte
sequence
as
one
f
eatu
re
[11].
F
ollo
wing
[11,
18],
the
siz
e
of
n-g
r
am
is
selected
to
be
4-g
r
ams
.
F
eatures
are
represented
in
ter
ms
of
nor
maliz
ed
frequency
instead
of
representing
each
of
them
as
Boolean
attr
ib
ute
.
TELK
OMNIKA
V
ol.
14,
No
.
3,
September
2016
:
1157
1165
Evaluation Warning : The document was created with Spire.PDF for Python.
TELK
OMNIKA
ISSN:
1693-6930
1161
4.1.
Dataset
Used
The
tr
aining
datasets
consisted
of
nor
mal
and
metamor
phic
files
.
The
nor
mal
(benign)
e
x
ecutab
le
files
of
total
1,971
w
ere
collected
from
Windo
w
7,
Windo
w
XP
,
and
Cygwin
[26].
A
total
of
1020
metamor
phic
files
w
ere
collected:
109
metamor
phic
files
are
collected
in
assemb
ly
f
or
mat
from
[27],
where
T
urbo
Assemb
ler
is
used
to
con
v
er
t
metamor
phic
files
from
assemb
ly
to
e
x
ecutab
le
f
or
mat.
.
50
files
w
ere
gener
ated
b
y
the
Ne
xt
Gener
ation
Vir
us
Constr
uction
Kit
(NGVCK),
50
files
w
ere
used
from
Second
Gener
ation
(G2)
vir
uses
,
and
the
other
9
files
w
ere
emplo
y
ed
as
Mass
Produced
Code
Gener
ation
Kit
(MPCGEN)
vir
uses
.
Fur
ther
more
,
the
NGVCK
kit
w
as
also
used
to
gener
ate
another
1000
metamor
phic
files
under
identical
configur
ation
setting.
This
constr
uction
kit
is
a
v
ailab
le
in
VX
Hea
v
ens
[28].
It
can
gener
ate
strong
metamor
phic
v
ar
iants
with
v
ar
ious
obfuscation
techniques
[1,
9].
911
files
w
ere
compiled
using
T
urbo
Assemb
ler
(T
ASM)
v
ersion
5
under
Or
acle
VM
Vir
tualBo
x
v
ersion
4.2.16.
.
4.2.
Dataset
Prepr
ocessing
n-g
r
ams
f
eatures
w
ere
e
xtr
acted
from
the
tr
aining
dataset
e
x
ecutab
le
files
,
which
include
thousands
of
f
eatures
,
and
most
do
not
contr
ib
ute
to
classification.
Theref
ore
,
Snor
t
malw
are
r
ules
are
used
to
e
xtr
act
Snor
t
4-g
r
ams
sub-signatures
(called
Snor
t
f
eatures).
The
pr
imar
y
aim
being
the
detection
of
metamor
phic
malw
are
from
its
pa
yload
only
the
signatures
content
of
malw
are
are
e
xtr
acted
to
gener
ate
4-g
r
ams
.
5.
Results
and
Discussion
In
total,
1542
Snor
t
4-g
r
am
f
eatures
appeared
in
metamor
phic
files
.
F
or
a
balanced
tr
aining
data
set,
50%
of
the
metamor
phic
files
are
selected
in
tr
aining
and
b
uilding
the
model
(510
metamor
phic
and
510
nor
mal
files).
The
rest
of
the
files
are
used
as
testing
dataset
(510
metamor
phic
and
1461
nor
mal
files).
5.1.
Accurac
y
Figure
3
sho
ws
the
accur
acy
of
the
testing
data
set
after
b
uilding
the
classifier
from
the
tr
aining
data
set.
It
uses
diff
erent
n
umber
of
f
eatures
selected
b
y
IG
and
diff
erent
k
er
nel
functions:
Gaussian
Radial
Basis
Function
(RB),
Linear
(LN),
and
P
olynomial
K
er
n
el
(PL
Y).
The
n
umber
of
f
eatures
aff
ected
the
accur
acy
,
area
under
the
cur
v
e
(A
UC),
f
alse
negativ
e
r
ate
(FNR),
f
alse
positiv
e
r
ate
(FPR),
tr
ue
negativ
e
r
ate
(TNR),
and
tr
ue
positiv
e
r
ate
(TPR)
as
sho
wn
in
Figure
3.
It
is
clear
that
f
or
the
n
umber
of
f
eatures
g
reater
than
500,
the
accur
acy
of
the
testing
data
set
appeared
stab
le
in
ter
ms
of
A
UC
,
FPR,
FNR,
TPR,
and
TNR.
The
best
SVM
k
er
nel
function
is
also
analyz
ed.
Radial
Base
produced
the
best
accur
acy
,
FPR,
FNR,
TPR,
and
TNR
compared
to
other
tested
k
er
nals
.
F
or
fur
ther
compar
ison
with
the
w
or
k
of
Alam
et
al.
[1],
359
ne
w
nor
mal
files
with
their
siz
es
r
ange
from
1
to
10
MB
w
ere
added.
Thus
,
the
dataset
contained
2330
nor
mal
files
and
1020
metamor
phic
files
with
the
total
siz
e
of
3.9
GB
,
which
are
classified
using
5-f
old
cross
v
alidation
in
4.68
sec.
Exper
iments
w
ere
repeated
with
diff
erent
tr
aining
and
testing
dataset
r
atio
,
as
sho
wn
in
Figure
4,
where
the
v
alues
of
accur
acy
,
FPR,
FNR,
TPR,
and
TNR
are
illustr
ated.
The
accur
acy
of
the
classifier
is
sho
wn
to
be
high
when
the
percentage
of
metamor
phic
files
in
the
tr
aining
dataset
r
anges
betw
een
10%
and
50%.
Fur
th
er
more
,
the
accur
a
cy
is
f
ound
to
decrease
when
the
amount
of
metamor
phic
files
in
the
tr
aining
dataset
became
g
reater
than
50%.
This
reduction
in
accur
acy
ma
y
be
attr
ib
uted
to
the
imbalance
prob
lem
due
to
the
less
n
umber
of
nor
mal
files
than
metamor
phic
one
.
In
the
rest
of
the
e
xper
iments
,
the
classifier
w
as
tr
ained
with
10%
metamor
phic
files
(102
files)
and
90%
nor
mal
files
(918).
These
r
atios
of
malw
are
and
nor
mal
files
w
ere
chosen
to
be
similar
to
the
real-lif
e
situation
with
the
n
umber
of
malw
are
is
less
than
10%
[13].
The
testing
data
set
contained
the
remaining
90%
of
the
metamor
phic
files
(918
files)
and
1412
nor
mal
files
.
The
accur
acy
is
f
ound
to
be
99.7%
when
RB
k
er
nal
function
has
been
used.
Metamor
phic
Malw
are
Detection
using
SVM
(BAN
MOHAMMED)
Evaluation Warning : The document was created with Spire.PDF for Python.
1162
ISSN:
1693-6930
0
500
1000
1500
0.95
1
Acc.
0
500
1000
1500
0.96
0.98
1
AUC
0
500
1000
1500
0
0.01
0.02
FNR
0
500
1000
1500
0
0.02
0.04
FPR
0
500
1000
1500
0.96
0.98
1
No. of features
TNR
0
500
1000
1500
0.98
1
1.02
No. of features
TPR
RB
LN
PLY
Figure
3.
T
esting
dataset
results
f
or
diff
erent
n
umber
of
f
eatures
and
diff
erent
K
er
nel
functions
.
20
40
60
80
0.92
0.94
0.96
0.98
1
Metamorphic % in training dataset
Acc.
20
40
60
80
0
0.5
1
Metamorphic % in training dataset
FNR
FTP
TNR
TPR
Acc.
Figure
4.
Accur
acy
,
FPR,
TPR,
FNR,
TNR
of
the
testing
dataset.
5.2.
Comparison
with
Related
W
orks
The
perf
or
mance
of
the
proposed
metamor
phic
malw
are
detection
method
w
as
compared
with
the
results
of
Alam
et
al.
[
1]
using
the
same
kit-gener
ated
the
metamor
phic
files
[28].
In
the
e
xper
iment,
NGVCK
w
as
used
to
gener
ate
highly
metamor
phic
vir
uses
[29].
Identical
siz
e
data
set
(1020
metamor
phic
files
and
2,330
nor
mal
files
that
w
ere
used
[1]
through
5-f
old
cross
v
alidation.
Th
e
results
w
ere
compared
with
se
v
er
al
results
in
liter
ature
as
summar
iz
ed
in
T
ab
le
2.
The
proposed
technique
re
v
ealed
v
er
y
high
detection
r
ate
.
The
results
presented
in
[4,
15,
30]
sho
w
ed
better
accur
acy
compared
to
the
present
one
and
[1]
possib
ly
because
of
limited
dataset
siz
e
.
T
ab
le
2
compares
the
perf
or
mance
of
our
proposed
method
with
some
e
xist
ing
w
or
ks
.
The
high
DR
(TPR)
percentage
achie
v
ed
b
y
the
proposed
method
is
a
clear
indication
of
major
metamor
phic
malw
are
detection
such
as
NGVCK,
which
is
not
possib
le
to
detect
using
the
other
methods
[9].
Fur
ther
more
,
the
use
of
small
data
se
t
siz
e
in
other
methods
[4,
15,
30]
mak
es
the
detection
r
ates
some
what
higher
.
The
occurrence
of
e
xtremely
lo
w
FPR
of
our
method
ma
y
be
as-
cr
ibed
to
the
f
eature
selection
b
y
Snor
t
sub-signatures
which
can
se
lect
inf
or
mativ
e
metamor
phic
f
eatures
and
eff
ectiv
e
implementation
of
r
adial-based
SVM.
5.3.
Comparison
with
Anti-Viruses
Scanner
s
The
eff
ectiv
eness
of
the
proposed
technique
w
as
e
v
aluated
b
y
compar
ing
the
results
against
commercial
anti-vir
uses
to
detect
metamor
phic
files
.
Other
anti-vir
us
tools
w
ere
tested
with
identical
dataset
of
metamor
phic
files
.
The
detection
r
ates
are
sho
wn
in
T
ab
le
3.
The
best
TELK
OMNIKA
V
ol.
14,
No
.
3,
September
2016
:
1157
1165
Evaluation Warning : The document was created with Spire.PDF for Python.
TELK
OMNIKA
ISSN:
1693-6930
1163
T
ab
le
2.
Compar
ison
with
related
w
or
ks
.
System
Anal
ysi
s
DR
FPR
Mal
ware/Normal
Platf
orm
Proposed
method
Static
99.6%
0.3%
1020/2330
Win&Lin
ux64
Opcode-Histog
r
am
[4]
Static
100%
0%
60/40
Win&Lin
ux32
Opcode-
histog
r
am
[14]
Static
99.5%
1.3%
1090/921
Win32
SW
OD-CFW
eight
[1]
Static
94.69%
10.59%
1020/2330
Win&Lin
ux64
Opcode-Gr
aph
[15]
Static
100%
1%
200/41
Win&Lin
ux32
Opcode-SD
[5]
Static
~
98%
~
0.5%
800/40
Lin
ux32
Chi-Squared
[29]
Static
~
98%
~
2%
200/40
Win&Lin
ux32
Opcode-HMM
[30]
Static
100%
0%
200/40
Win&Lin
ux32
Opcode-PHMM
[9]
Static
~100%
-
240/70
Win32
detection
r
ate
is
obtained
with
Kaspersky
that
is
ab
le
to
detect
MPCGEN
correctly
and
se
v
er
al
G2
files
.
Ho
w
e
v
er
,
it
could
not
detect
an
y
NGVCK
vir
uses
that
w
ere
gener
ated
b
y
the
kit
[28].
Thus
,
it
is
affir
med
that
our
method
can
detect
comple
x
metamor
phic
malw
are
types
that
remain
unrecogniz
ed
b
y
commercial
anti-vir
uses
.
This
high
le
v
el
of
detection
capability
is
ascr
ibed
to
the
combined
eff
ects
of
ter
m
frequency
f
eatures
e
xtr
action
and
their
masking
with
kno
wn
malw
are
sub-signature
f
or
accur
ate
metamor
phic
classification.
T
ab
le
3.
Compar
ison
with
commercial
anti-vir
uses
.
Anti-virus
Detection
Rate
Proposed
method
99.69%
Kaspersky
30.68%
A
V
G
17.35%
Comodo
14.80%
A
v
ast
6.96%
A
vir
a
6.17%
6.
Conc
lusion
W
e
proposed
an
eff
ectiv
e
technique
to
detect
metamor
phic
malw
are
with
high
accur
acy
and
lo
w
f
alse
positiv
e
r
ate
b
y
combining
n-g
r
am
Snor
t
signatures
and
SVM.
Metamor
phic
malw
are
e
x
ecutab
le
detection
is
achie
v
ed
using
only
500
f
eatures
of
n-g
r
am
Snor
t-mask
ed
sub-signatures
.
The
proposed
method
e
xhibits
its
super
ior
ity
with
higher
accur
acy
and
lo
w
er
f
alse
positiv
e
o
v
er
v
ar
ious
A
Vs
.
It
impro
v
es
the
metamor
phic
detection
accur
acy
remar
kab
ly
,
b
ut
at
the
same
time
the
proposed
method
can
not
detect
se
v
er
al
metamor
phic
files
if
the
y
m
utated
se
v
er
al
times
us-
ing
substituting
instr
uctions
with
equiv
alent
instr
uctions
.
Since
the
detection
of
some
malw
are
types
in
netw
or
k
is
v
er
y
arduous
[31],
the
present
method
can
be
e
xtended
in
NIDS
to
detecting
metamor
phic
malw
are
in
netw
or
k-based
with
real
tr
affic
tr
aces
especially
when
implemented
as
hardw
are
system
such
as
using
field-prog
r
ammab
le
gate
arr
a
y
(FPGA)
.
Ac
kno
wledgment
The
first
author
w
ould
lik
e
to
thanks
the
Ministr
y
of
Higher
Education
and
Scientific
Re-
search,
Ir
aq
f
or
pro
viding
Doctor
al
scholarship
f
or
her
study
.
Ref
erences
[1]
S
.
Alam,
R.
N.
Horspool,
I.
T
r
aore
,
and
I.
Sogukpinar
,
“A
fr
ame
w
or
k
f
or
metamor
p
hic
malw
are
analysis
and
real-time
detection,
”
Computers
&
Secur
ity
,
v
ol.
48,
pp
.
212–233,
2015.
Metamor
phic
Malw
are
Detection
using
SVM
(BAN
MOHAMMED)
Evaluation Warning : The document was created with Spire.PDF for Python.
1164
ISSN:
1693-6930
[2]
A.
A.
E.
Elhadi,
M.
A.
Maarof
,
B
.
I.
Barr
y
,
and
H.
Hamza,
“Enhancing
the
detection
of
meta-
mor
phic
malw
are
using
call
g
r
aphs
,
”
Computers
&
Secur
ity
,
v
ol.
46,
pp
.
62–78,
2014.
[3]
G.
Canf
or
a,
A.
N.
Iannaccone
,
and
C
.
A.
Visaggio
,
“Static
analysis
f
or
the
detection
of
meta-
mor
phic
computer
vir
uses
using
repeated-instr
uctions
counting
heur
istics
,
”
Jour
nal
of
Com-
puter
Virology
and
Hac
king
T
echniques
,
pp
.
1–17,
2013.
[4]
B
.
B
.
Rad,
M.
Masro
m,
and
S
.
Ibr
ahim,
“Opcodes
histog
r
am
f
or
classifying
metamor-
phic
por
tab
le
e
x
ecutab
les
malw
are
,
”
in
Inter
national
Conf
erence
on
e-Lear
ning
and
e-
T
echnologies
in
Education
(ICEEE)
.
IEEE,
2012,
pp
.
209–213.
[5]
G.
Shanm
ugam,
R.
M.
Lo
w
,
and
M.
Stamp
,
“Simple
substitution
distance
and
metamor
phic
detection,
”
Jour
nal
of
Computer
Virology
and
Hac
king
T
echniques
,
v
ol.
9,
no
.
3,
pp
.
159–170,
2013.
[6]
“Snor
t.
”
[Online].
A
v
ailab
le:
https://www
.snor
t.org/
[7]
“Bro
.
”
[Online].
A
v
ailab
le:
https://www
.bro
.org/
[8]
P
.
Li,
M.
Salour
,
and
X.
Su,
“A
sur
v
e
y
of
inter
net
w
or
m
detection
and
containment,
”
Comm
u-
nications
Sur
v
e
ys
&
T
utor
ials
,
IEEE
,
v
ol.
10,
no
.
1,
pp
.
20–35,
2008.
[9]
S
.
A
ttalur
i,
S
.
McGh
ee
,
and
M.
Stamp
,
“Profile
hidden
m
ar
k
o
v
models
and
metamor
phic
vir
us
detection,
”
Jour
nal
in
computer
virology
,
v
ol.
5,
no
.
2,
pp
.
151–169,
2009.
[10]
I.
Santos
,
F
.
Brez
o
,
X.
Ugar
te-P
edrero
,
and
P
.
G.
Br
ingas
,
“Opcode
sequences
as
represen-
tation
of
e
x
ecutab
les
f
or
data-mining-based
unkno
wn
malw
are
detection,
”
Inf
or
mation
Sci-
ences
,
v
ol.
231,
pp
.
64–82,
2013.
[11]
J
.
Z.
K
olter
and
M.
A.
Maloof
,
“Lear
ning
to
detect
and
classify
malicious
e
x
ecutab
les
in
the
wild,
”
The
Jour
nal
of
Machine
Lear
ning
Research
,
v
ol.
7,
pp
.
2721–2744,
2006.
[12]
M.
G.
Schultz,
E.
Eskin,
E.
Zadok,
and
S
.
J
.
Stolf
o
,
“Data
mining
methods
f
or
detection
of
ne
w
malicious
e
x
ecutab
les
,
”
in
Secur
ity
and
Pr
iv
acy
,
2001.
S&P
2001.
Proceedings
.
2001
IEEE
Symposium
on
.
IEEE,
2001,
pp
.
38–49.
[13]
A.
Shabtai,
R.
Mosk
o
vitch,
C
.
F
eher
,
S
.
Dole
v
,
and
Y
.
Elo
vici,
“Detecting
unkno
wn
malicious
code
b
y
applying
classification
techniques
on
opcode
patter
ns
,
”
Secur
ity
Inf
or
matics
,
v
ol.
1,
no
.
1,
pp
.
1–22,
2012.
[14]
D
.
K.
Maha
w
er
and
A.
Nagar
aju,
“Metamor
phic
malw
are
detection
using
base
malw
are
iden-
tification
approach,
”
Secur
ity
and
Comm
unication
Netw
or
ks
,
v
ol.
7,
no
.
11,
pp
.
1719–1733,
2014.
[15]
N.
Runw
al,
R.
M.
Lo
w
,
and
M.
Stamp
,
“Opcode
g
r
aph
similar
ity
and
metamor
phic
detection,
”
Jour
nal
in
Computer
Virology
,
v
ol.
8,
no
.
1-2,
pp
.
37–52,
2012.
[16]
A.
H.
Sung
and
S
.
Mukkamala,
“The
f
eature
selection
and
intr
usion
detection
prob
lems
,
”
in
Adv
ances
in
Computer
Science-ASIAN
2004.
Higher-Le
v
el
Decision
Making
.
Spr
inger
,
2005,
pp
.
468–482.
[17]
G.
V
arghese
,
J
.
A.
Fingerhut,
and
F
.
Bonomi,
“Detecting
e
v
asion
attac
ks
at
high
speeds
without
reassemb
ly
,
”
A
CM
SIGCOMM
Computer
Comm
unication
Re
vie
w
,
v
ol.
36,
no
.
4,
pp
.
327–338,
2006.
[18]
I.
Ismail,
M.
N.
Marsono
,
B
.
M.
Khammas
,
and
S
.
M.
Nor
,
“Incor
por
ating
kno
wn
malw
are
sig-
natures
to
classify
ne
w
malw
are
v
ar
iants
in
netw
or
k
tr
affic
,
”
Inter
national
Jour
nal
of
Netw
or
k
Management
,
v
ol.
25,
no
.
6,
pp
.
471–489,
2015.
[19]
J
.
Clemens
,
“A
utomatic
classification
of
object
code
using
machine
lear
ning,
”
Digital
In
v
esti-
gation
,
v
ol.
14,
pp
.
S156–S162,
2015.
[20]
P
.
W
ang
and
Y
.-S
.
W
ang,
“Malw
are
beha
viour
al
detection
and
v
accine
de
v
elopment
b
y
using
a
suppor
t
v
ector
model
classifier
,
”
Jour
nal
of
Computer
and
System
Sciences
,
v
ol.
81,
no
.
6,
pp
.
1012–1026,
2015.
[21]
V
.
N.
V
apnik
and
V
.
V
apnik,
Statistical
lear
ning
theor
y
.
Wile
y
Ne
w
Y
or
k,
1998,
v
ol.
1.
[22]
C
.-W
.
Hsu,
C
.-C
.
Chang,
C
.-J
.
Lin
et
al.
,
“A
pr
actical
guide
to
suppor
t
v
ector
classification,
”
2003.
[23]
Y
.
Chen,
Y
.
Li,
X.-Q.
Cheng,
and
L.
Guo
,
“Sur
v
e
y
and
taxonom
y
of
f
eature
selection
algo-
r
ithms
in
intr
usion
detection
system,
”
in
Inf
or
mation
Secur
it
y
and
Cr
yptology
.
Spr
inger
,
2006,
pp
.
153–167.
TELK
OMNIKA
V
ol.
14,
No
.
3,
September
2016
:
1157
1165
Evaluation Warning : The document was created with Spire.PDF for Python.
TELK
OMNIKA
ISSN:
1693-6930
1165
[24]
“W
eka.
”
[Online].
A
v
ailab
le:
http://www
.cs
.w
aikato
.ac.nz/ml/w
eka/do
wnloading.html
[25]
M.
Shankar
pani,
K.
Kancher
la,
R.
Mo
vv
a,
and
S
.
Mukkamala,
Computational
Intelligent
T
ech-
niques
and
Similar
ity
Measures
f
or
Malw
are
Classification
.
Spr
inger
,
2012,
pp
.
215–236.
[26]
“Cygwin.
”
[Online].
A
v
ailab
le:
http://cygwin.com/
[27]
D
.
Sa
y
ali,
P
.
Y
oun
ghee
,
and
S
.
Mar
k,
“Eigen
v
alue
analysis
f
or
metamor
phic
detection,
”
spr
inger
,
2013.
[28]
“Ngvc
k.
”
[Online].
A
v
ailab
le:
http://vxhea
v
en.org/vx.php?id=tn02
[29]
A.
H.
T
oder
ici
and
M.
Stamp
,
“Chi-squared
distance
and
metamor
phic
vir
us
detection,
”
Jour-
nal
of
Computer
Virology
and
Hac
king
T
echniques
,
v
ol.
9,
no
.
1,
pp
.
1–14,
2013.
[30]
W
.
W
ong
and
M.
Stamp
,
“Hunting
f
or
metamor
phic
engines
,
”
Jour
nal
in
Computer
Virology
,
v
ol.
2,
no
.
3,
pp
.
211–229,
2006.
[31]
T
.-F
.
Y
en
and
M.
K.
Reiter
,
“T
r
affic
agg
regation
f
or
malw
are
detection,
”
in
Detection
of
Intr
u-
sions
and
Malw
are
,
and
V
ulner
ability
Assessment
.
Spr
inger
,
2008,
pp
.
207–227.
Metamor
phic
Malw
are
Detection
using
SVM
(BAN
MOHAMMED)
Evaluation Warning : The document was created with Spire.PDF for Python.