Inter
national
J
our
nal
of
Inf
ormatics
and
Communication
T
echnology
(IJ-ICT)
V
ol.
14,
No.
2,
August
2025,
pp.
737
∼
750
ISSN:
2252-8776,
DOI:
10.11591/ijict.v14i2.pp737-750
❒
737
Malwar
e
detection
using
Gini,
Simpson
di
v
ersity
,
and
Shannon-W
iener
indexes
Y
eong
T
yng
Ling
1
,
Kang
Leng
Chiew
1
,
Piau
Phang
1
,
Xiao
wei
Zhang
2
1
F
aculty
of
Computer
Science
and
Information
T
echnology
,
Uni
v
ersiti
Malaysia
Sara
w
ak,
K
ota
Samarahan,
Malaysia
2
F
aculty
of
Biomedical
Engineering,
Chengde
Medical
Uni
v
ersity
,
Chengde,
China
Article
Inf
o
Article
history:
Recei
v
ed
Aug
20,
2024
Re
vised
No
v
19,
2024
Accepted
Dec
15,
2024
K
eyw
ords:
Gini
coef
cient
Mal
w
are
detection
MLP
Shannon-W
iener
Simpson
di
v
ersity
XGBoost
ABSTRA
CT
The
increasing
number
of
mal
w
are
attacks
poses
a
signicant
challenge
to
c
yber
security
.
This
paper
proposes
a
methodology
for
static
mal
w
are
analysis
using
biodi
v
eristy-inspired
metr
ics
that
is
Gini
coef
cient,
Simpson
di
v
ersity
,
and
Shannon-W
iener
inde
x
for
mal
w
are
detection.
These
met
rics
are
used
to
b
uild
the
structural
feature
representation
on
the
ra
w
binary
le
as
the
feature
space.
The
ef
fecti
v
eness
of
these
metrics
are
e
v
aluated
using
multilayer
perceptron
(MLP)
neural
netw
ork
and
e
xtreme
gradient
boosting
(XGBoost)
models.
A
deterministic
algorithm
is
used
to
generate
these
features
that
represent
the
feature
signature
of
the
e
x
ecutable
le.
Additionally
,
we
in
v
estig
ated
the
ef
fecti
v
eness
of
dif
ferent
byte
sizes
as
the
input
feature
for
these
tw
o
classiers.
According
to
the
results,
Gini
coef
cient
with
on
chunk
size
of
128
has
successfully
achie
v
ed
a
v
erage
F1
score
of
more
than
98.7%
by
using
XGBoost
model.
This
is
an
open
access
article
under
the
CC
BY
-SA
license
.
Corresponding
A
uthor:
Y
eong
T
yng
Ling
F
aculty
of
Computer
Science
and
Information
T
echnology
,
Uni
v
ersiti
Malaysia
Sara
w
ak
K
ota
Samarahan,
Sara
w
ak,
Malaysia
Email:
ytling@unimas.my
1.
INTR
ODUCTION
Mal
w
are
attack
is
one
of
the
most
signicant
and
pre
v
ailing
issues
in
information
security
.
Accord-
ing
to
[1],
there
has
been
an
increase
of
malicious
tasks
since
Q1
2024.
Hack
ers
use
malicious
softw
are
to
cause
harm
to
a
computer
or
its
users
in
the
form
of
virus,
w
orm,
rootkit,
k
e
y
logger
,trojan
horse,
ransomw
are,
and
sp
yw
are.
T
raditional
commercial
anti-mal
w
are
tools
which
use
signature-based
detection
method
are
in-
f
amously
inef
cient
when
f
aced
with
ne
wly
launched
(a.k.a.
“zero-day”)
mal
w
are.
Essentially
,
this
method
e
xtracts
unique
byte
sequences
which
dene
the
mal
w
are’
s
signature
in
the
le
contents
of
pre
viously
seen
mal
w
are.
Ho
we
v
er
,
this
method
is
time-consuming
and
costly
since
it
requires
ne
wly
e
xtracted
signatures
to
be
compared
ag
ainst
lar
ge
databases
of
malicious
signatures
[2].
It
also
needs
periodic
update
since
mal
w
are
writers
are
constantly
de
v
eloping
ne
w
codes
to
thw
art
detection.
Hence,
adv
anced
protection
technology
using
machine
learning
(ML)
is
needed.
There
are
tw
o
methods
for
mal
w
are
analysis,
dynamic
or
static
analysis.
In
dynamic
analysis
[3]–[5],
mal
w
are
features
such
as
runtime
API
or
system
call
traces
are
generated
by
e
x
ecuting
a
mal
w
are
le
and
observing
its
beha
vior
in
a
controlled
en
vironment,
e.g.
sandbox,
to
pre
v
ent
infection
and
spreading
during
analysis.
In
static
analysis
[6]–[8],
mal
w
are
features
such
as
n-gram,
image
representation,
opcode
are
gener
-
ated
without
e
x
ecuting
the
mal
w
are
le.
T
able
1
sho
ws
the
summary
of
studies
by
the
most
current
ones
related
to
our
w
ork.
J
ournal
homepage:
http://ijict.iaescor
e
.com
Evaluation Warning : The document was created with Spire.PDF for Python.
738
❒
ISSN:
2252-8776
There
e
xist
se
v
eral
compelling
w
orks
that
used
ML
models
for
mal
w
are
detection
and
classi
cation.
These
MLs
dif
fer
mainly
in
the
algorithms
and
the
types
of
analysis
used.
The
use
of
dynamic
analysis
such
as
the
[3]
e
v
aluated
eight
machines
learning
algorithms
for
mal
w
are
detection
through
analysis
of
the
frequenc
y
of
W
indo
ws
API
system
function
calls.
The
authors
observ
ed
the
beha
vior
of
mal
w
are
in
an
isolated
en
vironment
by
using
Cuck
oo
[9].
The
beha
vioral
e
v
ents
reported
by
Cuck
oo
will
be
the
feature
to
be
fed
into
the
ML
mod-
els.
The
authors
also
applied
Gini
inde
x
in
the
decision
tree
model.
Similar
technique
w
as
conducted
by
Syeda
and
Asghar
[5]
where
applied
both
Chi2
and
the
Gini
inde
x
to
lter
and
select
signic
ant
features
before
being
fed
into
six
ML
models.
F
or
static
analysis,
studies
such
as
the
[6]
incorporated
w
ord
embedding
technique
on
opcode
sequence
feature
wit
h
long
short-t
erm
memory
(LSTMs)
for
mal
w
are
classication.
The
y
used
the
30
most
frequent
opcodes
e
xtracted
after
disassemble
the
e
x
ecutable
les
of
20
dif
ferent
mal
w
are
f
amilies.
Other
studies
such
as
the
[7],
[8]
compared
models
performance
were
also
been
e
xamined
for
mal
w
are
classication
and
had
sho
wn
promising
results
by
using
e
xtreme
gradient
boosting
(XGBoost).
Hybrid
w
ork
in
[4]
conducted
both
static
and
dynamic
mal
w
are
analysis
with
dif
ferent
ML
models.
The
y
obtained
the
best
detection
accurac
y
rate
of
91.9%
on
the
static
analysis
dataset
and
96.4%
on
the
dynamic
analysis
dataset
by
using
the
XGBoost
algorithm.
Their
study
indicates
that
combining
static
and
dynamic
analysi
s
with
ML
is
an
ef
fecti
v
e
approach
for
identifying
mal
w
are.
Their
results
sho
w
that
the
ef
cac
y
of
ML
model
is
dependent
upon
the
respecti
v
e
algorithm
and
the
type
of
data
that
the
model
is
b
uilt
upon.
T
able
1.
Mal
w
are
detection
related
studies
Reference
Analysis
Feature
Approach
Dataset
(size)
Accurac
y
[3]
Dynami
c
API
MLs
Malpedia
7400
99.50%
[5]
Dynami
c
API
Random
forest
Mal
w
areBazaar
582
96.00%
[10]
Static
Aggre
g
ation
metrics
ELM
APK
600
82.50%
[8]
Stat
ic
String
XGBoost
EMBER
5000K
98.50%
[6]
Stat
ic
Opcode
LSTM,
CNN
Malicia;
pre
vious
study
25901
81.00%
[7]
Stat
ic
Entrop
y
,
Gini
MLs,
neural
netw
ork
V
irusShare
938
92.17%
[11]
Static
Image
CNNs,
ELMs
MalImg
9300
97.70%
[12]
Static
Image
CNN
Malimg
9435
98.82%
[13]
Static
Image
CNN
Malimg
9389
97.32%
[14]
Static
Entrop
y
,
image
SNN
Andro-Dumpsys
906
91.20%
[4]
Both
Function,
API
XGBoost
V
irusShare
2747-2937
96.48%
Lately
,
deep
learning
is
g
aining
much
popularity
due
to
it’
s
supremac
y
in
terms
of
accur
ac
y
when
trained
with
huge
amount
of
data.
Neural
netw
orks
are
a
subset
of
ML,
and
the
heart
of
deep
learning
algo-
rithms.
There
ha
v
e
been
man
y
studies
that
utilized
neural
netw
orks
by
adding
con
v
olution
and
pooling
layers.
The
study
[12]–[14],
used
v
ariants
of
con
v
olutional
neural
netw
ork
(CNN)
models
with
image-based
and
other
types
of
featur
e
representation
for
mal
w
are
detection.
T
o
o
v
ercome
the
android
mal
w
are
prediction
model,
[10]
studied
the
patterns
of
intermediate
code
and
source
code
of
an
apk
le
by
e
xtracting
16
types
of
metrics,
such
as
mean,
median,
Gini
inde
x,
and
entrop
y
.
From
their
empirical
study
,
e
xtreme
learning
machine
(ELM)
with
polynomial
k
ernels
pro
vides
a
better
performance
than
other
ML
classiers.
Re
g
ardless
of
using
ML
or
deep
learning
as
model
classiers,
feature
representation
plays
a
crucial
role
in
mal
w
are
detection.
In
static
analysis,
most
of
the
mal
w
are
come
in
the
form
of
ra
w
binary
e
x
ecutable
le.
T
o
quantify
the
ra
w
bytes,
[15]
introduced
the
di
v
ers
ity
inde
x
es
to
quantify
the
qualitati
v
e
v
alue
of
mal
w
are
data.
The
authors
used
[16]
to
compute
the
dif
ferent
di
v
ersity
inde
x
es
such
as
Shannon
inde
x,
Simpson,
in
v
erse
Simpson,
and
Fisher’
s
log.
Their
e
xperimental
results
sho
w
that
the
ecological
metric
can
be
well
used
in
mal
w
are
conte
xt
to
better
understand
the
pattern
in
mal
w
are.
Other
studies
such
as
in
[17],
[18]
adopted
mathematical
models
of
biodi
v
ersity
in
ecology
for
det
ection.
Their
studies
demonstrated
that
biodi
v
ersity-
related
metrics
can
impro
v
e
their
understanding
of
ho
w
di
v
ersity
af
fects
detection.
Inspired
by
the
abo
v
e
related
w
ork,
in
this
paper
we
e
xplore
the
ef
fecti
v
eness
of
structural
feature
of
Gini
inde
x,
Simpson
di
v
ersity
,
and
Shannon-W
iener
inde
x
with
multilayer
perceptron
(MLP)
neural
netw
ork
and
XGBoost
models.
The
rest
of
this
paper
is
or
g
anized
as
follo
ws.
Construction
of
feature
representation
of
this
study
is
described
in
section
2.
The
detailed
results
and
discussion
are
presented
in
section
3.
Conclusions
and
suggested
future
w
ork
are
discussed
in
section
4.
Int
J
Inf
&
Commun
T
echnol,
V
ol.
14,
No.
2,
August
2025:
737–750
Evaluation Warning : The document was created with Spire.PDF for Python.
Int
J
Inf
&
Commun
T
echnol
ISSN:
2252-8776
❒
739
2.
METHOD
In
this
section,
the
proposed
method
is
presented.
A
general
o
w
of
the
proposed
approach
is
sho
wn
in
Figure
1.
Figure
1.
General
architecture
o
w
2.1.
Input
les
The
format
of
the
input
les
used
in
this
study
w
as
in
W
indo
ws
e
x
ecutable
format.
A
total
of
7,852
binary
les
were
collected
as
the
dataset
of
the
input
les.
T
able
2
lists
the
information
about
the
mal
w
are
f
am-
ilies
used
in
this
study
.
These
mal
w
are
were
collect
from
[19].
As
for
the
corresponding
benign
les,
we
ran-
domly
selected
1000
W
indo
ws
applications
with
size
0.01
∼
94
MB
from
https://do
wnload.cnet.com.
Figure
2
sho
ws
the
distrib
ution
of
the
mal
w
are
f
amilies
on
the
selected
dataset.
T
able
2.
Mal
w
are
information
F
amily
T
ype
Size
(MB)
Sample
Bho
T
rojan
0.005
-
16.0
1,391
Ceeinject
V
irtool
0.004
-
7.67
1,077
F
ak
erean
Rogue
0.003
-
21.9
1,016
W
inwebsec
Rogue
0.325
-
0.60
1,023
Zbot
T
rojan
0.031
-
0.37
1,039
Zeroaccess
T
rojan
0.048
-
0.28
1,306
Figure
2.
Mal
w
are
distrib
ution
use
in
the
e
xperiment
(for
colors)
2.2.
File
splitting
In
this
step,
an
input
le
w
as
split
into
a
series
of
chunks.
T
o
achie
v
e
this,
technique
[20]
w
as
adopted
for
generating
t
he
proposed
structural
feature
representation
by
splitting
the
entire
le
into
x
ed-byte
of
chunks.
A
chunk
is
considered
to
be
a
string
of
non-o
v
erlapping
consecuti
v
e
bytes,
where
each
chunk
contains
the
same
number
of
bytes.
T
o
do
this,
a
unanimous
le
length,
F
,
and,
therefore,
the
number
of
chunks,
N
,
ha
v
e
to
be
determined.
W
e
x
ed
the
le
length
to
be
a
po
wer
of
2,
i.e.
N
=
2
α
for
some
α
∈
N
:
α
=
l
l
og
min
{
median
(
M
)
,
median
(
B
)
}
c
m
Malwar
e
detection
using
Gini,
Simpson
diver
sity
,
and
Shannon-W
iener
inde
xes
(Y
eong
T
yng
Ling)
Evaluation Warning : The document was created with Spire.PDF for Python.
740
❒
ISSN:
2252-8776
F
or
con
v
enience
purposes,
steps
to
determine
α
are
restated
here
as
follo
ws:
−
Step
1:
compute
the
median
size
of
a
group
of
mal
w
are
les
and
benign
les,
M
and
B
,
respecti
v
ely
.
Here,
dif
fer
from
[20],
median
score
is
considered
as
it
usually
pro
vides
a
better
measure
of
center
tendenc
y
of
sample
size.
−
Step
2:
determine
the
minimum
median
size
from
these
tw
o
groups.
−
Step
3:
di
vide
the
minimum
median
size
by
chunk
size,
says
c
=
256
bytes,
this
gi
v
es
the
D
.
−
Step
4:
nd
the
base-2
log
arithm
of
v
alue
from
pre
vious
step
and
tak
e
the
lar
gest
whole
inte
ger
.
−
Step
5:
if
the
whole
inte
ger
in
pre
vious
step
is
not
a
po
wer
of
2,
reduce
the
D
in
step
3
by
1
and
repeat
step
4
until
the
condition
met.
Dif
ferent
chunk
sizes,
that
is,
128,
256,
512,
1,024,
and
2,048
bytes
were
e
xamined
in
this
study
.
The
sliding
windo
w
for
each
le
splitting
is
the
same
length
as
the
chunk
sizes
for
con
v
enient
purpose.
These
chunks
pro
vide
granular
v
ariations
and
represent
the
structure
of
a
le.
2.3.
F
eatur
e
generation
Based
on
[20],
once
the
number
of
chunks,
N
,
has
been
determined,
a
deterministic
algorithm
using
Procrustean
notion
is
adopted
to
choose
e
v
enly
spread
chunks
from
each
le
to
produce
a
v
ector
of
N
chunks
in
order
.
In
other
w
ords,
the
number
of
chunks
for
a
le
i
s
either
reduced
to
or
increased
to
N
.
An
e
xample
is
pro
vided
here
for
illustration
purpose.
Gi
v
en
tw
o
les,
P
and
Q
,
and
a
chunk
size
of
c
,
with
l
eng
th
(
P
)
=
10
c
and
l
eng
th
(
Q
)
=
7
c
which
means
there
are
20
chunks
for
P
and
6
chunks
for
Q
.
Suppose
that
α
=
3
is
chosen,
then
N
=
2
α
=
8
chunks.
Since
P
has
number
of
chunks
lar
ger
than
N
,
it
needed
to
be
reduced
from
from
10
to
8
chunks
and
for
Q
which
is
smaller
than
N
,
it
needed
to
be
increased
from
7
to
8
chunks.
In
order
to
choose
these
chunks,
a
subset
of
the
current
chunks
using
a
jump
f
actor
is
generated
for
each
le.
The
chunk
inde
x
is
initially
set
to
0,
and
it
is
incremented
in
e
v
ery
step
by
inc
1
=
9/7
=
1.28
for
P
and
inc
2
=
6/7
=
0.85
for
Q
.
The
indices
are
selected
using
the
oor
of
the
accumulated
jump
v
alue,
so
the
chosen
indices
will
be:
I
P
=
(0
,
1
,
2
,
3
,
5
,
6
,
7
,
9)
I
Q
=
(0
,
0
,
1
,
2
,
3
,
4
,
5
,
6)
These
indices
gi
v
e
the
location
of
chunk
that
needed
to
b
uild
the
structural
feature
representation
of
a
le.
Three
biodi
v
eristy-inspired
metrics
were
adopted
in
this
study
to
b
uild
the
structural
feature
representation
on
these
chunks,
namely
,
Gini
coef
cient,
Simpson
di
v
ersity
,
and
Shannon-W
iener
.
2.3.1.
Gini
coefcient
Gini
coef
cient
also
kno
wn
as
Gini
inde
x
[21],
named
after
Italian
statistician
Corrado
Gini,
is
a
w
ay
to
measure
statistical
dispersion
inequality
especially
in
economics
and
ecology
[22].
The
Gini
coef
cient
is
dened
as:
g
ini
=
t/
(
b
2
∗
a
)
(1)
where
t
is
a
list
of
dif
ference
among
the
elements
of
a
list,
b
is
the
length
of
a
list
and
a
is
mean
v
alue
of
that
list.
2.3.2.
Simpson
di
v
ersity
The
Simpson
di
v
ersity
inde
x
[23]
w
as
introduced
by
Edw
ard
H.
Simpson
to
measure
the
probability
of
tw
o
samples
will
belong
to
the
same
group.
The
v
alue
of
Simpson
di
v
ersity
ranges
from
0
to
1,
with
0
representing
lar
ge
di
v
ersity
and
1
representing
no
di
v
ersity
.
The
formula
is
gi
v
en
as:
D
=
R
X
i
=1
n
i
(
n
−
1)
N
(
N
−
1)
(2)
where
n
i
is
the
number
of
indi
viduals
in
a
group
i
,
and
N
is
the
total
number
of
groups
in
a
sample.
Int
J
Inf
&
Commun
T
echnol,
V
ol.
14,
No.
2,
August
2025:
737–750
Evaluation Warning : The document was created with Spire.PDF for Python.
Int
J
Inf
&
Commun
T
echnol
ISSN:
2252-8776
❒
741
2.3.3.
Shannon-W
iener
The
Shannon-W
einer
i
nde
x
[24]
w
as
de
v
eloped
from
information
theory
and
is
based
on
m
easuring
uncertainty
.
The
computational
formula
is:
H
′
=
N
ln
N
−
P
(
n
i
ln
n
i
)
N
(3)
where
N
is
the
total
number
of
groups
and
n
i
is
the
number
of
indi
viduals
in
group
i
.
Each
metric,
which
is
used
to
quantize
the
byte
randomness
in
a
chunk,
will
dene
the
N
-v
ector
structural
feature
of
a
le.
Thus,
for
a
gi
v
en
e
x
ecutable
le,
three
dif
ferent
types
of
structural
features
were
generated.
2.4.
Classiers
Ne
xt,
each
type
of
the
s
tructural
feature
generated
from
the
pre
vious
step
will
be
fed
into
tw
o
sele
cted
classiers,
namely
MLP
neural
netw
ork
and
XGBoost,
respecti
v
ely
.
All
the
e
xperiments
were
conducted
under
a
W
indo
ws
10
64-bit
operating
system
using
the
Python
programming
language
(Scikit-learn
l
ibrary).
A
10-
fold
cross
v
alidation
w
as
performed
to
estimate
the
generalization
performance
of
the
proposed
approach.
The
prediction
ef
cienc
y
w
as
measured
in
terms
of
accurac
y
rate,
area
under
curv
e
(A
UC),
and
F1
score.
2.4.1.
Multilay
er
per
ceptr
on
Based
on
the
layer
construction
of
MLP
by
[25],
a
3-layer
MLP
model
is
constructed
as
ha
ving
one
input
layer
,
tw
o
hidden
layers,
follo
wed
by
a
output
layer
.
After
se
v
eral
e
xperiments
using
3
and
4
layers,
we
decided
to
use
the
3-layer
MLP
model
as
it
g
a
v
e
better
performance
than
the
4-layer
model.
In
this
study
,
the
rst
layer
(input
layer
with
linear
layer),
with
N
nodes,
where
N
is
the
number
of
chunks
generated
during
the
le
splitting
step.
Acti
v
ation
function
w
as
set
to
the
rectied
linear
unit
(ReLU)
to
suppresses
ne
g
ati
v
e
weights.
This
process
is
repeated
for
another
hidden
layer
by
reducing
the
half
of
the
pre
vious
nodes.
The
last
layer
is
a
Sigmoid
curv
e
which
allo
ws
binary
prediction.
Additionally
,
dropout
layers
with
0.2
is
added
between
hidden
layers
to
pre
v
ent
o
v
ertting.
The
model
w
as
fed
through
and
backpropag
ate
of
errors
with
50
epochs.
The
learning
rate
is
0.01
with
binary
cross
entrop
y
(BCE)
as
the
loss
function
and
stochastic
gradient
descent
as
the
optimizer
.
2.4.2.
XGBoost
XGBoost
[26],
is
an
im
plementation
of
gradient
boosted
decision
trees
(GBDTs).
The
XGBoost
pro
vides
a
wrapper
class
to
allo
w
models
to
be
treated
lik
e
classiers
in
the
scikit-learn
frame
w
ork
in
Python.
There
are
man
y
parameters
for
the
XGBoost
Classier
package.
W
e
k
ept
them
as
in
the
def
ault
for
simplicity
reasons
and
only
set
the
objective=‘binary:lo
gistic’
.
A
10-fold
cross
v
alidation
w
as
performed
on
dif
ferent
chunk
sizes
(i.e.:
128)
of
the
proposed
structural
feature
representation
as
mentioned
in
section
2.3.
3.
RESUL
TS
AND
DISCUSSION
W
e
w
ant
to
study
ho
w
ef
fecti
v
e
are
the
proposed
structural
features
in
discriminating
mal
w
are
from
benign
les
i
n
terms
of
accurac
y
,
A
UC,
and
F1
score.
W
e
consider
a
v
alidation
result
of
at
least
90%
as
high
detection
rate.
Based
on
the
ndings,
a
discussion
section
is
follo
wed.
3.1.
Results
Figures
3
sho
ws
the
a
v
erage
time
(in
seconds)
tak
en
to
generate
chunk
size
of
128
and
2,048
bytes
of
the
proposed
structural
feature
representations
for
the
mal
w
are
f
amily
.
It
is
ob
vious
that
to
e
xtract
smaller
chunk
size,
says
128
bytes,
will
tak
e
longer
time
compare
to
chunk
size
of
2,048
bytes.
Based
on
the
gure,
it
can
be
observ
ed
that
using
the
Simpson
inde
x
to
generate
the
structural
f
eature
is
the
f
astest,
follo
wed
by
Gini
coef
cient
and
Shannon-W
einer
.
It
is
surprising
to
notice
that
the
time
tak
en
by
W
inwebsec
f
amily
is
longer
than
the
other
f
amilies,
such
as
t
he
Bho
which
contains
more
samples
than
W
inwebsec
when
using
Gini
coef
cient
and
Shannon-W
einer
.
One
possible
conject
ure
is
that
the
W
inwebsec
f
amily
has
le
sizes
that
is
much
lar
ger
than
the
other
f
amilies.
3.1.1.
MLP
model
T
able
3
sho
ws
the
comparison
of
the
best
F1
score
performance
out
of
the
10-fold
cross
v
alidation.
The
highlighted
bold
indicates
the
highest
score
achie
v
ed
among
the
six
mal
w
are
f
amilies
on
respecti
v
e
chunk
size.
It
is
observ
ed
that
Gini
coef
cient
produced
the
highest
F1
scores
on
W
inwebsec
f
amily
e
xcept
with
Malwar
e
detection
using
Gini,
Simpson
diver
sity
,
and
Shannon-W
iener
inde
xes
(Y
eong
T
yng
Ling)
Evaluation Warning : The document was created with Spire.PDF for Python.
742
❒
ISSN:
2252-8776
chunk
size
256.
As
with
chunk
size
128,
512,
1,024,
and
2,048
the
MLP
model
can
achie
v
e
99.84%,
99.32%,
99.17%,
and
98.75%
F1
scores
on
W
inwebsec,
respecti
v
ely
.
Figure
3.
A
v
erage
time
of
feature
generation
of
the
mal
w
are
f
amily
T
able
3.
The
best
F1
score
using
the
MLP
model
Chunk
size
F
amily
Feature
type
Gini
coef
cient
Simpson
di
v
ersity
Shannon
W
iener
128
Bho
87.51
89.19
88.73
Ceeinject
89.37
90.27
92.23
F
ak
erean
87.86
90.44
89.62
W
inwebsec
99.84
99.00
98.89
Zbot
96.93
98.07
97.72
Zeroaccess
99.22
99.49
97.71
256
Bho
87.32
89.35
86.10
Ceeinject
89.15
90.41
90.60
F
ak
erean
86.07
91.53
86.34
W
inwebsec
99.36
99.52
98.73
Zbot
98.11
98.49
94.90
Zeroaccess
98.73
99.25
96.83
512
Bho
89.24
89.26
83.44
Ceeinject
88.52
87.77
87.36
F
ak
erean
90.85
91.09
83.24
W
inwebsec
99.32
99.21
98.05
Zbot
96.84
98.13
92.04
Zeroaccess
98.57
98.85
91.55
1024
Bho
88.30
87.66
81.50
Ceeinject
84.53
85.46
80.26
F
ak
erean
90.06
91.00
81.69
W
inwebsec
99.17
98.85
95.45
Zbot
96.78
97.52
90.55
Zeroaccess
97.81
98.08
85.39
2048
Bho
87.21
89.35
78.73
Ceeinject
86.07
84.98
64.07
F
ak
erean
90.63
92.15
83.75
W
inwebsec
98.75
97.99
90.59
Zbot
97.11
97.00
85.40
Zeroaccess
96.87
97.86
79.95
Figure
4
depicts
a
closer
look
at
the
performance
between
tw
o
selected
mal
w
are
f
amilies,
i.e.:
W
in-
websec
and
Bho.
Based
on
the
gure,
the
W
inwebsec
f
amily
in
Figure
4(a)
can
easily
be
detected
compared
with
the
Bho
f
amily
in
Figure
4(b)
using
Gini
coef
cient
as
feature
representation.
The
Simpson
di
v
ersity
inde
x
achie
v
ed
stable
performance
across
all
chunk
sizes.
T
able
4
sho
ws
the
accurac
y
rate
based
on
the
best
F1
score
achie
v
ed
across
the
mal
w
are
f
amilies.
It
measures
ho
w
often
the
MLP
model
correctly
predicts
the
outcome.
As
the
chunk
size
gro
ws
lar
ger
,
the
F1
score
decreases
by
using
Shannon-W
iener
as
the
structural
feature
representation.
Int
J
Inf
&
Commun
T
echnol,
V
ol.
14,
No.
2,
August
2025:
737–750
Evaluation Warning : The document was created with Spire.PDF for Python.
Int
J
Inf
&
Commun
T
echnol
ISSN:
2252-8776
❒
743
(a)
(b)
Figure
4.
The
F1
score
for
(a)
W
inwebsec
and
(b)
Bho
f
amilies
(see
online
v
ersion
for
colors)
T
able
4.
The
accurac
y
rate
based
on
the
best
F1
score
Chunk
size
F
amily
Feature
type
Gini
coef
cient
Simpson
di
v
ersity
Shannon
W
iener
128
Bho
84.54
87.04
86.76
Ceeinject
89.08
89.24
91.81
F
ak
erean
87.76
90.08
89.09
W
inwebsec
99.83
99.01
98.84
Zbot
97.05
98.03
97.71
Zeroaccess
99.13
99.42
97.25
256
Bho
87.32
86.62
82.86
Ceeinject
88.44
89.88
90.20
F
ak
erean
85.45
91.40
86.66
W
inwebsec
99.34
99.50
98.68
Zbot
98.03
98.36
94.93
Zeroaccess
98.55
99.13
96.53
512
Bho
86.90
87.46
82.31
Ceeinject
87.64
87.47
86.99
F
ak
erean
96.41
91.23
84.29
W
inwebsec
99.34
99.17
98.02
Zbot
96.73
98.03
91.50
Zeroaccess
98.41
98.29
89.73
1024
Bho
86.35
85.65
78.83
Ceeinject
83.78
84.10
80.73
F
ak
erean
89.42
91.23
84.95
W
inwebsec
99.17
98.84
95.38
Zbot
96.73
97.38
90.52
Zeroaccess
97.39
97.83
80.92
2048
Bho
84.81
88.02
71.86
Ceeinject
85.87
84.91
70.30
F
ak
erean
90.74
92.06
84.01
W
inwebsec
98.68
98.02
90.28
Zbot
96.89
96.73
84.64
Zeroaccess
96.38
97.68
72.39
It
is
observ
ed
that
the
accurac
y
rates
are
consistent
with
the
F1
sc
o
r
e,
where
using
Gini
coef
ci
ent
can
achie
v
e
the
highest
accurac
y
rate
for
the
W
inwebsec
f
amily
e
xcept
with
chunk
size
256.
Among
the
six
f
amilies,
Gini
coef
cient
can
ef
fecti
v
ely
detect
the
mal
w
are
from
the
benign
le
for
the
W
inwebsec,
Zbot,
and
Zeroaccess.
It
implies
that
these
three
mal
w
are
f
ami
lies
can
easily
be
detected
using
this
Gini
coef
cient,
b
ut
not
for
the
Bho,
Ceeinject,
and
F
ak
erean
when
compared
with
the
other
tw
o
structural
feature
representations.
On
a
v
erage,
feature
representati
on
using
Simpson
di
v
ers
ity
sho
wn
relati
v
ely
higher
accurac
y
rate
than
the
Gini
coef
cient
across
all
the
mal
w
are
f
amilies.
A
closer
observ
ation
sho
ws
that
this
structural
feature
representation
can
achie
v
e
more
than
90%
for
four
of
the
f
amilies
e
xcept
for
the
Bho
and
Ceeinject
f
amilies.
Malwar
e
detection
using
Gini,
Simpson
diver
sity
,
and
Shannon-W
iener
inde
xes
(Y
eong
T
yng
Ling)
Evaluation Warning : The document was created with Spire.PDF for Python.
744
❒
ISSN:
2252-8776
This
sho
ws
that
Simpson
di
v
ersity
demonstrated
stronger
discrimination
for
quantifying
byte
information.
Shannon-W
iener
sho
wn
as
the
least
signicant
structural
feature
representation
in
this
study
.
The
lo
west
accurac
y
rate
it
can
yield
is
71.86%
for
the
Bho
f
amily
with
chunk
size
2,048
and
the
highest
accurac
y
rate
it
can
yield
is
98.68%
for
the
W
inwebsec
f
amily
wi
th
chunk
size
256.
One
can
observ
e
that
as
the
chunk
size
increases,
the
discrimination
po
wer
for
this
feature
representation
becomes
w
orse.
Figure
5
depicts
the
A
UC
performance
based
on
the
three
proposed
structural
feature
represent
ations
in
Figures
5(a)
to
5(c).
A
UC
represents
the
de
gree
or
measure
of
separability
.
It
can
be
observ
ed
that
there
is
a
clear
distinction
between
A
UC
for
certain
types
of
mal
w
are
f
amilies.
F
or
e
xample,
either
both
Gini
coef
-
cient
and
Simpson
inde
x
yielded
higher
A
UC
for
the
W
inwebsec,
Zbot,
and
Zeroaccess,
b
ut
not
for
the
Bho,
Ceeinject,
and
F
ak
erean
f
amilies.
The
performance
of
all
three
st
ructural
feature
representations
declines
as
the
number
of
chunk
sizes
increases.
(a)
(b)
(c)
Figure
5.
A
UC
performance:
(a)
Gini
coef
cient,
(b)
Simpson
inde
x,
and
(c)
Shannon
W
iener
3.1.2.
XGBoost
model
T
able
5
sho
ws
the
comparison
of
the
bes
t
F1
score
performance
out
of
the
10-fold
cross
v
alidation.
The
highlighted
bold
indicates
the
highest
score
achie
v
ed
on
respecti
v
e
chunk
size.
It
is
observ
ed
that
the
a
v
erage
precision
and
recall
v
aries
for
each
of
the
feature
representation.
B
y
using
the
XGBoost
model,
W
inwebsec
f
amily
achie
v
ed
100%
F1
score
with
all
the
three
proposed
structural
feature
representations
across
all
dif
ferent
chunk
sizes.
Zeroaccess
f
amily
,
similar
to
the
W
inwebsec
f
amily
,
also
reached
F1
score
of
100%
for
all
the
chunk
sizes
e
xcept
with
chunk
size
1,024.
In
terms
of
feature
performance,
on
a
v
erage,
Shannon-W
iener
achie
v
ed
more
number
of
highest
F1
score
follo
wed
by
Gini
coef
cient
and
Simpson
di
v
ersity
.
Ho
we
v
er
,
the
a
v
erage
F1
score
is
98.26%.
It
is
follo
wed
by
Gini
coef
cient
and
Simps
on
di
v
ersity
,
which
yielded
an
a
v
erage
F1
score
of
98.63%,
98.27%
respecti
v
ely
.
In
terms
of
chunk
size,
Shannon-W
iener
and
Simpson
di
v
ersity
demonstrated
their
discriminate
Int
J
Inf
&
Commun
T
echnol,
V
ol.
14,
No.
2,
August
2025:
737–750
Evaluation Warning : The document was created with Spire.PDF for Python.
Int
J
Inf
&
Commun
T
echnol
ISSN:
2252-8776
❒
745
po
wer
mostly
with
chunk
size
256
and
512,
respecti
v
ely
.
As
for
Gini
coef
cient,
it
demonstrated
its
discrimi-
nate
po
wer
mostly
with
chunk
size
128
and
1,024.
Figure
6
depicts
a
close
r
look
at
the
performance
between
tw
o
mal
w
are
f
amilies,
i.e.:
W
inwebsec
and
Bho.
Based
on
the
gure,
the
W
inwebsec
f
amily
in
Figure
6(a)
can
also
be
easily
detected
compared
with
Bho
f
amily
in
Figure
6(b)
using
all
the
three
proposed
structural
feature
representations.
Ho
we
v
e
r
,
for
the
Bho
f
amily
,
Shannon-W
iener
inde
x
performed
best
with
chunk
size
256
and
1,024
only
.
Both
the
Gini
coef
cient
and
Simpson
inde
x
produced
best
result
with
chunk
size
512
and
2,048,
respecti
v
ely
.
T
able
5.
The
F1
score
of
the
best
model
performance
using
the
XGBoost
model
Chunk
size
F
amily
Feature
type
Gini
coef
cient
Simpson
di
v
ersity
Shannon
W
iener
128
Bho
95.53
94.94
95.20
Ceeinject
98.24
97.52
97.41
F
ak
erean
99.09
98.13
97.69
W
inwebsec
100.0
100.0
100.0
Zbot
99.54
99.53
100.0
Zeroaccess
100.0
100.0
100.0
256
Bho
96.24
96.84
97.57
Ceeinject
96.96
97.41
96.96
F
ak
erean
98.21
97.32
95.32
W
inwebsec
100.0
100.0
100.0
Zbot
99.09
99.50
100.0
Zeroaccess
100.0
100.0
100.0
512
Bho
97.27
97.14
96.86
Ceeinject
95.37
95.85
95.85
F
ak
erean
97.32
97.67
97.75
W
inwebsec
100.0
100.0
100.0
Zbot
99.54
100.0
99.50
Zeroaccess
100.0
100.0
100.0
1024
Bho
96.86
96.50
96.88
Ceeinject
95.39
96.10
94.68
F
ak
erean
98.65
97.65
98.21
W
inwebsec
100.0
100.0
100.0
Zbot
99.06
99.38
99.50
Zeroaccess
100.0
99.28
99.63
2048
Bho
96.00
97.16
96.52
Ceeinject
94.49
93.10
94.82
F
ak
erean
98.64
97.65
98.18
W
inwebsec
100.0
100.0
100.0
Zbot
99.09
99.49
99.50
Zeroaccess
99.59
100.0
100.0
(a)
(b)
Figure
6.
F1
score
of
comparison
between
(a)
W
inwebsec
and
(b)
Bho
f
amilies
(see
online
v
ersion
for
colors)
Malwar
e
detection
using
Gini,
Simpson
diver
sity
,
and
Shannon-W
iener
inde
xes
(Y
eong
T
yng
Ling)
Evaluation Warning : The document was created with Spire.PDF for Python.
746
❒
ISSN:
2252-8776
T
able
6
sho
ws
the
accurac
y
rate
based
on
the
best
F1
score
across
the
mal
w
are
f
amilies.
The
high-
lighted
bold
indicates
the
signicant
rate
achie
v
ed
among
the
mal
w
are
f
amilies
on
respecti
v
e
chunk
size.
It
is
observ
ed
that
both
Shannon-W
iener
and
Gini
coef
cient
ha
v
e
the
most
number
of
times
to
yield
higher
accurac
y
rates
across
the
mal
w
are
f
amilies.
Figure
7
depicts
the
A
UC
performance
based
on
the
three
structural
feature
representations
as
sho
wn
in
Figures
7(a)
to
7(c).
Here,
it
is
clear
that
the
Gini
coef
cient
produced
higher
A
UC
for
most
of
the
chunk
sizes
e
xcept
with
chunk
size
512.
On
the
other
hand,
Simpson
di
v
ersity
and
Shannon-W
iener
features
performed
better
wit
h
chunk
size
512.
Shannon-W
iener
can
yield
high
detection
the
W
inwebsec,
Zbot,
and
Zeroaccess
f
amilies
with
chunk
size
2,048,
that
is,
100%,
99.01%,
and
100%,
respecti
v
ely
.
T
able
6.
The
accurac
y
rate
of
the
best
model
based
on
the
F1
score
Chunk
size
F
amily
Feature
type
Gini
coef
cient
Simpson
di
v
ersity
Shannon
W
iener
128
Bho
95.53
94.94
95.20
Ceeinject
98.24
97.52
97.41
F
ak
erean
99.09
98.13
97.69
W
inwebsec
100.0
100.0
100.0
Zbot
99.54
99.53
100.0
Zeroaccess
100.0
100.0
100.0
256
Bho
96.24
96.84
97.57
Ceeinject
96.96
97.41
96.96
F
ak
erean
98.21
97.32
95.32
W
inwebsec
98.21
97.32
95.32
Zbot
100.0
100.0
100.0
Zeroaccess
99.09
99.50
100.0
512
Bho
97.27
97.14
96.86
Ceeinject
95.37
95.85
95.85
F
ak
erean
97.32
97.67
97.75
W
inwebsec
100.0
100.0
100.0
Zbot
99.54
100.0
99.50
Zeroaccess
100.0
100.0
100.0
1024
Bho
96.86
96.50
96.88
Ceeinject
95.39
96.10
94.68
F
ak
erean
98.65
97.65
98.21
W
inwebsec
100.0
100.0
100.0
Zbot
96.06
99.38
99.50
Zeroaccess
100.0
99.28
99.63
2048
Bho
96.00
97.16
96.52
Ceeinject
94.49
93.10
94.82
F
ak
erean
98.64
97.65
98.18
W
inwebsec
100.0
100.0
100.0
Zbot
99.09
99.49
99.50
Zeroaccess
99.59
100.0
100.0
3.2.
Discussion
While
earlier
studies
ha
v
e
e
xplored
the
impact
of
biodi
v
ersity-related
metrics,
the
y
ha
v
e
not
e
xpli
citly
studied
their
ef
fecti
v
eness
for
quant
ifying
on
the
binary
le.
This
study
in
v
estig
ated
the
ef
fecti
v
eness
of
three
biodi
v
ersity-related
metrics,
namely
Gini
coef
cient,
Simpson
di
v
ersity
,
and
Shannon-W
iener
,
on
binary
le
for
mal
w
are
detection.
The
v
alidation
results
from
our
study
suggests
that
the
computation
steps
to
e
xtract
and
generate
structural
feature
representation
shall
be
considered
if
the
performance
speed
is
a
concern.
In
this
study
,
the
number
of
lar
ge
les
in
the
W
inwebsec
f
amily
may
be
contrib
uting
to
the
f
act
that
it
required
more
feature
generation
time
than
the
Bho
f
amily
.
Due
to
the
computational
steps
in
the
Gini
coef
cient,
it
tak
es
longer
time
to
generate
the
structural
feature
of
a
le.
Comparing
the
three
structural
feature
representations
based
on
the
F1
score,
our
study
suggests
that
Gini
coef
cient
with
XGBoost
can
be
an
ef
fecti
v
e
m
etric
to
quantify
binary
le
for
detection
on
a
v
erage.
This
may
be
due
to
the
f
act
that
this
metric
measures
the
probability
for
a
v
alue
within
a
chunk
bytes
and
the
type
of
mal
w
are
f
amily
can
also
af
fect
the
performance.
It
is
unclear
the
reason
wh
y
Shannon-W
iener
inde
x
produces
lo
w
performance
in
the
e
xperiment
here.
It
is
suspected
that
the
computation
of
l
n
causes
the
di
v
ersity
v
alue
Int
J
Inf
&
Commun
T
echnol,
V
ol.
14,
No.
2,
August
2025:
737–750
Evaluation Warning : The document was created with Spire.PDF for Python.