Internati
o
nal
Journal of Ele
c
trical
and Computer
Engineering
(IJE
CE)
V
o
l.
6, N
o
. 2
,
A
p
r
il
201
6, p
p
.
77
8
~
78
4
I
S
SN
: 208
8-8
7
0
8
,
D
O
I
:
10.115
91
/ij
ece.v6
i
2.8
909
7
78
Jo
urn
a
l
h
o
me
pa
ge
: h
ttp
://iaesjo
u
r
na
l.com/
o
n
lin
e/ind
e
x.ph
p
/
IJECE
A P
reli
m
inary Perform
a
nce E
v
aluati
on of K-m
e
ans, KNN and
EM Unsupervised Machine Learning Methods for Network
Flow Cl
assifi
cati
on
Alhamz
a Mun
t
her
1
,
Ro
zm
ie
R
a
zif
1
, M
o
sle
h
A
b
u
A
lhaj
2
, Mohammed Anbar
3
, S
h
a
h
rul N
i
za
m
1
1
School of Com
puter
and Communication
Engineering,
Universiti
Malay
s
ia Pe
rlis
, Perlis,
Malay
s
ia
2
Dept. of
Netwo
r
k and
Information Security
, Faculty
of
Information Technolog
y
,
Al-A
hliy
y
a
Amman University
,
Amman, Jordan
3
Nation
a
l
Adva
nced IPv6 C
e
ntr
e
of
Exc
e
ll
enc
e
,
UniversitiSains
Mala
y
s
ia
,
Pen
a
n
g
,
Mal
a
y
s
ia
Article Info
A
B
STRAC
T
Article histo
r
y:
Received Aug 26, 2015
Rev
i
sed
No
v
19
, 20
15
Accepte
d Dec 7, 2015
Unsupervised leaning is
a popu
lar method
for
classif
y
un
lab
e
led
dataset
i.e.
without prior kn
owledge about d
a
ta cla
ss. Man
y
of unsupervised learn
i
ng are
used to insp
ect and classif
y
netw
ork fl
ow.
This
p
a
per pr
es
ents
in
-
d
eep s
t
u
d
y
for three unsupervised classifiers
,
na
mely
: K-means, K-near
est neighbor an
d
Expectation maximization
.
Th
e met
hodologies
and how it’s emplo
y
ed to
clas
s
i
f
y
ne
twork flow are
elab
orated
in det
a
i
l
s
. The thr
ee
cl
as
s
i
fiers
ar
e
evalu
a
ted us
ing
three s
i
gni
fic
a
nt
m
e
trics
,
which
are c
l
as
s
i
fic
a
t
i
o
n
accur
a
c
y
,
classification speed and
memor
y
c
onsuming. The K-neares
t neighbo
r
introduc
es better
results for accur
a
c
y
and m
e
m
o
r
y
; while K-m
eans announce
lowest processing time.
Keyword:
Machine lea
r
ni
ng
Network tra
ffi
c classification
Net
w
or
k t
r
a
ffi
c en
gi
nee
r
i
n
g
Uns
u
per
v
i
s
e
d
l
earni
ng
Copyright ©
201
6 Institut
e
o
f
Ad
vanced
Engin
eer
ing and S
c
i
e
nce.
All rights re
se
rve
d
.
Co
rresp
ond
i
ng
Autho
r
:
Al
ham
za M
unt
her
,
Sch
ool
o
f
C
o
m
put
e
r
a
n
d
C
o
m
m
uni
cat
i
on En
gi
nee
r
i
n
g,
Un
i
v
ersiti Malaysia Perlis, Perlis, Malaysia.
Em
a
il: alh
a
m
z
a.m
u
n
t
h
e
r@gmail.co
m
1.
INTRODUCTION
Network traffic classificati
on occupied a significant role in seve
ral field
s
, su
ch
as n
e
twork
secu
rity
,
n
e
two
r
k
m
a
n
a
g
e
m
e
n
t
an
d
surv
eillan
ce… et
c. It is th
e process o
f
classi
fyin
g
n
e
t
w
ork
traffic in
to
th
e
o
r
i
g
in
al
application tha
t
gene
rated thi
s
traffic. T
h
e
c
h
allenge
s th
at
face this
proce
ss is inc
r
ease
d
because
of emergi
ng
new a
p
pl
i
cat
i
ons t
h
at
cause
d
red
o
ubl
e
d
si
ze of dat
a
[
1
]
.
P
o
rt
-
b
ase
d
i
s
o
n
e
of t
h
e fi
rst
t
echni
que
s t
h
at
use
d
i
n
dat
a
cl
assi
fi
cat
i
on.
H
o
weve
r,
t
h
i
s
t
ech
ni
q
u
e
i
s
n
o
l
o
n
g
e
r
use
d
si
nce
i
t
'
s easy
t
o
m
a
sq
uera
de,
by
usi
n
g
t
h
e
wel
l
-
k
n
o
w
n
p
o
r
t
s
of s
o
m
e
ap
pl
i
cat
i
ons by
o
t
her ap
pl
i
cat
i
ons. F
o
r e
x
am
ple, som
e
VoIP
appl
i
cat
i
o
ns us
e po
r
t
23 t
h
at
al
l
o
cat
ed by
I
A
N
A
t
o
t
h
e Tel
n
et
p
r
ot
ocol
[
2
]
,
[3
]
.
Pay
l
oad-
bas
e
d an
d si
g
n
at
u
r
e-
base
d [4]
ar
e t
w
o
altern
ativ
e m
e
t
h
od
s th
at
u
s
ed
in
d
a
ta classificatio
n
.
Un
fo
rtu
n
ately
,
the two
app
r
oac
h
es s
u
ffe
r fr
om
cons
um
ing
space
of m
e
mory a
nd l
o
ng
processi
ng tim
e. In addition, they fail to cla
ssify enc
r
ypte
d pac
k
ets acc
urately.
B
e
havi
or
-base
d
[5]
i
s
a
n
ot
he
r
m
e
t
hod t
h
at
u
s
ed
fo
r
dat
a
cl
a
ssi
fi
cat
i
on.
H
o
weve
r,
i
t
fai
l
i
n
real
t
i
m
e and
onl
i
n
e
classification e
v
aluation.
As y
o
u
can
se
e, t
h
e af
orem
ent
i
one
d m
e
t
hods s
u
f
f
er
fr
om
m
a
ny
pro
b
l
e
m
s
;
whi
c
h c
o
m
p
ell
e
d t
h
e
researc
h
es t
o
s
u
ggest
ne
w a
p
proac
h
for
data classifica
tio
n; th
at is, m
ach
in
e lear
ni
n
g
.
M
achi
n
e l
e
a
r
ni
ng
[
6
]
p
opu
lates to
b
e
a su
itab
l
e so
l
u
tio
n
sin
c
e it's p
o
w
erfu
l
of
au
t
o
m
a
t
i
o
n
,
id
en
tificatio
n
,
an
d
p
r
ed
icatio
n. Basi
cally
,
m
achi
n
e l
e
a
r
ni
ng
ca
n
be
cl
a
ssi
fi
ed
i
n
t
o
s
u
per
v
i
s
ed
a
n
d
uns
u
p
er
vi
se
d l
earni
ng
[7]
,
[
8
]
,
[
9
]
.
Su
pe
rvi
s
ed
i
s
classify d
a
taset with
a
p
r
i
o
r kno
wled
g
e
ab
ou
t class re
su
lt. In con
t
rast un
sup
e
rv
ised h
a
d
t
h
e
po
ten
tial to
classify d
a
taset w
itho
u
t
know
ledg
e abo
u
t
th
e
resu
lting
class. In th
is pap
e
r
,
w
e
w
ill
ev
alu
a
te and co
m
p
are
t
h
ree p
o
p
u
l
a
r
uns
u
p
er
vi
se
d cl
assi
fi
cat
i
on m
e
t
hods;
nam
e
l
y
,
k-m
eans, k-
neare
s
t
nei
g
hb
or
,
a
n
d
ex
p
ect
at
i
on
Evaluation Warning : The document was created with Spire.PDF for Python.
I
S
SN
:
2
088
-87
08
IJEC
E V
o
l
.
6, No
. 2, A
p
ri
l
20
16
:
77
8 – 7
8
4
77
9
maxim
i
zation.
The three m
e
thods a
r
e
eval
uated in term
of classificatio
n accuracy
, clas
sification s
p
ee
d and
m
e
m
o
ry
cons
u
m
i
ng.
Thi
s
pape
r i
s
o
r
ga
ni
zed
as
fol
l
ows.
Sect
i
o
n
2 i
n
t
r
o
d
u
ces t
h
e t
h
ree
cl
assi
fi
cat
i
on m
e
t
hods
an
d s
h
o
w
s
how to em
ploy each m
e
thod in data
classification. Section 3
eval
uates an
d
c
o
m
p
ares the result of
t
h
e
three
m
e
t
hods
an
d
di
scusses
t
h
e
res
u
l
t
s
. Fi
nal
l
y
, Sect
i
on
4
prese
n
t
s
t
h
e c
oncl
u
si
on
.
2.
NETWO
R
K TRAF
FIC
CL
A
SSIFICATION METHODS
Thi
s
sect
i
o
n
r
e
veal
s t
h
e
ap
p
r
oac
h
es
o
f
t
h
r
ee u
n
s
upe
rvi
s
e
d
cl
assi
fi
cat
i
o
n m
e
t
hods
an
d
ho
w t
h
ey
em
pl
oy
ed t
o
c
l
assi
fy
net
w
or
k fl
ow.
The
pr
oced
u
r
e o
f
eac
h o
n
e i
s
ex
pl
a
i
ned i
n
det
a
i
l
s
t
o
ful
l
y
u
n
d
er
st
an
d
t
h
ese m
e
t
hods.
2.
1. K-me
an
s Cl
usteri
n
g
Bern
aille et al. [1
0
]
p
r
op
o
s
ed
u
s
ing
K-m
e
an
s cluster
u
n
su
perv
ised
learn
i
n
g
m
e
th
o
d
th
at classify
net
w
or
k
fl
o
w
by
cat
eg
ori
z
i
n
g a
dat
a
set
i
n
t
o
a
de
fi
ni
t
e
n
u
m
ber of
cl
ust
e
rs
(ass
um
e
cl
ust
e
rs
)
fi
xe
d a
pri
o
ri
.
The
key idea is
to select
cent
r
oi
ds
ra
nd
om
l
y
, o
n
e
fo
r eac
h
c
l
ust
e
r.
Each
i
n
put
re
prese
n
t
e
d as
co
o
r
di
nat
o
r
by
consideri
n
g the features
values which is c
o
nsisted a
gr
oup
of
p
o
i
n
t
s, each
po
i
n
t is
allocated t
o
the
closest
cent
r
oi
d, a
n
d e
ach g
r
ou
p
of
p
o
i
n
t
s
a
llocate
d
to a cent
r
oi
d is a cluster the
distance is m
eas
ure
.
The cent
r
oid of
each cluster is
updated later
base
d on the
points allocat
ed to the cluster. Networ
k fl
ows are re
pre
s
ent
e
d by
poi
nts in a P-dim
e
nsional spa
ce (dim
ension
refe
r to the
fe
ature suc
h
as
packet size), whe
r
e each
pa
cket is
l
i
nked
wi
t
h
a
d
i
m
e
nsi
on;
t
h
e coo
r
di
nat
e
o
n
di
m
e
nsi
on
p
is t
h
e size
of pac
k
et
p
in th
e
fl
ow
.
The
pr
oce
d
u
r
e
i
s
rep
eated
wit
h
u
p
d
a
ting
th
e step
s
u
n
til no
ch
ang
e
s cl
u
s
ters, or eq
u
a
lly,
u
n
til th
e cen
t
ro
id
s
rem
a
in
th
e sam
e
.
Fi
gu
re
1 s
h
ows
si
m
p
l
y
t
h
e
steps
of K-m
eans idea [11].
Fi
gu
re
1.
Key
st
eps
of
K
-
m
e
ans m
e
t
hod
[
11]
Th
e sim
ilarit
y
b
e
tween
fl
o
w
s
i
s
rep
r
ese
n
t
e
d
by
m
easuri
ng
di
st
ance
bet
w
e
e
n eac
h p
o
i
n
t
i
n
cl
ust
e
r a
n
d
cen
tro
i
d
wh
ich
is calcu
late
u
s
ing
Eu
clid
ean
d
i
stan
ce
as form
ulated
in equatio
n 1. T
h
e K-m
eans
m
e
t
hod
atte
m
p
ts to
fin
d
an
op
tim
a
l
s
o
lu
tion
b
y
redu
cing
th
e squ
a
re erro
r, wh
ich
is d
e
fin
e
d
as in
eq
u
a
tion
2
.
Th
e
squ
a
re e
r
r
o
r i
s
cal
cul
a
t
e
d
wi
t
h
t
h
e
di
st
an
c
e
sq
uare
d
bet
w
een
eac
h
poi
nt
(
o
bject
)
and the
ce
nter
of its
cluster
.
1/
2
2
1
,(
)
n
ii
i
dist
x
y
x
y
(
1
)
2
)
11
(,
kn
ji
ij
Ed
i
s
t
x
y
(
2
)
Evaluation Warning : The document was created with Spire.PDF for Python.
I
J
ECE
I
S
SN
:
208
8-8
7
0
8
A Prelimina
r
y
Perfo
r
man
ce Eva
l
ua
tio
n o
f
K-mean
s, KNN an
d EM
Un
su
p
e
rvised
…
(
A
l
h
a
m
za
M
unt
her)
78
0
Whe
r
e,
= ob
j
ecti
v
e functio
n
= nu
m
b
er
of
cl
u
s
ter
= num
b
er of ca
ses (poi
nts)
= case i
= cen
tro
i
d fo
r
clu
s
ter
j
Th
e resu
lts ill
u
s
tration
th
at m
o
re th
an
8
0
% o
f
en
tire
fl
ows are correctl
y
classified for a num
ber of
applications.
One e
x
ce
ptional case is
the
POP3 application. T
h
e class
i
fi
er lab
e
ls 86% o
f
POP3
fl
ows as
NNT
P and
12.6% as SMT
P
,
because POP3
fl
o
w
s al
way
s
bel
o
ng t
o
cl
ust
e
rs [
12]
.
H
o
w
e
ver
,
t
h
i
s
m
e
t
h
od i
s
failed
to classify so
m
e
o
f
app
licatio
n
with
lo
w accur
acy; furth
e
rm
o
r
e, th
e m
a
in
weakn
e
ss is t
h
at the in
itial
p
a
rtitio
n
s
(clu
sters)
are v
e
ry i
m
p
o
r
tan
t
. If
t
h
e
in
itial
clu
s
ters are
n
o
t
well selected
th
en
th
e
K-Means can
co
nv
erg
e
to
a l
o
cal min
i
m
u
m in
stead
of th
e
g
l
ob
al min
i
m
u
m so
lu
tio
n
.
To av
o
i
d
th
at, a so
lu
tion
is to
run
th
e
alg
o
rith
m
sev
e
ral ti
m
e
s an
d
preserv
e
th
e
b
e
st so
lu
tion
.
Th
is was led
for emerg
e
two
issu
es co
m
p
u
t
atio
n
a
lly
expe
nsi
v
e a
n
d
extra tim
e of
proces
sing [13].
2.
2. K-
Ne
arest
Nei
g
hb
or
Roughan et al
[14] suggested k-n
earest n
e
igh
bor to
classify n
e
twor
k
traffic. K-n
e
arest
n
e
igh
bor is
t
y
pe of com
m
on m
e
t
hod cal
l
e
d i
n
st
ance
-ba
s
ed l
earni
ng (
I
B
L
), w
h
i
c
h
uses
speci
fi
c t
r
ai
ni
ng i
n
st
a
n
ces t
o
m
a
ke
cl
assi
fi
cat
i
ons
wi
t
h
o
u
t
havi
ng
t
o
bui
l
d
m
odel
fr
om
t
r
ai
ni
ng
dat
a
.
IB
L al
g
o
r
i
t
h
m
s
req
u
i
r
e a
pr
o
x
i
m
it
y
m
e
asur
e
to
d
e
term
in
e t
h
e similarity
o
r
d
i
stan
ce b
e
tween
d
a
ta
i
n
put
s
(i
nst
a
nce
s
) an
d a cl
assi
f
i
cat
i
on fu
nct
i
o
n t
h
at
retu
rn
s th
e
resu
lted
class o
f
a test in
stan
ce b
a
sed
o
n
its p
r
ox
imity
to
o
t
h
e
r in
stan
ces. A n
earest n
e
ig
hbo
r’s
classifier re
pre
s
ents eac
h ins
t
ance as a
dat
a
point
in a
d-dim
e
nsional s
p
ace,
wh
ere
d is the num
b
er
of
attrib
u
t
es. For a g
i
v
e
n
test in
stan
ce, we com
p
u
t
e it p
r
o
x
i
mit
y
to
th
e rest o
f
th
e
d
a
ta po
in
ts in
th
e t
r
ain
i
n
g
set
by m
easuring
distance
betwe
e
n the i
n
stance
an
d class. T
h
e
k-nea
r
est nei
g
hbors for insta
n
ce
d
e
no
te to
th
e
poi
nts that are
closest to
. Fo
r an e
x
am
pl
e f
i
gu
re 2
dem
o
n
s
t
r
at
es t
h
e
1-,
2-
, 3
-
nearest
nei
g
hb
o
r
s
of a
dat
a
poi
nt located
at the center of eac
h circle. The data po
in
t is p
r
ed
icated
b
a
sed
on
the class lab
e
ls
o
f
its
nei
g
hb
o
r
s. I
n
t
h
e case w
h
ere
t
h
e nei
g
hb
o
r
s have m
o
re t
h
a
n
o
n
e cl
ass, t
h
e dat
a
poi
nt
i
s
assi
gne
d ba
sed
on t
h
e
maj
o
rity
class o
f
its n
earest n
e
igh
bors. In
fig
u
re
2
a
,
th
e 1-
n
e
ar
est n
e
i
ghb
or
o
f
t
h
e
d
a
ta po
in
t is a
n
e
g
a
tiv
e
in
stan
ce.
Th
eref
or
e th
e
d
a
ta
p
o
i
n
t
is assigned
to
t
h
e n
e
g
a
tiv
e class. If
th
e nu
m
b
er
of
n
ear
est
n
e
ighbo
r
s
is
th
ree, as sh
own
in
Figu
re 2
c
, th
en
t
h
e
n
e
ighb
orh
ood
con
t
ai
n
s
two
po
sitiv
e sam
p
les an
d
on
e
n
e
g
a
tiv
e sam
p
l
e
.
Based
on
th
e
maj
o
rity vo
ting
sch
e
m
e
, th
e d
a
ta po
in
t is allo
cated
to
th
e p
o
s
itiv
e class. K
-
n
earest
n
e
ig
hbor
com
put
es t
h
e
sim
i
l
a
ri
ty
by
m
easuri
n
g t
h
e
dista
n
ce
between eac
h test i
n
stance
poi
nt
,
and all th
e
training instances
,
∈
whe
r
e (
represen
t
who
l
e d
a
taset) t
o
co
m
p
u
t
e its n
earest
n
e
igh
bor list.
C
o
m
m
onl
y
,
t
h
ere are
di
f
f
ere
n
t
way
s
t
o
com
put
e t
h
e di
st
a
n
c
e
bet
w
ee
n p
o
i
n
t
and
nei
g
hb
o
r
cl
ass fo
r co
nt
i
n
uo
us
feat
ure
s
suc
h
as Eucl
i
d
ean
, M
a
nhat
t
a
n
,
M
i
nk
o
w
ski
an
d f
o
rm
ul
at
ed i
n
equat
i
o
ns
3, 4 a
nd
5 res
p
ect
i
v
el
y
,
for
di
scret
e
feat
u
r
e
s
usi
n
g
ham
m
ing
di
st
a
n
ce
wh
i
c
h i
s
i
m
pl
em
ent
bet
w
ee
n
poi
nt
s.
1
k
ii
i
M
anhattan
D
i
s
t
x
y
(
3
)
1/
|
1
(|
)
q
k
q
ii
i
Minkowski
D
ist
x
y
(
4
)
(a)
(b
)
(c)
Fig
u
r
e
2
.
1-
2-3-
N
e
ar
est N
e
ighb
or
[
1
5
]
Gene
rally KNN possess s
o
me limitati
ons.
At first, it nee
d
s to dete
rm
in
e the neighbors list for each
in
stan
ces su
ch
co
m
p
u
t
atio
n can
b
e
co
stly if th
e train
i
n
g
dataset is larg
e.
In ad
d
ition
,
valu
e is sen
s
itiv
e fo
r
Evaluation Warning : The document was created with Spire.PDF for Python.
I
S
SN
:
2
088
-87
08
IJEC
E V
o
l
.
6, No
. 2, A
p
ri
l
20
16
:
77
8 – 7
8
4
78
1
cho
o
si
ng
. In
o
t
her w
o
r
d
,
i
f
dat
a
set
is too sm
al
l th
e n
earest
n
e
igh
b
o
r
classifier m
a
y
b
e
suscep
tib
l
e
to
o
v
e
rfittin
g b
e
cau
s
e of n
o
i
se
in
t
h
e
trai
n
i
ng
d
a
ta. On
th
e oth
e
r h
a
n
d
,
if
is too
larg
e, the n
e
arest
n
e
igh
bor
classifier m
a
y
misclassify the test instance
because lis
ting of
nea
r
est nei
g
hbors m
a
y include
points that are
lo
cated
f
a
r aw
ay f
r
o
m
its n
e
ig
h
bor
hoo
d as sho
w
n
i
n
f
i
gu
r
e
3.
2.
3. E
x
pec
t
a
t
i
o
n
M
axi
mi
z
a
t
i
on
Jeff
rey
Erm
a
n
et al.
[
1
6]
i
s
em
pl
oy
ed ex
pect
at
i
on m
a
xim
i
zati
on (E
M
)
u
n
s
upe
r
v
i
s
ed m
achi
n
e
learning m
e
thod t
o
classify
network traffic
according to
t
h
e application.
EM is an iterative proce
d
ure that
co
nv
erg
e
s t
o
a
m
a
x
i
m
u
m
l
i
k
e
lih
o
o
d
u
s
ing
p
o
s
teriori p
r
obab
ility fu
n
c
tion
.
EM work
s
b
a
sed
on
two
step
s. In
first step
,
EM ex
p
ects t
h
e calcu
latio
n
o
f
t
h
e clu
s
ter pr
o
b
a
b
ilities (i.e. exp
ected
class
valu
es) th
erefore, th
is
step de
scribe
d as “expectation”. In sec
o
nd
step,
EM
calcu
lates of th
e
d
i
stributio
n
p
a
ram
e
te
rs, is
“
m
ax
i
m
izat
io
n” of t
h
e lik
eliho
o
d
of t
h
e
d
i
stribu
tio
n g
i
v
e
n th
e d
a
ta.
Fi
g
u
re
4
shows EM iteratio
n
alternativ
es
bet
w
ee
n
per
f
o
r
m
i
ng an
ex
pec
t
at
i
on (E
) st
e
p
,
w
h
i
c
h
pr
o
duc
es a f
unct
i
on
f
o
r t
h
e e
xpect
at
i
on
of
t
h
e l
i
k
el
i
h
o
o
d
calcu
lated
u
s
ing
th
e two
esti
mate p
a
ram
e
te
rs m
ean
s µ and
v
a
rian
ce
2
o
f
poi
nt
s, a
n
d a
m
a
xim
i
zat
i
on (M
)
step
,
wh
ich
com
p
u
t
es th
e m
a
x
i
m
u
m
p
a
ram
e
ters fo
r exp
ect
ed
lik
elihoo
d th
at fou
n
d
it in
th
e step (E).
Fi
gu
re
4.
Li
fe
cy
cl
e of e
x
pect
at
i
on-m
a
xim
i
zat
i
o
n
To estim
ate the proba
b
ility for
each class
(application type)
for a
give
n certain feature
s
-vect
or
u
s
ing
po
sterior prob
ab
ility fun
c
tio
n as
u
s
ed
in
equ
a
tion
6
fo
r Naï
v
e Bayse m
e
th
o
d
. Th
e
max
i
m
u
m
lik
el
ih
oo
d
is calculated by re-estim
ate the
value
of mean µ
a
n
d va
riance
2
con
tinu
ity th
en su
bstitu
ted
ag
ain in the
co
nd
itio
n
a
l p
r
ob
ab
ility
fun
c
tio
n
|
is calcu
late u
s
ing
th
e
b
e
low fo
rm
u
l
a, where
num
ber o
f
i
n
st
ance i
n
each feature
.
Th
e au
tho
r
s
u
s
ed
2
0
0
iteration
as cond
itio
n
a
l to
stop
EM
loo
p
.
2
(µ
)
2
1
|
2
x
PX
C
e
(5
)
1
1
N
i
i
µ
x
N
(
6
)
1
1
2(
µ
)
N
i
i
x
N
(
7
)
Aut
h
o
r
s
used
ni
ne
di
ffe
re
nt
cl
asses i
n
t
h
e ex
peri
m
e
nt
nam
e
l
y
(
S
M
T
P,
HTTP
, D
N
S,
FTP
-
CO
N
T
RO
L
,
So
cks
,
I
RC,
P
O
P
3
,
F
T
P-
DA
TA
and
P2
P
LimeW
i
re
). T
h
e
ove
rall classification acc
urac
y wa
s
91
% fo
r t
h
e co
l
l
ect
ed dat
a
set
.
Ho
weve
r, t
h
e
i
t
e
rat
i
on pr
oce
ss cons
um
e resou
r
ces (M
em
ory
,
C
P
U) a
nd a
ddi
ng
ex
tr
a
p
r
o
cessi
n
g
tim
e w
h
er
e r
e
p
eated
th
e
p
a
r
a
m
e
ter
s
(
m
ean
s an
d
cov
a
r
i
an
ce)
calcu
latio
n
up
t
o
200 ti
m
e
s
[1
6]
.
3.
PERFO
R
MA
NCE E
V
ALU
A
TIO
N
In
th
is section, th
e p
e
rfo
rm
a
n
ce of the K-means clustering, K-
N
e
ar
est
n
e
ig
hbo
r
an
d ex
p
ectation
m
a
xim
i
zati
on m
e
t
hods i
s
eva
l
uat
e
d an
d com
p
are
d
.
We hav
e
eval
uat
e
d an
d com
p
ared t
h
e t
h
ree
m
e
t
hod
s usi
n
g
three fact
ors; na
m
e
ly, classification
accurac
y
, classification spee
d, a
nd m
e
m
o
ry
consum
ption. W
e
use
d
these
Exp
e
ctation
‐
Step
Maximizati
on
‐
Step
Evaluation Warning : The document was created with Spire.PDF for Python.
IJECE
ISS
N
:
2088-8708
A Prelimina
r
y
Perfo
r
man
ce Eva
l
ua
tio
n o
f
K-mean
s, KNN an
d EM
Un
su
p
e
rvised
…
(
A
l
h
a
m
za
M
unt
her)
78
2
three fact
ors in the com
p
aris
on beca
use they
are play a significant role
in real time and online classific
a
tion
envi
ro
nm
ent
.
3.
1. T
e
s
t
i
n
g E
n
vi
ro
nmen
t
The
W
e
ka so
ft
ware ve
rsi
o
n 3
.
7.
1
0
[1
7]
and
t
h
e M
o
o
r
e dat
a
set
[18]
are use
d
t
o
eval
uat
e
and c
o
m
p
are
the three cla
ssification m
e
thods; nam
e
ly, K-m
ean
s cl
ust
e
ri
n
g
,
K
-
N
earest
nei
g
h
b
o
r a
n
d e
xpec
t
at
i
o
n
max
i
m
i
zat
io
n
.
Th
e Mo
ore dataset co
n
s
ists o
f
24
863
in
stances, 248 attributes, and 11 classes, which are
WWW,
FT
P-CONTR
O
L,
MAIL,
FT
P-P
A
S
V
,
P2P,
ATTAC
K
,
F
T
P
-
DAT
A
, DAT
A
BAS
E
, SER
V
ICES
,
M
U
LTIM
E
D
I
A, a
nd
INT
E
R
A
C
T
I
V
E.
1
4
9
18
o
u
t
of
2
4
8
63
reco
r
d
s are
used as a t
r
ai
ni
n
g
dat
a
set
w
h
i
l
e
t
h
e
rem
a
in
in
g
d
a
taset, 994
5 reco
rd
s,
are ado
p
t
ed as testin
g d
a
ta.
3.2. Resul
t
s and Discussion
The classificati
o
n accuracy is
evaluate
d by t
e
sti
ng t
h
e
ove
rall accuracy through
determ
ine c
o
rrectly
and inc
o
rrectly classified i
n
stances.
Figure
5 s
h
ows
the
overall acc
uracy
of
the t
h
ree
classification m
e
thods.
The res
u
lt showed that the
K
-
Nea
r
est
nei
g
h
b
o
r
(
K
N
N
) (
u
s
i
ng t
h
r
ee nei
ghbors) ac
hieve
d
the highest ac
curacy
by up to 98%
,
expectation maximization (EM) achie
ve
d the second
highest accuracy by up to 91%, and
K-
means achieve
d the lowest accuracy by up to 80 %.
Fi
gure 6 s
h
ows the
total pro
cessi
ng tim
e of the three
cl
assi
fi
cat
i
on
m
e
t
hods i
n
cl
u
d
i
n
g t
h
e
bui
l
d
up t
i
m
e. The r
e
sul
t
sh
owe
d
t
h
at
t
h
e t
o
t
a
l
p
r
ocessi
n
g
t
i
m
e
of EM
,
KN
N, a
nd
K-
m
eans i
s
90
0,
35
0, a
nd
6
0
secon
d
s
,
res
p
ectively. As you can see, K-m
e
a
n
s achie
ved the bes
t
pr
ocessi
ng t
i
m
e bet
w
ee
n t
h
e t
h
ree m
e
t
hod
s,
whi
c
h m
a
ke i
t
a sui
t
a
bl
e s
o
l
u
t
i
on
fo
r
onl
i
n
e
cl
assi
fi
cat
i
on.
Fi
gu
r
e
7 s
h
ows
t
h
e m
e
m
o
ry
co
ns
um
pt
i
o
n
o
f
t
h
e t
h
ree cl
assi
fi
cat
i
o
n
m
e
t
hods.
T
h
e
resul
t
s
h
ow
ed t
h
at
t
h
e m
e
m
o
ry
con
s
um
pt
i
on
o
f
EM
,
K
N
N
,
a
n
d
K
-
m
eans i
s
22
3M
B
,
6
0
M
B
, an
d
1
30M
B
,
r
e
spect
i
v
el
y
.
The res
u
lts showe
d
KNN with 3-N
earst neighbors is the best in
ter
m
of accuracy and m
e
m
o
ry
con
s
um
pt
i
on d
u
e t
o
t
h
e
po
w
e
rf
ul
an
d l
o
w
com
p
l
e
xi
t
y
of m
e
t
hod
but
t
h
e
m
e
m
o
ry
and
pr
ocessi
n
g
t
i
m
e i
s
threaten to inc
r
ease in case
nu
m
b
er of nei
g
hb
ors i
n
c
r
ease.
K
-
m
eans was t
h
e best
i
n
t
e
r
m
of t
i
m
e consu
m
pti
o
n
th
is asp
ect
m
a
k
e
it an
ap
pro
p
riate so
lu
tio
n
for real ti
m
e
clas
sificatio
n
bu
t still
n
eed
s enh
a
n
cem
en
t with
reg
a
rd
to accuracy and m
e
m
o
ry consum
ption whe
r
e data traffi
c
is pum
ped in high rate in real time and online
envi
ro
nm
ent
[19]
. E
x
p
ect
at
i
o
n
-
M
a
xi
m
i
zat
ion
ran
k
e
d
i
n
t
h
e end due to t
h
e cost com
p
utation was led
to high
m
e
m
o
ry
and
t
i
m
e cons
um
i
ng.
Figure
5. Overall Accuracy
o
f
k-m
eans, KN
N,
a
n
d
EM
Fi
gu
re
6.
Tot
a
l
p
r
oces
si
n
g
time of
k-m
eans,
KNN,
and EM
0
20
40
60
80
100
K
‐
means
K
NN
EM
Accu
racy
%
Overall
Accuracy
0
200
400
600
800
1000
K
‐
means
K
NN
EM
Time
Sec
Total
processing
Time
Evaluation Warning : The document was created with Spire.PDF for Python.
I
S
SN
:
2
088
-87
08
IJEC
E V
o
l
.
6, No
. 2, A
p
ri
l
20
16
:
77
8 – 7
8
4
78
3
Fi
gu
re
7.
M
e
m
o
ry
c
o
nsum
pt
i
o
n
o
f
k-m
eans,
K
N
N
,
a
n
d
EM
4.
CO
NCL
USI
O
N
Uns
u
per
v
i
s
e
d
l
earni
ng i
s
k
n
o
w
n m
e
t
hod use
d
wi
del
y
fo
r i
d
ent
i
f
y
and cl
as
si
fy
net
w
o
r
k t
r
affi
c. T
h
ere
are seve
ral uns
upe
rvised class
i
fiers we
re propos
ed
by resea
r
che
r
s to cla
ssi
fy
net
w
or
k
fl
o
w
. T
h
ese
resea
r
che
r
s
are c
o
m
p
eted in term
of QoS
to test
whic
h cl
assifiers a
r
e m
o
re
suita
ble fo
r real tim
e and
online
classific
a
tion.
This pape
r presents
a
c
o
m
p
arative
stud
y fo
r
t
h
r
ee
po
pu
l
a
r
un
sup
e
rv
ised
classif
i
er
s na
m
e
ly K
-
m
ean
s, K-
nearest
nei
g
h
b
o
r (
K
NN
) an
d Ex
pect
at
i
on M
a
xi
m
i
zat
i
on (E
M). These clas
sifiers we
re
stud
ied
d
e
ep
ly th
ro
ugh
expl
ai
n t
h
e m
e
t
h
o
dol
ogi
es
f
o
r eac
h an
d h
o
w
were em
pl
oy
ed to classify network
flow. T
h
e classifiers are
ev
alu
a
ted
with reg
a
rd
to
t
h
ree sig
n
i
fican
t
metrics sp
atially for
real time and online
envi
ronm
ent. These
metrics are cla
ssification acc
uracy, cl
assi
fi
c
a
t
i
on spee
d an
d m
e
m
o
ry
consum
pt
i
on. As a
resul
t
KN
N w
a
s t
h
e
best in term
of accuracy and
me
m
o
ry cons
uming but k-m
eans introduce
d
better performance with
rega
rd to
to
tal ti
me o
f
p
r
o
cessing
wh
il
e ex
p
ectation
max
i
m
i
zat
io
n
was th
e wo
rst
for th
e th
ree
metrics. Based
o
n
th
e
g
e
n
e
rated
results we reco
mmen
d
t
o
stud
y th
e av
enu
e
s to
o
p
tim
ize KNN to
redu
ce time p
r
o
cessing
t
o
b
e
fit
with real tim
e
and
onli
n
e environm
ent. Furt
herm
or
e, we recommend enhanci
ng
classification accurac
y
and
decreasi
n
g
m
e
m
o
ry
cons
um
pt
i
on fo
r K-m
eans. The
r
eaft
e
r, im
pl
em
ent
bot
h on
h
u
g
e dat
a
set
.
REFERE
NC
ES
[1]
A. Callado
, C. Kamienski,
S.N. Fernandes, an
d D. Sa
dok, "
A
Survey
on I
n
ternet Traffic Identif
ication and
Classification", 2009.
[2]
IANA.
Internet
Assigned Numb
ers Authority
.
A
v
ailable:
http://www.
iana.
o
rg/a
ssignments/port-numbers
.
[3]
L. Be
rnai
lle
, R
.
Teix
eira
, and
K.
Salam
a
ti
an,
"E
arl
y
app
l
i
cat
ion
identif
ic
ation"
, i
n
Proceed
ings o
f
the
2006 ACM
CoNEXT
con
f
er
ence
, 2006
, p
.
6
.
[4]
P.
Ha
ffne
r
,
S.
Se
n,
O.
Spa
t
sc
hec
k
,
and D. Wang, "ACAS: automated constr
uction of applicatio
n signatures", in
Proceed
ings of
t
h
e 2005
ACM SI
GCOMM workshop on Min
i
ng n
e
twork da
ta
, 200
5, pp
. 197-202
.
[5]
H.
T.
Marques
Nt,
L.
C.
Rocha, P.
H.
Guerra,
J.
M.
Almeida,
W.
Meira Jr,
and V.
A.
Almeida,
"Char
act
eri
z
ing
broadband user
behavior"
,
in
Pr
oceed
ings of
the
2004 ACM
workshop on Ne
xt-g
eneration
reside
ntial broadband
challenges
, 2004
, pp
. 11-18
.
[6]
Z. S
h
i
,
Pr
incip
l
e
s
of machin
e
lea
r
ning
: Internatio
nal Academ
ic Publishers, 1992
.
[7]
T.M. Mitch
e
ll, "
M
achine learnin
g
. 1997",
Burr Ridge, IL: McGra
w
Hill,
vol. 45
, 1
997.
[8]
A. Munther, R
.
Razif
,
S. Nizam,
N. Sabri,
an
d M. Anbar, "
A
ctiv
e Bu
ild-M
odel Random Forest Method f
o
r
Network Tr
affic
Cla
ssifica
tion",
I
n
ternational Jou
r
nal of
Engi
neering &
Technology
(
0975-4024)
,
vol. 6
,
2014
.
[9]
Y. Wang and Q. Li, "Rev
iew on the Studies an
d
Advances
of M
achine L
earn
i
ng Approaches
",
TE
LKOMNIK
A
Indonesian Jour
nal of Elec
trical Engineering,
vol. 12
, pp
. 1487-1
494, 2014
.
[10]
L. Bern
aille, R. Teix
eira, I. Ak
odke
nou, A. So
ule,
and K.
Salamatian
,
"Traff
ic cl
assification
on the fly
"
,
AC
M
SIGCOMM Computer Communication
Review
,
vo
l. 36
, pp
. 23-26
,
2006.
[11]
T. Pang-Ning
,
M. Steinb
ach
,
an
d V. Kuma
r, "Introduction
to d
a
ta mining",
in
Libr
ary of Congress
, 2006, p. 74.
[12]
T.
T.
Nguy
en and G.
Armitage,
"A surv
ey
of techniques for inter
n
et tr
affic
cl
as
s
i
fica
tion us
ing m
achin
e le
arning
"
,
Communications
Sur
veys
&
T
u
tor
i
als
,
IE
EE
,
vol.
10, pp
. 56-76
, 2
008.
[13]
L. B
e
rnaille, A
.
Soule, M.I
.
Jea
nnin
,
and
K. Salamatian, "Blind app
licatio
n recognition
through behav
i
o
r
al
classification", ed,
2005
.
[14]
M. Roughan, S
.
Sen, O. Spat
scheck, and N. Duffield
,
"Class-o
f-service
mapping for QoS: a st
atistical signatu
r
e
-
bas
e
d approach
to IP
traffic cla
s
s
i
fication"
, in
Pr
oceed
ings
of the 4th ACM SIGCOMM confer
ence on Inter
n
e
t
measurement
, 2
004, pp
. 135-14
8.
[15]
J. Han and M.
Kamber, "Data mining:
concep
ts and techniques (the Morgan
Kaufmann Series in
d
a
ta managemen
t
s
y
stems)", 2000.
0
50
100
150
200
250
K
‐
means
K
NN
EM
Memo
ry
MByte
Memory
consumptio
n
Evaluation Warning : The document was created with Spire.PDF for Python.
I
J
ECE
I
S
SN
:
208
8-8
7
0
8
A Prelimina
r
y
Perfo
r
man
ce Eva
l
ua
tio
n o
f
K-mean
s, KNN an
d EM
Un
su
p
e
rvised
…
(
A
l
h
a
m
za
M
unt
her)
78
4
[16]
J. Erm
a
n, A. Mahant
i, and M. Arlitt
, "Qrp05-4: Intern
et tr
affi
c identifi
c
ation using m
achine learning", in
Glob
al
Telecommunica
tions Conferen
ce, 2006. GLOBECOM'06. IEEE
, 2
006, pp
. 1-6
.
[17]
I.H. Witten
,
E
.
Frank, L
.
E
.
Trigg
,
M.A. Hall, G.
Holm
es, and
S.J. Cunningham
,
"Weka: Pract
ical
m
achine l
earnin
g
tools and
techniques with Ja
v
a
implementations"
,
1999
.
[18]
A.W. Moore and K. Papagiann
a
ki, "Toward
th
e accura
t
e
iden
ti
fica
tion of netw
ork applications", in
Passiv
e
and
Acti
ve
Ne
twor
k
Meas
ur
ement
, ed: Springer
,
200
5, pp
. 41-54
.
[19]
Y. Zhao
, X. X
i
e,
and M
.
J
i
an
g, "Hierar
c
hi
cal
Real
-
tim
e Net
w
ork Traffi
c C
l
a
ssification B
a
sed on ECOC",
TELKOMNIKA Indonesian Journ
a
l of
Electrical
Engineering,
vol. 12
, pp
. 1551-1
560, 2014
.
BIOGRAP
HI
ES OF
AUTH
ORS
Alhamza Munther is a Ph.D. student in School
of Computer and Communicatio
n Engineer
ing at
Universiti Ma
la
ysia Per
lis, Ma
la
y
s
ia
. He re
cei
ved his Master
degree
in adv
a
nced
com
pute
r
networks from
Universiti Sains
Malay
s
ia (US
M
)
in 2012 and
B.Sc of Com
puter and Software
Engineering at University
of Technolog
y
in
2
003. His resear
ch focuses on
overlay
network
s
,
m
u
ltim
edia d
i
str
i
bution,
tr
affi
c
e
ngineer
ing an
d
m
achine
le
arnin
g
.
Rozm
ie R. Othm
an
obtain
e
d hi
s BEng degree
in Elec
tronics
(Com
puter) from
Multim
edia
University
, Malay
s
ia in 2006
, M
E
ng in
Telecom
m
unications Eng
i
neer
ing from Universiti Malay
a
,
Malay
s
ia in 200
9 and PhD in
Software
Engin
eeri
ng from University
Sains M
a
lay
s
ia
in 2012. He
is
current
l
y
a s
e
n
i
or lec
t
urer a
t
t
ached to
th
e
Embedded, Netw
orks and Advanced Computin
g
Research
Group (ENAC), Scho
ol of Computer
and Communication Eng
i
neer
ing, Univ
ersiti
Mala
y
s
ia Pe
rlis
. His m
a
in r
e
search
int
e
rest
includes Software Engin
eer
ing, Combinatorial
Software Testin
g, and
Algorith
m
De
si
gn.
He
is a
me
mbe
r
of
IEEE
, Bri
tish C
o
m
puter S
o
cie
t
y
(BCS), and
Boar
d of Eng
i
neer M
a
lay
s
ia (BEM)
.
Dr. M
o
s
l
eh M
.
Abu-Alhaj is
a s
e
nior le
ctur
er in
Al-Ahli
yya
Am
m
a
n Univers
i
t
y
.
He receiv
e
d his
first degr
ee in
Computer Scien
ce from Philad
e
l
phia Univ
ersity, Jordan, in July
2004
, master
degree
in Computer Informati
on Sy
stem from the ArabAcadem
y
for Bank
ing and Financial
Sciences, Jordan in July
2007
, and doctor
a
te
degree
in Multimedia Networks Protocols from
UniversitiSains
Mala
y
s
ia
in 20
11. His rese
arc
h
are
a
of
int
e
r
e
st in
cludes V
o
IP, Multim
edi
a
Networking,
an
d Congestion C
ontrol. Ap
art fr
om
research
, Dr
. Mosleh M. Abu-Alhaj
also do
es
cons
ultan
c
y s
e
rv
ices
in th
e abov
e
res
ear
ch
are
a
s
a
nd dire
cts
th
e Ci
s
c
o ac
adem
y t
e
a
m
at Al-Ahl
i
y
ya
Amman University
.
Dr. Mohammed Anbar holds
a PhD in the ar
ea of
Advanced
Computer Networks, M.Sc.
in
Information Technolog
y
,
and
B.Sc. in Computer
S
y
stem En
gineer
ing. His research in
ter
e
st
includ
es
M
a
lware Dete
ction
,
W
e
b S
ecurit
y
, Intrus
ion Det
ect
ion S
y
s
t
em
(IDS
)
, Intrus
ion
Prevention
S
y
stem (IPS) and network monitoring
.
Dr. Shahrul Nizam is a mem
b
er of th
e
teac
hing staf
f at
the School of
Computer and
Communication Engineering
,
University
Malay
s
ia
Perlis. He obtain
e
d his
BSc. degr
ee fr
om
University
Tech
nolog
y
Malay
s
ia in 2004
, his
M.
Sc. in Comp
uter Eng
i
neerin
g from University
Malay
s
ia Per
lis
in 2006 and P
h
.D. Degree in
Co
mputer s
y
stem Eng. at
University
of South
Aus
t
ralia in 201
0. His
res
earch
areas
of in
teres
t
include
the
arti
fici
al int
e
l
ligen
c
e
and m
achi
n
e
learn
i
ng s
y
s
t
em
s
.
At the m
o
m
e
nt his
is
focu
si
ng on the agent learning algorithm architecture
design for b
o
th s
i
ngle
and
m
u
lti-
agent
s
y
st
em
s.
Evaluation Warning : The document was created with Spire.PDF for Python.