TELKOM
NIKA Indonesia
n
Journal of
Electrical En
gineering
Vol.12, No.6, Jun
e
201
4, pp. 4306 ~ 4
3
1
3
DOI: 10.115
9
1
/telkomni
ka.
v
12i6.456
2
4306
Re
cei
v
ed Se
ptem
ber 29, 2013; Revi
se
d De
ce
m
ber
23, 2013; Accepted Janu
ary 22, 201
4
Online Imbalanced Support Vector Machine for
Phishing Emails Filtering
XiaoQing Gu
, TongGuan
g
Ni*, Wei Wang
Schoo
l of Information Sci
enc
e and En
gi
neer
ing, Ch
an
gzho
u Univ
ersit
y
,
Cha
ngzh
ou 2
1
316
4, Chi
n
a
T
e
lp 86-51
9-86
330
55
8, F
a
x 8
6
-51
9
-86
3
3
0
2
8
4
Corresp
on
din
g
author, e-mai
l
: tidd
ydd
d
@1
63
.c
om, hbxtntg-
12@
163.com
*
, super_
a
sd@
1
63.com
A
b
st
r
a
ct
Phish
i
ng
e
m
ai
l
s
are a rea
l
threat to inter
n
et
communic
a
tion a
nd w
eb
econ
o
m
y. In real-w
orl
d
emails
datas
ets, data are pr
edo
mi
nate
l
y c
o
mpos
ed of h
a
m s
a
mpl
e
s
w
i
th only a s
m
a
ll p
e
rcent
ag
e of
phis
h
in
g on
es.
Standar
d Su
pport Vector
Machi
ne (SVM
)
could
prod
uc
e sub
opti
m
a
l
results in fi
lteri
n
g
phis
h
in
g
emai
l
s
, and
it ofte
n
requ
ires
much
time to
per
form
the classifica
tion for large
data sets. In t
h
is
pap
er, an
on
li
ne vers
io
n of i
m
b
a
l
ance
d
SV
M (OISVM) is
prop
osed. F
i
rs
t an e
m
ail
is c
onverte
d i
n
to
2
0
features w
h
ich
are w
e
ll se
lect
ed b
a
s
ed
on it
s content a
nd li
nk charact
e
rs
. Second, OISVM is developed to
opti
m
i
z
e t
he cl
assificati
on ac
curacy a
nd re
d
u
ce co
mputati
on ti
me, w
h
ich
is use
d
a n
o
vel
meth
od to
adj
us
t
the se
par
ation
hyper
pla
n
e
of i
m
b
a
l
ance
d
dat
e sets
an
d
a
n
onli
n
e
a
l
gor
ith
m
to
mak
e
th
e
retain
ing
pr
oce
s
s
m
u
ch fast. Com
p
ared to the existi
n
g
meth
ods,
the exp
e
r
i
menta
l
resu
lts show
that OISVM can
achi
e
v
e
signific
antly us
i
ng a pro
pos
ed
express
i
ve eva
l
uati
on
meth
od
.
Ke
y
w
ords
:
ph
i
s
hin
g
e
m
ai
ls, filterin
g, supp
ort ve
ctor mac
h
in
e, imb
a
l
ance
d
date, onl
in
e
Copy
right
©
2014 In
stitu
t
e o
f
Ad
van
ced
En
g
i
n
eerin
g and
Scien
ce. All
rig
h
t
s reser
ve
d
.
1. Introduc
tion
Phishin
g
em
a
il has in
cre
a
sed en
orm
o
u
s
l
y
over
the
last years an
d is a seri
ou
s threat to
global
se
curit
y
and eco
n
o
m
y. Phishing
email is
the
act of attempting to frau
dulently acqu
ire
throug
h de
ce
ption sen
s
itive pe
rsonal in
formation
su
ch as
pa
sswo
rds a
nd
credit
ca
rd d
e
tails
by
assumin
g
an
other’
s
ide
n
tity in an offici
a
l
-loo
king
ema
il. The u
s
e
r
i
s
p
r
ovided
wi
th a convenie
n
t
link in the
sa
me email that
takes th
e em
ail re
cipient t
o
a fake
web
page a
ppe
ari
ng to be that of a
trust
w
orthy company. Wh
en the use
r
e
n
ters hi
s pe
rsonal inform
ation on the fake page, it is then
c
aptur
ed by the fr
auds
ter
.
Ac
c
o
r
d
ing to
a r
e
por
t f
r
om
R
SA [1], the
number
of
phis
hing attacks in
the yea
r
of
2
011 i
n
crea
se
d 37%
comp
ared
to th
at
i
n
the
year of
2010,
and
ap
proximately
o
ne in
every 300 e
m
ails d
e
livered on the Int
e
rnet in the
y
ear of 20
11
wa
s a phi
shi
ng email. Phi
s
he
rs
can
obtai
n $
4500
in
stole
n
fund
s i
n
e
a
ch
phi
sh
i
n
g
attack. Phi
s
hTan
k, an
o
r
gani
zation
tracks
31,850
uni
qu
e phi
shi
ng
attacks du
rin
g
July 201
2. In
addition
the
r
e
are p
h
ishing
attacks a
gai
nst
non-t
r
aditio
n
a
l
sites, su
ch a
s
automotive
asso
ci
ation
s
. Highly targ
eted attacks o
n
the employe
e
s
or
m
e
mb
ers within a ce
rtain com
pany,
gove
r
nm
e
n
t
agen
cy, o
r
o
r
gani
zation
are called
“sp
e
a
r
phishing
”
. He
re the phi
she
r
wants to gai
n acce
ss to a
compa
n
y’s compute
r
syst
em.
Among the
countermea
s
u
r
es u
s
ed
agai
nst phi
shi
ng,
three m
a
in
alternatives ha
ve bee
n
use
d
: Bla
c
k li
st/white li
st,
netwo
rk a
nd
encry
ption
ba
sed
counte
r
measur
es an
d content
ba
sed
filtering [2]. The first alte
rnative
con
s
i
s
ts in
u
s
ing
publi
c
list
s
of
malici
ous p
h
ishi
ng
web
s
ites
(bla
ck li
st) an
d lists of ham
non-m
a
lici
o
u
s
we
bsite
s
(white list), whe
r
e ea
ch lin
k in an email m
u
st
be che
c
ked i
n
both list
s
. The bla
c
klist
-
based
anti
-
p
h
ishi
ng toolb
a
rs
are deve
l
oped by m
a
ny
c
o
mpanies
suc
h
as
Netc
raft. The main problem of
this
c
o
untermeas
ure is
that phis
hing webs
ites
are short-lived, it mak
e
s
diffic
u
lt to k
eep
an up-to
-dat
e list of malici
ous
web
s
ite
s
.
The
se
con
d
alternative i
s
based o
n
em
ail aut
he
ntica
t
ion metho
d
s.
Email auth
e
n
tication
mech
ani
sm
s allow receiving mail agent
s to accept
mail from kn
own go
od se
nders, reje
ct mail
from kno
w
n spam
mers,
o
r
u
s
e rep
u
tation
me
ch
ani
sms su
ch as blackli
s
ts
to deci
de
h
o
w
to
handl
e mail
from other send
ers. Herzbe
rg
et al. [3] have invented an authenti
c
a
t
ion
mech
ani
sm
s based on
DNS-b
a
sed e
m
ail sen
d
e
r
, wh
i
c
h u
s
e th
e DNS
syste
m
to identify th
e
sen
der. Dha
nala
k
shmi et al. [4] identified spo
o
fed
emails u
s
in
g
various te
ch
nique
s such as
Sende
r Policy Frame
w
ork,
Sender ID a
nd Dom
a
in K
e
ys Identified
Mail.
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
2302-4
046
Online Im
balanced Supp
o
r
t Vector Ma
chine for Phi
s
h
i
ng Em
ails Filtering (Xia
oQi
ng Gu)
4307
The third alte
rnative is
ba
sed on
conte
n
t
-based
p
h
ish
i
ng
filtering. Filtering attempts
to
disting
u
ish p
h
ishi
ng
email
s
from l
egitim
a
te em
ails u
s
ing m
a
chine
l
earni
ng
tech
nique
s. F
e
tte et
al. [5] devel
oped
a
sch
e
me i
n
to
filter p
h
ishing
email
s
ba
sed o
n
th
e f
eature
s
colle
cted
informatio
n from inte
rnal
a
nd external i
n
formatio
n
of
email
s
. Ga
re
ra et
al. [6] id
entified a
set
of
fine-g
r
ain
ed
heuri
s
tics fro
m
URLs, a
n
d
applie
d a
logisti
c
re
gre
ssi
on mo
del
to these
URL
sign
ature
s
. Zhuan
g et al. [7] have propo
se
d a
n
anti-phi
shi
ng frame
w
o
r
k usi
ng mul
t
iple
cla
ssifie
r
s
co
mbination.
Chen
et al. [8]
have a
dopte
d
a
hybri
d
te
xt and
data
mining
mod
e
l
that
use
d
key p
h
rase extractio
n
techni
que t
o
discove
r
im
portant
sema
ntic catego
rie
s
from the tex
t
ual
conte
n
t of the phishin
g ale
r
ts. Shih et al. [9]
designed
and implem
e
n
ted an email
virus filter with
an embe
dde
d
system.
As the
phi
shing
email
s
are
often
ne
arly id
entical
to le
gitimate web
s
ites,
curre
n
t
detectio
n
ap
p
r
oa
che
s
h
a
ve
limited su
ccess in d
e
te
cti
ng the
s
e atta
cks. In the ot
her h
and, e
m
ail
data set
s
in real-wo
r
ld u
s
u
a
lly have cla
s
s imbala
n
ce
probl
em
s, du
e to the fact that ham ema
ils
is rep
r
e
s
e
n
te
d by a much
large
r
numb
e
r of in
stan
ces than p
h
ishing email
s
. In this pape
r, we
prop
osed a
new
online
i
m
balan
ce
d S
V
M (OISVM) to provide
phishing filte
r
ing. Ba
sed
on
stand
ard
SVM, an imb
a
la
nce
d
alg
o
rith
m and
an
onl
ine lea
r
nin
g
strategy
are
combi
ned,
which
overcome
s the imbala
n
ce
d probl
em in
SVM and use the incre
m
ental traini
ng sam
p
le
s in re-
training. A
s
a re
sult, the
training tim
e
ca
n
be
re
duced g
r
eatl
y
without m
u
ch l
o
ss of
the
cla
ssifi
cation pre
c
isi
on.
The co
ntribu
tions of our work are: (1)
A numbe
r of new fe
ature
s
of emails are
inco
rpo
r
ate
d
, in particula
r conte
n
t features a
nd link f
eature
s
. (2
) A new online
imbalan
ce
d SVM
to the phi
shin
g filtering
pro
b
lem i
s
devel
oped.
It is
ea
sy mod
e
ling
and fa
st impl
emented
which
gives sta
b
le classificatio
n
result
s wh
en testing differe
nt dataset
s.
2. Content a
nd Link Fea
t
ures of Phis
hing Emails
In this se
ctio
n we di
scuss the content
and
lin
k feature
s
u
s
ed a
c
ross all the p
h
ishi
ng
emails
with the intensi
on of
identifying a set
of gene
ric feature
s
to be use
d
for filtering.
Link
identity:
The obje
c
tive
of this module is to extr
act the link identity of an email, owing
to link ide
n
tity defined
by
analyzi
ng th
e
hyperli
nks
st
ructu
r
e
of a
n
email. T
he
hyperlin
ks of
a
regul
ar em
ail
often link to its own do
m
a
in, while
p
h
i
s
hin
g
email
s
are u
s
ually the opp
osite. A
phishing
ema
il often contai
ns hyp
e
rli
n
ks that point
to
a forei
gn
do
main. Here
a
n
ch
or li
nks a
r
e
analyzed, sp
ecifically
the href a
ttribute
of <a
>
and
<area
>
t
ags.
F
o
r
each a
n
ch
or lin
ks, the
b
a
se
domain i
s
extracted
pa
rt from the
URL
,
and then
the o
c
curren
ce is
cou
n
ted
for ea
ch
ba
se
domain. Th
e base domai
n whi
c
h ha
s the
highe
st
occu
rre
nce will be
the link identi
t
y.
Next, the feat
ure
gen
eratio
n ste
p
would
deter
mi
ne th
e feature valu
es
of an
emai
l ba
sed
on its
content
and lin
k characters. Th
e fe
ature
s
that
se
rver a
s
inp
u
t to our filteri
ng
are p
r
e
s
ente
d
according to [5, 6]. But th
ese featu
r
e
s
also differ from the list propo
se
d abo
ve. First, some
feature
s
hav
e been
ch
an
ged alo
ng
wi
th the phishi
ng techniqu
e
s
; se
co
nd, th
e feature
s
th
a
t
requi
re
spe
c
ial informatio
n are n
o
t in
clud
ed i
n
ou
r
ap
proa
ch, su
ch as
the age of
lin
ke
d-to
domain
s
, sp
a
m
-filter output
; third, all featur
e
s
in ou
r ap
proa
ch a
r
e bi
nary features.
Feature 1: HTML form
at. Phishin
g
ema
ils t
end to u
s
e som
e
form
atting of the content to
displ
a
y the logo or d
e
si
g
n
of the corresp
ondi
ng m
e
ssag
e. For
this rea
s
o
n
, it is commo
n
fo
r
phishing em
a
ils to be in the HTML fo
rm
at.
Feature 2:
Using IP ad
dressed
in
stead
o
f
URL. Frequ
ently, phishin
g attempt to
con
c
e
a
l
the de
stinatio
n web
s
ite
by
obscu
ring
the
URL.
Du
e to
the l
o
w cost
of p
h
ishing,
many p
h
ishin
g
emails
can o
n
ly be addre
s
sed by an IP address UR
L
instead of a domain o
r
ho
st name. On the
other h
and, l
egitimate co
mpanie
s
rare
ly link to
pag
es by an IP address, an
d
so such a link in
email is a pot
ential indi
cate
of a phishin
g
attack.
Feature 3 an
d 4: Dots in URL and sl
a
s
h in URL. T
o
con
s
tru
c
t le
gitimate-lo
o
ki
ng URL,
there may be
a lager nu
m
ber of dot
s in a phi
shing
URL. Legitim
a
te URL also
can contain
a
numbe
r
of do
ts, but
a
URL
co
uld
be
le
ss
credibl
e
if t
here
a
r
e t
oo
many d
o
ts i
n
it. The
avera
g
e
numbe
r of do
ts of all URL
s in an ema
il is compute
d
by Equation (1
):
/|
|
u
dots
u
A
VG
d
U
(1)
Whe
r
e
u
is an URL in an email
d
,
d
u
is the numb
e
r of
dots in the URL
u
, and |U| is the numb
e
r of
URLs. Featu
r
e 3 is a binary feature to compa
r
e with
AVG
dots
and five dots. Similar to feature
3
,
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 23
02-4
046
TELKOM
NI
KA
Vol. 12, No. 6, June 20
14: 4306 – 4
313
4308
AVG
flashes
is the ave
r
ag
e n
u
mbe
r
of
sla
s
he
s i
n
all
URL
s
in
an
em
ail. Featu
r
e
4
is
also a
bin
a
ry
feature to co
mpare with
AVG
flashes
and five lashe
s
.
Feature 5: Usag
e of well
-define
d
und
erlyi
ng conte
n
ts or “He
r
e
”
links. Most
of the
phishing em
a
ils use the u
nderlyin
g co
n
t
ents or
“He
r
e” lin
ks
su
ch
as invoki
ng
a sen
s
e of fa
lse
urge
ncy, thre
at, wheedl
e, and con
c
ern to deceit t
he users in
clickin
g
on the visited hyperli
nk.
Feature 6: Domain in h
r
ef
is differe
nt fro
m
the displ
a
y string. In phi
shin
g email
s
, the link
text seen in the emails i
s
usually diffe
rent from
the
actual lin
k d
e
stinatio
n. For exampl
e, <a
herf=” http://www.eBay12
3.com
”
>w
ww.
eBay.com
</a
>
, the
URL
referri
ng to th
e di
splay
stri
ng i
s
eBay, but it r
edire
cts the u
s
er di
scretely
to a website
whi
c
h its dom
ain is eBay12
3.
Feature 7: Domain in hea
der field
s
is diffe
rent from
link identity. Compa
n
ie
s norm
a
lly
tend to ho
st their o
w
n m
a
il
serve
r
an
d web
se
rv
ers
within thei
r o
w
n net
wo
rk d
o
main
s. On t
h
e
other
han
d,
phishers
often u
s
e a
fre
e
email
co
unt
from
publi
c
email
se
rvice providers.
To
mislea
d re
cip
i
ents of
such messag
es, p
h
ish
e
rs
o
ften
use the
nam
e of the
targe
t
as
pa
rt of t
he
email a
cco
un
t name and t
he full use
r
name of the
email a
cco
un
t. Therefore, link ide
n
tity is
comp
ared to the followi
ng three h
ead
er f
i
elds:
“F
rom:
”, “Return-Pat
h
:” and “Re
p
l
y
-to:”.
Feature 8: Country in hea
der field
s
. This f
eature o
b
tains the ge
og
raphi
c lo
catio
n
of the
netwo
rk d
o
m
a
in of the claimed email a
ddre
s
se
s fou
nd on the he
aders: “R
eturn-Path:”, “F
ro
m:”
and “Re
p
ly-T
o:”. The lo
cati
on for all th
re
e sh
ould b
e
consi
s
tent, tha
t
is, in the sa
me co
untry. It is
noted that a
sub
s
et of p
h
i
s
hin
g
email
s
rely
on email comm
uni
cati
ons between the
phi
she
r
s and
reci
pient
s to
ca
rry
out th
e phi
shin
g a
ttack, in
ste
ad o
f
r
e
lying
on
r
e
d
i
re
c
t
in
g r
e
c
i
p
i
en
ts
to
a
fraudul
ent we
b site. In this ca
se, co
untry
code in d
o
m
a
in in hea
ders is compa
r
e
d
to each oth
e
r.
Feature 9-19:
Keywords.
Given
the n
a
t
ure of phi
shi
ng email, the
y
often conta
i
n som
e
distinctive words.
We
use
a positive word li
st, i.e., a lis
t of words hinting at
the possi
bility of
phishing. F
o
r ea
ch
wo
rd
i
n
the li
st
we
re
co
rd
i
s
a
binary f
eature of
wheth
e
r or not th
e
word
occurs in th
e
email.
The
l
i
st contain
s
a tota
l
of ten
wo
rd
ste
m
s:
acco
unt, u
p
date, pa
sswo
rd
,
bank
,
log, inconvenienc
e
,
s
e
c
u
rity, acc
e
s
s
,
verify, c
r
edit.
Feature 2
0
: S
pam Filte
r
. A
trained,
off-lin
e ve
rsio
n of
SpamAssa
ssin
is u
s
ed
to g
e
nerate
a feature: the
cla
ss
assig
n
ed to the em
ail either
“h
a
m
” o
r
“spam
”.
This i
s
a bi
n
a
ry feature u
s
ing
the trained v
e
rsi
on of Spa
m
Assassin
wi
th the def
ault rule wei
ghts
and thre
sh
old
.
This feature
’
s
importa
nce is discusse
d in more d
e
tail in
[5].
3. Online im
balanced SVM
3.1. Imbalanced SVM
Suppo
rt vect
or ma
chi
ne
(SVM) lea
r
nin
g
is a
promising pattern
cl
assificatio
n
tech
niqu
e
prop
osed by Corte
s
an
d Vapni
k [10]. SVM learni
n
g
aims at mini
mizing a
n
up
per bo
und of
the
gene
rali
zatio
n
error th
ro
ug
h maximizi
ng
the ma
rgin b
e
twee
n the
separating hyp
e
rpla
ne a
nd t
h
e
data. Althou
gh SVMs of
ten wo
rk eff
e
ctively wi
th
balan
ce
d d
a
taset
s
, they
co
uld p
r
od
uce
sub
optimal re
sults with
im
b
a
lan
c
ed data
s
ets.
M
o
re
sp
ecifically, an
SVM cla
ssifie
r
train
ed o
n
a
n
imbalan
ce
d d
a
taset often
prod
uces m
o
dels
whi
c
h a
r
e bia
s
ed to
wards the m
a
jority cla
ss
and
have low p
e
rf
orma
nce on
the mino
rity class.
A novel meth
od is
propo
se
d in [11] for the s
epa
ration
hyperplane
o
f
binary
cla
ssi
fication
imbalan
ce
d d
a
ta
.
Firstly, the origi
nal sa
mples a
r
e p
r
e
limina
r
ily trained by the standard su
pp
ort
vector m
a
chi
ne, and
a n
o
rmal vecto
r
of
the s
epa
rati
on hype
rpla
n
e
is
obtain
ed.
Seco
ndly, o
n
e
-
dimen
s
ion
a
l
data a
r
e
ge
n
e
rated
by
pro
j
ecting
the
hi
gh di
men
s
ion
a
l data
onto
t
he n
o
rm
al ve
ctor.
Then, th
e ratio of the
two-cla
s
s pe
nalty facto
r
s
is de
termine
d
b
a
sed o
n
the
inf
o
rmatio
n d
e
ri
ved
from the
stan
dard
deviatio
n
of the p
r
oje
c
tive dat
a a
n
d
t
he t
w
o
-
cla
ss
sa
mple
siz
e
s.
Fin
a
lly
,
a new
sep
a
ratio
n
hyperpl
ane i
s
prese
n
ted by the se
con
d
trai
ning.
Given a
trai
ning
set
of
N
sam
p
les
{
(
x
1
, y
1
), (
x
2
, y
2
), …, (
x
n
, y
n
)}, w
h
e
r
e
d
i
R
x
rep
r
e
s
ent
s a
n
n-dim
e
n
s
io
nal data p
o
in
t and
{1
,
1
}
i
y
re
pre
s
e
n
ts the lab
e
l of the cla
ss
o
f
that
data point, for
i
= 1,
…,
n
. Let
()
X
denote the d
a
te matrices in feature
spa
c
e
H
,
12
(
)
[
(
)
,
()
,
,
()
]
n
Xx
x
x
, then the kernel fu
nction
K
can be fou
nd su
ch th
at
(,
)
(
)
(
)
T
ij
i
j
Kx
x
x
x
. Thus, th
e n
online
a
r
OISVM ca
n be
a
c
hieve
d
by
solving the fol
l
owin
g
quad
ratic p
r
o
b
lem:
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
2302-4
046
Online Im
balanced Supp
o
r
t Vector Ma
chine for Phi
s
h
i
ng Em
ails Filtering (Xia
oQi
ng Gu)
4309
2
1
1
1
mi
n
2
nn
ii
i
in
wC
C
s
.t.
((
(
)
)
)
1
,
ii
i
yb
wx
0
i
(2)
1
0
n
i
i
,
1
0
n
i
in
,
i
=1, 2,…,
n
In imbalan
ce
d SVM, the SVM soft ma
rgin o
b
je
ctive function i
s
modified to a
ssi
gn two
miscl
as
sif
i
c
a
t
i
on c
o
st
s,
s
u
ch t
hat
C
+
i
s
the mi
scla
ssification
co
st
for po
sitive class exam
ple
s
,
while
C
-
i
s
th
e miscla
ssifi
cation cost fo
r
negative
cla
s
s exam
ples.
Here we al
so
assum
e
po
si
tive
cla
s
s to be
the min
o
rity class an
d n
e
g
a
tive cla
s
s to
be the
majo
rity cla
ss.
He
re
/
CC
s
n
,
/
CC
s
n
,
C
is a const
ant;
s
+
is the proje
c
tive sta
ndard
deviati
on of positive
class;
s
-
is the
proje
c
tive sta
ndard deviati
on of negativ
e cla
ss, an
d
nn
n
.
To solving Eq
uation (2
), the origin
al sa
mp
les a
r
e p
r
elimina
r
ily trained by stand
ard SVM,
and findin
g
the optimal value of
i
,
1
w
ca
n be
recove
re
d as:
1
1
()
n
ii
i
i
ay
wx
(3)
So the reje
ction value:
1
11
()
(
)
()
(
,
)
nn
j
ii
i
j
ii
i
j
ii
ay
a
y
k
wx
x
x
x
x
(
j
=1, 2,…,
n
)
(4)
As a re
sult, the para
m
eters
s
+
and
s
-
co
mputation
s
are descri
bed a
s
the followi
n
g
equatio
ns:
s
+
=
2
11
11
11
[(
,
)
]
[
(
,
)
]
11
nn
nn
ii
i
j
ii
i
j
ji
ji
ay
K
a
y
K
nn
xx
xx
s
-
=
2
11
11
11
[(
,
)
]
[
(
,
)
]
11
nn
n
n
ii
i
j
ii
i
j
ji
ji
ay
K
a
y
K
nn
xx
xx
(5)
To solve this
optimizatio
n probl
em La
grangia
n
is con
s
tru
c
ted:
2
11
1
1
(,
,
,
,
,
)
(
(
(
)
)
1
)
2
nn
n
ii
i
i
i
i
ii
in
Lb
C
C
y
b
w
ξα
β
γ
ww
x
1
1
nn
ii
i
i
i
in
(6)
W
i
th
La
gr
a
ngia
n
mu
ltip
lie
rs
0
i
,
0
i
and
0
i
. The derivatives
of
(,
,
,
,
,
)
Lb
w
ξα
β
γ
with
respe
c
t to the primal varia
b
l
es u
s
ing the
Karu
sh-K
uhn
-Tu
c
ker (KKT
) con
d
ition
s
should vani
sh,
1
()
0
n
ii
i
i
L
y
wx
w
(7)
1
0
N
ii
i
L
y
b
(8
)
0
ii
i
L
C
(9)
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 23
02-4
046
TELKOM
NI
KA
Vol. 12, No. 6, June 20
14: 4306 – 4
313
4310
0
ij
j
L
C
(10)
Substituting (7)-(10
)
into (6
), we obtain
t
he dual form of the optimization pro
b
lem
:
max
11
1
1
(,
)
2
nn
n
ii
j
i
j
i
j
ii
j
aa
a
y
y
K
xx
s
.t.
1
0
n
ii
i
ay
(11)
0/
i
aC
s
n
,
y
i
=+1,
i
=1
,2
,…,
n
+
0/
j
aC
s
n
,
y
i
=-1,
j
=
n
+
+1, …,
n
Equation
(11
)
is a typical co
nvex
q
uadratic pro
g
rammi
ng pr
oblem
whi
c
h
is e
a
sy to
b
e
nume
r
ically solved. Suppo
se a t
r
aini
ng
sampl
e
(1
)
i
in
x
calle
d a Sup
port
Vector
(SV) i
f
the
corre
s
p
ondin
g
Lag
ra
n
ge multip
lier
0
i
. Denote the SV s
e
ts as
1
{|
0
/
,
1
}
ii
SV
Cs
n
i
n
x
and
2
{|
0
/
,
1
}
jj
SV
Cs
n
n
j
n
x
. Suppo
se
**
*
1
[,,
]
N
ca
n b
e
u
s
e
d
to solve
the
a
bove o
p
timization p
r
obl
em
, and th
e o
p
timal threshold
*
b
is com
puted
by the followi
ng formul
a:
12
**
12
11
*
12
||
(
,
)
|
|
(
,
)
||
|
|
jj
NN
j
ij
j
i
j
sv
j
s
v
j
S
V
yk
S
V
yk
b
SV
S
V
ii
xx
xx
xx
(12)
Finally, the SVM deci
s
ion f
unctio
n
ca
n b
e
given by:
**
1
((
,
)
)
n
ii
i
i
fs
i
g
n
a
y
K
b
xx
x
(13)
The dual
opti
m
ization
p
r
ob
lem can
be
solved
in
the same way as solving
the st
anda
rd
SVM optimization problem
. The modifie
d
SVM algor
it
hm wo
uld not
tend to skew the sep
a
rati
ng
hyperpl
ane t
o
wa
rd
s the minority cla
s
s example
s
to
redu
ce the
total miscla
s
sificatio
n
s a
s
the
minority cla
ss examples a
r
e now a
s
sign
ed with a hig
her mi
scl
assif
i
cation
co
st.
3.2. Online Imbalanced
SVM (OISVM)
In standa
rd S
V
M applicatio
ns, an SVM is train
ed on a
n
entire
set of training d
a
ta, and is
then tested o
n
a sepa
rate
set of testing data.
Phish
i
ng email
s
filtering i
s
typically tested and
deploye
d
in a
n
online
setting, whi
c
h pro
c
ee
ds
in
crem
entally. Onlin
e learni
ng is
perfo
rmed in
a
sequence of t
r
ials. At trial
t
the algo
rithm first re
ceiv
es a
n
insta
n
ce
x
t
and i
s
re
quire
d to pre
d
ict
the label
a
s
sociate
d
with
that instan
ce.
Afte
r the o
n
line lea
r
nin
g
algorith
m
ha
s p
r
edi
cted t
he
label, the
tru
e
lab
e
l i
s
rev
ealed
an
d th
e alg
o
rithm
p
a
ys a
u
n
it co
st if its p
r
edi
ction is
wro
ng.
The
ultimate goal
of the algori
t
hm is to minimize
the total numbe
r of predi
ction m
i
stakes it ma
ke
s
along it
s run.
To a
c
hieve
this go
al, the
algo
rithm m
a
y update
its pre
d
ictio
n
m
e
ch
ani
sm aft
e
r
each trial s
o
as
to be more ac
curate in later trials
.
Based
on
Re
laxed Onli
ne
Suppo
rt Vect
or M
a
chi
ne
(ROSVM) alg
o
rithm
s
de
scribed by
Sculley in [12
], the propo
se
d OISVM classifier is
stated
in Table 1.
Initially trainin
g
in imbala
n
ced SVM is on
ly a small fra
c
tion of traini
ng email
s
en
d up a
s
sup
port vecto
r
s. Given an i
n
comi
ng me
ssag
e
x
i
and a
label
y
i
, if the Class
i
fier’s
optimal s
t
rategy
is satisfied
well, it will not
cha
nge t
he
hypothe
sis;
t
hus it i
s
n
o
t necessa
ry to
re-t
rain. If the
Cla
ssif
i
e
r
’s
o
p
t
i
mal st
r
a
t
e
g
y
is not
sat
i
sf
ied,
the hype
rplan
e
pa
ram
e
ters
are upd
ated u
s
ing th
e
imbalan
ce
d SVM algorith
m
over the seen me
ssag
es (see
nData
set). The training would
use
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
2302-4
046
Online Im
balanced Supp
o
r
t Vector Ma
chine for Phi
s
h
i
ng Em
ails Filtering (Xia
oQi
ng Gu)
4311
only the histo
r
ical
su
ppo
rt vector
sam
p
l
e
s a
nd the
in
cre
m
ental trai
ning
sampl
e
s in re-t
raini
ng.
All
non-SV
sam
p
les
are di
scarde
d after p
r
eviou
s
trai
ni
ng. Co
nsequ
ently, the trai
ning time
ca
n be
redu
ce
d grea
tly without much lo
ss of th
e cla
s
sificatio
n
pre
c
i
s
ion.
A param
eter
m
is used to
set
the definition
of well
cla
ssifi
ed, whi
c
h i
s
u
s
ed to
re
duce
the num
ber
of update
s
. A
para
m
eter
p
is
use
d
to set the numbe
r of messag
es in
see
n
Data.
Table 1. Pse
udo Code for
Propo
se
d OISVM Classifier
(
1
)
Initialize:
Data set
X
= (
x
1
,
y
1
),. . . ,(
x
n
,
y
n
)
Seed imbalanced SVM classif
i
er w
i
th
a f
e
w
e
x
am
ples of each class;
Train an initial imbalanced SVM filters;
(2) Online
Learni
ng
For Eac
h
x
i
X
do:
Classify
x
i
IF
y
i
f
(
x
i
) <
m
Find
w
’,
b’
w
i
th imbalanced SVM w
i
th pa
ramete
rs
C
+
, C
-
on see
n
D
a
ta,
using
w
, b as seed h
y
po
thesis.
s
e
t (
w
,
b) := (
w
’, b’)
IF size(seenData
)
>
p
Remove oldest example from
see
n
Data
Add
x
i
to seen
Da
ta
(3) Finishing:
Repeat until
x
n
is finished
Our al
gorithm
seem
s simil
a
r to ROSVM
; howev
e
r
, they are u
s
ed i
n
a different
context.
First, we
use
an imbala
n
ced SVM. Second, ROSVM
us
e
s
the lin
ear
kernel a
s
it assum
e
s t
hat
phishing a
nd
ham are linea
rly sep
a
ra
ble.
However,
in most
re
al-life emails data
s
et,
the
datase
t
s
are
not
com
p
letely linea
rly se
parable
e
v
en thou
gh t
hey are m
a
p
ped i
n
to a
hi
gher dim
e
n
s
ional
feature
spa
c
e. For OISVM we use
the Gaus
si
an Ra
dial Basi
s Fun
c
tio
n
(RBF
) ke
rnel
2
''
(,
)
e
x
p
(
)
k
xx
x
x
.
4. Experimental Settings
and Res
u
lts
In this
se
ctio
n, we
pre
s
e
n
t the expe
ri
m
ents
co
ndu
cted a
nd
discu
ss t
he results. All
cla
ssifi
cation
modelin
g is
carri
ed o
u
t on
a co
mpute
r
with an Intel X
eon at 1.8
6
G
H
z
and
8 GB
of
memory. Th
e
feature
s
d
e
scribi
ng the
propertie
s
of
e
m
ails
a
r
e extracted as de
scrib
ed
in
sect
ion
2 and the si
ze of each feat
ure vecto
r
is
20.
4.1. Data
se
t Des
c
ription
and Ev
aluation Crite
r
ia
We
rely o
n
f
our
differe
nt
datasets to
carry
out
the
evaluation
st
udie
s
of our work.
T
h
e
first one i
s
a
phishing d
a
ta
set co
ntainin
g
phishing e
m
ails
colle
cte
d
betwe
en 2
005 an
d 200
8 by
Jo
se Nazari
o [13]. The
second
on
e is al
so a
ham data
s
et colle
cted
from the Ap
ach
e
SpamAssa
ssi
n Proje
c
t [14]. Using the
s
e
two colle
ct
io
ns of phi
shin
g emails a
nd
ham email
s
we
con
s
tru
c
ted a
dataset NAZ
A
. To perform
experim
ent
s on data from
a real
-wo
r
ld
mailbox, we u
s
e
the
data
s
et REAL
whi
c
h were coll
ecte
d over
a
pe
riod
of 10
month
s
in
201
2 a
n
d
gath
e
re
d from
several u
s
e
r
s’ person
a
l ma
ilboxes. To
si
mplify our
ev
aluation
studi
es, all e
m
ails
in four
data
s
e
t
s
are in Engli
s
h
.
The summ
a
r
y of the key figure
s
of ea
ch use
d
data
s
et is given in Table 2.
Table 2. Sum
m
ary of the used
Data
set
Dataset Size
Training(H
a
m,Ph
insh
ing) Testing(Ham,Phi
n
shing)
NAZA
10520
7890(552
3,2367
)
2630(184
1,789)
REAL 4208
2524(227
2,252)
1684(151
5,169)
In this pa
per,
a gro
up of p
e
rform
a
n
c
e
metrics in
cla
ssifi
cation
problem
s a
r
e u
s
ed fo
r the
evaluation
of the re
sult
s, con
s
i
s
ting of
FPR, FNR, a
c
cura
cy,
precision,
re
call a
nd ROC. T
r
u
e
Positives
(TP
)
mea
n
s correctly cla
s
sifie
d
phi
shin
g e
m
ails, T
r
ue
Negative (T
N)
mean
s
corre
c
tly
cla
ssifie
d
ha
m emails, Fal
s
e Positive (F
P) mean
s wrong cl
assified
ham emails
as phi
shin
g, and
False
Negati
v
e (TN) m
e
a
n
s
wrong
cla
ssifie
d
phi
shi
ng me
ssage
s as
ham. Th
e
r
efore, The
F
a
lse
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 23
02-4
046
TELKOM
NI
KA
Vol. 12, No. 6, June 20
14: 4306 – 4
313
4312
Positive Rate
(FPR) an
d the Fal
s
e
Neg
a
tive Rate (F
NR) a
s
the p
r
oportio
n
of wrongly
cla
ssifi
ed
ham an
d phi
shin
g email
messag
es
re
spe
c
tively (F
PR = FP/ (F
P+TN), FNR
= FN/ (TP+F
N
)).
Accu
ra
cy sta
t
es the overall
percenta
g
e
of corre
c
t classified e
m
ail messa
g
e
s (A
ccura
c
y
=
(TP+T
N
)/ (T
P+FP+T
N+F
N
)). Pre
c
i
s
io
n a
s
th
e
cla
ssifie
r
’s
safet
y
, states the
deg
re
e in
which
messag
es id
entified a
s
ph
ishin
g
a
r
e i
n
d
eed m
a
lic
i
o
u
s
(Pre
cisi
on =
TP/ (TP+F
P
)). Re
call
a
s
the
cla
ssif
i
e
r
’s ef
f
e
ct
iv
ene
ss,
st
at
es the
p
e
rcentag
e of
phishi
ng m
e
ssag
es that
the cla
ssifi
er
manag
es to classify corre
c
tly (Recall
= TP/ (TP+
FN)). Receive
r
O
peratin
g Ch
aracteri
stic
(RO
C
)
as a cl
assifi
er’s b
a
lan
c
e
ability between its
FPR and its FNR is a fun
c
tion of varying a
cla
ssifi
cation threshold.
The
cla
ssifi
cation alg
o
rith
ms, onli
ne S
V
M, imbalan
ced
SVM, ROSVM and
OISVM are
impleme
n
ted.
We
u
s
e
Pla
tt’s SMO
alg
o
rithm
as a
core SVM
solver, an
d i
m
balan
ce
d S
V
M
cla
ssifie
r
impl
emented
in th
e libSVM-lib
rary. Since an
email i
s
o
n
ly
con
s
id
ere
d
a
s
a
legitimate
or
a phishing, it is naturally a binary cl
assifi
ca
tion p
r
obl
e
m
. The SVM
woul
d pro
d
u
c
e output in two
cla
s
ses: +1 mean
s
phi
shi
ng,
and -1 mean
s
legi
ti
mate. The
robu
stne
ss
of the cla
s
sifie
r
s i
s
evaluated usi
ng
10
-fold cross
vali
datio
n
.
For the
onli
ne setting, the ROSVM [1
2] wa
s u
s
ed.
For
training
the
SVM cla
s
sifier,
we
need
to spe
c
ify two
paramete
r
s, th
e
valu
e in th
e
ke
rnel
function, an
d
C
the pen
alty values. In on
line SVM and
ROSVM, the RBF ke
rnel
with pa
ramet
e
rs
C
= 1
00 a
nd
=0.1 turned
o
u
t to be most
accurate
an
d stable.
ROSV
M and OISV
M para
m
eters
tuning
we
re
estimated
ov
er a
20%
su
bset
s from
th
e traini
ng d
a
taset. Acco
rdi
ng to the
si
ze of
datasets, it is setting
m
=0.8
and
p
=1
000
for the thresh
old. The value of penalty value
s
C
+
and
C
-
use
d
in imbal
anced SVM and OISVM is given in Ta
ble 3, whe
r
e
C
+
is for phi
shing exampl
e
s
,
and
C
-
i
s
for h
a
m example
s
.
Table 3. The
Optimal Valu
e of Penalty Values
C
+
an
d
C
-
Dataset
imbalanced SVM
OISVM
C
+
C
-
C
+
C
-
NAZA
22.86
4.19 26.35
3.24
REAL
31.50
2.39 36.21
2.02
4.2. Results
We
com
pare
d
onlin
e SVM, imbalan
ced SVM, RO
SVM and OI
SVM for two
datasets
usin
g 1
0
-fold
cro
s
s validat
ion. The
resu
lts a
r
e
sho
w
n in T
able
4
and
5. The
training
time f
o
r
cla
ssifie
r
s i
s
t
he traini
ng of
the cla
s
sifica
tion,
not incl
u
d
ing the
pre
p
r
ocessin
g
of t
he email
s
. Th
e
training time
is expre
s
sed
in second
s. As we
can
see from the
Table 4, imb
a
lan
c
ed SVM
is
much
mo
re
e
ffective then
online
SVM a
nd
ROSVM,
but imbal
an
ced SVM i
s
ex
pen
sive in
terms
of time. Ou
r pro
p
o
s
ed
OISVM altho
ugh
ca
n a
c
hieve a
sim
ilarly a
c
curate cl
assification
perfo
rman
ce
in far le
ss tim
e
. The results demon
st
rate
that OISVM outperfo
rme
d
all of the oth
e
r
SVM appro
a
ches in the d
e
tection of phi
shing email viruse
s.
Table 4. Perf
orma
nce of the Method
s for the NAZA Dataset
Method Accur
a
cy
F
P
R
F
N
R
Pr
ecisio
n
Recall
ROC
Training time (s)
OnSVM
92.52%
6.68%
6.60%
93.06%
94.01%
96.46%
120.2
ImSVM
97.35%
2.74%
2.16%
98.00%
98.91%
98.51%
200.9
ROSVM
92.17%
7.16%
6.91%
92.58%
93.23%
96.09%
10.6
OISVM
97.16%
2.99%
2.43%
97.69%
97.75%
98.22%
13.8
Table 5: Perf
orma
nce of the method
s for the REAL da
taset
Method Accur
a
cy
F
P
R
F
N
R
Pr
ecisio
n
Recall
ROC
Training time (s)
OnSVM
90.47%
8.25%
7.84%
91.52%
91.74%
90.12%
72.3
ImSVM
95.13%
5.62%
5.26%
96.04%
95.89%
95.67%
100.5
ROSVM
90.28%
8.67%
8.03%
90.36%
90.25%
90.00%
8.5
OISVM
95.01%
6.01%
5.49%
96.22%
95.33%
95.24%
9.6
We
ca
n o
b
se
rve the
re
sult
s o
n
the
RE
AL data
s
ets
are
so
me
wh
at inferio
r
i
n
Table
5.
The FPR
and
FNR
are i
n
creased
comp
a
r
ed to
NA
ZA, and Accu
ra
cy
, Preci
s
ion, Recall a
nd
RO
C
are
de
crea
se
d a little. Th
e
ca
use in
the
fact i
s
t
hat t
he p
ublic dat
aset
are
som
e
wh
at artifici
al in
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
2302-4
046
Online Im
balanced Supp
o
r
t Vector Ma
chine for Phi
s
h
i
ng Em
ails Filtering (Xia
oQi
ng Gu)
4313
that they a
r
e
colle
cted
fro
m
diverse
so
urces a
nd
even
cove
r diffe
rent time
pe
ri
ods. Si
nce
we a
r
e
usin
g SVM for cla
s
sificatio
n
, the dete
c
tion re
sult
s al
so dep
end o
n
the quality an
d qua
ntity of the
training
data
s
et. If the train
i
ng data
s
et o
f
REAL
could
rep
r
e
s
ent
all cha
r
a
c
teri
sti
cs
of ham
an
d
phishing em
a
ils then the d
e
tection p
e
rfo
r
man
c
e
woul
d become bet
ter.
5. Conclusio
n
This
pap
er
a
ddre
s
se
s the
pro
b
lem
of filtering p
h
ishing e
m
ails from h
a
m o
n
e
s
with
imbalan
ce
d
and
online
l
earni
ng SVM
(OISVM). In
OISVM, the
SVM soft
margi
n
o
b
jective
function i
s
m
odified to a
s
si
gn two
miscla
ssifi
cation
co
sts. By assig
n
ing a
hi
gh
er miscl
assification
co
st for th
e
minority
class exam
ple
s
t
han th
e ma
j
o
rity cla
s
s e
x
amples, th
e
effect of
cl
ass
imbalan
ce
d
could b
e
red
u
ced. Fu
rtherm
o
re, the
onli
n
e lea
r
nin
g
alg
o
rithm i
s
u
s
e
d
, wh
ere
b
y o
n
ly
sub
s
et
s of the data a
r
e to
be co
nsi
dered at any
on
e time and
re
sults
su
bse
q
uently com
b
i
ned,
can
ma
ke
th
e retraini
ng
p
r
ocess mu
ch
faste
r
a
n
d
a
v
oid the
mu
ch sto
r
a
ge
co
st. Thu
s
OISVM
can b
e
scale
d
up to han
d
l
e extremely
large
dat
a se
ts. The expe
riments
sho
w
that OISVM is
able to o
b
tain
very goo
d re
sults
with th
e
differ
ent vali
dation d
a
tase
ts empl
oyed.
Furthe
rmo
r
e, a
numbe
r of fe
ature
s
h
a
ve
descri
bed th
at are
parti
cularly well-su
ited to filterin
g phi
shin
g m
a
ils
whi
c
h are bin
a
ry feature
s
a
nd sel
e
ct
e
d
b
y
its content and lin
k ch
aracters.
In our future
works, we pl
an to adju
s
t exis
ting featu
r
e extra
c
tion
method
s, and
see
k
for
more
rel
e
van
t
feature
s
to
get a b
e
tter result. Fu
rthe
rmore th
e met
hod u
s
e
d
to
colle
ct a
data
s
et
must be imp
r
oved.
Ackn
o
w
l
e
dg
ements
This
work wa
s sup
p
o
r
ted by
the Nation
al
Natural
Science F
ound
ation of
Chi
n
a un
de
r
conta
c
t (61
0
7
0121
).
Referen
ces
[1]
Sanch
e
z F
,
D
uan
Z
.
A se
nd
er-centric
ap
pr
oach
to d
e
tect
ing
ph
ishi
ng
e
m
a
ils.
Proc
ee
d
i
ngs
of ASE
Internatio
na
l C
onfere
n
ce o
n
C
y
ber Sec
u
rit
y
. USA. 2012: 2
48-2
57.
[2]
Bergh
o
lz A, Beer J, Glahn
S. Ne
w
filter
in
g appr
oac
hes
for phishi
ng
email.
Jo
urna
l
of Comp
uter
Security
. 201
0;
18(1): 7-35.
[3]
Herzb
e
rg A, Jbara A. Securit
y
DNS-bas
ed e
m
ail
sen
der a
u
t
henticati
on me
chan
isms: A critical revi
e
w
.
Com
p
uters & security
. 200
9; 28(8): 73
1-7
4
2
.
[4]
Dha
nal
akshmi
R, Kav
i
sank
a
r
L, C
hel
lap
p
a
n
C. E
nha
nce
d
En
ha
nced
E
m
ail A
u
the
n
tic
a
tion
Ag
ains
t
Spoofi
ng Attac
ks
T
o
Mitigate Phish
i
ng.
Eur
o
pea
n Jour
nal o
f
Scientific rese
arch
. 201
1; 54(
1): 165-1
70.
[5]
Fette L, SADEH N TOMASIC
A.
Learnin
g
to
Detect Phishi
n
g
Emails.
Proc
eed
ings
of the Internatio
na
l
W
o
rld W
i
de W
eb Co
nfere
n
ce
Committee (IW
3
C2). Can
a
d
a
. 2007; 6
49-6
56.
[6]
Garera S, Provos N Che
w
M.
A framew
ork for detectio
n
and
me
asur
ement of phis
h
in
g attacks
.
Procee
din
g
s of
the 200
7 ACM
W
o
rkshop on
Recurri
ng Ma
lc
ode. German
y
.
2007; 1
–8.
[7]
Z
huan
g W
,
Ji
ang
Q. Intelli
g
ent Anti-
phis
h
i
ng F
r
am
e
w
ork
Usin
g Mu
ltip
l
e
Cl
assifi
ers
Combi
nati
on.
Journ
a
l of Co
mputatio
na
l Informati
on Syste
m
s
. 2012; 8 (17):
7267- 7
2
8
1
.
[8]
Chen. X
,
Bose.
I. Assessing t
he
sever
i
t
y
of
phis
h
in
g attack
s: A hybrid d
a
t
a
minin
g
ap
pro
a
ch.
Decision
Supp
ort Systems
. 20
11; 50(
4
)
: 662 – 67
2.
[9]
Shih D, C
h
i
a
n
g
H, Yen D. A
n
inte
lli
gent
e
m
bed
ded s
y
st
em for malic
io
us emai
l filteri
ng.
Co
mput
e
r
Standar
ds & Interfaces
. 201
3; 35(5): 128
9-1
302.
[10]
Cortes C, Vap
n
ik V, Supp
ort vector net
w
o
rk
s.
Ma
ch
i
n
e
Le
arn
i
ng
. 199
5; 20
(3): 273–
29
7.
[11]
Liu W
,
Liu S,
Xue Z
.
Bala
nce
Met
hod for
imb
a
la
nced S
u
p
p
o
r
t Vector Machi
nes.
Pattern R
e
cog
n
itio
n
&
Artificial int
e
ll
ig
ence
. 20
08; 21
(2): 136-1
41.
[12]
Scull
e
y
D, W
a
chman G.
Rel
a
xed
onl
ine S
V
Ms for spa
m
filtering
. Proc
eed
ings
of the
30th a
nnu
a
l
international ACM SIGIR conference
on Research and
development in
information retrieval. USA.
200
7: 415
–4
22
.
[13]
Gomez J, Moens M. PCA do
cument
reco
ns
truction for em
ail cl
assificati
o
n
.
Co
mputati
o
nal Statistics
and D
a
ta Ana
l
ysis.
2012; 5
6
(3): 741–
75
1.
[14]
Rama
natha
n V. Phishin
g
d
e
tection a
nd i
m
per
so
nate
d
entit
y
d
i
scov
e
r
y
usi
ng Co
nd
ition
a
l Ra
ndo
m
F
i
eld a
nd L
a
te
nt Dirichl
et Allo
cation.
Co
mput
ers & Security
. 201
3; 34(5): 12
3-13
9.
Evaluation Warning : The document was created with Spire.PDF for Python.