TELKOM
NIKA Indonesia
n
Journal of
Electrical En
gineering
Vol.12, No.7, July 201
4, pp
. 5685 ~ 56
9
2
DOI: 10.115
9
1
/telkomni
ka.
v
12i7.575
6
5685
Re
cei
v
ed Fe
brua
ry 7, 201
3; Revi
se
d March 23, 201
4
;
Accepte
d
April 15, 201
4
Application of Boosting Algorithm in Spam Filtration
Gu Wen
c
hen
g
Net
w
ork a
nd In
formation C
ent
er, Qiqihar U
n
i
v
ersit
y
Qiqih
a
r, Heil
on
gjia
ng, Ch
in
a, 161
00
6
email: g
u
w
e
nc
hen
g@1
63.co
m
A
b
st
r
a
ct
In ord
e
r to
i
m
prove
accur
a
c
y
an
d
effectiveness
of
junk
mail
filtrati
on,
a fi
lterin
g
method
i
s
prop
osed
base
d
on Boosti
ng
algor
ith
m
, w
h
ich uses Bo
os
ting al
gorith
m
to construct a spam filt
erer
to
ide
n
tify junk
mail. Besi
des, referenc
e techn
i
cal i
ndex
es
in
the field of te
xt classificatio
n
and i
n
for
m
at
i
o
n
retrieval
ar
e us
ed to
co
nstruct a s
p
a
m
fi
ltere
r
eva
l
uati
o
n
sy
stem, w
i
th
it, e
x
peri
m
e
n
tal
d
a
ta obta
i
n
ed fro
m
simulati
on w
e
r
e
tested
an
d e
v
alu
a
ted. T
he
results of
test
and
eva
l
uati
o
n
prove
d
that, c
o
mpar
ed w
i
th t
h
e
traditio
nal Bay
e
sia
n
alg
o
rith
m, the spa
m
filterer base
d
o
n
Boostin
g
alg
o
rith
m is abl
e to filter junk mai
l
s
mor
e
exce
lle
ntl
y
, and the effe
ctiveness of Bo
osting
a
l
g
o
rith
m in sp
a
m
filter
ing is ver
i
fied.
Ke
y
w
ords
: bo
osting a
l
g
o
rith
m, junk
ma
il, fil
t
ration, classifi
er, evalu
a
tio
n
Copy
right
©
2014 In
stitu
t
e o
f
Ad
van
ced
En
g
i
n
eerin
g and
Scien
ce. All
rig
h
t
s reser
ve
d
.
1. Introduc
tion
Boosting
algo
rithm is a
n
al
gorithm u
s
e
d
to im
prove t
he accu
ra
cy of
wea
k
cl
as
sif
i
cat
i
o
n
algorith
m
. Va
liant [1] prop
ose
d
the
PAC (P
rob
abl
y
Approximatel
y Corre
c
t) l
e
arnin
g
m
odel
in
1984, PAC i
s
the the
o
re
tical ba
si
s of statisti
cal m
a
chi
ne lea
r
ni
ng, integratio
n of machin
e
learni
ng met
hod
s; Scha
pi
re [2] in 19
90 at the e
a
rlie
st PAC learni
ng mo
d
e
l con
s
truct
s
a
polynomial
al
gorithm, and
gives a
p
o
siti
ve
proof,
th
is is th
e first B
oostin
g
al
gori
t
hm; Fre
und
[3]
prop
osed a
more effici
en
t Boosting a
l
gorithm in
1
991, in 19
9
5
Freu
nd a
n
d Scha
pire [
4
]
improve
d
Bo
osting al
gorit
hm, we p
r
o
pose a
ne
w iterative algorithm Ad
a
B
oost (Ad
a
p
t
ive
Boosting
) alg
o
rithm.
In theo
ry, Bo
osting
alg
o
rit
h
m i
s
a
n
al
g
o
rithm f
r
ame
w
ork that
ca
n integ
r
ate
a
n
y we
ak
cla
ssifi
cation
algorithm
s, and ha
s a relatively
com
p
lete basi
s
of mathemati
c
al theo
ry and
experim
ents
sho
w
that, Boostin
g
algo
ri
thm has
st
ro
ng ap
plica
b
ili
ty for a small
size of sa
m
p
les
and high
-dim
ensi
onal data
,
comp
ared with
other
cl
as
sificatio
n
alg
o
r
ithms. Bo
osti
ng al
gorithm
i
s
fast,
sim
p
le, easily pro
g
ra
mmed with h
i
gh
a
dapta
b
ility
and
a
c
curacy,
an
d cap
able of
featu
r
e
sele
ct
ion whil
e
cla
ssif
y
ing.
With the in
creasi
ng d
e
vel
opment
and i
m
provem
ent
of Boostin
g
a
l
gorithm, it h
a
s b
een
widely used i
n
more an
d more a
r
ea
s. In the fi
eld of image re
cog
n
ition and ret
r
ieval, Boosti
ng
algorith
m
ha
s be
en
su
ccessfully appli
ed in h
and
written charact
e
r recognitio
n
[5], and O
CR
cha
r
a
c
ter re
cognition [6]; i
n
term
s
of sp
eed
and
accura
cy of fa
ce
dete
c
tion, B
oostin
g
alg
o
ri
thm
has
achieved
good
re
sult
s [7]; in the field of me
dical diag
no
sis,
it has b
een
u
s
ed i
n
lun
g
a
nd
ski
n
can
c
e
r
di
agno
si
s [8]; a
s
fo
r biol
ogi
cal information
pro
c
e
s
sing, i
t
has be
en
e
m
ployed i
n
g
ene
expre
ssi
on p
r
ofiles cl
assifi
cation [9]; in military fiel
d, it has su
cce
s
sfully been a
pplied to iden
ti
fy
rada
r sig
nals
[10];
in
the
field
of heal
th
ca
re, sub
-
health gro
u
p
s
a
r
e cla
s
sified
by
Bo
osti
ng
algorith
m
[11
]; in addition
, Boosting
al
gorithm
ha
s
also
bee
n a
pplied i
n
intrusio
n dete
c
ti
on
techn
o
logy [1
2], semi-stru
c
tured info
rma
t
ion extr
actio
n
[13], target
tracking [1
4], oil flooded l
a
yer
identificatio
n [15], human
a
c
tion
recognit
i
on [16] an
d o
t
her field
s
. Here,
with the
help of Boo
s
ti
ng
algorith
m
, we
filter junk ma
il.
This
pap
er i
s
org
ani
zed
a
s
follo
ws. Th
e theo
retical
basi
s
i
s
p
r
e
s
ented fo
r thi
s
wo
rk in
Section 2. In
Section 3, Boosting al
g
o
rithm is
d
e
signed for ju
n
k
mail filterin
g. In Section
4
,
evaluation
system of junk
mail f
iltering
based on Bo
osting al
gorit
hm is co
nst
r
u
c
ted. In Secti
on 5,
detailed
information o
n
th
e expe
riment
al re
sult
s a
n
d
analy
s
is is
d
i
scusse
d a
n
d
sum
m
ari
z
e
d
. In
Section 6, co
nclu
sio
n
s a
r
e
drawn.
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 23
02-4
046
TELKOM
NI
KA
Vol. 12, No. 7, July 201
4: 5685 – 56
92
5686
2.
Boos
ting Algorithm
Boosting al
g
o
rithm ca
n raise the re
cognition
rate
of weak cl
assificatio
n
algorithm,
who
s
e
co
re
idea is, viewing the othe
r wea
k
cla
s
si
fication alg
o
rithm as ba
se cla
ssifi
cati
on
algorith
m
s, to
put the
m
int
o
the f
r
ame
w
ork of
Bo
osti
ng al
gorith
m
s, unde
r
whi
c
h
the
sam
p
le
set
training
is op
erated,
so tha
t
different trai
ning
sam
p
le
s
su
bsets are obtaine
d, an
d
then
by trai
ni
ng
these sa
mple
s
sub
s
ets
ba
se cla
ssifie
r
s are obtaine
d;
after
N
roun
d
of su
ch give
n
training,
N
base
cla
ssifi
ers a
r
e
gen
e
r
ated; Bo
osti
ng al
gorith
m
s ,by
weig
hted fu
sion
of
these
N
ba
se
cla
ssif
i
e
r
s,
pr
odu
ce
a f
i
nal
cla
ssif
i
e
r
.
W
h
ile f
o
r t
h
e
N
b
a
se
cl
assifie
r
s, ea
ch
in
dividual
cla
s
sifier
recognitio
n
ra
te is not very
high, after th
e weig
hted fu
sion the fin
a
l cla
ssifie
r
ofte
n has
a high
er
recognitio
n
ra
te, so as to improve the
weak
cla
ssifi
ca
tion algorith
m
the recogniti
on ratio.
Boosting al
g
o
rithm is a
n
iterative algo
ri
thm,
who
s
e specifi
c
ope
rat
i
on is to co
n
s
tru
c
t a
seri
es of
pre
d
ictive fun
c
tio
n
, then i
n
a
certain
way
th
ey are
combi
ned i
n
to a
p
r
edictive fu
ncti
on.
The
ba
sic id
ea of
the Boosting
alg
o
rit
h
m i
s
: in
a
given
wea
k
l
earni
ng
algo
rithm an
d a
total
sampl
e
set
)
,
(
...,
),
,
(
),
,
(
:
2
2
1
1
n
n
v
u
v
u
v
u
U
,
i
u
is the i-th trai
ning sampl
e
s inp
u
t,
}
1
,
1
{
V
v
i
is t
he cla
s
s mar
k
of
cla
s
sif
i
cat
i
o
n
pro
b
lems
; in B
oos
ting algorithm, firs
t the
training
sam
p
le wei
ght distrib
u
tion i
s
initialized
with
n
/
1
as th
e spe
c
ified
training
set
distrib
u
tion, that is, the sa
mple wei
ght
i
D
for each traini
ng
i
U
is
n
/
1
, and then the app
ro
priate
wea
k
lea
r
nin
g
algorith
m
is used fo
r
N
iterations , after
each iteration
, acco
rdin
g to the re
sults
of the t
r
ainin
g
the
distri
buti
on of
the t
r
ai
ning
set
is up
dated, fo
r tho
s
e fail
ed
train
i
ng
sam
p
le
s t
he
weig
ht is re
di
stribute
d
to a greate
r
on
e , so that
in the
next iteration,
more attenti
on is to be p
a
i
d
on the
s
e trai
ning sampl
e
s; at the end
of the it
eratio
n , there i
s
a pre
d
ictio
n
functio
n
)
(
U
H
seq
uen
ce
n
h
h
h
,
…
,
,
2
1
, where e
a
ch predictio
n funct
i
on
i
h
is corre
s
pond
ed with
a weight valu
e
i
D
, for the
p
r
ed
iction fu
nctio
n
with
excell
ent pe
rforma
nce, th
e
co
rresp
ondi
ng
weight valu
e i
s
greate
r
, whe
r
eas the p
r
edi
ction fun
c
tion
with bad performan
ce, th
e corre
s
po
ndi
ng weig
ht value
is sm
aller. A
fter
N
iteration
s
, the final predi
ction fu
nction
)
(
U
H
is ge
nerate
d
by a
joint
weig
hted fu
si
on
i
w
. For a
si
n
g
le
wea
k
le
arning al
go
rith
m, its lea
r
nin
g
a
c
cura
cy ra
te is
not hig
h
,
but after Boo
s
ting
algo
rith
m, the a
c
cura
cy of the
final
re
sults will
b
e
greatly imp
r
oved. Algo
rithm
in Figure 1 ca
n be used to descri
be the
pro
c
e
ss.
T
o
t
a
l s
a
m
p
le
s
e
t
U
X
1
D
1
h
1
X
2
h
2
D
2
D
t
…
…
X
T
-1
h
T
-1
D
T
-1
X
T
h
T
D
T
H
(
X
)
w
1
h
1
(
x
)
w
2
h
2
(
x
)
w
T
h
T
(
x
)
w
T-
-
1
h
T-
1
(x
)
X
1
D
1
h
1
X
2
h
2
D
2
D
t
…
…
X
T
-1
h
T
-1
D
T
-1
X
T
h
T
D
T
H
(
X
)
w
1
h
1
(
x
)
w
2
h
2
(
x
)
w
T
h
T
(
x
)
w
T-
-
1
h
T-
1
(x
)
Figure 1. Algorithm Pro
c
e
s
ses
3. Filtering of Junk Mail Bas
e
d on
Bo
osting Alg
o
r
i
thm
E-mail is
essential to peo
ple's
wo
rk
a
nd lif
e as th
e informatio
n
transmission
mean
s,
bringi
ng th
em
co
nvenien
ce
, but at the
same time
it
al
so
ca
uses a l
o
t of ne
w
pro
b
lems, th
e m
a
in
probl
em is th
e emergen
ce
of a large nu
mber of
jun
k
mail, so jun
k
mail filtering
has b
e
come
one
of the issu
es
the majority o
f
e-
mail users are co
ncerne
d about mo
st.
Curre
n
tly, junk mail filterin
g method
s consi
s
t
of filte
r
ing ba
se
d o
n
IP address, filtering
based o
n
be
havioral fe
ature
s
, and filt
ering
ba
sed
on me
ssage
and
so o
n
. The IP add
ress
filtering tech
n
i
que
s is to restri
ct or filter by
e-mail a
ddre
s
s, IP or "black/
white
list" of domain
name [1
7]; the be
havio
ra
l feature
s
filt
ering
is
to d
e
termin
e wh
ether th
e me
ssage
is j
u
n
k
by
behavio
r feat
ure
s
of ju
nk
mail differe
nt from the
normal mail, such as
sp
eci
a
l time to send,
the
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
2302-4
046
Applicatio
n of Boosting Alg
o
rithm
in Spam
Filtration (Gu We
nchen
g)
5687
high tran
smi
ssi
on freq
ue
ncy in a sh
ort te
rm, mail forwa
r
din
g
, abnormal
e-mail add
ress,
abno
rmal SM
TP sessio
n in
formation, attacking oth
e
r
host
s
, severa
l transit ro
ute
r
s, false
se
rver
informatio
n,
e-mail
he
ade
r info
rmation
abn
or
m
a
litie
s, an
d hi
dde
n sendi
ng
a
ddre
s
s;
cont
ent
filtering tech
nique [18] i
s
to take e-mail me
ssa
g
e
head
er, sende
r, re
cipi
ent, subje
c
t
line,
messag
e
con
t
ent, the five cha
r
a
c
teri
stics a
s
b
a
si
s for judgm
ent, a
nalysi
s
, statistics a
nd
extra
c
t,
enabli
ng e-m
a
il filtering.
This a
r
ticle d
e
scrib
e
s a m
e
thod of jun
k
email filtering
base
d
on me
ssage
conte
n
t.
3.1. Junk Email Filtering Bas
e
d on M
essag
e Con
t
ent
Jun
k
em
ail filtering i
s
to d
i
vide email
s
i
n
to two type
s, jun
k
mail
s and le
gitima
te mils,
filtering out the junk mail
s, "spam".
E
-
mail cla
s
sif
i
cat
i
on i
s
crit
ical t
o
sp
am
f
ilt
ering in orde
r to ensure effective filtration,
conte
n
t-ba
se
d spa
m
filteri
ng focu
se
s o
n
the co
ntent
contain
ed in
e-mail a
s
a
resea
r
ch obj
ect,
gene
ral cla
ssification algo
rithm
begi
n wi
th
so
me kno
w
n sp
am
trai
n
ing as
sam
p
les, extracti
ng
spam fe
ature
s
, and the
n
constructin
g
a
spam
filterer, by which n
e
w me
ssag
e
s
are analy
z
ed,
judge
d, and l
egitimate mai
l
is disting
u
ished from
spa
m
and sp
am filtering is a
c
hi
eved.
Conte
n
t-ba
se
d spam
filteri
ng typically i
n
clu
de m
a
il
colle
ction, m
a
il man
age
m
ent an
d
mail filtering
and oth
e
r
steps. G
ene
rall
y spea
kin
g
the e-mail filtering
co
nsi
s
t
s
of two
sta
ges:
learni
ng a
nd
analysi
s
; in th
e learning
sta
ge, ce
rt
ain
al
gorithm
s a
r
e
use
d
to an
alyze a
nd p
r
o
c
e
s
s
the colle
cted
me
ssage
s
(inclu
ding
sp
am a
nd le
gitimate mail
) t
o
e
s
tabli
s
h
an a
pprop
ria
t
e
cla
ssifie
r
, and
then it is app
lied to filter mails in the an
alysis p
h
a
s
e.
Gene
rally, the spe
c
ific imp
l
ementation
steps are:
1.
First a
ce
rtai
n numb
e
r
of spam
and l
egitimate em
ails a
r
e
colle
cted in
ord
e
r to
establi
s
h two sets of spam
and legitimat
e
e-mail
s.
2.
The app
ro
pri
a
te cla
ssifi
ca
tion algorith
m
is use
d
for traini
ng these kn
own
spam
sampl
e
s, a
n
a
l
yzing the
e-mail me
ssag
es, extra
c
ti
ng
feature
s
fro
m
the e-m
a
il
s an
d collecti
ng
corre
s
p
ondin
g
data.
3.
A message
cl
assifier is
con
s
tru
c
ted.
4.
An app
ro
priat
e
thre
sh
old
s
i
s
sele
cted
for sp
am ju
dgm
ent, by the
u
s
e
of e
s
tabli
s
hed
e-mail
cla
ssif
i
ers m
e
s
s
a
g
e
s
ar
e cla
s
sif
i
e
d
.
The f
l
owch
art of spam
mail based o
n
conte
n
t filtering
is sh
owed in
Figure 2.
Trai
ni
ng s
e
t o
f
mai
l
sam
p
le
A c
e
rtai
n al
gori
t
hm i
s
use
d
to
cl
as
si
fy
and l
earn
Ma
il
cl
as
si
fie
r
is
co
nst
r
u
c
te
d
M
a
il
i
s
d
e
taec
ted
L
egi
tim
a
te m
a
i
l
S
pam
Figure 2. Flowchat of Spam
Filter Base
d on Co
ntent
3.2. Descrip
tion of the
Bo
osting Alg
o
r
i
thm
Boosting
alg
o
rithm [1
9] is a
n
iterative al
go
rithm,
also
an
ef
fective cl
assi
fication
algorith
m
, by whi
c
h the a
ccura
cy of disti
ngui
shi
ng
ma
ils can be i
m
p
r
oved g
r
e
a
tly in mail filterin
g,
enabli
ng legiti
mate mail not
to be misjud
ged a
s
jun
k
mail while
red
u
cin
g
false
co
ntractin
g rate.
Boosting
alg
o
rithm, aimi
n
g
at traini
ng
samp
l
e
set, learn
thro
ugh
an iterative
pro
c
e
ss,
one of who
s
e
core ide
a
s i
s
to remain a
weight
distri
bution on trai
ning sa
mple
set. In the initial
training, th
e
weig
ht of all
sample
s i
s
eq
ual, 1/
n
, after ea
ch ite
r
atio
n, the weight
of failed
sam
p
les
is in
cre
a
sed,
and effe
ct of
the we
ak l
e
a
r
ning ma
ch
i
n
e
is st
ren
g
then
ed for th
ose
difficult traini
ng
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 23
02-4
046
TELKOM
NI
KA
Vol. 12, No. 7, July 201
4: 5685 – 56
92
5688
sampl
e
s.
Trai
ning a
nd l
earning p
r
o
c
e
s
s, the p
r
o
c
e
s
s
of
st
ru
ct
u
r
e a
nd cla
ssif
i
cat
i
on
of
cla
ssif
i
er
together dete
r
mine
the accura
cy
of
spa
m
filtering, therefo
r
e the
v
a
riou
s p
a
ra
m
e
ters
used in
the
pro
c
e
ss
sh
ou
ld be sel
e
cte
d
rea
s
o
nably
,
which in th
e actual o
p
e
r
ation and te
sting, shoul
d be
adju
s
ted time
ly, and ultima
tely a rel
a
tively rea
s
on
able
value i
s
d
e
termined,
so
as
to red
u
ce fal
s
e
positive
s
to
a
minimum
a
s
much
a
s
po
ssible. It i
s
pro
v
ed that a
s
lo
ng a
s
th
e e
r
ror
rate
of e
a
ch
predi
ction fu
n
c
tion i
s
le
ss t
han 0.5
(i.e.
better than
ra
ndom g
u
e
ssi
ng), the
pre
d
i
c
tion a
c
curacy of
the final pred
iction fun
c
tio
n
s are often high, so in
a
c
tual ap
plicat
ion, accordin
g to the actual
situation
a b
e
tter cla
s
sifie
r
ca
n be
obt
ained by m
e
ans
of test a
nd othe
rs,
so
the accu
ra
cy
to
disting
u
ish legitimate messag
es i
s
improved, redu
cin
g
spam " sli
p
ping a
w
ay."
Boosting al
go
rithm is de
scribed a
s
follows:
Suppo
se th
e sam
p
le
set
)}
,
(
,
…
),
,
(
),
,
{(
2
2
1
1
n
n
v
u
v
u
v
u
U
, where
U
u
i
,
}
1
,
1
{
V
v
i
,
n
i
,
…
,
2
,
1
, in addition
, weight-di
s
tri
buting of trai
ning sample
s in
iteration
is
)
(
i
D
t
, the predi
ctive function gene
rated in
t
iteration is re
pre
s
ente
d
by
t
h
.
1.
Initial weight
distrib
u
tion of
the training sample
s
For ea
ch
U
v
u
i
i
)
,
(
,
n
v
u
D
i
i
/
1
)
,
(
1
,
i.e. the weig
ht of each training
sampl
e
are
equal in the fi
rst iteratio
n is
n
/
1
.
2. FOR
1
t
to
N
//
iterate the
followin
g
p
r
o
c
ed
ure,
wh
ere
N
is the
nu
mber of itera
t
ions, typicall
y an
experie
nce value.
1)
Boosting al
go
rithm is called
,
where
t
D
is the para
m
eter
Boosting al
go
rithm;
//
t
D
is weight of
t
round of cy
cle.
2)
predi
ction fun
c
tion is o
b
tain
ed:
}
1
,
1
{
:
V
U
h
t
;
3)
the error rate
t
h
is cal
c
ulate
d
:
i
i
t
v
u
h
t
t
i
D
)
(
)
(
;
4)
R
t
t
t
)
1
ln(
2
1
//
t
is the weig
ht value of
t
h
in
t
round.
5)
according to
the erro
r ra
te, the actua
l
weig
ht valu
e of the trai
ning
sampl
e
s is
update
d
:
t
i
t
i
t
i
i
t
i
i
t
i
i
t
t
t
t
Z
u
h
v
v
u
D
v
u
h
v
u
h
e
e
Z
i
D
i
D
t
t
))
(
exp(
)
,
(
)
(
)
(
)
(
)
(
1
Whe
r
e
t
Z
is a normali
zatio
n
factor
,
1
)
,
(
)
,
(
1
U
v
u
i
i
t
i
i
v
u
D
.
6)
at the end of the cycl
e, the cla
ssifie
r
is o
b
tained ultim
a
tely
)
)
(
(
)
(
1
T
t
t
t
u
h
sign
u
H
(1)
4. Ev
aluation of Spam Filtering Ba
se
d on Boos
tin
g
Algorithm
In
the pro
c
e
s
s of spam filteri
ng, the filterer divide
s e
m
ails into two
catego
rie
s
: spam and
legitimate me
ssage
s, so it
may ca
use t
w
o
co
rre
sp
o
nding
cla
s
sification
erro
r i
n
sp
am filteri
n
g
pro
c
e
ss:
mist
akin
g the
legi
timate me
ssa
ges a
s
s
pam,
or taki
ng
sp
am a
s
le
gitim
a
te mail. In
the
former case, use
r
s’
impo
rtant mail i
s
lo
st, leadin
g
to
seri
ou
s
con
s
eque
nces; in
the latter
ca
se, a
lot of trouble is ca
used for
use
r
s. Th
eref
ore, t
he conseque
nces of
misjud
gme
n
t cau
s
e
d
by these
two va
ry grea
tly, the co
st o
f
misjud
ging
a legiti
mate
e
-
mail i
s
much
greater than
slippi
ng
a ju
n
k
i
U
t
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
2302-4
046
Applicatio
n of Boosting Alg
o
rithm
in Spam
Filtration (Gu We
nchen
g)
5689
mail a
w
ay,
which
is an
im
portant
re
aso
n
why ma
ny
use
r
s d
o
n
o
t wa
nt to
use
sp
am filteri
n
g
d
e
v
ic
es
. T
hus
, in
the
s
p
am filte
r
in
g p
r
oc
es
s
,
le
g
i
tima
te messa
g
e
s
sho
u
ld
not to
be mi
sjud
ged
as
spam
as m
u
ch a
s
po
ssibl
e
, but at the same
time
spam false po
sitives shoul
d
be red
u
ced
as
much a
s
po
ssible to imp
r
o
v
e the accura
cy of classification.
In this pape
r, indexes in the field of info
rmatio
n re
trieval and te
xt categori
z
at
ion are
use
d
[20], co
nstru
c
ting
a spam filt
er eva
l
uation
syste
m
to evaluate
t
he effect of spam filte
r
s.
For
the convenie
n
ce
of d
e
scription, va
ria
b
les
are d
e
fined
as foll
o
w
s:
su
ppo
se
there
a
r
e
S
mess
ages in
tes
t
set,
hs
C
is th
e num
be
r of l
egitimate m
e
ssage
s mi
sta
k
en
a
s
spam,
h
N
is the
total numbe
r of legitimate messag
es,
sh
C
is the numb
e
r of spam miscla
ssifie
d
as
legal mail,
s
N
is the total numbe
r of spa
m
messag
es,
then
s
h
N
N
S
.
Evaluation is
defined a
s
fol
l
ows:
1. Erro
r
rate
E
S
C
C
N
N
C
C
E
sh
hs
s
h
sh
hs
(2)
Erro
r rate
E
reflects h
o
w th
e
e-mail
s are misjud
ged, in
cludi
ng two f
a
lse p
o
sitive
s rate:
spam mi
scla
ssified a
s
legiti
mate messa
g
e
s, legitimate
message
s m
i
scl
assified a
s
sp
am.
2.
False n
egativ
e rate
M
s
sh
N
C
M
(3)
False n
egati
v
e rate
M
refers to the ratio of spa
m
miscl
assifi
ed as le
gitimate
messag
es, which reflect
s
the sp
am re
co
gnition of the filterer.
3.
False rate
F
h
hs
N
C
F
(4)
False rate
F
reflects the le
gitimate messag
es in
co
rre
c
tly identified
as sp
am, in spam
filtering proce
ss, which sh
o
u
ld be avoid
e
d
from hap
pe
ning.
4. Recall
rate
R
Recall rate
R
include
s spam rec
a
ll
s
R
and le
gitimate messag
es recall
h
R
.
h
hs
h
h
s
sh
s
s
N
C
N
F
R
N
C
N
M
R
1
1
(5)
Recall rate
R
reflec
ts
h
ow spam is filtere
d
out and leg
i
timate message
s get thro
ugh.
Spam re
call
s
R
is spam
det
ection rate,
whi
c
h
reflect
s
the
spam
filterer's ability
to find spam,
s
pam recall rate
s
R
highe
r, the le
ss
spam
"slippi
ng
away"; legitimate mail
re
call
h
R
is the rat
e
of legitimate mail getting through, whi
c
h reflects
the spam filterer’s abilit
y to let legitimate mail
throug
h, the recall rate
h
R
hig
her, the less l
egitimate me
ssage
s misj
u
dged.
5.
Preci
s
io
n rate
P
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 23
02-4
046
TELKOM
NI
KA
Vol. 12, No. 7, July 201
4: 5685 – 56
92
5690
The preci
s
io
n
rate
P
includ
e
s
sp
am preci
s
ion rate
s
P
an
d legitimate e
-
mail preci
s
io
n
rate
h
P
.
sh
hs
h
hs
h
h
hs
sh
s
sh
s
s
C
C
N
C
N
P
C
C
N
C
N
P
(6)
Preci
s
io
n rat
e
P
reflects h
o
w
the email
s
are so
rted o
u
t, the preci
s
ion rate
s
P
is the
rate of
sp
am
co
rrectly sorted out, which refle
c
ts
the
filterer’
s
a
b
il
ity to find sp
am, the hi
gh
er
spam pre
c
i
s
i
on
rate
s
P
mea
n
s fewer le
gitimate messa
g
e
s mi
sjud
ged
as spam; leg
i
timate e-mail
pre
c
isi
on rat
e
h
P
is the rate of legitimate message
s
detected
co
rre
ctly, which
reflects the
filterer’
s
ability to find l
egit
i
mate e-mail,
the hi
gher l
egitimate e-mail precisi
o
n rate
h
P
me
ans
fewer
spa
m
mistaken a
s
legitimate me
ssage
s.
6.
A
ccu
ra
cy
rat
e
A
E
S
C
C
S
C
N
C
N
A
hs
sh
hs
h
sh
s
1
1
)
(
)
(
(7)
A
ccu
ra
cy
rat
e
A
is the for all mail (incl
uding
spam
and legitimat
e
mail) jud
g
m
ents,
whi
c
h
reflect
s
the filterer’
s
ability to d
e
term
in
e spa
m
as
sp
am,
and le
gitimate me
ssage
s
as
legitimate me
ssage
s.
7.
Value
F
Value
F
includ
es
spa
m
Value
F
s
F
and legitimate messa
g
e
s
Value
F
h
F
.
h
h
h
h
h
s
s
s
s
s
R
P
R
P
F
R
P
R
P
F
2
2
(8)
Value
F
is the ha
rmo
n
ic me
an of
pre
c
isi
on rate
P
and re
call
rate
R
, integrati
ng the
c
o
rrec
t rate
P
and recall
R
into one
evaluat
ion. The p
r
e
c
i
s
ion
rate
P
an
d re
call
rate
R
, from
different a
ngl
es,
refle
c
t th
e filtere
r
’s cl
assifica
tio
n
q
uality; in ge
n
e
ral, the
s
e
t
w
o i
ndexe
s
are
compl
e
me
nt
a
r
y
,
t
he accu
r
a
cy
rat
e
P
being improve will lead to lower re
call rate
R
, and vice
versa
. Spam
Value
F
s
F
the ha
rmo
n
i
c me
an of
sp
am a
c
cura
cy
rate
s
P
and
spa
m
re
call
rate
s
R
, legitimate
m
e
ssag
es
Value
F
h
F
is the h
a
rm
oni
c
mean
of le
gitimate e
-
mail
pre
c
isi
on
h
P
and legitimat
e
messag
es
recall rate
h
R
.
8.
Total Cos
t
Ratio (
TCR
)
In spam filteri
ng, it is not d
e
sirable th
at
legitimate me
ssage
s a
r
e in
corre
c
tly iden
tified as
spam. In o
r
d
e
r to express spam
filterer’s
co
st in
different situ
ations, And
r
o
u
tsop
oulo
s
I and
others propo
sed the co
nce
p
t
of Total Cos
t
Ratio,
TCR
[21],
TCR
is defined as:
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
2302-4
046
Applicatio
n of Boosting Alg
o
rithm
in Spam
Filtration (Gu We
nchen
g)
5691
sh
hs
s
C
C
N
TCR
(9)
is
ratio of l
e
gitimate me
ssag
es incorre
c
tly identified
as
sp
am
an
d spam i
n
correctly
identified a
s
spam, i.e.
sh
hs
C
C
/
.
TCR
r
e
flec
ts
th
e filte
r
e
r
’s
pe
r
f
or
ma
nc
e
,
th
e
h
i
g
h
e
r
TCR
is, the lower th
e l
o
ss of the
filterer is, an
d
the better the filterer’
s
performan
ce i
s
.
All of the above perfo
rma
n
ce in
dexes,
from
differe
nt angle
s
, evaluate a sp
a
m
filterer,
and th
erefore
an
evaluatio
n sy
stem
ca
n
be
con
s
ti
tuted by th
e a
b
o
v
e indexe
s
,
a
c
cordi
n
g
to t
he
experim
ental
test data, the filterer
can be
evaluated comprehe
nsiv
el
y, classifica
tion effect of the
filterer ba
se
d on Boostin
g
a
l
gorithm a
r
e j
udge
d.
5. Experiment Re
sults a
nd Analy
s
is
For the expe
rimental e
n
vironm
ent, the hard
w
a
r
e u
s
ed in thi
s
p
aper i
s
Sun
serve
r
s;
softwa
r
e i
s
the Solari
s Op
erating Syste
m
to bu
ild a
dedi
cated ma
il serve
r
. Multiple experi
m
e
n
ts
are
con
d
u
c
te
d with the
sp
am filtere
r
s
b
a
se
d
on B
oosting algo
rithm
and the
sp
a
m
filterers b
a
s
ed
on the traditi
onal Bayesi
a
n
algorith
m
(i
n each ex
periment 800 dif
f
erent e-mail
s are coll
ect
ed,
among
whi
c
h
is 300 legitimate e-mail
s, spams
50
0), the experimental data strike a mean t
o
obtain
re
sults clo
s
e
r
to th
e re
al
situati
on.
A nu
mbe
r
of
simulatio
n
expe
riment
s of i
n
form
ation
filtering a
r
e
condu
cted
wit
h
alg
o
rithm
b
a
se
d o
n
Bo
o
s
ting
algo
rith
m an
d the
tra
d
itional Baye
sian
algorith
m
[22], the data are
sho
w
n in Ta
ble 1.
Table 1. Experime
n
tal Re
sults of Filteri
ng S
pam u
s
in
g Boosting Al
gorithm a
nd
Traditio
nal
Bayes
i
an Algorithm
Algorithm
C
hs
N
h
C
sh
N
s
Boosting 6.82
300
16.56
500
Ba
y
e
sian 16.38
300
32.78
500
Table 1 is
cla
ssifi
cation of test data obtai
ned
by the junk mail filtere
r
based on B
oostin
g
algorith
m
on
800 me
ssage
s, inclu
d
ing 3
00 legiti
mate messag
es an
d
500 spam messag
es,
th
e
filtration re
sul
t
s are 3
10.28
legitimate m
a
ils an
d 489.
72 jun
k
mail
s; while filterin
g re
sults b
a
sed
on the traditio
nal Bayesia
n
algorith
m
is 3
16.40 legitim
a
te messag
e
s
and 4
83.60
junk mail
s.
We h
a
ve al
re
ady esta
blish
ed an
evalua
tion syste
m
compo
s
ed
of
E
,
M
,
F
,
R
,
P
,
A
,
Value
F
,
TCR
, the eight performan
ce
indicato
rs; b
y
means of
data in
Table
1 and the
perfo
rman
ce
indicators of t
he evalu
a
tion
system
, we
obtaine
d
pe
rf
orma
nce
indi
cators statisti
cal
data
comp
ari
s
on
between
the two filte
r
e
r
ba
se
d on
B
oostin
g
alg
o
ri
thm
and
tradi
tional Baye
si
an
algorith
m
as
sho
w
n by
Ta
ble
2.
Table 2. Co
m
pari
s
on b
e
tween the Expe
rimental
Re
sults Base
d on
the Two Algo
rithms
Algorithm
E
%
M
%
F
%
R
P
A
%
F
v
alue
TCR
R
s
%
R
h
%
P
s
%
P
h
%
F
s
%
F
h
%
Boosting
2.92
3.31
2.27
96.69
97.73
98.61
94.65
97.08
97.64
96.17
25.81
Ba
y
e
sian
6.15
6.56
5.46
93.44
94.54
96.61
89.64
93.86
95.00
92.02
12.21
From th
e d
a
ta in T
able
2
it can
be
cle
a
rly d
r
awn th
at Boostin
g
a
l
gorithm
ba
se
d spam
filter evaluati
on on
E
,
M
and
F
were si
gnificantly lower t
han tho
s
e of
traditional Ba
yesian
algorith
m
-b
ased sp
am filtering devi
c
e, while Bo
o
s
tin
g
algorith
m
-b
ase
d
sp
am filterer’
s
R
,
P
,
A
and
Value
F
were
si
gnifica
ntly higher tha
n
the four
eval
uatio
n by the spa
m
filterers ba
sed
on the traditi
onal Baye
sia
n
algo
rithm, esp
e
ci
ally as for the
TCR
, Boosting al
go
rithm-b
a
sed
spam filterer i
s
mu
ch high
e
r
than the tra
d
it
ional Baye
sian al
gorith
m
-ba
s
e
d
sp
a
m
filterers.
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 23
02-4
046
TELKOM
NI
KA
Vol. 12, No. 7, July 201
4: 5685 – 56
92
5692
6. Conclusio
n
The
above
compa
r
ison
of expe
rimenta
l
data
sh
ows that jun
k
m
a
il filtere
r
b
a
sed
on
Boosting al
go
rithm is si
gnifi
cantly better t
han the
jun
k
mail filterer b
a
se
d on tradit
i
onal Bayesi
an
algorith
m
, the
forme
r
i
s
abl
e to a
c
hi
eve
an ex
celle
nt
spam
filterin
g
functio
n
, so t
he effe
ctiven
ess
of Boosting
algorith
m
in
junk m
a
il filtering i
s
p
r
ov
ed. Furth
e
rm
ore, a m
o
re
suitabl
e kernel
function and its
paramet
ers
will be
selected in
our future re
search so as to improve
the
identificatio
n rate of the filtere
r
.
Ackn
o
w
l
e
dg
ements
This
work is
supp
orte
d by the Natura
l
Science Fo
undatio
n of Heilongjia
ng Province,
Chin
a unde
r
grant No. F20
1331.
Referen
ces
[1]
Vali
ant LG. A theor
y of the l
e
arna
ble.
Co
mmu
n
ic
ation of the ACM
. 198
4;
27(11): 11
34-
114
2.
[2]
Schap
ire R . T
he strength
of w
e
ak l
earn
a
b
ili
t
y
.
Machi
ne le
a
r
nin
g
. 199
0; 5(2): 197-2
27.
[3]
F
r
eund Y. B
o
osting
a
w
e
ak
lear
nin
g
a
l
go
rithm b
y
ma
jo
rit
y
.
Infor
m
ati
o
n an
d C
o
mpu
t
ation
. 1
995
;
121(
2): 256-
28
5.
[4]
F
r
eund Y, Sch
apir
e
RE. A decisio
n-the
o
reti
c gener
aliz
atio
n of on-li
ne l
e
arni
ng a
nd an
app
licati
on t
o
boosti
ng.
Jour
nal of Co
mpute
r
and Syste
m
Scienc
es
. 199
7; 55(1): 11
9-1
39.
[5]
Sch
w
enk H, Bengio Y.
Ada
p
t
ive boosti
ng
of neura
l
netw
o
rks for chara
c
ter recogn
itio
n
. T
e
chnica
l
Rep
o
rt, Univer
sité de Montré
al, Montréa
l
, 1997: 1-9
.
[6]
Mao J, M
ohi
ud
din
KM. Improv
ing
OCR
perfor
m
ance
usi
n
g
c
haracter
d
egra
datio
n mo
de
ls
and
b
oostin
g
algorithm.
Pattern Rec
ogn
itio
n Letters
. 199
7
;
18(11): 14
15-
141
9.
[7]
Viol
a P, Jones
M. Roubus
t real-time
obj
ect detection.
IEE
E
T
r
ansaction
on Ne
ural N
e
tw
orks
. 2001
;
(4): 151-1
55.
[8] Hoffmann
F
.
Boosting a Geneti
c Fu
z
z
y
Classifier
. Procee
din
g
s of the J
o
int 9th Inter
n
ation
a
l F
u
zz
y
S
y
stems Asso
ciatio
n W
o
rld Con
g
ress an
d
20th Inte
rnati
ona
l Confer
en
ce of
North American F
u
zz
y
Information Pr
ocessi
ng Soci
e
t
y
.
Vanco
u
ver, Can
ada. 2
001;
3: 1564-l
5
6
9
.
[9]
Liu Qua
n
j
i
n, Li
Ying
xin. App
lic
ation of b
oosti
ng
al
gor
ithm to sample c
a
teg
o
rizati
on of ge
ne e
x
pr
essi
o
n
profil
es.
Co
mp
uter Engi
ne
erin
g and Ap
pl
icati
ons.
200
8; 44(
14): 228-
23
0.
[10]
Che
n
W
e
i, Zhou
Xi
ao, Ye
F
e
i,
T
an Ying. T
he Appli
c
ation of ad
a
B
oost-NN i
n
Rad
a
r Sign
a
l
Reco
gniti
on.
El
ectronic Infor
m
ation W
a
rfare
T
e
chno
logy.
2
005; 20(
1):29-
33.
[11]
Li Xia, He L
i
yun, Liu C
h
a
o
. Boostin
g
al
gori
t
hm
and its a
pplic
atio
n of the sub-h
ealt
h
classificati
on
.
Chin
ese Jo
urn
a
l of Hea
l
th Statistics.
2008;
25(2): 15
8-1
6
1
.
[12]
Dan
g
Cha
n
g
q
i
ng, Liu Jie,
Ni
u F
enzho
ng. Intrusio
n detect
i
on b
a
sed o
n
boosti
ng meth
od an
d RBF
neur
al net
w
o
rk
.
Comp
uter En
gin
eeri
ng a
nd
Appl
icatio
ns.
2
008; 44(
15): 11
8-12
0.
[13]
Liu C
h
u
nni
an,
Song
Xi
a. Se
mi-structured t
e
xt i
n
formatio
n
extracti
on b
a
s
ed on
bo
osting al
gorit
hm.
Journ
a
l of Beij
i
ng Un
iversity o
f
T
e
chnolo
g
y
. 200
5; 31(2): 19
9-20
3.
[14]
Yi Ch
en. T
a
rget tracking
fea
t
ure
sel
e
ctio
n
alg
o
rithm b
a
se
d on
ad
ab
oost
.
T
E
LKOMNIKA Indo
nesi
a
n
Journ
a
l of Elec
trical Eng
i
ne
eri
n
g
. 201
3; 11(1
2
): 7373-
73
78.
[15]
Shan
g F
uhu
a, Yi Xi
on
g
y
in
g.
A boostin
g
-bas
ed al
gorithm a
nd its appl
ic
ati
on in floo
de
d oil fiel
d la
ye
r
identification.
Journ
a
l of South
w
est University
for Nationa
litie
s
. 2007; 33(
1): 124-
128.
[16]
Ye Yinl
an. Re
cogn
ition
of
hu
man actio
n
usi
ng bo
ostin
g
method a
nd RB
F
neural n
e
t
w
ork.
Comput
e
r
Engi
neer
in
g an
d Appl
icatio
ns
. 200
8; 44(1
3
): 188-1
90.
[17]
Yang F
e
ng, C
a
o Qili
n, Dua
n
Hai
x
i
n
, et al. D
e
si
g
n
an
d Impl
ementati
on
of
an Anti-S
pam
S
y
stem B
a
s
e
d
on DNS b
l
ackl
i
s
t.
Comp
uter Engi
neer
in
g and
Applic
ations
. 2
003; 7: 11-
12, 45.
[18] Pan
W
e
nfen
g.
Rese
arch on C
ontent-Bas
ed
Spa
m
F
ilter
in
g
. Beij
in
g: Institu
t
e of com
putin
g tech
nol
og
y
Chin
ese ac
ad
e
m
y
of scie
n
ces
.
2004.
[19]
F
r
eund Y, Sch
apir
e
RE, Abe
N. A short introductio
n
to boos
ting.
Journ
a
l-J
apa
nes
e Socie
t
y for Artificia
l
Intelli
genc
e
. 19
99; 14
(5): 7
71-
780.
[20]
Z
eng C
h
u
n
,
Xi
ng C
h
u
n
x
ia
o,
Z
hou L
i
zh
u. A
perso
na
lize
d
search al
gorith
m
b
y
usi
ng c
o
ntent-bas
ed
filterin
g.
Journ
a
l of Softw
are
.
200
3; 14(5): 99
9-10
04.
[21]
Andro
u
tsop
oul
os I, Koutsi
as
J, Ch
andr
in
o
s
KV, et a
l
.
A
n
ev
alu
a
tio
n
o
f
naiv
e
B
a
yesi
an
anti-sp
a
m
filterin
g
. Proce
edi
ngs of the
w
o
rksho
p
on Ma
chin
e Le
arni
ng in the Ne
w
Inf
o
rmation Age, G. Potamias,
V. Moustakis
and M. va
n
Somere
n (eds
.), 11th
Euro
pea
n Co
nfere
n
ce o
n
Mac
h
i
ne L
ear
nin
g
,
Barcel
ona, Sp
ain. 20
00:9-
17.
[22]
Schne
ider KM.
A compar
iso
n
of event mode
l
s
for Na
ive Ba
yes anti-sp
a
m
e-mail filter
in
g
. Proceed
ing
s
of the tenth confere
n
ce o
n
Europ
e
a
n
cha
p
ter of
the Associati
on for Comp
utation
a
l
Ling
uistics
.
Associati
on for
Computati
o
n
a
l
Ling
uistics,
Bu
dap
est, Hung
a
r
y
. 20
03; 1: 30
7-31
4.
Evaluation Warning : The document was created with Spire.PDF for Python.