TELKOM
NIKA
, Vol.12, No
.4, Dece
mbe
r
2014, pp. 90
5~9
1
0
ISSN: 1693-6
930,
accredited
A
by DIKTI, De
cree No: 58/DIK
T
I/Kep/2013
DOI
:
10.12928/TELKOMNIKA.v12i4.536
905
Re
cei
v
ed Au
gust 23, 20
14
; Revi
sed O
c
t
ober 2
8
, 201
4; Acce
pted
No
vem
ber 1
5
,
2014
Feature Selection Method Based on Improved
Document Frequency
Wei Zhen
g
1*
, Guohe Fen
g
2
1
Colle
ge of Sci
ence, He
be
i North Univ
er
sit
y
,
Z
hangj
iako
u 0
750
00, He
bei,
Chin
a
2
School of Eco
nomics & Man
agem
ent, Sout
h Chi
na Norm
a
l
Univers
i
t
y
, Gu
angz
ho
u 510
0
06,
Guang
do
ng, C
h
in
a
*Corres
p
o
ndi
n
g
author, e-ma
i
l
: 4339
90
78@
q
q
.com
A
b
st
r
a
ct
F
eature sel
e
cti
on is an i
m
port
ant par
t of the process of text classifi
cati
on, there is a dir
e
c
t
impac
t
on th
e q
ual
ity
of feature
sel
e
ction
beca
u
se of
the eval
uati
on
fu
nction.
D
o
cu
me
nt frequ
ency (DF
)
is o
ne
of
severa
l co
mmonly
meth
ods
used fe
ature s
e
lecti
on, its
sh
ortcomin
gs is the lack
of theoretica
l
basis
on
function c
onstr
uction, itw
ill te
nd to sel
e
ct hi
gh-freq
ue
n
cy w
o
rds in sel
e
cting. To solv
e the pro
b
l
e
m, we pu
t
forw
ard a
i
m
pr
oved
al
gor
ith
m
na
med
DF
Mcombi
ned
w
i
thcl
ass d
i
stributi
o
n
of ch
aracter
i
sti
cs an
d re
al
i
z
e
t
h
e
alg
o
rith
m w
i
th progr
a
m
min
g
, DF
M w
e
re comp
are
d
w
i
th some feature s
e
lecti
on
metho
d
commonly u
s
e
d
w
i
th experi
m
en
tal usi
ng su
pp
o
r
t vector mac
h
i
ne, as te
xt clas
sificatio
n
.T
he
results sh
ow
that, w
hen featur
e
selecti
on, the DF
M metho
d
s perfor
m
a
n
ce is
stable at
w
o
rk andis better th
an other
meth
o
d
sin class
i
ficati
on
results.
Ke
y
w
ords
: fea
t
ure selecti
on, docu
m
ent
frequency, text
classification
1. Introduc
tion
The cl
assification tech
nol
ogy is to a
s
sign
autom
atically a new d
o
cum
ent into
one o
r
more predefi
ned cla
s
ses based
o
n
its
conte
n
ts.
With the devel
opment
of WWW, in
re
cent
years,
text categori
z
atio
n
(TC) ha
s b
e
com
e
on
e of the key tech
niqu
es fo
r han
ding a
n
d
orga
nizi
ng te
xt data, and the techn
o
l
ogy has
got
extensive u
s
e in rubbi
sh mail filteri
ng,
cla
ssifi
cation
for Web p
a
g
e
and do
cum
ent. Therefo
r
e, it is very
necessa
ry an
d meanin
g
ful
to
study the ke
y technolo
g
y of text categori
z
at
ion fo
r improving t
he sp
eed
an
d accu
ra
cy of
categ
o
ri
zatio
n
.Some cla
s
sication alg
o
r
ithms
us
ed
in TC a
r
e:
Suppo
rt Vector Ma
chine
s
,k-
Neares
tNeighbor (k
NN) and naive
Bayes
[1],[2]. A major diffic
u
lty
of
text c
a
tegorization is the
high dim
e
n
s
ionality of the origi
nal fe
ature
sp
a
c
e.
Feature sel
e
ction is
an i
m
port meth
o
d
to
redu
ce
the
a
m
ount of
fea
t
ure in
text categori
z
at
io
n, and
its
goal
s i
s
imp
r
ovin
g cl
assificatio
n
effectivene
ss
and co
mputat
ional efficien
cy. Currently
, the feature
sel
e
ction meth
o
d
’s pri
n
ci
ple of
operation i
s
that it will co
mpute an
d score for
ea
ch feature
wo
rd u
s
ing
statistical
kn
owl
e
dge,
according to sort
the featu
r
e
wo
rd
s, then it
select
so
me feature
w
hose score is higher to act final
document fe
ature. Some
well-kn
ow m
e
thod
s ar
e
document fre
quen
cy(DF),
informatio
n g
a
in
(
I
G
)
,e
xp
ec
te
d c
r
oss
e
n
t
r
o
py(
E
C
E
)
,th
e
we
ight of evide
n
ce
of text (WET), x
2
st
at
i
s
t
i
c
(C
HI
) a
n
d
so
on [3],[4],[5], and it is highly desi
r
able to r
educe the feature spac
e
without the loss of
cla
ssif
i
cat
i
on ac
cur
a
cy
.
The do
cum
e
nt freque
ncy
(DF
)
threshol
ding, t
he sim
p
lestmeth
od
with the lo
we
st co
st in
comp
ution, h
a
s
sho
w
n to
behave
well
whe
n
co
mpa
r
ed to more m
e
thod
s, it can
be relia
bly u
s
ed
instea
d of I
G
or
CHI when
the
com
putati
on of
these m
easure
s
are t
oo ex
pe
ns
ive. Ae
xp
e
r
ime
n
tin
literature [6] i
s
ca
rrie
d
o
u
t
on the
pe
rformance
of
feat
ure
extra
c
tion
amo
ng
DF,IG,WET
and
CHI,
and the expe
riment
s re
sult
sho
w
that the DF me
thod
has its own advantag
es such a
s
ea
sy to
reali
z
e, wi
del
y to use in
Chin
ese text categ
o
ri
zatio
n
and E
nglish text catego
rizatio
n
, and
the
feature
sel
e
ct pe
rforma
nce
is go
o
d
while
it
comp
ared
with others f
eature
sele
ction
method
s.Ho
wever,
this method ove
r
loo
k
the
u
s
efullow-fre
q
u
ency feature
wo
rd
s in
the
categ
o
ry
and
no
co
nsi
deri
ng the
contri
bution
s
to
ea
ch
cate
gory ,
s
o it i
s
usuall
y
con
s
id
ere
d
an
empiri
cal m
e
thod to imp
r
ove efficien
cy, not a
pri
n
cipl
ed crite
r
ion
for sele
cting
p
r
edi
cti
v
e
feature
s
.
In this
pape
r we
propo
se
a featu
r
e
selectio
n met
hod
ba
sed
o
n
imp
r
oved
document
freque
ncy (DF),nam
ed DF
M, derived from the DF
o
r
iginal d
e
finition. The DM
F overcome
the
sho
r
tco
m
ing
s
of DF
su
ch
a
s
ove
r
loo
k
th
e
usef
ull
o
w-freque
ncy feat
ure
wo
rd
s in
t
he
catego
ry a
n
d
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 16
93-6
930
TELKOM
NIKA
Vol. 12, No. 4, Dece
mb
er 201
4: 905
– 910
906
no
con
s
id
erin
g the
co
ntrib
u
tions to e
a
ch catego
ry. Experime
n
ts
o
n
Chine
s
e
te
xt data colle
ction
colle
cted by the Fuda
n Uni
v
ersity sho
w
t
he perfo
rma
n
c
e of MDF m
e
thod .
The re
st
of
t
h
is pap
er
is orga
nized as
fo
llow
s
.
S
e
ct
ion 2
de
s
c
rib
e
s t
h
e t
e
rm
sele
ct
ion
method
s
com
m
only u
s
ed,
and
gives a i
m
prove
d
d
o
cument frequ
e
n
cy m
e
thod
MDF, Se
ctio
n 3
discu
s
ses th
e cla
s
sifier
u
s
ing i
n
expe
ri
ment to
com
pare
MDF wi
th other text
feature
sel
e
ct
ion
method
s, an
d pre
s
e
n
ts th
e experim
ent’
s
re
sult
s a
n
d
analysi
s
. In the last secti
on, we give t
he
c
o
nc
lus
i
on and future work
.
2. Rese
arch
Metho
d
In this sectio
n we sum
m
arize and reexa
m
i
ne the feature sele
ction method
s DF,I
G,ECE
and CHI wh
ich a
r
e com
m
only used
in feature
selectio
n for
text categori
z
ation, an
d
we
impleme
n
t a
new featu
r
e
sele
ction
met
hod
of DFM
(Do
c
ume
n
t F
r
eque
ncy M
o
d
i
fied),
which i
t
’s
evaluation fu
nction b
a
sed
on Do
cum
ent
Frequ
en
cy (DF) m
e
thod.
2.1. Featur
e Selection Me
thods
The followi
ng definitions of
DF, IG, ECE
and
CHI are t
a
ken from [7],[8], and They will be
simply introduced.
1. Document
Freq
uen
cy (DF)
Do
cume
nt freque
ncy is the numb
e
r
of docum
ent
s in whi
c
h a
term occurs. In text
categ
o
ri
zatio
n
, accordi
ng
to the setting thresh
old, the term is ret
a
ined o
r
rem
o
ved. DF is the
simple
st techniqu
e for feature redu
ction. It
scale
s ea
sily to
very large corpo
r
a with
an
approximatel
y linear co
mp
utational com
p
lexity
in the
numbe
r of tra
i
ning do
cum
e
nts [8].
2. Information
gain (IG)
Information
g
a
in is u
s
ed
to mea
s
u
r
e
s
t
he am
ount
of inform
ation
obtaine
d for
categ
o
ry
by kn
owi
ng t
he p
r
e
s
en
ce
or
ab
sen
c
e
of
a te
rm i
n
cat
egory
do
cum
ents. L
e
t
}
,....
,
{
2
1
m
c
c
c
C
be
the train set
of catego
rie
s
. The informati
on gain of term t is defined
as followi
ng:
n
i
i
i
n
i
i
i
n
i
i
i
t
c
p
t
c
p
t
p
p
t
c
p
t
c
p
t
p
c
p
c
p
t
IG
1
1
1
)
|
(
log
)
|
(
)
(
)
|
(
log
)
|
(
)
(
)
(
log
)
(
)
(
(1)
All the feature terms a
r
e computed a
ccordin
g to formula IG, who
s
e informatio
n gain is
less than so
me pre
determined thresh
old are
remov
ed from the feature
spa
c
e.
3. Expected cross ent
ropy (ECE)
Cro
s
s entrop
y
, also kn
own as the KL
distan
ce. It reflects the
probability dist
ribution of
text topic cla
ss
and in
ca
n com
pute
r
the di
stan
ce
betwe
en
spe
c
ific term wit
h
text topic class
under the
condition of probab
ility distri
bution .If the cross
entropy
of termis
bigger, the effect on
distrib
u
tion o
f
text
topic class is big
g
e
r
. The
difference to information gain
is co
nsi
der t
he
relation of wo
rd occu
rren
ce
and cate
gori
e
s,
only cal
c
u
l
ating theterm
appea
r in the
text.
n
i
i
i
i
c
p
t
c
p
t
c
P
t
P
t
ECE
1
)
(
)
|
(
log
)
|
(
)
(
)
(
(2)
4. Chi statisti
c
()
CH
I
The chi stati
s
tic method
m
easure
s
the l
a
ck
of ind
epe
nden
ce
between the te
rm
and the
c
a
tegory. If te
rm
t
and categ
o
ry
i
c
are inde
pe
ndent, then CHI is 0.
2
)
(
)
(
)
(
)
(
)]
,
(
)
,
(
)
,
(
)
,
(
[
)
,
(
i
i
i
i
i
i
i
c
P
c
P
t
p
t
P
c
t
P
c
t
P
c
t
P
c
t
P
c
t
CHI
(3
)
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
1693-6
930
Feature Sele
ction Metho
d
Based o
n
Im
pro
v
ed
Do
cu
m
ent Freque
ncy (Wei Zhe
ng)
907
If there are
n
classe
s ,then
each term val
ue will h
a
ve
n
correl
ation va
lue, the average
value cal
c
ulat
ion for a cate
gory as follo
ws:
)
,
(
log
)
(
)
(
1
i
n
i
i
c
t
CHI
c
P
t
CHI
(4)
The above 4
method is th
e most com
m
on met
hod
s in the expe
riment andth
e
different
points of ECE and IG
is th
at ECE o
n
ly consi
ders
th
e
effects to
category
whil
e t
hat word
s
ap
pear
in the do
cum
ents. the DF method i
s
si
mple, and
co
mplexity is lower.
CHI met
hod sho
w
s th
at the
CHI
statistic
value is grea
ter, the correl
ation bet
wee
n
features
an
d cate
go
rie
s
is mo
re
stro
n
g
.
The literatu
r
e [3] experi
m
ents
sho
w
that IG and
CHI i
s
most effectivei
n the English text
cla
ssifi
cation
DF follo
wed
.Experiment
s
prove th
at
the DF
ca
n ap
ply to larg
e scale
cal
c
ul
ation
,
ins
t
ead of
CHI whose
c
o
mplex is larger. Lite
rature [7],[9] points
out that IG, ECE and
CHI
method
s hav
e same effe
ct
on feature ex
traction in
Chinese text c
l
ass
i
fic
a
tion, followed by DF.
DF i
s
o
ne of t
he mo
st
simp
le feature term
extra
c
tion
method
s. Be
cause of the
e
x
traction
perfo
rman
ce
and co
rp
us in
to a linear rel
a
tionship, we
can se
e that, when a term
belon
g to more
than o
ne
cla
s
s ,the
evaluat
e fun
c
tion
will
ma
ke hi
gh
score to
it; ho
wever, if
the t
e
rm
belo
ng t
o
a
singl
e cate
go
ry, lower f
r
eq
uen
cy of occurren
ce lea
d
to a lowe
r score. DF eval
uation fun
c
tio
n
theory ba
sed
on a hypothesi
s
that ra
re term doe
s not contai
n useful info
rmation, or i
t
’s
informatio
n is litter as so
to exert usef
ul in
fluence on cla
s
sificati
on. Ho
wever,
there are few
confli
cts b
e
tween thi
s
a
s
su
mption an
d g
eneral info
rm
ation view,In
informatio
n th
eory, there a
r
e
point of view
that some
ra
re term
with a
greate
r
am
o
unt of inform
ation can reflect the
categ
o
ry
informatio
n than tho
s
e of
high frequ
e
n
cy wo
rd
s, and therefore those te
rmsshoul
d no
t be
arbitrarily re
moved, so t
he
choi
ce
o
n
ly usin
g
DF
method
will l
o
se
some v
a
luabl
e featu
r
es.
Do
cume
nt fre
quen
cy meth
od is ea
sy to implement
a
nd simpl
e
, an
d it' effect is simila
r to other
method
s in the Chine
s
e a
n
d
English text classifi
catio
n
.
Aiming at th
e sho
r
tco
m
in
g of DFmetho
d
,
we present a
n
improve
d
feature
sele
ctio
n method ba
sed on the DF.
2.2. A Featur
e Selection
Metho
d
Ba
s
e
d on Doc
u
ment Freq
ue
nc
y
impro
v
e
d
From the research of literat
ure [10],[11], this paper summed up, to meet the following 3
points of entry is helpful for cla
ssifi
catio
n
, these requi
reme
nts are:
1.
Con
c
e
n
tratio
n degree: in a corpu
s
of many ca
teg
o
r
ies, if a feature term a
p
p
ear in on
e or a
few categ
o
ri
e
s
, but not in other cate
gory
text, t
he ter
m
’srep
r
e
s
e
n
tation ability is strong an
d it
is helpful for t
e
xt classificat
i
on.
2.
Disperse d
e
g
r
ee: if a term
appea
r in a
cate
g
o
ry,it ha
s strong
correlation with t
he cate
gory.
That is
, a feature termis
more helpful to c
l
as
s
i
fic
a
tion
while it i
s
di
spersedi
n a la
rge of text of
a c
a
tegory.
3.
Contri
bution
d
egre
e
: if a fe
ature term’
s
correlat
ion wit
h
a certai
n ca
tegory is mo
re stron
g
, the
amount of informatio
n is greater an
dit is value of cla
s
sification.
This
arti
cle
use
s
d
o
cum
ent freq
uen
cy DF a
nd
adopt
s the f
o
llowin
g
m
e
thod to
quantitatively descri
be the
above thre
e p
r
inci
ple
s
:
(1) Con
c
e
n
tratio
n
degree
:
Using follo
win
g
formula to
expressio
n
, the ratio of formula is
bigge
rthat the
term is the m
o
re con
c
ent
ra
ted in the cla
ss.
))
,
(
1
/(
)
,
(
i
j
1
j
n
j
i
c
t
DF
c
t
DF
且
(5)
(2) Di
sp
erse
deg
ree:
Th
ere
m
dif
f
e
re
nt
cla
s
se
s,
}
,....
,
{
2
1
m
c
c
c
C
,Us
i
ng
)
(
/
)
,
(
i
i
c
N
c
t
DF
to
expre
ssi
on,
)
(
i
c
N
is the text am
ount of
i
c
,
the ratio is
more bigg
er ,the
more th
e term’s
disp
erse de
gree is big
ger.
(3) Contri
bution
d
egre
e
:
Expe
ctation
crossing ent
ropy
(ECE) con
s
ide
r
ing
the
relatio
n
ship
betwe
en the feature
appe
a
r
an
ce an
d ca
tegorie
s, so
throu
gh the
calcul
ation inf
o
rmatio
n that
feature app
e
a
ring
to cate
gory.
Thi
s
art
i
cle us
es a
si
mplified fo
rm
ula of E
C
E t
o
a
simplified
formula to ex
amine contri
b
u
tion of categ
o
ry. The sim
p
lified formula
is:
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 16
93-6
930
TELKOM
NIKA
Vol. 12, No. 4, Dece
mb
er 201
4: 905
– 910
908
)
(
)
|
(
log
)
,
(
i
i
i
c
p
t
c
p
t
c
P
(6)
In this pap
e
r
, we impl
e
m
ent a ne
w feature sel
e
ction m
e
tho
d
of DFM (Do
c
ume
n
t
Freq
uen
cy improve
d
), which its eval
uation
functi
on based o
n
Do
cume
nt Frequ
en
cy (DF
)
method, and
the Con
c
ent
ration deg
ree,
Dispe
r
se
de
gree, Contrib
u
tion deg
ree
are introdu
ce
d in
DFM .The
DF
M evaluation
function i
s
as
follows:
)
(
)
|
(
log
)
,
(
)
|
(
))
,
(
1
/(
)
,
(
)
,
(
i
j
1
j
i
i
i
i
n
j
i
i
c
p
t
c
p
t
c
P
c
t
p
c
t
DF
c
t
DF
c
t
DFM
且
(7)
2.3. Data
Co
llections
The exp
e
rim
ental data
u
s
ed in thi
s
p
a
p
e
r
i
s
fro
m
Chi
nese natu
r
al l
angu
age
pro
c
e
ssi
ng
grou
p in
De
p
a
rtment of
Computing
and
Information
tech
nolo
g
y in
Fudan
unive
rsity.The traini
ng
corpu
s
is “ train.ra
r ” whi
c
h has 20
cat
egori
e
s in
clu
des a
bout 98
04 do
cume
nts and “te
s
t.rar”
inclu
d
e
s
a
b
o
u
t 983
3 d
o
cu
ments is u
s
e
d
for test.
We
just
ch
oo
se
some
of th
e d
o
cum
ents for
our
experim
ents
becau
se
of consi
deri
ng th
e efficie
n
cy
o
f
the alg
o
rith
m. Table
1
shows th
e
spe
c
ific
quantity of sa
mples in e
a
ch categ
o
ry we cho
s
e
Table 1. Experime
n
tal Dat
a
Experiment
enviro
n
me
nt:CPU i
s
Intel Pen
t
ium Du
al
Co
re P
r
o
c
e
s
sor, Inte
l
G2020;Memor
y
is
2G DDR3; W
i
ndows
XP+
Micr
os
oft Vis
ual C++
.
2.4 perfo
r
ma
nce measu
re
To evaluate the perfo
rma
n
c
e of a text clas
sifier, we u
s
e F1 me
asure put forwa
r
d
byrijsb
e
rgen (1979
) [12]. This mea
s
u
r
e
co
mbi
n
e
s
re
call and preci
s
i
on as follo
ws:
examples
positive
of
number
s
prediction
positive
correct
of
number
Recall
(8)
s
prediction
positive
of
number
s
prediction
positive
correct
of
number
Precision
(9)
)
Pr
(
Pr
Re
2
1
F
ecision
recall
ecision
call
(10)
3. Results a
nd Analy
s
is
Our obj
ective
is to compa
r
e the DF, IG, ECE and CHI method
s with theDF
M
method
s.
A numbe
r of statistical cla
ssifi
cation a
n
d
machine le
aning te
chni
q
ues
have be
e
n
applie
d to text
categ
o
ri
zatio
n
, we
u
s
e SV
M cl
assifier.
We
ch
oo
se S
V
M be
cau
s
e
evaluation
s
h
a
ve sho
w
tha
t
it
outperfo
rm
s all the other classifier.
Quantit
y of traini
ng
documents
Quantit
y of testing
documents
C1-compute
r
134
66
C2-Enviornment
134
67
C3-Education
147
73
C4-Medical 136
68
C5-T
raffic 143
71
In all
694
345
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
1693-6
930
Feature Sele
ction Metho
d
Based o
n
Im
pro
v
ed
Do
cu
m
ent Freque
ncy (Wei Zhe
ng)
909
F
i
g
u
r
e
1
s
h
ow
th
e
s
e
lec
t
in
g
pe
r
f
or
man
c
e us
ed
SV
M on
Fu
dan
co
rp
us after feature
sele
ction
u
s
i
ng
DF, IG, E
C
E, Chi, an
d
DFM.
It ca
n
be
see
n
in
Fi
gure
1
that t
he
DFM
met
hod
outperfo
rm
s the DF meth
o
d
.
Figure 1. The
performan
ce
of fi
ve
feature sele
ction m
e
thod
s
A further
ob
servatio
n em
erge
s from t
he
catego
rization re
sult
of SVM. Through th
e
sele
ction
of
Figure 1,
we
find that DF
F1’ri
s
ing
alo
ng with
the i
n
crea
se
of the cha
r
a
c
teri
stic
dimen
s
ion, I
G
and
ECE
prod
uce
simil
a
r p
e
rfo
r
man
c
e of
cla
s
sification, b
e
cau
s
e the
ECE i
s
a
simplified ve
rsion of IG ,an
d
it only takes into
acco
un
t the conditio
n
of feature term
s app
eared in
corre
s
p
ondin
g
categ
o
ry. CHI an
d DF
M are the
m
o
st effective in our expe
ri
ment, and CHI
cla
ssifi
cation
F1 value ha
s been very stable in t
he pro
c
e
ss of th
e cla
ssifi
catio
n
. The feature
sele
ction m
e
thod DF
M curve a
r
e si
g
n
ificantly
hig
her tha
n
the
others met
hod
s whil
e the
cha
r
a
c
teri
stic dimen
s
ion b
e
twee
n 500
0
and 80
00 ,e
spe
c
ially whe
n
cha
r
a
c
teri
st
ic dime
nsi
on
is
8000 , Fig
u
re
1 appe
ar
a maximum poi
nts an
d F1 v
a
luei
s 99.13
3
%
. The cla
s
si
fication effe
ct of
DFM is b
e
tter than othe
r four f
eature sele
ction met
hod
s. The ex
treme value
of five kinds
of
feature
sele
ct
ion metho
d
s
are
sho
w
in
Table 2
with
Preci
s
io
n an
d
Re
call a
nd F
.
From Ta
ble
2,
we
ca
n n
o
tice that five fe
ature
sele
ctio
n meth
od
s
show bette
r p
e
rform
a
n
c
e
a
ll , an
d
DFM
gets
the best cate
gori
z
ation p
e
rforman
c
e that
the F1 value is 99.13
3%.
Table 2. The
bestp
erfo
rma
n
ce of five feature sele
ctio
n method
s
From Fi
gure
1 and T
able
2, we
can
se
e that
DFM
can extra
c
t ca
tegory cha
r
a
c
teristics
from Chinese text classification and im
prove the
classifi
cation accuracy,
and it has the stability
in feature extraction.
4. Conclusio
n
This pap
er h
a
s pro
p
o
s
ed
an
imp
r
oved
feature sel
e
ct
ion
meth
od b
a
se
d
on
DF, named
DFM. DFM i
m
pleme
n
ted t
h
ree
pri
n
ci
ple
s
whic
h
a
r
e Concentratio
n
degr
ee, Disp
erse
d
egree and
Contri
bution
d
egre
e
. The e
x
perime
n
t has sho
w
s t
hat DFM is an effective method to extra
c
t
categ
o
ry characteri
stics for feat
ure sele
ction, and it can effectiv
ely improve the
perfo
rman
ce
of
95.500%
96.000%
96.500%
97.000%
97.500%
98.000%
98.500%
99.000%
99.500%
1000
2000
3000
4000
5000
6000
7000
8000
F1
value
The
number
of
category
feature
DF
IG
ECE
CHI
DFM
Recall
Pr
ecision
F
1
DF
95.868%
95.875%
95.871%
IG 98.226%
98.269%
98.233%
ECE 98.284%
98.287%
98.280%
CHI
98.233%
98.289%
98.261%
DFM
99.121%
99.146%
99.133%
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 16
93-6
930
TELKOM
NIKA
Vol. 12, No. 4, Dece
mb
er 201
4: 905
– 910
910
text categori
zation. In the futu
re, we will
continue to
work on
the study of
contribution of
c
a
te
go
r
i
es
cha
r
ac
te
r
i
z
a
tion .
Ackn
o
w
l
e
dg
ements
This
wo
rk wa
s
sup
ported
b
y
National
Foun
datio
n of Soci
al Scien
c
e
(No:0
8
CTQ0
03),an
d
Fo
un
dation ofHeb
e
i North
Un
iv
ersity Natu
ral
Science for
Young (201
0).
Referen
ces
[1]
Yiming Y
a
n
g
, Xi
nL
iu.
A Re-E
xamin
a
tion
of T
e
xt Categori
z
ation Met
hods
.
Proceedings of ACM SIGIR
Confer
ence
on Researc
h
and
Developm
ent in Information
Re
triev
a
l (SIGIR). 1999: 42-
49.
[2]
Mccallum, A, Nigam, K.
A Co
mp
ariso
n
o
f
Event Model
s for Naive b
a
yes T
e
xt Cla
ssificatio
n
.
Procee
din
g
s of
the AAAI-98 Workshop o
n
l
ear
n
i
ng for T
e
xt Categoriz
atio
n. 1998: 4
1
-48.
[3]
Yang Y, Ped
e
rson J O.
A Compar
ative
Study on F
eature Se
lecti
on in T
e
xt C
a
tegor
i
z
a
t
io
n
.
Procee
din
g
s of
the 14th Intern
ation
a
l Co
nf
ere
n
ce on Mac
h
i
n
e lear
nin
g
.19
9
7
: 412-4
20.
[4]
Mlad
enic, D.
Grobel
nik, M.
F
eature S
e
lect
ion for
Un
bal
a
n
ced
Cl
ass Di
stributio
n a
nd
Naive
Bay
e
s
.
Procee
din
g
s of
the Sixte
enth I
n
ternati
o
n
a
l
Co
nferenc
e on M
a
chi
ne L
earn
i
n
g
.199
9: 258-
26
7.
[5]
F
o
rman, Guan.
Exper
iment
al Stud
y
of F
eature Selecti
on Metrics for
T
e
xt Categ
o
rizati
on.
Journal o
f
Machi
ne Le
arn
i
ng R
e
searc
h
. 200
3:12
89-1
3
0
5
.
[6]
KaiF
en
g Yan
g
, YiKun Z
han
g, Yan Li. A featur
e selecti
on
method b
a
se
d
on docum
ent
frequenc
y.
Co
mp
uter Engi
neer
. 20
10; 26(
17): 33-3
5
.
[7]
Mlad
enic, D,
Grobel
nik, M.
F
eature S
e
lect
ion for
Un
bal
a
n
ced
Cl
ass Di
stributio
n a
nd
Naive
Bay
e
s
.
Procee
din
g
s of
the Sixte
enth I
n
ternati
o
n
a
l
Co
nferenc
e on m
a
chi
ne L
earn
i
n
g
. 1999: 2
58-2
67.
[8]
F
o
rman, G. An Exte
nsiv
e Empirica
l Stud
y of F
eat
ure S
e
lecti
on Metric
s for
T
e
xt Cl
assificati
on.
Journ
a
l of Mac
h
in
e Le
arni
ng
Rese
arch
. 20
0
3
; 3(1): 128
9-1
305.
[9]
Yan Qiu
Chen,
Nixon, M.S, Damper,
R.I., I
mple
mentin
g th
e
k -
ne
arest
ne
igh
bour
ru
le v
i
a a
n
eura
l
netw
o
rk
. Proceedi
ngs IEEE Internatio
nal C
o
n
f
erenc
e o
n
Ne
ural N
e
t
w
orks.
199
5: 136-
140
1.
[10]
DDeD
i
T
a
i, Jun W
ang. Impr
oved F
e
ature
W
e
i
ghti
ng A
l
g
o
rithm for T
e
xt Categor
izati
o
n.
Co
mp
ute
r
Engi
neer
. 2
010
; 36(17): 33-3
5
.
[11]
YYong
Lo
u. A Improved St
ud
y o
n
F
eatur
e Sel
e
ct
io
n of
T
e
xt categori
z
ation B
a
se
d
on Mutu
a
l
Information.
F
u
Jian C
o
mputer
.
2009; 1
2
(4): 8
2
-83.
[12]
YYang YM. An eval
uatio
n o
f
statistical
ap
proac
hes to te
xt categ
o
rizati
on.
Journ
a
l of
Informati
o
n
Retriev
a
l
. 199
9
;
1: 67
−
88.
Evaluation Warning : The document was created with Spire.PDF for Python.