TELKOM
NIKA
, Vol. 11, No. 6, June 20
13, pp. 3213
~ 321
9
e-ISSN: 2087
-278X
3213
Re
cei
v
ed
Jan
uary 13, 201
3
;
Revi
sed Ap
ril 5, 2013; Accepte
d
April 2
0
, 2013
Bursty Hot-Words Detection for Campus BBS
Geng Chang
x
in*, Zhu Xiaoguan
g
, Nie Peiy
ao , Lin Peiguang
Schoo
l of Com
puter Scie
nce
&
T
e
chnol
og
y,
Shan
do
ng Un
i
v
ersit
y
of F
i
n
a
n
c
e and Eco
n
o
m
ics, Jinan
250
01
4,Chi
n
a
*Corres
p
o
ndi
n
g
author, e-ma
i
l
: g_cha
ng
xin
@
16
3.com
A
b
st
r
a
ct
In the
mon
i
tor
i
ng of ca
mpus
publ
ic op
in
ion
s
, hot
w
o
rds often reflect the
latest burst h
o
t topic
s
w
i
thin a certai
n
perio
d of time. T
herefore, this
paper
tak
e
s in
-depth res
earc
h
for bursty hot
w
o
rds detectio
n
.
In the process of w
o
rds w
e
ight calcul
atio
n, we consi
der not only traditi
on
al
features such as T
F
,
IDF
,
bu
t
also
the
burst
iness,
part
of spe
e
ch,
len
g
t
h, locati
on
in
text an
d
oth
e
r factors. C
o
nseq
ue
ntly, th
e
me
asur
e
m
ent f
o
rmula
of burst
iness
and th
e
w
e
ight calc
ulati
ng for
m
u
l
a b
a
s
ed o
n
sy
mphys
i
c multi-feat
ure
s
are propos
ed. The weight calculat
ing for
m
ula can id
entify the bursty hot-
w
ords
quickly
and acc
u
rately, and
then disc
over the bursty eve
n
t
s, finally
real
i
z
i
ng the e
a
rly w
a
rnin
g of campu
s
publ
ic opi
ni
o
n
s effectively.
Ke
y
w
ords
:
hot
-w
ords, bursty, w
e
ight
Copy
right
©
2
013 Un
ive
r
sita
s Ah
mad
Dah
l
an
. All rig
h
t
s r
ese
rved
.
1. Introduc
tion
With the
ra
pi
d devel
opme
n
t of Internet
, peopl
e ten
d
to exp
r
e
s
s real
thou
ght
on th
e
Internet. The
Internet is
be
comin
g
the p
r
imary pl
ace for ge
ne
ration
and di
ssemi
nation of p
ubl
ic
opinio
n
grad
ually, which
plays an in
creasi
ngly imp
o
rtant role in
the social life [1]. Today, the
use
r
group
s
of Internet are increa
sing
grad
ually, an
d it exceede
d 500 million
at the end of
De
cemb
er 2
011, whi
c
h h
a
s rea
c
hed
5
13 million. Student
s are th
e large
s
t gro
ups a
m
ong t
he
Internet u
s
ers, accou
n
ting
for 30.2% [2]. Su
ch a large
numb
e
r of university
student
s ha
ve
sen
s
itive rea
c
tion to
a lo
t of so
cial
p
henom
ena,
reality and i
s
sue
s
. Th
ey l
i
ke id
eolo
g
ical
comm
uni
cati
on to each
other via BBS, blog, mi
cro blog and ot
her informati
on platform, and
dissemin
ate
publi
c
spotlig
ht, hot issue
s
and maj
o
r i
s
sue
s
whi
c
h i
n
clu
de inte
rn
ational, dom
e
s
tic
and campu
s
by keepi
ng a
b
rea
s
t, postin
g
, comme
nts
and othe
r me
thods of com
m
unication.
The
colleg
e
stude
nts h
a
ve sp
eci
a
l g
r
o
upment
and
stron
g
exp
r
e
ssi
on d
e
si
re t
o
medi
a.
Those chara
c
teri
stics ma
ke hot
eve
n
ts sp
re
ad forward by high
spee
d, and
university ca
mpus
easily b
e
com
e
s the
ene
rg
y dispe
r
ser
o
f
negative pu
blic o
p
inion
s
and g
a
theri
n
g pla
c
e of th
em.
Therefore, th
e followin
g
re
sea
r
ch topics are very imp
o
rtant in coll
e
ge mana
gem
ent:
-
Standardized
manage
me
nt and the monitori
ng
of
network pu
blic opi
nion
on university
camp
us.
-
Identify bursty hot wo
rd
s f
r
om the
hu
ge
, messy and
diso
rde
r
e
d
co
llege n
e
two
r
k informatio
n
quickly and a
c
curately, an
d then find ho
t topic,
espe
ci
ally the latest bursty hot topics.
-
Control the
trend
of h
o
t
topics, an
d
corre
c
tly gui
de the
colle
ge n
e
two
r
k
publi
c
o
p
inio
n
towards a h
e
a
lthy directio
n, and there
b
y
reduce the negative imp
a
ct of the network.
Certai
n term
s may em
erge tran
sa
ctio
n over time
due to a
ppe
aran
ce
of su
dden
hot
events, na
m
e
ly the eme
r
gen
ce
of
h
o
t wo
rd
s. At pre
s
e
n
t, a lot of in-d
ep
th studie
s
h
a
ve
pro
c
e
s
sed in
many asp
e
ct
s, su
ch a
s
ho
t word
s disco
v
ery, hot words an
alysi
s
.
Zheng
Kui et al.
pro
p
o
s
ed
an
a
u
tomatic discovery me
thod
of hot
inf
o
rmatio
n
o
n
netwo
rk
publi
c
opi
nion
of ba
sed
on
ICTCLAS
se
g
m
entation
te
chnolo
g
y. This method
ca
n
read
ne
ws te
xt
and process word frequ
ency stati
s
tics after se
g
m
entation, remove stop
words from
word
freque
ncy ta
ble, merge m
u
lti-unit keyword
s
to obtai
n key
w
ords li
st of hot information of bu
rsty
events , a
c
hi
eve timely ret
r
ieval for n
e
twork info
rmat
ion, and the
n
provide te
ch
nical
sup
p
o
r
t for
emergen
cy d
e
ci
sion
of bu
rsty event [3
]. Xue
Fen
g
et al. p
r
op
o
s
ed
a
dyna
mic text mo
del-
dynamic b
u
rsty vector spa
c
e mod
e
l, whi
c
h can de
sc
ri
be the dynam
ic attribute
s
o
f
text efficiently.
Mean
while, a
method of online dete
c
t and tra
ck i
s
prop
osed to combi
ne with
text clusteri
ng
method [4].
A publi
c
o
p
i
n
ion
analysi
s
syste
m
, wit
h
a
kin
d
of
high
efficien
cy improve
d
LC
freque
nt patt
e
rn mi
ning
al
gorithm
data
flow an
alysi
s
is dei
gne
d by
Che
n
Li
zha
n
g
to analy
z
e t
h
e
Evaluation Warning : The document was created with Spire.PDF for Python.
e-ISSN: 2
087-278X
TELKOM
NIKA
Vol. 11, No. 6, June 20
13 : 3213 – 3
219
3214
hot spots.
Thi
s
system
ba
sed o
n
byp
a
ss mode
of
dat
a flow di
strib
u
tion a
c
cordi
ng to th
e visit
o
rs'
acce
ss to
forum. The
co
ntent of po
sts
on the fo
rum
is
clu
s
tere
d
by the re
du
ctive forum th
e
m
e
throug
h incre
m
ental hie
r
archi
c
al cl
uste
ri
ng algo
rithm [5]. Wang Tai
et al. prese
n
ted a method
to
captu
r
e p
opu
lar sea
r
ch word
s. By dep
loying an o
p
tical
splitter o
n
the Intern
e
t
portal of th
e
distri
ct,
po
pul
ar sea
r
ch wo
rds
a
r
e extra
c
ted
f
r
om
th
e
se
ssion
cont
ent which a
s
sembl
ed
by d
a
ta
packet
s
a
nd f
iltered
from
the o
p
tical
spli
tter [6]. Li Yu
qin et
al. p
r
o
c
essed
de
ep
rese
arch
for h
o
t-
word discove
r
ing an
d a
s
sociatin
g tech
nique. In
the
phase of word di
scoveri
ng, they utilize
named
entity
re
cog
n
ition techni
que
s a
nd statisti
cal
tech
niqu
es for hi
gh f
r
eq
u
ency
phrase
to
pro
c
e
ss
phra
s
e st
ring ex
cavation, then
take the
ba
sis of
weight
and wei
ght
fluctuation
s
to
comp
ute hot-word wei
ght. In the hot-wo
r
d asso
ciatio
n perio
d, hot word
s are di
vided based
on
the differen
c
e of the weight value of th
em,
and hot-word relatio
n
s
hip was
co
mputed from
the
prin
ciple of co-o
ccurren
c
e
rate [7].
Hot wo
rd is a web vocabula
r
y phen
omeno
n, refl
ecting the widespre
ad co
nce
r
ne
d
probl
em
withi
n
a
parti
cula
r cove
rag
e
, such
a
s
n
a
me
, place, orga
nizatio
n
a
nd
other
co
mmo
n
phra
s
e
s
. A lot of network hot wo
rd
s are ne
w words which h
a
v
e not been
include
d in the
diction
a
ry. Hot word
s u
s
u
a
lly have cha
r
acte
ri
st
ics of frequent o
c
curren
ce, wid
e
distributio
n a
nd
sud
den tra
n
saction may o
c
cur over tim
e
. In Figure 1, the word “word1
” appe
aring
with high
freque
ncy bu
t little
fluctuation, belong
s to high-fr
eq
uen
cy words;
the word “word
s
2
”
sud
d
e
n
ly
emerge
s tran
sa
ction g
r
o
w
th in the T
k
moment, sho
w
ing
a ra
pid
gro
w
th tre
nd,
so it bel
ong
s to
hot words; the wo
rd “wo
r
d3
” app
eari
ng with
lo
w freque
ncy an
d small fluct
uation chang
es,
belon
gs to lo
w-frequ
en
cy words.
Figure
1
.
The Presenting
Features of Hot Words
Therefore,
di
rectin
g at th
e
features
of hot word
s,
the
step
s fo
r their extract
i
on a
r
e
pre
s
ente
d
. Mean
while,
we develo
p
the meas
u
r
eme
n
t method of bursti
ness and
weight
cal
c
ulatio
n formul
a whi
c
h is ba
se
d
on the integ
r
ation of mu
ltiple feature
s
; finally, a text
expre
ssi
on m
e
thod
with d
y
namic
sp
ace mod
e
l is d
e
velope
d
by use of
this weight
calculat
ion
formula.
2. Hot Word
s Analy
s
is
The eme
r
g
e
n
c
e of hot eve
n
ts will le
ad to sud
den
tra
n
sa
ction
of certain words over
time.
Based
on
thi
s
featu
r
e,
a
zero
-copy b
a
sed n
e
two
r
k
p
a
cket
captu
r
e
platform
[5,
8] co
mbine
d
with
bypass li
steni
ng
way i
s
u
s
ed to
ca
pture
inform
ation i
n
colleg
e
s’
fo
rum i
n
real
time [9, 1
0
]. By
analyzi
ng the
fluctuation e
x
ception of
words
co
ntai
ned in
crawl
ed po
sts, ho
t word
s
can
be
identified, an
d then we
ca
n discover b
u
r
sty
hot topics. The main
steps are as fo
llows:
-
Wo
rd segme
n
tation and freque
ncy stati
s
ti
cs for capt
ured text information.
-
Acco
rdi
ng to relevant rule
s, we achiev
e
words filter by removing
word
s and
a numbe
r of
meanin
g
le
ss
words a
nd sy
mbols et
c.
-
Measure burstiness of word
s to identify hot words.
-
Initialize weight for respect
i
ve hot words by
the use o
f
weight calculation meth
o
d
based o
n
symphysi
c m
u
lti-features
and t
hen
so
rt hot word
s
by weight. Select a
certai
n numb
e
r of
hotk
e
y
s
in s
o
rt lis
t to c
o
ns
titute the hot k
e
ywords
library.
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
e-ISSN:
2087
-278X
Burst
y
Hot
-
Words Detection for Cam
p
us BBS (Geng Changxin)
3215
2.1. Word Se
gmenta
tion
Process
A word is th
e smalle
st co
nstituent unit
s
of a docum
ent, so that lexical analy
s
i
s
is the
foundatio
n a
nd key
step
in informati
on proces
sin
g
. In this pa
per, ICT
C
LA
S developed
by
Institute of Comp
uting T
e
ch
nolo
g
y, the Chine
s
e
Acade
my of Scien
c
e
s
is use
d
for
word
segm
entation
.
This
syste
m
ba
se
s on
th
e casc
a
d
ing t
y
pe of Hi
dde
n Ma
rkov Mo
del, whose
key
function
s
i
n
cl
ude Chi
nese word seg
m
en
tation,
par
t
of
sp
ee
ch ta
ggi
ng, nam
ed
e
n
tity reco
gniti
on,
the identificat
ion of new word
s and sup
porting the
u
s
er di
ction
a
ry
. In performa
n
ce, this sy
stem
pre
s
ent
s
hig
her wo
rd se
gmentation a
c
cura
cy
a
n
d
efficien
cy [11]. In segm
e
n
tation proce
ss,
locatio
n
of word shoul
d b
e
marked a
n
d
singl
e
wo
rd
shoul
d be re
moved, and
at the same t
i
me
,
the word co
m
posed of two
or more nou
n
s
sh
ould b
e
counted. The
word whi
c
h e
x
ceed
s a cert
ain
threshold
sho
u
ld be
ad
ded
to key
w
ords list in
fo
rm of
noun. e.g.
“Wen
chu
an earthq
u
a
k
e
”
are
comp
osed by
two nou
ns, “Wen
ch
uan
”
a
nd
“ea
r
thqu
a
k
e”.
The
word is ad
ded
to
the keywo
r
d
s
list
whe
n
the nu
mber of time
s exce
ed
s the pre
defined
threshold. After word
seg
m
entation, a l
i
st of
keyword
s
i
s
gene
rated
wi
th part
of sp
eech attr
ib
ute. Gen
e
rally,
the hot
wo
rds
have a
hi
gher
word freque
n
c
y, therefo
r
e,
a lo
we
r threshold n
eed
s to
be
set, an
d t
he
word
who
s
e frequ
en
cy
is
lowe
r than thi
s
thre
shol
d wi
ll be remove
d
from the list of keywo
r
d
s
.
2.2. Words F
ilter
Keywords li
st contain
s
a
lot of word
s a
fter word s
egmentation. Mos
t
of the words
contri
bute
little to th
e p
o
st,
so
the
corre
s
pondi
ng filter
rule
s [1
2, 13,
14]
need
to
b
e
devel
ope
d to
filter keywords list.
1. Part of Sp
eech Filter
Different p
a
rt
s of spee
ch
play different
role
s in text pre
s
entatio
n
.
The se
mant
ics
of a
sente
n
ce p
r
i
m
arily i
s
exp
r
essed
by n
o
uns an
d
verb
s. Althoug
h p
r
epo
sition
s,
conjun
ction
s
a
n
d
adverb
s
et
c.
have hig
her f
r
eque
ncy of o
c
curren
ce
i
n
the do
cum
ent, they have no
real m
eani
ng
s,
su
ch
as “of, the, in, th
oug
ht, but”
etc. F
o
r thi
s
re
aso
n
,
the wo
rd
s whi
c
h have p
a
rt
of sp
ee
ch
like
prep
ositio
n, auxiliary wo
rd
etc. sho
u
ld b
e
aban
don
ed
and retai
n
onl
y nouns a
nd
verbs.
2. Stop Wor
d
s Filter
Meanin
g
le
ss words, pun
ctuation,
num
b
e
rs
and
special symbol
s
may occu
r in
post
s
,
su
ch a
s
“Ho,
Aha, #, [, (”
etc. The
s
e
word
s an
d sy
mbols
ca
n be
adde
d into
stop wo
rd
s list
.
In
this pap
er, the list combi
n
e
s
and exten
d
s
the stop
wo
rd list of Ha
rb
in Institute of Tech
nolo
g
y.
3. Simi
lar Words Filter
Different exp
r
essio
n
form
s of words
with sim
ilar o
r
identical impl
ication
s
may exist in
keyword
s
li
st. It can
be
p
r
oce
s
sed
by
statisti
cs
meth
od
to merge synonym
s
, such
a
s
“ne
g
lect”
and “ig
n
o
r
e”
etc. By judgin
g
the simila
rity and word
freque
ncy, lon
ger word in the ca
se of sa
me
quite fre
que
n
c
y can b
e
retained,
su
ch
a
s
“Influ
en
za
A”, “Influen
za
A HINI” an
d
“Influen
za A
HINI
flu” etc.
4. Rule Filter
Rule filter generally i
s
used for filtering us
el
ess
strin
g
with o
b
viou
sly pattern, such
as
colle
ction of n
u
meral and q
uantifier with
high fr
eq
uen
cy, common m
eanin
g
less p
r
efix and suffix.
5. Backg
rou
nd Noise Filter
The b
a
ckg
r
o
und
noi
se i
s
su
bje
c
t un
re
lated a
nd m
eanin
g
less st
ring
whi
c
h
cannot
be
filtered by sto
p
wo
rd
s and
rule filter, such as “B
eijing
daily news”, “one of tho
s
e”, “at last” et
c.
The noise ha
s
h
uge
sy
ste
m
an
d cha
o
tic state,
un
abl
e to g
e
t throu
gh m
anual
so
rting. Th
erefo
r
e,
it need
s to
collect
co
rpu
s
as
a training
set to
program for the e
x
traction
of b
a
ckgroun
d n
o
ise
libra
ry
.
2.3. Words
Burstine
ss Me
asureme
nt
The h
o
t word
s have
bu
rsti
ness, like a
b
norm
a
l g
r
o
w
th of word
fre
quen
cy withi
n
a certai
n
perio
d. Fo
r t
h
is
rea
s
o
n
, the
word fre
q
uen
cy dist
rib
u
tion
within
a certain
time pe
riod
can
be
analyzed to d
e
termin
e wh
e
t
her the wo
rd
has b
u
rstine
ss.
The di
stributi
on of wo
rd freque
ncy is t
r
eated a
s
a di
mensi
onal fu
nction f(x
)
in
Figure 2,
in whi
c
h x-axis represents time, y-axis
rep
r
e
s
ent
s word frequ
en
cy of a mome
nt. Now i
n
tercept
Evaluation Warning : The document was created with Spire.PDF for Python.
e-ISSN: 2
087-278X
TELKOM
NIKA
Vol. 11, No. 6, June 20
13 : 3213 – 3
219
3216
the word
fre
q
uen
cy di
strib
u
tion
within t
he Ti
-Tj
time
perio
d a
nd
assume
that th
e
wo
rd f
r
eq
ue
ncy
of “word t” is
Pi in Ti moment, Pj in Tj mo
ment.
Figure 2. Wo
rd’s Frequ
en
cy Distributio
n
Con
s
id
erin
g t
he di
stri
butio
n of
wo
rd f
r
e
quen
cy m
a
y
influen
ce
burstine
s
s in
a
certai
n
perio
d time,
co
sine
theo
re
m can
be
used to
mea
s
u
r
e the
bu
rstin
e
ss of
word
t
in time
Ti
-Tj, as
following formula:
2
2
)
(
)
(
1
)
(
a
b
i
j
i
j
p
p
T
T
T
T
t
B
(1)
In formula
(1), word
burstin
ess
)
(
t
B
is a valu
e between 0 t
o
1. The g
r
ea
ter value it ha
s,
the greater
word burstiness
it will show, conversely, t
he sm
aller the burstiness i
s
;
i
j
T
T
is the
differen
c
e
be
tween
the t
w
o time
peri
o
ds,
whi
c
h
ca
n be
ho
urs,
days, et
c. an
d they
can
b
e
sele
cted a
c
cordin
g to the actual
situation;
a
b
p
p
is the differen
ce of word freque
ncy b
e
twee
n
time
i
T
and
j
T
.
Hot words
wil
l
drift over tim
e
, i.e. a topic
disc
u
ss t
r
an
sf
erred to
anot
her to
pic i
n
la
ter, i.e.
the drift of topic. The
cu
rrent
wo
rd fre
q
uen
cy of topic whi
c
h
ha
s drifted is le
ss than the wo
rd
frequency bef
ore drift. This paper fo
cuses on the ext
r
action of word
from current bursty event
s
,
so the drifte
d topic is ou
tside ou
r co
n
s
ide
r
at
ion
scope. The
r
efo
r
e, ce
rtain word
s whi
c
h l
a
ter
freque
ncy i
s
less th
an f
o
rme
r
frequ
e
n
cy
will be
removed f
r
om
key
w
ords li
st. So a l
o
wer
threshold
ca
n
be set to se
lect wo
rd
s whose bu
rstin
e
ss i
s
hig
her t
han this l
o
we
r thre
shol
d a
s
can
d
idate h
o
t word
s list. T
he high
-fre
qu
ency word
s
a
r
e usually mo
re evenly dist
ributed, so this
method can remove high
-freque
ncy words withi
n
the pro
c
e
ss of bu
rsty wo
rd
s sel
e
ction.
2.4. Weight
Calcula
t
ing Metho
d
base
d on Sy
mph
y
s
i
c Multi-fe
ature
s
For
can
d
idat
e hot word
s, each
wo
rd
has
us
eful
informatio
n, su
ch a
s
T
F
, IDF,
bur
st
ine
ss,
p
a
rt
of
sp
ee
ch
,
locat
i
o
n
,
len
g
t
h
et
c
[10].
Therefore,
th
e follo
wing
fa
ctors
nee
d to
be
con
s
id
ere
d
for hot wo
rd we
ight gene
ratio
n
.
1.
TF
: kno
w
n
as term freq
uen
cy, rep
r
e
s
entin
g the freque
ncy of word o
c
cu
rre
n
c
e. The
greate
r
TF is,
the higher
co
nce
r
n de
gree
the word h
a
s.
2.
IDF
: kn
own as i
n
verse
document freque
ncy, is
a mea
s
u
r
e o
f
whethe
r the
term i
s
comm
on o
r
rare a
c
ross all
docu
m
ent
s. Greate
r
IDF
sho
w
s gre
a
te
r discri
minati
on of wo
rd
s and
more relevant
to s
ubjec
t.
3.
Burs
tines
s
: hot word
s
have the ch
arac
teri
stics of
abno
rmal g
r
o
w
th in a sh
ort
time, so
the burstine
s
s nee
ds to be
introdu
ced to
measu
r
e the
gro
w
th of wo
rds in the pe
ri
od of time.
4.
POS
: kno
w
n a
s
p
a
rt o
f
spee
ch. T
h
e nam
ed e
n
tities in p
o
st
informatio
n such
a
s
name
s
, pla
c
e nam
es an
d in
stitutiona
l name
s
etc.
co
ntribute
much
mo
re
than n
on-na
med
entities to th
e distin
ction
of topics, so
the wei
ght g
r
owth i
s
requi
red fo
r n
a
me
d entities,
wh
ile
v
e
rb t
a
ke
s a
se
con
d
pla
c
e
.
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
e-ISSN:
2087
-278X
Burst
y
Hot
-
Words Detection for Cam
p
us BBS (Geng Changxin)
3217
5.
Loca
t
ion
:
the word
s i
n
different lo
cat
i
ons ma
ke
different
cont
rib
u
tion to e
n
tire po
st:
Wo
rds i
n
the
post title make the g
r
eate
s
t cont
ribut
io
n; the first an
d last pa
rag
r
aph in bo
dy text
also
ha
s
gre
a
ter
co
ntribut
ion; the fi
rst
and l
a
st
se
ntences of e
a
ch pa
rag
r
a
ph i
n
bo
dy text a
l
so
have
cont
ribu
tion; follows
are
the
wo
rd
s in
reply.
Th
erefo
r
e, the
word in
different lo
cation
s
ha
s
different weight.
6.
Length
: th
e longe
r the words a
r
e, th
e more info
rmation they carry.
Con
s
id
erin
g the above fa
ctors, the
weig
ht calc
ulation
formula i
s
co
nstru
c
ted a
s
f
o
llows:
2
2
2
)
))
(
(
*
)
)
(
(
log
*
)
,
(
*
(
))
(
(
*
)
)
(
(
log
*
)
,
(
*
*
)
,
(
D
t
t
POS
weight
L
t
DF
N
d
t
TF
t
POS
weight
L
t
DF
N
d
t
TF
a
d
t
w
)
,
(
(
*
)
(
*
)
(
d
t
position
weight
avglen
t
length
b
t
B
(2)
In formula
(
2),
a and b are adju
s
tment
coefficient, 0<a,b <1, a+b=1;
)
,
(
d
t
w
denotes th
e
weig
ht of wo
rd t in po
st
d; D is
wh
ol
e po
sts;
is a
d
justme
nt co
efficient of lo
cation
wei
ght;
)
,
(
d
t
TF
denote
s
o
c
curren
ce fre
q
u
ency of word
t in post d;
)
(
t
DF
denote
s
the
numbe
r of
post
s
whi
c
h i
n
clu
d
e
s
wo
rd
t; L is experience co
nsta
n
t;
))
(
(
t
POS
weight
is POS wei
ght of word
t, generally initialize
s
n
a
m
ed e
n
tities
as 2, ve
rb a
s
1.5;
)
(
t
length
denot
es le
ngth of
word t;
avglen
denote
s
the
averag
e lengt
h of keyword
s
;
))
,
(
(
d
t
position
weight
is location
weight of
word t in post
d.
)
(
t
B
is the wei
ght of burstiness facto
r
, t
he cal
c
ulatio
n formul
a is formula (1
).
After cal
c
ulat
ing the weig
ht of above
wo
rds, a
ce
rtain numb
e
r
of words
sh
ould be
cho
s
e
n
by descendi
ng so
rt
of weight to
con
s
tru
c
t hot
word
s featu
r
e libra
ry
i
K
within a certai
n
perio
d of time.
3. Results a
nd Discu
ssi
on
The experi
m
ent data is collecte
d
from
the
entrance
of College n
e
tw
ork. Every day th
e
informatio
n of
coll
ege fo
ru
m is
ca
ptured
twice
throug
h
entran
c
e,
a
nd contin
uou
s colle
ction g
oes
on for thi
r
ty d
a
ys, and th
e
n
we
have thi
r
ty days hi
sto
r
ical
data. Fo
r ea
ch
coll
ect
ed do
cum
ent
s,
firstly word
segmentatio
n and wo
rd fre
quen
cy
statistics are pro
c
ess
ed, and
carry on filter
according to the rule
s me
ntioned in
cha
p
t
er 2.2, then we will g
e
t a can
d
idate h
o
t keywo
r
d
s
list
.
3.1. Results and Disc
uss
i
on – Burs
tiness Mea
s
ure
m
ent
Becau
s
e of t
oo many wo
rds in keywo
r
ds list, this p
aper o
n
ly takes two
wo
rd
s “Th
e
Olympic Gam
e
s” an
d
“Jun
Zhou
” a
s
sa
mples to m
e
asu
r
e th
e
wo
rd b
u
rstine
ss within
five d
a
ys.
Normali
z
ation
is pro
c
e
s
se
d for co
nveni
ence: wo
rd
frequ
en
cy is
norm
a
lized to 0-10
0, time is
norm
a
lized to
0-100 a
s
wel
l
. Figure 3 is
norm
a
lized word fre
que
ncy
distributio
n.
Figure 3. Wo
rd’s Frequ
en
cy Distributio
n after No
rmali
z
ation
Evaluation Warning : The document was created with Spire.PDF for Python.
e-ISSN: 2
087-278X
TELKOM
NIKA
Vol. 11, No. 6, June 20
13 : 3213 – 3
219
3218
No
w calculat
e the bu
rsti
ness of the
s
e two words
within th
e mome
nt 4
0
-90
by
formula (1):
The bu
rstin
e
ss of wo
rd “T
h
e
Olympic G
a
mes”:
019419
.
0
)
(
)
(
1
2
2
a
b
i
j
i
j
p
p
T
T
T
T
The bu
rstin
e
ss of wo
rd “Ju
n
ZHO
U
”:
292893
.
0
)
(
)
(
1
2
2
a
b
i
j
i
j
p
p
T
T
T
T
In Figu
re
3,
the bu
rstiness of
wo
rd
“Z
hou
Ju
n” ob
viously i
s
g
r
eater than
word
“T
he
Olympic G
a
mes”, whi
c
h i
s
co
nsi
s
tent
with ou
r ex
p
e
rime
ntal re
sults. T
herefore, the burstin
ess
measurement
formula prop
ose
d
in this a
r
ti
cle is in lin
e
with actual requireme
nt.
3.2. Results and Disc
uss
i
on - The We
ight Calcula
t
ing Formula based on Sy
mph
y
sic
Multi-features
For the p
r
o
c
ess of can
d
i
date hot wo
rds list with b
u
rstin
e
ss me
asu
r
em
ent, this a
r
ticle
firstly calcul
ates th
e
weig
ht of hot
wo
rd
s by the
wei
g
h
t
cal
c
ulatio
n f
o
rmul
a of t
r
a
d
itional T
F
-IDF
function
an
d
symp
hysi
c
multi-f
eatures re
sp
ectively, then
proc
ess
extractio
n
experim
ents for
backg
rou
nd
corpu
s
ba
sed
on the
calcul
ation
re
sult
s.
Limited by
th
e si
ze
of
ba
ckgroun
d
co
rp
us,
we
ca
n’t verif
y
the effect
o
f
keywords e
x
trac
tion
for
each d
o
cume
nt. Therefore, 500
do
cum
e
nts
are extra
c
ted
rand
omly, and thro
ugh p
r
ogra
m
verifica
tion, the extractio
n
effect
of two methods
for hot key
w
o
r
ds a
r
e a
naly
z
ed a
nd com
pare
d
. Com
p
are results are sho
w
n in Fi
gure 4.
Figure 4. Effect Com
pari
s
o
n
of Two Diff
erent Extra
c
tion Method
s
With the
u
s
e
of the tra
d
itio
nal TF
-IDF
m
e
t
hod to
process extra
c
tio
n
, avera
ge
preci
s
ion
ratio i
s
76.9
%
, averag
e
recall
ratio
7
2
.3%; with th
e
use
of the
we
ight calculatin
g meth
od
ba
sed
on
symphysi
c m
u
lti-features to
p
r
ocess extraction,
averag
e p
r
e
c
ision
ratio
is
90.7%, avera
ge
recall ratio 8
5
.7%. With t
he u
s
e
of the wei
ght
cal
c
ulation m
e
th
od ba
se
d on
symphy
sic
multi-
feature
s
, p
r
e
c
isi
on
ratio i
n
cre
a
sed
abo
u
t
13.8%
ab
ove the tradition
al TF-I
D
F m
e
thod, an
d recall
ratio in
crea
se
d abo
ut 13.4
%
. From the
experim
ent
result
s, co
ncl
u
sion
co
me
s to that the
wei
ght
cal
c
ulatio
n m
e
thod b
a
sed
on symp
hysi
c multi-fe
at
ures i
s
obvio
usly better than
traditional
T
F
-
IDF method.
4. Conclusio
n
With the
rapi
d expa
nsi
on
scale
of the
Internet, the
discovery
an
d tra
c
king
of
coll
ege
netwo
rk bu
rsty hot topi
cs h
a
s
be
come
i
m
porta
nt
me
ans for re
gul
ation of
network pu
blic opi
nion.
Identifying ho
t wo
rd
s q
u
ickly and
accu
rately is
the
p
r
emi
s
e
of hot
events di
sco
v
ery. Base
d
on
the bu
rstiness
cha
r
a
c
teri
stic of h
o
t word
s,
thi
s
p
a
per develo
p
e
d
a
metho
d
for
burstine
s
s
measurement
, then in the weight
calculation of ho
t word
s, co
n
s
ide
r
ing n
o
t only the hig
her
occurre
n
ce freque
ncy, bu
rsty of time, but the f
eature
information
of words
who
s
e o
c
curren
ce in
post,
su
ch
a
s
location, PO
S and
len
g
th
etc.
To
re
se
arch
the
key
techn
o
logie
s
stated
above,
a
foundatio
n ca
n be
esta
blished fo
r bu
rsty hot event
s,
and fin
a
lly a
c
hieve
effecti
v
e and
accu
rate
early wa
rni
n
g
for colleg
e
n
e
twork pu
blic
opinio
n
.
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
e-ISSN:
2087
-278X
Burst
y
Hot
-
Words Detection for Cam
p
us BBS (Geng Changxin)
3219
Certai
nly, the measureme
n
t me
thod of
burstine
ss
propo
sed in thi
s
pa
per i
s
n
o
t
perfect.
For
example,
in the
mea
s
u
r
eme
n
t of
wo
rd b
u
rstine
ss, only the
amo
unt of g
r
o
w
th
within
a p
e
rio
d
of time is
con
s
ide
r
ed,
with
out the g
r
o
w
th rate
of
the
words. T
h
e
r
e
f
ore, in p
r
a
c
tical a
ppli
c
atio
ns,
definite integration
can be
utilized
to m
e
asure the
growth
rate
of word
frequen
cy, which
needs
further study.
Ackn
o
w
l
e
dg
ments
This work i
s
su
ppo
rted
b
y
Ministry
of Ed
u
c
ation,
Huma
nities a
nd So
cial
Sciences
Proje
c
t (10
Y
JC8
800
76
) and Sha
n
dong P
r
ovin
ce
Natural
Scien
c
e F
o
undatio
n Project
(ZR201
0FL0
08).
Referen
ces
[1]
CHEN
Hu
a, LI
ANG
Xun,
RU
AN Ji
n. Des
i
g
n
an
d Imp
l
eme
n
t
ation
of corr
el
ation
a
nal
ys
is i
n
C
y
b
e
r
w
o
r
l
d
opi
nio
n
.
NCIR
C
S
’
2
007.
2
007
; 45-49.
[2]
T
he 29 times
Chi
na Inter
n
et net
w
o
rk
de
velo
pment
stat
e statistic re
p
o
rt. Chin
a Int
e
rnet N
e
t
w
ork
Information C
e
nter. 2012.
[3]
Z
H
ENG Kui et
al. Hot Sp
ot
Informatio
n
Aut
o
-detecti
on Me
thod of N
e
t
w
or
k Public Op
in
i
on.
Co
mputer
Engi
neer
in
g.
2010; 36(
3): 4-6
.
[4]
XUE
F
e
n
g
, Z
H
OU Yad
o
n
g
, GAO F
eng. A
n
Onlin
e D
e
tecti
on
an
d T
r
ackin
g
Meth
od
for B
u
rst
y
T
opics.
Journ
a
l of Xi
’
a
n Jiaoto
ng U
n
i
v
ersity
. 2011; 4
5
(12): 64-
69.
[5]
CHEN L
i
zha
n
g
,
LI Bin, CHEN Xiao
pe
ng. D
e
sig
n
an
d Re
a
lizatio
n of Mon
i
torin
g
S
y
stem
of Camp
u
s
BBS Public Opinion
. Micropr
ocessors
. 201
2; 2(1): 40-4
8
.
[6]
W
A
NG
T
a
i, JIANG Guangr
o
ng, YU li
xia. Captur
ing
and
anal
yz
in
g po
pul
ar searc
h
w
o
rds i
n
micr
o
district.
Compu
t
er Engin
eeri
n
g
and Des
i
g
n
. 2012; 33(
2): 556
-560.
[7]
LI Yuqi
n, SU
N Lih
ua. H
o
t-W
o
rd Detecti
o
n for Internet
Public S
enti
m
ent.
Journ
a
l
of Chin
es
e
Information Pr
ocessi
ng
. 20
11
; 25(1): 48-59.
[8]
W
A
NG Meng, LI Bin, SUN C
hun
qi. Res
earc
h
of
Net
w
ork P
ublic Op
ini
on H
o
tspots Detecti
on Base
d o
n
F
r
eque
nt Items Minin
g
.
Microc
omputer i
n
for
m
ation.
20
10; 26
(12-3): 35-
38.
[9]
N
y
oma
n
Rizkh
a Emill
ia, Su
ya
nto, W
a
rih Ma
hara
n
i.
Isolate
d
W
o
rd Rec
o
g
n
itio
n Usi
ng Er
god
ic Hi
dd
e
n
Markov Mo
del
s and G
enetic
Algorit
hm.
T
E
LKOMNIKA Indones
ian
Jour
n
a
l of
El
ectrica
l
Engi
ne
erin
g
.
201
2; 10(1): 12
9-13
6.
[10]
Abeer
El-Kor
a
n
y
,
Salm
a M
o
khtar K
hata
b
.
Ontolog
y
-
b
a
s
ed S
o
cia
l
R
e
comme
nder
S
y
stem.
IAES
Internatio
na
l Journ
a
l of Artificial Intel
lig
enc
e
. 2012; 1(3): 1
2
7
-13
8
.
[11]
LUO Hui
x
i
a
. T
he Net
w
o
r
k Pu
blic Opi
n
io
n Monitor
i
ng S
y
ste
m
Researc
h
And E
x
pl
oitati
on
. Disertation
.
T
a
iyua
n; North
Universit
y
of C
h
in
a; 201
0.
[12]
LI He
ng
xu
n.
Ke
y T
e
chnol
o
g
y
Res
earc
h
on W
e
b F
o
ru
ms Cra
w
l
i
n
g
and
Hot T
opi
c Detecti
on.
Disertati
on. Bei
jing: C
apita
l No
rmal Univ
ersit
y
; 2011.
[13]
Z
E
NG Yiling,
XU Ho
ngb
o.
Research
o
n
Internet ho
tspot informati
on detecti
on.
Journal o
n
Co
mmun
icati
o
n
. 2007; 1
2
(28)
: 141-14
6.
[14]
LAN K
a
ime
i
. B
BS Hot T
opic
Detectio
n
and
Monito
ri
ng
S
y
s
t
em. Disertati
o
n
. Bei
jin
g: Be
iji
ng J
i
aot
ong
Univers
i
t
y
; 2
0
1
1
.
Evaluation Warning : The document was created with Spire.PDF for Python.