TELKOM
NIKA Indonesia
n
Journal of
Electrical En
gineering
Vol.12, No.5, May 2014, pp
. 4038 ~ 40
4
9
DOI: http://dx.doi.org/10.11591/telkomni
ka.v12i5.4170
4038
Re
cei
v
ed O
c
t
ober 1
5
, 201
3; Revi
se
d Decem
b
e
r
28, 2013; Accept
ed Ja
nua
ry 1
7
, 2014
Community Detection Based on Topic Distance in
Social Tagging Networks
Hong
tao Liu,
Hui Chen*,
Mao Lin, Yu Wu
Schoo
l of Com
puter Scie
nce
and T
e
chno
log
y
, Ch
ong
qi
ng
Univers
i
t
y
of Posts and T
e
lecommunic
a
tio
n
s
,
Cho
ngq
in
g, 40
006
5
*Corres
p
o
ndi
n
g
author, e-ma
i
l
: chenh
ui
88h
a
p
p
y
@
g
mai
l
.co
m
A
b
st
r
a
ct
Rese
arch o
n
the co
mmunity
detectio
n
in s
o
cial
ta
ggi
ng n
e
t
w
o
rks has attracted
muc
h
attentio
n i
n
the last
deca
d
e
. Extracting t
he h
i
d
den t
opi
c infor
m
ati
on f
r
om ta
gs pr
ovi
des a
new
w
a
y of thinki
ng f
o
r
community d
e
t
e
ction
in soc
i
al
taggi
ng n
e
tw
orks. In this
pap
er, a topic ta
gg
ing
netw
o
rk by
extracting s
e
ve
ral
topics fro
m
th
e
tags thro
ug
h
usin
g t
he
Late
n
t Diric
hlet A
llo
cation (
L
DA)
mode
l is
bui
lt fir
s
tly. T
hen a
to
pic
distanc
e betw
e
en users is d
e
fine
d,
w
h
ich de
pen
ds on the
book
markin
g relati
onsh
i
ps b
e
tw
een users a
n
d
tags. F
u
rther, a mo
dul
arity
clusterin
g
ap
proac
h bas
ed
on the topic
distance is
p
r
opos
ed to de
tect
communiti
es i
n
soci
al tag
g
i
n
g netw
o
rks. Emp
i
rica
l
stud
ie
s on re
al-w
orl
d
netw
o
rks d
e
m
o
n
strate that
the
prop
osed
meth
od can effectiv
ely detect
comm
unities in tagging networks.
Ke
y
w
ords
: co
mmu
n
ity detect
i
on, tagg
in
g ne
tw
orks,
topic di
stance, modu
la
rity optimi
z
a
t
i
o
n
Copy
right
©
2014 In
stitu
t
e o
f
Ad
van
ced
En
g
i
n
eerin
g and
Scien
ce. All
rig
h
t
s reser
ve
d
.
1. Introduc
tion
Comm
unity detection p
r
ovi
des a
n
impo
rtant way to further u
nde
rst
and an
d appl
y social
netwo
rks, th
at is, identifying commu
ni
ties o
r
group
s in
whi
c
h n
ode
s a
r
e d
e
n
sely
con
n
e
c
ted
insid
e
while
l
oosely conn
e
c
ted
outsi
de t
h
rou
gh
so
me
metho
d
s. A
s
one
succe
ssful kin
d
of
so
cial
netwo
rks, so
cial
ta
gging
netwo
rks su
ch
a
s
Del.
ici
o
.us,
CiteULike
and
Flickr h
a
ve devel
op
ed
fastly in the past seve
ra
l years. In social
tag
g
ing
netwo
rks, u
s
ers
can e
a
s
ily use tag
s
to
orga
nize,
sh
are and retrieve
onlin
e resou
r
ces, which ha
s
dif
f
erent
me
ani
ng
in differe
nt
environ
ment
s, for exa
m
ple
,
Web
pa
ge
s
in Del.icio
.u
s,
re
se
arch
pa
pers i
n
CiteULike
an
d p
hot
os
in Flickr. At the sam
e
time
, those taggin
g
netwo
rks h
a
ve cre
a
ted l
a
rge a
m
ou
nts of tagging d
a
ta
whi
c
h have
attracte
d mo
re an
d mo
re re
sea
r
ch
e
r
s to pay at
tention to id
entify potential
comm
unity structu
r
e in
so
cial tagging n
e
t
works.
At prese
n
t, there a
r
e ma
ny different community de
tection app
ro
ach
e
s, and t
w
o kin
d
s
among
them
are
appli
ed
more
[1]. Th
e first i
s
t
r
adi
tional top
o
log
y
-based
co
m
m
unity dete
c
tion
approa
ch, which m
a
p
s
th
e real
wo
rld
netwo
rk i
n
to a
gra
ph stru
cture
with nod
es
rep
r
esenti
n
g
use
r
s in
re
al
wo
rld
netwo
rk and
ed
ge
s
rep
r
e
s
entin
g the i
n
tera
ction rel
a
tion
betwe
en
use
r
s.
Comm
unity detectio
n
ap
proa
ch b
a
se
d on grap
h
partitioning
or clu
s
terin
g
tries to detect
sub
g
ra
ph
s wi
th high
den
sity, such
a
s
G
N
alg
o
rithm [
2
], K-L alg
o
rit
h
m [3], the spectral bi
se
ct
ion
method
[4] a
nd
so
on. B
u
t those m
e
thod
s m
o
stly
resea
r
ch o
n
the st
ru
ctural
prope
rties o
f
comm
unity, ignori
ng oth
e
r
impo
rtant chara
c
te
risti
c
s, espe
cially the them
e ch
ara
c
teri
stics of
comm
unity, for
example,
use
r
s b
e
long
to a
commu
nity tend to
have
simila
r
hobbi
es,
so
ci
al
function,
occupation, i
n
terest o
r
vie
w
p
o
int on
the
same topi
c an
d so o
n
. Th
e
r
efore, an
oth
e
r
topic-ba
sed
community det
ection m
e
tho
d
s h
a
s
bee
n
wide
sp
rea
d
concern, that i
s
, co
nsi
deri
n
g
the text information in n
e
t
works and
detectin
g
co
mmunitie
s
a
c
cording to th
e co
ntent u
s
ers
publi
s
hed. Hi
era
r
chical clu
s
terin
g
ba
sed
on distan
ce
or simil
a
rity metrics is a
common o
ne. The
topic-mod
e
lin
g app
ro
ach [5] is a
nothe
r topic-ba
se
d
com
m
unity
detectio
n
ap
proa
ch,
su
ch
as
Latent Di
richl
e
t Allocation
(LDA
) [6] and
its vari
ou
s variation
s
, for
example, Author-Topi
c mo
del
[7], Commu
ni
ty-Use
r-Topi
c model
(CUT
) [8], Topi
c-User-To
p
ic mo
del (TUCM) [9] and
so
on
.
And di
scoveri
ng
comm
unities
co
nsi
s
ting
of simil
a
r
users is an
imp
o
rtant p
r
o
b
le
m and
can fi
nd
pra
c
tical a
ppl
ication
s
in so
ciolo
g
y, biology,
compute
r
scie
n
ce and
other a
r
ea
s. There ha
s be
en
some
relate
d
work a
bout h
o
w to defin
e the simil
a
rity betwe
en u
s
e
r
s ba
sed
on
which p
eopl
e a
r
e
grou
ped
into
comm
unitie
s
. One
com
m
o
n
app
ro
ach
is to treat
com
m
unities a
s
g
r
oup
of no
de
s in
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
2302-4
046
Com
m
unity Detection Ba
se
d on Topi
c Di
stan
ce in Social Taggin
g
Networks (Hon
gtao Liu)
4039
so
cial n
e
two
r
k that th
e co
nne
ction a
m
o
ng them
se
lve
s
a
r
e m
o
re d
ensely than
with the
re
st
of
netwo
rk, whi
c
h ma
ke
s the comm
unity detection a
gr
ap
h clu
s
teri
ng pro
b
lem. In Sacha
n
’s
work
[9], communi
ties are
con
s
ide
r
ed a
s
“grou
p
s of u
s
ers
(no
d
e
s
)
who a
r
e inte
rco
nne
cted a
nd
comm
uni
cate
on sha
r
ed to
pics”. Thi
s
pa
per follo
ws th
e viewpoi
nt about com
m
un
ities.
In this p
ape
r, we p
r
op
ose
a commu
nity detection
a
ppro
a
ch ba
sed on
topic
distan
ce
whi
c
h i
s
sho
r
t for TDS
H
RINK, com
b
inin
g topic inform
ation with
mo
dularity
clu
s
tering
ap
pro
a
c
h.
To summa
ri
ze, this
wo
rk
contri
bute
s
o
n
the follo
win
g
a
s
pe
cts:
(1
) We d
e
fine
a topic di
sta
n
ce
based on th
e boo
kma
r
k
relation
shi
p
betwe
en u
s
e
r
s o
n
sa
me
topics. (2) A
new mo
dul
arity
clu
s
terin
g
alg
o
rithm ba
se
d
on topic di
stance in
so
ci
al tagging n
e
tworks i
s
p
r
o
posed, that is,
grou
ping
two
use
r
s into
a
community if t
hey are
inte
rested
in th
e
same topi
cs. (3) T
he
algo
rithm
we p
r
o
p
o
s
ed
can
dete
c
t ov
erlap
p
ing
co
mmunitie
s
, th
at is, a
user i
s
allo
we
d to
b
e
long to
multi
p
le
comm
unitie
s
. And the a
c
cura
cy and
efficien
cy
of ou
r algo
rithm a
r
e imp
r
oved
comp
ared
wit
h
other meth
od
s ba
sed o
n
modula
r
ity.
This
pape
r i
s
orga
nized a
s
follo
ws.
We
introdu
ce
th
e relate
d
work on
LDA m
o
del an
d
modula
r
ity in
Section
2.
The fo
rmal
d
e
finition is
prese
n
ted in
Section
3. In
Section
4, we
descri
be the t
opic
dista
n
ce
-ba
s
ed
mod
u
l
a
rity clu
s
teri
n
g
algo
rithm
specifi
c
ally. Th
e experi
m
ent
a
l
results a
r
e prese
n
ted in Se
ction 5. Finall
y
, we con
c
lud
e
in Section 6
.
2. Related
w
o
rk
Our work is
related to two research a
r
ea
s: LDA m
odel to extract latent topics an
d
comm
unity detection a
pproach ba
s
ed o
n
modula
r
ity optimizatio
n.
2.1. Topic Model LDA
In real life,
we always
wan
t
to find a bri
e
f
description
or summ
ary to
rep
r
e
s
ent or
reflect
the feature i
n
formatio
n of
a la
rge-scal
e data
s
et. F
o
r exam
ple,
extracting
se
veral topi
cs
to
rep
r
e
s
ent th
e
total text dat
aset,
while
th
e topi
c mo
del
is th
e m
odel
whi
c
h
ca
n eff
e
ctively an
alyze
large am
ount
s of text
[10]. The most wi
dely used
in
the topic mo
del is LDA m
odel, whi
c
h i
s
a
topic g
ene
rat
i
on tool by
Blei et al. [6] prop
os
ed i
n
200
3. LDA
is a th
ree
-
l
e
vel hierarch
ical
Bayesian
mo
del, in
whi
c
h
a topi
c i
s
sim
u
lated
as
the
dist
ribution
o
f
different
wo
rds,
ea
ch
arti
cle
is co
nstituted
by a mixture of several
different
topi
cs. So the t
opic g
ene
rati
on process i
s
a
prob
abili
stic
gene
ration p
r
ocess. LDA
use
s
a
-di
m
ensi
onal
Dirichl
e
t rand
o
m
variable to
rep
r
e
s
ent th
e prob
ability distributio
n over
do
cum
e
nt and topics, simulatin
g
the generation
pro
c
e
s
s of do
cume
nts,
whi
c
h i
s
mai
n
ly use
d
to
id
entify the hidde
n
topic i
n
form
a
t
ion from l
a
rg
e-
scale
do
cum
ent set o
r
corpu
s
. In
re
cent yea
r
s,
L
D
A mo
del
a
nd its vari
ou
s exten
ded
LDA
model
s h
a
ve
increa
sin
g
ly bee
n u
s
e
d
i
n
imag
e
pro
c
e
ssi
ng, n
a
tu
ral la
ngu
age
processing
and
other field
s
. More
over, in
the context of
tagging
syst
ems, where
multiple
u
s
ers are boo
km
a
r
kin
g
r
e
sour
ces w
i
t
h
multiple tags
, th
e
resulting topi
cs
ca
n reflect
a
share
d
vie
w
o
f
use
r
s o
n
th
e
document, a
nd the tag
s
belon
ging to
the topics
can refle
c
t a
comm
on vo
cabula
r
y. So it is
possibl
e to
consi
der that
usin
g L
D
A m
odel to
extra
c
t topi
cs i
n
fo
rmation f
r
om
so
cial ta
ggi
ng
netwo
rks.
2.2. Modularit
y
Optimization
The mo
dula
r
i
t
y function
is the mo
st wid
e
ly use
d
indi
cator to
cha
r
a
c
teri
ze the
strength
of co
mmunity
features,
whi
c
h i
s
fi
rst
pro
posed
by Ne
wman
et
al. [
11] in
200
4.
And
with time
s g
o
,
it become
s
a
stand
ard to m
easure h
o
w i
s
the re
su
lt of
commu
nity detection, which is define
d
a
s
:
(
1
)
Whe
r
e
is th
e
fractio
n
of th
e num
ber of
edge
s in
com
m
unity
to th
e num
ber of
edge
s in
the
grap
h, and
is the fractio
n
of the n
u
m
ber
of
the
edge
s that
conne
ct to no
des i
n
comm
unity
to the num
be
r of edg
es i
n
the gra
ph.
And
is the f
i
nal communi
ty number
of
comm
unity detection. Th
e
range
of
function i
s
, and in pra
c
tice it is found that a value
above 0.3 is
a good in
dica
tor of significant comm
uni
t
y
structu
r
e in
a netwo
rk, an
d the large
r
the
value is, th
e better the
quality of the comm
unit
y
structu
r
e i
n
netwo
rk. Therefore, th
e
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 23
02-4
046
TELKOM
NI
KA
Vol. 12, No. 5, May 2014: 4038 – 40
49
4040
modula
r
ity m
e
thod
s a
c
hie
v
e the o
p
tima
l clu
s
teri
ng
re
sults by maxi
mizing
v
a
lue
.
Ho
wev
e
r, it i
s
a NP-h
ard p
r
oblem to max
i
mizing
. So a lot of heuristi
c app
roa
c
h
e
s, which try to approximate
the optim
al
modula
r
ity value, h
a
s b
een
propo
se
d [12]. Su
ch
app
ro
aches incl
ude
g
r
e
edy
agglom
eration [13], mathe
m
atical
pro
g
ramming
[14
], spe
c
tral
meth
ods [15], sim
u
lated a
nne
al
in
g
[16] and
so
o
n
. Ho
weve
r, in Fortu
nato’
s wo
rk [1
7], they found th
at, modula
r
ity o
p
timization
m
a
y
fail to ide
n
tify co
mmunitie
s
sm
aller than
a
ce
rtain
scale, which d
e
pend
s
on th
e
total
size of
the
netwo
rk a
nd the deg
ree of i
n
terconn
ecte
dne
ss of the
comm
unitie
s
.
Beside
s, in
real
worl
d n
e
t
w
ork,
som
e
parts
in
com
m
unities a
r
e
overl
appin
g
, that i
s
,
some
no
de
s can
bel
ong
to multiple
com
m
unitie
s
with multi
p
le commu
ni
ty attributes.
For
example, a
schol
ar m
a
y collabo
rate
with others in
m
any different
area
s in
scie
ntific coll
abo
ration
netwo
rks; a use
r
wh
o
h
a
s
a
b
r
oa
d ra
nge of
inte
re
sts may b
e
a
s
soci
ated
wit
h
mo
re
than
one
comm
unity. Therefo
r
e, Pall
a et al. [18] p
r
opo
se
d a
cli
que p
e
rcolati
on metho
d
(CPM), whi
c
h
can
detect overl
a
pping commu
nities, but is not suitabl
e f
o
r dete
c
ting
hiera
r
chi
c
al structu
r
e
s
. Hu
ang
et al. [19] prop
osed a
para
m
eter-fre
e algo
rithm
SHRINK,
wh
ich
can n
o
t only discov
er
overlap
p
ing
a
nd hie
r
a
r
chical com
m
uniti
es b
u
t also th
e hub
nod
es
and o
u
tliers a
m
ong the
m
. And
based o
n
th
eir work, Lin
et al. [20] consi
dered
th
at there
are
many incom
p
lete informa
t
ion
netwo
rks
with
missing
a
lot of ed
ge
s
co
mmon i
n
real
worl
d n
e
two
r
ks, i
n
whi
c
h
n
ode
s a
r
e
kn
o
w
n
and ed
ge
s are locally
kno
w
n. They p
r
o
posed a hi
era
r
chy
clu
s
terin
g
method u
s
e
d
for co
mmu
nity
detectio
n
in
incom
p
lete i
n
formatio
n n
e
tworks with
missi
ng e
d
ges, that i
s
, distan
ce
-ba
s
ed
shri
nki
ng app
roa
c
h (DSHRINK),
which
l
earn
ed a
di
st
ance met
r
ic from lo
cal
information
regi
on
s,
and then e
s
ti
mated the di
stance b
e
twe
e
n
any pair
of
node
s in the
netwo
rk. Ba
sed on thei
r work,
a topic
dista
n
ce
-ba
s
e
d
sh
rinki
ng a
ppro
a
ch
(T
DS
HRINK) i
s
propo
sed i
n
this p
aper,
whe
r
e t
h
e
topic
dista
n
ce is defin
ed
by the b
o
o
k
marking
re
l
a
tionship
on th
e same to
pics b
e
twe
en
u
s
ers,
and then th
e
modula
r
ity ba
sed
on topi
c
distan
ce i
s
d
e
fined, too. Fi
nally, clust
e
ri
ng the no
de
s b
y
usin
g TDS
H
RINK algorithm
to reach the goal of co
m
m
unity detectio
n
in so
cial tag
g
ing net
works.
3. Terminolog
y
Definition
s
To e
s
tabli
s
h
a soci
al n
e
twork g
r
ap
h, we mu
st
con
s
i
der the i
n
tera
ction
betwee
n
u
s
e
r
s.
While th
e intera
ction b
e
tween two users in
soci
al ta
gging n
e
two
r
k can be
se
e
n
as b
o
o
k
ma
rk
relation
shi
p
o
n
all topi
cs. T
he form
al d
e
finition
of
so
ci
al taggin
g
n
e
t
work
s can b
e
expresse
d
as
, where
is the set of u
s
e
r
s in the n
e
tworks,
is the
set of
tags, and
is the set of re
so
urces.
Definition 1
(Topi
c Tag
g
ing Netwo
r
k): A topic tagging ne
twork is d
e
noted a
s
, where
is the set of
vertice
s
,
is the set of
edge
s, which
rep
r
e
s
ent
s to
pic di
stan
ce
betwe
en u
s
e
r
s.
is the se
t of topics
that are
extracte
d from
tags, i
n
which
ea
ch t
opic is mad
e
up
of
se
veral tag
s
and
, and
is the set of reso
urce
s in taggin
g
n
e
twork.
In this pap
er,
we extra
c
t the topics fro
m
tags in
so
cial taggi
ng
netwo
rk
by u
s
ing L
D
A
model, to cap
t
ure the pote
n
tial sema
ntic relation
ship
s betwee
n
tag
s
and topi
cs.
Definition
2 (Topic Di
stan
ce):
The to
pi
c di
st
an
ce
o
n
the
same
topic
between
any two
use
r
s
is exp
r
esse
d a
s
th
e boo
kma
r
k relation
sh
ip
s
on the topi
c,
whi
c
h i
s
me
a
s
ured
by
co
sine
simila
rity between them and the f
o
rmul
a is a
s
follows.
(
2
)
Whe
r
e
and
. While th
ere
are
seve
ral t
ags i
n
a topi
c,
is the fra
c
tion of the
numbe
r of re
sou
r
ces that
use
r
boo
kma
r
ke
d with tag
to the numbe
r of the total reso
urce
s that
use
r
boo
km
arked. T
he t
opic
dista
n
ce
is to m
e
a
s
u
r
e the topi
c
si
milarity between u
s
e
r
s, th
e
rang
e of whi
c
h is
, the sma
ller the value
is, the
highe
r simila
rity users have.
Definition 3 (Average To
pi
c Di
stan
ce):
Since there are
topics in total, any
pair of
u
s
er
s ha
ve
topic di
stan
ce
, that is,
.
Th
e
ave
r
age
to
pic dista
n
ce betwe
en any
two users is
calcul
ated with
Equation (3
).
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
2302-4
046
Com
m
unity Detection Ba
se
d on Topi
c Di
stan
ce in Social Taggin
g
Networks (Hon
gtao Liu)
4041
(
3
)
Whe
r
e
is the proba
bility that each topi
c
appe
are
d
re
spectively.
Definition
4
(I
nitial Commu
nity): Set a to
pic th
re
sh
old
, and
a i
n
itial
co
mmunity
is a set of u
s
ers who
s
e
averag
e topi
c dista
n
ce
in
the range of
the topic thresh
old, whi
c
h is
defined a
s
:
(
4
)
Whe
r
e
is the commu
nity radiu
s
of the initial comm
u
n
ity.
Definition 5
(Comm
unity Center
): Given
any three
co
mmunity
,
,
with the kno
w
n
comm
unity center i
s
,
,
re
spe
c
tively, while th
e t
opic
dista
n
ce of ea
ch
two
comm
unitie
s
are re
garded
as
,
,
. Therefore,
whe
n
and
c
l
us
tered into a
new com
m
u
n
ity
, the community ce
n
t
er
of whi
c
h
is dete
r
min
ed by the two initial
comm
unitie
s
, and the topi
c dista
n
ce be
tween the n
e
w
co
mmunity
and
is cal
c
ulated with
Equation (5).
(
5
)
The com
m
un
ity
cente
r
of initial com
m
unity
is t
he n
ode
. And
whe
n
clu
s
tere
d into
new co
mmu
nity, the new community
center is d
e
termined by Def
i
nition 5, whil
e
the topic di
stance is calc
ul
ated by Equa
tion (5).
For
conveni
e
n
ce, we map
the topic tag
g
ing
net
wo
rk
into a two-di
mensi
onal m
ap, each
node
has two coo
r
din
a
te
values
, and the geom
etri
cal di
stan
ce
betwe
en ea
ch pair of
node
s
is
. Therefo
r
e,
co
rresp
ondi
ng to
the
Definitio
n
5
in the p
aper, the
topic d
i
stan
ce b
e
t
ween
ea
ch pair
o
f
node
s is
, which i
s
sh
o
w
as Fi
gure 1
.
Figure 1. The
Sketch Ma
p of the Dist
an
ce Cal
c
ulation
betwe
en Co
mmunitie
s
4. Modularity
C
l
ustering
Algorit
hm Based on To
p
i
c Distan
ce
Based
on
th
e di
stan
ce-b
ase
d
mo
dula
r
ity and
DS
HRI
NK alg
o
rithm that Li
n et al.
prop
osed
[20
], we
pro
p
o
s
e a
topi
c di
st
ance-b
a
sed
modula
r
ity cl
usteri
ng
algo
rithm T
D
SHRINK,
whi
c
h dete
c
ts commu
nities according to
the topic
di
stance betwee
n
use
r
s, t
hat is, the use
r
s with
sho
r
ter topi
c distan
ce will
be grou
ped
into
the sa
me com
m
uni
ty and the users
with lon
ger
distan
ce into
different co
mmunitie
s
. And it can
al
so detect overlappin
g
com
m
unities in
social
tagging n
e
tworks.
4.1. Topic Distan
ce-bas
e
d
Modularit
y
Before the
T
D
SHRINK al
gorithm, the
topic
di
stan
ce-ba
s
e
d
mo
d
u
larity is
defi
ned a
s
follows
.
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 23
02-4
046
TELKOM
NI
KA
Vol. 12, No. 5, May 2014: 4038 – 40
49
4042
Definition 6
(Topi
c Dist
ance-b
a
sed
Modul
a
r
ity): Given a topic taggin
g
network
and the comm
unitie
s
, the topic dista
n
ce-b
ase
d
modula
r
ity
is defined a
s
:
(
6
)
W
h
er
e
is the
num
ber of
communitie
s
,
repre
s
e
n
ts the
su
m of
avera
ge
topic di
stan
ce between
a
n
y pair
of n
ode
s withi
n
comm
unity
,
rep
r
e
s
ent
s th
e su
m of ave
r
age topi
c di
st
ance bet
wee
n
any n
ode i
n
comm
unity
and a
n
y node
in the
topi
c ta
gging
net
work
, and
r
e
pr
es
e
n
t
s
th
e
s
u
m o
f
a
v
er
ag
e
topic di
stan
ce
betwee
n
any
two node
s in
.
As the
ran
ge
of modul
arity
that Ne
wma
n
pro
p
o
s
ed i
n
2004 i
s
[11], in this
pape
r,
the
rang
e of topi
c di
stan
ce-ba
s
ed m
odul
ari
t
y is
. If
, it
mean
s that a
ll the node
s
are
either
group
e
d
into
one
community o
r
grou
ped
into
different
co
mmunitie
s
ra
ndomly. And
the
la
r
g
er
va
lu
e
of
, the better the quality of c
l
us
tering result.
To enha
nce
the efficiency of the algor
ithm, we cal
c
ul
ate
the modularity
increme
n
tally, that is, th
e gai
n of
m
e
rgin
g the
two
com
m
uni
ties
and
into a
ne
w
comm
unity, whi
c
h is
call
ed topic di
st
ance-b
a
sed
modula
r
ity gain
. The calcul
ational
formula is
as
follows
.
(
7
)
Whe
r
e
re
prese
n
ts the
sum of ave
r
a
ge topi
c di
stance bet
wee
n
node
s in com
m
unity
and node
s in com
m
unity
.
The topi
c di
stance-b
a
sed
modula
r
ity de
fined ab
ov
e is a met
r
ic to
evaluate the
quality of
a partition. And we u
s
e th
e gain of topi
c dista
n
ce
m
odula
r
ity to control the
clu
s
ter p
r
o
c
e
ss t
o
get
a good result of commu
nity detection.
4.2. Topic Distan
ce-bas
e
d
Modul
arit
y
Cluste
ring Algorithm
The topi
c di
stance
-
ba
se
d
modula
r
ity cl
usteri
ng al
go
rithm T
D
SHRINK is p
r
e
s
e
n
ted in
Table
1. T
he
approa
ch
ca
n
be
divided
in
to two
pha
se
s. Firstly, buil
d
ing
a topi
c taggin
g
n
e
two
r
k
, and cal
c
ul
ating the averag
e topic
distan
ce
betwee
n
any
pair of nod
es.
Secon
d
ly, (1
) finding all
initial comm
unities a
c
cording to Defi
nition 4. (2) For ea
ch i
n
itial
comm
unity, we find its
community ce
nter by De
finition 5, and
cal
c
ulate th
e topic di
sta
n
ce
betwe
en
any
pair of i
n
itia
l co
mmunitie
s
by E
quatio
n (5) to
sto
r
e an
a
rray.
(3) fin
d
the
two
comm
unitie
s
that their topic dista
n
ce is
sm
alle
st, and calcul
ate the topi
c di
stan
ce
-b
ase
d
modula
r
ity g
a
in
. If
, it
mean
s that t
he me
rge
r
o
f
the two co
mmunitie
s
can
increa
se the topic di
stan
ce
-ba
s
ed mo
dul
arity
. Then merg
e the two comm
unitie
s
into a new
one. Othe
rwi
s
e, do
not m
e
rge
and
co
ntinue to find
two commu
nities that ha
ve less sm
all
e
st
distan
ce. Re
peat until all
cluste
re
s are “visited
” an
d mergi
ng th
e last two co
mmunitie
s
ca
n’t
incr
ea
se
. Finally, the communities a
r
e
pre
s
ente
d
.
5. Experiments
In this
se
ctio
n, we
use t
w
o
real
-worl
d
data
set
s
to validate t
he effe
ctiveness an
d
efficien
cy of the app
roa
c
h
we propo
se
d.
5.1. Data Se
ts
(1) CiteU
L
ik
e
Data
set
The
data
s
et i
n
this pa
pe
r
is from th
e a
v
ailable
data
s
ets in
Cite
ULike,
whi
c
h
i
s
a
fre
e
servi
c
e
for
m
anagi
ng and
discoveri
ng schol
arly re
fe
rences p
r
ovid
ed by
Sprin
g
e
r.
Whe
n
a
u
s
er
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
2302-4
046
Com
m
unity Detection Ba
se
d on Topi
c Di
stan
ce in Social Taggin
g
Networks (Hon
gtao Liu)
4043
in intere
sted
in an a
r
ticle,
he or
sh
e ca
n add it into
his o
r
he
r o
w
n libra
ry with
several rel
a
ted
tags. Th
e format of Cite
ULike
data i
n
cl
ude
s fou
r
fiel
ds,
whi
c
h a
r
e
articl
e id, u
s
er n
a
me
(a
salted
MD5 ha
sh of
the true u
s
ername
)
, the date and time
and tag. And
if a user p
o
st
s an a
r
ticle
with
several tags,
then this will result in se
veral rows, whi
c
h is sh
own in Table 2.
Table 1. The
De
scription of
TDSHRINK
A
l
g
o
ri
th
m 1 T
D
SHRINK
Inpu
t:
, topic threshold
;
Ou
tpu
t
:
:
Process:
1
Initialize, build the topic tagging network
;
for
each
do
for
each
do
Calculate
according to Equation
2;
;
end
end
for
each
do
if
then c
o
n
t
in
ue
Find a intial communit
y
according to Definition 4
;
for
each
do
;
end
;
end
for
each
do
for
each
do
Calculate
the
distance
bet
w
e
en
and
acco
rding to Equatio
n 5 and stor
e th
e distance
into
arra
y;
end
end
While true
do
Select the smalle
st distance
in
ar
ra
y
;
Calculate corresponding
according to Equati
on 7;
if
do
and update
arra
y
;
else
;
if
then bre
a
k
;
end
return
;
Table 2. The
Format of the
Raw
Data
Article id
Username(M
D
5)
Date and time
tag
9168221
654442b4e
aff27
91d205c4abdeb
99375
2012-01
-01
00:21:27.814
194
+00
pvalue
5827136
654442b4e
aff27
91d205c4abdeb
99375
2012-01
-01
00:22:17.990
863
+00
pvalue
10186672
aac9848472688
04c15d115fbee
0
b3652
2012-01
-01
00:26:26.822
489
+00
rsvp_iconchat
10186790
9730960ed
e281
beae74190
06b4
7dbf41
2012-01
-01
01:55:47.960
338
+00
motivation
10186791
9730960ed
e281
beae74190
06b4
7dbf41
2012-01
-01
01:58:43.275
636
+00
massively
_multiplayer_online_ga
mes
Since the raw data is la
rge, to facilitate the latter exper
im
ents, we intercept all
the data
from Ja
n. 20
12 to De
c. 20
12 as a d
a
ta set. In addi
tio
n
, we focu
s o
n
the study o
f
tags that user
use
d
an
d the
co
rre
sp
ondi
n
g
arti
cle
re
so
urces, th
en
calcul
ate the t
opic
dista
n
ce
betwe
en
use
r
s,
whi
c
h i
s
not
dire
ctly relate
d to the time.
Therefor
e,
we nee
d to extract th
ree fiel
ds of d
a
ta fro
m
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 23
02-4
046
TELKOM
NI
KA
Vol. 12, No. 5, May 2014: 4038 – 40
49
4044
the ra
w data,
which are article
id, username a
nd ta
g re
spe
c
tivel
y
. In data cleaning, we firstly
apply ste
mmi
ng to the ta
gs, split the
tags that
are
linke
d by
symlinks into
a
few
words.
In
addition, del
e
t
e those in
sig
n
ificant tag
s
, su
ch a
s
,
“no
-
tag”, prepo
siti
ons, pu
re di
gital tags and
so
on. Finally, in
ord
e
r to
sim
p
lify the su
bseque
nt ca
l
c
ul
ations and
e
n
su
re th
e a
c
curacy
rate,
we
delete the tag
s
that are bo
okma
rked by less than
10
use
r
s. After d
a
ta clea
ning, there a
r
e 185
12
tags, 1
308
6
use
r
s,
137
30
6 arti
cle
s
. Th
roug
h the
L
D
A topic extra
c
tion
pro
c
e
d
u
r
e that
Zho
u
Li
publi
c
ly avail
able,
we
get
100
topi
cs
ultimate
ly an
d ea
ch
topi
c incl
ude
s
se
veral tag
s
wi
th
corre
s
p
ondin
g
prob
ability, whi
c
h is
sho
w
n in Tabl
e 3
.
Table 3. The
Data of Part Topic
with Co
rre
sp
ondi
ng
Tags a
nd Pro
bability
Topic1
p1
Topic 2
p2
Topic 3
p3
paper
0.0869874
healthcare
0.0514422
attention
0.0245588
lanlsec 0.0159081
privacy
0.0332143
disorder
0.0194616
holopedia 0.0144484
security
0.0276437
auditor
y
0.0160443
access
0.00988849
mhealth 0.0222972
deficit
0.0156371
open
0.00894257
rechtslinguistik 0.0216415
adhd
0.015277
(2) DBLP-A
d
a
taset
DBLP-A i
s
the data
set
extracted
fro
m
DB
LP
we
bsite whi
c
h provide
s
bi
bli
ogra
phi
c
informatio
n o
n
compute
r
sci
en
ce jo
urnals and
p
r
oce
edin
g
. In ord
e
r to
co
mpare
with the
algorith
m
pro
posed in pap
er [20], we process t
he d
a
ta as the way Lin et al. do. To fit th
e
approa
ch
we
prop
osed,
we
ca
rry
out
so
me p
r
o
c
e
ssi
n
g
of th
e d
a
ta t
o
buil
d
topi
c taggin
g
n
e
two
r
k,
view the
arti
cle
s
that a
u
thor
s coa
u
t
h
o
r
ed as
re
so
u
r
ce
s of
. And the
choi
ce
of tags is
importa
nt, to ensure th
e re
levance of topics, we
vie
w
the words i
n
the title as tags, then
ap
ply
the stand
ard
text proce
s
sing, su
ch a
s
stemming, st
o
p
words
remo
val. The rest
of the pro
c
e
s
s is
simila
r to the above sectio
n of CiteULi
k
e.
The pro
c
e
s
sed d
a
ta is shown in Tabl
e 4.
Table 4. The
Data Sets u
s
ed in Experim
ent
Dataset User
s
T
ags
Resour
ces
T
opics
CiteULike 13086
18512
137306
100
DBLP-A
5417
3393
5455
6
5.2. The Cho
i
ce of Topic
Thresh
old
From th
e a
b
o
ve analy
s
is,
the la
rge
r
th
e value
of to
pic
distan
ce
-based m
odul
arity
,
the better the clusteri
ng result
s are. Moreover, t
he
choi
ce of topic thre
shold
will influence the
formation of initial commu
nities, and further influen
ce
the effectiveness of re
sult
s. Therefore,
in
this p
ape
r, th
e topi
c th
re
sh
old
is dete
r
m
i
ned by
, that is, the val
ue t
hat ma
ke
s
of
initial
comm
unitie
s
large
s
t, as th
e topic th
re
sh
old of ou
r ap
proa
ch. T
he range
of
is
,
in st
ep
s of
0.01. To ge
nerate i
n
itial
comm
unitie
s
a
c
cordi
ng
to corre
s
p
o
n
d
ing
value, and calcula
t
e
corre
s
p
ondin
g
. The result is sh
own in Figure 2.
(a)
C
ite
U
Li
ke
(b)
D
BLP-A
Figure 2. The
Choi
ce of To
pic Th
re
shol
d
0.
1
0.
2
0.
3
0.
4
0.
5
0.
6
0.
7
0.
8
0.
9
1
0
0.
1
0.
2
0.
3
0.
4
0.
5
0.
6
0.
7
0.
8
0.
9
t
o
pi
c
t
h
r
e
s
h
ol
d
M
o
du
l
a
r
i
ty
Q
td
Ci
t
e
UL
i
k
e
0.
1
0.
2
0.
3
0.
4
0.
5
0.
6
0.
7
0.
8
0.
9
1
1
0
0.
1
0.
2
0.
3
0.
4
0.
5
0.
6
0.
7
0.
8
t
opi
c
t
h
re
s
h
o
l
d
M
odul
ari
t
y
Q
td
DB
L
P
-A
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
2302-4
046
Com
m
unity Detection Ba
se
d on Topi
c Di
stan
ce in Social Taggin
g
Networks (Hon
gtao Liu)
4045
We
ca
n o
b
tai
n
that the
topi
c di
stan
ce
-ba
s
ed
mod
u
la
rity
of the t
w
o
datasets fi
rst
rise
to pea
k, then
decrea
s
e, a
nd ultimately
kee
p
st
a
b
le.
It is mainly becau
se that
the small
e
r
the
threshold i
s
,
the more the
numbe
r of i
n
itial co
mm
u
n
ities, and th
e more scatt
e
r the u
s
e
r
s
in
netwo
rk. Whil
e as the incre
a
sin
g
of threshold,
the num
ber of initial communitie
s
b
e
com
e
s
small,
the use
r
s be
come concentrated. At last,
the modula
r
it
y
become in
variant wh
en
rea
c
he
s a
certai
n value.
While th
e co
rre
sp
ondi
ng t
opic th
re
shol
d
of large
s
t
value are 0.5
5
and
0.32
respe
c
tively,
whi
c
h are as
the value of topic threshol
d
in our app
roach.
5.3. Ev
aluation Measu
res
In orde
r to measure the ef
fectivene
ss o
f
our approa
ch and co
mpa
r
e with the a
ppro
a
ch
that Lin et al. propo
se
d [20], we adopt
the same ev
a
l
uation criteri
on, that is, Purity, to evaluate
the quality of the comm
un
ities gen
erate
d
by diffe
rent
approa
che
s
.
The definitio
n of purity is
as
follows: ea
ch
clu
s
ter in first a
ssi
gned
with the
mo
st frequ
ent
cla
s
s in the
cl
uster, and
then
the
purity is
mea
s
ured
by co
m
puting the
nu
mber
of
the i
n
stan
ce
s a
s
si
gned
with th
e
sam
e
lab
e
ls
in
all clu
s
ters, which i
s
cal
c
ul
ated with Equ
a
tion (8
).
(
8
)
Whe
r
e
is th
e
set of
clu
s
ters that g
ene
ra
ted by ea
ch
d
e
tection
app
roach,
is the
-
th cla
ss l
abel
. The ra
nge
of purity is
, and the
high
er pu
rity mea
n
s the
high
er accuracy o
f
the metho
d
. The
comm
un
ity structu
r
e
gene
rat
ed
by each comp
a
r
ed m
e
thod
will be
evalu
a
ted
usin
g the true
label of each
node. Since
each user
ca
n have multip
le intere
sts in
CiteULi
k
e a
n
d
each autho
r
can
have mu
ltiple re
sea
r
ch are
a
s i
n
DB
LP as its
cl
ass lab
e
ls, th
at is, ea
ch n
ode
can
belo
ng t
o
overla
ppin
g
co
mmuniti
es. Th
erefo
r
e,
we
com
p
u
t
e the pu
rity of the cl
ust
e
ring
results ba
se
d
on label sep
a
rately, and the avera
ge
result
s over 1
00 or 6 lab
e
ls are re
porte
d.
5.4. Experiment Results and An
aly
s
is
To verify the
availability and
effectiveness of the T
D
S
HRI
NK algorit
h
m
we
proposed,
we
have experi
m
ent on the true data
s
et
from the so
ci
al tagging n
e
t
work Cite
ULi
k
e. And at the
same
time, to
co
mpa
r
e
wit
h
the
DS
HRI
NK alg
o
rith
m
that Lin
et al.
prop
osed [2
0], we
expe
rim
ent
with the sam
e
dataset DBLP-A.
5.4.1. Visualizatio
n and Analy
s
is of Results
As a vi
suali
z
ation tool
used u
s
ually i
n
com
p
lex net
work, Paje
k
can
effectivel
y analyze
and de
mon
s
t
r
ate the
stru
ctural prope
rti
e
s of comp
le
x networks. In this pa
per,
we choo
se P
a
jek
to demon
strat
e
the effect of result
s of co
mmunity dete
c
tion.
(a) T
opolo
g
y of original
netwo
rk
(b)
Distri
bution of initial
comm
unitie
s
(c) Di
splay of part final
r
e
sults
Figure 3. The
Proce
s
s Di
splay from Ori
g
i
nal Netwo
r
k to Final Results of DBLP-A
F
i
g
u
r
e
3
(
a)
is th
e
o
r
ig
in
a
l
ne
tw
o
r
k
o
f
D
B
L
P
-
A
d
a
t
a
s
e
t
w
e
co
ns
tr
uc
te
d
,
w
h
er
e
e
a
c
h
n
ode
rep
r
e
s
ent
s o
ne of the
aut
hors, the
blu
e
line
re
p
r
e
s
ents th
e ave
r
age topi
c
dist
ance bet
wee
n
any
pair of u
s
e
r
s.
We
ca
n
se
e f
r
om th
e fig
u
re, nod
es
in
th
e net
wo
rk a
r
e
divided
into
two
pa
rts,
wh
ere
node
s
i
n
sid
e
are clo
s
ely conne
cted, out
side
are
n
o
t conne
cted
wit
h
ed
ge
s, whi
c
h i
ndi
cate
s t
hat
the nod
es o
u
t
side a
r
e i
s
ol
ated no
de
s, and
corre
s
po
ndi
ng to the
result in T
able
4. Figure 3(b
)
is
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 23
02-4
046
TELKOM
NI
KA
Vol. 12, No. 5, May 2014: 4038 – 40
49
4046
displ
a
y of re
sults of initial
comm
unitie
s
gene
rat
ed
by the algo
rith
m we
pro
p
o
s
ed. Different from
Figure 3
(
a), i
n
this fig
u
re,
each no
de
re
pre
s
ent
s
a
n
i
n
itial co
mmu
nity, and there are 16
34 i
n
itial
comm
unitie
s
in total. We can find that
those
initial
commu
nities are clo
s
ely
linke
d, but the
c
o
mmu
n
i
ty s
t
r
u
c
t
ur
e is no
t ve
r
y
ob
vio
u
s, a
n
d
fu
rthe
r
detecte
d
can
get final
re
sul
t
of commu
nity
detectio
n
.
We
only sele
ct three re
pre
s
e
n
tative
co
m
m
unities for
display in thi
s
p
aper,
whi
c
h
a
r
e
comm
unity numbe
red by 23, 129, and
614 re
sp
ectiv
e
ly, as sho
w
n in Figure 3
(
c). Each nod
e in
the figure rep
r
esents a
user, ea
ch
de
n
s
e
clu
s
te
r re
pre
s
ent
s a
community, an
d the p
o
ints that
linke
d betwee
n
comm
unitie
s
are n
ode
s that belon
g to overlap
p
ing
communitie
s
.
Table 4. Re
sults of Comm
unity Detectio
n
CiteULike
DBLP-A
Number of
nodes
13086
5417
Clustering coefficient
0.35187
0.75708
Number of initial communities
4825
1634
Number of
final communities
2922
1033
Number of isolat
ed nodes
118
1003
Average size of communit
y
94
16
Number of
nodes
in maximum communit
y
1570
289
Thro
ugh
stati
s
tics in T
able
4, we
ca
n fi
nd
that the
p
r
opo
se
d alg
o
r
ithm can
effectively
detect comm
unities on two kind of net
works, t
he numbe
r of co
mmunity, the average
size of
comm
unity is rea
s
o
nabl
e. There i
s
a te
rm n
a
me
d
n
u
mbe
r
of i
s
ol
ated no
de
s,
whi
c
h a
r
e
so
me
node
s th
at d
o
not
pa
rticip
ate in th
e
co
mmunity det
e
c
tion. An
d it i
ndicates that
there a
r
e
so
me
use
r
s d
on’t i
n
volve in th
e
boo
km
ark of
topics, in
ot
her wo
rd
s, th
e tag
s
that
th
ey used
are
not
inclu
ded in t
opics, pa
rt of use
r
s a
r
e
intere
st
ed i
n
ce
rtain to
pic, othe
r are not, which
is
rea
s
on
able.
Clu
s
terin
g
co
efficient is a
paramet
er
to
me
as
ur
e
ne
tw
o
r
k
c
o
mmunity effec
t
in
compl
e
x net
work, the
ra
n
ge of
whi
c
h i
s
. In actual
n
e
twork,
clu
s
tering
coeffici
ent is
much
smalle
r tha
n
1, but mu
ch
greate
r
tha
n
. Therefore, t
he ave
r
ag
e clusteri
ng
co
efficient of
CiteULike is
0.3518
7, whi
c
h i
s
living
up to a
c
tual
netwo
rk an
d DBLP
-A is 0.7570
8, which
indicates that
DBLP-A ha
s high net
work
clu
s
te
rin
g
effect, clo
s
ely conne
cted b
e
twee
n nod
es.
On the othe
r
hand, we can
get obviou
s
ly from
Table
4, there a
r
e l
e
ss isol
ated
node
s in
CiteULike dataset than DB
LP-A dataset, which i
s
mai
n
ly becau
se t
he appli
c
atio
n backg
rou
n
d
of
this al
gorith
m
is ta
ggin
g
n
e
twork,
while
DBLP-A
is
coa
u
thor net
work
witho
u
t tag info
rmati
on.
Therefore, th
ere a
r
e mo
re
isolated no
d
e
s of re
sult in DBLP-A, b
u
t the remai
n
ing nod
es t
hat
partici
pated i
n
comm
unity detectio
n
hav
e high cl
uste
ring effect.
Figure 4. The
Distrib
u
tion o
f
Final Comm
unities of Two Data
sets
We h
a
ve ma
de stati
s
tics
of distrib
u
tio
n
of
numb
e
r of simila
r community si
ze of two
datasets Cite
ULi
k
e and DBLP-A
re
sp
e
c
tively,
whi
c
h
is
sh
own in
Figure 4.
We
ca
n find
that
the
comm
unity result
s of t
w
o
data
s
ets are mo
stly
co
n
c
entrated
on
small
nu
mb
er of
nod
es,
and
there a
r
e mo
st comm
uniti
es in
range,
whi
c
h indi
cat
e
s that t
he ne
twork clu
s
te
ri
ng effect is
stron
g
, and
the re
sults
of comm
unit
y
detection
are
satisfa
c
t
o
ry. And th
e distri
bution
of
comm
unitie
s
of Cite
ULi
k
e
is m
o
re
com
p
reh
e
n
s
ive, whe
r
e
large comm
unitie
s
that
num
ber of
w
h
ic
h
is
i
n
rang
e, middl
e com
m
unitie
s
in
range, small
comm
unities in
0
200
400
600
800
1000
1
200
1400
1600
0
200
400
600
800
1000
1200
1400
1600
1800
N
u
m
ber of
node
s
i
n
C
o
m
m
uni
t
y
N
u
m
b
er
o
f
c
o
m
m
uni
t
i
es
Ci
t
e
UL
i
k
e
DB
L
P
-
A
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
2302-4
046
Com
m
unity Detection Ba
se
d on Topi
c Di
stan
ce in Social Taggin
g
Networks (Hon
gtao Liu)
4047
ran
ge a
r
e
d
e
tected,
whi
c
h also indi
ca
tes
that u
s
e
r
s have
a
wid
e
ra
nge
of in
terest
on
multiple topics.
5.4.2. Effec
t
iv
eness and Comparis
on Analy
s
is
As the ap
plication ba
ckground
of the
TDSHRINK a
l
gorithm
we prop
osed
i
s
different
from
DSHRI
N
K alg
o
rithm
that Lin
et
al. pro
p
o
s
ed
[20], this al
gorithm
is a
pplied
to tag
g
ing
netwo
rk,
such a
s
CiteULi
k
e,
Del.ici
o
.u
s a
nd
so
o
n
. Whil
e
DSHRINK
algo
rithm i
s
a
pplie
d to
incom
p
lete in
formation
net
work. Th
erefore,
we ma
ke statisti
c ab
out Purity value of cl
uste
ri
ng
results of TDSHRINK al
gorithm in two data
s
et
s CiteULi
k
e a
nd DBLP-A
in different topic
threshold
s
,
whi
c
h is
sho
w
n in Figu
re
5.
(a) CiteU
L
ik
e
(b) DBLP-A
Figure 5. Accura
cy of TDS
HRI
NK Algor
i
t
hm in Differe
nt Topic Th
re
shol
ds
From Fi
gure
5(a
)
we
ca
n find that in Cit
e
UL
i
k
e d
a
taset the cha
n
g
e
of topic th
resh
old
prod
uces little influe
nce o
n
Purity valu
e, whi
c
h
is in
rang
e, indi
cating that
TDSHRINK
algorith
m
we
prop
osed i
s
effective in so
cial t
aggi
n
g
networks,
and h
a
s hi
gh
accuracy
rat
e
in
total. In Figu
re 5
(
b), th
e
ch
ange
of topi
c thre
sh
old
wil
l
ca
use d
r
am
atic
cha
nge
s
on Pu
rity value,
whe
n
is sm
all,
su
ch as 0.1,
Pu
rity value is only
0.323,
while
i
n
crea
se
s 0.2
,
Purity value
jumps to 0.7
59. And the topic threshol
d
we cho
s
e
in the pape
r is 0.32, co
rre
s
po
ndin
g
Purity
value i
s
0.7
7
2
, whi
c
h
is e
quivalent
wit
h
the
over
all
accu
ra
cy of
Lin’
s p
ape
r
[20] in the
same
dataset DBL
P
-A, indicatin
g
that
the propo
sed al
gori
t
hm is effecti
v
e. Moreove
r
, when the to
pic
threshold
is cho
s
en ap
p
r
op
riately, the accuracy
o
f
TDSHRI
NK algorithm we prop
osed is
highe
r tha
n
L
i
n’s al
go
rithm
.
In addition,
from Fig
u
re
5
(
a) t
o
Figu
re
5(b
)
, we
can
clea
rly find th
at
the Purity val
ue of T
D
SHRINK al
gorith
m
in Ci
teULik
e dataset is higher than
DBLP-A datas
e
t,
whi
c
h i
s
m
a
i
n
ly be
cau
s
e
the a
ppli
c
atio
n ba
ckgr
oun
d of T
D
SHRI
NK alg
o
rithm
is
so
cial
tag
g
ing
netwo
rks.
And that, the
numb
e
r
of final
comm
unities
of
Cite
UL
ike
and
DBL
P
-A in diffe
re
nt topi
c
threshold
is shown in Figure 6.
Figure 6. The
Distrib
u
tion o
f
Numbe
r
of Final
Com
m
uni
ties in Differe
nt Topic Th
re
shol
d
0.
1
0.
2
0.
3
0.
4
0.
5
0.
6
0.
7
0.
8
0.
9
0.
9
0.
6
0.
65
0.
7
0.
75
0.
8
0.
85
0.
9
0.
95
1
t
opi
c
t
h
re
s
h
o
l
d
Pu
r
i
t
y
Ci
t
e
UL
i
k
e
0.
1
0.
2
0.
3
0.
4
0.
5
0.
6
0.
7
0.
8
0.
9
0.
9
0.
3
0.
4
0.
5
0.
6
0.
7
0.
8
0.
9
0.
9
t
opi
c
t
h
res
h
ol
d
Pu
r
i
t
y
DB
L
P
-A
0.
1
0.
2
0.
3
0.
4
0.
5
0.
6
0.
7
0.
8
0.
9
50
0
10
00
15
00
20
00
25
00
30
00
35
00
40
00
t
o
p
i
c t
h
r
e
sh
o
l
d
N
u
m
ber
of
c
o
m
m
u
n
i
t
i
e
s
Ci
t
e
U
L
i
k
e
DB
L
P
-A
Evaluation Warning : The document was created with Spire.PDF for Python.