TELKOM
NIKA
, Vol.14, No
.1, March 2
0
1
6
, pp. 286~2
9
3
ISSN: 1693-6
930,
accredited
A
by DIKTI, De
cree No: 58/DIK
T
I/Kep/2013
DOI
:
10.12928/TELKOMNIKA.v14i1.2330
286
Re
cei
v
ed
Jul
y
14, 201
5; Revi
sed
No
ve
m
ber 23, 201
5; Acce
pted
De
cem
ber 2
0
,
2015
Improving Multi-Document Summary Method Based on
Sentence Distribution
Aminul Wahi
b*
1
, Agus Zai
n
al Arifin
2
, Diana Pur
w
i
t
a
sari
3
1
Stud
y
Pro
g
ra
m of Informatics, Politeknik K
o
ta Mala
ng, In
don
esia, 6
513
2
2,3
Department of Informatics, F
a
cult
y
of Informa
tion T
e
chno
log
y
, IT
S Sura
ba
ya, Ind
ones
i
a
, 6011
1
*Corres
p
o
ndi
n
g
author, em
ail
:
w
a
hib
@
po
lte
k
om.ac.id
1
, agusza@cs.its.ac
.id
2
, diana@if.its.ac.id
3
A
b
st
r
a
ct
Auto
matic
mult
i-doc
ument s
u
mmar
ies
ha
d
b
een
d
e
vel
o
p
e
d
by r
e
searc
her
s. T
he
meth
od
use
d
to
select se
ntenc
es from t
he so
urce d
o
cu
me
nt
w
ould d
e
te
r
m
i
ne the
qu
ality
of the su
mmar
y result. One
o
f
the
most p
o
p
u
lar
meth
ods
used
in w
e
ig
hting
sentenc
es
w
a
s by calcu
l
ati
n
g
the frequ
ency
of occurre
nce
o
f
w
o
rds formin
g
the sentence
s
. How
e
ver, choos
ing se
nt
e
n
ces w
i
th that metho
d
coul
d
lead to a cho
s
en
sentenc
e w
h
ic
h di
dn'
t repr
es
ent t
he c
onten
t of the sourc
e
doc
u
m
ent
o
p
timally. T
h
is
w
a
s becaus
e t
he
w
e
ightin
g of s
entenc
es w
a
s
only
meas
ured
by us
in
g
the
nu
mb
er
of occ
u
rrenc
es
o
f
w
o
r
d
s
.
T
h
i
s
s
t
u
d
y
prop
osed
a ne
w
strategy of
w
e
ight
in
g sent
ences b
a
se
d on sente
n
ces
distrib
u
tion to
choos
e the most
importa
nt sent
ences w
h
ic
h p
a
id
muc
h
atte
ntion to
the
el
ements of sen
t
ences
that w
e
re for
m
ed
as
a
distrib
u
tion of w
o
rds.
T
h
is
met
hod
of sent
e
n
ce d
i
stributi
o
n
ena
bl
es
the e
x
traction
of an importa
nt
sent
ence
in mu
lti-doc
u
m
e
n
t
su
mmar
i
z
a
t
i
o
n
w
h
ich served as a
strategy to
improve t
he
qu
ality of s
ente
n
c
e
summaries. In
that respect w
e
re three c
onc
epts use
d
in
this study: (1) clustering
sentences with similarity
base
d
histogr
a
m
c
l
uster
i
ng,
(2) or
deri
n
g
cl
uster by
cl
uster i
m
portanc
e
an
d (
3
) se
le
ction
of i
m
port
ant
sentenc
e by s
entenc
e d
i
strib
u
tion.
R
e
sults
of exper
i
m
ents
show
ed th
at
the pr
op
osed
method
ha
d a
be
tter
perfor
m
ance
when compar
ed with
SIDeKiCK
and LIGI m
e
thods.
Re
sults
of RO
UGE-1 showed the
prop
osed
method
incre
a
si
ng
3% co
mpare
d
w
i
th the
SIDe
KiCK meth
od and incre
a
si
ng
5.1%
co
mpar
ed
with LIGI m
e
thod. Results of
ROUGE-2 proposed
m
e
thod increase 13.7% com
p
ared
with the SIDeKiCK
and increas
e 14.4% com
p
ared with LIGI m
e
thod.
Ke
y
w
ords
: Mu
lti-doc
ument su
mmar
ies, Extracting i
m
p
o
rtant
sentences, Se
ntence d
i
strib
u
tion
Copy
right
©
2016 Un
ive
r
sita
s Ah
mad
Dah
l
an
. All rig
h
t
s r
ese
rved
.
1. Introduc
tion
The nu
mbe
r
of digital do
cuments
ha
s i
n
crea
s
ed ve
ry rapidly, so
that raises
m
any ne
w
probl
em
s in
diggin
g
and
obtain
s
information qui
ckl
y
and accu
rately. A growing n
u
mb
er of
document
s led to the informatio
n dig
gers must
spend extra time in sea
r
ching and rea
d
in
g
informatio
n. Another i
s
su
e that arises is
the mag
n
itude of the
potential lo
ss of imp
o
rt
ant
informatio
n containe
d in t
he do
cu
ment
. The re
searche
r
s tried t
o
re
solve thi
s
p
r
obl
em b
y
developin
g
a
method in do
cume
nt sum
m
arie
s.
Good d
o
cum
ent summ
ary
is a sum
m
ary of which i
s
cap
able of co
vering (cove
r
age)
as
much
as
po
ssible the im
p
o
rtant con
c
e
p
ts (salie
nc
y) that exist in the so
urce d
o
cum
ent [1] and
salie
ncy a
r
e
a major p
r
o
b
l
e
m in the do
cume
nt sum
m
ary, the strategy to sele
ct a se
nten
ce is
very impo
rtan
t becau
se it
should
be
able
to choo
se
th
e main
ph
ra
ses
and
avoid
redu
nda
ncy
so
as to inclu
d
e
many of the co
ncepts [2]. So
me studies [2-5] have develo
ped a metho
d
of
sele
cting a
n
importa
nt sent
ence to addre
ss the i
s
sue o
f
coverag
e
an
d salie
ncy.
One of good
method is the method
combinat
io
n o
f
senten
ce in
formation de
nsity and
keyword
of senten
ce cl
ust
e
rs
(Sid
e
k
i
c
k) [2]. Accordi
n
g [2] importa
nt senten
ce i
s
a senten
ce
that
has the info
rmation den
si
ty of
the sen
t
ence a
nd h
a
s many keywords of sen
t
ence
clu
s
ters.
Sentence inf
o
rmatio
n d
e
n
s
ity ca
n b
e
extracte
d
wit
h
an
app
ro
a
c
h
po
sitional
text grap
h
and
keyword
of senten
ce
clu
s
ters can b
e
e
x
tract
ed
usi
n
g TF.IDF m
e
thod
s [2]. But the app
ro
ach
of
the po
sitional
text graph
on the
con
d
i
t
ion there
are so
me
sent
ences
with a
l
most the
sa
me
weig
ht, it is
d
i
fficult to dete
r
mine
the im
portant
se
nte
n
ce
s [6].
Whi
l
e the
metho
d
of
keyword
of
sente
n
ce cl
usters
obtain
ed
with the T
F
.IDF
con
c
e
p
t n
o
t able to
give maximum
weig
ht on
clu
s
ter
k
e
ywords
[7].
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
1693-6
930
Im
proving Mu
lti-Do
cum
ent Summ
ary Me
thod Base
d o
n
Sentence Distributio
n
(A
m
i
nul Wahib
)
287
This stu
d
y p
r
opo
se
s a
me
thod of
sente
n
ce
we
ightin
g a
s
a n
e
w
strategy
in
selectin
g
importa
nt se
ntences
based on the
senten
ce di
stribution meth
od.
The
sent
ence dist
ribu
tion
method
will
ta
ke i
n
to
wh
ere
v
er the
lo
cati
on of
t
he ele
m
ents
formi
n
g
of se
nten
ce
s, so as
to give
a
highe
r
weig
h
t
to the
sent
ences that
should
be
t
h
e
topic of the
do
cume
nt. A sele
ction
of
importa
nt se
ntence with t
h
is sch
e
me i
s
expe
cted
t
o
sele
ct a re
pre
s
entative
sente
n
ce in
multi-
document su
mmary and
can improve q
uality of the summary resul
t
.
2. Rese
arch
Metho
d
The research method
used in this
st
udy wa
s ad
o
p
ted from th
e re
sea
r
che
r
[3] this
frame
w
ork
was al
so em
pl
oyed by the rese
arche
r
[2]. Figure 1
sh
ows the ph
ases which sho
u
ld
be don
e to obtain final su
mmarie
s
. Th
ose p
h
a
s
e
s
a
r
e text prep
ro
ce
ssi
ng, se
ntence clu
s
teri
ng,
clu
s
ter o
r
d
e
ri
ng, se
nten
ce
extraction
a
nd t
he la
st i
s
summa
rizi
ng arra
ngem
ents. Senten
ce
extraction p
h
a
se i
s
the co
ntribution of this re
se
arch.
Figure 1. The
framework of
multi-do
cum
ent summ
ary
2.1. Text Pro
cessing Pha
s
e
Text prep
ro
cessing
pha
se
inclu
d
e
s
the
pro
c
e
ss
of tokeni
zin
g
, sto
p
wo
rd a
nd
stemming.
Toke
nizi
ng i
s
a proses
of behe
adin
g
words so
ea
ch
wo
rd
can
stand
alon
e. Stopwo
r
d i
s
the
pro
c
e
ss of re
moving the key word
s whi
c
h are not ap
prop
riate to b
e
used, such as co
njun
ctio
ns,
prep
ositio
ns
and prono
un
s. Stemming is the pro
c
e
s
s of obtainin
g
basi
c
wo
rd
of each wo
rd. In
this
study to
keni
zin
g
i
s
d
one
by u
s
in
g Stanf
ord o
f
natural
lan
guag
e p
r
o
c
e
ssi
ng,
stop
word
removal i
s
usi
ng stopli
s
t dictionary, and stemmi
ng is u
s
ing lib
ra
ry English porte
r
stemme
r.
2.2. Sentenc
e
Clustering
Phase
w
i
th
Simila
rit
y
ba
sed His
t
ogr
a
m
Clustering
(SHC)
Sentence cl
u
s
terin
g
is
an i
m
porta
nt part
in a
syste
m
of the autom
atic summa
ry for ea
ch
topic in the set of docum
e
n
ts sh
ould b
e
prope
rly i
den
tified to find a similarity an
d dissimila
rity in
the document
so as to en
sure go
od coverag
e
[3, 8].
The fun
c
tion
of similarity which i
s
u
s
ed i
s
uni
-gram m
a
tchin
g
-b
ase
d
simila
rity base
d
on
the equatio
n
(1). Similarit
y
bet
ween
senten
ce
s is
cal
c
ulate
d
ba
sed o
n
corre
s
po
ndin
g
wo
rds
betwe
en th
e
words
s
to-
i
and
th
e word
s
s
to-
j
(
|
|
|
|
j
i
s
s
) i
s
divided
by th
e total le
ngth
of the
wor
d
s
s
to-
i
a
nd s to-
j
(
|
|
|
|
j
i
s
s
) :
|
|
|
|
|)
|
|
|
*
2
(
)
,
(
j
i
j
i
j
i
s
s
s
s
s
s
sim
(1)
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 16
93-6
930
TELKOM
NIKA
Vol. 14, No. 1, March 2
016 : 286 – 2
9
3
288
The meth
od
of uni-gra
m
matchi
ng-ba
sed
simila
rity mea
s
ure is a metho
d
use
d
to
measure sim
ilarity of each pair
sente
n
ce
s in a
cl
uster. If a cl
uster
ha
s
n
the
numb
e
r of
a
sente
n
ce, so
the total number of the
existing pai
r
sente
n
ce is
m
where
m
=
n(n+1)/2
and
Sim=
{s
im
1
, s
i
m
2
, s
i
m
3
, …, sim
m
} is the
coll
ectio
n
of
pair si
milarit
y
betwe
en
senten
ce
with
the
total
m
. Similarity histog
ra
m from a clu
s
ter i
s
noted
with
H=
{
h
1
, h
2
, h
3
, .
..,
h
nb
}. The functio
n
to
cal
c
ulate
h
i
i
s
sho
w
n in the
equation (2):
)
(
j
i
sim
count
h
(
2
)
Amount simil
a
rity of pairs of each
sen
t
ence in th
e bin to-
i
h
i
in
particular
cl
uster i
s
obtaine
d by summi
ng the
simila
rity pairs of e
a
ch se
ntence in the
bin to-
i
(
sim
j
) in that cl
ust
e
r
with a lowe
r limit value of similarity in the bin to-
i
(
sim
li)
and the upp
er limit value of the similari
ty
bin to
-
i
(
sim
ui
). Histo
g
ram
ratio (HR) f
r
o
m
a
clu
s
ter can b
e
cal
c
ula
t
ed by the
eq
uation
of (3)
an
d
(4):
b
b
n
j
j
n
T
i
i
h
h
HR
1
(
3
)
b
T
n
S
T
*
(
4
)
H
i
s
t
og
r
a
m
r
a
tio
is a c
l
u
s
ter w
h
ic
h
c
a
n be
ca
lc
ulated
by co
unting
a
ll of the
nu
mb
er
pairs
of similarity in sentence
(
h
i
) slippe
d a
w
ay
from thre
sh
o
l
d (
S
T
) is
divided with th
e total numb
e
r
of
pairs simil
a
rit
y
senten
ce
s in all of the bin
n
b
. A sentence ca
n be
in a clu
s
ter if
that senten
ce
meets th
e cri
t
eria of the
cl
uster.
Ho
wev
e
r, if
the sen
t
ence doe
s
n
o
t
meet
the crite
r
ia
in all of
existing
cluster, a new
cluster
will be formed. T
he
S
HC method
for clustering sentences used
in
this study wa
s ada
pted fro
m
study [3].
2.3. Cluste
r Ordering Ph
ase
One
of the
wea
k
n
e
sse
s
in the
cl
uste
ring
senten
ce ph
ase, i
s
the si
milarity
ba
sed
histogram
cl
ustering (SHC) whi
c
h are unknown
to
tal clusters that will
be formed. Therefore,
clu
s
ter o
r
d
e
ri
ng can be
used a
s
solutio
n
to determi
n
e
the app
ro
priate clu
s
ter to
be a pa
rt of the
pro
c
e
s
s in m
a
kin
g
a
sum
m
ary. That i
s
by testi
ng
e
v
ery wo
rd
wh
ich i
s
availa
b
l
e in the
clu
s
ter
based o
n
thresh
old valu
e
θ
. If the frequ
ency of
wo
rd
s
w
(
cou
n
t
(
w
)) fulfills threshold
θ
,
therefore
those words are
con
s
ide
r
e
d
as
fre
quent words.
The
weig
ht o
f
the word
w
is cal
c
ulated
based on
the
frequ
en
cie
s
of
all
of
the
words
in
the input document. Cal
c
u
l
ation of the cluster
weig
ht
refers to all of the frequen
cie
s
of the word
w
whi
c
h i
s
o
w
ne
d by
part
i
cula
r
clu
s
ter.
Clu
s
ter o
r
de
ring
pha
se
m
e
thod
used i
n
this stu
d
y
wa
s
adopte
d
cl
ust
e
r imp
o
rtan
ce method
wh
ich
wa
s sugg
ested
by in the stu
d
y [3]. Clu
s
ter o
r
d
e
ri
ng
based on the
weig
ht of cluster importa
nc
e can b
e
cal
c
ulated by equ
ation (5
):
j
c
w
j
w
count
c
Weight
))
(
1
log(
)
(
(
5
)
\
The wei
ght of a cluste
r to-
j
(
Weight
(
c
j
)) can be
cal
c
ulate
d
by summin
g
all of the
freque
nci
e
s o
f
the word
w
from the do
cu
ment input which i
s
found i
n
the clu
s
ter t
o
-
j
.
2.4. Sentenc
e
Extrac
tion
phase
Sentence extractio
n
is a
p
hase sel
e
ctio
n of
importa
n
t
senten
ce fo
r formin
g su
mmarie
s
.
To sele
ct imp
o
rtant se
nten
ce, this study
propo
se
s a new
strategy
of we
ighting sente
n
ces
u
s
ing
a distri
bution
of local an
d global
se
n
t
ence m
e
th
o
d
s. Thi
s
strategy is call
ed a senten
ce
distrib
u
tion m
e
thod.
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
1693-6
930
Im
proving Mu
lti-Do
cum
ent Summ
ary Me
thod Base
d o
n
Sentence Distributio
n
(A
m
i
nul Wahib
)
289
2.4.1. Senten
ce Distributi
on Metho
d
Sentence di
stributio
n me
thod is fo
rm
ed from
th
e
distrib
u
tion
of local
an
d glob
al
sente
n
ce whi
c
h can be
se
en in equ
atio
n (6):
)
(
)
(
)
(
ik
gs
ik
ls
s
s
W
s
W
Weight
ik
,
(6)
Weig
hts sent
ence
(
)
(
ik
s
Weight
) is o
b
tained from
multiplicativ
e the wei
ght
distrib
u
tion
of local
sente
n
ce (
)
(
ik
ls
s
W
) with the weight
distributio
n o
f
global se
nte
n
ce (
)
(
ik
gs
s
W
).
Distri
bution
o
f
local
senten
ce i
s
u
s
ed
to
det
e
r
mine
th
e po
sition
of
each
sente
n
ce in
a
cluster, by assuming that the
sentence whi
c
h has
sp
reader
elements in
a cl
uster
will have a
highe
r p
o
siti
on in
that
cl
uster.
Thi
s
method i
s
e
x
pected
to
select th
e m
o
st representa
t
ive
sente
n
ces
wh
ich are able t
o
rep
r
e
s
ent the clu
s
ter.
Distri
bution
of
glob
al
sente
n
ce
is u
s
ed
to gi
ve
a p
o
siti
on in
ea
ch
se
ntence in
a
cl
uster by
assuming that the sentences wh
ich have spreader el
ements
will
have a higher position. Thi
s
method i
s
expecte
d to determin
e
the le
vel of intere
st
or po
sition of
each
sente
n
ce glo
bally in the
clu
s
t
e
r.
The weight di
stributio
n of l
o
cal
and
glob
al
se
nten
ce
s
will be m
u
ltipl
i
ed with
ea
ch
other
so that the
weight of both
the local
and
global
s
ente
n
c
e
can b
e
m
u
tually reinfo
rcing. If the lo
cal
positio
n of
a
sente
n
ce h
a
s a
gre
a
t weig
ht, on th
e oth
e
r
han
d the
g
l
obal
po
sition
of a
senten
ce
has a little
weight, so by using this mult
iplication
it wi
ll de
cre
a
se th
e weight of th
e senten
ce.
As
appo
site, if the wei
ght of lo
cal
sente
n
ce
and gl
obal
se
ntence are b
o
th high
er, so that se
nten
ce
deserve
s to repre
s
e
n
t a cl
uster th
an an
other senten
ce whi
c
h ha
s a
lower valu
e.
2.4.1.1.
Local Sente
n
ce Dis
t
ribution Method
Local senten
ce di
stri
butio
n,
is a
distri
bution of im
port
ant
wo
rd
s which is formin
g
sente
n
ces in
a clu
s
ter. Lo
cal sente
n
ce
method i
s
formed thro
ugh
the pro
c
e
s
s o
f: (1) cal
c
ul
ating
probability of distribution,
(2)
ca
l
c
ul
ating the total di
stribution,
(3) cal
c
ulating the expansi
o
n of
distrib
u
tion, (3) cal
c
ul
ating
weight of se
ntenc
e com
p
onent
s, and (4) cal
c
ul
ating
weight ba
se
d on
local sente
n
ce
method.
Examples of
clu
s
ters which
co
ntain
s
i
senten
ce
s an
d
j
wo
rd
s, are
then
S
ik
re
prese
n
ted
as a
se
nten
ce to-
i
in
c
l
us
ter to-
k
. Of wh
ich eve
r
y wo
rd to-
j
in the
set of sente
n
ces, a
r
e pa
rt o
f
the
sente
n
ce if th
ose
word
s
ha
s a
n
e
qual
di
stributio
n of
words,
thu
s
t
he
cha
n
ces o
f
a senten
ce
to-
i
is cal
c
ul
ated
by using the t
heory
of K. Pearson a
s
eq
uation (7
):
|
|
|
|
k
dt
ik
ij
c
s
r
(7)
Distri
bution
o
pportu
nities (
ij
r
) a
r
e
obtaine
d from
differe
nce
of total f
o
rmin
g
sente
n
ce
s
s
to-
i
in the c
l
us
ter to-
k
(
dt
ik
s
|
|
) which i
s
divide
d by the
num
ber
(
dt
ik
s
|
|
) in the clus
ter to-
k
(
|
|
k
c
).
The amo
unt
of the differen
c
e b
e
twe
en freq
uen
ci
es of word
s with frequ
e
n
cie
s
of word
s
distrib
u
tion to
-
j
in a
se
nten
ce to
-
i
can b
e
cal
c
ulate
d
b
y
using
chi
-
square te
st sta
t
istics. T
h
u
s
, the
distrib
u
tion of
word
s to-
j
in a c
l
us
ter to-
k
are the same
with equatio
n (8):
dt
k
c
j
ij
jk
ij
jk
ij
jk
r
n
r
n
v
1
2
2
)
(
(8)
Distri
bution
o
f
senten
ce
compon
ent (
2
jk
)
is de
rived f
r
om total of
different q
u
a
d
rate
betwe
en th
e f
r
equ
en
cie
s
of
the
sente
n
ce
com
pon
ents
(
ij
v
) with
freq
u
enci
e
s of the
distri
bution
sente
n
ce co
mpone
nt
to-
j
in c
l
us
ter to-
k
(
ij
jk
r
n
) is
divided
b
y
the freq
uen
cie
s
of th
e di
stribution
of
sente
n
ce co
mpone
nt
to-
j
in the
c
l
uster to-
k
. Va
riabel
jk
n
is the
frequ
en
cy
of se
nten
ce
comp
one
nt to-
i
in c
l
us
ter to-
k
an
d
dt
k
c
is the
numbe
r of different words i
n
clu
s
ter to-
k.
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 16
93-6
930
TELKOM
NIKA
Vol. 14, No. 1, March 2
016 : 286 – 2
9
3
290
Smaller val
u
e
in e
quatio
n
(8) sho
w
s th
at com
pon
ent
se
nten
ce to
-
j
is
c
l
os
er
to
th
e
maximum of
the di
stributio
n where the
value i
s
contradicte
d
with
the
weightin
g and distri
buti
n
g
word which h
a
s po
sitive correlation
with non lin
ear [
7
]. Therefo
r
e,
equation
(9
) can b
e
obtai
n
ed
from:
2
1
1
jk
jk
U
(
9
)
Weig
hting of comp
one
nt
senten
ce
to
-
j
in a
c
l
us
ter to-
k
is spre
ad
(
jk
U
)
a
n
d
c
a
ps
iz
ed
with distri
buti
on of senten
ce compo
n
e
n
t (
2
jk
). In orde
r to obtain calcul
ation the
weighting of
sente
n
ce di
st
ribution
to-
j
in c
l
us
ter to-
k
optimally
ca
rrie
d
o
u
t ex
pan
sion
of th
e calculation
in
orde
r to obtai
n equatio
n (1
0):
)
1
(
log
2
k
jk
jk
P
p
St
(
1
0
)
The exp
a
n
s
io
n of di
stributi
ng comp
one
n
t
senten
ce
to-
k
in
c
l
us
ter to-
k
(
jk
St
) is
obtain
ed
from
a num
b
e
r
of se
nten
ces whi
c
h co
ntains
th
e word
to-
j
in
c
l
us
ter to-
k
(
jk
p
) with
th
e num
ber o
f
sente
n
ces f
r
o
m
the enti
r
e
sente
n
ces i
n
clu
s
ter to
-
k
(
k
P
). So, the wei
ghting of th
e
com
pon
ent
sente
n
ces to-
j
in a c
l
us
ter to-
k
can b
e
ca
lculate
d
by an equatio
n (1
1):
)
1
(
log
2
,
jk
jk
jk
l
St
U
Wt
(11
)
The wei
ghtin
g of local se
ntence com
p
onent
s to-
j
in c
l
us
ter to-
k
(
jk
l
Wt
,
) will form the
weig
hting
of l
o
cal
senten
ce to-
i
by
sum
m
ing all
of th
e compo
nent
s fo
rming
se
ntences
s
to-
i
in
c
l
us
ter to-
k
(
)
(
ik
ls
s
W
) whi
c
h
is divi
ded
by the
n
u
mbe
r
of
co
mpone
nts forming
se
nten
ces
s
to-
i
in
a c
l
us
ter to-
k
(
ik
s
), sh
own
in equation (12):
ik
jk
l
s
Wt
jk
l
ik
ik
ls
Wt
s
s
W
,
,
1
)
(
(12)
Equation
s
(7
) to (11) a
r
e
adopted fro
m
[7] which is origi
nally use
d
to calculate the
distrib
u
tion of
words in a p
a
rag
r
a
ph of the do
cu
me
nt, in this study
are devel
ope
d for wei
ghtin
g
sente
n
ce on
sente
n
ce clu
s
ter.
2.4.1.2. Global Sentenc
e
Distribu
tion
Metho
d
Global sent
ence distri
b
u
tion is a distri
b
u
tion
of important
compo
nent
s forming
sente
n
ces
i
n
sets of
cl
uste
rs.
Gl
obal se
ntence
met
h
o
d
is fo
rme
d
i
n
a
similar m
anne
r to that
of
local
sente
n
ce method, th
at is: (1) calculating pr
oba
bility of distribution,
(2
) ca
lculatin
g the total
distrib
u
tion, (3) cal
c
ulatin
g
the
ex
p
a
n
s
io
n of di
strib
u
tion, (4
)
cal
c
ul
ating the
weight of
comp
o
nent
sente
n
ces, a
nd (5
) cal
c
ul
a
t
ing
the weig
ht of local se
ntence.
Example the set of clu
s
ter
whi
c
h co
ntain
s
m
cluster a
nd se
quen
ce
s of a cluste
r is in
k
c
l
us
ter, where k
=
(1,2,3..m). Henc
e the tota
l of differen
ce of
wo
rds in
a clu
s
t
e
r to-
k
is
giv
en
by
(
'
|
|
dt
k
c
) and total di
fference of word
s in the
sets of clu
s
ters is give
n by (
'
|
|
dt
c
). Therefo
r
e,
the
cha
n
ces
of compon
ent se
ntence to-
j
in
a c
l
us
ter to-
k
has
a ch
an
ce to sp
rea
d
a
s
in the e
quat
ion
(13
)
:
'
'
'
|
|
|
|
dt
dt
k
jk
c
c
r
(13
)
2
jk
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
1693-6
930
Im
proving Mu
lti-Do
cum
ent Summ
ary Me
thod Base
d o
n
Sentence Distributio
n
(A
m
i
nul Wahib
)
291
Variab
el
j
n
is the freq
uen
cy
of compo
n
e
n
t senten
ce t
o
-
i
in a coll
ection
clu
s
ter. The
total differen
c
es
of qua
drate bet
ween
th
e freq
uen
cie
s
of wo
rd
s
(
'
jk
v
), with the f
r
eq
uen
cie
s
of
distrib
u
tion word
s to-
j
in a c
l
us
ter to-
k
(
'
'
jk
j
r
n
) is divided by the frequen
cies of word di
stributio
n
to-
j
in that
c
l
us
ter to-
k
i
s
u
s
ed to
cal
c
ul
ate the
dist
ribut
ion of
co
mpo
nent
sente
n
ce to-
j
in the
set
of cluste
r so that it can be
obtaine
d the equatio
n (14
)
:
'
1
'
'
2
'
'
'
'
2
)
(
dt
c
j
jk
j
jk
j
jk
j
r
n
r
n
v
(14
)
The e
quatio
n
(14
)
sh
ows the
smalle
r value
of th
e senten
ce
comp
one
nt to-
j
that
is
spread (
'
2
j
). Hen
c
e the
se
ntence co
mp
onent to-
j
i
s
close
r
to the
maximum of distrib
u
tion
s.
This value is
in line with the relation
s of weig
hting an
d distrib
u
ting words
whi
c
h have co
rrel
a
tion
negative no
n linear [7] so t
he
equ
ation (15) can be d
e
r
ived:
'
2
'
1
j
j
U
(15
)
The
equ
ation
(1
5) shows
the weighti
n
g
of the
comp
onent
s of
se
ntence to
-
j
is
s
p
read
(
'
j
U
) straig
htly compa
r
ed
with the distri
bu
ting from co
mpone
nts of
a sente
n
ce (
2
jk
). To get
cal
c
ulatio
n of the weightin
g of
comp
on
ents of se
nte
n
ce to
-
j
whi
c
h is sp
re
ad optimally don
e by
the expan
sio
n
of calculatio
n to get the equation (16
)
:
)
1
(
log
'
'
2
'
j
j
p
P
St
(
1
6
)
The expan
si
on of distrib
u
ting sente
n
c
e compo
n
e
n
t to-
j
(
j
St
) is obtaine
d fro
m
the
numbe
r
of se
ntences which contain
s
word to
-
j
(
j
p
'
) wit
h
total
s
entenc
e
s
in set of
c
l
us
ter (
'
P
).
So that weigh
t
ing from com
pone
nt of sen
t
ence to
-
j
(
j
g
Wt
,
) in the set of cl
uster
ca
n be
cal
c
ulate
d
with the equ
a
t
ion (17
)
:
)
1
(
log
'
'
2
,
j
j
j
g
St
U
Wt
(17
)
Equation (18
)
sho
w
s th
e weighting gl
ob
al sente
n
ce (
)
(
ik
gs
s
W
) is derive
d
by
summin
g
all
of the weight
ing of glo
bal
word of fo
rmi
ng senten
ce
s to-
i
in
a c
l
us
ter to-
k
by
con
s
id
erin
g t
h
e
length of the sente
n
ce or the numb
e
r of
compo
nent formin
g se
nte
n
ce
s
s
to-
i
in
a cluste
r to-
k
.
This
can be done
by
avoi
ding se
nten
ces whi
c
h
ha
ve the bi
gge
st nu
mbe
r
of
the
sente
n
ce
s
components whi
c
h will
al
ways
appea
r as an important
sentence without
considering meani
ng
that contain
s
in the sente
n
c
e.
ik
j
g
s
Wt
j
g
ik
ik
gs
Wt
s
s
W
,
,
1
)
(
(18)
Equation
(13
)
to (1
7) are a
dopted
of the
study
[7]
whi
c
h
are
o
r
igin
ally used to
calcul
ate
the dist
ributio
n of wo
rd
s in
a pa
rag
r
ap
h
of the
do
cum
ent. They are
therefo
r
e u
s
ed in thi
s
stu
d
y
to develop th
e weig
hting o
f
senten
ce in
senten
ce
cl
uster. Equ
a
tion (6
), (12
)
a
nd (18
)
are o
u
r
origin
al co
ntri
bution in this
study.
Evaluation Warning : The document was created with Spire.PDF for Python.
ISSN: 16
93-6
930
TELKOM
NIKA
Vol. 14, No. 1, March 2
016 : 286 – 2
9
3
292
2.5. Summary
Arrangement Phase
The summa
ry arra
ngem
e
n
t pha
se i
s
a
step of
summ
ary arran
gem
ent whi
c
h i
s
obtaine
d
from the extracting im
port
ant sente
n
ce
pha
se. T
he cl
usters whi
c
h are
in clu
s
ter
orde
ring pha
se
will be a reference in
sum
m
ary arrangement. A s
ent
ence that has the hi
gher
weight in each
cluster
will be the m
a
in point of the
expecte
d sum
m
ary. The number
and order of
sum
m
ary
sente
n
ce equ
al to the number an
d se
qu
ence of clu
s
ter se
nten
ce.
2.6. Ev
aluation of Summar
y
Results
The eval
uatio
n of the
sum
m
ary result e
m
ploy
ed
i
n
th
is study
i
s
cal
l
ed RO
UGE. RO
UGE
measure the
quality of th
e re
sult from
summ
ary
by cal
c
ulatin
g
ov
erlap
unit
s
su
ch as
N-g
r
a
m
,
orde
rin
g
wo
rds, pai
rs of word
s bet
ween
the candi
dat
e of summ
ary
and sum
m
a
r
y as refere
n
c
e.
RO
UGE is eff
e
ctively empl
oyed to evalu
a
te the summ
ary of the document [9].
RO
UGE u
s
e
d
in this st
udy is the RO
UGE-1 a
nd RO
UGE
-
2. ROUGE-1 is a
measurement
by using ana
gram ma
t
c
hin
g
con
c
ept wh
ere this me
asurem
ent is ca
lculate
d
base
d
on the total numbe
r of e
a
ch
word (u
nigra
m
) wh
ich is app
rop
r
i
a
te betwe
en
the re
sult of the
summ
ary sy
stem with a summary of referen
c
e
s
m
ade ma
nually
by the experts. RO
UGE
-
2 is
cal
c
ulate
d
ba
sed o
n
the total of each t
w
o pairs (bi
g
ra
m) whi
c
h i
s
a
ppro
p
ri
ate be
tween the
re
sult
of the sum
m
a
r
y system
wit
h
the refe
ren
c
e
summ
ary
made by the
experts,
whe
r
e the be
st of the
maximum score in the RO
UGE 1 and
ROUGE 2 in b
e
tter co
nditio
n
is 1.
3. Results a
nd Analy
s
is
Experiment i
n
this
study
wa
s do
ne by
usin
g
thre
e
of extracting
importa
nt se
ntences
method
s, the
r
e are se
nte
n
ce di
stri
bution metho
d
, SIDeKiCK m
e
thod [2] an
d the last is
LIGI
method [3].
The data
u
s
e
d
in this
stud
y is DUC
(d
o
c
ume
n
t und
e
r
stan
ding
co
n
f
eren
ce
s)
20
04
task 2
whi
c
h
con
s
ist
s
of 50 group
s of
docu
m
ent
s.
The re
sult of
the evaluati
ons u
s
e
d
in
this
study wa
s ROUGE
-
1 an
d RO
UGE-2 wh
ere the hi
ghe
r value of RO
UGE sho
w
s t
he better q
u
a
lity
of the result summary obtai
ned.
3.1.
Testing of
Sentence Distribu
tion Metho
d
Testing
wa
s
use
d
to kn
o
w
the result
of
prop
osed
method
com
pare
d
to LIG
I
(local
importa
nce a
nd gl
obal i
m
portan
c
e
)
m
e
thod a
nd SI
DeKiCK
(
sen
t
ence
inform
ation d
e
n
s
ity and
ke
yword the
clu
s
ter of sen
t
ence
)
method.
Paramete
r u
s
ed in the t
e
sting p
r
o
c
e
ss i
s
a pa
ra
meter with a
combin
ation
sco
re of
HR
min
=0.7,
ɛ
=
0
,3 ,
s
i
milarity thres
h
old
(
S
T
)=0,4 a
n
d
paramete
r
θ
=
10. Pa
ram
e
ter
β
the
LIGI
method
which has b
een
establi
s
h
ed i
n
the study
[3] was 0,5.
The sco
r
e
α
and
λ
used
in
SIDeKiCK m
e
thod was
α
=0,4 an
d
λ
=0,2 [2] where those
scores con
s
ide
r
as the optim
al of
recomme
nde
d score.
The re
sult of testing se
nte
n
ce
di
strib
u
tion
b
y
us
in
g LIG
I
me
th
o
d
an
d
SIDe
KiC
K
me
th
od
can b
e
see
n
in Table 1. Table 1 sho
w
that
the senten
ce dist
ribution meth
od ha
s a hig
her
averag
e score of ROUGE
comp
ared to LIGI me
thod and SIDeKiCK method both in the testing
RO
UGE-
1 an
d RO
UGE-
2.
Average sco
r
es gaine
d
in testing RO
UGE-1 us
i
ng
senten
ce di
stri
bution
wa
s 0,
4042, in
SIDeKiCK m
e
thod g
a
ine
d
the aver
age
score 0,39
2
4
and
LIGI
m
e
thod g
a
ine
d
avera
ge
sco
r
es
0,3845. It mean
s that th
e se
nten
ce
distrib
u
ti
on
method i
s
b
e
tter, or the
r
e was
3% of
improvem
ent
comp
are
d
to
SIDeKiCK m
e
thod an
d
th
ere
wa
s an i
n
crea
sing
5,1
%
compa
r
e
d
to
LIGI method
in the te
sti
ng
schema
with
RO
UG
E
-
1. Te
sting
i
n
senten
ce
distrib
u
tion
u
s
ing
RO
UGE-2 ga
ined ave
r
ag
e
score
s
0,1
2
09, SI
DeKiCK method gai
ned 0,10
63 a
nd LIGI meth
od
gaine
d 0,1057. It means that senten
ce dist
ri
butio
n method was better o
r
increa
sed 1
3
,
7%
comp
ared to SIDeKiCK m
e
thod an
d increa
sed 1
4
,4
% compa
r
ed
to LIGI method.
Table 1. Te
sting of Extracti
ng Importa
nt Sentence Me
thod
Summar
y
Method
RO
UG
E-1
RO
UG
E-2
Sentence Cluster
i
ng (SHC) + Clus
te
r Ordering + S
entence Distribution
0,404
0,1209
Clustering kalimat (SHC) +
Cluster
Orde
ring + SIDe
KiCK
0,392
0,1063
Clustering kalimat (SHC)
+
Cluster
Orde
ring + LI
GI
0,384
0,1057
Evaluation Warning : The document was created with Spire.PDF for Python.
TELKOM
NIKA
ISSN:
1693-6
930
Im
proving Mu
lti-Do
cum
ent Summ
ary Me
thod Base
d o
n
Sentence Distributio
n
(A
m
i
nul Wahib
)
293
4. Conclusio
n
To imp
r
ove q
uality of sum
m
arie
s
re
sult
in multi-d
o
cu
ment summa
rizatio
n
can
b
e
don
e
by extra
c
ting
importa
nt sen
t
ence
u
s
ing
the
sente
n
ce distrib
u
tion m
e
thod.
Se
nte
n
ce
di
stributi
on
method
ha
s
p
r
oven
better
comp
ared to
SIDeKiCK
an
d LIGI m
e
tho
d
s
with t
he
a
v
erage
a
c
hie
v
ed
in RO
UGE-1 is 0,404 a
nd
RO
UGE-2 is
0,121.
Selecting im
p
o
rtant senten
ce
s u
s
ing
se
ntenc
e dist
rib
u
tion metho
d
para
m
eter
o
p
timally
are
harm
i
ng
=0.7,
epsilon
(
ɛ
)=0,3,
simila
rity thre
shold
(
S
T
)=0,4 a
n
d
cl
uste
r o
r
d
e
ring
thre
sh
o
l
d
(
θ
)=10
whe
r
e RO
UGE
-
1
wa
s gain
e
d
from di
stri
b
u
tion of sent
ence metho
d
increa
sed
3%
comp
ared to the SIDeKiCK method (0,392) a
nd
increa
sed 5,1
%
compa
r
ed
to LIGI metho
d
(0,385
). The
result of
RO
UGE-2 u
s
in
g senten
ce
di
stri
bution meth
o
d
increa
se
d 1
3
,7% com
pared
to the SIDeKiCK method (0,106) a
nd in
cre
a
s
ed 14,4
%
compa
r
ed
to LIGI method (0,105
).
Referen
ces
[
1
]
Ou
yang
Y,
Li
W,
Z
hang R,
L
i
S,
Lu Q.
A P
r
ogressiv
e
S
e
n
t
ence S
e
lect
i
o
n St
rat
e
g
y
f
o
r
Docum
ent
Su
mma
riza
tio
n
.
Journal of
inf
o
rmat
io
n Prece
ssing a
nd Ma
n
age
ment
.
201
3
;
49(1):
213-2
2
1
.
[
2
]
Suput
ra
HGI
,
Arif
in Z
A
,
Yuni
art
i
A.
St
rat
egi
Pemili
ha
n Ka
l
i
mat
pa
da P
e
ri
ngkas
an M
u
lt
i-
Dokum
e
n
Berdas
arkan M
e
t
ode C
l
ust
e
ri
n
g
Kalim
at
.
Mast
er
T
hesis.
Suraba
ya:
Post
gra
duat
e I
T
S;
2013.
[
3
]
Sarkar K.
Sen
t
ence C
l
ust
e
ri
ng-b
a
sed
Sum
m
arizat
io
n of
Mult
ipl
e
T
e
xt
Docum
ent
s.
I
n
t
e
rnat
io
na
l
Journ
a
l of
Co
mput
in
g Scie
nce
and Co
mmuni
cat
i
on T
e
ch
no
l
ogi
es
.
200
9;
2(1):
325-3
35.
[
4
]
He T
,
Li F
,
Sh
ao W,
Ch
en J,
Ma L
.
A
New
F
eat
ure-F
u
si
on
Sent
e
n
ce S
e
l
e
ct
ing
St
rat
egy
f
o
r Query-
F
o
cused M
u
lt
i-doc
ument
S
u
mmari
z
a
t
i
on
.
Procee
din
g
of
I
n
t
e
rnat
io
nal C
onf
er
en
ce Adva
nce
Lan
gu
age Pro
c
essin
g
an
d Web I
n
f
o
rmat
io
n
T
e
chnol
og
y.
Eds:
Ock C.
et al.
,
Universit
y
of
Normal,
Wuhan.
Ch
ina.
2008:
8
1
-86.
[
5
]
V Kumar R,
Rag
huve
e
r K.
Leg
al D
o
cum
ent
s Cl
ust
e
rin
g
an
d Summa
rizat
i
o
n
usi
ng
Hierarc
h
ic
a
l
Lat
ent
Diric
h
let
Alloc
a
t
i
o
n
.
IAES Internati
o
nal Journal
of A
r
tificial Intelli
gence (IJ-AI)
.
20
14;
2(
1):
27-
35.
[
6
]
Kruen
gkrai C,
Jarusku
lcha
i C
.
Generic T
e
xt
Summar
i
z
a
t
io
n Using L
o
cal
and Glo
bal Pr
opert
i
es of
Sent
enc
es
. Proceedings of the IEEE/WI
C Internati
o
nal Conference
on Web Intelli
gence (WI’03),
IEEE Compute
r
Societ
y
Was
h
ingto
n
DC,
Hal
i
f
ax, Ca
nad
a. 2
003: 20
1-2
06.
[
7
]
T
i
an X,
Cha
i
Y.
An I
m
provement
t
o
T
F
-IDF
:
T
e
rm Dist
ribut
i
on b
a
sed
T
e
rm Weight
Algorit
hm.
Journ
a
l of
Sof
t
w
are
.
2011;
6(
3):
413-4
20.
[
8
]
Amoli VP,
Sh
Sojo
odi O.
Sci
ent
if
ic D
o
cum
ent
s Cl
ust
e
rin
g
Base
d o
n
T
e
xt
S
u
mmariz
a
t
i
on.
IAES
I
n
t
e
rnat
io
na
l Journ
a
l of
Elect
r
ical a
nd Co
mp
ut
er Engi
ne
erin
g
(I
JECE).
201
5;
5(4):
782-7
8
7
.
[9
] L
i
n
C
Y
.
ROUGE:
A Package f
o
r Aut
o
mat
i
c Evalu
a
t
i
on of
Summaries
.
I
n
Proceed
ings
of
Workshop
on T
e
xt
Sum
m
arizat
io
n Bra
n
ches O
u
t
.
Eds:
Moens,
M.
F
.
and Szp
a
k
o
w
icz S.
Ass
o
ciat
i
on f
o
r
Comp
ut
at
ion
a
l
Lin
guist
ics.
Bar
c
elo
na.
20
04:
74-8
1
.
Evaluation Warning : The document was created with Spire.PDF for Python.